Skip to content

[std::char] Add MAX_UTF8_LEN and MAX_UTF16_LEN #45795

Closed
@behnam

Description

@behnam

Background

UTF-8 encoding on any character can take up to 4 bytes (u8). UTF-16 encoding can take up to 2 words (u16). This is a promise from the encoding specs, and an assumption made in many places inside rust libs and applications.

Currently, there's lots of magic numbers 4 and 2 everywhere in the code, creating buffer long enough to encode a character into as UTF-8 or UTF-16.

Examples

fn check(input: char, expect: &[u8]) {
let mut buf = [0; 4];
let ptr = buf.as_ptr();
let s = input.encode_utf8(&mut buf);

fn check(input: char, expect: &[u16]) {
let mut buf = [0; 2];
let ptr = buf.as_mut_ptr();
let b = input.encode_utf16(&mut buf);

Proposal

Add the followings public definitions to std::char and core::char to be used inside the rust codebase and publicly.

pub const MAX_UTF8_LEN: usize = 4;
pub const MAX_UTF16_LEN: usize = 2;

Why should we do this?

This will allow the code to be written like this:

let mut buf = [0; char::MAX_UTF16_LEN];
let b = input.encode_utf16(&mut buf);

This will guide users—without them knowing too much details of UTF-8/UTF-16 encodings—to allocate the correct amount of memory while writing the code, instead of waiting until some runtime error is raise, which actually may not happen in basic tests and discovered externally. Also, it increases readability for anyone reading such code.

Besides using these max-length values for char-level allocations, user can also use them for pre-allocate memory for encoding some chars list into UTF-8/UTF-16.

How we teach this?

The std/core libs will be updated to use these values wherever possible (see this list), and docs for encoding-related functions in char module are updated to evangelize using these values when allocating memory to be used by the encoding functions.

Alternatives

1) Only update the docs

We can just update the function docs to talk about these max-length values, but not name them as a const value.

2) New functions for allocations with max limit

Although this can be handy for some users, it would be limited to only one use-case of these numbers and not helpful for other operations.


What do you think?

Metadata

Metadata

Assignees

Labels

C-feature-acceptedCategory: A feature request that has been accepted pending implementation.E-easyCall for participation: Easy difficulty. Experience needed to fix: Not much. Good first issue.T-libs-apiRelevant to the library API team, which will review and decide on the PR/issue.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions