bytes, strings: add ToValidUTF8

I have seen multiple places were string([]rune(x)) (where x has underlying type string) or a handcrafted function was used to replace invalid utf8 sequences in strings by utf8.RuneError. 
In many places only very few of the processed strings actually have invalid utf8 sequences in them.

The pattern string([]rune(x)) is short and does not require a package import but is inefficient in that it involves at least two runtime function calls (that iterate over the runes multiple times) and possibly allocates twice (once for the rune array and once for the string). 

I propose (no language change needed) to detect and optimize this pattern and use a special optimized runtime function to sanitize utf8 sequences the way string([]rune(x)) does. This function should use a stack local tmp buf if the resulting string is small and does not escape (like other runtime string functions do) and should allocate only once for the result string and only when there are invalid utf8 sequences in the string.

As an alternative (or addition to the proposal) a new function could be added to the utf8 package that does the semantic equivalent of string([]rune(x)) but in a more efficient manner. However this function would not be able to make use of all the optimizations the runtime with compiler support can and would make the new function only available from a specific std lib version onwards. This would also make new programs not compile with older std libs and existing uses of string([]rune(x)) not optimized automatically. There would also be likely performance regressions for small non escaping strings where the existing string([]rune(x)) was able to stack allocate the return string but the new utf8 function could not when returning a modified string.

If this sounds something we can pursue i would like to implement an optimization of string([]rune(x)) (and/or utf8 package function) for go1.12.

/cc @josharian @randall77 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

bytes, strings: add ToValidUTF8 #25805

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

bytes, strings: add ToValidUTF8 #25805

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions