Description
Given the amount of foregoing heated discussions on the topic, especially in context of Interface Types and GC, I am not getting the impression that anything of relevance is going to change, and we should start planning for the worst case.
So I have been thinking about what would be the implications of switching AssemblyScript's string encoding to W/UTF-8 by default again, and that doesn't look too bad if all one really wants is to get rid of WTF-16, is willing to break certain string APIs, wants it to be efficient after the breakage and otherwise is not judging.
Implications:
String#charCodeAt
would be removed in favor ofString#codePointAt
String#charAt
would be removed, or changed to retain codepoint boundaries if viableString#[]
would be removed, or changed to retain codepoint boundaries if viable, or to return bytes numeric like CString#length
would return the length of the string in bytesSting.fromCharCode
would be removed, or deprecated and polyfilledString#split
with an empty string separator would split at codepoints- Ill-formed Unicode would be rejected
- if it can be done efficiently
- if not, we'd have to think about WTF-8 instead
- Anything returning a character offset before would return a byte offset after:
- Most String APIs would "just work" with byte offsets instead of character offsets as well
- Mileage may vary if one uses string APIs with constant (incremented) offsets, as that would not map well anymore
- Example: The compiler's tokenizer would need to skip codepoints instead of += 1
Means we'd essentially jump the UTF-8 train to have
- efficient calls to WASI APIs
- efficient calls to DOM APIs typically accessed with 7-bit ASCII strings (think
.className = "abc
) - the same problem as everyone else where 7-bit ASCII is not enough
Note that the proposition of switching AS to UTF-8 is different from most of what has been discussed more recently, even though it has always been lingering in the background. Hasn't been a real topic so far due to the implied breakage with JS, but unlike the alternatives it can be made efficient when willing to break with JS. Certainly, the support-most-of-TypeScript folks may disagree as it picks a definite site.
If anything, however, we should switch the entire default and make a hard cut because
- maintaining two string implementations, and ensuring that all APIs work with both, is not exactly realistic
- maintaining a single string implementation understanding both encodings would yield the problem we are trying to avoid in Wasm, but in AS
- any of the above would often double code size of string operations
Thoughts?