Skip to content

Switch AssemblyScript to UTF-8 by default? #1653

Open
@dcodeIO

Description

@dcodeIO

Given the amount of foregoing heated discussions on the topic, especially in context of Interface Types and GC, I am not getting the impression that anything of relevance is going to change, and we should start planning for the worst case.

So I have been thinking about what would be the implications of switching AssemblyScript's string encoding to W/UTF-8 by default again, and that doesn't look too bad if all one really wants is to get rid of WTF-16, is willing to break certain string APIs, wants it to be efficient after the breakage and otherwise is not judging.

Implications:

  • String#charCodeAt would be removed in favor of String#codePointAt
  • String#charAt would be removed, or changed to retain codepoint boundaries if viable
  • String#[] would be removed, or changed to retain codepoint boundaries if viable, or to return bytes numeric like C
  • String#length would return the length of the string in bytes
  • Sting.fromCharCode would be removed, or deprecated and polyfilled
  • String#split with an empty string separator would split at codepoints
  • Ill-formed Unicode would be rejected
    • if it can be done efficiently
    • if not, we'd have to think about WTF-8 instead
  • Anything returning a character offset before would return a byte offset after:
    • Most String APIs would "just work" with byte offsets instead of character offsets as well
    • Mileage may vary if one uses string APIs with constant (incremented) offsets, as that would not map well anymore
      • Example: The compiler's tokenizer would need to skip codepoints instead of += 1

Means we'd essentially jump the UTF-8 train to have

  • efficient calls to WASI APIs
  • efficient calls to DOM APIs typically accessed with 7-bit ASCII strings (think .className = "abc)
  • the same problem as everyone else where 7-bit ASCII is not enough

Note that the proposition of switching AS to UTF-8 is different from most of what has been discussed more recently, even though it has always been lingering in the background. Hasn't been a real topic so far due to the implied breakage with JS, but unlike the alternatives it can be made efficient when willing to break with JS. Certainly, the support-most-of-TypeScript folks may disagree as it picks a definite site.

If anything, however, we should switch the entire default and make a hard cut because

  • maintaining two string implementations, and ensuring that all APIs work with both, is not exactly realistic
  • maintaining a single string implementation understanding both encodings would yield the problem we are trying to avoid in Wasm, but in AS
  • any of the above would often double code size of string operations

Thoughts?

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions