Skip to content
This repository was archived by the owner on Jun 15, 2023. It is now read-only.

Unicode support #433

Merged
merged 4 commits into from
Aug 24, 2021
Merged

Unicode support #433

merged 4 commits into from
Aug 24, 2021

Conversation

IwanKaramazow
Copy link
Contributor

@IwanKaramazow IwanKaramazow commented Jun 4, 2021

Unicode support

This PR adds support for Unicode codepoints at the syntax level: ReScript source code is now considered unicode text encoded in UTF-8.

Fixes #397

Codepoint literals

A codepoint literal represents an integer value identifying a unicode code point. It is expressed as one or more characters enclosed in single quotes. Examples are 'x', '\n' or '\u{00A9}'. Multiple UTF-8-encoded bytes may represent a single integer value.

String literals

String literals are (possibly multi-byte) UTF-8 encoded character sequences between double quotes, as in "fox 🦊 \u{2665}". Internally they compile now as {js|fox 🦊 \u{2665}|js}, we don't want to garble the js output.

New escape sequences

Both codepoint and string literals accept the following new escape sequences:

  1. Unicode escape sequences
    Any character with a character code lower than 65536 can be escaped using the hexadecimal value of its character code, prefixed with \u. Unicode escapes are six characters long. They require exactly four characters following \u . If the hexadecimal character code is only one, two or three characters long, you’ll need to pad it with leading zeroes.
    Example: '\u2665' (Represents ♥)

  2. Unicode codepoint escape sequences
    Any code point or character can be escaped using the hexadecimal value of its character code, prefixed with \u{ and suffixed with } . This allows for code points up to 0x10FFFF, which is the highest code point defined by Unicode. Unicode code point escapes consist of at least five characters. At least one hexadecimal character can be wrapped in \u{…} . There is no upper limit on the number of hex digits in use (for example '\u{000000000061}' == 'a')
    Example: '\u{2318}' (Represents ⌘)

@jfrolich
Copy link

jfrolich commented Jun 7, 2021

Why are \u{..} escape sequences needed in string literals if the source text is already assumed to be encoded in utf-8 (and thus able to express any code point)?

@IwanKaramazow
Copy link
Contributor Author

Escape sequences are primarily used to put hard/difficult to represent "characters" in a character or string literal. I guess you could compare it by using \n, \t or a hex escape sequence like \x41.

@jfrolich
Copy link

jfrolich commented Jun 7, 2021

Does this mean that normal string literals ("") when compiled to a JavaScript source file would be 100% compatible, as opposed to messing up the string literal when there is a codepoint that doesn't fit in 16-bit? (like an emoji)? That would be very good news!

@IwanKaramazow
Copy link
Contributor Author

@jfrolich Indeed that's the goal, no more weird escaping in the js output =D

Iwan added 4 commits August 23, 2021 21:51
This PR adds support for Unicode codepoints at the syntax level: ReScript source code is now unicode text encoded in UTF-8.

Fixes #397

### Codepoint literals

A codepoint literal represents an integer value identifying a unicode code point. It is expressed as one or more characters enclosed in single quotes. Examples are `’x’`, `’\n’` or `\u{00A9}`. Multiple UTF-8-encoded bytes may represent a single integer value.

### String literals

String literals are (possibly multi-byte) UTF-8 encoded character sequences between double quotes, as in `"fox"`.

### New escape sequences

Both codepoint and string literals accept the following new escape sequences:

1) Unicode escape sequences
Any character with a character code lower than 65536 can be escaped using the hexadecimal value of its character code, prefixed with `\u`. Unicode escapes are six characters long. They require exactly four characters following `\u` . If the hexadecimal character code is only one, two or three characters long, you’ll need to pad it with leading zeroes.
Example: `'\u2665'` (Represents ♥)

2) Unicode codepoint escape sequences
Any code point or character can be escaped using the hexadecimal value of its character code, prefixed with `\u{` and suffixed with `}` . This allows for code points up to 0x10FFFF, which is the highest code point defined by Unicode. Unicode code point escapes consist of at least five characters. At least one hexadecimal character can be wrapped in `\u{…}` . There is no upper limit on the number of hex digits in use (for example '\u{000000000061}' == 'a')
Example: `'\u{2318}'` (Represents ⌘)
Codepoint makes more sense with unicode
The compiler processes these strings with js semantics.

Previously {js||js} where interpreted as template literal strings.
The internal encoding has been changed to use an attribute (@res.template) to detect template literal strings
@jfrolich
Copy link

jfrolich commented Sep 6, 2021

Question: why does this need to happen at the syntax level? Wouldn't it be a better fix to just don't let the rescript compiler fudge string literals in any case and not just {js|...|js} type strings?

@IwanKaramazow
Copy link
Contributor Author

It happens on both levels:

  • compiler: {js|...|js} ast node is a compiler construct with utf8 semantics.
  • syntax: needs to parse strings into that construct

Hongbo suggested to just parse strings as {js|...|js}, since it already contains all the logic.

@jfrolich
Copy link

jfrolich commented Sep 6, 2021

Yeah, I guess it works, but is there any reason to keep the garbled way of parsing string literals around in the compiler.

Background: the JS interpreter will automatically convert a UTF8 string literal to UTF16 internally, so if you want behavior that matches JS, the current approach doesn't make sense. Also if you want to match legacy OCaml behaviors the approach also doesn't make sense, because OCaml sees strings as bytes, so some kind of UTF16 internal representation doesn't make sense in that case as well.

BTW: will it also print {js|...|js} as "..." with this change?

@IwanKaramazow
Copy link
Contributor Author

IwanKaramazow commented Sep 6, 2021

is there any reason to keep the garbled way of parsing string literals around in the compiler.

Time is limited, this is the fastest way to get the feature.

the JS interpreter will automatically convert a UTF8 string literal to UTF16 internally, so if you want behavior that matches JS, the current approach doesn't make sense. Also if you want to match legacy OCaml behaviors the approach also doesn't make sense, because OCaml sees strings as bytes, so some kind of UTF16 internal representation doesn't make sense in that case as well.

The standard says that the runtime model of JavaScript is based on UTF-16 code units. That doesn't specify what byte-encoding is used to convert a source file to those units.

BTW: will it also print {js|...|js} as "..." with this change?

The printer follows a different codepath with a separate internal encoding, you can rely on the parser emitting {js|…|js} for the compiler but not the other way around.

@jfrolich
Copy link

jfrolich commented Sep 7, 2021

The standard says that the runtime model of JavaScript is based on UTF-16 code units. That doesn't specify what byte-encoding is used to convert a source file to those units.

Indeed. But most JavaScript files and thus string literals are encoded in utf-8 in the source files. If we need the garbling around (I never needed it and never found a usecase for it even after researching), can we put it as {utf16|...|utf16} for the people that really need it.

I am aware that this quick fix is probably the fastest way to get it shipped while not relying on Hongbo, but I am speaking on the long term. And if we agree I might be able to try and help accomplish this?

@jfrolich
Copy link

jfrolich commented Sep 7, 2021

The printer follows a different codepath with a separate internal encoding, you can rely on the parser emitting {js|…|js} for the compiler but not the other way around.

Hmm so reformatting "bla" is going to yield {js|bla|js} (in some amount of the cases because it's not 100% reliable if I understand correctly?). That is not super pretty.

@IwanKaramazow
Copy link
Contributor Author

If you don't mind me asking, why do you you want your source files encoded in utf16?

UTF-8 is the most common character encoding method used on the internet today. Over 97% of all websites, likely including your own, store characters this way.
UTF-8 encoding is preferable to UTF-16 on the majority of websites, because it uses less memory. Recall that UTF-8 encodes each ASCII character in just one byte. UTF-16 must encode these same characters in either two or four bytes. This means that an English text file encoded with UTF-16 would be at least double the size of the same file encoded with UTF-8.

Hmm so reformatting "bla" is going to yield {js|bla|js} (in some amount of the cases because it's not 100% reliable if I understand correctly?). That is not super pretty.

"bla" will still yield "bla".

@jfrolich
Copy link

jfrolich commented Sep 8, 2021

Haha no I don't want my source encoded in utf16 at all where did you think I said that? I am just saying that even if string literals are in utf-8 at the file level v8 (or any javascript engine) will convert them to the correct internal runtime representation (which is utf16 for JavaScript), so there is no need to do this "garbling" (let's use this term, probably not the best but you get what I mean).

I am just proposing the following (longer term):

  • Don't run garbling on string at all at a compiler level for "".
  • If we need to keep this around (I would love to see a usecase, so I know better why this feature exists, I think only Hongbo knows this), give it a separate literal like {utf16|...|utf16}.

@jfrolich
Copy link

jfrolich commented Sep 8, 2021

"bla" will still yield "bla".

Ok, that is nice!

mununki added a commit to green-labs/ppx_spice that referenced this pull request Oct 25, 2021
- Currently, due to the pattern matching with unicode issue, use the ifthenelse.
- It is planned to release the built-in unicode encode for source. rescript-lang/syntax#433
- As soon as the new built-in feature is released, will change to use pattern matching again.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

support for unicode characters
2 participants