Unicode support #433

IwanKaramazow · 2021-06-04T16:34:10Z

Unicode support

This PR adds support for Unicode codepoints at the syntax level: ReScript source code is now considered unicode text encoded in UTF-8.

Fixes #397

Codepoint literals

A codepoint literal represents an integer value identifying a unicode code point. It is expressed as one or more characters enclosed in single quotes. Examples are 'x', '\n' or '\u{00A9}'. Multiple UTF-8-encoded bytes may represent a single integer value.

String literals

String literals are (possibly multi-byte) UTF-8 encoded character sequences between double quotes, as in "fox 🦊 \u{2665}". Internally they compile now as {js|fox 🦊 \u{2665}|js}, we don't want to garble the js output.

New escape sequences

Both codepoint and string literals accept the following new escape sequences:

Unicode escape sequences
Any character with a character code lower than 65536 can be escaped using the hexadecimal value of its character code, prefixed with \u. Unicode escapes are six characters long. They require exactly four characters following \u . If the hexadecimal character code is only one, two or three characters long, you’ll need to pad it with leading zeroes.
Example: '\u2665' (Represents ♥)
Unicode codepoint escape sequences
Any code point or character can be escaped using the hexadecimal value of its character code, prefixed with \u{ and suffixed with } . This allows for code points up to 0x10FFFF, which is the highest code point defined by Unicode. Unicode code point escapes consist of at least five characters. At least one hexadecimal character can be wrapped in \u{…} . There is no upper limit on the number of hex digits in use (for example '\u{000000000061}' == 'a')
Example: '\u{2318}' (Represents ⌘)

jfrolich · 2021-06-07T01:02:35Z

Why are \u{..} escape sequences needed in string literals if the source text is already assumed to be encoded in utf-8 (and thus able to express any code point)?

IwanKaramazow · 2021-06-07T06:35:40Z

Escape sequences are primarily used to put hard/difficult to represent "characters" in a character or string literal. I guess you could compare it by using \n, \t or a hex escape sequence like \x41.

jfrolich · 2021-06-07T07:12:45Z

Does this mean that normal string literals ("") when compiled to a JavaScript source file would be 100% compatible, as opposed to messing up the string literal when there is a codepoint that doesn't fit in 16-bit? (like an emoji)? That would be very good news!

IwanKaramazow · 2021-06-08T07:03:34Z

@jfrolich Indeed that's the goal, no more weird escaping in the js output =D

This PR adds support for Unicode codepoints at the syntax level: ReScript source code is now unicode text encoded in UTF-8. Fixes #397 ### Codepoint literals A codepoint literal represents an integer value identifying a unicode code point. It is expressed as one or more characters enclosed in single quotes. Examples are `’x’`, `’\n’` or `\u{00A9}`. Multiple UTF-8-encoded bytes may represent a single integer value. ### String literals String literals are (possibly multi-byte) UTF-8 encoded character sequences between double quotes, as in `"fox"`. ### New escape sequences Both codepoint and string literals accept the following new escape sequences: 1) Unicode escape sequences Any character with a character code lower than 65536 can be escaped using the hexadecimal value of its character code, prefixed with `\u`. Unicode escapes are six characters long. They require exactly four characters following `\u` . If the hexadecimal character code is only one, two or three characters long, you’ll need to pad it with leading zeroes. Example: `'\u2665'` (Represents ♥) 2) Unicode codepoint escape sequences Any code point or character can be escaped using the hexadecimal value of its character code, prefixed with `\u{` and suffixed with `}` . This allows for code points up to 0x10FFFF, which is the highest code point defined by Unicode. Unicode code point escapes consist of at least five characters. At least one hexadecimal character can be wrapped in `\u{…}` . There is no upper limit on the number of hex digits in use (for example '\u{000000000061}' == 'a') Example: `'\u{2318}'` (Represents ⌘)

Codepoint makes more sense with unicode

The compiler processes these strings with js semantics. Previously {js||js} where interpreted as template literal strings. The internal encoding has been changed to use an attribute (@res.template) to detect template literal strings

jfrolich · 2021-09-06T03:23:56Z

Question: why does this need to happen at the syntax level? Wouldn't it be a better fix to just don't let the rescript compiler fudge string literals in any case and not just {js|...|js} type strings?

IwanKaramazow · 2021-09-06T07:40:36Z

It happens on both levels:

compiler: {js|...|js} ast node is a compiler construct with utf8 semantics.
syntax: needs to parse strings into that construct

Hongbo suggested to just parse strings as {js|...|js}, since it already contains all the logic.

jfrolich · 2021-09-06T09:30:23Z

Yeah, I guess it works, but is there any reason to keep the garbled way of parsing string literals around in the compiler.

Background: the JS interpreter will automatically convert a UTF8 string literal to UTF16 internally, so if you want behavior that matches JS, the current approach doesn't make sense. Also if you want to match legacy OCaml behaviors the approach also doesn't make sense, because OCaml sees strings as bytes, so some kind of UTF16 internal representation doesn't make sense in that case as well.

BTW: will it also print {js|...|js} as "..." with this change?

IwanKaramazow · 2021-09-06T10:23:10Z

is there any reason to keep the garbled way of parsing string literals around in the compiler.

Time is limited, this is the fastest way to get the feature.

the JS interpreter will automatically convert a UTF8 string literal to UTF16 internally, so if you want behavior that matches JS, the current approach doesn't make sense. Also if you want to match legacy OCaml behaviors the approach also doesn't make sense, because OCaml sees strings as bytes, so some kind of UTF16 internal representation doesn't make sense in that case as well.

The standard says that the runtime model of JavaScript is based on UTF-16 code units. That doesn't specify what byte-encoding is used to convert a source file to those units.

BTW: will it also print {js|...|js} as "..." with this change?

The printer follows a different codepath with a separate internal encoding, you can rely on the parser emitting {js|…|js} for the compiler but not the other way around.

jfrolich · 2021-09-07T01:16:44Z

The standard says that the runtime model of JavaScript is based on UTF-16 code units. That doesn't specify what byte-encoding is used to convert a source file to those units.

Indeed. But most JavaScript files and thus string literals are encoded in utf-8 in the source files. If we need the garbling around (I never needed it and never found a usecase for it even after researching), can we put it as {utf16|...|utf16} for the people that really need it.

I am aware that this quick fix is probably the fastest way to get it shipped while not relying on Hongbo, but I am speaking on the long term. And if we agree I might be able to try and help accomplish this?

jfrolich · 2021-09-07T01:18:08Z

The printer follows a different codepath with a separate internal encoding, you can rely on the parser emitting {js|…|js} for the compiler but not the other way around.

Hmm so reformatting "bla" is going to yield {js|bla|js} (in some amount of the cases because it's not 100% reliable if I understand correctly?). That is not super pretty.

IwanKaramazow · 2021-09-07T15:10:34Z

If you don't mind me asking, why do you you want your source files encoded in utf16?

UTF-8 is the most common character encoding method used on the internet today. Over 97% of all websites, likely including your own, store characters this way.
UTF-8 encoding is preferable to UTF-16 on the majority of websites, because it uses less memory. Recall that UTF-8 encodes each ASCII character in just one byte. UTF-16 must encode these same characters in either two or four bytes. This means that an English text file encoded with UTF-16 would be at least double the size of the same file encoded with UTF-8.

Hmm so reformatting "bla" is going to yield {js|bla|js} (in some amount of the cases because it's not 100% reliable if I understand correctly?). That is not super pretty.

"bla" will still yield "bla".

jfrolich · 2021-09-08T00:48:00Z

Haha no I don't want my source encoded in utf16 at all where did you think I said that? I am just saying that even if string literals are in utf-8 at the file level v8 (or any javascript engine) will convert them to the correct internal runtime representation (which is utf16 for JavaScript), so there is no need to do this "garbling" (let's use this term, probably not the best but you get what I mean).

I am just proposing the following (longer term):

Don't run garbling on string at all at a compiler level for "".
If we need to keep this around (I would love to see a usecase, so I know better why this feature exists, I think only Hongbo knows this), give it a separate literal like {utf16|...|utf16}.

jfrolich · 2021-09-08T00:51:11Z

"bla" will still yield "bla".

Ok, that is nice!

- Currently, due to the pattern matching with unicode issue, use the ifthenelse. - It is planned to release the built-in unicode encode for source. rescript-lang/syntax#433 - As soon as the new built-in feature is released, will change to use pattern matching again.

IwanKaramazow force-pushed the utf8 branch from c5f808f to 8cea73c Compare August 23, 2021 16:41

Iwan added 4 commits August 23, 2021 21:51

Rename Character token to Codepoint token.

a6a6d71

Codepoint makes more sense with unicode

Add comment about codepoint literal encoding for printer.

051abc1

Parse all normal strings as {js||js} strings.

f5166a0

The compiler processes these strings with js semantics. Previously {js||js} where interpreted as template literal strings. The internal encoding has been changed to use an attribute (@res.template) to detect template literal strings

IwanKaramazow force-pushed the utf8 branch from 8cea73c to f5166a0 Compare August 23, 2021 19:51

IwanKaramazow merged commit 79a4bef into master Aug 24, 2021

IwanKaramazow deleted the utf8 branch August 24, 2021 05:09

l3v-m mentioned this pull request Nov 13, 2021

Possibly forgotten change of string quotation delimiter #466

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Unicode support #433

Unicode support #433

Uh oh!

IwanKaramazow commented Jun 4, 2021 •

edited

Loading

Uh oh!

jfrolich commented Jun 7, 2021

Uh oh!

IwanKaramazow commented Jun 7, 2021

Uh oh!

jfrolich commented Jun 7, 2021

Uh oh!

IwanKaramazow commented Jun 8, 2021

Uh oh!

jfrolich commented Sep 6, 2021 •

edited

Loading

Uh oh!

IwanKaramazow commented Sep 6, 2021

Uh oh!

jfrolich commented Sep 6, 2021 •

edited

Loading

Uh oh!

IwanKaramazow commented Sep 6, 2021 •

edited

Loading

Uh oh!

jfrolich commented Sep 7, 2021

Uh oh!

jfrolich commented Sep 7, 2021

Uh oh!

IwanKaramazow commented Sep 7, 2021

Uh oh!

jfrolich commented Sep 8, 2021

Uh oh!

jfrolich commented Sep 8, 2021

Uh oh!

Uh oh!

Unicode support #433

Unicode support #433

Uh oh!

Conversation

IwanKaramazow commented Jun 4, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Unicode support

Codepoint literals

String literals

New escape sequences

Uh oh!

jfrolich commented Jun 7, 2021

Uh oh!

IwanKaramazow commented Jun 7, 2021

Uh oh!

jfrolich commented Jun 7, 2021

Uh oh!

IwanKaramazow commented Jun 8, 2021

Uh oh!

jfrolich commented Sep 6, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

IwanKaramazow commented Sep 6, 2021

Uh oh!

jfrolich commented Sep 6, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

IwanKaramazow commented Sep 6, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jfrolich commented Sep 7, 2021

Uh oh!

jfrolich commented Sep 7, 2021

Uh oh!

IwanKaramazow commented Sep 7, 2021

Uh oh!

jfrolich commented Sep 8, 2021

Uh oh!

jfrolich commented Sep 8, 2021

Uh oh!

Uh oh!

IwanKaramazow commented Jun 4, 2021 •

edited

Loading

jfrolich commented Sep 6, 2021 •

edited

Loading

jfrolich commented Sep 6, 2021 •

edited

Loading

IwanKaramazow commented Sep 6, 2021 •

edited

Loading