-
Notifications
You must be signed in to change notification settings - Fork 38
Conversation
Why are |
Escape sequences are primarily used to put hard/difficult to represent "characters" in a character or string literal. I guess you could compare it by using |
Does this mean that normal string literals ( |
@jfrolich Indeed that's the goal, no more weird escaping in the js output =D |
This PR adds support for Unicode codepoints at the syntax level: ReScript source code is now unicode text encoded in UTF-8. Fixes #397 ### Codepoint literals A codepoint literal represents an integer value identifying a unicode code point. It is expressed as one or more characters enclosed in single quotes. Examples are `’x’`, `’\n’` or `\u{00A9}`. Multiple UTF-8-encoded bytes may represent a single integer value. ### String literals String literals are (possibly multi-byte) UTF-8 encoded character sequences between double quotes, as in `"fox"`. ### New escape sequences Both codepoint and string literals accept the following new escape sequences: 1) Unicode escape sequences Any character with a character code lower than 65536 can be escaped using the hexadecimal value of its character code, prefixed with `\u`. Unicode escapes are six characters long. They require exactly four characters following `\u` . If the hexadecimal character code is only one, two or three characters long, you’ll need to pad it with leading zeroes. Example: `'\u2665'` (Represents ♥) 2) Unicode codepoint escape sequences Any code point or character can be escaped using the hexadecimal value of its character code, prefixed with `\u{` and suffixed with `}` . This allows for code points up to 0x10FFFF, which is the highest code point defined by Unicode. Unicode code point escapes consist of at least five characters. At least one hexadecimal character can be wrapped in `\u{…}` . There is no upper limit on the number of hex digits in use (for example '\u{000000000061}' == 'a') Example: `'\u{2318}'` (Represents ⌘)
Codepoint makes more sense with unicode
The compiler processes these strings with js semantics. Previously {js||js} where interpreted as template literal strings. The internal encoding has been changed to use an attribute (@res.template) to detect template literal strings
Question: why does this need to happen at the syntax level? Wouldn't it be a better fix to just don't let the rescript compiler fudge string literals in any case and not just |
It happens on both levels:
Hongbo suggested to just parse strings as |
Yeah, I guess it works, but is there any reason to keep the garbled way of parsing string literals around in the compiler. Background: the JS interpreter will automatically convert a UTF8 string literal to UTF16 internally, so if you want behavior that matches JS, the current approach doesn't make sense. Also if you want to match legacy OCaml behaviors the approach also doesn't make sense, because OCaml sees strings as bytes, so some kind of UTF16 internal representation doesn't make sense in that case as well. BTW: will it also print |
Time is limited, this is the fastest way to get the feature.
The standard says that the runtime model of JavaScript is based on UTF-16 code units. That doesn't specify what byte-encoding is used to convert a source file to those units.
The printer follows a different codepath with a separate internal encoding, you can rely on the parser emitting |
Indeed. But most JavaScript files and thus string literals are encoded in utf-8 in the source files. If we need the garbling around (I never needed it and never found a usecase for it even after researching), can we put it as I am aware that this quick fix is probably the fastest way to get it shipped while not relying on Hongbo, but I am speaking on the long term. And if we agree I might be able to try and help accomplish this? |
Hmm so reformatting |
If you don't mind me asking, why do you you want your source files encoded in utf16? UTF-8 is the most common character encoding method used on the internet today. Over 97% of all websites, likely including your own, store characters this way.
|
Haha no I don't want my source encoded in utf16 at all where did you think I said that? I am just saying that even if string literals are in utf-8 at the file level v8 (or any javascript engine) will convert them to the correct internal runtime representation (which is utf16 for JavaScript), so there is no need to do this "garbling" (let's use this term, probably not the best but you get what I mean). I am just proposing the following (longer term):
|
Ok, that is nice! |
- Currently, due to the pattern matching with unicode issue, use the ifthenelse. - It is planned to release the built-in unicode encode for source. rescript-lang/syntax#433 - As soon as the new built-in feature is released, will change to use pattern matching again.
Unicode support
This PR adds support for Unicode codepoints at the syntax level: ReScript source code is now considered unicode text encoded in UTF-8.
Fixes #397
Codepoint literals
A codepoint literal represents an integer value identifying a unicode code point. It is expressed as one or more characters enclosed in single quotes. Examples are
'x'
,'\n'
or'\u{00A9}'
. Multiple UTF-8-encoded bytes may represent a single integer value.String literals
String literals are (possibly multi-byte) UTF-8 encoded character sequences between double quotes, as in
"fox 🦊 \u{2665}"
. Internally they compile now as{js|fox 🦊 \u{2665}|js}
, we don't want to garble the js output.New escape sequences
Both codepoint and string literals accept the following new escape sequences:
Unicode escape sequences
Any character with a character code lower than 65536 can be escaped using the hexadecimal value of its character code, prefixed with
\u
. Unicode escapes are six characters long. They require exactly four characters following\u
. If the hexadecimal character code is only one, two or three characters long, you’ll need to pad it with leading zeroes.Example:
'\u2665'
(Represents ♥)Unicode codepoint escape sequences
Any code point or character can be escaped using the hexadecimal value of its character code, prefixed with
\u{
and suffixed with}
. This allows for code points up to 0x10FFFF, which is the highest code point defined by Unicode. Unicode code point escapes consist of at least five characters. At least one hexadecimal character can be wrapped in\u{…}
. There is no upper limit on the number of hex digits in use (for example'\u{000000000061}' == 'a'
)Example:
'\u{2318}'
(Represents ⌘)