`utf8toUnicode` encodes to ambigious hex sequences

The `utf8toUnicode` transformation function outputs hex sequences of the form `%uXXXX` and `%uXXXXXX`. Other characters are passed through as-is. For some inputs, these sequences are indigustinguable from its encoded output produces by different inputs.

This ticket is to report two separate such encoding ambiguities: no escaping for a literal `%`, and trailing literal hex digits after four-digit sequences.

**Escaping literal % characters**

This function outputs hex sequences for non-ASCII codepoints. Other characters are passed through as-is.

The `%` character is passed through as-is, too, and so input of the form `abc%uXXXXxyz` will produce output which is indistinguishable from a legitimate hex sequence generated by the function.

This is possibly also a security risk, in that a consumer reading hex sequences would then treat the following few characters as digits (if they are legal hex characters), and then convert the sequence to a single codepoint. So this allows for a bypass by way of "sneaking through" those characters.

**Variable length sequences**

`utf8toUnicode` encodes to four hex digits for some codepoints, and six hex digits for other codepoints.

For example, the input `\xc4\x80-\xf4\x8f\xbf\xbf` is encoded to: `%u0100-%u10ffff`

This gives ambiguity for output, for example `å00`, because the output there would be `%uXXXX00` where the `00` are low-ASCII bytes (literal hex digits) passed through as-is. That's indistinguishable from a single codepoint of ``%uXXXXXX`.

This could also be a possible bypass: For a rule looks to match `%u1234` but would allow `%u123456`, an attacker would find some character for the `56` part which is syntactically permissible (whitespace, for example), and arrange for that to be after the disallowed character. Rather like `\0`-free shellcode, but the other way around.

---

My suggested fix for both issues is to to always encode to six-digit hex sequences, and also to encode `%` as a hex squence.

This breaks compatibility for rules using `utf8toUnicode`, and so notice would need to be given to rule authors to update those rules. Hence I would suggest replacing `utf8toUnicode` with a new function (which I have called `utf8toHex`), and to update rules accordingly. Then `utf8toUnicode` may be removed.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

`utf8toUnicode` encodes to ambigious hex sequences #1211

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

utf8toUnicode encodes to ambigious hex sequences #1211

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`utf8toUnicode` encodes to ambigious hex sequences #1211