Description
The utf8toUnicode
transformation function outputs hex sequences of the form %uXXXX
and %uXXXXXX
. Other characters are passed through as-is. For some inputs, these sequences are indigustinguable from its encoded output produces by different inputs.
This ticket is to report two separate such encoding ambiguities: no escaping for a literal %
, and trailing literal hex digits after four-digit sequences.
Escaping literal % characters
This function outputs hex sequences for non-ASCII codepoints. Other characters are passed through as-is.
The %
character is passed through as-is, too, and so input of the form abc%uXXXXxyz
will produce output which is indistinguishable from a legitimate hex sequence generated by the function.
This is possibly also a security risk, in that a consumer reading hex sequences would then treat the following few characters as digits (if they are legal hex characters), and then convert the sequence to a single codepoint. So this allows for a bypass by way of "sneaking through" those characters.
Variable length sequences
utf8toUnicode
encodes to four hex digits for some codepoints, and six hex digits for other codepoints.
For example, the input \xc4\x80-\xf4\x8f\xbf\xbf
is encoded to: %u0100-%u10ffff
This gives ambiguity for output, for example å00
, because the output there would be %uXXXX00
where the 00
are low-ASCII bytes (literal hex digits) passed through as-is. That's indistinguishable from a single codepoint of ``%uXXXXXX`.
This could also be a possible bypass: For a rule looks to match %u1234
but would allow %u123456
, an attacker would find some character for the 56
part which is syntactically permissible (whitespace, for example), and arrange for that to be after the disallowed character. Rather like \0
-free shellcode, but the other way around.
My suggested fix for both issues is to to always encode to six-digit hex sequences, and also to encode %
as a hex squence.
This breaks compatibility for rules using utf8toUnicode
, and so notice would need to be given to rule authors to update those rules. Hence I would suggest replacing utf8toUnicode
with a new function (which I have called utf8toHex
), and to update rules accordingly. Then utf8toUnicode
may be removed.