Skip to content

utf8toUnicode encodes to ambigious hex sequences #1211

Open
@katef

Description

@katef

The utf8toUnicode transformation function outputs hex sequences of the form %uXXXX and %uXXXXXX. Other characters are passed through as-is. For some inputs, these sequences are indigustinguable from its encoded output produces by different inputs.

This ticket is to report two separate such encoding ambiguities: no escaping for a literal %, and trailing literal hex digits after four-digit sequences.

Escaping literal % characters

This function outputs hex sequences for non-ASCII codepoints. Other characters are passed through as-is.

The % character is passed through as-is, too, and so input of the form abc%uXXXXxyz will produce output which is indistinguishable from a legitimate hex sequence generated by the function.

This is possibly also a security risk, in that a consumer reading hex sequences would then treat the following few characters as digits (if they are legal hex characters), and then convert the sequence to a single codepoint. So this allows for a bypass by way of "sneaking through" those characters.

Variable length sequences

utf8toUnicode encodes to four hex digits for some codepoints, and six hex digits for other codepoints.

For example, the input \xc4\x80-\xf4\x8f\xbf\xbf is encoded to: %u0100-%u10ffff

This gives ambiguity for output, for example å00, because the output there would be %uXXXX00 where the 00 are low-ASCII bytes (literal hex digits) passed through as-is. That's indistinguishable from a single codepoint of ``%uXXXXXX`.

This could also be a possible bypass: For a rule looks to match %u1234 but would allow %u123456, an attacker would find some character for the 56 part which is syntactically permissible (whitespace, for example), and arrange for that to be after the disallowed character. Rather like \0-free shellcode, but the other way around.


My suggested fix for both issues is to to always encode to six-digit hex sequences, and also to encode % as a hex squence.

This breaks compatibility for rules using utf8toUnicode, and so notice would need to be given to rule authors to update those rules. Hence I would suggest replacing utf8toUnicode with a new function (which I have called utf8toHex), and to update rules accordingly. Then utf8toUnicode may be removed.

Metadata

Metadata

Labels

2.xRelated to ModSecurity version 2.xenhancementwaiting for v3New feature in v2 that is not yet available in v3. Therefore, not yet released.

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions