Skip to content

Fix fallback for non-mapped Unicode char #1609

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

allanbomsft
Copy link

For example see this request:

POST / HTTP/1.1
Host: somehost:8080
Accept: */*
User-Agent: someagent
Content-Length: 30
Content-Type: application/json;charset=utf-8

{
    "a": "娧    "
}

t:utf8toUnicode turns this Unicode char into %u5a27 , and so the lowest byte is 0x27, which is a single quote in ASCII. This triggers false positives.

This happens with any unicode character that doesn't have a mapping in the SecUnicodeMapFile, and whose last byte in its code point happens to be 0x27. Likewise for characters that end in 0x22 would be treated as a double quote, etc.

0x27

The same problem exists with t:jsDecode given a request with the unicode character full width G (code point FF27).

Said differently: the last byte in the Unicode code point does not have any meaningful relation to whatever ASCII char happens to be represented by the same byte, and so we shouldn't treat it so.

I suggest replacing with an x. I also considered question marks or space, but that could also trigger false positives (too many non-alphanum in a row). Also considered just omitting the char, but that could also trigger a false positive where for example "-娧-" would have been OK but "--" is not.

@zimmerle zimmerle added the 3.x Related to ModSecurity version 3.x label Feb 28, 2018
@zimmerle zimmerle self-assigned this Apr 24, 2018
@zimmerle zimmerle self-requested a review April 24, 2018 01:58
@victorhora victorhora self-assigned this Sep 14, 2018
@victorhora victorhora self-requested a review September 14, 2018 20:46
@victorhora victorhora added this to the v3.0.4 milestone Nov 13, 2018
@zimmerle
Copy link
Contributor

Hi @allanbomsft,

Thank you for the patch. The transformation in ModSecurity are basically used as a way to prevent evasion. That is the case of t:utf8toUnicode. The convertion takes into consideration SecUnicodeMapFile. The convertion here may not need a fallback, as it is working in the exactly manner that it was designed to: matching wathever happens on the backend app.

Python

>>> hex(ord("娧"))
'0x5a27'

php

$ /tmp  cat a.php
<?php
echo json_encode("娧");
?>

$ /tmp  php a.php
"\u5a27"

JavaScript

> encodeURIComponent(escape("娧"))
< "%25u5A27"

The rule that are making usage of t:utf8toUnicode needs to be ware that the result will be a an unicode, as well as it is high recommended to have the SecUnicodeMapFile configured correctly. Therefore I am closing this without a merge. If you point us to the specific rule that is leading to the false positive, we may be able to assist you better. Thank you.

@zimmerle zimmerle closed this Nov 26, 2018
@allanbomsft
Copy link
Author

allanbomsft commented Nov 27, 2018

It's been more than a year since I sent this, so my memory on this issue is a bit hazy :-) I've dug through my notes and reproed the scenario again on the SpiderLabs branch (we are running with my patch in production on the Microsoft branch, so no repro there).

I understand that the conversion takes SecUnicodeMapFile into consideration, but this fix relates only to characters that there exist no mapping for in the SecUnicodeMapFile.

For example if there is no mapping for 娧 in the file, then the following request false positives CRS 942110.

POST / HTTP/1.1
Host: somehost:8080
Accept: */*
User-Agent: someagent
Content-Length: 37
Content-Type: application/json;charset=utf-8

{
    "a": "娧",
    "b": "娧"
}

This is because ModSecurity misunderstands this request as if it was

{
    "a": "'",
    "b": "'"
}

because, as mentioned in the original post, the last octet of codepoint 5A27 is 27. It is this fallback mapping from codepoint 5A27 to 27 that is incorrect. It is not what the backend receives. The UTF-8 encoded representation of 娧 is E5A8A7.

This is true for any char whose codepoint ends in 27, such as

5727  圧
5627  唧
5427  吧

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.x Related to ModSecurity version 3.x enhancement RIP - libmodsecurity
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants