Fix fallback for non-mapped Unicode char #1609
Closed
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
For example see this request:
t:utf8toUnicode turns this Unicode char into %u5a27 , and so the lowest byte is 0x27, which is a single quote in ASCII. This triggers false positives.
This happens with any unicode character that doesn't have a mapping in the SecUnicodeMapFile, and whose last byte in its code point happens to be 0x27. Likewise for characters that end in 0x22 would be treated as a double quote, etc.
The same problem exists with t:jsDecode given a request with the unicode character full width G (code point FF27).
Said differently: the last byte in the Unicode code point does not have any meaningful relation to whatever ASCII char happens to be represented by the same byte, and so we shouldn't treat it so.
I suggest replacing with an x. I also considered question marks or space, but that could also trigger false positives (too many non-alphanum in a row). Also considered just omitting the char, but that could also trigger a false positive where for example "-娧-" would have been OK but "--" is not.