Description
I want to talk about an inherent defect of the incremental UTF-7 decoder in Python.
UTF-7 is an historical encoding that encodes Unicode as a sequence of 7-bit bytes. Some characters (including Latin letters and digits) can be represented in UTF-8 directly as single ASCII bytes. "+" is encoded as "+-". All other characters first encoded using UTF-16, and then Base64, and the result is surrounded by "+" and "-". For example, "абвгде" is represented as "+BDAEMQQyBDMENAQ1-".
Incremental decoders allow to decode chunked data. If it encounters at the and of the input a bytes sequence which can be a prefix of the valid sequence, it does not interpret it as error, but returns the start position of an incomplete sequence. When new data is available, it is appended to the incomplete sequence, and the incremental decoder is called again.
The problem is that the length of an incomplete sequence in UTF-7 is not limited. The sequence starts with "+" and ends with "-", with Base64 characters in between. If "-" never occurs, the sequence will never end. After receiving new data the incremental decoder starts decoding from the same position. Each step has a linear complexity depending on the length of the sequence, so the total complexity is quadratic. UTF-7 can be used in email messages and HTTP requests and responses, so it can by used to organize a DOS attack.
I discussed this problem with @vstinner more than 10 years ago, and we decided that this vulnerability has a low risk rating, because the UTF-7 decoder is so fast, that the attacker needs to feed byte-by-byte several megabytes of data to make a noticeable effect. I returned to this now because we handled similar vulnerabilities (too long lines in some text protocols) in past years, and recently fixed CVE-2023-52425 in Expat has the similar nature. Moreover, I thought about the implementation of UTF-7 in Python, and this would make the attack more feasible.
I first reported this on the PSRT, but on @gvanrossum request open a public issue for discussion.