Description
Rustc already mitigates against the potential spoofing effects (CVE-2021-42574) of bidirectional format characters.
What we currently do is a bit of a blunt hammer: we lint on any bidirectional format character except LRM and RLM, which is both too aggressive (it doesn't allow for "isolated" pairs of format characters) and not aggressive enough (RLM can have negative effects, and even plain visible characters can cause problems).
Since that CVE was released, Unicode has been working on a specification, (Draft) Unicode Technical Standard #55 Unicode Source Code Handling to help programming language designers implement various mitigations to such issues. The goal of this specification is to protect source code from spoofing while retaining the ability to use bidirectional text in source. It contains advice for multiple levels of the stack: compiler warning engines, code formatters, and source code editors.
Relevant to rustc is the section on Directional Formatting Characters, which refers to the algorithm in Conversion to Plain Text. That algorithm is primarily written for the purposes of a formatting tool like rustfmt, and I've filed a rustfmt issue to implement it there1, however it can also be tweaked to be a good lint too.
In essence, it will warn about format characters nestled amongst whitespace (except for LRM2), and also warn about mismatched stateful format characters in contexts where it matters (e.g. a block comment where there is other code on the same line after it). The difference from the current algorithm is that it would ignore these format characters when they don't have any effect that leaks outside of the "atom" (string, comment, etc), and furthermore lint on RLMs found in whitespace.
An important thing to highlight is that this change would allow the proposed rustfmt pass to generate warning-free code: Under the current regime, this pass would generate warning-free code in most cases provided you do not already have warnings of this form, but there is the FSI/PDI insertion section that is rare but will turn warning-free code into code that warns.
Here is the tweaked algorithm such that it is suitable for linting.
declare
Needs_LRM : Boolean := False;
begin
Separate the text into lines and each line into atoms,
as described in Section 4.1.1, Atoms, in a language-dependent manner;
for each Line loop
for each Atom of Line loop
if Atom is a whitespace atom then
LINT on any instance of RLM
end if;
if Needs_LRM then
if inserting LRM before Atom has no effect on the meaning of the program then
Insert LRM before Atom;
Needs_LRM := False;
else -- For instance, if the position before Atom is within a string literal
Look for the first character C in Atom such that
the Bidi_Class property of C is one of LRE, RLE, LRI, RLI, or FSI;
if C exists and its Bidi_Class property is not L then
Emit an error: the line cannot be converted to plain text by this algorithm;
end if;
end if;
if Atom is a comment content atom then
if Atom is followed by a code point that does not have Bidi_Class B then
If there are any isolate initiators in Atom that do not have a matching PDI, as defined in BD9, lint
If there are any embedding initiators in Atom that
do not have a matching PDF, as defined in BD11, and
do not lie between an isolate initiator and its matching PDI,
then Lint
end if;
end if;
if Atom is followed by a code point that does not have Bidi_Class B then
if Atom has any isolate initiators that
do not have a matching PDI, as defined in BD9 then
Emit an error: the line cannot be converted to plain text by this algorithm;
end if;
if Atom has any embedding initiators that
do not lie between an isolate initiator and its matching PDI, and
do not have a matching PDF, as defined in BD11 then
Emit an error: the line cannot be converted to plain text by this algorithm;
end if;
Look for the last character C in Atom such that
the Bidi_Class property of C is one of L, R, AL, PDF, or PDI;
if C exists and its Bidi_Class property is not L then
Needs_LRM := True;
end if;
end if;
end loop;
if Line is terminated by a character with Bidi_Class B then
Needs_LRM := False;
end if;
end loop;
end;
However, a thing that will need to be decided is this: Do we want to lint cases where visible RTL characters change the rendered direction of the text?
For example, the code <arabic> = 2 - x;
will render as
عربی = 2 - x;
which can be confusing. This can be mitigated by inserting an LRM3 after the arabic identifier; which is what the algorithm recommends formatters like rustfmt do (and is what I am proposing rustfmt support).
We could in our lint require users do the same: any visible RTL code points that affect the directionality of surrounding text are required to have that effect be mitigated, usually with an LRM. This would roughly amount to requiring people run the rustfmt pass though we can have lint suggestions also do the same for them. This is an expansion of the current set of things we lint on: currently we only lint on invisible control characters and the effects of visible RTL characters are basically treated as okay since you can see the RTL character and know there is something fishy.
I generally lean on the side of not linting here. The algorithm above becomes much simpler in such a case, basically all the stuff producing and consuming Needs_LRM
goes away, I can write up a modified algorithm if necessary.
Happy to help mentor the implementation and provide additional background on this algorithm and the bidi algorithm/properties.
Thoughts? @estebank @pietroalbini
Footnotes
-
Likely initially as a config option; I do not want to make decisions on whether it ought to be the default just yet. ↩
-
LRM is a format character that, in a left-to-right context, can only fix problems, not cause them. It is the equivalent of a latin letter, except invisible. ↩
-
Treated as a space by Rust ↩