Implement the new Unicode algorithm for preventing bidi-based source code spoofing


Rustc already [mitigates against](https://github.com/rust-lang/rust/pull/90462) the [potential spoofing effects (CVE-2021-42574)](https://blog.rust-lang.org/2021/11/01/cve-2021-42574.html) of bidirectional format characters.

What we currently do is a bit of a blunt hammer: we lint on any bidirectional format character except LRM and RLM, which is both too aggressive (it doesn't allow for "isolated" pairs of format characters) and not aggressive enough (RLM can have negative effects, and even plain visible characters can cause problems).

Since that CVE was released, Unicode has been working on a specification, [(Draft) Unicode Technical Standard #55 Unicode Source Code Handling](https://www.unicode.org/reports/tr55/) to help programming language designers implement various mitigations to such issues. The goal of this specification is to protect source code from spoofing while retaining the ability to use bidirectional text in source. It contains advice for multiple levels of the stack: compiler warning engines, code formatters, and source code editors.

Relevant to rustc is the section on [Directional Formatting Characters](https://www.unicode.org/reports/tr55/#Directional-Formatting), which refers to the algorithm in [Conversion to Plain Text](https://www.unicode.org/reports/tr55/#Conversion-To-Plain-Text). That algorithm is primarily written for the purposes of a formatting tool like rustfmt, and I've filed a [rustfmt issue](https://github.com/rust-lang/rustfmt/issues/5815) to implement it there[^0], however it can also be tweaked to be a good lint too.

In essence, it will warn about format characters nestled amongst whitespace (except for LRM[^1]), and also warn about _mismatched_ stateful format characters in contexts where it matters (e.g. a block comment where there is other code on the same line after it). The difference from the current algorithm is that it would ignore these format characters when they don't have any effect that leaks outside of the "atom" (string, comment, etc), and furthermore lint on RLMs found in whitespace.

An important thing to highlight is that this change would allow the [proposed rustfmt pass](https://github.com/rust-lang/rustfmt/issues/5815) to generate warning-free code: Under the current regime, this pass would generate warning-free code in _most_ cases provided you do not already have warnings of this form, but there is the FSI/PDI insertion section that is rare but will turn warning-free code into code that warns.

------

Here is the tweaked algorithm such that it is suitable for linting.

<details>

```ada
declare
  Needs_LRM : Boolean := False;
begin
  Separate the text into lines and each line into atoms,
      as described in Section 4.1.1, Atoms, in a language-dependent manner;
  for each Line loop
    for each Atom of Line loop
      if Atom is a whitespace atom then
        LINT on any instance of RLM
      end if;
      if Needs_LRM then
        if inserting LRM before Atom has no effect on the meaning of the program then
          Insert LRM before Atom;
          Needs_LRM := False;
        else  -- For instance, if the position before Atom is within a string literal
          Look for the first character C in Atom such that
            the Bidi_Class property of C is one of LRE, RLE, LRI, RLI, or FSI;
        if C exists and its Bidi_Class property is not L then
            Emit an error: the line cannot be converted to plain text by this algorithm;
        end if;
      end if;
      if Atom is a comment content atom then
        if Atom is followed by a code point that does not have Bidi_Class B then
          If there are any isolate initiators in Atom that do not have a matching PDI, as defined in BD9, lint
          If there are any embedding initiators in Atom that
            do not have a matching PDF, as defined in BD11, and
            do not lie between an isolate initiator and its matching PDI,
          then Lint
        end if;
      end if;
      if Atom is followed by a code point that does not have Bidi_Class B then
        if Atom has any isolate initiators that
            do not have a matching PDI, as defined in BD9 then
          Emit an error: the line cannot be converted to plain text by this algorithm;
        end if;
        if Atom has any embedding initiators that
            do not lie between an isolate initiator and its matching PDI, and
            do not have a matching PDF, as defined in BD11 then
          Emit an error: the line cannot be converted to plain text by this algorithm;
        end if;
        Look for the last character C in Atom such that
          the Bidi_Class property of C is one of L, R, AL, PDF, or PDI;
        if C exists and its Bidi_Class property is not L then
          Needs_LRM := True;
        end if;
      end if;
    end loop;
    if Line is terminated by a character with Bidi_Class B then
      Needs_LRM := False;
    end if;
  end loop;
end;
```

</details>


However, a thing that will need to be decided is this: Do we want to lint cases where _visible_ RTL characters change the rendered direction of the text?

For example, the code `<arabic> = 2 - x;` will render as

```text
عربی = 2 - x;
```

which can be confusing. This can be mitigated by inserting an LRM[^2] after the arabic identifier; which is what the algorithm recommends formatters like rustfmt do (and is what I am [proposing rustfmt support](https://github.com/rust-lang/rustfmt/issues/5815)).


We _could_ in our lint require users do the same: any _visible_ RTL code points that affect the directionality of surrounding text are required to have that effect be mitigated, usually with an LRM. This would roughly amount to requiring people run the rustfmt pass though we can have lint suggestions also do the same for them. This is an expansion of the current set of things we lint on: currently we only lint on invisible control characters and the effects of visible RTL characters are basically treated as okay since you can _see_ the RTL character and know there is something fishy.

I generally lean on the side of _not_ linting here. The algorithm above becomes much simpler in such a case, basically all the stuff producing and consuming `Needs_LRM` goes away, I can write up a modified algorithm if necessary.


Happy to help mentor the implementation and provide additional background on this algorithm and the bidi algorithm/properties.

Thoughts? @estebank @pietroalbini 

 [^0]: Likely initially as a config option; I do not want to make decisions on whether it ought to be the default just yet.
 [^1]: LRM is a format character that, in a left-to-right context, can only fix problems, not cause them. It is the equivalent of a latin letter, except invisible.
 [^2]: Treated as a space by Rust

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement the new Unicode algorithm for preventing bidi-based source code spoofing #113363

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Implement the new Unicode algorithm for preventing bidi-based source code spoofing #113363

Description

Footnotes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions