Skip to content

Implement the new Unicode algorithm for preventing bidi-based source code spoofing #113363

Open
@Manishearth

Description

@Manishearth

Rustc already mitigates against the potential spoofing effects (CVE-2021-42574) of bidirectional format characters.

What we currently do is a bit of a blunt hammer: we lint on any bidirectional format character except LRM and RLM, which is both too aggressive (it doesn't allow for "isolated" pairs of format characters) and not aggressive enough (RLM can have negative effects, and even plain visible characters can cause problems).

Since that CVE was released, Unicode has been working on a specification, (Draft) Unicode Technical Standard #55 Unicode Source Code Handling to help programming language designers implement various mitigations to such issues. The goal of this specification is to protect source code from spoofing while retaining the ability to use bidirectional text in source. It contains advice for multiple levels of the stack: compiler warning engines, code formatters, and source code editors.

Relevant to rustc is the section on Directional Formatting Characters, which refers to the algorithm in Conversion to Plain Text. That algorithm is primarily written for the purposes of a formatting tool like rustfmt, and I've filed a rustfmt issue to implement it there1, however it can also be tweaked to be a good lint too.

In essence, it will warn about format characters nestled amongst whitespace (except for LRM2), and also warn about mismatched stateful format characters in contexts where it matters (e.g. a block comment where there is other code on the same line after it). The difference from the current algorithm is that it would ignore these format characters when they don't have any effect that leaks outside of the "atom" (string, comment, etc), and furthermore lint on RLMs found in whitespace.

An important thing to highlight is that this change would allow the proposed rustfmt pass to generate warning-free code: Under the current regime, this pass would generate warning-free code in most cases provided you do not already have warnings of this form, but there is the FSI/PDI insertion section that is rare but will turn warning-free code into code that warns.


Here is the tweaked algorithm such that it is suitable for linting.

declare
  Needs_LRM : Boolean := False;
begin
  Separate the text into lines and each line into atoms,
      as described in Section 4.1.1, Atoms, in a language-dependent manner;
  for each Line loop
    for each Atom of Line loop
      if Atom is a whitespace atom then
        LINT on any instance of RLM
      end if;
      if Needs_LRM then
        if inserting LRM before Atom has no effect on the meaning of the program then
          Insert LRM before Atom;
          Needs_LRM := False;
        else  -- For instance, if the position before Atom is within a string literal
          Look for the first character C in Atom such that
            the Bidi_Class property of C is one of LRE, RLE, LRI, RLI, or FSI;
        if C exists and its Bidi_Class property is not L then
            Emit an error: the line cannot be converted to plain text by this algorithm;
        end if;
      end if;
      if Atom is a comment content atom then
        if Atom is followed by a code point that does not have Bidi_Class B then
          If there are any isolate initiators in Atom that do not have a matching PDI, as defined in BD9, lint
          If there are any embedding initiators in Atom that
            do not have a matching PDF, as defined in BD11, and
            do not lie between an isolate initiator and its matching PDI,
          then Lint
        end if;
      end if;
      if Atom is followed by a code point that does not have Bidi_Class B then
        if Atom has any isolate initiators that
            do not have a matching PDI, as defined in BD9 then
          Emit an error: the line cannot be converted to plain text by this algorithm;
        end if;
        if Atom has any embedding initiators that
            do not lie between an isolate initiator and its matching PDI, and
            do not have a matching PDF, as defined in BD11 then
          Emit an error: the line cannot be converted to plain text by this algorithm;
        end if;
        Look for the last character C in Atom such that
          the Bidi_Class property of C is one of L, R, AL, PDF, or PDI;
        if C exists and its Bidi_Class property is not L then
          Needs_LRM := True;
        end if;
      end if;
    end loop;
    if Line is terminated by a character with Bidi_Class B then
      Needs_LRM := False;
    end if;
  end loop;
end;

However, a thing that will need to be decided is this: Do we want to lint cases where visible RTL characters change the rendered direction of the text?

For example, the code <arabic> = 2 - x; will render as

عربی = 2 - x;

which can be confusing. This can be mitigated by inserting an LRM3 after the arabic identifier; which is what the algorithm recommends formatters like rustfmt do (and is what I am proposing rustfmt support).

We could in our lint require users do the same: any visible RTL code points that affect the directionality of surrounding text are required to have that effect be mitigated, usually with an LRM. This would roughly amount to requiring people run the rustfmt pass though we can have lint suggestions also do the same for them. This is an expansion of the current set of things we lint on: currently we only lint on invisible control characters and the effects of visible RTL characters are basically treated as okay since you can see the RTL character and know there is something fishy.

I generally lean on the side of not linting here. The algorithm above becomes much simpler in such a case, basically all the stuff producing and consuming Needs_LRM goes away, I can write up a modified algorithm if necessary.

Happy to help mentor the implementation and provide additional background on this algorithm and the bidi algorithm/properties.

Thoughts? @estebank @pietroalbini

Footnotes

  1. Likely initially as a config option; I do not want to make decisions on whether it ought to be the default just yet.

  2. LRM is a format character that, in a left-to-right context, can only fix problems, not cause them. It is the equivalent of a latin letter, except invisible.

  3. Treated as a space by Rust

Metadata

Metadata

Assignees

No one assigned

    Labels

    A-UnicodeArea: UnicodeA-lintsArea: Lints (warnings about flaws in source code) such as unused_mut.A-rustfmtArea: RustfmtT-compilerRelevant to the compiler team, which will review and decide on the PR/issue.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions