Skip to content

Implement the "conversion to plain text" algorithm from DUTS #55 to protect from bidirectional spoofing #5815

Open
@Manishearth

Description

@Manishearth

Rustc already mitigates against the potential spoofing effects (CVE-2021-42574) of bidirectional format characters.

Since that CVE was released, Unicode has been working on a specification, (Draft) Unicode Technical Standard #55 Unicode Source Code Handling to help programming language designers implement various mitigations to such issues. The goal of this specification is to protect source code from spoofing while retaining the ability to use bidirectional text in source. It contains advice for multiple levels of the stack: compiler warning engines, code formatters, and source code editors.

Essentially, the algorithm works at the level of what it calls "atoms" (basically, tokens, so strings and comments and identifiers are all atoms), and it allows for bidirectional stuff within an atom, but "resets" the text direction after them, using a format character called a left-to-right mark (and two others that "pop" directional formatting). This format character is treated as a space by Rust, and is not one of the characters Rust warns about, since in a left-to-right context all it can do is undo bidirectional stuff, it cannot create Exciting New Bidirectional stuff.

Basically the algorithm is "if the atom starts with an RTL character, insert an LRM after the atom", though it has some nuanced additional stuff as well.

This is better than the existing heuristics since it also ensures that code that legitimately contains bidirectional text will render correctly, i.e. the expression FOO - bar, where FOO is a variable name written in Arabic will render in the right order (instead of rendering as bar - FOO).

The algorithm does also insert some characters that Rust lints about (ones that "pop" directional formatting) in contexts where a comment has insufficiently terminated opening directional format character.

It also does recommend inserting FSI characters which Rust lints about, we should improve the linting algorithm using the official guidance. My proposal in rust-lang/rust#113363 would fix that.


Anyway, I propose we add a format config mode that runs this algorithm. I'm not yet proposing that this be on track to be "default" (I'm not sure if it should be), but it's worth considering.

I don't have familiarity with the rustfmt codebase but I do have familiarity with Unicode properties and the bidirectional algorithm and am happy to help someone implement this. (cc @crlf0710 who may also be interested; and perhaps we can implement some of this inside unicode-security)

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions