Implement the "conversion to plain text" algorithm from DUTS #55 to protect from bidirectional spoofing

Rustc already [mitigates against](https://github.com/rust-lang/rust/pull/90462) the [potential spoofing effects (CVE-2021-42574)](https://blog.rust-lang.org/2021/11/01/cve-2021-42574.html) of bidirectional format characters.

Since that CVE was released, Unicode has been working on a specification, [(Draft) Unicode Technical Standard #55 Unicode Source Code Handling](https://www.unicode.org/reports/tr55/) to help programming language designers implement various mitigations to such issues. The goal of this specification is to protect source code from spoofing while retaining the ability to use bidirectional text in source. It contains advice for multiple levels of the stack: compiler warning engines, code formatters, and source code editors.


_Essentially_, the algorithm works at the level of what it calls "atoms" (basically, tokens, so strings and comments and identifiers are all atoms), and it allows for bidirectional stuff _within_ an atom, but "resets" the text direction after them, using a format character called a left-to-right mark (and two others that "pop" directional formatting). This format character is treated as a space by Rust, and is _not_ one of the characters Rust warns about, since in a left-to-right context all it can do is _undo_ bidirectional stuff, it cannot create Exciting New Bidirectional stuff.

Basically the algorithm is "if the atom starts with an RTL character, insert an LRM after the atom", though it has some nuanced additional stuff as well.

This is better than the existing heuristics since it also ensures that code that legitimately contains bidirectional text will render correctly, i.e. the expression `FOO - bar`, where FOO is a variable name written in Arabic will render in the right order (instead of rendering as `bar - FOO`).

The algorithm _does_ also insert some characters that Rust lints about (ones that "pop" directional formatting) in contexts where a comment has insufficiently terminated opening directional format character. 

It also does recommend inserting FSI characters which Rust lints about, we should improve the linting algorithm using the official guidance. My proposal in https://github.com/rust-lang/rust/issues/113363 would fix that.

-----

Anyway, I propose we add a format config mode that runs this algorithm. I'm not yet proposing that this be on track to be "default" (I'm not sure if it _should_ be), but it's worth considering.

I don't have familiarity with the rustfmt codebase but I do have familiarity with Unicode properties and the bidirectional algorithm and am happy to help someone implement this. (cc @crlf0710 who may also be interested; and perhaps we can implement some of this inside unicode-security)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implement the "conversion to plain text" algorithm from DUTS #55 to protect from bidirectional spoofing #5815

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Implement the "conversion to plain text" algorithm from DUTS #55 to protect from bidirectional spoofing #5815

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions