Skip to content

Create an internal lint for detecting "Unicode-unaware" BytePos & Span manipulations #128790

Open
@fmease

Description

@fmease

Inspired by #128717 (comment). CC @jieyouxu.


Since we recover from lexically invalid tokens that are Unicode-confusable with tokens that are lexically valid (e.g., U+037E Greek Question MarkU+003B Semicolon; U+066B Arabic Decimal SeparatorU+002C Comma), (suggestion) diagnostic code down the line generally ought not make too many assumptions about the length and/or position in bytes that the Span of a supposed Rust token/lexeme "maps to" .

In reality however, all too often (suggestion) diagnostic code doesn't follow this 'rule' when performing low-level "span manipulations" defaulting to hard-coded lengths and/or positions. The compiler contains a bunch of snippets like - BytePos(1) or + BytePos(1) where the code guesses that the code point before/after corresponds to a certain token/lexeme like ,, ;, ). However, such code doesn't account for the aforesaid recovery which may have mapped a UTF-8 code unit with byte length > 1 to an ASCII character of length 1 which can lead to ICEs (internal assertions or indexing/slicing at non-char boundaries).

sigh
we have too many hard coded +1/-1 in the compiler
-- @compiler-errors


So it might be worth linting against these error-prone BytePos & Span manipulations.
I don't know how feasible it'd be to implement such a lint well (i.e., low false positive rate) or how the exact rules should look like.

This issue may serve a dual purpose as a tracking issue for eliminating this 'pattern' from the code base.


Uplifted from #128717 (comment):

It's so easy to find these kinds of ICEs:

  1. One simply needs to look for /(\+|-) BytePos\(\d+\)/ inside compiler/,
  2. Figure out which ASCII character is meant
  3. Open one's favorite Unicode table website or program that can list Unicode-confusables
  4. Pick a confusable whose .len_utf8() is >1 and 💥

Example ICE: #128717.

Example ICE: I just found this a minute ago while reviewing an unrelated PR:

This code uses a Medium Right Parenthesis Ornament (U+2769) which is confusable with Right Parenthesis.

fn f() {}

fn main() {
    f(0,1;
}

Leads to:

thread 'rustc' panicked at compiler/rustc_span/src/lib.rs:2119:17:
assertion failed: bpos.to_u32() >= mbc.pos.to_u32() + mbc.bytes as u32

Code in question:

call_expr.span.with_lo(call_expr.span.hi() - BytePos(1))

Discussions

Related issues

Metadata

Metadata

Assignees

No one assigned

    Labels

    A-UnicodeArea: UnicodeA-diagnosticsArea: Messages for errors, warnings, and lintsA-lintsArea: Lints (warnings about flaws in source code) such as unused_mut.C-feature-requestCategory: A feature request, i.e: not implemented / a PR.C-tracking-issueCategory: An issue tracking the progress of sth. like the implementation of an RFCD-Unicode-unawareDiagnostics: Diagnostics that are unaware of Unicode and trigger codepoint boundary assertionsP-lowLow priorityT-compilerRelevant to the compiler team, which will review and decide on the PR/issue.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions