Description
Inspired by #128717 (comment). CC @jieyouxu.
Since we recover from lexically invalid tokens that are Unicode-confusable with tokens that are lexically valid (e.g., U+037E Greek Question Mark → U+003B Semicolon; U+066B Arabic Decimal Separator → U+002C Comma), (suggestion) diagnostic code down the line generally ought not make too many assumptions about the length and/or position in bytes that the Span
of a supposed Rust token/lexeme "maps to" .
In reality however, all too often (suggestion) diagnostic code doesn't follow this 'rule' when performing low-level "span manipulations" defaulting to hard-coded lengths and/or positions. The compiler contains a bunch of snippets like - BytePos(1)
or + BytePos(1)
where the code guesses that the code point before/after corresponds to a certain token/lexeme like ,
, ;
, )
. However, such code doesn't account for the aforesaid recovery which may have mapped a UTF-8 code unit with byte length > 1 to an ASCII character of length 1 which can lead to ICEs (internal assertions or indexing/slicing at non-char boundaries).
sigh
we have too many hard coded +1/-1 in the compiler
-- @compiler-errors
So it might be worth linting against these error-prone BytePos
& Span
manipulations.
I don't know how feasible it'd be to implement such a lint well (i.e., low false positive rate) or how the exact rules should look like.
This issue may serve a dual purpose as a tracking issue for eliminating this 'pattern' from the code base.
Uplifted from #128717 (comment):
It's so easy to find these kinds of ICEs:
- One simply needs to look for
/(\+|-) BytePos\(\d+\)/
insidecompiler/
, - Figure out which ASCII character is meant
- Open one's favorite Unicode table website or program that can list Unicode-confusables
- Pick a confusable whose
.len_utf8()
is >1 and 💥
Example ICE: #128717.
Example ICE: I just found this a minute ago while reviewing an unrelated PR:
This code uses a Medium Right Parenthesis Ornament (U+2769) which is confusable with Right Parenthesis.
fn f() {}
fn main() {
f(0,1❩;
}
Leads to:
thread 'rustc' panicked at compiler/rustc_span/src/lib.rs:2119:17:
assertion failed: bpos.to_u32() >= mbc.pos.to_u32() + mbc.bytes as u32
Code in question: