Create an internal lint for detecting "Unicode-unaware" `BytePos` & `Span` manipulations

Inspired by https://github.com/rust-lang/rust/issues/128717#issuecomment-2270315205. CC @jieyouxu.

---

Since we recover from lexically invalid tokens that are Unicode-confusable with tokens that are lexically valid (e.g., *U+037E Greek Question Mark* → *U+003B Semicolon*; *U+066B Arabic Decimal Separator* → *U+002C Comma*), (suggestion) diagnostic code down the line generally *ought not* make too many assumptions about the length and/or position in bytes that the `Span` of a supposed Rust token/lexeme "maps to" .

In reality however, all too often (suggestion) diagnostic code doesn't follow this 'rule' when performing low-level "span manipulations" defaulting to hard-coded lengths and/or positions. The compiler contains a bunch of snippets like `- BytePos(1)` or `+ BytePos(1)` where the code guesses that the code point before/after corresponds to a certain token/lexeme like `,`, `;`, `)`. However, such code doesn't account for the aforesaid recovery which may have mapped a UTF-8 code unit with byte length > 1 to an ASCII character of length 1 which can lead to ICEs (internal assertions or indexing/slicing at non-char boundaries).

> sigh
> we have too many hard coded +1/-1 in the compiler
> -- @compiler-errors 

---

So it might be worth **linting** against these error-prone `BytePos` & `Span` manipulations.
I don't know how feasible it'd be to implement such a lint *well* (i.e., low false positive rate) or how the exact rules should look like.

This issue may serve a dual purpose as a **tracking issue** for eliminating this 'pattern' from the code base.

---

Uplifted from https://github.com/rust-lang/rust/issues/128717#issuecomment-2273735687:

It's so easy to find these kinds of ICEs:

1. One simply needs to look for `/(\+|-) BytePos\(\d+\)/` inside `compiler/`,
2. Figure out which ASCII character is meant
3. Open one's favorite Unicode table website or program that can list Unicode-confusables
4. Pick a confusable whose `.len_utf8()` is >1 and :boom:

---

**Example ICE**: #128717.

**Example ICE**: I just found this a minute ago while reviewing an unrelated PR:

This code uses a *Medium Right Parenthesis Ornament* (U+2769) which is confusable with *Right Parenthesis*.

```rs
fn f() {}

fn main() {
    f(0,1❩;
}
```

Leads to:

```
thread 'rustc' panicked at compiler/rustc_span/src/lib.rs:2119:17:
assertion failed: bpos.to_u32() >= mbc.pos.to_u32() + mbc.bytes as u32
```

Code in question:

https://github.com/rust-lang/rust/blob/c9687a95a602091777e28703aa5abf20f1ce1797/compiler/rustc_hir_typeck/src/fn_ctxt/checks.rs#L1144

### Discussions

- https://rust-lang.zulipchat.com/#narrow/stream/131828-t-compiler/topic/Internal.20lint.20for.20.22non-Unicode-aware.22.20BytePos.20manipulations.3F

### Related issues

- https://github.com/rust-lang/rust/issues?q=is%3Aopen+is%3Aissue+label%3AD-unicode-unaware
- https://github.com/rust-lang/rust/issues?q=is%3Aissue+is%3Aopen+label%3AI-ICE+%22bpos.to_u32%22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create an internal lint for detecting "Unicode-unaware" `BytePos` & `Span` manipulations #128790

Discussions

Related issues

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Create an internal lint for detecting "Unicode-unaware" BytePos & Span manipulations #128790

Description

Discussions

Related issues

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Create an internal lint for detecting "Unicode-unaware" `BytePos` & `Span` manipulations #128790