Description
Summary
Hi! We've been seeing some surprising codegen decisions in Hubris programs for a while, and I wanted to surface it upstream in case there's anything to be done.
We build primarily at opt-level="z"
, though many of these behaviors seem to happen at opt-level="s"
too. (z
code is very slightly smaller.)
There are a handful of routines in core
that rustc often decides not to inline which have significant knock-on effects on LLVM's ability to reason about panics, integer ranges for bounds checks, and the like. The problem primarily arises with:
<Enumerate<core::slice::iter::Iter<T>> as Iterator>>::next
- not inlining this appears to hamper the ability of other parts of the compiler to notice that the enumerated indices it's producing are in a predictable range<Index<usize> for [T]>::index
or<Index<Range<usize>> for [T]>::index
- the non-inlined version of this does an unconditional bounds check, when in many cases there is information available at the callsite to eliminate it were it inlined- More rarely, but often enough,
<core::slice::Iter<T> as Iterator>::next
All of these routines, when inlined, tend to produce either one or zero machine instructions -- one in the case of index
which is typically an indexed addressing mode load, zero in the case of Enumerate::next
where the information is often already available in the calling frame for other reasons, and generally one in the case of slice::Iter::next
.
A concrete example from 1.75.0-ish
Here is a concrete example from one of our current programs.
The originating Rust code is (simplified):
let mut argsbuf = [0; 8]; // length actually a constant but I've inlined it
// ... things happen ...
let arglen = an_inlined_operation();
// ... things happen ...
&argsbuf[..arglen]
Here's the ARMv7E-M code as produced by objdump from the output of nightly-2023-12-21
(chosen to be approximately equivalent to 1.75.0).
80071ee: 2601 movs r6, #1 ; included for reasons I'll explain in a bit
; unrelated things happen...
core::array::<impl core::ops::index::Index<I> for [T; N]>::index:
/rustc/5ac4c8a63ee305742071ac6dd11817f7c24adce2/library/core/src/array/mod.rs:348
8007202: 4640 mov r0, r8
8007204: 2108 movs r1, #8
8007206: 4632 mov r2, r6
8007208: f000 fb5b bl 80078c2 <core::slice::index::<impl core::ops::index::Index<I> for [T]>::index>
r6
here holds the length being produced by the inlined operation; along all code paths the length it produces is either 0, 1, or 8, all literal constants. I've included the simplest one here, which sets it to literal 1.
Given that we are indexing a fixed-length 8-element array normally rustc would be very good about eliding the bounds check, since 0, 1, and 8 are all valid options here.
Instead we have the argument setup and a call to the non-inlined index function, which contains a bounds check and conditional panic:
080078c2 <core::slice::index::<impl core::ops::index::Index<I> for [T]>::index>:
80078c2: 428a cmp r2, r1
80078c4: bf9c itt ls
80078c6: 4611 movls r1, r2
80078c8: 4770 bxls lr
80078ca: b580 push {r7, lr}
80078cc: 466f mov r7, sp
80078ce: 4610 mov r0, r2
80078d0: f7ff ff50 bl 8007774 <core::slice::index::slice_end_index_len_fail>
Impact
In many of our programs, this has a pretty significant impact. In at least one case, opt-level=3
winds up being smaller because of more aggressive inlining causing checks to be elided. Because panic sites are relatively expensive -- we format and record panic messages, so they pull in formatting machinery that might not otherwise be needed -- in programs that otherwise contain no panics, a shift in the inlining decisions around something like index
can cause a program's size to grow by 4 kiB pretty easily.
For context, that's a 50-100% size increase on most of our programs.
A possibly clumsy solution
So, for the specific cases I described above -- indexing arrays with usizes or ranges of usizes, and core::slice::Iter
-- does it ever make sense to not inline these routines? I'm wondering if folks would be open to me going through and carefully applying some inline(always)
attributes in this neighborhood. (This might not be sufficient for the Enumerate<T>
case but could definitely be applied to the array/slice Index impls.)
If so, let me know what tests you'd like me to perform on the result. I would obviously measure the impact on various Hubris-based firmware, but perhaps there are other contexts where this could be a bad thing that you'd want me to check.
My LLVM experience is limited and rusty, so there might be a cleverer thing that could be done in the compiler, but I wouldn't know what to suggest.
rustc version
The examples above were generated by:
rustc 1.76.0-nightly (5ac4c8a63 2023-12-20)
binary: rustc
commit-hash: 5ac4c8a63ee305742071ac6dd11817f7c24adce2
commit-date: 2023-12-20
host: x86_64-unknown-linux-gnu
release: 1.76.0-nightly
LLVM version: 17.0.6
but I've been observing this behavior consistently since at least 2021, and just wasn't sure if an upstream bug report made sense.