Description
The contract for delay
currently states:
Blocks the program for at least
n
instruction cycles
The current implementation is:
llvm_asm!("1:
nop
subs $0, $$1
bne.n 1b"
: "+r"(_n / 4 + 1)
:
: "cpsr"
: "volatile");
This is approximately an assembly gloss of the code
let mut i = n / 4 + 1;
loop {
i -= 1;
if i == 0 { break }
}
Why the divide-by-four? Well, the code appears to be assuming that the minimum amount of time that the instruction sequence could take is four cycles, and so it is taking that into account to minimize the amount of over-delaying. That cycle count is approximately accurate on M3/M4 (it's an underestimate in practice, even in zero-wait-state RAM), but is wrong on M7 or any other part with any combination of nop-folding, branch prediction, and multiple issue. As a result, on such CPUs, this function delays for substantially fewer cycles than advertised -- by roughly a factor of 2, depending on cache and memory system configuration, branch alignment, etc. (It also appears to be wrong on M33, but M33 is not superscalar, so it's probably just an instruction timing issue.)
As CPU core complexity grows, any hand-tuned cycle delay loops need to be specialized for the particular microarchitecture being used, much like in the 486/Pentium era. Alternatively, you could use a conservative implementation that, when asked to loop for at least n
cycles, repeats the loop n
times -- that should be safe for now, but I wouldn't recommend it, particularly because we've got the zero-overhead-loop extension to ARMv8-M on the horizon.
Shout out to @labbott for noticing this.