ptr::align_offset generates surprisingly bad code

I happened to look at more closely at the generated assembly for some code that uses `align_offset`, and noticed... `align_offset` does not compile as efficiently as one might hope for the case of "align pointer to `size_of::<usize>()`"

For example (ignore that I omit handling of align_offset's error return value):

```rust
pub unsafe fn align_offset(p: *const u8) -> *const u8 {
    p.add(p.align_offset(core::mem::size_of::<usize>()))
}
```

compiles to

```
example::align_offset:
        movl    %edi, %ecx
        andl    $7, %ecx
        movl    $8, %eax
        subq    %rcx, %rax
        testq   %rcx, %rcx
        cmoveq  %rcx, %rax
        addq    %rdi, %rax
        retq
```

Whereas performing the same alignment manually (forgive my convoluted way of doing this, my usual pattern is very slightly different and I don't have a memorized idiom for this without going through usize, since, well, I figured I just wanted to use align_offset):

```rust
pub unsafe fn manual_align_offset(p: *const u8) -> *const u8 {
    // IIRC just doing `arithmetic(p as usize) as *const u8` makes LLVM sad
    let aligned = ((p as usize + USIZE_SIZE - 1) & !(USIZE_SIZE - 1)) as isize;
    p.offset(aligned - (p as isize))
}
```

compiles to

```
example::manual_align_offset:
        leaq    7(%rdi), %rax
        andq    $-8, %rax
        retq
```

Which is substantially better along a variety of metrics, including but not limited to actual runtime.

Taking a look at the source for `align_offset` reveals that it uh, well it does some stuff. https://github.com/rust-lang/rust/blob/ac48e62db85e6db4bbe026490381ab205f4a614d/library/core/src/ptr/mod.rs#L1166-L1271 ... Anyway I'm just gonna take at face value that this all is needed since it tracks that sometimes you might need to GCD... And hell, maybe mine misses a case like "`p + USIZE - 1` wraps around the address space but the aligned value wouldn't).

Anyway IIUC align_offset is really considered the way forward for all pointer aligning, as miri will throw your code straight into the trash if it catches it dereferencing a pointer that you manually aligned... (I have Opinions on this but I'll spare you from a rant).

So, for that and a lot of reasons, I think we'd like the codegen for align_offset to look a lot closer to what I provided at opt-level=3, even if it means special-casing when `size_of::<T> == 1`... (I mean, I'd also love for it not to generate the whole code mountain in debug builds, but one thing at a time I guess).

Anyway, the function's documentation comment tells me that @nagisa has taken up the sacred burden of "keeper of `align_offset`'s secrets"... I have questions: is this fixable? And if not, is there an interface that we can provide that lets us produce good code here? Am I missing something?

P.S. Godbolt link: https://godbolt.org/z/388Enf

	/// Any questions go to @nagisa.
	#[lang = "align_offset"]
	pub(crate) unsafe fn align_offset<T: Sized>(p: *const T, a: usize) -> usize {
	/// Calculate multiplicative modular inverse of `x` modulo `m`.
	///
	/// This implementation is tailored for align_offset and has following preconditions:
	///
	/// * `m` is a power-of-two;
	/// * `x < m`; (if `x ≥ m`, pass in `x % m` instead)
	///
	/// Implementation of this function shall not panic. Ever.
	#[inline]
	fn mod_inv(x: usize, m: usize) -> usize {
	/// Multiplicative modular inverse table modulo 2⁴ = 16.
	///
	/// Note, that this table does not contain values where inverse does not exist (i.e., for
	/// `0⁻¹ mod 16`, `2⁻¹ mod 16`, etc.)
	const INV_TABLE_MOD_16: [u8; 8] = [1, 11, 13, 7, 9, 3, 5, 15];
	/// Modulo for which the `INV_TABLE_MOD_16` is intended.
	const INV_TABLE_MOD: usize = 16;
	/// INV_TABLE_MOD²
	const INV_TABLE_MOD_SQUARED: usize = INV_TABLE_MOD * INV_TABLE_MOD;

	let table_inverse = INV_TABLE_MOD_16[(x & (INV_TABLE_MOD - 1)) >> 1] as usize;
	if m <= INV_TABLE_MOD {
	table_inverse & (m - 1)
	} else {
	// We iterate "up" using the following formula:
	//
	// $$ xy ≡ 1 (mod 2ⁿ) → xy (2 - xy) ≡ 1 (mod 2²ⁿ) $$
	//
	// until 2²ⁿ ≥ m. Then we can reduce to our desired `m` by taking the result `mod m`.
	let mut inverse = table_inverse;
	let mut going_mod = INV_TABLE_MOD_SQUARED;
	loop {
	// y = y * (2 - xy) mod n
	//
	// Note, that we use wrapping operations here intentionally – the original formula
	// uses e.g., subtraction `mod n`. It is entirely fine to do them `mod
	// usize::MAX` instead, because we take the result `mod n` at the end
	// anyway.
	inverse = inverse.wrapping_mul(2usize.wrapping_sub(x.wrapping_mul(inverse)));
	if going_mod >= m {
	return inverse & (m - 1);
	}
	going_mod = going_mod.wrapping_mul(going_mod);
	}
	}
	}

	let stride = mem::size_of::<T>();
	let a_minus_one = a.wrapping_sub(1);
	let pmoda = p as usize & a_minus_one;

	if pmoda == 0 {
	// Already aligned. Yay!
	return 0;
	}

	if stride <= 1 {
	return if stride == 0 {
	// If the pointer is not aligned, and the element is zero-sized, then no amount of
	// elements will ever align the pointer.
	!0
	} else {
	a.wrapping_sub(pmoda)
	};
	}

	let smoda = stride & a_minus_one;
	// SAFETY: a is power-of-two so cannot be 0. stride = 0 is handled above.
	let gcdpow = unsafe { intrinsics::cttz_nonzero(stride).min(intrinsics::cttz_nonzero(a)) };
	let gcd = 1usize << gcdpow;

	if p as usize & (gcd.wrapping_sub(1)) == 0 {
	// This branch solves for the following linear congruence equation:
	//
	// ` p + so = 0 mod a `
	//
	// `p` here is the pointer value, `s` - stride of `T`, `o` offset in `T`s, and `a` - the
	// requested alignment.
	//
	// With `g = gcd(a, s)`, and the above asserting that `p` is also divisible by `g`, we can
	// denote `a' = a/g`, `s' = s/g`, `p' = p/g`, then this becomes equivalent to:
	//
	// ` p' + s'o = 0 mod a' `
	// ` o = (a' - (p' mod a')) * (s'^-1 mod a') `
	//
	// The first term is "the relative alignment of `p` to `a`" (divided by the `g`), the second
	// term is "how does incrementing `p` by `s` bytes change the relative alignment of `p`" (again
	// divided by `g`).
	// Division by `g` is necessary to make the inverse well formed if `a` and `s` are not
	// co-prime.
	//
	// Furthermore, the result produced by this solution is not "minimal", so it is necessary
	// to take the result `o mod lcm(s, a)`. We can replace `lcm(s, a)` with just a `a'`.
	let a2 = a >> gcdpow;
	let a2minus1 = a2.wrapping_sub(1);
	let s2 = smoda >> gcdpow;
	let minusp2 = a2.wrapping_sub(pmoda >> gcdpow);
	return (minusp2.wrapping_mul(mod_inv(s2, a2))) & a2minus1;
	}

	// Cannot be aligned at all.
	usize::MAX
	}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ptr::align_offset generates surprisingly bad code #75579

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

ptr::align_offset generates surprisingly bad code #75579

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions