Description
I found both GCC12 and Clang11 call __aeabi_uldivmod when dividing a uint64_t by a constant on AArch32, and I found it's possible to emulate the AArch64 umulh instruction which enables division by a constant optimization even on AArch32. I wrote this up here and will put the relevant bits below, there's a GCC bugzilla suggestion as well. This optimization applies only to Arm cores with umull instruction, which is many of them but not present on the smallest microcontrollers and some older cores.
C code for dividing uint64_t by a constant
uint64_t div3(uint64_t x)
{
return x/3;
}
Clang11( -O3 -mcpu=cortex-m4):
div3(unsigned long long):
push {r7, lr}
movs r2, #3
movs r3, #0
bl __aeabi_uldivmod
pop {r7, pc}
However, Clang16 for AArch64 implements division by 3 using multiplication by the inverse as:
div3(unsigned long): // @div3(unsigned long)
mov x8, #-6148914691236517206
movk x8, #43691
umulh x8, x0, x8
lsr x0, x8, #1
ret
While AArch32 instruction set like Cortex-M4 does not have the umulth instruction, it does have umull(32bit x 32bit multiply with 64-bit result). Using that, we can emulate umulh with the following function:
uint64_t umulh( uint64_t a, uint64_t b )
{
const uint32_t a_lo = a;
const uint32_t a_hi = a >> 32;
const uint32_t b_lo = b;
const uint32_t b_hi = b >> 32;
/* FOIL method of multiplication
See https://en.wikipedia.org/wiki/FOIL_method,
but instead of binomials with constants a,b,c,d variables x,y: (ax+b) * (cy + d),
consider it with variables a,b,c,d, constants x,y = 1<<32
When applicable each is an ARM UMULL instruction, might do better with UMLAL*/
const uint64_t acc0 = (uint64_t)a_lo * b_lo;
const uint64_t acc1 = (uint64_t)a_hi * b_lo;
const uint64_t acc2 = (uint64_t)a_lo * b_hi;
const uint64_t acc3 = (uint64_t)a_hi * b_hi;
/* Break up into 32-bit values */
const uint32_t lo0 = acc0;
const uint32_t hi0 = acc0 >> 32;
const uint32_t lo1 = acc1;
const uint32_t hi1 = acc1 >> 32;
const uint32_t lo2 = acc2;
const uint32_t hi2 = acc2 >> 32;
const uint32_t lo3 = acc3;
const uint32_t hi3 = acc3 >> 32;
/* The first 32 bits aren't used in the result,
no need to store them, so no need to compute them.
In fact, don't even worry about res0, start computing res1*/
uint64_t acc = 0;
const uint32_t res0 = lo0;
acc += hi0;
acc += lo1;
acc += lo2;
const uint32_t res1 = acc;
acc >>= 32;
acc += hi1;
acc += hi2;
acc += lo3;
const uint32_t res2 = (uint32_t)acc;
acc >>= 32;
acc += hi3;
const uint32_t res3 = (uint32_t)acc;
return (((uint64_t)res3) << 32 ) + res2;
}
I will refer to this as umulh emulation, and in testing it results in a speedup, between 2x and 37x, depending on various factors, on Cortex-M4.
Firstly, the execution time of umulh() is a constant on Cortex-M4, but __aeabi_uldivmod() execution time depends on the numerator.
If the core and compiler-support library implement __aeabi_uldivmod() using the udiv instruction, the speedup for emulating umulh is relatively small, only around 2-4x on Cortex-M4. But if not, the umulh approach can divide numbers at about 28-37x the rate of __aeabi_uldivmod(). Roughly we can group cores as follows:
- Cores where umulh emulation performs poorly due to lack of umull instruction: old Arm before ARM7TDMI, Cortex M0/0+/1/23. On these we should use __aeabi_uldivmod()
- Cores where umulh emulation performs an order of magnitude faster due to presence of umull instruction but lack of udiv(or __aeabi_uldivmod does not take advantage of udiv): ARM7TDMI, ARM9, ARM10, ARM11, Cortex-A8, and Cortex-A9.
- Cores where umulh emulation is only a little faster: Cortex-A7, Cortex-A15, 64-bit Arm cores.
If this is of interest, it might be worth putting umulh into the compiler support library and using it to divide uint64_t by constants.