optimizing division-by-constant on AArch32(ARM32)

I found both GCC12 and Clang11 call __aeabi_uldivmod when dividing a uint64_t by a constant on AArch32, and I found it's possible to emulate the AArch64 umulh instruction which enables division by a constant optimization even on AArch32. [I wrote this up here](https://rsaxvc.net/blog/2022/7/21/Optimized_division_of_uint64_t_by_a_constant_on_32-bit_microcontrollers.html) and will put the relevant bits below, there's [a GCC bugzilla suggestion](https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106484) as well. This optimization applies only to Arm cores with umull instruction, which is many of them but not present on the smallest microcontrollers and some older cores.

C code for dividing uint64_t by a constant

```
uint64_t div3(uint64_t x)
{
    return x/3;
}
```

Clang11( -O3 -mcpu=cortex-m4):

```
div3(unsigned long long):
        push    {r7, lr}
        movs    r2, #3
        movs    r3, #0
        bl      __aeabi_uldivmod
        pop     {r7, pc}
```

However, Clang16 for AArch64 implements division by 3 using multiplication by the inverse as:
```
div3(unsigned long):                               // @div3(unsigned long)
        mov     x8, #-6148914691236517206
        movk    x8, #43691
        umulh   x8, x0, x8
        lsr     x0, x8, #1
        ret
```

While AArch32 instruction set like Cortex-M4 does not have the umulth instruction, it does have umull(32bit x 32bit multiply with 64-bit result). Using that, we can emulate umulh with the following function:

```
uint64_t umulh( uint64_t a, uint64_t b )
{
const uint32_t a_lo = a;
const uint32_t a_hi = a >> 32;
const uint32_t b_lo = b;
const uint32_t b_hi = b >> 32;

/* FOIL method of multiplication
See https://en.wikipedia.org/wiki/FOIL_method,
but instead of binomials with constants a,b,c,d variables x,y: (ax+b) * (cy + d),
consider it with variables a,b,c,d, constants x,y = 1<<32 
When applicable each is an ARM UMULL instruction, might do better with UMLAL*/
const uint64_t acc0 = (uint64_t)a_lo * b_lo;
const uint64_t acc1 = (uint64_t)a_hi * b_lo;
const uint64_t acc2 = (uint64_t)a_lo * b_hi;
const uint64_t acc3 = (uint64_t)a_hi * b_hi;

/* Break up into 32-bit values */
const uint32_t lo0 = acc0;
const uint32_t hi0 = acc0 >> 32;
const uint32_t lo1 = acc1;
const uint32_t hi1 = acc1 >> 32;
const uint32_t lo2 = acc2;
const uint32_t hi2 = acc2 >> 32;
const uint32_t lo3 = acc3;
const uint32_t hi3 = acc3 >> 32;

/* The first 32 bits aren't used in the result,
no need to store them, so no need to compute them.
In fact, don't even worry about res0, start computing res1*/
uint64_t acc = 0;
const uint32_t res0 = lo0;

acc += hi0;
acc += lo1;
acc += lo2;
const uint32_t res1 = acc;

acc >>= 32;
acc += hi1;
acc += hi2;
acc += lo3;

const uint32_t res2 = (uint32_t)acc;

acc >>= 32;
acc += hi3;
const uint32_t res3 = (uint32_t)acc;

return (((uint64_t)res3) << 32 ) + res2;
}
```

I will refer to this as umulh emulation, and in testing it results in a speedup, between 2x and 37x, depending on various factors, on Cortex-M4.

Firstly, the execution time of umulh() is a constant on Cortex-M4, but __aeabi_uldivmod() execution time depends on the numerator.

If the core and compiler-support library implement __aeabi_uldivmod() using the udiv instruction, the speedup for emulating umulh is relatively small, only around 2-4x on Cortex-M4. But if not, the umulh approach can divide numbers at about 28-37x the rate of __aeabi_uldivmod(). Roughly we can group cores as follows:

* Cores where umulh emulation performs poorly due to lack of umull instruction: old Arm before ARM7TDMI, Cortex M0/0+/1/23. On these we should use __aeabi_uldivmod()
* Cores where umulh emulation performs an order of magnitude faster due to presence of umull instruction but lack of udiv(or __aeabi_uldivmod does not take advantage of udiv): [ARM7TDMI](https://developer.arm.com/documentation/ddi0210/c/Instruction-Cycle-Timings/Multiply-and-multiply-accumulate), ARM9, ARM10, ARM11, Cortex-A8, and Cortex-A9.
* Cores where umulh emulation is only a little faster: Cortex-A7, Cortex-A15, 64-bit Arm cores.

If this is of interest, it might be worth putting umulh into the compiler support library and using it to divide uint64_t by constants.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

optimizing division-by-constant on AArch32(ARM32) #63731

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

optimizing division-by-constant on AArch32(ARM32) #63731

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions