Skip to content

Missed optimization with movzx and mov #56498

Closed
@Hintro

Description

@Hintro

I came across this when examining a loop that runs slower than I expected. It involves explicit and implicit conversions between 8-bit and 32/64-bit values, and as I looked through the generated assembly using Godbolt compiler explorer, I found lots of movzx instructions that don't seem to break dependency or play a role in correctness, not to mention many use the same register like movzx eax al, which cannot be eliminated.

I then tried some simple examples on Godbolt, and found that this behavior is persistent and easily reproducible, even when I specify -march=skylake. Here's an example:

#include <stdint.h>

int add2bytes(uint8_t* a, uint8_t* b) {
    return uint8_t(*a + *b);
}

Clang 14 -O3

add2bytes(unsigned char*, unsigned char*):                       # @add2bytes(unsigned char*, unsigned char*)
        mov     al, byte ptr [rsi]
        add     al, byte ptr [rdi]
        movzx   eax, al
        ret

movzx would be better in place of the mov instead of being at the end, so that dependency on old RAX value can be broken from the start and also clearing the upper bits of RAX in the process.

I also asked this on Stack Overflow and [Peter Cordes] has a detailed response (https://stackoverflow.com/a/72953035/14730360) explaining how this behavior is bad for pretty much all X86 processors.

Godbolt link with code for examples: https://godbolt.org/z/z45xr4hq1
Here's one that's closer to what I was originally examining:

int foo(uint8_t* a, uint8_t i, uint8_t j) {
    return a[a[i] | a[j]];
}

Clang 14 -O3:

foo(unsigned char*, unsigned char, unsigned char):                             # @foo(unsigned char*, unsigned char, unsigned char)
        mov     eax, esi
        mov     ecx, edx
        mov     cl, byte ptr [rdi + rcx]
        or      cl, byte ptr [rdi + rax]
        movzx   eax, cl
        movzx   eax, byte ptr [rdi + rax]
        ret

movzx eax, cl here just seems unnecessary. The upper bits of RCX should already be clean as it is used as index in mov cl, byte ptr [rdi + rcx]. The subsequent or does not affect its upper bits, and the dependency of RCX on this or is not something that movzx eax cl can break. So I think it's better to just do movzx eax, byte ptr [rdi + rcx] after the or.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions