Missed optimization with movzx and mov

I came across this when examining a loop that runs slower than I expected. It involves explicit and implicit conversions between 8-bit and 32/64-bit values, and as I looked through the generated assembly using Godbolt compiler explorer, I found lots of movzx instructions that don't seem to break dependency or play a role in correctness, not to mention many use the same register like `movzx eax al`, which cannot be eliminated.

I then tried some simple examples on Godbolt, and found that this behavior is persistent and easily reproducible, even when I specify `-march=skylake`. Here's an example:
```cpp
#include <stdint.h>

int add2bytes(uint8_t* a, uint8_t* b) {
    return uint8_t(*a + *b);
}
```
Clang 14 `-O3`
```asm
add2bytes(unsigned char*, unsigned char*):                       # @add2bytes(unsigned char*, unsigned char*)
        mov     al, byte ptr [rsi]
        add     al, byte ptr [rdi]
        movzx   eax, al
        ret
```
`movzx` would be better in place of the `mov` instead of being at the end, so that dependency on old RAX value can be broken from the start and also clearing the upper bits of RAX in the process.

I also asked this on Stack Overflow and [Peter Cordes] has a detailed response (https://stackoverflow.com/a/72953035/14730360) explaining how this behavior is bad for pretty much all X86 processors. 

Godbolt link with code for examples: https://godbolt.org/z/z45xr4hq1
Here's one that's closer to what I was originally examining:
```cpp
int foo(uint8_t* a, uint8_t i, uint8_t j) {
    return a[a[i] | a[j]];
}
```
Clang 14 `-O3`:
```asm
foo(unsigned char*, unsigned char, unsigned char):                             # @foo(unsigned char*, unsigned char, unsigned char)
        mov     eax, esi
        mov     ecx, edx
        mov     cl, byte ptr [rdi + rcx]
        or      cl, byte ptr [rdi + rax]
        movzx   eax, cl
        movzx   eax, byte ptr [rdi + rax]
        ret
```
`movzx   eax, cl` here just seems unnecessary. The upper bits of RCX should already be clean as it is used as index in `mov     cl, byte ptr [rdi + rcx]`. The subsequent `or` does not affect its upper bits, and the dependency of RCX on this `or` is not something that `movzx eax cl` can break. So I think it's better to just do `movzx   eax, byte ptr [rdi + rcx]` after the `or`.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Missed optimization with movzx and mov #56498

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Missed optimization with movzx and mov #56498

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions