Description
https://gcc.godbolt.org/z/YjbY7snas shows the reduced C++ source code, IR and generated code for x86, and https://gcc.godbolt.org/z/W69boErh1 is the same test case for aarch64. For both x86 and aarch64, both clang 16.0.0 produces more efficient code than trunk. The C++ source code has an outer loop (do {...} while
) and an inner-loop (unrolled into two basic blocks)
-
For x86,
LBB0_1
is the first unrolled basic block in the inner-loop. Clang 16.0.0 generates 14 instructions while trunk generates 15 instructions. Besides, code sequence from trunk are less efficient;lea
instructions of format[base + index*scale + displacement]
has higher latency and lower throughput thanadd
on some x86 processors (see Agner's instruction table) -
For aarch64, in clang 16.0.0,
ip += select(is_literal, next_literal_tag + 1, tag_type + 1
sinks fromLBB0_3
(the 2nd basic block of unrolled inner loop) intoLBB0_5
(the exiting block of outer loop)
The test case is reduced from a function in an open-source compression/decompression library.