missed optimization : `gep + gep -> add + gep` 

https://gcc.godbolt.org/z/YjbY7snas shows the reduced C++ source code, IR and generated code for x86, and https://gcc.godbolt.org/z/W69boErh1 is the same test case for aarch64.  For both x86 and aarch64, both clang 16.0.0 produces more efficient code than trunk. The C++ source code has an outer loop (`do {...} while`) and an inner-loop (unrolled into two basic blocks)

* For x86, `LBB0_1` is the first unrolled basic block in the inner-loop. Clang 16.0.0 generates 14 instructions while trunk generates 15 instructions. Besides, code sequence from trunk are less efficient; `lea` instructions of format `[base + index*scale + displacement]` has higher latency and lower throughput than `add` on some x86 processors (see [Agner's instruction table](https://www.agner.org/optimize/instruction_tables.pdf))

* For aarch64, in clang 16.0.0, `ip += select(is_literal, next_literal_tag + 1, tag_type + 1` sinks from `LBB0_3` (the 2nd basic block of unrolled inner loop) into `LBB0_5` (the exiting block of outer loop)

The test case is reduced from a [function](https://github.com/google/snappy/blob/27f34a580be4a3becf5f8c0cba13433f53c21337/snappy.cc#L1194C38-L1194C58) in an open-source compression/decompression library.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

missed optimization : `gep + gep -> add + gep` #78214

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

missed optimization : gep + gep -> add + gep #78214

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

missed optimization : `gep + gep -> add + gep` #78214