Skip to content

missed optimization : gep + gep -> add + gep  #78214

Open
@mingmingl-llvm

Description

@mingmingl-llvm

https://gcc.godbolt.org/z/YjbY7snas shows the reduced C++ source code, IR and generated code for x86, and https://gcc.godbolt.org/z/W69boErh1 is the same test case for aarch64. For both x86 and aarch64, both clang 16.0.0 produces more efficient code than trunk. The C++ source code has an outer loop (do {...} while) and an inner-loop (unrolled into two basic blocks)

  • For x86, LBB0_1 is the first unrolled basic block in the inner-loop. Clang 16.0.0 generates 14 instructions while trunk generates 15 instructions. Besides, code sequence from trunk are less efficient; lea instructions of format [base + index*scale + displacement] has higher latency and lower throughput than add on some x86 processors (see Agner's instruction table)

  • For aarch64, in clang 16.0.0, ip += select(is_literal, next_literal_tag + 1, tag_type + 1 sinks from LBB0_3 (the 2nd basic block of unrolled inner loop) into LBB0_5 (the exiting block of outer loop)

The test case is reduced from a function in an open-source compression/decompression library.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions