Skip to content

[RISCV] Add CompressPat for all cases in isCopyInstrImpl #136875

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

asb
Copy link
Contributor

@asb asb commented Apr 23, 2025

We may end up with a non-canonical move (copy) instruction after a pass like MachineCopyPropagation that doesn't get compressed because we don't have an explicit pattern for it. This patch addresses that by adding additional CompressPat. This means that the generated code will be slightly different with/without the compressed extension. This isn't expected to be problematic as the instructions in question are all simple arithmetic ops.

This removes 214 static instances of such instructions in an llvm-test-suite rva22 compile including SPEC. I don't claim this is any great win, but particularly as these do show up in benchmarks that people spend a lot of time analysing my preference is to remove codegen oddities like this where it's reasonable to do so, so as to avoid distraction/wasted time when people are analysing said benchmarks.


Question for reviewers: are you happy with the approach of adding as a CompressPat? The alternative would be to try to do this independent of whether C is enabled or not through either a canonicalisation pass, or adding a TargetInstrInfo hook like canonicalizeCopyInstr that MachineCopyPropagation calls after mutating something that was isCopyInstr.

If we're happy with this implementation approach I will add the necessary MC tests and mark for review.

@asb asb requested review from lenary, preames and topperc April 23, 2025 14:36
We may end up with a non-canonical move (copy) instruction after a pass like
MachineCopyPropagation that doesn't get compressed because we don't have
an explicit pattern for it. This patch addressses that by adding
additional CompressPat. This means that the generated code will be
slightly different with/without the compressed extension. This isn't
expected to be problematic as the instructions in question are all
simple arithmetic ops.
@asb asb force-pushed the 2025q2-riscv-additional-compress-pats branch from 6a6a224 to 2698e03 Compare April 23, 2025 14:38
@@ -1053,3 +1057,21 @@ let Predicates = [HasStdExtCOrZca, IsRV64] in {
def : CompressPat<(SD GPR:$rs2, SPMem:$rs1, uimm9_lsb000:$imm),
(C_SDSP GPR:$rs2, SPMem:$rs1, uimm9_lsb000:$imm)>;
} // Predicates = [HasStdExtCOrZca, IsRV64]

let isCompressOnly = true, Predicates = [HasStdExtCOrZca, HasStdExtZba] in {
def : CompressPat<(SH1ADD GPRNoX0:$rs1, X0, GPRNoX0:$rs2),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SHXADD is available on a single ALU on SiFive x280, but ADD/ADDI are available on both ALUs. x280 supports C so this patch would compress, but I could envision other CPUs that support SHXADD with lower performance and don't support C. Should we canonicalize a copy SHXADD to ADD earlier?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the kind trade-off I wondered about. I oversimplified a bit in my summary.

On the one hand: Although canonicalising earlier is theoretically better for this reason, it also seem defensible to optimise for existing real-world and likely future hardware where unless something big changes, it's hard to imagine it doesn't have RVC support. Especially as this is more of a micro-optimisation that we're mainly motivated to do to keep output clean and non-surprising for anyone doing perf analysis.

On the other hand: there's an attraction and probably minor engineering advantage to keeping differences between compressed/non-compressed as minimal as possible. And if non-C hardware did appear, our future selves would be thankful.

Overall I'm still torn...

@preames
Copy link
Collaborator

preames commented Apr 23, 2025

I'm mildly of the opinion that we should canonicalize the copy instead. Not a strongly held opinion or anything.

Maybe instead of a specific canonicalizeCopyInst, we could introduce a generic optimizeInstruction which is called after mutation? It could canonicalize the copy, but could also e.g. fold a sh2add x2, x1, x0 to slli x2, x1, 2.

Alternatively, if we wanted something localized, we could extend the late optimization pass we just added for branches to handle non-branch instructions.

@asb
Copy link
Contributor Author

asb commented Apr 24, 2025

I'm mildly of the opinion that we should canonicalize the copy instead. Not a strongly held opinion or anything.

Maybe instead of a specific canonicalizeCopyInst, we could introduce a generic optimizeInstruction which is called after mutation? It could canonicalize the copy, but could also e.g. fold a sh2add x2, x1, x0 to slli x2, x1, 2.

Good point. I'm seeing only a single static instance of that specific example (sh?add that can be a shift), but there are a few other potentials.

Alternatively, if we wanted something localized, we could extend the late optimization pass we just added for branches to handle non-branch instructions.

Before that I had it in mind that the pseudo instruction expansion pass might become a canonicalisation + pseudo instruction expansion pass. But yes, there are a few possible places for this.

Just to give a holding pattern update: given the fact we suspect this isn't necessarily the best way of handling it I'm going through the other suboptimal or improperly canonicalised instructions to make sure I can follow up with a proposal that considers all the ones we care about. I suspect this looks like the optimizeInstruction idea, but will report back once I've checked.

asb added a commit to asb/llvm-project that referenced this pull request Apr 30, 2025
…s after MachineCopyPropagation

PR llvm#136875 was posted as a draft PR that handled a subset of these
cases, using the CompressPat mechanism. The consensus from that
discussion (and a conclusion I agree with) is that it would be
beneficial doing this optimisation earlier on, and in a way that isn't
limited just to cases that can be handled by instruction compression.

The most common source for instructions that can be
optimized/canonicalized in this way is through tail duplication followed
by machine copy propagation. For RISC-V, choosing a more canonical
instruction allows it to be compressed when it couldn't be before. There
is the potential that it would make other MI-level optimisations easier.

This modifies ~910 instructions across an llvm-test-suite build
including SPEC2017, targeting rva22u64.

Coverage of instructions is based on observations from a script written
to find redundant or improperly canonicalized instructions (though I
aim to support all instructions in a 'group' at once, e.g. MUL* even if
I only saw some variants of MUL in practice).
@asb
Copy link
Contributor Author

asb commented Apr 30, 2025

I've found that all cases I saw were ultimately the result of MachineCopyPropagation and can be handled cleanly via the suggested optimizeInstruction route. #137973 implements this approach, and I'll close this draft PR in favour of that one.

@asb asb closed this Apr 30, 2025
asb added a commit that referenced this pull request May 8, 2025
…ns after MachineCopyPropagation (#137973)

PR #136875 was posted as a draft PR that handled a subset of these
cases, using the CompressPat mechanism. The consensus from that
discussion (and a conclusion I agree with) is that it would be
beneficial doing this optimisation earlier on, and in a way that isn't
limited just to cases that can be handled by instruction compression.

The most common source for instructions that can be
optimized/canonicalized in this way is through tail duplication in
MachineBlockPlacement followed by machine copy propagation. For RISC-V,
choosing a more canonical instruction allows it to be compressed when it
couldn't be before. There is the potential that it would make other
MI-level optimisations easier.

This modifies ~910 instructions across an llvm-test-suite build
including SPEC2017, targeting rva22u64. Looking at the diff, it seems
there's room for eliminating instructions or further propagating after
this.

Coverage of instructions is based on observations from a script written
to find redundant or improperly canonicalized instructions (though I aim
to support all instructions in a 'group' at once, e.g. MUL* even if I
only saw some variants of MUL in practice).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants