-
Notifications
You must be signed in to change notification settings - Fork 13.6k
[RISCV] Add CompressPat for all cases in isCopyInstrImpl #136875
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
We may end up with a non-canonical move (copy) instruction after a pass like MachineCopyPropagation that doesn't get compressed because we don't have an explicit pattern for it. This patch addressses that by adding additional CompressPat. This means that the generated code will be slightly different with/without the compressed extension. This isn't expected to be problematic as the instructions in question are all simple arithmetic ops.
6a6a224
to
2698e03
Compare
@@ -1053,3 +1057,21 @@ let Predicates = [HasStdExtCOrZca, IsRV64] in { | |||
def : CompressPat<(SD GPR:$rs2, SPMem:$rs1, uimm9_lsb000:$imm), | |||
(C_SDSP GPR:$rs2, SPMem:$rs1, uimm9_lsb000:$imm)>; | |||
} // Predicates = [HasStdExtCOrZca, IsRV64] | |||
|
|||
let isCompressOnly = true, Predicates = [HasStdExtCOrZca, HasStdExtZba] in { | |||
def : CompressPat<(SH1ADD GPRNoX0:$rs1, X0, GPRNoX0:$rs2), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SHXADD is available on a single ALU on SiFive x280, but ADD/ADDI are available on both ALUs. x280 supports C so this patch would compress, but I could envision other CPUs that support SHXADD with lower performance and don't support C. Should we canonicalize a copy SHXADD to ADD earlier?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the kind trade-off I wondered about. I oversimplified a bit in my summary.
On the one hand: Although canonicalising earlier is theoretically better for this reason, it also seem defensible to optimise for existing real-world and likely future hardware where unless something big changes, it's hard to imagine it doesn't have RVC support. Especially as this is more of a micro-optimisation that we're mainly motivated to do to keep output clean and non-surprising for anyone doing perf analysis.
On the other hand: there's an attraction and probably minor engineering advantage to keeping differences between compressed/non-compressed as minimal as possible. And if non-C hardware did appear, our future selves would be thankful.
Overall I'm still torn...
I'm mildly of the opinion that we should canonicalize the copy instead. Not a strongly held opinion or anything. Maybe instead of a specific canonicalizeCopyInst, we could introduce a generic optimizeInstruction which is called after mutation? It could canonicalize the copy, but could also e.g. fold a sh2add x2, x1, x0 to slli x2, x1, 2. Alternatively, if we wanted something localized, we could extend the late optimization pass we just added for branches to handle non-branch instructions. |
Good point. I'm seeing only a single static instance of that specific example (
Before that I had it in mind that the pseudo instruction expansion pass might become a canonicalisation + pseudo instruction expansion pass. But yes, there are a few possible places for this. Just to give a holding pattern update: given the fact we suspect this isn't necessarily the best way of handling it I'm going through the other suboptimal or improperly canonicalised instructions to make sure I can follow up with a proposal that considers all the ones we care about. I suspect this looks like the optimizeInstruction idea, but will report back once I've checked. |
…s after MachineCopyPropagation PR llvm#136875 was posted as a draft PR that handled a subset of these cases, using the CompressPat mechanism. The consensus from that discussion (and a conclusion I agree with) is that it would be beneficial doing this optimisation earlier on, and in a way that isn't limited just to cases that can be handled by instruction compression. The most common source for instructions that can be optimized/canonicalized in this way is through tail duplication followed by machine copy propagation. For RISC-V, choosing a more canonical instruction allows it to be compressed when it couldn't be before. There is the potential that it would make other MI-level optimisations easier. This modifies ~910 instructions across an llvm-test-suite build including SPEC2017, targeting rva22u64. Coverage of instructions is based on observations from a script written to find redundant or improperly canonicalized instructions (though I aim to support all instructions in a 'group' at once, e.g. MUL* even if I only saw some variants of MUL in practice).
I've found that all cases I saw were ultimately the result of MachineCopyPropagation and can be handled cleanly via the suggested optimizeInstruction route. #137973 implements this approach, and I'll close this draft PR in favour of that one. |
…ns after MachineCopyPropagation (#137973) PR #136875 was posted as a draft PR that handled a subset of these cases, using the CompressPat mechanism. The consensus from that discussion (and a conclusion I agree with) is that it would be beneficial doing this optimisation earlier on, and in a way that isn't limited just to cases that can be handled by instruction compression. The most common source for instructions that can be optimized/canonicalized in this way is through tail duplication in MachineBlockPlacement followed by machine copy propagation. For RISC-V, choosing a more canonical instruction allows it to be compressed when it couldn't be before. There is the potential that it would make other MI-level optimisations easier. This modifies ~910 instructions across an llvm-test-suite build including SPEC2017, targeting rva22u64. Looking at the diff, it seems there's room for eliminating instructions or further propagating after this. Coverage of instructions is based on observations from a script written to find redundant or improperly canonicalized instructions (though I aim to support all instructions in a 'group' at once, e.g. MUL* even if I only saw some variants of MUL in practice).
We may end up with a non-canonical move (copy) instruction after a pass like MachineCopyPropagation that doesn't get compressed because we don't have an explicit pattern for it. This patch addresses that by adding additional CompressPat. This means that the generated code will be slightly different with/without the compressed extension. This isn't expected to be problematic as the instructions in question are all simple arithmetic ops.
This removes 214 static instances of such instructions in an llvm-test-suite rva22 compile including SPEC. I don't claim this is any great win, but particularly as these do show up in benchmarks that people spend a lot of time analysing my preference is to remove codegen oddities like this where it's reasonable to do so, so as to avoid distraction/wasted time when people are analysing said benchmarks.
Question for reviewers: are you happy with the approach of adding as a CompressPat? The alternative would be to try to do this independent of whether C is enabled or not through either a canonicalisation pass, or adding a TargetInstrInfo hook like canonicalizeCopyInstr that MachineCopyPropagation calls after mutating something that was isCopyInstr.
If we're happy with this implementation approach I will add the necessary MC tests and mark for review.