-
Notifications
You must be signed in to change notification settings - Fork 13.3k
Simplify codegen for niche-encoded enums in simple cases #102901
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
r? @davidtwco (rust-highfive has picked a reviewer for you, use r? to override) |
Some before & after assembly:pub enum Foo {Bar(bool), Baz, Quux, Quux2}
pub fn baz(foo: Foo) -> bool {
matches!(foo, Foo::Baz)
}
pub enum E {A, Other(char), B}
pub fn x(x: E) -> char {
match x {
Other(c) => c,
_ => 'x',
}
}
wins: before & after
And here, the case of ugly pub fn disco(foo: Foo) -> std::mem::Discriminant<Foo> {
std::mem::discriminant(&foo)
} before & after:
I'm not even sure if that's an improvement or regression, on one hand, critical path to cmp seems shorter – add-(movzx)-lea-cmp-cmov vs lea-cmp-(movzx)-cmov. On the other hand, I googled somewhere that |
This comment has been minimized.
This comment has been minimized.
r? @nagisa |
This comment was marked as outdated.
This comment was marked as outdated.
@mikebenfield You've inspired me to file an issue for suboptimal codegen when attempting to do arithmetics directly on u8-tag :) llvm/llvm-project#58338. |
That's great, there's several ways LLVM's generated code seems to be suboptimal around this. In fact it seems to me it really could be doing the optimizations we're making on its own (just with some range data). That said, I'm not sure about the particular issue you filed. There's a surprising lack of clarity about this, but I think the x86_64 Sys V ABI does actually not require integer arguments in registers to be zero extended, and AFAICT gcc keeps in the |
The |
Most often all niche-encoded tags are at the end of tag scalar range. In these cases we can `<` on tag directly to know whether `is_untagged`. Also, avoiding casting and performing arithmetics directly on tag's type seem to help optimizations.
This should help with get discr codegen in cases where untagged_variant is not 0.
This helps avoid weird movzx instructions
d1daa62
to
f658bd6
Compare
@mikebenfield Turns out I was a bit overeager with the "tagged variant never has the same value as untagged" |
@nagisa I believe we've with @mikebenfield in #102872 converged on very similar approach, but can we do a perf run also on my PR? If it'll show the same results, I'll close my PR as redundant. |
@bors try @rust-timer queue |
Awaiting bors try build completion. @rustbot label: +S-waiting-on-perf |
⌛ Trying commit f658bd6 with merge e395cd1dcb5ab05e102b56b735bc1a84453d9b11... |
☀️ Try build successful - checks-actions |
Queued e395cd1dcb5ab05e102b56b735bc1a84453d9b11 with parent 0940040, future comparison URL. |
Finished benchmarking commit (e395cd1dcb5ab05e102b56b735bc1a84453d9b11): comparison URL. Overall result: ❌✅ regressions and improvements - ACTION NEEDEDBenchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf. Next Steps: If you can justify the regressions found in this try perf run, please indicate this with @bors rollup=never Instruction countThis is a highly reliable metric that was used to determine the overall result at the top of this comment.
Max RSS (memory usage)This benchmark run did not return any relevant results for this metric. CyclesResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
Footnotes |
When toggling "show non-relevant results", looks like there's a small, but consistent improvements – instr counts are mostly green (esp. when looking only at full-check). Even better results with cycles. Bootstrap is also green-ish (not sure whether that's significant enough though). Looked at some profiles and The regressions I see are only in I also found a small codegen regression ( @mikebenfield I've seen you have some regressions on your branch, sadly. Do you mind if I "steal" your tests and undraft my PR, or do you plan to debug the regressions on your PR? |
I do indeed plan to debug the regressions on my PR, although it's possible they won't be resolved until the LLVM issue I filed is fixed. Feel free to do as you wish and steal tests and whatever you want from me. I do find the cases I wrote a little cleaner and more general - AFAICT you only handle 1 out of the 4 cases I handle, but it is the common case. I'd ideally like to get all those cases in there, but whatever works is fine with me. |
@bors try @rust-timer queue |
Awaiting bors try build completion. @rustbot label: +S-waiting-on-perf |
⌛ Trying commit b33cb65 with merge 9d2875fb68102e33b0ee7a4ebb82c48344b258d5... |
☀️ Try build successful - checks-actions |
1 similar comment
☀️ Try build successful - checks-actions |
Queued 9d2875fb68102e33b0ee7a4ebb82c48344b258d5 with parent 898f463, future comparison URL. |
Finished benchmarking commit (9d2875fb68102e33b0ee7a4ebb82c48344b258d5): comparison URL. Overall result: ❌✅ regressions and improvements - ACTION NEEDEDBenchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf. Next Steps: If you can justify the regressions found in this try perf run, please indicate this with @bors rollup=never Instruction countThis is a highly reliable metric that was used to determine the overall result at the top of this comment.
Max RSS (memory usage)ResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
CyclesThis benchmark run did not return any relevant results for this metric. Footnotes |
#102872 was merged with the same idea 🎉. I think the codegen could be optimized even further though, will try to reopen. |
Vast majority of niche-encoded enums have either a single niche-encoded variant or all their niche-encoded tags are greater than the whole untagged range, eg. with
enum E { A(bool), B, C }
we have 0 | 1 → A, 2 → B, 3 → C. Currently get_discriminant accounts for various edgecases, like wrapping ranges, etc., but for such simple enum, we should be able to just determine whether an untagged value is stored by simple<
comparison.In pseudocode, before:
after:
Also:
cast_to
yields better codegen (perhaps LLVM has problems withrange
propagation when casting?).assume
that niche-encoded discriminant will never be the same as untagged discriminant helps a lot in cases when untagged variant is not the first one. Eg. inenum E { A, B(bool), C }
LLVM would have to consider 0 | 1 | 3 as valid values for B) if not for this assumption.tag < X
condition (<=
or just inverting the boolean? not sure now), but it resulted in LLVM getting too smart and recognizing theif tag > X { 0 } else { tag - X }
as saturating-sub intrinsic too early in the process, which resulted in nice codegen, but stopped further optimization.Despite some nice wins in generated assembly – see the "enum-fair-opt" file in this gist, I've seen only tiny (~0.03%) improvements on check scenarios in local perf testing on instr counts. The wins were quite consistent though, so I'd be happy to see a perf run – perhaps there's an improvement in cycles? (I couldn't get reliable values for cycles on local setup)
Pushing it as draft, because TODO
CraneliftIt could be done in followup PR, I guessMay help with #101872