Skip to content

Simplify codegen for niche-encoded enums in simple cases #102901

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 6 commits into from

Conversation

krdln
Copy link
Contributor

@krdln krdln commented Oct 10, 2022

Vast majority of niche-encoded enums have either a single niche-encoded variant or all their niche-encoded tags are greater than the whole untagged range, eg. with enum E { A(bool), B, C } we have 0 | 1 → A, 2 → B, 3  → C. Currently get_discriminant accounts for various edgecases, like wrapping ranges, etc., but for such simple enum, we should be able to just determine whether an untagged value is stored by simple < comparison.

In pseudocode, before:

let relative_discr = tag.wrapping_sub(X);
if relative_discr < Y { Z } else { relative_discr + A }

after:

if tag < X { Z } else { tag - B }

Also:

  • Noticed that doing all computation before cast to cast_to yields better codegen (perhaps LLVM has problems with range propagation when casting?).
  • Adding assume that niche-encoded discriminant will never be the same as untagged discriminant helps a lot in cases when untagged variant is not the first one. Eg. in enum E { A, B(bool), C } LLVM would have to consider 0 | 1 | 3 as valid values for B) if not for this assumption.
  • I remember trying slight variant of tag < X condition (<= or just inverting the boolean? not sure now), but it resulted in LLVM getting too smart and recognizing the if tag > X { 0 } else { tag - X } as saturating-sub intrinsic too early in the process, which resulted in nice codegen, but stopped further optimization.

Despite some nice wins in generated assembly – see the "enum-fair-opt" file in this gist, I've seen only tiny (~0.03%) improvements on check scenarios in local perf testing on instr counts. The wins were quite consistent though, so I'd be happy to see a perf run – perhaps there's an improvement in cycles? (I couldn't get reliable values for cycles on local setup)

Pushing it as draft, because TODO

May help with #101872

@rustbot rustbot added the T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. label Oct 10, 2022
@rust-highfive
Copy link
Contributor

r? @davidtwco

(rust-highfive has picked a reviewer for you, use r? to override)

@rust-highfive rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Oct 10, 2022
@krdln krdln marked this pull request as draft October 10, 2022 21:21
@krdln
Copy link
Contributor Author

krdln commented Oct 10, 2022

Some before & after assembly:
pub enum Foo {Bar(bool), Baz, Quux, Quux2}

pub fn baz(foo: Foo) -> bool {
    matches!(foo, Foo::Baz)
}

pub enum E {A, Other(char), B}

pub fn x(x: E) -> char {
    match x {
        Other(c) => c,
        _ => 'x',
    }
}

wins: before & after

0000000000000000 <enum_fair::bar>:
   0:	40 80 c7 fb          	add    dil,0xfb
   4:	40 80 ff fd          	cmp    dil,0xfd
   8:	0f 92 c0             	setb   al
   b:	c3                   	ret    

0000000000000000 <abx_bad::x>:
   0:	8d 8f 00 00 ef ff    	lea    ecx,[rdi-0x110000]
   6:	83 f9 01             	cmp    ecx,0x1
   9:	b8 78 00 00 00       	mov    eax,0x78
   e:	0f 44 c7             	cmove  eax,edi
  11:	83 f9 03             	cmp    ecx,0x3
  14:	0f 43 c7             	cmovae eax,edi
  17:	c3                   	ret    
0000000000000000 <enum_fair::bar>:
   0:	40 80 ff 02          	cmp    dil,0x2
   4:	0f 92 c0             	setb   al
   7:	c3                   	ret    

0000000000000000 <abx_bad::x>:
   0:	81 ff 00 00 11 00    	cmp    edi,0x110000
   6:	b8 78 00 00 00       	mov    eax,0x78
   b:	0f 42 c7             	cmovb  eax,edi
   e:	c3                   	ret    

And here, the case of ugly movzxes (note: x86_64 doesn't have 8-bit cmov, only 16, 32 and 64):

pub fn disco(foo: Foo) -> std::mem::Discriminant<Foo> {
    std::mem::discriminant(&foo)
}

before & after:

0000000000000000 <enum_fair::disco>:
   0:	40 80 c7 fe          	add    dil,0xfe
   4:	40 0f b6 cf          	movzx  ecx,dil
   8:	48 8d 51 01          	lea    rdx,[rcx+0x1]
   c:	31 c0                	xor    eax,eax
   e:	80 f9 03             	cmp    cl,0x3
  11:	48 0f 42 c2          	cmovb  rax,rdx
  15:	c3                   	ret    
0000000000000000 <enum_fair::disco>:
   0:	8d 47 ff             	lea    eax,[rdi-0x1]
   3:	31 c9                	xor    ecx,ecx
   5:	40 80 ff 02          	cmp    dil,0x2
   9:	0f b6 c0             	movzx  eax,al
   c:	0f 42 c1             	cmovb  eax,ecx
   f:	0f b6 c0             	movzx  eax,al
  12:	c3            

I'm not even sure if that's an improvement or regression, on one hand, critical path to cmp seems shorter – add-(movzx)-lea-cmp-cmov vs lea-cmp-(movzx)-cmov. On the other hand, I googled somewhere that movzx within the same register is a bad idea: InstLatx64/InstLatx64#4

@rust-log-analyzer

This comment has been minimized.

@davidtwco
Copy link
Member

r? @nagisa

@rust-highfive rust-highfive assigned nagisa and unassigned davidtwco Oct 11, 2022
@krdln

This comment was marked as outdated.

@krdln
Copy link
Contributor Author

krdln commented Oct 13, 2022

@mikebenfield You've inspired me to file an issue for suboptimal codegen when attempting to do arithmetics directly on u8-tag :) llvm/llvm-project#58338.

@mikebenfield
Copy link
Contributor

mikebenfield commented Oct 13, 2022

That's great, there's several ways LLVM's generated code seems to be suboptimal around this. In fact it seems to me it really could be doing the optimizations we're making on its own (just with some range data).

That said, I'm not sure about the particular issue you filed. There's a surprising lack of clarity about this, but I think the x86_64 Sys V ABI does actually not require integer arguments in registers to be zero extended, and AFAICT gcc keeps in the movzx too.

@krdln
Copy link
Contributor Author

krdln commented Oct 13, 2022

@mikebenfield

There's a surprising lack of clarity about this, but I think the x86_64 Sys V ABI does actually not require integer arguments in registers to be zero extended, and AFAICT gcc keeps in the movzx too.

The cast example turned out (as explained to me by @nikic) to be just a bad attempt in minimizing the repro. And it looks like both clang & gcc do emit just mov eax, edi, so I guess zeroextension is in the ABI? Anyway, the other example seems valid still. edit: Oh wow, this behaviour is C-specific: https://rust.godbolt.org/z/GrWT3n5qj

krdln added 3 commits October 14, 2022 02:13
Most often all niche-encoded tags are at the end of tag scalar range. In
these cases we can `<` on tag directly to know whether `is_untagged`.

Also, avoiding casting and performing arithmetics directly on tag's type
seem to help optimizations.
This should help with get discr codegen in cases where untagged_variant
is not 0.
This helps avoid weird movzx instructions
@krdln krdln force-pushed the niche-discr-speed-3 branch from d1daa62 to f658bd6 Compare October 14, 2022 07:57
@krdln
Copy link
Contributor Author

krdln commented Oct 14, 2022

@mikebenfield Turns out I was a bit overeager with the "tagged variant never has the same value as untagged" assume after all. Without such gating it was incorrectly assuming eg. enum E { A(bool), B, C } is never A(true). Anyway, this assume doesn't seem to have any effect on rustc perfomance itself (although it improvents codegen a lot for two-arms scenarios (to single cmp + jmp) where untagged variant is not the first one).

@krdln
Copy link
Contributor Author

krdln commented Oct 18, 2022

@nagisa I believe we've with @mikebenfield in #102872 converged on very similar approach, but can we do a perf run also on my PR? If it'll show the same results, I'll close my PR as redundant.

@nagisa
Copy link
Member

nagisa commented Oct 21, 2022

@bors try @rust-timer queue

@rust-timer
Copy link
Collaborator

Awaiting bors try build completion.

@rustbot label: +S-waiting-on-perf

@rustbot rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Oct 21, 2022
@bors
Copy link
Collaborator

bors commented Oct 21, 2022

⌛ Trying commit f658bd6 with merge e395cd1dcb5ab05e102b56b735bc1a84453d9b11...

@bors
Copy link
Collaborator

bors commented Oct 21, 2022

☀️ Try build successful - checks-actions
Build commit: e395cd1dcb5ab05e102b56b735bc1a84453d9b11 (e395cd1dcb5ab05e102b56b735bc1a84453d9b11)

@rust-timer
Copy link
Collaborator

Queued e395cd1dcb5ab05e102b56b735bc1a84453d9b11 with parent 0940040, future comparison URL.

@rust-timer
Copy link
Collaborator

Finished benchmarking commit (e395cd1dcb5ab05e102b56b735bc1a84453d9b11): comparison URL.

Overall result: ❌✅ regressions and improvements - ACTION NEEDED

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf.

Next Steps: If you can justify the regressions found in this try perf run, please indicate this with @rustbot label: +perf-regression-triaged along with sufficient written justification. If you cannot justify the regressions please fix the regressions and do another perf run. If the next run shows neutral or positive results, the label will be automatically removed.

@bors rollup=never
@rustbot label: +S-waiting-on-review -S-waiting-on-perf +perf-regression

Instruction count

This is a highly reliable metric that was used to determine the overall result at the top of this comment.

mean1 range count2
Regressions ❌
(primary)
0.3% [0.2%, 0.5%] 6
Regressions ❌
(secondary)
2.2% [2.2%, 2.2%] 1
Improvements ✅
(primary)
- - 0
Improvements ✅
(secondary)
-0.8% [-1.6%, -0.3%] 5
All ❌✅ (primary) 0.3% [0.2%, 0.5%] 6

Max RSS (memory usage)

This benchmark run did not return any relevant results for this metric.

Cycles

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean1 range count2
Regressions ❌
(primary)
- - 0
Regressions ❌
(secondary)
- - 0
Improvements ✅
(primary)
-1.6% [-1.6%, -1.6%] 1
Improvements ✅
(secondary)
-2.1% [-2.2%, -2.0%] 2
All ❌✅ (primary) -1.6% [-1.6%, -1.6%] 1

Footnotes

  1. the arithmetic mean of the percent change 2

  2. number of relevant changes 2

@rustbot rustbot added perf-regression Performance regression. and removed S-waiting-on-perf Status: Waiting on a perf run to be completed. labels Oct 21, 2022
@krdln
Copy link
Contributor Author

krdln commented Oct 25, 2022

When toggling "show non-relevant results", looks like there's a small, but consistent improvements – instr counts are mostly green (esp. when looking only at full-check). Even better results with cycles. Bootstrap is also green-ish (not sure whether that's significant enough though). Looked at some profiles and mir_borrowck seems like a major contributor to the speedup.

The regressions I see are only in incr-patched/incr-unchanged scenarios. Tried to look at the profiles there, but don't understand the cause. Anyway, perhaps these regressions are acceptable?

I also found a small codegen regression (enum Abc { B, A(bool), C }; matches!(Abc, B)) – looks like the assume was pessimizing for some reason. Pushed a fix – changing location of the assume seemed to help. Now I think all the is_x()-like functions should compile down to a single cmp+set. See this gist . Anyway – based on local testing I don't think this change would change the perf results significantly, but perhaps we can have a rerun, @nagisa?

@mikebenfield I've seen you have some regressions on your branch, sadly. Do you mind if I "steal" your tests and undraft my PR, or do you plan to debug the regressions on your PR?

@mikebenfield
Copy link
Contributor

I do indeed plan to debug the regressions on my PR, although it's possible they won't be resolved until the LLVM issue I filed is fixed.

Feel free to do as you wish and steal tests and whatever you want from me. I do find the cases I wrote a little cleaner and more general - AFAICT you only handle 1 out of the 4 cases I handle, but it is the common case. I'd ideally like to get all those cases in there, but whatever works is fine with me.

@nagisa
Copy link
Member

nagisa commented Oct 28, 2022

@bors try @rust-timer queue

@rust-timer
Copy link
Collaborator

Awaiting bors try build completion.

@rustbot label: +S-waiting-on-perf

@rustbot rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Oct 28, 2022
@bors
Copy link
Collaborator

bors commented Oct 28, 2022

⌛ Trying commit b33cb65 with merge 9d2875fb68102e33b0ee7a4ebb82c48344b258d5...

@bors
Copy link
Collaborator

bors commented Oct 28, 2022

☀️ Try build successful - checks-actions
Build commit: 9d2875fb68102e33b0ee7a4ebb82c48344b258d5 (9d2875fb68102e33b0ee7a4ebb82c48344b258d5)

1 similar comment
@bors
Copy link
Collaborator

bors commented Oct 28, 2022

☀️ Try build successful - checks-actions
Build commit: 9d2875fb68102e33b0ee7a4ebb82c48344b258d5 (9d2875fb68102e33b0ee7a4ebb82c48344b258d5)

@rust-timer
Copy link
Collaborator

Queued 9d2875fb68102e33b0ee7a4ebb82c48344b258d5 with parent 898f463, future comparison URL.

@rust-timer
Copy link
Collaborator

Finished benchmarking commit (9d2875fb68102e33b0ee7a4ebb82c48344b258d5): comparison URL.

Overall result: ❌✅ regressions and improvements - ACTION NEEDED

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf.

Next Steps: If you can justify the regressions found in this try perf run, please indicate this with @rustbot label: +perf-regression-triaged along with sufficient written justification. If you cannot justify the regressions please fix the regressions and do another perf run. If the next run shows neutral or positive results, the label will be automatically removed.

@bors rollup=never
@rustbot label: +S-waiting-on-review -S-waiting-on-perf +perf-regression

Instruction count

This is a highly reliable metric that was used to determine the overall result at the top of this comment.

mean1 range count2
Regressions ❌
(primary)
0.3% [0.2%, 0.3%] 5
Regressions ❌
(secondary)
2.1% [2.1%, 2.1%] 1
Improvements ✅
(primary)
- - 0
Improvements ✅
(secondary)
-0.5% [-1.7%, -0.2%] 17
All ❌✅ (primary) 0.3% [0.2%, 0.3%] 5

Max RSS (memory usage)

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean1 range count2
Regressions ❌
(primary)
- - 0
Regressions ❌
(secondary)
- - 0
Improvements ✅
(primary)
-4.7% [-4.7%, -4.7%] 1
Improvements ✅
(secondary)
-5.5% [-5.5%, -5.5%] 1
All ❌✅ (primary) -4.7% [-4.7%, -4.7%] 1

Cycles

This benchmark run did not return any relevant results for this metric.

Footnotes

  1. the arithmetic mean of the percent change 2

  2. number of relevant changes 2

@rustbot rustbot removed the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Oct 28, 2022
@krdln
Copy link
Contributor Author

krdln commented Nov 15, 2022

#102872 was merged with the same idea 🎉. I think the codegen could be optimized even further though, will try to reopen.

@krdln krdln closed this Nov 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
perf-regression Performance regression. S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants