Simplify codegen for niche-encoded enums in simple cases #102901

krdln · 2022-10-10T20:50:04Z

Vast majority of niche-encoded enums have either a single niche-encoded variant or all their niche-encoded tags are greater than the whole untagged range, eg. with enum E { A(bool), B, C } we have 0 | 1 → A, 2 → B, 3 → C. Currently get_discriminant accounts for various edgecases, like wrapping ranges, etc., but for such simple enum, we should be able to just determine whether an untagged value is stored by simple < comparison.

In pseudocode, before:

let relative_discr = tag.wrapping_sub(X);
if relative_discr < Y { Z } else { relative_discr + A }

after:

if tag < X { Z } else { tag - B }

Also:

Noticed that doing all computation before cast to cast_to yields better codegen (perhaps LLVM has problems with range propagation when casting?).
Adding assume that niche-encoded discriminant will never be the same as untagged discriminant helps a lot in cases when untagged variant is not the first one. Eg. in enum E { A, B(bool), C } LLVM would have to consider 0 | 1 | 3 as valid values for B) if not for this assumption.
I remember trying slight variant of tag < X condition (<= or just inverting the boolean? not sure now), but it resulted in LLVM getting too smart and recognizing the if tag > X { 0 } else { tag - X } as saturating-sub intrinsic too early in the process, which resulted in nice codegen, but stopped further optimization.

Despite some nice wins in generated assembly – see the "enum-fair-opt" file in this gist, I've seen only tiny (~0.03%) improvements on check scenarios in local perf testing on instr counts. The wins were quite consistent though, so I'd be happy to see a perf run – perhaps there's an improvement in cycles? (I couldn't get reliable values for cycles on local setup)

Pushing it as draft, because TODO

~~Cranelift~~ It could be done in followup PR, I guess
Add test for non-simple cases (they are so rare now, so that this codepath is basically untested)
Compare with rustc_codegen_ssa: Better code generation for niche discriminants. #102872 – perhaps combine the best of both approaches?

May help with #101872

rust-highfive · 2022-10-10T20:50:08Z

r? @davidtwco

(rust-highfive has picked a reviewer for you, use r? to override)

krdln · 2022-10-10T21:54:17Z

Some before & after assembly:

pub enum Foo {Bar(bool), Baz, Quux, Quux2}

pub fn baz(foo: Foo) -> bool {
    matches!(foo, Foo::Baz)
}

pub enum E {A, Other(char), B}

pub fn x(x: E) -> char {
    match x {
        Other(c) => c,
        _ => 'x',
    }
}

wins: before & after

0000000000000000 <enum_fair::bar>:
   0:	40 80 c7 fb          	add    dil,0xfb
   4:	40 80 ff fd          	cmp    dil,0xfd
   8:	0f 92 c0             	setb   al
   b:	c3                   	ret    

0000000000000000 <abx_bad::x>:
   0:	8d 8f 00 00 ef ff    	lea    ecx,[rdi-0x110000]
   6:	83 f9 01             	cmp    ecx,0x1
   9:	b8 78 00 00 00       	mov    eax,0x78
   e:	0f 44 c7             	cmove  eax,edi
  11:	83 f9 03             	cmp    ecx,0x3
  14:	0f 43 c7             	cmovae eax,edi
  17:	c3                   	ret

0000000000000000 <enum_fair::bar>:
   0:	40 80 ff 02          	cmp    dil,0x2
   4:	0f 92 c0             	setb   al
   7:	c3                   	ret    

0000000000000000 <abx_bad::x>:
   0:	81 ff 00 00 11 00    	cmp    edi,0x110000
   6:	b8 78 00 00 00       	mov    eax,0x78
   b:	0f 42 c7             	cmovb  eax,edi
   e:	c3                   	ret

And here, the case of ugly movzxes (note: x86_64 doesn't have 8-bit cmov, only 16, 32 and 64):

pub fn disco(foo: Foo) -> std::mem::Discriminant<Foo> {
    std::mem::discriminant(&foo)
}

before & after:

0000000000000000 <enum_fair::disco>:
   0:	40 80 c7 fe          	add    dil,0xfe
   4:	40 0f b6 cf          	movzx  ecx,dil
   8:	48 8d 51 01          	lea    rdx,[rcx+0x1]
   c:	31 c0                	xor    eax,eax
   e:	80 f9 03             	cmp    cl,0x3
  11:	48 0f 42 c2          	cmovb  rax,rdx
  15:	c3                   	ret

0000000000000000 <enum_fair::disco>:
   0:	8d 47 ff             	lea    eax,[rdi-0x1]
   3:	31 c9                	xor    ecx,ecx
   5:	40 80 ff 02          	cmp    dil,0x2
   9:	0f b6 c0             	movzx  eax,al
   c:	0f 42 c1             	cmovb  eax,ecx
   f:	0f b6 c0             	movzx  eax,al
  12:	c3

I'm not even sure if that's an improvement or regression, on one hand, critical path to cmp seems shorter – add-(movzx)-lea-cmp-cmov vs lea-cmp-(movzx)-cmov. On the other hand, I googled somewhere that movzx within the same register is a bad idea: InstLatx64/InstLatx64#4

davidtwco · 2022-10-11T08:23:44Z

r? @nagisa

krdln · 2022-10-13T09:17:44Z

@mikebenfield You've inspired me to file an issue for suboptimal codegen when attempting to do arithmetics directly on u8-tag :) llvm/llvm-project#58338.

mikebenfield · 2022-10-13T09:42:08Z

That's great, there's several ways LLVM's generated code seems to be suboptimal around this. In fact it seems to me it really could be doing the optimizations we're making on its own (just with some range data).

That said, I'm not sure about the particular issue you filed. There's a surprising lack of clarity about this, but I think the x86_64 Sys V ABI does actually not require integer arguments in registers to be zero extended, and AFAICT gcc keeps in the movzx too.

krdln · 2022-10-13T09:56:24Z

@mikebenfield

There's a surprising lack of clarity about this, but I think the x86_64 Sys V ABI does actually not require integer arguments in registers to be zero extended, and AFAICT gcc keeps in the movzx too.

The cast example turned out (as explained to me by @nikic) to be just a bad attempt in minimizing the repro. And it looks like both clang & gcc do emit just mov eax, edi, so I guess zeroextension is in the ABI? Anyway, the other example seems valid still. edit: Oh wow, this behaviour is C-specific: https://rust.godbolt.org/z/GrWT3n5qj

Most often all niche-encoded tags are at the end of tag scalar range. In these cases we can `<` on tag directly to know whether `is_untagged`. Also, avoiding casting and performing arithmetics directly on tag's type seem to help optimizations.

This should help with get discr codegen in cases where untagged_variant is not 0.

This helps avoid weird movzx instructions

krdln · 2022-10-14T08:07:59Z

@mikebenfield Turns out I was a bit overeager with the "tagged variant never has the same value as untagged" assume after all. Without such gating it was incorrectly assuming eg. enum E { A(bool), B, C } is never A(true). Anyway, this assume doesn't seem to have any effect on rustc perfomance itself (although it improvents codegen a lot for two-arms scenarios (to single cmp + jmp) where untagged variant is not the first one).

krdln · 2022-10-18T12:07:37Z

@nagisa I believe we've with @mikebenfield in #102872 converged on very similar approach, but can we do a perf run also on my PR? If it'll show the same results, I'll close my PR as redundant.

nagisa · 2022-10-21T14:16:29Z

@bors try @rust-timer queue

rust-timer · 2022-10-21T14:16:31Z

Awaiting bors try build completion.

@rustbot label: +S-waiting-on-perf

bors · 2022-10-21T14:16:41Z

⌛ Trying commit f658bd6 with merge e395cd1dcb5ab05e102b56b735bc1a84453d9b11...

bors · 2022-10-21T16:16:55Z

☀️ Try build successful - checks-actions
Build commit: e395cd1dcb5ab05e102b56b735bc1a84453d9b11 (e395cd1dcb5ab05e102b56b735bc1a84453d9b11)

rust-timer · 2022-10-21T16:16:57Z

Queued e395cd1dcb5ab05e102b56b735bc1a84453d9b11 with parent 0940040, future comparison URL.

rust-timer · 2022-10-21T18:05:31Z

Finished benchmarking commit (e395cd1dcb5ab05e102b56b735bc1a84453d9b11): comparison URL.

Overall result: ❌✅ regressions and improvements - ACTION NEEDED

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf.

Next Steps: If you can justify the regressions found in this try perf run, please indicate this with @rustbot label: +perf-regression-triaged along with sufficient written justification. If you cannot justify the regressions please fix the regressions and do another perf run. If the next run shows neutral or positive results, the label will be automatically removed.

@bors rollup=never
@rustbot label: +S-waiting-on-review -S-waiting-on-perf +perf-regression

Instruction count

This is a highly reliable metric that was used to determine the overall result at the top of this comment.

	mean¹	range	count²
Regressions ❌ (primary)	0.3%	[0.2%, 0.5%]	6
Regressions ❌ (secondary)	2.2%	[2.2%, 2.2%]	1
Improvements ✅ (primary)	-	-	0
Improvements ✅ (secondary)	-0.8%	[-1.6%, -0.3%]	5
All ❌✅ (primary)	0.3%	[0.2%, 0.5%]	6

Max RSS (memory usage)

This benchmark run did not return any relevant results for this metric.

Cycles

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

	mean¹	range	count²
Regressions ❌ (primary)	-	-	0
Regressions ❌ (secondary)	-	-	0
Improvements ✅ (primary)	-1.6%	[-1.6%, -1.6%]	1
Improvements ✅ (secondary)	-2.1%	[-2.2%, -2.0%]	2
All ❌✅ (primary)	-1.6%	[-1.6%, -1.6%]	1

the arithmetic mean of the percent change ↩ ↩²
number of relevant changes ↩ ↩²

krdln · 2022-10-25T11:13:18Z

When toggling "show non-relevant results", looks like there's a small, but consistent improvements – instr counts are mostly green (esp. when looking only at full-check). Even better results with cycles. Bootstrap is also green-ish (not sure whether that's significant enough though). Looked at some profiles and mir_borrowck seems like a major contributor to the speedup.

The regressions I see are only in incr-patched/incr-unchanged scenarios. Tried to look at the profiles there, but don't understand the cause. Anyway, perhaps these regressions are acceptable?

I also found a small codegen regression (enum Abc { B, A(bool), C }; matches!(Abc, B)) – looks like the assume was pessimizing for some reason. Pushed a fix – changing location of the assume seemed to help. Now I think all the is_x()-like functions should compile down to a single cmp+set. See this gist . Anyway – based on local testing I don't think this change would change the perf results significantly, but perhaps we can have a rerun, @nagisa?

@mikebenfield I've seen you have some regressions on your branch, sadly. Do you mind if I "steal" your tests and undraft my PR, or do you plan to debug the regressions on your PR?

mikebenfield · 2022-10-25T20:26:13Z

I do indeed plan to debug the regressions on my PR, although it's possible they won't be resolved until the LLVM issue I filed is fixed.

Feel free to do as you wish and steal tests and whatever you want from me. I do find the cases I wrote a little cleaner and more general - AFAICT you only handle 1 out of the 4 cases I handle, but it is the common case. I'd ideally like to get all those cases in there, but whatever works is fine with me.

nagisa · 2022-10-28T10:30:50Z

@bors try @rust-timer queue

rust-timer · 2022-10-28T10:30:52Z

Awaiting bors try build completion.

@rustbot label: +S-waiting-on-perf

bors · 2022-10-28T10:34:38Z

⌛ Trying commit b33cb65 with merge 9d2875fb68102e33b0ee7a4ebb82c48344b258d5...

bors · 2022-10-28T13:02:19Z

☀️ Try build successful - checks-actions
Build commit: 9d2875fb68102e33b0ee7a4ebb82c48344b258d5 (9d2875fb68102e33b0ee7a4ebb82c48344b258d5)

bors · 2022-10-28T13:02:19Z

☀️ Try build successful - checks-actions
Build commit: 9d2875fb68102e33b0ee7a4ebb82c48344b258d5 (9d2875fb68102e33b0ee7a4ebb82c48344b258d5)

rust-timer · 2022-10-28T13:02:21Z

Queued 9d2875fb68102e33b0ee7a4ebb82c48344b258d5 with parent 898f463, future comparison URL.

rust-timer · 2022-10-28T15:31:30Z

Finished benchmarking commit (9d2875fb68102e33b0ee7a4ebb82c48344b258d5): comparison URL.

Overall result: ❌✅ regressions and improvements - ACTION NEEDED

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf.

Next Steps: If you can justify the regressions found in this try perf run, please indicate this with @rustbot label: +perf-regression-triaged along with sufficient written justification. If you cannot justify the regressions please fix the regressions and do another perf run. If the next run shows neutral or positive results, the label will be automatically removed.

@bors rollup=never
@rustbot label: +S-waiting-on-review -S-waiting-on-perf +perf-regression

Instruction count

This is a highly reliable metric that was used to determine the overall result at the top of this comment.

	mean¹	range	count²
Regressions ❌ (primary)	0.3%	[0.2%, 0.3%]	5
Regressions ❌ (secondary)	2.1%	[2.1%, 2.1%]	1
Improvements ✅ (primary)	-	-	0
Improvements ✅ (secondary)	-0.5%	[-1.7%, -0.2%]	17
All ❌✅ (primary)	0.3%	[0.2%, 0.3%]	5

Max RSS (memory usage)

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

	mean¹	range	count²
Regressions ❌ (primary)	-	-	0
Regressions ❌ (secondary)	-	-	0
Improvements ✅ (primary)	-4.7%	[-4.7%, -4.7%]	1
Improvements ✅ (secondary)	-5.5%	[-5.5%, -5.5%]	1
All ❌✅ (primary)	-4.7%	[-4.7%, -4.7%]	1

Cycles

This benchmark run did not return any relevant results for this metric.

the arithmetic mean of the percent change ↩ ↩²
number of relevant changes ↩ ↩²

krdln · 2022-11-15T09:59:42Z

#102872 was merged with the same idea 🎉. I think the codegen could be optimized even further though, will try to reopen.

Extract single-niche-variant get discriminant to separate match arm

17bcac1

rustbot added the T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. label Oct 10, 2022

rust-highfive assigned davidtwco Oct 10, 2022

rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Oct 10, 2022

krdln mentioned this pull request Oct 10, 2022

rustc_codegen_ssa: Better code generation for niche discriminants. #102872

Merged

krdln marked this pull request as draft October 10, 2022 21:21

This comment has been minimized.

Sign in to view

rust-highfive assigned nagisa and unassigned davidtwco Oct 11, 2022

This comment was marked as outdated.

Sign in to view

krdln added 3 commits October 14, 2022 02:13

Tell LLVM about a hole in niche_variants range

8a5ff54

This should help with get discr codegen in cases where untagged_variant is not 0.

Avoid operating directly on i8 in get_discriminant

f658bd6

This helps avoid weird movzx instructions

krdln force-pushed the niche-discr-speed-3 branch from d1daa62 to f658bd6 Compare October 14, 2022 07:57

rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Oct 21, 2022

rustbot added perf-regression Performance regression. and removed S-waiting-on-perf Status: Waiting on a perf run to be completed. labels Oct 21, 2022

krdln added 2 commits October 25, 2022 11:02

Move assumption for hole in niche_variants to before cast

5718dda

Add more comments and link LLVM issue

b33cb65

rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Oct 28, 2022

rustbot removed the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Oct 28, 2022

krdln closed this Nov 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simplify codegen for niche-encoded enums in simple cases #102901

Simplify codegen for niche-encoded enums in simple cases #102901

krdln commented Oct 10, 2022 •

edited

Loading

rust-highfive commented Oct 10, 2022

krdln commented Oct 10, 2022

This comment has been minimized.

davidtwco commented Oct 11, 2022

This comment was marked as outdated.

krdln commented Oct 13, 2022 •

edited

Loading

mikebenfield commented Oct 13, 2022 •

edited

Loading

krdln commented Oct 13, 2022 •

edited

Loading

krdln commented Oct 14, 2022

krdln commented Oct 18, 2022

nagisa commented Oct 21, 2022

rust-timer commented Oct 21, 2022

bors commented Oct 21, 2022

bors commented Oct 21, 2022

rust-timer commented Oct 21, 2022

rust-timer commented Oct 21, 2022

krdln commented Oct 25, 2022

mikebenfield commented Oct 25, 2022

nagisa commented Oct 28, 2022

rust-timer commented Oct 28, 2022

bors commented Oct 28, 2022

bors commented Oct 28, 2022

bors commented Oct 28, 2022

rust-timer commented Oct 28, 2022

rust-timer commented Oct 28, 2022

krdln commented Nov 15, 2022

Simplify codegen for niche-encoded enums in simple cases #102901

Simplify codegen for niche-encoded enums in simple cases #102901

Conversation

krdln commented Oct 10, 2022 • edited Loading

rust-highfive commented Oct 10, 2022

krdln commented Oct 10, 2022

This comment has been minimized.

davidtwco commented Oct 11, 2022

This comment was marked as outdated.

krdln commented Oct 13, 2022 • edited Loading

mikebenfield commented Oct 13, 2022 • edited Loading

krdln commented Oct 13, 2022 • edited Loading

krdln commented Oct 14, 2022

krdln commented Oct 18, 2022

nagisa commented Oct 21, 2022

rust-timer commented Oct 21, 2022

bors commented Oct 21, 2022

bors commented Oct 21, 2022

rust-timer commented Oct 21, 2022

rust-timer commented Oct 21, 2022

Overall result: ❌✅ regressions and improvements - ACTION NEEDED

Instruction count

Max RSS (memory usage)

Cycles

Footnotes

krdln commented Oct 25, 2022

mikebenfield commented Oct 25, 2022

nagisa commented Oct 28, 2022

rust-timer commented Oct 28, 2022

bors commented Oct 28, 2022

bors commented Oct 28, 2022

bors commented Oct 28, 2022

rust-timer commented Oct 28, 2022

rust-timer commented Oct 28, 2022

Overall result: ❌✅ regressions and improvements - ACTION NEEDED

Instruction count

Max RSS (memory usage)

Cycles

Footnotes

krdln commented Nov 15, 2022

krdln commented Oct 10, 2022 •

edited

Loading

krdln commented Oct 13, 2022 •

edited

Loading

mikebenfield commented Oct 13, 2022 •

edited

Loading

krdln commented Oct 13, 2022 •

edited

Loading