Skip to content

[NVPTX] Disable incorrect peephole optimizations #79920

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from

Conversation

justinfargnoli
Copy link
Contributor

@justinfargnoli justinfargnoli commented Jan 30, 2024

The removed patterns are not correct. Doing the comparison, without first ensuring that the value only uses the lower 8 bits, can lead to incorrect results.

For example, when %a := Ox100 and %b := Ox1,
setugt (i16 (and (trunc Int32Regs:$a), 255)), (i16 (and (trunc Int32Regs:$b), 255))) produces a false predicate. Whereas SETP_u32rr Int32Regs:$a, Int32Regs:$b, CmpHI will produce a true predicate.

@justinfargnoli justinfargnoli self-assigned this Jan 30, 2024
@justinfargnoli justinfargnoli marked this pull request as ready for review January 30, 2024 00:40
@justinfargnoli
Copy link
Contributor Author

Link to original comment: #67866 (comment)

@Artem-B
Copy link
Member

Artem-B commented Jan 30, 2024

def: Pat<(setule (i16 (and (trunc Int32Regs:$a), 255)), (i16 (and (trunc Int32Regs:$b), 255))),
         (SETP_u32rr Int32Regs:$a, Int32Regs:$b, CmpLS)>;

I agree that the pattern is wrong, though the problem here is that we pass $a and $b directly to SETP_u32rr.
It should've been (SETP_u32rr BFE(Int32Regs:$a), BFE(Int32Regs:$b), CmpLS) as in the test cases changed by the patch.

Also, the example in the description seems to be invalid:

For example, when %a := Ox100 and %b := Ox1,
setugt (i16 (and (trunc Int32Regs:$a), 255)), (i16 (and (trunc Int32Regs:$b), 255))) produces a false predicate. Whereas SETP_u32rr Int32Regs:$a, Int32Regs:$b, CmpHI will produce a true predicate.

Translated into a more readable pseudo-code:

a = 0x100
b = 0x1
pred =  ((uint16_t)(a) & 0xff) > ((uint16_t)(b) & 0xff);

AFAICT, pred should be false here. It's not the matching part of the pattern that we need to fix, but rather the replacement.

@justinfargnoli
Copy link
Contributor Author

It should've been (SETP_u32rr BFE(Int32Regs:$a), BFE(Int32Regs:$b), CmpLS) as in the test cases changed by the patch.

That makes sense!

Also, the example in the description seems to be invalid:

For example, when %a := Ox100 and %b := Ox1, setugt (i16 (and (trunc Int32Regs:$a), 255)), (i16 (and (trunc Int32Regs:$b), 255))) produces a false predicate. Whereas SETP_u32rr Int32Regs:$a, Int32Regs:$b, CmpHI will produce a true predicate.
Translated into a more readable pseudo-code:

a = 0x100
b = 0x1
pred =  ((uint16_t)(a) & 0xff) > ((uint16_t)(b) & 0xff);

AFAICT, pred should be false here. It's not the matching part of the pattern that we need to fix, but rather the replacement.

It seems like we're on the same page. pred should be false. But, the replacement will change the code to behave like the below pseudo code. That pseudo code will set pred to true, producing an incorrect result.

a = 0x100
b = 0x1
pred =  a > b;

Assuming there are no other review comments, can I submit this PR to get rid of the incorrect transformation?

@justinfargnoli
Copy link
Contributor Author

Pinging @Artem-B @ThomasRaoux for review

@Artem-B
Copy link
Member

Artem-B commented Feb 8, 2024

Would you be willing to fix it, instead of removing it?

@justinfargnoli
Copy link
Contributor Author

TLDR: I won't have the bandwidth to fix this in the foreseeable future.

I implemented the fix that you suggested (and I agree that it's likely the right way to go). However, it wasn't trivial to fix the unit tests. That gave me pause about submitting a fix that I don't fully understand. e.g. I haven't run tests to assess the performance impact of the fix and I don't understand the original rationale for the change. So, instead of doing a deep dive on the fix, I'd like to remove the erroneous code for the time being and revisit it when necessary.

@Artem-B
Copy link
Member

Artem-B commented Feb 8, 2024

Fair enough. Thank you for bringing the issue to my attention and giving it a try. Let me try fixing it forward, and if it does turn out too hairy, I'll merge this change.

@Artem-B
Copy link
Member

Artem-B commented Feb 9, 2024

This should fix the issue: https://github.com/llvm/llvm-project/compare/main...Artem-B:llvm-project:bfe-fix?expand=1

While I was tinkering with this, there may be some opportunities for further optimizations for v4i8 types.
We could potentially use PRMT instruction to split it into v2i16, do vectorized ops on v2i16, and then pack results back into v4i8 with PRMT.

One remaining quirk is that the patterns above create an interesting quirk -- we end up extracting the same fields twice, with different signedness. Once, to be used in comparison, and another if the same field is used for something else (and uses the BFE lowered for extractelt). I'll need to see what works best for ptxas:

  • bfe.s32 + bfs.u32
  • bfe.u32 + sign-extend i8->i16.

@justinfargnoli
Copy link
Contributor Author

Closing in favor of #81308

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants