Skip to content

[CI] Disable flang from pre-commit tests when Flang files are not touched #93485

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

rengolin
Copy link
Member

@rengolin rengolin commented May 27, 2024

Flang pre-commit tests are triggered when any file in LLVM or MLIR are touched. However, Flang is not yet achieved a stable CI, which have caused instability to many other completely unrelated PRs. This PR disables Flang tests on non-Flang changes on pre-commit, but does nothing to post-commit tests, which continue to run as is.

Once Flang becomes stable like Clang and MLIR we can re-enable it on pre-commit tests, following our current staged policy on CI, testing and reporting failures.

Flang tests are unstable on Windows pre-commit CI:
https://discourse.llvm.org/t/flang-tests-are-extremely-slow-on-windows/78591/40

Potential timeout issue:
https://discourse.llvm.org/t/flang-tests-are-extremely-slow-on-windows/78591/42

Since there doesn't seem to be interest in the Flang community to fix
this, we disable. This should only be re-enabled once the stability
issues are proven to be fixed.
Copy link
Contributor

@Endilll Endilll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.
CC @lnihlen

Copy link
Contributor

@kiranchandramohan kiranchandramohan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requesting changes since the original issue reported was fixed.

@Endilll
Copy link
Contributor

Endilll commented May 27, 2024

Flang tests are unstable on Windows pre-commit CI: https://discourse.llvm.org/t/flang-tests-are-extremely-slow-on-windows/78591/40

I think is should be removed from the description to avoid confusion. The current issue is OOM while building Flang on Windows, which is not the issue the Discourse thread was created for (worth a new thread?)

@joker-eph
Copy link
Collaborator

joker-eph commented May 27, 2024

The problem IMO is more the disruption caused to other projects by flang failures.
So I wouldn’t just disable flang tests entirely, instead I would exclude flang from being built/tested on windows when no files from the flang directory was touched.
(I already argued this point to the clang folks who IMO incorrectly handled it)

@AaronBallman AaronBallman requested review from tstellar and lnihlen May 28, 2024 12:51
@AaronBallman
Copy link
Collaborator

I do not think Flang is in a state where it makes sense to precommit test Flang because of changes to Clang or LLVM for any platform. At least within Clang, I'm unaware of precommit CI ever testing a change to Clang which catches a legitimate failure in Flang, but there have been plenty of instances of Flang being in an invalid state and causing precommit CI to "fail" for Clang reviews as a result.

That said, I think it makes a lot of sense for Flang to have precommit CI testing when Flang source changes.

@tstellar
Copy link
Collaborator

I agree with @joker-eph we should still build and test it when flang/ files are touched.

@rengolin
Copy link
Member Author

I agree with @joker-eph we should still build and test it when flang/ files are touched.

Agree. What's the right incantation to make that work in our CI?

@kiranchandramohan
Copy link
Contributor

I agree with @joker-eph we should still build and test it when flang/ files are touched.

Agree. What's the right incantation to make that work in our CI?

It will be great if you can write a new post in discourse before going ahead. The original issue regarding the previous post was fixed, the last patch in the sequence was #93295. The other issue discussed in the post was OOM. Do you still see the problems after #93329?

Windows is important for Flang and Flang is important for the Windows community as highlighted in https://discourse.llvm.org/t/the-you-in-eucatastrophe-the-story-of-building-scipy-with-flang/74768.

@rengolin
Copy link
Member Author

It will be great if you can write a new post in discourse before going ahead. The original issue regarding the previous post was fixed, the last patch in the sequence was #93295. The other issue discussed in the post was OOM. Do you still see the problems after #93329?

This has nothing to do with the importance of Flang or Windows, only that we should not run tests when we don't touch those files. No need for an RFC here, this is just basic good CI practices.

The fact that those tests are unstable for over a year, for different issues, on Linux and Windows is just the trigger to do this in the first place, not the justification to do so.

@rengolin
Copy link
Member Author

@tstellar if you have a quick way to get this done, feel free to create a new PR and close this one. Thanks!

@Endilll
Copy link
Contributor

Endilll commented May 29, 2024

I agree with @joker-eph we should still build and test it when flang/ files are touched.

Agree. What's the right incantation to make that work in our CI?

You'd need to remove flang from compute-projects-to-test.

@rengolin
Copy link
Member Author

I agree with @joker-eph we should still build and test it when flang/ files are touched.

Agree. What's the right incantation to make that work in our CI?

You'd need to remove flang from compute-projects-to-test.

Ack. I see it's on both llvm and mlir. Now, of course, flang depends on both, but it's also no officially in production, so random crashes are still expected until it's as stable and complete as clang.

I propose we disable these two dependencies for now, and let them be tested in post-commit checks. I'm assuming that, when flang itself changes in a PR, it will be tested regardless of these deps.

@Endilll
Copy link
Contributor

Endilll commented May 29, 2024

I'm assuming that, when flang itself changes in a PR, it will be tested regardless of these deps.

That's correct.

@AaronBallman
Copy link
Collaborator

I propose we disable these two dependencies for now, and let them be tested in post-commit checks. I'm assuming that, when flang itself changes in a PR, it will be tested regardless of these deps.

I think this is a happy medium (Flang still gets precommit CI test coverage when Flang changes, but relies on post-commit CI test coverage when Clang, MLIR, or LLVM changes), so I'm in favor.

@rengolin rengolin changed the title [CI] Disable flang from Windows pre-commit tests [CI] Disable flang from pre-commit tests when Flang files are not touched May 29, 2024
@rengolin
Copy link
Member Author

I think is should be removed from the description to avoid confusion. The current issue is OOM while building Flang on Windows, which is not the issue the Discourse thread was created for (worth a new thread?)

Reworded the title and description of the PR to avoid confusion. Should be clear now.

@joker-eph @tstellar

@kiranchandramohan
Copy link
Contributor

I have created a post in discourse. https://discourse.llvm.org/t/disable-flang-from-pre-commit-tests-when-flang-files-are-not-touched/79251

@clementval
Copy link
Contributor

So is this disabling both linux and windows pre-commit testing?

@joker-eph
Copy link
Collaborator

Now, of course, flang depends on both, but it's also no officially in production, so random crashes are still expected until it's as stable and complete as clang.

This is a bit of a weird take to me: what does "officially in production" mean? Did MLIR ever got "officially in production" in the context of the LLVM project? The only thing I'm aware of is the notion of "experimental", which applies to backends (Flang was kind of in-between at the beginning because it was C++17 before LLVM was, but it's all aligned now).
There are degrees of maturity of course, but I haven't seen issues with the Flang-on-Linux CI so far (some frontend files are annoyingly heavy to compile!), or only very occasionally (flaky tests can happen, but not more than in LLVM itself or any other project)

@joker-eph
Copy link
Collaborator

So is this disabling both linux and windows pre-commit testing?

Yes but only on PR that don't touch flang directly. That is a PR that would only touch other projects (llvm/mlir/clang) used to to trigger testing of Flang since they can potentially break it.

@clementval
Copy link
Contributor

clementval commented May 29, 2024

So is this disabling both linux and windows pre-commit testing?

Yes but only on PR that don't touch flang directly. That is a PR that would only touch other projects (llvm/mlir/clang) used to to trigger testing of Flang since they can potentially break it.

It would be nice to keep flang testing on linux at least when MLIR is changed since it relies a lot on it.

@rengolin
Copy link
Member Author

This is a bit of a weird take to me: what does "officially in production" mean? Did MLIR ever got "officially in production" in the context of the LLVM project? The only thing I'm aware of is the notion of "experimental", which applies to backends (Flang was kind of in-between at the beginning because it was C++17 before LLVM was, but it's all aligned now). There are degrees of maturity of course, but I haven't seen issues with the Flang-on-Linux CI so far (some frontend files are annoyingly heavy to compile!), or only very occasionally (flaky tests can happen, but not more than in LLVM itself or any other project)

Our main support policy is: https://llvm.org/docs/SupportPolicy.html I have reported at least three serious issues with Flang testing since last year, all of which took over a month to get fixed. This one is only the last example.

But, this PR is NOT about that at all. This is a tangent that we should not pursue. The core issue here is PRE-COMMIT versus POST-COMMIT testing.

I think we need to be serious about pre-commit CI. This issue was causing grief for a month to a lot of people who do not care about Flang or cannot fix the underlying issue. It will not make it less tested if it has post-commit testing, it will just have a slightly longer cycle to fix rarer problems.

@rengolin
Copy link
Member Author

Yes but only on PR that don't touch flang directly. That is a PR that would only touch other projects (llvm/mlir/clang) used to to trigger testing of Flang since they can potentially break it.

It would be nice to keep flang testing on linux at least when MLIR is changed since it relies a lot on it.

Isn't is tested in post-commit bots?

@joker-eph
Copy link
Collaborator

joker-eph commented May 29, 2024

Our main support policy is: https://llvm.org/docs/SupportPolicy.html I

Thanks, I don't see any mention here that makes me think that Flang isn't the core tier, do you read anything differently here?

This issue was causing grief for a month to a lot of people who do not care about Flang or cannot fix the underlying issue

By "this issue", do you mean the "OOM on Windows" issue?
If so I agree with you, and I agree that disabling flang as a dependency to fix this error is the right course of action.
But I'm still puzzled as of why we'd not just do this for Windows where the issue exists and leave the Linux flow as-is, mind elaborating on this?
(if there are flakiness issues on Linux that I'm not aware, please say so, of course I'd support disabling this on Linux if this was the case)

@clementval
Copy link
Contributor

Yes but only on PR that don't touch flang directly. That is a PR that would only touch other projects (llvm/mlir/clang) used to to trigger testing of Flang since they can potentially break it.

It would be nice to keep flang testing on linux at least when MLIR is changed since it relies a lot on it.

Isn't is tested in post-commit bots?

Yes, I just don't see the problem to have it in linux pre-commit since it was reliable so far until now. If it's a problem of build time or load then it's a different issue.

@rengolin
Copy link
Member Author

rengolin commented May 29, 2024

Isn't is tested in post-commit bots?

Yes, I just don't see the problem to have it in linux pre-commit since it was reliable so far until now. If it's a problem of build time or load then it's a different issue.

If there's a way to run on Linux but not Windows, then I don't mind. I just want to make sure we only add to pre-commit what we know is stable, but letting it run post-commit for a few months without crashes.

@kiranchandramohan
Copy link
Contributor

This is a bit of a weird take to me: what does "officially in production" mean? Did MLIR ever got "officially in production" in the context of the LLVM project? The only thing I'm aware of is the notion of "experimental", which applies to backends (Flang was kind of in-between at the beginning because it was C++17 before LLVM was, but it's all aligned now). There are degrees of maturity of course, but I haven't seen issues with the Flang-on-Linux CI so far (some frontend files are annoyingly heavy to compile!), or only very occasionally (flaky tests can happen, but not more than in LLVM itself or any other project)

Our main support policy is: https://llvm.org/docs/SupportPolicy.html I have reported at least three serious issues with Flang testing since last year, all of which took over a month to get fixed. This one is only the last example.

This issue did take some time to fix because it was not reproducible on any of the Windows machines we tried. One of the past issues that was brought up (#77086) was a bot issue as pointed out by @joker-eph in #77086 (comment). I don't remember what the third issue was.

@rengolin
Copy link
Member Author

This issue did take some time to fix because it was not reproducible on any of the Windows machines we tried. One of the past issues that was brought up (#77086) was a bot issue as pointed out by @joker-eph in #77086 (comment). I don't remember what the third issue was.

I can't remember either, tbh. This was last year, and IIRC on Linux.

Right now I don't mind if we just wait until it crashes again, but we need to be more responsive. If we get instability and it's hard to reproduce, then the tests must be taken down from pre-commit CI until it can be reproduced, fixed and shown stable.

This is the kind of cost that is non-trivial and it makes no sense for everyone to have to pay when only a few people can actually do anything about it.

@joker-eph
Copy link
Collaborator

Isn't is tested in post-commit bots?

Yes, I just don't see the problem to have it in linux pre-commit since it was reliable so far until now. If it's a problem of build time or load then it's a different issue.

If there's a way to run on Linux but not Windows, then I don't mind.

OK! Then what about: #93729 ?

@clementval
Copy link
Contributor

Isn't is tested in post-commit bots?

Yes, I just don't see the problem to have it in linux pre-commit since it was reliable so far until now. If it's a problem of build time or load then it's a different issue.

If there's a way to run on Linux but not Windows, then I don't mind. I just want to make sure we only add to pre-commit what we know is stable, but letting it run post-commit for a few months without crashes.

Yeah that makes sense to enable pre-commit that are stable only. If we can keep linux that would be great. Even with pre-commit enabled it happened a couple of times that some patches were committed even with obvious failure in the pre commit CI.

@rengolin
Copy link
Member Author

Even with pre-commit enabled it happened a couple of times that some patches were committed even with obvious failure in the pre commit CI.

This is what I'm trying to avoid. We can't really move to better CI if we keep asking people to "just ignore the failures if you don't understand".

@clementval
Copy link
Contributor

Even with pre-commit enabled it happened a couple of times that some patches were committed even with obvious failure in the pre commit CI.

This is what I'm trying to avoid. We can't really move to better CI if we keep asking people to "just ignore the failures if you don't understand".

In these cases it was understandable failure that were breaking flang.

@rengolin
Copy link
Member Author

Fixed in #93729

@rengolin rengolin closed this May 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants