-
Notifications
You must be signed in to change notification settings - Fork 941
Treat YARN/Kubernetes application NOT_FOUND as failed to prevent data quality issue #7033
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
73b420f
to
676f6e2
Compare
17a9fee
to
182373c
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #7033 +/- ##
=======================================
Coverage 0.00% 0.00%
=======================================
Files 695 696 +1
Lines 42833 42997 +164
Branches 5833 5845 +12
=======================================
- Misses 42833 42997 +164 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
cc @pan3793 |
kyuubi-server/src/main/scala/org/apache/kyuubi/operation/BatchJobSubmission.scala
Outdated
Show resolved
Hide resolved
kyuubi-server/src/main/scala/org/apache/kyuubi/operation/BatchJobSubmission.scala
Outdated
Show resolved
Hide resolved
537582b
to
c8cb419
Compare
cc @pan3793 |
c8cb419
to
b9d6ac2
Compare
kyuubi-server/src/main/scala/org/apache/kyuubi/engine/KubernetesApplicationOperation.scala
Outdated
Show resolved
Hide resolved
kyuubi-server/src/main/scala/org/apache/kyuubi/engine/ApplicationOperation.scala
Outdated
Show resolved
Hide resolved
@@ -212,6 +215,20 @@ class BatchJobSubmission( | |||
metadata match { | |||
case Some(metadata) if metadata.peerInstanceClosed => | |||
setState(OperationState.CANCELED) | |||
case Some(metadata) | |||
// in case it has been updated by peer kyuubi instance, see KYUUBI-6278 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: KYUUBI-6278
=> KYUUBI #6278
does this fix another issue? what happens without this part of the code before/after this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue is that,
- if the batch state already updated by peer with terminated state
- without this change, it would try to get the current application state and might NOT_FOUND, can then cause the application failure(NOT FOUND As failure).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we shall respect the persisted application state instead of get the current application state(might not found).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before this PR:
- it might get NOT_FOUND
- then batch finished
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After this PR:
- respect the terminated application state set by peer, and then FINISH/FAIL the batch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI:
kyuubi/kyuubi-server/src/main/scala/org/apache/kyuubi/operation/BatchJobSubmission.scala
Lines 323 to 327 in 70c03ef
private def monitorBatchJob(appId: String): Unit = { | |
info(s"Monitoring submitted $batchType batch[$batchId] job: $appId") | |
if (_applicationInfo.isEmpty) { | |
_applicationInfo = currentApplicationInfo() | |
} |
For monitorBatchJob, it would get the currentApplicationInfo first, and does not respect the metadata store application state.
@pan3793 addressed all the comments |
04b0300
to
68d35b7
Compare
68d35b7
to
f03d612
Compare
kyuubi-server/src/main/scala/org/apache/kyuubi/engine/ApplicationOperation.scala
Outdated
Show resolved
Hide resolved
…ionOperation.scala
Thanks, merged to master |
Why are the changes needed?
Currently, NOT_FOUND application stated is treated as a terminated but not failed state.
It might cause some data quality issue if downstream application depends on the batch state for data processing.
So, I think we should treat NOT_FOUND as a failed state instead.
Currently, we support 3 types of application manager.
YarnApplicationOperation and KubernetesApplicationOperation are widely used in production use case.
And in multiple kyuubi instance mode, the NOT_FOUND case should rarely happen.
kyuubi/kyuubi-server/src/main/scala/org/apache/kyuubi/server/api/v1/BatchesResource.scala
Lines 369 to 385 in 7e199d6
[KYUUBI #7028] Persist the kubernetes application terminate state into metastore for app info store fallback #7029
So, I think we should treat NOT_FOUND as a failed state in production use case.
It is better to fail some corner cases than to mistakenly set unsuccessful batches to the finished state.
How was this patch tested?
GA.
Was this patch authored or co-authored using generative AI tooling?
No.