Skip to content

Treat YARN/Kubernetes application NOT_FOUND as failed to prevent data quality issue #7033

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 6 commits into from

Conversation

turboFei
Copy link
Member

@turboFei turboFei commented Apr 16, 2025

Why are the changes needed?

Currently, NOT_FOUND application stated is treated as a terminated but not failed state.

It might cause some data quality issue if downstream application depends on the batch state for data processing.

So, I think we should treat NOT_FOUND as a failed state instead.

Currently, we support 3 types of application manager.

  1. JpsApplicationOperation
  2. YarnApplicationOperation
  3. KubernetesApplicationOperation

YarnApplicationOperation and KubernetesApplicationOperation are widely used in production use case.

And in multiple kyuubi instance mode, the NOT_FOUND case should rarely happen.

  1. error(s"Error redirecting get batch[$batchId] to ${metadata.kyuubiInstance}", e)
    val batchAppStatus = sessionManager.applicationManager.getApplicationInfo(
    metadata.appMgrInfo,
    batchId,
    Some(userName),
    // prevent that the batch be marked as terminated if application state is NOT_FOUND
    Some(metadata.engineOpenTime).filter(_ > 0).orElse(Some(System.currentTimeMillis)))
    // if the batch app is terminated, update the metadata in db.
    if (BatchJobSubmission.applicationTerminated(batchAppStatus)) {
    val appInfo = batchAppStatus.get
    sessionManager.updateMetadata(Metadata(
    identifier = batchId,
    engineId = appInfo.id,
    engineName = appInfo.name,
    engineUrl = appInfo.url.orNull,
    engineState = appInfo.state.toString,
    engineError = appInfo.error))

  2. [KYUUBI #7028] Persist the kubernetes application terminate state into metastore for app info store fallback #7029

So, I think we should treat NOT_FOUND as a failed state in production use case.
It is better to fail some corner cases than to mistakenly set unsuccessful batches to the finished state.

How was this patch tested?

GA.

Was this patch authored or co-authored using generative AI tooling?

No.

@turboFei turboFei self-assigned this Apr 16, 2025
@turboFei turboFei added this to the v1.11.0 milestone Apr 16, 2025
@turboFei turboFei changed the title Treat application state NOT_FOUND as failed Treat application state NOT_FOUND as failed to prevent data quality issue Apr 16, 2025
@turboFei turboFei marked this pull request as draft April 17, 2025 03:22
@turboFei turboFei marked this pull request as ready for review April 17, 2025 04:13
@turboFei turboFei changed the title Treat application state NOT_FOUND as failed to prevent data quality issue Treat YARN/Kubernetes application state NOT_FOUND as failed to prevent data quality issue Apr 17, 2025
@turboFei turboFei force-pushed the revist_not_found branch 2 times, most recently from 17a9fee to 182373c Compare April 17, 2025 04:54
@turboFei turboFei changed the title Treat YARN/Kubernetes application state NOT_FOUND as failed to prevent data quality issue Treat YARN/Kubernetes application NOT_FOUND as failed to prevent data quality issue Apr 17, 2025
@codecov-commenter
Copy link

codecov-commenter commented Apr 17, 2025

Codecov Report

Attention: Patch coverage is 0% with 33 lines in your changes missing coverage. Please review.

Project coverage is 0.00%. Comparing base (29b6076) to head (ada4f88).
Report is 4 commits behind head on master.

Files with missing lines Patch % Lines
...g/apache/kyuubi/operation/BatchJobSubmission.scala 0.00% 20 Missing ⚠️
...kyuubi/engine/KubernetesApplicationOperation.scala 0.00% 5 Missing ⚠️
...rg/apache/kyuubi/engine/ApplicationOperation.scala 0.00% 3 Missing ⚠️
.../apache/kyuubi/server/api/v1/BatchesResource.scala 0.00% 2 Missing ⚠️
...apache/kyuubi/engine/JpsApplicationOperation.scala 0.00% 1 Missing ⚠️
...pache/kyuubi/engine/KyuubiApplicationManager.scala 0.00% 1 Missing ⚠️
...pache/kyuubi/engine/YarnApplicationOperation.scala 0.00% 1 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##           master   #7033    +/-   ##
=======================================
  Coverage    0.00%   0.00%            
=======================================
  Files         695     696     +1     
  Lines       42833   42997   +164     
  Branches     5833    5845    +12     
=======================================
- Misses      42833   42997   +164     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@turboFei turboFei requested a review from pan3793 April 17, 2025 06:12
@turboFei
Copy link
Member Author

cc @pan3793

@turboFei
Copy link
Member Author

I am confident for this PR after #7029

cc @pan3793

@turboFei turboFei force-pushed the revist_not_found branch 2 times, most recently from 537582b to c8cb419 Compare April 23, 2025 23:16
@turboFei
Copy link
Member Author

cc @pan3793

@@ -212,6 +215,20 @@ class BatchJobSubmission(
metadata match {
case Some(metadata) if metadata.peerInstanceClosed =>
setState(OperationState.CANCELED)
case Some(metadata)
// in case it has been updated by peer kyuubi instance, see KYUUBI-6278
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: KYUUBI-6278 => KYUUBI #6278

does this fix another issue? what happens without this part of the code before/after this PR?

Copy link
Member Author

@turboFei turboFei Apr 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue is that,

  1. if the batch state already updated by peer with terminated state
  2. without this change, it would try to get the current application state and might NOT_FOUND, can then cause the application failure(NOT FOUND As failure).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we shall respect the persisted application state instead of get the current application state(might not found).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before this PR:

  1. it might get NOT_FOUND
  2. then batch finished

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After this PR:

  1. respect the terminated application state set by peer, and then FINISH/FAIL the batch.

Copy link
Member Author

@turboFei turboFei Apr 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI:

private def monitorBatchJob(appId: String): Unit = {
info(s"Monitoring submitted $batchType batch[$batchId] job: $appId")
if (_applicationInfo.isEmpty) {
_applicationInfo = currentApplicationInfo()
}

For monitorBatchJob, it would get the currentApplicationInfo first, and does not respect the metadata store application state.

@turboFei
Copy link
Member Author

turboFei commented Apr 27, 2025

@pan3793 addressed all the comments

@turboFei turboFei requested a review from pan3793 April 27, 2025 08:19
@pan3793 pan3793 closed this in ecfca79 Apr 27, 2025
@pan3793
Copy link
Member

pan3793 commented Apr 27, 2025

Thanks, merged to master

@turboFei turboFei deleted the revist_not_found branch April 27, 2025 17:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants