Skip to content

ClusterCondition::last_update_time is updated on no-ops, causing infinite reconciles (in the worst case) #1032

Open
@nightkr

Description

@nightkr

Affected version

Yes. (Still an issue on trunk, introduced in #571, rolled out around SDP 23.4.)

Current and expected behavior

Reconciling a cluster where there nothing has changed should be a no-op.

ClusterCondition::last_update_time breaks this expectation since it is set unconditionally to whatever the current time is, rounded to the second (

if old_condition.status == new_condition.status {
ClusterCondition {
last_update_time: Some(now),
last_transition_time: old_condition.last_transition_time,
..new_condition
}
). This is registered as another object modification if the new reconcile is not within the same wall-second as the previous one. Depending on how long one reconcile takes, that can cause (up to) an infinite re-reconciliation loop while the object is trying to settle down (which is likely to be an indication that the cluster is struggling to begin with!).

Possible solution

  1. Drop last_update_time completely (for compat: either stub it out or make it equivalent to last_transition_time)
  2. Take the value from whenever the data source for the condition was updated, rather than the current wall time (if it makes sense/is possible for that condition)

Additional context

Discovered by @siegfriedweber, discussed at https://stackable-workspace.slack.com/archives/C02FZ581UCD/p1747230004370629

Environment

No response

Would you like to work on fixing this bug?

None

Metadata

Metadata

Assignees

Type

No type

Projects

Status

In Progress

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions