Skip to content

sql: fix memory leak in index backfill progress tracking #147511

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 31, 2025

Conversation

rafiss
Copy link
Collaborator

@rafiss rafiss commented May 29, 2025

Previously, when resuming an index backfill job, the completed spans
from the job's checkpoint would be repeatedly added to the progress
tracker on each update, causing unbounded memory growth. For large
tables with many ranges, this could exhaust available memory and
cause the backfill to fail.

This change fixes the memory leak by initializing the completed spans
group once with the checkpointed progress at job startup, then only
adding new spans from subsequent batches. A test-only assertion is
added to detect overlapping spans in progress updates.

The test is also updated to verify that checkpointed spans properly
enclose all completed work and to remove flaky timing dependencies.
These test changes also address the test issues that were present since
the test was first added in 058a7f5.
The testing changes include:

  • Use a context in the testing hook so that when the job is paused, the
    testing hook does not keep blocking.
  • Fixed a race condition by getting the progress from the persisted
    progress, rather than a different hook.
  • Avoid using require / t.Fatal from inside a goroutine, which
    violates the expectations of testing.T.

fixes #140358
fixes #147209

Release note (bug fix): Fixed a memory leak in index backfill jobs
where completed spans were duplicated in memory on each progress
update after resuming from a checkpoint. This could cause out-of-memory
errors when backfilling indexes on large tables with many ranges. This
bug affected release version v25.2.0 and pre-release versions
v25.2.0-alpha.3 through v25.2.0-rc.1.

Copy link

blathers-crl bot commented May 29, 2025

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@rafiss rafiss force-pushed the fix-backfill-progress branch from bc3ed0b to 67b7d70 Compare May 30, 2025 21:34
@rafiss rafiss changed the title sql: fix index backfill progress calculation sql: fix memory leak in index backfill progress tracking May 30, 2025
@rafiss rafiss marked this pull request as ready for review May 30, 2025 21:34
@rafiss rafiss requested a review from a team as a code owner May 30, 2025 21:34
@rafiss rafiss requested review from rytaft, fqazi, annrpom and a team and removed request for a team May 30, 2025 21:34
@rafiss rafiss added backport-25.2.x Flags PRs that need to be backported to 25.2 backport-25.2.1-rc FROZEN: requires ER request to thaw labels May 30, 2025
Copy link
Contributor

@annrpom annrpom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm_strong: nice test fixes

Reviewed 2 of 3 files at r1, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @fqazi and @rytaft)


pkg/sql/indexbackfiller_test.go line 579 at r1 (raw file):

	defer leaktest.AfterTest(t)()
	defer log.Scope(t).Close(t)
	skip.UnderStress(t, "timing-sensitive test")

are we able to remove this skip?

Copy link
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice find! A couple of drive-by comments from me.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @fqazi, @rafiss, and @rytaft)


-- commits line 27 at r1:
nit: it'd be good to include the release versions that are affected by the bug.


pkg/sql/index_backfiller.go line 97 at r1 (raw file):

			for i, span1 := range progress.CompletedSpans {
				for j, span2 := range progress.CompletedSpans {
					if i == j {

nit: we could skip more work because we currently will check both combinations (i=0, j=1) and (i=1, j=0) even though it is effectively the same check.

Previously, when resuming an index backfill job, the completed spans
from the job's checkpoint would be repeatedly added to the progress
tracker on each update, causing unbounded memory growth. For large
tables with many ranges, this could exhaust available memory and
cause the backfill to fail.

This change fixes the memory leak by initializing the completed spans
group once with the checkpointed progress at job startup, then only
adding new spans from subsequent batches. A test-only assertion is
added to detect overlapping spans in progress updates.

The test is also updated to verify that checkpointed spans properly
enclose all completed work and to remove flaky timing dependencies.
These test changes also address the test issues that were present since
the test was first added in 058a7f5.
The testing changes include:
- Use a context in the testing hook so that when the job is paused, the
  testing hook does not keep blocking.
- Fixed a race condition by getting the progress from the persisted
  progress, rather than a different hook.
- Avoid using `require` / `t.Fatal` from inside a goroutine, which
  violates the expectations of testing.T.

Release note (bug fix): Fixed a memory leak in index backfill jobs
where completed spans were duplicated in memory on each progress
update after resuming from a checkpoint. This could cause out-of-memory
errors when backfilling indexes on large tables with many ranges. This
bug affected release version v25.2.0 and pre-release versions
v25.2.0-alpha.3 through v25.2.0-rc.1.
@rafiss rafiss force-pushed the fix-backfill-progress branch from 67b7d70 to 48a6512 Compare May 31, 2025 15:23
Copy link
Collaborator Author

@rafiss rafiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @annrpom, @fqazi, and @rytaft)


-- commits line 27 at r1:

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: it'd be good to include the release versions that are affected by the bug.

done; just 25.2.0-alpha.3 through 25.2.0.


pkg/sql/index_backfiller.go line 97 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: we could skip more work because we currently will check both combinations (i=0, j=1) and (i=1, j=0) even though it is effectively the same check.

done; used i <= j


pkg/sql/indexbackfiller_test.go line 579 at r1 (raw file):

Previously, annrpom (annie pompa) wrote…

are we able to remove this skip?

done; nice call!

@rafiss
Copy link
Collaborator Author

rafiss commented May 31, 2025

tftr!

bors r+

@craig
Copy link
Contributor

craig bot commented May 31, 2025

@craig craig bot merged commit b216877 into cockroachdb:master May 31, 2025
20 of 21 checks passed
Copy link

blathers-crl bot commented May 31, 2025

Based on the specified backports for this PR, I applied new labels to the following linked issue(s). Please adjust the labels as needed to match the branches actually affected by the issue(s), including adding any known older branches.


Issue #140358: branch-release-25.2, branch-release-25.2.1-rc.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-25.2.x Flags PRs that need to be backported to 25.2 backport-25.2.1-rc FROZEN: requires ER request to thaw target-release-25.3.0
Projects
None yet
4 participants