-
Notifications
You must be signed in to change notification settings - Fork 3.9k
sql: fix memory leak in index backfill progress tracking #147511
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR? 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
bc3ed0b
to
67b7d70
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 3 files at r1, all commit messages.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @fqazi and @rytaft)
pkg/sql/indexbackfiller_test.go
line 579 at r1 (raw file):
defer leaktest.AfterTest(t)() defer log.Scope(t).Close(t) skip.UnderStress(t, "timing-sensitive test")
are we able to remove this skip?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice find! A couple of drive-by comments from me.
Reviewable status:
complete! 1 of 0 LGTMs obtained (waiting on @fqazi, @rafiss, and @rytaft)
-- commits
line 27 at r1:
nit: it'd be good to include the release versions that are affected by the bug.
pkg/sql/index_backfiller.go
line 97 at r1 (raw file):
for i, span1 := range progress.CompletedSpans { for j, span2 := range progress.CompletedSpans { if i == j {
nit: we could skip more work because we currently will check both combinations (i=0, j=1)
and (i=1, j=0)
even though it is effectively the same check.
Previously, when resuming an index backfill job, the completed spans from the job's checkpoint would be repeatedly added to the progress tracker on each update, causing unbounded memory growth. For large tables with many ranges, this could exhaust available memory and cause the backfill to fail. This change fixes the memory leak by initializing the completed spans group once with the checkpointed progress at job startup, then only adding new spans from subsequent batches. A test-only assertion is added to detect overlapping spans in progress updates. The test is also updated to verify that checkpointed spans properly enclose all completed work and to remove flaky timing dependencies. These test changes also address the test issues that were present since the test was first added in 058a7f5. The testing changes include: - Use a context in the testing hook so that when the job is paused, the testing hook does not keep blocking. - Fixed a race condition by getting the progress from the persisted progress, rather than a different hook. - Avoid using `require` / `t.Fatal` from inside a goroutine, which violates the expectations of testing.T. Release note (bug fix): Fixed a memory leak in index backfill jobs where completed spans were duplicated in memory on each progress update after resuming from a checkpoint. This could cause out-of-memory errors when backfilling indexes on large tables with many ranges. This bug affected release version v25.2.0 and pre-release versions v25.2.0-alpha.3 through v25.2.0-rc.1.
67b7d70
to
48a6512
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 1 of 0 LGTMs obtained (waiting on @annrpom, @fqazi, and @rytaft)
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit: it'd be good to include the release versions that are affected by the bug.
done; just 25.2.0-alpha.3 through 25.2.0.
pkg/sql/index_backfiller.go
line 97 at r1 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit: we could skip more work because we currently will check both combinations
(i=0, j=1)
and(i=1, j=0)
even though it is effectively the same check.
done; used i <= j
pkg/sql/indexbackfiller_test.go
line 579 at r1 (raw file):
Previously, annrpom (annie pompa) wrote…
are we able to remove this skip?
done; nice call!
tftr! bors r+ |
Based on the specified backports for this PR, I applied new labels to the following linked issue(s). Please adjust the labels as needed to match the branches actually affected by the issue(s), including adding any known older branches. Issue #140358: branch-release-25.2, branch-release-25.2.1-rc. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
Previously, when resuming an index backfill job, the completed spans
from the job's checkpoint would be repeatedly added to the progress
tracker on each update, causing unbounded memory growth. For large
tables with many ranges, this could exhaust available memory and
cause the backfill to fail.
This change fixes the memory leak by initializing the completed spans
group once with the checkpointed progress at job startup, then only
adding new spans from subsequent batches. A test-only assertion is
added to detect overlapping spans in progress updates.
The test is also updated to verify that checkpointed spans properly
enclose all completed work and to remove flaky timing dependencies.
These test changes also address the test issues that were present since
the test was first added in 058a7f5.
The testing changes include:
testing hook does not keep blocking.
progress, rather than a different hook.
require
/t.Fatal
from inside a goroutine, whichviolates the expectations of testing.T.
fixes #140358
fixes #147209
Release note (bug fix): Fixed a memory leak in index backfill jobs
where completed spans were duplicated in memory on each progress
update after resuming from a checkpoint. This could cause out-of-memory
errors when backfilling indexes on large tables with many ranges. This
bug affected release version v25.2.0 and pre-release versions
v25.2.0-alpha.3 through v25.2.0-rc.1.