[BUGFIX] Ring: fix DoBatch cleanup callback blocked forever on callback panic#7559
Open
sandy2008 wants to merge 3 commits into
Open
[BUGFIX] Ring: fix DoBatch cleanup callback blocked forever on callback panic#7559sandy2008 wants to merge 3 commits into
sandy2008 wants to merge 3 commits into
Conversation
DoBatch's submitted closures called wg.Done() inline after callback() and tracker.record(). If callback panicked, wg.Done() was skipped, causing wg.Wait() to block forever and the cleanup callback to never execute. This leaked context timers, request buffers, and any other resources owned by the cleanup function for all DoBatch callers (distributor, alertmanager). Move wg.Done() to a defer so it runs even during panic unwinding, ensuring the cleanup goroutine always completes. Fixes cortexproject#7558 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Sandy Chen <Yuxuan.Chen@morganstanley.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Sandy Chen <Yuxuan.Chen@morganstanley.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Sandy Chen <Yuxuan.Chen@morganstanley.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this PR does
ring.DoBatchsubmitted closures calledwg.Done()inline aftercallback()andtracker.record(). Ifcallbackpanicked,wg.Done()was skipped, causingwg.Wait()to block forever and the cleanup callback to never execute.Symptom: any panic inside a DoBatch callback permanently blocked the cleanup goroutine. For the distributor, this leaked the
context.WithTimeouttimer (untilRemoteTimeoutexpired) and request buffers (req.Timeseries,req.Free()) — they were never reclaimed. The same issue affected all three production DoBatch callers: distributor, alertmanager distributor, and multitenant alertmanager.Fix: move
wg.Done()todefer wg.Done()so it runs even during panic unwinding, ensuring the cleanup goroutine always completes. An earlier iteration also addeddefer cancel()to the distributor'sdoBatchfunction, but this was intentionally reverted after review found it contradicts the design intent: "Use a background context to make sure all ingesters get samples even if we return early." Thedefer wg.Done()fix makes the existing cleanup callback (which already callscancel()) reliable on all paths, preserving send semantics without cancelling in-flight ingester RPCs.Which issue(s) this PR fixes
Fixes #7558
Checklist
CHANGELOG.mdupdated — not yet, will add if maintainers confirm the fix direction.TestDoBatchCleanupCalledOnCallbackPanicverifies cleanup runs even when callback panics.Test plan
go vet ./pkg/ring/... ./pkg/distributor/...— cleango test -tags "netgo slicelabels" -count=1 -timeout 120s ./pkg/ring/...— 10/10 packages passgo test -tags "netgo slicelabels" -run TestDistributor_BatchTimeoutMetric ./pkg/distributor/...— passesdeferduring development)