fix(integration): repair TestExpandedPostingsCacheFuzz latent bugs#7549
Open
sandy2008 wants to merge 1 commit into
Open
fix(integration): repair TestExpandedPostingsCacheFuzz latent bugs#7549sandy2008 wants to merge 1 commit into
sandy2008 wants to merge 1 commit into
Conversation
TestExpandedPostingsCacheFuzz had two independent defects that together neutered the test: 1. The query-generation loop at integration/query_fuzz_test.go:550-557 was inverted: `break` fired the moment isValidQuery returned true, exiting the outer `for i` loop, and only *invalid* expressions ever reached the append. The queries/matchers slices ended up holding at most testRun invalid exprs (or empty if the first draw was valid). Every downstream c1.Query / c1.QueryRange call then short-circuited with the same error on both clients, so cmp.Equal(err1, err2) trivially held and nothing was actually compared. 2. The Series diff branch at integration/query_fuzz_test.go:657 read `cmp.Equal(tc.sres1, tc.sres1, ...)` — self-comparison — so the Series API divergence check was dead code. Net effect: the test had been a no-op for divergences. Issue cortexproject#7545 captures the empirical census. Fix: wrap the WalkRangeQuery call in a nested `for { ... }` that re-draws until isValidQuery is satisfied, mirroring the sibling fuzz tests in this file, and change `tc.sres1, tc.sres1` to `tc.sres1, tc.sres2`. No production code touched; enabledAggrs is already wired in. Two follow-ups worth flagging honestly: wall-clock will grow substantially (the previously-skipped ~9k HTTP round-trips now actually run), and any genuine v1.18.0-vs-HEAD divergence in the expanded-postings-cache code path will now surface as a real failure for the first time, to be triaged as a real bug rather than a regression from this PR. Fixes cortexproject#7545 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Sandy Chen <Yuxuan.Chen@morganstanley.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this PR does
Repairs two latent bugs in
TestExpandedPostingsCacheFuzz(integration/query_fuzz_test.go) that together gut the test: an invertedforloop that filters out valid queries instead of accepting them, and a self-comparison typo (sres1, sres1) that silently passes everySeriesdiff. After the fix the test actually exercises the expanded-postings-cache code path that it claims to test.Root cause
The full empirical census (occurrences, environments, exit signal) is in issue #7545. Two independent defects in the same test:
Inverted query-generation loop (
integration/query_fuzz_test.go:550-557, pre-fix):breakexits the outerfor iloop the moment a valid query is produced, and only the invalid expressions reach theappend. Net effect:queries/matchersend up holding at mosttestRuninvalid exprs (whichever onespromqlsmithhappened to spit out before the first valid one), or an empty slice if the very first draw is valid. Every downstream call toc1.Query/c1.QueryRangethen short-circuits with the same error from both clients, socmp.Equal(tc.err1, tc.err2)trivially holds and nothing is actually compared.Seriesself-comparison typo (integration/query_fuzz_test.go:657, pre-fix):tc.sres1is compared to itself, so theSeriesAPI divergence branch is dead code — it can never report a failure.The combination means the test has been a no-op for divergences. It will only fail on infra/setup errors (e.g. a failed Push). Issue #7545 documents how the test sometimes still trips in CI even in this neutered form, and confirms the queries/matchers slices being effectively empty in failing runs.
Fix
Two surgical edits to
integration/query_fuzz_test.go:Convert the inverted single-
ifinto a nestedfor { ... }that re-draws untilisValidQueryreturns true:This mirrors what every other fuzz test in this file expects:
testRunvalid queries.Fix the self-comparison:
tc.sres1, tc.sres1→tc.sres1, tc.sres2.No production code is touched.
enabledAggrsis already wired in at line 544, so the generator's safe-aggregator restriction (matching the other nine fuzz tests in this file) is unchanged.Why not other approaches
for { ... }with a bounded retry (e.g. capped at N attempts).WalkRangeQueryalready returns a valid query in the vast majority of draws; an unbounded retry is what the sibling tests already do implicitly (most of them don't even callisValidQuerybefore appending). A bounded retry adds complexity for a failure mode that doesn't occur in practice.continueand decrementiinstead of a nested loop. Semantically equivalent but harder to read and easier to break on future edits.Expected follow-up
Two honest caveats worth flagging:
queriesslice (only invalid exprs that immediately error out on both clients), so the heavy inner loop — 5 iterations × (instant + range) × 300 queries + 300Seriesmatchers × 2 clients ≈ 9k HTTP round-trips — was largely a no-op. Post-fix that work actually happens. Local timing on the unfixed vs fixed variant of this test wasn't measured here (would require Docker harness), but expect the wall-clock to grow accordingly. The other nine fuzz tests inintegration/query_fuzz_test.goalready do this volume of work and are tolerated in CI, so this should land within the existing budget.Seriesdiff and the query diff were both effectively suppressed, any genuine v1.18.0-vs-HEAD divergence in the expanded-postings-cache code path will now show up as a test failure for the first time. If new failures appear, they should be triaged as real bugs (or as additionalshouldUseSampleNumComparer/enabledAggrsadjustments matching what the sibling tests already do), not as regressions introduced by this PR.Which issue(s) this PR fixes
Fixes #7545
Checklist
CHANGELOG.mdupdated — not applicable; test-only change with no user-facing behaviour change.TestExpandedPostingsCacheFuzzto its intended behaviour. Re-running the fuzz test itself across CI iterations is the meaningful signal.Test plan
Local (no Docker harness required):
gofmt -l integration/query_fuzz_test.go— cleangoimports -local github.com/cortexproject/cortex -l integration/query_fuzz_test.go— cleango vet -tags "netgo slicelabels integration integration_query_fuzz" ./integration/...— cleango build -tags "netgo slicelabels integration integration_query_fuzz" ./integration/...— cleanValidation that the test now actually exercises the cache path (requires Docker):
make ./cmd/cortex/.uptodatego test -v -tags=integration,requires_docker,integration_query_fuzz -timeout 2400s -count=5 ./integration/... -run "^TestExpandedPostingsCacheFuzz$"— expect 5/5 PASS, and wall-clock per iteration significantly higher than pre-fix (the previously-skipped 9k HTTP round-trips now actually run). Any newly surfaced divergences should be triaged per the "Expected follow-up" section above.