fix(integration): repair TestExpandedPostingsCacheFuzz latent bugs by sandy2008 · Pull Request #7549 · cortexproject/cortex

sandy2008 · 2026-05-22T09:48:58Z

What this PR does

Repairs two latent bugs in TestExpandedPostingsCacheFuzz (integration/query_fuzz_test.go) that together gut the test: an inverted for loop that filters out valid queries instead of accepting them, and a self-comparison typo (sres1, sres1) that silently passes every Series diff. After the fix the test actually exercises the expanded-postings-cache code path that it claims to test.

Root cause

The full empirical census (occurrences, environments, exit signal) is in issue #7545. Two independent defects in the same test:

Inverted query-generation loop (integration/query_fuzz_test.go:550-557, pre-fix):
```
for i := 0; i < testRun; i++ {
    expr := ps.WalkRangeQuery()
    if isValidQuery(expr, true) {
        break          // <- exits the *outer* loop on first valid query
    }
    queries = append(queries, expr.Pretty(0))   // <- only invalid exprs reach here
    ...
}
```
break exits the outer for i loop the moment a valid query is produced, and only the invalid expressions reach the append. Net effect: queries/matchers end up holding at most testRun invalid exprs (whichever ones promqlsmith happened to spit out before the first valid one), or an empty slice if the very first draw is valid. Every downstream call to c1.Query / c1.QueryRange then short-circuits with the same error from both clients, so cmp.Equal(tc.err1, tc.err2) trivially holds and nothing is actually compared.
Series self-comparison typo (integration/query_fuzz_test.go:657, pre-fix):
```
} else if !cmp.Equal(tc.sres1, tc.sres1, labelSetsComparer) {
```
tc.sres1 is compared to itself, so the Series API divergence branch is dead code — it can never report a failure.

The combination means the test has been a no-op for divergences. It will only fail on infra/setup errors (e.g. a failed Push). Issue #7545 documents how the test sometimes still trips in CI even in this neutered form, and confirms the queries/matchers slices being effectively empty in failing runs.

Fix

Two surgical edits to integration/query_fuzz_test.go:

Convert the inverted single-if into a nested for { ... } that re-draws until isValidQuery returns true:

for i := 0; i < testRun; i++ {
    var expr parser.Expr
    for {
        expr = ps.WalkRangeQuery()
        if isValidQuery(expr, true) {
            break
        }
    }
    queries = append(queries, expr.Pretty(0))
    ...
}

This mirrors what every other fuzz test in this file expects: testRun valid queries.

Fix the self-comparison: tc.sres1, tc.sres1 → tc.sres1, tc.sres2.

No production code is touched. enabledAggrs is already wired in at line 544, so the generator's safe-aggregator restriction (matching the other nine fuzz tests in this file) is unchanged.

Why not other approaches

Skip / disable the test. Hides the regression instead of fixing it. The expanded-postings-cache path needs fuzz coverage.
Replace the inner for { ... } with a bounded retry (e.g. capped at N attempts). WalkRangeQuery already returns a valid query in the vast majority of draws; an unbounded retry is what the sibling tests already do implicitly (most of them don't even call isValidQuery before appending). A bounded retry adds complexity for a failure mode that doesn't occur in practice.
Use continue and decrement i instead of a nested loop. Semantically equivalent but harder to read and easier to break on future edits.
Add a separate non-fuzz unit test for the expanded-postings-cache path. Worth doing eventually but out of scope here — this PR is about restoring the existing test's intended behaviour, not redesigning coverage.

Expected follow-up

Two honest caveats worth flagging:

Runtime increases substantially. Pre-fix the test executed against a near-empty queries slice (only invalid exprs that immediately error out on both clients), so the heavy inner loop — 5 iterations × (instant + range) × 300 queries + 300 Series matchers × 2 clients ≈ 9k HTTP round-trips — was largely a no-op. Post-fix that work actually happens. Local timing on the unfixed vs fixed variant of this test wasn't measured here (would require Docker harness), but expect the wall-clock to grow accordingly. The other nine fuzz tests in integration/query_fuzz_test.go already do this volume of work and are tolerated in CI, so this should land within the existing budget.
Real divergences may newly surface. Because the Series diff and the query diff were both effectively suppressed, any genuine v1.18.0-vs-HEAD divergence in the expanded-postings-cache code path will now show up as a test failure for the first time. If new failures appear, they should be triaged as real bugs (or as additional shouldUseSampleNumComparer/enabledAggrs adjustments matching what the sibling tests already do), not as regressions introduced by this PR.

Which issue(s) this PR fixes

Fixes #7545

Checklist

CHANGELOG.md updated — not applicable; test-only change with no user-facing behaviour change.
Documentation updated — not applicable; no flags or config changed.
Tests: no new test added. This is the test fix; the change restores TestExpandedPostingsCacheFuzz to its intended behaviour. Re-running the fuzz test itself across CI iterations is the meaningful signal.

Test plan

Local (no Docker harness required):

gofmt -l integration/query_fuzz_test.go — clean
goimports -local github.com/cortexproject/cortex -l integration/query_fuzz_test.go — clean
go vet -tags "netgo slicelabels integration integration_query_fuzz" ./integration/... — clean
go build -tags "netgo slicelabels integration integration_query_fuzz" ./integration/... — clean

Validation that the test now actually exercises the cache path (requires Docker):

make ./cmd/cortex/.uptodate
go test -v -tags=integration,requires_docker,integration_query_fuzz -timeout 2400s -count=5 ./integration/... -run "^TestExpandedPostingsCacheFuzz$" — expect 5/5 PASS, and wall-clock per iteration significantly higher than pre-fix (the previously-skipped 9k HTTP round-trips now actually run). Any newly surfaced divergences should be triaged per the "Expected follow-up" section above.

TestExpandedPostingsCacheFuzz had two independent defects that together neutered the test: 1. The query-generation loop at integration/query_fuzz_test.go:550-557 was inverted: `break` fired the moment isValidQuery returned true, exiting the outer `for i` loop, and only *invalid* expressions ever reached the append. The queries/matchers slices ended up holding at most testRun invalid exprs (or empty if the first draw was valid). Every downstream c1.Query / c1.QueryRange call then short-circuited with the same error on both clients, so cmp.Equal(err1, err2) trivially held and nothing was actually compared. 2. The Series diff branch at integration/query_fuzz_test.go:657 read `cmp.Equal(tc.sres1, tc.sres1, ...)` — self-comparison — so the Series API divergence check was dead code. Net effect: the test had been a no-op for divergences. Issue cortexproject#7545 captures the empirical census. Fix: wrap the WalkRangeQuery call in a nested `for { ... }` that re-draws until isValidQuery is satisfied, mirroring the sibling fuzz tests in this file, and change `tc.sres1, tc.sres1` to `tc.sres1, tc.sres2`. No production code touched; enabledAggrs is already wired in. Two follow-ups worth flagging honestly: wall-clock will grow substantially (the previously-skipped ~9k HTTP round-trips now actually run), and any genuine v1.18.0-vs-HEAD divergence in the expanded-postings-cache code path will now surface as a real failure for the first time, to be triaged as a real bug rather than a regression from this PR. Fixes cortexproject#7545 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Sandy Chen <Yuxuan.Chen@morganstanley.com>

pull-request-size Bot added the size/S label May 22, 2026

dosubot Bot added type/bug type/tests labels May 22, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(integration): repair TestExpandedPostingsCacheFuzz latent bugs#7549

fix(integration): repair TestExpandedPostingsCacheFuzz latent bugs#7549
sandy2008 wants to merge 1 commit into
cortexproject:masterfrom
sandy2008:fix-expanded-postings-cache-fuzz

sandy2008 commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

sandy2008 commented May 22, 2026

What this PR does

Root cause

Fix

Why not other approaches

Expected follow-up

Which issue(s) this PR fixes

Checklist

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant