Skip to content

fix(integration): repair TestExpandedPostingsCacheFuzz latent bugs#7549

Open
sandy2008 wants to merge 1 commit into
cortexproject:masterfrom
sandy2008:fix-expanded-postings-cache-fuzz
Open

fix(integration): repair TestExpandedPostingsCacheFuzz latent bugs#7549
sandy2008 wants to merge 1 commit into
cortexproject:masterfrom
sandy2008:fix-expanded-postings-cache-fuzz

Conversation

@sandy2008
Copy link
Copy Markdown
Contributor

What this PR does

Repairs two latent bugs in TestExpandedPostingsCacheFuzz (integration/query_fuzz_test.go) that together gut the test: an inverted for loop that filters out valid queries instead of accepting them, and a self-comparison typo (sres1, sres1) that silently passes every Series diff. After the fix the test actually exercises the expanded-postings-cache code path that it claims to test.

Root cause

The full empirical census (occurrences, environments, exit signal) is in issue #7545. Two independent defects in the same test:

  1. Inverted query-generation loop (integration/query_fuzz_test.go:550-557, pre-fix):

    for i := 0; i < testRun; i++ {
        expr := ps.WalkRangeQuery()
        if isValidQuery(expr, true) {
            break          // <- exits the *outer* loop on first valid query
        }
        queries = append(queries, expr.Pretty(0))   // <- only invalid exprs reach here
        ...
    }

    break exits the outer for i loop the moment a valid query is produced, and only the invalid expressions reach the append. Net effect: queries/matchers end up holding at most testRun invalid exprs (whichever ones promqlsmith happened to spit out before the first valid one), or an empty slice if the very first draw is valid. Every downstream call to c1.Query / c1.QueryRange then short-circuits with the same error from both clients, so cmp.Equal(tc.err1, tc.err2) trivially holds and nothing is actually compared.

  2. Series self-comparison typo (integration/query_fuzz_test.go:657, pre-fix):

    } else if !cmp.Equal(tc.sres1, tc.sres1, labelSetsComparer) {

    tc.sres1 is compared to itself, so the Series API divergence branch is dead code — it can never report a failure.

The combination means the test has been a no-op for divergences. It will only fail on infra/setup errors (e.g. a failed Push). Issue #7545 documents how the test sometimes still trips in CI even in this neutered form, and confirms the queries/matchers slices being effectively empty in failing runs.

Fix

Two surgical edits to integration/query_fuzz_test.go:

  1. Convert the inverted single-if into a nested for { ... } that re-draws until isValidQuery returns true:

    for i := 0; i < testRun; i++ {
        var expr parser.Expr
        for {
            expr = ps.WalkRangeQuery()
            if isValidQuery(expr, true) {
                break
            }
        }
        queries = append(queries, expr.Pretty(0))
        ...
    }

    This mirrors what every other fuzz test in this file expects: testRun valid queries.

  2. Fix the self-comparison: tc.sres1, tc.sres1tc.sres1, tc.sres2.

No production code is touched. enabledAggrs is already wired in at line 544, so the generator's safe-aggregator restriction (matching the other nine fuzz tests in this file) is unchanged.

Why not other approaches

  • Skip / disable the test. Hides the regression instead of fixing it. The expanded-postings-cache path needs fuzz coverage.
  • Replace the inner for { ... } with a bounded retry (e.g. capped at N attempts). WalkRangeQuery already returns a valid query in the vast majority of draws; an unbounded retry is what the sibling tests already do implicitly (most of them don't even call isValidQuery before appending). A bounded retry adds complexity for a failure mode that doesn't occur in practice.
  • Use continue and decrement i instead of a nested loop. Semantically equivalent but harder to read and easier to break on future edits.
  • Add a separate non-fuzz unit test for the expanded-postings-cache path. Worth doing eventually but out of scope here — this PR is about restoring the existing test's intended behaviour, not redesigning coverage.

Expected follow-up

Two honest caveats worth flagging:

  • Runtime increases substantially. Pre-fix the test executed against a near-empty queries slice (only invalid exprs that immediately error out on both clients), so the heavy inner loop — 5 iterations × (instant + range) × 300 queries + 300 Series matchers × 2 clients ≈ 9k HTTP round-trips — was largely a no-op. Post-fix that work actually happens. Local timing on the unfixed vs fixed variant of this test wasn't measured here (would require Docker harness), but expect the wall-clock to grow accordingly. The other nine fuzz tests in integration/query_fuzz_test.go already do this volume of work and are tolerated in CI, so this should land within the existing budget.
  • Real divergences may newly surface. Because the Series diff and the query diff were both effectively suppressed, any genuine v1.18.0-vs-HEAD divergence in the expanded-postings-cache code path will now show up as a test failure for the first time. If new failures appear, they should be triaged as real bugs (or as additional shouldUseSampleNumComparer/enabledAggrs adjustments matching what the sibling tests already do), not as regressions introduced by this PR.

Which issue(s) this PR fixes

Fixes #7545

Checklist

  • CHANGELOG.md updated — not applicable; test-only change with no user-facing behaviour change.
  • Documentation updated — not applicable; no flags or config changed.
  • Tests: no new test added. This is the test fix; the change restores TestExpandedPostingsCacheFuzz to its intended behaviour. Re-running the fuzz test itself across CI iterations is the meaningful signal.

Test plan

Local (no Docker harness required):

  • gofmt -l integration/query_fuzz_test.go — clean
  • goimports -local github.com/cortexproject/cortex -l integration/query_fuzz_test.go — clean
  • go vet -tags "netgo slicelabels integration integration_query_fuzz" ./integration/... — clean
  • go build -tags "netgo slicelabels integration integration_query_fuzz" ./integration/... — clean

Validation that the test now actually exercises the cache path (requires Docker):

  • make ./cmd/cortex/.uptodate
  • go test -v -tags=integration,requires_docker,integration_query_fuzz -timeout 2400s -count=5 ./integration/... -run "^TestExpandedPostingsCacheFuzz$" — expect 5/5 PASS, and wall-clock per iteration significantly higher than pre-fix (the previously-skipped 9k HTTP round-trips now actually run). Any newly surfaced divergences should be triaged per the "Expected follow-up" section above.

TestExpandedPostingsCacheFuzz had two independent defects that together
neutered the test:

1. The query-generation loop at integration/query_fuzz_test.go:550-557
   was inverted: `break` fired the moment isValidQuery returned true,
   exiting the outer `for i` loop, and only *invalid* expressions ever
   reached the append. The queries/matchers slices ended up holding at
   most testRun invalid exprs (or empty if the first draw was valid).
   Every downstream c1.Query / c1.QueryRange call then short-circuited
   with the same error on both clients, so cmp.Equal(err1, err2)
   trivially held and nothing was actually compared.

2. The Series diff branch at integration/query_fuzz_test.go:657 read
   `cmp.Equal(tc.sres1, tc.sres1, ...)` — self-comparison — so the
   Series API divergence check was dead code.

Net effect: the test had been a no-op for divergences. Issue cortexproject#7545
captures the empirical census.

Fix: wrap the WalkRangeQuery call in a nested `for { ... }` that
re-draws until isValidQuery is satisfied, mirroring the sibling fuzz
tests in this file, and change `tc.sres1, tc.sres1` to
`tc.sres1, tc.sres2`. No production code touched; enabledAggrs is
already wired in.

Two follow-ups worth flagging honestly: wall-clock will grow
substantially (the previously-skipped ~9k HTTP round-trips now
actually run), and any genuine v1.18.0-vs-HEAD divergence in the
expanded-postings-cache code path will now surface as a real failure
for the first time, to be triaged as a real bug rather than a
regression from this PR.

Fixes cortexproject#7545

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Sandy Chen <Yuxuan.Chen@morganstanley.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Flaky test: TestExpandedPostingsCacheFuzz — inverted isValidQuery filter lets backward-incompat queries (quantile, atan2, …) through

1 participant