fix(integration): stabilise TestParquetFuzz topk/bottomk tie flake by sandy2008 · Pull Request #7544 · cortexproject/cortex

sandy2008 · 2026-05-22T08:52:16Z

What this PR does

Stabilises the flaky TestParquetFuzz integration test by restricting its promqlsmith random query generator to the safe aggregator set (enabledAggrs), matching what nine sibling fuzz tests in integration/query_fuzz_test.go already do.

Root cause

The full diagnosis (data, occurrences, why simpler fixes fail) is in issue #7543. In short: TestParquetFuzz only passed WithEnabledFunctions(enabledFunctions) to promqlsmith.New, so the generator kept emitting topk / bottomk queries. The block data produced by e2e.CreateBlock (float64(i+j)) is tie-prone, and inner expressions like ({a} or {b}) % {a} collapse to constant 0, so topk(1, …) has no canonical winner. Cortex's parquet path and standalone Prometheus pick different tied series, and the existing sampleNumComparer (which only compares total sample counts across all output series) can't reconcile the resulting time-window divergence. Empirically this fires on ~5 of ~245 PR CI runs over the last 18 days, 4 of those 5 on arm64 (full census in #7543).

Fix

One-line change at integration/parquet_querier_test.go:172-176: add promqlsmith.WithEnabledAggrs(enabledAggrs) to the opts slice. enabledAggrs is already declared at integration/query_fuzz_test.go:44-46 as {SUM, MIN, MAX, AVG, GROUP, COUNT, QUANTILE} — i.e. excludes TOPK, BOTTOMK, COUNT_VALUES, STDDEV, STDVAR. This is the same aggregator restriction used by TestNativeHistogramFuzz, TestExperimentalPromQLFuncsWithPrometheus, TestDisableChunkTrimmingFuzz, TestExpandedPostingsCacheFuzz, TestVerticalShardingFuzz, TestProtobufCodecFuzz, TestBackwardCompatibilityQueryFuzz, TestPrometheusCompatibilityQueryFuzz, and TestRW1vsRW2QueryFuzz in integration/query_fuzz_test.go.

Why not other approaches

Per-series epsilon in e2e.CreateBlock data. Doesn't help. The failing queries contain … % …, which produces exact 0 regardless of input epsilons — ties are unavoidable.
Strengthen sampleNumComparer (e.g. per-series buckets). Still doesn't normalize the downstream time-window divergence caused by picking different tied winners; would require effectively re-implementing topk semantics in the comparator.
Force sortSeries=true in parquet Select. Production change to satisfy a test — off-table.

If random fuzz coverage of topk/bottomk against the parquet path is wanted later, a dedicated deterministic test is the right shape.

Which issue(s) this PR fixes

Fixes #7543

Checklist

CHANGELOG.md updated — not applicable; test-only change with no user-facing behaviour change (consistent with fix(integration): Fix flaky TestParquetFuzz by uploading block before cortex starts #7499, the prior TestParquetFuzz flake fix, which also omitted a CHANGELOG entry).
Documentation updated — not applicable; no flags or config changed.
Tests: no new test added. This is the test fix; the change narrows the generator's aggregator set to the one already validated by nine sibling fuzz tests in integration/query_fuzz_test.go. A new unit test would not exercise the integration harness, and re-running the fuzz test itself across CI iterations is the meaningful signal.

Test plan

Local (no Docker harness required):

gofmt -l integration/parquet_querier_test.go — clean
goimports -local github.com/cortexproject/cortex -l integration/parquet_querier_test.go — clean
go vet -tags "netgo slicelabels integration integration_query_fuzz" ./integration/... — clean
go build -tags "netgo slicelabels integration integration_query_fuzz" ./integration/... — clean

Validation that the flake is actually gone (requires Docker):

make ./cmd/cortex/.uptodate
go test -v -tags=integration,requires_docker,integration_query_fuzz -timeout 2400s -count=20 ./integration/... -run "^TestParquetFuzz$" on an arm64 runner — expect 20/20 PASS. The pre-fix rate is ~2% per single run, so 20 iterations gives ~33% confidence of seeing one failure if the fix were wrong; CI's normal cadence (many runs/day across PRs) will give the final signal.

TestParquetFuzz only passed WithEnabledFunctions to promqlsmith.New and not WithEnabledAggrs, so the generator kept emitting topk / bottomk queries against tie-prone data (e2e.CreateBlock uses float64(i+j), and inner sub-expressions like ({a} or {b}) % {a} collapse to constant 0). Cortex's parquet path and standalone Prometheus pick different tied series, and sampleNumComparer — which only compares total sample count across all output series — cannot reconcile the resulting time-window divergence. Empirically this fires on ~5 of ~245 PR CI runs over the last 18 days, 4 of those 5 on arm64 (full census in cortexproject#7543). Pass promqlsmith.WithEnabledAggrs(enabledAggrs) at integration/parquet_querier_test.go:172-176. enabledAggrs is already declared at integration/query_fuzz_test.go:44-46 as {SUM, MIN, MAX, AVG, GROUP, COUNT, QUANTILE}; this matches the aggregator restriction used by nine sibling fuzz tests in integration/query_fuzz_test.go (TestNativeHistogramFuzz, TestVerticalShardingFuzz, TestProtobufCodecFuzz, TestBackwardCompatibilityQueryFuzz, and others). Same trade-off already accepted elsewhere: random fuzz coverage of topk / bottomk / count_values / stddev / stdvar against the parquet path is dropped. A deterministic dedicated test is the right shape if that coverage is wanted later — strengthening sampleNumComparer or injecting epsilons cannot work (modulo always produces ties). Fixes cortexproject#7543 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Sandy Chen <Yuxuan.Chen@morganstanley.com>

pull-request-size Bot added the size/XS label May 22, 2026

dosubot Bot added the type/flaky-test label May 22, 2026

sandy2008 mentioned this pull request May 22, 2026

fix(integration): log fuzz seed so failures can be reproduced #7552

Open

11 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(integration): stabilise TestParquetFuzz topk/bottomk tie flake#7544

fix(integration): stabilise TestParquetFuzz topk/bottomk tie flake#7544
sandy2008 wants to merge 1 commit into
cortexproject:masterfrom
sandy2008:fix-parquet-fuzz-topk-tie-flake

sandy2008 commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

sandy2008 commented May 22, 2026

What this PR does

Root cause

Fix

Why not other approaches

Which issue(s) this PR fixes

Checklist

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant