Skip to content

fix(integration): stabilise TestParquetFuzz topk/bottomk tie flake#7544

Open
sandy2008 wants to merge 1 commit into
cortexproject:masterfrom
sandy2008:fix-parquet-fuzz-topk-tie-flake
Open

fix(integration): stabilise TestParquetFuzz topk/bottomk tie flake#7544
sandy2008 wants to merge 1 commit into
cortexproject:masterfrom
sandy2008:fix-parquet-fuzz-topk-tie-flake

Conversation

@sandy2008
Copy link
Copy Markdown
Contributor

What this PR does

Stabilises the flaky TestParquetFuzz integration test by restricting its promqlsmith random query generator to the safe aggregator set (enabledAggrs), matching what nine sibling fuzz tests in integration/query_fuzz_test.go already do.

Root cause

The full diagnosis (data, occurrences, why simpler fixes fail) is in issue #7543. In short: TestParquetFuzz only passed WithEnabledFunctions(enabledFunctions) to promqlsmith.New, so the generator kept emitting topk / bottomk queries. The block data produced by e2e.CreateBlock (float64(i+j)) is tie-prone, and inner expressions like ({a} or {b}) % {a} collapse to constant 0, so topk(1, …) has no canonical winner. Cortex's parquet path and standalone Prometheus pick different tied series, and the existing sampleNumComparer (which only compares total sample counts across all output series) can't reconcile the resulting time-window divergence. Empirically this fires on ~5 of ~245 PR CI runs over the last 18 days, 4 of those 5 on arm64 (full census in #7543).

Fix

One-line change at integration/parquet_querier_test.go:172-176: add promqlsmith.WithEnabledAggrs(enabledAggrs) to the opts slice. enabledAggrs is already declared at integration/query_fuzz_test.go:44-46 as {SUM, MIN, MAX, AVG, GROUP, COUNT, QUANTILE} — i.e. excludes TOPK, BOTTOMK, COUNT_VALUES, STDDEV, STDVAR. This is the same aggregator restriction used by TestNativeHistogramFuzz, TestExperimentalPromQLFuncsWithPrometheus, TestDisableChunkTrimmingFuzz, TestExpandedPostingsCacheFuzz, TestVerticalShardingFuzz, TestProtobufCodecFuzz, TestBackwardCompatibilityQueryFuzz, TestPrometheusCompatibilityQueryFuzz, and TestRW1vsRW2QueryFuzz in integration/query_fuzz_test.go.

Why not other approaches

  • Per-series epsilon in e2e.CreateBlock data. Doesn't help. The failing queries contain … % …, which produces exact 0 regardless of input epsilons — ties are unavoidable.
  • Strengthen sampleNumComparer (e.g. per-series buckets). Still doesn't normalize the downstream time-window divergence caused by picking different tied winners; would require effectively re-implementing topk semantics in the comparator.
  • Force sortSeries=true in parquet Select. Production change to satisfy a test — off-table.

If random fuzz coverage of topk/bottomk against the parquet path is wanted later, a dedicated deterministic test is the right shape.

Which issue(s) this PR fixes

Fixes #7543

Checklist

  • CHANGELOG.md updated — not applicable; test-only change with no user-facing behaviour change (consistent with fix(integration): Fix flaky TestParquetFuzz by uploading block before cortex starts #7499, the prior TestParquetFuzz flake fix, which also omitted a CHANGELOG entry).
  • Documentation updated — not applicable; no flags or config changed.
  • Tests: no new test added. This is the test fix; the change narrows the generator's aggregator set to the one already validated by nine sibling fuzz tests in integration/query_fuzz_test.go. A new unit test would not exercise the integration harness, and re-running the fuzz test itself across CI iterations is the meaningful signal.

Test plan

Local (no Docker harness required):

  • gofmt -l integration/parquet_querier_test.go — clean
  • goimports -local github.com/cortexproject/cortex -l integration/parquet_querier_test.go — clean
  • go vet -tags "netgo slicelabels integration integration_query_fuzz" ./integration/... — clean
  • go build -tags "netgo slicelabels integration integration_query_fuzz" ./integration/... — clean

Validation that the flake is actually gone (requires Docker):

  • make ./cmd/cortex/.uptodate
  • go test -v -tags=integration,requires_docker,integration_query_fuzz -timeout 2400s -count=20 ./integration/... -run "^TestParquetFuzz$" on an arm64 runner — expect 20/20 PASS. The pre-fix rate is ~2% per single run, so 20 iterations gives ~33% confidence of seeing one failure if the fix were wrong; CI's normal cadence (many runs/day across PRs) will give the final signal.

TestParquetFuzz only passed WithEnabledFunctions to promqlsmith.New
and not WithEnabledAggrs, so the generator kept emitting topk /
bottomk queries against tie-prone data (e2e.CreateBlock uses
float64(i+j), and inner sub-expressions like ({a} or {b}) % {a}
collapse to constant 0). Cortex's parquet path and standalone
Prometheus pick different tied series, and sampleNumComparer — which
only compares total sample count across all output series — cannot
reconcile the resulting time-window divergence. Empirically this
fires on ~5 of ~245 PR CI runs over the last 18 days, 4 of those 5
on arm64 (full census in cortexproject#7543).

Pass promqlsmith.WithEnabledAggrs(enabledAggrs) at
integration/parquet_querier_test.go:172-176. enabledAggrs is already
declared at integration/query_fuzz_test.go:44-46 as
{SUM, MIN, MAX, AVG, GROUP, COUNT, QUANTILE}; this matches the
aggregator restriction used by nine sibling fuzz tests in
integration/query_fuzz_test.go (TestNativeHistogramFuzz,
TestVerticalShardingFuzz, TestProtobufCodecFuzz,
TestBackwardCompatibilityQueryFuzz, and others).

Same trade-off already accepted elsewhere: random fuzz coverage of
topk / bottomk / count_values / stddev / stdvar against the parquet
path is dropped. A deterministic dedicated test is the right shape
if that coverage is wanted later — strengthening sampleNumComparer
or injecting epsilons cannot work (modulo always produces ties).

Fixes cortexproject#7543

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Sandy Chen <Yuxuan.Chen@morganstanley.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Flaky test: TestParquetFuzz — topk/bottomk tie non-determinism not fully filtered

1 participant