fix(integration): stabilise TestParquetFuzz topk/bottomk tie flake#7544
Open
sandy2008 wants to merge 1 commit into
Open
fix(integration): stabilise TestParquetFuzz topk/bottomk tie flake#7544sandy2008 wants to merge 1 commit into
sandy2008 wants to merge 1 commit into
Conversation
TestParquetFuzz only passed WithEnabledFunctions to promqlsmith.New
and not WithEnabledAggrs, so the generator kept emitting topk /
bottomk queries against tie-prone data (e2e.CreateBlock uses
float64(i+j), and inner sub-expressions like ({a} or {b}) % {a}
collapse to constant 0). Cortex's parquet path and standalone
Prometheus pick different tied series, and sampleNumComparer — which
only compares total sample count across all output series — cannot
reconcile the resulting time-window divergence. Empirically this
fires on ~5 of ~245 PR CI runs over the last 18 days, 4 of those 5
on arm64 (full census in cortexproject#7543).
Pass promqlsmith.WithEnabledAggrs(enabledAggrs) at
integration/parquet_querier_test.go:172-176. enabledAggrs is already
declared at integration/query_fuzz_test.go:44-46 as
{SUM, MIN, MAX, AVG, GROUP, COUNT, QUANTILE}; this matches the
aggregator restriction used by nine sibling fuzz tests in
integration/query_fuzz_test.go (TestNativeHistogramFuzz,
TestVerticalShardingFuzz, TestProtobufCodecFuzz,
TestBackwardCompatibilityQueryFuzz, and others).
Same trade-off already accepted elsewhere: random fuzz coverage of
topk / bottomk / count_values / stddev / stdvar against the parquet
path is dropped. A deterministic dedicated test is the right shape
if that coverage is wanted later — strengthening sampleNumComparer
or injecting epsilons cannot work (modulo always produces ties).
Fixes cortexproject#7543
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Sandy Chen <Yuxuan.Chen@morganstanley.com>
11 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this PR does
Stabilises the flaky
TestParquetFuzzintegration test by restricting itspromqlsmithrandom query generator to the safe aggregator set (enabledAggrs), matching what nine sibling fuzz tests inintegration/query_fuzz_test.goalready do.Root cause
The full diagnosis (data, occurrences, why simpler fixes fail) is in issue #7543. In short:
TestParquetFuzzonly passedWithEnabledFunctions(enabledFunctions)topromqlsmith.New, so the generator kept emittingtopk/bottomkqueries. The block data produced bye2e.CreateBlock(float64(i+j)) is tie-prone, and inner expressions like({a} or {b}) % {a}collapse to constant0, sotopk(1, …)has no canonical winner. Cortex's parquet path and standalone Prometheus pick different tied series, and the existingsampleNumComparer(which only compares total sample counts across all output series) can't reconcile the resulting time-window divergence. Empirically this fires on ~5 of ~245 PR CI runs over the last 18 days, 4 of those 5 on arm64 (full census in #7543).Fix
One-line change at
integration/parquet_querier_test.go:172-176: addpromqlsmith.WithEnabledAggrs(enabledAggrs)to the opts slice.enabledAggrsis already declared atintegration/query_fuzz_test.go:44-46as{SUM, MIN, MAX, AVG, GROUP, COUNT, QUANTILE}— i.e. excludesTOPK,BOTTOMK,COUNT_VALUES,STDDEV,STDVAR. This is the same aggregator restriction used byTestNativeHistogramFuzz,TestExperimentalPromQLFuncsWithPrometheus,TestDisableChunkTrimmingFuzz,TestExpandedPostingsCacheFuzz,TestVerticalShardingFuzz,TestProtobufCodecFuzz,TestBackwardCompatibilityQueryFuzz,TestPrometheusCompatibilityQueryFuzz, andTestRW1vsRW2QueryFuzzinintegration/query_fuzz_test.go.Why not other approaches
e2e.CreateBlockdata. Doesn't help. The failing queries contain… % …, which produces exact0regardless of input epsilons — ties are unavoidable.sampleNumComparer(e.g. per-series buckets). Still doesn't normalize the downstream time-window divergence caused by picking different tied winners; would require effectively re-implementingtopksemantics in the comparator.sortSeries=truein parquetSelect. Production change to satisfy a test — off-table.If random fuzz coverage of
topk/bottomkagainst the parquet path is wanted later, a dedicated deterministic test is the right shape.Which issue(s) this PR fixes
Fixes #7543
Checklist
CHANGELOG.mdupdated — not applicable; test-only change with no user-facing behaviour change (consistent with fix(integration): Fix flaky TestParquetFuzz by uploading block before cortex starts #7499, the priorTestParquetFuzzflake fix, which also omitted a CHANGELOG entry).integration/query_fuzz_test.go. A new unit test would not exercise the integration harness, and re-running the fuzz test itself across CI iterations is the meaningful signal.Test plan
Local (no Docker harness required):
gofmt -l integration/parquet_querier_test.go— cleangoimports -local github.com/cortexproject/cortex -l integration/parquet_querier_test.go— cleango vet -tags "netgo slicelabels integration integration_query_fuzz" ./integration/...— cleango build -tags "netgo slicelabels integration integration_query_fuzz" ./integration/...— cleanValidation that the flake is actually gone (requires Docker):
make ./cmd/cortex/.uptodatego test -v -tags=integration,requires_docker,integration_query_fuzz -timeout 2400s -count=20 ./integration/... -run "^TestParquetFuzz$"on an arm64 runner — expect 20/20 PASS. The pre-fix rate is ~2% per single run, so 20 iterations gives ~33% confidence of seeing one failure if the fix were wrong; CI's normal cadence (many runs/day across PRs) will give the final signal.