Skip to content

Flaky test: TestParquetFuzz — topk/bottomk tie non-determinism not fully filtered #7543

@sandy2008

Description

@sandy2008

Summary

TestParquetFuzz's promqlsmith opts do not include WithEnabledAggrs(enabledAggrs), so the random query generator still emits topk / bottomk queries against this test. Combined with the highly tie-prone data values produced by e2e.CreateBlock (float64(i+j) with i ∈ [0,19], j ∈ [0,59] → values 0–78 with massive overlap), and the inherent non-determinism of topk/bottomk tie-breaking between Cortex's parquet path and standalone Prometheus, the result is a recurring 1 test cases failed flake at the # of samples mismatch line.

The existing sampleNumComparer was meant to be the relaxation for this case, but it only compares total sample count across all output series. When two engines pick different tied series at different timestamps, the chosen winners have different time-window coverage downstream of topk, so total counts still diverge — and the assertion fires.

This is not a Cortex / parquet correctness bug — it's a test-side issue. The majority of fuzz tests in query_fuzz_test.go (9 of 12) already pass WithEnabledAggrs(enabledAggrs) to suppress these specific aggregators; TestParquetFuzz is one of three that omits it (the other two being TestStoreGatewayLazyExpandedPostingsSeriesFuzz and TestStoreGatewayLazyExpandedPostingsSeriesFuzzWithPrometheus, which haven't surfaced as flakes in the recent 18-day CI window but share the same theoretical hole).

Most recent occurrence

Failure excerpt

    query_fuzz_test.go:1963: case 1773 # of samples mismatch.
        range query: (
            topk without (status_code, series) (
              1,
              (({__name__="test_series_a"} or {__name__="test_series_b"}) % {__name__="test_series_a"})
            )
          > bool
            {__name__="test_series_b"}
        )
        res1: {job="test", series="0", status_code="502"} => …(~20 samples)…
              {job="test", series="1", status_code="502"} => …(~33 samples)…
        res2: {job="test", series="0", status_code="400"} => …(~26 samples)…
    parquet_querier_test.go:178: 
        Error:      finished query fuzzing tests
        Messages:   1 test cases failed
--- FAIL: TestParquetFuzz (30.87s)

res1 and res2 selected different tied series (different status_code values) because the inner sub-expression (a or b) % a produces value 0 for every surviving sample → topk(1, …) has no canonical winner.

Empirical flake rate (last 18 days, ~245 CI runs across master + PRs)

Test PR-CI occurrences Master-CI occurrences
TestParquetFuzz (sample-count mismatch) 5 0
integration_query_fuzz job (any test) 15 0
  • TestParquetFuzz failure rate: ~2% of PR CI runs, 0% on master.
  • Arch split for TestParquetFuzz failures: 4/5 arm64, 1/5 amd64. arm64 runners are 10–35% slower per fuzz test, which likely widens race / iteration-order divergence windows; parquet-go also has amd64-only SIMD (vendor/github.com/parquet-go/parquet-go/*_amd64.s), leaving arm64 on the pure-Go fallback. These are hypotheses for the skew; neither was independently proven in this investigation.
  • PR fix(integration): Fix flaky TestParquetFuzz by uploading block before cortex starts #7499 (2026-05-11, "Fix flaky TestParquetFuzz") reduced the rate ~3% → ~1.4% but did not eliminate it — it fixed two different root causes (block-upload race; stdvar/stddev cross-version differences).

Sample prior failures (same root cause)

Root cause

  1. integration/parquet_querier_test.go:172-175 passes only WithEnabledFunctions(enabledFunctions) to promqlsmith.New(…) — it does not pass WithEnabledAggrs(enabledAggrs). Therefore the generator uses the promqlsmith default aggregator set (vendor/github.com/cortexproject/promqlsmith/opts.go:22-35), which includes TOPK, BOTTOMK, COUNT_VALUES, STDDEV, STDVAR, QUANTILE.
  2. isValidQuery(skipBackwardIncompat=true) (called via skipStdAggregations=true by this test, see integration/query_fuzz_test.go:1983-2020) filters generated query strings containing stddev, stdvar, quantile, limitk, limit_ratio — but not topk, bottomk, or count_values.
  3. topk/bottomk ties are resolved by the upstream Prometheus engine via strict > (see vendor/github.com/prometheus/prometheus/promql/engine.go topkHeap insertion), so the first-encountered tied series wins. Whether Cortex's parquet path and standalone Prometheus encounter the series in the same order depends on storage iteration; in practice they differ.
  4. sampleNumComparer (integration/query_fuzz_test.go:897-925) compares only total Matrix sample count. Different tied winners → different downstream time coverage → different total counts.
  5. Sibling test TestParquetProjectionPushdownFuzz is t.Skip("Disabled due to flakiness"); that one uses hardcoded queries (not promqlsmith), so its skip is for a separate reason — not addressed by this issue.

Proposed fix

Primary (~15 lines, matches sibling-test precedent):

Pass promqlsmith.WithEnabledAggrs(enabledAggrs) to the opts in integration/parquet_querier_test.go:172-175, e.g.:

opts := []promqlsmith.Option{
    promqlsmith.WithEnabledFunctions(enabledFunctions),
    promqlsmith.WithEnabledAggrs(enabledAggrs),
}

This matches the pattern used by TestNativeHistogramFuzz, TestExperimentalPromQLFuncsWithPrometheus, TestDisableChunkTrimmingFuzz, TestExpandedPostingsCacheFuzz, TestVerticalShardingFuzz, TestProtobufCodecFuzz, TestBackwardCompatibilityQueryFuzz, TestPrometheusCompatibilityQueryFuzz, and TestRW1vsRW2QueryFuzz. enabledAggrs is already defined at integration/query_fuzz_test.go:44-46 as {SUM, MIN, MAX, AVG, GROUP, COUNT, QUANTILE}.

Trade-off accepted

  • Loses random fuzz coverage of topk/bottomk (and count_values, stddev, stdvar) against the parquet path. Same trade-off already accepted by most other fuzz tests in the suite.
  • If topk/bottomk coverage of parquet is desired in the future, write a deterministic dedicated test rather than expanding sampleNumComparer's relaxation; the comparator cannot normalize away time-window coverage drift induced by tie-break choice without effectively re-implementing topk semantics.

Why simpler/alternative fixes don't work

  • "Inject per-series epsilon in data generation" (e.g. change e2e/util.go:384 from float64(i+j) to float64(i+j) + 1e-9*float64(i)) — does not fix this case, because the inner expression ({a} or {b}) % {a} produces exact 0 for every surviving sample regardless of input values. Modulo always produces ties.
  • "Strengthen sampleNumComparer to count per-series buckets" — still does not normalize for time-window divergence between different chosen ties.
  • "Force sortSeries=true in parquet Select" — production behavior change to satisfy a test; off-table.

Not addressed by this issue (separate flakes)

The same integration_query_fuzz job hits other flaky tests with different root causes; this issue should not try to subsume them. Each likely needs a separate report/fix:

Test Distinct root cause (rough)
TestExpandedPostingsCacheFuzz Data-freshness race: res1 = NaN, res2 = values (one Cortex hasn't yet ingested the iteration's new push).
TestPrometheusCompatibilityQueryFuzz / TestExperimentalPromQLFuncsWithPrometheus Error-string comparator: same error type with [A,B] vs [B,A] list order from non-deterministic map iteration in the error message.
TestVerticalShardingFuzz Semantic divergence: … or vector(…) fallback fires in the unsharded engine but not all shards of the sharded engine.
TestProtobufCodecFuzz Unknown / not enough samples to classify.

(arm64 skew across the job is hypothesized to come from ~10–35% slower runners + amd64-only SIMD in parquet-go, which would widen race windows for all of the above. Not independently proven.)

Acceptance criteria

  • After fix, observe no TestParquetFuzz # of samples mismatch failures across a representative local-run sample (e.g. ≥200 iterations on each of arm64 and amd64; given the ~2% per-CI-run rate, fewer iterations is statistically inconclusive).
  • No reduction in coverage of parquet-specific paths (the test still exercises parquet via SUM/MIN/MAX/AVG/GROUP/COUNT/QUANTILE aggregators and the full function set).
  • TestParquetProjectionPushdownFuzz remains skipped (separate issue).

Filed after a 3-round multi-agent investigation; full notes available on request.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions