Flaky test: TestParquetFuzz — topk/bottomk tie non-determinism not fully filtered


## Summary

`TestParquetFuzz`'s `promqlsmith` opts do not include `WithEnabledAggrs(enabledAggrs)`, so the random query generator still emits `topk` / `bottomk` queries against this test. Combined with the highly tie-prone data values produced by `e2e.CreateBlock` (`float64(i+j)` with `i ∈ [0,19], j ∈ [0,59]` → values 0–78 with massive overlap), and the inherent non-determinism of `topk`/`bottomk` tie-breaking between Cortex's parquet path and standalone Prometheus, the result is a recurring `1 test cases failed` flake at the `# of samples mismatch` line.

The existing `sampleNumComparer` was meant to be the relaxation for this case, but it only compares **total sample count** across all output series. When two engines pick different tied series at different timestamps, the chosen winners have different time-window coverage downstream of `topk`, so total counts still diverge — and the assertion fires.

This is **not** a Cortex / parquet correctness bug — it's a test-side issue. The majority of fuzz tests in `query_fuzz_test.go` (9 of 12) already pass `WithEnabledAggrs(enabledAggrs)` to suppress these specific aggregators; `TestParquetFuzz` is one of three that omits it (the other two being `TestStoreGatewayLazyExpandedPostingsSeriesFuzz` and `TestStoreGatewayLazyExpandedPostingsSeriesFuzzWithPrometheus`, which haven't surfaced as flakes in the recent 18-day CI window but share the same theoretical hole).

## Most recent occurrence

- Run: https://github.com/cortexproject/cortex/actions/runs/26265920573/job/77309556574?pr=7533
- Date: 2026-05-22T03:13:59Z
- Arch: `ubuntu-24.04-arm`, `arm64`, build tag `integration_query_fuzz`
- PR #7533 changes (ingester channel-close fix) are unrelated to the failing path.

### Failure excerpt

```
    query_fuzz_test.go:1963: case 1773 # of samples mismatch.
        range query: (
            topk without (status_code, series) (
              1,
              (({__name__="test_series_a"} or {__name__="test_series_b"}) % {__name__="test_series_a"})
            )
          > bool
            {__name__="test_series_b"}
        )
        res1: {job="test", series="0", status_code="502"} => …(~20 samples)…
              {job="test", series="1", status_code="502"} => …(~33 samples)…
        res2: {job="test", series="0", status_code="400"} => …(~26 samples)…
    parquet_querier_test.go:178: 
        Error:      finished query fuzzing tests
        Messages:   1 test cases failed
--- FAIL: TestParquetFuzz (30.87s)
```

`res1` and `res2` selected different tied series (different `status_code` values) because the inner sub-expression `(a or b) % a` produces value `0` for every surviving sample → `topk(1, …)` has no canonical winner.

## Empirical flake rate (last 18 days, ~245 CI runs across master + PRs)

| Test | PR-CI occurrences | Master-CI occurrences |
|------|-------------------|-----------------------|
| `TestParquetFuzz` (sample-count mismatch) | 5 | 0 |
| `integration_query_fuzz` job (any test) | 15 | 0 |

- `TestParquetFuzz` failure rate: ~2% of PR CI runs, 0% on master.
- Arch split for `TestParquetFuzz` failures: 4/5 arm64, 1/5 amd64. arm64 runners are 10–35% slower per fuzz test, which likely widens race / iteration-order divergence windows; parquet-go also has amd64-only SIMD (`vendor/github.com/parquet-go/parquet-go/*_amd64.s`), leaving arm64 on the pure-Go fallback. These are hypotheses for the skew; neither was independently proven in this investigation.
- PR #7499 (2026-05-11, "Fix flaky TestParquetFuzz") reduced the rate ~3% → ~1.4% but did not eliminate it — it fixed two **different** root causes (block-upload race; stdvar/stddev cross-version differences).

### Sample prior failures (same root cause)

- arm64, 2026-05-21: https://github.com/cortexproject/cortex/actions/runs/26174622955
- arm64, 2026-05-11 (on PR #7499 itself): https://github.com/cortexproject/cortex/actions/runs/25644799472
- amd64, 2026-05-04: https://github.com/cortexproject/cortex/actions/runs/25317431115

## Root cause

1. `integration/parquet_querier_test.go:172-175` passes only `WithEnabledFunctions(enabledFunctions)` to `promqlsmith.New(…)` — it does **not** pass `WithEnabledAggrs(enabledAggrs)`. Therefore the generator uses the promqlsmith default aggregator set (`vendor/github.com/cortexproject/promqlsmith/opts.go:22-35`), which includes `TOPK`, `BOTTOMK`, `COUNT_VALUES`, `STDDEV`, `STDVAR`, `QUANTILE`.
2. `isValidQuery(skipBackwardIncompat=true)` (called via `skipStdAggregations=true` by this test, see `integration/query_fuzz_test.go:1983-2020`) filters generated query *strings* containing `stddev`, `stdvar`, `quantile`, `limitk`, `limit_ratio` — but **not** `topk`, `bottomk`, or `count_values`.
3. `topk`/`bottomk` ties are resolved by the upstream Prometheus engine via strict `>` (see `vendor/github.com/prometheus/prometheus/promql/engine.go` `topkHeap` insertion), so the first-encountered tied series wins. Whether Cortex's parquet path and standalone Prometheus encounter the series in the same order depends on storage iteration; in practice they differ.
4. `sampleNumComparer` (`integration/query_fuzz_test.go:897-925`) compares only total Matrix sample count. Different tied winners → different downstream time coverage → different total counts.
5. Sibling test `TestParquetProjectionPushdownFuzz` is `t.Skip("Disabled due to flakiness")`; that one uses hardcoded queries (not `promqlsmith`), so its skip is for a separate reason — not addressed by this issue.

## Proposed fix

**Primary** (~15 lines, matches sibling-test precedent):

Pass `promqlsmith.WithEnabledAggrs(enabledAggrs)` to the opts in `integration/parquet_querier_test.go:172-175`, e.g.:

```go
opts := []promqlsmith.Option{
    promqlsmith.WithEnabledFunctions(enabledFunctions),
    promqlsmith.WithEnabledAggrs(enabledAggrs),
}
```

This matches the pattern used by `TestNativeHistogramFuzz`, `TestExperimentalPromQLFuncsWithPrometheus`, `TestDisableChunkTrimmingFuzz`, `TestExpandedPostingsCacheFuzz`, `TestVerticalShardingFuzz`, `TestProtobufCodecFuzz`, `TestBackwardCompatibilityQueryFuzz`, `TestPrometheusCompatibilityQueryFuzz`, and `TestRW1vsRW2QueryFuzz`. `enabledAggrs` is already defined at `integration/query_fuzz_test.go:44-46` as `{SUM, MIN, MAX, AVG, GROUP, COUNT, QUANTILE}`.

### Trade-off accepted

- Loses random fuzz coverage of `topk`/`bottomk` (and `count_values`, `stddev`, `stdvar`) against the parquet path. Same trade-off already accepted by most other fuzz tests in the suite.
- If `topk`/`bottomk` coverage of parquet is desired in the future, write a deterministic dedicated test rather than expanding `sampleNumComparer`'s relaxation; the comparator cannot normalize away time-window coverage drift induced by tie-break choice without effectively re-implementing `topk` semantics.

### Why simpler/alternative fixes don't work

- **"Inject per-series epsilon in data generation"** (e.g. change `e2e/util.go:384` from `float64(i+j)` to `float64(i+j) + 1e-9*float64(i)`) — does **not** fix this case, because the inner expression `({a} or {b}) % {a}` produces exact `0` for every surviving sample regardless of input values. Modulo always produces ties.
- **"Strengthen `sampleNumComparer` to count per-series buckets"** — still does not normalize for time-window divergence between different chosen ties.
- **"Force `sortSeries=true` in parquet `Select`"** — production behavior change to satisfy a test; off-table.

## Not addressed by this issue (separate flakes)

The same `integration_query_fuzz` job hits other flaky tests with **different** root causes; this issue should *not* try to subsume them. Each likely needs a separate report/fix:

| Test | Distinct root cause (rough) |
|------|------------------------------|
| `TestExpandedPostingsCacheFuzz` | Data-freshness race: `res1 = NaN`, `res2 = values` (one Cortex hasn't yet ingested the iteration's new push). |
| `TestPrometheusCompatibilityQueryFuzz` / `TestExperimentalPromQLFuncsWithPrometheus` | Error-string comparator: same error type with `[A,B]` vs `[B,A]` list order from non-deterministic map iteration in the error message. |
| `TestVerticalShardingFuzz` | Semantic divergence: `… or vector(…)` fallback fires in the unsharded engine but not all shards of the sharded engine. |
| `TestProtobufCodecFuzz` | Unknown / not enough samples to classify. |

(arm64 skew across the job is hypothesized to come from ~10–35% slower runners + amd64-only SIMD in `parquet-go`, which would widen race windows for all of the above. Not independently proven.)

## Acceptance criteria

- [ ] After fix, observe no `TestParquetFuzz` `# of samples mismatch` failures across a representative local-run sample (e.g. ≥200 iterations on each of arm64 and amd64; given the ~2% per-CI-run rate, fewer iterations is statistically inconclusive).
- [ ] No reduction in coverage of parquet-specific paths (the test still exercises parquet via SUM/MIN/MAX/AVG/GROUP/COUNT/QUANTILE aggregators and the full function set).
- [ ] `TestParquetProjectionPushdownFuzz` remains skipped (separate issue).

---

*Filed after a 3-round multi-agent investigation; full notes available on request.*


Test	Distinct root cause (rough)
`TestExpandedPostingsCacheFuzz`	Data-freshness race: `res1 = NaN`, `res2 = values` (one Cortex hasn't yet ingested the iteration's new push).
`TestPrometheusCompatibilityQueryFuzz` / `TestExperimentalPromQLFuncsWithPrometheus`	Error-string comparator: same error type with `[A,B]` vs `[B,A]` list order from non-deterministic map iteration in the error message.
`TestVerticalShardingFuzz`	Semantic divergence: `… or vector(…)` fallback fires in the unsharded engine but not all shards of the sharded engine.
`TestProtobufCodecFuzz`	Unknown / not enough samples to classify.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flaky test: TestParquetFuzz — topk/bottomk tie non-determinism not fully filtered #7543

Summary

Most recent occurrence

Failure excerpt

Empirical flake rate (last 18 days, ~245 CI runs across master + PRs)

Sample prior failures (same root cause)

Root cause

Proposed fix

Trade-off accepted

Why simpler/alternative fixes don't work

Not addressed by this issue (separate flakes)

Acceptance criteria

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Test	PR-CI occurrences	Master-CI occurrences
`TestParquetFuzz` (sample-count mismatch)	5	0
`integration_query_fuzz` job (any test)	15	0

Flaky test: TestParquetFuzz — topk/bottomk tie non-determinism not fully filtered #7543

Description

Summary

Most recent occurrence

Failure excerpt

Empirical flake rate (last 18 days, ~245 CI runs across master + PRs)

Sample prior failures (same root cause)

Root cause

Proposed fix

Trade-off accepted

Why simpler/alternative fixes don't work

Not addressed by this issue (separate flakes)

Acceptance criteria

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions