Skip to content

fix(integration): skip or vector(...) in fuzz query generator#7551

Open
sandy2008 wants to merge 1 commit into
cortexproject:masterfrom
sandy2008:fix-vertical-sharding-fuzz-or-vector
Open

fix(integration): skip or vector(...) in fuzz query generator#7551
sandy2008 wants to merge 1 commit into
cortexproject:masterfrom
sandy2008:fix-vertical-sharding-fuzz-or-vector

Conversation

@sandy2008
Copy link
Copy Markdown
Contributor

@sandy2008 sandy2008 commented May 22, 2026

What this PR does

Stabilises the flaky TestVerticalShardingFuzz integration test by teaching the promqlsmith-driven fuzz harness in integration/query_fuzz_test.go to reject any randomly-generated expression whose AST contains a <lhs> or vector(<rhs>) sub-expression. A new hasOrVectorFallback(parser.Expr) bool predicate walks the parsed expression with parser.Inspect and matches exactly that shape; it is called from the shared isValidQuery gate. A small unit test TestHasOrVectorFallback (6 subtests, all pass locally) pins the predicate's behaviour. This is a test-side dodge: it removes a known false-positive shape from the random surface, it does not fix the underlying sharding semantic.

Root cause

The full empirical census (occurrences, environments, exact failing query and log) is in issue #7547. In short: when the generator emits an expression like

({a} or vector(0)) <op> ({b} or vector(0))

against block data that has partial time coverage for the LHS series, Cortex's vertically-sharded query path and the unsharded reference engine disagree on which timestamps trigger the vector(...) fallback. Each shard sees its own slice of series coverage and decides locally whether the LHS is empty at a given step; the unsharded path sees the union. The two engines then materialise the constant-vector fallback at different sample timestamps, so the final result diverges on a handful of points even though both answers are individually correct under their own model. This is the same class of analyser-side problem already known and tracked for absent / absent_over_time / scalar (see #5203, #5204, #5205): the Thanos PromQL analyser does not classify these "fallback-to-constant-when-empty" shapes as non-shardable, so the planner shards them when it should not.

The proper fix is upstream: the Thanos analyser needs to recognise or vector(...) as non-shardable, in the same family as the existing absent / scalar exclusions. That work is tracked separately and is out of scope for this PR.

Fix

One new AST-walker helper, one unit test, and one call site in integration/query_fuzz_test.go:

  1. hasOrVectorFallback(expr parser.Expr) bool — uses parser.Inspect to walk the expression tree and return true if any *parser.BinaryExpr with Op == parser.LOR has a *parser.Call on its RHS whose Func.Name is "vector". Brace / paren depth and surrounding context don't matter — the AST gives an exact structural match, so it does not over-fire on incidental substrings (e.g. a label value of "vector(" or a top-level bare vector(0)).
  2. TestHasOrVectorFallback — six subtests pinning the predicate: up (no), vector(1) (no, bare call is fine), up or up (no, or without vector), up or vector(1) (yes), (sum(rate(up[1m])) == bool 0) or vector(0) (yes, deep LHS), and a simplified form of the actual failing query from Flaky test: TestVerticalShardingFuzz — sharded vs unsharded "or vector(…)" semantic divergence #7547 ((sum without (job) (stddev_over_time(up[4m]) == -up)) or vector(2.0099) — yes). All 6 pass locally.
  3. Inside isValidQuery, a if hasOrVectorFallback(generatedQuery) { return false } early-return is added next to the existing "limitk" / "limit_ratio" / "--" filters that run for every caller.

isValidQuery is the shared gate used by all eight fuzz tests in this file (TestVerticalShardingFuzz, TestRW1vsRW2QueryFuzz, TestExpandedPostingsCacheFuzz, TestBackwardCompatibilityQueryFuzz, TestPrometheusCompatibilityQueryFuzz, TestNativeHistogramFuzz, TestDisableChunkTrimmingFuzz, TestProtobufCodecFuzz, TestExperimentalPromQLFuncsWithPrometheus), invoked via runQueryFuzzTestCases and the two direct call sites at integration/query_fuzz_test.go:387 and :554. The new filter therefore applies to every fuzz caller, not just TestVerticalShardingFuzz — see "Trade-off" below.

Why an AST predicate, not a substring filter

An earlier draft of this PR used strings.Contains(queryStr, "vector(") to flag the offending shape. Round-2 review flagged that as too broad and the implementation was switched to the AST walker. The reasoning, made explicit so it survives future edits:

  • A substring scan fires on every occurrence of vector( — including the perfectly-shardable bare vector(0) at the top level, sum(vector(1)), vector(1) + on() vector(2), etc. Only <lhs> or vector(<rhs>) actually diverges between the sharded and unsharded engines; the substring filter drops orders of magnitude more random surface than necessary and silently shrinks coverage of legitimate vector(...) shapes.
  • A substring scan also false-positives on incidental text in label literals (e.g. a label value of "vector(" in {x="vector("}), which promqlsmith can in principle produce. The AST predicate is immune by construction.
  • The AST walker is ~15 lines of straightforward code: one parser.Inspect, one BinaryExpr/LOR check, one Call/vector check. The TestHasOrVectorFallback subtests demonstrate it matches exactly the intended shape and nothing else. This is the same level of rigour the surrounding filters maintain.

Net: the AST predicate drops only the genuinely-problematic shape and leaves bare vector(...) calls in the random surface, while remaining as cheap to maintain as a substring scan.

Why not other approaches

  • Add a real fix in the Thanos PromQL analyser so or vector(...) is classified as non-shardable. This is the right long-term fix and the issue notes it is tracked separately. It is a vendor-tree change that would have to be re-applied on every Thanos / Prometheus bump, and it is materially larger than a test-side filter. Doing it here would conflate "stop the flake" with "change sharding behaviour"; better to land them as separate, reviewable changes.
  • Skip / disable TestVerticalShardingFuzz outright. Removes the signal entirely. The test still has value for the rest of the fuzz surface (binary ops, aggregations, range vector functions, label matchers); only the or vector(...) shape is currently a known false positive.
  • Loosen the result comparator to tolerate timestamp-shifted vector-fallback divergences. Either masks real bugs or has to encode the sharding semantic into the comparator, which is exactly the analyser-side complexity we want to keep out of the test harness.
  • Per-caller filter list (apply the predicate only inside TestVerticalShardingFuzz). Duplicates plumbing for a shape that will become moot the moment the Thanos analyser is fixed; see "Trade-off" for why a global filter is accepted here.

Trade-off

isValidQuery is global — the new filter applies to every fuzz test that calls it, not just TestVerticalShardingFuzz. The biggest collateral is TestRW1vsRW2QueryFuzz, which exercises two isolated Cortex instances against the same data and does not involve sharding; an or vector(...) expression there would in principle still be a valid divergence-detection target. The same is true to a lesser degree for TestBackwardCompatibilityQueryFuzz, TestPrometheusCompatibilityQueryFuzz, and the other six fuzz callers.

This is accepted for the reasons spelled out in issue #7547: (a) the empirical evidence points at the sharded vs unsharded path as the only context where this shape currently diverges; (b) the next-most-likely caller, TestRW1vsRW2QueryFuzz, runs two identically-configured Cortex instances and is unlikely to surface a new divergence on this shape without the upstream analyser already disagreeing; (c) random fuzz coverage of or vector(...) is the only thing dropped — deterministic dedicated tests for the fallback behaviour can be added later if wanted; and (d) the alternative (a per-caller filter list) duplicates plumbing for a shape that will become moot the moment the Thanos analyser is fixed.

Which issue(s) this PR fixes

Fixes #7547

Checklist

  • CHANGELOG.md updated — not applicable; test-only change with no user-facing behaviour change.
  • Documentation updated — not applicable; no flags or config changed.
  • Tests: TestHasOrVectorFallback (6 subtests) added next to the new predicate, covering the no-op shapes (up, vector(1), up or up), the basic positive (up or vector(1)), a deep-LHS positive, and a simplified form of the actual failing query from Flaky test: TestVerticalShardingFuzz — sharded vs unsharded "or vector(…)" semantic divergence #7547. All 6 pass locally.

Test plan

Local (no Docker harness required):

  • gofmt -l integration/query_fuzz_test.go — clean
  • goimports -local github.com/cortexproject/cortex -l integration/query_fuzz_test.go — clean
  • go vet -tags "netgo slicelabels integration integration_query_fuzz" ./integration/... — clean
  • go build -tags "netgo slicelabels integration integration_query_fuzz" ./integration/... — clean
  • go test -tags "netgo slicelabels integration integration_query_fuzz" -run "^TestHasOrVectorFallback$" ./integration/... — PASS (6/6 subtests)

Validation that the flake is gone (requires Docker):

  • make ./cmd/cortex/.uptodate
  • go test -v -tags=integration,requires_docker,integration_query_fuzz -timeout 2400s -count=10 ./integration/... -run "^TestVerticalShardingFuzz$" — expect 10/10 PASS. If new failures surface on other fuzz tests (TestRW1vsRW2QueryFuzz is the most likely) they should be triaged as real divergences, not regressions from this PR.

TestVerticalShardingFuzz flakes when promqlsmith emits an expression
containing `<lhs> or vector(<rhs>)` against block data with partial
time coverage on the LHS: Cortex's vertically-sharded query path and
the unsharded reference engine disagree on which timestamps trigger
the `vector(...)` fallback, because each shard decides locally
whether the LHS is empty at a given step while the unsharded path
sees the union. Both answers are individually correct under their
own model, but the final results diverge on a handful of points.
Same analyser-side family as the known `absent` / `absent_over_time`
/ `scalar` cases (cortexproject#5203, cortexproject#5204, cortexproject#5205): the Thanos PromQL analyser
does not classify `or vector(...)` as non-shardable. Full empirical
census in cortexproject#7547.

Add hasOrVectorFallback(parser.Expr), an AST walker that uses
parser.Inspect to detect any *parser.BinaryExpr with Op == parser.LOR
whose RHS is a *parser.Call to the `vector` function. Call it from
isValidQuery alongside the existing `limitk` / `limit_ratio` / `--`
filters so every fuzz caller (TestVerticalShardingFuzz plus the
seven other fuzz tests that go through runQueryFuzzTestCases) skips
the offending shape.

An earlier draft used `strings.Contains(queryStr, "vector(")`. Round-2
review flagged that as too broad: a substring scan fires on bare
`vector(0)`, `sum(vector(1))`, `vector(1) + on() vector(2)`, and even
on incidental `vector(` text inside label literals — none of which
diverge between the sharded and unsharded engines. The AST predicate
matches only `<lhs> or vector(<rhs>)`, leaving the rest of the random
surface intact.

Add TestHasOrVectorFallback with six subtests pinning the predicate:
`up` (no), `vector(1)` (no — bare call is fine), `up or up` (no — `or`
without `vector`), `up or vector(1)` (yes), a deep-LHS positive
(`(sum(rate(up[1m])) == bool 0) or vector(0)`), and a simplified form
of the actual failing query from cortexproject#7547. All six pass locally.

This is a test-side dodge, not a sharding-engine fix. The proper
upstream change — teaching the Thanos analyser to recognise
`or vector(...)` as non-shardable, mirroring the existing
`absent` / `scalar` exclusions — is tracked separately.

Trade-off: isValidQuery is global to all fuzz callers, so random
fuzz coverage of `or vector(...)` is dropped everywhere, not just
in TestVerticalShardingFuzz. Accepted for the reasons in cortexproject#7547:
the empirical signal points at the sharded vs unsharded path as
the only current divergence context, and the filter becomes moot
the moment the upstream analyser is fixed.

Fixes cortexproject#7547

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Sandy Chen <Yuxuan.Chen@morganstanley.com>
sandy2008 added a commit to sandy2008/cortex-1 that referenced this pull request May 22, 2026
Failing TestVerticalShardingFuzz on the prior CI run was the same
pre-existing flake tracked in cortexproject#7547 / fixed by cortexproject#7551 — not a
regression from this PR's error-comparator change.

Signed-off-by: Sandy Chen <Yuxuan.Chen@morganstanley.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Flaky test: TestVerticalShardingFuzz — sharded vs unsharded "or vector(…)" semantic divergence

1 participant