fix(integration): compare query fuzz errors by class, not exact string by sandy2008 · Pull Request #7550 · cortexproject/cortex

sandy2008 · 2026-05-22T09:58:12Z

What this PR does

Replaces three strict cmp.Equal(tc.err1, tc.err2) call sites in integration/query_fuzz_test.go with a new sameErrorClass(err1, err2) helper that compares two Prometheus HTTP API errors by *promv1.Error.Type and a canonicalized message. Canonicalization sorts the contents of any [...] group inside an error message on top-level commas (brace-depth aware so labelset commas are preserved), so two errors that differ only in the order of a non-deterministic series list compare equal — while two genuinely different errors of the same Type still diverge.

Touches TestDisableChunkTrimmingFuzz, TestExpandedPostingsCacheFuzz, and the shared runQueryFuzzTestCases helper (used by ~ten other fuzz tests in this file).

Root cause

Full empirical census (occurrences, environments, exact failing log) is in issue #7546. In short: when a fuzz query exercises a vector-matching path such as vector(0) or test_series_a + test_series_b, the upstream PromQL engine emits an error of shape

execution: found duplicate series for the match group {series="2"} on the right hand-side of the operation:
  [{__name__="test_series_a", series="2"}, {series="2"}];
  many-to-many matching not allowed: matching labels must be unique on one side

The list inside [...] is built from Go-map iteration in vendor/github.com/prometheus/prometheus/promql/engine.go, so its order is non-deterministic. Cortex and the reference Prometheus produce the same error class with the same set of series, but the two entries can appear in either order across two processes. cmp.Equal(tc.err1, tc.err2) then fails on a pure string diff and the fuzz test reports a divergence that does not exist.

Fix

Two new helpers and one comparator in integration/query_fuzz_test.go:

splitTopLevelCommas(s string) []string — splits s on commas at brace depth 0. Commas inside {...} (Prometheus labelset literals) are preserved, so {a="1", b="2"} is one part, not two.
canonicalizeErrMsg(msg string) string — for every [...] group in msg, splits its contents with splitTopLevelCommas, trims, sorts, and rejoins with ", ". Groups with fewer than two parts are returned untouched.
sameErrorClass(err1, err2 error) bool — both-nil → true; one-nil → false; both *promv1.Error (via errors.As) → require equal Type and equal canonicalized Msg; otherwise fall back to comparing canonicalized Error() strings.

Then replace the three strict-string comparators:

integration/query_fuzz_test.go:408 (inside TestDisableChunkTrimmingFuzz)
integration/query_fuzz_test.go:644 (inside TestExpandedPostingsCacheFuzz)
integration/query_fuzz_test.go:1959 (inside runQueryFuzzTestCases, used by ~ten sibling fuzz tests)

A new TestSameErrorClass exercises the helper with nine subtests: both-nil, each-only-nil, typed-same-class-with-reordered-bracket-list, typed-different-classes, typed-same-class-with-materially-different-messages (must NOT mask), labelset-with-internal-comma (verifies brace-depth awareness), untyped-same-after-canonicalization, untyped-different. Runs locally; all subtests pass.

Why not other approaches

Skip / disable the affected fuzz tests. Hides the symptom without curing the root cause; the next non-deterministic-set message in a different code path would re-flake.
Wrap every brackets-in-message error site upstream to emit sorted output. A vendor-tree change to satisfy a test, plus would have to be re-applied on every Prometheus bump.
Compare only by Type and drop message comparison entirely. Too loose. A pure-Type comparator would mask cases like execution: division by zero vs execution: parse error: unexpected token — same class, materially different bug. The test plan deliberately includes that subtest as a guard.
Use a fuzzy text similarity (e.g. Levenshtein) on messages. Opaque threshold, opaque failure modes, opaque future maintenance. Bracket-list canonicalization is the minimum-surface fix that targets the actual non-determinism.

Trade-off worth flagging honestly: two errors with the same *promv1.Error.Type but materially different messages now have to both match before they are accepted. That is intentional — the comparator must not mask a real divergence between Cortex and Prometheus that happens to share an error category. The new TestSameErrorClass has an explicit subtest pinning this behaviour (same typed class, materially different messages (must not mask)).

Which issue(s) this PR fixes

Fixes #7546

Checklist

CHANGELOG.md updated — not applicable; test-only change with no user-facing behaviour change.
Documentation updated — not applicable; no flags or config changed.
Tests: TestSameErrorClass (9 subtests) added next to the new helper, covering all branches including the "must not mask materially different messages" guard.

Test plan

Local (no Docker harness required):

gofmt -l integration/query_fuzz_test.go — clean
goimports -local github.com/cortexproject/cortex -l integration/query_fuzz_test.go — clean
go vet -tags "netgo slicelabels integration integration_query_fuzz" ./integration/... — clean
go build -tags "netgo slicelabels integration integration_query_fuzz" ./integration/... — clean
go test -tags "netgo slicelabels integration integration_query_fuzz" -run "^TestSameErrorClass$" ./integration/... — PASS (9/9 subtests)

Validation that the flake is gone (requires Docker):

make ./cmd/cortex/.uptodate
go test -v -tags=integration,requires_docker,integration_query_fuzz -timeout 2400s -count=10 ./integration/... -run "^TestDisableChunkTrimmingFuzz$|^TestExpandedPostingsCacheFuzz$|^TestPrometheusCompatibilityQueryFuzz$" — expect 10/10 PASS on each. If any new failures surface they should be triaged as real divergences (not regressions from this PR).

Three call sites in integration/query_fuzz_test.go compared two Prometheus HTTP API errors with cmp.Equal(tc.err1, tc.err2), which is a strict byte-wise string compare on the underlying *promv1.Error.Msg. When a fuzz query exercised a vector-matching path, the upstream engine produced messages of shape execution: found duplicate series for the match group ... [{__name__="test_series_a", series="2"}, {series="2"}]; ... The contents of the [...] group are built from Go-map iteration in vendor/github.com/prometheus/prometheus/promql/engine.go, so the two entries can appear in either order between Cortex and the reference Prometheus, even though the error class and the underlying series set are identical. The strict compare then reported a divergence that did not exist. Issue cortexproject#7546 captures the full census. Fix: add sameErrorClass(err1, err2) backed by two helpers in the same file. canonicalizeErrMsg sorts the contents of every [...] group on top-level commas; splitTopLevelCommas is brace-depth aware so commas inside a Prometheus labelset literal ({a="1", b="2"}) are NOT split. When both errors unwrap to *promv1.Error via errors.As, the comparator requires equal Type AND equal canonicalized Msg; otherwise it falls back to comparing canonicalized Error() strings. Replace the three strict-string sites in TestDisableChunkTrimmingFuzz (~:408), TestExpandedPostingsCacheFuzz (~:644), and the shared runQueryFuzzTestCases helper (~:1959) used by ~ten other fuzz tests in this file. Add TestSameErrorClass with nine subtests, including the explicit "same typed class, materially different messages (must not mask)" guard — two errors with the same *promv1.Error.Type but truly different messages must still diverge, so the comparator cannot silently swallow a real Cortex-vs-Prometheus regression. Fixes cortexproject#7546 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Sandy Chen <Yuxuan.Chen@morganstanley.com>

sandy2008 · 2026-05-22T10:51:29Z

Reopening to retrigger CI — the failing TestVerticalShardingFuzz is a pre-existing flake (tracked in #7547, fixed by #7551) and unrelated to this PR's error-comparator change.

Failing TestVerticalShardingFuzz on the prior CI run was the same pre-existing flake tracked in cortexproject#7547 / fixed by cortexproject#7551 — not a regression from this PR's error-comparator change. Signed-off-by: Sandy Chen <Yuxuan.Chen@morganstanley.com>

pull-request-size Bot added the size/L label May 22, 2026

dosubot Bot added go Pull requests that update Go code type/tests labels May 22, 2026

sandy2008 closed this May 22, 2026

sandy2008 reopened this May 22, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(integration): compare query fuzz errors by class, not exact string#7550

fix(integration): compare query fuzz errors by class, not exact string#7550
sandy2008 wants to merge 2 commits into
cortexproject:masterfrom
sandy2008:fix-promql-compat-error-comparator

sandy2008 commented May 22, 2026

Uh oh!

sandy2008 commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

sandy2008 commented May 22, 2026

What this PR does

Root cause

Fix

Why not other approaches

Which issue(s) this PR fixes

Checklist

Test plan

Uh oh!

sandy2008 commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant