Skip to content

fix(integration): compare query fuzz errors by class, not exact string#7550

Open
sandy2008 wants to merge 2 commits into
cortexproject:masterfrom
sandy2008:fix-promql-compat-error-comparator
Open

fix(integration): compare query fuzz errors by class, not exact string#7550
sandy2008 wants to merge 2 commits into
cortexproject:masterfrom
sandy2008:fix-promql-compat-error-comparator

Conversation

@sandy2008
Copy link
Copy Markdown
Contributor

What this PR does

Replaces three strict cmp.Equal(tc.err1, tc.err2) call sites in integration/query_fuzz_test.go with a new sameErrorClass(err1, err2) helper that compares two Prometheus HTTP API errors by *promv1.Error.Type and a canonicalized message. Canonicalization sorts the contents of any [...] group inside an error message on top-level commas (brace-depth aware so labelset commas are preserved), so two errors that differ only in the order of a non-deterministic series list compare equal — while two genuinely different errors of the same Type still diverge.

Touches TestDisableChunkTrimmingFuzz, TestExpandedPostingsCacheFuzz, and the shared runQueryFuzzTestCases helper (used by ~ten other fuzz tests in this file).

Root cause

Full empirical census (occurrences, environments, exact failing log) is in issue #7546. In short: when a fuzz query exercises a vector-matching path such as vector(0) or test_series_a + test_series_b, the upstream PromQL engine emits an error of shape

execution: found duplicate series for the match group {series="2"} on the right hand-side of the operation:
  [{__name__="test_series_a", series="2"}, {series="2"}];
  many-to-many matching not allowed: matching labels must be unique on one side

The list inside [...] is built from Go-map iteration in vendor/github.com/prometheus/prometheus/promql/engine.go, so its order is non-deterministic. Cortex and the reference Prometheus produce the same error class with the same set of series, but the two entries can appear in either order across two processes. cmp.Equal(tc.err1, tc.err2) then fails on a pure string diff and the fuzz test reports a divergence that does not exist.

Fix

Two new helpers and one comparator in integration/query_fuzz_test.go:

  1. splitTopLevelCommas(s string) []string — splits s on commas at brace depth 0. Commas inside {...} (Prometheus labelset literals) are preserved, so {a="1", b="2"} is one part, not two.
  2. canonicalizeErrMsg(msg string) string — for every [...] group in msg, splits its contents with splitTopLevelCommas, trims, sorts, and rejoins with ", ". Groups with fewer than two parts are returned untouched.
  3. sameErrorClass(err1, err2 error) bool — both-nil → true; one-nil → false; both *promv1.Error (via errors.As) → require equal Type and equal canonicalized Msg; otherwise fall back to comparing canonicalized Error() strings.

Then replace the three strict-string comparators:

  • integration/query_fuzz_test.go:408 (inside TestDisableChunkTrimmingFuzz)
  • integration/query_fuzz_test.go:644 (inside TestExpandedPostingsCacheFuzz)
  • integration/query_fuzz_test.go:1959 (inside runQueryFuzzTestCases, used by ~ten sibling fuzz tests)

A new TestSameErrorClass exercises the helper with nine subtests: both-nil, each-only-nil, typed-same-class-with-reordered-bracket-list, typed-different-classes, typed-same-class-with-materially-different-messages (must NOT mask), labelset-with-internal-comma (verifies brace-depth awareness), untyped-same-after-canonicalization, untyped-different. Runs locally; all subtests pass.

Why not other approaches

  • Skip / disable the affected fuzz tests. Hides the symptom without curing the root cause; the next non-deterministic-set message in a different code path would re-flake.
  • Wrap every brackets-in-message error site upstream to emit sorted output. A vendor-tree change to satisfy a test, plus would have to be re-applied on every Prometheus bump.
  • Compare only by Type and drop message comparison entirely. Too loose. A pure-Type comparator would mask cases like execution: division by zero vs execution: parse error: unexpected token — same class, materially different bug. The test plan deliberately includes that subtest as a guard.
  • Use a fuzzy text similarity (e.g. Levenshtein) on messages. Opaque threshold, opaque failure modes, opaque future maintenance. Bracket-list canonicalization is the minimum-surface fix that targets the actual non-determinism.

Trade-off worth flagging honestly: two errors with the same *promv1.Error.Type but materially different messages now have to both match before they are accepted. That is intentional — the comparator must not mask a real divergence between Cortex and Prometheus that happens to share an error category. The new TestSameErrorClass has an explicit subtest pinning this behaviour (same typed class, materially different messages (must not mask)).

Which issue(s) this PR fixes

Fixes #7546

Checklist

  • CHANGELOG.md updated — not applicable; test-only change with no user-facing behaviour change.
  • Documentation updated — not applicable; no flags or config changed.
  • Tests: TestSameErrorClass (9 subtests) added next to the new helper, covering all branches including the "must not mask materially different messages" guard.

Test plan

Local (no Docker harness required):

  • gofmt -l integration/query_fuzz_test.go — clean
  • goimports -local github.com/cortexproject/cortex -l integration/query_fuzz_test.go — clean
  • go vet -tags "netgo slicelabels integration integration_query_fuzz" ./integration/... — clean
  • go build -tags "netgo slicelabels integration integration_query_fuzz" ./integration/... — clean
  • go test -tags "netgo slicelabels integration integration_query_fuzz" -run "^TestSameErrorClass$" ./integration/... — PASS (9/9 subtests)

Validation that the flake is gone (requires Docker):

  • make ./cmd/cortex/.uptodate
  • go test -v -tags=integration,requires_docker,integration_query_fuzz -timeout 2400s -count=10 ./integration/... -run "^TestDisableChunkTrimmingFuzz$|^TestExpandedPostingsCacheFuzz$|^TestPrometheusCompatibilityQueryFuzz$" — expect 10/10 PASS on each. If any new failures surface they should be triaged as real divergences (not regressions from this PR).

Three call sites in integration/query_fuzz_test.go compared two
Prometheus HTTP API errors with cmp.Equal(tc.err1, tc.err2), which is
a strict byte-wise string compare on the underlying *promv1.Error.Msg.
When a fuzz query exercised a vector-matching path, the upstream
engine produced messages of shape

  execution: found duplicate series for the match group ...
    [{__name__="test_series_a", series="2"}, {series="2"}]; ...

The contents of the [...] group are built from Go-map iteration in
vendor/github.com/prometheus/prometheus/promql/engine.go, so the two
entries can appear in either order between Cortex and the reference
Prometheus, even though the error class and the underlying series set
are identical. The strict compare then reported a divergence that did
not exist. Issue cortexproject#7546 captures the full census.

Fix: add sameErrorClass(err1, err2) backed by two helpers in the same
file. canonicalizeErrMsg sorts the contents of every [...] group on
top-level commas; splitTopLevelCommas is brace-depth aware so commas
inside a Prometheus labelset literal ({a="1", b="2"}) are NOT split.
When both errors unwrap to *promv1.Error via errors.As, the
comparator requires equal Type AND equal canonicalized Msg; otherwise
it falls back to comparing canonicalized Error() strings. Replace
the three strict-string sites in TestDisableChunkTrimmingFuzz
(~:408), TestExpandedPostingsCacheFuzz (~:644), and the shared
runQueryFuzzTestCases helper (~:1959) used by ~ten other fuzz tests
in this file.

Add TestSameErrorClass with nine subtests, including the explicit
"same typed class, materially different messages (must not mask)"
guard — two errors with the same *promv1.Error.Type but truly
different messages must still diverge, so the comparator cannot
silently swallow a real Cortex-vs-Prometheus regression.

Fixes cortexproject#7546

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Sandy Chen <Yuxuan.Chen@morganstanley.com>
@dosubot dosubot Bot added go Pull requests that update Go code type/tests labels May 22, 2026
@sandy2008
Copy link
Copy Markdown
Contributor Author

Reopening to retrigger CI — the failing TestVerticalShardingFuzz is a pre-existing flake (tracked in #7547, fixed by #7551) and unrelated to this PR's error-comparator change.

@sandy2008 sandy2008 closed this May 22, 2026
@sandy2008 sandy2008 reopened this May 22, 2026
Failing TestVerticalShardingFuzz on the prior CI run was the same
pre-existing flake tracked in cortexproject#7547 / fixed by cortexproject#7551 — not a
regression from this PR's error-comparator change.

Signed-off-by: Sandy Chen <Yuxuan.Chen@morganstanley.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go Pull requests that update Go code size/L type/tests

Projects

None yet

1 participant