feat(playwright): add flaky detection via multi-process orchestration#85
Conversation
Merge ProtectionsYour pull request matches the following merge protections and will not be merged until they are valid. 🔴 ApprovalWaiting for
This rule is failing.
🔴 🔎 ReviewsWaiting for
This rule is failing.
🟢 Continuous IntegrationWonderful, this rule succeeded.
🟢 Enforce conventional commitWonderful, this rule succeeded.Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
🟢 📕 PR descriptionWonderful, this rule succeeded.
|
af4cad1 to
df915d5
Compare
Revision history
|
df915d5 to
ecbb059
Compare
There was a problem hiding this comment.
Pull request overview
Adds a multi-process Playwright flaky-detection “phase 2” rerun orchestration path (gated by _MERGIFY_TEST_NEW_FLAKY_DETECTION) that enriches the shared state file, reruns candidate tests in a subprocess, aggregates outcomes, and annotates test-case spans + prints a summary.
Changes:
- Rename and generalize the canonical test key builder to
buildTestKey(used for both quarantine and flaky detection). - Extend state-file schema + validation to carry flaky-detection context/mode/candidates/deadlines, and enrich candidates on
reporter.onBegin. - Implement phase-2 rerun subprocess + JSONL outcome collection/aggregation, plus add unit + integration test coverage and README docs.
Reviewed changes
Copilot reviewed 15 out of 15 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| packages/playwright/src/reporter.ts | Implements rerun-mode JSONL writing, state enrichment, phase-2 subprocess reruns, aggregation, span enrichment, and summary printing. |
| packages/playwright/src/state-file.ts | Expands state schema to include flaky-detection fields and validates/strips malformed optional fields. |
| packages/playwright/src/global-setup.ts | Fetches flaky-detection context (feature-flagged) and persists flakyContext + flakyMode into the state file. |
| packages/playwright/src/fixture.ts | Extends quarantine absorption to also include unhealthy_test_names in unhealthy mode. |
| packages/playwright/src/utils.ts | Renames buildQuarantineKey → buildTestKey and updates docs accordingly. |
| packages/playwright/README.md | Documents flaky-detection preview behavior, modes, attributes, and env vars. |
| packages/core/src/flaky-detection.ts | Exposes mode, candidates, and perTestDeadlineMs as readonly public fields; refactors candidate/budget computation into a static helper. |
| packages/core/tests/flaky-detection.test.ts | Adds test coverage for the new public readonly fields on FlakyDetector. |
| packages/playwright/tests/reporter.test.ts | Adds unit tests for state enrichment, annotation parsing, and summary output. |
| packages/playwright/tests/state-file.test.ts | Adds round-trip/validation tests for new flaky-detection state-file fields. |
| packages/playwright/tests/utils.test.ts | Updates tests to use buildTestKey. |
| packages/playwright/tests/global-setup.test.ts | Adds tests for feature-flagged flaky context fetching and mode selection. |
| packages/playwright/tests/integration/run.test.ts | Adds end-to-end integration coverage for unhealthy/new flaky-detection modes. |
| packages/playwright/tests/fixtures/playwright.config.ts | Makes fixture testDir configurable via PW_FIXTURE_DIR for integration tests. |
| packages/playwright/tests/fixtures/tests-unhealthy/sample.spec.ts | Adds a deterministic flaky fixture spec used by integration tests. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // Build a grep regex from candidate test titles. We escape regex | ||
| // metacharacters and join with `|`. Anchoring isn't necessary because | ||
| // Playwright matches against the full test path; substring match is | ||
| // sufficient and robust to project-name prefixes. | ||
| const titles = [...this.phase1Outcomes.keys()] | ||
| .map((k) => k.split(' > ').pop() ?? k) | ||
| .map((s) => s.replace(/[.*+?^${}()|[\]\\]/g, '\\$&')); | ||
| const grepPattern = `(${titles.join('|')})`; |
There was a problem hiding this comment.
Partial pushback. Playwright --grep matches the joined title path (describes + title), not the file path, so adding the filepath to the regex would not narrow the match. The aggregation step filters JSONL entries by full-key membership, so non-candidate reruns produce JSONL lines that are dropped — wasted CI time, not incorrect results. Tighter scoping requires switching from --grep to a list-then-filter approach, which is V2 scope. Documented the limitation in the code comment for now.
ecbb059 to
ec7a552
Compare
ec7a552 to
70a39b9
Compare
Phase 1 runs Playwright normally and records candidate-test outcomes via the existing reporter pipeline. The quarantine fixture is extended to also absorb tests listed in the API's `unhealthy_test_names` so phase-1 failures of known-flaky tests don't break the build. After phase 1, the reporter spawns a single Playwright subprocess targeting all candidates (`--grep '<regex>' --repeat-each=N`) and writes per-attempt outcomes to a JSONL file. The main reporter aggregates phase-1 + phase-2 outcomes, emits `cicd.test.flaky_detection` / `.new` / `.flaky` / `.rerun_count` attributes on the consolidated test-case spans, and prints a summary. Each phase-2 rerun is a fresh Playwright invocation, so all fixtures (including user-defined `test.extend(...)` ones) re-initialize between attempts — matching Playwright's normal per-test isolation guarantees. This avoids the in-test body-wrapping approach we initially attempted, which broke `testInfo.file` resolution because the body lived in a wrapper function instead of the user's spec file. Activation is gated by `_MERGIFY_TEST_NEW_FLAKY_DETECTION=true`. Mode is inferred from `vcs.ref.base.name` (PR-like → `new`, push → `unhealthy`). Fixes MRGFY-7017 Change-Id: I8f7ae58a8ea95ff5ddabddd99e769324fcb4079d
70a39b9 to
38b39bc
Compare
Phase 1 runs Playwright normally and records candidate-test outcomes via
the existing reporter pipeline. The quarantine fixture is extended to
also absorb tests listed in the API's
unhealthy_test_namesso phase-1failures of known-flaky tests don't break the build. After phase 1, the
reporter spawns a single Playwright subprocess targeting all candidates
(
--grep '<regex>' --repeat-each=N) and writes per-attempt outcomes toa JSONL file. The main reporter aggregates phase-1 + phase-2 outcomes,
emits
cicd.test.flaky_detection/.new/.flaky/.rerun_countattributes on the consolidated test-case spans, and prints a summary.
Each phase-2 rerun is a fresh Playwright invocation, so all fixtures
(including user-defined
test.extend(...)ones) re-initialize betweenattempts — matching Playwright's normal per-test isolation guarantees.
This avoids the in-test body-wrapping approach we initially attempted,
which broke
testInfo.fileresolution because the body lived in awrapper function instead of the user's spec file.
Activation is gated by
_MERGIFY_TEST_NEW_FLAKY_DETECTION=true. Mode isinferred from
vcs.ref.base.name(PR-like →new, push →unhealthy).Fixes MRGFY-7017