diff --git a/docs/brainstorms/2026-03-17-user-test-self-eval-loop-brainstorm.md b/docs/brainstorms/2026-03-17-user-test-self-eval-loop-brainstorm.md
new file mode 100644
index 000000000..e4a6a895a
--- /dev/null
+++ b/docs/brainstorms/2026-03-17-user-test-self-eval-loop-brainstorm.md
@@ -0,0 +1,135 @@
+---
+date: 2026-03-17
+topic: user-test-self-eval-loop
+---
+
+# User-Test Self-Eval Loop
+
+## What We're Building
+
+A closed-loop self-evaluation system for the `user-test` skill. After each testing session, a separate `/user-test-eval` command grades the skill's output against a fixed set of binary evals, records scores, and proposes one targeted mutation to the skill's instructions. The human reviews and accepts/rejects. Over time this produces a durable research artifact — a history of what was tried, what improved signal, and what didn't.
+
+## Why This Approach
+
+The auto-research pattern (run → eval → mutate → run again) applies to the user-test skill, but two constraints shape the design:
+
+1. **Skill first, queries second.** The skill has known structural issues (probe execution order violations, Proven regression conflation, P1 item burial). These corrupt signal — optimizing queries through a miscalibrated instrument produces noise. Fix the instrument first, validate it holds, then turn it on query optimization.
+
+2. **Semi-automated, not autonomous.** Full autonomous mutation (run every 2 minutes, keep winner) risks unreviewed prompt drift. The skill is complex enough (SKILL.md + 14 reference files, schema v9) that mutations need human review. The friction cost of review is low; the risk of unreviewed drift is high.
+
+## Key Decisions
+
+### Eval runs as a separate command, not inside the skill
+
+- **Decision:** New `/user-test-eval` command (Option 2), not a Phase 5 inside the skill (Option 1) or added to `/user-test-commit` (Option 3).
+- **Rationale:** Same context window grading its own output is the exact failure mode we've already seen — structurally correct reports that technically satisfy format requirements while burying findings. Separate invocation context = harder to game. `/user-test-commit` is already doing post-processing; coupling eval logic there mixes "did this run complete" with "is the skill producing good outputs over time" — different questions on different timescales.
+
+### Eval reads both JSON and rendered report
+
+- **Decision:** `/user-test-eval` reads `.user-test-last-run.json` AND the rendered report output.
+- **Rationale:** The presentation layer is where actual failures occur. A P1 item technically present in JSON but buried in report formatting is a real failure. Grading JSON alone misses the class of problems that have been the persistent issue.
+
+### Two artifacts: scores in JSON, reasoning in markdown
+
+- **Decision:** `skill-evals.json` for score history; `skill-mutations.md` for proposed changes and accept/reject log.
+- **Rationale:** Scores need to be parseable by future runs. Mutation proposals need to be readable and editable by humans. Different purposes, different formats. `skill-mutations.md` becomes the durable research artifact — the "big list of things tried" that is the most underrated output of the whole process.
+
+### Start with exactly 3 binary evals
+
+- **Decision:** 3 evals, not more. Expand only after these are stable.
+- **Rationale:** Too many evals invites reward hacking — the agent finding ways to technically pass all checks without improving quality. Three is tight enough to avoid gaming, broad enough to cover three distinct failure layers.
+
+## The Binary Eval Set
+
+### Eval 1: Probe Execution Order (protocol layer)
+
+**Question:** "Did all failing/untested probes in each area execute before broad exploration began?"
+
+- **Grading:** Yes/no per area. Overall FAIL if any area violated.
+- **Tests:** Whether the agent followed the probe-first protocol, which exists because probes are the highest-signal checks and broad exploration can mask their results.
+- **Known failure mode:** Agent exploring broadly first, then running probes in whatever order, reducing probe signal quality.
+
+### Eval 2: Proven Regression Reasoning (reasoning layer)
+
+**Question:** "Did the report distinguish between 'new bug in Proven area' and 'area no longer meets Proven criteria'?"
+
+- **Grading:** PASS if these are treated as categorically different events. FAIL if all regressions are treated as the same type.
+- **Tests:** Whether the agent understood that a Proven area failing is categorically different from a Known-bug area failing — not just a score change but a status change with different implications.
+- **Known failure mode:** Agent filing bugs and updating scores without surfacing that a Proven regression is a different class of event. Treating all regressions uniformly.
+
+### Eval 3: P1 Surfacing (presentation layer)
+
+**Question:** "Did every P1 item (active probe failure OR new bug) appear in the NEEDS ACTION section, not only in DETAILS?"
+
+- **Grading:** PASS if every P1 item is in NEEDS ACTION. FAIL if any P1 item appears only in DETAILS.
+- **Tests:** Whether the report's summary layer actually surfaces the most important findings, or buries them in structural completeness.
+- **Known failure mode:** Structurally correct reports where P1 items exist in the data but don't surface to the section the human actually reads and acts on.
+
+## Artifact Locations
+
+```
+tests/user-flows/
+  skill-evals.json        # Score history per run
+  skill-mutations.md      # Proposed diffs + accept/reject log
+```
+
+### skill-evals.json structure
+
+```json
+{
+  "evals": [
+    {
+      "run_timestamp": "2026-03-17T14:30:00Z",
+      "git_sha": "abc1234",
+      "skill_version": "2.51.0",
+      "test_file": "resale-clothing.md",
+      "results": {
+        "probe_execution_order": { "pass": true, "areas_violated": [] },
+        "proven_regression_reasoning": { "pass": false, "detail": "Login area regressed from Proven but report filed bug without noting status change" },
+        "p1_surfacing": { "pass": true, "p1_count": 2, "surfaced_count": 2 }
+      },
+      "overall_pass": false,
+      "proposed_mutation": "Clarify Phase 4 to require explicit 'Proven → Regressed' status callout when a Proven area scores below pass_threshold"
+    }
+  ]
+}
+```
+
+### skill-mutations.md structure
+
+```markdown
+# Skill Mutations Log
+
+## Mutation 1 — 2026-03-17
+
+**Triggered by:** Eval 2 failure (Proven regression reasoning)
+**Eval scores:** 1/3 pass (probe order: PASS, regression reasoning: FAIL, P1 surfacing: PASS)
+**Proposed change:** Add explicit instruction in Phase 4 scoring section: "When a Proven area scores below pass_threshold, the report MUST include a 'Proven Regression' callout distinct from any bug filing. This is a status change, not just a score change."
+**Diff:** [specific lines in SKILL.md or reference file]
+**Status:** PENDING | ACCEPTED | REJECTED
+**Outcome after acceptance:** [filled in after next run]
+```
+
+## Scope Boundaries
+
+**In scope:**
+- `/user-test-eval` command that grades last run against 3 binary evals
+- `skill-evals.json` for score persistence
+- `skill-mutations.md` for mutation proposals and history
+- One mutation proposal per eval run (not one per failing eval)
+
+**Out of scope (for now):**
+- Autonomous mutation (no auto-editing SKILL.md)
+- Query-level optimization (comes after skill evals are stable)
+- More than 3 evals (expand only when current set is consistently passing)
+- Integration with `/user-test-commit` (eval stays independent)
+
+## Open Questions
+
+- Should `/user-test-eval` auto-run after `/user-test-commit`, or stay fully manual? Leaning manual to keep the separation clean, but convenience might win.
+- Where exactly do `skill-evals.json` and `skill-mutations.md` live — in `tests/user-flows/` (alongside test files) or in the skill directory itself? The skill directory is plugin-managed; `tests/user-flows/` is project-local.
+- When evals are consistently passing (say, 5 consecutive runs all pass), what's the trigger to add a 4th eval or shift to query optimization?
+
+## Next Steps
+
+-> `/workflows:plan` for implementation details (the `/user-test-eval` command, artifact formats, eval logic)
diff --git a/docs/plans/2026-02-26-feat-user-test-browser-testing-skill-plan.md b/docs/plans/2026-02-26-feat-user-test-browser-testing-skill-plan.md
new file mode 100644
index 000000000..834a6ab7e
--- /dev/null
+++ b/docs/plans/2026-02-26-feat-user-test-browser-testing-skill-plan.md
@@ -0,0 +1,597 @@
+# Decision Record
+
+**Deepened on:** 2026-02-26
+**Sections enhanced:** 11 of 13
+**Research agents used:** 14
+**Total recommendations applied:** 37 (22 implement, 9 fast_follow, 6 defer)
+
+## Pre-Implementation Verification
+
+1. [ ] Verify current component counts: `ls -d plugins/compound-engineering/skills/*/ | wc -l` and `ls plugins/compound-engineering/commands/*.md plugins/compound-engineering/commands/workflows/*.md | wc -l`
+2. [ ] Verify current plugin version in `plugins/compound-engineering/.claude-plugin/plugin.json`
+3. [ ] Confirm `claude-in-chrome` MCP tool names by running `/mcp` and selecting `claude-in-chrome`
+4. [ ] Review `plugins/compound-engineering/skills/deepen-plan/SKILL.md` frontmatter format as canonical thin-wrapper reference
+5. [ ] Verify `plugins/compound-engineering/commands/deepen-plan.md` as canonical thin-wrapper command template
+6. [ ] Check that `tests/user-flows/` does not already exist in any target project (no namespace collision)
+
+## Implementation Sequence
+
+1. **Create `skills/user-test/references/` files first** — the SKILL.md references these, so they must exist before the skill is validated
+2. **Create `skills/user-test/SKILL.md`** — the core skill with 5-phase execution logic + commit mode, under 500 lines
+3. **Create thin wrapper commands** — `commands/user-test.md`, `commands/user-test-iterate.md`, and `commands/user-test-commit.md`
+4. **Update metadata files** — plugin.json, marketplace.json, README.md, CHANGELOG.md (use dynamic counts, not hardcoded numbers)
+5. **Run `/release-docs`** — regenerate documentation site
+6. **Validate** — JSON validity, component count consistency, SKILL.md line count
+
+## Key Improvements
+
+1. **[Strong Signal -- 5 agents] Maturity model: guidance over rigid rules** — Replace hardcoded "3 consecutive passes = Proven" and "any failure = reset to Uncharted" with agent-guided judgment. Provide a rubric and guidelines, but let the agent decide based on context (e.g., a cosmetic issue in a Proven area should not trigger full demotion). Simplify initial threshold to 2 consecutive passes.
+
+2. **[Strong Signal -- 4 agents] Extract reference files from SKILL.md from day one** — Split the skill into SKILL.md (~300 lines of execution logic) plus `references/` directory containing test-file-template.md, browser-input-patterns.md, and iterate-mode.md. The monolith-to-skill-split learning explicitly warns that stated size budgets without enforcement are ignored.
+
+3. **[Strong Signal -- 4 agents] Extension disconnect handling with specific recovery instructions** — Replace generic "retry-once" with: wait 3 seconds, retry once, on second failure instruct user to run `/chrome` and select "Reconnect extension". Track cumulative disconnects and abort after 3 with a clear stability message.
+
+4. **[Strong Signal -- 3 agents] Add `disable-model-invocation: true` to both thin wrapper commands** — The commands have side effects (file creation, browser interaction, issue filing). Official docs require this flag for side-effect workflows.
+
+5. **[Strong Signal -- 3 agents] Explicit distinction from agent-browser/test-browser in SKILL.md intro** — Two browser tools creates confusion. The SKILL.md intro must state: "This skill is for exploratory testing in a visible Chrome window with shared login state. For automated headless regression testing, use /test-browser instead."
+
+6. **[Strong Signal -- 3 agents] Dynamic component counts in acceptance criteria** — Do not hardcode "Skills: 21, Commands: 24". Count actual files and verify description strings match.
+
+7. **[Strong Signal -- 3 agents] Enhanced preflight check with `/chrome` guidance, WSL detection, and site permissions** — Phase 0 must guide users to run `/chrome` if MCP tools are unavailable, detect WSL and abort with a clear message, and verify the target URL is within Chrome extension's allowed sites.
+
+8. **Quality scoring rubric with concrete calibration anchors** — Define what scores 1-5 mean with examples, making scoring reproducible across runs.
+
+9. **Test file schema version for forward compatibility** — Add `schema_version: 1` to test file template frontmatter.
+
+10. **SKILL.md description must be single-line string** — Multiline YAML indicators break the skill indexer. Use a single line with trigger keywords for auto-discovery.
+
+## Research Insights
+
+### Browser Automation (claude-in-chrome)
+- Actions run in a **visible Chrome window** in real time — the user can watch and intervene. This is the core differentiator from headless agent-browser and should be prominently documented.
+- Claude **shares browser login state** — eliminates most authentication concerns. Users sign in once in Chrome; Claude inherits the session.
+- **GIF recording** is available as a built-in capability. Phase 7 (Summary) can offer to record sessions for evidence attached to GitHub issues.
+- **Site-level permissions** from the Chrome extension control which URLs Claude can interact with. Preflight should verify this.
+- **Modal dialogs** (alert, confirm, prompt) block all browser commands. The skill should detect unresponsive commands and instruct users to dismiss dialogs manually.
+
+### Skill Architecture (Claude Code Plugins)
+- SKILL.md description must be a **single-line string** — multiline YAML breaks the indexer.
+- Skills use **progressive disclosure**: only frontmatter loads initially (~100 tokens); full content loads on activation. This makes the 500-line target a recommendation, not just a budget.
+- `context:fork` is available for isolated execution but is not needed for this skill's use case.
+
+### Security Considerations
+- **Path traversal**: test file path resolution must be validated to stay within `tests/user-flows/`.
+- **Credential prohibition**: the skill must never persist passwords, tokens, or session IDs in any written output.
+- **Issue body sanitization**: content derived from test results should be sanitized before passing to `gh issue create`.
+
+## New Considerations Discovered
+
+1. **MCP tool batching for performance** — Each claude-in-chrome MCP call involves a Chrome extension round-trip. Batch simple checks (element visibility, text content) into single `javascript_tool` calls. Define "quick spot-check" for Proven areas as max 3 MCP calls.
+
+2. **Iterate mode token cap** — Each 7-phase run consumes significant tokens. Add a default cap of N <= 10 with explicit override.
+
+3. **State clearing between iterate runs is incomplete** — Full page reload to app entry URL is the reset mechanism. This does not cover IndexedDB, service worker caches, or HttpOnly cookies. Document this limitation.
+
+4. **Test history file rotation** — `tests/user-flows/test-history.md` will grow unbounded. Add a rotation strategy: keep last 50 entries, archive older ones.
+
+5. **Atomic file writes** — "Full rewrite" is not truly atomic. Use write-to-temp-then-rename pattern for test file updates.
+
+6. **Area granularity definition** — The maturity map tracks "areas" but never defines what size an area should be. Without guidance, two runs will decompose the same scenario differently, making consecutive-pass tracking meaningless. Define areas as 1-3 user interactions each (e.g., "checkout" → cart-validation, shipping-form, payment-submission). Include a worked example in `references/test-file-template.md`.
+
+7. **Explore Next Run needs prioritization** — The Explore Next Run section is append-only with no signal about urgency. After 5-6 runs it becomes a backlog with no entry point. Add priority levels: `P1` (likely user-facing friction), `P2` (edge case worth knowing), `P3` (curiosity). Instruct Phase 3 to pick highest-priority uncharted items first.
+
+8. **Issue deduplication needs structured labels** — Semantic search via `gh issue list --search` is fragile — two runs will describe the same bug differently. Use a structured `user-test:<area-slug>` label on every issue (e.g., `user-test:checkout/cart-count`) for exact-match dedup via `--label` flag, with semantic search as fallback only.
+
+9. **Qualitative assessments evaporate after each run** — The Run Summary asks "Demo ready?" but this answer never persists in test-history.md. Add a `demo_readiness` field (yes/no/partial) to the history table schema so trend data captures qualitative signal, not just scores.
+
+10. **App-level environment sanity check** — Phase 0 validates tool availability but not app health. Stale auth tokens, empty search indices, or silent API 500s produce misleading test results that look like quality issues. Add a Phase 2 "environment sanity check": one known-good navigation + one content assertion before executing test scenarios.
+
+## Fast Follow (ticket before merge)
+
+**Tier 1 -- Blocks demo/UX quality** (fix within 1-2 days):
+- Add cross-reference from `agent-browser/SKILL.md` back to `user-test` to prevent user confusion between the two browser testing approaches
+
+**Tier 2 -- Improves robustness** (fix within 1 sprint):
+- Add file upload workaround documentation: pause user-test and use `/agent-browser` for upload steps, then resume
+- MCP tool mapping table (agent-browser CLI vs claude-in-chrome MCP equivalents) in a shared reference file
+- Test file concern separation: evaluate splitting run history into a sidecar `.json` for machine parsing while keeping the `.md` human-readable
+
+## Cross-Cutting Concerns
+
+1. **SKILL.md size budget enforcement** — Four agents independently recommend the `references/` extraction. The structural decision affects the content outline (section 8), technical considerations (section 5), thin wrapper templates (section 9), and the SpecFlow analysis (section 10). This is the single most impactful structural change.
+
+2. **Maturity model rigidity vs agent judgment** — Five agents flag this across scoring (section 4), SKILL.md phases (section 8), and success metrics (section 12). The resolution: provide guidance and rubrics, not rigid rules.
+
+3. **MCP reliability and graceful degradation** — Four agents converge on this across preflight (section 5), execution (section 8), and risks (section 11). The pattern: specific recovery instructions for known failure modes, graceful degradation for mid-run tool failures.
+
+4. **`disable-model-invocation: true`** — Three agents confirm this is required for both wrapper commands. Single-section impact but high confidence from official docs.
+
+## Deferred to Future Work
+
+- **MCP abstraction layer** for future tool swaps (agent-browser <-> claude-in-chrome) — adds unnecessary complexity for v1
+- **Test file concern separation** into spec + state sidecar — evaluate after real-world usage reveals whether the single-file approach causes friction
+- **`/mcp` runtime discovery** of available tools instead of hardcoded tool names — low confidence (0.65), nice-to-have for forward compatibility
+- **`context:fork` isolation** for iterate mode runs — not needed for current architecture but could improve memory isolation for long iterate sessions
+
+## Research Gaps Addressed
+
+| Source | Recommendation | Status |
+|--------|---------------|--------|
+| docs-researcher-claude-code-plugins | Single-line description | Implemented in SKILL.md frontmatter |
+| docs-researcher-claude-code-plugins | Keep SKILL.md under 500 lines | Implemented via references/ extraction |
+| docs-researcher-claude-code-plugins | disable-model-invocation: true | Implemented in both wrapper commands |
+| docs-researcher-claude-code-plugins | /chrome activation guidance | Implemented in Phase 0 preflight |
+| docs-researcher-claude-code-plugins | Service worker idle disconnects | Implemented in disconnect handling |
+| docs-researcher-claude-code-plugins | WSL not supported | Implemented in Phase 0 preflight |
+| docs-researcher-claude-code-plugins | Login page/CAPTCHA pausing | Implemented in Phase 2 setup |
+| docs-researcher-claude-in-chrome | Visible Chrome window | Implemented in SKILL.md intro |
+| docs-researcher-claude-in-chrome | Shared browser login state | Implemented in Phase 2 setup |
+| docs-researcher-claude-in-chrome | GIF recording | Acknowledged in Phase 7 as optional enhancement |
+| docs-researcher-claude-in-chrome | Site-level permissions | Implemented in Phase 0 preflight |
+| docs-researcher-claude-in-chrome | Named pipe conflicts (Windows) | Implemented in Phase 0 preflight |
+| docs-researcher-claude-in-chrome | Modal dialogs block commands | Implemented in Phase 3 execution |
+| docs-researcher-claude-in-chrome | /mcp runtime discovery | Deferred — low confidence (0.65), nice-to-have |
+
+---
+# Implementation Spec
+---
+
+---
+title: "Add user-test browser testing skill and commands"
+type: feat
+status: active
+date: 2026-02-26
+---
+
+# Add user-test Browser Testing Skill and Commands
+
+## Overview
+
+Add a new `user-test` skill and three companion commands (`/user-test`, `/user-test-iterate`, `/user-test-commit`) to the compound-engineering plugin. This implements browser-based exploratory user testing via `claude-in-chrome` MCP tools with a compounding maturity model — each run makes the test file smarter by promoting proven areas, filing new bugs, and expanding coverage.
+
+This skill is for **exploratory testing in a visible Chrome window** with shared login state. The user watches the test happening in real-time and can intervene if needed. For automated headless regression testing, use `/test-browser` instead — it uses the `agent-browser` CLI for deterministic, CI-oriented QA checks.
+
+The three commands separate concerns: `/user-test` runs and scores a test, `/user-test-iterate` runs it N times for consistency data, and `/user-test-commit` applies results (updates the test file maturity map, files issues, appends history). This separation keeps the fast feedback loop (run + score) lightweight and lets the user decide when to commit results.
+
+## Problem Statement / Motivation
+
+The plugin has `test-browser` (deterministic QA regression via `agent-browser` CLI) but no exploratory user testing capability. Teams need a way to:
+
+- Simulate real user behavior against their app in a visible browser
+- Track which areas are stable vs. fragile across runs
+- Automatically file and deduplicate GitHub issues from testing sessions
+- Compound knowledge: skip proven areas, skip known bugs, focus effort on uncharted territory
+
+This fills a distinct niche from `test-browser` — exploratory quality assessment with compounding knowledge, not regression checking.
+
+## Proposed Solution
+
+### Architecture: Skill + Thin Wrapper Commands
+
+Following the `deepen-plan` precedent (v2.36.0 refactor), implement as:
+
+| File | Type | Purpose |
+|------|------|---------|
+| `skills/user-test/SKILL.md` | Skill | Core 5-phase execution logic + commit mode (~300 lines) |
+| `skills/user-test/references/test-file-template.md` | Reference | Test file template for new scenarios (~100 lines) |
+| `skills/user-test/references/browser-input-patterns.md` | Reference | React-safe input patterns and MCP tool tips (~30 lines) |
+| `skills/user-test/references/iterate-mode.md` | Reference | Iterate mode execution details (~50 lines) |
+| `commands/user-test.md` | Thin wrapper | `Skill(user-test)` invocation for `/user-test` |
+| `commands/user-test-iterate.md` | Thin wrapper | `Skill(user-test)` invocation with iterate mode for `/user-test-iterate` |
+| `commands/user-test-commit.md` | Thin wrapper | `Skill(user-test)` invocation for committing results |
+
+**Why skill + thin wrapper?**
+- The execution logic is ~300 lines — well within the 500-line skill recommendation
+- Reference files extract the test template, input patterns, and iterate mode details — each reusable independently
+- Thin wrappers prevent command bloat (learnings: monolith-to-skill split anti-patterns)
+- Both commands share the same skill logic, just with different invocation modes
+- Consistent with the Pattern A convention used by `deepen-plan`, `create-agent-skill`, etc.
+
+**Why extract to `references/` from day one?**
+The monolith-to-skill-split learning (convergence from 4 agents) explicitly warns: "Stating max 1200 lines in a plan is a policy wish. Without a gate that fails the pipeline, the file will grow past the budget." By starting with the split structure, the SKILL.md stays focused on execution phases and the reference files can grow independently without threatening the line budget.
+
+### Key Design Decisions
+
+**1. Browser tool: `claude-in-chrome` MCP (not `agent-browser` CLI)**
+
+The skill uses `mcp__claude-in-chrome__*` tools (find, javascript_tool, read_page, screenshots). This is intentionally different from `test-browser` which uses the headless `agent-browser` CLI. The rationale:
+- `user-test` simulates a real user in a **visible** Chrome window — interactive, visual. The user can watch the test happening and intervene.
+- `test-browser` runs headless regression checks — deterministic, CI-oriented
+- Different tools for different testing philosophies
+- `claude-in-chrome` shares the browser's login state, so authenticated app testing requires no credential handling — the user simply signs in once in Chrome
+
+**2. Test file as the product, not the run report**
+
+Living test files in `tests/user-flows/<scenario-slug>.md` get rewritten each run with updated maturity maps, scores, and history. The test file compounds intelligence across runs.
+
+Test files include a `schema_version: 1` field in frontmatter to enable forward-compatible migrations when the maturity model or file structure evolves.
+
+**3. Maturity model drives test efficiency**
+
+The maturity model provides guidance for the agent's judgment, not rigid rules:
+
+| Status | Behavior | Guidance |
+|--------|----------|----------|
+| Proven | Quick spot-check only (max 3 MCP calls) | Promote after 2+ consecutive passes with no significant issues. Cosmetic issues do not warrant demotion. |
+| Uncharted | Full investigation, edge cases | Default state. Demote from Proven only on functional regressions or new features. |
+| Known bug | Skip entirely | Issue filed. Skip until fix deployed. |
+
+The agent exercises judgment on promotions and demotions using the scoring rubric rather than following mechanical counters. A minor CSS issue in a Proven area stays Proven with a note. A broken API in an Uncharted area gets a Known-bug issue filed.
+
+**Partial run safety:** If a run is interrupted before scoring completes, no maturity updates are produced. Only `/user-test-commit` writes maturity state, and only from a completed run's results.
+
+**Area granularity:** Each area should cover 1-3 user interactions — small enough that a single bug doesn't reset a huge chunk of proven territory, large enough to accumulate consecutive passes. Example decomposition for "checkout":
+
+| Area | Interactions | What's tested |
+|------|-------------|---------------|
+| `checkout/cart-validation` | Add item, verify count, change quantity | Cart state management |
+| `checkout/shipping-form` | Enter address, select method, see estimate | Form validation + shipping logic |
+| `checkout/payment-submission` | Enter card, submit, see confirmation | Payment flow + success state |
+
+A worked example with this decomposition pattern is included in [test-file-template.md](./references/test-file-template.md).
+
+**Quality Scoring Rubric**
+
+Each score applies to one **scored interaction unit** — a single user-facing task completion (e.g., "add item to cart", "submit shipping form", "complete payment"). Navigation steps, page loads, and setup actions are not scored individually; they are part of the interaction they serve.
+
+| Score | Meaning | Example |
+|-------|---------|---------|
+| 1 | Broken — cannot complete the task | Button unresponsive, page crashes |
+| 2 | Completes with major friction | 3+ confusing steps, error messages shown |
+| 3 | Completes with minor friction | Small UX issues, unclear labels |
+| 4 | Smooth experience | Clear flow, no confusion |
+| 5 | Delightful | Exceeds expectations, helpful feedback |
+
+## Technical Considerations
+
+### Distinct from existing `test-browser` command
+
+| Aspect | `test-browser` | `user-test` (new) |
+|--------|---------------|-------------------|
+| Tool | `agent-browser` CLI (headless) | `claude-in-chrome` MCP (visible browser) |
+| Purpose | QA regression on PR-affected pages | Exploratory user testing |
+| State | Stateless per run | Stateful via test files |
+| Output | Pass/fail per route | Quality scores 1-5 per interaction |
+| Issues | No issue creation | Auto-files and deduplicates issues |
+| Auth | Handles login flows | Shares browser login state |
+| Observation | Results only | Real-time visual — user watches the test |
+
+### MCP dependency
+
+The skill requires `claude-in-chrome` MCP to be connected. Phase 0 (Preflight Check) validates availability and provides specific guidance:
+
+<!-- ready-to-copy -->
+```
+## Phase 0: Preflight Check
+1. Check if claude-in-chrome MCP tools are available
+2. If NOT available:
+   - Display: "claude-in-chrome not connected. Run /chrome or restart with claude --chrome"
+   - Abort with clear instructions
+3. Detect WSL environment:
+   - If running in WSL: "Chrome integration is not supported in WSL. Run Claude Code directly on Windows."
+   - Abort
+4. Verify the target app URL is within Chrome extension's allowed sites
+   - If permission denied: "Grant site permission in Chrome extension settings for [URL]"
+5. Windows: if EADDRINUSE error on named pipe:
+   - "Close other Claude Code sessions that might be using Chrome, then retry"
+```
+
+### `gh` CLI dependency
+
+Issue creation (Phase 6) requires `gh auth status`. The skill handles this gracefully:
+- If `gh` is not authenticated: skip issue creation, note in summary
+- If `gh` is authenticated: proceed with duplicate detection and filing
+- **Structured dedup labels:** Every issue gets a label `user-test:<area-slug>` (e.g., `user-test:checkout/cart-count`). Duplicate detection uses `gh issue list --label "user-test:<area-slug>" --state open` for exact match, falling back to semantic title search only if no label match found. Labels are machine-parseable and immune to description rewording.
+- Issue body content sanitized before passing to `gh issue create` to prevent markdown injection
+
+### React-safe input pattern
+
+The React-specific native setter pattern for bypassing virtual DOM is extracted to [browser-input-patterns.md](./references/browser-input-patterns.md). This keeps framework-specific tool logic reusable and out of the main SKILL.md.
+
+### MCP tool performance
+
+Each claude-in-chrome MCP call involves a round-trip through the Chrome extension. To manage latency:
+- Batch simple checks (element visibility, text content, price display) into single `javascript_tool` calls
+- Define "quick spot-check" for Proven areas as max 3 MCP calls per area
+- Full investigations for Uncharted areas have no artificial cap but should use batched checks where possible
+
+<!-- illustrative -->
+```javascript
+// Batch multiple checks into one javascript_tool call:
+mcp__claude-in-chrome__javascript_tool({
+  code: `JSON.stringify({
+    submitBtn: !!document.querySelector('[type=submit]'),
+    errorMsg: !!document.querySelector('.error'),
+    price: document.querySelector('.price')?.textContent
+  })`
+})
+```
+
+### Connection resilience
+
+Extension disconnects are a known issue — the Chrome extension service worker can go idle during extended sessions.
+
+<!-- ready-to-copy -->
+```
+## Disconnect Handling
+1. After MCP tool failure: wait 3 seconds
+2. Retry the call once
+3. If retry fails: "Extension disconnected. Run /chrome and select Reconnect extension"
+4. Track disconnect_counter for the session
+5. If disconnect_counter >= 3: abort with "Extension connection unstable. Check Chrome extension status and restart the session."
+```
+
+### Modal dialog handling
+
+JavaScript dialogs (alert, confirm, prompt) block all browser events and prevent Claude from receiving commands. If commands stop responding after a dialog trigger, instruct the user to dismiss the dialog manually before continuing.
+
+### Graceful degradation
+
+Apply the same pattern used for `gh` CLI absence to MCP tool failures mid-run:
+- If screenshot fails: continue but note "screenshots unavailable" in the report
+- If javascript_tool fails: fall back to individual find/click calls
+- If all MCP tools fail: abort with specific recovery instructions
+
+## System-Wide Impact
+
+- **Interaction graph**: Skill invoked by two thin wrapper commands. No callbacks or middleware. Writes to `tests/user-flows/` (user's project, not the plugin). Calls `gh` CLI for issue creation.
+- **Error propagation**: MCP disconnects handled with retry-once + specific recovery instructions. `gh` failures gracefully degraded (skip issue creation). Mid-run MCP tool failures degrade individual capabilities rather than aborting.
+- **State lifecycle risks**: Test file writes use write-to-temp-then-rename pattern for atomic updates. Partial runs produce no committable output (maturity safety). Iterate mode resets between runs via full page reload to the app entry URL. Note: this does not clear IndexedDB, service worker caches, or HttpOnly cookies — document this limitation in iterate mode reference.
+- **API surface parity**: No overlap with existing commands — distinct MCP tool set, distinct file structure, distinct purpose.
+- **Security**: Test file paths validated to stay within `tests/user-flows/`. No credentials persisted in any written output (test files, run history, issue bodies). Issue body content sanitized before `gh` CLI invocation.
+
+## Acceptance Criteria
+
+### Files to Create
+
+- [x] `plugins/compound-engineering/skills/user-test/SKILL.md` — Core skill with 5 phases + commit mode, ~300 lines
+- [x] `plugins/compound-engineering/skills/user-test/references/test-file-template.md` — Test file template for new scenarios
+- [x] `plugins/compound-engineering/skills/user-test/references/browser-input-patterns.md` — React-safe input patterns
+- [x] `plugins/compound-engineering/skills/user-test/references/iterate-mode.md` — Iterate mode details
+- [x] `plugins/compound-engineering/commands/user-test.md` — Thin wrapper with `disable-model-invocation: true`
+- [x] `plugins/compound-engineering/commands/user-test-iterate.md` — Thin wrapper with `disable-model-invocation: true` and iterate argument forwarding
+- [x] `plugins/compound-engineering/commands/user-test-commit.md` — Thin wrapper with `disable-model-invocation: true` for committing results
+
+### Files to Modify
+
+- [x] `plugins/compound-engineering/.claude-plugin/plugin.json` — bump version, update description with dynamic counts
+- [x] `.claude-plugin/marketplace.json` — bump version, update description with dynamic counts
+- [x] `plugins/compound-engineering/README.md` — Update component count table, add skill row under Browser Automation, add two command rows
+- [x] `plugins/compound-engineering/CHANGELOG.md` — Add new version entry with `### Added` section
+
+### Post-Change Validation
+
+- [x] Validate JSON: `cat .claude-plugin/marketplace.json | jq .` and `cat plugins/compound-engineering/.claude-plugin/plugin.json | jq .`
+- [x] Verify skill count matches description: `SKILL_COUNT=$(ls -d plugins/compound-engineering/skills/*/ | wc -l) && grep -q "$SKILL_COUNT skill" plugins/compound-engineering/.claude-plugin/plugin.json`
+- [x] Verify command count matches description: `CMD_COUNT=$(ls plugins/compound-engineering/commands/*.md plugins/compound-engineering/commands/workflows/*.md | wc -l) && grep -q "$CMD_COUNT command" plugins/compound-engineering/.claude-plugin/plugin.json`
+- [x] Verify SKILL.md line count: `SKILL_LINES=$(wc -l < plugins/compound-engineering/skills/user-test/SKILL.md) && [ "$SKILL_LINES" -le 500 ] && echo "OK: $SKILL_LINES lines" || echo "FAIL: $SKILL_LINES lines (max 500)"`
+- [x] Verify SKILL.md frontmatter compliance: `name: user-test`, single-line description with trigger keywords
+- [x] Verify reference files are linked with proper markdown links (not backtick references)
+- [ ] Run `claude /release-docs` to regenerate all docs site pages
+
+### Functional Requirements
+
+- [ ] `/user-test tests/user-flows/checkout.md` — loads existing test file, runs phases 0-4 (score + report)
+- [ ] `/user-test "Test the checkout flow"` — creates new test file from description, runs phases 0-4
+- [ ] `/user-test-commit` — applies results from last run: updates maturity map, files issues, appends history
+- [ ] `/user-test-iterate tests/user-flows/checkout.md 5` — runs the scenario 5 times, reports consistency
+- [ ] Maturity model correctly promotes (2+ consistent passes with agent judgment) and demotes (functional regression with agent judgment)
+- [ ] Issues include `user-test:<area-slug>` label; dedup uses `--label` flag first, semantic fallback second
+- [ ] Test file template created for new scenarios with all required sections including `schema_version: 1`
+- [ ] `tests/user-flows/test-history.md` appended after each run (rotation: keep last 50 entries, includes quality avg + pass rate + disconnects + demo_readiness + key finding)
+- [ ] Test file path validated to stay within `tests/user-flows/` (no directory traversal)
+- [ ] Iterate mode: N capped at 10 by default, error on N=0, N=1 valid
+- [ ] Iterate mode: reset between runs = full page reload to app entry URL (limitations: IndexedDB, SW caches, HttpOnly cookies not cleared)
+- [ ] Iterate mode: partial run handling (disconnects mid-iterate produce valid partial results)
+- [ ] `test-history.md` includes `demo_readiness` column (yes/no/partial) persisted each run
+- [ ] Explore Next Run items include priority (P1/P2/P3); Phase 3 picks highest priority first
+- [ ] Area granularity: worked example in test-file-template.md showing 1-3 interactions per area
+- [ ] Phase 2 environment sanity check: verifies app loads with expected content before test execution
+- [ ] Given a new scenario, full pipeline (phases 0-4 + commit) produces: test file with schema_version: 1, quality score, maturity map, and summary — all without manual intervention beyond initial command
+- [ ] Given a test file with an Uncharted area, after iterate N=3 where all runs score >= 4, the area's maturity status is Proven
+
+### Security Requirements
+
+- [ ] Test file path resolution prevents directory traversal
+- [ ] No credentials (passwords, tokens, session IDs) persisted in any output file
+- [ ] Issue body content sanitized before `gh issue create`
+- [ ] `user-test:<area-slug>` label convention documented for duplicate detection
+
+## SKILL.md Content Outline
+
+The skill contains 5-phase execution logic (run + score) plus a commit mode (update files + file issues), with references to supporting files:
+
+<!-- ready-to-copy -->
+```
+---
+name: user-test
+description: Run browser-based user testing via claude-in-chrome MCP with quality scoring and compounding test files. Use when testing app quality from a real user's perspective, scoring interactions, tracking test maturity, or filing issues from test sessions.
+argument-hint: "[scenario-file-or-description]"
+---
+
+# User Test
+
+Exploratory testing in a visible Chrome window. You watch the test happening
+in real-time and can intervene if needed. Claude shares your browser's login
+state — sign into your app in Chrome before running.
+
+For automated headless regression testing, use /test-browser instead.
+
+**v1 limitation:** This skill targets localhost / local dev server apps. External
+or staging URLs are not validated for deployment status — if you test against a
+remote URL, verify it's live and accessible before running.
+
+## Phase 0: Preflight
+[Validate: claude-in-chrome MCP available (if not: "Run /chrome"), WSL detection,
+site permissions, gh auth status, app URL resolvable]
+
+## Phase 1: Load Context
+[Resolve test file from path/description, validate path stays within tests/user-flows/,
+extract maturity map + history, validate schema_version]
+[If no argument: scan tests/user-flows/ for test files, present list, or prompt for description]
+[If test file corrupted: offer to regenerate from template]
+
+## Phase 2: Setup
+[Ensure user is signed into target app in Chrome (shared login state),
+take baseline screenshot]
+[If login page or CAPTCHA encountered: pause for manual handling]
+[Environment sanity check: navigate to app URL, verify page loaded with expected content
+(not an error page, not a stale auth redirect, not an empty state). If the app loads but
+shows error banners, API failures, or empty data that should be populated — abort with
+"App environment issue detected" rather than producing misleading quality scores]
+
+## Phase 3: Execute
+[Maturity-guided selection (agent judgment, not mechanical counters),
+Proven areas: quick spot-check (max 3 MCP calls),
+Uncharted areas: full investigation with batched javascript_tool calls,
+Known-bug areas: skip entirely]
+[Connection resilience: retry once with 3s delay, then /chrome reconnect guidance]
+[If all areas Proven: spot-check all, suggest new scenarios in "Explore Next Run"]
+[Explore Next Run items have priority: P1 (likely user-facing friction), P2 (edge case),
+P3 (curiosity). Pick highest-priority uncharted items first, not FIFO]
+[Modal dialog detection: instruct user to dismiss manually]
+
+## Phase 4: Score and Report
+[Quality scoring 1-5 using calibration rubric per scored interaction unit]
+[A scored interaction unit = one user-facing task completion (e.g., "add item to cart",
+"submit shipping form", "complete payment"). Navigation steps, page loads, and setup
+actions are not scored individually — they are part of the interaction they serve.]
+[Scores are ABSOLUTE per rubric, not relative to scenario framing.]
+[Output: run summary block with per-area scores, disconnect count, overall quality avg]
+[If run is interrupted before scoring completes, do NOT produce committable output —
+partial runs must not corrupt maturity state]
+
+## Commit Mode
+[Invoked separately via /user-test-commit after reviewing run results]
+[Maturity updates using agent judgment, run history, promotion/demotion with rubric]
+[Atomic write: write to .tmp then rename]
+[History rotation: keep last 50 entries in test-history.md]
+[Include structured label `user-test:<area-slug>` on every issue]
+[Duplicate detection: `gh issue list --label "user-test:<area-slug>" --state open` for
+exact match; fall back to semantic title search only if no label match found]
+[Sanitize issue body content, skip gracefully if gh not authenticated]
+[Never persist credentials in issue bodies or test files]
+[Persist demo_readiness (yes/no/partial) in test-history.md alongside quality scores]
+
+## Iterate Mode
+See [iterate-mode.md](./references/iterate-mode.md) for details.
+N capped at 10 (default), N=0 is error, N=1 valid.
+State clearing limitations documented.
+Partial run handling: if disconnect mid-iterate, write results for completed runs and report
+"Completed M of N runs" — partial results are valid and maturity updates apply.
+Output format: per-run scores table + aggregate consistency metrics + maturity transitions.
+
+## Test File Template
+See [test-file-template.md](./references/test-file-template.md).
+
+## Browser Input Patterns
+See [browser-input-patterns.md](./references/browser-input-patterns.md).
+```
+
+## Thin Wrapper Command Templates
+
+### `commands/user-test.md`
+
+<!-- ready-to-copy -->
+```markdown
+---
+name: user-test
+description: Run browser-based user testing with quality scoring and compounding test files
+disable-model-invocation: true
+allowed-tools: Skill(user-test)
+argument-hint: "[scenario-file-or-description]"
+---
+
+Invoke the user-test skill for: $ARGUMENTS
+```
+
+### `commands/user-test-iterate.md`
+
+<!-- ready-to-copy -->
+```markdown
+---
+name: user-test-iterate
+description: Run the same user test scenario N times to measure consistency
+disable-model-invocation: true
+allowed-tools: Skill(user-test)
+argument-hint: "[scenario-file] [n]"
+---
+
+Invoke the user-test skill in iterate mode for: $ARGUMENTS
+```
+
+### `commands/user-test-commit.md`
+
+<!-- ready-to-copy -->
+```markdown
+---
+name: user-test-commit
+description: Commit user-test results — update test file maturity map, file issues, append history
+disable-model-invocation: true
+allowed-tools: Skill(user-test)
+---
+
+Invoke the user-test skill in commit mode for the last completed run.
+```
+
+## SpecFlow Analysis -- Gaps Addressed in Implementation
+
+The SpecFlow analyzer identified gaps. Here is how the implementation addresses each genuine gap:
+
+| Gap | Resolution |
+|-----|-----------|
+| No-argument behavior | Phase 1: scan `tests/user-flows/` for test files, present list, or prompt for description |
+| MCP not connected | Phase 0 preflight: check MCP availability, instruct to run `/chrome` or restart with `claude --chrome` |
+| gh not authenticated | Phase 6: check `gh auth status` before creating issues, skip gracefully if not authenticated |
+| Test file corruption | Phase 1: validate required sections and schema_version, offer to regenerate from template if missing |
+| All areas Proven | Phase 3: spot-check all Proven areas, add note suggesting new scenarios in "Explore Next Run" |
+| N=0 for iterate | Iterate mode: treat N=0 as error, require N >= 1, cap N <= 10. N=1 is valid (single run with consistency tracking) |
+| State between iterate runs | Iterate mode: full page reload to app entry URL between each run. Document limitation: does not clear IndexedDB, service worker caches, or HttpOnly cookies |
+| Preflight check | Phase 0: validates MCP, gh, app URL, WSL detection, site permissions, Windows named pipe conflicts |
+| Authentication/login | Phase 2: leverage shared browser login state. User signs in once in Chrome. If CAPTCHA encountered, Claude pauses for manual handling |
+
+## Dependencies & Risks
+
+| Risk | Mitigation |
+|------|-----------|
+| `claude-in-chrome` MCP may not be installed | Phase 0 preflight check with specific "/chrome" instructions |
+| Extension service worker goes idle during extended sessions | Retry once with 3s delay, then specific "/chrome Reconnect" guidance. Abort after 3 cumulative disconnects. |
+| File upload not supported | Explicit `MANUAL ONLY` marking in test file template. Workaround: pause user-test and use `/agent-browser` for upload steps. |
+| SKILL.md growth past 500 lines | References/ extraction from day one. Validation gate: `wc -l < SKILL.md` must be <= 500 |
+| Component count drift | Dynamic count validation in acceptance criteria (count files, verify descriptions match) |
+| Test history unbounded growth | Rotation: keep last 50 entries in test-history.md |
+| Modal dialogs block browser commands | Detection guidance in Phase 3, instruct user to dismiss manually |
+| WSL environment | Preflight detection and abort with clear message |
+| Windows named pipe conflicts | Preflight detection with "close other Claude Code sessions" guidance |
+| Directory traversal via test file path | Path validation in Phase 1: resolved path must start with `tests/user-flows/` |
+| External/staging app not deployed or stale | v1 targets localhost/local dev. Document limitation: no deployment verification for remote URLs. User must verify external apps are live before testing. |
+| App loads but environment is broken (stale auth, empty data, API 500s) | Phase 2 environment sanity check: navigate + content assertion before test execution. Abort with "App environment issue" rather than producing misleading scores |
+| Issue dedup fails on different descriptions of same bug | Structured `user-test:<area-slug>` label on every issue for exact-match dedup via `--label`; semantic search as fallback only |
+
+## Success Metrics
+
+- Skill loads and executes without errors on first invocation
+- Test file is correctly created from description with `schema_version: 1`
+- Maturity model state transitions work across 3+ consecutive runs using agent judgment
+- No duplicate GitHub issues created across iterate runs
+- SKILL.md <= 500 lines (enforced by validation gate)
+- All component counts match across plugin.json, marketplace.json, and README.md (verified dynamically)
+- **Compounding metric**: After 3 runs on the same scenario, Proven area count > 0 and total test duration decreases (spot-checks are faster than full investigations)
+
+## Sources & References
+
+### Internal References
+- Thin wrapper pattern: `plugins/compound-engineering/commands/deepen-plan.md:1-9`
+- Skill structure: `plugins/compound-engineering/skills/deepen-plan/SKILL.md` (frontmatter + phases pattern)
+- Browser automation: `plugins/compound-engineering/skills/agent-browser/SKILL.md` (MCP tool reference)
+- Existing test command: `plugins/compound-engineering/commands/test-browser.md` (distinct tool set)
+- Plugin checklist: `CLAUDE.md` "Adding a New Skill" section
+- Anti-patterns: `docs/solutions/2026-02-26-monolith-to-skill-split-anti-patterns.md`
+- Versioning: `docs/solutions/plugin-versioning-requirements.md`
+
+### Conventions Applied
+- Skill compliance: name matches directory, single-line description with trigger keywords
+- Thin wrapper: `allowed-tools: Skill(user-test)`, `disable-model-invocation: true`
+- Version bump: MINOR for new functionality (dynamic — count at implementation time)
+- CHANGELOG: Keep a Changelog format with `### Added` section
+- Reference files linked with proper markdown links: `[filename.md](./references/filename.md)`
diff --git a/docs/plans/2026-02-28-feat-user-test-skill-revision-plan.md b/docs/plans/2026-02-28-feat-user-test-skill-revision-plan.md
new file mode 100644
index 000000000..1797da0d1
--- /dev/null
+++ b/docs/plans/2026-02-28-feat-user-test-skill-revision-plan.md
@@ -0,0 +1,424 @@
+---
+title: "User-Test Skill Revision: Timing, Qualitative Summaries, Delta Tracking, and More"
+type: feat
+status: completed
+date: 2026-02-28
+---
+
+# User-Test Skill Revision
+
+Based on 7 rounds of iterative testing and real production test results.
+
+## Overview
+
+Revise the `user-test` skill to add timing tracking, qualitative summaries, delta regression detection, explore-next-run generation, optional CLI mode, output quality scoring, and conditional regression checks. These changes address gaps discovered during real-world usage — timing regressions went unnoticed, structured scores missed qualitative signal, and ~60% of bugs were agent reasoning errors catchable without a browser.
+
+## Problem Statement / Motivation
+
+The current skill scores UX quality but misses three signals that real testing revealed:
+1. **Performance blind spot** — response time regressed 15s to 28s across 5 runs, unnoticed until manually tracked
+2. **Qualitative signal loss** — "4.2/5 average" doesn't answer "should we demo tomorrow?"
+3. **Regression hiding** — absolute numbers mask run-over-run quality changes
+4. **Stale explore-next-run** — the section stays empty because the skill doesn't generate items proactively
+5. **Browser-only bottleneck** — most agent reasoning bugs don't need a browser to catch
+
+## Proposed Solution
+
+10 changes in 3 priority tiers, scoped to stay within the 500-line SKILL.md budget.
+
+### Prerequisite: Schema Migration Strategy
+
+Multiple changes (1A, 1C, 2B, 2C) add columns to the test file template. Existing `schema_version: 1` files must not break.
+
+**Approach:**
+- Bump to `schema_version: 2` in the template
+- Phase 1 (Load Context): when reading a v1 file, add missing columns with empty/default values in memory — do NOT rewrite the file
+- Commit mode: when writing back, upgrade the file to v2 schema (adds new columns, preserves all existing data)
+- Forward compatibility: the reader tolerates unknown frontmatter fields (from a future v3) by ignoring them. Unknown table columns are preserved on write.
+- The "offer to regenerate" recovery path remains for genuinely corrupted files only
+
+### Prerequisite: Run Results Persistence
+
+The current skill relies on the agent's context window to pass run results from `/user-test` to `/user-test-commit`. With more data dimensions (timing, dual scores, qualitative notes), this becomes fragile.
+
+**Approach:**
+- After Phase 4 completes, write a `.user-test-last-run.json` file in `tests/user-flows/` containing: scenario slug, per-area scores (UX + optional quality), timing, qualitative summary, issues to file, maturity assessments
+- `/user-test-commit` reads this file instead of relying on context
+- The file is overwritten on each run (only last run is committable)
+- Add `.user-test-last-run.json` to the project's `.gitignore` guidance in Phase 1
+
+**Stale/missing file handling for `/user-test-commit`:**
+- **Missing file:** If `.user-test-last-run.json` doesn't exist, abort with "No run results found. Run `/user-test` first."
+- **Stale file:** The file includes a `run_timestamp` (ISO 8601). If the timestamp is older than 24 hours, warn: "Run results are from <timestamp>. Commit anyway? (y/n)." If older than 7 days, abort with "Run results too old — re-run `/user-test` first."
+- **Partial run:** The file includes a `completed: true|false` flag. If `false`, abort with "Last run was incomplete. Run `/user-test` again for committable results."
+- **No context fallback:** Commit mode never falls back to context window. The JSON file is the single source of truth.
+
+## Priority 1: High Impact, Low Effort
+
+### 1A. Timing Tracking
+
+**Files:** SKILL.md Phase 3 + Phase 4, test-file-template.md
+
+**Change:** Measure wall-clock time per area (start timestamp before first MCP call, end after last). Record in seconds.
+
+**Template change — Areas table:**
+```
+| Area | Status | Last Score | Last Time | Consecutive Passes | Notes |
+```
+
+**Report output adds Time column:**
+```
+| Area | Status | Score | Time | Assessment |
+```
+
+**Edge cases:**
+- Partial area (disconnect mid-area): record time as `—` (incomplete), do not include in averages
+- Timing includes async waits — this is intentional (slow is slow, regardless of cause)
+
+**SKILL.md budget:** +8 lines
+
+### 1B. Qualitative Summary in Report Output
+
+**Files:** SKILL.md Phase 4 Report Output
+
+**Add after the scores table:**
+```
+Qualitative:
+- Best moment: <most impressive interaction observed>
+- Worst moment: <interaction that broke confidence>
+- Demo ready: yes / partial / no
+- One-line verdict: <summary>
+```
+
+**Persistence:** These fields are written to `.user-test-last-run.json` for commit mode. `demo_readiness`, `verdict`, and a brief `context` note are persisted to `test-history.md` during commit. The `context` field is a one-phrase explanation of *why* (e.g., "search results loading 28s" alongside verdict "partial") — without it, verdicts become ambiguous after a few weeks. `best_moment` and `worst_moment` are ephemeral (report-only) — they inform the human reviewer but don't need historical tracking.
+
+**Edge cases:**
+- All areas score the same: pick the area that was most/least expected
+- Only one area tested: best and worst are the same — write one line
+
+**SKILL.md budget:** +10 lines
+
+### 1C. Delta Tracking in Run History
+
+**Files:** SKILL.md Commit Mode, test-file-template.md
+
+**Change:** When appending to `test-history.md`, compute delta from the most recent *completed* previous run:
+```
+| Date | Quality Avg | Delta | Key Finding | Context |
+| 2/26 | 4.86 | +0.15 | Exclusion filters working | |
+| 2/24 | 4.71 | -0.18 | Forest green regression | color picker CSS regression |
+```
+
+Flag any delta worse than -0.5 with a warning in the commit output.
+
+**Edge cases:**
+- First run ever: delta is `—` (no baseline)
+- Previous run was partial: skip to the last complete run
+- Different area sets between runs: compute avg over only areas present in BOTH runs. If no overlap, delta is `—`. **Known limitation:** if area sets drift significantly over time (adding 3, removing 2), the delta is computed over a shrinking overlap and may look stable even when new areas perform poorly. Acceptable for now — flag for revisit if delta becomes unreliable in practice.
+- Iterate mode: delta is computed between the iterate session's aggregate and the previous non-iterate run. Per-iteration deltas within a session are NOT computed (they are noise, not signal)
+- Iterate mode output includes per-run timing and timing variance alongside score variance. A consistent 28s is fine; wild swings between 5s and 45s indicate flakiness worth investigating.
+
+**SKILL.md budget:** +10 lines
+
+### 1D. Explore-Next-Run Generation Guidance
+
+**Files:** SKILL.md Phase 4
+
+**Add:** After scoring, explicitly generate 2-3 Explore Next Run items with priority:
+- **P1** — Things that surprised you (positive or negative)
+- **P2** — Edge cases adjacent to tested areas
+- **P3** — Interactions started but not finished, or borderline scores (score of 3)
+
+A "borderline" score is any area scoring 3/5 — warrants deeper investigation next run regardless of maturity status.
+
+**SKILL.md budget:** +8 lines
+
+## Priority 2: Medium Impact, Medium Effort
+
+### 2A. Optional CLI Mode
+
+**Files:** SKILL.md (new Phase 2.5), test-file-template.md (frontmatter)
+
+**Test file frontmatter addition:**
+```yaml
+---
+cli_test_command: "node scripts/test-cli.js --query '{query}'"  # optional
+cli_queries:  # optional
+  - query: "queen bed hot sleeper"
+    expected: "cooling materials, percale or linen"
+  - query: "something nice"
+    expected: "asks clarifying questions"
+---
+```
+
+**SKILL.md addition — Phase 2.5: CLI Testing**
+
+If the test file defines `cli_test_command`:
+1. Skip Phase 0 MCP preflight (CLI doesn't need chrome). Run `gh auth status` check only.
+2. Skip Phase 2 browser setup entirely
+3. For each query in `cli_queries`: run the command via Bash, capture stdout
+4. Score output quality 1-5 against the `expected` field using the **output quality rubric** (see 2B). The agent evaluates whether the CLI output satisfies the expected description semantically — this is NOT exact string matching. The `expected` field describes what a correct response looks like, and the agent applies the output quality rubric to judge.
+5. CLI results feed into the same maturity map and scoring pipeline
+6. If BOTH `cli_queries` and browser areas exist in the test file: run CLI first. If CLI reveals broken agent logic (scores <= 2), skip browser testing for overlapping areas with a note "CLI pre-check failed — skipping browser test."
+
+**Overlap detection is explicit, not agent-inferred.** Each CLI query can optionally tag the browser area it pre-checks:
+```yaml
+cli_queries:
+  - query: "queen bed hot sleeper"
+    expected: "cooling materials, percale or linen"
+    prechecks: "search-results"  # area slug — skip this browser area on CLI failure
+  - query: "something nice"
+    expected: "asks clarifying questions"
+    # no prechecks tag — CLI-only, no browser area overlap
+```
+If `prechecks` is present and the CLI query scores <= 2, the tagged browser area is skipped. If `prechecks` is absent, the CLI query is standalone — no browser areas are skipped regardless of score. This eliminates fuzzy semantic matching at runtime.
+
+**Credential handling:** The `cli_test_command` runs as a Bash command inheriting the shell environment. No credentials are stored in the test file. If the command needs env vars, the user sets them in their shell before running `/user-test`.
+
+**Iterate mode:** CLI iterate resets by simply re-running the command (no browser reload needed). If the command has side effects (DB writes), document this limitation in iterate-mode.md.
+
+**SKILL.md budget:** +30 lines (extract to `references/cli-mode.md` if it exceeds 35)
+
+### 2B. Output Quality Scoring Dimension
+
+**Files:** SKILL.md Phase 4 Scoring, test-file-template.md
+
+**Change:** Areas can optionally have `scored_output: true` in their area details. When set, score TWO dimensions:
+
+| Dimension | Rubric | When to use |
+|-----------|--------|-------------|
+| **UX score (1-5)** | Existing rubric (broken → delightful) | Always |
+| **Quality score (1-5)** | Output correctness rubric (below) | Only when `scored_output: true` |
+
+**Output Quality Rubric:**
+
+| Score | Meaning | Example |
+|-------|---------|---------|
+| 5 | Exactly what an expert would produce | Right products, right reasoning |
+| 4 | Relevant, minor misses | Mostly right, one irrelevant result |
+| 3 | Partially correct | Some right, some wrong |
+| 2 | Mostly wrong | Misunderstood intent |
+| 1 | Completely wrong | Wrong category, hallucinated data |
+
+**Report shows both:** `UX: 4/5, Quality: 3/5`
+
+**Aggregation rules:**
+- `Quality Avg` in run history = average of UX scores only (maintains backward compatibility for areas without `scored_output`)
+- **Promotion gate for `scored_output: true` areas:** UX >= 4 AND Quality >= 3. A beautiful UI showing wrong results should not promote to Proven.
+- **Promotion gate for standard areas:** UX >= 4 only (unchanged from v1)
+- Quality score tracked as `Output Avg` in the report for visibility
+- Known-bug filing: trigger on UX <= 2 (functional failure) OR Quality <= 1 (completely wrong output)
+
+**Template change — Areas table:**
+```
+| Area | Status | Last Score | Last Quality | Last Time | Consecutive Passes | Notes |
+```
+(`Last Quality` column only populated for areas with `scored_output: true`)
+
+**SKILL.md budget:** +15 lines
+
+### 2C. Conditional Regression Checks for Known-Bug Areas
+
+**Files:** SKILL.md Phase 3, test-file-template.md
+
+**Test file area detail addition:**
+```markdown
+### cart-quantity-update
+**Status:** Known-bug
+**Issue:** #47
+**Fix check:** Verify quantity updates in <5s and cart badge reflects new count
+```
+
+**SKILL.md Phase 3 addition:**
+
+When encountering a Known-bug area:
+1. If `gh` is not authenticated: skip as normal (no change)
+2. Check if the linked issue is closed: `gh issue view <issue-number> --json state -q '.state'`
+3. If `closed`: flip area to Uncharted, run the `fix_check` as the first test for that area
+4. If `open`: skip as normal
+5. If the fix check fails (score <= 2): file a new issue with note "Regression of #N" in the body referencing the original closed issue for traceability. The dedup check (`--state open`) won't find the closed issue, so a new issue is created — this is correct behavior.
+
+**Template change:** Known-bug areas store `**Issue:** #<number>` in their area details section. This is the canonical reference for `gh issue view`.
+
+**SKILL.md budget:** +15 lines
+
+## Priority 3: Nice to Have
+
+### 3A. Async Wait Pattern
+
+**Files:** browser-input-patterns.md only (no SKILL.md change)
+
+**Add:**
+```javascript
+// Wait for async operation completion
+mcp__claude-in-chrome__javascript_tool({
+  code: `
+    (async () => {
+      const start = Date.now();
+      const timeout = 10000;
+      const selector = '.success-message';
+      while (Date.now() - start < timeout) {
+        if (document.querySelector(selector)) return 'found';
+        await new Promise(r => setTimeout(r, 200));
+      }
+      return 'timeout';
+    })()
+  `
+})
+```
+
+**SKILL.md budget:** 0 lines
+
+### 3B. Performance Threshold Configuration
+
+**Files:** test-file-template.md frontmatter, SKILL.md Phase 4
+
+**Frontmatter addition:**
+```yaml
+---
+performance_thresholds:  # optional, seconds
+  fast: 2
+  acceptable: 8
+  slow: 20
+  broken: 60
+---
+```
+
+**Scoring integration:** If thresholds are defined, append a timing grade to each area's assessment in the report: `(fast)`, `(acceptable)`, `(slow)`, `(BROKEN)`. A `broken` timing grade is a finding worth noting but does NOT affect the UX score — timing and quality are separate dimensions.
+
+**Measurement:** Wall-clock time from 1A. No browser performance API needed.
+
+**SKILL.md budget:** +8 lines
+
+### 3C. End-to-End Unscripted Scenario Type — DEFERRED
+
+**Rationale for deferral:** The SpecFlow analysis identified fundamental conflicts with the maturity model. Unscripted runs produce emergent areas that don't map to stable slugs, breaking consecutive-pass tracking, issue label convention, and iterate mode compatibility. This needs a separate design pass (possibly a distinct mode with its own output format) rather than being retrofitted into the existing area-based model.
+
+**Interim alternative:** Users can approximate unscripted testing by creating a test file with broad areas (e.g., `first-time-onboarding`) and giving the agent latitude in the area description. This gets 80% of the value without the architectural conflict.
+
+## Technical Considerations
+
+### SKILL.md Budget Impact
+
+| Change | Lines Added | Cumulative |
+|--------|-----------|------------|
+| Current | 0 | 192 |
+| 1A Timing | +8 | 200 |
+| 1B Qualitative | +10 | 210 |
+| 1C Delta | +10 | 220 |
+| 1D Explore | +8 | 228 |
+| 2A CLI mode | +30 | 258 |
+| 2B Quality scoring | +15 | 273 |
+| 2C Regression checks | +15 | 288 |
+| 3B Thresholds | +8 | 296 |
+| **Total** | **+104** | **~296** |
+
+Well within the 500-line budget. If CLI mode grows beyond 35 lines during implementation, extract to `references/cli-mode.md`.
+
+### File Change Summary
+
+| File | Changes |
+|------|---------|
+| `SKILL.md` | +104 lines: timing in Phase 3-4, qualitative summary in Phase 4, delta in Commit Mode, explore-next generation in Phase 4, CLI Phase 2.5, output quality rubric, regression checks in Phase 3, threshold eval in Phase 4 |
+| `test-file-template.md` | Schema v2: new columns (Last Time, Last Quality), `cli_test_command`/`cli_queries` frontmatter, `performance_thresholds` frontmatter, Known-bug `Issue:` field, `fix_check` field |
+| `browser-input-patterns.md` | +15 lines: async wait pattern |
+| `iterate-mode.md` | +8 lines: CLI iterate reset note, timing per run in output table, timing variance alongside score variance |
+| Commands | No changes (thin wrappers unchanged) |
+
+### Backward Compatibility
+
+- v1 test files work unchanged — missing columns filled with defaults on read
+- v1 files upgraded to v2 on next commit (non-destructive)
+- CLI mode is opt-in (no `cli_test_command` = no CLI testing)
+- Quality scoring is opt-in (`scored_output: true` per area)
+- Performance thresholds are opt-in (no frontmatter = no timing grades)
+
+## Acceptance Criteria
+
+### P1 Changes
+
+- [x] 1A: Report output includes `Time` column per area
+- [x] 1A: Test file template has `Last Time` column in areas table
+- [x] 1A: Partial area timing recorded as `—`
+- [x] 1B: Report output includes qualitative summary (best moment, worst moment, demo ready, verdict)
+- [x] 1B: `demo_readiness` and `verdict` persist to test-history.md via commit mode
+- [x] 1C: Run history includes `Delta` column computed from last complete run
+- [x] 1C: Delta worse than -0.5 flagged with warning
+- [x] 1C: First run shows delta as `—`
+- [x] 1C: Iterate mode computes delta vs. pre-session baseline only
+- [x] 1D: Phase 4 generates 2-3 Explore Next Run items with P1/P2/P3 priority
+- [x] 1D: Borderline (score 3) areas flagged for deeper investigation
+
+### P2 Changes
+
+- [x] 2A: Test files with `cli_test_command` run CLI queries before browser testing
+- [x] 2A: CLI mode skips Phase 0 MCP preflight and Phase 2 browser setup
+- [x] 2A: CLI queries use explicit `prechecks` tag for browser area overlap (no agent-inferred matching)
+- [x] 2A: No credentials stored in test file
+- [x] 2B: Areas with `scored_output: true` show dual scores (UX + Quality)
+- [x] 2B: Quality Avg in history = UX scores only (backward compatible)
+- [x] 2B: Known-bug trigger: UX <= 2 OR Quality <= 1
+- [x] 2C: Known-bug areas with closed issues auto-flip to Uncharted
+- [x] 2C: Fix check runs as first test for re-opened areas
+- [x] 2C: Issue number stored in area details (`**Issue:** #N`)
+
+### P3 Changes
+
+- [x] 3A: Async wait pattern documented in browser-input-patterns.md
+- [x] 3B: Optional `performance_thresholds` frontmatter evaluates timing grades
+- [x] 3C: Deferred — documented as future work
+
+### Prerequisites
+
+- [x] Schema migration: v1 files read without error, upgraded to v2 on commit
+- [x] Forward compatibility: v2 reader tolerates unknown frontmatter fields from future schema versions (ignore, don't error)
+- [x] Run results persistence: `.user-test-last-run.json` written after Phase 4, read by commit mode
+- [x] `.user-test-last-run.json` added to `.gitignore` guidance
+- [x] Commit mode aborts if `.user-test-last-run.json` missing, incomplete, or >7d stale
+- [x] Commit mode warns if `.user-test-last-run.json` >24h old
+- [x] `verdict` persists with `context` note to test-history.md
+
+### Post-Change Validation
+
+- [x] SKILL.md <= 500 lines after all changes (313 lines)
+- [x] All reference file links use proper markdown format
+- [x] Existing v1 test files load without error
+- [x] Version bump in plugin.json, marketplace.json, CHANGELOG.md (2.36.0)
+- [x] Reinstall to `~/.claude/skills/user-test/` and `~/.claude/commands/`
+
+## Dependencies & Risks
+
+| Risk | Mitigation |
+|------|-----------|
+| SKILL.md exceeds 500 lines | Extract CLI mode to references/cli-mode.md if >35 lines |
+| v1 test files break with new columns | Schema migration reads v1, upgrades on commit |
+| Run results lost between sessions | `.user-test-last-run.json` persists results to disk |
+| CLI command has side effects on iterate | Document limitation in iterate-mode.md |
+| Delta misleading with changing area sets | Compute delta over overlapping areas only |
+| 3C unscripted conflicts with maturity model | Deferred — needs separate design |
+| Stale `.user-test-last-run.json` committed | Timestamp check: warn >24h, block >7d, block if incomplete |
+
+## Implementation Sequence
+
+1. **Prerequisites first** — schema migration logic + `.user-test-last-run.json` persistence
+2. **P1 changes** (1A, 1B, 1C, 1D) — all low effort, high value
+3. **P2 changes** (2A, 2B, 2C) — medium effort, build on P1 foundations
+4. **P3 changes** (3A, 3B) — nice-to-have, zero risk
+5. **Reinstall** — copy updated files to `~/.claude/skills/` and `~/.claude/commands/`
+6. **Validate** — run `/user-test` against a test scenario to verify
+
+## Sources & References
+
+### Internal References
+- Current SKILL.md: `plugins/compound-engineering/skills/user-test/SKILL.md` (192 lines)
+- Current template: `plugins/compound-engineering/skills/user-test/references/test-file-template.md` (81 lines)
+- Current patterns: `plugins/compound-engineering/skills/user-test/references/browser-input-patterns.md` (54 lines)
+- Current iterate: `plugins/compound-engineering/skills/user-test/references/iterate-mode.md` (65 lines)
+- Skill size budget: `docs/solutions/2026-02-26-monolith-to-skill-split-anti-patterns.md`
+- Original plan: `docs/plans/2026-02-26-feat-user-test-browser-testing-skill-plan.md`
+
+### Conventions Applied
+- Schema versioning for forward compatibility
+- SKILL.md 500-line budget with reference extraction fallback
+- Thin wrapper commands unchanged (no new commands needed)
+- Backward-compatible template migration (read v1, write v2)
diff --git a/docs/plans/2026-02-28-feat-user-test-v3-ux-intelligence-plan.md b/docs/plans/2026-02-28-feat-user-test-v3-ux-intelligence-plan.md
new file mode 100644
index 000000000..050135db6
--- /dev/null
+++ b/docs/plans/2026-02-28-feat-user-test-v3-ux-intelligence-plan.md
@@ -0,0 +1,453 @@
+---
+title: "Evolve user-test into a compounding UX intelligence system"
+type: feat
+status: completed
+date: 2026-02-28
+origin: User vision document (inline, 2026-02-28) + 7 rounds of real testing on Resale Clothing Shop
+prior_art:
+  - docs/plans/2026-02-26-feat-user-test-browser-testing-skill-plan.md (v1, completed)
+  - docs/plans/2026-02-28-feat-user-test-skill-revision-plan.md (v2, completed)
+---
+
+# Evolve user-test into a Compounding UX Intelligence System
+
+## Overview
+
+Transform `/user-test` from a browser test runner into a **compounding UX intelligence system** — one that explores, regresses, and gets smarter with every run. The current v2 skill (321 lines) has the right foundations (maturity model, scoring rubric, CLI+browser layers, auto-commit). This plan closes 6 specific gaps identified after 7 real test runs, adds a new UX Opportunities signal category, and wires up the compounding loop so discoveries graduate from browser exploration to CLI regression checks.
+
+**What v3 adds that v2 doesn't have:**
+1. Bug registry (`bugs.md`) with open/fixed/regressed lifecycle
+2. Per-area score history (not just top-level delta)
+3. Structured skip reasons (untested-by-choice vs. infrastructure-failure)
+4. Explicit pass thresholds per area
+5. Queryable qualitative data (area-tagged best/worst moments)
+6. Discovery-to-regression graduation (browser findings become CLI checks)
+7. UX Opportunities — suggestions, not failures — as a new report section
+
+## Problem Statement / Motivation
+
+After 7 real test runs, the skill produces useful signal but doesn't compound it efficiently:
+
+- **Bugs evaporate.** A bug found in run 3 and fixed in run 5 has no persistent record linking the discovery to the fix. If it regresses in run 8, the skill doesn't know it's a regression — it just finds a "new" bug.
+- **Delta is top-level only.** "Quality went from 3.5 → 4.0" doesn't tell you which area improved. Per-area score history is needed to answer "did the shipping form fix actually help?"
+- **Disconnects are invisible.** Three disconnects in a run produce null scores with "extension disconnected" buried in assessment text. There's no machine-readable way to distinguish "skipped because Proven" from "skipped because Chrome crashed."
+- **Pass thresholds are implicit.** `consecutive_passes` exists but what counts as a pass varies by area. A search results area with `scored_output: true` needs UX >= 4 AND Quality >= 3, but this threshold lives only in the agent's head.
+- **Qualitative signal is write-once.** "Best moment: agent search is excellent" appears in one run's JSON but can't be queried over 20 runs to surface patterns like "agent/search has been the best moment 8 of 10 times."
+- **The flywheel doesn't close.** Browser discoveries don't become CLI regression checks. The same bug can silently regress without the fast layer catching it.
+
+## Proposed Solution
+
+### Phase 1: Bug Registry (Gap #1)
+
+Add `tests/user-flows/bugs.md` — a persistent, machine-readable bug tracker that complements GitHub Issues.
+
+```markdown
+# Bug Registry
+
+| ID | Area | Status | Issue | Summary | Found | Fixed | Regressed |
+|----|------|--------|-------|---------|-------|-------|-----------|
+| B001 | checkout/shipping-form | open | #47 | Accepts invalid zip codes | 2026-02-28 | — | — |
+| B002 | browse/product-grid | fixed | #48 | Cards not clickable | 2026-02-28 | 2026-03-01 | — |
+| B003 | browse/product-grid | regressed | #52 | Cards not clickable (regression of B002) | 2026-03-05 | — | 2026-03-05 |
+```
+
+**Status lifecycle:** `open` → `fixed` (when Known-bug area passes fix_check) → `regressed` (if same area fails again after fix). Cross-reference: `Issue` column links to GitHub, `ID` column is the local canonical reference.
+
+**Multi-area bugs:** A bug that manifests in multiple areas gets ONE registry entry with the primary area. The `Summary` field notes "Also affects: area-a, area-b". Each affected area's Known-bug detail references the same bug ID.
+
+**Commit mode updates:** After each run, commit mode:
+1. Marks bugs as `fixed` when a Known-bug area's fix_check passes (score >= area's `pass_threshold`, default 4)
+2. Files new bugs with next sequential ID
+3. Marks bugs as `regressed` when a previously-fixed area fails again
+4. Syncs with GitHub issue state (closed issue + passing fix_check = fixed)
+
+**File location:** `tests/user-flows/bugs.md` alongside scenario files. One registry per project, not per scenario.
+
+### Phase 2: Per-Area Score History (Gap #2)
+
+Split storage by audience: humans see trends, machines store history.
+
+**Machine-readable history:** `tests/user-flows/score-history.json` alongside `bugs.md`:
+
+```json
+{
+  "areas": {
+    "checkout/cart": {
+      "scores": [
+        { "date": "2026-02-28", "ux": 3, "quality": null, "time": 8 },
+        { "date": "2026-03-01", "ux": 4, "quality": null, "time": 7 },
+        { "date": "2026-03-02", "ux": 4, "quality": null, "time": 6 }
+      ],
+      "trend": "improving"
+    },
+    "agent/search-query": {
+      "scores": [
+        { "date": "2026-02-28", "ux": 4, "quality": 3, "time": 12 },
+        { "date": "2026-03-01", "ux": 5, "quality": 4, "time": 9 }
+      ],
+      "trend": "improving"
+    }
+  }
+}
+```
+
+**Storage:** Last 10 entries per area. Oldest drops off when 11th is recorded. One file per project (not per scenario).
+
+**Human-readable trends in test file:** A thin `## Area Trends` section replaces the wide history table:
+
+```markdown
+## Area Trends
+
+| Area | Trend | Last Score | Delta |
+|------|-------|------------|-------|
+| checkout/cart | improving | 4 | +1 |
+| checkout/shipping | fixed | 4 | +2 |
+| browse/product-grid | stable | 5 | — |
+```
+
+**Trend computation:** `improving` (last 3 trending up), `stable` (variance < 0.5 over last 3), `declining` (last 3 trending down), `volatile` (variance >= 1.0 over last 3), `fixed` (previous was <= 2, current >= pass_threshold). Computed from `score-history.json`, not by parsing markdown.
+
+**Delta computation:** Per-area delta compares current score to previous score for that specific area in `score-history.json`. This supplements the existing top-level delta in run history.
+
+### Phase 3: Structured Skip Reasons (Gap #3)
+
+Add `skip_reason` field to each area result in `.user-test-last-run.json`:
+
+```json
+{
+  "slug": "compare/add-view",
+  "ux_score": null,
+  "skip_reason": "disconnect",
+  "time_seconds": null
+}
+```
+
+**Enum values:**
+- `null` — area was tested normally
+- `"proven_spotcheck"` — Proven area, spot-checked only
+- `"known_bug_open"` — Known-bug, issue still open, skipped
+- `"known_bug_fixed"` — Known-bug, issue closed, ran fix_check
+- `"cli_precheck_failed"` — CLI precheck for this area scored <= 2
+- `"disconnect"` — MCP disconnect interrupted this area
+- `"user_skip"` — User explicitly skipped
+
+**Report impact:** Pass rate calculation excludes `disconnect` and `user_skip` areas. The report shows: "Pass rate: 4/5 (1 area skipped: disconnect)".
+
+### Phase 4: Explicit Pass Thresholds (Gap #4)
+
+Add `pass_threshold` to area details in test files:
+
+```markdown
+### checkout/shipping-form
+**Interactions:** Enter address, select method, see estimate
+**What's tested:** Form validation + shipping logic
+**pass_threshold:** 4
+```
+
+```markdown
+### agent/search-results
+**Interactions:** Enter query, review results, refine search
+**What's tested:** Result relevance and ranking quality
+**scored_output:** true
+**pass_threshold:** 4
+**quality_threshold:** 3
+```
+
+**Defaults:** If `pass_threshold` is not set, default is 4. If `quality_threshold` is not set for `scored_output` areas, default is 3. These match the current implicit behavior but make it explicit and per-area configurable.
+
+**Promotion gate uses thresholds:** "2+ consecutive passes" means 2+ consecutive runs where UX >= `pass_threshold` (and Quality >= `quality_threshold` for scored_output areas).
+
+**Self-documenting:** The test file now contains everything needed to understand when an area graduates — no implicit knowledge required.
+
+### Phase 5: Queryable Qualitative Data (Gap #5)
+
+Tag each qualitative observation with the area slug it relates to:
+
+**In `.user-test-last-run.json`:**
+```json
+{
+  "qualitative": {
+    "best_moment": { "area": "agent/search-query", "text": "Agent search returns highly relevant results in <2s" },
+    "worst_moment": { "area": "browse/product-detail", "text": "Product cards aren't clickable — expected click-to-detail" },
+    "demo_readiness": "partial",
+    "verdict": "Agent core is impressive but missing product-click-to-detail hurts the experience",
+    "context": "search excellent, product grid needs click handler"
+  }
+}
+```
+
+**In `test-history.md`:** The existing `Key Finding` column already captures one-line findings. Add `Best Area` and `Worst Area` columns to enable pattern queries:
+
+```markdown
+| Date | Areas Tested | Quality Avg | Delta | Pass Rate | Best Area | Worst Area | Demo Ready | Context | Key Finding |
+```
+
+**Pattern surfacing:** After 10+ runs, commit mode surfaces patterns with asymmetric thresholds:
+- **Positive patterns** (high bar): "area X has been best moment in 7+ of last 10 runs" — high evidence required because this is informational, not actionable
+- **Negative patterns** (moderate bar): "area X has been worst moment in 5+ of last 10 runs" — lower threshold than positive, but not 3-in-a-row (too noisy during normal development churn). Five of ten captures genuine trends while filtering out transient spikes from feature work.
+
+### Phase 6: Discovery-to-Regression Graduation (Gap #6)
+
+This is the highest-leverage change. When a browser-layer discovery is fixed and verified, offer to generate a CLI regression check.
+
+**Trigger:** When commit mode marks a bug as `fixed`:
+1. Check if `cli_test_command` exists in the scenario frontmatter
+2. If yes, offer: "Bug B002 (cards not clickable) is fixed. Generate a CLI regression check? (y/n)"
+3. If user accepts, append to `cli_queries` in the test file:
+
+```yaml
+cli_queries:
+  - query: "show me product cards"
+    expected: "Returns product data with clickable links or URLs"
+    prechecks: "browse/product-grid"
+    graduated_from: "B002"  # links back to the bug that spawned this check
+```
+
+**Graduation trigger:** Manual decision (user confirms). Automatic graduation after N passes was considered but rejected — the user knows better than the system whether a CLI check can meaningfully cover a UX-discovered issue. Some discoveries are inherently browser-only (layout, animation, visual feedback).
+
+**CLI-ineligible bugs:** If no `cli_test_command` exists, skip the graduation offer. If the bug is purely visual (e.g., CSS layout), note "This bug is browser-only — no CLI graduation available."
+
+**The compounding loop this enables:**
+```
+Browser discovers bug → bug filed → developer fixes → next run verifies fix
+    → fix confirmed → CLI regression check generated
+    → future regressions caught by fast CLI layer
+    → browser time freed for new exploration
+```
+
+### Phase 7: UX Opportunities (New Signal Category)
+
+Two distinct sections in the Phase 4 report — improvement suggestions and patterns to protect:
+
+**UX Opportunities** (action items — things to improve):
+
+```
+UX Opportunities:
+| ID | Area | Priority | Status | Suggestion |
+|----|------|----------|--------|-----------|
+| UX001 | browse/product-grid | P1 | open | Product cards should be clickable (users expect click-to-detail) |
+| UX002 | agent/search-results | P2 | open | Follow-up suggestion buttons are excellent — make more prominent |
+```
+
+**Good Patterns** (preservation notes — things to protect):
+
+```
+Good Patterns:
+| Area | Pattern | First Seen | Last Confirmed |
+|------|---------|------------|----------------|
+| browse/filters | Filter chip with sub-filter counts is a best-practice pattern | 2026-02-28 | 2026-03-02 |
+| agent/search-results | Agent follow-up buttons after search are excellent | 2026-02-28 | 2026-03-02 |
+```
+
+**Why separate sections:** P1 and P2 are action items — things to improve. Good Patterns are "don't break this" notes — a fundamentally different signal. Mixing them in one table conflates suggestions with preservation. Good Patterns also have a simpler lifecycle (confirmed each run, no status transitions).
+
+**Priority mapping (UX Opportunities only):**
+- **P1** — Missing expected interaction (friction source)
+- **P2** — Enhancement to an already-good interaction
+
+**UX Opportunity lifecycle:** Each entry has a `status` field:
+- `open` — suggestion logged, not yet acted on
+- `implemented` — the improvement was made (agent detects the change, or user marks manually)
+- `wont_fix` — explicitly declined (keeps the log honest, prevents re-suggestion)
+
+Entries rotate: keep last 20 `open` entries. `implemented` and `wont_fix` entries age out after 30 days (they've served their purpose).
+
+**Good Patterns lifecycle:** Simpler — `Last Confirmed` updates each run that observes the pattern. Patterns not confirmed for 5+ runs are removed (the code changed, the pattern may no longer exist). No status field needed.
+
+**Dedup:** Anchored on explicit IDs, not fuzzy text matching. UX Opportunities use sequential IDs (UX001, UX002...). When the agent observes something that might duplicate an existing entry, it checks by `area slug + priority level`: if the same area already has an open entry at the same priority, the agent decides whether to update or create new — not automated text overlap matching. Good Patterns dedup on area slug only (one pattern entry per area).
+
+**Distinct from bugs:** Bugs are functional failures (score <= 2) or complete output failures (quality <= 1). UX Opportunities are observations at score 3-5 where the experience could be better. Good Patterns are observations at score 4-5 where the experience is already good.
+
+**No GitHub issue filing:** Both sections are logged in the test file only. They feed the product backlog but don't create issue noise. The user can manually promote a UX Opportunity to an issue if they want.
+
+**Storage:** Two new sections in the test file: `## UX Opportunities Log` and `## Good Patterns`.
+
+## Technical Considerations
+
+### Schema Migration: v2 → v3
+
+**New frontmatter fields (all optional):**
+- None — all new data lives in new sections, not frontmatter
+
+**New test file sections:**
+- `## Area Trends` — replaces Area Score History, thin summary (trend + last score + delta)
+- `## UX Opportunities Log` — improvement suggestions with status lifecycle
+- `## Good Patterns` — patterns worth preserving (separate from opportunities)
+
+**New standalone files:**
+- `tests/user-flows/bugs.md` — bug registry
+- `tests/user-flows/score-history.json` — full per-area score history (machine-readable)
+
+**Run results JSON changes:**
+- `areas[].skip_reason` — new field (nullable string enum)
+- `qualitative.best_moment` — changes from string to `{ area, text }` object
+- `qualitative.worst_moment` — changes from string to `{ area, text }` object
+- `ux_opportunities` — new array at top level (P1/P2 improvement suggestions with IDs)
+- `good_patterns` — new array at top level (area-level patterns worth preserving)
+
+**Backward compatibility:** v2 files work unchanged. New sections are added on first v3 commit. The `qualitative` field change is breaking for `.user-test-last-run.json` consumers — but this file is ephemeral (overwritten each run, gitignored), so no migration needed. Bump `schema_version: 3` on first commit that adds new sections.
+
+**Migration strategy:** Same as v1→v2: fill defaults on read, upgrade on write. No separate migration step.
+
+### SKILL.md Line Budget
+
+Current: 321 lines. v3 additions estimated:
+
+| Addition | Lines in SKILL.md | Lines in references/ |
+|----------|------------------|---------------------|
+| Bug registry lifecycle | ~15 | ~40 (bugs-registry.md) |
+| Per-area trends + score-history.json | ~5 | ~15 (in test-file-template.md) |
+| Structured skip reasons | ~8 | 0 (enum in JSON schema) |
+| Pass thresholds | ~5 | ~10 (in test-file-template.md) |
+| Queryable qualitative | ~5 | 0 (JSON schema change) |
+| Graduation mechanism | ~15 | ~30 (graduation.md) |
+| UX Opportunities + Good Patterns | ~15 | ~25 (in test-file-template.md) |
+| Schema v3 migration note | ~5 | 0 |
+| **Total** | **~73** | **~120** |
+
+**Projected SKILL.md:** ~394 lines — within 500-line budget.
+
+**New reference file:** `references/bugs-registry.md` for bug lifecycle documentation.
+**New reference file:** `references/graduation.md` for discovery-to-regression graduation.
+**Updated reference file:** `references/test-file-template.md` for new sections + pass thresholds.
+
+**Line-count checkpoint:** After implementing step 4 (SKILL.md updates), run `wc -l < SKILL.md` before proceeding to step 5. If over 420 lines, extract UX Opportunities or graduation to their own reference files immediately — don't wait until post-implementation cleanup.
+
+**Graduation extraction trigger:** The graduation mechanism (Phase 6) involves conditional logic across several states (cli_test_command present?, bug type visual or functional?, user response). If it exceeds 20 lines in SKILL.md during implementation, extract to `references/graduation.md` immediately. The reference file is already planned; the question is whether graduation lives as a brief summary in SKILL.md with details in the reference, or entirely in the reference from the start. Default: start in SKILL.md, extract if it grows.
+
+### Two-Layer Architecture Clarification
+
+v2 already implements CLI-first (Phase 2.5) and Browser-second (Phase 3). v3 doesn't change the execution order, but the graduation mechanism (Phase 6) creates a feedback loop:
+
+```
+Layer 2 (Browser) → discovers issue → fix verified → graduation offered
+    ↓
+Layer 1 (CLI) ← new regression check added ← catches regressions fast
+    ↓
+Layer 2 (Browser) → freed to explore new territory
+```
+
+This is the compounding loop in action. Over time, the CLI layer grows and the browser layer stays focused on unknowns.
+
+### Open Questions Resolved
+
+**Q: How does bugs.md handle bugs that span multiple areas?**
+A: One registry entry with primary area. Summary notes "Also affects: area-a, area-b". Each affected area's Known-bug detail references the same bug ID.
+
+**Q: Should UX Opportunities have priority (P1/P2/P3)?**
+A: Yes. P1 = missing expected interaction, P2 = enhancement to good interaction, P3 = pattern worth preserving.
+
+**Q: What's the graduation trigger?**
+A: Manual — user confirms after fix is verified. The user knows whether a CLI check can meaningfully cover a UX-discovered issue. Some discoveries are inherently browser-only.
+
+**Q: How does the command handle an app it's never seen before?**
+A: Already handled by v2 — passing a description string to `/user-test` creates a new test file from template. No separate `/user-test init` needed. The first run IS the init.
+
+## Acceptance Criteria
+
+### Phase 1: Bug Registry
+- [x] `tests/user-flows/bugs.md` created on first bug filing if it doesn't exist
+- [x] Bug IDs are sequential (B001, B002, ...)
+- [x] Status lifecycle works: open → fixed → regressed
+- [x] Multi-area bugs have one entry with "Also affects" note
+- [x] Commit mode syncs bug status with GitHub issue state
+- [x] Fixed bugs are detected when Known-bug area passes fix_check (score >= area's `pass_threshold`, default 4)
+- [x] Regression detection: previously-fixed area fails → new issue "Regression of #N" + bug marked regressed
+
+### Phase 2: Per-Area Score History
+- [x] `tests/user-flows/score-history.json` created on first run, stores full per-area history
+- [x] Last 10 entries per area in JSON, oldest drops at 11th
+- [x] `## Area Trends` section in test file shows Trend + Last Score + Delta (human-readable summary)
+- [x] Trend computed from JSON: improving/stable/declining/volatile/fixed
+- [x] Per-area delta computed from JSON, not by parsing markdown
+
+### Phase 3: Structured Skip Reasons
+- [x] `skip_reason` field present in `.user-test-last-run.json` for every area
+- [x] Enum: null, proven_spotcheck, known_bug_open, known_bug_fixed, cli_precheck_failed, disconnect, user_skip
+- [x] Pass rate calculation excludes disconnect and user_skip
+- [x] Report displays skip count and reasons
+
+### Phase 4: Pass Thresholds
+- [x] `pass_threshold` field supported in area details (default: 4)
+- [x] `quality_threshold` field supported for scored_output areas (default: 3)
+- [x] Promotion gate uses per-area thresholds
+- [x] Test file is self-documenting — thresholds visible in area details
+
+### Phase 5: Queryable Qualitative
+- [x] `best_moment` and `worst_moment` in JSON are `{ area, text }` objects
+- [x] `test-history.md` has `Best Area` and `Worst Area` columns
+- [x] Positive pattern detection: 7+ of last 10 runs (high bar — informational signal)
+- [x] Negative pattern detection: 5+ of last 10 runs (moderate bar — actionable signal)
+
+### Phase 6: Graduation
+- [x] After bug marked fixed, offer CLI graduation if `cli_test_command` exists
+- [x] Graduated CLI query includes `graduated_from: "B00N"` tag
+- [x] Skip graduation offer for browser-only bugs (no CLI equivalent)
+- [x] Skip graduation offer if no `cli_test_command` in frontmatter
+
+### Phase 7: UX Opportunities + Good Patterns
+- [x] `UX Opportunities` section in Phase 4 report (P1/P2 action items)
+- [x] `Good Patterns` section in Phase 4 report (separate from opportunities)
+- [x] UX Opportunities use sequential IDs (UX001, UX002...) with status lifecycle (open/implemented/wont_fix)
+- [x] Good Patterns dedup on area slug only, `Last Confirmed` updates each run, removed after 5 runs unconfirmed
+- [x] Dedup: same area + same priority = agent decides (not fuzzy text matching)
+- [x] Stored in test file: `## UX Opportunities Log` (last 20 open) + `## Good Patterns`
+- [x] Distinct from bugs — no GitHub issue creation
+- [x] UX Opportunities triggered at score 3-5; Good Patterns triggered at score 4-5
+
+### Schema & Compatibility
+- [x] v2 files load without error (new sections added on first commit)
+- [x] `schema_version: 3` set on first v3 commit
+- [x] SKILL.md stays under 500 lines after all additions (checkpoint at step 4.5)
+- [x] bugs-registry.md reference file created
+- [x] graduation.md reference file created
+- [x] test-file-template.md updated with Area Trends, UX Opportunities Log, Good Patterns sections
+- [x] score-history.json schema documented in test-file-template.md
+
+### Version & Metadata
+- [x] Version bumped (2.36.0 → 2.37.0)
+- [x] CHANGELOG.md updated
+- [x] Plugin.json and marketplace.json description counts verified
+
+## Implementation Sequence
+
+1. **Create `references/bugs-registry.md`** — bug lifecycle, multi-area handling, status transitions, fix_check threshold tied to pass_threshold
+2. **Create `references/graduation.md`** — discovery-to-regression mechanism, CLI query generation, browser-only bug detection
+3. **Update `references/test-file-template.md`** — add Area Trends section (replacing wide score history table), UX Opportunities Log + Good Patterns sections, score-history.json schema, pass_threshold/quality_threshold in area details, schema_version: 3
+4. **Update SKILL.md** — add bug registry lifecycle to Commit Mode, skip_reason to Phase 3/4, pass thresholds to promotion gate, qualitative tagging to Phase 4, graduation offer to Commit Mode, UX Opportunities + Good Patterns to Phase 4, schema v3 migration note to Phase 1
+5. **Line-count checkpoint** — run `wc -l < SKILL.md`. If over 420 lines, extract graduation or UX Opportunities to reference files before proceeding. This is a hard gate, not a suggestion.
+6. **Update `.user-test-last-run.json` schema** — add skip_reason, change qualitative structure, add ux_opportunities, add good_patterns
+7. **Bump metadata** — version, changelog, plugin.json, marketplace.json
+8. **Validate** — SKILL.md line count, JSON validity, reference links, score-history.json schema
+9. **Install locally** — copy to ~/.claude/skills/user-test/
+
+## Dependencies & Risks
+
+| Risk | Mitigation |
+|------|-----------|
+| bugs.md grows unbounded | Rotation: archive entries older than 6 months to bugs-archive.md |
+| score-history.json grows with many areas over many runs | Cap at 10 entries per area; one file per project. At 30 areas x 10 entries = ~300 entries — manageable JSON size |
+| Graduation offers interrupt flow | Single y/n prompt after commit, not during test run. Batch all graduation offers into one prompt. |
+| Pattern detection is noisy early on | Only trigger after 10+ runs. Positive patterns: 7/10 threshold. Negative patterns: 5/10 threshold. |
+| UX Opportunity dedup produces false matches | Dedup anchored on area slug + priority level, not text overlap. Agent decides on conflicts — no automated fuzzy matching. |
+| Good Patterns log bloat (agent flags everything good as a pattern) | Only log patterns at score 4-5 that represent a *deliberate design choice* (not just "page loaded"). Patterns auto-expire after 5 runs unconfirmed. |
+| UX Opportunities with no lifecycle become stale | Status field (open/implemented/wont_fix). Implemented and wont_fix age out after 30 days. Open entries capped at 20. |
+| schema_version: 3 migration adds sections to existing test files | Non-destructive: new sections appended, existing content preserved |
+| SKILL.md approaches 400 lines | Hard gate at step 5: if over 420 lines, extract before proceeding. Graduation earmarked for early extraction (20-line trigger). |
+| qualitative JSON structure change breaks external consumers | .user-test-last-run.json is gitignored and ephemeral — no external consumers expected |
+
+## Sources & References
+
+### Prior Plans
+- [v1 plan: user-test browser testing skill](docs/plans/2026-02-26-feat-user-test-browser-testing-skill-plan.md) — original skill architecture (completed)
+- [v2 plan: user-test skill revision](docs/plans/2026-02-28-feat-user-test-skill-revision-plan.md) — schema v2, timing, CLI mode, auto-commit (completed)
+
+### Internal References
+- Current SKILL.md: `plugins/compound-engineering/skills/user-test/SKILL.md` (321 lines, schema v2)
+- Test file template: `plugins/compound-engineering/skills/user-test/references/test-file-template.md`
+- Anti-patterns: `docs/solutions/2026-02-26-monolith-to-skill-split-anti-patterns.md`
+
+### Learnings Applied
+- **Monolith-to-skill split:** Reference file extraction from day one prevents SKILL.md bloat (confirmed by v2 staying at 321 lines)
+- **Agent-guided state patterns:** Use agent judgment for maturity transitions, not mechanical counters (validated by 7 real runs)
+- **Plugin versioning:** Always bump version, changelog, and description counts in lockstep
diff --git a/docs/plans/2026-03-02-feat-compounding-quality-plan.md b/docs/plans/2026-03-02-feat-compounding-quality-plan.md
new file mode 100644
index 000000000..8f3289cbc
--- /dev/null
+++ b/docs/plans/2026-03-02-feat-compounding-quality-plan.md
@@ -0,0 +1,761 @@
+---
+title: "feat: Compounding Quality — Richer Writebacks, Weakness Synthesis, Fingerprints, CLI Adversarial"
+type: feat
+status: completed
+date: 2026-03-02
+---
+
+# feat: Compounding Quality
+
+Four changes to make the existing compound loop actually compound. Each run
+becomes smarter automatically — no new commands, no extra steps.
+
+## Overview
+
+| Change | Where | What It Does |
+|--------|-------|-------------|
+| 1. Richer commit writebacks | Commit Mode Step 1 | Persists tactical intelligence (selectors, timing, weakness class) back into area details |
+| 2. Weakness-class synthesis | Phase 4, Step 6 | Cross-area adversarial targeting from failure patterns, not just instances |
+| 3. Novelty fingerprint persistence | `.user-test-last-run.json` + Phase 3 | Prevents re-exploring territory already covered in prior runs |
+| 4. CLI score 3 → adversarial browser | Phase 2.5 + Phase 3 | Partially-correct CLI results trigger harder browser probing |
+
+## Problem Statement
+
+Run-over-run, the user-test skill rediscovers the same information:
+
+1. **Selectors are found then forgotten.** Run 1 discovers working DOM selectors
+   (3-5 MCP calls). Run 2 discovers them again. The verify block has no way to
+   persist confirmed selectors.
+
+2. **Weakness patterns are instance-level, not class-level.** Three areas share
+   the same "stale-react-state" failure pattern. Each is treated independently.
+   No mechanism identifies or targets the shared weakness class.
+
+3. **Novelty log expires between runs.** The novelty budget forces exploration,
+   but the log resets each run. Run N+1 re-explores territory N already covered.
+
+4. **CLI score 3 is a dead signal.** Score ≤2 skips browser. Score ≥4 passes
+   through. Score 3 ("surface-level right, deeper reasoning wrong") proceeds
+   normally — the adversarial sweet spot is wasted.
+
+## SKILL.md Line Budget Strategy
+
+SKILL.md is at **420 lines** (hard ceiling). All changes must net zero.
+
+### Extraction Plan
+
+| Extraction | Source Lines | Savings | Target |
+|-----------|-------------|---------|--------|
+| `.user-test-last-run.json` schema | SKILL.md:282-333 (52 lines) | ~45 lines (replace with 5-line pointer + version ref) | New `references/last-run-schema.md` |
+| Phase 3 novelty budget inline | SKILL.md:110-115 (6 lines) | ~4 lines (compress to 2-line pointer) | Already in `queries-and-multiturn.md:128-194` |
+| Phase 2.5 CLI detail | SKILL.md:84-97 (14 lines) | ~6 lines (compress to 8-line version) | Already in `queries-and-multiturn.md:62-83` |
+| **Total freed** | | **~55 lines** | |
+
+### Addition Plan
+
+| Addition | Lines | Location |
+|---------|-------|----------|
+| Commit Mode Step 1: 3 new bullet points (notes, selectors, weakness_class) | ~8 | SKILL.md Commit Mode |
+| Phase 4 Step 6: cross-area synthesis pointer | ~5 | SKILL.md Phase 4 |
+| Phase 3: fingerprint check + adversarial mode trigger | ~6 | SKILL.md Phase 3 |
+| Phase 2.5: adversarial flag check | ~5 | SKILL.md Phase 2.5 |
+| JSON schema pointer to `last-run-schema.md` | ~5 | SKILL.md (replaces extracted block) |
+| **Total added** | **~29** | |
+
+**Net: -55 + 29 = -26 lines.** Comfortable margin. SKILL.md lands at ~394.
+
+## Schema Version
+
+All four changes ship together as **v8**. One migration event.
+
+```
+v7 → v8 changes:
+- Area Details: optional **weakness_class:** field (below pass_threshold)
+- Area Details: **verify:** blocks auto-updated with confirmed selectors by commit mode
+- Areas table: Notes column receives tactical run notes in [Run N] format (max 3 entries)
+- .user-test-last-run.json: new fields per area (tactical_note, confirmed_selectors,
+  weakness_class, adversarial_browser, adversarial_trigger)
+- .user-test-last-run.json: new top-level key novelty_fingerprints (accumulates across runs)
+- .user-test-last-run.json schema extracted to references/last-run-schema.md
+```
+
+**Migration:** Treat missing `weakness_class` as absent. Treat missing
+`novelty_fingerprints` as empty. Treat missing `adversarial_browser` as false.
+Do NOT rewrite v7 files on read.
+
+---
+
+## Change 1: Richer Commit Writebacks
+
+### What changes
+
+Commit Mode Step 1 (currently SKILL.md:363-369) writes three new categories of
+intelligence back into each area after every run. Currently this data is
+discovered during execution then discarded.
+
+### A. Tactical notes (Areas table, Notes column)
+
+After scoring, commit mode appends a short tactical note to the area's Notes
+column. Format: `[Run N] <finding>`. Cap at 3 entries; drop oldest when exceeded.
+
+**Write only when there's a genuine tactical insight:**
+- A reliable JS selector pattern: `[Run 4] batch read via [data-filter-chip] + .product-card reliable`
+- A timing pattern: `[Run 3] agent response 8-12s on first query, faster on follow-ups`
+- An interaction sequence that revealed a bug: `[Run 2] filter → navigate → back → filter again surfaces stale state`
+
+Do NOT write: generic observations, maturity updates, restatements of probe results.
+
+### B. Verified selectors into `verify:` blocks
+
+When Phase 3 exploration discovers working DOM selectors confirmed by a
+successful `javascript_tool` batch call, commit mode writes them into the area's
+`**verify:**` block.
+
+```markdown
+**verify:**
+- Apply filter. Batch-check via javascript_tool:
+  activeFilters (`[data-filter-chip]`), resultCount (`.product-card`),
+  sample 5 results (`.product-card .title`, `.condition-badge`).
+  Every result's attribute must match the active filter.
+  _Selectors confirmed run 3._
+```
+
+Rules:
+- Only write selectors confirmed by a successful batch call this run
+- Append `_Selectors confirmed run N._` so future runs know the source
+- APPEND new selectors below existing user-authored content — never replace
+- Update with new selectors if they changed; preserve unchanged ones
+- If selectors are unknown (first run): `_Selectors not yet confirmed — discover during exploration._`
+
+This is the highest-leverage writeback: run 1 discovers selectors through
+sequential trial (3-5 MCP calls), run 2 reads the verify block and batches them
+into one `javascript_tool` call.
+
+### C. `weakness_class` field
+
+New optional field in area details, written by commit mode when 2+ probes in the
+same area share a recognizable failure pattern. Lives just below `pass_threshold`.
+
+```markdown
+**weakness_class:** stale-react-state
+```
+
+**Predefined classes:**
+- `stale-react-state` — filters/state not resetting on navigation
+- `count-display-lag` — displayed counts don't match actual DOM counts
+- `multi-turn-context-loss` — agent forgets constraints from earlier turns
+- `async-render-race` — results appear but attributes/badges haven't updated
+- `filter-intersection-empty` — compound filter combinations return 0 results unexpectedly
+- `agent-reasoning-shallow` — CLI quality consistently 3, partially correct but missing nuance
+
+**Freeform:** For novel failure modes that don't fit a predefined class, write a
+freeform string (e.g., `weakness_class: checkout-state-leaked-across-sessions`).
+Change 2 handles freeform classes with custom adversarial instruction generation.
+
+**Classification:** Commit mode reads each failing probe's `query`, `verify`,
+and `result_detail` fields and matches against predefined class descriptions
+using agent judgment. No mechanical matching rule — agent decides which class (if
+any) best describes the failure pattern. If classification is ambiguous, prefer
+freeform over forcing a predefined class. Matching for C2 synthesis uses exact
+string equality after normalization (lowercase, hyphenated).
+
+**Lifecycle:**
+- Write when 2+ probes share a pattern (one probe = insufficient signal)
+- Update each run: if the class's probes have all passed for 3+ consecutive
+  runs, remove the field (weakness resolved)
+- If a new pattern emerges with more probes than the current class, replace it
+- One `weakness_class` per area — the dominant pattern. Probe count decides dominance.
+
+### `.user-test-last-run.json` additions (per area)
+
+```json
+{
+  "slug": "agent/filter-via-chat",
+  "ux_score": 3,
+  "tactical_note": "filter → navigate away → back → filter again surfaces stale state",
+  "confirmed_selectors": {
+    "activeFilters": "[data-filter-chip]",
+    "resultCount": ".product-card",
+    "sampleResults": ".product-card .title, .condition-badge"
+  },
+  "weakness_class": "stale-react-state"
+}
+```
+
+`tactical_note: null` → skip Notes update. `confirmed_selectors: {}` → skip
+verify block update. `weakness_class: null` → no class identified (or resolved).
+
+### Detail spec locations
+
+- Selector persistence rules → `verification-patterns.md` (new section: "Selector Discovery and Writeback")
+- `weakness_class` lifecycle and predefined class definitions → `probes.md` (new section: "Weakness Classification")
+- Tactical notes format and cap rules → `queries-and-multiturn.md` (append to commit mode guidance)
+
+---
+
+## Change 2: Weakness-Class Synthesis in Explore Next Run
+
+### What changes
+
+Phase 4 Step 6 (currently SKILL.md:209-212) gains a cross-area synthesis pass.
+After generating per-area Explore Next Run items, it looks across all areas for
+shared failure classes. When a class appears in 2+ areas, it generates one
+`[cross-area]` Explore Next Run entry targeting the class systemically.
+
+### Synthesis pass
+
+Synthesis reads `weakness_class` fields from the test file as written by the
+previous run's commit — first-run appearance of a weakness_class does not trigger
+synthesis until the following run.
+
+1. Collect all areas with a `weakness_class` field set in the test file
+2. Group by weakness_class value (exact string match)
+3. For each class appearing in 2+ areas: generate one `[cross-area]` Explore
+   Next Run entry
+
+**Format:**
+```
+P1  [cross-area]  Browser  stale-react-state in agent/filter + browse/filters — probe ALL navigation sequences next run
+```
+
+**Cap:** Maximum 2 cross-area synthesis entries per run.
+
+**Tiebreaker when >2 classes qualify:**
+Rank by (1) number of affected areas — more areas = higher priority; then (2)
+number of failing probes in the class. Deterministic, favors widespread patterns.
+
+### Adversarial instruction templates (predefined classes)
+
+| Class | Adversarial Instruction |
+|-------|------------------------|
+| `stale-react-state` | Probe ALL navigation sequences that cross area boundaries — apply filter → navigate away → return → verify state reset |
+| `count-display-lag` | After every action changing result count, wait 2s then re-read count vs DOM — check for lag window |
+| `multi-turn-context-loss` | On every multi-turn sequence, inject a context-breaking action at turn 3, then return to prior context — verify retention |
+| `async-render-race` | After every action triggering async rendering, immediately read badges/attributes — check for race window |
+| `filter-intersection-empty` | Probe all 2-filter compound combinations systematically — check for empty-intersection cases |
+| `agent-reasoning-shallow` | Replace simple queries with competing-constraint and ambiguous queries across all affected areas |
+
+**Freeform classes:** When `weakness_class` is freeform (no matching template),
+the agent generates a custom adversarial instruction based on the class name and
+probe failure details.
+
+**Persistence signal:** If the same class appeared last run's Explore Next Run,
+was targeted, and still didn't resolve: `PERSISTENT — stale-react-state active
+N runs — escalate to Known-bug consideration`
+
+### Report placement
+
+Cross-area synthesis entries appear at the top of EXPLORE NEXT RUN:
+
+```
+EXPLORE NEXT RUN
+  P1  [cross-area]  Browser  stale-react-state in 3 areas — probe all navigation events
+  P1  shipping-form  Browser  Validation broken — edge cases
+  P2  checkout/promo  Both    Adjacent to cart, untested
+```
+
+### Why Explore Next Run entries, not cross-area probes
+
+Cross-area synthesis produces targeting instructions ("test this pattern across
+these areas next run"). These are ephemeral — regenerated each run from current
+state. Cross-area probes (from v7) are persistent regression tests with a full
+lifecycle. Different purpose: synthesis directs exploration, probes track
+regressions. If a synthesis target repeatedly fails, the agent should generate a
+cross-area probe from the failure — that's the natural escalation path.
+
+### `.user-test-last-run.json` additions (explore_next_run entries)
+
+```json
+{
+  "priority": "P1",
+  "area": "[cross-area]",
+  "mode": "Browser",
+  "why": "stale-react-state in agent/filter-via-chat + browse/filters",
+  "weakness_class": "stale-react-state",
+  "affected_areas": ["agent/filter-via-chat", "browse/filters"],
+  "adversarial_instruction": "Probe ALL navigation sequences that cross area boundaries..."
+}
+```
+
+### Detail spec location
+
+Cross-area synthesis rules, tiebreaker logic, template table → `probes.md`
+(new section: "Cross-Area Weakness Synthesis")
+
+---
+
+## Change 3: Novelty Fingerprint Persistence
+
+### What changes
+
+The novelty log expires between runs (documented v2 limitation). This change
+persists a compact fingerprint of each novel interaction across sessions so run
+N+1 knows what run N already explored.
+
+### Fingerprint format
+
+`<area-slug>:<action-type>:<key-parameter>`
+
+Examples:
+- `agent/filter-via-chat:edge-query:price-floor`
+- `browse/filters:filter-combo:size+color`
+- `checkout/shipping-form:invalid-input:zip-letters`
+
+**Normalization taxonomy (intentionally fuzzy):**
+- Price/number inputs → `price-floor`, `price-ceiling`, `price-range`
+- Filter combinations → `filter-combo:<f1>+<f2>`
+- Invalid inputs → `invalid-input:<input-type>`
+- Edge case queries → `edge-query:<topic>`
+- Navigation sequences → `nav-sequence:<from>-<to>`
+- **Doesn't fit taxonomy → `<area>:freeform:<3-word-summary>`** — coverage over consistency
+
+### Storage in `.user-test-last-run.json`
+
+```json
+"novelty_fingerprints": {
+  "agent/filter-via-chat": [
+    "agent/filter-via-chat:edge-query:price-floor",
+    "agent/filter-via-chat:edge-query:out-of-scope-question"
+  ],
+  "browse/filters": [
+    "browse/filters:filter-combo:size+color"
+  ]
+}
+```
+
+Cap: 20 fingerprints per area. Drop oldest when exceeded.
+
+### Read-Merge-Write Sequence
+
+`.user-test-last-run.json` is overwritten on each run (SKILL.md:331). Fingerprints
+are the only key that accumulates. The sequence:
+
+1. **Phase 1 (Load Context):** Read existing `novelty_fingerprints` from
+   `.user-test-last-run.json` into memory before the run starts.
+2. **Phase 3 (Execute):** Use fingerprints to skip already-explored interactions.
+   Generate new fingerprints for novel interactions this run.
+3. **Phase 4 (Write):** Merge existing fingerprints + new fingerprints. Apply
+   20-per-area cap (drop oldest). Write the merged set into the new JSON file.
+
+This is safe because the JSON is written once at the end of Phase 4. There is no
+partial-write risk — the entire file is written atomically.
+
+### Iterate mode exemption
+
+Iterate mode measures consistency by running the same scenario N times.
+**Fingerprints are ignored in iterate mode** — all runs test the same interaction
+set. The between-run page reload resets `mcp_call_counter` but does NOT apply
+fingerprint filtering. Fingerprints still accumulate for use in the next
+non-iterate session.
+
+### Fingerprint matching semantics
+
+Agent exercises judgment on what "matches" — the goal is to skip interactions of
+the same *type*, not require exact parameter matches. `edge-query:price-floor`
+and `edge-query:price-ceiling` are different fingerprints (different key params).
+`edge-query:price-floor` from run 1 means "don't test price-floor edge cases
+again" — test price-ceiling or price-range instead.
+
+### Interaction with adversarial mode (C4)
+
+Adversarial mode overrides fingerprint skipping for its specific actions.
+Competing-constraint queries triggered by C4 are always run regardless of
+fingerprint state — the adversarial signal takes priority over "already tried."
+
+### Interaction with Proven area budget
+
+Proven areas have a 3-MCP-call cap. Fingerprint filtering does NOT increase the
+budget — it changes WHAT those 3 calls test. If fingerprints exclude obvious
+interactions, the 3 calls target genuinely novel territory. This is the desired
+behavior: Proven areas get spot-checked on untested ground, not re-tested on
+familiar ground.
+
+### Resilience
+
+If `.user-test-last-run.json` is deleted or corrupted, fingerprint history resets
+to empty. Acceptable — the skill re-explores previously covered territory, same
+as before this change. Fingerprints are an optimization, not a correctness
+requirement.
+
+### Report signal
+
+Add to SIGNALS when fingerprints meaningfully constrained novelty choices:
+```
+~ agent/filter-via-chat novelty: 3 fingerprints excluded, 2 new interactions found
+```
+
+### Detail spec location
+
+Normalization rules, freeform fallback, accumulation behavior →
+`queries-and-multiturn.md` (new section: "Novelty Fingerprint Persistence")
+
+---
+
+## Change 4: CLI Score 3 → Browser Adversarial Signal
+
+### What changes
+
+CLI score 3 ("partially correct — surface-level right, deeper reasoning wrong")
+triggers adversarial browser mode for that area. Currently this signal is lost.
+
+**Why score 3 specifically:**
+- Score ≤2 already skips browser via `prechecks`
+- Score ≥4 proceeds normally
+- Score 3 = the adversarial sweet spot: the app functions, but the CLI revealed
+  shallow reasoning that browser testing can expose as real user-facing failure
+
+### Trigger condition
+
+Adversarial mode triggers when **any individual CLI query** for the area scores
+exactly 3. Per-query scores, not averages.
+
+**Secondary check:** If the area's CLI Quality average across queries is 3.0-3.4
+AND no single query hit exactly 3 (all queries borderline), also trigger
+adversarial mode. Record `adversarial_trigger: "cli-avg-3.x: <average>"`.
+
+### Adversarial browser mode behaviors
+
+When triggered, the area's Phase 3 execution changes:
+
+1. **Skip the happy path.** Start with the query most likely to expose the
+   shallow reasoning — not the simplest, expected query.
+
+2. **Front-load competing-constraint queries.** If the area has Queries defined,
+   execute any query with competing constraints before single-intent queries.
+
+3. **Pre-emptive probe (before exploration).** Generate an `untested` probe:
+   - `generated_from: "cli-score-3: <query that scored 3>"`
+   - Priority: P1 (CLI already revealed the weakness)
+
+4. **Increased novelty budget.**
+   - Proven areas: all 3 MCP calls must be adversarial, not happy-path spot-checks
+   - Uncharted areas: novelty budget increases to 40% of calls (from 30%), min 3
+
+5. **Report flag** in DETAILS:
+   ```
+   agent/filter-via-chat: CLI 3 → browser adversarial mode
+     Pre-emptive probe: "competing filter constraints" (P1)
+     Exploration front-loaded with competing-constraint queries
+   ```
+
+### Interaction with progressive narrowing
+
+If a SKIP-classified area has a CLI query scoring 3, **adversarial mode overrides
+SKIP for that area only** — it gets promoted to PROBES-ONLY with adversarial
+execution. The CLI signal is too strong to ignore. PROBES-ONLY areas with
+adversarial mode execute their probes + the pre-emptive probe, but skip full
+exploration.
+
+### Phase 2.5 addition
+
+After scoring CLI queries, add one step:
+
+> **Adversarial flag check:** For each area with `prechecks`-tagged queries: if
+> any individual query score == 3, set `adversarial_browser: true`. If average is
+> 3.0-3.4 with no single query at 3, also set `adversarial_browser: true`.
+> Record the triggering query in `adversarial_trigger`.
+
+### `.user-test-last-run.json` additions (per area)
+
+```json
+{
+  "slug": "agent/filter-via-chat",
+  "adversarial_browser": true,
+  "adversarial_trigger": "cli-score-3: show me items under $50 in good condition"
+}
+```
+
+### SIGNALS addition
+
+```
+~ 2 areas in CLI-adversarial mode (CLI score 3): agent/filter-via-chat, agent/search-query
+```
+
+### Detail spec location
+
+Full adversarial mode behavior, competing-constraint query identification,
+novelty budget adjustment → `queries-and-multiturn.md` (new section: "CLI
+Adversarial Mode")
+
+---
+
+## SpecFlow Gap Resolutions
+
+Issues identified by flow analysis, resolved here:
+
+| Gap | Resolution |
+|-----|-----------|
+| Fingerprint persistence vs JSON overwrite | Read-merge-write sequence documented in C3 (Phase 1 read, Phase 4 merge+write) |
+| Iterate mode + fingerprints | Explicit exemption: iterate mode ignores fingerprints (C3) |
+| C4 adversarial vs fingerprint skipping | Adversarial overrides fingerprints for its specific actions (C3) |
+| C4 adversarial vs progressive narrowing SKIP | Adversarial overrides SKIP → promotes to PROBES-ONLY (C4) |
+| C4 adversarial vs Proven 3-call budget | Budget unchanged — adversarial reshapes WHAT those 3 calls do (C4) |
+| weakness_class classification method | Agent judgment on probe query/verify/result_detail fields. Prefer freeform over forcing predefined. Documented in C1 spec. |
+| weakness_class matching for C2 synthesis | Exact string match. Predefined classes are canonical strings. Synthesis restricted to areas where weakness_class already set in test file (no re-derivation). |
+| Synthesis output vs cross-area probes | Synthesis produces ephemeral Explore Next Run entries, not probes. Repeated failures escalate to cross-area probes naturally. |
+| C1→C2 timing (2-run delay) | By design. Run N writes weakness_class via commit. Run N+1 reads it but synthesis requires 2+ areas — fires earliest at N+2 if a second area develops the same class. Stated explicitly in C2 synthesis pass. |
+| Fingerprints machine-local (gitignored JSON) | Intentional. Fingerprints are an optimization, not canonical state. Other compounding mechanisms (probes, queries, weakness_class) persist in committed test file. |
+| weakness_class removal in multi-run mode | Each run within a multi-run session counts as a separate run toward the 3-run removal threshold. |
+
+---
+
+## Design Decisions
+
+### D1. Net-zero SKILL.md via JSON schema extraction
+
+The `.user-test-last-run.json` schema block (52 lines) is the largest inline
+block in SKILL.md that can move to a reference file without hurting agent
+performance. The JSON schema is read once at run start and write once at run end —
+the agent doesn't need it inline during execution phases.
+
+### D2. Predefined weakness classes + freeform fallback
+
+Predefined classes accelerate C2 template lookup but freeform strings ensure
+novel failure modes aren't lost. Exact string matching (post-normalization) is
+strict enough to prevent false synthesis but simple to implement.
+
+### D3. Fingerprints as optimization, not truth
+
+Fingerprints are gitignored, machine-local, and lossy (20 cap with oldest-drop).
+This is deliberate — they guide novelty exploration but don't gate correctness.
+A fresh machine re-explores territory, which is the same as the pre-C3 behavior.
+
+### D4. Adversarial mode reshapes budget, doesn't increase it
+
+Proven areas keep their 3-call cap. Adversarial mode changes WHAT those calls
+test (competing constraints instead of happy paths). This maintains the
+efficiency property of Proven areas while exploiting the CLI signal.
+
+### D5. Explore Next Run entries, not cross-area probes
+
+Synthesis produces targeting instructions that are regenerated each run.
+Cross-area probes are persistent regression tests. Different tools for different
+purposes. If a synthesis target fails repeatedly, the agent generates a
+cross-area probe — natural escalation from ephemeral to persistent.
+
+---
+
+## Implementation Phases
+
+### Phase 1: Schema Extraction + Foundation (C1 prep)
+
+**Goal:** Create room in SKILL.md. Extract JSON schema. Add v8 migration notes.
+
+- [x] Create `references/last-run-schema.md` with full JSON schema from SKILL.md:282-333
+  - Include all current fields + C1/C2/C3/C4 additions
+  - Include behavioral notes (overwrite-per-run, fingerprint accumulation exception)
+- [x] Replace SKILL.md:282-333 with 5-line pointer to `last-run-schema.md`
+- [x] Compress Phase 3 novelty budget inline (SKILL.md:110-115) to 2-line pointer
+- [x] Compress Phase 2.5 CLI detail (SKILL.md:84-97) to 8-line version
+- [x] Add v7→v8 migration notes to `test-file-template.md`
+- [x] Add `weakness_class` field to area details template in `test-file-template.md`
+- [x] Verify SKILL.md line count after extraction (target: ~358; after Phases 2-5 additions: ~394)
+
+### Phase 2: Richer Commit Writebacks (C1)
+
+**Goal:** Commit mode persists tactical intelligence.
+
+- [x] Add 3 new bullet points to Commit Mode Step 1 in SKILL.md (~8 lines):
+  - Tactical notes (Notes column, cap 3, drop oldest)
+  - Verified selectors (verify: block, append only, tag with run number)
+  - weakness_class (below pass_threshold, 2+ probes threshold)
+- [x] Add "Selector Discovery and Writeback" section to `verification-patterns.md` (~20 lines)
+  - Rules: only confirmed selectors, append-only, run-tagged, first-run placeholder
+- [x] Add "Weakness Classification" section to `probes.md` (~20 lines)
+  - Predefined classes table, freeform guidance, lifecycle (write/update/remove)
+- [x] Add tactical notes format/cap to `queries-and-multiturn.md` (~10 lines)
+  - `[Run N] <finding>` format, 3-entry cap, write-only-when-genuine rule
+- [x] Update `last-run-schema.md` with C1 per-area fields
+
+### Phase 3: Weakness-Class Synthesis (C2)
+
+**Goal:** Cross-area adversarial targeting from shared failure patterns.
+
+- [x] Add cross-area synthesis pointer to Phase 4 Step 6 in SKILL.md (~5 lines)
+- [x] Add "Cross-Area Weakness Synthesis" section to `probes.md` (~20 lines)
+  - Synthesis pass (3 steps), cap of 2, tiebreaker rules
+  - Adversarial instruction templates table (6 predefined + freeform)
+  - Persistence signal format
+  - Report placement (top of EXPLORE NEXT RUN)
+- [x] Update `last-run-schema.md` with C2 explore_next_run additions
+
+### Phase 4: Novelty Fingerprint Persistence (C3)
+
+**Goal:** Novel interactions tracked across runs.
+
+- [x] Add fingerprint merge note to Commit Mode in SKILL.md (~3 lines)
+- [x] Add fingerprint check to Phase 3 in SKILL.md (~3 lines)
+- [x] Add "Novelty Fingerprint Persistence" section to `queries-and-multiturn.md` (~30 lines)
+  - Fingerprint format and normalization taxonomy
+  - Read-merge-write sequence
+  - Iterate mode exemption
+  - Adversarial mode override
+  - Proven area budget interaction
+  - Matching semantics
+  - SIGNALS format
+- [x] Update `last-run-schema.md` with `novelty_fingerprints` top-level key
+
+### Phase 5: CLI Adversarial Browser Mode (C4)
+
+**Goal:** CLI score 3 triggers adversarial browser testing.
+
+- [x] Add adversarial flag check to Phase 2.5 in SKILL.md (~5 lines)
+- [x] Add adversarial mode trigger to Phase 3 in SKILL.md (~3 lines)
+- [x] Add "CLI Adversarial Mode" section to `queries-and-multiturn.md` (~20 lines)
+  - Trigger condition (per-query score 3, secondary avg 3.0-3.4 check)
+  - 5 behavior changes (skip happy path, front-load, pre-emptive probe, increased novelty, report flag)
+  - Progressive narrowing override (SKIP → PROBES-ONLY)
+  - Fingerprint override rule
+- [x] Update `last-run-schema.md` with C4 per-area fields
+
+### Phase 6: Version Bump + Validation + Install
+
+- [x] Verify SKILL.md ≤ 420 lines (actual: 368)
+- [x] Verify all cross-references between files are correct
+- [x] Bump plugin.json: 2.49.0 → 2.50.0 (no marketplace.json found)
+- [x] Add CHANGELOG entry for v2.50.0
+- [x] Install locally to `~/.claude/skills/user-test/`
+- [x] Clean up any stale files from previous install
+
+---
+
+## Files to Change
+
+| File | Current Lines | Delta | After | What Changes |
+|------|-------------|-------|-------|-------------|
+| `SKILL.md` | 420 | ~-55 extracted, +29 added = net -26 | ~394 | JSON schema extraction, Phase 2.5 compress, Phase 3 compress, C1-C4 additions |
+| `references/last-run-schema.md` | 0 (new) | +~70 | ~70 | Full JSON schema + behavioral notes + C1-C4 field additions |
+| `references/test-file-template.md` | 536 | +~12 | ~548 | v8 migration notes, weakness_class in area details template |
+| `references/probes.md` | 401 | +~40 | ~441 | Weakness Classification section (C1), Cross-Area Weakness Synthesis section (C2) |
+| `references/queries-and-multiturn.md` | 194 | +~60 | ~254 | Tactical notes (C1), Novelty Fingerprint Persistence (C3), CLI Adversarial Mode (C4) |
+| `references/verification-patterns.md` | 131 | +~20 | ~151 | Selector Discovery and Writeback section (C1) |
+| `plugin.json` | — | version bump | — | 2.49.0 → 2.50.0 |
+| `marketplace.json` | — | version bump | — | 2.49.0 → 2.50.0 |
+| `CHANGELOG.md` | — | +~20 | — | v2.50.0 entry |
+
+---
+
+## Acceptance Criteria
+
+### Change 1: Richer Commit Writebacks
+- [ ] Tactical notes written to Notes column in `[Run N] <finding>` format
+- [ ] Notes capped at 3 entries; oldest dropped when exceeded
+- [ ] Notes written only for genuine tactical insights (not generic observations)
+- [ ] Verified selectors appended to verify: blocks with `_Selectors confirmed run N._`
+- [ ] Selector writeback is append-only (never replaces user-authored content)
+- [ ] First-run placeholder: `_Selectors not yet confirmed — discover during exploration._`
+- [ ] `weakness_class` written when 2+ probes share a failure pattern
+- [ ] `weakness_class` removed after 3 consecutive pass runs
+- [ ] One `weakness_class` per area — dominant pattern by probe count
+- [ ] `.user-test-last-run.json` includes `tactical_note`, `confirmed_selectors`, `weakness_class` per area
+- [ ] Detail specs in verification-patterns.md, probes.md, queries-and-multiturn.md
+
+### Change 2: Weakness-Class Synthesis
+- [ ] Phase 4 Step 6 runs cross-area synthesis after per-area Explore Next Run generation
+- [ ] `[cross-area]` entries generated when weakness_class appears in 2+ areas
+- [ ] Cap of 2 cross-area synthesis entries per run
+- [ ] Tiebreaker: (1) area count, (2) probe count
+- [ ] Predefined class templates produce correct adversarial instructions
+- [ ] Freeform classes produce custom adversarial instructions
+- [ ] Persistence signal when class active N+ runs: "PERSISTENT — escalate to Known-bug"
+- [ ] Cross-area entries appear at top of EXPLORE NEXT RUN in report
+- [ ] `.user-test-last-run.json` explore_next_run includes weakness_class, affected_areas, adversarial_instruction
+
+### Change 3: Novelty Fingerprint Persistence
+- [ ] Fingerprints stored in `.user-test-last-run.json` under `novelty_fingerprints`
+- [ ] Format: `<area-slug>:<action-type>:<key-parameter>`
+- [ ] Cap: 20 per area, drop oldest when exceeded
+- [ ] Read-merge-write: existing fingerprints read at Phase 1, merged at Phase 4
+- [ ] Phase 3 skips interactions matching existing fingerprints
+- [ ] Iterate mode ignores fingerprints (consistency measurement preserved)
+- [ ] Adversarial mode (C4) overrides fingerprint skipping for its actions
+- [ ] Proven area budget unchanged (fingerprints reshape, not expand)
+- [ ] SIGNALS line when fingerprints constrained novelty: `~ <area> novelty: N fingerprints excluded, M new found`
+- [ ] Resilience: missing/corrupted JSON → empty fingerprints (graceful degradation)
+
+### Change 4: CLI Adversarial Browser Mode
+- [ ] Triggers on any individual CLI query score == 3
+- [ ] Secondary trigger: CLI average 3.0-3.4 with no single query at 3
+- [ ] Skip happy path — start with query exposing shallow reasoning
+- [ ] Front-load competing-constraint queries before single-intent queries
+- [ ] Pre-emptive P1 probe: `generated_from: "cli-score-3: <query>"`
+- [ ] Proven areas: all 3 MCP calls adversarial (not happy-path spot-checks)
+- [ ] Uncharted areas: novelty budget 40% (from 30%), min 3 calls (from 2)
+- [ ] Progressive narrowing override: SKIP → PROBES-ONLY when CLI score 3
+- [ ] Report flag in DETAILS section
+- [ ] SIGNALS line: `~ N areas in CLI-adversarial mode (CLI score 3): <areas>`
+- [ ] `.user-test-last-run.json` includes `adversarial_browser`, `adversarial_trigger` per area
+
+### Infrastructure
+- [ ] `.user-test-last-run.json` schema extracted to `references/last-run-schema.md`
+- [ ] Schema version v7 → v8
+- [ ] SKILL.md ≤ 420 lines after all changes
+- [ ] All new fields additive (missing = absent/default)
+- [ ] v7 files readable without rewrite
+- [ ] Version bump 2.49.0 → 2.50.0
+- [ ] CHANGELOG entry for v2.50.0
+- [ ] Locally installed and stale files cleaned
+
+---
+
+## Implementation Order
+
+**1 → 2 → 3 → 4 → 5 → 6** — Phase 1 creates room, then C1 → C2 → C3 → C4.
+
+C1 before C2: `weakness_class` written in C1 is consumed by C2's synthesis.
+C3 after C1/C2: probe system is richer, more meaningful territory to fingerprint.
+C4 last: touches the most phases but is the most self-contained conceptually.
+
+---
+
+## What "Getting Smarter Run-Over-Run" Looks Like
+
+**Run 1:** Standard execution. Selectors unknown — sequential finds, 3-5 MCP
+calls per area. Novelty fingerprints empty. No weakness class. Explore Next Run
+is per-area only. CLI score 3 on one area triggers adversarial browser.
+
+**Run 2:** Selectors confirmed from run 1 — verification is now one batch
+`javascript_tool` call per area. Novelty fingerprints exclude run 1 territory.
+If `weakness_class` was written, it's visible in area details.
+
+**Run 3:** `weakness_class` confirmed (2+ probes). Cross-area synthesis generates
+adversarial Explore Next Run entry. Fingerprints cover 2 runs — agent must find
+genuinely new territory.
+
+**Run 5+:** Weakness classes resolve (probes pass, field removed) or deepen (more
+probes confirm). Fingerprints cover most obvious paths. Selectors battle-tested.
+
+**Run 10:** Qualitatively different from run 1. Targeted adversarial probing of
+known weakness classes with one-call batch verification, guided by 9 runs of
+accumulated fingerprints and pattern recognition.
+
+---
+
+## Verification: Would This Have Caught Real Bugs?
+
+| Bug | Without this plan | With this plan |
+|-----|-------------------|----------------|
+| Selectors rediscovered each run | 3-5 MCP calls per area per run | 1 batch call from run 2 onward |
+| Stale-react-state in 3 areas | Each treated independently | Cross-area synthesis targets pattern systemically |
+| Novelty re-exploration | Same territory tested twice | Fingerprints exclude, forcing novel ground |
+| CLI score 3 on filter-via-chat | Passes through to normal browser mode | Adversarial browser: competing constraints, pre-emptive P1 probe |
+
+---
+
+## Sources
+
+### Current File References
+- Commit Mode: `SKILL.md:335-397`
+- Phase 2.5 CLI Testing: `SKILL.md:84-97`
+- Phase 3 Novelty Budget: `SKILL.md:110-115` (inline), `queries-and-multiturn.md:128-194` (detail)
+- Phase 4 Step 6 Explore Next Run: `SKILL.md:209-212`
+- `.user-test-last-run.json` schema: `SKILL.md:282-333`
+- Selector lifecycle: `verification-patterns.md:92-94`
+- Area details template: `test-file-template.md:41-49`
+- CLI Area Queries: `queries-and-multiturn.md:62-83`
+- Novelty Log: `queries-and-multiturn.md:165-194`
+
+### Institutional Learnings
+- Agent-guided state transitions: `docs/solutions/2026-02-26-agent-guided-state-and-mcp-resilience-patterns.md`
+- Line budget enforcement: `docs/solutions/2026-02-26-monolith-to-skill-split-anti-patterns.md`
+- Plugin versioning: `docs/solutions/plugin-versioning-requirements.md`
diff --git a/docs/plans/2026-03-02-feat-cross-area-probes-isolation-browser-restart-plan.md b/docs/plans/2026-03-02-feat-cross-area-probes-isolation-browser-restart-plan.md
new file mode 100644
index 000000000..d469674a5
--- /dev/null
+++ b/docs/plans/2026-03-02-feat-cross-area-probes-isolation-browser-restart-plan.md
@@ -0,0 +1,630 @@
+---
+title: "feat: Cross-Area Probes, Probe Isolation, and Proactive Browser Restart"
+type: feat
+status: completed
+date: 2026-03-02
+schema_version_target: 7
+---
+
+# feat: Cross-Area Probes, Probe Isolation, and Proactive Browser Restart
+
+## Problem Statement
+
+Three gaps identified from run 9 results on sg-resale:
+
+**1. Cross-area seams are untestable.** The search bar -> chat contamination
+bug (UX010) lives between `browse/product-grid` and `agent/filter-via-chat`.
+Neither area owns the interaction. Every probe belongs to exactly one area,
+so there's no way to represent "do X in area A, verify behavior in area B."
+Agent-native apps break at boundaries -- state contamination, stale context
+carry-over, filter pollution across surfaces -- and the current structure
+can't test any of them.
+
+**2. Multi-cause symptoms produce ambiguous probe results.** BUG003 (y2k
+intersection empty) and UX010 (search bar contamination) both produce 0
+results on y2k queries. The existing probe tests the symptom, not the
+cause. When it fails, you can't tell which bug you're looking at. Fixing
+either bug confidently requires isolated probes that control for the other
+variable.
+
+**3. Browser connection degrades after ~18 MCP calls.** Run 6: 90s timing
+spike. Run 9: 3 disconnects all after call #18+. The skill tracks and
+reports this pattern (C4 disconnect tracking) but doesn't prevent it.
+Reactive recovery (wait 3s, retry) costs more time than proactive
+prevention.
+
+## Changes
+
+### X1. Cross-Area Probe Table
+
+**Files:** `test-file-template.md`, `probes.md`, `SKILL.md`
+**Problem:** No way to represent probes that span two areas
+**Fix:** Scenario-level probe table with trigger area + observation area
+
+#### X1a. Test File Schema Addition
+
+Add `## Cross-Area Probes` section to the test file template, positioned
+after `## Area Details` and before `## Area Trends`. This is scenario-level
+-- one table for the whole test file, not per-area.
+
+```markdown
+## Cross-Area Probes
+
+<!-- Probes that test interactions spanning two areas. Run before
+     per-area testing in Phase 3. -->
+
+| Trigger Area | Action | Observation Area | Verify | Status | Priority | Confidence | Generated From | Run History |
+|-------------|--------|-----------------|--------|--------|----------|------------|---------------|-------------|
+```
+
+**Column definitions:**
+
+- `Trigger Area`: The area where the initial action happens (e.g.,
+  `browse/product-grid`)
+- `Action`: What to do in the trigger area (e.g., "search 'dresses'
+  via search bar")
+- `Observation Area`: The area where the effect is verified (e.g.,
+  `agent/filter-via-chat`)
+- `Verify`: What to check in the observation area (e.g., "agent chat
+  responds to follow-up without stale category filter from search bar")
+- Status through Run History: Same as per-area probes -- uses the
+  existing probe lifecycle (untested/passing/failing/flaky/graduated),
+  confidence field, escalation at 3 failures, graduation at 2 passes
+
+**Dedup key:** `trigger_area + observation_area + verify text` (same
+70% word-overlap rule as per-area probes, extended to the area pair).
+
+#### X1b. Execution Slot in Phase 3
+
+Cross-area probes run BEFORE per-area testing. They need both areas
+accessible in sequence, which doesn't fit the area-by-area Phase 3
+flow. Running them first also informs how you interpret per-area
+scores -- if search bar -> chat contamination fails, agent/filter-via-chat
+scores may be polluted.
+
+**Add to SKILL.md Phase 3 (slim pointer, detail in probes.md):**
+
+```markdown
+### Cross-Area Probes (Before Per-Area Testing)
+
+Execute cross-area probes before per-area testing -- they test state
+carry-over between areas and inform per-area score interpretation.
+Results do NOT affect per-area scores. See [probes.md](./references/probes.md).
+```
+
+**Delta:** +4 lines in SKILL.md (after mitigation B).
+
+#### X1c. Lifecycle Rules in probes.md
+
+Cross-area probes use the existing probe lifecycle with two additions.
+Add a new section after the Multi-Run Mode section:
+
+```markdown
+## Cross-Area Probes
+
+Cross-area probes test interactions that span two areas -- where an
+action in one area affects state in another. They live in a scenario-
+level table (not per-area) and run before per-area testing in Phase 3.
+
+### Lifecycle
+
+Cross-area probes follow the same lifecycle as per-area probes:
+- Status transitions: untested -> passing/failing -> flaky/graduated
+- Escalation: 3+ consecutive failures -> auto-file to bugs.md
+- Graduation: 2+ consecutive passes -> eligible for CLI graduation
+  (only if BOTH areas have CLI coverage)
+- Confidence field: same defaults and update rules as per-area
+
+### Generation Triggers
+
+Cross-area probes are generated when:
+- A per-area probe fails AND the failure symptom could be caused by
+  state from another area (agent judgment -- look for stale filters,
+  carry-over context, shared state)
+- The novelty budget discovers a cross-area interaction worth tracking
+- Orientation (code reading) identifies a state ownership boundary
+  that crosses two areas
+- The user explicitly requests a cross-area probe
+
+Cross-area probes are NOT generated automatically from every per-area
+failure. The agent must identify a plausible cross-area cause before
+generating one. This keeps the table focused on genuine seam tests,
+not duplicates of per-area probes.
+
+### Execution
+
+1. Navigate to trigger area
+2. Perform action (do NOT reset between trigger and observation)
+3. Navigate to observation area
+4. Run verify check
+5. Record result
+
+The "no reset" between steps 2 and 3 is the critical difference from
+per-area probes. The whole point is testing state carry-over. If you
+reset between areas, you're testing two independent areas, not a seam.
+
+### Report Section
+
+Cross-area probe results appear in their own report section, between
+the header and NEEDS ACTION:
+
+```
+Cross-Area Probes:
+| Trigger -> Observation | Action | Status | Detail |
+|-----------------------|--------|--------|--------|
+| browse/product-grid -> agent/filter-via-chat | search "dresses" via search bar | failing | agent chat shows stale "Dresses" filter on follow-up |
+```
+
+### Dedup
+
+Key: `trigger_area + observation_area + verify text`. Same 70%
+word-overlap rule as per-area probes, applied to the area pair.
+A probe from A->B and a probe from B->A are different probes (different
+causal direction).
+
+### Bug Filing
+
+When a cross-area probe escalates (3+ consecutive failures), the bug
+entry in bugs.md lists the trigger area as primary and the observation
+area in the summary: "Also affects: <observation_area>". This matches
+the existing multi-area bug format in bugs-registry.md.
+
+### Spot-Check Budget
+
+Passing cross-area probes are spot-checked -- execute at most 3 passing
+probes per run (selected randomly). Failing and untested cross-area
+probes always execute. This bounds the front-load: a stable test file
+with 5 passing cross-area probes spot-checks 3, not all 5.
+
+### Progressive Narrowing Interaction
+
+Progressive narrowing classifications (SKIP/PROBES-ONLY/FULL) apply to
+per-area testing only. Cross-area probes execute in their own slot
+regardless of the trigger or observation area's narrowing classification.
+An area classified SKIP for per-area testing can still be a trigger or
+observation target for cross-area probes.
+
+### Cap
+
+Maximum 10 active cross-area probes per test file. Cross-area probes
+are more expensive than per-area (two navigation steps, no reset). If
+the table exceeds 10 active entries, the oldest passing probes rotate
+out first (same as per-area rotation).
+
+### Proactive Restart Interaction
+
+Cross-area probes must NOT be interrupted by a proactive restart --
+they depend on state carry-over between trigger and observation areas.
+The restart check is skipped during cross-area probe execution. The
+MCP call counter still increments; the restart happens after the
+cross-area probe sequence completes.
+```
+
+**Delta:** +72 lines in probes.md.
+
+#### X1d. Test File Template Update
+
+Add the `## Cross-Area Probes` section to test-file-template.md in the
+template block, after `## Area Details` closing and before `## Area Trends`:
+
+```markdown
+## Cross-Area Probes
+
+<!-- Probes that test state carry-over between areas. Run before per-area
+     testing. See probes.md for lifecycle and generation triggers. -->
+
+| Trigger Area | Action | Observation Area | Verify | Status | Priority | Confidence | Generated From | Run History |
+|-------------|--------|-----------------|--------|--------|----------|------------|---------------|-------------|
+```
+
+Add to schema migration section:
+
+```markdown
+**v6 -> v7 changes:**
+- New section: `## Cross-Area Probes` (scenario-level probe table for
+  interactions spanning two areas)
+- Probe generation: `related_bug` field for isolation probes
+- Test file frontmatter: optional `mcp_restart_threshold` field
+
+**Reading v6 files:** Treat missing `## Cross-Area Probes` section as
+empty table. Do NOT rewrite on read.
+```
+
+**Delta:** +12 lines in test-file-template.md.
+
+#### X1e. .user-test-last-run.json Schema
+
+Add `cross_area_probes_run` field alongside existing `probes_run`:
+
+```json
+"cross_area_probes_run": [
+  {
+    "trigger_area": "browse/product-grid",
+    "action": "search 'dresses' via search bar",
+    "observation_area": "agent/filter-via-chat",
+    "verify": "agent chat responds without stale category filter",
+    "status": "failing",
+    "result_detail": "agent showed stale Dresses filter on follow-up"
+  }
+]
+```
+
+**Delta:** 0 SKILL.md lines (documented in reference files only, same
+pattern as existing schema additions).
+
+---
+
+### X2. Probe Isolation Guidance
+
+**File:** `probes.md` Probe Generation section
+**Problem:** Single probe tests symptom with multiple possible causes
+**Fix:** Guidance for generating cause-isolated probes with `related_bug`
+
+```markdown
+### Multi-Cause Isolation
+
+When a probe targets a symptom that could have multiple causes (e.g.,
+two open bugs producing the same "0 results" failure), generate separate
+probes per hypothesized cause. Each probe's setup must isolate the
+variable being tested:
+
+**Pattern:**
+
+Symptom: y2k accessories returns 0 results
+Cause A: empty data intersection (BUG003)
+Cause B: search bar state contamination (UX010)
+
+Isolated probe A:
+  Setup: fresh session (no prior search bar usage)
+  Query: "y2k accessories"
+  Verify: "results include y2k-tagged items -- tests data coverage
+    independent of search bar state"
+  related_bug: BUG003
+
+Isolated probe B (cross-area):
+  Trigger: browse/product-grid -- search "dresses" via search bar
+  Observation: agent/filter-via-chat -- ask for "y2k accessories"
+  Verify: "agent clears stale category filter before applying y2k"
+  related_bug: UX010
+
+**`related_bug` field:** Optional field on any probe (per-area or
+cross-area) linking the probe to a specific bug ID. When the probe
+passes, it provides evidence that the linked bug is fixed. When it
+fails, it confirms the linked bug is still active. Multiple probes
+can reference the same bug -- each tests the bug from a different
+angle.
+
+**When to isolate:** The agent should consider isolation when:
+- A probe has `escalated_to` linking to a bug, AND another open bug
+  affects the same area or a related area
+- A failing probe's `result_detail` is ambiguous ("0 results" without
+  specifying whether the data is missing or the query is wrong)
+- Two bugs in bugs.md have overlapping area slugs
+
+**When NOT to isolate:** If only one bug exists for the symptom, or
+if the causes are clearly distinguishable from the probe result alone,
+isolation adds complexity without value. Single probes are preferred
+when the cause is unambiguous.
+
+**Bug lifecycle interaction:** When a bug is marked `fixed` in commit
+mode, the agent should note whether probes with `related_bug` pointing
+to that bug are passing or failing. If the bug is fixed but its related
+probes fail, note the discrepancy in the report: "BUG003 marked fixed
+but related probe still failing -- investigate." This keeps `related_bug`
+informational while giving it a concrete use during the bug lifecycle.
+```
+
+**Delta:** +35 lines in probes.md.
+
+---
+
+### X3. Proactive Browser Restart
+
+**Files:** `SKILL.md` (pointer), `references/connection-resilience.md` (NEW), `browser-input-patterns.md`
+**Problem:** Connection degrades after ~18 MCP calls, reactive recovery
+costs more than prevention
+**Fix:** Proactive page reload at configurable threshold
+
+#### X3a. Connection Resilience Reference File (NEW)
+
+Create `references/connection-resilience.md`:
+
+```markdown
+# Connection Resilience
+
+## Reactive (On Failure)
+
+1. After any MCP tool failure: wait 3 seconds (`Bash: sleep 3`)
+2. Retry the call once
+3. If retry fails: display "Extension disconnected. Run `/chrome` and
+   select Reconnect extension"
+4. Track `disconnect_counter` for the session
+5. If `disconnect_counter >= 3`: abort with "Extension connection
+   unstable. Check Chrome extension status and restart the session."
+
+## Proactive (Prevent Degradation)
+
+6. Track `mcp_call_counter` for the session (increments on every
+   successful MCP tool call)
+7. When `mcp_call_counter` reaches `mcp_restart_threshold` (default: 15,
+   configurable in test file frontmatter): navigate to the app entry URL
+   (full page reload). Reset `mcp_call_counter` to 0. Log: "Proactive
+   restart at call #N to prevent connection degradation."
+8. The restart happens between areas, not mid-area. If the threshold is
+   reached during an area, finish the current area first, then restart
+   before the next area.
+9. In iterate mode, the between-run reset counts as a restart. Reset
+   `mcp_call_counter` at each between-run page reload.
+
+## Disconnect Pattern Tracking
+
+When `disconnect_counter` increments, record the context: which MCP tool
+was called, which area was being tested, and the session MCP call count.
+
+At run end, if `disconnect_counter >= 3`, append a disconnect analysis
+to the SIGNALS section of the report.
+```
+
+#### X3b. SKILL.md Connection Resilience Pointer
+
+Replace current Connection Resilience section with a slim pointer:
+
+```markdown
+### Connection Resilience
+
+See [connection-resilience.md](./references/connection-resilience.md) for
+reactive recovery, proactive restart at configurable MCP call threshold,
+and disconnect tracking rules.
+```
+
+**Delta:** Replaces 7 lines with 3 lines = -4 lines in SKILL.md.
+
+#### X3c. Frontmatter Addition
+
+Add `mcp_restart_threshold` to test-file-template.md frontmatter:
+
+```yaml
+mcp_restart_threshold: 15  # optional, proactive page reload after N MCP calls
+```
+
+**Delta:** +1 line in test-file-template.md.
+
+#### X3d. browser-input-patterns.md Note
+
+Add after Modal Dialog Handling:
+
+```markdown
+## Proactive Restart
+
+Sustained MCP tool usage degrades browser extension connections. The
+skill proactively restarts (full page reload to app entry URL) after
+a configurable number of MCP calls -- see Connection Resilience in
+SKILL.md.
+
+**What a restart clears:**
+- Extension message channel state
+- In-memory JavaScript variables
+- Pending network requests
+
+**What a restart does NOT clear:**
+- Cookies and session storage (login state preserved)
+- IndexedDB data
+- Service worker caches
+
+**Timing:** Restarts happen between areas. If a restart is triggered
+mid-area, the current area completes first. The next area starts with
+a fresh page load.
+
+**Impact on cross-area probes:** Cross-area probes must NOT be
+interrupted by a proactive restart -- they depend on state carry-over
+between trigger and observation areas. The restart check is skipped
+during cross-area probe execution. The counter still increments.
+```
+
+**Delta:** +18 lines in browser-input-patterns.md.
+
+---
+
+## Design Decisions
+
+### D1. Cross-area probes run BEFORE per-area testing
+
+Running cross-area probes first provides context for per-area scoring.
+If search bar -> chat contamination fails, the agent knows that
+agent/filter-via-chat results may be unreliable. This changes the
+interpretation of per-area scores ("UX 4 on filter-via-chat, but
+cross-area contamination probe failing -- score may be inflated in
+clean sessions").
+
+The alternative -- running after per-area testing -- means per-area
+scores are computed without this context. Running before is more
+informative.
+
+### D2. Cross-area probes do NOT affect per-area scores
+
+A failing cross-area probe means the seam between two areas is broken.
+It doesn't mean either individual area is broken in isolation. Mixing
+cross-area results into per-area scores would pollute maturity tracking
+and make it impossible to determine whether an area is individually
+healthy.
+
+Cross-area probes have their own lifecycle. They can escalate to bugs
+independently. The bug references both areas.
+
+### D3. No reset between trigger and observation
+
+This is the defining characteristic of cross-area probes. A per-area
+probe with "navigate to area A, then navigate to area B" and a reset
+between them is just two per-area probes. The cross-area probe's value
+is testing what happens when state carries over -- stale filters, polluted
+context, shared session state.
+
+### D4. 10 active cross-area probe cap
+
+Cross-area probes are expensive -- two navigations, no reset, harder to
+debug when they fail. 10 is enough for a test file with 7-10 areas
+(testing the most important seams). If more seams need testing, that's
+a signal the app has too many state-sharing boundaries, which is itself
+a finding worth reporting.
+
+### D5. Proactive restart skips during cross-area execution
+
+A proactive restart between the trigger and observation steps of a
+cross-area probe would clear the exact state the probe is testing.
+The restart check is suppressed during cross-area probe execution.
+The MCP call counter still increments -- the restart happens after
+the cross-area probe sequence completes.
+
+### D6. Probe isolation is guidance, not automation
+
+The skill cannot automatically determine that two bugs produce the same
+symptom. The agent applies judgment: when a probe fails and the failure
+could have multiple causes, generate isolated probes. This is documented
+in the generation section as a pattern to follow, not a rule to enforce.
+Automated isolation would require causal reasoning the agent doesn't
+reliably have.
+
+### D7. `related_bug` is optional and informational
+
+The `related_bug` field links a probe to a bug for human/agent
+comprehension. It does NOT change probe behavior -- a probe with
+`related_bug: BUG003` follows the same lifecycle as any other probe.
+The field provides traceability: when reviewing bugs.md, you can see
+which probes are testing which bugs. When a bug is marked fixed,
+you can check whether its related probes are passing.
+
+---
+
+## Line Budget
+
+| File | Baseline | Delta | After | Notes |
+|------|----------|-------|-------|-------|
+| SKILL.md | 420 | +4 (X1b) -4 (X3b) | 420 | At ceiling |
+| probes.md | ~283 | +72 (X1c) +35 (X2) | ~390 | |
+| test-file-template.md | ~516 | +12 (X1d) +1 (X3c) | ~529 | |
+| browser-input-patterns.md | ~121 | +18 (X3d) | ~139 | |
+| connection-resilience.md | NEW | +30 (X3a) | ~30 | Extracted from SKILL.md |
+| **Total** | | **+168** | | |
+
+**SKILL.md stays at 420.** Cross-area pointer (+4) offset by connection
+resilience extraction (-4). Net zero.
+
+---
+
+## Implementation Phases
+
+### Phase 1: Schema (no behavior change)
+
+- [x] Update `references/test-file-template.md` -- cross-area probe table,
+  v7 migration notes, `mcp_restart_threshold` frontmatter, `related_bug`
+  field documentation
+- [x] Update `references/probes.md` -- cross-area probe lifecycle, generation
+  triggers, execution, report section, dedup, bug filing, cap, restart
+  interaction
+
+### Phase 2: Probe Isolation
+
+- [x] Update `references/probes.md` -- multi-cause isolation guidance,
+  `related_bug` field, isolation pattern example, when to/not to isolate
+
+### Phase 3: Proactive Browser Restart
+
+- [x] Create `references/connection-resilience.md` -- reactive + proactive
+  rules, disconnect tracking
+- [x] Update `SKILL.md` -- replace Connection Resilience with 3-line pointer
+- [x] Update `references/browser-input-patterns.md` -- proactive restart
+  section (clears/preserves, timing, cross-area interaction)
+
+### Phase 4: Cross-Area Execution
+
+- [x] Update `SKILL.md` Phase 3 -- add cross-area probes pointer (4 lines)
+- [x] Update `.user-test-last-run.json` schema -- `cross_area_probes_run`
+  documented in probes.md cross-area section (0 SKILL.md lines per X1e)
+
+### Phase 5: Version Bump & Validation
+
+- [x] Bump version in `plugin.json` and `marketplace.json` (2.48.0 -> 2.49.0)
+- [x] Update `CHANGELOG.md` with v7 schema changes
+- [x] Line-count checkpoint: SKILL.md = 420 lines
+- [x] Install locally to `~/.claude/skills/user-test/`
+- [x] Verify: v6 test files read correctly (missing cross-area section = empty)
+- [x] Verify: cross-area probe execution order (before per-area)
+- [x] Verify: proactive restart fires between areas, not mid-area
+- [x] Verify: restart skipped during cross-area probe execution
+
+---
+
+## Acceptance Criteria
+
+### X1: Cross-Area Probes
+- [ ] `## Cross-Area Probes` section in test file template
+- [ ] Table schema: Trigger Area, Action, Observation Area, Verify,
+      Status, Priority, Confidence, Generated From, Run History
+- [ ] Execution slot: before per-area testing in Phase 3
+- [ ] No reset between trigger action and observation verify
+- [ ] Results in separate report section (not mixed into per-area table)
+- [ ] Same lifecycle as per-area probes (escalation, graduation, confidence)
+- [ ] Graduation requires both areas to have CLI coverage
+- [ ] Dedup key: trigger_area + observation_area + verify text
+- [ ] Bug filing: trigger area as primary, observation area in summary
+- [ ] Cap: 10 active cross-area probes per test file
+- [ ] Spot-check budget: max 3 passing probes per run, failing/untested always execute
+- [ ] Progressive narrowing: cross-area probes ignore SKIP/PROBES-ONLY classification
+- [ ] `cross_area_probes_run` in .user-test-last-run.json
+- [ ] v6 -> v7 migration: missing section treated as empty table
+
+### X2: Probe Isolation
+- [ ] Multi-cause isolation pattern documented in probes.md
+- [ ] `related_bug` field documented (optional, on any probe)
+- [ ] Isolation example shows per-area + cross-area probe pair
+- [ ] "When to isolate" checklist (multiple bugs, ambiguous detail,
+      overlapping areas)
+- [ ] "When NOT to isolate" guidance (single cause, unambiguous result)
+- [ ] Bug lifecycle interaction: agent notes related_bug probe status when bug marked fixed
+
+### X3: Proactive Browser Restart
+- [ ] `mcp_call_counter` tracked per session
+- [ ] Proactive restart at `mcp_restart_threshold` (default 15)
+- [ ] Threshold configurable in test file frontmatter
+- [ ] Restart happens between areas, not mid-area
+- [ ] Restart skipped during cross-area probe execution
+- [ ] `mcp_call_counter` reset on between-run page reload (iterate mode)
+- [ ] Restart logged in report: "Proactive restart at call #N"
+- [ ] browser-input-patterns.md documents what restart clears/preserves
+- [ ] Connection resilience extracted to reference file (SKILL.md budget)
+
+### Schema & Migration
+- [ ] Schema version: v6 -> v7
+- [ ] Cross-Area Probes section additive (missing = empty table)
+- [ ] `related_bug` field additive (missing = no linked bug)
+- [ ] `mcp_restart_threshold` additive (missing = default 15)
+- [ ] Forward compatibility: v6 skill reads v7 files safely
+- [ ] SKILL.md <= 420 lines after all changes
+
+---
+
+## Verification: Would This Have Caught the Real Bugs?
+
+| Bug | Without this plan | With this plan |
+|-----|-------------------|----------------|
+| Search bar -> chat contamination (UX010) | Not testable -- no area owns the seam | Cross-area probe: trigger `browse/product-grid` search, observe `agent/filter-via-chat` state |
+| y2k + contamination tangled (BUG003 + UX010) | Single probe fails ambiguously | Two isolated probes: fresh-session y2k (per-area, related_bug: BUG003) + contamination path (cross-area, related_bug: UX010) |
+| 3 disconnects after call #18 | Tracked and reported, not prevented | Proactive restart at call #15 prevents degradation |
+
+---
+
+## Sources
+
+### Internal References
+- Current probe lifecycle: `probes.md`
+- Current probe generation: `probes.md` (generation triggers section)
+- Current connection resilience: `SKILL.md` (Phase 3)
+- Current report output: `SKILL.md` (Phase 4)
+- Current test file template: `test-file-template.md`
+- Cross-area bug format: `bugs-registry.md`
+- Multi-area bug filing: `bugs-registry.md`
+
+### Institutional Learnings Applied
+- **Agent-guided state transitions** (`docs/solutions/2026-02-26-agent-guided-state-and-mcp-resilience-patterns.md`): Cross-area probe generation uses agent judgment, not automated rules. The agent must identify plausible cross-area cause before generating.
+- **Line budget enforcement** (`docs/solutions/2026-02-26-monolith-to-skill-split-anti-patterns.md`): Connection resilience extracted to reference file. Cross-area execution uses slim pointer. SKILL.md stays at 420.
+- **Plugin versioning** (`docs/solutions/plugin-versioning-requirements.md`): MINOR version bump (2.48.0 -> 2.49.0) for new schema version.
diff --git a/docs/plans/2026-03-03-feat-multi-area-journey-testing-plan.md b/docs/plans/2026-03-03-feat-multi-area-journey-testing-plan.md
new file mode 100644
index 000000000..70e09e655
--- /dev/null
+++ b/docs/plans/2026-03-03-feat-multi-area-journey-testing-plan.md
@@ -0,0 +1,565 @@
+---
+title: "feat: Multi-Area Journey Testing"
+type: feat
+status: completed
+date: 2026-03-03
+schema_version_target: 9
+plugin_version_target: 2.51.0
+---
+
+# feat: Multi-Area Journey Testing
+
+## Problem Statement
+
+The skill tests areas in isolation. Every run is a series of independent
+spot-checks: test area A, reset, test area B, reset, test area C. Real
+users don't reset between actions. They search for something, filter the
+results, click a product, add it to cart, go back, search again. State
+accumulates across every transition.
+
+Cross-area probes (v2.49.0) partially address this -- they test state
+carry-over between two specific areas (trigger -> observation). But a
+two-area probe is a seam test, not a journey. A bug that only manifests
+after a 4-step sequence (search -> filter -> detail -> back -> filter
+state stale) would not be caught by any two-area probe, because the
+staleness requires the intermediate steps to accumulate.
+
+The skill needs multi-step journeys executed without resets, where state
+accumulates naturally and verification happens at checkpoints along the
+way -- not just at the end.
+
+## How Journeys Differ From Existing Constructs
+
+| Construct | Scope | Reset | Tests |
+|-----------|-------|-------|-------|
+| Per-area probe | 1 area | N/A | Specific claim within an area |
+| Cross-area probe | 2 areas | No reset | State carry-over at one seam |
+| Multi-turn sequence | 1 area, N turns | No reset | Conversational context retention |
+| **Journey** | **3+ areas** | **No reset** | **Accumulated state across a full user flow** |
+
+Journeys are a third testing layer alongside per-area and cross-area
+probes. They catch bugs requiring accumulated state -- invisible to
+isolated testing.
+
+## Design
+
+### Journey Definition
+
+A journey is a sequence of 3-8 steps across different areas, executed
+without resets, with checkpoints verifying state at intermediate points.
+
+**Schema in test file (new `## Journeys` section):**
+
+```markdown
+## Journeys
+
+<!-- Multi-area user flows without resets. Run after cross-area probes,
+     before per-area testing. See journeys.md for lifecycle and budget. -->
+
+### J001: Primary user flow
+
+**Steps:**
+
+| Step | Area | Action | Checkpoint |
+|------|------|--------|-----------|
+| 1 | <area-slug-1> | <natural language action> | <what to verify> |
+| 2 | <area-slug-2> | <natural language action> | <what to verify> |
+| 3 | <area-slug-3> | <natural language action> | <what to verify> |
+| 4 | <area-slug-4> | <natural language action> | <what to verify> |
+| 5 | <area-slug-1> | <natural language action> | <state clean from earlier steps> |
+
+**Status:** untested
+**Last Run:** ---
+**Run History:** ---
+**Generated From:** manual (initial scenario definition)
+```
+
+**Column definitions:**
+
+- **Step:** Execution order (1, 2, 3...). Positional index, not a stable ID.
+- **Area:** Which area this step operates in (area slug from ## Areas).
+- **Action:** What to do (natural language, same as probe queries).
+- **Checkpoint:** What to verify at THIS step before proceeding. A
+  checkpoint failure at step 3 means the journey failed at step 3,
+  not just "failed." Use `---` to skip verification (sparingly).
+
+**Journey-level fields:**
+
+- **Status:** `untested` / `passing` / `failing-at-N` / `flaky` / `stable`
+- **Last Run:** Date of last execution
+- **Run History:** Compact pass/fail (e.g., `P P F:3 P F:5 P`). Failures
+  include step number after colon for escalation tracking. The colon
+  delimiter avoids ambiguity with count-based formats (F:3 = "failed at
+  step 3", not "failed 3 times").
+- **Generated From:** `manual`, `orientation`, `cross-area-escalation`,
+  `weakness-class-synthesis`
+- **on_failure:** `abort` (default) or `continue` (opt-in, per-journey)
+
+### Checkpoint Types
+
+| Type | Example | How to check |
+|------|---------|-------------|
+| Result state | "Results include matching items" | javascript_tool read of first 3 results |
+| Count change | "Counter increments by 1" | Read element, compare to pre-action value |
+| Element present | "Details match listing" | Check 2-3 attributes match between views |
+| State clean | "No stale filters from prior steps" | Read active state, verify none from prior steps |
+| No check | `---` | Skip verification at this step (use sparingly) |
+
+Checkpoints are 1 MCP call each (batched `javascript_tool`). A 5-step
+journey = ~10 MCP calls (5 actions + 5 checkpoint reads). This is
+separate from the per-area MCP budget -- journey steps do NOT consume
+per-area call budgets.
+
+### Execution Slot
+
+```
+Phase 3 execution order:
+  1. Cross-area probes (seam tests)
+  2. Journeys (accumulated state tests)     <-- NEW
+  3. Per-area testing (isolated area tests)
+```
+
+Journeys run after cross-area probes because cross-area results inform
+whether a journey's seams are already known broken. Journeys run before
+per-area testing because journey failures provide context for per-area
+exploration (e.g., "area-X has state management issues after navigation").
+
+**Inter-journey reset:** Navigate to the app's entry URL between
+journeys. Each journey starts from a clean navigation state. Journeys
+are independent of each other and can be authored without considering
+execution order. (Within a journey, no resets between steps.)
+
+**Execution order when multiple journeys exist:**
+1. `failing-at-N` journeys first (highest signal value)
+2. `untested` journeys second
+3. `flaky` journeys third
+4. `passing` journeys fourth
+5. `stable` journeys last (and only every other run)
+
+### Journey Lifecycle
+
+```
+untested -> [run] -> passing / failing-at-N
+                       |           |
+               [5+ consecutive]  [mixed steps across 3+ runs]
+                       |           |
+                    stable       flaky
+               (every other run)
+                                   |
+                       [3+ consecutive SAME step]
+                                   |
+                         escalate to bugs.md
+                         (as multi-area bug)
+```
+
+**Status definitions:**
+
+| Status | Meaning |
+|--------|---------|
+| `untested` | Defined, not yet run |
+| `passing` | All checkpoints passed on last run |
+| `failing-at-N` | Failed at step N specifically |
+| `flaky` | Fails at different steps across 3+ runs |
+| `stable` | Passing 5+ consecutive runs |
+
+**`failing-at-N`** is the key innovation. Step 2 failure = the individual
+area is broken (per-area testing would catch this). Step 5 failure after
+steps 1-4 passed = accumulated state bug (the journey's unique value).
+
+**`flaky`:** Failing at step 3, then step 5, then step 3 = different
+causes. Status becomes `flaky`. The consecutive-same-step counter resets
+on each step change. Flaky is not inherently bad -- it means the journey
+has multiple fragile points worth investigating.
+
+**Escalation:** Journey failing at the SAME step for 3+ consecutive runs
+auto-escalates to bugs.md. Bug entry format:
+
+```
+| ID | Area | Summary | Journey |
+|... | <failing-step-area> | Journey <ID> fails at step N: <checkpoint detail> | J001 (steps 1-N context: <preceding area slugs>) |
+```
+
+The failing step's area is primary. Preceding areas provide context.
+
+**Stable frequency:** `stable` journeys run every other run (derived
+from Run History length -- odd run count = run, even = skip).
+
+**Stable revert:** When a stable journey fails, set status to
+`failing-at-N` (not `passing`). The stable consecutive counter resets.
+Journey runs every time again until re-stabilized.
+
+### Checkpoint Failure: Abort vs. Continue
+
+**Abort (default):** Stop at failing step. Record `failing-at-N`.
+Remaining steps not executed. Correct for most failures -- if step 3
+state is wrong, step 4 on wrong state is unpredictable.
+
+**Continue (opt-in):** Add `on_failure: continue` to journey definition.
+Log each checkpoint failure but execute all remaining steps. Useful when
+steps test independent state dimensions.
+
+**Continue-mode status:** When multiple checkpoints fail, status is
+`failing-at-N` where N = the FIRST failing step. Run History records
+all failing steps: `F:2,5` (failed at steps 2 and 5). Escalation uses
+the first failing step only -- subsequent failures may be cascading
+effects.
+
+### Definition Change Detection
+
+When commit mode reads the existing journey to update status, it
+compares the current step count and area slugs against the stored
+values. If either changed (steps added/removed/reordered, area slugs
+changed), reset status to `untested` and clear Run History.
+
+Detection key: `<step-count>:<area-slug-1>,<area-slug-2>,...`
+
+This is conservative but prevents stale `failing-at-3` pointing at a
+step that no longer exists or has moved.
+
+### Known-Bug Area Interaction
+
+Journey steps execute regardless of an area's Known-bug status. Rationale:
+the journey tests accumulated state across the full sequence, not the
+individual area. A Known-bug area may behave differently in a journey
+context than in isolation. If a Known-bug area causes a journey checkpoint
+to fail, the journey records `failing-at-N` normally -- this is useful
+signal (confirms the bug affects multi-area flows, not just isolated use).
+
+Journey failures involving Known-bug areas do NOT auto-escalate to
+bugs.md (the bug is already filed). Escalation is suppressed when the
+failing step's area has an active Known-bug entry.
+
+### Generation
+
+**1. Manual definition (primary).** User writes journeys for real user
+flows. Skill prompts on first run if none defined. If orientation (source
+2) generated journey suggestions this run, present those suggestions AS
+the first-run prompt rather than asking for manual definition from scratch:
+
+> "Based on code reading, I found these state boundaries crossing 3+
+> areas. Here's a suggested journey: [steps]. Would you like to use
+> this, modify it, or define your own?"
+
+If no orientation results exist, fall back to the generic prompt:
+
+> "No journeys defined. Journeys test multi-area flows without resets.
+> Define 1-2 journeys based on your app's primary user flows? (y/n)"
+
+If yes, agent suggests steps from the area map. If no, skip.
+
+**2. Orientation.** Code reading identifies state boundaries crossing
+3+ areas -> journey hypothesis. Orientation completes before the
+first-run prompt so its findings can be incorporated into suggestions.
+
+**3. Cross-area probe escalation.** 2+ cross-area probes pass individually
+but per-area issues persist -> suggest journey covering all affected areas.
+
+**4. Weakness-class synthesis.** Weakness class spans 3+ areas -> suggest
+journey probing state transitions across affected areas.
+
+Sources 2-4 generate **suggestions requiring user confirmation**. Journeys
+are expensive; auto-generation without confirmation wastes budget.
+
+### Journey Budget
+
+- **Max 5 active journeys** per test file
+- **3-8 steps** per journey. If a flow exceeds 8 steps, split into two
+  overlapping journeys (1-6 and 5-10) with shared transition. Splitting
+  counts against the 5-journey cap. If splitting would exceed the cap,
+  prefer a single 8-step journey over two overlapping ones. Only split
+  when the flow genuinely exceeds 8 steps.
+- **~2 minutes per journey.** 5 journeys = ~10 minutes maximum.
+- **Stable skip:** stable journeys run every other run, halving budget
+  for mature test files.
+- **Time pressure:** If session time is tight, run only failing/untested
+  journeys (same priority as probes).
+
+### Interaction With Existing Features
+
+**Proactive restart:** Suppressed during journey execution (same rule as
+cross-area probes). MCP counter increments but restart is deferred until
+the current journey completes. Counter resets between journeys (each
+starts fresh after inter-journey navigation).
+
+**Progressive narrowing:** Applies to per-area testing only. Journey
+steps execute regardless of area narrowing classification (SKIP,
+PROBES-ONLY, FULL). A SKIP area can still be a journey step.
+
+**Cross-area probes:** Complementary. Cross-area probes test 1 seam.
+Journeys test accumulated state across 3+ seams. No dedup between them
+-- a 2-area cross-area probe and a journey step covering the same seam
+test different things (isolation vs. accumulation).
+
+**Adversarial mode:** Does NOT apply to journey steps. Journey steps
+execute the defined action and checkpoint, not the adversarial variant.
+Adversarial mode is a per-area testing concern.
+
+**Per-area MCP budgets:** Journey MCP calls are separate from per-area
+budgets. A journey visiting an area does not consume that area's per-area
+call budget. Per-area testing runs independently after all journeys.
+
+**`--no-commit` flag:** Journey results are recorded in
+`.user-test-last-run.json` regardless of commit flag. But journey status
+in the test file is only updated during commit mode. The `--no-commit`
+run does NOT count toward the consecutive failure counter for escalation.
+
+**Iterate mode:** Each iterate iteration counts as a separate run for
+journey Run History. Stable "every other run" applies per iteration.
+
+**Partial run safety:** If a run is interrupted mid-journey, uncommitted
+journey results are discarded. Only fully-completed journeys have their
+status written during commit mode. Partially-executed journeys retain
+their pre-run status.
+
+### Report Section
+
+New section in Phase 4 report, after cross-area probes and before
+per-area details:
+
+```
+JOURNEYS
+| ID   | Name                   | Status       | Failed At         | Detail                          |
+|------|------------------------|--------------|-------------------|---------------------------------|
+| J001 | Primary user flow      | failing-at-5 | <area-slug-1>     | Stale state after navigation    |
+| J002 | Secondary flow         | passing      | ---               | All 4 checkpoints passed        |
+
+Journey J001 checkpoint detail:
+  + Step 1: <area-slug-1> -- <checkpoint description>
+  + Step 2: <area-slug-2> -- <checkpoint description>
+  + Step 3: <area-slug-3> -- <checkpoint description>
+  + Step 4: <area-slug-4> -- <checkpoint description>
+  x Step 5: <area-slug-1> -- STALE state from step 2
+```
+
+Checkpoint detail shown for failing/flaky journeys only. Passing
+journeys show summary line only.
+
+**SIGNALS addition:**
+```
+~ 1 journey failing: J001 at step 5 (<area-slug-1>) — accumulated state
+```
+
+**N-run summary:** Add "Journeys stabilized" and "Journeys with
+persistent issues" to the N-run summary format.
+
+### `.user-test-last-run.json` Schema
+
+```json
+"journeys_run": [
+  {
+    "id": "J001",
+    "name": "Primary user flow",
+    "status": "failing-at-5",
+    "on_failure": "abort",
+    "checkpoints": [
+      { "step": 1, "area": "<area-slug-1>", "passed": true },
+      { "step": 2, "area": "<area-slug-2>", "passed": true },
+      { "step": 3, "area": "<area-slug-3>", "passed": true },
+      { "step": 4, "area": "<area-slug-4>", "passed": true },
+      { "step": 5, "area": "<area-slug-1>", "passed": false,
+        "detail": "stale state from step 2 still active" }
+    ],
+    "time_seconds": 45
+  }
+]
+```
+
+### Commit Mode Additions
+
+Journey commit mode runs after per-area commit mode (step 4 updates
+probe tables, step 8 updates queries). Journey updates are a new step:
+
+1. Update journey **Status**, **Last Run**, **Run History** in test file
+2. Auto-escalate at 3+ consecutive same-step failures (→ bugs.md as
+   multi-area bug). Suppress if failing step's area has active Known-bug.
+3. Mark `stable` at 5+ consecutive passes
+4. Detect definition changes (step count or area slug changes → reset
+   to `untested`, clear Run History)
+5. Journey results do NOT affect per-area maturity scores
+
+## Design Decisions
+
+### D1. Journeys are scenario-level, not area-level
+Lives in `## Journeys` alongside `## Cross-Area Probes` and `## Areas`.
+Not owned by any single area.
+
+### D2. Checkpoints at every step, not just the end
+A journey verifying only at the end is just a long cross-area probe.
+Checkpoints pinpoint WHERE state goes wrong.
+
+### D3. Journey failure does NOT affect per-area scores
+Journey failure = accumulated state bug. Per-area score = isolated
+area health. Mixing them makes maturity tracking unreliable.
+
+### D4. failing-at-N is more useful than failing
+8-step journey reporting "failing" tells you nothing. "failing-at-5"
+tells you steps 1-4 work and the bug is at the step 5 transition.
+
+### D5. Manual definition is primary
+The user knows which flows matter. Auto-generation produces suggestions
+requiring confirmation, not automatic entries.
+
+### D6. Journey steps can revisit areas
+Step 1 uses area X. Step 5 uses area X again. The value is testing
+whether the area behaves differently after intermediate steps modified
+state.
+
+### D7. Abort is the default on checkpoint failure
+Wrong state at step 3 makes step 4 unpredictable. Continue exists as
+opt-in for independent state dimensions.
+
+### D8. Step drift prevents premature escalation
+Failing at different steps = different causes = flaky, not a single
+consistent bug worth auto-filing.
+
+### D9. Inter-journey reset to entry URL
+Journeys are independent. Each starts from a clean navigation state.
+Without this, journey ordering becomes a first-class authoring concern
+and journey 2's results depend on journey 1's side effects.
+
+### D10. Known-bug areas still execute in journeys
+Journeys test accumulated state, not individual areas. A Known-bug area
+in a journey provides useful signal about multi-area impact. But
+Known-bug journey failures don't auto-escalate (bug already filed).
+
+### D11. Journey MCP calls are separate from per-area budgets
+Journeys and per-area testing serve different purposes. Sharing budgets
+would force trade-offs between journey thoroughness and per-area depth.
+
+### D12. Definition changes reset to untested
+Conservative but safe. Prevents stale `failing-at-3` from pointing at
+a step that moved or no longer exists.
+
+### D13. Continue-mode uses first failing step for status
+Multiple checkpoint failures in continue mode may be cascading. The
+first failure is the root cause signal. Run History records all failures
+for investigation.
+
+## Line Budget
+
+| File | Baseline | Delta | After | Notes |
+|------|----------|-------|-------|-------|
+| SKILL.md | 368 | +5 (pointer + execution slot) -3 (trim) | 370 | Well under 420 ceiling |
+| journeys.md | NEW | +65 | 65 | All journey behavioral detail |
+| test-file-template.md | 549 | +25 | ~574 | Journey section template + v8→v9 migration |
+| last-run-schema.md | 136 | +15 | ~151 | journeys_run schema |
+| probes.md | ~490 | +3 | ~493 | Cross-ref to journey escalation |
+| Total new content | | ~110 | | |
+
+SKILL.md stays well under ceiling. All journey behavioral detail lives
+in `references/journeys.md`. SKILL.md holds only the execution slot
+pointer and commit mode bullet.
+
+## Schema Changes
+
+### Test file: v8 -> v9
+- New `## Journeys` section (optional, may be empty)
+- Journey entry schema: ID, Name, Steps table, Status, Last Run,
+  Run History, Generated From, optional on_failure
+
+### `.user-test-last-run.json`
+- New `journeys_run` array field
+
+### Migration: v8 -> v9
+- Missing `## Journeys` = empty (no journeys defined). Do not create.
+- Additive only. v8 files work unchanged.
+- Bump `schema_version: 9` on first commit.
+- Forward compatible: v8 skill reads v9 files safely (preserves
+  unknown sections).
+
+## Acceptance Criteria
+
+### Journey Definition
+- [ ] `## Journeys` section in test file template (`test-file-template.md`)
+- [ ] Schema: ID, Name, Steps table, Status, Last Run, Run History,
+      Generated From, optional on_failure
+- [ ] Steps table columns: Step / Area / Action / Checkpoint
+- [ ] 3-8 steps per journey, max 5 journeys
+- [ ] Same area can appear multiple times in a journey
+
+### Execution
+- [ ] Run after cross-area probes, before per-area testing
+- [ ] No reset between steps within a journey
+- [ ] Inter-journey reset to app entry URL
+- [ ] Checkpoint at each step (1 MCP call via batched javascript_tool)
+- [ ] Abort on checkpoint failure (default)
+- [ ] `on_failure: continue` option (first failing step = status)
+- [ ] Proactive restart suppressed during journey execution
+- [ ] Progressive narrowing does not affect journey steps
+- [ ] Known-bug areas still execute in journey steps
+- [ ] Adversarial mode does NOT apply to journey steps
+- [ ] Journey MCP calls separate from per-area budgets
+- [ ] Execution order: failing > untested > flaky > passing > stable
+- [ ] Failing/untested journeys before stable
+
+### Status & Lifecycle
+- [ ] `failing-at-N` records which step failed
+- [ ] Step drift across runs → status becomes `flaky`
+- [ ] Escalation: same step 3+ consecutive → bugs.md (multi-area bug)
+- [ ] Escalation suppressed when failing step area has Known-bug
+- [ ] Bug entry: failing step area primary, preceding areas as context
+- [ ] `stable`: 5+ consecutive passes, every other run
+- [ ] Stable revert on failure → `failing-at-N`, counter resets
+- [ ] Definition change detection → reset to `untested`
+- [ ] Journey results do NOT affect per-area maturity scores
+
+### Report
+- [ ] JOURNEYS section after cross-area, before per-area details
+- [ ] Failing/flaky: full checkpoint detail (+ and x markers)
+- [ ] Passing: summary line only
+- [ ] SIGNALS entry for failing journeys
+- [ ] `journeys_run` in JSON with per-step checkpoint data
+- [ ] N-run summary includes journey stabilization/persistence
+
+### Generation
+- [ ] First-run prompt if no journeys defined
+- [ ] Manual primary, auto sources suggest only
+- [ ] Suggestions require user confirmation
+
+### Commit Mode
+- [ ] Status + Last Run + Run History updated
+- [ ] Auto-escalation at 3+ consecutive same-step failures
+- [ ] Stable at 5+ consecutive passes
+- [ ] Definition change detection resets status
+- [ ] `--no-commit` runs don't count toward escalation
+- [ ] Partial runs: only fully-completed journeys written
+
+### Schema & Migration
+- [ ] v8 → v9 additive migration
+- [ ] Missing `## Journeys` = empty
+- [ ] Forward compatible
+- [ ] SKILL.md stays under 420-line ceiling
+
+## Implementation Order
+
+All changes ship together as schema v9.
+
+- [x] 1. **Schema & template** — `## Journeys` section in `test-file-template.md` + v8→v9 migration notes
+- [x] 2. **Reference file** — create `references/journeys.md` (lifecycle, budget, execution rules, checkpoint types, generation, interactions)
+- [x] 3. **Last-run schema** — add `journeys_run` to `last-run-schema.md`
+- [x] 4. **SKILL.md pointer** — Phase 3 execution slot + commit mode bullet + trim
+- [x] 5. **Report** — journey results format in Phase 4 (pointer to journeys.md)
+- [x] 6. **Commit mode** — status updates, escalation, stable, definition change detection
+- [x] 7. **Version bump + install** — plugin.json 2.50.0→2.51.0, CHANGELOG, local install
+
+## Verification: Would This Have Caught Real Bugs?
+
+| Bug pattern | Without journeys | With journeys |
+|-------------|-----------------|---------------|
+| Stale state after multi-step navigation (4+ steps) | Not testable (accumulated state) | `failing-at-N`: pinpoints which step's state leaked |
+| State contamination visible only after round-trip | Cross-area probe (2 steps, one seam) | Journey revisits area after 3 intermediate steps |
+| Counter/badge wrong after add→remove→add sequence | Per-area test starts clean each time | Journey checkpoints verify at each transition |
+| Filter/search state leaking across unrelated flows | Per-area tests pass in isolation | Journey exposes that state persists across areas |
+
+## Sources
+
+- Phase 3 execution: `SKILL.md:98-152`
+- Cross-area probes: probes.md (lines 322-489), v2.49.0 plan
+- Proactive restart: cross-area plan D5, `connection-resilience.md`
+- Progressive narrowing: `run-targeting.md` (lines 74-107)
+- Weakness-class synthesis: compounding quality plan Change 2
+- Multi-turn sequences: `queries-and-multiturn.md`
+- Probe lifecycle: `probes.md`
+- Known-bug handling: `bugs-registry.md`
+- Schema migration pattern: `test-file-template.md` (lines 168-176)
+- Line budget learnings: `docs/solutions/2026-02-26-monolith-to-skill-split-anti-patterns.md`
diff --git a/docs/plans/2026-03-17-feat-user-test-self-eval-loop-plan.md b/docs/plans/2026-03-17-feat-user-test-self-eval-loop-plan.md
new file mode 100644
index 000000000..7b9a04279
--- /dev/null
+++ b/docs/plans/2026-03-17-feat-user-test-self-eval-loop-plan.md
@@ -0,0 +1,383 @@
+---
+title: "feat: Add self-eval loop for user-test skill"
+type: feat
+status: completed
+date: 2026-03-17
+origin: docs/brainstorms/2026-03-17-user-test-self-eval-loop-brainstorm.md
+---
+
+# feat: Add self-eval loop for user-test skill
+
+## Overview
+
+Add a `/user-test-eval` command that grades the user-test skill's output against 3 binary evals after each run. Records scores in `skill-evals.json`, proposes targeted mutations to the skill in `skill-mutations.md`. Auto-triggers after commit mode completes. Goal: fix the testing instrument (the skill itself) before optimizing what it tests (queries).
+
+## Problem Statement / Motivation
+
+The user-test skill has three known signal-corrupting failure modes:
+1. **Probe execution order** — probes run after exploration instead of before, reducing signal quality
+2. **Proven regression conflation** — new bugs in Proven areas treated identically to area demotion
+3. **P1 burial** — critical items appear in DETAILS but not NEEDS ACTION
+
+These are instrument calibration failures. Optimizing queries through a miscalibrated instrument produces noise. The eval loop catches these failures mechanically, proposes fixes, and builds a mutation history artifact.
+
+(see brainstorm: docs/brainstorms/2026-03-17-user-test-self-eval-loop-brainstorm.md)
+
+## Proposed Solution
+
+### Architecture
+
+```
+/user-test → Phase 4 → Commit Mode → Auto-trigger → /user-test-eval
+                                                          ↓
+                                          Read artifacts (JSON + report file)
+                                                          ↓
+                                          Grade 3 binary evals
+                                                          ↓
+                                          Write skill-evals.json (scores)
+                                          Write skill-mutations.md (proposals)
+```
+
+Three new components:
+1. **`/user-test-eval` command** — thin dispatch to new eval skill
+2. **`user-test-eval` skill** — grades from artifacts, proposes mutations
+3. **Report file artifact** — rendered report written to file during commit mode (new)
+
+Plus two schema changes to existing `.user-test-last-run.json`:
+- `execution_index` per `probes_run` entry
+- `broad_exploration_start_index` per area
+
+### Prerequisites: Artifact Gaps
+
+The eval cannot function without two changes to the existing skill:
+
+**1. Report file artifact (new)**
+
+Commit mode currently prints the report to stdout only. The eval needs to read the rendered report from a file. Add a step to commit mode that writes the rendered report to `tests/user-flows/.user-test-last-report.md`, overwritten each run, gitignored.
+
+**Why a separate file instead of reading conversation context:** The brainstorm established that same-context grading is the exact failure mode we've seen — structurally correct reports that technically satisfy format requirements while burying findings. Reading from an artifact forces the eval to grade what the user actually sees, without access to the reasoning that produced it.
+
+**2. Execution order metadata (schema change)**
+
+Eval 1 checks probe execution order. The current `probes_run` array in `.user-test-last-run.json` records results but not execution sequence relative to broad exploration. Add:
+- `execution_index: <integer>` to each `probes_run` entry (0-based, monotonically increasing across all areas)
+- `broad_exploration_start_index: <integer>` per area in the `areas` array
+
+Eval 1 then checks: for each area, all probe `execution_index` values < that area's `broad_exploration_start_index`. Binary, mechanical, no judgment required.
+
+## The 3 Binary Evals
+
+### Eval 1: Probe Execution Order (protocol layer)
+
+**Question:** Did all failing/untested probes execute before broad exploration in every area?
+
+**Grading method:** For each area in `areas`, check that every `probes_run` entry for that area has `execution_index < broad_exploration_start_index`. FAIL if any area violates. Report which areas violated.
+
+**Data source:** `.user-test-last-run.json` only (structural check).
+
+**Zero probes case:** If an area has no probes, it passes vacuously.
+
+### Eval 2: Proven Regression Distinction (reasoning layer — reformulated as structural)
+
+**Question:** When a Proven area's score dropped by 1+ points, does the report's NEEDS ACTION section contain an entry for that area?
+
+**Grading method:**
+1. From `.user-test-last-run.json`, identify areas where the test file shows `status: Proven` but the run's `ux_score` is below `pass_threshold`
+2. From `.user-test-last-report.md`, check that each such area appears in the NEEDS ACTION section as a **line item** with the `⚠` prefix and `→ Proven regression` marker (not just the area slug mentioned anywhere in the section). The required format is: `⚠ P[N]  <area-slug> ... → Proven regression`
+3. PASS if every regressed Proven area has a matching line item. FAIL if any is missing or appears without the `→ Proven regression` marker.
+
+**Why a specific marker:** Checking for slug presence alone is gameable — the area could appear as a parenthetical note rather than an action item and technically pass. The marker requirement makes the check fully mechanical: regex match for `⚠.*<area-slug>.*→ Proven regression` in the NEEDS ACTION block.
+
+**Why reformulated:** The original question ("did the report distinguish bug vs. demotion?") was subjective. This structural version tests the same thing — a Proven regression must surface as actionable, not buried in DETAILS — without requiring judgment calls about "distinguishing."
+
+**No Proven regressions case:** Automatic PASS (vacuously true). The eval records `"detail": "no Proven regressions this run"`.
+
+**Data source:** Both `.user-test-last-run.json` (to identify regressions) and `.user-test-last-report.md` (to verify surfacing).
+
+### Eval 3: P1 Surfacing (presentation layer)
+
+**Question:** Did every P1 item (from `explore_next_run` where `priority: "P1"`) appear in the NEEDS ACTION section?
+
+**Grading method:**
+1. From `.user-test-last-run.json`, collect all `explore_next_run` items with `priority: "P1"`
+2. From `.user-test-last-report.md`, verify each P1 item appears in the NEEDS ACTION block (match area slug + priority marker)
+3. PASS if all P1 items are in NEEDS ACTION. FAIL with count of missing items.
+
+**Scope note:** Verification mismatches on Proven areas also belong in NEEDS ACTION (per dispatch format rules), but they flow through a different path — the `verification_results` array, not `explore_next_run`. The main skill does not consistently promote these to `explore_next_run` P1 items, so including them here would produce false positives. If verification mismatch surfacing needs eval coverage, add it as a separate Eval 4 later.
+
+**Zero P1 items case:** Automatic PASS. Eval records `"detail": "no P1 items this run"`.
+
+**Data source:** Both artifacts.
+
+## Artifact Schemas
+
+### `skill-evals.json`
+
+Location: `tests/user-flows/skill-evals.json` (project-scoped, committed to git)
+
+```json
+{
+  "eval_version": 1,
+  "entries": [
+    {
+      "run_timestamp": "2026-03-17T14:30:00Z",
+      "scenario_slug": "resale-clothing",
+      "git_sha": "abc1234",
+      "skill_version": "2.52.0",
+      "evals": {
+        "probe_execution_order": {
+          "pass": true,
+          "areas_violated": []
+        },
+        "proven_regression_distinction": {
+          "pass": false,
+          "regressed_areas": ["login"],
+          "missing_from_needs_action": ["login"],
+          "detail": "Login regressed from Proven (score 4→2) but only appeared in DETAILS"
+        },
+        "p1_surfacing": {
+          "pass": true,
+          "p1_count": 2,
+          "surfaced_count": 2
+        }
+      },
+      "overall_pass": false,
+      "mutation_proposed": true
+    }
+  ]
+}
+```
+
+- Cap: 50 entries (drop oldest)
+- `eval_version` at top level — bumped when evals change, enabling historical comparison filtering
+- Created if missing on first eval run
+
+### `skill-mutations.md`
+
+Location: `tests/user-flows/skill-mutations.md` (project-scoped, committed to git)
+
+```markdown
+# Skill Mutations Log
+
+Proposed changes to the user-test skill based on eval failures.
+Mark status as ACCEPTED or REJECTED after review.
+
+---
+
+## Mutation 1 — 2026-03-17
+
+**Status:** PROPOSED
+**Triggered by:** Eval 2 failure (Proven regression distinction)
+**Eval scores:** probe_order: PASS | regression_distinction: FAIL | p1_surfacing: PASS
+**Skill version:** 2.52.0
+**Scenario:** resale-clothing
+
+### Problem observed
+
+Login area regressed from Proven (score 4→2) but only appeared in DETAILS section.
+The report treated it as a normal score change rather than surfacing it in NEEDS ACTION.
+
+### Proposed change
+
+**File:** `plugins/compound-engineering/skills/user-test/SKILL.md`
+**Section:** Report Output — Dispatch Format, NEEDS ACTION rules
+
+**Current:** NEEDS ACTION includes "degrading areas, failing probes on Proven areas, verification mismatches on Proven"
+**Proposed:** Add explicit rule: "Any Proven area scoring below pass_threshold MUST appear in NEEDS ACTION with '→ Proven regression' suffix, regardless of whether a bug was filed."
+
+### Outcome
+
+_Fill after next run:_ Did the change fix the eval failure? Score comparison.
+```
+
+- Each mutation is a markdown section with clear status
+- Status values: `PROPOSED` | `ACCEPTED` | `REJECTED`
+- One mutation per failing eval — all failures get proposals in a single run
+- Human reviewer decides which to accept (can accept all, some, or none)
+- Proposals are numbered sequentially across all eval runs (Mutation 1, 2, 3...)
+
+### `.user-test-last-report.md` (new artifact)
+
+Location: `tests/user-flows/.user-test-last-report.md` (gitignored, ephemeral)
+
+Written during commit mode, after the report is displayed. Contains the exact rendered report text. Overwritten each run.
+
+## Implementation Plan
+
+### Phase 1: Prerequisites (changes to existing skill)
+
+#### 1a. Add report file output
+
+**File:** `plugins/compound-engineering/skills/user-test/SKILL.md`
+**Location:** After "Share Report (Optional)" section, before "Auto-Commit"
+**Change:** Add step: "Write the rendered report to `tests/user-flows/.user-test-last-report.md`"
+
+**File:** `plugins/compound-engineering/skills/user-test/SKILL.md`
+**Location:** Phase 0, step for `.gitignore` coverage
+**Change:** Add `.user-test-last-report.md` to the gitignore check alongside `.user-test-last-run.json`
+
+#### 1b. Add execution order metadata
+
+**File:** `plugins/compound-engineering/skills/user-test/references/last-run-schema.md`
+**Change:** Add `execution_index` to `probes_run` entries, add `broad_exploration_start_index` to per-area fields
+
+**File:** `plugins/compound-engineering/skills/user-test/SKILL.md`
+**Location:** Phase 3, probe execution section
+**Change:** Instruct agent to track execution index (monotonically increasing counter across all MCP calls/actions) and record `broad_exploration_start_index` when transitioning from probe execution to broad exploration per area
+
+**Schema version:** Bump to v10. Add v9 migration rule: treat missing `execution_index` as absent (eval skips Eval 1 for runs without ordering data). Treat missing `broad_exploration_start_index` as absent.
+
+### Phase 2: New skill and command
+
+#### 2a. Create eval skill
+
+**New file:** `plugins/compound-engineering/skills/user-test-eval/SKILL.md`
+
+Contents:
+- Frontmatter: `name: user-test-eval`, description, `disable-model-invocation: true`
+- **Artifact-only grading rule:** "Grade from file artifacts only. Do not reference test execution context, Phase 3 observations, or any other conversation content. The eval's integrity depends on grading what the user sees (the report file), not what the agent knows."
+- Load phase: Read `.user-test-last-run.json` and `.user-test-last-report.md`. Abort if either missing or if `completed: false`. Warn if run_timestamp > 24h old.
+- Read test file to get area maturity statuses (needed for Eval 2).
+- Run 3 evals in order. Record pass/fail + detail for each.
+- If any eval fails: propose one mutation per failing eval. Write all to `skill-mutations.md`.
+- Append entry to `skill-evals.json`. Create file if missing.
+- Display summary: `EVAL: 2/3 pass | probe_order: PASS | regression: FAIL | p1_surfacing: PASS`
+- If mutation proposed, display the proposed change inline.
+
+#### 2b. Create eval command
+
+**New file:** `plugins/compound-engineering/commands/user-test-eval.md`
+
+```yaml
+---
+name: user-test-eval
+description: Grade user-test skill output against binary evals
+disable-model-invocation: true
+allowed-tools: Skill(user-test-eval)
+---
+
+Invoke the user-test-eval skill for the last completed run.
+```
+
+#### 2c. Add auto-trigger to commit mode
+
+**File:** `plugins/compound-engineering/skills/user-test/SKILL.md`
+**Location:** End of Commit Mode section, after step 8c
+**Change:** Add:
+
+```
+### Auto-Eval
+
+After all commit steps complete, automatically invoke `/user-test-eval` to grade
+this session's output. The eval reads from file artifacts — it does not use
+conversation context from this session.
+
+**Skip conditions:** `--no-eval` flag, or if commit was partial/aborted.
+**Error handling:** If eval fails, the commit is already complete and preserved.
+Display "Eval failed: <reason>. Run `/user-test-eval` manually to retry."
+```
+
+Also add auto-trigger after `/user-test-commit` standalone (same artifacts, same trigger).
+
+### Phase 3: Versioning and metadata
+
+- Bump plugin version to 2.52.0 in `.claude-plugin/plugin.json`
+- Update `marketplace.json` description with new skill count
+- Update `README.md` — add user-test-eval to skills list
+- Update `CHANGELOG.md` with the addition
+- Schema version bump to v10 in test-file-template.md
+
+## Technical Considerations
+
+### Same-conversation limitation
+
+The auto-trigger runs eval in the same conversation as the test. The eval skill instructions say "grade from artifacts only," but the model still has conversation context. This is acknowledged as aspirational, not enforced.
+
+All three evals are designed to be mechanically checkable from artifacts: Eval 1 is pure index comparison, Eval 2 is a regex match for a specific marker format (`⚠.*<slug>.*→ Proven regression`), Eval 3 is slug+priority matching in a section block. No eval requires subjective judgment, which limits the surface area for self-bias to near zero.
+
+If gaming becomes observable (evals consistently pass but failures still occur in practice), the mitigation is to switch to manual-only invocation (`--no-eval` by default, explicit `/user-test-eval` in a new session).
+
+### Iterate mode
+
+Eval runs once after the final commit, not per-iteration. Grades the aggregate report. Eval 1 checks probe execution order for the first run only (subsequent runs use progressive narrowing where ordering constraints are relaxed).
+
+### Partial runs
+
+If `completed: false` in `.user-test-last-run.json`, eval aborts. Same guard as commit mode.
+
+### Artifact overwrite risk
+
+`.user-test-last-run.json` and `.user-test-last-report.md` are overwritten each run. If user runs `/user-test` again before running standalone eval, the previous artifacts are gone. The auto-trigger avoids this (eval runs immediately after commit).
+
+**Manual eval guard:** Before grading, check if `run_timestamp` in the artifact matches the `run_timestamp` of the last entry in `skill-evals.json`. If they match, this run was already evaluated — warn "This run was already evaluated. Run again? (y/n)". Also warn if `run_timestamp` > 24h old (matching commit mode's staleness check).
+
+### Concurrent writes
+
+Not supported. `skill-evals.json` writes are not atomic. Concurrent eval invocations (e.g., two terminals) could corrupt the file. Low risk for single-user CLI tool.
+
+### Eval evolution
+
+`eval_version` in `skill-evals.json` enables filtering when comparing historical scores. When adding a 4th eval, bump `eval_version` to 2. Entries with version 1 have 3 evals; version 2 has 4. Comparison tools should filter by version.
+
+### Graduation trigger
+
+When evals pass for 5 consecutive runs, the eval should note: "All evals passing consistently (runs from <first date> to <last date>). Consider adding a 4th eval or shifting to query-level optimization." Surface the date range alongside the count so the span is visible.
+
+**Gap reset:** If the gap between any two consecutive passing runs exceeds 14 days, reset the consecutive count. A run after a 3-week hiatus isn't comparable to daily sprint runs — the skill may have changed, the app may have changed, and the consecutive count would be misleading.
+
+## Acceptance Criteria
+
+- [x] `/user-test-eval` command exists and invokes the eval skill
+- [x] Eval reads `.user-test-last-run.json` and `.user-test-last-report.md` (not conversation context)
+- [x] 3 binary evals implemented: probe execution order, Proven regression distinction, P1 surfacing
+- [x] Scores written to `tests/user-flows/skill-evals.json` with defined schema
+- [x] Mutation proposals written to `tests/user-flows/skill-mutations.md` when evals fail
+- [x] Prompts user to run `/user-test-eval` after commit mode (both auto-commit and standalone `/user-test-commit`)
+- [x] `--no-eval` flag skips the auto-trigger
+- [x] `.user-test-last-report.md` written during commit mode, gitignored
+- [x] `execution_index` and `broad_exploration_start_index` added to last-run JSON schema
+- [x] Manual eval warns if run_timestamp matches last skill-evals.json entry (already evaluated)
+- [x] Graduation consecutive count resets if gap between runs exceeds 14 days
+- [x] Schema bumped to v10 with v9 migration rule
+- [x] Plugin version bumped to 2.52.0
+- [x] CHANGELOG, README, plugin.json, marketplace.json updated
+
+## Scope Boundaries
+
+**In scope:**
+- `/user-test-eval` skill + command
+- 3 binary evals (mechanical, artifact-based)
+- `skill-evals.json` + `skill-mutations.md` artifacts
+- Report file artifact (`.user-test-last-report.md`)
+- Execution order metadata in last-run JSON
+- Auto-trigger from commit mode
+- Schema v10
+
+**Out of scope:**
+- Autonomous mutation application (human review required)
+- Query-level optimization (comes after skill evals are stable)
+- More than 3 evals (expand after 5 consecutive passes)
+- Cross-model evaluation (same model, different context)
+- Mutation revert mechanism (use `git revert`)
+- Extract mutation format template and JSON schema to `references/` (v2.53.0 consideration — eval skill is 184 lines, extraction warranted when approaching 300+ or when references would be reused across skills)
+
+## Dependencies & Risks
+
+**Dependencies:**
+- Existing user-test skill and commit mode must be stable
+- Schema v9 must be current (it is as of v2.51.0)
+
+**Risks:**
+- Self-evaluation bias on Eval 2 (mitigated by structural reformulation)
+- Auto-trigger adds latency to every test session (~10-30s)
+- Mutation proposals may be low quality initially (mitigated by human review gate)
+
+## Sources & References
+
+- **Origin brainstorm:** [docs/brainstorms/2026-03-17-user-test-self-eval-loop-brainstorm.md](../brainstorms/2026-03-17-user-test-self-eval-loop-brainstorm.md) — Key decisions: separate eval command (not Phase 5), artifact-based grading, `skill-mutations.md` for proposals, 3 binary evals targeting protocol/reasoning/presentation layers
+- **Existing skill:** `plugins/compound-engineering/skills/user-test/SKILL.md` — Phase 4 report format, commit mode steps
+- **Last-run schema:** `plugins/compound-engineering/skills/user-test/references/last-run-schema.md`
+- **Learnings:** Agent-guided state transitions (docs/solutions/2026-02-26-agent-guided-state-and-mcp-resilience-patterns.md) — don't hardcode state transitions, use scoring rubrics
+- **Learnings:** Monolith-to-skill split anti-patterns (docs/solutions/2026-02-26-monolith-to-skill-split-anti-patterns.md) — enforce size budgets deterministically, don't duplicate validation
+- **Probe lifecycle plan:** docs/plans/2026-02-28-feat-user-test-compounding-probe-system-plan.md — binary verification separate from numeric scoring
+- **Report dispatch format:** docs/plans/2026-03-01-refactor-user-test-report-dispatch-format-plan.md — NEEDS ACTION section rules
diff --git a/docs/plans/2026-03-18-feat-tiered-proven-budget-probe-confirmation-plan.md b/docs/plans/2026-03-18-feat-tiered-proven-budget-probe-confirmation-plan.md
new file mode 100644
index 000000000..537847d7b
--- /dev/null
+++ b/docs/plans/2026-03-18-feat-tiered-proven-budget-probe-confirmation-plan.md
@@ -0,0 +1,205 @@
+---
+title: "feat: Tiered Proven Budget + Probe Confirmation Note"
+type: feat
+status: completed
+date: 2026-03-18
+amends: docs/plans/2026-03-03-feat-audit-response-skill-level-amendments-plan.md
+---
+
+# feat: Tiered Proven Budget + Probe Confirmation Note
+
+## Overview
+
+Implement two audit findings from run 12 as lightweight skill amendments:
+
+1. **A1: Tiered Proven Budget** -- Scale browser MCP budget by consecutive pass count (3/2/1 calls) instead of flat 3 for all Proven areas.
+2. **A2: Probe Confirmation Note** -- Require 2 consecutive passes for non-deterministic probes before treating them as genuinely passing.
+
+Both are behavioral guidance changes (+11 lines across 3 reference files), not schema or machinery changes.
+
+**Relationship to existing plan:** The full audit plan (`2026-03-03-feat-audit-response-skill-level-amendments-plan.md`) covers A1-A5 targeting schema v10 / v2.52.0. This plan implements a **lightweight subset** of A1 and A2 only, deferring the schema-level changes (determinism field, register variation, scroll verification) to the full plan.
+
+## Problem Statement / Motivation
+
+**A1:** All Proven areas get 3 browser MCP calls regardless of stability. An area at 15 consecutive passes gets the same budget as one at 3. For mature test files, the majority of MCP calls confirm things that haven't changed in months.
+
+**A2:** When a probe testing LLM-dependent behavior flips from failing to passing, 1 pass is indistinguishable from model variance. The operator handles this by judgment, but the skill should say so explicitly.
+
+## Proposed Solution
+
+### A1: Tiered Budget Table
+
+Add to `run-targeting.md` after the existing Proven area budget rule:
+
+```markdown
+### Proven Area Budget by Stability
+
+| Consecutive Passes | Browser MCP Budget |
+|---|---|
+| 2-5 | 3 calls |
+| 6-9 | 2 calls |
+| 10+ | 1 call |
+
+Failing/untested probes remain uncapped at all tiers. The tier only
+constrains passing probe spot-checks and exploration calls.
+
+Tier follows the area's consecutive pass count in the Areas table.
+The tier only resets when the consecutive pass count resets, which
+occurs on demotion from Proven. If the area stays Proven despite a
+soft score (agent judgment: cosmetic issue), the tier stays too.
+
+Stable queries (CLI-only) do not count against the browser budget.
+Journey steps and cross-area probes are separate from per-area budgets.
+
+Freed calls redistribute to novelty budget and areas with active
+variance. Report in SIGNALS: "+ N calls freed from ultra-stable
+areas."
+```
+
+### A1: SKILL.md Reword
+
+Replace the Proven area budget line in Phase 3 Area Selection Priority:
+
+**Current (~line 104):**
+```
+Proven areas at score 5 get max 3 MCP calls regardless of run focus.
+```
+
+**New:**
+```
+Proven areas: spot-check scaled by stability (see run-targeting.md for tiered budget), plus any failing/untested probes.
+```
+
+### A1: Cross-File Reference Updates
+
+Update all 12 hardcoded "3 MCP" references across 4 files to point to the tiered system. See the full reference index in the parent plan (lines 80-96).
+
+Key files requiring updates beyond run-targeting.md and SKILL.md:
+
+| File | Lines | Change |
+|------|-------|--------|
+| `probes.md` | 23 | `3-call MCP budget` -> `tiered MCP budget` |
+| `queries-and-multiturn.md` | 51, 55-59, 156, 166, 253, 299 | 6 references to `3-call cap` -> tiered references |
+| `SKILL.md` | 104, 126, 145 | 3 references -> tiered pointers |
+
+### A1: Report Display
+
+Per-area assessment includes tier context:
+
+```
+browse/product-grid  Proven (15 passes, 1-call budget)  UX 5  2s
+```
+
+### A2: Probe Confirmation Note
+
+Add to `probes.md` after the Status Definitions section (~line 191):
+
+```markdown
+### Non-Deterministic Probe Confirmation
+
+When a probe testing LLM-dependent behavior (agent reasoning,
+scored_output quality, search ranking) flips from failing to passing,
+treat the first pass as unconfirmed. Note "passing (unconfirmed)" in
+the report. Require a 2nd consecutive pass before updating probe
+status to passing in the test file during commit. If the next run
+fails, revert to failing -- the first pass was variance.
+```
+
+### A2: Report Display
+
+Unconfirmed probes display with an asterisk:
+
+```
+Probe Results:
+| Area | Query | Status | Detail |
+|------|-------|--------|--------|
+| agent/search-query | "boots under $50" | passing* | First pass after 8 fails -- needs confirmation |
+```
+
+## Technical Considerations
+
+### Gap 1: Cross-File Consistency (Critical)
+
+The original feature spec proposed updating only run-targeting.md and SKILL.md. However, `queries-and-multiturn.md` contains 6 references to "3-call cap" including a **worked example** (lines 55-59) that the agent treats as canonical. If these aren't updated, the agent will follow the concrete example over the abstract tiered rule.
+
+**Resolution:** Update all 12 references. The worked example at `queries-and-multiturn.md:55-59` must be updated to show tier-aware budgeting.
+
+### Gap 2: Novelty Budget at Reduced Tiers
+
+The novelty budget is currently defined as "exactly 1 MCP call (30% of 3 calls)" for Proven areas. At the 2-call tier, 30% = 0.6. At the 1-call tier, 30% = 0.3.
+
+**Resolution:** Add a note to run-targeting.md: "Novelty allocation within the tiered budget is at agent discretion. At the 1-call tier, the single call may be used for probe spot-check OR novelty -- the mandatory novelty probe rule is waived when the budget is 1 call."
+
+### Gap 3: A2 Determinism Identification
+
+The +3 lines of guidance rely on agent judgment to identify which probes are non-deterministic. The full audit plan proposes a `deterministic`/`non-deterministic` field per probe with defaults by trigger type.
+
+**Resolution for this plan:** Keep it lightweight. The agent already knows which probes target LLM-dependent behavior from the area's `scored_output` flag and probe generation context. Explicit classification deferred to the full plan's schema v10.
+
+### Gap 4: Failure Reset Semantics
+
+"Failure resets to 3-call tier" means area-level demotion from Proven, NOT individual probe failure. Probe failures are independent signals and do not affect the tier. The tier follows the consecutive pass count in the Areas table -- if the area stays Proven despite a soft score (agent judgment: cosmetic issue, not functional regression), the tier stays too. The tier only resets when consecutive passes resets to 0, which occurs on demotion.
+
+### Gap 5: Progressive Narrowing Interaction
+
+- **SKIP areas**: No browser calls -- tier is irrelevant (CLI queries still run)
+- **PROBES-ONLY areas**: 1 exploration call + all probes -- tier budget does not apply (probes are uncapped)
+- **FULL areas**: Tiered budget applies normally
+
+The tier only governs the budget for Proven areas in FULL classification.
+
+### Gap 6: Probe Graduation Interaction
+
+For non-deterministic probes: the `passing*` (unconfirmed) pass does NOT count toward the 2-consecutive-pass graduation requirement. Graduation requires 2 confirmed passes (minimum 3 total passes for non-deterministic probes: 1 unconfirmed + 2 confirmed).
+
+The unconfirmed pass rule applies only to probes transitioning from `failing` or `flaky` to `passing`. Probes transitioning from `untested` to `passing` follow the standard 1-pass threshold -- they have no failure history to create variance concern.
+
+## Acceptance Criteria
+
+### A1: Tiered Proven Budget
+
+- [x] Tiered budget table added to `run-targeting.md` (+8 lines)
+- [x] 2-5 passes: 3 calls; 6-9 passes: 2 calls; 10+ passes: 1 call
+- [x] Failure resets consecutive passes to 0 (returns to 3-call tier)
+- [x] Failing/untested probes uncapped at all tiers
+- [x] Freed calls redistribute to novelty and active areas
+- [x] Tier shown in per-area report line: `Proven (N passes, M-call budget)`
+- [x] All 12 cross-file "3 MCP" references updated to tiered system
+- [x] Worked example in `queries-and-multiturn.md:55-59` updated
+- [x] SKILL.md reworded (net 0 lines)
+- [x] Novelty budget note for 1-call tier added
+
+### A2: Probe Confirmation Note
+
+- [x] 3-line confirmation note added to `probes.md` after Status Definitions
+- [x] Unconfirmed probes display as `passing*` in report
+- [x] Commit mode holds unconfirmed probes -- doesn't write `passing` to test file until 2nd consecutive pass
+- [x] Fail after unconfirmed pass reverts to `failing`
+- [x] Graduation clock starts at confirmed `passing`, not `passing*`
+
+## Line Budget
+
+| File | Change | Delta |
+|------|--------|-------|
+| `run-targeting.md` | Tiered budget table + rules + novelty note | +10 |
+| `probes.md` | Non-deterministic confirmation note | +3 |
+| `queries-and-multiturn.md` | Update 6 references + worked example | ~0 (rewording) |
+| `SKILL.md` | Reword 3 Proven budget references | 0 |
+| **Total new lines** | | **+13** |
+
+## Dependencies & Risks
+
+**Dependencies:**
+- Consecutive pass count already tracked in test file Areas table -- no new tracking needed
+- `scored_output` flag already exists per area -- used to identify LLM-dependent probes
+
+**Risks:**
+- **Low:** Existing high-pass-count areas immediately get reduced budget. Mitigated by: the areas are genuinely stable (that's what 10+ passes means).
+- **Low:** Agent judgment for non-deterministic probe identification may be inconsistent. Mitigated by: deferred to full plan's explicit classification field.
+
+## Sources & References
+
+- **Parent plan:** [docs/plans/2026-03-03-feat-audit-response-skill-level-amendments-plan.md](docs/plans/2026-03-03-feat-audit-response-skill-level-amendments-plan.md) -- A1-A5 full audit response
+- **Iterate efficiency (completed):** [docs/plans/2026-03-01-perf-iterate-efficiency-progressive-narrowing-plan.md](docs/plans/2026-03-01-perf-iterate-efficiency-progressive-narrowing-plan.md)
+- **Probe lifecycle (completed):** [docs/plans/2026-03-01-feat-probe-lifecycle-research-quality-plan.md](docs/plans/2026-03-01-feat-probe-lifecycle-research-quality-plan.md)
+- **Cross-file reference index:** Parent plan lines 80-96
diff --git a/docs/solutions/2026-02-26-agent-guided-state-and-mcp-resilience-patterns.md b/docs/solutions/2026-02-26-agent-guided-state-and-mcp-resilience-patterns.md
new file mode 100644
index 000000000..daee34a60
--- /dev/null
+++ b/docs/solutions/2026-02-26-agent-guided-state-and-mcp-resilience-patterns.md
@@ -0,0 +1,63 @@
+---
+title: 'Agent-Guided State Transitions and MCP Resilience Patterns for Browser Skills'
+date: 2026-02-26
+tags: [claude-code, claude-in-chrome, mcp, skill-architecture, browser-testing]
+category: architecture
+module: plugins/compound-engineering/skills/user-test/SKILL.md
+source: deepen-plan
+convergence_count: 5
+plan: .deepen-2026-02-26-feat-user-test-browser-testing-skill-plan/original_plan.md
+---
+
+# Agent-Guided State Transitions and MCP Resilience Patterns for Browser Skills
+
+## Problem
+
+When designing skills that track state across runs (maturity models, progression systems) and depend on external MCP tools (browser automation, API connectors), two failure modes recur: hardcoded state transition rules that override agent judgment, and generic retry logic that gives users no actionable recovery path when MCP connections fail.
+
+## Key Findings
+
+### Hardcoded state rules violate agent-native principles (5 agents converged)
+
+Encoding rigid rules like "3 consecutive passes = Promoted" and "any failure = reset to Uncharted" puts business logic in the skill that should be prompt-driven. A minor cosmetic issue in a well-tested area does not warrant full demotion. The fix: provide a scoring rubric with calibration anchors (concrete examples for each score level) and maturity guidance (not rigid counters), then let the agent exercise judgment on promotions and demotions based on severity and context.
+
+### Extract content to references/ from day one (4 agents converged)
+
+Skills approaching the 500-line recommended limit should proactively extract templates, framework-specific patterns, and mode-specific documentation into references/ subdirectories before the first version ships. Retrofitting extraction after the skill is in use creates migration risk. Plan the directory structure at design time: SKILL.md holds execution logic (~300 lines), references/ holds reusable content (templates, patterns, mode details).
+
+### MCP disconnect recovery needs specific, not generic, guidance (4 agents converged)
+
+Chrome extension service workers go idle during extended sessions, breaking MCP connections. A generic "retry once" pattern gives users no path forward when the retry also fails. The fix: provide the specific recovery command ("/chrome Reconnect"), add backoff delay (2-3 seconds) before retry, and track cumulative disconnects to fail fast (abort after 3) rather than burning tokens on repeated failures.
+
+## Reusable Pattern
+
+For skills with state tracking: define scoring calibration anchors (what each numeric score means concretely), provide maturity guidance as a rubric, and let agents exercise judgment -- never hardcode state transition counters. For MCP-dependent skills: implement three-tier resilience (preflight availability check, mid-run retry with specific recovery instructions, graceful degradation for non-critical tool failures).
+
+## Code Example
+
+```markdown
+## Maturity Guidance (agent-guided, not hardcoded)
+| Score | Meaning               | Example                              |
+|-------|-----------------------|--------------------------------------|
+| 1     | Broken                | Button unresponsive, page crashes    |
+| 2     | Major friction         | 3+ confusing steps, error messages   |
+| 3     | Minor friction         | Small UX issues, unclear labels      |
+| 4     | Smooth                 | Clear flow, no confusion             |
+| 5     | Delightful             | Exceeds expectations                 |
+
+Promote to Proven: 2+ consecutive runs with no significant issues (use judgment)
+Demote: Functional regression, not cosmetic issues
+```
+
+```markdown
+## MCP Disconnect Recovery (three-tier)
+1. Preflight: verify tool availability, instruct `/chrome` if missing
+2. Mid-run: wait 3s, retry once, then: "Run /chrome > Reconnect extension"
+3. Cumulative: abort after 3 disconnects with clear extension stability message
+```
+
+## References
+
+- plugins/compound-engineering/skills/agent-native-architecture/SKILL.md (Granularity principle: agent judgment over hardcoded logic)
+- docs/solutions/2026-02-26-monolith-to-skill-split-anti-patterns.md (size budget enforcement pattern)
+- https://code.claude.com/docs/en/chrome (extension disconnect behavior and recovery)
diff --git a/docs/solutions/2026-02-26-monolith-to-skill-split-anti-patterns.md b/docs/solutions/2026-02-26-monolith-to-skill-split-anti-patterns.md
new file mode 100644
index 000000000..ed718b87e
--- /dev/null
+++ b/docs/solutions/2026-02-26-monolith-to-skill-split-anti-patterns.md
@@ -0,0 +1,61 @@
+---
+title: 'Monolith-to-Skill Split: Enforcement, Drift, and Shadowing Anti-Patterns'
+date: 2026-02-26
+tags: [claude-code, markdown-commands, skill-md-framework, bash, node-js]
+category: architecture
+module: commands/deepen-plan.md
+source: deepen-plan
+convergence_count: 4
+plan: .deepen-sorted-wandering-parnas/original_plan.md
+---
+
+# Monolith-to-Skill Split: Enforcement, Drift, and Shadowing Anti-Patterns
+
+## Problem
+
+When splitting a large command file into a thin wrapper + SKILL.md + reference doc, three failure modes recur: size budgets creep back without enforcement, validation logic duplicated across files drifts out of sync, and stale copies of the original monolith silently shadow the new skill.
+
+## Key Findings
+
+### Size budgets require deterministic enforcement, not prose (3 agents converged)
+
+Stating "max 1,200 lines" in a plan is a policy wish. Without a gate that fails the pipeline, the file will grow past the budget through iterative additions -- exactly how the original monolith grew from 400 to 1,452 lines. Embed a line-count check as a validation step that runs every time the pipeline executes.
+
+### Legacy monolith shadowing during migration (4 agents converged)
+
+Claude Code resolves skills by precedence: enterprise > personal > project, with plugins namespaced. A stale 1,452-line file at `~/.claude/commands/` or `~/.claude/skills/` silently shadows the new plugin skill. Detection must be automated, check all three resolution paths, and use line count (>100) as the heuristic -- not file existence alone.
+
+### Dual validation paths will drift (3 agents converged)
+
+When validation logic appears both inline in SKILL.md and in a reference doc, the two copies inevitably diverge. The fix: pick one canonical location per validation type. Parent-critical checks (judge output schema) stay inline. Pipeline-internal checks (preservation, artifact structure) live in the reference doc only.
+
+## Reusable Pattern
+
+For any command split: (1) add a deterministic size gate that fails loudly, (2) automate legacy detection across all skill resolution paths before first run, (3) assign each validation check exactly one canonical home -- never duplicate across files.
+
+## Code Example
+
+```bash
+# Size budget enforcement (add to pipeline validation step)
+ARCH_LINES=$(wc -l < "$DEEPEN_DIR/ARCHITECTURE.md")
+if [ "$ARCH_LINES" -gt 1200 ]; then
+  echo "FAIL: ARCHITECTURE.md is $ARCH_LINES lines (max 1200)"
+  exit 1
+fi
+
+# Legacy shadowing detection (cross-platform)
+for dir in "$HOME/.claude/commands" "$HOME/.claude/skills/deepen-plan"; do
+  TARGET="$dir/deepen-plan.md"
+  [ -d "$dir/deepen-plan" ] && TARGET="$dir/deepen-plan/SKILL.md"
+  if [ -f "$TARGET" ]; then
+    LINES=$(grep -c '' "$TARGET" 2>/dev/null || echo 0)
+    [ "$LINES" -gt 100 ] && echo "WARN: Legacy at $TARGET ($LINES lines)"
+  fi
+done
+```
+
+## References
+
+- agent-native-architecture/references/agent-execution-patterns.md (deterministic checks over heuristic detection)
+- agent-native-architecture/SKILL.md (anti-pattern: two ways to accomplish same outcome)
+- https://code.claude.com/docs/en/skills (skill resolution precedence)
diff --git a/plugins/compound-engineering/README.md b/plugins/compound-engineering/README.md
index 8bab08f5c..22ee2db48 100644
--- a/plugins/compound-engineering/README.md
+++ b/plugins/compound-engineering/README.md
@@ -156,6 +156,8 @@ Core workflow commands use `ce:` prefix to unambiguously identify them as compou
 | Skill | Description |
 |-------|-------------|
 | `agent-browser` | CLI-based browser automation using Vercel's agent-browser |
+| `user-test` | Exploratory browser testing via claude-in-chrome with quality scoring and compounding test files |
+| `user-test-eval` | Grade user-test skill output against binary evals and propose targeted mutations |
 
 ### Beta Skills
 
diff --git a/plugins/compound-engineering/skills/user-test-commit/SKILL.md b/plugins/compound-engineering/skills/user-test-commit/SKILL.md
new file mode 100644
index 000000000..08daee22a
--- /dev/null
+++ b/plugins/compound-engineering/skills/user-test-commit/SKILL.md
@@ -0,0 +1,8 @@
+---
+name: user-test-commit
+description: Commit user-test results — update test file maturity map, file issues, append history. Use after a --no-commit run or to retry a failed commit.
+disable-model-invocation: true
+allowed-tools: Skill(user-test)
+---
+
+Invoke the user-test skill in commit mode for the last completed run.
diff --git a/plugins/compound-engineering/skills/user-test-eval/SKILL.md b/plugins/compound-engineering/skills/user-test-eval/SKILL.md
new file mode 100644
index 000000000..8552ba859
--- /dev/null
+++ b/plugins/compound-engineering/skills/user-test-eval/SKILL.md
@@ -0,0 +1,187 @@
+---
+name: user-test-eval
+description: Grade user-test skill output against binary evals and propose mutations. Use after a user-test run completes to check probe ordering, regression surfacing, and P1 presentation.
+disable-model-invocation: true
+---
+
+# User Test Eval
+
+Grade the user-test skill's output against 3 binary evals. Read from file
+artifacts only. Propose targeted mutations when evals fail.
+
+**Artifact-only grading rule:** Grade from file artifacts only. Do not reference
+test execution context, Phase 3 observations, or any other conversation content.
+The eval's integrity depends on grading what the user sees (the report file),
+not what the agent knows.
+
+## Phase 1: Load Artifacts
+
+1. **Locate test directory:** Find `tests/user-flows/` in the project.
+2. **Read `.user-test-last-run.json`:**
+   - Missing: abort with "No run results found. Run `/user-test` first."
+   - `completed: false`: abort with "Last run was incomplete. Run `/user-test` again."
+3. **Read `.user-test-last-report.md`:**
+   - Missing: abort with "No report artifact found. The skill version may predate report persistence — run `/user-test` again with the latest skill."
+4. **Staleness check:** If `run_timestamp` > 24 hours old, use `AskUserQuestion` (if available, otherwise present as numbered options): "Run results are from <timestamp>. Evaluate anyway?" with options Yes / No. Abort on No.
+5. **Already-evaluated check:** Read `skill-evals.json` if it exists. If the last entry's `run_timestamp` matches the artifact's `run_timestamp`, use `AskUserQuestion`: "This run was already evaluated. Run again?" with options Yes / No. Abort on No.
+6. **Read the test file** (`tests/user-flows/<scenario_slug>.md`) to get area maturity statuses and `pass_threshold` values. Default `pass_threshold` is 4 if not specified.
+
+## Phase 2: Run Evals
+
+Run all 3 evals in order. Record pass/fail + detail for each.
+
+### Eval 1: Probe Execution Order (protocol layer)
+
+**Question:** Did all failing/untested probes execute before broad exploration in every area?
+
+**Method:**
+1. For each area in `areas` array, read `broad_exploration_start_index`
+2. Collect all `probes_run` entries for that area, read their `execution_index`
+3. Check: every probe's `execution_index` < area's `broad_exploration_start_index`
+4. **PASS** if all areas satisfy the constraint. **FAIL** if any area violates — list violated areas.
+
+**Edge cases:**
+- Area has no probes: PASS (vacuously true)
+- Missing `execution_index` or `broad_exploration_start_index` (v9 data): SKIP with detail "execution order data not available (pre-v10 run)"
+- Skipped areas (`skip_reason` present): exclude from check
+
+### Eval 2: Proven Regression Distinction (presentation layer)
+
+**Question:** When a Proven area's score dropped below pass_threshold, does the report's NEEDS ACTION section contain a properly formatted entry?
+
+**Method:**
+1. From the test file, identify areas with `Status: Proven`
+2. From `.user-test-last-run.json`, check each Proven area's `ux_score` against its `pass_threshold`
+3. For each regressed area (score < pass_threshold), search `.user-test-last-report.md` for the NEEDS ACTION section
+4. Check for a line matching the pattern: `⚠.*<area-slug>.*→ Proven regression`
+5. **PASS** if every regressed Proven area has a matching line item. **FAIL** if any is missing or appears without the `→ Proven regression` marker.
+
+**Edge cases:**
+- No Proven areas exist: PASS with detail "no Proven areas in test file"
+- No Proven areas regressed: PASS with detail "no Proven regressions this run"
+- Cannot parse NEEDS ACTION section: FAIL with detail "NEEDS ACTION section not found in report"
+
+### Eval 3: P1 Surfacing (presentation layer)
+
+**Question:** Did every P1 item from `explore_next_run` appear in the NEEDS ACTION section?
+
+**Method:**
+1. From `.user-test-last-run.json`, collect all `explore_next_run` items with `priority: "P1"`
+2. For each P1 item, search `.user-test-last-report.md` NEEDS ACTION section for the area slug with `P1` marker
+3. **PASS** if all P1 items are in NEEDS ACTION. **FAIL** with count of missing items and their area slugs.
+
+**Scope note:** Verification mismatches on Proven areas also belong in NEEDS ACTION per
+dispatch format rules, but they flow through `verification_results`, not `explore_next_run`.
+Not checked here — candidate for a future Eval 4.
+
+**Edge cases:**
+- No P1 items: PASS with detail "no P1 items this run"
+- Cross-area P1 items (area = `[cross-area]`): match against the `why` text or `affected_areas` slugs in NEEDS ACTION
+
+## Phase 3: Propose Mutations
+
+If any eval failed, propose one mutation per failing eval.
+
+**Mutation generation rules:**
+- Identify the skill file and section most likely responsible for the failure
+- Describe the current behavior and the proposed change
+- Frame as a specific, targeted instruction change — not a rewrite
+- Number mutations sequentially across all eval runs (read last mutation number from `skill-mutations.md`)
+
+**Mutation format:**
+
+```markdown
+## Mutation N -- <date>
+
+**Status:** PROPOSED
+**Triggered by:** Eval <N> failure (<eval name>)
+**Eval scores:** probe_order: <PASS/FAIL> | regression_distinction: <PASS/FAIL> | p1_surfacing: <PASS/FAIL>
+**Skill version:** <version from plugin.json or run context>
+**Scenario:** <scenario_slug>
+
+### Problem observed
+
+<1-2 sentences describing the specific failure>
+
+### Proposed change
+
+**File:** <path to skill file or reference>
+**Section:** <specific section name>
+
+**Current:** <quote or summarize current instruction>
+**Proposed:** <specific new instruction text>
+
+### Outcome
+
+_Fill after next run:_ Did the change fix the eval failure? Score comparison.
+```
+
+If all evals passed, do not propose a mutation.
+
+## Phase 4: Write Artifacts
+
+### `skill-evals.json`
+
+Location: `tests/user-flows/skill-evals.json`
+
+If file doesn't exist, create with `{ "eval_version": 1, "entries": [] }`.
+
+**Skill version:** Read the current `version` from the plugin's `.claude-plugin/plugin.json` at eval time. Do not hardcode.
+
+Append entry:
+
+```json
+{
+  "run_timestamp": "<from .user-test-last-run.json>",
+  "scenario_slug": "<from .user-test-last-run.json>",
+  "git_sha": "<from .user-test-last-run.json>",
+  "skill_version": "<current version from .claude-plugin/plugin.json>",
+  "evals": {
+    "probe_execution_order": { "pass": <bool>, "areas_violated": [...], "detail": "..." },
+    "proven_regression_distinction": { "pass": <bool>, "regressed_areas": [...], "missing_from_needs_action": [...], "detail": "..." },
+    "p1_surfacing": { "pass": <bool>, "p1_count": <int>, "surfaced_count": <int>, "detail": "..." }
+  },
+  "overall_pass": <bool>,
+  "mutation_proposed": <bool>
+}
+```
+
+Cap at 50 entries — drop oldest if exceeded.
+
+### `skill-mutations.md`
+
+Location: `tests/user-flows/skill-mutations.md`
+
+If file doesn't exist, create with header:
+
+```markdown
+# Skill Mutations Log
+
+Proposed changes to the user-test skill based on eval failures.
+Mark status as ACCEPTED or REJECTED after review.
+```
+
+Append mutation sections for each failing eval. Separate with `---`.
+
+### Graduation Check
+
+After writing artifacts, check for consecutive passing runs:
+
+1. Read the last N entries from `skill-evals.json` where `overall_pass: true`
+2. Count consecutive passes from most recent backwards
+3. Check for gap reset: if any two consecutive entries have `run_timestamp` more than 14 days apart, reset count to entries after the gap
+4. If 5+ consecutive passes within the gap window: display "All evals passing consistently (runs from <first date> to <last date>). Consider adding a 4th eval or shifting to query-level optimization."
+
+## Phase 5: Display Summary
+
+Display a one-line summary:
+
+```
+EVAL: <N>/3 pass | probe_order: <PASS/FAIL/SKIP> | regression: <PASS/FAIL> | p1_surfacing: <PASS/FAIL>
+```
+
+If mutations were proposed, display each mutation's Problem Observed and Proposed Change inline.
+
+If all passed, display: "All evals passing. No mutations proposed."
+
+If graduation threshold met, display the graduation message.
diff --git a/plugins/compound-engineering/skills/user-test-iterate/SKILL.md b/plugins/compound-engineering/skills/user-test-iterate/SKILL.md
new file mode 100644
index 000000000..0b2b53bfa
--- /dev/null
+++ b/plugins/compound-engineering/skills/user-test-iterate/SKILL.md
@@ -0,0 +1,9 @@
+---
+name: user-test-iterate
+description: Run the same user test scenario N times to measure consistency. Use when validating score stability or detecting flaky areas.
+disable-model-invocation: true
+allowed-tools: Skill(user-test)
+argument-hint: "[scenario-file] [n]"
+---
+
+Invoke the user-test skill in iterate mode for: $ARGUMENTS
diff --git a/plugins/compound-engineering/skills/user-test/SKILL.md b/plugins/compound-engineering/skills/user-test/SKILL.md
new file mode 100644
index 000000000..e7683d9e7
--- /dev/null
+++ b/plugins/compound-engineering/skills/user-test/SKILL.md
@@ -0,0 +1,399 @@
+---
+name: user-test
+description: Run browser-based user testing via claude-in-chrome MCP with quality scoring and compounding test files. Use when testing app quality, scoring interactions, tracking test maturity, or filing issues from test sessions.
+argument-hint: "[scenario-file-or-description]"
+disable-model-invocation: true
+---
+
+# User Test
+
+Exploratory testing in a visible Chrome window. The user watches the test
+happening in real-time and can intervene if needed. Claude shares the browser's
+login state — sign into the app in Chrome before running.
+
+For automated headless regression testing, use `/test-browser` instead.
+
+**v1 limitation:** This skill targets localhost / local dev server apps. External
+or staging URLs are not validated for deployment status — verify remote apps are
+live and accessible before testing.
+
+## Interaction Method
+
+If `AskUserQuestion` is available, use it for all prompts. Otherwise, present
+each question as a numbered list and wait for a reply. For multiSelect, accept
+comma-separated numbers (e.g. `1, 3`). Never skip or auto-configure.
+
+**Protected artifacts:** `tests/user-flows/` files are pipeline output. Review
+agents and cleanup tools must never flag them for deletion or gitignore.
+
+## Phase 0: Preflight
+
+1. **Chrome MCP check — deferred to Phase 2.** Phase 1 CLI discovery may eliminate browser testing.
+2. **Detect WSL:** Run `uname -r 2>/dev/null | grep -qi microsoft`. If WSL: abort with "Chrome integration not supported in WSL."
+3. **Check gh CLI:** Run `gh auth status`. If not authenticated: note "gh not authenticated — issue creation skipped in commit mode."
+4. **Validate app URL:** If test file contains `app_url`, verify reachable. Site permission errors handled reactively during execution.
+
+## Phase 1: Load Context
+
+**Input:** `$ARGUMENTS` — either a path to an existing test file or a description of what to test. A trailing integer N triggers multi-run mode (e.g., `/user-test resale-clothing 5`). See [probes.md](./references/probes.md) for multi-run orchestration: inter-run probe state, progressive Proven area reduction, interruption handling, and N-run summary format.
+
+1. **Resolve test file:**
+   - If argument is a file path (contains `/` or ends in `.md`):
+     - Validate path resolves within `tests/user-flows/` (prevent directory traversal)
+     - Read and parse the test file
+     - Validate `schema_version` is present (1–10 accepted) <!-- bump range when schema changes -->
+     - **v1/v2 migration:** If `schema_version: 1`, fill missing columns (`Last Quality`, `Last Time`, `Delta`, `Context`) with `—`. If `schema_version: 2`, also fill missing sections (Area Trends, UX Opportunities, Good Patterns) and Run History columns (Best Area, Worst Area). Do NOT rewrite on read.
+     - **v3/v4 migration:** If `schema_version: 3`, treat missing `verify:` blocks and `Probes:` tables as absent. If `schema_version: 4`, also treat missing `**Queries:**` and `**Multi-turn:**` tables as absent. Do NOT rewrite on read.
+     - **v5 migration:** If `schema_version: 5`, treat Probes without `Confidence` column as `confidence: high` (existing probes were generated from observed failures). Treat Probes without `Priority` column as inferred from `Generated From` (verification failure → P1, score-based → P2). Treat Queries without `Status` column as active. Treat missing `seams_read` as `false`. Do NOT rewrite the file on read.
+     - **v6 migration:** If `schema_version: 6`, treat missing `## Cross-Area Probes` section as empty table. Treat missing `mcp_restart_threshold` as 15. Treat probes without `related_bug` as unlinked. Do NOT rewrite on read.
+     - **v7 migration:** If `schema_version: 7`, treat missing `weakness_class` as absent. Treat missing `novelty_fingerprints` as empty. Treat missing `adversarial_browser` as false. In JSON: treat missing `tactical_note` as null, `confirmed_selectors` as `{}`. Do NOT rewrite on read.
+     - **v8 migration:** If `schema_version: 8`, treat missing `## Journeys` section as empty (no journeys defined). Do NOT rewrite on read.
+     - **v9 migration:** If `schema_version: 9`, treat missing `execution_index` on `probes_run` entries as absent. Treat missing `broad_exploration_start_index` on areas as absent. Eval skips Eval 1 (probe execution order) for runs without ordering data. Do NOT rewrite on read.
+     - **Forward compatibility:** Ignore unknown frontmatter fields. Preserve unknown table columns on write.
+     - **Missing `cli_test_command` (any version):** Treat as `cli_test_command: ""`. CLI discovery (step 3) will populate it. Do NOT rewrite the file on read.
+     - Extract maturity map, run history, and explore-next-run items
+   - If argument is a description string:
+     - Generate a slug from the description
+     - Check if `tests/user-flows/<slug>.md` already exists
+     - If not, create from template — see [test-file-template.md](./references/test-file-template.md)
+     - Decompose the description into areas (1-3 interactions each). For new test files, write **rich** area definitions — see Area Depth in [test-file-template.md](./references/test-file-template.md). For `scored_output` areas, include Queries and Multi-turn sequences.
+   - If no argument:
+     - Scan `tests/user-flows/` for existing test files
+     - Present list and ask which to run, or prompt for a new description
+2. **Orientation (first run only):** If `seams_read` is false or absent in frontmatter, run code reading to identify structural seams before any browser interaction. Output: 0-5 structural-hypothesis probes.
+   See [orientation.md](./references/orientation.md). Set `seams_read: true` on first commit after code reading, regardless of outcome.
+3. **CLI discovery (MANDATORY when `cli_test_command` is empty):** Whether the test file is new or existing, if `cli_test_command` is empty, run CLI discovery NOW before any browser interaction — follow every step in CLI Discovery in [test-file-template.md](./references/test-file-template.md). Check for API endpoints, test scripts, curl-able routes. If a testable surface exists, populate `cli_test_command` and `cli_queries` in the test file immediately. Do NOT skip this step. Do NOT ask the user whether to do it — just do it.
+4. **Ensure `.gitignore` coverage:**
+   - Check that `.user-test-last-run.json` and `.user-test-last-report.md` are in the project's `.gitignore`
+   - If missing, append them (these files are ephemeral run state, not source)
+   - Note: `score-history.json`, `bugs.md`, `skill-evals.json`, and `skill-mutations.md` are NOT gitignored — they are persistent project data
+5. **Handle corruption:**
+   - If required sections are missing or `schema_version` is absent, offer to regenerate from template
+6. **Capture git state:** Run `git rev-parse HEAD` and `git rev-parse origin/main 2>/dev/null`. Run `git diff --name-only origin/main..HEAD` — if this returns ANY files, those are code-affected areas requiring full exploration (even on a feature branch where main is "behind" HEAD). See [run-targeting.md](./references/run-targeting.md) for full rules.
+
+## Phase 2: Setup
+
+0. **Check claude-in-chrome MCP:** Call any `mcp__claude-in-chrome__*` tool. If NOT available: check if `cli_test_command` covers all `scored_output` areas. If yes, offer "All areas have CLI coverage — run CLI-only? (y/n)" and proceed without browser. If CLI doesn't cover all areas: display "claude-in-chrome not connected. Run `/chrome` or restart with `claude --chrome`" and abort.
+1. **Environment sanity check:**
+   - Navigate to the app URL using `mcp__claude-in-chrome__navigate`
+   - Verify the page loaded with expected content (not an error page, stale auth redirect, or empty state)
+   - If error banners, API failures, or empty data detected: abort with "App environment issue detected — fix the app state before testing"
+2. **Authentication check:**
+   - Claude shares the browser's login state — no credential handling needed
+   - If a login page or CAPTCHA is encountered: pause and instruct "Sign in to your app in Chrome, then press Enter to continue"
+3. **Baseline screenshot:**
+   - Take a screenshot of the app's initial state for reference
+
+## Phase 2.5: CLI Testing (Optional)
+
+If the test file defines `cli_test_command` in frontmatter, run CLI queries before browser testing. CLI mode catches agent reasoning errors without browser overhead.
+
+**When `cli_test_command` is present:**
+1. Phase 0 runs `gh auth status` only (Chrome MCP deferred). Skip Phase 2 browser setup unless browser areas exist.
+2. Run each `scored_output` area's Queries through `cli_test_command`. Run `cli_queries` via Bash. Score 1-5 using output quality rubric (semantic evaluation). See CLI Area Queries in [queries-and-multiturn.md](./references/queries-and-multiturn.md).
+3. **Browser area overlap:** If a `prechecks`-tagged CLI query scores ≤ 2, skip the tagged browser area. No `prechecks` tag = standalone.
+4. Credentials: shell environment only. No credentials in the test file.
+5. **Adversarial flag check:** If any CLI query for an area scores exactly 3, set `adversarial_browser: true`. See [queries-and-multiturn.md](./references/queries-and-multiturn.md) CLI Adversarial Mode for trigger conditions and secondary check.
+
+**CLI + browser coexistence:** When both exist, run CLI first. CLI failures only skip browser areas explicitly tagged via `prechecks`.
+
+## Phase 3: Execute
+
+Test areas based on maturity status. The agent exercises judgment on area selection — these are guidelines, not rigid rules. Record a `skip_reason` for each area not fully tested (see [test-file-template.md](./references/test-file-template.md) for enum values).
+
+**Run focus vs. area budget:** A run focus (e.g., "consumer stress test", "search bar exploration") controls WHAT you test within each area — which queries, which edge cases, which user personas. It does NOT override maturity-based time allocation (see override priority table in [run-targeting.md](./references/run-targeting.md)). Proven areas get a tiered MCP budget based on consecutive pass count (see [run-targeting.md](./references/run-targeting.md) for budget table). The run focus shapes WHAT those calls test (search bar instead of basic navigation), not the count.
+
+### Per-Area Checklist (run in order for every area)
+
+0. **CLI precheck gate** — if `prechecks` CLI query scored ≤ 2, skip. No prechecks tag = proceed. No CLI = proceed.
+0b. **Adversarial mode** — if `adversarial_browser: true` (from Phase 2.5): skip happy path, front-load competing-constraint queries, generate pre-emptive P1 probe, increase novelty budget. SKIP areas promoted to PROBES-ONLY. See [queries-and-multiturn.md](./references/queries-and-multiturn.md) CLI Adversarial Mode.
+1. **Run probes** — failing/untested first. See [probes.md](./references/probes.md).
+2. **Execute Queries and Multi-turn** — if defined. See [queries-and-multiturn.md](./references/queries-and-multiturn.md).
+3. **Novelty budget — MANDATORY.** Before generating novel interactions, check `novelty_fingerprints` from `.user-test-last-run.json` — skip interactions matching existing fingerprints. At least 1 novel interaction per `scored_output` area must generate a probe. Iterate mode ignores fingerprints. See [queries-and-multiturn.md](./references/queries-and-multiturn.md) for fingerprint matching, MCP budget, and mandatory probe rule.
+4. **Verification pass** — per area type. See [verification-patterns.md](./references/verification-patterns.md).
+5. **Score** — UX (1-5) + Quality if `scored_output: true`.
+6. **Time** — wall-clock seconds, first to last MCP call. Async waits count. Disconnect = `—`.
+7. **Notes** — what surprised you? Feeds Explore Next Run + new Queries in commit.
+
+Probes, verification, and UX scores are three separate signals — none subsumes the others.
+
+### Execution Index Tracking
+
+Maintain a monotonically increasing `execution_index` counter (starting at 0) across the entire run. Increment for each probe execution and each broad exploration action. Record `execution_index` on every `probes_run` entry. When transitioning from probe execution to broad exploration for an area, record `broad_exploration_start_index` on that area. This enables `/user-test-eval` to verify probe-before-exploration ordering from artifacts alone. See [last-run-schema.md](./references/last-run-schema.md) for field definitions.
+
+### Probe Execution (Before Broad Exploration)
+
+Read probes from area `**Probes:**` tables. Execute `untested` and `failing` probes before broad exploration — these are the highest-signal checks. For Proven areas, failing/untested probes always run regardless of MCP budget; the tiered budget cap only constrains passing-probe spot-checks. Record each probe result with its `execution_index`. See [probes.md](./references/probes.md) for execution flow, lifecycle, and dedup rules.
+
+### Cross-Area Probes (Before Per-Area Testing)
+
+Execute cross-area probes before per-area testing — they test state carry-over between areas and inform per-area score interpretation. Results do NOT affect per-area scores. See [probes.md](./references/probes.md).
+
+### Journey Execution (After Cross-Area Probes)
+
+Execute journeys after cross-area probes, before per-area testing. Journeys test accumulated state across 3+ areas without resets, with checkpoints at each step. Results do NOT affect per-area scores. See [journeys.md](./references/journeys.md).
+
+### Verification Pass (After Each Area)
+
+After exploring each area, run structural verification checks based on area type — independent of what the agent noticed. Read the area's `**verify:**` block for area-specific instructions. Record verification results separately from UX score. Verification failures block promotion to Proven but do not demote existing Proven areas. See [verification-patterns.md](./references/verification-patterns.md) for standard checks, tolerance rules, and maturity interaction.
+
+### Area Selection Priority
+
+See [run-targeting.md](./references/run-targeting.md) for full rules including
+git-aware targeting, progressive narrowing, and override priority.
+
+Quick reference: (0) Code-affected → full. (1) P1 Explore Next Run → full. (2) Uncharted → full. (3) Proven → spot-check (tiered MCP + failing probes). (4) Known-bug → check issue state:
+  - `gh issue view` or check tracker — if closed/fixed, flip to Uncharted (verify the fix)
+  - if open, spot-check the bug area (confirm still broken, note any change)
+(5) All Proven → spot-check all, suggest new areas.
+
+### Connection Resilience
+
+See [connection-resilience.md](./references/connection-resilience.md) for reactive recovery, proactive restart at configurable MCP call threshold, and disconnect tracking rules.
+
+### Modal Dialog Handling
+
+If MCP commands stop responding after triggering an action that may produce a dialog (`alert`, `confirm`, `prompt`): instruct the user to dismiss the dialog manually before continuing.
+
+### Graceful Degradation
+
+- Screenshot fails: continue, note "screenshots unavailable" in report
+- `javascript_tool` fails: fall back to individual `find`/`click` calls
+- All MCP tools fail: abort with recovery instructions
+
+## Phase 4: Score and Report
+
+### Scoring
+
+Score each area on a 1-5 scale per scored interaction unit. A scored interaction unit is one user-facing task completion (e.g., "add item to cart", "submit form"). Navigation, page loads, and setup steps are not scored individually.
+
+| Score | Meaning | Example |
+|-------|---------|---------|
+| 1 | Broken — cannot complete the task | Button unresponsive, page crashes |
+| 2 | Completes with major friction | 3+ confusing steps, error messages |
+| 3 | Completes with minor friction | Small UX issues, unclear labels |
+| 4 | Smooth experience | Clear flow, no confusion |
+| 5 | Delightful | Exceeds expectations, helpful feedback |
+
+Scores are **absolute** per this rubric. The same checkout flow should produce the same score regardless of which test scenario triggered it.
+
+### Output Quality Scoring (Optional)
+
+Areas with `scored_output: true` in their area details are scored on TWO dimensions:
+
+| Score | UX Meaning | Output Quality Meaning |
+|-------|-----------|----------------------|
+| 5 | Delightful | Exactly what an expert would produce |
+| 4 | Smooth | Relevant, minor misses |
+| 3 | Minor friction | Partially correct |
+| 2 | Major friction | Mostly wrong |
+| 1 | Broken | Completely wrong |
+
+Report shows both: `UX: 4/5, Quality: 3/5`. Areas without `scored_output` show UX only.
+
+**Aggregation:** `Quality Avg` in history = UX scores only (backward compatible). Output quality tracked separately as `Output Avg` in the report.
+
+**Promotion gate:** Each area's `pass_threshold` (default 4) and `quality_threshold` (default 3 for scored_output areas) define what counts as a pass. See [test-file-template.md](./references/test-file-template.md) for details.
+
+**Known-bug filing trigger:** UX <= 2 (functional failure) OR Quality <= 1 (completely wrong output). Files to bug registry — see [bugs-registry.md](./references/bugs-registry.md).
+
+### Performance Threshold Evaluation (Optional)
+
+If the test file defines `performance_thresholds` in frontmatter, append a timing grade to each area's assessment: `(fast)`, `(acceptable)`, `(slow)`, `(BROKEN)`. Compare each area's wall-clock time against the thresholds. A `broken` timing is a notable finding but does NOT affect the UX score — timing and quality are separate dimensions.
+
+### Collection Categories
+
+For each tested area, collect:
+1. **UX score** (1-5 per interaction unit)
+2. **Time** (wall-clock seconds from Phase 3 timing)
+3. **Issues found** (bugs, UX problems, accessibility gaps)
+4. **Maturity assessment** (promote, demote, or maintain current status)
+
+After all areas are scored, generate:
+5. **Qualitative summary:** best moment (tagged with area slug), worst moment (tagged with area slug), demo readiness (yes/partial/no), one-line verdict
+6. **Explore Next Run items** (2-3 items with priority P1/P2/P3):
+   - **P1** — Things that surprised you (positive or negative)
+   - **P2** — Edge cases adjacent to tested areas
+   - **P3** — Interactions started but not finished, or borderline scores (score of 3 warrants deeper investigation next run)
+   - **Cross-area weakness synthesis:** After per-area items, read `weakness_class` fields from the test file (as present at run start — ignore any written by this run's commit). If a class appears in 2+ areas, generate up to 2 `[cross-area]` P1 entries with adversarial instructions. See [probes.md](./references/probes.md) Cross-Area Weakness Synthesis.
+7. **UX Opportunities** (P1/P2 action items for improvements observed at score 3-5)
+8. **Good Patterns** (patterns worth preserving observed at score 4-5 — deliberate design choices, not trivial successes)
+9. **Verification results** per area: claims checked, mismatches found (from Layer 2 pass)
+10. **Probe results**: probes executed this run (pass/fail per probe), new probes generated from failures/low scores/worst_moment. See [probes.md](./references/probes.md) for generation triggers and lifecycle.
+
+### Report Output — Dispatch Format
+
+The report is a dispatch, not a broadcast. It tells you what to do next, in priority order. Sections with no items are omitted.
+
+```
+SESSION SUMMARY: <scenario>  [<date> · <mode>]
+UX 3.0 | Quality 4.5 (CLI) | 5 areas | 2 need action
+
+NEEDS ACTION (2)                    ← open items requiring follow-up
+  ⚠ P1  y2k accessories degrading Q3→Q2 → investigate CLI (Explore Next Run)
+  ⚠ P2  Proven area agent/filter-via-chat probe failing → regression
+
+FILED THIS SESSION (1)              ← closed loop, confirmation only
+  ✓ Bug #21: shipping-form validation accepts invalid zip codes
+
+IMPROVED (1)
+  cart-validation  3→4  Cart updates instantly on quantity change
+
+STABLE (3)
+  browse/product-grid, browse/filters, compare/add-view
+
+EXPLORE NEXT RUN
+  P1  shipping-form     Browser  Validation broken — edge cases
+  P1  agent/search-query CLI     y2k degrading — aesthetic+category
+  P2  checkout/promo     Both    Adjacent to cart, untested
+
+SIGNALS
+  + CLI speed 15.8s avg (was 20.4s, -23%)
+  ~ 10 disconnects (was 6) — Chrome extension fragile
+  ~ 2 UX opportunities logged (UX001–UX002)
+
+Demo: PARTIAL (P1 bug #21 open; promo-code untested)
+```
+
+**Section rules:**
+- **Header:** `UX X.X | Quality X.X (CLI) | N areas | M need action` — 2-second scan
+- **JOURNEYS:** After cross-area probes, before NEEDS ACTION. Failing/flaky journeys show checkpoint detail. Passing show summary. See [journeys.md](./references/journeys.md).
+- **NEEDS ACTION:** `⚠` prefix. Only open items: degrading areas, failing probes on **Proven** areas (unexpected regression), verification mismatches on Proven. Probe failures on Uncharted/Known-bug stay in DETAILS (expected)
+- **FILED THIS SESSION:** `✓` prefix. Bugs/issues filed. Omit if nothing filed
+- **IMPROVED:** `<area> <old>→<new> <reason>`
+- **STABLE:** Single comma-separated line
+- **EXPLORE NEXT RUN:** `<priority> <area> <mode> <why>` — must appear in printed report
+- **SIGNALS:** `+` positive, `-` negative, `~` neutral. Disconnects always here with delta. Omit if 0. Use `-` if increased 50%+
+- **Demo:** YES / PARTIAL (reason) / NO (reason). P1 NEEDS ACTION forces at most PARTIAL
+- **DETAILS:** Prints only when actionable (new probes, verification failures, new UX opps). Omit if all empty. Contains: Probe Results, Verification Failures, UX Opportunities tables. Code Changes section when git targeting active
+
+### Share Report (Optional)
+
+After displaying the report, offer: "Share report to Proof for team review? (y/n)".
+If yes, POST the SESSION SUMMARY markdown to `https://www.proofeditor.ai/share/markdown`
+with `{"title": "<scenario> — <date>", "markdown": "<report>"}` and display the
+returned URL. Skip silently on curl failure — Proof sharing is best-effort.
+
+### Persist Report
+
+After displaying the report (and optional Proof sharing), write the rendered report text to `tests/user-flows/.user-test-last-report.md`. This file is the eval artifact — `/user-test-eval` reads it to grade presentation-layer behavior. Overwritten each run, gitignored.
+
+### Auto-Commit
+
+After persisting the report, **automatically proceed to Commit Mode** (below) — update the test file, append to history, and file issues. The user reviews results inline as part of the same session.
+
+**Opt-out:** If invoked with `--no-commit` or if the run was partial (interrupted before all areas scored), skip commit and display the report only. The user can run `/user-test-commit` later to commit from `.user-test-last-run.json`.
+
+**Partial run safety:** If the run is interrupted before scoring completes, do NOT produce committable output. Partial runs must not corrupt maturity state.
+
+### Run Results Persistence
+
+After Phase 4 completes (all areas scored), write `tests/user-flows/.user-test-last-run.json`. See [last-run-schema.md](./references/last-run-schema.md) for full schema (v10), per-area fields, journey fields, execution index fields, and behavioral notes. File is overwritten each run except `novelty_fingerprints` which accumulates across runs (read-merge-write).
+
+## Commit Mode
+
+Runs automatically after Phase 4 completes a full run. Can also be invoked standalone via `/user-test-commit` (e.g., after a `--no-commit` run or to retry a failed commit).
+
+### Load Run Results
+
+**When invoked automatically:** Use the run results already in context from Phase 4.
+
+**When invoked standalone via `/user-test-commit`:** Read `tests/user-flows/.user-test-last-run.json`. This is the single source of truth — commit mode never falls back to context window.
+
+- **Missing file:** Abort with "No run results found. Run `/user-test` first."
+- **Incomplete run:** If `completed: false`, abort with "Last run was incomplete. Run `/user-test` again for committable results."
+- **Stale (>7 days):** Abort with "Run results too old — re-run `/user-test` first."
+- **Stale (>24 hours):** Warn "Run results are from <timestamp>. Commit anyway? (y/n)."
+
+### Maturity Updates
+
+Apply maturity transitions using agent judgment and the scoring rubric:
+
+- **Promote to Proven:** After 2+ consecutive passes where UX >= area's `pass_threshold` (default 4) and Quality >= `quality_threshold` for scored_output areas (default 3), with no functional issues. A cosmetic issue in a Proven area does not warrant demotion.
+- **Demote to Uncharted:** On functional regressions or new features that change behavior. Minor CSS issues do not trigger demotion.
+- **Mark Known-bug:** When a functional issue is found and an issue is filed. Record in bug registry — see [bugs-registry.md](./references/bugs-registry.md). Skip this area in future runs until the fix is deployed.
+- **Persistent ≤3 escalation:** If an area scores ≤ 3 for 3+ consecutive runs AND the same issue is noted each time, offer: "<area> has scored ≤3 for N runs with the same issue — file as Known-bug?" This is a manual escalation, not automatic.
+
+**Partial run safety:** If a run is interrupted before scoring completes, no maturity updates are produced.
+
+### File Updates
+
+1. **Update test file maturity map and area details:**
+   - Write to `.tmp` file first, then rename (atomic write)
+   - Upgrade to v10: bump `schema_version: 10` on first commit regardless of query/probe usage. Add missing columns and sections per [test-file-template.md](./references/test-file-template.md)
+   - Update area statuses, scores, timing, quality scores, and consecutive pass counts
+   - Update `## Area Trends` section from `score-history.json` data
+   - Update `## UX Opportunities Log`: add new entries with sequential IDs (UX001...), update existing entries (mark `implemented` if improvement detected), age out entries per lifecycle rules
+   - Update `## Good Patterns`: confirm existing patterns (update `Last Confirmed`), add new patterns, remove patterns unconfirmed for 5+ runs
+   - **Tactical notes:** Append `[Run N] <finding>` to area's Notes column when there's a genuine tactical insight (selector pattern, timing pattern, interaction sequence). Cap 3 entries per area; drop oldest. See [queries-and-multiturn.md](./references/queries-and-multiturn.md) Tactical Notes.
+   - **Verified selectors:** When Phase 3 confirmed DOM selectors via successful `javascript_tool` batch call, append them to the area's `**verify:**` block with `_Selectors confirmed run N._`. Append-only — never replace user-authored content. See [verification-patterns.md](./references/verification-patterns.md) Selector Discovery and Writeback.
+   - **Weakness class:** When 2+ probes in an area share a failure pattern, write `**weakness_class:** <class>` below `pass_threshold`. Remove after 3 consecutive pass runs. One class per area — dominant by probe count. See [probes.md](./references/probes.md) Weakness Classification.
+2. **Update `tests/user-flows/score-history.json`:**
+   - Append current run's per-area scores (UX, quality, time)
+   - Compute trend per area from last 3 entries
+   - Cap at 10 entries per area (drop oldest)
+   - Create file if it doesn't exist
+3. **Update `tests/user-flows/bugs.md`:**
+   - File new bugs with sequential IDs for areas with UX <= 2 or Quality <= 1
+   - Mark bugs as `fixed` when Known-bug area passes fix_check (score >= `pass_threshold`) AND GitHub issue is closed
+   - Mark bugs as `regressed` when previously-fixed area fails again
+   - Create file if it doesn't exist — see [bugs-registry.md](./references/bugs-registry.md)
+4. **Update probe statuses** in each area's `**Probes:**` table and the `## Cross-Area Probes` table: mark passing/failing/flaky based on this run's results. Rotate out passing probes older than 10 runs. If a probe has failed 3+ consecutive runs, auto-escalate to bugs.md (see [probes.md](./references/probes.md) Escalation). If a probe has passed 2+ consecutive runs, offer CLI graduation (same path as bug graduation — see [probes.md](./references/probes.md)).
+5. **Offer graduation** for newly-fixed bugs — see [graduation.md](./references/graduation.md)
+6. **Append to `tests/user-flows/test-history.md`:**
+   - Add row with: date, areas tested, quality avg, delta, pass rate, best area, worst area, demo ready, context, key finding
+   - **Delta computation:** Compare quality avg against the most recent *completed* previous run. First run: `—`. Previous run was partial: skip to last complete run. Different area sets: compute over overlapping areas only; no overlap → `—`. Always display how many areas overlap vs. excluded (e.g., "over 5 overlapping areas, 2 new excluded") so the denominator change is visible.
+   - **Delta warning:** Flag any delta worse than -0.5 in the commit output
+   - **Context field:** Brief phrase explaining *why* the verdict is what it is (e.g., "search results loading 28s"). Persists alongside verdict for future reference.
+   - **Pattern surfacing** (after 10+ runs): positive patterns need 7+ of last 10 runs as best area; negative patterns need 5+ of last 10 runs as worst area
+   - Rotation: keep last 50 entries, remove oldest when exceeding
+7. **File GitHub issues:**
+   - Each issue gets a label `user-test:<area-slug>` (e.g., `user-test:checkout/cart-count`)
+   - **Duplicate detection:** `gh issue list --label "user-test:<area-slug>" --state open`
+     - If match found: skip filing, note "duplicate of #N"
+     - If no match: fall back to semantic title search as secondary check
+   - Sanitize issue body content before `gh issue create`
+   - Skip gracefully if `gh` is not authenticated
+   - Never persist credentials (passwords, tokens, session IDs) in issue bodies or test files
+8. **Query compounding:** Sharpen failed queries into probes, expand from discoveries, mark stable queries. See [queries-and-multiturn.md](./references/queries-and-multiturn.md) for steps 8-12 details, query-to-probe conversion rules, and stable query regression tiers.
+8b. **Novelty fingerprints:** Merge this run's new fingerprints with existing ones from `.user-test-last-run.json`. Apply 20-per-area cap (drop oldest). Write merged set. See [queries-and-multiturn.md](./references/queries-and-multiturn.md) Novelty Fingerprint Persistence.
+8c. **Journey updates:** Update journey Status, Last Run, Run History. Auto-escalate, mark stable, detect definition changes. Journey results do NOT affect per-area maturity. See [journeys.md](./references/journeys.md) Commit Mode.
+
+### Eval Prompt
+
+After all commit steps complete, display:
+
+```
+Run `/user-test-eval` to grade this session's output against binary evals.
+```
+
+The eval runs as a separate invocation to preserve grading integrity — it reads from file artifacts, not conversation context. Do NOT attempt to invoke the eval skill inline; the separation is intentional.
+
+**Skip conditions:** `--no-eval` flag, or if commit was partial/aborted — omit the prompt.
+**Iterate mode:** Display the prompt once after the final commit, not per-iteration.
+**Standalone `/user-test-commit`:** Also displays the eval prompt after commit completes.
+
+## Iterate Mode
+
+See [iterate-mode.md](./references/iterate-mode.md) for full details.
+
+N capped at 10 (default), N=0 is error, N=1 is valid.
+Reset between runs = full page reload to app entry URL.
+Partial run handling: if disconnect mid-iterate, write results for completed
+runs and report "Completed M of N runs."
+Output: per-run scores table + aggregate consistency metrics + maturity transitions.
+After final run, auto-commit (same as normal `/user-test`). Pass `--no-commit` to skip.
+
+## Reference Files
+
+- [test-file-template.md](./references/test-file-template.md) — template, schema migration, area granularity, worked examples
+- [last-run-schema.md](./references/last-run-schema.md) — `.user-test-last-run.json` schema, per-area fields, behavioral notes
+- [journeys.md](./references/journeys.md) — multi-area journey testing: lifecycle, budget, execution, checkpoint types, generation, feature interactions
+- [probes.md](./references/probes.md) — probe execution, lifecycle, dedup, escalation, graduation, multi-run orchestration, weakness classification
+- [queries-and-multiturn.md](./references/queries-and-multiturn.md) — execution checklist, scoring, query compounding, novelty budget, fingerprints, CLI adversarial mode
+- [verification-patterns.md](./references/verification-patterns.md) — structural checks, tolerance rules, scoring impact
+- [run-targeting.md](./references/run-targeting.md) — area selection, git-aware targeting, progressive narrowing
+- [bugs-registry.md](./references/bugs-registry.md) — bug lifecycle, commit mode update rules
+- [graduation.md](./references/graduation.md) — browser discoveries → CLI regression checks
+- [browser-input-patterns.md](./references/browser-input-patterns.md) / [connection-resilience.md](./references/connection-resilience.md) — browser patterns, connection resilience
+- [iterate-mode.md](./references/iterate-mode.md) / [orientation.md](./references/orientation.md) — multi-run orchestration, first-run code reading
diff --git a/plugins/compound-engineering/skills/user-test/references/browser-input-patterns.md b/plugins/compound-engineering/skills/user-test/references/browser-input-patterns.md
new file mode 100644
index 000000000..94f2a384f
--- /dev/null
+++ b/plugins/compound-engineering/skills/user-test/references/browser-input-patterns.md
@@ -0,0 +1,145 @@
+# Browser Input Patterns
+
+Patterns for interacting with web apps via `claude-in-chrome` MCP tools.
+
+## React-Safe Input
+
+React uses synthetic events and controlled components. Setting `.value` directly
+bypasses React's state management. Use the native setter pattern:
+
+```javascript
+// React-safe input via javascript_tool
+mcp__claude-in-chrome__javascript_tool({
+  code: `
+    const el = document.querySelector('input[name="email"]');
+    const setter = Object.getOwnPropertyDescriptor(
+      window.HTMLInputElement.prototype, 'value'
+    ).set;
+    setter.call(el, 'test@example.com');
+    el.dispatchEvent(new Event('input', { bubbles: true }));
+    el.dispatchEvent(new Event('change', { bubbles: true }));
+  `
+})
+```
+
+This works for `<input>`, `<textarea>`, and `<select>` elements in React, Vue,
+and other virtual-DOM frameworks.
+
+## Batching DOM Checks
+
+Each MCP call is a Chrome extension round-trip. Batch simple checks into one call:
+
+```javascript
+// Batch multiple checks into one javascript_tool call
+mcp__claude-in-chrome__javascript_tool({
+  code: `JSON.stringify({
+    submitBtn: !!document.querySelector('[type=submit]'),
+    errorMsg: !!document.querySelector('.error'),
+    price: document.querySelector('.price')?.textContent,
+    itemCount: document.querySelectorAll('.cart-item').length
+  })`
+})
+```
+
+## File Upload Limitation
+
+File uploads (`<input type="file">`) cannot be automated via claude-in-chrome.
+Mark these interactions as `MANUAL ONLY` in the test file. Workaround: pause the
+test and use `/agent-browser` for file upload steps, then resume.
+
+## Async Wait Pattern
+
+Many interactions trigger async operations (API calls, animations, state updates).
+Check for completion before asserting results:
+
+```javascript
+// Wait for async operation completion
+mcp__claude-in-chrome__javascript_tool({
+  code: `
+    (async () => {
+      const start = Date.now();
+      const timeout = 10000;
+      const selector = '.success-message'; // adapt per use case
+      while (Date.now() - start < timeout) {
+        if (document.querySelector(selector)) return 'found';
+        await new Promise(r => setTimeout(r, 200));
+      }
+      return 'timeout';
+    })()
+  `
+})
+```
+
+Adapt the selector and timeout per use case. Common patterns:
+- Success message appears: `.success-message`, `.toast`, `[role="alert"]`
+- Loading spinner gone: `!document.querySelector('.spinner')`
+- Data rendered: `document.querySelectorAll('.result-item').length > 0`
+
+## Agent Response Polling
+
+After sending a query to an AI agent chat interface, poll for response completion instead of using fixed waits. AI agents take variable time (5-30s) — a fixed wait is either too short or too long.
+
+**Polling pattern:**
+
+```javascript
+mcp__claude-in-chrome__javascript_tool({
+  code: `
+    (async () => {
+      const start = Date.now();
+      const timeout = 30000; // 30s max
+      const interval = 1000; // check every 1s
+
+      while (Date.now() - start < timeout) {
+        const typing = document.querySelector('.typing-indicator, .loading-spinner');
+        const response = document.querySelector('.agent-response:last-child, .message:last-child');
+        const chips = document.querySelector('.suggestion-chips, .quick-replies');
+
+        if (!typing && response && response.textContent.trim().length > 20) {
+          await new Promise(r => setTimeout(r, 500)); // final render buffer
+          return JSON.stringify({
+            status: 'complete',
+            waitedMs: Date.now() - start,
+            hasChips: !!chips,
+            responseLength: response.textContent.trim().length
+          });
+        }
+        await new Promise(r => setTimeout(r, interval));
+      }
+      return JSON.stringify({ status: 'timeout', waitedMs: Date.now() - start });
+    })()
+  `
+})
+```
+
+**Parameters:** 1-second poll interval, 30-second maximum. The 500ms final buffer allows post-streaming render (chips, formatting).
+
+**Timeout handling:** A poll timeout is NOT a disconnect. The tool call succeeded — the agent response is slow. Log `waitedMs` in timing data. Proceed with whatever DOM state exists (partial response may be usable). Do NOT increment `disconnect_counter`.
+
+**Selector adaptation:** Polling selectors vary per app. On first run, discover response indicators during exploration. Document in area details:
+
+```markdown
+**Agent response selectors:** typing=`.typing-indicator`,
+response=`.chat-message:last-child`, chips=`.suggestion-chip`
+```
+
+**Fallback:** If selectors unknown (first run), use 3-second fixed wait then read_page. Shorter than current 5-10s because the read itself shows whether the response appeared.
+
+## Modal Dialog Handling
+
+JavaScript dialogs (`alert()`, `confirm()`, `prompt()`) block all browser events.
+If MCP commands stop responding after triggering a dialog, instruct the user to
+dismiss it manually before continuing.
+
+## Proactive Restart
+
+Sustained MCP tool usage degrades browser extension connections. The skill proactively restarts (full page reload to app entry URL) after a configurable number of MCP calls. See [connection-resilience.md](./connection-resilience.md) for timing rules and cross-area probe interaction.
+
+**What a restart clears:**
+- Extension message channel state
+- In-memory JavaScript variables
+- Pending network requests
+
+**What a restart does NOT clear:**
+- Cookies and session storage (login state preserved)
+- IndexedDB data
+- Service worker caches
diff --git a/plugins/compound-engineering/skills/user-test/references/bugs-registry.md b/plugins/compound-engineering/skills/user-test/references/bugs-registry.md
new file mode 100644
index 000000000..85787bfad
--- /dev/null
+++ b/plugins/compound-engineering/skills/user-test/references/bugs-registry.md
@@ -0,0 +1,56 @@
+# Bug Registry
+
+The bug registry (`tests/user-flows/bugs.md`) tracks bugs across runs with persistent status lifecycle. One registry per project, not per scenario.
+
+## File Format
+
+```markdown
+# Bug Registry
+
+| ID | Area | Status | Issue | Summary | Found | Fixed | Regressed |
+|----|------|--------|-------|---------|-------|-------|-----------|
+| B001 | checkout/shipping-form | open | #47 | Accepts invalid zip codes | 2026-02-28 | — | — |
+| B002 | browse/product-grid | fixed | #48 | Cards not clickable | 2026-02-28 | 2026-03-01 | — |
+| B003 | browse/product-grid | regressed | #52 | Regression of B002: Cards not clickable | 2026-03-05 | — | 2026-03-05 |
+```
+
+## Status Lifecycle
+
+```
+open → fixed → regressed
+              ↘ (stays fixed if no regression)
+```
+
+- **open**: Bug discovered and issue filed. Area marked Known-bug.
+- **fixed**: Known-bug area's `fix_check` passes (score >= area's `pass_threshold`, default 4) AND linked GitHub issue is closed. Both conditions required — a passing score with an open issue means the fix hasn't been formally accepted.
+- **regressed**: A previously-fixed area fails again (score < `pass_threshold`). A new GitHub issue is filed with "Regression of #N" referencing the original. The original bug entry is updated to `regressed` with the regression date.
+
+## Sequential IDs
+
+Bug IDs are sequential: B001, B002, B003... Read existing `bugs.md` to find the highest ID, then increment. If the file doesn't exist, start at B001.
+
+## Multi-Area Bugs
+
+A bug that manifests in multiple areas gets ONE registry entry with the primary area (the area where it was first discovered or most impactful). The `Summary` field notes affected areas:
+
+```
+| B004 | api/auth | open | #55 | Token refresh fails silently. Also affects: settings/profile, dashboard/data | 2026-03-01 | — | — |
+```
+
+Each affected area's Known-bug detail references the same bug ID.
+
+## Commit Mode Updates
+
+After each completed run, commit mode processes the bug registry:
+
+1. **Check for fixes:** For each `open` bug, check if the area was tested and passed fix_check (score >= `pass_threshold`). Also check `gh issue view <issue-number> --json state -q '.state'` — both must be true (score passes AND issue closed) to mark `fixed`.
+2. **File new bugs:** For each area with UX <= 2 or Quality <= 1, check if a bug already exists for that area. If not, create a new entry with next sequential ID and file a GitHub issue.
+3. **Detect regressions:** For each `fixed` bug, check if the area was tested and failed (score < `pass_threshold`). If so, file a new issue with "Regression of #N", update the bug entry to `regressed` with the date.
+
+## File Creation
+
+`tests/user-flows/bugs.md` is created on first bug filing if it doesn't exist. The file header and table format are generated automatically.
+
+## Rotation
+
+Archive entries older than 6 months to `tests/user-flows/bugs-archive.md`. Archived entries are no longer checked during runs but preserved for historical reference.
diff --git a/plugins/compound-engineering/skills/user-test/references/connection-resilience.md b/plugins/compound-engineering/skills/user-test/references/connection-resilience.md
new file mode 100644
index 000000000..5797dacdc
--- /dev/null
+++ b/plugins/compound-engineering/skills/user-test/references/connection-resilience.md
@@ -0,0 +1,22 @@
+# Connection Resilience
+
+## Reactive (On Failure)
+
+1. After any MCP tool failure: wait 3 seconds (`Bash: sleep 3`)
+2. Retry the call once
+3. If retry fails: display "Extension disconnected. Run `/chrome` and select Reconnect extension"
+4. Track `disconnect_counter` for the session
+5. If `disconnect_counter >= 3`: abort with "Extension connection unstable. Check Chrome extension status and restart the session."
+
+## Proactive (Prevent Degradation)
+
+6. Track `mcp_call_counter` for the session (increments on every successful MCP tool call)
+7. When `mcp_call_counter` reaches `mcp_restart_threshold` (default: 15, configurable in test file frontmatter): navigate to the app entry URL (full page reload). Reset `mcp_call_counter` to 0. Log: "Proactive restart at call #N to prevent connection degradation."
+8. The restart happens between areas, not mid-area. If the threshold is reached during an area, finish the current area first, then restart before the next area. Cross-area probes are an exception: the restart is deferred until the entire cross-area probe sequence completes. See [probes.md](./probes.md) Proactive Restart Interaction.
+9. In iterate mode, the between-run reset counts as a restart. Reset `mcp_call_counter` at each between-run page reload.
+
+## Disconnect Pattern Tracking
+
+When `disconnect_counter` increments, record the context: which MCP tool was called, which area was being tested, and the session MCP call count.
+
+At run end, if `disconnect_counter >= 3`, append a disconnect analysis to the SIGNALS section of the report.
diff --git a/plugins/compound-engineering/skills/user-test/references/graduation.md b/plugins/compound-engineering/skills/user-test/references/graduation.md
new file mode 100644
index 000000000..bedb2a9d6
--- /dev/null
+++ b/plugins/compound-engineering/skills/user-test/references/graduation.md
@@ -0,0 +1,68 @@
+# Discovery-to-Regression Graduation
+
+When a browser-layer discovery is fixed and verified, the system offers to generate a CLI regression check. This closes the compounding loop: browser discoveries become fast-layer guards.
+
+## The Compounding Loop
+
+```
+Browser discovers bug → bug filed → developer fixes → next run verifies fix
+    → fix confirmed → CLI regression check generated
+    → future regressions caught by fast CLI layer
+    → browser time freed for new exploration
+```
+
+## Trigger
+
+Graduation is offered when commit mode marks a bug as `fixed` in `bugs.md`:
+
+1. Check if `cli_test_command` exists in the scenario frontmatter
+2. If yes, offer: "Bug B002 (cards not clickable) is fixed. Generate a CLI regression check? (y/n)"
+3. If user accepts, append to `cli_queries` in the test file frontmatter
+
+## Generated CLI Query
+
+```yaml
+cli_queries:
+  - query: "show me product cards"
+    expected: "Returns product data with clickable links or URLs"
+    prechecks: "browse/product-grid"
+    graduated_from: "B002"
+```
+
+- `query`: A representative input that would expose the original bug
+- `expected`: Description of what a correct response looks like (semantic evaluation)
+- `prechecks`: Links to the browser area — if this CLI check fails, the browser area is skipped (already known broken)
+- `graduated_from`: Backlink to the bug ID that spawned this check (auditability)
+
+## Graduation Trigger: Manual
+
+The user confirms each graduation. Automatic graduation was rejected because:
+
+- The user knows whether a CLI check can meaningfully cover a UX-discovered issue
+- Some bugs are inherently browser-only (layout, animation, visual feedback, timing)
+- Auto-generated CLI queries might technically pass but not actually test the thing that broke
+
+## CLI-Ineligible Bugs
+
+Skip the graduation offer when:
+
+- **No `cli_test_command`** in the scenario frontmatter — there's no CLI layer to graduate to
+- **Browser-only bug** — CSS layout, animation timing, visual feedback, element positioning. Note: "This bug is browser-only — no CLI graduation available."
+
+Detection heuristic for browser-only bugs: if the bug's area detail mentions visual, layout, CSS, animation, or the fix_check involves screenshot comparison or element positioning, it's likely browser-only. The agent exercises judgment here.
+
+## Batching
+
+If multiple bugs are fixed in the same run, batch all graduation offers into a single prompt:
+
+```
+3 bugs fixed this run. Generate CLI regression checks?
+
+  B002 — browse/product-grid: Cards not clickable → CLI eligible
+  B005 — checkout/shipping-form: Validation broken → CLI eligible
+  B007 — browse/product-detail: Image carousel layout → browser-only (skip)
+
+Generate checks for B002 and B005? (y/n/select)
+```
+
+User can accept all, reject all, or select individual bugs.
diff --git a/plugins/compound-engineering/skills/user-test/references/iterate-mode.md b/plugins/compound-engineering/skills/user-test/references/iterate-mode.md
new file mode 100644
index 000000000..c978eba46
--- /dev/null
+++ b/plugins/compound-engineering/skills/user-test/references/iterate-mode.md
@@ -0,0 +1,107 @@
+# Iterate Mode
+
+Run the same test scenario N times to measure consistency and build confidence
+in maturity promotions.
+
+## Invocation
+
+```
+/user-test-iterate tests/user-flows/checkout.md 5
+```
+
+Arguments: `<scenario-file> <N>`
+
+## Constraints
+
+- **N must be >= 1.** N=0 is an error.
+- **N is capped at 10** by default. Higher values consume significant tokens.
+- **N=1 is valid** — runs once with consistency tracking output format.
+
+## Per-Run Flow
+
+Each iteration follows this sequence:
+
+1. **CLI queries (Phase 2.5):** If `cli_test_command` is present, run ALL `cli_queries` first. Score each. Apply precheck gates.
+2. **Browser reset:** Navigate to app entry URL (full page reload) for clean state.
+3. **Browser areas (Phase 3):** Test browser areas, skipping any gated by CLI precheck failures.
+4. **Score (Phase 4):** Score all areas for this run.
+
+**CLI-only mode:** If all areas have `prechecks` tags and CLI covers everything, skip steps 2-3 entirely — no browser needed for that run.
+
+**CLI command with side effects:** If the CLI command writes to a database or calls external APIs, each iteration may produce different results due to accumulated state. Document this in the test file's area details when relevant.
+
+**Known limitations — not cleared by page reload:**
+- IndexedDB data
+- Service worker caches
+- HttpOnly cookies
+
+Document these limitations when they affect test results. If an app relies
+heavily on these storage mechanisms, note it in the test file's area details.
+
+### Incremental Context Loading (Run 2+)
+
+After run 1, the skill has all reference files in context. For runs 2+:
+
+- Do NOT re-read reference files (SKILL.md, probes.md, etc.)
+- Do NOT re-read the full test file
+- DO re-read: `.user-test-last-run.json` (inter-run probe state + [progressive narrowing](./run-targeting.md) classification)
+- DO re-read: area details for FULL-classified areas only (Queries tables, verify blocks needed for execution)
+
+Order: read JSON → compute retest classification → load details for non-SKIP areas just-in-time before execution.
+
+This reduces Phase 1 from ~3 minutes to ~1 minute for run 2+.
+
+## Partial Run Handling
+
+If a disconnect occurs mid-iterate (e.g., on run 3 of 5):
+- Write results for completed runs (runs 1-2)
+- Report "Completed 2 of 5 runs"
+- Partial results are valid — maturity updates apply to completed runs only
+- Do NOT produce committable output for incomplete runs
+
+## Output Format
+
+Iterate mode produces:
+1. **Per-run scores table** — each run's per-area scores and timing
+2. **Aggregate consistency metrics** — score variance, timing variance, min/max/avg per area
+3. **Maturity transitions** — which areas would promote/demote based on results
+
+**Timing variance** is reported alongside score variance. A consistent 28s is
+acceptable; wild swings between 5s and 45s indicate flakiness worth investigating.
+
+**Delta computation:** Delta is computed between the iterate session's aggregate
+and the previous non-iterate run. Per-iteration deltas within a session are NOT
+computed (they are noise, not signal).
+
+After the final run completes, **automatically proceed to Commit Mode** — same as a normal `/user-test` run. This persists `git_sha`, maturity updates, probes, and history. Commit uses the aggregate scores (not individual run scores). The user can pass `--no-commit` to skip and run `/user-test-commit` manually later.
+
+### Example Output
+
+Per-run table first, then dispatch summary (same format as normal `/user-test`):
+
+```
+## Iterate Results: checkout.md (3 of 3 runs completed)
+
+| Area | Run 1 | Run 2 | Run 3 | Avg | Variance | Avg Time | Time Var |
+|------|-------|-------|-------|-----|----------|----------|----------|
+| cart-validation | 4 | 4 | 5 | 4.3 | 0.3 | 9s | 2s |
+| shipping-form | 3 | 4 | 4 | 3.7 | 0.3 | 14s | 5s |
+| payment-submission | 4 | 4 | 4 | 4.0 | 0.0 | 11s | 1s |
+
+SESSION SUMMARY: checkout  [2026-03-01 · iterate x 3]
+UX 4.0 | Quality — | 3 areas | 1 need action
+
+NEEDS ACTION (1)
+  ⚠ shipping-form inconsistent (3,4,4) — not promoting
+
+IMPROVED (1)
+  cart-validation  3→4.3  Consistent across 3 runs
+
+STABLE (1)
+  payment-submission
+
+EXPLORE NEXT RUN
+  P1  shipping-form  Browser  Inconsistent — push edge cases
+
+Demo: PARTIAL (shipping-form inconsistent)
+```
diff --git a/plugins/compound-engineering/skills/user-test/references/journeys.md b/plugins/compound-engineering/skills/user-test/references/journeys.md
new file mode 100644
index 000000000..748b56adb
--- /dev/null
+++ b/plugins/compound-engineering/skills/user-test/references/journeys.md
@@ -0,0 +1,177 @@
+# Journeys
+
+Multi-area user flows executed without resets. Journeys test accumulated state across 3+ areas — bugs invisible to isolated per-area testing or two-area cross-area probes.
+
+## Journey Schema
+
+Each journey lives in the test file's `## Journeys` section:
+
+```markdown
+### J001: <journey name>
+
+**Steps:**
+
+| Step | Area | Action | Checkpoint |
+|------|------|--------|-----------|
+| 1 | <area-slug> | <natural language action> | <what to verify> |
+| 2 | <area-slug> | <natural language action> | <what to verify> |
+| 3 | <area-slug> | <natural language action> | <state clean check> |
+
+**Status:** untested
+**Last Run:** ---
+**Run History:** ---
+**Generated From:** manual
+```
+
+**Column definitions:**
+- **Step:** Execution order (1, 2, 3...). Positional index, not a stable ID.
+- **Area:** Area slug from `## Areas`. Same area can appear multiple times.
+- **Action:** What to do (natural language, same as probe queries).
+- **Checkpoint:** What to verify at THIS step before proceeding. Use `---` to skip (sparingly).
+
+**Journey-level fields:**
+- **Status:** `untested` / `passing` / `failing-at-N` / `flaky` / `stable`
+- **Last Run:** Date of last execution
+- **Run History:** Compact pass/fail, e.g. `P P F:3 P F:5 P`. Failures include step number after colon — `F:3` = "failed at step 3" (colon avoids ambiguity with count formats).
+- **Generated From:** `manual`, `orientation`, `cross-area-escalation`, `weakness-class-synthesis`
+- **on_failure:** `abort` (default) or `continue` (opt-in, per-journey)
+- **escalated_to:** (optional) Bug ID if this journey has been auto-escalated (e.g., `B005`). Prevents duplicate filing.
+
+## Checkpoint Types
+
+| Type | Example | How to check |
+|------|---------|-------------|
+| Result state | "Results include matching items" | javascript_tool read of first 3 results |
+| Count change | "Counter increments by 1" | Read element, compare to pre-action value |
+| Element present | "Details match listing" | Check 2-3 attributes match between views |
+| State clean | "No stale filters from prior steps" | Read active state, verify none from prior steps |
+| No check | `---` | Skip verification (use sparingly) |
+
+Checkpoints are 1 MCP call each (batched `javascript_tool`). For "Count change" checkpoints, read the target element BEFORE executing the step's Action to capture the baseline — this adds 1 MCP call per count-change step. A 5-step journey = ~10-15 MCP calls (5 actions + 5 checkpoint reads + pre-reads for count-change steps). Journey MCP calls are separate from per-area budgets.
+
+## Execution
+
+**Phase 3 order:** (1) Cross-area probes, (2) Journeys, (3) Per-area testing.
+
+**Inter-journey reset:** Navigate to the app's entry URL between journeys. Each journey starts from clean navigation state. Within a journey, no resets between steps.
+
+**Execution order when multiple journeys exist:**
+1. `failing-at-N` (highest signal) — always run
+2. `untested` — always run
+3. `flaky` — always run
+4. `passing` — spot-check: run at most 3 per run, rotating round-robin by table order (advance start position each run). If ≤3 passing journeys, all run.
+5. `stable` (every other run only) — same 3-journey spot-check cap applies when eligible to run
+
+## Lifecycle
+
+```
+untested -> [run] -> passing / failing-at-N
+                       |           |
+               [5+ consecutive]  [mixed steps across 3+ runs]
+                       |           |
+                    stable       flaky
+               (every other run)
+                                   |
+                       [3+ consecutive SAME step]
+                                   |
+                         escalate to bugs.md
+```
+
+- **`failing-at-N`:** Pinpoints which step failed. Step 2 failure = area broken. Step 5 failure after 1-4 passed = accumulated state bug.
+- **`flaky`:** Fails at different steps across 3+ runs. Consecutive-same-step counter resets on step change.
+- **Escalation:** Same step 3+ consecutive runs → auto-escalate to bugs.md. Failing step's area is primary, preceding areas are context. Suppressed when failing step's area has active Known-bug. Dedup: if journey already has `escalated_to: "B00N"`, skip (already filed). Add `escalated_to` field to journey definition on escalation.
+- **Stable frequency:** Run every other run (odd Run History length = run, even = skip).
+- **Stable revert:** On failure, set status to `failing-at-N`, reset consecutive counter. Journey runs every time again.
+
+## Abort vs. Continue
+
+**Abort (default):** Stop at failing step. Record `failing-at-N`. If step 3 state is wrong, step 4 is unpredictable.
+
+**Continue (opt-in):** `on_failure: continue`. Log each failure, execute all remaining steps. Status = `failing-at-N` where N = first failing step. Run History records all failing steps: `F:2,5` (failed at steps 2 and 5).
+
+**Continue-mode escalation:** Track consecutive failures per step independently. `F:2,5` then `F:2` then `F:2,3` = step 2 failed 3 consecutive runs → escalate step 2, regardless of other steps also failing. Step 5 failed only once → no escalation. The per-step counter uses the first failing step from Run History entries containing that step number.
+
+## Definition Change Detection
+
+Commit mode compares current step count and area slugs against stored values. Key: `<step-count>:<area-slug-1>,<area-slug-2>,...`. If changed, reset status to `untested` and clear Run History.
+
+## Known-Bug Area Interaction
+
+Journey steps execute regardless of area Known-bug status — journeys test accumulated state, not individual areas. Journey failures involving Known-bug areas do NOT auto-escalate (bug already filed).
+
+## Generation
+
+**1. Manual (primary).** If orientation generated journey suggestions this run, present those as the first-run prompt:
+
+> "Based on code reading, I found state boundaries crossing 3+ areas. Here's a suggested journey: [steps]. Use this, modify it, or define your own?"
+
+If no orientation results, fall back to:
+
+> "No journeys defined. Journeys test multi-area flows without resets. Define 1-2 journeys based on your app's primary user flows? (y/n)"
+
+**2. Orientation.** When orientation probes span 3+ distinct areas, synthesize those probes into a journey suggestion (the steps are the areas the probes touch, in state-flow order). Orientation completes before the first-run prompt so its findings feed into suggestions.
+
+**3. Cross-area probe escalation.** 2+ cross-area probes pass individually but per-area issues persist → suggest journey.
+
+**4. Weakness-class synthesis.** Class spans 3+ areas → suggest journey.
+
+Sources 2-4 generate suggestions requiring user confirmation.
+
+## Budget
+
+- **Max 5 active journeys** per test file
+- **3-8 steps** per journey. Splitting counts against the 5-journey cap. If splitting would exceed the cap, prefer a single 8-step journey. Only split when a flow genuinely exceeds 8 steps.
+- **~2 minutes per journey.** 5 journeys = ~10 minutes maximum.
+- **Stable skip:** Stable journeys run every other run.
+- **Time pressure:** Run only failing/untested journeys.
+
+## Interactions With Existing Features
+
+**Proactive restart:** Suppressed during journey execution (same as cross-area probes). Counter resets between journeys (each starts fresh after inter-journey navigation).
+
+**Progressive narrowing:** Applies to per-area testing only. Journey steps execute regardless of narrowing classification (SKIP, PROBES-ONLY, FULL).
+
+**Cross-area probes:** Complementary. No dedup — a cross-area probe and journey step covering the same seam test different things (isolation vs. accumulation).
+
+**Adversarial mode:** Does NOT apply to journey steps. Journey steps execute defined action and checkpoint.
+
+**Graduation:** Does NOT apply to journeys. Journeys are multi-step browser flows that cannot be reduced to a single CLI call. Stable journeys remain as browser-only spot-checks.
+
+**Per-area MCP budgets:** Journey calls are separate. Visiting an area in a journey does not consume its per-area budget.
+
+**`--no-commit`:** Journey results recorded in `.user-test-last-run.json` regardless. Status in test file only updated during commit. No-commit runs don't count toward escalation.
+
+**Iterate mode:** Each iteration counts as a separate run for journey Run History. Stable "every other run" applies per iteration.
+
+**Partial run safety:** Interrupted journeys discarded. Only fully-completed journeys have status written during commit.
+
+## Report Format
+
+```
+JOURNEYS
+| ID   | Name              | Status       | Failed At       | Detail                       |
+|------|-------------------|--------------|-----------------|------------------------------|
+| J001 | Primary user flow | failing-at-5 | <area-slug>     | Stale state after navigation |
+| J002 | Secondary flow    | passing      | ---             | All checkpoints passed       |
+
+Journey J001 checkpoint detail:
+  + Step 1: <area-slug-1> -- <checkpoint description>
+  + Step 2: <area-slug-2> -- <checkpoint description>
+  x Step 5: <area-slug-1> -- STALE state from step 2
+```
+
+Checkpoint detail shown for failing/flaky journeys only. Passing journeys show summary.
+
+**SIGNALS:** `~ 1 journey failing: J001 at step 5 (<area-slug>) — accumulated state`
+
+**N-run summary:** Add "Journeys stabilized" (reached `stable` during this session = 5+ consecutive passes) and "Journeys with persistent issues" (status is `failing-at-N` or `flaky` at end of session).
+
+## Commit Mode
+
+Journey commit runs after per-area commit steps:
+
+1. Update **Status**, **Last Run**, **Run History** in test file
+2. Auto-escalate at 3+ consecutive same-step failures → bugs.md. Suppress if failing step's area has active Known-bug.
+3. Mark `stable` at 5+ consecutive passes
+4. Detect definition changes (step count or area slug changes → reset to `untested`, clear Run History)
+5. Journey results do NOT affect per-area maturity scores
diff --git a/plugins/compound-engineering/skills/user-test/references/last-run-schema.md b/plugins/compound-engineering/skills/user-test/references/last-run-schema.md
new file mode 100644
index 000000000..42b8378f2
--- /dev/null
+++ b/plugins/compound-engineering/skills/user-test/references/last-run-schema.md
@@ -0,0 +1,179 @@
+# .user-test-last-run.json Schema
+
+Written to `tests/user-flows/.user-test-last-run.json` after Phase 4 completes.
+
+## Behavior
+
+- **Overwritten each run** — only the last run is committable
+- `completed: false` if the run was interrupted — commit mode rejects it
+- If Phase 4 is interrupted before writing this file, no committable output exists
+- **Exception:** `novelty_fingerprints` accumulates across runs (read-merge-write). All other keys are overwritten.
+
+## Schema
+
+```json
+{
+  "run_timestamp": "2026-02-28T14:30:00Z",
+  "completed": true,
+  "scenario_slug": "checkout",
+  "git_sha": "abc1234",
+  "areas": [
+    {
+      "slug": "cart-validation",
+      "ux_score": 4,
+      "quality_score": null,
+      "time_seconds": 12,
+      "skip_reason": null,
+      "assessment": "Ready for promotion",
+      "issues": [],
+      "tactical_note": "filter → navigate → back → filter again surfaces stale state",
+      "confirmed_selectors": {
+        "activeFilters": "[data-filter-chip]",
+        "resultCount": ".product-card",
+        "sampleResults": ".product-card .title, .condition-badge"
+      },
+      "weakness_class": "stale-react-state",
+      "adversarial_browser": false,
+      "adversarial_trigger": null,
+      "broad_exploration_start_index": 3
+    }
+  ],
+  "qualitative": {
+    "best_moment": { "area": "cart-validation", "text": "Cart updates instantly on quantity change" },
+    "worst_moment": { "area": "shipping-form", "text": "Shipping form accepts invalid zip codes" },
+    "demo_readiness": "partial",
+    "verdict": "Checkout works but shipping validation broken",
+    "context": "shipping zip validation bypassed"
+  },
+  "explore_next_run": [
+    { "priority": "P1", "area": "shipping-form", "mode": "Browser", "why": "Validation broken" },
+    {
+      "priority": "P1",
+      "area": "[cross-area]",
+      "mode": "Browser",
+      "why": "stale-react-state in agent/filter-via-chat + browse/filters",
+      "weakness_class": "stale-react-state",
+      "affected_areas": ["agent/filter-via-chat", "browse/filters"],
+      "adversarial_instruction": "Probe ALL navigation sequences that cross area boundaries..."
+    }
+  ],
+  "ux_opportunities": [
+    { "id": "UX001", "area": "shipping-form", "priority": "P1", "suggestion": "Should show inline validation before submit" }
+  ],
+  "good_patterns": [
+    { "area": "cart-validation", "pattern": "Cart updates instantly on quantity change" }
+  ],
+  "verification_results": [
+    { "area": "agent/filter-via-chat", "claims_checked": 8, "mismatches": [
+      { "claim": "Condition: Like New", "actual": "Good", "element": "result-3 badge" }
+    ]}
+  ],
+  "probes_run": [
+    { "area": "agent/filter-via-chat", "query": "show me NWT only", "verify": "all badges say NWT", "status": "failing", "result_detail": "3 non-NWT results", "execution_index": 1 }
+  ],
+  "probes_generated": [
+    { "area": "agent/filter-via-chat", "query": "show me good condition only", "verify": "no NWT/like-new badges visible", "priority": "P1", "generated_from": "run-2 condition mismatch" }
+  ],
+  "cross_area_probes_run": [
+    {
+      "trigger_area": "browse/product-grid",
+      "action": "search 'dresses' via search bar",
+      "observation_area": "agent/filter-via-chat",
+      "verify": "agent chat responds without stale category filter",
+      "status": "failing",
+      "result_detail": "agent showed stale Dresses filter on follow-up",
+      "related_bug": "B002"
+    }
+  ],
+  "journeys_run": [
+    {
+      "id": "J001",
+      "name": "Primary user flow",
+      "status": "failing-at-5",
+      "on_failure": "abort",
+      "checkpoints": [
+        { "step": 1, "area": "<area-slug-1>", "passed": true },
+        { "step": 2, "area": "<area-slug-2>", "passed": true },
+        { "step": 3, "area": "<area-slug-3>", "passed": true },
+        { "step": 4, "area": "<area-slug-4>", "passed": true },
+        { "step": 5, "area": "<area-slug-1>", "passed": false,
+          "detail": "stale state from step 2 still active" }
+      ],
+      "time_seconds": 45
+    }
+  ],
+  "novelty_log": [],
+  "novelty_fingerprints": {
+    "agent/filter-via-chat": [
+      "agent/filter-via-chat:edge-query:price-floor",
+      "agent/filter-via-chat:edge-query:out-of-scope-question"
+    ],
+    "browse/filters": [
+      "browse/filters:filter-combo:size+color"
+    ]
+  },
+  "stable_queries_rotated": [],
+  "disconnects": {
+    "count": 0,
+    "contexts": []
+  }
+}
+```
+
+## Journey Fields (v9 additions)
+
+| Field | Type | Description |
+|-------|------|-------------|
+| `journeys_run` | array | Per-journey results with checkpoint data |
+| `journeys_run[].id` | string | Journey ID (e.g., "J001") |
+| `journeys_run[].name` | string | Journey name |
+| `journeys_run[].status` | string | `untested`, `passing`, `failing-at-N`, `flaky`, or `stable` |
+| `journeys_run[].on_failure` | string | `abort` or `continue` |
+| `journeys_run[].checkpoints` | array | Per-step results: step, area, passed, detail |
+| `journeys_run[].time_seconds` | number | Wall-clock time for the journey |
+
+See [journeys.md](./journeys.md) for lifecycle, budget, and execution rules.
+
+## Per-Area Fields (v8 additions)
+
+| Field | Type | Default | Written by |
+|-------|------|---------|-----------|
+| `tactical_note` | string or null | null | Commit Mode — genuine tactical insight only |
+| `confirmed_selectors` | object or {} | {} | Commit Mode — selectors confirmed by successful batch call |
+| `weakness_class` | string or null | null | Commit Mode — when 2+ probes share a failure pattern |
+| `adversarial_browser` | boolean | false | Phase 2.5 — CLI score 3 trigger |
+| `adversarial_trigger` | string or null | null | Phase 2.5 — the query that triggered adversarial mode |
+| `broad_exploration_start_index` | integer or null | null | Phase 3 — execution index when broad exploration began (v10) |
+
+## Probe Execution Fields (v10 additions)
+
+| Field | Type | Default | Written by |
+|-------|------|---------|-----------|
+| `probes_run[].execution_index` | integer | absent | Phase 3 — 0-based monotonically increasing counter across all areas |
+
+The `execution_index` tracks the order of all probe and exploration actions across the entire run. Each probe execution and each broad exploration action increments the counter. Combined with `broad_exploration_start_index` per area, the eval can verify that probes ran before exploration: for each area, all `probes_run` entries must have `execution_index < broad_exploration_start_index`.
+
+**v9 migration:** Treat missing `execution_index` as absent — eval skips Eval 1 for runs without ordering data. Treat missing `broad_exploration_start_index` as absent.
+
+## Explore Next Run Fields (v8 additions)
+
+Cross-area synthesis entries include:
+
+| Field | Type | Present when |
+|-------|------|-------------|
+| `weakness_class` | string | Entry is from cross-area synthesis |
+| `affected_areas` | string[] | Entry is from cross-area synthesis |
+| `adversarial_instruction` | string | Entry is from cross-area synthesis |
+
+## Novelty Fingerprints
+
+| Property | Value |
+|----------|-------|
+| Key | `novelty_fingerprints` (top-level) |
+| Structure | Object keyed by area slug, each value an array of fingerprint strings |
+| Format | `<area-slug>:<action-type>:<key-parameter>` |
+| Cap | 20 per area (drop oldest when exceeded) |
+| Accumulation | Read existing → merge with new → apply cap → write back |
+| Resilience | Missing/corrupted → empty (graceful degradation) |
+
+See [queries-and-multiturn.md](./queries-and-multiturn.md) for fingerprint normalization taxonomy.
diff --git a/plugins/compound-engineering/skills/user-test/references/orientation.md b/plugins/compound-engineering/skills/user-test/references/orientation.md
new file mode 100644
index 000000000..667b414ea
--- /dev/null
+++ b/plugins/compound-engineering/skills/user-test/references/orientation.md
@@ -0,0 +1,87 @@
+# Orientation (First-Run Code Reading)
+
+On first run against a project (when `seams_read` is `false` or absent in the test file frontmatter), read the app's source code to identify structural seams before any browser interaction. Output is 0-5 structural-hypothesis probes.
+
+## When to Run
+
+- `seams_read` is `false` or absent in test file frontmatter → run Orientation
+- `seams_read` is `true` → skip entirely
+- Set `seams_read: true` on first commit after code reading, regardless of outcome
+
+## Discovery Sequence
+
+Run these 5 bash commands before reading any files. They target the 20-file budget at highest-probability locations.
+
+```bash
+# 0. Discover source root — do NOT assume src/ exists
+SRC=$(ls -d src/ app/ lib/ pages/ 2>/dev/null | head -1)
+[ -z "$SRC" ] && SRC=$(find . -maxdepth 2 \( -name "*.ts" -o -name "*.tsx" -o -name "*.js" -o -name "*.jsx" -o -name "*.py" -o -name "*.rb" \) 2>/dev/null | head -5 | xargs -I{} dirname {} | sort -u | head -1)
+[ -z "$SRC" ] && SRC="."
+echo "Source root: $SRC"
+
+# 1. Get the file tree — understand project structure first
+find "$SRC" \( -name "*.ts" -o -name "*.tsx" -o -name "*.js" -o -name "*.jsx" -o -name "*.py" -o -name "*.rb" -o -name "*.go" \) 2>/dev/null | head -50
+
+# 2. Find likely translation/state/API files by keyword
+grep -rl "filter\|translate\|transform\|map\|schema" "$SRC" 2>/dev/null | head -20
+
+# 3. Find state management files
+grep -rl "useState\|useStore\|createSlice\|zustand\|redux\|context\|@store\|session" "$SRC" 2>/dev/null | head -10
+
+# 4. Find API route handlers
+find "$SRC" -path "*/api/*" -o -path "*/routes/*" -o -path "*/server/*" -o -path "*/controllers/*" 2>/dev/null | head -20
+```
+
+Read the top hits from commands 2-4 first. Follow imports from those files to find related state management.
+
+## Budget
+
+- **Time:** 5 minutes maximum
+- **Files:** 20 file reads maximum
+- Stop at cap and note which pattern areas weren't reached
+
+## Four Seam Patterns
+
+### 1. Translation Layers
+
+Where does user vocabulary map to system parameters? (e.g., "y2k" → `aesthetic=y2k&era=2000s`)
+
+**Look in:** API route handlers, agent prompt files, filter/facet config, files named `translate`, `map`, `transform`, `normalize`, or `schema`.
+
+### 2. State Ownership Boundaries
+
+Where do two systems hold a version of the same state? (agent context + UI store, server session + client state)
+
+**Look for:** Reset events that cross boundaries, hydration logic, event handlers that clear one store but not another.
+
+### 3. API Seams
+
+Where does the server's model differ from what the UI renders?
+
+**Look for:** Response transformation in routes, fields in the API not displayed, fields displayed that aren't in the response.
+
+### 4. Data Coverage Gaps
+
+Compound filter intersections that might be empty.
+
+**Look for:** Filter schemas, category/aesthetic/condition enums, hardcoded allowed-values lists. Cross-reference against each other — two valid individual values may have an empty intersection.
+
+## Output Format
+
+For each identified seam, generate a structural-hypothesis probe:
+
+- `query`: An interaction that exercises the seam (e.g., "filter by NWT + y2k")
+- `verify`: The testable claim (e.g., "results show items matching both NWT condition AND y2k aesthetic")
+- `status`: `untested`
+- `priority`: P2 (hypotheses, not observed failures)
+- `confidence`: `medium` (structural read, not observed)
+- `generated_from`: `"structural-hypothesis: <filename> <line or function>"`
+
+Write probes to the relevant area's Probes table. If a seam spans multiple areas, place the probe in the area most likely to surface the fragility. If the seam involves state carry-over between two areas (one sets state, another reads it), generate a cross-area probe instead — see [probes.md](./probes.md) Cross-Area Probes section.
+
+## Graceful No-Op
+
+If no clear seams are found within the budget: produce 0 probes, note "no seams identified within 20-file cap," and still set `seams_read: true`. Graceful no-op is a valid outcome — not every codebase has obvious seams from static analysis.
+
+**Non-local app:** If the 4 bash commands find no source files (app hosted remotely, no local repo), set `seams_read: true` to avoid retrying. Output 0 probes. Log: "No local source code found — Orientation skipped. Probes will be generated from runtime observations."
+
diff --git a/plugins/compound-engineering/skills/user-test/references/probes.md b/plugins/compound-engineering/skills/user-test/references/probes.md
new file mode 100644
index 000000000..2325fa82f
--- /dev/null
+++ b/plugins/compound-engineering/skills/user-test/references/probes.md
@@ -0,0 +1,493 @@
+# Adversarial Probes
+
+Code inspection finds candidates. Interaction confirms fragility.
+
+Probes are targeted test cases generated from observed failures and structural hypotheses. They transform luck ("the agent happened to notice") into a repeatable process. Over time, the Probes section in each area becomes a self-built adversarial test suite.
+
+## Probe Execution Flow
+
+At the start of Phase 3, before broad exploration:
+
+1. Read the `**Probes:**` table from each area's details in the test file
+2. In multi-run mode, also read `probes_run` from `.user-test-last-run.json` for inter-run state updates
+3. Execute probes in priority order: P1 first, then P2. **Priority gates execution order** — a P2 failing probe waits until all P1 probes complete.
+4. Within a priority level, execute in this order:
+   1. `failing` probes (regardless of confidence)
+   2. `untested` + `confidence: low` (most uncertain — most likely to surprise)
+   3. `untested` + `confidence: medium`
+   4. `untested` + `confidence: high` (confirming what's already expected)
+   5. `passing` spot-checks
+5. For each probe: navigate to the area, execute the query, run the verify check, record pass/fail
+
+### Proven Area MCP Budget
+
+Failing and untested probes **always run regardless of budget cap**. The tiered MCP budget for Proven areas (see [run-targeting.md](./run-targeting.md) for budget by consecutive pass count) only constrains passing-probe spot-checks. If a Proven area has 4 failing probes, all 4 run (no spot-check). The budget prevents stable areas from consuming exploration time — it does not suppress known-failing assertions. See [run-targeting.md](./run-targeting.md) for override priority.
+
+## Probe Generation
+
+After each run (Phase 4), generate probes for areas with:
+- A **verification failure** (Layer 2 structural check mismatch)
+- A **score of 3 or below**
+- A **worst_moment** designation
+- A **Query score ≤ 3** (see [queries-and-multiturn.md](./queries-and-multiturn.md) step 8 for conversion rules)
+- A **Multi-turn context failure** (see [queries-and-multiturn.md](./queries-and-multiturn.md) for detection patterns)
+- A **CLI timing variance >50%** between runs OR any CLI query timeout. Performance probes verify timing, not results: `verify: "Completes in <Xs (no timeout)"`. Priority P1 for timeouts, P2 for variance. Generated as `"run-N timing flakiness: <query>"`.
+- A **CLI tool call spike** (2x+ the query's historical average, minimum 3 data points before flagging). Tool call probes verify agent efficiency: `verify: "Completes with ≤N tool calls"`. Priority P2. Generated as `"run-N tool call spike: <query> (N vs avg M)"`. Tool call probes are **informational** — failing tool call probes do not block promotion to Proven and do not affect UX or Quality scores.
+- A **quality spread ≥ 2** across runs for the same query in iterate mode (e.g., Q5 in R1, Q2 in R2 — same query, wildly different outcomes). "Spread" = max score minus min score across runs in the session. These generate reliability probes: `verify: "Returns consistent results (same category/count ±30%) across 2 consecutive runs."` Priority P1 — flakiness is worse than consistently low quality because it's unpredictable. Applies to Quality score dimension per-query. If the query already has an active probe, skip (existing 70% word-overlap dedup applies).
+- A **structural hypothesis from code reading** — generated in Phase 1 from source file analysis before the first test pass (not after Phase 4). These are hypotheses, not observed failures. Default confidence: medium. Format: `generated_from: "structural-hypothesis: <filename> <line or function>"`. See [orientation.md](./orientation.md).
+
+Each generated probe has:
+- `query`: A specific action to perform (e.g., "show me NWT only")
+- `verify`: The testable claim to audit (e.g., "all visible condition badges say NWT")
+- `status`: Initial status is `untested`
+- `priority`: P1 for verification failures, P2 for score-based
+- `confidence`: Default assigned by generation trigger (see table below)
+- `generated_from`: Origin trail (e.g., "run-3 condition mismatch")
+- `related_bug`: (optional) Bug ID from bugs.md if this probe tests a symptom of a known open bug. Check bugs.md for bugs affecting the same area — if exactly one open bug matches, link it. Stored inline in the `Generated From` column: `"run-3 condition mismatch | related_bug: B003"`
+
+### Confidence Defaults by Trigger
+
+| Trigger | Default Confidence |
+|---------|-------------------|
+| verification failure (observed mismatch) | high |
+| score <= 3 (observed low quality) | high |
+| worst_moment | high |
+| query failure (score <= 3) | high |
+| multi-turn context failure | high |
+| CLI timing variance | medium |
+| CLI tool call spike | medium |
+| quality spread >= 2 (iterate) | medium |
+| structural-hypothesis (code reading) | medium |
+
+`low` confidence is reserved for future generation triggers or manual assignment. No current trigger produces `low` automatically, but the execution order (line 16) prioritizes `low` first within untested probes to maximize discovery value.
+
+**Confidence update rules (commit mode):**
+
+- Probe passes: confidence unchanged
+- Probe fails: upgrades to `high` (fragility confirmed)
+- Probe flaky: stays `medium` (fragility real but inconsistent)
+- Probe graduates: records `high` in graduated entry (frozen on graduation)
+- Probe escalated: retains `high` (3+ consecutive failures = confirmed)
+
+**v5 migration:** Probes without confidence field → treat as `confidence: high` (existing probes were generated from observed failures). Do NOT rewrite on read.
+
+### Multi-Cause Isolation
+
+When a probe targets a symptom that could have multiple causes (e.g., two open bugs producing the same "0 results" failure), generate separate probes per hypothesized cause. Each probe's setup must isolate the variable being tested:
+
+**Pattern:**
+
+```
+Symptom: y2k accessories returns 0 results
+Cause A: empty data intersection (BUG003)
+Cause B: search bar state contamination (UX010)
+
+Isolated probe A:
+  Setup: fresh session (no prior search bar usage)
+  Query: "y2k accessories"
+  Verify: "results include y2k-tagged items — tests data coverage
+    independent of search bar state"
+  related_bug: BUG003
+
+Isolated probe B (cross-area):
+  Trigger: browse/product-grid — search "dresses" via search bar
+  Observation: agent/filter-via-chat — ask for "y2k accessories"
+  Verify: "agent clears stale category filter before applying y2k"
+  related_bug: B002
+```
+
+**`related_bug` field:** Optional field on any probe (per-area or cross-area) linking the probe to a specific bug ID. When the probe passes, it provides evidence that the linked bug is fixed. When it fails, it confirms the linked bug is still active. Multiple probes can reference the same bug — each tests the bug from a different angle.
+
+**When to isolate:** The agent should consider isolation when:
+- A probe has `escalated_to` linking to a bug, AND another open bug affects the same area or a related area
+- A failing probe's `result_detail` is ambiguous ("0 results" without specifying whether the data is missing or the query is wrong)
+- Two bugs in bugs.md have overlapping area slugs
+
+**When NOT to isolate:** If only one bug exists for the symptom, or if the causes are clearly distinguishable from the probe result alone, isolation adds complexity without value. Single probes are preferred when the cause is unambiguous.
+
+**Bug lifecycle interaction:** When a bug is marked `fixed` in commit mode, the agent should note whether probes with `related_bug` pointing to that bug are passing or failing. If the bug is fixed but its related probes fail, note the discrepancy in the report: "BUG003 marked fixed but related probe still failing — investigate." This keeps `related_bug` informational while giving it a concrete use during the bug lifecycle.
+
+## Per-Query Quality Reporting
+
+When an area has `scored_output: true` and multiple Queries were evaluated, the report must surface per-query breakdown — not just the average.
+
+**Report format for scored_output areas (in DETAILS section):**
+
+```
+Quality: 4.1 (range: 2-5)
+  ✓ vintage denim jacket: Q5
+  ✓ boots under $40: Q4
+  ✗ y2k accessories: Q2 ← outlier
+  ✓ cottagecore dresses: Q5
+```
+
+The outlier flag (✗) appears on any query scoring ≤ 3. This prevents the "4.1 looks fine" problem where an average hides a broken query.
+
+Per-query breakdown appears in the **DETAILS section** of the dispatch report (only when outliers exist — if all queries scored ≥ 4, omit). Outlier queries also surface in **NEEDS ACTION** when the area is Proven (unexpected regression).
+
+In iterate mode, show per-query scores across runs:
+
+```
+  y2k accessories: R1:Q3 → R2:Q2 (degrading)
+  vintage denim:   R1:Q4 → R2:Q5 (flaky — spread ≥ 2 triggers probe)
+```
+
+**Scope:** Applies to CLI queries (explicit `cli_queries` and area Queries tables) AND browser Queries tables when present. Areas without Queries tables show only the aggregate score (existing behavior unchanged).
+
+Per-query scores are stored in `.user-test-last-run.json` under each area's `quality_scores_by_query` field (array of {query, scores[], avg}).
+
+**Interaction with existing probe generation:** Per-query outlier flagging (✗ marker) is cosmetic — it does not trigger additional probe generation beyond what already exists (queries scoring <= 3 already generate probes via commit mode step 8). The flag helps the reader spot the problem; the probe system handles the automated response.
+
+### Evaluation Provenance
+
+When CLI queries are evaluated, the testing agent (Claude) judges output from the app's agent (often Gemini or another model). This is inherently cross-model evaluation — free of self-preference bias.
+
+Note this in the report's Quality Scores table:
+
+```
+Quality Scores (scored_output areas — cross-model: Gemini→Claude)
+| Area               | R1 Q   | R2 Q   | Avg |
+| agent/search-query | 4 (CLI) | 5 (CLI) | 4.5 |
+```
+
+When browser evaluation scores quality (same model observes and judges), note `(browser)` instead of `(CLI)`. The provenance tag tells the reader which scores have cross-model validation and which might have self-bias.
+
+**Static labels:** CLI = cross-model, Browser = same-model. These are static based on evaluation mode, not dynamically detected. If the app under test uses the same model as the evaluator, add a note in the report footer: "Note: app LLM is also Claude — CLI evaluation is same-model for this app."
+
+If the app's model changes (e.g., Gemini Flash → Claude Sonnet), update the provenance header. Model changes invalidate previous quality baselines — note "model change: re-baseline quality" in the report.
+
+## Probe Lifecycle
+
+```
+untested → [run] → passing / failing
+                      ↓          ↓
+              [2+ consecutive]  [3+ consecutive]
+                      ↓          ↓
+              graduation offer  escalation to bugs.md
+                      ↓
+                  graduated (CLI regression check)
+```
+
+### Status Definitions
+
+| Status | Meaning |
+|--------|---------|
+| `untested` | Generated, not yet run |
+| `passing` | Ran, verification passed |
+| `failing` | Ran, verification failed |
+| `flaky` | Mixed results across 3+ runs |
+| `graduated` | Promoted to CLI regression check (read-only) |
+
+### Flaky Transition
+
+A probe becomes `flaky` when:
+- It has run at least 3 times
+- It has both at least 1 pass and 1 fail
+- It has no 2+ consecutive streak either way
+
+Revert rules:
+- Flaky → `failing`: 2 consecutive failures
+- Flaky → `passing`: 2 consecutive passes (eligible for graduation)
+
+### Non-Deterministic Probe Confirmation
+
+When a probe testing LLM-dependent behavior (agent reasoning, scored_output quality, search ranking) flips from `failing` or `flaky` to `passing`, treat the first pass as unconfirmed. Note "passing*" in the report. Keep the probe's status as `failing` (or `flaky`) in the test file during commit -- do not write `passing` yet. Track the unconfirmed pass in `probes_run` in `.user-test-last-run.json`. Require a 2nd consecutive pass before updating probe status to `passing` in the test file. If the next run fails, revert to `failing` -- the first pass was variance. This rule does not apply to probes transitioning from `untested` to `passing` -- they have no failure history to create variance concern.
+
+### Escalation (3+ Consecutive Failures)
+
+**Why 3, not 5:** The original design specified 5. Changed to 3 because a
+probe failing 3 times has been observed across at least 2 separate sessions
+(generation run + 2 failure runs). That's sufficient evidence. Auto-filing
+at 3 removes the manual confirmation step — the 3-run failure history IS
+the confirmation.
+
+A probe failing for 3+ consecutive runs auto-escalates during commit mode:
+
+1. **Dedup check:** If probe already has `escalated_to: "B00N"` field, skip (already filed)
+2. Create a bug entry in `bugs.md` with next sequential ID
+3. Set bug summary from probe's `verify` clause
+4. Set bug `Found` date from the probe's `Generated From` field
+5. Link probe to bug: add `escalated_to: "B00N"` to probe entry
+6. Probe stays active (keeps running). Bug entry tracks the fix.
+7. If `gh` is not authenticated: file to `bugs.md` with `Issue: ---`. Log warning: "Bug filed locally but GitHub issue not created — run `gh auth login` to sync." On next commit with `gh` authenticated, detect `Issue: ---` entries and offer to file.
+
+**Interaction with area-level escalation:** If a probe in the area has been auto-escalated (has `escalated_to` field), suppress the area-level "persistent <= 3 scores" manual escalation offer for that area. The probe-level escalation is more specific and already covers the intent.
+
+Escalation checks run at commit time only — never mid-iterate-session. Consecutive failure count increments once per commit, not once per iterate run. An iterate×5 where a probe fails all 5 runs adds 1 to the consecutive count.
+
+### Graduation (2+ Consecutive Passes)
+
+A passing probe (2+ consecutive) is eligible for CLI graduation — same path as bug graduation in [graduation.md](./graduation.md).
+
+- Uses the same `cli_queries` format with `graduated_from: "probe-<area>-<run>"`
+- **Skip for visual checks:** Layout, animation, cursor state, visual feedback — these can't be tested via CLI
+- **Manual trigger:** User confirms each graduation
+- The test file Probes table entry changes to status `graduated`
+
+### Proven Area Verification (Git-Aware)
+
+When git diff shows files affecting a Proven area:
+
+1. Area keeps Proven status but gains a `(verify)` annotation in the report
+2. Full exploration runs (existing git-aware targeting already does this)
+3. If the area still passes: remove `(verify)`, increment consecutive passes, note in report: "Verified after <file> change"
+4. If the area fails: the code change caused a regression — generate probe targeting the specific change, flag in report as "regression after <file>"
+
+**What this adds beyond existing git-aware targeting:** Git-aware targeting gives full exploration. This adds: (a) visible annotation so the reader knows WHY full exploration ran, (b) causal link in the report connecting score changes to specific commits, (c) targeted probe generation on regression that names the file change.
+
+The `(verify)` annotation is ephemeral — it appears only in the current run's report, not persisted in the test file. Next run without code changes, the area reverts to normal Proven spot-check behavior.
+
+When git diff is unavailable (no .git, first run, force push): skip verification. Proven areas tested normally per existing rules.
+
+**Known-bug areas with git changes:** When git changes affect a Known-bug area, the git-aware rule overrides the normal Known-bug skip. Run fix_check even if `gh` is not authenticated. If fix_check passes without `gh` confirmation, note: "fix_check passed but cannot verify issue state — authenticate gh to complete lifecycle." If fix_check passes with `gh` available, note: "fix_check passed but issue #N still open — close issue to complete fix lifecycle."
+
+## Dedup
+
+Dedup key: **area slug + verify text**. Two probes with the same area and >70% word overlap in their `verify:` clause are the same probe — update the existing entry, don't create a duplicate.
+
+Probes with the same query but different `verify:` clauses are distinct probes — both are kept.
+
+## Cap and Rotation
+
+- **Failing/flaky probes:** Keep indefinitely
+- **Passing probes:** Rotate out after 10 runs (they've proven stability)
+- **Graduated probes:** Stay in the table as read-only historical record
+- No cap per area — accumulation is natural. If an area has many failing probes, that's signal worth preserving.
+
+## Multi-Run Mode
+
+When invoked as `/user-test N`, the skill orchestrates N sequential runs with inter-run probe learning.
+
+### Inter-Run State
+
+Probes live in the test file markdown (canonical source, written on commit). Between runs within a multi-run session, probe state is tracked in `.user-test-last-run.json` as a scratchpad:
+- Each run reads probes from the test file (start of session) AND from the last-run JSON's `probes_run` field (inter-run updates)
+- Full commit happens only at the end of the N-run session (or on interruption)
+
+### Progressive Treatment
+
+```
+Run 1: Broad exploration → discover issues → generate initial probes
+Run 2: Execute run-1 probes first → verify/refute → generate sharper probes
+Run 3: Targeted at specific failure modes → verification catches what broad exploration missed
+Run 4+: Proven areas spot-checked only → all time on weak areas and active probes
+Run N: Final summary with trajectory across all N runs
+```
+
+### Interruption Handling
+
+If `/user-test N` is interrupted at run K:
+- The last-run JSON contains probe state through run K
+- Run `/user-test-commit` to persist what exists
+- Probes generated during the interrupted run get status `untested`
+- No special resume logic — next `/user-test` reads probes from the test file
+
+### N-Run Summary
+
+After all N runs complete, display a trajectory summary:
+
+```
+N-Run Summary: <scenario-name>
+
+Areas that stabilized:      <area> (N/N), <area> (N/N)
+Areas with persistent issues: <area> (0/N — <reason>)
+Areas that regressed:       <area> (K/N — <detail>)
+New issues discovered:      <count> (run X: <issue>, run Y: <issue>)
+Probes generated:           <total>, <active failures> active failures
+Demo ready:                 <yes/no> — <reason>
+```
+
+### Time Estimate
+
+Before starting a multi-run session, display estimated total time:
+```
+Starting N-run session for <scenario>. Estimated time: ~X minutes (N runs × ~Y min each including verification passes).
+```
+
+### Within-Session Probe Injection
+
+After each run K < N completes:
+
+1. Read `probes_generated` from the run-K results
+2. Add them to the probe execution list for run K+1 with status `untested`
+3. These injected probes execute FIRST in run K+1 (before existing probes)
+4. Record results in `probes_run` with `generated_from: "run-K <detail>"`
+
+This turns iterate from "test, then test again" into "test, discover, verify discovery." A stale filter probe generated in R1 gets tested in R2 instead of sitting untested until next session.
+
+If the newly generated probe has a `prechecks` tag and `cli_test_command` exists, run it via CLI first (same Phase 2.5 rules apply).
+
+**N=1 edge case:** When iterate mode runs with N=1, probes generated in the single run remain `untested` and are committed normally. No special handling — they execute on the next session (single or iterate).
+
+**Inter-run probe status:** R2 sees R1's probe results via the `.user-test-last-run.json` scratchpad. A probe that flipped from `failing` to `passing` in R1 is deprioritized in R2 (failing/untested before passing). This is correct and intentional.
+
+Progressive narrowing (SKIP/PROBES-ONLY/FULL classification for run 2+) has moved to [run-targeting.md](./run-targeting.md).
+
+## Cross-Area Probes
+
+Cross-area probes test interactions that span two areas — where an action in one area affects state in another. They live in a scenario-level table (not per-area) and run before per-area testing in Phase 3.
+
+### Lifecycle
+
+Same as per-area probes (status transitions, escalation, confidence). One exception: CLI graduation requires BOTH trigger and observation areas to have CLI coverage.
+
+### Generation Triggers
+
+Cross-area probes are generated when:
+- A per-area probe fails AND the failure symptom could be caused by state from another area (agent judgment — look for stale filters, carry-over context, shared state)
+- The novelty budget discovers a cross-area interaction worth tracking
+- Orientation (code reading) identifies a state ownership boundary that crosses two areas
+- The user explicitly requests a cross-area probe
+
+Cross-area probes are NOT generated automatically from every per-area failure. The agent must identify a plausible cross-area cause before generating one. This keeps the table focused on genuine seam tests, not duplicates of per-area probes.
+
+### Execution
+
+1. Navigate to trigger area
+2. Perform action (do NOT reset between trigger and observation)
+3. Navigate to observation area
+4. Run verify check
+5. Record result
+
+The "no reset" between steps 2 and 3 is the critical difference from per-area probes. The whole point is testing state carry-over. If you reset between areas, you're testing two independent areas, not a seam.
+
+### Report Section
+
+Cross-area probe results appear in their own report section, between the header and NEEDS ACTION:
+
+```
+Cross-Area Probes:
+| Trigger → Observation | Action | Status | Detail |
+|-----------------------|--------|--------|--------|
+| browse/product-grid → agent/filter-via-chat | search "dresses" via search bar | failing | agent chat shows stale "Dresses" filter on follow-up |
+```
+
+### Dedup
+
+Key: `trigger_area + observation_area + verify text`. Same 70% word-overlap rule as per-area probes, applied to the area pair. A probe from A→B and a probe from B→A are different probes (different causal direction).
+
+### Bug Filing
+
+When a cross-area probe escalates (3+ consecutive failures), the bug entry in bugs.md lists the trigger area as primary and the observation area in the summary: "Also affects: <observation_area>". This matches the existing multi-area bug format in bugs-registry.md.
+
+### Spot-Check Budget
+
+Passing cross-area probes are spot-checked — execute at most 3 passing probes per run, rotating round-robin by table order (advance start position each run). Failing and untested cross-area probes always execute. This bounds the front-load: a stable test file with 5 passing cross-area probes spot-checks 3, not all 5.
+
+### Progressive Narrowing Interaction
+
+Progressive narrowing classifications (SKIP/PROBES-ONLY/FULL) apply to per-area testing only. Cross-area probes execute in their own slot regardless of the trigger or observation area's narrowing classification. An area classified SKIP for per-area testing can still be a trigger or observation target for cross-area probes.
+
+### Cap
+
+Maximum 10 active cross-area probes per test file. Cross-area probes are more expensive than per-area (two navigation steps, no reset). If the table exceeds 10 active entries, the oldest passing probes rotate out first (same as per-area rotation).
+
+### Proactive Restart Interaction
+
+Cross-area probes must NOT be interrupted by a proactive restart — they depend on state carry-over between trigger and observation areas. The restart check is skipped during cross-area probe execution. The MCP call counter still increments; the restart happens after the cross-area probe sequence completes.
+
+### .user-test-last-run.json Schema
+
+Cross-area probe results are stored alongside `probes_run`:
+
+```json
+"cross_area_probes_run": [
+  {
+    "trigger_area": "browse/product-grid",
+    "action": "search 'dresses' via search bar",
+    "observation_area": "agent/filter-via-chat",
+    "verify": "agent chat responds without stale category filter",
+    "status": "failing",
+    "result_detail": "agent showed stale Dresses filter on follow-up",
+    "related_bug": "B002"
+  }
+]
+```
+
+## Weakness Classification
+
+When 2+ probes in the same area share a recognizable failure pattern, commit mode writes a `weakness_class` field to the area details. This enables cross-area adversarial targeting (see Cross-Area Weakness Synthesis below).
+
+### Predefined Classes
+
+| Class | Pattern |
+|-------|---------|
+| `stale-react-state` | Filters/state not resetting on navigation |
+| `count-display-lag` | Displayed counts don't match actual DOM counts |
+| `multi-turn-context-loss` | Agent forgets constraints from earlier turns |
+| `async-render-race` | Results appear but attributes/badges haven't updated |
+| `filter-intersection-empty` | Compound filter combinations return 0 results unexpectedly |
+| `agent-reasoning-shallow` | CLI quality consistently 3, partially correct but missing nuance |
+
+### Freeform Classes
+
+For novel failure modes that don't fit a predefined class, write a freeform string (e.g., `weakness_class: checkout-state-leaked-across-sessions`). Predefined classes are accelerators for cross-area synthesis template lookup — freeform classes produce custom adversarial instructions.
+
+### Classification Method
+
+Commit mode reads each failing probe's `query`, `verify`, and `result_detail` fields and matches against predefined class descriptions using agent judgment. No mechanical matching rule — agent decides which class (if any) best describes the shared failure pattern. If classification is ambiguous, prefer freeform over forcing a predefined class.
+
+### Lifecycle
+
+- **Write:** When 2+ probes in the area share a failure pattern (one probe = insufficient signal)
+- **Update:** Each run. If a new pattern emerges with more probes than the current class, replace it
+- **Remove:** If the class's probes have all passed for 3+ consecutive runs (weakness resolved)
+- **Dominance:** One `weakness_class` per area — the dominant pattern. Probe count decides dominance.
+
+### Matching for Synthesis
+
+Cross-area weakness synthesis (see below) uses exact string equality after normalization (lowercase, hyphenated). Predefined classes are canonical strings. Freeform classes match only identical freeform strings across areas.
+
+## Cross-Area Weakness Synthesis
+
+Phase 4 Step 6 runs a cross-area synthesis pass after generating per-area Explore Next Run items. When a `weakness_class` appears in 2+ areas, it generates one `[cross-area]` Explore Next Run entry targeting the class systemically.
+
+### Synthesis Pass
+
+Synthesis reads `weakness_class` fields from the test file as written by the previous run's commit — first-run appearance of a weakness_class does not trigger synthesis until the following run.
+
+1. Collect all areas with a `weakness_class` field set in the test file
+2. Group by weakness_class value (exact string match)
+3. For each class appearing in 2+ areas: generate one `[cross-area]` Explore Next Run entry
+
+### Cap and Tiebreaker
+
+**Cap:** Maximum 2 cross-area synthesis entries per run.
+
+**Tiebreaker when >2 classes qualify:** Rank by (1) number of affected areas — more areas = higher priority; then (2) number of failing probes in the class. Deterministic, favors widespread patterns.
+
+### Adversarial Instruction Templates
+
+| Class | Adversarial Instruction |
+|-------|------------------------|
+| `stale-react-state` | Probe ALL navigation sequences that cross area boundaries — apply filter → navigate away → return → verify state reset |
+| `count-display-lag` | After every action changing result count, wait 2s then re-read count vs DOM — check for lag window |
+| `multi-turn-context-loss` | On every multi-turn sequence, inject a context-breaking action at turn 3, then return to prior context — verify retention |
+| `async-render-race` | After every action triggering async rendering, immediately read badges/attributes — check for race window |
+| `filter-intersection-empty` | Probe all 2-filter compound combinations systematically — check for empty-intersection cases |
+| `agent-reasoning-shallow` | Replace simple queries with competing-constraint and ambiguous queries across all affected areas |
+
+**Freeform classes:** When `weakness_class` is freeform (no matching template), the agent generates a custom adversarial instruction based on the class name and probe failure details.
+
+### Persistence Signal
+
+If the same class appeared in the previous run's Explore Next Run, was targeted, and still didn't resolve: `PERSISTENT — stale-react-state active N runs — escalate to Known-bug consideration`
+
+### Report Placement
+
+Cross-area synthesis entries appear at the top of EXPLORE NEXT RUN:
+
+```
+EXPLORE NEXT RUN
+  P1  [cross-area]  Browser  stale-react-state in 3 areas — probe all navigation events
+  P1  shipping-form  Browser  Validation broken — edge cases
+  P2  checkout/promo  Both    Adjacent to cart, untested
+```
+
+### .user-test-last-run.json Format
+
+Cross-area synthesis entries are stored in the `explore_next_run` array with `weakness_class`, `affected_areas`, and `adversarial_instruction` fields. See [last-run-schema.md](./last-run-schema.md) for full schema.
+
+### Why Explore Next Run Entries, Not Cross-Area Probes
+
+Synthesis produces targeting instructions that are regenerated each run from current state. Cross-area probes are persistent regression tests with a full lifecycle. Different tools: synthesis directs exploration, probes track regressions. If a synthesis target repeatedly fails, the agent should generate a cross-area probe from the failure — that's the natural escalation path.
diff --git a/plugins/compound-engineering/skills/user-test/references/queries-and-multiturn.md b/plugins/compound-engineering/skills/user-test/references/queries-and-multiturn.md
new file mode 100644
index 000000000..f94a55c6b
--- /dev/null
+++ b/plugins/compound-engineering/skills/user-test/references/queries-and-multiturn.md
@@ -0,0 +1,333 @@
+# Queries and Multi-turn Sequences
+
+Queries test the app's **understanding** of its domain. Multi-turn sequences test **context retention** across turns. Both are optional — areas without them still work. Queries are only valid in `scored_output: true` areas. If an area has Queries but not `scored_output: true`, flag it during Phase 1 and suggest adding `scored_output: true`.
+
+**Queries vs Probes:** Queries are exploratory (scored 1-5, stateless). Probes are regression tests (pass/fail, full lifecycle). Failed queries generate probes — queries feed the probe system, they don't replace it.
+
+## During Execution (Phase 3)
+
+### Per-Area Checklist
+
+For each selected area, complete these steps in order:
+
+0. **CLI precheck gate:** If this area has a `prechecks` tag in any `cli_queries` entry AND that CLI query scored ≤ 2, skip browser testing for this area with note "CLI pre-check failed — agent reasoning broken, browser test skipped." If no prechecks tag exists for this area, or CLI scored ≥ 3, proceed normally.
+1. Run probes (failing/untested first) — see [probes.md](./probes.md)
+2. Execute Queries and Multi-turn sequences (if defined)
+3. Explore beyond the defined queries — try something the queries don't cover
+4. Run verification pass — see [verification-patterns.md](./verification-patterns.md)
+5. Score UX (+ Quality if scored_output)
+6. Record timing
+7. Note: what surprised you? What would you test next time?
+
+Empty Queries or Multi-turn tables are no-ops at step 2. Step 7 feeds directly into Explore Next Run generation and new Query creation during commit.
+
+### Scoring Boundaries
+
+Probes, verification, and UX scores are three separate signals — none subsumes the others. See SKILL.md Phase 3 checklist for the canonical definition.
+
+### CLI + Browser Score Mapping
+
+When an area has both CLI and browser results:
+- **CLI score → Quality** (did the agent reason correctly?)
+- **Browser score → UX** (did the interface deliver it smoothly?)
+- Report shows: `UX: 5 | Quality: 2 (CLI)` — source explicit
+
+CLI-only areas: CLI score populates both UX and Quality. Browser-only areas: browser scores both UX and Quality (existing behavior) — do NOT show `(CLI)` tag; show `(browser)` or nothing to distinguish the source.
+
+### Multi-turn Scoring
+
+Multi-turn sequences contribute to the area's Quality score (scored against the final turn's ideal outcome). Context failures at intermediate turns generate probes targeting the specific broken turn, and are noted in the area assessment, but do not directly reduce the UX or Quality score.
+
+**Context failure:** When a subsequent turn's result indicates the app lost state from a prior turn. Detection patterns:
+
+- **Filter state:** Turn 1 set "queen size" → Turn 3 results include non-queen items
+- **Conversational:** Turn 1 said "budget is $200" → Turn 3 recommends $400 items without acknowledging the constraint
+- **Preference:** Turn 2 said "NOT white" → Turn 3 results include white items
+
+**Detection rule:** After each turn, check whether prior turns' constraints are still reflected in the current state/results. If not, that's a context failure — generate a probe: query = the failing turn's action, verify = "context from turn N preserved."
+
+### Proven Area Query Budget
+
+Active queries count against the tiered MCP budget for Proven areas (see [run-targeting.md](./run-targeting.md) for budget by consecutive pass count). `[stable]` queries run via CLI only and do not count. Only failing/untested probes bypass the cap (existing rule from [probes.md](./probes.md)).
+
+**Worked example (3-call tier, consecutive passes 2-5):**
+```
+Proven area with 5 queries (2 active, 3 stable), 2 failing probes, 3-call budget:
+→ 2 failing probes run (uncapped): 2 browser calls
+→ 3 stable queries run via CLI (uncapped): 0 browser calls
+→ 1 remaining browser call → spot-check 1 active query
+→ 1 active query skipped this run
+
+At 2-call tier (6-9 consecutive passes), same area:
+→ 2 failing probes run (uncapped, outside budget): 2 browser calls
+→ 3 stable queries run via CLI (uncapped): 0 browser calls
+→ 2 budget calls → spot-check 2 active queries
+→ 0 active queries skipped
+
+At 1-call tier (10+ consecutive passes), same area:
+→ 2 failing probes run (uncapped, outside budget): 2 browser calls
+→ 3 stable queries run via CLI (uncapped): 0 browser calls
+→ 1 budget call available, but area already exercised by probes
+→ 2 active queries skipped this run
+```
+
+### CLI Area Queries
+
+When `cli_test_command` is present, Phase 2.5 also runs each `scored_output` area's **Queries:** table through CLI:
+
+- Query text → substituted into `cli_test_command`
+- Ideal Outcome → used as `expected` for semantic evaluation
+- Score → area's CLI Quality score
+- **Evaluate the full JSON response** — tool calls (correct tools, correct arguments), inferred facets, result data, and suggestions — not just the message text. The `expected` field should describe correct behavior across the full response structure.
+
+**Skip rules:** Only `scored_output: true` areas. Skip queries mentioning clicks, scrolling, visual layout in Check column. Multi-turn = browser-only (requires session state).
+
+**Budget:** Proven areas: max 2 Queries via CLI (spot-check). Uncharted areas: run all.
+
+**Timing:** Record wall-clock time per CLI query. If timing variance exceeds 50% between runs or any query times out, generate a performance probe — see [probes.md](./probes.md).
+
+**Tool call tracking:** For each CLI query response, capture from the JSON:
+- `tool_calls`: count of `toolCalls` array entries
+- `tool_names`: unique tool names from `toolCalls[*].tool`
+- `result_count`: count of items in the primary search/retrieval tool call's results array. If multiple search calls, sum them. Ignore non-search tool calls (filter lookups, respond_to_user, etc.)
+- `tokens`: `{ prompt, completion }` from `usage` field if present (null if not)
+
+Include `Tools` and `Results` columns in the CLI Speed table. Tool call spike flagging (2x+ historical avg) activates after 3+ data points for a query — before that, track but don't flag. Same minimum sample pattern as probe flaky transition.
+
+## During Commit
+
+### Tactical Notes (Commit Mode Step 1)
+
+After scoring, commit mode may append a short tactical note to the area's Notes column in the Areas table. Format: `[Run N] <finding>`.
+
+**Cap:** 3 entries per area. Drop oldest when exceeded.
+
+**Write only when there's a genuine tactical insight:**
+- A reliable JS selector pattern: `[Run 4] batch read via [data-filter-chip] + .product-card reliable`
+- A timing pattern: `[Run 3] agent response 8-12s on first query, faster on follow-ups`
+- An interaction sequence that revealed a bug: `[Run 2] filter → navigate → back → filter again surfaces stale state`
+
+**Do NOT write:** generic observations ("tested 3 areas"), maturity updates ("promoted to Proven"), restatements of probe results.
+
+In `.user-test-last-run.json`, `tactical_note: null` means skip Notes update for this area.
+
+### Query Compounding (Steps 8-10)
+
+These steps run AFTER existing commit mode steps 1-7.
+
+**8. Sharpen Queries from failures:** For each Query that scored ≤ 3, generate an adversarial probe targeting the specific gap. The probe goes in the area's `**Probes:**` table (not the `**Queries:**` table). One failed query generates one probe. Probe fields: query = adversarial version, verify = specific gap observed, status = untested, generated_from = "run-N query failure: <query text>". Existing probe dedup (70% word overlap) catches duplicates.
+
+Example: "earth tones" scored 3 because results were generic neutrals → Probe: query "terracotta and rust specifically", verify "results include warm red/orange tones, not beige/cream."
+
+**9. Expand Queries from discovery:** If exploration (checklist step 3) revealed an interesting interaction the existing Queries don't cover, add it as a new Query in the `**Queries:**` table with Ideal Outcome and Check columns filled from what was observed. New queries are exploratory — they'll be scored next run and may themselves generate probes if they fail.
+
+**10. Mark stable Queries:** If a Query has scored 5/5 for 3+ consecutive runs (commit-level, not per-iterate-run), update Status to `[stable]`. Stable queries shift to CLI-only execution — no browser testing. This frees browser time for novelty exploration. See Step 12 below for full rotation rules.
+
+**11. Persist CLI consistency patterns:** Persist a CLI observation to the area's Notes column on first sighting. Mark it `[confirmed]` if the pattern holds on the next run (same quality score range on the same query type). Remove if contradicted. Detection: compare this run's per-query CLI scores against the pattern claim — e.g., "strong on single-intent" is confirmed if single-intent CLI queries scored >= 4 again. Only persist patterns that are specific and actionable (not "sometimes works").
+
+### Step 12: Rotate Query Status
+
+Transition rules (applied per-query, commit mode only):
+
+- Active → `[stable]`: Scores 5/5 for 3 consecutive runs (commits)
+- `[stable]` → `[retired]`: Scores 5/5 for 10 consecutive runs AND `cli_test_command` is set (long-term maturity state — most queries won't reach this quickly)
+- `[stable]` → active: Scores Q4 twice consecutively (soft regression — note "previously stable query softened") OR scores Q≤3 once (immediate — generate probe per step 8)
+- `[retired]` → active: CLI spot-check scores ≤ 4 (generate probe)
+
+**CLI gate:** Queries without `cli_test_command` in the test file max out at `[stable]`. They receive browser spot-checks via the Proven area MCP budget.
+
+**Execution by status:**
+
+| Status | Browser | CLI | Proven cap |
+|--------|---------|-----|------------|
+| (active) | Yes | Yes | Counts |
+| `[stable]` | No | Yes | Does not count |
+| `[retired]` | No | No | Skipped |
+
+**Report SIGNALS:** Note freed browser time: "+ N stable queries → CLI-only."
+
+**Iterate mode timing:** An iterate×N session counts as 1 commit toward consecutive thresholds. A query scoring 5 all N runs counts as 1 toward the 3-consecutive threshold.
+
+**Data source:** The `scores` array in `quality_by_query` (see `score-history.json`) stores one entry per commit, not per iterate run. For iterate sessions, record the aggregate query score as a single entry. Consecutive count = length of the leading streak of 5s in the array (most recent first).
+
+Commit mode marks status automatically based on run history — no manual project-file edits needed.
+
+## Novelty Budget
+
+**Step 3 enforcement.** After running probes and queries, the agent MUST use the novelty budget on interactions not in any Query, Probe, or Multi-turn table for this area.
+
+### "Not Documented" Definition
+
+Any interaction where the core action (query text, filter applied, button sequence) does not appear in any existing table entry for this area. Rephrasing a stable query is not novel. A different filter combination is novel. A user behavior with no table representation is novel.
+
+**Run 1 state boundary:** Novelty is measured against the test file state at run start. On run 1 of a new file, all Queries defined during Phase 1 area creation are "documented" even though they were just written. The novelty budget requires interactions beyond those Queries.
+
+### MCP Budget by Area Type
+
+```
+Proven area (tiered cap, see run-targeting.md):
+  → novelty = 1 MCP call after probes and active queries (at 3-call tier)
+  → at 1-call tier: single call used for probe spot-check OR novelty (agent discretion)
+
+Uncharted/FULL area (no hard cap):
+  → novelty = 30% of calls used on probes + queries, minimum 2 calls
+  → Example: 10 calls on probes/queries → 3 novelty calls
+  → Example: 4 calls on probes/queries → still minimum 2 novelty calls
+```
+
+**Proven area budget exhaustion:** When the tiered budget cap is fully consumed by failing/untested probes and active queries, the novelty budget is 0 for that area. Probes and queries take priority — novelty defers, not the other way around. Passing-probe spot-checks also defer when novelty would compete for the last call.
+
+### Mandatory Probe Rule
+
+At least 1 novel interaction per `scored_output` area MUST generate a probe each run (waived at 1-call tier -- see [run-targeting.md](./run-targeting.md)), even if the interaction appeared clean. The probe verify clause can be "confirm this path remains clean after code changes." This prevents the agent from classifying everything as uninteresting.
+
+### Progressive Narrowing Interaction
+
+| Run 2+ Classification | Novelty Budget |
+|----------------------|---------------|
+| SKIP | 0 (area skipped entirely) |
+| PROBES-ONLY | 0 explicit — but 1 exploration call IS the novelty |
+| FULL | Normal budget (tiered for Proven per run-targeting.md, 30%/min-2 for Uncharted) |
+
+### Novelty Log
+
+Novelty log entries appear in DETAILS section of the report:
+
+```
+Novelty (agent/filter-via-chat — 2 novel interactions):
+  ✓ "show me everything under $10" — sparse results (3 items). Probe generated.
+  ~ "show me your favorites" — agent confused, returned random. Probe generated.
+```
+
+Format: `✓` tried and clean (probe generated per mandatory rule), `~` tried and interesting/broken (probe generated from finding).
+
+**Persistence:** Novelty log entries do NOT persist to the test file between runs — they're ephemeral. If a novel interaction was worth keeping, it's now a Probe or Query.
+
+**`.user-test-last-run.json` schema:**
+
+```json
+"novelty_log": [
+  {
+    "area": "agent/filter-via-chat",
+    "interaction": "show me everything under $10",
+    "observation": "sparse results (3 items)",
+    "probe_generated": true,
+    "probe_query": "price floor behavior — under $10 returns sparse results"
+  }
+],
+"stable_queries_rotated": ["cottagecore dresses", "leather jacket"]
+```
+
+## Novelty Fingerprint Persistence
+
+Resolves the v2 limitation where novelty logs expired between runs. Fingerprints persist a compact record of each novel interaction so run N+1 knows what run N already explored.
+
+### Fingerprint Format
+
+`<area-slug>:<action-type>:<key-parameter>`
+
+Examples:
+- `agent/filter-via-chat:edge-query:price-floor`
+- `browse/filters:filter-combo:size+color`
+- `checkout/shipping-form:invalid-input:zip-letters`
+
+### Normalization Taxonomy
+
+| Pattern | Format |
+|---------|--------|
+| Price/number inputs | `price-floor`, `price-ceiling`, `price-range` |
+| Filter combinations | `filter-combo:<f1>+<f2>` |
+| Invalid inputs | `invalid-input:<input-type>` |
+| Edge case queries | `edge-query:<topic>` |
+| Navigation sequences | `nav-sequence:<from>-<to>` |
+| Doesn't fit taxonomy | `<area>:freeform:<3-word-summary>` |
+
+Coverage is more important than taxonomy consistency. Use freeform when unsure.
+
+### Read-Merge-Write Sequence
+
+1. **Phase 1 (Load Context):** Read existing `novelty_fingerprints` from `.user-test-last-run.json` into memory
+2. **Phase 3 (Execute):** Use fingerprints to skip already-explored interactions. Generate new fingerprints for novel interactions this run.
+3. **Phase 4 / Commit (Write):** Merge existing + new fingerprints. Apply 20-per-area cap (drop oldest). Write merged set to JSON.
+
+Safe because the JSON is written once atomically at the end. No partial-write risk.
+
+### Iterate Mode Exemption
+
+Iterate mode measures consistency by running the same scenario N times. **Fingerprints are ignored in iterate mode** — all runs test the same interaction set. Fingerprints still accumulate for use in the next non-iterate session.
+
+### Adversarial Mode Override
+
+Adversarial mode (CLI score 3 trigger) overrides fingerprint skipping for its specific actions. Competing-constraint queries triggered by adversarial mode always run regardless of fingerprint state.
+
+### Proven Area Budget Interaction
+
+Proven areas keep their tiered MCP budget (see [run-targeting.md](./run-targeting.md)). Fingerprint filtering does NOT increase the budget -- it changes WHAT those calls test. If fingerprints exclude obvious interactions, the budgeted calls target genuinely novel territory.
+
+### Matching Semantics
+
+Agent exercises judgment on what "matches." The goal is to skip interactions of the same *type*, not requiring exact parameter matches. `edge-query:price-floor` and `edge-query:price-ceiling` are different fingerprints. `edge-query:price-floor` from run 1 means "don't test price-floor edge cases again."
+
+### SIGNALS Format
+
+When fingerprints meaningfully constrained novelty choices:
+```
+~ agent/filter-via-chat novelty: 3 fingerprints excluded, 2 new interactions found
+```
+
+### Resilience
+
+If `.user-test-last-run.json` is deleted or corrupted, fingerprint history resets to empty. Acceptable — the skill re-explores previously covered territory (same as pre-fingerprint behavior). Fingerprints are an optimization, not a correctness requirement.
+
+## CLI Adversarial Mode
+
+CLI score 3 ("partially correct — surface-level right, deeper reasoning wrong") triggers adversarial browser mode for the affected area. Score 3 is the adversarial sweet spot: the app functions, but CLI revealed shallow reasoning that browser testing can expose.
+
+### Trigger Condition
+
+**Primary:** Adversarial mode triggers when **any individual CLI query** for the area scores exactly 3. Per-query scores, not averages.
+
+**Secondary:** If the area's CLI Quality average across queries is 3.0-3.4 AND no single query hit exactly 3 (all queries borderline), also trigger adversarial mode. Record `adversarial_trigger: "cli-avg-3.x: <average>"`.
+
+### Phase 2.5 Addition
+
+After scoring CLI queries, for each area with `prechecks`-tagged queries:
+- If any individual query score == 3: set `adversarial_browser: true`, record triggering query
+- If average 3.0-3.4 with no single 3: also set `adversarial_browser: true` (secondary check)
+
+### Adversarial Browser Mode Behaviors
+
+When triggered, the area's Phase 3 execution changes in five ways:
+
+1. **Skip the happy path.** Start with the query most likely to expose the shallow reasoning — not the simplest, expected query.
+
+2. **Front-load competing-constraint queries.** If the area has Queries defined, execute any query with competing constraints (e.g., "crisp not silky") before single-intent queries.
+
+3. **Pre-emptive probe (before exploration).** Generate an `untested` probe targeting the specific CLI weakness:
+   - `generated_from: "cli-score-3: <query that scored 3>"`
+   - Priority: P1 (CLI already revealed the weakness)
+
+4. **Increased novelty budget.**
+   - Proven areas: all budgeted MCP calls must be adversarial, not happy-path spot-checks
+   - Uncharted areas: novelty budget increases to 40% of calls (from 30%), minimum 3 (from 2)
+
+5. **Report flag** in DETAILS:
+   ```
+   agent/filter-via-chat: CLI 3 → browser adversarial mode
+     Pre-emptive probe: "competing filter constraints" (P1)
+     Exploration front-loaded with competing-constraint queries
+   ```
+
+### Progressive Narrowing Override
+
+If a SKIP-classified area has a CLI query scoring 3, **adversarial mode overrides SKIP for that area only** — promoted to PROBES-ONLY with adversarial execution. The CLI signal is too strong to ignore. PROBES-ONLY areas with adversarial mode execute their probes + the pre-emptive probe, but skip full exploration.
+
+### SIGNALS Addition
+
+```
+~ 2 areas in CLI-adversarial mode (CLI score 3): agent/filter-via-chat, agent/search-query
+```
+
+### .user-test-last-run.json Fields
+
+Per-area: `adversarial_browser` (boolean, default false) and `adversarial_trigger` (string or null). See [last-run-schema.md](./last-run-schema.md) for full schema.
diff --git a/plugins/compound-engineering/skills/user-test/references/run-targeting.md b/plugins/compound-engineering/skills/user-test/references/run-targeting.md
new file mode 100644
index 000000000..b8b0ec796
--- /dev/null
+++ b/plugins/compound-engineering/skills/user-test/references/run-targeting.md
@@ -0,0 +1,120 @@
+# Run Targeting
+
+Rules for deciding which areas get tested each run, how deeply, and in what order.
+Three mechanisms — area selection priority, git-aware targeting, and progressive
+narrowing — work together to focus testing time where it has the most impact.
+
+## Area Selection Priority
+
+0. **Code-affected areas (if git diff available):** Full exploration regardless of maturity status — even Proven areas get the full checklist. See Git-Aware Targeting below.
+1. **Pick highest-priority Explore Next Run items first** (P1 > P2 > P3), not FIFO
+2. **Uncharted areas:** Full investigation with batched `javascript_tool` calls. See [browser-input-patterns.md](./browser-input-patterns.md) for input patterns and batching tips.
+3. **Proven areas:** Spot-check scaled by stability (see tiered budget below), plus any failing/untested probes. Verify the happy path still works.
+
+### Proven Area Budget by Stability
+
+| Consecutive Passes | Browser MCP Budget |
+|---|---|
+| 2-5 | 3 calls |
+| 6-9 | 2 calls |
+| 10+ | 1 call |
+
+Failing/untested probes remain uncapped at all tiers. The tier only constrains passing probe spot-checks and exploration calls. Tier resets on demotion from Proven (consecutive pass count returns to 0). Stable queries (CLI-only) and cross-area probes are not constrained by per-area budgets.
+
+At the 1-call tier, the single call may be used for probe spot-check OR novelty -- the mandatory novelty probe rule is waived when the budget is 1 call.
+
+Freed calls redistribute to novelty budget and areas with active variance. N = sum of (3 - tier_budget) across all Proven areas tested this run. Report in SIGNALS: "+ N calls freed from ultra-stable areas."
+4. **Known-bug areas:** Check if the linked issue is resolved before skipping:
+   - If `gh` not authenticated: skip as normal
+   - Run `gh issue view <issue-number> --json state -q '.state'`
+   - If `closed`: flip area to Uncharted, run the `fix_check` as the first test
+   - If `open`: skip as normal, note in output
+   - If fix check fails (score <= 2): file new issue with "Regression of #N" referencing the original closed issue
+5. **If all areas are Proven:** Spot-check all, then suggest new scenarios in "Explore Next Run"
+
+## Git-Aware Targeting
+
+Compute code diffs from **two sources**. If EITHER produces files, those files trigger area targeting — no exceptions.
+
+1. **Branch diff:** If `git_sha` from the previous run differs from HEAD, run `git diff --name-only <old_sha>..HEAD`.
+2. **Main diff:** Run `git diff --name-only origin/main..HEAD` (or `origin/master..HEAD`). This produces files whenever HEAD and origin/main differ — regardless of which is "ahead." Direction does not matter. If the diff returns files, those files are code changes.
+
+**Interpreting results:** Union all files from both diffs. If the union is empty, report "No code changes since last run." If the union has ANY files, every file in that list is a code change that MUST be mapped to test areas for full exploration. Do NOT filter, dismiss, or deprioritize files for any reason — not "already tested," not "origin/main is behind HEAD," not "these are old changes." A non-empty diff = code-affected areas = full exploration.
+
+**Why both diffs:** The branch diff catches new commits. The main diff catches divergence between your branch and main (squash merges, rebases, or simply being on a feature branch with changes vs main). Both are code the test areas need to cover.
+
+### Priority Integration
+
+Git targeting **augments** the priority list — it adds areas to the full-exploration set, it doesn't filter or demote existing priorities. Explore P1 items always get full exploration regardless of code changes.
+
+1. **Code-affected areas** (full exploration, regardless of maturity)
+2. P1 Explore Next Run items (full exploration — P1 means "test this thoroughly")
+3. Uncharted areas (existing)
+4. Proven areas — spot-check UNLESS code-affected (existing)
+5. Known-bug areas (existing)
+
+### Display at Run Start
+
+When EITHER diff produces files, display this block. Never say "No code changes" unless BOTH diffs return zero files.
+```
+Code changes detected (27 files):
+  Branch diff: <old_sha>..HEAD — 0 files (no new commits)
+  Main diff: origin/main..HEAD — 27 files
+Mapped to areas:
+- src/agent/orchestrator.ts → agent/search-query, agent/filter-via-chat
+- src/tools/cart/add-to-cart.ts → cart/add-remove
+Full exploration: agent/search-query, agent/filter-via-chat, cart/add-remove
+```
+
+### Edge Cases
+
+- **No .git:** Skip targeting. Note "Not a git repo — testing all areas equally."
+- **SHA not in history (rebase/force push):** Warn, test all areas.
+- **Feature branch (main behind HEAD):** The main diff still produces files — these ARE the branch's changes vs main and MUST trigger area targeting. "Behind HEAD" is not a reason to skip.
+- **>30 changed files:** Treat as "everything affected." Display "Large changeset (N files) — testing all areas." CLI-first ordering still applies.
+- **Only docs/config:** Note "Only docs/config changed — normal priority." Skip code targeting.
+- **Monorepo:** Agent ignores paths outside app source tree.
+
+### Report Section
+
+Add to run summary when targeting was active:
+```
+Code Changes Since Last Run (abc1234 → def5678):
+  12 files changed, 3 mapped to test areas
+  Targeted: agent/search-query ← orchestrator.ts; cart/add-remove ← add-to-cart.ts
+  Spot-check only: browse/product-grid, browse/filters, compare/add-view
+```
+
+## Progressive Narrowing (Run 2+)
+
+After run K completes, classify each area for run K+1:
+
+**SKIP** — Area scored ≥ 4 with 0 probe failures AND 0 verification mismatches in run K. No browser testing in run K+1. Note in report: "Skipped (stable in R{K})". CLI queries still run as a lightweight quality check (see D4 in plan). Failing/untested probes still execute if any exist — the probe uncap rule is not overridden by SKIP.
+
+**PROBES-ONLY** — Area scored ≥ 4 but has active failing/flaky probes. Execute ALL probes (failing, untested, AND passing as spot-checks) in run K+1 plus 1 exploration MCP call. No broad exploration beyond that.
+
+**FULL** — Area scored ≤ 3, OR had a verification mismatch, OR has a newly injected probe from run K, OR is the target of an Explore Next Run P1 item. Full exploration in run K+1 with injected probes.
+
+**Override priority** (first match wins):
+1. Git-diff `(verify)` → FULL (always)
+2. Explicit user override → FULL (all areas)
+3. This classification (SKIP/PROBES-ONLY/FULL)
+4. Proven tiered-MCP budget (R1 or N=1 only)
+
+Time freed from SKIP areas redistributes to FULL areas. This makes R2 systematically different from R1 — it pushes on weakness, not uniformity.
+
+**Interaction with within-session probe injection:** If R1 generates a new probe targeting a SKIP area, the injected probe has status `untested` and executes under the uncap rule. The area stays labeled SKIP in the display — the probe is an exception, not a reclassification to PROBES-ONLY.
+
+**Display at R2 start:**
+```
+Progressive narrowing (based on R1 results):
+  SKIP:        browse/product-grid (5), browse/filters (5),
+               cart/add-remove (4), compare/add-view (4)
+  PROBES-ONLY: agent/filter-via-chat (4, 1 active failing probe)
+  FULL:        agent/search-query (Q2 outlier), browse/product-detail (P2 explore)
+  Time saved:  ~12 min (4 areas skipped in browser)
+```
+
+**N=1 edge case:** Progressive narrowing only applies to runs 2+. N=1 iterate sessions test all areas per normal priority rules.
+
+**Retest classification is stored per-run** in `.user-test-last-run.json` under each run's per-area data as `retest_classification: "SKIP"`. This feeds the N-run summary trajectory display (e.g., "cart/add-remove: R1 FULL → R2 SKIP → R3 SKIP (stable)"). SKIP areas that maintained score via CLI appear as "Stable (not retested in browser)" — distinct from "Stabilized (tested and passed)."
diff --git a/plugins/compound-engineering/skills/user-test/references/test-file-template.md b/plugins/compound-engineering/skills/user-test/references/test-file-template.md
new file mode 100644
index 000000000..95ca8368d
--- /dev/null
+++ b/plugins/compound-engineering/skills/user-test/references/test-file-template.md
@@ -0,0 +1,563 @@
+# Test File Template
+
+Test files live in `tests/user-flows/<scenario-slug>.md` in the target project. Each file is a living document that compounds knowledge across runs.
+
+## Template
+
+```markdown
+---
+schema_version: 10
+scenario: "<scenario-name>"
+app_url: "http://localhost:3000"
+created: "<YYYY-MM-DD>"
+last_run: "<YYYY-MM-DD>"
+seams_read: false  # set to true after first code-reading pass (see orientation.md)
+cli_test_command: ""  # optional, e.g. "node scripts/test-cli.js --query '{query}'"
+cli_queries:  # optional
+  # - query: "example query"
+  #   expected: "description of correct response (agent evaluates semantically)"
+  #   prechecks: "area-slug"  # optional — browser area to skip on CLI failure
+  #   graduated_from: "B001"  # optional — bug ID that spawned this check (see graduation.md)
+performance_thresholds:  # optional, seconds
+  # fast: 2
+  # acceptable: 8
+  # slow: 20
+  # broken: 60
+mcp_restart_threshold: 15  # optional, proactive page reload after N MCP calls
+---
+
+# <Scenario Name>
+
+## Areas
+
+| Area | Status | Last Score | Last Quality | Last Time | Consecutive Passes | Notes |
+|------|--------|------------|-------------|-----------|-------------------|-------|
+| <area-slug> | Uncharted | — | — | — | 0 | |
+
+## Area Details
+
+### <area-slug>
+
+**Interactions:** <1-3 user-facing tasks this area covers>
+
+**What's tested:** <what does "good" look like for this area? Be specific about the domain. What are the ways the output could be subtly wrong?>
+
+**pass_threshold:** 4
+
+**weakness_class:** <!-- optional, written by commit mode when 2+ probes share a failure pattern. See probes.md Weakness Classification. -->
+
+**verify:**
+- <optional: freeform verification instructions — what claims to audit>
+- <e.g., "read every condition badge, compare against requested filter">
+
+**Queries:** <!-- For scored_output areas. Remove for non-output areas. See Area Depth below. -->
+
+| Query | Ideal Outcome | Check | Status | Notes |
+|-------|--------------|-------|--------|-------|
+
+**Multi-turn:** <!-- For conversational/multi-step areas. Remove if single-interaction. -->
+
+| Turn | Query | Check |
+|------|-------|-------|
+
+**Probes:**
+
+| Query | Verify | Status | Priority | Confidence | Generated From | Run History |
+|-------|--------|--------|----------|------------|---------------|-------------|
+
+Run History format: comma-separated P/F entries, most recent first. Example: `P,P,F,P` (4 runs: latest passed twice, then failed, then passed). Cap at 10 entries, drop oldest. Consecutive count for escalation/graduation is computed from the leading streak.
+
+## Cross-Area Probes
+
+<!-- Probes that test state carry-over between areas. Run before per-area
+     testing. See probes.md for lifecycle and generation triggers. -->
+
+| Trigger Area | Action | Observation Area | Verify | Status | Priority | Confidence | Generated From | Run History |
+|-------------|--------|-----------------|--------|--------|----------|------------|---------------|-------------|
+
+## Journeys
+
+<!-- Multi-area user flows without resets. Run after cross-area probes,
+     before per-area testing. See journeys.md for lifecycle and budget.
+     Each journey: ### J001: <name>, Steps table, Status, Last Run,
+     Run History, Generated From, optional on_failure, optional escalated_to. -->
+
+## Area Trends
+
+<!-- Auto-maintained from score-history.json. Do not edit manually. -->
+
+| Area | Trend | Last Score | Delta |
+|------|-------|------------|-------|
+
+## Explore Next Run
+
+<!-- Priority: P1 = likely user-facing friction, P2 = edge case worth knowing, P3 = curiosity -->
+<!-- Mode: CLI = agent reasoning only, Browser = rendering/interaction, Both = CLI reasoning + browser verification -->
+
+| Priority | Area | Mode | Why |
+|----------|------|------|-----|
+| P1 | | | |
+
+## Run History
+
+<!-- Keep last 50 entries. Oldest entries rotate out. -->
+
+| Date | Areas Tested | Quality Avg | Delta | Pass Rate | Best Area | Worst Area | Demo Ready | Context | Key Finding |
+|------|-------------|-------------|-------|-----------|-----------|------------|------------|---------|-------------|
+
+## UX Opportunities Log
+
+<!-- Action items: things to improve. Keep last 20 open entries. -->
+
+| ID | Area | Priority | Status | Suggestion |
+|----|------|----------|--------|-----------|
+
+## Good Patterns
+
+<!-- Preservation notes: things to protect. Auto-expire after 5 unconfirmed runs. -->
+
+| Area | Pattern | First Seen | Last Confirmed |
+|------|---------|------------|----------------|
+```
+
+## Schema Migration
+
+**v1 → v2 changes:**
+- Areas table: added `Last Quality` and `Last Time` columns
+- Run History table: added `Delta` and `Context` columns
+- Frontmatter: added optional `cli_test_command`, `cli_queries`, `performance_thresholds`
+
+**v2 → v3 changes:**
+- New section: `## Area Trends` (thin summary from score-history.json)
+- New section: `## UX Opportunities Log` (P1/P2 improvement suggestions with status lifecycle)
+- New section: `## Good Patterns` (patterns worth preserving, separate from opportunities)
+- New standalone file: `tests/user-flows/score-history.json` (machine-readable per-area history)
+- Run History table: added `Best Area` and `Worst Area` columns
+- Area Details: added optional `pass_threshold` and `quality_threshold` fields
+- Frontmatter: added optional `graduated_from` field on cli_queries entries
+
+**Reading v1 files:** Fill missing columns with defaults (`—` for scores/times, empty for notes). Do NOT rewrite the file on read.
+
+**Reading v2 files:** Fill missing sections (Area Trends, UX Opportunities Log, Good Patterns) with empty tables. Fill missing Run History columns (Best Area, Worst Area) with `—`. Do NOT rewrite the file on read.
+
+**Reading v3 files:** Treat missing `verify:` blocks and `Probes:` tables as absent (no verification steps, no probes). Do NOT rewrite the file on read.
+
+**Reading v4 files:** Treat missing `**Queries:**` and `**Multi-turn:**` tables as absent (no queries, no multi-turn sequences). Do NOT rewrite the file on read.
+
+**Reading any file missing `cli_test_command`:** Treat as `cli_test_command: ""`
+regardless of schema version. CLI discovery runs in Phase 1 step 3.
+
+**v4 → v5 changes:**
+- Area Details: added optional `**Queries:**` table (`| Query | Ideal Outcome | Check | Notes |`) (v6 adds Status column)
+- Area Details: added optional `**Multi-turn:**` table (`| Turn | Query | Check |`)
+- Area Details: `**What's tested:**` expanded to include domain-specific guidance
+- New reference file: `queries-and-multiturn.md` (per-area execution checklist, scoring boundaries, query compounding)
+- New section in this file: Area Depth (thin vs rich definitions, writing queries, multi-turn, first-run quality)
+
+**v5 → v6 changes:**
+- Probes table: added `Priority`, `Confidence`, `Generated From`, `Run History` columns (replaces `Generated`)
+- Queries table: added `Status` column (between Check and Notes)
+- Frontmatter: added `seams_read` field (boolean, default `false`)
+- New reference file: `orientation.md` (code-reading step for first-run structural hypothesis probes)
+- New probe generation trigger: `structural-hypothesis` (from code reading)
+- New query status lifecycle: active → `[stable]` → `[retired]` (see queries-and-multiturn.md step 12)
+
+**Reading v5 files:** Probes without `Confidence` column → treat as `confidence: high` (existing probes were generated from observed failures). Probes without `Priority` column → infer from `Generated From` (verification failure → P1, score-based → P2). Queries without `Status` column → treat as active. Existing `[stable]` tags in Notes column → migrate to Status column on first v6 commit, remove from Notes. Missing `seams_read` → treat as `false` (triggers Orientation on first v6 run). Do NOT rewrite the file on read.
+
+**v6 → v7 changes:**
+- New section: `## Cross-Area Probes` (scenario-level probe table for interactions spanning two areas)
+- Probe generation: optional `related_bug` field for isolation probes (any probe, per-area or cross-area)
+- Test file frontmatter: optional `mcp_restart_threshold` field (default 15)
+- Connection resilience extracted to `references/connection-resilience.md`
+
+**Reading v6 files:** Treat missing `## Cross-Area Probes` section as empty table. Treat missing `mcp_restart_threshold` as 15. Treat probes without `related_bug` as unlinked. Do NOT rewrite on read.
+
+**v7 → v8 changes:**
+- Area Details: optional `**weakness_class:**` field (below `pass_threshold`), written by commit mode when 2+ probes share a failure pattern
+- Area Details: `**verify:**` blocks auto-updated with confirmed selectors by commit mode (append-only, run-tagged)
+- Areas table: Notes column receives tactical run notes in `[Run N] <finding>` format (max 3 entries, drop oldest)
+- `.user-test-last-run.json` schema extracted to `references/last-run-schema.md`
+- `.user-test-last-run.json`: new per-area fields (`tactical_note`, `confirmed_selectors`, `weakness_class`, `adversarial_browser`, `adversarial_trigger`)
+- `.user-test-last-run.json`: new top-level key `novelty_fingerprints` (accumulates across runs, 20-per-area cap)
+- `.user-test-last-run.json`: cross-area synthesis entries in `explore_next_run` with `weakness_class`, `affected_areas`, `adversarial_instruction`
+
+**Reading v7 files:** Treat missing `weakness_class` as absent. Treat missing `novelty_fingerprints` as empty. Treat missing `adversarial_browser` as false. Do NOT rewrite on read.
+
+**v8 → v9 changes:**
+- New section: `## Journeys` (scenario-level multi-area user flows without resets)
+- `.user-test-last-run.json`: new `journeys_run` array field (per-journey checkpoint data)
+- New reference file: `journeys.md` (lifecycle, budget, execution rules, checkpoint types, generation, interactions)
+
+**Reading v8 files:** Treat missing `## Journeys` section as empty (no journeys defined). Do NOT rewrite on read.
+
+**CLI gate for query retirement:** Only queries in test files with `cli_test_command` set can reach `[retired]` status. Queries without CLI backstop max out at `[stable]` and continue receiving browser spot-checks via the Proven area MCP budget. If `cli_test_command` is removed from a file with `[retired]` queries, those queries demote to `[stable]` on next commit.
+
+**Writing any file:** Upgrade to v10 on commit. Bump `schema_version: 10` in frontmatter on the first commit under v10 skill logic. The version number reflects which skill version last wrote the file.
+
+**Forward compatibility:** Ignore unknown frontmatter fields from future schema versions. Preserve unknown table columns on write.
+
+## Pass Thresholds
+
+Each area can define explicit pass thresholds in its area details:
+
+```markdown
+### checkout/shipping-form
+**Interactions:** Enter address, select method, see estimate
+**What's tested:** Form validation + shipping logic
+**pass_threshold:** 4
+```
+
+For `scored_output` areas, add a quality threshold:
+
+```markdown
+### agent/search-results
+**Interactions:** Enter query, review results, refine search
+**What's tested:** Result relevance and ranking quality
+**scored_output:** true
+**pass_threshold:** 4
+**quality_threshold:** 3
+```
+
+**Defaults:** `pass_threshold: 4`, `quality_threshold: 3` (for scored_output areas). These match the v2 implicit behavior but are now explicit and per-area configurable.
+
+**Promotion gate:** "2+ consecutive passes" means 2+ consecutive runs where UX >= `pass_threshold` (and Quality >= `quality_threshold` for scored_output areas).
+
+## Known-Bug Area Details
+
+Areas with `Known-bug` status include additional fields:
+
+```markdown
+### cart-quantity-update
+**Status:** Known-bug
+**Issue:** #47
+**Bug ID:** B001
+**Fix check:** Verify quantity updates in <5s and cart badge reflects new count
+```
+
+The `**Issue:** #<number>` field is the canonical reference for `gh issue view`. The `**Bug ID:** B00N` field links to the bug registry entry. The `**Fix check:**` field describes what to verify when the issue is closed — fix_check passes when score >= area's `pass_threshold`.
+
+## Score History JSON
+
+Per-area score history is stored in `tests/user-flows/score-history.json`:
+
+```json
+{
+  "areas": {
+    "checkout/cart": {
+      "scores": [
+        { "date": "2026-02-28", "ux": 3, "quality": null, "time": 8 },
+        {
+          "date": "2026-03-01", "ux": 4, "quality": 4.1, "time": 7,
+          "quality_by_query": [
+            { "query": "vintage denim jacket", "scores": [4, 5], "avg": 4.5 },
+            { "query": "y2k accessories", "scores": [3, 2], "avg": 2.5, "outlier": true }
+          ]
+        }
+      ],
+      "cli_metrics": [
+        { "date": "2026-03-01", "avg_tool_calls": 2.5, "avg_time": 17.0 }
+      ],
+      "trend": "improving"
+    }
+  }
+}
+```
+
+**Storage:** Last 10 entries per area. Oldest drops when 11th is recorded. One file per project. `quality_by_query` follows the same rotation — last 10 entries per query.
+
+**`quality_by_query`:** Only present for `scored_output: true` areas with multiple Queries. The `outlier: true` flag is set when avg ≤ 3. Query text is the key — when commit mode sharpens a query (step 8), the old query gets a final entry and the new sharpened query starts fresh. Sharpening breaks per-query trend continuity; the area-level quality trend (which averages all queries) provides continuity across sharpening events. Old test files without `quality_by_query` parse fine — the field is purely additive.
+
+**Trend values:** `improving` (last 3 trending up), `stable` (variance < 0.5), `declining` (last 3 trending down), `volatile` (variance >= 1.0), `fixed` (previous <= 2, current >= pass_threshold).
+
+**Gitignore:** Add `score-history.json` to `.gitignore` if the project treats test data as ephemeral. Otherwise keep it committed for team visibility.
+
+## UX Opportunity Lifecycle
+
+| Status | Meaning |
+|--------|---------|
+| open | Suggestion logged, not yet acted on |
+| implemented | Improvement was made (agent detects or user marks) |
+| wont_fix | Explicitly declined (prevents re-suggestion) |
+
+Keep last 20 `open` entries. `implemented` and `wont_fix` age out after 30 days.
+
+Dedup: anchored on area slug + priority level. Agent decides whether to update or create new — no automated text matching.
+
+## Good Patterns Lifecycle
+
+`Last Confirmed` updates each run that observes the pattern. Patterns not confirmed for 5+ runs are removed. Dedup on area slug only (one pattern entry per area).
+
+Only log patterns at score 4-5 that represent a deliberate design choice, not just "page loaded successfully."
+
+## Area Granularity
+
+Each area should cover 1-3 scored interaction units. An interaction unit is one user-facing task completion (e.g., "add item to cart"), not a page load or navigation step.
+
+### Worked Example: Checkout Flow
+
+Instead of one large "checkout" area, decompose into:
+
+| Area | Interactions | What's Tested |
+|------|-------------|---------------|
+| `checkout/cart-validation` | Add item, verify count, change quantity | Cart state management |
+| `checkout/shipping-form` | Enter address, select method, see estimate | Form validation + shipping logic |
+| `checkout/payment-submission` | Enter card, submit, see confirmation | Payment flow + success state |
+
+This granularity ensures:
+- A single bug doesn't reset a huge chunk of proven territory
+- Areas are small enough to accumulate consecutive passes meaningfully
+- Each area maps to a distinct `user-test:<area-slug>` label for issue tracking
+
+### Worked Example: Settings Page
+
+| Area | Interactions | What's Tested |
+|------|-------------|---------------|
+| `settings/profile-update` | Edit name, upload avatar, save | Profile persistence |
+| `settings/notifications` | Toggle email prefs, save, verify | Notification preferences |
+| `settings/account-delete` | Click delete, confirm dialog, verify | Destructive action flow |
+
+### Worked Example: AI-Powered App
+
+Apps with AI output (search, recommendations, chatbots, generated content) need rich area definitions with Queries that test domain understanding. Decompose by capability:
+
+| Area | Interactions | What's Tested |
+|------|-------------|---------------|
+| `agent/search-quality` | Enter query, review results | Domain vocabulary mapping |
+| `agent/conversation` | Multi-turn refinement | Context retention across turns |
+| `agent/edge-cases` | Confusing or out-of-scope input | Graceful degradation |
+
+Here's what a rich `agent/search-quality` area looks like for a bedding store:
+
+```markdown
+### agent/search-quality
+**Interactions:** Enter query, review results, assess whether the app understood what the user actually meant — not just the keywords
+**What's tested:** Does the app translate lifestyle language into correct domain attributes? Does it surface results the user didn't know to ask for but would love?
+**scored_output:** true
+**pass_threshold:** 4
+**quality_threshold:** 3
+
+**Queries:**
+
+| Query | Ideal Outcome | Check | Status | Notes |
+|-------|--------------|-------|--------|-------|
+| "I run warm and want crisp not silky" | Percale, linen. NOT sateen/flannel | (1) cooling (2) crisp feel | | |
+| "earth tones — terracotta, sage, clay" | Specific warm colors, not generic neutrals | Color precision | | |
+| "my partner and I disagree on temperature" | Compromise or dual-zone solutions | Both needs addressed | | |
+| "something nice" | Clarifying questions, not random guesses | Handles vagueness | | |
+| "linen because it's so soft and wrinkle-free" | Corrects both misconceptions gently | Factual accuracy | | |
+
+**Multi-turn:**
+
+| Turn | Query | Check |
+|------|-------|-------|
+| 1 | "show me white sheets" | Broad white results |
+| 2 | "boring — add color" | NOT white added, muted tones |
+| 3 | "sage, but cozy not crisp" | Sage + cozy materials, turns 1-2 remembered |
+
+**verify:**
+- Sample 5-8 results, read material/color attributes
+- Every result should match stated filters
+- If agent claims "all cooling" — verify no flannel/heavy cotton
+```
+
+When `scored_output: true`, the area is scored on both UX (1-5) and output quality (1-5). The `Last Quality` column tracks the output quality score.
+
+**Translating to other domains:** The query types are universal. For a recipe app: "I run warm" becomes "quick weeknight dinner for a picky toddler." "Crisp not silky" becomes "healthy but my kids will eat it." "Linen because it's soft" becomes "sear meat to lock in juices" (common misconception). For a code assistant: "something nice" becomes "fix it." "Competing constraints" becomes "fast but maintainable." The structure is the same — the domain content is different.
+
+## Area Depth
+
+Granularity determines how many areas you have. **Depth** determines how useful each area is. A thin area produces generic scores. A rich area produces specific, actionable findings that compound across runs.
+
+### Thin vs. Rich Area Definitions
+
+**Thin (produces "4/5, looked fine"):**
+
+```markdown
+### search-results
+**Interactions:** Enter query, review results
+**What's tested:** Result relevance
+**scored_output:** true
+```
+
+**Rich (produces "4/5, but missed terracotta — returned generic neutrals. Agent correctly mapped 'crisp' to percale but didn't exclude sateen"):**
+
+```markdown
+### search-results
+**Interactions:** Enter query, review results, assess domain interpretation
+**What's tested:** Does the app translate natural language into correct domain attributes? Does it understand subjective vocabulary, competing constraints, and emotional context?
+**scored_output:** true
+**pass_threshold:** 4
+**quality_threshold:** 3
+
+**Queries:**
+
+| Query | Ideal Outcome | Check | Status | Notes |
+|-------|--------------|-------|--------|-------|
+| "I run warm and want crisp not silky" | Percale, linen. NOT sateen/flannel | (1) cooling (2) crisp feel | | |
+| "earth tones — terracotta, sage, clay" | Specific warm colors, not generic neutrals | Color precision | | |
+| "my partner and I disagree on temperature" | Compromise solutions (dual-zone, blends) | Both needs addressed | | |
+| "something nice" | Clarifying questions, not random results | Handles vagueness | | |
+| "I want linen because it's soft" | Gentle correction — linen is crisp, not soft | Factual accuracy | | |
+
+**Multi-turn:**
+
+| Turn | Query | Check |
+|------|-------|-------|
+| 1 | "show me white sheets" | Broad white results |
+| 2 | "boring — add color" | NOT white added, muted tones |
+| 3 | "sage, but cozy not crisp" | Sage + cozy materials, context retained |
+
+**verify:**
+- Sample 5-8 results, read material/color attributes
+- Every result should match the stated filters
+- If agent claims "all cooling materials" — verify no flannel/heavy cotton
+```
+
+The rich definition tells the agent exactly what to look for, what "good" means in this specific domain, and how the output could be subtly wrong. It compounds: queries that fail generate probes, probes that persist generate bugs, bugs that fix generate CLI regression checks.
+
+### Writing Good Queries
+
+Queries test the app's **understanding**, not just its functionality. "Show me blue sheets" tests filtering. "I want something calming for my bedroom — maybe ocean-inspired?" tests whether the app understands that "calming + ocean" means soft blues and greens, not literal ocean-print sheets.
+
+**Queries are only valid in `scored_output: true` areas.** If an area has Queries but not `scored_output: true`, the agent flags it during Phase 1 and suggests adding `scored_output: true`.
+
+**Include at least:**
+
+1. **Subjective/lifestyle query** — uses natural language the app must interpret. For a bedding app: "I want to feel like I'm sleeping in a cloud." For a recipe app: "quick weeknight dinner for a picky toddler." For a code assistant: "make this function more readable, not clever."
+
+2. **Competing constraints** — two preferences that tension against each other. Bedding: "soft but cool." Recipes: "healthy but my kids will eat it." Code: "fast but maintainable."
+
+3. **Edge case** — tests the boundary of the app's domain. Bedding: "do you have bath towels?" Recipes: "I only have canned goods and spite." Code: "rewrite this in a language you don't support."
+
+4. **Wrong premises** — the user believes something incorrect. Bedding: "linen because it's so soft." Recipes: "sear the meat to lock in juices." Code: "use a singleton because it's the cleanest pattern."
+
+5. **Vague input** — the minimal useful query. Bedding: "something nice." Recipes: "dinner." Code: "fix it." Should the app ask for more info or make smart defaults?
+
+**Queries compound across runs.** A query that scores 3/5 generates a probe. That probe either gets fixed (the app improves) or escalates to a bug. Either way, you now know exactly where the app's understanding breaks.
+
+### Writing Good Multi-turn Sequences
+
+Multi-turn sequences test whether the app maintains context as the user changes their mind or evolves their preferences.
+
+**Each turn should build on or contradict the previous turn:**
+- Turn 1: Broad starting point
+- Turn 2: Refine or pivot ("actually, not that — more like this")
+- Turn 3: Specific constraint that requires remembering turns 1-2
+
+**Scoring:** The final turn gets the Quality score. Context failures at intermediate turns generate probes targeting the specific turn that broke, and are noted in the area assessment, but do not directly reduce UX or Quality scores. This follows the same pattern as verification failures — they're important findings recorded separately.
+
+### First-Run Query Quality
+
+Run 1 queries will be approximate. The agent is seeing the app for the first time and writing its best guess at domain-specific tests. That's expected and fine.
+
+After run 1, commit mode sharpens them: failed queries generate probes targeting the specific gap, exploration reveals new queries worth adding. By run 3, the Queries table is specific to THIS app's actual strengths and weaknesses — not because anyone hardcoded domain knowledge, but because the queries evolved from real observations.
+
+## CLI Discovery
+
+During Phase 1, actively look for a CLI-testable API surface. This runs for **both new and existing test files** — if an existing file has `cli_test_command: ""`, discovery runs and populates it. CLI mode catches agent reasoning errors in ~30 seconds without browser overhead — browsers should only test what CLI can't (rendering, animations, SSE delivery, click interactions).
+
+### Discovery Steps
+
+Try ALL approaches below in order. If one fails (e.g., a script has runtime errors), proceed to the next. Do NOT conclude "CLI not viable" until every approach has been attempted.
+
+1. **Check for API indicators:**
+   - `package.json` scripts containing `dev`, `start`, or `serve`
+   - `.env` or `.env.local` files with `PORT`, `API_URL`, or endpoint references
+   - Directories: `src/api/`, `src/server/`, `src/routes/`, `routes/`, `api/`
+   - Files: `server.ts`, `server.js`, `index.ts` with express/hono/fastify imports
+
+2. **Check for curl-able endpoints (try this FIRST — most reliable):**
+   - Look for route definitions (POST/GET handlers) in the codebase
+   - Identify the chat/agent/search endpoint that powers the app's core feature
+   - Test it: `curl -s -X POST http://localhost:{port}/{endpoint} -H "Content-Type: application/json" -d '{"message": "test"}'`
+   - **If curl returns JSON:** Use this as `cli_test_command`. Stop — no need to try test scripts.
+   - **If curl fails or times out:** The server may not be running. Still populate `cli_test_command` from code analysis (route definitions, package.json scripts) even if the endpoint can't be verified live.
+
+3. **Check for existing test scripts (fallback if curl doesn't work):**
+   - `scripts/verify*.ts`, `scripts/test-cli*`, `scripts/smoke-test*`
+   - `package.json` scripts with `verify`, `test:e2e`, `test:api`
+   - **If a script errors:** Try to fix trivially (missing dependency, wrong import). If not trivially fixable, skip it and note "test script broken — using curl instead" or "no CLI surface found."
+
+4. **If ANY testable surface was found:**
+   - Set `cli_test_command` in frontmatter (use the curl pattern that returns JSON, not SSE)
+   - Generate `cli_queries` from `scored_output` area Queries, mapping:
+     - Query → `query` field
+     - Ideal Outcome → `expected` field (semantic description)
+     - Area slug → `prechecks` field (gates browser testing)
+
+### CLI Test Command Patterns
+
+| App Type | Pattern |
+|----------|---------|
+| Express/Hono API with JSON fallback | `curl -s -X POST http://localhost:{port}/{endpoint} -H "Content-Type: application/json" -d '{"message": "{query}"}'` |
+| Express API with SSE only | CLI testing may not be viable. Check for a separate REST route, a JSON fallback (omit `Accept: text/event-stream`), or a test script. Don't parse SSE streams from curl — it's fragile and wastes time. |
+| Direct script invocation | `npx tsx scripts/verify-agent.ts "{query}"` |
+| REST API (GET) | `curl -s "http://localhost:{port}/api/search?q={query}"` |
+
+### Mapping Area Queries to CLI Queries
+
+For each `scored_output` area with a **Queries:** table, generate one `cli_queries` entry per query:
+
+```yaml
+# From area agent/search-quality Query: "I run warm and want crisp not silky"
+cli_queries:
+  - query: "I run warm and want crisp not silky"
+    expected: "Results include percale and linen. No sateen or flannel."
+    prechecks: "agent/search-quality"
+```
+
+**Only map queries that test agent reasoning** — skip queries that test pure UI behavior (click interactions, filter panel rendering, suggestion chip behavior). Those need browser testing and have no CLI equivalent.
+
+**Queries that test both reasoning and rendering** can still map to CLI — the CLI tests the reasoning half. The browser area tests the rendering half independently. Example: "add the cheapest one to my cart" becomes a CLI query with expected "identifies the lowest-priced item and calls add-to-cart tool" — the browser separately checks whether the cart badge updated. One query, two test layers.
+
+### When CLI Discovery Finds Nothing
+
+If the app has no backend API (pure static frontend, no server-side logic), set `cli_test_command: ""` and skip CLI query generation. The test file works exactly as before — browser-only testing.
+
+### CLI Response Evaluation
+
+When evaluating CLI responses, assess the **full response** — not just the text message. Check tool calls (were the right tools used with correct arguments?), structured data (are search facets, filters, and categories correct?), and metadata (suggestions, confidence scores, session state). The `expected` field should describe what a correct *response* looks like, not just what correct *text* looks like.
+
+### Run 1 CLI Queries Will Be Approximate
+
+Same principle as browser Queries: run 1 is the agent's best guess. Commit mode sharpens them. If a CLI query's `expected` description is too vague ("returns good results"), the scoring will be generous. By run 2, the agent has seen real responses and writes sharper expectations.
+
+## Probe Statuses Reference
+
+| Status | Meaning |
+|--------|---------|
+| `untested` | Generated, not yet run |
+| `passing` | Ran, verification passed |
+| `failing` | Ran, verification failed |
+| `flaky` | Mixed results across 3+ runs (at least 1 pass and 1 fail, no streak) |
+| `graduated` | Promoted to CLI regression check (read-only historical record) |
+
+See [probes.md](./probes.md) for lifecycle rules, dedup, cap/rotation, escalation, and graduation.
+
+## Probe Confidence Reference
+
+| Confidence | Meaning |
+|-----------|---------|
+| `high` | Generated from observed failure, or confirmed by a failing run |
+| `medium` | Generated from structural read or timing signal — not yet confirmed |
+| `low` | Generated from weak signal or wide inference — run early to validate |
+
+See [probes.md](./probes.md) for default confidence values by generation trigger, execution priority within confidence levels, and update rules.
+
+## Query Status Reference
+
+| Status | Meaning | Execution |
+|--------|---------|-----------|
+| (empty) | Active, exploratory | Full browser + CLI execution |
+| `[stable]` | 5/5 for 3+ consecutive runs | CLI only — no browser execution |
+| `[retired]` | Stable for 10+ consecutive runs (CLI-capable only) | Skip entirely |
+
+See [queries-and-multiturn.md](./queries-and-multiturn.md) step 12 for transition rules, regression thresholds, and CLI gate.
+
+## Maturity Status Reference
+
+| Status | Symbol | Meaning |
+|--------|--------|---------|
+| Proven | Proven | 2+ consecutive passes, no functional regressions, no verification failures |
+| Uncharted | Uncharted | Default state, or demoted from Proven |
+| Known-bug | Known-bug | Issue filed, skip until fix deployed |
diff --git a/plugins/compound-engineering/skills/user-test/references/verification-patterns.md b/plugins/compound-engineering/skills/user-test/references/verification-patterns.md
new file mode 100644
index 000000000..61bfbf328
--- /dev/null
+++ b/plugins/compound-engineering/skills/user-test/references/verification-patterns.md
@@ -0,0 +1,169 @@
+# Verification Patterns
+
+After exploring each area, the skill runs a structural verification pass — independent of what the agent noticed during exploration. This is the "distrust the UI" layer.
+
+## Standard Checks by Area Type
+
+| Area Type | Verification Steps |
+|-----------|-------------------|
+| Filter areas | Read active filter chip state. Sample 5-8 visible results. Read the corresponding badge/attribute on each. Every result must match the filter. If sub-filter options show counts ("Like New (14)"), read one count, apply that sub-filter, count visible results — displayed count must be within ±10% or ±2 items of actual. Zero results when count > 0 is always a failure. |
+| Search/agent areas | Extract any summary claim the agent made ("showing like-new items"). Sample 5-8 results. Read the attribute the claim references. Every result must match. After results load, check `window.scrollY` — must be < 100px (see Interaction State Checks for calibration). If ≥ 100px, the page did not scroll to top. |
+| Cart areas | Read the cart badge count. Open the cart drawer/page. Count visible items. Numbers must match. |
+| Count displays | Read any "N items" or "N results" text. Count visible items on screen. Numbers must match (pagination: compare against the count on the current page, not total). |
+| Sort areas | Read the claimed sort order (e.g., "Price: Low to High"). Read the sort attribute on the first 5 visible results. Each successive value must be >= the previous (or <= for descending). |
+| Filter chip dismiss | After dismissing any filter chip: verify chip is gone from DOM, result count changed, and (if area has agent component) agent responds to a follow-up message within 10s. Non-response after chip dismiss is a verification failure. |
+
+## Tolerance Rules
+
+**Zero tolerance** for filter, search, cart, and count mismatches. If 6 of 8 sampled results match and 2 don't, that's a failure. A filter that works 75% of the time is broken. Record exact counts: "2 of 8 sampled results had mismatched condition badges."
+
+**Sort order exception:** Position drift of ±1 is acceptable (ties, identical values). Position drift of ±2 or more is a failure.
+
+## Interaction State Checks
+
+Some verifications require a before/after pattern — read state, trigger
+an interaction, read state again. These cannot be batched into a single
+javascript_tool call. Run them AFTER the standard batch verification pass.
+
+| Interaction | Before | Action | After | Pass Condition |
+|-------------|--------|--------|-------|---------------|
+| Filter chip dismiss | Read chip list + result count | Click dismiss | Read chip list + result count | Chip gone; result count changed |
+| Search query submit | — | Submit query, wait for results | Read `window.scrollY` | scrollY < 100px |
+| Agent follow-up after filter change | — | Dismiss chip; send follow-up | Poll for response (10s max) | Response received |
+
+**Scroll tolerance:** 100px accommodates sticky headers. Document app-specific
+threshold in the area's `verify:` block if different.
+
+**Agent timeout:** 10s is generous for 2-3s baseline apps. Calibrate against
+`score-history.json` timing data.
+
+## Scoring Impact
+
+Verification results, probe results, and UX scores are three separate signals — none subsumes the others. See SKILL.md Phase 3 checklist. An area can have:
+- Good UX + passing verification = healthy
+- Good UX + failing verification = data integrity issue (the UI lies)
+- Poor UX + passing verification = genuine UX problem (the data is correct)
+
+## Maturity Interaction
+
+- **Promotion blocked:** A verification failure blocks promotion to Proven, even if UX score >= `pass_threshold`. Area stays Uncharted with note "verification failure blocks promotion."
+- **No demotion:** A Proven area that fails verification on a subsequent run does NOT demote. Instead: a probe is generated for the next run and a warning appears in the report. Demotion only happens via the bug registry path (score drops below threshold).
+- **Probe generation:** Any verification failure triggers adversarial probe generation — see [probes.md](./probes.md).
+
+## Batching Verification Reads
+
+Verification passes are read-only — they observe DOM state without interacting. All verification reads SHOULD use a single `javascript_tool` call that returns a JSON object with all checked claims.
+
+**Pattern (replaces sequential find calls):**
+
+```javascript
+mcp__claude-in-chrome__javascript_tool({
+  code: `JSON.stringify({
+    activeFilters: [...document.querySelectorAll('[data-filter-chip]')]
+      .map(c => ({ text: c.textContent, active: c.classList.contains('active') })),
+    resultCount: document.querySelectorAll('.product-card').length,
+    sampleResults: [...document.querySelectorAll('.product-card')]
+      .slice(0, 5).map(c => ({
+        title: c.querySelector('.title')?.textContent?.trim(),
+        price: c.querySelector('.price')?.textContent?.trim(),
+        condition: c.querySelector('[data-condition]')?.textContent?.trim(),
+        category: c.querySelector('[data-category]')?.textContent?.trim()
+      }))
+  })`
+})
+```
+
+This replaces 5+ individual MCP calls with 1. At ~2-3s per MCP round trip, saves 8-12s per area.
+
+**When to use individual calls instead:**
+- DOM structure unknown (first run, no selectors documented)
+- javascript_tool fails (fall back per Graceful Degradation rules)
+- Verification requires interaction (clicking to reveal hidden state)
+
+**Selector discovery:** On first run, the agent discovers selectors during exploration. Document working selectors in the area's `**verify:**` block so subsequent runs can batch directly. Example:
+
+```markdown
+**verify:** Apply a category filter. Batch-check via javascript_tool:
+activeFilters (`[data-filter-chip]`), resultCount (`.product-card`),
+sample 5 results (`.product-card .title`, `.condition-badge`).
+Every result's category must match the filter.
+```
+
+**First-run selector lifecycle:** Selectors discovered during exploration are used for verification in the same run (held in context). They are persisted to the verify: block during commit mode. Subsequent runs read the persisted selectors directly. Do NOT write selectors to the test file mid-Phase-3 — that's a commit-time operation.
+
+Selectors compound: by run 3, most verification passes are single-call batched reads because the selectors were discovered in runs 1-2.
+
+**Failure handling:** A batch failure increments `disconnect_counter` once (it is an MCP tool failure). Area gets `verification_results: null`. Retry with individual calls before recording skip_reason.
+
+## Disconnect Pattern Tracking
+
+When `disconnect_counter` increments, record the context: which MCP tool was called, which area was being tested, and the session MCP call count.
+
+At run end, if `disconnect_counter >= 3`, append a disconnect analysis:
+
+```
+Disconnects: 10
+  Pattern: 7/10 after javascript_tool calls
+  Cluster: 6/10 after MCP call #15+
+  Worst area: agent/search-query (4 disconnects)
+  Suggestion: Extension unstable under sustained javascript_tool use.
+              Consider browser restart between iterate runs.
+```
+
+**Schema in .user-test-last-run.json:**
+
+```json
+"disconnects": {
+  "count": 10,
+  "contexts": [
+    { "call_number": 18, "tool": "javascript_tool", "area": "agent/search-query" },
+    { "call_number": 22, "tool": "click", "area": "browse/filters" }
+  ]
+}
+```
+
+This data compounds: after 3+ sessions, patterns emerge (e.g., "always after 20+ MCP calls" → connection fatigue, restart between runs).
+
+## verify: Blocks
+
+Areas can include an optional `**verify:**` block in their area details — freeform instructions that tell the agent what claims to audit. The structural checks above run regardless; verify blocks add area-specific auditing on top.
+
+When to add a verify block: any area with a filter, search result set, count, sort order, or agent response that summarizes data — anywhere the app could lie and the user wouldn't immediately notice.
+
+## Selector Discovery and Writeback
+
+Commit mode persists confirmed selectors into each area's `**verify:**` block. This is the highest-leverage writeback: run 1 discovers selectors through sequential trial (3-5 MCP calls), run 2 reads the verify block and batches them into one `javascript_tool` call.
+
+### Rules
+
+1. **Only write selectors confirmed by a successful batch call this run.** A selector that appeared in the DOM but wasn't used in a batch call is not confirmed — it may be fragile.
+2. **Append-only.** Never replace user-authored verify content. New selectors go below existing lines.
+3. **Tag with run number:** Append `_Selectors confirmed run N._` so future runs know the source.
+4. **Update changed selectors:** If a confirmed selector changed from the previous run (e.g., `.product-card` → `.item-card`), update the selector and reset the tag to the current run.
+5. **Preserve unchanged selectors:** If selectors from a previous run still work, leave them and their tag intact.
+6. **First-run placeholder:** If no selectors are confirmed yet, write `_Selectors not yet confirmed — discover during exploration._`
+
+### Format
+
+```markdown
+**verify:**
+- Apply filter. Batch-check via javascript_tool:
+  activeFilters (`[data-filter-chip]`), resultCount (`.product-card`),
+  sample 5 results (`.product-card .title`, `.condition-badge`).
+  Every result's attribute must match the active filter.
+  _Selectors confirmed run 3._
+```
+
+### Interaction with `.user-test-last-run.json`
+
+Confirmed selectors are stored per area in the `confirmed_selectors` object:
+
+```json
+"confirmed_selectors": {
+  "activeFilters": "[data-filter-chip]",
+  "resultCount": ".product-card",
+  "sampleResults": ".product-card .title, .condition-badge"
+}
+```
+
+`confirmed_selectors: {}` means no selectors were confirmed this run — skip verify block update for this area.