diff --git a/docs/brainstorms/2026-03-17-user-test-self-eval-loop-brainstorm.md b/docs/brainstorms/2026-03-17-user-test-self-eval-loop-brainstorm.md new file mode 100644 index 000000000..e4a6a895a --- /dev/null +++ b/docs/brainstorms/2026-03-17-user-test-self-eval-loop-brainstorm.md @@ -0,0 +1,135 @@ +--- +date: 2026-03-17 +topic: user-test-self-eval-loop +--- + +# User-Test Self-Eval Loop + +## What We're Building + +A closed-loop self-evaluation system for the `user-test` skill. After each testing session, a separate `/user-test-eval` command grades the skill's output against a fixed set of binary evals, records scores, and proposes one targeted mutation to the skill's instructions. The human reviews and accepts/rejects. Over time this produces a durable research artifact — a history of what was tried, what improved signal, and what didn't. + +## Why This Approach + +The auto-research pattern (run → eval → mutate → run again) applies to the user-test skill, but two constraints shape the design: + +1. **Skill first, queries second.** The skill has known structural issues (probe execution order violations, Proven regression conflation, P1 item burial). These corrupt signal — optimizing queries through a miscalibrated instrument produces noise. Fix the instrument first, validate it holds, then turn it on query optimization. + +2. **Semi-automated, not autonomous.** Full autonomous mutation (run every 2 minutes, keep winner) risks unreviewed prompt drift. The skill is complex enough (SKILL.md + 14 reference files, schema v9) that mutations need human review. The friction cost of review is low; the risk of unreviewed drift is high. + +## Key Decisions + +### Eval runs as a separate command, not inside the skill + +- **Decision:** New `/user-test-eval` command (Option 2), not a Phase 5 inside the skill (Option 1) or added to `/user-test-commit` (Option 3). +- **Rationale:** Same context window grading its own output is the exact failure mode we've already seen — structurally correct reports that technically satisfy format requirements while burying findings. Separate invocation context = harder to game. `/user-test-commit` is already doing post-processing; coupling eval logic there mixes "did this run complete" with "is the skill producing good outputs over time" — different questions on different timescales. + +### Eval reads both JSON and rendered report + +- **Decision:** `/user-test-eval` reads `.user-test-last-run.json` AND the rendered report output. +- **Rationale:** The presentation layer is where actual failures occur. A P1 item technically present in JSON but buried in report formatting is a real failure. Grading JSON alone misses the class of problems that have been the persistent issue. + +### Two artifacts: scores in JSON, reasoning in markdown + +- **Decision:** `skill-evals.json` for score history; `skill-mutations.md` for proposed changes and accept/reject log. +- **Rationale:** Scores need to be parseable by future runs. Mutation proposals need to be readable and editable by humans. Different purposes, different formats. `skill-mutations.md` becomes the durable research artifact — the "big list of things tried" that is the most underrated output of the whole process. + +### Start with exactly 3 binary evals + +- **Decision:** 3 evals, not more. Expand only after these are stable. +- **Rationale:** Too many evals invites reward hacking — the agent finding ways to technically pass all checks without improving quality. Three is tight enough to avoid gaming, broad enough to cover three distinct failure layers. + +## The Binary Eval Set + +### Eval 1: Probe Execution Order (protocol layer) + +**Question:** "Did all failing/untested probes in each area execute before broad exploration began?" + +- **Grading:** Yes/no per area. Overall FAIL if any area violated. +- **Tests:** Whether the agent followed the probe-first protocol, which exists because probes are the highest-signal checks and broad exploration can mask their results. +- **Known failure mode:** Agent exploring broadly first, then running probes in whatever order, reducing probe signal quality. + +### Eval 2: Proven Regression Reasoning (reasoning layer) + +**Question:** "Did the report distinguish between 'new bug in Proven area' and 'area no longer meets Proven criteria'?" + +- **Grading:** PASS if these are treated as categorically different events. FAIL if all regressions are treated as the same type. +- **Tests:** Whether the agent understood that a Proven area failing is categorically different from a Known-bug area failing — not just a score change but a status change with different implications. +- **Known failure mode:** Agent filing bugs and updating scores without surfacing that a Proven regression is a different class of event. Treating all regressions uniformly. + +### Eval 3: P1 Surfacing (presentation layer) + +**Question:** "Did every P1 item (active probe failure OR new bug) appear in the NEEDS ACTION section, not only in DETAILS?" + +- **Grading:** PASS if every P1 item is in NEEDS ACTION. FAIL if any P1 item appears only in DETAILS. +- **Tests:** Whether the report's summary layer actually surfaces the most important findings, or buries them in structural completeness. +- **Known failure mode:** Structurally correct reports where P1 items exist in the data but don't surface to the section the human actually reads and acts on. + +## Artifact Locations + +``` +tests/user-flows/ + skill-evals.json # Score history per run + skill-mutations.md # Proposed diffs + accept/reject log +``` + +### skill-evals.json structure + +```json +{ + "evals": [ + { + "run_timestamp": "2026-03-17T14:30:00Z", + "git_sha": "abc1234", + "skill_version": "2.51.0", + "test_file": "resale-clothing.md", + "results": { + "probe_execution_order": { "pass": true, "areas_violated": [] }, + "proven_regression_reasoning": { "pass": false, "detail": "Login area regressed from Proven but report filed bug without noting status change" }, + "p1_surfacing": { "pass": true, "p1_count": 2, "surfaced_count": 2 } + }, + "overall_pass": false, + "proposed_mutation": "Clarify Phase 4 to require explicit 'Proven → Regressed' status callout when a Proven area scores below pass_threshold" + } + ] +} +``` + +### skill-mutations.md structure + +```markdown +# Skill Mutations Log + +## Mutation 1 — 2026-03-17 + +**Triggered by:** Eval 2 failure (Proven regression reasoning) +**Eval scores:** 1/3 pass (probe order: PASS, regression reasoning: FAIL, P1 surfacing: PASS) +**Proposed change:** Add explicit instruction in Phase 4 scoring section: "When a Proven area scores below pass_threshold, the report MUST include a 'Proven Regression' callout distinct from any bug filing. This is a status change, not just a score change." +**Diff:** [specific lines in SKILL.md or reference file] +**Status:** PENDING | ACCEPTED | REJECTED +**Outcome after acceptance:** [filled in after next run] +``` + +## Scope Boundaries + +**In scope:** +- `/user-test-eval` command that grades last run against 3 binary evals +- `skill-evals.json` for score persistence +- `skill-mutations.md` for mutation proposals and history +- One mutation proposal per eval run (not one per failing eval) + +**Out of scope (for now):** +- Autonomous mutation (no auto-editing SKILL.md) +- Query-level optimization (comes after skill evals are stable) +- More than 3 evals (expand only when current set is consistently passing) +- Integration with `/user-test-commit` (eval stays independent) + +## Open Questions + +- Should `/user-test-eval` auto-run after `/user-test-commit`, or stay fully manual? Leaning manual to keep the separation clean, but convenience might win. +- Where exactly do `skill-evals.json` and `skill-mutations.md` live — in `tests/user-flows/` (alongside test files) or in the skill directory itself? The skill directory is plugin-managed; `tests/user-flows/` is project-local. +- When evals are consistently passing (say, 5 consecutive runs all pass), what's the trigger to add a 4th eval or shift to query optimization? + +## Next Steps + +-> `/workflows:plan` for implementation details (the `/user-test-eval` command, artifact formats, eval logic) diff --git a/docs/plans/2026-02-26-feat-user-test-browser-testing-skill-plan.md b/docs/plans/2026-02-26-feat-user-test-browser-testing-skill-plan.md new file mode 100644 index 000000000..834a6ab7e --- /dev/null +++ b/docs/plans/2026-02-26-feat-user-test-browser-testing-skill-plan.md @@ -0,0 +1,597 @@ +# Decision Record + +**Deepened on:** 2026-02-26 +**Sections enhanced:** 11 of 13 +**Research agents used:** 14 +**Total recommendations applied:** 37 (22 implement, 9 fast_follow, 6 defer) + +## Pre-Implementation Verification + +1. [ ] Verify current component counts: `ls -d plugins/compound-engineering/skills/*/ | wc -l` and `ls plugins/compound-engineering/commands/*.md plugins/compound-engineering/commands/workflows/*.md | wc -l` +2. [ ] Verify current plugin version in `plugins/compound-engineering/.claude-plugin/plugin.json` +3. [ ] Confirm `claude-in-chrome` MCP tool names by running `/mcp` and selecting `claude-in-chrome` +4. [ ] Review `plugins/compound-engineering/skills/deepen-plan/SKILL.md` frontmatter format as canonical thin-wrapper reference +5. [ ] Verify `plugins/compound-engineering/commands/deepen-plan.md` as canonical thin-wrapper command template +6. [ ] Check that `tests/user-flows/` does not already exist in any target project (no namespace collision) + +## Implementation Sequence + +1. **Create `skills/user-test/references/` files first** — the SKILL.md references these, so they must exist before the skill is validated +2. **Create `skills/user-test/SKILL.md`** — the core skill with 5-phase execution logic + commit mode, under 500 lines +3. **Create thin wrapper commands** — `commands/user-test.md`, `commands/user-test-iterate.md`, and `commands/user-test-commit.md` +4. **Update metadata files** — plugin.json, marketplace.json, README.md, CHANGELOG.md (use dynamic counts, not hardcoded numbers) +5. **Run `/release-docs`** — regenerate documentation site +6. **Validate** — JSON validity, component count consistency, SKILL.md line count + +## Key Improvements + +1. **[Strong Signal -- 5 agents] Maturity model: guidance over rigid rules** — Replace hardcoded "3 consecutive passes = Proven" and "any failure = reset to Uncharted" with agent-guided judgment. Provide a rubric and guidelines, but let the agent decide based on context (e.g., a cosmetic issue in a Proven area should not trigger full demotion). Simplify initial threshold to 2 consecutive passes. + +2. **[Strong Signal -- 4 agents] Extract reference files from SKILL.md from day one** — Split the skill into SKILL.md (~300 lines of execution logic) plus `references/` directory containing test-file-template.md, browser-input-patterns.md, and iterate-mode.md. The monolith-to-skill-split learning explicitly warns that stated size budgets without enforcement are ignored. + +3. **[Strong Signal -- 4 agents] Extension disconnect handling with specific recovery instructions** — Replace generic "retry-once" with: wait 3 seconds, retry once, on second failure instruct user to run `/chrome` and select "Reconnect extension". Track cumulative disconnects and abort after 3 with a clear stability message. + +4. **[Strong Signal -- 3 agents] Add `disable-model-invocation: true` to both thin wrapper commands** — The commands have side effects (file creation, browser interaction, issue filing). Official docs require this flag for side-effect workflows. + +5. **[Strong Signal -- 3 agents] Explicit distinction from agent-browser/test-browser in SKILL.md intro** — Two browser tools creates confusion. The SKILL.md intro must state: "This skill is for exploratory testing in a visible Chrome window with shared login state. For automated headless regression testing, use /test-browser instead." + +6. **[Strong Signal -- 3 agents] Dynamic component counts in acceptance criteria** — Do not hardcode "Skills: 21, Commands: 24". Count actual files and verify description strings match. + +7. **[Strong Signal -- 3 agents] Enhanced preflight check with `/chrome` guidance, WSL detection, and site permissions** — Phase 0 must guide users to run `/chrome` if MCP tools are unavailable, detect WSL and abort with a clear message, and verify the target URL is within Chrome extension's allowed sites. + +8. **Quality scoring rubric with concrete calibration anchors** — Define what scores 1-5 mean with examples, making scoring reproducible across runs. + +9. **Test file schema version for forward compatibility** — Add `schema_version: 1` to test file template frontmatter. + +10. **SKILL.md description must be single-line string** — Multiline YAML indicators break the skill indexer. Use a single line with trigger keywords for auto-discovery. + +## Research Insights + +### Browser Automation (claude-in-chrome) +- Actions run in a **visible Chrome window** in real time — the user can watch and intervene. This is the core differentiator from headless agent-browser and should be prominently documented. +- Claude **shares browser login state** — eliminates most authentication concerns. Users sign in once in Chrome; Claude inherits the session. +- **GIF recording** is available as a built-in capability. Phase 7 (Summary) can offer to record sessions for evidence attached to GitHub issues. +- **Site-level permissions** from the Chrome extension control which URLs Claude can interact with. Preflight should verify this. +- **Modal dialogs** (alert, confirm, prompt) block all browser commands. The skill should detect unresponsive commands and instruct users to dismiss dialogs manually. + +### Skill Architecture (Claude Code Plugins) +- SKILL.md description must be a **single-line string** — multiline YAML breaks the indexer. +- Skills use **progressive disclosure**: only frontmatter loads initially (~100 tokens); full content loads on activation. This makes the 500-line target a recommendation, not just a budget. +- `context:fork` is available for isolated execution but is not needed for this skill's use case. + +### Security Considerations +- **Path traversal**: test file path resolution must be validated to stay within `tests/user-flows/`. +- **Credential prohibition**: the skill must never persist passwords, tokens, or session IDs in any written output. +- **Issue body sanitization**: content derived from test results should be sanitized before passing to `gh issue create`. + +## New Considerations Discovered + +1. **MCP tool batching for performance** — Each claude-in-chrome MCP call involves a Chrome extension round-trip. Batch simple checks (element visibility, text content) into single `javascript_tool` calls. Define "quick spot-check" for Proven areas as max 3 MCP calls. + +2. **Iterate mode token cap** — Each 7-phase run consumes significant tokens. Add a default cap of N <= 10 with explicit override. + +3. **State clearing between iterate runs is incomplete** — Full page reload to app entry URL is the reset mechanism. This does not cover IndexedDB, service worker caches, or HttpOnly cookies. Document this limitation. + +4. **Test history file rotation** — `tests/user-flows/test-history.md` will grow unbounded. Add a rotation strategy: keep last 50 entries, archive older ones. + +5. **Atomic file writes** — "Full rewrite" is not truly atomic. Use write-to-temp-then-rename pattern for test file updates. + +6. **Area granularity definition** — The maturity map tracks "areas" but never defines what size an area should be. Without guidance, two runs will decompose the same scenario differently, making consecutive-pass tracking meaningless. Define areas as 1-3 user interactions each (e.g., "checkout" → cart-validation, shipping-form, payment-submission). Include a worked example in `references/test-file-template.md`. + +7. **Explore Next Run needs prioritization** — The Explore Next Run section is append-only with no signal about urgency. After 5-6 runs it becomes a backlog with no entry point. Add priority levels: `P1` (likely user-facing friction), `P2` (edge case worth knowing), `P3` (curiosity). Instruct Phase 3 to pick highest-priority uncharted items first. + +8. **Issue deduplication needs structured labels** — Semantic search via `gh issue list --search` is fragile — two runs will describe the same bug differently. Use a structured `user-test:` label on every issue (e.g., `user-test:checkout/cart-count`) for exact-match dedup via `--label` flag, with semantic search as fallback only. + +9. **Qualitative assessments evaporate after each run** — The Run Summary asks "Demo ready?" but this answer never persists in test-history.md. Add a `demo_readiness` field (yes/no/partial) to the history table schema so trend data captures qualitative signal, not just scores. + +10. **App-level environment sanity check** — Phase 0 validates tool availability but not app health. Stale auth tokens, empty search indices, or silent API 500s produce misleading test results that look like quality issues. Add a Phase 2 "environment sanity check": one known-good navigation + one content assertion before executing test scenarios. + +## Fast Follow (ticket before merge) + +**Tier 1 -- Blocks demo/UX quality** (fix within 1-2 days): +- Add cross-reference from `agent-browser/SKILL.md` back to `user-test` to prevent user confusion between the two browser testing approaches + +**Tier 2 -- Improves robustness** (fix within 1 sprint): +- Add file upload workaround documentation: pause user-test and use `/agent-browser` for upload steps, then resume +- MCP tool mapping table (agent-browser CLI vs claude-in-chrome MCP equivalents) in a shared reference file +- Test file concern separation: evaluate splitting run history into a sidecar `.json` for machine parsing while keeping the `.md` human-readable + +## Cross-Cutting Concerns + +1. **SKILL.md size budget enforcement** — Four agents independently recommend the `references/` extraction. The structural decision affects the content outline (section 8), technical considerations (section 5), thin wrapper templates (section 9), and the SpecFlow analysis (section 10). This is the single most impactful structural change. + +2. **Maturity model rigidity vs agent judgment** — Five agents flag this across scoring (section 4), SKILL.md phases (section 8), and success metrics (section 12). The resolution: provide guidance and rubrics, not rigid rules. + +3. **MCP reliability and graceful degradation** — Four agents converge on this across preflight (section 5), execution (section 8), and risks (section 11). The pattern: specific recovery instructions for known failure modes, graceful degradation for mid-run tool failures. + +4. **`disable-model-invocation: true`** — Three agents confirm this is required for both wrapper commands. Single-section impact but high confidence from official docs. + +## Deferred to Future Work + +- **MCP abstraction layer** for future tool swaps (agent-browser <-> claude-in-chrome) — adds unnecessary complexity for v1 +- **Test file concern separation** into spec + state sidecar — evaluate after real-world usage reveals whether the single-file approach causes friction +- **`/mcp` runtime discovery** of available tools instead of hardcoded tool names — low confidence (0.65), nice-to-have for forward compatibility +- **`context:fork` isolation** for iterate mode runs — not needed for current architecture but could improve memory isolation for long iterate sessions + +## Research Gaps Addressed + +| Source | Recommendation | Status | +|--------|---------------|--------| +| docs-researcher-claude-code-plugins | Single-line description | Implemented in SKILL.md frontmatter | +| docs-researcher-claude-code-plugins | Keep SKILL.md under 500 lines | Implemented via references/ extraction | +| docs-researcher-claude-code-plugins | disable-model-invocation: true | Implemented in both wrapper commands | +| docs-researcher-claude-code-plugins | /chrome activation guidance | Implemented in Phase 0 preflight | +| docs-researcher-claude-code-plugins | Service worker idle disconnects | Implemented in disconnect handling | +| docs-researcher-claude-code-plugins | WSL not supported | Implemented in Phase 0 preflight | +| docs-researcher-claude-code-plugins | Login page/CAPTCHA pausing | Implemented in Phase 2 setup | +| docs-researcher-claude-in-chrome | Visible Chrome window | Implemented in SKILL.md intro | +| docs-researcher-claude-in-chrome | Shared browser login state | Implemented in Phase 2 setup | +| docs-researcher-claude-in-chrome | GIF recording | Acknowledged in Phase 7 as optional enhancement | +| docs-researcher-claude-in-chrome | Site-level permissions | Implemented in Phase 0 preflight | +| docs-researcher-claude-in-chrome | Named pipe conflicts (Windows) | Implemented in Phase 0 preflight | +| docs-researcher-claude-in-chrome | Modal dialogs block commands | Implemented in Phase 3 execution | +| docs-researcher-claude-in-chrome | /mcp runtime discovery | Deferred — low confidence (0.65), nice-to-have | + +--- +# Implementation Spec +--- + +--- +title: "Add user-test browser testing skill and commands" +type: feat +status: active +date: 2026-02-26 +--- + +# Add user-test Browser Testing Skill and Commands + +## Overview + +Add a new `user-test` skill and three companion commands (`/user-test`, `/user-test-iterate`, `/user-test-commit`) to the compound-engineering plugin. This implements browser-based exploratory user testing via `claude-in-chrome` MCP tools with a compounding maturity model — each run makes the test file smarter by promoting proven areas, filing new bugs, and expanding coverage. + +This skill is for **exploratory testing in a visible Chrome window** with shared login state. The user watches the test happening in real-time and can intervene if needed. For automated headless regression testing, use `/test-browser` instead — it uses the `agent-browser` CLI for deterministic, CI-oriented QA checks. + +The three commands separate concerns: `/user-test` runs and scores a test, `/user-test-iterate` runs it N times for consistency data, and `/user-test-commit` applies results (updates the test file maturity map, files issues, appends history). This separation keeps the fast feedback loop (run + score) lightweight and lets the user decide when to commit results. + +## Problem Statement / Motivation + +The plugin has `test-browser` (deterministic QA regression via `agent-browser` CLI) but no exploratory user testing capability. Teams need a way to: + +- Simulate real user behavior against their app in a visible browser +- Track which areas are stable vs. fragile across runs +- Automatically file and deduplicate GitHub issues from testing sessions +- Compound knowledge: skip proven areas, skip known bugs, focus effort on uncharted territory + +This fills a distinct niche from `test-browser` — exploratory quality assessment with compounding knowledge, not regression checking. + +## Proposed Solution + +### Architecture: Skill + Thin Wrapper Commands + +Following the `deepen-plan` precedent (v2.36.0 refactor), implement as: + +| File | Type | Purpose | +|------|------|---------| +| `skills/user-test/SKILL.md` | Skill | Core 5-phase execution logic + commit mode (~300 lines) | +| `skills/user-test/references/test-file-template.md` | Reference | Test file template for new scenarios (~100 lines) | +| `skills/user-test/references/browser-input-patterns.md` | Reference | React-safe input patterns and MCP tool tips (~30 lines) | +| `skills/user-test/references/iterate-mode.md` | Reference | Iterate mode execution details (~50 lines) | +| `commands/user-test.md` | Thin wrapper | `Skill(user-test)` invocation for `/user-test` | +| `commands/user-test-iterate.md` | Thin wrapper | `Skill(user-test)` invocation with iterate mode for `/user-test-iterate` | +| `commands/user-test-commit.md` | Thin wrapper | `Skill(user-test)` invocation for committing results | + +**Why skill + thin wrapper?** +- The execution logic is ~300 lines — well within the 500-line skill recommendation +- Reference files extract the test template, input patterns, and iterate mode details — each reusable independently +- Thin wrappers prevent command bloat (learnings: monolith-to-skill split anti-patterns) +- Both commands share the same skill logic, just with different invocation modes +- Consistent with the Pattern A convention used by `deepen-plan`, `create-agent-skill`, etc. + +**Why extract to `references/` from day one?** +The monolith-to-skill-split learning (convergence from 4 agents) explicitly warns: "Stating max 1200 lines in a plan is a policy wish. Without a gate that fails the pipeline, the file will grow past the budget." By starting with the split structure, the SKILL.md stays focused on execution phases and the reference files can grow independently without threatening the line budget. + +### Key Design Decisions + +**1. Browser tool: `claude-in-chrome` MCP (not `agent-browser` CLI)** + +The skill uses `mcp__claude-in-chrome__*` tools (find, javascript_tool, read_page, screenshots). This is intentionally different from `test-browser` which uses the headless `agent-browser` CLI. The rationale: +- `user-test` simulates a real user in a **visible** Chrome window — interactive, visual. The user can watch the test happening and intervene. +- `test-browser` runs headless regression checks — deterministic, CI-oriented +- Different tools for different testing philosophies +- `claude-in-chrome` shares the browser's login state, so authenticated app testing requires no credential handling — the user simply signs in once in Chrome + +**2. Test file as the product, not the run report** + +Living test files in `tests/user-flows/.md` get rewritten each run with updated maturity maps, scores, and history. The test file compounds intelligence across runs. + +Test files include a `schema_version: 1` field in frontmatter to enable forward-compatible migrations when the maturity model or file structure evolves. + +**3. Maturity model drives test efficiency** + +The maturity model provides guidance for the agent's judgment, not rigid rules: + +| Status | Behavior | Guidance | +|--------|----------|----------| +| Proven | Quick spot-check only (max 3 MCP calls) | Promote after 2+ consecutive passes with no significant issues. Cosmetic issues do not warrant demotion. | +| Uncharted | Full investigation, edge cases | Default state. Demote from Proven only on functional regressions or new features. | +| Known bug | Skip entirely | Issue filed. Skip until fix deployed. | + +The agent exercises judgment on promotions and demotions using the scoring rubric rather than following mechanical counters. A minor CSS issue in a Proven area stays Proven with a note. A broken API in an Uncharted area gets a Known-bug issue filed. + +**Partial run safety:** If a run is interrupted before scoring completes, no maturity updates are produced. Only `/user-test-commit` writes maturity state, and only from a completed run's results. + +**Area granularity:** Each area should cover 1-3 user interactions — small enough that a single bug doesn't reset a huge chunk of proven territory, large enough to accumulate consecutive passes. Example decomposition for "checkout": + +| Area | Interactions | What's tested | +|------|-------------|---------------| +| `checkout/cart-validation` | Add item, verify count, change quantity | Cart state management | +| `checkout/shipping-form` | Enter address, select method, see estimate | Form validation + shipping logic | +| `checkout/payment-submission` | Enter card, submit, see confirmation | Payment flow + success state | + +A worked example with this decomposition pattern is included in [test-file-template.md](./references/test-file-template.md). + +**Quality Scoring Rubric** + +Each score applies to one **scored interaction unit** — a single user-facing task completion (e.g., "add item to cart", "submit shipping form", "complete payment"). Navigation steps, page loads, and setup actions are not scored individually; they are part of the interaction they serve. + +| Score | Meaning | Example | +|-------|---------|---------| +| 1 | Broken — cannot complete the task | Button unresponsive, page crashes | +| 2 | Completes with major friction | 3+ confusing steps, error messages shown | +| 3 | Completes with minor friction | Small UX issues, unclear labels | +| 4 | Smooth experience | Clear flow, no confusion | +| 5 | Delightful | Exceeds expectations, helpful feedback | + +## Technical Considerations + +### Distinct from existing `test-browser` command + +| Aspect | `test-browser` | `user-test` (new) | +|--------|---------------|-------------------| +| Tool | `agent-browser` CLI (headless) | `claude-in-chrome` MCP (visible browser) | +| Purpose | QA regression on PR-affected pages | Exploratory user testing | +| State | Stateless per run | Stateful via test files | +| Output | Pass/fail per route | Quality scores 1-5 per interaction | +| Issues | No issue creation | Auto-files and deduplicates issues | +| Auth | Handles login flows | Shares browser login state | +| Observation | Results only | Real-time visual — user watches the test | + +### MCP dependency + +The skill requires `claude-in-chrome` MCP to be connected. Phase 0 (Preflight Check) validates availability and provides specific guidance: + + +``` +## Phase 0: Preflight Check +1. Check if claude-in-chrome MCP tools are available +2. If NOT available: + - Display: "claude-in-chrome not connected. Run /chrome or restart with claude --chrome" + - Abort with clear instructions +3. Detect WSL environment: + - If running in WSL: "Chrome integration is not supported in WSL. Run Claude Code directly on Windows." + - Abort +4. Verify the target app URL is within Chrome extension's allowed sites + - If permission denied: "Grant site permission in Chrome extension settings for [URL]" +5. Windows: if EADDRINUSE error on named pipe: + - "Close other Claude Code sessions that might be using Chrome, then retry" +``` + +### `gh` CLI dependency + +Issue creation (Phase 6) requires `gh auth status`. The skill handles this gracefully: +- If `gh` is not authenticated: skip issue creation, note in summary +- If `gh` is authenticated: proceed with duplicate detection and filing +- **Structured dedup labels:** Every issue gets a label `user-test:` (e.g., `user-test:checkout/cart-count`). Duplicate detection uses `gh issue list --label "user-test:" --state open` for exact match, falling back to semantic title search only if no label match found. Labels are machine-parseable and immune to description rewording. +- Issue body content sanitized before passing to `gh issue create` to prevent markdown injection + +### React-safe input pattern + +The React-specific native setter pattern for bypassing virtual DOM is extracted to [browser-input-patterns.md](./references/browser-input-patterns.md). This keeps framework-specific tool logic reusable and out of the main SKILL.md. + +### MCP tool performance + +Each claude-in-chrome MCP call involves a round-trip through the Chrome extension. To manage latency: +- Batch simple checks (element visibility, text content, price display) into single `javascript_tool` calls +- Define "quick spot-check" for Proven areas as max 3 MCP calls per area +- Full investigations for Uncharted areas have no artificial cap but should use batched checks where possible + + +```javascript +// Batch multiple checks into one javascript_tool call: +mcp__claude-in-chrome__javascript_tool({ + code: `JSON.stringify({ + submitBtn: !!document.querySelector('[type=submit]'), + errorMsg: !!document.querySelector('.error'), + price: document.querySelector('.price')?.textContent + })` +}) +``` + +### Connection resilience + +Extension disconnects are a known issue — the Chrome extension service worker can go idle during extended sessions. + + +``` +## Disconnect Handling +1. After MCP tool failure: wait 3 seconds +2. Retry the call once +3. If retry fails: "Extension disconnected. Run /chrome and select Reconnect extension" +4. Track disconnect_counter for the session +5. If disconnect_counter >= 3: abort with "Extension connection unstable. Check Chrome extension status and restart the session." +``` + +### Modal dialog handling + +JavaScript dialogs (alert, confirm, prompt) block all browser events and prevent Claude from receiving commands. If commands stop responding after a dialog trigger, instruct the user to dismiss the dialog manually before continuing. + +### Graceful degradation + +Apply the same pattern used for `gh` CLI absence to MCP tool failures mid-run: +- If screenshot fails: continue but note "screenshots unavailable" in the report +- If javascript_tool fails: fall back to individual find/click calls +- If all MCP tools fail: abort with specific recovery instructions + +## System-Wide Impact + +- **Interaction graph**: Skill invoked by two thin wrapper commands. No callbacks or middleware. Writes to `tests/user-flows/` (user's project, not the plugin). Calls `gh` CLI for issue creation. +- **Error propagation**: MCP disconnects handled with retry-once + specific recovery instructions. `gh` failures gracefully degraded (skip issue creation). Mid-run MCP tool failures degrade individual capabilities rather than aborting. +- **State lifecycle risks**: Test file writes use write-to-temp-then-rename pattern for atomic updates. Partial runs produce no committable output (maturity safety). Iterate mode resets between runs via full page reload to the app entry URL. Note: this does not clear IndexedDB, service worker caches, or HttpOnly cookies — document this limitation in iterate mode reference. +- **API surface parity**: No overlap with existing commands — distinct MCP tool set, distinct file structure, distinct purpose. +- **Security**: Test file paths validated to stay within `tests/user-flows/`. No credentials persisted in any written output (test files, run history, issue bodies). Issue body content sanitized before `gh` CLI invocation. + +## Acceptance Criteria + +### Files to Create + +- [x] `plugins/compound-engineering/skills/user-test/SKILL.md` — Core skill with 5 phases + commit mode, ~300 lines +- [x] `plugins/compound-engineering/skills/user-test/references/test-file-template.md` — Test file template for new scenarios +- [x] `plugins/compound-engineering/skills/user-test/references/browser-input-patterns.md` — React-safe input patterns +- [x] `plugins/compound-engineering/skills/user-test/references/iterate-mode.md` — Iterate mode details +- [x] `plugins/compound-engineering/commands/user-test.md` — Thin wrapper with `disable-model-invocation: true` +- [x] `plugins/compound-engineering/commands/user-test-iterate.md` — Thin wrapper with `disable-model-invocation: true` and iterate argument forwarding +- [x] `plugins/compound-engineering/commands/user-test-commit.md` — Thin wrapper with `disable-model-invocation: true` for committing results + +### Files to Modify + +- [x] `plugins/compound-engineering/.claude-plugin/plugin.json` — bump version, update description with dynamic counts +- [x] `.claude-plugin/marketplace.json` — bump version, update description with dynamic counts +- [x] `plugins/compound-engineering/README.md` — Update component count table, add skill row under Browser Automation, add two command rows +- [x] `plugins/compound-engineering/CHANGELOG.md` — Add new version entry with `### Added` section + +### Post-Change Validation + +- [x] Validate JSON: `cat .claude-plugin/marketplace.json | jq .` and `cat plugins/compound-engineering/.claude-plugin/plugin.json | jq .` +- [x] Verify skill count matches description: `SKILL_COUNT=$(ls -d plugins/compound-engineering/skills/*/ | wc -l) && grep -q "$SKILL_COUNT skill" plugins/compound-engineering/.claude-plugin/plugin.json` +- [x] Verify command count matches description: `CMD_COUNT=$(ls plugins/compound-engineering/commands/*.md plugins/compound-engineering/commands/workflows/*.md | wc -l) && grep -q "$CMD_COUNT command" plugins/compound-engineering/.claude-plugin/plugin.json` +- [x] Verify SKILL.md line count: `SKILL_LINES=$(wc -l < plugins/compound-engineering/skills/user-test/SKILL.md) && [ "$SKILL_LINES" -le 500 ] && echo "OK: $SKILL_LINES lines" || echo "FAIL: $SKILL_LINES lines (max 500)"` +- [x] Verify SKILL.md frontmatter compliance: `name: user-test`, single-line description with trigger keywords +- [x] Verify reference files are linked with proper markdown links (not backtick references) +- [ ] Run `claude /release-docs` to regenerate all docs site pages + +### Functional Requirements + +- [ ] `/user-test tests/user-flows/checkout.md` — loads existing test file, runs phases 0-4 (score + report) +- [ ] `/user-test "Test the checkout flow"` — creates new test file from description, runs phases 0-4 +- [ ] `/user-test-commit` — applies results from last run: updates maturity map, files issues, appends history +- [ ] `/user-test-iterate tests/user-flows/checkout.md 5` — runs the scenario 5 times, reports consistency +- [ ] Maturity model correctly promotes (2+ consistent passes with agent judgment) and demotes (functional regression with agent judgment) +- [ ] Issues include `user-test:` label; dedup uses `--label` flag first, semantic fallback second +- [ ] Test file template created for new scenarios with all required sections including `schema_version: 1` +- [ ] `tests/user-flows/test-history.md` appended after each run (rotation: keep last 50 entries, includes quality avg + pass rate + disconnects + demo_readiness + key finding) +- [ ] Test file path validated to stay within `tests/user-flows/` (no directory traversal) +- [ ] Iterate mode: N capped at 10 by default, error on N=0, N=1 valid +- [ ] Iterate mode: reset between runs = full page reload to app entry URL (limitations: IndexedDB, SW caches, HttpOnly cookies not cleared) +- [ ] Iterate mode: partial run handling (disconnects mid-iterate produce valid partial results) +- [ ] `test-history.md` includes `demo_readiness` column (yes/no/partial) persisted each run +- [ ] Explore Next Run items include priority (P1/P2/P3); Phase 3 picks highest priority first +- [ ] Area granularity: worked example in test-file-template.md showing 1-3 interactions per area +- [ ] Phase 2 environment sanity check: verifies app loads with expected content before test execution +- [ ] Given a new scenario, full pipeline (phases 0-4 + commit) produces: test file with schema_version: 1, quality score, maturity map, and summary — all without manual intervention beyond initial command +- [ ] Given a test file with an Uncharted area, after iterate N=3 where all runs score >= 4, the area's maturity status is Proven + +### Security Requirements + +- [ ] Test file path resolution prevents directory traversal +- [ ] No credentials (passwords, tokens, session IDs) persisted in any output file +- [ ] Issue body content sanitized before `gh issue create` +- [ ] `user-test:` label convention documented for duplicate detection + +## SKILL.md Content Outline + +The skill contains 5-phase execution logic (run + score) plus a commit mode (update files + file issues), with references to supporting files: + + +``` +--- +name: user-test +description: Run browser-based user testing via claude-in-chrome MCP with quality scoring and compounding test files. Use when testing app quality from a real user's perspective, scoring interactions, tracking test maturity, or filing issues from test sessions. +argument-hint: "[scenario-file-or-description]" +--- + +# User Test + +Exploratory testing in a visible Chrome window. You watch the test happening +in real-time and can intervene if needed. Claude shares your browser's login +state — sign into your app in Chrome before running. + +For automated headless regression testing, use /test-browser instead. + +**v1 limitation:** This skill targets localhost / local dev server apps. External +or staging URLs are not validated for deployment status — if you test against a +remote URL, verify it's live and accessible before running. + +## Phase 0: Preflight +[Validate: claude-in-chrome MCP available (if not: "Run /chrome"), WSL detection, +site permissions, gh auth status, app URL resolvable] + +## Phase 1: Load Context +[Resolve test file from path/description, validate path stays within tests/user-flows/, +extract maturity map + history, validate schema_version] +[If no argument: scan tests/user-flows/ for test files, present list, or prompt for description] +[If test file corrupted: offer to regenerate from template] + +## Phase 2: Setup +[Ensure user is signed into target app in Chrome (shared login state), +take baseline screenshot] +[If login page or CAPTCHA encountered: pause for manual handling] +[Environment sanity check: navigate to app URL, verify page loaded with expected content +(not an error page, not a stale auth redirect, not an empty state). If the app loads but +shows error banners, API failures, or empty data that should be populated — abort with +"App environment issue detected" rather than producing misleading quality scores] + +## Phase 3: Execute +[Maturity-guided selection (agent judgment, not mechanical counters), +Proven areas: quick spot-check (max 3 MCP calls), +Uncharted areas: full investigation with batched javascript_tool calls, +Known-bug areas: skip entirely] +[Connection resilience: retry once with 3s delay, then /chrome reconnect guidance] +[If all areas Proven: spot-check all, suggest new scenarios in "Explore Next Run"] +[Explore Next Run items have priority: P1 (likely user-facing friction), P2 (edge case), +P3 (curiosity). Pick highest-priority uncharted items first, not FIFO] +[Modal dialog detection: instruct user to dismiss manually] + +## Phase 4: Score and Report +[Quality scoring 1-5 using calibration rubric per scored interaction unit] +[A scored interaction unit = one user-facing task completion (e.g., "add item to cart", +"submit shipping form", "complete payment"). Navigation steps, page loads, and setup +actions are not scored individually — they are part of the interaction they serve.] +[Scores are ABSOLUTE per rubric, not relative to scenario framing.] +[Output: run summary block with per-area scores, disconnect count, overall quality avg] +[If run is interrupted before scoring completes, do NOT produce committable output — +partial runs must not corrupt maturity state] + +## Commit Mode +[Invoked separately via /user-test-commit after reviewing run results] +[Maturity updates using agent judgment, run history, promotion/demotion with rubric] +[Atomic write: write to .tmp then rename] +[History rotation: keep last 50 entries in test-history.md] +[Include structured label `user-test:` on every issue] +[Duplicate detection: `gh issue list --label "user-test:" --state open` for +exact match; fall back to semantic title search only if no label match found] +[Sanitize issue body content, skip gracefully if gh not authenticated] +[Never persist credentials in issue bodies or test files] +[Persist demo_readiness (yes/no/partial) in test-history.md alongside quality scores] + +## Iterate Mode +See [iterate-mode.md](./references/iterate-mode.md) for details. +N capped at 10 (default), N=0 is error, N=1 valid. +State clearing limitations documented. +Partial run handling: if disconnect mid-iterate, write results for completed runs and report +"Completed M of N runs" — partial results are valid and maturity updates apply. +Output format: per-run scores table + aggregate consistency metrics + maturity transitions. + +## Test File Template +See [test-file-template.md](./references/test-file-template.md). + +## Browser Input Patterns +See [browser-input-patterns.md](./references/browser-input-patterns.md). +``` + +## Thin Wrapper Command Templates + +### `commands/user-test.md` + + +```markdown +--- +name: user-test +description: Run browser-based user testing with quality scoring and compounding test files +disable-model-invocation: true +allowed-tools: Skill(user-test) +argument-hint: "[scenario-file-or-description]" +--- + +Invoke the user-test skill for: $ARGUMENTS +``` + +### `commands/user-test-iterate.md` + + +```markdown +--- +name: user-test-iterate +description: Run the same user test scenario N times to measure consistency +disable-model-invocation: true +allowed-tools: Skill(user-test) +argument-hint: "[scenario-file] [n]" +--- + +Invoke the user-test skill in iterate mode for: $ARGUMENTS +``` + +### `commands/user-test-commit.md` + + +```markdown +--- +name: user-test-commit +description: Commit user-test results — update test file maturity map, file issues, append history +disable-model-invocation: true +allowed-tools: Skill(user-test) +--- + +Invoke the user-test skill in commit mode for the last completed run. +``` + +## SpecFlow Analysis -- Gaps Addressed in Implementation + +The SpecFlow analyzer identified gaps. Here is how the implementation addresses each genuine gap: + +| Gap | Resolution | +|-----|-----------| +| No-argument behavior | Phase 1: scan `tests/user-flows/` for test files, present list, or prompt for description | +| MCP not connected | Phase 0 preflight: check MCP availability, instruct to run `/chrome` or restart with `claude --chrome` | +| gh not authenticated | Phase 6: check `gh auth status` before creating issues, skip gracefully if not authenticated | +| Test file corruption | Phase 1: validate required sections and schema_version, offer to regenerate from template if missing | +| All areas Proven | Phase 3: spot-check all Proven areas, add note suggesting new scenarios in "Explore Next Run" | +| N=0 for iterate | Iterate mode: treat N=0 as error, require N >= 1, cap N <= 10. N=1 is valid (single run with consistency tracking) | +| State between iterate runs | Iterate mode: full page reload to app entry URL between each run. Document limitation: does not clear IndexedDB, service worker caches, or HttpOnly cookies | +| Preflight check | Phase 0: validates MCP, gh, app URL, WSL detection, site permissions, Windows named pipe conflicts | +| Authentication/login | Phase 2: leverage shared browser login state. User signs in once in Chrome. If CAPTCHA encountered, Claude pauses for manual handling | + +## Dependencies & Risks + +| Risk | Mitigation | +|------|-----------| +| `claude-in-chrome` MCP may not be installed | Phase 0 preflight check with specific "/chrome" instructions | +| Extension service worker goes idle during extended sessions | Retry once with 3s delay, then specific "/chrome Reconnect" guidance. Abort after 3 cumulative disconnects. | +| File upload not supported | Explicit `MANUAL ONLY` marking in test file template. Workaround: pause user-test and use `/agent-browser` for upload steps. | +| SKILL.md growth past 500 lines | References/ extraction from day one. Validation gate: `wc -l < SKILL.md` must be <= 500 | +| Component count drift | Dynamic count validation in acceptance criteria (count files, verify descriptions match) | +| Test history unbounded growth | Rotation: keep last 50 entries in test-history.md | +| Modal dialogs block browser commands | Detection guidance in Phase 3, instruct user to dismiss manually | +| WSL environment | Preflight detection and abort with clear message | +| Windows named pipe conflicts | Preflight detection with "close other Claude Code sessions" guidance | +| Directory traversal via test file path | Path validation in Phase 1: resolved path must start with `tests/user-flows/` | +| External/staging app not deployed or stale | v1 targets localhost/local dev. Document limitation: no deployment verification for remote URLs. User must verify external apps are live before testing. | +| App loads but environment is broken (stale auth, empty data, API 500s) | Phase 2 environment sanity check: navigate + content assertion before test execution. Abort with "App environment issue" rather than producing misleading scores | +| Issue dedup fails on different descriptions of same bug | Structured `user-test:` label on every issue for exact-match dedup via `--label`; semantic search as fallback only | + +## Success Metrics + +- Skill loads and executes without errors on first invocation +- Test file is correctly created from description with `schema_version: 1` +- Maturity model state transitions work across 3+ consecutive runs using agent judgment +- No duplicate GitHub issues created across iterate runs +- SKILL.md <= 500 lines (enforced by validation gate) +- All component counts match across plugin.json, marketplace.json, and README.md (verified dynamically) +- **Compounding metric**: After 3 runs on the same scenario, Proven area count > 0 and total test duration decreases (spot-checks are faster than full investigations) + +## Sources & References + +### Internal References +- Thin wrapper pattern: `plugins/compound-engineering/commands/deepen-plan.md:1-9` +- Skill structure: `plugins/compound-engineering/skills/deepen-plan/SKILL.md` (frontmatter + phases pattern) +- Browser automation: `plugins/compound-engineering/skills/agent-browser/SKILL.md` (MCP tool reference) +- Existing test command: `plugins/compound-engineering/commands/test-browser.md` (distinct tool set) +- Plugin checklist: `CLAUDE.md` "Adding a New Skill" section +- Anti-patterns: `docs/solutions/2026-02-26-monolith-to-skill-split-anti-patterns.md` +- Versioning: `docs/solutions/plugin-versioning-requirements.md` + +### Conventions Applied +- Skill compliance: name matches directory, single-line description with trigger keywords +- Thin wrapper: `allowed-tools: Skill(user-test)`, `disable-model-invocation: true` +- Version bump: MINOR for new functionality (dynamic — count at implementation time) +- CHANGELOG: Keep a Changelog format with `### Added` section +- Reference files linked with proper markdown links: `[filename.md](./references/filename.md)` diff --git a/docs/plans/2026-02-28-feat-user-test-skill-revision-plan.md b/docs/plans/2026-02-28-feat-user-test-skill-revision-plan.md new file mode 100644 index 000000000..1797da0d1 --- /dev/null +++ b/docs/plans/2026-02-28-feat-user-test-skill-revision-plan.md @@ -0,0 +1,424 @@ +--- +title: "User-Test Skill Revision: Timing, Qualitative Summaries, Delta Tracking, and More" +type: feat +status: completed +date: 2026-02-28 +--- + +# User-Test Skill Revision + +Based on 7 rounds of iterative testing and real production test results. + +## Overview + +Revise the `user-test` skill to add timing tracking, qualitative summaries, delta regression detection, explore-next-run generation, optional CLI mode, output quality scoring, and conditional regression checks. These changes address gaps discovered during real-world usage — timing regressions went unnoticed, structured scores missed qualitative signal, and ~60% of bugs were agent reasoning errors catchable without a browser. + +## Problem Statement / Motivation + +The current skill scores UX quality but misses three signals that real testing revealed: +1. **Performance blind spot** — response time regressed 15s to 28s across 5 runs, unnoticed until manually tracked +2. **Qualitative signal loss** — "4.2/5 average" doesn't answer "should we demo tomorrow?" +3. **Regression hiding** — absolute numbers mask run-over-run quality changes +4. **Stale explore-next-run** — the section stays empty because the skill doesn't generate items proactively +5. **Browser-only bottleneck** — most agent reasoning bugs don't need a browser to catch + +## Proposed Solution + +10 changes in 3 priority tiers, scoped to stay within the 500-line SKILL.md budget. + +### Prerequisite: Schema Migration Strategy + +Multiple changes (1A, 1C, 2B, 2C) add columns to the test file template. Existing `schema_version: 1` files must not break. + +**Approach:** +- Bump to `schema_version: 2` in the template +- Phase 1 (Load Context): when reading a v1 file, add missing columns with empty/default values in memory — do NOT rewrite the file +- Commit mode: when writing back, upgrade the file to v2 schema (adds new columns, preserves all existing data) +- Forward compatibility: the reader tolerates unknown frontmatter fields (from a future v3) by ignoring them. Unknown table columns are preserved on write. +- The "offer to regenerate" recovery path remains for genuinely corrupted files only + +### Prerequisite: Run Results Persistence + +The current skill relies on the agent's context window to pass run results from `/user-test` to `/user-test-commit`. With more data dimensions (timing, dual scores, qualitative notes), this becomes fragile. + +**Approach:** +- After Phase 4 completes, write a `.user-test-last-run.json` file in `tests/user-flows/` containing: scenario slug, per-area scores (UX + optional quality), timing, qualitative summary, issues to file, maturity assessments +- `/user-test-commit` reads this file instead of relying on context +- The file is overwritten on each run (only last run is committable) +- Add `.user-test-last-run.json` to the project's `.gitignore` guidance in Phase 1 + +**Stale/missing file handling for `/user-test-commit`:** +- **Missing file:** If `.user-test-last-run.json` doesn't exist, abort with "No run results found. Run `/user-test` first." +- **Stale file:** The file includes a `run_timestamp` (ISO 8601). If the timestamp is older than 24 hours, warn: "Run results are from . Commit anyway? (y/n)." If older than 7 days, abort with "Run results too old — re-run `/user-test` first." +- **Partial run:** The file includes a `completed: true|false` flag. If `false`, abort with "Last run was incomplete. Run `/user-test` again for committable results." +- **No context fallback:** Commit mode never falls back to context window. The JSON file is the single source of truth. + +## Priority 1: High Impact, Low Effort + +### 1A. Timing Tracking + +**Files:** SKILL.md Phase 3 + Phase 4, test-file-template.md + +**Change:** Measure wall-clock time per area (start timestamp before first MCP call, end after last). Record in seconds. + +**Template change — Areas table:** +``` +| Area | Status | Last Score | Last Time | Consecutive Passes | Notes | +``` + +**Report output adds Time column:** +``` +| Area | Status | Score | Time | Assessment | +``` + +**Edge cases:** +- Partial area (disconnect mid-area): record time as `—` (incomplete), do not include in averages +- Timing includes async waits — this is intentional (slow is slow, regardless of cause) + +**SKILL.md budget:** +8 lines + +### 1B. Qualitative Summary in Report Output + +**Files:** SKILL.md Phase 4 Report Output + +**Add after the scores table:** +``` +Qualitative: +- Best moment: +- Worst moment: +- Demo ready: yes / partial / no +- One-line verdict: +``` + +**Persistence:** These fields are written to `.user-test-last-run.json` for commit mode. `demo_readiness`, `verdict`, and a brief `context` note are persisted to `test-history.md` during commit. The `context` field is a one-phrase explanation of *why* (e.g., "search results loading 28s" alongside verdict "partial") — without it, verdicts become ambiguous after a few weeks. `best_moment` and `worst_moment` are ephemeral (report-only) — they inform the human reviewer but don't need historical tracking. + +**Edge cases:** +- All areas score the same: pick the area that was most/least expected +- Only one area tested: best and worst are the same — write one line + +**SKILL.md budget:** +10 lines + +### 1C. Delta Tracking in Run History + +**Files:** SKILL.md Commit Mode, test-file-template.md + +**Change:** When appending to `test-history.md`, compute delta from the most recent *completed* previous run: +``` +| Date | Quality Avg | Delta | Key Finding | Context | +| 2/26 | 4.86 | +0.15 | Exclusion filters working | | +| 2/24 | 4.71 | -0.18 | Forest green regression | color picker CSS regression | +``` + +Flag any delta worse than -0.5 with a warning in the commit output. + +**Edge cases:** +- First run ever: delta is `—` (no baseline) +- Previous run was partial: skip to the last complete run +- Different area sets between runs: compute avg over only areas present in BOTH runs. If no overlap, delta is `—`. **Known limitation:** if area sets drift significantly over time (adding 3, removing 2), the delta is computed over a shrinking overlap and may look stable even when new areas perform poorly. Acceptable for now — flag for revisit if delta becomes unreliable in practice. +- Iterate mode: delta is computed between the iterate session's aggregate and the previous non-iterate run. Per-iteration deltas within a session are NOT computed (they are noise, not signal) +- Iterate mode output includes per-run timing and timing variance alongside score variance. A consistent 28s is fine; wild swings between 5s and 45s indicate flakiness worth investigating. + +**SKILL.md budget:** +10 lines + +### 1D. Explore-Next-Run Generation Guidance + +**Files:** SKILL.md Phase 4 + +**Add:** After scoring, explicitly generate 2-3 Explore Next Run items with priority: +- **P1** — Things that surprised you (positive or negative) +- **P2** — Edge cases adjacent to tested areas +- **P3** — Interactions started but not finished, or borderline scores (score of 3) + +A "borderline" score is any area scoring 3/5 — warrants deeper investigation next run regardless of maturity status. + +**SKILL.md budget:** +8 lines + +## Priority 2: Medium Impact, Medium Effort + +### 2A. Optional CLI Mode + +**Files:** SKILL.md (new Phase 2.5), test-file-template.md (frontmatter) + +**Test file frontmatter addition:** +```yaml +--- +cli_test_command: "node scripts/test-cli.js --query '{query}'" # optional +cli_queries: # optional + - query: "queen bed hot sleeper" + expected: "cooling materials, percale or linen" + - query: "something nice" + expected: "asks clarifying questions" +--- +``` + +**SKILL.md addition — Phase 2.5: CLI Testing** + +If the test file defines `cli_test_command`: +1. Skip Phase 0 MCP preflight (CLI doesn't need chrome). Run `gh auth status` check only. +2. Skip Phase 2 browser setup entirely +3. For each query in `cli_queries`: run the command via Bash, capture stdout +4. Score output quality 1-5 against the `expected` field using the **output quality rubric** (see 2B). The agent evaluates whether the CLI output satisfies the expected description semantically — this is NOT exact string matching. The `expected` field describes what a correct response looks like, and the agent applies the output quality rubric to judge. +5. CLI results feed into the same maturity map and scoring pipeline +6. If BOTH `cli_queries` and browser areas exist in the test file: run CLI first. If CLI reveals broken agent logic (scores <= 2), skip browser testing for overlapping areas with a note "CLI pre-check failed — skipping browser test." + +**Overlap detection is explicit, not agent-inferred.** Each CLI query can optionally tag the browser area it pre-checks: +```yaml +cli_queries: + - query: "queen bed hot sleeper" + expected: "cooling materials, percale or linen" + prechecks: "search-results" # area slug — skip this browser area on CLI failure + - query: "something nice" + expected: "asks clarifying questions" + # no prechecks tag — CLI-only, no browser area overlap +``` +If `prechecks` is present and the CLI query scores <= 2, the tagged browser area is skipped. If `prechecks` is absent, the CLI query is standalone — no browser areas are skipped regardless of score. This eliminates fuzzy semantic matching at runtime. + +**Credential handling:** The `cli_test_command` runs as a Bash command inheriting the shell environment. No credentials are stored in the test file. If the command needs env vars, the user sets them in their shell before running `/user-test`. + +**Iterate mode:** CLI iterate resets by simply re-running the command (no browser reload needed). If the command has side effects (DB writes), document this limitation in iterate-mode.md. + +**SKILL.md budget:** +30 lines (extract to `references/cli-mode.md` if it exceeds 35) + +### 2B. Output Quality Scoring Dimension + +**Files:** SKILL.md Phase 4 Scoring, test-file-template.md + +**Change:** Areas can optionally have `scored_output: true` in their area details. When set, score TWO dimensions: + +| Dimension | Rubric | When to use | +|-----------|--------|-------------| +| **UX score (1-5)** | Existing rubric (broken → delightful) | Always | +| **Quality score (1-5)** | Output correctness rubric (below) | Only when `scored_output: true` | + +**Output Quality Rubric:** + +| Score | Meaning | Example | +|-------|---------|---------| +| 5 | Exactly what an expert would produce | Right products, right reasoning | +| 4 | Relevant, minor misses | Mostly right, one irrelevant result | +| 3 | Partially correct | Some right, some wrong | +| 2 | Mostly wrong | Misunderstood intent | +| 1 | Completely wrong | Wrong category, hallucinated data | + +**Report shows both:** `UX: 4/5, Quality: 3/5` + +**Aggregation rules:** +- `Quality Avg` in run history = average of UX scores only (maintains backward compatibility for areas without `scored_output`) +- **Promotion gate for `scored_output: true` areas:** UX >= 4 AND Quality >= 3. A beautiful UI showing wrong results should not promote to Proven. +- **Promotion gate for standard areas:** UX >= 4 only (unchanged from v1) +- Quality score tracked as `Output Avg` in the report for visibility +- Known-bug filing: trigger on UX <= 2 (functional failure) OR Quality <= 1 (completely wrong output) + +**Template change — Areas table:** +``` +| Area | Status | Last Score | Last Quality | Last Time | Consecutive Passes | Notes | +``` +(`Last Quality` column only populated for areas with `scored_output: true`) + +**SKILL.md budget:** +15 lines + +### 2C. Conditional Regression Checks for Known-Bug Areas + +**Files:** SKILL.md Phase 3, test-file-template.md + +**Test file area detail addition:** +```markdown +### cart-quantity-update +**Status:** Known-bug +**Issue:** #47 +**Fix check:** Verify quantity updates in <5s and cart badge reflects new count +``` + +**SKILL.md Phase 3 addition:** + +When encountering a Known-bug area: +1. If `gh` is not authenticated: skip as normal (no change) +2. Check if the linked issue is closed: `gh issue view --json state -q '.state'` +3. If `closed`: flip area to Uncharted, run the `fix_check` as the first test for that area +4. If `open`: skip as normal +5. If the fix check fails (score <= 2): file a new issue with note "Regression of #N" in the body referencing the original closed issue for traceability. The dedup check (`--state open`) won't find the closed issue, so a new issue is created — this is correct behavior. + +**Template change:** Known-bug areas store `**Issue:** #` in their area details section. This is the canonical reference for `gh issue view`. + +**SKILL.md budget:** +15 lines + +## Priority 3: Nice to Have + +### 3A. Async Wait Pattern + +**Files:** browser-input-patterns.md only (no SKILL.md change) + +**Add:** +```javascript +// Wait for async operation completion +mcp__claude-in-chrome__javascript_tool({ + code: ` + (async () => { + const start = Date.now(); + const timeout = 10000; + const selector = '.success-message'; + while (Date.now() - start < timeout) { + if (document.querySelector(selector)) return 'found'; + await new Promise(r => setTimeout(r, 200)); + } + return 'timeout'; + })() + ` +}) +``` + +**SKILL.md budget:** 0 lines + +### 3B. Performance Threshold Configuration + +**Files:** test-file-template.md frontmatter, SKILL.md Phase 4 + +**Frontmatter addition:** +```yaml +--- +performance_thresholds: # optional, seconds + fast: 2 + acceptable: 8 + slow: 20 + broken: 60 +--- +``` + +**Scoring integration:** If thresholds are defined, append a timing grade to each area's assessment in the report: `(fast)`, `(acceptable)`, `(slow)`, `(BROKEN)`. A `broken` timing grade is a finding worth noting but does NOT affect the UX score — timing and quality are separate dimensions. + +**Measurement:** Wall-clock time from 1A. No browser performance API needed. + +**SKILL.md budget:** +8 lines + +### 3C. End-to-End Unscripted Scenario Type — DEFERRED + +**Rationale for deferral:** The SpecFlow analysis identified fundamental conflicts with the maturity model. Unscripted runs produce emergent areas that don't map to stable slugs, breaking consecutive-pass tracking, issue label convention, and iterate mode compatibility. This needs a separate design pass (possibly a distinct mode with its own output format) rather than being retrofitted into the existing area-based model. + +**Interim alternative:** Users can approximate unscripted testing by creating a test file with broad areas (e.g., `first-time-onboarding`) and giving the agent latitude in the area description. This gets 80% of the value without the architectural conflict. + +## Technical Considerations + +### SKILL.md Budget Impact + +| Change | Lines Added | Cumulative | +|--------|-----------|------------| +| Current | 0 | 192 | +| 1A Timing | +8 | 200 | +| 1B Qualitative | +10 | 210 | +| 1C Delta | +10 | 220 | +| 1D Explore | +8 | 228 | +| 2A CLI mode | +30 | 258 | +| 2B Quality scoring | +15 | 273 | +| 2C Regression checks | +15 | 288 | +| 3B Thresholds | +8 | 296 | +| **Total** | **+104** | **~296** | + +Well within the 500-line budget. If CLI mode grows beyond 35 lines during implementation, extract to `references/cli-mode.md`. + +### File Change Summary + +| File | Changes | +|------|---------| +| `SKILL.md` | +104 lines: timing in Phase 3-4, qualitative summary in Phase 4, delta in Commit Mode, explore-next generation in Phase 4, CLI Phase 2.5, output quality rubric, regression checks in Phase 3, threshold eval in Phase 4 | +| `test-file-template.md` | Schema v2: new columns (Last Time, Last Quality), `cli_test_command`/`cli_queries` frontmatter, `performance_thresholds` frontmatter, Known-bug `Issue:` field, `fix_check` field | +| `browser-input-patterns.md` | +15 lines: async wait pattern | +| `iterate-mode.md` | +8 lines: CLI iterate reset note, timing per run in output table, timing variance alongside score variance | +| Commands | No changes (thin wrappers unchanged) | + +### Backward Compatibility + +- v1 test files work unchanged — missing columns filled with defaults on read +- v1 files upgraded to v2 on next commit (non-destructive) +- CLI mode is opt-in (no `cli_test_command` = no CLI testing) +- Quality scoring is opt-in (`scored_output: true` per area) +- Performance thresholds are opt-in (no frontmatter = no timing grades) + +## Acceptance Criteria + +### P1 Changes + +- [x] 1A: Report output includes `Time` column per area +- [x] 1A: Test file template has `Last Time` column in areas table +- [x] 1A: Partial area timing recorded as `—` +- [x] 1B: Report output includes qualitative summary (best moment, worst moment, demo ready, verdict) +- [x] 1B: `demo_readiness` and `verdict` persist to test-history.md via commit mode +- [x] 1C: Run history includes `Delta` column computed from last complete run +- [x] 1C: Delta worse than -0.5 flagged with warning +- [x] 1C: First run shows delta as `—` +- [x] 1C: Iterate mode computes delta vs. pre-session baseline only +- [x] 1D: Phase 4 generates 2-3 Explore Next Run items with P1/P2/P3 priority +- [x] 1D: Borderline (score 3) areas flagged for deeper investigation + +### P2 Changes + +- [x] 2A: Test files with `cli_test_command` run CLI queries before browser testing +- [x] 2A: CLI mode skips Phase 0 MCP preflight and Phase 2 browser setup +- [x] 2A: CLI queries use explicit `prechecks` tag for browser area overlap (no agent-inferred matching) +- [x] 2A: No credentials stored in test file +- [x] 2B: Areas with `scored_output: true` show dual scores (UX + Quality) +- [x] 2B: Quality Avg in history = UX scores only (backward compatible) +- [x] 2B: Known-bug trigger: UX <= 2 OR Quality <= 1 +- [x] 2C: Known-bug areas with closed issues auto-flip to Uncharted +- [x] 2C: Fix check runs as first test for re-opened areas +- [x] 2C: Issue number stored in area details (`**Issue:** #N`) + +### P3 Changes + +- [x] 3A: Async wait pattern documented in browser-input-patterns.md +- [x] 3B: Optional `performance_thresholds` frontmatter evaluates timing grades +- [x] 3C: Deferred — documented as future work + +### Prerequisites + +- [x] Schema migration: v1 files read without error, upgraded to v2 on commit +- [x] Forward compatibility: v2 reader tolerates unknown frontmatter fields from future schema versions (ignore, don't error) +- [x] Run results persistence: `.user-test-last-run.json` written after Phase 4, read by commit mode +- [x] `.user-test-last-run.json` added to `.gitignore` guidance +- [x] Commit mode aborts if `.user-test-last-run.json` missing, incomplete, or >7d stale +- [x] Commit mode warns if `.user-test-last-run.json` >24h old +- [x] `verdict` persists with `context` note to test-history.md + +### Post-Change Validation + +- [x] SKILL.md <= 500 lines after all changes (313 lines) +- [x] All reference file links use proper markdown format +- [x] Existing v1 test files load without error +- [x] Version bump in plugin.json, marketplace.json, CHANGELOG.md (2.36.0) +- [x] Reinstall to `~/.claude/skills/user-test/` and `~/.claude/commands/` + +## Dependencies & Risks + +| Risk | Mitigation | +|------|-----------| +| SKILL.md exceeds 500 lines | Extract CLI mode to references/cli-mode.md if >35 lines | +| v1 test files break with new columns | Schema migration reads v1, upgrades on commit | +| Run results lost between sessions | `.user-test-last-run.json` persists results to disk | +| CLI command has side effects on iterate | Document limitation in iterate-mode.md | +| Delta misleading with changing area sets | Compute delta over overlapping areas only | +| 3C unscripted conflicts with maturity model | Deferred — needs separate design | +| Stale `.user-test-last-run.json` committed | Timestamp check: warn >24h, block >7d, block if incomplete | + +## Implementation Sequence + +1. **Prerequisites first** — schema migration logic + `.user-test-last-run.json` persistence +2. **P1 changes** (1A, 1B, 1C, 1D) — all low effort, high value +3. **P2 changes** (2A, 2B, 2C) — medium effort, build on P1 foundations +4. **P3 changes** (3A, 3B) — nice-to-have, zero risk +5. **Reinstall** — copy updated files to `~/.claude/skills/` and `~/.claude/commands/` +6. **Validate** — run `/user-test` against a test scenario to verify + +## Sources & References + +### Internal References +- Current SKILL.md: `plugins/compound-engineering/skills/user-test/SKILL.md` (192 lines) +- Current template: `plugins/compound-engineering/skills/user-test/references/test-file-template.md` (81 lines) +- Current patterns: `plugins/compound-engineering/skills/user-test/references/browser-input-patterns.md` (54 lines) +- Current iterate: `plugins/compound-engineering/skills/user-test/references/iterate-mode.md` (65 lines) +- Skill size budget: `docs/solutions/2026-02-26-monolith-to-skill-split-anti-patterns.md` +- Original plan: `docs/plans/2026-02-26-feat-user-test-browser-testing-skill-plan.md` + +### Conventions Applied +- Schema versioning for forward compatibility +- SKILL.md 500-line budget with reference extraction fallback +- Thin wrapper commands unchanged (no new commands needed) +- Backward-compatible template migration (read v1, write v2) diff --git a/docs/plans/2026-02-28-feat-user-test-v3-ux-intelligence-plan.md b/docs/plans/2026-02-28-feat-user-test-v3-ux-intelligence-plan.md new file mode 100644 index 000000000..050135db6 --- /dev/null +++ b/docs/plans/2026-02-28-feat-user-test-v3-ux-intelligence-plan.md @@ -0,0 +1,453 @@ +--- +title: "Evolve user-test into a compounding UX intelligence system" +type: feat +status: completed +date: 2026-02-28 +origin: User vision document (inline, 2026-02-28) + 7 rounds of real testing on Resale Clothing Shop +prior_art: + - docs/plans/2026-02-26-feat-user-test-browser-testing-skill-plan.md (v1, completed) + - docs/plans/2026-02-28-feat-user-test-skill-revision-plan.md (v2, completed) +--- + +# Evolve user-test into a Compounding UX Intelligence System + +## Overview + +Transform `/user-test` from a browser test runner into a **compounding UX intelligence system** — one that explores, regresses, and gets smarter with every run. The current v2 skill (321 lines) has the right foundations (maturity model, scoring rubric, CLI+browser layers, auto-commit). This plan closes 6 specific gaps identified after 7 real test runs, adds a new UX Opportunities signal category, and wires up the compounding loop so discoveries graduate from browser exploration to CLI regression checks. + +**What v3 adds that v2 doesn't have:** +1. Bug registry (`bugs.md`) with open/fixed/regressed lifecycle +2. Per-area score history (not just top-level delta) +3. Structured skip reasons (untested-by-choice vs. infrastructure-failure) +4. Explicit pass thresholds per area +5. Queryable qualitative data (area-tagged best/worst moments) +6. Discovery-to-regression graduation (browser findings become CLI checks) +7. UX Opportunities — suggestions, not failures — as a new report section + +## Problem Statement / Motivation + +After 7 real test runs, the skill produces useful signal but doesn't compound it efficiently: + +- **Bugs evaporate.** A bug found in run 3 and fixed in run 5 has no persistent record linking the discovery to the fix. If it regresses in run 8, the skill doesn't know it's a regression — it just finds a "new" bug. +- **Delta is top-level only.** "Quality went from 3.5 → 4.0" doesn't tell you which area improved. Per-area score history is needed to answer "did the shipping form fix actually help?" +- **Disconnects are invisible.** Three disconnects in a run produce null scores with "extension disconnected" buried in assessment text. There's no machine-readable way to distinguish "skipped because Proven" from "skipped because Chrome crashed." +- **Pass thresholds are implicit.** `consecutive_passes` exists but what counts as a pass varies by area. A search results area with `scored_output: true` needs UX >= 4 AND Quality >= 3, but this threshold lives only in the agent's head. +- **Qualitative signal is write-once.** "Best moment: agent search is excellent" appears in one run's JSON but can't be queried over 20 runs to surface patterns like "agent/search has been the best moment 8 of 10 times." +- **The flywheel doesn't close.** Browser discoveries don't become CLI regression checks. The same bug can silently regress without the fast layer catching it. + +## Proposed Solution + +### Phase 1: Bug Registry (Gap #1) + +Add `tests/user-flows/bugs.md` — a persistent, machine-readable bug tracker that complements GitHub Issues. + +```markdown +# Bug Registry + +| ID | Area | Status | Issue | Summary | Found | Fixed | Regressed | +|----|------|--------|-------|---------|-------|-------|-----------| +| B001 | checkout/shipping-form | open | #47 | Accepts invalid zip codes | 2026-02-28 | — | — | +| B002 | browse/product-grid | fixed | #48 | Cards not clickable | 2026-02-28 | 2026-03-01 | — | +| B003 | browse/product-grid | regressed | #52 | Cards not clickable (regression of B002) | 2026-03-05 | — | 2026-03-05 | +``` + +**Status lifecycle:** `open` → `fixed` (when Known-bug area passes fix_check) → `regressed` (if same area fails again after fix). Cross-reference: `Issue` column links to GitHub, `ID` column is the local canonical reference. + +**Multi-area bugs:** A bug that manifests in multiple areas gets ONE registry entry with the primary area. The `Summary` field notes "Also affects: area-a, area-b". Each affected area's Known-bug detail references the same bug ID. + +**Commit mode updates:** After each run, commit mode: +1. Marks bugs as `fixed` when a Known-bug area's fix_check passes (score >= area's `pass_threshold`, default 4) +2. Files new bugs with next sequential ID +3. Marks bugs as `regressed` when a previously-fixed area fails again +4. Syncs with GitHub issue state (closed issue + passing fix_check = fixed) + +**File location:** `tests/user-flows/bugs.md` alongside scenario files. One registry per project, not per scenario. + +### Phase 2: Per-Area Score History (Gap #2) + +Split storage by audience: humans see trends, machines store history. + +**Machine-readable history:** `tests/user-flows/score-history.json` alongside `bugs.md`: + +```json +{ + "areas": { + "checkout/cart": { + "scores": [ + { "date": "2026-02-28", "ux": 3, "quality": null, "time": 8 }, + { "date": "2026-03-01", "ux": 4, "quality": null, "time": 7 }, + { "date": "2026-03-02", "ux": 4, "quality": null, "time": 6 } + ], + "trend": "improving" + }, + "agent/search-query": { + "scores": [ + { "date": "2026-02-28", "ux": 4, "quality": 3, "time": 12 }, + { "date": "2026-03-01", "ux": 5, "quality": 4, "time": 9 } + ], + "trend": "improving" + } + } +} +``` + +**Storage:** Last 10 entries per area. Oldest drops off when 11th is recorded. One file per project (not per scenario). + +**Human-readable trends in test file:** A thin `## Area Trends` section replaces the wide history table: + +```markdown +## Area Trends + +| Area | Trend | Last Score | Delta | +|------|-------|------------|-------| +| checkout/cart | improving | 4 | +1 | +| checkout/shipping | fixed | 4 | +2 | +| browse/product-grid | stable | 5 | — | +``` + +**Trend computation:** `improving` (last 3 trending up), `stable` (variance < 0.5 over last 3), `declining` (last 3 trending down), `volatile` (variance >= 1.0 over last 3), `fixed` (previous was <= 2, current >= pass_threshold). Computed from `score-history.json`, not by parsing markdown. + +**Delta computation:** Per-area delta compares current score to previous score for that specific area in `score-history.json`. This supplements the existing top-level delta in run history. + +### Phase 3: Structured Skip Reasons (Gap #3) + +Add `skip_reason` field to each area result in `.user-test-last-run.json`: + +```json +{ + "slug": "compare/add-view", + "ux_score": null, + "skip_reason": "disconnect", + "time_seconds": null +} +``` + +**Enum values:** +- `null` — area was tested normally +- `"proven_spotcheck"` — Proven area, spot-checked only +- `"known_bug_open"` — Known-bug, issue still open, skipped +- `"known_bug_fixed"` — Known-bug, issue closed, ran fix_check +- `"cli_precheck_failed"` — CLI precheck for this area scored <= 2 +- `"disconnect"` — MCP disconnect interrupted this area +- `"user_skip"` — User explicitly skipped + +**Report impact:** Pass rate calculation excludes `disconnect` and `user_skip` areas. The report shows: "Pass rate: 4/5 (1 area skipped: disconnect)". + +### Phase 4: Explicit Pass Thresholds (Gap #4) + +Add `pass_threshold` to area details in test files: + +```markdown +### checkout/shipping-form +**Interactions:** Enter address, select method, see estimate +**What's tested:** Form validation + shipping logic +**pass_threshold:** 4 +``` + +```markdown +### agent/search-results +**Interactions:** Enter query, review results, refine search +**What's tested:** Result relevance and ranking quality +**scored_output:** true +**pass_threshold:** 4 +**quality_threshold:** 3 +``` + +**Defaults:** If `pass_threshold` is not set, default is 4. If `quality_threshold` is not set for `scored_output` areas, default is 3. These match the current implicit behavior but make it explicit and per-area configurable. + +**Promotion gate uses thresholds:** "2+ consecutive passes" means 2+ consecutive runs where UX >= `pass_threshold` (and Quality >= `quality_threshold` for scored_output areas). + +**Self-documenting:** The test file now contains everything needed to understand when an area graduates — no implicit knowledge required. + +### Phase 5: Queryable Qualitative Data (Gap #5) + +Tag each qualitative observation with the area slug it relates to: + +**In `.user-test-last-run.json`:** +```json +{ + "qualitative": { + "best_moment": { "area": "agent/search-query", "text": "Agent search returns highly relevant results in <2s" }, + "worst_moment": { "area": "browse/product-detail", "text": "Product cards aren't clickable — expected click-to-detail" }, + "demo_readiness": "partial", + "verdict": "Agent core is impressive but missing product-click-to-detail hurts the experience", + "context": "search excellent, product grid needs click handler" + } +} +``` + +**In `test-history.md`:** The existing `Key Finding` column already captures one-line findings. Add `Best Area` and `Worst Area` columns to enable pattern queries: + +```markdown +| Date | Areas Tested | Quality Avg | Delta | Pass Rate | Best Area | Worst Area | Demo Ready | Context | Key Finding | +``` + +**Pattern surfacing:** After 10+ runs, commit mode surfaces patterns with asymmetric thresholds: +- **Positive patterns** (high bar): "area X has been best moment in 7+ of last 10 runs" — high evidence required because this is informational, not actionable +- **Negative patterns** (moderate bar): "area X has been worst moment in 5+ of last 10 runs" — lower threshold than positive, but not 3-in-a-row (too noisy during normal development churn). Five of ten captures genuine trends while filtering out transient spikes from feature work. + +### Phase 6: Discovery-to-Regression Graduation (Gap #6) + +This is the highest-leverage change. When a browser-layer discovery is fixed and verified, offer to generate a CLI regression check. + +**Trigger:** When commit mode marks a bug as `fixed`: +1. Check if `cli_test_command` exists in the scenario frontmatter +2. If yes, offer: "Bug B002 (cards not clickable) is fixed. Generate a CLI regression check? (y/n)" +3. If user accepts, append to `cli_queries` in the test file: + +```yaml +cli_queries: + - query: "show me product cards" + expected: "Returns product data with clickable links or URLs" + prechecks: "browse/product-grid" + graduated_from: "B002" # links back to the bug that spawned this check +``` + +**Graduation trigger:** Manual decision (user confirms). Automatic graduation after N passes was considered but rejected — the user knows better than the system whether a CLI check can meaningfully cover a UX-discovered issue. Some discoveries are inherently browser-only (layout, animation, visual feedback). + +**CLI-ineligible bugs:** If no `cli_test_command` exists, skip the graduation offer. If the bug is purely visual (e.g., CSS layout), note "This bug is browser-only — no CLI graduation available." + +**The compounding loop this enables:** +``` +Browser discovers bug → bug filed → developer fixes → next run verifies fix + → fix confirmed → CLI regression check generated + → future regressions caught by fast CLI layer + → browser time freed for new exploration +``` + +### Phase 7: UX Opportunities (New Signal Category) + +Two distinct sections in the Phase 4 report — improvement suggestions and patterns to protect: + +**UX Opportunities** (action items — things to improve): + +``` +UX Opportunities: +| ID | Area | Priority | Status | Suggestion | +|----|------|----------|--------|-----------| +| UX001 | browse/product-grid | P1 | open | Product cards should be clickable (users expect click-to-detail) | +| UX002 | agent/search-results | P2 | open | Follow-up suggestion buttons are excellent — make more prominent | +``` + +**Good Patterns** (preservation notes — things to protect): + +``` +Good Patterns: +| Area | Pattern | First Seen | Last Confirmed | +|------|---------|------------|----------------| +| browse/filters | Filter chip with sub-filter counts is a best-practice pattern | 2026-02-28 | 2026-03-02 | +| agent/search-results | Agent follow-up buttons after search are excellent | 2026-02-28 | 2026-03-02 | +``` + +**Why separate sections:** P1 and P2 are action items — things to improve. Good Patterns are "don't break this" notes — a fundamentally different signal. Mixing them in one table conflates suggestions with preservation. Good Patterns also have a simpler lifecycle (confirmed each run, no status transitions). + +**Priority mapping (UX Opportunities only):** +- **P1** — Missing expected interaction (friction source) +- **P2** — Enhancement to an already-good interaction + +**UX Opportunity lifecycle:** Each entry has a `status` field: +- `open` — suggestion logged, not yet acted on +- `implemented` — the improvement was made (agent detects the change, or user marks manually) +- `wont_fix` — explicitly declined (keeps the log honest, prevents re-suggestion) + +Entries rotate: keep last 20 `open` entries. `implemented` and `wont_fix` entries age out after 30 days (they've served their purpose). + +**Good Patterns lifecycle:** Simpler — `Last Confirmed` updates each run that observes the pattern. Patterns not confirmed for 5+ runs are removed (the code changed, the pattern may no longer exist). No status field needed. + +**Dedup:** Anchored on explicit IDs, not fuzzy text matching. UX Opportunities use sequential IDs (UX001, UX002...). When the agent observes something that might duplicate an existing entry, it checks by `area slug + priority level`: if the same area already has an open entry at the same priority, the agent decides whether to update or create new — not automated text overlap matching. Good Patterns dedup on area slug only (one pattern entry per area). + +**Distinct from bugs:** Bugs are functional failures (score <= 2) or complete output failures (quality <= 1). UX Opportunities are observations at score 3-5 where the experience could be better. Good Patterns are observations at score 4-5 where the experience is already good. + +**No GitHub issue filing:** Both sections are logged in the test file only. They feed the product backlog but don't create issue noise. The user can manually promote a UX Opportunity to an issue if they want. + +**Storage:** Two new sections in the test file: `## UX Opportunities Log` and `## Good Patterns`. + +## Technical Considerations + +### Schema Migration: v2 → v3 + +**New frontmatter fields (all optional):** +- None — all new data lives in new sections, not frontmatter + +**New test file sections:** +- `## Area Trends` — replaces Area Score History, thin summary (trend + last score + delta) +- `## UX Opportunities Log` — improvement suggestions with status lifecycle +- `## Good Patterns` — patterns worth preserving (separate from opportunities) + +**New standalone files:** +- `tests/user-flows/bugs.md` — bug registry +- `tests/user-flows/score-history.json` — full per-area score history (machine-readable) + +**Run results JSON changes:** +- `areas[].skip_reason` — new field (nullable string enum) +- `qualitative.best_moment` — changes from string to `{ area, text }` object +- `qualitative.worst_moment` — changes from string to `{ area, text }` object +- `ux_opportunities` — new array at top level (P1/P2 improvement suggestions with IDs) +- `good_patterns` — new array at top level (area-level patterns worth preserving) + +**Backward compatibility:** v2 files work unchanged. New sections are added on first v3 commit. The `qualitative` field change is breaking for `.user-test-last-run.json` consumers — but this file is ephemeral (overwritten each run, gitignored), so no migration needed. Bump `schema_version: 3` on first commit that adds new sections. + +**Migration strategy:** Same as v1→v2: fill defaults on read, upgrade on write. No separate migration step. + +### SKILL.md Line Budget + +Current: 321 lines. v3 additions estimated: + +| Addition | Lines in SKILL.md | Lines in references/ | +|----------|------------------|---------------------| +| Bug registry lifecycle | ~15 | ~40 (bugs-registry.md) | +| Per-area trends + score-history.json | ~5 | ~15 (in test-file-template.md) | +| Structured skip reasons | ~8 | 0 (enum in JSON schema) | +| Pass thresholds | ~5 | ~10 (in test-file-template.md) | +| Queryable qualitative | ~5 | 0 (JSON schema change) | +| Graduation mechanism | ~15 | ~30 (graduation.md) | +| UX Opportunities + Good Patterns | ~15 | ~25 (in test-file-template.md) | +| Schema v3 migration note | ~5 | 0 | +| **Total** | **~73** | **~120** | + +**Projected SKILL.md:** ~394 lines — within 500-line budget. + +**New reference file:** `references/bugs-registry.md` for bug lifecycle documentation. +**New reference file:** `references/graduation.md` for discovery-to-regression graduation. +**Updated reference file:** `references/test-file-template.md` for new sections + pass thresholds. + +**Line-count checkpoint:** After implementing step 4 (SKILL.md updates), run `wc -l < SKILL.md` before proceeding to step 5. If over 420 lines, extract UX Opportunities or graduation to their own reference files immediately — don't wait until post-implementation cleanup. + +**Graduation extraction trigger:** The graduation mechanism (Phase 6) involves conditional logic across several states (cli_test_command present?, bug type visual or functional?, user response). If it exceeds 20 lines in SKILL.md during implementation, extract to `references/graduation.md` immediately. The reference file is already planned; the question is whether graduation lives as a brief summary in SKILL.md with details in the reference, or entirely in the reference from the start. Default: start in SKILL.md, extract if it grows. + +### Two-Layer Architecture Clarification + +v2 already implements CLI-first (Phase 2.5) and Browser-second (Phase 3). v3 doesn't change the execution order, but the graduation mechanism (Phase 6) creates a feedback loop: + +``` +Layer 2 (Browser) → discovers issue → fix verified → graduation offered + ↓ +Layer 1 (CLI) ← new regression check added ← catches regressions fast + ↓ +Layer 2 (Browser) → freed to explore new territory +``` + +This is the compounding loop in action. Over time, the CLI layer grows and the browser layer stays focused on unknowns. + +### Open Questions Resolved + +**Q: How does bugs.md handle bugs that span multiple areas?** +A: One registry entry with primary area. Summary notes "Also affects: area-a, area-b". Each affected area's Known-bug detail references the same bug ID. + +**Q: Should UX Opportunities have priority (P1/P2/P3)?** +A: Yes. P1 = missing expected interaction, P2 = enhancement to good interaction, P3 = pattern worth preserving. + +**Q: What's the graduation trigger?** +A: Manual — user confirms after fix is verified. The user knows whether a CLI check can meaningfully cover a UX-discovered issue. Some discoveries are inherently browser-only. + +**Q: How does the command handle an app it's never seen before?** +A: Already handled by v2 — passing a description string to `/user-test` creates a new test file from template. No separate `/user-test init` needed. The first run IS the init. + +## Acceptance Criteria + +### Phase 1: Bug Registry +- [x] `tests/user-flows/bugs.md` created on first bug filing if it doesn't exist +- [x] Bug IDs are sequential (B001, B002, ...) +- [x] Status lifecycle works: open → fixed → regressed +- [x] Multi-area bugs have one entry with "Also affects" note +- [x] Commit mode syncs bug status with GitHub issue state +- [x] Fixed bugs are detected when Known-bug area passes fix_check (score >= area's `pass_threshold`, default 4) +- [x] Regression detection: previously-fixed area fails → new issue "Regression of #N" + bug marked regressed + +### Phase 2: Per-Area Score History +- [x] `tests/user-flows/score-history.json` created on first run, stores full per-area history +- [x] Last 10 entries per area in JSON, oldest drops at 11th +- [x] `## Area Trends` section in test file shows Trend + Last Score + Delta (human-readable summary) +- [x] Trend computed from JSON: improving/stable/declining/volatile/fixed +- [x] Per-area delta computed from JSON, not by parsing markdown + +### Phase 3: Structured Skip Reasons +- [x] `skip_reason` field present in `.user-test-last-run.json` for every area +- [x] Enum: null, proven_spotcheck, known_bug_open, known_bug_fixed, cli_precheck_failed, disconnect, user_skip +- [x] Pass rate calculation excludes disconnect and user_skip +- [x] Report displays skip count and reasons + +### Phase 4: Pass Thresholds +- [x] `pass_threshold` field supported in area details (default: 4) +- [x] `quality_threshold` field supported for scored_output areas (default: 3) +- [x] Promotion gate uses per-area thresholds +- [x] Test file is self-documenting — thresholds visible in area details + +### Phase 5: Queryable Qualitative +- [x] `best_moment` and `worst_moment` in JSON are `{ area, text }` objects +- [x] `test-history.md` has `Best Area` and `Worst Area` columns +- [x] Positive pattern detection: 7+ of last 10 runs (high bar — informational signal) +- [x] Negative pattern detection: 5+ of last 10 runs (moderate bar — actionable signal) + +### Phase 6: Graduation +- [x] After bug marked fixed, offer CLI graduation if `cli_test_command` exists +- [x] Graduated CLI query includes `graduated_from: "B00N"` tag +- [x] Skip graduation offer for browser-only bugs (no CLI equivalent) +- [x] Skip graduation offer if no `cli_test_command` in frontmatter + +### Phase 7: UX Opportunities + Good Patterns +- [x] `UX Opportunities` section in Phase 4 report (P1/P2 action items) +- [x] `Good Patterns` section in Phase 4 report (separate from opportunities) +- [x] UX Opportunities use sequential IDs (UX001, UX002...) with status lifecycle (open/implemented/wont_fix) +- [x] Good Patterns dedup on area slug only, `Last Confirmed` updates each run, removed after 5 runs unconfirmed +- [x] Dedup: same area + same priority = agent decides (not fuzzy text matching) +- [x] Stored in test file: `## UX Opportunities Log` (last 20 open) + `## Good Patterns` +- [x] Distinct from bugs — no GitHub issue creation +- [x] UX Opportunities triggered at score 3-5; Good Patterns triggered at score 4-5 + +### Schema & Compatibility +- [x] v2 files load without error (new sections added on first commit) +- [x] `schema_version: 3` set on first v3 commit +- [x] SKILL.md stays under 500 lines after all additions (checkpoint at step 4.5) +- [x] bugs-registry.md reference file created +- [x] graduation.md reference file created +- [x] test-file-template.md updated with Area Trends, UX Opportunities Log, Good Patterns sections +- [x] score-history.json schema documented in test-file-template.md + +### Version & Metadata +- [x] Version bumped (2.36.0 → 2.37.0) +- [x] CHANGELOG.md updated +- [x] Plugin.json and marketplace.json description counts verified + +## Implementation Sequence + +1. **Create `references/bugs-registry.md`** — bug lifecycle, multi-area handling, status transitions, fix_check threshold tied to pass_threshold +2. **Create `references/graduation.md`** — discovery-to-regression mechanism, CLI query generation, browser-only bug detection +3. **Update `references/test-file-template.md`** — add Area Trends section (replacing wide score history table), UX Opportunities Log + Good Patterns sections, score-history.json schema, pass_threshold/quality_threshold in area details, schema_version: 3 +4. **Update SKILL.md** — add bug registry lifecycle to Commit Mode, skip_reason to Phase 3/4, pass thresholds to promotion gate, qualitative tagging to Phase 4, graduation offer to Commit Mode, UX Opportunities + Good Patterns to Phase 4, schema v3 migration note to Phase 1 +5. **Line-count checkpoint** — run `wc -l < SKILL.md`. If over 420 lines, extract graduation or UX Opportunities to reference files before proceeding. This is a hard gate, not a suggestion. +6. **Update `.user-test-last-run.json` schema** — add skip_reason, change qualitative structure, add ux_opportunities, add good_patterns +7. **Bump metadata** — version, changelog, plugin.json, marketplace.json +8. **Validate** — SKILL.md line count, JSON validity, reference links, score-history.json schema +9. **Install locally** — copy to ~/.claude/skills/user-test/ + +## Dependencies & Risks + +| Risk | Mitigation | +|------|-----------| +| bugs.md grows unbounded | Rotation: archive entries older than 6 months to bugs-archive.md | +| score-history.json grows with many areas over many runs | Cap at 10 entries per area; one file per project. At 30 areas x 10 entries = ~300 entries — manageable JSON size | +| Graduation offers interrupt flow | Single y/n prompt after commit, not during test run. Batch all graduation offers into one prompt. | +| Pattern detection is noisy early on | Only trigger after 10+ runs. Positive patterns: 7/10 threshold. Negative patterns: 5/10 threshold. | +| UX Opportunity dedup produces false matches | Dedup anchored on area slug + priority level, not text overlap. Agent decides on conflicts — no automated fuzzy matching. | +| Good Patterns log bloat (agent flags everything good as a pattern) | Only log patterns at score 4-5 that represent a *deliberate design choice* (not just "page loaded"). Patterns auto-expire after 5 runs unconfirmed. | +| UX Opportunities with no lifecycle become stale | Status field (open/implemented/wont_fix). Implemented and wont_fix age out after 30 days. Open entries capped at 20. | +| schema_version: 3 migration adds sections to existing test files | Non-destructive: new sections appended, existing content preserved | +| SKILL.md approaches 400 lines | Hard gate at step 5: if over 420 lines, extract before proceeding. Graduation earmarked for early extraction (20-line trigger). | +| qualitative JSON structure change breaks external consumers | .user-test-last-run.json is gitignored and ephemeral — no external consumers expected | + +## Sources & References + +### Prior Plans +- [v1 plan: user-test browser testing skill](docs/plans/2026-02-26-feat-user-test-browser-testing-skill-plan.md) — original skill architecture (completed) +- [v2 plan: user-test skill revision](docs/plans/2026-02-28-feat-user-test-skill-revision-plan.md) — schema v2, timing, CLI mode, auto-commit (completed) + +### Internal References +- Current SKILL.md: `plugins/compound-engineering/skills/user-test/SKILL.md` (321 lines, schema v2) +- Test file template: `plugins/compound-engineering/skills/user-test/references/test-file-template.md` +- Anti-patterns: `docs/solutions/2026-02-26-monolith-to-skill-split-anti-patterns.md` + +### Learnings Applied +- **Monolith-to-skill split:** Reference file extraction from day one prevents SKILL.md bloat (confirmed by v2 staying at 321 lines) +- **Agent-guided state patterns:** Use agent judgment for maturity transitions, not mechanical counters (validated by 7 real runs) +- **Plugin versioning:** Always bump version, changelog, and description counts in lockstep diff --git a/docs/plans/2026-03-02-feat-compounding-quality-plan.md b/docs/plans/2026-03-02-feat-compounding-quality-plan.md new file mode 100644 index 000000000..8f3289cbc --- /dev/null +++ b/docs/plans/2026-03-02-feat-compounding-quality-plan.md @@ -0,0 +1,761 @@ +--- +title: "feat: Compounding Quality — Richer Writebacks, Weakness Synthesis, Fingerprints, CLI Adversarial" +type: feat +status: completed +date: 2026-03-02 +--- + +# feat: Compounding Quality + +Four changes to make the existing compound loop actually compound. Each run +becomes smarter automatically — no new commands, no extra steps. + +## Overview + +| Change | Where | What It Does | +|--------|-------|-------------| +| 1. Richer commit writebacks | Commit Mode Step 1 | Persists tactical intelligence (selectors, timing, weakness class) back into area details | +| 2. Weakness-class synthesis | Phase 4, Step 6 | Cross-area adversarial targeting from failure patterns, not just instances | +| 3. Novelty fingerprint persistence | `.user-test-last-run.json` + Phase 3 | Prevents re-exploring territory already covered in prior runs | +| 4. CLI score 3 → adversarial browser | Phase 2.5 + Phase 3 | Partially-correct CLI results trigger harder browser probing | + +## Problem Statement + +Run-over-run, the user-test skill rediscovers the same information: + +1. **Selectors are found then forgotten.** Run 1 discovers working DOM selectors + (3-5 MCP calls). Run 2 discovers them again. The verify block has no way to + persist confirmed selectors. + +2. **Weakness patterns are instance-level, not class-level.** Three areas share + the same "stale-react-state" failure pattern. Each is treated independently. + No mechanism identifies or targets the shared weakness class. + +3. **Novelty log expires between runs.** The novelty budget forces exploration, + but the log resets each run. Run N+1 re-explores territory N already covered. + +4. **CLI score 3 is a dead signal.** Score ≤2 skips browser. Score ≥4 passes + through. Score 3 ("surface-level right, deeper reasoning wrong") proceeds + normally — the adversarial sweet spot is wasted. + +## SKILL.md Line Budget Strategy + +SKILL.md is at **420 lines** (hard ceiling). All changes must net zero. + +### Extraction Plan + +| Extraction | Source Lines | Savings | Target | +|-----------|-------------|---------|--------| +| `.user-test-last-run.json` schema | SKILL.md:282-333 (52 lines) | ~45 lines (replace with 5-line pointer + version ref) | New `references/last-run-schema.md` | +| Phase 3 novelty budget inline | SKILL.md:110-115 (6 lines) | ~4 lines (compress to 2-line pointer) | Already in `queries-and-multiturn.md:128-194` | +| Phase 2.5 CLI detail | SKILL.md:84-97 (14 lines) | ~6 lines (compress to 8-line version) | Already in `queries-and-multiturn.md:62-83` | +| **Total freed** | | **~55 lines** | | + +### Addition Plan + +| Addition | Lines | Location | +|---------|-------|----------| +| Commit Mode Step 1: 3 new bullet points (notes, selectors, weakness_class) | ~8 | SKILL.md Commit Mode | +| Phase 4 Step 6: cross-area synthesis pointer | ~5 | SKILL.md Phase 4 | +| Phase 3: fingerprint check + adversarial mode trigger | ~6 | SKILL.md Phase 3 | +| Phase 2.5: adversarial flag check | ~5 | SKILL.md Phase 2.5 | +| JSON schema pointer to `last-run-schema.md` | ~5 | SKILL.md (replaces extracted block) | +| **Total added** | **~29** | | + +**Net: -55 + 29 = -26 lines.** Comfortable margin. SKILL.md lands at ~394. + +## Schema Version + +All four changes ship together as **v8**. One migration event. + +``` +v7 → v8 changes: +- Area Details: optional **weakness_class:** field (below pass_threshold) +- Area Details: **verify:** blocks auto-updated with confirmed selectors by commit mode +- Areas table: Notes column receives tactical run notes in [Run N] format (max 3 entries) +- .user-test-last-run.json: new fields per area (tactical_note, confirmed_selectors, + weakness_class, adversarial_browser, adversarial_trigger) +- .user-test-last-run.json: new top-level key novelty_fingerprints (accumulates across runs) +- .user-test-last-run.json schema extracted to references/last-run-schema.md +``` + +**Migration:** Treat missing `weakness_class` as absent. Treat missing +`novelty_fingerprints` as empty. Treat missing `adversarial_browser` as false. +Do NOT rewrite v7 files on read. + +--- + +## Change 1: Richer Commit Writebacks + +### What changes + +Commit Mode Step 1 (currently SKILL.md:363-369) writes three new categories of +intelligence back into each area after every run. Currently this data is +discovered during execution then discarded. + +### A. Tactical notes (Areas table, Notes column) + +After scoring, commit mode appends a short tactical note to the area's Notes +column. Format: `[Run N] `. Cap at 3 entries; drop oldest when exceeded. + +**Write only when there's a genuine tactical insight:** +- A reliable JS selector pattern: `[Run 4] batch read via [data-filter-chip] + .product-card reliable` +- A timing pattern: `[Run 3] agent response 8-12s on first query, faster on follow-ups` +- An interaction sequence that revealed a bug: `[Run 2] filter → navigate → back → filter again surfaces stale state` + +Do NOT write: generic observations, maturity updates, restatements of probe results. + +### B. Verified selectors into `verify:` blocks + +When Phase 3 exploration discovers working DOM selectors confirmed by a +successful `javascript_tool` batch call, commit mode writes them into the area's +`**verify:**` block. + +```markdown +**verify:** +- Apply filter. Batch-check via javascript_tool: + activeFilters (`[data-filter-chip]`), resultCount (`.product-card`), + sample 5 results (`.product-card .title`, `.condition-badge`). + Every result's attribute must match the active filter. + _Selectors confirmed run 3._ +``` + +Rules: +- Only write selectors confirmed by a successful batch call this run +- Append `_Selectors confirmed run N._` so future runs know the source +- APPEND new selectors below existing user-authored content — never replace +- Update with new selectors if they changed; preserve unchanged ones +- If selectors are unknown (first run): `_Selectors not yet confirmed — discover during exploration._` + +This is the highest-leverage writeback: run 1 discovers selectors through +sequential trial (3-5 MCP calls), run 2 reads the verify block and batches them +into one `javascript_tool` call. + +### C. `weakness_class` field + +New optional field in area details, written by commit mode when 2+ probes in the +same area share a recognizable failure pattern. Lives just below `pass_threshold`. + +```markdown +**weakness_class:** stale-react-state +``` + +**Predefined classes:** +- `stale-react-state` — filters/state not resetting on navigation +- `count-display-lag` — displayed counts don't match actual DOM counts +- `multi-turn-context-loss` — agent forgets constraints from earlier turns +- `async-render-race` — results appear but attributes/badges haven't updated +- `filter-intersection-empty` — compound filter combinations return 0 results unexpectedly +- `agent-reasoning-shallow` — CLI quality consistently 3, partially correct but missing nuance + +**Freeform:** For novel failure modes that don't fit a predefined class, write a +freeform string (e.g., `weakness_class: checkout-state-leaked-across-sessions`). +Change 2 handles freeform classes with custom adversarial instruction generation. + +**Classification:** Commit mode reads each failing probe's `query`, `verify`, +and `result_detail` fields and matches against predefined class descriptions +using agent judgment. No mechanical matching rule — agent decides which class (if +any) best describes the failure pattern. If classification is ambiguous, prefer +freeform over forcing a predefined class. Matching for C2 synthesis uses exact +string equality after normalization (lowercase, hyphenated). + +**Lifecycle:** +- Write when 2+ probes share a pattern (one probe = insufficient signal) +- Update each run: if the class's probes have all passed for 3+ consecutive + runs, remove the field (weakness resolved) +- If a new pattern emerges with more probes than the current class, replace it +- One `weakness_class` per area — the dominant pattern. Probe count decides dominance. + +### `.user-test-last-run.json` additions (per area) + +```json +{ + "slug": "agent/filter-via-chat", + "ux_score": 3, + "tactical_note": "filter → navigate away → back → filter again surfaces stale state", + "confirmed_selectors": { + "activeFilters": "[data-filter-chip]", + "resultCount": ".product-card", + "sampleResults": ".product-card .title, .condition-badge" + }, + "weakness_class": "stale-react-state" +} +``` + +`tactical_note: null` → skip Notes update. `confirmed_selectors: {}` → skip +verify block update. `weakness_class: null` → no class identified (or resolved). + +### Detail spec locations + +- Selector persistence rules → `verification-patterns.md` (new section: "Selector Discovery and Writeback") +- `weakness_class` lifecycle and predefined class definitions → `probes.md` (new section: "Weakness Classification") +- Tactical notes format and cap rules → `queries-and-multiturn.md` (append to commit mode guidance) + +--- + +## Change 2: Weakness-Class Synthesis in Explore Next Run + +### What changes + +Phase 4 Step 6 (currently SKILL.md:209-212) gains a cross-area synthesis pass. +After generating per-area Explore Next Run items, it looks across all areas for +shared failure classes. When a class appears in 2+ areas, it generates one +`[cross-area]` Explore Next Run entry targeting the class systemically. + +### Synthesis pass + +Synthesis reads `weakness_class` fields from the test file as written by the +previous run's commit — first-run appearance of a weakness_class does not trigger +synthesis until the following run. + +1. Collect all areas with a `weakness_class` field set in the test file +2. Group by weakness_class value (exact string match) +3. For each class appearing in 2+ areas: generate one `[cross-area]` Explore + Next Run entry + +**Format:** +``` +P1 [cross-area] Browser stale-react-state in agent/filter + browse/filters — probe ALL navigation sequences next run +``` + +**Cap:** Maximum 2 cross-area synthesis entries per run. + +**Tiebreaker when >2 classes qualify:** +Rank by (1) number of affected areas — more areas = higher priority; then (2) +number of failing probes in the class. Deterministic, favors widespread patterns. + +### Adversarial instruction templates (predefined classes) + +| Class | Adversarial Instruction | +|-------|------------------------| +| `stale-react-state` | Probe ALL navigation sequences that cross area boundaries — apply filter → navigate away → return → verify state reset | +| `count-display-lag` | After every action changing result count, wait 2s then re-read count vs DOM — check for lag window | +| `multi-turn-context-loss` | On every multi-turn sequence, inject a context-breaking action at turn 3, then return to prior context — verify retention | +| `async-render-race` | After every action triggering async rendering, immediately read badges/attributes — check for race window | +| `filter-intersection-empty` | Probe all 2-filter compound combinations systematically — check for empty-intersection cases | +| `agent-reasoning-shallow` | Replace simple queries with competing-constraint and ambiguous queries across all affected areas | + +**Freeform classes:** When `weakness_class` is freeform (no matching template), +the agent generates a custom adversarial instruction based on the class name and +probe failure details. + +**Persistence signal:** If the same class appeared last run's Explore Next Run, +was targeted, and still didn't resolve: `PERSISTENT — stale-react-state active +N runs — escalate to Known-bug consideration` + +### Report placement + +Cross-area synthesis entries appear at the top of EXPLORE NEXT RUN: + +``` +EXPLORE NEXT RUN + P1 [cross-area] Browser stale-react-state in 3 areas — probe all navigation events + P1 shipping-form Browser Validation broken — edge cases + P2 checkout/promo Both Adjacent to cart, untested +``` + +### Why Explore Next Run entries, not cross-area probes + +Cross-area synthesis produces targeting instructions ("test this pattern across +these areas next run"). These are ephemeral — regenerated each run from current +state. Cross-area probes (from v7) are persistent regression tests with a full +lifecycle. Different purpose: synthesis directs exploration, probes track +regressions. If a synthesis target repeatedly fails, the agent should generate a +cross-area probe from the failure — that's the natural escalation path. + +### `.user-test-last-run.json` additions (explore_next_run entries) + +```json +{ + "priority": "P1", + "area": "[cross-area]", + "mode": "Browser", + "why": "stale-react-state in agent/filter-via-chat + browse/filters", + "weakness_class": "stale-react-state", + "affected_areas": ["agent/filter-via-chat", "browse/filters"], + "adversarial_instruction": "Probe ALL navigation sequences that cross area boundaries..." +} +``` + +### Detail spec location + +Cross-area synthesis rules, tiebreaker logic, template table → `probes.md` +(new section: "Cross-Area Weakness Synthesis") + +--- + +## Change 3: Novelty Fingerprint Persistence + +### What changes + +The novelty log expires between runs (documented v2 limitation). This change +persists a compact fingerprint of each novel interaction across sessions so run +N+1 knows what run N already explored. + +### Fingerprint format + +`::` + +Examples: +- `agent/filter-via-chat:edge-query:price-floor` +- `browse/filters:filter-combo:size+color` +- `checkout/shipping-form:invalid-input:zip-letters` + +**Normalization taxonomy (intentionally fuzzy):** +- Price/number inputs → `price-floor`, `price-ceiling`, `price-range` +- Filter combinations → `filter-combo:+` +- Invalid inputs → `invalid-input:` +- Edge case queries → `edge-query:` +- Navigation sequences → `nav-sequence:-` +- **Doesn't fit taxonomy → `:freeform:<3-word-summary>`** — coverage over consistency + +### Storage in `.user-test-last-run.json` + +```json +"novelty_fingerprints": { + "agent/filter-via-chat": [ + "agent/filter-via-chat:edge-query:price-floor", + "agent/filter-via-chat:edge-query:out-of-scope-question" + ], + "browse/filters": [ + "browse/filters:filter-combo:size+color" + ] +} +``` + +Cap: 20 fingerprints per area. Drop oldest when exceeded. + +### Read-Merge-Write Sequence + +`.user-test-last-run.json` is overwritten on each run (SKILL.md:331). Fingerprints +are the only key that accumulates. The sequence: + +1. **Phase 1 (Load Context):** Read existing `novelty_fingerprints` from + `.user-test-last-run.json` into memory before the run starts. +2. **Phase 3 (Execute):** Use fingerprints to skip already-explored interactions. + Generate new fingerprints for novel interactions this run. +3. **Phase 4 (Write):** Merge existing fingerprints + new fingerprints. Apply + 20-per-area cap (drop oldest). Write the merged set into the new JSON file. + +This is safe because the JSON is written once at the end of Phase 4. There is no +partial-write risk — the entire file is written atomically. + +### Iterate mode exemption + +Iterate mode measures consistency by running the same scenario N times. +**Fingerprints are ignored in iterate mode** — all runs test the same interaction +set. The between-run page reload resets `mcp_call_counter` but does NOT apply +fingerprint filtering. Fingerprints still accumulate for use in the next +non-iterate session. + +### Fingerprint matching semantics + +Agent exercises judgment on what "matches" — the goal is to skip interactions of +the same *type*, not require exact parameter matches. `edge-query:price-floor` +and `edge-query:price-ceiling` are different fingerprints (different key params). +`edge-query:price-floor` from run 1 means "don't test price-floor edge cases +again" — test price-ceiling or price-range instead. + +### Interaction with adversarial mode (C4) + +Adversarial mode overrides fingerprint skipping for its specific actions. +Competing-constraint queries triggered by C4 are always run regardless of +fingerprint state — the adversarial signal takes priority over "already tried." + +### Interaction with Proven area budget + +Proven areas have a 3-MCP-call cap. Fingerprint filtering does NOT increase the +budget — it changes WHAT those 3 calls test. If fingerprints exclude obvious +interactions, the 3 calls target genuinely novel territory. This is the desired +behavior: Proven areas get spot-checked on untested ground, not re-tested on +familiar ground. + +### Resilience + +If `.user-test-last-run.json` is deleted or corrupted, fingerprint history resets +to empty. Acceptable — the skill re-explores previously covered territory, same +as before this change. Fingerprints are an optimization, not a correctness +requirement. + +### Report signal + +Add to SIGNALS when fingerprints meaningfully constrained novelty choices: +``` +~ agent/filter-via-chat novelty: 3 fingerprints excluded, 2 new interactions found +``` + +### Detail spec location + +Normalization rules, freeform fallback, accumulation behavior → +`queries-and-multiturn.md` (new section: "Novelty Fingerprint Persistence") + +--- + +## Change 4: CLI Score 3 → Browser Adversarial Signal + +### What changes + +CLI score 3 ("partially correct — surface-level right, deeper reasoning wrong") +triggers adversarial browser mode for that area. Currently this signal is lost. + +**Why score 3 specifically:** +- Score ≤2 already skips browser via `prechecks` +- Score ≥4 proceeds normally +- Score 3 = the adversarial sweet spot: the app functions, but the CLI revealed + shallow reasoning that browser testing can expose as real user-facing failure + +### Trigger condition + +Adversarial mode triggers when **any individual CLI query** for the area scores +exactly 3. Per-query scores, not averages. + +**Secondary check:** If the area's CLI Quality average across queries is 3.0-3.4 +AND no single query hit exactly 3 (all queries borderline), also trigger +adversarial mode. Record `adversarial_trigger: "cli-avg-3.x: "`. + +### Adversarial browser mode behaviors + +When triggered, the area's Phase 3 execution changes: + +1. **Skip the happy path.** Start with the query most likely to expose the + shallow reasoning — not the simplest, expected query. + +2. **Front-load competing-constraint queries.** If the area has Queries defined, + execute any query with competing constraints before single-intent queries. + +3. **Pre-emptive probe (before exploration).** Generate an `untested` probe: + - `generated_from: "cli-score-3: "` + - Priority: P1 (CLI already revealed the weakness) + +4. **Increased novelty budget.** + - Proven areas: all 3 MCP calls must be adversarial, not happy-path spot-checks + - Uncharted areas: novelty budget increases to 40% of calls (from 30%), min 3 + +5. **Report flag** in DETAILS: + ``` + agent/filter-via-chat: CLI 3 → browser adversarial mode + Pre-emptive probe: "competing filter constraints" (P1) + Exploration front-loaded with competing-constraint queries + ``` + +### Interaction with progressive narrowing + +If a SKIP-classified area has a CLI query scoring 3, **adversarial mode overrides +SKIP for that area only** — it gets promoted to PROBES-ONLY with adversarial +execution. The CLI signal is too strong to ignore. PROBES-ONLY areas with +adversarial mode execute their probes + the pre-emptive probe, but skip full +exploration. + +### Phase 2.5 addition + +After scoring CLI queries, add one step: + +> **Adversarial flag check:** For each area with `prechecks`-tagged queries: if +> any individual query score == 3, set `adversarial_browser: true`. If average is +> 3.0-3.4 with no single query at 3, also set `adversarial_browser: true`. +> Record the triggering query in `adversarial_trigger`. + +### `.user-test-last-run.json` additions (per area) + +```json +{ + "slug": "agent/filter-via-chat", + "adversarial_browser": true, + "adversarial_trigger": "cli-score-3: show me items under $50 in good condition" +} +``` + +### SIGNALS addition + +``` +~ 2 areas in CLI-adversarial mode (CLI score 3): agent/filter-via-chat, agent/search-query +``` + +### Detail spec location + +Full adversarial mode behavior, competing-constraint query identification, +novelty budget adjustment → `queries-and-multiturn.md` (new section: "CLI +Adversarial Mode") + +--- + +## SpecFlow Gap Resolutions + +Issues identified by flow analysis, resolved here: + +| Gap | Resolution | +|-----|-----------| +| Fingerprint persistence vs JSON overwrite | Read-merge-write sequence documented in C3 (Phase 1 read, Phase 4 merge+write) | +| Iterate mode + fingerprints | Explicit exemption: iterate mode ignores fingerprints (C3) | +| C4 adversarial vs fingerprint skipping | Adversarial overrides fingerprints for its specific actions (C3) | +| C4 adversarial vs progressive narrowing SKIP | Adversarial overrides SKIP → promotes to PROBES-ONLY (C4) | +| C4 adversarial vs Proven 3-call budget | Budget unchanged — adversarial reshapes WHAT those 3 calls do (C4) | +| weakness_class classification method | Agent judgment on probe query/verify/result_detail fields. Prefer freeform over forcing predefined. Documented in C1 spec. | +| weakness_class matching for C2 synthesis | Exact string match. Predefined classes are canonical strings. Synthesis restricted to areas where weakness_class already set in test file (no re-derivation). | +| Synthesis output vs cross-area probes | Synthesis produces ephemeral Explore Next Run entries, not probes. Repeated failures escalate to cross-area probes naturally. | +| C1→C2 timing (2-run delay) | By design. Run N writes weakness_class via commit. Run N+1 reads it but synthesis requires 2+ areas — fires earliest at N+2 if a second area develops the same class. Stated explicitly in C2 synthesis pass. | +| Fingerprints machine-local (gitignored JSON) | Intentional. Fingerprints are an optimization, not canonical state. Other compounding mechanisms (probes, queries, weakness_class) persist in committed test file. | +| weakness_class removal in multi-run mode | Each run within a multi-run session counts as a separate run toward the 3-run removal threshold. | + +--- + +## Design Decisions + +### D1. Net-zero SKILL.md via JSON schema extraction + +The `.user-test-last-run.json` schema block (52 lines) is the largest inline +block in SKILL.md that can move to a reference file without hurting agent +performance. The JSON schema is read once at run start and write once at run end — +the agent doesn't need it inline during execution phases. + +### D2. Predefined weakness classes + freeform fallback + +Predefined classes accelerate C2 template lookup but freeform strings ensure +novel failure modes aren't lost. Exact string matching (post-normalization) is +strict enough to prevent false synthesis but simple to implement. + +### D3. Fingerprints as optimization, not truth + +Fingerprints are gitignored, machine-local, and lossy (20 cap with oldest-drop). +This is deliberate — they guide novelty exploration but don't gate correctness. +A fresh machine re-explores territory, which is the same as the pre-C3 behavior. + +### D4. Adversarial mode reshapes budget, doesn't increase it + +Proven areas keep their 3-call cap. Adversarial mode changes WHAT those calls +test (competing constraints instead of happy paths). This maintains the +efficiency property of Proven areas while exploiting the CLI signal. + +### D5. Explore Next Run entries, not cross-area probes + +Synthesis produces targeting instructions that are regenerated each run. +Cross-area probes are persistent regression tests. Different tools for different +purposes. If a synthesis target fails repeatedly, the agent generates a +cross-area probe — natural escalation from ephemeral to persistent. + +--- + +## Implementation Phases + +### Phase 1: Schema Extraction + Foundation (C1 prep) + +**Goal:** Create room in SKILL.md. Extract JSON schema. Add v8 migration notes. + +- [x] Create `references/last-run-schema.md` with full JSON schema from SKILL.md:282-333 + - Include all current fields + C1/C2/C3/C4 additions + - Include behavioral notes (overwrite-per-run, fingerprint accumulation exception) +- [x] Replace SKILL.md:282-333 with 5-line pointer to `last-run-schema.md` +- [x] Compress Phase 3 novelty budget inline (SKILL.md:110-115) to 2-line pointer +- [x] Compress Phase 2.5 CLI detail (SKILL.md:84-97) to 8-line version +- [x] Add v7→v8 migration notes to `test-file-template.md` +- [x] Add `weakness_class` field to area details template in `test-file-template.md` +- [x] Verify SKILL.md line count after extraction (target: ~358; after Phases 2-5 additions: ~394) + +### Phase 2: Richer Commit Writebacks (C1) + +**Goal:** Commit mode persists tactical intelligence. + +- [x] Add 3 new bullet points to Commit Mode Step 1 in SKILL.md (~8 lines): + - Tactical notes (Notes column, cap 3, drop oldest) + - Verified selectors (verify: block, append only, tag with run number) + - weakness_class (below pass_threshold, 2+ probes threshold) +- [x] Add "Selector Discovery and Writeback" section to `verification-patterns.md` (~20 lines) + - Rules: only confirmed selectors, append-only, run-tagged, first-run placeholder +- [x] Add "Weakness Classification" section to `probes.md` (~20 lines) + - Predefined classes table, freeform guidance, lifecycle (write/update/remove) +- [x] Add tactical notes format/cap to `queries-and-multiturn.md` (~10 lines) + - `[Run N] ` format, 3-entry cap, write-only-when-genuine rule +- [x] Update `last-run-schema.md` with C1 per-area fields + +### Phase 3: Weakness-Class Synthesis (C2) + +**Goal:** Cross-area adversarial targeting from shared failure patterns. + +- [x] Add cross-area synthesis pointer to Phase 4 Step 6 in SKILL.md (~5 lines) +- [x] Add "Cross-Area Weakness Synthesis" section to `probes.md` (~20 lines) + - Synthesis pass (3 steps), cap of 2, tiebreaker rules + - Adversarial instruction templates table (6 predefined + freeform) + - Persistence signal format + - Report placement (top of EXPLORE NEXT RUN) +- [x] Update `last-run-schema.md` with C2 explore_next_run additions + +### Phase 4: Novelty Fingerprint Persistence (C3) + +**Goal:** Novel interactions tracked across runs. + +- [x] Add fingerprint merge note to Commit Mode in SKILL.md (~3 lines) +- [x] Add fingerprint check to Phase 3 in SKILL.md (~3 lines) +- [x] Add "Novelty Fingerprint Persistence" section to `queries-and-multiturn.md` (~30 lines) + - Fingerprint format and normalization taxonomy + - Read-merge-write sequence + - Iterate mode exemption + - Adversarial mode override + - Proven area budget interaction + - Matching semantics + - SIGNALS format +- [x] Update `last-run-schema.md` with `novelty_fingerprints` top-level key + +### Phase 5: CLI Adversarial Browser Mode (C4) + +**Goal:** CLI score 3 triggers adversarial browser testing. + +- [x] Add adversarial flag check to Phase 2.5 in SKILL.md (~5 lines) +- [x] Add adversarial mode trigger to Phase 3 in SKILL.md (~3 lines) +- [x] Add "CLI Adversarial Mode" section to `queries-and-multiturn.md` (~20 lines) + - Trigger condition (per-query score 3, secondary avg 3.0-3.4 check) + - 5 behavior changes (skip happy path, front-load, pre-emptive probe, increased novelty, report flag) + - Progressive narrowing override (SKIP → PROBES-ONLY) + - Fingerprint override rule +- [x] Update `last-run-schema.md` with C4 per-area fields + +### Phase 6: Version Bump + Validation + Install + +- [x] Verify SKILL.md ≤ 420 lines (actual: 368) +- [x] Verify all cross-references between files are correct +- [x] Bump plugin.json: 2.49.0 → 2.50.0 (no marketplace.json found) +- [x] Add CHANGELOG entry for v2.50.0 +- [x] Install locally to `~/.claude/skills/user-test/` +- [x] Clean up any stale files from previous install + +--- + +## Files to Change + +| File | Current Lines | Delta | After | What Changes | +|------|-------------|-------|-------|-------------| +| `SKILL.md` | 420 | ~-55 extracted, +29 added = net -26 | ~394 | JSON schema extraction, Phase 2.5 compress, Phase 3 compress, C1-C4 additions | +| `references/last-run-schema.md` | 0 (new) | +~70 | ~70 | Full JSON schema + behavioral notes + C1-C4 field additions | +| `references/test-file-template.md` | 536 | +~12 | ~548 | v8 migration notes, weakness_class in area details template | +| `references/probes.md` | 401 | +~40 | ~441 | Weakness Classification section (C1), Cross-Area Weakness Synthesis section (C2) | +| `references/queries-and-multiturn.md` | 194 | +~60 | ~254 | Tactical notes (C1), Novelty Fingerprint Persistence (C3), CLI Adversarial Mode (C4) | +| `references/verification-patterns.md` | 131 | +~20 | ~151 | Selector Discovery and Writeback section (C1) | +| `plugin.json` | — | version bump | — | 2.49.0 → 2.50.0 | +| `marketplace.json` | — | version bump | — | 2.49.0 → 2.50.0 | +| `CHANGELOG.md` | — | +~20 | — | v2.50.0 entry | + +--- + +## Acceptance Criteria + +### Change 1: Richer Commit Writebacks +- [ ] Tactical notes written to Notes column in `[Run N] ` format +- [ ] Notes capped at 3 entries; oldest dropped when exceeded +- [ ] Notes written only for genuine tactical insights (not generic observations) +- [ ] Verified selectors appended to verify: blocks with `_Selectors confirmed run N._` +- [ ] Selector writeback is append-only (never replaces user-authored content) +- [ ] First-run placeholder: `_Selectors not yet confirmed — discover during exploration._` +- [ ] `weakness_class` written when 2+ probes share a failure pattern +- [ ] `weakness_class` removed after 3 consecutive pass runs +- [ ] One `weakness_class` per area — dominant pattern by probe count +- [ ] `.user-test-last-run.json` includes `tactical_note`, `confirmed_selectors`, `weakness_class` per area +- [ ] Detail specs in verification-patterns.md, probes.md, queries-and-multiturn.md + +### Change 2: Weakness-Class Synthesis +- [ ] Phase 4 Step 6 runs cross-area synthesis after per-area Explore Next Run generation +- [ ] `[cross-area]` entries generated when weakness_class appears in 2+ areas +- [ ] Cap of 2 cross-area synthesis entries per run +- [ ] Tiebreaker: (1) area count, (2) probe count +- [ ] Predefined class templates produce correct adversarial instructions +- [ ] Freeform classes produce custom adversarial instructions +- [ ] Persistence signal when class active N+ runs: "PERSISTENT — escalate to Known-bug" +- [ ] Cross-area entries appear at top of EXPLORE NEXT RUN in report +- [ ] `.user-test-last-run.json` explore_next_run includes weakness_class, affected_areas, adversarial_instruction + +### Change 3: Novelty Fingerprint Persistence +- [ ] Fingerprints stored in `.user-test-last-run.json` under `novelty_fingerprints` +- [ ] Format: `::` +- [ ] Cap: 20 per area, drop oldest when exceeded +- [ ] Read-merge-write: existing fingerprints read at Phase 1, merged at Phase 4 +- [ ] Phase 3 skips interactions matching existing fingerprints +- [ ] Iterate mode ignores fingerprints (consistency measurement preserved) +- [ ] Adversarial mode (C4) overrides fingerprint skipping for its actions +- [ ] Proven area budget unchanged (fingerprints reshape, not expand) +- [ ] SIGNALS line when fingerprints constrained novelty: `~ novelty: N fingerprints excluded, M new found` +- [ ] Resilience: missing/corrupted JSON → empty fingerprints (graceful degradation) + +### Change 4: CLI Adversarial Browser Mode +- [ ] Triggers on any individual CLI query score == 3 +- [ ] Secondary trigger: CLI average 3.0-3.4 with no single query at 3 +- [ ] Skip happy path — start with query exposing shallow reasoning +- [ ] Front-load competing-constraint queries before single-intent queries +- [ ] Pre-emptive P1 probe: `generated_from: "cli-score-3: "` +- [ ] Proven areas: all 3 MCP calls adversarial (not happy-path spot-checks) +- [ ] Uncharted areas: novelty budget 40% (from 30%), min 3 calls (from 2) +- [ ] Progressive narrowing override: SKIP → PROBES-ONLY when CLI score 3 +- [ ] Report flag in DETAILS section +- [ ] SIGNALS line: `~ N areas in CLI-adversarial mode (CLI score 3): ` +- [ ] `.user-test-last-run.json` includes `adversarial_browser`, `adversarial_trigger` per area + +### Infrastructure +- [ ] `.user-test-last-run.json` schema extracted to `references/last-run-schema.md` +- [ ] Schema version v7 → v8 +- [ ] SKILL.md ≤ 420 lines after all changes +- [ ] All new fields additive (missing = absent/default) +- [ ] v7 files readable without rewrite +- [ ] Version bump 2.49.0 → 2.50.0 +- [ ] CHANGELOG entry for v2.50.0 +- [ ] Locally installed and stale files cleaned + +--- + +## Implementation Order + +**1 → 2 → 3 → 4 → 5 → 6** — Phase 1 creates room, then C1 → C2 → C3 → C4. + +C1 before C2: `weakness_class` written in C1 is consumed by C2's synthesis. +C3 after C1/C2: probe system is richer, more meaningful territory to fingerprint. +C4 last: touches the most phases but is the most self-contained conceptually. + +--- + +## What "Getting Smarter Run-Over-Run" Looks Like + +**Run 1:** Standard execution. Selectors unknown — sequential finds, 3-5 MCP +calls per area. Novelty fingerprints empty. No weakness class. Explore Next Run +is per-area only. CLI score 3 on one area triggers adversarial browser. + +**Run 2:** Selectors confirmed from run 1 — verification is now one batch +`javascript_tool` call per area. Novelty fingerprints exclude run 1 territory. +If `weakness_class` was written, it's visible in area details. + +**Run 3:** `weakness_class` confirmed (2+ probes). Cross-area synthesis generates +adversarial Explore Next Run entry. Fingerprints cover 2 runs — agent must find +genuinely new territory. + +**Run 5+:** Weakness classes resolve (probes pass, field removed) or deepen (more +probes confirm). Fingerprints cover most obvious paths. Selectors battle-tested. + +**Run 10:** Qualitatively different from run 1. Targeted adversarial probing of +known weakness classes with one-call batch verification, guided by 9 runs of +accumulated fingerprints and pattern recognition. + +--- + +## Verification: Would This Have Caught Real Bugs? + +| Bug | Without this plan | With this plan | +|-----|-------------------|----------------| +| Selectors rediscovered each run | 3-5 MCP calls per area per run | 1 batch call from run 2 onward | +| Stale-react-state in 3 areas | Each treated independently | Cross-area synthesis targets pattern systemically | +| Novelty re-exploration | Same territory tested twice | Fingerprints exclude, forcing novel ground | +| CLI score 3 on filter-via-chat | Passes through to normal browser mode | Adversarial browser: competing constraints, pre-emptive P1 probe | + +--- + +## Sources + +### Current File References +- Commit Mode: `SKILL.md:335-397` +- Phase 2.5 CLI Testing: `SKILL.md:84-97` +- Phase 3 Novelty Budget: `SKILL.md:110-115` (inline), `queries-and-multiturn.md:128-194` (detail) +- Phase 4 Step 6 Explore Next Run: `SKILL.md:209-212` +- `.user-test-last-run.json` schema: `SKILL.md:282-333` +- Selector lifecycle: `verification-patterns.md:92-94` +- Area details template: `test-file-template.md:41-49` +- CLI Area Queries: `queries-and-multiturn.md:62-83` +- Novelty Log: `queries-and-multiturn.md:165-194` + +### Institutional Learnings +- Agent-guided state transitions: `docs/solutions/2026-02-26-agent-guided-state-and-mcp-resilience-patterns.md` +- Line budget enforcement: `docs/solutions/2026-02-26-monolith-to-skill-split-anti-patterns.md` +- Plugin versioning: `docs/solutions/plugin-versioning-requirements.md` diff --git a/docs/plans/2026-03-02-feat-cross-area-probes-isolation-browser-restart-plan.md b/docs/plans/2026-03-02-feat-cross-area-probes-isolation-browser-restart-plan.md new file mode 100644 index 000000000..d469674a5 --- /dev/null +++ b/docs/plans/2026-03-02-feat-cross-area-probes-isolation-browser-restart-plan.md @@ -0,0 +1,630 @@ +--- +title: "feat: Cross-Area Probes, Probe Isolation, and Proactive Browser Restart" +type: feat +status: completed +date: 2026-03-02 +schema_version_target: 7 +--- + +# feat: Cross-Area Probes, Probe Isolation, and Proactive Browser Restart + +## Problem Statement + +Three gaps identified from run 9 results on sg-resale: + +**1. Cross-area seams are untestable.** The search bar -> chat contamination +bug (UX010) lives between `browse/product-grid` and `agent/filter-via-chat`. +Neither area owns the interaction. Every probe belongs to exactly one area, +so there's no way to represent "do X in area A, verify behavior in area B." +Agent-native apps break at boundaries -- state contamination, stale context +carry-over, filter pollution across surfaces -- and the current structure +can't test any of them. + +**2. Multi-cause symptoms produce ambiguous probe results.** BUG003 (y2k +intersection empty) and UX010 (search bar contamination) both produce 0 +results on y2k queries. The existing probe tests the symptom, not the +cause. When it fails, you can't tell which bug you're looking at. Fixing +either bug confidently requires isolated probes that control for the other +variable. + +**3. Browser connection degrades after ~18 MCP calls.** Run 6: 90s timing +spike. Run 9: 3 disconnects all after call #18+. The skill tracks and +reports this pattern (C4 disconnect tracking) but doesn't prevent it. +Reactive recovery (wait 3s, retry) costs more time than proactive +prevention. + +## Changes + +### X1. Cross-Area Probe Table + +**Files:** `test-file-template.md`, `probes.md`, `SKILL.md` +**Problem:** No way to represent probes that span two areas +**Fix:** Scenario-level probe table with trigger area + observation area + +#### X1a. Test File Schema Addition + +Add `## Cross-Area Probes` section to the test file template, positioned +after `## Area Details` and before `## Area Trends`. This is scenario-level +-- one table for the whole test file, not per-area. + +```markdown +## Cross-Area Probes + + + +| Trigger Area | Action | Observation Area | Verify | Status | Priority | Confidence | Generated From | Run History | +|-------------|--------|-----------------|--------|--------|----------|------------|---------------|-------------| +``` + +**Column definitions:** + +- `Trigger Area`: The area where the initial action happens (e.g., + `browse/product-grid`) +- `Action`: What to do in the trigger area (e.g., "search 'dresses' + via search bar") +- `Observation Area`: The area where the effect is verified (e.g., + `agent/filter-via-chat`) +- `Verify`: What to check in the observation area (e.g., "agent chat + responds to follow-up without stale category filter from search bar") +- Status through Run History: Same as per-area probes -- uses the + existing probe lifecycle (untested/passing/failing/flaky/graduated), + confidence field, escalation at 3 failures, graduation at 2 passes + +**Dedup key:** `trigger_area + observation_area + verify text` (same +70% word-overlap rule as per-area probes, extended to the area pair). + +#### X1b. Execution Slot in Phase 3 + +Cross-area probes run BEFORE per-area testing. They need both areas +accessible in sequence, which doesn't fit the area-by-area Phase 3 +flow. Running them first also informs how you interpret per-area +scores -- if search bar -> chat contamination fails, agent/filter-via-chat +scores may be polluted. + +**Add to SKILL.md Phase 3 (slim pointer, detail in probes.md):** + +```markdown +### Cross-Area Probes (Before Per-Area Testing) + +Execute cross-area probes before per-area testing -- they test state +carry-over between areas and inform per-area score interpretation. +Results do NOT affect per-area scores. See [probes.md](./references/probes.md). +``` + +**Delta:** +4 lines in SKILL.md (after mitigation B). + +#### X1c. Lifecycle Rules in probes.md + +Cross-area probes use the existing probe lifecycle with two additions. +Add a new section after the Multi-Run Mode section: + +```markdown +## Cross-Area Probes + +Cross-area probes test interactions that span two areas -- where an +action in one area affects state in another. They live in a scenario- +level table (not per-area) and run before per-area testing in Phase 3. + +### Lifecycle + +Cross-area probes follow the same lifecycle as per-area probes: +- Status transitions: untested -> passing/failing -> flaky/graduated +- Escalation: 3+ consecutive failures -> auto-file to bugs.md +- Graduation: 2+ consecutive passes -> eligible for CLI graduation + (only if BOTH areas have CLI coverage) +- Confidence field: same defaults and update rules as per-area + +### Generation Triggers + +Cross-area probes are generated when: +- A per-area probe fails AND the failure symptom could be caused by + state from another area (agent judgment -- look for stale filters, + carry-over context, shared state) +- The novelty budget discovers a cross-area interaction worth tracking +- Orientation (code reading) identifies a state ownership boundary + that crosses two areas +- The user explicitly requests a cross-area probe + +Cross-area probes are NOT generated automatically from every per-area +failure. The agent must identify a plausible cross-area cause before +generating one. This keeps the table focused on genuine seam tests, +not duplicates of per-area probes. + +### Execution + +1. Navigate to trigger area +2. Perform action (do NOT reset between trigger and observation) +3. Navigate to observation area +4. Run verify check +5. Record result + +The "no reset" between steps 2 and 3 is the critical difference from +per-area probes. The whole point is testing state carry-over. If you +reset between areas, you're testing two independent areas, not a seam. + +### Report Section + +Cross-area probe results appear in their own report section, between +the header and NEEDS ACTION: + +``` +Cross-Area Probes: +| Trigger -> Observation | Action | Status | Detail | +|-----------------------|--------|--------|--------| +| browse/product-grid -> agent/filter-via-chat | search "dresses" via search bar | failing | agent chat shows stale "Dresses" filter on follow-up | +``` + +### Dedup + +Key: `trigger_area + observation_area + verify text`. Same 70% +word-overlap rule as per-area probes, applied to the area pair. +A probe from A->B and a probe from B->A are different probes (different +causal direction). + +### Bug Filing + +When a cross-area probe escalates (3+ consecutive failures), the bug +entry in bugs.md lists the trigger area as primary and the observation +area in the summary: "Also affects: ". This matches +the existing multi-area bug format in bugs-registry.md. + +### Spot-Check Budget + +Passing cross-area probes are spot-checked -- execute at most 3 passing +probes per run (selected randomly). Failing and untested cross-area +probes always execute. This bounds the front-load: a stable test file +with 5 passing cross-area probes spot-checks 3, not all 5. + +### Progressive Narrowing Interaction + +Progressive narrowing classifications (SKIP/PROBES-ONLY/FULL) apply to +per-area testing only. Cross-area probes execute in their own slot +regardless of the trigger or observation area's narrowing classification. +An area classified SKIP for per-area testing can still be a trigger or +observation target for cross-area probes. + +### Cap + +Maximum 10 active cross-area probes per test file. Cross-area probes +are more expensive than per-area (two navigation steps, no reset). If +the table exceeds 10 active entries, the oldest passing probes rotate +out first (same as per-area rotation). + +### Proactive Restart Interaction + +Cross-area probes must NOT be interrupted by a proactive restart -- +they depend on state carry-over between trigger and observation areas. +The restart check is skipped during cross-area probe execution. The +MCP call counter still increments; the restart happens after the +cross-area probe sequence completes. +``` + +**Delta:** +72 lines in probes.md. + +#### X1d. Test File Template Update + +Add the `## Cross-Area Probes` section to test-file-template.md in the +template block, after `## Area Details` closing and before `## Area Trends`: + +```markdown +## Cross-Area Probes + + + +| Trigger Area | Action | Observation Area | Verify | Status | Priority | Confidence | Generated From | Run History | +|-------------|--------|-----------------|--------|--------|----------|------------|---------------|-------------| +``` + +Add to schema migration section: + +```markdown +**v6 -> v7 changes:** +- New section: `## Cross-Area Probes` (scenario-level probe table for + interactions spanning two areas) +- Probe generation: `related_bug` field for isolation probes +- Test file frontmatter: optional `mcp_restart_threshold` field + +**Reading v6 files:** Treat missing `## Cross-Area Probes` section as +empty table. Do NOT rewrite on read. +``` + +**Delta:** +12 lines in test-file-template.md. + +#### X1e. .user-test-last-run.json Schema + +Add `cross_area_probes_run` field alongside existing `probes_run`: + +```json +"cross_area_probes_run": [ + { + "trigger_area": "browse/product-grid", + "action": "search 'dresses' via search bar", + "observation_area": "agent/filter-via-chat", + "verify": "agent chat responds without stale category filter", + "status": "failing", + "result_detail": "agent showed stale Dresses filter on follow-up" + } +] +``` + +**Delta:** 0 SKILL.md lines (documented in reference files only, same +pattern as existing schema additions). + +--- + +### X2. Probe Isolation Guidance + +**File:** `probes.md` Probe Generation section +**Problem:** Single probe tests symptom with multiple possible causes +**Fix:** Guidance for generating cause-isolated probes with `related_bug` + +```markdown +### Multi-Cause Isolation + +When a probe targets a symptom that could have multiple causes (e.g., +two open bugs producing the same "0 results" failure), generate separate +probes per hypothesized cause. Each probe's setup must isolate the +variable being tested: + +**Pattern:** + +Symptom: y2k accessories returns 0 results +Cause A: empty data intersection (BUG003) +Cause B: search bar state contamination (UX010) + +Isolated probe A: + Setup: fresh session (no prior search bar usage) + Query: "y2k accessories" + Verify: "results include y2k-tagged items -- tests data coverage + independent of search bar state" + related_bug: BUG003 + +Isolated probe B (cross-area): + Trigger: browse/product-grid -- search "dresses" via search bar + Observation: agent/filter-via-chat -- ask for "y2k accessories" + Verify: "agent clears stale category filter before applying y2k" + related_bug: UX010 + +**`related_bug` field:** Optional field on any probe (per-area or +cross-area) linking the probe to a specific bug ID. When the probe +passes, it provides evidence that the linked bug is fixed. When it +fails, it confirms the linked bug is still active. Multiple probes +can reference the same bug -- each tests the bug from a different +angle. + +**When to isolate:** The agent should consider isolation when: +- A probe has `escalated_to` linking to a bug, AND another open bug + affects the same area or a related area +- A failing probe's `result_detail` is ambiguous ("0 results" without + specifying whether the data is missing or the query is wrong) +- Two bugs in bugs.md have overlapping area slugs + +**When NOT to isolate:** If only one bug exists for the symptom, or +if the causes are clearly distinguishable from the probe result alone, +isolation adds complexity without value. Single probes are preferred +when the cause is unambiguous. + +**Bug lifecycle interaction:** When a bug is marked `fixed` in commit +mode, the agent should note whether probes with `related_bug` pointing +to that bug are passing or failing. If the bug is fixed but its related +probes fail, note the discrepancy in the report: "BUG003 marked fixed +but related probe still failing -- investigate." This keeps `related_bug` +informational while giving it a concrete use during the bug lifecycle. +``` + +**Delta:** +35 lines in probes.md. + +--- + +### X3. Proactive Browser Restart + +**Files:** `SKILL.md` (pointer), `references/connection-resilience.md` (NEW), `browser-input-patterns.md` +**Problem:** Connection degrades after ~18 MCP calls, reactive recovery +costs more than prevention +**Fix:** Proactive page reload at configurable threshold + +#### X3a. Connection Resilience Reference File (NEW) + +Create `references/connection-resilience.md`: + +```markdown +# Connection Resilience + +## Reactive (On Failure) + +1. After any MCP tool failure: wait 3 seconds (`Bash: sleep 3`) +2. Retry the call once +3. If retry fails: display "Extension disconnected. Run `/chrome` and + select Reconnect extension" +4. Track `disconnect_counter` for the session +5. If `disconnect_counter >= 3`: abort with "Extension connection + unstable. Check Chrome extension status and restart the session." + +## Proactive (Prevent Degradation) + +6. Track `mcp_call_counter` for the session (increments on every + successful MCP tool call) +7. When `mcp_call_counter` reaches `mcp_restart_threshold` (default: 15, + configurable in test file frontmatter): navigate to the app entry URL + (full page reload). Reset `mcp_call_counter` to 0. Log: "Proactive + restart at call #N to prevent connection degradation." +8. The restart happens between areas, not mid-area. If the threshold is + reached during an area, finish the current area first, then restart + before the next area. +9. In iterate mode, the between-run reset counts as a restart. Reset + `mcp_call_counter` at each between-run page reload. + +## Disconnect Pattern Tracking + +When `disconnect_counter` increments, record the context: which MCP tool +was called, which area was being tested, and the session MCP call count. + +At run end, if `disconnect_counter >= 3`, append a disconnect analysis +to the SIGNALS section of the report. +``` + +#### X3b. SKILL.md Connection Resilience Pointer + +Replace current Connection Resilience section with a slim pointer: + +```markdown +### Connection Resilience + +See [connection-resilience.md](./references/connection-resilience.md) for +reactive recovery, proactive restart at configurable MCP call threshold, +and disconnect tracking rules. +``` + +**Delta:** Replaces 7 lines with 3 lines = -4 lines in SKILL.md. + +#### X3c. Frontmatter Addition + +Add `mcp_restart_threshold` to test-file-template.md frontmatter: + +```yaml +mcp_restart_threshold: 15 # optional, proactive page reload after N MCP calls +``` + +**Delta:** +1 line in test-file-template.md. + +#### X3d. browser-input-patterns.md Note + +Add after Modal Dialog Handling: + +```markdown +## Proactive Restart + +Sustained MCP tool usage degrades browser extension connections. The +skill proactively restarts (full page reload to app entry URL) after +a configurable number of MCP calls -- see Connection Resilience in +SKILL.md. + +**What a restart clears:** +- Extension message channel state +- In-memory JavaScript variables +- Pending network requests + +**What a restart does NOT clear:** +- Cookies and session storage (login state preserved) +- IndexedDB data +- Service worker caches + +**Timing:** Restarts happen between areas. If a restart is triggered +mid-area, the current area completes first. The next area starts with +a fresh page load. + +**Impact on cross-area probes:** Cross-area probes must NOT be +interrupted by a proactive restart -- they depend on state carry-over +between trigger and observation areas. The restart check is skipped +during cross-area probe execution. The counter still increments. +``` + +**Delta:** +18 lines in browser-input-patterns.md. + +--- + +## Design Decisions + +### D1. Cross-area probes run BEFORE per-area testing + +Running cross-area probes first provides context for per-area scoring. +If search bar -> chat contamination fails, the agent knows that +agent/filter-via-chat results may be unreliable. This changes the +interpretation of per-area scores ("UX 4 on filter-via-chat, but +cross-area contamination probe failing -- score may be inflated in +clean sessions"). + +The alternative -- running after per-area testing -- means per-area +scores are computed without this context. Running before is more +informative. + +### D2. Cross-area probes do NOT affect per-area scores + +A failing cross-area probe means the seam between two areas is broken. +It doesn't mean either individual area is broken in isolation. Mixing +cross-area results into per-area scores would pollute maturity tracking +and make it impossible to determine whether an area is individually +healthy. + +Cross-area probes have their own lifecycle. They can escalate to bugs +independently. The bug references both areas. + +### D3. No reset between trigger and observation + +This is the defining characteristic of cross-area probes. A per-area +probe with "navigate to area A, then navigate to area B" and a reset +between them is just two per-area probes. The cross-area probe's value +is testing what happens when state carries over -- stale filters, polluted +context, shared session state. + +### D4. 10 active cross-area probe cap + +Cross-area probes are expensive -- two navigations, no reset, harder to +debug when they fail. 10 is enough for a test file with 7-10 areas +(testing the most important seams). If more seams need testing, that's +a signal the app has too many state-sharing boundaries, which is itself +a finding worth reporting. + +### D5. Proactive restart skips during cross-area execution + +A proactive restart between the trigger and observation steps of a +cross-area probe would clear the exact state the probe is testing. +The restart check is suppressed during cross-area probe execution. +The MCP call counter still increments -- the restart happens after +the cross-area probe sequence completes. + +### D6. Probe isolation is guidance, not automation + +The skill cannot automatically determine that two bugs produce the same +symptom. The agent applies judgment: when a probe fails and the failure +could have multiple causes, generate isolated probes. This is documented +in the generation section as a pattern to follow, not a rule to enforce. +Automated isolation would require causal reasoning the agent doesn't +reliably have. + +### D7. `related_bug` is optional and informational + +The `related_bug` field links a probe to a bug for human/agent +comprehension. It does NOT change probe behavior -- a probe with +`related_bug: BUG003` follows the same lifecycle as any other probe. +The field provides traceability: when reviewing bugs.md, you can see +which probes are testing which bugs. When a bug is marked fixed, +you can check whether its related probes are passing. + +--- + +## Line Budget + +| File | Baseline | Delta | After | Notes | +|------|----------|-------|-------|-------| +| SKILL.md | 420 | +4 (X1b) -4 (X3b) | 420 | At ceiling | +| probes.md | ~283 | +72 (X1c) +35 (X2) | ~390 | | +| test-file-template.md | ~516 | +12 (X1d) +1 (X3c) | ~529 | | +| browser-input-patterns.md | ~121 | +18 (X3d) | ~139 | | +| connection-resilience.md | NEW | +30 (X3a) | ~30 | Extracted from SKILL.md | +| **Total** | | **+168** | | | + +**SKILL.md stays at 420.** Cross-area pointer (+4) offset by connection +resilience extraction (-4). Net zero. + +--- + +## Implementation Phases + +### Phase 1: Schema (no behavior change) + +- [x] Update `references/test-file-template.md` -- cross-area probe table, + v7 migration notes, `mcp_restart_threshold` frontmatter, `related_bug` + field documentation +- [x] Update `references/probes.md` -- cross-area probe lifecycle, generation + triggers, execution, report section, dedup, bug filing, cap, restart + interaction + +### Phase 2: Probe Isolation + +- [x] Update `references/probes.md` -- multi-cause isolation guidance, + `related_bug` field, isolation pattern example, when to/not to isolate + +### Phase 3: Proactive Browser Restart + +- [x] Create `references/connection-resilience.md` -- reactive + proactive + rules, disconnect tracking +- [x] Update `SKILL.md` -- replace Connection Resilience with 3-line pointer +- [x] Update `references/browser-input-patterns.md` -- proactive restart + section (clears/preserves, timing, cross-area interaction) + +### Phase 4: Cross-Area Execution + +- [x] Update `SKILL.md` Phase 3 -- add cross-area probes pointer (4 lines) +- [x] Update `.user-test-last-run.json` schema -- `cross_area_probes_run` + documented in probes.md cross-area section (0 SKILL.md lines per X1e) + +### Phase 5: Version Bump & Validation + +- [x] Bump version in `plugin.json` and `marketplace.json` (2.48.0 -> 2.49.0) +- [x] Update `CHANGELOG.md` with v7 schema changes +- [x] Line-count checkpoint: SKILL.md = 420 lines +- [x] Install locally to `~/.claude/skills/user-test/` +- [x] Verify: v6 test files read correctly (missing cross-area section = empty) +- [x] Verify: cross-area probe execution order (before per-area) +- [x] Verify: proactive restart fires between areas, not mid-area +- [x] Verify: restart skipped during cross-area probe execution + +--- + +## Acceptance Criteria + +### X1: Cross-Area Probes +- [ ] `## Cross-Area Probes` section in test file template +- [ ] Table schema: Trigger Area, Action, Observation Area, Verify, + Status, Priority, Confidence, Generated From, Run History +- [ ] Execution slot: before per-area testing in Phase 3 +- [ ] No reset between trigger action and observation verify +- [ ] Results in separate report section (not mixed into per-area table) +- [ ] Same lifecycle as per-area probes (escalation, graduation, confidence) +- [ ] Graduation requires both areas to have CLI coverage +- [ ] Dedup key: trigger_area + observation_area + verify text +- [ ] Bug filing: trigger area as primary, observation area in summary +- [ ] Cap: 10 active cross-area probes per test file +- [ ] Spot-check budget: max 3 passing probes per run, failing/untested always execute +- [ ] Progressive narrowing: cross-area probes ignore SKIP/PROBES-ONLY classification +- [ ] `cross_area_probes_run` in .user-test-last-run.json +- [ ] v6 -> v7 migration: missing section treated as empty table + +### X2: Probe Isolation +- [ ] Multi-cause isolation pattern documented in probes.md +- [ ] `related_bug` field documented (optional, on any probe) +- [ ] Isolation example shows per-area + cross-area probe pair +- [ ] "When to isolate" checklist (multiple bugs, ambiguous detail, + overlapping areas) +- [ ] "When NOT to isolate" guidance (single cause, unambiguous result) +- [ ] Bug lifecycle interaction: agent notes related_bug probe status when bug marked fixed + +### X3: Proactive Browser Restart +- [ ] `mcp_call_counter` tracked per session +- [ ] Proactive restart at `mcp_restart_threshold` (default 15) +- [ ] Threshold configurable in test file frontmatter +- [ ] Restart happens between areas, not mid-area +- [ ] Restart skipped during cross-area probe execution +- [ ] `mcp_call_counter` reset on between-run page reload (iterate mode) +- [ ] Restart logged in report: "Proactive restart at call #N" +- [ ] browser-input-patterns.md documents what restart clears/preserves +- [ ] Connection resilience extracted to reference file (SKILL.md budget) + +### Schema & Migration +- [ ] Schema version: v6 -> v7 +- [ ] Cross-Area Probes section additive (missing = empty table) +- [ ] `related_bug` field additive (missing = no linked bug) +- [ ] `mcp_restart_threshold` additive (missing = default 15) +- [ ] Forward compatibility: v6 skill reads v7 files safely +- [ ] SKILL.md <= 420 lines after all changes + +--- + +## Verification: Would This Have Caught the Real Bugs? + +| Bug | Without this plan | With this plan | +|-----|-------------------|----------------| +| Search bar -> chat contamination (UX010) | Not testable -- no area owns the seam | Cross-area probe: trigger `browse/product-grid` search, observe `agent/filter-via-chat` state | +| y2k + contamination tangled (BUG003 + UX010) | Single probe fails ambiguously | Two isolated probes: fresh-session y2k (per-area, related_bug: BUG003) + contamination path (cross-area, related_bug: UX010) | +| 3 disconnects after call #18 | Tracked and reported, not prevented | Proactive restart at call #15 prevents degradation | + +--- + +## Sources + +### Internal References +- Current probe lifecycle: `probes.md` +- Current probe generation: `probes.md` (generation triggers section) +- Current connection resilience: `SKILL.md` (Phase 3) +- Current report output: `SKILL.md` (Phase 4) +- Current test file template: `test-file-template.md` +- Cross-area bug format: `bugs-registry.md` +- Multi-area bug filing: `bugs-registry.md` + +### Institutional Learnings Applied +- **Agent-guided state transitions** (`docs/solutions/2026-02-26-agent-guided-state-and-mcp-resilience-patterns.md`): Cross-area probe generation uses agent judgment, not automated rules. The agent must identify plausible cross-area cause before generating. +- **Line budget enforcement** (`docs/solutions/2026-02-26-monolith-to-skill-split-anti-patterns.md`): Connection resilience extracted to reference file. Cross-area execution uses slim pointer. SKILL.md stays at 420. +- **Plugin versioning** (`docs/solutions/plugin-versioning-requirements.md`): MINOR version bump (2.48.0 -> 2.49.0) for new schema version. diff --git a/docs/plans/2026-03-03-feat-multi-area-journey-testing-plan.md b/docs/plans/2026-03-03-feat-multi-area-journey-testing-plan.md new file mode 100644 index 000000000..70e09e655 --- /dev/null +++ b/docs/plans/2026-03-03-feat-multi-area-journey-testing-plan.md @@ -0,0 +1,565 @@ +--- +title: "feat: Multi-Area Journey Testing" +type: feat +status: completed +date: 2026-03-03 +schema_version_target: 9 +plugin_version_target: 2.51.0 +--- + +# feat: Multi-Area Journey Testing + +## Problem Statement + +The skill tests areas in isolation. Every run is a series of independent +spot-checks: test area A, reset, test area B, reset, test area C. Real +users don't reset between actions. They search for something, filter the +results, click a product, add it to cart, go back, search again. State +accumulates across every transition. + +Cross-area probes (v2.49.0) partially address this -- they test state +carry-over between two specific areas (trigger -> observation). But a +two-area probe is a seam test, not a journey. A bug that only manifests +after a 4-step sequence (search -> filter -> detail -> back -> filter +state stale) would not be caught by any two-area probe, because the +staleness requires the intermediate steps to accumulate. + +The skill needs multi-step journeys executed without resets, where state +accumulates naturally and verification happens at checkpoints along the +way -- not just at the end. + +## How Journeys Differ From Existing Constructs + +| Construct | Scope | Reset | Tests | +|-----------|-------|-------|-------| +| Per-area probe | 1 area | N/A | Specific claim within an area | +| Cross-area probe | 2 areas | No reset | State carry-over at one seam | +| Multi-turn sequence | 1 area, N turns | No reset | Conversational context retention | +| **Journey** | **3+ areas** | **No reset** | **Accumulated state across a full user flow** | + +Journeys are a third testing layer alongside per-area and cross-area +probes. They catch bugs requiring accumulated state -- invisible to +isolated testing. + +## Design + +### Journey Definition + +A journey is a sequence of 3-8 steps across different areas, executed +without resets, with checkpoints verifying state at intermediate points. + +**Schema in test file (new `## Journeys` section):** + +```markdown +## Journeys + + + +### J001: Primary user flow + +**Steps:** + +| Step | Area | Action | Checkpoint | +|------|------|--------|-----------| +| 1 | | | | +| 2 | | | | +| 3 | | | | +| 4 | | | | +| 5 | | | | + +**Status:** untested +**Last Run:** --- +**Run History:** --- +**Generated From:** manual (initial scenario definition) +``` + +**Column definitions:** + +- **Step:** Execution order (1, 2, 3...). Positional index, not a stable ID. +- **Area:** Which area this step operates in (area slug from ## Areas). +- **Action:** What to do (natural language, same as probe queries). +- **Checkpoint:** What to verify at THIS step before proceeding. A + checkpoint failure at step 3 means the journey failed at step 3, + not just "failed." Use `---` to skip verification (sparingly). + +**Journey-level fields:** + +- **Status:** `untested` / `passing` / `failing-at-N` / `flaky` / `stable` +- **Last Run:** Date of last execution +- **Run History:** Compact pass/fail (e.g., `P P F:3 P F:5 P`). Failures + include step number after colon for escalation tracking. The colon + delimiter avoids ambiguity with count-based formats (F:3 = "failed at + step 3", not "failed 3 times"). +- **Generated From:** `manual`, `orientation`, `cross-area-escalation`, + `weakness-class-synthesis` +- **on_failure:** `abort` (default) or `continue` (opt-in, per-journey) + +### Checkpoint Types + +| Type | Example | How to check | +|------|---------|-------------| +| Result state | "Results include matching items" | javascript_tool read of first 3 results | +| Count change | "Counter increments by 1" | Read element, compare to pre-action value | +| Element present | "Details match listing" | Check 2-3 attributes match between views | +| State clean | "No stale filters from prior steps" | Read active state, verify none from prior steps | +| No check | `---` | Skip verification at this step (use sparingly) | + +Checkpoints are 1 MCP call each (batched `javascript_tool`). A 5-step +journey = ~10 MCP calls (5 actions + 5 checkpoint reads). This is +separate from the per-area MCP budget -- journey steps do NOT consume +per-area call budgets. + +### Execution Slot + +``` +Phase 3 execution order: + 1. Cross-area probes (seam tests) + 2. Journeys (accumulated state tests) <-- NEW + 3. Per-area testing (isolated area tests) +``` + +Journeys run after cross-area probes because cross-area results inform +whether a journey's seams are already known broken. Journeys run before +per-area testing because journey failures provide context for per-area +exploration (e.g., "area-X has state management issues after navigation"). + +**Inter-journey reset:** Navigate to the app's entry URL between +journeys. Each journey starts from a clean navigation state. Journeys +are independent of each other and can be authored without considering +execution order. (Within a journey, no resets between steps.) + +**Execution order when multiple journeys exist:** +1. `failing-at-N` journeys first (highest signal value) +2. `untested` journeys second +3. `flaky` journeys third +4. `passing` journeys fourth +5. `stable` journeys last (and only every other run) + +### Journey Lifecycle + +``` +untested -> [run] -> passing / failing-at-N + | | + [5+ consecutive] [mixed steps across 3+ runs] + | | + stable flaky + (every other run) + | + [3+ consecutive SAME step] + | + escalate to bugs.md + (as multi-area bug) +``` + +**Status definitions:** + +| Status | Meaning | +|--------|---------| +| `untested` | Defined, not yet run | +| `passing` | All checkpoints passed on last run | +| `failing-at-N` | Failed at step N specifically | +| `flaky` | Fails at different steps across 3+ runs | +| `stable` | Passing 5+ consecutive runs | + +**`failing-at-N`** is the key innovation. Step 2 failure = the individual +area is broken (per-area testing would catch this). Step 5 failure after +steps 1-4 passed = accumulated state bug (the journey's unique value). + +**`flaky`:** Failing at step 3, then step 5, then step 3 = different +causes. Status becomes `flaky`. The consecutive-same-step counter resets +on each step change. Flaky is not inherently bad -- it means the journey +has multiple fragile points worth investigating. + +**Escalation:** Journey failing at the SAME step for 3+ consecutive runs +auto-escalates to bugs.md. Bug entry format: + +``` +| ID | Area | Summary | Journey | +|... | | Journey fails at step N: | J001 (steps 1-N context: ) | +``` + +The failing step's area is primary. Preceding areas provide context. + +**Stable frequency:** `stable` journeys run every other run (derived +from Run History length -- odd run count = run, even = skip). + +**Stable revert:** When a stable journey fails, set status to +`failing-at-N` (not `passing`). The stable consecutive counter resets. +Journey runs every time again until re-stabilized. + +### Checkpoint Failure: Abort vs. Continue + +**Abort (default):** Stop at failing step. Record `failing-at-N`. +Remaining steps not executed. Correct for most failures -- if step 3 +state is wrong, step 4 on wrong state is unpredictable. + +**Continue (opt-in):** Add `on_failure: continue` to journey definition. +Log each checkpoint failure but execute all remaining steps. Useful when +steps test independent state dimensions. + +**Continue-mode status:** When multiple checkpoints fail, status is +`failing-at-N` where N = the FIRST failing step. Run History records +all failing steps: `F:2,5` (failed at steps 2 and 5). Escalation uses +the first failing step only -- subsequent failures may be cascading +effects. + +### Definition Change Detection + +When commit mode reads the existing journey to update status, it +compares the current step count and area slugs against the stored +values. If either changed (steps added/removed/reordered, area slugs +changed), reset status to `untested` and clear Run History. + +Detection key: `:,,...` + +This is conservative but prevents stale `failing-at-3` pointing at a +step that no longer exists or has moved. + +### Known-Bug Area Interaction + +Journey steps execute regardless of an area's Known-bug status. Rationale: +the journey tests accumulated state across the full sequence, not the +individual area. A Known-bug area may behave differently in a journey +context than in isolation. If a Known-bug area causes a journey checkpoint +to fail, the journey records `failing-at-N` normally -- this is useful +signal (confirms the bug affects multi-area flows, not just isolated use). + +Journey failures involving Known-bug areas do NOT auto-escalate to +bugs.md (the bug is already filed). Escalation is suppressed when the +failing step's area has an active Known-bug entry. + +### Generation + +**1. Manual definition (primary).** User writes journeys for real user +flows. Skill prompts on first run if none defined. If orientation (source +2) generated journey suggestions this run, present those suggestions AS +the first-run prompt rather than asking for manual definition from scratch: + +> "Based on code reading, I found these state boundaries crossing 3+ +> areas. Here's a suggested journey: [steps]. Would you like to use +> this, modify it, or define your own?" + +If no orientation results exist, fall back to the generic prompt: + +> "No journeys defined. Journeys test multi-area flows without resets. +> Define 1-2 journeys based on your app's primary user flows? (y/n)" + +If yes, agent suggests steps from the area map. If no, skip. + +**2. Orientation.** Code reading identifies state boundaries crossing +3+ areas -> journey hypothesis. Orientation completes before the +first-run prompt so its findings can be incorporated into suggestions. + +**3. Cross-area probe escalation.** 2+ cross-area probes pass individually +but per-area issues persist -> suggest journey covering all affected areas. + +**4. Weakness-class synthesis.** Weakness class spans 3+ areas -> suggest +journey probing state transitions across affected areas. + +Sources 2-4 generate **suggestions requiring user confirmation**. Journeys +are expensive; auto-generation without confirmation wastes budget. + +### Journey Budget + +- **Max 5 active journeys** per test file +- **3-8 steps** per journey. If a flow exceeds 8 steps, split into two + overlapping journeys (1-6 and 5-10) with shared transition. Splitting + counts against the 5-journey cap. If splitting would exceed the cap, + prefer a single 8-step journey over two overlapping ones. Only split + when the flow genuinely exceeds 8 steps. +- **~2 minutes per journey.** 5 journeys = ~10 minutes maximum. +- **Stable skip:** stable journeys run every other run, halving budget + for mature test files. +- **Time pressure:** If session time is tight, run only failing/untested + journeys (same priority as probes). + +### Interaction With Existing Features + +**Proactive restart:** Suppressed during journey execution (same rule as +cross-area probes). MCP counter increments but restart is deferred until +the current journey completes. Counter resets between journeys (each +starts fresh after inter-journey navigation). + +**Progressive narrowing:** Applies to per-area testing only. Journey +steps execute regardless of area narrowing classification (SKIP, +PROBES-ONLY, FULL). A SKIP area can still be a journey step. + +**Cross-area probes:** Complementary. Cross-area probes test 1 seam. +Journeys test accumulated state across 3+ seams. No dedup between them +-- a 2-area cross-area probe and a journey step covering the same seam +test different things (isolation vs. accumulation). + +**Adversarial mode:** Does NOT apply to journey steps. Journey steps +execute the defined action and checkpoint, not the adversarial variant. +Adversarial mode is a per-area testing concern. + +**Per-area MCP budgets:** Journey MCP calls are separate from per-area +budgets. A journey visiting an area does not consume that area's per-area +call budget. Per-area testing runs independently after all journeys. + +**`--no-commit` flag:** Journey results are recorded in +`.user-test-last-run.json` regardless of commit flag. But journey status +in the test file is only updated during commit mode. The `--no-commit` +run does NOT count toward the consecutive failure counter for escalation. + +**Iterate mode:** Each iterate iteration counts as a separate run for +journey Run History. Stable "every other run" applies per iteration. + +**Partial run safety:** If a run is interrupted mid-journey, uncommitted +journey results are discarded. Only fully-completed journeys have their +status written during commit mode. Partially-executed journeys retain +their pre-run status. + +### Report Section + +New section in Phase 4 report, after cross-area probes and before +per-area details: + +``` +JOURNEYS +| ID | Name | Status | Failed At | Detail | +|------|------------------------|--------------|-------------------|---------------------------------| +| J001 | Primary user flow | failing-at-5 | | Stale state after navigation | +| J002 | Secondary flow | passing | --- | All 4 checkpoints passed | + +Journey J001 checkpoint detail: + + Step 1: -- + + Step 2: -- + + Step 3: -- + + Step 4: -- + x Step 5: -- STALE state from step 2 +``` + +Checkpoint detail shown for failing/flaky journeys only. Passing +journeys show summary line only. + +**SIGNALS addition:** +``` +~ 1 journey failing: J001 at step 5 () — accumulated state +``` + +**N-run summary:** Add "Journeys stabilized" and "Journeys with +persistent issues" to the N-run summary format. + +### `.user-test-last-run.json` Schema + +```json +"journeys_run": [ + { + "id": "J001", + "name": "Primary user flow", + "status": "failing-at-5", + "on_failure": "abort", + "checkpoints": [ + { "step": 1, "area": "", "passed": true }, + { "step": 2, "area": "", "passed": true }, + { "step": 3, "area": "", "passed": true }, + { "step": 4, "area": "", "passed": true }, + { "step": 5, "area": "", "passed": false, + "detail": "stale state from step 2 still active" } + ], + "time_seconds": 45 + } +] +``` + +### Commit Mode Additions + +Journey commit mode runs after per-area commit mode (step 4 updates +probe tables, step 8 updates queries). Journey updates are a new step: + +1. Update journey **Status**, **Last Run**, **Run History** in test file +2. Auto-escalate at 3+ consecutive same-step failures (→ bugs.md as + multi-area bug). Suppress if failing step's area has active Known-bug. +3. Mark `stable` at 5+ consecutive passes +4. Detect definition changes (step count or area slug changes → reset + to `untested`, clear Run History) +5. Journey results do NOT affect per-area maturity scores + +## Design Decisions + +### D1. Journeys are scenario-level, not area-level +Lives in `## Journeys` alongside `## Cross-Area Probes` and `## Areas`. +Not owned by any single area. + +### D2. Checkpoints at every step, not just the end +A journey verifying only at the end is just a long cross-area probe. +Checkpoints pinpoint WHERE state goes wrong. + +### D3. Journey failure does NOT affect per-area scores +Journey failure = accumulated state bug. Per-area score = isolated +area health. Mixing them makes maturity tracking unreliable. + +### D4. failing-at-N is more useful than failing +8-step journey reporting "failing" tells you nothing. "failing-at-5" +tells you steps 1-4 work and the bug is at the step 5 transition. + +### D5. Manual definition is primary +The user knows which flows matter. Auto-generation produces suggestions +requiring confirmation, not automatic entries. + +### D6. Journey steps can revisit areas +Step 1 uses area X. Step 5 uses area X again. The value is testing +whether the area behaves differently after intermediate steps modified +state. + +### D7. Abort is the default on checkpoint failure +Wrong state at step 3 makes step 4 unpredictable. Continue exists as +opt-in for independent state dimensions. + +### D8. Step drift prevents premature escalation +Failing at different steps = different causes = flaky, not a single +consistent bug worth auto-filing. + +### D9. Inter-journey reset to entry URL +Journeys are independent. Each starts from a clean navigation state. +Without this, journey ordering becomes a first-class authoring concern +and journey 2's results depend on journey 1's side effects. + +### D10. Known-bug areas still execute in journeys +Journeys test accumulated state, not individual areas. A Known-bug area +in a journey provides useful signal about multi-area impact. But +Known-bug journey failures don't auto-escalate (bug already filed). + +### D11. Journey MCP calls are separate from per-area budgets +Journeys and per-area testing serve different purposes. Sharing budgets +would force trade-offs between journey thoroughness and per-area depth. + +### D12. Definition changes reset to untested +Conservative but safe. Prevents stale `failing-at-3` from pointing at +a step that moved or no longer exists. + +### D13. Continue-mode uses first failing step for status +Multiple checkpoint failures in continue mode may be cascading. The +first failure is the root cause signal. Run History records all failures +for investigation. + +## Line Budget + +| File | Baseline | Delta | After | Notes | +|------|----------|-------|-------|-------| +| SKILL.md | 368 | +5 (pointer + execution slot) -3 (trim) | 370 | Well under 420 ceiling | +| journeys.md | NEW | +65 | 65 | All journey behavioral detail | +| test-file-template.md | 549 | +25 | ~574 | Journey section template + v8→v9 migration | +| last-run-schema.md | 136 | +15 | ~151 | journeys_run schema | +| probes.md | ~490 | +3 | ~493 | Cross-ref to journey escalation | +| Total new content | | ~110 | | | + +SKILL.md stays well under ceiling. All journey behavioral detail lives +in `references/journeys.md`. SKILL.md holds only the execution slot +pointer and commit mode bullet. + +## Schema Changes + +### Test file: v8 -> v9 +- New `## Journeys` section (optional, may be empty) +- Journey entry schema: ID, Name, Steps table, Status, Last Run, + Run History, Generated From, optional on_failure + +### `.user-test-last-run.json` +- New `journeys_run` array field + +### Migration: v8 -> v9 +- Missing `## Journeys` = empty (no journeys defined). Do not create. +- Additive only. v8 files work unchanged. +- Bump `schema_version: 9` on first commit. +- Forward compatible: v8 skill reads v9 files safely (preserves + unknown sections). + +## Acceptance Criteria + +### Journey Definition +- [ ] `## Journeys` section in test file template (`test-file-template.md`) +- [ ] Schema: ID, Name, Steps table, Status, Last Run, Run History, + Generated From, optional on_failure +- [ ] Steps table columns: Step / Area / Action / Checkpoint +- [ ] 3-8 steps per journey, max 5 journeys +- [ ] Same area can appear multiple times in a journey + +### Execution +- [ ] Run after cross-area probes, before per-area testing +- [ ] No reset between steps within a journey +- [ ] Inter-journey reset to app entry URL +- [ ] Checkpoint at each step (1 MCP call via batched javascript_tool) +- [ ] Abort on checkpoint failure (default) +- [ ] `on_failure: continue` option (first failing step = status) +- [ ] Proactive restart suppressed during journey execution +- [ ] Progressive narrowing does not affect journey steps +- [ ] Known-bug areas still execute in journey steps +- [ ] Adversarial mode does NOT apply to journey steps +- [ ] Journey MCP calls separate from per-area budgets +- [ ] Execution order: failing > untested > flaky > passing > stable +- [ ] Failing/untested journeys before stable + +### Status & Lifecycle +- [ ] `failing-at-N` records which step failed +- [ ] Step drift across runs → status becomes `flaky` +- [ ] Escalation: same step 3+ consecutive → bugs.md (multi-area bug) +- [ ] Escalation suppressed when failing step area has Known-bug +- [ ] Bug entry: failing step area primary, preceding areas as context +- [ ] `stable`: 5+ consecutive passes, every other run +- [ ] Stable revert on failure → `failing-at-N`, counter resets +- [ ] Definition change detection → reset to `untested` +- [ ] Journey results do NOT affect per-area maturity scores + +### Report +- [ ] JOURNEYS section after cross-area, before per-area details +- [ ] Failing/flaky: full checkpoint detail (+ and x markers) +- [ ] Passing: summary line only +- [ ] SIGNALS entry for failing journeys +- [ ] `journeys_run` in JSON with per-step checkpoint data +- [ ] N-run summary includes journey stabilization/persistence + +### Generation +- [ ] First-run prompt if no journeys defined +- [ ] Manual primary, auto sources suggest only +- [ ] Suggestions require user confirmation + +### Commit Mode +- [ ] Status + Last Run + Run History updated +- [ ] Auto-escalation at 3+ consecutive same-step failures +- [ ] Stable at 5+ consecutive passes +- [ ] Definition change detection resets status +- [ ] `--no-commit` runs don't count toward escalation +- [ ] Partial runs: only fully-completed journeys written + +### Schema & Migration +- [ ] v8 → v9 additive migration +- [ ] Missing `## Journeys` = empty +- [ ] Forward compatible +- [ ] SKILL.md stays under 420-line ceiling + +## Implementation Order + +All changes ship together as schema v9. + +- [x] 1. **Schema & template** — `## Journeys` section in `test-file-template.md` + v8→v9 migration notes +- [x] 2. **Reference file** — create `references/journeys.md` (lifecycle, budget, execution rules, checkpoint types, generation, interactions) +- [x] 3. **Last-run schema** — add `journeys_run` to `last-run-schema.md` +- [x] 4. **SKILL.md pointer** — Phase 3 execution slot + commit mode bullet + trim +- [x] 5. **Report** — journey results format in Phase 4 (pointer to journeys.md) +- [x] 6. **Commit mode** — status updates, escalation, stable, definition change detection +- [x] 7. **Version bump + install** — plugin.json 2.50.0→2.51.0, CHANGELOG, local install + +## Verification: Would This Have Caught Real Bugs? + +| Bug pattern | Without journeys | With journeys | +|-------------|-----------------|---------------| +| Stale state after multi-step navigation (4+ steps) | Not testable (accumulated state) | `failing-at-N`: pinpoints which step's state leaked | +| State contamination visible only after round-trip | Cross-area probe (2 steps, one seam) | Journey revisits area after 3 intermediate steps | +| Counter/badge wrong after add→remove→add sequence | Per-area test starts clean each time | Journey checkpoints verify at each transition | +| Filter/search state leaking across unrelated flows | Per-area tests pass in isolation | Journey exposes that state persists across areas | + +## Sources + +- Phase 3 execution: `SKILL.md:98-152` +- Cross-area probes: probes.md (lines 322-489), v2.49.0 plan +- Proactive restart: cross-area plan D5, `connection-resilience.md` +- Progressive narrowing: `run-targeting.md` (lines 74-107) +- Weakness-class synthesis: compounding quality plan Change 2 +- Multi-turn sequences: `queries-and-multiturn.md` +- Probe lifecycle: `probes.md` +- Known-bug handling: `bugs-registry.md` +- Schema migration pattern: `test-file-template.md` (lines 168-176) +- Line budget learnings: `docs/solutions/2026-02-26-monolith-to-skill-split-anti-patterns.md` diff --git a/docs/plans/2026-03-17-feat-user-test-self-eval-loop-plan.md b/docs/plans/2026-03-17-feat-user-test-self-eval-loop-plan.md new file mode 100644 index 000000000..7b9a04279 --- /dev/null +++ b/docs/plans/2026-03-17-feat-user-test-self-eval-loop-plan.md @@ -0,0 +1,383 @@ +--- +title: "feat: Add self-eval loop for user-test skill" +type: feat +status: completed +date: 2026-03-17 +origin: docs/brainstorms/2026-03-17-user-test-self-eval-loop-brainstorm.md +--- + +# feat: Add self-eval loop for user-test skill + +## Overview + +Add a `/user-test-eval` command that grades the user-test skill's output against 3 binary evals after each run. Records scores in `skill-evals.json`, proposes targeted mutations to the skill in `skill-mutations.md`. Auto-triggers after commit mode completes. Goal: fix the testing instrument (the skill itself) before optimizing what it tests (queries). + +## Problem Statement / Motivation + +The user-test skill has three known signal-corrupting failure modes: +1. **Probe execution order** — probes run after exploration instead of before, reducing signal quality +2. **Proven regression conflation** — new bugs in Proven areas treated identically to area demotion +3. **P1 burial** — critical items appear in DETAILS but not NEEDS ACTION + +These are instrument calibration failures. Optimizing queries through a miscalibrated instrument produces noise. The eval loop catches these failures mechanically, proposes fixes, and builds a mutation history artifact. + +(see brainstorm: docs/brainstorms/2026-03-17-user-test-self-eval-loop-brainstorm.md) + +## Proposed Solution + +### Architecture + +``` +/user-test → Phase 4 → Commit Mode → Auto-trigger → /user-test-eval + ↓ + Read artifacts (JSON + report file) + ↓ + Grade 3 binary evals + ↓ + Write skill-evals.json (scores) + Write skill-mutations.md (proposals) +``` + +Three new components: +1. **`/user-test-eval` command** — thin dispatch to new eval skill +2. **`user-test-eval` skill** — grades from artifacts, proposes mutations +3. **Report file artifact** — rendered report written to file during commit mode (new) + +Plus two schema changes to existing `.user-test-last-run.json`: +- `execution_index` per `probes_run` entry +- `broad_exploration_start_index` per area + +### Prerequisites: Artifact Gaps + +The eval cannot function without two changes to the existing skill: + +**1. Report file artifact (new)** + +Commit mode currently prints the report to stdout only. The eval needs to read the rendered report from a file. Add a step to commit mode that writes the rendered report to `tests/user-flows/.user-test-last-report.md`, overwritten each run, gitignored. + +**Why a separate file instead of reading conversation context:** The brainstorm established that same-context grading is the exact failure mode we've seen — structurally correct reports that technically satisfy format requirements while burying findings. Reading from an artifact forces the eval to grade what the user actually sees, without access to the reasoning that produced it. + +**2. Execution order metadata (schema change)** + +Eval 1 checks probe execution order. The current `probes_run` array in `.user-test-last-run.json` records results but not execution sequence relative to broad exploration. Add: +- `execution_index: ` to each `probes_run` entry (0-based, monotonically increasing across all areas) +- `broad_exploration_start_index: ` per area in the `areas` array + +Eval 1 then checks: for each area, all probe `execution_index` values < that area's `broad_exploration_start_index`. Binary, mechanical, no judgment required. + +## The 3 Binary Evals + +### Eval 1: Probe Execution Order (protocol layer) + +**Question:** Did all failing/untested probes execute before broad exploration in every area? + +**Grading method:** For each area in `areas`, check that every `probes_run` entry for that area has `execution_index < broad_exploration_start_index`. FAIL if any area violates. Report which areas violated. + +**Data source:** `.user-test-last-run.json` only (structural check). + +**Zero probes case:** If an area has no probes, it passes vacuously. + +### Eval 2: Proven Regression Distinction (reasoning layer — reformulated as structural) + +**Question:** When a Proven area's score dropped by 1+ points, does the report's NEEDS ACTION section contain an entry for that area? + +**Grading method:** +1. From `.user-test-last-run.json`, identify areas where the test file shows `status: Proven` but the run's `ux_score` is below `pass_threshold` +2. From `.user-test-last-report.md`, check that each such area appears in the NEEDS ACTION section as a **line item** with the `⚠` prefix and `→ Proven regression` marker (not just the area slug mentioned anywhere in the section). The required format is: `⚠ P[N] ... → Proven regression` +3. PASS if every regressed Proven area has a matching line item. FAIL if any is missing or appears without the `→ Proven regression` marker. + +**Why a specific marker:** Checking for slug presence alone is gameable — the area could appear as a parenthetical note rather than an action item and technically pass. The marker requirement makes the check fully mechanical: regex match for `⚠.*.*→ Proven regression` in the NEEDS ACTION block. + +**Why reformulated:** The original question ("did the report distinguish bug vs. demotion?") was subjective. This structural version tests the same thing — a Proven regression must surface as actionable, not buried in DETAILS — without requiring judgment calls about "distinguishing." + +**No Proven regressions case:** Automatic PASS (vacuously true). The eval records `"detail": "no Proven regressions this run"`. + +**Data source:** Both `.user-test-last-run.json` (to identify regressions) and `.user-test-last-report.md` (to verify surfacing). + +### Eval 3: P1 Surfacing (presentation layer) + +**Question:** Did every P1 item (from `explore_next_run` where `priority: "P1"`) appear in the NEEDS ACTION section? + +**Grading method:** +1. From `.user-test-last-run.json`, collect all `explore_next_run` items with `priority: "P1"` +2. From `.user-test-last-report.md`, verify each P1 item appears in the NEEDS ACTION block (match area slug + priority marker) +3. PASS if all P1 items are in NEEDS ACTION. FAIL with count of missing items. + +**Scope note:** Verification mismatches on Proven areas also belong in NEEDS ACTION (per dispatch format rules), but they flow through a different path — the `verification_results` array, not `explore_next_run`. The main skill does not consistently promote these to `explore_next_run` P1 items, so including them here would produce false positives. If verification mismatch surfacing needs eval coverage, add it as a separate Eval 4 later. + +**Zero P1 items case:** Automatic PASS. Eval records `"detail": "no P1 items this run"`. + +**Data source:** Both artifacts. + +## Artifact Schemas + +### `skill-evals.json` + +Location: `tests/user-flows/skill-evals.json` (project-scoped, committed to git) + +```json +{ + "eval_version": 1, + "entries": [ + { + "run_timestamp": "2026-03-17T14:30:00Z", + "scenario_slug": "resale-clothing", + "git_sha": "abc1234", + "skill_version": "2.52.0", + "evals": { + "probe_execution_order": { + "pass": true, + "areas_violated": [] + }, + "proven_regression_distinction": { + "pass": false, + "regressed_areas": ["login"], + "missing_from_needs_action": ["login"], + "detail": "Login regressed from Proven (score 4→2) but only appeared in DETAILS" + }, + "p1_surfacing": { + "pass": true, + "p1_count": 2, + "surfaced_count": 2 + } + }, + "overall_pass": false, + "mutation_proposed": true + } + ] +} +``` + +- Cap: 50 entries (drop oldest) +- `eval_version` at top level — bumped when evals change, enabling historical comparison filtering +- Created if missing on first eval run + +### `skill-mutations.md` + +Location: `tests/user-flows/skill-mutations.md` (project-scoped, committed to git) + +```markdown +# Skill Mutations Log + +Proposed changes to the user-test skill based on eval failures. +Mark status as ACCEPTED or REJECTED after review. + +--- + +## Mutation 1 — 2026-03-17 + +**Status:** PROPOSED +**Triggered by:** Eval 2 failure (Proven regression distinction) +**Eval scores:** probe_order: PASS | regression_distinction: FAIL | p1_surfacing: PASS +**Skill version:** 2.52.0 +**Scenario:** resale-clothing + +### Problem observed + +Login area regressed from Proven (score 4→2) but only appeared in DETAILS section. +The report treated it as a normal score change rather than surfacing it in NEEDS ACTION. + +### Proposed change + +**File:** `plugins/compound-engineering/skills/user-test/SKILL.md` +**Section:** Report Output — Dispatch Format, NEEDS ACTION rules + +**Current:** NEEDS ACTION includes "degrading areas, failing probes on Proven areas, verification mismatches on Proven" +**Proposed:** Add explicit rule: "Any Proven area scoring below pass_threshold MUST appear in NEEDS ACTION with '→ Proven regression' suffix, regardless of whether a bug was filed." + +### Outcome + +_Fill after next run:_ Did the change fix the eval failure? Score comparison. +``` + +- Each mutation is a markdown section with clear status +- Status values: `PROPOSED` | `ACCEPTED` | `REJECTED` +- One mutation per failing eval — all failures get proposals in a single run +- Human reviewer decides which to accept (can accept all, some, or none) +- Proposals are numbered sequentially across all eval runs (Mutation 1, 2, 3...) + +### `.user-test-last-report.md` (new artifact) + +Location: `tests/user-flows/.user-test-last-report.md` (gitignored, ephemeral) + +Written during commit mode, after the report is displayed. Contains the exact rendered report text. Overwritten each run. + +## Implementation Plan + +### Phase 1: Prerequisites (changes to existing skill) + +#### 1a. Add report file output + +**File:** `plugins/compound-engineering/skills/user-test/SKILL.md` +**Location:** After "Share Report (Optional)" section, before "Auto-Commit" +**Change:** Add step: "Write the rendered report to `tests/user-flows/.user-test-last-report.md`" + +**File:** `plugins/compound-engineering/skills/user-test/SKILL.md` +**Location:** Phase 0, step for `.gitignore` coverage +**Change:** Add `.user-test-last-report.md` to the gitignore check alongside `.user-test-last-run.json` + +#### 1b. Add execution order metadata + +**File:** `plugins/compound-engineering/skills/user-test/references/last-run-schema.md` +**Change:** Add `execution_index` to `probes_run` entries, add `broad_exploration_start_index` to per-area fields + +**File:** `plugins/compound-engineering/skills/user-test/SKILL.md` +**Location:** Phase 3, probe execution section +**Change:** Instruct agent to track execution index (monotonically increasing counter across all MCP calls/actions) and record `broad_exploration_start_index` when transitioning from probe execution to broad exploration per area + +**Schema version:** Bump to v10. Add v9 migration rule: treat missing `execution_index` as absent (eval skips Eval 1 for runs without ordering data). Treat missing `broad_exploration_start_index` as absent. + +### Phase 2: New skill and command + +#### 2a. Create eval skill + +**New file:** `plugins/compound-engineering/skills/user-test-eval/SKILL.md` + +Contents: +- Frontmatter: `name: user-test-eval`, description, `disable-model-invocation: true` +- **Artifact-only grading rule:** "Grade from file artifacts only. Do not reference test execution context, Phase 3 observations, or any other conversation content. The eval's integrity depends on grading what the user sees (the report file), not what the agent knows." +- Load phase: Read `.user-test-last-run.json` and `.user-test-last-report.md`. Abort if either missing or if `completed: false`. Warn if run_timestamp > 24h old. +- Read test file to get area maturity statuses (needed for Eval 2). +- Run 3 evals in order. Record pass/fail + detail for each. +- If any eval fails: propose one mutation per failing eval. Write all to `skill-mutations.md`. +- Append entry to `skill-evals.json`. Create file if missing. +- Display summary: `EVAL: 2/3 pass | probe_order: PASS | regression: FAIL | p1_surfacing: PASS` +- If mutation proposed, display the proposed change inline. + +#### 2b. Create eval command + +**New file:** `plugins/compound-engineering/commands/user-test-eval.md` + +```yaml +--- +name: user-test-eval +description: Grade user-test skill output against binary evals +disable-model-invocation: true +allowed-tools: Skill(user-test-eval) +--- + +Invoke the user-test-eval skill for the last completed run. +``` + +#### 2c. Add auto-trigger to commit mode + +**File:** `plugins/compound-engineering/skills/user-test/SKILL.md` +**Location:** End of Commit Mode section, after step 8c +**Change:** Add: + +``` +### Auto-Eval + +After all commit steps complete, automatically invoke `/user-test-eval` to grade +this session's output. The eval reads from file artifacts — it does not use +conversation context from this session. + +**Skip conditions:** `--no-eval` flag, or if commit was partial/aborted. +**Error handling:** If eval fails, the commit is already complete and preserved. +Display "Eval failed: . Run `/user-test-eval` manually to retry." +``` + +Also add auto-trigger after `/user-test-commit` standalone (same artifacts, same trigger). + +### Phase 3: Versioning and metadata + +- Bump plugin version to 2.52.0 in `.claude-plugin/plugin.json` +- Update `marketplace.json` description with new skill count +- Update `README.md` — add user-test-eval to skills list +- Update `CHANGELOG.md` with the addition +- Schema version bump to v10 in test-file-template.md + +## Technical Considerations + +### Same-conversation limitation + +The auto-trigger runs eval in the same conversation as the test. The eval skill instructions say "grade from artifacts only," but the model still has conversation context. This is acknowledged as aspirational, not enforced. + +All three evals are designed to be mechanically checkable from artifacts: Eval 1 is pure index comparison, Eval 2 is a regex match for a specific marker format (`⚠.*.*→ Proven regression`), Eval 3 is slug+priority matching in a section block. No eval requires subjective judgment, which limits the surface area for self-bias to near zero. + +If gaming becomes observable (evals consistently pass but failures still occur in practice), the mitigation is to switch to manual-only invocation (`--no-eval` by default, explicit `/user-test-eval` in a new session). + +### Iterate mode + +Eval runs once after the final commit, not per-iteration. Grades the aggregate report. Eval 1 checks probe execution order for the first run only (subsequent runs use progressive narrowing where ordering constraints are relaxed). + +### Partial runs + +If `completed: false` in `.user-test-last-run.json`, eval aborts. Same guard as commit mode. + +### Artifact overwrite risk + +`.user-test-last-run.json` and `.user-test-last-report.md` are overwritten each run. If user runs `/user-test` again before running standalone eval, the previous artifacts are gone. The auto-trigger avoids this (eval runs immediately after commit). + +**Manual eval guard:** Before grading, check if `run_timestamp` in the artifact matches the `run_timestamp` of the last entry in `skill-evals.json`. If they match, this run was already evaluated — warn "This run was already evaluated. Run again? (y/n)". Also warn if `run_timestamp` > 24h old (matching commit mode's staleness check). + +### Concurrent writes + +Not supported. `skill-evals.json` writes are not atomic. Concurrent eval invocations (e.g., two terminals) could corrupt the file. Low risk for single-user CLI tool. + +### Eval evolution + +`eval_version` in `skill-evals.json` enables filtering when comparing historical scores. When adding a 4th eval, bump `eval_version` to 2. Entries with version 1 have 3 evals; version 2 has 4. Comparison tools should filter by version. + +### Graduation trigger + +When evals pass for 5 consecutive runs, the eval should note: "All evals passing consistently (runs from to ). Consider adding a 4th eval or shifting to query-level optimization." Surface the date range alongside the count so the span is visible. + +**Gap reset:** If the gap between any two consecutive passing runs exceeds 14 days, reset the consecutive count. A run after a 3-week hiatus isn't comparable to daily sprint runs — the skill may have changed, the app may have changed, and the consecutive count would be misleading. + +## Acceptance Criteria + +- [x] `/user-test-eval` command exists and invokes the eval skill +- [x] Eval reads `.user-test-last-run.json` and `.user-test-last-report.md` (not conversation context) +- [x] 3 binary evals implemented: probe execution order, Proven regression distinction, P1 surfacing +- [x] Scores written to `tests/user-flows/skill-evals.json` with defined schema +- [x] Mutation proposals written to `tests/user-flows/skill-mutations.md` when evals fail +- [x] Prompts user to run `/user-test-eval` after commit mode (both auto-commit and standalone `/user-test-commit`) +- [x] `--no-eval` flag skips the auto-trigger +- [x] `.user-test-last-report.md` written during commit mode, gitignored +- [x] `execution_index` and `broad_exploration_start_index` added to last-run JSON schema +- [x] Manual eval warns if run_timestamp matches last skill-evals.json entry (already evaluated) +- [x] Graduation consecutive count resets if gap between runs exceeds 14 days +- [x] Schema bumped to v10 with v9 migration rule +- [x] Plugin version bumped to 2.52.0 +- [x] CHANGELOG, README, plugin.json, marketplace.json updated + +## Scope Boundaries + +**In scope:** +- `/user-test-eval` skill + command +- 3 binary evals (mechanical, artifact-based) +- `skill-evals.json` + `skill-mutations.md` artifacts +- Report file artifact (`.user-test-last-report.md`) +- Execution order metadata in last-run JSON +- Auto-trigger from commit mode +- Schema v10 + +**Out of scope:** +- Autonomous mutation application (human review required) +- Query-level optimization (comes after skill evals are stable) +- More than 3 evals (expand after 5 consecutive passes) +- Cross-model evaluation (same model, different context) +- Mutation revert mechanism (use `git revert`) +- Extract mutation format template and JSON schema to `references/` (v2.53.0 consideration — eval skill is 184 lines, extraction warranted when approaching 300+ or when references would be reused across skills) + +## Dependencies & Risks + +**Dependencies:** +- Existing user-test skill and commit mode must be stable +- Schema v9 must be current (it is as of v2.51.0) + +**Risks:** +- Self-evaluation bias on Eval 2 (mitigated by structural reformulation) +- Auto-trigger adds latency to every test session (~10-30s) +- Mutation proposals may be low quality initially (mitigated by human review gate) + +## Sources & References + +- **Origin brainstorm:** [docs/brainstorms/2026-03-17-user-test-self-eval-loop-brainstorm.md](../brainstorms/2026-03-17-user-test-self-eval-loop-brainstorm.md) — Key decisions: separate eval command (not Phase 5), artifact-based grading, `skill-mutations.md` for proposals, 3 binary evals targeting protocol/reasoning/presentation layers +- **Existing skill:** `plugins/compound-engineering/skills/user-test/SKILL.md` — Phase 4 report format, commit mode steps +- **Last-run schema:** `plugins/compound-engineering/skills/user-test/references/last-run-schema.md` +- **Learnings:** Agent-guided state transitions (docs/solutions/2026-02-26-agent-guided-state-and-mcp-resilience-patterns.md) — don't hardcode state transitions, use scoring rubrics +- **Learnings:** Monolith-to-skill split anti-patterns (docs/solutions/2026-02-26-monolith-to-skill-split-anti-patterns.md) — enforce size budgets deterministically, don't duplicate validation +- **Probe lifecycle plan:** docs/plans/2026-02-28-feat-user-test-compounding-probe-system-plan.md — binary verification separate from numeric scoring +- **Report dispatch format:** docs/plans/2026-03-01-refactor-user-test-report-dispatch-format-plan.md — NEEDS ACTION section rules diff --git a/docs/plans/2026-03-18-feat-tiered-proven-budget-probe-confirmation-plan.md b/docs/plans/2026-03-18-feat-tiered-proven-budget-probe-confirmation-plan.md new file mode 100644 index 000000000..537847d7b --- /dev/null +++ b/docs/plans/2026-03-18-feat-tiered-proven-budget-probe-confirmation-plan.md @@ -0,0 +1,205 @@ +--- +title: "feat: Tiered Proven Budget + Probe Confirmation Note" +type: feat +status: completed +date: 2026-03-18 +amends: docs/plans/2026-03-03-feat-audit-response-skill-level-amendments-plan.md +--- + +# feat: Tiered Proven Budget + Probe Confirmation Note + +## Overview + +Implement two audit findings from run 12 as lightweight skill amendments: + +1. **A1: Tiered Proven Budget** -- Scale browser MCP budget by consecutive pass count (3/2/1 calls) instead of flat 3 for all Proven areas. +2. **A2: Probe Confirmation Note** -- Require 2 consecutive passes for non-deterministic probes before treating them as genuinely passing. + +Both are behavioral guidance changes (+11 lines across 3 reference files), not schema or machinery changes. + +**Relationship to existing plan:** The full audit plan (`2026-03-03-feat-audit-response-skill-level-amendments-plan.md`) covers A1-A5 targeting schema v10 / v2.52.0. This plan implements a **lightweight subset** of A1 and A2 only, deferring the schema-level changes (determinism field, register variation, scroll verification) to the full plan. + +## Problem Statement / Motivation + +**A1:** All Proven areas get 3 browser MCP calls regardless of stability. An area at 15 consecutive passes gets the same budget as one at 3. For mature test files, the majority of MCP calls confirm things that haven't changed in months. + +**A2:** When a probe testing LLM-dependent behavior flips from failing to passing, 1 pass is indistinguishable from model variance. The operator handles this by judgment, but the skill should say so explicitly. + +## Proposed Solution + +### A1: Tiered Budget Table + +Add to `run-targeting.md` after the existing Proven area budget rule: + +```markdown +### Proven Area Budget by Stability + +| Consecutive Passes | Browser MCP Budget | +|---|---| +| 2-5 | 3 calls | +| 6-9 | 2 calls | +| 10+ | 1 call | + +Failing/untested probes remain uncapped at all tiers. The tier only +constrains passing probe spot-checks and exploration calls. + +Tier follows the area's consecutive pass count in the Areas table. +The tier only resets when the consecutive pass count resets, which +occurs on demotion from Proven. If the area stays Proven despite a +soft score (agent judgment: cosmetic issue), the tier stays too. + +Stable queries (CLI-only) do not count against the browser budget. +Journey steps and cross-area probes are separate from per-area budgets. + +Freed calls redistribute to novelty budget and areas with active +variance. Report in SIGNALS: "+ N calls freed from ultra-stable +areas." +``` + +### A1: SKILL.md Reword + +Replace the Proven area budget line in Phase 3 Area Selection Priority: + +**Current (~line 104):** +``` +Proven areas at score 5 get max 3 MCP calls regardless of run focus. +``` + +**New:** +``` +Proven areas: spot-check scaled by stability (see run-targeting.md for tiered budget), plus any failing/untested probes. +``` + +### A1: Cross-File Reference Updates + +Update all 12 hardcoded "3 MCP" references across 4 files to point to the tiered system. See the full reference index in the parent plan (lines 80-96). + +Key files requiring updates beyond run-targeting.md and SKILL.md: + +| File | Lines | Change | +|------|-------|--------| +| `probes.md` | 23 | `3-call MCP budget` -> `tiered MCP budget` | +| `queries-and-multiturn.md` | 51, 55-59, 156, 166, 253, 299 | 6 references to `3-call cap` -> tiered references | +| `SKILL.md` | 104, 126, 145 | 3 references -> tiered pointers | + +### A1: Report Display + +Per-area assessment includes tier context: + +``` +browse/product-grid Proven (15 passes, 1-call budget) UX 5 2s +``` + +### A2: Probe Confirmation Note + +Add to `probes.md` after the Status Definitions section (~line 191): + +```markdown +### Non-Deterministic Probe Confirmation + +When a probe testing LLM-dependent behavior (agent reasoning, +scored_output quality, search ranking) flips from failing to passing, +treat the first pass as unconfirmed. Note "passing (unconfirmed)" in +the report. Require a 2nd consecutive pass before updating probe +status to passing in the test file during commit. If the next run +fails, revert to failing -- the first pass was variance. +``` + +### A2: Report Display + +Unconfirmed probes display with an asterisk: + +``` +Probe Results: +| Area | Query | Status | Detail | +|------|-------|--------|--------| +| agent/search-query | "boots under $50" | passing* | First pass after 8 fails -- needs confirmation | +``` + +## Technical Considerations + +### Gap 1: Cross-File Consistency (Critical) + +The original feature spec proposed updating only run-targeting.md and SKILL.md. However, `queries-and-multiturn.md` contains 6 references to "3-call cap" including a **worked example** (lines 55-59) that the agent treats as canonical. If these aren't updated, the agent will follow the concrete example over the abstract tiered rule. + +**Resolution:** Update all 12 references. The worked example at `queries-and-multiturn.md:55-59` must be updated to show tier-aware budgeting. + +### Gap 2: Novelty Budget at Reduced Tiers + +The novelty budget is currently defined as "exactly 1 MCP call (30% of 3 calls)" for Proven areas. At the 2-call tier, 30% = 0.6. At the 1-call tier, 30% = 0.3. + +**Resolution:** Add a note to run-targeting.md: "Novelty allocation within the tiered budget is at agent discretion. At the 1-call tier, the single call may be used for probe spot-check OR novelty -- the mandatory novelty probe rule is waived when the budget is 1 call." + +### Gap 3: A2 Determinism Identification + +The +3 lines of guidance rely on agent judgment to identify which probes are non-deterministic. The full audit plan proposes a `deterministic`/`non-deterministic` field per probe with defaults by trigger type. + +**Resolution for this plan:** Keep it lightweight. The agent already knows which probes target LLM-dependent behavior from the area's `scored_output` flag and probe generation context. Explicit classification deferred to the full plan's schema v10. + +### Gap 4: Failure Reset Semantics + +"Failure resets to 3-call tier" means area-level demotion from Proven, NOT individual probe failure. Probe failures are independent signals and do not affect the tier. The tier follows the consecutive pass count in the Areas table -- if the area stays Proven despite a soft score (agent judgment: cosmetic issue, not functional regression), the tier stays too. The tier only resets when consecutive passes resets to 0, which occurs on demotion. + +### Gap 5: Progressive Narrowing Interaction + +- **SKIP areas**: No browser calls -- tier is irrelevant (CLI queries still run) +- **PROBES-ONLY areas**: 1 exploration call + all probes -- tier budget does not apply (probes are uncapped) +- **FULL areas**: Tiered budget applies normally + +The tier only governs the budget for Proven areas in FULL classification. + +### Gap 6: Probe Graduation Interaction + +For non-deterministic probes: the `passing*` (unconfirmed) pass does NOT count toward the 2-consecutive-pass graduation requirement. Graduation requires 2 confirmed passes (minimum 3 total passes for non-deterministic probes: 1 unconfirmed + 2 confirmed). + +The unconfirmed pass rule applies only to probes transitioning from `failing` or `flaky` to `passing`. Probes transitioning from `untested` to `passing` follow the standard 1-pass threshold -- they have no failure history to create variance concern. + +## Acceptance Criteria + +### A1: Tiered Proven Budget + +- [x] Tiered budget table added to `run-targeting.md` (+8 lines) +- [x] 2-5 passes: 3 calls; 6-9 passes: 2 calls; 10+ passes: 1 call +- [x] Failure resets consecutive passes to 0 (returns to 3-call tier) +- [x] Failing/untested probes uncapped at all tiers +- [x] Freed calls redistribute to novelty and active areas +- [x] Tier shown in per-area report line: `Proven (N passes, M-call budget)` +- [x] All 12 cross-file "3 MCP" references updated to tiered system +- [x] Worked example in `queries-and-multiturn.md:55-59` updated +- [x] SKILL.md reworded (net 0 lines) +- [x] Novelty budget note for 1-call tier added + +### A2: Probe Confirmation Note + +- [x] 3-line confirmation note added to `probes.md` after Status Definitions +- [x] Unconfirmed probes display as `passing*` in report +- [x] Commit mode holds unconfirmed probes -- doesn't write `passing` to test file until 2nd consecutive pass +- [x] Fail after unconfirmed pass reverts to `failing` +- [x] Graduation clock starts at confirmed `passing`, not `passing*` + +## Line Budget + +| File | Change | Delta | +|------|--------|-------| +| `run-targeting.md` | Tiered budget table + rules + novelty note | +10 | +| `probes.md` | Non-deterministic confirmation note | +3 | +| `queries-and-multiturn.md` | Update 6 references + worked example | ~0 (rewording) | +| `SKILL.md` | Reword 3 Proven budget references | 0 | +| **Total new lines** | | **+13** | + +## Dependencies & Risks + +**Dependencies:** +- Consecutive pass count already tracked in test file Areas table -- no new tracking needed +- `scored_output` flag already exists per area -- used to identify LLM-dependent probes + +**Risks:** +- **Low:** Existing high-pass-count areas immediately get reduced budget. Mitigated by: the areas are genuinely stable (that's what 10+ passes means). +- **Low:** Agent judgment for non-deterministic probe identification may be inconsistent. Mitigated by: deferred to full plan's explicit classification field. + +## Sources & References + +- **Parent plan:** [docs/plans/2026-03-03-feat-audit-response-skill-level-amendments-plan.md](docs/plans/2026-03-03-feat-audit-response-skill-level-amendments-plan.md) -- A1-A5 full audit response +- **Iterate efficiency (completed):** [docs/plans/2026-03-01-perf-iterate-efficiency-progressive-narrowing-plan.md](docs/plans/2026-03-01-perf-iterate-efficiency-progressive-narrowing-plan.md) +- **Probe lifecycle (completed):** [docs/plans/2026-03-01-feat-probe-lifecycle-research-quality-plan.md](docs/plans/2026-03-01-feat-probe-lifecycle-research-quality-plan.md) +- **Cross-file reference index:** Parent plan lines 80-96 diff --git a/docs/solutions/2026-02-26-agent-guided-state-and-mcp-resilience-patterns.md b/docs/solutions/2026-02-26-agent-guided-state-and-mcp-resilience-patterns.md new file mode 100644 index 000000000..daee34a60 --- /dev/null +++ b/docs/solutions/2026-02-26-agent-guided-state-and-mcp-resilience-patterns.md @@ -0,0 +1,63 @@ +--- +title: 'Agent-Guided State Transitions and MCP Resilience Patterns for Browser Skills' +date: 2026-02-26 +tags: [claude-code, claude-in-chrome, mcp, skill-architecture, browser-testing] +category: architecture +module: plugins/compound-engineering/skills/user-test/SKILL.md +source: deepen-plan +convergence_count: 5 +plan: .deepen-2026-02-26-feat-user-test-browser-testing-skill-plan/original_plan.md +--- + +# Agent-Guided State Transitions and MCP Resilience Patterns for Browser Skills + +## Problem + +When designing skills that track state across runs (maturity models, progression systems) and depend on external MCP tools (browser automation, API connectors), two failure modes recur: hardcoded state transition rules that override agent judgment, and generic retry logic that gives users no actionable recovery path when MCP connections fail. + +## Key Findings + +### Hardcoded state rules violate agent-native principles (5 agents converged) + +Encoding rigid rules like "3 consecutive passes = Promoted" and "any failure = reset to Uncharted" puts business logic in the skill that should be prompt-driven. A minor cosmetic issue in a well-tested area does not warrant full demotion. The fix: provide a scoring rubric with calibration anchors (concrete examples for each score level) and maturity guidance (not rigid counters), then let the agent exercise judgment on promotions and demotions based on severity and context. + +### Extract content to references/ from day one (4 agents converged) + +Skills approaching the 500-line recommended limit should proactively extract templates, framework-specific patterns, and mode-specific documentation into references/ subdirectories before the first version ships. Retrofitting extraction after the skill is in use creates migration risk. Plan the directory structure at design time: SKILL.md holds execution logic (~300 lines), references/ holds reusable content (templates, patterns, mode details). + +### MCP disconnect recovery needs specific, not generic, guidance (4 agents converged) + +Chrome extension service workers go idle during extended sessions, breaking MCP connections. A generic "retry once" pattern gives users no path forward when the retry also fails. The fix: provide the specific recovery command ("/chrome Reconnect"), add backoff delay (2-3 seconds) before retry, and track cumulative disconnects to fail fast (abort after 3) rather than burning tokens on repeated failures. + +## Reusable Pattern + +For skills with state tracking: define scoring calibration anchors (what each numeric score means concretely), provide maturity guidance as a rubric, and let agents exercise judgment -- never hardcode state transition counters. For MCP-dependent skills: implement three-tier resilience (preflight availability check, mid-run retry with specific recovery instructions, graceful degradation for non-critical tool failures). + +## Code Example + +```markdown +## Maturity Guidance (agent-guided, not hardcoded) +| Score | Meaning | Example | +|-------|-----------------------|--------------------------------------| +| 1 | Broken | Button unresponsive, page crashes | +| 2 | Major friction | 3+ confusing steps, error messages | +| 3 | Minor friction | Small UX issues, unclear labels | +| 4 | Smooth | Clear flow, no confusion | +| 5 | Delightful | Exceeds expectations | + +Promote to Proven: 2+ consecutive runs with no significant issues (use judgment) +Demote: Functional regression, not cosmetic issues +``` + +```markdown +## MCP Disconnect Recovery (three-tier) +1. Preflight: verify tool availability, instruct `/chrome` if missing +2. Mid-run: wait 3s, retry once, then: "Run /chrome > Reconnect extension" +3. Cumulative: abort after 3 disconnects with clear extension stability message +``` + +## References + +- plugins/compound-engineering/skills/agent-native-architecture/SKILL.md (Granularity principle: agent judgment over hardcoded logic) +- docs/solutions/2026-02-26-monolith-to-skill-split-anti-patterns.md (size budget enforcement pattern) +- https://code.claude.com/docs/en/chrome (extension disconnect behavior and recovery) diff --git a/docs/solutions/2026-02-26-monolith-to-skill-split-anti-patterns.md b/docs/solutions/2026-02-26-monolith-to-skill-split-anti-patterns.md new file mode 100644 index 000000000..ed718b87e --- /dev/null +++ b/docs/solutions/2026-02-26-monolith-to-skill-split-anti-patterns.md @@ -0,0 +1,61 @@ +--- +title: 'Monolith-to-Skill Split: Enforcement, Drift, and Shadowing Anti-Patterns' +date: 2026-02-26 +tags: [claude-code, markdown-commands, skill-md-framework, bash, node-js] +category: architecture +module: commands/deepen-plan.md +source: deepen-plan +convergence_count: 4 +plan: .deepen-sorted-wandering-parnas/original_plan.md +--- + +# Monolith-to-Skill Split: Enforcement, Drift, and Shadowing Anti-Patterns + +## Problem + +When splitting a large command file into a thin wrapper + SKILL.md + reference doc, three failure modes recur: size budgets creep back without enforcement, validation logic duplicated across files drifts out of sync, and stale copies of the original monolith silently shadow the new skill. + +## Key Findings + +### Size budgets require deterministic enforcement, not prose (3 agents converged) + +Stating "max 1,200 lines" in a plan is a policy wish. Without a gate that fails the pipeline, the file will grow past the budget through iterative additions -- exactly how the original monolith grew from 400 to 1,452 lines. Embed a line-count check as a validation step that runs every time the pipeline executes. + +### Legacy monolith shadowing during migration (4 agents converged) + +Claude Code resolves skills by precedence: enterprise > personal > project, with plugins namespaced. A stale 1,452-line file at `~/.claude/commands/` or `~/.claude/skills/` silently shadows the new plugin skill. Detection must be automated, check all three resolution paths, and use line count (>100) as the heuristic -- not file existence alone. + +### Dual validation paths will drift (3 agents converged) + +When validation logic appears both inline in SKILL.md and in a reference doc, the two copies inevitably diverge. The fix: pick one canonical location per validation type. Parent-critical checks (judge output schema) stay inline. Pipeline-internal checks (preservation, artifact structure) live in the reference doc only. + +## Reusable Pattern + +For any command split: (1) add a deterministic size gate that fails loudly, (2) automate legacy detection across all skill resolution paths before first run, (3) assign each validation check exactly one canonical home -- never duplicate across files. + +## Code Example + +```bash +# Size budget enforcement (add to pipeline validation step) +ARCH_LINES=$(wc -l < "$DEEPEN_DIR/ARCHITECTURE.md") +if [ "$ARCH_LINES" -gt 1200 ]; then + echo "FAIL: ARCHITECTURE.md is $ARCH_LINES lines (max 1200)" + exit 1 +fi + +# Legacy shadowing detection (cross-platform) +for dir in "$HOME/.claude/commands" "$HOME/.claude/skills/deepen-plan"; do + TARGET="$dir/deepen-plan.md" + [ -d "$dir/deepen-plan" ] && TARGET="$dir/deepen-plan/SKILL.md" + if [ -f "$TARGET" ]; then + LINES=$(grep -c '' "$TARGET" 2>/dev/null || echo 0) + [ "$LINES" -gt 100 ] && echo "WARN: Legacy at $TARGET ($LINES lines)" + fi +done +``` + +## References + +- agent-native-architecture/references/agent-execution-patterns.md (deterministic checks over heuristic detection) +- agent-native-architecture/SKILL.md (anti-pattern: two ways to accomplish same outcome) +- https://code.claude.com/docs/en/skills (skill resolution precedence) diff --git a/plugins/compound-engineering/README.md b/plugins/compound-engineering/README.md index 8bab08f5c..22ee2db48 100644 --- a/plugins/compound-engineering/README.md +++ b/plugins/compound-engineering/README.md @@ -156,6 +156,8 @@ Core workflow commands use `ce:` prefix to unambiguously identify them as compou | Skill | Description | |-------|-------------| | `agent-browser` | CLI-based browser automation using Vercel's agent-browser | +| `user-test` | Exploratory browser testing via claude-in-chrome with quality scoring and compounding test files | +| `user-test-eval` | Grade user-test skill output against binary evals and propose targeted mutations | ### Beta Skills diff --git a/plugins/compound-engineering/skills/user-test-commit/SKILL.md b/plugins/compound-engineering/skills/user-test-commit/SKILL.md new file mode 100644 index 000000000..08daee22a --- /dev/null +++ b/plugins/compound-engineering/skills/user-test-commit/SKILL.md @@ -0,0 +1,8 @@ +--- +name: user-test-commit +description: Commit user-test results — update test file maturity map, file issues, append history. Use after a --no-commit run or to retry a failed commit. +disable-model-invocation: true +allowed-tools: Skill(user-test) +--- + +Invoke the user-test skill in commit mode for the last completed run. diff --git a/plugins/compound-engineering/skills/user-test-eval/SKILL.md b/plugins/compound-engineering/skills/user-test-eval/SKILL.md new file mode 100644 index 000000000..8552ba859 --- /dev/null +++ b/plugins/compound-engineering/skills/user-test-eval/SKILL.md @@ -0,0 +1,187 @@ +--- +name: user-test-eval +description: Grade user-test skill output against binary evals and propose mutations. Use after a user-test run completes to check probe ordering, regression surfacing, and P1 presentation. +disable-model-invocation: true +--- + +# User Test Eval + +Grade the user-test skill's output against 3 binary evals. Read from file +artifacts only. Propose targeted mutations when evals fail. + +**Artifact-only grading rule:** Grade from file artifacts only. Do not reference +test execution context, Phase 3 observations, or any other conversation content. +The eval's integrity depends on grading what the user sees (the report file), +not what the agent knows. + +## Phase 1: Load Artifacts + +1. **Locate test directory:** Find `tests/user-flows/` in the project. +2. **Read `.user-test-last-run.json`:** + - Missing: abort with "No run results found. Run `/user-test` first." + - `completed: false`: abort with "Last run was incomplete. Run `/user-test` again." +3. **Read `.user-test-last-report.md`:** + - Missing: abort with "No report artifact found. The skill version may predate report persistence — run `/user-test` again with the latest skill." +4. **Staleness check:** If `run_timestamp` > 24 hours old, use `AskUserQuestion` (if available, otherwise present as numbered options): "Run results are from . Evaluate anyway?" with options Yes / No. Abort on No. +5. **Already-evaluated check:** Read `skill-evals.json` if it exists. If the last entry's `run_timestamp` matches the artifact's `run_timestamp`, use `AskUserQuestion`: "This run was already evaluated. Run again?" with options Yes / No. Abort on No. +6. **Read the test file** (`tests/user-flows/.md`) to get area maturity statuses and `pass_threshold` values. Default `pass_threshold` is 4 if not specified. + +## Phase 2: Run Evals + +Run all 3 evals in order. Record pass/fail + detail for each. + +### Eval 1: Probe Execution Order (protocol layer) + +**Question:** Did all failing/untested probes execute before broad exploration in every area? + +**Method:** +1. For each area in `areas` array, read `broad_exploration_start_index` +2. Collect all `probes_run` entries for that area, read their `execution_index` +3. Check: every probe's `execution_index` < area's `broad_exploration_start_index` +4. **PASS** if all areas satisfy the constraint. **FAIL** if any area violates — list violated areas. + +**Edge cases:** +- Area has no probes: PASS (vacuously true) +- Missing `execution_index` or `broad_exploration_start_index` (v9 data): SKIP with detail "execution order data not available (pre-v10 run)" +- Skipped areas (`skip_reason` present): exclude from check + +### Eval 2: Proven Regression Distinction (presentation layer) + +**Question:** When a Proven area's score dropped below pass_threshold, does the report's NEEDS ACTION section contain a properly formatted entry? + +**Method:** +1. From the test file, identify areas with `Status: Proven` +2. From `.user-test-last-run.json`, check each Proven area's `ux_score` against its `pass_threshold` +3. For each regressed area (score < pass_threshold), search `.user-test-last-report.md` for the NEEDS ACTION section +4. Check for a line matching the pattern: `⚠.*.*→ Proven regression` +5. **PASS** if every regressed Proven area has a matching line item. **FAIL** if any is missing or appears without the `→ Proven regression` marker. + +**Edge cases:** +- No Proven areas exist: PASS with detail "no Proven areas in test file" +- No Proven areas regressed: PASS with detail "no Proven regressions this run" +- Cannot parse NEEDS ACTION section: FAIL with detail "NEEDS ACTION section not found in report" + +### Eval 3: P1 Surfacing (presentation layer) + +**Question:** Did every P1 item from `explore_next_run` appear in the NEEDS ACTION section? + +**Method:** +1. From `.user-test-last-run.json`, collect all `explore_next_run` items with `priority: "P1"` +2. For each P1 item, search `.user-test-last-report.md` NEEDS ACTION section for the area slug with `P1` marker +3. **PASS** if all P1 items are in NEEDS ACTION. **FAIL** with count of missing items and their area slugs. + +**Scope note:** Verification mismatches on Proven areas also belong in NEEDS ACTION per +dispatch format rules, but they flow through `verification_results`, not `explore_next_run`. +Not checked here — candidate for a future Eval 4. + +**Edge cases:** +- No P1 items: PASS with detail "no P1 items this run" +- Cross-area P1 items (area = `[cross-area]`): match against the `why` text or `affected_areas` slugs in NEEDS ACTION + +## Phase 3: Propose Mutations + +If any eval failed, propose one mutation per failing eval. + +**Mutation generation rules:** +- Identify the skill file and section most likely responsible for the failure +- Describe the current behavior and the proposed change +- Frame as a specific, targeted instruction change — not a rewrite +- Number mutations sequentially across all eval runs (read last mutation number from `skill-mutations.md`) + +**Mutation format:** + +```markdown +## Mutation N -- + +**Status:** PROPOSED +**Triggered by:** Eval failure () +**Eval scores:** probe_order: | regression_distinction: | p1_surfacing: +**Skill version:** +**Scenario:** + +### Problem observed + +<1-2 sentences describing the specific failure> + +### Proposed change + +**File:** +**Section:** + +**Current:** +**Proposed:** + +### Outcome + +_Fill after next run:_ Did the change fix the eval failure? Score comparison. +``` + +If all evals passed, do not propose a mutation. + +## Phase 4: Write Artifacts + +### `skill-evals.json` + +Location: `tests/user-flows/skill-evals.json` + +If file doesn't exist, create with `{ "eval_version": 1, "entries": [] }`. + +**Skill version:** Read the current `version` from the plugin's `.claude-plugin/plugin.json` at eval time. Do not hardcode. + +Append entry: + +```json +{ + "run_timestamp": "", + "scenario_slug": "", + "git_sha": "", + "skill_version": "", + "evals": { + "probe_execution_order": { "pass": , "areas_violated": [...], "detail": "..." }, + "proven_regression_distinction": { "pass": , "regressed_areas": [...], "missing_from_needs_action": [...], "detail": "..." }, + "p1_surfacing": { "pass": , "p1_count": , "surfaced_count": , "detail": "..." } + }, + "overall_pass": , + "mutation_proposed": +} +``` + +Cap at 50 entries — drop oldest if exceeded. + +### `skill-mutations.md` + +Location: `tests/user-flows/skill-mutations.md` + +If file doesn't exist, create with header: + +```markdown +# Skill Mutations Log + +Proposed changes to the user-test skill based on eval failures. +Mark status as ACCEPTED or REJECTED after review. +``` + +Append mutation sections for each failing eval. Separate with `---`. + +### Graduation Check + +After writing artifacts, check for consecutive passing runs: + +1. Read the last N entries from `skill-evals.json` where `overall_pass: true` +2. Count consecutive passes from most recent backwards +3. Check for gap reset: if any two consecutive entries have `run_timestamp` more than 14 days apart, reset count to entries after the gap +4. If 5+ consecutive passes within the gap window: display "All evals passing consistently (runs from to ). Consider adding a 4th eval or shifting to query-level optimization." + +## Phase 5: Display Summary + +Display a one-line summary: + +``` +EVAL: /3 pass | probe_order: | regression: | p1_surfacing: +``` + +If mutations were proposed, display each mutation's Problem Observed and Proposed Change inline. + +If all passed, display: "All evals passing. No mutations proposed." + +If graduation threshold met, display the graduation message. diff --git a/plugins/compound-engineering/skills/user-test-iterate/SKILL.md b/plugins/compound-engineering/skills/user-test-iterate/SKILL.md new file mode 100644 index 000000000..0b2b53bfa --- /dev/null +++ b/plugins/compound-engineering/skills/user-test-iterate/SKILL.md @@ -0,0 +1,9 @@ +--- +name: user-test-iterate +description: Run the same user test scenario N times to measure consistency. Use when validating score stability or detecting flaky areas. +disable-model-invocation: true +allowed-tools: Skill(user-test) +argument-hint: "[scenario-file] [n]" +--- + +Invoke the user-test skill in iterate mode for: $ARGUMENTS diff --git a/plugins/compound-engineering/skills/user-test/SKILL.md b/plugins/compound-engineering/skills/user-test/SKILL.md new file mode 100644 index 000000000..e7683d9e7 --- /dev/null +++ b/plugins/compound-engineering/skills/user-test/SKILL.md @@ -0,0 +1,399 @@ +--- +name: user-test +description: Run browser-based user testing via claude-in-chrome MCP with quality scoring and compounding test files. Use when testing app quality, scoring interactions, tracking test maturity, or filing issues from test sessions. +argument-hint: "[scenario-file-or-description]" +disable-model-invocation: true +--- + +# User Test + +Exploratory testing in a visible Chrome window. The user watches the test +happening in real-time and can intervene if needed. Claude shares the browser's +login state — sign into the app in Chrome before running. + +For automated headless regression testing, use `/test-browser` instead. + +**v1 limitation:** This skill targets localhost / local dev server apps. External +or staging URLs are not validated for deployment status — verify remote apps are +live and accessible before testing. + +## Interaction Method + +If `AskUserQuestion` is available, use it for all prompts. Otherwise, present +each question as a numbered list and wait for a reply. For multiSelect, accept +comma-separated numbers (e.g. `1, 3`). Never skip or auto-configure. + +**Protected artifacts:** `tests/user-flows/` files are pipeline output. Review +agents and cleanup tools must never flag them for deletion or gitignore. + +## Phase 0: Preflight + +1. **Chrome MCP check — deferred to Phase 2.** Phase 1 CLI discovery may eliminate browser testing. +2. **Detect WSL:** Run `uname -r 2>/dev/null | grep -qi microsoft`. If WSL: abort with "Chrome integration not supported in WSL." +3. **Check gh CLI:** Run `gh auth status`. If not authenticated: note "gh not authenticated — issue creation skipped in commit mode." +4. **Validate app URL:** If test file contains `app_url`, verify reachable. Site permission errors handled reactively during execution. + +## Phase 1: Load Context + +**Input:** `$ARGUMENTS` — either a path to an existing test file or a description of what to test. A trailing integer N triggers multi-run mode (e.g., `/user-test resale-clothing 5`). See [probes.md](./references/probes.md) for multi-run orchestration: inter-run probe state, progressive Proven area reduction, interruption handling, and N-run summary format. + +1. **Resolve test file:** + - If argument is a file path (contains `/` or ends in `.md`): + - Validate path resolves within `tests/user-flows/` (prevent directory traversal) + - Read and parse the test file + - Validate `schema_version` is present (1–10 accepted) + - **v1/v2 migration:** If `schema_version: 1`, fill missing columns (`Last Quality`, `Last Time`, `Delta`, `Context`) with `—`. If `schema_version: 2`, also fill missing sections (Area Trends, UX Opportunities, Good Patterns) and Run History columns (Best Area, Worst Area). Do NOT rewrite on read. + - **v3/v4 migration:** If `schema_version: 3`, treat missing `verify:` blocks and `Probes:` tables as absent. If `schema_version: 4`, also treat missing `**Queries:**` and `**Multi-turn:**` tables as absent. Do NOT rewrite on read. + - **v5 migration:** If `schema_version: 5`, treat Probes without `Confidence` column as `confidence: high` (existing probes were generated from observed failures). Treat Probes without `Priority` column as inferred from `Generated From` (verification failure → P1, score-based → P2). Treat Queries without `Status` column as active. Treat missing `seams_read` as `false`. Do NOT rewrite the file on read. + - **v6 migration:** If `schema_version: 6`, treat missing `## Cross-Area Probes` section as empty table. Treat missing `mcp_restart_threshold` as 15. Treat probes without `related_bug` as unlinked. Do NOT rewrite on read. + - **v7 migration:** If `schema_version: 7`, treat missing `weakness_class` as absent. Treat missing `novelty_fingerprints` as empty. Treat missing `adversarial_browser` as false. In JSON: treat missing `tactical_note` as null, `confirmed_selectors` as `{}`. Do NOT rewrite on read. + - **v8 migration:** If `schema_version: 8`, treat missing `## Journeys` section as empty (no journeys defined). Do NOT rewrite on read. + - **v9 migration:** If `schema_version: 9`, treat missing `execution_index` on `probes_run` entries as absent. Treat missing `broad_exploration_start_index` on areas as absent. Eval skips Eval 1 (probe execution order) for runs without ordering data. Do NOT rewrite on read. + - **Forward compatibility:** Ignore unknown frontmatter fields. Preserve unknown table columns on write. + - **Missing `cli_test_command` (any version):** Treat as `cli_test_command: ""`. CLI discovery (step 3) will populate it. Do NOT rewrite the file on read. + - Extract maturity map, run history, and explore-next-run items + - If argument is a description string: + - Generate a slug from the description + - Check if `tests/user-flows/.md` already exists + - If not, create from template — see [test-file-template.md](./references/test-file-template.md) + - Decompose the description into areas (1-3 interactions each). For new test files, write **rich** area definitions — see Area Depth in [test-file-template.md](./references/test-file-template.md). For `scored_output` areas, include Queries and Multi-turn sequences. + - If no argument: + - Scan `tests/user-flows/` for existing test files + - Present list and ask which to run, or prompt for a new description +2. **Orientation (first run only):** If `seams_read` is false or absent in frontmatter, run code reading to identify structural seams before any browser interaction. Output: 0-5 structural-hypothesis probes. + See [orientation.md](./references/orientation.md). Set `seams_read: true` on first commit after code reading, regardless of outcome. +3. **CLI discovery (MANDATORY when `cli_test_command` is empty):** Whether the test file is new or existing, if `cli_test_command` is empty, run CLI discovery NOW before any browser interaction — follow every step in CLI Discovery in [test-file-template.md](./references/test-file-template.md). Check for API endpoints, test scripts, curl-able routes. If a testable surface exists, populate `cli_test_command` and `cli_queries` in the test file immediately. Do NOT skip this step. Do NOT ask the user whether to do it — just do it. +4. **Ensure `.gitignore` coverage:** + - Check that `.user-test-last-run.json` and `.user-test-last-report.md` are in the project's `.gitignore` + - If missing, append them (these files are ephemeral run state, not source) + - Note: `score-history.json`, `bugs.md`, `skill-evals.json`, and `skill-mutations.md` are NOT gitignored — they are persistent project data +5. **Handle corruption:** + - If required sections are missing or `schema_version` is absent, offer to regenerate from template +6. **Capture git state:** Run `git rev-parse HEAD` and `git rev-parse origin/main 2>/dev/null`. Run `git diff --name-only origin/main..HEAD` — if this returns ANY files, those are code-affected areas requiring full exploration (even on a feature branch where main is "behind" HEAD). See [run-targeting.md](./references/run-targeting.md) for full rules. + +## Phase 2: Setup + +0. **Check claude-in-chrome MCP:** Call any `mcp__claude-in-chrome__*` tool. If NOT available: check if `cli_test_command` covers all `scored_output` areas. If yes, offer "All areas have CLI coverage — run CLI-only? (y/n)" and proceed without browser. If CLI doesn't cover all areas: display "claude-in-chrome not connected. Run `/chrome` or restart with `claude --chrome`" and abort. +1. **Environment sanity check:** + - Navigate to the app URL using `mcp__claude-in-chrome__navigate` + - Verify the page loaded with expected content (not an error page, stale auth redirect, or empty state) + - If error banners, API failures, or empty data detected: abort with "App environment issue detected — fix the app state before testing" +2. **Authentication check:** + - Claude shares the browser's login state — no credential handling needed + - If a login page or CAPTCHA is encountered: pause and instruct "Sign in to your app in Chrome, then press Enter to continue" +3. **Baseline screenshot:** + - Take a screenshot of the app's initial state for reference + +## Phase 2.5: CLI Testing (Optional) + +If the test file defines `cli_test_command` in frontmatter, run CLI queries before browser testing. CLI mode catches agent reasoning errors without browser overhead. + +**When `cli_test_command` is present:** +1. Phase 0 runs `gh auth status` only (Chrome MCP deferred). Skip Phase 2 browser setup unless browser areas exist. +2. Run each `scored_output` area's Queries through `cli_test_command`. Run `cli_queries` via Bash. Score 1-5 using output quality rubric (semantic evaluation). See CLI Area Queries in [queries-and-multiturn.md](./references/queries-and-multiturn.md). +3. **Browser area overlap:** If a `prechecks`-tagged CLI query scores ≤ 2, skip the tagged browser area. No `prechecks` tag = standalone. +4. Credentials: shell environment only. No credentials in the test file. +5. **Adversarial flag check:** If any CLI query for an area scores exactly 3, set `adversarial_browser: true`. See [queries-and-multiturn.md](./references/queries-and-multiturn.md) CLI Adversarial Mode for trigger conditions and secondary check. + +**CLI + browser coexistence:** When both exist, run CLI first. CLI failures only skip browser areas explicitly tagged via `prechecks`. + +## Phase 3: Execute + +Test areas based on maturity status. The agent exercises judgment on area selection — these are guidelines, not rigid rules. Record a `skip_reason` for each area not fully tested (see [test-file-template.md](./references/test-file-template.md) for enum values). + +**Run focus vs. area budget:** A run focus (e.g., "consumer stress test", "search bar exploration") controls WHAT you test within each area — which queries, which edge cases, which user personas. It does NOT override maturity-based time allocation (see override priority table in [run-targeting.md](./references/run-targeting.md)). Proven areas get a tiered MCP budget based on consecutive pass count (see [run-targeting.md](./references/run-targeting.md) for budget table). The run focus shapes WHAT those calls test (search bar instead of basic navigation), not the count. + +### Per-Area Checklist (run in order for every area) + +0. **CLI precheck gate** — if `prechecks` CLI query scored ≤ 2, skip. No prechecks tag = proceed. No CLI = proceed. +0b. **Adversarial mode** — if `adversarial_browser: true` (from Phase 2.5): skip happy path, front-load competing-constraint queries, generate pre-emptive P1 probe, increase novelty budget. SKIP areas promoted to PROBES-ONLY. See [queries-and-multiturn.md](./references/queries-and-multiturn.md) CLI Adversarial Mode. +1. **Run probes** — failing/untested first. See [probes.md](./references/probes.md). +2. **Execute Queries and Multi-turn** — if defined. See [queries-and-multiturn.md](./references/queries-and-multiturn.md). +3. **Novelty budget — MANDATORY.** Before generating novel interactions, check `novelty_fingerprints` from `.user-test-last-run.json` — skip interactions matching existing fingerprints. At least 1 novel interaction per `scored_output` area must generate a probe. Iterate mode ignores fingerprints. See [queries-and-multiturn.md](./references/queries-and-multiturn.md) for fingerprint matching, MCP budget, and mandatory probe rule. +4. **Verification pass** — per area type. See [verification-patterns.md](./references/verification-patterns.md). +5. **Score** — UX (1-5) + Quality if `scored_output: true`. +6. **Time** — wall-clock seconds, first to last MCP call. Async waits count. Disconnect = `—`. +7. **Notes** — what surprised you? Feeds Explore Next Run + new Queries in commit. + +Probes, verification, and UX scores are three separate signals — none subsumes the others. + +### Execution Index Tracking + +Maintain a monotonically increasing `execution_index` counter (starting at 0) across the entire run. Increment for each probe execution and each broad exploration action. Record `execution_index` on every `probes_run` entry. When transitioning from probe execution to broad exploration for an area, record `broad_exploration_start_index` on that area. This enables `/user-test-eval` to verify probe-before-exploration ordering from artifacts alone. See [last-run-schema.md](./references/last-run-schema.md) for field definitions. + +### Probe Execution (Before Broad Exploration) + +Read probes from area `**Probes:**` tables. Execute `untested` and `failing` probes before broad exploration — these are the highest-signal checks. For Proven areas, failing/untested probes always run regardless of MCP budget; the tiered budget cap only constrains passing-probe spot-checks. Record each probe result with its `execution_index`. See [probes.md](./references/probes.md) for execution flow, lifecycle, and dedup rules. + +### Cross-Area Probes (Before Per-Area Testing) + +Execute cross-area probes before per-area testing — they test state carry-over between areas and inform per-area score interpretation. Results do NOT affect per-area scores. See [probes.md](./references/probes.md). + +### Journey Execution (After Cross-Area Probes) + +Execute journeys after cross-area probes, before per-area testing. Journeys test accumulated state across 3+ areas without resets, with checkpoints at each step. Results do NOT affect per-area scores. See [journeys.md](./references/journeys.md). + +### Verification Pass (After Each Area) + +After exploring each area, run structural verification checks based on area type — independent of what the agent noticed. Read the area's `**verify:**` block for area-specific instructions. Record verification results separately from UX score. Verification failures block promotion to Proven but do not demote existing Proven areas. See [verification-patterns.md](./references/verification-patterns.md) for standard checks, tolerance rules, and maturity interaction. + +### Area Selection Priority + +See [run-targeting.md](./references/run-targeting.md) for full rules including +git-aware targeting, progressive narrowing, and override priority. + +Quick reference: (0) Code-affected → full. (1) P1 Explore Next Run → full. (2) Uncharted → full. (3) Proven → spot-check (tiered MCP + failing probes). (4) Known-bug → check issue state: + - `gh issue view` or check tracker — if closed/fixed, flip to Uncharted (verify the fix) + - if open, spot-check the bug area (confirm still broken, note any change) +(5) All Proven → spot-check all, suggest new areas. + +### Connection Resilience + +See [connection-resilience.md](./references/connection-resilience.md) for reactive recovery, proactive restart at configurable MCP call threshold, and disconnect tracking rules. + +### Modal Dialog Handling + +If MCP commands stop responding after triggering an action that may produce a dialog (`alert`, `confirm`, `prompt`): instruct the user to dismiss the dialog manually before continuing. + +### Graceful Degradation + +- Screenshot fails: continue, note "screenshots unavailable" in report +- `javascript_tool` fails: fall back to individual `find`/`click` calls +- All MCP tools fail: abort with recovery instructions + +## Phase 4: Score and Report + +### Scoring + +Score each area on a 1-5 scale per scored interaction unit. A scored interaction unit is one user-facing task completion (e.g., "add item to cart", "submit form"). Navigation, page loads, and setup steps are not scored individually. + +| Score | Meaning | Example | +|-------|---------|---------| +| 1 | Broken — cannot complete the task | Button unresponsive, page crashes | +| 2 | Completes with major friction | 3+ confusing steps, error messages | +| 3 | Completes with minor friction | Small UX issues, unclear labels | +| 4 | Smooth experience | Clear flow, no confusion | +| 5 | Delightful | Exceeds expectations, helpful feedback | + +Scores are **absolute** per this rubric. The same checkout flow should produce the same score regardless of which test scenario triggered it. + +### Output Quality Scoring (Optional) + +Areas with `scored_output: true` in their area details are scored on TWO dimensions: + +| Score | UX Meaning | Output Quality Meaning | +|-------|-----------|----------------------| +| 5 | Delightful | Exactly what an expert would produce | +| 4 | Smooth | Relevant, minor misses | +| 3 | Minor friction | Partially correct | +| 2 | Major friction | Mostly wrong | +| 1 | Broken | Completely wrong | + +Report shows both: `UX: 4/5, Quality: 3/5`. Areas without `scored_output` show UX only. + +**Aggregation:** `Quality Avg` in history = UX scores only (backward compatible). Output quality tracked separately as `Output Avg` in the report. + +**Promotion gate:** Each area's `pass_threshold` (default 4) and `quality_threshold` (default 3 for scored_output areas) define what counts as a pass. See [test-file-template.md](./references/test-file-template.md) for details. + +**Known-bug filing trigger:** UX <= 2 (functional failure) OR Quality <= 1 (completely wrong output). Files to bug registry — see [bugs-registry.md](./references/bugs-registry.md). + +### Performance Threshold Evaluation (Optional) + +If the test file defines `performance_thresholds` in frontmatter, append a timing grade to each area's assessment: `(fast)`, `(acceptable)`, `(slow)`, `(BROKEN)`. Compare each area's wall-clock time against the thresholds. A `broken` timing is a notable finding but does NOT affect the UX score — timing and quality are separate dimensions. + +### Collection Categories + +For each tested area, collect: +1. **UX score** (1-5 per interaction unit) +2. **Time** (wall-clock seconds from Phase 3 timing) +3. **Issues found** (bugs, UX problems, accessibility gaps) +4. **Maturity assessment** (promote, demote, or maintain current status) + +After all areas are scored, generate: +5. **Qualitative summary:** best moment (tagged with area slug), worst moment (tagged with area slug), demo readiness (yes/partial/no), one-line verdict +6. **Explore Next Run items** (2-3 items with priority P1/P2/P3): + - **P1** — Things that surprised you (positive or negative) + - **P2** — Edge cases adjacent to tested areas + - **P3** — Interactions started but not finished, or borderline scores (score of 3 warrants deeper investigation next run) + - **Cross-area weakness synthesis:** After per-area items, read `weakness_class` fields from the test file (as present at run start — ignore any written by this run's commit). If a class appears in 2+ areas, generate up to 2 `[cross-area]` P1 entries with adversarial instructions. See [probes.md](./references/probes.md) Cross-Area Weakness Synthesis. +7. **UX Opportunities** (P1/P2 action items for improvements observed at score 3-5) +8. **Good Patterns** (patterns worth preserving observed at score 4-5 — deliberate design choices, not trivial successes) +9. **Verification results** per area: claims checked, mismatches found (from Layer 2 pass) +10. **Probe results**: probes executed this run (pass/fail per probe), new probes generated from failures/low scores/worst_moment. See [probes.md](./references/probes.md) for generation triggers and lifecycle. + +### Report Output — Dispatch Format + +The report is a dispatch, not a broadcast. It tells you what to do next, in priority order. Sections with no items are omitted. + +``` +SESSION SUMMARY: [ · ] +UX 3.0 | Quality 4.5 (CLI) | 5 areas | 2 need action + +NEEDS ACTION (2) ← open items requiring follow-up + ⚠ P1 y2k accessories degrading Q3→Q2 → investigate CLI (Explore Next Run) + ⚠ P2 Proven area agent/filter-via-chat probe failing → regression + +FILED THIS SESSION (1) ← closed loop, confirmation only + ✓ Bug #21: shipping-form validation accepts invalid zip codes + +IMPROVED (1) + cart-validation 3→4 Cart updates instantly on quantity change + +STABLE (3) + browse/product-grid, browse/filters, compare/add-view + +EXPLORE NEXT RUN + P1 shipping-form Browser Validation broken — edge cases + P1 agent/search-query CLI y2k degrading — aesthetic+category + P2 checkout/promo Both Adjacent to cart, untested + +SIGNALS + + CLI speed 15.8s avg (was 20.4s, -23%) + ~ 10 disconnects (was 6) — Chrome extension fragile + ~ 2 UX opportunities logged (UX001–UX002) + +Demo: PARTIAL (P1 bug #21 open; promo-code untested) +``` + +**Section rules:** +- **Header:** `UX X.X | Quality X.X (CLI) | N areas | M need action` — 2-second scan +- **JOURNEYS:** After cross-area probes, before NEEDS ACTION. Failing/flaky journeys show checkpoint detail. Passing show summary. See [journeys.md](./references/journeys.md). +- **NEEDS ACTION:** `⚠` prefix. Only open items: degrading areas, failing probes on **Proven** areas (unexpected regression), verification mismatches on Proven. Probe failures on Uncharted/Known-bug stay in DETAILS (expected) +- **FILED THIS SESSION:** `✓` prefix. Bugs/issues filed. Omit if nothing filed +- **IMPROVED:** ` ` +- **STABLE:** Single comma-separated line +- **EXPLORE NEXT RUN:** ` ` — must appear in printed report +- **SIGNALS:** `+` positive, `-` negative, `~` neutral. Disconnects always here with delta. Omit if 0. Use `-` if increased 50%+ +- **Demo:** YES / PARTIAL (reason) / NO (reason). P1 NEEDS ACTION forces at most PARTIAL +- **DETAILS:** Prints only when actionable (new probes, verification failures, new UX opps). Omit if all empty. Contains: Probe Results, Verification Failures, UX Opportunities tables. Code Changes section when git targeting active + +### Share Report (Optional) + +After displaying the report, offer: "Share report to Proof for team review? (y/n)". +If yes, POST the SESSION SUMMARY markdown to `https://www.proofeditor.ai/share/markdown` +with `{"title": "", "markdown": ""}` and display the +returned URL. Skip silently on curl failure — Proof sharing is best-effort. + +### Persist Report + +After displaying the report (and optional Proof sharing), write the rendered report text to `tests/user-flows/.user-test-last-report.md`. This file is the eval artifact — `/user-test-eval` reads it to grade presentation-layer behavior. Overwritten each run, gitignored. + +### Auto-Commit + +After persisting the report, **automatically proceed to Commit Mode** (below) — update the test file, append to history, and file issues. The user reviews results inline as part of the same session. + +**Opt-out:** If invoked with `--no-commit` or if the run was partial (interrupted before all areas scored), skip commit and display the report only. The user can run `/user-test-commit` later to commit from `.user-test-last-run.json`. + +**Partial run safety:** If the run is interrupted before scoring completes, do NOT produce committable output. Partial runs must not corrupt maturity state. + +### Run Results Persistence + +After Phase 4 completes (all areas scored), write `tests/user-flows/.user-test-last-run.json`. See [last-run-schema.md](./references/last-run-schema.md) for full schema (v10), per-area fields, journey fields, execution index fields, and behavioral notes. File is overwritten each run except `novelty_fingerprints` which accumulates across runs (read-merge-write). + +## Commit Mode + +Runs automatically after Phase 4 completes a full run. Can also be invoked standalone via `/user-test-commit` (e.g., after a `--no-commit` run or to retry a failed commit). + +### Load Run Results + +**When invoked automatically:** Use the run results already in context from Phase 4. + +**When invoked standalone via `/user-test-commit`:** Read `tests/user-flows/.user-test-last-run.json`. This is the single source of truth — commit mode never falls back to context window. + +- **Missing file:** Abort with "No run results found. Run `/user-test` first." +- **Incomplete run:** If `completed: false`, abort with "Last run was incomplete. Run `/user-test` again for committable results." +- **Stale (>7 days):** Abort with "Run results too old — re-run `/user-test` first." +- **Stale (>24 hours):** Warn "Run results are from . Commit anyway? (y/n)." + +### Maturity Updates + +Apply maturity transitions using agent judgment and the scoring rubric: + +- **Promote to Proven:** After 2+ consecutive passes where UX >= area's `pass_threshold` (default 4) and Quality >= `quality_threshold` for scored_output areas (default 3), with no functional issues. A cosmetic issue in a Proven area does not warrant demotion. +- **Demote to Uncharted:** On functional regressions or new features that change behavior. Minor CSS issues do not trigger demotion. +- **Mark Known-bug:** When a functional issue is found and an issue is filed. Record in bug registry — see [bugs-registry.md](./references/bugs-registry.md). Skip this area in future runs until the fix is deployed. +- **Persistent ≤3 escalation:** If an area scores ≤ 3 for 3+ consecutive runs AND the same issue is noted each time, offer: " has scored ≤3 for N runs with the same issue — file as Known-bug?" This is a manual escalation, not automatic. + +**Partial run safety:** If a run is interrupted before scoring completes, no maturity updates are produced. + +### File Updates + +1. **Update test file maturity map and area details:** + - Write to `.tmp` file first, then rename (atomic write) + - Upgrade to v10: bump `schema_version: 10` on first commit regardless of query/probe usage. Add missing columns and sections per [test-file-template.md](./references/test-file-template.md) + - Update area statuses, scores, timing, quality scores, and consecutive pass counts + - Update `## Area Trends` section from `score-history.json` data + - Update `## UX Opportunities Log`: add new entries with sequential IDs (UX001...), update existing entries (mark `implemented` if improvement detected), age out entries per lifecycle rules + - Update `## Good Patterns`: confirm existing patterns (update `Last Confirmed`), add new patterns, remove patterns unconfirmed for 5+ runs + - **Tactical notes:** Append `[Run N] ` to area's Notes column when there's a genuine tactical insight (selector pattern, timing pattern, interaction sequence). Cap 3 entries per area; drop oldest. See [queries-and-multiturn.md](./references/queries-and-multiturn.md) Tactical Notes. + - **Verified selectors:** When Phase 3 confirmed DOM selectors via successful `javascript_tool` batch call, append them to the area's `**verify:**` block with `_Selectors confirmed run N._`. Append-only — never replace user-authored content. See [verification-patterns.md](./references/verification-patterns.md) Selector Discovery and Writeback. + - **Weakness class:** When 2+ probes in an area share a failure pattern, write `**weakness_class:** ` below `pass_threshold`. Remove after 3 consecutive pass runs. One class per area — dominant by probe count. See [probes.md](./references/probes.md) Weakness Classification. +2. **Update `tests/user-flows/score-history.json`:** + - Append current run's per-area scores (UX, quality, time) + - Compute trend per area from last 3 entries + - Cap at 10 entries per area (drop oldest) + - Create file if it doesn't exist +3. **Update `tests/user-flows/bugs.md`:** + - File new bugs with sequential IDs for areas with UX <= 2 or Quality <= 1 + - Mark bugs as `fixed` when Known-bug area passes fix_check (score >= `pass_threshold`) AND GitHub issue is closed + - Mark bugs as `regressed` when previously-fixed area fails again + - Create file if it doesn't exist — see [bugs-registry.md](./references/bugs-registry.md) +4. **Update probe statuses** in each area's `**Probes:**` table and the `## Cross-Area Probes` table: mark passing/failing/flaky based on this run's results. Rotate out passing probes older than 10 runs. If a probe has failed 3+ consecutive runs, auto-escalate to bugs.md (see [probes.md](./references/probes.md) Escalation). If a probe has passed 2+ consecutive runs, offer CLI graduation (same path as bug graduation — see [probes.md](./references/probes.md)). +5. **Offer graduation** for newly-fixed bugs — see [graduation.md](./references/graduation.md) +6. **Append to `tests/user-flows/test-history.md`:** + - Add row with: date, areas tested, quality avg, delta, pass rate, best area, worst area, demo ready, context, key finding + - **Delta computation:** Compare quality avg against the most recent *completed* previous run. First run: `—`. Previous run was partial: skip to last complete run. Different area sets: compute over overlapping areas only; no overlap → `—`. Always display how many areas overlap vs. excluded (e.g., "over 5 overlapping areas, 2 new excluded") so the denominator change is visible. + - **Delta warning:** Flag any delta worse than -0.5 in the commit output + - **Context field:** Brief phrase explaining *why* the verdict is what it is (e.g., "search results loading 28s"). Persists alongside verdict for future reference. + - **Pattern surfacing** (after 10+ runs): positive patterns need 7+ of last 10 runs as best area; negative patterns need 5+ of last 10 runs as worst area + - Rotation: keep last 50 entries, remove oldest when exceeding +7. **File GitHub issues:** + - Each issue gets a label `user-test:` (e.g., `user-test:checkout/cart-count`) + - **Duplicate detection:** `gh issue list --label "user-test:" --state open` + - If match found: skip filing, note "duplicate of #N" + - If no match: fall back to semantic title search as secondary check + - Sanitize issue body content before `gh issue create` + - Skip gracefully if `gh` is not authenticated + - Never persist credentials (passwords, tokens, session IDs) in issue bodies or test files +8. **Query compounding:** Sharpen failed queries into probes, expand from discoveries, mark stable queries. See [queries-and-multiturn.md](./references/queries-and-multiturn.md) for steps 8-12 details, query-to-probe conversion rules, and stable query regression tiers. +8b. **Novelty fingerprints:** Merge this run's new fingerprints with existing ones from `.user-test-last-run.json`. Apply 20-per-area cap (drop oldest). Write merged set. See [queries-and-multiturn.md](./references/queries-and-multiturn.md) Novelty Fingerprint Persistence. +8c. **Journey updates:** Update journey Status, Last Run, Run History. Auto-escalate, mark stable, detect definition changes. Journey results do NOT affect per-area maturity. See [journeys.md](./references/journeys.md) Commit Mode. + +### Eval Prompt + +After all commit steps complete, display: + +``` +Run `/user-test-eval` to grade this session's output against binary evals. +``` + +The eval runs as a separate invocation to preserve grading integrity — it reads from file artifacts, not conversation context. Do NOT attempt to invoke the eval skill inline; the separation is intentional. + +**Skip conditions:** `--no-eval` flag, or if commit was partial/aborted — omit the prompt. +**Iterate mode:** Display the prompt once after the final commit, not per-iteration. +**Standalone `/user-test-commit`:** Also displays the eval prompt after commit completes. + +## Iterate Mode + +See [iterate-mode.md](./references/iterate-mode.md) for full details. + +N capped at 10 (default), N=0 is error, N=1 is valid. +Reset between runs = full page reload to app entry URL. +Partial run handling: if disconnect mid-iterate, write results for completed +runs and report "Completed M of N runs." +Output: per-run scores table + aggregate consistency metrics + maturity transitions. +After final run, auto-commit (same as normal `/user-test`). Pass `--no-commit` to skip. + +## Reference Files + +- [test-file-template.md](./references/test-file-template.md) — template, schema migration, area granularity, worked examples +- [last-run-schema.md](./references/last-run-schema.md) — `.user-test-last-run.json` schema, per-area fields, behavioral notes +- [journeys.md](./references/journeys.md) — multi-area journey testing: lifecycle, budget, execution, checkpoint types, generation, feature interactions +- [probes.md](./references/probes.md) — probe execution, lifecycle, dedup, escalation, graduation, multi-run orchestration, weakness classification +- [queries-and-multiturn.md](./references/queries-and-multiturn.md) — execution checklist, scoring, query compounding, novelty budget, fingerprints, CLI adversarial mode +- [verification-patterns.md](./references/verification-patterns.md) — structural checks, tolerance rules, scoring impact +- [run-targeting.md](./references/run-targeting.md) — area selection, git-aware targeting, progressive narrowing +- [bugs-registry.md](./references/bugs-registry.md) — bug lifecycle, commit mode update rules +- [graduation.md](./references/graduation.md) — browser discoveries → CLI regression checks +- [browser-input-patterns.md](./references/browser-input-patterns.md) / [connection-resilience.md](./references/connection-resilience.md) — browser patterns, connection resilience +- [iterate-mode.md](./references/iterate-mode.md) / [orientation.md](./references/orientation.md) — multi-run orchestration, first-run code reading diff --git a/plugins/compound-engineering/skills/user-test/references/browser-input-patterns.md b/plugins/compound-engineering/skills/user-test/references/browser-input-patterns.md new file mode 100644 index 000000000..94f2a384f --- /dev/null +++ b/plugins/compound-engineering/skills/user-test/references/browser-input-patterns.md @@ -0,0 +1,145 @@ +# Browser Input Patterns + +Patterns for interacting with web apps via `claude-in-chrome` MCP tools. + +## React-Safe Input + +React uses synthetic events and controlled components. Setting `.value` directly +bypasses React's state management. Use the native setter pattern: + +```javascript +// React-safe input via javascript_tool +mcp__claude-in-chrome__javascript_tool({ + code: ` + const el = document.querySelector('input[name="email"]'); + const setter = Object.getOwnPropertyDescriptor( + window.HTMLInputElement.prototype, 'value' + ).set; + setter.call(el, 'test@example.com'); + el.dispatchEvent(new Event('input', { bubbles: true })); + el.dispatchEvent(new Event('change', { bubbles: true })); + ` +}) +``` + +This works for ``, `