From 7a67ed1d886cdf5acfe1977eba5bbfba51bbb866 Mon Sep 17 00:00:00 2001 From: Drew Miller Date: Sat, 28 Feb 2026 14:42:00 -0500 Subject: [PATCH 01/17] =?UTF-8?q?feat(user-test):=20add=20user-test=20skil?= =?UTF-8?q?l=20v3=20=E2=80=94=20compounding=20UX=20intelligence=20system?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Complete implementation of the user-test skill across three plan iterations: - v1: Core skill with 5-phase browser testing, maturity model, thin wrapper commands - v2: Schema migration, timing, CLI mode, auto-commit, quality scoring, performance thresholds - v3: Bug registry, per-area score history, structured skip reasons, pass thresholds, queryable qualitative data, discovery-to-regression graduation, UX opportunities + good patterns New files: SKILL.md (364 lines), 5 reference files, 3 thin wrapper commands, 3 plans, 2 learnings. Version: 2.37.0 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude --- .claude-plugin/marketplace.json | 4 +- ...at-user-test-browser-testing-skill-plan.md | 597 ++++++++++++++++++ ...2-28-feat-user-test-skill-revision-plan.md | 424 +++++++++++++ ...-feat-user-test-v3-ux-intelligence-plan.md | 453 +++++++++++++ ...uided-state-and-mcp-resilience-patterns.md | 63 ++ ...6-monolith-to-skill-split-anti-patterns.md | 61 ++ .../.claude-plugin/plugin.json | 4 +- plugins/compound-engineering/CHANGELOG.md | 48 ++ plugins/compound-engineering/README.md | 8 +- .../commands/user-test-commit.md | 8 + .../commands/user-test-iterate.md | 9 + .../commands/user-test.md | 9 + .../skills/user-test/SKILL.md | 364 +++++++++++ .../references/browser-input-patterns.md | 82 +++ .../user-test/references/bugs-registry.md | 56 ++ .../skills/user-test/references/graduation.md | 68 ++ .../user-test/references/iterate-mode.md | 80 +++ .../references/test-file-template.md | 236 +++++++ 18 files changed, 2568 insertions(+), 6 deletions(-) create mode 100644 docs/plans/2026-02-26-feat-user-test-browser-testing-skill-plan.md create mode 100644 docs/plans/2026-02-28-feat-user-test-skill-revision-plan.md create mode 100644 docs/plans/2026-02-28-feat-user-test-v3-ux-intelligence-plan.md create mode 100644 docs/solutions/2026-02-26-agent-guided-state-and-mcp-resilience-patterns.md create mode 100644 docs/solutions/2026-02-26-monolith-to-skill-split-anti-patterns.md create mode 100644 plugins/compound-engineering/commands/user-test-commit.md create mode 100644 plugins/compound-engineering/commands/user-test-iterate.md create mode 100644 plugins/compound-engineering/commands/user-test.md create mode 100644 plugins/compound-engineering/skills/user-test/SKILL.md create mode 100644 plugins/compound-engineering/skills/user-test/references/browser-input-patterns.md create mode 100644 plugins/compound-engineering/skills/user-test/references/bugs-registry.md create mode 100644 plugins/compound-engineering/skills/user-test/references/graduation.md create mode 100644 plugins/compound-engineering/skills/user-test/references/iterate-mode.md create mode 100644 plugins/compound-engineering/skills/user-test/references/test-file-template.md diff --git a/.claude-plugin/marketplace.json b/.claude-plugin/marketplace.json index a1b7be99e..c3b438532 100644 --- a/.claude-plugin/marketplace.json +++ b/.claude-plugin/marketplace.json @@ -11,8 +11,8 @@ "plugins": [ { "name": "compound-engineering", - "description": "AI-powered development tools that get smarter with every use. Make each unit of engineering work easier than the last. Includes 29 specialized agents, 22 commands, and 19 skills.", - "version": "2.34.0", + "description": "AI-powered development tools that get smarter with every use. Make each unit of engineering work easier than the last. Includes 29 specialized agents, 25 commands, and 21 skills.", + "version": "2.37.0", "author": { "name": "Kieran Klaassen", "url": "https://github.com/kieranklaassen", diff --git a/docs/plans/2026-02-26-feat-user-test-browser-testing-skill-plan.md b/docs/plans/2026-02-26-feat-user-test-browser-testing-skill-plan.md new file mode 100644 index 000000000..834a6ab7e --- /dev/null +++ b/docs/plans/2026-02-26-feat-user-test-browser-testing-skill-plan.md @@ -0,0 +1,597 @@ +# Decision Record + +**Deepened on:** 2026-02-26 +**Sections enhanced:** 11 of 13 +**Research agents used:** 14 +**Total recommendations applied:** 37 (22 implement, 9 fast_follow, 6 defer) + +## Pre-Implementation Verification + +1. [ ] Verify current component counts: `ls -d plugins/compound-engineering/skills/*/ | wc -l` and `ls plugins/compound-engineering/commands/*.md plugins/compound-engineering/commands/workflows/*.md | wc -l` +2. [ ] Verify current plugin version in `plugins/compound-engineering/.claude-plugin/plugin.json` +3. [ ] Confirm `claude-in-chrome` MCP tool names by running `/mcp` and selecting `claude-in-chrome` +4. [ ] Review `plugins/compound-engineering/skills/deepen-plan/SKILL.md` frontmatter format as canonical thin-wrapper reference +5. [ ] Verify `plugins/compound-engineering/commands/deepen-plan.md` as canonical thin-wrapper command template +6. [ ] Check that `tests/user-flows/` does not already exist in any target project (no namespace collision) + +## Implementation Sequence + +1. **Create `skills/user-test/references/` files first** — the SKILL.md references these, so they must exist before the skill is validated +2. **Create `skills/user-test/SKILL.md`** — the core skill with 5-phase execution logic + commit mode, under 500 lines +3. **Create thin wrapper commands** — `commands/user-test.md`, `commands/user-test-iterate.md`, and `commands/user-test-commit.md` +4. **Update metadata files** — plugin.json, marketplace.json, README.md, CHANGELOG.md (use dynamic counts, not hardcoded numbers) +5. **Run `/release-docs`** — regenerate documentation site +6. **Validate** — JSON validity, component count consistency, SKILL.md line count + +## Key Improvements + +1. **[Strong Signal -- 5 agents] Maturity model: guidance over rigid rules** — Replace hardcoded "3 consecutive passes = Proven" and "any failure = reset to Uncharted" with agent-guided judgment. Provide a rubric and guidelines, but let the agent decide based on context (e.g., a cosmetic issue in a Proven area should not trigger full demotion). Simplify initial threshold to 2 consecutive passes. + +2. **[Strong Signal -- 4 agents] Extract reference files from SKILL.md from day one** — Split the skill into SKILL.md (~300 lines of execution logic) plus `references/` directory containing test-file-template.md, browser-input-patterns.md, and iterate-mode.md. The monolith-to-skill-split learning explicitly warns that stated size budgets without enforcement are ignored. + +3. **[Strong Signal -- 4 agents] Extension disconnect handling with specific recovery instructions** — Replace generic "retry-once" with: wait 3 seconds, retry once, on second failure instruct user to run `/chrome` and select "Reconnect extension". Track cumulative disconnects and abort after 3 with a clear stability message. + +4. **[Strong Signal -- 3 agents] Add `disable-model-invocation: true` to both thin wrapper commands** — The commands have side effects (file creation, browser interaction, issue filing). Official docs require this flag for side-effect workflows. + +5. **[Strong Signal -- 3 agents] Explicit distinction from agent-browser/test-browser in SKILL.md intro** — Two browser tools creates confusion. The SKILL.md intro must state: "This skill is for exploratory testing in a visible Chrome window with shared login state. For automated headless regression testing, use /test-browser instead." + +6. **[Strong Signal -- 3 agents] Dynamic component counts in acceptance criteria** — Do not hardcode "Skills: 21, Commands: 24". Count actual files and verify description strings match. + +7. **[Strong Signal -- 3 agents] Enhanced preflight check with `/chrome` guidance, WSL detection, and site permissions** — Phase 0 must guide users to run `/chrome` if MCP tools are unavailable, detect WSL and abort with a clear message, and verify the target URL is within Chrome extension's allowed sites. + +8. **Quality scoring rubric with concrete calibration anchors** — Define what scores 1-5 mean with examples, making scoring reproducible across runs. + +9. **Test file schema version for forward compatibility** — Add `schema_version: 1` to test file template frontmatter. + +10. **SKILL.md description must be single-line string** — Multiline YAML indicators break the skill indexer. Use a single line with trigger keywords for auto-discovery. + +## Research Insights + +### Browser Automation (claude-in-chrome) +- Actions run in a **visible Chrome window** in real time — the user can watch and intervene. This is the core differentiator from headless agent-browser and should be prominently documented. +- Claude **shares browser login state** — eliminates most authentication concerns. Users sign in once in Chrome; Claude inherits the session. +- **GIF recording** is available as a built-in capability. Phase 7 (Summary) can offer to record sessions for evidence attached to GitHub issues. +- **Site-level permissions** from the Chrome extension control which URLs Claude can interact with. Preflight should verify this. +- **Modal dialogs** (alert, confirm, prompt) block all browser commands. The skill should detect unresponsive commands and instruct users to dismiss dialogs manually. + +### Skill Architecture (Claude Code Plugins) +- SKILL.md description must be a **single-line string** — multiline YAML breaks the indexer. +- Skills use **progressive disclosure**: only frontmatter loads initially (~100 tokens); full content loads on activation. This makes the 500-line target a recommendation, not just a budget. +- `context:fork` is available for isolated execution but is not needed for this skill's use case. + +### Security Considerations +- **Path traversal**: test file path resolution must be validated to stay within `tests/user-flows/`. +- **Credential prohibition**: the skill must never persist passwords, tokens, or session IDs in any written output. +- **Issue body sanitization**: content derived from test results should be sanitized before passing to `gh issue create`. + +## New Considerations Discovered + +1. **MCP tool batching for performance** — Each claude-in-chrome MCP call involves a Chrome extension round-trip. Batch simple checks (element visibility, text content) into single `javascript_tool` calls. Define "quick spot-check" for Proven areas as max 3 MCP calls. + +2. **Iterate mode token cap** — Each 7-phase run consumes significant tokens. Add a default cap of N <= 10 with explicit override. + +3. **State clearing between iterate runs is incomplete** — Full page reload to app entry URL is the reset mechanism. This does not cover IndexedDB, service worker caches, or HttpOnly cookies. Document this limitation. + +4. **Test history file rotation** — `tests/user-flows/test-history.md` will grow unbounded. Add a rotation strategy: keep last 50 entries, archive older ones. + +5. **Atomic file writes** — "Full rewrite" is not truly atomic. Use write-to-temp-then-rename pattern for test file updates. + +6. **Area granularity definition** — The maturity map tracks "areas" but never defines what size an area should be. Without guidance, two runs will decompose the same scenario differently, making consecutive-pass tracking meaningless. Define areas as 1-3 user interactions each (e.g., "checkout" → cart-validation, shipping-form, payment-submission). Include a worked example in `references/test-file-template.md`. + +7. **Explore Next Run needs prioritization** — The Explore Next Run section is append-only with no signal about urgency. After 5-6 runs it becomes a backlog with no entry point. Add priority levels: `P1` (likely user-facing friction), `P2` (edge case worth knowing), `P3` (curiosity). Instruct Phase 3 to pick highest-priority uncharted items first. + +8. **Issue deduplication needs structured labels** — Semantic search via `gh issue list --search` is fragile — two runs will describe the same bug differently. Use a structured `user-test:` label on every issue (e.g., `user-test:checkout/cart-count`) for exact-match dedup via `--label` flag, with semantic search as fallback only. + +9. **Qualitative assessments evaporate after each run** — The Run Summary asks "Demo ready?" but this answer never persists in test-history.md. Add a `demo_readiness` field (yes/no/partial) to the history table schema so trend data captures qualitative signal, not just scores. + +10. **App-level environment sanity check** — Phase 0 validates tool availability but not app health. Stale auth tokens, empty search indices, or silent API 500s produce misleading test results that look like quality issues. Add a Phase 2 "environment sanity check": one known-good navigation + one content assertion before executing test scenarios. + +## Fast Follow (ticket before merge) + +**Tier 1 -- Blocks demo/UX quality** (fix within 1-2 days): +- Add cross-reference from `agent-browser/SKILL.md` back to `user-test` to prevent user confusion between the two browser testing approaches + +**Tier 2 -- Improves robustness** (fix within 1 sprint): +- Add file upload workaround documentation: pause user-test and use `/agent-browser` for upload steps, then resume +- MCP tool mapping table (agent-browser CLI vs claude-in-chrome MCP equivalents) in a shared reference file +- Test file concern separation: evaluate splitting run history into a sidecar `.json` for machine parsing while keeping the `.md` human-readable + +## Cross-Cutting Concerns + +1. **SKILL.md size budget enforcement** — Four agents independently recommend the `references/` extraction. The structural decision affects the content outline (section 8), technical considerations (section 5), thin wrapper templates (section 9), and the SpecFlow analysis (section 10). This is the single most impactful structural change. + +2. **Maturity model rigidity vs agent judgment** — Five agents flag this across scoring (section 4), SKILL.md phases (section 8), and success metrics (section 12). The resolution: provide guidance and rubrics, not rigid rules. + +3. **MCP reliability and graceful degradation** — Four agents converge on this across preflight (section 5), execution (section 8), and risks (section 11). The pattern: specific recovery instructions for known failure modes, graceful degradation for mid-run tool failures. + +4. **`disable-model-invocation: true`** — Three agents confirm this is required for both wrapper commands. Single-section impact but high confidence from official docs. + +## Deferred to Future Work + +- **MCP abstraction layer** for future tool swaps (agent-browser <-> claude-in-chrome) — adds unnecessary complexity for v1 +- **Test file concern separation** into spec + state sidecar — evaluate after real-world usage reveals whether the single-file approach causes friction +- **`/mcp` runtime discovery** of available tools instead of hardcoded tool names — low confidence (0.65), nice-to-have for forward compatibility +- **`context:fork` isolation** for iterate mode runs — not needed for current architecture but could improve memory isolation for long iterate sessions + +## Research Gaps Addressed + +| Source | Recommendation | Status | +|--------|---------------|--------| +| docs-researcher-claude-code-plugins | Single-line description | Implemented in SKILL.md frontmatter | +| docs-researcher-claude-code-plugins | Keep SKILL.md under 500 lines | Implemented via references/ extraction | +| docs-researcher-claude-code-plugins | disable-model-invocation: true | Implemented in both wrapper commands | +| docs-researcher-claude-code-plugins | /chrome activation guidance | Implemented in Phase 0 preflight | +| docs-researcher-claude-code-plugins | Service worker idle disconnects | Implemented in disconnect handling | +| docs-researcher-claude-code-plugins | WSL not supported | Implemented in Phase 0 preflight | +| docs-researcher-claude-code-plugins | Login page/CAPTCHA pausing | Implemented in Phase 2 setup | +| docs-researcher-claude-in-chrome | Visible Chrome window | Implemented in SKILL.md intro | +| docs-researcher-claude-in-chrome | Shared browser login state | Implemented in Phase 2 setup | +| docs-researcher-claude-in-chrome | GIF recording | Acknowledged in Phase 7 as optional enhancement | +| docs-researcher-claude-in-chrome | Site-level permissions | Implemented in Phase 0 preflight | +| docs-researcher-claude-in-chrome | Named pipe conflicts (Windows) | Implemented in Phase 0 preflight | +| docs-researcher-claude-in-chrome | Modal dialogs block commands | Implemented in Phase 3 execution | +| docs-researcher-claude-in-chrome | /mcp runtime discovery | Deferred — low confidence (0.65), nice-to-have | + +--- +# Implementation Spec +--- + +--- +title: "Add user-test browser testing skill and commands" +type: feat +status: active +date: 2026-02-26 +--- + +# Add user-test Browser Testing Skill and Commands + +## Overview + +Add a new `user-test` skill and three companion commands (`/user-test`, `/user-test-iterate`, `/user-test-commit`) to the compound-engineering plugin. This implements browser-based exploratory user testing via `claude-in-chrome` MCP tools with a compounding maturity model — each run makes the test file smarter by promoting proven areas, filing new bugs, and expanding coverage. + +This skill is for **exploratory testing in a visible Chrome window** with shared login state. The user watches the test happening in real-time and can intervene if needed. For automated headless regression testing, use `/test-browser` instead — it uses the `agent-browser` CLI for deterministic, CI-oriented QA checks. + +The three commands separate concerns: `/user-test` runs and scores a test, `/user-test-iterate` runs it N times for consistency data, and `/user-test-commit` applies results (updates the test file maturity map, files issues, appends history). This separation keeps the fast feedback loop (run + score) lightweight and lets the user decide when to commit results. + +## Problem Statement / Motivation + +The plugin has `test-browser` (deterministic QA regression via `agent-browser` CLI) but no exploratory user testing capability. Teams need a way to: + +- Simulate real user behavior against their app in a visible browser +- Track which areas are stable vs. fragile across runs +- Automatically file and deduplicate GitHub issues from testing sessions +- Compound knowledge: skip proven areas, skip known bugs, focus effort on uncharted territory + +This fills a distinct niche from `test-browser` — exploratory quality assessment with compounding knowledge, not regression checking. + +## Proposed Solution + +### Architecture: Skill + Thin Wrapper Commands + +Following the `deepen-plan` precedent (v2.36.0 refactor), implement as: + +| File | Type | Purpose | +|------|------|---------| +| `skills/user-test/SKILL.md` | Skill | Core 5-phase execution logic + commit mode (~300 lines) | +| `skills/user-test/references/test-file-template.md` | Reference | Test file template for new scenarios (~100 lines) | +| `skills/user-test/references/browser-input-patterns.md` | Reference | React-safe input patterns and MCP tool tips (~30 lines) | +| `skills/user-test/references/iterate-mode.md` | Reference | Iterate mode execution details (~50 lines) | +| `commands/user-test.md` | Thin wrapper | `Skill(user-test)` invocation for `/user-test` | +| `commands/user-test-iterate.md` | Thin wrapper | `Skill(user-test)` invocation with iterate mode for `/user-test-iterate` | +| `commands/user-test-commit.md` | Thin wrapper | `Skill(user-test)` invocation for committing results | + +**Why skill + thin wrapper?** +- The execution logic is ~300 lines — well within the 500-line skill recommendation +- Reference files extract the test template, input patterns, and iterate mode details — each reusable independently +- Thin wrappers prevent command bloat (learnings: monolith-to-skill split anti-patterns) +- Both commands share the same skill logic, just with different invocation modes +- Consistent with the Pattern A convention used by `deepen-plan`, `create-agent-skill`, etc. + +**Why extract to `references/` from day one?** +The monolith-to-skill-split learning (convergence from 4 agents) explicitly warns: "Stating max 1200 lines in a plan is a policy wish. Without a gate that fails the pipeline, the file will grow past the budget." By starting with the split structure, the SKILL.md stays focused on execution phases and the reference files can grow independently without threatening the line budget. + +### Key Design Decisions + +**1. Browser tool: `claude-in-chrome` MCP (not `agent-browser` CLI)** + +The skill uses `mcp__claude-in-chrome__*` tools (find, javascript_tool, read_page, screenshots). This is intentionally different from `test-browser` which uses the headless `agent-browser` CLI. The rationale: +- `user-test` simulates a real user in a **visible** Chrome window — interactive, visual. The user can watch the test happening and intervene. +- `test-browser` runs headless regression checks — deterministic, CI-oriented +- Different tools for different testing philosophies +- `claude-in-chrome` shares the browser's login state, so authenticated app testing requires no credential handling — the user simply signs in once in Chrome + +**2. Test file as the product, not the run report** + +Living test files in `tests/user-flows/.md` get rewritten each run with updated maturity maps, scores, and history. The test file compounds intelligence across runs. + +Test files include a `schema_version: 1` field in frontmatter to enable forward-compatible migrations when the maturity model or file structure evolves. + +**3. Maturity model drives test efficiency** + +The maturity model provides guidance for the agent's judgment, not rigid rules: + +| Status | Behavior | Guidance | +|--------|----------|----------| +| Proven | Quick spot-check only (max 3 MCP calls) | Promote after 2+ consecutive passes with no significant issues. Cosmetic issues do not warrant demotion. | +| Uncharted | Full investigation, edge cases | Default state. Demote from Proven only on functional regressions or new features. | +| Known bug | Skip entirely | Issue filed. Skip until fix deployed. | + +The agent exercises judgment on promotions and demotions using the scoring rubric rather than following mechanical counters. A minor CSS issue in a Proven area stays Proven with a note. A broken API in an Uncharted area gets a Known-bug issue filed. + +**Partial run safety:** If a run is interrupted before scoring completes, no maturity updates are produced. Only `/user-test-commit` writes maturity state, and only from a completed run's results. + +**Area granularity:** Each area should cover 1-3 user interactions — small enough that a single bug doesn't reset a huge chunk of proven territory, large enough to accumulate consecutive passes. Example decomposition for "checkout": + +| Area | Interactions | What's tested | +|------|-------------|---------------| +| `checkout/cart-validation` | Add item, verify count, change quantity | Cart state management | +| `checkout/shipping-form` | Enter address, select method, see estimate | Form validation + shipping logic | +| `checkout/payment-submission` | Enter card, submit, see confirmation | Payment flow + success state | + +A worked example with this decomposition pattern is included in [test-file-template.md](./references/test-file-template.md). + +**Quality Scoring Rubric** + +Each score applies to one **scored interaction unit** — a single user-facing task completion (e.g., "add item to cart", "submit shipping form", "complete payment"). Navigation steps, page loads, and setup actions are not scored individually; they are part of the interaction they serve. + +| Score | Meaning | Example | +|-------|---------|---------| +| 1 | Broken — cannot complete the task | Button unresponsive, page crashes | +| 2 | Completes with major friction | 3+ confusing steps, error messages shown | +| 3 | Completes with minor friction | Small UX issues, unclear labels | +| 4 | Smooth experience | Clear flow, no confusion | +| 5 | Delightful | Exceeds expectations, helpful feedback | + +## Technical Considerations + +### Distinct from existing `test-browser` command + +| Aspect | `test-browser` | `user-test` (new) | +|--------|---------------|-------------------| +| Tool | `agent-browser` CLI (headless) | `claude-in-chrome` MCP (visible browser) | +| Purpose | QA regression on PR-affected pages | Exploratory user testing | +| State | Stateless per run | Stateful via test files | +| Output | Pass/fail per route | Quality scores 1-5 per interaction | +| Issues | No issue creation | Auto-files and deduplicates issues | +| Auth | Handles login flows | Shares browser login state | +| Observation | Results only | Real-time visual — user watches the test | + +### MCP dependency + +The skill requires `claude-in-chrome` MCP to be connected. Phase 0 (Preflight Check) validates availability and provides specific guidance: + + +``` +## Phase 0: Preflight Check +1. Check if claude-in-chrome MCP tools are available +2. If NOT available: + - Display: "claude-in-chrome not connected. Run /chrome or restart with claude --chrome" + - Abort with clear instructions +3. Detect WSL environment: + - If running in WSL: "Chrome integration is not supported in WSL. Run Claude Code directly on Windows." + - Abort +4. Verify the target app URL is within Chrome extension's allowed sites + - If permission denied: "Grant site permission in Chrome extension settings for [URL]" +5. Windows: if EADDRINUSE error on named pipe: + - "Close other Claude Code sessions that might be using Chrome, then retry" +``` + +### `gh` CLI dependency + +Issue creation (Phase 6) requires `gh auth status`. The skill handles this gracefully: +- If `gh` is not authenticated: skip issue creation, note in summary +- If `gh` is authenticated: proceed with duplicate detection and filing +- **Structured dedup labels:** Every issue gets a label `user-test:` (e.g., `user-test:checkout/cart-count`). Duplicate detection uses `gh issue list --label "user-test:" --state open` for exact match, falling back to semantic title search only if no label match found. Labels are machine-parseable and immune to description rewording. +- Issue body content sanitized before passing to `gh issue create` to prevent markdown injection + +### React-safe input pattern + +The React-specific native setter pattern for bypassing virtual DOM is extracted to [browser-input-patterns.md](./references/browser-input-patterns.md). This keeps framework-specific tool logic reusable and out of the main SKILL.md. + +### MCP tool performance + +Each claude-in-chrome MCP call involves a round-trip through the Chrome extension. To manage latency: +- Batch simple checks (element visibility, text content, price display) into single `javascript_tool` calls +- Define "quick spot-check" for Proven areas as max 3 MCP calls per area +- Full investigations for Uncharted areas have no artificial cap but should use batched checks where possible + + +```javascript +// Batch multiple checks into one javascript_tool call: +mcp__claude-in-chrome__javascript_tool({ + code: `JSON.stringify({ + submitBtn: !!document.querySelector('[type=submit]'), + errorMsg: !!document.querySelector('.error'), + price: document.querySelector('.price')?.textContent + })` +}) +``` + +### Connection resilience + +Extension disconnects are a known issue — the Chrome extension service worker can go idle during extended sessions. + + +``` +## Disconnect Handling +1. After MCP tool failure: wait 3 seconds +2. Retry the call once +3. If retry fails: "Extension disconnected. Run /chrome and select Reconnect extension" +4. Track disconnect_counter for the session +5. If disconnect_counter >= 3: abort with "Extension connection unstable. Check Chrome extension status and restart the session." +``` + +### Modal dialog handling + +JavaScript dialogs (alert, confirm, prompt) block all browser events and prevent Claude from receiving commands. If commands stop responding after a dialog trigger, instruct the user to dismiss the dialog manually before continuing. + +### Graceful degradation + +Apply the same pattern used for `gh` CLI absence to MCP tool failures mid-run: +- If screenshot fails: continue but note "screenshots unavailable" in the report +- If javascript_tool fails: fall back to individual find/click calls +- If all MCP tools fail: abort with specific recovery instructions + +## System-Wide Impact + +- **Interaction graph**: Skill invoked by two thin wrapper commands. No callbacks or middleware. Writes to `tests/user-flows/` (user's project, not the plugin). Calls `gh` CLI for issue creation. +- **Error propagation**: MCP disconnects handled with retry-once + specific recovery instructions. `gh` failures gracefully degraded (skip issue creation). Mid-run MCP tool failures degrade individual capabilities rather than aborting. +- **State lifecycle risks**: Test file writes use write-to-temp-then-rename pattern for atomic updates. Partial runs produce no committable output (maturity safety). Iterate mode resets between runs via full page reload to the app entry URL. Note: this does not clear IndexedDB, service worker caches, or HttpOnly cookies — document this limitation in iterate mode reference. +- **API surface parity**: No overlap with existing commands — distinct MCP tool set, distinct file structure, distinct purpose. +- **Security**: Test file paths validated to stay within `tests/user-flows/`. No credentials persisted in any written output (test files, run history, issue bodies). Issue body content sanitized before `gh` CLI invocation. + +## Acceptance Criteria + +### Files to Create + +- [x] `plugins/compound-engineering/skills/user-test/SKILL.md` — Core skill with 5 phases + commit mode, ~300 lines +- [x] `plugins/compound-engineering/skills/user-test/references/test-file-template.md` — Test file template for new scenarios +- [x] `plugins/compound-engineering/skills/user-test/references/browser-input-patterns.md` — React-safe input patterns +- [x] `plugins/compound-engineering/skills/user-test/references/iterate-mode.md` — Iterate mode details +- [x] `plugins/compound-engineering/commands/user-test.md` — Thin wrapper with `disable-model-invocation: true` +- [x] `plugins/compound-engineering/commands/user-test-iterate.md` — Thin wrapper with `disable-model-invocation: true` and iterate argument forwarding +- [x] `plugins/compound-engineering/commands/user-test-commit.md` — Thin wrapper with `disable-model-invocation: true` for committing results + +### Files to Modify + +- [x] `plugins/compound-engineering/.claude-plugin/plugin.json` — bump version, update description with dynamic counts +- [x] `.claude-plugin/marketplace.json` — bump version, update description with dynamic counts +- [x] `plugins/compound-engineering/README.md` — Update component count table, add skill row under Browser Automation, add two command rows +- [x] `plugins/compound-engineering/CHANGELOG.md` — Add new version entry with `### Added` section + +### Post-Change Validation + +- [x] Validate JSON: `cat .claude-plugin/marketplace.json | jq .` and `cat plugins/compound-engineering/.claude-plugin/plugin.json | jq .` +- [x] Verify skill count matches description: `SKILL_COUNT=$(ls -d plugins/compound-engineering/skills/*/ | wc -l) && grep -q "$SKILL_COUNT skill" plugins/compound-engineering/.claude-plugin/plugin.json` +- [x] Verify command count matches description: `CMD_COUNT=$(ls plugins/compound-engineering/commands/*.md plugins/compound-engineering/commands/workflows/*.md | wc -l) && grep -q "$CMD_COUNT command" plugins/compound-engineering/.claude-plugin/plugin.json` +- [x] Verify SKILL.md line count: `SKILL_LINES=$(wc -l < plugins/compound-engineering/skills/user-test/SKILL.md) && [ "$SKILL_LINES" -le 500 ] && echo "OK: $SKILL_LINES lines" || echo "FAIL: $SKILL_LINES lines (max 500)"` +- [x] Verify SKILL.md frontmatter compliance: `name: user-test`, single-line description with trigger keywords +- [x] Verify reference files are linked with proper markdown links (not backtick references) +- [ ] Run `claude /release-docs` to regenerate all docs site pages + +### Functional Requirements + +- [ ] `/user-test tests/user-flows/checkout.md` — loads existing test file, runs phases 0-4 (score + report) +- [ ] `/user-test "Test the checkout flow"` — creates new test file from description, runs phases 0-4 +- [ ] `/user-test-commit` — applies results from last run: updates maturity map, files issues, appends history +- [ ] `/user-test-iterate tests/user-flows/checkout.md 5` — runs the scenario 5 times, reports consistency +- [ ] Maturity model correctly promotes (2+ consistent passes with agent judgment) and demotes (functional regression with agent judgment) +- [ ] Issues include `user-test:` label; dedup uses `--label` flag first, semantic fallback second +- [ ] Test file template created for new scenarios with all required sections including `schema_version: 1` +- [ ] `tests/user-flows/test-history.md` appended after each run (rotation: keep last 50 entries, includes quality avg + pass rate + disconnects + demo_readiness + key finding) +- [ ] Test file path validated to stay within `tests/user-flows/` (no directory traversal) +- [ ] Iterate mode: N capped at 10 by default, error on N=0, N=1 valid +- [ ] Iterate mode: reset between runs = full page reload to app entry URL (limitations: IndexedDB, SW caches, HttpOnly cookies not cleared) +- [ ] Iterate mode: partial run handling (disconnects mid-iterate produce valid partial results) +- [ ] `test-history.md` includes `demo_readiness` column (yes/no/partial) persisted each run +- [ ] Explore Next Run items include priority (P1/P2/P3); Phase 3 picks highest priority first +- [ ] Area granularity: worked example in test-file-template.md showing 1-3 interactions per area +- [ ] Phase 2 environment sanity check: verifies app loads with expected content before test execution +- [ ] Given a new scenario, full pipeline (phases 0-4 + commit) produces: test file with schema_version: 1, quality score, maturity map, and summary — all without manual intervention beyond initial command +- [ ] Given a test file with an Uncharted area, after iterate N=3 where all runs score >= 4, the area's maturity status is Proven + +### Security Requirements + +- [ ] Test file path resolution prevents directory traversal +- [ ] No credentials (passwords, tokens, session IDs) persisted in any output file +- [ ] Issue body content sanitized before `gh issue create` +- [ ] `user-test:` label convention documented for duplicate detection + +## SKILL.md Content Outline + +The skill contains 5-phase execution logic (run + score) plus a commit mode (update files + file issues), with references to supporting files: + + +``` +--- +name: user-test +description: Run browser-based user testing via claude-in-chrome MCP with quality scoring and compounding test files. Use when testing app quality from a real user's perspective, scoring interactions, tracking test maturity, or filing issues from test sessions. +argument-hint: "[scenario-file-or-description]" +--- + +# User Test + +Exploratory testing in a visible Chrome window. You watch the test happening +in real-time and can intervene if needed. Claude shares your browser's login +state — sign into your app in Chrome before running. + +For automated headless regression testing, use /test-browser instead. + +**v1 limitation:** This skill targets localhost / local dev server apps. External +or staging URLs are not validated for deployment status — if you test against a +remote URL, verify it's live and accessible before running. + +## Phase 0: Preflight +[Validate: claude-in-chrome MCP available (if not: "Run /chrome"), WSL detection, +site permissions, gh auth status, app URL resolvable] + +## Phase 1: Load Context +[Resolve test file from path/description, validate path stays within tests/user-flows/, +extract maturity map + history, validate schema_version] +[If no argument: scan tests/user-flows/ for test files, present list, or prompt for description] +[If test file corrupted: offer to regenerate from template] + +## Phase 2: Setup +[Ensure user is signed into target app in Chrome (shared login state), +take baseline screenshot] +[If login page or CAPTCHA encountered: pause for manual handling] +[Environment sanity check: navigate to app URL, verify page loaded with expected content +(not an error page, not a stale auth redirect, not an empty state). If the app loads but +shows error banners, API failures, or empty data that should be populated — abort with +"App environment issue detected" rather than producing misleading quality scores] + +## Phase 3: Execute +[Maturity-guided selection (agent judgment, not mechanical counters), +Proven areas: quick spot-check (max 3 MCP calls), +Uncharted areas: full investigation with batched javascript_tool calls, +Known-bug areas: skip entirely] +[Connection resilience: retry once with 3s delay, then /chrome reconnect guidance] +[If all areas Proven: spot-check all, suggest new scenarios in "Explore Next Run"] +[Explore Next Run items have priority: P1 (likely user-facing friction), P2 (edge case), +P3 (curiosity). Pick highest-priority uncharted items first, not FIFO] +[Modal dialog detection: instruct user to dismiss manually] + +## Phase 4: Score and Report +[Quality scoring 1-5 using calibration rubric per scored interaction unit] +[A scored interaction unit = one user-facing task completion (e.g., "add item to cart", +"submit shipping form", "complete payment"). Navigation steps, page loads, and setup +actions are not scored individually — they are part of the interaction they serve.] +[Scores are ABSOLUTE per rubric, not relative to scenario framing.] +[Output: run summary block with per-area scores, disconnect count, overall quality avg] +[If run is interrupted before scoring completes, do NOT produce committable output — +partial runs must not corrupt maturity state] + +## Commit Mode +[Invoked separately via /user-test-commit after reviewing run results] +[Maturity updates using agent judgment, run history, promotion/demotion with rubric] +[Atomic write: write to .tmp then rename] +[History rotation: keep last 50 entries in test-history.md] +[Include structured label `user-test:` on every issue] +[Duplicate detection: `gh issue list --label "user-test:" --state open` for +exact match; fall back to semantic title search only if no label match found] +[Sanitize issue body content, skip gracefully if gh not authenticated] +[Never persist credentials in issue bodies or test files] +[Persist demo_readiness (yes/no/partial) in test-history.md alongside quality scores] + +## Iterate Mode +See [iterate-mode.md](./references/iterate-mode.md) for details. +N capped at 10 (default), N=0 is error, N=1 valid. +State clearing limitations documented. +Partial run handling: if disconnect mid-iterate, write results for completed runs and report +"Completed M of N runs" — partial results are valid and maturity updates apply. +Output format: per-run scores table + aggregate consistency metrics + maturity transitions. + +## Test File Template +See [test-file-template.md](./references/test-file-template.md). + +## Browser Input Patterns +See [browser-input-patterns.md](./references/browser-input-patterns.md). +``` + +## Thin Wrapper Command Templates + +### `commands/user-test.md` + + +```markdown +--- +name: user-test +description: Run browser-based user testing with quality scoring and compounding test files +disable-model-invocation: true +allowed-tools: Skill(user-test) +argument-hint: "[scenario-file-or-description]" +--- + +Invoke the user-test skill for: $ARGUMENTS +``` + +### `commands/user-test-iterate.md` + + +```markdown +--- +name: user-test-iterate +description: Run the same user test scenario N times to measure consistency +disable-model-invocation: true +allowed-tools: Skill(user-test) +argument-hint: "[scenario-file] [n]" +--- + +Invoke the user-test skill in iterate mode for: $ARGUMENTS +``` + +### `commands/user-test-commit.md` + + +```markdown +--- +name: user-test-commit +description: Commit user-test results — update test file maturity map, file issues, append history +disable-model-invocation: true +allowed-tools: Skill(user-test) +--- + +Invoke the user-test skill in commit mode for the last completed run. +``` + +## SpecFlow Analysis -- Gaps Addressed in Implementation + +The SpecFlow analyzer identified gaps. Here is how the implementation addresses each genuine gap: + +| Gap | Resolution | +|-----|-----------| +| No-argument behavior | Phase 1: scan `tests/user-flows/` for test files, present list, or prompt for description | +| MCP not connected | Phase 0 preflight: check MCP availability, instruct to run `/chrome` or restart with `claude --chrome` | +| gh not authenticated | Phase 6: check `gh auth status` before creating issues, skip gracefully if not authenticated | +| Test file corruption | Phase 1: validate required sections and schema_version, offer to regenerate from template if missing | +| All areas Proven | Phase 3: spot-check all Proven areas, add note suggesting new scenarios in "Explore Next Run" | +| N=0 for iterate | Iterate mode: treat N=0 as error, require N >= 1, cap N <= 10. N=1 is valid (single run with consistency tracking) | +| State between iterate runs | Iterate mode: full page reload to app entry URL between each run. Document limitation: does not clear IndexedDB, service worker caches, or HttpOnly cookies | +| Preflight check | Phase 0: validates MCP, gh, app URL, WSL detection, site permissions, Windows named pipe conflicts | +| Authentication/login | Phase 2: leverage shared browser login state. User signs in once in Chrome. If CAPTCHA encountered, Claude pauses for manual handling | + +## Dependencies & Risks + +| Risk | Mitigation | +|------|-----------| +| `claude-in-chrome` MCP may not be installed | Phase 0 preflight check with specific "/chrome" instructions | +| Extension service worker goes idle during extended sessions | Retry once with 3s delay, then specific "/chrome Reconnect" guidance. Abort after 3 cumulative disconnects. | +| File upload not supported | Explicit `MANUAL ONLY` marking in test file template. Workaround: pause user-test and use `/agent-browser` for upload steps. | +| SKILL.md growth past 500 lines | References/ extraction from day one. Validation gate: `wc -l < SKILL.md` must be <= 500 | +| Component count drift | Dynamic count validation in acceptance criteria (count files, verify descriptions match) | +| Test history unbounded growth | Rotation: keep last 50 entries in test-history.md | +| Modal dialogs block browser commands | Detection guidance in Phase 3, instruct user to dismiss manually | +| WSL environment | Preflight detection and abort with clear message | +| Windows named pipe conflicts | Preflight detection with "close other Claude Code sessions" guidance | +| Directory traversal via test file path | Path validation in Phase 1: resolved path must start with `tests/user-flows/` | +| External/staging app not deployed or stale | v1 targets localhost/local dev. Document limitation: no deployment verification for remote URLs. User must verify external apps are live before testing. | +| App loads but environment is broken (stale auth, empty data, API 500s) | Phase 2 environment sanity check: navigate + content assertion before test execution. Abort with "App environment issue" rather than producing misleading scores | +| Issue dedup fails on different descriptions of same bug | Structured `user-test:` label on every issue for exact-match dedup via `--label`; semantic search as fallback only | + +## Success Metrics + +- Skill loads and executes without errors on first invocation +- Test file is correctly created from description with `schema_version: 1` +- Maturity model state transitions work across 3+ consecutive runs using agent judgment +- No duplicate GitHub issues created across iterate runs +- SKILL.md <= 500 lines (enforced by validation gate) +- All component counts match across plugin.json, marketplace.json, and README.md (verified dynamically) +- **Compounding metric**: After 3 runs on the same scenario, Proven area count > 0 and total test duration decreases (spot-checks are faster than full investigations) + +## Sources & References + +### Internal References +- Thin wrapper pattern: `plugins/compound-engineering/commands/deepen-plan.md:1-9` +- Skill structure: `plugins/compound-engineering/skills/deepen-plan/SKILL.md` (frontmatter + phases pattern) +- Browser automation: `plugins/compound-engineering/skills/agent-browser/SKILL.md` (MCP tool reference) +- Existing test command: `plugins/compound-engineering/commands/test-browser.md` (distinct tool set) +- Plugin checklist: `CLAUDE.md` "Adding a New Skill" section +- Anti-patterns: `docs/solutions/2026-02-26-monolith-to-skill-split-anti-patterns.md` +- Versioning: `docs/solutions/plugin-versioning-requirements.md` + +### Conventions Applied +- Skill compliance: name matches directory, single-line description with trigger keywords +- Thin wrapper: `allowed-tools: Skill(user-test)`, `disable-model-invocation: true` +- Version bump: MINOR for new functionality (dynamic — count at implementation time) +- CHANGELOG: Keep a Changelog format with `### Added` section +- Reference files linked with proper markdown links: `[filename.md](./references/filename.md)` diff --git a/docs/plans/2026-02-28-feat-user-test-skill-revision-plan.md b/docs/plans/2026-02-28-feat-user-test-skill-revision-plan.md new file mode 100644 index 000000000..1797da0d1 --- /dev/null +++ b/docs/plans/2026-02-28-feat-user-test-skill-revision-plan.md @@ -0,0 +1,424 @@ +--- +title: "User-Test Skill Revision: Timing, Qualitative Summaries, Delta Tracking, and More" +type: feat +status: completed +date: 2026-02-28 +--- + +# User-Test Skill Revision + +Based on 7 rounds of iterative testing and real production test results. + +## Overview + +Revise the `user-test` skill to add timing tracking, qualitative summaries, delta regression detection, explore-next-run generation, optional CLI mode, output quality scoring, and conditional regression checks. These changes address gaps discovered during real-world usage — timing regressions went unnoticed, structured scores missed qualitative signal, and ~60% of bugs were agent reasoning errors catchable without a browser. + +## Problem Statement / Motivation + +The current skill scores UX quality but misses three signals that real testing revealed: +1. **Performance blind spot** — response time regressed 15s to 28s across 5 runs, unnoticed until manually tracked +2. **Qualitative signal loss** — "4.2/5 average" doesn't answer "should we demo tomorrow?" +3. **Regression hiding** — absolute numbers mask run-over-run quality changes +4. **Stale explore-next-run** — the section stays empty because the skill doesn't generate items proactively +5. **Browser-only bottleneck** — most agent reasoning bugs don't need a browser to catch + +## Proposed Solution + +10 changes in 3 priority tiers, scoped to stay within the 500-line SKILL.md budget. + +### Prerequisite: Schema Migration Strategy + +Multiple changes (1A, 1C, 2B, 2C) add columns to the test file template. Existing `schema_version: 1` files must not break. + +**Approach:** +- Bump to `schema_version: 2` in the template +- Phase 1 (Load Context): when reading a v1 file, add missing columns with empty/default values in memory — do NOT rewrite the file +- Commit mode: when writing back, upgrade the file to v2 schema (adds new columns, preserves all existing data) +- Forward compatibility: the reader tolerates unknown frontmatter fields (from a future v3) by ignoring them. Unknown table columns are preserved on write. +- The "offer to regenerate" recovery path remains for genuinely corrupted files only + +### Prerequisite: Run Results Persistence + +The current skill relies on the agent's context window to pass run results from `/user-test` to `/user-test-commit`. With more data dimensions (timing, dual scores, qualitative notes), this becomes fragile. + +**Approach:** +- After Phase 4 completes, write a `.user-test-last-run.json` file in `tests/user-flows/` containing: scenario slug, per-area scores (UX + optional quality), timing, qualitative summary, issues to file, maturity assessments +- `/user-test-commit` reads this file instead of relying on context +- The file is overwritten on each run (only last run is committable) +- Add `.user-test-last-run.json` to the project's `.gitignore` guidance in Phase 1 + +**Stale/missing file handling for `/user-test-commit`:** +- **Missing file:** If `.user-test-last-run.json` doesn't exist, abort with "No run results found. Run `/user-test` first." +- **Stale file:** The file includes a `run_timestamp` (ISO 8601). If the timestamp is older than 24 hours, warn: "Run results are from . Commit anyway? (y/n)." If older than 7 days, abort with "Run results too old — re-run `/user-test` first." +- **Partial run:** The file includes a `completed: true|false` flag. If `false`, abort with "Last run was incomplete. Run `/user-test` again for committable results." +- **No context fallback:** Commit mode never falls back to context window. The JSON file is the single source of truth. + +## Priority 1: High Impact, Low Effort + +### 1A. Timing Tracking + +**Files:** SKILL.md Phase 3 + Phase 4, test-file-template.md + +**Change:** Measure wall-clock time per area (start timestamp before first MCP call, end after last). Record in seconds. + +**Template change — Areas table:** +``` +| Area | Status | Last Score | Last Time | Consecutive Passes | Notes | +``` + +**Report output adds Time column:** +``` +| Area | Status | Score | Time | Assessment | +``` + +**Edge cases:** +- Partial area (disconnect mid-area): record time as `—` (incomplete), do not include in averages +- Timing includes async waits — this is intentional (slow is slow, regardless of cause) + +**SKILL.md budget:** +8 lines + +### 1B. Qualitative Summary in Report Output + +**Files:** SKILL.md Phase 4 Report Output + +**Add after the scores table:** +``` +Qualitative: +- Best moment: +- Worst moment: +- Demo ready: yes / partial / no +- One-line verdict: +``` + +**Persistence:** These fields are written to `.user-test-last-run.json` for commit mode. `demo_readiness`, `verdict`, and a brief `context` note are persisted to `test-history.md` during commit. The `context` field is a one-phrase explanation of *why* (e.g., "search results loading 28s" alongside verdict "partial") — without it, verdicts become ambiguous after a few weeks. `best_moment` and `worst_moment` are ephemeral (report-only) — they inform the human reviewer but don't need historical tracking. + +**Edge cases:** +- All areas score the same: pick the area that was most/least expected +- Only one area tested: best and worst are the same — write one line + +**SKILL.md budget:** +10 lines + +### 1C. Delta Tracking in Run History + +**Files:** SKILL.md Commit Mode, test-file-template.md + +**Change:** When appending to `test-history.md`, compute delta from the most recent *completed* previous run: +``` +| Date | Quality Avg | Delta | Key Finding | Context | +| 2/26 | 4.86 | +0.15 | Exclusion filters working | | +| 2/24 | 4.71 | -0.18 | Forest green regression | color picker CSS regression | +``` + +Flag any delta worse than -0.5 with a warning in the commit output. + +**Edge cases:** +- First run ever: delta is `—` (no baseline) +- Previous run was partial: skip to the last complete run +- Different area sets between runs: compute avg over only areas present in BOTH runs. If no overlap, delta is `—`. **Known limitation:** if area sets drift significantly over time (adding 3, removing 2), the delta is computed over a shrinking overlap and may look stable even when new areas perform poorly. Acceptable for now — flag for revisit if delta becomes unreliable in practice. +- Iterate mode: delta is computed between the iterate session's aggregate and the previous non-iterate run. Per-iteration deltas within a session are NOT computed (they are noise, not signal) +- Iterate mode output includes per-run timing and timing variance alongside score variance. A consistent 28s is fine; wild swings between 5s and 45s indicate flakiness worth investigating. + +**SKILL.md budget:** +10 lines + +### 1D. Explore-Next-Run Generation Guidance + +**Files:** SKILL.md Phase 4 + +**Add:** After scoring, explicitly generate 2-3 Explore Next Run items with priority: +- **P1** — Things that surprised you (positive or negative) +- **P2** — Edge cases adjacent to tested areas +- **P3** — Interactions started but not finished, or borderline scores (score of 3) + +A "borderline" score is any area scoring 3/5 — warrants deeper investigation next run regardless of maturity status. + +**SKILL.md budget:** +8 lines + +## Priority 2: Medium Impact, Medium Effort + +### 2A. Optional CLI Mode + +**Files:** SKILL.md (new Phase 2.5), test-file-template.md (frontmatter) + +**Test file frontmatter addition:** +```yaml +--- +cli_test_command: "node scripts/test-cli.js --query '{query}'" # optional +cli_queries: # optional + - query: "queen bed hot sleeper" + expected: "cooling materials, percale or linen" + - query: "something nice" + expected: "asks clarifying questions" +--- +``` + +**SKILL.md addition — Phase 2.5: CLI Testing** + +If the test file defines `cli_test_command`: +1. Skip Phase 0 MCP preflight (CLI doesn't need chrome). Run `gh auth status` check only. +2. Skip Phase 2 browser setup entirely +3. For each query in `cli_queries`: run the command via Bash, capture stdout +4. Score output quality 1-5 against the `expected` field using the **output quality rubric** (see 2B). The agent evaluates whether the CLI output satisfies the expected description semantically — this is NOT exact string matching. The `expected` field describes what a correct response looks like, and the agent applies the output quality rubric to judge. +5. CLI results feed into the same maturity map and scoring pipeline +6. If BOTH `cli_queries` and browser areas exist in the test file: run CLI first. If CLI reveals broken agent logic (scores <= 2), skip browser testing for overlapping areas with a note "CLI pre-check failed — skipping browser test." + +**Overlap detection is explicit, not agent-inferred.** Each CLI query can optionally tag the browser area it pre-checks: +```yaml +cli_queries: + - query: "queen bed hot sleeper" + expected: "cooling materials, percale or linen" + prechecks: "search-results" # area slug — skip this browser area on CLI failure + - query: "something nice" + expected: "asks clarifying questions" + # no prechecks tag — CLI-only, no browser area overlap +``` +If `prechecks` is present and the CLI query scores <= 2, the tagged browser area is skipped. If `prechecks` is absent, the CLI query is standalone — no browser areas are skipped regardless of score. This eliminates fuzzy semantic matching at runtime. + +**Credential handling:** The `cli_test_command` runs as a Bash command inheriting the shell environment. No credentials are stored in the test file. If the command needs env vars, the user sets them in their shell before running `/user-test`. + +**Iterate mode:** CLI iterate resets by simply re-running the command (no browser reload needed). If the command has side effects (DB writes), document this limitation in iterate-mode.md. + +**SKILL.md budget:** +30 lines (extract to `references/cli-mode.md` if it exceeds 35) + +### 2B. Output Quality Scoring Dimension + +**Files:** SKILL.md Phase 4 Scoring, test-file-template.md + +**Change:** Areas can optionally have `scored_output: true` in their area details. When set, score TWO dimensions: + +| Dimension | Rubric | When to use | +|-----------|--------|-------------| +| **UX score (1-5)** | Existing rubric (broken → delightful) | Always | +| **Quality score (1-5)** | Output correctness rubric (below) | Only when `scored_output: true` | + +**Output Quality Rubric:** + +| Score | Meaning | Example | +|-------|---------|---------| +| 5 | Exactly what an expert would produce | Right products, right reasoning | +| 4 | Relevant, minor misses | Mostly right, one irrelevant result | +| 3 | Partially correct | Some right, some wrong | +| 2 | Mostly wrong | Misunderstood intent | +| 1 | Completely wrong | Wrong category, hallucinated data | + +**Report shows both:** `UX: 4/5, Quality: 3/5` + +**Aggregation rules:** +- `Quality Avg` in run history = average of UX scores only (maintains backward compatibility for areas without `scored_output`) +- **Promotion gate for `scored_output: true` areas:** UX >= 4 AND Quality >= 3. A beautiful UI showing wrong results should not promote to Proven. +- **Promotion gate for standard areas:** UX >= 4 only (unchanged from v1) +- Quality score tracked as `Output Avg` in the report for visibility +- Known-bug filing: trigger on UX <= 2 (functional failure) OR Quality <= 1 (completely wrong output) + +**Template change — Areas table:** +``` +| Area | Status | Last Score | Last Quality | Last Time | Consecutive Passes | Notes | +``` +(`Last Quality` column only populated for areas with `scored_output: true`) + +**SKILL.md budget:** +15 lines + +### 2C. Conditional Regression Checks for Known-Bug Areas + +**Files:** SKILL.md Phase 3, test-file-template.md + +**Test file area detail addition:** +```markdown +### cart-quantity-update +**Status:** Known-bug +**Issue:** #47 +**Fix check:** Verify quantity updates in <5s and cart badge reflects new count +``` + +**SKILL.md Phase 3 addition:** + +When encountering a Known-bug area: +1. If `gh` is not authenticated: skip as normal (no change) +2. Check if the linked issue is closed: `gh issue view --json state -q '.state'` +3. If `closed`: flip area to Uncharted, run the `fix_check` as the first test for that area +4. If `open`: skip as normal +5. If the fix check fails (score <= 2): file a new issue with note "Regression of #N" in the body referencing the original closed issue for traceability. The dedup check (`--state open`) won't find the closed issue, so a new issue is created — this is correct behavior. + +**Template change:** Known-bug areas store `**Issue:** #` in their area details section. This is the canonical reference for `gh issue view`. + +**SKILL.md budget:** +15 lines + +## Priority 3: Nice to Have + +### 3A. Async Wait Pattern + +**Files:** browser-input-patterns.md only (no SKILL.md change) + +**Add:** +```javascript +// Wait for async operation completion +mcp__claude-in-chrome__javascript_tool({ + code: ` + (async () => { + const start = Date.now(); + const timeout = 10000; + const selector = '.success-message'; + while (Date.now() - start < timeout) { + if (document.querySelector(selector)) return 'found'; + await new Promise(r => setTimeout(r, 200)); + } + return 'timeout'; + })() + ` +}) +``` + +**SKILL.md budget:** 0 lines + +### 3B. Performance Threshold Configuration + +**Files:** test-file-template.md frontmatter, SKILL.md Phase 4 + +**Frontmatter addition:** +```yaml +--- +performance_thresholds: # optional, seconds + fast: 2 + acceptable: 8 + slow: 20 + broken: 60 +--- +``` + +**Scoring integration:** If thresholds are defined, append a timing grade to each area's assessment in the report: `(fast)`, `(acceptable)`, `(slow)`, `(BROKEN)`. A `broken` timing grade is a finding worth noting but does NOT affect the UX score — timing and quality are separate dimensions. + +**Measurement:** Wall-clock time from 1A. No browser performance API needed. + +**SKILL.md budget:** +8 lines + +### 3C. End-to-End Unscripted Scenario Type — DEFERRED + +**Rationale for deferral:** The SpecFlow analysis identified fundamental conflicts with the maturity model. Unscripted runs produce emergent areas that don't map to stable slugs, breaking consecutive-pass tracking, issue label convention, and iterate mode compatibility. This needs a separate design pass (possibly a distinct mode with its own output format) rather than being retrofitted into the existing area-based model. + +**Interim alternative:** Users can approximate unscripted testing by creating a test file with broad areas (e.g., `first-time-onboarding`) and giving the agent latitude in the area description. This gets 80% of the value without the architectural conflict. + +## Technical Considerations + +### SKILL.md Budget Impact + +| Change | Lines Added | Cumulative | +|--------|-----------|------------| +| Current | 0 | 192 | +| 1A Timing | +8 | 200 | +| 1B Qualitative | +10 | 210 | +| 1C Delta | +10 | 220 | +| 1D Explore | +8 | 228 | +| 2A CLI mode | +30 | 258 | +| 2B Quality scoring | +15 | 273 | +| 2C Regression checks | +15 | 288 | +| 3B Thresholds | +8 | 296 | +| **Total** | **+104** | **~296** | + +Well within the 500-line budget. If CLI mode grows beyond 35 lines during implementation, extract to `references/cli-mode.md`. + +### File Change Summary + +| File | Changes | +|------|---------| +| `SKILL.md` | +104 lines: timing in Phase 3-4, qualitative summary in Phase 4, delta in Commit Mode, explore-next generation in Phase 4, CLI Phase 2.5, output quality rubric, regression checks in Phase 3, threshold eval in Phase 4 | +| `test-file-template.md` | Schema v2: new columns (Last Time, Last Quality), `cli_test_command`/`cli_queries` frontmatter, `performance_thresholds` frontmatter, Known-bug `Issue:` field, `fix_check` field | +| `browser-input-patterns.md` | +15 lines: async wait pattern | +| `iterate-mode.md` | +8 lines: CLI iterate reset note, timing per run in output table, timing variance alongside score variance | +| Commands | No changes (thin wrappers unchanged) | + +### Backward Compatibility + +- v1 test files work unchanged — missing columns filled with defaults on read +- v1 files upgraded to v2 on next commit (non-destructive) +- CLI mode is opt-in (no `cli_test_command` = no CLI testing) +- Quality scoring is opt-in (`scored_output: true` per area) +- Performance thresholds are opt-in (no frontmatter = no timing grades) + +## Acceptance Criteria + +### P1 Changes + +- [x] 1A: Report output includes `Time` column per area +- [x] 1A: Test file template has `Last Time` column in areas table +- [x] 1A: Partial area timing recorded as `—` +- [x] 1B: Report output includes qualitative summary (best moment, worst moment, demo ready, verdict) +- [x] 1B: `demo_readiness` and `verdict` persist to test-history.md via commit mode +- [x] 1C: Run history includes `Delta` column computed from last complete run +- [x] 1C: Delta worse than -0.5 flagged with warning +- [x] 1C: First run shows delta as `—` +- [x] 1C: Iterate mode computes delta vs. pre-session baseline only +- [x] 1D: Phase 4 generates 2-3 Explore Next Run items with P1/P2/P3 priority +- [x] 1D: Borderline (score 3) areas flagged for deeper investigation + +### P2 Changes + +- [x] 2A: Test files with `cli_test_command` run CLI queries before browser testing +- [x] 2A: CLI mode skips Phase 0 MCP preflight and Phase 2 browser setup +- [x] 2A: CLI queries use explicit `prechecks` tag for browser area overlap (no agent-inferred matching) +- [x] 2A: No credentials stored in test file +- [x] 2B: Areas with `scored_output: true` show dual scores (UX + Quality) +- [x] 2B: Quality Avg in history = UX scores only (backward compatible) +- [x] 2B: Known-bug trigger: UX <= 2 OR Quality <= 1 +- [x] 2C: Known-bug areas with closed issues auto-flip to Uncharted +- [x] 2C: Fix check runs as first test for re-opened areas +- [x] 2C: Issue number stored in area details (`**Issue:** #N`) + +### P3 Changes + +- [x] 3A: Async wait pattern documented in browser-input-patterns.md +- [x] 3B: Optional `performance_thresholds` frontmatter evaluates timing grades +- [x] 3C: Deferred — documented as future work + +### Prerequisites + +- [x] Schema migration: v1 files read without error, upgraded to v2 on commit +- [x] Forward compatibility: v2 reader tolerates unknown frontmatter fields from future schema versions (ignore, don't error) +- [x] Run results persistence: `.user-test-last-run.json` written after Phase 4, read by commit mode +- [x] `.user-test-last-run.json` added to `.gitignore` guidance +- [x] Commit mode aborts if `.user-test-last-run.json` missing, incomplete, or >7d stale +- [x] Commit mode warns if `.user-test-last-run.json` >24h old +- [x] `verdict` persists with `context` note to test-history.md + +### Post-Change Validation + +- [x] SKILL.md <= 500 lines after all changes (313 lines) +- [x] All reference file links use proper markdown format +- [x] Existing v1 test files load without error +- [x] Version bump in plugin.json, marketplace.json, CHANGELOG.md (2.36.0) +- [x] Reinstall to `~/.claude/skills/user-test/` and `~/.claude/commands/` + +## Dependencies & Risks + +| Risk | Mitigation | +|------|-----------| +| SKILL.md exceeds 500 lines | Extract CLI mode to references/cli-mode.md if >35 lines | +| v1 test files break with new columns | Schema migration reads v1, upgrades on commit | +| Run results lost between sessions | `.user-test-last-run.json` persists results to disk | +| CLI command has side effects on iterate | Document limitation in iterate-mode.md | +| Delta misleading with changing area sets | Compute delta over overlapping areas only | +| 3C unscripted conflicts with maturity model | Deferred — needs separate design | +| Stale `.user-test-last-run.json` committed | Timestamp check: warn >24h, block >7d, block if incomplete | + +## Implementation Sequence + +1. **Prerequisites first** — schema migration logic + `.user-test-last-run.json` persistence +2. **P1 changes** (1A, 1B, 1C, 1D) — all low effort, high value +3. **P2 changes** (2A, 2B, 2C) — medium effort, build on P1 foundations +4. **P3 changes** (3A, 3B) — nice-to-have, zero risk +5. **Reinstall** — copy updated files to `~/.claude/skills/` and `~/.claude/commands/` +6. **Validate** — run `/user-test` against a test scenario to verify + +## Sources & References + +### Internal References +- Current SKILL.md: `plugins/compound-engineering/skills/user-test/SKILL.md` (192 lines) +- Current template: `plugins/compound-engineering/skills/user-test/references/test-file-template.md` (81 lines) +- Current patterns: `plugins/compound-engineering/skills/user-test/references/browser-input-patterns.md` (54 lines) +- Current iterate: `plugins/compound-engineering/skills/user-test/references/iterate-mode.md` (65 lines) +- Skill size budget: `docs/solutions/2026-02-26-monolith-to-skill-split-anti-patterns.md` +- Original plan: `docs/plans/2026-02-26-feat-user-test-browser-testing-skill-plan.md` + +### Conventions Applied +- Schema versioning for forward compatibility +- SKILL.md 500-line budget with reference extraction fallback +- Thin wrapper commands unchanged (no new commands needed) +- Backward-compatible template migration (read v1, write v2) diff --git a/docs/plans/2026-02-28-feat-user-test-v3-ux-intelligence-plan.md b/docs/plans/2026-02-28-feat-user-test-v3-ux-intelligence-plan.md new file mode 100644 index 000000000..050135db6 --- /dev/null +++ b/docs/plans/2026-02-28-feat-user-test-v3-ux-intelligence-plan.md @@ -0,0 +1,453 @@ +--- +title: "Evolve user-test into a compounding UX intelligence system" +type: feat +status: completed +date: 2026-02-28 +origin: User vision document (inline, 2026-02-28) + 7 rounds of real testing on Resale Clothing Shop +prior_art: + - docs/plans/2026-02-26-feat-user-test-browser-testing-skill-plan.md (v1, completed) + - docs/plans/2026-02-28-feat-user-test-skill-revision-plan.md (v2, completed) +--- + +# Evolve user-test into a Compounding UX Intelligence System + +## Overview + +Transform `/user-test` from a browser test runner into a **compounding UX intelligence system** — one that explores, regresses, and gets smarter with every run. The current v2 skill (321 lines) has the right foundations (maturity model, scoring rubric, CLI+browser layers, auto-commit). This plan closes 6 specific gaps identified after 7 real test runs, adds a new UX Opportunities signal category, and wires up the compounding loop so discoveries graduate from browser exploration to CLI regression checks. + +**What v3 adds that v2 doesn't have:** +1. Bug registry (`bugs.md`) with open/fixed/regressed lifecycle +2. Per-area score history (not just top-level delta) +3. Structured skip reasons (untested-by-choice vs. infrastructure-failure) +4. Explicit pass thresholds per area +5. Queryable qualitative data (area-tagged best/worst moments) +6. Discovery-to-regression graduation (browser findings become CLI checks) +7. UX Opportunities — suggestions, not failures — as a new report section + +## Problem Statement / Motivation + +After 7 real test runs, the skill produces useful signal but doesn't compound it efficiently: + +- **Bugs evaporate.** A bug found in run 3 and fixed in run 5 has no persistent record linking the discovery to the fix. If it regresses in run 8, the skill doesn't know it's a regression — it just finds a "new" bug. +- **Delta is top-level only.** "Quality went from 3.5 → 4.0" doesn't tell you which area improved. Per-area score history is needed to answer "did the shipping form fix actually help?" +- **Disconnects are invisible.** Three disconnects in a run produce null scores with "extension disconnected" buried in assessment text. There's no machine-readable way to distinguish "skipped because Proven" from "skipped because Chrome crashed." +- **Pass thresholds are implicit.** `consecutive_passes` exists but what counts as a pass varies by area. A search results area with `scored_output: true` needs UX >= 4 AND Quality >= 3, but this threshold lives only in the agent's head. +- **Qualitative signal is write-once.** "Best moment: agent search is excellent" appears in one run's JSON but can't be queried over 20 runs to surface patterns like "agent/search has been the best moment 8 of 10 times." +- **The flywheel doesn't close.** Browser discoveries don't become CLI regression checks. The same bug can silently regress without the fast layer catching it. + +## Proposed Solution + +### Phase 1: Bug Registry (Gap #1) + +Add `tests/user-flows/bugs.md` — a persistent, machine-readable bug tracker that complements GitHub Issues. + +```markdown +# Bug Registry + +| ID | Area | Status | Issue | Summary | Found | Fixed | Regressed | +|----|------|--------|-------|---------|-------|-------|-----------| +| B001 | checkout/shipping-form | open | #47 | Accepts invalid zip codes | 2026-02-28 | — | — | +| B002 | browse/product-grid | fixed | #48 | Cards not clickable | 2026-02-28 | 2026-03-01 | — | +| B003 | browse/product-grid | regressed | #52 | Cards not clickable (regression of B002) | 2026-03-05 | — | 2026-03-05 | +``` + +**Status lifecycle:** `open` → `fixed` (when Known-bug area passes fix_check) → `regressed` (if same area fails again after fix). Cross-reference: `Issue` column links to GitHub, `ID` column is the local canonical reference. + +**Multi-area bugs:** A bug that manifests in multiple areas gets ONE registry entry with the primary area. The `Summary` field notes "Also affects: area-a, area-b". Each affected area's Known-bug detail references the same bug ID. + +**Commit mode updates:** After each run, commit mode: +1. Marks bugs as `fixed` when a Known-bug area's fix_check passes (score >= area's `pass_threshold`, default 4) +2. Files new bugs with next sequential ID +3. Marks bugs as `regressed` when a previously-fixed area fails again +4. Syncs with GitHub issue state (closed issue + passing fix_check = fixed) + +**File location:** `tests/user-flows/bugs.md` alongside scenario files. One registry per project, not per scenario. + +### Phase 2: Per-Area Score History (Gap #2) + +Split storage by audience: humans see trends, machines store history. + +**Machine-readable history:** `tests/user-flows/score-history.json` alongside `bugs.md`: + +```json +{ + "areas": { + "checkout/cart": { + "scores": [ + { "date": "2026-02-28", "ux": 3, "quality": null, "time": 8 }, + { "date": "2026-03-01", "ux": 4, "quality": null, "time": 7 }, + { "date": "2026-03-02", "ux": 4, "quality": null, "time": 6 } + ], + "trend": "improving" + }, + "agent/search-query": { + "scores": [ + { "date": "2026-02-28", "ux": 4, "quality": 3, "time": 12 }, + { "date": "2026-03-01", "ux": 5, "quality": 4, "time": 9 } + ], + "trend": "improving" + } + } +} +``` + +**Storage:** Last 10 entries per area. Oldest drops off when 11th is recorded. One file per project (not per scenario). + +**Human-readable trends in test file:** A thin `## Area Trends` section replaces the wide history table: + +```markdown +## Area Trends + +| Area | Trend | Last Score | Delta | +|------|-------|------------|-------| +| checkout/cart | improving | 4 | +1 | +| checkout/shipping | fixed | 4 | +2 | +| browse/product-grid | stable | 5 | — | +``` + +**Trend computation:** `improving` (last 3 trending up), `stable` (variance < 0.5 over last 3), `declining` (last 3 trending down), `volatile` (variance >= 1.0 over last 3), `fixed` (previous was <= 2, current >= pass_threshold). Computed from `score-history.json`, not by parsing markdown. + +**Delta computation:** Per-area delta compares current score to previous score for that specific area in `score-history.json`. This supplements the existing top-level delta in run history. + +### Phase 3: Structured Skip Reasons (Gap #3) + +Add `skip_reason` field to each area result in `.user-test-last-run.json`: + +```json +{ + "slug": "compare/add-view", + "ux_score": null, + "skip_reason": "disconnect", + "time_seconds": null +} +``` + +**Enum values:** +- `null` — area was tested normally +- `"proven_spotcheck"` — Proven area, spot-checked only +- `"known_bug_open"` — Known-bug, issue still open, skipped +- `"known_bug_fixed"` — Known-bug, issue closed, ran fix_check +- `"cli_precheck_failed"` — CLI precheck for this area scored <= 2 +- `"disconnect"` — MCP disconnect interrupted this area +- `"user_skip"` — User explicitly skipped + +**Report impact:** Pass rate calculation excludes `disconnect` and `user_skip` areas. The report shows: "Pass rate: 4/5 (1 area skipped: disconnect)". + +### Phase 4: Explicit Pass Thresholds (Gap #4) + +Add `pass_threshold` to area details in test files: + +```markdown +### checkout/shipping-form +**Interactions:** Enter address, select method, see estimate +**What's tested:** Form validation + shipping logic +**pass_threshold:** 4 +``` + +```markdown +### agent/search-results +**Interactions:** Enter query, review results, refine search +**What's tested:** Result relevance and ranking quality +**scored_output:** true +**pass_threshold:** 4 +**quality_threshold:** 3 +``` + +**Defaults:** If `pass_threshold` is not set, default is 4. If `quality_threshold` is not set for `scored_output` areas, default is 3. These match the current implicit behavior but make it explicit and per-area configurable. + +**Promotion gate uses thresholds:** "2+ consecutive passes" means 2+ consecutive runs where UX >= `pass_threshold` (and Quality >= `quality_threshold` for scored_output areas). + +**Self-documenting:** The test file now contains everything needed to understand when an area graduates — no implicit knowledge required. + +### Phase 5: Queryable Qualitative Data (Gap #5) + +Tag each qualitative observation with the area slug it relates to: + +**In `.user-test-last-run.json`:** +```json +{ + "qualitative": { + "best_moment": { "area": "agent/search-query", "text": "Agent search returns highly relevant results in <2s" }, + "worst_moment": { "area": "browse/product-detail", "text": "Product cards aren't clickable — expected click-to-detail" }, + "demo_readiness": "partial", + "verdict": "Agent core is impressive but missing product-click-to-detail hurts the experience", + "context": "search excellent, product grid needs click handler" + } +} +``` + +**In `test-history.md`:** The existing `Key Finding` column already captures one-line findings. Add `Best Area` and `Worst Area` columns to enable pattern queries: + +```markdown +| Date | Areas Tested | Quality Avg | Delta | Pass Rate | Best Area | Worst Area | Demo Ready | Context | Key Finding | +``` + +**Pattern surfacing:** After 10+ runs, commit mode surfaces patterns with asymmetric thresholds: +- **Positive patterns** (high bar): "area X has been best moment in 7+ of last 10 runs" — high evidence required because this is informational, not actionable +- **Negative patterns** (moderate bar): "area X has been worst moment in 5+ of last 10 runs" — lower threshold than positive, but not 3-in-a-row (too noisy during normal development churn). Five of ten captures genuine trends while filtering out transient spikes from feature work. + +### Phase 6: Discovery-to-Regression Graduation (Gap #6) + +This is the highest-leverage change. When a browser-layer discovery is fixed and verified, offer to generate a CLI regression check. + +**Trigger:** When commit mode marks a bug as `fixed`: +1. Check if `cli_test_command` exists in the scenario frontmatter +2. If yes, offer: "Bug B002 (cards not clickable) is fixed. Generate a CLI regression check? (y/n)" +3. If user accepts, append to `cli_queries` in the test file: + +```yaml +cli_queries: + - query: "show me product cards" + expected: "Returns product data with clickable links or URLs" + prechecks: "browse/product-grid" + graduated_from: "B002" # links back to the bug that spawned this check +``` + +**Graduation trigger:** Manual decision (user confirms). Automatic graduation after N passes was considered but rejected — the user knows better than the system whether a CLI check can meaningfully cover a UX-discovered issue. Some discoveries are inherently browser-only (layout, animation, visual feedback). + +**CLI-ineligible bugs:** If no `cli_test_command` exists, skip the graduation offer. If the bug is purely visual (e.g., CSS layout), note "This bug is browser-only — no CLI graduation available." + +**The compounding loop this enables:** +``` +Browser discovers bug → bug filed → developer fixes → next run verifies fix + → fix confirmed → CLI regression check generated + → future regressions caught by fast CLI layer + → browser time freed for new exploration +``` + +### Phase 7: UX Opportunities (New Signal Category) + +Two distinct sections in the Phase 4 report — improvement suggestions and patterns to protect: + +**UX Opportunities** (action items — things to improve): + +``` +UX Opportunities: +| ID | Area | Priority | Status | Suggestion | +|----|------|----------|--------|-----------| +| UX001 | browse/product-grid | P1 | open | Product cards should be clickable (users expect click-to-detail) | +| UX002 | agent/search-results | P2 | open | Follow-up suggestion buttons are excellent — make more prominent | +``` + +**Good Patterns** (preservation notes — things to protect): + +``` +Good Patterns: +| Area | Pattern | First Seen | Last Confirmed | +|------|---------|------------|----------------| +| browse/filters | Filter chip with sub-filter counts is a best-practice pattern | 2026-02-28 | 2026-03-02 | +| agent/search-results | Agent follow-up buttons after search are excellent | 2026-02-28 | 2026-03-02 | +``` + +**Why separate sections:** P1 and P2 are action items — things to improve. Good Patterns are "don't break this" notes — a fundamentally different signal. Mixing them in one table conflates suggestions with preservation. Good Patterns also have a simpler lifecycle (confirmed each run, no status transitions). + +**Priority mapping (UX Opportunities only):** +- **P1** — Missing expected interaction (friction source) +- **P2** — Enhancement to an already-good interaction + +**UX Opportunity lifecycle:** Each entry has a `status` field: +- `open` — suggestion logged, not yet acted on +- `implemented` — the improvement was made (agent detects the change, or user marks manually) +- `wont_fix` — explicitly declined (keeps the log honest, prevents re-suggestion) + +Entries rotate: keep last 20 `open` entries. `implemented` and `wont_fix` entries age out after 30 days (they've served their purpose). + +**Good Patterns lifecycle:** Simpler — `Last Confirmed` updates each run that observes the pattern. Patterns not confirmed for 5+ runs are removed (the code changed, the pattern may no longer exist). No status field needed. + +**Dedup:** Anchored on explicit IDs, not fuzzy text matching. UX Opportunities use sequential IDs (UX001, UX002...). When the agent observes something that might duplicate an existing entry, it checks by `area slug + priority level`: if the same area already has an open entry at the same priority, the agent decides whether to update or create new — not automated text overlap matching. Good Patterns dedup on area slug only (one pattern entry per area). + +**Distinct from bugs:** Bugs are functional failures (score <= 2) or complete output failures (quality <= 1). UX Opportunities are observations at score 3-5 where the experience could be better. Good Patterns are observations at score 4-5 where the experience is already good. + +**No GitHub issue filing:** Both sections are logged in the test file only. They feed the product backlog but don't create issue noise. The user can manually promote a UX Opportunity to an issue if they want. + +**Storage:** Two new sections in the test file: `## UX Opportunities Log` and `## Good Patterns`. + +## Technical Considerations + +### Schema Migration: v2 → v3 + +**New frontmatter fields (all optional):** +- None — all new data lives in new sections, not frontmatter + +**New test file sections:** +- `## Area Trends` — replaces Area Score History, thin summary (trend + last score + delta) +- `## UX Opportunities Log` — improvement suggestions with status lifecycle +- `## Good Patterns` — patterns worth preserving (separate from opportunities) + +**New standalone files:** +- `tests/user-flows/bugs.md` — bug registry +- `tests/user-flows/score-history.json` — full per-area score history (machine-readable) + +**Run results JSON changes:** +- `areas[].skip_reason` — new field (nullable string enum) +- `qualitative.best_moment` — changes from string to `{ area, text }` object +- `qualitative.worst_moment` — changes from string to `{ area, text }` object +- `ux_opportunities` — new array at top level (P1/P2 improvement suggestions with IDs) +- `good_patterns` — new array at top level (area-level patterns worth preserving) + +**Backward compatibility:** v2 files work unchanged. New sections are added on first v3 commit. The `qualitative` field change is breaking for `.user-test-last-run.json` consumers — but this file is ephemeral (overwritten each run, gitignored), so no migration needed. Bump `schema_version: 3` on first commit that adds new sections. + +**Migration strategy:** Same as v1→v2: fill defaults on read, upgrade on write. No separate migration step. + +### SKILL.md Line Budget + +Current: 321 lines. v3 additions estimated: + +| Addition | Lines in SKILL.md | Lines in references/ | +|----------|------------------|---------------------| +| Bug registry lifecycle | ~15 | ~40 (bugs-registry.md) | +| Per-area trends + score-history.json | ~5 | ~15 (in test-file-template.md) | +| Structured skip reasons | ~8 | 0 (enum in JSON schema) | +| Pass thresholds | ~5 | ~10 (in test-file-template.md) | +| Queryable qualitative | ~5 | 0 (JSON schema change) | +| Graduation mechanism | ~15 | ~30 (graduation.md) | +| UX Opportunities + Good Patterns | ~15 | ~25 (in test-file-template.md) | +| Schema v3 migration note | ~5 | 0 | +| **Total** | **~73** | **~120** | + +**Projected SKILL.md:** ~394 lines — within 500-line budget. + +**New reference file:** `references/bugs-registry.md` for bug lifecycle documentation. +**New reference file:** `references/graduation.md` for discovery-to-regression graduation. +**Updated reference file:** `references/test-file-template.md` for new sections + pass thresholds. + +**Line-count checkpoint:** After implementing step 4 (SKILL.md updates), run `wc -l < SKILL.md` before proceeding to step 5. If over 420 lines, extract UX Opportunities or graduation to their own reference files immediately — don't wait until post-implementation cleanup. + +**Graduation extraction trigger:** The graduation mechanism (Phase 6) involves conditional logic across several states (cli_test_command present?, bug type visual or functional?, user response). If it exceeds 20 lines in SKILL.md during implementation, extract to `references/graduation.md` immediately. The reference file is already planned; the question is whether graduation lives as a brief summary in SKILL.md with details in the reference, or entirely in the reference from the start. Default: start in SKILL.md, extract if it grows. + +### Two-Layer Architecture Clarification + +v2 already implements CLI-first (Phase 2.5) and Browser-second (Phase 3). v3 doesn't change the execution order, but the graduation mechanism (Phase 6) creates a feedback loop: + +``` +Layer 2 (Browser) → discovers issue → fix verified → graduation offered + ↓ +Layer 1 (CLI) ← new regression check added ← catches regressions fast + ↓ +Layer 2 (Browser) → freed to explore new territory +``` + +This is the compounding loop in action. Over time, the CLI layer grows and the browser layer stays focused on unknowns. + +### Open Questions Resolved + +**Q: How does bugs.md handle bugs that span multiple areas?** +A: One registry entry with primary area. Summary notes "Also affects: area-a, area-b". Each affected area's Known-bug detail references the same bug ID. + +**Q: Should UX Opportunities have priority (P1/P2/P3)?** +A: Yes. P1 = missing expected interaction, P2 = enhancement to good interaction, P3 = pattern worth preserving. + +**Q: What's the graduation trigger?** +A: Manual — user confirms after fix is verified. The user knows whether a CLI check can meaningfully cover a UX-discovered issue. Some discoveries are inherently browser-only. + +**Q: How does the command handle an app it's never seen before?** +A: Already handled by v2 — passing a description string to `/user-test` creates a new test file from template. No separate `/user-test init` needed. The first run IS the init. + +## Acceptance Criteria + +### Phase 1: Bug Registry +- [x] `tests/user-flows/bugs.md` created on first bug filing if it doesn't exist +- [x] Bug IDs are sequential (B001, B002, ...) +- [x] Status lifecycle works: open → fixed → regressed +- [x] Multi-area bugs have one entry with "Also affects" note +- [x] Commit mode syncs bug status with GitHub issue state +- [x] Fixed bugs are detected when Known-bug area passes fix_check (score >= area's `pass_threshold`, default 4) +- [x] Regression detection: previously-fixed area fails → new issue "Regression of #N" + bug marked regressed + +### Phase 2: Per-Area Score History +- [x] `tests/user-flows/score-history.json` created on first run, stores full per-area history +- [x] Last 10 entries per area in JSON, oldest drops at 11th +- [x] `## Area Trends` section in test file shows Trend + Last Score + Delta (human-readable summary) +- [x] Trend computed from JSON: improving/stable/declining/volatile/fixed +- [x] Per-area delta computed from JSON, not by parsing markdown + +### Phase 3: Structured Skip Reasons +- [x] `skip_reason` field present in `.user-test-last-run.json` for every area +- [x] Enum: null, proven_spotcheck, known_bug_open, known_bug_fixed, cli_precheck_failed, disconnect, user_skip +- [x] Pass rate calculation excludes disconnect and user_skip +- [x] Report displays skip count and reasons + +### Phase 4: Pass Thresholds +- [x] `pass_threshold` field supported in area details (default: 4) +- [x] `quality_threshold` field supported for scored_output areas (default: 3) +- [x] Promotion gate uses per-area thresholds +- [x] Test file is self-documenting — thresholds visible in area details + +### Phase 5: Queryable Qualitative +- [x] `best_moment` and `worst_moment` in JSON are `{ area, text }` objects +- [x] `test-history.md` has `Best Area` and `Worst Area` columns +- [x] Positive pattern detection: 7+ of last 10 runs (high bar — informational signal) +- [x] Negative pattern detection: 5+ of last 10 runs (moderate bar — actionable signal) + +### Phase 6: Graduation +- [x] After bug marked fixed, offer CLI graduation if `cli_test_command` exists +- [x] Graduated CLI query includes `graduated_from: "B00N"` tag +- [x] Skip graduation offer for browser-only bugs (no CLI equivalent) +- [x] Skip graduation offer if no `cli_test_command` in frontmatter + +### Phase 7: UX Opportunities + Good Patterns +- [x] `UX Opportunities` section in Phase 4 report (P1/P2 action items) +- [x] `Good Patterns` section in Phase 4 report (separate from opportunities) +- [x] UX Opportunities use sequential IDs (UX001, UX002...) with status lifecycle (open/implemented/wont_fix) +- [x] Good Patterns dedup on area slug only, `Last Confirmed` updates each run, removed after 5 runs unconfirmed +- [x] Dedup: same area + same priority = agent decides (not fuzzy text matching) +- [x] Stored in test file: `## UX Opportunities Log` (last 20 open) + `## Good Patterns` +- [x] Distinct from bugs — no GitHub issue creation +- [x] UX Opportunities triggered at score 3-5; Good Patterns triggered at score 4-5 + +### Schema & Compatibility +- [x] v2 files load without error (new sections added on first commit) +- [x] `schema_version: 3` set on first v3 commit +- [x] SKILL.md stays under 500 lines after all additions (checkpoint at step 4.5) +- [x] bugs-registry.md reference file created +- [x] graduation.md reference file created +- [x] test-file-template.md updated with Area Trends, UX Opportunities Log, Good Patterns sections +- [x] score-history.json schema documented in test-file-template.md + +### Version & Metadata +- [x] Version bumped (2.36.0 → 2.37.0) +- [x] CHANGELOG.md updated +- [x] Plugin.json and marketplace.json description counts verified + +## Implementation Sequence + +1. **Create `references/bugs-registry.md`** — bug lifecycle, multi-area handling, status transitions, fix_check threshold tied to pass_threshold +2. **Create `references/graduation.md`** — discovery-to-regression mechanism, CLI query generation, browser-only bug detection +3. **Update `references/test-file-template.md`** — add Area Trends section (replacing wide score history table), UX Opportunities Log + Good Patterns sections, score-history.json schema, pass_threshold/quality_threshold in area details, schema_version: 3 +4. **Update SKILL.md** — add bug registry lifecycle to Commit Mode, skip_reason to Phase 3/4, pass thresholds to promotion gate, qualitative tagging to Phase 4, graduation offer to Commit Mode, UX Opportunities + Good Patterns to Phase 4, schema v3 migration note to Phase 1 +5. **Line-count checkpoint** — run `wc -l < SKILL.md`. If over 420 lines, extract graduation or UX Opportunities to reference files before proceeding. This is a hard gate, not a suggestion. +6. **Update `.user-test-last-run.json` schema** — add skip_reason, change qualitative structure, add ux_opportunities, add good_patterns +7. **Bump metadata** — version, changelog, plugin.json, marketplace.json +8. **Validate** — SKILL.md line count, JSON validity, reference links, score-history.json schema +9. **Install locally** — copy to ~/.claude/skills/user-test/ + +## Dependencies & Risks + +| Risk | Mitigation | +|------|-----------| +| bugs.md grows unbounded | Rotation: archive entries older than 6 months to bugs-archive.md | +| score-history.json grows with many areas over many runs | Cap at 10 entries per area; one file per project. At 30 areas x 10 entries = ~300 entries — manageable JSON size | +| Graduation offers interrupt flow | Single y/n prompt after commit, not during test run. Batch all graduation offers into one prompt. | +| Pattern detection is noisy early on | Only trigger after 10+ runs. Positive patterns: 7/10 threshold. Negative patterns: 5/10 threshold. | +| UX Opportunity dedup produces false matches | Dedup anchored on area slug + priority level, not text overlap. Agent decides on conflicts — no automated fuzzy matching. | +| Good Patterns log bloat (agent flags everything good as a pattern) | Only log patterns at score 4-5 that represent a *deliberate design choice* (not just "page loaded"). Patterns auto-expire after 5 runs unconfirmed. | +| UX Opportunities with no lifecycle become stale | Status field (open/implemented/wont_fix). Implemented and wont_fix age out after 30 days. Open entries capped at 20. | +| schema_version: 3 migration adds sections to existing test files | Non-destructive: new sections appended, existing content preserved | +| SKILL.md approaches 400 lines | Hard gate at step 5: if over 420 lines, extract before proceeding. Graduation earmarked for early extraction (20-line trigger). | +| qualitative JSON structure change breaks external consumers | .user-test-last-run.json is gitignored and ephemeral — no external consumers expected | + +## Sources & References + +### Prior Plans +- [v1 plan: user-test browser testing skill](docs/plans/2026-02-26-feat-user-test-browser-testing-skill-plan.md) — original skill architecture (completed) +- [v2 plan: user-test skill revision](docs/plans/2026-02-28-feat-user-test-skill-revision-plan.md) — schema v2, timing, CLI mode, auto-commit (completed) + +### Internal References +- Current SKILL.md: `plugins/compound-engineering/skills/user-test/SKILL.md` (321 lines, schema v2) +- Test file template: `plugins/compound-engineering/skills/user-test/references/test-file-template.md` +- Anti-patterns: `docs/solutions/2026-02-26-monolith-to-skill-split-anti-patterns.md` + +### Learnings Applied +- **Monolith-to-skill split:** Reference file extraction from day one prevents SKILL.md bloat (confirmed by v2 staying at 321 lines) +- **Agent-guided state patterns:** Use agent judgment for maturity transitions, not mechanical counters (validated by 7 real runs) +- **Plugin versioning:** Always bump version, changelog, and description counts in lockstep diff --git a/docs/solutions/2026-02-26-agent-guided-state-and-mcp-resilience-patterns.md b/docs/solutions/2026-02-26-agent-guided-state-and-mcp-resilience-patterns.md new file mode 100644 index 000000000..daee34a60 --- /dev/null +++ b/docs/solutions/2026-02-26-agent-guided-state-and-mcp-resilience-patterns.md @@ -0,0 +1,63 @@ +--- +title: 'Agent-Guided State Transitions and MCP Resilience Patterns for Browser Skills' +date: 2026-02-26 +tags: [claude-code, claude-in-chrome, mcp, skill-architecture, browser-testing] +category: architecture +module: plugins/compound-engineering/skills/user-test/SKILL.md +source: deepen-plan +convergence_count: 5 +plan: .deepen-2026-02-26-feat-user-test-browser-testing-skill-plan/original_plan.md +--- + +# Agent-Guided State Transitions and MCP Resilience Patterns for Browser Skills + +## Problem + +When designing skills that track state across runs (maturity models, progression systems) and depend on external MCP tools (browser automation, API connectors), two failure modes recur: hardcoded state transition rules that override agent judgment, and generic retry logic that gives users no actionable recovery path when MCP connections fail. + +## Key Findings + +### Hardcoded state rules violate agent-native principles (5 agents converged) + +Encoding rigid rules like "3 consecutive passes = Promoted" and "any failure = reset to Uncharted" puts business logic in the skill that should be prompt-driven. A minor cosmetic issue in a well-tested area does not warrant full demotion. The fix: provide a scoring rubric with calibration anchors (concrete examples for each score level) and maturity guidance (not rigid counters), then let the agent exercise judgment on promotions and demotions based on severity and context. + +### Extract content to references/ from day one (4 agents converged) + +Skills approaching the 500-line recommended limit should proactively extract templates, framework-specific patterns, and mode-specific documentation into references/ subdirectories before the first version ships. Retrofitting extraction after the skill is in use creates migration risk. Plan the directory structure at design time: SKILL.md holds execution logic (~300 lines), references/ holds reusable content (templates, patterns, mode details). + +### MCP disconnect recovery needs specific, not generic, guidance (4 agents converged) + +Chrome extension service workers go idle during extended sessions, breaking MCP connections. A generic "retry once" pattern gives users no path forward when the retry also fails. The fix: provide the specific recovery command ("/chrome Reconnect"), add backoff delay (2-3 seconds) before retry, and track cumulative disconnects to fail fast (abort after 3) rather than burning tokens on repeated failures. + +## Reusable Pattern + +For skills with state tracking: define scoring calibration anchors (what each numeric score means concretely), provide maturity guidance as a rubric, and let agents exercise judgment -- never hardcode state transition counters. For MCP-dependent skills: implement three-tier resilience (preflight availability check, mid-run retry with specific recovery instructions, graceful degradation for non-critical tool failures). + +## Code Example + +```markdown +## Maturity Guidance (agent-guided, not hardcoded) +| Score | Meaning | Example | +|-------|-----------------------|--------------------------------------| +| 1 | Broken | Button unresponsive, page crashes | +| 2 | Major friction | 3+ confusing steps, error messages | +| 3 | Minor friction | Small UX issues, unclear labels | +| 4 | Smooth | Clear flow, no confusion | +| 5 | Delightful | Exceeds expectations | + +Promote to Proven: 2+ consecutive runs with no significant issues (use judgment) +Demote: Functional regression, not cosmetic issues +``` + +```markdown +## MCP Disconnect Recovery (three-tier) +1. Preflight: verify tool availability, instruct `/chrome` if missing +2. Mid-run: wait 3s, retry once, then: "Run /chrome > Reconnect extension" +3. Cumulative: abort after 3 disconnects with clear extension stability message +``` + +## References + +- plugins/compound-engineering/skills/agent-native-architecture/SKILL.md (Granularity principle: agent judgment over hardcoded logic) +- docs/solutions/2026-02-26-monolith-to-skill-split-anti-patterns.md (size budget enforcement pattern) +- https://code.claude.com/docs/en/chrome (extension disconnect behavior and recovery) diff --git a/docs/solutions/2026-02-26-monolith-to-skill-split-anti-patterns.md b/docs/solutions/2026-02-26-monolith-to-skill-split-anti-patterns.md new file mode 100644 index 000000000..ed718b87e --- /dev/null +++ b/docs/solutions/2026-02-26-monolith-to-skill-split-anti-patterns.md @@ -0,0 +1,61 @@ +--- +title: 'Monolith-to-Skill Split: Enforcement, Drift, and Shadowing Anti-Patterns' +date: 2026-02-26 +tags: [claude-code, markdown-commands, skill-md-framework, bash, node-js] +category: architecture +module: commands/deepen-plan.md +source: deepen-plan +convergence_count: 4 +plan: .deepen-sorted-wandering-parnas/original_plan.md +--- + +# Monolith-to-Skill Split: Enforcement, Drift, and Shadowing Anti-Patterns + +## Problem + +When splitting a large command file into a thin wrapper + SKILL.md + reference doc, three failure modes recur: size budgets creep back without enforcement, validation logic duplicated across files drifts out of sync, and stale copies of the original monolith silently shadow the new skill. + +## Key Findings + +### Size budgets require deterministic enforcement, not prose (3 agents converged) + +Stating "max 1,200 lines" in a plan is a policy wish. Without a gate that fails the pipeline, the file will grow past the budget through iterative additions -- exactly how the original monolith grew from 400 to 1,452 lines. Embed a line-count check as a validation step that runs every time the pipeline executes. + +### Legacy monolith shadowing during migration (4 agents converged) + +Claude Code resolves skills by precedence: enterprise > personal > project, with plugins namespaced. A stale 1,452-line file at `~/.claude/commands/` or `~/.claude/skills/` silently shadows the new plugin skill. Detection must be automated, check all three resolution paths, and use line count (>100) as the heuristic -- not file existence alone. + +### Dual validation paths will drift (3 agents converged) + +When validation logic appears both inline in SKILL.md and in a reference doc, the two copies inevitably diverge. The fix: pick one canonical location per validation type. Parent-critical checks (judge output schema) stay inline. Pipeline-internal checks (preservation, artifact structure) live in the reference doc only. + +## Reusable Pattern + +For any command split: (1) add a deterministic size gate that fails loudly, (2) automate legacy detection across all skill resolution paths before first run, (3) assign each validation check exactly one canonical home -- never duplicate across files. + +## Code Example + +```bash +# Size budget enforcement (add to pipeline validation step) +ARCH_LINES=$(wc -l < "$DEEPEN_DIR/ARCHITECTURE.md") +if [ "$ARCH_LINES" -gt 1200 ]; then + echo "FAIL: ARCHITECTURE.md is $ARCH_LINES lines (max 1200)" + exit 1 +fi + +# Legacy shadowing detection (cross-platform) +for dir in "$HOME/.claude/commands" "$HOME/.claude/skills/deepen-plan"; do + TARGET="$dir/deepen-plan.md" + [ -d "$dir/deepen-plan" ] && TARGET="$dir/deepen-plan/SKILL.md" + if [ -f "$TARGET" ]; then + LINES=$(grep -c '' "$TARGET" 2>/dev/null || echo 0) + [ "$LINES" -gt 100 ] && echo "WARN: Legacy at $TARGET ($LINES lines)" + fi +done +``` + +## References + +- agent-native-architecture/references/agent-execution-patterns.md (deterministic checks over heuristic detection) +- agent-native-architecture/SKILL.md (anti-pattern: two ways to accomplish same outcome) +- https://code.claude.com/docs/en/skills (skill resolution precedence) diff --git a/plugins/compound-engineering/.claude-plugin/plugin.json b/plugins/compound-engineering/.claude-plugin/plugin.json index 9b35c5a7a..a2448673c 100644 --- a/plugins/compound-engineering/.claude-plugin/plugin.json +++ b/plugins/compound-engineering/.claude-plugin/plugin.json @@ -1,7 +1,7 @@ { "name": "compound-engineering", - "version": "2.34.0", - "description": "AI-powered development tools. 29 agents, 22 commands, 19 skills, 1 MCP server for code review, research, design, and workflow automation.", + "version": "2.37.0", + "description": "AI-powered development tools. 29 agents, 25 commands, 21 skills, 1 MCP server for code review, research, design, and workflow automation.", "author": { "name": "Kieran Klaassen", "email": "kieran@every.to", diff --git a/plugins/compound-engineering/CHANGELOG.md b/plugins/compound-engineering/CHANGELOG.md index 6819c4847..9fd38f067 100644 --- a/plugins/compound-engineering/CHANGELOG.md +++ b/plugins/compound-engineering/CHANGELOG.md @@ -5,6 +5,54 @@ All notable changes to the compound-engineering plugin will be documented in thi The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/), and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html). +## [2.37.0] - 2026-02-28 + +### Changed + +- **`user-test` skill v3** — Compounding UX intelligence system (schema v2→v3): + - **Bug registry (`bugs.md`):** Persistent bug tracker with open/fixed/regressed lifecycle. Sequential IDs (B001...), multi-area support, fix_check tied to area's `pass_threshold`, regression detection. + - **Per-area score history:** Machine-readable `score-history.json` with last 10 entries per area. Thin `Area Trends` table in test file (trend + last score + delta). Trend computation: improving/stable/declining/volatile/fixed. + - **Structured skip reasons:** `skip_reason` field on each area (enum: null, proven_spotcheck, known_bug_open, known_bug_fixed, cli_precheck_failed, disconnect, user_skip). Pass rate excludes disconnects. + - **Explicit pass thresholds:** Per-area `pass_threshold` (default 4) and `quality_threshold` (default 3) in area details. Promotion gate uses per-area thresholds. + - **Queryable qualitative data:** Best/worst moments tagged with area slug. Pattern surfacing after 10+ runs with asymmetric thresholds (positive: 7/10, negative: 5/10). + - **Discovery-to-regression graduation:** When a bug is fixed, offer to generate a CLI regression check. Manual trigger, batched prompts, browser-only detection. + - **UX Opportunities + Good Patterns:** Two new report sections. P1/P2 improvement suggestions with status lifecycle (open/implemented/wont_fix). Good Patterns for preserving deliberate design choices, auto-expire after 5 unconfirmed runs. + - **New reference files:** `bugs-registry.md`, `graduation.md`. Updated `test-file-template.md` with v3 sections. + +--- + +## [2.36.0] - 2026-02-28 + +### Changed + +- **`user-test` skill v2** — Major revision based on 7 rounds of real-world testing: + - **Schema migration (v1→v2):** Existing test files upgrade non-destructively on commit. Forward-compatible reader ignores unknown fields. + - **Run results persistence:** `.user-test-last-run.json` bridges `/user-test` and `/user-test-commit` sessions. Stale/missing file handling with 24h warn / 7d block. + - **Timing tracking (1A):** Wall-clock time per area, `Last Time` column in template, `Time` column in report output. + - **Qualitative summary (1B):** Best/worst moment, demo readiness, verdict with `context` note persisted to history. + - **Delta tracking (1C):** Run-over-run delta in commit history. Flags regressions worse than -0.5. + - **Explore-next-run generation (1D):** Auto-generates 2-3 items with P1/P2/P3 priority after scoring. + - **CLI mode (2A):** Optional `cli_test_command` runs agent reasoning tests without browser. Explicit `prechecks` tag for browser area overlap detection. + - **Output quality scoring (2B):** Dual UX + Quality scores for `scored_output` areas. Promotion gate: UX >= 4 AND Quality >= 3. + - **Conditional regression checks (2C):** Known-bug areas auto-check if linked issue is closed, flip to Uncharted, run fix_check. Files "Regression of #N" on failure. + - **Async wait pattern (3A):** Documented in browser-input-patterns.md. + - **Performance thresholds (3B):** Optional per-project timing grades (fast/acceptable/slow/BROKEN). + - **Iterate mode updates:** Timing variance alongside score variance, CLI iterate reset, delta vs. pre-session baseline. + +--- + +## [2.35.0] - 2026-02-27 + +### Added + +- **`user-test` skill** — Exploratory browser testing via claude-in-chrome MCP with quality scoring and compounding test files. Tests run in a visible Chrome window with shared login state. Features a maturity model (Proven/Uncharted/Known-bug) that compounds knowledge across runs, quality scoring rubric (1-5), area-based test decomposition, and structured issue deduplication via GitHub labels. +- **`/user-test` command** — Run browser-based user testing with quality scoring against a test file or description +- **`/user-test-iterate` command** — Run the same test scenario N times to measure consistency +- **`/user-test-commit` command** — Commit test results: update maturity map, file issues, append history +- **Reference files** — `test-file-template.md`, `browser-input-patterns.md`, `iterate-mode.md` extracted from day one per skill size budget best practices + +--- + ## [2.34.0] - 2026-02-14 ### Added diff --git a/plugins/compound-engineering/README.md b/plugins/compound-engineering/README.md index ec1ad83b2..47d23096a 100644 --- a/plugins/compound-engineering/README.md +++ b/plugins/compound-engineering/README.md @@ -7,8 +7,8 @@ AI-powered development tools that get smarter with every use. Make each unit of | Component | Count | |-----------|-------| | Agents | 29 | -| Commands | 22 | -| Skills | 19 | +| Commands | 25 | +| Skills | 20 | | MCP Servers | 1 | ## Agents @@ -102,6 +102,9 @@ Core workflow commands use `workflows:` prefix to avoid collisions with built-in | `/resolve_todo_parallel` | Resolve todos in parallel | | `/triage` | Triage and prioritize issues | | `/test-browser` | Run browser tests on PR-affected pages | +| `/user-test` | Run browser-based user testing with quality scoring | +| `/user-test-iterate` | Run user test N times to measure consistency | +| `/user-test-commit` | Commit user-test results (maturity updates, issues, history) | | `/xcode-test` | Build and test iOS apps on simulator | | `/feature-video` | Record video walkthroughs and add to PR description | @@ -154,6 +157,7 @@ Core workflow commands use `workflows:` prefix to avoid collisions with built-in | Skill | Description | |-------|-------------| | `agent-browser` | CLI-based browser automation using Vercel's agent-browser | +| `user-test` | Exploratory browser testing via claude-in-chrome with quality scoring and compounding test files | ### Image Generation diff --git a/plugins/compound-engineering/commands/user-test-commit.md b/plugins/compound-engineering/commands/user-test-commit.md new file mode 100644 index 000000000..d6453ed92 --- /dev/null +++ b/plugins/compound-engineering/commands/user-test-commit.md @@ -0,0 +1,8 @@ +--- +name: user-test-commit +description: Commit user-test results — update test file maturity map, file issues, append history +disable-model-invocation: true +allowed-tools: Skill(user-test) +--- + +Invoke the user-test skill in commit mode for the last completed run. diff --git a/plugins/compound-engineering/commands/user-test-iterate.md b/plugins/compound-engineering/commands/user-test-iterate.md new file mode 100644 index 000000000..7cdea1a00 --- /dev/null +++ b/plugins/compound-engineering/commands/user-test-iterate.md @@ -0,0 +1,9 @@ +--- +name: user-test-iterate +description: Run the same user test scenario N times to measure consistency +disable-model-invocation: true +allowed-tools: Skill(user-test) +argument-hint: "[scenario-file] [n]" +--- + +Invoke the user-test skill in iterate mode for: $ARGUMENTS diff --git a/plugins/compound-engineering/commands/user-test.md b/plugins/compound-engineering/commands/user-test.md new file mode 100644 index 000000000..de2c9d29f --- /dev/null +++ b/plugins/compound-engineering/commands/user-test.md @@ -0,0 +1,9 @@ +--- +name: user-test +description: Run browser-based user testing with quality scoring and compounding test files +disable-model-invocation: true +allowed-tools: Skill(user-test) +argument-hint: "[scenario-file-or-description]" +--- + +Invoke the user-test skill for: $ARGUMENTS diff --git a/plugins/compound-engineering/skills/user-test/SKILL.md b/plugins/compound-engineering/skills/user-test/SKILL.md new file mode 100644 index 000000000..637323a01 --- /dev/null +++ b/plugins/compound-engineering/skills/user-test/SKILL.md @@ -0,0 +1,364 @@ +--- +name: user-test +description: Run browser-based user testing via claude-in-chrome MCP with quality scoring and compounding test files. Use when testing app quality, scoring interactions, tracking test maturity, or filing issues from test sessions. +argument-hint: "[scenario-file-or-description]" +disable-model-invocation: true +--- + +# User Test + +Exploratory testing in a visible Chrome window. The user watches the test +happening in real-time and can intervene if needed. Claude shares the browser's +login state — sign into the app in Chrome before running. + +For automated headless regression testing, use `/test-browser` instead. + +**v1 limitation:** This skill targets localhost / local dev server apps. External +or staging URLs are not validated for deployment status — verify remote apps are +live and accessible before testing. + +## Phase 0: Preflight + +1. **Check claude-in-chrome MCP availability:** + - Call any `mcp__claude-in-chrome__*` tool (e.g., `mcp__claude-in-chrome__read_page`) + - If NOT available: display "claude-in-chrome not connected. Run `/chrome` or restart with `claude --chrome`" and abort +2. **Detect WSL:** + - Run `uname -r 2>/dev/null | grep -qi microsoft` via Bash + - If WSL detected: display "Chrome integration is not supported in WSL. Run Claude Code directly on Windows." and abort +3. **Check gh CLI:** + - Run `gh auth status` via Bash + - If not authenticated: note "gh not authenticated — issue creation will be skipped in commit mode" +4. **Validate app URL:** + - If a test file is provided and contains `app_url`, verify the URL is reachable + - Site permission errors and named pipe conflicts are handled reactively during execution (not preflight-checkable) + +## Phase 1: Load Context + +**Input:** `$ARGUMENTS` — either a path to an existing test file or a description of what to test. + +1. **Resolve test file:** + - If argument is a file path (contains `/` or ends in `.md`): + - Validate path resolves within `tests/user-flows/` (prevent directory traversal) + - Read and parse the test file + - Validate `schema_version` is present (1, 2, or 3 accepted) + - **v1 migration:** If `schema_version: 1`, fill missing columns with defaults in memory (`Last Quality` → `—`, `Last Time` → `—`, `Delta` → `—`, `Context` → empty). Do NOT rewrite the file — upgrades happen only during commit. + - **v2 migration:** If `schema_version: 2`, fill missing sections (Area Trends, UX Opportunities Log, Good Patterns) with empty tables. Fill missing Run History columns (Best Area, Worst Area) with `—`. Do NOT rewrite the file on read. + - **Forward compatibility:** Ignore unknown frontmatter fields. Preserve unknown table columns on write. + - Extract maturity map, run history, and explore-next-run items + - If argument is a description string: + - Generate a slug from the description + - Check if `tests/user-flows/.md` already exists + - If not, create from template — see [test-file-template.md](./references/test-file-template.md) + - Decompose the description into areas (1-3 interactions each) + - If no argument: + - Scan `tests/user-flows/` for existing test files + - Present list and ask which to run, or prompt for a new description +2. **Ensure `.gitignore` coverage:** + - Check that `.user-test-last-run.json` is in the project's `.gitignore` + - If missing, append it (this file is ephemeral run state, not source) + - Note: `score-history.json` and `bugs.md` are NOT gitignored — they are persistent project data +3. **Handle corruption:** + - If required sections are missing or `schema_version` is absent, offer to regenerate from template + +## Phase 2: Setup + +1. **Environment sanity check:** + - Navigate to the app URL using `mcp__claude-in-chrome__navigate` + - Verify the page loaded with expected content (not an error page, stale auth redirect, or empty state) + - If error banners, API failures, or empty data detected: abort with "App environment issue detected — fix the app state before testing" +2. **Authentication check:** + - Claude shares the browser's login state — no credential handling needed + - If a login page or CAPTCHA is encountered: pause and instruct "Sign in to your app in Chrome, then press Enter to continue" +3. **Baseline screenshot:** + - Take a screenshot of the app's initial state for reference + +## Phase 2.5: CLI Testing (Optional) + +If the test file defines `cli_test_command` in frontmatter, run CLI queries before browser testing. CLI mode catches agent reasoning errors without browser overhead. + +**When `cli_test_command` is present:** +1. Skip Phase 0 MCP preflight (CLI doesn't need Chrome). Run `gh auth status` check only. +2. Skip Phase 2 browser setup entirely (unless browser areas also exist). +3. For each query in `cli_queries`: run the command via Bash (substituting `{query}`), capture stdout. +4. Score output quality 1-5 using the **output quality rubric** (see Scoring section). The agent evaluates whether CLI output satisfies the `expected` description **semantically** — not exact string matching. The `expected` field describes what a correct response looks like. +5. CLI results feed into the same maturity map and scoring pipeline. +6. **Browser area overlap:** If a CLI query has a `prechecks: "area-slug"` tag and scores <= 2, skip the tagged browser area with "CLI pre-check failed — skipping browser test." No `prechecks` tag = standalone CLI query, no browser areas skipped. +7. Credentials: the command inherits the shell environment. No credentials stored in the test file. + +**CLI + browser coexistence:** When both exist, run CLI first. CLI failures only skip browser areas explicitly tagged via `prechecks`. + +## Phase 3: Execute + +Test areas based on maturity status. The agent exercises judgment on area selection — these are guidelines, not rigid rules. Record a `skip_reason` for each area not fully tested (see [test-file-template.md](./references/test-file-template.md) for enum values). + +### Timing + +Record wall-clock time per area: note the timestamp before the first MCP call and after the last. Record in seconds. Timing includes async waits — slow is slow, regardless of cause. If a disconnect interrupts an area mid-test, record time as `—` (incomplete) and exclude from averages. + +### Area Selection Priority + +1. **Pick highest-priority Explore Next Run items first** (P1 > P2 > P3), not FIFO +2. **Uncharted areas:** Full investigation with batched `javascript_tool` calls. See [browser-input-patterns.md](./references/browser-input-patterns.md) for input patterns and batching tips. +3. **Proven areas:** Quick spot-check only (max 3 MCP calls per area). Verify the happy path still works. +4. **Known-bug areas:** Check if the linked issue is resolved before skipping: + - If `gh` not authenticated: skip as normal + - Run `gh issue view --json state -q '.state'` + - If `closed`: flip area to Uncharted, run the `fix_check` as the first test + - If `open`: skip as normal, note in output + - If fix check fails (score <= 2): file new issue with "Regression of #N" referencing the original closed issue +5. **If all areas are Proven:** Spot-check all, then suggest new scenarios in "Explore Next Run" + +### Connection Resilience + +1. After any MCP tool failure: wait 3 seconds (`Bash: sleep 3`) +2. Retry the call once +3. If retry fails: display "Extension disconnected. Run `/chrome` and select Reconnect extension" +4. Track `disconnect_counter` for the session +5. If `disconnect_counter >= 3`: abort with "Extension connection unstable. Check Chrome extension status and restart the session." + +### Modal Dialog Handling + +If MCP commands stop responding after triggering an action that may produce a dialog (`alert`, `confirm`, `prompt`): instruct the user to dismiss the dialog manually before continuing. + +### Graceful Degradation + +- Screenshot fails: continue, note "screenshots unavailable" in report +- `javascript_tool` fails: fall back to individual `find`/`click` calls +- All MCP tools fail: abort with recovery instructions + +## Phase 4: Score and Report + +### Scoring + +Score each area on a 1-5 scale per scored interaction unit. A scored interaction unit is one user-facing task completion (e.g., "add item to cart", "submit form"). Navigation, page loads, and setup steps are not scored individually. + +| Score | Meaning | Example | +|-------|---------|---------| +| 1 | Broken — cannot complete the task | Button unresponsive, page crashes | +| 2 | Completes with major friction | 3+ confusing steps, error messages | +| 3 | Completes with minor friction | Small UX issues, unclear labels | +| 4 | Smooth experience | Clear flow, no confusion | +| 5 | Delightful | Exceeds expectations, helpful feedback | + +Scores are **absolute** per this rubric. The same checkout flow should produce the same score regardless of which test scenario triggered it. + +### Output Quality Scoring (Optional) + +Areas with `scored_output: true` in their area details are scored on TWO dimensions: + +| Score | UX Meaning | Output Quality Meaning | +|-------|-----------|----------------------| +| 5 | Delightful | Exactly what an expert would produce | +| 4 | Smooth | Relevant, minor misses | +| 3 | Minor friction | Partially correct | +| 2 | Major friction | Mostly wrong | +| 1 | Broken | Completely wrong | + +Report shows both: `UX: 4/5, Quality: 3/5`. Areas without `scored_output` show UX only. + +**Aggregation:** `Quality Avg` in history = UX scores only (backward compatible). Output quality tracked separately as `Output Avg` in the report. + +**Promotion gate:** Each area's `pass_threshold` (default 4) and `quality_threshold` (default 3 for scored_output areas) define what counts as a pass. See [test-file-template.md](./references/test-file-template.md) for details. + +**Known-bug filing trigger:** UX <= 2 (functional failure) OR Quality <= 1 (completely wrong output). Files to bug registry — see [bugs-registry.md](./references/bugs-registry.md). + +### Performance Threshold Evaluation (Optional) + +If the test file defines `performance_thresholds` in frontmatter, append a timing grade to each area's assessment: `(fast)`, `(acceptable)`, `(slow)`, `(BROKEN)`. Compare each area's wall-clock time against the thresholds. A `broken` timing is a notable finding but does NOT affect the UX score — timing and quality are separate dimensions. + +### Collection Categories + +For each tested area, collect: +1. **UX score** (1-5 per interaction unit) +2. **Time** (wall-clock seconds from Phase 3 timing) +3. **Issues found** (bugs, UX problems, accessibility gaps) +4. **Maturity assessment** (promote, demote, or maintain current status) + +After all areas are scored, generate: +5. **Qualitative summary:** best moment (tagged with area slug), worst moment (tagged with area slug), demo readiness (yes/partial/no), one-line verdict +6. **Explore Next Run items** (2-3 items with priority P1/P2/P3): + - **P1** — Things that surprised you (positive or negative) + - **P2** — Edge cases adjacent to tested areas + - **P3** — Interactions started but not finished, or borderline scores (score of 3 warrants deeper investigation next run) +7. **UX Opportunities** (P1/P2 action items for improvements observed at score 3-5) +8. **Good Patterns** (patterns worth preserving observed at score 4-5 — deliberate design choices, not trivial successes) + +### Report Output + +Display a run summary: + +``` +## Run Summary: + +| Area | Status | Score | Time | Assessment | +|------|--------|-------|------|------------| +| cart-validation | Uncharted | 4 | 8s | Ready for promotion | +| shipping-form | Uncharted | 2 | 15s | Issue found: validation broken | + +Quality Avg: 3.0 | Pass Rate: 2/2 | Disconnects: 0 + +Qualitative: +- Best moment: Cart updates instantly on quantity change +- Worst moment: Shipping form accepts invalid zip codes +- Demo ready: partial +- One-line verdict: Checkout works but shipping validation broken + +Explore Next Run: +| Priority | Area | Why | +|----------|------|-----| +| P1 | shipping-form | Validation broken — push harder on edge cases | +| P2 | checkout/promo-code | Adjacent to cart, untested | + +UX Opportunities: +| Priority | Area | Suggestion | +|----------|------|-----------| +| P1 | shipping-form | Should show inline validation before submit | +| P2 | cart-validation | Quantity stepper would be smoother than text input | + +Good Patterns: +| Area | Pattern | +|------|---------| +| cart-validation | Cart updates instantly on quantity change | + +Issues to file: +- shipping-form: Form validation accepts invalid zip codes +``` + +### Auto-Commit + +After displaying the report, **automatically proceed to Commit Mode** (below) — update the test file, append to history, and file issues. The user reviews results inline as part of the same session. + +**Opt-out:** If invoked with `--no-commit` or if the run was partial (interrupted before all areas scored), skip commit and display the report only. The user can run `/user-test-commit` later to commit from `.user-test-last-run.json`. + +**Partial run safety:** If the run is interrupted before scoring completes, do NOT produce committable output. Partial runs must not corrupt maturity state. + +### Run Results Persistence + +After Phase 4 completes (all areas scored), write `tests/user-flows/.user-test-last-run.json`: + +```json +{ + "run_timestamp": "2026-02-28T14:30:00Z", + "completed": true, + "scenario_slug": "checkout", + "areas": [ + { + "slug": "cart-validation", + "ux_score": 4, + "quality_score": null, + "time_seconds": 12, + "skip_reason": null, + "assessment": "Ready for promotion", + "issues": [] + } + ], + "qualitative": { + "best_moment": { "area": "cart-validation", "text": "Cart updates instantly on quantity change" }, + "worst_moment": { "area": "shipping-form", "text": "Shipping form accepts invalid zip codes" }, + "demo_readiness": "partial", + "verdict": "Checkout works but shipping validation broken", + "context": "shipping zip validation bypassed" + }, + "explore_next_run": [ + { "priority": "P1", "area": "shipping-form", "why": "Validation broken" } + ], + "ux_opportunities": [ + { "id": "UX001", "area": "shipping-form", "priority": "P1", "suggestion": "Should show inline validation before submit" } + ], + "good_patterns": [ + { "area": "cart-validation", "pattern": "Cart updates instantly on quantity change" } + ] +} +``` + +- File is overwritten on each run (only last run is committable) +- `completed: false` if the run was interrupted — commit mode will reject it +- If Phase 4 is interrupted before writing this file, no committable output exists + +## Commit Mode + +Runs automatically after Phase 4 completes a full run. Can also be invoked standalone via `/user-test-commit` (e.g., after a `--no-commit` run or to retry a failed commit). + +### Load Run Results + +**When invoked automatically:** Use the run results already in context from Phase 4. + +**When invoked standalone via `/user-test-commit`:** Read `tests/user-flows/.user-test-last-run.json`. This is the single source of truth — commit mode never falls back to context window. + +- **Missing file:** Abort with "No run results found. Run `/user-test` first." +- **Incomplete run:** If `completed: false`, abort with "Last run was incomplete. Run `/user-test` again for committable results." +- **Stale (>7 days):** Abort with "Run results too old — re-run `/user-test` first." +- **Stale (>24 hours):** Warn "Run results are from . Commit anyway? (y/n)." + +### Maturity Updates + +Apply maturity transitions using agent judgment and the scoring rubric: + +- **Promote to Proven:** After 2+ consecutive passes where UX >= area's `pass_threshold` (default 4) and Quality >= `quality_threshold` for scored_output areas (default 3), with no functional issues. A cosmetic issue in a Proven area does not warrant demotion. +- **Demote to Uncharted:** On functional regressions or new features that change behavior. Minor CSS issues do not trigger demotion. +- **Mark Known-bug:** When a functional issue is found and an issue is filed. Record in bug registry — see [bugs-registry.md](./references/bugs-registry.md). Skip this area in future runs until the fix is deployed. + +**Partial run safety:** If a run is interrupted before scoring completes, no maturity updates are produced. + +### File Updates + +1. **Update test file maturity map and area details:** + - Write to `.tmp` file first, then rename (atomic write) + - If file is `schema_version: 1` or `2`, upgrade to v3: add missing columns and sections per [test-file-template.md](./references/test-file-template.md) + - Update area statuses, scores, timing, quality scores, and consecutive pass counts + - Update `## Area Trends` section from `score-history.json` data + - Update `## UX Opportunities Log`: add new entries with sequential IDs (UX001...), update existing entries (mark `implemented` if improvement detected), age out entries per lifecycle rules + - Update `## Good Patterns`: confirm existing patterns (update `Last Confirmed`), add new patterns, remove patterns unconfirmed for 5+ runs +2. **Update `tests/user-flows/score-history.json`:** + - Append current run's per-area scores (UX, quality, time) + - Compute trend per area from last 3 entries + - Cap at 10 entries per area (drop oldest) + - Create file if it doesn't exist +3. **Update `tests/user-flows/bugs.md`:** + - File new bugs with sequential IDs for areas with UX <= 2 or Quality <= 1 + - Mark bugs as `fixed` when Known-bug area passes fix_check (score >= `pass_threshold`) AND GitHub issue is closed + - Mark bugs as `regressed` when previously-fixed area fails again + - Create file if it doesn't exist — see [bugs-registry.md](./references/bugs-registry.md) +4. **Offer graduation** for newly-fixed bugs — see [graduation.md](./references/graduation.md) +5. **Append to `tests/user-flows/test-history.md`:** + - Add row with: date, areas tested, quality avg, delta, pass rate, best area, worst area, demo ready, context, key finding + - **Delta computation:** Compare quality avg against the most recent *completed* previous run. First run: `—`. Previous run was partial: skip to last complete run. Different area sets: compute over overlapping areas only; no overlap → `—`. + - **Delta warning:** Flag any delta worse than -0.5 in the commit output + - **Context field:** Brief phrase explaining *why* the verdict is what it is (e.g., "search results loading 28s"). Persists alongside verdict for future reference. + - **Pattern surfacing** (after 10+ runs): positive patterns need 7+ of last 10 runs as best area; negative patterns need 5+ of last 10 runs as worst area + - Rotation: keep last 50 entries, remove oldest when exceeding +6. **File GitHub issues:** + - Each issue gets a label `user-test:` (e.g., `user-test:checkout/cart-count`) + - **Duplicate detection:** `gh issue list --label "user-test:" --state open` + - If match found: skip filing, note "duplicate of #N" + - If no match: fall back to semantic title search as secondary check + - Sanitize issue body content before `gh issue create` + - Skip gracefully if `gh` is not authenticated + - Never persist credentials (passwords, tokens, session IDs) in issue bodies or test files + +## Iterate Mode + +See [iterate-mode.md](./references/iterate-mode.md) for full details. + +N capped at 10 (default), N=0 is error, N=1 is valid. +Reset between runs = full page reload to app entry URL. +Partial run handling: if disconnect mid-iterate, write results for completed +runs and report "Completed M of N runs." +Output: per-run scores table + aggregate consistency metrics + maturity transitions. +Results are not committed automatically — use `/user-test-commit` to apply. + +## Test File Template + +See [test-file-template.md](./references/test-file-template.md) for the template used when creating new test files, including area granularity guidelines and worked examples. + +## Bug Registry + +See [bugs-registry.md](./references/bugs-registry.md) for bug lifecycle (open/fixed/regressed), multi-area handling, and commit mode update rules. + +## Discovery-to-Regression Graduation + +See [graduation.md](./references/graduation.md) for the compounding loop: browser discoveries becoming CLI regression checks. + +## Browser Input Patterns + +See [browser-input-patterns.md](./references/browser-input-patterns.md) for React-safe input patterns, DOM check batching, file upload limitations, and modal dialog handling. diff --git a/plugins/compound-engineering/skills/user-test/references/browser-input-patterns.md b/plugins/compound-engineering/skills/user-test/references/browser-input-patterns.md new file mode 100644 index 000000000..0aab0e68e --- /dev/null +++ b/plugins/compound-engineering/skills/user-test/references/browser-input-patterns.md @@ -0,0 +1,82 @@ +# Browser Input Patterns + +Patterns for interacting with web apps via `claude-in-chrome` MCP tools. + +## React-Safe Input + +React uses synthetic events and controlled components. Setting `.value` directly +bypasses React's state management. Use the native setter pattern: + +```javascript +// React-safe input via javascript_tool +mcp__claude-in-chrome__javascript_tool({ + code: ` + const el = document.querySelector('input[name="email"]'); + const setter = Object.getOwnPropertyDescriptor( + window.HTMLInputElement.prototype, 'value' + ).set; + setter.call(el, 'test@example.com'); + el.dispatchEvent(new Event('input', { bubbles: true })); + el.dispatchEvent(new Event('change', { bubbles: true })); + ` +}) +``` + +This works for ``, `