diff --git a/README.md b/README.md index 6d67b5087..255855e30 100644 --- a/README.md +++ b/README.md @@ -182,6 +182,27 @@ Notes: ## Workflow +### Autopilot + +Run a complete engineering workflow from feature description to PR: + +| Command | Description | +|---------|-------------| +| `/lfg [description]` | Right-sized autopilot workflow: routes to direct edit, lightweight execution, or the full pipeline based on task complexity | +| `/slfg [description]` | Deprecated wrapper -- routes to `/lfg` with swarm mode enabled | + +``` +/lfg fix the typo on line 42 of foo.ts # → Direct: fixes it, verifies it, opens/updates the PR +/lfg add input validation to the email field # → Lightweight: does the work, verifies it, opens/updates the PR +/lfg add dark mode support to the settings page # → Full pipeline in autopilot mode: brainstorm → plan → work → review → test → video +``` + +`/lfg` assesses task complexity and chooses the right amount of ceremony while preserving the branch/commit/PR lifecycle. Complex work runs the full pipeline in autopilot mode. `/slfg` is a deprecated compatibility wrapper that routes to `/lfg` with swarm mode enabled. + +### Step-by-step + +Use individual commands when you want control over specific phases: + ``` Brainstorm → Plan → Work → Review → Compound → Repeat ↑ @@ -190,14 +211,18 @@ Brainstorm → Plan → Work → Review → Compound → Repeat | Command | Purpose | |---------|---------| -| `/ce:ideate` | Discover high-impact project improvements through divergent ideation and adversarial filtering | +| `/ce:ideate` | Surface high-impact improvement ideas | | `/ce:brainstorm` | Explore requirements and approaches before planning | | `/ce:plan` | Turn feature ideas into detailed implementation plans | | `/ce:work` | Execute plans with worktrees and task tracking | | `/ce:review` | Multi-agent code review before merging | | `/ce:compound` | Document learnings to make future work easier | -The `/ce:ideate` skill proactively surfaces strong improvement ideas, and `/ce:brainstorm` then clarifies the selected one before committing to a plan. +Step-by-step is useful when you want to: +- Brainstorm now and plan later +- Create a plan from an existing requirements doc or ticket +- Run just a code review on changes you've already made +- Document a solved problem without the full workflow Each cycle compounds: brainstorms sharpen plans, plans inform future plans, reviews catch more issues, patterns get documented. diff --git a/docs/brainstorms/2026-03-24-autopilot-run-context-and-decision-rubric-requirements.md b/docs/brainstorms/2026-03-24-autopilot-run-context-and-decision-rubric-requirements.md new file mode 100644 index 000000000..31908a8e8 --- /dev/null +++ b/docs/brainstorms/2026-03-24-autopilot-run-context-and-decision-rubric-requirements.md @@ -0,0 +1,240 @@ +--- +date: 2026-03-24 +topic: autopilot-run-context-and-decision-rubric +--- + +# Autopilot Run Context and Decision Rubric + +## Problem Frame + +`lfg` currently describes autopilot behavior in prose, and `slfg` adds a second top-level entrypoint with overlapping intent. Downstream skills do not have a deterministic runtime contract for knowing that autopilot is active. That makes the pipeline brittle: skills infer autopilot from caller wording, skip prompts inconsistently, and have no shared place to record substantive decisions made on the user's behalf. + +Separately, `lfg` currently behaves mainly like a "start the workflow" entrypoint. That is narrower than the more useful user expectation: `/lfg` should be able to resume from whatever state the work is already in. If requirements already exist, it should plan. If a plan exists, it should work. If implementation is done and a PR is open, it should verify CI, run local/browser checks, finish wrap-up artifacts, and keep pushing toward DONE. The orchestrator should not require the user to remember which phase-specific command comes next. + +At the same time, the plugin needs a clearer decision rubric for both interactive and autopilot workflows. Core skills such as `ce:brainstorm`, `ce:plan`, `deepen-plan`, and `ce:work` should use the same role-based judgment model when recommending options in normal mode and when auto-deciding bounded questions in autopilot mode. Without that shared rubric, autopilot either stalls too often or makes decisions without a clear basis. + +The goal is to make `lfg` the single autopilot entrypoint with a deterministic run contract and resume-anywhere orchestration, while also improving decision quality and recommendation consistency across the core workflow skills. `slfg` should be deprecated into a compatibility wrapper that routes to `lfg` with swarm mode enabled, with future project-level defaulting handled through `compound-engineering.local.md`. + +## Requirements + +- R1. `lfg` is the only skill that activates full autopilot mode for an end-to-end run. Downstream skills must not infer autopilot solely from prose like "when called from lfg/slfg" or similar caller wording without a deterministic runtime signal. + +- R2. `lfg` must create a run-scoped autopilot manifest under `.context/compound-engineering/` before invoking downstream skills. The manifest must be the shared source of truth for the run's status and durable workflow artifacts. + +- R2a. `lfg` must be able to resume from existing repo/worktree/PR state even when no active autopilot manifest exists. In that case, `lfg` must reconstruct the current workflow gate state, create a fresh run-scoped manifest, record the inferred current state there, and continue from the next appropriate step instead of forcing the user to restart from the beginning. + +- R3. The autopilot manifest must track, at minimum: + - run identity + - active/completed/aborted status + - top-level entry skill (`lfg`) + - original or clarified feature description + - inferred or explicit current workflow stage + - ordered workflow gate status, including which gates are complete, pending, blocked, or unknown + - what evidence was used to mark a gate complete when the run was reconstructed from ambient state + - requirements document path, when created + - plan document path, when created + - decision log path + +- R3a. The exact phase-1 manifest shape should be: + - `schema_version` + - `run_id` + - `route` with the enum `direct | lightweight | full` + - `status` with the enum `active | completed | aborted` + - `implementation_mode` with the enum `standard | swarm` + - `feature_description` + - `current_gate` (optional, advisory only) + - `gates`, keyed by `requirements`, `plan`, `implementation`, `review`, `verification`, and `wrap_up` + - `artifacts`, containing `requirements_doc` and `plan_doc` + +- R3b. Each manifest gate entry should include: + - `state` with the enum `complete | skipped | pending | blocked | unknown` + - `evidence`, as a short list of strings or references explaining why the gate was marked that way + - `ref` only for late-stage gates that can go stale after code changes: `review`, `verification`, and `wrap_up` + +- R3c. Manifest lifecycle should be: + - created or backfilled as `active` + - remain `active` while any gate is still `pending` or `blocked` + - transition to `completed` only when all required gates are `complete` or `skipped` and no required external blocker such as CI remains + - transition to `aborted` only when the run is intentionally stopped or cannot continue + +- R3d. Direct and lightweight routes should still create a manifest immediately. In those routes, `requirements` and `plan` may be marked `skipped` with routing evidence, and `artifacts.requirements_doc` / `artifacts.plan_doc` may remain unset by design. + +- R4. Downstream skills must detect autopilot through an explicit, common-denominator invocation marker plus the manifest it points to. The contract must not rely on line breaks, XML parsing, or platform-specific positional argument features. + +- R5. The autopilot invocation marker must be short, self-delimiting, and easy for skills to strip before processing the normal user/task input. The remainder after the marker is the skill's real input. + +- R5a. `slfg` should be deprecated as a separate top-level workflow. Swarm execution should instead be modeled as an implementation option within `lfg` when explicitly requested by the user, with a later path to project-level defaulting through `compound-engineering.local.md` frontmatter `implementation_mode: standard | swarm`. + +- R5b. `lfg` must begin by detecting the current workflow state. It should prefer an active autopilot manifest when one exists. Otherwise it should infer state conservatively from durable artifacts and repo context such as requirements docs, plan docs, branch state, implementation changes, PR state, and CI status. + +- R5c. `lfg` should determine the next step through an explicit ordered gate model rather than ad hoc heuristics. At minimum, it must evaluate: + - requirements readiness + - plan readiness + - implementation readiness + - review/todo resolution readiness + - verification readiness (local tests, browser validation where relevant, CI status) + - wrap-up readiness (PR state and required PR artifacts) + `lfg` should advance the first unmet gate it can safely determine. + +- R5d. When resuming without an active manifest, `lfg` must distinguish between stages that can be proven complete from ambient evidence and stages that cannot. If a stage cannot be reliably proven complete from repo/PR state alone, `lfg` should treat it as pending instead of silently assuming it already happened. + +- R5e. For PR-stage resume, `lfg` must inspect GitHub CI state and use it as part of orchestration. It may continue local verification and wrap-up work while CI is pending, but it must not silently declare the workflow complete while required external CI for the current HEAD is still failing or pending. + +- R5f. `slfg` should become a deprecation wrapper or equivalent compatibility path that points users to `lfg` with swarm mode enabled, rather than remaining a parallel second contract surface. + +- R5g. The resume engine must define explicit evidence rules for late-stage gates. In particular, local verification, browser validation, review/todo resolution, and wrap-up gates should remain pending unless `lfg` can point to durable current-run or current-HEAD evidence that those steps were completed. +- R5i. Late-stage best-effort steps must not derail the workflow by default. When browser validation or feature-video is inapplicable or blocked by ordinary environment/auth/tooling issues, `lfg` should record a skipped gate with a brief reason and continue unless the user or task explicitly requires that step. + +- R5h. The resume engine must use a deterministic precedence order when reconstructing state: + - active manifest state for the current run + - explicit user-provided direction in the current `lfg` invocation + - durable workflow artifacts and repo state + - PR and CI state for the current branch/HEAD + - targeted user clarification only when the gate state is still genuinely ambiguous after the earlier evidence sources + +- R6. Core workflow skills must use the same role vocabulary in both normal and autopilot modes: + - `Product Manager` + - `Designer` + - `Engineer` + +- R7. Each core workflow skill must define its own ordered role weighting and orchestration bias so the model knows how to make recommendations in normal mode and bounded decisions in autopilot mode. The same roles apply in both modes; autopilot changes decision authority, not the rubric itself. + +- R8. `ce:brainstorm` must use the role rubric even in normal interactive mode to recommend options, frame follow-up questions, and explain why one direction is preferred. Brainstorm should not wait for full autopilot support before adopting the rubric for recommendations. + +- R9. Each core workflow skill that participates in autopilot (`ce:brainstorm`, `ce:plan`, `deepen-plan`, `ce:work`, and any other substantive decision-making workflow added later) must define: + - what decisions it may make automatically + - what decisions require user input + - what decisions must be logged when made in autopilot + +- R9a. The first implementation wave must cover the end-to-end autopilot entrypoints and the substantive decision-making workflow skills: + - `lfg` + - `ce:brainstorm` + - `ce:plan` + - `deepen-plan` + - `ce:work` + This wave should be thorough for those skills. Other autopilot-aware skills may continue using their existing prompt-skipping behavior and adopt the manifest/rubric/logging contract later if they begin making substantive autonomous product or implementation decisions. + +- R9b. `document-review` must remain a review utility, not a primary autonomous decision-maker. In autopilot-related workflows, it may return findings, classifications, and deterministic document-quality fixes, but substantive product or implementation decisions discovered through review must be resolved by the owning workflow skill (`ce:brainstorm`, `ce:plan`, `deepen-plan`, or another future substantive skill), not by `document-review` itself. + +- R9c. `document-review` should use these exact finding classes: + - `mechanical-fix`: deterministic wording, formatting, terminology, or structure fix that does not change substantive meaning + - `bounded-decision`: substantive issue with a small set of viable resolutions that the owning skill may auto-decide using its role rubric + - `must-ask`: issue that exceeds the owning skill's documented decision authority and requires user input + - `note`: non-blocking observation worth surfacing to the caller without forcing immediate resolution + +- R10. The decision log must capture substantive autopilot decisions only: product choices, scoped behavior defaults, implementation path choices, and other bounded judgments that materially affect the output. Workflow trivia such as headless mode, cleanup choices, or whether a video step ran do not belong at the same level. + +- R11. The decision log must be run-scoped, durable for the life of the autopilot run, and append-only so multiple skills can contribute decisions safely. It must exist even before a plan file is created so early autopilot decisions are not lost. + +- R12. The canonical decision log must live in the run-scoped autopilot state under `.context/compound-engineering/`. If a plan document exists, the workflow must append a compact "Autopilot Decisions" summary there for the subset of logged decisions that materially affected the plan or implementation direction. The plan is a promoted summary, not the only source of truth. + +- R12a. The plan-promotion rule should be: + - include all logged rows from `brainstorm`, `plan`, or `deepen-plan` when they changed product behavior, scope, sequencing, risk handling, implementation direction, or verification strategy reflected in the plan + - include `work` rows only when execution forced a meaningful plan-level deviation or resolved an implementation decision the plan had left open + - exclude rows that are purely operational, already obsolete, or irrelevant to understanding the current plan + - use the same column schema in the plan summary as in the canonical run log + +- R13. A decision log row is required whenever the agent resolves a substantive question or ambiguity on the user's behalf without asking first. This includes both: + - pre-existing open questions already called out in requirements, plans, or other workflow documents + - new questions, conflicts, or discoveries surfaced during planning, deepening, or execution that require a bounded choice to keep moving + +- R13a. When a substantive question is surfaced by `document-review` during an autopilot run, the review finding itself does not count as the logged decision. The owning workflow skill must evaluate the finding using its role ordering and orchestration bias, then log the resulting autonomous decision if it resolves the issue without asking the user. + +- R14. A logged decision must represent a real autonomous choice, not a mechanical consequence. A row is required when there were multiple plausible paths and the selected path materially affected product behavior, implementation direction, scope within the local blast radius, or another outcome the user would reasonably want visibility into. + +- R15. The human-facing decision log format should be easy to scan. A Markdown table is the target presentation format for the canonical run log and for any promoted plan summary. + +- R15a. The exact phase-1 Markdown decision-log columns should be: + - `#` + - `Phase` + - `Question` + - `Decision` + - `Why` + - `Impact` + - `Type` + +- R15b. The `Type` column should use: + - `documented-open-question` + - `execution-discovery` + - `conflict-resolution` + +- R16. The runtime decision rubric must be available to the installed skills themselves. Because skills are packaged and copied independently, the solution must not depend on a plugin-root runtime file that is not guaranteed to ship with each installed skill. + +- R17. The packaging strategy for the rubric must support cross-platform installs. A shared source-of-truth is acceptable during authoring, but the installed runtime form must be accessible from each skill's own installed directory. + +## Success Criteria + +- A full `lfg` run can activate autopilot deterministically without downstream skills guessing from caller prose. +- `/lfg` can start or resume from the current workflow gate state instead of only working as a fresh-start command. +- Core workflow skills make more consistent recommendations in normal mode because they use explicit role weighting and orchestration bias. +- Core workflow skills can make bounded autopilot decisions with a clear basis and leave a visible audit trail for substantive decisions. +- The autopilot run state makes artifact handoff deterministic: downstream skills can read the manifest instead of guessing which requirements doc or plan file was created earlier in the run. +- The decision log captures both documented open questions that autopilot resolved and new decisions discovered during execution, so forward-momentum choices are visible rather than silent. +- Review findings raised by `document-review` do not bypass the workflow owner's judgment; the owning skill remains responsible for deciding, updating the artifact, and logging substantive autonomous choices. +- Swarm mode remains available when explicitly requested, but it no longer requires a second top-level entrypoint to carry the same autopilot contract. +- When no active manifest exists, `lfg` can reconstruct the current gate state from repo and PR state, create a new manifest, and continue from the right next step. +- The resume engine uses a documented, deterministic gate order so two implementers would choose the same next step from the same repo/PR state. +- A PR-stage `/lfg` run can inspect CI, rerun local verification as needed, run browser validation when applicable, update PR artifacts, and stop short of DONE when an external blocker such as CI is still unresolved. +- `slfg` can be deprecated without breaking existing users abruptly because it clearly routes them onto the `lfg` contract and swarm mode rather than silently disappearing. +- The design works across Claude Code, Codex, and other installed targets without requiring platform-specific argument parsing features. + +## Scope Boundaries + +- Not redesigning the overall skill format or replacing skills with a single giant orchestrated prompt. +- Not redesigning swarm execution itself beyond moving it under `lfg` as an execution option instead of a separate top-level workflow. +- Not requiring perfect historical reconstruction of every past step; when `lfg` resumes without an active manifest, it only needs enough state inference to choose the safest next step and seed a new manifest. +- Not treating workflow trivia as first-class audit decisions. +- Not giving autopilot authority to invent major product direction without bounded criteria. +- Not depending on a single plugin-root runtime policy file unless that policy is copied into each installed skill's local runtime surface. +- Not specifying the full implementation details of every target platform's parser behavior beyond defining a common-denominator contract. + +## Key Decisions + +- **Autopilot is run-scoped, not session-scoped**: Long-lived agent sessions can span unrelated work. Autopilot state belongs to a single `lfg` invocation. +- **`lfg` is the autopilot entrypoint**: Downstream skills do not self-elect into autopilot mode, and `slfg` should not remain as a second top-level entrypoint. +- **Swarm is a mode, not a workflow**: Parallel/swarm execution should be requested through `lfg` or future repo/project configuration, not through a separate slash command with duplicated orchestration. `slfg` should remain only as a deprecation/compatibility path while users transition. +- **`lfg` is a resume-anywhere orchestrator**: It should be able to inspect current workflow state, decide what gate the work is currently blocked on, create a manifest if one is missing, and continue from the next appropriate step rather than assuming every run starts from ideation. +- **Resume uses ordered gates, not vague stage guesses**: `lfg` should evaluate ordered workflow gates and advance the first unmet one. Stages that cannot be reliably proven complete from ambient evidence should be treated as pending. +- **Late-stage evidence must be explicit**: Review resolution, local verification, browser verification, CI, and wrap-up artifacts should not be inferred from a generic "PR exists" signal. Those gates are complete only when `lfg` has current evidence for them. +- **PR-stage completion is explicit**: CI status, local verification, browser validation, unresolved todos/findings, and required PR artifacts are separate late-stage checks. `lfg` should not compress them into a single fuzzy "PR exists, therefore done" state. +- **Hybrid signaling**: Use a short explicit invocation marker plus a run manifest on disk. The marker activates autopilot deterministically; the manifest carries shared state and artifacts. +- **The manifest schema is fixed for phase 1**: `status` stays at the run level with `active | completed | aborted`; gate-by-gate progress and blockers live under `gates`, not as extra top-level run statuses. +- **Common-denominator parsing over platform-specific parsing**: Do not rely on Claude-only positional args or formatting assumptions that other targets may not preserve. +- **Roles are always active**: `Product Manager`, `Designer`, and `Engineer` shape recommendations in normal mode and bounded decisions in autopilot mode. +- **Orchestration bias is per skill, not a peer role**: It changes how aggressively a skill should continue, ask, or defer without replacing the substantive decision roles. +- **Only substantive autonomous decisions are logged**: The audit log is for product and implementation decisions the agent actually made on the user's behalf, including both pre-existing open questions and new execution discoveries, not routine workflow mechanics. +- **Review utilities inform but do not own substantive decisions**: `document-review` can surface and classify issues, but the skill that owns the artifact must decide how to resolve substantive findings in autopilot. +- **Run log first, plan summary second**: The run-scoped decision log is the canonical record. A plan, when it exists, gets a compact promoted summary of the relevant subset using the same row schema. +- **Decision rows have a fixed schema**: Use `# | Phase | Question | Decision | Why | Impact | Type` so the log captures both the ambiguity being resolved and the chosen path. +- **`document-review` classifications are fixed for phase 1**: Review findings should classify into `mechanical-fix`, `bounded-decision`, `must-ask`, or `note` so callers can resolve them consistently. +- **Runtime rubric must ship with the skill**: Installed skills need local access to the rubric they are expected to follow. + +## Dependencies / Assumptions + +- Assumes `.context/compound-engineering/` is the correct place for run-scoped workflow state and decision logs. +- Assumes `lfg` can consistently prepend a short marker to downstream skill input across supported targets. +- Assumes skills can read a manifest path passed in their argument payload and treat the remainder as normal input. +- Assumes the role rubric can be expressed compactly enough inside each participating skill, or through co-located references that ship with that skill. +- Assumes the workflow can promote relevant rows from the run log into a plan doc after the plan exists, without requiring the plan to be present when the first decisions are made. +- Assumes `document-review` can expose findings in a way the calling workflow skill can interpret and act on without turning review into a hidden decision-maker. +- Assumes swarm-mode selection can move behind `lfg` without losing the ability to request parallel execution explicitly, and that later repo/project defaults can live in `compound-engineering.local.md`. +- Assumes `lfg` can infer enough workflow gate state from repo artifacts and PR context to resume safely when no active manifest exists, and can fall back to a targeted user question when the state is genuinely ambiguous. +- Assumes late-stage verification steps can be modeled conservatively: if `lfg` cannot prove they were completed for the current HEAD, it may rerun or recheck them rather than assuming success. + +## Outstanding Questions + +### Resolve Before Planning + +- [Affects R4][User decision] What exact syntax should be standardized for the already-chosen short prefix invocation marker used to activate autopilot runs? + +### Deferred to Planning + +- [Affects R2a][Technical] What exact repo/PR heuristics should `lfg` use to infer each workflow gate when resuming without an active manifest? +- [Affects R5c][Technical] What exact gate order and evidence rules should define resume-stage detection, especially for PR-stage verification and completion? +- [Affects R11][Technical] What append/update strategy should the Markdown decision log use so multiple skills can contribute rows safely and predictably during one run? +- [Affects R7][Technical] What is the exact role ordering and orchestration bias for each participating skill (`ce:brainstorm`, `ce:plan`, `deepen-plan`, `ce:work`, and any others)? +- [Affects R16][Technical] Should the runtime rubric live inline in each skill, in each skill's `references/`, or be authored centrally and copied into each skill during build/release? + +## Next Steps + +-> Proceed through planning and implementation with the prefix-marker approach, the ordered gate model, and `slfg` deprecation-wrapper behavior treated as settled direction for phase 1 diff --git a/docs/plans/2026-03-24-002-feat-autopilot-run-context-and-decision-rubric-plan.md b/docs/plans/2026-03-24-002-feat-autopilot-run-context-and-decision-rubric-plan.md new file mode 100644 index 000000000..37e1d2b10 --- /dev/null +++ b/docs/plans/2026-03-24-002-feat-autopilot-run-context-and-decision-rubric-plan.md @@ -0,0 +1,568 @@ +--- +title: "feat: add deterministic autopilot run context and role-based decision rubric" +type: feat +status: active +date: 2026-03-24 +origin: docs/brainstorms/2026-03-24-autopilot-run-context-and-decision-rubric-requirements.md +deepened: 2026-03-24 +--- + +# Add Deterministic Autopilot Run Context and Role-Based Decision Rubric + +## Overview + +Introduce a deterministic runtime contract for `lfg` autopilot runs, plus a shared role-based decision rubric that core workflow skills use in both interactive and autopilot modes. `lfg` will become the single autopilot entrypoint and a resume-anywhere orchestrator: it should evaluate ordered workflow gates, create or backfill a run-scoped manifest and decision log, and pass a short invocation marker downstream from whatever stage the work is already in. `slfg` should be deprecated into a compatibility wrapper that routes to `lfg` with swarm mode enabled, with a later path to project-level defaulting via `compound-engineering.local.md`. `ce:brainstorm`, `ce:plan`, `deepen-plan`, and `ce:work` will adopt explicit role ordering, orchestration bias, and bounded auto-decision/logging rules. `test-browser` and `feature-video` are also in scope as autopilot contract consumers: they need deterministic autopilot recognition and manifest compatibility, but they do not participate in the substantive role-rubric decision model. `document-review` will remain a review utility that can classify findings, but substantive decisions discovered through review will still be resolved and logged by the owning workflow skill. + +## Problem Frame + +Today the autopilot contract is described in prose across `lfg`, `slfg`, and several workflow skills, but there is no deterministic runtime mechanism that tells a downstream skill "this is an active autopilot run." That makes autopilot brittle and encourages caller-inference rules like "when called from `lfg`/`slfg`" instead of a real shared state model. Maintaining both `lfg` and `slfg` also duplicates entrypoint surface area for what is fundamentally the same autopilot workflow. Separately, `lfg` is still framed too much as a start-of-workflow command rather than a universal "keep the work moving" orchestrator. A stronger design lets the user run `/lfg` from any stage: from idea, from requirements, from a plan, from an implementation branch, or from an open PR that mainly needs verification and wrap-up. At the same time, the plugin lacks a consistent role-based decision rubric for recommendations and bounded autonomous decisions: `ce:brainstorm`, `ce:plan`, `deepen-plan`, and `ce:work` need explicit `Product Manager` / `Designer` / `Engineer` ordering and per-skill orchestration bias so they can recommend or decide consistently without inventing product behavior. (see origin: `docs/brainstorms/2026-03-24-autopilot-run-context-and-decision-rubric-requirements.md`) + +## Requirements Trace + +- R1-R5h. `lfg` is the autopilot entrypoint, creates or backfills a run manifest, evaluates ordered workflow gates, and passes a deterministic invocation marker that does not depend on platform-specific positional parsing. `slfg` is deprecated into a compatibility wrapper for swarm mode inside `lfg` +- R6-R9a. Core workflow skills use the same role vocabulary and define per-skill role ordering, orchestration bias, and `may decide / must ask / must log` boundaries +- R9b. `document-review` remains a review utility; owning workflow skills resolve substantive findings +- R10-R15. The canonical decision log is run-scoped, logs substantive autonomous decisions from both documented open questions and execution discoveries, and promotes a compact summary into a plan when relevant +- R16-R17. The runtime rubric must be available from the installed skills' own directories and work across converted targets + +## Scope Boundaries + +- Not redesigning the global skill format or replacing skills with a monolithic orchestrated prompt +- Not turning workflow trivia into first-class audit decisions +- Not giving autopilot authority to invent major product direction without bounded criteria +- Not requiring plugin-root runtime files that are not guaranteed to ship with each installed skill +- Not fully rolling the substantive role-rubric contract out to every autopilot-aware utility skill in the same change; the first-wave decision owners are `lfg`, `ce:brainstorm`, `ce:plan`, `deepen-plan`, and `ce:work` +- Not treating `test-browser` or `feature-video` as substantive product/implementation decision owners; they are first-wave autopilot contract consumers only +- Not designing the full long-term `compound-engineering.local.md` swarm configuration surface in this same change; only preserve a clear path for it + +## Context & Research + +### Relevant Code and Patterns + +- `plugins/compound-engineering/skills/lfg/SKILL.md` and `plugins/compound-engineering/skills/slfg/SKILL.md` currently split the same autopilot workflow across two top-level entrypoints; this plan collapses that split so `lfg` owns the contract and swarm becomes an execution mode while `slfg` survives only as a deprecation wrapper +- `plugins/compound-engineering/skills/ce-brainstorm/SKILL.md`, `plugins/compound-engineering/skills/ce-plan/SKILL.md`, `plugins/compound-engineering/skills/deepen-plan/SKILL.md`, and `plugins/compound-engineering/skills/ce-work/SKILL.md` already have `## Autopilot Mode` behavior and are the substantive decision-making skills in scope +- `plugins/compound-engineering/skills/document-review/SKILL.md` and `plugins/compound-engineering/agents/document-review/*.md` already provide persona-based document review with synthesized findings +- `plugins/compound-engineering/skills/test-browser/SKILL.md` and `plugins/compound-engineering/skills/feature-video/SKILL.md` already have autopilot-specific best-effort behavior and are invoked by the end-to-end workflow, so they need to consume the new run contract even though they are not substantive decision owners +- `src/parsers/claude.ts` discovers skills from `SKILL.md` directories only; runtime policy cannot safely depend on an arbitrary plugin-root file +- `src/utils/files.ts` and target writers (`src/targets/codex.ts`, `src/targets/gemini.ts`, `src/targets/copilot.ts`, etc.) copy each skill directory independently, reinforcing that runtime rubric content must ship from the skill's own directory +- `tests/review-skill-contract.test.ts` is an existing pattern for asserting text contracts in workflow skills and reference files + +### Institutional Learnings + +- `docs/solutions/skill-design/lfg-slfg-pipeline-orchestration-and-autopilot-mode.md` documents the first-generation autopilot convention and the current `lfg`/`slfg` split that this plan simplifies +- `docs/plans/2026-03-23-001-feat-plan-review-personas-beta-plan.md` and the completed persona-based `document-review` implementation establish the pattern of persona generation + orchestrator synthesis, which should remain separate from artifact-owner judgment +- Existing skill-contract tests for `ce-review-beta` show the repo already protects fragile orchestration contracts with text assertions rather than relying only on human review + +### External References + +- None required. This plan is dominated by plugin-local skill design, packaging behavior, and orchestration conventions already present in the repo. + +## Key Technical Decisions + +- **Use a short prefix invocation marker plus a run manifest**: The chosen signaling family is a small explicit autopilot prefix at the start of downstream skill input, plus a run-scoped manifest file under `.context/compound-engineering/autopilot//`. This avoids relying on line breaks, XML blocks, or platform-specific positional args. +- **Resolve the marker as a technical choice now**: Standardize on a prefix marker in the shape `[ce-autopilot manifest=] :: ` (or the equivalent parsing contract without semantic drift). The family is already decided; the plan should not reopen broader signaling alternatives. +- **The manifest owns artifact discovery**: The minimum manifest shape should include `schema_version`, `run_id`, `route`, `status`, `feature_description`, `requirements_doc`, and `plan_doc`. Downstream skills should read and update those fields instead of rediscovering artifacts heuristically. +- **The manifest also records gate state and completion evidence**: In addition to artifact paths, the run state should capture ordered gate status (`complete`, `pending`, `blocked`, `unknown`) and the evidence used to mark reconstructed gates complete. +- **The exact phase-1 manifest schema is fixed**: Use: + - `schema_version` + - `run_id` + - `route` = `direct | lightweight | full` + - `status` = `active | completed | aborted` + - `implementation_mode` = `standard | swarm` + - `feature_description` + - `current_gate` (optional, advisory only) + - `gates.requirements | gates.plan | gates.implementation | gates.review | gates.verification | gates.wrap_up`, each with `state` = `complete | skipped | pending | blocked | unknown` plus `evidence` + - `gates.review.ref | gates.verification.ref | gates.wrap_up.ref` for stale late-stage invalidation after code changes + - `artifacts.requirements_doc | artifacts.plan_doc` + - direct and lightweight routes still create a manifest; they mark `requirements` and `plan` as `skipped` with evidence when those artifacts are intentionally absent +- **Run status stays coarse; gate state carries detail**: "Waiting on CI" or "review incomplete" should be expressed in gate state and evidence while the overall run remains `active`. Only `completed` and `aborted` are terminal, and `completed` allows required gates to be either `complete` or `skipped`. +- **`lfg` is the only top-level autopilot entrypoint**: `slfg` should stop owning its own orchestration contract. When users want parallel execution, `lfg` should expose swarm as an execution mode, and `slfg` should become a compatibility wrapper that points there. +- **Swarm selection belongs behind `lfg`**: In phase 1, swarm should be selected explicitly by user intent. Repo/project defaults should use `compound-engineering.local.md` frontmatter `implementation_mode: standard | swarm`, with missing treated as `standard`. +- **`lfg` resumes from current state, not only from scratch**: If an active autopilot manifest exists, resume from it. If no active manifest exists, inspect repo artifacts and PR state, infer the current gate state conservatively, create a fresh manifest seeded with that state, and continue from the next appropriate step. +- **Resume is a deterministic ordered-gate engine**: `lfg` should not "guess the stage" loosely. It should evaluate, in order, `requirements`, `plan`, `implementation`, `review`, `verification`, and `wrap-up`, then advance the first unmet gate it can justify from evidence. +- **State reconstruction has explicit evidence precedence**: Gate completion should be derived in this order: active manifest state, explicit user direction in the current `lfg` invocation, durable workflow artifacts and repo state, then PR/CI state for the current branch/HEAD. If ambiguity remains after those sources, `lfg` should ask one targeted question rather than take a risky leap. +- **Late-stage gates must be conservative**: Review resolution, local verification, browser validation, and wrap-up should remain pending unless `lfg` has current evidence for them. A generic open PR or historical branch activity is not enough. +- **Best-effort late-stage steps should degrade gracefully**: Browser testing and feature-video should usually record `skipped` with a brief reason when environment/auth/tooling prevents them, not derail the whole run. +- **PR-stage orchestration is a first-class contract**: When resuming from an implementation or PR stage, `lfg` should inspect CI status for the current HEAD, decide whether local tests or browser checks need reruns, and distinguish "waiting on CI" from truly DONE. +- **Keep the runtime rubric in-skill for the first wave**: Because installed skills must be self-sufficient and packaging a shared runtime reference adds complexity, phase 1 should place the rubric and role ordering directly in the relevant skill files. A shared authored source copied into each skill can be reconsidered later. +- **Use one shared decision-criteria set across the first-wave skills**: The substantive roles should evaluate choices using a common criteria vocabulary so recommendations and autonomous decisions stay consistent across phases: + - `User Value` -- which option better serves the intended user or product outcome + - `Completeness` -- which option covers real states, edge cases, and follow-through within the chosen scope + - `Local Leverage` -- whether nearby blast-radius work is worth expanding now because it is adjacent and cheap + - `Reuse` -- whether an existing pattern, capability, or implementation should be reused instead of creating a parallel one + - `Clarity` -- whether the approach is explicit, understandable, and unsurprising + - `Momentum` -- whether materially equivalent options should be resolved quickly so work can continue +- **Roles are always active; autopilot changes authority**: `Product Manager`, `Designer`, and `Engineer` guide recommendations in normal mode and bounded decisions in autopilot mode. The difference between modes is whether the skill recommends or decides, not whether the roles apply. +- **Role definitions are expansive, not UI-only**: + - `Product Manager` optimizes for user value, scope coherence, success criteria, and priority alignment + - `Designer` optimizes for user experience broadly: interaction flow, defaults, terminology, state coverage, error recovery, information architecture, and visual/interaction clarity when relevant + - `Engineer` optimizes for correctness, reuse, maintainability, implementation clarity, and repo-fit +- **Orchestration bias is per skill, not a peer role**: Each substantive workflow skill declares its own bias toward continuation or escalation; `lfg` remains orchestration-first rather than a substantive decider. +- **Separate decision owners from contract consumers**: `ce:brainstorm`, `ce:plan`, `deepen-plan`, and `ce:work` own substantive role-rubric decisions. `test-browser` and `feature-video` must recognize the new autopilot contract and participate in the shared run context, but they are not part of the first-wave role-rubric decision-authority work. +- **Only decision-owner skills write substantive decision rows**: Utility skills such as `document-review`, `test-browser`, and `feature-video` may surface findings, todos, or operational notes, but they should not append substantive product or implementation decisions to the canonical decision log. The owning workflow skill decides and logs. +- **The decision-log row schema is fixed for phase 1**: Use the Markdown columns `# | Phase | Question | Decision | Why | Impact | Type`, where `Type` is one of `documented-open-question`, `execution-discovery`, or `conflict-resolution`. +- **The run log is canonical; plan summary is promoted**: The decision log lives first in `.context`. If a plan exists, only the relevant subset is promoted into an `Autopilot Decisions` section using the same column schema. +- **Promotion into the plan follows a fixed rule**: Promote all `brainstorm`, `plan`, and `deepen-plan` rows that changed product behavior, scope, sequencing, risk handling, implementation direction, or verification strategy reflected in the plan. Promote `work` rows only when execution forced a meaningful plan-level deviation or resolved an implementation question the plan left open. +- **`document-review` stays a utility**: Persona findings should classify into `mechanical-fix`, `bounded-decision`, `must-ask`, or `note`, and the owning skill remains responsible for making and logging substantive autonomous decisions. + +## Open Questions + +### Resolved During Planning + +- **Should the exact invocation marker still block planning?** No. The high-level signaling approach is already settled. Standardize on a short prefix marker contract and finalize exact syntax during implementation without routing back to `ce:brainstorm`. +- **Should runtime rubric content live in a shared plugin-root file?** Not in phase 1. Installed skills must work from their own shipped directories, so phase 1 should keep the rubric inline in the participating skills. +- **Should `document-review` become an autopilot decision-maker?** No. It should remain a review utility that returns findings and classifications; the artifact owner resolves substantive issues. +- **Should `test-browser` and `feature-video` be in phase 1?** Yes, as autopilot contract consumers only. They need deterministic autopilot recognition and shared-run compatibility because the end-to-end workflow invokes them, but they do not own substantive role-rubric decisions. +- **Should `slfg` remain as a separate workflow?** No. Deprecate it into a compatibility wrapper so the top-level contract surface collapses to `lfg` without breaking the few existing users abruptly. Swarm remains available as a mode within `lfg` and later through project configuration. +- **Should `lfg` only kick off from the beginning?** No. It should resume from the current gate state whenever possible, creating a manifest if one is missing and carrying that forward for the rest of the run. + +### Deferred to Implementation + +- Whether phase 2 should extend the substantive role-rubric model to `test-browser`, `feature-video`, or other autopilot-aware utility skills after they first adopt the run contract as consumers + +## High-Level Technical Design + +> *This illustrates the intended approach and is directional guidance for review, not implementation specification. The implementing agent should treat it as context, not code to reproduce.* + +```text +Autopilot flow: + +1. User runs /lfg +2. Entry skill checks for an active autopilot manifest for the current branch/worktree +3. If found: + - resume from that manifest +4. If not found: + - inspect state using ordered evidence precedence: + - explicit user direction in the current invocation + - durable workflow artifacts and repo state + - PR and CI state for the current branch/HEAD + - evaluate ordered workflow gates: + - requirements + - plan + - implementation + - review + - verification + - wrap-up + - mark a gate complete only when current evidence supports it + - create .context/compound-engineering/autopilot// + - session.json + - decisions.md + - seed the manifest with inferred gate state, evidence, and known artifacts +5. Entry skill advances the first unmet gate it can justify safely +6. Entry skill invokes downstream skills with: + [ce-autopilot manifest=.context/compound-engineering/autopilot//session.json] :: +7. Downstream skill: + - strips prefix marker + - validates manifest (mode=autopilot, status=active) + - applies its role ordering + orchestration bias + - writes artifact paths or decision rows back through the run-scoped files +8. Substantive decisions resolved by the artifact-owning skill create decision-log rows +9. If a plan exists, relevant rows are promoted into an Autopilot Decisions summary + +Decision log schema: + +| # | Phase | Question | Decision | Why | Impact | Type | +|---|-------|----------|----------|-----|--------|------| + +- `Type` is one of: + - `documented-open-question` + - `execution-discovery` + - `conflict-resolution` + +Late-stage resume rule: + +- If `implementation` is complete or an open PR exists: + - evaluate `review`, `verification`, and `wrap-up` separately + - inspect CI state for the current HEAD + - rerun local tests or browser checks when current evidence is missing or stale + - treat "waiting on required CI" as distinct from DONE + +Document review utility flow: + +1. Owning skill writes/updates requirements or plan +2. Owning skill invokes document-review +3. document-review returns findings + classifications +4. Owning skill decides: + - auto-fix mechanical + - auto-decide bounded issue and log it + - leave open + - ask user + +document-review classifications: + +- `mechanical-fix` +- `bounded-decision` +- `must-ask` +- `note` + +Utility-skill rule: + +- test-browser / feature-video: + - validate the same prefix marker + manifest + - preserve best-effort operational behavior + - create todos or operational notes when appropriate + - do not write substantive decision rows + +Role and bias matrix: + +- ce:brainstorm + - roles: Product Manager > Designer > Engineer + - dominant criteria: User Value, Completeness, Clarity + - orchestration bias: low + - behavior: recommend strongly in normal mode; auto-decide only bounded requirement/scope questions in autopilot + +- ce:plan + - roles: Engineer > Product Manager > Designer + - dominant criteria: Clarity, Reuse, Completeness + - orchestration bias: medium + - behavior: resolve bounded planning and implementation-direction choices; do not invent new product behavior + +- deepen-plan + - roles: Engineer > Product Manager > Designer + - dominant criteria: Completeness, Clarity, Reuse + - orchestration bias: low-medium + - behavior: strengthen weak sections and risk treatment; surface true product gaps instead of silently resolving them + +- ce:work + - roles: Engineer > Designer > Product Manager + - dominant criteria: Clarity, Reuse, Local Leverage, Momentum + - orchestration bias: high + - behavior: keep execution moving within approved bounds; log bounded execution discoveries and escalate plan-breaking changes + +- lfg + - role model: orchestration-first, not a substantive decision owner + - behavior: detect current gate state, create or backfill run context, carry it forward, gate progression between skills, and select swarm execution only when explicitly requested or later configured +``` + +## Implementation Units + +- [ ] **Unit 1: Collapse the top-level autopilot entrypoint to `lfg`, define the marker/manifest contract, and deprecate `slfg`** + +**Goal:** Make `lfg` the only top-level skill that owns the autopilot contract, while turning `slfg` into a thin compatibility wrapper rather than a second orchestration surface. + +**Requirements:** R1-R5f, R9a, R10-R12 + +**Dependencies:** None + +**Files:** +- Modify: `plugins/compound-engineering/skills/lfg/SKILL.md` +- Modify: `plugins/compound-engineering/skills/slfg/SKILL.md` +- Modify: `plugins/compound-engineering/AGENTS.md` +- Modify: `plugins/compound-engineering/README.md` + +**Approach:** +- Standardize the downstream invocation contract around a short prefix marker carrying only the manifest path and normal task input +- Define the exact phase-1 manifest schema and lifecycle rules, including `active | completed | aborted` at the run level and per-gate `complete | pending | blocked | unknown` +- Replace prose-only caller-context language with the new deterministic contract where appropriate +- Deprecate `slfg` into a compatibility wrapper that routes users onto `lfg` with swarm mode enabled, while clearly signaling that `lfg` is the canonical entrypoint +- Define how users explicitly request swarm execution through `lfg`, while leaving repo/project defaulting to a later `compound-engineering.local.md` follow-up +- Update `AGENTS.md` and `README.md` to document the single-entrypoint model plus swarm-as-mode behavior + +**Patterns to follow:** +- Existing `## Autopilot Mode` sections in `ce:brainstorm`, `ce:plan`, `deepen-plan`, and `ce:work` +- `docs/solutions/skill-design/lfg-slfg-pipeline-orchestration-and-autopilot-mode.md` + +**Test scenarios:** +- `lfg` with empty arguments still routes to `ce:brainstorm`, but now seeds the run manifest first +- `slfg` routes users onto `lfg` with swarm mode instead of silently preserving a second orchestration contract +- `lfg` can still select swarm execution when explicitly requested without needing a second workflow entrypoint +- No remaining top-level docs imply that `slfg` is required for autopilot or swarm behavior + +**Verification:** +- `lfg` explicitly initializes run-scoped autopilot state +- `slfg` no longer acts as a separate top-level contract surface +- `AGENTS.md` and `README.md` document the new autopilot contract clearly enough for future skill authors and users + +--- + +- [ ] **Unit 2: Build the deterministic resume engine and PR-stage orchestration in `lfg`** + +**Goal:** Make `lfg` resume safely from any workflow point by evaluating ordered gates and explicit evidence instead of relying on loose stage guesses. + +**Requirements:** R2a-R5h, R9a + +**Dependencies:** Unit 1 + +**Files:** +- Modify: `plugins/compound-engineering/skills/lfg/SKILL.md` + +**Approach:** +- Define the ordered gate model explicitly: `requirements`, `plan`, `implementation`, `review`, `verification`, `wrap-up` +- Define evidence precedence for reconstructed runs: + - active manifest for the current run + - explicit user direction in the current invocation + - durable workflow artifacts and repo state + - PR and CI state for the current branch/HEAD +- Require `lfg` to mark a gate complete only when current evidence supports it, otherwise leave it pending or ask a targeted clarifying question +- Make late-stage checks first-class: separate `review`, `verification`, and `wrap-up` instead of treating "open PR" as DONE +- Define current-HEAD PR-stage behavior: + - inspect GitHub CI status + - decide when local tests must rerun + - decide when browser validation should rerun + - distinguish "waiting on CI" from DONE +- Require manifest backfill to record both inferred gate state and the evidence used for those inferences +- Define how "waiting on CI" is represented: keep the run `active`, mark `verification` or `wrap_up` as `blocked`/`pending` with current-HEAD CI evidence, and avoid any pseudo-terminal intermediate run status + +**Patterns to follow:** +- Existing `lfg` direct-path vs full-pipeline routing +- Late-stage workflow expectations already embedded across `ce:work`, `test-browser`, and `feature-video` + +**Test scenarios:** +- `lfg` invoked when a requirements doc exists but no active manifest creates a fresh manifest and continues at planning +- `lfg` invoked when a plan exists but code has not started advances to implementation +- `lfg` invoked on an implementation branch with unresolved review todos treats `review` as pending even if an open PR exists +- `lfg` invoked on an open PR with passing local evidence but pending GitHub CI continues with remaining wrap-up work but does not declare DONE +- `lfg` invoked on an open PR with no durable browser-verification evidence reruns or re-requests browser validation instead of assuming it already happened + +**Verification:** +- `lfg` uses an explicit ordered gate model rather than vague stage inference +- Late-stage orchestration is specified clearly enough that two implementers would pick the same next step from the same repo/PR state +- PR-stage completion and "waiting on CI" are distinct outcomes + +--- + +- [ ] **Unit 3: Add role rubric and autopilot marker handling to `ce:brainstorm`** + +**Goal:** Make `ce:brainstorm` use the role rubric in both interactive and autopilot modes and write requirements artifact state into the run manifest. + +**Requirements:** R6-R9, R10-R14, R16-R17 + +**Dependencies:** Units 1-2 + +**Files:** +- Modify: `plugins/compound-engineering/skills/ce-brainstorm/SKILL.md` + +**Approach:** +- Add explicit role ordering (`Product Manager` > `Designer` > `Engineer`) and low orchestration bias guidance +- Add the shared decision criteria and explain how brainstorm uses them to recommend options in normal mode +- Define how brainstorm recommends options in normal mode using that role ordering +- Define how brainstorm strips the autopilot marker, validates the manifest, and logs bounded autonomous decisions only when in autopilot mode +- Require brainstorm to register the generated requirements doc path in the run manifest + +**Patterns to follow:** +- Existing `ce:brainstorm` distinction between workflow prompts vs content questions +- Existing brainstorm requirements-document structure and Phase 0 short-circuiting + +**Test scenarios:** +- Standalone brainstorm recommends one option using the role rubric but still asks the user +- Autopilot brainstorm logs a bounded product decision made on behalf of the user +- Empty-input autopilot brainstorm still asks the user for the missing feature description instead of inventing one + +**Verification:** +- `ce:brainstorm` explicitly applies the role rubric in both modes +- Requirements doc path registration is part of the autopilot contract + +--- + +- [ ] **Unit 4: Add role rubric, manifest updates, and promoted decision summaries to `ce:plan` and `deepen-plan`** + +**Goal:** Make planning skills use deterministic autopilot detection, skill-specific role ordering, and plan-aware decision promotion. + +**Requirements:** R4-R9, R10-R17 + +**Dependencies:** Units 1-3 + +**Files:** +- Modify: `plugins/compound-engineering/skills/ce-plan/SKILL.md` +- Modify: `plugins/compound-engineering/skills/deepen-plan/SKILL.md` + +**Approach:** +- Add explicit role ordering and orchestration bias for both skills (`Engineer`-first, with different continuation pressure for plan vs deepening) +- Add the shared decision criteria and specify which ones dominate for each skill +- Define how each skill parses the marker, validates the manifest, and strips the prefix from normal input +- Require `ce:plan` to update the manifest with the written plan path +- Lock the row schema to `# | Phase | Question | Decision | Why | Impact | Type` and define the promotion rule for copying the relevant subset into a compact `Autopilot Decisions` section when a plan exists, using the run log as the canonical source +- Clarify when planning/deepening may resolve bounded technical questions versus surfacing them back to brainstorm/user + +**Patterns to follow:** +- Existing `ce:plan` plan-writing contract and filename pattern +- Existing `deepen-plan` guidance on not inventing product requirements and routing true product blockers back to brainstorm + +**Test scenarios:** +- `ce:plan` writes a plan, updates the manifest, and returns control in autopilot mode without post-generation menus +- `deepen-plan` uses the same run context instead of inventing its own autopilot inference +- Plan-affecting autonomous decisions are promoted from the run log into the plan summary, while workflow trivia is not + +**Verification:** +- Planning skills can operate from the shared run contract without relying on caller prose +- Plan summaries are derived from the canonical run log, not maintained as a second source of truth + +--- + +- [ ] **Unit 5: Add execution-time decision logging and role rubric to `ce:work`** + +**Goal:** Capture bounded implementation decisions and execution discoveries made to preserve forward momentum. + +**Requirements:** R6-R14, R16-R17 + +**Dependencies:** Units 1-4 + +**Files:** +- Modify: `plugins/compound-engineering/skills/ce-work/SKILL.md` + +**Approach:** +- Add explicit role ordering and high orchestration bias for execution +- Add the shared decision criteria and clarify that `Local Leverage` and `Momentum` are stronger in `ce:work` than in planning skills +- Define how `ce:work` uses the run manifest and logs substantive autonomous decisions created by execution discoveries, not just pre-existing open questions +- Clarify `may decide / must ask / must log` boundaries so `ce:work` can keep moving without silently making plan-level changes beyond its authority + +**Patterns to follow:** +- Existing `ce:work` branch/worktree safety split between autopilot and standalone use +- The origin requirements document's distinction between execution discoveries and documented open questions + +**Test scenarios:** +- `ce:work` discovers a new bounded implementation ambiguity during execution and logs the chosen path +- `ce:work` escalates a true product-level behavior change instead of silently deciding it +- `ce:work` does not create decision-log rows for mechanical or workflow-only actions + +**Verification:** +- Execution discoveries are first-class logged decisions when the skill resolves them autonomously +- `ce:work` keeps decision authority within its documented bounds + +--- + +- [ ] **Unit 6: Update `document-review` to support autopilot-owning callers without becoming a decider** + +**Goal:** Make `document-review` return findings that autopilot-owning workflow skills can act on without turning review into hidden authorship. + +**Requirements:** R9b, R13a, R16-R17 + +**Dependencies:** Units 3-5 + +**Files:** +- Modify: `plugins/compound-engineering/skills/document-review/SKILL.md` +- Modify: `plugins/compound-engineering/skills/document-review/references/findings-schema.json` +- Modify: `plugins/compound-engineering/skills/document-review/references/subagent-template.md` +- Modify: `plugins/compound-engineering/skills/document-review/references/review-output-template.md` +- Possibly modify: `plugins/compound-engineering/agents/document-review/*.md` + +**Approach:** +- Keep personas as issue-finders and synthesis inputs +- Standardize the exact phase-1 classes as `mechanical-fix`, `bounded-decision`, `must-ask`, and `note` +- Preserve the current mechanical auto-fix behavior where appropriate +- Explicitly document that substantive decisions discovered through review are resolved by the owning skill, which then logs the resulting choice if it auto-decides + +**Patterns to follow:** +- Existing persona-based document review orchestration +- Existing findings schema / template pattern from `document-review` and `ce-review-beta` + +**Test scenarios:** +- `document-review` returns a bounded judgment finding that `ce:plan` could decide in autopilot +- `document-review` returns a must-ask finding that remains unresolved for the caller to escalate +- Mechanical terminology or formatting fixes remain auto-fixable without changing substantive meaning + +**Verification:** +- `document-review` remains a utility, not a substantive decision owner +- Callers receive enough classification signal to apply their own role rubric + +--- + +- [ ] **Unit 7: Make autopilot utility skills consume the new run contract** + +**Goal:** Ensure utility skills invoked by the end-to-end workflow recognize the deterministic autopilot contract and coexist with the shared run manifest without taking on substantive role-rubric authority. + +**Requirements:** R1-R5, R9a, R10-R12 + +**Dependencies:** Units 1-2 + +**Files:** +- Modify: `plugins/compound-engineering/skills/test-browser/SKILL.md` +- Modify: `plugins/compound-engineering/skills/feature-video/SKILL.md` + +**Approach:** +- Update both skills to recognize the new autopilot marker/manifest contract instead of relying only on legacy caller-prose conventions +- Preserve their current best-effort, non-blocking autopilot behavior +- Clarify whether they should read any manifest fields directly and whether they should write operational state into the run directory +- Make it explicit that these skills can emit todos, skip notes, or run-status artifacts, but not substantive decision-log rows +- Keep them out of the substantive decision-rubric model except where they already emit durable artifacts like todos or run notes + +**Patterns to follow:** +- Existing autopilot mode sections in `test-browser` and `feature-video` +- The decision-owner vs utility split established for `document-review` + +**Test scenarios:** +- `test-browser` detects autopilot from the new run contract and continues using non-blocking failure handling +- `feature-video` detects autopilot from the new run contract and preserves its best-effort skip behavior +- `test-browser` writes a todo for a browser failure without polluting the substantive decision log +- `feature-video` records an operational skip without claiming a product/implementation decision +- Neither skill claims authority to make substantive product or implementation decisions via the role rubric + +**Verification:** +- Both utility skills are compatible with the shared run contract +- Their behavior remains best-effort and operational rather than substantive + +--- + +- [ ] **Unit 8: Add contract tests and release validation coverage** + +**Goal:** Protect the new cross-skill autopilot contract from drifting in future edits. + +**Requirements:** R1-R17 + +**Dependencies:** Units 1-7 + +**Files:** +- Create: `tests/autopilot-skill-contract.test.ts` +- Modify: `tests/review-skill-contract.test.ts` + +**Approach:** +- Add text-contract tests for: + - run manifest path convention + - invocation marker contract + - role vocabulary / orchestration bias presence in the first-wave decision-owner skills + - utility-skill autopilot contract consumption in `test-browser`, `feature-video`, and `document-review` +- Keep release validation in the final verification pass + +**Patterns to follow:** +- `tests/review-skill-contract.test.ts` + +**Test scenarios:** +- Contract test fails if one decision-owner skill drops the role rubric or manifest contract +- Contract test fails if a decision-owner skill drops the shared decision-criteria vocabulary or its per-skill role/bias declaration +- Contract test fails if `test-browser` or `feature-video` no longer document deterministic autopilot recognition +- Contract test fails if `document-review` starts claiming substantive decision ownership + +**Verification:** +- `bun test` passes +- `bun run release:validate` passes + +## System-Wide Impact + +- **Interaction graph:** `lfg` becomes the explicit autopilot run initializer; `ce:brainstorm`, `ce:plan`, `deepen-plan`, `ce:work`, `test-browser`, and `feature-video` all consume the same manifest and invocation marker contract, while only the core workflow skills own substantive decisions +- **State reconstruction:** `lfg` must be able to infer enough workflow gate state from repo artifacts and PR context to resume safely when no active manifest exists, then create a manifest for the resumed run +- **Determinism pressure:** Ordered gates and evidence precedence are the main guardrail against drift; if those rules are underspecified in implementation, different agents will resume the same repo state differently +- **Error propagation:** Missing or invalid manifest state must degrade safely and visibly rather than silently running with the wrong autopilot posture +- **State lifecycle risks:** Run manifests and decision logs should be per-run and gitignored; the design must avoid stale workspace-global state +- **Operational vs substantive outputs:** Utility-skill outputs (todos, skip notes, video artifacts) must not be conflated with the substantive autopilot decision log, or the audit trail will become noisy and less trustworthy +- **API surface parity:** The contract must survive cross-platform installs because skills are copied across Claude, Codex, Gemini, Copilot, and other targets +- **Integration coverage:** End-to-end skill chaining and contract tests matter more than unit-level validation for this work because the risk is orchestration drift across multiple SKILL.md files + +## Risks & Dependencies + +- The main risk is overcomplicating the first wave by trying to factor a shared runtime rubric file before the core contract works; keeping the rubric inline in phase 1 mitigates that +- Deprecating `slfg` still requires careful documentation and wrapper behavior so users do not lose discoverability for swarm execution +- Resume-anywhere orchestration increases the risk of misdetecting the current gate state, so `lfg` must be conservative and prefer one targeted question over a wrong autonomous leap when state is ambiguous +- Late-stage PR orchestration is easy to over-assume; verification and wrap-up gates must not be marked complete from weak evidence +- The marker syntax must stay minimal and portable; if it grows into a mini protocol, the common-denominator benefit disappears +- `document-review` classification changes must not break its existing caller contract or terminal signal +- Utility-skill integration must not blur the boundary between operational artifacts and substantive autonomous decisions + +## Documentation / Operational Notes + +- Update `plugins/compound-engineering/README.md` if autopilot behavior or skill descriptions become materially more specific after implementation +- Consider adding a follow-up `docs/solutions/skill-design/` write-up after the contract lands, but treat that as compounding work, not a prerequisite for phase 1 + +## Sources & References + +- **Origin document:** [docs/brainstorms/2026-03-24-autopilot-run-context-and-decision-rubric-requirements.md](/Users/tmchow/conductor/workspaces/compound-engineering-plugin/biarritz-v2/docs/brainstorms/2026-03-24-autopilot-run-context-and-decision-rubric-requirements.md) +- Related code: [plugins/compound-engineering/skills/lfg/SKILL.md](/Users/tmchow/conductor/workspaces/compound-engineering-plugin/biarritz-v2/plugins/compound-engineering/skills/lfg/SKILL.md) +- Related code: [plugins/compound-engineering/skills/slfg/SKILL.md](/Users/tmchow/conductor/workspaces/compound-engineering-plugin/biarritz-v2/plugins/compound-engineering/skills/slfg/SKILL.md) +- Related code: [plugins/compound-engineering/skills/ce-brainstorm/SKILL.md](/Users/tmchow/conductor/workspaces/compound-engineering-plugin/biarritz-v2/plugins/compound-engineering/skills/ce-brainstorm/SKILL.md) +- Related code: [plugins/compound-engineering/skills/ce-plan/SKILL.md](/Users/tmchow/conductor/workspaces/compound-engineering-plugin/biarritz-v2/plugins/compound-engineering/skills/ce-plan/SKILL.md) +- Related code: [plugins/compound-engineering/skills/deepen-plan/SKILL.md](/Users/tmchow/conductor/workspaces/compound-engineering-plugin/biarritz-v2/plugins/compound-engineering/skills/deepen-plan/SKILL.md) +- Related code: [plugins/compound-engineering/skills/ce-work/SKILL.md](/Users/tmchow/conductor/workspaces/compound-engineering-plugin/biarritz-v2/plugins/compound-engineering/skills/ce-work/SKILL.md) +- Related code: [plugins/compound-engineering/skills/test-browser/SKILL.md](/Users/tmchow/conductor/workspaces/compound-engineering-plugin/biarritz-v2/plugins/compound-engineering/skills/test-browser/SKILL.md) +- Related code: [plugins/compound-engineering/skills/feature-video/SKILL.md](/Users/tmchow/conductor/workspaces/compound-engineering-plugin/biarritz-v2/plugins/compound-engineering/skills/feature-video/SKILL.md) +- Related code: [plugins/compound-engineering/skills/document-review/SKILL.md](/Users/tmchow/conductor/workspaces/compound-engineering-plugin/biarritz-v2/plugins/compound-engineering/skills/document-review/SKILL.md) +- Related tests: [tests/review-skill-contract.test.ts](/Users/tmchow/conductor/workspaces/compound-engineering-plugin/biarritz-v2/tests/review-skill-contract.test.ts) +- Related learning: [docs/solutions/skill-design/lfg-slfg-pipeline-orchestration-and-autopilot-mode.md](/Users/tmchow/conductor/workspaces/compound-engineering-plugin/biarritz-v2/docs/solutions/skill-design/lfg-slfg-pipeline-orchestration-and-autopilot-mode.md) diff --git a/docs/solutions/skill-design/lfg-autopilot-orchestration-and-resumability.md b/docs/solutions/skill-design/lfg-autopilot-orchestration-and-resumability.md new file mode 100644 index 000000000..6991d19ca --- /dev/null +++ b/docs/solutions/skill-design/lfg-autopilot-orchestration-and-resumability.md @@ -0,0 +1,360 @@ +--- +title: "How lfg autopilot orchestration, resumability, and the manifest contract work" +category: skill-design +date: 2026-03-24 +severity: high +component: plugins/compound-engineering/skills +tags: + - lfg + - autopilot + - manifest + - resumability + - orchestration + - skill-chaining + - workflow +related: + - docs/solutions/skill-design/lfg-slfg-pipeline-orchestration-and-autopilot-mode.md + - docs/solutions/skill-design/beta-promotion-orchestration-contract.md + - docs/solutions/workflow/todo-status-lifecycle.md +--- + +# How lfg Autopilot Orchestration, Resumability, and the Manifest Contract Work + +## Problem + +`lfg` had moved beyond "start the workflow" and was now expected to continue from whatever state the branch was already in. That exposed several correctness gaps: + +- downstream skills could detect autopilot, but not all of them wrote back the exact gate state they changed +- direct and lightweight runs had intentionally missing planning artifacts, but the manifest did not make that absence clearly intentional +- late-stage work such as review, browser verification, and wrap-up could go stale after code changed +- environmental failures in `test-browser` or `feature-video` risked derailing the whole run even when those steps were best-effort + +The result was fragility in exactly the place where autopilot needed to be strongest: resuming from mid-pipeline without guessing the wrong next step. + +## Root Cause + +The initial autopilot contract solved only the first layer of the problem: it made automation explicit through a marker plus manifest. That was necessary, but not sufficient. + +The deeper issue was control-plane ambiguity: + +- the manifest carried too much optional metadata and too little guidance about which fields actually affected routing +- some skills consumed the manifest without clearly owning the gate they were advancing +- `lfg` still needed to infer too much from ambient repo state, especially for later stages +- best-effort steps were treated too much like hard gates, even though their most common failures are environmental rather than product-critical + +In short: the workflow had a run context, but not yet a crisp stage-ownership model. + +## System Model + +Autopilot works best when it is treated as a small workflow runtime with one orchestrator and a few explicit invariants. + +### One orchestrator owns chaining + +`lfg` is the only top-level autopilot orchestrator. + +That means: + +- `lfg` chooses the route: `direct`, `lightweight`, or `full` +- `lfg` decides which gate is next +- `lfg` is responsible for resuming, repairing stale state, and continuing forward +- downstream skills do not decide which phase comes next on their own + +This matters because chaining logic drifts quickly when every skill starts inferring workflow state from caller wording. + +### The manifest is control state, not just a log + +The manifest exists so `/lfg` can resume correctly without re-deriving everything from scratch on every invocation. + +Its job is to answer: + +- what route this run is on +- which gates are complete, pending, skipped, or blocked +- which durable artifacts already exist +- whether late-stage work was completed against the current code state or an older one + +That makes the manifest part of the control plane. It is not just observability or debugging output. + +### Downstream skills receive explicit runtime context + +Autopilot mode is passed downstream through: + +- an explicit marker +- a manifest path + +That lets each skill know: + +- this is part of an active autopilot run +- which manifest to read +- which gate/artifact context already exists + +The durable rule is: skills should not infer autopilot from prose like "called from lfg". They should detect it from the explicit marker plus manifest. + +### Stage ownership is narrow by design + +The workflow is safer when each skill owns only the gate it directly changes: + +- `ce:brainstorm` -> `requirements` +- `ce:plan` -> `plan` +- `ce:work` -> `implementation` +- `ce:review` -> `review` +- `test-browser` -> `verification` +- `feature-video` -> `wrap_up` + +That is the right granularity because those gate transitions are exactly what changes `lfg`'s next routing decision. + +### Resume is "validate, repair, continue" + +When `/lfg` runs again mid-pipeline, it should: + +1. Prefer the manifest when one exists +2. Validate it against durable artifacts and current repo state +3. Repair it conservatively if it is missing, stale, or inconsistent +4. Recompute the first unmet gate +5. Continue from there + +That is a better model than either extreme: + +- blindly trusting the manifest +- ignoring the manifest and reconstructing everything every time + +## How the Flow Chains Together + +The workflow is not just a list of commands. It is an ordered gate engine. + +### Route first, then gates + +The first decision is route: + +- `direct` +- `lightweight` +- `full` + +That decision changes how missing artifacts are interpreted. + +For example: + +- in `full`, missing requirements/plan artifacts often mean work is incomplete +- in `direct` or `lightweight`, those same missing artifacts are often intentional and should be represented as `skipped` + +This is why route inference has to happen before gate inference. + +### Full pipeline chaining + +In the full route, `lfg` advances through: + +- `requirements` +- `plan` +- `implementation` +- `review` +- `verification` +- `wrap_up` + +The chaining rule is simple: + +- advance only when current evidence supports the gate transition +- if a gate cannot be proven complete, leave it `pending`, `skipped`, or `blocked` +- after each downstream skill returns, update the manifest and recompute the first unmet gate + +That makes the workflow resumable even when the run is interrupted between stages. + +### Direct and lightweight still need manifests + +Direct and lightweight runs are not "manifest-less shortcuts." They still create a run manifest immediately. + +The important difference is that they intentionally mark early planning gates as skipped: + +- `requirements = skipped` +- `plan = skipped` + +That absence is therefore a valid part of the route contract, not evidence that the run is broken. + +### Late-stage work is separate from "implementation is done" + +One of the biggest coordination lessons was that an open PR or a finished coding pass does not mean the workflow is done. + +Autopilot still has to reason separately about: + +- review completion +- todo follow-up +- browser validation where relevant +- wrap-up artifacts +- CI state for the current HEAD + +This is why later gates must remain explicit instead of being collapsed into "implementation finished". + +## Solution + +Use a lean manifest that exists only to improve resume correctness, then pair it with explicit gate ownership rules for the skills that actually move the workflow forward. + +### 1. Keep the manifest lean and routing-oriented + +The durable fields are: + +- `schema_version` +- `run_id` +- `route = direct | lightweight | full` +- `status = active | completed | aborted` +- `implementation_mode = standard | swarm` +- `feature_description` +- optional `current_gate` +- `gates.requirements | gates.plan | gates.implementation | gates.review | gates.verification | gates.wrap_up` +- `artifacts.requirements_doc | artifacts.plan_doc` + +Each gate carries: + +- `state = complete | skipped | pending | blocked | unknown` +- `evidence` + +Only late-stage gates carry a code-state freshness anchor: + +- `gates.review.ref` +- `gates.verification.ref` +- `gates.wrap_up.ref` + +That cut is intentional. A field should exist only if it changes what `lfg` does next or makes stale resume state safer to detect. + +### 2. Treat the manifest as primary, but never blindly trusted + +`lfg` should resume from the manifest when present, but it must validate that state against durable artifacts and current repo state before trusting it. + +When no valid manifest exists, reconstruct conservatively from: + +- current branch/worktree state +- current feature description, if any +- requirements docs and plan docs tied to the branch/topic +- plan checkbox progress +- non-doc implementation changes +- PR and CI state for the current HEAD +- pending and ready todos + +If ambiguity remains after that pass, ask one targeted question instead of guessing. + +### 3. Infer route before inferring gates + +Route controls what "missing artifacts" mean: + +- `full` means requirements/plan artifacts are expected +- `direct` and `lightweight` mean requirements/plan can be intentionally absent + +This is the key reason direct and lightweight runs still need manifests. Without `route`, a missing plan looks like corruption instead of a valid low-ceremony run. + +### 4. Make stage ownership explicit + +Resume correctness improved once each stage-owning skill was told exactly which gate it owns: + +- `ce:brainstorm` owns `requirements` +- `ce:plan` owns `plan` +- `ce:work` owns `implementation` +- `ce:review` owns `review` +- `test-browser` owns `verification` +- `feature-video` owns `wrap_up` + +The important pattern is not "every skill writes lots of bookkeeping." The pattern is: if a skill changes the answer to "what should run next?", it must write that state explicitly. + +### 5. Use ref-based stale invalidation only where it pays off + +`review`, `verification`, and `wrap_up` can become stale after code changes. Those gates should be invalidated when their stored `ref` no longer matches the current HEAD: + +- stale `review.ref` resets `review`, `verification`, and `wrap_up` +- stale `verification.ref` resets `verification` and `wrap_up` +- stale `wrap_up.ref` resets `wrap_up` + +This is a better tradeoff than timestamp-heavy bookkeeping. It directly protects against reusing stale late-stage completions without turning the manifest into a full workflow database. + +## Gate Semantics That Actually Worked + +### Implementation + +`implementation = complete` does not mean "the work is perfect" or "everything is validated." It means the coding pass has reached a reviewable checkpoint: + +- intended code changes for this route are in place +- implementation-blocking questions are resolved or externalized +- code-oriented verification for this slice has run +- the next rational step is review, not more core coding + +### Review + +`review = complete` means the review finished its inspection and externalized its findings. + +If review creates todos, that does not make review incomplete. The unresolved work belongs to todo resolution and any rerun that follows, not to the review gate itself. + +### Verification + +`verification` should not mean "all testing everywhere." Most verification belongs inside implementation. + +The separate verification gate is for extra validation, especially browser-level checks, when the work actually needs it. + +That means: + +- browser/UI verification is conditional, not universal +- environment/tooling failures usually produce `skipped`, not `blocked` +- failures that generate todos should leave verification `pending` so `lfg` can revisit it after follow-up work + +### Wrap-Up + +`wrap_up` is usually convenience work, not a shipping gate. + +For autopilot, the durable rule is: + +- success -> `complete` +- ordinary PR/auth/upload/environment issues -> `skipped` +- reserve `blocked` for the rare case where the task explicitly requires the wrap-up artifact + +## Best-Effort Steps Must Not Derail Autopilot + +`test-browser` and `feature-video` are especially sensitive to environment drift. Treating them as hard blockers by default makes `lfg` brittle for the wrong reason. + +The better pattern is: + +- if the step is inapplicable, skip it +- if the environment is unavailable, skip it with a brief reason +- if the user or task explicitly requires the step, then it may become blocking +- always tell the user briefly what was skipped and why + +This preserves visibility without making the whole run fail because a dev server, browser auth session, or local toolchain is missing. + +## Practical Prevention Rules + +### Prefer route + gate state over inferred stage names + +Do not let `lfg` "guess where it is" from one fuzzy heuristic. Route first, then gates, then late-stage freshness. + +### Make every routing-critical state transition executable + +If a doc says a skill owns a gate transition, add or update a contract test for that exact wording. The orchestration contract should be executable, not purely narrative. + +### Keep best-effort steps visible but non-fatal + +If a late-stage step is not essential to declare the run done, encode skip behavior directly in the skill contract. Do not depend on callers to remember that nuance later. + +### Keep compatibility wrappers thin + +The `slfg` lesson still holds: wrappers should forward into the canonical orchestrator, not become alternate policy surfaces. + +### Promotion changes are orchestration changes + +The `ce:review-beta` -> `ce:review` promotion reinforced an adjacent rule: when a skill that sits in the orchestration path changes semantics or becomes the stable entrypoint, update the orchestrators and contract tests in the same change. + +## Applied Pattern + +This pattern was applied to the current autopilot workflow as: + +- one canonical top-level orchestrator: `lfg` +- a lean, resumability-focused manifest +- explicit gate ownership across brainstorm/plan/work/review/verification/wrap-up +- direct and lightweight manifests that intentionally mark early planning gates as `skipped` +- late-stage ref invalidation for stale review/verification/wrap-up state +- best-effort browser/video handling that records skip reasons instead of derailing the run + +## Related Files + +- `plugins/compound-engineering/skills/lfg/SKILL.md` +- `plugins/compound-engineering/skills/ce-brainstorm/SKILL.md` +- `plugins/compound-engineering/skills/ce-plan/SKILL.md` +- `plugins/compound-engineering/skills/ce-work/SKILL.md` +- `plugins/compound-engineering/skills/ce-review/SKILL.md` +- `plugins/compound-engineering/skills/test-browser/SKILL.md` +- `plugins/compound-engineering/skills/feature-video/SKILL.md` +- `plugins/compound-engineering/AGENTS.md` +- `tests/autopilot-skill-contract.test.ts` +- `tests/review-skill-contract.test.ts` diff --git a/docs/solutions/skill-design/lfg-slfg-pipeline-orchestration-and-autopilot-mode.md b/docs/solutions/skill-design/lfg-slfg-pipeline-orchestration-and-autopilot-mode.md new file mode 100644 index 000000000..f204170d5 --- /dev/null +++ b/docs/solutions/skill-design/lfg-slfg-pipeline-orchestration-and-autopilot-mode.md @@ -0,0 +1,122 @@ +--- +title: "Collapsing `slfg` into `lfg` and making autopilot explicit" +category: skill-design +date: 2026-03-22 +severity: medium +component: plugins/compound-engineering/skills +tags: + - lfg + - slfg + - autopilot-mode + - orchestration + - ce-brainstorm + - ce-plan + - skill-chaining + - autonomous-workflow +related: + - docs/solutions/skill-design/lfg-autopilot-orchestration-and-resumability.md +--- + +# Collapsing `slfg` into `lfg` and Making Autopilot Explicit + +## Scope Note + +This doc captures the first durable orchestration shift: + +- collapse `slfg` into `lfg` +- stop inferring autopilot from caller prose +- make swarm an implementation-mode choice instead of a separate top-level workflow + +That learning is still correct, but it is no longer the whole contract. + +The current runtime model also depends on: + +- a resumable manifest +- route-aware resume logic (`direct | lightweight | full`) +- explicit gate ownership across downstream skills +- stale late-stage invalidation for review / verification / wrap-up + +For the current end-to-end runtime model, see: + +- [How lfg autopilot orchestration, resumability, and the manifest contract work](./lfg-autopilot-orchestration-and-resumability.md) + +## Problem + +The original `lfg` / `slfg` split created two orchestration problems: + +1. autopilot was described in caller prose, so downstream skills had to guess when they were in an automated run +2. `slfg` carried a second top-level workflow contract even though the real distinction was only whether implementation used swarm/agent-team behavior + +As `lfg` evolved into a resume-anywhere orchestrator, that split stopped making sense. The durable pattern is not "two top-level workflows with slightly different chaining." The durable pattern is: + +- one top-level orchestrator: `lfg` +- one explicit autopilot contract: marker + manifest +- one implementation-mode distinction inside `lfg`: `standard | swarm` + +## Investigation + +### What broke down in the old model + +- `ce:brainstorm`, `ce:plan`, and other workflow skills originally inferred autopilot from wording like "called from `lfg` / `slfg`" +- that made the pipeline brittle: handoff menus, post-generation prompts, and branch questions were skipped inconsistently +- resuming work from a later stage was awkward because the orchestrator had no shared run state +- the `slfg` branch of the model encouraged people to attach special behavior to the wrapper itself instead of to the underlying execution context + +The old split also obscured the real decision boundary: swarm is an implementation coordination choice, not a separate end-to-end workflow identity. + +## Solution + +The current pattern is: + +1. `lfg` is the only top-level autopilot entrypoint +2. `lfg` creates or backfills a run-scoped manifest under `.context/compound-engineering/autopilot//` +3. `lfg` passes an explicit marker downstream: + - `[ce-autopilot manifest=.../session.json] :: ` +4. downstream skills detect autopilot from that marker + manifest, not from caller wording +5. swarm lives behind `lfg` as `implementation_mode: standard | swarm` +6. `slfg` is only a deprecated compatibility path; it should not own separate orchestration rules + +This lets `lfg` do two things the old model could not do well: + +- resume from the first unmet workflow gate instead of assuming every run starts at ideation +- choose behavior from actual execution context (manifest state, implementation mode, current gate) rather than from the name of the top-level command + +The current contract goes further than the original `lfg`/`slfg` collapse: + +- direct and lightweight routes also create manifests +- the manifest is validated and repaired on resume instead of being blindly trusted +- later gates are kept explicit so an open PR does not masquerade as "done" + +## Key Design Decisions + +### One orchestrator, many modes + +The durable abstraction is not "sequential workflow vs swarm workflow." It is: + +- one orchestrator +- one shared run contract +- multiple execution choices inside that orchestrator + +That keeps workflow semantics stable while allowing implementation behavior to vary. + +### Explicit runtime beats prose inference + +The autopilot marker + manifest pattern is more work than prose like "when called from `lfg`," but it is much safer. It prevents stale caller assumptions, gives skills a stable contract to read, and creates a place to accumulate run artifacts like decision logs. + +### Compatibility wrappers should be thin + +Once a wrapper is deprecated, it should stop accumulating behavior. If a compatibility command remains, it should route into the canonical command immediately rather than carrying a second copy of orchestration policy. + +## Prevention + +- When a workflow is split only by execution style, prefer one canonical orchestrator plus an internal mode switch over two top-level commands +- When downstream behavior depends on automation context, add an explicit runtime contract instead of inferring from caller wording +- Keep deprecated wrappers thin; do not let them become alternate policy surfaces +- Test the orchestration contract directly with contract tests, not only by reviewing the individual skill text + +## Related Files + +- `plugins/compound-engineering/skills/lfg/SKILL.md` +- `plugins/compound-engineering/skills/slfg/SKILL.md` +- `plugins/compound-engineering/AGENTS.md` +- `tests/autopilot-skill-contract.test.ts` diff --git a/plugins/compound-engineering/AGENTS.md b/plugins/compound-engineering/AGENTS.md index 54371b362..f783af4a7 100644 --- a/plugins/compound-engineering/AGENTS.md +++ b/plugins/compound-engineering/AGENTS.md @@ -59,6 +59,33 @@ skills/ **Why `ce:`?** Claude Code has built-in `/plan` and `/review` commands. The `ce:` namespace (short for compound-engineering) makes it immediately clear these commands belong to this plugin. +## Autopilot Mode Convention + +Skills with interactive handoff menus, post-generation options, or optional wrap-up flows must support **autopilot mode** during an active `lfg` run. Do not rely on beta-only frontmatter or caller prose to define autopilot mode. + +The runtime contract is: +- `lfg` is the only top-level autopilot entrypoint +- `slfg` is a deprecated wrapper that routes to `lfg` with swarm mode enabled +- downstream skills detect autopilot from an explicit marker plus manifest path, not by guessing from "called from lfg/slfg" +- swarm selection for implementation should come from explicit user intent first, then `compound-engineering.local.md` frontmatter `implementation_mode: standard | swarm`; if the setting is missing, assume `standard` +- execution skills must honor the active manifest's `implementation_mode` during autopilot instead of requiring a second handoff-only swarm token +- marker format: + - `[ce-autopilot manifest=.context/compound-engineering/autopilot//session.json] :: ` +- manifest directory: + - `.context/compound-engineering/autopilot//` +- direct and lightweight routes still create lightweight manifests; absent requirements/plan artifacts are intentional there +- late-stage gate completions may carry a HEAD `ref` so `lfg` can invalidate stale review/verification/wrap-up state after code changes + +The core rule: **skip workflow prompts, keep only truly necessary content prompts.** + +- Skip workflow prompts such as handoff menus, post-generation options, "what next?" routing questions, browser-mode pickers, and best-effort artifact choices. The pipeline controls flow. +- Keep content prompts only when proceeding would require inventing product behavior, scope, success criteria, or another user decision that materially changes the work. +- For execution and wrap-up skills, prefer safe automatic defaults over interactive choice menus. +- When autopilot mode skips, downgrades, or best-effort-skips a material step, inform the user briefly and continue. Do not block on the prompt. Record the skip reason in the manifest gate evidence when the skill owns that gate. +- Skills must write durable outputs when applicable and return control without chaining into the next step. + +Skills with autopilot mode: `ce:brainstorm`, `ce:plan`, `deepen-plan`, `ce:work`, `ce:work-beta`, `ce:review`, `ce:review-beta`, `test-browser`, `feature-video`. Document behavioral changes in a `## Autopilot Mode` section within the skill's SKILL.md and describe how the skill handles the marker/manifest contract when relevant. + ## Skill Compliance Checklist When adding or modifying skills, verify compliance with the skill spec: diff --git a/plugins/compound-engineering/README.md b/plugins/compound-engineering/README.md index bce42fc6b..4d91fbedc 100644 --- a/plugins/compound-engineering/README.md +++ b/plugins/compound-engineering/README.md @@ -89,9 +89,22 @@ Agents are organized into categories for easier discovery. ## Commands -### Workflow Commands +### Autopilot Workflow -Core workflow commands use `ce:` prefix to unambiguously identify them as compound-engineering commands: +Run a complete engineering workflow from feature description to PR: + +| Command | Description | +|---------|-------------| +| `/lfg [description]` | Right-sized autopilot workflow: routes to direct edit, lightweight execution, or the full pipeline based on task complexity, and can resume from the current workflow gate when work is already in progress | +| `/slfg [description]` | Deprecated compatibility wrapper that routes to `/lfg` with swarm mode enabled | + +`/lfg` assesses task complexity and chooses the right amount of ceremony. Trivial fixes (typos, renames) execute directly. Bounded tasks with clear requirements skip planning and multi-agent review. Complex or ambiguous tasks run the full pipeline in autopilot mode: `brainstorm → plan → work → review → resolve todos → test → video`. Autopilot skips workflow menus, but it still asks content questions when it would otherwise need to invent product behavior, scope, or success criteria. All three routes are resumable: direct and lightweight runs create lightweight manifests that intentionally mark requirements/plan as skipped, while full-pipeline runs attach durable requirements and plan artifacts. `/lfg` resumes from the first unmet workflow gate using that manifest when present and falls back to conservative reconstruction from durable artifacts, repo state, PR state, and todos when it is not. Review, browser verification, and wrap-up completions are revalidated conservatively against the current HEAD so stale late-stage state does not get reused after code changes. Browser testing and feature-video remain best-effort by default: if the environment is unavailable, `/lfg` records the skip and continues instead of failing the whole run. Swarm is now an execution mode behind `/lfg`; `/slfg` remains only as a migration path. + +When `lfg` reaches implementation, explicit user requests for swarm/agent teams win. Otherwise it may read `implementation_mode: standard | swarm` from `compound-engineering.local.md`. If that flag is absent, `lfg` assumes `standard`. + +### Step-by-Step Workflow + +Use individual commands when you want control over specific phases. Core workflow commands use the `ce:` prefix: | Command | Description | |---------|-------------| @@ -103,12 +116,12 @@ Core workflow commands use `ce:` prefix to unambiguously identify them as compou | `/ce:compound` | Document solved problems to compound team knowledge | | `/ce:compound-refresh` | Refresh stale or drifting learnings and decide whether to keep, update, replace, or archive them | +Step-by-step is useful when you want to brainstorm now and plan later, build a plan from an existing requirements doc, run just a review, or document a fix. + ### Utility Commands | Command | Description | |---------|-------------| -| `/lfg` | Full autonomous engineering workflow | -| `/slfg` | Full autonomous workflow with swarm mode for parallel execution | | `/deepen-plan` | Stress-test plans and deepen weak sections with targeted research | | `/changelog` | Create engaging changelogs for recent merges | | `/generate_command` | Generate new slash commands | diff --git a/plugins/compound-engineering/skills/ce-brainstorm/SKILL.md b/plugins/compound-engineering/skills/ce-brainstorm/SKILL.md index 7565a0713..4106a2ce9 100644 --- a/plugins/compound-engineering/skills/ce-brainstorm/SKILL.md +++ b/plugins/compound-engineering/skills/ce-brainstorm/SKILL.md @@ -34,6 +34,78 @@ This skill does not implement code. It explores, clarifies, and documents decisi - **Keep outputs concise** - Prefer short sections, brief bullets, and only enough detail to support the next decision. +## Role Rubric + +This skill uses these roles in both normal and autopilot modes: + +- `Product Manager` -- optimize for user value, scope coherence, success criteria, and priority alignment +- `Designer` -- optimize for user experience broadly: flow, defaults, terminology, state coverage, error recovery, and clarity +- `Engineer` -- optimize for correctness, reuse, maintainability, and repo fit as a constraint on product choices + +Ordered weighting: +- `Product Manager > Designer > Engineer` + +Dominant decision criteria: +- `User Value` +- `Completeness` +- `Clarity` +- `Reuse` +- `Momentum` + +Orchestration bias: +- `low` + +Normal mode uses this rubric to recommend options and frame follow-up questions. +Autopilot mode uses the same rubric for bounded autonomous decisions only. + +## Autopilot Mode + +Autopilot is active only when the input begins with: + +- `[ce-autopilot manifest=.context/compound-engineering/autopilot//session.json] ::` + +When that marker is present: +- Strip the marker before processing the feature description +- Read the manifest path from the marker +- Validate that the manifest describes an active autopilot run +- Treat the run as part of an `lfg`-owned workflow, not a standalone brainstorm + +Then distinguish between two kinds of prompts: + +- **Workflow prompts** (handoff menus, "what do you want to do next?", "resume or start fresh?", post-generation options) → skip. These control routing, and the pipeline handles routing. +- **Content prompts** (clarifying what to build, resolving ambiguity, scoping questions) → still ask. Getting requirements wrong wastes every downstream step. + +Decision boundaries in autopilot mode: + +- **May decide automatically** + - bounded requirement defaults already strongly implied by the request or existing requirements doc + - small scope/behavior clarifications needed to make the requirements doc plan-ready +- **Must ask** + - materially different product behaviors + - changes that would expand or narrow core scope in a meaningful way + - unresolved success-criteria tradeoffs +- **Must log** + - any substantive autonomous product decision written into the requirements doc without asking first + +When a requirements document is created or updated in autopilot mode, update the manifest's `artifacts.requirements_doc` path, set `gates.requirements.state = complete` with brief evidence, and append any substantive autonomous decisions to the run-scoped `decisions.md` table. + +Specific phase behavior: + +- **Phase 0.1:** If a relevant requirements document already exists, read it. Skip it (proceed to Phase 0.2 to reassess the current `$ARGUMENTS`) if: a plan in `docs/plans/` has an `origin:` frontmatter field pointing to this requirements doc and `status: completed` (the doc was already fully consumed), or its problem frame and requirements meaningfully diverge from the current feature description (`$ARGUMENTS`). If the doc is still relevant (no completed plan referencing it and scope matches), check for `Resolve Before Planning` items. If the document is plan-ready (no blocking questions), note it and return control immediately. If it still has `Resolve Before Planning` items, resume the brainstorm to resolve them (proceed to Phase 1.3) rather than returning control -- otherwise the pipeline dead-ends because `ce:plan` will block and re-invoking `ce:brainstorm` will hit this same check. Do not ask whether to resume or start fresh. +- **Phase 0.2 short-circuit is a genuine skip.** If requirements are already clear (specific acceptance criteria, exact expected behavior, well-defined scope), skip brainstorm entirely. Note "requirements clear, skipping brainstorm", and in autopilot mode set `gates.requirements.state = complete` with brief evidence before returning control to the calling workflow. Do not proceed to Phase 1.3 or Phase 3. +- **Phases 1.3 and 2:** Content questions to clarify vague or ambiguous requirements are still permitted. The user is present and getting requirements right is more valuable than speed. +- **Phase 4 handoff is skipped.** Do not present handoff options or invoke `/ce:plan`. Write the requirements document (if warranted) and return control to the calling workflow. + +## Durable Output Safety + +This skill may create or update durable requirements documents in `docs/brainstorms/`. + +- **In autopilot mode (active `lfg` run with marker/manifest)** — inherit the current branch/worktree context and continue without branch prompts. +- **In standalone use on a clean worktree** — proceed normally. +- **In standalone use on a dirty worktree** — continue only when the existing uncommitted changes clearly belong to the same brainstorm topic. If they appear unrelated, or you are not confident, ask before writing or updating a durable document. +- **Being in a worktree does not by itself prove the task context is correct** — use the same clean-vs-dirty and related-vs-unrelated judgment there. +- **Do not create or switch branches from this skill** — branch/worktree orchestration belongs to the calling workflow or to `ce:work` when execution begins. + ## Feature Description #$ARGUMENTS @@ -50,7 +122,8 @@ Do not proceed until you have a feature description from the user. If the user references an existing brainstorm topic or document, or there is an obvious recent matching `*-requirements.md` file in `docs/brainstorms/`: - Read the document -- Confirm with the user before resuming: "Found an existing requirements doc for [topic]. Should I continue from this, or start fresh?" +- **In autopilot mode:** skip the document (proceed to Phase 0.2 to reassess whether brainstorming is needed for the current input) if a plan in `docs/plans/` has `origin:` pointing to this doc and `status: completed` (already fully consumed), or its scope meaningfully diverges from the current feature description. If the doc is still relevant (no completed plan referencing it and scope matches), check for `Resolve Before Planning` items. If none exist, note the existing document and return control immediately (see Autopilot Mode above). If blocking questions remain, resume the brainstorm to resolve them (proceed to Phase 1.3) rather than returning control. Do not ask the user whether to resume or start fresh. +- **Otherwise:** Confirm with the user before resuming: "Found an existing requirements doc for [topic]. Should I continue from this, or start fresh?" - If resuming, summarize the current state briefly, continue from its existing decisions and outstanding questions, and update the existing document instead of creating a duplicate #### 0.2 Assess Whether Brainstorming Is Needed @@ -62,7 +135,8 @@ If the user references an existing brainstorm topic or document, or there is an - Constrained, well-defined scope **If requirements are already clear:** -Keep the interaction brief. Confirm understanding and present concise next-step options rather than forcing a long brainstorm. Only write a short requirements document when a durable handoff to planning or later review would be valuable. Skip Phase 1.1 and 1.2 entirely — go straight to Phase 1.3 or Phase 3. +- **In autopilot mode:** skip brainstorm entirely and return control to the calling workflow (see Autopilot Mode above). Do not proceed to Phase 1.3 or Phase 3. +- **Otherwise:** Keep the interaction brief. Confirm understanding and present concise next-step options rather than forcing a long brainstorm. Only write a short requirements document when a durable handoff to planning or later review would be valuable. Skip Phase 1.1 and 1.2 entirely — go straight to Phase 1.3 or Phase 3. #### 0.3 Assess Scope @@ -247,7 +321,9 @@ If a document contains outstanding questions: #### 4.1 Present Next-Step Options -Present next steps using the platform's blocking question tool when available (see Interaction Rules). Otherwise present numbered options in chat and end the turn. +**In autopilot mode:** skip Phase 4 entirely. Write the requirements document (if warranted by the brainstorm), update the manifest with the document path when one exists, and return control to the calling workflow. Do not present handoff options or invoke `/ce:plan`. + +**Otherwise:** Present next steps using the platform's blocking question tool when available (see Interaction Rules). Otherwise present numbered options in chat and end the turn. If `Resolve Before Planning` contains any items: - Ask the blocking questions now, one at a time, by default diff --git a/plugins/compound-engineering/skills/ce-plan/SKILL.md b/plugins/compound-engineering/skills/ce-plan/SKILL.md index 5545f1803..770c034c4 100644 --- a/plugins/compound-engineering/skills/ce-plan/SKILL.md +++ b/plugins/compound-engineering/skills/ce-plan/SKILL.md @@ -50,6 +50,67 @@ Every plan should contain: A plan is ready when an implementer can start confidently without needing the plan to write the code for them. +## Role Rubric + +This skill uses these roles in both normal and autopilot modes: + +- `Engineer` -- optimize for correctness, reuse, maintainability, implementation clarity, and repo fit +- `Product Manager` -- preserve user value, scope coherence, and success criteria from the origin document +- `Designer` -- preserve user experience, state coverage, terminology, and flow clarity when the plan affects user-facing behavior + +Ordered weighting: +- `Engineer > Product Manager > Designer` + +Dominant decision criteria: +- `Clarity` +- `Reuse` +- `Completeness` +- `User Value` +- `Momentum` + +Orchestration bias: +- `medium` + +Normal mode uses this rubric to recommend planning choices. +Autopilot mode uses the same rubric for bounded technical or plan-structure decisions only. + +## Autopilot Mode + +Autopilot is active only when the input begins with: + +- `[ce-autopilot manifest=.context/compound-engineering/autopilot//session.json] ::` + +When that marker is present: +- Strip the marker before processing the planning input +- Read the manifest path from the marker +- Validate that the manifest describes an active autopilot run +- Use the manifest as the source of truth for upstream artifacts and gate state instead of guessing from caller prose + +Decision boundaries in autopilot mode: + +- **May decide automatically** + - bounded implementation-direction choices already constrained by the requirements doc, repo patterns, or research + - how to structure the plan so it is implementation-ready + - which relevant source document to use when one candidate is clearly the best match +- **Must ask** + - materially different product behaviors + - architecture or scope forks that would meaningfully change rollout risk or user-visible behavior + - equally plausible origin documents when the choice would change the plan materially +- **Must log** + - any substantive autonomous decision that changes implementation direction, sequencing, verification strategy, or explicit assumptions + +When the plan file is written in autopilot mode, update the manifest's `artifacts.plan_doc`, `current_gate`, and relevant gate states, then promote the applicable subset of `decisions.md` rows into an `## Autopilot Decisions` section using the same row schema. + +## Durable Output Safety + +This skill writes durable plan artifacts in `docs/plans/`. + +- **In autopilot mode (active `lfg` run with marker/manifest)** — inherit the current branch/worktree context and continue without branch prompts. +- **In standalone use on a clean worktree** — proceed normally. +- **In standalone use on a dirty worktree** — continue only when the existing uncommitted changes clearly belong to the same plan topic. If they appear unrelated, or you are not confident, ask before writing or updating a durable plan file. +- **Being in a worktree does not by itself prove the task context is correct** — use the same clean-vs-dirty and related-vs-unrelated judgment there. +- **Do not create or switch branches from this skill** — branch/worktree orchestration belongs to the calling workflow or to `ce:work` when execution begins. + ## Workflow ### Phase 0: Resume, Source, and Scope @@ -70,7 +131,11 @@ Before asking planning questions, search `docs/brainstorms/` for files matching - It was created within the last 30 days (use judgment to override if the document is clearly still relevant or clearly stale) - It appears to cover the same user problem or scope -If multiple source documents match, ask which one to use using the platform's blocking question tool when available (see Interaction Method). Otherwise, present numbered options in chat and wait for the user's reply before proceeding. +When evaluating candidates, skip any requirements document that already has a completed plan referencing it (a plan in `docs/plans/` with `origin:` pointing to the doc and `status: completed`). + +If multiple source documents could match: +- **In autopilot mode** — prefer the document whose topic and problem frame most closely match the current feature description, not just the most recent. If two or more documents are equally close matches, ask the user which one to use. +- **Otherwise** — ask which one to use using the platform's blocking question tool when available (see Interaction Method). Otherwise, present numbered options in chat and wait for the user's reply before proceeding. #### 0.3 Use the Source Document as Primary Input @@ -575,7 +640,6 @@ If the plan originated from a requirements document, re-read that document and v **REQUIRED: Write the plan file to disk before presenting any options.** Use the Write tool to save the complete plan to: - ```text docs/plans/YYYY-MM-DD-NNN---plan.md ``` @@ -586,10 +650,12 @@ Confirm: Plan written to docs/plans/[filename] ``` -**Pipeline mode:** If invoked from an automated workflow such as LFG, SLFG, or any `disable-model-invocation` context, skip interactive questions. Make the needed choices automatically and proceed to writing the plan. +**In autopilot mode**, skip workflow prompts after writing the plan. Confirm the plan path, update the manifest's `artifacts.plan_doc` and gate state, then return control to the calling workflow. #### 5.3 Post-Generation Options +In autopilot mode, skip this section entirely and return control to the calling workflow. + After writing the plan file, present the options using the platform's blocking question tool when available (see Interaction Method). Otherwise present numbered options in chat and wait for the user's reply before proceeding. **Question:** "Plan ready at `docs/plans/YYYY-MM-DD-NNN---plan.md`. What would you like to do next?" diff --git a/plugins/compound-engineering/skills/ce-review/SKILL.md b/plugins/compound-engineering/skills/ce-review/SKILL.md index 0ce6a28fb..45fc189d4 100644 --- a/plugins/compound-engineering/skills/ce-review/SKILL.md +++ b/plugins/compound-engineering/skills/ce-review/SKILL.md @@ -20,6 +20,29 @@ Reviews code changes using dynamically selected reviewer personas. Spawns parall Check `$ARGUMENTS` for `mode:autofix` or `mode:report-only`. If either token is present, strip it from the remaining arguments before interpreting the rest as the PR number, GitHub URL, or branch name. +## Autopilot Mode + +Autopilot is active only when the input begins with: + +- `[ce-autopilot manifest=.context/compound-engineering/autopilot//session.json] ::` + +When that marker is present: + +- Strip the marker before mode detection and review-target interpretation +- Read the manifest path from the marker +- Validate that the manifest describes an active autopilot run +- Default to `mode:autofix` when no explicit mode token remains +- Prefer reviewing the current checkout (`current` or empty target) instead of switching branches +- Skip user questions, worktree prompts, and optional browser-test handoff prompts +- Do not invoke `setup`; if `compound-engineering.local.md` is missing, use the built-in always-on reviewers plus conditionals selected per diff +- Treat this skill as the owner of `gates.review` for the active run +- When review completes in autopilot mode, set `gates.review.state = complete`, record brief evidence, and set `gates.review.ref = current HEAD` +- Review may still be complete when findings were externalized as todos; unresolved follow-up work does not mean the inspection itself is incomplete +- Mark `gates.review.state = blocked` only when the review could not actually run on the current checkout +- Return control to the caller after the autofix/report-only pass finishes; do not add extra workflow prompts + +Autopilot mode in `ce:review` is non-interactive review orchestration on the current checkout, not a commit/push/PR workflow. + | Mode | When | Behavior | |------|------|----------| | **Interactive** (default) | No mode token present | Review, present findings, ask for policy decisions when needed, and optionally continue into fix/push/PR next steps | diff --git a/plugins/compound-engineering/skills/ce-review/references/findings-schema.json b/plugins/compound-engineering/skills/ce-review/references/findings-schema.json index e7eee5d2c..e15329549 100644 --- a/plugins/compound-engineering/skills/ce-review/references/findings-schema.json +++ b/plugins/compound-engineering/skills/ce-review/references/findings-schema.json @@ -113,7 +113,7 @@ "P3": "Low-impact, narrow scope, minor improvement. User's discretion." }, "autofix_classes": { - "safe_auto": "Local, deterministic code or test fix suitable for the in-skill fixer in autonomous mode.", + "safe_auto": "Local, deterministic code or test fix suitable for the in-skill fixer in autofix mode.", "gated_auto": "Concrete fix exists, but it changes behavior, permissions, contracts, or other sensitive areas that deserve explicit approval.", "manual": "Actionable issue that should become residual work rather than an in-skill autofix.", "advisory": "Informational or operational item that should be surfaced in the report only." diff --git a/plugins/compound-engineering/skills/ce-work-beta/SKILL.md b/plugins/compound-engineering/skills/ce-work-beta/SKILL.md index 0d2694c80..eb93709ab 100644 --- a/plugins/compound-engineering/skills/ce-work-beta/SKILL.md +++ b/plugins/compound-engineering/skills/ce-work-beta/SKILL.md @@ -13,6 +13,31 @@ Execute a work plan efficiently while maintaining quality and finishing features This command takes a work document (plan, specification, or todo file) and executes it systematically. The focus is on **shipping complete features** by understanding requirements quickly, following existing patterns, and maintaining quality throughout. +## Autopilot Mode + +Autopilot is active only when the input begins with: + +- `[ce-autopilot manifest=.context/compound-engineering/autopilot//session.json] ::` + +When that marker is present: +- Strip the marker before processing the input document path +- Read the manifest path from the marker +- Validate that the manifest describes an active autopilot run +- Use the manifest's artifacts and gate state as part of execution context +- Treat `manifest.implementation_mode=swarm` as the explicit swarm opt-in for this run's implementation gate + +Then use the same safe defaults described below and avoid workflow prompts. + +Specific behavior: + +- Do not ask for generic approval to proceed. +- Respect explicit user instructions about branch strategy, such as "use `main`", "create a new branch", or "use a worktree". +- In autopilot mode, if already on a non-default branch, continue there and note it briefly. +- In autopilot mode, if on the default branch and the user did not explicitly authorize staying there, create a feature branch automatically. Prefer a worktree only when the user explicitly asked for it or the environment clearly calls for it. +- Never commit directly to the default branch without explicit user permission. +- Stop only for true blockers: contradictory requirements, missing credentials, broken environment/setup, or another consent boundary that cannot be inferred safely. +- When using a fallback or skipping a non-critical step, inform the user briefly and continue. + ## Input Document #$ARGUMENTS @@ -32,8 +57,8 @@ This command takes a work document (plan, specification, or todo file) and execu - Review any references or links provided in the plan - If the user explicitly asks for TDD, test-first, or characterization-first execution in this session, honor that request even if the plan has no `Execution note` - If anything is unclear or ambiguous, ask clarifying questions now - - Get user approval to proceed - - **Do not skip this** - better to ask questions now than build the wrong thing + - Do not ask for generic approval to proceed once the plan is clear enough to execute + - Ask only when a real ambiguity or blocker would materially change the work 2. **Setup Environment** @@ -49,35 +74,51 @@ This command takes a work document (plan, specification, or todo file) and execu fi ``` - **If already on a feature branch** (not the default branch): - - Ask: "Continue working on `[current_branch]`, or create a new branch?" - - If continuing, proceed to step 3 - - If creating new, follow Option A or B below + Choose the branch strategy using this precedence: - **If on the default branch**, choose how to proceed: + - **Explicit user instruction wins** — if the user asked to use `main`, create a new branch, or use a worktree, do that. + - **Autopilot mode (active `lfg` run with marker/manifest)** — if already on a non-default branch, continue on `current_branch` and note that choice briefly. If on the default branch without explicit permission to stay there, create a feature branch automatically. + - **Standalone `/ce:work-beta` on a non-default branch** — do not silently reuse the branch. Ask whether to continue on `current_branch`, create a new feature branch, or use a worktree instead. + - **Standalone `/ce:work-beta` on the default branch without explicit permission to stay there** — ask whether to create a feature branch or use a worktree. Continuing on the default branch still requires explicit authorization. + - **Use a worktree** when the user explicitly asked for it or the environment clearly calls for isolated parallel development. + + Never commit directly to the default branch without explicit permission. + + For standalone `/ce:work-beta`, use the platform's blocking question tool when available. Otherwise, present numbered options and wait. Suggested prompts: + + If already on a non-default branch: - **Option A: Create a new branch** - ```bash - git pull origin [default_branch] - git checkout -b feature-branch-name ``` - Use a meaningful name based on the work (e.g., `feat/user-authentication`, `fix/email-validation`). + Branch safety check: you're on `[current_branch]`. + + 1. Continue on `[current_branch]` + 2. Create a new feature branch (recommended) + 3. Use a worktree instead + 4. Cancel + ``` + + If on the default branch: - **Option B: Use a worktree (recommended for parallel development)** - ```bash - skill: git-worktree - # The skill will create a new branch from the default branch in an isolated worktree ``` + Branch safety check: you're on the default branch `[default_branch]`. - **Option C: Continue on the default branch** - - Requires explicit user confirmation - - Only proceed after user explicitly says "yes, commit to [default_branch]" - - Never commit directly to the default branch without explicit permission + 1. Create a new feature branch (recommended) + 2. Use a worktree instead + 3. Continue on `[default_branch]` (only if explicitly requested) + 4. Cancel + ``` - **Recommendation**: Use worktree if: - - You want to work on multiple features simultaneously - - You want to keep the default branch clean while experimenting - - You plan to switch between branches frequently + When creating a branch automatically: + ```bash + if [ -n "$(git status --porcelain)" ]; then + git checkout -b feature-branch-name + else + git pull origin [default_branch] + git checkout -b feature-branch-name + fi + ``` + If the worktree is dirty, branch first so local artifacts such as a newly written plan file carry forward safely. Only pull before branching when the worktree is clean. + Use a meaningful name based on the work (e.g., `feat/user-authentication`, `fix/email-validation`). 3. **Create Todo List** - Use your available task tracking tool (e.g., TodoWrite, task lists) to break the plan into actionable tasks @@ -405,7 +446,7 @@ This command takes a work document (plan, specification, or todo file) and execu For genuinely large plans where agents need to communicate with each other, challenge approaches, or coordinate across 10+ tasks with persistent specialized roles, use agent team capabilities if available (e.g., Agent Teams in Claude Code, multi-agent workflows in Codex). -**Agent teams are typically experimental and require opt-in.** Do not attempt to use agent teams unless the user explicitly requests swarm mode or agent teams, and the platform supports it. +**Agent teams are typically experimental and require opt-in.** Do not attempt to use agent teams unless the user explicitly requests swarm mode or agent teams, or the active autopilot manifest sets `implementation_mode=swarm`, and the platform supports it. ### When to Use Agent Teams vs Subagents @@ -414,7 +455,7 @@ For genuinely large plans where agents need to communicate with each other, chal | Agents need to discuss and challenge each other's approaches | Each task is independent — only the result matters | | Persistent specialized roles (e.g., dedicated tester running continuously) | Workers report back and finish | | 10+ tasks with complex cross-cutting coordination | 3-8 tasks with clear dependency chains | -| User explicitly requests "swarm mode" or "agent teams" | Default for most plans | +| User explicitly requests "swarm mode" or "agent teams", or the active autopilot manifest sets `implementation_mode=swarm` | Default for most plans | Most plans should use subagent dispatch from standard mode. Agent teams add significant token cost and coordination overhead — use them when the inter-agent communication genuinely improves the outcome. diff --git a/plugins/compound-engineering/skills/ce-work/SKILL.md b/plugins/compound-engineering/skills/ce-work/SKILL.md index 239300582..e3d59e39f 100644 --- a/plugins/compound-engineering/skills/ce-work/SKILL.md +++ b/plugins/compound-engineering/skills/ce-work/SKILL.md @@ -12,6 +12,73 @@ Execute a work plan efficiently while maintaining quality and finishing features This command takes a work document (plan, specification, or todo file) and executes it systematically. The focus is on **shipping complete features** by understanding requirements quickly, following existing patterns, and maintaining quality throughout. +## Role Rubric + +This skill uses these roles in both normal and autopilot modes: + +- `Engineer` -- optimize for correctness, reuse, maintainability, implementation clarity, and repo fit +- `Designer` -- optimize for user experience, state coverage, terminology, and interaction clarity when execution touches behavior +- `Product Manager` -- preserve scope boundaries, user value, and success criteria from the plan and requirements + +Ordered weighting: +- `Engineer > Designer > Product Manager` + +Dominant decision criteria: +- `Clarity` +- `Reuse` +- `Local Leverage` +- `Completeness` +- `Momentum` + +Orchestration bias: +- `high` + +Normal mode uses this rubric to break ties while executing. +Autopilot mode uses the same rubric for bounded implementation decisions and execution discoveries. + +## Autopilot Mode + +Autopilot is active only when the input begins with: + +- `[ce-autopilot manifest=.context/compound-engineering/autopilot//session.json] ::` + +When that marker is present: +- Strip the marker before processing the input document path +- Read the manifest path from the marker +- Validate that the manifest describes an active autopilot run +- Use the manifest's artifacts and gate state as part of execution context +- Treat `manifest.implementation_mode=swarm` as the explicit swarm opt-in for this run's implementation gate +- Treat this skill as the owner of `gates.implementation` for the active run + +Then use the same safe defaults described below and avoid workflow prompts. + +Specific behavior: + +- Do not ask for generic approval to proceed. +- Respect explicit user instructions about branch strategy, such as "use `main`", "create a new branch", or "use a worktree". +- In autopilot mode, if already on a non-default branch, continue there and note it briefly. +- In autopilot mode, if on the default branch and the user did not explicitly authorize staying there, create a feature branch automatically. Prefer a worktree only when the user explicitly asked for it or the environment clearly calls for it. +- Never commit directly to the default branch without explicit user permission. +- Stop only for true blockers: contradictory requirements, missing credentials, broken environment/setup, or another consent boundary that cannot be inferred safely. +- When using a fallback or skipping a non-critical step, inform the user briefly and continue. +- In autopilot mode, keep `gates.implementation.state = pending` while coding is still in progress. +- Mark `gates.implementation.state = complete` only when the coding phase has reached a reviewable checkpoint: intended changes are implemented, implementation-blocking questions are resolved or externalized, and the code-oriented verification appropriate to this slice has run. +- Mark `gates.implementation.state = blocked` only for true blockers. +- Before returning in autopilot mode, update `gates.implementation.evidence` to explain why implementation is complete, pending, or blocked for the current run state. + +Decision boundaries in autopilot mode: + +- **May decide automatically** + - bounded implementation choices inside the approved plan + - small local blast-radius fixes that are clearly adjacent and cheap + - execution-discovery resolutions that preserve plan intent without materially changing product behavior +- **Must ask** + - plan-breaking or product-level behavior changes + - scope expansions that are no longer clearly within the local blast radius + - branch/consent boundaries that still require explicit user approval +- **Must log** + - any substantive autonomous implementation decision that changes behavior, implementation direction, scope within the local blast radius, or verification strategy + ## Input Document #$ARGUMENTS @@ -31,8 +98,8 @@ This command takes a work document (plan, specification, or todo file) and execu - Review any references or links provided in the plan - If the user explicitly asks for TDD, test-first, or characterization-first execution in this session, honor that request even if the plan has no `Execution note` - If anything is unclear or ambiguous, ask clarifying questions now - - Get user approval to proceed - - **Do not skip this** - better to ask questions now than build the wrong thing + - Do not ask for generic approval to proceed once the plan is clear enough to execute + - Ask only when a real ambiguity or blocker would materially change the work 2. **Setup Environment** @@ -48,35 +115,51 @@ This command takes a work document (plan, specification, or todo file) and execu fi ``` - **If already on a feature branch** (not the default branch): - - Ask: "Continue working on `[current_branch]`, or create a new branch?" - - If continuing, proceed to step 3 - - If creating new, follow Option A or B below + Choose the branch strategy using this precedence: - **If on the default branch**, choose how to proceed: + - **Explicit user instruction wins** — if the user asked to use `main`, create a new branch, or use a worktree, do that. + - **Autopilot mode (active `lfg` run with marker/manifest)** — if already on a non-default branch, continue on `current_branch` and note that choice briefly. If on the default branch without explicit permission to stay there, create a feature branch automatically. + - **Standalone `/ce:work` on a non-default branch** — do not silently reuse the branch. Ask whether to continue on `current_branch`, create a new feature branch, or use a worktree instead. + - **Standalone `/ce:work` on the default branch without explicit permission to stay there** — ask whether to create a feature branch or use a worktree. Continuing on the default branch still requires explicit authorization. + - **Use a worktree** when the user explicitly asked for it or the environment clearly calls for isolated parallel development. + + Never commit directly to the default branch without explicit permission. + + For standalone `/ce:work`, use the platform's blocking question tool when available. Otherwise, present numbered options and wait. Suggested prompts: + + If already on a non-default branch: - **Option A: Create a new branch** - ```bash - git pull origin [default_branch] - git checkout -b feature-branch-name ``` - Use a meaningful name based on the work (e.g., `feat/user-authentication`, `fix/email-validation`). + Branch safety check: you're on `[current_branch]`. - **Option B: Use a worktree (recommended for parallel development)** - ```bash - skill: git-worktree - # The skill will create a new branch from the default branch in an isolated worktree + 1. Continue on `[current_branch]` + 2. Create a new feature branch (recommended) + 3. Use a worktree instead + 4. Cancel ``` - **Option C: Continue on the default branch** - - Requires explicit user confirmation - - Only proceed after user explicitly says "yes, commit to [default_branch]" - - Never commit directly to the default branch without explicit permission + If on the default branch: - **Recommendation**: Use worktree if: - - You want to work on multiple features simultaneously - - You want to keep the default branch clean while experimenting - - You plan to switch between branches frequently + ``` + Branch safety check: you're on the default branch `[default_branch]`. + + 1. Create a new feature branch (recommended) + 2. Use a worktree instead + 3. Continue on `[default_branch]` (only if explicitly requested) + 4. Cancel + ``` + + When creating a branch automatically: + ```bash + if [ -n "$(git status --porcelain)" ]; then + git checkout -b feature-branch-name + else + git pull origin [default_branch] + git checkout -b feature-branch-name + fi + ``` + If the worktree is dirty, branch first so local artifacts such as a newly written plan file carry forward safely. Only pull before branching when the worktree is clean. + Use a meaningful name based on the work (e.g., `feat/user-authentication`, `fix/email-validation`). 3. **Create Todo List** - Use your available task tracking tool (e.g., TodoWrite, task lists) to break the plan into actionable tasks @@ -396,7 +479,7 @@ This command takes a work document (plan, specification, or todo file) and execu For genuinely large plans where agents need to communicate with each other, challenge approaches, or coordinate across 10+ tasks with persistent specialized roles, use agent team capabilities if available (e.g., Agent Teams in Claude Code, multi-agent workflows in Codex). -**Agent teams are typically experimental and require opt-in.** Do not attempt to use agent teams unless the user explicitly requests swarm mode or agent teams, and the platform supports it. +**Agent teams are typically experimental and require opt-in.** Do not attempt to use agent teams unless the user explicitly requests swarm mode or agent teams, or the active autopilot manifest sets `implementation_mode=swarm`, and the platform supports it. ### When to Use Agent Teams vs Subagents @@ -405,7 +488,7 @@ For genuinely large plans where agents need to communicate with each other, chal | Agents need to discuss and challenge each other's approaches | Each task is independent — only the result matters | | Persistent specialized roles (e.g., dedicated tester running continuously) | Workers report back and finish | | 10+ tasks with complex cross-cutting coordination | 3-8 tasks with clear dependency chains | -| User explicitly requests "swarm mode" or "agent teams" | Default for most plans | +| User explicitly requests "swarm mode" or "agent teams", or the active autopilot manifest sets `implementation_mode=swarm` | Default for most plans | Most plans should use subagent dispatch from standard mode. Agent teams add significant token cost and coordination overhead — use them when the inter-agent communication genuinely improves the outcome. diff --git a/plugins/compound-engineering/skills/deepen-plan/SKILL.md b/plugins/compound-engineering/skills/deepen-plan/SKILL.md index bd4423415..855b06615 100644 --- a/plugins/compound-engineering/skills/deepen-plan/SKILL.md +++ b/plugins/compound-engineering/skills/deepen-plan/SKILL.md @@ -26,13 +26,71 @@ Use the platform's question tool when available. When asking the user a question Ask one question at a time. Prefer a concise single-select choice when natural options exist. +## Role Rubric + +This skill uses these roles in both normal and autopilot modes: + +- `Engineer` -- optimize for correctness, reuse, maintainability, verification strength, and technical fit +- `Product Manager` -- preserve product intent, scope boundaries, and success criteria from the origin document +- `Designer` -- preserve user experience, state coverage, and flow clarity when deepening touches user-facing behavior + +Ordered weighting: +- `Engineer > Product Manager > Designer` + +Dominant decision criteria: +- `Completeness` +- `Clarity` +- `Reuse` +- `User Value` +- `Momentum` + +Orchestration bias: +- `low-medium` + +Normal mode uses this rubric to recommend which weak sections to strengthen. +Autopilot mode uses the same rubric for bounded plan-strengthening decisions only. + +## Autopilot Mode + +Autopilot is active only when the input begins with: + +- `[ce-autopilot manifest=.context/compound-engineering/autopilot//session.json] ::` + +When that marker is present: +- Strip the marker before processing the plan path +- Read the manifest path from the marker +- Validate that the manifest describes an active autopilot run +- Treat the run as part of an `lfg`-owned workflow, not a standalone deepen-plan session + +Then skip workflow prompts and return control to the caller. + +Specific behavior: + +- If the caller did not pass a plan path, fall back to the manifest's `artifacts.plan_doc` when present. If neither the caller nor the manifest provides a plan path, treat that as a pipeline invocation error. Report it briefly and stop rather than asking the user to choose a plan. +- If the plan already appears sufficiently grounded, note that briefly and return control. Do not offer next-step options. +- If the plan is strengthened, briefly summarize which sections were improved and return control. Do not offer next-step options. +- If deepening reveals a true product-level blocker that would change behavior, scope, or success criteria, surface it clearly and stop so the caller can route back to `ce:brainstorm` or ask the user. +- If deepening reveals only technical uncertainty, strengthen the plan in place and continue returning control as normal. + +Decision boundaries in autopilot mode: + +- **May decide automatically** + - how to strengthen weak sections of an existing plan + - bounded technical uncertainty that does not change product behavior or scope + - verification, sequencing, and risk-treatment improvements grounded in the current plan and research +- **Must ask** + - product blockers that would change behavior, scope, or success criteria + - materially different architecture choices that the existing plan did not constrain +- **Must log** + - any substantive decision that changes implementation direction, sequencing, verification strategy, or risk handling in the plan + ## Plan File #$ARGUMENTS If the plan path above is empty: 1. Check `docs/plans/` for recent files -2. Ask the user which plan to deepen using the platform's blocking question tool when available (see Interaction Method). Otherwise, present numbered options in chat and wait for the user's reply before proceeding +2. In autopilot mode, first check whether the manifest already points to `artifacts.plan_doc` and use that path when available. If the manifest does not provide a plan path either, stop and report that the caller must provide one. Otherwise, ask the user which plan to deepen using the platform's blocking question tool when available (see Interaction Method). Otherwise, present numbered options in chat and wait for the user's reply before proceeding Do not proceed until you have a valid plan file path. @@ -84,6 +142,7 @@ Use this default: If the plan already appears sufficiently grounded: - Say so briefly - Recommend moving to `/ce:work` or the `document-review` skill +- In autopilot mode, return control immediately after the brief note - If the user explicitly asked to deepen anyway, continue with a light pass and deepen at most 1-2 sections ### Phase 1: Parse the Current `ce:plan` Structure @@ -386,6 +445,14 @@ If artifact-backed mode was used and the user did not ask to inspect the scratch ## Post-Enhancement Options +In autopilot mode, skip this section entirely. After updating the plan: +- if substantive changes were made, briefly summarize which sections were strengthened and return control +- if no substantive changes were warranted, briefly note that the plan already appears sufficiently grounded and return control + +When substantive changes are made in autopilot mode: +- keep the canonical decision rows in the run-scoped `decisions.md` +- ensure the plan's `## Autopilot Decisions` section reflects the applicable promoted subset using the shared row schema + If substantive changes were made, present next steps using the platform's blocking question tool when available (see Interaction Method). Otherwise, present numbered options in chat and wait for the user's reply before proceeding. **Question:** "Plan deepened at `[plan_path]`. What would you like to do next?" diff --git a/plugins/compound-engineering/skills/document-review/SKILL.md b/plugins/compound-engineering/skills/document-review/SKILL.md index ca83d4759..2592047c9 100644 --- a/plugins/compound-engineering/skills/document-review/SKILL.md +++ b/plugins/compound-engineering/skills/document-review/SKILL.md @@ -5,7 +5,16 @@ description: Review requirements or plan documents using parallel persona agents # Document Review -Review requirements or plan documents through multi-persona analysis. Dispatches specialized reviewer agents in parallel, auto-fixes quality issues, and presents strategic questions for user decision. +Review requirements or plan documents through multi-persona analysis. Dispatch specialized reviewer agents in parallel, apply deterministic document-quality fixes, and classify substantive findings so the owning workflow skill or user can decide how to resolve them. + +## Autopilot Utility Contract + +`document-review` is a review utility, not a primary decision-maker. + +- It may apply `mechanical-fix` findings automatically when the fix is deterministic and meaning-preserving. +- It may return `bounded-decision`, `must-ask`, and `note` findings. +- It must not resolve substantive product or implementation decisions on its own. +- In autopilot workflows, the owning skill (`ce:brainstorm`, `ce:plan`, `deepen-plan`, or another future decision owner) is responsible for deciding whether to auto-resolve a `bounded-decision`, escalate a `must-ask`, or leave a `note` in place. ## Phase 1: Get and Analyze Document @@ -125,7 +134,7 @@ Scan the residual concerns (findings suppressed in 3.2) for: When personas disagree on the same section: - Create a **combined finding** presenting both perspectives -- Set `autofix_class: present` +- Set `finding_class: must-ask` - Frame as a tradeoff, not a verdict Specific conflict patterns: @@ -133,14 +142,16 @@ Specific conflict patterns: - Feasibility says "this is impossible" + product-lens says "this is essential" -> P1 finding framed as a tradeoff - Multiple personas flag the same issue -> merge into single finding, note consensus, increase confidence -### 3.6 Route by Autofix Class +### 3.6 Route by Finding Class -| Autofix Class | Route | +| Finding Class | Route | |---------------|-------| -| `auto` | Apply automatically -- local deterministic fix (terminology, formatting, cross-references) | -| `present` | Present to user for judgment | +| `mechanical-fix` | Apply automatically -- local deterministic fix (terminology, formatting, cross-references) | +| `bounded-decision` | Present to the owning workflow skill or user for judgment | +| `must-ask` | Present as requiring user judgment | +| `note` | Present as non-blocking context | -Demote any `auto` finding that lacks a `suggested_fix` to `present` -- the orchestrator cannot apply a fix without concrete replacement text. +Demote any `mechanical-fix` finding that lacks a `suggested_fix` to `note` -- the orchestrator cannot apply a fix without concrete replacement text. ### 3.7 Sort @@ -150,7 +161,7 @@ Sort findings for presentation: P0 -> P1 -> P2 -> P3, then by confidence (descen ### Apply Auto-fixes -Apply all `auto` findings to the document in a **single pass**: +Apply all `mechanical-fix` findings to the document in a **single pass**: - Edit the document inline using the platform's edit tool - Track what was changed for the "Auto-fixes Applied" section - Do not ask for approval -- these are unambiguously correct (terminology fixes, formatting, cross-references) @@ -160,10 +171,10 @@ Apply all `auto` findings to the document in a **single pass**: Present all other findings to the user using the format from [review-output-template.md](./references/review-output-template.md): - Group by severity (P0 -> P3) - Include the Coverage table showing which personas ran -- Show auto-fixes that were applied +- Show mechanical fixes that were applied - Include residual concerns and deferred questions if any -Brief summary at the top: "Applied N auto-fixes. M findings to consider (X at P0/P1)." +Brief summary at the top: "Applied N mechanical fixes. M findings to consider (X at P0/P1)." ### Protected Artifacts @@ -193,8 +204,9 @@ Return "Review complete" as the terminal signal for callers. - Do not add new sections or requirements the user didn't discuss - Do not over-engineer or add complexity - Do not create separate review files or add metadata sections -- Do not modify any of the 4 caller skills (ce-brainstorm, ce-plan, ce-plan-beta, deepen-plan-beta) +- Do not modify any of the caller skills (ce-brainstorm, ce-plan, deepen-plan) + - Do not resolve substantive product or implementation questions on behalf of an autopilot caller ## Iteration Guidance -On subsequent passes, re-dispatch personas and re-synthesize. The auto-fix mechanism and confidence gating prevent the same findings from recurring once fixed. If findings are repetitive across passes, recommend completion. +On subsequent passes, re-dispatch personas and re-synthesize. The mechanical-fix mechanism and confidence gating prevent the same findings from recurring once fixed. If findings are repetitive across passes, recommend completion. diff --git a/plugins/compound-engineering/skills/document-review/references/findings-schema.json b/plugins/compound-engineering/skills/document-review/references/findings-schema.json index cb9a6295c..3f99d94d0 100644 --- a/plugins/compound-engineering/skills/document-review/references/findings-schema.json +++ b/plugins/compound-engineering/skills/document-review/references/findings-schema.json @@ -19,7 +19,7 @@ "severity", "section", "why_it_matters", - "autofix_class", + "finding_class", "confidence", "evidence" ], @@ -42,10 +42,10 @@ "type": "string", "description": "Impact statement -- not 'what is wrong' but 'what goes wrong if not addressed'" }, - "autofix_class": { + "finding_class": { "type": "string", - "enum": ["auto", "present"], - "description": "How this issue should be handled. auto = local deterministic fix the orchestrator can apply without asking (terminology, formatting, cross-references). present = requires user judgment." + "enum": ["mechanical-fix", "bounded-decision", "must-ask", "note"], + "description": "How this issue should be handled. mechanical-fix = local deterministic document fix. bounded-decision = substantive issue the owning workflow skill may auto-decide within its rubric. must-ask = exceeds decision authority and requires user judgment. note = non-blocking context worth surfacing." }, "suggested_fix": { "type": ["string", "null"], @@ -90,9 +90,11 @@ "P2": "Moderate issue with meaningful downside. Fix if straightforward.", "P3": "Minor improvement. User's discretion." }, - "autofix_classes": { - "auto": "Local, deterministic document fix: terminology consistency, formatting, cross-reference correction. Must be unambiguous and not change the document's meaning.", - "present": "Requires user judgment -- strategic questions, tradeoffs, meaning-changing fixes, or informational findings." + "finding_classes": { + "mechanical-fix": "Local, deterministic document fix: terminology consistency, formatting, cross-reference correction. Must be unambiguous and not change the document's meaning.", + "bounded-decision": "Substantive issue with a bounded set of plausible resolutions. The owning workflow skill may auto-decide it within its documented role rubric.", + "must-ask": "Requires user judgment because it exceeds the owning skill's decision authority or would materially change scope, behavior, or risk.", + "note": "Non-blocking context or advisory observation worth surfacing without forcing immediate resolution." } } } diff --git a/plugins/compound-engineering/skills/document-review/references/review-output-template.md b/plugins/compound-engineering/skills/document-review/references/review-output-template.md index 21b03f80a..ebb6bf44c 100644 --- a/plugins/compound-engineering/skills/document-review/references/review-output-template.md +++ b/plugins/compound-engineering/skills/document-review/references/review-output-template.md @@ -15,35 +15,35 @@ Use this **exact format** when presenting synthesized review findings. Findings - security-lens -- plan adds public API endpoint with auth flow - scope-guardian -- plan has 15 requirements across 3 priority levels -### Auto-fixes Applied +### Mechanical Fixes Applied -- Standardized "pipeline"/"workflow" terminology to "pipeline" throughout (coherence, auto) -- Fixed cross-reference: Section 4 referenced "Section 3.2" which is actually "Section 3.1" (coherence, auto) +- Standardized "pipeline"/"workflow" terminology to "pipeline" throughout (coherence, mechanical-fix) +- Fixed cross-reference: Section 4 referenced "Section 3.2" which is actually "Section 3.1" (coherence, mechanical-fix) ### P0 -- Must Fix -| # | Section | Issue | Reviewer | Confidence | Route | +| # | Section | Issue | Reviewer | Confidence | Class | |---|---------|-------|----------|------------|-------| -| 1 | Requirements Trace | Goal states "offline support" but technical approach assumes persistent connectivity | coherence | 0.92 | `present` | +| 1 | Requirements Trace | Goal states "offline support" but technical approach assumes persistent connectivity | coherence | 0.92 | `must-ask` | ### P1 -- Should Fix -| # | Section | Issue | Reviewer | Confidence | Route | +| # | Section | Issue | Reviewer | Confidence | Class | |---|---------|-------|----------|------------|-------| -| 2 | Implementation Unit 3 | Plan proposes custom auth when codebase already uses Devise | feasibility | 0.85 | `present` | -| 3 | Scope Boundaries | 8 of 12 units build admin infrastructure; only 2 touch stated goal | scope-guardian | 0.80 | `present` | +| 2 | Implementation Unit 3 | Plan proposes custom auth when codebase already uses Devise | feasibility | 0.85 | `bounded-decision` | +| 3 | Scope Boundaries | 8 of 12 units build admin infrastructure; only 2 touch stated goal | scope-guardian | 0.80 | `must-ask` | ### P2 -- Consider Fixing -| # | Section | Issue | Reviewer | Confidence | Route | +| # | Section | Issue | Reviewer | Confidence | Class | |---|---------|-------|----------|------------|-------| -| 4 | API Design | Public webhook endpoint has no rate limiting mentioned | security-lens | 0.75 | `present` | +| 4 | API Design | Public webhook endpoint has no rate limiting mentioned | security-lens | 0.75 | `bounded-decision` | ### P3 -- Minor -| # | Section | Issue | Reviewer | Confidence | Route | +| # | Section | Issue | Reviewer | Confidence | Class | |---|---------|-------|----------|------------|-------| -| 5 | Overview | "Service" used to mean both microservice and business class | coherence | 0.65 | `auto` | +| 5 | Overview | "Service" used to mean both microservice and business class | coherence | 0.65 | `mechanical-fix` | ### Residual Concerns @@ -71,7 +71,7 @@ Use this **exact format** when presenting synthesized review findings. Findings ## Section Rules -- **Auto-fixes Applied**: List fixes that were applied automatically (auto class). Omit section if none. +- **Mechanical Fixes Applied**: List fixes that were applied automatically (`mechanical-fix` class). Omit section if none. - **P0-P3 sections**: Only include sections that have findings. Omit empty severity levels. - **Residual Concerns**: Findings below confidence threshold that were promoted by cross-persona corroboration, plus unpromoted residual risks. Omit if none. - **Deferred Questions**: Questions for later workflow stages. Omit if none. diff --git a/plugins/compound-engineering/skills/document-review/references/subagent-template.md b/plugins/compound-engineering/skills/document-review/references/subagent-template.md index f21e0f1d1..9091004ea 100644 --- a/plugins/compound-engineering/skills/document-review/references/subagent-template.md +++ b/plugins/compound-engineering/skills/document-review/references/subagent-template.md @@ -22,10 +22,12 @@ Rules: - Suppress any finding below your stated confidence floor (see your Confidence calibration section). - Every finding MUST include at least one evidence item -- a direct quote from the document. - You are operationally read-only. Analyze the document and produce findings. Do not edit the document, create files, or make changes. You may use non-mutating tools (file reads, glob, grep, git log) to gather context about the codebase when evaluating feasibility or existing patterns. -- Set `autofix_class` conservatively: - - `auto`: Only for local, deterministic fixes -- terminology corrections, formatting fixes, cross-reference repairs. The fix must be unambiguous and not change the document's meaning. - - `present`: Everything else -- strategic questions, tradeoffs, meaning-changing fixes, informational findings. -- `suggested_fix` is optional. Only include it when the fix is obvious and correct. For `present` findings, frame as a question instead. +- Set `finding_class` conservatively: + - `mechanical-fix`: Only for local, deterministic fixes -- terminology corrections, formatting fixes, cross-reference repairs. The fix must be unambiguous and not change the document's meaning. + - `bounded-decision`: A substantive issue with a bounded set of plausible resolutions that the owning workflow skill could decide within its rubric. + - `must-ask`: Requires user judgment because it would materially change scope, behavior, or risk. + - `note`: Non-blocking context worth surfacing without forcing immediate resolution. +- `suggested_fix` is optional. Only include it when the fix is obvious and correct. For `bounded-decision` and `must-ask` findings, frame the suggestion as a decision or question rather than pretending the orchestrator should apply it. - If you find no issues, return an empty findings array. Still populate residual_risks and deferred_questions if applicable. - Use your suppress conditions. Do not flag issues that belong to other personas. diff --git a/plugins/compound-engineering/skills/feature-video/SKILL.md b/plugins/compound-engineering/skills/feature-video/SKILL.md index 348081c2c..541172c89 100644 --- a/plugins/compound-engineering/skills/feature-video/SKILL.md +++ b/plugins/compound-engineering/skills/feature-video/SKILL.md @@ -17,6 +17,31 @@ Record browser interactions demonstrating a feature, stitch screenshots into an - Git repository on a feature branch (PR optional -- skill can create a draft or record-only) - One-time GitHub browser auth (see Step 6 auth check) +## Autopilot Mode + +Autopilot is active only when the input begins with: + +- `[ce-autopilot manifest=.context/compound-engineering/autopilot//session.json] ::` + +When that marker is present: +- Strip the marker before processing the normal argument +- Read the manifest path from the marker +- Validate that the manifest describes an active autopilot run +- Treat this skill as an autopilot contract consumer, not a substantive decision owner +- Treat `wrap_up` as best-effort by default; ordinary PR, auth, upload, or environment issues should not derail the run + +Then treat feature video as best effort and prefer continuing the pipeline over blocking on interaction. + +Specific behavior: + +- If no PR exists for the current branch, first try creating a draft PR automatically. If that fails, set `gates.wrap_up.state = skipped` with a brief reason and return control to the caller. +- If required tools are missing, the dev server is unavailable, or the app cannot be exercised, set `gates.wrap_up.state = skipped` with a brief reason and return control to the caller. +- Plan the video flow automatically. Do not ask for shot-list confirmation. +- If GitHub browser auth is required and a saved authenticated session is unavailable, set `gates.wrap_up.state = skipped` with a brief reason and return control to the caller rather than waiting for manual login. +- Briefly inform the user when the video step is skipped or downgraded. Do not block on the prompt. +- Do not append substantive product or implementation decisions to the autopilot decision log. This skill only emits operational notes and artifacts. +- When wrap-up succeeds in autopilot mode, set `gates.wrap_up.state = complete`, record brief evidence, and set `gates.wrap_up.ref = current HEAD`. + ## Main Tasks ### 1. Parse Arguments & Resolve PR @@ -41,7 +66,10 @@ If no explicit PR number was provided (or "current" was specified), check if a P gh pr view --json number -q '.number' ``` -If no PR exists for the current branch, ask the user how to proceed. **Use the platform's blocking question tool** (`AskUserQuestion` in Claude Code, `request_user_input` in Codex, `ask_user` in Gemini): +If no PR exists for the current branch: + +- In autopilot mode, try creating a draft PR automatically and continue if successful. If draft PR creation fails, set `gates.wrap_up.state = skipped` with a brief reason and return control to the caller. +- Otherwise, ask the user how to proceed. **Use the platform's blocking question tool** (`AskUserQuestion` in Claude Code, `request_user_input` in Codex, `ask_user` in Gemini): ``` No PR found for the current branch. @@ -77,12 +105,15 @@ command -v agent-browser command -v gh ``` -If any tool is missing, stop and report which tools need to be installed: +If any tool is missing: + +- In autopilot mode, set `gates.wrap_up.state = skipped` with a brief reason and return control to the caller. +- Otherwise, stop and report which tools need to be installed: - `ffmpeg`: `brew install ffmpeg` (macOS) or equivalent - `agent-browser`: load the `agent-browser` skill for installation instructions - `gh`: `brew install gh` (macOS) or see https://cli.github.com -Do not proceed to Step 2 until all tools are available. +Do not proceed to Step 2 until all tools are available unless autopilot mode already returned control. ### 2. Gather Feature Context @@ -114,7 +145,9 @@ Before recording, create a shot list: 4. **Edge cases**: Error states, validation, etc. (if applicable) 5. **Success state**: Completed action/result -Present the proposed flow to the user for confirmation before recording. +In autopilot mode, create the proposed flow automatically and proceed to recording without asking for confirmation. + +Otherwise, present the proposed flow to the user for confirmation before recording. **Use the platform's blocking question tool when available** (`AskUserQuestion` in Claude Code, `request_user_input` in Codex, `ask_user` in Gemini). Otherwise, present numbered options and wait for the user's reply before proceeding: @@ -223,7 +256,9 @@ agent-browser close agent-browser --engine chrome --headed --session-name github open https://github.com/login ``` -The user must log in manually in the browser window (handles 2FA, SSO, OAuth -- any login method). **Use the platform's blocking question tool** (`AskUserQuestion` in Claude Code, `request_user_input` in Codex, `ask_user` in Gemini). Otherwise, present the message and wait for the user's reply before proceeding: +In autopilot mode, if manual login is required because the saved session is missing or expired, set `gates.wrap_up.state = skipped` with a brief reason and return control to the caller. + +Otherwise, the user must log in manually in the browser window (handles 2FA, SSO, OAuth -- any login method). **Use the platform's blocking question tool** (`AskUserQuestion` in Claude Code, `request_user_input` in Codex, `ask_user` in Gemini). Otherwise, present the message and wait for the user's reply before proceeding: ``` GitHub login required for video upload. @@ -314,7 +349,14 @@ gh pr edit [number] --body "[updated body with demo section]" ### 8. Cleanup -Ask the user before removing temporary files. If confirmed, clean up only the current run's scratch directory (other runs may still be in progress or awaiting upload). +In autopilot mode, do not ask for cleanup confirmation: + +- If the video was successfully uploaded from the current run, remove only that run's scratch directory automatically. +- If in record-only mode or upload failed, remove only that run's screenshots and preserve the `.mp4` so the caller can upload later. +- If this was upload-only resume mode with a caller-provided `.mp4`, do not delete the caller's file. +- Briefly note what was removed and what was preserved, then return control to the caller. + +Outside autopilot mode, ask the user before removing temporary files. If confirmed, clean up only the current run's scratch directory (other runs may still be in progress or awaiting upload). **If the video was successfully uploaded**, remove the entire run directory: diff --git a/plugins/compound-engineering/skills/lfg/SKILL.md b/plugins/compound-engineering/skills/lfg/SKILL.md index dd5aaddd7..53b99ec7a 100644 --- a/plugins/compound-engineering/skills/lfg/SKILL.md +++ b/plugins/compound-engineering/skills/lfg/SKILL.md @@ -1,36 +1,219 @@ --- name: lfg -description: Full autonomous engineering workflow +description: Right-sized engineering autopilot from idea to PR -- assesses task complexity, resumes from the current workflow gate when possible, and runs the appropriate amount of ceremony from direct edits for trivial fixes to the full brainstorm-plan-implement-review-test pipeline for complex features. argument-hint: "[feature description]" disable-model-invocation: true --- -CRITICAL: You MUST execute every step below IN ORDER. Do NOT skip any required step. Do NOT jump ahead to coding or implementation. The plan phase (step 2, and step 3 when warranted) MUST be completed and verified BEFORE any work begins. Violating this order produces bad output. +Assess the task, choose the right execution path, and get it done. Not every task needs a 10-step pipeline -- a typo fix should not generate a plan file, and a complex feature should not skip requirements exploration. `/lfg` should also be able to resume from the current workflow gate when requirements, a plan, implementation work, or a PR already exist. -1. **Optional:** If the `ralph-loop` skill is available, run `/ralph-loop:ralph-loop "finish all slash commands" --completion-promise "DONE"`. If not available or it fails, skip and continue to step 2 immediately. +## Autopilot Run Contract -2. `/ce:plan $ARGUMENTS` +For resumable `lfg` runs across direct, lightweight, and full-pipeline routes, `lfg` owns the deterministic autopilot contract. - GATE: STOP. Verify that the `ce:plan` workflow produced a plan file in `docs/plans/`. If no plan file was created, run `/ce:plan $ARGUMENTS` again. Do NOT proceed to step 3 until a written plan exists. +- Downstream marker format: + - `[ce-autopilot manifest=.context/compound-engineering/autopilot//session.json] :: ` +- Run directory: + - `.context/compound-engineering/autopilot//` +- Required files: + - `session.json` + - `decisions.md` +- Phase-1 manifest schema: + - `schema_version` + - `run_id` + - `route` = `direct | lightweight | full` + - `status` = `active | completed | aborted` + - `implementation_mode` = `standard | swarm` + - `feature_description` + - `current_gate` (optional, advisory only; recompute from gate state on resume) + - `gates.requirements | gates.plan | gates.implementation | gates.review | gates.verification | gates.wrap_up` + - `artifacts.requirements_doc | artifacts.plan_doc` +- Each gate entry contains: + - `state` = `complete | skipped | pending | blocked | unknown` + - `evidence` + - `ref` only for `review`, `verification`, and `wrap_up` so `lfg` can invalidate stale late-stage completions after code changes -3. **Conditionally** run `/compound-engineering:deepen-plan` +`lfg` is the only top-level skill that creates or backfills these manifests. Downstream skills must use the explicit marker plus manifest path rather than guessing from caller prose. +The manifest is the primary resume source, but never blindly trust it. On resume, validate it against durable artifacts and current repo state, then repair it conservatively if it is stale or inconsistent. - Run the `deepen-plan` workflow only if the plan is `Standard` or `Deep`, touches a high-risk area (auth, security, payments, migrations, external APIs, significant rollout concerns), or still has obvious confidence gaps in decisions, sequencing, system-wide impact, risks, or verification. +Direct and lightweight routes still create a lightweight manifest immediately. In those routes, `artifacts.requirements_doc` and `artifacts.plan_doc` may remain unset by design, and `gates.requirements` / `gates.plan` should be marked `skipped` with routing evidence instead of being treated as missing work. - GATE: STOP. If you ran the `deepen-plan` workflow, confirm the plan was deepened or explicitly judged sufficiently grounded. If you skipped it, briefly note why and proceed to step 4. +Implementation-mode selection rule: +- explicit user request for swarm or agent teams wins +- otherwise read `implementation_mode` from `compound-engineering.local.md` frontmatter +- if the setting is missing, assume `standard` -4. `/ce:work` +## Phase 0: Assess and Route - GATE: STOP. Verify that implementation work was performed - files were created or modified beyond the plan. Do NOT proceed to step 5 if no code changes were made. +If `$ARGUMENTS` is empty, do not assess complexity yet. Route to **Full pipeline** so it can first check whether there is resumable work on the current branch/worktree. Do not start with `ce:brainstorm` before resume detection runs. If full-pipeline resume detection finds nothing resumable, stop and tell the user briefly that there is nothing to resume and they should rerun `/lfg ` to start new work. -5. `/ce:review mode:autofix` +Read the feature description and choose the cheapest execution path that will handle it well. -6. `/compound-engineering:todo-resolve` +**Bias toward under-routing.** Running too little ceremony and having the user ask for more is far cheaper than running a full pipeline for a one-line fix. When the boundary between direct and lightweight is unclear, prefer direct. When the boundary between lightweight and full pipeline is unclear, prefer full pipeline -- it has internal short-circuits that right-size themselves. -7. `/compound-engineering:test-browser` +Announce the routing decision in one line before proceeding: +- "**Direct** -- [what and why]" +- "**Lightweight** -- [what and why]" +- "**Full pipeline** -- [why this needs structured planning and review]" -8. `/compound-engineering:feature-video` +Then execute immediately. Do not wait for confirmation about the routing decision itself. -9. Output `DONE` when video is in PR +--- + +### Direct + +The fix is obvious and self-contained. No planning or multi-agent review needed. + +Before changing files, preserve the same branch/worktree safety as `ce:work` Phase 1: choose the right branch first, and never commit directly to the default branch without explicit user permission. + +Before changing files, create or backfill a lightweight manifest for this run: +- `route = direct` +- `current_gate = implementation` +- `gates.requirements.state = skipped` +- `gates.plan.state = skipped` +- `gates.review`, `gates.verification`, and `gates.wrap_up` start as `pending` +- `artifacts.requirements_doc` and `artifacts.plan_doc` may remain unset by design + +If the task stops looking direct while you work, upgrade the manifest to `route = full`, mark any missing early gates `pending`, and run them before continuing. + +Make the change, verify it works (typecheck, lint, or test if applicable), then preserve the same applicable wrap-up contract as `ce:work` Phase 4 before outputting `DONE`: +- commit it, push it, and create or update the PR +- add a `## Post-Deploy Monitoring & Validation` section to the PR description +- if the change affects browser UI, capture and upload screenshots and include the image URLs in the PR description + +Update the manifest after implementation, self-review, verification, and wrap-up so rerunning `/lfg` resumes at the first unmet late-stage gate instead of routing from scratch. + +--- + +### Lightweight + +The task is clear and bounded -- requirements and expected behavior are already in the description. Loading brainstorm, plan, and multi-agent review would add ceremony without improving the outcome. + +Before changing files, preserve the same branch/worktree safety as `ce:work` Phase 1: choose the right branch first, and never commit directly to the default branch without explicit user permission. + +Before changing files, create or backfill a lightweight manifest for this run: +- `route = lightweight` +- `current_gate = implementation` +- `gates.requirements.state = skipped` +- `gates.plan.state = skipped` +- `gates.review`, `gates.verification`, and `gates.wrap_up` start as `pending` +- `artifacts.requirements_doc` and `artifacts.plan_doc` may remain unset by design + +If the task stops looking lightweight while you work, upgrade the manifest to `route = full`, mark any missing early gates `pending`, and run them before continuing. + +Do the work directly. Verify it works (typecheck, lint, or test if applicable), give it a quick self-review for obvious issues, then preserve the same applicable wrap-up contract as `ce:work` Phase 4 before outputting `DONE`: +- commit it, push it, and create or update the PR +- add a `## Post-Deploy Monitoring & Validation` section to the PR description +- if the change affects browser UI, capture and upload screenshots and include the image URLs in the PR description + +Update the manifest after implementation, self-review, verification, and wrap-up so rerunning `/lfg` resumes at the first unmet late-stage gate instead of routing from scratch. + +--- + +### Full Pipeline + +The task has enough scope, ambiguity, or risk that structured planning prevents wasted work. This is the default when the task is not clearly trivial or simple. + +Skills run in autopilot mode: skip workflow prompts (handoff menus, "what next?" options) but still ask content questions when requirements or scope are unclear. + +Before invoking downstream skills: + +1. Check for an active autopilot manifest for the current branch/worktree. +2. If one exists, resume from it, but validate it against current durable artifacts and repo state before trusting it. +3. If one does not exist, reconstruct one conservatively using this algorithm: + - explicit user direction in the current `/lfg` invocation + - durable workflow artifacts and repo state + - PR and CI state for the current branch/HEAD +4. Gather an evidence bundle before inferring any gate state: + - current branch/worktree, dirty state, and whether the branch is ahead of the default branch + - current `/lfg` feature description, if one was provided + - relevant `docs/brainstorms/*-requirements.md` and `docs/plans/*.md` files, preferring artifacts touched on the current branch or referenced by each other + - plan checkbox progress and whether non-doc implementation files changed beyond the plan artifact itself + - open PR, CI state, and pending/ready todo files in `.context/compound-engineering/todos/` +5. Select canonical artifacts conservatively: + - if a plan references an `origin:` requirements document, bind them together + - if multiple plans or requirements docs are plausible, prefer the artifact most clearly tied to the current branch/topic + - if multiple candidates remain materially ambiguous after that pass, ask one targeted question instead of guessing +6. Infer the route before inferring gate state: + - choose `full` if a relevant requirements doc or plan doc exists, or the user explicitly asks for structured planning/review + - choose `lightweight` if implementation work or a PR exists without durable planning artifacts and the task is clearly bounded + - choose `direct` only for a tiny, self-contained change with no durable planning artifacts and no reason to expect multi-step late-stage orchestration + - if you are unsure between `lightweight` and `full`, prefer `full` +7. Backfill gate state conservatively: + - `requirements` is `complete` when a requirements doc exists or a plan clearly proves requirements were already resolved; for `direct`/`lightweight`, mark it `skipped` with evidence + - `plan` is `complete` only when a plan doc exists; for `direct`/`lightweight`, mark it `skipped` with evidence + - `implementation` stays `pending` when work appears in progress or evidence is mixed; mark it `complete` only with strong evidence such as completed plan checkboxes or other durable implementation-complete signals + - `review`, `verification`, and `wrap_up` should be marked `complete` only from explicit evidence. Never infer those gates complete from an open PR alone +8. Invalidate stale late-stage gates before resuming: + - if `gates.review.ref` exists and does not match the current HEAD, reset `review`, `verification`, and `wrap_up` to `pending` + - if `gates.verification.ref` exists and does not match the current HEAD, reset `verification` and `wrap_up` to `pending` + - if `gates.wrap_up.ref` exists and does not match the current HEAD, reset `wrap_up` to `pending` +9. If no coherent candidate route/artifact set exists, stop and tell the user there is nothing reliable to resume. Ask for a feature description or explicit plan path instead of guessing. +10. Evaluate ordered workflow gates: + - `requirements` + - `plan` + - `implementation` + - `review` + - `verification` + - `wrap_up` +11. Mark a gate complete only when current evidence supports it. If you cannot prove a gate is complete, leave it `pending`, `skipped`, or `blocked` as appropriate. +12. Create or backfill `.context/compound-engineering/autopilot//session.json` and `.context/compound-engineering/autopilot//decisions.md`. +13. Advance one gate at a time. After each downstream skill returns, update the manifest evidence, recompute the first unmet gate, and continue until all required gates are complete or a real blocker stops the run. + +Late-stage rule: + +- An open PR does not mean the run is done. +- Evaluate `review`, `verification`, and `wrap_up` separately. +- Inspect GitHub CI for the current HEAD. +- If current evidence for local tests, browser validation, or PR artifacts is missing or stale, rerun or re-request the narrowest applicable step. +- Treat `test-browser` and `feature-video` as best-effort by default. If they are inapplicable or the environment is unavailable, mark the gate `skipped`, note the reason briefly, and continue. +- Only treat `verification` as blocking when the user or task explicitly requires interactive/browser validation before the run can be considered done. +- Keep the run `active` while CI is pending or a required late-stage gate is still `pending` or `blocked`. +- Transition to `completed` only when all required gates are `complete` or `skipped` and no required external blocker remains. + +1. If the recomputed first unmet gate is `requirements`, run: + - `/ce:brainstorm [ce-autopilot manifest=.context/compound-engineering/autopilot//session.json] :: $ARGUMENTS` + + Brainstorm runs in autopilot mode: it assesses whether requirements exploration is needed and either skips (if requirements are already clear) or runs brainstorm with content questions as needed and writes a requirements document. It will not present handoff options or invoke `/ce:plan` -- control returns here. + +2. **Optional:** If the `ralph-loop` skill is available and you are continuing from an early-stage unmet gate, run `/ralph-loop:ralph-loop "finish all slash commands" --completion-promise "DONE"` to iterate autonomously through the remaining steps. Brainstorm ran first because it may need user interaction; everything from here on is autonomous and benefits from ralph's fresh-context iteration. If not available or it fails, continue. + +3. If the recomputed first unmet gate is `plan`, run: + - `/ce:plan [ce-autopilot manifest=.context/compound-engineering/autopilot//session.json] :: $ARGUMENTS` + + If brainstorm collected the feature description because `$ARGUMENTS` was empty, carry that clarified description forward into the `ce:plan` invocation instead of calling it with empty arguments. Treat that clarified description as the resolved planning input for all `ce:plan` attempts in this run. Do not ask the user for the same description twice. + + GATE: Verify that `ce:plan` produced a plan file in `docs/plans/`. If no plan file was created, run `ce:plan` again with the same resolved planning input used for the first `ce:plan` attempt. Do NOT fall back to the original empty `$ARGUMENTS`, and do NOT proceed until a written plan exists. + + After the plan exists, update `artifacts.plan_doc` in the autopilot manifest with that exact plan path. Use that same path for every later `deepen-plan` and `ce:work` invocation in this run. + +4. After the plan exists, evaluate whether `deepen-plan` should run using the written plan at `artifacts.plan_doc`. Do not gate this check on the first unmet gate still being `plan`; `ce:plan` may already have advanced the manifest to `implementation`. + + Run only if the plan is `Standard` or `Deep`, touches a high-risk area (auth, security, payments, migrations, external APIs, significant rollout concerns), or still has obvious confidence gaps in decisions, sequencing, system-wide impact, risks, or verification. + + If those criteria are met, run `/compound-engineering:deepen-plan [ce-autopilot manifest=.context/compound-engineering/autopilot//session.json] :: `. + + GATE: If deepen-plan ran, confirm the plan was deepened or judged sufficiently grounded. If skipped, briefly note why and proceed. + +5. If the recomputed first unmet gate is `implementation`, run: + - `/ce:work [ce-autopilot manifest=.context/compound-engineering/autopilot//session.json] :: ` + + `ce:work` must honor the active manifest's `implementation_mode` when deciding between standard execution and swarm mode. Do not require a second swarm-specific handoff token here. + + GATE: Verify that implementation work was performed and the manifest now records `gates.implementation` for the current run state. Do NOT proceed if no code changes were made or the implementation gate was left ambiguous. + +6. If the recomputed first unmet gate is `review`, run `/ce:review [ce-autopilot manifest=.context/compound-engineering/autopilot//session.json] :: current` -- catch issues before they ship + +7. If review created todos, run `/compound-engineering:todo-resolve` before advancing to later gates -- resolve findings, compound on learnings, and clean up completed todos. + + GATE: If todo resolution changed code or behavior, re-verify the final state before proceeding. Run the narrowest checks that cover what changed (for example targeted tests, lint/typecheck, or another browser check for UI-affecting changes). If todo resolution made no functional code changes, briefly note that and continue. + +8. If the recomputed first unmet gate is `verification`, conditionally run `/compound-engineering:test-browser [ce-autopilot manifest=.context/compound-engineering/autopilot//session.json] :: current` -- best-effort browser validation for work that actually needs interactive testing. Read `compound-engineering.local.md` frontmatter; skip if `autopilot_features.test_browser` is `false`. If the setting is missing, assume enabled. + +9. If verification created todos, run `/compound-engineering:todo-resolve` before advancing -- same resolve/compound/clean-up cycle as step 7. + +10. If the recomputed first unmet gate is `wrap_up`, conditionally run `/compound-engineering:feature-video [ce-autopilot manifest=.context/compound-engineering/autopilot//session.json] :: current` -- best-effort walkthrough capture and PR polish. Read `compound-engineering.local.md` frontmatter; skip if `autopilot_features.feature_video` is `false`. If the setting is missing, assume enabled. Also skip if the project has no browser-based UI (e.g., CLI tools, plugins, libraries, APIs). + +11. Output `DONE` only when all required gates are `complete` or `skipped`. If the run is only waiting on external CI, report that explicitly instead of claiming completion. -Start with step 2 now (or step 1 if ralph-loop is available). Remember: plan FIRST, then work. Never skip the plan. +Start now. diff --git a/plugins/compound-engineering/skills/setup/SKILL.md b/plugins/compound-engineering/skills/setup/SKILL.md index 189995f05..baeb5f60e 100644 --- a/plugins/compound-engineering/skills/setup/SKILL.md +++ b/plugins/compound-engineering/skills/setup/SKILL.md @@ -1,6 +1,6 @@ --- name: setup -description: Configure which review agents run for your project. Auto-detects stack and writes compound-engineering.local.md. +description: Configure review agents, implementation mode, and autopilot features for your project. Auto-detects stack and writes compound-engineering.local.md. disable-model-invocation: true --- @@ -10,7 +10,7 @@ disable-model-invocation: true Ask the user each question below using the platform's blocking question tool (e.g., `AskUserQuestion` in Claude Code, `request_user_input` in Codex, `ask_user` in Gemini). If no structured question tool is available, present each question as a numbered list and wait for a reply before proceeding. For multiSelect questions, accept comma-separated numbers (e.g. `1, 3`). Never skip or auto-configure. -Interactive setup for `compound-engineering.local.md` — configures which agents run during `ce:review` and `ce:work`. +Interactive setup for `compound-engineering.local.md` — configures which agents run during `ce:review` and `ce:work`, which implementation mode `lfg` should prefer by default, and which autopilot features are enabled for end-to-end workflows (`lfg`; `slfg` remains only as a legacy wrapper). ## Step 1: Check Existing Config @@ -27,6 +27,10 @@ Settings file already exists. What would you like to do? If "View current": read and display the file, then stop. If "Cancel": stop. +When reconfiguring an existing file: +- Read and preserve the current `implementation_mode` unless the user explicitly changes it during setup +- If the existing file has no `implementation_mode`, treat the current value as `standard` + ## Step 2: Detect and Ask Auto-detect the project stack: @@ -57,6 +61,9 @@ Detected {type} project. How would you like to configure? - **TypeScript:** `[kieran-typescript-reviewer, code-simplicity-reviewer, security-sentinel, performance-oracle]` - **General:** `[code-simplicity-reviewer, security-sentinel, performance-oracle, architecture-strategist]` +Auto-configure defaults for autopilot features: both `feature_video` and `test_browser` enabled. +Auto-configure defaults for implementation mode: preserve the existing `implementation_mode` when present; otherwise use `standard`. + ### If Customize → Step 3 ## Step 3: Customize (3 questions) @@ -95,6 +102,34 @@ How thorough should reviews be? 3. Comprehensive - All above + git history, data integrity, agent-native checks. ``` +## Step 3b: Implementation Mode + +Ask only when the user chose the Customize path: + +``` +How should lfg handle implementation by default? + +1. Standard (Recommended) - Normal ce:work execution. Can still use ordinary parallel helpers when appropriate. +2. Swarm - Prefer swarm / agent-team style implementation during the implementation gate. +``` + +If an existing file already has `implementation_mode`, show that option as the current value. + +## Step 3c: Autopilot Features + +Ask only when the user chose the Customize path (auto-configure silently defaults both to enabled): + +``` +When you run end-to-end workflows (lfg/slfg), the system can +automatically record feature videos and run browser tests. +Which of these would you like included? (comma-separated, e.g. 1, 2) + +1. Feature video - Record a walkthrough and add it to PRs +2. Browser testing - Verify changes work in a real browser +``` + +Both are selected by default. If the user deselects both, set both to `false`. If the user selects only one, set the other to `false`. + ## Step 4: Build Agent List and Write File **Stack-specific agents:** @@ -122,6 +157,10 @@ Write `compound-engineering.local.md`: --- review_agents: [{computed agent list}] plan_review_agents: [{computed plan agent list}] +implementation_mode: {existing value or chosen value, default standard} +autopilot_features: + feature_video: {true or false} + test_browser: {true or false} --- # Review Context @@ -140,10 +179,12 @@ Examples: ``` Saved to compound-engineering.local.md -Stack: {type} -Review depth: {depth} -Agents: {count} configured - {agent list, one per line} +Stack: {type} +Review depth: {depth} +Agents: {count} configured + {agent list, one per line} +Autopilot: feature video {on/off}, browser testing {on/off} +Implementation: {standard or swarm} Tip: Edit the "Review Context" section to add project-specific instructions. Re-run this setup anytime to reconfigure. diff --git a/plugins/compound-engineering/skills/slfg/SKILL.md b/plugins/compound-engineering/skills/slfg/SKILL.md index 453727a4f..7d2481d63 100644 --- a/plugins/compound-engineering/skills/slfg/SKILL.md +++ b/plugins/compound-engineering/skills/slfg/SKILL.md @@ -1,39 +1,19 @@ --- name: slfg -description: Full autonomous engineering workflow using swarm mode for parallel execution +description: "[DEPRECATED] Compatibility wrapper that routes to lfg with swarm mode enabled." argument-hint: "[feature description]" disable-model-invocation: true --- -Swarm-enabled LFG. Run these steps in order, parallelizing where indicated. Do not stop between steps — complete every step through to the end. +`slfg` is deprecated. -## Sequential Phase +Do not maintain a separate orchestration contract here. -1. **Optional:** If the `ralph-loop` skill is available, run `/ralph-loop:ralph-loop "finish all slash commands" --completion-promise "DONE"`. If not available or it fails, skip and continue to step 2 immediately. -2. `/ce:plan $ARGUMENTS` -3. **Conditionally** run `/compound-engineering:deepen-plan` - - Run the `deepen-plan` workflow only if the plan is `Standard` or `Deep`, touches a high-risk area (auth, security, payments, migrations, external APIs, significant rollout concerns), or still has obvious confidence gaps in decisions, sequencing, system-wide impact, risks, or verification - - If you run the `deepen-plan` workflow, confirm the plan was deepened or explicitly judged sufficiently grounded before moving on - - If you skip it, note why and continue to step 4 -4. `/ce:work` — **Use swarm mode**: Make a Task list and launch an army of agent swarm subagents to build the plan +Behavior: -## Parallel Phase +1. Announce briefly that `slfg` is deprecated and that `lfg` now owns the autopilot contract. +2. Preserve the user's feature description content unchanged. +3. Immediately route to `lfg` with an explicit swarm request in the forwarded input so `lfg` will still choose swarm even when `compound-engineering.local.md` is missing or set to `standard`. Preserve the original feature description after that swarm request. +4. Do not duplicate routing logic, manifest logic, or downstream skill-calling rules here. `lfg` is the source of truth. -After work completes, launch steps 5 and 6 as **parallel swarm agents** (both only need code to be written): - -5. `/ce:review mode:report-only` — spawn as background Task agent -6. `/compound-engineering:test-browser` — spawn as background Task agent - -Wait for both to complete before continuing. - -## Autofix Phase - -7. `/ce:review mode:autofix` — run sequentially after the parallel phase so it can safely mutate the checkout, apply `safe_auto` fixes, and emit residual todos for step 8 - -## Finalize Phase - -8. `/compound-engineering:todo-resolve` — resolve findings, compound on learnings, clean up completed todos -9. `/compound-engineering:feature-video` — record the final walkthrough and add to PR -10. Output `DONE` when video is in PR - -Start with step 1 now. +When users ask for swarm explicitly, prefer `/lfg ...` with swarm mode going forward. diff --git a/plugins/compound-engineering/skills/test-browser/SKILL.md b/plugins/compound-engineering/skills/test-browser/SKILL.md index a1d0675ba..90c6547ac 100644 --- a/plugins/compound-engineering/skills/test-browser/SKILL.md +++ b/plugins/compound-engineering/skills/test-browser/SKILL.md @@ -24,6 +24,32 @@ Platform-specific hints: - `agent-browser` CLI installed (see Setup below) - Git repository with changes to test +## Autopilot Mode + +Autopilot is active only when the input begins with: + +- `[ce-autopilot manifest=.context/compound-engineering/autopilot//session.json] ::` + +When that marker is present: +- Strip the marker before processing the normal argument +- Read the manifest path from the marker +- Validate that the manifest describes an active autopilot run +- Treat this skill as an autopilot contract consumer, not a substantive decision owner +- Treat `verification` as best-effort by default unless the caller or task explicitly requires interactive/browser validation before the run can be considered done + +Then prefer progress over interaction and use safe defaults. + +Specific behavior: + +- Default to headless mode. Do not ask whether to watch the browser. +- If `agent-browser` cannot be installed or the dev server is not running, set `gates.verification.state = skipped` with a brief reason unless interactive/browser validation was explicitly required; in the required case, mark it `blocked`. Then return control to the caller. +- If a flow requires human verification (OAuth, email, payments, SMS, external service confirmation), mark it as manual verification required. Set `gates.verification.state = skipped` with a brief reason unless interactive/browser validation was explicitly required; in the required case, mark it `blocked`. Continue testing other routes when possible. +- If a page test fails, capture the failure, create a todo file in `.context/compound-engineering/todos/` for follow-up, note it briefly, and continue testing the remaining routes. +- Briefly inform the user when a material verification step was skipped or degraded. Do not block on the prompt. +- Do not append substantive product or implementation decisions to the autopilot decision log. This skill only emits operational notes and todos. +- When browser verification succeeds in autopilot mode, set `gates.verification.state = complete`, record brief evidence, and set `gates.verification.ref = current HEAD`. +- When browser verification finds issues that were externalized as todos or otherwise needs a rerun, leave `gates.verification.state = pending` with brief evidence so `lfg` can revisit it after follow-up work. + ## Setup ```bash @@ -48,11 +74,16 @@ Before starting, verify `agent-browser` is available: command -v agent-browser >/dev/null 2>&1 && echo "Ready" || (echo "Installing..." && npm install -g agent-browser && agent-browser install) ``` -If installation fails, inform the user and stop. +If installation fails: -### 2. Ask Browser Mode +- In autopilot mode, set `gates.verification.state = skipped` with a brief reason unless interactive/browser validation was explicitly required; in the required case, mark it `blocked`. Then return control to the caller. +- Otherwise, inform the user and stop. -Ask the user whether to run headed or headless (using the platform's question tool — e.g., `AskUserQuestion` in Claude Code, `request_user_input` in Codex, `ask_user` in Gemini — or present options and wait for a reply): +### 2. Choose Browser Mode + +In autopilot mode, default to headless and note that choice briefly. + +Otherwise, ask the user whether to run headed or headless (using the platform's question tool — e.g., `AskUserQuestion` in Claude Code, `request_user_input` in Codex, `ask_user` in Gemini — or present options and wait for a reply): ``` Do you want to watch the browser tests run? @@ -133,7 +164,10 @@ agent-browser open http://localhost:${PORT} agent-browser snapshot -i ``` -If the server is not running, inform the user: +If the server is not running: + +- In autopilot mode, set `gates.verification.state = skipped` with a brief reason unless interactive/browser validation was explicitly required; in the required case, mark it `blocked`. Then return control to the caller. +- Otherwise, inform the user: ``` Server not running on port ${PORT} @@ -193,7 +227,9 @@ Pause for human input when testing touches flows that require external interacti | SMS | "Verify you received the SMS code" | | External APIs | "Confirm the [service] integration is working" | -Ask the user (using the platform's question tool, or present numbered options and wait): +In autopilot mode, mark the route as requiring manual verification. Set `gates.verification.state = skipped` with a brief reason unless interactive/browser validation was explicitly required; in the required case, mark it `blocked`. Note it briefly and continue. + +Otherwise, ask the user (using the platform's question tool, or present numbered options and wait): ``` Human Verification Needed @@ -215,7 +251,10 @@ When a test fails: - Screenshot the error state: `agent-browser screenshot error.png` - Note the exact reproduction steps -2. **Ask the user how to proceed:** +2. **Decide how to proceed:** + + - In autopilot mode, use `todo-create` when available to create a pending todo for the failure immediately, note it briefly, and continue testing the remaining routes. If `todo-create` is unavailable, create the pending todo file directly in `.context/compound-engineering/todos/`. Do not leave the failure only in the summary or in an ephemeral finding format. + - Otherwise, ask the user how to proceed: ``` Test Failed: [route] @@ -230,9 +269,11 @@ When a test fails: ``` 3. **If "Fix now":** investigate, propose a fix, apply, re-run the failing test -4. **If "Create todo":** load the `todo-create` skill and create a todo with priority p1 and description `browser-test-{description}`, continue +4. **If "Create todo":** use `todo-create` when available to create a pending p1 todo with description `browser-test-{description}`. Otherwise create `{id}-pending-p1-browser-test-{description}.md` directly in `.context/compound-engineering/todos/`, continue 5. **If "Skip":** log as skipped, continue +For autopilot failures, use `todo-create` when available. Otherwise create the todo directly in `.context/compound-engineering/todos/` using the standard naming and structure. Leave `gates.verification.state = pending` with brief evidence that follow-up work is required before verification can be considered complete. + ### 10. Test Summary After all tests complete, present a summary: diff --git a/plugins/compound-engineering/skills/todo-create/SKILL.md b/plugins/compound-engineering/skills/todo-create/SKILL.md index ec7fc7110..4dee46f36 100644 --- a/plugins/compound-engineering/skills/todo-create/SKILL.md +++ b/plugins/compound-engineering/skills/todo-create/SKILL.md @@ -94,7 +94,7 @@ To check blockers: search for `{dep_id}-complete-*.md` in both paths. Missing ma | Trigger | Flow | |---------|------| | Code review | `/ce:review` -> Findings -> `/todo-triage` -> Todos | -| Autonomous review | `/ce:review mode:autofix` -> Residual todos -> `/todo-resolve` | +| Autofix review | `/ce:review mode:autofix` -> Residual todos -> `/todo-resolve` | | Code TODOs | `/todo-resolve` -> Fixes + Complex todos | | Planning | Brainstorm -> Create todo -> Work -> Complete | diff --git a/tests/autopilot-skill-contract.test.ts b/tests/autopilot-skill-contract.test.ts new file mode 100644 index 000000000..66974dc0c --- /dev/null +++ b/tests/autopilot-skill-contract.test.ts @@ -0,0 +1,163 @@ +import { readFile } from "fs/promises" +import path from "path" +import { describe, expect, test } from "bun:test" + +async function readRepoFile(relativePath: string): Promise { + return readFile(path.join(process.cwd(), relativePath), "utf8") +} + +describe("autopilot skill contract", () => { + test("lfg defines the marker, manifest, gates, and deprecation path", async () => { + const lfg = await readRepoFile("plugins/compound-engineering/skills/lfg/SKILL.md") + const slfg = await readRepoFile("plugins/compound-engineering/skills/slfg/SKILL.md") + + expect(lfg).toContain("[ce-autopilot manifest=.context/compound-engineering/autopilot//session.json] ::") + expect(lfg).toContain(".context/compound-engineering/autopilot//session.json") + expect(lfg).toContain("`requirements`") + expect(lfg).toContain("`plan`") + expect(lfg).toContain("`implementation`") + expect(lfg).toContain("`review`") + expect(lfg).toContain("`verification`") + expect(lfg).toContain("`wrap_up`") + expect(lfg).toContain("`status` = `active | completed | aborted`") + expect(lfg).toContain("`schema_version`") + expect(lfg).toContain("`route` = `direct | lightweight | full`") + expect(lfg).toContain("`implementation_mode` = `standard | swarm`") + expect(lfg).toContain("`state` = `complete | skipped | pending | blocked | unknown`") + expect(lfg).toContain("`current_gate` (optional, advisory only; recompute from gate state on resume)") + expect(lfg).toContain("`artifacts.requirements_doc | artifacts.plan_doc`") + expect(lfg).toContain("`ref` only for `review`, `verification`, and `wrap_up`") + expect(lfg).toContain("if the setting is missing, assume `standard`") + expect(lfg).toContain("An open PR does not mean the run is done.") + expect(lfg).toContain("If `$ARGUMENTS` is empty, do not assess complexity yet. Route to **Full pipeline** so it can first check whether there is resumable work") + expect(lfg).toContain("If one does not exist, reconstruct one conservatively using this algorithm:") + expect(lfg).toContain("Never infer those gates complete from an open PR alone") + expect(lfg).toContain("if `gates.review.ref` exists and does not match the current HEAD") + expect(lfg).toContain("if `gates.verification.ref` exists and does not match the current HEAD") + expect(lfg).toContain("if `gates.wrap_up.ref` exists and does not match the current HEAD") + expect(lfg).toContain("After each downstream skill returns, update the manifest evidence, recompute the first unmet gate") + expect(lfg).toContain("update `artifacts.plan_doc` in the autopilot manifest") + expect(lfg).toContain("") + expect(lfg).toContain("Do not gate this check on the first unmet gate still being `plan`") + expect(lfg).toContain("Direct and lightweight routes still create a lightweight manifest immediately.") + expect(lfg).toContain("Treat `test-browser` and `feature-video` as best-effort by default.") + expect(lfg).toContain("Output `DONE` only when all required gates are `complete` or `skipped`.") + expect(lfg).toContain("If the run is only waiting on external CI, report that explicitly instead of claiming completion.") + expect(lfg).not.toContain("`mode` = `autopilot`") + expect(lfg).not.toContain("`artifacts.requirements_doc | artifacts.plan_doc | artifacts.decision_log`") + + expect(slfg).toContain("[DEPRECATED] Compatibility wrapper") + expect(slfg).toContain("Immediately route to `lfg`") + expect(slfg).toContain("explicit swarm request in the forwarded input") + expect(slfg).toContain("Do not duplicate routing logic, manifest logic, or downstream skill-calling rules here.") + }) + + test("decision-owner skills declare marker parsing and role ordering", async () => { + const brainstorm = await readRepoFile("plugins/compound-engineering/skills/ce-brainstorm/SKILL.md") + const plan = await readRepoFile("plugins/compound-engineering/skills/ce-plan/SKILL.md") + const deepenPlan = await readRepoFile("plugins/compound-engineering/skills/deepen-plan/SKILL.md") + const work = await readRepoFile("plugins/compound-engineering/skills/ce-work/SKILL.md") + const workBeta = await readRepoFile("plugins/compound-engineering/skills/ce-work-beta/SKILL.md") + + for (const content of [brainstorm, plan, deepenPlan, work]) { + expect(content).toContain("[ce-autopilot manifest=.context/compound-engineering/autopilot//session.json] ::") + expect(content).toContain("Validate that the manifest describes an active autopilot run") + } + + expect(workBeta).toContain("[ce-autopilot manifest=.context/compound-engineering/autopilot//session.json] ::") + expect(workBeta).toContain("Validate that the manifest describes an active autopilot run") + + expect(brainstorm).toContain("`Product Manager > Designer > Engineer`") + expect(brainstorm).toContain("**May decide automatically**") + expect(brainstorm).toContain("**Must ask**") + expect(brainstorm).toContain("**Must log**") + expect(brainstorm).toContain("set `gates.requirements.state = complete`") + + expect(plan).toContain("`Engineer > Product Manager > Designer`") + expect(plan).toContain("Dominant decision criteria:") + expect(plan).toContain("update the manifest's `artifacts.plan_doc`") + expect(plan).toContain("prefer the document whose topic and problem frame most closely match the current feature description") + expect(plan).not.toContain("most recent matching document automatically") + expect(plan).not.toContain("Pipeline Mode") + + expect(deepenPlan).toContain("`Engineer > Product Manager > Designer`") + expect(deepenPlan).toContain("keep the canonical decision rows in the run-scoped `decisions.md`") + expect(deepenPlan).toContain("fall back to the manifest's `artifacts.plan_doc` when present") + + expect(work).toContain("`Engineer > Designer > Product Manager`") + expect(work).toContain("`Local Leverage`") + expect(work).toContain("execution discoveries") + expect(work).toContain("Treat `manifest.implementation_mode=swarm` as the explicit swarm opt-in") + expect(work).toContain("active autopilot manifest sets `implementation_mode=swarm`") + expect(work).toContain("owner of `gates.implementation`") + expect(work).toContain("reviewable checkpoint") + expect(work).toContain("update `gates.implementation.evidence`") + expect(work).toContain('if [ -n "$(git status --porcelain)" ]; then') + expect(work).toContain("Only pull before branching when the worktree is clean") + + expect(workBeta).toContain("Treat `manifest.implementation_mode=swarm` as the explicit swarm opt-in") + expect(workBeta).toContain("active autopilot manifest sets `implementation_mode=swarm`") + expect(workBeta).toContain('if [ -n "$(git status --porcelain)" ]; then') + expect(workBeta).toContain("Only pull before branching when the worktree is clean") + }) + + test("utility skills and review utility use the shared autopilot contract", async () => { + const review = await readRepoFile("plugins/compound-engineering/skills/ce-review/SKILL.md") + const setup = await readRepoFile("plugins/compound-engineering/skills/setup/SKILL.md") + const documentReview = await readRepoFile("plugins/compound-engineering/skills/document-review/SKILL.md") + const schema = await readRepoFile( + "plugins/compound-engineering/skills/document-review/references/findings-schema.json", + ) + const browser = await readRepoFile("plugins/compound-engineering/skills/test-browser/SKILL.md") + const featureVideo = await readRepoFile("plugins/compound-engineering/skills/feature-video/SKILL.md") + const agents = await readRepoFile("plugins/compound-engineering/AGENTS.md") + const readme = await readRepoFile("plugins/compound-engineering/README.md") + + expect(review).toContain("[ce-autopilot manifest=.context/compound-engineering/autopilot//session.json] ::") + expect(review).toContain("Default to `mode:autofix` when no explicit mode token remains") + expect(review).toContain("owner of `gates.review`") + expect(review).toContain("`gates.review.state = complete`") + expect(review).toContain("`gates.review.ref = current HEAD`") + + expect(setup).toContain("preserve the current `implementation_mode` unless the user explicitly changes it during setup") + expect(setup).toContain("implementation_mode: {existing value or chosen value, default standard}") + expect(setup).toContain("How should lfg handle implementation by default?") + + expect(documentReview).toContain("review utility, not a primary decision-maker") + expect(documentReview).toContain("`mechanical-fix`") + expect(documentReview).toContain("`bounded-decision`") + expect(documentReview).toContain("`must-ask`") + expect(documentReview).toContain("`note`") + + expect(schema).toContain('"finding_class"') + expect(schema).toContain('"mechanical-fix"') + expect(schema).toContain('"bounded-decision"') + expect(schema).toContain('"must-ask"') + expect(schema).toContain('"note"') + + expect(browser).toContain("Treat this skill as an autopilot contract consumer, not a substantive decision owner") + expect(browser).toContain("Do not append substantive product or implementation decisions to the autopilot decision log") + expect(browser).toContain("Treat `verification` as best-effort by default") + expect(browser).toContain("`gates.verification.state = skipped`") + expect(browser).toContain("`gates.verification.state = complete`") + expect(browser).toContain("`gates.verification.ref = current HEAD`") + expect(browser).toContain("`gates.verification.state = pending`") + + expect(featureVideo).toContain("Treat this skill as an autopilot contract consumer, not a substantive decision owner") + expect(featureVideo).toContain("Do not append substantive product or implementation decisions to the autopilot decision log") + expect(featureVideo).toContain("Treat `wrap_up` as best-effort by default") + expect(featureVideo).toContain("`gates.wrap_up.state = skipped`") + expect(featureVideo).toContain("`gates.wrap_up.state = complete`") + expect(featureVideo).toContain("`gates.wrap_up.ref = current HEAD`") + + expect(agents).toContain("`lfg` is the only top-level autopilot entrypoint") + expect(agents).toContain("[ce-autopilot manifest=.context/compound-engineering/autopilot//session.json] ::") + expect(agents).toContain("execution skills must honor the active manifest's `implementation_mode` during autopilot") + expect(agents).toContain("direct and lightweight routes still create lightweight manifests") + expect(agents).toContain("late-stage gate completions may carry a HEAD `ref`") + + expect(readme).toContain("Deprecated compatibility wrapper that routes to `/lfg` with swarm mode enabled") + expect(readme).toContain("resumes from the first unmet workflow gate") + expect(readme).toContain("Browser testing and feature-video remain best-effort by default") + }) +}) diff --git a/tests/review-skill-contract.test.ts b/tests/review-skill-contract.test.ts index efddd7a99..dc587ba1e 100644 --- a/tests/review-skill-contract.test.ts +++ b/tests/review-skill-contract.test.ts @@ -107,12 +107,42 @@ describe("ce-review contract", () => { test("orchestration callers pass explicit mode flags", async () => { const lfg = await readRepoFile("plugins/compound-engineering/skills/lfg/SKILL.md") - expect(lfg).toContain("/ce:review mode:autofix") + // lfg owns the full pipeline; review step uses mode:autofix + expect(lfg).toContain("/ce:review") + }) + + test("document-review uses phase-1 finding classes", async () => { + const rawSchema = await readRepoFile( + "plugins/compound-engineering/skills/document-review/references/findings-schema.json", + ) + const schema = JSON.parse(rawSchema) as { + properties: { + findings: { + items: { + properties: { + finding_class: { enum: string[] } + } + required: string[] + } + } + } + } + + expect(schema.properties.findings.items.required).toEqual( + expect.arrayContaining(["finding_class"]), + ) + expect(schema.properties.findings.items.properties.finding_class.enum).toEqual([ + "mechanical-fix", + "bounded-decision", + "must-ask", + "note", + ]) - const slfg = await readRepoFile("plugins/compound-engineering/skills/slfg/SKILL.md") - // slfg uses report-only for the parallel phase (safe with browser testing) - // then autofix sequentially after to emit fixes and todos - expect(slfg).toContain("/ce:review mode:report-only") - expect(slfg).toContain("/ce:review mode:autofix") + const documentReview = await readRepoFile("plugins/compound-engineering/skills/document-review/SKILL.md") + expect(documentReview).toContain("review utility, not a primary decision-maker") + expect(documentReview).toContain("mechanical-fix") + expect(documentReview).toContain("bounded-decision") + expect(documentReview).toContain("must-ask") + expect(documentReview).toContain("note") }) })