diff --git a/.gitignore b/.gitignore index 7783391a..8c1bab36 100644 --- a/.gitignore +++ b/.gitignore @@ -5,3 +5,4 @@ node_modules/ todos/ .worktrees .context/ +.claude/worktrees/ diff --git a/docs/brainstorms/2026-03-31-codex-delegation-requirements.md b/docs/brainstorms/2026-03-31-codex-delegation-requirements.md new file mode 100644 index 00000000..d76620b6 --- /dev/null +++ b/docs/brainstorms/2026-03-31-codex-delegation-requirements.md @@ -0,0 +1,236 @@ +--- +date: 2026-03-31 +topic: codex-delegation +--- + +# Codex Delegation Mode for ce:work + +## Problem Frame + +Users running ce:work from Claude Code (or other non-Codex agents) may want to delegate the actual code-writing to Codex. Two motivations: (1) Codex may produce better code for certain tasks, and (2) delegating token-heavy implementation work to Codex conserves tokens on the user's current model. + +PR #364 attempted this via a separate `ce-work-beta` skill with prose-based delegation instructions. The agent improvises CLI syntax each run, producing non-deterministic results confirmed as flaky in the PR author's own testing. The root cause: describing Codex CLI invocation in prose lets the agent guess differently every time. + +ce-work-beta does have a structured 7-step External Delegate Mode (environment guards, availability checks, prompt file writing, circuit breaker), but the CLI invocation step itself is prose-based, causing the non-determinism. This feature ports the useful structural elements (guards, circuit breaker pattern) while replacing prose invocations with concrete bash templates. + +> **Implementation note (2026-03-31):** The final rollout was redirected to `ce:work-beta` so stable `ce:work` remains unchanged during beta. `ce:work-beta` must be invoked manually; `ce:plan` and workflow handoffs stay on stable `ce:work` until promotion. + +## Delegation Flow + +``` +/ce:work delegate:codex ~/plan.md + │ + ▼ +┌──────────────────────────┐ +│ Parse arguments │ +│ - Extract delegate flag │ +│ - Require plan file │ +│ - Check local.md default │ +│ - Resolution chain: │ +│ flag > local.md > off │ +└────────┬─────────────────┘ + │ + ▼ +┌──────────────────────────┐ ┌───────────────────────┐ +│ Environment guard │────>│ Notify if explicit, │ +│ $CODEX_SANDBOX set? │ yes │ use standard mode │ +│ $CODEX_SESSION_ID set? │ └───────────────────────┘ +└────────┬─────────────────┘ + │ no + ▼ +┌──────────────────────────┐ ┌───────────────────────┐ +│ Availability check │────>│ Fall back to │ +│ command -v codex │ no │ standard mode + notify│ +└────────┬─────────────────┘ └───────────────────────┘ + │ yes + ▼ +┌──────────────────────────┐ ┌───────────────────────┐ +│ Consent + mode selection │────>│ Ask: disable │ +│ work_delegate_consent set? │ no │ delegation? │ +│ Show warning + sandbox │ │ Set local.md │ +│ mode choice (yolo/full- │ └───────────────────────┘ +│ auto). Recommend yolo. │ +│ (headless: require prior) │ +└────────┬─────────────────┘ + │ accepted + ▼ +┌──────────────────────────┐ +│ Per-unit execution loop │ +│ (SERIAL, not parallel) │ +│ For each implementation │ +│ unit in the plan: │ +│ │ +│ 1. Check unit eligibility │ +│ (out-of-repo? trivial?)│ +│ -> local if ineligible │ +│ 2. Named stash snapshot │ +│ 3. Write prompt + schema │ +│ to .context/compound- │ +│ engineering/codex- │ +│ delegation/ │ +│ 4. codex exec w/ flags │ +│ 5. Classify result: │ +│ CLI fail | task fail | │ +│ verify fail | success │ +│ 6. Pass: commit, drop │ +│ stash, clean scratch │ +│ Fail: rollback, │ +│ increment ctr │ +│ 7. If 3 consecutive │ +│ failures: fall back │ +│ to standard mode │ +└──────────────────────────┘ +``` + +## Requirements + +**Activation and Configuration** + +- R1. Codex delegation is an optional mode within ce:work, not a separate skill. ce-work-beta is superseded: its delegation logic is replaced by this feature; its non-delegation features (e.g., Frontend Design Guidance) should be ported to ce:work as a separate concern if valuable. Disposition of ce-work-beta (delete vs. retain without delegation) is a planning decision, not a product decision. +- R2. Delegation is triggered via a resolution chain: (1) per-invocation argument wins, (2) `work_delegate` setting in `.claude/compound-engineering.local.md` is fallback, (3) hard default is `false` (off). +- R3. Canonical activation argument is `delegate:codex`. The skill also recognizes fuzzy variants: `codex mode`, `codex`, `delegate codex`, and similar intent expressions. Agent intent recognition handles the fuzzy matching — the set does not need to be exhaustively enumerated. +- R4. Canonical deactivation argument is `delegate:local`. Also recognizes fuzzy variants like `no codex`, `local mode`, `standard mode`. +- R5. Delegation only applies to structured plan execution. Ad-hoc prompts without a plan file always use standard mode regardless of the delegation setting. When delegation mode is active for a plan, each implementation unit is delegated to Codex by default. The agent may execute a unit locally in standard mode when: (a) the unit explicitly requires modifications outside the repository root, or (b) the unit is trivially small (single-file config change, simple substitution) where delegation overhead exceeds the work. The agent states which mode it's using for each unit before execution. + +**Environment Safety** + +- R6. When running inside a Codex sandbox (detected by `$CODEX_SANDBOX` or `$CODEX_SESSION_ID` environment variables), delegation is disabled and ce:work proceeds in standard mode. If the user explicitly requested delegation (via argument), emit a brief notification: "Already inside Codex sandbox — using standard mode." If delegation was only enabled via local.md default, proceed silently. +- R7. All delegation logic lives in the skill itself. Converters do not modify skill behavior for cross-platform compatibility — the environment guard handles platform detection at runtime. + +**Availability and Fallback** + +- R8. Before delegation, check `command -v codex`. If the Codex CLI is not on PATH, fall back to standard mode with a brief notification: "Codex CLI not found — using standard mode." +- R9. No minimum version check for now. If a future CLI change breaks delegation, the invocation fails loudly and the fix is a single bash line update. + +**Consent and Mode Selection** + +- R10. First time delegation activates in a project, show a one-time consent flow that: (1) explains what delegation does and the security implications, (2) presents the sandbox mode choice with a recommendation, and (3) records the user's decisions. The sandbox modes are: + - **yolo** (recommended): Maps to `--yolo` (`--dangerously-bypass-approvals-and-sandbox`). Full system access including network. Required for verification steps that run tests or install dependencies. Explain why this is recommended. + - **full-auto**: Maps to `--full-auto`. Workspace-write sandbox, no network access. Tests/installs that need network will fail. Suitable for pure code-writing tasks without verification dependencies. +- R11. On user acceptance, store `work_delegate_consent: true` and `work_delegate_sandbox: yolo` (or `full-auto`) in `.claude/compound-engineering.local.md`. Do not show the consent flow again for this project. +- R12. On user decline, ask whether to disable codex delegation entirely. If yes, set `work_delegate: false` in local.md and proceed in standard mode. +- R13. In headless mode, delegation proceeds only if `work_delegate_consent` is already `true` in local.md. If not set or `false`, fall back to standard mode silently. Headless runs never prompt for consent and never silently escalate to unsandboxed mode without prior interactive consent. + +**Execution Mechanism** + +- R14. Delegation uses concrete bash commands, not prose instructions. The exact invocation template: + + ```bash + # Read sandbox mode from settings (default: yolo) + if [ "$CODEX_SANDBOX_MODE" = "full-auto" ]; then + SANDBOX_FLAG="--full-auto" + else + SANDBOX_FLAG="--yolo" + fi + + codex exec \ + $SANDBOX_FLAG \ + --output-schema .context/compound-engineering/codex-delegation/result-schema.json \ + -o .context/compound-engineering/codex-delegation/result-.json \ + - < .context/compound-engineering/codex-delegation/prompt-.md + ``` + + The agent executes this verbatim — no improvisation of CLI syntax. + +- R15. Sandbox posture defaults to `yolo` (`--yolo`, shorthand for `--dangerously-bypass-approvals-and-sandbox`) but the user may choose `full-auto` during the consent flow (R10). The choice is stored in `work_delegate_sandbox` in local.md. `yolo` is recommended because `--full-auto` blocks network access, which is required for verification steps (running tests, installing dependencies). If `full-auto` is chosen and causes repeated verification failures, the circuit breaker (R18) handles fallback. + +- R16. When delegation mode is active, ALL units execute serially — both delegated and locally-executed units. Git stash is a global stack; mixing parallel and serial execution on the same working tree causes stash entanglement. This means delegation mode and swarm mode (Agent Teams) are mutually exclusive. Before each delegated unit, the loop assumes a clean working tree (enforced by ce:work's Phase 1 setup and by mandatory commits after each successful unit). Snapshot the working tree via named stash: `git stash push --include-untracked -m "ce-codex-"`. On failure, rollback via `git checkout -- . && git clean -fd && git stash drop "$(git stash list | grep 'ce-codex-' | head -1 | cut -d: -f1)"`. On success, commit the changes, then drop the named stash. + +- R17. The structured prompt template is written to a file at `.context/compound-engineering/codex-delegation/prompt-.md` rather than piped via stdin, to avoid ARG_MAX limits for large CURRENT PATTERNS sections. The template includes: TASK (goal from implementation unit), FILES TO MODIFY (file list), CURRENT PATTERNS (relevant code context), APPROACH (from implementation unit), CONSTRAINTS (no git commit, restrict modifications to files within the repository root, scoped changes, line limit, mandatory result reporting), and VERIFY (test/lint commands). Prompt files are cleaned up after each successful unit. + +- R18. A consecutive failure counter tracks delegation failures. After 3 consecutive failures, the skill falls back to standard mode for remaining units with a notification. + +- R19. Failure classification uses a multi-signal approach. `codex exec` returns exit code 0 even when the task fails — the exit code only reflects CLI infrastructure, not task success. + + | Category | Signal | Action | + |---|---|---| + | **CLI failure** | Exit code != 0 | Hard failure — fall back to standard mode | + | **Result absent** | Exit code 0, result JSON missing or malformed | Count as task failure | + | **Task failure** | Exit code 0, result schema `status: "failed"` | Count toward circuit breaker, rollback | + | **Task partial** | Exit code 0, result schema `status: "partial"` | Keep changes, report gaps to main agent | + | **Verify failure** | Exit code 0, `status: "completed"`, VERIFY fails | Count toward circuit breaker, rollback | + | **Success** | Exit code 0, `status: "completed"`, VERIFY passes | Commit, drop stash, continue | + +- R20. A result schema file is written alongside the prompt file. Codex is instructed via `--output-schema` to produce structured JSON conforming to this schema. The `-o` flag writes the result to `result-.json`. The schema: + + ```json + { + "type": "object", + "properties": { + "status": { "enum": ["completed", "partial", "failed"] }, + "files_modified": { "type": "array", "items": { "type": "string" } }, + "issues": { "type": "array", "items": { "type": "string" } }, + "summary": { "type": "string" } + }, + "required": ["status", "files_modified", "issues", "summary"], + "additionalProperties": false + } + ``` + + The prompt CONSTRAINTS section includes mandatory result reporting instructions telling Codex it MUST fill in the schema honestly: `status: "completed"` only if all changes were made, `"partial"` if incomplete, `"failed"` if no meaningful progress. Known limitation: `--output-schema` only works with `gpt-5` family models, not `gpt-5-codex` or `codex-` prefixed models (Codex CLI bug #4181). If the result JSON is absent or malformed, classify as task failure. + +- R21. The prompt constraint tells Codex to restrict all modifications to files within the repository root. If Codex discovers mid-execution that it needs to modify files outside the repo root, it should complete what it can within the repo and report what it couldn't do via the result schema `issues` field. The main agent then handles the out-of-repo work in standard mode. Out-of-repo changes cannot be detected or rolled back by git stash — this is an accepted risk mitigated by the prompt constraint and per-unit pre-screening (R5). + +**Settings in compound-engineering.local.md** + +- R22. New YAML frontmatter keys in `.claude/compound-engineering.local.md`: + - `work_delegate`: `codex`/`false` (default: `false`) — delegation target when enabled + - `work_delegate_consent`: `true`/`false` — whether the user has completed the one-time consent flow + - `work_delegate_sandbox`: `yolo`/`full-auto` (default: `yolo`) — sandbox posture for codex exec + +## Success Criteria + +- Codex successfully implements implementation units from ce:plan output across a variety of task types (new features, bug fixes, refactors) +- CLI invocations are deterministic — no agent improvisation of shell syntax across runs +- Delegation activates only when explicitly requested (argument or local.md), only with a plan file, and never when running inside Codex +- Failed delegation rolls back cleanly via named git stash without corrupting tracked repository files +- The result schema provides reliable signal for success/failure classification +- Users who never enable delegation experience zero change in ce:work behavior + +## Scope Boundaries + +- **Not a separate skill.** ce-work-beta is superseded. This modifies ce:work directly. +- **No app-server integration.** We use bare `codex exec`, not the codex-companion.mjs app server or the codex plugin's rescue skill. The delegation pattern is fire-prompt -> wait -> inspect-result, which is exactly what `codex exec` provides. +- **No ad-hoc delegation.** Delegation only applies to structured plan execution with a plan file. Bare prompts without plans always use standard mode. +- **No minimum version gating.** Added later if a breaking CLI change actually occurs. +- **No periodic re-consent.** One acceptance per project. Version-gated or calendar-based re-consent can be added later if needed. +- **No converter changes.** The skill handles platform detection internally via environment variable checks. +- **No out-of-repo detection.** Git stash cannot protect files outside the repo. Defense is prompt constraint + per-unit pre-screening, not post-execution validation. +- **No timeout for v1.** Neither `codex exec` nor the most mature codex integration (osc-work) implements timeouts. Added later if users report hung processes. + +## Key Decisions + +- **Modify ce:work, not a separate skill**: Avoids skill proliferation. Users stay in their existing workflow. ce-work-beta's delegation section is superseded; its structural patterns (guards, circuit breaker) are ported. +- **`delegate:codex` namespace, not `mode:codex`**: Existing `mode:` tokens describe interaction style (headless, autofix). Delegation describes execution target. Separate namespace avoids semantic overloading. +- **Bare `codex exec` over app-server**: App server offers structured output and thread management, but requires fragile path discovery into another plugin's versioned install directory. `codex exec` is one line of bash, works identically in subagents, and does exactly what fire-and-wait delegation needs. +- **User-selected sandbox mode (yolo default, full-auto option)**: yolo is recommended because `--full-auto` blocks network access needed for test/lint commands. But users who prefer sandboxed execution can choose `full-auto`, accepting that verification may fail. The circuit breaker handles repeated failures. +- **One-time consent with mode selection**: Consent is about informed awareness, not ongoing compliance. The sandbox mode choice is part of the consent flow and persisted in local.md. +- **Per-unit delegation eligibility, not all-or-nothing**: Default is to delegate all units, but the agent pre-screens units that need out-of-repo access or are trivially small. This avoids delegating work that can't succeed in the unsandboxed environment and reduces overhead for trivial changes. +- **Prompt file over stdin**: Writing prompts to `.context/compound-engineering/codex-delegation/` avoids ARG_MAX limits, provides debugging artifacts on failure, and follows the repo's scratch space convention. +- **Complete-and-report over error-and-rollback**: When Codex discovers it needs out-of-repo access mid-execution, it completes in-repo changes and reports what it couldn't do. Preserves useful work rather than wasting it. +- **Plan-only delegation**: Ad-hoc prompts use standard mode. Delegation requires the structured plan decomposition to build effective prompts and provide meaningful implementation units. +- **Serial execution for all units when delegation is active**: Git stash is a global stack. Mixing parallel and serial execution causes stash entanglement. When delegation mode is on, all units (including locally-executed ones) run serially. This makes delegation mode and swarm mode (Agent Teams) mutually exclusive — a deliberate tradeoff of parallelism for the ability to use Codex. +- **`--output-schema` for result classification**: `codex exec` returns exit code 0 even on task failure. The structured result schema combined with VERIFY commands provides reliable success/failure signal. Prompt-enforced honest reporting plus cross-validation with VERIFY catches model misreporting. +- **No timeout for v1**: `codex exec` has no built-in timeout, and the most mature integration (osc-work) doesn't implement one either. Added if users report hung processes. + +## Dependencies / Assumptions + +- Codex CLI `exec` subcommand with `--yolo`, `--full-auto`, `--output-schema`, `-o`, and `-m` flags remains stable +- `--output-schema` works with `gpt-5` family models. Known bug #4181 breaks it for `gpt-5-codex` / `codex-` prefixed models — delegation should use `gpt-5` family models (e.g., `o4-mini`, `gpt-5.4`) +- `$CODEX_SANDBOX` and `$CODEX_SESSION_ID` environment variables continue to be set when running inside Codex +- `.claude/compound-engineering.local.md` YAML frontmatter reading/writing infrastructure must be built as part of this work — no existing skill currently reads or writes these keys. This is a prerequisite, not an assumption. + +## Outstanding Questions + +### Deferred to Planning + +- [Affects R17][Needs research] What is the optimal prompt template structure for maximizing Codex code quality? The printing-press skill provides one template; the codex plugin's prompting skill (`gpt-5-4-prompting`) may offer insights on how to structure prompts for Codex/GPT models specifically. +- [Affects R14][Technical] Where exactly in ce:work's Phase 2 task execution loop does the delegation branch? Need to read the current task-worker dispatch logic to identify the cleanest insertion point. +- [Affects R18][Technical] Should the circuit breaker (3 consecutive failures) reset per-unit or persist across the entire plan execution? Per-unit is more forgiving; per-plan is more conservative. +- [Affects R22][Technical] How does the agent parse `.claude/compound-engineering.local.md` YAML frontmatter at runtime? Is there an existing utility or must the skill instruct the agent to parse it directly via bash? +- [Affects R20][Needs testing] How reliably does `--output-schema` constrain Codex's final response? Need to test with representative implementation prompts to validate the result classification approach. Use `--ephemeral` flag during testing to avoid session file clutter (production invocations do not use `--ephemeral` — session persistence is valuable for debugging). +- [Affects R20][Technical] Fallback behavior when `--output-schema` fails (wrong model family, malformed output): define the exact classification logic when the result JSON is absent. + +## Next Steps + +-> `/ce:plan` for structured implementation planning diff --git a/docs/plans/2026-03-31-001-feat-codex-delegation-plan.md b/docs/plans/2026-03-31-001-feat-codex-delegation-plan.md new file mode 100644 index 00000000..df78b806 --- /dev/null +++ b/docs/plans/2026-03-31-001-feat-codex-delegation-plan.md @@ -0,0 +1,466 @@ +--- +title: "feat: Add Codex delegation mode to ce:work" +type: feat +status: completed +date: 2026-03-31 +origin: docs/brainstorms/2026-03-31-codex-delegation-requirements.md +--- + +# feat: Add Codex delegation mode to ce:work + +## Overview + +Add an optional Codex delegation mode to ce:work that delegates code-writing to the Codex CLI (`codex exec`) using concrete bash templates. When active with a plan file, each implementation unit is sent to Codex with a structured prompt and result schema, then classified, verified, and committed or rolled back. This replaces ce-work-beta's prose-based delegation (PR #364) which caused non-deterministic CLI invocations. + +> **Implementation note (2026-03-31):** The final rollout was redirected to `ce:work-beta` so stable `ce:work` remains unchanged during beta. `ce:work-beta` must be invoked manually; `ce:plan` and other workflow handoffs remain pointed at stable `ce:work` until promotion. + +## Problem Frame + +Users running ce:work from Claude Code (or other non-Codex agents) want to delegate token-heavy implementation work to Codex — either for better code quality or token conservation. PR #364's approach failed because the agent improvised CLI syntax each run. ce-work-beta has a structured 7-step External Delegate Mode with useful patterns (environment guards, circuit breaker), but the CLI invocation step itself is prose-based. This plan ports the structural patterns and replaces prose invocations with concrete, tested bash templates. (see origin: docs/brainstorms/2026-03-31-codex-delegation-requirements.md) + +## Requirements Trace + +- R1. Optional mode within ce:work, not separate skill; ce-work-beta superseded +- R2. Resolution chain: argument > local.md > hard default (off) +- R3-R4. `delegate:codex` / `delegate:local` canonical tokens with bounded imperative fuzzy matching +- R5. Plan-only delegation; per-unit eligibility pre-screening (out-of-repo checks, trivial-work exclusions) +- R6-R7. Environment guard (Codex sandbox detection); skill-level logic, no converter changes +- R8-R9. Availability check; no version gating +- R10-R13. One-time consent with sandbox mode selection during interactive ce:work execution +- R14. Concrete bash invocation template (validated via live CLI testing) +- R15. User-selected sandbox: `--yolo` (default) or `--full-auto` +- R16. Serial execution for all units; delegation and swarm mode mutually exclusive; delegated execution requires a clean working tree and rolls failed units back to `HEAD` +- R17. Prompt template written to `.context/compound-engineering/codex-delegation/`; XML-tagged sections +- R18. Circuit breaker: 3 consecutive failures -> standard mode fallback +- R19. Multi-signal failure classification (CLI fail / result absent / task fail / partial / verify fail / success) +- R20. `--output-schema` for structured result JSON; known gpt-5-codex model bug +- R21. Repo-root restriction via prompt constraint; complete-and-report on out-of-repo discovery +- R22. Settings in `.claude/compound-engineering.local.md`: `work_delegate`, `work_delegate_consent`, `work_delegate_sandbox` + +## Scope Boundaries + +- No app-server integration (bare `codex exec` only) +- No ad-hoc delegation (plan file required) +- No minimum version gating +- No periodic re-consent +- No converter changes +- No timeout for v1 +- No out-of-repo detection (prompt constraint + pre-screening only) +- No automatic preservation of pre-existing dirty state in delegated mode +- Delegation and swarm mode (Agent Teams) are mutually exclusive + +## Context & Research + +### Relevant Code and Patterns + +- `plugins/compound-engineering/skills/ce-work/SKILL.md` — target file; Phase 1 Step 4 (execution strategy, lines 126-144) and Phase 2 Step 1 (task loop, line ~159) are the insertion points +- `plugins/compound-engineering/skills/ce-work-beta/SKILL.md` — External Delegate Mode (lines 413-474) provides the structural pattern being ported (guards, circuit breaker, prompt file writing) +- `plugins/compound-engineering/skills/ce-review/SKILL.md` (lines 19-33) — canonical argument parsing pattern with token table, strip-before-interpret, conflict detection +- `plugins/compound-engineering/skills/ce-plan/SKILL.md` (lines 167-176, 352-356, 495) — current `Execution target: external-delegate` posture signal to remove as part of the supersession work +- `~/.claude/plugins/marketplaces/cli-printing-press/skills/printing-press/SKILL.md` — proven codex delegation via `codex exec --yolo -` with 3-failure circuit breaker +- `~/.claude/plugins/marketplaces/openai-codex/plugins/codex/skills/gpt-5-4-prompting/` — Codex prompt best practices: XML-tagged blocks, ``, ``, `` + +### Institutional Learnings + +- **Git workflow skills need explicit state machines** (`docs/solutions/skill-design/git-workflow-skills-need-explicit-state-machines-2026-03-27.md`): Re-read state at each git transition; use `git status` not `git diff HEAD` for cleanliness; model non-zero exits as state transitions +- **Pass paths, not content, to sub-agents** (`docs/solutions/skill-design/pass-paths-not-content-to-subagents-2026-03-26.md`): Orchestrator discovers paths; sub-agent reads content; instruction phrasing affects tool call count +- **Beta promotion must update callers atomically** (`docs/solutions/skill-design/beta-promotion-orchestration-contract.md`): When adding new invocation semantics, update all callers in the same PR +- **Compound-refresh mode detection** (`docs/solutions/skill-design/compound-refresh-skill-improvements.md`): Mode must be explicit opt-in via arguments, not auto-detected from environment + +## Key Technical Decisions + +- **Insertion point:** Delegation routing gate at Phase 1 Step 4 (execution strategy selection); per-unit delegation branch at Phase 2 Step 1 line ~159 ("Implement following existing conventions"). This keeps delegation as a task-level modifier within the existing execution flow rather than a separate phase. +- **Argument parsing pattern:** Follow ce:review's canonical pattern — token table, strip-before-interpret, graceful fallback. Introduce `delegate:` as a new namespace separate from `mode:`. Do not add a non-interactive mode to ce:work as part of this feature; the skill remains interactive. The `argument-hint` frontmatter gets updated. +- **Fuzzy matching boundary:** Support fuzzy activation only for imperative execution-intent phrases such as "use codex", "delegate to codex", or "codex mode". A bare mention of "codex" or prompts about Codex itself must not activate delegation. +- **Prompt template format:** XML-tagged blocks following the codex `gpt-5-4-prompting` skill's guidance — ``, ``, ``, ``, ``, ``, ``. This is more structured than printing-press's flat format and aligns with how Codex/GPT-5.4 models parse instructions. +- **Settings parsing:** No utility exists. The skill includes inline instructions for the agent to read `.claude/compound-engineering.local.md`, extract YAML between `---` delimiters, and interpret keys. For writing, read-modify-write with explicit handling: (1) if file doesn't exist, create it with YAML frontmatter wrapper; (2) if file exists with valid frontmatter, merge new keys preserving existing keys; (3) if file exists without frontmatter or with malformed frontmatter, prepend a valid frontmatter block and preserve existing body content below the closing `---`. Cross-platform path rewriting handled by converters (`.claude/` -> `.codex/` -> `.opencode/`). +- **Circuit breaker resets on success, persists across units:** A successful delegation resets the counter to 0. Consecutive failures accumulate across units within a single plan execution. If delegation keeps failing, it's likely environmental (codex auth, model issues), not unit-specific. +- **Delegation takes precedence over swarm:** When delegation is active, serial execution is enforced and swarm mode is suppressed. This applies even when slfg or the user explicitly requests swarm mode. Delegation is the higher-priority execution constraint because it requires serial execution. Swarm mode may be re-evaluated in the future but delegation support is more important now. +- **Delegated execution safety model:** Do not auto-stash pre-existing user changes. Delegated execution only starts from a clean working tree in the current checkout or current worktree. If the tree is dirty, stop and tell the user to commit, stash explicitly, or continue in standard mode. This makes rollback-to-`HEAD` safe and avoids hiding user data inside automation-owned stash entries. +- **Partial result policy:** Treat `status: "partial"` as a handoff, not a completed unit. Keep the diff, switch immediately to local completion for that same unit, verify and commit before moving on, and count it toward the circuit breaker. If local completion fails, roll the unit back to `HEAD`. +- **ce-work-beta disposition:** Port Frontend Design Guidance (lines 266-272) to ce:work as a separate Phase 2 addition. Supersede the External Delegate Mode section entirely, and remove the old `Execution target: external-delegate` execution-note contract from ce:plan / ce-work-beta in the same PR. Keep ce-work-beta otherwise intact for now — deletion is a separate cleanup task. + +## Open Questions + +### Resolved During Planning + +- **Optimal prompt template structure (R17):** XML-tagged blocks per codex `gpt-5-4-prompting` guidance. Sections: ``, ``, ``, ``, `` (includes repo-root restriction and mandatory result reporting), ``, ``. +- **Insertion point in ce:work Phase 2 (R14):** Phase 1 Step 4 for routing/strategy gate; Phase 2 Step 1 line ~159 for per-unit delegation branch. +- **Circuit breaker reset semantics (R18):** Per-plan, resetting to 0 on success. Rationale: repeated failures are likely environmental, not unit-specific. +- **How to parse local.md YAML (R22):** Inline skill instructions — agent reads the file, extracts YAML between `---` delimiters, interprets the keys. No utility exists; building a general-purpose utility is out of scope. +- **Fallback when --output-schema fails (R20):** If result JSON is absent or malformed, classify as task failure per R19. The agent proceeds to the next unit or triggers the circuit breaker. + +### Deferred to Implementation + +- **Exact prompt wording:** The XML-tagged template structure is defined; the exact prose within each section will be refined during implementation based on testing with representative plan units. +- **Consent flow UX copy:** The consent warning content (R10) — what exactly to say about `--yolo`, how to present the sandbox choice — is best refined during implementation with real interaction testing. +- **Frontend Design Guidance port quality:** Whether the beta's Frontend Design Guidance section ports cleanly or needs adaptation for ce:work's structure. + +## High-Level Technical Design + +> *This illustrates the intended approach and is directional guidance for review, not implementation specification. The implementing agent should treat it as context, not code to reproduce.* + +The delegation mode adds three sections to ce:work's SKILL.md: + +``` +┌─────────────────────────────────────────────────────────────┐ +│ SKILL.md Structure (additions marked with +) │ +├─────────────────────────────────────────────────────────────┤ +│ │ +│ + ## Argument Parsing │ +│ Parse delegate:codex / delegate:local tokens │ +│ Read local.md for work_delegate fallback │ +│ Resolve delegation state: on/off + sandbox mode │ +│ │ +│ ## Phase 0: Input Triage (existing) │ +│ │ +│ ## Phase 1: Quick Start (existing) │ +│ + Step 4 modification: if delegation on + plan present, │ +│ force serial execution, block swarm mode │ +│ │ +│ ## Phase 2: Execute (existing) │ +│ + Step 1 modification: if delegation on for this unit, │ +│ branch to Codex Delegation section instead of │ +│ "implement following existing conventions" │ +│ │ +│ + ## Codex Delegation Mode │ +│ + Pre-delegation checks (env guard, availability, │ +│ consent) │ +│ + Prompt template builder (XML-tagged) │ +│ + Result schema definition │ +│ + Execution loop (exec -> classify -> │ +│ local-complete/commit/rollback-to-HEAD) │ +│ + Circuit breaker logic │ +│ │ +│ ## Phase 3: Quality Check (existing, unchanged) │ +│ ## Phase 4: Ship It (existing, unchanged) │ +│ ## Swarm Mode (existing, + mutual exclusion note) │ +│ │ +│ + ## Frontend Design Guidance (ported from ce-work-beta) │ +│ │ +└─────────────────────────────────────────────────────────────┘ +``` + +## Implementation Units + +```mermaid +graph TB + U1[Unit 1: Argument Parsing
+ Settings Reading] --> U2[Unit 2: Pre-Delegation Gates] + U2 --> U3[Unit 3: Execution Strategy Gate] + U3 --> U4[Unit 4: Delegation Artifacts] + U4 --> U5[Unit 5: Core Delegation Loop] + U5 --> U6[Unit 6: ce-work-beta Sync] +``` + +--- + +- [x] **Unit 1: Argument Parsing and Settings Reading** + +**Goal:** Add `delegate:codex` / `delegate:local` token parsing to ce:work and the resolution chain that reads local.md settings. + +**Requirements:** R2, R3, R4, R22 + +**Dependencies:** None + +**Files:** +- Modify: `plugins/compound-engineering/skills/ce-work/SKILL.md` +- Test: `tests/pipeline-review-contract.test.ts` +- Test: manual invocation testing with `delegate:codex`, `delegate:local`, and fuzzy variants + +**Approach:** +- Add an `## Argument Parsing` section immediately before the `## Phase 0: Input Triage` heading (after the opening narrative), following ce:review's canonical pattern (token table, strip-before-interpret). Cross-reference the High-Level Technical Design diagram for placement. +- Token table: `delegate:codex` (activate), `delegate:local` (deactivate), plus bounded fuzzy recognition for delegate activation phrases. Do not add `mode:headless` here; ce:work remains an interactive workflow. +- After token extraction, read `.claude/compound-engineering.local.md` for `work_delegate`, `work_delegate_consent`, `work_delegate_sandbox` keys +- Implement resolution chain: argument flag > local.md `work_delegate` > hard default `false` +- Store resolved delegation state (on/off) and sandbox mode in skill-level variables for downstream consumption +- Update the `argument-hint` frontmatter to include `delegate:codex` for discoverability +- Follow learning: mode must be explicit opt-in via arguments, not auto-detected (compound-refresh pattern) + +**Patterns to follow:** +- `plugins/compound-engineering/skills/ce-review/SKILL.md` lines 19-33 — token table, strip-before-interpret, conflict detection +- `plugins/compound-engineering/skills/ce-compound-refresh/SKILL.md` line 13 — simple token stripping +- YAML frontmatter parsing: agent reads file, extracts content between `---` delimiters, interprets keys + +**Test scenarios:** +- Happy path: `delegate:codex` in arguments sets delegation on with default yolo sandbox +- Happy path: `delegate:local` in arguments sets delegation off even when local.md has `work_delegate: codex` +- Happy path: No delegate token with `work_delegate: codex` in local.md activates delegation +- Happy path: No delegate token and no local.md setting defaults to delegation off +- Edge case: `delegate:codex` combined with a plan file path — both are parsed correctly, plan path preserved +- Edge case: Fuzzy variant "use codex for this work" recognized as delegation activation +- Edge case: Bare prompt "fix codex converter bugs" does not activate delegation +- Edge case: Missing or empty local.md file — falls back to hard defaults gracefully +- Edge case: Malformed YAML frontmatter in local.md — treated as if settings are absent, not a fatal error + +**Verification:** +- Delegation state resolves correctly for all combinations of argument + local.md + default +- Plan file paths are not corrupted by token stripping +- Argument-hint frontmatter includes delegate:codex +- Contract tests cover the new token/wording expectations + +--- + +- [x] **Unit 2: Pre-Delegation Gates (Environment Guard + Availability + Consent)** + +**Goal:** Add the checks that run before delegation can proceed — environment detection, CLI availability, and one-time consent with sandbox mode selection. + +**Requirements:** R6, R7, R8, R10, R11, R12, R13 + +**Dependencies:** Unit 1 (delegation state must be resolved) + +**Files:** +- Modify: `plugins/compound-engineering/skills/ce-work/SKILL.md` +- Test: `tests/pipeline-review-contract.test.ts` +- Test: manual invocation testing in Codex sandbox vs normal environment + +**Approach:** +- Add a `### Pre-Delegation Checks` subsection within the new Codex Delegation Mode section +- **Environment guard:** Check `$CODEX_SANDBOX` and `$CODEX_SESSION_ID`. If set, disable delegation. Notify only when user explicitly requested delegation (via argument); proceed silently when delegation was enabled via local.md default only. +- **Availability check:** `command -v codex`. If not found, fall back to standard mode with notification. +- **Consent flow:** If `work_delegate_consent` is not `true` in local.md: + - Show one-time warning explaining `--yolo`, present sandbox mode choice (yolo recommended, full-auto option), record decision to local.md +- **Consent decline path:** Ask whether to disable delegation entirely; if yes, set `work_delegate: false` in local.md +- Follow learning: re-read git/file state at each transition rather than caching (state machine pattern) + +**Patterns to follow:** +- ce-work-beta External Delegate Mode lines 436-445 — environment guard structure +- Platform-agnostic tool references: "Use the platform's blocking question tool (AskUserQuestion in Claude Code, request_user_input in Codex)" + +**Test scenarios:** +- Happy path: Outside Codex, CLI available, consent already granted — proceeds to delegation +- Happy path: First-time consent flow — warning shown, user accepts yolo, settings written to local.md +- Happy path: First-time consent — user chooses full-auto, setting stored correctly +- Error path: Inside Codex sandbox with explicit `delegate:codex` argument — notification emitted, falls back to standard mode +- Error path: Inside Codex sandbox with only local.md default — silent fallback, no notification +- Error path: `codex` CLI not on PATH — notification emitted, falls back to standard mode +- Error path: User declines consent — asked about disabling, if yes `work_delegate: false` set +- Edge case: Delegation enabled via local.md default on first invocation (no delegate:codex argument) — consent flow shown as normal, because R10 triggers on "first time delegation activates" regardless of activation source + +**Verification:** +- Environment guard correctly detects Codex sandbox and falls back +- Missing codex CLI produces notification and graceful fallback +- Consent state persists across invocations via local.md +- Consent flow prompts only within ce:work's existing interactive execution model + +--- + +- [x] **Unit 3: Execution Strategy Gate and Swarm Exclusion** + +**Goal:** Modify Phase 1 Step 4 to force serial execution when delegation is active and block swarm mode selection. + +**Requirements:** R5, R16 + +**Dependencies:** Unit 1 (delegation state) + +**Files:** +- Modify: `plugins/compound-engineering/skills/ce-work/SKILL.md` +- Test: `tests/pipeline-review-contract.test.ts` +- Test: manual testing with delegation + swarm mode request + +**Approach:** +- In Phase 1 Step 4 ("Choose Execution Strategy"), add a routing gate: if delegation is active AND a plan file is present, override the strategy to serial execution +- Add explicit note that delegation mode and swarm mode (Agent Teams) are mutually exclusive +- **Delegation takes precedence over swarm mode.** When delegation is active (resolved via the resolution chain in Unit 1), serial execution is enforced and swarm mode is suppressed — even if the user or caller (e.g., slfg) requests swarm mode. Delegation requires serial execution which is mechanically incompatible with swarm. If swarm mode would otherwise activate but delegation is on, emit a notification: "Delegation mode active — serial execution enforced, swarm mode unavailable." This gate operates at the execution-strategy level (Phase 1 Step 4), after argument parsing completes. +- Add a brief note in the Swarm Mode section about the mutual exclusivity constraint +- Enforce plan-only delegation: if delegation is active but no plan file was provided (bare prompt), fall back to standard mode with a brief note + +**Patterns to follow:** +- Existing Phase 1 Step 4 execution strategy decision tree +- Beta promotion learning: when adding new invocation semantics, update all callers atomically + +**Test scenarios:** +- Happy path: Delegation active with plan file — serial execution enforced +- Happy path: Delegation off — existing execution strategy selection unchanged +- Edge case: Delegation active but bare prompt (no plan) — falls back to standard mode +- Edge case: slfg requests swarm mode but local.md has `work_delegate: codex` — delegation wins, serial execution enforced, swarm mode suppressed with notification +- Edge case: User explicitly passes `delegate:codex` AND requests swarm mode — delegation wins, swarm suppressed with notification + +**Verification:** +- Serial execution enforced when delegation active with a plan +- Swarm mode suppressed when delegation is active, with notification +- Bare prompts always use standard mode regardless of delegation setting +- slfg invocations with delegation enabled via local.md result in serial execution, not swarm mode + +--- + +- [x] **Unit 4: Delegation Artifacts (Prompt Template + Result Schema)** + +**Goal:** Define the prompt template builder and result schema that are written to `.context/compound-engineering/codex-delegation/` before each delegation invocation. + +**Requirements:** R17, R20, R21 + +**Dependencies:** Unit 2 (consent + sandbox mode resolved) + +**Files:** +- Modify: `plugins/compound-engineering/skills/ce-work/SKILL.md` +- Test: manual inspection of generated prompt files and schema + +**Approach:** +- Add a `### Prompt Template` subsection within the Codex Delegation Mode section +- Define the XML-tagged prompt structure following `gpt-5-4-prompting` best practices: + - `` — goal from implementation unit + - `` — file list from implementation unit + - `` — relevant code context (CURRENT PATTERNS) + - `` — approach from implementation unit + - `` — no git commit, repo-root restriction, scoped changes, line limit, mandatory result reporting + - `` — test/lint commands from project + - `` — the result reporting instructions (status/files_modified/issues/summary) +- Define the result schema JSON (per R20) as a static file written to `.context/compound-engineering/codex-delegation/result-schema.json` +- Include `.context/compound-engineering/codex-delegation/` directory creation as part of the setup contract +- Prompt files: `prompt-.md` — cleaned up after each successful unit +- Result files: `result-.json` — cleaned up after each successful unit +- Follow learning: pass paths, not content, to sub-agents — the prompt template includes file paths for CURRENT PATTERNS, letting codex read them + +**Patterns to follow:** +- `gpt-5-4-prompting` skill — XML-tagged blocks, ``, `` +- Printing-press skill — TASK/FILES TO MODIFY/CURRENT CODE/EXPECTED CHANGE/CONVENTIONS/CONSTRAINTS/VERIFY structure +- AGENTS.md scratch space convention: `.context/compound-engineering//` + +**Test scenarios:** +- Happy path: Prompt file generated with all XML sections populated from a plan implementation unit +- Happy path: Result schema file created as valid JSON matching the R20 schema definition +- Edge case: Implementation unit with no VERIFY commands — `` section contains fallback instruction ("Run any available test suite or lint") +- Edge case: Implementation unit with no CURRENT PATTERNS — `` section notes the absence rather than being empty +- Integration: Prompt file is readable by `codex exec - < prompt-file.md` — validated during brainstorm CLI testing + +**Verification:** +- Generated prompt files contain all required XML sections +- Result schema validates against the JSON schema definition in R20 +- Scratch directory created at `.context/compound-engineering/codex-delegation/` +- Files cleaned up after successful delegation + +--- + +- [x] **Unit 5: Core Delegation Execution Loop** + +**Goal:** Implement the per-unit delegation execution: clean-baseline preflight, codex exec invocation, result classification, commit or rollback-to-`HEAD`, and circuit breaker. + +**Requirements:** R14, R15, R16, R18, R19 + +**Dependencies:** Unit 3 (serial execution enforced), Unit 4 (prompt template + schema available) + +**Files:** +- Modify: `plugins/compound-engineering/skills/ce-work/SKILL.md` +- Test: `tests/pipeline-review-contract.test.ts` +- Test: manual end-to-end delegation testing with a real plan file + +**Approach:** +- Add the `### Execution Loop` subsection within Codex Delegation Mode +- **Clean-baseline preflight:** Before the first delegated unit, require a clean working tree in the current checkout/worktree (`git status --short` empty). If dirty, stop and instruct the user to commit, stash explicitly, or continue in standard mode. Do not auto-stash user changes. +- **Per-unit eligibility check (R5):** Before delegating, the agent assesses whether the unit is eligible per R5: (a) does not require modifications outside the repository root, and (b) is not trivially small (single-file config change, simple substitution where delegation overhead exceeds the work). If ineligible, execute locally in standard mode and state the reason before execution. +- **Codex exec invocation:** The verbatim bash template from R14: + ``` + codex exec $SANDBOX_FLAG --output-schema -o - < + ``` +- **Result classification (R19):** Multi-signal approach: + 1. Exit code != 0 → CLI failure → rollback current unit to `HEAD`, then hard fall back to standard mode for all remaining units + 2. Exit code 0, result JSON missing/malformed → task failure → rollback current unit to `HEAD` + circuit breaker + 3. `status: "failed"` → task failure → rollback current unit to `HEAD` + circuit breaker + 4. `status: "partial"` → keep the diff, switch immediately to standard-mode completion for this same unit, verify + commit before moving on, count as a delegation failure for circuit-breaker purposes + 5. `status: "completed"` + VERIFY fails → verify failure → rollback current unit to `HEAD` + circuit breaker + 6. `status: "completed"` + VERIFY passes → success → commit +- **Rollback:** `git checkout -- . && git clean -fd` back to `HEAD`. This is only permitted because delegated mode starts from a clean baseline and never auto-stashes user-owned local changes. +- **Commit on success:** Mandatory commit after each successful unit (enforces clean working tree for next unit) +- **Circuit breaker (R18):** Counter persists across units within a plan execution. Resets to 0 on success. After 3 consecutive failures, fall back to standard mode for all remaining units with notification. +- **Partial success handling:** `partial` is a local handoff for the current unit, not permission to continue with a dirty tree. The main agent must finish the same unit locally, verify it, and commit before dispatching the next unit. If local completion fails, roll the unit back to `HEAD`. + +**Patterns to follow:** +- ce-work-beta External Delegate Mode 7-step workflow (lines 447-465) +- Printing-press skill codex invocation + circuit breaker pattern +- Git state machine learning: re-read state at each transition; model non-zero exits as expected state transitions + +**Test scenarios:** +- Happy path: Unit delegated, codex succeeds, result schema says "completed", VERIFY passes — changes committed +- Happy path: Delegation runs inside an already-isolated clean worktree — no extra worktree required +- Happy path: Multiple units delegated serially — each starts with clean working tree after prior commit +- Happy path: Circuit breaker resets after a success following a failure +- Error path: Dirty working tree before first delegated unit — stop and ask the user to clean/stash/commit or continue in standard mode +- Error path: codex exec returns exit code != 0 — classified as CLI failure, rollback to `HEAD`, all remaining units use standard mode +- Error path: Result JSON missing after successful exit code — classified as task failure, rollback to `HEAD`, circuit breaker increment +- Error path: Result schema reports "failed" — rollback to `HEAD`, circuit breaker increment +- Error path: Result schema reports "completed" but VERIFY fails — rollback to `HEAD`, circuit breaker increment +- Error path: 3 consecutive failures — circuit breaker triggers, remaining units fall back to standard mode with notification +- Edge case: Result schema reports "partial" — changes kept, same unit completed locally, verified, and committed before the next unit +- Edge case: Unit pre-screened as ineligible (out-of-repo) — executed locally, not delegated +- Edge case: Unit pre-screened as trivially small — executed locally, not delegated +- Integration: Contract tests assert the delegated-mode clean-baseline and supersession wording stays in sync + +**Verification:** +- Delegation produces deterministic CLI invocations (no agent improvisation) +- Failed delegation rolls back cleanly to `HEAD` without touching pre-existing user changes +- Circuit breaker activates after 3 consecutive failures +- Partial success never advances to the next unit until the current unit is completed locally and committed +- Each successful delegation is followed by a commit before the next unit + +--- + +- [x] **Unit 6: ce-work-beta Sync (Port Non-Delegation Features + Supersede)** + +**Goal:** Port ce-work-beta's Frontend Design Guidance to ce:work, mark the old delegation section as superseded, and remove the obsolete `external-delegate` execution-note contract. + +**Requirements:** R1 + +**Dependencies:** Unit 5 (delegation fully implemented in ce:work) + +**Files:** +- Modify: `plugins/compound-engineering/skills/ce-work/SKILL.md` +- Modify: `plugins/compound-engineering/skills/ce-work-beta/SKILL.md` +- Modify: `plugins/compound-engineering/skills/ce-plan/SKILL.md` +- Test: `tests/pipeline-review-contract.test.ts` +- Test: verify Frontend Design Guidance triggers correctly in ce:work + +**Approach:** +- **Port Frontend Design Guidance** (ce-work-beta lines 266-272) to ce:work Phase 2 as a new numbered step: "For UI tasks without Figma designs, load the `frontend-design` skill before implementing" +- **Supersede ce-work-beta delegation:** Add a note at the top of ce-work-beta's External Delegate Mode section stating it is superseded by ce:work's Codex Delegation Mode. Do not delete the section — leave it as documentation of the prior approach. +- **Remove obsolete execution-note contract:** Delete `Execution target: external-delegate` guidance and examples from ce:plan, and remove ce-work-beta's activation path that consumes that tag. After this change, delegation is controlled by the ce:work resolution chain only. +- **Mixed-Model Attribution:** Port the PR attribution guidance (ce-work-beta lines 467-473) to ce:work's Codex Delegation Mode section — when some tasks are delegated and some local, the PR should credit both models. +- **Caller update check:** Verify no other skills still reference `Execution target: external-delegate` after the removal. Per the beta promotion learning, delete the old contract atomically rather than leaving dual semantics behind. + +**Patterns to follow:** +- ce-work-beta Frontend Design Guidance (lines 266-272) +- ce-work-beta Mixed-Model Attribution (lines 467-473) +- Beta promotion learning: update orchestration callers atomically + +**Test scenarios:** +- Happy path: UI task without Figma design in ce:work — Frontend Design Guidance triggers correctly +- Happy path: Mixed delegation/local execution — PR attribution credits both models +- Happy path: ce:plan no longer emits `Execution target: external-delegate` +- Edge case: ce-work-beta invoked directly — sees supersession note, delegation section still present for reference + +**Verification:** +- Frontend Design Guidance is functional in ce:work Phase 2 +- ce-work-beta delegation section is marked superseded +- `external-delegate` references are removed from live skills +- `bun test` and `bun run release:validate` pass because skill content changed + +## System-Wide Impact + +- **Interaction graph:** ce:work's Phase 2 task execution loop gains a delegation branch. Phase 1 Step 4 gains a routing gate. The Swarm Mode section gains a mutual exclusivity note. Phase 3 is unchanged. Phase 4 only gains mixed-model attribution guidance carried over from ce-work-beta. +- **Error propagation:** CLI failures cause rollback of the current delegated unit to `HEAD` and hard fallback to standard mode for all remaining units. Task/verify failures count toward the circuit breaker and trigger per-unit rollback. Partial success is a handoff path: finish the same unit locally, then commit before continuing. +- **State lifecycle risks:** Delegated mode now refuses to start from a dirty tree, including in an existing worktree checkout. This is a deliberate safety tradeoff that avoids automation-owned stash state and keeps `HEAD` rollback safe. The mandatory commit after each successful or locally-completed partial unit prevents cross-unit entanglement. +- **API surface parity:** `delegate:codex` is the new argument namespace. Converters rewrite `.claude/` paths in local.md references to platform equivalents (`.codex/`, `.opencode/`). The old `Execution target: external-delegate` contract is removed from live skills. No new ce:work-wide non-interactive mode is introduced. +- **Integration coverage:** The delegation flow crosses ce:work -> bash (codex exec) -> codex CLI -> file system (result JSON, prompt files) -> git. End-to-end testing requires a working codex CLI installation. +- **Unchanged invariants:** ce:work's existing argument handling for file paths and bare prompts is preserved. Users who never enable delegation experience zero behavioral change. Phase 3 remains unchanged; Phase 4 keeps its existing ship flow aside from mixed-model attribution guidance. + +## Risks & Dependencies + +| Risk | Mitigation | +|------|------------| +| `--output-schema` only works with gpt-5 family models (bug #4181) | Document the model constraint; classify absent/malformed result JSON as task failure | +| Codex CLI flags change in future releases | Invocation is one concrete bash line — loud failure, easy to fix | +| Delegated mode stops on dirty trees, which may feel stricter than standard mode | Be explicit in the prompt: current checkout/worktree is fine, but it must be clean before delegated execution begins | +| Consent flow complexity in a skill that has no prior interactive prompting | Follow ce:review's pattern for platform-agnostic question tool usage | +| local.md YAML parsing has no utility — agent must parse inline | Provide clear parsing instructions; malformed YAML treated as absent (graceful degradation) | +| slfg interaction: swarm mode suppressed when delegation active | Delegation takes precedence; serial execution enforced. slfg users with delegation enabled will not get swarm mode — emit notification | +| `partial` results could otherwise leave the loop in an ambiguous state | Treat `partial` as local handoff for the same unit, require verify + commit before moving on, and count it toward the circuit breaker | + +## Sources & References + +- **Origin document:** [docs/brainstorms/2026-03-31-codex-delegation-requirements.md](docs/brainstorms/2026-03-31-codex-delegation-requirements.md) +- Related PR: #364 (ce-work-beta sandbox options — superseded) +- Related PR: #363 (ce-work-beta original delegation — superseded) +- Codex prompting: `~/.claude/plugins/marketplaces/openai-codex/plugins/codex/skills/gpt-5-4-prompting/` +- Printing-press pattern: `~/.claude/plugins/marketplaces/cli-printing-press/skills/printing-press/SKILL.md` +- Git state machine learning: `docs/solutions/skill-design/git-workflow-skills-need-explicit-state-machines-2026-03-27.md` +- Beta promotion learning: `docs/solutions/skill-design/beta-promotion-orchestration-contract.md` +- Pass paths learning: `docs/solutions/skill-design/pass-paths-not-content-to-subagents-2026-03-26.md` diff --git a/docs/solutions/best-practices/codex-delegation-best-practices-2026-04-01.md b/docs/solutions/best-practices/codex-delegation-best-practices-2026-04-01.md new file mode 100644 index 00000000..c317fe1b --- /dev/null +++ b/docs/solutions/best-practices/codex-delegation-best-practices-2026-04-01.md @@ -0,0 +1,203 @@ +--- +title: "Codex Delegation Best Practices" +date: 2026-04-01 +category: best-practices +module: "Codex delegation / skill design" +problem_type: best_practice +component: tooling +severity: medium +applies_when: + - Designing delegation to external models (Codex, future delegates) in orchestrator skills + - Authoring or editing SKILL.md files where token cost matters + - Choosing whether to delegate plan execution or implement directly + - Writing delegation prompts for secondary agents +tags: + - codex-delegation + - token-economics + - skill-design + - batching + - orchestration-cost + - prompt-engineering + - ce-work-beta +--- + +# Codex Delegation Best Practices + +## Context + +Over six iterations of evaluation building Codex delegation into `ce-work-beta`, we collected quantitative data on the token economics of orchestrating work between Claude Code (the orchestrator) and Codex (the delegated executor). The core question: when does delegating plan units to Codex actually save Claude tokens, and what architectural patterns control the cost? + +The delegation model: `ce-work-beta` receives a plan with N implementation units, then decides whether to execute them directly (standard mode) or delegate them to Codex via `codex exec`. Delegation has a fixed orchestration overhead per batch (prompt file write, codex exec invocation, result classification, commit) of approximately 4-5k Claude tokens. Each unit of code Claude does not write saves roughly 3-5k tokens. The crossover depends on how many units are batched per delegation call. + +The evaluation spanned iterations 1-6, testing small (1-2 units), medium (4 units), large (7 units), and extra-large (10 units) plans in both delegation and standard modes, with real code implementation and test verification in isolated worktrees. + +--- + +## Guidance + +### Token Economics + +Delegation has a fixed orchestration cost per batch (~4-5k Claude tokens for prompt generation, codex exec, result classification, and commit) and a variable savings per unit (~3-5k Claude tokens of code-writing avoided). The crossover depends on how many units are batched per call. + +**Crossover by plan size:** + +| Plan size | Units | Delegate tokens | Standard tokens | Overhead | Verdict | +|-----------|-------|----------------|-----------------|----------|---------| +| Small (bug fix) | 1 | 51k | 38k | +34% | Not worth it for token savings | +| Small (new feature) | 1 | 63k | 42k | +50% | Not worth it for token savings | +| Medium | 4 | 54k | 53k | +2% | Marginal | +| Large | 7 | 62k | 62k | +1% | Break-even | +| Extra-large | 10 | 54k | 62k* | **-13%** | Delegation is cheaper | + +*Standard mode extrapolated from 7-unit baseline. The XL delegate cost (54k) is lower than the 7-unit standard cost (62k) because orchestration is amortized over more units per batch. + +**How it scales:** Each additional unit in a batch saves ~3-5k Claude tokens while adding zero orchestration cost. The orchestration is per-batch, not per-unit. A 10-unit plan in 2 batches costs ~8-10k in orchestration regardless of whether those batches contain 5 units or 50 lines of code each. + +**The crossover point is ~5-7 units.** Below that, orchestration overhead dominates. Above it, code-writing savings dominate. Users may still choose delegation below the crossover for cost arbitrage (Codex tokens are cheaper than Claude tokens) or coding preference. + +**Wall clock time cost:** Delegation is 1.7-2.2x slower due to codex exec latency: + +| Plan size | Delegate time | Standard time | Slowdown | +|-----------|---------------|---------------|----------| +| Medium (4 units) | 353s | 188s | 1.9x | +| Large (7 units) | 569s | 254s | 2.2x | +| Extra-large (10 units) | 574s | ~300s* | ~1.9x | + +**Test coverage cost:** Without explicit testing guidance in the prompt, Codex produces 15-43% fewer tests than Claude. Adding the `` section to the prompt closed this gap by ~35% on large plans (see Prompt Engineering section below). + +**Evolution across iterations:** + +| Iteration | Architecture | Medium delegate tokens | Change | +|-----------|-------------|----------------------|--------| +| 3 | Per-unit loop, all content in SKILL.md body (776 lines) | 58k | Baseline | +| 4 | Added optimizations to body (~810 lines) | 79k | +38% (worse — body growth overwhelmed savings) | +| 5 | Extracted to reference file, batched model (514 lines) | 61k | -23% from iter-4, back to baseline | +| 6 | Added `` to prompt | 54k | -7% (with better test quality) | + +The key lesson from iteration 4: adding content to the skill body increases cost on every tool call. Optimizations that save a few tool calls but add 50+ lines to the body can be net negative. + +### Skill Body Size is the Multiplicative Cost Driver + +The dominant formula: + +``` +total_token_cost ~ skill_body_lines x tokens_per_line x num_tool_calls +``` + +Reducing tool calls helps linearly. Reducing skill body size helps **multiplicatively** because it affects every remaining tool call for the entire session. In iteration 4, adding optimization instructions directly to the SKILL.md body caused a net token *increase* despite the optimizations being structurally sound — the larger body cost more on every subsequent tool call than the optimizations saved. + +**Threshold rule:** Move content to a reference file if it exceeds ~50 lines AND is only used in a minority of invocations. Keep always-needed content in the body. + +### Architecture Patterns That Reduce Cost (Ranked by Impact) + +**1. Extract conditional content to reference files.** +Moving delegation-specific content (~250 lines) from the SKILL.md body to `references/codex-delegation-workflow.md` shrank the skill from 776 to 514 lines. This saved ~15k Claude tokens per non-delegation run — a 34% body reduction affecting every tool call. The reference is loaded once, only when delegation is active. + +**2. Batch execution over per-unit execution.** +Sending all units (or groups of roughly 5) in a single `codex exec` call reduces orchestration from O(N) to O(ceil(N/batch_size)). For a 10-unit plan: 2 batches x ~4-5k = 8-10k orchestration vs 10 x 4-5k = 40-50k with per-unit delegation. + +**3. Delegate the verify/test-fix loop to Codex.** +In the original design, Codex wrote code and the orchestrator independently ran tests to verify. This doubled the verification cost — Claude re-ran the same tests Codex already ran, adding a tool call per batch and classification logic for "completed but verify failed" (a 6th signal in the result table). Moving verification into the delegation prompt ("run tests, fix failures, do not report completed unless tests pass") eliminates that round-trip. + +The safety net is the circuit breaker, not the orchestrator re-running tests. If Codex reports "completed" but the code is actually broken, the failure surfaces at one of three catch points: (1) the result schema — Codex reports "failed" or "partial" when it cannot get tests to pass, triggering rollback; (2) the circuit breaker — 3 consecutive failures disable delegation and fall back to standard mode where Claude implements with full Phase 2 testing guidance; (3) Phase 3 quality check — the full test suite runs before shipping regardless of execution mode. The orchestrator does not need to independently verify each batch because these layered catches prevent bad code from shipping. This is the key design insight: trust the delegate's self-report, protect against systematic failure with the circuit breaker, and verify the whole at the end. + +**4. Cache pre-delegation checks.** +Environment guard, CLI availability, and consent checks run once before the first batch, not per-unit or per-batch. These don't change mid-execution. + +**5. Batch scratch cleanup.** +Clean up `.context/` delegation artifacts at end-of-plan, not per-unit. Fewer tool calls, same outcome. + +### Plan Quality Enables Good Delegation Decisions + +Every delegation decision — whether to delegate, how to batch, what to include in the prompt — depends on what the plan file provides. The orchestrator can only be as smart as the plan it reads. + +| Plan signal | What it enables | +|-------------|----------------| +| Unit count and scope | The crossover decision (5-7 unit threshold) | +| File lists per unit | "Don't split units that share files" batching rule | +| Test scenarios per unit | Forwarded to Codex via the `` prompt section; thin plan scenarios produce thin Codex tests regardless of prompt engineering | +| Verification commands | Become the `` section; missing verification means Codex cannot confirm its own work | +| Triviality signals (Goal, Approach) | Whether delegation is considered at all ("config change" vs "recursive validation engine") | +| Dependencies between units | Batch boundary decisions for plans >5 units | + +A well-structured ce:plan output provides all of these. A hand-written requirements doc or TODO list may provide few or none — the delegation logic still works (the skill handles non-standard plans), but the decisions are less informed. For example, without explicit file lists, the batching rule cannot check for shared files; without test scenarios, the Codex prompt's `` section has nothing to supplement. + +This does not mean delegation requires ce:plan output. It means the quality of delegation improves proportionally with the structure of the plan. Users who invest in structured plans get smarter delegation decisions. Users with lightweight plans get delegation that works but makes conservative choices (e.g., single-batch everything, generic test guidance). + +### Prompt Engineering for Delegation Quality + +Without explicit testing guidance, Codex produces 15-43% fewer tests than Claude. Three prompt additions close this gap: + +**`` section** — Include Test Scenario Completeness guidance (happy path, edge cases, error paths, integration). This improved Codex test output by ~35% on large plans. Codex implements what the prompt asks; it does not infer quality standards from context. + +**Combined `` command** — Require running ALL test files in a single command, not per-file. Per-file verification misses cross-file contamination — observed in eval when mocked `globalThis.fetch` in one test file leaked into integration tests running in the same bun process. + +**Light system-wide check** — "If your changes touch callbacks, middleware, or event handlers, verify the interaction chain end-to-end." One sentence that catches architectural issues Codex would otherwise miss. + +### Batching Strategy + +Delegate all units in one batch. If the plan exceeds 5 units, split into batches of roughly 5 — never splitting units that share files. Skip delegation entirely if every unit is trivial. + +Between batches: report progress and continue immediately unless the user intervenes. The checkpoint exists so the user *can* steer, not so they *must*. + +### User Choice Matters + +Users may prefer delegation even when it is not optimal for Claude token savings: + +- **Cost arbitrage** — Codex tokens may be cheaper on their usage plan +- **Coding preference** — they may prefer Codex's implementation style for certain tasks +- **Usage conservation** — they may want to conserve Claude Code usage specifically + +The `work_delegate_decision` setting (`auto`/`ask`) supports this. In `ask` mode, the skill presents a recommendation with rationale but lets the user override. When recommending against delegation: "Codex delegation active, but these are small changes where the cost of delegating outweighs having Claude Code do them." The user can still choose "Delegate to Codex anyway." + +--- + +## Why This Matters + +The naive assumption — that offloading work to a secondary agent always saves the orchestrator tokens — is wrong for small workloads and only becomes true past a specific threshold. Without this data, skill authors will either avoid delegation entirely (missing savings on large plans) or apply it universally (wasting tokens on small plans). The 5-7 unit crossover, derived from six evaluation iterations with real token counts, provides a concrete decision boundary. + +The discovery that skill body size is a multiplicative cost driver changes how skills should be authored across the entire plugin. Every line in a SKILL.md body is paid for on every tool call in the session. This makes "extract rarely-used content to reference files" one of the highest-leverage optimizations available to skill authors, and it reframes the instinct to add helpful content to a skill body as a potential anti-pattern when that content is conditional. + +--- + +## When to Apply + +- **Designing delegation in any orchestrator skill:** Use the 5-7 unit crossover as the threshold. Below it, prefer direct execution unless the user explicitly requests delegation. +- **Authoring or editing any SKILL.md:** Audit for conditional content blocks exceeding ~50 lines. If they apply to a minority of invocations, extract to reference files. +- **Adding optimization or guidance content to a skill:** Measure whether the added body size costs more per-call than the optimization saves. If content is only relevant to a specific execution path, it belongs in a reference file. +- **Writing delegation prompts:** Include explicit testing completeness guidance and require unified test execution. Do not assume the delegated agent will infer quality standards. +- **Choosing batch sizes:** Use batches of up to roughly 5 units, never splitting units that share files. + +--- + +## Examples + +**Skill body size impact — iteration 4 regression:** + +Iteration 3: SKILL.md at 776 lines. Medium plan (4 units) delegated cost 58k Claude tokens. +Iteration 4: Added optimization content to body, SKILL.md grew to ~810 lines. Same plan cost 79k tokens (+38%) despite fewer tool calls. The optimization content was sound but the body growth overwhelmed the savings. +Iteration 5: Extracted delegation to reference file, SKILL.md back to 514 lines. Same plan cost 61k tokens — back to iter-3 levels with more features. + +**Delegation decision examples:** + +3-unit plan, all implementation: +> Standard mode recommended. These 3 units are below the efficiency threshold. Direct execution uses fewer Claude tokens. + +8-unit plan, mixed implementation and tests: +> Delegate. Batch into [units 1-5] and [units 6-8], keeping shared-file units together. Pre-delegation checks run once. Progress reported between batches. + +4-unit plan, all config/renames: +> Skip delegation. All units are trivial — orchestration overhead exceeds any benefit. + +4-unit plan, user explicitly requests delegation: +> Delegate despite marginal economics. User preference is respected. One batch, standard flow. + +--- + +## Related + +- [Codex delegation requirements](../../brainstorms/2026-03-31-codex-delegation-requirements.md) — origin requirements defining the delegation flow +- [Codex delegation implementation plan](../../plans/2026-03-31-001-feat-codex-delegation-plan.md) — implementation plan with prompt template and circuit breaker design +- [Pass paths not content to subagents](../skill-design/pass-paths-not-content-to-subagents-2026-03-26.md) — foundational token efficiency pattern for multi-agent orchestration +- [Script-first skill architecture](../skill-design/script-first-skill-architecture.md) — complementary token reduction pattern (60-75% savings by moving processing to scripts) +- [Agent-friendly CLI principles](../agent-friendly-cli-principles.md) — CLI design principles relevant to how `codex exec` is consumed diff --git a/docs/solutions/skill-design/ce-work-beta-promotion-checklist-2026-03-31.md b/docs/solutions/skill-design/ce-work-beta-promotion-checklist-2026-03-31.md new file mode 100644 index 00000000..298ed22b --- /dev/null +++ b/docs/solutions/skill-design/ce-work-beta-promotion-checklist-2026-03-31.md @@ -0,0 +1,106 @@ +--- +title: "ce:work-beta promotion needs manual-handoff cleanup and contract migration" +category: skill-design +date: 2026-03-31 +module: plugins/compound-engineering/skills +component: SKILL.md +tags: + - skill-design + - beta-testing + - workflow + - rollout-safety +severity: medium +description: "Promoting ce:work-beta requires more than copying SKILL.md content: stable handoffs, contract tests, beta-only wording, and planning neutrality must all flip together." +related: + - docs/solutions/skill-design/beta-skills-framework.md + - docs/solutions/skill-design/beta-promotion-orchestration-contract.md +--- + +## Problem + +`ce:work-beta` is intentionally a manual-invocation beta skill. During beta, `ce:plan`, `ce:brainstorm`, `lfg`, `slfg`, and other workflow handoffs remain pointed at stable `ce:work` so the repo does not need to support two execution paths at once. + +That means promoting `ce:work-beta` to stable is not just a content copy. The rollout flips multiple contracts at once: + +- the active implementation surface moves from `ce:work-beta` to `ce:work` +- beta-only manual invocation caveats become wrong +- planner and workflow handoffs can start acknowledging the promoted path +- tests need to assert the stable surface, not the beta surface + +If those changes do not happen together, the repo ends up teaching the wrong skill, keeping stale beta caveats, or preserving duplicate active paths that drift apart. + +## Current Beta Limitation + +During beta, the intended behavior is: + +- `ce:work-beta` contains the experimental implementation +- users invoke `ce:work-beta` manually when they want the new behavior +- `ce:plan` stays neutral and continues to offer stable `ce:work` +- workflow orchestrators stay pointed at stable `ce:work` + +This limitation is deliberate. It avoids pushing beta-specific branching into every planning and orchestration surface. + +## Promotion Checklist + +When `ce:work-beta` is ready to promote: + +1. Copy the validated implementation from `plugins/compound-engineering/skills/ce-work-beta/SKILL.md` into `plugins/compound-engineering/skills/ce-work/SKILL.md`. +2. Restore stable frontmatter on `ce:work`: + - stable `name:` + - stable description without `[BETA]` + - remove `disable-model-invocation: true` +3. Remove beta-only manual invocation wording from the promoted stable skill. +4. Rework or remove `ce:work-beta` so it no longer looks like an active parallel implementation: + - delete it, or + - reduce it to a thin redirect/deprecation note +5. Update planning and workflow handoffs atomically: + - `ce:plan` + - `ce:brainstorm` + - any other skills or workflows that recommend or invoke `ce:work` +6. Revisit planner wording so it can safely mention the promoted stable behavior if needed. +7. Move contract tests from the beta surface to the stable surface. +8. Re-run release validation and any workflow-level tests that exercise the handoff chain. + +## Unique Gotchas + +### Manual-invocation caveats must be removed + +The beta skill intentionally says it must be invoked manually and that handoffs remain pointed at stable `ce:work`. After promotion, that wording becomes false and will actively mislead users. + +### `ce:plan` should stay neutral during beta, then flip intentionally + +While beta is manual-only, `ce:plan` should not teach beta-only invocation details. After promotion, the planner can acknowledge the promoted stable path, but that should happen in the promotion PR, not earlier. + +### Test ownership must migrate + +During beta, contract tests should assert delegation behavior on `ce:work-beta`. After promotion, those assertions belong on `ce:work`. Copying the skill content without moving the tests leaves the wrong surface protected. + +### Do not leave two active delegation paths + +If both `ce:work` and `ce:work-beta` retain live delegation logic after promotion, they will drift. Promotion should end with exactly one canonical implementation surface. + +### Promotion is both a beta-to-stable change and an orchestration change + +This promotion is unusual because the beta skill was intentionally isolated from workflow handoffs. The promotion PR must therefore do both: + +- normal beta-to-stable file/content promotion +- workflow contract cleanup now that the stable surface can own the feature + +See `docs/solutions/skill-design/beta-promotion-orchestration-contract.md` for the caller-update principle. + +## Verification + +Before merging the promotion PR, confirm: + +- stable `ce:work` contains the implementation +- `ce:work-beta` no longer reads like the active implementation path +- no beta-only manual invocation caveats remain on the stable path +- workflow handoffs point where intended +- contract tests assert the right surface +- release validation passes + +## Prevention + +- Treat `ce:work-beta` promotion as a coordinated workflow change, not just a text replacement. +- Update skill content, planner wording, workflow handoffs, and tests in the same PR. +- Leave a durable note like this one at beta time so later promotion work does not rely on memory. diff --git a/plugins/compound-engineering/AGENTS.md b/plugins/compound-engineering/AGENTS.md index 0cefbd52..e962be9e 100644 --- a/plugins/compound-engineering/AGENTS.md +++ b/plugins/compound-engineering/AGENTS.md @@ -132,7 +132,7 @@ Why: shell-heavy exploration causes avoidable permission prompts in sub-agent wo - [ ] Never instruct agents to use `find`, `ls`, `cat`, `head`, `tail`, `grep`, `rg`, `wc`, or `tree` through a shell for routine file discovery, content search, or file reading - [ ] Describe tools by capability class with platform hints — e.g., "Use the native file-search/glob tool (e.g., Glob in Claude Code)" — not by Claude Code-specific tool names alone -- [ ] When shell is the only option (e.g., `ast-grep`, `bundle show`, git commands), instruct one simple command at a time — no chaining (`&&`, `||`, `;`) and no error suppression (`2>/dev/null`, `|| true`). Simple pipes (e.g., `| jq .field`) and output redirection (e.g., `> file`) are acceptable when they don't obscure failures +- [ ] When shell is the only option (e.g., `ast-grep`, `bundle show`, git commands), instruct one simple command at a time — no action chaining (`cmd1 && cmd2`, `cmd1 ; cmd2`) and no error suppression (`2>/dev/null`, `|| true`). Boolean conditions within if/while guards (`[ -n "$X" ] || [ -n "$Y" ]`) are fine — that is normal conditional logic, not action chaining. Simple pipes (e.g., `| jq .field`) and output redirection (e.g., `> file`) are acceptable when they don't obscure failures - [ ] Do not encode shell recipes for routine exploration when native tools can do the job; encode intent and preferred tool classes instead - [ ] For shell-only workflows (e.g., `gh`, `git`, `bundle show`, project CLIs), explicit command examples are acceptable when they are simple, task-scoped, and not chained together diff --git a/plugins/compound-engineering/compound-engineering.local.example.md b/plugins/compound-engineering/compound-engineering.local.example.md new file mode 100644 index 00000000..32fe3447 --- /dev/null +++ b/plugins/compound-engineering/compound-engineering.local.example.md @@ -0,0 +1,12 @@ +--- +# Codex Delegation Settings +# Copy to .claude/compound-engineering.local.md in your project root. +# All settings are optional. Invalid values fall through to defaults. + +# work_delegate: codex # codex | false (default: false) +# work_delegate_consent: true # true | false (default: false) +# work_delegate_sandbox: yolo # yolo | full-auto (default: yolo) +# work_delegate_decision: auto # auto | ask (default: auto) +# work_delegate_model: gpt-5.4 # any valid codex model (default: gpt-5.4) +# work_delegate_effort: high # minimal | low | medium | high | xhigh (default: high) +--- diff --git a/plugins/compound-engineering/skills/ce-plan/SKILL.md b/plugins/compound-engineering/skills/ce-plan/SKILL.md index 847e35f6..b6e18870 100644 --- a/plugins/compound-engineering/skills/ce-plan/SKILL.md +++ b/plugins/compound-engineering/skills/ce-plan/SKILL.md @@ -172,7 +172,6 @@ Look for signals such as: - The user explicitly asks for TDD, test-first, or characterization-first work - The origin document calls for test-first implementation or exploratory hardening of legacy code - Local research shows the target area is legacy, weakly tested, or historically fragile, suggesting characterization coverage before changing behavior -- The user asks for external delegation, says "use codex", "delegate mode", or mentions token conservation -- add `Execution target: external-delegate` to implementation units that are pure code writing When the signal is clear, carry it forward silently in the relevant implementation units. @@ -337,7 +336,7 @@ For each unit, include: - **Dependencies** - what must exist first - **Files** - exact file paths to create, modify, or test - **Approach** - key decisions, data flow, component boundaries, or integration notes -- **Execution note** - optional, only when the unit benefits from a non-default execution posture such as test-first, characterization-first, or external delegation +- **Execution note** - optional, only when the unit benefits from a non-default execution posture such as test-first or characterization-first - **Technical design** - optional pseudo-code or diagram when the unit's approach is non-obvious and prose alone would leave it ambiguous. Frame explicitly as directional guidance, not implementation specification - **Patterns to follow** - existing code or conventions to mirror - **Test scenarios** - enumerate the specific test cases the implementer should write, right-sized to the unit's complexity and risk. Consider each category below and include scenarios from every category that applies to this unit. A simple config change may need one scenario; a payment flow may need a dozen. The quality signal is specificity — each scenario should name the input, action, and expected outcome so the implementer doesn't have to invent coverage. For units with no behavioral change (pure config, scaffolding, styling), use `Test expectation: none -- [reason]` instead of leaving the field blank. @@ -353,7 +352,6 @@ Use `Execution note` sparingly. Good uses include: - `Execution note: Start with a failing integration test for the request/response contract.` - `Execution note: Add characterization coverage before modifying this legacy parser.` - `Execution note: Implement new domain behavior test-first.` -- `Execution note: Execution target: external-delegate` Do not expand units into literal `RED/GREEN/REFACTOR` substeps. @@ -492,7 +490,7 @@ deepened: YYYY-MM-DD # optional, set when the confidence check substantively st **Approach:** - [Key design or sequencing decision] -**Execution note:** [Optional test-first, characterization-first, external-delegate, or other execution posture signal] +**Execution note:** [Optional test-first, characterization-first, or other execution posture signal] **Technical design:** *(optional -- pseudo-code or diagram when the unit's approach is non-obvious. Directional guidance, not implementation specification.)* diff --git a/plugins/compound-engineering/skills/ce-work-beta/SKILL.md b/plugins/compound-engineering/skills/ce-work-beta/SKILL.md index 8063763f..82b76afb 100644 --- a/plugins/compound-engineering/skills/ce-work-beta/SKILL.md +++ b/plugins/compound-engineering/skills/ce-work-beta/SKILL.md @@ -2,7 +2,7 @@ name: ce:work-beta description: "[BETA] Execute work with external delegate support. Same as ce:work but includes experimental Codex delegation mode for token-conserving code implementation." disable-model-invocation: true -argument-hint: "[Plan doc path or description of work. Blank to auto use latest plan doc]" +argument-hint: "[Plan doc path or description of work. Blank to auto use latest plan doc] [delegate:codex]" --- # Work Execution Command @@ -13,10 +13,54 @@ Execute work efficiently while maintaining quality and finishing features. This command takes a work document (plan, specification, or todo file) or a bare prompt describing the work, and executes it systematically. The focus is on **shipping complete features** by understanding requirements quickly, following existing patterns, and maintaining quality throughout. +**Beta rollout note:** Invoke `ce:work-beta` manually when you want to trial Codex delegation. During the beta period, planning and workflow handoffs remain pointed at stable `ce:work` to avoid dual-path orchestration complexity. + ## Input Document #$ARGUMENTS +## Argument Parsing + +Parse `$ARGUMENTS` for the following optional tokens. Strip each recognized token before interpreting the remainder as the plan file path or bare prompt. + +| Token | Example | Effect | +|-------|---------|--------| +| `delegate:codex` | `delegate:codex` | Activate Codex delegation mode for plan execution | +| `delegate:local` | `delegate:local` | Deactivate delegation even if enabled in local.md | + +All tokens are optional. When absent, fall back to the resolution chain below. + +**Fuzzy activation:** Also recognize imperative delegation-intent phrases such as "use codex", "delegate to codex", "codex mode", or "delegate mode" as equivalent to `delegate:codex`. A bare mention of "codex" in a prompt (e.g., "fix codex converter bugs") must NOT activate delegation -- only clear delegation intent triggers it. + +**Fuzzy deactivation:** Also recognize phrases such as "no codex", "local mode", "standard mode" as equivalent to `delegate:local`. + +### Settings Resolution Chain + +After extracting tokens from arguments, resolve the delegation state using this precedence chain: + +1. **Argument flag** -- `delegate:codex` or `delegate:local` from the current invocation (highest priority) +2. **local.md setting** -- Read `.claude/compound-engineering.local.md` and extract `work_delegate` from YAML frontmatter. Value `codex` activates delegation; `false` deactivates. +3. **Hard default** -- `false` (delegation off) + +To read local.md: open the file, extract content between the opening and closing `---` delimiters (YAML frontmatter), and interpret the keys. If the file is missing, empty, has malformed frontmatter, or any setting has an unrecognized value, fall through to the hard default for that setting. + +Also read from local.md when present: +- `work_delegate_consent` -- `true` or default `false` +- `work_delegate_sandbox` -- `yolo` (default) or `full-auto` +- `work_delegate_decision` -- `auto` (default) or `ask` +- `work_delegate_model` -- Codex model to use (default `gpt-5.4`). Passthrough — any valid model name accepted. +- `work_delegate_effort` -- `minimal`, `low`, `medium`, `high` (default), or `xhigh` + +Store the resolved state for downstream consumption: +- `delegation_active` -- boolean, whether delegation mode is on +- `delegation_source` -- `argument` or `local.md` or `default` -- how delegation was resolved (used by environment guard to decide notification verbosity) +- `sandbox_mode` -- `yolo` or `full-auto` (from local.md or default `yolo`) +- `consent_granted` -- boolean (from local.md `work_delegate_consent`) +- `delegate_model` -- string (from local.md or default `gpt-5.4`) +- `delegate_effort` -- string (from local.md or default `high`) + +--- + ## Execution Workflow ### Phase 0: Input Triage @@ -126,6 +170,8 @@ Determine how to proceed based on what was provided in ``. 4. **Choose Execution Strategy** + **Delegation routing gate:** If `delegation_active` is true AND the input is a plan file (not a bare prompt), read `references/codex-delegation-workflow.md` and follow its Pre-Delegation Checks and Delegation Decision flow. If all checks pass and delegation proceeds, force **serial execution** and proceed directly to Phase 2 using the workflow's batched execution loop. If any check disables delegation, fall through to the standard strategy table below. If delegation is active but the input is a bare prompt (no plan file), set `delegation_active` to false with a brief note: "Codex delegation requires a plan file -- using standard mode." and continue with the standard strategy selection below. + After creating the task list, decide how to execute based on the plan's size and dependency structure: | Strategy | When to use | @@ -156,7 +202,9 @@ Determine how to proceed based on what was provided in ``. - Read any referenced files from the plan or discovered during Phase 0 - Look for similar patterns in codebase - Find existing test files for implementation files being changed (Test Discovery — see below) - - Implement following existing conventions + - If delegation_active: branch to the Codex Delegation Execution Loop + (see `references/codex-delegation-workflow.md`) + - Otherwise: implement following existing conventions - Add, update, or remove tests to match implementation changes (see Test Discovery below) - Run System-Wide Test Check (see below) - Run tests after changes @@ -385,92 +433,15 @@ Determine how to proceed based on what was provided in ``. ## Swarm Mode with Agent Teams (Optional) -For genuinely large plans where agents need to communicate with each other, challenge approaches, or coordinate across 10+ tasks with persistent specialized roles, use agent team capabilities if available (e.g., Agent Teams in Claude Code, multi-agent workflows in Codex). - -**Agent teams are typically experimental and require opt-in.** Do not attempt to use agent teams unless the user explicitly requests swarm mode or agent teams, and the platform supports it. +For genuinely large plans (10+ tasks) where agents need persistent specialized roles and inter-agent communication, read `references/swarm-mode.md`. Agent teams are experimental and require explicit user opt-in. -### When to Use Agent Teams vs Subagents - -| Agent Teams | Subagents (standard mode) | -|-------------|---------------------------| -| Agents need to discuss and challenge each other's approaches | Each task is independent — only the result matters | -| Persistent specialized roles (e.g., dedicated tester running continuously) | Workers report back and finish | -| 10+ tasks with complex cross-cutting coordination | 3-8 tasks with clear dependency chains | -| User explicitly requests "swarm mode" or "agent teams" | Default for most plans | - -Most plans should use subagent dispatch from standard mode. Agent teams add significant token cost and coordination overhead — use them when the inter-agent communication genuinely improves the outcome. - -### Agent Teams Workflow - -1. **Create team** — use your available team creation mechanism -2. **Create task list** — parse Implementation Units into tasks with dependency relationships -3. **Spawn teammates** — assign specialized roles (implementer, tester, reviewer) based on the plan's needs. Give each teammate the plan file path and their specific task assignments -4. **Coordinate** — the lead monitors task completion, reassigns work if someone gets stuck, and spawns additional workers as phases unblock -5. **Cleanup** — shut down all teammates, then clean up the team resources +**Mutual exclusion with Codex delegation:** Delegation mode and swarm mode are mutually exclusive. When `delegation_active` is true, emit: "Codex delegation active -- swarm mode unavailable." --- -## External Delegate Mode (Optional) - -For plans where token conservation matters, delegate code implementation to an external delegate (currently Codex CLI) while keeping planning, review, and git operations in the current agent. - -This mode integrates with the existing Phase 1 Step 4 strategy selection as a **task-level modifier** - the strategy (inline/serial/parallel) still applies, but the implementation step within each tagged task delegates to the external tool instead of executing directly. - -### When to Use External Delegation - -| External Delegation | Standard Mode | -|---------------------|---------------| -| Task is pure code implementation | Task requires research or exploration | -| Plan has clear acceptance criteria | Task is ambiguous or needs iteration | -| Token conservation matters (e.g., Max20 plan) | Unlimited plan or small task | -| Files to change are well-scoped | Changes span many interconnected files | - -### Enabling External Delegation - -External delegation activates when any of these conditions are met: -- The user says "use codex for this work", "delegate to codex", or "delegate mode" -- A plan implementation unit contains `Execution target: external-delegate` in its Execution note (set by ce:plan) - -The specific delegate tool is resolved at execution time. Currently the only supported delegate is Codex CLI. Future delegates can be added without changing plan files. - -### Environment Guard - -Before attempting delegation, check whether the current agent is already running inside a delegate's sandbox. Delegation from within a sandbox will fail silently or recurse. - -Check for known sandbox indicators: -- `CODEX_SANDBOX` environment variable is set -- `CODEX_SESSION_ID` environment variable is set -- The filesystem is read-only at `.git/` (Codex sandbox blocks git writes) - -If any indicator is detected, print "Already running inside a delegate sandbox - using standard mode." and proceed with standard execution for that task. - -### External Delegation Workflow - -When external delegation is active, follow this workflow for each tagged task. Do not skip delegation because a task seems "small", "simple", or "faster inline". The user or plan explicitly requested delegation. - -1. **Check availability** - - Verify the delegate CLI is installed. If not found, print "Delegate CLI not installed - continuing with standard mode." and proceed normally. - -2. **Build prompt** — For each task, assemble a prompt from the plan's implementation unit (Goal, Files, Approach, Conventions from project CLAUDE.md/AGENTS.md). Include rules: no git commits, no PRs, run `git status` and `git diff --stat` when done. Never embed credentials or tokens in the prompt - pass auth through environment variables. - -3. **Write prompt to file** — Save the assembled prompt to a unique temporary file to avoid shell quoting issues and cross-task races. Use a unique filename per task. - -4. **Delegate** — Run the delegate CLI, piping the prompt file via stdin (not argv expansion, which hits `ARG_MAX` on large prompts). Omit the model flag to use the delegate's default model, which stays current without manual updates. - -5. **Review diff** — After the delegate finishes, verify the diff is non-empty and in-scope. Run the project's test/lint commands. If the diff is empty or out-of-scope, fall back to standard mode for that task. - -6. **Commit** — The current agent handles all git operations. The delegate's sandbox blocks `.git/index.lock` writes, so the delegate cannot commit. Stage changes and commit with a conventional message. - -7. **Error handling** — On any delegate failure (rate limit, error, empty diff), fall back to standard mode for that task. Track consecutive failures - after 3 consecutive failures, disable delegation for remaining tasks and print "Delegate disabled after 3 consecutive failures - completing remaining tasks in standard mode." - -### Mixed-Model Attribution - -When some tasks are executed by the delegate and others by the current agent, use the following attribution in Phase 4: +## Codex Delegation Mode -- If all tasks used the delegate: attribute to the delegate model -- If all tasks used standard mode: attribute to the current agent's model -- If mixed: use `Generated with [CURRENT_MODEL] + [DELEGATE_MODEL] via [HARNESS]` and note which tasks were delegated in the PR description +When `delegation_active` is true after argument parsing, read `references/codex-delegation-workflow.md` for the complete delegation workflow: pre-checks, batching, prompt template, execution loop, and result classification. --- diff --git a/plugins/compound-engineering/skills/ce-work-beta/references/codex-delegation-workflow.md b/plugins/compound-engineering/skills/ce-work-beta/references/codex-delegation-workflow.md new file mode 100644 index 00000000..3e76a5f6 --- /dev/null +++ b/plugins/compound-engineering/skills/ce-work-beta/references/codex-delegation-workflow.md @@ -0,0 +1,282 @@ +# Codex Delegation Workflow + +When `delegation_active` is true, code implementation is delegated to the Codex CLI (`codex exec`) instead of being implemented directly. The orchestrating Claude Code agent retains control of planning, review, git operations, and orchestration. + +## Delegation Decision + +If `work_delegate_decision` is `ask`, present the recommendation before proceeding. + +**When recommending Codex delegation:** + +> "Codex delegation active. [N] implementation units -- delegating in one batch." +> 1. Delegate to Codex *(recommended)* +> 2. Execute with Claude Code instead + +**When recommending Codex delegation, multiple batches:** + +> "Codex delegation active. [N] implementation units -- delegating in [X] batches." +> 1. Delegate to Codex *(recommended)* +> 2. Execute with Claude Code instead + +**When recommending Claude Code (all units are trivial):** + +> "Codex delegation active, but these are small changes where the cost of delegating outweighs having Claude Code do them." +> 1. Execute with Claude Code *(recommended)* +> 2. Delegate to Codex anyway + +If `work_delegate_decision` is `auto` (the default), state the execution plan in one line and proceed without waiting: "Codex delegation active. Delegating [N] units in [X] batch(es)." or "Codex delegation active. All units are trivial -- executing with Claude Code." + +## Pre-Delegation Checks + +Run these checks **once before the first batch**. If any check fails, fall back to standard mode for the remainder of the plan execution. Do not re-run on subsequent batches. + +**0. Platform Gate** + +Codex delegation is only supported when the orchestrating agent is running in Claude Code. If the current session is Codex, Gemini CLI, OpenCode, or any other platform, set `delegation_active` to false and proceed in standard mode. + +**1. Environment Guard** + +Check whether the current agent is already running inside a Codex sandbox: + +```bash +if [ -n "$CODEX_SANDBOX" ] || [ -n "$CODEX_SESSION_ID" ]; then + echo "inside_sandbox=true" +else + echo "inside_sandbox=false" +fi +``` + +If `inside_sandbox` is true, delegation would recurse or fail. + +- If `delegation_source` is `argument`: emit "Already inside Codex sandbox -- using standard mode." and set `delegation_active` to false. +- If `delegation_source` is `local.md` or `default`: set `delegation_active` to false silently. + +**2. Availability Check** + +```bash +command -v codex +``` + +If the Codex CLI is not on PATH: emit "Codex CLI not found -- using standard mode." and set `delegation_active` to false. + +**3. Consent Flow** + +If `consent_granted` is not true (from local.md `work_delegate_consent`): + +Present a one-time consent warning using the platform's blocking question tool (AskUserQuestion in Claude Code). The consent warning explains: +- Delegation sends implementation units to `codex exec` as a structured prompt +- **yolo mode** (`--yolo`): Full system access including network. Required for verification steps that run tests or install dependencies. **Recommended.** +- **full-auto mode** (`--full-auto`): Workspace-write sandbox, no network access. + +Present the sandbox mode choice: (1) yolo (recommended), (2) full-auto. + +On acceptance: +- Write `work_delegate_consent: true` and `work_delegate_sandbox: ` to `.claude/compound-engineering.local.md` YAML frontmatter +- To write local.md: (1) if file does not exist, create it with YAML frontmatter wrapper; (2) if file exists with valid frontmatter, merge new keys preserving existing keys; (3) if file exists without frontmatter or with malformed frontmatter, prepend a valid frontmatter block and preserve existing body content below the closing `---` +- Update `consent_granted` and `sandbox_mode` in the resolved state + +On decline: +- Ask whether to disable delegation entirely for this project +- If yes: write `work_delegate: false` to local.md, set `delegation_active` to false, proceed in standard mode +- If no: set `delegation_active` to false for this invocation only, proceed in standard mode + +**Headless consent:** If running in a headless or non-interactive context, delegation proceeds only if `work_delegate_consent` is already `true` in local.md. If consent is not recorded, set `delegation_active` to false silently. + +## Batching + +Delegate all units in one batch. If the plan exceeds 5 units, split into batches at the plan's own phase boundaries, or in groups of roughly 5 -- never splitting units that share files. Skip delegation entirely if every unit is trivial. + +## Prompt Template + +At the start of delegated execution, generate a short unique run ID (e.g., 8 hex chars from a timestamp or random source). All scratch files for this invocation go under `.context/compound-engineering/codex-delegation//`. Create the directory if it does not exist. + +Before each batch, write a prompt file to `.context/compound-engineering/codex-delegation//prompt-batch-.md`. + +Build the prompt from the batch's implementation units using these XML-tagged sections: + +```xml + +[For a single-unit batch: Goal from the implementation unit. +For a multi-unit batch: list each unit with its Goal, stating the concrete +job, repository context, and expected end state for each.] + + + +[Combined file list from all units in the batch -- files to create, modify, or read.] + + + +[File paths from all units' "Patterns to follow" fields. If no patterns: +"No explicit patterns referenced -- follow existing conventions in the +modified files."] + + + +[For a single-unit batch: Approach from the unit. +For a multi-unit batch: list each unit's approach, noting dependencies +and suggested ordering.] + + + +- Do NOT run git commit, git push, or create PRs -- the orchestrating agent handles all git operations +- Restrict all modifications to files within the repository root +- Keep changes tightly scoped to the stated task -- avoid unrelated refactors, renames, or cleanup +- Resolve the task fully before stopping -- do not stop at the first plausible answer +- If you discover mid-execution that you need to modify files outside the repo root, complete what you can within the repo and report what you could not do via the result schema issues field + + + +Before writing tests, check whether the plan's test scenarios cover all +categories that apply to each unit. Supplement gaps before writing tests: +- Happy path: core input/output pairs from each unit's goal +- Edge cases: boundary values, empty/nil inputs, type mismatches +- Error/failure paths: invalid inputs, permission denials, downstream failures +- Integration: cross-layer scenarios that mocks alone won't prove + +Write tests that name specific inputs and expected outcomes. If your changes +touch code with callbacks, middleware, or event handlers, verify the +interaction chain works end-to-end. + + + +After implementing, run ALL test files together in a single command (not +per-file). Cross-file contamination (e.g., mocked globals leaking between +test files) only surfaces when tests run in the same process. If tests +fail, fix the issues and re-run until they pass. Do not report status +"completed" unless verification passes. This is your responsibility -- +the orchestrator will not re-run verification independently. + +[Test and lint commands from the project. Use the union of all units' +verification commands as a single combined invocation.] + + + +Report your result via the --output-schema mechanism. Fill in every field: +- status: "completed" ONLY if all changes were made AND verification passes, + "partial" if incomplete, "failed" if no meaningful progress +- files_modified: array of file paths you changed +- issues: array of strings describing any problems, gaps, or out-of-scope + work discovered +- summary: one-paragraph description of what was done + +``` + +## Result Schema + +Write the result schema to `.context/compound-engineering/codex-delegation//result-schema.json` once at the start of delegated execution: + +```json +{ + "type": "object", + "properties": { + "status": { "enum": ["completed", "partial", "failed"] }, + "files_modified": { "type": "array", "items": { "type": "string" } }, + "issues": { "type": "array", "items": { "type": "string" } }, + "summary": { "type": "string" } + }, + "required": ["status", "files_modified", "issues", "summary"], + "additionalProperties": false +} +``` + +Each batch's result is written to `.context/compound-engineering/codex-delegation//result-batch-.json` via the `-o` flag. On plan failure, files are left in place for debugging. + +**Known limitation:** `--output-schema` only works with `gpt-5` family models (e.g., `o4-mini`, `gpt-5.4`), not `gpt-5-codex` or `codex-` prefixed models (Codex CLI bug #4181). If the result JSON is absent or malformed after a successful exit code, classify as task failure. + +## Execution Loop + +Initialize a `consecutive_failures` counter at 0 before the first batch. + +**Clean-baseline preflight:** Before the first batch, verify there are no uncommitted changes to tracked files: + +```bash +git diff --quiet HEAD +``` + +This intentionally ignores untracked files. Only staged or unstaged modifications to tracked files make rollback unsafe. + +If tracked files are dirty, stop and present options: (1) commit current changes, (2) stash explicitly (`git stash push -m "pre-delegation"`), (3) continue in standard mode (sets `delegation_active` to false). Do not auto-stash user changes. + +**Delegation invocation:** For each batch: + +1. Write the prompt file using the Prompt Template above +2. Launch the Codex CLI in the **background** (no timeout ceiling): + +```bash +# Resolve sandbox flag +if [ "$SANDBOX_MODE" = "full-auto" ]; then + SANDBOX_FLAG="--full-auto" +else + SANDBOX_FLAG="--dangerously-bypass-approvals-and-sandbox" +fi + +codex exec \ + -m "" \ + -c 'model_reasoning_effort=""' \ + $SANDBOX_FLAG \ + --output-schema .context/compound-engineering/codex-delegation//result-schema.json \ + -o .context/compound-engineering/codex-delegation//result-batch-.json \ + - < .context/compound-engineering/codex-delegation//prompt-batch-.md +``` + +Run this command with `run_in_background: true` so there is no timeout ceiling. Codex batches can take 5-20+ minutes depending on plan size and test-fix iterations. + +Quoting is critical for the `-c` flag: use single quotes around the entire key=value and double quotes around the TOML string value inside. Example: `-c 'model_reasoning_effort="high"'`. The `-m` value does not need special quoting unless the model name contains spaces. + +Do not improvise CLI flags or modify this invocation template. + +3. **Poll for completion.** Immediately after launching, enter a foreground polling loop that checks every 10 seconds whether the result file exists. This keeps the agent's turn active so the user cannot interfere with the working tree during delegation. + +```bash +RESULT_FILE=".context/compound-engineering/codex-delegation//result-batch-.json" +for i in $(seq 1 6); do + test -s "$RESULT_FILE" && echo "DONE" && exit 0 + sleep 10 +done +echo "Waiting for Codex..." +``` + +If the output is "Waiting for Codex...", issue the same polling command again. Repeat until the result file appears. When the output is "DONE", read the result file and proceed to classification. + +**Result classification:** Codex is responsible for running verification internally and fixing failures before reporting -- the orchestrator does not re-run verification independently. + +| # | Signal | Classification | Action | +|---|--------|---------------|--------| +| 1 | Exit code != 0 | CLI failure | Rollback to HEAD. Fall back to standard mode for ALL remaining work. | +| 2 | Exit code 0, result JSON missing or malformed | Task failure | Rollback to HEAD. Increment `consecutive_failures`. | +| 3 | Exit code 0, `status: "failed"` | Task failure | Rollback to HEAD. Increment `consecutive_failures`. | +| 4 | Exit code 0, `status: "partial"` | Partial success | Keep the diff. Complete remaining work locally, verify, and commit. Increment `consecutive_failures`. | +| 5 | Exit code 0, `status: "completed"` | Success | Commit changes. Reset `consecutive_failures` to 0. | + +**Rollback procedure:** + +```bash +git checkout -- . +git clean -fd -- +``` + +Do NOT use bare `git clean -fd` without path arguments. + +**Commit on success:** + +```bash +git add +git commit -m "feat(): " +``` + +**Between batches** (plans split into multiple batches): Report what completed, test results, and what's next. Continue immediately unless the user intervenes -- the checkpoint exists so the user *can* steer, not so they *must*. + +**Circuit breaker:** After 3 consecutive failures, set `delegation_active` to false and emit: "Codex delegation disabled after 3 consecutive failures -- completing remaining units in standard mode." + +**Scratch cleanup:** After the last batch completes: + +```bash +rm -rf .context/compound-engineering/codex-delegation// +``` + +## Mixed-Model Attribution + +When some units are executed by Codex and others locally: +- If all units used delegation: attribute to the Codex model +- If all units used standard mode: attribute to the current agent's model +- If mixed: note which units were delegated in the PR description and credit both models diff --git a/plugins/compound-engineering/skills/ce-work-beta/references/swarm-mode.md b/plugins/compound-engineering/skills/ce-work-beta/references/swarm-mode.md new file mode 100644 index 00000000..094f40ef --- /dev/null +++ b/plugins/compound-engineering/skills/ce-work-beta/references/swarm-mode.md @@ -0,0 +1,24 @@ +# Swarm Mode with Agent Teams + +For genuinely large plans where agents need to communicate with each other, challenge approaches, or coordinate across 10+ tasks with persistent specialized roles, use agent team capabilities if available (e.g., Agent Teams in Claude Code). + +**Agent teams are typically experimental and require opt-in.** Do not attempt to use agent teams unless the user explicitly requests swarm mode or agent teams, and the platform supports it. + +## When to Use Agent Teams vs Subagents + +| Agent Teams | Subagents (standard mode) | +|-------------|---------------------------| +| Agents need to discuss and challenge each other's approaches | Each task is independent -- only the result matters | +| Persistent specialized roles (e.g., dedicated tester running continuously) | Workers report back and finish | +| 10+ tasks with complex cross-cutting coordination | 3-8 tasks with clear dependency chains | +| User explicitly requests "swarm mode" or "agent teams" | Default for most plans | + +Most plans should use subagent dispatch from standard mode. Agent teams add significant token cost and coordination overhead -- use them when the inter-agent communication genuinely improves the outcome. + +## Agent Teams Workflow + +1. **Create team** -- use your available team creation mechanism +2. **Create task list** -- parse Implementation Units into tasks with dependency relationships +3. **Spawn teammates** -- assign specialized roles (implementer, tester, reviewer) based on the plan's needs. Give each teammate the plan file path and their specific task assignments +4. **Coordinate** -- the lead monitors task completion, reassigns work if someone gets stuck, and spawns additional workers as phases unblock +5. **Cleanup** -- shut down all teammates, then clean up the team resources diff --git a/plugins/compound-engineering/skills/ce-work/SKILL.md b/plugins/compound-engineering/skills/ce-work/SKILL.md index 98a7256d..d206b334 100644 --- a/plugins/compound-engineering/skills/ce-work/SKILL.md +++ b/plugins/compound-engineering/skills/ce-work/SKILL.md @@ -372,8 +372,6 @@ Determine how to proceed based on what was provided in ``. - Note any follow-up work needed - Suggest next steps if applicable ---- - ## Swarm Mode with Agent Teams (Optional) For genuinely large plans where agents need to communicate with each other, challenge approaches, or coordinate across 10+ tasks with persistent specialized roles, use agent team capabilities if available (e.g., Agent Teams in Claude Code, multi-agent workflows in Codex). diff --git a/tests/pipeline-review-contract.test.ts b/tests/pipeline-review-contract.test.ts index 91b138f2..94e02264 100644 --- a/tests/pipeline-review-contract.test.ts +++ b/tests/pipeline-review-contract.test.ts @@ -87,6 +87,157 @@ describe("ce:work review contract", () => { expect(beta).not.toContain("Tests pass (run project's test command)") expect(beta).not.toContain("- All tests pass") }) + + test("ce:work remains the stable non-delegating surface", async () => { + const content = await readRepoFile("plugins/compound-engineering/skills/ce-work/SKILL.md") + + expect(content).not.toContain("## Argument Parsing") + expect(content).not.toContain("## Codex Delegation Mode") + expect(content).not.toContain("delegate:codex") + }) +}) + +describe("ce:work-beta codex delegation contract", () => { + test("has argument parsing with delegate tokens", async () => { + const content = await readRepoFile("plugins/compound-engineering/skills/ce-work-beta/SKILL.md") + + // Argument parsing section exists with delegation tokens + expect(content).toContain("## Argument Parsing") + expect(content).toContain("`delegate:codex`") + expect(content).toContain("`delegate:local`") + + // Resolution chain present + expect(content).toContain("### Settings Resolution Chain") + expect(content).toContain("work_delegate") + expect(content).toContain("compound-engineering.local.md") + }) + + test("argument-hint includes delegate:codex for discoverability", async () => { + const content = await readRepoFile("plugins/compound-engineering/skills/ce-work-beta/SKILL.md") + + expect(content).toContain("argument-hint:") + expect(content).toContain("delegate:codex") + }) + + test("remains manual-invocation beta during rollout", async () => { + const content = await readRepoFile("plugins/compound-engineering/skills/ce-work-beta/SKILL.md") + + expect(content).toContain("disable-model-invocation: true") + expect(content).toContain("Invoke `ce:work-beta` manually") + expect(content).toContain("planning and workflow handoffs remain pointed at stable `ce:work`") + }) + + test("SKILL.md has delegation routing stub pointing to reference", async () => { + const content = await readRepoFile("plugins/compound-engineering/skills/ce-work-beta/SKILL.md") + + expect(content).toContain("## Codex Delegation Mode") + expect(content).toContain("references/codex-delegation-workflow.md") + // Delegation details are NOT in SKILL.md body — they're in the reference + expect(content).not.toContain("### Pre-Delegation Checks") + expect(content).not.toContain("### Prompt Template") + expect(content).not.toContain("### Execution Loop") + }) + + test("delegation routing gate in Phase 1 Step 4", async () => { + const content = await readRepoFile("plugins/compound-engineering/skills/ce-work-beta/SKILL.md") + + const gateIdx = content.indexOf("Delegation routing gate") + const strategyTableIdx = content.indexOf("| **Inline**") + expect(gateIdx).toBeGreaterThan(0) + expect(gateIdx).toBeLessThan(strategyTableIdx) + expect(content).toContain("Codex delegation requires a plan file") + }) + + test("delegation branches in Phase 2 task loop", async () => { + const content = await readRepoFile("plugins/compound-engineering/skills/ce-work-beta/SKILL.md") + + expect(content).toContain("If delegation_active: branch to the Codex Delegation Execution Loop") + }) + + test("swarm mode has mutual exclusion note", async () => { + const content = await readRepoFile("plugins/compound-engineering/skills/ce-work-beta/SKILL.md") + + expect(content).toContain("Mutual exclusion with Codex delegation") + expect(content).toContain("swarm mode are mutually exclusive") + }) + + test("delegation reference has all required sections", async () => { + const content = await readRepoFile("plugins/compound-engineering/skills/ce-work-beta/references/codex-delegation-workflow.md") + + // Pre-delegation checks + expect(content).toContain("## Pre-Delegation Checks") + expect(content).toContain("Platform Gate") + expect(content).toContain("CODEX_SANDBOX") + expect(content).toContain("command -v codex") + expect(content).toContain("Consent Flow") + + // Batching + expect(content).toContain("## Batching") + + // Prompt template + expect(content).toContain("## Prompt Template") + expect(content).toContain("") + expect(content).toContain("") + expect(content).toContain("") + expect(content).toContain("the orchestrator will not re-run verification independently") + + // Result schema and execution loop + expect(content).toContain("## Result Schema") + expect(content).toContain("## Execution Loop") + expect(content).toContain("codex exec") + + // Circuit breaker + expect(content).toContain("consecutive_failures") + expect(content).toContain("3 consecutive failures") + + // Rollback safety + expect(content).toContain("git diff --quiet HEAD") + expect(content).toContain("git checkout -- .") + expect(content).toContain("Do NOT use bare `git clean -fd` without path arguments") + + // Mixed-model attribution + expect(content).toContain("## Mixed-Model Attribution") + }) + + test("delegation reference has decision prompts for ask mode", async () => { + const content = await readRepoFile("plugins/compound-engineering/skills/ce-work-beta/references/codex-delegation-workflow.md") + + expect(content).toContain("## Delegation Decision") + expect(content).toContain("work_delegate_decision") + expect(content).toContain("Execute with Claude Code instead") + expect(content).toContain("Delegate to Codex anyway") + expect(content).toContain("the cost of delegating outweighs having Claude Code do them") + }) + + test("settings resolution includes delegation decision setting", async () => { + const content = await readRepoFile("plugins/compound-engineering/skills/ce-work-beta/SKILL.md") + + expect(content).toContain("work_delegate_decision") + expect(content).toContain("`auto`") + expect(content).toContain("`ask`") + }) + + test("has frontend design guidance ported from beta", async () => { + const content = await readRepoFile("plugins/compound-engineering/skills/ce-work-beta/SKILL.md") + + expect(content).toContain("**Frontend Design Guidance**") + expect(content).toContain("`frontend-design` skill") + }) +}) + +describe("ce:plan remains neutral during ce:work-beta rollout", () => { + test("removes delegation-specific execution posture guidance", async () => { + const content = await readRepoFile("plugins/compound-engineering/skills/ce-plan/SKILL.md") + + // Old tag removed from execution posture signals + expect(content).not.toContain("add `Execution target: external-delegate`") + + // Old tag removed from execution note examples + expect(content).not.toContain("Execution note: Execution target: external-delegate") + + // Planner stays neutral instead of teaching beta-only invocation + expect(content).not.toContain("delegate:codex") + }) }) describe("ce:brainstorm review contract", () => {