From 1f24e902037b9421afb6103e22ea0a14d7b6586d Mon Sep 17 00:00:00 2001 From: Fotis Stamatelopoulos Date: Wed, 13 May 2026 21:06:03 -0700 Subject: [PATCH 1/3] feat(loop): harness-level missing-signals pause + retry_iteration action MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit cfcf has implicitly assumed dev + judge always run far enough to write their signal files (cfcf-iteration-signals.json / cfcf-judge-signals.json). When agents exit before doing so (hard quota cap, crash, OOM, …), cfcf silently marked the iteration failed and continued. Real gmbot dogfood: iter 10 hit a codex usage limit, both agents exited without signals, cfcf moved on and iter 11 had to re-discover the missing work. This is a harness-contract violation, not an agent-reasoning failure — the agent never ran far enough to classify anything. So the fix is harness-side and DELIBERATELY doesn't classify the root cause. Quota / crash / OOM all collapse to "agent exited without signals → can't safely continue → pause + surface to user." User reads the log file to identify the cause. Behaviour: - After parseSignalFile / parseJudgeSignals returns null (file missing OR JSON parse error OR schema validation failure — one signal), pause with pauseReason "missing_signals" rather than silently continuing. - pendingQuestions[0] is a generic-anomaly message with the log path: "Anomaly detected: agent for iteration N exited (exit code X) without writing its signals file (agent crashed or hit a usage limit). Check the log: . Resume with retry_iteration / continue / stop_loop_now." - loop.paused notification fires for any configured channel. - Working tree left as-is — no git reset, no commit attempt. On retry_iteration the branch is re-created off HEAD by the existing iteration body; any dirty changes survive in the working tree. New resume action: retry_iteration (extends item 6.25's structured pause-actions vocabulary): - Rolls iteration counter back via new decrementIteration() helper in workspaces.ts. - Pops the failed iteration's record from state.iterations. - Falls through to the regular iteration body — nextIteration() now returns the same number, the branch is re-created. - ONLY applicable to missing_signals pauses (allowlist: retry_iteration / continue / stop_loop_now). Rejected for other pause classes via existing pauseReasonAllowedActions validation. Files touched (~140 LoC): - packages/core/src/types.ts — extend ResumeAction - packages/core/src/iteration-loop.ts — pause helper, two call sites (dev + judge), retry dispatch, pauseReasonAllowedActions branch - packages/core/src/workspaces.ts — decrementIteration helper - packages/cli/src/commands/resume.ts — RESUME_ACTIONS + help - packages/web/{types.ts,api.ts,components/FeedbackForm.tsx} — mirror the new variants on the web side. pendingQuestions[0] rendering already surfaces our generic-anomaly message; no UI copy needed. Test coverage: 8 new tests (1030 total pass). Typecheck clean. Intentionally NOT in scope (future, if dogfood demands): - Root-cause classification (quota vs. crash) — would be a cosmetic label, not a behaviour change. Defer. - Auto-retry after parsed reset time — defer. - Pattern-based "anomaly type" labels in the pause text. Co-Authored-By: Claude Opus 4.7 (1M context) --- CHANGELOG.md | 121 +++++++++++++++ packages/cli/src/commands/resume.ts | 7 +- packages/core/src/iteration-loop.test.ts | 41 +++++ packages/core/src/iteration-loop.ts | 153 ++++++++++++++++++- packages/core/src/types.ts | 9 +- packages/core/src/workspaces.test.ts | 44 ++++++ packages/core/src/workspaces.ts | 19 +++ packages/web/src/api.ts | 3 +- packages/web/src/components/FeedbackForm.tsx | 14 ++ packages/web/src/types.ts | 3 +- 10 files changed, 405 insertions(+), 9 deletions(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index 61a4348..9c3d449 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -9,6 +9,127 @@ Changes are tracked via git tags. Each release tag corresponds to an entry here. ## [Unreleased] +### Added — Harness-level missing-signals pause + `retry_iteration` resume action + +cfcf has always implicitly assumed that the dev and judge agents +will run far enough to write their signals files +(`cfcf-iteration-signals.json` / `cfcf-judge-signals.json`). +Everything downstream — determination parsing, archive logic, +decision engine — branches off those files. When an agent exits +before writing them (hard quota cap, crash, OOM kill, killed +process, …), cfcf silently marked the iteration "failed" and +continued to the next one. That cost a real gmbot dogfood run: iter +10 hit a codex usage limit, dev + judge both exited without signals, +cfcf moved on to iter 11 which then had to re-discover the missing +work from scratch. + +**Architectural framing**: this is a harness-contract violation, not +an agent-reasoning failure. The agent never ran far enough to +classify anything. So the fix lives at the harness layer, not in +the agent prompts — and deliberately doesn't try to identify the +root cause. Quota / crash / OOM / killed all collapse to "agent +exited without signals → can't safely continue → pause." The user +reads the log file to identify the cause and chooses how to +recover. + +**Behaviour**: + +- After `parseSignalFile` / `parseJudgeSignals` returns `null` + (missing file, JSON parse failure, schema validation failure — + all collapsed to one signal), the loop pauses with + `pauseReason: "missing_signals"` rather than continuing silently. +- The pause sets `pendingQuestions[0]` to a generic anomaly message + pointing at the log file: *"Anomaly detected: dev agent for + iteration 10 exited (exit code 1) without writing its signals + file (agent crashed or hit a usage limit). Check the log: … + Resume with `retry_iteration` to redo this iteration, `continue` + to skip to iteration 11, or `stop_loop_now` to abandon."* +- Notification fires (`type: "loop.paused"`) so any notification + channel sees the pause immediately, not just the dashboard. +- Working tree is left as-is — no `git reset`, no commit attempt. + Any dirty changes are preserved for the user to inspect. + +**New resume action: `retry_iteration`** (extends the structured +pause-actions vocabulary from item 6.25): + +- Rolls the workspace's iteration counter back by one (new + `decrementIteration(workspaceId)` helper in `workspaces.ts`). +- Pops the failed iteration's record from `state.iterations` so + it isn't double-counted. +- Falls through to the regular iteration body, which calls + `nextIteration()` (now returns the same number that failed) and + proceeds to delete + re-create the existing branch off HEAD + (logic the loop already had for retry scenarios). Dev respawns + on a clean branch under the same iteration number. +- Applicable ONLY to `missing_signals` pauses — surfaced in the + web UI's pause-action button row + accepted on + `cfcf resume --action retry_iteration`. Rejected for other + pause classes via the existing `pauseReasonAllowedActions` + validation. +- `missing_signals` pause-action allowlist: `["retry_iteration", + "continue", "stop_loop_now"]`. No finish/refine/consult — those + all assume a meaningful iteration result to act on. + +**What this doesn't do** (intentional non-scope): + +- No root-cause classification (quota vs. crash vs. OOM). The + harness doesn't try — that requires reading the log file, + which is the user's job. Future enhancement: per-adapter + pattern matching that adds a *label* to the pause without + changing behaviour. +- No automatic retry after a parsed reset time. v1 is pause + + user resumes when ready. Matches "human on the loop, not in + it." Revisit if dogfood demands it. +- No partial-work recovery beyond "leave the working tree alone." + If the agent made dirty changes before crashing, they survive in + the working tree; the user can `git status` / `git diff` / + `git stash` as needed. On `retry_iteration` the branch is + re-created off HEAD; any branch-only commits from the failed + attempt are lost. (Working-tree changes are NOT lost.) + +**Implementation** (~140 LoC): + +- `packages/core/src/types.ts` — extend `ResumeAction` with + `"retry_iteration"`. +- `packages/core/src/iteration-loop.ts`: + - Extend `LoopState.pauseReason` with `"missing_signals"`. + - New `pauseLoopOnMissingSignals(workspace, state, iter, phase, + exitCode, logFile)` helper. Sets phase + pauseReason + + pendingQuestions, saves state, flips workspace status, + dispatches a `loop.paused` notification. + - Two new call sites: after the dev signal-parse (post-exit, + pre-commit) and after the judge signal-parse (same shape). + Both return immediately — the outer loop's `isLoopDone(state)` + check exits the loop cleanly. + - New `retry_iteration` branch in the resume-action dispatch + chain — decrements the counter, pops the failed iter record, + falls through. + - Extend `pauseReasonAllowedActions` with the + `case "missing_signals"` branch. +- `packages/core/src/workspaces.ts` — new + `decrementIteration(workspaceId)` helper (idempotent at floor 0). +- `packages/cli/src/commands/resume.ts` — add `retry_iteration` + to `RESUME_ACTIONS` + the `--action` help text. +- `packages/web/src/types.ts` + `packages/web/src/api.ts` — mirror + the new pauseReason + ResumeAction variants. +- `packages/web/src/components/FeedbackForm.tsx` — add the + `missing_signals` case to the per-pauseReason action matrix, + add `retry_iteration` to `ACTION_LABEL` + `ACTION_HELP`. The + existing `pendingQuestions[0]` rendering surfaces the + generic-anomaly message we set in the pause helper, so no + additional copy was needed. + +**Test coverage** (8 new tests, all 1030 tests pass): + +- 4 new tests in `iteration-loop.test.ts` covering the + `missing_signals` action matrix (retry/continue/stop allowed; + finish/refine/consult excluded) and the inverse — `retry_iteration` + is rejected for every other pauseReason. +- 4 new tests in `workspaces.test.ts` for `decrementIteration`: + rolls back by 1, decrement+nextIteration round-trips to the + failed number (load-bearing invariant for retry), floors at 0, + null for non-existent workspace. + ### Fixed — History tab: judge row time cell while judge is pending F.21 follow-up surfaced during v0.24.2 dogfood. The judge row's diff --git a/packages/cli/src/commands/resume.ts b/packages/cli/src/commands/resume.ts index dc10ad2..0f45f6d 100644 --- a/packages/cli/src/commands/resume.ts +++ b/packages/cli/src/commands/resume.ts @@ -26,6 +26,11 @@ const RESUME_ACTIONS = [ "stop_loop_now", "refine_plan", "consult_reflection", + // Re-spawn dev on the same iteration. Typical use: a + // `missing_signals` pause (quota cap, crash) is being unblocked + // after the underlying cause cleared. The harness rolls back the + // iteration counter and re-creates the branch. + "retry_iteration", ] as const; type ResumeAction = (typeof RESUME_ACTIONS)[number]; @@ -81,7 +86,7 @@ function buildActionOption(program: Command): Option { // eslint-disable-next-line @typescript-eslint/no-explicit-any const opt = (program as any).createOption( "--action ", - "Resume action: continue | finish_loop | stop_loop_now | refine_plan | consult_reflection", + "Resume action: continue | finish_loop | stop_loop_now | refine_plan | consult_reflection | retry_iteration", ); opt.choices(RESUME_ACTIONS as readonly string[]); opt.default("continue"); diff --git a/packages/core/src/iteration-loop.test.ts b/packages/core/src/iteration-loop.test.ts index 0651843..2801cc1 100644 --- a/packages/core/src/iteration-loop.test.ts +++ b/packages/core/src/iteration-loop.test.ts @@ -947,6 +947,47 @@ describe("pauseReasonAllowedActions (item 6.25)", () => { test("scope_complete excludes consult_reflection (no iterations to reflect on)", () => { expect(pauseReasonAllowedActions("scope_complete")).not.toContain("consult_reflection"); }); + + // harness-missing-signals: when dev or judge exit without writing + // their signals file, the iteration is in an unknown state. The + // user's three meaningful options are: redo the iter (retry), + // skip it (continue), or abandon (stop). finish_loop / + // refine_plan / consult_reflection all assume we have a + // meaningful iter result to act on; we don't. + test("missing_signals: retry_iteration + continue + stop_loop_now", () => { + expectActions(pauseReasonAllowedActions("missing_signals"), [ + "retry_iteration", + "continue", + "stop_loop_now", + ]); + }); + + test("missing_signals excludes finish_loop (no iter result to finish on)", () => { + expect(pauseReasonAllowedActions("missing_signals")).not.toContain("finish_loop"); + }); + + test("missing_signals excludes refine_plan + consult_reflection (no iter data to refine/reflect on)", () => { + expect(pauseReasonAllowedActions("missing_signals")).not.toContain("refine_plan"); + expect(pauseReasonAllowedActions("missing_signals")).not.toContain("consult_reflection"); + }); + + test("retry_iteration is ONLY applicable to missing_signals (not offered for other pauseReasons)", () => { + // The retry action is specifically the "agent exited without + // signals, redo the iter" path. It doesn't make sense for the + // other pause classes (cadence, anomaly with signals, etc. + // — those have a known iter result and use other actions). + const otherReasons = [ + undefined, + "cadence", + "anomaly", + "user_input_needed", + "max_iterations", + "scope_complete", + ] as const; + for (const reason of otherReasons) { + expect(pauseReasonAllowedActions(reason)).not.toContain("retry_iteration"); + } + }); }); // 2026-05-02: Architect SCOPE_COMPLETE readiness verdict (item 6.25 follow-up). diff --git a/packages/core/src/iteration-loop.ts b/packages/core/src/iteration-loop.ts index 2764fca..99f4a62 100644 --- a/packages/core/src/iteration-loop.ts +++ b/packages/core/src/iteration-loop.ts @@ -33,7 +33,7 @@ import { getAdapter } from "./adapters/index.js"; import { spawnProcess } from "./process-manager.js"; import { getIterationLogPath, ensureWorkspaceLogDir } from "./log-storage.js"; import * as gitManager from "./git-manager.js"; -import { nextIteration, updateWorkspace, getWorkspace } from "./workspaces.js"; +import { nextIteration, decrementIteration, updateWorkspace, getWorkspace } from "./workspaces.js"; import { runDocumentSync } from "./documenter-runner.js"; import { runReflectionSync, parseReflectionSignals } from "./reflection-runner.js"; import { runReviewSync, readinessGateBlocks } from "./architect-runner.js"; @@ -88,7 +88,14 @@ export interface LoopState { | "anomaly" | "user_input_needed" | "max_iterations" - | "scope_complete"; // item 6.25 follow-up: architect SCOPE_COMPLETE + | "scope_complete" // item 6.25 follow-up: architect SCOPE_COMPLETE + // Harness contract violation: dev or judge agent exited without + // writing its signals file. Inclusive of all root causes (crash, + // quota cap, OOM, etc.) — the harness doesn't classify, just + // pauses + surfaces. User checks the log to identify cause and + // resumes with `retry_iteration` (redo the iter) or `continue` + // (skip + move on) or `stop_loop_now`. + | "missing_signals"; /** Questions from dev/judge that need user answers */ pendingQuestions?: string[]; /** User feedback to inject into the next iteration */ @@ -753,6 +760,21 @@ export function pauseReasonAllowedActions( // - refine_plan → user adds new requirements, re-runs review return ["finish_loop", "stop_loop_now", "refine_plan"]; + case "missing_signals": + // Dev or judge agent exited without writing signals (crash, + // quota cap, OOM, …). The iteration is in an unknown state; + // the harness can't safely judge, reflect, refine, or finish + // on no data. Allowed: + // - retry_iteration → redo the failed iter (counter rolled + // back; branch re-created). Natural choice after a quota + // cap resets. + // - continue → skip the failed iter, start iter N+1 + // fresh. Use when the failed work isn't worth recovering. + // - stop_loop_now → abandon, accept partial progress. + // No finish_loop / refine_plan / consult_reflection — those + // all assume we have a meaningful iteration result to act on. + return ["retry_iteration", "continue", "stop_loop_now"]; + default: // A1 pre-loop review blocked. continue (after user edited the // Problem Pack), stop_loop_now, refine_plan (re-run architect @@ -1052,6 +1074,70 @@ async function handleStopLoopNow( ); } +/** + * Pause the loop because the dev or judge agent exited without + * writing its signals file. Harness contract: agents must produce + * `cfcf-iteration-signals.json` / `cfcf-judge-signals.json`. When + * the file is missing, unreadable, or fails schema validation, the + * harness can't safely judge / reflect / decide on no data — pause + * and surface to the user. + * + * Inclusive of all root causes (quota cap, agent crash, OOM, killed + * process, agent CLI bug). The harness deliberately doesn't try to + * classify — the user reads the log file to identify cause, then + * resumes with `retry_iteration` (most common after a quota cap + * resets) / `continue` (skip the failed iter) / `stop_loop_now`. + * + * The pause-state writes mirror the existing + * `consult_reflection`/`scope_complete` pause paths: set phase + + * pauseReason + pendingQuestions, saveLoopState, flip workspace + * status to "paused", dispatch a notification. No git surgery — + * any dirty working-tree state is preserved for the user to + * inspect; on `retry_iteration`, the iteration counter is rolled + * back and the existing branch is re-created off HEAD (the normal + * iteration body already deletes-and-recreates an existing branch). + */ +async function pauseLoopOnMissingSignals( + workspace: WorkspaceConfig, + state: LoopState, + iterationNum: number, + phase: "dev" | "judge", + exitCode: number | undefined, + logFileName: string, +): Promise { + state.phase = "paused"; + state.pauseReason = "missing_signals"; + const exitDetail = + exitCode === undefined ? "" : ` (exit code ${exitCode})`; + const message = + `Anomaly detected: ${phase} agent for iteration ${iterationNum} ` + + `exited${exitDetail} without writing its signals file (agent crashed ` + + `or hit a usage limit). Check the log: ${logFileName}. Resume with ` + + `\`retry_iteration\` to redo this iteration, \`continue\` to skip ` + + `to iteration ${iterationNum + 1}, or \`stop_loop_now\` to abandon.`; + state.pendingQuestions = [message]; + await saveLoopState(state); + await updateWorkspace(workspace.id, { status: "paused" }); + + dispatchForWorkspace( + makeEvent({ + type: "loop.paused", + title: "Loop paused: agent exited without signals", + message: `${workspace.name}: ${message}`, + workspaceId: workspace.id, + workspaceName: workspace.name, + details: { + pauseReason: "missing_signals", + phase, + iteration: iterationNum, + exitCode: exitCode ?? null, + logFile: logFileName, + }, + }), + workspace.notifications, + ); +} + /** * Handle the `finish_loop` resume action: skip the iteration spawn, * jump to the configured end-of-loop sequence (documenter when @@ -1246,6 +1332,22 @@ async function runLoop( // is handled inside the consult helper, which sets the pause state // and returns "stop" to abort runLoop). } + if (resumeAction === "retry_iteration") { + // Re-spawn dev on the same iteration after a `missing_signals` + // pause (quota cap reset, agent crash recovery, etc.). Roll the + // iteration counter back by one so the loop body's next + // `nextIteration()` call returns the SAME number that failed — + // the branch-creation logic already deletes + re-creates an + // existing branch off HEAD, so the failed attempt's branch is + // replaced cleanly. Also drop the failed iteration's record + // from state.iterations so it isn't double-counted; the retry + // produces a fresh record under the same number. + await decrementIteration(workspace.id); + if (state.iterations.length > 0) { + state.iterations.pop(); + } + // Fall through to the regular pre-loop / iteration logic. + } // --- PRE-LOOP REVIEW (item 5.1 autoReviewSpecs=true) --- // Runs before iteration 1, not on resume. When the readiness gate @@ -1664,6 +1766,29 @@ async function runLoop( const devSignals = await parseSignalFile(workspace.repoPath); iterRecord.devSignals = devSignals ?? undefined; + // Harness contract check: dev agent must write + // `cfcf-iteration-signals.json`. Absence (file missing, JSON + // parse failure, or schema validation failure — all collapsed + // to `null` by `parseSignalFile`) means we have no data to + // judge / reflect / decide on. Pause + surface to the user; + // do NOT silently treat as a failed iteration and move on. + // Inclusive of all root causes: agent crashed mid-startup, + // quota cap, OOM kill, etc. Working tree (if dirty) is left + // as-is for the user to inspect; on `retry_iteration` the + // existing branch is deleted + re-created off HEAD by the + // normal iteration body. + if (!devSignals) { + await pauseLoopOnMissingSignals( + workspace, + state, + iterationNum, + "dev", + devResult.exitCode, + iterRecord.devLogFileName, + ); + return; + } + // Commit dev work if (await gitManager.hasChanges(workspace.repoPath)) { await gitManager.commitAll( @@ -1799,9 +1924,27 @@ async function runJudgeAndDecide( const judgeSignals = await parseJudgeSignals(workspace.repoPath); iterRecord.judgeSignals = judgeSignals ?? undefined; - // If judge exited with non-zero and produced no signals, log it - if (judgeResult.exitCode !== 0 && !judgeSignals) { - iterRecord.judgeError = `Judge agent exited with code ${judgeResult.exitCode}. Check log: ${judgeLogFile}`; + // Harness contract check: judge agent must write + // `cfcf-judge-signals.json`. Same pattern as the dev check above + // — absence of signals means we have no determination to act on, + // so we pause and surface to the user rather than archive a + // stale judge-assessment.md or fall through to the decide path + // with no data. Inclusive of all root causes (crash, quota cap, + // OOM). The previous behaviour (record `iterRecord.judgeError` + // and continue silently) is replaced by the explicit pause; the + // `judgeError` field is still populated for the audit trail. + if (!judgeSignals) { + iterRecord.judgeError = + `Judge agent exited with code ${judgeResult.exitCode} without writing signals. Check log: ${judgeLogFile}`; + await pauseLoopOnMissingSignals( + workspace, + state, + iterationNum, + "judge", + judgeResult.exitCode, + iterRecord.judgeLogFileName, + ); + return; } // Commit judge work diff --git a/packages/core/src/types.ts b/packages/core/src/types.ts index 42c234f..9b7dfe7 100644 --- a/packages/core/src/types.ts +++ b/packages/core/src/types.ts @@ -585,7 +585,14 @@ export type ResumeAction = | "finish_loop" | "stop_loop_now" | "refine_plan" - | "consult_reflection"; + | "consult_reflection" + // Re-spawn dev on the same iteration after a pause (typically + // following a `missing_signals` pause from a quota cap or agent + // crash). The iteration counter is rolled back so the retry runs + // as iteration N rather than N+1; the existing branch is + // discarded + re-created off HEAD. See harness-missing-signals + // feature for the full flow. + | "retry_iteration"; /** * Architect's verdict on the Problem Pack's readiness for an iteration loop. diff --git a/packages/core/src/workspaces.test.ts b/packages/core/src/workspaces.test.ts index 68a8ae9..124e5f2 100644 --- a/packages/core/src/workspaces.test.ts +++ b/packages/core/src/workspaces.test.ts @@ -11,6 +11,7 @@ import { deleteWorkspace, validateWorkspaceRepo, nextIteration, + decrementIteration, } from "./workspaces.js"; describe("projects", () => { @@ -166,6 +167,49 @@ describe("projects", () => { }); }); + // Roll-back helper for `retry_iteration` (harness-missing-signals + // feature): when a failed iteration N is being retried, the + // counter is rolled back so the next `nextIteration()` call + // returns N again instead of N+1. + describe("decrementIteration", () => { + it("rolls back from 3 to 2", async () => { + const project = await createWorkspace({ name: "dec-test", repoPath: repoDir }); + await nextIteration(project.id); + await nextIteration(project.id); + await nextIteration(project.id); + const prev = await decrementIteration(project.id); + expect(prev).toBe(2); + + const updated = await getWorkspace(project.id); + expect(updated!.currentIteration).toBe(2); + }); + + it("decrement then nextIteration returns the same number that was rolled back", async () => { + // The retry_iteration flow's load-bearing invariant: the + // counter ends up back at the failed iter's number so the + // loop body's next `nextIteration()` call reproduces it. + const project = await createWorkspace({ name: "retry-test", repoPath: repoDir }); + await nextIteration(project.id); // -> 1 + await nextIteration(project.id); // -> 2 (this is the "failed" iter) + + await decrementIteration(project.id); + const retried = await nextIteration(project.id); + expect(retried).toBe(2); // same number as the failed attempt + }); + + it("floors at 0 (calling on a 0-counter is a no-op)", async () => { + const project = await createWorkspace({ name: "floor-test", repoPath: repoDir }); + const result = await decrementIteration(project.id); + expect(result).toBe(0); + const updated = await getWorkspace(project.id); + expect(updated!.currentIteration).toBe(0); + }); + + it("returns null for non-existent project (matches nextIteration's nullable shape)", async () => { + expect(await decrementIteration("nonexistent")).toBeNull(); + }); + }); + describe("validateWorkspaceRepo", () => { it("validates a valid git repo", async () => { const result = await validateWorkspaceRepo(repoDir); diff --git a/packages/core/src/workspaces.ts b/packages/core/src/workspaces.ts index 300de98..e4ea95e 100644 --- a/packages/core/src/workspaces.ts +++ b/packages/core/src/workspaces.ts @@ -243,6 +243,25 @@ export async function nextIteration(workspaceId: string): Promise return next; } +/** + * Roll the iteration counter back by one. Used by `retry_iteration` + * (the resume action that re-spawns dev on the same iteration after a + * `missing_signals` pause): the failed attempt was already counted via + * `nextIteration()`, so we roll back to let the loop body's next + * `nextIteration()` call return the same number. Idempotent at the + * floor — if the counter is already 0 (no iterations yet), this is a + * no-op. + */ +export async function decrementIteration(workspaceId: string): Promise { + const workspace = await getWorkspace(workspaceId); + if (!workspace) return null; + const current = workspace.currentIteration || 0; + if (current <= 0) return 0; + const prev = current - 1; + await updateWorkspace(workspaceId, { currentIteration: prev }); + return prev; +} + /** * Verify a workspace's repo path exists and is a git repo. */ diff --git a/packages/web/src/api.ts b/packages/web/src/api.ts index d1683e6..875cf97 100644 --- a/packages/web/src/api.ts +++ b/packages/web/src/api.ts @@ -302,7 +302,8 @@ export type ResumeAction = | "finish_loop" | "stop_loop_now" | "refine_plan" - | "consult_reflection"; + | "consult_reflection" + | "retry_iteration"; // harness-missing-signals: redo the failed iter export function resumeLoop( workspaceId: string, diff --git a/packages/web/src/components/FeedbackForm.tsx b/packages/web/src/components/FeedbackForm.tsx index 16bd4b0..13081c3 100644 --- a/packages/web/src/components/FeedbackForm.tsx +++ b/packages/web/src/components/FeedbackForm.tsx @@ -32,6 +32,17 @@ function pauseReasonAllowedActions(pauseReason: LoopState["pauseReason"]): Resum // configured; stop_loop_now accepts "project done"; refine_plan // re-runs the architect after the user adds new requirements. return ["finish_loop", "stop_loop_now", "refine_plan"]; + case "missing_signals": + // Harness-missing-signals: dev/judge exited without writing + // its signals file (crash, quota cap, OOM, …). No determination + // to act on. Allowed: + // - retry_iteration → redo the iter (after a quota window + // resets, typically) + // - continue → skip the failed iter + // - stop_loop_now → abandon + // No finish/refine/consult — they all need a meaningful iter + // result. + return ["retry_iteration", "continue", "stop_loop_now"]; default: // Pre-loop review block (A1) return ["continue", "stop_loop_now", "refine_plan"]; @@ -44,6 +55,7 @@ const ACTION_LABEL: Record = { stop_loop_now: "Stop loop now", refine_plan: "Refine plan", consult_reflection: "Ask Reflection to decide", + retry_iteration: "Retry iteration", }; const ACTION_HELP: Record = { @@ -57,6 +69,8 @@ const ACTION_HELP: Record = { "Run the architect (synchronously) with your feedback to update the plan, then continue with the next iteration.", consult_reflection: "Spawn reflection with your feedback as input. Reflection decides what the harness does next (continue / finish / stop / re-pause).", + retry_iteration: + "Re-spawn dev on the same iteration that just failed (the iteration counter is rolled back, the branch is re-created). Typical use: a quota cap reset, or after fixing whatever caused the agent to crash.", }; export function FeedbackForm({ diff --git a/packages/web/src/types.ts b/packages/web/src/types.ts index 07eeed4..d2d29f3 100644 --- a/packages/web/src/types.ts +++ b/packages/web/src/types.ts @@ -131,7 +131,8 @@ export interface LoopState { | "anomaly" | "user_input_needed" | "max_iterations" - | "scope_complete"; // item 6.25 follow-up: architect SCOPE_COMPLETE + | "scope_complete" // item 6.25 follow-up: architect SCOPE_COMPLETE + | "missing_signals"; // harness-missing-signals: dev/judge exited without writing signals file pendingQuestions?: string[]; userFeedback?: string; iterations: LoopIterationRecord[]; From 6f95f02091bbf688d85cc81c22613fcba37ee30e Mon Sep 17 00:00:00 2001 From: Fotis Stamatelopoulos Date: Wed, 13 May 2026 21:28:45 -0700 Subject: [PATCH 2/3] feat(loop): cfcf stop kills active subprocess + new cfcf agents reap MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Real gmbot dogfood: after iter 19 reflection started at 3:41pm, the user observed the dashboard still showing "reflection running" at 8pm — 4.5 hours later. The loop state was already phase=stopped/outcome=stopped, but the reflection codex subprocess (PID 99729) was still alive with the cfcf server as its parent. Three layered bugs: 1. stopLoop() only flipped the state flag — it never signaled any subprocess. 2. The per-role state-store (reflect-state.json) was never flipped to "failed", so getActiveAgent() kept claiming reflection was alive. 3. cfcf server reap filters PPID==1 (orphans of a dead server), so it couldn't catch live-server children. Fix is three-pronged: **Fix 1**: stopLoop() now kills the active subprocess. New loopActivePhaseToRole(phase) maps loop phases to AgentRoles: - pre_loop_reviewing -> architect - dev_executing -> dev - judging -> judge - reflecting -> reflection - documenting -> documenter - else -> null (skip) For non-null roles, look up active-processes registry, call existing killProcessTree() (SIGTERM -> 1.5s -> SIGKILL). **Fix 2**: per-role state-store flip via new markStateFailed(workspaceId, reason) helpers in each runner (reflection / documenter / architect). Idempotent: no-op when no state, no-op when state already terminal. Dashboard chip clears on next poll. **Fix 3**: new cfcf agents reap command + matching API endpoints (/api/active-processes GET + POST kill). Mirrors cfcf server reap UX but scoped to live-server children, not PPID==1 orphans. Per-row y/N or --yes for non-interactive. **Hard guarantee** (per user's explicit caution): PA and HA are untouched at all three layers. They run as cfcf spec / cfcf help assistant with stdio:"inherit" OUTSIDE the cfcf server, are NOT in the active-processes registry, and AgentRole type doesn't include them. Two layers of defence: type-level + phase-mapping. **Wall-clock timeout intentionally NOT implemented**. PA proposed it; user and I both rejected. Conflicts with codex rolling-window quota recovery (~1hr legit waits); matches cfcf "human on the loop" principle. cfcf agents reap is the user-driven equivalent. Files (~340 LoC): - process-manager.ts: export killProcessTree - reflection-runner.ts / documenter-runner.ts / architect-runner.ts: add markStateFailed helpers - iteration-loop.ts: loopActivePhaseToRole helper, stopLoop kills + flips state - server/app.ts: 2 new endpoints - cli/commands/agents.ts: new file, cfcf agents reap - cli/index.ts: register Test coverage: 8 new tests (1038 total pass). Typecheck clean. Co-Authored-By: Claude Opus 4.7 (1M context) --- CHANGELOG.md | 153 ++++++++++++++++++++ packages/cli/src/commands/agents.ts | 152 +++++++++++++++++++ packages/cli/src/index.ts | 2 + packages/core/src/architect-runner.ts | 25 ++++ packages/core/src/documenter-runner.ts | 25 ++++ packages/core/src/iteration-loop.test.ts | 39 +++++ packages/core/src/iteration-loop.ts | 90 ++++++++++++ packages/core/src/process-manager.ts | 2 +- packages/core/src/reflection-runner.test.ts | 22 +++ packages/core/src/reflection-runner.ts | 28 ++++ packages/server/src/app.ts | 83 +++++++++++ 11 files changed, 620 insertions(+), 1 deletion(-) create mode 100644 packages/cli/src/commands/agents.ts diff --git a/CHANGELOG.md b/CHANGELOG.md index 9c3d449..29a592d 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -9,6 +9,159 @@ Changes are tracked via git tags. Each release tag corresponds to an entry here. ## [Unreleased] +### Fixed — `cfcf stop` now kills the active subprocess + clears the dashboard indicator + new `cfcf agents reap` escape hatch + +Real gmbot dogfood: after iter 19 completed and reflection started +at 3:41 pm, the user observed the dashboard still showing +"reflection running" at 8 pm — 4.5 hours later. PA's investigation +found that the loop state was `phase: stopped, outcome: stopped`, +but the reflection's codex subprocess (PID 99729) was still alive +with the cfcf server as its parent. + +**Root cause** (three layered bugs): + +1. **`stopLoop()` only flips the state flag.** It set `state.phase + = "stopped"` and saved — but never signaled any subprocess. + The main while-loop's `isStopped(state)` check would exit + cleanly at the next iteration boundary; a mid-flight subprocess + (the reflection codex) kept running because nothing told it to + stop. The runner's `await managed.result` promise got stranded. +2. **The per-role state-store row stayed `executing`.** When the + subprocess was orphaned, the reflection-runner's normal + completion path didn't fire, so `reflect-state.json`'s status + was never flipped. The dashboard's `getActiveAgent()` reads + that status; it kept claiming reflection was alive. +3. **No way to kill the stranded subprocess from cfcf.** The + existing `cfcf server reap` filters on `PPID==1` (orphans of a + dead server). PID 99729 was still parented by the live cfcf + server, so it didn't qualify. User had to find the PID + manually and `kill -9` it. + +The fix lands all three: + +**Fix 1 — `stopLoop()` kills the active subprocess.** + +`stopLoop()` now maps the loop's current phase to its in-flight +`AgentRole` via a new pure helper `loopActivePhaseToRole(phase): +AgentRole | null`: + + - `pre_loop_reviewing` → `architect` + - `dev_executing` → `dev` + - `judging` → `judge` + - `reflecting` → `reflection` + - `documenting` → `documenter` + - else → `null` (no subprocess in flight; nothing to kill) + +For non-null roles, looks up the registered process in +`active-processes.ts` and calls the existing `killProcessTree()` +(SIGTERM → 1.5s grace → SIGKILL on the process group). + +**Hard guarantee — PA and HA are untouched.** They run interactively +as `cfcf spec` / `cfcf help assistant` with `stdio: "inherit"`, +outside the cfcf server entirely. They are NOT in the +`active-processes` registry, and the registry's `AgentRole` type +(`packages/core/src/log-storage.ts:20`) doesn't include them. +Both layers of defence: type-level (can't be in the registry) + +phase-mapping (the loop never enters a phase that resolves to +"pa" or "ha"). + +**Fix 2 — Per-role state-store flip on kill.** + +After killing, `stopLoop()` calls the role-appropriate +`markStateFailed(workspaceId, reason)` helper: + + - `markReflectStateFailed` (new, in `reflection-runner.ts`) + - `markDocumentStateFailed` (new, in `documenter-runner.ts`) + - `markReviewStateFailed` (new, in `architect-runner.ts`) + +All three are idempotent: no-op when no state exists, no-op when +state is already terminal (so they're safe to call +unconditionally, and won't clobber a real completion if the +subprocess happened to finish between `stopLoop()` being called +and the kill landing). On a flip, status → `failed`, error → +`"Loop stopped by user via cfcf stop"`, `completedAt` is set. + +The dashboard's `getActiveAgent()` re-reads these state stores; the +"X running" chip clears on the next poll (~5s). + +Dev / judge don't have separate state stores — they're tracked +inside the iteration's record in `loop-state.json` itself, which +the "stopped" flip already handles. So those two roles skip the +state-flip step. + +**Fix 3 — New `cfcf agents reap` command (manual escape hatch).** + +For the rare cases where the auto-kill in `stopLoop()` doesn't +land (subprocess in uninterruptible sleep, user observes a stuck +process from a pre-fix build, etc.), or for proactive inspection: + +``` +$ cfcf agents reap # all workspaces +$ cfcf agents reap --workspace gmbot # one workspace +$ cfcf agents reap --yes # non-interactive +``` + +Mirrors `cfcf server reap`'s UX (item 6.31) but scoped to +LIVE-SERVER children rather than `PPID==1` orphans: + + - Requires the cfcf server to be running (queries the in-memory + registry). + - Lists each process: workspace, role, PID, runtime, log file. + - Per-row y/N confirmation by default; `--yes` for scripted + use. + - Per-row kill via new HTTP routes: + - `GET /api/active-processes?workspace=` (list) + - `POST /api/active-processes/:workspaceId/:role/kill` (kill + + state-flip) + - **Cannot kill PA/HA** — they're not in the registry, so they + don't show up in the list. Same hard guarantee as `stopLoop()`. + +**Why no wall-clock timeout (rejected design).** + +PA's initial proposal included a per-role timeout (e.g., kill +reflection if it runs > 4 hours). Declined deliberately: + + - The existing codex rolling-window quota recovery can take ~1 + hour. Any wall-clock heuristic risks killing legitimate recovery. + - Timeouts add config knobs that are hard to tune correctly. + - Matches cf²'s "human on the loop, not in it" principle — + the user is the one who decides when something is stuck. The + new `cfcf agents reap` is the user-driven version of the + same intent. + +**Implementation** (~340 LoC): + +- `packages/core/src/process-manager.ts` — export + `killProcessTree` (was private). +- `packages/core/src/reflection-runner.ts` — + `markReflectStateFailed(workspaceId, reason)`. +- `packages/core/src/documenter-runner.ts` — + `markDocumentStateFailed(workspaceId, reason)`. +- `packages/core/src/architect-runner.ts` — + `markReviewStateFailed(workspaceId, reason)`. +- `packages/core/src/iteration-loop.ts`: + - `loopActivePhaseToRole(phase): AgentRole | null` (exported, + pure, testable). + - `stopLoop()` extended: capture active role before phase flip, + kill subprocess, flip per-role state store. +- `packages/server/src/app.ts` — two new endpoints + (`/api/active-processes` GET + POST kill). +- `packages/cli/src/commands/agents.ts` — new file, `cfcf agents + reap` command. +- `packages/cli/src/index.ts` — register the new command. + +**Test coverage** (8 new tests, all 1038 tests pass): + +- 6 new tests in `iteration-loop.test.ts` covering + `loopActivePhaseToRole`: each of the five active phases maps to + the correct role; non-active phases (preparing / deciding / + paused / completed / failed / stopped) return null. +- 2 new tests in `reflection-runner.test.ts` covering + `markReflectStateFailed`'s idempotent no-state branch (safe to + call unconditionally, doesn't throw for unknown workspace IDs). + Document + architect runners share the same code shape; the + reflection test locks in the pattern. + ### Added — Harness-level missing-signals pause + `retry_iteration` resume action cfcf has always implicitly assumed that the dev and judge agents diff --git a/packages/cli/src/commands/agents.ts b/packages/cli/src/commands/agents.ts new file mode 100644 index 0000000..b934319 --- /dev/null +++ b/packages/cli/src/commands/agents.ts @@ -0,0 +1,152 @@ +/** + * `cfcf agents` — manage loop-spawned agent processes. + * + * Today only `reap` is implemented: list + kill subprocesses + * registered with the cfcf server. Mirrors the UX of `cfcf server + * reap` (item 6.31) but scoped differently: + * + * - `cfcf server reap` finds ORPHANS (PPID==1) — children of a + * prior cfcf server that hard-crashed and got reparented to + * init. Works without a running cfcf server. + * + * - `cfcf agents reap` lists LIVE-SERVER CHILDREN — what the + * current cfcf server is still tracking. Requires the server + * to be running. Use case: a subprocess survived a `cfcf stop` + * (rare, but the missing-signals follow-up + this command + * close that gap), or the user wants to inspect what's still + * attached to the server. + * + * **Safety**: this only ever lists / kills processes in the + * `active-processes` registry, which is scoped to loop-spawned + * agent roles (dev / judge / architect / documenter / reflection). + * PA (`cfcf spec`) and HA (`cfcf help assistant`) run interactively + * outside the cfcf server (`stdio: "inherit"`), are NOT in the + * registry, and CANNOT be killed via this command. + */ + +import type { Command } from "commander"; +import { createInterface } from "node:readline"; +import { isServerReachable, get, post } from "../client.js"; + +interface ActiveProcessSummary { + workspaceId: string; + workspaceName: string; + role: "dev" | "judge" | "architect" | "documenter" | "reflection"; + pid: number | undefined; + startedAt: string; + runtimeMs: number; + logFileName: string | null; +} + +function readLine(question: string): Promise { + const rl = createInterface({ input: process.stdin, output: process.stdout }); + return new Promise((resolve) => { + rl.question(question, (answer) => { + rl.close(); + resolve(answer.trim()); + }); + }); +} + +function formatRuntime(ms: number): string { + const seconds = Math.floor(ms / 1000); + const minutes = Math.floor(seconds / 60); + const hours = Math.floor(minutes / 60); + if (hours > 0) return `${hours}h${minutes % 60}m`; + if (minutes > 0) return `${minutes}m${seconds % 60}s`; + return `${seconds}s`; +} + +function formatRow(p: ActiveProcessSummary): string { + const pidStr = p.pid !== undefined ? `pid ${p.pid}` : "pid ?"; + const runtime = formatRuntime(p.runtimeMs); + const log = p.logFileName ? ` log=${p.logFileName}` : ""; + return ` ${p.workspaceName} / ${p.role}: ${pidStr}, running for ${runtime}${log}`; +} + +export function registerAgentsCommands(program: Command): void { + const agents = program + .command("agents") + .description("Manage running agent subprocesses"); + + agents + .command("reap") + .description("List + interactively kill active agent processes (loop-spawned only — PA/HA are untouched)") + .option("--workspace ", "Limit to a single workspace (by name or ID)") + .option("-y, --yes", "Kill without prompting (non-interactive use)") + .action(async (opts: { workspace?: string; yes?: boolean }) => { + if (!(await isServerReachable())) { + console.error( + "cfcf server is not running. Start it with: cfcf server start", + ); + console.error( + "(For orphans from a previously-crashed server, use `cfcf server reap` instead.)", + ); + process.exit(1); + } + + const query = opts.workspace + ? `?workspace=${encodeURIComponent(opts.workspace)}` + : ""; + const res = await get<{ active: ActiveProcessSummary[] }>( + `/api/active-processes${query}`, + ); + if (!res.ok) { + console.error(`Failed to list active processes: ${res.error}`); + process.exit(1); + } + + const procs = res.data!.active; + if (procs.length === 0) { + console.log( + opts.workspace + ? `No active agent processes for workspace "${opts.workspace}".` + : "No active agent processes.", + ); + return; + } + + console.log( + `Found ${procs.length} active agent process${procs.length === 1 ? "" : "es"}:`, + ); + for (const p of procs) { + console.log(formatRow(p)); + } + console.log(); + + let proceed = false; + if (opts.yes) { + proceed = true; + } else { + const answer = await readLine( + `Kill ${procs.length === 1 ? "this process" : `these ${procs.length} processes`}? [y/N]: `, + ); + proceed = /^y(es)?$/i.test(answer); + } + if (!proceed) { + console.log("Aborted. No processes killed."); + return; + } + + let killed = 0; + let failed = 0; + for (const p of procs) { + const r = await post<{ ok: boolean; error?: string }>( + `/api/active-processes/${encodeURIComponent(p.workspaceId)}/${p.role}/kill`, + {}, + ); + if (r.ok && r.data?.ok) { + console.log(` ✓ ${p.workspaceName} / ${p.role} (pid ${p.pid})`); + killed++; + } else { + console.log( + ` ✗ ${p.workspaceName} / ${p.role} (pid ${p.pid}): ${r.data?.error ?? r.error ?? "unknown error"}`, + ); + failed++; + } + } + console.log(); + console.log(`Reap complete: ${killed} signaled, ${failed} failed.`); + if (failed > 0) process.exit(1); + }); +} diff --git a/packages/cli/src/index.ts b/packages/cli/src/index.ts index 1491518..ef9a815 100644 --- a/packages/cli/src/index.ts +++ b/packages/cli/src/index.ts @@ -42,6 +42,7 @@ import { registerSelfUpdateCommand } from "./commands/self-update.js"; import { registerCompletionCommand } from "./commands/completion.js"; import { registerHelpCommand } from "./commands/help.js"; import { registerSpecCommand } from "./commands/spec.js"; +import { registerAgentsCommands } from "./commands/agents.js"; import { maybePrintUpdateBanner } from "./update-banner.js"; // --- Internal: run the server in-process --- @@ -114,6 +115,7 @@ registerRunCommand(program); registerResumeCommand(program); registerStopCommand(program); registerSpecCommand(program); +registerAgentsCommands(program); registerReviewCommand(program); registerDocumentCommand(program); registerReflectCommand(program); diff --git a/packages/core/src/architect-runner.ts b/packages/core/src/architect-runner.ts index 187d89f..0c64d9b 100644 --- a/packages/core/src/architect-runner.ts +++ b/packages/core/src/architect-runner.ts @@ -83,6 +83,31 @@ export function getReviewState(workspaceId: string): ReviewState | undefined { return reviewStore.get(workspaceId); } +/** + * Force-mark this workspace's architect/review state as `failed` + * with the given reason. Used by `stopLoop()` + `cfcf agents reap` + * to flip the dashboard's "review running" indicator off + * immediately after killing the subprocess. Mirrors + * `markReflectStateFailed` / `markDocumentStateFailed`. + * + * Idempotent: no-op when no state exists, or when the state is + * already terminal (completed / failed). Returns true if a flip + * happened. + */ +export async function markReviewStateFailed( + workspaceId: string, + reason: string, +): Promise { + const current = reviewStore.get(workspaceId); + if (!current) return false; + if (!REVIEW_ACTIVE_STATUSES.has(current.status)) return false; + current.status = "failed"; + current.error = reason; + current.completedAt = new Date().toISOString(); + await setReviewState(current); + return true; +} + /** * Server-boot hook (item F.23): load every workspace's persisted * review state into the in-memory cache. Any state still in an active diff --git a/packages/core/src/documenter-runner.ts b/packages/core/src/documenter-runner.ts index cf7ae55..d134896 100644 --- a/packages/core/src/documenter-runner.ts +++ b/packages/core/src/documenter-runner.ts @@ -82,6 +82,31 @@ export function getDocumentState(workspaceId: string): DocumentState | undefined return documentStore.get(workspaceId); } +/** + * Force-mark this workspace's documenter state as `failed` with the + * given reason. Used by `stopLoop()` + `cfcf agents reap` to flip + * the dashboard's "documenting" indicator off immediately after + * killing the subprocess. Mirrors `markReflectStateFailed` / + * `markReviewStateFailed`. + * + * Idempotent: no-op when no state exists, or when the state is + * already terminal (completed / failed). Returns true if a flip + * happened. + */ +export async function markDocumentStateFailed( + workspaceId: string, + reason: string, +): Promise { + const current = documentStore.get(workspaceId); + if (!current) return false; + if (!DOCUMENT_ACTIVE_STATUSES.has(current.status)) return false; + current.status = "failed"; + current.error = reason; + current.completedAt = new Date().toISOString(); + await setDocumentState(current); + return true; +} + /** * Server-boot hook (item F.23): hydrate the in-memory cache from disk * and clean up any state still claiming to be active. Returns the diff --git a/packages/core/src/iteration-loop.test.ts b/packages/core/src/iteration-loop.test.ts index 2801cc1..632b096 100644 --- a/packages/core/src/iteration-loop.test.ts +++ b/packages/core/src/iteration-loop.test.ts @@ -13,6 +13,7 @@ import { stopLoop, shouldRunReflection, resolveEffectiveDetermination, + loopActivePhaseToRole, type LoopState, } from "./iteration-loop.js"; import type { WorkspaceConfig, DevSignals, JudgeSignals, ReflectionSignals } from "./types.js"; @@ -1048,3 +1049,41 @@ describe("buildPreLoopBlockReason: SCOPE_COMPLETE message", () => { expect(msg).not.toContain("already implemented"); }); }); + + +// --- loopActivePhaseToRole (kill-on-stop feature) --- +// +// Maps the loop's `state.phase` to the AgentRole currently being +// awaited, so stopLoop / `cfcf agents reap` know which subprocess +// to signal. PA / HA are deliberately omitted — they run outside +// the cfcf server (`stdio: "inherit"`) and are never in the +// active-processes registry. + +describe("loopActivePhaseToRole", () => { + test("pre_loop_reviewing → architect", () => { + expect(loopActivePhaseToRole("pre_loop_reviewing")).toBe("architect"); + }); + test("dev_executing → dev", () => { + expect(loopActivePhaseToRole("dev_executing")).toBe("dev"); + }); + test("judging → judge", () => { + expect(loopActivePhaseToRole("judging")).toBe("judge"); + }); + test("reflecting → reflection (the iter-19 gmbot case)", () => { + expect(loopActivePhaseToRole("reflecting")).toBe("reflection"); + }); + test("documenting → documenter", () => { + expect(loopActivePhaseToRole("documenting")).toBe("documenter"); + }); + test("non-active phases return null (nothing to kill)", () => { + // preparing / deciding / paused / completed / failed / stopped: + // no subprocess is in flight, so stopLoop's kill loop should + // skip cleanly. + expect(loopActivePhaseToRole("preparing")).toBeNull(); + expect(loopActivePhaseToRole("deciding")).toBeNull(); + expect(loopActivePhaseToRole("paused")).toBeNull(); + expect(loopActivePhaseToRole("completed")).toBeNull(); + expect(loopActivePhaseToRole("failed")).toBeNull(); + expect(loopActivePhaseToRole("stopped")).toBeNull(); + }); +}); diff --git a/packages/core/src/iteration-loop.ts b/packages/core/src/iteration-loop.ts index 99f4a62..8958e63 100644 --- a/packages/core/src/iteration-loop.ts +++ b/packages/core/src/iteration-loop.ts @@ -849,6 +849,39 @@ export async function resumeLoop( /** * Stop a running or paused loop. */ +/** + * Map the loop's current phase to the AgentRole of the subprocess + * the harness is awaiting RIGHT NOW. Returns `null` when no + * subprocess is in flight (preparing / deciding / paused / etc.). + * + * Used by `stopLoop()` (and any future "kill the currently-running + * thing" callers) to decide what to signal. The loop awaits exactly + * one subprocess at a time, so a single role is unambiguous. + * + * Deliberately omits PA/HA — those run outside the cfcf server + * entirely (interactive `cfcf spec` / `cfcf help assistant` with + * `stdio: "inherit"`), are NOT in the `active-processes` registry, + * and must NOT be touched by a loop-level stop. + */ +export function loopActivePhaseToRole( + phase: LoopState["phase"], +): import("./log-storage.js").AgentRole | null { + switch (phase) { + case "pre_loop_reviewing": + return "architect"; + case "dev_executing": + return "dev"; + case "judging": + return "judge"; + case "reflecting": + return "reflection"; + case "documenting": + return "documenter"; + default: + return null; + } +} + export async function stopLoop(workspaceId: string): Promise { const state = await getLoopState(workspaceId); if (!state) { @@ -858,12 +891,69 @@ export async function stopLoop(workspaceId: string): Promise { throw new Error(`Loop already ended (phase: ${state.phase})`); } + // Capture the active role BEFORE flipping phase to "stopped" — + // the flip would otherwise hide which subprocess we should signal. + const activeRole = loopActivePhaseToRole(state.phase); + state.phase = "stopped"; state.outcome = "stopped"; state.completedAt = new Date().toISOString(); await saveLoopState(state); await updateWorkspace(workspaceId, { status: "stopped" }); + // Kill the subprocess the loop was awaiting + flip its per-role + // state-store row so the dashboard's "X running" indicator clears + // immediately rather than staying stuck until next server boot. + // + // Previously `stopLoop()` only flipped the state flag; mid-flight + // subprocesses (notably reflection codex on a multi-hour stall) + // kept running indefinitely because nothing signaled them. The + // loop's `isStopped(state)` check exits the while-loop at the next + // iteration boundary, but a subprocess that started before stop + // is still being awaited at that point — the await promise gets + // stranded and the subprocess outlives the server's interest in + // it. Real dogfood: gmbot iter 19 reflection ran ~5 hours past + // a `cfcf stop` because of this. + if (activeRole) { + const { getActiveProcess } = await import("./active-processes.js"); + const entry = getActiveProcess(workspaceId, activeRole); + if (entry) { + try { + const { killProcessTree } = await import("./process-manager.js"); + killProcessTree(entry.process.proc.pid); + } catch (err) { + console.warn( + `[stopLoop] killProcessTree failed for ${activeRole} (pid ${entry.process.proc.pid}): ${err instanceof Error ? err.message : String(err)}`, + ); + } + } + // Flip the per-role state-store. Each runner exposes a + // `markStateFailed` helper that's idempotent — no-op when + // there's no state (dev/judge don't have one) or when the + // state is already terminal (e.g., the subprocess died on its + // own between stop being called and the kill landing). + const reason = "Loop stopped by user via cfcf stop"; + try { + if (activeRole === "reflection") { + const { markReflectStateFailed } = await import("./reflection-runner.js"); + await markReflectStateFailed(workspaceId, reason); + } else if (activeRole === "documenter") { + const { markDocumentStateFailed } = await import("./documenter-runner.js"); + await markDocumentStateFailed(workspaceId, reason); + } else if (activeRole === "architect") { + const { markReviewStateFailed } = await import("./architect-runner.js"); + await markReviewStateFailed(workspaceId, reason); + } + // dev / judge: no separate state-store. The loop-state itself + // (now "stopped") is the audit trail; their iterRecord stays + // in state.iterations as a partial entry. + } catch (err) { + console.warn( + `[stopLoop] state-store flip failed for ${activeRole}: ${err instanceof Error ? err.message : String(err)}`, + ); + } + } + return state; } diff --git a/packages/core/src/process-manager.ts b/packages/core/src/process-manager.ts index c95c65e..705a779 100644 --- a/packages/core/src/process-manager.ts +++ b/packages/core/src/process-manager.ts @@ -159,7 +159,7 @@ export async function spawnProcess(opts: ProcessOptions): Promise { expect(validatePlanRewrite(basePlan, newPlan)).toEqual({ valid: true }); }); }); + + // markReflectStateFailed: used by stopLoop + cfcf agents reap to + // flip the dashboard's "reflection running" indicator off after + // killing the subprocess. The "with-state" branch is exercised + // end-to-end via stopLoop integration tests; here we lock in the + // idempotent / safe no-state path so callers can call it + // unconditionally without worrying about whether state exists. + describe("markReflectStateFailed (no-state branch — idempotency)", () => { + it("returns false when the workspace has no reflect state (no-op, safe to call unconditionally)", async () => { + const result = await markReflectStateFailed("workspace-id-with-no-state-xyz", "test reason"); + expect(result).toBe(false); + }); + + it("does not throw for unknown workspace ids", async () => { + // Callers in stopLoop wrap in try/catch but rely on this not + // throwing for the common case (no reflection ever spawned). + await expect( + markReflectStateFailed("totally-fake-id", "test reason"), + ).resolves.toBe(false); + }); + }); }); diff --git a/packages/core/src/reflection-runner.ts b/packages/core/src/reflection-runner.ts index bee403b..9052963 100644 --- a/packages/core/src/reflection-runner.ts +++ b/packages/core/src/reflection-runner.ts @@ -104,6 +104,34 @@ export function getReflectState(workspaceId: string): ReflectState | undefined { return reflectStore.get(workspaceId); } +/** + * Force-mark this workspace's reflection state as `failed` with the + * given reason. Used by `stopLoop()` (and the manual + * `cfcf agents reap` command) to flip the dashboard's "reflection + * running" indicator off immediately after killing the subprocess — + * without this, the in-memory state would stay "executing" and the + * `/api/activity` chip would keep claiming reflection is alive. + * + * Idempotent: no-op when the workspace has no reflect state, or + * when the existing state is already a terminal status (completed / + * failed) — won't clobber a real completion. + * + * Returns true if a flip happened, false otherwise. + */ +export async function markReflectStateFailed( + workspaceId: string, + reason: string, +): Promise { + const current = reflectStore.get(workspaceId); + if (!current) return false; + if (!REFLECT_ACTIVE_STATUSES.has(current.status)) return false; + current.status = "failed"; + current.error = reason; + current.completedAt = new Date().toISOString(); + await setReflectState(current); + return true; +} + /** * Server-boot hook (item F.23): hydrate the in-memory cache from disk * and clean up any state still claiming to be active. diff --git a/packages/server/src/app.ts b/packages/server/src/app.ts index 0518a62..40374b9 100644 --- a/packages/server/src/app.ts +++ b/packages/server/src/app.ts @@ -49,6 +49,13 @@ import { startReflection, getReflectState, stopReflection, + getActiveProcessesForWorkspace, + getAllActiveProcesses, + getActiveProcess, + killProcessTree, + markReflectStateFailed, + markDocumentStateFailed, + markReviewStateFailed, } from "@cfcf/core"; const startedAt = Date.now(); @@ -135,6 +142,82 @@ export function createApp() { return c.json({ active: items }); }); + // --- Active loop-spawned agent processes (for `cfcf agents reap`) --- + // + // List + kill subprocess registry. Distinct from `/api/activity`: + // /activity reads state files (loop-state, reflect-state, …) and + // is the dashboard's source of truth for "what's running"; this + // endpoint exposes the in-memory PID registry (`active-processes.ts`) + // so the user can identify and kill a stranded subprocess when + // the higher-level stop didn't (or to clean up after a kill that + // didn't reach the state-store flip path). + // + // PA / HA are intentionally absent — they run as `cfcf spec` / + // `cfcf help assistant` with `stdio: "inherit"` outside the cfcf + // server, are not in the registry, and cannot be killed via these + // endpoints. Safe by construction. + app.get("/api/active-processes", async (c) => { + const workspaceFilter = c.req.query("workspace"); + const all = workspaceFilter + ? getActiveProcessesForWorkspace(workspaceFilter) + : getAllActiveProcesses(); + const workspaces = await listWorkspaces(); + const wsById = new Map(workspaces.map((w) => [w.id, w])); + const now = Date.now(); + const summary = all.map((e) => ({ + workspaceId: e.workspaceId, + workspaceName: wsById.get(e.workspaceId)?.name ?? e.workspaceId, + role: e.role, + pid: e.process.proc.pid, + startedAt: e.startedAt, + runtimeMs: Math.max(0, now - new Date(e.startedAt).getTime()), + logFileName: e.logFileName ?? null, + })); + return c.json({ active: summary }); + }); + + app.post("/api/active-processes/:workspaceId/:role/kill", async (c) => { + const workspaceId = c.req.param("workspaceId"); + const role = c.req.param("role") as + | "dev" | "judge" | "architect" | "documenter" | "reflection"; + const validRoles = new Set([ + "dev", "judge", "architect", "documenter", "reflection", + ]); + if (!validRoles.has(role)) { + return c.json({ ok: false, error: `Unknown role: ${role}` }, 400); + } + const entry = getActiveProcess(workspaceId, role); + if (!entry) { + return c.json( + { ok: false, error: `No active ${role} process for workspace ${workspaceId}` }, + 404, + ); + } + const pid = entry.process.proc.pid; + try { + killProcessTree(pid); + } catch (err) { + return c.json( + { ok: false, error: `killProcessTree failed: ${err instanceof Error ? err.message : String(err)}` }, + 500, + ); + } + // Flip per-role state-store row so the dashboard clears + // immediately. Idempotent; dev/judge have no state-store so + // skip them. + const reason = "Killed by user via cfcf agents reap"; + try { + if (role === "reflection") await markReflectStateFailed(workspaceId, reason); + else if (role === "documenter") await markDocumentStateFailed(workspaceId, reason); + else if (role === "architect") await markReviewStateFailed(workspaceId, reason); + } catch (err) { + console.warn( + `[reap] state-store flip failed for ${role}: ${err instanceof Error ? err.message : String(err)}`, + ); + } + return c.json({ ok: true, killed: true, pid, role, workspaceId }); + }); + app.get("/api/status", async (c) => { const hasConfig = await configExists(); const config = hasConfig ? await readConfig() : null; From 6f4664bed8d598eb4875cfadb4b395e7e6b0e2e0 Mon Sep 17 00:00:00 2001 From: Fotis Stamatelopoulos Date: Wed, 13 May 2026 21:34:37 -0700 Subject: [PATCH 3/3] =?UTF-8?q?fix(loop):=20isStopped=20guard=20before=20D?= =?UTF-8?q?ECIDE=20=E2=80=94=20survives=20post-reflection=20cascade?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit User reported: after `cfcf stop` + `kill -9` on a stranded reflection codex, the loop CONTINUED to the next iteration. They flagged this as "probably a consequence of bug A" — correct diagnosis. The chain (Bug D, downstream of Bug A): 1. cfcf stop sets state.phase = "stopped" 2. Reflection await is mid-flight; codex alive 3. User kill -9's codex 4. Await unblocks → post-reflection processing runs (archive, commit, Clio ingest) 5. Line 2283: state.phase = "deciding" — CLOBBERS "stopped" 6. makeDecision returns "continue" → case "continue" returns 7. Outer loop's isLoopDone(state) sees "deciding" → false 8. Next iteration spawns The isStopped checks exist after dev's await (line 1852) and after judge's await (line 2011), but were missing between the reflection block and the DECIDE block. Any state.phase mutation in that window overwrites an external "stopped" flag. Fix: one isStopped guard right before "state.phase = deciding". Matches the pattern of the dev / judge guards. Closes the cascade for users on Fix 1 (kill-on-stop) too — Fix 1 makes the await unblock faster, but the same overwrite would have run without the guard. No new tests — the guard's logic is `isStopped(state)`, which is already covered exhaustively. The comment explains the rationale + the gmbot iter-19 dogfood case so future edits don't drop it. Typecheck clean. Full suite (1038 tests) passes. Co-Authored-By: Claude Opus 4.7 (1M context) --- CHANGELOG.md | 69 ++++++++++++++++++++++++++--- packages/core/src/iteration-loop.ts | 16 +++++++ 2 files changed, 80 insertions(+), 5 deletions(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index 29a592d..fd528fc 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -18,7 +18,9 @@ found that the loop state was `phase: stopped, outcome: stopped`, but the reflection's codex subprocess (PID 99729) was still alive with the cfcf server as its parent. -**Root cause** (three layered bugs): +**Root cause** (four layered bugs — the fourth surfaced after the +others were fixed and the user reported the loop continuing +post-kill-9): 1. **`stopLoop()` only flips the state flag.** It set `state.phase = "stopped"` and saved — but never signaled any subprocess. @@ -36,8 +38,17 @@ with the cfcf server as its parent. dead server). PID 99729 was still parented by the live cfcf server, so it didn't qualify. User had to find the PID manually and `kill -9` it. - -The fix lands all three: +4. **External `cfcf stop` flag was overwritten by post-reflection + phase transitions.** The reflection block had no `isStopped` + guard between its `await` and the next `state.phase = "deciding"` + mutation. When the user `kill -9`'d the stranded subprocess, + the await unblocked → post-processing ran → the DECIDE block's + phase assignment clobbered `"stopped"` → loop continued. User + reported this as "the loop continued after I killed the + process" — confirming PA's hypothesis that it was a downstream + consequence of Bug A. + +The fix lands all four: **Fix 1 — `stopLoop()` kills the active subprocess.** @@ -89,7 +100,52 @@ inside the iteration's record in `loop-state.json` itself, which the "stopped" flip already handles. So those two roles skip the state-flip step. -**Fix 3 — New `cfcf agents reap` command (manual escape hatch).** +**Fix 3 — Stop signal survives the reflection post-processing cascade.** + +A subtler second-order bug surfaced in the same gmbot iter-19 trace. +After PA helped the user discover the stranded codex, they ran +`kill -9 99729` — and observed the loop *continuing* afterwards. +PA's hypothesis was right: it was Bug A's downstream consequence, +but the chain is specifically: + +1. `cfcf stop` set `state.phase = "stopped"` (correct — Fix 1 + would have killed the subprocess too, but the user was on a + pre-fix build). +2. Reflection's `await runReflectionSync(...)` was still + in flight; the codex subprocess was alive. +3. The user's external `kill -9` killed codex. +4. The await unblocked. Post-reflection processing ran (archive + the analysis, commit, ingest into Clio). +5. **`state.phase = "deciding"` (line 2283) overwrote the + externally-set `"stopped"` flag.** +6. `makeDecision()` returned its normal verdict (likely + "continue"). +7. `case "continue":` returned. +8. Back in the outer loop, `isLoopDone(state)` checked + `state.phase` — now `"deciding"`, not `"stopped"` — and + returned false. +9. **The next iteration of the while-loop started.** + +This was a genuine missing guard. The `isStopped(state)` check +exists after dev's await (iteration-loop.ts:1852) and after judge's +await (line 2011), but was missing between the reflection block +and the DECIDE block. Any phase mutation in that window clobbers +an external stop. + +**Fix**: one `if (isStopped(state)) return;` guard right before the +DECIDE block. The cascade exits cleanly back to the outer loop's +`isLoopDone(state)` check, which sees the unmodified `"stopped"` +and returns. Matches the pattern of the dev / judge guards. + +This guard also closes the cascade for users with Fix 1 enabled. +Fix 1 makes `cfcf stop` actively kill the subprocess (which then +unblocks the await), but the same overwrite-on-post-processing +cascade still ran without the guard. With Fix 3, the user's +`cfcf stop` is honoured end-to-end whether the subprocess was +killed by our `killProcessTree()` (Fix 1) or by an external +`kill -9` (user escape hatch). + +**Fix 4 — New `cfcf agents reap` command (manual escape hatch).** For the rare cases where the auto-kill in `stopLoop()` doesn't land (subprocess in uninterruptible sleep, user observes a stuck @@ -129,7 +185,7 @@ reflection if it runs > 4 hours). Declined deliberately: new `cfcf agents reap` is the user-driven version of the same intent. -**Implementation** (~340 LoC): +**Implementation** (~360 LoC): - `packages/core/src/process-manager.ts` — export `killProcessTree` (was private). @@ -144,6 +200,9 @@ reflection if it runs > 4 hours). Declined deliberately: pure, testable). - `stopLoop()` extended: capture active role before phase flip, kill subprocess, flip per-role state store. + - **New `isStopped(state)` guard** before the DECIDE block + (line ~2283) — closes the Bug-D cascade that otherwise + clobbered the stop flag on post-reflection phase transitions. - `packages/server/src/app.ts` — two new endpoints (`/api/active-processes` GET + POST kill). - `packages/cli/src/commands/agents.ts` — new file, `cfcf agents diff --git a/packages/core/src/iteration-loop.ts b/packages/core/src/iteration-loop.ts index 8958e63..4944ed2 100644 --- a/packages/core/src/iteration-loop.ts +++ b/packages/core/src/iteration-loop.ts @@ -2279,6 +2279,22 @@ async function runJudgeAndDecide( ); } + // External-stop guard. Mirrors the `isStopped(state)` checks after + // dev's await (line ~1852) and judge's await (line ~2011) — the + // reflection path was missing it, and any phase mutation here + // would clobber a `state.phase = "stopped"` that `stopLoop()` set + // while we were mid-reflection. Real dogfood: gmbot iter 19 — user + // called `cfcf stop`, then `kill -9`'d the stranded reflection + // codex; the await unblocked, post-reflection ingest ran, line + // 2283 below overwrote `"stopped"` with `"deciding"`, makeDecision + // returned `continue`, and the outer loop's `isLoopDone(state)` + // saw `"deciding"` → false → spawned iter N+1. With this guard, + // the cascade exits cleanly back to runLoop's `isLoopDone(state)` + // check, which sees the unmodified `"stopped"` and returns. + if (isStopped(state)) { + return; + } + // --- DECIDE --- state.phase = "deciding"; await saveLoopState(state);