diff --git a/CHANGELOG.md b/CHANGELOG.md index 61a4348..fd528fc 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -9,6 +9,339 @@ Changes are tracked via git tags. Each release tag corresponds to an entry here. ## [Unreleased] +### Fixed — `cfcf stop` now kills the active subprocess + clears the dashboard indicator + new `cfcf agents reap` escape hatch + +Real gmbot dogfood: after iter 19 completed and reflection started +at 3:41 pm, the user observed the dashboard still showing +"reflection running" at 8 pm — 4.5 hours later. PA's investigation +found that the loop state was `phase: stopped, outcome: stopped`, +but the reflection's codex subprocess (PID 99729) was still alive +with the cfcf server as its parent. + +**Root cause** (four layered bugs — the fourth surfaced after the +others were fixed and the user reported the loop continuing +post-kill-9): + +1. **`stopLoop()` only flips the state flag.** It set `state.phase + = "stopped"` and saved — but never signaled any subprocess. + The main while-loop's `isStopped(state)` check would exit + cleanly at the next iteration boundary; a mid-flight subprocess + (the reflection codex) kept running because nothing told it to + stop. The runner's `await managed.result` promise got stranded. +2. **The per-role state-store row stayed `executing`.** When the + subprocess was orphaned, the reflection-runner's normal + completion path didn't fire, so `reflect-state.json`'s status + was never flipped. The dashboard's `getActiveAgent()` reads + that status; it kept claiming reflection was alive. +3. **No way to kill the stranded subprocess from cfcf.** The + existing `cfcf server reap` filters on `PPID==1` (orphans of a + dead server). PID 99729 was still parented by the live cfcf + server, so it didn't qualify. User had to find the PID + manually and `kill -9` it. +4. **External `cfcf stop` flag was overwritten by post-reflection + phase transitions.** The reflection block had no `isStopped` + guard between its `await` and the next `state.phase = "deciding"` + mutation. When the user `kill -9`'d the stranded subprocess, + the await unblocked → post-processing ran → the DECIDE block's + phase assignment clobbered `"stopped"` → loop continued. User + reported this as "the loop continued after I killed the + process" — confirming PA's hypothesis that it was a downstream + consequence of Bug A. + +The fix lands all four: + +**Fix 1 — `stopLoop()` kills the active subprocess.** + +`stopLoop()` now maps the loop's current phase to its in-flight +`AgentRole` via a new pure helper `loopActivePhaseToRole(phase): +AgentRole | null`: + + - `pre_loop_reviewing` → `architect` + - `dev_executing` → `dev` + - `judging` → `judge` + - `reflecting` → `reflection` + - `documenting` → `documenter` + - else → `null` (no subprocess in flight; nothing to kill) + +For non-null roles, looks up the registered process in +`active-processes.ts` and calls the existing `killProcessTree()` +(SIGTERM → 1.5s grace → SIGKILL on the process group). + +**Hard guarantee — PA and HA are untouched.** They run interactively +as `cfcf spec` / `cfcf help assistant` with `stdio: "inherit"`, +outside the cfcf server entirely. They are NOT in the +`active-processes` registry, and the registry's `AgentRole` type +(`packages/core/src/log-storage.ts:20`) doesn't include them. +Both layers of defence: type-level (can't be in the registry) + +phase-mapping (the loop never enters a phase that resolves to +"pa" or "ha"). + +**Fix 2 — Per-role state-store flip on kill.** + +After killing, `stopLoop()` calls the role-appropriate +`markStateFailed(workspaceId, reason)` helper: + + - `markReflectStateFailed` (new, in `reflection-runner.ts`) + - `markDocumentStateFailed` (new, in `documenter-runner.ts`) + - `markReviewStateFailed` (new, in `architect-runner.ts`) + +All three are idempotent: no-op when no state exists, no-op when +state is already terminal (so they're safe to call +unconditionally, and won't clobber a real completion if the +subprocess happened to finish between `stopLoop()` being called +and the kill landing). On a flip, status → `failed`, error → +`"Loop stopped by user via cfcf stop"`, `completedAt` is set. + +The dashboard's `getActiveAgent()` re-reads these state stores; the +"X running" chip clears on the next poll (~5s). + +Dev / judge don't have separate state stores — they're tracked +inside the iteration's record in `loop-state.json` itself, which +the "stopped" flip already handles. So those two roles skip the +state-flip step. + +**Fix 3 — Stop signal survives the reflection post-processing cascade.** + +A subtler second-order bug surfaced in the same gmbot iter-19 trace. +After PA helped the user discover the stranded codex, they ran +`kill -9 99729` — and observed the loop *continuing* afterwards. +PA's hypothesis was right: it was Bug A's downstream consequence, +but the chain is specifically: + +1. `cfcf stop` set `state.phase = "stopped"` (correct — Fix 1 + would have killed the subprocess too, but the user was on a + pre-fix build). +2. Reflection's `await runReflectionSync(...)` was still + in flight; the codex subprocess was alive. +3. The user's external `kill -9` killed codex. +4. The await unblocked. Post-reflection processing ran (archive + the analysis, commit, ingest into Clio). +5. **`state.phase = "deciding"` (line 2283) overwrote the + externally-set `"stopped"` flag.** +6. `makeDecision()` returned its normal verdict (likely + "continue"). +7. `case "continue":` returned. +8. Back in the outer loop, `isLoopDone(state)` checked + `state.phase` — now `"deciding"`, not `"stopped"` — and + returned false. +9. **The next iteration of the while-loop started.** + +This was a genuine missing guard. The `isStopped(state)` check +exists after dev's await (iteration-loop.ts:1852) and after judge's +await (line 2011), but was missing between the reflection block +and the DECIDE block. Any phase mutation in that window clobbers +an external stop. + +**Fix**: one `if (isStopped(state)) return;` guard right before the +DECIDE block. The cascade exits cleanly back to the outer loop's +`isLoopDone(state)` check, which sees the unmodified `"stopped"` +and returns. Matches the pattern of the dev / judge guards. + +This guard also closes the cascade for users with Fix 1 enabled. +Fix 1 makes `cfcf stop` actively kill the subprocess (which then +unblocks the await), but the same overwrite-on-post-processing +cascade still ran without the guard. With Fix 3, the user's +`cfcf stop` is honoured end-to-end whether the subprocess was +killed by our `killProcessTree()` (Fix 1) or by an external +`kill -9` (user escape hatch). + +**Fix 4 — New `cfcf agents reap` command (manual escape hatch).** + +For the rare cases where the auto-kill in `stopLoop()` doesn't +land (subprocess in uninterruptible sleep, user observes a stuck +process from a pre-fix build, etc.), or for proactive inspection: + +``` +$ cfcf agents reap # all workspaces +$ cfcf agents reap --workspace gmbot # one workspace +$ cfcf agents reap --yes # non-interactive +``` + +Mirrors `cfcf server reap`'s UX (item 6.31) but scoped to +LIVE-SERVER children rather than `PPID==1` orphans: + + - Requires the cfcf server to be running (queries the in-memory + registry). + - Lists each process: workspace, role, PID, runtime, log file. + - Per-row y/N confirmation by default; `--yes` for scripted + use. + - Per-row kill via new HTTP routes: + - `GET /api/active-processes?workspace=` (list) + - `POST /api/active-processes/:workspaceId/:role/kill` (kill + + state-flip) + - **Cannot kill PA/HA** — they're not in the registry, so they + don't show up in the list. Same hard guarantee as `stopLoop()`. + +**Why no wall-clock timeout (rejected design).** + +PA's initial proposal included a per-role timeout (e.g., kill +reflection if it runs > 4 hours). Declined deliberately: + + - The existing codex rolling-window quota recovery can take ~1 + hour. Any wall-clock heuristic risks killing legitimate recovery. + - Timeouts add config knobs that are hard to tune correctly. + - Matches cf²'s "human on the loop, not in it" principle — + the user is the one who decides when something is stuck. The + new `cfcf agents reap` is the user-driven version of the + same intent. + +**Implementation** (~360 LoC): + +- `packages/core/src/process-manager.ts` — export + `killProcessTree` (was private). +- `packages/core/src/reflection-runner.ts` — + `markReflectStateFailed(workspaceId, reason)`. +- `packages/core/src/documenter-runner.ts` — + `markDocumentStateFailed(workspaceId, reason)`. +- `packages/core/src/architect-runner.ts` — + `markReviewStateFailed(workspaceId, reason)`. +- `packages/core/src/iteration-loop.ts`: + - `loopActivePhaseToRole(phase): AgentRole | null` (exported, + pure, testable). + - `stopLoop()` extended: capture active role before phase flip, + kill subprocess, flip per-role state store. + - **New `isStopped(state)` guard** before the DECIDE block + (line ~2283) — closes the Bug-D cascade that otherwise + clobbered the stop flag on post-reflection phase transitions. +- `packages/server/src/app.ts` — two new endpoints + (`/api/active-processes` GET + POST kill). +- `packages/cli/src/commands/agents.ts` — new file, `cfcf agents + reap` command. +- `packages/cli/src/index.ts` — register the new command. + +**Test coverage** (8 new tests, all 1038 tests pass): + +- 6 new tests in `iteration-loop.test.ts` covering + `loopActivePhaseToRole`: each of the five active phases maps to + the correct role; non-active phases (preparing / deciding / + paused / completed / failed / stopped) return null. +- 2 new tests in `reflection-runner.test.ts` covering + `markReflectStateFailed`'s idempotent no-state branch (safe to + call unconditionally, doesn't throw for unknown workspace IDs). + Document + architect runners share the same code shape; the + reflection test locks in the pattern. + +### Added — Harness-level missing-signals pause + `retry_iteration` resume action + +cfcf has always implicitly assumed that the dev and judge agents +will run far enough to write their signals files +(`cfcf-iteration-signals.json` / `cfcf-judge-signals.json`). +Everything downstream — determination parsing, archive logic, +decision engine — branches off those files. When an agent exits +before writing them (hard quota cap, crash, OOM kill, killed +process, …), cfcf silently marked the iteration "failed" and +continued to the next one. That cost a real gmbot dogfood run: iter +10 hit a codex usage limit, dev + judge both exited without signals, +cfcf moved on to iter 11 which then had to re-discover the missing +work from scratch. + +**Architectural framing**: this is a harness-contract violation, not +an agent-reasoning failure. The agent never ran far enough to +classify anything. So the fix lives at the harness layer, not in +the agent prompts — and deliberately doesn't try to identify the +root cause. Quota / crash / OOM / killed all collapse to "agent +exited without signals → can't safely continue → pause." The user +reads the log file to identify the cause and chooses how to +recover. + +**Behaviour**: + +- After `parseSignalFile` / `parseJudgeSignals` returns `null` + (missing file, JSON parse failure, schema validation failure — + all collapsed to one signal), the loop pauses with + `pauseReason: "missing_signals"` rather than continuing silently. +- The pause sets `pendingQuestions[0]` to a generic anomaly message + pointing at the log file: *"Anomaly detected: dev agent for + iteration 10 exited (exit code 1) without writing its signals + file (agent crashed or hit a usage limit). Check the log: … + Resume with `retry_iteration` to redo this iteration, `continue` + to skip to iteration 11, or `stop_loop_now` to abandon."* +- Notification fires (`type: "loop.paused"`) so any notification + channel sees the pause immediately, not just the dashboard. +- Working tree is left as-is — no `git reset`, no commit attempt. + Any dirty changes are preserved for the user to inspect. + +**New resume action: `retry_iteration`** (extends the structured +pause-actions vocabulary from item 6.25): + +- Rolls the workspace's iteration counter back by one (new + `decrementIteration(workspaceId)` helper in `workspaces.ts`). +- Pops the failed iteration's record from `state.iterations` so + it isn't double-counted. +- Falls through to the regular iteration body, which calls + `nextIteration()` (now returns the same number that failed) and + proceeds to delete + re-create the existing branch off HEAD + (logic the loop already had for retry scenarios). Dev respawns + on a clean branch under the same iteration number. +- Applicable ONLY to `missing_signals` pauses — surfaced in the + web UI's pause-action button row + accepted on + `cfcf resume --action retry_iteration`. Rejected for other + pause classes via the existing `pauseReasonAllowedActions` + validation. +- `missing_signals` pause-action allowlist: `["retry_iteration", + "continue", "stop_loop_now"]`. No finish/refine/consult — those + all assume a meaningful iteration result to act on. + +**What this doesn't do** (intentional non-scope): + +- No root-cause classification (quota vs. crash vs. OOM). The + harness doesn't try — that requires reading the log file, + which is the user's job. Future enhancement: per-adapter + pattern matching that adds a *label* to the pause without + changing behaviour. +- No automatic retry after a parsed reset time. v1 is pause + + user resumes when ready. Matches "human on the loop, not in + it." Revisit if dogfood demands it. +- No partial-work recovery beyond "leave the working tree alone." + If the agent made dirty changes before crashing, they survive in + the working tree; the user can `git status` / `git diff` / + `git stash` as needed. On `retry_iteration` the branch is + re-created off HEAD; any branch-only commits from the failed + attempt are lost. (Working-tree changes are NOT lost.) + +**Implementation** (~140 LoC): + +- `packages/core/src/types.ts` — extend `ResumeAction` with + `"retry_iteration"`. +- `packages/core/src/iteration-loop.ts`: + - Extend `LoopState.pauseReason` with `"missing_signals"`. + - New `pauseLoopOnMissingSignals(workspace, state, iter, phase, + exitCode, logFile)` helper. Sets phase + pauseReason + + pendingQuestions, saves state, flips workspace status, + dispatches a `loop.paused` notification. + - Two new call sites: after the dev signal-parse (post-exit, + pre-commit) and after the judge signal-parse (same shape). + Both return immediately — the outer loop's `isLoopDone(state)` + check exits the loop cleanly. + - New `retry_iteration` branch in the resume-action dispatch + chain — decrements the counter, pops the failed iter record, + falls through. + - Extend `pauseReasonAllowedActions` with the + `case "missing_signals"` branch. +- `packages/core/src/workspaces.ts` — new + `decrementIteration(workspaceId)` helper (idempotent at floor 0). +- `packages/cli/src/commands/resume.ts` — add `retry_iteration` + to `RESUME_ACTIONS` + the `--action` help text. +- `packages/web/src/types.ts` + `packages/web/src/api.ts` — mirror + the new pauseReason + ResumeAction variants. +- `packages/web/src/components/FeedbackForm.tsx` — add the + `missing_signals` case to the per-pauseReason action matrix, + add `retry_iteration` to `ACTION_LABEL` + `ACTION_HELP`. The + existing `pendingQuestions[0]` rendering surfaces the + generic-anomaly message we set in the pause helper, so no + additional copy was needed. + +**Test coverage** (8 new tests, all 1030 tests pass): + +- 4 new tests in `iteration-loop.test.ts` covering the + `missing_signals` action matrix (retry/continue/stop allowed; + finish/refine/consult excluded) and the inverse — `retry_iteration` + is rejected for every other pauseReason. +- 4 new tests in `workspaces.test.ts` for `decrementIteration`: + rolls back by 1, decrement+nextIteration round-trips to the + failed number (load-bearing invariant for retry), floors at 0, + null for non-existent workspace. + ### Fixed — History tab: judge row time cell while judge is pending F.21 follow-up surfaced during v0.24.2 dogfood. The judge row's diff --git a/packages/cli/src/commands/agents.ts b/packages/cli/src/commands/agents.ts new file mode 100644 index 0000000..b934319 --- /dev/null +++ b/packages/cli/src/commands/agents.ts @@ -0,0 +1,152 @@ +/** + * `cfcf agents` — manage loop-spawned agent processes. + * + * Today only `reap` is implemented: list + kill subprocesses + * registered with the cfcf server. Mirrors the UX of `cfcf server + * reap` (item 6.31) but scoped differently: + * + * - `cfcf server reap` finds ORPHANS (PPID==1) — children of a + * prior cfcf server that hard-crashed and got reparented to + * init. Works without a running cfcf server. + * + * - `cfcf agents reap` lists LIVE-SERVER CHILDREN — what the + * current cfcf server is still tracking. Requires the server + * to be running. Use case: a subprocess survived a `cfcf stop` + * (rare, but the missing-signals follow-up + this command + * close that gap), or the user wants to inspect what's still + * attached to the server. + * + * **Safety**: this only ever lists / kills processes in the + * `active-processes` registry, which is scoped to loop-spawned + * agent roles (dev / judge / architect / documenter / reflection). + * PA (`cfcf spec`) and HA (`cfcf help assistant`) run interactively + * outside the cfcf server (`stdio: "inherit"`), are NOT in the + * registry, and CANNOT be killed via this command. + */ + +import type { Command } from "commander"; +import { createInterface } from "node:readline"; +import { isServerReachable, get, post } from "../client.js"; + +interface ActiveProcessSummary { + workspaceId: string; + workspaceName: string; + role: "dev" | "judge" | "architect" | "documenter" | "reflection"; + pid: number | undefined; + startedAt: string; + runtimeMs: number; + logFileName: string | null; +} + +function readLine(question: string): Promise { + const rl = createInterface({ input: process.stdin, output: process.stdout }); + return new Promise((resolve) => { + rl.question(question, (answer) => { + rl.close(); + resolve(answer.trim()); + }); + }); +} + +function formatRuntime(ms: number): string { + const seconds = Math.floor(ms / 1000); + const minutes = Math.floor(seconds / 60); + const hours = Math.floor(minutes / 60); + if (hours > 0) return `${hours}h${minutes % 60}m`; + if (minutes > 0) return `${minutes}m${seconds % 60}s`; + return `${seconds}s`; +} + +function formatRow(p: ActiveProcessSummary): string { + const pidStr = p.pid !== undefined ? `pid ${p.pid}` : "pid ?"; + const runtime = formatRuntime(p.runtimeMs); + const log = p.logFileName ? ` log=${p.logFileName}` : ""; + return ` ${p.workspaceName} / ${p.role}: ${pidStr}, running for ${runtime}${log}`; +} + +export function registerAgentsCommands(program: Command): void { + const agents = program + .command("agents") + .description("Manage running agent subprocesses"); + + agents + .command("reap") + .description("List + interactively kill active agent processes (loop-spawned only — PA/HA are untouched)") + .option("--workspace ", "Limit to a single workspace (by name or ID)") + .option("-y, --yes", "Kill without prompting (non-interactive use)") + .action(async (opts: { workspace?: string; yes?: boolean }) => { + if (!(await isServerReachable())) { + console.error( + "cfcf server is not running. Start it with: cfcf server start", + ); + console.error( + "(For orphans from a previously-crashed server, use `cfcf server reap` instead.)", + ); + process.exit(1); + } + + const query = opts.workspace + ? `?workspace=${encodeURIComponent(opts.workspace)}` + : ""; + const res = await get<{ active: ActiveProcessSummary[] }>( + `/api/active-processes${query}`, + ); + if (!res.ok) { + console.error(`Failed to list active processes: ${res.error}`); + process.exit(1); + } + + const procs = res.data!.active; + if (procs.length === 0) { + console.log( + opts.workspace + ? `No active agent processes for workspace "${opts.workspace}".` + : "No active agent processes.", + ); + return; + } + + console.log( + `Found ${procs.length} active agent process${procs.length === 1 ? "" : "es"}:`, + ); + for (const p of procs) { + console.log(formatRow(p)); + } + console.log(); + + let proceed = false; + if (opts.yes) { + proceed = true; + } else { + const answer = await readLine( + `Kill ${procs.length === 1 ? "this process" : `these ${procs.length} processes`}? [y/N]: `, + ); + proceed = /^y(es)?$/i.test(answer); + } + if (!proceed) { + console.log("Aborted. No processes killed."); + return; + } + + let killed = 0; + let failed = 0; + for (const p of procs) { + const r = await post<{ ok: boolean; error?: string }>( + `/api/active-processes/${encodeURIComponent(p.workspaceId)}/${p.role}/kill`, + {}, + ); + if (r.ok && r.data?.ok) { + console.log(` ✓ ${p.workspaceName} / ${p.role} (pid ${p.pid})`); + killed++; + } else { + console.log( + ` ✗ ${p.workspaceName} / ${p.role} (pid ${p.pid}): ${r.data?.error ?? r.error ?? "unknown error"}`, + ); + failed++; + } + } + console.log(); + console.log(`Reap complete: ${killed} signaled, ${failed} failed.`); + if (failed > 0) process.exit(1); + }); +} diff --git a/packages/cli/src/commands/resume.ts b/packages/cli/src/commands/resume.ts index dc10ad2..0f45f6d 100644 --- a/packages/cli/src/commands/resume.ts +++ b/packages/cli/src/commands/resume.ts @@ -26,6 +26,11 @@ const RESUME_ACTIONS = [ "stop_loop_now", "refine_plan", "consult_reflection", + // Re-spawn dev on the same iteration. Typical use: a + // `missing_signals` pause (quota cap, crash) is being unblocked + // after the underlying cause cleared. The harness rolls back the + // iteration counter and re-creates the branch. + "retry_iteration", ] as const; type ResumeAction = (typeof RESUME_ACTIONS)[number]; @@ -81,7 +86,7 @@ function buildActionOption(program: Command): Option { // eslint-disable-next-line @typescript-eslint/no-explicit-any const opt = (program as any).createOption( "--action ", - "Resume action: continue | finish_loop | stop_loop_now | refine_plan | consult_reflection", + "Resume action: continue | finish_loop | stop_loop_now | refine_plan | consult_reflection | retry_iteration", ); opt.choices(RESUME_ACTIONS as readonly string[]); opt.default("continue"); diff --git a/packages/cli/src/index.ts b/packages/cli/src/index.ts index 1491518..ef9a815 100644 --- a/packages/cli/src/index.ts +++ b/packages/cli/src/index.ts @@ -42,6 +42,7 @@ import { registerSelfUpdateCommand } from "./commands/self-update.js"; import { registerCompletionCommand } from "./commands/completion.js"; import { registerHelpCommand } from "./commands/help.js"; import { registerSpecCommand } from "./commands/spec.js"; +import { registerAgentsCommands } from "./commands/agents.js"; import { maybePrintUpdateBanner } from "./update-banner.js"; // --- Internal: run the server in-process --- @@ -114,6 +115,7 @@ registerRunCommand(program); registerResumeCommand(program); registerStopCommand(program); registerSpecCommand(program); +registerAgentsCommands(program); registerReviewCommand(program); registerDocumentCommand(program); registerReflectCommand(program); diff --git a/packages/core/src/architect-runner.ts b/packages/core/src/architect-runner.ts index 187d89f..0c64d9b 100644 --- a/packages/core/src/architect-runner.ts +++ b/packages/core/src/architect-runner.ts @@ -83,6 +83,31 @@ export function getReviewState(workspaceId: string): ReviewState | undefined { return reviewStore.get(workspaceId); } +/** + * Force-mark this workspace's architect/review state as `failed` + * with the given reason. Used by `stopLoop()` + `cfcf agents reap` + * to flip the dashboard's "review running" indicator off + * immediately after killing the subprocess. Mirrors + * `markReflectStateFailed` / `markDocumentStateFailed`. + * + * Idempotent: no-op when no state exists, or when the state is + * already terminal (completed / failed). Returns true if a flip + * happened. + */ +export async function markReviewStateFailed( + workspaceId: string, + reason: string, +): Promise { + const current = reviewStore.get(workspaceId); + if (!current) return false; + if (!REVIEW_ACTIVE_STATUSES.has(current.status)) return false; + current.status = "failed"; + current.error = reason; + current.completedAt = new Date().toISOString(); + await setReviewState(current); + return true; +} + /** * Server-boot hook (item F.23): load every workspace's persisted * review state into the in-memory cache. Any state still in an active diff --git a/packages/core/src/documenter-runner.ts b/packages/core/src/documenter-runner.ts index cf7ae55..d134896 100644 --- a/packages/core/src/documenter-runner.ts +++ b/packages/core/src/documenter-runner.ts @@ -82,6 +82,31 @@ export function getDocumentState(workspaceId: string): DocumentState | undefined return documentStore.get(workspaceId); } +/** + * Force-mark this workspace's documenter state as `failed` with the + * given reason. Used by `stopLoop()` + `cfcf agents reap` to flip + * the dashboard's "documenting" indicator off immediately after + * killing the subprocess. Mirrors `markReflectStateFailed` / + * `markReviewStateFailed`. + * + * Idempotent: no-op when no state exists, or when the state is + * already terminal (completed / failed). Returns true if a flip + * happened. + */ +export async function markDocumentStateFailed( + workspaceId: string, + reason: string, +): Promise { + const current = documentStore.get(workspaceId); + if (!current) return false; + if (!DOCUMENT_ACTIVE_STATUSES.has(current.status)) return false; + current.status = "failed"; + current.error = reason; + current.completedAt = new Date().toISOString(); + await setDocumentState(current); + return true; +} + /** * Server-boot hook (item F.23): hydrate the in-memory cache from disk * and clean up any state still claiming to be active. Returns the diff --git a/packages/core/src/iteration-loop.test.ts b/packages/core/src/iteration-loop.test.ts index 0651843..632b096 100644 --- a/packages/core/src/iteration-loop.test.ts +++ b/packages/core/src/iteration-loop.test.ts @@ -13,6 +13,7 @@ import { stopLoop, shouldRunReflection, resolveEffectiveDetermination, + loopActivePhaseToRole, type LoopState, } from "./iteration-loop.js"; import type { WorkspaceConfig, DevSignals, JudgeSignals, ReflectionSignals } from "./types.js"; @@ -947,6 +948,47 @@ describe("pauseReasonAllowedActions (item 6.25)", () => { test("scope_complete excludes consult_reflection (no iterations to reflect on)", () => { expect(pauseReasonAllowedActions("scope_complete")).not.toContain("consult_reflection"); }); + + // harness-missing-signals: when dev or judge exit without writing + // their signals file, the iteration is in an unknown state. The + // user's three meaningful options are: redo the iter (retry), + // skip it (continue), or abandon (stop). finish_loop / + // refine_plan / consult_reflection all assume we have a + // meaningful iter result to act on; we don't. + test("missing_signals: retry_iteration + continue + stop_loop_now", () => { + expectActions(pauseReasonAllowedActions("missing_signals"), [ + "retry_iteration", + "continue", + "stop_loop_now", + ]); + }); + + test("missing_signals excludes finish_loop (no iter result to finish on)", () => { + expect(pauseReasonAllowedActions("missing_signals")).not.toContain("finish_loop"); + }); + + test("missing_signals excludes refine_plan + consult_reflection (no iter data to refine/reflect on)", () => { + expect(pauseReasonAllowedActions("missing_signals")).not.toContain("refine_plan"); + expect(pauseReasonAllowedActions("missing_signals")).not.toContain("consult_reflection"); + }); + + test("retry_iteration is ONLY applicable to missing_signals (not offered for other pauseReasons)", () => { + // The retry action is specifically the "agent exited without + // signals, redo the iter" path. It doesn't make sense for the + // other pause classes (cadence, anomaly with signals, etc. + // — those have a known iter result and use other actions). + const otherReasons = [ + undefined, + "cadence", + "anomaly", + "user_input_needed", + "max_iterations", + "scope_complete", + ] as const; + for (const reason of otherReasons) { + expect(pauseReasonAllowedActions(reason)).not.toContain("retry_iteration"); + } + }); }); // 2026-05-02: Architect SCOPE_COMPLETE readiness verdict (item 6.25 follow-up). @@ -1007,3 +1049,41 @@ describe("buildPreLoopBlockReason: SCOPE_COMPLETE message", () => { expect(msg).not.toContain("already implemented"); }); }); + + +// --- loopActivePhaseToRole (kill-on-stop feature) --- +// +// Maps the loop's `state.phase` to the AgentRole currently being +// awaited, so stopLoop / `cfcf agents reap` know which subprocess +// to signal. PA / HA are deliberately omitted — they run outside +// the cfcf server (`stdio: "inherit"`) and are never in the +// active-processes registry. + +describe("loopActivePhaseToRole", () => { + test("pre_loop_reviewing → architect", () => { + expect(loopActivePhaseToRole("pre_loop_reviewing")).toBe("architect"); + }); + test("dev_executing → dev", () => { + expect(loopActivePhaseToRole("dev_executing")).toBe("dev"); + }); + test("judging → judge", () => { + expect(loopActivePhaseToRole("judging")).toBe("judge"); + }); + test("reflecting → reflection (the iter-19 gmbot case)", () => { + expect(loopActivePhaseToRole("reflecting")).toBe("reflection"); + }); + test("documenting → documenter", () => { + expect(loopActivePhaseToRole("documenting")).toBe("documenter"); + }); + test("non-active phases return null (nothing to kill)", () => { + // preparing / deciding / paused / completed / failed / stopped: + // no subprocess is in flight, so stopLoop's kill loop should + // skip cleanly. + expect(loopActivePhaseToRole("preparing")).toBeNull(); + expect(loopActivePhaseToRole("deciding")).toBeNull(); + expect(loopActivePhaseToRole("paused")).toBeNull(); + expect(loopActivePhaseToRole("completed")).toBeNull(); + expect(loopActivePhaseToRole("failed")).toBeNull(); + expect(loopActivePhaseToRole("stopped")).toBeNull(); + }); +}); diff --git a/packages/core/src/iteration-loop.ts b/packages/core/src/iteration-loop.ts index 2764fca..4944ed2 100644 --- a/packages/core/src/iteration-loop.ts +++ b/packages/core/src/iteration-loop.ts @@ -33,7 +33,7 @@ import { getAdapter } from "./adapters/index.js"; import { spawnProcess } from "./process-manager.js"; import { getIterationLogPath, ensureWorkspaceLogDir } from "./log-storage.js"; import * as gitManager from "./git-manager.js"; -import { nextIteration, updateWorkspace, getWorkspace } from "./workspaces.js"; +import { nextIteration, decrementIteration, updateWorkspace, getWorkspace } from "./workspaces.js"; import { runDocumentSync } from "./documenter-runner.js"; import { runReflectionSync, parseReflectionSignals } from "./reflection-runner.js"; import { runReviewSync, readinessGateBlocks } from "./architect-runner.js"; @@ -88,7 +88,14 @@ export interface LoopState { | "anomaly" | "user_input_needed" | "max_iterations" - | "scope_complete"; // item 6.25 follow-up: architect SCOPE_COMPLETE + | "scope_complete" // item 6.25 follow-up: architect SCOPE_COMPLETE + // Harness contract violation: dev or judge agent exited without + // writing its signals file. Inclusive of all root causes (crash, + // quota cap, OOM, etc.) — the harness doesn't classify, just + // pauses + surfaces. User checks the log to identify cause and + // resumes with `retry_iteration` (redo the iter) or `continue` + // (skip + move on) or `stop_loop_now`. + | "missing_signals"; /** Questions from dev/judge that need user answers */ pendingQuestions?: string[]; /** User feedback to inject into the next iteration */ @@ -753,6 +760,21 @@ export function pauseReasonAllowedActions( // - refine_plan → user adds new requirements, re-runs review return ["finish_loop", "stop_loop_now", "refine_plan"]; + case "missing_signals": + // Dev or judge agent exited without writing signals (crash, + // quota cap, OOM, …). The iteration is in an unknown state; + // the harness can't safely judge, reflect, refine, or finish + // on no data. Allowed: + // - retry_iteration → redo the failed iter (counter rolled + // back; branch re-created). Natural choice after a quota + // cap resets. + // - continue → skip the failed iter, start iter N+1 + // fresh. Use when the failed work isn't worth recovering. + // - stop_loop_now → abandon, accept partial progress. + // No finish_loop / refine_plan / consult_reflection — those + // all assume we have a meaningful iteration result to act on. + return ["retry_iteration", "continue", "stop_loop_now"]; + default: // A1 pre-loop review blocked. continue (after user edited the // Problem Pack), stop_loop_now, refine_plan (re-run architect @@ -827,6 +849,39 @@ export async function resumeLoop( /** * Stop a running or paused loop. */ +/** + * Map the loop's current phase to the AgentRole of the subprocess + * the harness is awaiting RIGHT NOW. Returns `null` when no + * subprocess is in flight (preparing / deciding / paused / etc.). + * + * Used by `stopLoop()` (and any future "kill the currently-running + * thing" callers) to decide what to signal. The loop awaits exactly + * one subprocess at a time, so a single role is unambiguous. + * + * Deliberately omits PA/HA — those run outside the cfcf server + * entirely (interactive `cfcf spec` / `cfcf help assistant` with + * `stdio: "inherit"`), are NOT in the `active-processes` registry, + * and must NOT be touched by a loop-level stop. + */ +export function loopActivePhaseToRole( + phase: LoopState["phase"], +): import("./log-storage.js").AgentRole | null { + switch (phase) { + case "pre_loop_reviewing": + return "architect"; + case "dev_executing": + return "dev"; + case "judging": + return "judge"; + case "reflecting": + return "reflection"; + case "documenting": + return "documenter"; + default: + return null; + } +} + export async function stopLoop(workspaceId: string): Promise { const state = await getLoopState(workspaceId); if (!state) { @@ -836,12 +891,69 @@ export async function stopLoop(workspaceId: string): Promise { throw new Error(`Loop already ended (phase: ${state.phase})`); } + // Capture the active role BEFORE flipping phase to "stopped" — + // the flip would otherwise hide which subprocess we should signal. + const activeRole = loopActivePhaseToRole(state.phase); + state.phase = "stopped"; state.outcome = "stopped"; state.completedAt = new Date().toISOString(); await saveLoopState(state); await updateWorkspace(workspaceId, { status: "stopped" }); + // Kill the subprocess the loop was awaiting + flip its per-role + // state-store row so the dashboard's "X running" indicator clears + // immediately rather than staying stuck until next server boot. + // + // Previously `stopLoop()` only flipped the state flag; mid-flight + // subprocesses (notably reflection codex on a multi-hour stall) + // kept running indefinitely because nothing signaled them. The + // loop's `isStopped(state)` check exits the while-loop at the next + // iteration boundary, but a subprocess that started before stop + // is still being awaited at that point — the await promise gets + // stranded and the subprocess outlives the server's interest in + // it. Real dogfood: gmbot iter 19 reflection ran ~5 hours past + // a `cfcf stop` because of this. + if (activeRole) { + const { getActiveProcess } = await import("./active-processes.js"); + const entry = getActiveProcess(workspaceId, activeRole); + if (entry) { + try { + const { killProcessTree } = await import("./process-manager.js"); + killProcessTree(entry.process.proc.pid); + } catch (err) { + console.warn( + `[stopLoop] killProcessTree failed for ${activeRole} (pid ${entry.process.proc.pid}): ${err instanceof Error ? err.message : String(err)}`, + ); + } + } + // Flip the per-role state-store. Each runner exposes a + // `markStateFailed` helper that's idempotent — no-op when + // there's no state (dev/judge don't have one) or when the + // state is already terminal (e.g., the subprocess died on its + // own between stop being called and the kill landing). + const reason = "Loop stopped by user via cfcf stop"; + try { + if (activeRole === "reflection") { + const { markReflectStateFailed } = await import("./reflection-runner.js"); + await markReflectStateFailed(workspaceId, reason); + } else if (activeRole === "documenter") { + const { markDocumentStateFailed } = await import("./documenter-runner.js"); + await markDocumentStateFailed(workspaceId, reason); + } else if (activeRole === "architect") { + const { markReviewStateFailed } = await import("./architect-runner.js"); + await markReviewStateFailed(workspaceId, reason); + } + // dev / judge: no separate state-store. The loop-state itself + // (now "stopped") is the audit trail; their iterRecord stays + // in state.iterations as a partial entry. + } catch (err) { + console.warn( + `[stopLoop] state-store flip failed for ${activeRole}: ${err instanceof Error ? err.message : String(err)}`, + ); + } + } + return state; } @@ -1052,6 +1164,70 @@ async function handleStopLoopNow( ); } +/** + * Pause the loop because the dev or judge agent exited without + * writing its signals file. Harness contract: agents must produce + * `cfcf-iteration-signals.json` / `cfcf-judge-signals.json`. When + * the file is missing, unreadable, or fails schema validation, the + * harness can't safely judge / reflect / decide on no data — pause + * and surface to the user. + * + * Inclusive of all root causes (quota cap, agent crash, OOM, killed + * process, agent CLI bug). The harness deliberately doesn't try to + * classify — the user reads the log file to identify cause, then + * resumes with `retry_iteration` (most common after a quota cap + * resets) / `continue` (skip the failed iter) / `stop_loop_now`. + * + * The pause-state writes mirror the existing + * `consult_reflection`/`scope_complete` pause paths: set phase + + * pauseReason + pendingQuestions, saveLoopState, flip workspace + * status to "paused", dispatch a notification. No git surgery — + * any dirty working-tree state is preserved for the user to + * inspect; on `retry_iteration`, the iteration counter is rolled + * back and the existing branch is re-created off HEAD (the normal + * iteration body already deletes-and-recreates an existing branch). + */ +async function pauseLoopOnMissingSignals( + workspace: WorkspaceConfig, + state: LoopState, + iterationNum: number, + phase: "dev" | "judge", + exitCode: number | undefined, + logFileName: string, +): Promise { + state.phase = "paused"; + state.pauseReason = "missing_signals"; + const exitDetail = + exitCode === undefined ? "" : ` (exit code ${exitCode})`; + const message = + `Anomaly detected: ${phase} agent for iteration ${iterationNum} ` + + `exited${exitDetail} without writing its signals file (agent crashed ` + + `or hit a usage limit). Check the log: ${logFileName}. Resume with ` + + `\`retry_iteration\` to redo this iteration, \`continue\` to skip ` + + `to iteration ${iterationNum + 1}, or \`stop_loop_now\` to abandon.`; + state.pendingQuestions = [message]; + await saveLoopState(state); + await updateWorkspace(workspace.id, { status: "paused" }); + + dispatchForWorkspace( + makeEvent({ + type: "loop.paused", + title: "Loop paused: agent exited without signals", + message: `${workspace.name}: ${message}`, + workspaceId: workspace.id, + workspaceName: workspace.name, + details: { + pauseReason: "missing_signals", + phase, + iteration: iterationNum, + exitCode: exitCode ?? null, + logFile: logFileName, + }, + }), + workspace.notifications, + ); +} + /** * Handle the `finish_loop` resume action: skip the iteration spawn, * jump to the configured end-of-loop sequence (documenter when @@ -1246,6 +1422,22 @@ async function runLoop( // is handled inside the consult helper, which sets the pause state // and returns "stop" to abort runLoop). } + if (resumeAction === "retry_iteration") { + // Re-spawn dev on the same iteration after a `missing_signals` + // pause (quota cap reset, agent crash recovery, etc.). Roll the + // iteration counter back by one so the loop body's next + // `nextIteration()` call returns the SAME number that failed — + // the branch-creation logic already deletes + re-creates an + // existing branch off HEAD, so the failed attempt's branch is + // replaced cleanly. Also drop the failed iteration's record + // from state.iterations so it isn't double-counted; the retry + // produces a fresh record under the same number. + await decrementIteration(workspace.id); + if (state.iterations.length > 0) { + state.iterations.pop(); + } + // Fall through to the regular pre-loop / iteration logic. + } // --- PRE-LOOP REVIEW (item 5.1 autoReviewSpecs=true) --- // Runs before iteration 1, not on resume. When the readiness gate @@ -1664,6 +1856,29 @@ async function runLoop( const devSignals = await parseSignalFile(workspace.repoPath); iterRecord.devSignals = devSignals ?? undefined; + // Harness contract check: dev agent must write + // `cfcf-iteration-signals.json`. Absence (file missing, JSON + // parse failure, or schema validation failure — all collapsed + // to `null` by `parseSignalFile`) means we have no data to + // judge / reflect / decide on. Pause + surface to the user; + // do NOT silently treat as a failed iteration and move on. + // Inclusive of all root causes: agent crashed mid-startup, + // quota cap, OOM kill, etc. Working tree (if dirty) is left + // as-is for the user to inspect; on `retry_iteration` the + // existing branch is deleted + re-created off HEAD by the + // normal iteration body. + if (!devSignals) { + await pauseLoopOnMissingSignals( + workspace, + state, + iterationNum, + "dev", + devResult.exitCode, + iterRecord.devLogFileName, + ); + return; + } + // Commit dev work if (await gitManager.hasChanges(workspace.repoPath)) { await gitManager.commitAll( @@ -1799,9 +2014,27 @@ async function runJudgeAndDecide( const judgeSignals = await parseJudgeSignals(workspace.repoPath); iterRecord.judgeSignals = judgeSignals ?? undefined; - // If judge exited with non-zero and produced no signals, log it - if (judgeResult.exitCode !== 0 && !judgeSignals) { - iterRecord.judgeError = `Judge agent exited with code ${judgeResult.exitCode}. Check log: ${judgeLogFile}`; + // Harness contract check: judge agent must write + // `cfcf-judge-signals.json`. Same pattern as the dev check above + // — absence of signals means we have no determination to act on, + // so we pause and surface to the user rather than archive a + // stale judge-assessment.md or fall through to the decide path + // with no data. Inclusive of all root causes (crash, quota cap, + // OOM). The previous behaviour (record `iterRecord.judgeError` + // and continue silently) is replaced by the explicit pause; the + // `judgeError` field is still populated for the audit trail. + if (!judgeSignals) { + iterRecord.judgeError = + `Judge agent exited with code ${judgeResult.exitCode} without writing signals. Check log: ${judgeLogFile}`; + await pauseLoopOnMissingSignals( + workspace, + state, + iterationNum, + "judge", + judgeResult.exitCode, + iterRecord.judgeLogFileName, + ); + return; } // Commit judge work @@ -2046,6 +2279,22 @@ async function runJudgeAndDecide( ); } + // External-stop guard. Mirrors the `isStopped(state)` checks after + // dev's await (line ~1852) and judge's await (line ~2011) — the + // reflection path was missing it, and any phase mutation here + // would clobber a `state.phase = "stopped"` that `stopLoop()` set + // while we were mid-reflection. Real dogfood: gmbot iter 19 — user + // called `cfcf stop`, then `kill -9`'d the stranded reflection + // codex; the await unblocked, post-reflection ingest ran, line + // 2283 below overwrote `"stopped"` with `"deciding"`, makeDecision + // returned `continue`, and the outer loop's `isLoopDone(state)` + // saw `"deciding"` → false → spawned iter N+1. With this guard, + // the cascade exits cleanly back to runLoop's `isLoopDone(state)` + // check, which sees the unmodified `"stopped"` and returns. + if (isStopped(state)) { + return; + } + // --- DECIDE --- state.phase = "deciding"; await saveLoopState(state); diff --git a/packages/core/src/process-manager.ts b/packages/core/src/process-manager.ts index c95c65e..705a779 100644 --- a/packages/core/src/process-manager.ts +++ b/packages/core/src/process-manager.ts @@ -159,7 +159,7 @@ export async function spawnProcess(opts: ProcessOptions): Promise { expect(validatePlanRewrite(basePlan, newPlan)).toEqual({ valid: true }); }); }); + + // markReflectStateFailed: used by stopLoop + cfcf agents reap to + // flip the dashboard's "reflection running" indicator off after + // killing the subprocess. The "with-state" branch is exercised + // end-to-end via stopLoop integration tests; here we lock in the + // idempotent / safe no-state path so callers can call it + // unconditionally without worrying about whether state exists. + describe("markReflectStateFailed (no-state branch — idempotency)", () => { + it("returns false when the workspace has no reflect state (no-op, safe to call unconditionally)", async () => { + const result = await markReflectStateFailed("workspace-id-with-no-state-xyz", "test reason"); + expect(result).toBe(false); + }); + + it("does not throw for unknown workspace ids", async () => { + // Callers in stopLoop wrap in try/catch but rely on this not + // throwing for the common case (no reflection ever spawned). + await expect( + markReflectStateFailed("totally-fake-id", "test reason"), + ).resolves.toBe(false); + }); + }); }); diff --git a/packages/core/src/reflection-runner.ts b/packages/core/src/reflection-runner.ts index bee403b..9052963 100644 --- a/packages/core/src/reflection-runner.ts +++ b/packages/core/src/reflection-runner.ts @@ -104,6 +104,34 @@ export function getReflectState(workspaceId: string): ReflectState | undefined { return reflectStore.get(workspaceId); } +/** + * Force-mark this workspace's reflection state as `failed` with the + * given reason. Used by `stopLoop()` (and the manual + * `cfcf agents reap` command) to flip the dashboard's "reflection + * running" indicator off immediately after killing the subprocess — + * without this, the in-memory state would stay "executing" and the + * `/api/activity` chip would keep claiming reflection is alive. + * + * Idempotent: no-op when the workspace has no reflect state, or + * when the existing state is already a terminal status (completed / + * failed) — won't clobber a real completion. + * + * Returns true if a flip happened, false otherwise. + */ +export async function markReflectStateFailed( + workspaceId: string, + reason: string, +): Promise { + const current = reflectStore.get(workspaceId); + if (!current) return false; + if (!REFLECT_ACTIVE_STATUSES.has(current.status)) return false; + current.status = "failed"; + current.error = reason; + current.completedAt = new Date().toISOString(); + await setReflectState(current); + return true; +} + /** * Server-boot hook (item F.23): hydrate the in-memory cache from disk * and clean up any state still claiming to be active. diff --git a/packages/core/src/types.ts b/packages/core/src/types.ts index 42c234f..9b7dfe7 100644 --- a/packages/core/src/types.ts +++ b/packages/core/src/types.ts @@ -585,7 +585,14 @@ export type ResumeAction = | "finish_loop" | "stop_loop_now" | "refine_plan" - | "consult_reflection"; + | "consult_reflection" + // Re-spawn dev on the same iteration after a pause (typically + // following a `missing_signals` pause from a quota cap or agent + // crash). The iteration counter is rolled back so the retry runs + // as iteration N rather than N+1; the existing branch is + // discarded + re-created off HEAD. See harness-missing-signals + // feature for the full flow. + | "retry_iteration"; /** * Architect's verdict on the Problem Pack's readiness for an iteration loop. diff --git a/packages/core/src/workspaces.test.ts b/packages/core/src/workspaces.test.ts index 68a8ae9..124e5f2 100644 --- a/packages/core/src/workspaces.test.ts +++ b/packages/core/src/workspaces.test.ts @@ -11,6 +11,7 @@ import { deleteWorkspace, validateWorkspaceRepo, nextIteration, + decrementIteration, } from "./workspaces.js"; describe("projects", () => { @@ -166,6 +167,49 @@ describe("projects", () => { }); }); + // Roll-back helper for `retry_iteration` (harness-missing-signals + // feature): when a failed iteration N is being retried, the + // counter is rolled back so the next `nextIteration()` call + // returns N again instead of N+1. + describe("decrementIteration", () => { + it("rolls back from 3 to 2", async () => { + const project = await createWorkspace({ name: "dec-test", repoPath: repoDir }); + await nextIteration(project.id); + await nextIteration(project.id); + await nextIteration(project.id); + const prev = await decrementIteration(project.id); + expect(prev).toBe(2); + + const updated = await getWorkspace(project.id); + expect(updated!.currentIteration).toBe(2); + }); + + it("decrement then nextIteration returns the same number that was rolled back", async () => { + // The retry_iteration flow's load-bearing invariant: the + // counter ends up back at the failed iter's number so the + // loop body's next `nextIteration()` call reproduces it. + const project = await createWorkspace({ name: "retry-test", repoPath: repoDir }); + await nextIteration(project.id); // -> 1 + await nextIteration(project.id); // -> 2 (this is the "failed" iter) + + await decrementIteration(project.id); + const retried = await nextIteration(project.id); + expect(retried).toBe(2); // same number as the failed attempt + }); + + it("floors at 0 (calling on a 0-counter is a no-op)", async () => { + const project = await createWorkspace({ name: "floor-test", repoPath: repoDir }); + const result = await decrementIteration(project.id); + expect(result).toBe(0); + const updated = await getWorkspace(project.id); + expect(updated!.currentIteration).toBe(0); + }); + + it("returns null for non-existent project (matches nextIteration's nullable shape)", async () => { + expect(await decrementIteration("nonexistent")).toBeNull(); + }); + }); + describe("validateWorkspaceRepo", () => { it("validates a valid git repo", async () => { const result = await validateWorkspaceRepo(repoDir); diff --git a/packages/core/src/workspaces.ts b/packages/core/src/workspaces.ts index 300de98..e4ea95e 100644 --- a/packages/core/src/workspaces.ts +++ b/packages/core/src/workspaces.ts @@ -243,6 +243,25 @@ export async function nextIteration(workspaceId: string): Promise return next; } +/** + * Roll the iteration counter back by one. Used by `retry_iteration` + * (the resume action that re-spawns dev on the same iteration after a + * `missing_signals` pause): the failed attempt was already counted via + * `nextIteration()`, so we roll back to let the loop body's next + * `nextIteration()` call return the same number. Idempotent at the + * floor — if the counter is already 0 (no iterations yet), this is a + * no-op. + */ +export async function decrementIteration(workspaceId: string): Promise { + const workspace = await getWorkspace(workspaceId); + if (!workspace) return null; + const current = workspace.currentIteration || 0; + if (current <= 0) return 0; + const prev = current - 1; + await updateWorkspace(workspaceId, { currentIteration: prev }); + return prev; +} + /** * Verify a workspace's repo path exists and is a git repo. */ diff --git a/packages/server/src/app.ts b/packages/server/src/app.ts index 0518a62..40374b9 100644 --- a/packages/server/src/app.ts +++ b/packages/server/src/app.ts @@ -49,6 +49,13 @@ import { startReflection, getReflectState, stopReflection, + getActiveProcessesForWorkspace, + getAllActiveProcesses, + getActiveProcess, + killProcessTree, + markReflectStateFailed, + markDocumentStateFailed, + markReviewStateFailed, } from "@cfcf/core"; const startedAt = Date.now(); @@ -135,6 +142,82 @@ export function createApp() { return c.json({ active: items }); }); + // --- Active loop-spawned agent processes (for `cfcf agents reap`) --- + // + // List + kill subprocess registry. Distinct from `/api/activity`: + // /activity reads state files (loop-state, reflect-state, …) and + // is the dashboard's source of truth for "what's running"; this + // endpoint exposes the in-memory PID registry (`active-processes.ts`) + // so the user can identify and kill a stranded subprocess when + // the higher-level stop didn't (or to clean up after a kill that + // didn't reach the state-store flip path). + // + // PA / HA are intentionally absent — they run as `cfcf spec` / + // `cfcf help assistant` with `stdio: "inherit"` outside the cfcf + // server, are not in the registry, and cannot be killed via these + // endpoints. Safe by construction. + app.get("/api/active-processes", async (c) => { + const workspaceFilter = c.req.query("workspace"); + const all = workspaceFilter + ? getActiveProcessesForWorkspace(workspaceFilter) + : getAllActiveProcesses(); + const workspaces = await listWorkspaces(); + const wsById = new Map(workspaces.map((w) => [w.id, w])); + const now = Date.now(); + const summary = all.map((e) => ({ + workspaceId: e.workspaceId, + workspaceName: wsById.get(e.workspaceId)?.name ?? e.workspaceId, + role: e.role, + pid: e.process.proc.pid, + startedAt: e.startedAt, + runtimeMs: Math.max(0, now - new Date(e.startedAt).getTime()), + logFileName: e.logFileName ?? null, + })); + return c.json({ active: summary }); + }); + + app.post("/api/active-processes/:workspaceId/:role/kill", async (c) => { + const workspaceId = c.req.param("workspaceId"); + const role = c.req.param("role") as + | "dev" | "judge" | "architect" | "documenter" | "reflection"; + const validRoles = new Set([ + "dev", "judge", "architect", "documenter", "reflection", + ]); + if (!validRoles.has(role)) { + return c.json({ ok: false, error: `Unknown role: ${role}` }, 400); + } + const entry = getActiveProcess(workspaceId, role); + if (!entry) { + return c.json( + { ok: false, error: `No active ${role} process for workspace ${workspaceId}` }, + 404, + ); + } + const pid = entry.process.proc.pid; + try { + killProcessTree(pid); + } catch (err) { + return c.json( + { ok: false, error: `killProcessTree failed: ${err instanceof Error ? err.message : String(err)}` }, + 500, + ); + } + // Flip per-role state-store row so the dashboard clears + // immediately. Idempotent; dev/judge have no state-store so + // skip them. + const reason = "Killed by user via cfcf agents reap"; + try { + if (role === "reflection") await markReflectStateFailed(workspaceId, reason); + else if (role === "documenter") await markDocumentStateFailed(workspaceId, reason); + else if (role === "architect") await markReviewStateFailed(workspaceId, reason); + } catch (err) { + console.warn( + `[reap] state-store flip failed for ${role}: ${err instanceof Error ? err.message : String(err)}`, + ); + } + return c.json({ ok: true, killed: true, pid, role, workspaceId }); + }); + app.get("/api/status", async (c) => { const hasConfig = await configExists(); const config = hasConfig ? await readConfig() : null; diff --git a/packages/web/src/api.ts b/packages/web/src/api.ts index d1683e6..875cf97 100644 --- a/packages/web/src/api.ts +++ b/packages/web/src/api.ts @@ -302,7 +302,8 @@ export type ResumeAction = | "finish_loop" | "stop_loop_now" | "refine_plan" - | "consult_reflection"; + | "consult_reflection" + | "retry_iteration"; // harness-missing-signals: redo the failed iter export function resumeLoop( workspaceId: string, diff --git a/packages/web/src/components/FeedbackForm.tsx b/packages/web/src/components/FeedbackForm.tsx index 16bd4b0..13081c3 100644 --- a/packages/web/src/components/FeedbackForm.tsx +++ b/packages/web/src/components/FeedbackForm.tsx @@ -32,6 +32,17 @@ function pauseReasonAllowedActions(pauseReason: LoopState["pauseReason"]): Resum // configured; stop_loop_now accepts "project done"; refine_plan // re-runs the architect after the user adds new requirements. return ["finish_loop", "stop_loop_now", "refine_plan"]; + case "missing_signals": + // Harness-missing-signals: dev/judge exited without writing + // its signals file (crash, quota cap, OOM, …). No determination + // to act on. Allowed: + // - retry_iteration → redo the iter (after a quota window + // resets, typically) + // - continue → skip the failed iter + // - stop_loop_now → abandon + // No finish/refine/consult — they all need a meaningful iter + // result. + return ["retry_iteration", "continue", "stop_loop_now"]; default: // Pre-loop review block (A1) return ["continue", "stop_loop_now", "refine_plan"]; @@ -44,6 +55,7 @@ const ACTION_LABEL: Record = { stop_loop_now: "Stop loop now", refine_plan: "Refine plan", consult_reflection: "Ask Reflection to decide", + retry_iteration: "Retry iteration", }; const ACTION_HELP: Record = { @@ -57,6 +69,8 @@ const ACTION_HELP: Record = { "Run the architect (synchronously) with your feedback to update the plan, then continue with the next iteration.", consult_reflection: "Spawn reflection with your feedback as input. Reflection decides what the harness does next (continue / finish / stop / re-pause).", + retry_iteration: + "Re-spawn dev on the same iteration that just failed (the iteration counter is rolled back, the branch is re-created). Typical use: a quota cap reset, or after fixing whatever caused the agent to crash.", }; export function FeedbackForm({ diff --git a/packages/web/src/types.ts b/packages/web/src/types.ts index 07eeed4..d2d29f3 100644 --- a/packages/web/src/types.ts +++ b/packages/web/src/types.ts @@ -131,7 +131,8 @@ export interface LoopState { | "anomaly" | "user_input_needed" | "max_iterations" - | "scope_complete"; // item 6.25 follow-up: architect SCOPE_COMPLETE + | "scope_complete" // item 6.25 follow-up: architect SCOPE_COMPLETE + | "missing_signals"; // harness-missing-signals: dev/judge exited without writing signals file pendingQuestions?: string[]; userFeedback?: string; iterations: LoopIterationRecord[];