loop: four anomaly-handling fixes — missing-signals pause, stop-kills-subprocess, agents reap, post-reflection stop guard#49
Merged
Conversation
cfcf has implicitly assumed dev + judge always run far enough to
write their signal files (cfcf-iteration-signals.json /
cfcf-judge-signals.json). When agents exit before doing so (hard
quota cap, crash, OOM, …), cfcf silently marked the iteration
failed and continued. Real gmbot dogfood: iter 10 hit a codex
usage limit, both agents exited without signals, cfcf moved on and
iter 11 had to re-discover the missing work.
This is a harness-contract violation, not an agent-reasoning
failure — the agent never ran far enough to classify anything. So
the fix is harness-side and DELIBERATELY doesn't classify the
root cause. Quota / crash / OOM all collapse to "agent exited
without signals → can't safely continue → pause + surface to
user." User reads the log file to identify the cause.
Behaviour:
- After parseSignalFile / parseJudgeSignals returns null (file
missing OR JSON parse error OR schema validation failure — one
signal), pause with pauseReason "missing_signals" rather than
silently continuing.
- pendingQuestions[0] is a generic-anomaly message with the log
path: "Anomaly detected: <phase> agent for iteration N exited
(exit code X) without writing its signals file (agent crashed
or hit a usage limit). Check the log: <path>. Resume with
retry_iteration / continue / stop_loop_now."
- loop.paused notification fires for any configured channel.
- Working tree left as-is — no git reset, no commit attempt. On
retry_iteration the branch is re-created off HEAD by the
existing iteration body; any dirty changes survive in the
working tree.
New resume action: retry_iteration (extends item 6.25's structured
pause-actions vocabulary):
- Rolls iteration counter back via new decrementIteration() helper
in workspaces.ts.
- Pops the failed iteration's record from state.iterations.
- Falls through to the regular iteration body — nextIteration()
now returns the same number, the branch is re-created.
- ONLY applicable to missing_signals pauses (allowlist:
retry_iteration / continue / stop_loop_now). Rejected for other
pause classes via existing pauseReasonAllowedActions validation.
Files touched (~140 LoC):
- packages/core/src/types.ts — extend ResumeAction
- packages/core/src/iteration-loop.ts — pause helper, two call
sites (dev + judge), retry dispatch, pauseReasonAllowedActions
branch
- packages/core/src/workspaces.ts — decrementIteration helper
- packages/cli/src/commands/resume.ts — RESUME_ACTIONS + help
- packages/web/{types.ts,api.ts,components/FeedbackForm.tsx} —
mirror the new variants on the web side. pendingQuestions[0]
rendering already surfaces our generic-anomaly message; no UI
copy needed.
Test coverage: 8 new tests (1030 total pass). Typecheck clean.
Intentionally NOT in scope (future, if dogfood demands):
- Root-cause classification (quota vs. crash) — would be a
cosmetic label, not a behaviour change. Defer.
- Auto-retry after parsed reset time — defer.
- Pattern-based "anomaly type" labels in the pause text.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Real gmbot dogfood: after iter 19 reflection started at 3:41pm, the user observed the dashboard still showing "reflection running" at 8pm — 4.5 hours later. The loop state was already phase=stopped/outcome=stopped, but the reflection codex subprocess (PID 99729) was still alive with the cfcf server as its parent. Three layered bugs: 1. stopLoop() only flipped the state flag — it never signaled any subprocess. 2. The per-role state-store (reflect-state.json) was never flipped to "failed", so getActiveAgent() kept claiming reflection was alive. 3. cfcf server reap filters PPID==1 (orphans of a dead server), so it couldn't catch live-server children. Fix is three-pronged: **Fix 1**: stopLoop() now kills the active subprocess. New loopActivePhaseToRole(phase) maps loop phases to AgentRoles: - pre_loop_reviewing -> architect - dev_executing -> dev - judging -> judge - reflecting -> reflection - documenting -> documenter - else -> null (skip) For non-null roles, look up active-processes registry, call existing killProcessTree() (SIGTERM -> 1.5s -> SIGKILL). **Fix 2**: per-role state-store flip via new mark<X>StateFailed(workspaceId, reason) helpers in each runner (reflection / documenter / architect). Idempotent: no-op when no state, no-op when state already terminal. Dashboard chip clears on next poll. **Fix 3**: new cfcf agents reap command + matching API endpoints (/api/active-processes GET + POST kill). Mirrors cfcf server reap UX but scoped to live-server children, not PPID==1 orphans. Per-row y/N or --yes for non-interactive. **Hard guarantee** (per user's explicit caution): PA and HA are untouched at all three layers. They run as cfcf spec / cfcf help assistant with stdio:"inherit" OUTSIDE the cfcf server, are NOT in the active-processes registry, and AgentRole type doesn't include them. Two layers of defence: type-level + phase-mapping. **Wall-clock timeout intentionally NOT implemented**. PA proposed it; user and I both rejected. Conflicts with codex rolling-window quota recovery (~1hr legit waits); matches cfcf "human on the loop" principle. cfcf agents reap is the user-driven equivalent. Files (~340 LoC): - process-manager.ts: export killProcessTree - reflection-runner.ts / documenter-runner.ts / architect-runner.ts: add mark<X>StateFailed helpers - iteration-loop.ts: loopActivePhaseToRole helper, stopLoop kills + flips state - server/app.ts: 2 new endpoints - cli/commands/agents.ts: new file, cfcf agents reap - cli/index.ts: register Test coverage: 8 new tests (1038 total pass). Typecheck clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ascade User reported: after `cfcf stop` + `kill -9` on a stranded reflection codex, the loop CONTINUED to the next iteration. They flagged this as "probably a consequence of bug A" — correct diagnosis. The chain (Bug D, downstream of Bug A): 1. cfcf stop sets state.phase = "stopped" 2. Reflection await is mid-flight; codex alive 3. User kill -9's codex 4. Await unblocks → post-reflection processing runs (archive, commit, Clio ingest) 5. Line 2283: state.phase = "deciding" — CLOBBERS "stopped" 6. makeDecision returns "continue" → case "continue" returns 7. Outer loop's isLoopDone(state) sees "deciding" → false 8. Next iteration spawns The isStopped checks exist after dev's await (line 1852) and after judge's await (line 2011), but were missing between the reflection block and the DECIDE block. Any state.phase mutation in that window overwrites an external "stopped" flag. Fix: one isStopped guard right before "state.phase = deciding". Matches the pattern of the dev / judge guards. Closes the cascade for users on Fix 1 (kill-on-stop) too — Fix 1 makes the await unblock faster, but the same overwrite would have run without the guard. No new tests — the guard's logic is `isStopped(state)`, which is already covered exhaustively. The comment explains the rationale + the gmbot iter-19 dogfood case so future edits don't drop it. Typecheck clean. Full suite (1038 tests) passes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Five distinct fixes across two anomaly-handling features, all surfaced by a single gmbot dogfood run. Bundled because they share infrastructure (active-processes registry, per-role state-store flips,
killProcessTree) and the later fixes close gaps that the earlier ones exposed.What the gmbot dogfood exposed
phase: stopped, but the reflection codex (PID 99729) was still alive with the cfcf server as its parent.kill -9 99729— and observed the loop continuing to the next iteration despite the stop. That was the giveaway that even fixing stop's auto-kill wouldn't be sufficient — there was a downstream cascade clobbering the stop signal.Feature 1: Harness-level missing-signals pause +
retry_iterationresume actionparseSignalFile()andparseJudgeSignals()returningnull(file missing, JSON parse failure, or schema validation failure — all collapsed to one signal) now pauses the loop withpauseReason: \"missing_signals\"instead of silently treating the iteration as failed and continuing.The harness deliberately doesn't classify the root cause. Quota cap / crash / OOM / killed all collapse to "agent exited without signals → can't safely continue → pause." The user reads the log to identify the cause and chooses recovery.
New resume action
retry_iterationre-spawns dev on the same iteration after the user unblocks the underlying cause (typically: quota window resets). Rolls the iteration counter back via newdecrementIteration()helper; existing branch is re-created off HEAD by the normal iteration body.Pause-action allowlist for
missing_signals:[retry_iteration, continue, stop_loop_now]. No finish/refine/consult — those all assume a meaningful iteration result.Feature 2:
cfcf stophonours its name end-to-endFour layered bugs:
Bug A —
stopLoop()only flipped the state flag. It setstate.phase = \"stopped\"and saved. Mid-flight subprocesses kept running because nothing signaled them. The main while-loop'sisStopped(state)exits the next iteration boundary cleanly, but a subprocess started before stop was simply abandoned.Bug B — The per-role state-store stayed
executing. When the subprocess was orphaned, the runner's normal completion path didn't fire, soreflect-state.jsonwas never flipped. Dashboard'sgetActiveAgent()reads that file; kept claiming reflection was alive.Bug C — No CLI way to kill live-server children. Existing
cfcf server reapfilters onPPID==1(dead-server orphans). A live-server stranded subprocess didn't qualify.Bug D — External "stopped" flag overwritten by post-reflection phase mutations. After reflection's
awaitunblocked (whether from our new auto-kill or an externalkill -9), the post-reflection code (archive, commit, Clio ingest) ran cleanly — and thenstate.phase = \"deciding\"clobbered the\"stopped\"flag,makeDecisionreturned, the outer loop'sisLoopDone(state)saw\"deciding\"(not\"stopped\") and continued to the next iteration. TheisStoppedchecks existed after dev's await (iteration-loop.ts:1852) and judge's await (line 2011), but were missing after reflection.Fixes (one per bug)
stopLoop()maps phase →AgentRolevia newloopActivePhaseToRole(phase)helper; looks upactive-processesregistry; calls existingkillProcessTree()(SIGTERM → 1.5s → SIGKILL).markReflectStateFailed/markDocumentStateFailed/markReviewStateFailedhelpers (idempotent — no-op when no state or already terminal). Called by stop + reap.cfcf agents reap [--workspace <name>] [--yes]+GET /api/active-processes+POST /api/active-processes/:wsId/:role/kill. Mirrorscfcf server reapUX.isStopped(state)guard before the DECIDE block (line 2283). Mirrors the dev/judge guard pattern.PA / HA safety guarantee
PA (
cfcf spec) and HA (cfcf help assistant) run interactively withstdio: \"inherit\"outside the cfcf server. They are NOT in theactive-processesregistry, and the registry'sAgentRoletype (packages/core/src/log-storage.ts:20) doesn't include them. Two layers of defence at type + registry. None of the kill paths in this PR can reach them.Wall-clock timeout deliberately rejected
PA's initial design for Feature 2 included a per-role timeout. Declined:
cfcf agents reapis the user-driven version of the same intent.Files touched (~360 LoC + tests + CHANGELOG)
Core:
process-manager.ts— exportkillProcessTreereflection-runner.ts,documenter-runner.ts,architect-runner.ts—mark<X>StateFailedhelpersiteration-loop.ts—loopActivePhaseToRolehelper,stopLoopextended, isStopped guard before DECIDE, missing-signals pause helper,retry_iterationdispatch, allowlistworkspaces.ts—decrementIterationhelpertypes.ts— extendResumeActionServer:
app.ts— 2 new endpoints (/api/active-processes)CLI:
commands/agents.ts(new file),index.ts(register),commands/resume.ts(addretry_iteration)Web:
types.ts,api.ts,components/FeedbackForm.tsx— mirror new pauseReason + ResumeAction variants. ExistingpendingQuestions[0]rendering surfaces the generic-anomaly message; no UI copy added.Test plan
bun run test— 1038 tests pass (+16 new)bun run typecheck— cleancfcf agents reap --helprenders cleanlycfcf stop --workspace <name>→ verify subprocess dies within ~1.5s AND dashboard chip clears.cfcf agents reap→ expect "No active agent processes."chmod 000an agent binary or similar) → verify pause with the new reason text → resume with--action retry_iteration→ verify counter rolled back + iteration re-runs cleanly.Three commits, cleanly separable
1f24e90Feature 1 (missing-signals pause + retry_iteration)6f95f02Feature 2, Bugs A/B/C (kill on stop + agents reap)6f4664bBug D (isStopped guard before DECIDE)If reviewers prefer to merge incrementally, the commits are independent. Default plan is to merge as one.
🤖 Generated with Claude Code