Skip to content

loop: four anomaly-handling fixes — missing-signals pause, stop-kills-subprocess, agents reap, post-reflection stop guard#49

Merged
fstamatelopoulos merged 3 commits into
mainfrom
iteration-7/missing-signal-pause
May 14, 2026
Merged

loop: four anomaly-handling fixes — missing-signals pause, stop-kills-subprocess, agents reap, post-reflection stop guard#49
fstamatelopoulos merged 3 commits into
mainfrom
iteration-7/missing-signal-pause

Conversation

@fstamatelopoulos
Copy link
Copy Markdown
Owner

@fstamatelopoulos fstamatelopoulos commented May 14, 2026

Summary

Five distinct fixes across two anomaly-handling features, all surfaced by a single gmbot dogfood run. Bundled because they share infrastructure (active-processes registry, per-role state-store flips, killProcessTree) and the later fixes close gaps that the earlier ones exposed.

What the gmbot dogfood exposed

  • Iter 10: codex hit a usage limit; dev + judge exited without writing signals. cfcf treated it as "failed iteration" and silently moved on. Iter 11 then had to re-discover the missing work from scratch.
  • Iter 19: reflection started at 3:41 pm. User checked at 8 pm — dashboard still showed "reflection running." Loop state was phase: stopped, but the reflection codex (PID 99729) was still alive with the cfcf server as its parent.
  • After PA helped diagnose, user ran kill -9 99729 — and observed the loop continuing to the next iteration despite the stop. That was the giveaway that even fixing stop's auto-kill wouldn't be sufficient — there was a downstream cascade clobbering the stop signal.

Feature 1: Harness-level missing-signals pause + retry_iteration resume action

parseSignalFile() and parseJudgeSignals() returning null (file missing, JSON parse failure, or schema validation failure — all collapsed to one signal) now pauses the loop with pauseReason: \"missing_signals\" instead of silently treating the iteration as failed and continuing.

The harness deliberately doesn't classify the root cause. Quota cap / crash / OOM / killed all collapse to "agent exited without signals → can't safely continue → pause." The user reads the log to identify the cause and chooses recovery.

New resume action retry_iteration re-spawns dev on the same iteration after the user unblocks the underlying cause (typically: quota window resets). Rolls the iteration counter back via new decrementIteration() helper; existing branch is re-created off HEAD by the normal iteration body.

Pause-action allowlist for missing_signals: [retry_iteration, continue, stop_loop_now]. No finish/refine/consult — those all assume a meaningful iteration result.

Feature 2: cfcf stop honours its name end-to-end

Four layered bugs:

Bug A — stopLoop() only flipped the state flag. It set state.phase = \"stopped\" and saved. Mid-flight subprocesses kept running because nothing signaled them. The main while-loop's isStopped(state) exits the next iteration boundary cleanly, but a subprocess started before stop was simply abandoned.

Bug B — The per-role state-store stayed executing. When the subprocess was orphaned, the runner's normal completion path didn't fire, so reflect-state.json was never flipped. Dashboard's getActiveAgent() reads that file; kept claiming reflection was alive.

Bug C — No CLI way to kill live-server children. Existing cfcf server reap filters on PPID==1 (dead-server orphans). A live-server stranded subprocess didn't qualify.

Bug D — External "stopped" flag overwritten by post-reflection phase mutations. After reflection's await unblocked (whether from our new auto-kill or an external kill -9), the post-reflection code (archive, commit, Clio ingest) ran cleanly — and then state.phase = \"deciding\" clobbered the \"stopped\" flag, makeDecision returned, the outer loop's isLoopDone(state) saw \"deciding\" (not \"stopped\") and continued to the next iteration. The isStopped checks existed after dev's await (iteration-loop.ts:1852) and judge's await (line 2011), but were missing after reflection.

Fixes (one per bug)

Bug Fix
A stopLoop() maps phase → AgentRole via new loopActivePhaseToRole(phase) helper; looks up active-processes registry; calls existing killProcessTree() (SIGTERM → 1.5s → SIGKILL).
B New markReflectStateFailed / markDocumentStateFailed / markReviewStateFailed helpers (idempotent — no-op when no state or already terminal). Called by stop + reap.
C New cfcf agents reap [--workspace <name>] [--yes] + GET /api/active-processes + POST /api/active-processes/:wsId/:role/kill. Mirrors cfcf server reap UX.
D One isStopped(state) guard before the DECIDE block (line 2283). Mirrors the dev/judge guard pattern.

PA / HA safety guarantee

PA (cfcf spec) and HA (cfcf help assistant) run interactively with stdio: \"inherit\" outside the cfcf server. They are NOT in the active-processes registry, and the registry's AgentRole type (packages/core/src/log-storage.ts:20) doesn't include them. Two layers of defence at type + registry. None of the kill paths in this PR can reach them.

Wall-clock timeout deliberately rejected

PA's initial design for Feature 2 included a per-role timeout. Declined:

  • Conflicts with codex's rolling-window quota recovery (~1hr legit waits).
  • Adds config knobs that are hard to tune.
  • Conflicts with cf²'s "human on the loop, not in it" principle. cfcf agents reap is the user-driven version of the same intent.

Files touched (~360 LoC + tests + CHANGELOG)

Core:

  • process-manager.ts — export killProcessTree
  • reflection-runner.ts, documenter-runner.ts, architect-runner.tsmark<X>StateFailed helpers
  • iteration-loop.tsloopActivePhaseToRole helper, stopLoop extended, isStopped guard before DECIDE, missing-signals pause helper, retry_iteration dispatch, allowlist
  • workspaces.tsdecrementIteration helper
  • types.ts — extend ResumeAction

Server: app.ts — 2 new endpoints (/api/active-processes)

CLI: commands/agents.ts (new file), index.ts (register), commands/resume.ts (add retry_iteration)

Web: types.ts, api.ts, components/FeedbackForm.tsx — mirror new pauseReason + ResumeAction variants. Existing pendingQuestions[0] rendering surfaces the generic-anomaly message; no UI copy added.

Test plan

  • bun run test — 1038 tests pass (+16 new)
  • bun run typecheck — clean
  • CLI bundles; cfcf agents reap --help renders cleanly
  • Manual smoke (post-merge):
    • Start tiny loop → wait for any agent phase → cfcf stop --workspace <name> → verify subprocess dies within ~1.5s AND dashboard chip clears.
    • On an idle workspace: cfcf agents reap → expect "No active agent processes."
    • Manufacture a missing-signals scenario (chmod 000 an agent binary or similar) → verify pause with the new reason text → resume with --action retry_iteration → verify counter rolled back + iteration re-runs cleanly.

Three commits, cleanly separable

  • 1f24e90 Feature 1 (missing-signals pause + retry_iteration)
  • 6f95f02 Feature 2, Bugs A/B/C (kill on stop + agents reap)
  • 6f4664b Bug D (isStopped guard before DECIDE)

If reviewers prefer to merge incrementally, the commits are independent. Default plan is to merge as one.

🤖 Generated with Claude Code

fstamatelopoulos and others added 2 commits May 13, 2026 21:06
cfcf has implicitly assumed dev + judge always run far enough to
write their signal files (cfcf-iteration-signals.json /
cfcf-judge-signals.json). When agents exit before doing so (hard
quota cap, crash, OOM, …), cfcf silently marked the iteration
failed and continued. Real gmbot dogfood: iter 10 hit a codex
usage limit, both agents exited without signals, cfcf moved on and
iter 11 had to re-discover the missing work.

This is a harness-contract violation, not an agent-reasoning
failure — the agent never ran far enough to classify anything. So
the fix is harness-side and DELIBERATELY doesn't classify the
root cause. Quota / crash / OOM all collapse to "agent exited
without signals → can't safely continue → pause + surface to
user." User reads the log file to identify the cause.

Behaviour:
- After parseSignalFile / parseJudgeSignals returns null (file
  missing OR JSON parse error OR schema validation failure — one
  signal), pause with pauseReason "missing_signals" rather than
  silently continuing.
- pendingQuestions[0] is a generic-anomaly message with the log
  path: "Anomaly detected: <phase> agent for iteration N exited
  (exit code X) without writing its signals file (agent crashed
  or hit a usage limit). Check the log: <path>. Resume with
  retry_iteration / continue / stop_loop_now."
- loop.paused notification fires for any configured channel.
- Working tree left as-is — no git reset, no commit attempt. On
  retry_iteration the branch is re-created off HEAD by the
  existing iteration body; any dirty changes survive in the
  working tree.

New resume action: retry_iteration (extends item 6.25's structured
pause-actions vocabulary):
- Rolls iteration counter back via new decrementIteration() helper
  in workspaces.ts.
- Pops the failed iteration's record from state.iterations.
- Falls through to the regular iteration body — nextIteration()
  now returns the same number, the branch is re-created.
- ONLY applicable to missing_signals pauses (allowlist:
  retry_iteration / continue / stop_loop_now). Rejected for other
  pause classes via existing pauseReasonAllowedActions validation.

Files touched (~140 LoC):
- packages/core/src/types.ts — extend ResumeAction
- packages/core/src/iteration-loop.ts — pause helper, two call
  sites (dev + judge), retry dispatch, pauseReasonAllowedActions
  branch
- packages/core/src/workspaces.ts — decrementIteration helper
- packages/cli/src/commands/resume.ts — RESUME_ACTIONS + help
- packages/web/{types.ts,api.ts,components/FeedbackForm.tsx} —
  mirror the new variants on the web side. pendingQuestions[0]
  rendering already surfaces our generic-anomaly message; no UI
  copy needed.

Test coverage: 8 new tests (1030 total pass). Typecheck clean.

Intentionally NOT in scope (future, if dogfood demands):
- Root-cause classification (quota vs. crash) — would be a
  cosmetic label, not a behaviour change. Defer.
- Auto-retry after parsed reset time — defer.
- Pattern-based "anomaly type" labels in the pause text.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Real gmbot dogfood: after iter 19 reflection started at 3:41pm,
the user observed the dashboard still showing "reflection running"
at 8pm — 4.5 hours later. The loop state was already
phase=stopped/outcome=stopped, but the reflection codex subprocess
(PID 99729) was still alive with the cfcf server as its parent.

Three layered bugs:
1. stopLoop() only flipped the state flag — it never signaled
   any subprocess.
2. The per-role state-store (reflect-state.json) was never
   flipped to "failed", so getActiveAgent() kept claiming
   reflection was alive.
3. cfcf server reap filters PPID==1 (orphans of a dead server),
   so it couldn't catch live-server children.

Fix is three-pronged:

**Fix 1**: stopLoop() now kills the active subprocess. New
loopActivePhaseToRole(phase) maps loop phases to AgentRoles:
- pre_loop_reviewing -> architect
- dev_executing -> dev
- judging -> judge
- reflecting -> reflection
- documenting -> documenter
- else -> null (skip)
For non-null roles, look up active-processes registry, call
existing killProcessTree() (SIGTERM -> 1.5s -> SIGKILL).

**Fix 2**: per-role state-store flip via new
mark<X>StateFailed(workspaceId, reason) helpers in each runner
(reflection / documenter / architect). Idempotent: no-op when no
state, no-op when state already terminal. Dashboard chip clears
on next poll.

**Fix 3**: new cfcf agents reap command + matching API endpoints
(/api/active-processes GET + POST kill). Mirrors cfcf server reap
UX but scoped to live-server children, not PPID==1 orphans.
Per-row y/N or --yes for non-interactive.

**Hard guarantee** (per user's explicit caution): PA and HA are
untouched at all three layers. They run as cfcf spec / cfcf help
assistant with stdio:"inherit" OUTSIDE the cfcf server, are NOT
in the active-processes registry, and AgentRole type doesn't
include them. Two layers of defence: type-level + phase-mapping.

**Wall-clock timeout intentionally NOT implemented**. PA proposed
it; user and I both rejected. Conflicts with codex rolling-window
quota recovery (~1hr legit waits); matches cfcf "human on the
loop" principle. cfcf agents reap is the user-driven equivalent.

Files (~340 LoC):
- process-manager.ts: export killProcessTree
- reflection-runner.ts / documenter-runner.ts / architect-runner.ts:
  add mark<X>StateFailed helpers
- iteration-loop.ts: loopActivePhaseToRole helper, stopLoop kills
  + flips state
- server/app.ts: 2 new endpoints
- cli/commands/agents.ts: new file, cfcf agents reap
- cli/index.ts: register

Test coverage: 8 new tests (1038 total pass). Typecheck clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@fstamatelopoulos fstamatelopoulos changed the title feat(loop): harness-level missing-signals pause + retry_iteration action feat(loop): missing-signals pause + retry_iteration + cfcf stop kills subprocess + cfcf agents reap May 14, 2026
…ascade

User reported: after `cfcf stop` + `kill -9` on a stranded
reflection codex, the loop CONTINUED to the next iteration. They
flagged this as "probably a consequence of bug A" — correct
diagnosis.

The chain (Bug D, downstream of Bug A):
1. cfcf stop sets state.phase = "stopped"
2. Reflection await is mid-flight; codex alive
3. User kill -9's codex
4. Await unblocks → post-reflection processing runs (archive,
   commit, Clio ingest)
5. Line 2283: state.phase = "deciding" — CLOBBERS "stopped"
6. makeDecision returns "continue" → case "continue" returns
7. Outer loop's isLoopDone(state) sees "deciding" → false
8. Next iteration spawns

The isStopped checks exist after dev's await (line 1852) and
after judge's await (line 2011), but were missing between the
reflection block and the DECIDE block. Any state.phase mutation in
that window overwrites an external "stopped" flag.

Fix: one isStopped guard right before "state.phase = deciding".
Matches the pattern of the dev / judge guards. Closes the cascade
for users on Fix 1 (kill-on-stop) too — Fix 1 makes the await
unblock faster, but the same overwrite would have run without the
guard.

No new tests — the guard's logic is `isStopped(state)`, which is
already covered exhaustively. The comment explains the rationale +
the gmbot iter-19 dogfood case so future edits don't drop it.

Typecheck clean. Full suite (1038 tests) passes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@fstamatelopoulos fstamatelopoulos changed the title feat(loop): missing-signals pause + retry_iteration + cfcf stop kills subprocess + cfcf agents reap loop: four anomaly-handling fixes — missing-signals pause, stop-kills-subprocess, agents reap, post-reflection stop guard May 14, 2026
@fstamatelopoulos fstamatelopoulos merged commit 0fdb39f into main May 14, 2026
3 checks passed
@fstamatelopoulos fstamatelopoulos deleted the iteration-7/missing-signal-pause branch May 14, 2026 06:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant