Skip to content

Fix #2792: auto-recover from sync divergence with HITL ack#2797

Merged
jwbron merged 5 commits into
mainfrom
egg/issue-2792/hard-reset-recovery
May 27, 2026
Merged

Fix #2792: auto-recover from sync divergence with HITL ack#2797
jwbron merged 5 commits into
mainfrom
egg/issue-2792/hard-reset-recovery

Conversation

@jwbron
Copy link
Copy Markdown
Owner

@jwbron jwbron commented May 27, 2026

Summary

Closes the recurrence loop on #2337 / #2627: when _sync_worktree_with_remote detects real divergence and the rebase autoresolve fails, the helper now reconciles inline via a hard reset and surfaces an HITL ack instead of leaving the downstream populator to fail loudly against stale state.

  • _sync_worktree_with_remote returns a new WorktreeSyncOutcome (case, hard_reset_performed, backup_ref, discarded_commit_shas). On rebase failure the helper pins HEAD under refs/egg-backup/sync-recovery/<pipeline_id>/<unix_ts>, then git reset --hard origin/<branch>. Discriminator log lines stay in place; the new terminal case is divergence_recovered_via_reset (or divergence_rebase_and_reset_failed when the reset itself fails).
  • Plan-complete sync site, phase-start sync site, and the POST /pipelines/<id>/phase/populate-contract HTTP route inspect hard_reset_performed. The first two mark the pipeline FAILED and emit a dedicated HITL keyed on context="hard_reset_recovery:<phase>"; the route surfaces the same fields in its response body.
  • Decision dispatch in routes/decisions.py routes Continue with post-reset stateresume_pipeline_after_hard_reset_ack (slim phase-restart: reset phase exec, bump run_epoch, spawn a fresh _run_pipeline thread) and Abort pipelineabort_pipeline_after_hard_reset_ack (sets status=CANCELLED, emits the event; the existing PATCH cleanup machinery handles container/worktree teardown when the operator drives it).
  • Dropped the misleading "populate-from-plan step silently failed earlier" framing from _empty_contract_hitl_question — the new wording references the auto-reconcile path without inventing a phantom earlier failure.

Test plan

  • orchestrator/tests/test_hard_reset_recovery.py (new) — backup-ref name layout, WorktreeSyncOutcome shape on non-recovery branches, HITL question/options/context, dispatch wiring (Continue/Abort/unknown), and the dropped-wording regression check.
  • orchestrator/tests/test_sync_worktree.py — added taxonomy cases for divergence_recovered_via_reset, divergence_rebase_and_reset_failed, and the backup-ref-write-failure inlined-SHAs path. Updated the two prior tests that assumed rebase failure short-circuited without a reset.
  • orchestrator/tests/test_advance_phase_thread.py — relaxed the structural regex so it accepts outcome = _sync_worktree_with_remote(...) in addition to the bare call.
  • orchestrator/tests/test_pipeline_failure_path.py — pinned the StalePipelineBranchError test's sync mock to a non-recovery WorktreeSyncOutcome so the new hard-reset HITL path doesn't fire against the mocked-out helper.
  • make lint clean (ruff + ruff-format via pre-commit).
  • 100 tests across the four files above pass locally (.venv/bin/pytest).

Out of scope (per the issue plan)

When ``_sync_worktree_with_remote`` detects real divergence and the
rebase autoresolve fails, the helper now pins local-only commits under
``refs/egg-backup/sync-recovery/<pipeline_id>/<unix_ts>`` and hard-resets
HEAD to origin so downstream populator/decision-sync reads see a
consistent worktree. Phase-boundary callers and the ``populate-contract``
HTTP route inspect the new ``WorktreeSyncOutcome`` return type; when
``hard_reset_performed`` is True they emit a dedicated two-option HITL
(``Continue with post-reset state`` / ``Abort pipeline``) keyed on
``context="hard_reset_recovery:<phase>"``. The resolve_decision dispatch
in routes/decisions.py routes Continue to a slim phase-restart helper
and Abort to a status=CANCELLED writer.

The ``_empty_contract_hitl_question`` wording no longer blames a phantom
"populate-from-plan step silently failed earlier" — when the auto-
reconcile path doesn't fire, the prose now accurately describes the
divergence without inventing an earlier failure.
Copy link
Copy Markdown
Contributor

@egg-reviewer egg-reviewer Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary

This PR closes a real gap — #2337 / #2627 left silent worktree divergence in the system, and the hard-reset auto-recovery with a HITL ack is the right shape. The new WorktreeSyncOutcome is well-modelled, the backup-ref naming convention (refs/egg-backup/sync-recovery/<pid>/<ts>) is recoverable via git for-each-ref, and the context discriminator (hard_reset_recovery:<phase>) keeps the dispatch hook stable. Tests cover the outcome shape, HITL text, and dispatch routing.

However, the three emission sites and the resume helper have substantial asymmetries that mean the destructive-recovery path doesn't actually behave consistently. Several of these are blocking; they're called out below with file:line references.

Blocking

B1. resume_pipeline_after_hard_reset_ack does not clear the BRC consensus tracker, evaluator, restart counts, or health-monitor anchors — phase re-run after a post-phase hard reset will not work correctly

orchestrator/routes/pipelines.py:3341-3435

The resume helper sets phase_exec.status = PENDING for the phase that was JUST marked FAILED, bumps run_epoch, and spawns a new _run_pipeline thread. Compare with restart_phase at lines 3250-3312, which does five additional things this helper skips: peer_consensus_tracker.clear(), consensus_evaluator.clear(pipeline_id), spawner.reset_restart_counts(pipeline_id), and health_monitor.reset_agent(role) per agent role.

The docstring justifies skipping container teardown and per-agent worktree deletion ("the hard-reset path is reached after BRC consensus already completed (or never spawned containers for this run)"), which is reasonable for those — but says nothing about the BRC tracker. At the post-phase emission site (line 21978+), the resume target IS the phase whose BRC tracker holds a CONFIRMED entry, because that phase completed normally before the sync ran. With the tracker not cleared, the re-spawned phase will either:

  • Find the tracker already in a CONFIRMED terminal state and short-circuit (no actual re-work), or
  • Have agents reconnect against stale per-role entries from the previous round.

This is exactly the class of bug that #2084 was opened against in the original restart_phase path, and #2084's fix is the reset_agent loop that this helper skips. The peer_consensus_tracker lookup pattern from restart_phase (3253-3262) needs to be replicated here, along with the evaluator clear, restart-count reset, and health-monitor anchor reset.

The phase-start emission site (line 20371) is less affected because the phase had not started — but the same code path runs there, so the fix is one helper, not three.

B2. Phase-start HITL emission site does not emit pipeline.failed events; observers will miss the destructive recovery

orchestrator/routes/pipelines.py:20371-20400 vs :21978-22027

The phase-start emission flips pipeline.status = FAILED, persists, emits the HITL decision, and returns. It never calls report_pipeline_status(pipeline, event_type="pipeline.failed", ...) or _emit_pipeline_event(pipeline, "pipeline.failed").

The post-phase emission site does both (lines 22021-22026), and every other FAILED-setter in this file does too — see _empty_contract_hitl cleanup at line 22100+, and the standard phase-failure block at line 21880+. Subscribers to pipeline.failed (collaborator dashboards, the gateway's pipeline-state cache, any external observer wired to the event stream) will see phase.completed followed by silence; they will not learn the pipeline became FAILED until they poll status directly.

Make the two sites emit symmetrically.

B3. Phase-start emission site leaves phase_exec.status untouched; state is asymmetric across the two recovery paths

orchestrator/routes/pipelines.py:20387-20391 vs :21994-22002

Phase-start emission writes only pipeline.status = FAILED + pipeline.error. It does not touch phase_exec.status or phase_exec.completed_at for the current phase. Post-phase emission writes phase_execution.status = FAILED + phase_execution.error + phase_execution.completed_at = datetime.now(UTC) in addition to the pipeline-level fields.

This means a GET /pipelines/<id> after a phase-start hard reset shows pipeline.status=FAILED but phase_exec.status=PENDING (the value going in) — the per-phase view does not reflect the failure. Either both fields should be set, or neither (with the resume helper compensating); choose one and apply it to both sites.

For phase-start specifically, leaving phase_exec.status=PENDING and phase_exec.error=None is defensible (the phase hadn't started), but then the post-phase site is overwriting a COMPLETE status with FAILED, which has its own oddness — the phase did complete, the sync just broke afterward. See B1 for the resume-side implication.

B4. populate_contract route runs the destructive sync but never emits an HITL or transitions pipeline state — automation drivers checking status_code alone will miss the hard reset

orchestrator/routes/phases.py:1103-1144 and :1210-1225

The route runs _sync_worktree_with_remote(...) pre-populate (good), and surfaces hard_reset_performed, backup_ref, discarded_commit_shas in the 200 success response and in details for 404/422/500 errors. The docstring explicitly intends this to be "the operator's signal" without an HITL gate.

The asymmetry is dangerous: the other two emission sites treat a hard reset as a terminal failure that requires operator ack before any downstream work runs. This route treats it as a no-stop event — _populate_contract_from_plan runs against the post-reset state immediately, and on success returns HTTP 200 with hard_reset_performed: true somewhere deep in data. An automation caller checking only response.ok will not notice that local commits were discarded.

If the design intent is "this route IS the recovery primitive, by the time an operator calls it they accept the destructive recovery" — that's defensible, but it needs to be expressed by:

  1. Refusing to run the populator at all if hard_reset_performed is True, returning an explicit 409/422 reason; OR
  2. Emitting the same HITL the other sites emit, gating the populator on the ack; OR
  3. At minimum, making the response a non-2xx status when hard_reset_performed is True so automation cannot mistake it for success.

Surfacing the flag in the response body without changing the status code is the worst of the three options. (Note also: _populate_contract_from_plan consumes the post-reset worktree before any operator has seen the hard-reset flag — there is no gate where the operator can say "wait, don't populate.")

There's no test for the route's hard-reset surfacing behavior in test_hard_reset_recovery.py either; the test suite covers the helper, the question, and the dispatch, but does not exercise the route end-to-end.

B5. divergence_rebase_and_reset_failed is silently fall-through at every caller — the doubly-failed path lands on a divergent worktree with no recovery

orchestrator/routes/pipelines.py:7130-7149, callers at 20371, 21978, and phases.py:1115

When both rebase AND hard-reset fail, the helper returns WorktreeSyncOutcome(case="divergence_rebase_and_reset_failed", hard_reset_performed=False, ...). Every caller gates downstream behavior on outcome.hard_reset_performed — which is False here — so the failure-of-failures path is identical to "happy no-op" from the caller's perspective. The worktree is still divergent, but the pipeline continues with no signal.

I acknowledge pre-PR behavior was the same (rebase failure silently fell through), so this is not strictly a regression. But the entire PR is about closing the silent-divergence loop, and this corner case re-opens it — the helper now has a doubly-failed terminal case that the callers do not handle. At minimum, raise a typed exception on this path (analogous to PlanDraftMissingOnLocalError) so the callers can wire it into the same FAILED-cleanup flow, or surface it as a different HITL/alert. The current behavior makes the "the rebase fallback handles divergence" guarantee conditional on git reset --hard never failing, which is plausible-but-not-guaranteed on a busy worktree.

Non-blocking

N1. Backup ref name uses second-precision unix timestamp; two recoveries within the same second on the same pipeline will collide

orchestrator/routes/pipelines.py:7099 (unix_ts = int(time.time()))

git update-ref refs/egg-backup/sync-recovery/<pid>/<ts> against an existing ref will refuse to write without the old SHA (so the second write would fail and backup_ok = False, the ref points at the FIRST recovery's HEAD, and the second recovery's discarded SHAs are inlined into a WARN log only). Two recoveries in the same second is unlikely but not impossible (orchestrator restart loop, a HITL ack that races with a phase-start emission). time.time_ns() or appending a process-counter suffix avoids the collision and costs nothing.

N2. _collect_local_only_commits uses --no-commit-header, which requires Git 2.33+ (August 2021)

orchestrator/routes/pipelines.py:6656

Likely fine for the egg sandbox/orchestrator image, but worth a sanity check against the actual Git version pinned in the Dockerfile / CI base image, since a stale base would produce error: unknown option: --no-commit-header and the function would silently return () (the try/except swallows). If older Git support is needed, rev-list --pretty=format:%h %s followed by stripping commit <hash> lines achieves the same output.

N3. Test file does not cover the phase-start vs post-phase asymmetry, the resume helper's tracker behavior, or the populate-contract route surfacing

orchestrator/tests/test_hard_reset_recovery.py

The tests are well-scoped to the helper / HITL / dispatch units, but the integration concerns flagged in B1-B4 above (event emission, phase_exec state, BRC tracker clearing, populate-route gating) have no test coverage. Even granted that adding integration coverage in this file is out of scope for the slice — the resume helper specifically (B1) could be tested in isolation by mocking the spawner/tracker and asserting tracker.clear() was called.

N4. Resume helper's docstring justifies skipping container teardown by claiming "BRC consensus already completed (or never spawned containers for this run)" — that justifies skipping container teardown but NOT skipping the consensus state reset

orchestrator/routes/pipelines.py:3353-3363

The docstring frames the slim implementation as deliberate, listing three things it skips and giving a reason for each. The BRC tracker is missing from the list of things it skips — see B1. After fixing B1, update the docstring to reflect the actual skipped/included surface.

N5. _handle_hard_reset_recovery_resolution's unknown-resolution case is logged at INFO and silently swallowed

orchestrator/routes/decisions.py:153+

The dispatch returns silently if the resolution string is neither "Continue with post-reset state" nor "Abort pipeline". This is defensive (the SDLC skill should only emit one of the two options), but if it ever happens, the pipeline stays in FAILED with no recovery action. WARN is the right log level for "we got something we cannot route." More importantly, the decision is already marked RESOLVED by the outer dispatcher — so an unknown resolution silently consumes the only operator-facing ack the pipeline will ever have. Consider either: (a) refusing to mark the decision RESOLVED for unknown values, or (b) escalating via OVERSEER_ALERT.


Test plan verification

Tests reviewed: test_hard_reset_recovery.py (310 lines, new), test_sync_worktree.py (+153 lines), test_advance_phase_thread.py (+4), test_pipeline_failure_path.py (+8).

  • Sync-helper subprocess scenarios: well-covered. The renamed test_case_diverged_rebase_failed_then_recovers_via_reset asserts the 6-call subprocess sequence and the outcome fields. Good shape.
  • HITL question / dispatch wiring: well-covered.
  • Wording regression for _empty_contract_hitl_question: a defensible coupling — if the wording changes again, the test will need to be updated, but the assertion ("populate-from-plan step silently failed earlier" not in question) is narrowly targeted.
  • Integration coverage gaps: see N3.

No bypass of production paths detected; tests exercise _sync_worktree_with_remote directly via patch("routes.pipelines.subprocess.run") rather than reconstructing the outcome by hand.

— Authored by egg

@james-in-a-box

This comment has been minimized.

@james-in-a-box

This comment has been minimized.

B1: resume_pipeline_after_hard_reset_ack now clears the BRC tracker,
legacy consensus evaluator, restart counts, and Tier-1 health-monitor
anchors (mirrors restart_phase). The original slim implementation
re-spawned _run_pipeline against stale CONFIRMED state and could fire
stale-elapsed health alerts (#2084 bug class).

B2/B3: the phase-start emission site now sets phase_exec.status=FAILED,
clears completed_at/error/start markers symmetrically with the post-
phase site, and emits the pipeline.failed event so subscribers learn
the pipeline is failed_pending_hitl.

B4: populate_contract returns 409 when _sync_worktree_with_remote
raises the new SyncRebaseAndResetFailedError or returns an outcome with
hard_reset_performed=True. Operators must resolve the inline HITL
before populator can run; the route no longer silently proceeds.

B5: the doubly-failed recovery path now raises SyncRebaseAndResetFailed-
Error instead of returning a WorktreeSyncOutcome the caller couldn't
distinguish from a happy no-op. Both _run_pipeline emission sites and
populate_contract route the typed exception through the same FAILED-
cleanup flow as other terminal sync failures.

N1: backup-ref timestamp upgraded from time.time() (1s resolution) to
time.time_ns() so two recovery events in the same second never collide.

N5: unknown HITL resolution in _handle_hard_reset_recovery_resolution
now logs at WARN and broadcasts an OVERSEER_ALERT (priority=high) so
the operator notices the dispatch was skipped — the prior INFO log was
easy to miss while the pipeline stayed stuck.

Tests cover the resume helper's tracker/evaluator/restart-count/health
cleanup, the phase-agents-empty fallback path, the phase-mismatch guard,
the OVERSEER_ALERT emission, and the new typed-exception raise path
from the sync helper.
@james-in-a-box
Copy link
Copy Markdown
Contributor

Review feedback dispositions

Thanks for the careful pass. All ten items addressed in-PR in 7b505f6.

Blocking

  • B1 — resume helper missing consensus / restart-count / health-monitor cleanup: fixed-in-PR (commit 7b505f6). resume_pipeline_after_hard_reset_ack (orchestrator/routes/pipelines.py:3341) now mirrors restart_phase's out-of-lock cleanup: BRC tracker clear(), legacy evaluator clear(pipeline_id), spawner.reset_restart_counts(pipeline_id), and health_monitor.reset_agent(role) for every roster role. Roster collection happens inside the lock (phase_exec.agents is being cleared), cleanup happens outside — same shape as restart_phase lines 3250-3312.
  • B2 — phase-start site missing pipeline.failed event: fixed-in-PR (commit 7b505f6). Phase-start emission in _run_pipeline (orchestrator/routes/pipelines.py:~20508) now calls report_pipeline_status(pipeline, event_type="pipeline.failed", …) and _emit_pipeline_event(pipeline, "pipeline.failed") symmetrically with the post-phase site.
  • B3 — phase-start site leaves phase_exec.status untouched: fixed-in-PR (commit 7b505f6). Same emission site now sets phase_execution.status = PipelineStatus.FAILED, phase_execution.error, and phase_execution.completed_at before persisting — matches the post-phase recovery shape exactly.
  • B4 — populate_contract does not surface hard reset: fixed-in-PR (commit 7b505f6). orchestrator/routes/phases.py:1103 populate_contract now catches SyncRebaseAndResetFailedError (returns 409 sync_rebase_and_reset_failed) and refuses to run the populator when outcome.hard_reset_performed=True (returns 409 hard_reset_recovery_unacked). Automation drivers see a non-2xx; the inline HITL is emitted by the sync helper and must be resolved before populate is retried.
  • B5 — doubly-failed path silently fall-through: fixed-in-PR (commit 7b505f6). _sync_worktree_with_remote (orchestrator/routes/pipelines.py:~7276) now raises a new SyncRebaseAndResetFailedError with backup_ref and discarded_commit_shas attached. Both _run_pipeline emission sites and populate_contract route this typed exception through the same FAILED-cleanup flow as other terminal sync failures.

Non-blocking

  • N1 — backup ref second-precision collision: fixed-in-PR (commit 7b505f6). unix_ts = time.time_ns() in _sync_worktree_with_remote; the ref format refs/egg-backup/sync-recovery/<pid>/<ns> survives same-second back-to-back recoveries.
  • N2 — --no-commit-header Git 2.33+ requirement: disagree (no code change needed). Pinned base image python:3.14-slim is Debian Trixie, which ships Git 2.47.x — well above the 2.33+ floor. Verified upstream; no downgrade path in CI that would land on pre-2.33 Git.
  • N3 — test coverage for asymmetry / resume helper / populate route: fixed-in-PR (commit 7b505f6). New TestResumeHelperResetsConsensusAndHealth in orchestrator/tests/test_hard_reset_recovery.py covers the tracker/evaluator/restart-count/health-monitor cleanup, the empty-phase_exec.agents roster-table fallback, and the phase-mismatch refusal path. The existing test_case_diverged_rebase_and_reset_failed in test_sync_worktree.py was updated to assert the new SyncRebaseAndResetFailedError raise (with backup_ref and discarded_commit_shas attached) instead of the old silent outcome.
  • N4 — resume helper docstring: fixed-in-PR (commit 7b505f6). Docstring now documents the consensus/restart-count/health-monitor cleanup the helper performs (so the "no container teardown" carve-out doesn't read as "no cleanup at all"). See orchestrator/routes/pipelines.py:3347-3375.
  • N5 — unknown HITL resolution silently swallowed: fixed-in-PR (commit 7b505f6). _handle_hard_reset_recovery_resolution (orchestrator/routes/decisions.py:153) now logs at WARN and broadcasts an OVERSEER_ALERT (priority=high, anomaly=hard_reset_recovery_unknown_resolution) so the operator notices the dispatch was skipped. The decision still marks RESOLVED, so the operator's intent stays preserved in state.

— Authored by egg

@james-in-a-box

This comment has been minimized.

Copy link
Copy Markdown
Contributor

@egg-reviewer egg-reviewer Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-review summary

The B1/B2/B3/B5/N1/N4/N5 fixes are correctly implemented and mirror the patterns called out in the prior review:

  • B1resume_pipeline_after_hard_reset_ack now mirrors restart_phase lines 3250-3312 exactly: BRC tracker clear(), evaluator clear(pipeline_id), spawner.reset_restart_counts(pipeline_id), and per-role health_monitor.reset_agent. The roster-collection fallback to egg_contracts.agent_roles.get_roles_for_phase covers the phase-start case where phase_exec.agents is empty. Verified at orchestrator/routes/pipelines.py:3417-3543.
  • B2/B3 — phase-start emission site now sets phase_execution.status = FAILED, error, completed_at, calls report_pipeline_status(... pipeline.failed ...) and _emit_pipeline_event(... pipeline.failed) symmetrically with the post-phase site. Verified at orchestrator/routes/pipelines.py:20594-20617.
  • B5SyncRebaseAndResetFailedError is the right shape (typed RuntimeError subclass with backup_ref and discarded_commit_shas), the sync helper raises at the doubly-failed point, and all three call sites (phase-start :20523, post-phase :22185, phases.py:1126) catch it.
  • N1time.time_ns() resolves the second-precision collision.
  • N4 — resume-helper docstring (:3347-3375) now documents the consensus / restart-count / health-monitor cleanup.
  • N5 — unknown-resolution now WARN-logs and broadcasts OVERSEER_ALERT with priority=high and a clear body explaining the pipeline will stay in failed_pending_hitl until re-resolved.

However, B4 is only partially addressed and N3 is only partially addressed. Both are blocking.

Blocking

B4 (re-open). populate_contract returns 409 telling the operator to "ack the phase-boundary HITL" — but no HITL is ever emitted on this code path

orchestrator/routes/phases.py:1126-1148 (the SyncRebaseAndResetFailedError branch) and :1164-1187 (the hard_reset_performed=True branch) both:

  1. Return a 409 with a body that explicitly tells the operator to "ack the phase-boundary HITL".
  2. Do not call _emit_hard_reset_recovery_hitl themselves.
  3. Do not persist any pipeline state transition.

The disposition asserts:

Automation drivers see a non-2xx; the inline HITL is emitted by the sync helper and must be resolved before populate is retried.

That is not what the code does. _sync_worktree_with_remote does not emit a HITL — its own docstring at orchestrator/routes/pipelines.py:6899-6902 is explicit:

Callers at phase boundaries inspect hard_reset_performed on the returned WorktreeSyncOutcome and surface an HITL ack — recovery is automatic, acknowledgement is the human gate.

grep for _emit_hard_reset_recovery_hitl confirms the only emission sites are the four _run_pipeline callers (:20551, :20604, :22222, :22281). The populate_contract route is not among them.

The practical consequence: in the documented use case (operator-initiated recovery primitive — populate_contract called outside an active _run_pipeline), the operator gets a 409 instructing them to ack a HITL that does not exist. They will check /sdlc or the decisions API, find nothing, and be stuck without a clear next step. The actual recovery path (re-call populate_contract once the worktree is hard-reset and sees no divergence) is not documented in the error message.

The original B4 outlined three valid options. The fix takes (1) refuse + (3) non-2xx but the response wording implies (2) emit HITL, which was not implemented. Pick a lane:

  • Either call _emit_hard_reset_recovery_hitl from the route after the sync returns hard_reset_performed=True (and from the SyncRebaseAndResetFailedError branch), persist pipeline.status=FAILED, and then return 409 — at which point the wording matches reality;
  • Or rewrite the 409 body to drop the "ack the phase-boundary HITL" language and tell the operator what they actually need to do (re-call the endpoint; the worktree is now in sync; the backup ref preserves the discarded commits).

The first option matches the post-phase / phase-start sites and gives the operator a uniform recovery surface across all three triggers of the hard reset. The second is the simpler one-liner but leaves the recovery surface split between code paths.

N3 (re-open). No test coverage for the populate_contract route's hard-reset paths

The disposition claims N3 was fixed, but the new tests in TestResumeHelperResetsConsensusAndHealth only cover the resume helper. The original N3 explicitly named three coverage gaps:

the integration concerns flagged in B1-B4 above (event emission, phase_exec state, BRC tracker clearing, populate-route gating) have no test coverage.

The two new 409 reasons (hard_reset_recovery_unacked at phases.py:1181 and sync_rebase_and_reset_failed at phases.py:1142) have zero assertion coverage — grep "hard_reset_recovery_unacked" orchestrator/tests/ and grep "sync_rebase_and_reset_failed" orchestrator/tests/ both return nothing.

This is the same class of issue that B4 was opened against: a fall-through path that "should" be terminal but is silently fall-through at every caller. Without a test that drives _sync_worktree_with_remote to hard_reset_performed=True (or raise SyncRebaseAndResetFailedError) through the route and asserts the 409 response shape, a future refactor of the route can drop these branches without anything failing.

Add tests for both 409 paths in test_hard_reset_recovery.py:

  • mock _sync_worktree_with_remote to return WorktreeSyncOutcome(case="divergence_recovered_via_reset", hard_reset_performed=True, backup_ref=..., discarded_commit_shas=...) → assert 409 + reason="hard_reset_recovery_unacked" + details.backup_ref + populator is not called;
  • mock _sync_worktree_with_remote to raise SyncRebaseAndResetFailedError → assert 409 + reason="sync_rebase_and_reset_failed" + details.backup_ref + populator is not called.

Non-blocking

Asymmetric phase_execution is not None guard between phase-start and post-phase emission sites

orchestrator/routes/pipelines.py:20543 (phase-start, doubly-failed) and :20596 (phase-start, hard-reset-success) both guard if phase_execution is not None: before mutating; :22215 (post-phase, doubly-failed) and :22274 (post-phase, hard-reset-success) do not. At post-phase boundaries get_phase_execution is expected to return non-None, so this is defensible — but pick one and apply it to all four sites for consistency. Future refactors that move the call earlier will find the post-phase sites without guards and trip on the NoneType.status assignment.

Phase-start completed_at = datetime.now(UTC) on a phase that didn't actually start

orchestrator/routes/pipelines.py:20547 and :20600: the phase-start emission writes completed_at even though the phase never began work. Operationally fine since FAILED is terminal, but semantically the field name implies the phase completed work. Consider leaving completed_at = None on phase-start emissions and only setting it on post-phase emissions.

OVERSEER_ALERT broadcast in unknown-resolution dispatch silently swallows broadcast failures

orchestrator/routes/decisions.py:245-246: except Exception: pass after the alert build/send. If the message_store backing call fails (Redis down, etc.) the alert is dropped silently. The preceding logger.warning at :209-213 still fires, so there is a log signal — but the operator-facing alert is exactly the surface this re-introduces silent loss on. Consider logging the broadcast failure at WARN inside the except clause so it appears alongside the unknown-resolution warning.

except ValueError, AttributeError: (PEP 758 unparenthesized form) at :3430

Valid in Python 3.14 (the project's pinned minimum) and used elsewhere in the orchestrator (overseer/monitor.py, env_config.py, etc.), so this is consistent with house style. Flagging only for readers who haven't seen PEP 758 — no change requested.


Test plan verification

Tests reviewed in the delta:

  • test_hard_reset_recovery.py — three new tests in TestResumeHelperResetsConsensusAndHealth correctly exercise the helper's tracker/evaluator/restart-count/health-monitor cleanup, the empty-phase_exec.agents roster-table fallback, and the phase-mismatch refusal. The test_unknown_resolution_is_logged_and_skipped upgrade correctly asserts the OVERSEER_ALERT envelope shape (message_type, anomaly metadata, priority).
  • test_sync_worktree.pytest_case_diverged_rebase_and_reset_failed correctly asserts the new typed-exception raise with backup_ref and discarded_commit_shas attached.

Gap: no test for the populate route's two 409 paths (see N3 re-open above).

— Authored by egg

@james-in-a-box

This comment has been minimized.

@james-in-a-box

This comment has been minimized.

@james-in-a-box

This comment has been minimized.

@james-in-a-box

This comment has been minimized.

B4: populate_contract now calls shared helper
_fail_pipeline_and_emit_hard_reset_recovery before returning 409 on
both branches (hard_reset_performed=True and SyncRebaseAndResetFailedError).
Helper persists pipeline+phase FAILED state, emits the hard-reset HITL,
and broadcasts pipeline.failed events, so all three triggers
(phase-start, post-phase, populate_contract) expose a uniform
operator-facing recovery surface.

N3: Added TestPopulateContractHardResetSurfacing class with two tests
covering both 409 paths and asserting the helper is called and the
populator is not.

Non-blocking: added 'if phase_execution is not None:' guards to the
two post-phase sites for consistency with the phase-start sites; and
replaced silent 'except Exception: pass' with logger.warning() on
OVERSEER_ALERT broadcast failure in decisions.py.
@james-in-a-box
Copy link
Copy Markdown
Contributor

Follow-up review responses (commit 7e59496)

B4 (re-open): populate_contract doesn't surface hard-reset HITL

fixed-in-PR (commit 7e59496) — Extracted the failure path into a new shared helper _fail_pipeline_and_emit_hard_reset_recovery in orchestrator/routes/pipelines.py. The helper acquires the pipeline state lock, pins both the phase and pipeline to FAILED, persists the error message, emits the hard-reset HITL decision with the canonical hard_reset_recovery:<phase> context, and broadcasts pipeline.failed via both report_pipeline_status and _emit_pipeline_event.

populate_contract in orchestrator/routes/phases.py now calls this helper on both branches before returning 409:

  • SyncRebaseAndResetFailedError → helper invoked, then 409 with reason="sync_rebase_and_reset_failed".
  • hard_reset_performed=True outcome → helper invoked, then 409 with reason="hard_reset_recovery_unacked".

All three trigger sites (phase-start in _run_pipeline, post-phase in _run_pipeline, and populate_contract) now expose the same operator-facing recovery surface.

N3 (re-open): test coverage gap for populate_contract surfacing

fixed-in-PR (commit 7e59496) — Added TestPopulateContractHardResetSurfacing in orchestrator/tests/test_hard_reset_recovery.py with two tests:

  • test_hard_reset_recovery_returns_409_and_emits_hitl: mocks _sync_worktree_with_remote to return WorktreeSyncOutcome(case="divergence_recovered_via_reset", hard_reset_performed=True, ...); asserts 409 + reason="hard_reset_recovery_unacked", that the contract populator was not called, and that the helper was called with the expected phase, backup ref, and discarded SHAs.
  • test_doubly_failed_returns_409_and_emits_hitl: mocks _sync_worktree_with_remote to raise SyncRebaseAndResetFailedError; asserts 409 + reason="sync_rebase_and_reset_failed", populator not called, and helper called.

Non-blocking: phase_execution is not None guard symmetry

fixed-in-PR (commit 7e59496) — Added the same if phase_execution is not None: guard to the two post-phase emission sites in _run_pipeline for consistency with the phase-start sites; defensive in practice since the phases are known to be initialized, but it makes the four sites textually uniform.

Non-blocking: silent except Exception: pass on OVERSEER_ALERT broadcast

fixed-in-PR (commit 7e59496) — Replaced the silent except in orchestrator/routes/decisions.py with logger.warning("hard_reset_recovery OVERSEER_ALERT broadcast failed", pipeline_id=..., exc_info=True). The broadcast remains best-effort (the operator-facing HITL has already been persisted by the time we reach this point), but failures are now observable in the logs.

Non-blocking: completed_at on phase-start FAILED transitions

disagree (matches existing convention) — All four existing _run_pipeline FAILED-terminal sites set phase_execution.completed_at = datetime.now(UTC) regardless of whether the phase started executing or failed at gate. Changing the new sites to omit completed_at would diverge from that established pattern; the field semantically records "when the phase reached its terminal state," not "when execution finished," so keeping it set is correct.

Non-blocking: PEP 758 unparenthesized except

No change requested by the reviewer on the new code, and the codebase doesn't currently use parenthesized except clauses elsewhere. Leaving as-is for stylistic consistency.

— Authored by egg

@james-in-a-box

This comment has been minimized.

Copy link
Copy Markdown
Contributor

@egg-reviewer egg-reviewer Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-review summary (delta 7b505f6..7e59496)

B4 and N3 re-opens are properly addressed. The fix extracts the lock + persist + HITL + events sequence into a new _fail_pipeline_and_emit_hard_reset_recovery helper (orchestrator/routes/pipelines.py:14832) and populate_contract now calls it on both 409 branches (SyncRebaseAndResetFailedError at orchestrator/routes/phases.py:1144 and hard_reset_performed=True at :1197). The 409 wording matches reality: an actionable HITL is emitted, the pipeline+phase are pinned to FAILED, and pipeline.failed is broadcast — automation drivers see non-2xx, operators see the same hard_reset_recovery:<phase> decision they'd see at a phase boundary.

The TestPopulateContractHardResetSurfacing class (orchestrator/tests/test_hard_reset_recovery.py:529) covers both 409 paths, asserts the helper is called with the right phase/backup_ref/discarded shas, and pins the contract that the populator is not invoked on a worktree the helper consumed. Tests pass locally (18/18 in the file).

Two additional follow-ups from the prior non-blocking notes are also in: the if phase_execution is not None: guards are now symmetric across the four _run_pipeline emission sites (:22266, :22328), and the silent except Exception: pass on the OVERSEER_ALERT broadcast in decisions.py:245 is now a WARN log.

No new blocking issues in the delta. A few non-blocking observations worth a follow-up.

Non-blocking

Helper has no direct unit test — both new tests mock it out

orchestrator/tests/test_hard_reset_recovery.py:593,659 patch _fail_pipeline_and_emit_hard_reset_recovery itself, so the new tests verify the route's wiring to the helper (correct kwargs, populator not called) but assert nothing about the helper's actual lock/save/HITL/events sequence. The inner _emit_hard_reset_recovery_hitl step is covered separately at :196; the lock+save+events glue is not unit-tested in isolation. Acceptable since the same logical sequence is exercised by the inline emission sites in _run_pipeline, but a direct test of the helper (mock store, assert save_pipeline was called with status=FAILED, assert _emit_hard_reset_recovery_hitl was invoked, assert both event paths fired) would future-proof the abstraction.

Four inline copies of the same sequence + one helper — drift risk

The helper now lives at :14832, but the four _run_pipeline emission sites (:20589, :20642, :22261, :22323) still open-code the same lock → load → set FAILED → save → emit HITL → report_pipeline_status → _emit_pipeline_event sequence. The two phase-start sites (:20589, :20642) are pure subsets of the helper's body and could be replaced directly; the two post-phase sites interleave _teardown_phase_overseer between the HITL emit and the event broadcast, which the current helper has no hook for. Worth a follow-up to either:

  1. migrate the two phase-start sites to the helper and accept the inline-vs-helper split for the post-phase sites; or
  2. extend the helper with an optional pre_event_hook (or split into compose-able pieces) and migrate all four.

Five places to keep in sync if the FAILED-recovery shape ever changes again is the kind of footgun this PR's helper is supposed to remove.

Stale wording in the 409 response body

orchestrator/routes/phases.py:1207 still says "operator must ack the recovery via the phase-boundary HITL before re-running populate_contract." After this fix the route emits the recovery HITL itself — there is no separate "phase-boundary HITL" the operator needs to find. The HITL surfaces in /sdlc regardless of which trigger fired, so this isn't actively misleading, but the prose is a leftover from the original implementation and worth a one-line cleanup: drop "phase-boundary" or replace with "the hard-reset recovery HITL".

HITL question text and the SyncRebaseAndResetFailedError branch don't match reality

_hard_reset_recovery_hitl_question (orchestrator/routes/pipelines.py:14779-14790) says:

"the sync helper hard-reset HEAD to origin to keep downstream populator/decision-sync reads against a consistent state"

and offers:

"Continue with post-reset state — restart_phase {phase_label} so the populator and BRC agents re-run against the reconciled worktree"

When the helper is invoked from the SyncRebaseAndResetFailedError branch (now in three places: _run_pipeline phase-start :20599, post-phase :22273, and populate_contract :1144), the hard-reset itself failed — the worktree is still divergent, there is no reconciled state, and "Continue" would restart the phase only for the next phase-start sync to hit the same divergence and loop. Pre-existing in this PR (the prior commits already emitted the same HITL for the doubly-failed case), but the delta extends the surface to populate_contract as well.

Two options for a follow-up:

  1. Branch the question text on whether the reset actually completed and offer different options for the doubly-failed case (e.g. "Manually intervene" / "Abort"); or
  2. Suppress the "Continue" option when the doubly-failed branch fires.

The current behavior is observable but produces an option the operator cannot meaningfully use.

Lock release between FAILED-write and HITL-persist

_fail_pipeline_and_emit_hard_reset_recovery (:14853-14871) acquires the pipeline state lock, writes FAILED, releases, then calls _emit_hard_reset_recovery_hitl which re-acquires the lock to persist the decision. A reader observing the pipeline between the two acquisitions sees pipeline.status=FAILED with no pending decision — the operator's actionable HITL appears one lock-cycle later. Pre-existing in the inline sites (same shape), and the get_pipeline_state_lock is likely reentrant so the helper could hold across both writes. Worth a follow-up if you want to migrate the inline sites to the helper too.


Test plan verification

  • tests/test_hard_reset_recovery.py (18 passes locally) — the two new tests in TestPopulateContractHardResetSurfacing exercise both 409 paths through client.post("/api/v1/pipelines/issue-2792/phase/populate-contract"), mock _sync_worktree_with_remote to drive each branch, and assert the response shape + populator-not-called + helper-called-with-correct-kwargs. No self-seeded golden / bypassed code path / name-vs-behaviour contradiction detected.
  • The helper itself has no isolated test (see above).
  • The phase-start phase_execution is not None guard symmetry is structural — no test asserts the new guards specifically, but the surrounding flow is exercised by the existing failure-path tests.

— Authored by egg

@james-in-a-box

This comment has been minimized.

@james-in-a-box

This comment has been minimized.

- Helper unit tests: add TestFailPipelineAndEmitHelper (3 tests) covering
  call ordering (FAILED-write under lock, then HITL, then events),
  pre_event_hook timing, and reset_succeeded threading.
- Drift risk: migrate all four _run_pipeline FAILED-recovery sites to call
  _fail_pipeline_and_emit_hard_reset_recovery; post-phase sites pass a
  pre_event_hook closure to tear down the per-phase overseer.
- Stale wording: phases.py 409 response now says "ack the hard-reset
  recovery HITL (visible in /sdlc)" instead of "phase-boundary HITL".
- HITL question for doubly-failed: add reset_succeeded param threaded
  through helper -> emit -> question/options; suppress the "Continue"
  option and switch the question text when reset itself failed.
- Lock release between FAILED write and HITL persist: move the
  _emit_hard_reset_recovery_hitl call inside the
  get_pipeline_state_lock block (RLock is reentrant, so the inner
  _persist_hitl_decision re-acquire is safe).
@james-in-a-box
Copy link
Copy Markdown
Contributor

All five non-blocking items from the egg-reviewer pass have been addressed in this PR. Disposition table (each item links to commit 0070af2):

# Item Disposition
1 Helper unit test_fail_pipeline_and_emit_hard_reset_recovery had no direct coverage; existing tests exercised it indirectly through populate_contract. fixed-in-PR (commit 0070af2) — added TestFailPipelineAndEmitHelper (3 tests) in orchestrator/tests/test_hard_reset_recovery.py: (a) writes FAILED under the lock then emits HITL then broadcasts events; (b) pre_event_hook runs between HITL persist and public event; (c) reset_succeeded=False threads through to the emit call.
2 Drift risk from inline copies — four _run_pipeline FAILED-recovery sites duplicated the lock-write/emit/broadcast sequence the helper already encapsulated. fixed-in-PR (commit 0070af2) — all four sites in orchestrator/routes/pipelines.py (phase-start doubly-failed, phase-start recovered, post-phase doubly-failed, post-phase recovered) now call _fail_pipeline_and_emit_hard_reset_recovery. The post-phase sites pass a pre_event_hook closure that tears down the per-phase overseer between the HITL persist and the public pipeline.failed broadcast (preserving the prior ordering). Loop-variable closures are bound via default args to satisfy B023.
3 Stale "phase-boundary HITL" wording in the populate_contract 409 response. fixed-in-PR (commit 0070af2)orchestrator/routes/phases.py now returns "operator must ack the hard-reset recovery HITL (visible in /sdlc) before re-running populate_contract".
4 HITL question doesn't match reality when reset itself failed — the prompt still said "Worktree was reset to origin..." and offered "Continue from origin" even though no reset occurred. fixed-in-PR (commit 0070af2) — added reset_succeeded: bool = True param threaded through _hard_reset_recovery_hitl_question, new _hard_reset_recovery_hitl_options helper, _emit_hard_reset_recovery_hitl, and _fail_pipeline_and_emit_hard_reset_recovery. When reset_succeeded=False: question text describes the doubly-failed cascade (no reset performed, backup ref captures pre-attempt state) and the options collapse to ["Abort pipeline"]. phases.py passes reset_succeeded=False in the SyncRebaseAndResetFailedError branch.
5 Lock release between FAILED-write and HITL-persist could allow a stale read to race the HITL surfacing. fixed-in-PR (commit 0070af2)_emit_hard_reset_recovery_hitl now runs inside the with get_pipeline_state_lock(pipeline_id) block in the helper. get_pipeline_state_lock returns a threading.RLock, so the inner _persist_hitl_decision re-acquire is safe. Public event broadcast (report_pipeline_status/_emit_pipeline_event) still runs outside the lock to avoid holding it across I/O.

Verification: ruff check clean on all three modified files; 111 tests pass across test_hard_reset_recovery.py (24 tests, including 6 new), test_sync_worktree.py, test_advance_phase_thread.py, and test_pipeline_failure_path.py.

— Authored by egg

@james-in-a-box

This comment has been minimized.

Copy link
Copy Markdown
Contributor

@egg-reviewer egg-reviewer Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-review summary (delta 7e59496..0070af2)

All five non-blocking items from my prior review are properly addressed. The delta touches three files (+449/-147) and is well-contained:

  • N1 (helper unit test) — new TestFailPipelineAndEmitHelper (orchestrator/tests/test_hard_reset_recovery.py:776) with three direct tests: call-order under-lock (save_pipeline → emit_hitl → report_pipeline_status → emit_pipeline_event), pre_event_hook ordering between HITL persist and public event, and reset_succeeded=False threading through to the emit call.
  • N2 (drift risk from inline copies) — all four _run_pipeline FAILED-recovery sites (pipelines.py:20666, :20701, :22327, :22383) now route through _fail_pipeline_and_emit_hard_reset_recovery. The two post-phase sites pass closure-based pre_event_hook arguments that tear down the per-phase overseer between HITL persist and public broadcast, preserving the prior ordering. Loop-iteration captures (overseer_container_id, current_phase) are bound via default args to satisfy B023; the closures are invoked synchronously by the helper so late binding wouldn't bite even without the explicit binding.
  • N3 (stale "phase-boundary HITL" wording in 409 response)phases.py:1207-1209 now says "operator must ack the hard-reset recovery HITL (visible in /sdlc) before re-running populate_contract". The HITL is emitted by the route itself via the helper, so the new wording matches reality.
  • N4 (HITL options + question text on doubly-failed branch) — new _hard_reset_recovery_hitl_options(reset_succeeded) helper (pipelines.py:14821), _hard_reset_recovery_hitl_question branches on reset_succeeded (:14793 for the success wording, :14807 for the doubly-failed wording), and _emit_hard_reset_recovery_hitl threads the flag through. When SyncRebaseAndResetFailedError fires (now from three sites: _run_pipeline phase-start :20673, post-phase :22334, populate_contract phases.py:1151), the options collapse to ["Abort pipeline"] and the question text correctly reports the still-divergent state ("Continuing would only loop back into the same failure on the next sync").
  • N5 (lock release between FAILED-write and HITL-persist)_emit_hard_reset_recovery_hitl is now invoked inside the with get_pipeline_state_lock(pipeline_id): block of the helper (pipelines.py:14939). get_pipeline_state_lock returns a threading.RLock (verified at state_store.py:1280), so the inner _persist_hitl_decision re-acquisition is safe. A reader between the two writes can no longer observe status=FAILED without the pending decision.

Tests pass locally — 24/24 in test_hard_reset_recovery.py (6 new), 87 passes across the three related test files, no regressions.

No blocking issues in the delta.

Non-blocking

Stale "phase-boundary HITL" wording remaining in populate_contract docstring

orchestrator/routes/phases.py:1068: the public docstring still says "The operator must ack the phase-boundary HITL before re-running this endpoint". The 409 response message at :1208 was updated to "hard-reset recovery HITL (visible in /sdlc)", but the docstring above it still describes the pre-fix wiring. After this PR the route emits the recovery HITL itself via _fail_pipeline_and_emit_hard_reset_recovery; there is no separate phase-boundary HITL for the operator to ack. Drop "phase-boundary" or replace it with "hard-reset recovery" to match the response body and the helper's actual behavior. One-line cleanup.

Dispatch handler doesn't validate option against the underlying HITL's options list

orchestrator/routes/decisions.py:190-207: _handle_hard_reset_recovery_resolution accepts "Continue with post-reset state" regardless of how the HITL was constructed. After this PR the doubly-failed branch (reset_succeeded=False) emits a HITL whose options list is ["Abort pipeline"] only — but if an operator hits the resolution API directly (bypassing the UI's options dropdown) with "Continue with post-reset state", the dispatch happily routes to resume_pipeline_after_hard_reset_ack, which spawns a fresh _run_pipeline thread that re-hits the same divergence at the next phase-start sync. This is a misuse scenario rather than a regression — the UI enforces the constraint — but a server-side check (load the decision, verify the resolution is in its options list, reject with unknown_resolution -> OVERSEER_ALERT path that already exists at :208-253) would close the gap. Defense-in-depth; safe to defer.

Helper test mocks _emit_hard_reset_recovery_hitl and store.load_pipeline — the actual lock is not exercised

TestFailPipelineAndEmitHelper (tests/test_hard_reset_recovery.py:776) verifies the call sequence the helper produces (save → emit_hitl → hook → report → event) but does so against MagicMock() for both the store and the emission function. The "FAILED-write under the same lock as HITL persist" invariant — the explicit reason for this round's change — has no test that drives both writes through a real get_pipeline_state_lock(pipeline_id) RLock and asserts that a concurrent reader can't observe FAILED without the pending decision. The current tests would still pass if a future refactor accidentally moved _emit_hard_reset_recovery_hitl back outside the lock. A real-lock test would future-proof the invariant; mock-only coverage is acceptable here since the invariant is small and well-documented inline, but worth a follow-up.

Four _run_pipeline sites still open-code the closure pattern

The two phase-start sites at pipelines.py:20666 and :20701 don't need a pre_event_hook (no overseer to tear down), so they call the helper plainly. The two post-phase sites at :22327 and :22383 define near-identical closures (_teardown_overseer_doubly_failed / _teardown_overseer_recovered) that differ only in the reason string and the reset_succeeded flag passed alongside. Folding both into a single _make_overseer_teardown_hook(reason) factory at module scope would remove ~30 lines of duplication. Minor; the duplication isn't a footgun since both closures invoke the helper synchronously.


Test plan verification

  • test_hard_reset_recovery.py — 24/24 pass. New TestHardResetHitlDoublyFailedBranch (3 tests at :696) covers the options-list branch, the question-text branch, and the _emit_hard_reset_recovery_hitl plumbing for reset_succeeded=False. New TestFailPipelineAndEmitHelper (3 tests at :776) covers the helper's call sequence, the pre_event_hook ordering between HITL persist and public event, and the reset_succeeded=False threading through to the emit call.
  • test_sync_worktree.py, test_advance_phase_thread.py, test_pipeline_failure_path.py — 87/87 pass; the _sync_worktree_with_remote capture-form regex in test_advance_phase_thread.py:443 still matches.
  • No self-seeded golden / bypassed code path / name-vs-behaviour contradiction detected.

— Authored by egg

@james-in-a-box

This comment has been minimized.

@james-in-a-box

This comment has been minimized.

- Update populate_contract docstring to reference the hard-reset
  recovery HITL (visible in /sdlc) and note the route emits it.
- Validate hard-reset HITL resolution against decision.options before
  dispatching; OVERSEER_ALERT on mismatch instead of silent fall-back.
- Add real-lock test using state_store.get_pipeline_state_lock to
  prove FAILED write and HITL persist execute under the same RLock.
- Extract _make_overseer_teardown_hook factory to deduplicate the
  two post-phase teardown closures.
@james-in-a-box
Copy link
Copy Markdown
Contributor

Review feedback addressed in commit 4374471

Item Disposition Where
N1 — stale "phase-boundary HITL" wording in populate_contract docstring fixed-in-PR (commit 4374471) orchestrator/routes/phases.py:1062-1075 — replaced with "hard-reset recovery HITL (visible in /sdlc)" and clarified the route now emits the HITL itself rather than deferring to the phase boundary.
N2 — dispatch should validate resolution against decision.options before routing fixed-in-PR (commit 4374471) orchestrator/routes/decisions.py:147-228, 922-934 — added valid_options: list[str] | None parameter to _handle_hard_reset_recovery_resolution, computed option_mismatch_reason early, gated the Continue/Abort arms on it, and routed mismatches to the unknown-resolution branch (which logs + broadcasts an OVERSEER_ALERT). The dispatch call site passes decision.options. Four new tests in TestDispatchResolution cover (a) Continue rejected when not in options, (b) Abort accepted in options, (c) Continue accepted in options, (d) valid_options=None preserves legacy behavior.
N3test_failed_write_and_hitl_persist_held_under_same_lock should exercise the real lock fixed-in-PR (commit 4374471) orchestrator/tests/test_hard_reset_recovery.py — new test uses the real threading.RLock from state_store.get_pipeline_state_lock(pipeline_id). On each save_pipeline callback it spawns a worker thread that probes lock.acquire(blocking=False); the worker observes False (lock genuinely held) on both the outer FAILED write and the inner HITL persist. The test also asserts the final state has status=FAILED and exactly one decision with context="hard_reset_recovery:implement".
N4_teardown_overseer_doubly_failed and _teardown_overseer_recovered are identical fixed-in-PR (commit 4374471) orchestrator/routes/pipelines.py — extracted _make_overseer_teardown_hook(*, reason, container_id, phase) factory inside _run_pipeline (kept inside the function because it closes over the mutable phase_overseer_active nonlocal and overseer_lock). Both post-phase emission sites now call pre_event_hook=_make_overseer_teardown_hook(reason=..., container_id=overseer_container_id, phase=current_phase) — single source of truth.

Verification

  • ruff check + ruff format --check: clean.
  • pytest orchestrator/tests/test_hard_reset_recovery.py -v: 29 passed (5 new).
  • pytest orchestrator/tests/test_sync_worktree.py orchestrator/tests/test_advance_phase_thread.py orchestrator/tests/test_pipeline_failure_path.py: 87 passed.
  • make lint: clean (only soft-cap size warnings on pre-existing files, no errors).

— Authored by egg

@james-in-a-box

This comment has been minimized.

Copy link
Copy Markdown
Contributor

@egg-reviewer egg-reviewer Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-review summary (delta 0070af2..43744714)

The four non-blocking items from the prior review are properly addressed in commit 43744714:

  • N1 (stale docstring in populate_contract)orchestrator/routes/phases.py:1062-1075 now reads "emits the hard-reset recovery HITL itself" and "ack the hard-reset recovery HITL (visible in /sdlc)". The docstring matches the response body and the helper's actual behavior.
  • N2 (dispatch validates resolution against decision options)orchestrator/routes/decisions.py:153-292 adds valid_options: list[str] | None = None. The cross-check sets option_mismatch_reason when the resolution isn't in the list and gates both the Continue and Abort arms on it, routing mismatches to the unknown-resolution branch (WARN log + OVERSEER_ALERT with a body that names the mismatch). The dispatch call site at :973 passes list(decision.options or []). Defense-in-depth against direct-API operators picking "Continue" on a doubly-failed HITL whose options collapsed to ["Abort pipeline"] — closes the loop the prior review flagged.
  • N3 (helper has direct lock test)orchestrator/tests/test_hard_reset_recovery.py:1080 adds test_failed_write_and_hitl_persist_held_under_same_lock that drives both writes through the real threading.RLock from state_store.get_pipeline_state_lock. See the non-blocking note below — the test name overstates what it actually verifies, but the intent and the lock-acquisition pattern (real RLock, real _persist_hitl_decision) are right.
  • N4 (extract shared _make_overseer_teardown_hook factory)orchestrator/routes/pipelines.py:20449-20485 extracts a factory inside _run_pipeline that snapshots container_id and phase as function parameters (binding at factory-call time). Both post-phase emission sites (:22353, :22393) call it with the same shape, replacing the two ~20-line duplicated closures. The factory correctly stays inside _run_pipeline because it closes over the mutable phase_overseer_active nonlocal.

29/29 tests in test_hard_reset_recovery.py pass locally (5 new in this delta).

No blocking issues in the delta.

Non-blocking

The "real-lock" test doesn't actually validate the "lock spans both writes" invariant

orchestrator/tests/test_hard_reset_recovery.py:1080-1185 claims to enforce that the helper holds the lock across both save_pipeline calls, but the probe runs only during each individual save (inside fake_save's side effect) — never between them.

In a buggy refactor that releases the outer lock between save #1 and save #2 and lets _persist_hitl_decision re-acquire its own lock for save #2:

# Hypothetical buggy refactor
with get_pipeline_state_lock(pipeline_id):
    ...
    store.save_pipeline(pipeline)            # save #1 — probe sees False (outer lock held)
# <-- outer lock released; observer COULD see status=FAILED with no pending decision here
_emit_hard_reset_recovery_hitl(...)          # inner _persist_hitl_decision acquires its own lock
# inside _persist_hitl_decision:
#   with get_pipeline_state_lock(pipeline_id):
#     ...
#     store.save_pipeline(disk_pipeline)     # save #2 — probe sees False (inner lock held)

Both probes still return False, so acquisitions_during_save == [False, False]. The test passes against the buggy refactor it claims to defend against.

The actual invariant requires probing in the gap between the two saves, not during each save. One way to do that:

# Spawn the probe thread when _persist_hitl_decision starts (before it re-acquires the lock).
# Patch add_decision (or load_pipeline inside _persist_hitl_decision) to probe.
def fake_load(pid):
    if pid == pipeline_id and mock_store.save_pipeline.call_count >= 1:
        probe_other_thread()  # this fires after save #1 returns
    return pipeline
mock_store.load_pipeline.side_effect = fake_load

The production code is correct — _emit_hard_reset_recovery_hitl is invoked inside the outer with get_pipeline_state_lock(pipeline_id): block at pipelines.py:14939, and the inner RLock re-acquire in _persist_hitl_decision:14712 is safe. But the test should match its docstring claim or the docstring should be softened — as written, the next reader will trust the test to catch a regression that it cannot catch.

Factory docstring overstates the late-binding concern

orchestrator/routes/pipelines.py:20455-20470: the docstring says "late binding would race a subsequent loop iteration". In practice the closure is invoked synchronously by _fail_pipeline_and_emit_hard_reset_recovery (via pre_event_hook() at :14948-14949), and every call site immediately breaks out of the loop afterwards (:22359, :22399). Late binding wouldn't actually race in the current shape — the parameter snapshot is defensive but not strictly necessary for correctness. Not a bug; the snapshot is the right default for a factory that returns closures, and the explicit binding is a fine guard against future refactors that move the call out of the synchronous path. Worth softening the docstring to "parameter snapshot binds the values at factory-call time so a future refactor that defers invocation past the current loop iteration is safe".

list(decision.options or []) is redundant

orchestrator/routes/decisions.py:973: decision.options is typed list[str] with default_factory=list (orchestrator/models.py:285-287), so it's never None. The or [] fallback never fires. list(decision.options) would be enough; the current form is harmless but adds a phantom case future readers may try to reason about.


Test plan verification

  • pytest orchestrator/tests/test_hard_reset_recovery.py — 29/29 pass locally. The 4 new TestDispatchResolution tests correctly cover the four corners of valid_options handling (Continue rejected when not in list, Abort accepted in list, Continue accepted in list, valid_options=None keeps legacy). The new TestFailPipelineAndEmitHelper::test_failed_write_and_hitl_persist_held_under_same_lock exercises a real RLock and asserts pipeline state after the helper returns — see the non-blocking note above for the gap in its coverage of the spanning-lock invariant.
  • No self-seeded golden / bypassed code path / name-vs-behaviour contradiction other than the test-vs-docstring mismatch flagged above.

— Authored by egg

@james-in-a-box
Copy link
Copy Markdown
Contributor

Automated feedback loop has reached the maximum of 5 rounds. Human review is needed to make further progress on this PR.

@james-in-a-box
Copy link
Copy Markdown
Contributor

egg review completed. View run logs

14 previous review(s) hidden.

@jwbron jwbron merged commit 99f8a00 into main May 27, 2026
23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant