Fix stuck running jobs by detecting dead review/task PIDs#176
Fix stuck running jobs by detecting dead review/task PIDs#176Abdooo2235 wants to merge 10 commits intoopenai:mainfrom
Conversation
- detect dead tracked PIDs and reconcile queued/running jobs to failed\n- persist failure state to state index and job file with clear diagnostics\n- surface error details and wait timeout context in status output\n- add regression test covering dead-PID status --wait behavior
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 39b560f530
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
Addressed the Codex race-condition feedback in commit 5ff5fc5. Changes made:
Validation: npm test passes (68/68). |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 5ff5fc5d60
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Dead-PID Handling UpdateImplemented on branch SummaryThis PR now handles stale What changed
Validationnpm test
@codex review |
|
Codex Review: Didn't find any major issues. Breezy! ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
|
Codex Review: Didn't find any major issues. More of your lovely PRs please. ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
- Add isProcessAlive(pid) helper to process.mjs - Add markDeadPidJobFailed() to job-control.mjs with race guards - Call dead-PID check inside waitForSingleJobSnapshot poll loop - Surface errorMessage in pushJobDetails for failed jobs - Add regression tests: dead-PID fast-fail and concurrent-completion race Fixes openai#164.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b2dd44a044
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
A Codex task can become stuck in `status: running` indefinitely with no way to recover short of manually wiping the job-store and broker session. The user-visible symptom is that `/codex:status` keeps reporting a job as running for many minutes after the underlying codex CLI worker has died, and the next `/codex:cancel` reports `ECONNREFUSED` on the broker socket. Three independent root causes combine to produce the wedge: 1. `captureTurn` (lib/codex.mjs) awaits `state.completion` with no race against the client's `exitPromise`. When the underlying codex app-server (or the broker socket) dies mid-turn, no terminal event is ever delivered, so the await never resolves. Pending RPC requests are rejected by `handleExit` but the turn-completion promise is unrelated to those, so it sits forever. 2. `runTrackedJob` (lib/tracked-jobs.mjs) awaits the runner with no timeout fallback. When `captureTurn` hangs, the entire companion process hangs, and the per-job state file is never transitioned to a terminal status. 3. `app-server-broker.mjs` never listens for the underlying app-server client's exit. When the codex CLI process dies, the broker keeps running with a dead client, accepting new connections but unable to serve them. The next companion sees a stale broker.json that points at a half-dead broker. 4. Status reads (`buildStatusSnapshot` and `buildSingleJobSnapshot`) do not probe whether tracked PIDs are still alive. Once a job is wedged, every status query reports it as running. Fixes: - captureTurn now races `state.completion` against `client.exitPromise`. If the client closes before a terminal event arrives, the turn is rejected with the recorded exit error and progress emits a `failed` notification. - runTrackedJob races the runner against a hard timeout (default 30 minutes, override via `CODEX_COMPANION_JOB_TIMEOUT_MS` or the `timeoutMs` option). On timeout the job transitions to `failed` with a clear error message instead of leaving the companion stuck. - The broker now subscribes to its appClient's `exitPromise`. When the underlying codex CLI exits unexpectedly the broker logs the reason, fans out a `notifications/broker/shuttingDown` event to any connected socket, tears down the unix socket and pid file, then exits with status 1. The next companion will detect a dead endpoint via `ensureBrokerSession` and respawn cleanly. - A new `isProcessAlive` helper in lib/process.mjs uses `kill(pid, 0)` to probe pid liveness without affecting the process. - A new `markDeadPidJobFailed` and `reconcileIfDead` helper in lib/job-control.mjs reconcile any active job whose tracked PID is gone. Reconciliation now runs on every status read path, not only the `--wait` polling loop, so a single `/codex:status` call surfaces dead workers immediately. Closes openai#183 (runTrackedJob/captureTurn hang in finalizing). Refs openai#176, openai#164 (dead PID detection on status reads). Tests: - tests/process.test.mjs: 3 new cases for `isProcessAlive`. - tests/dead-pid-reconcile.test.mjs: 6 new cases covering `markDeadPidJobFailed` guards (active state, PID match) and the reconcile-on-status behaviour for both single-job and full snapshots. - tests/tracked-jobs-timeout.test.mjs: 2 new cases proving the hard timeout transitions a hung job to `failed` and that successful runners still complete normally. All 11 new tests pass. The 5 failing tests under the existing suite (`collectReviewContext`, three `runtime.test.mjs` snapshot tests, and `resolveStateDir`) are pre-existing on `main` and unrelated to this patch. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Detect dead tracked processes throughout status wait polling, harden reconciliation race guards with PID normalization, and surface clear timeout/failure messaging. Add regression coverage for immediate timeout reconciliation and string/number PID matching. Made-with: Cursor
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b155dab283
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Repair runtime.test parse error, ensure status wait consistently reconciles dead PIDs across polling phases, and restore persisted-state preference for stale running index entries without mutating read paths broadly. Also remove duplicate failed-job error rendering and add reconciliation error breadcrumbs. Made-with: Cursor
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6132e48506
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 5c0d7cdac9
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f0b9379efa
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
Update: metadata consistency follow-up addressed and pushed to this PR branch. Summary of changes
Validation
Branch status
|
|
Updated PR branch with the latest fix. Commit
Validation
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 10f8528fb1
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| function normalizeTrackedPid(pid) { | ||
| const numeric = Number(pid); | ||
| return Number.isFinite(numeric) && numeric > 0 ? numeric : null; | ||
| } |
There was a problem hiding this comment.
Reintroduce session-scoping task helper functions
This hunk replaces the block that previously defined getCurrentClaudeSessionId, filterJobsForCurrentClaudeSession, and findLatestResumableTaskJob, but call sites for those names still exist in resolveLatestTrackedTaskThread and handleTaskResumeCandidate. In this commit, invoking task-resume-candidate (for example node plugins/codex/scripts/codex-companion.mjs task-resume-candidate --json) now fails with ReferenceError: getCurrentClaudeSessionId is not defined, and resume-last task flows crash for the same reason. Please restore these helpers (or update the callers to a replacement API) so resume commands remain usable.
Useful? React with 👍 / 👎.
Summary
Fixes stuck Codex jobs that stayed in
queued/runningafter the tracked process died unexpectedly.Closes #164.
Problem
When a review/task process exited unexpectedly, job state could remain
runningindefinitely.That caused
status --waitto keep polling until timeout with no actionable failure reason.What Changed
failed.errorMessagein status output for easier diagnosis.status --waitbehavior.Behavior After Fix
failedinstead of remainingrunning.status --waitreturns quickly with failure state rather than timing out.Test Plan
npm teststatus --wait.