Skip to content

Fix stuck running jobs by detecting dead review/task PIDs#176

Open
Abdooo2235 wants to merge 10 commits intoopenai:mainfrom
Abdooo2235:fix/issue-164-dead-pid-detection
Open

Fix stuck running jobs by detecting dead review/task PIDs#176
Abdooo2235 wants to merge 10 commits intoopenai:mainfrom
Abdooo2235:fix/issue-164-dead-pid-detection

Conversation

@Abdooo2235
Copy link
Copy Markdown

@Abdooo2235 Abdooo2235 commented Apr 7, 2026

Summary

Fixes stuck Codex jobs that stayed in queued/running after the tracked process died unexpectedly.

Closes #164.

Problem

When a review/task process exited unexpectedly, job state could remain running indefinitely.
That caused status --wait to keep polling until timeout with no actionable failure reason.

What Changed

  • Added PID liveness detection for tracked jobs.
  • Added reconciliation that converts stale active jobs with dead PIDs to failed.
  • Persisted reconciliation to both the state index and per-job file.
  • Stored a clear failure message and completion timestamp when dead PID is detected.
  • Surfaced errorMessage in status output for easier diagnosis.
  • Included wait-timeout context in single-job status rendering.
  • Added a regression test for dead-PID plus status --wait behavior.

Behavior After Fix

  • Dead worker/reviewer PID is detected promptly.
  • Job transitions to failed instead of remaining running.
  • status --wait returns quickly with failure state rather than timing out.
  • Diagnostic reason is visible in status output and persisted job artifacts.

Test Plan

  • Ran full suite: npm test
  • Result: 67 passed, 0 failed.
  • Includes regression coverage for dead-PID reconciliation in status --wait.

- detect dead tracked PIDs and reconcile queued/running jobs to failed\n- persist failure state to state index and job file with clear diagnostics\n- surface error details and wait timeout context in status output\n- add regression test covering dead-PID status --wait behavior
@Abdooo2235 Abdooo2235 requested a review from a team April 7, 2026 16:16
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 39b560f530

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@Abdooo2235
Copy link
Copy Markdown
Author

Addressed the Codex race-condition feedback in commit 5ff5fc5.

Changes made:

  • Updated markDeadPidJobFailed to re-read the latest persisted job before applying failedPatch.
  • Added guards so failure is only written when the latest persisted state is still active (queued/running) and PID still matches the originally observed PID.
  • This prevents downgrading legitimately completed/failed jobs during status polling races.
  • Added regression test: "status dead-pid reconciliation does not downgrade a concurrently completed job".

Validation: npm test passes (68/68).

@Abdooo2235
Copy link
Copy Markdown
Author

Abdooo2235 commented Apr 7, 2026

Implemented follow-up fix for the race noted in Codex feedback (commit 5ff5fc5), including persisted-state + PID guard checks and a regression test.

Validation is green: npm test (68/68).

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5ff5fc5d60

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@Abdooo2235
Copy link
Copy Markdown
Author

Abdooo2235 commented Apr 7, 2026

Dead-PID Handling Update

Implemented on branch fix/issue-164-dead-pid-detection.

Summary

This PR now handles stale running/queued jobs whose tracked process has died by reconciling them into a consistent terminal state.

What changed

  • Added dead-PID detection and reconciliation in plugins/codex/scripts/lib/job-control.mjs.
  • Added race guards so terminal states are not overwritten during polling races.
  • Synced state.json from the latest persisted snapshot on early-return paths to prevent index/file split-brain.
  • Improved status rendering so failure details are visible in job output.
  • Added regression coverage for dead-PID timeout behavior and concurrent-completion race cases.

Validation

npm test
  • 68 passed, 0 failed.

@codex review

@chatgpt-codex-connector
Copy link
Copy Markdown

Codex Review: Didn't find any major issues. Breezy!

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@chatgpt-codex-connector
Copy link
Copy Markdown

Codex Review: Didn't find any major issues. More of your lovely PRs please.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

- Add isProcessAlive(pid) helper to process.mjs
- Add markDeadPidJobFailed() to job-control.mjs with race guards
- Call dead-PID check inside waitForSingleJobSnapshot poll loop
- Surface errorMessage in pushJobDetails for failed jobs
- Add regression tests: dead-PID fast-fail and concurrent-completion race

Fixes openai#164.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b2dd44a044

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

jonny981 added a commit to jonny981/codex-plugin-cc that referenced this pull request Apr 8, 2026
A Codex task can become stuck in `status: running` indefinitely with no
way to recover short of manually wiping the job-store and broker session.
The user-visible symptom is that `/codex:status` keeps reporting a job as
running for many minutes after the underlying codex CLI worker has died,
and the next `/codex:cancel` reports `ECONNREFUSED` on the broker socket.

Three independent root causes combine to produce the wedge:

1. `captureTurn` (lib/codex.mjs) awaits `state.completion` with no race
   against the client's `exitPromise`. When the underlying codex
   app-server (or the broker socket) dies mid-turn, no terminal event is
   ever delivered, so the await never resolves. Pending RPC requests are
   rejected by `handleExit` but the turn-completion promise is unrelated
   to those, so it sits forever.

2. `runTrackedJob` (lib/tracked-jobs.mjs) awaits the runner with no
   timeout fallback. When `captureTurn` hangs, the entire companion
   process hangs, and the per-job state file is never transitioned to a
   terminal status.

3. `app-server-broker.mjs` never listens for the underlying app-server
   client's exit. When the codex CLI process dies, the broker keeps
   running with a dead client, accepting new connections but unable to
   serve them. The next companion sees a stale broker.json that points
   at a half-dead broker.

4. Status reads (`buildStatusSnapshot` and `buildSingleJobSnapshot`) do
   not probe whether tracked PIDs are still alive. Once a job is wedged,
   every status query reports it as running.

Fixes:

- captureTurn now races `state.completion` against `client.exitPromise`.
  If the client closes before a terminal event arrives, the turn is
  rejected with the recorded exit error and progress emits a `failed`
  notification.

- runTrackedJob races the runner against a hard timeout (default 30
  minutes, override via `CODEX_COMPANION_JOB_TIMEOUT_MS` or the
  `timeoutMs` option). On timeout the job transitions to `failed` with
  a clear error message instead of leaving the companion stuck.

- The broker now subscribes to its appClient's `exitPromise`. When the
  underlying codex CLI exits unexpectedly the broker logs the reason,
  fans out a `notifications/broker/shuttingDown` event to any connected
  socket, tears down the unix socket and pid file, then exits with
  status 1. The next companion will detect a dead endpoint via
  `ensureBrokerSession` and respawn cleanly.

- A new `isProcessAlive` helper in lib/process.mjs uses `kill(pid, 0)`
  to probe pid liveness without affecting the process.

- A new `markDeadPidJobFailed` and `reconcileIfDead` helper in
  lib/job-control.mjs reconcile any active job whose tracked PID is
  gone. Reconciliation now runs on every status read path, not only the
  `--wait` polling loop, so a single `/codex:status` call surfaces dead
  workers immediately.

Closes openai#183 (runTrackedJob/captureTurn hang in finalizing).
Refs openai#176, openai#164 (dead PID detection on status reads).

Tests:

- tests/process.test.mjs: 3 new cases for `isProcessAlive`.
- tests/dead-pid-reconcile.test.mjs: 6 new cases covering
  `markDeadPidJobFailed` guards (active state, PID match) and the
  reconcile-on-status behaviour for both single-job and full snapshots.
- tests/tracked-jobs-timeout.test.mjs: 2 new cases proving the hard
  timeout transitions a hung job to `failed` and that successful
  runners still complete normally.

All 11 new tests pass. The 5 failing tests under the existing suite
(`collectReviewContext`, three `runtime.test.mjs` snapshot tests, and
`resolveStateDir`) are pre-existing on `main` and unrelated to this
patch.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Detect dead tracked processes throughout status wait polling, harden reconciliation race guards with PID normalization, and surface clear timeout/failure messaging. Add regression coverage for immediate timeout reconciliation and string/number PID matching.

Made-with: Cursor
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b155dab283

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Repair runtime.test parse error, ensure status wait consistently reconciles dead PIDs across polling phases, and restore persisted-state preference for stale running index entries without mutating read paths broadly. Also remove duplicate failed-job error rendering and add reconciliation error breadcrumbs.

Made-with: Cursor
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6132e48506

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5c0d7cdac9

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f0b9379efa

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Copy link
Copy Markdown
Author

Abdooo2235 commented Apr 9, 2026

Update: metadata consistency follow-up addressed and pushed to this PR branch.

Summary of changes

  • Kept single-job status reconciliation writeback conditional using index diff checks in plugins/codex/scripts/lib/job-control.mjs so read-only status calls do not rewrite updatedAt or reorder recency.
  • Added fallback handling for malformed per-job JSON during status polling.
  • Added dead-PID metadata preservation assertions in tests/runtime.test.mjs so index fields remain intact, including jobClass and summary.

Validation

  • node --test tests/runtime.test.mjs
  • npm test
  • Result: all passing in this environment.

Branch status

  • Originally prepared as local commit e06618b.
  • Rebased and published to this PR branch as head commit 52e4f07.
  • PR head is now up to date with these changes.

@Abdooo2235
Copy link
Copy Markdown
Author

Abdooo2235 commented Apr 9, 2026

Updated PR branch with the latest fix.

Commit

  • 10f8528 fix: improve dead-PID reconciliation by returning a built job snapshot

Validation

  • node --test tests/runtime.test.mjs
  • npm test
  • Result: all passing in this environment

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 10f8528fb1

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +292 to 295
function normalizeTrackedPid(pid) {
const numeric = Number(pid);
return Number.isFinite(numeric) && numeric > 0 ? numeric : null;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Reintroduce session-scoping task helper functions

This hunk replaces the block that previously defined getCurrentClaudeSessionId, filterJobsForCurrentClaudeSession, and findLatestResumableTaskJob, but call sites for those names still exist in resolveLatestTrackedTaskThread and handleTaskResumeCandidate. In this commit, invoking task-resume-candidate (for example node plugins/codex/scripts/codex-companion.mjs task-resume-candidate --json) now fails with ReferenceError: getCurrentClaudeSessionId is not defined, and resume-last task flows crash for the same reason. Please restore these helpers (or update the callers to a replacement API) so resume commands remain usable.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Review jobs stuck in 'running' after process death — no dead-PID detection

1 participant