Skip to content

fix(prefork): kill children that exceed per-job timeout (#81)#87

Merged
pratyush618 merged 5 commits into
masterfrom
fix/prefork-timeout-watchdog
Apr 30, 2026
Merged

fix(prefork): kill children that exceed per-job timeout (#81)#87
pratyush618 merged 5 commits into
masterfrom
fix/prefork-timeout-watchdog

Conversation

@pratyush618
Copy link
Copy Markdown
Collaborator

Summary

  • Hung prefork children (infinite loop, blocking syscall, deadlock) used to wedge their reader thread forever and ignore @queue.task(timeout=…). The scheduler's stale-job reaper recorded the failure in the DB but the live process kept running, permanently losing that worker slot.
  • Adds a watchdog thread that owns per-child Option<ActiveJob> slots. On deadline expiry it SIGKILLs + reaps the child, decrements in_flight, and emits JobResult::Failure { timed_out: true } with the same shape the reaper produces — so on_timeout middleware and JOB_TIMEOUT events fire identically to the thread-pool path. The slot acts as the single ownership token, so reader and watchdog can never double-complete the same job. The dispatcher's existing dead-child respawn loop brings the killed slot back on the next dispatch.
  • No protocol changes — ParentMessage.timeout_ms and ChildMessage.Failure.timed_out already existed.

Closes #81.

Test plan

  • cargo check/clippy --workspace clean for default, --features postgres, and --features redis
  • cargo test --workspace — all 78 tests pass
  • uv run python -m pytest tests/python/ — 464 passed, 9 skipped (incl. the pre-existing prefork basic-execution skip)
  • uv run ruff check py_src/ tests/ and uv run mypy py_src/taskito/ clean
  • New regression tests in tests/python/test_prefork.py:
    • test_prefork_kills_hung_taskwhile True: pass task with timeout=2 is killed within 12 s and fires on_timeout
    • test_prefork_no_timeout_unaffectedtimeout=0 task runs to completion (watchdog must not kill it)
    • test_prefork_finishes_before_deadline — task completing before its deadline returns normally

@pratyush618 pratyush618 merged commit b52e1ce into master Apr 30, 2026
19 checks passed
@pratyush618 pratyush618 deleted the fix/prefork-timeout-watchdog branch May 2, 2026 05:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Prefork: hung task blocks reader thread, no per-job timeout enforcement

1 participant