fix(prefork): kill children that exceed per-job timeout (#81)#87
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
@queue.task(timeout=…). The scheduler's stale-job reaper recorded the failure in the DB but the live process kept running, permanently losing that worker slot.Option<ActiveJob>slots. On deadline expiry itSIGKILLs + reaps the child, decrementsin_flight, and emitsJobResult::Failure { timed_out: true }with the same shape the reaper produces — soon_timeoutmiddleware andJOB_TIMEOUTevents fire identically to the thread-pool path. The slot acts as the single ownership token, so reader and watchdog can never double-complete the same job. The dispatcher's existing dead-child respawn loop brings the killed slot back on the next dispatch.ParentMessage.timeout_msandChildMessage.Failure.timed_outalready existed.Closes #81.
Test plan
cargo check/clippy --workspaceclean for default,--features postgres, and--features rediscargo test --workspace— all 78 tests passuv run python -m pytest tests/python/— 464 passed, 9 skipped (incl. the pre-existing prefork basic-execution skip)uv run ruff check py_src/ tests/anduv run mypy py_src/taskito/cleantests/python/test_prefork.py:test_prefork_kills_hung_task—while True: passtask withtimeout=2is killed within 12 s and fireson_timeouttest_prefork_no_timeout_unaffected—timeout=0task runs to completion (watchdog must not kill it)test_prefork_finishes_before_deadline— task completing before its deadline returns normally