Skip to content

fix(core): retry TASK_PROCESS_SIGSEGV under the user's retry policy#3552

Merged
matt-aitken merged 1 commit into
mainfrom
feature/tri-9234-retry-task_process_sigsegv-errors-respecting-user-retry
May 12, 2026
Merged

fix(core): retry TASK_PROCESS_SIGSEGV under the user's retry policy#3552
matt-aitken merged 1 commit into
mainfrom
feature/tri-9234-retry-task_process_sigsegv-errors-respecting-user-retry

Conversation

@matt-aitken
Copy link
Copy Markdown
Member

@matt-aitken matt-aitken commented May 11, 2026

Closes TRI-9234

What this changes

SIGSEGV crashes (TASK_PROCESS_SIGSEGV) will now be retried when an attempt fails, in line with the task's configured retry settings (retry.maxAttempts etc.) — the same path SIGTERM and uncaught exceptions already use. Previously SIGSEGV was hard-classified as non-retriable and failed the run on the first segfault, ignoring the user's retry policy.

Tasks without a retry policy still fail fast on the first SIGSEGV. Behaviour is unchanged for OOM kills (separate machine-bump retry path) and SIGKILL_TIMEOUT.

Deploy

Only the webapp needs to ship. The retry decision lives entirely in the webapp:

  • V2 path: internal-packages/run-engine (bundled into the webapp)
  • V1 path: apps/webapp/app/v3/services/completeAttempt.server.ts

No supervisor, CLI, SDK, or customer-task-image changes required. Customers do not need to redeploy. The @trigger.dev/core changeset is just keeping the public package in sync — the published npm version isn't what makes the fix work.

Why retry

SIGSEGV in Node tasks is frequently non-deterministic across processes:

  • Native addon races (sharp, canvas, better-sqlite3, node-rdkafka, bcrypt, …) — libuv thread-pool work stepping on V8 handles. Different heap layout / thread schedule on a fresh process → retry often succeeds.
  • JIT / GC interaction — V8 turbofan deopt or GC during a native callback. Timing-dependent.
  • Near-OOM in native code — when RSS approaches the cgroup limit, native allocations fail and poorly-written addons dereference NULL → SIGSEGV instead of clean OOM-kill.
  • Host / hardware issues — bit flips, kernel quirks. Retry lands on a different host.

The genuinely deterministic case (a user-code bug always tripping the same addon) is real, but a subset — and maxAttempts bounds the damage.

Pre-existing inconsistency this resolves

  • shouldRetryError returned false for TASK_PROCESS_SIGSEGVfail_run.
  • shouldLookupRetrySettings already listed TASK_PROCESS_SIGSEGV as retry-config-aware — but that branch was unreachable because shouldRetryError short-circuited first in retrying.ts:86-90.
  • We already retry TASK_RUN_UNCAUGHT_EXCEPTION (clearly a user-code bug) under the user's retry policy; refusing to retry SIGSEGV was the odd one out.

Test plan

  • pnpm exec vitest run test/errors.test.ts in packages/core — 26/26 pass (4 new)
  • pnpm run build --filter @trigger.dev/core
  • CI green on PR

🤖 Generated with Claude Code

SIGSEGV was hard-classified as non-retriable in shouldRetryError on the
assumption that it's always a deterministic native crash. For Node
tasks that's not reliably true — many production SIGSEGVs are flaky:

- Native addon races (sharp, canvas, better-sqlite3, node-rdkafka,
  bcrypt, etc.) — libuv thread-pool work stepping on V8 handles.
  Different heap layout / thread schedule on a fresh process,
  retry often succeeds.
- JIT / GC interaction — V8 turbofan deopt or GC during a native
  callback. Timing-dependent.
- Near-OOM in native code — when RSS approaches the cgroup limit,
  native allocations fail and poorly-written addons dereference
  NULL → SIGSEGV instead of a clean OOM-kill. A fresh process
  with cleaner memory often succeeds.
- Host / hardware issues — bit flips, kernel quirks. Retry lands
  on a different host.

The codebase was already inconsistent here: shouldLookupRetrySettings
listed SIGSEGV as retry-config-aware, but the shouldRetryError gate
short-circuited fail_run before that branch could be reached. And we
already retry TASK_RUN_UNCAUGHT_EXCEPTION — clearly a user-code bug —
under the user's retry policy, so refusing to retry SIGSEGV was the
odd one out.

Flip TASK_PROCESS_SIGSEGV from the false branch to the true branch in
shouldRetryError. The existing retrying.ts pipeline then gates the
retry on lockedRetryConfig + maxAttempts — same path SIGTERM and
uncaught-exception already use. No new code paths; tasks without a
retry policy still fail fast.

Tests added in packages/core/test/errors.test.ts lock down the new
classification alongside SIGTERM, SIGKILL_TIMEOUT, and the OOM codes
(still non-retriable here because OOM has its own machine-bump retry
path in retrying.ts that runs before shouldRetryError).

Closes TRI-9234.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@changeset-bot
Copy link
Copy Markdown

changeset-bot Bot commented May 11, 2026

🦋 Changeset detected

Latest commit: cdfe334

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 29 packages
Name Type
@trigger.dev/core Patch
@trigger.dev/build Patch
trigger.dev Patch
@trigger.dev/python Patch
@trigger.dev/redis-worker Patch
@trigger.dev/schema-to-json Patch
@trigger.dev/sdk Patch
@internal/cache Patch
@internal/clickhouse Patch
@internal/llm-model-catalog Patch
@internal/redis Patch
@internal/replication Patch
@internal/run-engine Patch
@internal/schedule-engine Patch
@internal/testcontainers Patch
@internal/tracing Patch
@internal/tsql Patch
@internal/zod-worker Patch
d3-chat Patch
references-d3-openai-agents Patch
references-nextjs-realtime Patch
references-realtime-hooks-test Patch
references-realtime-streams Patch
references-telemetry Patch
@internal/sdk-compat-tests Patch
@trigger.dev/react-hooks Patch
@trigger.dev/rsc Patch
@trigger.dev/database Patch
@trigger.dev/otlp-importer Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 11, 2026

Review Change Stack

Walkthrough

This PR makes segmentation fault (SIGSEGV) errors retryable when a task has a retry policy configured. The shouldRetryError function in errors.ts is updated to classify TASK_PROCESS_SIGSEGV as a retryable error code, matching the existing behavior for TASK_PROCESS_SIGTERM. Test coverage verifies that SIGSEGV and SIGTERM are retriable while SIGKILL, OOM, and related errors remain non-retriable. A changeset documents the patch-level change to @trigger.dev/core.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately and concisely summarizes the main change: reclassifying TASK_PROCESS_SIGSEGV as retriable under the user's retry policy, which is the primary modification across the changeset.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description check ✅ Passed The pull request description is comprehensive and well-structured, covering the change rationale, deployment scope, technical justification, and test results.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feature/tri-9234-retry-task_process_sigsegv-errors-respecting-user-retry

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 3 additional findings.

Open in Devin Review

@matt-aitken matt-aitken merged commit 8e675a4 into main May 12, 2026
43 checks passed
@matt-aitken matt-aitken deleted the feature/tri-9234-retry-task_process_sigsegv-errors-respecting-user-retry branch May 12, 2026 14:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants