Skip to content

fix(webapp): auto-recover replication services after stream errors#3613

Open
ericallam wants to merge 2 commits into
mainfrom
fix/replication-auto-recover-on-stream-error
Open

fix(webapp): auto-recover replication services after stream errors#3613
ericallam wants to merge 2 commits into
mainfrom
fix/replication-auto-recover-on-stream-error

Conversation

@ericallam
Copy link
Copy Markdown
Member

Summary

When the logical-replication stream errored (most commonly after a Postgres failover), the runs and sessions replication services logged the error and left the underlying client stopped. The host process kept running, the WAL backed up, and ClickHouse silently fell behind.

Fix

Both services now run a configurable recovery strategy on stream errors, defaulting to in-process reconnect with exponential backoff so a fresh self-hosted setup heals on its own.

  • reconnect (default) — re-subscribe with exponential backoff (1s → 60s cap, unlimited attempts). LogicalReplicationClient.subscribe(lastLsn) re-validates the publication, re-acquires the leader lock, and resumes from the last acknowledged LSN.
  • exitprocess.exit(1) after a short flush window so a host supervisor (Docker restart=always, systemd, k8s) can replace the process.
  • log — preserves the old behaviour.

Per-service strategy + exit knobs are env-driven (RUN_REPLICATION_ERROR_STRATEGY / SESSION_REPLICATION_ERROR_STRATEGY + *_EXIT_DELAY_MS, *_EXIT_CODE). Reconnect tuning is shared across both services (REPLICATION_RECONNECT_INITIAL_DELAY_MS, _MAX_DELAY_MS, _MAX_ATTEMPTS; MAX_ATTEMPTS=0 means unlimited).

Test plan

Integration tests cover all three strategies by simulating a failover with pg_terminate_backend against the WAL sender:

  • reconnect — kill the backend, insert a new row, assert it lands in ClickHouse
  • exit — kill the backend, assert process.exit(1) is called
  • log — kill the backend, insert a new row, assert it does not land in ClickHouse
pnpm --filter webapp test --run runsReplicationService.errorRecovery

When the underlying logical-replication client errored (e.g. after a
Postgres failover), the runs and sessions replication services logged
the error and left the stream stopped. The host process kept running,
the WAL backed up, and ClickHouse silently fell behind.

Both services now run a configurable recovery strategy on stream errors,
defaulting to in-process reconnect with exponential backoff so a fresh
self-hosted setup heals on its own:

- "reconnect" (default) re-subscribes via the existing subscribe(lastLsn)
  path with exponential backoff (1s -> 60s cap, unlimited attempts), which
  re-validates the publication, re-acquires the leader lock, and resumes
  from the last acknowledged LSN.
- "exit" calls process.exit after a short flush window so a host's
  supervisor (Docker restart=always, systemd, k8s, etc.) can replace the
  process.
- "log" preserves the historical behaviour.

Per-service strategy + exit knobs are env-driven via
RUN_REPLICATION_ERROR_STRATEGY / SESSION_REPLICATION_ERROR_STRATEGY plus
matching *_EXIT_DELAY_MS / *_EXIT_CODE. Reconnect tuning is shared
across both services via REPLICATION_RECONNECT_INITIAL_DELAY_MS /
_MAX_DELAY_MS / _MAX_ATTEMPTS (0 = unlimited).
@changeset-bot
Copy link
Copy Markdown

changeset-bot Bot commented May 13, 2026

⚠️ No Changeset found

Latest commit: 6f8cc24

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 13, 2026

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: de0c4f7a-0c00-4e91-9071-562f9bfce70d

📥 Commits

Reviewing files that changed from the base of the PR and between 6bfd109 and 6f8cc24.

📒 Files selected for processing (3)
  • apps/webapp/app/env.server.ts
  • apps/webapp/app/services/replicationErrorRecovery.server.ts
  • apps/webapp/test/runsReplicationService.errorRecovery.test.ts
🚧 Files skipped from review as they are similar to previous changes (3)
  • apps/webapp/app/env.server.ts
  • apps/webapp/app/services/replicationErrorRecovery.server.ts
  • apps/webapp/test/runsReplicationService.errorRecovery.test.ts
📜 Recent review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (29)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (7, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (3, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (8, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (8, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (5, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (4, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (7, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (6, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (2, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (4, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (6, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (5, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (1, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (2, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (1, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (3, 8)
  • GitHub Check: typecheck / typecheck
  • GitHub Check: e2e / 🧪 CLI v3 tests (ubuntu-latest - npm)
  • GitHub Check: units / e2e-webapp / 🧪 E2E Tests: Webapp
  • GitHub Check: sdk-compat / Node.js 22.12 (ubuntu-latest)
  • GitHub Check: units / packages / 🧪 Unit Tests: Packages (1, 1)
  • GitHub Check: sdk-compat / Bun Runtime
  • GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - npm)
  • GitHub Check: e2e / 🧪 CLI v3 tests (ubuntu-latest - pnpm)
  • GitHub Check: sdk-compat / Cloudflare Workers
  • GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - pnpm)
  • GitHub Check: sdk-compat / Deno Runtime
  • GitHub Check: sdk-compat / Node.js 20.20 (ubuntu-latest)
  • GitHub Check: Analyze (javascript-typescript)

Walkthrough

This PR adds configurable error recovery for the runs and sessions replication services. When a logical replication stream fails (e.g., during a database failover), the system can now automatically reconnect with exponential backoff, exit to trigger an external supervisor restart, or remain stopped with logging. Environment variables control per-service strategy selection and tuning parameters. The implementation integrates into both services' lifecycle (on error, stream start, and shutdown) and is validated through integration tests that force actual replication stream failures.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 9.09% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main change: adding automatic error recovery to replication services after stream errors, which aligns with all file changes in the changeset.
Description check ✅ Passed The description comprehensively covers the problem, fix, all three recovery strategies, configuration details, and includes a test plan with specific commands. It fully addresses the PR template requirements.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/replication-auto-recover-on-stream-error

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (2)
apps/webapp/test/runsReplicationService.errorRecovery.test.ts (2)

77-81: 🏗️ Heavy lift

Avoid mocking process.exit in this repo's tests.

This suite now depends on a global stub for the core behavior being tested. A subprocess-based assertion would exercise the real exit path without violating the repo's no-mocks test rule.

As per coding guidelines: Use vitest exclusively for testing and never mock anything - use testcontainers instead.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@apps/webapp/test/runsReplicationService.errorRecovery.test.ts` around lines
77 - 81, The test currently stubs process.exit via vi.spyOn(process, "exit")
(exitSpy) which violates the no-mocks rule; instead, refactor the test to run
the scenario in a real subprocess and assert the actual process exit behavior:
spawn a child Node process to execute the code path exercised by
runsReplicationService.errorRecovery.test.ts, capture its exit code and
stdout/stderr, and replace the mock-based assertions with assertions against the
child process’s exit code and output; remove the vi.spyOn(process, "exit") mock
and any references to exitSpy and ensure teardown/cleanup of the spawned
process.

1-6: ⚡ Quick win

Colocate this test with RunsReplicationService.

Please move this suite beside the service under test instead of adding it under apps/webapp/test; that keeps discovery and ownership aligned with the rest of the repo conventions.

As per coding guidelines: Test files should live beside the files under test with the pattern MyService.ts -> MyService.test.ts.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@apps/webapp/test/runsReplicationService.errorRecovery.test.ts` around lines 1
- 6, Move the test suite file runsReplicationService.errorRecovery.test.ts so it
lives alongside the RunsReplicationService implementation file (the
runsReplicationService server module) instead of under the central test
directory; update any import paths accordingly and ensure the test filename
follows the pattern runsReplicationService.test.ts to match the service name
(RunsReplicationService) so discovery and ownership follow the repo convention.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@apps/webapp/app/env.server.ts`:
- Around line 1317-1318: The schema currently allows negative values for
recovery-related settings (e.g., RUN_REPLICATION_EXIT_DELAY_MS,
RUN_REPLICATION_EXIT_CODE and the other recovery delay/attempt/exit-code fields)
which can cause churn; update each z.coerce.number().int().default(...) to also
enforce a non-negative constraint (e.g., .min(0) or .nonnegative()) so invalid
negative values are rejected at parse time and keep the existing defaults and
types intact.

In `@apps/webapp/app/services/replicationErrorRecovery.server.ts`:
- Around line 30-39: Replace the interface declaration for
ReplicationErrorRecoveryDeps with a TypeScript type alias to follow repo
conventions; update the declaration that currently reads "export interface
ReplicationErrorRecoveryDeps { ... }" to "export type
ReplicationErrorRecoveryDeps = { ... }" preserving all fields (strategy:
ReplicationErrorRecoveryStrategy, logger: Logger, reconnect: () =>
Promise<void>, isShuttingDown: () => boolean) and any exported uses of
ReplicationErrorRecoveryDeps elsewhere in the file so type compatibility is
unchanged.

In `@apps/webapp/test/runsReplicationService.errorRecovery.test.ts`:
- Around line 128-130: The test currently restores process.exit
(exitSpy.mockRestore()) before awaiting service.shutdown(), which can race with
delayed recovery timers and terminate the test worker; change the order so you
first await service.shutdown() to let the service cancel timers and cleanup, and
only after shutdown completes call exitSpy.mockRestore() (keeping references to
exitSpy and service.shutdown in the same block) so process.exit is restored
after teardown finishes.

---

Nitpick comments:
In `@apps/webapp/test/runsReplicationService.errorRecovery.test.ts`:
- Around line 77-81: The test currently stubs process.exit via vi.spyOn(process,
"exit") (exitSpy) which violates the no-mocks rule; instead, refactor the test
to run the scenario in a real subprocess and assert the actual process exit
behavior: spawn a child Node process to execute the code path exercised by
runsReplicationService.errorRecovery.test.ts, capture its exit code and
stdout/stderr, and replace the mock-based assertions with assertions against the
child process’s exit code and output; remove the vi.spyOn(process, "exit") mock
and any references to exitSpy and ensure teardown/cleanup of the spawned
process.
- Around line 1-6: Move the test suite file
runsReplicationService.errorRecovery.test.ts so it lives alongside the
RunsReplicationService implementation file (the runsReplicationService server
module) instead of under the central test directory; update any import paths
accordingly and ensure the test filename follows the pattern
runsReplicationService.test.ts to match the service name
(RunsReplicationService) so discovery and ownership follow the repo convention.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 1675814f-4ab6-4985-a244-9dfa633a1a3a

📥 Commits

Reviewing files that changed from the base of the PR and between 759214e and 6bfd109.

📒 Files selected for processing (8)
  • .server-changes/replication-error-recovery.md
  • apps/webapp/app/env.server.ts
  • apps/webapp/app/services/replicationErrorRecovery.server.ts
  • apps/webapp/app/services/runsReplicationInstance.server.ts
  • apps/webapp/app/services/runsReplicationService.server.ts
  • apps/webapp/app/services/sessionsReplicationInstance.server.ts
  • apps/webapp/app/services/sessionsReplicationService.server.ts
  • apps/webapp/test/runsReplicationService.errorRecovery.test.ts
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (29)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (1, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (8, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (7, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (6, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (3, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (7, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (6, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (5, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (2, 8)
  • GitHub Check: units / packages / 🧪 Unit Tests: Packages (1, 1)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (1, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (4, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (3, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (5, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (8, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (2, 8)
  • GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - npm)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (4, 8)
  • GitHub Check: units / e2e-webapp / 🧪 E2E Tests: Webapp
  • GitHub Check: e2e / 🧪 CLI v3 tests (ubuntu-latest - pnpm)
  • GitHub Check: sdk-compat / Node.js 20.20 (ubuntu-latest)
  • GitHub Check: sdk-compat / Cloudflare Workers
  • GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - pnpm)
  • GitHub Check: typecheck / typecheck
  • GitHub Check: sdk-compat / Deno Runtime
  • GitHub Check: sdk-compat / Node.js 22.12 (ubuntu-latest)
  • GitHub Check: e2e / 🧪 CLI v3 tests (ubuntu-latest - npm)
  • GitHub Check: sdk-compat / Bun Runtime
  • GitHub Check: Analyze (javascript-typescript)
🧰 Additional context used
📓 Path-based instructions (14)
**/*.{ts,tsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.{ts,tsx}: Use types over interfaces for TypeScript
Avoid using enums; prefer string unions or const objects instead

Files:

  • apps/webapp/app/services/runsReplicationInstance.server.ts
  • apps/webapp/app/services/sessionsReplicationInstance.server.ts
  • apps/webapp/app/services/replicationErrorRecovery.server.ts
  • apps/webapp/app/services/sessionsReplicationService.server.ts
  • apps/webapp/app/services/runsReplicationService.server.ts
  • apps/webapp/app/env.server.ts
  • apps/webapp/test/runsReplicationService.errorRecovery.test.ts
{packages/core,apps/webapp}/**/*.{ts,tsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Use zod for validation in packages/core and apps/webapp

Files:

  • apps/webapp/app/services/runsReplicationInstance.server.ts
  • apps/webapp/app/services/sessionsReplicationInstance.server.ts
  • apps/webapp/app/services/replicationErrorRecovery.server.ts
  • apps/webapp/app/services/sessionsReplicationService.server.ts
  • apps/webapp/app/services/runsReplicationService.server.ts
  • apps/webapp/app/env.server.ts
  • apps/webapp/test/runsReplicationService.errorRecovery.test.ts
**/*.{ts,tsx,js,jsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Use function declarations instead of default exports

Files:

  • apps/webapp/app/services/runsReplicationInstance.server.ts
  • apps/webapp/app/services/sessionsReplicationInstance.server.ts
  • apps/webapp/app/services/replicationErrorRecovery.server.ts
  • apps/webapp/app/services/sessionsReplicationService.server.ts
  • apps/webapp/app/services/runsReplicationService.server.ts
  • apps/webapp/app/env.server.ts
  • apps/webapp/test/runsReplicationService.errorRecovery.test.ts
**/*.ts

📄 CodeRabbit inference engine (.cursor/rules/otel-metrics.mdc)

**/*.ts: When creating or editing OTEL metrics (counters, histograms, gauges), ensure metric attributes have low cardinality by using only enums, booleans, bounded error codes, or bounded shard IDs
Do not use high-cardinality attributes in OTEL metrics such as UUIDs/IDs (envId, userId, runId, projectId, organizationId), unbounded integers (itemCount, batchSize, retryCount), timestamps (createdAt, startTime), or free-form strings (errorMessage, taskName, queueName)
When exporting OTEL metrics via OTLP to Prometheus, be aware that the exporter automatically adds unit suffixes to metric names (e.g., 'my_duration_ms' becomes 'my_duration_ms_milliseconds', 'my_counter' becomes 'my_counter_total'). Account for these transformations when writing Grafana dashboards or Prometheus queries

Files:

  • apps/webapp/app/services/runsReplicationInstance.server.ts
  • apps/webapp/app/services/sessionsReplicationInstance.server.ts
  • apps/webapp/app/services/replicationErrorRecovery.server.ts
  • apps/webapp/app/services/sessionsReplicationService.server.ts
  • apps/webapp/app/services/runsReplicationService.server.ts
  • apps/webapp/app/env.server.ts
  • apps/webapp/test/runsReplicationService.errorRecovery.test.ts
apps/webapp/**/*.{ts,tsx}

📄 CodeRabbit inference engine (.cursor/rules/webapp.mdc)

apps/webapp/**/*.{ts,tsx}: Access environment variables through the env export of env.server.ts instead of directly accessing process.env
Use subpath exports from @trigger.dev/core package instead of importing from the root @trigger.dev/core path

Use named constants for sentinel/placeholder values (e.g. const UNSET_VALUE = '__unset__') instead of raw string literals scattered across comparisons

Files:

  • apps/webapp/app/services/runsReplicationInstance.server.ts
  • apps/webapp/app/services/sessionsReplicationInstance.server.ts
  • apps/webapp/app/services/replicationErrorRecovery.server.ts
  • apps/webapp/app/services/sessionsReplicationService.server.ts
  • apps/webapp/app/services/runsReplicationService.server.ts
  • apps/webapp/app/env.server.ts
  • apps/webapp/test/runsReplicationService.errorRecovery.test.ts
apps/webapp/**/*.server.ts

📄 CodeRabbit inference engine (apps/webapp/CLAUDE.md)

apps/webapp/**/*.server.ts: Never use request.signal for detecting client disconnects. Use getRequestAbortSignal() from app/services/httpAsyncStorage.server.ts instead, which is wired directly to Express res.on('close') and fires reliably
Access environment variables via env export from app/env.server.ts. Never use process.env directly
Always use findFirst instead of findUnique in Prisma queries. findUnique has an implicit DataLoader that batches concurrent calls and has active bugs even in Prisma 6.x (uppercase UUIDs returning null, composite key SQL correctness issues, 5-10x worse performance). findFirst is never batched and avoids this entire class of issues

Files:

  • apps/webapp/app/services/runsReplicationInstance.server.ts
  • apps/webapp/app/services/sessionsReplicationInstance.server.ts
  • apps/webapp/app/services/replicationErrorRecovery.server.ts
  • apps/webapp/app/services/sessionsReplicationService.server.ts
  • apps/webapp/app/services/runsReplicationService.server.ts
  • apps/webapp/app/env.server.ts
{apps,internal-packages}/**/*.{ts,tsx,js}

📄 CodeRabbit inference engine (CLAUDE.md)

Use pnpm run typecheck to verify changes in apps and internal packages (apps/*, internal-packages/*) instead of build, which proves almost nothing about correctness

Files:

  • apps/webapp/app/services/runsReplicationInstance.server.ts
  • apps/webapp/app/services/sessionsReplicationInstance.server.ts
  • apps/webapp/app/services/replicationErrorRecovery.server.ts
  • apps/webapp/app/services/sessionsReplicationService.server.ts
  • apps/webapp/app/services/runsReplicationService.server.ts
  • apps/webapp/app/env.server.ts
  • apps/webapp/test/runsReplicationService.errorRecovery.test.ts
{package.json,**/*.{ts,tsx,js}}

📄 CodeRabbit inference engine (CLAUDE.md)

Pin Zod to version 3.25.76 exactly across the entire monorepo - never use a different version or version range

Files:

  • apps/webapp/app/services/runsReplicationInstance.server.ts
  • apps/webapp/app/services/sessionsReplicationInstance.server.ts
  • apps/webapp/app/services/replicationErrorRecovery.server.ts
  • apps/webapp/app/services/sessionsReplicationService.server.ts
  • apps/webapp/app/services/runsReplicationService.server.ts
  • apps/webapp/app/env.server.ts
  • apps/webapp/test/runsReplicationService.errorRecovery.test.ts
**/*.{ts,tsx,js}

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.{ts,tsx,js}: Import from @trigger.dev/core using subpaths only, never the root export
Always import tasks from @trigger.dev/sdk, never from @trigger.dev/sdk/v3 or deprecated client.defineJob
Add crumbs to code using // @Crumbs comments or `// `#region` `@crumbs blocks for debug tracing during development

Files:

  • apps/webapp/app/services/runsReplicationInstance.server.ts
  • apps/webapp/app/services/sessionsReplicationInstance.server.ts
  • apps/webapp/app/services/replicationErrorRecovery.server.ts
  • apps/webapp/app/services/sessionsReplicationService.server.ts
  • apps/webapp/app/services/runsReplicationService.server.ts
  • apps/webapp/app/env.server.ts
  • apps/webapp/test/runsReplicationService.errorRecovery.test.ts
**/*.{ts,tsx,js,jsx,json,md,css,scss}

📄 CodeRabbit inference engine (AGENTS.md)

Code formatting is enforced using Prettier. Run pnpm run format before committing

Files:

  • apps/webapp/app/services/runsReplicationInstance.server.ts
  • apps/webapp/app/services/sessionsReplicationInstance.server.ts
  • apps/webapp/app/services/replicationErrorRecovery.server.ts
  • apps/webapp/app/services/sessionsReplicationService.server.ts
  • apps/webapp/app/services/runsReplicationService.server.ts
  • apps/webapp/app/env.server.ts
  • apps/webapp/test/runsReplicationService.errorRecovery.test.ts
**/*.{test,spec}.{ts,tsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Use vitest for all tests in the Trigger.dev repository

Files:

  • apps/webapp/test/runsReplicationService.errorRecovery.test.ts
apps/webapp/**/*.test.{ts,tsx}

📄 CodeRabbit inference engine (.cursor/rules/webapp.mdc)

Do not import env.server.ts directly or indirectly into test files; instead pass environment-dependent values through options/parameters to make code testable

For testable code, never import env.server.ts in test files. Pass configuration as options instead (e.g., realtimeClient.server.ts takes config as constructor arg, realtimeClientGlobal.server.ts creates singleton with env config)

Files:

  • apps/webapp/test/runsReplicationService.errorRecovery.test.ts
**/*.test.{ts,tsx,js}

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.test.{ts,tsx,js}: Use vitest exclusively for testing and never mock anything - use testcontainers instead
Place test files next to source files using the pattern MyService.ts -> MyService.test.ts

**/*.test.{ts,tsx,js}: Use vitest for unit testing and run tests with pnpm run test
Test files should live beside the files under test with descriptive describe and it blocks
Tests should avoid mocks or stubs and use helpers from @internal/testcontainers when Redis or Postgres are needed

Files:

  • apps/webapp/test/runsReplicationService.errorRecovery.test.ts
**/*.test.{ts,tsx}

📄 CodeRabbit inference engine (CLAUDE.md)

Use testcontainers with redisTest, postgresTest, or containerTest from @internal/testcontainers for testing with Redis/PostgreSQL dependencies

Files:

  • apps/webapp/test/runsReplicationService.errorRecovery.test.ts
🧠 Learnings (8)
📚 Learning: 2026-03-22T13:26:12.060Z
Learnt from: ericallam
Repo: triggerdotdev/trigger.dev PR: 3244
File: apps/webapp/app/components/code/TextEditor.tsx:81-86
Timestamp: 2026-03-22T13:26:12.060Z
Learning: In the triggerdotdev/trigger.dev codebase, do not flag `navigator.clipboard.writeText(...)` calls for `missing-await`/`unhandled-promise` issues. These clipboard writes are intentionally invoked without `await` and without `catch` handlers across the project; keep that behavior consistent when reviewing TypeScript/TSX files (e.g., usages like in `apps/webapp/app/components/code/TextEditor.tsx`).

Applied to files:

  • apps/webapp/app/services/runsReplicationInstance.server.ts
  • apps/webapp/app/services/sessionsReplicationInstance.server.ts
  • apps/webapp/app/services/replicationErrorRecovery.server.ts
  • apps/webapp/app/services/sessionsReplicationService.server.ts
  • apps/webapp/app/services/runsReplicationService.server.ts
  • apps/webapp/app/env.server.ts
  • apps/webapp/test/runsReplicationService.errorRecovery.test.ts
📚 Learning: 2026-03-22T19:24:14.403Z
Learnt from: matt-aitken
Repo: triggerdotdev/trigger.dev PR: 3187
File: apps/webapp/app/v3/services/alerts/deliverErrorGroupAlert.server.ts:200-204
Timestamp: 2026-03-22T19:24:14.403Z
Learning: In the triggerdotdev/trigger.dev codebase, webhook URLs are not expected to contain embedded credentials/secrets (e.g., fields like `ProjectAlertWebhookProperties` should only hold credential-free webhook endpoints). During code review, if you see logging or inclusion of raw webhook URLs in error messages, do not automatically treat it as a credential-leak/secrets-in-logs issue by default—first verify the URL does not contain embedded credentials (for example, no username/password in the URL, no obvious secret/token query params or fragments). If the URL is credential-free per this project’s conventions, allow the logging.

Applied to files:

  • apps/webapp/app/services/runsReplicationInstance.server.ts
  • apps/webapp/app/services/sessionsReplicationInstance.server.ts
  • apps/webapp/app/services/replicationErrorRecovery.server.ts
  • apps/webapp/app/services/sessionsReplicationService.server.ts
  • apps/webapp/app/services/runsReplicationService.server.ts
  • apps/webapp/app/env.server.ts
  • apps/webapp/test/runsReplicationService.errorRecovery.test.ts
📚 Learning: 2026-03-26T09:02:07.973Z
Learnt from: myftija
Repo: triggerdotdev/trigger.dev PR: 3274
File: apps/webapp/app/services/runsReplicationService.server.ts:922-924
Timestamp: 2026-03-26T09:02:07.973Z
Learning: When parsing Trigger.dev task run annotations in server-side services, keep `TaskRun.annotations` strictly conforming to the `RunAnnotations` schema from `trigger.dev/core/v3`. If the code already uses `RunAnnotations.safeParse` (e.g., in a `#parseAnnotations` helper), treat that as intentional/necessary for atomic, schema-accurate annotation handling. Do not recommend relaxing the annotation payload schema or using a permissive “passthrough” parse path, since the annotations are expected to be written atomically in one operation and should not contain partial/legacy payloads that would require a looser parser.

Applied to files:

  • apps/webapp/app/services/runsReplicationInstance.server.ts
  • apps/webapp/app/services/sessionsReplicationInstance.server.ts
  • apps/webapp/app/services/replicationErrorRecovery.server.ts
  • apps/webapp/app/services/sessionsReplicationService.server.ts
  • apps/webapp/app/services/runsReplicationService.server.ts
📚 Learning: 2026-05-05T09:38:02.512Z
Learnt from: d-cs
Repo: triggerdotdev/trigger.dev PR: 3523
File: apps/webapp/app/routes/api.v3.batches.ts:178-181
Timestamp: 2026-05-05T09:38:02.512Z
Learning: When reviewing code that catches `ServiceValidationError` in `*.server.ts` files, do not blindly forward `error.status` to HTTP responses, because SVEs may be thrown with non-default statuses (e.g., 400/500) and forwarding them can cause client-visible behavioral regressions (e.g., surfacing 500s to clients). Prefer a safe default response status of `error.status ?? 422`, but only after confirming via the reachable call graph that the caught `ServiceValidationError` instances are expected to carry those non-default statuses; otherwise, normalize to `422` to avoid unexpected client-visible 5xx behavior.

Applied to files:

  • apps/webapp/app/services/runsReplicationInstance.server.ts
  • apps/webapp/app/services/sessionsReplicationInstance.server.ts
  • apps/webapp/app/services/replicationErrorRecovery.server.ts
  • apps/webapp/app/services/sessionsReplicationService.server.ts
  • apps/webapp/app/services/runsReplicationService.server.ts
  • apps/webapp/app/env.server.ts
📚 Learning: 2026-05-12T21:04:05.815Z
Learnt from: ericallam
Repo: triggerdotdev/trigger.dev PR: 3542
File: apps/webapp/app/components/sessions/v1/SessionStatus.tsx:1-3
Timestamp: 2026-05-12T21:04:05.815Z
Learning: In this Remix + TypeScript codebase, do not flag a server/client boundary violation when a file imports only types from a module matching `*.server`.

Specifically, it’s safe to import types using `import type { Foo } from "*.server"` or `import { type Foo } from "*.server"` because TypeScript erases type-only imports at compile time and they emit no JavaScript, so they won’t cross the Remix server/client bundle boundary.

Only raise the boundary concern for value imports (e.g., `import { Foo }` without `type`, or `import Foo`), since those produce JavaScript output.

Applied to files:

  • apps/webapp/app/services/runsReplicationInstance.server.ts
  • apps/webapp/app/services/sessionsReplicationInstance.server.ts
  • apps/webapp/app/services/replicationErrorRecovery.server.ts
  • apps/webapp/app/services/sessionsReplicationService.server.ts
  • apps/webapp/app/services/runsReplicationService.server.ts
  • apps/webapp/app/env.server.ts
  • apps/webapp/test/runsReplicationService.errorRecovery.test.ts
📚 Learning: 2026-04-20T14:50:16.440Z
Learnt from: ericallam
Repo: triggerdotdev/trigger.dev PR: 3417
File: apps/webapp/app/services/sessionsReplicationService.server.ts:224-231
Timestamp: 2026-04-20T14:50:16.440Z
Learning: In Trigger.dev’s replication services (e.g., sessionsReplicationService.server.ts and runsReplicationService.server.ts), the “acknowledge-before-flush” behavior is intentional. The `_latestCommitEndLsn` should be updated at Postgres commit time and acknowledged on a periodic interval (via methods like `#acknowledgeLatestTransaction`) without waiting for ClickHouse batch flush to complete. Reviewers should not flag this as a durability/ordering bug; it is an established project-wide at-least-once delivery trade-off used across both runs and sessions replication services.

Applied to files:

  • apps/webapp/app/services/sessionsReplicationService.server.ts
  • apps/webapp/app/services/runsReplicationService.server.ts
📚 Learning: 2026-04-20T15:08:49.959Z
Learnt from: ericallam
Repo: triggerdotdev/trigger.dev PR: 3417
File: apps/webapp/app/services/sessionsReplicationService.server.ts:204-215
Timestamp: 2026-04-20T15:08:49.959Z
Learning: For replication services in `apps/webapp/app/services/*ReplicationService.server.ts`, keep the `ConcurrentFlushScheduler` deduplication key shape consistent across the related services (e.g., sessions vs runs) by using the same `${item.event}_${item.session.id}` / `${item.event}_${item.run.id}` pattern. If the key format ever needs to change (such as keying only by session/run id), make the update in all related replication services together—never in just one—so deduplication behavior stays aligned across services.

Applied to files:

  • apps/webapp/app/services/sessionsReplicationService.server.ts
  • apps/webapp/app/services/runsReplicationService.server.ts
📚 Learning: 2026-05-07T12:25:18.271Z
Learnt from: d-cs
Repo: triggerdotdev/trigger.dev PR: 3531
File: apps/webapp/test/sentryTraceContext.server.test.ts:9-47
Timestamp: 2026-05-07T12:25:18.271Z
Learning: In the triggerdotdev/trigger.dev webapp test suite, it is acceptable to leave `createInMemoryTracing()` calls that register a global `NodeTracerProvider` without `afterEach`/`afterAll` teardown. Do not flag this as a test-ordering risk when the code follows the established pattern used across webapp tests (e.g., replication service/benchmark/backfiller tests). This is considered safe because `trace.getActiveSpan()` when called outside a `context.with(...)` block reads `AsyncLocalStorage.getStore()` (undefined when no `run()` scope exists), so it falls back to `ROOT_CONTEXT` with no attached span—regardless of which provider is registered.

Applied to files:

  • apps/webapp/test/runsReplicationService.errorRecovery.test.ts
🔇 Additional comments (5)
.server-changes/replication-error-recovery.md (1)

6-6: LGTM!

apps/webapp/app/services/runsReplicationService.server.ts (1)

32-36: LGTM!

Also applies to: 81-84, 130-130, 262-269, 275-275, 280-280, 301-302

apps/webapp/app/services/runsReplicationInstance.server.ts (1)

6-6: LGTM!

Also applies to: 73-80

apps/webapp/app/services/sessionsReplicationService.server.ts (1)

26-30: LGTM!

Also applies to: 73-76, 116-116, 243-250, 256-256, 261-261, 282-282

apps/webapp/app/services/sessionsReplicationInstance.server.ts (1)

6-6: LGTM!

Also applies to: 70-77

Comment thread apps/webapp/app/env.server.ts Outdated
Comment thread apps/webapp/app/services/replicationErrorRecovery.server.ts Outdated
Comment thread apps/webapp/test/runsReplicationService.errorRecovery.test.ts
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 2 potential issues.

View 4 additional findings in Devin Review.

Open in Devin Review

Comment thread apps/webapp/app/services/replicationErrorRecovery.server.ts
Comment thread apps/webapp/app/services/runsReplicationService.server.ts
Addresses PR review feedback:

- LogicalReplicationClient.subscribe() can throw before its internal
  "error" listener is wired up (notably when pg client.connect() fails
  mid-failover). The reconnect strategy's catch block only logged, so
  recovery silently stopped. Now also calls scheduleReconnect(err) — the
  pendingReconnect guard makes it idempotent if an error event was also
  emitted.
- Reject negative values for the new replication-recovery env vars and
  cap exit codes at 255.
- Convert the new ReplicationErrorRecovery{Deps,} interfaces to type
  aliases to match the repo's TypeScript style.
- Tighten the reconnect dep comment to drop a stale "lastAcknowledgedLsn"
  reference (the wrapper-tracked resume LSN is what callers actually pass).
- Restore process.exit after service.shutdown() in the exit-strategy
  test so a delayed exit timer can't terminate the test worker.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant