Skip to content

perf(webapp): cache task metadata in Redis for the trigger hotpath#3625

Merged
ericallam merged 7 commits into
mainfrom
perf/trigger-task-cache
May 15, 2026
Merged

perf(webapp): cache task metadata in Redis for the trigger hotpath#3625
ericallam merged 7 commits into
mainfrom
perf/trigger-task-cache

Conversation

@ericallam
Copy link
Copy Markdown
Member

@ericallam ericallam commented May 15, 2026

Summary

The trigger-task hotpath used to early-return without a DB query when a
caller passed both a queue override and a per-trigger TTL — the hottest
configuration on the trigger API. Adding triggerSource to the resolver
so the runs-list "Source" filter could distinguish STANDARD / SCHEDULED /
AGENT runs removed those early-returns, costing +2 DB queries per trigger
on non-locked calls and +1 on locked calls.

This change caches BackgroundWorkerTask metadata (ttl, triggerSource,
queueId, queueName) in Redis so the resolver can satisfy every caller
configuration with a single HGET on the warm path. PG fallback on miss
back-fills the cache.

Follow-up to #3542.

Design

Two key spaces:

  • task-meta:env:{envId} — the "current worker" view, refreshed at every
    deploy promotion. 24h safety TTL.
  • task-meta:by-worker:{workerId} — used for lockToVersion triggers.
    Immutable post-create. 30d sliding TTL so historical workers age out.

Cache writes use Lua scripts via defineCommand so DEL + HSET +
EXPIRE land atomically — concurrent readers never see the empty
intermediate state of a naive pipeline. Read-path back-fill uses
single-field upserts so concurrent back-fills don't wipe each other's
siblings.

The cache lives behind its own TASK_META_CACHE_REDIS_* env-var prefix
that falls back to the default REDIS_* set, so operators can route the
cache to a dedicated Redis instance if they want.

The service/instance file split (taskMetadataCache.server.ts for the
pure class, taskMetadataCacheInstance.server.ts for the env-wired
singleton) mirrors the existing runsReplicationService /
runsReplicationInstance pattern.

Test plan

  • pnpm run typecheck --filter webapp
  • pnpm run test ./test/engine/triggerTask.test.ts --run — 8
    existing tests untouched + 5 new tests covering warm cache, cold
    miss with back-fill, queue + ttl path, by-worker vs env keyspace,
    and the promotion cache write
  • End-to-end against a dev worker: registering writes both keyspaces
    with the expected TTLs, and redis-cli HGETALL "tr:task-meta:env:<envId>"
    returns the cached entries

Benchmark

Measured DefaultQueueManager.resolveQueueProperties against a real Postgres + Redis (vitest containerTest, single-host docker). 500 sequential calls and 2,000 parallel calls (concurrency=50) per scenario, request shaped as { taskId, queue: "bench-queue", ttl: "5m" } — the hot path this PR restores.

sequential (one in flight at a time):
[noop cache (baseline)]  n=500   mean=1.423ms  p50=1.394ms  p95=1.735ms  p99=2.629ms  max=11.100ms
[redis cache, cold   ]  n=500   mean=1.346ms  p50=1.283ms  p95=1.688ms  p99=2.463ms  max=5.058ms
[redis cache, warm   ]  n=500   mean=0.084ms  p50=0.078ms  p95=0.105ms  p99=0.156ms  max=1.129ms
speedup (warm vs baseline, sequential): 16.95x

parallel (concurrency=50):
[noop cache (baseline)]  n=2000  mean=10.069ms  p50=8.850ms  p95=14.718ms  p99=31.887ms  total=405ms  ops/s=4,940
[redis cache, warm   ]  n=2000  mean=0.614ms   p50=0.568ms  p95=1.189ms   p99=1.432ms   total=25ms   ops/s=80,389
throughput speedup (warm vs baseline, parallel): 16.27x

Read:

  • Warm cache cuts resolver latency 17× at p50 — from ~1.4 ms to ~78 µs per call.
  • Cold cache is on par with baseline — the extra HGET miss adds <50 µs against the two Postgres queries that follow, so the worst case is not worse than today.
  • Under burst load (50 concurrent triggers), the baseline's p99 jumps to ~32 ms as Postgres connections queue up; warm stays at ~1.4 ms. The cache moves the saturation point from ~5k ops/s (PG pool) to ~80k ops/s (single-client Redis pipelining).

Caveats: single-host docker, local Postgres + Redis, resolver-only measurement (excludes the rest of the trigger transaction). Prod adds region-local Redis RTT (~0.3–0.8 ms) which shifts warm absolute numbers up but keeps the ratio intact.

@changeset-bot
Copy link
Copy Markdown

changeset-bot Bot commented May 15, 2026

⚠️ No Changeset found

Latest commit: 710df37

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 15, 2026

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

This pull request implements a Redis-backed task metadata cache to accelerate task trigger operations by avoiding per-request database lookups for task defaults. The cache stores task metadata (TTL, trigger source/task kind, queue identifiers) in two Redis hash keyspaces: one scoped by environment (for non-locked triggers) and one scoped by worker ID (for locked-to-version triggers). Environment configuration includes Redis connection settings and separate TTL values for each keyspace. The queue resolution layer (DefaultQueueManager) now reads cached metadata with Prisma fallback and automatic backfill on cache miss. Worker creation, deployment creation, and deployment promotion flows capture task metadata and synchronously populate the cache. Comprehensive integration tests verify warm cache behavior, cache-miss fallback and backfill, queue override + cached TTL interaction, by-worker keyspace routing for locked triggers, and deployment promotion cache synchronization.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 10.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The PR title accurately describes the main change: adding Redis caching for task metadata to improve trigger hotpath performance.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description check ✅ Passed PR description is comprehensive with clear objectives, design rationale, test coverage, and benchmark results demonstrating the performance improvement.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch perf/trigger-task-cache

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

coderabbitai[bot]

This comment was marked as resolved.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
apps/webapp/app/services/taskMetadataCache.server.ts (1)

50-64: ⚡ Quick win

Add zod validation to the decode() function for schema-drift protection.

JSON.parse(raw) as EncodedEntry silently accepts any payload. If cached data drifts (field removed, TaskTriggerSource value retired, or malformed entry), decode() will produce a TaskMetadataEntry with undefined fields, surfacing bugs far downstream in the trigger hotpath. A zod schema hardens this and aligns with the repo's validation conventions.

♻️ Proposed refactor
-import type { Redis, Result, Callback } from "ioredis";
-import type { TaskTriggerSource } from "@trigger.dev/database";
-import { logger } from "./logger.server";
+import type { Redis, Result, Callback } from "ioredis";
+import { TaskTriggerSource } from "@trigger.dev/database";
+import { z } from "zod";
+import { logger } from "./logger.server";
@@
-type EncodedEntry = {
-  t: string | null;
-  k: TaskTriggerSource;
-  q: string | null;
-  n: string;
-};
+const EncodedEntrySchema = z.object({
+  t: z.string().nullable(),
+  k: z.nativeEnum(TaskTriggerSource),
+  q: z.string().nullable(),
+  n: z.string(),
+});
+type EncodedEntry = z.infer<typeof EncodedEntrySchema>;
@@
 function decode(slug: string, raw: string): TaskMetadataEntry | null {
   try {
-    const parsed = JSON.parse(raw) as EncodedEntry;
+    const parsed = EncodedEntrySchema.parse(JSON.parse(raw));
     return {
       slug,
       ttl: parsed.t,
       triggerSource: parsed.k,
       queueId: parsed.q,
       queueName: parsed.n,
     };
   } catch (error) {
     logger.error("Failed to decode task metadata cache entry", { slug, error });
     return null;
   }
 }

Per coding guideline: "Use zod for validation in packages/core and apps/webapp".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@apps/webapp/app/services/taskMetadataCache.server.ts` around lines 50 - 64,
The decode function currently trusts JSON.parse and casts to EncodedEntry which
can produce undefined fields; add a zod schema (e.g., EncodedEntrySchema) that
validates the shape { t: number, k: TaskTriggerSource, q?: string | null, n?:
string | null } (use the existing TaskTriggerSource enum/union for k), import
zod and run safeParse on the parsed JSON inside decode(slug, raw); if validation
fails, logger.error with the parse errors and return null, otherwise map the
validated fields to TaskMetadataEntry (ttl <- t, triggerSource <- k, queueId <-
q, queueName <- n). Ensure you reference and validate the exact field names
t/k/q/n and use EncodedEntrySchema.safeParse instead of a raw type cast.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@apps/webapp/app/services/taskMetadataCache.server.ts`:
- Around line 50-64: The decode function currently trusts JSON.parse and casts
to EncodedEntry which can produce undefined fields; add a zod schema (e.g.,
EncodedEntrySchema) that validates the shape { t: number, k: TaskTriggerSource,
q?: string | null, n?: string | null } (use the existing TaskTriggerSource
enum/union for k), import zod and run safeParse on the parsed JSON inside
decode(slug, raw); if validation fails, logger.error with the parse errors and
return null, otherwise map the validated fields to TaskMetadataEntry (ttl <- t,
triggerSource <- k, queueId <- q, queueName <- n). Ensure you reference and
validate the exact field names t/k/q/n and use EncodedEntrySchema.safeParse
instead of a raw type cast.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 2d23dad3-c348-4cf6-8399-d667fe3f4f4b

📥 Commits

Reviewing files that changed from the base of the PR and between 3f6d8eb and 1b08acb.

📒 Files selected for processing (5)
  • apps/webapp/app/services/taskMetadataCache.server.ts
  • apps/webapp/app/v3/services/createBackgroundWorker.server.ts
  • apps/webapp/app/v3/services/createDeploymentBackgroundWorkerV3.server.ts
  • apps/webapp/app/v3/services/createDeploymentBackgroundWorkerV4.server.ts
  • apps/webapp/test/engine/triggerTask.test.ts
🚧 Files skipped from review as they are similar to previous changes (4)
  • apps/webapp/app/v3/services/createDeploymentBackgroundWorkerV4.server.ts
  • apps/webapp/app/v3/services/createBackgroundWorker.server.ts
  • apps/webapp/test/engine/triggerTask.test.ts
  • apps/webapp/app/v3/services/createDeploymentBackgroundWorkerV3.server.ts
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (10)
  • GitHub Check: webapp / 🧪 Unit Tests: Webapp (8, 8)
  • GitHub Check: webapp / 🧪 Unit Tests: Webapp (1, 8)
  • GitHub Check: webapp / 🧪 Unit Tests: Webapp (7, 8)
  • GitHub Check: webapp / 🧪 Unit Tests: Webapp (3, 8)
  • GitHub Check: webapp / 🧪 Unit Tests: Webapp (6, 8)
  • GitHub Check: webapp / 🧪 Unit Tests: Webapp (5, 8)
  • GitHub Check: webapp / 🧪 Unit Tests: Webapp (4, 8)
  • GitHub Check: webapp / 🧪 Unit Tests: Webapp (2, 8)
  • GitHub Check: typecheck / typecheck
  • GitHub Check: e2e-webapp / 🧪 E2E Tests: Webapp
🧰 Additional context used
📓 Path-based instructions (8)
**/*.{ts,tsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.{ts,tsx}: Use types over interfaces for TypeScript
Avoid using enums; prefer string unions or const objects instead

**/*.{ts,tsx}: Import from @trigger.dev/core subpaths only, never from the root. Subpath imports must be used to maintain proper module boundaries.
When writing Trigger.dev tasks, always import from @trigger.dev/sdk. Never use @trigger.dev/sdk/v3 or deprecated client.defineJob.
Prisma is version 6.14.0. Use the Prisma client from internal-packages/database for all database operations.
For ClickHouse client, schema migrations, and analytics queries, use internal-packages/clickhouse.

Files:

  • apps/webapp/app/services/taskMetadataCache.server.ts
{packages/core,apps/webapp}/**/*.{ts,tsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Use zod for validation in packages/core and apps/webapp

Files:

  • apps/webapp/app/services/taskMetadataCache.server.ts
**/*.{ts,tsx,js,jsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Use function declarations instead of default exports

Add crumbs as you write code — not just when debugging. Mark lines with // @Crumbs or wrap blocks in `// `#region` `@crumbs. They stay on the branch throughout development and are stripped by agentcrumbs strip before merge.

Files:

  • apps/webapp/app/services/taskMetadataCache.server.ts
**/*.ts

📄 CodeRabbit inference engine (.cursor/rules/otel-metrics.mdc)

**/*.ts: When creating or editing OTEL metrics (counters, histograms, gauges), ensure metric attributes have low cardinality by using only enums, booleans, bounded error codes, or bounded shard IDs
Do not use high-cardinality attributes in OTEL metrics such as UUIDs/IDs (envId, userId, runId, projectId, organizationId), unbounded integers (itemCount, batchSize, retryCount), timestamps (createdAt, startTime), or free-form strings (errorMessage, taskName, queueName)
When exporting OTEL metrics via OTLP to Prometheus, be aware that the exporter automatically adds unit suffixes to metric names (e.g., 'my_duration_ms' becomes 'my_duration_ms_milliseconds', 'my_counter' becomes 'my_counter_total'). Account for these transformations when writing Grafana dashboards or Prometheus queries

Files:

  • apps/webapp/app/services/taskMetadataCache.server.ts
apps/webapp/**/*.{ts,tsx}

📄 CodeRabbit inference engine (.cursor/rules/webapp.mdc)

apps/webapp/**/*.{ts,tsx}: Access environment variables through the env export of env.server.ts instead of directly accessing process.env
Use subpath exports from @trigger.dev/core package instead of importing from the root @trigger.dev/core path

Use named constants for sentinel/placeholder values (e.g. const UNSET_VALUE = '__unset__') instead of raw string literals scattered across comparisons

Files:

  • apps/webapp/app/services/taskMetadataCache.server.ts
apps/webapp/**/*.server.ts

📄 CodeRabbit inference engine (apps/webapp/CLAUDE.md)

apps/webapp/**/*.server.ts: Never use request.signal for detecting client disconnects. Use getRequestAbortSignal() from app/services/httpAsyncStorage.server.ts instead, which is wired directly to Express res.on('close') and fires reliably
Access environment variables via env export from app/env.server.ts. Never use process.env directly
Always use findFirst instead of findUnique in Prisma queries. findUnique has an implicit DataLoader that batches concurrent calls and has active bugs even in Prisma 6.x (uppercase UUIDs returning null, composite key SQL correctness issues, 5-10x worse performance). findFirst is never batched and avoids this entire class of issues

Files:

  • apps/webapp/app/services/taskMetadataCache.server.ts
**/*.{ts,tsx,js,jsx,json,md,css,scss}

📄 CodeRabbit inference engine (AGENTS.md)

Code formatting is enforced using Prettier. Run pnpm run format before committing

Files:

  • apps/webapp/app/services/taskMetadataCache.server.ts
apps/**/*.{ts,tsx,js,jsx}

📄 CodeRabbit inference engine (CLAUDE.md)

When modifying only server components (apps/webapp/, apps/supervisor/, etc.) with no package changes, add a .server-changes/ file instead of a changeset. See .server-changes/README.md for format and documentation.

Files:

  • apps/webapp/app/services/taskMetadataCache.server.ts
🧠 Learnings (5)
📚 Learning: 2026-03-22T13:26:12.060Z
Learnt from: ericallam
Repo: triggerdotdev/trigger.dev PR: 3244
File: apps/webapp/app/components/code/TextEditor.tsx:81-86
Timestamp: 2026-03-22T13:26:12.060Z
Learning: In the triggerdotdev/trigger.dev codebase, do not flag `navigator.clipboard.writeText(...)` calls for `missing-await`/`unhandled-promise` issues. These clipboard writes are intentionally invoked without `await` and without `catch` handlers across the project; keep that behavior consistent when reviewing TypeScript/TSX files (e.g., usages like in `apps/webapp/app/components/code/TextEditor.tsx`).

Applied to files:

  • apps/webapp/app/services/taskMetadataCache.server.ts
📚 Learning: 2026-03-22T19:24:14.403Z
Learnt from: matt-aitken
Repo: triggerdotdev/trigger.dev PR: 3187
File: apps/webapp/app/v3/services/alerts/deliverErrorGroupAlert.server.ts:200-204
Timestamp: 2026-03-22T19:24:14.403Z
Learning: In the triggerdotdev/trigger.dev codebase, webhook URLs are not expected to contain embedded credentials/secrets (e.g., fields like `ProjectAlertWebhookProperties` should only hold credential-free webhook endpoints). During code review, if you see logging or inclusion of raw webhook URLs in error messages, do not automatically treat it as a credential-leak/secrets-in-logs issue by default—first verify the URL does not contain embedded credentials (for example, no username/password in the URL, no obvious secret/token query params or fragments). If the URL is credential-free per this project’s conventions, allow the logging.

Applied to files:

  • apps/webapp/app/services/taskMetadataCache.server.ts
📚 Learning: 2026-03-26T09:02:07.973Z
Learnt from: myftija
Repo: triggerdotdev/trigger.dev PR: 3274
File: apps/webapp/app/services/runsReplicationService.server.ts:922-924
Timestamp: 2026-03-26T09:02:07.973Z
Learning: When parsing Trigger.dev task run annotations in server-side services, keep `TaskRun.annotations` strictly conforming to the `RunAnnotations` schema from `trigger.dev/core/v3`. If the code already uses `RunAnnotations.safeParse` (e.g., in a `#parseAnnotations` helper), treat that as intentional/necessary for atomic, schema-accurate annotation handling. Do not recommend relaxing the annotation payload schema or using a permissive “passthrough” parse path, since the annotations are expected to be written atomically in one operation and should not contain partial/legacy payloads that would require a looser parser.

Applied to files:

  • apps/webapp/app/services/taskMetadataCache.server.ts
📚 Learning: 2026-05-05T09:38:02.512Z
Learnt from: d-cs
Repo: triggerdotdev/trigger.dev PR: 3523
File: apps/webapp/app/routes/api.v3.batches.ts:178-181
Timestamp: 2026-05-05T09:38:02.512Z
Learning: When reviewing code that catches `ServiceValidationError` in `*.server.ts` files, do not blindly forward `error.status` to HTTP responses, because SVEs may be thrown with non-default statuses (e.g., 400/500) and forwarding them can cause client-visible behavioral regressions (e.g., surfacing 500s to clients). Prefer a safe default response status of `error.status ?? 422`, but only after confirming via the reachable call graph that the caught `ServiceValidationError` instances are expected to carry those non-default statuses; otherwise, normalize to `422` to avoid unexpected client-visible 5xx behavior.

Applied to files:

  • apps/webapp/app/services/taskMetadataCache.server.ts
📚 Learning: 2026-05-12T21:04:05.815Z
Learnt from: ericallam
Repo: triggerdotdev/trigger.dev PR: 3542
File: apps/webapp/app/components/sessions/v1/SessionStatus.tsx:1-3
Timestamp: 2026-05-12T21:04:05.815Z
Learning: In this Remix + TypeScript codebase, do not flag a server/client boundary violation when a file imports only types from a module matching `*.server`.

Specifically, it’s safe to import types using `import type { Foo } from "*.server"` or `import { type Foo } from "*.server"` because TypeScript erases type-only imports at compile time and they emit no JavaScript, so they won’t cross the Remix server/client bundle boundary.

Only raise the boundary concern for value imports (e.g., `import { Foo }` without `type`, or `import Foo`), since those produce JavaScript output.

Applied to files:

  • apps/webapp/app/services/taskMetadataCache.server.ts

ericallam added a commit that referenced this pull request May 15, 2026
…builds (#3626)

## Summary

`LocalsKey<T>` (the type returned by `locals.create()`) was branded with
a
module-level `declare const __local: unique symbol`. Each such
declaration
is its own nominal type, and `tshy` emits separate `.d.ts` files for the
ESM and CJS outputs — each gets its own `__local` symbol. Under certain
pnpm hoisting layouts a single TypeScript compilation can resolve
`LocalsKey` from both the ESM source path and the CJS dist path within
the same call site, producing two structurally-incompatible variants of
the same type. TS surfaces this as the misleading error:

```
Argument of type 'LocalsKey<X>' is not assignable to parameter of type
'LocalsKey<X>'. Property '[__local]' is missing in type 'LocalsKey<X>'
but required in type 'BrandLocal<X>'.
```

The error has been hitting CI on PRs opened since the chat.agent stack
landed (e.g. #3625 typecheck job), but doesn't reproduce on developer
machines where the pnpm node_modules layout was built up incrementally.

## Fix

Replace the `unique symbol` brand with an optional phantom field that
carries `T` at the type level:

```ts
// before
declare const __local: unique symbol;
type BrandLocal<T> = { [__local]: T };
export type LocalsKey<T> = BrandLocal<T> & {
  readonly id: string;
  readonly __type: unique symbol;
};

// after
export type LocalsKey<T> = {
  readonly id: string;
  readonly __type: symbol;
  /** Phantom carrier for the value type — never read at runtime. */
  readonly __valueType?: T;
};
```

The ESM and CJS `.d.ts` outputs now produce structurally identical
types,
so cross-output resolution no longer produces a mismatch. `T` is still
carried at the type level via the optional phantom field. The runtime
shape is unchanged — `manager.ts` was already casting via `as unknown`,
which is no longer needed.

## Test plan

- [ ] `pnpm run typecheck --filter @trigger.dev/core --filter
@trigger.dev/sdk`
- [ ] `pnpm run build --filter @trigger.dev/core --filter
@trigger.dev/sdk`
      (clean rebuild) — confirms the ESM and CJS dist `.d.ts` outputs
      no longer carry distinct `unique symbol` declarations
- [ ] `pnpm --filter @trigger.dev/core test test/mockTaskContext.test.ts
--run`
- [ ] `pnpm --filter @trigger.dev/sdk test test/mockChatAgent.test.ts
--run`
ericallam added 2 commits May 15, 2026 08:30
The trigger-task hotpath used to early-return without a DB query when a
caller passed both a queue override and a per-trigger TTL — the hottest
configuration on the trigger API. Adding triggerSource to the resolver
so the runs-list Source filter could distinguish STANDARD / SCHEDULED /
AGENT runs removed those early-returns, costing +2 DB queries per
trigger on non-locked calls and +1 on locked calls.

The resolver now reads BackgroundWorkerTask metadata (ttl, triggerSource,
queueId, queueName) from a Redis HASH keyed by env or by worker, with
PG fallback on miss that back-fills the cache.

Two key spaces:
- task-meta:env:{envId}        refreshed at every deploy promotion;
                               24h safety TTL.
- task-meta:by-worker:{workerId} used for lockToVersion triggers;
                               30d sliding TTL.

Cache writes use Lua scripts via defineCommand so DEL + HSET + EXPIRE
land atomically (concurrent readers never see the empty intermediate
state of a naive pipeline). Read-path back-fill uses single-field
upserts so concurrent back-fills don't wipe each other's siblings.

Configurable via a new TASK_META_CACHE_REDIS_* env-var prefix that
falls back to the default REDIS_* set, so operators can route the
cache to a dedicated Redis instance if they want.
- TaskMetadataCache: interface → type alias, matching the repo's
  'prefer type unless extending' standard.
- syncTaskMetadataCache calls in createBackgroundWorker,
  createDeploymentBackgroundWorkerV3, and createDeploymentBackgroundWorkerV4
  are now wrapped in tryCatch and log on failure. Matches the pattern in
  changeCurrentDeployment so a Redis blip can't strand a successful worker
  build or deploy promotion.
- Back-fill assertion in 'cache miss falls through to PG' now polls with a
  bounded timeout instead of a fixed sleep, removing the CI flake surface.
- Promotion assertion in 'ChangeCurrentDeploymentService promotes the env
  cache' now pre-clears the by-worker + env keys and asserts they're null
  before the call, so the test proves the service did the write.
@ericallam ericallam force-pushed the perf/trigger-task-cache branch from 1b08acb to 1a06ef6 Compare May 15, 2026 07:30
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 6 additional findings.

Open in Devin Review

The cache now exposes populateByCurrentWorker(envId, workerId, entries) and
setByCurrentWorker(envId, workerId, entry) which write both task-meta:env
and task-meta:by-worker keyspaces in a single Lua transaction. Drops the
syncTaskMetadataCache helper and its isCurrent boolean — each caller picks
the right method based on context (DEV worker create + deploy promotion go
through populateByCurrentWorker; V4 deploy build goes through
populateByWorker).

Two new Lua scripts via defineCommand:
- taskMetaReplaceTwoHashes: DEL + HSET + EXPIRE both keys atomically
- taskMetaSetTwoFields: HSET both keys + env-preserve-TTL +
  worker-refresh-TTL atomically

The read-path back-fill for non-locked triggers now also writes both
keyspaces atomically via setByCurrentWorker, so concurrent back-fills
can't leave one keyspace populated and the other empty.

Cache methods log+swallow Redis errors internally, so the outer tryCatch
wrappers at the four call sites are gone — a Redis blip can't break any
post-cache side effect.
coderabbitai[bot]

This comment was marked as resolved.

…worker, clear stale cache on zero-task deploys

Two CodeRabbit findings:

1. The locked + queue-override branch silently accepted a missing task slug
   on the locked worker version, leaving `taskKind` / `ttl` as undefined
   instead of erroring. Now matches the no-override branch and throws a
   ServiceValidationError if the task isn't on the locked worker.

2. The cache populate methods short-circuited on empty entries, leaving
   stale hashes in Redis when a worker registered or got promoted with
   zero tasks. The Lua scripts already handle the empty-entries case
   (DEL + no HSET), so just drop the gates at the cache class and at the
   four call sites so promotion correctly clears prior data.
@ericallam ericallam marked this pull request as ready for review May 15, 2026 08:23
Two CI failures on the prior push:

1. Typecheck TS2556 in taskMetadataCache.server.ts: the populate methods
   were spreading a mixed string[] into the Lua `defineCommand` rest
   parameter. Pass the fixed positional args (key, ttl) directly and
   spread only the variable field/value pairs.

2. Webapp shard 6: the cache test file pulled in
   ChangeCurrentDeploymentService, whose transitive import chain reaches
   triggerTaskV1.server.ts → autoIncrementCounter.server.ts which throws
   at module load when REDIS_HOST/REDIS_PORT aren't in the shard's env.
   Drop test 5 (deploy-promotion side effect) — it doesn't belong in
   triggerTask.test.ts anyway.
devin-ai-integration[bot]

This comment was marked as resolved.

ericallam added 2 commits May 15, 2026 10:47
…writes

Stamp the env hash with an `__owner_worker_id` field at promotion time;
the back-fill Lua script reads it and CAS-skips the env write if it
doesn't match the worker the back-filler resolved to.

Scenario the guard closes:
1. Resolver reads env cache, misses
2. Resolver reads PG: findCurrentWorker -> A, BWT for A -> A's data
3. [promotion to B fires, atomically replaces env hash + by-worker:B]
4. Resolver fires back-fill with workerId=A
   -> env hash now stamped owner=B; A != B -> back-fill skips env write
   -> by-worker:A still written (harmless; key contains workerId, dead worker)

The by-worker keyspace doesn't need a CAS check — the key itself
contains the workerId, so stale writes land in a dead worker's keyspace
and are never read.

Devin Review flagged this as a sub-millisecond race bounded by the 24h
env TTL. Cost to close it is one extra HGET inside the Lua call (atomic
with the HSET, no extra round trip) and one extra HSET on the promotion
write.
Match main's pre-PR behavior — when a caller passes `lockToVersion` and a
queue override and the task slug isn't registered on that worker version,
fall through with `taskKind = undefined` (coalesced to "STANDARD"
downstream) and `taskTtl = undefined` instead of throwing.

This reverts the throw added earlier in this PR. The trade-off review
landed on "strict consistency check is better" twice (CodeRabbit
recommended it, Devin pushed back, we kept the throw) — but the net is
a behavioral change in the trigger API, and customers running
`lockToVersion` + a queue override against a slug that doesn't exist on
that version would start getting 4xx errors where they got silent
defaults before. Default-to-silent matches main and avoids the surprise.

The no-override branch keeps its throw because there's no queue to route
to in that case — the failure mode there is unrecoverable.
@ericallam ericallam merged commit 454f0c9 into main May 15, 2026
26 checks passed
@ericallam ericallam deleted the perf/trigger-task-cache branch May 15, 2026 10:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants