feat(mollifier): trigger burst smoothing — Phase 1 (monitoring) by d-cs · Pull Request #3614 · triggerdotdev/trigger.dev

d-cs · 2026-05-13T16:41:26Z

Summary

Introduce the Mollifier: a Redis-backed buffer for trigger() API calls during traffic spikes, with a per-env trip evaluator and a drainer ack-loop.
Phase 1 is dual-write monitoring — every mollified trigger is buffered to Redis AND continues to engine.trigger. No customer-facing behaviour change.
Telemetry events: mollifier.would_mollify, mollifier.buffered, mollifier.drained, plus the mollifier.decisions counter.
Gated behind a feature flag (default off).

Test plan

pnpm run test --filter @trigger.dev/redis-worker
pnpm run test --filter webapp -- mollifier
Manual: with flag off, no behaviour change vs main
Manual: with flag on + threshold lowered, observe mollifier.buffered + mollifier.drained log pairs with matching runId

changeset-bot · 2026-05-13T16:41:31Z

🦋 Changeset detected

Latest commit: 6487461

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 32 packages

Name	Type
@trigger.dev/redis-worker	Patch
@internal/run-engine	Patch
@internal/schedule-engine	Patch
@trigger.dev/build	Patch
@trigger.dev/core	Patch
@trigger.dev/plugins	Patch
@trigger.dev/python	Patch
@trigger.dev/react-hooks	Patch
@trigger.dev/rsc	Patch
@trigger.dev/schema-to-json	Patch
@trigger.dev/sdk	Patch
@trigger.dev/database	Patch
@trigger.dev/otlp-importer	Patch
@trigger.dev/rbac	Patch
trigger.dev	Patch
references-ai-chat	Patch
d3-chat	Patch
references-d3-openai-agents	Patch
@internal/cache	Patch
@internal/clickhouse	Patch
@internal/llm-model-catalog	Patch
@internal/redis	Patch
@internal/replication	Patch
@internal/testcontainers	Patch
@internal/tracing	Patch
@internal/tsql	Patch
@internal/zod-worker	Patch
references-nextjs-realtime	Patch
references-realtime-hooks-test	Patch
references-realtime-streams	Patch
@internal/sdk-compat-tests	Patch
references-telemetry	Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

coderabbitai · 2026-05-13T16:41:44Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

Walkthrough

This PR implements the first two phases of a trigger burst-smoothing system ("mollifier"): it adds a Redis-backed MollifierBuffer and MollifierDrainer, Zod schemas and payload (de)serialization, a real trip evaluator and mollifier gate wired into RunEngineTriggerTaskService, OpenTelemetry metrics, worker startup drainer wiring, environment configuration and feature flag, package re-exports, and comprehensive tests (unit, integration, and fuzz) validating buffer, drainer, gate, and evaluator behavior while keeping fail-open semantics and deferring full activation to later phases.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 9.52% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check	⚠️ Warning	The PR description is incomplete and does not follow the required template structure. Missing required sections and structured information.	Add missing template sections: fill in issue reference, complete testing details, and add screenshots if applicable. Structure should match the provided template with all checkboxes and sections.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and specifically summarizes the main change: introducing the Mollifier feature for trigger burst smoothing with Phase 1 focusing on monitoring/dual-write behavior.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch mollifier-phase-2

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

d-cs · 2026-05-14T07:49:00Z

Code review

Found 2 issues:

resolveOrgFlag resolves the global feature flag, not a per-org flag. flag() is called without overrides, so it reads only the global FeatureFlag row — other call sites in the webapp (e.g. canAccessAi/canAccessQuery) explicitly pass the org's featureFlags JSON as overrides. With the current code, the variable named orgFlagEnabled and the call-site comment "if the org has the mollifier feature flag enabled" don't match behaviour: the moment the global flag flips on, every org is mollified — there's no per-org rollout.

trigger.dev/apps/webapp/app/v3/mollifier/mollifierGate.server.ts

Lines 83 to 88 in 452ebda

    
           isMollifierEnabled: () => env.MOLLIFIER_ENABLED === "1", 
        
           isShadowModeOn: () => env.MOLLIFIER_SHADOW_MODE === "1", 
        
           resolveOrgFlag: () => 
        
             flag({ key: FEATURE_FLAG.mollifierEnabled, defaultValue: false }), 
        
           evaluator: defaultEvaluator, 
        
           logShadow: (inputs, decision) =>

evaluateGate is awaited outside the try block in the trigger hot path. evaluateGate itself has no internal try/catch around resolveOrgFlag() (a DB feature-flag read) or evaluator() (a Redis Lua call), so any transient failure once MOLLIFIER_ENABLED=1 is flipped will throw past the call site and fail the customer's trigger — turning a monitoring-only Phase 1 into a fleet-wide trigger outage on a Redis/DB blip. Suggest wrapping the gate call in a try/catch that falls back to pass_through.

trigger.dev/apps/webapp/app/runEngine/services/triggerTask.server.ts

Lines 338 to 346 in 452ebda

    
           }; 
        
           const mollifierOutcome = await this.evaluateGate({ 
        
             envId: environment.id, 
        
             orgId: environment.organizationId, 
        
             taskId, 
        
           }); 
        
           try {

🤖 Generated with Claude Code

_{- If this code review was useful, please react with 👍. Otherwise, react with 👎.}

d-cs · 2026-05-14T07:52:08Z

Nitpicks (test-mock removal) also addressed in 98c1520:

apps/webapp/test/mollifierGate.test.ts: dropped vi.fn. makeDeps now returns real closure-based dependencies with plain counter/array spies; the 16-row truth table and decision-log shape tests are unchanged.
apps/webapp/test/mollifierTripEvaluator.test.ts: removed the vi.fn/cast fakeBuffer doubles. Each test now uses redisTest from @internal/testcontainers to spin up Redis and exercise a real MollifierBuffer. The trip case uses threshold: 2 + three real evaluateTrip calls to deterministically trip via the production Lua window; the fail-open case closes the buffer up front so the next evaluateTrip hits Connection is closed. — a real failure mode rather than a stub.

d-cs · 2026-05-14T08:24:45Z

Code review (follow-up)

Four additional issues from the earlier scan, posting now since the gate-naming fix landed:

Webapp CLAUDE.md says "Always use findFirst instead of findUnique" — new test introduces a findUnique. Trivial swap.

trigger.dev/apps/webapp/test/engine/triggerTask.test.ts

Lines 1343 to 1347 in 98c1520

    
           expect(result).toBeDefined(); 
        
           expect(result?.run.friendlyId).toBeDefined(); 
        
           const pgRun = await prisma.taskRun.findUnique({ where: { id: result!.run.id } }); 
        
           expect(pgRun).not.toBeNull(); 
        
           expect(pgRun!.friendlyId).toBe(result!.run.friendlyId);

Drainer is started inside init() unconditionally — workerQueue.initialize() two lines above is gated on WORKER_ENABLED === "true", but getMollifierDrainer() is not. Once MOLLIFIER_ENABLED=1 flips on, every webapp replica (including API-only ones) will spin up a polling loop + Redis connection and race for the same buffer. Consider gating on WORKER_ENABLED (or a dedicated MOLLIFIER_DRAINER_ENABLED).

trigger.dev/apps/webapp/app/services/worker.server.ts

Lines 129 to 152 in 98c1520

    
             if (env.WORKER_ENABLED === "true") { 
        
               await workerQueue.initialize(); 
        
             } 
        
             try { 
        
               const drainer = getMollifierDrainer(); 
        
               if (drainer) { 
        
                 // The drainer owns a polling loop and a Redis client; let it drain 
        
                 // in-flight pops on shutdown rather than tearing the process down 
        
                 // mid-handler. Idempotent — `drainer.stop()` short-circuits if already 
        
                 // stopped, so registering on both signals is safe. 
        
                 const stopDrainer = () => { 
        
                   drainer.stop().catch((error) => { 
        
                     logger.error("Failed to stop mollifier drainer", { error }); 
        
                   }); 
        
                 }; 
        
                 process.once("SIGTERM", stopDrainer); 
        
                 process.once("SIGINT", stopDrainer); 
        
               } 
        
             } catch (error) { 
        
               logger.error("Failed to initialise mollifier drainer", { error }); 
        
             } 
        
           }

includeTtl: true was added to the delayed and pending-version re-enqueue paths, but the producer-side comment in enqueueSystem.ts still explicitly warns these callers off: "Re-enqueues (waitpoint, checkpoint, delayed, pending version) must not add TTL." Either the comment needs updating to reflect the new design, or the new callers are wrong.

trigger.dev/internal-packages/run-engine/src/engine/systems/enqueueSystem.ts

Lines 88 to 95 in 98c1520

    
           const timestamp = (run.queueTimestamp ?? run.createdAt).getTime() - run.priorityMs; 
        
           // Include TTL only when explicitly requested (first enqueue from trigger). 
        
           // Re-enqueues (waitpoint, checkpoint, delayed, pending version) must not add TTL. 
        
           let ttlExpiresAt: number | undefined; 
        
           if (includeTtl && run.ttl) { 
        
             const expireAt = parseNaturalLanguageDuration(run.ttl);

trigger.dev/internal-packages/run-engine/src/engine/systems/delayedRunSystem.ts

Lines 172 to 176 in 98c1520

    
             batchId: run.batchId ?? undefined, 
        
             skipRunLock: true, 
        
             includeTtl: true, 
        
           });

trigger.dev/internal-packages/run-engine/src/engine/systems/pendingVersionSystem.ts

Lines 103 to 107 in 98c1520

    
               // for a worker version). Arm TTL here so the TTL system can expire it 
        
               // if it sits queued waiting on a concurrency slot. 
        
               includeTtl: true, 
        
             }); 
        
           });

The comment calls count a "sliding-window counter", but the Lua is INCR + PEXPIRE-on-first — a fixed window. At a window boundary this can briefly admit ~2x threshold before tripping. Additionally PSETEX trippedKey, holdMs rearms the hold TTL on every overage call, so under sustained overload the hold never elapses while traffic stays above threshold. Both may be intentional for Phase 1, but the docstring should match the implementation.

trigger.dev/apps/webapp/app/v3/mollifier/mollifierGate.server.ts

Lines 11 to 16 in 98c1520

    
           } from "./mollifierTelemetry.server"; 
        
           // `count` is the *single-instance* sliding-window counter, not a fleet-wide 
        
           // aggregate. Each webapp instance maintains its own Redis key, so the fleet 
        
           // effective ceiling is `instance_count * threshold`. Phase 2 consumers must 
        
           // not treat `count` as a global rate.

trigger.dev/packages/redis-worker/src/mollifier/buffer.ts

Lines 294 to 312 in 98c1520

    
                 numberOfKeys: 2, 
        
                 lua: ` 
        
                   local rateKey = KEYS[1] 
        
                   local trippedKey = KEYS[2] 
        
                   local windowMs = tonumber(ARGV[1]) 
        
                   local threshold = tonumber(ARGV[2]) 
        
                   local holdMs = tonumber(ARGV[3]) 
        
                   local count = redis.call('INCR', rateKey) 
        
                   if count == 1 then 
        
                     redis.call('PEXPIRE', rateKey, windowMs) 
        
                   end 
        
                   if count > threshold then 
        
                     redis.call('PSETEX', trippedKey, holdMs, '1') 
        
                   end 
        
                   local tripped = redis.call('EXISTS', trippedKey) 
        
                   return {count, tripped}

🤖 Generated with Claude Code

_{- If this code review was useful, please react with 👍. Otherwise, react with 👎.}

Redis-backed burst-smoothing layer behind MOLLIFIER_ENABLED=0 (default). With the kill switch off, the gate short-circuits on its first env check and production behaviour is identical to main. @trigger.dev/redis-worker: - MollifierBuffer: atomic Lua-backed FIFO with accept / pop / ack / requeue / fail + TTL. Per-env queues with HSET entry storage, atomic RPOP + status transition, FIFO retry ordering. - MollifierDrainer: generic round-robin worker with concurrency cap, retry semantics, and a stop deadline to avoid livelock on a hung handler. Phase 3 will wire the handler to engine.trigger(). - Full testcontainers-backed test suite (21 tests). apps/webapp: - evaluateGate cascade-check (kill switch -> org feature flag -> shadow mode -> trip evaluator -> mollify / shadow_log / pass_through). Dependencies injected for testability; the trip evaluator stub returns { divert: false } in phase 1. - Inserted into RunEngineTriggerTaskService.call() before traceEventConcern.traceRun. The mollify branch throws (unreachable in phase 1). - Lazy MollifierBuffer + MollifierDrainer singletons; no Redis connection unless MOLLIFIER_ENABLED=1. - 12 MOLLIFIER_* env vars (all safe defaults) and a mollifierEnabled feature flag in the global catalog. - Drainer booted from worker.server.ts on first import. - Read-fallback stub for phase 3. - Gate cascade tests + .env loader so env.server validates in vitest workers. Phase 2 will land the real trip evaluator; phase 3 will activate the buffer-write + drain path.

…dual-write monitoring + drainer ack loop) Phase 1 of the trigger-burst smoothing initiative. Adds the A-side trip evaluator (atomic Lua sliding-window per env) and wires it into the trigger hot path. When the per-org mollifierEnabled feature flag is on AND the evaluator says divert, the canonical replay payload is buffered to Redis (via buffer.accept) AND the trigger continues through engine.trigger — i.e. dual-write. The drainer pops + acks (no-op handler) to prove the dequeue mechanism works end-to-end. Operators audit by joining mollifier.buffered (write) and mollifier.drained (consume) logs by runId. Buffer primitives hardened: - accept is idempotent on duplicate runId (Lua EXISTS guard) - pop skips orphan queue references (entry HASH TTL'd while runId queued) - fail no-ops on missing entry (no partial FAILED hash leak) - mollifier:envs set pruned on draining pop, restored on requeue - 16-row truth-table test enumerates the gate cascade - BufferedTriggerPayload defines the canonical replay shape Phase 2 will use to invoke engine.trigger - payload hash for audit-equivalence computed off the hot path (in the drainer) to avoid CPU during a spike Regression tests in apps/webapp/test/engine/triggerTask.test.ts pin the mollifier integration: - validation throws BEFORE the gate runs (no orphan buffer write on rejected triggers) - mollify dual-write happy path (Postgres + Redis both reflect the run) - pass_through path does NOT call buffer.accept - engine.trigger throwing AFTER buffer.accept leaves an orphan (documented behaviour — drainer auto-cleans; audit-trail surfaces it) - idempotency-key match short-circuits BEFORE the gate is consulted - debounce match produces an orphan (documented behaviour — Phase 2 must lift handleDebounce upfront before buffer.accept) Behaviour with MOLLIFIER_ENABLED=0 (default) is byte-identical to main. With MOLLIFIER_ENABLED=1 and the flag off, only mollifier.would_mollify logs fire (no buffer state). With the flag on, dual-write activates. Includes two opt-in *.fuzz.test.ts suites (gated on FUZZ=1) that randomise operation sequences against evaluateTrip and the drainer to find timing edges. They are clearly marked TEMPORARY in their headers.

- changeset: drop "deferred" wording — phase-1 actively dual-writes + runs the drainer ack loop. - worker.server.ts: wrap mollifier drainer init in try/catch + register SIGTERM/SIGINT handlers so the polling loop stops cleanly on shutdown. - bufferedTriggerPayload: only serialise idempotencyKeyExpiresAt when an idempotencyKey is present (avoid impossible orphan-expiry payloads). - mollifierTelemetry: narrow recordDecision reason to DecisionReason union to keep OTEL attribute cardinality bounded. - mollifierGate: rename resolveOrgFlag → resolveFlag. The underlying FeatureFlag table is global by key, so the "org" prefix was misleading; per-org gating is out of scope for phase-1. - tests: drop vi.fn mocks. mollifierGate now uses plain closure spies; mollifierTripEvaluator runs against a real MollifierBuffer backed by a redisTest container (closed client exercises the fail-open path). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…stacking Worker.init() is called per request from entry.server.tsx, so the process.once SIGTERM/SIGINT pair added in 98c1520 would stack a fresh listener every request under dev hot-reload (process.once only removes after firing). Gate registration on a process-global flag, matching the existing __worker__ pattern. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…tion.featureFlags The mollifier gate's resolveOrgFlag was a global feature-flag lookup named as if org-scoped. Phase-1 plan and design doc both intended per-org gating; the implementation regressed because the global flag() helper has no orgId parameter. Adopt the existing per-org feature-flag pattern (used by canAccessAi, canAccessPrivateConnections, compute beta gating): pass `Organization.featureFlags` through as `flag()` overrides. Per-org opt-in now works admin-toggleable via the existing Organization.featureFlags JSON column — no schema migration needed. - mollifierGate: revert resolveFlag/flagEnabled back to resolveOrgFlag/orgFlagEnabled (the name now matches reality). GateInputs gains `orgFeatureFlags`; the default resolver passes them as overrides to `flag()`. - triggerTask.server.ts: thread `environment.organization.featureFlags` into the gate call. - tests: three new postgresTest cases exercise the real DB-backed resolveOrgFlag end-to-end, proving (a) per-org opt-in isolation, (b) unrelated beta flags don't bleed across, (c) per-org overrides take precedence over the global FeatureFlag row. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…nect The unit cascade tests in mollifierGate.test.ts import the gate module, which transitively pulls in ~/db.server. That module constructs the prisma singleton at import time and eagerly calls $connect(), which fails against localhost:5432 in the unit-test shard and surfaces as an unhandled rejection that fails the whole vitest run. Mocking the module keeps the cascade tests pure and leaves the postgresTest cases on the testcontainer-fixture prisma untouched.

- Gate drainer init on WORKER_ENABLED so only worker replicas run the polling loop. - Update the enqueueSystem TTL comment now that delayed/pending-version are first enqueues. - Correct the mollifier gate docstring to describe the fixed-window counter and tripped-key rearm. - Swap findUnique for findFirst in the trigger task test to match the webapp Prisma rule.

…eFlags The gate's `GateInputs` now requires `orgFeatureFlags`, but the surface type used by the trigger service was still the pre-org-scope shape, so the default evaluator wasn't assignable and the call site couldn't pass the flag overrides.

…est startup The per-org isolation suite uses `postgresTest`, which spins up a fresh Postgres testcontainer per case. On CI the 5s vitest default regularly times out on container start before the test body runs. Match the 30s `vi.setConfig` used by other postgresTest suites in this app.

…rrors resolveOrgFlag now checks the per-org Organization.featureFlags override in-memory before falling back to the global flag() helper, so the common per-org enablement path resolves without a Prisma round-trip on every trigger call. evaluateGate also wraps the flag resolution in try/catch and fails open to false on error, mirroring the trip evaluator.

…exit Pass a configurable timeout to drainer.stop() so SIGTERM/SIGINT can't hang forever if an in-flight handler is wedged. Matches the precedent set by BATCH_TRIGGER_WORKER_SHUTDOWN_TIMEOUT_MS (default 30s).

processOneFromEnv now catches buffer.pop() failures so one env's hiccup doesn't reject Promise.all and bubble up to the loop's outer catch. The polling loop itself wraps each runOnce in try/catch and backs off with capped exponential delay (up to 5s) instead of exiting permanently on the first listEnvs/pop error. Stop semantics are unchanged: only the stopping flag breaks the loop. Adds two regression tests using a stub buffer (no Redis container) so fault injection is deterministic.

The phase-1 scaffolding referenced MollifierBuffer, getMollifierBuffer, and deserialiseMollifierSnapshot without importing them — CI typecheck fails with TS2304. The runtime path is gated behind MOLLIFIER_ENABLED=0 so this never produced a runtime symptom, but the types must resolve.

The enqueueSystem.ts comment touch-up was an unrelated drive-by during phase-1 review and doesn't belong in this PR. Will land separately.

External changelog readers don't have context on internal phase numbering; describe the feature itself (opt-in burst protection, default-off env vars, shadow mode, dual-write activation) instead of "phase 1".

…iversion The previous wording implied the buffer/drainer was active protection; in this release they're audit-only. Spell out that no trigger calls are diverted or rate-limited yet, and that active smoothing follows later.

MollifierEvaluateGate and MollifierGetBuffer were defined in the consumer (triggerTask.server.ts) but described the surface of the gate and the buffer accessor respectively. Move each to the module that owns the underlying implementation so the type lives with the producer, not the caller. No behavioural change.

…-redis fallback Two operational guards for misconfigured rollouts: 1. Drop the MOLLIFIER_REDIS_* fallback to the main REDIS_* cluster. The mollifier writes to a dedicated Redis to keep burst traffic off the engine's primary queue — silently colocating with the main Redis when MOLLIFIER_REDIS_HOST is unset defeats the design. 2. Degrade gracefully instead of crashing the pod. If MOLLIFIER_ENABLED was flipped on without setting MOLLIFIER_REDIS_HOST, the buffer returns null (with a one-shot warn log per process) and the drainer no-ops. No crash loops, no failed deploys, no traffic impact — operators see the warn line and fix the misconfig in a follow-up deploy. The drainer's previously-unreachable "env vars inconsistent" throw becomes reachable in this degraded mode; replace it with a null return so worker.server.ts's existing null check short-circuits cleanly.

mollifier:envs is a Redis SET that grows with the count of envs that currently have buffered entries. Under normal operation that's small, but an extended drainer outage can leave entries piled up across thousands of envs — at which point runOnce would queue one processOneFromEnv per env through pLimit, ballooning per-tick latency and event-loop queue depth. Cap per-tick fan-out at MOLLIFIER_DRAIN_MAX_ENVS_PER_TICK (default 500). When the set fits within the cap, behaviour is unchanged (take all, rotate cursor by 1 for fairness). When the set exceeds the cap, take a rotating slice and advance the cursor by the slice size so successive ticks sweep through the full set. Tests use a stub buffer to drive listEnvs() deterministically with thousands of envs without provisioning a real Redis.

…tchQueue The MollifierDrainer's stop() was polling `isRunning` every 20ms until the loop exited, which differs from the codebase's convention for similar polling loops (FairQueue, BatchQueue both hold the loop promise as a field and await it directly on stop). Switch to the same pattern: store the loop promise on start(), then in stop() race it against the timeout via Promise.race. With no timeout we just await the loop directly. With a timeout the warn-and-return behaviour is unchanged. No polling, no separate `isRunning` poll loop. Behaviour is identical to the previous implementation, including the hung-handler timeout path (covered by the existing "stop returns after timeoutMs even if a handler is hung" test).

The previous chunking advanced the cursor by sliceSize each tick, producing fixed disjoint slices like [0..3], [4..7], [0..3], ... With that pattern env_0 was always at slice position 0 (first into pLimit) and env_3 always at position 3 (last) — reinstating the head-of-line bias rotation was meant to prevent. Advance the cursor by 1 instead. Slices now overlap across consecutive ticks (e.g. [0..3], [1..4], [2..5], ...) so every env reaches every slot position 0..sliceSize-1 across one envs.length-tick cycle. Drainage rate per env is unchanged: each env still appears in exactly sliceSize of every envs.length ticks. New regression test pins the fairness property by asserting each env touches every slot at least once per cycle.

…y envs Adds a regression test that proves a light env (single buffered entry) is drained within (envs.length - sliceSize + 1) ticks regardless of how many entries the heavy envs have queued. The test uses a stub buffer whose listEnvs/pop pair mirrors the production atomic-Lua semantic: an env disappears from listEnvs the moment its queue empties, so the light env exits the rotation as soon as it's popped — while the heavy envs stay in the rotation until their thousands of entries are drained. Together with the head-of-line fairness test this pins both fairness properties: (1) every env touches every slice slot per cycle (no within-slice bias), and (2) no env's drainage latency depends on the queue depth of other envs (no across-slice starvation).

…eckedIndexedAccess The fairness test compared popsPerTick[0][0] vs popsPerTick[1][0] directly. Under the redis-worker package's strict tsconfig (noUncheckedIndexedAccess implied), array index access returns T | undefined, which trips TS2532. Destructure into named locals and use optional chaining — same assertion, no `\!` non-null soup.

1. start() resets envCursor to 0 — new behaviour. A stop+start cycle now begins rotation cleanly from envs[0] rather than inheriting between- restart cursor drift. 2. Malformed payload → non-retryable handler error path. Pins that the deserialise failure goes terminal without invoking the handler. 3. Ack failure after handler success — documents the current behavioural gap. ack() lives inside processEntry's try, so a Redis blip on ack routes a successfully-handled entry through the retry/terminal path. Phase 2's engine-replay handler will need idempotency to absorb the re-execution, OR ack should be lifted out of the try block. 4. start() idempotency — second call is a no-op (no doubled loop). 5. stop() idempotency — safe to call when never started or twice. 6. Loop-level backoff actually grows on consecutive runOnce failures and resets on first success. Distinct from per-entry retry attempts already covered elsewhere; this is the consecutiveErrors counter that drives backoffMs between ticks. Also adds org-level fairness analogue of the existing env starvation test: a light org (1 env, 1 entry) is not starved behind a heavy org with many envs and many entries. The buffer doesn't track orgs as a separate axis, so org fairness is an emergent property of env rotation — the test pins that property explicitly.

…el fairness Previously the drainer rotated per-env: an org with N busy envs got N scheduling slots per tick. A noisy tenant with many envs would drain proportionally faster than a quiet tenant with one env. Switch to hierarchical rotation: pick orgs round-robin (capped by maxOrgsPerTick), then pick one env per picked org (also rotating). Implementation is drainer-side only — no buffer or Lua changes. The drainer caches envId→orgId from popped entries; envs not yet cached are treated as their own pseudo-org for one tick, so cold start matches the old per-env behaviour and converges to hierarchical once cache is hot (usually within one tick). Cache and cursors reset on start() alongside the existing cursor reset. API change: maxEnvsPerTick → maxOrgsPerTick on MollifierDrainerOptions, MOLLIFIER_DRAIN_MAX_ENVS_PER_TICK → MOLLIFIER_DRAIN_MAX_ORGS_PER_TICK on the webapp env. Same default (500). Operators tune for "typical orgs with pending entries" rather than env count. Trade-off: total per-tick pops drop from O(envs) to O(orgs). For an org with N envs, each env's individual drainage rate is 1/N of what it was, but the tenant overall is bounded the same way as a single-env tenant — which is the fairness contract. Tests: - Renamed maxEnvsPerTick references throughout existing tests; old behaviour still holds at cold cache (each env = pseudo-org). - New "heavy org with many envs does not dominate vs light org" pins the post-warm-up ~1:1 drainage ratio between a 6-env org and a 1-env org over a sustained 20-tick run. - New "within an org, envs are rotated round-robin across ticks" pins the inner env cursor's behaviour for a single multi-env org. - Cursor-reset test renamed and now asserts cache+cursors all reset. Also removed an outdated test-count comment in apps/webapp/test/engine/triggerTask.test.ts that listed "four tests" when reality has moved on.

…uage) The changeset accreted across the PR's evolution and ended up reading as three deltas ("now survives", "is now two-level", "no longer scales"). On merge this is the introduction of the feature — there's no prior state to contrast against. Rewrite as one cohesive description of what ships.

…rness Previously the drainer cached envId→orgId from popped entries and used a sentinel pseudo-org for envs it hadn't seen yet. The sentinel polluted the bucket map with fake org IDs and was a foreseeable source of bugs. This commit moves org membership into the buffer's atomic Lua scripts. New Redis keys, both maintained transactionally alongside per-env queues: - mollifier:orgs — orgs with at least one queued env - mollifier:org-envs:${orgId} — envs of that org with queued entries acceptMollifierEntry SADDs into all three sets (envs + orgs + org-envs). popAndMarkDraining cleans up envs+orgs+org-envs together when the queue empties in the success branch (we know orgId from the popped entry). The no-runId branch can't read orgId so it only cleans envs — stale org-envs entries are bounded by env count and recovered on the next accept. requeueMollifierEntry re-SADDs all three since the env may have just been pruned. The drainer now walks listOrgs() → listEnvsForOrg(org) → pop(env) with two cursors: orgCursor across all active orgs and a per-org envCursor for round-robin within each org. No client-side cache, no sentinel, deterministic from the first tick. Tests updated: - multi-org-round-robin (was multi-env-round-robin): two orgs with one and two envs respectively, asserts org_B drains its only env each tick while org_A rotates through its two. - concurrency-cap test spreads 12 envs across 12 orgs (otherwise one org → one pop per tick). - "heavy org doesn't dominate vs light org" gets explicit listOrgs / listEnvsForOrg from the test's env→org map; assertion tightened to 0.7–1.5 ratio over 20 ticks. - "within an org envs rotated round-robin" gets explicit listEnvsForOrg. - "envCursor resets" → "rotation cursors reset"; cache is gone, only orgCursor and perOrgEnvCursors reset on start(). - makeStubBuffer auto-derives listOrgs/listEnvsForOrg from listEnvs (each env as its own org) so tests that don't care about org grouping don't need to provide them explicitly. 24/24 drainer tests pass, 35/35 buffer tests pass (some redis-container flakes under full-suite load; all green in isolation). Webapp typecheck clean.

…config Two prior changes are reverted: 1. MOLLIFIER_REDIS_HOST (plus _PORT/_USERNAME/_PASSWORD/_TLS_DISABLED) regain their `.transform((v) => v ?? process.env.REDIS_*)` fallback to the main Redis cluster, matching the convention used elsewhere in the codebase for dedicated-cluster env vars. Operators who don't set a dedicated mollifier Redis fall back to the main one — that's the accepted default. 2. getMollifierBuffer() no longer degrades to disabled with a warn log when MOLLIFIER_ENABLED=1 but MOLLIFIER_REDIS_HOST is unset. The buffer initialises normally (falling back to the main Redis if configured), and if that fails the pod crashes loudly. Same for the drainer: initializeMollifierDrainer() throws "env vars inconsistent" if the buffer comes back null, surfacing the misconfig immediately rather than silently leaving entries un-drained. Operationally: silent degradation hides config errors from operators and produces "why are no triggers being mollified?" debugging sessions. Loud failure surfaces the same misconfig at deploy time via the pod's health checks.

With the drainer walking listOrgs → listEnvsForOrg → pop, the flat mollifier:envs SET has no consumer — `mollifier:orgs` and the per-org `mollifier:org-envs:${orgId}` SETs cover everything the drainer needs. Removing it drops three Lua write ops per accept/pop/requeue and one Redis key per active env. Changes: - Lua: acceptMollifierEntry, popAndMarkDraining, requeueMollifierEntry no longer touch mollifier:envs. Their KEYS arrays shrink by one. - TS: listEnvs() method removed; only listOrgs() and listEnvsForOrg() remain. TS bindings updated to match the new arg shapes. - buffer.test.ts: listEnvs() assertions converted to listEnvsForOrg( "org_1") so they verify the equivalent org-level membership. The "stale envs SET cleanup on empty-pop" test is removed (envs SET is gone). The "pop skips orphans" test's trailing-cleanup assertion is updated to document the deliberate stale-tolerance in the no-runId branch of popAndMarkDraining (can't read orgId without a popped entry, so org-envs cleanup is skipped there). - drainer.test.ts: stub helper moved to module scope and gains an `eachEnvAsOwnOrg(envs)` convenience that supplies listOrgs + listEnvsForOrg in tests where each env is its own org. Stub helpers duplicated across describe blocks are removed in favour of the shared one. 24/24 drainer tests pass; buffer tests pass in isolation (a few timeout under full-suite contention against the shared redis container — unrelated to this change).

devin-ai-integration

Devin Review found 0 new potential issues.

View 14 additional findings in Devin Review.

…olver `triggerTask` is the highest-throughput code path in the system and the webapp CLAUDE.md forbids new DB queries there. The previous resolver fell back to `flag()` (a Prisma read against `FeatureFlag`) when the org had no `mollifierEnabled` override, which added a query to every trigger whenever `MOLLIFIER_ENABLED=1`. The fleet-wide kill switch already lives in `MOLLIFIER_ENABLED`; rollout is per-org via `Organization.featureFlags` JSON, matching `canAccessAi`/`hasComputeAccess`/etc. Drop the fallback so the resolver is purely in-memory. Tests no longer need a postgres testcontainer or `makeFlag(prisma)`; the per-org isolation suite now asserts directly on `Organization.featureFlags` shape and adds a regression test for the no-override -> false contract. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…own deadlines Two correctness/perf fixes on top of the phase-2 drainer: 1. `runOnce` was awaiting `listEnvsForOrg` serially before any pop ran. At the default `maxOrgsPerTick=500` and a ~1ms RTT, that's a ~500ms per-tick latency floor before `pLimit` even sees work. `Promise.all` over the org slice lets ioredis auto-pipeline the SMEMBERS into a single round-trip. Order is preserved so the org→envs pairing stays deterministic and `pickEnvForOrg` still rotates per org. 2. The SIGTERM handler is sync fire-and-forget: `drainer.stop({timeoutMs})` returns a promise that keeps the loop alive, but in cluster mode the primary process runs its own `GRACEFUL_SHUTDOWN_TIMEOUT` and will hit `process.exit(0)` independently. If the drainer's deadline exceeds the primary's, the drainer's "log a warning on timeout" turns into "hard exit with no log". Assert at boot that `MOLLIFIER_DRAIN_SHUTDOWN_TIMEOUT_MS <= GRACEFUL_SHUTDOWN_TIMEOUT - 1s` so a misconfig fails loud instead of disappearing at shutdown. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…tracking The previous wording still described the in-memory env→org cache and the "uncached envs treated as their own pseudo-org for one tick" sentinel — both removed when the buffer started tracking `mollifier:orgs` and `mollifier:org-envs:${orgId}` atomically. Re-describe the drainer in terms of the current org-walk so the published changelog matches the shipped code. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Replace each vi.fn(async handler) with a plain async closure that records calls via captured counter/array variables. Assertions move from handler.mock.* / toHaveBeenCalled* matchers to checks against the captured state, e.g. handlerCalls.length / handlerCalls[0]. Functionally equivalent; aligns with the package convention of using real testcontainers + closure-based probes (cf. mollifierGate.test.ts and mollifierTripEvaluator.test.ts) rather than vitest fakes.

…g polling loop The drainer was started inside the singleton factory, with the shutdown-timeout-vs-GRACEFUL_SHUTDOWN_TIMEOUT reconciliation living in worker.server.ts init() afterwards. If that validation threw, the polling loop was already running and the SIGTERM handler registration below it was never reached — the loop kept polling with no graceful-shutdown path, and the singleton was cached in its running state (so subsequent init() calls returned the same drainer and validation kept failing). Move the timeout check into initializeMollifierDrainer() before drainer.start(). singleton() uses ??=, so a throw inside the factory leaves the cache slot unset and the next getMollifierDrainer() call re-runs the factory — no half-started state, no missing SIGTERM handler. The catch in worker.server.ts init() still logs and aborts drainer registration on either the validation error or a Redis init failure.

…g polling loop Move the MOLLIFIER_DRAIN_SHUTDOWN_TIMEOUT_MS / GRACEFUL_SHUTDOWN_TIMEOUT reconciliation check from worker.server.ts init() into initializeMollifierDrainer() — BEFORE drainer.start() — so a misconfigured deploy fails loud at module-load time instead of starting the polling loop and then throwing back at the caller before the SIGTERM handler can be registered. The singleton() helper uses ??=, so a throw inside the factory leaves the cache slot unset and the next getMollifierDrainer() call re-runs the factory. No half-started state, no missing SIGTERM handler. The catch in worker.server.ts init() still logs and aborts drainer registration on either the validation error or a Redis init failure — same observable behaviour from the caller's perspective.

initializeMollifierDrainer() no longer calls drainer.start() — it returns a configured-but-stopped drainer. worker.server.ts init() now invokes drainer.start() AFTER the SIGTERM/SIGINT handlers are registered, gated on the same __mollifierShutdownRegistered__ guard so dev hot-reloads can't double-start. Closes the residual race window between drainer.start() (previously fired inside the singleton factory) and process.once("SIGTERM", stopDrainer) in worker.server.ts. With construction and starting separated, a signal landing during boot can never find the polling loop running without a graceful-stop path.

d-cs self-assigned this May 13, 2026

This comment was marked as resolved.

Sign in to view

d-cs force-pushed the mollifier-phase-2 branch from b4b5719 to f43df01 Compare May 14, 2026 08:54

This comment was marked as resolved.

Sign in to view

d-cs force-pushed the mollifier-phase-2 branch 2 times, most recently from 088e071 to 2a0d5ee Compare May 14, 2026 12:49

This comment was marked as resolved.

Sign in to view

d-cs force-pushed the mollifier-phase-2 branch from e51fac9 to a9400b1 Compare May 14, 2026 14:09

This comment was marked as resolved.

Sign in to view

d-cs and others added 13 commits May 14, 2026 16:43

fix(mollifier): bound drainer shutdown so a hung handler can't block …

8469561

…exit Pass a configurable timeout to drainer.stop() so SIGTERM/SIGINT can't hang forever if an in-flight handler is wedged. Matches the precedent set by BATCH_TRIGGER_WORKER_SHUTDOWN_TIMEOUT_MS (default 30s).

d-cs added 16 commits May 14, 2026 16:53

chore(mollifier): drop drive-by enqueueSystem comment change

0bf53e7

The enqueueSystem.ts comment touch-up was an unrelated drive-by during phase-1 review and doesn't belong in this PR. Will land separately.

chore(mollifier): rewrite server-changes note for external readers

edd3250

External changelog readers don't have context on internal phase numbering; describe the feature itself (opt-in burst protection, default-off env vars, shadow mode, dual-write activation) instead of "phase 1".

devin-ai-integration Bot reviewed May 15, 2026

View reviewed changes

d-cs and others added 4 commits May 15, 2026 11:19

Merge branch 'main' into mollifier-phase-2

a467e9e

d-cs marked this pull request as ready for review May 15, 2026 10:56

This comment was marked as resolved.

Sign in to view

ericallam reviewed May 15, 2026

View reviewed changes

Comment thread apps/webapp/app/runEngine/services/triggerTask.server.ts Outdated

d-cs added 2 commits May 15, 2026 13:04

This comment was marked as resolved.

Sign in to view

d-cs added 3 commits May 15, 2026 13:23

switch info logging to debug

9007053

Uh oh!

Conversation

d-cs commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

changeset-bot Bot commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🦋 Changeset detected

Uh oh!

coderabbitai Bot commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Estimated code review effort

❌ Failed checks (2 warnings)

Uh oh!

This comment was marked as resolved.

Uh oh!

d-cs commented May 14, 2026

Code review

Uh oh!

d-cs commented May 14, 2026

Uh oh!

This comment was marked as resolved.

Uh oh!

d-cs commented May 14, 2026

Code review (follow-up)

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

Uh oh!

This comment was marked as resolved.

Uh oh!

Uh oh!

This comment was marked as resolved.

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

d-cs commented May 13, 2026 •

edited

Loading

changeset-bot Bot commented May 13, 2026 •

edited

Loading

coderabbitai Bot commented May 13, 2026 •

edited

Loading