Skip to content

Commit 22046d2

Browse files
committed
docs(ai-chat): document oomMachine OOM-resilience pattern
New patterns/oom-resilience.mdx covering the opt-in, recovery flow, hydrate vs degraded continuity, and tool-execute idempotency caveats. Reference table updated with the oomMachine row and a note that generic retry is not exposed on chat.agent.
1 parent 20ec724 commit 22046d2

3 files changed

Lines changed: 117 additions & 2 deletions

File tree

Lines changed: 113 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,113 @@
1+
---
2+
title: "OOM resilience"
3+
sidebarTitle: "OOM resilience"
4+
description: "Recover from out-of-memory errors mid-turn by automatically retrying the failed turn on a larger machine — without losing the in-flight user message or re-processing completed turns."
5+
---
6+
7+
When a `chat.agent` turn runs out of memory, the worker process dies and everything in it is gone: the in-flight LLM call, the accumulator, any tool execution mid-flight. By default, Trigger.dev surfaces the OOM as a run failure.
8+
9+
Setting `oomMachine` opts the agent into automatic recovery: the failed turn re-runs on a larger machine, picks up the user message that triggered the OOM (without re-processing earlier completed turns), and produces a normal response.
10+
11+
## Setup
12+
13+
```ts
14+
import { chat } from "@trigger.dev/sdk/ai";
15+
16+
export const myChat = chat.agent({
17+
id: "my-chat",
18+
machine: "small-1x", // default machine
19+
oomMachine: "medium-2x", // fallback on OOM
20+
run: async ({ messages, signal }) =>
21+
streamText({ model, messages, abortSignal: signal }),
22+
});
23+
```
24+
25+
That's the entire opt-in. With `oomMachine` set, the agent gets:
26+
27+
- **`retry.maxAttempts: 2`** internally — one retry for OOM only; non-OOM errors don't retry.
28+
- **`retry.outOfMemory.machine: oomMachine`** — the fresh attempt boots on the larger machine.
29+
- **`session.in` cursor recovery** — the new attempt skips records belonging to turns that already completed on the prior attempt and only re-runs the OOM'd turn.
30+
31+
`chat.agent` does not expose generic `retry` options. OOM recovery is the only retry path because retrying an LLM-driven loop on non-OOM errors tends to be expensive and side-effecting. Drop down to `chat.task` (the raw primitive) if you need richer retry semantics.
32+
33+
## How recovery works
34+
35+
The recovery doesn't need any customer-side persistence to avoid duplicate processing. It uses two pieces of durable state Trigger already maintains for every chat:
36+
37+
- **`session.out`** — the durable response stream. Every successful turn writes a `trigger:turn-complete` chunk here.
38+
- **`session.in`** — the durable input stream. Every user message after the first turn lands here as a record with a server-assigned timestamp.
39+
40+
On retry boot, the SDK:
41+
42+
1. Scans `session.out` for the latest `trigger:turn-complete` chunk and reads its timestamp. Call this `T_last_complete`.
43+
2. Sets a per-stream filter on `session.in` so any record with `timestamp <= T_last_complete` is dropped before it reaches the turn loop.
44+
3. Begins normal processing. The first record that passes the filter is the message that triggered the OOM (or any newer message that arrived during the retry window).
45+
46+
Result: turns 1..N-1 are not re-processed, turn N runs on the larger machine, and the conversation continues.
47+
48+
```mermaid
49+
sequenceDiagram
50+
participant User
51+
participant Run as chat.agent run
52+
participant SessionIn as session.in
53+
participant SessionOut as session.out
54+
55+
User->>SessionIn: u2 (turn 2)
56+
Run->>SessionIn: read u2
57+
Run->>SessionOut: turn-complete (T1)
58+
User->>SessionIn: u3 (turn 3)
59+
Run->>SessionIn: read u3
60+
Run->>SessionOut: turn-complete (T2)
61+
User->>SessionIn: u4 (turn 4)
62+
Run->>SessionIn: read u4
63+
Note over Run: OOM mid-turn
64+
Run->>Run: ⚠️ killed
65+
Note over Run: Attempt 2 boots on oomMachine
66+
Run->>SessionOut: scan → T_last_complete = T2
67+
Run->>SessionIn: read with filter (ts > T2)
68+
SessionIn-->>Run: u2 (filtered, ts < T2)
69+
SessionIn-->>Run: u3 (filtered, ts < T2)
70+
SessionIn-->>Run: u4 (passes — the OOM'd turn)
71+
Run->>SessionOut: turn 4 complete
72+
```
73+
74+
The scan on `session.out` is streaming and bounded in memory: each chunk is inspected and discarded one at a time, so a long-running chat doesn't bloat the retry-boot worker. Bandwidth scales linearly with `session.out` size, but only on the OOM-retry path — a rare event.
75+
76+
## With `hydrateMessages`
77+
78+
If your agent uses [`hydrateMessages`](/ai-chat/lifecycle-hooks#hydratemessages) to load the durable conversation history per turn, the OOM'd turn re-runs against the full prior accumulator: the model sees `[u1, a1, u2, a2, ..., u_N]` and responds in context. This is the recommended pattern for production chats.
79+
80+
## Without `hydrateMessages`
81+
82+
The retry filter still prevents duplicate processing — turns 1..N-1 aren't re-run — but the OOM'd turn's accumulator is whatever the chat.agent's default flow can rebuild from `payload.messages` (typically just the first user message of the chat). The model context is **incomplete**: it doesn't see prior assistant responses. The conversation continues but a multi-turn OOM'd recovery may produce a less coherent reply.
83+
84+
If conversation continuity matters, use `hydrateMessages`.
85+
86+
## Tool execute idempotency
87+
88+
If an OOM hits mid-tool-execution, the new attempt re-runs the entire turn — including the tool call. Make tool `execute` functions idempotent or checkpoint their progress externally. Trigger doesn't roll back side effects automatically.
89+
90+
```ts
91+
import { tool } from "ai";
92+
93+
export const sendEmail = tool({
94+
description: "Send an email",
95+
inputSchema: z.object({ to: z.string(), idempotencyKey: z.string() }),
96+
execute: async ({ to, idempotencyKey }) => {
97+
// Stripe-style: dedupe at the side-effect layer with a customer-supplied key.
98+
return await mailer.send({ to, idempotencyKey });
99+
},
100+
});
101+
```
102+
103+
## Limitations
104+
105+
- **One OOM retry per run.** `chat.agent` sets `maxAttempts: 2`. If attempt 2 also OOMs, the run fails. Use a sufficiently large `oomMachine` to avoid this.
106+
- **Single fallback tier.** Only one `oomMachine`. There's no "tiered retry" (small → medium → large). If you need that, drop down to `chat.task` and configure `retry` directly.
107+
- **Non-OOM errors don't retry.** Schema errors, model-call rejections, tool throws, etc. fail the run as before. Out-of-memory is the only retry trigger.
108+
- **Tools mid-execution are not checkpointed.** A partially-run tool re-runs from scratch on the new attempt. Make them idempotent.
109+
110+
## See also
111+
112+
- [Lifecycle hooks](/ai-chat/lifecycle-hooks)`onChatResume` fires on every retry attempt with `phase: "preload"` or `"turn"`
113+
- [Database persistence](/ai-chat/patterns/database-persistence) — the `hydrateMessages` pattern this builds on for full continuity

docs/ai-chat/reference.mdx

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -36,8 +36,9 @@ Options for `chat.agent()`.
3636
| `onChatSuspend` | `(event: ChatSuspendEvent) => Promise<void> \| void` || Fires right before the run suspends. See [onChatSuspend](/ai-chat/lifecycle-hooks#onchatsuspend--onchatresume) |
3737
| `onChatResume` | `(event: ChatResumeEvent) => Promise<void> \| void` || Fires right after the run resumes from suspension |
3838
| `exitAfterPreloadIdle` | `boolean` | `false` | Exit run after preload idle timeout instead of suspending. See [exitAfterPreloadIdle](/ai-chat/lifecycle-hooks#exitafterpreloadidle) |
39+
| `oomMachine` | `MachinePresetName` || Fallback machine when an attempt fails with OOM. Setting it enables a single OOM retry on the larger machine. See [OOM resilience](/ai-chat/patterns/oom-resilience) |
3940

40-
Plus all standard [TaskOptions](/tasks/overview)`retry`, `queue`, `machine`, `maxDuration`, **`onWait`**, **`onResume`**, **`onComplete`**, and other lifecycle hooks. Those hooks use the same parameter shapes as on a normal `task()` (including `ctx`).
41+
Plus most standard [TaskOptions](/tasks/overview)`queue`, `machine`, `maxDuration`, **`onWait`**, **`onResume`**, **`onComplete`**, and other lifecycle hooks. Generic `retry` is **not** exposed on `chat.agent`; use `oomMachine` for OOM recovery, or drop down to `chat.task` if you need richer retry semantics. Standard hooks use the same parameter shapes as on a normal `task()` (including `ctx`).
4142

4243
## Task context (`ctx`)
4344

docs/docs.json

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -130,7 +130,8 @@
130130
"ai-chat/patterns/code-sandbox",
131131
"ai-chat/patterns/human-in-the-loop",
132132
"ai-chat/patterns/large-payloads",
133-
"ai-chat/patterns/skills"
133+
"ai-chat/patterns/skills",
134+
"ai-chat/patterns/oom-resilience"
134135
]
135136
},
136137
"ai-chat/client-protocol",

0 commit comments

Comments
 (0)