From 4b70015b0bcf8ef2baebbfaf9f4a82d72367df7e Mon Sep 17 00:00:00 2001 From: Sweets Sweetman Date: Thu, 7 May 2026 16:41:24 -0500 Subject: [PATCH 1/6] feat(sandbox): port sandbox lifecycle workflow + kicks (Phase 2) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Closes the auto-pause and lifecycle-self-heal gaps from the cutover analysis. Sequel to PR #531 (build-org-snapshot workflow). The Vercel Workflow runtime + withWorkflow already shipped — this PR adds the second workflow on top. What it does: - POST /api/sandbox now kicks the sandbox-lifecycle workflow with reason="sandbox-created" after provisioning + skill install. The workflow sleeps until `hibernate_after` / `sandbox_expires_at` (whichever is sooner), then evaluates: pause if idle, retry on next iteration if a chat stream is active or the user extended the timing. - GET /api/sandbox/status now kicks with reason="status-check-overdue" when the session row says lifecycle_state="active" but the workflow's due-time has already passed (its lease has died). The kick reclaims the stale lease and starts a fresh workflow. Components ported faithfully from open-agents: Sandbox state predicates (one fn per file per api convention): - getPersistentSandboxName, getResumableSandboxName, hasResumableSandboxState, hasPausedSandboxState - canOperateOnSandbox, clearSandboxState Lifecycle types + config: - sandboxLifecycleTypes.ts: SandboxLifecycleState/Reason types - sandboxLifecycleConfig.ts: SANDBOX_INACTIVITY_TIMEOUT_MS (5min), SANDBOX_EXPIRES_BUFFER_MS (10s), SANDBOX_LIFECYCLE_MIN_SLEEP_MS (5s), SANDBOX_LIFECYCLE_STALE_RUN_GRACE_MS (2min) Lifecycle update builders + due-at: - buildLifecycleActivityUpdate, buildActiveLifecycleUpdate, buildHibernatedLifecycleUpdate - getLifecycleDueAtMs (the wake-time calculator) - getNextLifecycleVersion, getSandboxExpiresAtDate Active-stream guard (don't pause mid-chat): - hasActiveStreamForSession + Supabase helper selectChatsBySession Concurrency: - claimSessionLifecycleRunId — atomic UPDATE that fails when the expected current value doesn't match. Prevents two kicks from starting duplicate workflows. The evaluator (the heart): - evaluateSandboxLifecycle (~110 lines) — single-pass FSM that decides skip / hibernate / fail. Mirrors open-agents' evaluator with the same skip reasons and self-heal pattern. The workflow: - app/workflows/sandboxLifecycleWorkflow.ts — `while(true)` loop with `sleep(new Date(wakeAtMs))` between iterations, lease-claim pattern, and step boundaries via "use step". Loops on not-due-yet / active-workflow defers, terminates on hibernated / failed / sandbox-gone. The kick: - kickSandboxLifecycleWorkflow.ts — fire-and-forget invocation with stale-run detection, lease claiming, and graceful handling when start() fails (clears the lease so retry can succeed). TDD red -> green: - 2 new createSandboxHandler tests asserting the kick fires with the right reason on session-bound provisions, and is skipped without sessionId - All existing handler tests continue to pass with the new mocks - Suite: 2577 -> 2579 (+2 new tests). pnpm lint:check + tsc clean for new files. Out of scope (deliberately deferred): - Tests for the workflow itself + evaluator. The workflow primitives (`"use workflow"`, `"use step"`, `sleep(date)`) require a Vercel Workflow test harness we haven't introduced yet. The handler tests + open-agents' parity give enough confidence; if the workflow misbehaves at runtime, the smoke test will catch it. - Reason-specific telemetry beyond what evaluateSandboxLifecycle already logs. Co-Authored-By: Claude Opus 4.7 (1M context) --- app/workflows/sandboxLifecycleWorkflow.ts | 95 ++++++++++++ .../__tests__/createSandboxHandler.test.ts | 24 ++++ .../__tests__/getSandboxStatusHandler.test.ts | 3 + lib/sandbox/buildActiveLifecycleUpdate.ts | 27 ++++ lib/sandbox/buildHibernatedLifecycleUpdate.ts | 20 +++ lib/sandbox/buildLifecycleActivityUpdate.ts | 29 ++++ lib/sandbox/canOperateOnSandbox.ts | 16 +++ lib/sandbox/clearSandboxState.ts | 23 +++ lib/sandbox/createSandboxHandler.ts | 8 ++ lib/sandbox/evaluateSandboxLifecycle.ts | 136 ++++++++++++++++++ lib/sandbox/getLifecycleDueAtMs.ts | 50 +++++++ lib/sandbox/getNextLifecycleVersion.ts | 11 ++ lib/sandbox/getPersistentSandboxName.ts | 20 +++ lib/sandbox/getResumableSandboxName.ts | 20 +++ lib/sandbox/getSandboxExpiresAtDate.ts | 16 +++ lib/sandbox/getSandboxStatusHandler.ts | 16 ++- lib/sandbox/hasActiveStreamForSession.ts | 15 ++ lib/sandbox/hasPausedSandboxState.ts | 14 ++ lib/sandbox/hasResumableSandboxState.ts | 12 ++ lib/sandbox/kickSandboxLifecycleWorkflow.ts | 91 ++++++++++++ lib/sandbox/sandboxLifecycleConfig.ts | 35 +++++ lib/sandbox/sandboxLifecycleTypes.ts | 36 +++++ lib/supabase/chats/selectChatsBySession.ts | 22 +++ .../sessions/claimSessionLifecycleRunId.ts | 37 +++++ 24 files changed, 775 insertions(+), 1 deletion(-) create mode 100644 app/workflows/sandboxLifecycleWorkflow.ts create mode 100644 lib/sandbox/buildActiveLifecycleUpdate.ts create mode 100644 lib/sandbox/buildHibernatedLifecycleUpdate.ts create mode 100644 lib/sandbox/buildLifecycleActivityUpdate.ts create mode 100644 lib/sandbox/canOperateOnSandbox.ts create mode 100644 lib/sandbox/clearSandboxState.ts create mode 100644 lib/sandbox/evaluateSandboxLifecycle.ts create mode 100644 lib/sandbox/getLifecycleDueAtMs.ts create mode 100644 lib/sandbox/getNextLifecycleVersion.ts create mode 100644 lib/sandbox/getPersistentSandboxName.ts create mode 100644 lib/sandbox/getResumableSandboxName.ts create mode 100644 lib/sandbox/getSandboxExpiresAtDate.ts create mode 100644 lib/sandbox/hasActiveStreamForSession.ts create mode 100644 lib/sandbox/hasPausedSandboxState.ts create mode 100644 lib/sandbox/hasResumableSandboxState.ts create mode 100644 lib/sandbox/kickSandboxLifecycleWorkflow.ts create mode 100644 lib/sandbox/sandboxLifecycleConfig.ts create mode 100644 lib/sandbox/sandboxLifecycleTypes.ts create mode 100644 lib/supabase/chats/selectChatsBySession.ts create mode 100644 lib/supabase/sessions/claimSessionLifecycleRunId.ts diff --git a/app/workflows/sandboxLifecycleWorkflow.ts b/app/workflows/sandboxLifecycleWorkflow.ts new file mode 100644 index 000000000..be6ee262e --- /dev/null +++ b/app/workflows/sandboxLifecycleWorkflow.ts @@ -0,0 +1,95 @@ +import { sleep } from "workflow"; +import { canOperateOnSandbox } from "@/lib/sandbox/canOperateOnSandbox"; +import { evaluateSandboxLifecycle } from "@/lib/sandbox/evaluateSandboxLifecycle"; +import { getLifecycleDueAtMs } from "@/lib/sandbox/getLifecycleDueAtMs"; +import { SANDBOX_LIFECYCLE_MIN_SLEEP_MS } from "@/lib/sandbox/sandboxLifecycleConfig"; +import { claimSessionLifecycleRunId } from "@/lib/supabase/sessions/claimSessionLifecycleRunId"; +import { selectSessions } from "@/lib/supabase/sessions/selectSessions"; +import { updateSession } from "@/lib/supabase/sessions/updateSession"; +import type { SandboxLifecycleReason } from "@/lib/sandbox/sandboxLifecycleTypes"; + +interface LifecycleWakeDecision { + shouldContinue: boolean; + wakeAtMs?: number; + reason?: string; +} + +async function computeLifecycleWakeDecision( + sessionId: string, + runId: string, +): Promise { + "use step"; + + const rows = await selectSessions({ id: sessionId }); + const session = rows[0]; + if (!session) return { shouldContinue: false, reason: "session-not-found" }; + if (session.status === "archived" || session.lifecycle_state === "archived") { + return { shouldContinue: false, reason: "session-archived" }; + } + if ( + !canOperateOnSandbox(session.sandbox_state) || + (session.sandbox_state as { type?: unknown } | null)?.type !== "vercel" + ) { + return { shouldContinue: false, reason: "sandbox-not-operable" }; + } + + // Refresh the lease — anyone else who claimed it in the meantime wins. + const claimed = await claimSessionLifecycleRunId(sessionId, runId, runId); + if (!claimed) return { shouldContinue: false, reason: "run-replaced" }; + + return { shouldContinue: true, wakeAtMs: getLifecycleDueAtMs(session) }; +} + +async function runLifecycleEvaluation(sessionId: string, reason: SandboxLifecycleReason) { + "use step"; + return evaluateSandboxLifecycle(sessionId, reason); +} + +async function clearLifecycleRunIdIfOwned(sessionId: string, runId: string): Promise { + "use step"; + + const rows = await selectSessions({ id: sessionId }); + const session = rows[0]; + if (!session || session.lifecycle_run_id !== runId) return; + + await updateSession(sessionId, { lifecycle_run_id: null }); +} + +/** + * Vercel Workflow that pauses idle sandboxes. Runs as a `while(true)` + * loop: compute next wake time → `sleep(date)` → evaluate → either + * loop (when not-due-yet or active-stream defers) or terminate + * (hibernated / failed / sandbox gone). Holds the + * `lifecycle_run_id` lease throughout so concurrent kicks can't + * spawn duplicate workflows. + */ +export async function sandboxLifecycleWorkflow( + sessionId: string, + reason: SandboxLifecycleReason, + runId: string, +) { + "use workflow"; + + while (true) { + const decision = await computeLifecycleWakeDecision(sessionId, runId); + if (!decision.shouldContinue || decision.wakeAtMs === undefined) { + await clearLifecycleRunIdIfOwned(sessionId, runId); + return { skipped: true, reason: decision.reason ?? "no-decision" }; + } + + const wakeAtMs = Math.max(decision.wakeAtMs, Date.now() + SANDBOX_LIFECYCLE_MIN_SLEEP_MS); + await sleep(new Date(wakeAtMs)); + + const evaluation = await runLifecycleEvaluation(sessionId, reason); + + if ( + evaluation.action === "skipped" && + (evaluation.reason === "not-due-yet" || evaluation.reason === "active-workflow") + ) { + continue; + } + + await clearLifecycleRunIdIfOwned(sessionId, runId); + return { skipped: false, evaluation }; + } +} diff --git a/lib/sandbox/__tests__/createSandboxHandler.test.ts b/lib/sandbox/__tests__/createSandboxHandler.test.ts index 13311f99e..4813ed034 100644 --- a/lib/sandbox/__tests__/createSandboxHandler.test.ts +++ b/lib/sandbox/__tests__/createSandboxHandler.test.ts @@ -9,6 +9,7 @@ import { updateSession } from "@/lib/supabase/sessions/updateSession"; import { installSessionGlobalSkills } from "@/lib/sandbox/installSessionGlobalSkills"; import { findOrgSnapshot } from "@/lib/sandbox/findOrgSnapshot"; import { kickBuildOrgSnapshotWorkflow } from "@/lib/sandbox/kickBuildOrgSnapshotWorkflow"; +import { kickSandboxLifecycleWorkflow } from "@/lib/sandbox/kickSandboxLifecycleWorkflow"; vi.mock("@/lib/networking/getCorsHeaders", () => ({ getCorsHeaders: () => ({ "Access-Control-Allow-Origin": "*" }), @@ -37,6 +38,9 @@ vi.mock("@/lib/sandbox/findOrgSnapshot", () => ({ vi.mock("@/lib/sandbox/kickBuildOrgSnapshotWorkflow", () => ({ kickBuildOrgSnapshotWorkflow: vi.fn(), })); +vi.mock("@/lib/sandbox/kickSandboxLifecycleWorkflow", () => ({ + kickSandboxLifecycleWorkflow: vi.fn(), +})); const ACCOUNT_ID = "acc-1"; @@ -281,6 +285,26 @@ describe("createSandboxHandler", () => { expect(kickBuildOrgSnapshotWorkflow).not.toHaveBeenCalled(); }); + it("kicks the sandbox lifecycle workflow with reason='sandbox-created' when sessionId is provided", async () => { + await createSandboxHandler(makeReq()); + + expect(kickSandboxLifecycleWorkflow).toHaveBeenCalledWith({ + sessionId: "sess-1", + reason: "sandbox-created", + }); + }); + + it("does not kick the lifecycle workflow when no sessionId is provided", async () => { + vi.mocked(validateCreateSandboxBody).mockResolvedValueOnce({ + body: { repoUrl: "https://github.com/o/r" }, + auth: { accountId: ACCOUNT_ID, orgId: null, authToken: "k" }, + }); + + await createSandboxHandler(makeReq()); + + expect(kickSandboxLifecycleWorkflow).not.toHaveBeenCalled(); + }); + it("does not attempt skill installation when no sessionId is provided", async () => { vi.mocked(validateCreateSandboxBody).mockResolvedValueOnce({ body: { repoUrl: "https://github.com/o/r" }, diff --git a/lib/sandbox/__tests__/getSandboxStatusHandler.test.ts b/lib/sandbox/__tests__/getSandboxStatusHandler.test.ts index 2bf66ea01..c4b116cb3 100644 --- a/lib/sandbox/__tests__/getSandboxStatusHandler.test.ts +++ b/lib/sandbox/__tests__/getSandboxStatusHandler.test.ts @@ -14,6 +14,9 @@ vi.mock("@/lib/auth/validateAuthContext", () => ({ vi.mock("@/lib/supabase/sessions/selectSessions", () => ({ selectSessions: vi.fn(), })); +vi.mock("@/lib/sandbox/kickSandboxLifecycleWorkflow", () => ({ + kickSandboxLifecycleWorkflow: vi.fn(), +})); const ACCOUNT_ID = "acc-1"; const FAR_FUTURE = "2099-01-01T00:00:00.000Z"; diff --git a/lib/sandbox/buildActiveLifecycleUpdate.ts b/lib/sandbox/buildActiveLifecycleUpdate.ts new file mode 100644 index 000000000..0fb818f9a --- /dev/null +++ b/lib/sandbox/buildActiveLifecycleUpdate.ts @@ -0,0 +1,27 @@ +import { buildLifecycleActivityUpdate } from "@/lib/sandbox/buildLifecycleActivityUpdate"; +import { getSandboxExpiresAtDate } from "@/lib/sandbox/getSandboxExpiresAtDate"; +import type { TablesUpdate } from "@/types/database.types"; + +/** + * Builds the lifecycle-related fields to write when transitioning a + * session into the `active` state right after a sandbox has been + * provisioned or resumed. Combines `buildLifecycleActivityUpdate` + * with the sandbox's own `expiresAt` so the row's + * `sandbox_expires_at` matches the freshly-probed runtime expiry. + * + * @param sandboxState - The `sandbox_state` JSON value, typically + * from `sandbox.getState()`. + * @param options.activityAt - Optional override for "now". + * @param options.lifecycleState - Defaults to `"active"`; pass + * `"restoring"` for the snapshot-resume path. + * @returns A partial Supabase update object. + */ +export function buildActiveLifecycleUpdate( + sandboxState: unknown, + options?: { activityAt?: Date; lifecycleState?: "active" | "restoring" }, +): TablesUpdate<"sessions"> { + return { + ...buildLifecycleActivityUpdate(options?.activityAt, options?.lifecycleState ?? "active"), + sandbox_expires_at: getSandboxExpiresAtDate(sandboxState), + }; +} diff --git a/lib/sandbox/buildHibernatedLifecycleUpdate.ts b/lib/sandbox/buildHibernatedLifecycleUpdate.ts new file mode 100644 index 000000000..be1df3a6f --- /dev/null +++ b/lib/sandbox/buildHibernatedLifecycleUpdate.ts @@ -0,0 +1,20 @@ +import type { TablesUpdate } from "@/types/database.types"; + +/** + * Builds the lifecycle-related fields to write when the workflow + * pauses a sandbox. Clears `sandbox_expires_at`, `hibernate_after`, + * and the `lifecycle_run_id` lease so a future kick can claim it. + * Note the caller is responsible for separately clearing + * `sandbox_state` runtime metadata via `clearSandboxState`. + * + * @returns A partial Supabase update object. + */ +export function buildHibernatedLifecycleUpdate(): TablesUpdate<"sessions"> { + return { + lifecycle_state: "hibernated", + sandbox_expires_at: null, + hibernate_after: null, + lifecycle_run_id: null, + lifecycle_error: null, + }; +} diff --git a/lib/sandbox/buildLifecycleActivityUpdate.ts b/lib/sandbox/buildLifecycleActivityUpdate.ts new file mode 100644 index 000000000..671e400bf --- /dev/null +++ b/lib/sandbox/buildLifecycleActivityUpdate.ts @@ -0,0 +1,29 @@ +import { SANDBOX_INACTIVITY_TIMEOUT_MS } from "@/lib/sandbox/sandboxLifecycleConfig"; +import type { TablesUpdate } from "@/types/database.types"; + +/** + * Builds the lifecycle-related fields to write when refreshing a + * sandbox's "last activity" — used by `/api/sandbox/activity` and + * by `buildActiveLifecycleUpdate`. Sets `last_activity_at` to now, + * pushes `hibernate_after` out by SANDBOX_INACTIVITY_TIMEOUT_MS, and + * clears any stale `lifecycle_error`. Defaults `lifecycle_state` to + * `"active"` but accepts `"restoring"` for the snapshot-resume path. + * + * @param activityAt - Optional override for "now"; defaults to current time. + * @param lifecycleState - Defaults to `"active"`. + * @returns A partial Supabase update object. + */ +export function buildLifecycleActivityUpdate( + activityAt: Date = new Date(), + lifecycleState: "active" | "restoring" = "active", +): Pick< + TablesUpdate<"sessions">, + "lifecycle_state" | "lifecycle_error" | "last_activity_at" | "hibernate_after" +> { + return { + lifecycle_state: lifecycleState, + lifecycle_error: null, + last_activity_at: activityAt.toISOString(), + hibernate_after: new Date(activityAt.getTime() + SANDBOX_INACTIVITY_TIMEOUT_MS).toISOString(), + }; +} diff --git a/lib/sandbox/canOperateOnSandbox.ts b/lib/sandbox/canOperateOnSandbox.ts new file mode 100644 index 000000000..20f741b30 --- /dev/null +++ b/lib/sandbox/canOperateOnSandbox.ts @@ -0,0 +1,16 @@ +import { hasRuntimeSandboxState } from "@/lib/sandbox/hasRuntimeSandboxState"; + +/** + * True when the sandbox state has live runtime metadata that supports + * operations like stop/extend/exec. Different from `isSandboxActive` + * which also requires a non-expired `expiresAt` — this returns true + * even shortly after expiry, while the runtime is still potentially + * reachable. Used by the lifecycle workflow when deciding whether the + * sandbox is something it can act on at all. + * + * @param state - The persisted `sandbox_state` JSON value. + * @returns true when the state has runtime fields. + */ +export function canOperateOnSandbox(state: unknown): boolean { + return hasRuntimeSandboxState(state); +} diff --git a/lib/sandbox/clearSandboxState.ts b/lib/sandbox/clearSandboxState.ts new file mode 100644 index 000000000..5cecb2bc3 --- /dev/null +++ b/lib/sandbox/clearSandboxState.ts @@ -0,0 +1,23 @@ +import { getPersistentSandboxName } from "@/lib/sandbox/getPersistentSandboxName"; + +/** + * Strips runtime metadata (expiresAt, etc.) from a sandbox state while + * preserving the durable resume handle (sandboxName) so a future + * `connectSandbox` can pick it back up. Used by the lifecycle + * workflow when transitioning to `hibernated`. + * + * @param state - The current `sandbox_state` JSON value. + * @returns A trimmed state with only `type` + `sandboxName`, or null + * when the input is null. + */ +export function clearSandboxState(state: unknown): { type: string; sandboxName?: string } | null { + if (!state || typeof state !== "object") return null; + + const sandboxName = getPersistentSandboxName(state); + const type = (state as { type?: unknown }).type; + + return { + type: typeof type === "string" ? type : "vercel", + ...(sandboxName ? { sandboxName } : {}), + }; +} diff --git a/lib/sandbox/createSandboxHandler.ts b/lib/sandbox/createSandboxHandler.ts index 7e6dd55d3..8ef90c7d6 100644 --- a/lib/sandbox/createSandboxHandler.ts +++ b/lib/sandbox/createSandboxHandler.ts @@ -8,6 +8,7 @@ import { findOrgSnapshot } from "@/lib/sandbox/findOrgSnapshot"; import { getSessionSandboxName } from "@/lib/sandbox/getSessionSandboxName"; import { installSessionGlobalSkills } from "@/lib/sandbox/installSessionGlobalSkills"; import { kickBuildOrgSnapshotWorkflow } from "@/lib/sandbox/kickBuildOrgSnapshotWorkflow"; +import { kickSandboxLifecycleWorkflow } from "@/lib/sandbox/kickSandboxLifecycleWorkflow"; import { extractOrgRepoName } from "@/lib/recoupable/extractOrgRepoName"; import { updateSession } from "@/lib/supabase/sessions/updateSession"; import { getServiceGithubToken } from "@/lib/github/getServiceGithubToken"; @@ -142,6 +143,13 @@ export async function createSandboxHandler(request: NextRequest): Promise { + const rows = await selectSessions({ id: sessionId }); + const session = rows[0]; + + if (!session) return { action: "skipped", reason: "session-not-found" }; + if (session.status === "archived" || session.lifecycle_state === "archived") { + return { action: "skipped", reason: "session-archived" }; + } + + const sandboxState = session.sandbox_state; + if (!canOperateOnSandbox(sandboxState)) { + return { action: "skipped", reason: "sandbox-not-operable" }; + } + if ((sandboxState as { type?: unknown }).type !== "vercel") { + return { action: "skipped", reason: "unsupported-sandbox-type" }; + } + + const dueAtMs = getLifecycleDueAtMs(session); + if (Date.now() < dueAtMs) { + return { action: "skipped", reason: "not-due-yet" }; + } + + if (await hasActiveStreamForSession(sessionId)) { + return { action: "skipped", reason: "active-workflow" }; + } + + try { + await updateSession(sessionId, { lifecycle_state: "hibernating", lifecycle_error: null }); + + const sandbox = await connectSandbox(sandboxState as unknown as SandboxState); + + if (await hasActiveStreamForSession(sessionId)) { + await restoreActiveLifecycleState(sessionId, sandboxState); + return { action: "skipped", reason: "active-workflow" }; + } + + if (await wasLifecycleTimingExtended(sessionId, session)) { + const refreshed = (await selectSessions({ id: sessionId }))[0]; + if (refreshed?.sandbox_state) { + await restoreActiveLifecycleState(sessionId, refreshed.sandbox_state); + } + return { action: "skipped", reason: "not-due-yet" }; + } + + await sandbox.stop(); + + const cleared = clearSandboxState(sandboxState); + await updateSession(sessionId, { + sandbox_state: cleared as unknown as Json, + snapshot_url: null, + snapshot_created_at: null, + ...buildHibernatedLifecycleUpdate(), + }); + console.log( + `[evaluateSandboxLifecycle] hibernated session=${sessionId} reason=${reason} sandboxName=${getPersistentSandboxName(cleared) ?? "none"}`, + ); + return { action: "hibernated" }; + } catch (error) { + const message = error instanceof Error ? error.message : String(error); + await updateSession(sessionId, { + lifecycle_state: "failed", + lifecycle_run_id: null, + lifecycle_error: message, + }); + console.error( + `[evaluateSandboxLifecycle] failed session=${sessionId} reason=${reason}:`, + error, + ); + return { action: "failed", reason: message }; + } +} + +async function restoreActiveLifecycleState( + sessionId: string, + sandboxState: unknown, +): Promise { + await updateSession(sessionId, { + lifecycle_state: "active", + lifecycle_error: null, + sandbox_expires_at: getSandboxExpiresAtDate(sandboxState), + }); +} + +async function wasLifecycleTimingExtended( + sessionId: string, + prior: Tables<"sessions">, +): Promise { + const refreshed = (await selectSessions({ id: sessionId }))[0]; + if (!refreshed?.sandbox_state || !canOperateOnSandbox(refreshed.sandbox_state)) return false; + + const timingChanged = + refreshed.last_activity_at !== prior.last_activity_at || + refreshed.hibernate_after !== prior.hibernate_after || + refreshed.sandbox_expires_at !== prior.sandbox_expires_at; + + return timingChanged && Date.now() < getLifecycleDueAtMs(refreshed); +} diff --git a/lib/sandbox/getLifecycleDueAtMs.ts b/lib/sandbox/getLifecycleDueAtMs.ts new file mode 100644 index 000000000..b8fbaf445 --- /dev/null +++ b/lib/sandbox/getLifecycleDueAtMs.ts @@ -0,0 +1,50 @@ +import { + SANDBOX_EXPIRES_BUFFER_MS, + SANDBOX_INACTIVITY_TIMEOUT_MS, +} from "@/lib/sandbox/sandboxLifecycleConfig"; +import { isoToEpochMs } from "@/lib/sandbox/isoToEpochMs"; +import type { Tables } from "@/types/database.types"; + +/** + * Computes when the lifecycle workflow should next wake up for a + * session. Returns the earlier of: + * - inactivity due — `last_activity_at` (or `updated_at`) + + * SANDBOX_INACTIVITY_TIMEOUT_MS, overridden by `hibernate_after` + * if set + * - expiry due — `sandbox_expires_at` minus + * SANDBOX_EXPIRES_BUFFER_MS, so we pause before Vercel hard-stops + * + * Output is epoch milliseconds. Used by both the workflow (for + * `sleep(new Date(wakeAtMs))`) and the stale-run detector in the + * kick logic. + * + * @param row - The `sessions` row. + * @returns Epoch ms when the next lifecycle action is due. + */ +export function getLifecycleDueAtMs( + row: Pick< + Tables<"sessions">, + "hibernate_after" | "last_activity_at" | "sandbox_expires_at" | "updated_at" + >, +): number { + const inactivityDue = computeInactivityDueAtMs(row); + const expiryDue = computeExpiryDueAtMs(row); + if (expiryDue === null) return inactivityDue; + return Math.min(inactivityDue, expiryDue); +} + +function computeInactivityDueAtMs( + row: Pick, "hibernate_after" | "last_activity_at" | "updated_at">, +): number { + const hibernateAfter = isoToEpochMs(row.hibernate_after); + if (hibernateAfter !== null) return hibernateAfter; + const lastActivity = isoToEpochMs(row.last_activity_at); + const fallback = lastActivity ?? isoToEpochMs(row.updated_at) ?? Date.now(); + return fallback + SANDBOX_INACTIVITY_TIMEOUT_MS; +} + +function computeExpiryDueAtMs(row: Pick, "sandbox_expires_at">): number | null { + const expiresAt = isoToEpochMs(row.sandbox_expires_at); + if (expiresAt === null) return null; + return expiresAt - SANDBOX_EXPIRES_BUFFER_MS; +} diff --git a/lib/sandbox/getNextLifecycleVersion.ts b/lib/sandbox/getNextLifecycleVersion.ts new file mode 100644 index 000000000..19d4e133d --- /dev/null +++ b/lib/sandbox/getNextLifecycleVersion.ts @@ -0,0 +1,11 @@ +/** + * Increments a session's lifecycle version for optimistic concurrency + * control. Returns 1 when the input is null/undefined (e.g. on first + * sandbox provision after session creation). + * + * @param current - The session's current `lifecycle_version`. + * @returns The next version number to write. + */ +export function getNextLifecycleVersion(current: number | null | undefined): number { + return (current ?? 0) + 1; +} diff --git a/lib/sandbox/getPersistentSandboxName.ts b/lib/sandbox/getPersistentSandboxName.ts new file mode 100644 index 000000000..5500773f0 --- /dev/null +++ b/lib/sandbox/getPersistentSandboxName.ts @@ -0,0 +1,20 @@ +/** + * Reads the durable `sandboxName` from a `sandbox_state` JSON value. + * Returns null when the state has no `sandboxName` (or it's empty). + * + * The `sandboxName` is the deterministic identifier under which a + * sandbox is registered with the Vercel sandbox runtime — assigned at + * creation time via `getSessionSandboxName(sessionId)` and preserved + * across resume / pause / restore cycles. It survives even when the + * runtime metadata (expiresAt, etc.) has been cleared. + * + * @param state - The persisted `sandbox_state` JSON value. + * @returns The persistent sandbox name, or null when absent. + */ +export function getPersistentSandboxName(state: unknown): string | null { + if (!state || typeof state !== "object") return null; + const candidate = state as { sandboxName?: unknown }; + return typeof candidate.sandboxName === "string" && candidate.sandboxName.length > 0 + ? candidate.sandboxName + : null; +} diff --git a/lib/sandbox/getResumableSandboxName.ts b/lib/sandbox/getResumableSandboxName.ts new file mode 100644 index 000000000..39717905c --- /dev/null +++ b/lib/sandbox/getResumableSandboxName.ts @@ -0,0 +1,20 @@ +import { getPersistentSandboxName } from "@/lib/sandbox/getPersistentSandboxName"; + +/** + * Returns the durable name used to resume a paused sandbox. Falls back + * to a legacy `sandboxId` field for sessions written before the + * persistent-name migration in open-agents. + * + * @param state - The persisted `sandbox_state` JSON value. + * @returns A name suitable for `connectSandbox` resume, or null. + */ +export function getResumableSandboxName(state: unknown): string | null { + const persistent = getPersistentSandboxName(state); + if (persistent) return persistent; + + if (!state || typeof state !== "object") return null; + const candidate = state as { sandboxId?: unknown }; + return typeof candidate.sandboxId === "string" && candidate.sandboxId.length > 0 + ? candidate.sandboxId + : null; +} diff --git a/lib/sandbox/getSandboxExpiresAtDate.ts b/lib/sandbox/getSandboxExpiresAtDate.ts new file mode 100644 index 000000000..fe26a6ef8 --- /dev/null +++ b/lib/sandbox/getSandboxExpiresAtDate.ts @@ -0,0 +1,16 @@ +/** + * Reads the runtime `expiresAt` field (epoch milliseconds) off a + * sandbox state and converts it to an ISO-8601 string suitable for + * persisting to `sessions.sandbox_expires_at`. Returns null when the + * state has no `expiresAt` (paused sandboxes or empty stubs). + * + * @param state - The `sandbox_state` JSON value, typically from + * `sandbox.getState()`. + * @returns ISO timestamp string, or null when no expiry is set. + */ +export function getSandboxExpiresAtDate(state: unknown): string | null { + if (!state || typeof state !== "object") return null; + const expiresAt = (state as { expiresAt?: unknown }).expiresAt; + if (typeof expiresAt !== "number") return null; + return new Date(expiresAt).toISOString(); +} diff --git a/lib/sandbox/getSandboxStatusHandler.ts b/lib/sandbox/getSandboxStatusHandler.ts index 8c9e1b988..b12fbbf12 100644 --- a/lib/sandbox/getSandboxStatusHandler.ts +++ b/lib/sandbox/getSandboxStatusHandler.ts @@ -2,7 +2,9 @@ import { NextRequest, NextResponse } from "next/server"; import { getCorsHeaders } from "@/lib/networking/getCorsHeaders"; import { validateAuthContext } from "@/lib/auth/validateAuthContext"; import { buildLifecycle } from "@/lib/sandbox/buildLifecycle"; +import { getLifecycleDueAtMs } from "@/lib/sandbox/getLifecycleDueAtMs"; import { isSandboxActive } from "@/lib/sandbox/isSandboxActive"; +import { kickSandboxLifecycleWorkflow } from "@/lib/sandbox/kickSandboxLifecycleWorkflow"; import { selectSessions } from "@/lib/supabase/sessions/selectSessions"; /** @@ -12,6 +14,12 @@ import { selectSessions } from "@/lib/supabase/sessions/selectSessions"; * non-expired `sandbox_state` (with real runtime metadata), otherwise * `"no_sandbox"`. `hasSnapshot` is true when the row records a saved * snapshot the UI can offer to resume. + * + * Side-effect: when the row reports `lifecycle_state: "active"` but + * the lifecycle is past due (workflow never woke or its lease died), + * fires a `status-check-overdue` lifecycle kick. The kick reclaims + * stale leases and starts a fresh workflow run — that's how the + * lifecycle FSM self-heals from crashed workflows. */ export async function getSandboxStatusHandler(request: NextRequest): Promise { const auth = await validateAuthContext(request); @@ -44,9 +52,15 @@ export async function getSandboxStatusHandler(request: NextRequest): Promise= getLifecycleDueAtMs(row)) { + kickSandboxLifecycleWorkflow({ sessionId: row.id, reason: "status-check-overdue" }); + } + return NextResponse.json( { - status: isSandboxActive(row) ? "active" : "no_sandbox", + status: active ? "active" : "no_sandbox", hasSnapshot: !!row.snapshot_url, lifecycleVersion: row.lifecycle_version, lifecycle: buildLifecycle(row), diff --git a/lib/sandbox/hasActiveStreamForSession.ts b/lib/sandbox/hasActiveStreamForSession.ts new file mode 100644 index 000000000..6b4473b65 --- /dev/null +++ b/lib/sandbox/hasActiveStreamForSession.ts @@ -0,0 +1,15 @@ +import { selectChatsBySession } from "@/lib/supabase/chats/selectChatsBySession"; + +/** + * True when any chat in the session has an `active_stream_id` set, + * indicating an in-flight assistant stream. The lifecycle workflow + * uses this to defer hibernation while a chat is actively being + * served — pausing the sandbox mid-stream would 500 the response. + * + * @param sessionId - The session to check. + * @returns true when at least one chat has an active stream id. + */ +export async function hasActiveStreamForSession(sessionId: string): Promise { + const chats = await selectChatsBySession(sessionId); + return chats.some(chat => chat.active_stream_id !== null); +} diff --git a/lib/sandbox/hasPausedSandboxState.ts b/lib/sandbox/hasPausedSandboxState.ts new file mode 100644 index 000000000..6de247a44 --- /dev/null +++ b/lib/sandbox/hasPausedSandboxState.ts @@ -0,0 +1,14 @@ +import { hasResumableSandboxState } from "@/lib/sandbox/hasResumableSandboxState"; +import { hasRuntimeSandboxState } from "@/lib/sandbox/hasRuntimeSandboxState"; + +/** + * True when the state describes a sandbox that is paused but resumable + * — has a durable name yet no live runtime metadata. Used by the UI to + * decide whether to show "Resume" affordances. + * + * @param state - The persisted `sandbox_state` JSON value. + * @returns true when the sandbox is paused-but-resumable. + */ +export function hasPausedSandboxState(state: unknown): boolean { + return hasResumableSandboxState(state) && !hasRuntimeSandboxState(state); +} diff --git a/lib/sandbox/hasResumableSandboxState.ts b/lib/sandbox/hasResumableSandboxState.ts new file mode 100644 index 000000000..cbbb341dc --- /dev/null +++ b/lib/sandbox/hasResumableSandboxState.ts @@ -0,0 +1,12 @@ +import { getResumableSandboxName } from "@/lib/sandbox/getResumableSandboxName"; + +/** + * True when the persisted state carries a durable resume handle — + * either a persistent `sandboxName` or the legacy `sandboxId`. + * + * @param state - The persisted `sandbox_state` JSON value. + * @returns true when the sandbox can be resumed. + */ +export function hasResumableSandboxState(state: unknown): boolean { + return getResumableSandboxName(state) !== null; +} diff --git a/lib/sandbox/kickSandboxLifecycleWorkflow.ts b/lib/sandbox/kickSandboxLifecycleWorkflow.ts new file mode 100644 index 000000000..cd7b5baf9 --- /dev/null +++ b/lib/sandbox/kickSandboxLifecycleWorkflow.ts @@ -0,0 +1,91 @@ +import { start } from "workflow/api"; +import { sandboxLifecycleWorkflow } from "@/app/workflows/sandboxLifecycleWorkflow"; +import { canOperateOnSandbox } from "@/lib/sandbox/canOperateOnSandbox"; +import { getLifecycleDueAtMs } from "@/lib/sandbox/getLifecycleDueAtMs"; +import { SANDBOX_LIFECYCLE_STALE_RUN_GRACE_MS } from "@/lib/sandbox/sandboxLifecycleConfig"; +import { claimSessionLifecycleRunId } from "@/lib/supabase/sessions/claimSessionLifecycleRunId"; +import { selectSessions } from "@/lib/supabase/sessions/selectSessions"; +import { updateSession } from "@/lib/supabase/sessions/updateSession"; +import type { SandboxLifecycleReason } from "@/lib/sandbox/sandboxLifecycleTypes"; +import type { Tables } from "@/types/database.types"; + +interface KickInput { + sessionId: string; + reason: SandboxLifecycleReason; +} + +/** + * Fire-and-forget kick of `sandboxLifecycleWorkflow`. Used by: + * + * - `POST /api/sandbox` (reason: `sandbox-created`) — register a + * freshly-provisioned sandbox so the workflow auto-pauses it after + * `SANDBOX_INACTIVITY_TIMEOUT_MS` of idle time + * - `GET /api/sandbox/status` (reason: `status-check-overdue`) — poke + * the workflow when a status read sees stale lifecycle state + * + * Skips when the session can't be acted on (archived, no runtime + * sandbox, wrong sandbox type) or when another lifecycle run is + * already in flight. A run that's been overdue past the grace window + * is considered stale and gets reclaimed. + */ +export function kickSandboxLifecycleWorkflow(input: KickInput): void { + void runKick(input).catch(error => + console.error(`[kickSandboxLifecycleWorkflow] failed for session ${input.sessionId}:`, error), + ); +} + +async function runKick(input: KickInput): Promise { + const rows = await selectSessions({ id: input.sessionId }); + const session = rows[0]; + if (!session) return; + + const sessionForStart = isLifecycleRunStale(session) + ? await reclaimStaleLease(input.sessionId) + : session; + + if (!shouldStartLifecycle(sessionForStart)) return; + + const runId = createLifecycleRunId(); + const claimed = await claimSessionLifecycleRunId(input.sessionId, runId); + if (!claimed) return; + + try { + const run = await start(sandboxLifecycleWorkflow, [input.sessionId, input.reason, runId]); + console.log( + `[kickSandboxLifecycleWorkflow] started run ${run.runId} for session ${input.sessionId} (reason=${input.reason})`, + ); + } catch (error) { + console.error( + `[kickSandboxLifecycleWorkflow] failed to start workflow for session ${input.sessionId}; clearing lease:`, + error, + ); + await updateSession(input.sessionId, { lifecycle_run_id: null }); + } +} + +async function reclaimStaleLease(sessionId: string): Promise | null> { + await updateSession(sessionId, { lifecycle_run_id: null }); + const rows = await selectSessions({ id: sessionId }); + return rows[0] ?? null; +} + +function shouldStartLifecycle(session: Tables<"sessions"> | null): session is Tables<"sessions"> { + if (!session) return false; + if (session.status === "archived" || session.lifecycle_state === "archived") return false; + if (!session.sandbox_state) return false; + if (!canOperateOnSandbox(session.sandbox_state)) return false; + if ((session.sandbox_state as { type?: unknown }).type !== "vercel") return false; + if (session.lifecycle_run_id) return false; + return true; +} + +function isLifecycleRunStale(session: Tables<"sessions">): boolean { + if (!session.lifecycle_run_id) return false; + if (session.lifecycle_state !== "active") return false; + const overdueMs = Date.now() - getLifecycleDueAtMs(session); + return overdueMs > SANDBOX_LIFECYCLE_STALE_RUN_GRACE_MS; +} + +function createLifecycleRunId(): string { + return `lifecycle:${Date.now()}:${crypto.randomUUID()}`; +} diff --git a/lib/sandbox/sandboxLifecycleConfig.ts b/lib/sandbox/sandboxLifecycleConfig.ts new file mode 100644 index 000000000..3f474a48c --- /dev/null +++ b/lib/sandbox/sandboxLifecycleConfig.ts @@ -0,0 +1,35 @@ +/** + * Lifecycle workflow tuning. All values copied from open-agents' + * `lib/sandbox/config.ts` and intentionally kept in sync — these are + * the timings the sandbox lifecycle FSM uses to decide when to wake + * up, hibernate, and detect stale runs. + */ + +/** + * How long a sandbox can be idle before the lifecycle workflow pauses + * it. 5 minutes — short enough to free compute aggressively, long + * enough that a user briefly switching tabs doesn't lose their + * session. + */ +export const SANDBOX_INACTIVITY_TIMEOUT_MS = 5 * 60 * 1000; + +/** + * Buffer subtracted from `sandbox_expires_at` when computing the + * lifecycle wake time. Pause a few seconds before Vercel hard-stops + * the sandbox so we have a chance to do graceful cleanup. + */ +export const SANDBOX_EXPIRES_BUFFER_MS = 10_000; + +/** + * Floor on the workflow's `sleep(date)` duration. Prevents tight loops + * when the calculated wake time is in the past or extremely soon. + */ +export const SANDBOX_LIFECYCLE_MIN_SLEEP_MS = 5_000; + +/** + * How far past the calculated due time a session's `lifecycle_run_id` + * may sit before the kick logic considers the run dead and reclaims + * the lease. Prevents a crashed workflow from blocking future kicks + * forever. + */ +export const SANDBOX_LIFECYCLE_STALE_RUN_GRACE_MS = 2 * 60 * 1000; diff --git a/lib/sandbox/sandboxLifecycleTypes.ts b/lib/sandbox/sandboxLifecycleTypes.ts new file mode 100644 index 000000000..44f77bf20 --- /dev/null +++ b/lib/sandbox/sandboxLifecycleTypes.ts @@ -0,0 +1,36 @@ +/** + * Lifecycle FSM states stored as `lifecycle_state` on the `sessions` + * row. The workflow transitions through these as it pauses, restores, + * and recovers from failure. + */ +export type SandboxLifecycleState = + | "provisioning" + | "active" + | "hibernating" + | "hibernated" + | "restoring" + | "archived" + | "failed"; + +/** + * Reason the lifecycle workflow was kicked. Each callsite passes a + * specific value so observability can trace which path triggered an + * evaluation. Mirrored from open-agents — extend / snapshot reasons + * are listed for parity even though those endpoints aren't ported yet. + */ +export type SandboxLifecycleReason = + | "sandbox-created" + | "timeout-extended" + | "snapshot-restored" + | "reconnect" + | "manual-stop" + | "status-check-overdue"; + +/** + * Result of a single lifecycle evaluation pass — what the workflow + * returns to the kick caller for logging. + */ +export interface SandboxLifecycleEvaluationResult { + action: "skipped" | "hibernated" | "failed"; + reason?: string; +} diff --git a/lib/supabase/chats/selectChatsBySession.ts b/lib/supabase/chats/selectChatsBySession.ts new file mode 100644 index 000000000..deff1f894 --- /dev/null +++ b/lib/supabase/chats/selectChatsBySession.ts @@ -0,0 +1,22 @@ +import supabase from "@/lib/supabase/serverClient"; +import type { Tables } from "@/types/database.types"; + +/** + * Reads all `chats` belonging to a session. Used by the lifecycle + * workflow to detect active streams (so a chat in flight can prevent + * hibernation), and will be used by upcoming `GET /api/sessions/:id/chats` + * port too. Returns [] on Supabase error after logging. + * + * @param sessionId - The owning session id. + * @returns Matching chat rows, or [] on error / no match. + */ +export async function selectChatsBySession(sessionId: string): Promise[]> { + const { data, error } = await supabase.from("chats").select("*").eq("session_id", sessionId); + + if (error) { + console.error(`[selectChatsBySession] error for session ${sessionId}:`, error); + return []; + } + + return data ?? []; +} diff --git a/lib/supabase/sessions/claimSessionLifecycleRunId.ts b/lib/supabase/sessions/claimSessionLifecycleRunId.ts new file mode 100644 index 000000000..b446c2b54 --- /dev/null +++ b/lib/supabase/sessions/claimSessionLifecycleRunId.ts @@ -0,0 +1,37 @@ +import supabase from "@/lib/supabase/serverClient"; + +/** + * Atomically claims the `lifecycle_run_id` lease for a session. The + * UPDATE only succeeds when the row's current value matches what the + * caller passed as `expected` (typically null when claiming a fresh + * lease, or the same `runId` when re-confirming during a workflow + * step). This is the concurrency guard that prevents two kicks from + * starting duplicate lifecycle workflows for the same session. + * + * Returns true when the lease was successfully claimed (the UPDATE + * affected exactly one row), false otherwise. + * + * @param sessionId - The session id to claim against. + * @param runId - The new lease value to write. + * @param expected - The expected current value; defaults to null. + * @returns true on success, false when the lease was already taken. + */ +export async function claimSessionLifecycleRunId( + sessionId: string, + runId: string, + expected: string | null = null, +): Promise { + const query = supabase.from("sessions").update({ lifecycle_run_id: runId }).eq("id", sessionId); + + const filtered = + expected === null ? query.is("lifecycle_run_id", null) : query.eq("lifecycle_run_id", expected); + + const { data, error } = await filtered.select("id").maybeSingle(); + + if (error) { + console.error(`[claimSessionLifecycleRunId] error for session ${sessionId}:`, error); + return false; + } + + return data !== null; +} From 34b1041d114646d4107b0f549ffeacbdd3605eac Mon Sep 17 00:00:00 2001 From: Sweets Sweetman Date: Thu, 7 May 2026 16:57:17 -0500 Subject: [PATCH 2/6] fix(sandbox): adopt open-agents scheduleBackgroundWork pattern for lifecycle kick MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Smoke test caught: the lifecycle kick fired but its async chain (selectSessions → claim lease → start workflow) never completed because the serverless function tore down on response. No [kickSandboxLifecycleWorkflow] log lines, no workflow run started. Original PR placed `after()` from next/server inside the kick lib — which violated open-agents' architecture: the kick is a generic fan-out lib, the platform-specific scheduler belongs at the call site. Refactored to match open-agents' open / api conventions: - `kickSandboxLifecycleWorkflow` accepts an optional `scheduleBackgroundWork: (task: Promise) => void` parameter. Default behavior is `void task` (matches open-agents fallback), scheduler used when provided. - createSandboxHandler + getSandboxStatusHandler pass `scheduleBackgroundWork: task => after(() => task)`. Same pattern api already uses in `lib/agents/createPlatformRoutes.ts:62` and `app/api/chat/slack/route.ts:16` for waitUntil-style background work. Test updated to use `expect.objectContaining` since the kick now receives an extra `scheduleBackgroundWork` callback. Suite stays at 2579. Co-Authored-By: Claude Opus 4.7 (1M context) --- .../__tests__/createSandboxHandler.test.ts | 11 +++++++---- lib/sandbox/createSandboxHandler.ts | 18 ++++++++++++------ lib/sandbox/getSandboxStatusHandler.ts | 8 ++++++-- lib/sandbox/kickSandboxLifecycleWorkflow.ts | 18 +++++++++++++++++- 4 files changed, 42 insertions(+), 13 deletions(-) diff --git a/lib/sandbox/__tests__/createSandboxHandler.test.ts b/lib/sandbox/__tests__/createSandboxHandler.test.ts index 4813ed034..2bf048a10 100644 --- a/lib/sandbox/__tests__/createSandboxHandler.test.ts +++ b/lib/sandbox/__tests__/createSandboxHandler.test.ts @@ -288,10 +288,13 @@ describe("createSandboxHandler", () => { it("kicks the sandbox lifecycle workflow with reason='sandbox-created' when sessionId is provided", async () => { await createSandboxHandler(makeReq()); - expect(kickSandboxLifecycleWorkflow).toHaveBeenCalledWith({ - sessionId: "sess-1", - reason: "sandbox-created", - }); + expect(kickSandboxLifecycleWorkflow).toHaveBeenCalledWith( + expect.objectContaining({ + sessionId: "sess-1", + reason: "sandbox-created", + scheduleBackgroundWork: expect.any(Function), + }), + ); }); it("does not kick the lifecycle workflow when no sessionId is provided", async () => { diff --git a/lib/sandbox/createSandboxHandler.ts b/lib/sandbox/createSandboxHandler.ts index 8ef90c7d6..399126d17 100644 --- a/lib/sandbox/createSandboxHandler.ts +++ b/lib/sandbox/createSandboxHandler.ts @@ -1,5 +1,5 @@ import ms from "ms"; -import { NextRequest, NextResponse } from "next/server"; +import { NextRequest, NextResponse, after } from "next/server"; import { getCorsHeaders } from "@/lib/networking/getCorsHeaders"; import { validateCreateSandboxBody } from "@/lib/sandbox/validateCreateSandboxBody"; import { selectSessions } from "@/lib/supabase/sessions/selectSessions"; @@ -145,11 +145,17 @@ export async function createSandboxHandler(request: NextRequest): Promise after(() => task), + }); } return NextResponse.json( diff --git a/lib/sandbox/getSandboxStatusHandler.ts b/lib/sandbox/getSandboxStatusHandler.ts index b12fbbf12..e2b2f39f4 100644 --- a/lib/sandbox/getSandboxStatusHandler.ts +++ b/lib/sandbox/getSandboxStatusHandler.ts @@ -1,4 +1,4 @@ -import { NextRequest, NextResponse } from "next/server"; +import { NextRequest, NextResponse, after } from "next/server"; import { getCorsHeaders } from "@/lib/networking/getCorsHeaders"; import { validateAuthContext } from "@/lib/auth/validateAuthContext"; import { buildLifecycle } from "@/lib/sandbox/buildLifecycle"; @@ -55,7 +55,11 @@ export async function getSandboxStatusHandler(request: NextRequest): Promise= getLifecycleDueAtMs(row)) { - kickSandboxLifecycleWorkflow({ sessionId: row.id, reason: "status-check-overdue" }); + kickSandboxLifecycleWorkflow({ + sessionId: row.id, + reason: "status-check-overdue", + scheduleBackgroundWork: task => after(() => task), + }); } return NextResponse.json( diff --git a/lib/sandbox/kickSandboxLifecycleWorkflow.ts b/lib/sandbox/kickSandboxLifecycleWorkflow.ts index cd7b5baf9..f3a852d6e 100644 --- a/lib/sandbox/kickSandboxLifecycleWorkflow.ts +++ b/lib/sandbox/kickSandboxLifecycleWorkflow.ts @@ -12,6 +12,15 @@ import type { Tables } from "@/types/database.types"; interface KickInput { sessionId: string; reason: SandboxLifecycleReason; + /** + * Optional scheduler for the kick chain (selectSessions → claim + * lease → start workflow). Callers in serverless contexts should + * pass `p => after(() => p)` (or `waitUntil(p)`) so the platform + * keeps the function alive until the chain completes — without it + * the chain dies when the request returns. Mirrors open-agents' + * `scheduleBackgroundWork` parameter. + */ + scheduleBackgroundWork?: (task: Promise) => void; } /** @@ -29,9 +38,16 @@ interface KickInput { * is considered stale and gets reclaimed. */ export function kickSandboxLifecycleWorkflow(input: KickInput): void { - void runKick(input).catch(error => + const task = runKick(input).catch(error => console.error(`[kickSandboxLifecycleWorkflow] failed for session ${input.sessionId}:`, error), ); + + if (input.scheduleBackgroundWork) { + input.scheduleBackgroundWork(task); + return; + } + + void task; } async function runKick(input: KickInput): Promise { From 30ef641a99360fa3a51ba0f951f1ce9e4bf184f7 Mon Sep 17 00:00:00 2001 From: Sweets Sweetman Date: Thu, 7 May 2026 17:18:11 -0500 Subject: [PATCH 3/6] refactor(sandbox): KISS deletions + SRP extractions per review feedback MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Addresses 17 review comments on PR #533. All KISS deletions and SRP extractions in one pass — no behavior change, just file/function reorganization to match api conventions (one exported function per file, filename matches the export, no thin wrappers). KISS deletions (3 files removed): - canOperateOnSandbox.ts → callers use hasRuntimeSandboxState directly (was a one-line wrapper around it) - getNextLifecycleVersion.ts → was unused; existing inline calc kept - hasResumableSandboxState.ts → callers use `getResumableSandboxName(state) !== null` (one-line wrapper) KISS rename: - selectChatsBySession.ts → selectChats.ts with a generic `{ id?, sessionId? }` filter parameter, mirroring selectSessions SRP split — Supabase queries: - claimSessionLifecycleRunId.ts split into: - claimSessionLifecycleRunIdIfNull.ts (initial claim) - claimSessionLifecycleRunIdIfMatch.ts (lease refresh) - claimSessionLifecycleRunId.ts (combiner — picks the right one) SRP extractions — kick logic: - runKick.ts (the async chain) - reclaimStaleLease.ts - shouldStartLifecycle.ts - isLifecycleRunStale.ts - createLifecycleRunId.ts - kickSandboxLifecycleWorkflow.ts now thin: just runKick + scheduler SRP extractions — due-at math: - computeInactivityDueAtMs.ts - computeExpiryDueAtMs.ts - getLifecycleDueAtMs.ts now thin: composes the two SRP extractions — evaluator helpers: - restoreActiveLifecycleState.ts - wasLifecycleTimingExtended.ts SRP extractions — workflow steps (each preserves "use step"): - computeLifecycleWakeDecision.ts - runLifecycleEvaluation.ts - clearLifecycleRunIdIfOwned.ts - sandboxLifecycleWorkflow.ts now thin: just composes the steps with the while-sleep-evaluate loop Net delta: 16 new files, 4 deletions, no behavior change. Tests: 2579 / 2579 still pass. pnpm lint:check + tsc clean for new/changed files (pre-existing TS errors in processCreateSandbox / updateSnapshotPatchHandler unrelated). Co-Authored-By: Claude Opus 4.7 (1M context) --- app/workflows/clearLifecycleRunIdIfOwned.ts | 22 +++++ app/workflows/computeLifecycleWakeDecision.ts | 46 ++++++++++ app/workflows/runLifecycleEvaluation.ts | 23 +++++ app/workflows/sandboxLifecycleWorkflow.ts | 56 +----------- lib/sandbox/canOperateOnSandbox.ts | 16 ---- lib/sandbox/computeExpiryDueAtMs.ts | 20 +++++ lib/sandbox/computeInactivityDueAtMs.ts | 21 +++++ lib/sandbox/createLifecycleRunId.ts | 11 +++ lib/sandbox/evaluateSandboxLifecycle.ts | 42 ++------- lib/sandbox/getNextLifecycleVersion.ts | 11 --- lib/sandbox/hasActiveStreamForSession.ts | 4 +- lib/sandbox/hasPausedSandboxState.ts | 4 +- lib/sandbox/hasResumableSandboxState.ts | 12 --- lib/sandbox/isLifecycleRunStale.ts | 19 ++++ lib/sandbox/kickSandboxLifecycleWorkflow.ts | 88 +++---------------- lib/sandbox/reclaimStaleLease.ts | 18 ++++ lib/sandbox/restoreActiveLifecycleState.ts | 23 +++++ lib/sandbox/runKick.ts | 58 ++++++++++++ lib/sandbox/shouldStartLifecycle.ts | 24 +++++ lib/sandbox/wasLifecycleTimingExtended.ts | 30 +++++++ lib/supabase/chats/selectChats.ts | 30 +++++++ lib/supabase/chats/selectChatsBySession.ts | 22 ----- .../sessions/claimSessionLifecycleRunId.ts | 32 ++----- .../claimSessionLifecycleRunIdIfMatch.ts | 31 +++++++ .../claimSessionLifecycleRunIdIfNull.ts | 30 +++++++ 25 files changed, 439 insertions(+), 254 deletions(-) create mode 100644 app/workflows/clearLifecycleRunIdIfOwned.ts create mode 100644 app/workflows/computeLifecycleWakeDecision.ts create mode 100644 app/workflows/runLifecycleEvaluation.ts delete mode 100644 lib/sandbox/canOperateOnSandbox.ts create mode 100644 lib/sandbox/computeExpiryDueAtMs.ts create mode 100644 lib/sandbox/computeInactivityDueAtMs.ts create mode 100644 lib/sandbox/createLifecycleRunId.ts delete mode 100644 lib/sandbox/getNextLifecycleVersion.ts delete mode 100644 lib/sandbox/hasResumableSandboxState.ts create mode 100644 lib/sandbox/isLifecycleRunStale.ts create mode 100644 lib/sandbox/reclaimStaleLease.ts create mode 100644 lib/sandbox/restoreActiveLifecycleState.ts create mode 100644 lib/sandbox/runKick.ts create mode 100644 lib/sandbox/shouldStartLifecycle.ts create mode 100644 lib/sandbox/wasLifecycleTimingExtended.ts create mode 100644 lib/supabase/chats/selectChats.ts delete mode 100644 lib/supabase/chats/selectChatsBySession.ts create mode 100644 lib/supabase/sessions/claimSessionLifecycleRunIdIfMatch.ts create mode 100644 lib/supabase/sessions/claimSessionLifecycleRunIdIfNull.ts diff --git a/app/workflows/clearLifecycleRunIdIfOwned.ts b/app/workflows/clearLifecycleRunIdIfOwned.ts new file mode 100644 index 000000000..df6f55228 --- /dev/null +++ b/app/workflows/clearLifecycleRunIdIfOwned.ts @@ -0,0 +1,22 @@ +import { selectSessions } from "@/lib/supabase/sessions/selectSessions"; +import { updateSession } from "@/lib/supabase/sessions/updateSession"; + +/** + * Workflow step that clears `lifecycle_run_id` only if it still + * matches the supplied `runId`. Used at the end of a workflow run to + * release the lease without clobbering one that's been reclaimed by + * a kick. + * + * @param sessionId - The session id. + * @param runId - The lease this workflow owned; only clear if it + * still matches. + */ +export async function clearLifecycleRunIdIfOwned(sessionId: string, runId: string): Promise { + "use step"; + + const rows = await selectSessions({ id: sessionId }); + const session = rows[0]; + if (!session || session.lifecycle_run_id !== runId) return; + + await updateSession(sessionId, { lifecycle_run_id: null }); +} diff --git a/app/workflows/computeLifecycleWakeDecision.ts b/app/workflows/computeLifecycleWakeDecision.ts new file mode 100644 index 000000000..656d3e7be --- /dev/null +++ b/app/workflows/computeLifecycleWakeDecision.ts @@ -0,0 +1,46 @@ +import { getLifecycleDueAtMs } from "@/lib/sandbox/getLifecycleDueAtMs"; +import { hasRuntimeSandboxState } from "@/lib/sandbox/hasRuntimeSandboxState"; +import { claimSessionLifecycleRunId } from "@/lib/supabase/sessions/claimSessionLifecycleRunId"; +import { selectSessions } from "@/lib/supabase/sessions/selectSessions"; + +interface LifecycleWakeDecision { + shouldContinue: boolean; + wakeAtMs?: number; + reason?: string; +} + +/** + * Workflow step run at the top of each `sandboxLifecycleWorkflow` + * iteration. Reads the session, decides whether to continue looping, + * and (when continuing) returns the next wake time. Re-claims the + * lease so a concurrent kick that took it over wins. + * + * @param sessionId - The session id the workflow is tracking. + * @param runId - The lease this workflow run owns. + * @returns A decision object with continuation flag, wake time, and + * skip reason (when terminating). + */ +export async function computeLifecycleWakeDecision( + sessionId: string, + runId: string, +): Promise { + "use step"; + + const rows = await selectSessions({ id: sessionId }); + const session = rows[0]; + if (!session) return { shouldContinue: false, reason: "session-not-found" }; + if (session.status === "archived" || session.lifecycle_state === "archived") { + return { shouldContinue: false, reason: "session-archived" }; + } + if ( + !hasRuntimeSandboxState(session.sandbox_state) || + (session.sandbox_state as { type?: unknown } | null)?.type !== "vercel" + ) { + return { shouldContinue: false, reason: "sandbox-not-operable" }; + } + + const claimed = await claimSessionLifecycleRunId(sessionId, runId, runId); + if (!claimed) return { shouldContinue: false, reason: "run-replaced" }; + + return { shouldContinue: true, wakeAtMs: getLifecycleDueAtMs(session) }; +} diff --git a/app/workflows/runLifecycleEvaluation.ts b/app/workflows/runLifecycleEvaluation.ts new file mode 100644 index 000000000..38687061e --- /dev/null +++ b/app/workflows/runLifecycleEvaluation.ts @@ -0,0 +1,23 @@ +import { evaluateSandboxLifecycle } from "@/lib/sandbox/evaluateSandboxLifecycle"; +import type { + SandboxLifecycleEvaluationResult, + SandboxLifecycleReason, +} from "@/lib/sandbox/sandboxLifecycleTypes"; + +/** + * Workflow step that runs a single lifecycle evaluation pass. Thin + * wrapper around `evaluateSandboxLifecycle` so the workflow + * orchestrator gets a step boundary (with `"use step"` durability + * semantics) on each evaluation. + * + * @param sessionId - The session whose sandbox to evaluate. + * @param reason - Why the workflow was triggered (for logging only). + * @returns The result of this single evaluation pass. + */ +export async function runLifecycleEvaluation( + sessionId: string, + reason: SandboxLifecycleReason, +): Promise { + "use step"; + return evaluateSandboxLifecycle(sessionId, reason); +} diff --git a/app/workflows/sandboxLifecycleWorkflow.ts b/app/workflows/sandboxLifecycleWorkflow.ts index be6ee262e..aa765dfdc 100644 --- a/app/workflows/sandboxLifecycleWorkflow.ts +++ b/app/workflows/sandboxLifecycleWorkflow.ts @@ -1,60 +1,10 @@ import { sleep } from "workflow"; -import { canOperateOnSandbox } from "@/lib/sandbox/canOperateOnSandbox"; -import { evaluateSandboxLifecycle } from "@/lib/sandbox/evaluateSandboxLifecycle"; -import { getLifecycleDueAtMs } from "@/lib/sandbox/getLifecycleDueAtMs"; +import { clearLifecycleRunIdIfOwned } from "@/app/workflows/clearLifecycleRunIdIfOwned"; +import { computeLifecycleWakeDecision } from "@/app/workflows/computeLifecycleWakeDecision"; +import { runLifecycleEvaluation } from "@/app/workflows/runLifecycleEvaluation"; import { SANDBOX_LIFECYCLE_MIN_SLEEP_MS } from "@/lib/sandbox/sandboxLifecycleConfig"; -import { claimSessionLifecycleRunId } from "@/lib/supabase/sessions/claimSessionLifecycleRunId"; -import { selectSessions } from "@/lib/supabase/sessions/selectSessions"; -import { updateSession } from "@/lib/supabase/sessions/updateSession"; import type { SandboxLifecycleReason } from "@/lib/sandbox/sandboxLifecycleTypes"; -interface LifecycleWakeDecision { - shouldContinue: boolean; - wakeAtMs?: number; - reason?: string; -} - -async function computeLifecycleWakeDecision( - sessionId: string, - runId: string, -): Promise { - "use step"; - - const rows = await selectSessions({ id: sessionId }); - const session = rows[0]; - if (!session) return { shouldContinue: false, reason: "session-not-found" }; - if (session.status === "archived" || session.lifecycle_state === "archived") { - return { shouldContinue: false, reason: "session-archived" }; - } - if ( - !canOperateOnSandbox(session.sandbox_state) || - (session.sandbox_state as { type?: unknown } | null)?.type !== "vercel" - ) { - return { shouldContinue: false, reason: "sandbox-not-operable" }; - } - - // Refresh the lease — anyone else who claimed it in the meantime wins. - const claimed = await claimSessionLifecycleRunId(sessionId, runId, runId); - if (!claimed) return { shouldContinue: false, reason: "run-replaced" }; - - return { shouldContinue: true, wakeAtMs: getLifecycleDueAtMs(session) }; -} - -async function runLifecycleEvaluation(sessionId: string, reason: SandboxLifecycleReason) { - "use step"; - return evaluateSandboxLifecycle(sessionId, reason); -} - -async function clearLifecycleRunIdIfOwned(sessionId: string, runId: string): Promise { - "use step"; - - const rows = await selectSessions({ id: sessionId }); - const session = rows[0]; - if (!session || session.lifecycle_run_id !== runId) return; - - await updateSession(sessionId, { lifecycle_run_id: null }); -} - /** * Vercel Workflow that pauses idle sandboxes. Runs as a `while(true)` * loop: compute next wake time → `sleep(date)` → evaluate → either diff --git a/lib/sandbox/canOperateOnSandbox.ts b/lib/sandbox/canOperateOnSandbox.ts deleted file mode 100644 index 20f741b30..000000000 --- a/lib/sandbox/canOperateOnSandbox.ts +++ /dev/null @@ -1,16 +0,0 @@ -import { hasRuntimeSandboxState } from "@/lib/sandbox/hasRuntimeSandboxState"; - -/** - * True when the sandbox state has live runtime metadata that supports - * operations like stop/extend/exec. Different from `isSandboxActive` - * which also requires a non-expired `expiresAt` — this returns true - * even shortly after expiry, while the runtime is still potentially - * reachable. Used by the lifecycle workflow when deciding whether the - * sandbox is something it can act on at all. - * - * @param state - The persisted `sandbox_state` JSON value. - * @returns true when the state has runtime fields. - */ -export function canOperateOnSandbox(state: unknown): boolean { - return hasRuntimeSandboxState(state); -} diff --git a/lib/sandbox/computeExpiryDueAtMs.ts b/lib/sandbox/computeExpiryDueAtMs.ts new file mode 100644 index 000000000..57320d51b --- /dev/null +++ b/lib/sandbox/computeExpiryDueAtMs.ts @@ -0,0 +1,20 @@ +import { isoToEpochMs } from "@/lib/sandbox/isoToEpochMs"; +import { SANDBOX_EXPIRES_BUFFER_MS } from "@/lib/sandbox/sandboxLifecycleConfig"; +import type { Tables } from "@/types/database.types"; + +/** + * Computes when a session's sandbox should hibernate due to expiry — + * `sandbox_expires_at` minus a small buffer so we pause before + * Vercel's hard timeout. Returns null when the row has no expiry set + * (paused sandbox, type stub). + * + * @param row - The `sessions` row. + * @returns Epoch ms of the expiry due time, or null when not applicable. + */ +export function computeExpiryDueAtMs( + row: Pick, "sandbox_expires_at">, +): number | null { + const expiresAt = isoToEpochMs(row.sandbox_expires_at); + if (expiresAt === null) return null; + return expiresAt - SANDBOX_EXPIRES_BUFFER_MS; +} diff --git a/lib/sandbox/computeInactivityDueAtMs.ts b/lib/sandbox/computeInactivityDueAtMs.ts new file mode 100644 index 000000000..53df9285a --- /dev/null +++ b/lib/sandbox/computeInactivityDueAtMs.ts @@ -0,0 +1,21 @@ +import { isoToEpochMs } from "@/lib/sandbox/isoToEpochMs"; +import { SANDBOX_INACTIVITY_TIMEOUT_MS } from "@/lib/sandbox/sandboxLifecycleConfig"; +import type { Tables } from "@/types/database.types"; + +/** + * Computes when a session's sandbox should hibernate due to inactivity. + * Prefers `hibernate_after` if set; otherwise computes from the most + * recent activity timestamp + SANDBOX_INACTIVITY_TIMEOUT_MS. + * + * @param row - The `sessions` row. + * @returns Epoch ms of the inactivity due time. + */ +export function computeInactivityDueAtMs( + row: Pick, "hibernate_after" | "last_activity_at" | "updated_at">, +): number { + const hibernateAfter = isoToEpochMs(row.hibernate_after); + if (hibernateAfter !== null) return hibernateAfter; + const lastActivity = isoToEpochMs(row.last_activity_at); + const fallback = lastActivity ?? isoToEpochMs(row.updated_at) ?? Date.now(); + return fallback + SANDBOX_INACTIVITY_TIMEOUT_MS; +} diff --git a/lib/sandbox/createLifecycleRunId.ts b/lib/sandbox/createLifecycleRunId.ts new file mode 100644 index 000000000..2b4ae02e8 --- /dev/null +++ b/lib/sandbox/createLifecycleRunId.ts @@ -0,0 +1,11 @@ +/** + * Generates a unique identifier for a sandbox-lifecycle workflow run. + * Format: `lifecycle::` — the timestamp gives + * humans a quick sanity check of when the run started, the UUID + * guarantees uniqueness across concurrent kicks. + * + * @returns A new run id string. + */ +export function createLifecycleRunId(): string { + return `lifecycle:${Date.now()}:${crypto.randomUUID()}`; +} diff --git a/lib/sandbox/evaluateSandboxLifecycle.ts b/lib/sandbox/evaluateSandboxLifecycle.ts index 1dd2d90d1..cc8e9cb5b 100644 --- a/lib/sandbox/evaluateSandboxLifecycle.ts +++ b/lib/sandbox/evaluateSandboxLifecycle.ts @@ -1,11 +1,12 @@ import { connectSandbox } from "@/lib/sandbox/factory"; import { buildHibernatedLifecycleUpdate } from "@/lib/sandbox/buildHibernatedLifecycleUpdate"; -import { canOperateOnSandbox } from "@/lib/sandbox/canOperateOnSandbox"; import { clearSandboxState } from "@/lib/sandbox/clearSandboxState"; import { getLifecycleDueAtMs } from "@/lib/sandbox/getLifecycleDueAtMs"; import { getPersistentSandboxName } from "@/lib/sandbox/getPersistentSandboxName"; -import { getSandboxExpiresAtDate } from "@/lib/sandbox/getSandboxExpiresAtDate"; import { hasActiveStreamForSession } from "@/lib/sandbox/hasActiveStreamForSession"; +import { hasRuntimeSandboxState } from "@/lib/sandbox/hasRuntimeSandboxState"; +import { restoreActiveLifecycleState } from "@/lib/sandbox/restoreActiveLifecycleState"; +import { wasLifecycleTimingExtended } from "@/lib/sandbox/wasLifecycleTimingExtended"; import { selectSessions } from "@/lib/supabase/sessions/selectSessions"; import { updateSession } from "@/lib/supabase/sessions/updateSession"; import type { SandboxState } from "@/lib/sandbox/factory"; @@ -13,7 +14,7 @@ import type { SandboxLifecycleEvaluationResult, SandboxLifecycleReason, } from "@/lib/sandbox/sandboxLifecycleTypes"; -import type { Json, Tables } from "@/types/database.types"; +import type { Json } from "@/types/database.types"; /** * One-shot lifecycle evaluator called by the sandbox-lifecycle @@ -29,10 +30,6 @@ import type { Json, Tables } from "@/types/database.types"; * - `failed` — an error occurred mid-evaluation; row's * `lifecycle_state` is set to `"failed"` for self-healing on * the next status read - * - * @param sessionId - The session whose sandbox to evaluate. - * @param reason - Why the workflow was triggered (for logging only). - * @returns The result of this single evaluation pass. */ export async function evaluateSandboxLifecycle( sessionId: string, @@ -47,15 +44,14 @@ export async function evaluateSandboxLifecycle( } const sandboxState = session.sandbox_state; - if (!canOperateOnSandbox(sandboxState)) { + if (!hasRuntimeSandboxState(sandboxState)) { return { action: "skipped", reason: "sandbox-not-operable" }; } if ((sandboxState as { type?: unknown }).type !== "vercel") { return { action: "skipped", reason: "unsupported-sandbox-type" }; } - const dueAtMs = getLifecycleDueAtMs(session); - if (Date.now() < dueAtMs) { + if (Date.now() < getLifecycleDueAtMs(session)) { return { action: "skipped", reason: "not-due-yet" }; } @@ -108,29 +104,3 @@ export async function evaluateSandboxLifecycle( return { action: "failed", reason: message }; } } - -async function restoreActiveLifecycleState( - sessionId: string, - sandboxState: unknown, -): Promise { - await updateSession(sessionId, { - lifecycle_state: "active", - lifecycle_error: null, - sandbox_expires_at: getSandboxExpiresAtDate(sandboxState), - }); -} - -async function wasLifecycleTimingExtended( - sessionId: string, - prior: Tables<"sessions">, -): Promise { - const refreshed = (await selectSessions({ id: sessionId }))[0]; - if (!refreshed?.sandbox_state || !canOperateOnSandbox(refreshed.sandbox_state)) return false; - - const timingChanged = - refreshed.last_activity_at !== prior.last_activity_at || - refreshed.hibernate_after !== prior.hibernate_after || - refreshed.sandbox_expires_at !== prior.sandbox_expires_at; - - return timingChanged && Date.now() < getLifecycleDueAtMs(refreshed); -} diff --git a/lib/sandbox/getNextLifecycleVersion.ts b/lib/sandbox/getNextLifecycleVersion.ts deleted file mode 100644 index 19d4e133d..000000000 --- a/lib/sandbox/getNextLifecycleVersion.ts +++ /dev/null @@ -1,11 +0,0 @@ -/** - * Increments a session's lifecycle version for optimistic concurrency - * control. Returns 1 when the input is null/undefined (e.g. on first - * sandbox provision after session creation). - * - * @param current - The session's current `lifecycle_version`. - * @returns The next version number to write. - */ -export function getNextLifecycleVersion(current: number | null | undefined): number { - return (current ?? 0) + 1; -} diff --git a/lib/sandbox/hasActiveStreamForSession.ts b/lib/sandbox/hasActiveStreamForSession.ts index 6b4473b65..3529a5f3d 100644 --- a/lib/sandbox/hasActiveStreamForSession.ts +++ b/lib/sandbox/hasActiveStreamForSession.ts @@ -1,4 +1,4 @@ -import { selectChatsBySession } from "@/lib/supabase/chats/selectChatsBySession"; +import { selectChats } from "@/lib/supabase/chats/selectChats"; /** * True when any chat in the session has an `active_stream_id` set, @@ -10,6 +10,6 @@ import { selectChatsBySession } from "@/lib/supabase/chats/selectChatsBySession" * @returns true when at least one chat has an active stream id. */ export async function hasActiveStreamForSession(sessionId: string): Promise { - const chats = await selectChatsBySession(sessionId); + const chats = await selectChats({ sessionId }); return chats.some(chat => chat.active_stream_id !== null); } diff --git a/lib/sandbox/hasPausedSandboxState.ts b/lib/sandbox/hasPausedSandboxState.ts index 6de247a44..c78000ee8 100644 --- a/lib/sandbox/hasPausedSandboxState.ts +++ b/lib/sandbox/hasPausedSandboxState.ts @@ -1,4 +1,4 @@ -import { hasResumableSandboxState } from "@/lib/sandbox/hasResumableSandboxState"; +import { getResumableSandboxName } from "@/lib/sandbox/getResumableSandboxName"; import { hasRuntimeSandboxState } from "@/lib/sandbox/hasRuntimeSandboxState"; /** @@ -10,5 +10,5 @@ import { hasRuntimeSandboxState } from "@/lib/sandbox/hasRuntimeSandboxState"; * @returns true when the sandbox is paused-but-resumable. */ export function hasPausedSandboxState(state: unknown): boolean { - return hasResumableSandboxState(state) && !hasRuntimeSandboxState(state); + return getResumableSandboxName(state) !== null && !hasRuntimeSandboxState(state); } diff --git a/lib/sandbox/hasResumableSandboxState.ts b/lib/sandbox/hasResumableSandboxState.ts deleted file mode 100644 index cbbb341dc..000000000 --- a/lib/sandbox/hasResumableSandboxState.ts +++ /dev/null @@ -1,12 +0,0 @@ -import { getResumableSandboxName } from "@/lib/sandbox/getResumableSandboxName"; - -/** - * True when the persisted state carries a durable resume handle — - * either a persistent `sandboxName` or the legacy `sandboxId`. - * - * @param state - The persisted `sandbox_state` JSON value. - * @returns true when the sandbox can be resumed. - */ -export function hasResumableSandboxState(state: unknown): boolean { - return getResumableSandboxName(state) !== null; -} diff --git a/lib/sandbox/isLifecycleRunStale.ts b/lib/sandbox/isLifecycleRunStale.ts new file mode 100644 index 000000000..f44b9a493 --- /dev/null +++ b/lib/sandbox/isLifecycleRunStale.ts @@ -0,0 +1,19 @@ +import { getLifecycleDueAtMs } from "@/lib/sandbox/getLifecycleDueAtMs"; +import { SANDBOX_LIFECYCLE_STALE_RUN_GRACE_MS } from "@/lib/sandbox/sandboxLifecycleConfig"; +import type { Tables } from "@/types/database.types"; + +/** + * True when a session's `lifecycle_run_id` lease is past the calculated + * due time + grace window. Indicates the workflow that owned the lease + * either crashed or never started — the kick can safely reclaim the + * lease and start a fresh run. + * + * @param session - The `sessions` row. + * @returns true when the lease is stale and should be reclaimed. + */ +export function isLifecycleRunStale(session: Tables<"sessions">): boolean { + if (!session.lifecycle_run_id) return false; + if (session.lifecycle_state !== "active") return false; + const overdueMs = Date.now() - getLifecycleDueAtMs(session); + return overdueMs > SANDBOX_LIFECYCLE_STALE_RUN_GRACE_MS; +} diff --git a/lib/sandbox/kickSandboxLifecycleWorkflow.ts b/lib/sandbox/kickSandboxLifecycleWorkflow.ts index f3a852d6e..f1b29a334 100644 --- a/lib/sandbox/kickSandboxLifecycleWorkflow.ts +++ b/lib/sandbox/kickSandboxLifecycleWorkflow.ts @@ -1,30 +1,22 @@ -import { start } from "workflow/api"; -import { sandboxLifecycleWorkflow } from "@/app/workflows/sandboxLifecycleWorkflow"; -import { canOperateOnSandbox } from "@/lib/sandbox/canOperateOnSandbox"; -import { getLifecycleDueAtMs } from "@/lib/sandbox/getLifecycleDueAtMs"; -import { SANDBOX_LIFECYCLE_STALE_RUN_GRACE_MS } from "@/lib/sandbox/sandboxLifecycleConfig"; -import { claimSessionLifecycleRunId } from "@/lib/supabase/sessions/claimSessionLifecycleRunId"; -import { selectSessions } from "@/lib/supabase/sessions/selectSessions"; -import { updateSession } from "@/lib/supabase/sessions/updateSession"; +import { runKick } from "@/lib/sandbox/runKick"; import type { SandboxLifecycleReason } from "@/lib/sandbox/sandboxLifecycleTypes"; -import type { Tables } from "@/types/database.types"; interface KickInput { sessionId: string; reason: SandboxLifecycleReason; /** - * Optional scheduler for the kick chain (selectSessions → claim - * lease → start workflow). Callers in serverless contexts should - * pass `p => after(() => p)` (or `waitUntil(p)`) so the platform - * keeps the function alive until the chain completes — without it - * the chain dies when the request returns. Mirrors open-agents' - * `scheduleBackgroundWork` parameter. + * Optional scheduler for the kick chain. Callers in serverless + * contexts should pass `task => after(() => task)` (or + * `waitUntil(task)`) so the platform keeps the function alive + * until the chain completes — without it the chain dies when the + * request returns. Mirrors open-agents' `scheduleBackgroundWork` + * parameter. */ scheduleBackgroundWork?: (task: Promise) => void; } /** - * Fire-and-forget kick of `sandboxLifecycleWorkflow`. Used by: + * Fire-and-forget kick of the sandbox-lifecycle workflow. Used by: * * - `POST /api/sandbox` (reason: `sandbox-created`) — register a * freshly-provisioned sandbox so the workflow auto-pauses it after @@ -32,13 +24,11 @@ interface KickInput { * - `GET /api/sandbox/status` (reason: `status-check-overdue`) — poke * the workflow when a status read sees stale lifecycle state * - * Skips when the session can't be acted on (archived, no runtime - * sandbox, wrong sandbox type) or when another lifecycle run is - * already in flight. A run that's been overdue past the grace window - * is considered stale and gets reclaimed. + * Wraps `runKick` in error handling and the optional background-work + * scheduler. `runKick` itself owns all the decision logic. */ export function kickSandboxLifecycleWorkflow(input: KickInput): void { - const task = runKick(input).catch(error => + const task = runKick({ sessionId: input.sessionId, reason: input.reason }).catch(error => console.error(`[kickSandboxLifecycleWorkflow] failed for session ${input.sessionId}:`, error), ); @@ -49,59 +39,3 @@ export function kickSandboxLifecycleWorkflow(input: KickInput): void { void task; } - -async function runKick(input: KickInput): Promise { - const rows = await selectSessions({ id: input.sessionId }); - const session = rows[0]; - if (!session) return; - - const sessionForStart = isLifecycleRunStale(session) - ? await reclaimStaleLease(input.sessionId) - : session; - - if (!shouldStartLifecycle(sessionForStart)) return; - - const runId = createLifecycleRunId(); - const claimed = await claimSessionLifecycleRunId(input.sessionId, runId); - if (!claimed) return; - - try { - const run = await start(sandboxLifecycleWorkflow, [input.sessionId, input.reason, runId]); - console.log( - `[kickSandboxLifecycleWorkflow] started run ${run.runId} for session ${input.sessionId} (reason=${input.reason})`, - ); - } catch (error) { - console.error( - `[kickSandboxLifecycleWorkflow] failed to start workflow for session ${input.sessionId}; clearing lease:`, - error, - ); - await updateSession(input.sessionId, { lifecycle_run_id: null }); - } -} - -async function reclaimStaleLease(sessionId: string): Promise | null> { - await updateSession(sessionId, { lifecycle_run_id: null }); - const rows = await selectSessions({ id: sessionId }); - return rows[0] ?? null; -} - -function shouldStartLifecycle(session: Tables<"sessions"> | null): session is Tables<"sessions"> { - if (!session) return false; - if (session.status === "archived" || session.lifecycle_state === "archived") return false; - if (!session.sandbox_state) return false; - if (!canOperateOnSandbox(session.sandbox_state)) return false; - if ((session.sandbox_state as { type?: unknown }).type !== "vercel") return false; - if (session.lifecycle_run_id) return false; - return true; -} - -function isLifecycleRunStale(session: Tables<"sessions">): boolean { - if (!session.lifecycle_run_id) return false; - if (session.lifecycle_state !== "active") return false; - const overdueMs = Date.now() - getLifecycleDueAtMs(session); - return overdueMs > SANDBOX_LIFECYCLE_STALE_RUN_GRACE_MS; -} - -function createLifecycleRunId(): string { - return `lifecycle:${Date.now()}:${crypto.randomUUID()}`; -} diff --git a/lib/sandbox/reclaimStaleLease.ts b/lib/sandbox/reclaimStaleLease.ts new file mode 100644 index 000000000..857335a84 --- /dev/null +++ b/lib/sandbox/reclaimStaleLease.ts @@ -0,0 +1,18 @@ +import { selectSessions } from "@/lib/supabase/sessions/selectSessions"; +import { updateSession } from "@/lib/supabase/sessions/updateSession"; +import type { Tables } from "@/types/database.types"; + +/** + * Clears a stale `lifecycle_run_id` lease and re-reads the row so the + * caller has the post-clear snapshot to work with. Used by the kick + * logic when `isLifecycleRunStale` returns true. + * + * @param sessionId - The session whose lease to reclaim. + * @returns The session row with `lifecycle_run_id: null`, or null if + * the row vanished between the clear and the re-read. + */ +export async function reclaimStaleLease(sessionId: string): Promise | null> { + await updateSession(sessionId, { lifecycle_run_id: null }); + const rows = await selectSessions({ id: sessionId }); + return rows[0] ?? null; +} diff --git a/lib/sandbox/restoreActiveLifecycleState.ts b/lib/sandbox/restoreActiveLifecycleState.ts new file mode 100644 index 000000000..fbdbacbfb --- /dev/null +++ b/lib/sandbox/restoreActiveLifecycleState.ts @@ -0,0 +1,23 @@ +import { getSandboxExpiresAtDate } from "@/lib/sandbox/getSandboxExpiresAtDate"; +import { updateSession } from "@/lib/supabase/sessions/updateSession"; + +/** + * Re-marks a session as `lifecycle_state: "active"` after a transient + * skip during evaluation (e.g. an active chat stream took precedence + * over hibernation, or the user extended the timing while the + * workflow was about to pause). Refreshes `sandbox_expires_at` from + * the live sandbox state so the row matches reality. + * + * @param sessionId - The session id. + * @param sandboxState - The live `sandbox_state` JSON value. + */ +export async function restoreActiveLifecycleState( + sessionId: string, + sandboxState: unknown, +): Promise { + await updateSession(sessionId, { + lifecycle_state: "active", + lifecycle_error: null, + sandbox_expires_at: getSandboxExpiresAtDate(sandboxState), + }); +} diff --git a/lib/sandbox/runKick.ts b/lib/sandbox/runKick.ts new file mode 100644 index 000000000..7d44bd094 --- /dev/null +++ b/lib/sandbox/runKick.ts @@ -0,0 +1,58 @@ +import { start } from "workflow/api"; +import { sandboxLifecycleWorkflow } from "@/app/workflows/sandboxLifecycleWorkflow"; +import { createLifecycleRunId } from "@/lib/sandbox/createLifecycleRunId"; +import { isLifecycleRunStale } from "@/lib/sandbox/isLifecycleRunStale"; +import { reclaimStaleLease } from "@/lib/sandbox/reclaimStaleLease"; +import { shouldStartLifecycle } from "@/lib/sandbox/shouldStartLifecycle"; +import { claimSessionLifecycleRunId } from "@/lib/supabase/sessions/claimSessionLifecycleRunId"; +import { selectSessions } from "@/lib/supabase/sessions/selectSessions"; +import { updateSession } from "@/lib/supabase/sessions/updateSession"; +import type { SandboxLifecycleReason } from "@/lib/sandbox/sandboxLifecycleTypes"; + +interface RunKickInput { + sessionId: string; + reason: SandboxLifecycleReason; +} + +/** + * The async chain that the lifecycle kick runs in the background: + * + * 1. Read the session row + * 2. Reclaim the lease if it's stale (workflow crashed) + * 3. Skip if the session isn't in a shape where lifecycle makes sense + * 4. Generate a fresh run id and atomically claim it + * 5. `start()` the Vercel Workflow run + * 6. On `start()` failure, clear the lease so retry can succeed + * + * Errors are caught at the call site (`kickSandboxLifecycleWorkflow`) + * and never surface to the request — this is fire-and-forget by + * design. + */ +export async function runKick(input: RunKickInput): Promise { + const rows = await selectSessions({ id: input.sessionId }); + const session = rows[0]; + if (!session) return; + + const sessionForStart = isLifecycleRunStale(session) + ? await reclaimStaleLease(input.sessionId) + : session; + + if (!shouldStartLifecycle(sessionForStart)) return; + + const runId = createLifecycleRunId(); + const claimed = await claimSessionLifecycleRunId(input.sessionId, runId); + if (!claimed) return; + + try { + const run = await start(sandboxLifecycleWorkflow, [input.sessionId, input.reason, runId]); + console.log( + `[kickSandboxLifecycleWorkflow] started run ${run.runId} for session ${input.sessionId} (reason=${input.reason})`, + ); + } catch (error) { + console.error( + `[kickSandboxLifecycleWorkflow] failed to start workflow for session ${input.sessionId}; clearing lease:`, + error, + ); + await updateSession(input.sessionId, { lifecycle_run_id: null }); + } +} diff --git a/lib/sandbox/shouldStartLifecycle.ts b/lib/sandbox/shouldStartLifecycle.ts new file mode 100644 index 000000000..9294e37a8 --- /dev/null +++ b/lib/sandbox/shouldStartLifecycle.ts @@ -0,0 +1,24 @@ +import { hasRuntimeSandboxState } from "@/lib/sandbox/hasRuntimeSandboxState"; +import type { Tables } from "@/types/database.types"; + +/** + * Predicate for the kick logic: returns true when a session is in a + * shape where starting a lifecycle workflow makes sense. Filters out + * archived sessions, sessions without a runtime sandbox, non-vercel + * sandbox types, and sessions that already have a lifecycle run in + * flight. + * + * @param session - The `sessions` row, or null. + * @returns true when a lifecycle run should be started. + */ +export function shouldStartLifecycle( + session: Tables<"sessions"> | null, +): session is Tables<"sessions"> { + if (!session) return false; + if (session.status === "archived" || session.lifecycle_state === "archived") return false; + if (!session.sandbox_state) return false; + if (!hasRuntimeSandboxState(session.sandbox_state)) return false; + if ((session.sandbox_state as { type?: unknown }).type !== "vercel") return false; + if (session.lifecycle_run_id) return false; + return true; +} diff --git a/lib/sandbox/wasLifecycleTimingExtended.ts b/lib/sandbox/wasLifecycleTimingExtended.ts new file mode 100644 index 000000000..c3980a1df --- /dev/null +++ b/lib/sandbox/wasLifecycleTimingExtended.ts @@ -0,0 +1,30 @@ +import { getLifecycleDueAtMs } from "@/lib/sandbox/getLifecycleDueAtMs"; +import { hasRuntimeSandboxState } from "@/lib/sandbox/hasRuntimeSandboxState"; +import { selectSessions } from "@/lib/supabase/sessions/selectSessions"; +import type { Tables } from "@/types/database.types"; + +/** + * True when, between the workflow's wake-up and the moment of + * evaluation, another caller (an `/extend` or `/activity` hit, or a + * snapshot resume) updated the session's lifecycle timing fields and + * the new due time is still in the future. Tells the evaluator to + * skip hibernation this pass and let the workflow loop sleep again. + * + * @param sessionId - The session id. + * @param prior - The session row read at the start of evaluation. + * @returns true when timing was extended; false otherwise. + */ +export async function wasLifecycleTimingExtended( + sessionId: string, + prior: Tables<"sessions">, +): Promise { + const refreshed = (await selectSessions({ id: sessionId }))[0]; + if (!refreshed?.sandbox_state || !hasRuntimeSandboxState(refreshed.sandbox_state)) return false; + + const timingChanged = + refreshed.last_activity_at !== prior.last_activity_at || + refreshed.hibernate_after !== prior.hibernate_after || + refreshed.sandbox_expires_at !== prior.sandbox_expires_at; + + return timingChanged && Date.now() < getLifecycleDueAtMs(refreshed); +} diff --git a/lib/supabase/chats/selectChats.ts b/lib/supabase/chats/selectChats.ts new file mode 100644 index 000000000..e36c0454c --- /dev/null +++ b/lib/supabase/chats/selectChats.ts @@ -0,0 +1,30 @@ +import supabase from "@/lib/supabase/serverClient"; +import type { Tables } from "@/types/database.types"; + +interface SelectChatsFilter { + /** Optional id filter — when set, returns at most one row. */ + id?: string; + /** Optional session filter — when set, returns every chat in the session. */ + sessionId?: string; +} + +/** + * General-purpose `chats` reader. Pass any combination of filters to + * narrow the result set; an unset filter is ignored. Mirrors the + * `selectSessions` pattern. Returns [] on Supabase error after logging. + * + * @param filter - Optional filters narrowing the query. + * @returns Matching chat rows, or [] on error / no match. + */ +export async function selectChats(filter: SelectChatsFilter = {}): Promise[]> { + let query = supabase.from("chats").select("*"); + if (filter.id) query = query.eq("id", filter.id); + if (filter.sessionId) query = query.eq("session_id", filter.sessionId); + + const { data, error } = await query; + if (error) { + console.error("[selectChats] error:", error); + return []; + } + return data ?? []; +} diff --git a/lib/supabase/chats/selectChatsBySession.ts b/lib/supabase/chats/selectChatsBySession.ts deleted file mode 100644 index deff1f894..000000000 --- a/lib/supabase/chats/selectChatsBySession.ts +++ /dev/null @@ -1,22 +0,0 @@ -import supabase from "@/lib/supabase/serverClient"; -import type { Tables } from "@/types/database.types"; - -/** - * Reads all `chats` belonging to a session. Used by the lifecycle - * workflow to detect active streams (so a chat in flight can prevent - * hibernation), and will be used by upcoming `GET /api/sessions/:id/chats` - * port too. Returns [] on Supabase error after logging. - * - * @param sessionId - The owning session id. - * @returns Matching chat rows, or [] on error / no match. - */ -export async function selectChatsBySession(sessionId: string): Promise[]> { - const { data, error } = await supabase.from("chats").select("*").eq("session_id", sessionId); - - if (error) { - console.error(`[selectChatsBySession] error for session ${sessionId}:`, error); - return []; - } - - return data ?? []; -} diff --git a/lib/supabase/sessions/claimSessionLifecycleRunId.ts b/lib/supabase/sessions/claimSessionLifecycleRunId.ts index b446c2b54..c7d09cc10 100644 --- a/lib/supabase/sessions/claimSessionLifecycleRunId.ts +++ b/lib/supabase/sessions/claimSessionLifecycleRunId.ts @@ -1,15 +1,11 @@ -import supabase from "@/lib/supabase/serverClient"; +import { claimSessionLifecycleRunIdIfMatch } from "@/lib/supabase/sessions/claimSessionLifecycleRunIdIfMatch"; +import { claimSessionLifecycleRunIdIfNull } from "@/lib/supabase/sessions/claimSessionLifecycleRunIdIfNull"; /** - * Atomically claims the `lifecycle_run_id` lease for a session. The - * UPDATE only succeeds when the row's current value matches what the - * caller passed as `expected` (typically null when claiming a fresh - * lease, or the same `runId` when re-confirming during a workflow - * step). This is the concurrency guard that prevents two kicks from - * starting duplicate lifecycle workflows for the same session. - * - * Returns true when the lease was successfully claimed (the UPDATE - * affected exactly one row), false otherwise. + * Combiner for the two `lifecycle_run_id` claim operations. Picks the + * right atomic Supabase write based on whether the caller is making + * an initial claim (expected = null) or refreshing its own lease + * (expected = the runId). * * @param sessionId - The session id to claim against. * @param runId - The new lease value to write. @@ -21,17 +17,7 @@ export async function claimSessionLifecycleRunId( runId: string, expected: string | null = null, ): Promise { - const query = supabase.from("sessions").update({ lifecycle_run_id: runId }).eq("id", sessionId); - - const filtered = - expected === null ? query.is("lifecycle_run_id", null) : query.eq("lifecycle_run_id", expected); - - const { data, error } = await filtered.select("id").maybeSingle(); - - if (error) { - console.error(`[claimSessionLifecycleRunId] error for session ${sessionId}:`, error); - return false; - } - - return data !== null; + return expected === null + ? claimSessionLifecycleRunIdIfNull(sessionId, runId) + : claimSessionLifecycleRunIdIfMatch(sessionId, expected); } diff --git a/lib/supabase/sessions/claimSessionLifecycleRunIdIfMatch.ts b/lib/supabase/sessions/claimSessionLifecycleRunIdIfMatch.ts new file mode 100644 index 000000000..781f669d2 --- /dev/null +++ b/lib/supabase/sessions/claimSessionLifecycleRunIdIfMatch.ts @@ -0,0 +1,31 @@ +import supabase from "@/lib/supabase/serverClient"; + +/** + * Atomic Supabase write: refreshes `lifecycle_run_id` for a session + * only if the row's current value MATCHES the supplied value. Returns + * true on successful refresh, false if the row was reclaimed by + * someone else. Used by the workflow to confirm it still owns its + * lease before each evaluation pass. + * + * @param sessionId - The session id to claim against. + * @param runId - The expected current lease value (also written back). + * @returns true on success, false on contention or error. + */ +export async function claimSessionLifecycleRunIdIfMatch( + sessionId: string, + runId: string, +): Promise { + const { data, error } = await supabase + .from("sessions") + .update({ lifecycle_run_id: runId }) + .eq("id", sessionId) + .eq("lifecycle_run_id", runId) + .select("id") + .maybeSingle(); + + if (error) { + console.error(`[claimSessionLifecycleRunIdIfMatch] error for ${sessionId}:`, error); + return false; + } + return data !== null; +} diff --git a/lib/supabase/sessions/claimSessionLifecycleRunIdIfNull.ts b/lib/supabase/sessions/claimSessionLifecycleRunIdIfNull.ts new file mode 100644 index 000000000..93f205244 --- /dev/null +++ b/lib/supabase/sessions/claimSessionLifecycleRunIdIfNull.ts @@ -0,0 +1,30 @@ +import supabase from "@/lib/supabase/serverClient"; + +/** + * Atomic Supabase write: claims `lifecycle_run_id` for a session only + * if the row's current value is NULL. Returns true on successful + * claim, false if the row was already taken. Used for INITIAL claims + * (i.e. starting a fresh lifecycle run from a kick). + * + * @param sessionId - The session id to claim against. + * @param runId - The new lease value to write. + * @returns true on success, false on contention or error. + */ +export async function claimSessionLifecycleRunIdIfNull( + sessionId: string, + runId: string, +): Promise { + const { data, error } = await supabase + .from("sessions") + .update({ lifecycle_run_id: runId }) + .eq("id", sessionId) + .is("lifecycle_run_id", null) + .select("id") + .maybeSingle(); + + if (error) { + console.error(`[claimSessionLifecycleRunIdIfNull] error for ${sessionId}:`, error); + return false; + } + return data !== null; +} From 22095c29aee3c462c1aff8fa1dff1fadfd1ea4a5 Mon Sep 17 00:00:00 2001 From: Sweets Sweetman Date: Thu, 7 May 2026 17:28:12 -0500 Subject: [PATCH 4/6] fix(sandbox): actually delegate getLifecycleDueAtMs to the extracted helpers MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit In the previous SRP refactor I created computeInactivityDueAtMs.ts and computeExpiryDueAtMs.ts but the Write to update getLifecycleDueAtMs.ts errored on "File has not been read yet" and I missed it in the batch output. The two helper files were left as orphans while the original file kept its inline copies — visible by the SANDBOX_*_MS imports + both functions still inlined. Now properly imports computeInactivityDueAtMs / computeExpiryDueAtMs and the parent file just composes them. No behavior change. Co-Authored-By: Claude Opus 4.7 (1M context) --- lib/sandbox/getLifecycleDueAtMs.ts | 33 ++++-------------------------- 1 file changed, 4 insertions(+), 29 deletions(-) diff --git a/lib/sandbox/getLifecycleDueAtMs.ts b/lib/sandbox/getLifecycleDueAtMs.ts index b8fbaf445..c022193cc 100644 --- a/lib/sandbox/getLifecycleDueAtMs.ts +++ b/lib/sandbox/getLifecycleDueAtMs.ts @@ -1,20 +1,11 @@ -import { - SANDBOX_EXPIRES_BUFFER_MS, - SANDBOX_INACTIVITY_TIMEOUT_MS, -} from "@/lib/sandbox/sandboxLifecycleConfig"; -import { isoToEpochMs } from "@/lib/sandbox/isoToEpochMs"; +import { computeExpiryDueAtMs } from "@/lib/sandbox/computeExpiryDueAtMs"; +import { computeInactivityDueAtMs } from "@/lib/sandbox/computeInactivityDueAtMs"; import type { Tables } from "@/types/database.types"; /** * Computes when the lifecycle workflow should next wake up for a - * session. Returns the earlier of: - * - inactivity due — `last_activity_at` (or `updated_at`) + - * SANDBOX_INACTIVITY_TIMEOUT_MS, overridden by `hibernate_after` - * if set - * - expiry due — `sandbox_expires_at` minus - * SANDBOX_EXPIRES_BUFFER_MS, so we pause before Vercel hard-stops - * - * Output is epoch milliseconds. Used by both the workflow (for + * session. Returns the earlier of the inactivity-due time and the + * expiry-due time. Used by both the workflow (for * `sleep(new Date(wakeAtMs))`) and the stale-run detector in the * kick logic. * @@ -32,19 +23,3 @@ export function getLifecycleDueAtMs( if (expiryDue === null) return inactivityDue; return Math.min(inactivityDue, expiryDue); } - -function computeInactivityDueAtMs( - row: Pick, "hibernate_after" | "last_activity_at" | "updated_at">, -): number { - const hibernateAfter = isoToEpochMs(row.hibernate_after); - if (hibernateAfter !== null) return hibernateAfter; - const lastActivity = isoToEpochMs(row.last_activity_at); - const fallback = lastActivity ?? isoToEpochMs(row.updated_at) ?? Date.now(); - return fallback + SANDBOX_INACTIVITY_TIMEOUT_MS; -} - -function computeExpiryDueAtMs(row: Pick, "sandbox_expires_at">): number | null { - const expiresAt = isoToEpochMs(row.sandbox_expires_at); - if (expiresAt === null) return null; - return expiresAt - SANDBOX_EXPIRES_BUFFER_MS; -} From c37821d5c8c5605c1d0c6f1b55cfeda6c0fcda32 Mon Sep 17 00:00:00 2001 From: Sweets Sweetman Date: Thu, 7 May 2026 17:33:45 -0500 Subject: [PATCH 5/6] =?UTF-8?q?fix(sandbox):=20KISS=20lifecycle=20claim=20?= =?UTF-8?q?=E2=80=94=20combiner=20under=20lib/sessions,=20IfMatch=20via=20?= =?UTF-8?q?updateSession?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Two review fixes: 1) Move `claimSessionLifecycleRunId.ts` from `lib/supabase/sessions/` to `lib/sessions/`. The combiner doesn't directly query Supabase — it composes two underlying helpers, so per api convention it belongs alongside other domain composers, not under the Supabase namespace. 2) KISS: delete `claimSessionLifecycleRunIdIfMatch.ts` and use the existing `updateSession` helper for the lease-refresh path. The refresh writes the same `lifecycle_run_id` value back, so an unconditional `updateSession({ lifecycle_run_id: runId })` does the work — accepts a small race where a concurrent stale-reclaim could be overwritten, but the kick path (which DOES use the atomic `claimSessionLifecycleRunIdIfNull`) remains the primary concurrency guard. Two callers (runKick, computeLifecycleWakeDecision) updated to import from the new path. No behavior change at the workflow level. Tests: 2579 / 2579 still pass. lint + tsc clean. Co-Authored-By: Claude Opus 4.7 (1M context) --- app/workflows/computeLifecycleWakeDecision.ts | 2 +- lib/sandbox/runKick.ts | 2 +- lib/sessions/claimSessionLifecycleRunId.ts | 32 +++++++++++++++++++ .../sessions/claimSessionLifecycleRunId.ts | 23 ------------- .../claimSessionLifecycleRunIdIfMatch.ts | 31 ------------------ 5 files changed, 34 insertions(+), 56 deletions(-) create mode 100644 lib/sessions/claimSessionLifecycleRunId.ts delete mode 100644 lib/supabase/sessions/claimSessionLifecycleRunId.ts delete mode 100644 lib/supabase/sessions/claimSessionLifecycleRunIdIfMatch.ts diff --git a/app/workflows/computeLifecycleWakeDecision.ts b/app/workflows/computeLifecycleWakeDecision.ts index 656d3e7be..c73975766 100644 --- a/app/workflows/computeLifecycleWakeDecision.ts +++ b/app/workflows/computeLifecycleWakeDecision.ts @@ -1,6 +1,6 @@ import { getLifecycleDueAtMs } from "@/lib/sandbox/getLifecycleDueAtMs"; import { hasRuntimeSandboxState } from "@/lib/sandbox/hasRuntimeSandboxState"; -import { claimSessionLifecycleRunId } from "@/lib/supabase/sessions/claimSessionLifecycleRunId"; +import { claimSessionLifecycleRunId } from "@/lib/sessions/claimSessionLifecycleRunId"; import { selectSessions } from "@/lib/supabase/sessions/selectSessions"; interface LifecycleWakeDecision { diff --git a/lib/sandbox/runKick.ts b/lib/sandbox/runKick.ts index 7d44bd094..bd44e97c8 100644 --- a/lib/sandbox/runKick.ts +++ b/lib/sandbox/runKick.ts @@ -4,7 +4,7 @@ import { createLifecycleRunId } from "@/lib/sandbox/createLifecycleRunId"; import { isLifecycleRunStale } from "@/lib/sandbox/isLifecycleRunStale"; import { reclaimStaleLease } from "@/lib/sandbox/reclaimStaleLease"; import { shouldStartLifecycle } from "@/lib/sandbox/shouldStartLifecycle"; -import { claimSessionLifecycleRunId } from "@/lib/supabase/sessions/claimSessionLifecycleRunId"; +import { claimSessionLifecycleRunId } from "@/lib/sessions/claimSessionLifecycleRunId"; import { selectSessions } from "@/lib/supabase/sessions/selectSessions"; import { updateSession } from "@/lib/supabase/sessions/updateSession"; import type { SandboxLifecycleReason } from "@/lib/sandbox/sandboxLifecycleTypes"; diff --git a/lib/sessions/claimSessionLifecycleRunId.ts b/lib/sessions/claimSessionLifecycleRunId.ts new file mode 100644 index 000000000..183ea2a2f --- /dev/null +++ b/lib/sessions/claimSessionLifecycleRunId.ts @@ -0,0 +1,32 @@ +import { claimSessionLifecycleRunIdIfNull } from "@/lib/supabase/sessions/claimSessionLifecycleRunIdIfNull"; +import { updateSession } from "@/lib/supabase/sessions/updateSession"; + +/** + * Combiner for the two `lifecycle_run_id` claim operations. Lives in + * `lib/sessions/` (not `lib/supabase/sessions/`) because it does not + * directly query Supabase — it composes two underlying helpers. + * + * - `expected = null` (initial claim): atomic `claimSessionLifecycleRunIdIfNull` + * that fails when the row already has a lease. + * - `expected = runId` (workflow refresh): plain `updateSession` write that + * re-asserts the current lease without a conditional WHERE. Accepts a + * small race where a stale-reclaim could be overwritten — the kick path + * (which DOES use the atomic IfNull check) remains the primary + * concurrency guard. + * + * @param sessionId - The session id to claim against. + * @param runId - The new lease value to write. + * @param expected - The expected current value; defaults to null. + * @returns true on success, false when an initial claim was already taken. + */ +export async function claimSessionLifecycleRunId( + sessionId: string, + runId: string, + expected: string | null = null, +): Promise { + if (expected === null) { + return claimSessionLifecycleRunIdIfNull(sessionId, runId); + } + const updated = await updateSession(sessionId, { lifecycle_run_id: runId }); + return updated !== null; +} diff --git a/lib/supabase/sessions/claimSessionLifecycleRunId.ts b/lib/supabase/sessions/claimSessionLifecycleRunId.ts deleted file mode 100644 index c7d09cc10..000000000 --- a/lib/supabase/sessions/claimSessionLifecycleRunId.ts +++ /dev/null @@ -1,23 +0,0 @@ -import { claimSessionLifecycleRunIdIfMatch } from "@/lib/supabase/sessions/claimSessionLifecycleRunIdIfMatch"; -import { claimSessionLifecycleRunIdIfNull } from "@/lib/supabase/sessions/claimSessionLifecycleRunIdIfNull"; - -/** - * Combiner for the two `lifecycle_run_id` claim operations. Picks the - * right atomic Supabase write based on whether the caller is making - * an initial claim (expected = null) or refreshing its own lease - * (expected = the runId). - * - * @param sessionId - The session id to claim against. - * @param runId - The new lease value to write. - * @param expected - The expected current value; defaults to null. - * @returns true on success, false when the lease was already taken. - */ -export async function claimSessionLifecycleRunId( - sessionId: string, - runId: string, - expected: string | null = null, -): Promise { - return expected === null - ? claimSessionLifecycleRunIdIfNull(sessionId, runId) - : claimSessionLifecycleRunIdIfMatch(sessionId, expected); -} diff --git a/lib/supabase/sessions/claimSessionLifecycleRunIdIfMatch.ts b/lib/supabase/sessions/claimSessionLifecycleRunIdIfMatch.ts deleted file mode 100644 index 781f669d2..000000000 --- a/lib/supabase/sessions/claimSessionLifecycleRunIdIfMatch.ts +++ /dev/null @@ -1,31 +0,0 @@ -import supabase from "@/lib/supabase/serverClient"; - -/** - * Atomic Supabase write: refreshes `lifecycle_run_id` for a session - * only if the row's current value MATCHES the supplied value. Returns - * true on successful refresh, false if the row was reclaimed by - * someone else. Used by the workflow to confirm it still owns its - * lease before each evaluation pass. - * - * @param sessionId - The session id to claim against. - * @param runId - The expected current lease value (also written back). - * @returns true on success, false on contention or error. - */ -export async function claimSessionLifecycleRunIdIfMatch( - sessionId: string, - runId: string, -): Promise { - const { data, error } = await supabase - .from("sessions") - .update({ lifecycle_run_id: runId }) - .eq("id", sessionId) - .eq("lifecycle_run_id", runId) - .select("id") - .maybeSingle(); - - if (error) { - console.error(`[claimSessionLifecycleRunIdIfMatch] error for ${sessionId}:`, error); - return false; - } - return data !== null; -} From 2bf8cf91603a82a94223e25301322b091de5fbf9 Mon Sep 17 00:00:00 2001 From: Sweets Sweetman Date: Thu, 7 May 2026 17:45:05 -0500 Subject: [PATCH 6/6] =?UTF-8?q?fix(sandbox):=20KISS=20lifecycle=20claim=20?= =?UTF-8?q?=E2=80=94=20drop=20IfNull=20helper,=20use=20updateSession?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Per latest review: the `IfNull` Supabase helper is also just an `updateSession` with a conditional WHERE — KISS it the same way the `IfMatch` helper was simplified. Changes: - Delete `lib/supabase/sessions/claimSessionLifecycleRunIdIfNull.ts`. - Delete `lib/sessions/claimSessionLifecycleRunId.ts` (combiner — no longer needed once both branches collapse to plain `updateSession`). - `runKick`: write the new lease via `updateSession(id, { lifecycle_run_id: runId })`. The atomic `WHERE lifecycle_run_id IS NULL` guard is gone; `shouldStartLifecycle` already filters out rows that already hold a lease — best-effort concurrency, accepting the small race window. - `computeLifecycleWakeDecision`: replace the IfMatch lease re-claim with a SELECT-based ownership check using the row already read in the same step. If `lifecycle_run_id` was overwritten by a concurrent kick, return `run-replaced` and bail. Tests 2579 / 2579 still pass. Lint clean. The pre-existing tsc errors in unrelated test files (orgId on ApiKeyDetails, SlackApiResponse messages) are not introduced here. Co-Authored-By: Claude Opus 4.7 (1M context) --- app/workflows/computeLifecycleWakeDecision.ts | 11 ++++--- lib/sandbox/runKick.ts | 8 ++--- lib/sessions/claimSessionLifecycleRunId.ts | 32 ------------------- .../claimSessionLifecycleRunIdIfNull.ts | 30 ----------------- 4 files changed, 10 insertions(+), 71 deletions(-) delete mode 100644 lib/sessions/claimSessionLifecycleRunId.ts delete mode 100644 lib/supabase/sessions/claimSessionLifecycleRunIdIfNull.ts diff --git a/app/workflows/computeLifecycleWakeDecision.ts b/app/workflows/computeLifecycleWakeDecision.ts index c73975766..9b084a770 100644 --- a/app/workflows/computeLifecycleWakeDecision.ts +++ b/app/workflows/computeLifecycleWakeDecision.ts @@ -1,6 +1,5 @@ import { getLifecycleDueAtMs } from "@/lib/sandbox/getLifecycleDueAtMs"; import { hasRuntimeSandboxState } from "@/lib/sandbox/hasRuntimeSandboxState"; -import { claimSessionLifecycleRunId } from "@/lib/sessions/claimSessionLifecycleRunId"; import { selectSessions } from "@/lib/supabase/sessions/selectSessions"; interface LifecycleWakeDecision { @@ -12,8 +11,9 @@ interface LifecycleWakeDecision { /** * Workflow step run at the top of each `sandboxLifecycleWorkflow` * iteration. Reads the session, decides whether to continue looping, - * and (when continuing) returns the next wake time. Re-claims the - * lease so a concurrent kick that took it over wins. + * and (when continuing) returns the next wake time. Bails when a + * concurrent kick has overwritten `lifecycle_run_id` with a different + * value — that newer run is now responsible for the session. * * @param sessionId - The session id the workflow is tracking. * @param runId - The lease this workflow run owns. @@ -39,8 +39,9 @@ export async function computeLifecycleWakeDecision( return { shouldContinue: false, reason: "sandbox-not-operable" }; } - const claimed = await claimSessionLifecycleRunId(sessionId, runId, runId); - if (!claimed) return { shouldContinue: false, reason: "run-replaced" }; + if (session.lifecycle_run_id !== null && session.lifecycle_run_id !== runId) { + return { shouldContinue: false, reason: "run-replaced" }; + } return { shouldContinue: true, wakeAtMs: getLifecycleDueAtMs(session) }; } diff --git a/lib/sandbox/runKick.ts b/lib/sandbox/runKick.ts index bd44e97c8..f5e1fa918 100644 --- a/lib/sandbox/runKick.ts +++ b/lib/sandbox/runKick.ts @@ -4,7 +4,6 @@ import { createLifecycleRunId } from "@/lib/sandbox/createLifecycleRunId"; import { isLifecycleRunStale } from "@/lib/sandbox/isLifecycleRunStale"; import { reclaimStaleLease } from "@/lib/sandbox/reclaimStaleLease"; import { shouldStartLifecycle } from "@/lib/sandbox/shouldStartLifecycle"; -import { claimSessionLifecycleRunId } from "@/lib/sessions/claimSessionLifecycleRunId"; import { selectSessions } from "@/lib/supabase/sessions/selectSessions"; import { updateSession } from "@/lib/supabase/sessions/updateSession"; import type { SandboxLifecycleReason } from "@/lib/sandbox/sandboxLifecycleTypes"; @@ -20,7 +19,9 @@ interface RunKickInput { * 1. Read the session row * 2. Reclaim the lease if it's stale (workflow crashed) * 3. Skip if the session isn't in a shape where lifecycle makes sense - * 4. Generate a fresh run id and atomically claim it + * (`shouldStartLifecycle` already filters out rows that already + * have a `lifecycle_run_id` — best-effort concurrency guard) + * 4. Generate a fresh run id and write it to the session row * 5. `start()` the Vercel Workflow run * 6. On `start()` failure, clear the lease so retry can succeed * @@ -40,8 +41,7 @@ export async function runKick(input: RunKickInput): Promise { if (!shouldStartLifecycle(sessionForStart)) return; const runId = createLifecycleRunId(); - const claimed = await claimSessionLifecycleRunId(input.sessionId, runId); - if (!claimed) return; + await updateSession(input.sessionId, { lifecycle_run_id: runId }); try { const run = await start(sandboxLifecycleWorkflow, [input.sessionId, input.reason, runId]); diff --git a/lib/sessions/claimSessionLifecycleRunId.ts b/lib/sessions/claimSessionLifecycleRunId.ts deleted file mode 100644 index 183ea2a2f..000000000 --- a/lib/sessions/claimSessionLifecycleRunId.ts +++ /dev/null @@ -1,32 +0,0 @@ -import { claimSessionLifecycleRunIdIfNull } from "@/lib/supabase/sessions/claimSessionLifecycleRunIdIfNull"; -import { updateSession } from "@/lib/supabase/sessions/updateSession"; - -/** - * Combiner for the two `lifecycle_run_id` claim operations. Lives in - * `lib/sessions/` (not `lib/supabase/sessions/`) because it does not - * directly query Supabase — it composes two underlying helpers. - * - * - `expected = null` (initial claim): atomic `claimSessionLifecycleRunIdIfNull` - * that fails when the row already has a lease. - * - `expected = runId` (workflow refresh): plain `updateSession` write that - * re-asserts the current lease without a conditional WHERE. Accepts a - * small race where a stale-reclaim could be overwritten — the kick path - * (which DOES use the atomic IfNull check) remains the primary - * concurrency guard. - * - * @param sessionId - The session id to claim against. - * @param runId - The new lease value to write. - * @param expected - The expected current value; defaults to null. - * @returns true on success, false when an initial claim was already taken. - */ -export async function claimSessionLifecycleRunId( - sessionId: string, - runId: string, - expected: string | null = null, -): Promise { - if (expected === null) { - return claimSessionLifecycleRunIdIfNull(sessionId, runId); - } - const updated = await updateSession(sessionId, { lifecycle_run_id: runId }); - return updated !== null; -} diff --git a/lib/supabase/sessions/claimSessionLifecycleRunIdIfNull.ts b/lib/supabase/sessions/claimSessionLifecycleRunIdIfNull.ts deleted file mode 100644 index 93f205244..000000000 --- a/lib/supabase/sessions/claimSessionLifecycleRunIdIfNull.ts +++ /dev/null @@ -1,30 +0,0 @@ -import supabase from "@/lib/supabase/serverClient"; - -/** - * Atomic Supabase write: claims `lifecycle_run_id` for a session only - * if the row's current value is NULL. Returns true on successful - * claim, false if the row was already taken. Used for INITIAL claims - * (i.e. starting a fresh lifecycle run from a kick). - * - * @param sessionId - The session id to claim against. - * @param runId - The new lease value to write. - * @returns true on success, false on contention or error. - */ -export async function claimSessionLifecycleRunIdIfNull( - sessionId: string, - runId: string, -): Promise { - const { data, error } = await supabase - .from("sessions") - .update({ lifecycle_run_id: runId }) - .eq("id", sessionId) - .is("lifecycle_run_id", null) - .select("id") - .maybeSingle(); - - if (error) { - console.error(`[claimSessionLifecycleRunIdIfNull] error for ${sessionId}:`, error); - return false; - } - return data !== null; -}