feat(sandbox): port sandbox lifecycle workflow + kicks (Phase 2)#533
feat(sandbox): port sandbox lifecycle workflow + kicks (Phase 2)#533sweetmantech merged 6 commits intotestfrom
Conversation
Closes the auto-pause and lifecycle-self-heal gaps from the cutover analysis. Sequel to PR #531 (build-org-snapshot workflow). The Vercel Workflow runtime + withWorkflow already shipped — this PR adds the second workflow on top. What it does: - POST /api/sandbox now kicks the sandbox-lifecycle workflow with reason="sandbox-created" after provisioning + skill install. The workflow sleeps until `hibernate_after` / `sandbox_expires_at` (whichever is sooner), then evaluates: pause if idle, retry on next iteration if a chat stream is active or the user extended the timing. - GET /api/sandbox/status now kicks with reason="status-check-overdue" when the session row says lifecycle_state="active" but the workflow's due-time has already passed (its lease has died). The kick reclaims the stale lease and starts a fresh workflow. Components ported faithfully from open-agents: Sandbox state predicates (one fn per file per api convention): - getPersistentSandboxName, getResumableSandboxName, hasResumableSandboxState, hasPausedSandboxState - canOperateOnSandbox, clearSandboxState Lifecycle types + config: - sandboxLifecycleTypes.ts: SandboxLifecycleState/Reason types - sandboxLifecycleConfig.ts: SANDBOX_INACTIVITY_TIMEOUT_MS (5min), SANDBOX_EXPIRES_BUFFER_MS (10s), SANDBOX_LIFECYCLE_MIN_SLEEP_MS (5s), SANDBOX_LIFECYCLE_STALE_RUN_GRACE_MS (2min) Lifecycle update builders + due-at: - buildLifecycleActivityUpdate, buildActiveLifecycleUpdate, buildHibernatedLifecycleUpdate - getLifecycleDueAtMs (the wake-time calculator) - getNextLifecycleVersion, getSandboxExpiresAtDate Active-stream guard (don't pause mid-chat): - hasActiveStreamForSession + Supabase helper selectChatsBySession Concurrency: - claimSessionLifecycleRunId — atomic UPDATE that fails when the expected current value doesn't match. Prevents two kicks from starting duplicate workflows. The evaluator (the heart): - evaluateSandboxLifecycle (~110 lines) — single-pass FSM that decides skip / hibernate / fail. Mirrors open-agents' evaluator with the same skip reasons and self-heal pattern. The workflow: - app/workflows/sandboxLifecycleWorkflow.ts — `while(true)` loop with `sleep(new Date(wakeAtMs))` between iterations, lease-claim pattern, and step boundaries via "use step". Loops on not-due-yet / active-workflow defers, terminates on hibernated / failed / sandbox-gone. The kick: - kickSandboxLifecycleWorkflow.ts — fire-and-forget invocation with stale-run detection, lease claiming, and graceful handling when start() fails (clears the lease so retry can succeed). TDD red -> green: - 2 new createSandboxHandler tests asserting the kick fires with the right reason on session-bound provisions, and is skipped without sessionId - All existing handler tests continue to pass with the new mocks - Suite: 2577 -> 2579 (+2 new tests). pnpm lint:check + tsc clean for new files. Out of scope (deliberately deferred): - Tests for the workflow itself + evaluator. The workflow primitives (`"use workflow"`, `"use step"`, `sleep(date)`) require a Vercel Workflow test harness we haven't introduced yet. The handler tests + open-agents' parity give enough confidence; if the workflow misbehaves at runtime, the smoke test will catch it. - Reason-specific telemetry beyond what evaluateSandboxLifecycle already logs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
|
Warning Rate limit exceeded
You’ve run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (3)
📝 WalkthroughWalkthroughThis PR introduces a comprehensive sandbox lifecycle management system that automatically hibernates idle sandboxes after configurable inactivity periods or resource expiry deadlines. It includes Vercel Workflow orchestration, atomic lease-based concurrency control, state transition helpers, timing computation, and integration hooks in sandbox creation and status-check handlers. ChangesSandbox Lifecycle Management System
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
7 issues found across 24 files
Confidence score: 2/5
- There is meaningful regression risk from high-severity, high-confidence lease-race issues in
app/workflows/sandboxLifecycleWorkflow.tsandlib/sandbox/kickSandboxLifecycleWorkflow.ts; both can let stale runs execute side effects or clear a newer run’s lease. lib/supabase/chats/selectChatsBySession.tscan misclassify chat-read errors as no active stream, which may incorrectly hibernate an active sandbox and impact live user sessions.- I’m scoring this as high risk because multiple issues are concrete, user-impacting lifecycle correctness problems (several at 6–8/10 severity) rather than minor maintainability-only findings.
- Pay close attention to
app/workflows/sandboxLifecycleWorkflow.ts,lib/sandbox/kickSandboxLifecycleWorkflow.ts, andlib/supabase/chats/selectChatsBySession.ts- lease ownership and stream-state handling can cause incorrect lifecycle transitions.
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="lib/supabase/chats/selectChatsBySession.ts">
<violation number="1" location="lib/supabase/chats/selectChatsBySession.ts:18">
P1: Do not treat chat-read errors as “no active stream”; this can incorrectly hibernate an active sandbox.</violation>
</file>
<file name="app/workflows/sandboxLifecycleWorkflow.ts">
<violation number="1" location="app/workflows/sandboxLifecycleWorkflow.ts:83">
P1: Revalidate lifecycle lease after `sleep()` before evaluating, otherwise a replaced run can still execute one side-effecting evaluation pass.</violation>
</file>
<file name="lib/sandbox/kickSandboxLifecycleWorkflow.ts">
<violation number="1" location="lib/sandbox/kickSandboxLifecycleWorkflow.ts:62">
P2: On start failure, clearing `lifecycle_run_id` without verifying ownership can clobber a different run’s lease.</violation>
<violation number="2" location="lib/sandbox/kickSandboxLifecycleWorkflow.ts:67">
P1: Stale-lease reclamation clears `lifecycle_run_id` unconditionally, which can erase a newer lease claimed concurrently.</violation>
</file>
<file name="lib/sandbox/evaluateSandboxLifecycle.ts">
<violation number="1" location="lib/sandbox/evaluateSandboxLifecycle.ts:1">
P2: Custom agent: **Enforce Clear Code Style and Maintainability Practices**
File exceeds the 100-line maintainability limit and should be split into smaller modules.</violation>
</file>
<file name="lib/sandbox/getSandboxStatusHandler.ts">
<violation number="1" location="lib/sandbox/getSandboxStatusHandler.ts:58">
P2: Custom agent: **API Design Consistency and Maintainability**
GET handler performs a lifecycle kick side effect instead of remaining read-only.</violation>
</file>
<file name="lib/sandbox/buildLifecycleActivityUpdate.ts">
<violation number="1" location="lib/sandbox/buildLifecycleActivityUpdate.ts:6">
P3: Custom agent: **Flag AI Slop and Fabricated Changes**
Doc comment references a non-existent `/api/sandbox/activity` route, making the documentation misleading.</violation>
</file>
Architecture diagram
sequenceDiagram
participant Client as HTTP Client
participant API as Next.js API Route
participant Auth as Auth Service
participant DB as Supabase Sessions
participant ChatDB as Supabase Chats
participant Kick as kickSandboxLifecycleWorkflow
participant Workflow as Vercel Workflow Runtime
participant Sandbox as Vercel Sandbox Runtime
Note over Client,Sandbox: NEW: Sandbox Lifecycle Auto-Pause Flow
%% POST /api/sandbox -> Trigger lifecycle
Client->>API: POST /api/sandbox (with sessionId)
API->>Auth: validateAuthContext()
Auth-->>API: auth context
API->>API: provision sandbox + install skills
API->>DB: updateSession() with sandbox_state
alt sessionId provided
API->>Kick: fire-and-forget kickSandboxLifecycleWorkflow({sessionId, reason: "sandbox-created"})
Note over Kick: Fire-and-forget, failure doesn't fail the request
else no sessionId
Note over API: Skip lifecycle kick
end
API-->>Client: 200 OK
%% Kick -> Lease claim -> Start workflow
Kick->>DB: selectSessions({id: sessionId})
DB-->>Kick: session row
Kick->>Kick: shouldStartLifecycle() checks
alt Lease available (lifecycle_run_id is null)
Kick->>DB: claimSessionLifecycleRunId(sessionId, runId)
DB-->>Kick: claimed (true)
Kick->>Workflow: start(sandboxLifecycleWorkflow, [sessionId, reason, runId])
Workflow-->>Kick: run {runId}
Note over Kick: Lease claimed - prevents duplicate workflows
else Lease taken by another run
DB-->>Kick: claim failed (false)
Note over Kick: Skip - another workflow is running
end
%% Workflow main loop
Note over Workflow: while(true) loop
Workflow->>Workflow: computeLifecycleWakeDecision(sessionId, runId)
Workflow->>DB: selectSessions({id: sessionId})
DB-->>Workflow: session row
Workflow->>Workflow: canOperateOnSandbox() check
Workflow->>Workflow: getLifecycleDueAtMs() - earlier of inactivity or expiry
alt Session not found, archived, or sandbox not operable
Workflow->>DB: clearLifecycleRunIdIfOwned()
Workflow-->>Kick: return {skipped: true}
else Session looks good
Workflow->>DB: claimSessionLifecycleRunId(sessionId, runId, runId)
DB-->>Workflow: lease refreshed
Workflow->>Workflow: sleep(new Date(wakeAtMs))
Note over Workflow: Sleeps until hibernate_after or sandbox_expires_at - buffer
end
Workflow->>Workflow: runLifecycleEvaluation(sessionId, reason)
Workflow->>DB: selectSessions({id: sessionId})
DB-->>Workflow: refreshed session
Workflow->>ChatDB: hasActiveStreamForSession(sessionId)
ChatDB->>ChatDB: selectChatsBySession(sessionId)
ChatDB-->>Workflow: chats with active_stream_id
alt Not due yet (Date.now() < dueAtMs)
Workflow->>Workflow: continue (sleep again)
else Active stream in progress
Workflow->>Workflow: continue (defer hibernation)
else Past due, no active stream
Workflow->>DB: updateSession({lifecycle_state: "hibernating"})
Workflow->>Sandbox: connectSandbox(sandboxState)
Sandbox-->>Workflow: sandbox instance
%% Double-check for race conditions
Workflow->>ChatDB: hasActiveStreamForSession(sessionId) (re-check)
Workflow->>DB: selectSessions({id: sessionId}) (check for extended timings)
alt Stream started during evaluation
Workflow->>DB: restoreActiveLifecycleState()
Workflow->>Workflow: continue (defer)
else Timing was extended
Workflow->>DB: restoreActiveLifecycleState()
Workflow->>Workflow: continue (defer)
else Proceed with hibernation
Workflow->>Sandbox: sandbox.stop()
Sandbox-->>Workflow: stopped
Workflow->>Workflow: clearSandboxState() - preserve sandboxName
Workflow->>DB: updateSession({sandbox_state: cleared, ...buildHibernatedLifecycleUpdate()})
Note over Workflow: Clears: sandbox_expires_at, hibernate_after, lifecycle_run_id
Workflow->>DB: clearLifecycleRunIdIfOwned()
Workflow-->>Kick: return {action: "hibernated"}
end
else Error during evaluation
Workflow->>DB: updateSession({lifecycle_state: "failed", lifecycle_error: message})
Workflow->>DB: clearLifecycleRunIdIfOwned()
Workflow-->>Kick: return {action: "failed", reason: message}
Note over Workflow: Self-healing - next status read will reclaim lease
end
%% GET /api/sandbox/status -> Self-heal on stale lease
Client->>API: GET /api/sandbox/status
API->>Auth: validateAuthContext()
API->>DB: selectSessions({id, account_id})
DB-->>API: session row
API->>API: isSandboxActive() check
alt Active AND lifecycle_state="active" AND past due time
Note over API: Stale lease detected - workflow likely crashed
API->>Kick: fire-and-forget kick with reason="status-check-overdue"
Note over Kick: Will reclaim stale lease and start fresh workflow
end
API-->>Client: {status, hasSnapshot, lifecycleVersion, lifecycle}
%% Stale lease reclamation path
Note over Kick,Sandbox: NEW: Stale-Lease Reclamation Path
API->>Kick: kickSandboxLifecycleWorkflow({sessionId, reason: "status-check-overdue"})
Kick->>DB: selectSessions({id: sessionId})
DB-->>Kick: session with lifecycle_run_id set
Kick->>Kick: isLifecycleRunStale() - overdue > 2 min grace?
alt Run is stale
Kick->>DB: updateSession({lifecycle_run_id: null})
Note over Kick: Reclaim lease by clearing stale run_id
Kick->>DB: selectSessions({id: sessionId})
DB-->>Kick: refreshed session
end
Kick->>Kick: shouldStartLifecycle() checks
Kick->>DB: claimSessionLifecycleRunId(sessionId, newRunId)
DB-->>Kick: lease claimed
Kick->>Workflow: start(sandboxLifecycleWorkflow, [sessionId, "status-check-overdue", newRunId])
Workflow-->>Kick: fresh workflow run
Note over Kick: Self-healed - new workflow now owns the lease
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
| const wakeAtMs = Math.max(decision.wakeAtMs, Date.now() + SANDBOX_LIFECYCLE_MIN_SLEEP_MS); | ||
| await sleep(new Date(wakeAtMs)); | ||
|
|
||
| const evaluation = await runLifecycleEvaluation(sessionId, reason); |
There was a problem hiding this comment.
P1: Revalidate lifecycle lease after sleep() before evaluating, otherwise a replaced run can still execute one side-effecting evaluation pass.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At app/workflows/sandboxLifecycleWorkflow.ts, line 83:
<comment>Revalidate lifecycle lease after `sleep()` before evaluating, otherwise a replaced run can still execute one side-effecting evaluation pass.</comment>
<file context>
@@ -0,0 +1,95 @@
+ const wakeAtMs = Math.max(decision.wakeAtMs, Date.now() + SANDBOX_LIFECYCLE_MIN_SLEEP_MS);
+ await sleep(new Date(wakeAtMs));
+
+ const evaluation = await runLifecycleEvaluation(sessionId, reason);
+
+ if (
</file context>
| } | ||
|
|
||
| async function reclaimStaleLease(sessionId: string): Promise<Tables<"sessions"> | null> { | ||
| await updateSession(sessionId, { lifecycle_run_id: null }); |
There was a problem hiding this comment.
P1: Stale-lease reclamation clears lifecycle_run_id unconditionally, which can erase a newer lease claimed concurrently.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At lib/sandbox/kickSandboxLifecycleWorkflow.ts, line 67:
<comment>Stale-lease reclamation clears `lifecycle_run_id` unconditionally, which can erase a newer lease claimed concurrently.</comment>
<file context>
@@ -0,0 +1,91 @@
+}
+
+async function reclaimStaleLease(sessionId: string): Promise<Tables<"sessions"> | null> {
+ await updateSession(sessionId, { lifecycle_run_id: null });
+ const rows = await selectSessions({ id: sessionId });
+ return rows[0] ?? null;
</file context>
|
|
||
| /** | ||
| * Builds the lifecycle-related fields to write when refreshing a | ||
| * sandbox's "last activity" — used by `/api/sandbox/activity` and |
There was a problem hiding this comment.
P3: Custom agent: Flag AI Slop and Fabricated Changes
Doc comment references a non-existent /api/sandbox/activity route, making the documentation misleading.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At lib/sandbox/buildLifecycleActivityUpdate.ts, line 6:
<comment>Doc comment references a non-existent `/api/sandbox/activity` route, making the documentation misleading.</comment>
<file context>
@@ -0,0 +1,29 @@
+
+/**
+ * Builds the lifecycle-related fields to write when refreshing a
+ * sandbox's "last activity" — used by `/api/sandbox/activity` and
+ * by `buildActiveLifecycleUpdate`. Sets `last_activity_at` to now,
+ * pushes `hibernate_after` out by SANDBOX_INACTIVITY_TIMEOUT_MS, and
</file context>
…fecycle kick Smoke test caught: the lifecycle kick fired but its async chain (selectSessions → claim lease → start workflow) never completed because the serverless function tore down on response. No [kickSandboxLifecycleWorkflow] log lines, no workflow run started. Original PR placed `after()` from next/server inside the kick lib — which violated open-agents' architecture: the kick is a generic fan-out lib, the platform-specific scheduler belongs at the call site. Refactored to match open-agents' open / api conventions: - `kickSandboxLifecycleWorkflow` accepts an optional `scheduleBackgroundWork: (task: Promise<void>) => void` parameter. Default behavior is `void task` (matches open-agents fallback), scheduler used when provided. - createSandboxHandler + getSandboxStatusHandler pass `scheduleBackgroundWork: task => after(() => task)`. Same pattern api already uses in `lib/agents/createPlatformRoutes.ts:62` and `app/api/chat/slack/route.ts:16` for waitUntil-style background work. Test updated to use `expect.objectContaining` since the kick now receives an extra `scheduleBackgroundWork` callback. Suite stays at 2579. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
2 issues found across 4 files (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="lib/sandbox/createSandboxHandler.ts">
<violation number="1" location="lib/sandbox/createSandboxHandler.ts:154">
P1: Custom agent: **Enforce Clear Code Style and Maintainability Practices**
File exceeds the rule’s 100-line limit and mixes several concerns, including the newly added background scheduling logic.</violation>
</file>
<file name="lib/sandbox/kickSandboxLifecycleWorkflow.ts">
<violation number="1" location="lib/sandbox/kickSandboxLifecycleWorkflow.ts:23">
P2: Custom agent: **Enforce Clear Code Style and Maintainability Practices**
File exceeds the repo’s 100-line limit.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
| @@ -1,5 +1,5 @@ | |||
| import ms from "ms"; | |||
There was a problem hiding this comment.
P1: Custom agent: Enforce Clear Code Style and Maintainability Practices
File exceeds the rule’s 100-line limit and mixes several concerns, including the newly added background scheduling logic.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At lib/sandbox/createSandboxHandler.ts, line 154:
<comment>File exceeds the rule’s 100-line limit and mixes several concerns, including the newly added background scheduling logic.</comment>
<file context>
@@ -145,11 +145,17 @@ export async function createSandboxHandler(request: NextRequest): Promise<NextRe
+ // function alive past the response until the chain completes —
+ // without that, the chain dies on function teardown and the
+ // workflow never starts. Failures are logged and never surfaced.
+ kickSandboxLifecycleWorkflow({
+ sessionId: sessionRow.id,
+ reason: "sandbox-created",
</file context>
Smoke test — Phase 2 lifecycle kick verified end-to-endRound 1 (initial commit) — kick chain died on function teardownThe first POST returned 200 fine but no Round 2 (after `34b1041d`)Adopted open-agents' Re-ran the smoke test: ``` Workflow run shows up in dashboard ( What's now runtime-verified
Not directly verified (requires waiting)
|
| reason?: string; | ||
| } | ||
|
|
||
| async function computeLifecycleWakeDecision( |
There was a problem hiding this comment.
SRP - new lib file for computeLifecycleWakeDecision
| return { shouldContinue: true, wakeAtMs: getLifecycleDueAtMs(session) }; | ||
| } | ||
|
|
||
| async function runLifecycleEvaluation(sessionId: string, reason: SandboxLifecycleReason) { |
There was a problem hiding this comment.
SRP - new lib file for runLifecycleEvaluation
| return evaluateSandboxLifecycle(sessionId, reason); | ||
| } | ||
|
|
||
| async function clearLifecycleRunIdIfOwned(sessionId: string, runId: string): Promise<void> { |
There was a problem hiding this comment.
SRP - new lib file for clearLifecycleRunIdIfOwned.ts
There was a problem hiding this comment.
KISS principle
- actual: tiny redundant lib/sandbox/canOperateOnSandbox.ts wrapper
- required: delete canOperateOnSandbox and update callers to directly call
hasRuntimeSandboxState
| } | ||
| } | ||
|
|
||
| async function restoreActiveLifecycleState( |
There was a problem hiding this comment.
SRP - new lib file for restoreActiveLifecycleState.ts
| return rows[0] ?? null; | ||
| } | ||
|
|
||
| function shouldStartLifecycle(session: Tables<"sessions"> | null): session is Tables<"sessions"> { |
There was a problem hiding this comment.
SRP - new lib file for shouldStartLifecycle
| return true; | ||
| } | ||
|
|
||
| function isLifecycleRunStale(session: Tables<"sessions">): boolean { |
There was a problem hiding this comment.
SRP - new lib file for isLifecycleRunStale
| return overdueMs > SANDBOX_LIFECYCLE_STALE_RUN_GRACE_MS; | ||
| } | ||
|
|
||
| function createLifecycleRunId(): string { |
There was a problem hiding this comment.
SRP - new lib file for createLifecycleRunId
There was a problem hiding this comment.
SRP
- actual: lib/supabase/chats/selectChatsBySession.ts
- required: lib/supabase/chats/selectChats.ts
Addresses 17 review comments on PR #533. All KISS deletions and SRP extractions in one pass — no behavior change, just file/function reorganization to match api conventions (one exported function per file, filename matches the export, no thin wrappers). KISS deletions (3 files removed): - canOperateOnSandbox.ts → callers use hasRuntimeSandboxState directly (was a one-line wrapper around it) - getNextLifecycleVersion.ts → was unused; existing inline calc kept - hasResumableSandboxState.ts → callers use `getResumableSandboxName(state) !== null` (one-line wrapper) KISS rename: - selectChatsBySession.ts → selectChats.ts with a generic `{ id?, sessionId? }` filter parameter, mirroring selectSessions SRP split — Supabase queries: - claimSessionLifecycleRunId.ts split into: - claimSessionLifecycleRunIdIfNull.ts (initial claim) - claimSessionLifecycleRunIdIfMatch.ts (lease refresh) - claimSessionLifecycleRunId.ts (combiner — picks the right one) SRP extractions — kick logic: - runKick.ts (the async chain) - reclaimStaleLease.ts - shouldStartLifecycle.ts - isLifecycleRunStale.ts - createLifecycleRunId.ts - kickSandboxLifecycleWorkflow.ts now thin: just runKick + scheduler SRP extractions — due-at math: - computeInactivityDueAtMs.ts - computeExpiryDueAtMs.ts - getLifecycleDueAtMs.ts now thin: composes the two SRP extractions — evaluator helpers: - restoreActiveLifecycleState.ts - wasLifecycleTimingExtended.ts SRP extractions — workflow steps (each preserves "use step"): - computeLifecycleWakeDecision.ts - runLifecycleEvaluation.ts - clearLifecycleRunIdIfOwned.ts - sandboxLifecycleWorkflow.ts now thin: just composes the steps with the while-sleep-evaluate loop Net delta: 16 new files, 4 deletions, no behavior change. Tests: 2579 / 2579 still pass. pnpm lint:check + tsc clean for new/changed files (pre-existing TS errors in processCreateSandbox / updateSnapshotPatchHandler unrelated). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Pushed `30ef641a` addressing all 17 review comments. No behavior change, just file/function reorganization.
After the refactor:
Tests: 2579 / 2579 still pass (no behavior change). `pnpm lint:check` + `tsc --noEmit` clean for the new/changed files. Net delta: 16 new files, 4 deletions, no behavior change. |
There was a problem hiding this comment.
Actionable comments posted: 6
🧹 Nitpick comments (6)
lib/sandbox/getSandboxStatusHandler.ts (1)
24-74: ⚡ Quick winSplit
getSandboxStatusHandlerinto smaller units.
getSandboxStatusHandleris currently too large and mixes multiple responsibilities (auth/session lookup/authorization/overdue-kick/response shaping). Please extract at least the overdue-kick decision and session loading into helpers.As per coding guidelines, "Flag functions longer than 20 lines or classes with >200 lines" and "Keep functions small and focused".
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@lib/sandbox/getSandboxStatusHandler.ts` around lines 24 - 74, getSandboxStatusHandler is doing too many things; extract session loading/authorization and the overdue-kick decision into helpers to reduce size and responsibilities. Create a helper loadAuthorizedSession(request, validateAuthContext, selectSessions) that returns the session row or a NextResponse for missing/forbidden cases, and a separate ensureLifecycleKick(sessionRow, isSandboxActive, getLifecycleDueAtMs, kickSandboxLifecycleWorkflow, after) that encapsulates the conditional that calls kickSandboxLifecycleWorkflow when active, lifecycle_state === "active" and Date.now() >= getLifecycleDueAtMs(row); then refactor getSandboxStatusHandler to call loadAuthorizedSession, call ensureLifecycleKick, and only shape the final JSON response (hasSnapshot, lifecycleVersion, lifecycle via buildLifecycle). Ensure you reference the original symbols getSandboxStatusHandler, validateAuthContext, selectSessions, isSandboxActive, getLifecycleDueAtMs, kickSandboxLifecycleWorkflow, after, and buildLifecycle when moving logic.lib/sandbox/runKick.ts (1)
31-58: ⚡ Quick winSplit
runKickinto smaller steps for maintainability
runKickcurrently bundles fetch/reclaim/eligibility/claim/start/recovery in one long function. Extracting claim-and-start and failure-recovery helpers will keep responsibilities sharper and make testing easier.As per coding guidelines, "Flag functions longer than 20 lines or classes with >200 lines" and "Keep functions small and focused".
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@lib/sandbox/runKick.ts` around lines 31 - 58, runKick is doing too many steps (fetch/reclaim/eligibility/claim/start/recovery); extract the claim-and-start and failure-recovery logic into small helpers to improve maintainability and testability. Implement a helper (e.g., claimAndStartLifecycle(input.sessionId, runId, input.reason)) that encapsulates createLifecycleRunId, claimSessionLifecycleRunId and the call to start(sandboxLifecycleWorkflow, …) and returns the started run or false, and a separate recovery helper (e.g., recoverFailedStart(sessionId)) that handles logging the error and calling updateSession(sessionId, { lifecycle_run_id: null }) on failure; then simplify runKick to call isLifecycleRunStale/reclaimStaleLease/shouldStartLifecycle and use the new helpers for claim/start and recovery so each function stays small and focused.app/workflows/computeLifecycleWakeDecision.ts (1)
23-46: ⚡ Quick winFactor out session-eligibility checks from
computeLifecycleWakeDecision.This function now combines several responsibilities (state validation, lease claim, scheduling decision) and is above the guideline length threshold. Extracting the eligibility guard into a helper would improve clarity and testability.
As per coding guidelines, "Flag functions longer than 20 lines" and "Keep functions small and focused".
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@app/workflows/computeLifecycleWakeDecision.ts` around lines 23 - 46, computeLifecycleWakeDecision mixes session retrieval, multiple eligibility checks, and the lease claim; extract the eligibility guard into a small helper (e.g., isSessionEligibleForLifecycleWake(session?: Session | null): {eligible: boolean; reason?: string}) that encapsulates the "session not found", archived checks (session.status and session.lifecycle_state), and sandbox operability checks (hasRuntimeSandboxState and the vercel type check). Replace the inline checks in computeLifecycleWakeDecision with a single call to that helper; only if eligible proceed to call claimSessionLifecycleRunId and then return the wakeAtMs via getLifecycleDueAtMs(session). Ensure the helper is exported or colocated so it can be unit-tested separately.app/workflows/sandboxLifecycleWorkflow.ts (1)
16-45: ⚡ Quick winExtract loop branches to keep the workflow function compact.
The workflow body is over the 20-line threshold and mixes wake decision, sleep scheduling, defer handling, and lease cleanup. Splitting branch handlers (e.g., terminal path + defer path) would make future lifecycle changes safer.
As per coding guidelines, "Flag functions longer than 20 lines" and "Keep functions small and focused".
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@app/workflows/sandboxLifecycleWorkflow.ts` around lines 16 - 45, The sandboxLifecycleWorkflow function is too long and mixes wake decision, sleep scheduling, defer handling, and lease cleanup; refactor by extracting the loop branches into small helpers: create e.g. handleTerminalDecision(decision, sessionId, runId) to run clearLifecycleRunIdIfOwned and return the skipped result when !decision.shouldContinue or decision.wakeAtMs undefined; create scheduleWake(wakeAtMs) to compute Math.max with SANDBOX_LIFECYCLE_MIN_SLEEP_MS and call sleep; and create handleEvaluationResult(evaluation, sessionId, runId) to handle the "skipped" defer path (checking evaluation.reason) vs terminal path that clears the lease and returns evaluation; keep sandboxLifecycleWorkflow as the loop that calls computeLifecycleWakeDecision, scheduleWake, runLifecycleEvaluation, and delegates branching to the new helpers.lib/sandbox/createSandboxHandler.ts (1)
34-171: 🏗️ Heavy liftSplit
createSandboxHandlerinto focused steps.This handler is doing validation, authz, snapshot orchestration, provisioning, DB sync, skills install, and workflow kick in one function. Please extract these into small helpers to keep orchestration readable and reduce regression risk in future lifecycle changes.
As per coding guidelines, "Flag functions longer than 20 lines" and "Keep functions small and focused".
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@lib/sandbox/createSandboxHandler.ts` around lines 34 - 171, The createSandboxHandler function is doing too many things; split it into small focused helpers: extract request validation/authz into validateAndAuthorizeCreateSandbox using validateCreateSandboxBody and selectSessions, move org snapshot lookup and kickBuildOrgSnapshotWorkflow into resolveOrgSnapshot (use extractOrgRepoName and findOrgSnapshot), move the sandbox provisioning logic into provisionSandbox which calls connectSandbox and handles its errors, move DB session sync into syncSessionState which calls updateSession, and move post-provision steps (installSessionGlobalSkills and kickSandboxLifecycleWorkflow) into installSkillsAndRegisterLifecycle that logs but does not throw; update createSandboxHandler to orchestrate these helpers, passing sandbox, sessionRow and relevant params and keeping only high-level flow, timing, and the final NextResponse logic.lib/sandbox/evaluateSandboxLifecycle.ts (1)
34-106: 🏗️ Heavy liftFunction exceeds the 50-line guideline at 73 lines — consider extracting the guard block.
The function has two naturally separate responsibilities currently inlined:
- Pre-condition guards (lines 38–60): seven early-return predicates that determine whether the session is even eligible — this is a pure, testable unit on its own.
- Hibernation execution (lines 62–92): the stateful transition that mutates the DB, stops the sandbox, and builds the hibernated update.
Extracting the guard block (e.g., into a helper like
checkLifecycleEligibility) would also resolve the 4-level nesting inside thewasLifecycleTimingExtendedbranch and make the happy-path intent of the outer function immediately obvious.As per coding guidelines: "Flag functions longer than 20 lines" and "Keep functions under 50 lines".
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@lib/sandbox/evaluateSandboxLifecycle.ts` around lines 34 - 106, The evaluateSandboxLifecycle function is too long and mixes pre-condition guards with hibernation execution; extract the early-return guard checks (the session lookup and the seven predicates that use selectSessions, hasRuntimeSandboxState, getLifecycleDueAtMs, hasActiveStreamForSession, and the sandbox type check) into a new helper (e.g., checkLifecycleEligibility(sessionId): Promise<{status: "skipped"|"ok", reason?: string, session?: Session}>), call that from evaluateSandboxLifecycle to short-circuit when skipped, and keep evaluateSandboxLifecycle focused on the stateful hibernation flow (updateSession to "hibernating", connectSandbox, wasLifecycleTimingExtended, sandbox.stop, clearSandboxState, and final update); ensure you preserve existing behavior and return values and reference functions selectSessions, updateSession, connectSandbox, wasLifecycleTimingExtended, clearSandboxState, and restoreActiveLifecycleState when moving logic.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@app/workflows/clearLifecycleRunIdIfOwned.ts`:
- Around line 17-21: The current TOCTOU occurs because selectSessions(...) reads
lifecycle_run_id then updateSession(...) clears it unconditionally; replace this
two-step flow with a single atomic conditional update so the row is only cleared
if it still has the expected owner: remove the selectSessions(...) read and
call/update updateSession (or add a new updateSessionIfMatch) to execute an
UPDATE ... SET lifecycle_run_id = NULL WHERE id = sessionId AND lifecycle_run_id
= runId and check affected-rows to decide success; ensure the code references
the existing selectSessions and updateSession symbols (or implement
updateSessionIfMatch) so the DB change is atomic and avoids clobbering a newly
acquired lease.
In `@lib/sandbox/evaluateSandboxLifecycle.ts`:
- Around line 72-78: The branch that skips restoreActiveLifecycleState leaves
the DB lifecycle_state as "hibernating" when refreshed.sandbox_state is falsy;
change the logic in the wasLifecycleTimingExtended branch to call
restoreActiveLifecycleState(sessionId, refreshed.sandbox_state ?? sandboxState)
so the original validated sandboxState (already checked by
hasRuntimeSandboxState) is used as a fallback instead of skipping restoration,
ensuring DB and in-memory state stay in sync; reference symbols:
wasLifecycleTimingExtended, selectSessions, restoreActiveLifecycleState,
sandboxState, hasRuntimeSandboxState, sessionId, refreshed.sandbox_state.
In `@lib/sandbox/getSandboxExpiresAtDate.ts`:
- Around line 13-15: The current code reads expiresAt from state and immediately
constructs new Date(expiresAt).toISOString(), which can throw for NaN/Infinity
or out-of-range numbers; update the guard in getSandboxExpiresAtDate (check the
expiresAt variable) to first ensure typeof expiresAt === "number" and
Number.isFinite(expiresAt) and that new Date(expiresAt) yields a valid date
(e.g., !isNaN(date.getTime())) before calling toISOString(); if any check fails,
return null instead of calling toISOString().
In `@lib/sandbox/reclaimStaleLease.ts`:
- Around line 15-17: The update currently clears lifecycle_run_id
unconditionally which can cause a TOCTOU race; change the updateSession call to
perform a conditional/compare-and-swap update that only sets lifecycle_run_id to
null when the stored lifecycle_run_id equals the expected stale value (use the
previous/expected lease id variable, e.g., staleLeaseId) and check the update
result (affected rows) before reading with selectSessions; if no row was
updated, return null (or the current row) to avoid wiping a freshly claimed
lease. Ensure you reference the same sessionId and lifecycle_run_id fields in
the conditional update and handle the zero-rows-updated case accordingly.
In `@lib/sandbox/runKick.ts`:
- Around line 52-57: The catch currently unconditionally clears lifecycle_run_id
and can clobber a newer run; change it to first read the current session and
only clear lifecycle_run_id if it still equals the run id this code owns (the
original run id you attempted to start). In practice: fetch the session by
input.sessionId (use the existing session getter or add one), compare
session.lifecycle_run_id to the expected run id stored when starting the
workflow, and call updateSession(input.sessionId, { lifecycle_run_id: null })
only if they match; otherwise leave it untouched.
In `@lib/supabase/sessions/claimSessionLifecycleRunId.ts`:
- Around line 20-23: The current branch in claimSessionLifecycleRunId ignores
runId when expected is non-null by calling
claimSessionLifecycleRunIdIfMatch(sessionId, expected) — change the call so the
intended lease value is passed through (e.g., call
claimSessionLifecycleRunIdIfMatch(sessionId, expected, runId) or adjust the
helper signature) so that both paths write/rotate the lease value; update the
claimSessionLifecycleRunIdIfMatch signature/implementation to accept and write
the runId (and update any callers) to preserve the contract that runId is the
lease value to write.
---
Nitpick comments:
In `@app/workflows/computeLifecycleWakeDecision.ts`:
- Around line 23-46: computeLifecycleWakeDecision mixes session retrieval,
multiple eligibility checks, and the lease claim; extract the eligibility guard
into a small helper (e.g., isSessionEligibleForLifecycleWake(session?: Session |
null): {eligible: boolean; reason?: string}) that encapsulates the "session not
found", archived checks (session.status and session.lifecycle_state), and
sandbox operability checks (hasRuntimeSandboxState and the vercel type check).
Replace the inline checks in computeLifecycleWakeDecision with a single call to
that helper; only if eligible proceed to call claimSessionLifecycleRunId and
then return the wakeAtMs via getLifecycleDueAtMs(session). Ensure the helper is
exported or colocated so it can be unit-tested separately.
In `@app/workflows/sandboxLifecycleWorkflow.ts`:
- Around line 16-45: The sandboxLifecycleWorkflow function is too long and mixes
wake decision, sleep scheduling, defer handling, and lease cleanup; refactor by
extracting the loop branches into small helpers: create e.g.
handleTerminalDecision(decision, sessionId, runId) to run
clearLifecycleRunIdIfOwned and return the skipped result when
!decision.shouldContinue or decision.wakeAtMs undefined; create
scheduleWake(wakeAtMs) to compute Math.max with SANDBOX_LIFECYCLE_MIN_SLEEP_MS
and call sleep; and create handleEvaluationResult(evaluation, sessionId, runId)
to handle the "skipped" defer path (checking evaluation.reason) vs terminal path
that clears the lease and returns evaluation; keep sandboxLifecycleWorkflow as
the loop that calls computeLifecycleWakeDecision, scheduleWake,
runLifecycleEvaluation, and delegates branching to the new helpers.
In `@lib/sandbox/createSandboxHandler.ts`:
- Around line 34-171: The createSandboxHandler function is doing too many
things; split it into small focused helpers: extract request validation/authz
into validateAndAuthorizeCreateSandbox using validateCreateSandboxBody and
selectSessions, move org snapshot lookup and kickBuildOrgSnapshotWorkflow into
resolveOrgSnapshot (use extractOrgRepoName and findOrgSnapshot), move the
sandbox provisioning logic into provisionSandbox which calls connectSandbox and
handles its errors, move DB session sync into syncSessionState which calls
updateSession, and move post-provision steps (installSessionGlobalSkills and
kickSandboxLifecycleWorkflow) into installSkillsAndRegisterLifecycle that logs
but does not throw; update createSandboxHandler to orchestrate these helpers,
passing sandbox, sessionRow and relevant params and keeping only high-level
flow, timing, and the final NextResponse logic.
In `@lib/sandbox/evaluateSandboxLifecycle.ts`:
- Around line 34-106: The evaluateSandboxLifecycle function is too long and
mixes pre-condition guards with hibernation execution; extract the early-return
guard checks (the session lookup and the seven predicates that use
selectSessions, hasRuntimeSandboxState, getLifecycleDueAtMs,
hasActiveStreamForSession, and the sandbox type check) into a new helper (e.g.,
checkLifecycleEligibility(sessionId): Promise<{status: "skipped"|"ok", reason?:
string, session?: Session}>), call that from evaluateSandboxLifecycle to
short-circuit when skipped, and keep evaluateSandboxLifecycle focused on the
stateful hibernation flow (updateSession to "hibernating", connectSandbox,
wasLifecycleTimingExtended, sandbox.stop, clearSandboxState, and final update);
ensure you preserve existing behavior and return values and reference functions
selectSessions, updateSession, connectSandbox, wasLifecycleTimingExtended,
clearSandboxState, and restoreActiveLifecycleState when moving logic.
In `@lib/sandbox/getSandboxStatusHandler.ts`:
- Around line 24-74: getSandboxStatusHandler is doing too many things; extract
session loading/authorization and the overdue-kick decision into helpers to
reduce size and responsibilities. Create a helper loadAuthorizedSession(request,
validateAuthContext, selectSessions) that returns the session row or a
NextResponse for missing/forbidden cases, and a separate
ensureLifecycleKick(sessionRow, isSandboxActive, getLifecycleDueAtMs,
kickSandboxLifecycleWorkflow, after) that encapsulates the conditional that
calls kickSandboxLifecycleWorkflow when active, lifecycle_state === "active" and
Date.now() >= getLifecycleDueAtMs(row); then refactor getSandboxStatusHandler to
call loadAuthorizedSession, call ensureLifecycleKick, and only shape the final
JSON response (hasSnapshot, lifecycleVersion, lifecycle via buildLifecycle).
Ensure you reference the original symbols getSandboxStatusHandler,
validateAuthContext, selectSessions, isSandboxActive, getLifecycleDueAtMs,
kickSandboxLifecycleWorkflow, after, and buildLifecycle when moving logic.
In `@lib/sandbox/runKick.ts`:
- Around line 31-58: runKick is doing too many steps
(fetch/reclaim/eligibility/claim/start/recovery); extract the claim-and-start
and failure-recovery logic into small helpers to improve maintainability and
testability. Implement a helper (e.g., claimAndStartLifecycle(input.sessionId,
runId, input.reason)) that encapsulates createLifecycleRunId,
claimSessionLifecycleRunId and the call to start(sandboxLifecycleWorkflow, …)
and returns the started run or false, and a separate recovery helper (e.g.,
recoverFailedStart(sessionId)) that handles logging the error and calling
updateSession(sessionId, { lifecycle_run_id: null }) on failure; then simplify
runKick to call isLifecycleRunStale/reclaimStaleLease/shouldStartLifecycle and
use the new helpers for claim/start and recovery so each function stays small
and focused.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 9fcd1a08-8b2e-4c74-ba93-ceef75b3914b
⛔ Files ignored due to path filters (2)
lib/sandbox/__tests__/createSandboxHandler.test.tsis excluded by!**/*.test.*,!**/__tests__/**and included bylib/**lib/sandbox/__tests__/getSandboxStatusHandler.test.tsis excluded by!**/*.test.*,!**/__tests__/**and included bylib/**
📒 Files selected for processing (33)
app/workflows/clearLifecycleRunIdIfOwned.tsapp/workflows/computeLifecycleWakeDecision.tsapp/workflows/runLifecycleEvaluation.tsapp/workflows/sandboxLifecycleWorkflow.tslib/sandbox/buildActiveLifecycleUpdate.tslib/sandbox/buildHibernatedLifecycleUpdate.tslib/sandbox/buildLifecycleActivityUpdate.tslib/sandbox/clearSandboxState.tslib/sandbox/computeExpiryDueAtMs.tslib/sandbox/computeInactivityDueAtMs.tslib/sandbox/createLifecycleRunId.tslib/sandbox/createSandboxHandler.tslib/sandbox/evaluateSandboxLifecycle.tslib/sandbox/getLifecycleDueAtMs.tslib/sandbox/getPersistentSandboxName.tslib/sandbox/getResumableSandboxName.tslib/sandbox/getSandboxExpiresAtDate.tslib/sandbox/getSandboxStatusHandler.tslib/sandbox/hasActiveStreamForSession.tslib/sandbox/hasPausedSandboxState.tslib/sandbox/isLifecycleRunStale.tslib/sandbox/kickSandboxLifecycleWorkflow.tslib/sandbox/reclaimStaleLease.tslib/sandbox/restoreActiveLifecycleState.tslib/sandbox/runKick.tslib/sandbox/sandboxLifecycleConfig.tslib/sandbox/sandboxLifecycleTypes.tslib/sandbox/shouldStartLifecycle.tslib/sandbox/wasLifecycleTimingExtended.tslib/supabase/chats/selectChats.tslib/supabase/sessions/claimSessionLifecycleRunId.tslib/supabase/sessions/claimSessionLifecycleRunIdIfMatch.tslib/supabase/sessions/claimSessionLifecycleRunIdIfNull.ts
| const rows = await selectSessions({ id: sessionId }); | ||
| const session = rows[0]; | ||
| if (!session || session.lifecycle_run_id !== runId) return; | ||
|
|
||
| await updateSession(sessionId, { lifecycle_run_id: null }); |
There was a problem hiding this comment.
Use an atomic conditional update to avoid lease clobbering.
Lines 17–21 have a TOCTOU race: a new workflow can claim lifecycle_run_id after the read and before the write, then this step clears the new owner’s lease. This should be a single DB update with WHERE id = sessionId AND lifecycle_run_id = runId.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@app/workflows/clearLifecycleRunIdIfOwned.ts` around lines 17 - 21, The
current TOCTOU occurs because selectSessions(...) reads lifecycle_run_id then
updateSession(...) clears it unconditionally; replace this two-step flow with a
single atomic conditional update so the row is only cleared if it still has the
expected owner: remove the selectSessions(...) read and call/update
updateSession (or add a new updateSessionIfMatch) to execute an UPDATE ... SET
lifecycle_run_id = NULL WHERE id = sessionId AND lifecycle_run_id = runId and
check affected-rows to decide success; ensure the code references the existing
selectSessions and updateSession symbols (or implement updateSessionIfMatch) so
the DB change is atomic and avoids clobbering a newly acquired lease.
| if (await wasLifecycleTimingExtended(sessionId, session)) { | ||
| const refreshed = (await selectSessions({ id: sessionId }))[0]; | ||
| if (refreshed?.sandbox_state) { | ||
| await restoreActiveLifecycleState(sessionId, refreshed.sandbox_state); | ||
| } | ||
| return { action: "skipped", reason: "not-due-yet" }; | ||
| } |
There was a problem hiding this comment.
lifecycle_state is stranded as "hibernating" when refreshed.sandbox_state is falsy.
After line 63 writes lifecycle_state: "hibernating" to the DB, the wasLifecycleTimingExtended branch conditionally skips restoreActiveLifecycleState when refreshed?.sandbox_state is falsy. The DB row stays in "hibernating" while the evaluator returns { action: "skipped", reason: "not-due-yet" }. The workflow loop then re-schedules a wake using a session it perceives as "not yet due" but whose DB state says "hibernating" — the two stay out of sync until the self-heal kick fires.
The original sandboxState (line 46) has already been validated by hasRuntimeSandboxState and is the correct fallback.
🐛 Proposed fix — use original `sandboxState` as fallback
if (await wasLifecycleTimingExtended(sessionId, session)) {
const refreshed = (await selectSessions({ id: sessionId }))[0];
- if (refreshed?.sandbox_state) {
- await restoreActiveLifecycleState(sessionId, refreshed.sandbox_state);
- }
+ await restoreActiveLifecycleState(
+ sessionId,
+ refreshed?.sandbox_state ?? sandboxState,
+ );
return { action: "skipped", reason: "not-due-yet" };
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if (await wasLifecycleTimingExtended(sessionId, session)) { | |
| const refreshed = (await selectSessions({ id: sessionId }))[0]; | |
| if (refreshed?.sandbox_state) { | |
| await restoreActiveLifecycleState(sessionId, refreshed.sandbox_state); | |
| } | |
| return { action: "skipped", reason: "not-due-yet" }; | |
| } | |
| if (await wasLifecycleTimingExtended(sessionId, session)) { | |
| const refreshed = (await selectSessions({ id: sessionId }))[0]; | |
| await restoreActiveLifecycleState( | |
| sessionId, | |
| refreshed?.sandbox_state ?? sandboxState, | |
| ); | |
| return { action: "skipped", reason: "not-due-yet" }; | |
| } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@lib/sandbox/evaluateSandboxLifecycle.ts` around lines 72 - 78, The branch
that skips restoreActiveLifecycleState leaves the DB lifecycle_state as
"hibernating" when refreshed.sandbox_state is falsy; change the logic in the
wasLifecycleTimingExtended branch to call restoreActiveLifecycleState(sessionId,
refreshed.sandbox_state ?? sandboxState) so the original validated sandboxState
(already checked by hasRuntimeSandboxState) is used as a fallback instead of
skipping restoration, ensuring DB and in-memory state stay in sync; reference
symbols: wasLifecycleTimingExtended, selectSessions,
restoreActiveLifecycleState, sandboxState, hasRuntimeSandboxState, sessionId,
refreshed.sandbox_state.
| const expiresAt = (state as { expiresAt?: unknown }).expiresAt; | ||
| if (typeof expiresAt !== "number") return null; | ||
| return new Date(expiresAt).toISOString(); |
There was a problem hiding this comment.
Guard invalid numeric timestamps before toISOString().
Line 15 can throw if expiresAt is NaN, Infinity, or out-of-range. Add a finite/valid-date check and return null instead of throwing.
Suggested fix
export function getSandboxExpiresAtDate(state: unknown): string | null {
if (!state || typeof state !== "object") return null;
const expiresAt = (state as { expiresAt?: unknown }).expiresAt;
- if (typeof expiresAt !== "number") return null;
- return new Date(expiresAt).toISOString();
+ if (typeof expiresAt !== "number" || !Number.isFinite(expiresAt)) return null;
+ const iso = new Date(expiresAt);
+ return Number.isNaN(iso.getTime()) ? null : iso.toISOString();
}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@lib/sandbox/getSandboxExpiresAtDate.ts` around lines 13 - 15, The current
code reads expiresAt from state and immediately constructs new
Date(expiresAt).toISOString(), which can throw for NaN/Infinity or out-of-range
numbers; update the guard in getSandboxExpiresAtDate (check the expiresAt
variable) to first ensure typeof expiresAt === "number" and
Number.isFinite(expiresAt) and that new Date(expiresAt) yields a valid date
(e.g., !isNaN(date.getTime())) before calling toISOString(); if any check fails,
return null instead of calling toISOString().
| await updateSession(sessionId, { lifecycle_run_id: null }); | ||
| const rows = await selectSessions({ id: sessionId }); | ||
| return rows[0] ?? null; |
There was a problem hiding this comment.
Guard stale-lease clearing with ownership/expected-run-id to avoid TOCTOU races.
At Line 15, the update clears lifecycle_run_id without checking the currently stored lease value. If another worker claims a fresh run between stale detection and this update, this call can wipe a valid lease and allow duplicate lifecycle runs.
Suggested direction
-export async function reclaimStaleLease(sessionId: string): Promise<Tables<"sessions"> | null> {
- await updateSession(sessionId, { lifecycle_run_id: null });
+export async function reclaimStaleLease(
+ sessionId: string,
+ expectedStaleRunId: string,
+): Promise<Tables<"sessions"> | null> {
+ // Use a conditional update in the sessions data-access layer:
+ // UPDATE ... SET lifecycle_run_id = null WHERE id = :sessionId AND lifecycle_run_id = :expectedStaleRunId
+ await clearLifecycleRunIdIfMatches(sessionId, expectedStaleRunId);
const rows = await selectSessions({ id: sessionId });
return rows[0] ?? null;
}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@lib/sandbox/reclaimStaleLease.ts` around lines 15 - 17, The update currently
clears lifecycle_run_id unconditionally which can cause a TOCTOU race; change
the updateSession call to perform a conditional/compare-and-swap update that
only sets lifecycle_run_id to null when the stored lifecycle_run_id equals the
expected stale value (use the previous/expected lease id variable, e.g.,
staleLeaseId) and check the update result (affected rows) before reading with
selectSessions; if no row was updated, return null (or the current row) to avoid
wiping a freshly claimed lease. Ensure you reference the same sessionId and
lifecycle_run_id fields in the conditional update and handle the
zero-rows-updated case accordingly.
| console.error( | ||
| `[kickSandboxLifecycleWorkflow] failed to start workflow for session ${input.sessionId}; clearing lease:`, | ||
| error, | ||
| ); | ||
| await updateSession(input.sessionId, { lifecycle_run_id: null }); | ||
| } |
There was a problem hiding this comment.
Guard lease cleanup by ownership to avoid clobbering a newer run
The catch block clears lifecycle_run_id unconditionally. If ownership changed before this write, this can erase another run’s lease and allow duplicate runners.
Suggested fix
- await updateSession(input.sessionId, { lifecycle_run_id: null });
+ await claimSessionLifecycleRunId(input.sessionId, runId, runId);
+ await updateSession(input.sessionId, { lifecycle_run_id: null });+ // Better: single helper with CAS semantics
+ await clearLifecycleRunIdIfOwned(input.sessionId, runId);🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@lib/sandbox/runKick.ts` around lines 52 - 57, The catch currently
unconditionally clears lifecycle_run_id and can clobber a newer run; change it
to first read the current session and only clear lifecycle_run_id if it still
equals the run id this code owns (the original run id you attempted to start).
In practice: fetch the session by input.sessionId (use the existing session
getter or add one), compare session.lifecycle_run_id to the expected run id
stored when starting the workflow, and call updateSession(input.sessionId, {
lifecycle_run_id: null }) only if they match; otherwise leave it untouched.
…helpers In the previous SRP refactor I created computeInactivityDueAtMs.ts and computeExpiryDueAtMs.ts but the Write to update getLifecycleDueAtMs.ts errored on "File has not been read yet" and I missed it in the batch output. The two helper files were left as orphans while the original file kept its inline copies — visible by the SANDBOX_*_MS imports + both functions still inlined. Now properly imports computeInactivityDueAtMs / computeExpiryDueAtMs and the parent file just composes them. No behavior change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
1 issue found across 25 files (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="lib/supabase/sessions/claimSessionLifecycleRunId.ts">
<violation number="1" location="lib/supabase/sessions/claimSessionLifecycleRunId.ts:22">
P2: The refresh branch uses `expected` instead of `runId`, which can silently ignore the requested new lease value when the two differ.</violation>
</file>
Tip: Review your code locally with the cubic CLI to iterate faster.
|
You caught a real miss. Pushed `22095c29`. The two helper files (`computeInactivityDueAtMs.ts` + `computeExpiryDueAtMs.ts`) were created in the previous commit, but the Write to update `getLifecycleDueAtMs.ts` itself errored on "File has not been read yet" and I missed it in the batch output. So the helpers were sitting as orphans while the parent file kept its inline copies. Now properly imports the extracted helpers — `getLifecycleDueAtMs.ts` is the thin composer it was meant to be. `SANDBOX_*_MS` constants no longer imported here; they live with the helpers that use them. `isoToEpochMs` import also gone since the parent doesn't need it. Net delta on the fix: -29 / +4. Tests still 2579 / 2579, lint clean. |
There was a problem hiding this comment.
Incorrect path:
- actual: lib/supabase/sessions/claimSessionLifecycleRunId.ts
- required: lib/sessions/claimSessionLifecycleRunId.ts
There was a problem hiding this comment.
KISS - file name
- actual: lib/supabase/sessions/claimSessionLifecycleRunIdIfMatch.ts
- required: lib/supabase/sessions/updateSession.ts
…atch via updateSession
Two review fixes:
1) Move `claimSessionLifecycleRunId.ts` from `lib/supabase/sessions/`
to `lib/sessions/`. The combiner doesn't directly query Supabase —
it composes two underlying helpers, so per api convention it
belongs alongside other domain composers, not under the Supabase
namespace.
2) KISS: delete `claimSessionLifecycleRunIdIfMatch.ts` and use the
existing `updateSession` helper for the lease-refresh path. The
refresh writes the same `lifecycle_run_id` value back, so an
unconditional `updateSession({ lifecycle_run_id: runId })` does the
work — accepts a small race where a concurrent stale-reclaim could
be overwritten, but the kick path (which DOES use the atomic
`claimSessionLifecycleRunIdIfNull`) remains the primary
concurrency guard.
Two callers (runKick, computeLifecycleWakeDecision) updated to import
from the new path. No behavior change at the workflow level.
Tests: 2579 / 2579 still pass. lint + tsc clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Pushed `c37821d5` addressing both new comments.
Two callers updated (`runKick`, `computeLifecycleWakeDecision`) to import from the new path. Net delta: -56 / +34 (deleted IfMatch + the old combiner location, smaller new combiner). Tests 2579 / 2579, lint + tsc clean. |
There was a problem hiding this comment.
KISS
- actual: lib/supabase/sessions/claimSessionLifecycleRunIdIfNull.ts
- required: lib/supabase/sessions/updateSession.ts
…ssion
Per latest review: the `IfNull` Supabase helper is also just an
`updateSession` with a conditional WHERE — KISS it the same way the
`IfMatch` helper was simplified.
Changes:
- Delete `lib/supabase/sessions/claimSessionLifecycleRunIdIfNull.ts`.
- Delete `lib/sessions/claimSessionLifecycleRunId.ts` (combiner — no
longer needed once both branches collapse to plain `updateSession`).
- `runKick`: write the new lease via
`updateSession(id, { lifecycle_run_id: runId })`. The atomic
`WHERE lifecycle_run_id IS NULL` guard is gone; `shouldStartLifecycle`
already filters out rows that already hold a lease — best-effort
concurrency, accepting the small race window.
- `computeLifecycleWakeDecision`: replace the IfMatch lease re-claim
with a SELECT-based ownership check using the row already read in
the same step. If `lifecycle_run_id` was overwritten by a concurrent
kick, return `run-replaced` and bail.
Tests 2579 / 2579 still pass. Lint clean. The pre-existing tsc errors
in unrelated test files (orgId on ApiKeyDetails, SlackApiResponse
messages) are not introduced here.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Pushed `2bf8cf91` addressing the latest comment. Applied the same KISS collapse to `claimSessionLifecycleRunIdIfNull.ts` that was applied to `IfMatch`:
Tradeoff acknowledged: the atomic `UPDATE … WHERE lifecycle_run_id IS NULL` guarantee is gone. Two simultaneous kicks racing past `shouldStartLifecycle` could both write a lease and start workflows. The downstream evaluation steps are idempotent, and `clearLifecycleRunIdIfOwned` ensures only one run owns the lease at end. Accepting this for KISS. Net delta over both KISS rounds: `-127 / +44` across 7 files. Tests 2579 / 2579 still pass, lint clean. |
There was a problem hiding this comment.
1 issue found across 4 files (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="lib/sandbox/runKick.ts">
<violation number="1" location="lib/sandbox/runKick.ts:44">
P1: This lease claim is no longer atomic. Concurrent kicks can both start workflows for the same session because `lifecycle_run_id` is set with an unconditional update.</violation>
</file>
Tip: Review your code locally with the cubic CLI to iterate faster.
| if (!shouldStartLifecycle(sessionForStart)) return; | ||
|
|
||
| const runId = createLifecycleRunId(); | ||
| await updateSession(input.sessionId, { lifecycle_run_id: runId }); |
There was a problem hiding this comment.
P1: This lease claim is no longer atomic. Concurrent kicks can both start workflows for the same session because lifecycle_run_id is set with an unconditional update.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At lib/sandbox/runKick.ts, line 44:
<comment>This lease claim is no longer atomic. Concurrent kicks can both start workflows for the same session because `lifecycle_run_id` is set with an unconditional update.</comment>
<file context>
@@ -40,8 +41,7 @@ export async function runKick(input: RunKickInput): Promise<void> {
const runId = createLifecycleRunId();
- const claimed = await claimSessionLifecycleRunId(input.sessionId, runId);
- if (!claimed) return;
+ await updateSession(input.sessionId, { lifecycle_run_id: runId });
try {
</file context>
Tip: Review your code locally with the cubic CLI to iterate faster.
Summary
Closes the auto-pause and lifecycle-self-heal gaps from the open-agents → api cutover analysis. Sequel to #531 (build-org-snapshot workflow). The Vercel Workflow runtime + `withWorkflow` already shipped — this PR adds the second workflow on top.
What it does
Date.now() >= getLifecycleDueAtMs(row))Components ported (faithfully from open-agents, adapted to api conventions)
TDD
Strict red → green for the wired-in behavior. +2 new createSandboxHandler tests (suite 2577 → 2579):
Existing handler tests pass unchanged with new mocks for `kickSandboxLifecycleWorkflow`.
Verification
Out of scope (deliberately deferred)
Test plan
Cutover gap status after this lands
The cutover is now functionally complete except for two minor cosmetic gaps. Open-agents UI can swap to api with no user-visible regressions.
🤖 Generated with Claude Code
Summary by cubic
Adds a sandbox lifecycle workflow that auto‑pauses idle sandboxes and a self‑healing kick path. Also fixes serverless teardown by scheduling the kick at call sites with
after()so workflows reliably start.New Features
POST /api/sandbox: kicks lifecycle withsandbox-createdafter provision.GET /api/sandbox/status: kicks withstatus-check-overduewhen past due and the lease is stale.lifecycle_run_idwith best‑effort claims (updateSession), bail when a concurrent kick replaces the run (run-replaced), and 2m stale‑run reclamation.Refactors
scheduleBackgroundWorkat call sites (task => after(() => task)) and keptkickSandboxLifecycleWorkflowfire‑and‑forget.updateSessionand select‑based ownership checks; addedreclaimStaleLease,isLifecycleRunStale, andrunKick.computeLifecycleWakeDecision,runLifecycleEvaluation,clearLifecycleRunIdIfOwned.selectChatsBySessiontoselectChats; removed thin wrappers (canOperateOnSandbox,getNextLifecycleVersion,hasResumableSandboxState);getLifecycleDueAtMsnow composescomputeInactivityDueAtMsandcomputeExpiryDueAtMs.Written for commit 2bf8cf9. Summary will update on new commits.
Summary by CodeRabbit
Release Notes