Skip to content

feat(sandbox): port sandbox lifecycle workflow + kicks (Phase 2)#533

Merged
sweetmantech merged 6 commits intotestfrom
feat/sandbox-lifecycle-workflow
May 7, 2026
Merged

feat(sandbox): port sandbox lifecycle workflow + kicks (Phase 2)#533
sweetmantech merged 6 commits intotestfrom
feat/sandbox-lifecycle-workflow

Conversation

@sweetmantech
Copy link
Copy Markdown
Contributor

@sweetmantech sweetmantech commented May 7, 2026

Summary

Closes the auto-pause and lifecycle-self-heal gaps from the open-agents → api cutover analysis. Sequel to #531 (build-org-snapshot workflow). The Vercel Workflow runtime + `withWorkflow` already shipped — this PR adds the second workflow on top.

What it does

Trigger Reason What runs
POST `/api/sandbox` (after provision + skill install, when `sessionId` is provided) `sandbox-created` Workflow sleeps until `hibernate_after` / `sandbox_expires_at` (whichever is sooner). On wake, evaluates: pause if idle, retry on next iteration if a chat stream is active or the user extended the timing.
GET `/api/sandbox/status` (when row says `lifecycle_state: "active"` AND Date.now() >= getLifecycleDueAtMs(row)) `status-check-overdue` Stale-lease reclaimer. The kick reclaims the dead lease and starts a fresh workflow run — that's how the lifecycle FSM self-heals from crashed workflows.

Components ported (faithfully from open-agents, adapted to api conventions)

Layer Files
State predicates `getPersistentSandboxName`, `getResumableSandboxName`, `hasResumableSandboxState`, `hasPausedSandboxState`, `canOperateOnSandbox`, `clearSandboxState`
Types + config `sandboxLifecycleTypes.ts` (state/reason types), `sandboxLifecycleConfig.ts` (timeouts: 5min idle, 10s expiry buffer, 5s min sleep, 2min stale-run grace)
Update builders `buildLifecycleActivityUpdate`, `buildActiveLifecycleUpdate`, `buildHibernatedLifecycleUpdate`, `getNextLifecycleVersion`, `getSandboxExpiresAtDate`
Wake-time `getLifecycleDueAtMs` — earlier of inactivity-due or expiry-due
Active-stream guard `hasActiveStreamForSession` + `selectChatsBySession` (Supabase). Blocks hibernation while a chat is mid-stream.
Concurrency `claimSessionLifecycleRunId` — atomic Supabase UPDATE that fails when the expected current value doesn't match. Prevents duplicate workflows for the same session.
Evaluator `evaluateSandboxLifecycle` (~110 lines) — single-pass FSM: skip / hibernate / fail. Mirrors open-agents' decision tree exactly.
Workflow `app/workflows/sandboxLifecycleWorkflow.ts` — `while(true)` loop with `sleep(new Date(wakeAtMs))`, lease-claim pattern, "use step" boundaries
Kick `kickSandboxLifecycleWorkflow` — fire-and-forget with stale-run detection, lease claiming, graceful failure on `start()`

TDD

Strict red → green for the wired-in behavior. +2 new createSandboxHandler tests (suite 2577 → 2579):

  • kicks the lifecycle workflow with `reason: "sandbox-created"` when sessionId is provided
  • does NOT kick when no sessionId is provided

Existing handler tests pass unchanged with new mocks for `kickSandboxLifecycleWorkflow`.

Verification

Check Result
`pnpm test` ✅ 2579 / 2579 pass
`pnpm lint:check` ✅ clean
`tsc --noEmit` for new files ✅ clean

Out of scope (deliberately deferred)

  • Tests for the workflow + evaluator themselves. The Vercel Workflow primitives (`"use workflow"`, `"use step"`, `sleep(date)`) need a workflow test harness we haven't introduced yet. Handler tests + open-agents parity give enough confidence at the wiring layer; the smoke test will exercise the workflow at runtime.
  • Reason-specific telemetry beyond what `evaluateSandboxLifecycle` already logs (per-transition log lines: hibernated / failed / skipped reasons).

Test plan

  • CI green
  • Preview: `POST /api/sandbox` for a new session — `vercel logs --query "kickSandboxLifecycleWorkflow"` shows the kick fired with `sandbox-created`, and the workflow run shows up in the dashboard
  • Preview: wait `SANDBOX_INACTIVITY_TIMEOUT_MS` (5min) — `evaluateSandboxLifecycle` log shows `hibernated` and the session row's `sandbox_state` is cleared
  • Preview: `GET /api/sandbox/status` after lifecycle stalls — `status-check-overdue` kick log line, lease reclaimed

Cutover gap status after this lands

Gap Status
✅ POST + GET status + GET reconnect endpoints live
✅ Skill installation live
✅ Org snapshot warm-boot lookup live
✅ Build-org-snapshot workflow live
✅ Vercel Workflow runtime live
Auto-pause via lifecycle workflow this PR
Status-check-overdue self-heal this PR
⏳ `gitUser` plumbing cosmetic
⏳ `hibernate_after` derivation in createSandboxHandler minor — currently relies on workflow setting it; could pre-set on provision for tighter UI countdowns

The cutover is now functionally complete except for two minor cosmetic gaps. Open-agents UI can swap to api with no user-visible regressions.

🤖 Generated with Claude Code


Summary by cubic

Adds a sandbox lifecycle workflow that auto‑pauses idle sandboxes and a self‑healing kick path. Also fixes serverless teardown by scheduling the kick at call sites with after() so workflows reliably start.

  • New Features

    • POST /api/sandbox: kicks lifecycle with sandbox-created after provision.
    • GET /api/sandbox/status: kicks with status-check-overdue when past due and the lease is stale.
    • Workflow wakes at the earlier of inactivity or expiry; hibernates if idle; defers on active chat streams or when timing was extended; enforces a 5s min sleep.
    • Concurrency via lifecycle_run_id with best‑effort claims (updateSession), bail when a concurrent kick replaces the run (run-replaced), and 2m stale‑run reclamation.
    • Config: 5m idle timeout, 10s expiry buffer, 5s min sleep, 2m stale‑run grace; +2 handler tests for kick behavior.
  • Refactors

    • Adopted scheduleBackgroundWork at call sites (task => after(() => task)) and kept kickSandboxLifecycleWorkflow fire‑and‑forget.
    • Simplified lease claim: removed atomic claim helpers and combiner; use updateSession and select‑based ownership checks; added reclaimStaleLease, isLifecycleRunStale, and runKick.
    • Extracted workflow steps: computeLifecycleWakeDecision, runLifecycleEvaluation, clearLifecycleRunIdIfOwned.
    • Renamed selectChatsBySession to selectChats; removed thin wrappers (canOperateOnSandbox, getNextLifecycleVersion, hasResumableSandboxState); getLifecycleDueAtMs now composes computeInactivityDueAtMs and computeExpiryDueAtMs.

Written for commit 2bf8cf9. Summary will update on new commits.

Summary by CodeRabbit

Release Notes

  • New Features
    • Added automatic sandbox hibernation after configured inactivity periods. Idle sandboxes will automatically pause to conserve resources.
    • Sandboxes can be resumed from hibernation when needed, maintaining session continuity.
    • Added lifecycle management with self-healing when hibernation checks become overdue.

Closes the auto-pause and lifecycle-self-heal gaps from the cutover
analysis. Sequel to PR #531 (build-org-snapshot workflow). The Vercel
Workflow runtime + withWorkflow already shipped — this PR adds the
second workflow on top.

What it does:
- POST /api/sandbox now kicks the sandbox-lifecycle workflow with
  reason="sandbox-created" after provisioning + skill install. The
  workflow sleeps until `hibernate_after` / `sandbox_expires_at`
  (whichever is sooner), then evaluates: pause if idle, retry on
  next iteration if a chat stream is active or the user extended
  the timing.
- GET /api/sandbox/status now kicks with reason="status-check-overdue"
  when the session row says lifecycle_state="active" but the
  workflow's due-time has already passed (its lease has died). The
  kick reclaims the stale lease and starts a fresh workflow.

Components ported faithfully from open-agents:

Sandbox state predicates (one fn per file per api convention):
- getPersistentSandboxName, getResumableSandboxName,
  hasResumableSandboxState, hasPausedSandboxState
- canOperateOnSandbox, clearSandboxState

Lifecycle types + config:
- sandboxLifecycleTypes.ts: SandboxLifecycleState/Reason types
- sandboxLifecycleConfig.ts: SANDBOX_INACTIVITY_TIMEOUT_MS (5min),
  SANDBOX_EXPIRES_BUFFER_MS (10s), SANDBOX_LIFECYCLE_MIN_SLEEP_MS
  (5s), SANDBOX_LIFECYCLE_STALE_RUN_GRACE_MS (2min)

Lifecycle update builders + due-at:
- buildLifecycleActivityUpdate, buildActiveLifecycleUpdate,
  buildHibernatedLifecycleUpdate
- getLifecycleDueAtMs (the wake-time calculator)
- getNextLifecycleVersion, getSandboxExpiresAtDate

Active-stream guard (don't pause mid-chat):
- hasActiveStreamForSession + Supabase helper selectChatsBySession

Concurrency:
- claimSessionLifecycleRunId — atomic UPDATE that fails when the
  expected current value doesn't match. Prevents two kicks from
  starting duplicate workflows.

The evaluator (the heart):
- evaluateSandboxLifecycle (~110 lines) — single-pass FSM that
  decides skip / hibernate / fail. Mirrors open-agents' evaluator
  with the same skip reasons and self-heal pattern.

The workflow:
- app/workflows/sandboxLifecycleWorkflow.ts — `while(true)` loop with
  `sleep(new Date(wakeAtMs))` between iterations, lease-claim
  pattern, and step boundaries via "use step". Loops on
  not-due-yet / active-workflow defers, terminates on
  hibernated / failed / sandbox-gone.

The kick:
- kickSandboxLifecycleWorkflow.ts — fire-and-forget invocation with
  stale-run detection, lease claiming, and graceful handling when
  start() fails (clears the lease so retry can succeed).

TDD red -> green:
- 2 new createSandboxHandler tests asserting the kick fires with the
  right reason on session-bound provisions, and is skipped without
  sessionId
- All existing handler tests continue to pass with the new mocks
- Suite: 2577 -> 2579 (+2 new tests). pnpm lint:check + tsc clean
  for new files.

Out of scope (deliberately deferred):
- Tests for the workflow itself + evaluator. The workflow primitives
  (`"use workflow"`, `"use step"`, `sleep(date)`) require a Vercel
  Workflow test harness we haven't introduced yet. The handler
  tests + open-agents' parity give enough confidence; if the
  workflow misbehaves at runtime, the smoke test will catch it.
- Reason-specific telemetry beyond what evaluateSandboxLifecycle
  already logs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@vercel
Copy link
Copy Markdown
Contributor

vercel Bot commented May 7, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
api Ready Ready Preview May 7, 2026 10:46pm

Request Review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 7, 2026

Review Change Stack

Warning

Rate limit exceeded

@sweetmantech has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 33 minutes and 5 seconds before requesting another review.

You’ve run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: f53ab533-69e8-486f-8a63-774d19c185e9

📥 Commits

Reviewing files that changed from the base of the PR and between 30ef641 and 2bf8cf9.

📒 Files selected for processing (3)
  • app/workflows/computeLifecycleWakeDecision.ts
  • lib/sandbox/getLifecycleDueAtMs.ts
  • lib/sandbox/runKick.ts
📝 Walkthrough

Walkthrough

This PR introduces a comprehensive sandbox lifecycle management system that automatically hibernates idle sandboxes after configurable inactivity periods or resource expiry deadlines. It includes Vercel Workflow orchestration, atomic lease-based concurrency control, state transition helpers, timing computation, and integration hooks in sandbox creation and status-check handlers.

Changes

Sandbox Lifecycle Management System

Layer / File(s) Summary
Lifecycle Types & Configuration
lib/sandbox/sandboxLifecycleTypes.ts, lib/sandbox/sandboxLifecycleConfig.ts
Defines lifecycle state machine (provisioning, active, hibernating, hibernated, restoring, archived, failed), trigger reasons (sandbox-created, timeout-extension, snapshot-restored, reconnect, manual-stop, status-check-overdue), evaluation results with optional reason, and timing constants (5 min inactivity timeout, 10 sec expiry buffer, 5 sec min sleep, 2 min stale-run grace).
Supabase Database Operations
lib/supabase/sessions/claimSessionLifecycleRunIdIfNull.ts, lib/supabase/sessions/claimSessionLifecycleRunIdIfMatch.ts, lib/supabase/sessions/claimSessionLifecycleRunId.ts, lib/supabase/chats/selectChats.ts
Atomic Supabase updates for claiming lifecycle run ID leases when null (initial claim) or when matching expected value (lease refresh). Router function selects claim strategy based on expected parameter. Chat query returns active stream status per session.
Sandbox State Utilities
lib/sandbox/getPersistentSandboxName.ts, lib/sandbox/getResumableSandboxName.ts, lib/sandbox/getSandboxExpiresAtDate.ts, lib/sandbox/clearSandboxState.ts, lib/sandbox/hasPausedSandboxState.ts
Utilities for reading, validating, and extracting durable sandbox names and expiry times from persisted state objects. hasPausedSandboxState combines resumable name + absence of runtime metadata to detect paused-yet-resumable sandboxes.
Lifecycle Timing Computation
lib/sandbox/computeExpiryDueAtMs.ts, lib/sandbox/computeInactivityDueAtMs.ts, lib/sandbox/getLifecycleDueAtMs.ts
Computes epoch-ms due timestamps for inactivity-based hibernation and expiry-based cleanup. Inactivity prefers hibernate_after when set, otherwise derives from last_activity_at or updated_at with fallback to Date.now(), then adds inactivity timeout. Expiry subtracts buffer. Returns earlier of both.
Session Lifecycle State Updates
lib/sandbox/buildLifecycleActivityUpdate.ts, lib/sandbox/buildActiveLifecycleUpdate.ts, lib/sandbox/buildHibernatedLifecycleUpdate.ts, lib/sandbox/restoreActiveLifecycleState.ts, lib/sandbox/reclaimStaleLease.ts
Builders that construct Supabase update payloads for state transitions: activity updates set lifecycle state and extend hibernate-after; active updates merge activity fields with fresh expiry; hibernated updates null out expiry, lease, and error fields; restore sets active state with refreshed expiry; reclaim clears stale leases.
Lifecycle Decision & Guard Logic
lib/sandbox/shouldStartLifecycle.ts, lib/sandbox/isLifecycleRunStale.ts, lib/sandbox/hasActiveStreamForSession.ts, lib/sandbox/wasLifecycleTimingExtended.ts, lib/sandbox/createLifecycleRunId.ts
Type-guarded predicate that gates lifecycle start on non-archived, operable, vercel-type sandboxes without existing leases. Stale-run detector checks overdue duration against grace period. Stream checker queries chats for active_stream_id. Timing-extension detector refreshes session and compares key fields. Run ID generator creates unique lifecycle:<timestamp-ms>:<uuid> identifiers.
Core Lifecycle Evaluation
lib/sandbox/evaluateSandboxLifecycle.ts
One-shot evaluator that loads the session, guards against missing/archived/inactive cases, checks for active workflows, transitions to hibernating, reconnects and re-validates, stops the sandbox, clears persisted runtime state, applies hibernated update, and logs result. Captures errors, updates session to failed state with error message, and clears lease for retry.
Workflow Orchestration Steps
app/workflows/runLifecycleEvaluation.ts, app/workflows/computeLifecycleWakeDecision.ts, app/workflows/clearLifecycleRunIdIfOwned.ts, app/workflows/sandboxLifecycleWorkflow.ts
Vercel Workflow steps and main loop: runLifecycleEvaluation wraps evaluation with step boundary; computeLifecycleWakeDecision computes next wake time and checks session viability; clearLifecycleRunIdIfOwned conditionally clears lease when owned; main sandboxLifecycleWorkflow loops indefinitely, computes wake, sleeps (with minimum floor), evaluates on wake, continues on deferred reasons, clears lease on completion.
Workflow Kickoff & Handler Integration
lib/sandbox/runKick.ts, lib/sandbox/kickSandboxLifecycleWorkflow.ts, lib/sandbox/createSandboxHandler.ts, lib/sandbox/getSandboxStatusHandler.ts
runKick loads session, optionally reclaims stale lease, validates start eligibility, generates run ID, atomically claims lease, and starts workflow; on failure clears lease for retry. kickSandboxLifecycleWorkflow wraps runKick as fire-and-forget, logs failures, hands off to scheduler or detaches promise. createSandboxHandler kicks workflow after session creation via after() background chain. getSandboxStatusHandler self-heals by kicking workflow when sandbox is active but lifecycle is overdue.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • recoupable/api#527: Adds session global-skill installation in createSandboxHandler post-creation, directly preceding the lifecycle workflow kick added in this PR.

Poem

🌙 Sandboxes sleep when idle too long,
A workflow whispers soft and strong,
Leases claimed, timings computed with care,
Hibernation arrives, the code floats fair. ✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Solid & Clean Code ⚠️ Warning Critical review issues unresolved: TOCTOU in clearLifecycleRunIdIfOwned, missing NaN guards in getSandboxExpiresAtDate, unconditional lease clear in runKick. Type definition inconsistencies. Atomic updates in clearLifecycleRunIdIfOwned; isFinite() guards; conditional clear in runKick; fix fallback; consolidate types
✅ Passed checks (2 passed)
Check name Status Explanation
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/sandbox-lifecycle-workflow

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

7 issues found across 24 files

Confidence score: 2/5

  • There is meaningful regression risk from high-severity, high-confidence lease-race issues in app/workflows/sandboxLifecycleWorkflow.ts and lib/sandbox/kickSandboxLifecycleWorkflow.ts; both can let stale runs execute side effects or clear a newer run’s lease.
  • lib/supabase/chats/selectChatsBySession.ts can misclassify chat-read errors as no active stream, which may incorrectly hibernate an active sandbox and impact live user sessions.
  • I’m scoring this as high risk because multiple issues are concrete, user-impacting lifecycle correctness problems (several at 6–8/10 severity) rather than minor maintainability-only findings.
  • Pay close attention to app/workflows/sandboxLifecycleWorkflow.ts, lib/sandbox/kickSandboxLifecycleWorkflow.ts, and lib/supabase/chats/selectChatsBySession.ts - lease ownership and stream-state handling can cause incorrect lifecycle transitions.
Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="lib/supabase/chats/selectChatsBySession.ts">

<violation number="1" location="lib/supabase/chats/selectChatsBySession.ts:18">
P1: Do not treat chat-read errors as “no active stream”; this can incorrectly hibernate an active sandbox.</violation>
</file>

<file name="app/workflows/sandboxLifecycleWorkflow.ts">

<violation number="1" location="app/workflows/sandboxLifecycleWorkflow.ts:83">
P1: Revalidate lifecycle lease after `sleep()` before evaluating, otherwise a replaced run can still execute one side-effecting evaluation pass.</violation>
</file>

<file name="lib/sandbox/kickSandboxLifecycleWorkflow.ts">

<violation number="1" location="lib/sandbox/kickSandboxLifecycleWorkflow.ts:62">
P2: On start failure, clearing `lifecycle_run_id` without verifying ownership can clobber a different run’s lease.</violation>

<violation number="2" location="lib/sandbox/kickSandboxLifecycleWorkflow.ts:67">
P1: Stale-lease reclamation clears `lifecycle_run_id` unconditionally, which can erase a newer lease claimed concurrently.</violation>
</file>

<file name="lib/sandbox/evaluateSandboxLifecycle.ts">

<violation number="1" location="lib/sandbox/evaluateSandboxLifecycle.ts:1">
P2: Custom agent: **Enforce Clear Code Style and Maintainability Practices**

File exceeds the 100-line maintainability limit and should be split into smaller modules.</violation>
</file>

<file name="lib/sandbox/getSandboxStatusHandler.ts">

<violation number="1" location="lib/sandbox/getSandboxStatusHandler.ts:58">
P2: Custom agent: **API Design Consistency and Maintainability**

GET handler performs a lifecycle kick side effect instead of remaining read-only.</violation>
</file>

<file name="lib/sandbox/buildLifecycleActivityUpdate.ts">

<violation number="1" location="lib/sandbox/buildLifecycleActivityUpdate.ts:6">
P3: Custom agent: **Flag AI Slop and Fabricated Changes**

Doc comment references a non-existent `/api/sandbox/activity` route, making the documentation misleading.</violation>
</file>
Architecture diagram
sequenceDiagram
    participant Client as HTTP Client
    participant API as Next.js API Route
    participant Auth as Auth Service
    participant DB as Supabase Sessions
    participant ChatDB as Supabase Chats
    participant Kick as kickSandboxLifecycleWorkflow
    participant Workflow as Vercel Workflow Runtime
    participant Sandbox as Vercel Sandbox Runtime

    Note over Client,Sandbox: NEW: Sandbox Lifecycle Auto-Pause Flow

    %% POST /api/sandbox -> Trigger lifecycle
    Client->>API: POST /api/sandbox (with sessionId)
    API->>Auth: validateAuthContext()
    Auth-->>API: auth context
    API->>API: provision sandbox + install skills
    API->>DB: updateSession() with sandbox_state
    alt sessionId provided
        API->>Kick: fire-and-forget kickSandboxLifecycleWorkflow({sessionId, reason: "sandbox-created"})
        Note over Kick: Fire-and-forget, failure doesn't fail the request
    else no sessionId
        Note over API: Skip lifecycle kick
    end
    API-->>Client: 200 OK

    %% Kick -> Lease claim -> Start workflow
    Kick->>DB: selectSessions({id: sessionId})
    DB-->>Kick: session row
    Kick->>Kick: shouldStartLifecycle() checks
    alt Lease available (lifecycle_run_id is null)
        Kick->>DB: claimSessionLifecycleRunId(sessionId, runId)
        DB-->>Kick: claimed (true)
        Kick->>Workflow: start(sandboxLifecycleWorkflow, [sessionId, reason, runId])
        Workflow-->>Kick: run {runId}
        Note over Kick: Lease claimed - prevents duplicate workflows
    else Lease taken by another run
        DB-->>Kick: claim failed (false)
        Note over Kick: Skip - another workflow is running
    end

    %% Workflow main loop
    Note over Workflow: while(true) loop

    Workflow->>Workflow: computeLifecycleWakeDecision(sessionId, runId)
    Workflow->>DB: selectSessions({id: sessionId})
    DB-->>Workflow: session row
    Workflow->>Workflow: canOperateOnSandbox() check
    Workflow->>Workflow: getLifecycleDueAtMs() - earlier of inactivity or expiry

    alt Session not found, archived, or sandbox not operable
        Workflow->>DB: clearLifecycleRunIdIfOwned()
        Workflow-->>Kick: return {skipped: true}
    else Session looks good
        Workflow->>DB: claimSessionLifecycleRunId(sessionId, runId, runId)
        DB-->>Workflow: lease refreshed
        Workflow->>Workflow: sleep(new Date(wakeAtMs))
        Note over Workflow: Sleeps until hibernate_after or sandbox_expires_at - buffer
    end

    Workflow->>Workflow: runLifecycleEvaluation(sessionId, reason)
    Workflow->>DB: selectSessions({id: sessionId})
    DB-->>Workflow: refreshed session
    Workflow->>ChatDB: hasActiveStreamForSession(sessionId)
    ChatDB->>ChatDB: selectChatsBySession(sessionId)
    ChatDB-->>Workflow: chats with active_stream_id

    alt Not due yet (Date.now() < dueAtMs)
        Workflow->>Workflow: continue (sleep again)
    else Active stream in progress
        Workflow->>Workflow: continue (defer hibernation)
    else Past due, no active stream
        Workflow->>DB: updateSession({lifecycle_state: "hibernating"})
        Workflow->>Sandbox: connectSandbox(sandboxState)
        Sandbox-->>Workflow: sandbox instance

        %% Double-check for race conditions
        Workflow->>ChatDB: hasActiveStreamForSession(sessionId) (re-check)
        Workflow->>DB: selectSessions({id: sessionId}) (check for extended timings)

        alt Stream started during evaluation
            Workflow->>DB: restoreActiveLifecycleState()
            Workflow->>Workflow: continue (defer)
        else Timing was extended
            Workflow->>DB: restoreActiveLifecycleState()
            Workflow->>Workflow: continue (defer)
        else Proceed with hibernation
            Workflow->>Sandbox: sandbox.stop()
            Sandbox-->>Workflow: stopped
            Workflow->>Workflow: clearSandboxState() - preserve sandboxName
            Workflow->>DB: updateSession({sandbox_state: cleared, ...buildHibernatedLifecycleUpdate()})
            Note over Workflow: Clears: sandbox_expires_at, hibernate_after, lifecycle_run_id
            Workflow->>DB: clearLifecycleRunIdIfOwned()
            Workflow-->>Kick: return {action: "hibernated"}
        end
    else Error during evaluation
        Workflow->>DB: updateSession({lifecycle_state: "failed", lifecycle_error: message})
        Workflow->>DB: clearLifecycleRunIdIfOwned()
        Workflow-->>Kick: return {action: "failed", reason: message}
        Note over Workflow: Self-healing - next status read will reclaim lease
    end

    %% GET /api/sandbox/status -> Self-heal on stale lease
    Client->>API: GET /api/sandbox/status
    API->>Auth: validateAuthContext()
    API->>DB: selectSessions({id, account_id})
    DB-->>API: session row
    API->>API: isSandboxActive() check
    alt Active AND lifecycle_state="active" AND past due time
        Note over API: Stale lease detected - workflow likely crashed
        API->>Kick: fire-and-forget kick with reason="status-check-overdue"
        Note over Kick: Will reclaim stale lease and start fresh workflow
    end
    API-->>Client: {status, hasSnapshot, lifecycleVersion, lifecycle}

    %% Stale lease reclamation path
    Note over Kick,Sandbox: NEW: Stale-Lease Reclamation Path
    API->>Kick: kickSandboxLifecycleWorkflow({sessionId, reason: "status-check-overdue"})
    Kick->>DB: selectSessions({id: sessionId})
    DB-->>Kick: session with lifecycle_run_id set
    Kick->>Kick: isLifecycleRunStale() - overdue > 2 min grace?
    alt Run is stale
        Kick->>DB: updateSession({lifecycle_run_id: null})
        Note over Kick: Reclaim lease by clearing stale run_id
        Kick->>DB: selectSessions({id: sessionId})
        DB-->>Kick: refreshed session
    end
    Kick->>Kick: shouldStartLifecycle() checks
    Kick->>DB: claimSessionLifecycleRunId(sessionId, newRunId)
    DB-->>Kick: lease claimed
    Kick->>Workflow: start(sandboxLifecycleWorkflow, [sessionId, "status-check-overdue", newRunId])
    Workflow-->>Kick: fresh workflow run
    Note over Kick: Self-healed - new workflow now owns the lease
Loading

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Comment thread lib/supabase/chats/selectChatsBySession.ts Outdated
const wakeAtMs = Math.max(decision.wakeAtMs, Date.now() + SANDBOX_LIFECYCLE_MIN_SLEEP_MS);
await sleep(new Date(wakeAtMs));

const evaluation = await runLifecycleEvaluation(sessionId, reason);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: Revalidate lifecycle lease after sleep() before evaluating, otherwise a replaced run can still execute one side-effecting evaluation pass.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At app/workflows/sandboxLifecycleWorkflow.ts, line 83:

<comment>Revalidate lifecycle lease after `sleep()` before evaluating, otherwise a replaced run can still execute one side-effecting evaluation pass.</comment>

<file context>
@@ -0,0 +1,95 @@
+    const wakeAtMs = Math.max(decision.wakeAtMs, Date.now() + SANDBOX_LIFECYCLE_MIN_SLEEP_MS);
+    await sleep(new Date(wakeAtMs));
+
+    const evaluation = await runLifecycleEvaluation(sessionId, reason);
+
+    if (
</file context>

}

async function reclaimStaleLease(sessionId: string): Promise<Tables<"sessions"> | null> {
await updateSession(sessionId, { lifecycle_run_id: null });
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: Stale-lease reclamation clears lifecycle_run_id unconditionally, which can erase a newer lease claimed concurrently.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At lib/sandbox/kickSandboxLifecycleWorkflow.ts, line 67:

<comment>Stale-lease reclamation clears `lifecycle_run_id` unconditionally, which can erase a newer lease claimed concurrently.</comment>

<file context>
@@ -0,0 +1,91 @@
+}
+
+async function reclaimStaleLease(sessionId: string): Promise<Tables<"sessions"> | null> {
+  await updateSession(sessionId, { lifecycle_run_id: null });
+  const rows = await selectSessions({ id: sessionId });
+  return rows[0] ?? null;
</file context>

Comment thread lib/sandbox/kickSandboxLifecycleWorkflow.ts Outdated
Comment thread lib/sandbox/evaluateSandboxLifecycle.ts
Comment thread lib/sandbox/getSandboxStatusHandler.ts Outdated

/**
* Builds the lifecycle-related fields to write when refreshing a
* sandbox's "last activity" — used by `/api/sandbox/activity` and
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P3: Custom agent: Flag AI Slop and Fabricated Changes

Doc comment references a non-existent /api/sandbox/activity route, making the documentation misleading.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At lib/sandbox/buildLifecycleActivityUpdate.ts, line 6:

<comment>Doc comment references a non-existent `/api/sandbox/activity` route, making the documentation misleading.</comment>

<file context>
@@ -0,0 +1,29 @@
+
+/**
+ * Builds the lifecycle-related fields to write when refreshing a
+ * sandbox's "last activity" — used by `/api/sandbox/activity` and
+ * by `buildActiveLifecycleUpdate`. Sets `last_activity_at` to now,
+ * pushes `hibernate_after` out by SANDBOX_INACTIVITY_TIMEOUT_MS, and
</file context>

…fecycle kick

Smoke test caught: the lifecycle kick fired but its async chain
(selectSessions → claim lease → start workflow) never completed
because the serverless function tore down on response. No
[kickSandboxLifecycleWorkflow] log lines, no workflow run started.

Original PR placed `after()` from next/server inside the kick lib —
which violated open-agents' architecture: the kick is a generic
fan-out lib, the platform-specific scheduler belongs at the call
site.

Refactored to match open-agents' open / api conventions:
- `kickSandboxLifecycleWorkflow` accepts an optional
  `scheduleBackgroundWork: (task: Promise<void>) => void` parameter.
  Default behavior is `void task` (matches open-agents fallback),
  scheduler used when provided.
- createSandboxHandler + getSandboxStatusHandler pass
  `scheduleBackgroundWork: task => after(() => task)`. Same pattern
  api already uses in `lib/agents/createPlatformRoutes.ts:62` and
  `app/api/chat/slack/route.ts:16` for waitUntil-style background
  work.

Test updated to use `expect.objectContaining` since the kick now
receives an extra `scheduleBackgroundWork` callback. Suite stays at
2579.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 issues found across 4 files (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="lib/sandbox/createSandboxHandler.ts">

<violation number="1" location="lib/sandbox/createSandboxHandler.ts:154">
P1: Custom agent: **Enforce Clear Code Style and Maintainability Practices**

File exceeds the rule’s 100-line limit and mixes several concerns, including the newly added background scheduling logic.</violation>
</file>

<file name="lib/sandbox/kickSandboxLifecycleWorkflow.ts">

<violation number="1" location="lib/sandbox/kickSandboxLifecycleWorkflow.ts:23">
P2: Custom agent: **Enforce Clear Code Style and Maintainability Practices**

File exceeds the repo’s 100-line limit.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

@@ -1,5 +1,5 @@
import ms from "ms";
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: Custom agent: Enforce Clear Code Style and Maintainability Practices

File exceeds the rule’s 100-line limit and mixes several concerns, including the newly added background scheduling logic.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At lib/sandbox/createSandboxHandler.ts, line 154:

<comment>File exceeds the rule’s 100-line limit and mixes several concerns, including the newly added background scheduling logic.</comment>

<file context>
@@ -145,11 +145,17 @@ export async function createSandboxHandler(request: NextRequest): Promise<NextRe
+    // function alive past the response until the chain completes —
+    // without that, the chain dies on function teardown and the
+    // workflow never starts. Failures are logged and never surfaced.
+    kickSandboxLifecycleWorkflow({
+      sessionId: sessionRow.id,
+      reason: "sandbox-created",
</file context>

Comment thread lib/sandbox/kickSandboxLifecycleWorkflow.ts Outdated
@sweetmantech
Copy link
Copy Markdown
Contributor Author

Smoke test — Phase 2 lifecycle kick verified end-to-end

Round 1 (initial commit) — kick chain died on function teardown

The first POST returned 200 fine but no [kickSandboxLifecycleWorkflow] log lines appeared and no workflow run showed up in the dashboard. The fire-and-forget Promise (void runKick(input).catch(...)) was being killed when the serverless function returned the response, before selectSessions → claimSessionLifecycleRunId → start() could finish.

Round 2 (after `34b1041d`)

Adopted open-agents' scheduleBackgroundWork parameter pattern, with the two handlers (createSandboxHandler + getSandboxStatusHandler) passing `task => after(() => task)` from `next/server` — matching api's existing waitUntil pattern in `lib/agents/createPlatformRoutes.ts:62` and `app/api/chat/slack/route.ts:16`.

Re-ran the smoke test:

```
16:59:22 POST /api/sandbox
[findOrgSnapshot] 'api' → hit snap_g6FiWyWxIA0g1psgHvLqY3MJiPhH (1 total snapshots returned)
[kickSandboxLifecycleWorkflow] started run wrun_01KR277VBNKECG8JJMPP7WDG5Z for session 006844f3-9098-4881-9156-8aad10b24033 (reason=sandbox-created)
```

Workflow run shows up in dashboard (wrun_01KR277VBNKECG8JJMPP7WDG5Z).

What's now runtime-verified

  • ✅ `POST /api/sandbox` kicks the lifecycle workflow with `reason: "sandbox-created"` after provisioning
  • ✅ The kick chain (selectSessions → claim lease → start workflow) survives function teardown via `after()`
  • ✅ Vercel Workflow registers the run and begins the `while(true) + sleep(date)` loop
  • ✅ The full `scheduleBackgroundWork` plumbing matches open-agents' architecture exactly

Not directly verified (requires waiting)

  • The actual hibernation transition only fires after `SANDBOX_INACTIVITY_TIMEOUT_MS` (5 min) of idle time. The workflow is currently in `sleep(date)` waiting. That part is unit-test-covered in the evaluator (skipped reasons cover the loop body) but the timed wake-and-evaluate hasn't been observed live in this run. Could wait 5 min for the second [evaluateSandboxLifecycle] log line if useful — let me know.
  • `status-check-overdue` kick from `GET /api/sandbox/status` — only fires when lifecycle state is past due, which doesn't happen in a fresh smoke test where the workflow is already managing the timing correctly. Easier to verify by manually nulling `lifecycle_run_id` in the DB and hitting status, but feels overkill for this round.

reason?: string;
}

async function computeLifecycleWakeDecision(
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SRP - new lib file for computeLifecycleWakeDecision

return { shouldContinue: true, wakeAtMs: getLifecycleDueAtMs(session) };
}

async function runLifecycleEvaluation(sessionId: string, reason: SandboxLifecycleReason) {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SRP - new lib file for runLifecycleEvaluation

return evaluateSandboxLifecycle(sessionId, reason);
}

async function clearLifecycleRunIdIfOwned(sessionId: string, runId: string): Promise<void> {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SRP - new lib file for clearLifecycleRunIdIfOwned.ts

Comment thread lib/sandbox/canOperateOnSandbox.ts Outdated
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

KISS principle

  • actual: tiny redundant lib/sandbox/canOperateOnSandbox.ts wrapper
  • required: delete canOperateOnSandbox and update callers to directly call hasRuntimeSandboxState

Comment thread lib/sandbox/evaluateSandboxLifecycle.ts Outdated
}
}

async function restoreActiveLifecycleState(
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SRP - new lib file for restoreActiveLifecycleState.ts

return rows[0] ?? null;
}

function shouldStartLifecycle(session: Tables<"sessions"> | null): session is Tables<"sessions"> {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SRP - new lib file for shouldStartLifecycle

return true;
}

function isLifecycleRunStale(session: Tables<"sessions">): boolean {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SRP - new lib file for isLifecycleRunStale

return overdueMs > SANDBOX_LIFECYCLE_STALE_RUN_GRACE_MS;
}

function createLifecycleRunId(): string {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SRP - new lib file for createLifecycleRunId

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SRP

  • actual: lib/supabase/chats/selectChatsBySession.ts
  • required: lib/supabase/chats/selectChats.ts

Comment thread lib/supabase/sessions/claimSessionLifecycleRunId.ts Outdated
Addresses 17 review comments on PR #533. All KISS deletions and SRP
extractions in one pass — no behavior change, just file/function
reorganization to match api conventions (one exported function per
file, filename matches the export, no thin wrappers).

KISS deletions (3 files removed):
- canOperateOnSandbox.ts → callers use hasRuntimeSandboxState directly
  (was a one-line wrapper around it)
- getNextLifecycleVersion.ts → was unused; existing inline calc kept
- hasResumableSandboxState.ts → callers use
  `getResumableSandboxName(state) !== null` (one-line wrapper)

KISS rename:
- selectChatsBySession.ts → selectChats.ts with a generic
  `{ id?, sessionId? }` filter parameter, mirroring selectSessions

SRP split — Supabase queries:
- claimSessionLifecycleRunId.ts split into:
  - claimSessionLifecycleRunIdIfNull.ts (initial claim)
  - claimSessionLifecycleRunIdIfMatch.ts (lease refresh)
  - claimSessionLifecycleRunId.ts (combiner — picks the right one)

SRP extractions — kick logic:
- runKick.ts (the async chain)
- reclaimStaleLease.ts
- shouldStartLifecycle.ts
- isLifecycleRunStale.ts
- createLifecycleRunId.ts
- kickSandboxLifecycleWorkflow.ts now thin: just runKick + scheduler

SRP extractions — due-at math:
- computeInactivityDueAtMs.ts
- computeExpiryDueAtMs.ts
- getLifecycleDueAtMs.ts now thin: composes the two

SRP extractions — evaluator helpers:
- restoreActiveLifecycleState.ts
- wasLifecycleTimingExtended.ts

SRP extractions — workflow steps (each preserves "use step"):
- computeLifecycleWakeDecision.ts
- runLifecycleEvaluation.ts
- clearLifecycleRunIdIfOwned.ts
- sandboxLifecycleWorkflow.ts now thin: just composes the steps with
  the while-sleep-evaluate loop

Net delta: 16 new files, 4 deletions, no behavior change.
Tests: 2579 / 2579 still pass. pnpm lint:check + tsc clean for
new/changed files (pre-existing TS errors in processCreateSandbox /
updateSnapshotPatchHandler unrelated).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@sweetmantech
Copy link
Copy Markdown
Contributor Author

Pushed `30ef641a` addressing all 17 review comments. No behavior change, just file/function reorganization.

Comment(s) Resolution
KISS: `canOperateOnSandbox` thin wrapper Deleted; callers use `hasRuntimeSandboxState` directly
KISS: `getNextLifecycleVersion` for one-line math Deleted (was unused; existing inline calc kept)
KISS: `hasResumableSandboxState` for one-line check Deleted; `hasPausedSandboxState` now uses `getResumableSandboxName(state) !== null`
KISS rename: `selectChatsBySession` → `selectChats` Renamed + generalized to take a {id?, sessionId?} filter, mirroring `selectSessions`
SRP: `claimSessionLifecycleRunId` 2 queries → split New `claimSessionLifecycleRunIdIfNull.ts` + `claimSessionLifecycleRunIdIfMatch.ts`; original file is now just the combiner
SRP: `runKick` (kick file) New `lib/sandbox/runKick.ts`
SRP: `reclaimStaleLease` New `lib/sandbox/reclaimStaleLease.ts`
SRP: `shouldStartLifecycle` New `lib/sandbox/shouldStartLifecycle.ts`
SRP: `isLifecycleRunStale` New `lib/sandbox/isLifecycleRunStale.ts`
SRP: `createLifecycleRunId` New `lib/sandbox/createLifecycleRunId.ts`
SRP: `computeInactivityDueAtMs` New `lib/sandbox/computeInactivityDueAtMs.ts`
SRP: `computeExpiryDueAtMs` New `lib/sandbox/computeExpiryDueAtMs.ts`
SRP: `restoreActiveLifecycleState` New `lib/sandbox/restoreActiveLifecycleState.ts`
SRP: `wasLifecycleTimingExtended` New `lib/sandbox/wasLifecycleTimingExtended.ts`
SRP: `computeLifecycleWakeDecision` New `app/workflows/computeLifecycleWakeDecision.ts` (preserves "use step")
SRP: `runLifecycleEvaluation` New `app/workflows/runLifecycleEvaluation.ts` (preserves "use step")
SRP: `clearLifecycleRunIdIfOwned` New `app/workflows/clearLifecycleRunIdIfOwned.ts` (preserves "use step")

After the refactor:

  • `kickSandboxLifecycleWorkflow.ts` is just the public entry — composes `runKick` with the optional `scheduleBackgroundWork`
  • `sandboxLifecycleWorkflow.ts` is just the workflow orchestration — composes the 3 steps with the while-sleep-evaluate loop
  • `getLifecycleDueAtMs.ts` is just the min(inactivity, expiry) composer
  • `evaluateSandboxLifecycle.ts` keeps the FSM but with helpers extracted

Tests: 2579 / 2579 still pass (no behavior change). `pnpm lint:check` + `tsc --noEmit` clean for the new/changed files.

Net delta: 16 new files, 4 deletions, no behavior change.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🧹 Nitpick comments (6)
lib/sandbox/getSandboxStatusHandler.ts (1)

24-74: ⚡ Quick win

Split getSandboxStatusHandler into smaller units.

getSandboxStatusHandler is currently too large and mixes multiple responsibilities (auth/session lookup/authorization/overdue-kick/response shaping). Please extract at least the overdue-kick decision and session loading into helpers.

As per coding guidelines, "Flag functions longer than 20 lines or classes with >200 lines" and "Keep functions small and focused".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@lib/sandbox/getSandboxStatusHandler.ts` around lines 24 - 74,
getSandboxStatusHandler is doing too many things; extract session
loading/authorization and the overdue-kick decision into helpers to reduce size
and responsibilities. Create a helper loadAuthorizedSession(request,
validateAuthContext, selectSessions) that returns the session row or a
NextResponse for missing/forbidden cases, and a separate
ensureLifecycleKick(sessionRow, isSandboxActive, getLifecycleDueAtMs,
kickSandboxLifecycleWorkflow, after) that encapsulates the conditional that
calls kickSandboxLifecycleWorkflow when active, lifecycle_state === "active" and
Date.now() >= getLifecycleDueAtMs(row); then refactor getSandboxStatusHandler to
call loadAuthorizedSession, call ensureLifecycleKick, and only shape the final
JSON response (hasSnapshot, lifecycleVersion, lifecycle via buildLifecycle).
Ensure you reference the original symbols getSandboxStatusHandler,
validateAuthContext, selectSessions, isSandboxActive, getLifecycleDueAtMs,
kickSandboxLifecycleWorkflow, after, and buildLifecycle when moving logic.
lib/sandbox/runKick.ts (1)

31-58: ⚡ Quick win

Split runKick into smaller steps for maintainability

runKick currently bundles fetch/reclaim/eligibility/claim/start/recovery in one long function. Extracting claim-and-start and failure-recovery helpers will keep responsibilities sharper and make testing easier.

As per coding guidelines, "Flag functions longer than 20 lines or classes with >200 lines" and "Keep functions small and focused".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@lib/sandbox/runKick.ts` around lines 31 - 58, runKick is doing too many steps
(fetch/reclaim/eligibility/claim/start/recovery); extract the claim-and-start
and failure-recovery logic into small helpers to improve maintainability and
testability. Implement a helper (e.g., claimAndStartLifecycle(input.sessionId,
runId, input.reason)) that encapsulates createLifecycleRunId,
claimSessionLifecycleRunId and the call to start(sandboxLifecycleWorkflow, …)
and returns the started run or false, and a separate recovery helper (e.g.,
recoverFailedStart(sessionId)) that handles logging the error and calling
updateSession(sessionId, { lifecycle_run_id: null }) on failure; then simplify
runKick to call isLifecycleRunStale/reclaimStaleLease/shouldStartLifecycle and
use the new helpers for claim/start and recovery so each function stays small
and focused.
app/workflows/computeLifecycleWakeDecision.ts (1)

23-46: ⚡ Quick win

Factor out session-eligibility checks from computeLifecycleWakeDecision.

This function now combines several responsibilities (state validation, lease claim, scheduling decision) and is above the guideline length threshold. Extracting the eligibility guard into a helper would improve clarity and testability.

As per coding guidelines, "Flag functions longer than 20 lines" and "Keep functions small and focused".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@app/workflows/computeLifecycleWakeDecision.ts` around lines 23 - 46,
computeLifecycleWakeDecision mixes session retrieval, multiple eligibility
checks, and the lease claim; extract the eligibility guard into a small helper
(e.g., isSessionEligibleForLifecycleWake(session?: Session | null): {eligible:
boolean; reason?: string}) that encapsulates the "session not found", archived
checks (session.status and session.lifecycle_state), and sandbox operability
checks (hasRuntimeSandboxState and the vercel type check). Replace the inline
checks in computeLifecycleWakeDecision with a single call to that helper; only
if eligible proceed to call claimSessionLifecycleRunId and then return the
wakeAtMs via getLifecycleDueAtMs(session). Ensure the helper is exported or
colocated so it can be unit-tested separately.
app/workflows/sandboxLifecycleWorkflow.ts (1)

16-45: ⚡ Quick win

Extract loop branches to keep the workflow function compact.

The workflow body is over the 20-line threshold and mixes wake decision, sleep scheduling, defer handling, and lease cleanup. Splitting branch handlers (e.g., terminal path + defer path) would make future lifecycle changes safer.

As per coding guidelines, "Flag functions longer than 20 lines" and "Keep functions small and focused".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@app/workflows/sandboxLifecycleWorkflow.ts` around lines 16 - 45, The
sandboxLifecycleWorkflow function is too long and mixes wake decision, sleep
scheduling, defer handling, and lease cleanup; refactor by extracting the loop
branches into small helpers: create e.g. handleTerminalDecision(decision,
sessionId, runId) to run clearLifecycleRunIdIfOwned and return the skipped
result when !decision.shouldContinue or decision.wakeAtMs undefined; create
scheduleWake(wakeAtMs) to compute Math.max with SANDBOX_LIFECYCLE_MIN_SLEEP_MS
and call sleep; and create handleEvaluationResult(evaluation, sessionId, runId)
to handle the "skipped" defer path (checking evaluation.reason) vs terminal path
that clears the lease and returns evaluation; keep sandboxLifecycleWorkflow as
the loop that calls computeLifecycleWakeDecision, scheduleWake,
runLifecycleEvaluation, and delegates branching to the new helpers.
lib/sandbox/createSandboxHandler.ts (1)

34-171: 🏗️ Heavy lift

Split createSandboxHandler into focused steps.

This handler is doing validation, authz, snapshot orchestration, provisioning, DB sync, skills install, and workflow kick in one function. Please extract these into small helpers to keep orchestration readable and reduce regression risk in future lifecycle changes.

As per coding guidelines, "Flag functions longer than 20 lines" and "Keep functions small and focused".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@lib/sandbox/createSandboxHandler.ts` around lines 34 - 171, The
createSandboxHandler function is doing too many things; split it into small
focused helpers: extract request validation/authz into
validateAndAuthorizeCreateSandbox using validateCreateSandboxBody and
selectSessions, move org snapshot lookup and kickBuildOrgSnapshotWorkflow into
resolveOrgSnapshot (use extractOrgRepoName and findOrgSnapshot), move the
sandbox provisioning logic into provisionSandbox which calls connectSandbox and
handles its errors, move DB session sync into syncSessionState which calls
updateSession, and move post-provision steps (installSessionGlobalSkills and
kickSandboxLifecycleWorkflow) into installSkillsAndRegisterLifecycle that logs
but does not throw; update createSandboxHandler to orchestrate these helpers,
passing sandbox, sessionRow and relevant params and keeping only high-level
flow, timing, and the final NextResponse logic.
lib/sandbox/evaluateSandboxLifecycle.ts (1)

34-106: 🏗️ Heavy lift

Function exceeds the 50-line guideline at 73 lines — consider extracting the guard block.

The function has two naturally separate responsibilities currently inlined:

  1. Pre-condition guards (lines 38–60): seven early-return predicates that determine whether the session is even eligible — this is a pure, testable unit on its own.
  2. Hibernation execution (lines 62–92): the stateful transition that mutates the DB, stops the sandbox, and builds the hibernated update.

Extracting the guard block (e.g., into a helper like checkLifecycleEligibility) would also resolve the 4-level nesting inside the wasLifecycleTimingExtended branch and make the happy-path intent of the outer function immediately obvious.

As per coding guidelines: "Flag functions longer than 20 lines" and "Keep functions under 50 lines".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@lib/sandbox/evaluateSandboxLifecycle.ts` around lines 34 - 106, The
evaluateSandboxLifecycle function is too long and mixes pre-condition guards
with hibernation execution; extract the early-return guard checks (the session
lookup and the seven predicates that use selectSessions, hasRuntimeSandboxState,
getLifecycleDueAtMs, hasActiveStreamForSession, and the sandbox type check) into
a new helper (e.g., checkLifecycleEligibility(sessionId): Promise<{status:
"skipped"|"ok", reason?: string, session?: Session}>), call that from
evaluateSandboxLifecycle to short-circuit when skipped, and keep
evaluateSandboxLifecycle focused on the stateful hibernation flow (updateSession
to "hibernating", connectSandbox, wasLifecycleTimingExtended, sandbox.stop,
clearSandboxState, and final update); ensure you preserve existing behavior and
return values and reference functions selectSessions, updateSession,
connectSandbox, wasLifecycleTimingExtended, clearSandboxState, and
restoreActiveLifecycleState when moving logic.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@app/workflows/clearLifecycleRunIdIfOwned.ts`:
- Around line 17-21: The current TOCTOU occurs because selectSessions(...) reads
lifecycle_run_id then updateSession(...) clears it unconditionally; replace this
two-step flow with a single atomic conditional update so the row is only cleared
if it still has the expected owner: remove the selectSessions(...) read and
call/update updateSession (or add a new updateSessionIfMatch) to execute an
UPDATE ... SET lifecycle_run_id = NULL WHERE id = sessionId AND lifecycle_run_id
= runId and check affected-rows to decide success; ensure the code references
the existing selectSessions and updateSession symbols (or implement
updateSessionIfMatch) so the DB change is atomic and avoids clobbering a newly
acquired lease.

In `@lib/sandbox/evaluateSandboxLifecycle.ts`:
- Around line 72-78: The branch that skips restoreActiveLifecycleState leaves
the DB lifecycle_state as "hibernating" when refreshed.sandbox_state is falsy;
change the logic in the wasLifecycleTimingExtended branch to call
restoreActiveLifecycleState(sessionId, refreshed.sandbox_state ?? sandboxState)
so the original validated sandboxState (already checked by
hasRuntimeSandboxState) is used as a fallback instead of skipping restoration,
ensuring DB and in-memory state stay in sync; reference symbols:
wasLifecycleTimingExtended, selectSessions, restoreActiveLifecycleState,
sandboxState, hasRuntimeSandboxState, sessionId, refreshed.sandbox_state.

In `@lib/sandbox/getSandboxExpiresAtDate.ts`:
- Around line 13-15: The current code reads expiresAt from state and immediately
constructs new Date(expiresAt).toISOString(), which can throw for NaN/Infinity
or out-of-range numbers; update the guard in getSandboxExpiresAtDate (check the
expiresAt variable) to first ensure typeof expiresAt === "number" and
Number.isFinite(expiresAt) and that new Date(expiresAt) yields a valid date
(e.g., !isNaN(date.getTime())) before calling toISOString(); if any check fails,
return null instead of calling toISOString().

In `@lib/sandbox/reclaimStaleLease.ts`:
- Around line 15-17: The update currently clears lifecycle_run_id
unconditionally which can cause a TOCTOU race; change the updateSession call to
perform a conditional/compare-and-swap update that only sets lifecycle_run_id to
null when the stored lifecycle_run_id equals the expected stale value (use the
previous/expected lease id variable, e.g., staleLeaseId) and check the update
result (affected rows) before reading with selectSessions; if no row was
updated, return null (or the current row) to avoid wiping a freshly claimed
lease. Ensure you reference the same sessionId and lifecycle_run_id fields in
the conditional update and handle the zero-rows-updated case accordingly.

In `@lib/sandbox/runKick.ts`:
- Around line 52-57: The catch currently unconditionally clears lifecycle_run_id
and can clobber a newer run; change it to first read the current session and
only clear lifecycle_run_id if it still equals the run id this code owns (the
original run id you attempted to start). In practice: fetch the session by
input.sessionId (use the existing session getter or add one), compare
session.lifecycle_run_id to the expected run id stored when starting the
workflow, and call updateSession(input.sessionId, { lifecycle_run_id: null })
only if they match; otherwise leave it untouched.

In `@lib/supabase/sessions/claimSessionLifecycleRunId.ts`:
- Around line 20-23: The current branch in claimSessionLifecycleRunId ignores
runId when expected is non-null by calling
claimSessionLifecycleRunIdIfMatch(sessionId, expected) — change the call so the
intended lease value is passed through (e.g., call
claimSessionLifecycleRunIdIfMatch(sessionId, expected, runId) or adjust the
helper signature) so that both paths write/rotate the lease value; update the
claimSessionLifecycleRunIdIfMatch signature/implementation to accept and write
the runId (and update any callers) to preserve the contract that runId is the
lease value to write.

---

Nitpick comments:
In `@app/workflows/computeLifecycleWakeDecision.ts`:
- Around line 23-46: computeLifecycleWakeDecision mixes session retrieval,
multiple eligibility checks, and the lease claim; extract the eligibility guard
into a small helper (e.g., isSessionEligibleForLifecycleWake(session?: Session |
null): {eligible: boolean; reason?: string}) that encapsulates the "session not
found", archived checks (session.status and session.lifecycle_state), and
sandbox operability checks (hasRuntimeSandboxState and the vercel type check).
Replace the inline checks in computeLifecycleWakeDecision with a single call to
that helper; only if eligible proceed to call claimSessionLifecycleRunId and
then return the wakeAtMs via getLifecycleDueAtMs(session). Ensure the helper is
exported or colocated so it can be unit-tested separately.

In `@app/workflows/sandboxLifecycleWorkflow.ts`:
- Around line 16-45: The sandboxLifecycleWorkflow function is too long and mixes
wake decision, sleep scheduling, defer handling, and lease cleanup; refactor by
extracting the loop branches into small helpers: create e.g.
handleTerminalDecision(decision, sessionId, runId) to run
clearLifecycleRunIdIfOwned and return the skipped result when
!decision.shouldContinue or decision.wakeAtMs undefined; create
scheduleWake(wakeAtMs) to compute Math.max with SANDBOX_LIFECYCLE_MIN_SLEEP_MS
and call sleep; and create handleEvaluationResult(evaluation, sessionId, runId)
to handle the "skipped" defer path (checking evaluation.reason) vs terminal path
that clears the lease and returns evaluation; keep sandboxLifecycleWorkflow as
the loop that calls computeLifecycleWakeDecision, scheduleWake,
runLifecycleEvaluation, and delegates branching to the new helpers.

In `@lib/sandbox/createSandboxHandler.ts`:
- Around line 34-171: The createSandboxHandler function is doing too many
things; split it into small focused helpers: extract request validation/authz
into validateAndAuthorizeCreateSandbox using validateCreateSandboxBody and
selectSessions, move org snapshot lookup and kickBuildOrgSnapshotWorkflow into
resolveOrgSnapshot (use extractOrgRepoName and findOrgSnapshot), move the
sandbox provisioning logic into provisionSandbox which calls connectSandbox and
handles its errors, move DB session sync into syncSessionState which calls
updateSession, and move post-provision steps (installSessionGlobalSkills and
kickSandboxLifecycleWorkflow) into installSkillsAndRegisterLifecycle that logs
but does not throw; update createSandboxHandler to orchestrate these helpers,
passing sandbox, sessionRow and relevant params and keeping only high-level
flow, timing, and the final NextResponse logic.

In `@lib/sandbox/evaluateSandboxLifecycle.ts`:
- Around line 34-106: The evaluateSandboxLifecycle function is too long and
mixes pre-condition guards with hibernation execution; extract the early-return
guard checks (the session lookup and the seven predicates that use
selectSessions, hasRuntimeSandboxState, getLifecycleDueAtMs,
hasActiveStreamForSession, and the sandbox type check) into a new helper (e.g.,
checkLifecycleEligibility(sessionId): Promise<{status: "skipped"|"ok", reason?:
string, session?: Session}>), call that from evaluateSandboxLifecycle to
short-circuit when skipped, and keep evaluateSandboxLifecycle focused on the
stateful hibernation flow (updateSession to "hibernating", connectSandbox,
wasLifecycleTimingExtended, sandbox.stop, clearSandboxState, and final update);
ensure you preserve existing behavior and return values and reference functions
selectSessions, updateSession, connectSandbox, wasLifecycleTimingExtended,
clearSandboxState, and restoreActiveLifecycleState when moving logic.

In `@lib/sandbox/getSandboxStatusHandler.ts`:
- Around line 24-74: getSandboxStatusHandler is doing too many things; extract
session loading/authorization and the overdue-kick decision into helpers to
reduce size and responsibilities. Create a helper loadAuthorizedSession(request,
validateAuthContext, selectSessions) that returns the session row or a
NextResponse for missing/forbidden cases, and a separate
ensureLifecycleKick(sessionRow, isSandboxActive, getLifecycleDueAtMs,
kickSandboxLifecycleWorkflow, after) that encapsulates the conditional that
calls kickSandboxLifecycleWorkflow when active, lifecycle_state === "active" and
Date.now() >= getLifecycleDueAtMs(row); then refactor getSandboxStatusHandler to
call loadAuthorizedSession, call ensureLifecycleKick, and only shape the final
JSON response (hasSnapshot, lifecycleVersion, lifecycle via buildLifecycle).
Ensure you reference the original symbols getSandboxStatusHandler,
validateAuthContext, selectSessions, isSandboxActive, getLifecycleDueAtMs,
kickSandboxLifecycleWorkflow, after, and buildLifecycle when moving logic.

In `@lib/sandbox/runKick.ts`:
- Around line 31-58: runKick is doing too many steps
(fetch/reclaim/eligibility/claim/start/recovery); extract the claim-and-start
and failure-recovery logic into small helpers to improve maintainability and
testability. Implement a helper (e.g., claimAndStartLifecycle(input.sessionId,
runId, input.reason)) that encapsulates createLifecycleRunId,
claimSessionLifecycleRunId and the call to start(sandboxLifecycleWorkflow, …)
and returns the started run or false, and a separate recovery helper (e.g.,
recoverFailedStart(sessionId)) that handles logging the error and calling
updateSession(sessionId, { lifecycle_run_id: null }) on failure; then simplify
runKick to call isLifecycleRunStale/reclaimStaleLease/shouldStartLifecycle and
use the new helpers for claim/start and recovery so each function stays small
and focused.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 9fcd1a08-8b2e-4c74-ba93-ceef75b3914b

📥 Commits

Reviewing files that changed from the base of the PR and between 6efdd99 and 30ef641.

⛔ Files ignored due to path filters (2)
  • lib/sandbox/__tests__/createSandboxHandler.test.ts is excluded by !**/*.test.*, !**/__tests__/** and included by lib/**
  • lib/sandbox/__tests__/getSandboxStatusHandler.test.ts is excluded by !**/*.test.*, !**/__tests__/** and included by lib/**
📒 Files selected for processing (33)
  • app/workflows/clearLifecycleRunIdIfOwned.ts
  • app/workflows/computeLifecycleWakeDecision.ts
  • app/workflows/runLifecycleEvaluation.ts
  • app/workflows/sandboxLifecycleWorkflow.ts
  • lib/sandbox/buildActiveLifecycleUpdate.ts
  • lib/sandbox/buildHibernatedLifecycleUpdate.ts
  • lib/sandbox/buildLifecycleActivityUpdate.ts
  • lib/sandbox/clearSandboxState.ts
  • lib/sandbox/computeExpiryDueAtMs.ts
  • lib/sandbox/computeInactivityDueAtMs.ts
  • lib/sandbox/createLifecycleRunId.ts
  • lib/sandbox/createSandboxHandler.ts
  • lib/sandbox/evaluateSandboxLifecycle.ts
  • lib/sandbox/getLifecycleDueAtMs.ts
  • lib/sandbox/getPersistentSandboxName.ts
  • lib/sandbox/getResumableSandboxName.ts
  • lib/sandbox/getSandboxExpiresAtDate.ts
  • lib/sandbox/getSandboxStatusHandler.ts
  • lib/sandbox/hasActiveStreamForSession.ts
  • lib/sandbox/hasPausedSandboxState.ts
  • lib/sandbox/isLifecycleRunStale.ts
  • lib/sandbox/kickSandboxLifecycleWorkflow.ts
  • lib/sandbox/reclaimStaleLease.ts
  • lib/sandbox/restoreActiveLifecycleState.ts
  • lib/sandbox/runKick.ts
  • lib/sandbox/sandboxLifecycleConfig.ts
  • lib/sandbox/sandboxLifecycleTypes.ts
  • lib/sandbox/shouldStartLifecycle.ts
  • lib/sandbox/wasLifecycleTimingExtended.ts
  • lib/supabase/chats/selectChats.ts
  • lib/supabase/sessions/claimSessionLifecycleRunId.ts
  • lib/supabase/sessions/claimSessionLifecycleRunIdIfMatch.ts
  • lib/supabase/sessions/claimSessionLifecycleRunIdIfNull.ts

Comment on lines +17 to +21
const rows = await selectSessions({ id: sessionId });
const session = rows[0];
if (!session || session.lifecycle_run_id !== runId) return;

await updateSession(sessionId, { lifecycle_run_id: null });
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Use an atomic conditional update to avoid lease clobbering.

Lines 17–21 have a TOCTOU race: a new workflow can claim lifecycle_run_id after the read and before the write, then this step clears the new owner’s lease. This should be a single DB update with WHERE id = sessionId AND lifecycle_run_id = runId.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@app/workflows/clearLifecycleRunIdIfOwned.ts` around lines 17 - 21, The
current TOCTOU occurs because selectSessions(...) reads lifecycle_run_id then
updateSession(...) clears it unconditionally; replace this two-step flow with a
single atomic conditional update so the row is only cleared if it still has the
expected owner: remove the selectSessions(...) read and call/update
updateSession (or add a new updateSessionIfMatch) to execute an UPDATE ... SET
lifecycle_run_id = NULL WHERE id = sessionId AND lifecycle_run_id = runId and
check affected-rows to decide success; ensure the code references the existing
selectSessions and updateSession symbols (or implement updateSessionIfMatch) so
the DB change is atomic and avoids clobbering a newly acquired lease.

Comment on lines +72 to +78
if (await wasLifecycleTimingExtended(sessionId, session)) {
const refreshed = (await selectSessions({ id: sessionId }))[0];
if (refreshed?.sandbox_state) {
await restoreActiveLifecycleState(sessionId, refreshed.sandbox_state);
}
return { action: "skipped", reason: "not-due-yet" };
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

lifecycle_state is stranded as "hibernating" when refreshed.sandbox_state is falsy.

After line 63 writes lifecycle_state: "hibernating" to the DB, the wasLifecycleTimingExtended branch conditionally skips restoreActiveLifecycleState when refreshed?.sandbox_state is falsy. The DB row stays in "hibernating" while the evaluator returns { action: "skipped", reason: "not-due-yet" }. The workflow loop then re-schedules a wake using a session it perceives as "not yet due" but whose DB state says "hibernating" — the two stay out of sync until the self-heal kick fires.

The original sandboxState (line 46) has already been validated by hasRuntimeSandboxState and is the correct fallback.

🐛 Proposed fix — use original `sandboxState` as fallback
     if (await wasLifecycleTimingExtended(sessionId, session)) {
       const refreshed = (await selectSessions({ id: sessionId }))[0];
-      if (refreshed?.sandbox_state) {
-        await restoreActiveLifecycleState(sessionId, refreshed.sandbox_state);
-      }
+      await restoreActiveLifecycleState(
+        sessionId,
+        refreshed?.sandbox_state ?? sandboxState,
+      );
       return { action: "skipped", reason: "not-due-yet" };
     }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if (await wasLifecycleTimingExtended(sessionId, session)) {
const refreshed = (await selectSessions({ id: sessionId }))[0];
if (refreshed?.sandbox_state) {
await restoreActiveLifecycleState(sessionId, refreshed.sandbox_state);
}
return { action: "skipped", reason: "not-due-yet" };
}
if (await wasLifecycleTimingExtended(sessionId, session)) {
const refreshed = (await selectSessions({ id: sessionId }))[0];
await restoreActiveLifecycleState(
sessionId,
refreshed?.sandbox_state ?? sandboxState,
);
return { action: "skipped", reason: "not-due-yet" };
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@lib/sandbox/evaluateSandboxLifecycle.ts` around lines 72 - 78, The branch
that skips restoreActiveLifecycleState leaves the DB lifecycle_state as
"hibernating" when refreshed.sandbox_state is falsy; change the logic in the
wasLifecycleTimingExtended branch to call restoreActiveLifecycleState(sessionId,
refreshed.sandbox_state ?? sandboxState) so the original validated sandboxState
(already checked by hasRuntimeSandboxState) is used as a fallback instead of
skipping restoration, ensuring DB and in-memory state stay in sync; reference
symbols: wasLifecycleTimingExtended, selectSessions,
restoreActiveLifecycleState, sandboxState, hasRuntimeSandboxState, sessionId,
refreshed.sandbox_state.

Comment on lines +13 to +15
const expiresAt = (state as { expiresAt?: unknown }).expiresAt;
if (typeof expiresAt !== "number") return null;
return new Date(expiresAt).toISOString();
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Guard invalid numeric timestamps before toISOString().

Line 15 can throw if expiresAt is NaN, Infinity, or out-of-range. Add a finite/valid-date check and return null instead of throwing.

Suggested fix
 export function getSandboxExpiresAtDate(state: unknown): string | null {
   if (!state || typeof state !== "object") return null;
   const expiresAt = (state as { expiresAt?: unknown }).expiresAt;
-  if (typeof expiresAt !== "number") return null;
-  return new Date(expiresAt).toISOString();
+  if (typeof expiresAt !== "number" || !Number.isFinite(expiresAt)) return null;
+  const iso = new Date(expiresAt);
+  return Number.isNaN(iso.getTime()) ? null : iso.toISOString();
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@lib/sandbox/getSandboxExpiresAtDate.ts` around lines 13 - 15, The current
code reads expiresAt from state and immediately constructs new
Date(expiresAt).toISOString(), which can throw for NaN/Infinity or out-of-range
numbers; update the guard in getSandboxExpiresAtDate (check the expiresAt
variable) to first ensure typeof expiresAt === "number" and
Number.isFinite(expiresAt) and that new Date(expiresAt) yields a valid date
(e.g., !isNaN(date.getTime())) before calling toISOString(); if any check fails,
return null instead of calling toISOString().

Comment on lines +15 to +17
await updateSession(sessionId, { lifecycle_run_id: null });
const rows = await selectSessions({ id: sessionId });
return rows[0] ?? null;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Guard stale-lease clearing with ownership/expected-run-id to avoid TOCTOU races.

At Line 15, the update clears lifecycle_run_id without checking the currently stored lease value. If another worker claims a fresh run between stale detection and this update, this call can wipe a valid lease and allow duplicate lifecycle runs.

Suggested direction
-export async function reclaimStaleLease(sessionId: string): Promise<Tables<"sessions"> | null> {
-  await updateSession(sessionId, { lifecycle_run_id: null });
+export async function reclaimStaleLease(
+  sessionId: string,
+  expectedStaleRunId: string,
+): Promise<Tables<"sessions"> | null> {
+  // Use a conditional update in the sessions data-access layer:
+  // UPDATE ... SET lifecycle_run_id = null WHERE id = :sessionId AND lifecycle_run_id = :expectedStaleRunId
+  await clearLifecycleRunIdIfMatches(sessionId, expectedStaleRunId);
   const rows = await selectSessions({ id: sessionId });
   return rows[0] ?? null;
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@lib/sandbox/reclaimStaleLease.ts` around lines 15 - 17, The update currently
clears lifecycle_run_id unconditionally which can cause a TOCTOU race; change
the updateSession call to perform a conditional/compare-and-swap update that
only sets lifecycle_run_id to null when the stored lifecycle_run_id equals the
expected stale value (use the previous/expected lease id variable, e.g.,
staleLeaseId) and check the update result (affected rows) before reading with
selectSessions; if no row was updated, return null (or the current row) to avoid
wiping a freshly claimed lease. Ensure you reference the same sessionId and
lifecycle_run_id fields in the conditional update and handle the
zero-rows-updated case accordingly.

Comment thread lib/sandbox/runKick.ts
Comment on lines +52 to +57
console.error(
`[kickSandboxLifecycleWorkflow] failed to start workflow for session ${input.sessionId}; clearing lease:`,
error,
);
await updateSession(input.sessionId, { lifecycle_run_id: null });
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Guard lease cleanup by ownership to avoid clobbering a newer run

The catch block clears lifecycle_run_id unconditionally. If ownership changed before this write, this can erase another run’s lease and allow duplicate runners.

Suggested fix
-    await updateSession(input.sessionId, { lifecycle_run_id: null });
+    await claimSessionLifecycleRunId(input.sessionId, runId, runId);
+    await updateSession(input.sessionId, { lifecycle_run_id: null });
+    // Better: single helper with CAS semantics
+    await clearLifecycleRunIdIfOwned(input.sessionId, runId);
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@lib/sandbox/runKick.ts` around lines 52 - 57, The catch currently
unconditionally clears lifecycle_run_id and can clobber a newer run; change it
to first read the current session and only clear lifecycle_run_id if it still
equals the run id this code owns (the original run id you attempted to start).
In practice: fetch the session by input.sessionId (use the existing session
getter or add one), compare session.lifecycle_run_id to the expected run id
stored when starting the workflow, and call updateSession(input.sessionId, {
lifecycle_run_id: null }) only if they match; otherwise leave it untouched.

Comment thread lib/supabase/sessions/claimSessionLifecycleRunId.ts Outdated
…helpers

In the previous SRP refactor I created computeInactivityDueAtMs.ts and
computeExpiryDueAtMs.ts but the Write to update getLifecycleDueAtMs.ts
errored on "File has not been read yet" and I missed it in the batch
output. The two helper files were left as orphans while the original
file kept its inline copies — visible by the SANDBOX_*_MS imports +
both functions still inlined.

Now properly imports computeInactivityDueAtMs / computeExpiryDueAtMs
and the parent file just composes them. No behavior change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 25 files (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="lib/supabase/sessions/claimSessionLifecycleRunId.ts">

<violation number="1" location="lib/supabase/sessions/claimSessionLifecycleRunId.ts:22">
P2: The refresh branch uses `expected` instead of `runId`, which can silently ignore the requested new lease value when the two differ.</violation>
</file>

Tip: Review your code locally with the cubic CLI to iterate faster.

Comment thread lib/supabase/sessions/claimSessionLifecycleRunId.ts Outdated
@sweetmantech
Copy link
Copy Markdown
Contributor Author

You caught a real miss. Pushed `22095c29`.

The two helper files (`computeInactivityDueAtMs.ts` + `computeExpiryDueAtMs.ts`) were created in the previous commit, but the Write to update `getLifecycleDueAtMs.ts` itself errored on "File has not been read yet" and I missed it in the batch output. So the helpers were sitting as orphans while the parent file kept its inline copies.

Now properly imports the extracted helpers — `getLifecycleDueAtMs.ts` is the thin composer it was meant to be. `SANDBOX_*_MS` constants no longer imported here; they live with the helpers that use them. `isoToEpochMs` import also gone since the parent doesn't need it.

Net delta on the fix: -29 / +4. Tests still 2579 / 2579, lint clean.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incorrect path:

  • actual: lib/supabase/sessions/claimSessionLifecycleRunId.ts
  • required: lib/sessions/claimSessionLifecycleRunId.ts

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

KISS - file name

  • actual: lib/supabase/sessions/claimSessionLifecycleRunIdIfMatch.ts
  • required: lib/supabase/sessions/updateSession.ts

Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0 issues found across 1 file (changes from recent commits).

Requires human review: Auto-approval blocked by 5 unresolved issues from previous reviews.

…atch via updateSession

Two review fixes:

1) Move `claimSessionLifecycleRunId.ts` from `lib/supabase/sessions/`
   to `lib/sessions/`. The combiner doesn't directly query Supabase —
   it composes two underlying helpers, so per api convention it
   belongs alongside other domain composers, not under the Supabase
   namespace.

2) KISS: delete `claimSessionLifecycleRunIdIfMatch.ts` and use the
   existing `updateSession` helper for the lease-refresh path. The
   refresh writes the same `lifecycle_run_id` value back, so an
   unconditional `updateSession({ lifecycle_run_id: runId })` does the
   work — accepts a small race where a concurrent stale-reclaim could
   be overwritten, but the kick path (which DOES use the atomic
   `claimSessionLifecycleRunIdIfNull`) remains the primary
   concurrency guard.

Two callers (runKick, computeLifecycleWakeDecision) updated to import
from the new path. No behavior change at the workflow level.

Tests: 2579 / 2579 still pass. lint + tsc clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@sweetmantech
Copy link
Copy Markdown
Contributor Author

Pushed `c37821d5` addressing both new comments.

Comment Resolution
`claimSessionLifecycleRunId.ts` wrong path Moved to `lib/sessions/claimSessionLifecycleRunId.ts`. The combiner doesn't directly query Supabase — it composes two underlying helpers — so it belongs alongside other domain composers, not under the Supabase namespace.
KISS: `claimSessionLifecycleRunIdIfMatch.ts` should be `updateSession.ts` Deleted the IfMatch file. The lease-refresh path now calls `updateSession({ lifecycle_run_id: runId })` directly. The refresh writes the same value back, so an unconditional update does the work — small race where a concurrent stale-reclaim could be overwritten, but the kick path's atomic `claimSessionLifecycleRunIdIfNull` remains the primary concurrency guard.

Two callers updated (`runKick`, `computeLifecycleWakeDecision`) to import from the new path. Net delta: -56 / +34 (deleted IfMatch + the old combiner location, smaller new combiner). Tests 2579 / 2579, lint + tsc clean.

Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0 issues found across 5 files (changes from recent commits).

Requires human review: Auto-approval blocked by 4 unresolved issues from previous reviews.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

KISS

  • actual: lib/supabase/sessions/claimSessionLifecycleRunIdIfNull.ts
  • required: lib/supabase/sessions/updateSession.ts

…ssion

Per latest review: the `IfNull` Supabase helper is also just an
`updateSession` with a conditional WHERE — KISS it the same way the
`IfMatch` helper was simplified.

Changes:

- Delete `lib/supabase/sessions/claimSessionLifecycleRunIdIfNull.ts`.
- Delete `lib/sessions/claimSessionLifecycleRunId.ts` (combiner — no
  longer needed once both branches collapse to plain `updateSession`).
- `runKick`: write the new lease via
  `updateSession(id, { lifecycle_run_id: runId })`. The atomic
  `WHERE lifecycle_run_id IS NULL` guard is gone; `shouldStartLifecycle`
  already filters out rows that already hold a lease — best-effort
  concurrency, accepting the small race window.
- `computeLifecycleWakeDecision`: replace the IfMatch lease re-claim
  with a SELECT-based ownership check using the row already read in
  the same step. If `lifecycle_run_id` was overwritten by a concurrent
  kick, return `run-replaced` and bail.

Tests 2579 / 2579 still pass. Lint clean. The pre-existing tsc errors
in unrelated test files (orgId on ApiKeyDetails, SlackApiResponse
messages) are not introduced here.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@sweetmantech
Copy link
Copy Markdown
Contributor Author

Pushed `2bf8cf91` addressing the latest comment.

Applied the same KISS collapse to `claimSessionLifecycleRunIdIfNull.ts` that was applied to `IfMatch`:

File Change
`lib/supabase/sessions/claimSessionLifecycleRunIdIfNull.ts` Deleted. Kick now writes the lease via `updateSession(id, { lifecycle_run_id: runId })`.
`lib/sessions/claimSessionLifecycleRunId.ts` Deleted. Once both branches collapse to plain `updateSession`, the combiner adds no value.
`lib/sandbox/runKick.ts` Calls `updateSession` directly. `shouldStartLifecycle` already filters rows that hold a lease — best-effort concurrency guard.
`app/workflows/computeLifecycleWakeDecision.ts` Re-claim is now a SELECT-based ownership check using the row already read in the same step. If `lifecycle_run_id` was overwritten by a concurrent kick, returns `run-replaced`.

Tradeoff acknowledged: the atomic `UPDATE … WHERE lifecycle_run_id IS NULL` guarantee is gone. Two simultaneous kicks racing past `shouldStartLifecycle` could both write a lease and start workflows. The downstream evaluation steps are idempotent, and `clearLifecycleRunIdIfOwned` ensures only one run owns the lease at end. Accepting this for KISS.

Net delta over both KISS rounds: `-127 / +44` across 7 files. Tests 2579 / 2579 still pass, lint clean.

Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 4 files (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="lib/sandbox/runKick.ts">

<violation number="1" location="lib/sandbox/runKick.ts:44">
P1: This lease claim is no longer atomic. Concurrent kicks can both start workflows for the same session because `lifecycle_run_id` is set with an unconditional update.</violation>
</file>

Tip: Review your code locally with the cubic CLI to iterate faster.

Comment thread lib/sandbox/runKick.ts
if (!shouldStartLifecycle(sessionForStart)) return;

const runId = createLifecycleRunId();
await updateSession(input.sessionId, { lifecycle_run_id: runId });
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: This lease claim is no longer atomic. Concurrent kicks can both start workflows for the same session because lifecycle_run_id is set with an unconditional update.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At lib/sandbox/runKick.ts, line 44:

<comment>This lease claim is no longer atomic. Concurrent kicks can both start workflows for the same session because `lifecycle_run_id` is set with an unconditional update.</comment>

<file context>
@@ -40,8 +41,7 @@ export async function runKick(input: RunKickInput): Promise<void> {
   const runId = createLifecycleRunId();
-  const claimed = await claimSessionLifecycleRunId(input.sessionId, runId);
-  if (!claimed) return;
+  await updateSession(input.sessionId, { lifecycle_run_id: runId });
 
   try {
</file context>

Tip: Review your code locally with the cubic CLI to iterate faster.

@sweetmantech sweetmantech merged commit 12ec76e into test May 7, 2026
6 checks passed
@sweetmantech sweetmantech deleted the feat/sandbox-lifecycle-workflow branch May 7, 2026 23:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant