Skip to content

feat(sandbox): port status self-heal + reconnect transient handling#535

Merged
sweetmantech merged 3 commits intotestfrom
feat/status-reconnect-parity
May 8, 2026
Merged

feat(sandbox): port status self-heal + reconnect transient handling#535
sweetmantech merged 3 commits intotestfrom
feat/status-reconnect-parity

Conversation

@sweetmantech
Copy link
Copy Markdown
Contributor

@sweetmantech sweetmantech commented May 8, 2026

Closes the remaining open-agents parity gaps in the two read endpoints (status + reconnect) that the chat UI hits on session re-entry / tab refocus.

Summary

`GET /api/sandbox/status`

  • Failed-state self-heal: when runtime is alive but `lifecycle_state === "failed"`, recovers to `active` + clears `lifecycle_error` + refreshes `sandbox_expires_at`. Without it, the UI sticks on "Paused" after a transient eval hiccup.
  • `hasSnapshot` recognizes hibernated state: now true when `snapshot_url` is set OR `lifecycle_state === "hibernated"` AND state has a resumable name.

`GET /api/sandbox/reconnect`

  • Transient-error preservation: only collapses to `expired` when the probe error matches `isSandboxUnavailableError` (404 / 410 / "sandbox not found" / "sandbox is stopped" / "sandbox probe failed" / "expected a stream of command data"). Other failures (502 / connection reset / timeout) preserve runtime state and return `connected` with a `safeExpiresAt` only if it's still in the future.
  • Aggressive-cleanup gating: not-found errors drop the resume handle; other unavailable errors keep it so a future provision can reuse the name.
  • Expires sync: on success, `sandbox_expires_at` is refreshed from the live SDK state.
  • Lifecycle recovery: on success, if row was `lifecycle_state: "failed"`, recovers to `active` + clears `lifecycle_error`.

New helpers (each its own file with a TDD red→green pass):

  • `isSandboxNotFoundError`
  • `isSandboxUnavailableError`
  • `clearSandboxResumeState`
  • `clearUnavailableSandboxState`

Test plan

  • `pnpm test` — 2622 / 2622 pass
  • `pnpm lint:check` — clean
  • `npx tsc --noEmit` — clean for changed files (pre-existing errors elsewhere unchanged)
  • Smoke: provision a session-bound sandbox, hit `/status` + `/reconnect`, verify response shape; force a failure into `lifecycle_state` and confirm self-heal

🤖 Generated with Claude Code


Summary by cubic

Adds self-heal to GET /api/sandbox/status and smarter transient handling to GET /api/sandbox/reconnect to prevent unnecessary rebuilds and stuck “Paused” states. On successful probes, refreshes persisted sandbox_state and sandbox_expires_at so timers match the runtime.

  • New Features
    • Status: If runtime is alive but lifecycle is failed, recover to active, clear lifecycle_error, and refresh sandbox_expires_at. hasSnapshot is true when snapshot_url exists or lifecycle is hibernated with a resumable sandboxName.
    • Reconnect: Only return expired for isSandboxUnavailableError (404/410/not found/stopped/probe failed/expected stream). Treat other errors as transient: keep runtime state, respond connected, and include expiresAt only if it’s still in the future. On success, refresh sandbox_state and sandbox_expires_at, and recover failedactive. Not-found drops the resume handle; other unavailable errors keep it.
    • Helpers: isSandboxNotFoundError, isSandboxUnavailableError, clearSandboxResumeState, clearUnavailableSandboxState, getStateExpiresAt.

Written for commit bec3c4a. Summary will update on new commits.

Summary by CodeRabbit

Release Notes

  • Bug Fixes
    • Improved sandbox reconnection handling with enhanced error recovery capabilities.
    • Better distinction between temporary and permanent sandbox unavailability errors.
    • Added automatic recovery mechanism for failed sandbox sessions to restore active state.

Closes the remaining open-agents parity gaps in the two read endpoints
that the chat UI hits on session re-entry / tab refocus.

**Status handler (`GET /api/sandbox/status`)**
- **Failed-state self-heal**: when `hasRuntimeSandboxState` matches but
  `lifecycle_state === "failed"`, recovers to `active` + clears
  `lifecycle_error` + refreshes `sandbox_expires_at`. Without this,
  the UI gets stuck on "Paused" after a transient lifecycle eval
  hiccup even though the runtime is still alive.
- **`hasSnapshot` recognizes hibernated state**: now true when
  `snapshot_url` is set OR `lifecycle_state === "hibernated"` AND the
  state has a resumable name. Previously only checked `snapshot_url`,
  so paused-but-resumable sessions reported `hasSnapshot: false`.

**Reconnect handler (`GET /api/sandbox/reconnect`)**
- **Transient-error preservation**: only collapses to `expired` when
  the probe error matches a known permanent-failure pattern
  (`isSandboxUnavailableError`: 404 / 410 / "sandbox not found" /
  "sandbox is stopped" / "sandbox probe failed" / "expected a stream
  of command data"). Anything else (502 / connection reset / timeout)
  is treated as transient: runtime state is preserved, response is
  `connected` with a conservative `safeExpiresAt` (only forwarded if
  still in the future). This avoids forcing a full sandbox rebuild
  on a flaky network.
- **Aggressive cleanup gating**: not-found errors drop the resume
  handle (sandbox is gone-gone, can't be brought back), but other
  unavailable errors keep it via `clearUnavailableSandboxState` so a
  future provision can reuse the name.
- **Expires sync**: on a successful probe, `sandbox_expires_at` is
  refreshed from the live SDK state — without it the FE timer drifts
  from reality.
- **Lifecycle recovery**: on a successful probe, if the row was in
  `lifecycle_state: "failed"`, recovers to `active` + clears
  `lifecycle_error`.

**New helpers (each its own SRP file with a vitest red→green pass):**
- `isSandboxNotFoundError` — 404 / sandbox-not-found patterns
- `isSandboxUnavailableError` — broader permanent-failure dispatcher
- `clearSandboxResumeState` — collapses state to just `{ type }`
- `clearUnavailableSandboxState` — picks between resume-clear and
  state-clear based on the error class

Tests: 2622 / 2622 pass. Lint + tsc clean for changed files.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@vercel
Copy link
Copy Markdown
Contributor

vercel Bot commented May 8, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
api Ready Ready Preview May 8, 2026 2:02am

Request Review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 8, 2026

Review Change Stack

Warning

Rate limit exceeded

@sweetmantech has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 46 minutes and 50 seconds before requesting another review.

You’ve run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 923a81af-270c-4d4e-8055-524571c8ca60

📥 Commits

Reviewing files that changed from the base of the PR and between 8631b8e and bec3c4a.

⛔ Files ignored due to path filters (2)
  • lib/sandbox/__tests__/getSandboxReconnectHandler.test.ts is excluded by !**/*.test.*, !**/__tests__/** and included by lib/**
  • lib/sandbox/__tests__/getStateExpiresAt.test.ts is excluded by !**/*.test.*, !**/__tests__/** and included by lib/**
📒 Files selected for processing (2)
  • lib/sandbox/getSandboxReconnectHandler.ts
  • lib/sandbox/getStateExpiresAt.ts
📝 Walkthrough

Walkthrough

This PR adds error classification predicates, conditional state-cleanup helpers, and improvements to sandbox reconnect and status handlers. It enables distinguishing permanent sandbox unavailability from transient failures, implementing lifecycle self-healing, and expanding resumable session eligibility logic.

Changes

Sandbox State Resilience and Error Recovery

Layer / File(s) Summary
Error Detection and Classification
lib/sandbox/isSandboxNotFoundError.ts, lib/sandbox/isSandboxUnavailableError.ts
Adds isSandboxNotFoundError() and isSandboxUnavailableError() predicates that normalize error messages and match against specific substring patterns to classify permanent sandbox unavailability.
State Cleanup Strategies
lib/sandbox/clearSandboxResumeState.ts, lib/sandbox/clearUnavailableSandboxState.ts
Implements clearSandboxResumeState() to extract and sanitize the type discriminator from persisted state (defaulting to "vercel"), and clearUnavailableSandboxState() which conditionally routes between resume-aware or standard cleanup based on error classification.
Reconnect Handler Probe and Recovery
lib/sandbox/getSandboxReconnectHandler.ts
On successful probe, refreshes session sandbox_state, sandbox_expires_at, and recovers lifecycle_state from "failed" back to "active". On probe failure, distinguishes transient errors (returns connected with safe future-only expiration) from unavailable errors (clears state via helper and returns expired with hibernated lifecycle). Adds getStateExpiresAt() helper to safely extract expiration timestamps.
Status Handler Self-Healing
lib/sandbox/getSandboxStatusHandler.ts
Introduces effectiveRow recovery: when sandbox is active but lifecycle_state is "failed", attempts to update the session back to "active" and recompute expiration from persisted state. Expands snapshot/resumable eligibility to return true for saved snapshots or hibernated sessions with derivable sandbox names. Uses recovered state for lifecycle kick conditions and response payloads.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • recoupable/api#525: Both PRs modify lib/sandbox/getSandboxReconnectHandler.ts and probe handling / session state update logic for sandbox reconnection.
  • recoupable/api#533: Both PRs modify sandbox status and reconnect handlers with state-clearing helpers and runtime-state cleanup logic.

Poem

🏜️ When probes collide with errors grand,

We heal the state, both close and expand—

Transient whispers from the dark,

Hibernation leaves its mark. ✨

Recovery flows through every lane,

Till sandboxes wake again. 🚀

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Solid & Clean Code ✅ Passed New files follow SRP naming. Helper functions are focused and small. Code uses composition instead of duplication. Nesting ≤2 levels, excellent documentation. No OCP violations.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/status-reconnect-parity

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (3)
lib/sandbox/getSandboxReconnectHandler.ts (2)

72-92: 💤 Low value

Soft DRY nudge: lifecycle-recovery shape is now duplicated across status + reconnect handlers.

Both handlers now own the "if lifecycle_state === 'failed', flip to active, clear lifecycle_error, refresh sandbox_expires_at" pattern. The data sources differ (status uses persisted row.sandbox_state, reconnect uses live refreshedState), so a full extraction would need a small input shape — but a tiny recoverFailedLifecycle({ row, expiresAtSource }) helper would centralize the FSM-side knowledge ("what fields constitute failed→active recovery") in one place.

Optional, but worth a thought before a third site needs the same dance.

As per coding guidelines: "Extract shared logic into reusable utilities following Don't Repeat Yourself (DRY) principle".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@lib/sandbox/getSandboxReconnectHandler.ts` around lines 72 - 92, Extract the
repeated "failed → active" recovery logic into a small helper (e.g.,
recoverFailedLifecycle) that accepts the source inputs used here (row and an
expiresAt candidate like refreshedExpiresAt or persisted sandbox_expires_at) and
returns two things: the DB update patch (fields to pass into updateSession) and
the lifecycle patch (fields to merge into lifecycle in the ReconnectBody).
Replace the inline checks of refreshedState, recoverFailed and
refreshedExpiresAt in getSandboxReconnectHandler (the block that builds
updateSession payload and the lifecycle spread in the ReconnectBody using
buildLifecycle) with a call to this helper so both updateSession and lifecycle
composition use the single canonical recovery shape. Ensure the helper is reused
by the other status/reconnect handler to remove duplication.

24-28: ⚡ Quick win

Consider extracting getStateExpiresAt into its own file.

Per the repo's lib/**/*.ts SRP rule ("one exported function per file; each file should do one thing well"), this local helper is genuinely reusable — it's the read-side mirror of getSandboxExpiresAtDate for the numeric expiresAt shape on sandbox_state. Promoting it to lib/sandbox/getStateExpiresAt.ts gives you isolated unit tests and a discoverable home next to its sibling.

Not a blocker — the helper is correctly scoped where it lives — but it's a low-effort win the moment a second caller wants the same extraction.

As per coding guidelines: "Apply Single Responsibility Principle (SRP): one exported function per file; each file should do one thing well".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@lib/sandbox/getSandboxReconnectHandler.ts` around lines 24 - 28, Extract the
local helper getStateExpiresAt into its own module: create a new file exporting
the function (e.g., getStateExpiresAt.ts) and move the implementation there,
keep the exact logic (type-check state, read expiresAt, return number or
undefined), then update the original file to import and use the exported
getStateExpiresAt; ensure the new module is unit-testable and add an index or
export as needed so callers (and the sibling getSandboxExpiresAtDate) can import
it.
lib/sandbox/getSandboxStatusHandler.ts (1)

64-72: 💤 Low value

Consider logging when self-heal write fails.

If updateSession returns falsy (RLS denial, transient DB hiccup, race with another writer), we silently fall back to row and the response will still report lifecycle_state: "failed" — exactly the "stuck on Paused" state this self-heal was meant to fix. A console.warn here would make the failure observable without changing behavior.

🪵 Proposed observability nudge
   let effectiveRow = row;
   if (active && row.lifecycle_state === "failed") {
     const recovered = await updateSession(row.id, {
       lifecycle_state: "active",
       lifecycle_error: null,
       sandbox_expires_at: getSandboxExpiresAtDate(row.sandbox_state),
     });
-    if (recovered) effectiveRow = recovered;
+    if (recovered) {
+      effectiveRow = recovered;
+    } else {
+      console.warn(
+        `[getSandboxStatusHandler] self-heal failed for ${row.id}: updateSession returned no row`,
+      );
+    }
   }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@lib/sandbox/getSandboxStatusHandler.ts` around lines 64 - 72, When the
self-heal write via updateSession fails (i.e., updateSession(row.id, ...)
returns a falsy value), add an observable warning log so the failure is visible;
inside the block that currently sets effectiveRow = recovered, add an else path
that emits a console.warn (or the module's logger) including identifying info
like row.id, row.lifecycle_state, and sandbox_state (or mention
getSandboxExpiresAtDate(row.sandbox_state)) to indicate the attempted recovery
failed and we are falling back to the original row.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@lib/sandbox/getSandboxReconnectHandler.ts`:
- Around line 72-92: Extract the repeated "failed → active" recovery logic into
a small helper (e.g., recoverFailedLifecycle) that accepts the source inputs
used here (row and an expiresAt candidate like refreshedExpiresAt or persisted
sandbox_expires_at) and returns two things: the DB update patch (fields to pass
into updateSession) and the lifecycle patch (fields to merge into lifecycle in
the ReconnectBody). Replace the inline checks of refreshedState, recoverFailed
and refreshedExpiresAt in getSandboxReconnectHandler (the block that builds
updateSession payload and the lifecycle spread in the ReconnectBody using
buildLifecycle) with a call to this helper so both updateSession and lifecycle
composition use the single canonical recovery shape. Ensure the helper is reused
by the other status/reconnect handler to remove duplication.
- Around line 24-28: Extract the local helper getStateExpiresAt into its own
module: create a new file exporting the function (e.g., getStateExpiresAt.ts)
and move the implementation there, keep the exact logic (type-check state, read
expiresAt, return number or undefined), then update the original file to import
and use the exported getStateExpiresAt; ensure the new module is unit-testable
and add an index or export as needed so callers (and the sibling
getSandboxExpiresAtDate) can import it.

In `@lib/sandbox/getSandboxStatusHandler.ts`:
- Around line 64-72: When the self-heal write via updateSession fails (i.e.,
updateSession(row.id, ...) returns a falsy value), add an observable warning log
so the failure is visible; inside the block that currently sets effectiveRow =
recovered, add an else path that emits a console.warn (or the module's logger)
including identifying info like row.id, row.lifecycle_state, and sandbox_state
(or mention getSandboxExpiresAtDate(row.sandbox_state)) to indicate the
attempted recovery failed and we are falling back to the original row.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 46ac9299-02dd-439a-8d04-4a2e2b7cf55e

📥 Commits

Reviewing files that changed from the base of the PR and between 0c51c14 and 8631b8e.

⛔ Files ignored due to path filters (6)
  • lib/sandbox/__tests__/clearSandboxResumeState.test.ts is excluded by !**/*.test.*, !**/__tests__/** and included by lib/**
  • lib/sandbox/__tests__/clearUnavailableSandboxState.test.ts is excluded by !**/*.test.*, !**/__tests__/** and included by lib/**
  • lib/sandbox/__tests__/getSandboxReconnectHandler.test.ts is excluded by !**/*.test.*, !**/__tests__/** and included by lib/**
  • lib/sandbox/__tests__/getSandboxStatusHandler.test.ts is excluded by !**/*.test.*, !**/__tests__/** and included by lib/**
  • lib/sandbox/__tests__/isSandboxNotFoundError.test.ts is excluded by !**/*.test.*, !**/__tests__/** and included by lib/**
  • lib/sandbox/__tests__/isSandboxUnavailableError.test.ts is excluded by !**/*.test.*, !**/__tests__/** and included by lib/**
📒 Files selected for processing (6)
  • lib/sandbox/clearSandboxResumeState.ts
  • lib/sandbox/clearUnavailableSandboxState.ts
  • lib/sandbox/getSandboxReconnectHandler.ts
  • lib/sandbox/getSandboxStatusHandler.ts
  • lib/sandbox/isSandboxNotFoundError.ts
  • lib/sandbox/isSandboxUnavailableError.ts

lifecycle: ReturnType<typeof buildLifecycle>;
}

function getStateExpiresAt(state: unknown): number | undefined {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SRP - new lib file for getStateExpiresAt.ts

sweetmantech and others added 2 commits May 7, 2026 21:00
Per review: the inline helper in getSandboxReconnectHandler.ts is its
own concern (read epoch-ms expiresAt off a sandbox state) and belongs
in a dedicated file alongside the other state predicates.

Adds `lib/sandbox/getStateExpiresAt.ts` with a focused test (numeric
match, non-numeric reject, null/scalar guard). Reconnect handler now
imports from the new path; no behavior change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@sweetmantech
Copy link
Copy Markdown
Contributor Author

Pushed `361922b5` addressing the comment.

Extracted `getStateExpiresAt` from inside `getSandboxReconnectHandler.ts` into its own SRP file at `lib/sandbox/getStateExpiresAt.ts` with a focused test (numeric match / non-numeric reject / null & scalar guard). Reconnect handler now imports from the new path; no behavior change.

Follow-up `bec3c4af` widens an unrelated tsc cast in the reconnect test to keep `tsc --noEmit` clean.

Smoke test on `8631b8ea` preview:

Step Result
`POST /api/sessions` session `04157291-4531-4ca2-90d2-d2ef44a0760f` ✓
`POST /api/sandbox` (vercel/next.js) ready in 100s ✓
`GET /api/sandbox/status` `status: active`, `hasSnapshot: false`, `lifecycleVersion: 1`, `sandboxExpiresAt: 1778207323686` ✓
`GET /api/sandbox/reconnect` `status: connected`, `hasSnapshot: false`, `expiresAt: 1778207323915`, lifecycle `sandboxExpiresAt: 1778207323915` ✓

The reconnect's `sandboxExpiresAt` is 229ms newer than status's — that's the new live-state expires-sync working end-to-end.

Failed-state self-heal and transient-error preservation aren't easy to exercise from preview without DB tampering or upstream fault injection, but they have full unit-test coverage and the response shapes for the happy paths are clean.

@sweetmantech sweetmantech merged commit 881d9d2 into test May 8, 2026
6 checks passed
@sweetmantech sweetmantech deleted the feat/status-reconnect-parity branch May 8, 2026 02:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant