feat(sandbox): port status self-heal + reconnect transient handling#535
feat(sandbox): port status self-heal + reconnect transient handling#535sweetmantech merged 3 commits intotestfrom
Conversation
Closes the remaining open-agents parity gaps in the two read endpoints
that the chat UI hits on session re-entry / tab refocus.
**Status handler (`GET /api/sandbox/status`)**
- **Failed-state self-heal**: when `hasRuntimeSandboxState` matches but
`lifecycle_state === "failed"`, recovers to `active` + clears
`lifecycle_error` + refreshes `sandbox_expires_at`. Without this,
the UI gets stuck on "Paused" after a transient lifecycle eval
hiccup even though the runtime is still alive.
- **`hasSnapshot` recognizes hibernated state**: now true when
`snapshot_url` is set OR `lifecycle_state === "hibernated"` AND the
state has a resumable name. Previously only checked `snapshot_url`,
so paused-but-resumable sessions reported `hasSnapshot: false`.
**Reconnect handler (`GET /api/sandbox/reconnect`)**
- **Transient-error preservation**: only collapses to `expired` when
the probe error matches a known permanent-failure pattern
(`isSandboxUnavailableError`: 404 / 410 / "sandbox not found" /
"sandbox is stopped" / "sandbox probe failed" / "expected a stream
of command data"). Anything else (502 / connection reset / timeout)
is treated as transient: runtime state is preserved, response is
`connected` with a conservative `safeExpiresAt` (only forwarded if
still in the future). This avoids forcing a full sandbox rebuild
on a flaky network.
- **Aggressive cleanup gating**: not-found errors drop the resume
handle (sandbox is gone-gone, can't be brought back), but other
unavailable errors keep it via `clearUnavailableSandboxState` so a
future provision can reuse the name.
- **Expires sync**: on a successful probe, `sandbox_expires_at` is
refreshed from the live SDK state — without it the FE timer drifts
from reality.
- **Lifecycle recovery**: on a successful probe, if the row was in
`lifecycle_state: "failed"`, recovers to `active` + clears
`lifecycle_error`.
**New helpers (each its own SRP file with a vitest red→green pass):**
- `isSandboxNotFoundError` — 404 / sandbox-not-found patterns
- `isSandboxUnavailableError` — broader permanent-failure dispatcher
- `clearSandboxResumeState` — collapses state to just `{ type }`
- `clearUnavailableSandboxState` — picks between resume-clear and
state-clear based on the error class
Tests: 2622 / 2622 pass. Lint + tsc clean for changed files.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
|
Warning Rate limit exceeded
You’ve run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: ⛔ Files ignored due to path filters (2)
📒 Files selected for processing (2)
📝 WalkthroughWalkthroughThis PR adds error classification predicates, conditional state-cleanup helpers, and improvements to sandbox reconnect and status handlers. It enables distinguishing permanent sandbox unavailability from transient failures, implementing lifecycle self-healing, and expanding resumable session eligibility logic. ChangesSandbox State Resilience and Error Recovery
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (3)
lib/sandbox/getSandboxReconnectHandler.ts (2)
72-92: 💤 Low valueSoft DRY nudge: lifecycle-recovery shape is now duplicated across status + reconnect handlers.
Both handlers now own the "if
lifecycle_state === 'failed', flip toactive, clearlifecycle_error, refreshsandbox_expires_at" pattern. The data sources differ (status uses persistedrow.sandbox_state, reconnect uses liverefreshedState), so a full extraction would need a small input shape — but a tinyrecoverFailedLifecycle({ row, expiresAtSource })helper would centralize the FSM-side knowledge ("what fields constitute failed→active recovery") in one place.Optional, but worth a thought before a third site needs the same dance.
As per coding guidelines: "Extract shared logic into reusable utilities following Don't Repeat Yourself (DRY) principle".
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@lib/sandbox/getSandboxReconnectHandler.ts` around lines 72 - 92, Extract the repeated "failed → active" recovery logic into a small helper (e.g., recoverFailedLifecycle) that accepts the source inputs used here (row and an expiresAt candidate like refreshedExpiresAt or persisted sandbox_expires_at) and returns two things: the DB update patch (fields to pass into updateSession) and the lifecycle patch (fields to merge into lifecycle in the ReconnectBody). Replace the inline checks of refreshedState, recoverFailed and refreshedExpiresAt in getSandboxReconnectHandler (the block that builds updateSession payload and the lifecycle spread in the ReconnectBody using buildLifecycle) with a call to this helper so both updateSession and lifecycle composition use the single canonical recovery shape. Ensure the helper is reused by the other status/reconnect handler to remove duplication.
24-28: ⚡ Quick winConsider extracting
getStateExpiresAtinto its own file.Per the repo's
lib/**/*.tsSRP rule ("one exported function per file; each file should do one thing well"), this local helper is genuinely reusable — it's the read-side mirror ofgetSandboxExpiresAtDatefor the numericexpiresAtshape onsandbox_state. Promoting it tolib/sandbox/getStateExpiresAt.tsgives you isolated unit tests and a discoverable home next to its sibling.Not a blocker — the helper is correctly scoped where it lives — but it's a low-effort win the moment a second caller wants the same extraction.
As per coding guidelines: "Apply Single Responsibility Principle (SRP): one exported function per file; each file should do one thing well".
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@lib/sandbox/getSandboxReconnectHandler.ts` around lines 24 - 28, Extract the local helper getStateExpiresAt into its own module: create a new file exporting the function (e.g., getStateExpiresAt.ts) and move the implementation there, keep the exact logic (type-check state, read expiresAt, return number or undefined), then update the original file to import and use the exported getStateExpiresAt; ensure the new module is unit-testable and add an index or export as needed so callers (and the sibling getSandboxExpiresAtDate) can import it.lib/sandbox/getSandboxStatusHandler.ts (1)
64-72: 💤 Low valueConsider logging when self-heal write fails.
If
updateSessionreturns falsy (RLS denial, transient DB hiccup, race with another writer), we silently fall back torowand the response will still reportlifecycle_state: "failed"— exactly the "stuck on Paused" state this self-heal was meant to fix. Aconsole.warnhere would make the failure observable without changing behavior.🪵 Proposed observability nudge
let effectiveRow = row; if (active && row.lifecycle_state === "failed") { const recovered = await updateSession(row.id, { lifecycle_state: "active", lifecycle_error: null, sandbox_expires_at: getSandboxExpiresAtDate(row.sandbox_state), }); - if (recovered) effectiveRow = recovered; + if (recovered) { + effectiveRow = recovered; + } else { + console.warn( + `[getSandboxStatusHandler] self-heal failed for ${row.id}: updateSession returned no row`, + ); + } }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@lib/sandbox/getSandboxStatusHandler.ts` around lines 64 - 72, When the self-heal write via updateSession fails (i.e., updateSession(row.id, ...) returns a falsy value), add an observable warning log so the failure is visible; inside the block that currently sets effectiveRow = recovered, add an else path that emits a console.warn (or the module's logger) including identifying info like row.id, row.lifecycle_state, and sandbox_state (or mention getSandboxExpiresAtDate(row.sandbox_state)) to indicate the attempted recovery failed and we are falling back to the original row.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@lib/sandbox/getSandboxReconnectHandler.ts`:
- Around line 72-92: Extract the repeated "failed → active" recovery logic into
a small helper (e.g., recoverFailedLifecycle) that accepts the source inputs
used here (row and an expiresAt candidate like refreshedExpiresAt or persisted
sandbox_expires_at) and returns two things: the DB update patch (fields to pass
into updateSession) and the lifecycle patch (fields to merge into lifecycle in
the ReconnectBody). Replace the inline checks of refreshedState, recoverFailed
and refreshedExpiresAt in getSandboxReconnectHandler (the block that builds
updateSession payload and the lifecycle spread in the ReconnectBody using
buildLifecycle) with a call to this helper so both updateSession and lifecycle
composition use the single canonical recovery shape. Ensure the helper is reused
by the other status/reconnect handler to remove duplication.
- Around line 24-28: Extract the local helper getStateExpiresAt into its own
module: create a new file exporting the function (e.g., getStateExpiresAt.ts)
and move the implementation there, keep the exact logic (type-check state, read
expiresAt, return number or undefined), then update the original file to import
and use the exported getStateExpiresAt; ensure the new module is unit-testable
and add an index or export as needed so callers (and the sibling
getSandboxExpiresAtDate) can import it.
In `@lib/sandbox/getSandboxStatusHandler.ts`:
- Around line 64-72: When the self-heal write via updateSession fails (i.e.,
updateSession(row.id, ...) returns a falsy value), add an observable warning log
so the failure is visible; inside the block that currently sets effectiveRow =
recovered, add an else path that emits a console.warn (or the module's logger)
including identifying info like row.id, row.lifecycle_state, and sandbox_state
(or mention getSandboxExpiresAtDate(row.sandbox_state)) to indicate the
attempted recovery failed and we are falling back to the original row.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 46ac9299-02dd-439a-8d04-4a2e2b7cf55e
⛔ Files ignored due to path filters (6)
lib/sandbox/__tests__/clearSandboxResumeState.test.tsis excluded by!**/*.test.*,!**/__tests__/**and included bylib/**lib/sandbox/__tests__/clearUnavailableSandboxState.test.tsis excluded by!**/*.test.*,!**/__tests__/**and included bylib/**lib/sandbox/__tests__/getSandboxReconnectHandler.test.tsis excluded by!**/*.test.*,!**/__tests__/**and included bylib/**lib/sandbox/__tests__/getSandboxStatusHandler.test.tsis excluded by!**/*.test.*,!**/__tests__/**and included bylib/**lib/sandbox/__tests__/isSandboxNotFoundError.test.tsis excluded by!**/*.test.*,!**/__tests__/**and included bylib/**lib/sandbox/__tests__/isSandboxUnavailableError.test.tsis excluded by!**/*.test.*,!**/__tests__/**and included bylib/**
📒 Files selected for processing (6)
lib/sandbox/clearSandboxResumeState.tslib/sandbox/clearUnavailableSandboxState.tslib/sandbox/getSandboxReconnectHandler.tslib/sandbox/getSandboxStatusHandler.tslib/sandbox/isSandboxNotFoundError.tslib/sandbox/isSandboxUnavailableError.ts
| lifecycle: ReturnType<typeof buildLifecycle>; | ||
| } | ||
|
|
||
| function getStateExpiresAt(state: unknown): number | undefined { |
There was a problem hiding this comment.
SRP - new lib file for getStateExpiresAt.ts
Per review: the inline helper in getSandboxReconnectHandler.ts is its own concern (read epoch-ms expiresAt off a sandbox state) and belongs in a dedicated file alongside the other state predicates. Adds `lib/sandbox/getStateExpiresAt.ts` with a focused test (numeric match, non-numeric reject, null/scalar guard). Reconnect handler now imports from the new path; no behavior change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Pushed `361922b5` addressing the comment. Extracted `getStateExpiresAt` from inside `getSandboxReconnectHandler.ts` into its own SRP file at `lib/sandbox/getStateExpiresAt.ts` with a focused test (numeric match / non-numeric reject / null & scalar guard). Reconnect handler now imports from the new path; no behavior change. Follow-up `bec3c4af` widens an unrelated tsc cast in the reconnect test to keep `tsc --noEmit` clean. Smoke test on `8631b8ea` preview:
The reconnect's `sandboxExpiresAt` is 229ms newer than status's — that's the new live-state expires-sync working end-to-end. Failed-state self-heal and transient-error preservation aren't easy to exercise from preview without DB tampering or upstream fault injection, but they have full unit-test coverage and the response shapes for the happy paths are clean. |
Closes the remaining open-agents parity gaps in the two read endpoints (status + reconnect) that the chat UI hits on session re-entry / tab refocus.
Summary
`GET /api/sandbox/status`
`GET /api/sandbox/reconnect`
New helpers (each its own file with a TDD red→green pass):
Test plan
🤖 Generated with Claude Code
Summary by cubic
Adds self-heal to
GET /api/sandbox/statusand smarter transient handling toGET /api/sandbox/reconnectto prevent unnecessary rebuilds and stuck “Paused” states. On successful probes, refreshes persistedsandbox_stateandsandbox_expires_atso timers match the runtime.failed, recover toactive, clearlifecycle_error, and refreshsandbox_expires_at.hasSnapshotis true whensnapshot_urlexists or lifecycle ishibernatedwith a resumablesandboxName.expiredforisSandboxUnavailableError(404/410/not found/stopped/probe failed/expected stream). Treat other errors as transient: keep runtime state, respondconnected, and includeexpiresAtonly if it’s still in the future. On success, refreshsandbox_stateandsandbox_expires_at, and recoverfailed→active. Not-found drops the resume handle; other unavailable errors keep it.isSandboxNotFoundError,isSandboxUnavailableError,clearSandboxResumeState,clearUnavailableSandboxState,getStateExpiresAt.Written for commit bec3c4a. Summary will update on new commits.
Summary by CodeRabbit
Release Notes