Skip to content

feat: instrument sandbox startup timing#8

Open
sweetmantech wants to merge 1 commit into
mainfrom
feat/sandbox-startup-timing
Open

feat: instrument sandbox startup timing#8
sweetmantech wants to merge 1 commit into
mainfrom
feat/sandbox-startup-timing

Conversation

@sweetmantech
Copy link
Copy Markdown
Contributor

@sweetmantech sweetmantech commented Apr 22, 2026

Summary

  • Goal: figure out which phase of new-sandbox startup dominates the observed 60+s wait (Rostrum Pacific org), before committing to per-org base snapshots or another fix.
  • Adds two structured log streams that print step_ms (this phase) and elapsed_ms (since start), so the dominant cost surfaces immediately when grepping Vercel function logs.

Log streams

[sandbox-handler-timing]apps/web/lib/sandbox/create-sandbox-handler.ts

  • auth_and_db_lookups — Privy session + requireOwnedSession + getGitHubAccount
  • connect_sandbox — full connectSandbox call (covers VM provision + clone + git config)
  • update_session_state — DB write of new sandbox state
  • install_global_skills — only when the session has globalSkillRefs
  • handler_total — full request duration

[sandbox-timing]packages/sandbox/vercel/sandbox.ts

  • vm_provision (with mode: base_snapshot|sdk_git|restore_snapshot|empty) — Firecracker bootstrap
  • git_clone (with repo) — manual git clone into /vercel/sandbox
  • git_remote_set_url — embedding the auth token in the origin remote
  • git_config_usergit config user.name/email
  • git_checkout_new_branch (with branch) — when isNewBranch is set
  • after_start_hook — only when an afterStart hook is registered
  • create_total — full VercelSandbox.create duration

Each log line includes the sandbox name (= session id) so you can filter to a single create.

How to read the logs

After deploying this branch, click an org once. In the Vercel runtime logs for POST /api/sandbox, grep sandbox-timing\\|sandbox-handler-timing and you'll see something like:

[sandbox-timing] { name: "...", phase: "vm_provision", mode: "base_snapshot", step_ms: 8120, elapsed_ms: 8120 }
[sandbox-timing] { name: "...", phase: "git_clone", repo: "https://github.com/recoupable/org-...", step_ms: 41200, elapsed_ms: 49320 }
[sandbox-timing] { name: "...", phase: "git_remote_set_url", step_ms: 290, elapsed_ms: 49610 }
[sandbox-timing] { name: "...", phase: "git_config_user", step_ms: 540, elapsed_ms: 50150 }
[sandbox-timing] { name: "...", phase: "create_total", step_ms: 0, elapsed_ms: 50150 }
[sandbox-handler-timing] { sessionId: "...", phase: "connect_sandbox", step_ms: 50180, elapsed_ms: 50420 }

The step_ms values tell you exactly where the time is going.

Test plan

  • Merge, deploy, click "Rostrum Pacific" once
  • Grep Vercel logs for sandbox-timing and sandbox-handler-timing
  • Confirm whether git_clone is in fact dominant (hypothesis), or whether vm_provision is comparable
  • Decide: ship per-org base snapshots vs shallow-clone tweak vs something else
  • Once root cause is fixed and verified, revert this PR (logs are intentionally noisy)

🤖 Generated with Claude Code


Summary by cubic

Adds structured per-step timing logs for sandbox startup to pinpoint the 60s+ bottleneck. Makes the slow phase obvious in Vercel logs so we can pick the right fix.

  • New Features
    • [sandbox-handler-timing] in apps/web/lib/sandbox/create-sandbox-handler.ts: logs auth/DB lookups, connectSandbox, session state update, global skill install, total. Includes sessionId, step_ms, elapsed_ms.
    • [sandbox-timing] in packages/sandbox/vercel/sandbox.ts: logs VM provision (mode: base_snapshot|sdk_git|restore_snapshot|empty), git clone/config steps, optional hooks, total. Includes name, step_ms, elapsed_ms.

Written for commit 7983a5d. Summary will update on new commits.

Summary by CodeRabbit

  • Chores
    • Enhanced internal performance monitoring and timing instrumentation for sandbox creation workflows to track step-by-step duration metrics and provisioning states.

Adds structured per-step timing logs so we can identify which phase
dominates the 60+s new-sandbox startup before committing to a fix.

Two log streams:
- [sandbox-handler-timing] (apps/web): auth_and_db_lookups, connect_sandbox,
  update_session_state, install_global_skills, handler_total
- [sandbox-timing] (packages/sandbox): vm_provision (with mode), git_clone,
  git_remote_set_url, git_config_user, git_checkout_new_branch,
  after_start_hook, create_total

Each log line includes step_ms (this phase) and elapsed_ms (since start)
so the dominant cost surfaces immediately in Vercel function logs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@vercel
Copy link
Copy Markdown

vercel Bot commented Apr 22, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
open-agents Ready Ready Preview Apr 22, 2026 4:43pm

Request Review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 22, 2026

📝 Walkthrough

Walkthrough

Introduced request-scoped timing instrumentation in the sandbox creation handler and SDK provisioning process. Added step-by-step timing logs at key control-flow points to measure elapsed time between phases and total duration, without modifying core functionality or public API signatures.

Changes

Cohort / File(s) Summary
Handler timing instrumentation
apps/web/lib/sandbox/create-sandbox-handler.ts
Added logTiming helper function and timing logs at control-flow checkpoints: after auth/DB lookups, after connectSandbox, after updateSession, and after installSessionGlobalSkills. Final timing calculation now references handlerStart instead of startTime.
Sandbox provisioning timing instrumentation
packages/sandbox/vercel/sandbox.ts
Added step-by-step timing logs during VercelSandbox.create() with logTiming helper. Logs record intervals between phases and track provisioning mode (restore_snapshot, base_snapshot, sdk_git, or empty). Timing captured after git operations, branch creation, hooks, and total creation duration.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

Possibly related PRs

Poem

🐰 Tick-tock, step by step,
Timing logs in every rep,
From provisioning to the final call,
Now we measure it all! ⏱️

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat: instrument sandbox startup timing' directly describes the main change—adding timing instrumentation to sandbox startup—which is the core objective of the PR.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/sandbox-startup-timing

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
apps/web/lib/sandbox/create-sandbox-handler.ts (2)

120-145: ⚠️ Potential issue | 🟡 Minor

Emit connect_sandbox timing when provisioning fails.

If connectSandbox spends 60+ seconds in provision/clone and then throws, this returns before logging the phase, hiding the latency this PR is meant to isolate.

📈 Proposed fix
   try {
     sandbox = await connectSandbox({
       state: {
         type: "vercel",
         ...(sandboxName ? { sandboxName } : {}),
         source,
@@
       },
     });
   } catch (error) {
+    logTiming("connect_sandbox", sessionId, { success: false });
     console.error("connectSandbox failed:", error);
     return Response.json(
       { error: "Failed to provision sandbox" },
       { status: 502 },
     );
   }
-  logTiming("connect_sandbox", sessionId);
+  logTiming("connect_sandbox", sessionId, { success: true });
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/web/lib/sandbox/create-sandbox-handler.ts` around lines 120 - 145, The
connectSandbox call currently logs timing only on success, so if it throws you
return early and lose the "connect_sandbox" latency; update the error path to
emit the timing before returning by calling logTiming("connect_sandbox",
sessionId) in the catch (or move the timing call into a finally block that
always runs). Change the catch in create-sandbox-handler.ts (around
connectSandbox) to invoke logTiming("connect_sandbox", sessionId) prior to
Response.json return (or refactor to try/catch/finally so logTiming always
executes).

160-174: ⚠️ Potential issue | 🟡 Minor

Record the global-skills phase even on failure.

A slow failing installSessionGlobalSkills call is swallowed by the catch block and only appears as an unlabeled gap before handler_total.

📈 Proposed fix
     if (sessionRecord) {
+      let globalSkillsInstalled = false;
       try {
         await installSessionGlobalSkills({
           sessionRecord,
           sandbox,
         });
-        logTiming("install_global_skills", sessionId, {
-          skillCount: sessionRecord.globalSkillRefs?.length ?? 0,
-        });
+        globalSkillsInstalled = true;
       } catch (error) {
         console.error(
           `Failed to install global skills for session ${sessionRecord.id}:`,
           error,
         );
       }
+      logTiming("install_global_skills", sessionId, {
+        skillCount: sessionRecord.globalSkillRefs?.length ?? 0,
+        success: globalSkillsInstalled,
+      });
     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/web/lib/sandbox/create-sandbox-handler.ts` around lines 160 - 174, The
global-skills timing is only recorded on success, so move or duplicate the
logTiming call for the "install_global_skills" phase so it always runs
(including on failure) for the given sessionId; inside the try set
logTiming("install_global_skills", sessionId, { skillCount:
sessionRecord.globalSkillRefs?.length ?? 0, success: true }) and in the catch
call logTiming("install_global_skills", sessionId, { skillCount:
sessionRecord.globalSkillRefs?.length ?? 0, success: false, error: String(error)
}) before rethrowing or swallowing, referencing installSessionGlobalSkills,
sessionRecord and sessionId to locate the code.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@apps/web/lib/sandbox/create-sandbox-handler.ts`:
- Around line 29-43: The timing log uses lastStep for step_ms causing the
"handler_total" phase to show only the last interval; update the logTiming
function so when phase === "handler_total" it computes step_ms using
handlerStart instead of lastStep (i.e., step_ms = now - handlerStart) while
other phases keep step_ms = now - lastStep; adjust the computation in logTiming
(referencing logTiming, handlerStart, lastStep, and the "handler_total" phase)
and ensure the same change is applied to the duplicate occurrence noted around
lines 183-184.

In `@packages/sandbox/vercel/sandbox.ts`:
- Around line 516-526: The timing helper logTiming currently computes step_ms as
now - lastStep which makes the "create_total" phase report only the last
segment; change logTiming so when phase === "create_total" (or when a boolean
total flag) it sets step_ms to now - createStart instead of now - lastStep,
while keeping elapsed_ms = now - createStart for all phases; update references
to lastStep and createStart in the logTiming closure (and any other create_total
logTiming call sites) so the "create_total" entry reflects the full
VercelSandbox.create() duration.
- Line 615: The log currently passes the raw repository URL (source.url) into
logTiming("git_clone"), which can expose private repo info or embedded
credentials; instead compute and log a non-reversible fingerprint of the URL
(e.g., SHA256/SHA1 digest truncated) and/or the sandbox name, and pass that
fingerprint to logTiming("git_clone") rather than source.url; ensure you
reference source.url only to compute the hash and do not persist or log the raw
URL anywhere in the code paths around logTiming("git_clone").

---

Outside diff comments:
In `@apps/web/lib/sandbox/create-sandbox-handler.ts`:
- Around line 120-145: The connectSandbox call currently logs timing only on
success, so if it throws you return early and lose the "connect_sandbox"
latency; update the error path to emit the timing before returning by calling
logTiming("connect_sandbox", sessionId) in the catch (or move the timing call
into a finally block that always runs). Change the catch in
create-sandbox-handler.ts (around connectSandbox) to invoke
logTiming("connect_sandbox", sessionId) prior to Response.json return (or
refactor to try/catch/finally so logTiming always executes).
- Around line 160-174: The global-skills timing is only recorded on success, so
move or duplicate the logTiming call for the "install_global_skills" phase so it
always runs (including on failure) for the given sessionId; inside the try set
logTiming("install_global_skills", sessionId, { skillCount:
sessionRecord.globalSkillRefs?.length ?? 0, success: true }) and in the catch
call logTiming("install_global_skills", sessionId, { skillCount:
sessionRecord.globalSkillRefs?.length ?? 0, success: false, error: String(error)
}) before rethrowing or swallowing, referencing installSessionGlobalSkills,
sessionRecord and sessionId to locate the code.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c9cd694f-3449-4118-ac00-66376f89551f

📥 Commits

Reviewing files that changed from the base of the PR and between 66abefb and 7983a5d.

📒 Files selected for processing (2)
  • apps/web/lib/sandbox/create-sandbox-handler.ts
  • packages/sandbox/vercel/sandbox.ts

Comment on lines +29 to +43
const logTiming = (
phase: string,
sessionId: string | undefined,
extra?: Record<string, unknown>,
) => {
const now = Date.now();
console.log("[sandbox-handler-timing]", {
sessionId: sessionId ?? "unknown",
phase,
step_ms: now - lastStep,
elapsed_ms: now - handlerStart,
...extra,
});
lastStep = now;
};
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Make handler_total.step_ms represent the full request duration.

handler_total currently uses lastStep, so step_ms is only the final gap since the previous phase. Use handlerStart for this phase to keep the “total” line unambiguous.

📏 Proposed fix
   const logTiming = (
     phase: string,
     sessionId: string | undefined,
     extra?: Record<string, unknown>,
+    stepStart = lastStep,
   ) => {
     const now = Date.now();
     console.log("[sandbox-handler-timing]", {
       sessionId: sessionId ?? "unknown",
       phase,
-      step_ms: now - lastStep,
+      step_ms: now - stepStart,
       elapsed_ms: now - handlerStart,
       ...extra,
     });
     lastStep = now;
   };
   const readyMs = Date.now() - handlerStart;
-  logTiming("handler_total", sessionId, { readyMs });
+  logTiming("handler_total", sessionId, { readyMs }, handlerStart);

Also applies to: 183-184

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/web/lib/sandbox/create-sandbox-handler.ts` around lines 29 - 43, The
timing log uses lastStep for step_ms causing the "handler_total" phase to show
only the last interval; update the logTiming function so when phase ===
"handler_total" it computes step_ms using handlerStart instead of lastStep
(i.e., step_ms = now - handlerStart) while other phases keep step_ms = now -
lastStep; adjust the computation in logTiming (referencing logTiming,
handlerStart, lastStep, and the "handler_total" phase) and ensure the same
change is applied to the duplicate occurrence noted around lines 183-184.

Comment on lines +516 to +526
const logTiming = (phase: string, extra?: Record<string, unknown>) => {
const now = Date.now();
console.log("[sandbox-timing]", {
name: name ?? "unnamed",
phase,
step_ms: now - lastStep,
elapsed_ms: now - createStart,
...extra,
});
lastStep = now;
};
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Make create_total.step_ms represent the total duration.

logTiming("create_total") currently reports step_ms since the previous logged phase, not the full VercelSandbox.create() duration. That can under-report the total when consumers group by phase and read step_ms.

📏 Proposed fix
-    const logTiming = (phase: string, extra?: Record<string, unknown>) => {
+    const logTiming = (
+      phase: string,
+      extra?: Record<string, unknown>,
+      stepStart = lastStep,
+    ) => {
       const now = Date.now();
       console.log("[sandbox-timing]", {
         name: name ?? "unnamed",
         phase,
-        step_ms: now - lastStep,
+        step_ms: now - stepStart,
         elapsed_ms: now - createStart,
         ...extra,
       });
       lastStep = now;
     };
-    logTiming("create_total");
+    logTiming("create_total", undefined, createStart);

Also applies to: 724-724

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/sandbox/vercel/sandbox.ts` around lines 516 - 526, The timing helper
logTiming currently computes step_ms as now - lastStep which makes the
"create_total" phase report only the last segment; change logTiming so when
phase === "create_total" (or when a boolean total flag) it sets step_ms to now -
createStart instead of now - lastStep, while keeping elapsed_ms = now -
createStart for all phases; update references to lastStep and createStart in the
logTiming closure (and any other create_total logTiming call sites) so the
"create_total" entry reflects the full VercelSandbox.create() duration.

`Failed to clone repository '${source.url}' (exit code ${cloneResult.exitCode})`,
);
}
logTiming("git_clone", { repo: source.url });
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Avoid logging raw repository URLs.

source.url can disclose private repo identity and may include embedded credentials if a caller passes an authenticated URL. The sandbox name already provides correlation, so omit the repo or log a non-reversible fingerprint instead.

🛡️ Proposed fix
-      logTiming("git_clone", { repo: source.url });
+      logTiming("git_clone");
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
logTiming("git_clone", { repo: source.url });
logTiming("git_clone");
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/sandbox/vercel/sandbox.ts` at line 615, The log currently passes the
raw repository URL (source.url) into logTiming("git_clone"), which can expose
private repo info or embedded credentials; instead compute and log a
non-reversible fingerprint of the URL (e.g., SHA256/SHA1 digest truncated)
and/or the sandbox name, and pass that fingerprint to logTiming("git_clone")
rather than source.url; ensure you reference source.url only to compute the hash
and do not persist or log the raw URL anywhere in the code paths around
logTiming("git_clone").

Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 2 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="packages/sandbox/vercel/sandbox.ts">

<violation number="1" location="packages/sandbox/vercel/sandbox.ts:615">
P2: Avoid logging the raw repo URL here; it can leak embedded credentials if callers provide a credentialed GitHub URL.</violation>
</file>

You're on the cubic free plan with 20 free PR reviews remaining this month. Upgrade for unlimited reviews.

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

`Failed to clone repository '${source.url}' (exit code ${cloneResult.exitCode})`,
);
}
logTiming("git_clone", { repo: source.url });
Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Avoid logging the raw repo URL here; it can leak embedded credentials if callers provide a credentialed GitHub URL.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At packages/sandbox/vercel/sandbox.ts, line 615:

<comment>Avoid logging the raw repo URL here; it can leak embedded credentials if callers provide a credentialed GitHub URL.</comment>

<file context>
@@ -592,6 +612,7 @@ ${hostLine}${portLines}${runtimeEnvLine}`;
           `Failed to clone repository '${source.url}' (exit code ${cloneResult.exitCode})`,
         );
       }
+      logTiming("git_clone", { repo: source.url });
     }
 
</file context>
Suggested change
logTiming("git_clone", { repo: source.url });
logTiming("git_clone");
Fix with Cubic

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant