fix: harden desktop app lifecycle, install path, and runtime cleanup#151
fix: harden desktop app lifecycle, install path, and runtime cleanup#151jeffreyliimerch merged 6 commits intomainfrom
Conversation
Audit of the embedded-runtime app lifecycle surface (install, start, restart, stop, port allocation, IPC) found a mix of correctness, safety, and supply-chain issues. This change addresses the actionable ones in one pass; non-bugs identified during the audit are noted in code comments rather than "fixed". Lifecycle / process management - stopShellLifecycleAppTarget no longer issues a redundant second kill via a leftover else-if branch - killAllocatedPortListeners now keys on the composite mapKey instead of the bare appId, so multi-workspace deployments don't slowly leak shellLifecyclePorts entries - ensureAppRunning's "already healthy" fast path requires the executor to actually be tracking a child process for shell-managed apps; an orphan listener triggers killPortListeners + a fresh restart - Health monitor restart path now stopApp + killPortListeners before ensureAppRunning so a half-dead zombie can't keep the port bound - restart_attempts is now persisted on app_builds (state-store migration) so a perpetual crash-loop trips the circuit breaker across runtime restarts instead of looping forever - attachTrackedPipeConsumers drains tracked stdout/stderr to a 64 KiB WeakMap tail so apps can't stall on a full pipe; getTrackedProcessTails exposes the tail for forensics - runSpawn captures stdout/stderr with a 1 MiB cap to avoid OOM from runaway lifecycle commands - Periodic orphan-process cleanup runs every ~5 min from the health monitor in addition to the existing startup pass - AppLifecycleExecutorLike gains an optional isTrackingApp() method (test doubles can omit it; production executor implements it via isShellLifecycleAppTracked) Install / supply-chain - /api/v1/apps/install-archive serialises concurrent installs for the same (workspaceId, appId) via appInstallTasks, returning 409 instead of racing on the filesystem and corrupting app.runtime.yaml - tar.x gains a filter that rejects absolute paths and ".." segments, drops uid/gid metadata via portable: true, and strips the executable bit from regular files (defense-in-depth on top of strict: true) - runLifecycleSetup defaults NPM_CONFIG_IGNORE_SCRIPTS=true so npm install dependency lifecycle scripts can't run as a supply-chain RCE vector; npm run build and similar explicit scripts still work - isAllowedArchiveUrl now parses both the request URL and the prefix and compares protocol/host/pathname, closing the github.com.attacker.com lookalike-suffix vector - mcp_registry server entries gain a started_at timestamp, bumped only on the post-start path, so MCP clients watching workspace.yaml can invalidate cached SSE streams when an app restarts - Health probe timeout grows with the configured interval_s instead of being hard-coded at 3s Desktop IPC / download - New assertSafeWorkspaceId / assertSafeAppId guards (SAFE_ID_REGEX: alnum + ._-) reject empty strings, path separators, control chars, and over-long ids - workspaceDirectoryPath validates AND re-resolves the joined path so a renderer can't smuggle ".." past the workspace root - installAppFromCatalog, removeInstalledApp, getWorkspaceLifecycle, activateWorkspace, deleteWorkspace all sanitise their ids on entry - downloadAppArchive rewritten on stream/promises pipeline with Readable.fromWeb, AbortController-based 5 min timeout, 500 MiB hard cap, and best-effort cleanup of partial archives on failure - Composio polling loop hoists the 5-minute window into named constants and reports the configured timeout in its error string Notes (verified-not-bug, kept as-is) - withAppStartOperation cleanup is correct: JS single-threaded + no await between get() and set() means the slot can never be overwritten while a promise is in-flight - allocateAppPort is idempotent per (workspaceId, appId-key); restart loops do not leak SQLite port rows - resolveWorkspaceApp already throws WorkspaceAppsError 404 when workspace.yaml is missing; no silent null path
Follow-up to the previous hardening commit. Two fixes:
1. Revert `NPM_CONFIG_IGNORE_SCRIPTS=true` default in runLifecycleSetup
and the exec-bit strip in tar.x. Both were overreach: TanStack Start
module apps legitimately depend on postinstall scripts (esbuild,
better-sqlite3, swc) and on executable shebang scripts under
node_modules/.bin (vite, tsup). Turning either off made fresh
workspace creates fail to build any app. Supply-chain concerns for
marketplace apps still have defense-in-depth via the parsed-URL
allowlist and the tar path-escape filter.
2. Add end-to-end install-path forensic logging.
runAppSetup (the setup path actually used by ensureAppRunning)
- Writes <appDir>/.holaboss/logs/setup-{timestamp}.log and a
setup.latest.log mirror with full STDOUT + STDERR, exit code,
command, and timestamps
- Captures stdout (previously discarded) in addition to stderr
- Emits structured pino events (app.setup.start/completed/failed/
timeout/exception) via app.log
- Appends NDJSON events to <appDir>/.holaboss/logs/events.ndjson
for cheap tailing
- Embeds the log path in the app_builds.error field and the thrown
Error so failure messages always point at the log
ensureAppRunning
- Structured pino events at every stage: start, resolved, setup
gate, orphan detected, spawn, started, failed variants
/api/v1/apps/install-archive
- Structured events at begin, download start/complete/failed,
extract start/complete/failed, yaml missing, registered,
ensure-running success/failed
New endpoint: GET /api/v1/apps/:appId/setup-log?workspace_id=...
- Returns log tail (configurable byte size, default 32 KiB) +
last 50 NDJSON events for UI/CLI debugging without ssh
Desktop createWorkspace
- stageLog()/stageError() helpers emit structured JSON lines to
main-process stdout at each stage: materialize_template,
runtime_post_workspaces, workspace_dir_resolved, apply_template,
copy_local_node_modules, activate_workspace
- Greppable prefix `[holaboss.createWorkspace]` for triage
Test plan
- runtime/api-server: 385/385 passing (updated the workspace-timeout
assertion to allow the new log-path suffix in the error message)
- runtime/api-server: typecheck clean
- desktop: typecheck clean
The embedded-runtime exit handler couldn't tell whether SIGTERM came from stopEmbeddedRuntime (hot-reload / settings restart / app quit) or from an actual crash, so every intentional teardown persisted `status: "error" / "Runtime exited unexpectedly (signal=SIGTERM)"` and the UI flashed the red "local runtime is unavailable" banner on every restart. Mark the child in a WeakSet before sending SIGTERM and treat a marked exit as a clean stop in the exit handler.
Two bugs combined turned marketplace app installs into a runtime crash-loop on the desktop build: 1. `buildPortListenerKillCommand` POSIX path used `lsof -t -i :PORT`, which returns PIDs for every socket bound to that port — LISTEN and ESTABLISHED alike. When the runtime's health monitor had just probed `http://127.0.0.1:<app-http-port>/healthz`, its own PID showed up with a keep-alive/TIME_WAIT socket. Orphan cleanup then ran `kill <module-app-pid> <runtime-pid>` and the runtime killed itself. Restrict the match with `-sTCP:LISTEN` so only the listener PID comes back (the PowerShell branch already had the equivalent filter). 2. `workspace-runtime-plan`, `workspace-mcp-sidecar`, `ts-runner` and `runtime-config-cli` each had a top-level `if (import.meta.url === pathToFileURL(argv[1]).href) await main()` guard. When tsup bundles these source files into `dist/index.mjs`, all of them share the bundle's `import.meta.url`, so every CLI main fired on server boot — with empty argv — printing "operation is required" / "unsupported operation" to stderr and setting `process.exitCode = 2`. The server survived as long as the Fastify listener kept the loop busy, but combined with (1) it made each crash exit with a non-zero code and the logs utterly unreadable. Add a basename check so each CLI only self-runs when invoked directly as `node dist/<cli>.mjs`, not when its source is re-bundled.
c8f0e4b to
d3a24d1
Compare
|
I reviewed this for merge safety with focus on runtime regressions and UI impact. I would not merge as-is yet. I found two install-path regressions:
Other than that, the rest of the sampled runtime and desktop changes looked reasonable. I did not find a separate UI blocker in the desktop-side changes. Validation run on the PR head:
Recommendation: move the install lock so it wraps the full install flow, including the already-installed early return and the |
Two install-path regressions flagged in PR review: 1. The "already installed" 409 early return sat outside the try/finally, so after one failed reinstall the (workspaceId, appId) pair stayed pinned with "install already in progress" until the runtime restarted. 2. The lock was claimed after `await downloadArchiveToTemp(...)`, so two concurrent archive_url requests both passed the in-flight guard, both downloaded, and both reached tar extraction. Move `appInstallTasks.set` to run synchronously right after the in-flight check — before body validation, download, appDir existence check — and wrap the full flow (through ensureAppRunning) in a single try/finally so every exit path releases the lock and cleans up the temp file. Add two regression tests that reproduce the reviewer's scenarios: - Sequential "already installed" 409s must both report "already installed" (not "install already in progress"), proving the lock is released. - Two concurrent archive_url installs against a delayed fixture server must resolve to exactly one 200 and one 409 "install already in progress", proving the download is inside the lock. Both tests fail against the pre-fix code for the exact symptoms above.
Summary
Audit of the embedded-runtime app lifecycle surface (install, start, restart, stop, port allocation, IPC) found a mix of correctness, safety, and supply-chain issues. This PR fixes the actionable ones in one pass; non-bugs identified during the audit are noted in code comments rather than "fixed".
Audit hit rate: 17 real bugs out of 23 reported issues. The other 6 are documented inline as verified-not-bug (single-threaded JS guarantees, idempotent state-store APIs, existing throws the audit missed).
Two additional runtime crashes surfaced while testing marketplace installs and are fixed in the last two commits (
236ca1f,c8f0e4b) — see Runtime crash fixes (post-audit) below.Changes by category
Lifecycle / process management
stopShellLifecycleAppTarget— removed the redundant secondkillTrackedProcessvia a leftoverelse ifbranchkillAllocatedPortListeners— now keys on the compositemapKeyinstead of bareappId, so multi-workspace deployments don't slowly leakshellLifecyclePortsentries (bonus bug found while fixing C1)ensureAppRunning— the "already healthy" fast path requires the executor to actually be tracking a child process for shell-managed apps. Orphan listeners triggerkillPortListeners+ a fresh restart so we don't trust phantom portsstopApp+killPortListenersbeforeensureAppRunningso a half-dead zombie can't keep the port bound on the new spawnrestart_attempts— now persisted onapp_builds(state-store migration) so a perpetual crash-loop trips the circuit breaker across runtime restarts instead of looping foreverattachTrackedPipeConsumerskeeps a 64 KiB tail in a WeakMap so apps can't stall on a full pipe;getTrackedProcessTailsexposes the tail for forensicsrunSpawn— captures stdout/stderr with a 1 MiB cap to avoid OOM from runaway lifecycle commandsAppLifecycleExecutorLike— gains optionalisTrackingApp()(test doubles can omit; production executor implements viaisShellLifecycleAppTracked)Install / supply-chain
/api/v1/apps/install-archive— serialises concurrent installs for the same(workspaceId, appId)viaappInstallTasks, returning 409 instead of racing on the filesystem and corruptingapp.runtime.yamltar.xfilter — rejects absolute paths and..segments, drops uid/gid viaportable: true(exec-bit strip was reverted because it brokenode_modules/.bin/*shebangs in prebuilt archives)runLifecycleSetup— intentionally does NOT defaultNPM_CONFIG_IGNORE_SCRIPTS=true; TanStack Start module apps (better-sqlite3, esbuild, swc) legitimately rely on dependency postinstall. Decision documented inlineisAllowedArchiveUrl— now parses both the request URL and the prefix and compares protocol/host/pathname, closing thegithub.com.attacker.comlookalike-suffix bypassmcp_registryserver entries — gain astarted_attimestamp, bumped only on the post-start path, so MCP clients watchingworkspace.yamlcan invalidate cached SSE streams when an app restartsinterval_sinstead of being hard-coded at 3sDesktop IPC / download
assertSafeWorkspaceId/assertSafeAppId— new guards (SAFE_ID_REGEX: alnum +._-) reject empty strings, path separators, control chars, and over-long idsworkspaceDirectoryPath— validates AND re-resolves the joined path so a renderer can't smuggle..past the workspace rootinstallAppFromCatalog,removeInstalledApp,getWorkspaceLifecycle,activateWorkspace,deleteWorkspaceall sanitise their ids on entrydownloadAppArchive— rewritten onstream/promises pipelinewithReadable.fromWeb,AbortController5 min timeout, 500 MiB hard cap, best-effort cleanup of partial archives on failureForensic logging (post-audit)
runAppSetup/ensureAppRunning//install-archiveall emit structured pino events, and every setup run writes<appDir>/.holaboss/logs/setup-<iso>.log+setup.latest.log+ anevents.ndjsonevent trail. NewGET /api/v1/apps/:appId/setup-log?workspace_id=…&bytes=…endpoint returns the latest log tail + last 50 events, and error messages embed the log path so they show up in the Electron renderercreateWorkspacedesktop path —stageLog/stageErrorhelpers emit[holaboss.createWorkspace] {...JSON...}breadcrumbs across every stage (materialize, template apply, local node_modules copy, activate)Runtime crash fixes (post-audit)
fix(desktop): distinguish intentional runtime stops from crashes(236ca1f) — the embedded-runtime exit handler couldn't tell whetherSIGTERMcame fromstopEmbeddedRuntime(hot-reload / settings restart / app quit) or from an actual crash, so every intentional teardown persistedstatus: "error" / "Runtime exited unexpectedly (signal=SIGTERM)"and the UI flashed the red "local runtime is unavailable" banner on every dev reload. Mark the child in a WeakSet before sendingSIGTERMand treat a marked exit as a clean stop.fix(runtime): stop orphan cleanup from killing the runtime itself(c8f0e4b) — two bugs turned marketplace app installs into a runtime crash-loop on the desktop build:buildPortListenerKillCommandPOSIX path usedlsof -t -i :PORT, which returns PIDs for every socket bound to that port — LISTEN and ESTABLISHED alike. When the runtime had just probedhttp://127.0.0.1:<app-http-port>/healthz, its own PID showed up with a keep-alive socket, and orphan cleanup rankill <module-app-pid> <runtime-pid>. Restrict with-sTCP:LISTENso only the listener PID comes back (PowerShell branch already had the equivalent filter).workspace-runtime-plan,workspace-mcp-sidecar,ts-runner, andruntime-config-clieach had a top-levelif (import.meta.url === pathToFileURL(argv[1]).href) await main()guard. When tsup bundles these files intodist/index.mjs, all of them share the bundle'simport.meta.url, so every CLI main fired on server boot with empty argv — printingoperation is requiredto stderr and settingprocess.exitCode = 2. Added a basename check so each CLI only self-runs when invoked directly asnode dist/<cli>.mjs.Notes (verified-not-bug, kept as-is)
withAppStartOperationcleanup is correct: JS single-threaded + noawaitbetweenget()andset()means the slot can never be overwritten while a promise is in-flightallocateAppPortis idempotent per(workspaceId, appId-key); restart loops do not leak SQLite port rowsresolveWorkspaceAppalready throwsWorkspaceAppsError 404whenworkspace.yamlis missing; no silent null pathspawn(..., { shell: true })on POSIX uses/bin/sh, not zsh; bash-isms inlifecycle.startare an app-author contract, not a runtime bugTest plan
runtime/api-server— 385 tests pass, typecheck cleanruntime/state-store— 49 tests pass (newrestart_attemptscolumn + migration covered)desktop—npm run typecheckclean