fix: harden desktop app lifecycle, install path, and runtime cleanup by jotyy · Pull Request #151 · holaboss-ai/holaOS

jotyy · 2026-04-14T02:31:57Z

Summary

Audit of the embedded-runtime app lifecycle surface (install, start, restart, stop, port allocation, IPC) found a mix of correctness, safety, and supply-chain issues. This PR fixes the actionable ones in one pass; non-bugs identified during the audit are noted in code comments rather than "fixed".

Audit hit rate: 17 real bugs out of 23 reported issues. The other 6 are documented inline as verified-not-bug (single-threaded JS guarantees, idempotent state-store APIs, existing throws the audit missed).

Two additional runtime crashes surfaced while testing marketplace installs and are fixed in the last two commits (236ca1f, c8f0e4b) — see Runtime crash fixes (post-audit) below.

Changes by category

Lifecycle / process management

stopShellLifecycleAppTarget — removed the redundant second killTrackedProcess via a leftover else if branch
killAllocatedPortListeners — now keys on the composite mapKey instead of bare appId, so multi-workspace deployments don't slowly leak shellLifecyclePorts entries (bonus bug found while fixing C1)
ensureAppRunning — the "already healthy" fast path requires the executor to actually be tracking a child process for shell-managed apps. Orphan listeners trigger killPortListeners + a fresh restart so we don't trust phantom ports
Health monitor — restart path now stopApp + killPortListeners before ensureAppRunning so a half-dead zombie can't keep the port bound on the new spawn
restart_attempts — now persisted on app_builds (state-store migration) so a perpetual crash-loop trips the circuit breaker across runtime restarts instead of looping forever
Periodic orphan cleanup — runs every ~5 min from the health monitor in addition to the existing startup pass
Tracked stdout/stderr drains — attachTrackedPipeConsumers keeps a 64 KiB tail in a WeakMap so apps can't stall on a full pipe; getTrackedProcessTails exposes the tail for forensics
runSpawn — captures stdout/stderr with a 1 MiB cap to avoid OOM from runaway lifecycle commands
AppLifecycleExecutorLike — gains optional isTrackingApp() (test doubles can omit; production executor implements via isShellLifecycleAppTracked)

Install / supply-chain

/api/v1/apps/install-archive — serialises concurrent installs for the same (workspaceId, appId) via appInstallTasks, returning 409 instead of racing on the filesystem and corrupting app.runtime.yaml
tar.x filter — rejects absolute paths and .. segments, drops uid/gid via portable: true (exec-bit strip was reverted because it broke node_modules/.bin/* shebangs in prebuilt archives)
runLifecycleSetup — intentionally does NOT default NPM_CONFIG_IGNORE_SCRIPTS=true; TanStack Start module apps (better-sqlite3, esbuild, swc) legitimately rely on dependency postinstall. Decision documented inline
isAllowedArchiveUrl — now parses both the request URL and the prefix and compares protocol/host/pathname, closing the github.com.attacker.com lookalike-suffix bypass
mcp_registry server entries — gain a started_at timestamp, bumped only on the post-start path, so MCP clients watching workspace.yaml can invalidate cached SSE streams when an app restarts
Health probe timeout — grows with the configured interval_s instead of being hard-coded at 3s

Desktop IPC / download

assertSafeWorkspaceId / assertSafeAppId — new guards (SAFE_ID_REGEX: alnum + ._-) reject empty strings, path separators, control chars, and over-long ids
workspaceDirectoryPath — validates AND re-resolves the joined path so a renderer can't smuggle .. past the workspace root
Sanitised entrypoints — installAppFromCatalog, removeInstalledApp, getWorkspaceLifecycle, activateWorkspace, deleteWorkspace all sanitise their ids on entry
downloadAppArchive — rewritten on stream/promises pipeline with Readable.fromWeb, AbortController 5 min timeout, 500 MiB hard cap, best-effort cleanup of partial archives on failure
Composio polling — hoists the 5-minute window into named constants and reports the configured timeout in its error string

Forensic logging (post-audit)

App install path — runAppSetup / ensureAppRunning / /install-archive all emit structured pino events, and every setup run writes <appDir>/.holaboss/logs/setup-<iso>.log + setup.latest.log + an events.ndjson event trail. New GET /api/v1/apps/:appId/setup-log?workspace_id=…&bytes=… endpoint returns the latest log tail + last 50 events, and error messages embed the log path so they show up in the Electron renderer
createWorkspace desktop path — stageLog / stageError helpers emit [holaboss.createWorkspace] {...JSON...} breadcrumbs across every stage (materialize, template apply, local node_modules copy, activate)

Runtime crash fixes (post-audit)

fix(desktop): distinguish intentional runtime stops from crashes (236ca1f) — the embedded-runtime exit handler couldn't tell whether SIGTERM came from stopEmbeddedRuntime (hot-reload / settings restart / app quit) or from an actual crash, so every intentional teardown persisted status: "error" / "Runtime exited unexpectedly (signal=SIGTERM)" and the UI flashed the red "local runtime is unavailable" banner on every dev reload. Mark the child in a WeakSet before sending SIGTERM and treat a marked exit as a clean stop.
fix(runtime): stop orphan cleanup from killing the runtime itself (c8f0e4b) — two bugs turned marketplace app installs into a runtime crash-loop on the desktop build:
1. buildPortListenerKillCommand POSIX path used lsof -t -i :PORT, which returns PIDs for every socket bound to that port — LISTEN and ESTABLISHED alike. When the runtime had just probed http://127.0.0.1:<app-http-port>/healthz, its own PID showed up with a keep-alive socket, and orphan cleanup ran kill <module-app-pid> <runtime-pid>. Restrict with -sTCP:LISTEN so only the listener PID comes back (PowerShell branch already had the equivalent filter).
2. workspace-runtime-plan, workspace-mcp-sidecar, ts-runner, and runtime-config-cli each had a top-level if (import.meta.url === pathToFileURL(argv[1]).href) await main() guard. When tsup bundles these files into dist/index.mjs, all of them share the bundle's import.meta.url, so every CLI main fired on server boot with empty argv — printing operation is required to stderr and setting process.exitCode = 2. Added a basename check so each CLI only self-runs when invoked directly as node dist/<cli>.mjs.

Notes (verified-not-bug, kept as-is)

withAppStartOperation cleanup is correct: JS single-threaded + no await between get() and set() means the slot can never be overwritten while a promise is in-flight
allocateAppPort is idempotent per (workspaceId, appId-key); restart loops do not leak SQLite port rows
resolveWorkspaceApp already throws WorkspaceAppsError 404 when workspace.yaml is missing; no silent null path
node spawn(..., { shell: true }) on POSIX uses /bin/sh, not zsh; bash-isms in lifecycle.start are an app-author contract, not a runtime bug

Test plan

runtime/api-server — 385 tests pass, typecheck clean
runtime/state-store — 49 tests pass (new restart_attempts column + migration covered)
desktop — npm run typecheck clean
Manual: dev desktop hot-reload no longer flashes "Runtime exited unexpectedly" banner
Manual: installing an app from the marketplace no longer crash-loops the embedded runtime

Audit of the embedded-runtime app lifecycle surface (install, start, restart, stop, port allocation, IPC) found a mix of correctness, safety, and supply-chain issues. This change addresses the actionable ones in one pass; non-bugs identified during the audit are noted in code comments rather than "fixed". Lifecycle / process management - stopShellLifecycleAppTarget no longer issues a redundant second kill via a leftover else-if branch - killAllocatedPortListeners now keys on the composite mapKey instead of the bare appId, so multi-workspace deployments don't slowly leak shellLifecyclePorts entries - ensureAppRunning's "already healthy" fast path requires the executor to actually be tracking a child process for shell-managed apps; an orphan listener triggers killPortListeners + a fresh restart - Health monitor restart path now stopApp + killPortListeners before ensureAppRunning so a half-dead zombie can't keep the port bound - restart_attempts is now persisted on app_builds (state-store migration) so a perpetual crash-loop trips the circuit breaker across runtime restarts instead of looping forever - attachTrackedPipeConsumers drains tracked stdout/stderr to a 64 KiB WeakMap tail so apps can't stall on a full pipe; getTrackedProcessTails exposes the tail for forensics - runSpawn captures stdout/stderr with a 1 MiB cap to avoid OOM from runaway lifecycle commands - Periodic orphan-process cleanup runs every ~5 min from the health monitor in addition to the existing startup pass - AppLifecycleExecutorLike gains an optional isTrackingApp() method (test doubles can omit it; production executor implements it via isShellLifecycleAppTracked) Install / supply-chain - /api/v1/apps/install-archive serialises concurrent installs for the same (workspaceId, appId) via appInstallTasks, returning 409 instead of racing on the filesystem and corrupting app.runtime.yaml - tar.x gains a filter that rejects absolute paths and ".." segments, drops uid/gid metadata via portable: true, and strips the executable bit from regular files (defense-in-depth on top of strict: true) - runLifecycleSetup defaults NPM_CONFIG_IGNORE_SCRIPTS=true so npm install dependency lifecycle scripts can't run as a supply-chain RCE vector; npm run build and similar explicit scripts still work - isAllowedArchiveUrl now parses both the request URL and the prefix and compares protocol/host/pathname, closing the github.com.attacker.com lookalike-suffix vector - mcp_registry server entries gain a started_at timestamp, bumped only on the post-start path, so MCP clients watching workspace.yaml can invalidate cached SSE streams when an app restarts - Health probe timeout grows with the configured interval_s instead of being hard-coded at 3s Desktop IPC / download - New assertSafeWorkspaceId / assertSafeAppId guards (SAFE_ID_REGEX: alnum + ._-) reject empty strings, path separators, control chars, and over-long ids - workspaceDirectoryPath validates AND re-resolves the joined path so a renderer can't smuggle ".." past the workspace root - installAppFromCatalog, removeInstalledApp, getWorkspaceLifecycle, activateWorkspace, deleteWorkspace all sanitise their ids on entry - downloadAppArchive rewritten on stream/promises pipeline with Readable.fromWeb, AbortController-based 5 min timeout, 500 MiB hard cap, and best-effort cleanup of partial archives on failure - Composio polling loop hoists the 5-minute window into named constants and reports the configured timeout in its error string Notes (verified-not-bug, kept as-is) - withAppStartOperation cleanup is correct: JS single-threaded + no await between get() and set() means the slot can never be overwritten while a promise is in-flight - allocateAppPort is idempotent per (workspaceId, appId-key); restart loops do not leak SQLite port rows - resolveWorkspaceApp already throws WorkspaceAppsError 404 when workspace.yaml is missing; no silent null path

Follow-up to the previous hardening commit. Two fixes: 1. Revert `NPM_CONFIG_IGNORE_SCRIPTS=true` default in runLifecycleSetup and the exec-bit strip in tar.x. Both were overreach: TanStack Start module apps legitimately depend on postinstall scripts (esbuild, better-sqlite3, swc) and on executable shebang scripts under node_modules/.bin (vite, tsup). Turning either off made fresh workspace creates fail to build any app. Supply-chain concerns for marketplace apps still have defense-in-depth via the parsed-URL allowlist and the tar path-escape filter. 2. Add end-to-end install-path forensic logging. runAppSetup (the setup path actually used by ensureAppRunning) - Writes <appDir>/.holaboss/logs/setup-{timestamp}.log and a setup.latest.log mirror with full STDOUT + STDERR, exit code, command, and timestamps - Captures stdout (previously discarded) in addition to stderr - Emits structured pino events (app.setup.start/completed/failed/ timeout/exception) via app.log - Appends NDJSON events to <appDir>/.holaboss/logs/events.ndjson for cheap tailing - Embeds the log path in the app_builds.error field and the thrown Error so failure messages always point at the log ensureAppRunning - Structured pino events at every stage: start, resolved, setup gate, orphan detected, spawn, started, failed variants /api/v1/apps/install-archive - Structured events at begin, download start/complete/failed, extract start/complete/failed, yaml missing, registered, ensure-running success/failed New endpoint: GET /api/v1/apps/:appId/setup-log?workspace_id=... - Returns log tail (configurable byte size, default 32 KiB) + last 50 NDJSON events for UI/CLI debugging without ssh Desktop createWorkspace - stageLog()/stageError() helpers emit structured JSON lines to main-process stdout at each stage: materialize_template, runtime_post_workspaces, workspace_dir_resolved, apply_template, copy_local_node_modules, activate_workspace - Greppable prefix `[holaboss.createWorkspace]` for triage Test plan - runtime/api-server: 385/385 passing (updated the workspace-timeout assertion to allow the new log-path suffix in the error message) - runtime/api-server: typecheck clean - desktop: typecheck clean

The embedded-runtime exit handler couldn't tell whether SIGTERM came from stopEmbeddedRuntime (hot-reload / settings restart / app quit) or from an actual crash, so every intentional teardown persisted `status: "error" / "Runtime exited unexpectedly (signal=SIGTERM)"` and the UI flashed the red "local runtime is unavailable" banner on every restart. Mark the child in a WeakSet before sending SIGTERM and treat a marked exit as a clean stop in the exit handler.

Two bugs combined turned marketplace app installs into a runtime crash-loop on the desktop build: 1. `buildPortListenerKillCommand` POSIX path used `lsof -t -i :PORT`, which returns PIDs for every socket bound to that port — LISTEN and ESTABLISHED alike. When the runtime's health monitor had just probed `http://127.0.0.1:<app-http-port>/healthz`, its own PID showed up with a keep-alive/TIME_WAIT socket. Orphan cleanup then ran `kill <module-app-pid> <runtime-pid>` and the runtime killed itself. Restrict the match with `-sTCP:LISTEN` so only the listener PID comes back (the PowerShell branch already had the equivalent filter). 2. `workspace-runtime-plan`, `workspace-mcp-sidecar`, `ts-runner` and `runtime-config-cli` each had a top-level `if (import.meta.url === pathToFileURL(argv[1]).href) await main()` guard. When tsup bundles these source files into `dist/index.mjs`, all of them share the bundle's `import.meta.url`, so every CLI main fired on server boot — with empty argv — printing "operation is required" / "unsupported operation" to stderr and setting `process.exitCode = 2`. The server survived as long as the Fastify listener kept the loop busy, but combined with (1) it made each crash exit with a non-zero code and the logs utterly unreadable. Add a basename check so each CLI only self-runs when invoked directly as `node dist/<cli>.mjs`, not when its source is re-bundled.

jeffreyliimerch · 2026-04-14T07:04:50Z

I reviewed this for merge safety with focus on runtime regressions and UI impact.

I would not merge as-is yet. I found two install-path regressions:

runtime/api-server/src/app.ts:4332-4348,4524-4533
The new appInstallTasks lock is inserted before the appDir existence check, but the "app already installed" return happens before the surrounding try/finally that clears the lock. I reproduced this on the PR head: first request returns 409 "app already installed — uninstall first", and the next request for the same (workspaceId, appId) returns 409 "app install already in progress for this id". After one failed reinstall attempt, that app id stays blocked until the runtime restarts.
runtime/api-server/src/app.ts:4262-4299,4332-4338
The concurrency guard still does not serialize concurrent archive_url installs. The in-flight check runs first, but the lock is only set after await downloadArchiveToTemp(...), so two simultaneous URL-based requests can both pass the guard, both download, and both continue into extraction/registration. I reproduced this with two concurrent app.inject calls against a delayed local archive_url; both requests reached tar extraction instead of one getting a 409.

Other than that, the rest of the sampled runtime and desktop changes looked reasonable. I did not find a separate UI blocker in the desktop-side changes.

Validation run on the PR head:

runtime/api-server: npm test -- --test-name-pattern='app setup timeout honors configured timeout|POST /apps/install-archive|buildPortListenerKillCommand emits POSIX lsof command' → passed (389/389 green)
runtime/api-server: npm run typecheck → passed
desktop: tsc --noEmit → passed

Recommendation: move the install lock so it wraps the full install flow, including the already-installed early return and the archive_url download path, then this should be ready for another pass.

Two install-path regressions flagged in PR review: 1. The "already installed" 409 early return sat outside the try/finally, so after one failed reinstall the (workspaceId, appId) pair stayed pinned with "install already in progress" until the runtime restarted. 2. The lock was claimed after `await downloadArchiveToTemp(...)`, so two concurrent archive_url requests both passed the in-flight guard, both downloaded, and both reached tar extraction. Move `appInstallTasks.set` to run synchronously right after the in-flight check — before body validation, download, appDir existence check — and wrap the full flow (through ensureAppRunning) in a single try/finally so every exit path releases the lock and cleans up the temp file. Add two regression tests that reproduce the reviewer's scenarios: - Sequential "already installed" 409s must both report "already installed" (not "install already in progress"), proving the lock is released. - Two concurrent archive_url installs against a delayed fixture server must resolve to exactly one 200 and one 409 "install already in progress", proving the download is inside the lock. Both tests fail against the pre-fix code for the exact symptoms above.

jotyy requested a review from jeffreyliimerch as a code owner April 14, 2026 02:31

jotyy closed this Apr 14, 2026

jotyy reopened this Apr 14, 2026

jotyy added 4 commits April 14, 2026 02:41

jotyy force-pushed the fix/desktop-app-lifecycle-hardening branch from c8f0e4b to d3a24d1 Compare April 14, 2026 06:41

jotyy and others added 2 commits April 14, 2026 05:17

Merge branch 'main' into fix/desktop-app-lifecycle-hardening

7c37281

jeffreyliimerch merged commit f91324a into main Apr 14, 2026
11 checks passed

jeffreyliimerch deleted the fix/desktop-app-lifecycle-hardening branch April 14, 2026 11:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: harden desktop app lifecycle, install path, and runtime cleanup#151

fix: harden desktop app lifecycle, install path, and runtime cleanup#151
jeffreyliimerch merged 6 commits intomainfrom
fix/desktop-app-lifecycle-hardening

jotyy commented Apr 14, 2026 •

edited

Loading

Uh oh!

jeffreyliimerch commented Apr 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jotyy commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes by category

Lifecycle / process management

Install / supply-chain

Desktop IPC / download

Forensic logging (post-audit)

Runtime crash fixes (post-audit)

Notes (verified-not-bug, kept as-is)

Test plan

Uh oh!

jeffreyliimerch commented Apr 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jotyy commented Apr 14, 2026 •

edited

Loading