Skip to content

fix: harden desktop app lifecycle, install path, and runtime cleanup#151

Merged
jeffreyliimerch merged 6 commits intomainfrom
fix/desktop-app-lifecycle-hardening
Apr 14, 2026
Merged

fix: harden desktop app lifecycle, install path, and runtime cleanup#151
jeffreyliimerch merged 6 commits intomainfrom
fix/desktop-app-lifecycle-hardening

Conversation

@jotyy
Copy link
Copy Markdown
Contributor

@jotyy jotyy commented Apr 14, 2026

Summary

Audit of the embedded-runtime app lifecycle surface (install, start, restart, stop, port allocation, IPC) found a mix of correctness, safety, and supply-chain issues. This PR fixes the actionable ones in one pass; non-bugs identified during the audit are noted in code comments rather than "fixed".

Audit hit rate: 17 real bugs out of 23 reported issues. The other 6 are documented inline as verified-not-bug (single-threaded JS guarantees, idempotent state-store APIs, existing throws the audit missed).

Two additional runtime crashes surfaced while testing marketplace installs and are fixed in the last two commits (236ca1f, c8f0e4b) — see Runtime crash fixes (post-audit) below.

Changes by category

Lifecycle / process management

  • stopShellLifecycleAppTarget — removed the redundant second killTrackedProcess via a leftover else if branch
  • killAllocatedPortListeners — now keys on the composite mapKey instead of bare appId, so multi-workspace deployments don't slowly leak shellLifecyclePorts entries (bonus bug found while fixing C1)
  • ensureAppRunning — the "already healthy" fast path requires the executor to actually be tracking a child process for shell-managed apps. Orphan listeners trigger killPortListeners + a fresh restart so we don't trust phantom ports
  • Health monitor — restart path now stopApp + killPortListeners before ensureAppRunning so a half-dead zombie can't keep the port bound on the new spawn
  • restart_attempts — now persisted on app_builds (state-store migration) so a perpetual crash-loop trips the circuit breaker across runtime restarts instead of looping forever
  • Periodic orphan cleanup — runs every ~5 min from the health monitor in addition to the existing startup pass
  • Tracked stdout/stderr drainsattachTrackedPipeConsumers keeps a 64 KiB tail in a WeakMap so apps can't stall on a full pipe; getTrackedProcessTails exposes the tail for forensics
  • runSpawn — captures stdout/stderr with a 1 MiB cap to avoid OOM from runaway lifecycle commands
  • AppLifecycleExecutorLike — gains optional isTrackingApp() (test doubles can omit; production executor implements via isShellLifecycleAppTracked)

Install / supply-chain

  • /api/v1/apps/install-archive — serialises concurrent installs for the same (workspaceId, appId) via appInstallTasks, returning 409 instead of racing on the filesystem and corrupting app.runtime.yaml
  • tar.x filter — rejects absolute paths and .. segments, drops uid/gid via portable: true (exec-bit strip was reverted because it broke node_modules/.bin/* shebangs in prebuilt archives)
  • runLifecycleSetup — intentionally does NOT default NPM_CONFIG_IGNORE_SCRIPTS=true; TanStack Start module apps (better-sqlite3, esbuild, swc) legitimately rely on dependency postinstall. Decision documented inline
  • isAllowedArchiveUrl — now parses both the request URL and the prefix and compares protocol/host/pathname, closing the github.com.attacker.com lookalike-suffix bypass
  • mcp_registry server entries — gain a started_at timestamp, bumped only on the post-start path, so MCP clients watching workspace.yaml can invalidate cached SSE streams when an app restarts
  • Health probe timeout — grows with the configured interval_s instead of being hard-coded at 3s

Desktop IPC / download

  • assertSafeWorkspaceId / assertSafeAppId — new guards (SAFE_ID_REGEX: alnum + ._-) reject empty strings, path separators, control chars, and over-long ids
  • workspaceDirectoryPath — validates AND re-resolves the joined path so a renderer can't smuggle .. past the workspace root
  • Sanitised entrypointsinstallAppFromCatalog, removeInstalledApp, getWorkspaceLifecycle, activateWorkspace, deleteWorkspace all sanitise their ids on entry
  • downloadAppArchive — rewritten on stream/promises pipeline with Readable.fromWeb, AbortController 5 min timeout, 500 MiB hard cap, best-effort cleanup of partial archives on failure
  • Composio polling — hoists the 5-minute window into named constants and reports the configured timeout in its error string

Forensic logging (post-audit)

  • App install pathrunAppSetup / ensureAppRunning / /install-archive all emit structured pino events, and every setup run writes <appDir>/.holaboss/logs/setup-<iso>.log + setup.latest.log + an events.ndjson event trail. New GET /api/v1/apps/:appId/setup-log?workspace_id=…&bytes=… endpoint returns the latest log tail + last 50 events, and error messages embed the log path so they show up in the Electron renderer
  • createWorkspace desktop pathstageLog / stageError helpers emit [holaboss.createWorkspace] {...JSON...} breadcrumbs across every stage (materialize, template apply, local node_modules copy, activate)

Runtime crash fixes (post-audit)

  • fix(desktop): distinguish intentional runtime stops from crashes (236ca1f) — the embedded-runtime exit handler couldn't tell whether SIGTERM came from stopEmbeddedRuntime (hot-reload / settings restart / app quit) or from an actual crash, so every intentional teardown persisted status: "error" / "Runtime exited unexpectedly (signal=SIGTERM)" and the UI flashed the red "local runtime is unavailable" banner on every dev reload. Mark the child in a WeakSet before sending SIGTERM and treat a marked exit as a clean stop.
  • fix(runtime): stop orphan cleanup from killing the runtime itself (c8f0e4b) — two bugs turned marketplace app installs into a runtime crash-loop on the desktop build:
    1. buildPortListenerKillCommand POSIX path used lsof -t -i :PORT, which returns PIDs for every socket bound to that port — LISTEN and ESTABLISHED alike. When the runtime had just probed http://127.0.0.1:<app-http-port>/healthz, its own PID showed up with a keep-alive socket, and orphan cleanup ran kill <module-app-pid> <runtime-pid>. Restrict with -sTCP:LISTEN so only the listener PID comes back (PowerShell branch already had the equivalent filter).
    2. workspace-runtime-plan, workspace-mcp-sidecar, ts-runner, and runtime-config-cli each had a top-level if (import.meta.url === pathToFileURL(argv[1]).href) await main() guard. When tsup bundles these files into dist/index.mjs, all of them share the bundle's import.meta.url, so every CLI main fired on server boot with empty argv — printing operation is required to stderr and setting process.exitCode = 2. Added a basename check so each CLI only self-runs when invoked directly as node dist/<cli>.mjs.

Notes (verified-not-bug, kept as-is)

  • withAppStartOperation cleanup is correct: JS single-threaded + no await between get() and set() means the slot can never be overwritten while a promise is in-flight
  • allocateAppPort is idempotent per (workspaceId, appId-key); restart loops do not leak SQLite port rows
  • resolveWorkspaceApp already throws WorkspaceAppsError 404 when workspace.yaml is missing; no silent null path
  • node spawn(..., { shell: true }) on POSIX uses /bin/sh, not zsh; bash-isms in lifecycle.start are an app-author contract, not a runtime bug

Test plan

  • runtime/api-server — 385 tests pass, typecheck clean
  • runtime/state-store — 49 tests pass (new restart_attempts column + migration covered)
  • desktopnpm run typecheck clean
  • Manual: dev desktop hot-reload no longer flashes "Runtime exited unexpectedly" banner
  • Manual: installing an app from the marketplace no longer crash-loops the embedded runtime

@jotyy jotyy requested a review from jeffreyliimerch as a code owner April 14, 2026 02:31
@jotyy jotyy closed this Apr 14, 2026
@jotyy jotyy reopened this Apr 14, 2026
jotyy added 4 commits April 14, 2026 02:41
Audit of the embedded-runtime app lifecycle surface (install, start,
restart, stop, port allocation, IPC) found a mix of correctness, safety,
and supply-chain issues. This change addresses the actionable ones in
one pass; non-bugs identified during the audit are noted in code
comments rather than "fixed".

Lifecycle / process management
- stopShellLifecycleAppTarget no longer issues a redundant second kill
  via a leftover else-if branch
- killAllocatedPortListeners now keys on the composite mapKey instead
  of the bare appId, so multi-workspace deployments don't slowly leak
  shellLifecyclePorts entries
- ensureAppRunning's "already healthy" fast path requires the executor
  to actually be tracking a child process for shell-managed apps; an
  orphan listener triggers killPortListeners + a fresh restart
- Health monitor restart path now stopApp + killPortListeners before
  ensureAppRunning so a half-dead zombie can't keep the port bound
- restart_attempts is now persisted on app_builds (state-store
  migration) so a perpetual crash-loop trips the circuit breaker
  across runtime restarts instead of looping forever
- attachTrackedPipeConsumers drains tracked stdout/stderr to a 64 KiB
  WeakMap tail so apps can't stall on a full pipe; getTrackedProcessTails
  exposes the tail for forensics
- runSpawn captures stdout/stderr with a 1 MiB cap to avoid OOM from
  runaway lifecycle commands
- Periodic orphan-process cleanup runs every ~5 min from the health
  monitor in addition to the existing startup pass
- AppLifecycleExecutorLike gains an optional isTrackingApp() method
  (test doubles can omit it; production executor implements it via
  isShellLifecycleAppTracked)

Install / supply-chain
- /api/v1/apps/install-archive serialises concurrent installs for the
  same (workspaceId, appId) via appInstallTasks, returning 409 instead
  of racing on the filesystem and corrupting app.runtime.yaml
- tar.x gains a filter that rejects absolute paths and ".." segments,
  drops uid/gid metadata via portable: true, and strips the executable
  bit from regular files (defense-in-depth on top of strict: true)
- runLifecycleSetup defaults NPM_CONFIG_IGNORE_SCRIPTS=true so npm
  install dependency lifecycle scripts can't run as a supply-chain RCE
  vector; npm run build and similar explicit scripts still work
- isAllowedArchiveUrl now parses both the request URL and the prefix
  and compares protocol/host/pathname, closing the
  github.com.attacker.com lookalike-suffix vector
- mcp_registry server entries gain a started_at timestamp, bumped only
  on the post-start path, so MCP clients watching workspace.yaml can
  invalidate cached SSE streams when an app restarts
- Health probe timeout grows with the configured interval_s instead of
  being hard-coded at 3s

Desktop IPC / download
- New assertSafeWorkspaceId / assertSafeAppId guards (SAFE_ID_REGEX:
  alnum + ._-) reject empty strings, path separators, control chars,
  and over-long ids
- workspaceDirectoryPath validates AND re-resolves the joined path so
  a renderer can't smuggle ".." past the workspace root
- installAppFromCatalog, removeInstalledApp, getWorkspaceLifecycle,
  activateWorkspace, deleteWorkspace all sanitise their ids on entry
- downloadAppArchive rewritten on stream/promises pipeline with
  Readable.fromWeb, AbortController-based 5 min timeout, 500 MiB hard
  cap, and best-effort cleanup of partial archives on failure
- Composio polling loop hoists the 5-minute window into named constants
  and reports the configured timeout in its error string

Notes (verified-not-bug, kept as-is)
- withAppStartOperation cleanup is correct: JS single-threaded + no
  await between get() and set() means the slot can never be overwritten
  while a promise is in-flight
- allocateAppPort is idempotent per (workspaceId, appId-key); restart
  loops do not leak SQLite port rows
- resolveWorkspaceApp already throws WorkspaceAppsError 404 when
  workspace.yaml is missing; no silent null path
Follow-up to the previous hardening commit. Two fixes:

1. Revert `NPM_CONFIG_IGNORE_SCRIPTS=true` default in runLifecycleSetup
   and the exec-bit strip in tar.x. Both were overreach: TanStack Start
   module apps legitimately depend on postinstall scripts (esbuild,
   better-sqlite3, swc) and on executable shebang scripts under
   node_modules/.bin (vite, tsup). Turning either off made fresh
   workspace creates fail to build any app. Supply-chain concerns for
   marketplace apps still have defense-in-depth via the parsed-URL
   allowlist and the tar path-escape filter.

2. Add end-to-end install-path forensic logging.

   runAppSetup (the setup path actually used by ensureAppRunning)
   - Writes <appDir>/.holaboss/logs/setup-{timestamp}.log and a
     setup.latest.log mirror with full STDOUT + STDERR, exit code,
     command, and timestamps
   - Captures stdout (previously discarded) in addition to stderr
   - Emits structured pino events (app.setup.start/completed/failed/
     timeout/exception) via app.log
   - Appends NDJSON events to <appDir>/.holaboss/logs/events.ndjson
     for cheap tailing
   - Embeds the log path in the app_builds.error field and the thrown
     Error so failure messages always point at the log

   ensureAppRunning
   - Structured pino events at every stage: start, resolved, setup
     gate, orphan detected, spawn, started, failed variants

   /api/v1/apps/install-archive
   - Structured events at begin, download start/complete/failed,
     extract start/complete/failed, yaml missing, registered,
     ensure-running success/failed

   New endpoint: GET /api/v1/apps/:appId/setup-log?workspace_id=...
   - Returns log tail (configurable byte size, default 32 KiB) +
     last 50 NDJSON events for UI/CLI debugging without ssh

   Desktop createWorkspace
   - stageLog()/stageError() helpers emit structured JSON lines to
     main-process stdout at each stage: materialize_template,
     runtime_post_workspaces, workspace_dir_resolved, apply_template,
     copy_local_node_modules, activate_workspace
   - Greppable prefix `[holaboss.createWorkspace]` for triage

Test plan
- runtime/api-server: 385/385 passing (updated the workspace-timeout
  assertion to allow the new log-path suffix in the error message)
- runtime/api-server: typecheck clean
- desktop: typecheck clean
The embedded-runtime exit handler couldn't tell whether SIGTERM came from
stopEmbeddedRuntime (hot-reload / settings restart / app quit) or from an
actual crash, so every intentional teardown persisted
`status: "error" / "Runtime exited unexpectedly (signal=SIGTERM)"` and the
UI flashed the red "local runtime is unavailable" banner on every restart.

Mark the child in a WeakSet before sending SIGTERM and treat a marked exit
as a clean stop in the exit handler.
Two bugs combined turned marketplace app installs into a runtime
crash-loop on the desktop build:

1. `buildPortListenerKillCommand` POSIX path used `lsof -t -i :PORT`,
   which returns PIDs for every socket bound to that port — LISTEN and
   ESTABLISHED alike. When the runtime's health monitor had just probed
   `http://127.0.0.1:<app-http-port>/healthz`, its own PID showed up
   with a keep-alive/TIME_WAIT socket. Orphan cleanup then ran
   `kill <module-app-pid> <runtime-pid>` and the runtime killed itself.
   Restrict the match with `-sTCP:LISTEN` so only the listener PID
   comes back (the PowerShell branch already had the equivalent filter).

2. `workspace-runtime-plan`, `workspace-mcp-sidecar`, `ts-runner` and
   `runtime-config-cli` each had a top-level
   `if (import.meta.url === pathToFileURL(argv[1]).href) await main()`
   guard. When tsup bundles these source files into `dist/index.mjs`,
   all of them share the bundle's `import.meta.url`, so every CLI main
   fired on server boot — with empty argv — printing "operation is
   required" / "unsupported operation" to stderr and setting
   `process.exitCode = 2`. The server survived as long as the Fastify
   listener kept the loop busy, but combined with (1) it made each
   crash exit with a non-zero code and the logs utterly unreadable.
   Add a basename check so each CLI only self-runs when invoked
   directly as `node dist/<cli>.mjs`, not when its source is re-bundled.
@jotyy jotyy force-pushed the fix/desktop-app-lifecycle-hardening branch from c8f0e4b to d3a24d1 Compare April 14, 2026 06:41
@jeffreyliimerch
Copy link
Copy Markdown
Collaborator

I reviewed this for merge safety with focus on runtime regressions and UI impact.

I would not merge as-is yet. I found two install-path regressions:

  1. runtime/api-server/src/app.ts:4332-4348,4524-4533
    The new appInstallTasks lock is inserted before the appDir existence check, but the "app already installed" return happens before the surrounding try/finally that clears the lock. I reproduced this on the PR head: first request returns 409 "app already installed — uninstall first", and the next request for the same (workspaceId, appId) returns 409 "app install already in progress for this id". After one failed reinstall attempt, that app id stays blocked until the runtime restarts.

  2. runtime/api-server/src/app.ts:4262-4299,4332-4338
    The concurrency guard still does not serialize concurrent archive_url installs. The in-flight check runs first, but the lock is only set after await downloadArchiveToTemp(...), so two simultaneous URL-based requests can both pass the guard, both download, and both continue into extraction/registration. I reproduced this with two concurrent app.inject calls against a delayed local archive_url; both requests reached tar extraction instead of one getting a 409.

Other than that, the rest of the sampled runtime and desktop changes looked reasonable. I did not find a separate UI blocker in the desktop-side changes.

Validation run on the PR head:

  • runtime/api-server: npm test -- --test-name-pattern='app setup timeout honors configured timeout|POST /apps/install-archive|buildPortListenerKillCommand emits POSIX lsof command' → passed (389/389 green)
  • runtime/api-server: npm run typecheck → passed
  • desktop: tsc --noEmit → passed

Recommendation: move the install lock so it wraps the full install flow, including the already-installed early return and the archive_url download path, then this should be ready for another pass.

jotyy and others added 2 commits April 14, 2026 05:17
Two install-path regressions flagged in PR review:

1. The "already installed" 409 early return sat outside the try/finally,
   so after one failed reinstall the (workspaceId, appId) pair stayed
   pinned with "install already in progress" until the runtime restarted.

2. The lock was claimed after `await downloadArchiveToTemp(...)`, so two
   concurrent archive_url requests both passed the in-flight guard, both
   downloaded, and both reached tar extraction.

Move `appInstallTasks.set` to run synchronously right after the in-flight
check — before body validation, download, appDir existence check — and
wrap the full flow (through ensureAppRunning) in a single try/finally so
every exit path releases the lock and cleans up the temp file.

Add two regression tests that reproduce the reviewer's scenarios:
- Sequential "already installed" 409s must both report "already installed"
  (not "install already in progress"), proving the lock is released.
- Two concurrent archive_url installs against a delayed fixture server
  must resolve to exactly one 200 and one 409 "install already in
  progress", proving the download is inside the lock.

Both tests fail against the pre-fix code for the exact symptoms above.
@jeffreyliimerch jeffreyliimerch merged commit f91324a into main Apr 14, 2026
11 checks passed
@jeffreyliimerch jeffreyliimerch deleted the fix/desktop-app-lifecycle-hardening branch April 14, 2026 11:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants