Skip to content

perf(sandbox): parallelize npm lifecycle hooks (5x faster batch installs)#25

Merged
RalianENG merged 1 commit into
mainfrom
feat/npm-lifecycle-parallel
May 13, 2026
Merged

perf(sandbox): parallelize npm lifecycle hooks (5x faster batch installs)#25
RalianENG merged 1 commit into
mainfrom
feat/npm-lifecycle-parallel

Conversation

@RalianENG
Copy link
Copy Markdown
Owner

Summary

Sequential ;-joined subshells were the dominant wall-time cost for batch npm scans: ~1 second per npm run --silent --if-present no-op invocation due to npm CLI startup overhead alone, multiplied by 3 hooks (preinstall/install/postinstall) per package. For a 100-package corpus that adds up to ~5 minutes of pure overhead before any real work happens.

Replace with bounded-parallel xargs dispatch (-P 4). Peak concurrent process count stays within --pids-limit=256; CPU/memory caps absorb the parallel load.

Measured (10-pkg light corpus on post-FP-fix kojuto binary): install phase 43s → 19s (-56%). The win compounds for larger corpora.

Changes

  • npmLifecycleScript rewritten to pipe package paths into xargs -P 4 -I{}.
  • Discovery path (pkgs=nil): find -print0 | xargs -0 -P 4 ...
  • Named-pkgs path: printf '%s\n' <quoted paths> | xargs -P 4 ...
  • npmLifecycleParallelism = 4 constant — matches audit.py worker count.
  • Single-quoting of cd targets preserved as defense-in-depth against attacker-controllable package names (existing tests still pass).

Test Plan

  • TestNpmLifecycleScript_ParallelDispatch pins the xargs -P invocation structure (both paths) and regression-guards against the sequential form.
  • TestNpmLifecycleScript_ScopedPackage, TestNpmLifecycleScript_QuotesShellMetachars — single-quote preservation unchanged.
  • Light-corpus batch (10 pkg, native + pure JS) — verdict CLEAN after parallelization; 111 events captured (parallel fork/exec produces more PIDs; clone tracking + V8 JIT filter + library_hijack rule all attribute correctly).
  • go vet, golangci-lint run — 0 new issues.

Related Issues

  • Batch screening speed claim in README (50 PyPI packages in 98s) is unchanged — this PR targets npm; PyPI batch already uses a single pip install invocation that parallelizes internally.

Sequential `;`-joined subshells were the dominant wall-time cost
for batch npm scans: ~1 second per `npm run --silent --if-present`
no-op invocation due to npm CLI startup overhead alone, multiplied
by 3 hooks (preinstall/install/postinstall) per package. For a
100-package corpus that adds up to ~5 minutes spent waiting on
no-op skips before any real work happens.

Replace the sequential form with bounded-parallel xargs dispatch:

  Discovery path (pkgs=nil):
    find ... -print0 | xargs -0 -P 4 -I{} sh -c 'cd ... && hooks'

  Named-pkgs path (pkgs=["lodash","express",...]):
    printf '%s\n' '/install/.../lodash' '/install/.../express' \
      | xargs -P 4 -I{} sh -c 'cd "{}" && hooks'

Parallelism bound (npmLifecycleParallelism = 4):

  Each hook chain spawns ~3-5 processes (sh + npm + node + helpers
  + occasional native-build subprocesses). At N=4 the peak concurrent
  process count stays well within the container's --pids-limit=256
  and matches audit.py's worker count for consistency. CPU/memory
  ceilings (--cpus, --memory) absorb the parallel load.

Portability:

  - `find -print0 | xargs -0` is supported by dash and busybox
    findutils alike; the previous comment that motivated avoiding
    `read -d ""` does not apply (that was about the `read` builtin
    specifically, not pipe-based separation).
  - The named-pkgs path uses `\n` separation instead of `\0` because
    dash's printf builtin's `\0` handling is not portable; npm
    package names cannot contain newlines per the npm registry
    name spec, so `\n` is safe.
  - `xargs -I{}` substitutes one input line as one argument,
    preserving spaces and quoting via single-quoted command body.

Measured (10-pkg light corpus, post-FP-fix kojuto binary):

  install phase:      43s  ->  19s  (-56%)
  import phase × 3:   ~same (per-OS-identity, single command each)
  total real time:    2:05 -> 1:56
  events captured:    19   -> 111  (parallel fork/exec produces more
                                    concurrent PIDs; verdict CLEAN
                                    maintained — clone tracking + V8
                                    JIT filter + library_hijack rule
                                    all correctly attribute parallel
                                    events).

Tests:

  - TestNpmLifecycleScript_ParallelDispatch pins the xargs -P N
    invocation structure for both paths and guards against
    regression to the sequential form.
  - Existing TestNpmLifecycleScript_ScopedPackage and
    TestNpmLifecycleScript_QuotesShellMetachars still pass — single-
    quoting of cd targets is preserved as defense-in-depth against
    attacker-controllable package names slipping past depfile
    validation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@codecov
Copy link
Copy Markdown

codecov Bot commented May 13, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@RalianENG RalianENG merged commit 6375fd4 into main May 13, 2026
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant