Skip to content

fix(sandbox): surface fingerprint-erasure and probe-script failures#22

Merged
RalianENG merged 3 commits into
mainfrom
fix/sandbox-fail-loud
May 9, 2026
Merged

fix(sandbox): surface fingerprint-erasure and probe-script failures#22
RalianENG merged 3 commits into
mainfrom
fix/sandbox-fail-loud

Conversation

@RalianENG
Copy link
Copy Markdown
Owner

Summary

Fix a silent-fail class of bug in the post-start sandbox preparation that could produce false-clean verdicts. When eraseFingerprints, plantHoneypotFiles, restoreLocalBin, or the probe-script writers failed at the docker exec layer, the error was discarded (_ = cmd.Run()). The scan would continue against a partially-prepared sandbox — /.dockerenv still present, honeypot files missing, or an import phase running without its OS-spoofing wrapper — and any sandbox-aware payload that simply detected the gap and stayed dormant would surface as clean. The detection contract ("clean means no malicious behavior was observed") was therefore not enforced.

Changes

  • dockerExecRoot / dockerWriteFile now return errors, wrapped with the command (or path) for diagnosis. Honeypot content is omitted from the error to avoid spilling generated credentials into logs.
  • restoreLocalBin, eraseFingerprints, plantHoneypotFiles, WriteProbeScripts, and WriteProbeScriptsMulti now fail-fast on the first underlying error.
  • Start / StartPaused route through a shared prepareSandboxState helper that propagates any setup failure.
  • The three call sites in cmd/root.go (batch screening, eBPF probe, container-strace probe) abort the scan on prep failure rather than continuing.
  • Fail-loud regression tests added (TestEraseFingerprints_FailLoud, TestPlantHoneypotFiles_FailLoud, TestRestoreLocalBin_FailLoud, TestWriteProbeScripts_FailLoud, TestWriteProbeScriptsMulti_FailLoud) — they pin the contract that a failing docker command must surface as an error. Existing happy-path tests updated to assert nil.
  • CHANGELOG entry added under [Unreleased] / Fixed.

Test Plan

  • Unit tests pass (go test ./... -short -race -count=1)
  • go vet ./... clean
  • gofmt -l . clean
  • New *_FailLoud regression tests cover every affected function
  • Manual end-to-end: scan a known-clean package and a known-malicious package; verify happy-path behavior is unchanged and that a deliberately-broken docker exec (e.g. by revoking the container ID mid-run) now produces an error instead of clean
  • Documentation updated: CHANGELOG [Unreleased] / Fixed entry added

Related Issues

N/A — surfaced during a v1.0 readiness review; not tracked in an existing issue.

eraseFingerprints, plantHoneypotFiles, restoreLocalBin, WriteProbeScripts,
and WriteProbeScriptsMulti previously discarded the result of every
docker-exec call (`_ = cmd.Run()`). A failure here — e.g. /.dockerenv
removal rejected by a corrupted runtime, or a probe-script write that
silently dropped — would let the scan continue against a partially
prepared sandbox: /.dockerenv intact, honeypot files missing, or an
import phase that runs without its OS-spoofing wrapper. Sandbox-aware
malware that simply detected the gap and stayed dormant would surface
as a "clean" verdict — breaking the detection contract that "clean
means no malicious behavior was observed."

This change makes the entire post-start setup fail loud:

- dockerExecRoot / dockerWriteFile return errors wrapped with the
  command (or path) for diagnosis. Honeypot content is omitted from
  the error to avoid spilling generated credentials into logs.
- The three setup helpers and both probe-script writers fail-fast on
  the first underlying error.
- StartPaused and Start route through a shared prepareSandboxState
  helper that propagates any setup failure. Callers in cmd/root.go
  abort the scan rather than reporting clean.
- TestEraseFingerprints_FailLoud / TestPlantHoneypotFiles_FailLoud /
  TestRestoreLocalBin_FailLoud / TestWriteProbeScripts*_FailLoud pin
  the contract: a failing docker command must surface as an error.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@codecov
Copy link
Copy Markdown

codecov Bot commented May 8, 2026

Codecov Report

❌ Patch coverage is 73.52941% with 27 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
internal/sandbox/sandbox.go 78.94% 10 Missing and 10 partials ⚠️
cmd/root.go 0.00% 7 Missing ⚠️

📢 Thoughts on this report? Let us know!

RalianENG and others added 2 commits May 9, 2026 02:06
Real-world scan of `six` failed with the previous commit's fail-loud
change:

    Error: starting sandbox: erasing sandbox fingerprints: \
      docker exec rm -f /.dockerenv: exit status 1

Root cause: the rootfs is mounted --read-only, so `rm -f /.dockerenv`
inside the container has never succeeded. The original silent-fail
hid the bug; surfacing the error exposed it. Anti-fingerprinting was
effectively never applied for /.dockerenv since --read-only was
introduced.

Fix: bind-mount an empty regular file from the host over /.dockerenv
at container creation time. The mask file lives in the per-scan
seccomp temp dir (already cleaned up by Cleanup), is created by
writeSeccompProfile, and is wired in via
`--mount=type=bind,src=...,dst=/.dockerenv,readonly`. Sandbox-aware
payloads that read /.dockerenv now see empty content. A regular
empty file is used (not /dev/null) so stat().S_ISREG() still returns
true — a char device would itself be a fingerprint.

eraseFingerprints had only this one operation, so the function and
its two tests are removed; prepareSandboxState now calls only
restoreLocalBin and plantHoneypotFiles. SECURITY.md is corrected to
describe the actual mechanism. TestContainerArgs gains an assertion
that the bind-mount flag and host-side mask file are present.

Path-existence checks (`os.path.exists("/.dockerenv")`) still see
something at the path; --runtime=runsc remains the recommendation
for defeating those.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- gocritic appendCombine: combine the seccomp and dockerenv-mask
  appends into a single args = append(...) call (introduced when
  the bind-mount line was added in the previous commit).
- goconst: extract "test-container" into testContainerID — the new
  *_FailLoud tests pushed the literal's count from 3 to 7.

No behaviour change. Net lint count drops by one vs. the pre-series
baseline.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@RalianENG RalianENG merged commit e98021b into main May 9, 2026
11 of 12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant