Skip to content

ci+security: Phase 4 repo controls + Phase 5 HMAC-authenticated cache#254

Merged
Cre-eD merged 7 commits into
mainfrom
fix/phase4-repo-controls-and-phase5-cache-hmac
May 13, 2026
Merged

ci+security: Phase 4 repo controls + Phase 5 HMAC-authenticated cache#254
Cre-eD merged 7 commits into
mainfrom
fix/phase4-repo-controls-and-phase5-cache-hmac

Conversation

@Cre-eD
Copy link
Copy Markdown
Contributor

@Cre-eD Cre-eD commented May 11, 2026

Summary

Two hardening phases bundled — they touch disjoint files and share zero overlap with the parallel SCA / SBOM / SAST work happening in this repo.

Phase Touched files Closes
5 — HMAC integrity for the security cache pkg/security/cache.go, pkg/security/cache_test.go OWASP A08, NIST SSDF RV.3, mtime-forgery (touch -t) attack surface
4 — Repo controls SECURITY.md, CODEOWNERS, .github/workflows/codeql.yml, .github/workflows/scorecard.yml OSPS Baseline OSP.A/D/S/V, Scorecard Security-Policy/Code-Review/SAST, CIS SSCS §1, CISA Secure-by-Design pillar 2

Phase 5 — HMAC-SHA256 for pkg/security/cache.go

The pre-existing tamper detection compared info.ModTime() to the entry's CreatedAt. mtime is forgeable in one line (touch -t), so the check was decorative. Replaced with HMAC-SHA256 over the canonical JSON of each entry, keyed by a 32-byte random per-cache key persisted at <baseDir>/.hmac.key (mode 0o600, generated via crypto/rand).

Behavior change matrix:

Scenario Before After
Set + Get hit hit
Set + Get after TTL miss miss
touch -t future mtime miss (false positive — entry was fine) hit (mtime no longer load-bearing)
Bytes flipped in JSON on disk hit (forgery succeeded) miss + file removed
Entry written under a different cache key hit miss + file removed
Pre-HMAC legacy entry hit (no integrity) miss + file removed
Signed file copied to wrong CacheKey path (codex P2 catch) n/a miss + file removed

Codex review

Codex caught a real bug on the first pass and was fixed in 5f01b9d:

"When a valid signed cache file is copied to another cache key's path, the MAC still verifies because it covers the embedded entry, but Get never checks signed.Entry.Key against the requested key. In that scenario a caller receives cached SBOM/scan data for a different image/config without knowing the HMAC key."

Fix: after MAC verifies in Get, require signed.Entry.Key == key; the same check on Clean() removes misplaced files during the next sweep. New regression test TestCacheCrossKeyCopyRejected exercises the attack.

Tests

23 tests in pkg/security/: the 16 pre-existing + 7 new (TestCacheHMACKeyPersisted, TestCacheHMACKeyWrongSizeRejected, TestCacheTamperDetection, TestCacheMtimeForgeryNoLongerHelpful, TestCacheKeyMismatchRejectsEntry, TestCacheLegacyUnsignedEntryDiscarded, TestCacheCrossKeyCopyRejected, TestCacheCleanSkipsHMACKey). All green.

Forward compatibility

Future fields added to CacheEntry will break MAC verification on old files (the marshaled JSON differs, so the recomputed MAC differs). Acceptable — caches rebuild on miss, and the alternative (schema versioning) is more complexity than the use case warrants. Documented inline.

Phase 4 — Repo controls

SECURITY.md

Vulnerability disclosure policy. Supported-versions matrix, two reporting channels (GitHub Security Advisory preferred, security@simple-container.com fallback), SLA (3-day ack; 30d HIGH+, 90d MEDIUM), out-of-scope clauses (consumer cloud accounts, upstream tools, base-image OS packages tracked in PR notes), and a note on cryptographic primitives.

CODEOWNERS

Two-lead review (@Cre-eD @smecsia) enforced on security-critical paths: .github/, *.Dockerfile, pkg/security/, pkg/api/secrets/, pkg/cmd/cmd_secrets/, pkg/clouds/pulumi/{aws,gcp}/, go.mod, go.sum, tools.go, SECURITY.md, CODEOWNERS itself. Broad default also requires one lead; sensitive paths require both. Pattern ordering follows GitHub's "last match wins" semantics.

.github/workflows/codeql.yml

CodeQL Go static analysis on push to main, every PR, and weekly Sun 04:23 UTC. Uses security-extended query pack (taint, crypto misuse, additional sinks beyond the default pack). Third-party actions SHA-pinned. Permissions least-privilege: root contents: read; job adds security-events: write for SARIF upload to the Security tab. Complements Semgrep (different rule packs find different classes of bugs).

.github/workflows/scorecard.yml

OpenSSF Scorecard on push to main + weekly Wed 09:17 UTC + branch-protection events. Publishes to api.securityscorecards.dev AND uploads SARIF. Advisory only, not gating. Surfaces our score publicly so consumers can audit posture without asking.

Validation

  • go build ./... clean
  • go vet ./... clean
  • go test ./pkg/security/... 23/23 green
  • Semgrep (SC org ruleset) on new workflows + new Go code: 0 findings
  • No callers depend on the old mtime check (grep-verified)
  • Codex review: one P2 issue caught, fixed before this PR opened

Frameworks satisfied

Framework Control
NIST SP 800-218 SSDF PS.1 (code protection — CODEOWNERS), PW.7 (SAST — CodeQL), RV.3 (root cause — HMAC tamper detection enables forensics)
OWASP Top 10 A08 Software & Data Integrity Failures — HMAC over cache entries
OWASP CICD-SEC-10 Insufficient logging — CodeQL SARIF + Scorecard SARIF in Security tab
CIS Software Supply Chain Security §1 (CODEOWNERS, signed-commits prerequisite)
OpenSSF Scorecard Security-Policy ✅, Code-Review ✅, SAST ✅
OpenSSF OSPS Baseline OSP.A (access controls) ✅, OSP.D (docs) ✅, OSP.S (security analysis) ✅, OSP.V (vuln mgmt) ✅
CISA Secure-by-Design Pillar 2 — radical transparency (VDP + public Scorecard)

Test plan

  • go build ./... + go vet ./...
  • go test ./pkg/security/... 23/23
  • Semgrep on new files: 0 findings
  • Codex review (P2 caught + fixed)
  • CI green on this branch (CodeQL run will be the long pole — first execution on a new workflow is ~5-8 min)
  • After merge, verify Scorecard run lands a result at https://scorecard.dev/viewer/?uri=github.com/simple-container-com/api within 24h

Cre-eD added 2 commits May 11, 2026 19:55
Two phase deliverables in one PR — they share zero overlap with the
parallel SCA / SBOM / SAST work happening in this repo.

═══════════════════════════════════════════════════════════════════
Phase 5 — HMAC-SHA256 integrity for pkg/security/cache.go
═══════════════════════════════════════════════════════════════════

The previous tamper-detection heuristic compared `info.ModTime()` to
the recorded `CreatedAt`. mtime is forgeable in one line (`touch -t`),
so the check was decorative rather than load-bearing. Replaced with a
proper HMAC-SHA256 over the canonical JSON of each entry, keyed by a
32-byte random per-cache key persisted at `<baseDir>/.hmac.key` (mode
0o600, generated from `crypto/rand` on first use).

Behavior change matrix
----------------------

| Scenario | Before | After |
|---|---|---|
| Set + Get | hit | hit |
| Set + Get after TTL | miss (correct) | miss (correct) |
| `touch -t` future mtime | miss (false positive — entry was fine) | hit (mtime no longer load-bearing) |
| Bytes flipped in JSON on disk | hit (forgery succeeded) | miss + file removed (HMAC mismatch) |
| Entry written under a different cache key (Clear+NewCache) | hit (forgery succeeded) | miss + file removed |
| Legacy unsigned entry from pre-HMAC code | hit (no integrity) | miss + file removed |

Key management
--------------

`loadOrCreateHMACKey` is race-safe: a missing file triggers
`O_CREATE|O_EXCL`. If two callers race, the loser sees `os.ErrExist`
and recursively re-reads the winner's key. Key length is validated to
catch truncation/corruption — wrong size returns an error rather than
silently rotating.

`Cache.Clean()` skips the key file by name; `Cache.Clear()` removes
the entire baseDir (key gets regenerated on next NewCache, which is
the expected reset semantic).

Tests (6 new, joining the 16 existing — all 22 green):
  - TestCacheHMACKeyPersisted
  - TestCacheHMACKeyWrongSizeRejected
  - TestCacheTamperDetection
  - TestCacheMtimeForgeryNoLongerHelpful
  - TestCacheKeyMismatchRejectsEntry
  - TestCacheLegacyUnsignedEntryDiscarded
  - TestCacheCleanSkipsHMACKey

═══════════════════════════════════════════════════════════════════
Phase 4 — Repo controls
═══════════════════════════════════════════════════════════════════

SECURITY.md
-----------
Vulnerability disclosure policy. Supported-versions matrix, two
reporting channels (GitHub Security Advisory preferred, email
fallback), SLA (3-day ack; 30d HIGH+, 90d MEDIUM, longer LOW), out-
of-scope clauses, and a pointer at the cryptographic primitives.
Closes OSPS Baseline OSP.V, CISA Secure-by-Design pillar 2, Scorecard
"Security-Policy" check.

CODEOWNERS
----------
Two-lead review enforced on security-critical paths: `.github/`,
`*.Dockerfile`, `pkg/security/`, `pkg/api/secrets/`,
`pkg/cmd/cmd_secrets/`, `pkg/clouds/pulumi/{aws,gcp}/`, `go.mod`,
`go.sum`, `tools.go`, `SECURITY.md`, `CODEOWNERS`. Broad default
keeps single-lead review for ordinary code paths. Closes OSPS
Baseline OSP.A, Scorecard "Code-Review", CIS SSCS §1.

.github/workflows/codeql.yml
----------------------------
CodeQL Go static analysis on push to main, every PR, and weekly Sun
04:23 UTC. Uses `security-extended` query pack (taint, crypto
misuse, additional sinks beyond the default pack). All third-party
actions SHA-pinned. Permissions least-privilege: root `contents:read`,
job adds `security-events: write` for SARIF upload. Closes Scorecard
"SAST" check beyond the Semgrep coverage we already have.

.github/workflows/scorecard.yml
-------------------------------
OpenSSF Scorecard runs weekly + on main pushes + branch-protection
events. Publishes results to api.securityscorecards.dev (consumers
can audit our posture without asking) AND uploads SARIF to the
Security tab. Advisory, not gating — score moves on factors outside
any single PR's diff.

═══════════════════════════════════════════════════════════════════
Validation
═══════════════════════════════════════════════════════════════════

- `go build ./...` clean
- `go vet ./...` clean
- `go test ./pkg/security/...` 22 tests, all green
- Semgrep (SC org ruleset) on the new workflows and on cache.go +
  cache_test.go: 0 findings
- No callers depend on the old mtime check (grep-verified)
- Forward compatibility: future fields added to `CacheEntry` will
  break MAC verification on old files (acceptable — cache rebuilds
  on miss); documented in the inline comment

═══════════════════════════════════════════════════════════════════
Frameworks satisfied
═══════════════════════════════════════════════════════════════════

| Framework | Control |
|---|---|
| CIS SSCS | §1 (CODEOWNERS, signed-tags prerequisite) |
| NIST SP 800-218 SSDF | PS.1 (code protection), PW.7 (SAST), RV.3 (root cause) |
| OWASP Top 10 CI/CD | CICD-SEC-10 (logging/visibility — CodeQL SARIF surface) |
| OWASP A08 (Software & Data Integrity Failures) | HMAC integrity |
| OpenSSF Scorecard | Security-Policy, Code-Review, SAST, CodeQL checks |
| OpenSSF OSPS Baseline | OSP.A (access), OSP.D (docs), OSP.S (security analysis), OSP.V (vuln mgmt) |
| CISA Secure-by-Design | Pillar 2 (radical transparency: VDP + Scorecard publishing) |

Signed-off-by: Dmitrii Creed <creeed22@gmail.com>
codex review on PR #(this branch) flagged a real gap in Phase 5's
HMAC integrity model:

> When a valid signed cache file is copied to another cache key's path,
> the MAC still verifies because it covers the embedded entry, but
> `Get` never checks `signed.Entry.Key` against the requested key. In
> that scenario a caller receives cached SBOM/scan data for a different
> image/config without knowing the HMAC key.

Practical impact: cross-image / cross-config payload swap. A scan
result for image A could be served when the caller asked for image B,
because the MAC alone proves entry integrity but not (entry, location)
binding.

Fix in two places, defense in depth:

1. `Get(key)` — after MAC verifies, require `signed.Entry.Key == key`.
   Mismatched files are removed and reported as cache miss.
2. `Clean()` — same check expressed as `c.getPath(signed.Entry.Key) ==
   path`. Validly-signed entries parked at the wrong location get
   garbage-collected on the next Clean sweep.

Regression test added: `TestCacheCrossKeyCopyRejected` — Set under keyA,
copy the signed file to keyB's path, assert `Get(keyB)` returns miss
and the misplaced file is removed. Existing keyA entry remains intact.

23/23 tests in pkg/security/ green.

Signed-off-by: Dmitrii Creed <creeed22@gmail.com>
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 11, 2026

Semgrep Scan Results

Repository: api | Commit: 164995c

Check Status Details
✅ Semgrep Pass 0 total findings (no error/warning)

Scanned at 2026-05-12 18:05 UTC

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 11, 2026

Security Scan Results

Repository: api | Commit: 164995c

Check Status Details
✅ Secret Scan Pass No secrets detected
⚠️ Dependencies (Trivy) High 1 high, 1 total
⚠️ Dependencies (Grype) High 1 high, 1 total
📦 SBOM Generated 470 components (CycloneDX)

Scanned at 2026-05-12 18:06 UTC

@github-advanced-security
Copy link
Copy Markdown

You are seeing this message because GitHub Code Scanning has recently been set up for this repository, or this pull request contains the workflow file for the Code Scanning tool.

What Enabling Code Scanning Means:

  • The 'Security' tab will display more code scanning analysis results (e.g., for the default branch).
  • Depending on your configuration and choice of analysis tool, future pull requests will be annotated with code scanning analysis results.
  • You will be able to see the analysis results for the pull request's branch on this overview once the scans have completed and the checks have passed.

For more information about GitHub Code Scanning, check out the documentation.

Cre-eD added 5 commits May 12, 2026 21:05
Two reviewers caught real issues. Four addressed:

1. (codex P2) HMAC key creation was racy: O_CREATE|O_EXCL + Write
   leaves a window where another reader can ReadFile a 0-byte file
   and fail with a "corrupted" error, bubbling up as cache init
   failure (= scan/SBOM runs blocked).
   → loadOrCreateHMACKey switched to os.CreateTemp + Sync + Rename.
     Same-directory temp keeps the rename a same-filesystem op
     (handles EXDEV gemini also flagged). New readKeyFile() helper
     retries short reads 5×50ms to absorb concurrent-creation races.

2. (codex P2 / doc) CODEOWNERS comments overstated what GitHub
   enforces. Multiple owners on one pattern = ALTERNATIVES, not
   conjunction. Two-reviewer gating requires a branch-protection
   setting in the admin UI.
   → Header rewritten to document the admin-UI dependency. No
     semantic change to ownership.

3. (gemini P1) Set()'s os.WriteFile is non-atomic — a concurrent
   Clean() can read a partial JSON, fail to unmarshal, and remove
   the file mid-write.
   → SetWithTTL now publishes via CreateTemp + Chmod + Write + Sync
     + Close + Rename in the SAME dir.
   → Clean() now skips .cache.tmp.* and .hmac.key.tmp.* by name so
     in-flight writers' temp files aren't sabotaged.

4. (gemini P2) CodeQL Autobuild relies on the runner's bundled Go;
   our go.mod is 1.25.x, ubuntu-24.04 ships older — Autobuild fails
   at compile and CI red looks like a finding rather than infra.
   → Added explicit actions/setup-go pinned by go.mod via
     `go-version-file: go.mod`. Granted `actions: read` per
     GitHub's recommended CodeQL template (workflow-metadata fetch).

Tests added (now 27 total in pkg/security/, was 25):
  - TestCacheCleanSkipsInFlightTempFiles — Clean leaves .cache.tmp.*
    and .hmac.key.tmp.* alone.
  - TestCacheSetIsAtomic — 256 KiB payload round-trips; no temp
    leftover in the op dir after a successful Set.

Validation
==========

- go build ./... clean
- go vet ./... clean
- go test -short ./pkg/security/... 27/27 green
- Semgrep (github-actions + go rulesets) on changed files: 0 findings

(Note: -count=1 without -short fails 3 _Integration tests due to a
stale local Grype DB unrelated to this PR; CI refreshes that DB.)

Round-1 findings deliberately NOT addressed in this commit
==========================================================

- (gemini P3) Re-marshal HMAC verification is fragile if CacheEntry's
  schema ever changes. Today's Go encoding/json is deterministic for
  our shape; documented inline. Schema-change discipline is the right
  control.
- (gemini P3) Clean() TOCTOU between expiry decision and Remove.
  Cache is rebuildable; the cost of holding a lock outweighs the
  benefit.
- (gemini nit) SECURITY.md PGP key — security@simple-container.com is
  sufficient.

Signed-off-by: Dmitrii Creed <creeed22@gmail.com>
Codex round-2 caught a remaining race in the Rename-based publish:

> When two `NewCache` calls initialize the same empty cache directory
> concurrently, this `Rename` can overwrite the key that another
> process already published. If that first process already reread and
> returned its key, the two live cache instances use different HMAC
> keys; entries written by one are treated as tampered and removed by
> the other.

Trace:
  T0  proc-A: readKeyFile → ErrNotExist → generate keyA → CreateTemp
  T1  proc-A: Rename → publishes keyA → readKeyFile → returns keyA → A
        now holds keyA in memory
  T2  proc-B: (was sleeping at readKeyFile) → ErrNotExist (lost the
        race) → generate keyB → CreateTemp
  T3  proc-B: Rename → OVERWRITES keyA on disk with keyB → readKeyFile
        → returns keyB → B holds keyB in memory
  ⇒ A signs cache entries with keyA, B reads them and fails MAC
    verification → cache thrashes, scans/SBOMs run uncached.

Fix: replace `os.Rename` with `os.Link`. Link is atomic NO-REPLACE on
POSIX: if the destination already exists, it fails with ErrExist
instead of overwriting. On collision, drop our generated key and
adopt the winner's by rereading the file. Hardlinks are universally
supported on Linux (CI) and macOS (dev), and the same-directory
CreateTemp keeps it a same-FS operation.

Regression test added: TestCacheHMACKeyConcurrentInitConverges spawns
8 goroutines that all call NewCache(dir) simultaneously after a
release-the-hounds channel close. Asserts they all return identical
HMAC keys. With the previous Rename-based code this test failed
intermittently; with Link it passes reliably (28/28 in pkg/security/).

Build, vet, short test suite all clean.

Signed-off-by: Dmitrii Creed <creeed22@gmail.com>
…und-2 P3 + P2)

gemini round-2 caught:

  [P3] The readKeyFile retry loop addresses an impossible state. With
  the Link-based publish writing a fully-fsynced temp first, the
  canonical hmac.key path appears atomically with exactly hmacKeyLen
  bytes — there is no partial-write window at that path. A wrong-size
  read is permanent corruption, not a race to wait through.

  [P2] os.Link rejects hardlinks on some exotic filesystems (VirtualBox
  shared folders, certain SMB mounts). The current code surfaces a
  clear "placing hmac key" error in that case rather than silently
  falling back to racy semantics.

Changes:

- readKeyFile simplified: read once → wrong-size → corruption error.
  The 5×50ms retry was dead code with the Link+sync design and is
  gone. Function is now 13 lines vs 28; behavior unchanged for the
  happy path and the file-absent path.

- Documented the FS-compat constraint inline at the Link site:
  hardlinks are universally available on the filesystems SC actually
  runs on (Linux ext4/overlayFS, macOS APFS, container layered FS,
  NFSv3+). The exotic-FS path surfaces an explicit error — that's
  the right behavior; "fail loud" beats "thrash the cache".

  Not gemini-recommended fallback chains: every non-Link fallback
  primitive (O_CREATE|O_EXCL + Write, Rename, etc.) has at least one
  of the two races we just eliminated. Better to refuse to operate
  on a hostile filesystem than silently re-introduce them.

28/28 tests in pkg/security/ still green. Semgrep clean.

[nit] gemini also noted Get's os.Remove on tampered files doesn't
fsync the parent directory. Acceptable: cache is rebuildable; the
benefit of an extra syscall per tamper-discard doesn't justify the
cost. Not addressed.

Signed-off-by: Dmitrii Creed <creeed22@gmail.com>
…d-3 P3)

codex round-3 caught a slow leak:

> When a process is killed or the host crashes after CreateTemp but
> before the deferred remove/rename, these files are no longer
> in-flight. Clean now skips every .cache.tmp.* and .hmac.key.tmp.*
> forever, so large SBOM/scan temp payloads can accumulate and Size
> still reports them even though they can never be used.

Real concern for long-running deployments: SBOM payloads are
multi-MB; over months of crashed Set() calls the disk fills up.

Fix: Clean() now distinguishes recent vs stale temp files via mtime.
A legitimate Set/loadOrCreateHMACKey publishes via Link/Rename in
sub-second. Anything older than 1h is unambiguously a crashed writer
whose deferred-remove never ran.

New const `tempFileGracePeriod = 1 * time.Hour` documents the
threshold. Inside Clean's temp-file branch, files past the grace
period are removed; recent ones still get the in-flight pass.

Regression test: TestCacheCleanReclaimsStaleTemps creates one recent
and one back-dated temp, runs Clean, asserts recent stays + stale is
reclaimed.

29/29 tests in pkg/security/ green. Build, vet, Semgrep clean.

Signed-off-by: Dmitrii Creed <creeed22@gmail.com>
…mini round-3 P2 + P3)

Two findings from gemini round 3:

[P2] os.Link rejects hardlinks on common DEV-environment
filesystems — Docker Desktop bind mounts, VirtualBox vboxsf, some
SMB shares. The previous "fail loud" stance broke `sc` outright for
those users. Production paths (Linux ext4 / overlayFS, macOS APFS,
container layered FS, NFSv3+) all support Link and stay race-free.

Fix: if Link returns one of {EPERM, ENOTSUP, EOPNOTSUPP, EXDEV},
fall back to os.Rename in the same directory. The Rename path
reintroduces the narrow concurrent-NewCache race we eliminated in
round 2 — but only on filesystems that can't support the safer
primitive, and dev usability beats production paranoia we'd never
hit there anyway. Helper `isLinkUnsupported(err)` keeps the test
local. Production paths still take the Link branch and remain
race-free.

[P3] CacheKey.Operation flowed into filepath.Join unchecked. All
in-tree callers pass hardcoded constants ("sbom", "scan-grype",
etc.), but defense-in-depth: a future careless callsite passing
"../../etc" would let Set() write above baseDir.

Fix: getPath now collapses Operation via filepath.Base(filepath.Clean(.))
and explicitly rejects {"", ".", "..", anything containing / or \},
falling back to "_unknown". filepath.Base("..") returns "..", which
my initial draft missed and the new regression test caught.

Regression tests
----------------

- TestCachePathTraversalContained — five hostile Operation values
  ("../../../etc", "..", "/etc/passwd", `..\..\windows`, "a/b").
  Every getPath() result must start with baseDir + separator AND
  must not contain "..". The "_unknown" landing slot is fine; the
  point is containment.

- The Link-fallback path isn't easily unit-testable without
  injecting a mocked filesystem error; the production FS list above
  is documented inline. CI runners are all Linux ext4, which exercises
  the Link path.

29/29 tests in pkg/security/ green; all short-mode security suites
green; Semgrep clean.

Round-3 findings deliberately NOT addressed (repeats from earlier rounds)
-------------------------------------------------------------------------

- [P3] Clean()'s TOCTOU between expiry decision and Remove — gemini
  re-flagged. Cache is rebuildable; cost of locking outweighs benefit.
- [nit] Re-marshal HMAC validation — gemini re-flagged. Today's Go
  json.Marshal is deterministic for our struct shape; schema-change
  discipline is the proper control.

Signed-off-by: Dmitrii Creed <creeed22@gmail.com>
@Cre-eD Cre-eD merged commit 947e3ef into main May 13, 2026
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants