From 17f69738a9e6026d4d69394282f7bfc3622da283 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Patrick=20Sodr=C3=A9?= Date: Fri, 1 May 2026 22:17:42 -0400 Subject: [PATCH 01/13] docs: implementation plan for per-layer ZFS clone chain (issue #4) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Signed-off-by: Patrick Sodré --- doc/plans/2026-05-01-zfs-g-layer-chain.md | 407 ++++++++++++++++++++++ doc/plans/README.md | 5 +- 2 files changed, 410 insertions(+), 2 deletions(-) create mode 100644 doc/plans/2026-05-01-zfs-g-layer-chain.md diff --git a/doc/plans/2026-05-01-zfs-g-layer-chain.md b/doc/plans/2026-05-01-zfs-g-layer-chain.md new file mode 100644 index 0000000..7c6d88e --- /dev/null +++ b/doc/plans/2026-05-01-zfs-g-layer-chain.md @@ -0,0 +1,407 @@ +# ZFS Backend Plan G: Per-layer ZFS Clone Chain (opt-in) + +> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** When `ENROOT_STORAGE_BACKEND=zfs` AND `ENROOT_ZFS_LAYER_CHAIN=y`, populate the Docker template cache via a per-layer `zfs clone` chain instead of a single merged extract. Each registry layer becomes its own `/.layers/` dataset, layered as cloned descendants of the layer below. The leaf is then cloned into `/.templates/` to preserve the Plan F template shape. With the flag unset, Plan F's single-merge path runs unchanged. + +**Why:** Plan F's single-merge design re-extracts every layer for every distinct image. For HPC and CI hosts that pull many images sharing a Debian/Alpine/CUDA base, this wastes disk, CPU, and bandwidth. Per-layer chains buy back: + +1. Cross-image layer dedup at the dataset level — two images sharing a base store the base bytes once. (`dedup=on` recovers this in Plan F at ~5 GB RAM per TB; per-layer datasets dedup for free.) +2. Cheap incremental re-pull — when the top layer of a tag changes, only that layer is rebuilt; lower-layer datasets are reused. +3. Layer-granular cache invalidation and inspection (`zfs list -r /.layers` shows the chain). +4. Quota accounting that matches intuition (shared layers count once). +5. Aligns with Docker's own `zfs` storage driver. + +Tradeoff: shell-side whiteout/opaque-dir merging is required because each at-rest layer dataset must contain the merged-up-to-this-layer rootfs (overlayfs only does the merge at mount time). The kernel's overlay engine is *not* the merge engine in this path — `enroot-aufs2ovlfs` already converted whiteouts to overlayfs form during `_prepare_layers`, but we have to apply them ourselves between clone steps. + +**Architecture:** `docker::_prepare_layers` already extracts each registry layer into a per-layer directory (`1/`, `2/`, … `N/`) and runs `enroot-aufs2ovlfs` on each, producing overlayfs-style trees: `mknod 0:0` char devices for whiteouts and `trusted.overlay.opaque=y` xattrs for opaque dirs. With `ENROOT_SET_USER_XATTRS=y` (already set on the load path) we also get a parallel `user.overlay.opaque=y` for unprivileged paths. + +For Plan G, after `_prepare_layers` returns, instead of one overlay-mount + tar-pipe into a single template (Plan F), we walk the layer list bottom-up: + +``` +.layers/ ← zfs create -u (apply layer 1 contents) +.layers/ ← zfs clone .layers/@done (apply layer 2 on top) +.layers/ ← zfs clone .layers/@done (apply layer 3 on top) +… +.layers/ ← zfs clone .layers/@done (apply layer N on top) +.templates/ ← zfs clone .layers/@done (clone leaf as template) +``` + +Each `.layers/` dataset's `@done` snapshot is reused on subsequent imports of any image whose chain prefix matches. The final `.templates/@pristine` snapshot is identical in shape to Plan F's, so `zfs::clone_container`, the pointer-format import path, eviction recovery, `enroot export`, and `enroot import zfs://` all work unchanged. + +**Why no `zfs promote`:** Promoting the leaf into the templates dataset inverts the chain — layer datasets become clones of the template, which then owns the data. That works for one image but produces a complex, image-private topology that defeats the whole point of cross-image sharing. Plan G keeps layer datasets as immutable origins and templates as ordinary clones. The simple invariant: *layers are shared and never mutated; templates are per-image clones; ZFS refuses to destroy a layer dataset while any descendant clone exists, so layer GC is automatic.* + +**Coexistence with Plan F:** + +- `ENROOT_ZFS_LAYER_CHAIN=` (unset, empty, or anything but `y`): Plan F's `_install_template_from_layers` runs unchanged. Default behavior preserved byte-for-byte. +- `ENROOT_ZFS_LAYER_CHAIN=y`: dispatch to chain mode. Same dispatch is hit from `docker::load` and from `_pull_and_install_template` (the puller used by pointer-format import and eviction recovery), so all callers see chain-mode templates when the flag is on. +- The fast path "template `@pristine` already exists, reuse it" is hit *before* the chain/no-chain dispatch. Templates produced under one mode are reused under the other without rebuild — only the *fill* mechanism differs on miss. + +**Depends on:** Plans A, B, F (template lifecycle, sweep, ENOSPC retry shape are reused). + +**Prerequisite host setup:** Same as Plan F. ZFS user delegation must include `clone`. `promote` is **not** required (Plan G doesn't promote). Whiteout/xattr work runs inside `enroot-nsenter --user --remap-root --mount`, same as Plan F's merge step. + +**Test images:** `docker://alpine` (1 layer, smoke); `docker://debian:stable-slim` (multi-layer with whiteouts); `docker://python:3.12-slim` and `docker://node:20-slim` (debian-bookworm-based — base layer must be physically shared). + +--- + +## Files + +- **Modify:** `src/storage_zfs.sh` — add `zfs::layer_chain_active`, `zfs::_install_layer_chain`, `zfs::_apply_layer_payload`, dispatch in `zfs::docker_install_from_layers`. +- **Modify:** `src/docker.sh` (`docker::_prepare_layers`, `docker::load`) — emit layer digest list to a caller-provided fd when chain mode is active; thread it into `zfs::docker_install_from_layers`. +- **Modify:** `doc/zfs.md` — document `ENROOT_ZFS_LAYER_CHAIN`, the `.layers/` namespace, dedup semantics, GC notes. +- **Modify:** `CLAUDE.md` — flip the active-design-proposals line. + +`docker::configure`, `docker::_download`, the existing dir-backend overlay path, Plan F's `_install_template_from_layers`, the pointer-format paths, and `zfs::clone_container` are **not** modified. + +--- + +### Task 1: Add `zfs::layer_chain_active` predicate + +A small gate that callers use before opting into chain mode. + +**Files:** +- Modify: `src/storage_zfs.sh` (append, near `zfs::pointer_format_active`) + +- [ ] **Step 1.1: Add the predicate** + +```bash +# Returns 0 iff the ZFS backend is active AND ENROOT_ZFS_LAYER_CHAIN=y. +# Callers gate the per-layer-clone-chain template-fill path on this. +# Default-off; the unset / "" / "n" cases all fall through to Plan F's +# single-merge path, preserving byte-for-byte behavior. +zfs::layer_chain_active() { + zfs::enabled || return 1 + [ "${ENROOT_ZFS_LAYER_CHAIN-}" = "y" ] +} +``` + +- [ ] **Step 1.2: Commit** + +```sh +git add src/storage_zfs.sh +git commit -s -m "storage_zfs: add zfs::layer_chain_active predicate" +``` + +--- + +### Task 2: Plumb layer digests out of `docker::_prepare_layers` + +Plan G needs the ordered list of registry layer content-digests as cache keys for `/.layers/`. Today `_prepare_layers` only emits `config\nlayer_count\n` on stdout. Existing callers (Plan F's `_install_template_from_layers`, `docker::import`, `docker::load`'s dir branch) read exactly two lines via `common::read`; they must keep working unchanged. + +The least-invasive plumbing: a new function `docker::_load_layer_digests` that re-runs the *cheap* parts of `_download` (parse the manifest, list digests) without re-extracting. The layers are already cached at `${ENROOT_CACHE_PATH}/` from `_prepare_layers`'s pull, so we don't pay any registry roundtrip — the manifest is also already cached in the token dir. The chain installer calls this helper to get the ordered digest list. + +Alternative considered and rejected: extending `_prepare_layers` stdout with extra lines after the existing two. Existing callers consume the pipe with `{ common::read; common::read; }` which leaves trailing lines unread, triggering SIGPIPE on the producer's printf and a non-zero exit under `pipefail`. Adding a side-channel fd works but complicates calling conventions. A standalone helper that re-parses the cached manifest is simplest and contained. + +**Files:** +- Modify: `src/docker.sh` + +- [ ] **Step 2.1: Add `docker::_load_layer_digests`** + +Append after `docker::_prepare_layers` in `src/docker.sh`. The helper takes the same five inputs (`user registry image tag arch`) and prints one digest per line, in stack order (base first, top last). Implementation reuses `docker::_download`'s manifest-parsing path or directly reads the cached manifest. + +- [ ] **Step 2.2: Commit** + +```sh +git add src/docker.sh +git commit -s -m "docker: add _load_layer_digests helper for chain-mode callers" +``` + +--- + +### Task 3: Add `zfs::_apply_layer_payload` + +The bash payload that runs inside `enroot-nsenter --user --remap-root --mount` to apply one layer dir on top of one target dir. Returns a string suitable for `bash -c "${payload}"` from the chain installer. + +Three phases: +1. **Opaque-dir clearing.** Walk the layer's directories; for each with `trusted.overlay.opaque=y` xattr, `rm -rf` the children (not the dir itself) of the corresponding dir in the target. +2. **Whiteout deletion.** For each char device 0:0 in the layer, `rm -rf` the corresponding path in the target. +3. **Content tar-pipe.** Tar the layer's contents into the target with `--xattrs --xattrs-include='*' --acls`, excluding char devices via an exclude list built from phase 2. + +Why a payload string: the chain installer launches one `enroot-nsenter` per layer (or batches them), and the inside-userns work is straightforward bash. Keeping it as a single payload string avoids per-layer fork overhead beyond the necessary `enroot-nsenter` wrapping. + +**Files:** +- Modify: `src/storage_zfs.sh` + +- [ ] **Step 3.1: Add the payload generator** + +```bash +# Generates the bash payload that applies one layer dir's whiteouts and +# contents on top of one target dir. Designed to be passed to +# `enroot-nsenter --user --remap-root --mount bash -c`. +# +# Pre-conditions on inputs (caller responsibility): +# - layer_dir was extracted by docker::_prepare_layers and processed by +# enroot-aufs2ovlfs, so whiteouts are mknod 0:0 char devices and +# opaque dirs carry trusted.overlay.opaque=y (and user.overlay.opaque=y +# when ENROOT_SET_USER_XATTRS=y was set, which the load path always does). +# - target_dir already contains the merged contents of all layers below +# this one. +# - Both paths are absolute and well-formed (no embedded quotes/spaces in +# the digest-keyed dataset paths the chain installer produces). +zfs::_apply_layer_payload() { + local -r layer_dir="$1" target_dir="$2" + cat </dev/null \\ + | awk -F': ' '/^# file:/ { print substr(\$0, 9) }' \\ + | while IFS= read -r d; do + rel=\${d#./} + [ "\${rel}" = "." ] || [ -z "\${rel}" ] && rel="" + find '${target_dir}'/"\${rel}" -mindepth 1 -maxdepth 1 -exec rm -rf {} + 2>/dev/null || : + done +# Phase 2: whiteout deletion — char device 0:0 in layer means "delete in target". +find . -type c | while IFS= read -r wh; do + rm -rf '${target_dir}'/"\${wh#./}" +done +# Phase 3: copy non-whiteout contents over. +find . -type c -printf '%P\\n' > /tmp/excludes.\$\$ +tar -C . --exclude-from=/tmp/excludes.\$\$ --xattrs --xattrs-include='*' --acls -cpf - . \\ + | tar -C '${target_dir}' --xattrs --xattrs-include='*' --acls -xpf - +rm -f /tmp/excludes.\$\$ +PAYLOAD +} +``` + +- [ ] **Step 3.2: Commit** + +```sh +git add src/storage_zfs.sh +git commit -s -m "storage_zfs: add _apply_layer_payload generator" +``` + +--- + +### Task 4: Add `zfs::_install_layer_chain` + +The full chain-build-and-template-install lifecycle. Designed to be a drop-in replacement for `zfs::_install_template_from_layers` when chain mode is active. Same input contract (`cache_key` = image-config-sha256, `layer_count`, `unpriv`); also takes the layer-digest list as an array. Same output contract: prints the template dataset path on stdout. + +Chain build (idempotent per-layer): + +1. For i = 1..N: + - If `/.layers/@done` exists, reuse — go to next layer. + - Else: race-safe create. Try `zfs create -u /.layers/.tmp` (i=1) or `zfs clone -o canmount=noauto @done /.layers/.tmp` (i≥2). On EEXIST, wait for `/.layers/@done` (timeout 600s). + - Mount the `.tmp` via `enroot-zfs-mount`. Run `enroot-nsenter --user --remap-root --mount bash -c "$(zfs::_apply_layer_payload layer_dir mountpoint)"`. On failure, mirror Plan B's ENOSPC retry (sweep warm templates, retry once; on second failure destroy `.tmp` and abort). + - Unmount, `zfs rename .tmp → final`, snapshot `@done`, `set readonly=on`, set `enroot:layer-digest=` and `enroot:imported`. + +Template install (matches Plan F shape): + +2. `zfs clone -o canmount=noauto /.layers/@done /.templates/` (with the standard `.tmp`-then-rename race protection, identical to Plan F). +3. Snapshot `@pristine`, `set readonly=on`, stamp metadata. Done — caller (or `clone_container`) takes it from here. + +**Files:** +- Modify: `src/storage_zfs.sh` + +- [ ] **Step 4.1: Add chain installer** + +Append after `zfs::_install_template_from_dir` in `src/storage_zfs.sh`. Inputs: + +``` +$1 cache_key - image-config-sha256 +$2 layer_count - the N from _prepare_layers +$3 unpriv - "y" or "" — passed through to enroot-nsenter +$4..$(3+N) - layer digests in stack order, base first +``` + +Print the resulting template dataset path on stdout (no trailing newline). Sweeps templates and runs the layer-apply ENOSPC retry on each layer. + +- [ ] **Step 4.2: Commit** + +```sh +git add src/storage_zfs.sh +git commit -s -m "storage_zfs: add _install_layer_chain" +``` + +--- + +### Task 5: Dispatch chain mode in `docker_install_from_layers` and `_pull_and_install_template` + +`docker::load` and the pointer-import / eviction-recovery paths both go through the install helpers. Both need to opt into chain mode when active. + +**Files:** +- Modify: `src/storage_zfs.sh` + +- [ ] **Step 5.1: Make `docker_install_from_layers` chain-mode-aware** + +Currently: + +```bash +zfs::docker_install_from_layers() { + local -r cache_key="$1" layer_count="$2" unpriv="$3" name="$4" + local template + template=$(zfs::_install_template_from_layers "${cache_key}" "${layer_count}" "${unpriv}") + zfs::clone_container "${template}" "${name}" +} +``` + +Add an optional 5th-onwards argument: layer digests (variadic). When `zfs::layer_chain_active`, route through `_install_layer_chain` with the digest list; otherwise fall back to `_install_template_from_layers`. The dispatch falls back gracefully if the caller didn't pass digests (e.g. older internal callers): chain mode silently degrades to single-merge. + +```bash +zfs::docker_install_from_layers() { + local -r cache_key="$1" layer_count="$2" unpriv="$3" name="$4" + shift 4 + local template + if zfs::layer_chain_active && [ "$#" -ge 1 ]; then + template=$(zfs::_install_layer_chain "${cache_key}" "${layer_count}" "${unpriv}" "$@") + else + template=$(zfs::_install_template_from_layers "${cache_key}" "${layer_count}" "${unpriv}") + fi + zfs::clone_container "${template}" "${name}" +} +``` + +- [ ] **Step 5.2: Pass layer digests from `docker::load`** + +In `src/docker.sh` `docker::load`'s ZFS branch (the `if zfs::enabled` block currently calling `zfs::docker_install_from_layers "${config}" "${layer_count}" "${unpriv}" "${name}"`), prepend a digest-list capture under chain mode and append it to the call: + +```bash +if zfs::enabled; then + if zfs::layer_chain_active; then + local layer_digests=() + readarray -t layer_digests < <(docker::_load_layer_digests "${user}" "${registry}" "${image}" "${tag}" "${arch}") + zfs::docker_install_from_layers "${config}" "${layer_count}" "${unpriv}" "${name}" "${layer_digests[@]}" + else + zfs::docker_install_from_layers "${config}" "${layer_count}" "${unpriv}" "${name}" + fi +else + # existing dir-backend overlay-mount + tar-pipe … +fi +``` + +- [ ] **Step 5.3: Pass layer digests from `zfs::_pull_and_install_template`** + +In `src/storage_zfs.sh`, the puller already calls `zfs::_install_template_from_layers` directly (it bypasses `docker_install_from_layers` because it doesn't clone — only fills the cache for the pointer-import / eviction-recovery flow). Mirror the dispatch: + +```bash +if zfs::layer_chain_active; then + local layer_digests=() + readarray -t layer_digests < <(docker::_load_layer_digests "${user}" "${registry}" "${image}" "${tag}" "${arch}") + zfs::_install_layer_chain "${config}" "${layer_count}" "${unpriv}" "${layer_digests[@]}" > /dev/null +else + zfs::_install_template_from_layers "${config}" "${layer_count}" "${unpriv}" > /dev/null +fi +``` + +- [ ] **Step 5.4: Commit** + +```sh +git add src/storage_zfs.sh src/docker.sh +git commit -s -m "storage_zfs: dispatch chain mode in docker_install_from_layers and pull path" +``` + +--- + +### Task 6: Verify on smoke-test cluster + +The compute nodes already share `/var/lib/enroot` over a delegated ZFS pool. Build a `.deb` locally, push to `spark-f2ff`, run the smoke checks below, then revert per CLAUDE.md. + +- [ ] **Step 6.1: Single-layer alpine, chain mode** + +```sh +sudo systemd-run --user --pty --setenv=ENROOT_ZFS_LAYER_CHAIN=y \ + enroot import -o /tmp/a.sqsh docker://alpine +sudo enroot create -n a /tmp/a.sqsh +sudo enroot start a cat /etc/os-release | head -1 +sudo zfs list -r tank/enroot/data/.layers # should show one layer dataset +sudo enroot remove -f a; rm -f /tmp/a.sqsh +``` + +Expected: load + start succeed; `.layers/` shows one dataset with the layer's digest. + +- [ ] **Step 6.2: Multi-layer debian, whiteouts** + +```sh +sudo ENROOT_ZFS_LAYER_CHAIN=y enroot import -o /tmp/d.sqsh docker://debian:stable-slim +sudo enroot create -n d /tmp/d.sqsh +sudo enroot start d cat /etc/os-release | grep PRETTY +sudo find /var/lib/enroot/d -name '.wh.*' | head -3 # must be empty +sudo enroot remove -f d; rm -f /tmp/d.sqsh +``` + +Expected: container starts; no AUFS whiteouts leak through (the conversion is intact); chain has multiple layer datasets. + +- [ ] **Step 6.3: Cross-image base-layer dedup** + +```sh +sudo ENROOT_ZFS_LAYER_CHAIN=y enroot import -o /tmp/p.sqsh docker://python:3.12-slim +sudo ENROOT_ZFS_LAYER_CHAIN=y enroot import -o /tmp/n.sqsh docker://node:20-slim +sudo zfs list -r tank/enroot/data/.layers -o name,used,referenced +``` + +Expected: the base bookworm layer appears once; `python:3-slim` and `node:20-slim` chains share that dataset (visible in `zfs list -t all` as multiple clones of the same `@done`). Block-level sharing visible via `referenced` ≫ `used` on the shared dataset. + +- [ ] **Step 6.4: Plan F regression — flag unset** + +```sh +sudo enroot import -o /tmp/u.sqsh docker://ubuntu:24.04 # ENROOT_ZFS_LAYER_CHAIN unset +sudo enroot create -n u /tmp/u.sqsh +sudo zfs list -r tank/enroot/data/.layers # must NOT have created new datasets here +sudo enroot remove -f u; rm -f /tmp/u.sqsh +``` + +Expected: Plan F's single-merge behavior; `.layers/` either doesn't exist or is unchanged from prior chain-mode runs. + +- [ ] **Step 6.5: Pyxis end-to-end with chain mode** + +```sh +ssh spark-f2ff 'sudo zfs destroy -r tank/enroot/data/.layers 2>/dev/null || :; \ + sudo zfs destroy -r tank/enroot/data/.templates 2>/dev/null || :' +ENROOT_ZFS_LAYER_CHAIN=y srun -N1 -w spark-f2ff --container-image=docker://debian:stable-slim cat /etc/os-release +ENROOT_ZFS_LAYER_CHAIN=y srun -N1 -w spark-f2ff --container-image=docker://debian:stable-slim hostname +``` + +Expected: first invocation pays the layer-extract cost once; second is sub-second (template-cache hit). + +--- + +### Task 7: Documentation and PR + +**Files:** +- Modify: `doc/zfs.md` — add an `ENROOT_ZFS_LAYER_CHAIN` section under tunables; flip status note to mention Plan G. +- Modify: `CLAUDE.md` — update active-design-proposals line. + +- [ ] **Step 7.1: Document and commit** + +```sh +git add doc/zfs.md CLAUDE.md +git commit -s -m "Mark Plan G (per-layer ZFS clone chain) as implemented" +git push -u origin feature/zfs-g-layer-chain +gh pr create --repo zeroae/enroot --base zenroot/main --head feature/zfs-g-layer-chain \ + --title "Plan G: per-layer ZFS clone chain for enroot load docker:// (opt-in)" \ + --body "Closes #4." +``` + +--- + +## Self-review checklist + +- [ ] Default-off: `ENROOT_ZFS_LAYER_CHAIN=` (unset/empty) leaves Plan F's `_install_template_from_layers` path unchanged. Verified at T6.4. +- [ ] Final template shape matches Plan F's (`/.templates/@pristine`, readonly, metadata-stamped), so `zfs::clone_container`, pointer-format import, eviction recovery, `enroot export`, and `enroot import zfs://` all keep working. +- [ ] Chain installer covers both load (T5.2) and pull (T5.3) entry points, so chain mode applies to direct `enroot create docker://` AND to the pointer-import / eviction-recovery paths from #13/#14. +- [ ] Whiteouts and opaque dirs handled (T3 phases 1+2). `enroot-aufs2ovlfs`'s overlayfs output is the input to phase 1/2, so AUFS edge cases that aufs2ovlfs already rejects (`.wh..wh.foo`) stay rejected. +- [ ] Race-safe per-layer via `.tmp` lock (T4). Concurrent imports of different images sharing a layer collapse onto the same dataset; loser waits for `@done`. +- [ ] ENOSPC retry mirrors Plan B's pattern (T4): sweep warm templates, retry once, abort with `.tmp` cleanup on second failure. +- [ ] Layer datasets are immortal until manually swept; ZFS refuses `zfs destroy ` while any descendant clone exists, so layers are GC-protected for free as long as any template references them. + +## Known limitations + +- **No automated `.layers/` GC.** When the last template referencing a base layer is evicted, the layer dataset survives. ZFS will refuse to destroy it while clones exist; once it's standalone, an admin can `zfs destroy ` manually. A follow-up plan can extend Plan B's `eviction_candidates` to layers (same shape: layer is evictable iff it has no clones). +- **No promote.** Layer datasets are clones-of-clones; the deepest leaf chain has N+1 levels of indirection. ZFS handles this fine performance-wise (snapshots are flat at the block layer), but `zfs list -t all` shows the chain. +- **No cross-host layer replication.** `zfs send` per-layer would be a sensible follow-up but is out of scope here. +- **No migration tool** between Plan F single-merge and Plan G chain caches. Switching the flag mid-life is transparent to users (existing templates remain valid) but the on-disk shape diverges. +- **Whiteout-replay is shell.** `getfattr -R` + `find -type c` + `tar`. Slower than the kernel's overlay engine (which Plan F uses) on a per-layer basis, but the work scales with layer size, not image count, and is paid once per unique layer across all images that use it. + +## Out of scope + +- Replacing Plan F's single-merge path. Plan G is purely additive. +- Cross-host layer replication via `zfs send`. +- Migration tooling between merged-template and per-layer-chain caches. +- Automated layer-dataset GC (manual `zfs destroy` works today). + +## Execution Handoff + +Same options as Plan A. diff --git a/doc/plans/README.md b/doc/plans/README.md index 58b11f7..58d3095 100644 --- a/doc/plans/README.md +++ b/doc/plans/README.md @@ -10,15 +10,16 @@ Plans for landing the optional ZFS storage backend designed in [`../zfs.md`](../ | D. `zfs://` URI transport — `enroot load zfs://host/NAME`, `enroot export NAME zfs://host` | [2026-04-29-zfs-d-zfs-uri.md](2026-04-29-zfs-d-zfs-uri.md) | A, C | | E. Ephemeral start ZFS path — substitute `squashfuse + overlay` with throwaway clone | [2026-04-29-zfs-e-ephemeral-start.md](2026-04-29-zfs-e-ephemeral-start.md) | A | | F. Docker layer-stack ZFS path — lift `ENROOT_NATIVE_OVERLAYFS=y` requirement on ZFS hosts | [2026-04-29-zfs-f-docker-load.md](2026-04-29-zfs-f-docker-load.md) | A | +| G. Per-layer ZFS clone chain (opt-in `ENROOT_ZFS_LAYER_CHAIN=y`) — cross-image layer dedup at the dataset level | [2026-05-01-zfs-g-layer-chain.md](2026-05-01-zfs-g-layer-chain.md) | F | ``` A ─┬─> B ├─> C ─> D ├─> E - └─> F + └─> F ─> G ``` -Recommended landing order: **A → E → F → B → C → D**. A is the foundation; E/F give the most user-visible wins next; B improves cache economics; C/D add transport options. +Recommended landing order: **A → E → F → B → C → D → G**. A is the foundation; E/F give the most user-visible wins next; B improves cache economics; C/D add transport options; G is an opt-in optimization on top of F. ## Conventions used by these plans From 3e2ae2d2e1c3249e1030f90eb7efae62fe1caca3 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Patrick=20Sodr=C3=A9?= Date: Fri, 1 May 2026 22:17:57 -0400 Subject: [PATCH 02/13] storage_zfs: add zfs::layer_chain_active predicate MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Signed-off-by: Patrick Sodré --- src/storage_zfs.sh | 9 +++++++++ 1 file changed, 9 insertions(+) diff --git a/src/storage_zfs.sh b/src/storage_zfs.sh index 01a327a..ff076e2 100644 --- a/src/storage_zfs.sh +++ b/src/storage_zfs.sh @@ -164,6 +164,15 @@ zfs::image_sha256() { # 19 bytes (`enroot-zfs-image:v1`) and dispatches to the pointer path. readonly zfs_pointer_magic="enroot-zfs-image:v1" +# Returns 0 iff the ZFS backend is active AND ENROOT_ZFS_LAYER_CHAIN=y. +# Callers gate the per-layer-clone-chain template-fill path on this. The +# default-off behavior (unset / "" / anything but "y") preserves Plan F's +# single-merge path byte-for-byte. +zfs::layer_chain_active() { + zfs::enabled || return 1 + [ "${ENROOT_ZFS_LAYER_CHAIN-}" = "y" ] +} + # Returns 0 if the ZFS backend is active AND ENROOT_ZFS_IMPORT_FORMAT is # unset or set to "pointer". Returns 1 otherwise (e.g. "squashfs" opt-out # or dir backend). Callers gate the new pointer-import path on this. From 96dec1734c5440a811655c816c5199c2e4d4254f Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Patrick=20Sodr=C3=A9?= Date: Fri, 1 May 2026 22:19:45 -0400 Subject: [PATCH 03/13] docker: side-emit layer digests to ./.layers in _prepare_layers MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Signed-off-by: Patrick Sodré --- src/docker.sh | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/src/docker.sh b/src/docker.sh index 06fae9b..ff362f8 100644 --- a/src/docker.sh +++ b/src/docker.sh @@ -326,6 +326,13 @@ docker::_prepare_layers() ( zstd -q -d -o config "${ENROOT_CACHE_PATH}/${config}" docker::configure "${PWD}/0" config "${arch}" + # Side-emit the ordered layer-digest list to ./.layers (one per line, base + # first, top last). The ZFS chain-mode path (Plan G) reads this back to + # build the per-layer dataset chain. Plan F and dir-backend callers ignore + # the file; it lives in the caller's per-call mktmpdir so it gets cleaned + # up alongside the rest of the extraction temp dir. + printf "%s\n" "${layers[@]}" > .layers + printf "%s\n%s\n" "${config}" "${#layers[@]}" ) From 7e41551c4737e4227e1601f7cc5faccc636b5937 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Patrick=20Sodr=C3=A9?= Date: Fri, 1 May 2026 22:22:14 -0400 Subject: [PATCH 04/13] storage_zfs: add _apply_layer_payload generator and layer-chain constants MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Signed-off-by: Patrick Sodré --- src/storage_zfs.sh | 55 ++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 55 insertions(+) diff --git a/src/storage_zfs.sh b/src/storage_zfs.sh index ff076e2..f8d4ab8 100644 --- a/src/storage_zfs.sh +++ b/src/storage_zfs.sh @@ -19,6 +19,8 @@ source "${ENROOT_LIBRARY_PATH}/common.sh" readonly zfs_template_subdir=".templates" readonly zfs_pristine_snap="pristine" readonly zfs_ephemeral_subdir=".ephemeral" +readonly zfs_layers_subdir=".layers" +readonly zfs_layer_done_snap="done" # Returns 0 if the ZFS storage backend is configured, 1 otherwise. zfs::enabled() { @@ -737,6 +739,59 @@ zfs::send_stream() { fi } +# Generates the bash payload that applies one already-extracted layer +# directory (post enroot-aufs2ovlfs, so whiteouts are mknod 0:0 char +# devices and opaque dirs carry trusted.overlay.opaque=y) on top of an +# already-merged target directory. Designed to be passed to +# `enroot-nsenter --user --remap-root --mount bash -c`. +# +# Two placeholders @@LAYER@@ / @@TARGET@@ are sed-substituted at +# generation time; both paths come from ZFS dataset mountpoints whose +# names derive from regex-validated digests + ENROOT_DATA_PATH, so they +# can't contain shell metacharacters. The payload itself uses single +# quotes around the substituted paths and double quotes around the +# loop-local `${var}` interpolations so a path containing whitespace +# (rare but legal in mountpoints) does not break the apply. +zfs::_apply_layer_payload() { + local -r layer_dir="$1" target_dir="$2" + sed -e "s#@@LAYER@@#${layer_dir}#g" -e "s#@@TARGET@@#${target_dir}#g" <<'PAYLOAD' +set -euo pipefail +mount --make-rprivate / +cd '@@LAYER@@' + +# Phase 1: opaque-dir clearing. trusted.overlay.opaque=y on a layer dir +# means "ignore everything from the parent in this dir"; we replicate +# that by clearing the corresponding target dir's children before +# layering this layer's contents on top. +getfattr -R -h --absolute-names -n trusted.overlay.opaque . 2>/dev/null \ + | awk -F': ' 'sub(/^# file: /, "")' \ + | while IFS= read -r d; do + rel="${d#./}" + find '@@TARGET@@/'"${rel}" -mindepth 1 -maxdepth 1 -exec rm -rf {} + 2>/dev/null || : + done + +# Phase 2: whiteout deletion. Each char-device 0:0 in the layer encodes +# "this path is removed in this layer". Be defensive — only treat 0:0 +# devices as whiteouts; any non-0:0 char dev (legitimate but unusual) +# is left for phase 3 to copy forward. +find . -type c | while IFS= read -r wh; do + [ "$(stat -c '%t-%T' "${wh}" 2>/dev/null)" = "0-0" ] || continue + rm -rf '@@TARGET@@/'"${wh#./}" +done + +# Phase 3: tar-pipe non-whiteout contents into the target. xattrs +# (overlayfs opaque markers, capability bits, SELinux labels) and ACLs +# are preserved. Char devices are excluded — both the 0:0 whiteouts we +# already actioned in phase 2 and any other char devs (which would not +# be expected in Docker images post extraction). +find . -type c -printf '%P\n' > /tmp/.enroot-excludes.$$ +tar -C . --exclude-from=/tmp/.enroot-excludes.$$ \ + --xattrs --xattrs-include='*' --acls -cpf - . \ + | tar -C '@@TARGET@@' --xattrs --xattrs-include='*' --acls -xpf - +rm -f /tmp/.enroot-excludes.$$ +PAYLOAD +} + # Materializes the merged Docker rootfs into a ZFS template (cached by # cache_key). Designed to be called from docker::load (or the pointer-import # flow) AFTER docker::_prepare_layers has populated the cwd with extracted, From 6e9faa99c70e6d473430fbc66fe994598a9b24c6 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Patrick=20Sodr=C3=A9?= Date: Fri, 1 May 2026 22:23:39 -0400 Subject: [PATCH 05/13] storage_zfs: add _build_layer and _install_layer_chain MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Signed-off-by: Patrick Sodré --- src/storage_zfs.sh | 175 +++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 175 insertions(+) diff --git a/src/storage_zfs.sh b/src/storage_zfs.sh index f8d4ab8..49f8009 100644 --- a/src/storage_zfs.sh +++ b/src/storage_zfs.sh @@ -941,6 +941,181 @@ zfs::_install_template_from_dir() { printf "%s" "${template}" } +# Builds one layer dataset on top of prev_layer (or as a base if prev_layer +# is empty). Idempotent: if /.layers/@done already exists, no +# work is done. Race-safe via a per-digest .tmp dataset lock; losers wait +# for @done. ENOSPC during apply triggers a single warm-template-eviction +# retry, mirroring Plan B's pattern. +# +# Inputs: +# $1 digest - the layer's content digest (cache key under .layers/) +# $2 prev_layer - parent dataset name (empty for the base layer) +# $3 layer_dir - extracted-layer directory in cwd (1, 2, ..., N from +# docker::_prepare_layers' parallel extraction step) +# $4 unpriv - "y" or "" — passed through to enroot-nsenter +zfs::_build_layer() { + local -r digest="$1" prev_layer="$2" layer_dir="$3" unpriv="$4" + local store layer tmp snap mountpoint payload + local create_ok= i=0 + + store=$(zfs::store_dataset) + layer="${store}/${zfs_layers_subdir}/${digest}" + tmp="${layer}.tmp" + snap="${layer}@${zfs_layer_done_snap}" + + # Cache hit: already built. + if zfs list -H -t snapshot "${snap}" > /dev/null 2>&1; then + return + fi + + # Try to win the lock. Base layers create-from-scratch; non-base layers + # clone the previous layer's @done. canmount=noauto avoids ZFS auto-mount + # (which would need CAP_SYS_ADMIN) — we mount via enroot-zfs-mount below. + if [ -z "${prev_layer}" ]; then + zfs create -u "${tmp}" 2> /dev/null && create_ok=y + else + zfs clone -o canmount=noauto "${prev_layer}@${zfs_layer_done_snap}" "${tmp}" 2> /dev/null && create_ok=y + fi + + if [ -z "${create_ok}" ]; then + # Lost the race or stale .tmp. Wait briefly for another writer to + # finalize @done; on timeout, surface for manual cleanup. + while ! zfs list -H -t snapshot "${snap}" > /dev/null 2>&1; do + sleep 1 + ((i++ < 600)) || common::err "Timed out waiting for layer ${digest:0:12} (stale ${tmp}?)" + done + return + fi + + # Clones inherit readonly=on from the parent's snapshot; we need to write + # the layer's contents into the .tmp dataset before snapshotting, so flip + # it back off here. This is unprivileged-safe: 'zfs allow' includes the + # readonly property in the standard delegation set. + zfs set readonly=off "${tmp}" 2> /dev/null || : + if ! enroot-zfs-mount "${tmp}" 2> /dev/null; then + zfs destroy "${tmp}" 2> /dev/null || : + common::err "failed to mount layer ${digest:0:12}" + fi + mountpoint=$(zfs get -H -o value mountpoint "${tmp}") + common::log INFO "Building layer ${digest:0:12}..." + + payload=$(zfs::_apply_layer_payload "${PWD}/${layer_dir}" "${mountpoint}") + if ! enroot-nsenter ${unpriv:+--user} --mount --remap-root bash -c "${payload}"; then + common::log WARN "Layer apply failed; evicting all warm templates and retrying" + ENROOT_TEMPLATE_WARM_SECONDS=0 zfs::sweep_templates + enroot-nsenter ${unpriv:+--user} --mount --remap-root bash -c "${payload}" \ + || { zfs destroy -r "${tmp}" 2> /dev/null || :; \ + common::err "Failed to apply layer ${digest:0:12} even after evicting warm templates"; } + fi + + enroot-zfs-mount --unmount "${tmp}" 2> /dev/null || : + zfs rename "${tmp}" "${layer}" + enroot-zfs-mount "${layer}" 2> /dev/null || : + zfs snapshot "${snap}" + zfs set readonly=on "${layer}" 2> /dev/null || : + zfs set "enroot:layer-digest=${digest}" "${layer}" 2> /dev/null || : + zfs set "enroot:imported=$(date -u +%FT%TZ)" "${layer}" 2> /dev/null || : + enroot-zfs-mount --unmount "${layer}" 2> /dev/null || : +} + +# Materializes the merged Docker rootfs into a ZFS template (cached by +# cache_key) by building a per-layer clone chain under /.layers/. +# Drop-in replacement for _install_template_from_layers when chain mode +# (ENROOT_ZFS_LAYER_CHAIN=y) is active. Designed to be called from +# docker::load (or _pull_and_install_template) AFTER docker::_prepare_layers +# has populated the cwd with extracted, whiteout-converted layer +# directories 0/, 1/, ..., N/ and written the digest list to ./.layers. +# +# The leaf of the layer chain is cloned into /.templates/ +# with @pristine snapshot, so the resulting template is shape-compatible +# with Plan F templates: clone_container, the pointer-format flow, eviction +# recovery, and zfs:// transport all work unchanged. +# +# Inputs: +# $1 cache_key - sha256 of the image config blob +# $2 layer_count - the N from _prepare_layers +# $3 unpriv - "y" or "" — passed through to enroot-nsenter +# $4..$(3+N) - layer digests in stack order, base first, top last +# +# Outputs: prints the template dataset path on stdout (no trailing newline). +# +# Atomicity: per-layer races resolved via .tmp dataset locks +# (see _build_layer); the final template is created via the same .tmp +# pattern as Plan F's _install_template_from_layers, so concurrent +# imports of the same image collapse onto one builder. +zfs::_install_layer_chain() { + local -r cache_key="$1" layer_count="$2" unpriv="$3" + shift 3 + local -a digests=("$@") + local store template tmp snap prev_layer leaf_layer + local i wait_i=0 + + if [ "${#digests[@]}" -ne "${layer_count}" ]; then + common::err "_install_layer_chain: digest count (${#digests[@]}) != layer_count (${layer_count})" + fi + + store=$(zfs::store_dataset) + template="${store}/${zfs_template_subdir}/${cache_key}" + tmp="${template}.tmp" + snap="${template}@${zfs_pristine_snap}" + + zfs::sweep_templates + + # Ensure parent containers exist without auto-mounting them (mount(2) + # needs CAP_SYS_ADMIN; the helper below applies it via the +caps file + # capability). + zfs create -u "${store}/${zfs_template_subdir}" 2> /dev/null || : + enroot-zfs-mount "${store}/${zfs_template_subdir}" 2> /dev/null || : + zfs create -u "${store}/${zfs_layers_subdir}" 2> /dev/null || : + enroot-zfs-mount "${store}/${zfs_layers_subdir}" 2> /dev/null || : + + # Fast path: template already cached — nothing to do. + if zfs list -H -t snapshot "${snap}" > /dev/null 2>&1; then + common::log INFO "Reusing cached template ${cache_key:0:12}" + zfs::touch_template "${template}" + printf "%s" "${template}" + return + fi + + # Build the chain bottom-up. _build_layer is idempotent on cache hit, + # so a partial earlier chain (e.g. base layers reused from another + # image) costs only the missing top layers. + prev_layer="" + for ((i=0; i /dev/null; then + # The clone needs no contents work — it's already the merged + # rootfs. Mount it just long enough to validate, then snapshot. + if ! enroot-zfs-mount "${tmp}" 2> /dev/null; then + zfs destroy "${tmp}" 2> /dev/null || : + common::err "failed to mount template clone of layer leaf" + fi + enroot-zfs-mount --unmount "${tmp}" 2> /dev/null || : + zfs rename "${tmp}" "${template}" + enroot-zfs-mount "${template}" 2> /dev/null || : + zfs snapshot "${snap}" + zfs set readonly=on "${template}" 2> /dev/null || : + zfs set "enroot:imported=$(date -u +%FT%TZ)" "${template}" 2> /dev/null || : + enroot-zfs-mount --unmount "${template}" 2> /dev/null || : + zfs::touch_template "${template}" + else + # Lost the race or stale .tmp — wait for @pristine. + while ! zfs list -H -t snapshot "${snap}" > /dev/null 2>&1; do + sleep 1 + ((wait_i++ < 600)) || common::err "Timed out waiting for chain template: ${template}" + done + fi + + printf "%s" "${template}" +} + # Import flow for docker:// URIs when the ZFS backend is active and the # pointer format is selected. Pulls layers (via docker::_prepare_layers), # fetches the manifest digest (via docker::digest), populates the From 31e53a2ae1ab337804709c8fa7270bdd9041fde4 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Patrick=20Sodr=C3=A9?= Date: Fri, 1 May 2026 22:24:10 -0400 Subject: [PATCH 06/13] storage_zfs: dispatch chain mode in docker_install_from_layers and pull path MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Signed-off-by: Patrick Sodré --- src/docker.sh | 8 +++++++- src/storage_zfs.sh | 20 ++++++++++++++++++-- 2 files changed, 25 insertions(+), 3 deletions(-) diff --git a/src/docker.sh b/src/docker.sh index ff362f8..e396cad 100644 --- a/src/docker.sh +++ b/src/docker.sh @@ -552,7 +552,13 @@ docker::load() ( fi if zfs::enabled; then - zfs::docker_install_from_layers "${config}" "${layer_count}" "${unpriv}" "${name}" + if zfs::layer_chain_active; then + local -a layer_digests=() + readarray -t layer_digests < .layers + zfs::docker_install_from_layers "${config}" "${layer_count}" "${unpriv}" "${name}" "${layer_digests[@]}" + else + zfs::docker_install_from_layers "${config}" "${layer_count}" "${unpriv}" "${name}" + fi else # Create a mount namespace and overlay mount mkdir -p rootfs "${name}" diff --git a/src/storage_zfs.sh b/src/storage_zfs.sh index 49f8009..db2f13c 100644 --- a/src/storage_zfs.sh +++ b/src/storage_zfs.sh @@ -870,8 +870,18 @@ zfs::_install_template_from_layers() { # (docker::load) use. zfs::docker_install_from_layers() { local -r cache_key="$1" layer_count="$2" unpriv="$3" name="$4" + shift 4 local template - template=$(zfs::_install_template_from_layers "${cache_key}" "${layer_count}" "${unpriv}") + # Variadic remaining args are layer digests (base first, top last); + # passed when the caller wants chain mode. Chain mode also requires + # ENROOT_ZFS_LAYER_CHAIN=y; if either gate fails we transparently + # fall back to Plan F's single-merge path. This keeps the dispatch + # safe for callers that do not yet know about chain mode. + if zfs::layer_chain_active && [ "$#" -ge 1 ]; then + template=$(zfs::_install_layer_chain "${cache_key}" "${layer_count}" "${unpriv}" "$@") + else + template=$(zfs::_install_template_from_layers "${cache_key}" "${layer_count}" "${unpriv}") + fi zfs::clone_container "${template}" "${name}" } @@ -1292,7 +1302,13 @@ zfs::_pull_and_install_template() ( unpriv=y fi - zfs::_install_template_from_layers "${config}" "${layer_count}" "${unpriv}" > /dev/null + if zfs::layer_chain_active; then + local -a layer_digests=() + readarray -t layer_digests < .layers + zfs::_install_layer_chain "${config}" "${layer_count}" "${unpriv}" "${layer_digests[@]}" > /dev/null + else + zfs::_install_template_from_layers "${config}" "${layer_count}" "${unpriv}" > /dev/null + fi printf "%s" "${config}" ) From ebd0699e6db2fdb1343899831a9be85289ab0950 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Patrick=20Sodr=C3=A9?= Date: Fri, 1 May 2026 22:25:55 -0400 Subject: [PATCH 07/13] Mark Plan G (per-layer ZFS clone chain) as implemented MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Signed-off-by: Patrick Sodré --- CLAUDE.md | 2 +- doc/zfs.md | 14 +++++++++++++- 2 files changed, 14 insertions(+), 2 deletions(-) diff --git a/CLAUDE.md b/CLAUDE.md index 0fe01e6..6c7681a 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -70,7 +70,7 @@ When debugging container behavior, the order is: image `/etc/{rc,fstab,environme ## Active design proposals - **`doc/zfs.md`** — optional ZFS storage backend (`ENROOT_STORAGE_BACKEND=zfs`). Replaces `unsquashfs`-per-create with extract-once-then-`zfs clone`. Adds a `.zfs` (zfs send stream) image format and a `zfs://host/NAME` transport scheme alongside today's `.sqsh`. Introduces a shared template cache with a live/warm/cold lifecycle (knobs: `ENROOT_TEMPLATE_WARM_SECONDS`, `ENROOT_TEMPLATE_PRESSURE_THRESHOLD`; eviction is implicit on `create`, no daemon, no `enroot gc` command). Default backend (`dir`) is unchanged. -- **`doc/plans/`** — six implementation plans (A–F) breaking the ZFS backend into independently-landable slices. Start with `doc/plans/README.md` for the index and recommended landing order (A → E → F → B → C → D). Plans add a new sourced module `src/storage_zfs.sh` (under a `zfs::` namespace) and branch in `src/runtime.sh`, `src/docker.sh` on `ENROOT_STORAGE_BACKEND`. **All six plans merged** on `zenroot/main` (PRs [#1](https://github.com/zeroae/enroot/pull/1), [#2](https://github.com/zeroae/enroot/pull/2), [#3](https://github.com/zeroae/enroot/pull/3), [#5](https://github.com/zeroae/enroot/pull/5), [#7](https://github.com/zeroae/enroot/pull/7), and Plan D in review). +- **`doc/plans/`** — implementation plans for the ZFS backend, broken into independently-landable slices. Start with `doc/plans/README.md` for the index and recommended landing order (A → E → F → B → C → D → G). Plans add a new sourced module `src/storage_zfs.sh` (under a `zfs::` namespace) and branch in `src/runtime.sh`, `src/docker.sh` on `ENROOT_STORAGE_BACKEND`. **Plans A–F merged** on `zenroot/main` (PRs [#1](https://github.com/zeroae/enroot/pull/1), [#2](https://github.com/zeroae/enroot/pull/2), [#3](https://github.com/zeroae/enroot/pull/3), [#5](https://github.com/zeroae/enroot/pull/5), [#7](https://github.com/zeroae/enroot/pull/7), [#8](https://github.com/zeroae/enroot/pull/8)). Plan G (per-layer clone chain, opt-in via `ENROOT_ZFS_LAYER_CHAIN=y`, [issue #4](https://github.com/zeroae/enroot/issues/4)) layered on top of F. ## Conventions diff --git a/doc/zfs.md b/doc/zfs.md index 1c175d0..ebbbd2d 100644 --- a/doc/zfs.md +++ b/doc/zfs.md @@ -1,6 +1,6 @@ # ZFS storage backend -This document describes an optional ZFS-aware mode for the enroot container store. **All six plans (A–F) are implemented.** When `ENROOT_STORAGE_BACKEND=zfs`: `enroot create`, `enroot remove`, ephemeral `enroot start `, and `enroot load docker://...` all use ZFS datasets, with a shared template cache that survives `enroot remove` (warm) for `ENROOT_TEMPLATE_WARM_SECONDS` and gets pressure-evicted LRU once the templates dataset crosses `ENROOT_TEMPLATE_PRESSURE_THRESHOLD` of its quota. `enroot create` accepts both `.sqsh` and `.zfs` (zfs send stream) inputs; `enroot export --format=zfs` produces the latter. The `zfs://[USER@]HOST/NAME` URI scheme transports containers between enroot hosts over SSH (`enroot load zfs://...` to pull, `enroot export NAME zfs://...` to push). The default storage backend (plain directories under `ENROOT_DATA_PATH`) is unchanged and remains the only option on hosts without ZFS. +This document describes an optional ZFS-aware mode for the enroot container store. **All six plans (A–F) are implemented; Plan G adds an opt-in per-layer clone chain on top of F.** When `ENROOT_STORAGE_BACKEND=zfs`: `enroot create`, `enroot remove`, ephemeral `enroot start `, and `enroot load docker://...` all use ZFS datasets, with a shared template cache that survives `enroot remove` (warm) for `ENROOT_TEMPLATE_WARM_SECONDS` and gets pressure-evicted LRU once the templates dataset crosses `ENROOT_TEMPLATE_PRESSURE_THRESHOLD` of its quota. `enroot create` accepts both `.sqsh` and `.zfs` (zfs send stream) inputs; `enroot export --format=zfs` produces the latter. The `zfs://[USER@]HOST/NAME` URI scheme transports containers between enroot hosts over SSH (`enroot load zfs://...` to pull, `enroot export NAME zfs://...` to push). The default storage backend (plain directories under `ENROOT_DATA_PATH`) is unchanged and remains the only option on hosts without ZFS. ## Motivation @@ -15,6 +15,7 @@ The ZFS backend is an *alternative storage driver*, in the same spirit as Docker | `ENROOT_STORAGE_BACKEND` | `dir` | `dir` = today's behavior. `zfs` = use ZFS datasets for the container store. | | `ENROOT_TEMPLATE_WARM_SECONDS` | `604800` (7 days) | How long a template with no clones remains evictable only under pressure. `0` = evict immediately when refcount reaches zero (refcount-only). `inf` = never auto-evict. | | `ENROOT_TEMPLATE_PRESSURE_THRESHOLD` | `0.80` | Templates dataset quota fraction above which routine `create`s start evicting warm templates. Soft signal; the ZFS quota is the hard wall. | +| `ENROOT_ZFS_LAYER_CHAIN` | unset | When `y` AND backend is `zfs`, populate the Docker template cache via a per-layer `zfs clone` chain under `/.layers/` instead of a single merged extract. Cross-image base layers are physically shared on disk (a debian-bookworm base used by both `python:slim` and `node:slim` is stored once). Default off — leaves Plan F's single-merge path unchanged. | When `ENROOT_STORAGE_BACKEND=zfs`, `ENROOT_DATA_PATH` must be the mountpoint of a ZFS dataset that the unprivileged user has been granted permission on (see [Admin setup](#admin-setup)). @@ -26,6 +27,17 @@ ${pool}/${dataset}/templates/@pristine # snapshot taken after extracti ${pool}/${dataset}// # clones of @pristine, the user's containers ``` +When `ENROOT_ZFS_LAYER_CHAIN=y`, an additional `.layers/` namespace appears under the same store; templates become clones of the chain leaf instead of being filled by a single merged extract: + +``` +${pool}/${dataset}/.layers/ # one per distinct registry layer +${pool}/${dataset}/.layers/@done # snapshot taken after layer apply +${pool}/${dataset}/.templates/ # zfs clone of the chain leaf @done +${pool}/${dataset}/.templates/@pristine +``` + +Each layer dataset is `zfs clone`d from the previous layer's `@done`, so two images sharing a base layer (e.g. `python:3.12-slim` and `node:20-slim`, both built on `debian:bookworm-slim`) physically share the base bytes. Layer datasets are immutable origins; ZFS refuses to destroy a layer while any descendant clone exists, so layer GC is automatic once all referencing templates are evicted. + Mountpoints follow the dataset hierarchy under `ENROOT_DATA_PATH`. Templates are not user-visible — `enroot list` only enumerates `/` clones. Templates have `readonly=on`; clones inherit the property override on `start -w`. The `templates` dataset is shared across all users on the host. Its quota and properties are admin-controlled (see below). From 5b0d37b566416089c916e0a04cd27cd125869274 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Patrick=20Sodr=C3=A9?= Date: Fri, 1 May 2026 23:11:24 -0400 Subject: [PATCH 08/13] deb: recommend attr (provides getfattr, required by chain mode) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Signed-off-by: Patrick Sodré --- pkg/deb/control | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/pkg/deb/control b/pkg/deb/control index 26b9950..1f51cc7 100644 --- a/pkg/deb/control +++ b/pkg/deb/control @@ -29,7 +29,7 @@ Depends: ${shlibs:Depends}, ${misc:Depends}, # tar, # util-linux, # ncurses-bin -Recommends: pigz +Recommends: pigz, attr Suggests: libnvidia-container-tools, squashfuse, fuse-overlayfs Description: Unprivileged container sandboxing utility A simple yet powerful tool to turn traditional container/OS images into From 3f7e3af642a0a435d55e672ac4bfb0eb3c488f02 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Patrick=20Sodr=C3=A9?= Date: Fri, 1 May 2026 23:11:25 -0400 Subject: [PATCH 09/13] storage_zfs: fix Plan G chain order and apply synthetic 0/ to template MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Smoke testing on a 3-layer image (node:20-alpine) caught two bugs in the chain installer: 1. Inverted iteration. docker::_download reverses the manifest's layer order via jq's `reverse`, so digests[0] is the TOP layer and digests[N-1] is the BASE. The original `for i in 0..N-1` loop treated digests[0] as the base, building the chain upside-down and producing a leaf that contained only the top-layer's diffs (e.g. 5.4M for what should have been a 70M merged node:20-alpine rootfs). Iterating from N-1 down to 0 puts BASE first in the zfs hierarchy and the TOP at the leaf. 2. Missing synthetic config layer. docker::_prepare_layers populates a directory 0/ via docker::configure with the per-image /etc/{rc,fstab,environment} derived from the image config blob; Plan F's overlay mount stacks 0:1:2:...:N so 0/ ends up on top. The chain installer ignored 0/ entirely, so containers created via chain mode were missing /etc/rc and the merged fstab entries. Now applied as a final tar-pipe step on top of the leaf clone during template finalization, before snapshotting @pristine. Also tighten the apply payload: - getfattr returns non-zero when no files match the requested xattr; with set -euo pipefail in the payload that aborted the whole apply on alpine (no opaque dirs). Capture to a temp file with `|| true`. - Drop tar's --acls. Default ZFS datasets have acltype=off, which makes POSIX ACL set/get fail with "Operation not supported" warnings even when the source has no ACLs. Docker images effectively never depend on ACLs, and xattrs (overlayfs opaque markers, capability bits, SELinux labels) are still preserved. Signed-off-by: Patrick Sodré --- src/storage_zfs.sh | 78 +++++++++++++++++++++++++++++++++------------- 1 file changed, 56 insertions(+), 22 deletions(-) diff --git a/src/storage_zfs.sh b/src/storage_zfs.sh index db2f13c..2d95602 100644 --- a/src/storage_zfs.sh +++ b/src/storage_zfs.sh @@ -762,13 +762,16 @@ cd '@@LAYER@@' # Phase 1: opaque-dir clearing. trusted.overlay.opaque=y on a layer dir # means "ignore everything from the parent in this dir"; we replicate # that by clearing the corresponding target dir's children before -# layering this layer's contents on top. -getfattr -R -h --absolute-names -n trusted.overlay.opaque . 2>/dev/null \ - | awk -F': ' 'sub(/^# file: /, "")' \ +# layering this layer's contents on top. getfattr exits non-zero when no +# matches are found, so the result is captured to a temp file with +# || true to keep set -e + pipefail happy. +getfattr -R -h --absolute-names -n trusted.overlay.opaque . 2>/dev/null > /tmp/.enroot-opq.$$ || true +awk -F': ' 'sub(/^# file: /, "")' /tmp/.enroot-opq.$$ \ | while IFS= read -r d; do rel="${d#./}" find '@@TARGET@@/'"${rel}" -mindepth 1 -maxdepth 1 -exec rm -rf {} + 2>/dev/null || : done +rm -f /tmp/.enroot-opq.$$ # Phase 2: whiteout deletion. Each char-device 0:0 in the layer encodes # "this path is removed in this layer". Be defensive — only treat 0:0 @@ -780,14 +783,17 @@ find . -type c | while IFS= read -r wh; do done # Phase 3: tar-pipe non-whiteout contents into the target. xattrs -# (overlayfs opaque markers, capability bits, SELinux labels) and ACLs -# are preserved. Char devices are excluded — both the 0:0 whiteouts we +# (overlayfs opaque markers, capability bits, SELinux labels) are +# preserved. Char devices are excluded — both the 0:0 whiteouts we # already actioned in phase 2 and any other char devs (which would not -# be expected in Docker images post extraction). +# be expected in Docker images post extraction). POSIX ACLs are not +# preserved because (a) ZFS datasets default to acltype=off which +# makes `tar --acls` fail with noisy warnings even when the source has +# no ACLs, and (b) Docker images effectively never depend on ACLs. find . -type c -printf '%P\n' > /tmp/.enroot-excludes.$$ tar -C . --exclude-from=/tmp/.enroot-excludes.$$ \ - --xattrs --xattrs-include='*' --acls -cpf - . \ - | tar -C '@@TARGET@@' --xattrs --xattrs-include='*' --acls -xpf - + --xattrs --xattrs-include='*' -cpf - . \ + | tar -C '@@TARGET@@' --xattrs --xattrs-include='*' -xpf - rm -f /tmp/.enroot-excludes.$$ PAYLOAD } @@ -1036,16 +1042,28 @@ zfs::_build_layer() { # has populated the cwd with extracted, whiteout-converted layer # directories 0/, 1/, ..., N/ and written the digest list to ./.layers. # -# The leaf of the layer chain is cloned into /.templates/ -# with @pristine snapshot, so the resulting template is shape-compatible -# with Plan F templates: clone_container, the pointer-format flow, eviction -# recovery, and zfs:// transport all work unchanged. +# The leaf of the layer chain is cloned into /.templates/, +# the per-image synthetic config layer 0/ (rc/fstab/environment generated +# by docker::configure) is applied on top, and the result is snapshotted +# as @pristine. The template is therefore shape-compatible with Plan F +# templates: clone_container, the pointer-format flow, eviction recovery, +# and zfs:// transport all work unchanged. +# +# Layer ordering: docker::_download reverses the registry's manifest order, +# so digests[0] is the TOP layer (e.g. node binary) and digests[N-1] is +# the BASE layer (e.g. alpine root). docker::_prepare_layers' parallel +# extraction puts each digest into directory `i+1` (1-based), so dir 1 = +# digests[0] = TOP and dir N = digests[N-1] = BASE. We build the chain +# BASE-first (iterating i=N-1 down to 0) so the leaf @done snapshot +# contains the fully merged rootfs and matches Plan F's overlay-mount +# output (lowerdir=0:1:2:...:N stacks 0 on top of 1 on top of 2 ...). # # Inputs: # $1 cache_key - sha256 of the image config blob # $2 layer_count - the N from _prepare_layers # $3 unpriv - "y" or "" — passed through to enroot-nsenter -# $4..$(3+N) - layer digests in stack order, base first, top last +# $4..$(3+N) - layer digests with docker::_download's reversed +# convention: digests[0] = TOP, digests[N-1] = BASE # # Outputs: prints the template dataset path on stdout (no trailing newline). # @@ -1087,26 +1105,42 @@ zfs::_install_layer_chain() { return fi - # Build the chain bottom-up. _build_layer is idempotent on cache hit, - # so a partial earlier chain (e.g. base layers reused from another - # image) costs only the missing top layers. + # Build the chain BASE-first (iterate from digests[N-1] = BASE to + # digests[0] = TOP). _build_layer is idempotent on @done cache hit, + # so re-pulling an image whose lower layers are already cached costs + # only the new top layers. prev_layer="" - for ((i=0; i=0; i--)); do zfs::_build_layer "${digests[i]}" "${prev_layer}" "$((i+1))" "${unpriv}" prev_layer="${store}/${zfs_layers_subdir}/${digests[i]}" done leaf_layer="${prev_layer}" - # Final: clone the leaf as the user-visible template. canmount=noauto - # so we control mount via the helper. Same .tmp-then-rename race - # protection as Plan F's _install_template_from_layers. + # Final: clone the leaf as the user-visible template, then apply the + # per-image synthetic config layer 0/ on top. canmount=noauto so we + # control mount via the helper. Same .tmp-then-rename race protection + # as Plan F's _install_template_from_layers. if zfs clone -o canmount=noauto "${leaf_layer}@${zfs_layer_done_snap}" "${tmp}" 2> /dev/null; then - # The clone needs no contents work — it's already the merged - # rootfs. Mount it just long enough to validate, then snapshot. + zfs set readonly=off "${tmp}" 2> /dev/null || : if ! enroot-zfs-mount "${tmp}" 2> /dev/null; then zfs destroy "${tmp}" 2> /dev/null || : common::err "failed to mount template clone of layer leaf" fi + local tmp_mountpoint + tmp_mountpoint=$(zfs get -H -o value mountpoint "${tmp}") + + # Apply the synthetic config layer (dir 0/, populated by + # docker::configure with /etc/{rc,fstab,environment} derived from + # the image config). It has no whiteouts, so the apply payload + # degenerates to a tar-pipe — but reusing the same payload keeps + # xattr-handling consistent with the registry layers. + local config_payload + config_payload=$(zfs::_apply_layer_payload "${PWD}/0" "${tmp_mountpoint}") + if ! enroot-nsenter ${unpriv:+--user} --mount --remap-root bash -c "${config_payload}"; then + zfs destroy -r "${tmp}" 2> /dev/null || : + common::err "Failed to apply synthetic config layer to template ${cache_key:0:12}" + fi + enroot-zfs-mount --unmount "${tmp}" 2> /dev/null || : zfs rename "${tmp}" "${template}" enroot-zfs-mount "${template}" 2> /dev/null || : From 65fb5215d32252c9d42342271cfdf263a28f3bd0 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Patrick=20Sodr=C3=A9?= Date: Sat, 2 May 2026 08:12:46 -0400 Subject: [PATCH 10/13] plans: sync Task 2 with sidecar-file plumbing decision MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Signed-off-by: Patrick Sodré --- doc/plans/2026-05-01-zfs-g-layer-chain.md | 28 ++++++++++++++--------- 1 file changed, 17 insertions(+), 11 deletions(-) diff --git a/doc/plans/2026-05-01-zfs-g-layer-chain.md b/doc/plans/2026-05-01-zfs-g-layer-chain.md index 7c6d88e..1b0c44c 100644 --- a/doc/plans/2026-05-01-zfs-g-layer-chain.md +++ b/doc/plans/2026-05-01-zfs-g-layer-chain.md @@ -85,26 +85,32 @@ git commit -s -m "storage_zfs: add zfs::layer_chain_active predicate" --- -### Task 2: Plumb layer digests out of `docker::_prepare_layers` +### Task 2: Side-emit layer digests from `docker::_prepare_layers` -Plan G needs the ordered list of registry layer content-digests as cache keys for `/.layers/`. Today `_prepare_layers` only emits `config\nlayer_count\n` on stdout. Existing callers (Plan F's `_install_template_from_layers`, `docker::import`, `docker::load`'s dir branch) read exactly two lines via `common::read`; they must keep working unchanged. +Plan G needs the ordered list of registry layer content-digests as cache keys for `/.layers/`. Today `_prepare_layers` only emits `config\nlayer_count\n` on stdout. Existing callers (Plan F's `_install_template_from_layers`, `docker::import`, `docker::load`'s dir branch) read exactly two lines via `common::read` and must keep working unchanged. -The least-invasive plumbing: a new function `docker::_load_layer_digests` that re-runs the *cheap* parts of `_download` (parse the manifest, list digests) without re-extracting. The layers are already cached at `${ENROOT_CACHE_PATH}/` from `_prepare_layers`'s pull, so we don't pay any registry roundtrip — the manifest is also already cached in the token dir. The chain installer calls this helper to get the ordered digest list. +Adding extra stdout lines is risky: a 2-line consumer closes the pipe after its second `read`, causing SIGPIPE on the producer's third printf. Under `set -euo pipefail` in the producer subshell that surfaces as a non-zero exit, breaking existing callers. -Alternative considered and rejected: extending `_prepare_layers` stdout with extra lines after the existing two. Existing callers consume the pipe with `{ common::read; common::read; }` which leaves trailing lines unread, triggering SIGPIPE on the producer's printf and a non-zero exit under `pipefail`. Adding a side-channel fd works but complicates calling conventions. A standalone helper that re-parses the cached manifest is simplest and contained. +The simplest fix: have `_prepare_layers` write the digest list to a sidecar file `./.layers` in its own cwd. Every caller already runs `_prepare_layers` inside a fresh `common::mktmpdir enroot` directory and `common::chdir`s into it, so the sidecar lives inside the per-call temp dir and is cleaned up by the caller's existing EXIT trap. Plan G's chain-mode caller does `readarray -t digests < .layers` after `_prepare_layers` returns. Non-chain callers simply ignore the file. **Files:** - Modify: `src/docker.sh` -- [ ] **Step 2.1: Add `docker::_load_layer_digests`** +- [ ] **Step 2.1: Have `_prepare_layers` write `./.layers`** -Append after `docker::_prepare_layers` in `src/docker.sh`. The helper takes the same five inputs (`user registry image tag arch`) and prints one digest per line, in stack order (base first, top last). Implementation reuses `docker::_download`'s manifest-parsing path or directly reads the cached manifest. +In `docker::_prepare_layers`, after `_download` has populated `${layers[@]}` and before the existing `printf "%s\n%s\n"` final output, add: + +```bash +printf "%s\n" "${layers[@]}" > .layers +``` + +The file is one digest per line in stack order (base first, top last). It sits in the caller's temp dir, gets removed when the temp dir is. - [ ] **Step 2.2: Commit** ```sh git add src/docker.sh -git commit -s -m "docker: add _load_layer_digests helper for chain-mode callers" +git commit -s -m "docker: side-emit layer digests to ./.layers in _prepare_layers" ``` --- @@ -256,13 +262,13 @@ zfs::docker_install_from_layers() { - [ ] **Step 5.2: Pass layer digests from `docker::load`** -In `src/docker.sh` `docker::load`'s ZFS branch (the `if zfs::enabled` block currently calling `zfs::docker_install_from_layers "${config}" "${layer_count}" "${unpriv}" "${name}"`), prepend a digest-list capture under chain mode and append it to the call: +In `src/docker.sh` `docker::load`'s ZFS branch (the `if zfs::enabled` block currently calling `zfs::docker_install_from_layers "${config}" "${layer_count}" "${unpriv}" "${name}"`), read the sidecar `./.layers` written by `_prepare_layers` and pass the digests through under chain mode: ```bash if zfs::enabled; then if zfs::layer_chain_active; then local layer_digests=() - readarray -t layer_digests < <(docker::_load_layer_digests "${user}" "${registry}" "${image}" "${tag}" "${arch}") + readarray -t layer_digests < .layers zfs::docker_install_from_layers "${config}" "${layer_count}" "${unpriv}" "${name}" "${layer_digests[@]}" else zfs::docker_install_from_layers "${config}" "${layer_count}" "${unpriv}" "${name}" @@ -274,12 +280,12 @@ fi - [ ] **Step 5.3: Pass layer digests from `zfs::_pull_and_install_template`** -In `src/storage_zfs.sh`, the puller already calls `zfs::_install_template_from_layers` directly (it bypasses `docker_install_from_layers` because it doesn't clone — only fills the cache for the pointer-import / eviction-recovery flow). Mirror the dispatch: +In `src/storage_zfs.sh`, the puller already calls `zfs::_install_template_from_layers` directly (it bypasses `docker_install_from_layers` because it doesn't clone — only fills the cache for the pointer-import / eviction-recovery flow). It also runs `_prepare_layers` inside its own `common::mktmpdir`+`chdir` block, so the same `./.layers` sidecar is available. Mirror the dispatch: ```bash if zfs::layer_chain_active; then local layer_digests=() - readarray -t layer_digests < <(docker::_load_layer_digests "${user}" "${registry}" "${image}" "${tag}" "${arch}") + readarray -t layer_digests < .layers zfs::_install_layer_chain "${config}" "${layer_count}" "${unpriv}" "${layer_digests[@]}" > /dev/null else zfs::_install_template_from_layers "${config}" "${layer_count}" "${unpriv}" > /dev/null From 901c7e79e58fb6072a4fef7b52a000a81178b26b Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Patrick=20Sodr=C3=A9?= Date: Wed, 6 May 2026 05:54:43 -0400 Subject: [PATCH 11/13] docs: document Plan G's docker://-only scope MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Plan G applies only to registry-pulled docker:// URIs. Daemon-local URIs (dockerd://, podman://) take a separate path that uses \`${engine} export | tar -x\` (flat rootfs) instead of layer tarballs, so there is no per-layer structure for the chain installer to consume. Spell this out in the plan's Coexistence section and the user-facing knob description in doc/zfs.md, plus add a future-work note in Out of scope describing what bringing chain mode to daemon URIs would require (switching to \`${engine} save\` and parsing manifest.json). The current code already silently no-ops for daemon URIs; this is a docs-only commit clarifying the boundary. Signed-off-by: Patrick Sodré --- doc/plans/2026-05-01-zfs-g-layer-chain.md | 8 ++++++++ doc/zfs.md | 2 +- 2 files changed, 9 insertions(+), 1 deletion(-) diff --git a/doc/plans/2026-05-01-zfs-g-layer-chain.md b/doc/plans/2026-05-01-zfs-g-layer-chain.md index 1b0c44c..427260b 100644 --- a/doc/plans/2026-05-01-zfs-g-layer-chain.md +++ b/doc/plans/2026-05-01-zfs-g-layer-chain.md @@ -37,6 +37,8 @@ Each `.layers/` dataset's `@done` snapshot is reused on subsequent impor - `ENROOT_ZFS_LAYER_CHAIN=y`: dispatch to chain mode. Same dispatch is hit from `docker::load` and from `_pull_and_install_template` (the puller used by pointer-format import and eviction recovery), so all callers see chain-mode templates when the flag is on. - The fast path "template `@pristine` already exists, reuse it" is hit *before* the chain/no-chain dispatch. Templates produced under one mode are reused under the other without rebuild — only the *fill* mechanism differs on miss. +**Scope: `docker://` URIs only.** Plan G applies to registry-pulled images that go through `docker::_prepare_layers` (which produces the per-layer directories Plan G chains over). Daemon-local URIs (`dockerd://`, `podman://`) are *silently unaffected* by `ENROOT_ZFS_LAYER_CHAIN=y` — they go through `zfs::_extract_and_install_from_daemon`, which uses `${engine} export | tar -x` to produce a single flat rootfs (the daemon has already merged the layers internally; `docker export` is a flatten operation, not a layer-preserving one). That path goes through `zfs::_install_template_from_dir` and stays untouched. Bringing chain mode to daemon URIs is feasible but requires switching from `docker export` to `docker save` (which writes a tar archive containing per-layer tarballs plus a `manifest.json` describing layer order) — see Out of scope below. + **Depends on:** Plans A, B, F (template lifecycle, sweep, ENOSPC retry shape are reused). **Prerequisite host setup:** Same as Plan F. ZFS user delegation must include `clone`. `promote` is **not** required (Plan G doesn't promote). Whiteout/xattr work runs inside `enroot-nsenter --user --remap-root --mount`, same as Plan F's merge step. @@ -407,6 +409,12 @@ gh pr create --repo zeroae/enroot --base zenroot/main --head feature/zfs-g-layer - Cross-host layer replication via `zfs send`. - Migration tooling between merged-template and per-layer-chain caches. - Automated layer-dataset GC (manual `zfs destroy` works today). +- **Chain mode for `dockerd://` / `podman://` URIs.** Daemon-URI imports use `${engine} export | tar -x` which flattens the layered image into a single tarball before extraction — there is no per-layer directory structure for the chain installer to consume. Adding chain support here would require: + 1. Switching the daemon path from `${engine} export` to `${engine} save` (which writes a tar archive containing one `/layer.tar` per layer plus a `manifest.json` describing the order). + 2. Parsing `manifest.json` to recover the layer-digest list. + 3. Extracting each layer tarball into a directory parallel to what `docker::_prepare_layers` produces, then dispatching to `_install_layer_chain`. + 4. Constructing a synthetic `0/` from the daemon's image config (`${engine} inspect`'s output). + This is a real follow-up plan, not a one-line addition. It also slightly changes the daemon contract — `docker save` requires more disk (full image tar before extraction) than `docker export` (streamed). For now `ENROOT_ZFS_LAYER_CHAIN=y` is a documented no-op for daemon URIs. ## Execution Handoff diff --git a/doc/zfs.md b/doc/zfs.md index ebbbd2d..461621a 100644 --- a/doc/zfs.md +++ b/doc/zfs.md @@ -15,7 +15,7 @@ The ZFS backend is an *alternative storage driver*, in the same spirit as Docker | `ENROOT_STORAGE_BACKEND` | `dir` | `dir` = today's behavior. `zfs` = use ZFS datasets for the container store. | | `ENROOT_TEMPLATE_WARM_SECONDS` | `604800` (7 days) | How long a template with no clones remains evictable only under pressure. `0` = evict immediately when refcount reaches zero (refcount-only). `inf` = never auto-evict. | | `ENROOT_TEMPLATE_PRESSURE_THRESHOLD` | `0.80` | Templates dataset quota fraction above which routine `create`s start evicting warm templates. Soft signal; the ZFS quota is the hard wall. | -| `ENROOT_ZFS_LAYER_CHAIN` | unset | When `y` AND backend is `zfs`, populate the Docker template cache via a per-layer `zfs clone` chain under `/.layers/` instead of a single merged extract. Cross-image base layers are physically shared on disk (a debian-bookworm base used by both `python:slim` and `node:slim` is stored once). Default off — leaves Plan F's single-merge path unchanged. | +| `ENROOT_ZFS_LAYER_CHAIN` | unset | When `y` AND backend is `zfs`, populate the Docker template cache via a per-layer `zfs clone` chain under `/.layers/` instead of a single merged extract. Cross-image base layers are physically shared on disk (a debian-bookworm base used by both `python:slim` and `node:slim` is stored once). Applies to `docker://` URIs only; `dockerd://` and `podman://` always go through the daemon-flat-export path and are unaffected. Default off — leaves Plan F's single-merge path unchanged. | When `ENROOT_STORAGE_BACKEND=zfs`, `ENROOT_DATA_PATH` must be the mountpoint of a ZFS dataset that the unprivileged user has been granted permission on (see [Admin setup](#admin-setup)). From 17b8bac8dd6a5a93b49b40448d0d869c64e8ee97 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Patrick=20Sodr=C3=A9?= Date: Wed, 6 May 2026 06:52:04 -0400 Subject: [PATCH 12/13] docs: implementation plan for chain mode on daemon URIs (Plan H) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Plan H extends Plan G's per-layer zfs clone chain to dockerd:// and podman:// URIs via \`${engine} save\` (preserves per-layer tarballs + manifest.json describing layer order) instead of \`${engine} export\` (flattens). Layer digests are content-addressed via sha256 of each layer.tar, so the same image won't share .layers/ datasets across docker:// and dockerd:// sources (registry blobs are compressed; save's layer.tar is uncompressed — same content, different sha) but multiple local daemon images with shared base layers DO dedup at the .layers/ level. Default-off — flag unset leaves the existing flat-export daemon path byte-for-byte unchanged. Same dispatch shape as Plan G (chain-mode gate on ENROOT_ZFS_LAYER_CHAIN=y, branched in import_daemon_pointer and create_from_pointer's recovery arm). Signed-off-by: Patrick Sodré --- doc/plans/2026-05-06-zfs-h-daemon-chain.md | 382 +++++++++++++++++++++ doc/plans/README.md | 5 +- 2 files changed, 385 insertions(+), 2 deletions(-) create mode 100644 doc/plans/2026-05-06-zfs-h-daemon-chain.md diff --git a/doc/plans/2026-05-06-zfs-h-daemon-chain.md b/doc/plans/2026-05-06-zfs-h-daemon-chain.md new file mode 100644 index 0000000..2489da9 --- /dev/null +++ b/doc/plans/2026-05-06-zfs-h-daemon-chain.md @@ -0,0 +1,382 @@ +# ZFS Backend Plan H: Per-layer Chain for Daemon URIs (`dockerd://`, `podman://`) + +> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** Extend Plan G's per-layer ZFS clone chain (`ENROOT_ZFS_LAYER_CHAIN=y`) to `dockerd://` and `podman://` URIs so daemon-local images get the same cross-image base-layer dedup that registry-pulled `docker://` images get today. Preserves Plan G's default-off behavior — when the flag is unset (or set to anything but `y`), daemon imports continue to use the existing `${engine} export | tar -x` flat path unchanged. + +**Why:** Plan G's reuse story ("a debian-bookworm base used by both `python:slim` and `node:slim` is stored once") only kicks in for `docker://`. Sites that build images locally (CI runners, devboxes) and import via `dockerd://` / `podman://` see no on-disk dedup — every image gets its own full template even when 80% of the rootfs is shared with another local image. For HPC clusters with engine-local image registries this matters as much as the registry case. + +## Architecture + +The daemon-URI flat-export path (`zfs::_extract_and_install_from_daemon` in Plan F's daemon-pointer follow-up) uses: + +``` +${engine} create --name +${engine} export | tar -x -C rootfs/ # flattens; no per-layer information survives +``` + +`docker export` (and `podman export`) is a *flatten* operation — the daemon walks the layered overlay and produces a single rootfs tarball. We can't recover the layer structure from that output. + +`docker save` (and `podman save`) is a *layer-preserving* operation. Its output is a tar archive containing: + +``` +manifest.json # Docker Image Format v1.2 — describes layer order +.json # the image config blob +/ # one dir per layer + layer.tar # the layer's content as an uncompressed tar + json + VERSION +/ + ... +repositories # legacy index (ignored) +``` + +Inside `manifest.json`: +```json +[{ + "Config": ".json", + "RepoTags": ["myimage:tag"], + "Layers": [ + "/layer.tar", // BASE first + "/layer.tar", + ... // TOP last + ] +}] +``` + +Plan H's daemon-chain installer: + +1. `${engine} save ` to a streaming pipe (no temp tar archive on disk; pipe straight into `tar -x` at a temp dir). +2. Parse `manifest.json` from the extracted tree → `config` blob path + ordered `Layers` list. +3. For each `/layer.tar`: + - `sha256sum layer.tar` → cache key for `/.layers/`. + - Extract the tarball into directory `i/` (1-based, BASE = 1, TOP = N) — same convention `docker::_prepare_layers` uses for the registry path. + - Run `enroot-aufs2ovlfs i/` to convert AUFS whiteouts to overlayfs form. +4. Build the synthetic config layer `0/` from the parsed `.json` (or equivalently `${engine} inspect`) using `docker::configure` — identical to the registry path. +5. Call `zfs::_install_layer_chain "${cache_key}" "${layer_count}" "${unpriv}" "${digests[@]}"`. + +The cache key (`cache_key`) stays the image-config-sha256 (same as today's daemon path) so a daemon image and the same image pulled via `docker://` *can* share a template if their config shas happen to match. They won't share `.layers/` datasets across sources because `sha256(save-layer.tar)` ≠ `sha256(registry-blob)` (the registry blob is compressed; `docker save` emits uncompressed tarballs); but within the daemon path, multiple local images with shared base layers will dedup at the `.layers/` level. + +### Coexistence + +- **Flag off (`ENROOT_ZFS_LAYER_CHAIN=` unset/empty):** existing flat `${engine} export | tar -x → _install_template_from_dir` path runs unchanged. Default behavior preserved byte-for-byte. +- **Flag on, daemon URI:** Plan H's `_save_and_install_from_daemon` path runs. Higher peak disk (`docker save` writes a tarball pipe; the per-layer extraction needs ~2× the merged image size at peak vs. flat-export's 1×) but cheaper on subsequent imports of images with shared layers. +- **Flag on, `docker://` URI:** Plan G runs unchanged (this plan does not touch the registry path). +- **Fast path:** the existing "template `@pristine` already exists, reuse it" check runs *before* the chain dispatch, so a daemon image whose config-sha is already cached (from any earlier import via any source) skips the save entirely. + +## Files + +- **Modify:** `src/storage_zfs.sh` — add `zfs::_save_and_install_from_daemon` (chain-mode counterpart of `_extract_and_install_from_daemon`), wire chain-mode dispatch into `import_daemon_pointer` and `create_from_pointer`'s daemon recovery branch. +- **Modify:** `doc/zfs.md`, `CLAUDE.md` — flip the `ENROOT_ZFS_LAYER_CHAIN` knob description to drop the "docker:// only" caveat. + +`docker.sh`, `runtime.sh`, the existing flat-export daemon path, Plan G's helpers (`_apply_layer_payload`, `_build_layer`, `_install_layer_chain`), and `clone_container` are **not** modified. + +**Depends on:** Plans A, B, F (daemon pointer follow-up), G. + +**Prerequisite host setup:** Same as Plan G plus `${engine} save` permission. Docker Desktop and rootless podman both support `save`. Smoke target: spark-ctrl (Pi 5, OpenZFS 2.4.1, 3.75G test pool) with a docker daemon installed. + +--- + +### Task 1: Add `zfs::_save_and_install_from_daemon` + +**Files:** +- Modify: `src/storage_zfs.sh` — append after `_extract_and_install_from_daemon`. + +- [ ] **Step 1.1: Add the helper** + +Subshell function (parens) so cwd and EXIT trap stay scoped, matching `_extract_and_install_from_daemon`. Inputs match the existing helper (`uri`, `arch`); outputs the resolved image-config-sha256 on stdout (same contract). + +```bash +zfs::_save_and_install_from_daemon() ( + local -r uri="$1" arch="$2" + local image= tmpdir= engine= cache_key= unpriv= image_id= + local config_blob= layer_count= i= + local -a layer_paths=() digests=() + + set -euo pipefail + + case "${uri}" in + dockerd://*) engine="docker" ;; + podman://*) engine="podman" ;; + *) common::err "_save_and_install_from_daemon: not a daemon URI: ${uri}" ;; + esac + + common::checkcmd jq sha256sum "${engine}" tar + + local -r reg_image="[[:alnum:]/._:-]+" + if [[ "${uri}" =~ ^[[:alpha:]]+://(${reg_image})$ ]]; then + image="${BASH_REMATCH[1]}" + else + common::err "Invalid image reference: ${uri}" + fi + + image_id=$("${engine}" inspect --format '{{.Id}}' "${image}") \ + || common::err "${engine} inspect ${image} failed" + [[ "${image_id}" =~ ^sha256:[0-9a-f]{64}$ ]] \ + || common::err "${engine} returned unexpected image ID: ${image_id}" + cache_key="${image_id#sha256:}" + + trap 'common::rmall "${tmpdir}" 2> /dev/null' EXIT + tmpdir=$(common::mktmpdir enroot) + common::chdir "${tmpdir}" + + common::log INFO "Saving ${engine} image and extracting layers..." NL + "${engine}" save "${image}" | tar -x + + [ -f manifest.json ] \ + || common::err "${engine} save did not produce manifest.json (unsupported image format?)" + + # manifest.json shape: [{"Config": "...", "Layers": ["/layer.tar", ...]}] + config_blob=$(common::jq -r '.[0].Config' manifest.json) + [ -n "${config_blob}" ] && [ -f "${config_blob}" ] \ + || common::err "manifest.json missing Config blob: ${config_blob}" + + readarray -t layer_paths < <(common::jq -r '.[0].Layers[]' manifest.json) + layer_count="${#layer_paths[@]}" + [ "${layer_count}" -gt 0 ] \ + || common::err "manifest.json declares no Layers" + + # Per-layer extraction: dir 1/ = layer_paths[0] = BASE (manifest.json + # ordering puts BASE first, TOP last — opposite of docker::_download's + # reversed convention). Compute digests as sha256 of each layer.tar so + # cache keys are content-addressed regardless of the engine's + # internal layer-id format (legacy v1 uses random IDs; newer formats + # use content addresses). + common::log INFO "Computing layer digests and extracting..." + digests=() + for ((i=0; i config + docker::configure "${PWD}/0" config "${arch}" + + if [ "${EUID}" -ne 0 ]; then + unpriv=y + fi + + # Plan G's chain installer expects digests in "TOP first, BASE last" + # ordering (the reversed convention of docker::_download). Reverse + # ours to match. _install_layer_chain iterates N-1 down to 0 to + # build BASE-first on disk, so the on-disk leaf is digests[0] = TOP. + local -a digests_reversed=() + for ((i=layer_count-1; i>=0; i--)); do + digests_reversed+=("${digests[i]}") + done + + # Wait — directory numbering must match the digest at the same chain + # position. Plan G's _build_layer takes (digest, prev_layer, layer_dir). + # With our reversed digests array (digests_reversed[0]=TOP=layer_count, + # digests_reversed[N-1]=BASE=1), we need layer_dir to follow the + # same reversal so digests_reversed[k] always lines up with the + # directory whose tarball produced that digest. + # + # Easier: rename our extracted directories to match the registry + # path's "dir 1 = TOP" convention. Walk i=1..N (currently 1=BASE), + # rename "i/" to "tmp_i/" and then "tmp_i/" to "(N-i+1)/". One pass. + for ((i=1; i<=layer_count; i++)); do + mv "${i}" "tmp_${i}" + done + for ((i=1; i<=layer_count; i++)); do + mv "tmp_${i}" "$((layer_count - i + 1))" + done + + zfs::_install_layer_chain "${cache_key}" "${layer_count}" "${unpriv}" "${digests_reversed[@]}" > /dev/null + + printf "%s" "${cache_key}" +) +``` + +- [ ] **Step 1.2: Commit** + +```sh +git add src/storage_zfs.sh +git commit -s -m "storage_zfs: add _save_and_install_from_daemon for chain-mode daemon imports" +``` + +--- + +### Task 2: Wire chain-mode dispatch in `import_daemon_pointer` and `create_from_pointer` + +Two daemon entry points: the initial import (`import_daemon_pointer`) and the eviction-recovery re-pull inside `create_from_pointer`. Both today call `_extract_and_install_from_daemon`. Add a single dispatch in each based on `zfs::layer_chain_active`. + +**Files:** +- Modify: `src/storage_zfs.sh` + +- [ ] **Step 2.1: `import_daemon_pointer` dispatch** + +Before: + +```bash +cache_key=$(zfs::_extract_and_install_from_daemon "${uri}" "${arch}") +``` + +After: + +```bash +if zfs::layer_chain_active; then + cache_key=$(zfs::_save_and_install_from_daemon "${uri}" "${arch}") +else + cache_key=$(zfs::_extract_and_install_from_daemon "${uri}" "${arch}") +fi +``` + +- [ ] **Step 2.2: `create_from_pointer` daemon-recovery dispatch** + +Replace the existing `dockerd://*|podman://*)` arm in the recovery `case`: + +```bash +dockerd://*|podman://*) + if zfs::layer_chain_active; then + fresh_config_sha=$(zfs::_save_and_install_from_daemon "${uri}" "${arch}") + else + fresh_config_sha=$(zfs::_extract_and_install_from_daemon "${uri}" "${arch}") + fi + ;; +``` + +- [ ] **Step 2.3: Commit** + +```sh +git add src/storage_zfs.sh +git commit -s -m "storage_zfs: dispatch chain mode in daemon-pointer import and recovery" +``` + +--- + +### Task 3: Verify on smoke cluster + +Spark-ctrl as today (Plan G's smoke target), with a docker daemon installed locally (or use the existing daemon if present). The DGX Spark compute nodes also work but pool space is larger. + +- [ ] **Step 3.1: Single-image daemon import, chain mode** + +```sh +sudo docker pull alpine:3.21 +sudo bash -c 'ENROOT_ZFS_LAYER_CHAIN=y enroot import -o /tmp/a.sqsh dockerd://alpine:3.21' +sudo bash -c 'ENROOT_ZFS_LAYER_CHAIN=y enroot create -n a /tmp/a.sqsh' +sudo zfs list -r tank/enroot/data +sudo cat /var/lib/enroot/a/etc/os-release | head -3 +``` + +Expected: `.layers/` dataset(s) created (one per layer in the saved image), `.templates/` exists, rootfs has alpine os-release + synthetic `0/` files. + +- [ ] **Step 3.2: Cross-image base-layer dedup, daemon path** + +```sh +sudo docker pull python:3.13-alpine3.21 +sudo docker pull node:22-alpine3.21 +for img in python:3.13-alpine3.21 node:22-alpine3.21; do + sudo bash -c "ENROOT_ZFS_LAYER_CHAIN=y enroot import -o /tmp/${img%%:*}.sqsh dockerd://${img}" + sudo bash -c "ENROOT_ZFS_LAYER_CHAIN=y enroot create -n ${img%%:*} /tmp/${img%%:*}.sqsh" +done +sudo zfs list -r -d 1 -o name,origin tank/enroot/data/.layers +``` + +Expected: alpine 3.21 base layer dataset exists once and is the origin of TWO chains (one from python, one from node), same Y-shape as Plan G's `docker://` smoke. + +- [ ] **Step 3.3: Plan G regression — `docker://` still works in chain mode** + +```sh +sudo bash -c 'ENROOT_ZFS_LAYER_CHAIN=y enroot import -o /tmp/r.sqsh docker://redis:alpine3.21' +sudo bash -c 'ENROOT_ZFS_LAYER_CHAIN=y enroot create -n r /tmp/r.sqsh' +``` + +Expected: works as before. `_pull_and_install_template` still uses `_install_layer_chain`. No regression. + +- [ ] **Step 3.4: Flat-export regression — daemon path with flag unset** + +```sh +sudo enroot remove -f a python node r +sudo zfs list -H -o name -r tank/enroot/data | tail -n +2 | tac | xargs -r -n1 sudo zfs destroy -r +sudo bash -c 'enroot import -o /tmp/a.sqsh dockerd://alpine:3.21' +sudo zfs list -r tank/enroot/data # NO .layers/ dataset expected +sudo bash -c 'enroot create -n a /tmp/a.sqsh' +``` + +Expected: no `.layers/` namespace; `_install_template_from_dir` populates `.templates/` directly, identical to today's flat-export behavior. + +- [ ] **Step 3.5: Eviction recovery via daemon-chain re-save** + +```sh +# Create + then destroy template, keep layer-cache +sudo bash -c 'ENROOT_ZFS_LAYER_CHAIN=y enroot import -o /tmp/a.sqsh dockerd://alpine:3.21' +sudo bash -c 'ENROOT_ZFS_LAYER_CHAIN=y enroot create -n a /tmp/a.sqsh' +sudo enroot remove -f a +sudo zfs list -H -o name -r tank/enroot/data/.templates | tail -n +2 | xargs -r -n1 sudo zfs destroy -r + +# Re-create from the cached pointer file — recovery path triggers _save_and_install +sudo bash -c 'ENROOT_ZFS_LAYER_CHAIN=y enroot create -n a /tmp/a.sqsh' +sudo zfs list -r tank/enroot/data +``` + +Expected: `Re-pulling from dockerd://alpine:3.21` log line; layer datasets reused (no `Building layer` messages); template re-cloned from chain leaf. + +--- + +### Task 4: Documentation + +**Files:** +- Modify: `doc/zfs.md` — drop the "docker:// only" caveat from the `ENROOT_ZFS_LAYER_CHAIN` knob description; mention the disk-pressure tradeoff (`${engine} save` peak disk ~2× flat-export). +- Modify: `CLAUDE.md` — append a note that Plan H is implemented under PR #?? if relevant. +- Modify: `doc/plans/README.md` — add Plan H row, dependency `G`, recommended landing position after G. + +- [ ] **Step 4.1: Commit and PR** + +```sh +git add doc/zfs.md CLAUDE.md doc/plans/README.md +git commit -s -m "Mark Plan H (per-layer chain for daemon URIs) as implemented" +git push -u origin feature/zfs-h-daemon-chain +gh pr create --repo zeroae/enroot --base zenroot/main --head feature/zfs-h-daemon-chain \ + --title "Plan H: per-layer ZFS clone chain for dockerd:// / podman:// URIs" +``` + +--- + +## Self-review checklist + +- [ ] Default-off: `ENROOT_ZFS_LAYER_CHAIN=` unset leaves the daemon flat-export path unchanged. T3.4 covers this. +- [ ] No regression on `docker://` chain mode (Plan G). T3.3 covers this. +- [ ] Daemon-cross-image dedup works at the `.layers/` level when local images share base layer content. T3.2 covers this. +- [ ] Eviction recovery re-uses cached layer datasets and only rebuilds the template clone-of-leaf. T3.5 covers this. +- [ ] Layer-digest computation is content-addressed (sha256 of `layer.tar`) so engine-specific layer-id formats (legacy v1 random IDs vs newer content-addressed) don't matter for the cache key. +- [ ] Synthetic `0/` is built from the same `${engine} inspect` output as the existing daemon path, so per-image config (rc/fstab/environment) matches what users get today. + +## Known limitations + +- **No cross-source dedup with `docker://`.** A layer pulled via `docker://alpine` and the same layer extracted via `dockerd://alpine` produce different `.layers/` datasets because the registry blob is gzip/zstd-compressed and `docker save`'s `layer.tar` is uncompressed. Same content, different sha256. We could add a "compressed-and-uncompressed sha both stored as user properties" recovery scheme, but it'd be substantial added complexity for a thin slice of cases. +- **`docker save` disk pressure.** Streaming `${engine} save | tar -x` avoids the full saved-tar landing on disk, but the per-layer `tar -xf /layer.tar` step does need each `layer.tar` to be on disk during extraction (it's not a pipe). Peak disk is ~1× the saved-tar size (which is roughly the same as the merged image size, since `docker save` doesn't compress). For very large images (multi-GB ML containers) this matters. Today's flat-export streams in one pass with no intermediate; Plan H trades that for cross-image dedup. +- **Engine compatibility.** Plan H assumes Docker Image Format v1.1+ (`manifest.json` at archive root, `Layers` array as paths to `layer.tar` files). `podman save` defaults to docker format; OCI archive format (`podman save --format oci-archive`) has a different layout and is **not supported**. Document the assumption. +- **Image config sha vs registry config sha.** Daemon-side `${engine} inspect '{{.Id}}'` is the daemon's image-id; for images pulled from a registry it usually matches the registry's image-config-sha256, but for locally-built images (Dockerfile) it's daemon-local. That's the same situation as today's flat-export path. + +## Out of scope + +- Replacing the flat-export path (Plan H is purely additive; flag unset preserves byte-for-byte behavior). +- Cross-source `.layers/` dedup between `docker://` and `dockerd://`. +- OCI archive format support (`podman save --format oci-archive`). +- Streaming `${engine} save` directly into per-layer extraction without an intermediate tar dump (tar tools don't natively support "extract just this nested archive from a stream" — would require a custom parser). +- Per-layer-aware digest stamping (e.g. `enroot:image-source=daemon|registry` properties to avoid surprising users when the same logical image has two different layer chains depending on import path). Useful diagnostic but not load-bearing. + +## Execution Handoff + +Same options as Plan A. diff --git a/doc/plans/README.md b/doc/plans/README.md index 58d3095..6d29a1c 100644 --- a/doc/plans/README.md +++ b/doc/plans/README.md @@ -11,15 +11,16 @@ Plans for landing the optional ZFS storage backend designed in [`../zfs.md`](../ | E. Ephemeral start ZFS path — substitute `squashfuse + overlay` with throwaway clone | [2026-04-29-zfs-e-ephemeral-start.md](2026-04-29-zfs-e-ephemeral-start.md) | A | | F. Docker layer-stack ZFS path — lift `ENROOT_NATIVE_OVERLAYFS=y` requirement on ZFS hosts | [2026-04-29-zfs-f-docker-load.md](2026-04-29-zfs-f-docker-load.md) | A | | G. Per-layer ZFS clone chain (opt-in `ENROOT_ZFS_LAYER_CHAIN=y`) — cross-image layer dedup at the dataset level | [2026-05-01-zfs-g-layer-chain.md](2026-05-01-zfs-g-layer-chain.md) | F | +| H. Extend chain mode to `dockerd://` / `podman://` URIs via `${engine} save` | [2026-05-06-zfs-h-daemon-chain.md](2026-05-06-zfs-h-daemon-chain.md) | G | ``` A ─┬─> B ├─> C ─> D ├─> E - └─> F ─> G + └─> F ─> G ─> H ``` -Recommended landing order: **A → E → F → B → C → D → G**. A is the foundation; E/F give the most user-visible wins next; B improves cache economics; C/D add transport options; G is an opt-in optimization on top of F. +Recommended landing order: **A → E → F → B → C → D → G → H**. A is the foundation; E/F give the most user-visible wins next; B improves cache economics; C/D add transport options; G is an opt-in optimization on top of F; H extends G's reuse story to daemon-local images. ## Conventions used by these plans From cc167830adf3151c9677ffda2c68b9a01ba14fe7 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Patrick=20Sodr=C3=A9?= Date: Wed, 6 May 2026 06:52:44 -0400 Subject: [PATCH 13/13] Revert "docs: implementation plan for chain mode on daemon URIs (Plan H)" This reverts commit 17b8bac8dd6a5a93b49b40448d0d869c64e8ee97. --- doc/plans/2026-05-06-zfs-h-daemon-chain.md | 382 --------------------- doc/plans/README.md | 5 +- 2 files changed, 2 insertions(+), 385 deletions(-) delete mode 100644 doc/plans/2026-05-06-zfs-h-daemon-chain.md diff --git a/doc/plans/2026-05-06-zfs-h-daemon-chain.md b/doc/plans/2026-05-06-zfs-h-daemon-chain.md deleted file mode 100644 index 2489da9..0000000 --- a/doc/plans/2026-05-06-zfs-h-daemon-chain.md +++ /dev/null @@ -1,382 +0,0 @@ -# ZFS Backend Plan H: Per-layer Chain for Daemon URIs (`dockerd://`, `podman://`) - -> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. - -**Goal:** Extend Plan G's per-layer ZFS clone chain (`ENROOT_ZFS_LAYER_CHAIN=y`) to `dockerd://` and `podman://` URIs so daemon-local images get the same cross-image base-layer dedup that registry-pulled `docker://` images get today. Preserves Plan G's default-off behavior — when the flag is unset (or set to anything but `y`), daemon imports continue to use the existing `${engine} export | tar -x` flat path unchanged. - -**Why:** Plan G's reuse story ("a debian-bookworm base used by both `python:slim` and `node:slim` is stored once") only kicks in for `docker://`. Sites that build images locally (CI runners, devboxes) and import via `dockerd://` / `podman://` see no on-disk dedup — every image gets its own full template even when 80% of the rootfs is shared with another local image. For HPC clusters with engine-local image registries this matters as much as the registry case. - -## Architecture - -The daemon-URI flat-export path (`zfs::_extract_and_install_from_daemon` in Plan F's daemon-pointer follow-up) uses: - -``` -${engine} create --name -${engine} export | tar -x -C rootfs/ # flattens; no per-layer information survives -``` - -`docker export` (and `podman export`) is a *flatten* operation — the daemon walks the layered overlay and produces a single rootfs tarball. We can't recover the layer structure from that output. - -`docker save` (and `podman save`) is a *layer-preserving* operation. Its output is a tar archive containing: - -``` -manifest.json # Docker Image Format v1.2 — describes layer order -.json # the image config blob -/ # one dir per layer - layer.tar # the layer's content as an uncompressed tar - json - VERSION -/ - ... -repositories # legacy index (ignored) -``` - -Inside `manifest.json`: -```json -[{ - "Config": ".json", - "RepoTags": ["myimage:tag"], - "Layers": [ - "/layer.tar", // BASE first - "/layer.tar", - ... // TOP last - ] -}] -``` - -Plan H's daemon-chain installer: - -1. `${engine} save ` to a streaming pipe (no temp tar archive on disk; pipe straight into `tar -x` at a temp dir). -2. Parse `manifest.json` from the extracted tree → `config` blob path + ordered `Layers` list. -3. For each `/layer.tar`: - - `sha256sum layer.tar` → cache key for `/.layers/`. - - Extract the tarball into directory `i/` (1-based, BASE = 1, TOP = N) — same convention `docker::_prepare_layers` uses for the registry path. - - Run `enroot-aufs2ovlfs i/` to convert AUFS whiteouts to overlayfs form. -4. Build the synthetic config layer `0/` from the parsed `.json` (or equivalently `${engine} inspect`) using `docker::configure` — identical to the registry path. -5. Call `zfs::_install_layer_chain "${cache_key}" "${layer_count}" "${unpriv}" "${digests[@]}"`. - -The cache key (`cache_key`) stays the image-config-sha256 (same as today's daemon path) so a daemon image and the same image pulled via `docker://` *can* share a template if their config shas happen to match. They won't share `.layers/` datasets across sources because `sha256(save-layer.tar)` ≠ `sha256(registry-blob)` (the registry blob is compressed; `docker save` emits uncompressed tarballs); but within the daemon path, multiple local images with shared base layers will dedup at the `.layers/` level. - -### Coexistence - -- **Flag off (`ENROOT_ZFS_LAYER_CHAIN=` unset/empty):** existing flat `${engine} export | tar -x → _install_template_from_dir` path runs unchanged. Default behavior preserved byte-for-byte. -- **Flag on, daemon URI:** Plan H's `_save_and_install_from_daemon` path runs. Higher peak disk (`docker save` writes a tarball pipe; the per-layer extraction needs ~2× the merged image size at peak vs. flat-export's 1×) but cheaper on subsequent imports of images with shared layers. -- **Flag on, `docker://` URI:** Plan G runs unchanged (this plan does not touch the registry path). -- **Fast path:** the existing "template `@pristine` already exists, reuse it" check runs *before* the chain dispatch, so a daemon image whose config-sha is already cached (from any earlier import via any source) skips the save entirely. - -## Files - -- **Modify:** `src/storage_zfs.sh` — add `zfs::_save_and_install_from_daemon` (chain-mode counterpart of `_extract_and_install_from_daemon`), wire chain-mode dispatch into `import_daemon_pointer` and `create_from_pointer`'s daemon recovery branch. -- **Modify:** `doc/zfs.md`, `CLAUDE.md` — flip the `ENROOT_ZFS_LAYER_CHAIN` knob description to drop the "docker:// only" caveat. - -`docker.sh`, `runtime.sh`, the existing flat-export daemon path, Plan G's helpers (`_apply_layer_payload`, `_build_layer`, `_install_layer_chain`), and `clone_container` are **not** modified. - -**Depends on:** Plans A, B, F (daemon pointer follow-up), G. - -**Prerequisite host setup:** Same as Plan G plus `${engine} save` permission. Docker Desktop and rootless podman both support `save`. Smoke target: spark-ctrl (Pi 5, OpenZFS 2.4.1, 3.75G test pool) with a docker daemon installed. - ---- - -### Task 1: Add `zfs::_save_and_install_from_daemon` - -**Files:** -- Modify: `src/storage_zfs.sh` — append after `_extract_and_install_from_daemon`. - -- [ ] **Step 1.1: Add the helper** - -Subshell function (parens) so cwd and EXIT trap stay scoped, matching `_extract_and_install_from_daemon`. Inputs match the existing helper (`uri`, `arch`); outputs the resolved image-config-sha256 on stdout (same contract). - -```bash -zfs::_save_and_install_from_daemon() ( - local -r uri="$1" arch="$2" - local image= tmpdir= engine= cache_key= unpriv= image_id= - local config_blob= layer_count= i= - local -a layer_paths=() digests=() - - set -euo pipefail - - case "${uri}" in - dockerd://*) engine="docker" ;; - podman://*) engine="podman" ;; - *) common::err "_save_and_install_from_daemon: not a daemon URI: ${uri}" ;; - esac - - common::checkcmd jq sha256sum "${engine}" tar - - local -r reg_image="[[:alnum:]/._:-]+" - if [[ "${uri}" =~ ^[[:alpha:]]+://(${reg_image})$ ]]; then - image="${BASH_REMATCH[1]}" - else - common::err "Invalid image reference: ${uri}" - fi - - image_id=$("${engine}" inspect --format '{{.Id}}' "${image}") \ - || common::err "${engine} inspect ${image} failed" - [[ "${image_id}" =~ ^sha256:[0-9a-f]{64}$ ]] \ - || common::err "${engine} returned unexpected image ID: ${image_id}" - cache_key="${image_id#sha256:}" - - trap 'common::rmall "${tmpdir}" 2> /dev/null' EXIT - tmpdir=$(common::mktmpdir enroot) - common::chdir "${tmpdir}" - - common::log INFO "Saving ${engine} image and extracting layers..." NL - "${engine}" save "${image}" | tar -x - - [ -f manifest.json ] \ - || common::err "${engine} save did not produce manifest.json (unsupported image format?)" - - # manifest.json shape: [{"Config": "...", "Layers": ["/layer.tar", ...]}] - config_blob=$(common::jq -r '.[0].Config' manifest.json) - [ -n "${config_blob}" ] && [ -f "${config_blob}" ] \ - || common::err "manifest.json missing Config blob: ${config_blob}" - - readarray -t layer_paths < <(common::jq -r '.[0].Layers[]' manifest.json) - layer_count="${#layer_paths[@]}" - [ "${layer_count}" -gt 0 ] \ - || common::err "manifest.json declares no Layers" - - # Per-layer extraction: dir 1/ = layer_paths[0] = BASE (manifest.json - # ordering puts BASE first, TOP last — opposite of docker::_download's - # reversed convention). Compute digests as sha256 of each layer.tar so - # cache keys are content-addressed regardless of the engine's - # internal layer-id format (legacy v1 uses random IDs; newer formats - # use content addresses). - common::log INFO "Computing layer digests and extracting..." - digests=() - for ((i=0; i config - docker::configure "${PWD}/0" config "${arch}" - - if [ "${EUID}" -ne 0 ]; then - unpriv=y - fi - - # Plan G's chain installer expects digests in "TOP first, BASE last" - # ordering (the reversed convention of docker::_download). Reverse - # ours to match. _install_layer_chain iterates N-1 down to 0 to - # build BASE-first on disk, so the on-disk leaf is digests[0] = TOP. - local -a digests_reversed=() - for ((i=layer_count-1; i>=0; i--)); do - digests_reversed+=("${digests[i]}") - done - - # Wait — directory numbering must match the digest at the same chain - # position. Plan G's _build_layer takes (digest, prev_layer, layer_dir). - # With our reversed digests array (digests_reversed[0]=TOP=layer_count, - # digests_reversed[N-1]=BASE=1), we need layer_dir to follow the - # same reversal so digests_reversed[k] always lines up with the - # directory whose tarball produced that digest. - # - # Easier: rename our extracted directories to match the registry - # path's "dir 1 = TOP" convention. Walk i=1..N (currently 1=BASE), - # rename "i/" to "tmp_i/" and then "tmp_i/" to "(N-i+1)/". One pass. - for ((i=1; i<=layer_count; i++)); do - mv "${i}" "tmp_${i}" - done - for ((i=1; i<=layer_count; i++)); do - mv "tmp_${i}" "$((layer_count - i + 1))" - done - - zfs::_install_layer_chain "${cache_key}" "${layer_count}" "${unpriv}" "${digests_reversed[@]}" > /dev/null - - printf "%s" "${cache_key}" -) -``` - -- [ ] **Step 1.2: Commit** - -```sh -git add src/storage_zfs.sh -git commit -s -m "storage_zfs: add _save_and_install_from_daemon for chain-mode daemon imports" -``` - ---- - -### Task 2: Wire chain-mode dispatch in `import_daemon_pointer` and `create_from_pointer` - -Two daemon entry points: the initial import (`import_daemon_pointer`) and the eviction-recovery re-pull inside `create_from_pointer`. Both today call `_extract_and_install_from_daemon`. Add a single dispatch in each based on `zfs::layer_chain_active`. - -**Files:** -- Modify: `src/storage_zfs.sh` - -- [ ] **Step 2.1: `import_daemon_pointer` dispatch** - -Before: - -```bash -cache_key=$(zfs::_extract_and_install_from_daemon "${uri}" "${arch}") -``` - -After: - -```bash -if zfs::layer_chain_active; then - cache_key=$(zfs::_save_and_install_from_daemon "${uri}" "${arch}") -else - cache_key=$(zfs::_extract_and_install_from_daemon "${uri}" "${arch}") -fi -``` - -- [ ] **Step 2.2: `create_from_pointer` daemon-recovery dispatch** - -Replace the existing `dockerd://*|podman://*)` arm in the recovery `case`: - -```bash -dockerd://*|podman://*) - if zfs::layer_chain_active; then - fresh_config_sha=$(zfs::_save_and_install_from_daemon "${uri}" "${arch}") - else - fresh_config_sha=$(zfs::_extract_and_install_from_daemon "${uri}" "${arch}") - fi - ;; -``` - -- [ ] **Step 2.3: Commit** - -```sh -git add src/storage_zfs.sh -git commit -s -m "storage_zfs: dispatch chain mode in daemon-pointer import and recovery" -``` - ---- - -### Task 3: Verify on smoke cluster - -Spark-ctrl as today (Plan G's smoke target), with a docker daemon installed locally (or use the existing daemon if present). The DGX Spark compute nodes also work but pool space is larger. - -- [ ] **Step 3.1: Single-image daemon import, chain mode** - -```sh -sudo docker pull alpine:3.21 -sudo bash -c 'ENROOT_ZFS_LAYER_CHAIN=y enroot import -o /tmp/a.sqsh dockerd://alpine:3.21' -sudo bash -c 'ENROOT_ZFS_LAYER_CHAIN=y enroot create -n a /tmp/a.sqsh' -sudo zfs list -r tank/enroot/data -sudo cat /var/lib/enroot/a/etc/os-release | head -3 -``` - -Expected: `.layers/` dataset(s) created (one per layer in the saved image), `.templates/` exists, rootfs has alpine os-release + synthetic `0/` files. - -- [ ] **Step 3.2: Cross-image base-layer dedup, daemon path** - -```sh -sudo docker pull python:3.13-alpine3.21 -sudo docker pull node:22-alpine3.21 -for img in python:3.13-alpine3.21 node:22-alpine3.21; do - sudo bash -c "ENROOT_ZFS_LAYER_CHAIN=y enroot import -o /tmp/${img%%:*}.sqsh dockerd://${img}" - sudo bash -c "ENROOT_ZFS_LAYER_CHAIN=y enroot create -n ${img%%:*} /tmp/${img%%:*}.sqsh" -done -sudo zfs list -r -d 1 -o name,origin tank/enroot/data/.layers -``` - -Expected: alpine 3.21 base layer dataset exists once and is the origin of TWO chains (one from python, one from node), same Y-shape as Plan G's `docker://` smoke. - -- [ ] **Step 3.3: Plan G regression — `docker://` still works in chain mode** - -```sh -sudo bash -c 'ENROOT_ZFS_LAYER_CHAIN=y enroot import -o /tmp/r.sqsh docker://redis:alpine3.21' -sudo bash -c 'ENROOT_ZFS_LAYER_CHAIN=y enroot create -n r /tmp/r.sqsh' -``` - -Expected: works as before. `_pull_and_install_template` still uses `_install_layer_chain`. No regression. - -- [ ] **Step 3.4: Flat-export regression — daemon path with flag unset** - -```sh -sudo enroot remove -f a python node r -sudo zfs list -H -o name -r tank/enroot/data | tail -n +2 | tac | xargs -r -n1 sudo zfs destroy -r -sudo bash -c 'enroot import -o /tmp/a.sqsh dockerd://alpine:3.21' -sudo zfs list -r tank/enroot/data # NO .layers/ dataset expected -sudo bash -c 'enroot create -n a /tmp/a.sqsh' -``` - -Expected: no `.layers/` namespace; `_install_template_from_dir` populates `.templates/` directly, identical to today's flat-export behavior. - -- [ ] **Step 3.5: Eviction recovery via daemon-chain re-save** - -```sh -# Create + then destroy template, keep layer-cache -sudo bash -c 'ENROOT_ZFS_LAYER_CHAIN=y enroot import -o /tmp/a.sqsh dockerd://alpine:3.21' -sudo bash -c 'ENROOT_ZFS_LAYER_CHAIN=y enroot create -n a /tmp/a.sqsh' -sudo enroot remove -f a -sudo zfs list -H -o name -r tank/enroot/data/.templates | tail -n +2 | xargs -r -n1 sudo zfs destroy -r - -# Re-create from the cached pointer file — recovery path triggers _save_and_install -sudo bash -c 'ENROOT_ZFS_LAYER_CHAIN=y enroot create -n a /tmp/a.sqsh' -sudo zfs list -r tank/enroot/data -``` - -Expected: `Re-pulling from dockerd://alpine:3.21` log line; layer datasets reused (no `Building layer` messages); template re-cloned from chain leaf. - ---- - -### Task 4: Documentation - -**Files:** -- Modify: `doc/zfs.md` — drop the "docker:// only" caveat from the `ENROOT_ZFS_LAYER_CHAIN` knob description; mention the disk-pressure tradeoff (`${engine} save` peak disk ~2× flat-export). -- Modify: `CLAUDE.md` — append a note that Plan H is implemented under PR #?? if relevant. -- Modify: `doc/plans/README.md` — add Plan H row, dependency `G`, recommended landing position after G. - -- [ ] **Step 4.1: Commit and PR** - -```sh -git add doc/zfs.md CLAUDE.md doc/plans/README.md -git commit -s -m "Mark Plan H (per-layer chain for daemon URIs) as implemented" -git push -u origin feature/zfs-h-daemon-chain -gh pr create --repo zeroae/enroot --base zenroot/main --head feature/zfs-h-daemon-chain \ - --title "Plan H: per-layer ZFS clone chain for dockerd:// / podman:// URIs" -``` - ---- - -## Self-review checklist - -- [ ] Default-off: `ENROOT_ZFS_LAYER_CHAIN=` unset leaves the daemon flat-export path unchanged. T3.4 covers this. -- [ ] No regression on `docker://` chain mode (Plan G). T3.3 covers this. -- [ ] Daemon-cross-image dedup works at the `.layers/` level when local images share base layer content. T3.2 covers this. -- [ ] Eviction recovery re-uses cached layer datasets and only rebuilds the template clone-of-leaf. T3.5 covers this. -- [ ] Layer-digest computation is content-addressed (sha256 of `layer.tar`) so engine-specific layer-id formats (legacy v1 random IDs vs newer content-addressed) don't matter for the cache key. -- [ ] Synthetic `0/` is built from the same `${engine} inspect` output as the existing daemon path, so per-image config (rc/fstab/environment) matches what users get today. - -## Known limitations - -- **No cross-source dedup with `docker://`.** A layer pulled via `docker://alpine` and the same layer extracted via `dockerd://alpine` produce different `.layers/` datasets because the registry blob is gzip/zstd-compressed and `docker save`'s `layer.tar` is uncompressed. Same content, different sha256. We could add a "compressed-and-uncompressed sha both stored as user properties" recovery scheme, but it'd be substantial added complexity for a thin slice of cases. -- **`docker save` disk pressure.** Streaming `${engine} save | tar -x` avoids the full saved-tar landing on disk, but the per-layer `tar -xf /layer.tar` step does need each `layer.tar` to be on disk during extraction (it's not a pipe). Peak disk is ~1× the saved-tar size (which is roughly the same as the merged image size, since `docker save` doesn't compress). For very large images (multi-GB ML containers) this matters. Today's flat-export streams in one pass with no intermediate; Plan H trades that for cross-image dedup. -- **Engine compatibility.** Plan H assumes Docker Image Format v1.1+ (`manifest.json` at archive root, `Layers` array as paths to `layer.tar` files). `podman save` defaults to docker format; OCI archive format (`podman save --format oci-archive`) has a different layout and is **not supported**. Document the assumption. -- **Image config sha vs registry config sha.** Daemon-side `${engine} inspect '{{.Id}}'` is the daemon's image-id; for images pulled from a registry it usually matches the registry's image-config-sha256, but for locally-built images (Dockerfile) it's daemon-local. That's the same situation as today's flat-export path. - -## Out of scope - -- Replacing the flat-export path (Plan H is purely additive; flag unset preserves byte-for-byte behavior). -- Cross-source `.layers/` dedup between `docker://` and `dockerd://`. -- OCI archive format support (`podman save --format oci-archive`). -- Streaming `${engine} save` directly into per-layer extraction without an intermediate tar dump (tar tools don't natively support "extract just this nested archive from a stream" — would require a custom parser). -- Per-layer-aware digest stamping (e.g. `enroot:image-source=daemon|registry` properties to avoid surprising users when the same logical image has two different layer chains depending on import path). Useful diagnostic but not load-bearing. - -## Execution Handoff - -Same options as Plan A. diff --git a/doc/plans/README.md b/doc/plans/README.md index 6d29a1c..58d3095 100644 --- a/doc/plans/README.md +++ b/doc/plans/README.md @@ -11,16 +11,15 @@ Plans for landing the optional ZFS storage backend designed in [`../zfs.md`](../ | E. Ephemeral start ZFS path — substitute `squashfuse + overlay` with throwaway clone | [2026-04-29-zfs-e-ephemeral-start.md](2026-04-29-zfs-e-ephemeral-start.md) | A | | F. Docker layer-stack ZFS path — lift `ENROOT_NATIVE_OVERLAYFS=y` requirement on ZFS hosts | [2026-04-29-zfs-f-docker-load.md](2026-04-29-zfs-f-docker-load.md) | A | | G. Per-layer ZFS clone chain (opt-in `ENROOT_ZFS_LAYER_CHAIN=y`) — cross-image layer dedup at the dataset level | [2026-05-01-zfs-g-layer-chain.md](2026-05-01-zfs-g-layer-chain.md) | F | -| H. Extend chain mode to `dockerd://` / `podman://` URIs via `${engine} save` | [2026-05-06-zfs-h-daemon-chain.md](2026-05-06-zfs-h-daemon-chain.md) | G | ``` A ─┬─> B ├─> C ─> D ├─> E - └─> F ─> G ─> H + └─> F ─> G ``` -Recommended landing order: **A → E → F → B → C → D → G → H**. A is the foundation; E/F give the most user-visible wins next; B improves cache economics; C/D add transport options; G is an opt-in optimization on top of F; H extends G's reuse story to daemon-local images. +Recommended landing order: **A → E → F → B → C → D → G**. A is the foundation; E/F give the most user-visible wins next; B improves cache economics; C/D add transport options; G is an opt-in optimization on top of F. ## Conventions used by these plans