diff --git a/CLAUDE.md b/CLAUDE.md index 0fe01e6..6c7681a 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -70,7 +70,7 @@ When debugging container behavior, the order is: image `/etc/{rc,fstab,environme ## Active design proposals - **`doc/zfs.md`** — optional ZFS storage backend (`ENROOT_STORAGE_BACKEND=zfs`). Replaces `unsquashfs`-per-create with extract-once-then-`zfs clone`. Adds a `.zfs` (zfs send stream) image format and a `zfs://host/NAME` transport scheme alongside today's `.sqsh`. Introduces a shared template cache with a live/warm/cold lifecycle (knobs: `ENROOT_TEMPLATE_WARM_SECONDS`, `ENROOT_TEMPLATE_PRESSURE_THRESHOLD`; eviction is implicit on `create`, no daemon, no `enroot gc` command). Default backend (`dir`) is unchanged. -- **`doc/plans/`** — six implementation plans (A–F) breaking the ZFS backend into independently-landable slices. Start with `doc/plans/README.md` for the index and recommended landing order (A → E → F → B → C → D). Plans add a new sourced module `src/storage_zfs.sh` (under a `zfs::` namespace) and branch in `src/runtime.sh`, `src/docker.sh` on `ENROOT_STORAGE_BACKEND`. **All six plans merged** on `zenroot/main` (PRs [#1](https://github.com/zeroae/enroot/pull/1), [#2](https://github.com/zeroae/enroot/pull/2), [#3](https://github.com/zeroae/enroot/pull/3), [#5](https://github.com/zeroae/enroot/pull/5), [#7](https://github.com/zeroae/enroot/pull/7), and Plan D in review). +- **`doc/plans/`** — implementation plans for the ZFS backend, broken into independently-landable slices. Start with `doc/plans/README.md` for the index and recommended landing order (A → E → F → B → C → D → G). Plans add a new sourced module `src/storage_zfs.sh` (under a `zfs::` namespace) and branch in `src/runtime.sh`, `src/docker.sh` on `ENROOT_STORAGE_BACKEND`. **Plans A–F merged** on `zenroot/main` (PRs [#1](https://github.com/zeroae/enroot/pull/1), [#2](https://github.com/zeroae/enroot/pull/2), [#3](https://github.com/zeroae/enroot/pull/3), [#5](https://github.com/zeroae/enroot/pull/5), [#7](https://github.com/zeroae/enroot/pull/7), [#8](https://github.com/zeroae/enroot/pull/8)). Plan G (per-layer clone chain, opt-in via `ENROOT_ZFS_LAYER_CHAIN=y`, [issue #4](https://github.com/zeroae/enroot/issues/4)) layered on top of F. ## Conventions diff --git a/doc/plans/2026-05-01-zfs-g-layer-chain.md b/doc/plans/2026-05-01-zfs-g-layer-chain.md new file mode 100644 index 0000000..427260b --- /dev/null +++ b/doc/plans/2026-05-01-zfs-g-layer-chain.md @@ -0,0 +1,421 @@ +# ZFS Backend Plan G: Per-layer ZFS Clone Chain (opt-in) + +> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** When `ENROOT_STORAGE_BACKEND=zfs` AND `ENROOT_ZFS_LAYER_CHAIN=y`, populate the Docker template cache via a per-layer `zfs clone` chain instead of a single merged extract. Each registry layer becomes its own `/.layers/` dataset, layered as cloned descendants of the layer below. The leaf is then cloned into `/.templates/` to preserve the Plan F template shape. With the flag unset, Plan F's single-merge path runs unchanged. + +**Why:** Plan F's single-merge design re-extracts every layer for every distinct image. For HPC and CI hosts that pull many images sharing a Debian/Alpine/CUDA base, this wastes disk, CPU, and bandwidth. Per-layer chains buy back: + +1. Cross-image layer dedup at the dataset level — two images sharing a base store the base bytes once. (`dedup=on` recovers this in Plan F at ~5 GB RAM per TB; per-layer datasets dedup for free.) +2. Cheap incremental re-pull — when the top layer of a tag changes, only that layer is rebuilt; lower-layer datasets are reused. +3. Layer-granular cache invalidation and inspection (`zfs list -r /.layers` shows the chain). +4. Quota accounting that matches intuition (shared layers count once). +5. Aligns with Docker's own `zfs` storage driver. + +Tradeoff: shell-side whiteout/opaque-dir merging is required because each at-rest layer dataset must contain the merged-up-to-this-layer rootfs (overlayfs only does the merge at mount time). The kernel's overlay engine is *not* the merge engine in this path — `enroot-aufs2ovlfs` already converted whiteouts to overlayfs form during `_prepare_layers`, but we have to apply them ourselves between clone steps. + +**Architecture:** `docker::_prepare_layers` already extracts each registry layer into a per-layer directory (`1/`, `2/`, … `N/`) and runs `enroot-aufs2ovlfs` on each, producing overlayfs-style trees: `mknod 0:0` char devices for whiteouts and `trusted.overlay.opaque=y` xattrs for opaque dirs. With `ENROOT_SET_USER_XATTRS=y` (already set on the load path) we also get a parallel `user.overlay.opaque=y` for unprivileged paths. + +For Plan G, after `_prepare_layers` returns, instead of one overlay-mount + tar-pipe into a single template (Plan F), we walk the layer list bottom-up: + +``` +.layers/ ← zfs create -u (apply layer 1 contents) +.layers/ ← zfs clone .layers/@done (apply layer 2 on top) +.layers/ ← zfs clone .layers/@done (apply layer 3 on top) +… +.layers/ ← zfs clone .layers/@done (apply layer N on top) +.templates/ ← zfs clone .layers/@done (clone leaf as template) +``` + +Each `.layers/` dataset's `@done` snapshot is reused on subsequent imports of any image whose chain prefix matches. The final `.templates/@pristine` snapshot is identical in shape to Plan F's, so `zfs::clone_container`, the pointer-format import path, eviction recovery, `enroot export`, and `enroot import zfs://` all work unchanged. + +**Why no `zfs promote`:** Promoting the leaf into the templates dataset inverts the chain — layer datasets become clones of the template, which then owns the data. That works for one image but produces a complex, image-private topology that defeats the whole point of cross-image sharing. Plan G keeps layer datasets as immutable origins and templates as ordinary clones. The simple invariant: *layers are shared and never mutated; templates are per-image clones; ZFS refuses to destroy a layer dataset while any descendant clone exists, so layer GC is automatic.* + +**Coexistence with Plan F:** + +- `ENROOT_ZFS_LAYER_CHAIN=` (unset, empty, or anything but `y`): Plan F's `_install_template_from_layers` runs unchanged. Default behavior preserved byte-for-byte. +- `ENROOT_ZFS_LAYER_CHAIN=y`: dispatch to chain mode. Same dispatch is hit from `docker::load` and from `_pull_and_install_template` (the puller used by pointer-format import and eviction recovery), so all callers see chain-mode templates when the flag is on. +- The fast path "template `@pristine` already exists, reuse it" is hit *before* the chain/no-chain dispatch. Templates produced under one mode are reused under the other without rebuild — only the *fill* mechanism differs on miss. + +**Scope: `docker://` URIs only.** Plan G applies to registry-pulled images that go through `docker::_prepare_layers` (which produces the per-layer directories Plan G chains over). Daemon-local URIs (`dockerd://`, `podman://`) are *silently unaffected* by `ENROOT_ZFS_LAYER_CHAIN=y` — they go through `zfs::_extract_and_install_from_daemon`, which uses `${engine} export | tar -x` to produce a single flat rootfs (the daemon has already merged the layers internally; `docker export` is a flatten operation, not a layer-preserving one). That path goes through `zfs::_install_template_from_dir` and stays untouched. Bringing chain mode to daemon URIs is feasible but requires switching from `docker export` to `docker save` (which writes a tar archive containing per-layer tarballs plus a `manifest.json` describing layer order) — see Out of scope below. + +**Depends on:** Plans A, B, F (template lifecycle, sweep, ENOSPC retry shape are reused). + +**Prerequisite host setup:** Same as Plan F. ZFS user delegation must include `clone`. `promote` is **not** required (Plan G doesn't promote). Whiteout/xattr work runs inside `enroot-nsenter --user --remap-root --mount`, same as Plan F's merge step. + +**Test images:** `docker://alpine` (1 layer, smoke); `docker://debian:stable-slim` (multi-layer with whiteouts); `docker://python:3.12-slim` and `docker://node:20-slim` (debian-bookworm-based — base layer must be physically shared). + +--- + +## Files + +- **Modify:** `src/storage_zfs.sh` — add `zfs::layer_chain_active`, `zfs::_install_layer_chain`, `zfs::_apply_layer_payload`, dispatch in `zfs::docker_install_from_layers`. +- **Modify:** `src/docker.sh` (`docker::_prepare_layers`, `docker::load`) — emit layer digest list to a caller-provided fd when chain mode is active; thread it into `zfs::docker_install_from_layers`. +- **Modify:** `doc/zfs.md` — document `ENROOT_ZFS_LAYER_CHAIN`, the `.layers/` namespace, dedup semantics, GC notes. +- **Modify:** `CLAUDE.md` — flip the active-design-proposals line. + +`docker::configure`, `docker::_download`, the existing dir-backend overlay path, Plan F's `_install_template_from_layers`, the pointer-format paths, and `zfs::clone_container` are **not** modified. + +--- + +### Task 1: Add `zfs::layer_chain_active` predicate + +A small gate that callers use before opting into chain mode. + +**Files:** +- Modify: `src/storage_zfs.sh` (append, near `zfs::pointer_format_active`) + +- [ ] **Step 1.1: Add the predicate** + +```bash +# Returns 0 iff the ZFS backend is active AND ENROOT_ZFS_LAYER_CHAIN=y. +# Callers gate the per-layer-clone-chain template-fill path on this. +# Default-off; the unset / "" / "n" cases all fall through to Plan F's +# single-merge path, preserving byte-for-byte behavior. +zfs::layer_chain_active() { + zfs::enabled || return 1 + [ "${ENROOT_ZFS_LAYER_CHAIN-}" = "y" ] +} +``` + +- [ ] **Step 1.2: Commit** + +```sh +git add src/storage_zfs.sh +git commit -s -m "storage_zfs: add zfs::layer_chain_active predicate" +``` + +--- + +### Task 2: Side-emit layer digests from `docker::_prepare_layers` + +Plan G needs the ordered list of registry layer content-digests as cache keys for `/.layers/`. Today `_prepare_layers` only emits `config\nlayer_count\n` on stdout. Existing callers (Plan F's `_install_template_from_layers`, `docker::import`, `docker::load`'s dir branch) read exactly two lines via `common::read` and must keep working unchanged. + +Adding extra stdout lines is risky: a 2-line consumer closes the pipe after its second `read`, causing SIGPIPE on the producer's third printf. Under `set -euo pipefail` in the producer subshell that surfaces as a non-zero exit, breaking existing callers. + +The simplest fix: have `_prepare_layers` write the digest list to a sidecar file `./.layers` in its own cwd. Every caller already runs `_prepare_layers` inside a fresh `common::mktmpdir enroot` directory and `common::chdir`s into it, so the sidecar lives inside the per-call temp dir and is cleaned up by the caller's existing EXIT trap. Plan G's chain-mode caller does `readarray -t digests < .layers` after `_prepare_layers` returns. Non-chain callers simply ignore the file. + +**Files:** +- Modify: `src/docker.sh` + +- [ ] **Step 2.1: Have `_prepare_layers` write `./.layers`** + +In `docker::_prepare_layers`, after `_download` has populated `${layers[@]}` and before the existing `printf "%s\n%s\n"` final output, add: + +```bash +printf "%s\n" "${layers[@]}" > .layers +``` + +The file is one digest per line in stack order (base first, top last). It sits in the caller's temp dir, gets removed when the temp dir is. + +- [ ] **Step 2.2: Commit** + +```sh +git add src/docker.sh +git commit -s -m "docker: side-emit layer digests to ./.layers in _prepare_layers" +``` + +--- + +### Task 3: Add `zfs::_apply_layer_payload` + +The bash payload that runs inside `enroot-nsenter --user --remap-root --mount` to apply one layer dir on top of one target dir. Returns a string suitable for `bash -c "${payload}"` from the chain installer. + +Three phases: +1. **Opaque-dir clearing.** Walk the layer's directories; for each with `trusted.overlay.opaque=y` xattr, `rm -rf` the children (not the dir itself) of the corresponding dir in the target. +2. **Whiteout deletion.** For each char device 0:0 in the layer, `rm -rf` the corresponding path in the target. +3. **Content tar-pipe.** Tar the layer's contents into the target with `--xattrs --xattrs-include='*' --acls`, excluding char devices via an exclude list built from phase 2. + +Why a payload string: the chain installer launches one `enroot-nsenter` per layer (or batches them), and the inside-userns work is straightforward bash. Keeping it as a single payload string avoids per-layer fork overhead beyond the necessary `enroot-nsenter` wrapping. + +**Files:** +- Modify: `src/storage_zfs.sh` + +- [ ] **Step 3.1: Add the payload generator** + +```bash +# Generates the bash payload that applies one layer dir's whiteouts and +# contents on top of one target dir. Designed to be passed to +# `enroot-nsenter --user --remap-root --mount bash -c`. +# +# Pre-conditions on inputs (caller responsibility): +# - layer_dir was extracted by docker::_prepare_layers and processed by +# enroot-aufs2ovlfs, so whiteouts are mknod 0:0 char devices and +# opaque dirs carry trusted.overlay.opaque=y (and user.overlay.opaque=y +# when ENROOT_SET_USER_XATTRS=y was set, which the load path always does). +# - target_dir already contains the merged contents of all layers below +# this one. +# - Both paths are absolute and well-formed (no embedded quotes/spaces in +# the digest-keyed dataset paths the chain installer produces). +zfs::_apply_layer_payload() { + local -r layer_dir="$1" target_dir="$2" + cat </dev/null \\ + | awk -F': ' '/^# file:/ { print substr(\$0, 9) }' \\ + | while IFS= read -r d; do + rel=\${d#./} + [ "\${rel}" = "." ] || [ -z "\${rel}" ] && rel="" + find '${target_dir}'/"\${rel}" -mindepth 1 -maxdepth 1 -exec rm -rf {} + 2>/dev/null || : + done +# Phase 2: whiteout deletion — char device 0:0 in layer means "delete in target". +find . -type c | while IFS= read -r wh; do + rm -rf '${target_dir}'/"\${wh#./}" +done +# Phase 3: copy non-whiteout contents over. +find . -type c -printf '%P\\n' > /tmp/excludes.\$\$ +tar -C . --exclude-from=/tmp/excludes.\$\$ --xattrs --xattrs-include='*' --acls -cpf - . \\ + | tar -C '${target_dir}' --xattrs --xattrs-include='*' --acls -xpf - +rm -f /tmp/excludes.\$\$ +PAYLOAD +} +``` + +- [ ] **Step 3.2: Commit** + +```sh +git add src/storage_zfs.sh +git commit -s -m "storage_zfs: add _apply_layer_payload generator" +``` + +--- + +### Task 4: Add `zfs::_install_layer_chain` + +The full chain-build-and-template-install lifecycle. Designed to be a drop-in replacement for `zfs::_install_template_from_layers` when chain mode is active. Same input contract (`cache_key` = image-config-sha256, `layer_count`, `unpriv`); also takes the layer-digest list as an array. Same output contract: prints the template dataset path on stdout. + +Chain build (idempotent per-layer): + +1. For i = 1..N: + - If `/.layers/@done` exists, reuse — go to next layer. + - Else: race-safe create. Try `zfs create -u /.layers/.tmp` (i=1) or `zfs clone -o canmount=noauto @done /.layers/.tmp` (i≥2). On EEXIST, wait for `/.layers/@done` (timeout 600s). + - Mount the `.tmp` via `enroot-zfs-mount`. Run `enroot-nsenter --user --remap-root --mount bash -c "$(zfs::_apply_layer_payload layer_dir mountpoint)"`. On failure, mirror Plan B's ENOSPC retry (sweep warm templates, retry once; on second failure destroy `.tmp` and abort). + - Unmount, `zfs rename .tmp → final`, snapshot `@done`, `set readonly=on`, set `enroot:layer-digest=` and `enroot:imported`. + +Template install (matches Plan F shape): + +2. `zfs clone -o canmount=noauto /.layers/@done /.templates/` (with the standard `.tmp`-then-rename race protection, identical to Plan F). +3. Snapshot `@pristine`, `set readonly=on`, stamp metadata. Done — caller (or `clone_container`) takes it from here. + +**Files:** +- Modify: `src/storage_zfs.sh` + +- [ ] **Step 4.1: Add chain installer** + +Append after `zfs::_install_template_from_dir` in `src/storage_zfs.sh`. Inputs: + +``` +$1 cache_key - image-config-sha256 +$2 layer_count - the N from _prepare_layers +$3 unpriv - "y" or "" — passed through to enroot-nsenter +$4..$(3+N) - layer digests in stack order, base first +``` + +Print the resulting template dataset path on stdout (no trailing newline). Sweeps templates and runs the layer-apply ENOSPC retry on each layer. + +- [ ] **Step 4.2: Commit** + +```sh +git add src/storage_zfs.sh +git commit -s -m "storage_zfs: add _install_layer_chain" +``` + +--- + +### Task 5: Dispatch chain mode in `docker_install_from_layers` and `_pull_and_install_template` + +`docker::load` and the pointer-import / eviction-recovery paths both go through the install helpers. Both need to opt into chain mode when active. + +**Files:** +- Modify: `src/storage_zfs.sh` + +- [ ] **Step 5.1: Make `docker_install_from_layers` chain-mode-aware** + +Currently: + +```bash +zfs::docker_install_from_layers() { + local -r cache_key="$1" layer_count="$2" unpriv="$3" name="$4" + local template + template=$(zfs::_install_template_from_layers "${cache_key}" "${layer_count}" "${unpriv}") + zfs::clone_container "${template}" "${name}" +} +``` + +Add an optional 5th-onwards argument: layer digests (variadic). When `zfs::layer_chain_active`, route through `_install_layer_chain` with the digest list; otherwise fall back to `_install_template_from_layers`. The dispatch falls back gracefully if the caller didn't pass digests (e.g. older internal callers): chain mode silently degrades to single-merge. + +```bash +zfs::docker_install_from_layers() { + local -r cache_key="$1" layer_count="$2" unpriv="$3" name="$4" + shift 4 + local template + if zfs::layer_chain_active && [ "$#" -ge 1 ]; then + template=$(zfs::_install_layer_chain "${cache_key}" "${layer_count}" "${unpriv}" "$@") + else + template=$(zfs::_install_template_from_layers "${cache_key}" "${layer_count}" "${unpriv}") + fi + zfs::clone_container "${template}" "${name}" +} +``` + +- [ ] **Step 5.2: Pass layer digests from `docker::load`** + +In `src/docker.sh` `docker::load`'s ZFS branch (the `if zfs::enabled` block currently calling `zfs::docker_install_from_layers "${config}" "${layer_count}" "${unpriv}" "${name}"`), read the sidecar `./.layers` written by `_prepare_layers` and pass the digests through under chain mode: + +```bash +if zfs::enabled; then + if zfs::layer_chain_active; then + local layer_digests=() + readarray -t layer_digests < .layers + zfs::docker_install_from_layers "${config}" "${layer_count}" "${unpriv}" "${name}" "${layer_digests[@]}" + else + zfs::docker_install_from_layers "${config}" "${layer_count}" "${unpriv}" "${name}" + fi +else + # existing dir-backend overlay-mount + tar-pipe … +fi +``` + +- [ ] **Step 5.3: Pass layer digests from `zfs::_pull_and_install_template`** + +In `src/storage_zfs.sh`, the puller already calls `zfs::_install_template_from_layers` directly (it bypasses `docker_install_from_layers` because it doesn't clone — only fills the cache for the pointer-import / eviction-recovery flow). It also runs `_prepare_layers` inside its own `common::mktmpdir`+`chdir` block, so the same `./.layers` sidecar is available. Mirror the dispatch: + +```bash +if zfs::layer_chain_active; then + local layer_digests=() + readarray -t layer_digests < .layers + zfs::_install_layer_chain "${config}" "${layer_count}" "${unpriv}" "${layer_digests[@]}" > /dev/null +else + zfs::_install_template_from_layers "${config}" "${layer_count}" "${unpriv}" > /dev/null +fi +``` + +- [ ] **Step 5.4: Commit** + +```sh +git add src/storage_zfs.sh src/docker.sh +git commit -s -m "storage_zfs: dispatch chain mode in docker_install_from_layers and pull path" +``` + +--- + +### Task 6: Verify on smoke-test cluster + +The compute nodes already share `/var/lib/enroot` over a delegated ZFS pool. Build a `.deb` locally, push to `spark-f2ff`, run the smoke checks below, then revert per CLAUDE.md. + +- [ ] **Step 6.1: Single-layer alpine, chain mode** + +```sh +sudo systemd-run --user --pty --setenv=ENROOT_ZFS_LAYER_CHAIN=y \ + enroot import -o /tmp/a.sqsh docker://alpine +sudo enroot create -n a /tmp/a.sqsh +sudo enroot start a cat /etc/os-release | head -1 +sudo zfs list -r tank/enroot/data/.layers # should show one layer dataset +sudo enroot remove -f a; rm -f /tmp/a.sqsh +``` + +Expected: load + start succeed; `.layers/` shows one dataset with the layer's digest. + +- [ ] **Step 6.2: Multi-layer debian, whiteouts** + +```sh +sudo ENROOT_ZFS_LAYER_CHAIN=y enroot import -o /tmp/d.sqsh docker://debian:stable-slim +sudo enroot create -n d /tmp/d.sqsh +sudo enroot start d cat /etc/os-release | grep PRETTY +sudo find /var/lib/enroot/d -name '.wh.*' | head -3 # must be empty +sudo enroot remove -f d; rm -f /tmp/d.sqsh +``` + +Expected: container starts; no AUFS whiteouts leak through (the conversion is intact); chain has multiple layer datasets. + +- [ ] **Step 6.3: Cross-image base-layer dedup** + +```sh +sudo ENROOT_ZFS_LAYER_CHAIN=y enroot import -o /tmp/p.sqsh docker://python:3.12-slim +sudo ENROOT_ZFS_LAYER_CHAIN=y enroot import -o /tmp/n.sqsh docker://node:20-slim +sudo zfs list -r tank/enroot/data/.layers -o name,used,referenced +``` + +Expected: the base bookworm layer appears once; `python:3-slim` and `node:20-slim` chains share that dataset (visible in `zfs list -t all` as multiple clones of the same `@done`). Block-level sharing visible via `referenced` ≫ `used` on the shared dataset. + +- [ ] **Step 6.4: Plan F regression — flag unset** + +```sh +sudo enroot import -o /tmp/u.sqsh docker://ubuntu:24.04 # ENROOT_ZFS_LAYER_CHAIN unset +sudo enroot create -n u /tmp/u.sqsh +sudo zfs list -r tank/enroot/data/.layers # must NOT have created new datasets here +sudo enroot remove -f u; rm -f /tmp/u.sqsh +``` + +Expected: Plan F's single-merge behavior; `.layers/` either doesn't exist or is unchanged from prior chain-mode runs. + +- [ ] **Step 6.5: Pyxis end-to-end with chain mode** + +```sh +ssh spark-f2ff 'sudo zfs destroy -r tank/enroot/data/.layers 2>/dev/null || :; \ + sudo zfs destroy -r tank/enroot/data/.templates 2>/dev/null || :' +ENROOT_ZFS_LAYER_CHAIN=y srun -N1 -w spark-f2ff --container-image=docker://debian:stable-slim cat /etc/os-release +ENROOT_ZFS_LAYER_CHAIN=y srun -N1 -w spark-f2ff --container-image=docker://debian:stable-slim hostname +``` + +Expected: first invocation pays the layer-extract cost once; second is sub-second (template-cache hit). + +--- + +### Task 7: Documentation and PR + +**Files:** +- Modify: `doc/zfs.md` — add an `ENROOT_ZFS_LAYER_CHAIN` section under tunables; flip status note to mention Plan G. +- Modify: `CLAUDE.md` — update active-design-proposals line. + +- [ ] **Step 7.1: Document and commit** + +```sh +git add doc/zfs.md CLAUDE.md +git commit -s -m "Mark Plan G (per-layer ZFS clone chain) as implemented" +git push -u origin feature/zfs-g-layer-chain +gh pr create --repo zeroae/enroot --base zenroot/main --head feature/zfs-g-layer-chain \ + --title "Plan G: per-layer ZFS clone chain for enroot load docker:// (opt-in)" \ + --body "Closes #4." +``` + +--- + +## Self-review checklist + +- [ ] Default-off: `ENROOT_ZFS_LAYER_CHAIN=` (unset/empty) leaves Plan F's `_install_template_from_layers` path unchanged. Verified at T6.4. +- [ ] Final template shape matches Plan F's (`/.templates/@pristine`, readonly, metadata-stamped), so `zfs::clone_container`, pointer-format import, eviction recovery, `enroot export`, and `enroot import zfs://` all keep working. +- [ ] Chain installer covers both load (T5.2) and pull (T5.3) entry points, so chain mode applies to direct `enroot create docker://` AND to the pointer-import / eviction-recovery paths from #13/#14. +- [ ] Whiteouts and opaque dirs handled (T3 phases 1+2). `enroot-aufs2ovlfs`'s overlayfs output is the input to phase 1/2, so AUFS edge cases that aufs2ovlfs already rejects (`.wh..wh.foo`) stay rejected. +- [ ] Race-safe per-layer via `.tmp` lock (T4). Concurrent imports of different images sharing a layer collapse onto the same dataset; loser waits for `@done`. +- [ ] ENOSPC retry mirrors Plan B's pattern (T4): sweep warm templates, retry once, abort with `.tmp` cleanup on second failure. +- [ ] Layer datasets are immortal until manually swept; ZFS refuses `zfs destroy ` while any descendant clone exists, so layers are GC-protected for free as long as any template references them. + +## Known limitations + +- **No automated `.layers/` GC.** When the last template referencing a base layer is evicted, the layer dataset survives. ZFS will refuse to destroy it while clones exist; once it's standalone, an admin can `zfs destroy ` manually. A follow-up plan can extend Plan B's `eviction_candidates` to layers (same shape: layer is evictable iff it has no clones). +- **No promote.** Layer datasets are clones-of-clones; the deepest leaf chain has N+1 levels of indirection. ZFS handles this fine performance-wise (snapshots are flat at the block layer), but `zfs list -t all` shows the chain. +- **No cross-host layer replication.** `zfs send` per-layer would be a sensible follow-up but is out of scope here. +- **No migration tool** between Plan F single-merge and Plan G chain caches. Switching the flag mid-life is transparent to users (existing templates remain valid) but the on-disk shape diverges. +- **Whiteout-replay is shell.** `getfattr -R` + `find -type c` + `tar`. Slower than the kernel's overlay engine (which Plan F uses) on a per-layer basis, but the work scales with layer size, not image count, and is paid once per unique layer across all images that use it. + +## Out of scope + +- Replacing Plan F's single-merge path. Plan G is purely additive. +- Cross-host layer replication via `zfs send`. +- Migration tooling between merged-template and per-layer-chain caches. +- Automated layer-dataset GC (manual `zfs destroy` works today). +- **Chain mode for `dockerd://` / `podman://` URIs.** Daemon-URI imports use `${engine} export | tar -x` which flattens the layered image into a single tarball before extraction — there is no per-layer directory structure for the chain installer to consume. Adding chain support here would require: + 1. Switching the daemon path from `${engine} export` to `${engine} save` (which writes a tar archive containing one `/layer.tar` per layer plus a `manifest.json` describing the order). + 2. Parsing `manifest.json` to recover the layer-digest list. + 3. Extracting each layer tarball into a directory parallel to what `docker::_prepare_layers` produces, then dispatching to `_install_layer_chain`. + 4. Constructing a synthetic `0/` from the daemon's image config (`${engine} inspect`'s output). + This is a real follow-up plan, not a one-line addition. It also slightly changes the daemon contract — `docker save` requires more disk (full image tar before extraction) than `docker export` (streamed). For now `ENROOT_ZFS_LAYER_CHAIN=y` is a documented no-op for daemon URIs. + +## Execution Handoff + +Same options as Plan A. diff --git a/doc/plans/README.md b/doc/plans/README.md index 58b11f7..58d3095 100644 --- a/doc/plans/README.md +++ b/doc/plans/README.md @@ -10,15 +10,16 @@ Plans for landing the optional ZFS storage backend designed in [`../zfs.md`](../ | D. `zfs://` URI transport — `enroot load zfs://host/NAME`, `enroot export NAME zfs://host` | [2026-04-29-zfs-d-zfs-uri.md](2026-04-29-zfs-d-zfs-uri.md) | A, C | | E. Ephemeral start ZFS path — substitute `squashfuse + overlay` with throwaway clone | [2026-04-29-zfs-e-ephemeral-start.md](2026-04-29-zfs-e-ephemeral-start.md) | A | | F. Docker layer-stack ZFS path — lift `ENROOT_NATIVE_OVERLAYFS=y` requirement on ZFS hosts | [2026-04-29-zfs-f-docker-load.md](2026-04-29-zfs-f-docker-load.md) | A | +| G. Per-layer ZFS clone chain (opt-in `ENROOT_ZFS_LAYER_CHAIN=y`) — cross-image layer dedup at the dataset level | [2026-05-01-zfs-g-layer-chain.md](2026-05-01-zfs-g-layer-chain.md) | F | ``` A ─┬─> B ├─> C ─> D ├─> E - └─> F + └─> F ─> G ``` -Recommended landing order: **A → E → F → B → C → D**. A is the foundation; E/F give the most user-visible wins next; B improves cache economics; C/D add transport options. +Recommended landing order: **A → E → F → B → C → D → G**. A is the foundation; E/F give the most user-visible wins next; B improves cache economics; C/D add transport options; G is an opt-in optimization on top of F. ## Conventions used by these plans diff --git a/doc/zfs.md b/doc/zfs.md index 1c175d0..461621a 100644 --- a/doc/zfs.md +++ b/doc/zfs.md @@ -1,6 +1,6 @@ # ZFS storage backend -This document describes an optional ZFS-aware mode for the enroot container store. **All six plans (A–F) are implemented.** When `ENROOT_STORAGE_BACKEND=zfs`: `enroot create`, `enroot remove`, ephemeral `enroot start `, and `enroot load docker://...` all use ZFS datasets, with a shared template cache that survives `enroot remove` (warm) for `ENROOT_TEMPLATE_WARM_SECONDS` and gets pressure-evicted LRU once the templates dataset crosses `ENROOT_TEMPLATE_PRESSURE_THRESHOLD` of its quota. `enroot create` accepts both `.sqsh` and `.zfs` (zfs send stream) inputs; `enroot export --format=zfs` produces the latter. The `zfs://[USER@]HOST/NAME` URI scheme transports containers between enroot hosts over SSH (`enroot load zfs://...` to pull, `enroot export NAME zfs://...` to push). The default storage backend (plain directories under `ENROOT_DATA_PATH`) is unchanged and remains the only option on hosts without ZFS. +This document describes an optional ZFS-aware mode for the enroot container store. **All six plans (A–F) are implemented; Plan G adds an opt-in per-layer clone chain on top of F.** When `ENROOT_STORAGE_BACKEND=zfs`: `enroot create`, `enroot remove`, ephemeral `enroot start `, and `enroot load docker://...` all use ZFS datasets, with a shared template cache that survives `enroot remove` (warm) for `ENROOT_TEMPLATE_WARM_SECONDS` and gets pressure-evicted LRU once the templates dataset crosses `ENROOT_TEMPLATE_PRESSURE_THRESHOLD` of its quota. `enroot create` accepts both `.sqsh` and `.zfs` (zfs send stream) inputs; `enroot export --format=zfs` produces the latter. The `zfs://[USER@]HOST/NAME` URI scheme transports containers between enroot hosts over SSH (`enroot load zfs://...` to pull, `enroot export NAME zfs://...` to push). The default storage backend (plain directories under `ENROOT_DATA_PATH`) is unchanged and remains the only option on hosts without ZFS. ## Motivation @@ -15,6 +15,7 @@ The ZFS backend is an *alternative storage driver*, in the same spirit as Docker | `ENROOT_STORAGE_BACKEND` | `dir` | `dir` = today's behavior. `zfs` = use ZFS datasets for the container store. | | `ENROOT_TEMPLATE_WARM_SECONDS` | `604800` (7 days) | How long a template with no clones remains evictable only under pressure. `0` = evict immediately when refcount reaches zero (refcount-only). `inf` = never auto-evict. | | `ENROOT_TEMPLATE_PRESSURE_THRESHOLD` | `0.80` | Templates dataset quota fraction above which routine `create`s start evicting warm templates. Soft signal; the ZFS quota is the hard wall. | +| `ENROOT_ZFS_LAYER_CHAIN` | unset | When `y` AND backend is `zfs`, populate the Docker template cache via a per-layer `zfs clone` chain under `/.layers/` instead of a single merged extract. Cross-image base layers are physically shared on disk (a debian-bookworm base used by both `python:slim` and `node:slim` is stored once). Applies to `docker://` URIs only; `dockerd://` and `podman://` always go through the daemon-flat-export path and are unaffected. Default off — leaves Plan F's single-merge path unchanged. | When `ENROOT_STORAGE_BACKEND=zfs`, `ENROOT_DATA_PATH` must be the mountpoint of a ZFS dataset that the unprivileged user has been granted permission on (see [Admin setup](#admin-setup)). @@ -26,6 +27,17 @@ ${pool}/${dataset}/templates/@pristine # snapshot taken after extracti ${pool}/${dataset}// # clones of @pristine, the user's containers ``` +When `ENROOT_ZFS_LAYER_CHAIN=y`, an additional `.layers/` namespace appears under the same store; templates become clones of the chain leaf instead of being filled by a single merged extract: + +``` +${pool}/${dataset}/.layers/ # one per distinct registry layer +${pool}/${dataset}/.layers/@done # snapshot taken after layer apply +${pool}/${dataset}/.templates/ # zfs clone of the chain leaf @done +${pool}/${dataset}/.templates/@pristine +``` + +Each layer dataset is `zfs clone`d from the previous layer's `@done`, so two images sharing a base layer (e.g. `python:3.12-slim` and `node:20-slim`, both built on `debian:bookworm-slim`) physically share the base bytes. Layer datasets are immutable origins; ZFS refuses to destroy a layer while any descendant clone exists, so layer GC is automatic once all referencing templates are evicted. + Mountpoints follow the dataset hierarchy under `ENROOT_DATA_PATH`. Templates are not user-visible — `enroot list` only enumerates `/` clones. Templates have `readonly=on`; clones inherit the property override on `start -w`. The `templates` dataset is shared across all users on the host. Its quota and properties are admin-controlled (see below). diff --git a/pkg/deb/control b/pkg/deb/control index 26b9950..1f51cc7 100644 --- a/pkg/deb/control +++ b/pkg/deb/control @@ -29,7 +29,7 @@ Depends: ${shlibs:Depends}, ${misc:Depends}, # tar, # util-linux, # ncurses-bin -Recommends: pigz +Recommends: pigz, attr Suggests: libnvidia-container-tools, squashfuse, fuse-overlayfs Description: Unprivileged container sandboxing utility A simple yet powerful tool to turn traditional container/OS images into diff --git a/src/docker.sh b/src/docker.sh index 06fae9b..e396cad 100644 --- a/src/docker.sh +++ b/src/docker.sh @@ -326,6 +326,13 @@ docker::_prepare_layers() ( zstd -q -d -o config "${ENROOT_CACHE_PATH}/${config}" docker::configure "${PWD}/0" config "${arch}" + # Side-emit the ordered layer-digest list to ./.layers (one per line, base + # first, top last). The ZFS chain-mode path (Plan G) reads this back to + # build the per-layer dataset chain. Plan F and dir-backend callers ignore + # the file; it lives in the caller's per-call mktmpdir so it gets cleaned + # up alongside the rest of the extraction temp dir. + printf "%s\n" "${layers[@]}" > .layers + printf "%s\n%s\n" "${config}" "${#layers[@]}" ) @@ -545,7 +552,13 @@ docker::load() ( fi if zfs::enabled; then - zfs::docker_install_from_layers "${config}" "${layer_count}" "${unpriv}" "${name}" + if zfs::layer_chain_active; then + local -a layer_digests=() + readarray -t layer_digests < .layers + zfs::docker_install_from_layers "${config}" "${layer_count}" "${unpriv}" "${name}" "${layer_digests[@]}" + else + zfs::docker_install_from_layers "${config}" "${layer_count}" "${unpriv}" "${name}" + fi else # Create a mount namespace and overlay mount mkdir -p rootfs "${name}" diff --git a/src/storage_zfs.sh b/src/storage_zfs.sh index ef04b67..06d8e87 100644 --- a/src/storage_zfs.sh +++ b/src/storage_zfs.sh @@ -19,6 +19,8 @@ source "${ENROOT_LIBRARY_PATH}/common.sh" readonly zfs_template_subdir=".templates" readonly zfs_pristine_snap="pristine" readonly zfs_ephemeral_subdir=".ephemeral" +readonly zfs_layers_subdir=".layers" +readonly zfs_layer_done_snap="done" # Returns 0 if the ZFS storage backend is configured, 1 otherwise. zfs::enabled() { @@ -164,6 +166,15 @@ zfs::image_sha256() { # 19 bytes (`enroot-zfs-image:v1`) and dispatches to the pointer path. readonly zfs_pointer_magic="enroot-zfs-image:v1" +# Returns 0 iff the ZFS backend is active AND ENROOT_ZFS_LAYER_CHAIN=y. +# Callers gate the per-layer-clone-chain template-fill path on this. The +# default-off behavior (unset / "" / anything but "y") preserves Plan F's +# single-merge path byte-for-byte. +zfs::layer_chain_active() { + zfs::enabled || return 1 + [ "${ENROOT_ZFS_LAYER_CHAIN-}" = "y" ] +} + # Returns 0 if the ZFS backend is active AND ENROOT_ZFS_IMPORT_FORMAT is # unset or set to "pointer". Returns 1 otherwise (e.g. "squashfs" opt-out # or dir backend). Callers gate the new pointer-import path on this. @@ -744,6 +755,65 @@ zfs::send_stream() { fi } +# Generates the bash payload that applies one already-extracted layer +# directory (post enroot-aufs2ovlfs, so whiteouts are mknod 0:0 char +# devices and opaque dirs carry trusted.overlay.opaque=y) on top of an +# already-merged target directory. Designed to be passed to +# `enroot-nsenter --user --remap-root --mount bash -c`. +# +# Two placeholders @@LAYER@@ / @@TARGET@@ are sed-substituted at +# generation time; both paths come from ZFS dataset mountpoints whose +# names derive from regex-validated digests + ENROOT_DATA_PATH, so they +# can't contain shell metacharacters. The payload itself uses single +# quotes around the substituted paths and double quotes around the +# loop-local `${var}` interpolations so a path containing whitespace +# (rare but legal in mountpoints) does not break the apply. +zfs::_apply_layer_payload() { + local -r layer_dir="$1" target_dir="$2" + sed -e "s#@@LAYER@@#${layer_dir}#g" -e "s#@@TARGET@@#${target_dir}#g" <<'PAYLOAD' +set -euo pipefail +mount --make-rprivate / +cd '@@LAYER@@' + +# Phase 1: opaque-dir clearing. trusted.overlay.opaque=y on a layer dir +# means "ignore everything from the parent in this dir"; we replicate +# that by clearing the corresponding target dir's children before +# layering this layer's contents on top. getfattr exits non-zero when no +# matches are found, so the result is captured to a temp file with +# || true to keep set -e + pipefail happy. +getfattr -R -h --absolute-names -n trusted.overlay.opaque . 2>/dev/null > /tmp/.enroot-opq.$$ || true +awk -F': ' 'sub(/^# file: /, "")' /tmp/.enroot-opq.$$ \ + | while IFS= read -r d; do + rel="${d#./}" + find '@@TARGET@@/'"${rel}" -mindepth 1 -maxdepth 1 -exec rm -rf {} + 2>/dev/null || : + done +rm -f /tmp/.enroot-opq.$$ + +# Phase 2: whiteout deletion. Each char-device 0:0 in the layer encodes +# "this path is removed in this layer". Be defensive — only treat 0:0 +# devices as whiteouts; any non-0:0 char dev (legitimate but unusual) +# is left for phase 3 to copy forward. +find . -type c | while IFS= read -r wh; do + [ "$(stat -c '%t-%T' "${wh}" 2>/dev/null)" = "0-0" ] || continue + rm -rf '@@TARGET@@/'"${wh#./}" +done + +# Phase 3: tar-pipe non-whiteout contents into the target. xattrs +# (overlayfs opaque markers, capability bits, SELinux labels) are +# preserved. Char devices are excluded — both the 0:0 whiteouts we +# already actioned in phase 2 and any other char devs (which would not +# be expected in Docker images post extraction). POSIX ACLs are not +# preserved because (a) ZFS datasets default to acltype=off which +# makes `tar --acls` fail with noisy warnings even when the source has +# no ACLs, and (b) Docker images effectively never depend on ACLs. +find . -type c -printf '%P\n' > /tmp/.enroot-excludes.$$ +tar -C . --exclude-from=/tmp/.enroot-excludes.$$ \ + --xattrs --xattrs-include='*' -cpf - . \ + | tar -C '@@TARGET@@' --xattrs --xattrs-include='*' -xpf - +rm -f /tmp/.enroot-excludes.$$ +PAYLOAD +} + # Materializes the merged Docker rootfs into a ZFS template (cached by # cache_key). Designed to be called from docker::load (or the pointer-import # flow) AFTER docker::_prepare_layers has populated the cwd with extracted, @@ -822,8 +892,18 @@ zfs::_install_template_from_layers() { # (docker::load) use. zfs::docker_install_from_layers() { local -r cache_key="$1" layer_count="$2" unpriv="$3" name="$4" + shift 4 local template - template=$(zfs::_install_template_from_layers "${cache_key}" "${layer_count}" "${unpriv}") + # Variadic remaining args are layer digests (base first, top last); + # passed when the caller wants chain mode. Chain mode also requires + # ENROOT_ZFS_LAYER_CHAIN=y; if either gate fails we transparently + # fall back to Plan F's single-merge path. This keeps the dispatch + # safe for callers that do not yet know about chain mode. + if zfs::layer_chain_active && [ "$#" -ge 1 ]; then + template=$(zfs::_install_layer_chain "${cache_key}" "${layer_count}" "${unpriv}" "$@") + else + template=$(zfs::_install_template_from_layers "${cache_key}" "${layer_count}" "${unpriv}") + fi zfs::clone_container "${template}" "${name}" } @@ -893,6 +973,209 @@ zfs::_install_template_from_dir() { printf "%s" "${template}" } +# Builds one layer dataset on top of prev_layer (or as a base if prev_layer +# is empty). Idempotent: if /.layers/@done already exists, no +# work is done. Race-safe via a per-digest .tmp dataset lock; losers wait +# for @done. ENOSPC during apply triggers a single warm-template-eviction +# retry, mirroring Plan B's pattern. +# +# Inputs: +# $1 digest - the layer's content digest (cache key under .layers/) +# $2 prev_layer - parent dataset name (empty for the base layer) +# $3 layer_dir - extracted-layer directory in cwd (1, 2, ..., N from +# docker::_prepare_layers' parallel extraction step) +# $4 unpriv - "y" or "" — passed through to enroot-nsenter +zfs::_build_layer() { + local -r digest="$1" prev_layer="$2" layer_dir="$3" unpriv="$4" + local store layer tmp snap mountpoint payload + local create_ok= i=0 + + store=$(zfs::store_dataset) + layer="${store}/${zfs_layers_subdir}/${digest}" + tmp="${layer}.tmp" + snap="${layer}@${zfs_layer_done_snap}" + + # Cache hit: already built. + if zfs list -H -t snapshot "${snap}" > /dev/null 2>&1; then + return + fi + + # Try to win the lock. Base layers create-from-scratch; non-base layers + # clone the previous layer's @done. canmount=noauto avoids ZFS auto-mount + # (which would need CAP_SYS_ADMIN) — we mount via enroot-zfs-mount below. + if [ -z "${prev_layer}" ]; then + zfs create -u "${tmp}" 2> /dev/null && create_ok=y + else + zfs clone -o canmount=noauto "${prev_layer}@${zfs_layer_done_snap}" "${tmp}" 2> /dev/null && create_ok=y + fi + + if [ -z "${create_ok}" ]; then + # Lost the race or stale .tmp. Wait briefly for another writer to + # finalize @done; on timeout, surface for manual cleanup. + while ! zfs list -H -t snapshot "${snap}" > /dev/null 2>&1; do + sleep 1 + ((i++ < 600)) || common::err "Timed out waiting for layer ${digest:0:12} (stale ${tmp}?)" + done + return + fi + + # Clones inherit readonly=on from the parent's snapshot; we need to write + # the layer's contents into the .tmp dataset before snapshotting, so flip + # it back off here. This is unprivileged-safe: 'zfs allow' includes the + # readonly property in the standard delegation set. + zfs set readonly=off "${tmp}" 2> /dev/null || : + if ! enroot-zfs-mount "${tmp}" 2> /dev/null; then + zfs destroy "${tmp}" 2> /dev/null || : + common::err "failed to mount layer ${digest:0:12}" + fi + mountpoint=$(zfs get -H -o value mountpoint "${tmp}") + common::log INFO "Building layer ${digest:0:12}..." + + payload=$(zfs::_apply_layer_payload "${PWD}/${layer_dir}" "${mountpoint}") + if ! enroot-nsenter ${unpriv:+--user} --mount --remap-root bash -c "${payload}"; then + common::log WARN "Layer apply failed; evicting all warm templates and retrying" + ENROOT_TEMPLATE_WARM_SECONDS=0 zfs::sweep_templates + enroot-nsenter ${unpriv:+--user} --mount --remap-root bash -c "${payload}" \ + || { zfs destroy -r "${tmp}" 2> /dev/null || :; \ + common::err "Failed to apply layer ${digest:0:12} even after evicting warm templates"; } + fi + + enroot-zfs-mount --unmount "${tmp}" 2> /dev/null || : + zfs rename "${tmp}" "${layer}" + enroot-zfs-mount "${layer}" 2> /dev/null || : + zfs snapshot "${snap}" + zfs set readonly=on "${layer}" 2> /dev/null || : + zfs set "enroot:layer-digest=${digest}" "${layer}" 2> /dev/null || : + zfs set "enroot:imported=$(date -u +%FT%TZ)" "${layer}" 2> /dev/null || : + enroot-zfs-mount --unmount "${layer}" 2> /dev/null || : +} + +# Materializes the merged Docker rootfs into a ZFS template (cached by +# cache_key) by building a per-layer clone chain under /.layers/. +# Drop-in replacement for _install_template_from_layers when chain mode +# (ENROOT_ZFS_LAYER_CHAIN=y) is active. Designed to be called from +# docker::load (or _pull_and_install_template) AFTER docker::_prepare_layers +# has populated the cwd with extracted, whiteout-converted layer +# directories 0/, 1/, ..., N/ and written the digest list to ./.layers. +# +# The leaf of the layer chain is cloned into /.templates/, +# the per-image synthetic config layer 0/ (rc/fstab/environment generated +# by docker::configure) is applied on top, and the result is snapshotted +# as @pristine. The template is therefore shape-compatible with Plan F +# templates: clone_container, the pointer-format flow, eviction recovery, +# and zfs:// transport all work unchanged. +# +# Layer ordering: docker::_download reverses the registry's manifest order, +# so digests[0] is the TOP layer (e.g. node binary) and digests[N-1] is +# the BASE layer (e.g. alpine root). docker::_prepare_layers' parallel +# extraction puts each digest into directory `i+1` (1-based), so dir 1 = +# digests[0] = TOP and dir N = digests[N-1] = BASE. We build the chain +# BASE-first (iterating i=N-1 down to 0) so the leaf @done snapshot +# contains the fully merged rootfs and matches Plan F's overlay-mount +# output (lowerdir=0:1:2:...:N stacks 0 on top of 1 on top of 2 ...). +# +# Inputs: +# $1 cache_key - sha256 of the image config blob +# $2 layer_count - the N from _prepare_layers +# $3 unpriv - "y" or "" — passed through to enroot-nsenter +# $4..$(3+N) - layer digests with docker::_download's reversed +# convention: digests[0] = TOP, digests[N-1] = BASE +# +# Outputs: prints the template dataset path on stdout (no trailing newline). +# +# Atomicity: per-layer races resolved via .tmp dataset locks +# (see _build_layer); the final template is created via the same .tmp +# pattern as Plan F's _install_template_from_layers, so concurrent +# imports of the same image collapse onto one builder. +zfs::_install_layer_chain() { + local -r cache_key="$1" layer_count="$2" unpriv="$3" + shift 3 + local -a digests=("$@") + local store template tmp snap prev_layer leaf_layer + local i wait_i=0 + + if [ "${#digests[@]}" -ne "${layer_count}" ]; then + common::err "_install_layer_chain: digest count (${#digests[@]}) != layer_count (${layer_count})" + fi + + store=$(zfs::store_dataset) + template="${store}/${zfs_template_subdir}/${cache_key}" + tmp="${template}.tmp" + snap="${template}@${zfs_pristine_snap}" + + zfs::sweep_templates + + # Ensure parent containers exist without auto-mounting them (mount(2) + # needs CAP_SYS_ADMIN; the helper below applies it via the +caps file + # capability). + zfs create -u "${store}/${zfs_template_subdir}" 2> /dev/null || : + enroot-zfs-mount "${store}/${zfs_template_subdir}" 2> /dev/null || : + zfs create -u "${store}/${zfs_layers_subdir}" 2> /dev/null || : + enroot-zfs-mount "${store}/${zfs_layers_subdir}" 2> /dev/null || : + + # Fast path: template already cached — nothing to do. + if zfs list -H -t snapshot "${snap}" > /dev/null 2>&1; then + common::log INFO "Reusing cached template ${cache_key:0:12}" + zfs::touch_template "${template}" + printf "%s" "${template}" + return + fi + + # Build the chain BASE-first (iterate from digests[N-1] = BASE to + # digests[0] = TOP). _build_layer is idempotent on @done cache hit, + # so re-pulling an image whose lower layers are already cached costs + # only the new top layers. + prev_layer="" + for ((i=layer_count-1; i>=0; i--)); do + zfs::_build_layer "${digests[i]}" "${prev_layer}" "$((i+1))" "${unpriv}" + prev_layer="${store}/${zfs_layers_subdir}/${digests[i]}" + done + leaf_layer="${prev_layer}" + + # Final: clone the leaf as the user-visible template, then apply the + # per-image synthetic config layer 0/ on top. canmount=noauto so we + # control mount via the helper. Same .tmp-then-rename race protection + # as Plan F's _install_template_from_layers. + if zfs clone -o canmount=noauto "${leaf_layer}@${zfs_layer_done_snap}" "${tmp}" 2> /dev/null; then + zfs set readonly=off "${tmp}" 2> /dev/null || : + if ! enroot-zfs-mount "${tmp}" 2> /dev/null; then + zfs destroy "${tmp}" 2> /dev/null || : + common::err "failed to mount template clone of layer leaf" + fi + local tmp_mountpoint + tmp_mountpoint=$(zfs get -H -o value mountpoint "${tmp}") + + # Apply the synthetic config layer (dir 0/, populated by + # docker::configure with /etc/{rc,fstab,environment} derived from + # the image config). It has no whiteouts, so the apply payload + # degenerates to a tar-pipe — but reusing the same payload keeps + # xattr-handling consistent with the registry layers. + local config_payload + config_payload=$(zfs::_apply_layer_payload "${PWD}/0" "${tmp_mountpoint}") + if ! enroot-nsenter ${unpriv:+--user} --mount --remap-root bash -c "${config_payload}"; then + zfs destroy -r "${tmp}" 2> /dev/null || : + common::err "Failed to apply synthetic config layer to template ${cache_key:0:12}" + fi + + enroot-zfs-mount --unmount "${tmp}" 2> /dev/null || : + zfs rename "${tmp}" "${template}" + enroot-zfs-mount "${template}" 2> /dev/null || : + zfs snapshot "${snap}" + zfs set readonly=on "${template}" 2> /dev/null || : + zfs set "enroot:imported=$(date -u +%FT%TZ)" "${template}" 2> /dev/null || : + enroot-zfs-mount --unmount "${template}" 2> /dev/null || : + zfs::touch_template "${template}" + else + # Lost the race or stale .tmp — wait for @pristine. + while ! zfs list -H -t snapshot "${snap}" > /dev/null 2>&1; do + sleep 1 + ((wait_i++ < 600)) || common::err "Timed out waiting for chain template: ${template}" + done + fi + + printf "%s" "${template}" +} + # Import flow for docker:// URIs when the ZFS backend is active and the # pointer format is selected. Pulls layers (via docker::_prepare_layers), # fetches the manifest digest (via docker::digest), populates the @@ -1069,7 +1352,13 @@ zfs::_pull_and_install_template() ( unpriv=y fi - zfs::_install_template_from_layers "${config}" "${layer_count}" "${unpriv}" > /dev/null + if zfs::layer_chain_active; then + local -a layer_digests=() + readarray -t layer_digests < .layers + zfs::_install_layer_chain "${config}" "${layer_count}" "${unpriv}" "${layer_digests[@]}" > /dev/null + else + zfs::_install_template_from_layers "${config}" "${layer_count}" "${unpriv}" > /dev/null + fi printf "%s" "${config}" )