Skip to content

Plan G: per-layer ZFS clone chain for enroot load docker:// (opt-in)#17

Open
sodre wants to merge 11 commits intozenroot/mainfrom
feature/zfs-g-layer-chain
Open

Plan G: per-layer ZFS clone chain for enroot load docker:// (opt-in)#17
sodre wants to merge 11 commits intozenroot/mainfrom
feature/zfs-g-layer-chain

Conversation

@sodre
Copy link
Copy Markdown
Member

@sodre sodre commented May 2, 2026

Closes #4.

Adds an opt-in per-layer zfs clone chain mode for the Docker template cache. With ENROOT_ZFS_LAYER_CHAIN=y, two images sharing a registry layer digest physically share the bytes on disk; re-pulling an image after a top-layer-only change reuses the cached lower-layer datasets.

Layout

<store>/.layers/<layer-digest>             # one per distinct registry layer (origin)
<store>/.layers/<layer-digest>@done        # snapshot taken after layer apply
<store>/.templates/<image-config-sha>      # zfs clone of the chain leaf @done
<store>/.templates/<image-config-sha>@pristine

Each layer dataset is zfs cloned from the previous layer's @done, with overlayfs whiteouts (mknod 0:0) and opaque-dir markers (trusted.overlay.opaque=y) replayed in shell on top of the cloned target — overlayfs only does that merge at mount time, but a chain stored at-rest needs it baked in. The chain leaf is then cloned into .templates/<config_sha>, the per-image synthetic 0/ config layer (rc/fstab/environment from docker::configure) is applied on top, and the result is snapshotted as @pristine so the existing zfs::clone_container, pointer-format, eviction-recovery, and zfs:// paths all work unchanged.

Why no zfs promote

The issue mentions promote as one option for flattening the chain. We don't promote — promoting inverts the chain (layers become clones of the template), which works for one image but produces a complex image-private topology that defeats the cross-image sharing goal. Plan G keeps layers as immutable origins; ZFS refuses to destroy a layer while any descendant clone exists, so layer GC is automatic once all referencing templates are evicted.

What's added

File Change
src/storage_zfs.sh zfs::layer_chain_active, zfs::_apply_layer_payload, zfs::_build_layer, zfs::_install_layer_chain; chain-mode dispatch in docker_install_from_layers and _pull_and_install_template.
src/docker.sh _prepare_layers side-emits the ordered layer-digest list to ./.layers in its temp cwd; docker::load's ZFS branch reads it back when chain mode is active.
pkg/deb/control Recommends attr (provides getfattr, required by chain-mode opaque-dir handling).
doc/zfs.md, CLAUDE.md Document the new knob, store layout, and dedup semantics.
doc/plans/2026-05-01-zfs-g-layer-chain.md Implementation plan, mirrors Plans A–F structure.

Coexistence with Plan F

  • ENROOT_ZFS_LAYER_CHAIN= (unset/empty/anything but y): Plan F's single-merge _install_template_from_layers runs unchanged.
  • ENROOT_ZFS_LAYER_CHAIN=y: chain mode. Same dispatch hits both docker::load (direct create) and _pull_and_install_template (used by pointer-format import + eviction recovery).
  • The fast path "template @pristine already exists, reuse it" runs before the chain dispatch — templates produced under either mode are reused under the other without rebuild.

Smoke results (spark-ctrl, Pi 5 / Debian 13 / OpenZFS 2.4.1, 3.75G test pool)

Test Result
Single-layer alpine, chain mode ✓ pointer file written, layer dataset created, template clones leaf, rootfs has os-release + /etc/{rc,fstab,environment}
Multi-layer node:20-alpine ✓ 3 layer datasets in BASE→TOP order, leaf REFERs full 69.9M merged tree, /usr/local/bin/node (102M binary) present in container rootfs
Multi-layer python:3.13-alpine ✓ 3 layer datasets, clones-of-clones visible in zfs list (3070388042c6 1.03M USED / 19.3M REFER — pure dedup)
Whiteout/opaque sanity ✓ no .wh.* AUFS files leak through, no char-device whiteouts in final rootfs
Layer reuse after template eviction ✓ destroy templates+containers, keep layers; re-create from pointer takes 1.7s with NO "Building layer" messages, just final clone
Plan F regression (flag unset) ✓ no .layers/ namespace created, _install_template_from_layers runs as before

Smoke testing also flagged two bugs that were fixed in 3f7e3af:

  • Inverted chain iteration order (docker::_download reverses the manifest, so digests[0] is the TOP, not the BASE).
  • Missing synthetic 0/ config layer apply on the leaf — Plan F's overlay mount stacks 0:1:…:N with 0/ on top; the chain installer needed an explicit final tar-pipe of 0/ onto the template.

Plus one packaging fix: attr is now Recommended (was Suggested), since getfattr is required for chain-mode opaque-dir handling and Suggests is not auto-installed.

🤖 Generated with Claude Code

sodre added 9 commits May 1, 2026 22:17
Signed-off-by: Patrick Sodré <patrick@zero-ae.com>
Signed-off-by: Patrick Sodré <patrick@zero-ae.com>
Signed-off-by: Patrick Sodré <patrick@zero-ae.com>
…ants

Signed-off-by: Patrick Sodré <patrick@zero-ae.com>
Signed-off-by: Patrick Sodré <patrick@zero-ae.com>
…ll path

Signed-off-by: Patrick Sodré <patrick@zero-ae.com>
Signed-off-by: Patrick Sodré <patrick@zero-ae.com>
Signed-off-by: Patrick Sodré <patrick@zero-ae.com>
Smoke testing on a 3-layer image (node:20-alpine) caught two bugs in the
chain installer:

1. Inverted iteration. docker::_download reverses the manifest's layer
   order via jq's `reverse`, so digests[0] is the TOP layer and
   digests[N-1] is the BASE. The original `for i in 0..N-1` loop treated
   digests[0] as the base, building the chain upside-down and producing
   a leaf that contained only the top-layer's diffs (e.g. 5.4M for what
   should have been a 70M merged node:20-alpine rootfs). Iterating from
   N-1 down to 0 puts BASE first in the zfs hierarchy and the TOP at
   the leaf.

2. Missing synthetic config layer. docker::_prepare_layers populates a
   directory 0/ via docker::configure with the per-image
   /etc/{rc,fstab,environment} derived from the image config blob; Plan
   F's overlay mount stacks 0:1:2:...:N so 0/ ends up on top. The chain
   installer ignored 0/ entirely, so containers created via chain mode
   were missing /etc/rc and the merged fstab entries. Now applied as a
   final tar-pipe step on top of the leaf clone during template
   finalization, before snapshotting @pristine.

Also tighten the apply payload:
- getfattr returns non-zero when no files match the requested xattr;
  with set -euo pipefail in the payload that aborted the whole apply on
  alpine (no opaque dirs). Capture to a temp file with `|| true`.
- Drop tar's --acls. Default ZFS datasets have acltype=off, which makes
  POSIX ACL set/get fail with "Operation not supported" warnings even
  when the source has no ACLs. Docker images effectively never depend
  on ACLs, and xattrs (overlayfs opaque markers, capability bits,
  SELinux labels) are still preserved.

Signed-off-by: Patrick Sodré <patrick@zero-ae.com>
@sodre sodre marked this pull request as ready for review May 2, 2026 03:12
@sodre
Copy link
Copy Markdown
Member Author

sodre commented May 2, 2026

Cross-image dedup verified on disk (acceptance criterion #1 from issue #4):

Pulled python:3.13-alpine3.21 then node:22-alpine3.21 (both pinned to the same alpine base) on a fresh store. Origin tree (zfs list -r -d 1 -o name,origin):

c2fe130f4aab (alpine 3.21 base, 5.11M USED, ORIGIN)
├── 1c6063f559a3 → 838d25d4769a → 3517b1771ef3   (python:3.13-alpine3.21)
└── 720ee653d3d4 → 055ee03d01c9 → 7f3c333e617d   (node:22-alpine3.21)

Both chains branch from c2fe130f4aab…@done. The 5.11M alpine base is stored once on disk; ZFS protects it via "snapshot has clones" so it's automatically immortal as long as either python or node is still around.

@sodre
Copy link
Copy Markdown
Member Author

sodre commented May 2, 2026

Two more acceptance criteria covered.

Concurrent pull of the same image — race-safe

ENROOT_ZFS_LAYER_CHAIN=y enroot import -o /tmp/n1.sqsh docker://node:22-alpine3.21 & p1=$!
ENROOT_ZFS_LAYER_CHAIN=y enroot import -o /tmp/n2.sqsh docker://node:22-alpine3.21 & p2=$!
wait $p1 $p2

Result: both processes returned 0; only one set of [INFO] Building layer ... messages printed (4 messages, one per node:22-alpine3.21 layer); on disk: exactly 4 layer datasets, no .tmp orphans, both containers usable. The per-layer <digest>.tmp lock collapsed both invocations onto a single builder; the loser waited for @done silently.

Lower-layer reuse on a second pull that shares a base

When node:22-alpine3.21 is imported after python:3.13-alpine3.21, the chain installer logged 3 Building layer messages — not 4 — because alpine 3.21's c2fe130f4aab@done was already in the cache from python. The node chain branched directly off that snapshot via zfs clone, so the second pull paid the layer-build cost only for node-specific content.

This generalizes to the issue's "top-layer-only re-pull" case: when a docker tag is republished with only the top digest changed, every cached lower-layer <digest>@done is a hit and _build_layer skips immediately to the clone-on-top step.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Plan G: per-layer ZFS clone chain for enroot load docker:// (opt-in)

1 participant