Plan G: per-layer ZFS clone chain for enroot load docker:// (opt-in) by sodre · Pull Request #17 · zeroae/enroot

sodre · 2026-05-02T02:26:48Z

Closes #4.

Adds an opt-in per-layer zfs clone chain mode for the Docker template cache. With ENROOT_ZFS_LAYER_CHAIN=y, two images sharing a registry layer digest physically share the bytes on disk; re-pulling an image after a top-layer-only change reuses the cached lower-layer datasets.

Layout

<store>/.layers/<layer-digest>             # one per distinct registry layer (origin)
<store>/.layers/<layer-digest>@done        # snapshot taken after layer apply
<store>/.templates/<image-config-sha>      # zfs clone of the chain leaf @done
<store>/.templates/<image-config-sha>@pristine

Each layer dataset is zfs cloned from the previous layer's @done, with overlayfs whiteouts (mknod 0:0) and opaque-dir markers (trusted.overlay.opaque=y) replayed in shell on top of the cloned target — overlayfs only does that merge at mount time, but a chain stored at-rest needs it baked in. The chain leaf is then cloned into .templates/<config_sha>, the per-image synthetic 0/ config layer (rc/fstab/environment from docker::configure) is applied on top, and the result is snapshotted as @pristine so the existing zfs::clone_container, pointer-format, eviction-recovery, and zfs:// paths all work unchanged.

Why no `zfs promote`

The issue mentions promote as one option for flattening the chain. We don't promote — promoting inverts the chain (layers become clones of the template), which works for one image but produces a complex image-private topology that defeats the cross-image sharing goal. Plan G keeps layers as immutable origins; ZFS refuses to destroy a layer while any descendant clone exists, so layer GC is automatic once all referencing templates are evicted.

What's added

File	Change
`src/storage_zfs.sh`	`zfs::layer_chain_active`, `zfs::_apply_layer_payload`, `zfs::_build_layer`, `zfs::_install_layer_chain`; chain-mode dispatch in `docker_install_from_layers` and `_pull_and_install_template`.
`src/docker.sh`	`_prepare_layers` side-emits the ordered layer-digest list to `./.layers` in its temp cwd; `docker::load`'s ZFS branch reads it back when chain mode is active.
`pkg/deb/control`	Recommends `attr` (provides `getfattr`, required by chain-mode opaque-dir handling).
`doc/zfs.md`, `CLAUDE.md`	Document the new knob, store layout, and dedup semantics.
`doc/plans/2026-05-01-zfs-g-layer-chain.md`	Implementation plan, mirrors Plans A–F structure.

Coexistence with Plan F

ENROOT_ZFS_LAYER_CHAIN= (unset/empty/anything but y): Plan F's single-merge _install_template_from_layers runs unchanged.
ENROOT_ZFS_LAYER_CHAIN=y: chain mode. Same dispatch hits both docker::load (direct create) and _pull_and_install_template (used by pointer-format import + eviction recovery).
The fast path "template @pristine already exists, reuse it" runs before the chain dispatch — templates produced under either mode are reused under the other without rebuild.

Smoke results (spark-ctrl, Pi 5 / Debian 13 / OpenZFS 2.4.1, 3.75G test pool)

Test	Result
Single-layer alpine, chain mode	✓ pointer file written, layer dataset created, template clones leaf, rootfs has `os-release` + `/etc/{rc,fstab,environment}`
Multi-layer node:20-alpine	✓ 3 layer datasets in BASE→TOP order, leaf REFERs full 69.9M merged tree, `/usr/local/bin/node` (102M binary) present in container rootfs
Multi-layer python:3.13-alpine	✓ 3 layer datasets, clones-of-clones visible in `zfs list` (`3070388042c6` 1.03M USED / 19.3M REFER — pure dedup)
Whiteout/opaque sanity	✓ no `.wh.*` AUFS files leak through, no char-device whiteouts in final rootfs
Layer reuse after template eviction	✓ destroy templates+containers, keep layers; re-create from pointer takes 1.7s with NO "Building layer" messages, just final clone
Plan F regression (flag unset)	✓ no `.layers/` namespace created, `_install_template_from_layers` runs as before

Smoke testing also flagged two bugs that were fixed in 3f7e3af:

Inverted chain iteration order (docker::_download reverses the manifest, so digests[0] is the TOP, not the BASE).
Missing synthetic 0/ config layer apply on the leaf — Plan F's overlay mount stacks 0:1:…:N with 0/ on top; the chain installer needed an explicit final tar-pipe of 0/ onto the template.

Plus one packaging fix: attr is now Recommended (was Suggested), since getfattr is required for chain-mode opaque-dir handling and Suggests is not auto-installed.

🤖 Generated with Claude Code

Signed-off-by: Patrick Sodré <patrick@zero-ae.com>

…ants Signed-off-by: Patrick Sodré <patrick@zero-ae.com>

Signed-off-by: Patrick Sodré <patrick@zero-ae.com>

…ll path Signed-off-by: Patrick Sodré <patrick@zero-ae.com>

Signed-off-by: Patrick Sodré <patrick@zero-ae.com>

@pristine

Smoke testing on a 3-layer image (node:20-alpine) caught two bugs in the chain installer: 1. Inverted iteration. docker::_download reverses the manifest's layer order via jq's `reverse`, so digests[0] is the TOP layer and digests[N-1] is the BASE. The original `for i in 0..N-1` loop treated digests[0] as the base, building the chain upside-down and producing a leaf that contained only the top-layer's diffs (e.g. 5.4M for what should have been a 70M merged node:20-alpine rootfs). Iterating from N-1 down to 0 puts BASE first in the zfs hierarchy and the TOP at the leaf. 2. Missing synthetic config layer. docker::_prepare_layers populates a directory 0/ via docker::configure with the per-image /etc/{rc,fstab,environment} derived from the image config blob; Plan F's overlay mount stacks 0:1:2:...:N so 0/ ends up on top. The chain installer ignored 0/ entirely, so containers created via chain mode were missing /etc/rc and the merged fstab entries. Now applied as a final tar-pipe step on top of the leaf clone during template finalization, before snapshotting @pristine. Also tighten the apply payload: - getfattr returns non-zero when no files match the requested xattr; with set -euo pipefail in the payload that aborted the whole apply on alpine (no opaque dirs). Capture to a temp file with `|| true`. - Drop tar's --acls. Default ZFS datasets have acltype=off, which makes POSIX ACL set/get fail with "Operation not supported" warnings even when the source has no ACLs. Docker images effectively never depend on ACLs, and xattrs (overlayfs opaque markers, capability bits, SELinux labels) are still preserved. Signed-off-by: Patrick Sodré <patrick@zero-ae.com>

Signed-off-by: Patrick Sodré <patrick@zero-ae.com>

sodre · 2026-05-02T12:26:16Z

Cross-image dedup verified on disk (acceptance criterion #1 from issue #4):

Pulled python:3.13-alpine3.21 then node:22-alpine3.21 (both pinned to the same alpine base) on a fresh store. Origin tree (zfs list -r -d 1 -o name,origin):

c2fe130f4aab (alpine 3.21 base, 5.11M USED, ORIGIN)
├── 1c6063f559a3 → 838d25d4769a → 3517b1771ef3   (python:3.13-alpine3.21)
└── 720ee653d3d4 → 055ee03d01c9 → 7f3c333e617d   (node:22-alpine3.21)

Both chains branch from c2fe130f4aab…@done. The 5.11M alpine base is stored once on disk; ZFS protects it via "snapshot has clones" so it's automatically immortal as long as either python or node is still around.

sodre · 2026-05-02T20:34:35Z

Two more acceptance criteria covered.

Concurrent pull of the same image — race-safe

ENROOT_ZFS_LAYER_CHAIN=y enroot import -o /tmp/n1.sqsh docker://node:22-alpine3.21 & p1=$!
ENROOT_ZFS_LAYER_CHAIN=y enroot import -o /tmp/n2.sqsh docker://node:22-alpine3.21 & p2=$!
wait $p1 $p2

Result: both processes returned 0; only one set of [INFO] Building layer ... messages printed (4 messages, one per node:22-alpine3.21 layer); on disk: exactly 4 layer datasets, no .tmp orphans, both containers usable. The per-layer <digest>.tmp lock collapsed both invocations onto a single builder; the loser waited for @done silently.

Lower-layer reuse on a second pull that shares a base

When node:22-alpine3.21 is imported after python:3.13-alpine3.21, the chain installer logged 3 Building layer messages — not 4 — because alpine 3.21's c2fe130f4aab@done was already in the cache from python. The node chain branched directly off that snapshot via zfs clone, so the second pull paid the layer-build cost only for node-specific content.

This generalizes to the issue's "top-layer-only re-pull" case: when a docker tag is republished with only the top digest changed, every cached lower-layer <digest>@done is a hit and _build_layer skips immediately to the clone-on-top step.

sodre added 9 commits May 1, 2026 22:17

docs: implementation plan for per-layer ZFS clone chain (issue #4)

17f6973

Signed-off-by: Patrick Sodré <patrick@zero-ae.com>

storage_zfs: add zfs::layer_chain_active predicate

3e2ae2d

Signed-off-by: Patrick Sodré <patrick@zero-ae.com>

docker: side-emit layer digests to ./.layers in _prepare_layers

96dec17

Signed-off-by: Patrick Sodré <patrick@zero-ae.com>

storage_zfs: add _apply_layer_payload generator and layer-chain const…

7e41551

…ants Signed-off-by: Patrick Sodré <patrick@zero-ae.com>

storage_zfs: add _build_layer and _install_layer_chain

6e9faa9

Signed-off-by: Patrick Sodré <patrick@zero-ae.com>

storage_zfs: dispatch chain mode in docker_install_from_layers and pu…

31e53a2

…ll path Signed-off-by: Patrick Sodré <patrick@zero-ae.com>

Mark Plan G (per-layer ZFS clone chain) as implemented

ebd0699

Signed-off-by: Patrick Sodré <patrick@zero-ae.com>

deb: recommend attr (provides getfattr, required by chain mode)

5b0d37b

Signed-off-by: Patrick Sodré <patrick@zero-ae.com>

sodre marked this pull request as ready for review May 2, 2026 03:12

sodre added 2 commits May 2, 2026 08:12

plans: sync Task 2 with sidecar-file plumbing decision

65fb521

Signed-off-by: Patrick Sodré <patrick@zero-ae.com>

Merge zenroot/main: absorb #19 (post-reboot remount)

2ac44d2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Plan G: per-layer ZFS clone chain for enroot load docker:// (opt-in)#17

Plan G: per-layer ZFS clone chain for enroot load docker:// (opt-in)#17
sodre wants to merge 11 commits intozenroot/mainfrom
feature/zfs-g-layer-chain

sodre commented May 2, 2026 •

edited

Loading

Uh oh!

sodre commented May 2, 2026

Uh oh!

sodre commented May 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

sodre commented May 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Layout

Why no zfs promote

What's added

Coexistence with Plan F

Smoke results (spark-ctrl, Pi 5 / Debian 13 / OpenZFS 2.4.1, 3.75G test pool)

Uh oh!

sodre commented May 2, 2026

Uh oh!

sodre commented May 2, 2026

Concurrent pull of the same image — race-safe

Lower-layer reuse on a second pull that shares a base

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

sodre commented May 2, 2026 •

edited

Loading

Why no `zfs promote`