Follow-up to #3 (Plan F).
Plan F's enroot load docker:// path materializes the merged image into a single ZFS template per image (cached by image config digest). That covers most cases but loses several properties that the per-layer-clone-chain approach (mirroring Docker's own zfs storage driver) would give us. Plan G is the opt-in mode that adds the per-layer path alongside Plan F's single-merge path, gated on a config flag like ENROOT_ZFS_LAYER_CHAIN=y.
Why it's worth doing
For HPC / CI hosts that pull many images sharing common bases, the current single-merge design wastes disk and CPU. Per-layer chains would buy back:
- Cross-image layer dedup at the dataset level. Two images sharing a Debian base store the base bytes once instead of twice. Block-level
dedup=on recovers this in Plan F's design but at ~5–6 GB RAM per TB indexed; per-layer datasets dedup for free.
- Incremental re-pull cost. When
alpine:3.21 replaces alpine:3.20, only the changed top layers are re-extracted; lower-layer datasets are reused. Plan F re-merges the whole stack.
- Layer-granular cache invalidation. A poisoned layer can be
zfs destroyd in isolation; Plan F throws out the whole template.
- Native ZFS introspection.
zfs list -t all shows the layer chain; zfs send per-layer becomes a sensible cross-host replication primitive.
- Quota accounting matches intuition.
quota=200G on <store>/.templates reflects shared layers once, not multiplied by the number of images that use them.
- Aligns with Docker's
zfs storage driver shape, which ops people already know.
What it costs
- Whiteout/opaque-dir merging in shell. Overlayfs whiteouts are character-device files (
mknod c 0 0); opaque-dir markers are trusted.overlay.opaque=y xattrs. Without the kernel's overlay engine doing the merge, we need to apply these manually during each clone-extract step. Real edge-case surface.
- ~5–15× more dataset objects per image.
zfs list clutter; more bookkeeping.
- Per-layer atomic locks (multiple
.tmp datasets racing).
zfs promote (or chain-preservation alternative) to flatten the leaf into a standalone template — depends on user delegations.
- More complex cache invalidation logic (decide what to destroy when a leaf is reaped vs. when a shared lower layer is reaped).
Sketch of design
zfs::docker_install_from_layers (in src/storage_zfs.sh) gains a check: if ENROOT_ZFS_LAYER_CHAIN=y, dispatch to zfs::docker_install_chain instead. The new function:
- For each layer in stack order, hash the layer tarball (already in
${ENROOT_CACHE_PATH}/<digest> from _prepare_layers) → cache key.
- If
<store>/.layers/<digest>@done exists, reuse; else zfs clone parent@done <new> (or zfs create for the base), apply layer's whiteouts + extracted contents, zfs snapshot @done.
- Final leaf is the merged image;
zfs promote it into <store>/.templates/<image-config-sha> to flatten the chain into a standalone template.
- Existing
zfs::clone_container then clones the template for the user.
Whiteout application for step 2 needs a helper like zfs::apply_layer_whiteouts <layer_dir> <target>:
- For each
*.wh.foo (AUFS) or 0:0-char-device named foo (overlayfs) in layer_dir: rm -rf "${target}/foo".
- For each
.wh..wh..opq or trusted.overlay.opaque=y dir: clear children of corresponding dir in target.
- Then
cp -a (or tar | tar) the rest of layer_dir over target.
enroot-aufs2ovlfs already converts to overlayfs whiteouts in-place, so the helper only needs the overlayfs forms. Worth confirming the exact char-device format it produces.
Coexistence with Plan F
- Default behavior unchanged: Plan F's single-merge path stays the default.
ENROOT_ZFS_LAYER_CHAIN=y opts into per-layer.
- Both paths populate
<store>/.templates/<sha> — same shape, same zfs::clone_container for the user. Only the fill mechanism differs.
- A site can switch on/off without migration; existing single-merge templates remain valid.
Open questions
- Where do layer datasets live?
<store>/.layers/<sha> (parallel to .templates) keeps the layer cache separate from per-image templates. Avoids confusion when an admin scans .templates.
zfs promote permissions: needs promote in the user's zfs allow. Document in the admin recipe.
- Should we GC unused layer datasets when no template references them? Plan B's sweep mechanism could be extended to
.layers/ with the same warm/cold logic.
- How does Plan B's warm/cold eviction interact? Templates and layers should probably share the same lifecycle policy, but layers' lifecycle is "no template references" rather than "no clones."
- xattr propagation under
cp -a vs tar: tar --xattrs --xattrs-include='*' --selinux --acls is the safer pipe; enroot-aufs2ovlfs likely emits these correctly already.
Acceptance criteria
- Two distinct images sharing a base layer (e.g.
python:3-slim and node:20-slim, both Debian-based) store the shared layer once on disk under <store>/.layers/<digest>.
- Re-pulling an image after a top-layer-only update reuses the cached lower-layer datasets (verify with timing +
zfs list snapshot before/after).
- Whiteouts and opaque dirs from real Docker images merge correctly (verified against
python:3-slim, nginx, cuda base images).
- Concurrent
enroot load of the same image is race-safe (same .tmp lock pattern as Plan F).
ENROOT_ZFS_LAYER_CHAIN= (unset) leaves Plan F's behavior unchanged.
Out of scope
- Replacing Plan F's single-merge path. Plan G is purely additive.
- Cross-host layer replication via
zfs send. Natural follow-up but tracked separately.
- Migration tooling between merged-template and per-layer-chain caches.
Follow-up to #3 (Plan F).
Plan F's
enroot load docker://path materializes the merged image into a single ZFS template per image (cached by image config digest). That covers most cases but loses several properties that the per-layer-clone-chain approach (mirroring Docker's ownzfsstorage driver) would give us. Plan G is the opt-in mode that adds the per-layer path alongside Plan F's single-merge path, gated on a config flag likeENROOT_ZFS_LAYER_CHAIN=y.Why it's worth doing
For HPC / CI hosts that pull many images sharing common bases, the current single-merge design wastes disk and CPU. Per-layer chains would buy back:
dedup=onrecovers this in Plan F's design but at ~5–6 GB RAM per TB indexed; per-layer datasets dedup for free.alpine:3.21replacesalpine:3.20, only the changed top layers are re-extracted; lower-layer datasets are reused. Plan F re-merges the whole stack.zfs destroyd in isolation; Plan F throws out the whole template.zfs list -t allshows the layer chain;zfs sendper-layer becomes a sensible cross-host replication primitive.quota=200Gon<store>/.templatesreflects shared layers once, not multiplied by the number of images that use them.zfsstorage driver shape, which ops people already know.What it costs
mknod c 0 0); opaque-dir markers aretrusted.overlay.opaque=yxattrs. Without the kernel's overlay engine doing the merge, we need to apply these manually during each clone-extract step. Real edge-case surface.zfs listclutter; more bookkeeping..tmpdatasets racing).zfs promote(or chain-preservation alternative) to flatten the leaf into a standalone template — depends on user delegations.Sketch of design
zfs::docker_install_from_layers(insrc/storage_zfs.sh) gains a check: ifENROOT_ZFS_LAYER_CHAIN=y, dispatch tozfs::docker_install_chaininstead. The new function:${ENROOT_CACHE_PATH}/<digest>from_prepare_layers) → cache key.<store>/.layers/<digest>@doneexists, reuse; elsezfs clone parent@done <new>(orzfs createfor the base), apply layer's whiteouts + extracted contents,zfs snapshot @done.zfs promoteit into<store>/.templates/<image-config-sha>to flatten the chain into a standalone template.zfs::clone_containerthen clones the template for the user.Whiteout application for step 2 needs a helper like
zfs::apply_layer_whiteouts <layer_dir> <target>:*.wh.foo(AUFS) or 0:0-char-device namedfoo(overlayfs) inlayer_dir:rm -rf "${target}/foo"..wh..wh..opqortrusted.overlay.opaque=ydir: clear children of corresponding dir in target.cp -a(ortar | tar) the rest oflayer_dirovertarget.enroot-aufs2ovlfsalready converts to overlayfs whiteouts in-place, so the helper only needs the overlayfs forms. Worth confirming the exact char-device format it produces.Coexistence with Plan F
ENROOT_ZFS_LAYER_CHAIN=yopts into per-layer.<store>/.templates/<sha>— same shape, samezfs::clone_containerfor the user. Only the fill mechanism differs.Open questions
<store>/.layers/<sha>(parallel to.templates) keeps the layer cache separate from per-image templates. Avoids confusion when an admin scans.templates.zfs promotepermissions: needspromotein the user'szfs allow. Document in the admin recipe..layers/with the same warm/cold logic.cp -avstar:tar --xattrs --xattrs-include='*' --selinux --aclsis the safer pipe;enroot-aufs2ovlfslikely emits these correctly already.Acceptance criteria
python:3-slimandnode:20-slim, both Debian-based) store the shared layer once on disk under<store>/.layers/<digest>.zfs listsnapshot before/after).python:3-slim,nginx,cudabase images).enroot loadof the same image is race-safe (same.tmplock pattern as Plan F).ENROOT_ZFS_LAYER_CHAIN=(unset) leaves Plan F's behavior unchanged.Out of scope
zfs send. Natural follow-up but tracked separately.