Skip to content

Plan G: per-layer ZFS clone chain for enroot load docker:// (opt-in) #4

@sodre

Description

@sodre

Follow-up to #3 (Plan F).

Plan F's enroot load docker:// path materializes the merged image into a single ZFS template per image (cached by image config digest). That covers most cases but loses several properties that the per-layer-clone-chain approach (mirroring Docker's own zfs storage driver) would give us. Plan G is the opt-in mode that adds the per-layer path alongside Plan F's single-merge path, gated on a config flag like ENROOT_ZFS_LAYER_CHAIN=y.

Why it's worth doing

For HPC / CI hosts that pull many images sharing common bases, the current single-merge design wastes disk and CPU. Per-layer chains would buy back:

  1. Cross-image layer dedup at the dataset level. Two images sharing a Debian base store the base bytes once instead of twice. Block-level dedup=on recovers this in Plan F's design but at ~5–6 GB RAM per TB indexed; per-layer datasets dedup for free.
  2. Incremental re-pull cost. When alpine:3.21 replaces alpine:3.20, only the changed top layers are re-extracted; lower-layer datasets are reused. Plan F re-merges the whole stack.
  3. Layer-granular cache invalidation. A poisoned layer can be zfs destroyd in isolation; Plan F throws out the whole template.
  4. Native ZFS introspection. zfs list -t all shows the layer chain; zfs send per-layer becomes a sensible cross-host replication primitive.
  5. Quota accounting matches intuition. quota=200G on <store>/.templates reflects shared layers once, not multiplied by the number of images that use them.
  6. Aligns with Docker's zfs storage driver shape, which ops people already know.

What it costs

  • Whiteout/opaque-dir merging in shell. Overlayfs whiteouts are character-device files (mknod c 0 0); opaque-dir markers are trusted.overlay.opaque=y xattrs. Without the kernel's overlay engine doing the merge, we need to apply these manually during each clone-extract step. Real edge-case surface.
  • ~5–15× more dataset objects per image. zfs list clutter; more bookkeeping.
  • Per-layer atomic locks (multiple .tmp datasets racing).
  • zfs promote (or chain-preservation alternative) to flatten the leaf into a standalone template — depends on user delegations.
  • More complex cache invalidation logic (decide what to destroy when a leaf is reaped vs. when a shared lower layer is reaped).

Sketch of design

zfs::docker_install_from_layers (in src/storage_zfs.sh) gains a check: if ENROOT_ZFS_LAYER_CHAIN=y, dispatch to zfs::docker_install_chain instead. The new function:

  1. For each layer in stack order, hash the layer tarball (already in ${ENROOT_CACHE_PATH}/<digest> from _prepare_layers) → cache key.
  2. If <store>/.layers/<digest>@done exists, reuse; else zfs clone parent@done <new> (or zfs create for the base), apply layer's whiteouts + extracted contents, zfs snapshot @done.
  3. Final leaf is the merged image; zfs promote it into <store>/.templates/<image-config-sha> to flatten the chain into a standalone template.
  4. Existing zfs::clone_container then clones the template for the user.

Whiteout application for step 2 needs a helper like zfs::apply_layer_whiteouts <layer_dir> <target>:

  • For each *.wh.foo (AUFS) or 0:0-char-device named foo (overlayfs) in layer_dir: rm -rf "${target}/foo".
  • For each .wh..wh..opq or trusted.overlay.opaque=y dir: clear children of corresponding dir in target.
  • Then cp -a (or tar | tar) the rest of layer_dir over target.

enroot-aufs2ovlfs already converts to overlayfs whiteouts in-place, so the helper only needs the overlayfs forms. Worth confirming the exact char-device format it produces.

Coexistence with Plan F

  • Default behavior unchanged: Plan F's single-merge path stays the default.
  • ENROOT_ZFS_LAYER_CHAIN=y opts into per-layer.
  • Both paths populate <store>/.templates/<sha> — same shape, same zfs::clone_container for the user. Only the fill mechanism differs.
  • A site can switch on/off without migration; existing single-merge templates remain valid.

Open questions

  • Where do layer datasets live? <store>/.layers/<sha> (parallel to .templates) keeps the layer cache separate from per-image templates. Avoids confusion when an admin scans .templates.
  • zfs promote permissions: needs promote in the user's zfs allow. Document in the admin recipe.
  • Should we GC unused layer datasets when no template references them? Plan B's sweep mechanism could be extended to .layers/ with the same warm/cold logic.
  • How does Plan B's warm/cold eviction interact? Templates and layers should probably share the same lifecycle policy, but layers' lifecycle is "no template references" rather than "no clones."
  • xattr propagation under cp -a vs tar: tar --xattrs --xattrs-include='*' --selinux --acls is the safer pipe; enroot-aufs2ovlfs likely emits these correctly already.

Acceptance criteria

  • Two distinct images sharing a base layer (e.g. python:3-slim and node:20-slim, both Debian-based) store the shared layer once on disk under <store>/.layers/<digest>.
  • Re-pulling an image after a top-layer-only update reuses the cached lower-layer datasets (verify with timing + zfs list snapshot before/after).
  • Whiteouts and opaque dirs from real Docker images merge correctly (verified against python:3-slim, nginx, cuda base images).
  • Concurrent enroot load of the same image is race-safe (same .tmp lock pattern as Plan F).
  • ENROOT_ZFS_LAYER_CHAIN= (unset) leaves Plan F's behavior unchanged.

Out of scope

  • Replacing Plan F's single-merge path. Plan G is purely additive.
  • Cross-host layer replication via zfs send. Natural follow-up but tracked separately.
  • Migration tooling between merged-template and per-layer-chain caches.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions