Plan G: per-layer ZFS clone chain for enroot load docker:// (opt-in)

Follow-up to #3 (Plan F).

Plan F's `enroot load docker://` path materializes the merged image into a single ZFS template per image (cached by image config digest). That covers most cases but loses several properties that the per-layer-clone-chain approach (mirroring Docker's own `zfs` storage driver) would give us. Plan G is the opt-in mode that adds the per-layer path **alongside** Plan F's single-merge path, gated on a config flag like `ENROOT_ZFS_LAYER_CHAIN=y`.

## Why it's worth doing

For HPC / CI hosts that pull many images sharing common bases, the current single-merge design wastes disk and CPU. Per-layer chains would buy back:

1. **Cross-image layer dedup at the dataset level.** Two images sharing a Debian base store the base bytes once instead of twice. Block-level `dedup=on` recovers this in Plan F's design but at ~5–6 GB RAM per TB indexed; per-layer datasets dedup for free.
2. **Incremental re-pull cost.** When `alpine:3.21` replaces `alpine:3.20`, only the changed top layers are re-extracted; lower-layer datasets are reused. Plan F re-merges the whole stack.
3. **Layer-granular cache invalidation.** A poisoned layer can be `zfs destroy`d in isolation; Plan F throws out the whole template.
4. **Native ZFS introspection.** `zfs list -t all` shows the layer chain; `zfs send` per-layer becomes a sensible cross-host replication primitive.
5. **Quota accounting matches intuition.** `quota=200G` on `<store>/.templates` reflects shared layers once, not multiplied by the number of images that use them.
6. **Aligns with Docker's `zfs` storage driver shape**, which ops people already know.

## What it costs

- **Whiteout/opaque-dir merging in shell.** Overlayfs whiteouts are character-device files (`mknod c 0 0`); opaque-dir markers are `trusted.overlay.opaque=y` xattrs. Without the kernel's overlay engine doing the merge, we need to apply these manually during each clone-extract step. Real edge-case surface.
- ~5–15× more dataset objects per image. `zfs list` clutter; more bookkeeping.
- Per-layer atomic locks (multiple `.tmp` datasets racing).
- `zfs promote` (or chain-preservation alternative) to flatten the leaf into a standalone template — depends on user delegations.
- More complex cache invalidation logic (decide what to destroy when a leaf is reaped vs. when a shared lower layer is reaped).

## Sketch of design

`zfs::docker_install_from_layers` (in `src/storage_zfs.sh`) gains a check: if `ENROOT_ZFS_LAYER_CHAIN=y`, dispatch to `zfs::docker_install_chain` instead. The new function:

1. For each layer in stack order, hash the layer tarball (already in `${ENROOT_CACHE_PATH}/<digest>` from `_prepare_layers`) → cache key.
2. If `<store>/.layers/<digest>@done` exists, reuse; else `zfs clone parent@done <new>` (or `zfs create` for the base), apply layer's whiteouts + extracted contents, `zfs snapshot @done`.
3. Final leaf is the merged image; `zfs promote` it into `<store>/.templates/<image-config-sha>` to flatten the chain into a standalone template.
4. Existing `zfs::clone_container` then clones the template for the user.

Whiteout application for step 2 needs a helper like `zfs::apply_layer_whiteouts <layer_dir> <target>`:

- For each `*.wh.foo` (AUFS) or 0:0-char-device named `foo` (overlayfs) in `layer_dir`: `rm -rf "${target}/foo"`.
- For each `.wh..wh..opq` or `trusted.overlay.opaque=y` dir: clear children of corresponding dir in target.
- Then `cp -a` (or `tar | tar`) the rest of `layer_dir` over `target`.

`enroot-aufs2ovlfs` already converts to overlayfs whiteouts in-place, so the helper only needs the overlayfs forms. Worth confirming the exact char-device format it produces.

## Coexistence with Plan F

- Default behavior unchanged: Plan F's single-merge path stays the default.
- `ENROOT_ZFS_LAYER_CHAIN=y` opts into per-layer.
- Both paths populate `<store>/.templates/<sha>` — same shape, same `zfs::clone_container` for the user. Only the *fill* mechanism differs.
- A site can switch on/off without migration; existing single-merge templates remain valid.

## Open questions

- Where do layer datasets live? `<store>/.layers/<sha>` (parallel to `.templates`) keeps the layer cache separate from per-image templates. Avoids confusion when an admin scans `.templates`.
- `zfs promote` permissions: needs `promote` in the user's `zfs allow`. Document in the admin recipe.
- Should we GC unused layer datasets when no template references them? Plan B's sweep mechanism could be extended to `.layers/` with the same warm/cold logic.
- How does Plan B's warm/cold eviction interact? Templates and layers should probably share the same lifecycle policy, but layers' lifecycle is "no template references" rather than "no clones."
- xattr propagation under `cp -a` vs `tar`: `tar --xattrs --xattrs-include='*' --selinux --acls` is the safer pipe; `enroot-aufs2ovlfs` likely emits these correctly already.

## Acceptance criteria

- Two distinct images sharing a base layer (e.g. `python:3-slim` and `node:20-slim`, both Debian-based) store the shared layer once on disk under `<store>/.layers/<digest>`.
- Re-pulling an image after a top-layer-only update reuses the cached lower-layer datasets (verify with timing + `zfs list` snapshot before/after).
- Whiteouts and opaque dirs from real Docker images merge correctly (verified against `python:3-slim`, `nginx`, `cuda` base images).
- Concurrent `enroot load` of the same image is race-safe (same `.tmp` lock pattern as Plan F).
- `ENROOT_ZFS_LAYER_CHAIN=` (unset) leaves Plan F's behavior unchanged.

## Out of scope

- Replacing Plan F's single-merge path. Plan G is purely additive.
- Cross-host layer replication via `zfs send`. Natural follow-up but tracked separately.
- Migration tooling between merged-template and per-layer-chain caches.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Plan G: per-layer ZFS clone chain for enroot load docker:// (opt-in) #4

Why it's worth doing

What it costs

Sketch of design

Coexistence with Plan F

Open questions

Acceptance criteria

Out of scope

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Plan G: per-layer ZFS clone chain for enroot load docker:// (opt-in) #4

Description

Why it's worth doing

What it costs

Sketch of design

Coexistence with Plan F

Open questions

Acceptance criteria

Out of scope

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions