ZFS template ENOSPC retry: tighten end-to-end verification

Follow-up to #5 (Plan B). The ENOSPC retry path in `zfs::ensure_template` and `zfs::docker_install_from_layers` is structurally correct and partially verified, but never observed end-to-end in a single test run. This issue tracks closing that gap.

## What's verified today

PR #5 documents the following tests passing on a loopback ZFS pool (Linux 6.12.75, aarch64, zfs-2.4.1):

- `unsquashfs` returns non-zero on `Disk quota exceeded` (verified with `quota=3M` < single-template extract size).
- The retry path catches the failure: `[WARN] Extraction failed; evicting all warm templates and retrying` is printed.
- The sweep itself works under `WARM_SECONDS=0` (Plan B Task 6 tests).
- The retry attempt fires (a second `unsquashfs` invocation runs).
- Final-error path fires when retry also fails: `[ERROR] Extraction failed even after evicting warm templates`, `.tmp` dataset is destroyed, no orphan container.

## What's not yet verified

**Success after retry**, in one continuous run: first attempt hits ENOSPC → sweep evicts a warm template that frees enough space → retry attempt succeeds → final state has the new template installed.

The pieces are individually correct (the retry attempt is the identical command to the first), but observing the *transition* in a single run requires three quota constraints to line up:

1. The new template's extracted size must be ≤ available-after-eviction.
2. Two templates' combined extracted size must exceed the quota (so the first attempt fails).
3. ZFS refuses `quota=N` if `N < current usage`, so the quota has to be set up while the warm template is fresh enough to fit AND tight enough that adding a second pushes over.

I tried several payload sizes and quotas on a 3.75G test pool and either both fit (no ENOSPC) or neither fit (no warm eviction recovery).

## Also untested: Docker variant ENOSPC behavior

The `.sqsh` path uses `unsquashfs`, which surfaces ENOSPC immediately as a non-zero exit. The Docker path uses `tar | tar` inside `enroot-nsenter` and exhibited a different failure mode in one test: the receiving `tar` hung instead of returning a clean exit code, because ZFS quota visibility into the writing process is delayed by transaction-group commit timing. The retry's `if !` guard never fired and the merge command had to be SIGKILL'd.

If this proves flaky in production, options include:

- Wrap the receiving `tar` with a poller that watches the dataset's `available` property and aborts the pipe with `SIGPIPE` once it hits zero.
- Pre-flight: run `zfs::sweep_templates` more aggressively before the merge if `under_pressure` is borderline (say, `>= threshold - 10`).
- Add a hard timeout around the merge command and treat timeout-with-no-progress as ENOSPC.

## Suggested verification approaches

1. **Dedicated tiny pool.** Create a 64MB loopback file → 64MB pool. Tighter knobs available; smaller payloads exercise quotas with less ambiguity.
2. **Synthetic failure injection.** Replace `unsquashfs` (and the merge command) with a wrapper script that exits non-zero on first call and exits zero on second call. Verifies the success-after-retry transition without depending on real ENOSPC behavior. Useful as a unit-style check.
3. **`zfs reservation` instead of `quota`.** A reservation on a sibling dataset can squeeze the templates dataset's available bytes deterministically, sidestepping the "can't shrink quota below current usage" rule.
4. **Concurrent-extraction race against the quota** — N workers each extracting a unique template against a quota that fits half of them. Forces sweep + retry under contention.

## Acceptance criteria

- A reproducible test recipe (in `doc/zfs.md` admin notes or a script in `pkg/` if/when test infra lands) that triggers each of these in a single run:
  - First attempt ENOSPC → sweep → retry → success → new template installed.
  - First attempt ENOSPC → sweep evicts nothing → retry → second ENOSPC → final error → `.tmp` cleaned.
  - Docker `tar | tar` variant: receives clean ENOSPC exit code (no hang).
- A short note in `doc/zfs.md` documenting the test recipe so admins can verify their own pool sizing produces the expected behavior.

## Out of scope

- Implementing a more sophisticated retry policy (e.g. multiple sweep aggressiveness levels, exponential backoff). The current single-retry behavior matches the plan; this issue is about *verification*, not redesign.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ZFS template ENOSPC retry: tighten end-to-end verification #6

What's verified today

What's not yet verified

Also untested: Docker variant ENOSPC behavior

Suggested verification approaches

Acceptance criteria

Out of scope

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

ZFS template ENOSPC retry: tighten end-to-end verification #6

Description

What's verified today

What's not yet verified

Also untested: Docker variant ENOSPC behavior

Suggested verification approaches

Acceptance criteria

Out of scope

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions