Status: Deferred from v0.3. The design and scaffolding are in the repo (see "What's already here" below) so the work can be picked up cleanly if/when the cost-benefit changes. The reason for deferral is in "Why deferred" below.
Goal
Cut BRANCH pause-window to ~30 ms regardless of source memory size, by replacing "pause source, write full memory.bin, resume" with "pause source, register WP on its memory, resume; children inherit a memfd view and the WP handler resolves source's post-fork writes lazily."
Motivating measurement: bench/pause-window/RESULTS-v0.2.md. Today's pause is storage-bound: 163 ms (tmpfs) to 4.26 s (SATA SSD) for 513 MiB, scaling linearly with source memory. v0.2.5's prewarm (PR #100) flattens the cold/warm ratio but doesn't change the absolute floor.
What's already here (scaffolding from v0.3 cycle)
| Artifact |
Purpose |
Notes |
docs/design/userfaultfd.md |
Full architectural design |
Honest about the open question on source-divergence sync |
crates/forkd-uffd/ |
Workspace crate with Firecracker UDS handshake parser |
Linux-only; tested over socketpair(2); no event loop yet |
crates/forkd-vmm/src/lib.rs MemoryBackend::Userfault |
Enum variant + ForkOpts field |
bail!s in restore_many_with so no caller can rely on it |
firecracker-patch/v0.3-memfd-backend.patch |
First-cut patch against Firecracker v1.10.1 |
Not compile-tested; reviewed against real upstream source |
Why deferred
-
Architecture isn't closed. memfd injection is one piece; the harder piece is what happens to source's post-fork writes (its private CoW pages diverge from the memfd, so children see boot state, not source's current state). The prior art that's most often cited — MITOSIS, NFork, CodeSandbox — each solves a different problem (cross-host RDMA fork, kernel page-table sharing, cold-snapshot startup). None of them is a drop-in for "instant BRANCH from a long-running source."
-
Firecracker fork is real maintenance. Forking and patching is ~1 week initial + permanent: keep a deeplethe/firecracker repo, rebase the patch on every upstream tag, run our own musl-via-docker CI, publish per-version releases, weaken the "uses official firecracker" trust story.
-
Cheaper alternatives haven't been exhausted. Things that don't require a firecracker patch and could land in v0.3 instead:
- Diff snapshots — firecracker already supports
enable_diff_snapshots: true + track_dirty_pages. The second-and-after BRANCH from the same source only writes pages dirtied since the last snapshot. Typically 5-10x speedup for repeated fan-out. ETA 3-5 days. Likely the biggest single win available.
- NVMe + io_uring snapshot writer — pure forkd-side, no firecracker change. SSD pause should drop from ~4 s to ~400 ms.
- Parallel snapshot writer — multi-threaded pwrite of the memory.bin. Requires firecracker cooperation; might or might not be doable without a patch.
- Pre-emptive background snapshot — write source's memory to a staging file continuously; at BRANCH, only flush what's dirtied since the last flush. Bounds the pause to the flush interval.
Together these get us most of the perceived win (sub-second BRANCH on commodity SSD) without any firecracker work.
Revival criteria
This stops being deferred when at least two of:
What gets touched if revived
crates/forkd-uffd/ — add the UFFDIO_REGISTER / UFFDIO_COPY / UFFDIO_WP event loop on top of today's handshake parser.
firecracker-patch/ — refresh the patch against the current firecracker tag, get it compile-tested, fork the upstream repo.
crates/forkd-vmm/src/lib.rs — wire MemoryBackend::Userfault to actually spawn the handler, create the memfd, send it across the UDS instead of bail!.
docs/ROADMAP.md, docs/design/userfaultfd.md — update from "deferred" back to "in flight."
Related work shipped in the v0.3 cycle anyway
MemoryBackend enum + scaffolding (no behavior change, but the API shape is stable for future use).
- The forkd-uffd handshake parser is a real reusable component; even without the full live-fork architecture, it's the right way to talk to a userfaultfd-using firecracker.
Status: Deferred from v0.3. The design and scaffolding are in the repo (see "What's already here" below) so the work can be picked up cleanly if/when the cost-benefit changes. The reason for deferral is in "Why deferred" below.
Goal
Cut BRANCH pause-window to ~30 ms regardless of source memory size, by replacing "pause source, write full memory.bin, resume" with "pause source, register WP on its memory, resume; children inherit a memfd view and the WP handler resolves source's post-fork writes lazily."
Motivating measurement:
bench/pause-window/RESULTS-v0.2.md. Today's pause is storage-bound: 163 ms (tmpfs) to 4.26 s (SATA SSD) for 513 MiB, scaling linearly with source memory. v0.2.5's prewarm (PR #100) flattens the cold/warm ratio but doesn't change the absolute floor.What's already here (scaffolding from v0.3 cycle)
docs/design/userfaultfd.mdcrates/forkd-uffd/socketpair(2); no event loop yetcrates/forkd-vmm/src/lib.rsMemoryBackend::Userfaultbail!s inrestore_many_withso no caller can rely on itfirecracker-patch/v0.3-memfd-backend.patchWhy deferred
Architecture isn't closed. memfd injection is one piece; the harder piece is what happens to source's post-fork writes (its private CoW pages diverge from the memfd, so children see boot state, not source's current state). The prior art that's most often cited — MITOSIS, NFork, CodeSandbox — each solves a different problem (cross-host RDMA fork, kernel page-table sharing, cold-snapshot startup). None of them is a drop-in for "instant BRANCH from a long-running source."
Firecracker fork is real maintenance. Forking and patching is ~1 week initial + permanent: keep a
deeplethe/firecrackerrepo, rebase the patch on every upstream tag, run our own musl-via-docker CI, publish per-version releases, weaken the "uses official firecracker" trust story.Cheaper alternatives haven't been exhausted. Things that don't require a firecracker patch and could land in v0.3 instead:
enable_diff_snapshots: true+track_dirty_pages. The second-and-after BRANCH from the same source only writes pages dirtied since the last snapshot. Typically 5-10x speedup for repeated fan-out. ETA 3-5 days. Likely the biggest single win available.Together these get us most of the perceived win (sub-second BRANCH on commodity SSD) without any firecracker work.
Revival criteria
This stops being deferred when at least two of:
What gets touched if revived
crates/forkd-uffd/— add the UFFDIO_REGISTER / UFFDIO_COPY / UFFDIO_WP event loop on top of today's handshake parser.firecracker-patch/— refresh the patch against the current firecracker tag, get it compile-tested, fork the upstream repo.crates/forkd-vmm/src/lib.rs— wireMemoryBackend::Userfaultto actually spawn the handler, create the memfd, send it across the UDS instead ofbail!.docs/ROADMAP.md,docs/design/userfaultfd.md— update from "deferred" back to "in flight."Related work shipped in the v0.3 cycle anyway
MemoryBackendenum + scaffolding (no behavior change, but the API shape is stable for future use).