Skip to content

v0.4+ candidate: live-fork via memfd-backed source RAM + uffd_wp (deferred from v0.3) #101

@WaylandYang

Description

@WaylandYang

Status: Deferred from v0.3. The design and scaffolding are in the repo (see "What's already here" below) so the work can be picked up cleanly if/when the cost-benefit changes. The reason for deferral is in "Why deferred" below.

Goal

Cut BRANCH pause-window to ~30 ms regardless of source memory size, by replacing "pause source, write full memory.bin, resume" with "pause source, register WP on its memory, resume; children inherit a memfd view and the WP handler resolves source's post-fork writes lazily."

Motivating measurement: bench/pause-window/RESULTS-v0.2.md. Today's pause is storage-bound: 163 ms (tmpfs) to 4.26 s (SATA SSD) for 513 MiB, scaling linearly with source memory. v0.2.5's prewarm (PR #100) flattens the cold/warm ratio but doesn't change the absolute floor.

What's already here (scaffolding from v0.3 cycle)

Artifact Purpose Notes
docs/design/userfaultfd.md Full architectural design Honest about the open question on source-divergence sync
crates/forkd-uffd/ Workspace crate with Firecracker UDS handshake parser Linux-only; tested over socketpair(2); no event loop yet
crates/forkd-vmm/src/lib.rs MemoryBackend::Userfault Enum variant + ForkOpts field bail!s in restore_many_with so no caller can rely on it
firecracker-patch/v0.3-memfd-backend.patch First-cut patch against Firecracker v1.10.1 Not compile-tested; reviewed against real upstream source

Why deferred

  1. Architecture isn't closed. memfd injection is one piece; the harder piece is what happens to source's post-fork writes (its private CoW pages diverge from the memfd, so children see boot state, not source's current state). The prior art that's most often cited — MITOSIS, NFork, CodeSandbox — each solves a different problem (cross-host RDMA fork, kernel page-table sharing, cold-snapshot startup). None of them is a drop-in for "instant BRANCH from a long-running source."

  2. Firecracker fork is real maintenance. Forking and patching is ~1 week initial + permanent: keep a deeplethe/firecracker repo, rebase the patch on every upstream tag, run our own musl-via-docker CI, publish per-version releases, weaken the "uses official firecracker" trust story.

  3. Cheaper alternatives haven't been exhausted. Things that don't require a firecracker patch and could land in v0.3 instead:

    • Diff snapshots — firecracker already supports enable_diff_snapshots: true + track_dirty_pages. The second-and-after BRANCH from the same source only writes pages dirtied since the last snapshot. Typically 5-10x speedup for repeated fan-out. ETA 3-5 days. Likely the biggest single win available.
    • NVMe + io_uring snapshot writer — pure forkd-side, no firecracker change. SSD pause should drop from ~4 s to ~400 ms.
    • Parallel snapshot writer — multi-threaded pwrite of the memory.bin. Requires firecracker cooperation; might or might not be doable without a patch.
    • Pre-emptive background snapshot — write source's memory to a staging file continuously; at BRANCH, only flush what's dirtied since the last flush. Bounds the pause to the flush interval.

    Together these get us most of the perceived win (sub-second BRANCH on commodity SSD) without any firecracker work.

Revival criteria

This stops being deferred when at least two of:

  • Diff snapshots + NVMe + parallel writer have shipped and benchmarked; remaining pause floor is the limiting factor on a real user workflow we care about.
  • A specific user (forkd Hub user, downstream project, paper co-author) commits to using the live-fork path. Maintenance cost has a beneficiary.
  • We've sketched the source-divergence sync mechanism end-to-end (concrete enough to be a 10-page paper section, not just a paragraph) and it survives a careful review.
  • Firecracker upstream accepts external-memfd injection (would remove the fork-maintenance cost entirely).

What gets touched if revived

  • crates/forkd-uffd/ — add the UFFDIO_REGISTER / UFFDIO_COPY / UFFDIO_WP event loop on top of today's handshake parser.
  • firecracker-patch/ — refresh the patch against the current firecracker tag, get it compile-tested, fork the upstream repo.
  • crates/forkd-vmm/src/lib.rs — wire MemoryBackend::Userfault to actually spawn the handler, create the memfd, send it across the UDS instead of bail!.
  • docs/ROADMAP.md, docs/design/userfaultfd.md — update from "deferred" back to "in flight."

Related work shipped in the v0.3 cycle anyway

  • MemoryBackend enum + scaffolding (no behavior change, but the API shape is stable for future use).
  • The forkd-uffd handshake parser is a real reusable component; even without the full live-fork architecture, it's the right way to talk to a userfaultfd-using firecracker.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions