From 94448e0596b3fd7d2d8b08eb8a750197de1ff660 Mon Sep 17 00:00:00 2001 From: Wayland Yang Date: Thu, 21 May 2026 16:15:32 +0800 Subject: [PATCH] probe: futex tracing pinpoints 3 idle waiters parked on snapshot worker MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Third follow-up to PR #128. Adds bpftrace probe-futex-trace.sh that aggregates per-uaddr futex wait time (sys_enter/sys_exit_futex correlated by tid) during a BRANCH window, plus a cross-reference against /proc/\$pid/maps so we can place each uaddr in FC's address space. Findings (one 8 s window over a 153 ms BRANCH): @wait_ns[0x...7648, 137]: 152.35 ms (3 calls) @wait_ns[0x...8c88, 137]: 152.49 ms (2 calls) @wait_ns[0x...8d08, 137]: 152.56 ms (2 calls) @wait_ns[0x...52a0, 137]: 1.14 ms (6 calls) Op 137 = FUTEX_WAIT_BITSET_PRIVATE. The 3 hot futexes each accumulated ~152 ms wait ≈ the entire pause window. They live in anonymous heap mappings adjacent to FC binary's .bss (Rust heap- allocated synchronization primitives — parking_lot::Mutex / Condvar inner words). Interpretation: this is NOT contention on a hot mutex. It's 3 idle threads parked on the snapshot worker's completion signal, each sleeping the entire pause window then waking when the worker finishes. The contention is the *symptom*; the real bottleneck is whatever the snapshot worker does single-threaded. Revised #118 implication (third pass): - Phase 2 (io_uring) addresses ~2 % of the window — narrows further - Phase 3 (1 s tick) compounds nothing about the futexes — the 3 waiters are passive - Next operational step: build FC with DWARF symbols + perf-record flamegraph the snapshot worker thread → find the per-snapshot growing loop inside vmm::persist::create_snapshot Refs #118. Co-Authored-By: Claude Opus 4.7 (1M context) --- .../PROBE-multi-branch-anomaly.md | 91 +++++++++++++ bench/pause-window/probe-futex-trace.sh | 120 ++++++++++++++++++ 2 files changed, 211 insertions(+) create mode 100644 bench/pause-window/probe-futex-trace.sh diff --git a/bench/pause-window/PROBE-multi-branch-anomaly.md b/bench/pause-window/PROBE-multi-branch-anomaly.md index 93d1430..f726ab8 100644 --- a/bench/pause-window/PROBE-multi-branch-anomaly.md +++ b/bench/pause-window/PROBE-multi-branch-anomaly.md @@ -221,6 +221,95 @@ The pause-time growth has at least three contributors, not just user-space CPU: contention is fixable, may cut multi-BRANCH pause growth without touching the IO path at all. +## Follow-up: futex tracing (2026-05-21) + +Acting on the previous follow-up's "futex contention is the most +operational signal", added one more probe: bpftrace on +`tracepoint:syscalls:sys_enter_futex` / `sys_exit_futex` with +`args->uaddr` and `args->op` captured. Per-futex wait time and call +count aggregated. + +### Result (8 s window covering one 153 ms BRANCH) + +``` +@wait_ns[0x79fa2c8775c8, 128]: 3.9k ns +@wait_ns[0x79fa2c878c88, 129]: 4.8k ns +... +@wait_ns[0x79fa2c8752a0, 137]: 1.14 ms ← 6 calls +@wait_ns[0x79fa2c887648, 137]: 152.35 ms ← 3 calls +@wait_ns[0x79fa2c878c88, 137]: 152.49 ms ← 2 calls +@wait_ns[0x79fa2c878d08, 137]: 152.56 ms ← 2 calls +``` + +Three different futexes each accumulated **~152 ms of wait time** +during the BRANCH window. Op 137 is `FUTEX_WAIT_BITSET_PRIVATE`. + +The wait time per futex (~152 ms) ≈ the entire pause window. So: +**3 separate threads each sat blocked on a different futex for the +entire BRANCH duration**, then woke when the snapshot worker +finished. + +### Where the futex addresses live (cross-ref `/proc/$pid/maps`) + +``` +0x79fa2c843000-0x79fa2c876000 rw-p firecracker (.data/.bss) +0x79fa2c876000-0x79fa2c879000 rw-p [anon, immediately after .bss] +0x79fa2c879000-0x79fa2c87d000 rw-p [anon] +... +0x79fa2c885000-0x79fa2c88f000 rw-p [anon] +``` + +Three of the hot futex addresses (`0x79fa...8c88`, `8d08`, `7648`) +fall in **anonymous heap mappings adjacent to FC's writable +section** — consistent with Rust heap allocations holding the inner +`AtomicU32` of a `parking_lot::Mutex` / `std::sync::Condvar`. + +The 4th (`0x79fa2c8752a0`, 6 calls × 1.14 ms total, lower amplitude) +falls in **FC binary's `.data`/`.bss`** — could be a `static`/ +`OnceCell`-held mutex, possibly `KVM_FD` or similar global. + +### Why this matters + +The 3 hot futexes pattern looks like a producer-consumer: +1 snapshot-worker thread doing the actual write, with 3 other +threads (vCPU pause acknowledger? vhost reaper? event-loop?) all +sleeping on `WAIT_BITSET_PRIVATE` for the snapshot worker to signal +completion. If the snapshot worker's wall time grows with snapshot +count (it does), all 3 waiters' wait time grows in lockstep. + +This means **the contention is not the cause of slowness** — it's +the *symptom*. Eliminating the contention wouldn't speed anything +up; the bottleneck is whatever the snapshot worker is doing +single-threaded. + +### Implications for #118 — third revision + +The original Phase 2/3 scope was wrong; the first probe corrected to +"user-space CPU"; this one corrects further to "user-space CPU in +the snapshot worker, with 3 idle waiters parked on its completion +futex". + +Operational next steps to actually identify the snapshot-worker +loop: + +1. **Build a debug-symbols Firecracker** (`cargo build --release --features dwarf-symbols` or equivalent) so `perf record` / `bpftrace ustack(perf)` can resolve user-space frames to Rust function names. +2. **`perf record -F 99 -p $FC_PID -g`** during a slow BRANCH window + (now that the dev box's `linux-tools-6.14.0-36-generic` ships + without `perf`, this requires building perf from kernel source + OR using a kernel version that has it). Flame graph the worker + thread's stack. +3. **Cross-check against Firecracker's + `vmm::persist::create_snapshot`** source. The function is ~21 KB + of compiled code; if there's a per-snapshot growing data + structure (memory region list, device descriptor vec, dirty + bitmap walk), it should jump out. + +### Files + +- `bench/pause-window/probe-futex-trace.sh` — the bpftrace script + that produced the data above. Writes /tmp/futex-trace-*.txt plus + /tmp/futex-trace-*.txt.maps for uaddr cross-reference. + ## Files - `bench/pause-window/probe-multi-branch-strace.sh` — the original @@ -230,4 +319,6 @@ The pause-time growth has at least three contributors, not just user-space CPU: - `bench/pause-window/probe-syscall-poll.sh` — /proc/$pid/syscall poll loop (200 Hz); did not pinpoint a single syscall (consistent with the off-CPU finding). +- `bench/pause-window/probe-futex-trace.sh` — bpftrace futex aggregator + (see "Follow-up: futex tracing" above for results). - This document. diff --git a/bench/pause-window/probe-futex-trace.sh b/bench/pause-window/probe-futex-trace.sh new file mode 100644 index 0000000..14b4440 --- /dev/null +++ b/bench/pause-window/probe-futex-trace.sh @@ -0,0 +1,120 @@ +#!/usr/bin/env bash +# probe-futex-trace.sh — bpftrace the firecracker process's futex calls +# during a slow BRANCH to identify which specific futex(es) accumulate +# wait time. +# +# Follows up on PROBE-multi-branch-anomaly.md: thread-level probe found +# 17/250 in-kernel-sleep samples in futex_wait_queue (the only signal +# besides the parked vCPU). This script aggregates per-uaddr wait +# duration so we can: +# 1. Confirm that futex contention scales with snapshot count +# 2. Identify the specific futex (memory address; later cross-referenced +# to a symbol/data structure via /proc/$pid/maps) +# +# Output: histogram of (uaddr, op) → total wait nanoseconds. +set -euo pipefail + +FORKD_URL=${FORKD_URL:-http://127.0.0.1:8889} +FORKD_TOKEN=${FORKD_TOKEN:-$(cat "${FORKD_TOKEN_FILE:-/etc/forkd/token}" 2>/dev/null || echo "")} +TAG=${TAG:-coding-agent-fork-prewarm-v1} +WARMUP_BRANCHES=${WARMUP_BRANCHES:-6} +GAP_SECS=${GAP_SECS:-3} +OUT="/tmp/futex-trace-$(date +%s).txt" +auth=(-H "Authorization: Bearer $FORKD_TOKEN") + +echo "[probe] output → $OUT" >&2 + +# Spawn source +spawn=$(curl -fsS "${auth[@]}" -H "Content-Type: application/json" \ + -d "{\"snapshot_tag\":\"$TAG\",\"n\":1,\"per_child_netns\":true}" \ + "$FORKD_URL/v1/sandboxes") +sb_id=$(echo "$spawn" | jq -r '.[0].id') +fc_pid=$(echo "$spawn" | jq -r '.[0].pid') +echo "[probe] sandbox=$sb_id fc_pid=$fc_pid" >&2 +sleep 2 + +# Warmup into the slow regime +for i in $(seq 1 "$WARMUP_BRANCHES"); do + sleep "$GAP_SECS" + resp=$(curl -fsS "${auth[@]}" -H "Content-Type: application/json" \ + -d "{\"tag\":\"w-$i-$(date +%s%N)\",\"diff\":true}" \ + "$FORKD_URL/v1/sandboxes/$sb_id/branch") + echo "[probe] warmup BRANCH $i: pause_ms=$(echo "$resp" | jq -r .pause_ms)" >&2 +done + +# Launch bpftrace — capture entry/exit of futex(2) and aggregate wait time +# per (uaddr, op) tuple. +# Note: `args` field in tracepoint:syscalls:sys_enter_futex has uaddr, op. +# FUTEX_WAIT_PRIVATE = 128 (most common), FUTEX_WAKE_PRIVATE = 129. +echo "[probe] starting bpftrace (8 s window covers the next BRANCH)" >&2 +sudo bpftrace -e " +tracepoint:syscalls:sys_enter_futex +/ pid == $fc_pid / +{ + @start[tid] = nsecs; + @uaddr[tid] = args->uaddr; + @op[tid] = args->op; +} + +tracepoint:syscalls:sys_exit_futex +/ pid == $fc_pid && @start[tid] != 0 / +{ + \$d = nsecs - @start[tid]; + @wait_ns[@uaddr[tid], @op[tid]] = sum(\$d); + @wait_count[@uaddr[tid], @op[tid]] = count(); + delete(@start[tid]); + delete(@uaddr[tid]); + delete(@op[tid]); +} + +interval:s:8 { exit(); } +" > "$OUT" 2>&1 & +bp_pid=$! +sleep 0.5 + +# Fire the slow BRANCH while bpftrace is recording +sleep "$GAP_SECS" +echo "[probe] firing profiled BRANCH" >&2 +t0_ns=$(date +%s%N) +resp=$(curl -fsS "${auth[@]}" -H "Content-Type: application/json" \ + -d "{\"tag\":\"prof-$(date +%s%N)\",\"diff\":true}" \ + "$FORKD_URL/v1/sandboxes/$sb_id/branch") +t1_ns=$(date +%s%N) +echo "[probe] profiled BRANCH: pause_ms=$(echo "$resp" | jq -r .pause_ms) wall=$(( (t1_ns - t0_ns) / 1000000 ))ms" >&2 + +wait "$bp_pid" 2>/dev/null || true + +# Save FC's /proc/$pid/maps so we can later annotate which mapping the +# top uaddr falls in. +sudo cp /proc/$fc_pid/maps "$OUT.maps" 2>/dev/null || true + +# Cleanup +curl -fsS -X DELETE "${auth[@]}" "$FORKD_URL/v1/sandboxes/$sb_id" > /dev/null || true + +echo "" >&2 +echo "===== top futexes by total wait time =====" >&2 +# bpftrace prints @wait_ns[u64, u64]: N — sort by N. +awk ' +/^@wait_ns\[/ { + # line looks like "@wait_ns[0x7fab12345678, 128]: 1234567890" + match($0, /\[([0-9]+|0x[0-9a-fA-F]+), ([0-9]+)\]:\s*([0-9]+)/, m) + if (m[3] > 0) { + printf " %s op=%s wait_ms=%.2f\n", m[1], m[2], m[3]/1e6 + } +} +' "$OUT" | sort -k4 -t= -n -r | head -10 >&2 + +echo "" >&2 +echo "===== top futexes by call count =====" >&2 +awk ' +/^@wait_count\[/ { + match($0, /\[([0-9]+|0x[0-9a-fA-F]+), ([0-9]+)\]:\s*([0-9]+)/, m) + if (m[3] > 0) { + printf " %s op=%s calls=%s\n", m[1], m[2], m[3] + } +} +' "$OUT" | sort -k4 -t= -n -r | head -10 >&2 + +echo "" >&2 +echo "Raw trace: $OUT" >&2 +echo "FC memory map: $OUT.maps" >&2