Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
91 changes: 91 additions & 0 deletions bench/pause-window/PROBE-multi-branch-anomaly.md
Original file line number Diff line number Diff line change
Expand Up @@ -221,6 +221,95 @@ The pause-time growth has at least three contributors, not just user-space CPU:
contention is fixable, may cut multi-BRANCH pause growth without
touching the IO path at all.

## Follow-up: futex tracing (2026-05-21)

Acting on the previous follow-up's "futex contention is the most
operational signal", added one more probe: bpftrace on
`tracepoint:syscalls:sys_enter_futex` / `sys_exit_futex` with
`args->uaddr` and `args->op` captured. Per-futex wait time and call
count aggregated.

### Result (8 s window covering one 153 ms BRANCH)

```
@wait_ns[0x79fa2c8775c8, 128]: 3.9k ns
@wait_ns[0x79fa2c878c88, 129]: 4.8k ns
...
@wait_ns[0x79fa2c8752a0, 137]: 1.14 ms ← 6 calls
@wait_ns[0x79fa2c887648, 137]: 152.35 ms ← 3 calls
@wait_ns[0x79fa2c878c88, 137]: 152.49 ms ← 2 calls
@wait_ns[0x79fa2c878d08, 137]: 152.56 ms ← 2 calls
```

Three different futexes each accumulated **~152 ms of wait time**
during the BRANCH window. Op 137 is `FUTEX_WAIT_BITSET_PRIVATE`.

The wait time per futex (~152 ms) ≈ the entire pause window. So:
**3 separate threads each sat blocked on a different futex for the
entire BRANCH duration**, then woke when the snapshot worker
finished.

### Where the futex addresses live (cross-ref `/proc/$pid/maps`)

```
0x79fa2c843000-0x79fa2c876000 rw-p firecracker (.data/.bss)
0x79fa2c876000-0x79fa2c879000 rw-p [anon, immediately after .bss]
0x79fa2c879000-0x79fa2c87d000 rw-p [anon]
...
0x79fa2c885000-0x79fa2c88f000 rw-p [anon]
```

Three of the hot futex addresses (`0x79fa...8c88`, `8d08`, `7648`)
fall in **anonymous heap mappings adjacent to FC's writable
section** — consistent with Rust heap allocations holding the inner
`AtomicU32` of a `parking_lot::Mutex` / `std::sync::Condvar`.

The 4th (`0x79fa2c8752a0`, 6 calls × 1.14 ms total, lower amplitude)
falls in **FC binary's `.data`/`.bss`** — could be a `static`/
`OnceCell`-held mutex, possibly `KVM_FD` or similar global.

### Why this matters

The 3 hot futexes pattern looks like a producer-consumer:
1 snapshot-worker thread doing the actual write, with 3 other
threads (vCPU pause acknowledger? vhost reaper? event-loop?) all
sleeping on `WAIT_BITSET_PRIVATE` for the snapshot worker to signal
completion. If the snapshot worker's wall time grows with snapshot
count (it does), all 3 waiters' wait time grows in lockstep.

This means **the contention is not the cause of slowness** — it's
the *symptom*. Eliminating the contention wouldn't speed anything
up; the bottleneck is whatever the snapshot worker is doing
single-threaded.

### Implications for #118 — third revision

The original Phase 2/3 scope was wrong; the first probe corrected to
"user-space CPU"; this one corrects further to "user-space CPU in
the snapshot worker, with 3 idle waiters parked on its completion
futex".

Operational next steps to actually identify the snapshot-worker
loop:

1. **Build a debug-symbols Firecracker** (`cargo build --release --features dwarf-symbols` or equivalent) so `perf record` / `bpftrace ustack(perf)` can resolve user-space frames to Rust function names.
2. **`perf record -F 99 -p $FC_PID -g`** during a slow BRANCH window
(now that the dev box's `linux-tools-6.14.0-36-generic` ships
without `perf`, this requires building perf from kernel source
OR using a kernel version that has it). Flame graph the worker
thread's stack.
3. **Cross-check against Firecracker's
`vmm::persist::create_snapshot`** source. The function is ~21 KB
of compiled code; if there's a per-snapshot growing data
structure (memory region list, device descriptor vec, dirty
bitmap walk), it should jump out.

### Files

- `bench/pause-window/probe-futex-trace.sh` — the bpftrace script
that produced the data above. Writes /tmp/futex-trace-*.txt plus
/tmp/futex-trace-*.txt.maps for uaddr cross-reference.

## Files

- `bench/pause-window/probe-multi-branch-strace.sh` — the original
Expand All @@ -230,4 +319,6 @@ The pause-time growth has at least three contributors, not just user-space CPU:
- `bench/pause-window/probe-syscall-poll.sh` — /proc/$pid/syscall
poll loop (200 Hz); did not pinpoint a single syscall (consistent
with the off-CPU finding).
- `bench/pause-window/probe-futex-trace.sh` — bpftrace futex aggregator
(see "Follow-up: futex tracing" above for results).
- This document.
120 changes: 120 additions & 0 deletions bench/pause-window/probe-futex-trace.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,120 @@
#!/usr/bin/env bash
# probe-futex-trace.sh — bpftrace the firecracker process's futex calls
# during a slow BRANCH to identify which specific futex(es) accumulate
# wait time.
#
# Follows up on PROBE-multi-branch-anomaly.md: thread-level probe found
# 17/250 in-kernel-sleep samples in futex_wait_queue (the only signal
# besides the parked vCPU). This script aggregates per-uaddr wait
# duration so we can:
# 1. Confirm that futex contention scales with snapshot count
# 2. Identify the specific futex (memory address; later cross-referenced
# to a symbol/data structure via /proc/$pid/maps)
#
# Output: histogram of (uaddr, op) → total wait nanoseconds.
set -euo pipefail

FORKD_URL=${FORKD_URL:-http://127.0.0.1:8889}
FORKD_TOKEN=${FORKD_TOKEN:-$(cat "${FORKD_TOKEN_FILE:-/etc/forkd/token}" 2>/dev/null || echo "")}
TAG=${TAG:-coding-agent-fork-prewarm-v1}
WARMUP_BRANCHES=${WARMUP_BRANCHES:-6}
GAP_SECS=${GAP_SECS:-3}
OUT="/tmp/futex-trace-$(date +%s).txt"
auth=(-H "Authorization: Bearer $FORKD_TOKEN")

echo "[probe] output → $OUT" >&2

# Spawn source
spawn=$(curl -fsS "${auth[@]}" -H "Content-Type: application/json" \
-d "{\"snapshot_tag\":\"$TAG\",\"n\":1,\"per_child_netns\":true}" \
"$FORKD_URL/v1/sandboxes")
sb_id=$(echo "$spawn" | jq -r '.[0].id')
fc_pid=$(echo "$spawn" | jq -r '.[0].pid')
echo "[probe] sandbox=$sb_id fc_pid=$fc_pid" >&2
sleep 2

# Warmup into the slow regime
for i in $(seq 1 "$WARMUP_BRANCHES"); do
sleep "$GAP_SECS"
resp=$(curl -fsS "${auth[@]}" -H "Content-Type: application/json" \
-d "{\"tag\":\"w-$i-$(date +%s%N)\",\"diff\":true}" \
"$FORKD_URL/v1/sandboxes/$sb_id/branch")
echo "[probe] warmup BRANCH $i: pause_ms=$(echo "$resp" | jq -r .pause_ms)" >&2
done

# Launch bpftrace — capture entry/exit of futex(2) and aggregate wait time
# per (uaddr, op) tuple.
# Note: `args` field in tracepoint:syscalls:sys_enter_futex has uaddr, op.
# FUTEX_WAIT_PRIVATE = 128 (most common), FUTEX_WAKE_PRIVATE = 129.
echo "[probe] starting bpftrace (8 s window covers the next BRANCH)" >&2
sudo bpftrace -e "
tracepoint:syscalls:sys_enter_futex
/ pid == $fc_pid /
{
@start[tid] = nsecs;
@uaddr[tid] = args->uaddr;
@op[tid] = args->op;
}

tracepoint:syscalls:sys_exit_futex
/ pid == $fc_pid && @start[tid] != 0 /
{
\$d = nsecs - @start[tid];
@wait_ns[@uaddr[tid], @op[tid]] = sum(\$d);
@wait_count[@uaddr[tid], @op[tid]] = count();
delete(@start[tid]);
delete(@uaddr[tid]);
delete(@op[tid]);
}

interval:s:8 { exit(); }
" > "$OUT" 2>&1 &
bp_pid=$!
sleep 0.5

# Fire the slow BRANCH while bpftrace is recording
sleep "$GAP_SECS"
echo "[probe] firing profiled BRANCH" >&2
t0_ns=$(date +%s%N)
resp=$(curl -fsS "${auth[@]}" -H "Content-Type: application/json" \
-d "{\"tag\":\"prof-$(date +%s%N)\",\"diff\":true}" \
"$FORKD_URL/v1/sandboxes/$sb_id/branch")
t1_ns=$(date +%s%N)
echo "[probe] profiled BRANCH: pause_ms=$(echo "$resp" | jq -r .pause_ms) wall=$(( (t1_ns - t0_ns) / 1000000 ))ms" >&2

wait "$bp_pid" 2>/dev/null || true

# Save FC's /proc/$pid/maps so we can later annotate which mapping the
# top uaddr falls in.
sudo cp /proc/$fc_pid/maps "$OUT.maps" 2>/dev/null || true

# Cleanup
curl -fsS -X DELETE "${auth[@]}" "$FORKD_URL/v1/sandboxes/$sb_id" > /dev/null || true

echo "" >&2
echo "===== top futexes by total wait time =====" >&2
# bpftrace prints @wait_ns[u64, u64]: N — sort by N.
awk '
/^@wait_ns\[/ {
# line looks like "@wait_ns[0x7fab12345678, 128]: 1234567890"
match($0, /\[([0-9]+|0x[0-9a-fA-F]+), ([0-9]+)\]:\s*([0-9]+)/, m)
if (m[3] > 0) {
printf " %s op=%s wait_ms=%.2f\n", m[1], m[2], m[3]/1e6
}
}
' "$OUT" | sort -k4 -t= -n -r | head -10 >&2

echo "" >&2
echo "===== top futexes by call count =====" >&2
awk '
/^@wait_count\[/ {
match($0, /\[([0-9]+|0x[0-9a-fA-F]+), ([0-9]+)\]:\s*([0-9]+)/, m)
if (m[3] > 0) {
printf " %s op=%s calls=%s\n", m[1], m[2], m[3]
}
}
' "$OUT" | sort -k4 -t= -n -r | head -10 >&2

echo "" >&2
echo "Raw trace: $OUT" >&2
echo "FC memory map: $OUT.maps" >&2
Loading