Skip to content

probe: multi-BRANCH anomaly — thread-level attribution refinement#140

Merged
WaylandYang merged 2 commits into
mainfrom
probe/multi-branch-thread-attribution
May 21, 2026
Merged

probe: multi-BRANCH anomaly — thread-level attribution refinement#140
WaylandYang merged 2 commits into
mainfrom
probe/multi-branch-thread-attribution

Conversation

@WaylandYang
Copy link
Copy Markdown
Contributor

Summary

Follow-up to #128. The original `strace -c` probe couldn't distinguish user-space CPU from off-CPU blocked-waiting. This pass added bpftrace + per-thread `/proc/$pid/task/*/stack` polling and substantially refined the picture.

TL;DR — original "user-space CPU" claim was too strong

bpftrace at 199 Hz on the FC pid during a slow BRANCH: only ~18 samples in 1.6 s out of ~320 expected = FC is off-CPU ~94 % of the BRANCH window.

What FC's 5 threads do during a slow BRANCH (kernel-sleep histogram)

```
ep_poll x90 main thread (HTTP idle)
[kvm] x88 vCPU thread parked in kvm_vcpu_halt — the pause working
vhost_task_fn x50 vhost-net (idle)
futex_wait_queue x17 ← actual signal: thread blocked on userspace futex
submit_bio_wait x3 snapshot writer waiting on block IO
jbd2_log_wait x2 ext4 journal commit
```

Full kvm stack: `kvm_vcpu_block ← kvm_vcpu_halt ← ... ← __x64_sys_ioctl`
Full futex stack: `futex_wait_queue ← futex_wait ← do_futex ← __x64_sys_futex` (kernel can't say which futex from a static stack).

Revised picture — 3 contributors

  1. Userspace futex contention — a worker waits on a mutex; lock hold-time may scale with accumulated snapshot count
  2. ext4 journal / block IO writeback — ~2 % of the window
  3. User-space CPU on the snapshot worker — ~70 % of off-CPU thread-poll points returned empty kernel stack (thread was in user mode). FC's static-pie release build lacks frame pointers → bpftrace can't symbolize. Needs DWARF unwinding (perf, when available) or a debug FC rebuild.

Implications for #118 Phase 2/3 (refined)

  • Phase 2 (`io_uring` writer) addresses contributor RNG re-seed on restore (security correctness) #2 only (~2 % of the window) — narrows the win further
  • Phase 3 (1 s pre-emptive tick) would compound the futex contention if the snapshot worker holds the contended lock. Identifying the lock should be a prereq.
  • New candidate work: bpftrace on `tracepoint:syscalls:sys_enter_futex` with `args.uaddr` capture, correlated with the waker's stack → if contention is fixable, may cut pause growth without touching the IO path.

Files

Refs #118.

🤖 Generated with Claude Code

WaylandYang and others added 2 commits May 21, 2026 15:14
Two polish items on snapshot management:

1. \`forkd images\` table output revamped:
   - new columns: MEMORY (memory.bin size) and CREATED (relative age)
   - dynamic TAG column width
   - sorted most-recent-first instead of alphabetical
   - footer with snapshot count + total bytes

   Before:
     TAG                               SIZE  ROOTFS?
     coding-agent-fork-prewarm-v1   2.4 GiB  yes

   After:
     TAG                                  SIZE      MEMORY     CREATED  ROOTFS
     coding-agent-fork-prewarm-v1      2.4 GiB   512.0 MiB     3d ago  yes
     python-numpy                      1.8 GiB   512.0 MiB    12h ago  yes

     2 snapshots · 4.2 GiB total

2. New \`forkd rmi <TAG>...\` subcommand (docker-style):
   - tries DELETE /v1/snapshots/:tag first (clean: daemon removes
     registry entry + on-disk files atomically)
   - falls back to direct disk removal when the daemon is unreachable
     or doesn't know the tag (404)
   - reports source per tag: \"(daemon)\", \"(disk)\", or
     \"(disk (daemon unreachable))\"

   Examples:
     forkd rmi pyagent
     forkd rmi pyagent langgraph python-numpy

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Follow-up to PR #128. The original probe used \`strace -c\` on the
whole FC process; that can't distinguish user-space CPU from
off-CPU blocked-waiting. This pass added two more tools and refined
the picture significantly:

  - bpftrace profile:hz:199 on the FC pid → ~18 samples in 1.6 s of
    BRANCH out of ~320 expected, i.e. FC is off-CPU ~94 % of the
    BRANCH window. Original \"user-space CPU\" claim was too strong.

  - /proc/\$pid/task/*/stack polled at 30 ms across all FC threads →
    top kernel-sleep frames during the slow window:

      ep_poll         x90  (main thread, idle)
      [kvm]           x88  (vCPU thread parked in kvm_vcpu_halt — the
                            pause working as designed)
      vhost_task_fn   x50  (vhost-net, idle)
      futex_wait      x17  ← actual signal: thread blocked on userspace futex
      submit_bio_wait x3   (snapshot writer waiting on block IO)
      jbd2_log_wait   x2   (ext4 journal commit)

    Full kvm stack: kvm_vcpu_block ← kvm_vcpu_halt ← ... ← __x64_sys_ioctl
    Full futex stack: futex_wait_queue ← futex_wait ← do_futex ← __x64_sys_futex
    (kernel can't tell *which* futex from a static stack.)

Revised picture (3 contributors, not just user CPU):

  1. Userspace futex contention — a worker waits on a mutex; lock
     hold-time may scale with accumulated snapshot count
  2. ext4 journal / block IO writeback (~2 % of off-CPU)
  3. User-space CPU on the snapshot worker (~70 % of off-CPU time
     returned empty kernel stack = thread was in user mode; FC's
     static-pie release build has no frame pointers so bpftrace
     can't symbolize)

Implications for #118 Phase 2/3:
  - Phase 2 (io_uring) addresses #2 only (~2 % of the window)
  - Phase 3 (1 s tick) may compound the futex contention; need to
    identify the lock first
  - New candidate work: bpftrace on tracepoint:syscalls:sys_enter_futex
    with args.uaddr capture to identify the specific futex

Ships:
  - bench/pause-window/probe-bpftrace-fc.sh — user-stack sampler
  - bench/pause-window/probe-syscall-poll.sh — /proc/syscall poll loop
  - bench/pause-window/PROBE-multi-branch-anomaly.md — \"Follow-up\"
    section appended with the refined picture and revised #118 scope

Refs #118.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@WaylandYang WaylandYang merged commit c400b06 into main May 21, 2026
2 checks passed
@WaylandYang WaylandYang deleted the probe/multi-branch-thread-attribution branch May 21, 2026 07:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant