h2: harden against potential deadlock when frame exceeds rx buffer capacity by cmoyes-jump · Pull Request #8993 · firedancer-io/firedancer

cmoyes-jump · 2026-03-20T21:28:41Z

Fix a potential deadlock in the H2 "all or nothing" frame consumption path: if a non-DATA frame's total size (header + payload) exceeds the rx buffer capacity, rx_suppress would wait forever for data that can never arrive, deadlocking the connection. Note that production buffers are currently sized properly, so this protects against a future misconfig.
Add guard in fd_h2_rx1 that detects this condition and issues FD_H2_ERR_INTERNAL instead of endlessly waiting and deadlocking.
Size test/fuzz buffer to match max_frame_size and clamp max_frame_size to buffer capacity so the existing FRAME_SIZE check catches oversized frames before they reach the buffer guard.
Fix fuzzer loop to handle the one-step gap between conn error detection (SEND_GOAWAY) and teardown (GOAWAY sent, conn reset).
New unit tests in test_h2_conn.c covering buffer guard edge cases.

Previously, this invariant was only implicit. The H2 "all or nothing" path silently assumes bufsz >= max_frame_size + 9. Nothing enforced that. Today's production configs are fine (2MB buffer >> 16KB max frame). If someone later shrinks the buffer or bumps max_frame_size, the deadlock becomes active.

This hardening change makes the buffer/frame-size relationship an explicit, enforced contract (the clamp + the FD_TEST + the runtime guard), rather than just a coincidence.

Copilot

Pull request overview

This PR hardens the HTTP/2 receive path against a misconfiguration that could otherwise deadlock the connection when a non-DATA frame can never fully fit into the RX ring buffer. It also updates the bundle client test/fuzz harnesses and adds unit/regression coverage to make the buffer/frame-size contract explicit.

Changes:

Add a runtime guard in fd_h2_rx1 to fail fast with FD_H2_ERR_INTERNAL when a non-DATA frame’s total size exceeds RX buffer capacity.
Update bundle client test/fuzz setup to size/clamp buffers/max_frame_size consistently and adjust the fuzzer loop to tolerate the GOAWAY→DEAD/reset gap.
Add unit tests and binary fixtures covering the oversized-frame edge cases and ensuring correct GOAWAY lifecycle behavior.

Reviewed changes

Copilot reviewed 4 out of 6 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
src/waltz/h2/fd_h2_conn.c	Adds the RX buffer-capacity guard in the non-DATA “all-or-nothing” consumption path to prevent deadlock.
src/waltz/h2/test_h2_conn.c	Adds unit tests for oversized frames vs buffer capacity, frame-size precedence, and GOAWAY lifecycle behavior.
src/disco/bundle/test_bundle_common.c	Adjusts test gRPC/H2 buffer sizing and clamps `max_frame_size` to buffer capacity to surface issues earlier.
src/disco/bundle/fuzz_bundle_client.c	Updates fuzz loop to handle RX buffer-full conditions by stepping state machine once more and asserting buffer drain/reset.
src/disco/bundle/fixtures/regression_oversized_settings.bin	Adds a regression fixture exercising oversized SETTINGS behavior.
src/disco/bundle/fixtures/regression_many_small_frames.bin	Adds a regression fixture with many small frames to stress RX buffering/state progression.

src/waltz/h2/test_h2_conn.c

Copilot · 2026-03-20T21:33:32Z

src/waltz/h2/test_h2_conn.c

+  /* (c) Frame at max_frame_size+1 -> FD_H2_ERR_FRAME_SIZE fires first
+     When max_frame_size is properly clamped to bufsz - sizeof(hdr),
+     a frame with payload > max_frame_size should be rejected by the
+     FRAME_SIZE check (line 627), not the buffer guard (line 671). */


These comments call out specific line numbers in fd_h2_conn.c (e.g., "line 627" / "line 671"). Line-number references tend to rot quickly and can confuse future debugging when the file changes. Consider rewording to reference the named checks/branches instead (FRAME_SIZE check vs buffer-capacity guard in fd_h2_rx1).

src/waltz/h2/test_h2_conn.c

github-actions · 2026-03-20T21:35:13Z

Performance Measurements ⏳

Suite	Baseline	New	Change
backtest `mainnet-406545575-perf` per slot	`0.134266 s`	`0.136816 s`	`1.899%` ✅
backtest `mainnet-406545575-perf` snapshot load	`4.942 s`	`3.798 s`	`-23.149%` ✅
backtest `mainnet-406545575-perf` total elapsed	`134.266173 s`	`136.815602 s`	`1.899%` ✅
`firedancer mem` usage with `mainnet.toml`	`1105.43 GiB`	`1105.43 GiB`	`0.000%` ✅

Copilot

Pull request overview

Copilot reviewed 4 out of 6 changed files in this pull request and generated 2 comments.

src/waltz/h2/test_h2_conn.c

Copilot · 2026-03-20T21:46:02Z

src/waltz/h2/test_h2_conn.c

+  /* (c) Frame at max_frame_size+1 -> FD_H2_ERR_FRAME_SIZE fires first
+     When max_frame_size is properly clamped to bufsz - sizeof(hdr),
+     a frame with payload > max_frame_size should be rejected by the
+     FRAME_SIZE check (line 627), not the buffer guard (line 671). */


These test comments reference specific line numbers in fd_h2_rx1 (e.g. “line 627” / “line 671”), which will go stale as the file changes. Consider describing the checks by name/condition instead of embedding source line numbers.

github-actions · 2026-03-20T21:47:20Z

Performance Measurements ⏳

Suite	Baseline	New	Change
backtest `mainnet-406545575-perf` per slot	`0.112041 s`	`0.112417 s`	`0.336%` ✅
backtest `mainnet-406545575-perf` snapshot load	`4.051 s`	`2.511 s`	`-38.015%` ✅
backtest `mainnet-406545575-perf` total elapsed	`112.040606 s`	`112.417108 s`	`0.336%` ✅
`firedancer mem` usage with `mainnet.toml`	`1105.43 GiB`	`1105.43 GiB`	`0.000%` ✅

github-actions · 2026-03-20T22:00:59Z

Performance Measurements ⏳

Suite	Baseline	New	Change
backtest `mainnet-406545575-perf` per slot	`0.120737 s`	`0.120694 s`	`-0.036%` ✅
backtest `mainnet-406545575-perf` snapshot load	`5.196 s`	`3.536 s`	`-31.948%` ✅
backtest `mainnet-406545575-perf` total elapsed	`120.736775 s`	`120.693963 s`	`-0.035%` ✅
`firedancer mem` usage with `mainnet.toml`	`1105.43 GiB`	`1105.43 GiB`	`0.000%` ✅

Copilot

Pull request overview

Copilot reviewed 4 out of 6 changed files in this pull request and generated 2 comments.

src/waltz/h2/fd_h2_conn.c

src/disco/bundle/fuzz_bundle_client.c

github-actions · 2026-03-20T22:20:40Z

Performance Measurements ⏳

Suite	Baseline	New	Change
backtest `mainnet-406545575-perf` per slot	`0.112152 s`	`0.11211 s`	`-0.037%` ✅
backtest `mainnet-406545575-perf` snapshot load	`4.076 s`	`2.534 s`	`-37.831%` ✅
backtest `mainnet-406545575-perf` total elapsed	`112.151754 s`	`112.110309 s`	`-0.037%` ✅
`firedancer mem` usage with `mainnet.toml`	`1105.43 GiB`	`1105.43 GiB`	`0.000%` ✅

github-actions · 2026-03-23T22:35:56Z

Performance Measurements ⏳

Suite	Baseline	New	Change
backtest `mainnet-406545575-perf` per slot	`0.133932 s`	`0.135763 s`	`1.367%` ✅
backtest `mainnet-406545575-perf` snapshot load	`5.13 s`	`3.809 s`	`-25.750%` ✅
backtest `mainnet-406545575-perf` total elapsed	`133.932069 s`	`135.762647 s`	`1.367%` ✅
`firedancer mem` usage with `mainnet.toml`	`1110.43 GiB`	`1110.43 GiB`	`0.000%` ✅

Copilot

Pull request overview

Copilot reviewed 4 out of 6 changed files in this pull request and generated 1 comment.

Copilot · 2026-03-25T21:34:57Z

src/waltz/h2/test_h2_conn.c

+/* test_h2_buffer_guard exercises the buffer guard at fd_h2_conn.c:671
+   and related frame size checks.
+
+   Background: Non-DATA frames are consumed "all or nothing", meaning
+   the entire frame (header + payload) must fit in the rx ring buffer
+   at once.  If a frame's total size exceeds the buffer capacity, it
+   can never be consumed, causing a deadlock.  The buffer guard
+   detects this and issues a conn error instead. */


The new test comments hard-code source line numbers (e.g., fd_h2_conn.c:671, line 627, line 654). These will drift as fd_h2_conn.c changes and can mislead future debugging. Prefer referencing the function/logic (e.g., fd_h2_rx1 FRAME_SIZE check / buffer-capacity guard) without embedding specific line numbers, or derive them from symbols if needed.

github-actions · 2026-03-25T21:37:47Z

Performance Measurements ⏳

Suite	Baseline	New	Change
backtest `mainnet-406545575-perf` per slot	`0.122436 s`	`0.123281 s`	`0.690%` ✅
backtest `mainnet-406545575-perf` snapshot load	`3.361 s`	`2.937 s`	`-12.615%` ✅
backtest `mainnet-406545575-perf` total elapsed	`122.436312 s`	`123.280865 s`	`0.690%` ✅
`firedancer mem` usage with `mainnet.toml`	`1095.43 GiB`	`1095.43 GiB`	`0.000%` ✅

github-actions · 2026-03-26T23:42:40Z

Performance Measurements ⏳

Suite	Baseline	New	Change
backtest `mainnet-406545575-perf` per slot	`0.143066 s`	`0.143341 s`	`0.192%` ✅
backtest `mainnet-406545575-perf` snapshot load	`5.185 s`	`3.568 s`	`-31.186%` ✅
backtest `mainnet-406545575-perf` total elapsed	`143.06556 s`	`143.340864 s`	`0.192%` ✅
`firedancer mem` usage with `mainnet.toml`	`1090.43 GiB`	`1090.43 GiB`	`0.000%` ✅

Copilot

Pull request overview

Copilot reviewed 106 out of 110 changed files in this pull request and generated 1 comment.

Comments suppressed due to low confidence (1)

src/flamenco/runtime/fd_bank.h:207

The PR description focuses on an H2 RX buffer deadlock hardening, but this change set also includes large, unrelated refactors (e.g., fd_bank/fd_banks layout and API changes, leader schedule API changes, wsamples/shred dest changes, tower reconcile signature changes). This scope mismatch makes the PR substantially harder to review and increases risk; consider splitting the non-H2 refactors into separate PRs or updating the PR description to explicitly cover them and the associated test/rollout plan.

Copilot · 2026-03-27T20:45:07Z

src/choreo/tower/fd_tower.c

  ulong            prev_vote_slot = fd_tower_peek_tail_const( tower )->slot;
-  fd_tower_blk_t * prev_vote_fork = fd_tower_blocks_query( blocks, prev_vote_slot );
+  fd_tower_blk_t * prev_vote_fork = fd_tower_blocks_query( blocks, prev_vote_slot ); /* must exist */



fd_tower_vote_and_reset now unconditionally reads prev_vote_fork->voted_block_id and looks it up in ghost, but voted_block_id may be unset (e.g., when the tower was reconciled from the vote account before the corresponding fd_tower_blk_t existed, or any case where voted==0). This can lead to querying ghost with a zero/garbage hash and incorrect behavior or a crash downstream. Consider restoring a defensive path: if prev_vote_fork is NULL or prev_vote_fork->voted==0, initialize voted_block_id from replayed_block_id (and set voted=1) or otherwise bail out safely.

Suggested change

if( FD_UNLIKELY( prev_vote_fork && !prev_vote_fork->voted ) ) {

prev_vote_fork->voted_block_id = prev_vote_fork->replayed_block_id;

prev_vote_fork->voted = 1;

}

github-actions · 2026-03-27T20:51:00Z

Performance Measurements ⏳

Suite	Baseline	New	Change
backtest `mainnet-406545575-perf` per slot	`0.122418 s`	`0.122536 s`	`0.096%` ✅
backtest `mainnet-406545575-perf` snapshot load	`3.425 s`	`2.907 s`	`-15.124%` ✅
backtest `mainnet-406545575-perf` total elapsed	`122.41846 s`	`122.536302 s`	`0.096%` ✅
`firedancer mem` usage with `mainnet.toml`	`1090.43 GiB`	`1090.43 GiB`	`0.000%` ✅

github-actions · 2026-03-27T21:50:21Z

Performance Measurements ⏳

Suite	Baseline	New	Change
backtest `mainnet-406545575-perf` per slot	`0.135426 s`	`0.13747 s`	`1.509%` ✅
backtest `mainnet-406545575-perf` snapshot load	`3.644 s`	`3.119 s`	`-14.407%` ✅
backtest `mainnet-406545575-perf` total elapsed	`135.425609 s`	`137.46975 s`	`1.509%` ✅
`firedancer mem` usage with `mainnet.toml`	`1090.43 GiB`	`1090.43 GiB`	`0.000%` ✅

cmoyes-jump requested review from mjain-jump, mmcgee-jump, ripatel-fd and topointon-jump March 20, 2026 21:28

cmoyes-jump self-assigned this Mar 20, 2026

Copilot AI review requested due to automatic review settings March 20, 2026 21:28

Copilot started reviewing on behalf of cmoyes-jump March 20, 2026 21:29 View session

cmoyes-jump changed the title ~~h2: fix potential deadlock when frame exceeds rx buffer capacity~~ h2: harden against potential deadlock when frame exceeds rx buffer capacity Mar 20, 2026

Copilot AI reviewed Mar 20, 2026

View reviewed changes

Copilot AI review requested due to automatic review settings March 20, 2026 21:41

Copilot started reviewing on behalf of cmoyes-jump March 20, 2026 21:42 View session

Copilot AI reviewed Mar 20, 2026

View reviewed changes

cmoyes-jump force-pushed the cmoyes/h2bufferguard branch from b5ce0c3 to 2d38dad Compare March 20, 2026 21:55

Copilot AI review requested due to automatic review settings March 20, 2026 22:15

Copilot started reviewing on behalf of cmoyes-jump March 20, 2026 22:15 View session

Copilot AI reviewed Mar 20, 2026

View reviewed changes

src/waltz/h2/fd_h2_conn.c Show resolved Hide resolved

src/disco/bundle/fuzz_bundle_client.c Show resolved Hide resolved

cmoyes-jump force-pushed the cmoyes/h2bufferguard branch from 9f78865 to 552e239 Compare March 23, 2026 22:29

Copilot AI review requested due to automatic review settings March 25, 2026 21:31

cmoyes-jump force-pushed the cmoyes/h2bufferguard branch from 552e239 to 783bc5c Compare March 25, 2026 21:31

Copilot started reviewing on behalf of cmoyes-jump March 25, 2026 21:31 View session

Copilot AI reviewed Mar 25, 2026

View reviewed changes

Copilot AI review requested due to automatic review settings March 27, 2026 20:40

cmoyes-jump requested a review from ptaffet-jump as a code owner March 27, 2026 20:40

cmoyes-jump requested review from 0x0ece, alpeng-jump, anwayde, emwang-jump, jherrera-jump and lidatong as code owners March 27, 2026 20:40

Copilot started reviewing on behalf of cmoyes-jump March 27, 2026 20:40 View session

Copilot AI reviewed Mar 27, 2026

View reviewed changes

cmoyes-jump added 3 commits March 27, 2026 21:42

bundle buffer guard

9ee6f86

fuzzer crit avoidance

15f59b0

test cases

c58cc34

cmoyes-jump force-pushed the cmoyes/h2bufferguard branch from bdd6e74 to c58cc34 Compare March 27, 2026 21:43

+  if( FD_UNLIKELY( prev_vote_fork && !prev_vote_fork->voted ) ) {
+    prev_vote_fork->voted_block_id = prev_vote_fork->replayed_block_id;
+    prev_vote_fork->voted          = 1;
+  }

Conversation

cmoyes-jump commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions bot commented Mar 20, 2026

Performance Measurements ⏳

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Mar 20, 2026

Performance Measurements ⏳

Uh oh!

github-actions bot commented Mar 20, 2026

Performance Measurements ⏳

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Mar 20, 2026

Performance Measurements ⏳

Uh oh!

github-actions bot commented Mar 23, 2026

Performance Measurements ⏳

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Mar 25, 2026

Performance Measurements ⏳

Uh oh!

github-actions bot commented Mar 26, 2026

Performance Measurements ⏳

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Mar 27, 2026

Performance Measurements ⏳

Uh oh!

github-actions bot commented Mar 27, 2026

Performance Measurements ⏳

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

cmoyes-jump commented Mar 20, 2026 •

edited

Loading