Skip to content

h2: harden against potential deadlock when frame exceeds rx buffer capacity#8993

Open
cmoyes-jump wants to merge 3 commits intomainfrom
cmoyes/h2bufferguard
Open

h2: harden against potential deadlock when frame exceeds rx buffer capacity#8993
cmoyes-jump wants to merge 3 commits intomainfrom
cmoyes/h2bufferguard

Conversation

@cmoyes-jump
Copy link
Copy Markdown
Contributor

@cmoyes-jump cmoyes-jump commented Mar 20, 2026

  • Fix a potential deadlock in the H2 "all or nothing" frame consumption path: if a non-DATA frame's total size (header + payload) exceeds the rx buffer capacity, rx_suppress would wait forever for data that can never arrive, deadlocking the connection. Note that production buffers are currently sized properly, so this protects against a future misconfig.
  • Add guard in fd_h2_rx1 that detects this condition and issues FD_H2_ERR_INTERNAL instead of endlessly waiting and deadlocking.
  • Size test/fuzz buffer to match max_frame_size and clamp max_frame_size to buffer capacity so the existing FRAME_SIZE check catches oversized frames before they reach the buffer guard.
  • Fix fuzzer loop to handle the one-step gap between conn error detection (SEND_GOAWAY) and teardown (GOAWAY sent, conn reset).
  • New unit tests in test_h2_conn.c covering buffer guard edge cases.

Previously, this invariant was only implicit. The H2 "all or nothing" path silently assumes bufsz >= max_frame_size + 9. Nothing enforced that. Today's production configs are fine (2MB buffer >> 16KB max frame). If someone later shrinks the buffer or bumps max_frame_size, the deadlock becomes active.

This hardening change makes the buffer/frame-size relationship an explicit, enforced contract (the clamp + the FD_TEST + the runtime guard), rather than just a coincidence.

@cmoyes-jump cmoyes-jump self-assigned this Mar 20, 2026
Copilot AI review requested due to automatic review settings March 20, 2026 21:28
@cmoyes-jump cmoyes-jump changed the title h2: fix potential deadlock when frame exceeds rx buffer capacity h2: harden against potential deadlock when frame exceeds rx buffer capacity Mar 20, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens the HTTP/2 receive path against a misconfiguration that could otherwise deadlock the connection when a non-DATA frame can never fully fit into the RX ring buffer. It also updates the bundle client test/fuzz harnesses and adds unit/regression coverage to make the buffer/frame-size contract explicit.

Changes:

  • Add a runtime guard in fd_h2_rx1 to fail fast with FD_H2_ERR_INTERNAL when a non-DATA frame’s total size exceeds RX buffer capacity.
  • Update bundle client test/fuzz setup to size/clamp buffers/max_frame_size consistently and adjust the fuzzer loop to tolerate the GOAWAY→DEAD/reset gap.
  • Add unit tests and binary fixtures covering the oversized-frame edge cases and ensuring correct GOAWAY lifecycle behavior.

Reviewed changes

Copilot reviewed 4 out of 6 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
src/waltz/h2/fd_h2_conn.c Adds the RX buffer-capacity guard in the non-DATA “all-or-nothing” consumption path to prevent deadlock.
src/waltz/h2/test_h2_conn.c Adds unit tests for oversized frames vs buffer capacity, frame-size precedence, and GOAWAY lifecycle behavior.
src/disco/bundle/test_bundle_common.c Adjusts test gRPC/H2 buffer sizing and clamps max_frame_size to buffer capacity to surface issues earlier.
src/disco/bundle/fuzz_bundle_client.c Updates fuzz loop to handle RX buffer-full conditions by stepping state machine once more and asserting buffer drain/reset.
src/disco/bundle/fixtures/regression_oversized_settings.bin Adds a regression fixture exercising oversized SETTINGS behavior.
src/disco/bundle/fixtures/regression_many_small_frames.bin Adds a regression fixture with many small frames to stress RX buffering/state progression.

Comment on lines +392 to +395
/* (c) Frame at max_frame_size+1 -> FD_H2_ERR_FRAME_SIZE fires first
When max_frame_size is properly clamped to bufsz - sizeof(hdr),
a frame with payload > max_frame_size should be rejected by the
FRAME_SIZE check (line 627), not the buffer guard (line 671). */
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These comments call out specific line numbers in fd_h2_conn.c (e.g., "line 627" / "line 671"). Line-number references tend to rot quickly and can confuse future debugging when the file changes. Consider rewording to reference the named checks/branches instead (FRAME_SIZE check vs buffer-capacity guard in fd_h2_rx1).

Copilot uses AI. Check for mistakes.
@github-actions
Copy link
Copy Markdown

Performance Measurements ⏳

Suite Baseline New Change
backtest mainnet-406545575-perf per slot 0.134266 s 0.136816 s 1.899%
backtest mainnet-406545575-perf snapshot load 4.942 s 3.798 s -23.149%
backtest mainnet-406545575-perf total elapsed 134.266173 s 136.815602 s 1.899%
firedancer mem usage with mainnet.toml 1105.43 GiB 1105.43 GiB 0.000%

Copilot AI review requested due to automatic review settings March 20, 2026 21:41
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 6 changed files in this pull request and generated 2 comments.

Comment on lines +393 to +396
/* (c) Frame at max_frame_size+1 -> FD_H2_ERR_FRAME_SIZE fires first
When max_frame_size is properly clamped to bufsz - sizeof(hdr),
a frame with payload > max_frame_size should be rejected by the
FRAME_SIZE check (line 627), not the buffer guard (line 671). */
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These test comments reference specific line numbers in fd_h2_rx1 (e.g. “line 627” / “line 671”), which will go stale as the file changes. Consider describing the checks by name/condition instead of embedding source line numbers.

Copilot uses AI. Check for mistakes.
@github-actions
Copy link
Copy Markdown

Performance Measurements ⏳

Suite Baseline New Change
backtest mainnet-406545575-perf per slot 0.112041 s 0.112417 s 0.336%
backtest mainnet-406545575-perf snapshot load 4.051 s 2.511 s -38.015%
backtest mainnet-406545575-perf total elapsed 112.040606 s 112.417108 s 0.336%
firedancer mem usage with mainnet.toml 1105.43 GiB 1105.43 GiB 0.000%

@cmoyes-jump cmoyes-jump force-pushed the cmoyes/h2bufferguard branch from b5ce0c3 to 2d38dad Compare March 20, 2026 21:55
@github-actions
Copy link
Copy Markdown

Performance Measurements ⏳

Suite Baseline New Change
backtest mainnet-406545575-perf per slot 0.120737 s 0.120694 s -0.036%
backtest mainnet-406545575-perf snapshot load 5.196 s 3.536 s -31.948%
backtest mainnet-406545575-perf total elapsed 120.736775 s 120.693963 s -0.035%
firedancer mem usage with mainnet.toml 1105.43 GiB 1105.43 GiB 0.000%

Copilot AI review requested due to automatic review settings March 20, 2026 22:15
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 6 changed files in this pull request and generated 2 comments.

@github-actions
Copy link
Copy Markdown

Performance Measurements ⏳

Suite Baseline New Change
backtest mainnet-406545575-perf per slot 0.112152 s 0.11211 s -0.037%
backtest mainnet-406545575-perf snapshot load 4.076 s 2.534 s -37.831%
backtest mainnet-406545575-perf total elapsed 112.151754 s 112.110309 s -0.037%
firedancer mem usage with mainnet.toml 1105.43 GiB 1105.43 GiB 0.000%

@cmoyes-jump cmoyes-jump force-pushed the cmoyes/h2bufferguard branch from 9f78865 to 552e239 Compare March 23, 2026 22:29
@github-actions
Copy link
Copy Markdown

Performance Measurements ⏳

Suite Baseline New Change
backtest mainnet-406545575-perf per slot 0.133932 s 0.135763 s 1.367%
backtest mainnet-406545575-perf snapshot load 5.13 s 3.809 s -25.750%
backtest mainnet-406545575-perf total elapsed 133.932069 s 135.762647 s 1.367%
firedancer mem usage with mainnet.toml 1110.43 GiB 1110.43 GiB 0.000%

Copilot AI review requested due to automatic review settings March 25, 2026 21:31
@cmoyes-jump cmoyes-jump force-pushed the cmoyes/h2bufferguard branch from 552e239 to 783bc5c Compare March 25, 2026 21:31
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 6 changed files in this pull request and generated 1 comment.

Comment on lines +282 to +289
/* test_h2_buffer_guard exercises the buffer guard at fd_h2_conn.c:671
and related frame size checks.

Background: Non-DATA frames are consumed "all or nothing", meaning
the entire frame (header + payload) must fit in the rx ring buffer
at once. If a frame's total size exceeds the buffer capacity, it
can never be consumed, causing a deadlock. The buffer guard
detects this and issues a conn error instead. */
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new test comments hard-code source line numbers (e.g., fd_h2_conn.c:671, line 627, line 654). These will drift as fd_h2_conn.c changes and can mislead future debugging. Prefer referencing the function/logic (e.g., fd_h2_rx1 FRAME_SIZE check / buffer-capacity guard) without embedding specific line numbers, or derive them from symbols if needed.

Copilot uses AI. Check for mistakes.
@github-actions
Copy link
Copy Markdown

Performance Measurements ⏳

Suite Baseline New Change
backtest mainnet-406545575-perf per slot 0.122436 s 0.123281 s 0.690%
backtest mainnet-406545575-perf snapshot load 3.361 s 2.937 s -12.615%
backtest mainnet-406545575-perf total elapsed 122.436312 s 123.280865 s 0.690%
firedancer mem usage with mainnet.toml 1095.43 GiB 1095.43 GiB 0.000%

@github-actions
Copy link
Copy Markdown

Performance Measurements ⏳

Suite Baseline New Change
backtest mainnet-406545575-perf per slot 0.143066 s 0.143341 s 0.192%
backtest mainnet-406545575-perf snapshot load 5.185 s 3.568 s -31.186%
backtest mainnet-406545575-perf total elapsed 143.06556 s 143.340864 s 0.192%
firedancer mem usage with mainnet.toml 1090.43 GiB 1090.43 GiB 0.000%

Copilot AI review requested due to automatic review settings March 27, 2026 20:40
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 106 out of 110 changed files in this pull request and generated 1 comment.

Comments suppressed due to low confidence (1)

src/flamenco/runtime/fd_bank.h:207

  • The PR description focuses on an H2 RX buffer deadlock hardening, but this change set also includes large, unrelated refactors (e.g., fd_bank/fd_banks layout and API changes, leader schedule API changes, wsamples/shred dest changes, tower reconcile signature changes). This scope mismatch makes the PR substantially harder to review and increases risk; consider splitting the non-H2 refactors into separate PRs or updating the PR description to explicitly cover them and the associated test/rollout plan.

ulong prev_vote_slot = fd_tower_peek_tail_const( tower )->slot;
fd_tower_blk_t * prev_vote_fork = fd_tower_blocks_query( blocks, prev_vote_slot );
fd_tower_blk_t * prev_vote_fork = fd_tower_blocks_query( blocks, prev_vote_slot ); /* must exist */

Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fd_tower_vote_and_reset now unconditionally reads prev_vote_fork->voted_block_id and looks it up in ghost, but voted_block_id may be unset (e.g., when the tower was reconciled from the vote account before the corresponding fd_tower_blk_t existed, or any case where voted==0). This can lead to querying ghost with a zero/garbage hash and incorrect behavior or a crash downstream. Consider restoring a defensive path: if prev_vote_fork is NULL or prev_vote_fork->voted==0, initialize voted_block_id from replayed_block_id (and set voted=1) or otherwise bail out safely.

Suggested change
if( FD_UNLIKELY( prev_vote_fork && !prev_vote_fork->voted ) ) {
prev_vote_fork->voted_block_id = prev_vote_fork->replayed_block_id;
prev_vote_fork->voted = 1;
}

Copilot uses AI. Check for mistakes.
@github-actions
Copy link
Copy Markdown

Performance Measurements ⏳

Suite Baseline New Change
backtest mainnet-406545575-perf per slot 0.122418 s 0.122536 s 0.096%
backtest mainnet-406545575-perf snapshot load 3.425 s 2.907 s -15.124%
backtest mainnet-406545575-perf total elapsed 122.41846 s 122.536302 s 0.096%
firedancer mem usage with mainnet.toml 1090.43 GiB 1090.43 GiB 0.000%

@cmoyes-jump cmoyes-jump force-pushed the cmoyes/h2bufferguard branch from bdd6e74 to c58cc34 Compare March 27, 2026 21:43
@github-actions
Copy link
Copy Markdown

Performance Measurements ⏳

Suite Baseline New Change
backtest mainnet-406545575-perf per slot 0.135426 s 0.13747 s 1.509%
backtest mainnet-406545575-perf snapshot load 3.644 s 3.119 s -14.407%
backtest mainnet-406545575-perf total elapsed 135.425609 s 137.46975 s 1.509%
firedancer mem usage with mainnet.toml 1090.43 GiB 1090.43 GiB 0.000%

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants