h2: harden against potential deadlock when frame exceeds rx buffer capacity#8993
h2: harden against potential deadlock when frame exceeds rx buffer capacity#8993cmoyes-jump wants to merge 3 commits intomainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR hardens the HTTP/2 receive path against a misconfiguration that could otherwise deadlock the connection when a non-DATA frame can never fully fit into the RX ring buffer. It also updates the bundle client test/fuzz harnesses and adds unit/regression coverage to make the buffer/frame-size contract explicit.
Changes:
- Add a runtime guard in
fd_h2_rx1to fail fast withFD_H2_ERR_INTERNALwhen a non-DATA frame’s total size exceeds RX buffer capacity. - Update bundle client test/fuzz setup to size/clamp buffers/
max_frame_sizeconsistently and adjust the fuzzer loop to tolerate the GOAWAY→DEAD/reset gap. - Add unit tests and binary fixtures covering the oversized-frame edge cases and ensuring correct GOAWAY lifecycle behavior.
Reviewed changes
Copilot reviewed 4 out of 6 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| src/waltz/h2/fd_h2_conn.c | Adds the RX buffer-capacity guard in the non-DATA “all-or-nothing” consumption path to prevent deadlock. |
| src/waltz/h2/test_h2_conn.c | Adds unit tests for oversized frames vs buffer capacity, frame-size precedence, and GOAWAY lifecycle behavior. |
| src/disco/bundle/test_bundle_common.c | Adjusts test gRPC/H2 buffer sizing and clamps max_frame_size to buffer capacity to surface issues earlier. |
| src/disco/bundle/fuzz_bundle_client.c | Updates fuzz loop to handle RX buffer-full conditions by stepping state machine once more and asserting buffer drain/reset. |
| src/disco/bundle/fixtures/regression_oversized_settings.bin | Adds a regression fixture exercising oversized SETTINGS behavior. |
| src/disco/bundle/fixtures/regression_many_small_frames.bin | Adds a regression fixture with many small frames to stress RX buffering/state progression. |
| /* (c) Frame at max_frame_size+1 -> FD_H2_ERR_FRAME_SIZE fires first | ||
| When max_frame_size is properly clamped to bufsz - sizeof(hdr), | ||
| a frame with payload > max_frame_size should be rejected by the | ||
| FRAME_SIZE check (line 627), not the buffer guard (line 671). */ |
There was a problem hiding this comment.
These comments call out specific line numbers in fd_h2_conn.c (e.g., "line 627" / "line 671"). Line-number references tend to rot quickly and can confuse future debugging when the file changes. Consider rewording to reference the named checks/branches instead (FRAME_SIZE check vs buffer-capacity guard in fd_h2_rx1).
Performance Measurements ⏳
|
| /* (c) Frame at max_frame_size+1 -> FD_H2_ERR_FRAME_SIZE fires first | ||
| When max_frame_size is properly clamped to bufsz - sizeof(hdr), | ||
| a frame with payload > max_frame_size should be rejected by the | ||
| FRAME_SIZE check (line 627), not the buffer guard (line 671). */ |
There was a problem hiding this comment.
These test comments reference specific line numbers in fd_h2_rx1 (e.g. “line 627” / “line 671”), which will go stale as the file changes. Consider describing the checks by name/condition instead of embedding source line numbers.
Performance Measurements ⏳
|
b5ce0c3 to
2d38dad
Compare
Performance Measurements ⏳
|
Performance Measurements ⏳
|
9f78865 to
552e239
Compare
Performance Measurements ⏳
|
552e239 to
783bc5c
Compare
| /* test_h2_buffer_guard exercises the buffer guard at fd_h2_conn.c:671 | ||
| and related frame size checks. | ||
|
|
||
| Background: Non-DATA frames are consumed "all or nothing", meaning | ||
| the entire frame (header + payload) must fit in the rx ring buffer | ||
| at once. If a frame's total size exceeds the buffer capacity, it | ||
| can never be consumed, causing a deadlock. The buffer guard | ||
| detects this and issues a conn error instead. */ |
There was a problem hiding this comment.
The new test comments hard-code source line numbers (e.g., fd_h2_conn.c:671, line 627, line 654). These will drift as fd_h2_conn.c changes and can mislead future debugging. Prefer referencing the function/logic (e.g., fd_h2_rx1 FRAME_SIZE check / buffer-capacity guard) without embedding specific line numbers, or derive them from symbols if needed.
Performance Measurements ⏳
|
Performance Measurements ⏳
|
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 106 out of 110 changed files in this pull request and generated 1 comment.
Comments suppressed due to low confidence (1)
src/flamenco/runtime/fd_bank.h:207
- The PR description focuses on an H2 RX buffer deadlock hardening, but this change set also includes large, unrelated refactors (e.g., fd_bank/fd_banks layout and API changes, leader schedule API changes, wsamples/shred dest changes, tower reconcile signature changes). This scope mismatch makes the PR substantially harder to review and increases risk; consider splitting the non-H2 refactors into separate PRs or updating the PR description to explicitly cover them and the associated test/rollout plan.
| ulong prev_vote_slot = fd_tower_peek_tail_const( tower )->slot; | ||
| fd_tower_blk_t * prev_vote_fork = fd_tower_blocks_query( blocks, prev_vote_slot ); | ||
| fd_tower_blk_t * prev_vote_fork = fd_tower_blocks_query( blocks, prev_vote_slot ); /* must exist */ | ||
|
|
There was a problem hiding this comment.
fd_tower_vote_and_reset now unconditionally reads prev_vote_fork->voted_block_id and looks it up in ghost, but voted_block_id may be unset (e.g., when the tower was reconciled from the vote account before the corresponding fd_tower_blk_t existed, or any case where voted==0). This can lead to querying ghost with a zero/garbage hash and incorrect behavior or a crash downstream. Consider restoring a defensive path: if prev_vote_fork is NULL or prev_vote_fork->voted==0, initialize voted_block_id from replayed_block_id (and set voted=1) or otherwise bail out safely.
| if( FD_UNLIKELY( prev_vote_fork && !prev_vote_fork->voted ) ) { | |
| prev_vote_fork->voted_block_id = prev_vote_fork->replayed_block_id; | |
| prev_vote_fork->voted = 1; | |
| } |
Performance Measurements ⏳
|
bdd6e74 to
c58cc34
Compare
Performance Measurements ⏳
|
DATAframe's total size (header + payload) exceeds the rx buffer capacity,rx_suppresswould wait forever for data that can never arrive, deadlocking the connection. Note that production buffers are currently sized properly, so this protects against a future misconfig.fd_h2_rx1that detects this condition and issuesFD_H2_ERR_INTERNALinstead of endlessly waiting and deadlocking.max_frame_sizeand clampmax_frame_sizeto buffer capacity so the existingFRAME_SIZEcheck catches oversized frames before they reach the buffer guard.SEND_GOAWAY) and teardown (GOAWAYsent, conn reset).Previously, this invariant was only implicit. The H2 "all or nothing" path silently assumes
bufsz >= max_frame_size + 9. Nothing enforced that. Today's production configs are fine (2MB buffer >> 16KB max frame). If someone later shrinks the buffer or bumpsmax_frame_size, the deadlock becomes active.This hardening change makes the buffer/frame-size relationship an explicit, enforced contract (the clamp + the
FD_TEST+ the runtime guard), rather than just a coincidence.