Skip to content

fix(bitswap): serve late-arriving peers in HaveSession#459

Open
cong-or wants to merge 4 commits into
dariusc93:libp2p-nextfrom
cong-or:fix/bitswap-serve-late-arriving-peers
Open

fix(bitswap): serve late-arriving peers in HaveSession#459
cong-or wants to merge 4 commits into
dariusc93:libp2p-nextfrom
cong-or:fix/bitswap-serve-late-arriving-peers

Conversation

@cong-or
Copy link
Copy Markdown

@cong-or cong-or commented Jan 12, 2026

Summary

Fixes #458

When multiple peers request the same block simultaneously, late-arriving peers were never served. This caused only 2-3 out of 5+ simultaneous requesters to receive the block.

Root Cause

Two issues in HaveSession:

  1. need_block(): When a peer called need_block() while the session was already in Block state (bytes ready), the waker was not called. The session never got polled again to serve the new peer.

  2. poll_next(): After serving all currently-known peers, the session transitioned to Complete or Idle, discarding the block bytes. Late-arriving peers (delayed by DHT provider discovery latency) could never be served.

Changes

  • need_block(): Wake the session when already in Block state so poll_next runs again to serve newly-added peers
  • poll_next(): Keep Block state instead of completing or transitioning to Idle, preserving bytes in memory for late-arriving requests. Session cleanup is delegated to the session manager's timeout.

Test Results

Tested with 6-node P2P mesh network:

Metric Before After
Nodes receiving block 2/5 5/5
Success rate 40% 100%

  When multiple peers request the same block simultaneously, late-arriving
  peers (those calling need_block() after the session already has the block)
  were never served because the session wasn't being woken up.

  - Wake session when already in Block state so poll_next serves new peers
  - Keep Block state instead of completing, preserving bytes for late requests

  Fixes dariusc93#458
Comment on lines -701 to +731
// Since we have no more peers who want the block, we will finalize the session
this.state = HaveSessionState::Complete;
this.want.clear();
this.send_dont_have.clear();
return Poll::Ready(Some(HaveSessionEvent::Cancelled));
// All currently-known peers have been served, but don't complete the session yet.
//
// More peers may request this block after a slight delay (e.g., due to network
// latency in DHT provider discovery). By keeping the Block state with bytes in
// memory, we allow need_block() to wake us again when late-arriving peers request
// the same block. This fixes the issue where only the first few simultaneous
// requesters would receive the block.
//
// The session will eventually be cleaned up by the session manager's timeout
// or explicit cancellation, not by completing here.
this.waker = Some(cx.waker().clone());
return Poll::Pending;
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are only a couple of places where the session would be canceled, so without this here, it would possibly lead to high memory usage here if there is fetching a lot of blocks until then. We could setup a timeout per session before cancellation happens and mark the state differently so it will still state it has the block without holding it in memory, maybe until A) the session is explicitly canceled; or B) the session is canceled by timeout; or C) the block is removed out of the block store (which may require some form of signaling).

Thoughts?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point about the memory — hadn’t considered what happens when lots of blocks are being fetched.

I’m thinking Option 2 makes the most sense here: drop the bytes after serving everyone, but remember that we have the block. If a late peer shows up, we just re-fetch from the repo (should be fast since it’s local).

So basically:

  • Add a new Waiting { timer } state that holds no block bytes
  • After all known peers get BlockSent, transition from Block → Waiting and start the timer
  • The have: Some(true) field already tracks that we have the block, so that stays
  • If a late peer calls need_block() while we’re in Waiting, transition to GetBlock to re-fetch from repo, then serve them
  • If the timer expires with no new requests, emit Cancelled so the session gets cleaned up from the have_session map

This way memory gets freed right away instead of sitting there for the whole timeout, and late peers still get served.

Let me know if that works or if you’d prefer a different approach. Also happy to hear thoughts on timeout duration — was thinking somewhere around 30–60 seconds?

  Reproduces issue dariusc93#458: when multiple peers request the same block
  simultaneously, only 1-2 receive it due to HaveSession state bugs.

  Test setup:
  - 1 provider with block
  - 4 requesters firing get() concurrently

  Results:
  - Without fix: ~1/4 succeed
  - With fix: 4/4 succeed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bitswap provider ignores concurrent block requests from multiple peers

2 participants