fix(bitswap): serve late-arriving peers in HaveSession#459
Conversation
When multiple peers request the same block simultaneously, late-arriving peers (those calling need_block() after the session already has the block) were never served because the session wasn't being woken up. - Wake session when already in Block state so poll_next serves new peers - Keep Block state instead of completing, preserving bytes for late requests Fixes dariusc93#458
| // Since we have no more peers who want the block, we will finalize the session | ||
| this.state = HaveSessionState::Complete; | ||
| this.want.clear(); | ||
| this.send_dont_have.clear(); | ||
| return Poll::Ready(Some(HaveSessionEvent::Cancelled)); | ||
| // All currently-known peers have been served, but don't complete the session yet. | ||
| // | ||
| // More peers may request this block after a slight delay (e.g., due to network | ||
| // latency in DHT provider discovery). By keeping the Block state with bytes in | ||
| // memory, we allow need_block() to wake us again when late-arriving peers request | ||
| // the same block. This fixes the issue where only the first few simultaneous | ||
| // requesters would receive the block. | ||
| // | ||
| // The session will eventually be cleaned up by the session manager's timeout | ||
| // or explicit cancellation, not by completing here. | ||
| this.waker = Some(cx.waker().clone()); | ||
| return Poll::Pending; |
There was a problem hiding this comment.
There are only a couple of places where the session would be canceled, so without this here, it would possibly lead to high memory usage here if there is fetching a lot of blocks until then. We could setup a timeout per session before cancellation happens and mark the state differently so it will still state it has the block without holding it in memory, maybe until A) the session is explicitly canceled; or B) the session is canceled by timeout; or C) the block is removed out of the block store (which may require some form of signaling).
Thoughts?
There was a problem hiding this comment.
Good point about the memory — hadn’t considered what happens when lots of blocks are being fetched.
I’m thinking Option 2 makes the most sense here: drop the bytes after serving everyone, but remember that we have the block. If a late peer shows up, we just re-fetch from the repo (should be fast since it’s local).
So basically:
- Add a new
Waiting { timer }state that holds no block bytes - After all known peers get
BlockSent, transition fromBlock → Waitingand start the timer - The
have: Some(true)field already tracks that we have the block, so that stays - If a late peer calls
need_block()while we’re inWaiting, transition toGetBlockto re-fetch from repo, then serve them - If the timer expires with no new requests, emit
Cancelledso the session gets cleaned up from thehave_sessionmap
This way memory gets freed right away instead of sitting there for the whole timeout, and late peers still get served.
Let me know if that works or if you’d prefer a different approach. Also happy to hear thoughts on timeout duration — was thinking somewhere around 30–60 seconds?
Reproduces issue dariusc93#458: when multiple peers request the same block simultaneously, only 1-2 receive it due to HaveSession state bugs. Test setup: - 1 provider with block - 4 requesters firing get() concurrently Results: - Without fix: ~1/4 succeed - With fix: 4/4 succeed
Summary
Fixes #458
When multiple peers request the same block simultaneously, late-arriving peers were never served. This caused only 2-3 out of 5+ simultaneous requesters to receive the block.
Root Cause
Two issues in
HaveSession:need_block(): When a peer calledneed_block()while the session was already inBlockstate (bytes ready), the waker was not called. The session never got polled again to serve the new peer.poll_next(): After serving all currently-known peers, the session transitioned toCompleteorIdle, discarding the block bytes. Late-arriving peers (delayed by DHT provider discovery latency) could never be served.Changes
need_block(): Wake the session when already inBlockstate sopoll_nextruns again to serve newly-added peerspoll_next(): KeepBlockstate instead of completing or transitioning toIdle, preserving bytes in memory for late-arriving requests. Session cleanup is delegated to the session manager's timeout.Test Results
Tested with 6-node P2P mesh network: