Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
280 changes: 280 additions & 0 deletions SPEC/9002_BBR.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,280 @@
# RFC 9002 + BBRv3 Congestion Control

## Status

**Phases 1–4 complete (Apr 2026).** BBRv3 is selectable via
`ConnectionConfig.congestion_control = .bbr` alongside the existing
`.cubic` (default) and `.newreno` controllers.

Reference: `draft-cardwell-iccrg-bbr-congestion-control-03`.

## Architecture

### CC abstraction (`src/quic/congestion.zig`)

```
pub const Algorithm = enum { newreno, cubic, bbr };
pub const CongestionControl = union(Algorithm) {
newreno: NewReno,
cubic: Cubic,
bbr: Bbr,
};
```

Tagged union with `inline else` dispatch — zero allocation, exhaustive
switch protection at compile time. Replaces the previous hardcoded
`cc: Cubic` field on `Connection`.

### Batched ACK API

```
pub const AckContext = struct {
now: i64,
bytes_in_flight: u64,
persistent_congestion: bool,
earliest_lost_sent_time: ?i64,
ce_byte_count: u64,
rate_sample: ?delivery_rate.RateSample = null,
};

pub fn onAckBatch(self: *CongestionControl, ctx: *const AckContext) void;
```

Per-packet `onPacketAcked` still fires inside the connection's existing
ACK-processing loop (where per-packet stream/MTU bookkeeping lives).
`onAckBatch` carries batch-level signals: loss summary, ECN-CE bytes,
and the latest delivery-rate sample. NewReno/Cubic implement
`onAckBatch` as a thin shim over their existing
`onCongestionEvent`/`onPersistentCongestion` methods. BBR consumes the
whole context.

### Pacer

```
pub fn setBandwidth(self: *Pacer, cwnd: u64, rtt_stats: *const RttStats);
pub fn setPacingRate(self: *Pacer, bytes_per_second: u64);
```

NewReno/Cubic use `setBandwidth` (cwnd/RTT-derived rate). BBR computes
its own pacing rate from `pacing_gain × max_bw` and writes it directly
via `setPacingRate`. The connection calls
`cc.updatePacer(&pacer, &rtt_stats)` which dispatches the right path.

### Delivery-rate sampler (`src/quic/delivery_rate.zig`)

**Off by default — pay-for-what-you-use.** `PacketHandler.rate_sampling_enabled`
is `false` unless `cc_algorithm == .bbr`. When off, both
`onPacketSent` (4-store snapshot stamp) and the per-acked-packet u128
divide are skipped entirely, behind a single predictable branch on a
flag constant for the connection's lifetime. Loopback HTTP/3 bench
confirms zero measurable regression vs the pre-BBR codebase when
Cubic/NewReno is selected.

**Snapshot fields live inline on `SentPacket`** rather than in a
side-allocated table. This costs ~24 B per in-flight packet
unconditionally (the work is gated, the storage isn't). At 1000
in-flight packets × 1000 connections that's ~24 MB — fine for typical
servers, potentially worth revisiting at 100K+ connections. The
considered alternative — a separate `?AutoArrayHashMapUnmanaged(u64,
BbrSnapshot)` allocated only when BBR is on — was rejected: it splits
the per-packet state across two structures with synchronized
lifetimes, adds an OOM failure path on `onPacketSent`, and complicates
the sampler API for a memory win that does not show up in current
workloads.

Per-packet snapshot fields on `SentPacket`:

```
delivered_at_send: u64
delivered_time_at_send: i64
first_sent_time_at_send: i64
is_app_limited_at_send: bool
```

Stamped in `PacketHandler.onPacketSent`; consumed in
`PacketHandler.onAckReceived` to produce a `RateSample` per acked
packet. The highest-numbered acked packet's sample is exposed via
`pkt_handler.latest_rate_sample` and threaded into `AckContext`.

## BBRv3 implementation (`src/quic/bbr.zig`)

Architecture follows picoquic's `bbr.c` pattern: a three-stage pipeline
inside `onAckBatch`:

1. **`updateModelAndState`** — round detection, recovery exit, latest
round signals, max-BW filter, min-RTT filter, ACK aggregation
(`extra_acked`), ECN-alpha EWMA, loss-event gate (`BBRLossThresh`),
lower-bounds (`bw_lo`, `inflight_lo`), and state transitions.
2. **`updateControlParameters`** — `pacing_rate`, `send_quantum`, cwnd.

State transitions are isolated in `enterDrain`, `enterProbeBwDown`,
`enterProbeBwCruise`, `enterProbeBwRefill`, `enterProbeBwUp` so the
state machine is auditable in one place each.

### Model

- **`max_bw`**: max-filtered over the last 4 rounds. App-limited samples
are skipped *only if they would lower the estimate* (otherwise we'd
miss bandwidth growth during partial idle periods).
- **`bw_lo`**: per-round lower bound, decays toward `bw_latest` (the
latest round's max). Reset at REFILL entry. `boundedBw() = min(max_bw,
bw_lo)` is the rate used for pacing.
- **`min_rtt`**: min over a 10s window. Crucially, the **stamp** that
gates ProbeRTT entry (`probe_rtt_min_stamp`) is *separate* from the
per-sample stamp — it only advances when a *new* minimum is observed
or the window expires. This fixes the prior bug where every sample at
the current min refreshed the stamp and prevented ProbeRTT from ever
firing.
- **`extra_acked`**: ACK-aggregation budget, max-filtered over 10
rounds. Added to `max_inflight` so we don't underutilize wifi/LTE
links that batch ACKs.

### State machine

Event-driven, not timer-driven:

```
Startup ──bw plateau / loss > BBRLossThresh / RTT excess─▶ Drain
Drain ──inflight ≤ BDP─▶ ProbeBW_Down

ProbeBW_Down ──inflight ≤ (1-headroom)·BDP─▶ ProbeBW_Cruise
ProbeBW_Cruise ──≥1 round elapsed─▶ ProbeBW_Refill
ProbeBW_Refill ──one round (resets bw_lo, inflight_lo)─▶ ProbeBW_Up
ProbeBW_Up ──inflight ≥ inflight_hi / >2 rounds─▶ ProbeBW_Down

(any non-Startup state) ──min_rtt stale (5s)─▶ ProbeRTT
ProbeRTT ──200ms drained─▶ ProbeBW_Down (or Startup if !filled_pipe)
```

- **Startup**: pacing_gain = 2.885, cwnd_gain = 2.885. Two exit paths:
bw plateau (≥5/4 growth fails for 3 rounds), or loss > `BBRLossThresh`
(2% of inflight).
- **Drain**: pacing_gain = 0.346, cwnd_gain = 2.885. Exit when cwnd ≤ BDP.
- **ProbeBW**: four sub-states with distinct semantics:
- **Down** (gain 0.9): drain inflight to leave 15% headroom.
- **Cruise** (gain 1.0, cwnd-headroom 0.15): hold steady, fair to
competing flows.
- **Refill** (gain 1.0): reset `bw_lo` and `inflight_lo` so Up can
probe upward; lasts one round.
- **Up** (gain 1.25, cwnd_gain 2.25): raise `inflight_hi` slope-style
via `BBRRaiseInflightHiSlope` so steady-state throughput grows.
- **ProbeRTT**: gated on `probe_rtt_min_stamp + 5s`. Drops cwnd to 4×MSS
for 200ms, then resumes ProbeBW_Down (or Startup if pipe never filled).

### Loss / ECN response

- **Loss** (`checkLossEvent` + `BBRLossThresh`): `inflight_hi`
reduction fires only when `lost_bytes > 2% × prior_bytes_in_flight
AND lost_bytes > 3 × MTU`. Without this gate, a single packet loss
would permanently cap throughput at 70% of BDP. Also enters recovery
state for one round of packet conservation.
- **ECN** (`updateEcnAlpha` + `BBRExcessiveEcnCE`): EWMA `alpha =
frac/16 + 15·alpha_prev/16` (gain 1/16). Reduces `inflight_hi` when
`alpha > 0.5`. Smoothing prevents a single CE-marked batch from
triggering reduction.
- **Persistent congestion**: preserves the BW model. Halves cwnd, resets
`bw_lo`/`inflight_lo`, exits recovery. Does NOT throw away `max_bw`
or `min_rtt` — those are minutes of measurement.

### Recovery

- On the first loss in a round (loss-too-high), enter recovery; `cwnd ≤
inflight_lo` for one round (packet conservation approximation).
- On PTO, save `cwnd` and enter recovery; first ack lifts `inflight_hi`
to at least the saved value (BBR recomputes the actual cwnd from its
model).
- Exit when `largest_acked_pn ≥ recovery_start_pn`.

### Pacing rate, send_quantum, cwnd

```
pacing_rate = pacing_gain × boundedBw × (1 - 1% margin)
send_quantum = clamp(pacing_rate × 1ms, 2·MTU, 64KB)
cwnd = max_inflight = bdp × cwnd_gain + extra_acked
(bounded by inflight_hi above, inflight_lo below; floor 4·MTU)
```

CRUISE applies a 15% headroom subtraction; ProbeRTT clamps to 4·MTU.

### Path migration

`Bbr.onPathChange()` resets the entire struct (path's BW/RTT model is
no longer valid), returning to Startup with a fresh estimator.

## Wiring

### `ConnectionConfig`

```
congestion_control: congestion.Algorithm = .cubic,
```

### `Connection`

```
cc: congestion.CongestionControl = .{ .cubic = ... },
cc_algorithm: congestion.Algorithm = .cubic, // preserved across migrations
```

Migration resets use `congestion.CongestionControl.init(self.cc_algorithm)`
so the user-chosen CC survives address rebinding.

### Per-packet ECN CE attribution

ACK_ECN counters are aggregate (no per-packet attribution available
from the wire). The implementation distributes the CE-count delta
across the ECT(0) bytes acked in the same batch:

```
ce_byte_count = newly_acked_ect0_bytes * ce_delta / newly_acked_ect0
```

This matches quic-go's approach. NewReno/Cubic treat any `ce_byte_count
> 0` as a single congestion event (legacy behavior); BBR uses the byte
count for its alpha threshold check.

## Caveats

- **Draft revision drift.** Pinned to `draft-cardwell-iccrg-bbr-congestion-control-03`.
- **Simplified `BBRRaiseInflightHiSlope`.** The draft prescribes a
per-ack slope-based growth; we use a per-round `inflight_hi +=
MTU × bw_probe_up_rounds` approximation. Same direction, slightly
less precise.
- **No CRUISE randomized timer / Reno coexistence quota.** Picoquic uses
random timers and a `BBR_BDP_packets`-based forced-probe to share
fairly with Reno flows. We use a simpler "≥1 round elapsed" rule.
Acceptable on links that don't share bottleneck with Reno; revisit if
fairness measurements show starvation.
- **No qlog instrumentation yet** for BBR-specific state (state
transitions, max_bw, min_rtt, pacing_gain, inflight_hi/lo). Follow-up.
- **No interop CLI flag yet.** Interop binaries still use the default
Cubic. Adding `--cc bbr` pass-through is a small follow-up.

## Test coverage

- `congestion.zig` — 6 union-dispatch parity tests + the existing
NewReno/Cubic suites.
- `delivery_rate.zig` — 4 sampler tests (steady stream, app-limited
flag propagation, reordering, zero-interval).
- `bbr.zig` — 19 tests covering: initial state, max-BW filter +
app-limited handling, **`BBRLossThresh` gating both directions**,
ProbeRTT entry actually firing (with the `probe_rtt_min_stamp` fix),
ProbeRTT exit, all 4 ProbeBW sub-state transitions,
**`inflight_hi` growth in Up**, ECN EWMA crossing threshold,
**persistent congestion preserves `max_bw`**, PTO recovery exit lifts
`inflight_hi`, `send_quantum` scaling, CRUISE headroom, bdp uses
bounded bw, app-limited handling, pacing margin, path change.
- `connection.zig` — 3 tests confirming algorithm selection via
`ConnectionConfig.congestion_control`.

Total: 551/551 tests pass.

## Files

- `src/quic/congestion.zig` — union, AckContext, NewReno, Cubic, Pacer
- `src/quic/bbr.zig` — Bbr state machine
- `src/quic/delivery_rate.zig` — RateSampler, RateSample
- `src/quic/ack_handler.zig` — SentPacket sampler fields, PacketHandler
rate_sampler
- `src/quic/connection.zig` — cc, cc_algorithm, AckContext call sites,
config wiring
1 change: 1 addition & 0 deletions SPEC/STATUS.md
Original file line number Diff line number Diff line change
Expand Up @@ -276,6 +276,7 @@
| 7.8 | Under-utilizing the Congestion Window | ✅ Done | app_limited flag suppresses cwnd growth in NewReno + CUBIC |
| B | NewReno Pseudocode | ✅ Done | Matches appendix B |
| - | CUBIC (RFC 8312) | ✅ Done | Default CC algorithm, fast convergence |
| - | BBRv3 (draft-cardwell-iccrg-03) | ✅ Done | Selectable via `ConnectionConfig.congestion_control = .bbr`. See [9002_BBR.md](./9002_BBR.md) |

### Summary — RFC 9002

Expand Down
48 changes: 46 additions & 2 deletions src/quic/ack_handler.zig
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ const frame_mod = @import("frame.zig");
const Frame = frame_mod.Frame;
const AckRange = frame_mod.AckRange;
const MAX_ACK_RANGES = frame_mod.MAX_ACK_RANGES;
const delivery_rate = @import("delivery_rate.zig");

/// Encryption level / packet number space.
pub const EncLevel = enum(u2) {
Expand Down Expand Up @@ -77,6 +78,15 @@ pub const SentPacket = struct {
/// Whether this packet contains DATAGRAM frames (for stats tracking on loss).
has_datagram: bool = false,

// ── Delivery-rate sampler snapshots (RFC 9002 §B / draft-cardwell §3.2) ──
// Filled in by RateSampler.onPacketSent at the moment this packet is
// transmitted. Consumed by the sampler when the packet is acked to
// produce a RateSample.
delivered_at_send: u64 = 0,
delivered_time_at_send: i64 = 0,
first_sent_time_at_send: i64 = 0,
is_app_limited_at_send: bool = false,

/// Record a stream frame carried by this packet.
pub fn addStreamFrame(self: *SentPacket, info: StreamFrameInfo) void {
if (self.stream_frame_count < MAX_STREAM_FRAMES_PER_PACKET) {
Expand Down Expand Up @@ -501,6 +511,18 @@ pub const PacketHandler = struct {
bytes_in_flight: u64 = 0,
pto_count: u32 = 0,
next_pn: [3]u64 = .{ 0, 0, 0 },
/// Delivery-rate sampler (RFC 9002 §B). Stamps per-packet snapshots on
/// send and produces RateSamples on ACK. Consumed by BBR; ignored by
/// NewReno/Cubic.
rate_sampler: delivery_rate.RateSampler = .{},
/// Latest delivery-rate sample produced during the most recent
/// `onAckReceived` call (null if no acked packets / no usable sample,
/// or if rate sampling is disabled).
latest_rate_sample: ?delivery_rate.RateSample = null,
/// Whether the rate sampler should run on send/ack. Off by default —
/// enable only when the active CC needs delivery-rate samples (BBR).
/// Off, the per-send stamp and per-ack u128 divide are skipped entirely.
rate_sampling_enabled: bool = false,

pub fn init(allocator: Allocator) PacketHandler {
return .{
Expand Down Expand Up @@ -536,8 +558,15 @@ pub const PacketHandler = struct {
return self.sent[idx].largest_acked;
}

pub fn onPacketSent(self: *PacketHandler, pkt: SentPacket) !void {
pub fn onPacketSent(self: *PacketHandler, pkt_in: SentPacket) !void {
var pkt = pkt_in;
const idx = @intFromEnum(pkt.enc_level);
// Stamp delivery-rate snapshot fields only when sampling is enabled.
// The branch is predictable (flag is constant for the connection's
// lifetime), and skipping saves 5 stores per send when CC ≠ BBR.
if (self.rate_sampling_enabled) {
self.rate_sampler.onPacketSent(&pkt, self.bytes_in_flight, pkt.time_sent);
}
try self.sent[idx].onPacketSent(pkt);
if (pkt.in_flight) {
self.bytes_in_flight += pkt.size;
Expand Down Expand Up @@ -576,8 +605,16 @@ pub const PacketHandler = struct {
);

// ACK-of-ACK pruning (RFC 9000 §13.2.4): when an acked packet contained
// our ACK frame, prune received ranges below that ACK's largest_ack
// our ACK frame, prune received ranges below that ACK's largest_ack.
// Also feed the delivery-rate sampler (when enabled) and capture the
// latest sample from the highest-numbered acked packet (RFC 9002 §B
// uses the sample produced by the highest-acked packet as
// representative for the ACK).
var max_ack_of_ack: ?u64 = null;
var highest_acked_pn: ?u64 = null;
self.latest_rate_sample = null;
const sampling_on = self.rate_sampling_enabled;
const rtt_for_sample = self.rtt_stats.latest_rtt;
for (result.acked.constSlice()) |pkt| {
if (pkt.in_flight) {
self.bytes_in_flight -|= pkt.size;
Expand All @@ -587,6 +624,13 @@ pub const PacketHandler = struct {
max_ack_of_ack = la;
}
}
if (sampling_on) {
const sample = self.rate_sampler.onPacketAcked(&pkt, rtt_for_sample, now);
if (highest_acked_pn == null or pkt.pn > highest_acked_pn.?) {
highest_acked_pn = pkt.pn;
self.latest_rate_sample = sample;
}
}
}
if (max_ack_of_ack) |prune_below| {
self.recv[idx].pruneAckedRanges(prune_below);
Expand Down
Loading
Loading