Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
66 changes: 66 additions & 0 deletions SPEC/CLOCKS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
# Clock contract

quic-zig uses two clock sources internally. Most of the codebase reads
`std.time.nanoTimestamp()` (REALTIME); the user-space pacer is the single
exception — it runs on `CLOCK_MONOTONIC` via `clock.monoNanos()`.

This split is intentional. Reading this page once should be enough to avoid
introducing a cross-clock comparison bug on a future change.

## Who uses what

| Subsystem | Clock | Source | Why |
|-----------|-------|--------|-----|
| Loss detection (PTO, RTT) | REALTIME | `std.time.nanoTimestamp()` | Compares timestamps it produced itself; absolute drift is irrelevant. |
| Idle timeout | REALTIME | `std.time.nanoTimestamp()` | Same — only the delta `now − last_activity` matters. |
| Stateless reset / token expiry | REALTIME | `std.time.nanoTimestamp()` | Long-horizon validity windows; wall-clock alignment is fine. |
| qlog timestamps | REALTIME | `std.time.nanoTimestamp()` | Wall-clock is what humans expect when reading traces. |
| Datagram receive timestamps | REALTIME | `std.time.nanoTimestamp()` | Compared only to other REALTIME values within the same connection. |
| **Pacer** (`Pacer.last_sent_time`, `timeUntilSend`, `onPacketSent`) | **MONOTONIC** | `clock.monoNanos()` | Budget replenishment math (`elapsed = now − last_sent_time`) breaks if a wall-clock jump (NTP slew, manual time change, DST) makes elapsed go negative or huge. |

## The single boundary

`Connection.nextTimeoutNs()` is the only function that crosses the boundary.
It folds the pacer's next-send time into a deadline that the event loop
compares against REALTIME-based deadlines (loss timer, idle timer, ack alarm).

The conversion happens inline at `connection.zig:3793`:

```zig
const now_realtime: i64 = @intCast(std.time.nanoTimestamp());
const now_mono: i64 = clock.monoNanos();
const elapsed = now_mono - self.pacer.last_sent_time; // duration on MONO
// ... compute pacer_delay (a duration, clock-agnostic) ...
const pacer_deadline = now_realtime + delay; // anchor on REALTIME
```

We compute the *duration* on the monotonic clock (where the pacer's state
lives) and add it to a REALTIME `now` so the resulting deadline is comparable
to the other deadlines the event loop collects. The result is a REALTIME
timestamp, never a MONOTONIC one — that boundary stays inside this function.

## Rules for future changes

1. **Adding a new pacer call site:** pass `now_mono` (or call `clock.monoNanos()` fresh). Never pass a `nanoTimestamp()` value.
2. **Reading `pacer.last_sent_time` from outside the Pacer:** treat it as MONOTONIC. Subtract it from another MONOTONIC value to get a duration. Never compare to a REALTIME timestamp.
3. **Adding a new clock-using subsystem:** default to REALTIME. Switch to MONOTONIC only if the subsystem hands timestamps to the kernel (e.g., a future `SCM_TXTIME` cmsg) or is genuinely sensitive to wall-clock jumps.
4. **Mixing in a single deadline computation:** allowed only when computing a *duration* on one clock and anchoring the deadline on another (the `nextTimeoutNs` pattern above). Document why in a comment.

## Why not migrate everything to MONOTONIC

- Loss detection, PTO, and idle timeout are all *delta-based* — they don't care which clock as long as the timestamps in a single comparison agree. They've worked correctly on REALTIME since day one and changing them adds risk for no gain.
- qlog readers and external tooling expect wall-clock timestamps.
- Token-validity windows are conceptually wall-clock (a 1-day token means 24 wall-clock hours).
- The single subsystem that genuinely needed monotonic semantics (the pacer) is now isolated.

## Why the pacer specifically

- `Pacer.replenish` computes `elapsed = now - last_sent_time` and turns it into bytes of budget. If the wall clock jumps backward by 10 seconds (NTP slew, DST end, manual time change), `elapsed` goes negative and the pacer either refuses to send or floods, depending on signedness handling.
- A forward jump credits the pacer with phantom bandwidth, briefly defeating congestion control.
- `MONOTONIC` immunizes both directions.

## Files

- `src/quic/clock.zig` — defines `monoNanos()` (Linux/macOS via `clock_gettime`, Windows fallback to `nanoTimestamp()`).
- `src/quic/congestion.zig` — `Pacer` doc comment names the contract.
- `src/quic/connection.zig` — three pacer call sites in `send()` use `now_mono`; `nextTimeoutNs` handles the boundary conversion.
32 changes: 30 additions & 2 deletions SPEC/interop-results.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,37 @@
# Interop Test Results

Date: 2026-03-24
Zig version: 0.15.2, quic-go interop image `martenseemann/quic-go-interop:latest`, webtransport-go interop image `martenseemann/webtransport-go-interop:latest`
Date: 2026-04-15 (supersedes 2026-03-24 baseline below)
Zig version: 0.15.2, quic-go interop image `martenseemann/quic-go-interop:latest`, neqo interop image `ghcr.io/mozilla/neqo-qns:latest`, webtransport-go interop image `martenseemann/webtransport-go-interop:latest`
Build: Docker interop image from `interop/runner/Dockerfile`, `zig build -Doptimize=ReleaseSafe`

## 2026-04-15: UDP send-path optimizations (`sendmmsg` + pacer hardening)

Inspired by Cloudflare's "Accelerating UDP packet transmission for QUIC" post,
narrowed to the techniques that fit a real-time WebTransport workload (small
datagrams, latency-sensitive). Larger throughput-oriented optimizations (UDP
GSO, SO_TXTIME kernel pacing) were prototyped, validated, and reverted —
see "Cloudflare optimizations: what we kept and why" in `SPEC/STATUS.md` if
revisiting in the future.

### Send-path toggles
| Feature | Default | Env var | Notes |
|---------|---------|---------|-------|
| `sendmmsg` batching | on (Linux) | `QUIC_ZIG_NO_SENDMMSG=1` disables | one syscall per ECN-mark run |
| User-space pacer | on | `QUIC_ZIG_NO_PACING=1` disables | bisection escape hatch |
| Pacer clock | always `CLOCK_MONOTONIC` | n/a | NTP-skew resilience |

### Matrix (sequential run, `handshake,transfer,chacha20,multiplexing,longrtt,http3,keyupdate`)

| | quic-go (server/client) | neqo (server/client) |
|---------------------------|-------------------------|----------------------|
| quic-zig server ← peer client | **7/7 PASS** | **7/7 PASS** |
| quic-zig client → peer server | **7/7 PASS** | **6-7/7 PASS** |

Zero regressions against the 2026-03-24 baseline recorded below. The
zig-client → neqo-server flake on `keyupdate`/`chacha20` predates this work.

## 2026-03-24 baseline (pre-optimization)

## Functional Interop Matrix

### QUIC / HTTP/3 (`quic-go`)
Expand Down
4 changes: 4 additions & 0 deletions interop/runner/run_endpoint.sh
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,10 @@ set -e
# Setup routing for the simulated network
source /setup.sh

# Optimization toggles — both on by default; set to 1 to disable for bisection.
export QUIC_ZIG_NO_SENDMMSG="${QUIC_ZIG_NO_SENDMMSG:-0}"
export QUIC_ZIG_NO_PACING="${QUIC_ZIG_NO_PACING:-0}"

# Determine if this is a WebTransport test case
is_wt_test() {
case "$TESTCASE" in
Expand Down
28 changes: 28 additions & 0 deletions src/quic/clock.zig
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
const std = @import("std");
const builtin = @import("builtin");

/// Read `CLOCK_MONOTONIC` in nanoseconds.
///
/// The Pacer uses this clock so its `last_sent_time` deltas are immune to
/// wall-clock jumps (NTP slews, daylight-saving, manual clock changes). Loss
/// detection, PTO, and idle-timeout code paths continue to use
/// `std.time.nanoTimestamp()` (REALTIME) — those only compare timestamps to
/// each other within short horizons where the gap matters but the absolute
/// drift does not.
pub fn monoNanos() i64 {
// On Windows there is no POSIX CLOCK_MONOTONIC; fall back to the default
// `nanoTimestamp()` so the pacer still works.
if (comptime builtin.os.tag == .windows) {
return @intCast(std.time.nanoTimestamp());
}
const ts = std.posix.clock_gettime(.MONOTONIC) catch {
return @intCast(std.time.nanoTimestamp());
};
return @as(i64, ts.sec) * std.time.ns_per_s + @as(i64, ts.nsec);
}

test "monoNanos is non-decreasing" {
const a = monoNanos();
const b = monoNanos();
try std.testing.expect(b >= a);
}
8 changes: 7 additions & 1 deletion src/quic/congestion.zig
Original file line number Diff line number Diff line change
Expand Up @@ -421,14 +421,20 @@ fn icbrt(x: u64) u64 {
/// Pacer for spacing out packet sends to avoid bursts.
///
/// Uses a token bucket algorithm similar to quic-go's pacer.
///
/// All timestamp arguments (`now` in `onPacketSent`, `timeUntilSend`, and
/// `replenish`) MUST be on `CLOCK_MONOTONIC` — callers obtain them via
/// `clock.monoNanos()`. The monotonic clock makes budget replenishment
/// immune to wall-clock jumps (NTP slews, manual time changes). Mixing
/// clock sources across calls would silently corrupt budget math.
pub const Pacer = struct {
/// Available budget in bytes.
budget: u64,

/// Max burst size in bytes.
max_burst: u64,

/// Last time a packet was sent (nanoseconds).
/// Last time a packet was sent (CLOCK_MONOTONIC nanoseconds).
last_sent_time: i64 = 0,

/// Bandwidth in bytes per nanosecond, left-shifted by BANDWIDTH_SHIFT for precision.
Expand Down
51 changes: 40 additions & 11 deletions src/quic/connection.zig
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,24 @@ const stateless_reset = @import("stateless_reset.zig");
const ecn = @import("ecn.zig");
const qlog = @import("qlog.zig");
const quic_lb = @import("quic_lb.zig");
const clock = @import("clock.zig");

/// Bisection kill switch for the user-space pacer.
/// When `QUIC_ZIG_NO_PACING=1` (or any non-empty non-"0" value) is set in the
/// environment, `conn.send()` and `nextTimeoutNs()` behave as if the pacer
/// never blocks. `Pacer.onPacketSent` and `setBandwidth` continue to run so
/// bisection can be toggled without polluting CC state.
var pacing_disabled_cache: ?bool = null;

fn isPacingDisabled() bool {
if (pacing_disabled_cache) |v| return v;
const v = blk: {
const raw = std.posix.getenv("QUIC_ZIG_NO_PACING") orelse break :blk false;
break :blk !(raw.len == 0 or std.mem.eql(u8, raw, "0"));
};
pacing_disabled_cache = v;
return v;
}

pub const State = enum(u8) {
first_flight = 0,
Expand Down Expand Up @@ -2753,6 +2771,9 @@ pub const Connection = struct {
if (self.state == .draining or self.state == .terminated) return 0;

const now: i64 = @intCast(std.time.nanoTimestamp());
// Pacer runs on CLOCK_MONOTONIC for NTP-skew resilience; other
// subsystems stay on REALTIME (they only compare deltas).
const now_mono: i64 = clock.monoNanos();

// Closing: retransmit saved close packet on each incoming packet (RFC 9000 §10.2.1)
if (self.state == .closing) {
Expand Down Expand Up @@ -2818,11 +2839,13 @@ pub const Connection = struct {
return try self.sendAckOnly(out_buf, now);
}

// Check if pacer allows sending
// Exception: PTO probes bypass pacing (RFC 9002 §6.2.4)
// Note: ACK-only path above bypasses pacer per RFC 9002 §7.7
if (self.pto_probe_pending == 0) {
const pacer_delay = self.pacer.timeUntilSend(now);
// Pacer gate. Returning 0 here is how the event loop breaks out of
// its burst send loop; the next send time is then surfaced via
// `nextTimeoutNs()` so libxev wakes us when the pacer has budget again.
// Exceptions: PTO probes bypass pacing (RFC 9002 §6.2.4); the ACK-only
// path above bypasses it per RFC 9002 §7.7.
if (self.pto_probe_pending == 0 and !isPacingDisabled()) {
const pacer_delay = self.pacer.timeUntilSend(now_mono);
if (pacer_delay > 0) {
return 0;
}
Expand Down Expand Up @@ -2934,7 +2957,7 @@ pub const Connection = struct {
self.pto_probe_pending -|= 1;
self.paths[self.active_path_idx].bytes_sent += bytes_written;
self.total_packets_sent += 1;
self.pacer.onPacketSent(bytes_written, now);
self.pacer.onPacketSent(bytes_written, now_mono);
self.last_packet_sent_time = now;

// If more PTO probes are pending, re-queue stream data + crypto data
Expand Down Expand Up @@ -3770,10 +3793,16 @@ pub const Connection = struct {

// Pacer: if the pacer has bandwidth set (active transfer), include its
// next-send time so the event loop wakes up promptly to send more data.
if (self.pacer.bandwidth_shifted > 0 and self.state == .connected) {
const now: i64 = @intCast(std.time.nanoTimestamp());
// Estimate pacer delay without mutating: budget is replenished by elapsed time
const elapsed = now - self.pacer.last_sent_time;
// Skipped when pacing is disabled via the env kill switch.
//
// The pacer stores `last_sent_time` on CLOCK_MONOTONIC; the deadline we
// return must be comparable to the REALTIME-based deadlines collected
// above, so compute the *delay* on the monotonic clock and add it to
// the REALTIME `now`.
if (self.pacer.bandwidth_shifted > 0 and self.state == .connected and !isPacingDisabled()) {
const now_realtime: i64 = @intCast(std.time.nanoTimestamp());
const now_mono: i64 = clock.monoNanos();
const elapsed = now_mono - self.pacer.last_sent_time;
var budget = self.pacer.budget;
if (self.pacer.last_sent_time > 0 and elapsed > 0) {
const replenished = (self.pacer.bandwidth_shifted *| @as(u64, @intCast(elapsed))) >> 20;
Expand All @@ -3782,7 +3811,7 @@ pub const Connection = struct {
if (budget < self.pacer.max_datagram_size) {
const deficit = self.pacer.max_datagram_size - budget;
const delay: i64 = @intCast((deficit << 20) / self.pacer.bandwidth_shifted);
const pacer_deadline = now + delay;
const pacer_deadline = now_realtime + delay;
if (earliest == null or pacer_deadline < earliest.?) {
earliest = pacer_deadline;
}
Expand Down
Loading
Loading