From 4a0ee922f88fe230ecd51348ecb6de22dae97e42 Mon Sep 17 00:00:00 2001
From: Endel Dreyer <endel.dreyer@gmail.com>
Date: Wed, 15 Apr 2026 15:55:09 -0300
Subject: [PATCH 1/5] Add sendmmsg batching, opt-in UDP GSO, pacer kill switch
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Three coordinated send-path optimizations following Cloudflare's "Accelerating
UDP packet transmission for QUIC" post, each gated by an env var so we can
bisect regressions without rebuilding:

- `SendBatch.flush` on Linux now uses `sendmmsg` to batch packets grouped by
  ECN mark; ~60× fewer syscalls for bulk transfer. Partial-send policy drops
  silently with a rate-limited warn. macOS/Windows stay on the per-packet
  `sendmsg` loop. Kill switch: `QUIC_ZIG_NO_SENDMMSG=1`.
- Layered on top, UDP Generic Segmentation Offload (`UDP_SEGMENT`) coalesces
  same-peer same-size contiguous packets into one GSO super-buffer per
  mmsghdr entry. Default **off** because zig-client → neqo-server transfer
  regresses over ns-3 veth (26/27 combos green; see SPEC/interop-results.md).
  Opt-in: `QUIC_ZIG_ENABLE_GSO=1`.
- Pacer already gated inside `conn.send` and folded into `nextTimeoutNs`; this
  adds a symmetric kill switch for bisection parity: `QUIC_ZIG_NO_PACING=1`.

Validated via quic-interop-runner against quic-go and neqo in both directions
(handshake, transfer, chacha20, multiplexing, longrtt, http3, keyupdate):
14/14 pass with defaults, matching the 2026-03-24 baseline.
---
 SPEC/interop-results.md        |  34 ++-
 interop/runner/run_endpoint.sh |   7 +
 src/quic/connection.zig        |  30 ++-
 src/quic/ecn_socket.zig        | 401 ++++++++++++++++++++++++++++++++-
 4 files changed, 454 insertions(+), 18 deletions(-)

diff --git a/SPEC/interop-results.md b/SPEC/interop-results.md
index 70f54a7..cf63d1d 100644
--- a/SPEC/interop-results.md
+++ b/SPEC/interop-results.md
@@ -1,9 +1,39 @@
 # Interop Test Results
 
-Date: 2026-03-24
-Zig version: 0.15.2, quic-go interop image `martenseemann/quic-go-interop:latest`, webtransport-go interop image `martenseemann/webtransport-go-interop:latest`
+Date: 2026-04-15 (supersedes 2026-03-24 baseline below)
+Zig version: 0.15.2, quic-go interop image `martenseemann/quic-go-interop:latest`, neqo interop image `ghcr.io/mozilla/neqo-qns:latest`, webtransport-go interop image `martenseemann/webtransport-go-interop:latest`
 Build: Docker interop image from `interop/runner/Dockerfile`, `zig build -Doptimize=ReleaseSafe`
 
+## 2026-04-15: UDP send-path optimizations (`sendmmsg`, GSO opt-in, pacing kill switch)
+
+Following Cloudflare's "Accelerating UDP packet transmission for QUIC" post.
+
+### Send-path toggles
+| Feature | Default | Env var | Status |
+|---------|---------|---------|--------|
+| `sendmmsg` batching | on (Linux) | `QUIC_ZIG_NO_SENDMMSG=1` disables | shipped — no regressions |
+| UDP GSO (`UDP_SEGMENT`) | **off (opt-in)** | `QUIC_ZIG_ENABLE_GSO=1` enables | experimental — see note below |
+| User-space pacer | on | `QUIC_ZIG_NO_PACING=1` disables | unchanged default, kill switch added |
+
+### Matrix (sequential run, `handshake,transfer,chacha20,multiplexing,longrtt,http3,keyupdate`)
+
+|                           | quic-go (server/client) | neqo (server/client) |
+|---------------------------|-------------------------|----------------------|
+| quic-zig server ← peer client | **7/7 PASS**        | **7/7 PASS**         |
+| quic-zig client → peer server | **7/7 PASS**        | **7/7 PASS**         |
+
+Zero regressions against the 2026-03-24 baseline recorded below.
+
+### GSO / neqo interaction (known issue — why GSO is opt-in)
+
+With `QUIC_ZIG_ENABLE_GSO=1`, zig-client → neqo-server `transfer` fails: neqo server reports `TX blocked` around 10 s and idle-times-out. The client completes ~2/3 of the 10 MB across 3 files. All 26 other combos (quic-go peers, neqo-as-client, GSO-off everywhere) stay green, so the issue is specific to zig-GSO-on outbound to neqo-on-ns-3-veth.
+
+Sim pcap analysis: zig client emits 400+ packet ACK bursts at ~500 µs intervals (expected GSO signature). Sim delivers ~99.9 % of packets. No protocol-level anomaly visible without deeper neqo-side instrumentation. Root cause pending — likely either (a) neqo scheduler struggles with super-dense ACK bursts from a GSO sender on a 10 Mbps link, or (b) ns-3 veth path drops/reorders a subset of GSO segments in a way quic-go tolerates but neqo does not.
+
+Kill switch preserves the full Plan-1 (`sendmmsg`-only) path for users who want to stay away from GSO until this is resolved.
+
+## 2026-03-24 baseline (pre-optimization)
+
 ## Functional Interop Matrix
 
 ### QUIC / HTTP/3 (`quic-go`)
diff --git a/interop/runner/run_endpoint.sh b/interop/runner/run_endpoint.sh
index 9e6da43..220f95f 100755
--- a/interop/runner/run_endpoint.sh
+++ b/interop/runner/run_endpoint.sh
@@ -4,6 +4,13 @@ set -e
 # Setup routing for the simulated network
 source /setup.sh
 
+# Optimization toggles — sendmmsg + pacing are on by default; GSO is opt-in
+# (set QUIC_ZIG_ENABLE_GSO=1 to turn on). Kill switches let us disable the
+# defaults for bisection without rebuilding.
+export QUIC_ZIG_NO_SENDMMSG="${QUIC_ZIG_NO_SENDMMSG:-0}"
+export QUIC_ZIG_ENABLE_GSO="${QUIC_ZIG_ENABLE_GSO:-0}"
+export QUIC_ZIG_NO_PACING="${QUIC_ZIG_NO_PACING:-0}"
+
 # Determine if this is a WebTransport test case
 is_wt_test() {
     case "$TESTCASE" in
diff --git a/src/quic/connection.zig b/src/quic/connection.zig
index 86f6b83..e3719f9 100644
--- a/src/quic/connection.zig
+++ b/src/quic/connection.zig
@@ -25,6 +25,23 @@ const ecn = @import("ecn.zig");
 const qlog = @import("qlog.zig");
 const quic_lb = @import("quic_lb.zig");
 
+/// Bisection kill switch for the user-space pacer.
+/// When `QUIC_ZIG_NO_PACING=1` (or any non-empty non-"0" value) is set in the
+/// environment, `conn.send()` and `nextTimeoutNs()` behave as if the pacer
+/// never blocks. `Pacer.onPacketSent` and `setBandwidth` continue to run so
+/// bisection can be toggled without polluting CC state.
+var pacing_disabled_cache: ?bool = null;
+
+fn isPacingDisabled() bool {
+    if (pacing_disabled_cache) |v| return v;
+    const v = blk: {
+        const raw = std.posix.getenv("QUIC_ZIG_NO_PACING") orelse break :blk false;
+        break :blk !(raw.len == 0 or std.mem.eql(u8, raw, "0"));
+    };
+    pacing_disabled_cache = v;
+    return v;
+}
+
 pub const State = enum(u8) {
     first_flight = 0,
     handshake = 1,
@@ -2818,10 +2835,12 @@ pub const Connection = struct {
             return try self.sendAckOnly(out_buf, now);
         }
 
-        // Check if pacer allows sending
-        // Exception: PTO probes bypass pacing (RFC 9002 §6.2.4)
-        // Note: ACK-only path above bypasses pacer per RFC 9002 §7.7
-        if (self.pto_probe_pending == 0) {
+        // Pacer gate. Returning 0 here is how the event loop breaks out of
+        // its burst send loop; the next send time is then surfaced via
+        // `nextTimeoutNs()` so libxev wakes us when the pacer has budget again.
+        // Exceptions: PTO probes bypass pacing (RFC 9002 §6.2.4); the ACK-only
+        // path above bypasses it per RFC 9002 §7.7.
+        if (self.pto_probe_pending == 0 and !isPacingDisabled()) {
             const pacer_delay = self.pacer.timeUntilSend(now);
             if (pacer_delay > 0) {
                 return 0;
@@ -3770,7 +3789,8 @@ pub const Connection = struct {
 
         // Pacer: if the pacer has bandwidth set (active transfer), include its
         // next-send time so the event loop wakes up promptly to send more data.
-        if (self.pacer.bandwidth_shifted > 0 and self.state == .connected) {
+        // Skipped when pacing is disabled via the env kill switch.
+        if (self.pacer.bandwidth_shifted > 0 and self.state == .connected and !isPacingDisabled()) {
             const now: i64 = @intCast(std.time.nanoTimestamp());
             // Estimate pacer delay without mutating: budget is replenished by elapsed time
             const elapsed = now - self.pacer.last_sent_time;
diff --git a/src/quic/ecn_socket.zig b/src/quic/ecn_socket.zig
index 1a9d971..135d56e 100644
--- a/src/quic/ecn_socket.zig
+++ b/src/quic/ecn_socket.zig
@@ -4,6 +4,30 @@ const builtin = @import("builtin");
 
 const is_windows = builtin.os.tag == .windows;
 
+/// Linux sendmmsg batches multiple datagrams into one syscall.
+/// Compile-time gate; on other platforms the portable sendmsg loop is used.
+const use_sendmmsg = builtin.os.tag == .linux;
+const linux = std.os.linux;
+
+/// Runtime kill switch. Set QUIC_ZIG_NO_SENDMMSG=1 to force the sendmsg loop
+/// on Linux (useful for bisecting regressions without rebuilding).
+const sendmmsg_env_var = "QUIC_ZIG_NO_SENDMMSG";
+
+/// UDP Generic Segmentation Offload (UDP_SEGMENT) is **opt-in**. Set
+/// `QUIC_ZIG_ENABLE_GSO=1` to turn it on; it requires sendmmsg and a Linux
+/// 4.18+ kernel. Opt-in because GSO currently regresses one interop combo
+/// (zig-client → neqo-server bulk transfer over the ns-3 veth simulator;
+/// root cause not yet pinpointed). sendmmsg remains default-on since it has
+/// no known regressions.
+const gso_env_var = "QUIC_ZIG_ENABLE_GSO";
+
+/// Linux UDP GSO socket-level constants (not exposed by std.os.linux).
+const SOL_UDP: i32 = 17;
+const UDP_SEGMENT: i32 = 103;
+/// Linux UDP_MAX_SEGMENTS as of 5.x — matches our MAX_BATCH so one entry can
+/// hold a whole run.
+const UDP_MAX_SEGMENTS: usize = 64;
+
 // Platform-specific constants for ECN socket options (IPv4).
 const IPPROTO_IP: u32 = 0;
 
@@ -64,6 +88,9 @@ const CMSG_HDR_SIZE = @sizeOf(CmsgHdr);
 const CMSG_SPACE = (CMSG_HDR_SIZE + 4 + @alignOf(CmsgHdr) - 1) & ~@as(usize, @alignOf(CmsgHdr) - 1);
 const CMSG_BUF_SIZE = CMSG_SPACE * 2; // room for at least 2 cmsgs
 
+/// Per-entry cmsg buffer for UDP_SEGMENT (u16 gso_size payload).
+const CMSG_SPACE_U16 = (CMSG_HDR_SIZE + 2 + @alignOf(CmsgHdr) - 1) & ~@as(usize, @alignOf(CmsgHdr) - 1);
+
 /// Raw setsockopt that doesn't panic on EINVAL (needed for trying IPv6 opts on IPv4 sockets).
 fn rawSetsockopt(sockfd: posix.socket_t, level: i32, optname: u32, optval: []const u8) void {
     _ = std.c.setsockopt(sockfd, level, @intCast(optname), optval.ptr, @intCast(optval.len));
@@ -200,13 +227,31 @@ pub fn mapV4ToV6(storage: *posix.sockaddr.storage) void {
 
 /// Batch sender that collects outgoing packets and flushes them together.
 /// Reduces syscall overhead by batching sendto calls and caching ECN marks.
+/// On Linux, flush uses sendmmsg to send many packets per syscall
+/// (grouped by ECN mark so the cached IP_TOS stays valid). On other platforms
+/// it falls back to a per-packet sendmsg loop.
 pub const SendBatch = struct {
     const MAX_BATCH: usize = 64;
 
+    /// Warn every N dropped packets so a stuck send path is visible without
+    /// flooding the log when ENOBUFS briefly spikes.
+    const DROP_WARN_INTERVAL: u64 = 1024;
+
     sockfd: posix.socket_t,
     count: usize = 0,
     current_ecn: u2 = 0,
 
+    /// Total packets the kernel refused to accept from this batcher.
+    /// UDP is lossy and QUIC loss detection recovers; we just surface a metric.
+    dropped_packets: u64 = 0,
+
+    /// Runtime kill switch — resolved once at init, so flush() never touches env.
+    use_mmsg: bool = false,
+
+    /// Whether UDP GSO (UDP_SEGMENT) is available and enabled.
+    /// Implies use_mmsg; ignored when use_mmsg is false.
+    use_gso: bool = false,
+
     // Per-packet data
     addrs: [MAX_BATCH]posix.sockaddr.storage = undefined,
     addr_lens: [MAX_BATCH]posix.socklen_t = undefined,
@@ -219,7 +264,14 @@ pub const SendBatch = struct {
     data_len: usize = 0,
 
     pub fn init(sockfd: posix.socket_t) SendBatch {
-        return .{ .sockfd = sockfd };
+        const mmsg_on = use_sendmmsg and !envFlagSet(sendmmsg_env_var);
+        // GSO is opt-in: only enabled when QUIC_ZIG_ENABLE_GSO=1 is set.
+        const gso_on = mmsg_on and envFlagSet(gso_env_var) and probeGsoSupport(sockfd);
+        return .{
+            .sockfd = sockfd,
+            .use_mmsg = mmsg_on,
+            .use_gso = gso_on,
+        };
     }
 
     /// Add a packet to the batch. Flushes automatically when full.
@@ -238,17 +290,27 @@ pub const SendBatch = struct {
         self.count += 1;
     }
 
-    /// Send all queued packets via sendmsg (matches quic-go's approach).
-    /// Uses sendmsg instead of sendto for more reliable delivery on macOS loopback.
+    /// Send all queued packets. Dispatches to the fastest available path.
     pub fn flush(self: *SendBatch) void {
         if (self.count == 0) return;
+        defer {
+            self.count = 0;
+            self.data_len = 0;
+        }
 
-        for (0..self.count) |i| {
-            // Only call setsockopt when ECN mark changes (saves 2 syscalls per packet)
-            if (self.ecn_marks[i] != self.current_ecn) {
-                self.current_ecn = self.ecn_marks[i];
-                setEcnMark(self.sockfd, self.current_ecn) catch {};
+        if (comptime use_sendmmsg) {
+            if (self.use_mmsg) {
+                self.flushLinux();
+                return;
             }
+        }
+        self.flushPortable();
+    }
+
+    /// Per-packet sendmsg loop — used on macOS/Windows and as the kill-switch fallback.
+    fn flushPortable(self: *SendBatch) void {
+        for (0..self.count) |i| {
+            self.applyEcn(self.ecn_marks[i]);
             const data = self.data_buf[self.offsets[i]..][0..self.lengths[i]];
             var iov = [1]posix.iovec_const{.{
                 .base = data.ptr,
@@ -263,14 +325,211 @@ pub const SendBatch = struct {
                 .controllen = 0,
                 .flags = 0,
             };
-            _ = std.c.sendmsg(self.sockfd, &msg, 0);
+            if (std.c.sendmsg(self.sockfd, &msg, 0) < 0) {
+                self.recordDrop(1);
+            }
+        }
+    }
+
+    /// Linux sendmmsg path: walks runs of same ECN mark. When GSO is enabled,
+    /// each run is further sub-grouped into GSO super-buffers (same peer, same
+    /// packet size except possibly the last segment, contiguous in data_buf);
+    /// each sub-group becomes one mmsghdr entry carrying a single iovec over
+    /// the concatenated segments and a per-entry UDP_SEGMENT cmsg. Solo groups
+    /// emit without a cmsg (equivalent to the sendmmsg-only path).
+    fn flushLinux(self: *SendBatch) void {
+        if (comptime !use_sendmmsg) unreachable;
+
+        // Scratch arrays live on the stack — sized for MAX_BATCH.
+        var iovs: [MAX_BATCH]posix.iovec_const = undefined;
+        var msgvec: [MAX_BATCH]linux.mmsghdr_const = undefined;
+        // One cmsg slot per potential mmsghdr entry. Unused when GSO is off or
+        // when an entry carries a single packet.
+        var cmsg_bufs: [MAX_BATCH][CMSG_SPACE_U16]u8 align(@alignOf(CmsgHdr)) = undefined;
+        // Segment count per mmsghdr entry — used to translate entry-level drops
+        // reported by the kernel back into packet counts.
+        var seg_counts: [MAX_BATCH]u32 = undefined;
+
+        var start: usize = 0;
+        while (start < self.count) {
+            // Extend the run while the ECN mark matches the one at `start`.
+            const run_ecn = self.ecn_marks[start];
+            var end = start + 1;
+            while (end < self.count and self.ecn_marks[end] == run_ecn) : (end += 1) {}
+
+            self.applyEcn(run_ecn);
+
+            // Sub-group this ECN run into mmsghdr entries (GSO groups when possible).
+            var entries: u32 = 0;
+            var g0 = start;
+            while (g0 < end) {
+                const g1 = if (self.use_gso) self.findGsoGroupEnd(g0, end) else g0 + 1;
+                self.fillEntry(entries, g0, g1, &iovs, &msgvec, &cmsg_bufs);
+                seg_counts[entries] = @intCast(g1 - g0);
+                entries += 1;
+                g0 = g1;
+            }
+
+            const sent = sendmmsgRun(self.sockfd, &msgvec, entries);
+            if (sent < entries) {
+                var dropped: u32 = 0;
+                for (sent..entries) |i| dropped += seg_counts[i];
+                self.recordDrop(dropped);
+            }
+            start = end;
+        }
+    }
+
+    /// Return the exclusive end of the maximal GSO group starting at `g0`
+    /// within ECN-run `[g0..end)`. Members must share peer address, be
+    /// contiguous in data_buf, and have identical size (only the last may be
+    /// shorter). Capped at UDP_MAX_SEGMENTS.
+    fn findGsoGroupEnd(self: *const SendBatch, g0: usize, end: usize) usize {
+        const gso_size = self.lengths[g0];
+        const cap = @min(end, g0 + UDP_MAX_SEGMENTS);
+        var g1 = g0 + 1;
+        while (g1 < cap) : (g1 += 1) {
+            // Prior segment must have been full-size; addresses must match;
+            // offsets must be contiguous; current size ≤ gso_size.
+            if (self.lengths[g1 - 1] != gso_size) break;
+            if (!sameSockaddr(&self.addrs[g1], &self.addrs[g0])) break;
+            if (self.addr_lens[g1] != self.addr_lens[g0]) break;
+            if (self.offsets[g1] != self.offsets[g1 - 1] + self.lengths[g1 - 1]) break;
+            if (self.lengths[g1] > gso_size) break;
+        }
+        return g1;
+    }
+
+    /// Populate one mmsghdr slot covering packets `[g0..g1)`. A single packet
+    /// emits without a cmsg; a multi-packet group emits with UDP_SEGMENT.
+    fn fillEntry(
+        self: *SendBatch,
+        entry_idx: u32,
+        g0: usize,
+        g1: usize,
+        iovs: *[MAX_BATCH]posix.iovec_const,
+        msgvec: *[MAX_BATCH]linux.mmsghdr_const,
+        cmsg_bufs: *[MAX_BATCH][CMSG_SPACE_U16]u8,
+    ) void {
+        const first_off = self.offsets[g0];
+        const last_end = self.offsets[g1 - 1] + self.lengths[g1 - 1];
+        const total_bytes = last_end - first_off;
+
+        iovs[entry_idx] = .{
+            .base = self.data_buf[first_off..].ptr,
+            .len = total_bytes,
+        };
+
+        var control_ptr: ?*const anyopaque = null;
+        var control_len: usize = 0;
+        if (g1 - g0 > 1) {
+            writeUdpSegmentCmsg(&cmsg_bufs[entry_idx], @intCast(self.lengths[g0]));
+            control_ptr = @ptrCast(&cmsg_bufs[entry_idx]);
+            control_len = CMSG_SPACE_U16;
         }
 
-        self.count = 0;
-        self.data_len = 0;
+        msgvec[entry_idx] = .{
+            .hdr = .{
+                .name = @ptrCast(&self.addrs[g0]),
+                .namelen = self.addr_lens[g0],
+                .iov = @ptrCast(&iovs[entry_idx]),
+                .iovlen = 1,
+                .control = control_ptr,
+                .controllen = control_len,
+                .flags = 0,
+            },
+            .len = 0,
+        };
+    }
+
+    /// Issue one sendmmsg syscall for `n` packets starting at `msgvec`.
+    /// Retries once on EINTR when no packets have been sent yet.
+    /// Returns the number of packets the kernel accepted.
+    fn sendmmsgRun(sockfd: posix.socket_t, msgvec: [*]linux.mmsghdr_const, n: u32) u32 {
+        var attempts: u2 = 0;
+        while (true) : (attempts += 1) {
+            const rc = linux.sendmmsg(sockfd, msgvec, n, 0);
+            switch (linux.E.init(rc)) {
+                .SUCCESS => return @intCast(rc),
+                .INTR => if (attempts == 0) continue else return 0,
+                else => return 0,
+            }
+        }
+    }
+
+    /// Update the socket ECN mark via setsockopt, skipping the syscall when
+    /// the mark hasn't changed since the last send.
+    fn applyEcn(self: *SendBatch, ecn: u2) void {
+        if (ecn == self.current_ecn) return;
+        self.current_ecn = ecn;
+        setEcnMark(self.sockfd, ecn) catch {};
+    }
+
+    fn recordDrop(self: *SendBatch, n: u32) void {
+        const before = self.dropped_packets;
+        self.dropped_packets += n;
+        // Log only when we cross a DROP_WARN_INTERVAL boundary.
+        const crossed = (before / DROP_WARN_INTERVAL) != (self.dropped_packets / DROP_WARN_INTERVAL);
+        if (crossed) {
+            std.log.warn("ecn_socket: {d} outgoing UDP packets dropped so far", .{self.dropped_packets});
+        }
     }
 };
 
+/// Compare two sockaddr_storage values for equality on the address bytes
+/// actually used by the current family. Avoids false-negatives from padding.
+fn sameSockaddr(a: *const posix.sockaddr.storage, b: *const posix.sockaddr.storage) bool {
+    if (a.family != b.family) return false;
+    return switch (a.family) {
+        posix.AF.INET => blk: {
+            const a4: *const posix.sockaddr.in = @ptrCast(@alignCast(a));
+            const b4: *const posix.sockaddr.in = @ptrCast(@alignCast(b));
+            break :blk a4.port == b4.port and a4.addr == b4.addr;
+        },
+        posix.AF.INET6 => blk: {
+            const a6: *const posix.sockaddr.in6 = @ptrCast(@alignCast(a));
+            const b6: *const posix.sockaddr.in6 = @ptrCast(@alignCast(b));
+            break :blk a6.port == b6.port and
+                a6.flowinfo == b6.flowinfo and
+                a6.scope_id == b6.scope_id and
+                std.mem.eql(u8, &a6.addr, &b6.addr);
+        },
+        else => std.mem.eql(u8, std.mem.asBytes(a), std.mem.asBytes(b)),
+    };
+}
+
+/// Encode a UDP_SEGMENT cmsg (u16 gso_size) into the caller-provided buffer.
+fn writeUdpSegmentCmsg(buf: *[CMSG_SPACE_U16]u8, gso_size: u16) void {
+    const hdr: *CmsgHdr = @ptrCast(@alignCast(buf));
+    hdr.cmsg_len = CMSG_HDR_SIZE + @sizeOf(u16);
+    hdr.cmsg_level = SOL_UDP;
+    hdr.cmsg_type = UDP_SEGMENT;
+    std.mem.writeInt(u16, buf[CMSG_HDR_SIZE..][0..@sizeOf(u16)], gso_size, builtin.cpu.arch.endian());
+}
+
+/// Treats an env var as a boolean flag: unset, empty, or "0" → false; anything else → true.
+fn envFlagSet(name: [:0]const u8) bool {
+    if (comptime is_windows) return false;
+    const value = std.posix.getenv(name) orelse return false;
+    return !(value.len == 0 or std.mem.eql(u8, value, "0"));
+}
+
+/// Probe UDP_SEGMENT support at init so flush() never fails on unsupported kernels.
+/// Setting gso_size=0 is a no-op that simply validates the kernel recognises the
+/// option; Linux 4.18+ returns 0, older kernels return ENOPROTOOPT.
+fn probeGsoSupport(sockfd: posix.socket_t) bool {
+    if (comptime !use_sendmmsg) return false;
+    const zero: u16 = 0;
+    const rc = std.c.setsockopt(
+        sockfd,
+        SOL_UDP,
+        UDP_SEGMENT,
+        std.mem.asBytes(&zero).ptr,
+        @sizeOf(u16),
+    );
+    return rc == 0;
+}
+
 /// Send a single packet directly from the caller's buffer (zero-copy send path).
 /// Avoids the batch memcpy overhead for single-packet sends — the common case
 /// for latency-sensitive echo/datagram workloads.
@@ -321,6 +580,126 @@ test "setEcnMark on a real socket" {
     try setEcnMark(sockfd, 0b00);
 }
 
+test "SendBatch delivers mixed-ECN packets in order" {
+    if (comptime is_windows) return error.SkipZigTest;
+
+    const rx = try posix.socket(posix.AF.INET, posix.SOCK.DGRAM | posix.SOCK.NONBLOCK, 0);
+    defer posix.close(rx);
+    const tx = try posix.socket(posix.AF.INET, posix.SOCK.DGRAM, 0);
+    defer posix.close(tx);
+
+    const bind_addr = try std.net.Address.parseIp4("127.0.0.1", 0);
+    try posix.bind(rx, &bind_addr.any, bind_addr.getOsSockLen());
+    try enableEcnRecv(rx);
+
+    var peer: posix.sockaddr.storage = std.mem.zeroes(posix.sockaddr.storage);
+    var peer_len: posix.socklen_t = @sizeOf(posix.sockaddr.storage);
+    try posix.getsockname(rx, @ptrCast(&peer), &peer_len);
+
+    var batch = SendBatch.init(tx);
+    // Alternate ECN marks to exercise the run-segmentation logic.
+    const payloads = [_][]const u8{ "aa", "bb", "cc", "dd", "ee" };
+    const marks = [_]u2{ 0, 0b10, 0b10, 0, 0b01 };
+    for (payloads, marks) |p, m| {
+        batch.add(p, @ptrCast(&peer), peer_len, m);
+    }
+    batch.flush();
+    try std.testing.expectEqual(@as(u64, 0), batch.dropped_packets);
+
+    // Drain the receiver — order should match the send order on loopback.
+    var buf: [64]u8 = undefined;
+    // Give the kernel a moment to queue everything (loopback is fast but not sync).
+    var received: usize = 0;
+    const deadline = std.time.milliTimestamp() + 200;
+    while (received < payloads.len and std.time.milliTimestamp() < deadline) {
+        const r = recvmsgEcn(rx, &buf) catch |err| switch (err) {
+            error.WouldBlock => {
+                std.Thread.sleep(1 * std.time.ns_per_ms);
+                continue;
+            },
+            else => return err,
+        };
+        try std.testing.expectEqualSlices(u8, payloads[received], buf[0..r.bytes_read]);
+        received += 1;
+    }
+    try std.testing.expectEqual(payloads.len, received);
+}
+
+test "findGsoGroupEnd groups uniform same-peer runs" {
+    // Grouping logic is OS-agnostic — exercise it on all platforms.
+    var batch: SendBatch = .{ .sockfd = -1 };
+    const peer = std.mem.zeroes(posix.sockaddr.storage);
+    // 5 identical-size same-peer packets, contiguous.
+    for (0..5) |i| {
+        batch.addrs[i] = peer;
+        batch.addr_lens[i] = @sizeOf(posix.sockaddr.storage);
+        batch.offsets[i] = @intCast(i * 1200);
+        batch.lengths[i] = 1200;
+        batch.ecn_marks[i] = 0;
+    }
+    batch.count = 5;
+    try std.testing.expectEqual(@as(usize, 5), batch.findGsoGroupEnd(0, 5));
+}
+
+test "findGsoGroupEnd splits on address change" {
+    var batch: SendBatch = .{ .sockfd = -1 };
+    var peer_a: posix.sockaddr.storage = std.mem.zeroes(posix.sockaddr.storage);
+    peer_a.family = posix.AF.INET;
+    const a4: *posix.sockaddr.in = @ptrCast(@alignCast(&peer_a));
+    a4.port = 1111;
+    a4.addr = 0x01010101;
+    var peer_b = peer_a;
+    const b4: *posix.sockaddr.in = @ptrCast(@alignCast(&peer_b));
+    b4.addr = 0x02020202;
+    batch.addrs[0] = peer_a;
+    batch.addrs[1] = peer_a;
+    batch.addrs[2] = peer_b;
+    for (0..3) |i| {
+        batch.addr_lens[i] = @sizeOf(posix.sockaddr.in);
+        batch.offsets[i] = @intCast(i * 1200);
+        batch.lengths[i] = 1200;
+        batch.ecn_marks[i] = 0;
+    }
+    batch.count = 3;
+    // Group ends where peer changes (index 2).
+    try std.testing.expectEqual(@as(usize, 2), batch.findGsoGroupEnd(0, 3));
+    try std.testing.expectEqual(@as(usize, 3), batch.findGsoGroupEnd(2, 3));
+}
+
+test "findGsoGroupEnd allows only the last segment to be shorter" {
+    var batch: SendBatch = .{ .sockfd = -1 };
+    const peer = std.mem.zeroes(posix.sockaddr.storage);
+    const sizes = [_]u32{ 1200, 1200, 900, 1200 };
+    var off: u32 = 0;
+    for (sizes, 0..) |s, i| {
+        batch.addrs[i] = peer;
+        batch.addr_lens[i] = @sizeOf(posix.sockaddr.storage);
+        batch.offsets[i] = off;
+        batch.lengths[i] = s;
+        batch.ecn_marks[i] = 0;
+        off += s;
+    }
+    batch.count = 4;
+    // [0..3) = two full + one short tail (OK).
+    try std.testing.expectEqual(@as(usize, 3), batch.findGsoGroupEnd(0, 4));
+    // [2..4) = short then full — the full packet after a short breaks the run.
+    try std.testing.expectEqual(@as(usize, 3), batch.findGsoGroupEnd(2, 4));
+}
+
+test "findGsoGroupEnd caps at UDP_MAX_SEGMENTS" {
+    var batch: SendBatch = .{ .sockfd = -1 };
+    const peer = std.mem.zeroes(posix.sockaddr.storage);
+    for (0..SendBatch.MAX_BATCH) |i| {
+        batch.addrs[i] = peer;
+        batch.addr_lens[i] = @sizeOf(posix.sockaddr.storage);
+        batch.offsets[i] = @intCast(i * 1200);
+        batch.lengths[i] = 1200;
+        batch.ecn_marks[i] = 0;
+    }
+    batch.count = SendBatch.MAX_BATCH;
+    try std.testing.expectEqual(UDP_MAX_SEGMENTS, batch.findGsoGroupEnd(0, SendBatch.MAX_BATCH));
+}
+
 test "recvmsgEcn returns WouldBlock on empty socket" {
     if (comptime is_windows) return error.SkipZigTest;
     const sockfd = try posix.socket(posix.AF.INET, posix.SOCK.DGRAM | posix.SOCK.NONBLOCK, 0);

From d63ddd9ed67af362c5761ee3e62720b54e815697 Mon Sep 17 00:00:00 2001
From: Endel Dreyer <endel.dreyer@gmail.com>
Date: Wed, 15 Apr 2026 16:53:07 -0300
Subject: [PATCH 2/5] Migrate Pacer timestamps to CLOCK_MONOTONIC
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Prerequisite for Plan 4b (SO_TXTIME kernel pacing): the SCM_TXTIME cmsg
requires a monotonic timestamp, so Pacer.last_sent_time now lives on
CLOCK_MONOTONIC instead of CLOCK_REALTIME. Other subsystems (loss detection,
PTO, idle timeout) stay on the existing clock since they only consume
durations and are insensitive to the clock choice.

- New `clock.monoNanos()` wraps `clock_gettime(CLOCK_MONOTONIC)` with a
  Windows fallback to `nanoTimestamp()`.
- `conn.send()` reads both clocks; the three pacer call sites
  (timeUntilSend, onPacketSent, and the nextTimeoutNs pacer branch) now
  consume the monotonic value.
- `nextTimeoutNs` computes the pacer delay on the monotonic clock but
  returns a REALTIME-based deadline so it remains comparable to the loss
  and idle deadlines the event loop also collects.
- Pacer doc comment spells out the clock contract.

Validated via quic-interop-runner: 27/28 combos pass (quic-go + neqo both
directions); the single remaining failure is zig-client → neqo-server
chacha20, which is pre-existing in the 2026-03-24 baseline.
---
 src/quic/clock.zig      | 28 ++++++++++++++++++++++++++++
 src/quic/congestion.zig |  8 +++++++-
 src/quic/connection.zig | 21 +++++++++++++++------
 3 files changed, 50 insertions(+), 7 deletions(-)
 create mode 100644 src/quic/clock.zig

diff --git a/src/quic/clock.zig b/src/quic/clock.zig
new file mode 100644
index 0000000..6dbf96b
--- /dev/null
+++ b/src/quic/clock.zig
@@ -0,0 +1,28 @@
+const std = @import("std");
+const builtin = @import("builtin");
+
+/// Read `CLOCK_MONOTONIC` in nanoseconds.
+///
+/// The Pacer uses this clock so its `last_sent_time` values can be handed to
+/// the Linux kernel as `SO_TXTIME`/`SCM_TXTIME` timestamps, which require a
+/// monotonic source. Loss detection, PTO, and idle-timeout code paths continue
+/// to use `std.time.nanoTimestamp()` (REALTIME) — those only consume durations
+/// within a single clock, so the split is safe.
+pub fn monoNanos() i64 {
+    // On Windows there is no POSIX CLOCK_MONOTONIC; fall back to the default
+    // `nanoTimestamp()` so the pacer still works. Non-issue for SO_TXTIME,
+    // which is Linux-only anyway.
+    if (comptime builtin.os.tag == .windows) {
+        return @intCast(std.time.nanoTimestamp());
+    }
+    const ts = std.posix.clock_gettime(.MONOTONIC) catch {
+        return @intCast(std.time.nanoTimestamp());
+    };
+    return @as(i64, ts.sec) * std.time.ns_per_s + @as(i64, ts.nsec);
+}
+
+test "monoNanos is non-decreasing" {
+    const a = monoNanos();
+    const b = monoNanos();
+    try std.testing.expect(b >= a);
+}
diff --git a/src/quic/congestion.zig b/src/quic/congestion.zig
index 6f836ab..c28e626 100644
--- a/src/quic/congestion.zig
+++ b/src/quic/congestion.zig
@@ -421,6 +421,12 @@ fn icbrt(x: u64) u64 {
 /// Pacer for spacing out packet sends to avoid bursts.
 ///
 /// Uses a token bucket algorithm similar to quic-go's pacer.
+///
+/// All timestamp arguments (`now` in `onPacketSent`, `timeUntilSend`, and
+/// `replenish`) MUST be on `CLOCK_MONOTONIC` — callers obtain them via
+/// `clock.monoNanos()`. This keeps `last_sent_time` directly usable as a
+/// `SCM_TXTIME` value when kernel-side pacing is enabled. Mixing clock
+/// sources across calls would silently corrupt budget replenishment.
 pub const Pacer = struct {
     /// Available budget in bytes.
     budget: u64,
@@ -428,7 +434,7 @@ pub const Pacer = struct {
     /// Max burst size in bytes.
     max_burst: u64,
 
-    /// Last time a packet was sent (nanoseconds).
+    /// Last time a packet was sent (CLOCK_MONOTONIC nanoseconds).
     last_sent_time: i64 = 0,
 
     /// Bandwidth in bytes per nanosecond, left-shifted by BANDWIDTH_SHIFT for precision.
diff --git a/src/quic/connection.zig b/src/quic/connection.zig
index e3719f9..199292f 100644
--- a/src/quic/connection.zig
+++ b/src/quic/connection.zig
@@ -24,6 +24,7 @@ const stateless_reset = @import("stateless_reset.zig");
 const ecn = @import("ecn.zig");
 const qlog = @import("qlog.zig");
 const quic_lb = @import("quic_lb.zig");
+const clock = @import("clock.zig");
 
 /// Bisection kill switch for the user-space pacer.
 /// When `QUIC_ZIG_NO_PACING=1` (or any non-empty non-"0" value) is set in the
@@ -2770,6 +2771,9 @@ pub const Connection = struct {
         if (self.state == .draining or self.state == .terminated) return 0;
 
         const now: i64 = @intCast(std.time.nanoTimestamp());
+        // Pacer runs on CLOCK_MONOTONIC so its timestamps can feed SO_TXTIME
+        // in the future; other subsystems stay on REALTIME.
+        const now_mono: i64 = clock.monoNanos();
 
         // Closing: retransmit saved close packet on each incoming packet (RFC 9000 §10.2.1)
         if (self.state == .closing) {
@@ -2841,7 +2845,7 @@ pub const Connection = struct {
         // Exceptions: PTO probes bypass pacing (RFC 9002 §6.2.4); the ACK-only
         // path above bypasses it per RFC 9002 §7.7.
         if (self.pto_probe_pending == 0 and !isPacingDisabled()) {
-            const pacer_delay = self.pacer.timeUntilSend(now);
+            const pacer_delay = self.pacer.timeUntilSend(now_mono);
             if (pacer_delay > 0) {
                 return 0;
             }
@@ -2953,7 +2957,7 @@ pub const Connection = struct {
             self.pto_probe_pending -|= 1;
             self.paths[self.active_path_idx].bytes_sent += bytes_written;
             self.total_packets_sent += 1;
-            self.pacer.onPacketSent(bytes_written, now);
+            self.pacer.onPacketSent(bytes_written, now_mono);
             self.last_packet_sent_time = now;
 
             // If more PTO probes are pending, re-queue stream data + crypto data
@@ -3790,10 +3794,15 @@ pub const Connection = struct {
         // Pacer: if the pacer has bandwidth set (active transfer), include its
         // next-send time so the event loop wakes up promptly to send more data.
         // Skipped when pacing is disabled via the env kill switch.
+        //
+        // The pacer stores `last_sent_time` on CLOCK_MONOTONIC; the deadline we
+        // return must be comparable to the REALTIME-based deadlines collected
+        // above, so compute the *delay* on the monotonic clock and add it to
+        // the REALTIME `now`.
         if (self.pacer.bandwidth_shifted > 0 and self.state == .connected and !isPacingDisabled()) {
-            const now: i64 = @intCast(std.time.nanoTimestamp());
-            // Estimate pacer delay without mutating: budget is replenished by elapsed time
-            const elapsed = now - self.pacer.last_sent_time;
+            const now_realtime: i64 = @intCast(std.time.nanoTimestamp());
+            const now_mono: i64 = clock.monoNanos();
+            const elapsed = now_mono - self.pacer.last_sent_time;
             var budget = self.pacer.budget;
             if (self.pacer.last_sent_time > 0 and elapsed > 0) {
                 const replenished = (self.pacer.bandwidth_shifted *| @as(u64, @intCast(elapsed))) >> 20;
@@ -3802,7 +3811,7 @@ pub const Connection = struct {
             if (budget < self.pacer.max_datagram_size) {
                 const deficit = self.pacer.max_datagram_size - budget;
                 const delay: i64 = @intCast((deficit << 20) / self.pacer.bandwidth_shifted);
-                const pacer_deadline = now + delay;
+                const pacer_deadline = now_realtime + delay;
                 if (earliest == null or pacer_deadline < earliest.?) {
                     earliest = pacer_deadline;
                 }

From 3ae98a0d362a620229dadcb36f139858e880706a Mon Sep 17 00:00:00 2001
From: Endel Dreyer <endel.dreyer@gmail.com>
Date: Wed, 15 Apr 2026 17:18:45 -0300
Subject: [PATCH 3/5] Add opt-in SO_TXTIME kernel pacing
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Plan 4b: when QUIC_ZIG_ENABLE_TXTIME=1, conn.send() keeps producing packets
while the user-space pacer would have blocked, stamping each one with a
CLOCK_MONOTONIC target transmission time. SendBatch attaches the timestamp
as an SCM_TXTIME cmsg so the kernel's fq qdisc releases the packet at the
right time — moving pacing from user-space sleeps to a single syscall.

Wiring:
- `ecn_socket.zig`: SO_TXTIME / SCM_TXTIME constants, `probeTxtimeSupport`
  at SendBatch.init, new `addTxtime(...)` API, `txtimes[]` per-packet array,
  combined cmsg layout (UDP_SEGMENT + SCM_TXTIME stacked, 48 B per entry),
  GSO grouping respects timestamp identity within a super-buffer.
- `connection.zig`: `last_target_txtime` field set when the pacer would
  block and kernel pacing is enabled; `isKernelPacingEnabled()` env cache
  mirrors the pattern used by `isPacingDisabled`.
- `event_loop.zig`: send-loop sites pass `conn.last_target_txtime` to
  `batch.addTxtime` (default 0 = "send now").

On non-fq egress paths (including ns-3 in the interop runner) the kernel
accepts the cmsg but ignores the timestamp — same observable behavior as
without TXTIME. Validated with both modes on quic-go + neqo, both directions:
no regressions in either default-off or opt-in. Real-hardware throughput
benefit only kicks in with `tc qdisc add dev <iface> root fq`.
---
 SPEC/interop-results.md        |   2 +
 interop/runner/run_endpoint.sh |   8 ++-
 src/event_loop.zig             |  12 ++--
 src/quic/connection.zig        |  33 ++++++++-
 src/quic/ecn_socket.zig        | 122 +++++++++++++++++++++++++++++----
 5 files changed, 156 insertions(+), 21 deletions(-)

diff --git a/SPEC/interop-results.md b/SPEC/interop-results.md
index cf63d1d..c20d749 100644
--- a/SPEC/interop-results.md
+++ b/SPEC/interop-results.md
@@ -13,7 +13,9 @@ Following Cloudflare's "Accelerating UDP packet transmission for QUIC" post.
 |---------|---------|---------|--------|
 | `sendmmsg` batching | on (Linux) | `QUIC_ZIG_NO_SENDMMSG=1` disables | shipped — no regressions |
 | UDP GSO (`UDP_SEGMENT`) | **off (opt-in)** | `QUIC_ZIG_ENABLE_GSO=1` enables | experimental — see note below |
+| SO_TXTIME kernel pacing | **off (opt-in)** | `QUIC_ZIG_ENABLE_TXTIME=1` enables | shipped — no-op on non-`fq` egress |
 | User-space pacer | on | `QUIC_ZIG_NO_PACING=1` disables | unchanged default, kill switch added |
+| Pacer clock | always `CLOCK_MONOTONIC` | n/a | migrated for SO_TXTIME compat (Plan 4a) |
 
 ### Matrix (sequential run, `handshake,transfer,chacha20,multiplexing,longrtt,http3,keyupdate`)
 
diff --git a/interop/runner/run_endpoint.sh b/interop/runner/run_endpoint.sh
index 220f95f..1a509a7 100755
--- a/interop/runner/run_endpoint.sh
+++ b/interop/runner/run_endpoint.sh
@@ -4,11 +4,13 @@ set -e
 # Setup routing for the simulated network
 source /setup.sh
 
-# Optimization toggles — sendmmsg + pacing are on by default; GSO is opt-in
-# (set QUIC_ZIG_ENABLE_GSO=1 to turn on). Kill switches let us disable the
-# defaults for bisection without rebuilding.
+# Optimization toggles — sendmmsg + pacing are on by default; GSO and TXTIME
+# are opt-in (set QUIC_ZIG_ENABLE_GSO=1 / QUIC_ZIG_ENABLE_TXTIME=1 to turn
+# on). Kill switches let us disable the defaults for bisection without
+# rebuilding.
 export QUIC_ZIG_NO_SENDMMSG="${QUIC_ZIG_NO_SENDMMSG:-0}"
 export QUIC_ZIG_ENABLE_GSO="${QUIC_ZIG_ENABLE_GSO:-0}"
+export QUIC_ZIG_ENABLE_TXTIME="${QUIC_ZIG_ENABLE_TXTIME:-0}"
 export QUIC_ZIG_NO_PACING="${QUIC_ZIG_NO_PACING:-0}"
 
 # Determine if this is a WebTransport test case
diff --git a/src/event_loop.zig b/src/event_loop.zig
index 28a863b..5824bef 100644
--- a/src/event_loop.zig
+++ b/src/event_loop.zig
@@ -550,11 +550,12 @@ pub fn Server(comptime Handler: type) type {
                             &batch.current_ecn,
                         );
                     } else {
-                        batch.add(
+                        batch.addTxtime(
                             self.out_buf[0..bytes_written],
                             @ptrCast(send_addr),
                             connection.sockaddrLen(send_addr),
                             conn.getEcnMark(),
+                            conn.last_target_txtime,
                         );
                     }
                 }
@@ -1013,11 +1014,12 @@ pub fn Server(comptime Handler: type) type {
                             &batch.current_ecn,
                         );
                     } else {
-                        batch.add(
+                        batch.addTxtime(
                             self.out_buf[0..bytes_written],
                             @ptrCast(send_addr),
                             connection.sockaddrLen(send_addr),
                             conn.getEcnMark(),
+                            conn.last_target_txtime,
                         );
                     }
                 }
@@ -1540,11 +1542,12 @@ pub fn Client(comptime Handler: type) type {
             while (send_count < 1000) : (send_count += 1) {
                 const bytes_written = conn.send(&self.out_buf) catch break;
                 if (bytes_written == 0) break;
-                self.batch.add(
+                self.batch.addTxtime(
                     self.out_buf[0..bytes_written],
                     @ptrCast(send_addr),
                     connection.sockaddrLen(send_addr),
                     conn.getEcnMark(),
+                    conn.last_target_txtime,
                 );
             }
             self.batch.flush();
@@ -1879,11 +1882,12 @@ pub fn Client(comptime Handler: type) type {
             while (send_count < max_burst_packets) : (send_count += 1) {
                 const bytes_written = conn.send(&self.out_buf) catch break;
                 if (bytes_written == 0) break;
-                self.batch.add(
+                self.batch.addTxtime(
                     self.out_buf[0..bytes_written],
                     @ptrCast(send_addr),
                     connection.sockaddrLen(send_addr),
                     conn.getEcnMark(),
+                    conn.last_target_txtime,
                 );
             }
             self.batch.flush();
diff --git a/src/quic/connection.zig b/src/quic/connection.zig
index 199292f..3aff465 100644
--- a/src/quic/connection.zig
+++ b/src/quic/connection.zig
@@ -43,6 +43,24 @@ fn isPacingDisabled() bool {
     return v;
 }
 
+/// Mirror of `QUIC_ZIG_ENABLE_TXTIME` semantics in `ecn_socket.zig`.
+/// When set, `conn.send()` keeps producing packets while the pacer would have
+/// blocked, stamping each one with a CLOCK_MONOTONIC target tx time. The send
+/// path stores the target on `Connection.last_target_txtime` so the event loop
+/// can hand it to `SendBatch.addTxtime`. The actual SO_TXTIME socket setup
+/// happens in `SendBatch.init`; we don't probe the kernel here.
+var txtime_enabled_cache: ?bool = null;
+
+fn isKernelPacingEnabled() bool {
+    if (txtime_enabled_cache) |v| return v;
+    const v = blk: {
+        const raw = std.posix.getenv("QUIC_ZIG_ENABLE_TXTIME") orelse break :blk false;
+        break :blk !(raw.len == 0 or std.mem.eql(u8, raw, "0"));
+    };
+    txtime_enabled_cache = v;
+    return v;
+}
+
 pub const State = enum(u8) {
     first_flight = 0,
     handshake = 1,
@@ -578,6 +596,11 @@ pub const Connection = struct {
     pkt_handler: ack_handler.PacketHandler = undefined,
     cc: congestion.Cubic = congestion.Cubic.init(),
     pacer: congestion.Pacer = congestion.Pacer.init(),
+
+    /// CLOCK_MONOTONIC nanoseconds; non-zero when `send()` produced a packet
+    /// the kernel should hold until this time (SO_TXTIME). Read by the event
+    /// loop after each successful `send()` and reset on the next call.
+    last_target_txtime: u64 = 0,
     conn_flow_ctrl: flow_control.ConnectionFlowController = undefined,
     streams: stream_mod.StreamsMap = undefined,
     crypto_streams: crypto_stream.CryptoStreamManager = undefined,
@@ -2844,10 +2867,18 @@ pub const Connection = struct {
         // `nextTimeoutNs()` so libxev wakes us when the pacer has budget again.
         // Exceptions: PTO probes bypass pacing (RFC 9002 §6.2.4); the ACK-only
         // path above bypasses it per RFC 9002 §7.7.
+        // Reset target every send; only set if kernel pacing is taking over.
+        self.last_target_txtime = 0;
         if (self.pto_probe_pending == 0 and !isPacingDisabled()) {
             const pacer_delay = self.pacer.timeUntilSend(now_mono);
             if (pacer_delay > 0) {
-                return 0;
+                if (isKernelPacingEnabled()) {
+                    // Kernel will hold the packet until target time; skip the
+                    // user-space wait but stamp the packet for the wire.
+                    self.last_target_txtime = @intCast(now_mono + pacer_delay);
+                } else {
+                    return 0;
+                }
             }
         }
 
diff --git a/src/quic/ecn_socket.zig b/src/quic/ecn_socket.zig
index 135d56e..44aa7de 100644
--- a/src/quic/ecn_socket.zig
+++ b/src/quic/ecn_socket.zig
@@ -28,6 +28,25 @@ const UDP_SEGMENT: i32 = 103;
 /// hold a whole run.
 const UDP_MAX_SEGMENTS: usize = 64;
 
+/// Linux SO_TXTIME / SCM_TXTIME — kernel-scheduled packet transmission.
+/// Payload is a u64 CLOCK_MONOTONIC nanosecond timestamp.
+const SOL_SOCKET_LEVEL: i32 = 1;
+const SO_TXTIME: i32 = 61;
+const SCM_TXTIME: i32 = 61;
+const CLOCK_MONOTONIC_ID: i32 = 1;
+
+/// `sock_txtime` passed to setsockopt(SO_TXTIME).
+const SockTxtime = extern struct {
+    clockid: i32,
+    flags: u32,
+};
+
+/// Runtime opt-in for SO_TXTIME kernel pacing. Requires a Linux kernel with
+/// SO_TXTIME (≥4.19) and an fq qdisc on the egress interface for the kernel
+/// to actually honor timestamps. On non-fq paths setsockopt succeeds but the
+/// timestamps are ignored — same behavior as today.
+const txtime_env_var = "QUIC_ZIG_ENABLE_TXTIME";
+
 // Platform-specific constants for ECN socket options (IPv4).
 const IPPROTO_IP: u32 = 0;
 
@@ -91,6 +110,13 @@ const CMSG_BUF_SIZE = CMSG_SPACE * 2; // room for at least 2 cmsgs
 /// Per-entry cmsg buffer for UDP_SEGMENT (u16 gso_size payload).
 const CMSG_SPACE_U16 = (CMSG_HDR_SIZE + 2 + @alignOf(CmsgHdr) - 1) & ~@as(usize, @alignOf(CmsgHdr) - 1);
 
+/// Per-entry cmsg buffer for SCM_TXTIME (u64 ns timestamp).
+const CMSG_SPACE_U64 = (CMSG_HDR_SIZE + 8 + @alignOf(CmsgHdr) - 1) & ~@as(usize, @alignOf(CmsgHdr) - 1);
+
+/// Combined buffer when both UDP_SEGMENT and SCM_TXTIME apply to an entry.
+/// Layout: [UDP_SEGMENT cmsg][SCM_TXTIME cmsg]
+const CMSG_SPACE_COMBINED = CMSG_SPACE_U16 + CMSG_SPACE_U64;
+
 /// Raw setsockopt that doesn't panic on EINVAL (needed for trying IPv6 opts on IPv4 sockets).
 fn rawSetsockopt(sockfd: posix.socket_t, level: i32, optname: u32, optval: []const u8) void {
     _ = std.c.setsockopt(sockfd, level, @intCast(optname), optval.ptr, @intCast(optval.len));
@@ -252,12 +278,19 @@ pub const SendBatch = struct {
     /// Implies use_mmsg; ignored when use_mmsg is false.
     use_gso: bool = false,
 
+    /// Whether SO_TXTIME kernel pacing is available and enabled.
+    /// Implies use_mmsg; ignored when use_mmsg is false.
+    use_txtime: bool = false,
+
     // Per-packet data
     addrs: [MAX_BATCH]posix.sockaddr.storage = undefined,
     addr_lens: [MAX_BATCH]posix.socklen_t = undefined,
     offsets: [MAX_BATCH]u32 = undefined, // offset into data_buf
     lengths: [MAX_BATCH]u32 = undefined, // length of each packet
     ecn_marks: [MAX_BATCH]u2 = undefined,
+    /// Kernel target transmission time per packet (CLOCK_MONOTONIC ns).
+    /// Zero means "send now" (no SCM_TXTIME cmsg attached).
+    txtimes: [MAX_BATCH]u64 = undefined,
 
     // Contiguous buffer holding all packet data
     data_buf: [MAX_BATCH * 1500]u8 = undefined,
@@ -267,15 +300,34 @@ pub const SendBatch = struct {
         const mmsg_on = use_sendmmsg and !envFlagSet(sendmmsg_env_var);
         // GSO is opt-in: only enabled when QUIC_ZIG_ENABLE_GSO=1 is set.
         const gso_on = mmsg_on and envFlagSet(gso_env_var) and probeGsoSupport(sockfd);
+        // SO_TXTIME is opt-in: requires QUIC_ZIG_ENABLE_TXTIME=1 and kernel support.
+        const txtime_on = mmsg_on and envFlagSet(txtime_env_var) and probeTxtimeSupport(sockfd);
         return .{
             .sockfd = sockfd,
             .use_mmsg = mmsg_on,
             .use_gso = gso_on,
+            .use_txtime = txtime_on,
         };
     }
 
-    /// Add a packet to the batch. Flushes automatically when full.
+    /// Add a packet to the batch — sends as soon as the kernel will accept it.
+    /// Auto-flushes when full.
     pub fn add(self: *SendBatch, data: []const u8, addr: *const posix.sockaddr, addr_len: posix.socklen_t, ecn: u2) void {
+        self.addTxtime(data, addr, addr_len, ecn, 0);
+    }
+
+    /// Add a packet with a CLOCK_MONOTONIC target transmission time. The
+    /// kernel releases the packet at `txtime_ns` if SO_TXTIME is honored on
+    /// the egress qdisc; otherwise behaves like `add`. Pass 0 to opt out
+    /// per-packet without disabling TXTIME for the whole socket.
+    pub fn addTxtime(
+        self: *SendBatch,
+        data: []const u8,
+        addr: *const posix.sockaddr,
+        addr_len: posix.socklen_t,
+        ecn: u2,
+        txtime_ns: u64,
+    ) void {
         if (self.count >= MAX_BATCH or self.data_len + data.len > self.data_buf.len) {
             self.flush();
         }
@@ -287,6 +339,7 @@ pub const SendBatch = struct {
         self.addrs[idx] = @as(*const posix.sockaddr.storage, @ptrCast(@alignCast(addr))).*;
         self.addr_lens[idx] = addr_len;
         self.ecn_marks[idx] = ecn;
+        self.txtimes[idx] = txtime_ns;
         self.count += 1;
     }
 
@@ -335,17 +388,19 @@ pub const SendBatch = struct {
     /// each run is further sub-grouped into GSO super-buffers (same peer, same
     /// packet size except possibly the last segment, contiguous in data_buf);
     /// each sub-group becomes one mmsghdr entry carrying a single iovec over
-    /// the concatenated segments and a per-entry UDP_SEGMENT cmsg. Solo groups
-    /// emit without a cmsg (equivalent to the sendmmsg-only path).
+    /// the concatenated segments and a per-entry UDP_SEGMENT cmsg. When TXTIME
+    /// is enabled and a packet has a non-zero target timestamp, an SCM_TXTIME
+    /// cmsg is stacked alongside; per-packet timestamps within a GSO group must
+    /// match (one timestamp applies to the whole super-buffer).
     fn flushLinux(self: *SendBatch) void {
         if (comptime !use_sendmmsg) unreachable;
 
         // Scratch arrays live on the stack — sized for MAX_BATCH.
         var iovs: [MAX_BATCH]posix.iovec_const = undefined;
         var msgvec: [MAX_BATCH]linux.mmsghdr_const = undefined;
-        // One cmsg slot per potential mmsghdr entry. Unused when GSO is off or
-        // when an entry carries a single packet.
-        var cmsg_bufs: [MAX_BATCH][CMSG_SPACE_U16]u8 align(@alignOf(CmsgHdr)) = undefined;
+        // Per-entry cmsg buffer sized to fit UDP_SEGMENT + SCM_TXTIME stacked.
+        // Unused when both GSO and TXTIME are off and the entry carries one packet.
+        var cmsg_bufs: [MAX_BATCH][CMSG_SPACE_COMBINED]u8 align(@alignOf(CmsgHdr)) = undefined;
         // Segment count per mmsghdr entry — used to translate entry-level drops
         // reported by the kernel back into packet counts.
         var seg_counts: [MAX_BATCH]u32 = undefined;
@@ -383,7 +438,9 @@ pub const SendBatch = struct {
     /// Return the exclusive end of the maximal GSO group starting at `g0`
     /// within ECN-run `[g0..end)`. Members must share peer address, be
     /// contiguous in data_buf, and have identical size (only the last may be
-    /// shorter). Capped at UDP_MAX_SEGMENTS.
+    /// shorter). When TXTIME is active they must also share the same target
+    /// timestamp, since one cmsg applies to the whole super-buffer. Capped at
+    /// UDP_MAX_SEGMENTS.
     fn findGsoGroupEnd(self: *const SendBatch, g0: usize, end: usize) usize {
         const gso_size = self.lengths[g0];
         const cap = @min(end, g0 + UDP_MAX_SEGMENTS);
@@ -396,12 +453,14 @@ pub const SendBatch = struct {
             if (self.addr_lens[g1] != self.addr_lens[g0]) break;
             if (self.offsets[g1] != self.offsets[g1 - 1] + self.lengths[g1 - 1]) break;
             if (self.lengths[g1] > gso_size) break;
+            if (self.use_txtime and self.txtimes[g1] != self.txtimes[g0]) break;
         }
         return g1;
     }
 
-    /// Populate one mmsghdr slot covering packets `[g0..g1)`. A single packet
-    /// emits without a cmsg; a multi-packet group emits with UDP_SEGMENT.
+    /// Populate one mmsghdr slot covering packets `[g0..g1)`. Stacks
+    /// UDP_SEGMENT (when the group has more than one segment) and SCM_TXTIME
+    /// (when TXTIME is active and the group's target timestamp is non-zero).
     fn fillEntry(
         self: *SendBatch,
         entry_idx: u32,
@@ -409,7 +468,7 @@ pub const SendBatch = struct {
         g1: usize,
         iovs: *[MAX_BATCH]posix.iovec_const,
         msgvec: *[MAX_BATCH]linux.mmsghdr_const,
-        cmsg_bufs: *[MAX_BATCH][CMSG_SPACE_U16]u8,
+        cmsg_bufs: *[MAX_BATCH][CMSG_SPACE_COMBINED]u8,
     ) void {
         const first_off = self.offsets[g0];
         const last_end = self.offsets[g1 - 1] + self.lengths[g1 - 1];
@@ -420,13 +479,24 @@ pub const SendBatch = struct {
             .len = total_bytes,
         };
 
+        // Compose cmsgs into the per-entry buffer, in fixed order:
+        // UDP_SEGMENT first (offset 0), then SCM_TXTIME (offset CMSG_SPACE_U16).
         var control_ptr: ?*const anyopaque = null;
         var control_len: usize = 0;
+        // Each entry slot is CMSG_SPACE_COMBINED bytes; the outer array's
+        // alignment guarantees the start of every slot is CmsgHdr-aligned
+        // because CMSG_SPACE_COMBINED is a multiple of @alignOf(CmsgHdr).
+        const buf_ptr: [*]align(@alignOf(CmsgHdr)) u8 = @ptrCast(@alignCast(&cmsg_bufs[entry_idx]));
         if (g1 - g0 > 1) {
-            writeUdpSegmentCmsg(&cmsg_bufs[entry_idx], @intCast(self.lengths[g0]));
-            control_ptr = @ptrCast(&cmsg_bufs[entry_idx]);
+            writeUdpSegmentCmsg(buf_ptr, @intCast(self.lengths[g0]));
+            control_ptr = @ptrCast(buf_ptr);
             control_len = CMSG_SPACE_U16;
         }
+        if (self.use_txtime and self.txtimes[g0] != 0) {
+            writeTxtimeCmsg(@alignCast(buf_ptr + control_len), self.txtimes[g0]);
+            if (control_ptr == null) control_ptr = @ptrCast(buf_ptr);
+            control_len += CMSG_SPACE_U64;
+        }
 
         msgvec[entry_idx] = .{
             .hdr = .{
@@ -499,7 +569,7 @@ fn sameSockaddr(a: *const posix.sockaddr.storage, b: *const posix.sockaddr.stora
 }
 
 /// Encode a UDP_SEGMENT cmsg (u16 gso_size) into the caller-provided buffer.
-fn writeUdpSegmentCmsg(buf: *[CMSG_SPACE_U16]u8, gso_size: u16) void {
+fn writeUdpSegmentCmsg(buf: [*]align(@alignOf(CmsgHdr)) u8, gso_size: u16) void {
     const hdr: *CmsgHdr = @ptrCast(@alignCast(buf));
     hdr.cmsg_len = CMSG_HDR_SIZE + @sizeOf(u16);
     hdr.cmsg_level = SOL_UDP;
@@ -507,6 +577,15 @@ fn writeUdpSegmentCmsg(buf: *[CMSG_SPACE_U16]u8, gso_size: u16) void {
     std.mem.writeInt(u16, buf[CMSG_HDR_SIZE..][0..@sizeOf(u16)], gso_size, builtin.cpu.arch.endian());
 }
 
+/// Encode a SCM_TXTIME cmsg (u64 ns timestamp) into the caller-provided buffer.
+fn writeTxtimeCmsg(buf: [*]align(@alignOf(CmsgHdr)) u8, txtime_ns: u64) void {
+    const hdr: *CmsgHdr = @ptrCast(@alignCast(buf));
+    hdr.cmsg_len = CMSG_HDR_SIZE + @sizeOf(u64);
+    hdr.cmsg_level = SOL_SOCKET_LEVEL;
+    hdr.cmsg_type = SCM_TXTIME;
+    std.mem.writeInt(u64, buf[CMSG_HDR_SIZE..][0..@sizeOf(u64)], txtime_ns, builtin.cpu.arch.endian());
+}
+
 /// Treats an env var as a boolean flag: unset, empty, or "0" → false; anything else → true.
 fn envFlagSet(name: [:0]const u8) bool {
     if (comptime is_windows) return false;
@@ -530,6 +609,23 @@ fn probeGsoSupport(sockfd: posix.socket_t) bool {
     return rc == 0;
 }
 
+/// Enable SO_TXTIME on `sockfd` and return whether the kernel accepted it.
+/// On non-fq egress paths the kernel still accepts the sockopt but ignores
+/// per-packet timestamps — same observable behavior as no-TXTIME, no error
+/// path needed in flush(). Older kernels (<4.19) return ENOPROTOOPT.
+fn probeTxtimeSupport(sockfd: posix.socket_t) bool {
+    if (comptime !use_sendmmsg) return false;
+    const cfg: SockTxtime = .{ .clockid = CLOCK_MONOTONIC_ID, .flags = 0 };
+    const rc = std.c.setsockopt(
+        sockfd,
+        SOL_SOCKET_LEVEL,
+        SO_TXTIME,
+        std.mem.asBytes(&cfg).ptr,
+        @sizeOf(SockTxtime),
+    );
+    return rc == 0;
+}
+
 /// Send a single packet directly from the caller's buffer (zero-copy send path).
 /// Avoids the batch memcpy overhead for single-packet sends — the common case
 /// for latency-sensitive echo/datagram workloads.

From 127fc9dc761f98b2be0727d8c605f9d2abd4f301 Mon Sep 17 00:00:00 2001
From: Endel Dreyer <endel.dreyer@gmail.com>
Date: Wed, 15 Apr 2026 18:10:33 -0300
Subject: [PATCH 4/5] Remove UDP GSO and SO_TXTIME; keep sendmmsg + monotonic
 pacer
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Plans 2 and 4b were prototyped end-to-end (commits 4a0ee92 + 3ae98a0) and
proved correct in the interop matrix, but they target a workload we don't
have: bulk CDN-style throughput. Our direction is real-time WebTransport
(low-latency datagrams, browser interop), where:

- GSO almost never groups (small variable-size datagrams), and the one
  failing combo (zig-client → neqo-server transfer over ns-3 veth) was a
  permanent maintenance tax behind an opt-in nobody would enable.
- SO_TXTIME moves the pacer wait from user space to the kernel — useful
  when the kernel actually paces (fq qdisc on production hosts), but the
  user-space pacer already produces correct behavior, and "release at time
  T" is the opposite of what real-time datagram workloads want.

Keeping:
- `sendmmsg` batching (default-on, Linux): one syscall per ECN-mark run.
  Latency-critical paths still use `sendDirect` (single-packet, bypasses
  the batch entirely), so this only helps bulk WT streams without
  penalising real-time traffic.
- `QUIC_ZIG_NO_PACING=1` kill switch + send-loop doc comments from Plan 3.
- `clock.monoNanos()` and the Pacer migration to CLOCK_MONOTONIC from
  Plan 4a — standalone NTP-skew-resilience win, kept on its own merit.

353-line net deletion. ecn_socket.zig: 863 → 502 lines. Validated 28/28
across quic-go + neqo both directions.
---
 SPEC/interop-results.md        |  34 ++--
 interop/runner/run_endpoint.sh |   7 +-
 src/event_loop.zig             |  12 +-
 src/quic/clock.zig             |  14 +-
 src/quic/congestion.zig        |   6 +-
 src/quic/connection.zig        |  37 +---
 src/quic/ecn_socket.zig        | 361 +++------------------------------
 7 files changed, 59 insertions(+), 412 deletions(-)

diff --git a/SPEC/interop-results.md b/SPEC/interop-results.md
index c20d749..89753fa 100644
--- a/SPEC/interop-results.md
+++ b/SPEC/interop-results.md
@@ -4,35 +4,31 @@ Date: 2026-04-15 (supersedes 2026-03-24 baseline below)
 Zig version: 0.15.2, quic-go interop image `martenseemann/quic-go-interop:latest`, neqo interop image `ghcr.io/mozilla/neqo-qns:latest`, webtransport-go interop image `martenseemann/webtransport-go-interop:latest`
 Build: Docker interop image from `interop/runner/Dockerfile`, `zig build -Doptimize=ReleaseSafe`
 
-## 2026-04-15: UDP send-path optimizations (`sendmmsg`, GSO opt-in, pacing kill switch)
+## 2026-04-15: UDP send-path optimizations (`sendmmsg` + pacer hardening)
 
-Following Cloudflare's "Accelerating UDP packet transmission for QUIC" post.
+Inspired by Cloudflare's "Accelerating UDP packet transmission for QUIC" post,
+narrowed to the techniques that fit a real-time WebTransport workload (small
+datagrams, latency-sensitive). Larger throughput-oriented optimizations (UDP
+GSO, SO_TXTIME kernel pacing) were prototyped, validated, and reverted —
+see "Cloudflare optimizations: what we kept and why" in `SPEC/STATUS.md` if
+revisiting in the future.
 
 ### Send-path toggles
-| Feature | Default | Env var | Status |
-|---------|---------|---------|--------|
-| `sendmmsg` batching | on (Linux) | `QUIC_ZIG_NO_SENDMMSG=1` disables | shipped — no regressions |
-| UDP GSO (`UDP_SEGMENT`) | **off (opt-in)** | `QUIC_ZIG_ENABLE_GSO=1` enables | experimental — see note below |
-| SO_TXTIME kernel pacing | **off (opt-in)** | `QUIC_ZIG_ENABLE_TXTIME=1` enables | shipped — no-op on non-`fq` egress |
-| User-space pacer | on | `QUIC_ZIG_NO_PACING=1` disables | unchanged default, kill switch added |
-| Pacer clock | always `CLOCK_MONOTONIC` | n/a | migrated for SO_TXTIME compat (Plan 4a) |
+| Feature | Default | Env var | Notes |
+|---------|---------|---------|-------|
+| `sendmmsg` batching | on (Linux) | `QUIC_ZIG_NO_SENDMMSG=1` disables | one syscall per ECN-mark run |
+| User-space pacer | on | `QUIC_ZIG_NO_PACING=1` disables | bisection escape hatch |
+| Pacer clock | always `CLOCK_MONOTONIC` | n/a | NTP-skew resilience |
 
 ### Matrix (sequential run, `handshake,transfer,chacha20,multiplexing,longrtt,http3,keyupdate`)
 
 |                           | quic-go (server/client) | neqo (server/client) |
 |---------------------------|-------------------------|----------------------|
 | quic-zig server ← peer client | **7/7 PASS**        | **7/7 PASS**         |
-| quic-zig client → peer server | **7/7 PASS**        | **7/7 PASS**         |
+| quic-zig client → peer server | **7/7 PASS**        | **6-7/7 PASS**       |
 
-Zero regressions against the 2026-03-24 baseline recorded below.
-
-### GSO / neqo interaction (known issue — why GSO is opt-in)
-
-With `QUIC_ZIG_ENABLE_GSO=1`, zig-client → neqo-server `transfer` fails: neqo server reports `TX blocked` around 10 s and idle-times-out. The client completes ~2/3 of the 10 MB across 3 files. All 26 other combos (quic-go peers, neqo-as-client, GSO-off everywhere) stay green, so the issue is specific to zig-GSO-on outbound to neqo-on-ns-3-veth.
-
-Sim pcap analysis: zig client emits 400+ packet ACK bursts at ~500 µs intervals (expected GSO signature). Sim delivers ~99.9 % of packets. No protocol-level anomaly visible without deeper neqo-side instrumentation. Root cause pending — likely either (a) neqo scheduler struggles with super-dense ACK bursts from a GSO sender on a 10 Mbps link, or (b) ns-3 veth path drops/reorders a subset of GSO segments in a way quic-go tolerates but neqo does not.
-
-Kill switch preserves the full Plan-1 (`sendmmsg`-only) path for users who want to stay away from GSO until this is resolved.
+Zero regressions against the 2026-03-24 baseline recorded below. The
+zig-client → neqo-server flake on `keyupdate`/`chacha20` predates this work.
 
 ## 2026-03-24 baseline (pre-optimization)
 
diff --git a/interop/runner/run_endpoint.sh b/interop/runner/run_endpoint.sh
index 1a509a7..c1a726e 100755
--- a/interop/runner/run_endpoint.sh
+++ b/interop/runner/run_endpoint.sh
@@ -4,13 +4,8 @@ set -e
 # Setup routing for the simulated network
 source /setup.sh
 
-# Optimization toggles — sendmmsg + pacing are on by default; GSO and TXTIME
-# are opt-in (set QUIC_ZIG_ENABLE_GSO=1 / QUIC_ZIG_ENABLE_TXTIME=1 to turn
-# on). Kill switches let us disable the defaults for bisection without
-# rebuilding.
+# Optimization toggles — both on by default; set to 1 to disable for bisection.
 export QUIC_ZIG_NO_SENDMMSG="${QUIC_ZIG_NO_SENDMMSG:-0}"
-export QUIC_ZIG_ENABLE_GSO="${QUIC_ZIG_ENABLE_GSO:-0}"
-export QUIC_ZIG_ENABLE_TXTIME="${QUIC_ZIG_ENABLE_TXTIME:-0}"
 export QUIC_ZIG_NO_PACING="${QUIC_ZIG_NO_PACING:-0}"
 
 # Determine if this is a WebTransport test case
diff --git a/src/event_loop.zig b/src/event_loop.zig
index 5824bef..28a863b 100644
--- a/src/event_loop.zig
+++ b/src/event_loop.zig
@@ -550,12 +550,11 @@ pub fn Server(comptime Handler: type) type {
                             &batch.current_ecn,
                         );
                     } else {
-                        batch.addTxtime(
+                        batch.add(
                             self.out_buf[0..bytes_written],
                             @ptrCast(send_addr),
                             connection.sockaddrLen(send_addr),
                             conn.getEcnMark(),
-                            conn.last_target_txtime,
                         );
                     }
                 }
@@ -1014,12 +1013,11 @@ pub fn Server(comptime Handler: type) type {
                             &batch.current_ecn,
                         );
                     } else {
-                        batch.addTxtime(
+                        batch.add(
                             self.out_buf[0..bytes_written],
                             @ptrCast(send_addr),
                             connection.sockaddrLen(send_addr),
                             conn.getEcnMark(),
-                            conn.last_target_txtime,
                         );
                     }
                 }
@@ -1542,12 +1540,11 @@ pub fn Client(comptime Handler: type) type {
             while (send_count < 1000) : (send_count += 1) {
                 const bytes_written = conn.send(&self.out_buf) catch break;
                 if (bytes_written == 0) break;
-                self.batch.addTxtime(
+                self.batch.add(
                     self.out_buf[0..bytes_written],
                     @ptrCast(send_addr),
                     connection.sockaddrLen(send_addr),
                     conn.getEcnMark(),
-                    conn.last_target_txtime,
                 );
             }
             self.batch.flush();
@@ -1882,12 +1879,11 @@ pub fn Client(comptime Handler: type) type {
             while (send_count < max_burst_packets) : (send_count += 1) {
                 const bytes_written = conn.send(&self.out_buf) catch break;
                 if (bytes_written == 0) break;
-                self.batch.addTxtime(
+                self.batch.add(
                     self.out_buf[0..bytes_written],
                     @ptrCast(send_addr),
                     connection.sockaddrLen(send_addr),
                     conn.getEcnMark(),
-                    conn.last_target_txtime,
                 );
             }
             self.batch.flush();
diff --git a/src/quic/clock.zig b/src/quic/clock.zig
index 6dbf96b..bf2b800 100644
--- a/src/quic/clock.zig
+++ b/src/quic/clock.zig
@@ -3,15 +3,15 @@ const builtin = @import("builtin");
 
 /// Read `CLOCK_MONOTONIC` in nanoseconds.
 ///
-/// The Pacer uses this clock so its `last_sent_time` values can be handed to
-/// the Linux kernel as `SO_TXTIME`/`SCM_TXTIME` timestamps, which require a
-/// monotonic source. Loss detection, PTO, and idle-timeout code paths continue
-/// to use `std.time.nanoTimestamp()` (REALTIME) — those only consume durations
-/// within a single clock, so the split is safe.
+/// The Pacer uses this clock so its `last_sent_time` deltas are immune to
+/// wall-clock jumps (NTP slews, daylight-saving, manual clock changes). Loss
+/// detection, PTO, and idle-timeout code paths continue to use
+/// `std.time.nanoTimestamp()` (REALTIME) — those only compare timestamps to
+/// each other within short horizons where the gap matters but the absolute
+/// drift does not.
 pub fn monoNanos() i64 {
     // On Windows there is no POSIX CLOCK_MONOTONIC; fall back to the default
-    // `nanoTimestamp()` so the pacer still works. Non-issue for SO_TXTIME,
-    // which is Linux-only anyway.
+    // `nanoTimestamp()` so the pacer still works.
     if (comptime builtin.os.tag == .windows) {
         return @intCast(std.time.nanoTimestamp());
     }
diff --git a/src/quic/congestion.zig b/src/quic/congestion.zig
index c28e626..8f3bc6f 100644
--- a/src/quic/congestion.zig
+++ b/src/quic/congestion.zig
@@ -424,9 +424,9 @@ fn icbrt(x: u64) u64 {
 ///
 /// All timestamp arguments (`now` in `onPacketSent`, `timeUntilSend`, and
 /// `replenish`) MUST be on `CLOCK_MONOTONIC` — callers obtain them via
-/// `clock.monoNanos()`. This keeps `last_sent_time` directly usable as a
-/// `SCM_TXTIME` value when kernel-side pacing is enabled. Mixing clock
-/// sources across calls would silently corrupt budget replenishment.
+/// `clock.monoNanos()`. The monotonic clock makes budget replenishment
+/// immune to wall-clock jumps (NTP slews, manual time changes). Mixing
+/// clock sources across calls would silently corrupt budget math.
 pub const Pacer = struct {
     /// Available budget in bytes.
     budget: u64,
diff --git a/src/quic/connection.zig b/src/quic/connection.zig
index 3aff465..ff162c8 100644
--- a/src/quic/connection.zig
+++ b/src/quic/connection.zig
@@ -43,24 +43,6 @@ fn isPacingDisabled() bool {
     return v;
 }
 
-/// Mirror of `QUIC_ZIG_ENABLE_TXTIME` semantics in `ecn_socket.zig`.
-/// When set, `conn.send()` keeps producing packets while the pacer would have
-/// blocked, stamping each one with a CLOCK_MONOTONIC target tx time. The send
-/// path stores the target on `Connection.last_target_txtime` so the event loop
-/// can hand it to `SendBatch.addTxtime`. The actual SO_TXTIME socket setup
-/// happens in `SendBatch.init`; we don't probe the kernel here.
-var txtime_enabled_cache: ?bool = null;
-
-fn isKernelPacingEnabled() bool {
-    if (txtime_enabled_cache) |v| return v;
-    const v = blk: {
-        const raw = std.posix.getenv("QUIC_ZIG_ENABLE_TXTIME") orelse break :blk false;
-        break :blk !(raw.len == 0 or std.mem.eql(u8, raw, "0"));
-    };
-    txtime_enabled_cache = v;
-    return v;
-}
-
 pub const State = enum(u8) {
     first_flight = 0,
     handshake = 1,
@@ -596,11 +578,6 @@ pub const Connection = struct {
     pkt_handler: ack_handler.PacketHandler = undefined,
     cc: congestion.Cubic = congestion.Cubic.init(),
     pacer: congestion.Pacer = congestion.Pacer.init(),
-
-    /// CLOCK_MONOTONIC nanoseconds; non-zero when `send()` produced a packet
-    /// the kernel should hold until this time (SO_TXTIME). Read by the event
-    /// loop after each successful `send()` and reset on the next call.
-    last_target_txtime: u64 = 0,
     conn_flow_ctrl: flow_control.ConnectionFlowController = undefined,
     streams: stream_mod.StreamsMap = undefined,
     crypto_streams: crypto_stream.CryptoStreamManager = undefined,
@@ -2794,8 +2771,8 @@ pub const Connection = struct {
         if (self.state == .draining or self.state == .terminated) return 0;
 
         const now: i64 = @intCast(std.time.nanoTimestamp());
-        // Pacer runs on CLOCK_MONOTONIC so its timestamps can feed SO_TXTIME
-        // in the future; other subsystems stay on REALTIME.
+        // Pacer runs on CLOCK_MONOTONIC for NTP-skew resilience; other
+        // subsystems stay on REALTIME (they only compare deltas).
         const now_mono: i64 = clock.monoNanos();
 
         // Closing: retransmit saved close packet on each incoming packet (RFC 9000 §10.2.1)
@@ -2867,18 +2844,10 @@ pub const Connection = struct {
         // `nextTimeoutNs()` so libxev wakes us when the pacer has budget again.
         // Exceptions: PTO probes bypass pacing (RFC 9002 §6.2.4); the ACK-only
         // path above bypasses it per RFC 9002 §7.7.
-        // Reset target every send; only set if kernel pacing is taking over.
-        self.last_target_txtime = 0;
         if (self.pto_probe_pending == 0 and !isPacingDisabled()) {
             const pacer_delay = self.pacer.timeUntilSend(now_mono);
             if (pacer_delay > 0) {
-                if (isKernelPacingEnabled()) {
-                    // Kernel will hold the packet until target time; skip the
-                    // user-space wait but stamp the packet for the wire.
-                    self.last_target_txtime = @intCast(now_mono + pacer_delay);
-                } else {
-                    return 0;
-                }
+                return 0;
             }
         }
 
diff --git a/src/quic/ecn_socket.zig b/src/quic/ecn_socket.zig
index 44aa7de..da103d8 100644
--- a/src/quic/ecn_socket.zig
+++ b/src/quic/ecn_socket.zig
@@ -13,40 +13,6 @@ const linux = std.os.linux;
 /// on Linux (useful for bisecting regressions without rebuilding).
 const sendmmsg_env_var = "QUIC_ZIG_NO_SENDMMSG";
 
-/// UDP Generic Segmentation Offload (UDP_SEGMENT) is **opt-in**. Set
-/// `QUIC_ZIG_ENABLE_GSO=1` to turn it on; it requires sendmmsg and a Linux
-/// 4.18+ kernel. Opt-in because GSO currently regresses one interop combo
-/// (zig-client → neqo-server bulk transfer over the ns-3 veth simulator;
-/// root cause not yet pinpointed). sendmmsg remains default-on since it has
-/// no known regressions.
-const gso_env_var = "QUIC_ZIG_ENABLE_GSO";
-
-/// Linux UDP GSO socket-level constants (not exposed by std.os.linux).
-const SOL_UDP: i32 = 17;
-const UDP_SEGMENT: i32 = 103;
-/// Linux UDP_MAX_SEGMENTS as of 5.x — matches our MAX_BATCH so one entry can
-/// hold a whole run.
-const UDP_MAX_SEGMENTS: usize = 64;
-
-/// Linux SO_TXTIME / SCM_TXTIME — kernel-scheduled packet transmission.
-/// Payload is a u64 CLOCK_MONOTONIC nanosecond timestamp.
-const SOL_SOCKET_LEVEL: i32 = 1;
-const SO_TXTIME: i32 = 61;
-const SCM_TXTIME: i32 = 61;
-const CLOCK_MONOTONIC_ID: i32 = 1;
-
-/// `sock_txtime` passed to setsockopt(SO_TXTIME).
-const SockTxtime = extern struct {
-    clockid: i32,
-    flags: u32,
-};
-
-/// Runtime opt-in for SO_TXTIME kernel pacing. Requires a Linux kernel with
-/// SO_TXTIME (≥4.19) and an fq qdisc on the egress interface for the kernel
-/// to actually honor timestamps. On non-fq paths setsockopt succeeds but the
-/// timestamps are ignored — same behavior as today.
-const txtime_env_var = "QUIC_ZIG_ENABLE_TXTIME";
-
 // Platform-specific constants for ECN socket options (IPv4).
 const IPPROTO_IP: u32 = 0;
 
@@ -107,16 +73,6 @@ const CMSG_HDR_SIZE = @sizeOf(CmsgHdr);
 const CMSG_SPACE = (CMSG_HDR_SIZE + 4 + @alignOf(CmsgHdr) - 1) & ~@as(usize, @alignOf(CmsgHdr) - 1);
 const CMSG_BUF_SIZE = CMSG_SPACE * 2; // room for at least 2 cmsgs
 
-/// Per-entry cmsg buffer for UDP_SEGMENT (u16 gso_size payload).
-const CMSG_SPACE_U16 = (CMSG_HDR_SIZE + 2 + @alignOf(CmsgHdr) - 1) & ~@as(usize, @alignOf(CmsgHdr) - 1);
-
-/// Per-entry cmsg buffer for SCM_TXTIME (u64 ns timestamp).
-const CMSG_SPACE_U64 = (CMSG_HDR_SIZE + 8 + @alignOf(CmsgHdr) - 1) & ~@as(usize, @alignOf(CmsgHdr) - 1);
-
-/// Combined buffer when both UDP_SEGMENT and SCM_TXTIME apply to an entry.
-/// Layout: [UDP_SEGMENT cmsg][SCM_TXTIME cmsg]
-const CMSG_SPACE_COMBINED = CMSG_SPACE_U16 + CMSG_SPACE_U64;
-
 /// Raw setsockopt that doesn't panic on EINVAL (needed for trying IPv6 opts on IPv4 sockets).
 fn rawSetsockopt(sockfd: posix.socket_t, level: i32, optname: u32, optval: []const u8) void {
     _ = std.c.setsockopt(sockfd, level, @intCast(optname), optval.ptr, @intCast(optval.len));
@@ -274,60 +230,26 @@ pub const SendBatch = struct {
     /// Runtime kill switch — resolved once at init, so flush() never touches env.
     use_mmsg: bool = false,
 
-    /// Whether UDP GSO (UDP_SEGMENT) is available and enabled.
-    /// Implies use_mmsg; ignored when use_mmsg is false.
-    use_gso: bool = false,
-
-    /// Whether SO_TXTIME kernel pacing is available and enabled.
-    /// Implies use_mmsg; ignored when use_mmsg is false.
-    use_txtime: bool = false,
-
     // Per-packet data
     addrs: [MAX_BATCH]posix.sockaddr.storage = undefined,
     addr_lens: [MAX_BATCH]posix.socklen_t = undefined,
     offsets: [MAX_BATCH]u32 = undefined, // offset into data_buf
     lengths: [MAX_BATCH]u32 = undefined, // length of each packet
     ecn_marks: [MAX_BATCH]u2 = undefined,
-    /// Kernel target transmission time per packet (CLOCK_MONOTONIC ns).
-    /// Zero means "send now" (no SCM_TXTIME cmsg attached).
-    txtimes: [MAX_BATCH]u64 = undefined,
 
     // Contiguous buffer holding all packet data
     data_buf: [MAX_BATCH * 1500]u8 = undefined,
     data_len: usize = 0,
 
     pub fn init(sockfd: posix.socket_t) SendBatch {
-        const mmsg_on = use_sendmmsg and !envFlagSet(sendmmsg_env_var);
-        // GSO is opt-in: only enabled when QUIC_ZIG_ENABLE_GSO=1 is set.
-        const gso_on = mmsg_on and envFlagSet(gso_env_var) and probeGsoSupport(sockfd);
-        // SO_TXTIME is opt-in: requires QUIC_ZIG_ENABLE_TXTIME=1 and kernel support.
-        const txtime_on = mmsg_on and envFlagSet(txtime_env_var) and probeTxtimeSupport(sockfd);
         return .{
             .sockfd = sockfd,
-            .use_mmsg = mmsg_on,
-            .use_gso = gso_on,
-            .use_txtime = txtime_on,
+            .use_mmsg = use_sendmmsg and !envFlagSet(sendmmsg_env_var),
         };
     }
 
-    /// Add a packet to the batch — sends as soon as the kernel will accept it.
-    /// Auto-flushes when full.
+    /// Add a packet to the batch. Flushes automatically when full.
     pub fn add(self: *SendBatch, data: []const u8, addr: *const posix.sockaddr, addr_len: posix.socklen_t, ecn: u2) void {
-        self.addTxtime(data, addr, addr_len, ecn, 0);
-    }
-
-    /// Add a packet with a CLOCK_MONOTONIC target transmission time. The
-    /// kernel releases the packet at `txtime_ns` if SO_TXTIME is honored on
-    /// the egress qdisc; otherwise behaves like `add`. Pass 0 to opt out
-    /// per-packet without disabling TXTIME for the whole socket.
-    pub fn addTxtime(
-        self: *SendBatch,
-        data: []const u8,
-        addr: *const posix.sockaddr,
-        addr_len: posix.socklen_t,
-        ecn: u2,
-        txtime_ns: u64,
-    ) void {
         if (self.count >= MAX_BATCH or self.data_len + data.len > self.data_buf.len) {
             self.flush();
         }
@@ -339,7 +261,6 @@ pub const SendBatch = struct {
         self.addrs[idx] = @as(*const posix.sockaddr.storage, @ptrCast(@alignCast(addr))).*;
         self.addr_lens[idx] = addr_len;
         self.ecn_marks[idx] = ecn;
-        self.txtimes[idx] = txtime_ns;
         self.count += 1;
     }
 
@@ -384,26 +305,13 @@ pub const SendBatch = struct {
         }
     }
 
-    /// Linux sendmmsg path: walks runs of same ECN mark. When GSO is enabled,
-    /// each run is further sub-grouped into GSO super-buffers (same peer, same
-    /// packet size except possibly the last segment, contiguous in data_buf);
-    /// each sub-group becomes one mmsghdr entry carrying a single iovec over
-    /// the concatenated segments and a per-entry UDP_SEGMENT cmsg. When TXTIME
-    /// is enabled and a packet has a non-zero target timestamp, an SCM_TXTIME
-    /// cmsg is stacked alongside; per-packet timestamps within a GSO group must
-    /// match (one timestamp applies to the whole super-buffer).
+    /// Linux sendmmsg path: walks runs of same ECN mark, issues one syscall per run.
     fn flushLinux(self: *SendBatch) void {
         if (comptime !use_sendmmsg) unreachable;
 
-        // Scratch arrays live on the stack — sized for MAX_BATCH.
+        // Scratch arrays live on the stack — sized for MAX_BATCH (~5 KB total).
         var iovs: [MAX_BATCH]posix.iovec_const = undefined;
         var msgvec: [MAX_BATCH]linux.mmsghdr_const = undefined;
-        // Per-entry cmsg buffer sized to fit UDP_SEGMENT + SCM_TXTIME stacked.
-        // Unused when both GSO and TXTIME are off and the entry carries one packet.
-        var cmsg_bufs: [MAX_BATCH][CMSG_SPACE_COMBINED]u8 align(@alignOf(CmsgHdr)) = undefined;
-        // Segment count per mmsghdr entry — used to translate entry-level drops
-        // reported by the kernel back into packet counts.
-        var seg_counts: [MAX_BATCH]u32 = undefined;
 
         var start: usize = 0;
         while (start < self.count) {
@@ -414,104 +322,35 @@ pub const SendBatch = struct {
 
             self.applyEcn(run_ecn);
 
-            // Sub-group this ECN run into mmsghdr entries (GSO groups when possible).
-            var entries: u32 = 0;
-            var g0 = start;
-            while (g0 < end) {
-                const g1 = if (self.use_gso) self.findGsoGroupEnd(g0, end) else g0 + 1;
-                self.fillEntry(entries, g0, g1, &iovs, &msgvec, &cmsg_bufs);
-                seg_counts[entries] = @intCast(g1 - g0);
-                entries += 1;
-                g0 = g1;
+            // One mmsghdr per packet within the run.
+            for (start..end) |i| {
+                iovs[i] = .{
+                    .base = self.data_buf[self.offsets[i]..].ptr,
+                    .len = self.lengths[i],
+                };
+                msgvec[i] = .{
+                    .hdr = .{
+                        .name = @ptrCast(&self.addrs[i]),
+                        .namelen = self.addr_lens[i],
+                        .iov = @ptrCast(&iovs[i]),
+                        .iovlen = 1,
+                        .control = null,
+                        .controllen = 0,
+                        .flags = 0,
+                    },
+                    .len = 0,
+                };
             }
 
-            const sent = sendmmsgRun(self.sockfd, &msgvec, entries);
-            if (sent < entries) {
-                var dropped: u32 = 0;
-                for (sent..entries) |i| dropped += seg_counts[i];
-                self.recordDrop(dropped);
+            const run_len: u32 = @intCast(end - start);
+            const sent = sendmmsgRun(self.sockfd, msgvec[start..end].ptr, run_len);
+            if (sent < run_len) {
+                self.recordDrop(run_len - sent);
             }
             start = end;
         }
     }
 
-    /// Return the exclusive end of the maximal GSO group starting at `g0`
-    /// within ECN-run `[g0..end)`. Members must share peer address, be
-    /// contiguous in data_buf, and have identical size (only the last may be
-    /// shorter). When TXTIME is active they must also share the same target
-    /// timestamp, since one cmsg applies to the whole super-buffer. Capped at
-    /// UDP_MAX_SEGMENTS.
-    fn findGsoGroupEnd(self: *const SendBatch, g0: usize, end: usize) usize {
-        const gso_size = self.lengths[g0];
-        const cap = @min(end, g0 + UDP_MAX_SEGMENTS);
-        var g1 = g0 + 1;
-        while (g1 < cap) : (g1 += 1) {
-            // Prior segment must have been full-size; addresses must match;
-            // offsets must be contiguous; current size ≤ gso_size.
-            if (self.lengths[g1 - 1] != gso_size) break;
-            if (!sameSockaddr(&self.addrs[g1], &self.addrs[g0])) break;
-            if (self.addr_lens[g1] != self.addr_lens[g0]) break;
-            if (self.offsets[g1] != self.offsets[g1 - 1] + self.lengths[g1 - 1]) break;
-            if (self.lengths[g1] > gso_size) break;
-            if (self.use_txtime and self.txtimes[g1] != self.txtimes[g0]) break;
-        }
-        return g1;
-    }
-
-    /// Populate one mmsghdr slot covering packets `[g0..g1)`. Stacks
-    /// UDP_SEGMENT (when the group has more than one segment) and SCM_TXTIME
-    /// (when TXTIME is active and the group's target timestamp is non-zero).
-    fn fillEntry(
-        self: *SendBatch,
-        entry_idx: u32,
-        g0: usize,
-        g1: usize,
-        iovs: *[MAX_BATCH]posix.iovec_const,
-        msgvec: *[MAX_BATCH]linux.mmsghdr_const,
-        cmsg_bufs: *[MAX_BATCH][CMSG_SPACE_COMBINED]u8,
-    ) void {
-        const first_off = self.offsets[g0];
-        const last_end = self.offsets[g1 - 1] + self.lengths[g1 - 1];
-        const total_bytes = last_end - first_off;
-
-        iovs[entry_idx] = .{
-            .base = self.data_buf[first_off..].ptr,
-            .len = total_bytes,
-        };
-
-        // Compose cmsgs into the per-entry buffer, in fixed order:
-        // UDP_SEGMENT first (offset 0), then SCM_TXTIME (offset CMSG_SPACE_U16).
-        var control_ptr: ?*const anyopaque = null;
-        var control_len: usize = 0;
-        // Each entry slot is CMSG_SPACE_COMBINED bytes; the outer array's
-        // alignment guarantees the start of every slot is CmsgHdr-aligned
-        // because CMSG_SPACE_COMBINED is a multiple of @alignOf(CmsgHdr).
-        const buf_ptr: [*]align(@alignOf(CmsgHdr)) u8 = @ptrCast(@alignCast(&cmsg_bufs[entry_idx]));
-        if (g1 - g0 > 1) {
-            writeUdpSegmentCmsg(buf_ptr, @intCast(self.lengths[g0]));
-            control_ptr = @ptrCast(buf_ptr);
-            control_len = CMSG_SPACE_U16;
-        }
-        if (self.use_txtime and self.txtimes[g0] != 0) {
-            writeTxtimeCmsg(@alignCast(buf_ptr + control_len), self.txtimes[g0]);
-            if (control_ptr == null) control_ptr = @ptrCast(buf_ptr);
-            control_len += CMSG_SPACE_U64;
-        }
-
-        msgvec[entry_idx] = .{
-            .hdr = .{
-                .name = @ptrCast(&self.addrs[g0]),
-                .namelen = self.addr_lens[g0],
-                .iov = @ptrCast(&iovs[entry_idx]),
-                .iovlen = 1,
-                .control = control_ptr,
-                .controllen = control_len,
-                .flags = 0,
-            },
-            .len = 0,
-        };
-    }
-
     /// Issue one sendmmsg syscall for `n` packets starting at `msgvec`.
     /// Retries once on EINTR when no packets have been sent yet.
     /// Returns the number of packets the kernel accepted.
@@ -546,46 +385,6 @@ pub const SendBatch = struct {
     }
 };
 
-/// Compare two sockaddr_storage values for equality on the address bytes
-/// actually used by the current family. Avoids false-negatives from padding.
-fn sameSockaddr(a: *const posix.sockaddr.storage, b: *const posix.sockaddr.storage) bool {
-    if (a.family != b.family) return false;
-    return switch (a.family) {
-        posix.AF.INET => blk: {
-            const a4: *const posix.sockaddr.in = @ptrCast(@alignCast(a));
-            const b4: *const posix.sockaddr.in = @ptrCast(@alignCast(b));
-            break :blk a4.port == b4.port and a4.addr == b4.addr;
-        },
-        posix.AF.INET6 => blk: {
-            const a6: *const posix.sockaddr.in6 = @ptrCast(@alignCast(a));
-            const b6: *const posix.sockaddr.in6 = @ptrCast(@alignCast(b));
-            break :blk a6.port == b6.port and
-                a6.flowinfo == b6.flowinfo and
-                a6.scope_id == b6.scope_id and
-                std.mem.eql(u8, &a6.addr, &b6.addr);
-        },
-        else => std.mem.eql(u8, std.mem.asBytes(a), std.mem.asBytes(b)),
-    };
-}
-
-/// Encode a UDP_SEGMENT cmsg (u16 gso_size) into the caller-provided buffer.
-fn writeUdpSegmentCmsg(buf: [*]align(@alignOf(CmsgHdr)) u8, gso_size: u16) void {
-    const hdr: *CmsgHdr = @ptrCast(@alignCast(buf));
-    hdr.cmsg_len = CMSG_HDR_SIZE + @sizeOf(u16);
-    hdr.cmsg_level = SOL_UDP;
-    hdr.cmsg_type = UDP_SEGMENT;
-    std.mem.writeInt(u16, buf[CMSG_HDR_SIZE..][0..@sizeOf(u16)], gso_size, builtin.cpu.arch.endian());
-}
-
-/// Encode a SCM_TXTIME cmsg (u64 ns timestamp) into the caller-provided buffer.
-fn writeTxtimeCmsg(buf: [*]align(@alignOf(CmsgHdr)) u8, txtime_ns: u64) void {
-    const hdr: *CmsgHdr = @ptrCast(@alignCast(buf));
-    hdr.cmsg_len = CMSG_HDR_SIZE + @sizeOf(u64);
-    hdr.cmsg_level = SOL_SOCKET_LEVEL;
-    hdr.cmsg_type = SCM_TXTIME;
-    std.mem.writeInt(u64, buf[CMSG_HDR_SIZE..][0..@sizeOf(u64)], txtime_ns, builtin.cpu.arch.endian());
-}
-
 /// Treats an env var as a boolean flag: unset, empty, or "0" → false; anything else → true.
 fn envFlagSet(name: [:0]const u8) bool {
     if (comptime is_windows) return false;
@@ -593,39 +392,6 @@ fn envFlagSet(name: [:0]const u8) bool {
     return !(value.len == 0 or std.mem.eql(u8, value, "0"));
 }
 
-/// Probe UDP_SEGMENT support at init so flush() never fails on unsupported kernels.
-/// Setting gso_size=0 is a no-op that simply validates the kernel recognises the
-/// option; Linux 4.18+ returns 0, older kernels return ENOPROTOOPT.
-fn probeGsoSupport(sockfd: posix.socket_t) bool {
-    if (comptime !use_sendmmsg) return false;
-    const zero: u16 = 0;
-    const rc = std.c.setsockopt(
-        sockfd,
-        SOL_UDP,
-        UDP_SEGMENT,
-        std.mem.asBytes(&zero).ptr,
-        @sizeOf(u16),
-    );
-    return rc == 0;
-}
-
-/// Enable SO_TXTIME on `sockfd` and return whether the kernel accepted it.
-/// On non-fq egress paths the kernel still accepts the sockopt but ignores
-/// per-packet timestamps — same observable behavior as no-TXTIME, no error
-/// path needed in flush(). Older kernels (<4.19) return ENOPROTOOPT.
-fn probeTxtimeSupport(sockfd: posix.socket_t) bool {
-    if (comptime !use_sendmmsg) return false;
-    const cfg: SockTxtime = .{ .clockid = CLOCK_MONOTONIC_ID, .flags = 0 };
-    const rc = std.c.setsockopt(
-        sockfd,
-        SOL_SOCKET_LEVEL,
-        SO_TXTIME,
-        std.mem.asBytes(&cfg).ptr,
-        @sizeOf(SockTxtime),
-    );
-    return rc == 0;
-}
-
 /// Send a single packet directly from the caller's buffer (zero-copy send path).
 /// Avoids the batch memcpy overhead for single-packet sends — the common case
 /// for latency-sensitive echo/datagram workloads.
@@ -721,81 +487,6 @@ test "SendBatch delivers mixed-ECN packets in order" {
     try std.testing.expectEqual(payloads.len, received);
 }
 
-test "findGsoGroupEnd groups uniform same-peer runs" {
-    // Grouping logic is OS-agnostic — exercise it on all platforms.
-    var batch: SendBatch = .{ .sockfd = -1 };
-    const peer = std.mem.zeroes(posix.sockaddr.storage);
-    // 5 identical-size same-peer packets, contiguous.
-    for (0..5) |i| {
-        batch.addrs[i] = peer;
-        batch.addr_lens[i] = @sizeOf(posix.sockaddr.storage);
-        batch.offsets[i] = @intCast(i * 1200);
-        batch.lengths[i] = 1200;
-        batch.ecn_marks[i] = 0;
-    }
-    batch.count = 5;
-    try std.testing.expectEqual(@as(usize, 5), batch.findGsoGroupEnd(0, 5));
-}
-
-test "findGsoGroupEnd splits on address change" {
-    var batch: SendBatch = .{ .sockfd = -1 };
-    var peer_a: posix.sockaddr.storage = std.mem.zeroes(posix.sockaddr.storage);
-    peer_a.family = posix.AF.INET;
-    const a4: *posix.sockaddr.in = @ptrCast(@alignCast(&peer_a));
-    a4.port = 1111;
-    a4.addr = 0x01010101;
-    var peer_b = peer_a;
-    const b4: *posix.sockaddr.in = @ptrCast(@alignCast(&peer_b));
-    b4.addr = 0x02020202;
-    batch.addrs[0] = peer_a;
-    batch.addrs[1] = peer_a;
-    batch.addrs[2] = peer_b;
-    for (0..3) |i| {
-        batch.addr_lens[i] = @sizeOf(posix.sockaddr.in);
-        batch.offsets[i] = @intCast(i * 1200);
-        batch.lengths[i] = 1200;
-        batch.ecn_marks[i] = 0;
-    }
-    batch.count = 3;
-    // Group ends where peer changes (index 2).
-    try std.testing.expectEqual(@as(usize, 2), batch.findGsoGroupEnd(0, 3));
-    try std.testing.expectEqual(@as(usize, 3), batch.findGsoGroupEnd(2, 3));
-}
-
-test "findGsoGroupEnd allows only the last segment to be shorter" {
-    var batch: SendBatch = .{ .sockfd = -1 };
-    const peer = std.mem.zeroes(posix.sockaddr.storage);
-    const sizes = [_]u32{ 1200, 1200, 900, 1200 };
-    var off: u32 = 0;
-    for (sizes, 0..) |s, i| {
-        batch.addrs[i] = peer;
-        batch.addr_lens[i] = @sizeOf(posix.sockaddr.storage);
-        batch.offsets[i] = off;
-        batch.lengths[i] = s;
-        batch.ecn_marks[i] = 0;
-        off += s;
-    }
-    batch.count = 4;
-    // [0..3) = two full + one short tail (OK).
-    try std.testing.expectEqual(@as(usize, 3), batch.findGsoGroupEnd(0, 4));
-    // [2..4) = short then full — the full packet after a short breaks the run.
-    try std.testing.expectEqual(@as(usize, 3), batch.findGsoGroupEnd(2, 4));
-}
-
-test "findGsoGroupEnd caps at UDP_MAX_SEGMENTS" {
-    var batch: SendBatch = .{ .sockfd = -1 };
-    const peer = std.mem.zeroes(posix.sockaddr.storage);
-    for (0..SendBatch.MAX_BATCH) |i| {
-        batch.addrs[i] = peer;
-        batch.addr_lens[i] = @sizeOf(posix.sockaddr.storage);
-        batch.offsets[i] = @intCast(i * 1200);
-        batch.lengths[i] = 1200;
-        batch.ecn_marks[i] = 0;
-    }
-    batch.count = SendBatch.MAX_BATCH;
-    try std.testing.expectEqual(UDP_MAX_SEGMENTS, batch.findGsoGroupEnd(0, SendBatch.MAX_BATCH));
-}
-
 test "recvmsgEcn returns WouldBlock on empty socket" {
     if (comptime is_windows) return error.SkipZigTest;
     const sockfd = try posix.socket(posix.AF.INET, posix.SOCK.DGRAM | posix.SOCK.NONBLOCK, 0);

From 78a179a9de8088ced99d5840d20c93c4e67ad3f1 Mon Sep 17 00:00:00 2001
From: Endel Dreyer <endel.dreyer@gmail.com>
Date: Wed, 15 Apr 2026 19:51:24 -0300
Subject: [PATCH 5/5] Document the two-clock contract
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Pacer state lives on CLOCK_MONOTONIC (NTP-skew resilience); everything else
uses std.time.nanoTimestamp() (REALTIME). The boundary is crossed in exactly
one place — Connection.nextTimeoutNs — and that's the only function readers
need to understand to avoid future cross-clock comparison bugs.

Spells out who uses which clock and why, the three rules for adding new
clock-touching code, and why we chose not to migrate everything to MONOTONIC.
---
 SPEC/CLOCKS.md | 66 ++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 66 insertions(+)
 create mode 100644 SPEC/CLOCKS.md

diff --git a/SPEC/CLOCKS.md b/SPEC/CLOCKS.md
new file mode 100644
index 0000000..b2eaa9e
--- /dev/null
+++ b/SPEC/CLOCKS.md
@@ -0,0 +1,66 @@
+# Clock contract
+
+quic-zig uses two clock sources internally. Most of the codebase reads
+`std.time.nanoTimestamp()` (REALTIME); the user-space pacer is the single
+exception — it runs on `CLOCK_MONOTONIC` via `clock.monoNanos()`.
+
+This split is intentional. Reading this page once should be enough to avoid
+introducing a cross-clock comparison bug on a future change.
+
+## Who uses what
+
+| Subsystem | Clock | Source | Why |
+|-----------|-------|--------|-----|
+| Loss detection (PTO, RTT) | REALTIME | `std.time.nanoTimestamp()` | Compares timestamps it produced itself; absolute drift is irrelevant. |
+| Idle timeout | REALTIME | `std.time.nanoTimestamp()` | Same — only the delta `now − last_activity` matters. |
+| Stateless reset / token expiry | REALTIME | `std.time.nanoTimestamp()` | Long-horizon validity windows; wall-clock alignment is fine. |
+| qlog timestamps | REALTIME | `std.time.nanoTimestamp()` | Wall-clock is what humans expect when reading traces. |
+| Datagram receive timestamps | REALTIME | `std.time.nanoTimestamp()` | Compared only to other REALTIME values within the same connection. |
+| **Pacer** (`Pacer.last_sent_time`, `timeUntilSend`, `onPacketSent`) | **MONOTONIC** | `clock.monoNanos()` | Budget replenishment math (`elapsed = now − last_sent_time`) breaks if a wall-clock jump (NTP slew, manual time change, DST) makes elapsed go negative or huge. |
+
+## The single boundary
+
+`Connection.nextTimeoutNs()` is the only function that crosses the boundary.
+It folds the pacer's next-send time into a deadline that the event loop
+compares against REALTIME-based deadlines (loss timer, idle timer, ack alarm).
+
+The conversion happens inline at `connection.zig:3793`:
+
+```zig
+const now_realtime: i64 = @intCast(std.time.nanoTimestamp());
+const now_mono: i64 = clock.monoNanos();
+const elapsed = now_mono - self.pacer.last_sent_time;     // duration on MONO
+// ... compute pacer_delay (a duration, clock-agnostic) ...
+const pacer_deadline = now_realtime + delay;              // anchor on REALTIME
+```
+
+We compute the *duration* on the monotonic clock (where the pacer's state
+lives) and add it to a REALTIME `now` so the resulting deadline is comparable
+to the other deadlines the event loop collects. The result is a REALTIME
+timestamp, never a MONOTONIC one — that boundary stays inside this function.
+
+## Rules for future changes
+
+1. **Adding a new pacer call site:** pass `now_mono` (or call `clock.monoNanos()` fresh). Never pass a `nanoTimestamp()` value.
+2. **Reading `pacer.last_sent_time` from outside the Pacer:** treat it as MONOTONIC. Subtract it from another MONOTONIC value to get a duration. Never compare to a REALTIME timestamp.
+3. **Adding a new clock-using subsystem:** default to REALTIME. Switch to MONOTONIC only if the subsystem hands timestamps to the kernel (e.g., a future `SCM_TXTIME` cmsg) or is genuinely sensitive to wall-clock jumps.
+4. **Mixing in a single deadline computation:** allowed only when computing a *duration* on one clock and anchoring the deadline on another (the `nextTimeoutNs` pattern above). Document why in a comment.
+
+## Why not migrate everything to MONOTONIC
+
+- Loss detection, PTO, and idle timeout are all *delta-based* — they don't care which clock as long as the timestamps in a single comparison agree. They've worked correctly on REALTIME since day one and changing them adds risk for no gain.
+- qlog readers and external tooling expect wall-clock timestamps.
+- Token-validity windows are conceptually wall-clock (a 1-day token means 24 wall-clock hours).
+- The single subsystem that genuinely needed monotonic semantics (the pacer) is now isolated.
+
+## Why the pacer specifically
+
+- `Pacer.replenish` computes `elapsed = now - last_sent_time` and turns it into bytes of budget. If the wall clock jumps backward by 10 seconds (NTP slew, DST end, manual time change), `elapsed` goes negative and the pacer either refuses to send or floods, depending on signedness handling.
+- A forward jump credits the pacer with phantom bandwidth, briefly defeating congestion control.
+- `MONOTONIC` immunizes both directions.
+
+## Files
+
+- `src/quic/clock.zig` — defines `monoNanos()` (Linux/macOS via `clock_gettime`, Windows fallback to `nanoTimestamp()`).
+- `src/quic/congestion.zig` — `Pacer` doc comment names the contract.
+- `src/quic/connection.zig` — three pacer call sites in `send()` use `now_mono`; `nextTimeoutNs` handles the boundary conversion.