Latency-aware routing for VPC networks

## Problem

Today, traffic between VPC workloads follows the shortest available path through the network fabric — but the shortest path isn't always the fastest. Customers running latency-sensitive workloads (real-time APIs, databases, video streaming) have no way to express that their traffic should prefer low-latency paths, and no visibility into the actual latency their traffic experiences.

As the platform scales across regions and availability zones, the gap between "shortest path" and "lowest latency path" will grow. Customers with strict SLA requirements need confidence that their traffic is routed optimally — not just reachably.

## Desired Outcome

VPC customers should be able to:

1. **See the latency** their traffic experiences across the network — between workloads, across clusters, and through connectors
2. **Express latency preferences** as simple policies on their VPC (e.g., "optimize for low latency" or "keep latency under 10ms")
3. **Get automatic path optimization** — the platform should continuously measure network conditions and route traffic over the best-performing paths without manual intervention
4. **Differentiate by tier** — premium VPC tiers could receive latency-optimized routing as a value-added capability

## Why This Matters

- **Competitive differentiation**: No major cloud provider currently exposes tenant-controllable latency-aware routing as a first-class VPC feature
- **Revenue enabler**: Latency SLA tiers (standard, low-latency, ultra-low-latency) create natural pricing differentiation
- **Operational simplicity**: Measured, automated path selection replaces manual traffic engineering
- **Data sovereignty**: Latency constraints implicitly prefer geographically local paths, supporting compliance requirements without explicit geo-fencing rules
- **Multi-region readiness**: As the platform expands across regions, latency-aware routing prevents cross-region detours that degrade user experience

## How This Could Work

Research into modern network path selection suggests a layered approach:

- **Measure**: Active delay probes between nodes continuously track real-time latency across all network paths. [STAMP (RFC 8762)](https://datatracker.ietf.org/doc/html/rfc8762) provides proven mechanisms for this, with [SRv6-specific extensions (RFC 9503)](https://datatracker.ietf.org/doc/html/rfc9503) that ensure probes follow the exact same paths as real traffic.
- **Advertise**: Measured latency data is distributed alongside routing information so the control plane has a complete picture of network performance. Standards like [BGP-LS TE Performance Metrics (RFC 8571)](https://datatracker.ietf.org/doc/html/rfc8571) and [IS-IS TE Metric Extensions (RFC 8570)](https://datatracker.ietf.org/doc/html/rfc8570) define how delay values propagate through the routing system.
- **Policy**: Customers express intent through simple VPC-level settings. The platform translates "optimize for latency" into the appropriate path selection constraints. The [SR Policy Architecture (RFC 9256)](https://datatracker.ietf.org/doc/html/rfc9256) defines how a "Color" value maps high-level intent to specific forwarding paths, and [IGP Flexible Algorithm (RFC 9350)](https://datatracker.ietf.org/doc/html/rfc9350) enables the network to compute delay-optimized topologies automatically.
- **Steer**: The existing [SRv6 (RFC 8986)](https://datatracker.ietf.org/doc/html/rfc8986) traffic engineering capability in the Galactic data plane can route tenant traffic through specific paths — what's needed is the intelligence layer that picks the right path based on measured delay.

This builds naturally on the existing BGP control plane and SRv6 data plane architecture. The core insight from research into protocols like [DDM (Delay Driven Multipath)](https://rfd.shared.oxide.computer/rfd/0347) is that **delay is a simple, effective, and universal signal** for path quality — it inherently encodes congestion, distance, and link health into a single measurable value. Academic work like Google's [Swift](https://research.google/pubs/swift-delay-is-simple-and-effective-for-congestion-control-in-the-datacenter/) ("Delay is Simple and Effective for Congestion Control in the Datacenter", SIGCOMM 2020) validates this approach at hyperscaler scale.

## Related Standards & Background Reading

**SRv6 Data Plane & Traffic Engineering**
- [RFC 8986 — SRv6 Network Programming](https://datatracker.ietf.org/doc/html/rfc8986) — Galactic's data plane; defines how SRv6 encodes forwarding instructions in IPv6 headers
- [RFC 9256 — SR Policy Architecture](https://datatracker.ietf.org/doc/html/rfc9256) — How "Color" values map tenant intent to specific forwarding paths
- [RFC 9350 — IGP Flexible Algorithm](https://datatracker.ietf.org/doc/html/rfc9350) — Enables delay-optimized path computation without a central controller
- [RFC 9843 — Flex-Algo Bandwidth, Delay, Metrics and Constraints](https://datatracker.ietf.org/doc/html/rfc9843) — Extends Flex-Algo with max-delay and min-bandwidth link pruning

**Delay Measurement**
- [RFC 8762 — STAMP](https://datatracker.ietf.org/doc/html/rfc8762) — How nodes measure real-time latency between each other
- [RFC 9503 — STAMP for Segment Routing](https://datatracker.ietf.org/doc/html/rfc9503) — Extends STAMP to measure delay along specific SRv6 segment lists
- [RFC 8570 — IS-IS TE Metric Extensions](https://datatracker.ietf.org/doc/html/rfc8570) — How measured delay is flooded through the routing protocol
- [RFC 8571 — BGP-LS TE Performance Metrics](https://datatracker.ietf.org/doc/html/rfc8571) — How delay metrics reach SDN controllers via BGP

**Performance-Aware Routing**
- [draft-ietf-idr-performance-routing](https://datatracker.ietf.org/doc/draft-ietf-idr-performance-routing/) — Adds latency as a BGP best-path selection criterion
- [draft-ietf-idr-bgp-car — Color-Aware Routing](https://datatracker.ietf.org/doc/draft-ietf-idr-bgp-car/) — Extends intent-aware routing across domain boundaries

**Policy & Intent Frameworks**
- [RFC 9315 — Intent-Based Networking](https://datatracker.ietf.org/doc/html/rfc9315) — Formal vocabulary for declarative, outcome-driven network policies
- [RFC 9543 — IETF Network Slicing Framework](https://datatracker.ietf.org/doc/html/rfc9543) — Framework for network slices with SLO guarantees (latency, bandwidth, loss)

**Prior Art & Research**
- [Oxide RFD 347 — DDM (Delay Driven Multipath)](https://rfd.shared.oxide.computer/rfd/0347) — Novel protocol using real-time delay as the primary routing signal
- [Google Espresso (SIGCOMM 2017)](https://research.google/pubs/taking-the-edge-off-with-espresso-scale-reliability-and-programmability-for-global-internet-peering/) — How Google routes 20%+ of internet traffic using real-time performance measurement
- [Swift — Delay is Simple and Effective (SIGCOMM 2020)](https://research.google/pubs/swift-delay-is-simple-and-effective-for-congestion-control-in-the-datacenter/) — Validates delay as a simple, effective signal at datacenter scale

## Open Questions

- What latency granularity do customers actually need? (per-VPC, per-workload, per-connection?)
- Should latency preferences be a VPC-level setting or expressed through a separate policy resource?
- How should latency data be exposed to customers? (metrics dashboard, API, status on VPC resource?)
- What is the right default — should all VPCs get basic latency optimization, or is it opt-in?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Latency-aware routing for VPC networks #15

Problem

Desired Outcome

Why This Matters

How This Could Work

Related Standards & Background Reading

Open Questions

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Latency-aware routing for VPC networks #15

Description

Problem

Desired Outcome

Why This Matters

How This Could Work

Related Standards & Background Reading

Open Questions

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions