fix(ateom-gvisor): restore link-scope routes before gateway routes by brandonrjacobs · Pull Request #2 · coreweave/substrate

brandonrjacobs · 2026-05-21T15:07:00Z

TL;DR

restoreLink in ateom-gvisor iterates the saved routes in the order
returned by netlink.RouteList and calls netlink.RouteReplace on each
one. On Cilium clusters (and any CNI that assigns a /32 pod IP with a
link-scope route to a gateway in the same prefix), this order causes the
kernel to reject the default-gateway route with ENETUNREACH
("Network is unreachable") because the link-scope route to the gateway
hasn't been installed yet.

Fix: stable-sort the route list so routes with no gateway are restored
before routes that have a gateway. No-op on CNIs where the gateway is
already in a connected /N subnet of the pod's address (kindnet, classic
kubenet, etc.).

Problem

On a Cilium-based Kubernetes cluster, every attempt to resume an actor
fails on the first POST with:

HTTP/1.1 500 Internal Server Error
error resuming actor my-counter-1: rpc error: code = Unknown desc =
while calling ateom.RunWorkload:
rpc error: code = Unknown desc =
while restoring eth0 in interior netns:
while executing function in target netns:
while restoring eth0 routes and addresses in interior netns:
while restoring route 0: network is unreachable

Every subsequent POST then fails with a follow-on error, because the
first attempt has already moved eth0 out of the worker pod's netns:

while calling ateom.RunWorkload: rpc error: code = Unknown desc =
while getting netlink link for eth0: Link not found

Root cause

When ateom-gvisor starts, it scrapeLinks the worker pod's eth0 so
it can later recreate the same IPs and routes inside the interior
(gVisor) netns. On a Cilium pod, the saved state looks like:

addresses: [10.8.0.201/32, fe80::…]
routes:
  [0] 0.0.0.0/0 via 10.8.0.157 dev eth0    ← default route (gateway = .157)
  [1] 10.8.0.157/32 dev eth0 scope link    ← link-scope route to the gateway
  [2] fe80::/64 dev eth0

This is the standard Cilium pod-routing layout: each pod gets a /32,
and the "gateway" is reachable only via a link-scope /32 route
(Cilium's eBPF datapath handles ARP for it on the lxc-side veth). See
Cilium routing concepts
for background on the eBPF-managed datapath.

On first RunWorkload, restoreLink iterates info.Routes in the
order netlink.RouteList returned them and calls
netlink.RouteReplace on each. Index 0 is the default route. The
kernel's fib_check_nh_v4_gw function (net/ipv4/fib_semantics.c)
validates this by running a fib_lookup on the gateway IP:

struct flowi4 fl4 = { .daddr = nh->fib_nh_gw4, ... };
err = fib_lookup(net, &fl4, &res, FIB_LOOKUP_IGNORE_LINKSTATE);
if (err) {
    NL_SET_ERR_MSG(extack, "Nexthop has invalid gateway");
    goto out;
}

Because the link-scope route to 10.8.0.157 hasn't been installed
yet, fib_lookup fails, the kernel returns -ENETUNREACH, and
iproute2/netlink renders it as "Network is unreachable" —
exactly what we see.

After this failure, eth0 has already been LinkSetNsFd-moved into
the interior netns (that happens before restoreLink runs). It's
now in a half-set-up state in the interior netns and gone from the
outer pod's netns, which is why subsequent resume attempts return
"Link not found".

This does not reproduce on the kind setup because kindnet gives pods a
normal connected subnet route (10.244.X.0/24 dev eth0 scope link),
so the gateway falls under that connected route and the default
route's fib_lookup succeeds regardless of order. The Cilium-style
/32 pod IP with a /32 link-scope gateway makes the order
load-bearing.

Fix

Sort info.Routes so routes with no gateway (link-scope direct
routes) are installed before routes that have a gateway. This
guarantees the kernel can resolve any gateway's nexthop at the moment
the gatewayed route is installed.

sort.SliceStable preserves the relative order of routes within the
two buckets, so IPv6 link-local routes and other no-gateway entries
stay in the order Cilium provided them.

Why this is safe for non-Cilium CNIs

kindnet / classic kubenet: the pod has IP/24 and a connected
route subnet/24 dev eth0 scope link. After sorting, the link-scope
subnet route is installed first (same as today), then the default
route, which still works fine.
Calico / Flannel host-gw: similar — pod has a connected route
covering its subnet before the default route. Sort is a no-op in
effect.
CNIs that use RTNH_F_ONLINK ("onlink") routes: these bypass
fib_check_nh_v4_gw's gateway lookup entirely. Order doesn't matter
for them.

Reproduction

On a Cilium-based Kubernetes cluster (tested on Linux 6.8,
Kubernetes v1.35):

Install Substrate using the standard install path.
Deploy the counter demo.
kubectl ate create actor my-counter-1 --template ate-demo-counter/counter
kubectl port-forward -n ate-system svc/atenet-router 8000:80 &
curl -X POST -H "Host: my-counter-1.actors.resources.substrate.ate.dev" -i http://localhost:8000/

Expected (before fix): 500 with "network is unreachable" on the
first POST, "Link not found" on subsequent.
Expected (after fix): 200 OK, counter increments across
suspend/resume cycles.

Verification

Captured route list before the patch (from atelet logs, formatted):

Routes: [
  {Dst: 0.0.0.0/0,        Gw: 10.8.0.157, Scope: 0},
  {Dst: 10.8.0.157/32,    Gw: <nil>,      Scope: 253 (RT_SCOPE_LINK)},
  {Dst: fe80::/64,        Gw: <nil>,      Scope: 0}
]

Restore order after the patch (from Restoring route log lines):

Restoring route   dst=10.8.0.157/32   gateway=<nil>      ← was index 1, now first
Restoring route   dst=fe80::/64        gateway=<nil>      ← was index 2
Restoring route   dst=0.0.0.0/0        gateway=10.8.0.157 ← was index 0, now last

Counter resumed and incremented correctly across sequential POSTs,
confirming the actor's network namespace was restored intact.

Caveats / follow-ups (not in this PR)

Cilium clusters separately need the actor's interior netns to be
able to resolve its gateway's MAC. Even after the routes restore
correctly, the first packet the actor sends still has to resolve
the gateway via ARP, and that ARP doesn't reach Cilium's eBPF
responder when sent from the moved-into interior netns. The
right fix is for ateom-gvisor to install a permanent neighbor
entry inside the interior netns pointing at the host-side veth
peer's MAC, mirroring how Cilium would handle the request via
eBPF for a normal pod. That work belongs in a follow-up PR — it
needs to capture the peer MAC before the netns move and NeighSet
it after the route restore.

As a stopgap during testing we enabled proxy_arp=1 on the
host-side lxc* veths via a DaemonSet so the host kernel
answers the ARP. This works but is not the right long-term
fix: it bypasses Cilium's intended eBPF-only resolution path
and could let any pod whose eth0 setup deviates from Cilium's
defaults resolve neighbors through the host kernel rather than
the datapath. Recommend using the permanent-neighbor approach
in the follow-up PR instead.

The kind cluster setup script enables proxy_arp host-wide
(hack/create-kind-cluster.sh), which is fine for a local
development cluster but not appropriate for a Cilium-managed
production cluster.
restoreLink does not currently handle topologically-dependent
route chains (e.g. route A's gateway requires route B's gateway
to already be installed). The two-bucket sort here handles the
common case (link-scope first, then gateway). If a CNI ever
produces a longer dependency chain, a real topological sort would
be needed.
Failure cleanup is asymmetric: if restoreLink fails, eth0
has already been moved into the interior netns and is left there.
The pod is then unusable until restarted because the next attempt
sees no eth0 in the worker netns. Worth a follow-up to either
roll the link back on failure, or to do the netns move after a
successful route validation.

Test plan

go build ./cmd/servers/ateom-gvisor succeeds.
Reproduce failure on a Cilium-based v1.35 cluster — get
"network is unreachable".
Apply patch, redeploy ateom-gvisor image, redeploy counter
demo to pick up new image.
Create actor, POST repeatedly — counter increments across
suspend/resume cycles.
Regression-test on kind cluster (kindnet CNI) — counter demo
still works end-to-end.

This is my fault. I hang my head in shame.

…rate#24) Incorporate Agent Executor as a demonstrative example of a distributed agent runtime and harness built on Agent Substrate. ISSUE=None Fixes #<issue_number_goes_here> > It's a good idea to open an issue first for discussion. - [ ] Tests pass - [ ] Appropriate changes to documentation are included in the PR Co-authored-by: Maya Wang <mymaya@google.com>

Discussed this morning with other maintainers. We know we need to test more with 1.36 (see agent-substrate#8), this is just the declaration of intent to start.

…trate#22) ## Summary `s3Client.PutObject` called `CreateBucket` before every upload. Convenient against local dev backends like rustfs/minio where buckets may not pre-exist, but against managed S3-compatible backends the caller typically lacks `s3:CreateBucket`, so each snapshot upload paid for a 403 in added latency and audit-log noise. For local kind dev, add a one-shot Job alongside the rustfs Deployment that creates the `ate-snapshots` bucket once at install time. ## Validation Verified locally with a kind cluster: - Applied namespace + `manifests/ate-install/kind/rustfs.yaml` - `rustfs-bucket-init` Job retried once while rustfs came up, then created `ate-snapshots` - `aws s3api list-buckets` against rustfs returns `ate-snapshots` --------- Co-authored-by: Benjamin Elder <bentheelder@google.com>

…agent-substrate#37) - Add a "Local (kind)" subsection alongside the existing GKE tracing recipe. - Add a note explaining why `kubectl get actor` and `kubectl get worker` return nothing (they live in valkey, not as CRDs). - Add output-column glossaries for `kubectl ate get actor` and `kubectl ate get worker`. - Add a "Logs" section covering the `kubectl ate logs actors <id>` form. Fixes #<issue_number_goes_here> > It's a good idea to open an issue first for discussion. - [ ] Tests pass - [x] Appropriate changes to documentation are included in the PR Signed-off-by: Davanum Srinivas <davanum@gmail.com>

We were doing it for `ate-apiserver`, but not `atelet`. This results in `atelet` failing to ingest its own metrics. Fixes agent-substrate#38

Before <img width="2560" height="447" alt="2026-05-21_08-00-39" src="https://github.com/user-attachments/assets/3242ec19-9783-4727-80b3-24136ff5a3fc" /> After <img width="2600" height="807" alt="2026-05-21_08-00-57" src="https://github.com/user-attachments/assets/c6892106-a9de-428a-b41d-6fdcbc7c39bd" /> > It's a good idea to open an issue first for discussion. - [x] Tests pass - [x] Appropriate changes to documentation are included in the PR

CNIs that assign a /32 pod IP with a link-scope route to a gateway in the same prefix (Cilium being the headline case) save the default route before the link-scope route in netlink.RouteList output. restoreLink iterated in that order, causing the kernel's fib_check_nh_v4_gw to reject the default route with ENETUNREACH ("network is unreachable") because the route to the gateway had not been installed yet. Stable-sort the saved routes so no-gateway routes are installed first. No-op on CNIs where the gateway is covered by a connected subnet route already (kindnet, classic kubenet) because the existing order already satisfies the predicate.

thockin and others added 7 commits May 20, 2026 14:00

Remove merge markers from LICENSE (agent-substrate#14)

87a0a0a

This is my fault. I hang my head in shame.

declare initial kubernetes version support intent (agent-substrate#25)

b80031d

Discussed this morning with other maintainers. We know we need to test more with 1.36 (see agent-substrate#8), this is just the declaration of intent to start.

Fix atelet otel kustomize (agent-substrate#39)

ef436e9

We were doing it for `ate-apiserver`, but not `atelet`. This results in `atelet` failing to ingest its own metrics. Fixes agent-substrate#38

brandonrjacobs force-pushed the bjacobs/fix-cilium-route-restore-order branch from 48cd849 to 43cc146 Compare May 21, 2026 15:40

brandonrjacobs mentioned this pull request May 21, 2026

fix(ateom-gvisor): pin gateway MAC via permanent neighbor entries #3

Open

4 tasks

brandonrjacobs force-pushed the bjacobs/fix-cilium-route-restore-order branch from 43cc146 to 41c6eb2 Compare May 21, 2026 16:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(ateom-gvisor): restore link-scope routes before gateway routes#2

fix(ateom-gvisor): restore link-scope routes before gateway routes#2
brandonrjacobs wants to merge 8 commits into
mainfrom
bjacobs/fix-cilium-route-restore-order

brandonrjacobs commented May 21, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Conversation

brandonrjacobs commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

TL;DR

Problem

Root cause

Fix

Why this is safe for non-Cilium CNIs

Reproduction

Verification

Caveats / follow-ups (not in this PR)

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

brandonrjacobs commented May 21, 2026 •

edited

Loading