Skip to content

fix(ateom-gvisor): restore link-scope routes before gateway routes#2

Open
brandonrjacobs wants to merge 8 commits into
mainfrom
bjacobs/fix-cilium-route-restore-order
Open

fix(ateom-gvisor): restore link-scope routes before gateway routes#2
brandonrjacobs wants to merge 8 commits into
mainfrom
bjacobs/fix-cilium-route-restore-order

Conversation

@brandonrjacobs
Copy link
Copy Markdown
Collaborator

@brandonrjacobs brandonrjacobs commented May 21, 2026

TL;DR

restoreLink in ateom-gvisor iterates the saved routes in the order
returned by netlink.RouteList and calls netlink.RouteReplace on each
one. On Cilium clusters (and any CNI that assigns a /32 pod IP with a
link-scope route to a gateway in the same prefix), this order causes the
kernel to reject the default-gateway route with ENETUNREACH
("Network is unreachable") because the link-scope route to the gateway
hasn't been installed yet.

Fix: stable-sort the route list so routes with no gateway are restored
before routes that have a gateway. No-op on CNIs where the gateway is
already in a connected /N subnet of the pod's address (kindnet, classic
kubenet, etc.).

Problem

On a Cilium-based Kubernetes cluster, every attempt to resume an actor
fails on the first POST with:

HTTP/1.1 500 Internal Server Error
error resuming actor my-counter-1: rpc error: code = Unknown desc =
while calling ateom.RunWorkload:
rpc error: code = Unknown desc =
while restoring eth0 in interior netns:
while executing function in target netns:
while restoring eth0 routes and addresses in interior netns:
while restoring route 0: network is unreachable

Every subsequent POST then fails with a follow-on error, because the
first attempt has already moved eth0 out of the worker pod's netns:

while calling ateom.RunWorkload: rpc error: code = Unknown desc =
while getting netlink link for eth0: Link not found

Root cause

When ateom-gvisor starts, it scrapeLinks the worker pod's eth0 so
it can later recreate the same IPs and routes inside the interior
(gVisor) netns. On a Cilium pod, the saved state looks like:

addresses: [10.8.0.201/32, fe80::…]
routes:
  [0] 0.0.0.0/0 via 10.8.0.157 dev eth0    ← default route (gateway = .157)
  [1] 10.8.0.157/32 dev eth0 scope link    ← link-scope route to the gateway
  [2] fe80::/64 dev eth0

This is the standard Cilium pod-routing layout: each pod gets a /32,
and the "gateway" is reachable only via a link-scope /32 route
(Cilium's eBPF datapath handles ARP for it on the lxc-side veth). See
Cilium routing concepts
for background on the eBPF-managed datapath.

On first RunWorkload, restoreLink iterates info.Routes in the
order netlink.RouteList returned them and calls
netlink.RouteReplace on each. Index 0 is the default route. The
kernel's fib_check_nh_v4_gw function (net/ipv4/fib_semantics.c)
validates this by running a fib_lookup on the gateway IP:

struct flowi4 fl4 = { .daddr = nh->fib_nh_gw4, ... };
err = fib_lookup(net, &fl4, &res, FIB_LOOKUP_IGNORE_LINKSTATE);
if (err) {
    NL_SET_ERR_MSG(extack, "Nexthop has invalid gateway");
    goto out;
}

Because the link-scope route to 10.8.0.157 hasn't been installed
yet, fib_lookup fails, the kernel returns -ENETUNREACH, and
iproute2/netlink renders it as "Network is unreachable"
exactly what we see.

After this failure, eth0 has already been LinkSetNsFd-moved into
the interior netns (that happens before restoreLink runs). It's
now in a half-set-up state in the interior netns and gone from the
outer pod's netns, which is why subsequent resume attempts return
"Link not found".

This does not reproduce on the kind setup because kindnet gives pods a
normal connected subnet route (10.244.X.0/24 dev eth0 scope link),
so the gateway falls under that connected route and the default
route's fib_lookup succeeds regardless of order. The Cilium-style
/32 pod IP with a /32 link-scope gateway makes the order
load-bearing.

Fix

Sort info.Routes so routes with no gateway (link-scope direct
routes) are installed before routes that have a gateway. This
guarantees the kernel can resolve any gateway's nexthop at the moment
the gatewayed route is installed.

sort.SliceStable preserves the relative order of routes within the
two buckets, so IPv6 link-local routes and other no-gateway entries
stay in the order Cilium provided them.

Why this is safe for non-Cilium CNIs

  • kindnet / classic kubenet: the pod has IP/24 and a connected
    route subnet/24 dev eth0 scope link. After sorting, the link-scope
    subnet route is installed first (same as today), then the default
    route, which still works fine.
  • Calico / Flannel host-gw: similar — pod has a connected route
    covering its subnet before the default route. Sort is a no-op in
    effect.
  • CNIs that use RTNH_F_ONLINK ("onlink") routes: these bypass
    fib_check_nh_v4_gw's gateway lookup entirely. Order doesn't matter
    for them.

Reproduction

On a Cilium-based Kubernetes cluster (tested on Linux 6.8,
Kubernetes v1.35):

  1. Install Substrate using the standard install path.
  2. Deploy the counter demo.
  3. kubectl ate create actor my-counter-1 --template ate-demo-counter/counter
  4. kubectl port-forward -n ate-system svc/atenet-router 8000:80 &
  5. curl -X POST -H "Host: my-counter-1.actors.resources.substrate.ate.dev" -i http://localhost:8000/

Expected (before fix): 500 with "network is unreachable" on the
first POST, "Link not found" on subsequent.
Expected (after fix): 200 OK, counter increments across
suspend/resume cycles.

Verification

Captured route list before the patch (from atelet logs, formatted):

Routes: [
  {Dst: 0.0.0.0/0,        Gw: 10.8.0.157, Scope: 0},
  {Dst: 10.8.0.157/32,    Gw: <nil>,      Scope: 253 (RT_SCOPE_LINK)},
  {Dst: fe80::/64,        Gw: <nil>,      Scope: 0}
]

Restore order after the patch (from Restoring route log lines):

Restoring route   dst=10.8.0.157/32   gateway=<nil>      ← was index 1, now first
Restoring route   dst=fe80::/64        gateway=<nil>      ← was index 2
Restoring route   dst=0.0.0.0/0        gateway=10.8.0.157 ← was index 0, now last

Counter resumed and incremented correctly across sequential POSTs,
confirming the actor's network namespace was restored intact.

Caveats / follow-ups (not in this PR)

  • Cilium clusters separately need the actor's interior netns to be
    able to resolve its gateway's MAC.
    Even after the routes restore
    correctly, the first packet the actor sends still has to resolve
    the gateway via ARP, and that ARP doesn't reach Cilium's eBPF
    responder when sent from the moved-into interior netns. The
    right fix is for ateom-gvisor to install a permanent neighbor
    entry inside the interior netns pointing at the host-side veth
    peer's MAC, mirroring how Cilium would handle the request via
    eBPF for a normal pod. That work belongs in a follow-up PR — it
    needs to capture the peer MAC before the netns move and NeighSet
    it after the route restore.

    As a stopgap during testing we enabled proxy_arp=1 on the
    host-side lxc* veths via a DaemonSet so the host kernel
    answers the ARP. This works but is not the right long-term
    fix: it bypasses Cilium's intended eBPF-only resolution path
    and could let any pod whose eth0 setup deviates from Cilium's
    defaults resolve neighbors through the host kernel rather than
    the datapath. Recommend using the permanent-neighbor approach
    in the follow-up PR instead.

    The kind cluster setup script enables proxy_arp host-wide
    (hack/create-kind-cluster.sh), which is fine for a local
    development cluster but not appropriate for a Cilium-managed
    production cluster.

  • restoreLink does not currently handle topologically-dependent
    route chains
    (e.g. route A's gateway requires route B's gateway
    to already be installed). The two-bucket sort here handles the
    common case (link-scope first, then gateway). If a CNI ever
    produces a longer dependency chain, a real topological sort would
    be needed.

  • Failure cleanup is asymmetric: if restoreLink fails, eth0
    has already been moved into the interior netns and is left there.
    The pod is then unusable until restarted because the next attempt
    sees no eth0 in the worker netns. Worth a follow-up to either
    roll the link back on failure, or to do the netns move after a
    successful route validation.

Test plan

  • go build ./cmd/servers/ateom-gvisor succeeds.
  • Reproduce failure on a Cilium-based v1.35 cluster — get
    "network is unreachable".
  • Apply patch, redeploy ateom-gvisor image, redeploy counter
    demo to pick up new image.
  • Create actor, POST repeatedly — counter increments across
    suspend/resume cycles.
  • Regression-test on kind cluster (kindnet CNI) — counter demo
    still works end-to-end.

thockin and others added 7 commits May 20, 2026 14:00
This is my fault.  I hang my head in shame.
…rate#24)

Incorporate Agent Executor as a demonstrative example of a distributed
agent runtime and harness built on Agent Substrate.

ISSUE=None

Fixes #<issue_number_goes_here>

> It's a good idea to open an issue first for discussion.

- [ ] Tests pass
- [ ] Appropriate changes to documentation are included in the PR

Co-authored-by: Maya Wang <mymaya@google.com>
Discussed this morning with other maintainers.

We know we need to test more with 1.36 (see agent-substrate#8), this is just the
declaration of intent to start.
…trate#22)

## Summary

`s3Client.PutObject` called `CreateBucket` before every upload.
Convenient against local dev backends like rustfs/minio where buckets
may not pre-exist, but against managed S3-compatible backends the caller
typically lacks `s3:CreateBucket`, so each snapshot upload paid for a
403 in added latency and audit-log noise.

For local kind dev, add a one-shot Job alongside the rustfs Deployment
that creates the `ate-snapshots` bucket once at install time.

## Validation

Verified locally with a kind cluster:

- Applied namespace + `manifests/ate-install/kind/rustfs.yaml`
- `rustfs-bucket-init` Job retried once while rustfs came up, then
created `ate-snapshots`
- `aws s3api list-buckets` against rustfs returns `ate-snapshots`

---------

Co-authored-by: Benjamin Elder <bentheelder@google.com>
…agent-substrate#37)

- Add a "Local (kind)" subsection alongside the existing GKE tracing
recipe.
- Add a note explaining why `kubectl get actor` and `kubectl get worker`
return nothing (they live in valkey, not as CRDs).
- Add output-column glossaries for `kubectl ate get actor` and `kubectl
ate get worker`.
- Add a "Logs" section covering the `kubectl ate logs actors <id>` form.

Fixes #<issue_number_goes_here>

> It's a good idea to open an issue first for discussion.

- [ ] Tests pass
- [x] Appropriate changes to documentation are included in the PR

Signed-off-by: Davanum Srinivas <davanum@gmail.com>
We were doing it for `ate-apiserver`, but not `atelet`. This results in
`atelet` failing to ingest its own metrics.

Fixes agent-substrate#38
Before
<img width="2560" height="447" alt="2026-05-21_08-00-39"
src="https://github.com/user-attachments/assets/3242ec19-9783-4727-80b3-24136ff5a3fc"
/>

After
<img width="2600" height="807" alt="2026-05-21_08-00-57"
src="https://github.com/user-attachments/assets/c6892106-a9de-428a-b41d-6fdcbc7c39bd"
/>


> It's a good idea to open an issue first for discussion.

- [x] Tests pass
- [x] Appropriate changes to documentation are included in the PR
CNIs that assign a /32 pod IP with a link-scope route to a gateway in
the same prefix (Cilium being the headline case) save the default
route before the link-scope route in netlink.RouteList output.
restoreLink iterated in that order, causing the kernel's
fib_check_nh_v4_gw to reject the default route with ENETUNREACH
("network is unreachable") because the route to the gateway had not
been installed yet.

Stable-sort the saved routes so no-gateway routes are installed
first. No-op on CNIs where the gateway is covered by a connected
subnet route already (kindnet, classic kubenet) because the existing
order already satisfies the predicate.
@brandonrjacobs brandonrjacobs force-pushed the bjacobs/fix-cilium-route-restore-order branch from 43cc146 to 41c6eb2 Compare May 21, 2026 16:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants