fix(ateom-gvisor): restore link-scope routes before gateway routes#2
Open
brandonrjacobs wants to merge 8 commits into
Open
fix(ateom-gvisor): restore link-scope routes before gateway routes#2brandonrjacobs wants to merge 8 commits into
brandonrjacobs wants to merge 8 commits into
Conversation
This is my fault. I hang my head in shame.
…rate#24) Incorporate Agent Executor as a demonstrative example of a distributed agent runtime and harness built on Agent Substrate. ISSUE=None Fixes #<issue_number_goes_here> > It's a good idea to open an issue first for discussion. - [ ] Tests pass - [ ] Appropriate changes to documentation are included in the PR Co-authored-by: Maya Wang <mymaya@google.com>
Discussed this morning with other maintainers. We know we need to test more with 1.36 (see agent-substrate#8), this is just the declaration of intent to start.
…trate#22) ## Summary `s3Client.PutObject` called `CreateBucket` before every upload. Convenient against local dev backends like rustfs/minio where buckets may not pre-exist, but against managed S3-compatible backends the caller typically lacks `s3:CreateBucket`, so each snapshot upload paid for a 403 in added latency and audit-log noise. For local kind dev, add a one-shot Job alongside the rustfs Deployment that creates the `ate-snapshots` bucket once at install time. ## Validation Verified locally with a kind cluster: - Applied namespace + `manifests/ate-install/kind/rustfs.yaml` - `rustfs-bucket-init` Job retried once while rustfs came up, then created `ate-snapshots` - `aws s3api list-buckets` against rustfs returns `ate-snapshots` --------- Co-authored-by: Benjamin Elder <bentheelder@google.com>
…agent-substrate#37) - Add a "Local (kind)" subsection alongside the existing GKE tracing recipe. - Add a note explaining why `kubectl get actor` and `kubectl get worker` return nothing (they live in valkey, not as CRDs). - Add output-column glossaries for `kubectl ate get actor` and `kubectl ate get worker`. - Add a "Logs" section covering the `kubectl ate logs actors <id>` form. Fixes #<issue_number_goes_here> > It's a good idea to open an issue first for discussion. - [ ] Tests pass - [x] Appropriate changes to documentation are included in the PR Signed-off-by: Davanum Srinivas <davanum@gmail.com>
We were doing it for `ate-apiserver`, but not `atelet`. This results in `atelet` failing to ingest its own metrics. Fixes agent-substrate#38
Before <img width="2560" height="447" alt="2026-05-21_08-00-39" src="https://github.com/user-attachments/assets/3242ec19-9783-4727-80b3-24136ff5a3fc" /> After <img width="2600" height="807" alt="2026-05-21_08-00-57" src="https://github.com/user-attachments/assets/c6892106-a9de-428a-b41d-6fdcbc7c39bd" /> > It's a good idea to open an issue first for discussion. - [x] Tests pass - [x] Appropriate changes to documentation are included in the PR
48cd849 to
43cc146
Compare
4 tasks
CNIs that assign a /32 pod IP with a link-scope route to a gateway in
the same prefix (Cilium being the headline case) save the default
route before the link-scope route in netlink.RouteList output.
restoreLink iterated in that order, causing the kernel's
fib_check_nh_v4_gw to reject the default route with ENETUNREACH
("network is unreachable") because the route to the gateway had not
been installed yet.
Stable-sort the saved routes so no-gateway routes are installed
first. No-op on CNIs where the gateway is covered by a connected
subnet route already (kindnet, classic kubenet) because the existing
order already satisfies the predicate.
43cc146 to
41c6eb2
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
TL;DR
restoreLinkinateom-gvisoriterates the saved routes in the orderreturned by
netlink.RouteListand callsnetlink.RouteReplaceon eachone. On Cilium clusters (and any CNI that assigns a /32 pod IP with a
link-scope route to a gateway in the same prefix), this order causes the
kernel to reject the default-gateway route with
ENETUNREACH("Network is unreachable") because the link-scope route to the gateway
hasn't been installed yet.
Fix: stable-sort the route list so routes with no gateway are restored
before routes that have a gateway. No-op on CNIs where the gateway is
already in a connected /N subnet of the pod's address (kindnet, classic
kubenet, etc.).
Problem
On a Cilium-based Kubernetes cluster, every attempt to resume an actor
fails on the first POST with:
Every subsequent POST then fails with a follow-on error, because the
first attempt has already moved
eth0out of the worker pod's netns:Root cause
When
ateom-gvisorstarts, itscrapeLinks the worker pod'seth0soit can later recreate the same IPs and routes inside the interior
(gVisor) netns. On a Cilium pod, the saved state looks like:
This is the standard Cilium pod-routing layout: each pod gets a
/32,and the "gateway" is reachable only via a link-scope
/32route(Cilium's eBPF datapath handles ARP for it on the lxc-side veth). See
Cilium routing concepts
for background on the eBPF-managed datapath.
On first
RunWorkload,restoreLinkiteratesinfo.Routesin theorder
netlink.RouteListreturned them and callsnetlink.RouteReplaceon each. Index 0 is the default route. Thekernel's
fib_check_nh_v4_gwfunction (net/ipv4/fib_semantics.c)validates this by running a
fib_lookupon the gateway IP:Because the link-scope route to
10.8.0.157hasn't been installedyet,
fib_lookupfails, the kernel returns-ENETUNREACH, andiproute2/netlink renders it as "Network is unreachable" —exactly what we see.
After this failure,
eth0has already beenLinkSetNsFd-moved intothe interior netns (that happens before
restoreLinkruns). It'snow in a half-set-up state in the interior netns and gone from the
outer pod's netns, which is why subsequent resume attempts return
"Link not found".
This does not reproduce on the kind setup because kindnet gives pods a
normal connected subnet route (
10.244.X.0/24 dev eth0 scope link),so the gateway falls under that connected route and the default
route's
fib_lookupsucceeds regardless of order. The Cilium-style/32pod IP with a/32link-scope gateway makes the orderload-bearing.
Fix
Sort
info.Routesso routes with no gateway (link-scope directroutes) are installed before routes that have a gateway. This
guarantees the kernel can resolve any gateway's nexthop at the moment
the gatewayed route is installed.
sort.SliceStablepreserves the relative order of routes within thetwo buckets, so IPv6 link-local routes and other no-gateway entries
stay in the order Cilium provided them.
Why this is safe for non-Cilium CNIs
IP/24and a connectedroute
subnet/24 dev eth0 scope link. After sorting, the link-scopesubnet route is installed first (same as today), then the default
route, which still works fine.
covering its subnet before the default route. Sort is a no-op in
effect.
RTNH_F_ONLINK("onlink") routes: these bypassfib_check_nh_v4_gw's gateway lookup entirely. Order doesn't matterfor them.
Reproduction
On a Cilium-based Kubernetes cluster (tested on Linux 6.8,
Kubernetes v1.35):
kubectl ate create actor my-counter-1 --template ate-demo-counter/counterkubectl port-forward -n ate-system svc/atenet-router 8000:80 &curl -X POST -H "Host: my-counter-1.actors.resources.substrate.ate.dev" -i http://localhost:8000/Expected (before fix):
500with "network is unreachable" on thefirst POST, "Link not found" on subsequent.
Expected (after fix):
200 OK, counter increments acrosssuspend/resume cycles.
Verification
Captured route list before the patch (from atelet logs, formatted):
Restore order after the patch (from
Restoring routelog lines):Counter resumed and incremented correctly across sequential POSTs,
confirming the actor's network namespace was restored intact.
Caveats / follow-ups (not in this PR)
Cilium clusters separately need the actor's interior netns to be
able to resolve its gateway's MAC. Even after the routes restore
correctly, the first packet the actor sends still has to resolve
the gateway via ARP, and that ARP doesn't reach Cilium's eBPF
responder when sent from the moved-into interior netns. The
right fix is for
ateom-gvisorto install a permanent neighborentry inside the interior netns pointing at the host-side veth
peer's MAC, mirroring how Cilium would handle the request via
eBPF for a normal pod. That work belongs in a follow-up PR — it
needs to capture the peer MAC before the netns move and
NeighSetit after the route restore.
As a stopgap during testing we enabled
proxy_arp=1on thehost-side
lxc*veths via a DaemonSet so the host kernelanswers the ARP. This works but is not the right long-term
fix: it bypasses Cilium's intended eBPF-only resolution path
and could let any pod whose eth0 setup deviates from Cilium's
defaults resolve neighbors through the host kernel rather than
the datapath. Recommend using the permanent-neighbor approach
in the follow-up PR instead.
The kind cluster setup script enables proxy_arp host-wide
(
hack/create-kind-cluster.sh), which is fine for a localdevelopment cluster but not appropriate for a Cilium-managed
production cluster.
restoreLinkdoes not currently handle topologically-dependentroute chains (e.g. route A's gateway requires route B's gateway
to already be installed). The two-bucket sort here handles the
common case (link-scope first, then gateway). If a CNI ever
produces a longer dependency chain, a real topological sort would
be needed.
Failure cleanup is asymmetric: if
restoreLinkfails,eth0has already been moved into the interior netns and is left there.
The pod is then unusable until restarted because the next attempt
sees no
eth0in the worker netns. Worth a follow-up to eitherroll the link back on failure, or to do the netns move after a
successful route validation.
Test plan
go build ./cmd/servers/ateom-gvisorsucceeds."network is unreachable".
ateom-gvisorimage, redeploy counterdemo to pick up new image.
suspend/resume cycles.
still works end-to-end.