fix(ateom-gvisor): pin gateway MAC via permanent neighbor entries#3
Open
brandonrjacobs wants to merge 1 commit into
Open
Conversation
866179b to
d29cfc6
Compare
On CNIs whose ARP resolution is delivered by an eBPF program on the host-side veth (Cilium being the headline case), the interior gVisor netns can't reliably re-ARP for its gateway after eth0 is moved into it. The kernel's neighbor cache is empty in the fresh netns and the eBPF responder doesn't necessarily see ARP frames originating there. Capture (IP, MAC) pairs for each gateway from the kernel's neighbor table at ateom-gvisor startup -- triggering resolution if needed -- and install them inside the interior netns as NUD_PERMANENT entries. The actor then reaches its gateway without re-running ARP. This replaces an earlier host-level workaround that flipped net.ipv4.conf.lxc*.proxy_arp=1 on every Cilium veth. That sysctl opens the host kernel as an alternate ARP responder for every pod, which is at minimum a layering violation of Cilium's eBPF-only datapath and could permit isolation bypasses for pods whose eth0 setup deviates from Cilium's defaults.
43cc146 to
41c6eb2
Compare
d29cfc6 to
1efb345
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
TL;DR
Replace the host-level
proxy_arp=1workaround for Cilium clusters withan in-
ateom-gvisorpermanent-neighbor approach: capture the(gateway IP, MAC) pairs the kernel resolves at startup, then install
them inside the interior gVisor netns as
NUD_PERMANENTentries. Theactor never has to ARP, and we no longer touch the host's sysctl.
Stacked on top of #2 (route-sort fix). Review that one first.
Why not just leave
proxy_arp=1?A reviewer raised a fair concern about the proxy_arp workaround:
Looking at the upstream Cilium CNI
plugin and the eBPF program
bpf_lxc.cthat runs
tail_handle_arpon the host-side veth ingress, the pictureis:
eth0; the CNI just adds the IP and routes.
ingress hook, eBPF synthesizes an ARP reply with the host's MAC,
and traffic is steered by eBPF thereafter.
proxy_arpflag is not what makes this work — eBPFruns before the kernel IP stack ever sees the ARP.
We don't have a verified explanation for why our interior-netns case
needs
proxy_arp=1to succeed. The most likely theory is that themoved-into netns + gVisor's
--network=sandboxpacket plumbing(AF_PACKET) prevents the ARP from reaching
tail_handle_arpthe sameway a normal pod's does. Whatever the exact mechanism, enabling
proxy_arp host-wide is the wrong layer to fix it at.
The fix
At ateom-gvisor startup, after scraping the routes:
UDP socket and
connect-ing to it. The socket sends no packets buttriggers a route lookup + neighbor probe.
netlink.NeighList) andrecord (IP, MAC) pairs for the resolved gateways. Best-effort:
unresolved gateways are logged and skipped.
At RunWorkload time, after
restoreLinkinstalls addresses androutes, install each saved neighbor inside the interior netns with
State: NUD_PERMANENT. The actor's first packet uses the cachedMAC; no ARP is ever sent from the interior netns.
Why this preserves Cilium isolation
to use — we're transcribing it, not inventing it.
interior netns. It's not visible to other pods.
veth where Cilium's eBPF datapath enforces policy and routing as
normal.
Files changed
cmd/servers/ateom-gvisor/ateom-gvisor.go:SaveNeightype, added toSaveLinkInfo.Neighbors.scrapeGatewayNeighborshelper called fromdo()afterscrapeLink.probeNeighborhelper that triggers kernel resolution via ano-op UDP
connect.restoreLink, install captured neighbors asNUD_PERMANENTafter routes are in place.
Test plan
go build ./cmd/servers/ateom-gvisorsucceeds.DaemonSet, reset
net.ipv4.conf.*.proxy_arpto 0 on everyinterface, redeploy counter demo. Verified end-to-end:
ateom-gvisor logged
Captured gateway neighbor ip=10.8.0.157 mac=96:a5:c5:e4:42:4aat startup, fresh actor resumedsuccessfully and incremented the counter across 3 POSTs.
netns directly (couldn't
nsenterfrom a debug pod —ateom-gvisor's interior netns is bound under its own
filesystem; needs a node-shell or atelet exec path).
kindnet pods resolve their gateway via the normal subnet path.