Skip to content

fix(ateom-gvisor): pin gateway MAC via permanent neighbor entries#3

Open
brandonrjacobs wants to merge 1 commit into
bjacobs/fix-cilium-route-restore-orderfrom
bjacobs/ateom-permanent-neigh
Open

fix(ateom-gvisor): pin gateway MAC via permanent neighbor entries#3
brandonrjacobs wants to merge 1 commit into
bjacobs/fix-cilium-route-restore-orderfrom
bjacobs/ateom-permanent-neigh

Conversation

@brandonrjacobs
Copy link
Copy Markdown
Collaborator

@brandonrjacobs brandonrjacobs commented May 21, 2026

TL;DR

Replace the host-level proxy_arp=1 workaround for Cilium clusters with
an in-ateom-gvisor permanent-neighbor approach: capture the
(gateway IP, MAC) pairs the kernel resolves at startup, then install
them inside the interior gVisor netns as NUD_PERMANENT entries. The
actor never has to ARP, and we no longer touch the host's sysctl.

Stacked on top of #2 (route-sort fix). Review that one first.

Why not just leave proxy_arp=1?

A reviewer raised a fair concern about the proxy_arp workaround:

By flipping proxy_arp=1 on every lxc* veth host-wide, we open
the host kernel as an alternate ARP responder for any pod whose
setup deviates from Cilium's defaults. Cilium's eBPF datapath is
the intended sole arbiter of pod-to-gateway resolution; the sysctl
bypasses it.

Looking at the upstream Cilium CNI
plugin and the eBPF program bpf_lxc.c
that runs tail_handle_arp on the host-side veth ingress, the picture
is:

  • Cilium does not install a permanent neighbor entry on pod-side
    eth0; the CNI just adds the IP and routes.
  • For normal pods, the pod's ARP request reaches the host-side veth
    ingress hook, eBPF synthesizes an ARP reply with the host's MAC,
    and traffic is steered by eBPF thereafter.
  • The kernel's proxy_arp flag is not what makes this work — eBPF
    runs before the kernel IP stack ever sees the ARP.

We don't have a verified explanation for why our interior-netns case
needs proxy_arp=1 to succeed. The most likely theory is that the
moved-into netns + gVisor's --network=sandbox packet plumbing
(AF_PACKET) prevents the ARP from reaching tail_handle_arp the same
way a normal pod's does. Whatever the exact mechanism, enabling
proxy_arp host-wide is the wrong layer to fix it at.

The fix

At ateom-gvisor startup, after scraping the routes:

  1. Walk the routes and collect every distinct gateway IP.
  2. For each gateway, force the kernel to resolve its MAC by opening a
    UDP socket and connect-ing to it. The socket sends no packets but
    triggers a route lookup + neighbor probe.
  3. Read back the kernel's neighbor table (netlink.NeighList) and
    record (IP, MAC) pairs for the resolved gateways. Best-effort:
    unresolved gateways are logged and skipped.

At RunWorkload time, after restoreLink installs addresses and
routes, install each saved neighbor inside the interior netns with
State: NUD_PERMANENT. The actor's first packet uses the cached
MAC; no ARP is ever sent from the interior netns.

Why this preserves Cilium isolation

  • We never set any host-level sysctl.
  • The MAC we install is the one Cilium's eBPF program told the kernel
    to use — we're transcribing it, not inventing it.
  • The permanent neighbor entry lives only inside the per-actor
    interior netns. It's not visible to other pods.
  • Egress packets from the actor still pass through the host-side
    veth where Cilium's eBPF datapath enforces policy and routing as
    normal.

Files changed

  • cmd/servers/ateom-gvisor/ateom-gvisor.go:
    • New SaveNeigh type, added to SaveLinkInfo.Neighbors.
    • New scrapeGatewayNeighbors helper called from do() after
      scrapeLink.
    • New probeNeighbor helper that triggers kernel resolution via a
      no-op UDP connect.
    • In restoreLink, install captured neighbors as NUD_PERMANENT
      after routes are in place.

Test plan

  • go build ./cmd/servers/ateom-gvisor succeeds.
  • Deploy to a Cilium-managed cluster, remove the proxy_arp
    DaemonSet, reset net.ipv4.conf.*.proxy_arp to 0 on every
    interface, redeploy counter demo. Verified end-to-end:
    ateom-gvisor logged Captured gateway neighbor ip=10.8.0.157 mac=96:a5:c5:e4:42:4a at startup, fresh actor resumed
    successfully and incremented the counter across 3 POSTs.
  • Confirm the saved neighbor entry is present in the interior
    netns directly (couldn't nsenter from a debug pod —
    ateom-gvisor's interior netns is bound under its own
    filesystem; needs a node-shell or atelet exec path).
  • Regression-test on kind (kindnet CNI) — should be a no-op since
    kindnet pods resolve their gateway via the normal subnet path.

@brandonrjacobs brandonrjacobs force-pushed the bjacobs/ateom-permanent-neigh branch from 866179b to d29cfc6 Compare May 21, 2026 16:14
On CNIs whose ARP resolution is delivered by an eBPF program on the
host-side veth (Cilium being the headline case), the interior gVisor
netns can't reliably re-ARP for its gateway after eth0 is moved into
it. The kernel's neighbor cache is empty in the fresh netns and the
eBPF responder doesn't necessarily see ARP frames originating there.

Capture (IP, MAC) pairs for each gateway from the kernel's neighbor
table at ateom-gvisor startup -- triggering resolution if needed --
and install them inside the interior netns as NUD_PERMANENT entries.
The actor then reaches its gateway without re-running ARP.

This replaces an earlier host-level workaround that flipped
net.ipv4.conf.lxc*.proxy_arp=1 on every Cilium veth. That sysctl
opens the host kernel as an alternate ARP responder for every pod,
which is at minimum a layering violation of Cilium's eBPF-only
datapath and could permit isolation bypasses for pods whose eth0
setup deviates from Cilium's defaults.
@brandonrjacobs brandonrjacobs force-pushed the bjacobs/fix-cilium-route-restore-order branch from 43cc146 to 41c6eb2 Compare May 21, 2026 16:17
@brandonrjacobs brandonrjacobs force-pushed the bjacobs/ateom-permanent-neigh branch from d29cfc6 to 1efb345 Compare May 21, 2026 16:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant