Skip to content

fix: tolerate missing Lease discovery on project control planes#161

Closed
mattdjenkinson wants to merge 3 commits into
mainfrom
fix/optional-lease-watch
Closed

fix: tolerate missing Lease discovery on project control planes#161
mattdjenkinson wants to merge 3 commits into
mainfrom
fix/optional-lease-watch

Conversation

@mattdjenkinson
Copy link
Copy Markdown
Contributor

@mattdjenkinson mattdjenkinson commented May 14, 2026

Summary

connector and iroh-dns both register Watches(&coordinationv1.Lease{}, …) and call cl.Get(&coordinationv1.Lease{}, …) inside Reconcile. multicluster-runtime rejects the entire controller/cluster pair if any watch can't be wired; controller-runtime's cached client also routes through the cluster's REST mapper. On project control planes that omit coordination.k8s.io from API discovery, both paths fail with no matches for kind "Lease" in version "coordination.k8s.io/v1"Connector.status.conditions[Ready] freezes, observedGeneration lags generation, and the operator loops a ~5 s engage-retry storm per affected PCP.

Currently staging-only (~221 affected PCPs). Root cause is upstream in milo: milo-os/milo#616 — the DiscoveryContextFilter feature gate drops built-in groups from project-context aggregated discovery responses because the filter unmarshals legacy APIGroupList JSON into an APIGroupDiscoveryList struct and emits an empty list. Prod runs with the gate disabled, so it doesn't hit this today.

This PR makes NSO defensive regardless. Even after the milo bug is fixed, this protects against any future downstream apiserver that mis-handles discovery.

Changes

  • internal/controller/source/optional.go (new)OptionalKind wraps mcsource.Kind. On ForCluster, it probes the cluster's REST mapper; on apimeta.IsNoMatchError it returns shouldEngage=false so multicluster-runtime cleanly skips that source for that cluster. Other mapper errors still propagate.
  • connector_controller.go, iroh_dns_controller.go — both switch from .Watches(…Lease…).Complete(r) to .Build(r) plus c.MultiClusterWatch(OptionalKind(…Lease…)). The builder's Watches always constructs mcsource.Kind so we bypass it for Lease only.
  • ConnectorReconciler.LeaseClient hook + ensureConnectorLease — Lease access in the reconciler body now goes through k8s.io/client-go/kubernetes/typed/coordination/v1. The typed client encodes its own group/version and never asks the REST mapper, so a missing discovery entry no longer breaks cl.Get(Lease). Production constructs the client lazily from cluster.GetConfig(); tests inject a kubefake.NewSimpleClientset to avoid needing a real REST config.
  • Periodic RequeueAfterconnectorLeaseReady now sets RequeueAfter: leaseDuration + jitter on every not-ready path, and IrohDNSReconciler.Reconcile returns RequeueAfter: requeueInterval() for every iroh-eligible Connector. Healthy clusters retain event-driven (sub-second) latency on Lease renew; degraded clusters converge at the lease cadence (≈30 s) instead of never.
  • Unit testsoptional_test.go covers the three branches (no-match degrades, healthy mapper delegates, other errors propagate); the existing TestConnectorReconcile cases were updated to inject the fake Lease clientset.

Two minor trade-offs worth noting:

  1. The typed Lease client bypasses controller-runtime's cache. Each reconcile makes one extra direct GET for the Lease. At ~200 connectors × ~4 reconciles/min ≈ 13 GET/s — negligible apiserver load.
  2. Both reconcilers now requeue at the lease cadence even when the watch is firing normally. Redundant for healthy clusters but cheap, and removes any silent stall path.

Staging validation (already deployed via infra#2448)

Two-commit rollout against the affected 221 PCPs.

After commit 3578c30 (engage fix):

  • Zero failed to engage / get informer failed log pairs across all 3 replicas
  • optional watch unavailable on cluster, degrading to no-op fires once per cluster at engagement, then silent
  • iroh-dns reconcile rate 0/s → 0.97/s (was completely stalled; now flowing)
  • Controllers engage, but connector had a residual 0.1–0.5/s Reconciler error rate from cl.Get(Lease) still hitting the RESTMapper

After commit 2bedb7a (typed-client fix):

  • connector Reconciler error rate ~0/s (single 0.003/s blip on one pod = noise)
  • Per-pod memory 0.4–0.5 GiB (was OOMKilling at 4 GiB every ~30 min pre-fix)
  • 58 min post-rollout, zero restarts on the new ReplicaSet
  • matt-jenkinson-yz0y92 (one of the affected projects from the issue repro) confirmed reconciling cleanly in logs at 15:23:57Z+

Test plan

  • go build ./..., go vet ./..., go test ./... all clean locally
  • New unit tests cover the OptionalKind branches and the typed Lease client injection
  • Staging deploy validated above

Fixes #160. Related: milo-os/milo#616 (upstream root cause).

The connector and iroh-dns reconcilers both registered a watch on
coordination.k8s.io/v1.Lease. multicluster-runtime engages every
source against every cluster at runtime; if any source's REST mapper
returns no-match the whole controller/cluster pair is rejected and
never reconciles. In staging that's the state of 221 project control
planes — they serve Lease at the raw URL but omit the group from
discovery, so connector status freezes and the operator burns a 5s
retry storm per affected cluster.

Add an OptionalKind wrapper that probes the REST mapper at engage
time and disengages the watch (shouldEngage=false) when the GVK has
no match, letting the rest of the controller carry on. Other mapper
errors still propagate so transient discovery failures retry.

The Lease watch is wired via the builder's Watches API which always
constructs an mcsource.Kind; there's no seam to inject a custom
source. Build the controller, then call MultiClusterWatch directly
with the optional wrapper.

Without the Lease watch, reconciles still need to converge. Extend
connectorLeaseReady to set RequeueAfter on every not-ready path so
freshness is re-evaluated at the lease cadence, and have the
iroh-dns reconcile always requeue at the lease duration for
iroh-routed connectors so sibling handover still happens within
≈ one duration when the watch is degraded.

Fixes #160
The optional Lease watch in 3578c30 stops engagement failing on
project control planes that omit coordination.k8s.io from API
discovery, but the connector reconciler body still hit the same
RESTMapper through cl.Get(&coordinationv1.Lease{}, …) and
controllerutil.CreateOrUpdate. On every reconcile against an
affected cluster, both paths returned
  no matches for kind "Lease" in version "coordination.k8s.io/v1"
which surfaced as Reconciler errors and prevented Connector.Ready
from ever transitioning to True even though the agent was renewing
the lease at the raw URL.

Route Lease access through k8s.io/client-go/kubernetes/typed/coordination/v1.
The typed client encodes its own group/version so it does not depend
on the cluster's REST mapper. Production lazy-builds one from the
cluster's REST config; tests inject the typed fake clientset via a
new LeaseClient hook on the reconciler.

ensureConnectorLease replaces controllerutil.CreateOrUpdate for the
same reason — preserves the no-op-when-unchanged behaviour via
DeepEqual before issuing an Update so the reconcile cadence doesn't
turn into a write loop.
@scotwells
Copy link
Copy Markdown
Contributor

@mattdjenkinson interesting side-effect of the discovery stuff! Let's add leases back to projects from a discovery perspective to fix this.

@mattdjenkinson
Copy link
Copy Markdown
Contributor Author

@scotwells want me to close this and just look at the Milo fix or keep this as well?

@scotwells
Copy link
Copy Markdown
Contributor

@mattdjenkinson lets close this and fix on the milo side.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

2 participants