fix: tolerate missing Lease discovery on project control planes#161
Closed
mattdjenkinson wants to merge 3 commits into
Closed
fix: tolerate missing Lease discovery on project control planes#161mattdjenkinson wants to merge 3 commits into
mattdjenkinson wants to merge 3 commits into
Conversation
The connector and iroh-dns reconcilers both registered a watch on coordination.k8s.io/v1.Lease. multicluster-runtime engages every source against every cluster at runtime; if any source's REST mapper returns no-match the whole controller/cluster pair is rejected and never reconciles. In staging that's the state of 221 project control planes — they serve Lease at the raw URL but omit the group from discovery, so connector status freezes and the operator burns a 5s retry storm per affected cluster. Add an OptionalKind wrapper that probes the REST mapper at engage time and disengages the watch (shouldEngage=false) when the GVK has no match, letting the rest of the controller carry on. Other mapper errors still propagate so transient discovery failures retry. The Lease watch is wired via the builder's Watches API which always constructs an mcsource.Kind; there's no seam to inject a custom source. Build the controller, then call MultiClusterWatch directly with the optional wrapper. Without the Lease watch, reconciles still need to converge. Extend connectorLeaseReady to set RequeueAfter on every not-ready path so freshness is re-evaluated at the lease cadence, and have the iroh-dns reconcile always requeue at the lease duration for iroh-routed connectors so sibling handover still happens within ≈ one duration when the watch is degraded. Fixes #160
The optional Lease watch in 3578c30 stops engagement failing on project control planes that omit coordination.k8s.io from API discovery, but the connector reconciler body still hit the same RESTMapper through cl.Get(&coordinationv1.Lease{}, …) and controllerutil.CreateOrUpdate. On every reconcile against an affected cluster, both paths returned no matches for kind "Lease" in version "coordination.k8s.io/v1" which surfaced as Reconciler errors and prevented Connector.Ready from ever transitioning to True even though the agent was renewing the lease at the raw URL. Route Lease access through k8s.io/client-go/kubernetes/typed/coordination/v1. The typed client encodes its own group/version so it does not depend on the cluster's REST mapper. Production lazy-builds one from the cluster's REST config; tests inject the typed fake clientset via a new LeaseClient hook on the reconciler. ensureConnectorLease replaces controllerutil.CreateOrUpdate for the same reason — preserves the no-op-when-unchanged behaviour via DeepEqual before issuing an Update so the reconcile cadence doesn't turn into a write loop.
Contributor
|
@mattdjenkinson interesting side-effect of the discovery stuff! Let's add leases back to projects from a discovery perspective to fix this. |
Contributor
Author
|
@scotwells want me to close this and just look at the Milo fix or keep this as well? |
Contributor
|
@mattdjenkinson lets close this and fix on the milo side. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
connectorandiroh-dnsboth registerWatches(&coordinationv1.Lease{}, …)and callcl.Get(&coordinationv1.Lease{}, …)insideReconcile. multicluster-runtime rejects the entire controller/cluster pair if any watch can't be wired; controller-runtime's cached client also routes through the cluster's REST mapper. On project control planes that omitcoordination.k8s.iofrom API discovery, both paths fail withno matches for kind "Lease" in version "coordination.k8s.io/v1"—Connector.status.conditions[Ready]freezes,observedGenerationlagsgeneration, and the operator loops a ~5 s engage-retry storm per affected PCP.Currently staging-only (~221 affected PCPs). Root cause is upstream in milo: milo-os/milo#616 — the
DiscoveryContextFilterfeature gate drops built-in groups from project-context aggregated discovery responses because the filter unmarshals legacyAPIGroupListJSON into anAPIGroupDiscoveryListstruct and emits an empty list. Prod runs with the gate disabled, so it doesn't hit this today.This PR makes NSO defensive regardless. Even after the milo bug is fixed, this protects against any future downstream apiserver that mis-handles discovery.
Changes
internal/controller/source/optional.go(new) —OptionalKindwrapsmcsource.Kind. OnForCluster, it probes the cluster's REST mapper; onapimeta.IsNoMatchErrorit returnsshouldEngage=falseso multicluster-runtime cleanly skips that source for that cluster. Other mapper errors still propagate.connector_controller.go,iroh_dns_controller.go— both switch from.Watches(…Lease…).Complete(r)to.Build(r)plusc.MultiClusterWatch(OptionalKind(…Lease…)). The builder'sWatchesalways constructsmcsource.Kindso we bypass it for Lease only.ConnectorReconciler.LeaseClienthook +ensureConnectorLease— Lease access in the reconciler body now goes throughk8s.io/client-go/kubernetes/typed/coordination/v1. The typed client encodes its own group/version and never asks the REST mapper, so a missing discovery entry no longer breakscl.Get(Lease). Production constructs the client lazily fromcluster.GetConfig(); tests inject akubefake.NewSimpleClientsetto avoid needing a real REST config.RequeueAfter—connectorLeaseReadynow setsRequeueAfter: leaseDuration + jitteron every not-ready path, andIrohDNSReconciler.ReconcilereturnsRequeueAfter: requeueInterval()for every iroh-eligible Connector. Healthy clusters retain event-driven (sub-second) latency on Lease renew; degraded clusters converge at the lease cadence (≈30 s) instead of never.optional_test.gocovers the three branches (no-match degrades, healthy mapper delegates, other errors propagate); the existingTestConnectorReconcilecases were updated to inject the fake Lease clientset.Two minor trade-offs worth noting:
Staging validation (already deployed via infra#2448)
Two-commit rollout against the affected 221 PCPs.
After commit
3578c30(engage fix):failed to engage/get informer failedlog pairs across all 3 replicasoptional watch unavailable on cluster, degrading to no-opfires once per cluster at engagement, then silentiroh-dnsreconcile rate 0/s → 0.97/s (was completely stalled; now flowing)connectorhad a residual 0.1–0.5/sReconciler errorrate fromcl.Get(Lease)still hitting the RESTMapperAfter commit
2bedb7a(typed-client fix):connectorReconciler errorrate ~0/s (single 0.003/s blip on one pod = noise)matt-jenkinson-yz0y92(one of the affected projects from the issue repro) confirmed reconciling cleanly in logs at 15:23:57Z+Test plan
go build ./...,go vet ./...,go test ./...all clean locallyOptionalKindbranches and the typed Lease client injectionFixes #160. Related: milo-os/milo#616 (upstream root cause).