Skip to content

connector/iroh-dns controllers fail to engage on 221 PCPs missing coordination.k8s.io discovery → Connector.Ready frozen indefinitely; 5s retry loop likely driving OOMKill #160

@mattdjenkinson

Description

@mattdjenkinson

Summary

In staging, the connector and iroh-dns controllers in network-services-operator fail to engage for every project control plane whose apiserver does not advertise coordination.k8s.io/v1 in API discovery. Both controllers register a Watches(&coordinationv1.Lease{}, …), and mcController.Engage rejects the whole controller/cluster pair when any watch can't be wired. The result is that for affected clusters:

  • Connector.status.conditions[Ready] is frozen at the value set the first time the controller successfully engaged (often the creation time, when the Lease hadn't been renewed yet, so Ready=False(ConnectorNotReady)).
  • IrohDNSPublished similarly never updates.
  • metadata.generation advances but status.conditions[*].observedGeneration stays behind.
  • HTTP/Gateway/etc. controllers in the same operator binary continue to reconcile that same cluster normally (because they don't watch Lease).

User-visible effect: Datum Connect desktop agents heartbeat correctly every ~15s (Lease spec.renewTime is fresh) but the Datum Cloud UI reports the Connector as offline indefinitely.

There is also a secondary OOMKilled loop on controller-manager (every ~30 min per replica, see "Operator stability" below) which makes the problem worse — every restart re-attempts the failing Engage and re-grows whatever caches are leaking.

Reproduction

  1. In staging, install Datum Connect desktop and create a Connector in a project whose PCP apiserver only advertises networking.datumapis.com in /apis (i.e. doesn't advertise coordination.k8s.io).
  2. Verify the agent is patching spec.renewTime on the connector's Lease.
  3. Observe Connector.status.conditions[Ready] = False(ConnectorNotReady) indefinitely. observedGeneration lags metadata.generation.

Evidence — affected staging cluster matt-jenkinson-yz0y92

Connector datum-connect-ff98k (uid e5b7d11f-945d-4432-8b21-6f66d93f3e5a, project namespace default).

Connector status (via kubectl get connector ... -o jsonpath=...)

generation=2  resourceVersion=1138668215
Accepted=True(Accepted)                    obs=1  @ 1970-01-01T00:00:00Z
Ready=False(ConnectorNotReady)             obs=1  @ 2026-03-07T15:20:14Z   ← creation time
IrohDNSPublished=False(DeferredToOwner)    obs=1  @ 2026-05-01T20:11:43Z   ← last operator status write

Note observedGeneration=1 everywhere despite generation=2. The controller has never observed the current generation.

metadata.managedFields confirms no manager: manager (operator) status write has happened since 2026-05-01T20:11:43Z. Datum Desktop (the agent) continues to patch status.connectionDetails and renew the Lease.

Lease (coordination.k8s.io/v1 default/datum-connect-ff98k)

Fetched via kubectl get --raw … (kubectl's normal discovery returns error: the server doesn't have a resource type "leases" — see "PCP discovery omission" below — but the resource is reachable directly):

metadata:
  ownerReferences:
  - apiVersion: networking.datumapis.com/v1alpha1
    kind: Connector
    name: datum-connect-ff98k
    uid: e5b7d11f-945d-4432-8b21-6f66d93f3e5a
    controller: true
    blockOwnerDeletion: true
spec:
  leaseDurationSeconds: 30
  renewTime: 2026-05-14T13:19:58.587566Z   # less than 15s old at fetch

Lease is healthy: correct ownerRef, fresh renewTime, valid duration.

Operator-side logs

The connector/iroh-dns controllers have never reconciled matt-jenkinson-yz0y92 on either of the most recent boots of the leader pod (the boots where they did reconcile httpproxy/gateway for that exact cluster):

$ kubectl -n datum-system logs <leader-pod> --tail=200000 \
    | grep '"controller":"(connector|iroh-dns)"' | grep matt-jenkinson
$ kubectl -n datum-system logs <leader-pod> --previous --tail=200000 \
    | grep '"controller":"(connector|iroh-dns)"' | grep matt-jenkinson
(empty)

For the same boots, the httpproxy and gateway controllers reconcile the same cluster normally — they don't Watches(&coordinationv1.Lease{}, …), so their Engage isn't rejected.

Root-cause log line (the bug)

For matt-jenkinson-yz0y92 specifically, the error repeats every ~5 seconds in a retry loop on the leader pod:

2026-05-14T13:18:40Z ERROR get informer failed
  {"cluster": "/matt-jenkinson-yz0y92", "source": "kind",
   "error": "no matches for kind \"Lease\" in version \"coordination.k8s.io/v1\""}
2026-05-14T13:18:40Z ERROR cluster-sharding-coordinator failed to engage
  {"cluster": "/matt-jenkinson-yz0y92",
   "error": "failed to watch for cluster \"/matt-jenkinson-yz0y92\":
             no matches for kind \"Lease\" in version \"coordination.k8s.io/v1\""}
2026-05-14T13:18:45Z ERROR get informer failed   { … same … }
2026-05-14T13:18:45Z ERROR cluster-sharding-coordinator failed to engage   { … same … }
2026-05-14T13:18:50Z ERROR get informer failed   { … same … }
2026-05-14T13:18:50Z ERROR cluster-sharding-coordinator failed to engage   { … same … }
…

221 unique project clusters are in this state on staging (counted via grep "failed to engage" | grep -oE '"cluster": "/[^"]+"' | sort -u | wc -l across both boots of all three replicas). The retry-every-5s pattern means each affected cluster generates ~720 error-pair log lines per hour, which also likely contributes to the OOMKill loop below.

A small sample of the 221 affected clusters (alphabetical prefix only):

/aaaaaa-d4qxk8
/asdf-6283wa
/asdasd
/e2e-shared-project-1776-{0bjzou,74u9w6,gp25oj,kaghqp,knraaw}
/e2e-shared-project-1777-x69inr
/e2e-test-dns-project-17-jawcmp
/hiyahya-4vrcph
/jacob-test-project-ybdzjo
/jbjjjhji-jm8yi3
/jose-{project-pt1wpv,sirugu}
/matt-jenkinson-yz0y92
/molla-{9rnjfm,otoke-4baody}
/new-project-6x6sz1
/osca-slo-test-r5h1r7
/personal-project-{2119b055,6527428a,759543f8,aeef86da,be933431, …many more}
/tdaly-v20250703-yxt7b6
/test-{delete-n4ccjo,elzw4o,fathom-project-twjb2l,project-{1-fscxij,6z9bj6,quota-8mmaoj,w4t25q},queue-{2-7v9vy1,hupwwe,project-56chve},quota-yu4nc4}
/test{1-clhv7m,2-gbod84,123-xvposa}
/testing-rkuax5

PCP discovery omission

The Datum project-control-plane apiserver only advertises networking.datumapis.com in discovery:

$ kubectl api-resources --api-group=coordination.k8s.io
(empty)
$ kubectl api-resources --api-group=networking.datumapis.com
connectoradvertisements        networking.datumapis.com/v1alpha1   ConnectorAdvertisement
connectorclasses               networking.datumapis.com/v1alpha1   ConnectorClass
connectors                     networking.datumapis.com/v1alpha1   Connector
…

But the underlying Lease resource is reachable via direct path:

$ kubectl get --raw "/apis/coordination.k8s.io/v1/namespaces/default/leases/datum-connect-ff98k"
{ "kind": "Lease", "apiVersion": "coordination.k8s.io/v1", … }

So this is a discovery omission, not a real "Lease isn't there" condition. The Datum Connect desktop uses kube-rs (which doesn't do discovery, just constructs the URL directly) and is consequently able to renew leases without issue.

Operator stability

Concurrent issue compounding the above: all three replicas of network-services-operator-controller-manager are OOMKilled every ~30 minutes (Exit 137, memory: 4Gi limit). Image ghcr.io/datum-cloud/network-services-operator:v0.0.0-main-20260512-182158. Restart counts on 2026-05-14T13:30Z:

network-services-operator-controller-manager-67ff7d4f66-cj8kd   36 restarts in 17h
network-services-operator-controller-manager-67ff7d4f66-j29gb   32 restarts in 17h
network-services-operator-controller-manager-67ff7d4f66-vj4bf   33 restarts in 17h

Probably a separate bug (memory growth scaling with number of project clusters / retried engagements). Even if Bug 1 above were fixed, the OOM loop is going to cause stalls.

Suggested directions

For the engage failure (the primary blocker):

  • Make the Lease watch optional: catch the discovery error and either continue without it, or schedule a periodic re-attempt without failing the whole Engage for the controller. Today, every controller that watches Lease is "all or nothing" per cluster.
  • Alternatively, on the PCP apiserver side, advertise coordination.k8s.io/v1 in /apis discovery — since the resource is already reachable, just hidden from clients that do API discovery (kubectl, controller-runtime).
  • Either fix would also restore correct reconciles for the iroh-dns controller in these clusters.

For the OOMKill loop — needs a separate investigation; probably worth a pprof heap dump on a healthy-but-near-OOM pod.

Workaround

None on the agent side. Restarting the operator briefly resurrects reconciles for clusters that do engage, but the affected clusters never recover within a pod's lifetime.

Metadata

Metadata

Labels

bugSomething isn't working

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions