connector/iroh-dns controllers fail to engage on 221 PCPs missing coordination.k8s.io discovery → Connector.Ready frozen indefinitely; 5s retry loop likely driving OOMKill

## Summary

In staging, the `connector` and `iroh-dns` controllers in `network-services-operator` fail to engage for every project control plane whose apiserver does not advertise `coordination.k8s.io/v1` in API discovery. Both controllers register a `Watches(&coordinationv1.Lease{}, …)`, and `mcController.Engage` rejects the whole controller/cluster pair when any watch can't be wired. The result is that for affected clusters:

- `Connector.status.conditions[Ready]` is frozen at the value set the first time the controller successfully engaged (often the creation time, when the Lease hadn't been renewed yet, so `Ready=False(ConnectorNotReady)`).
- `IrohDNSPublished` similarly never updates.
- `metadata.generation` advances but `status.conditions[*].observedGeneration` stays behind.
- HTTP/Gateway/etc. controllers in the same operator binary continue to reconcile that same cluster normally (because they don't watch Lease).

User-visible effect: Datum Connect desktop agents heartbeat correctly every ~15s (Lease `spec.renewTime` is fresh) but the Datum Cloud UI reports the Connector as offline indefinitely.

There is also a secondary `OOMKilled` loop on `controller-manager` (every ~30 min per replica, see "Operator stability" below) which makes the problem worse — every restart re-attempts the failing Engage and re-grows whatever caches are leaking.

## Reproduction

1. In staging, install Datum Connect desktop and create a Connector in a project whose PCP apiserver only advertises `networking.datumapis.com` in `/apis` (i.e. doesn't advertise `coordination.k8s.io`).
2. Verify the agent is patching `spec.renewTime` on the connector's Lease.
3. Observe `Connector.status.conditions[Ready] = False(ConnectorNotReady)` indefinitely. `observedGeneration` lags `metadata.generation`.

## Evidence — affected staging cluster `matt-jenkinson-yz0y92`

Connector `datum-connect-ff98k` (uid `e5b7d11f-945d-4432-8b21-6f66d93f3e5a`, project namespace `default`).

### Connector status (via `kubectl get connector ... -o jsonpath=...`)

```
generation=2  resourceVersion=1138668215
Accepted=True(Accepted)                    obs=1  @ 1970-01-01T00:00:00Z
Ready=False(ConnectorNotReady)             obs=1  @ 2026-03-07T15:20:14Z   ← creation time
IrohDNSPublished=False(DeferredToOwner)    obs=1  @ 2026-05-01T20:11:43Z   ← last operator status write
```

Note `observedGeneration=1` everywhere despite `generation=2`. The controller has never observed the current generation.

`metadata.managedFields` confirms no `manager: manager` (operator) status write has happened since 2026-05-01T20:11:43Z. Datum Desktop (the agent) continues to patch `status.connectionDetails` and renew the Lease.

### Lease (`coordination.k8s.io/v1` `default/datum-connect-ff98k`)

Fetched via `kubectl get --raw …` (kubectl's normal discovery returns `error: the server doesn't have a resource type "leases"` — see "PCP discovery omission" below — but the resource is reachable directly):

```yaml
metadata:
  ownerReferences:
  - apiVersion: networking.datumapis.com/v1alpha1
    kind: Connector
    name: datum-connect-ff98k
    uid: e5b7d11f-945d-4432-8b21-6f66d93f3e5a
    controller: true
    blockOwnerDeletion: true
spec:
  leaseDurationSeconds: 30
  renewTime: 2026-05-14T13:19:58.587566Z   # less than 15s old at fetch
```

Lease is healthy: correct ownerRef, fresh renewTime, valid duration.

### Operator-side logs

The connector/iroh-dns controllers have **never** reconciled `matt-jenkinson-yz0y92` on either of the most recent boots of the leader pod (the boots where they did reconcile httpproxy/gateway for that exact cluster):

```bash
$ kubectl -n datum-system logs <leader-pod> --tail=200000 \
    | grep '"controller":"(connector|iroh-dns)"' | grep matt-jenkinson
$ kubectl -n datum-system logs <leader-pod> --previous --tail=200000 \
    | grep '"controller":"(connector|iroh-dns)"' | grep matt-jenkinson
(empty)
```

For the same boots, the `httpproxy` and `gateway` controllers reconcile the same cluster normally — they don't `Watches(&coordinationv1.Lease{}, …)`, so their Engage isn't rejected.

### Root-cause log line (the bug)

For `matt-jenkinson-yz0y92` specifically, the error repeats **every ~5 seconds in a retry loop** on the leader pod:

```
2026-05-14T13:18:40Z ERROR get informer failed
  {"cluster": "/matt-jenkinson-yz0y92", "source": "kind",
   "error": "no matches for kind \"Lease\" in version \"coordination.k8s.io/v1\""}
2026-05-14T13:18:40Z ERROR cluster-sharding-coordinator failed to engage
  {"cluster": "/matt-jenkinson-yz0y92",
   "error": "failed to watch for cluster \"/matt-jenkinson-yz0y92\":
             no matches for kind \"Lease\" in version \"coordination.k8s.io/v1\""}
2026-05-14T13:18:45Z ERROR get informer failed   { … same … }
2026-05-14T13:18:45Z ERROR cluster-sharding-coordinator failed to engage   { … same … }
2026-05-14T13:18:50Z ERROR get informer failed   { … same … }
2026-05-14T13:18:50Z ERROR cluster-sharding-coordinator failed to engage   { … same … }
…
```

**221 unique project clusters** are in this state on staging (counted via `grep "failed to engage" | grep -oE '"cluster": "/[^"]+"' | sort -u | wc -l` across both boots of all three replicas). The retry-every-5s pattern means each affected cluster generates ~720 error-pair log lines per hour, which also likely contributes to the OOMKill loop below.

A small sample of the 221 affected clusters (alphabetical prefix only):

```
/aaaaaa-d4qxk8
/asdf-6283wa
/asdasd
/e2e-shared-project-1776-{0bjzou,74u9w6,gp25oj,kaghqp,knraaw}
/e2e-shared-project-1777-x69inr
/e2e-test-dns-project-17-jawcmp
/hiyahya-4vrcph
/jacob-test-project-ybdzjo
/jbjjjhji-jm8yi3
/jose-{project-pt1wpv,sirugu}
/matt-jenkinson-yz0y92
/molla-{9rnjfm,otoke-4baody}
/new-project-6x6sz1
/osca-slo-test-r5h1r7
/personal-project-{2119b055,6527428a,759543f8,aeef86da,be933431, …many more}
/tdaly-v20250703-yxt7b6
/test-{delete-n4ccjo,elzw4o,fathom-project-twjb2l,project-{1-fscxij,6z9bj6,quota-8mmaoj,w4t25q},queue-{2-7v9vy1,hupwwe,project-56chve},quota-yu4nc4}
/test{1-clhv7m,2-gbod84,123-xvposa}
/testing-rkuax5
```

## PCP discovery omission

The Datum project-control-plane apiserver only advertises `networking.datumapis.com` in discovery:

```
$ kubectl api-resources --api-group=coordination.k8s.io
(empty)
$ kubectl api-resources --api-group=networking.datumapis.com
connectoradvertisements        networking.datumapis.com/v1alpha1   ConnectorAdvertisement
connectorclasses               networking.datumapis.com/v1alpha1   ConnectorClass
connectors                     networking.datumapis.com/v1alpha1   Connector
…
```

But the underlying Lease resource **is** reachable via direct path:

```
$ kubectl get --raw "/apis/coordination.k8s.io/v1/namespaces/default/leases/datum-connect-ff98k"
{ "kind": "Lease", "apiVersion": "coordination.k8s.io/v1", … }
```

So this is a discovery omission, not a real "Lease isn't there" condition. The Datum Connect desktop uses kube-rs (which doesn't do discovery, just constructs the URL directly) and is consequently able to renew leases without issue.

## Operator stability

Concurrent issue compounding the above: all three replicas of `network-services-operator-controller-manager` are `OOMKilled` every ~30 minutes (Exit 137, `memory: 4Gi` limit). Image `ghcr.io/datum-cloud/network-services-operator:v0.0.0-main-20260512-182158`. Restart counts on `2026-05-14T13:30Z`:

```
network-services-operator-controller-manager-67ff7d4f66-cj8kd   36 restarts in 17h
network-services-operator-controller-manager-67ff7d4f66-j29gb   32 restarts in 17h
network-services-operator-controller-manager-67ff7d4f66-vj4bf   33 restarts in 17h
```

Probably a separate bug (memory growth scaling with number of project clusters / retried engagements). Even if Bug 1 above were fixed, the OOM loop is going to cause stalls.

## Suggested directions

For the engage failure (the primary blocker):

- Make the `Lease` watch optional: catch the discovery error and either continue without it, or schedule a periodic re-attempt without failing the whole `Engage` for the controller. Today, every controller that watches Lease is "all or nothing" per cluster.
- Alternatively, on the PCP apiserver side, advertise `coordination.k8s.io/v1` in `/apis` discovery — since the resource is already reachable, just hidden from clients that do API discovery (kubectl, controller-runtime).
- Either fix would also restore correct reconciles for the iroh-dns controller in these clusters.

For the OOMKill loop — needs a separate investigation; probably worth a pprof heap dump on a healthy-but-near-OOM pod.

## Workaround

None on the agent side. Restarting the operator briefly resurrects reconciles for clusters that *do* engage, but the affected clusters never recover within a pod's lifetime.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

connector/iroh-dns controllers fail to engage on 221 PCPs missing coordination.k8s.io discovery → Connector.Ready frozen indefinitely; 5s retry loop likely driving OOMKill #160

Summary

Reproduction

Evidence — affected staging cluster `matt-jenkinson-yz0y92`

Connector status (via `kubectl get connector ... -o jsonpath=...`)

Lease (`coordination.k8s.io/v1` `default/datum-connect-ff98k`)

Operator-side logs

Root-cause log line (the bug)

PCP discovery omission

Operator stability

Suggested directions

Workaround

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

connector/iroh-dns controllers fail to engage on 221 PCPs missing coordination.k8s.io discovery → Connector.Ready frozen indefinitely; 5s retry loop likely driving OOMKill #160

Description

Summary

Reproduction

Evidence — affected staging cluster matt-jenkinson-yz0y92

Connector status (via kubectl get connector ... -o jsonpath=...)

Lease (coordination.k8s.io/v1 default/datum-connect-ff98k)

Operator-side logs

Root-cause log line (the bug)

PCP discovery omission

Operator stability

Suggested directions

Workaround

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Evidence — affected staging cluster `matt-jenkinson-yz0y92`

Connector status (via `kubectl get connector ... -o jsonpath=...`)

Lease (`coordination.k8s.io/v1` `default/datum-connect-ff98k`)