Skip to content

fix: use SSA for HTTPProxy child resource writes to eliminate 409 races#170

Draft
drewr wants to merge 1 commit into
mainfrom
fix/httpproxy-ssa-child-resources
Draft

fix: use SSA for HTTPProxy child resource writes to eliminate 409 races#170
drewr wants to merge 1 commit into
mainfrom
fix/httpproxy-ssa-child-resources

Conversation

@drewr
Copy link
Copy Markdown
Contributor

@drewr drewr commented May 22, 2026

Problem

Closes #166. Supersedes #169 (which treated the symptom rather than the cause).

When a tunnel (HTTPProxy) is created, the .Owns() watch chain fires as each child resource (Gateway, HTTPRoute, EndpointSlice) is created, queuing concurrent reconciles of the same HTTPProxy. Those concurrent reconciles all ran the same CreateOrUpdate (read → compute → write) path, racing on the resource version and producing a burst of 409 Conflict errors. Controller-runtime applied exponential backoff; after ~15 failures the wait reached 3-4 minutes, silencing the operator until the next periodic tick.

19:33:47  HTTPProxy created → reconcile fires
19:33:47  reconcile creates Gateway → Owns() watch fires → reconcile #2 queued
19:33:47  reconcile creates HTTPRoute → Owns() watch fires → reconcile #3 queued
19:33:47–52  goroutines #2, #3, … race to Update → burst of 409s → exp. backoff
19:37:37  next tick: Programmed=True finally set  (+3m47s)

Fix

Replace CreateOrUpdate with Server-Side Apply (client.Apply + ForceOwnership) for Gateway, HTTPRoute, HTTPRouteFilter, and EndpointSlice.

With SSA, concurrent goroutines applying the same desired state with the same field manager are deduplicated by the API server. The second (and third, fourth…) apply is a no-op rather than a 409. No resource-version conflict is possible.

Additional improvements from SSA:

  • Reduced watch churn: an idempotent apply that produces no server-side change does not increment the resource version, so no watch event fires. The .Owns() re-queue that triggered the race in the first place largely disappears for steady-state reconciles.
  • BackendCertHostnameAnnotation simplified: the annotation is already set or absent on each desired EndpointSlice in collectDesiredResources. SSA + ForceOwnership adds it when present and removes it when absent — no explicit delete() call needed. The per-field merge logic inside the CreateOrUpdate mutate function is removed.

Gateway listener hostname note

The gateway controller sets listener hostnames via plain Update (no SSA field manager). In SSA terms those hostnames are unmanaged. ForceOwnership on a list item (matched by merge key name) takes ownership of all fields within that item, which would clear unmanaged fields not in the apply payload.

To avoid losing those hostnames, the single Get we already need for ownership-conflict detection also reads any hostname the gateway controller has already written, and we carry it forward in the apply payload. This becomes unnecessary once the gateway controller adopts SSA itself (a natural follow-up).

Testing

All existing TestHTTPProxyReconcile subtests pass, including address_and_hostname_propagation which exercises the hostname carry-forward path. No new test was added — the correctness is structural: concurrent SSA applies of the same state cannot produce a 409.

go test ./internal/controller/ -count=1 -timeout=120s
ok  go.datum.net/network-services-operator/internal/controller  1.49s

Related

The HTTPProxy controller used CreateOrUpdate (read-modify-write) for Gateway,
HTTPRoute, and EndpointSlice child resources. When multiple reconcile goroutines
ran simultaneously — triggered by the .Owns() watch chain firing as each child
was created — they raced on the resource version, producing a burst of 409
Conflict errors. Controller-runtime then applied exponential backoff; after ~15
consecutive failures the wait reached 3-4 minutes, silencing the controller
until its next periodic tick.

Root cause: concurrent goroutines doing Get → Update is inherently racy.
Requeuing quickly on conflict treats the symptom, not the cause.

Fix: replace CreateOrUpdate with Server-Side Apply (client.Apply +
ForceOwnership) for Gateway, HTTPRoute, HTTPRouteFilter, and EndpointSlice.
Concurrent goroutines applying the same desired state with the same field
manager are deduplicated by the API server — the second apply is a no-op
rather than a 409.

Additional properties of SSA that improve the controller:
- An idempotent apply that produces no server-side change fires no watch event,
  naturally reducing re-queue churn from the .Owns() watches.
- The BackendCertHostnameAnnotation on EndpointSlice is already set (or absent)
  by collectDesiredResources; SSA + ForceOwnership ensures it is added when
  present and removed when absent without an explicit delete call.
- The complex read-modify-write in the EndpointSlice mutate function is removed.

Gateway listener hostname note: the gateway controller sets listener hostnames
via plain Update (no SSA field manager), making them unmanaged in SSA terms.
ForceOwnership on a list item would clear unmanaged fields not in the apply
payload. We carry forward any hostname the gateway controller has already set in
the single Get we retain for ownership-conflict detection. This becomes
unnecessary once the gateway controller adopts SSA itself.

Closes: network-services-operator#166
@drewr drewr requested a review from scotwells May 22, 2026 21:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

HTTPProxy reconcile backs off 3-4 min after 409 conflict burst at tunnel creation

1 participant