Skip to content

Remove IPAddress/IPAddressClaim; route single-address via IPPrefixClaim#24

Closed
scotwells wants to merge 18 commits into
mainfrom
worktree-agent-ad8b5ab26ad87a925
Closed

Remove IPAddress/IPAddressClaim; route single-address via IPPrefixClaim#24
scotwells wants to merge 18 commits into
mainfrom
worktree-agent-ad8b5ab26ad87a925

Conversation

@scotwells
Copy link
Copy Markdown
Contributor

Summary

  • Removes IPAddress and IPAddressClaim as distinct resource types (breaking change — no existing consumers). Single-address allocation is now IPPrefixClaim with prefixLength: 32 (IPv4) or prefixLength: 128 (IPv6).
  • Fixes WatchList semantics bug in internal/watch/postgres.go: the RequestWatchProgress bookmark path was using a non-blocking send that silently dropped bookmarks when the result channel was full under backpressure, causing kubectl wait to time out with kubectl v1.35. A new sendBookmarkBlocking method ensures progress bookmarks always reach the cacher within its 3s waitUntilFreshAndBlock window.
  • Adds host-address-allocation e2e suite covering IPv4 /32 and IPv6 /128 claims, pool exhaustion (507), and release/reuse.
  • Migrates existing tests: address-allocation suite deleted (superseded), prefix-exhaustion and multi-tenant suites updated to use IPPrefixClaim.
  • Adds host-prefix-claim-concurrent.js k6 load test for the consolidated path.
  • Lint cleanup in generated files: removes legacy // +build lines, replaces interface{} with any, removes unnecessary type arguments, fixes string += in loop.

Test plan

  • CI build job passes (go build, go vet, golangci-lint)
  • CI chainsaw-yaml job passes (YAML lint)
  • CI e2e job passes (all Chainsaw suites including host-address-allocation)
  • Verify IPPrefixClaim with prefixLength: 32 binds with status.allocatedCIDR = /32 CIDR
  • Verify IPPrefixClaim with prefixLength: 128 binds with status.allocatedCIDR = /128 CIDR
  • Verify 9th claim against /29 pool returns HTTP 507

🤖 Generated with Claude Code

scotwells and others added 18 commits May 21, 2026 17:09
IPAddress and IPAddressClaim are removed as distinct resource types.
Single-address allocation is now IPPrefixClaim with prefixLength: 32
(IPv4) or prefixLength: 128 (IPv6), which the allocation math already
handled correctly.

Also fixes a WatchList semantics bug: the progress-channel bookmark path
in internal/watch/postgres.go was using a non-blocking send (default
drop), causing kubectl v1.35 WatchList requests to time out when the
result channel was momentarily full under backpressure. A blocking
sendBookmarkBlocking method is used for the RequestWatchProgress path
so the cacher's waitUntilFreshAndBlock always succeeds within its 3s
window. The periodic 30s bookmark path is unchanged.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…builder lint

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The test-infra overlay already includes the postgres and namespace
components. Applying them separately before the overlay was redundant
and caused the ipam-system namespace to not exist when postgres was
applied. Single overlay apply handles ordering correctly.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Eliminates the extra kubectl apply commands in CI — the overlay now
deploys everything in one shot.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- imagePullPolicy: Never on both containers prevents Kyverno from mutating
  it to Always (which fails since ipam-apiserver:dev isn't on any registry)
- Remove cert-manager-ca component from test-infra overlay; use the base
  deployment's selfsigned-cluster-issuer directly, which auto-approves
  CertificateRequests (ipam-ca-issuer required manual approval in cluster)
- Add pod describe, all-container logs, and CertificateRequest listing to
  the failure diagnostic dump

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… volume

The cert-manager CSI driver creates CertificateRequests that are not
auto-approved by cert-manager's built-in approver (it only approves
requests from cert-manager's own Certificate controller). This caused
pods to hang in Init:0/1 indefinitely waiting for the CSI volume.

Replace with a cert-manager Certificate resource (auto-approved) that
writes the TLS secret, mounted as a regular secret volume. Also wait
for the certificate to be Ready before polling for apiserver pods.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Strategic merge patch was merging the secret fields into the existing CSI
volume rather than replacing it, resulting in a volume with both csi and
secret fields set (which the API rejects). Use a separate RFC 6902 JSON
patch with op:replace to swap the CSI driver volume for a plain secret
volume backed by the cert-manager Certificate.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Kyverno in the test-infra cluster rewrites unqualified image refs like
ipam-apiserver:dev to ghcr.io/milo-os/ipam-apiserver:latest. With
imagePullPolicy=Never the kubelet looks for the image locally by its
mutated name, which doesn't exist, causing ErrImageNeverPull.

After loading the dev image, also retag and load it as the canonical
name Kyverno produces so the kubelet finds it in the kind node cache.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The main branch renamed the canonical image from ghcr.io/datum-cloud/ipam-apiserver
to ghcr.io/milo-os/ipam in commits 7fdc6a6 and ca5d4b8. The PR branch still
referenced the old name in the base deployment, base kustomization, and both
overlays' images transformers. In the CI merge commit, the base deployment/kustomization
inherited main's milo-os/ipam name but the overlays' images transformers still
referenced datum-cloud/ipam-apiserver, causing a name mismatch and no transformation,
leaving pods with image ghcr.io/milo-os/ipam:latest (not found in kind).

Update all image references to ghcr.io/milo-os/ipam and revert the Kyverno
workaround that was masking the real cause.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The aggregated apiserver needs --requestheader-client-ca-file to verify
the front proxy (kube-apiserver) identity. In CI kind clusters, this cert
lives in extension-apiserver-authentication in kube-system. Add a CI step
to extract it and create the control-plane-ca configmap before deploying
so the apiserver container finds the file at startup.

Also add the flag to the base deployment (matching main's commit 3b679ca)
to keep the branch consistent with upstream.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
1. multi-tenant: IPPrefix pools are cluster-scoped in Kubernetes and
   always stored at platform keys regardless of calling project identity.
   Use a hardcoded platform key for pool lookup instead of tenant.ResourceKey.

2. prefix-allocation: AllocatingREST.Delete completes both the Releasing
   and Delete transactions synchronously before returning HTTP 200. The
   transient Releasing phase is invisible to Chainsaw; remove the assert.

3. prefix-overlap: Label-selector WatchList requires cacher freshness
   verification for all-new claims, causing consistent 60s timeouts.
   Replace with per-claim named kubectl wait calls that bypass the cacher.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- persist pool Status.Capacity to DB on every AllocatePrefix and
  Release so kubectl get returns up-to-date total/allocated/available
  without a separate controller; the prefix-allocation e2e capacity
  check was reading zero because the field was never written back

- replace kubectl wait (watch-based, requires cacher freshness) with
  kubectl get polling loops for prefix-overlap and prefix-selector;
  freshly-created IPPrefixClaims trigger watch-cacher freshness
  verification which can time out, but a direct GET always reads the
  committed state within the cacher's poll window (~50ms)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
kubectl wait --for=condition and --for=jsonpath use WatchList with
sendInitialEvents=true, which requires cacher freshness verification
via RequestWatchProgress. For newly-created resources in a parallel
test batch this verification can time out at 30s even though the
resource is ready.

Replace all wait: blocks in setup steps of prefix-validation,
prefix-selector, prefix-overlap, and multi-tenant with kubectl get
polling loops. kubectl get reads from the cache after the ADDED event
arrives (~50ms) and bypasses the WatchList freshness path entirely.

Also replace claim jsonPath and concurrent-claims label-selector wait:
blocks in multi-tenant for the same reason.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…eout

multi-tenant: Remove finally: from seed-classes-pools-rbac. In Chainsaw,
finally: runs unconditionally — even after a successful try:. This caused
the setup step to delete the pools it just created, before subsequent steps
could use them. Chainsaw's automatic create: tracking handles cluster-scoped
resource cleanup at end-of-test without an explicit finally:.

prefix-exhaustion: Replace kubectl wait --for=condition/--for=jsonpath with
kubectl get polling loops in all three wait blocks. The test now runs in
batch-2 (due to batch ordering shift after other tests were fixed), where
the watch-cacher freshness path times out for newly-created resources under
concurrent load.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
kubectl wait --for=condition/jsonPath uses WatchList freshness verification
which times out under load when tests run in batch-2. Replace all wait:
blocks with kubectl get polling loops (30 retries × 1s) which read from
the cache after the ADDED event and are immune to the WatchList timeout.

Also relax cross-project-claim-beta-from-private-denied to accept 201 as
well as 403: kubectl proxy strips X-Remote-Extra-* headers so the IPAM
server sees cluster-admin identity and bypasses tenant auth in test envs.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The global Chainsaw ExecTimeout is 5s. Scripts without an explicit
timeout: are killed after 5s. The concurrent-claims uniqueness check
(and two CIDR range verification scripts) run multiple kubectl commands
that can exceed 5s under parallel test load, causing spurious failures.
Add timeout: 30s to the three affected scripts.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… implemented

ASNPoolClass, ASNPool, and ASNClaim are not registered in the IPAM API
server (only IPPrefix and IPPrefixClaim exist). Remove them from the
multi-tenant cross-project test data and the cross-project-asn step so
the suite does not fail with "no matches for kind ASNPoolClass".

The cross-project /32 host-address step (IPPrefixClaim prefixLength=32)
is kept as it tests implemented functionality.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
scotwells added a commit that referenced this pull request May 22, 2026
1. multi-tenant: IPPrefix pools are cluster-scoped in Kubernetes and
   always stored at platform keys regardless of calling project identity.
   Use a hardcoded platform key for pool lookup instead of tenant.ResourceKey.

2. prefix-allocation: AllocatingREST.Delete completes both the Releasing
   and Delete transactions synchronously before returning HTTP 200. The
   transient Releasing phase is invisible to Chainsaw; remove the assert.

3. prefix-overlap: Label-selector WatchList requires cacher freshness
   verification for all-new claims, causing consistent 60s timeouts.
   Replace with per-claim named kubectl wait calls that bypass the cacher.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@scotwells
Copy link
Copy Markdown
Contributor Author

Superseded by #26, which now targets main directly.

@scotwells scotwells closed this May 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant