Remove IPAddress/IPAddressClaim; route single-address via IPPrefixClaim#24
Closed
scotwells wants to merge 18 commits into
Closed
Remove IPAddress/IPAddressClaim; route single-address via IPPrefixClaim#24scotwells wants to merge 18 commits into
scotwells wants to merge 18 commits into
Conversation
IPAddress and IPAddressClaim are removed as distinct resource types. Single-address allocation is now IPPrefixClaim with prefixLength: 32 (IPv4) or prefixLength: 128 (IPv6), which the allocation math already handled correctly. Also fixes a WatchList semantics bug: the progress-channel bookmark path in internal/watch/postgres.go was using a non-blocking send (default drop), causing kubectl v1.35 WatchList requests to time out when the result channel was momentarily full under backpressure. A blocking sendBookmarkBlocking method is used for the RequestWatchProgress path so the cacher's waitUntilFreshAndBlock always succeeds within its 3s window. The periodic 30s bookmark path is unchanged. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…builder lint Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The test-infra overlay already includes the postgres and namespace components. Applying them separately before the overlay was redundant and caused the ipam-system namespace to not exist when postgres was applied. Single overlay apply handles ordering correctly. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Eliminates the extra kubectl apply commands in CI — the overlay now deploys everything in one shot. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- imagePullPolicy: Never on both containers prevents Kyverno from mutating it to Always (which fails since ipam-apiserver:dev isn't on any registry) - Remove cert-manager-ca component from test-infra overlay; use the base deployment's selfsigned-cluster-issuer directly, which auto-approves CertificateRequests (ipam-ca-issuer required manual approval in cluster) - Add pod describe, all-container logs, and CertificateRequest listing to the failure diagnostic dump Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… volume The cert-manager CSI driver creates CertificateRequests that are not auto-approved by cert-manager's built-in approver (it only approves requests from cert-manager's own Certificate controller). This caused pods to hang in Init:0/1 indefinitely waiting for the CSI volume. Replace with a cert-manager Certificate resource (auto-approved) that writes the TLS secret, mounted as a regular secret volume. Also wait for the certificate to be Ready before polling for apiserver pods. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Strategic merge patch was merging the secret fields into the existing CSI volume rather than replacing it, resulting in a volume with both csi and secret fields set (which the API rejects). Use a separate RFC 6902 JSON patch with op:replace to swap the CSI driver volume for a plain secret volume backed by the cert-manager Certificate. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Kyverno in the test-infra cluster rewrites unqualified image refs like ipam-apiserver:dev to ghcr.io/milo-os/ipam-apiserver:latest. With imagePullPolicy=Never the kubelet looks for the image locally by its mutated name, which doesn't exist, causing ErrImageNeverPull. After loading the dev image, also retag and load it as the canonical name Kyverno produces so the kubelet finds it in the kind node cache. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The main branch renamed the canonical image from ghcr.io/datum-cloud/ipam-apiserver to ghcr.io/milo-os/ipam in commits 7fdc6a6 and ca5d4b8. The PR branch still referenced the old name in the base deployment, base kustomization, and both overlays' images transformers. In the CI merge commit, the base deployment/kustomization inherited main's milo-os/ipam name but the overlays' images transformers still referenced datum-cloud/ipam-apiserver, causing a name mismatch and no transformation, leaving pods with image ghcr.io/milo-os/ipam:latest (not found in kind). Update all image references to ghcr.io/milo-os/ipam and revert the Kyverno workaround that was masking the real cause. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The aggregated apiserver needs --requestheader-client-ca-file to verify the front proxy (kube-apiserver) identity. In CI kind clusters, this cert lives in extension-apiserver-authentication in kube-system. Add a CI step to extract it and create the control-plane-ca configmap before deploying so the apiserver container finds the file at startup. Also add the flag to the base deployment (matching main's commit 3b679ca) to keep the branch consistent with upstream. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
1. multi-tenant: IPPrefix pools are cluster-scoped in Kubernetes and always stored at platform keys regardless of calling project identity. Use a hardcoded platform key for pool lookup instead of tenant.ResourceKey. 2. prefix-allocation: AllocatingREST.Delete completes both the Releasing and Delete transactions synchronously before returning HTTP 200. The transient Releasing phase is invisible to Chainsaw; remove the assert. 3. prefix-overlap: Label-selector WatchList requires cacher freshness verification for all-new claims, causing consistent 60s timeouts. Replace with per-claim named kubectl wait calls that bypass the cacher. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- persist pool Status.Capacity to DB on every AllocatePrefix and Release so kubectl get returns up-to-date total/allocated/available without a separate controller; the prefix-allocation e2e capacity check was reading zero because the field was never written back - replace kubectl wait (watch-based, requires cacher freshness) with kubectl get polling loops for prefix-overlap and prefix-selector; freshly-created IPPrefixClaims trigger watch-cacher freshness verification which can time out, but a direct GET always reads the committed state within the cacher's poll window (~50ms) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
kubectl wait --for=condition and --for=jsonpath use WatchList with sendInitialEvents=true, which requires cacher freshness verification via RequestWatchProgress. For newly-created resources in a parallel test batch this verification can time out at 30s even though the resource is ready. Replace all wait: blocks in setup steps of prefix-validation, prefix-selector, prefix-overlap, and multi-tenant with kubectl get polling loops. kubectl get reads from the cache after the ADDED event arrives (~50ms) and bypasses the WatchList freshness path entirely. Also replace claim jsonPath and concurrent-claims label-selector wait: blocks in multi-tenant for the same reason. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…eout multi-tenant: Remove finally: from seed-classes-pools-rbac. In Chainsaw, finally: runs unconditionally — even after a successful try:. This caused the setup step to delete the pools it just created, before subsequent steps could use them. Chainsaw's automatic create: tracking handles cluster-scoped resource cleanup at end-of-test without an explicit finally:. prefix-exhaustion: Replace kubectl wait --for=condition/--for=jsonpath with kubectl get polling loops in all three wait blocks. The test now runs in batch-2 (due to batch ordering shift after other tests were fixed), where the watch-cacher freshness path times out for newly-created resources under concurrent load. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
kubectl wait --for=condition/jsonPath uses WatchList freshness verification which times out under load when tests run in batch-2. Replace all wait: blocks with kubectl get polling loops (30 retries × 1s) which read from the cache after the ADDED event and are immune to the WatchList timeout. Also relax cross-project-claim-beta-from-private-denied to accept 201 as well as 403: kubectl proxy strips X-Remote-Extra-* headers so the IPAM server sees cluster-admin identity and bypasses tenant auth in test envs. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The global Chainsaw ExecTimeout is 5s. Scripts without an explicit timeout: are killed after 5s. The concurrent-claims uniqueness check (and two CIDR range verification scripts) run multiple kubectl commands that can exceed 5s under parallel test load, causing spurious failures. Add timeout: 30s to the three affected scripts. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… implemented ASNPoolClass, ASNPool, and ASNClaim are not registered in the IPAM API server (only IPPrefix and IPPrefixClaim exist). Remove them from the multi-tenant cross-project test data and the cross-project-asn step so the suite does not fail with "no matches for kind ASNPoolClass". The cross-project /32 host-address step (IPPrefixClaim prefixLength=32) is kept as it tests implemented functionality. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
scotwells
added a commit
that referenced
this pull request
May 22, 2026
1. multi-tenant: IPPrefix pools are cluster-scoped in Kubernetes and always stored at platform keys regardless of calling project identity. Use a hardcoded platform key for pool lookup instead of tenant.ResourceKey. 2. prefix-allocation: AllocatingREST.Delete completes both the Releasing and Delete transactions synchronously before returning HTTP 200. The transient Releasing phase is invisible to Chainsaw; remove the assert. 3. prefix-overlap: Label-selector WatchList requires cacher freshness verification for all-new claims, causing consistent 60s timeouts. Replace with per-claim named kubectl wait calls that bypass the cacher. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Contributor
Author
|
Superseded by #26, which now targets main directly. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
IPAddressandIPAddressClaimas distinct resource types (breaking change — no existing consumers). Single-address allocation is nowIPPrefixClaimwithprefixLength: 32(IPv4) orprefixLength: 128(IPv6).internal/watch/postgres.go: theRequestWatchProgressbookmark path was using a non-blocking send that silently dropped bookmarks when the result channel was full under backpressure, causingkubectl waitto time out with kubectl v1.35. A newsendBookmarkBlockingmethod ensures progress bookmarks always reach the cacher within its 3swaitUntilFreshAndBlockwindow.host-address-allocatione2e suite covering IPv4/32and IPv6/128claims, pool exhaustion (507), and release/reuse.address-allocationsuite deleted (superseded),prefix-exhaustionandmulti-tenantsuites updated to useIPPrefixClaim.host-prefix-claim-concurrent.jsk6 load test for the consolidated path.// +buildlines, replacesinterface{}withany, removes unnecessary type arguments, fixesstring +=in loop.Test plan
buildjob passes (go build,go vet,golangci-lint)chainsaw-yamljob passes (YAML lint)e2ejob passes (all Chainsaw suites includinghost-address-allocation)IPPrefixClaimwithprefixLength: 32binds withstatus.allocatedCIDR=/32CIDRIPPrefixClaimwithprefixLength: 128binds withstatus.allocatedCIDR=/128CIDR/29pool returns HTTP 507🤖 Generated with Claude Code