Conversation
📝 WalkthroughWalkthroughImplements ProjectQuota custom resource support for Kubernetes with a complete ingestion, storage, and reconciliation pipeline. Extends VM source interfaces to track deleted servers and VM creation timestamps. Adds HTTP API handler for quota requests, Prometheus monitoring, and a controller reconciling quota CRDs against active VMs and committed resources. ChangesProjectQuota Feature Implementation
Sequence DiagramsequenceDiagram
participant Client
participant HTTPHandler as Quota API Handler
participant KubeAPI as Kubernetes API
participant Controller as Quota Controller
participant VMSource as VM Source
participant Metrics as Prometheus
Client->>HTTPHandler: PUT /quota/{projectID}
HTTPHandler->>HTTPHandler: Validate request (method, enable flag)
HTTPHandler->>HTTPHandler: Decode ServiceQuotaRequest
HTTPHandler->>KubeAPI: Get ProjectQuota CRD
alt CRD not found
HTTPHandler->>KubeAPI: Create ProjectQuota CRD
else CRD exists
HTTPHandler->>KubeAPI: Update ProjectQuota Spec
end
HTTPHandler->>Metrics: Record request metric (204)
HTTPHandler-->>Client: 204 No Content
Controller->>KubeAPI: Watch ProjectQuota changes
Controller->>VMSource: List all VMs
Controller->>VMSource: List committed resources
Controller->>Controller: Compute TotalUsage per project/AZ
Controller->>Controller: Compute PaygUsage (TotalUsage - CommittedUsage)
Controller->>KubeAPI: Update ProjectQuota Status with retry
Controller->>Metrics: Record usage gauges (total, payg, committed)
Controller->>Metrics: Record reconcile duration
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 3❌ Failed checks (2 warnings, 1 inconclusive)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
⚔️ Resolve merge conflicts
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Review rate limit: 0/1 reviews remaining, refill in 60 minutes.Comment |
|
follow up of #783 |
There was a problem hiding this comment.
Actionable comments posted: 8
🧹 Nitpick comments (3)
internal/scheduling/reservations/commitments/api/quota_monitor.go (1)
29-33: ⚡ Quick winPre-initialize the 503 series too.
HandleQuotahas an explicit service-disabled path, so if we're pre-creating the common status-code series here,503should be included as well. Otherwise dashboards/alerts won't see that label until the first disabled request happens.Suggested diff
- for _, statusCode := range []string{"204", "400", "405", "500"} { + for _, statusCode := range []string{"204", "400", "405", "500", "503"} { m.requestCounter.WithLabelValues(statusCode) m.requestDuration.WithLabelValues(statusCode) }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/scheduling/reservations/commitments/api/quota_monitor.go` around lines 29 - 33, Pre-create the 503 status series alongside the others so dashboards/alerts see service-disabled metrics before the first disabled request; update the loop that pre-initializes status codes (the array passed to the for loop that calls m.requestCounter.WithLabelValues and m.requestDuration.WithLabelValues) to include "503" as well—this affects the same block used by HandleQuota's service-disabled path and ensures the "503" label exists from startup.internal/scheduling/reservations/failover/vm_source_test.go (1)
403-405: ⚡ Quick winAdd unit cases for the new deleted-server paths.
This mock change only restores interface compliance;
IsServerActive,GetDeletedVMInfo, andCreatedAtpropagation inDBVMSourcestill have no assertions here. A few table-driven cases forDELETED,nil, deleted-server flavor lookup, andCreatedAtwould lock down the new quota path.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/scheduling/reservations/failover/vm_source_test.go` around lines 403 - 405, Add table-driven unit tests exercising the new deleted-server code paths: update tests around mockNovaReader.GetDeletedServerByID and DBVMSource to include cases for (1) a DELETED deleted-server result, (2) GetDeletedServerByID returning nil, (3) deleted-server that requires flavor lookup (ensure flavor ID/name resolution happens), and (4) propagation of CreatedAt from the DeletedServer into the DB VM. For each case assert DBVMSource behavior and side-effects by checking IsServerActive and GetDeletedVMInfo are invoked as expected and that the resulting VM record's CreatedAt (and quota decision path) matches the deleted-server input.internal/scheduling/reservations/commitments/api/quota_test.go (1)
352-354: ⚡ Quick winUse
testlib.Ptrinstead of a local pointer helper.
boolPtr()duplicates the shared pointer helper pattern and drifts from the repo's test convention.As per coding guidelines,
**/*_test.go: Usetestlib.Ptrfor test cases that require pointer values.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/scheduling/reservations/commitments/api/quota_test.go` around lines 352 - 354, The helper function boolPtr(b bool) is duplicating a shared pointer helper; remove boolPtr and replace its usages with the shared test helper testlib.Ptr(b) (update imports to include testlib if missing) so tests follow the repo convention; search for the function name boolPtr and replace calls like boolPtr(true) / boolPtr(false) with testlib.Ptr(true) / testlib.Ptr(false) and delete the boolPtr function definition.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.claude/settings.local.json:
- Around line 1-8: The permissions block in settings.local.json is too
broad—replace the Read(//root/**) rule with least-privilege paths that the PR
actually needs (e.g., Read(//root/path/to/needed/dir) or explicit filenames) and
remove any wildcard that grants repository‑wide secret access; ensure the
"permissions" object only contains allowed actions for specific resources, and
make sure settings.local.json is listed in .gitignore or otherwise prevented
from being committed/used in CI/shared environments so the local-only settings
aren’t relied upon in shared workflows.
In `@internal/scheduling/reservations/commitments/api/quota.go`:
- Around line 72-84: The code currently allows creating/updating a ProjectQuota
with an empty DomainID when req.ProjectMetadata is absent; change the logic in
the block around req.ProjectMetadata.Unpack() so that if Unpack() returns false
you immediately reject the request with api.quotaError(...,
http.StatusBadRequest, "missing required domainID in project metadata",
startTime), and also validate that after unpack the derived domainID (variable
domainID) is non-empty—if it's empty return a 400 similarly; keep the existing
UUID consistency check and use the same api.quotaError pattern for these new
validation failures.
- Around line 111-160: The current Get→Create/Update flow for ProjectQuota
(crdName via projectQuotaCRDName) races with concurrent reconciles and can fail
with update conflicts; modify the write path to be conflict-safe by wrapping the
update logic in a retry-on-conflict loop (e.g., retry.RetryOnConflict) or by
using controllerutil.CreateOrUpdate on v1alpha1.ProjectQuota to compute and
apply spec changes atomically; ensure you preserve setting Spec.Quota,
Spec.ProjectName, Spec.DomainID and Spec.DomainName and log/create errors as
before but retry on conflicts instead of returning a 409/500 immediately.
In `@internal/scheduling/reservations/commitments/config.go`:
- Around line 61-64: The bool field EnableQuotaAPI in the Config struct
currently defaults to true only in DefaultConfig() but remains false when config
is unmarshaled into a zero-value Config and then passed through ApplyDefaults();
to fix this either (A) change the field to a pointer (*bool EnableQuotaAPI) and
update ApplyDefaults() to set it to pointer(true) when nil, or (B) ensure
decoding/UNMARSHAL logic constructs config by starting from DefaultConfig() and
then overlays file/flags onto it before calling ApplyDefaults(); update
ApplyDefaults(), DefaultConfig(), and any decoding flow that constructs Config
(referencing EnableQuotaAPI, ApplyDefaults, DefaultConfig) so the effective
default is unambiguous and tests reflect the chosen approach.
In `@internal/scheduling/reservations/failover/vm_source.go`:
- Around line 29-31: The VM CreatedAt field is not propagated when VMs are
rebuilt via buildVMsFromHypervisors called by ListVMsOnHypervisors(..., true),
causing callers to see an empty timestamp; update buildVMsFromHypervisors to
copy pgVM.CreatedAt into the new VM struct (and any other rebuild paths that
construct VM instances from pgVM) so CreatedAt flows through the
hypervisor-trust path; locate uses of pgVM and the VM constructor/assignment in
buildVMsFromHypervisors and add copying of the CreatedAt field (and mirror the
same change in the other rebuild block referenced around the same area).
In `@internal/scheduling/reservations/quota/controller.go`:
- Around line 396-427: The code incorrectly uses
ProjectQuota.Status.LastReconcileAt as the "last full reconcile" watermark;
introduce and use a dedicated field (e.g.
ProjectQuota.Status.LastFullReconcileAt) that is only updated by the full
periodic reconcile, update isVMNewSinceLastReconcile to compare VM creation time
against LastFullReconcileAt instead of LastReconcileAt, and change the PAYG-only
path and applyDeltaAndUpdateStatus() so they no longer overwrite
LastFullReconcileAt (they may continue to update LastReconcileAt for any status
change). Ensure the new field is set only in the full reconcile code path and
that isVMNewSinceLastReconcile references that new field when deciding whether a
VM is new.
- Around line 554-568: The ProjectQuota watch in SetupWithManager currently has
no predicate and thus enqueues on status-only updates; add a predicate (e.g.
projectQuotaSpecChangePredicate) to the For(&v1alpha1.ProjectQuota{}) watch that
filters out status-only changes by returning true only when Generation or Spec
changed (similar logic to crUsedAmountChangePredicate()). Implement
projectQuotaSpecChangePredicate() to compare old and new objects and only allow
events where old.GetGeneration() != new.GetGeneration() or a deep-equal check of
Spec differs, then pass it into builder.WithPredicates(...) for the ProjectQuota
watch in SetupWithManager.
In `@internal/scheduling/reservations/quota/metrics.go`:
- Around line 71-77: In QuotaMetrics.RecordUsage, avoid leaking metric label
series by deleting label values when a usage is removed: if totalUsage,
paygUsage and crUsage are all zero (or otherwise indicate the
resource/project/AZ was removed), call
m.totalUsageGauge.DeleteLabelValues(projectID, az, resource),
m.paygUsageGauge.DeleteLabelValues(...), and
m.crUsageGauge.DeleteLabelValues(...) instead of Set; otherwise continue to Set
the gauges as currently implemented. Ensure you reference the existing symbols
RecordUsage, totalUsageGauge, paygUsageGauge, crUsageGauge and use
DeleteLabelValues to prevent unbounded cardinality growth.
---
Nitpick comments:
In `@internal/scheduling/reservations/commitments/api/quota_monitor.go`:
- Around line 29-33: Pre-create the 503 status series alongside the others so
dashboards/alerts see service-disabled metrics before the first disabled
request; update the loop that pre-initializes status codes (the array passed to
the for loop that calls m.requestCounter.WithLabelValues and
m.requestDuration.WithLabelValues) to include "503" as well—this affects the
same block used by HandleQuota's service-disabled path and ensures the "503"
label exists from startup.
In `@internal/scheduling/reservations/commitments/api/quota_test.go`:
- Around line 352-354: The helper function boolPtr(b bool) is duplicating a
shared pointer helper; remove boolPtr and replace its usages with the shared
test helper testlib.Ptr(b) (update imports to include testlib if missing) so
tests follow the repo convention; search for the function name boolPtr and
replace calls like boolPtr(true) / boolPtr(false) with testlib.Ptr(true) /
testlib.Ptr(false) and delete the boolPtr function definition.
In `@internal/scheduling/reservations/failover/vm_source_test.go`:
- Around line 403-405: Add table-driven unit tests exercising the new
deleted-server code paths: update tests around
mockNovaReader.GetDeletedServerByID and DBVMSource to include cases for (1) a
DELETED deleted-server result, (2) GetDeletedServerByID returning nil, (3)
deleted-server that requires flavor lookup (ensure flavor ID/name resolution
happens), and (4) propagation of CreatedAt from the DeletedServer into the DB
VM. For each case assert DBVMSource behavior and side-effects by checking
IsServerActive and GetDeletedVMInfo are invoked as expected and that the
resulting VM record's CreatedAt (and quota decision path) matches the
deleted-server input.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: f410d3c9-ddee-4e79-9ca4-43f2b4447fd6
📒 Files selected for processing (22)
.claude/settings.local.jsonapi/v1alpha1/project_quota_types.goapi/v1alpha1/zz_generated.deepcopy.gohelm/library/cortex/files/crds/cortex.cloud_projectquotas.yamlinternal/scheduling/external/nova.gointernal/scheduling/reservations/commitments/api/handler.gointernal/scheduling/reservations/commitments/api/info.gointernal/scheduling/reservations/commitments/api/info_test.gointernal/scheduling/reservations/commitments/api/quota.gointernal/scheduling/reservations/commitments/api/quota_monitor.gointernal/scheduling/reservations/commitments/api/quota_test.gointernal/scheduling/reservations/commitments/config.gointernal/scheduling/reservations/commitments/usage.gointernal/scheduling/reservations/failover/integration_test.gointernal/scheduling/reservations/failover/vm_source.gointernal/scheduling/reservations/failover/vm_source_test.gointernal/scheduling/reservations/quota/config.gointernal/scheduling/reservations/quota/context.gointernal/scheduling/reservations/quota/controller.gointernal/scheduling/reservations/quota/controller_test.gointernal/scheduling/reservations/quota/integration_test.gointernal/scheduling/reservations/quota/metrics.go
| { | ||
| "permissions": { | ||
| "allow": [ | ||
| "Read(//root/**)", | ||
| "Bash(go doc:*)" | ||
| ] | ||
| } | ||
| } |
There was a problem hiding this comment.
Tighten Claude permissions to avoid overbroad secret access.
Read(//root/**) grants read access to everything under //root/**, which is potentially too permissive and could expose credentials/secrets to the model/tooling. Please apply least-privilege by scoping the read allow rule to only the specific directories/files needed for this PR, and/or ensure this settings.local.json truly remains local-only (e.g., gitignored) and isn’t relied on in CI/shared environments.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In @.claude/settings.local.json around lines 1 - 8, The permissions block in
settings.local.json is too broad—replace the Read(//root/**) rule with
least-privilege paths that the PR actually needs (e.g.,
Read(//root/path/to/needed/dir) or explicit filenames) and remove any wildcard
that grants repository‑wide secret access; ensure the "permissions" object only
contains allowed actions for specific resources, and make sure
settings.local.json is listed in .gitignore or otherwise prevented from being
committed/used in CI/shared environments so the local-only settings aren’t
relied upon in shared workflows.
| // Extract project/domain metadata if available | ||
| var projectName, domainID, domainName string | ||
| if meta, ok := req.ProjectMetadata.Unpack(); ok { | ||
| // Consistency check: metadata UUID must match URL path UUID | ||
| if meta.UUID != "" && meta.UUID != projectID { | ||
| log.Info("project UUID mismatch", "urlProjectID", projectID, "metadataUUID", meta.UUID) | ||
| api.quotaError(w, http.StatusBadRequest, fmt.Sprintf("Project UUID mismatch: URL has %q but metadata has %q", projectID, meta.UUID), startTime) | ||
| return | ||
| } | ||
| projectName = meta.Name | ||
| domainID = meta.Domain.UUID | ||
| domainName = meta.Domain.Name | ||
| } |
There was a problem hiding this comment.
Reject requests that cannot satisfy the CRD's required domainID.
ProjectMetadata is optional here, but ProjectQuotaSpec.DomainID is required by the new type/CRD. When metadata is absent you fall through to create/update with an empty DomainID, which turns a client error into an API-server validation 500.
Suggested fix
if meta, ok := req.ProjectMetadata.Unpack(); ok {
// Consistency check: metadata UUID must match URL path UUID
if meta.UUID != "" && meta.UUID != projectID {
log.Info("project UUID mismatch", "urlProjectID", projectID, "metadataUUID", meta.UUID)
api.quotaError(w, http.StatusBadRequest, fmt.Sprintf("Project UUID mismatch: URL has %q but metadata has %q", projectID, meta.UUID), startTime)
return
}
projectName = meta.Name
domainID = meta.Domain.UUID
domainName = meta.Domain.Name
}
+ if domainID == "" {
+ api.quotaError(w, http.StatusBadRequest, "missing domain UUID in project metadata", startTime)
+ return
+ }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@internal/scheduling/reservations/commitments/api/quota.go` around lines 72 -
84, The code currently allows creating/updating a ProjectQuota with an empty
DomainID when req.ProjectMetadata is absent; change the logic in the block
around req.ProjectMetadata.Unpack() so that if Unpack() returns false you
immediately reject the request with api.quotaError(..., http.StatusBadRequest,
"missing required domainID in project metadata", startTime), and also validate
that after unpack the derived domainID (variable domainID) is non-empty—if it's
empty return a 400 similarly; keep the existing UUID consistency check and use
the same api.quotaError pattern for these new validation failures.
| // Create or update ProjectQuota CRD | ||
| crdName := projectQuotaCRDName(projectID) | ||
| ctx := r.Context() | ||
|
|
||
| var existing v1alpha1.ProjectQuota | ||
| err = api.client.Get(ctx, client.ObjectKey{Name: crdName}, &existing) | ||
| if err != nil { | ||
| if !apierrors.IsNotFound(err) { | ||
| // Real error | ||
| log.Error(err, "failed to get existing ProjectQuota", "name", crdName) | ||
| api.quotaError(w, http.StatusInternalServerError, fmt.Sprintf("Failed to check existing quota: %v", err), startTime) | ||
| return | ||
| } | ||
| // Not found — create new | ||
| pq := &v1alpha1.ProjectQuota{ | ||
| ObjectMeta: metav1.ObjectMeta{ | ||
| Name: crdName, | ||
| }, | ||
| Spec: v1alpha1.ProjectQuotaSpec{ | ||
| ProjectID: projectID, | ||
| ProjectName: projectName, | ||
| DomainID: domainID, | ||
| DomainName: domainName, | ||
| Quota: specQuota, | ||
| }, | ||
| } | ||
| if err := api.client.Create(ctx, pq); err != nil { | ||
| log.Error(err, "failed to create ProjectQuota", "name", crdName) | ||
| api.quotaError(w, http.StatusInternalServerError, fmt.Sprintf("Failed to create quota: %v", err), startTime) | ||
| return | ||
| } | ||
| log.V(1).Info("created ProjectQuota", "name", crdName, "projectID", projectID, "resources", len(specQuota)) | ||
| } else { | ||
| // Update existing | ||
| existing.Spec.Quota = specQuota | ||
| if projectName != "" { | ||
| existing.Spec.ProjectName = projectName | ||
| } | ||
| if domainID != "" { | ||
| existing.Spec.DomainID = domainID | ||
| } | ||
| if domainName != "" { | ||
| existing.Spec.DomainName = domainName | ||
| } | ||
| if err := api.client.Update(ctx, &existing); err != nil { | ||
| log.Error(err, "failed to update ProjectQuota", "name", crdName) | ||
| api.quotaError(w, http.StatusInternalServerError, fmt.Sprintf("Failed to update quota: %v", err), startTime) | ||
| return | ||
| } | ||
| log.V(1).Info("updated ProjectQuota", "name", crdName, "projectID", projectID, "resources", len(specQuota)) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
printf '\n== ProjectQuota references ==\n'
rg -n -C2 'ProjectQuota' --type go api internal
printf '\n== Status writers ==\n'
rg -n -C2 'Status\(\)\.(Update|Patch)\(' --type go api internal
printf '\n== Quota handler write path ==\n'
rg -n -C3 'HandleQuota|\.Create\(|\.Update\(|\.Patch\(' --type go internal/scheduling/reservations/commitments/apiRepository: cobaltcore-dev/cortex
Length of output: 50377
🏁 Script executed:
#!/bin/bash
set -euo pipefail
printf '\n=== Quota Handler Code (lines 111-162) ===\n'
sed -n '111,162p' internal/scheduling/reservations/commitments/api/quota.go
printf '\n=== Controller Status Update Pattern ===\n'
sed -n '530,545p' internal/scheduling/reservations/quota/controller.go
sed -n '780,800p' internal/scheduling/reservations/quota/controller.go
printf '\n=== Checking for retry.RetryOnConflict in handler ===\n'
rg -n 'RetryOnConflict' internal/scheduling/reservations/commitments/api/Repository: cobaltcore-dev/cortex
Length of output: 2932
Make the ProjectQuota write path conflict-safe.
This raw Get → Create/Update sequence races with the quota controller updating ProjectQuota status on the same object. When the controller reconciles concurrently, the handler's stale resourceVersion will fail the Update with a conflict error, turning valid PUT requests into spurious 409/500 responses.
Wrap the spec write in a retry-on-conflict loop, or use controllerutil.CreateOrUpdate() to handle version conflicts automatically.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@internal/scheduling/reservations/commitments/api/quota.go` around lines 111 -
160, The current Get→Create/Update flow for ProjectQuota (crdName via
projectQuotaCRDName) races with concurrent reconciles and can fail with update
conflicts; modify the write path to be conflict-safe by wrapping the update
logic in a retry-on-conflict loop (e.g., retry.RetryOnConflict) or by using
controllerutil.CreateOrUpdate on v1alpha1.ProjectQuota to compute and apply spec
changes atomically; ensure you preserve setting Spec.Quota, Spec.ProjectName,
Spec.DomainID and Spec.DomainName and log/create errors as before but retry on
conflicts instead of returning a 409/500 immediately.
| // EnableQuotaAPI controls whether the quota API endpoint is active. | ||
| // When false, the endpoint will return HTTP 503 Service Unavailable. | ||
| // This can be used as an emergency switch if quota persistence is causing issues. | ||
| EnableQuotaAPI bool `json:"committedResourceEnableQuotaAPI"` |
There was a problem hiding this comment.
EnableQuotaAPI does not actually default to true in the ApplyDefaults() path.
Adding the field here with DefaultConfig().EnableQuotaAPI = true is not enough on its own: an omitted bool still stays false when config is unmarshaled into a zero-value Config and then passed through ApplyDefaults(). That means the new quota API can come up disabled even when operators never set this flag.
Please make the defaulting strategy unambiguous for bools here as well, e.g. by using *bool for tri-state config or by ensuring config decoding starts from DefaultConfig() before overrides are applied.
Also applies to: 111-111
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@internal/scheduling/reservations/commitments/config.go` around lines 61 - 64,
The bool field EnableQuotaAPI in the Config struct currently defaults to true
only in DefaultConfig() but remains false when config is unmarshaled into a
zero-value Config and then passed through ApplyDefaults(); to fix this either
(A) change the field to a pointer (*bool EnableQuotaAPI) and update
ApplyDefaults() to set it to pointer(true) when nil, or (B) ensure
decoding/UNMARSHAL logic constructs config by starting from DefaultConfig() and
then overlays file/flags onto it before calling ApplyDefaults(); update
ApplyDefaults(), DefaultConfig(), and any decoding flow that constructs Config
(referencing EnableQuotaAPI, ApplyDefaults, DefaultConfig) so the effective
default is unambiguous and tests reflect the chosen approach.
| // CreatedAt is the ISO 8601 timestamp when the VM was created in Nova. | ||
| // Used by the quota controller to distinguish new VMs from migrations. | ||
| CreatedAt string |
There was a problem hiding this comment.
Propagate CreatedAt through the hypervisor-trust path.
CreatedAt is now part of the VM contract for quota reconciliation, but ListVMsOnHypervisors(..., true) rebuilds VMs in buildVMsFromHypervisors() without copying it from pgVM. Callers on that path will always see an empty timestamp and can misclassify migrations as new VMs.
Suggested fix
vm := VM{
UUID: inst.ID,
FlavorName: pgVM.FlavorName,
ProjectID: pgVM.ProjectID,
CurrentHypervisor: hv.Name, // Use hypervisor CRD location, not postgres
AvailabilityZone: pgVM.AvailabilityZone,
+ CreatedAt: pgVM.CreatedAt,
Resources: pgVM.Resources,
FlavorExtraSpecs: pgVM.FlavorExtraSpecs,
}Also applies to: 52-58
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@internal/scheduling/reservations/failover/vm_source.go` around lines 29 - 31,
The VM CreatedAt field is not propagated when VMs are rebuilt via
buildVMsFromHypervisors called by ListVMsOnHypervisors(..., true), causing
callers to see an empty timestamp; update buildVMsFromHypervisors to copy
pgVM.CreatedAt into the new VM struct (and any other rebuild paths that
construct VM instances from pgVM) so CreatedAt flows through the
hypervisor-trust path; locate uses of pgVM and the VM constructor/assignment in
buildVMsFromHypervisors and add copying of the CreatedAt field (and mirror the
same change in the other rebuild block referenced around the same area).
| func (c *QuotaController) isVMNewSinceLastReconcile(ctx context.Context, vm *failover.VM) bool { | ||
| if vm.CreatedAt == "" { | ||
| // No creation time available -- be conservative, skip increment. | ||
| // The next full reconcile will pick it up. | ||
| return false | ||
| } | ||
|
|
||
| // Look up the ProjectQuota for this VM's project | ||
| crdName := "quota-" + vm.ProjectID | ||
| var pq v1alpha1.ProjectQuota | ||
| if err := c.Get(ctx, client.ObjectKey{Name: crdName}, &pq); err != nil { | ||
| // If we can't find the ProjectQuota, skip (full reconcile will handle it) | ||
| return false | ||
| } | ||
|
|
||
| if pq.Status.LastReconcileAt == nil { | ||
| // No full reconcile has run yet -- skip incremental updates | ||
| return false | ||
| } | ||
|
|
||
| // Parse the VM's creation time and compare with last reconcile | ||
| vmCreatedAt, err := time.Parse("2006-01-02T15:04:05Z", vm.CreatedAt) | ||
| if err != nil { | ||
| // Try alternative format with timezone offset | ||
| vmCreatedAt, err = time.Parse(time.RFC3339, vm.CreatedAt) | ||
| if err != nil { | ||
| // Cannot parse -- be conservative, skip | ||
| return false | ||
| } | ||
| } | ||
|
|
||
| return vmCreatedAt.After(pq.Status.LastReconcileAt.Time) |
There was a problem hiding this comment.
Keep a dedicated full-reconcile watermark for incremental add detection.
isVMNewSinceLastReconcile() treats Status.LastReconcileAt as "last full reconcile", but both the PAYG-only path and applyDeltaAndUpdateStatus() overwrite that field on every status update. After any intervening CR update or incremental delta, a VM created after the last full sweep can be misclassified as already counted and never added until the next periodic reconcile.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@internal/scheduling/reservations/quota/controller.go` around lines 396 - 427,
The code incorrectly uses ProjectQuota.Status.LastReconcileAt as the "last full
reconcile" watermark; introduce and use a dedicated field (e.g.
ProjectQuota.Status.LastFullReconcileAt) that is only updated by the full
periodic reconcile, update isVMNewSinceLastReconcile to compare VM creation time
against LastFullReconcileAt instead of LastReconcileAt, and change the PAYG-only
path and applyDeltaAndUpdateStatus() so they no longer overwrite
LastFullReconcileAt (they may continue to update LastReconcileAt for any status
change). Ensure the new field is set only in the full reconcile code path and
that isVMNewSinceLastReconcile references that new field when deciding whether a
VM is new.
| func (c *QuotaController) SetupWithManager(mgr ctrl.Manager) error { | ||
| return ctrl.NewControllerManagedBy(mgr). | ||
| Named("quota-controller"). | ||
| // Watch ProjectQuota for spec changes (Limes pushes quota) | ||
| For(&v1alpha1.ProjectQuota{}). | ||
| // Watch CommittedResource for status changes (UsedAmount updates) | ||
| Watches( | ||
| &v1alpha1.CommittedResource{}, | ||
| handler.EnqueueRequestsFromMapFunc(c.mapCRToProjectQuota), | ||
| builder.WithPredicates(crUsedAmountChangePredicate()), | ||
| ). | ||
| WithOptions(controller.Options{ | ||
| MaxConcurrentReconciles: 1, | ||
| }). | ||
| Complete(c) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
fd -t f "controller.go" internal/schedulingRepository: cobaltcore-dev/cortex
Length of output: 811
🏁 Script executed:
wc -l internal/scheduling/reservations/quota/controller.goRepository: cobaltcore-dev/cortex
Length of output: 122
🏁 Script executed:
sed -n '554,570p' internal/scheduling/reservations/quota/controller.goRepository: cobaltcore-dev/cortex
Length of output: 627
🏁 Script executed:
sed -n '1,50p' internal/scheduling/reservations/quota/controller.goRepository: cobaltcore-dev/cortex
Length of output: 1874
🏁 Script executed:
grep -n "func (c \*QuotaController) Reconcile" internal/scheduling/reservations/quota/controller.goRepository: cobaltcore-dev/cortex
Length of output: 348
🏁 Script executed:
sed -n '162,250p' internal/scheduling/reservations/quota/controller.goRepository: cobaltcore-dev/cortex
Length of output: 3263
🏁 Script executed:
sed -n '400,500p' internal/scheduling/reservations/quota/controller.goRepository: cobaltcore-dev/cortex
Length of output: 3198
🏁 Script executed:
sed -n '500,550p' internal/scheduling/reservations/quota/controller.goRepository: cobaltcore-dev/cortex
Length of output: 1421
🏁 Script executed:
grep -n "func (c \*QuotaController) updateProjectQuotaStatusWithRetry" internal/scheduling/reservations/quota/controller.goRepository: cobaltcore-dev/cortex
Length of output: 130
🏁 Script executed:
sed -n '779,830p' internal/scheduling/reservations/quota/controller.goRepository: cobaltcore-dev/cortex
Length of output: 1904
🏁 Script executed:
grep -n "predicate\|Predicate\|WithPredicates" internal/scheduling/reservations/quota/controller.go | head -20Repository: cobaltcore-dev/cortex
Length of output: 641
🏁 Script executed:
sed -n '554,575p' internal/scheduling/reservations/quota/controller.goRepository: cobaltcore-dev/cortex
Length of output: 876
🏁 Script executed:
sed -n '889,920p' internal/scheduling/reservations/quota/controller.goRepository: cobaltcore-dev/cortex
Length of output: 1290
🏁 Script executed:
sed -n '162,170p' internal/scheduling/reservations/quota/controller.goRepository: cobaltcore-dev/cortex
Length of output: 523
🏁 Script executed:
sed -n '571,595p' internal/scheduling/reservations/quota/controller.goRepository: cobaltcore-dev/cortex
Length of output: 1019
🏁 Script executed:
rg "For\(&" internal/scheduling --type go -A 3 | head -50Repository: cobaltcore-dev/cortex
Length of output: 424
Add a predicate to filter status-only updates from the ProjectQuota watch.
The For(&v1alpha1.ProjectQuota{}) watch has no predicate and will enqueue on status changes. Since both Reconcile() (via updateProjectQuotaStatusWithRetry) and the incremental path (via applyDeltaAndUpdateStatus) update Status.LastReconcileAt, each reconciliation immediately triggers the next one, causing unnecessary API server churn. Add a predicate that only reacts to spec/generation changes, similar to crUsedAmountChangePredicate() used for CommittedResource.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@internal/scheduling/reservations/quota/controller.go` around lines 554 - 568,
The ProjectQuota watch in SetupWithManager currently has no predicate and thus
enqueues on status-only updates; add a predicate (e.g.
projectQuotaSpecChangePredicate) to the For(&v1alpha1.ProjectQuota{}) watch that
filters out status-only changes by returning true only when Generation or Spec
changed (similar logic to crUsedAmountChangePredicate()). Implement
projectQuotaSpecChangePredicate() to compare old and new objects and only allow
events where old.GetGeneration() != new.GetGeneration() or a deep-equal check of
Spec differs, then pass it into builder.WithPredicates(...) for the ProjectQuota
watch in SetupWithManager.
| func (m *QuotaMetrics) RecordUsage(projectID, az, resource string, totalUsage, paygUsage, crUsage int64) { | ||
| if m == nil { | ||
| return | ||
| } | ||
| m.totalUsageGauge.WithLabelValues(projectID, az, resource).Set(float64(totalUsage)) | ||
| m.paygUsageGauge.WithLabelValues(projectID, az, resource).Set(float64(paygUsage)) | ||
| m.crUsageGauge.WithLabelValues(projectID, az, resource).Set(float64(crUsage)) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, locate and read the metrics.go file
fd metrics.go internal/schedulingRepository: cobaltcore-dev/cortex
Length of output: 194
🏁 Script executed:
# Get an overview of the file to understand its structure
wc -l internal/scheduling/reservations/quota/metrics.goRepository: cobaltcore-dev/cortex
Length of output: 118
🏁 Script executed:
# Read the entire file to understand the QuotaMetrics structure and any cleanup logic
cat -n internal/scheduling/reservations/quota/metrics.goRepository: cobaltcore-dev/cortex
Length of output: 3740
🏁 Script executed:
# Search for any deletion or cleanup logic in the quota module
rg "DeleteLabelValues|Delete\(|Reset\(" internal/scheduling/reservations/quota/Repository: cobaltcore-dev/cortex
Length of output: 389
🏁 Script executed:
# Search for calls to RecordUsage to understand the lifecycle and context
rg "RecordUsage" internal/scheduling/reservations/quota/Repository: cobaltcore-dev/cortex
Length of output: 478
🏁 Script executed:
# Check if there are any other metric management methods in related quota files
fd "\.go$" internal/scheduling/reservations/quota/ --exec grep -l "GaugeVec\|Delete" {}Repository: cobaltcore-dev/cortex
Length of output: 285
🏁 Script executed:
# Find the line where RecordUsage is called and read context around it
rg -B 10 -A 10 "c\.Metrics\.RecordUsage" internal/scheduling/reservations/quota/controller.goRepository: cobaltcore-dev/cortex
Length of output: 869
🏁 Script executed:
# Get more context on the overall structure of the controller
wc -l internal/scheduling/reservations/quota/controller.goRepository: cobaltcore-dev/cortex
Length of output: 122
🏁 Script executed:
# Search for any cleanup or eviction logic in the controller
rg "evict|cleanup|remove.*label|forget" internal/scheduling/reservations/quota/controller.goRepository: cobaltcore-dev/cortex
Length of output: 47
🏁 Script executed:
# Find where these usage maps are built to understand what entries they contain
rg -B 5 "for resourceName, total := range totalUsage" internal/scheduling/reservations/quota/controller.goRepository: cobaltcore-dev/cortex
Length of output: 469
🏁 Script executed:
# Look for where totalUsage, paygUsage, crUsage are populated
rg "totalUsage\s*:=|totalUsage\s*=" internal/scheduling/reservations/quota/controller.go | head -20Repository: cobaltcore-dev/cortex
Length of output: 125
🏁 Script executed:
# Check if there's any cleanup logic for deleted projects
rg "reconcil|Reconcil" internal/scheduling/reservations/quota/controller.go | head -15Repository: cobaltcore-dev/cortex
Length of output: 828
🏁 Script executed:
# Search for DeleteLabelValues patterns in other metric files to see if this is handled elsewhere
rg "DeleteLabelValues" internal/scheduling/Repository: cobaltcore-dev/cortex
Length of output: 47
🏁 Script executed:
# Check other metric files in the codebase for cardinality management patterns
fd "metrics\.go$" internal/scheduling/ --exec grep -l "GaugeVec" {}Repository: cobaltcore-dev/cortex
Length of output: 115
🏁 Script executed:
# Look at another metrics file to see if there are deletion patterns
cat -n internal/scheduling/reservations/commitments/api/change_commitments_metrics.go | head -100Repository: cobaltcore-dev/cortex
Length of output: 1688
Add cleanup for removed usage label sets.
These gauges only ever Set() values. When a project/resource/AZ disappears, the old series is never deleted, so Prometheus will keep exporting stale usage and the project_id cardinality will grow monotonically over time. Add DeleteLabelValues() calls when projects or resources are removed to prevent unbounded metric cardinality growth.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@internal/scheduling/reservations/quota/metrics.go` around lines 71 - 77, In
QuotaMetrics.RecordUsage, avoid leaking metric label series by deleting label
values when a usage is removed: if totalUsage, paygUsage and crUsage are all
zero (or otherwise indicate the resource/project/AZ was removed), call
m.totalUsageGauge.DeleteLabelValues(projectID, az, resource),
m.paygUsageGauge.DeleteLabelValues(...), and
m.crUsageGauge.DeleteLabelValues(...) instead of Set; otherwise continue to Set
the gauges as currently implemented. Ensure you reference the existing symbols
RecordUsage, totalUsageGauge, paygUsageGauge, crUsageGauge and use
DeleteLabelValues to prevent unbounded cardinality growth.
No description provided.