feat: Qutoa calc payg by umswmayj · Pull Request #796 · cobaltcore-dev/cortex

umswmayj · 2026-05-04T15:49:11Z

No description provided.

coderabbitai · 2026-05-04T15:49:27Z

📝 Walkthrough

Walkthrough

Implements ProjectQuota custom resource support for Kubernetes with a complete ingestion, storage, and reconciliation pipeline. Extends VM source interfaces to track deleted servers and VM creation timestamps. Adds HTTP API handler for quota requests, Prometheus monitoring, and a controller reconciling quota CRDs against active VMs and committed resources.

Changes

ProjectQuota Feature Implementation

Layer / File(s)	Summary
Data Shape & API Types `api/v1alpha1/project_quota_types.go`	Defines `ResourceQuota`, `ResourceQuotaUsage`, `ProjectQuotaSpec`, `ProjectQuotaStatus`, `ProjectQuota`, and `ProjectQuotaList` types with kubebuilder markers for cluster-scoped CR, status subresource, and print columns.
Code Generation `api/v1alpha1/zz_generated.deepcopy.go`	Autogenerated `DeepCopy`/`DeepCopyInto` methods for all new types, including deep-copy logic for nested maps (`PerAZ`, quota maps) and condition slices.
CRD Schema Definition `helm/library/cortex/files/crds/cortex.cloud_projectquotas.yaml`	OpenAPI schema for `ProjectQuota` with spec fields (domain/project IDs, optional names, quota map per resource with per-AZ breakdown) and status fields (usage maps, reconcile timestamp, conditions).
VM Source Infrastructure `internal/scheduling/reservations/failover/vm_source.go`, `internal/scheduling/external/nova.go`	Extended `VMSource` interface with `IsServerActive` and `GetDeletedVMInfo` methods; added `DeletedVMInfo` struct for quota decrements; added `VM.CreatedAt` timestamp; extended `NovaReaderInterface` with `GetDeletedServerByID`; implemented all methods on `DBVMSource` and `NovaReader`.
VM Source Tests `internal/scheduling/reservations/failover/vm_source_test.go`, `internal/scheduling/reservations/failover/integration_test.go`	Updated `mockNovaReader` and `MockVMSource` to implement new interface methods; extended failover integration test mocks.
Quota API Handler `internal/scheduling/reservations/commitments/api/quota.go`, `internal/scheduling/reservations/commitments/api/quota_monitor.go`	Transformed `HandleQuota` from no-op to full request handler: validates PUT method, checks `EnableQuotaAPI` flag, extracts project UUID, decodes `liquid.ServiceQuotaRequest`, validates metadata, converts quotas with overflow checks, creates/updates `ProjectQuota` CRD, records Prometheus metrics via new `QuotaAPIMonitor`.
API Configuration & Wiring `internal/scheduling/reservations/commitments/api/handler.go`, `internal/scheduling/reservations/commitments/config.go`, `internal/scheduling/reservations/commitments/api/info.go`, `internal/scheduling/reservations/commitments/api/usage.go`	Added `quotaMonitor` field to `HTTPAPI` and registered with Prometheus; added `EnableQuotaAPI` config boolean (default true); updated RAM resource topology/quota conditionals in `buildServiceInfo` based on `handlesCommitments` flag; updated `buildUsageResponse` to initialize per-AZ RAM quotas for commitment-accepting groups.
API Tests `internal/scheduling/reservations/commitments/api/quota_test.go`, `internal/scheduling/reservations/commitments/api/info_test.go`	Added comprehensive `HandleQuota` tests covering error cases (method validation, API disablement, request validation, UUID mismatch) and create/update flows with CRD persistence; extended `info_test.go` assertions for topology and quota fields per flavor group.
Quota Controller Core `internal/scheduling/reservations/quota/controller.go`, `internal/scheduling/reservations/quota/config.go`, `internal/scheduling/reservations/quota/context.go`, `internal/scheduling/reservations/quota/metrics.go`	Implemented three-path reconciliation: (1) periodic full reconciliation recomputing total usage from all VMs and deriving PAYG usage from committed resources, (2) watch-triggered single-quota reconciliation recomputing PAYG when committed resources change, (3) incremental HV-diff reconciliation delta-updating usage for added/removed VMs; added config with 5-minute default reconcile interval and CR state filter; added context helpers for request IDs and logging; added quota metrics (usage gauges, reconcile duration, result counters).
Controller Helpers & Logic `internal/scheduling/reservations/quota/controller.go` (computation methods)	Implemented `computeTotalUsage` aggregating VMs per project/AZ/resource; `computeCRUsage` summing committed resources by state; `derivePaygUsage` subtracting CR from total with clamp; `buildFlavorToGroupMap`; `vmResourceUnits` converting VM specs to commitment units; `incrementUsage`/`decrementUsage` with decrement clamp; `updateProjectQuotaStatusWithRetry` persisting status with conflict-retry; event mapping and predicates for watch-based triggers.
Controller Tests `internal/scheduling/reservations/quota/controller_test.go`	Added 11 unit tests covering computation helpers (`computeTotalUsage`, `computeCRUsage`, `derivePaygUsage`, etc.), usage delta accumulation, reconciliation with nil state, and VM accumulation (unknown/known flavors with creation-time guards).
Integration Tests `internal/scheduling/reservations/quota/integration_test.go`	Added comprehensive scenario-driven integration test suite with 20+ test cases covering: baseline full reconciliation, periodic idempotency, incremental add/remove/migration/deletion, unknown-flavor skipping, drift correction, multi-project sequences, and PaygUsage updates on CR changes; includes test framework (scenario builders, action execution, verification helpers).
Configuration `.claude/settings.local.json`	Added Claude IDE local settings granting `Read(//root/*)` and `Bash(go doc:)` permissions.

Sequence Diagram

sequenceDiagram
    participant Client
    participant HTTPHandler as Quota API Handler
    participant KubeAPI as Kubernetes API
    participant Controller as Quota Controller
    participant VMSource as VM Source
    participant Metrics as Prometheus

    Client->>HTTPHandler: PUT /quota/{projectID}
    HTTPHandler->>HTTPHandler: Validate request (method, enable flag)
    HTTPHandler->>HTTPHandler: Decode ServiceQuotaRequest
    HTTPHandler->>KubeAPI: Get ProjectQuota CRD
    alt CRD not found
        HTTPHandler->>KubeAPI: Create ProjectQuota CRD
    else CRD exists
        HTTPHandler->>KubeAPI: Update ProjectQuota Spec
    end
    HTTPHandler->>Metrics: Record request metric (204)
    HTTPHandler-->>Client: 204 No Content

    Controller->>KubeAPI: Watch ProjectQuota changes
    Controller->>VMSource: List all VMs
    Controller->>VMSource: List committed resources
    Controller->>Controller: Compute TotalUsage per project/AZ
    Controller->>Controller: Compute PaygUsage (TotalUsage - CommittedUsage)
    Controller->>KubeAPI: Update ProjectQuota Status with retry
    Controller->>Metrics: Record usage gauges (total, payg, committed)
    Controller->>Metrics: Record reconcile duration

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

feat: PoC failover controller #572: Extends failover VM-source and Nova reader interfaces with deleted-server tracking methods (GetDeletedServerByID, DeletedVMInfo) also modified in this PR.
feat: adding operator-controlled per-resource-type config of committed resources #792: Modifies internal/scheduling/reservations/commitments/api/info.go resource topology and quota-handling logic based on handlesCommitments, which this PR also updates.
Flavor group mem/cpu ratio; Committed resource APIs only for fixed ratio #612: Modifies commitments flavor-group handling and buildServiceInfo resource registration, overlapping with topology/quota changes in this PR.

Suggested reviewers

auhlig
PhilippMatthes
juliusclausnitzer

Poem

🐰 A rabbit hops through quota lands
CRDs in place, controller stands
From VM deletion to PaygUsage born,
Reconciliation metrics dawn,
Tracking projects with grace and care!

🚥 Pre-merge checks | ✅ 2 | ❌ 3

❌ Failed checks (2 warnings, 1 inconclusive)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	No pull request description was provided by the author, making it impossible to assess whether the description relates to the changeset.	Add a detailed pull request description explaining the changes, including the new ProjectQuota CRD, quota controller, and PAYG usage calculation logic.
Docstring Coverage	⚠️ Warning	Docstring coverage is 49.02% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check	❓ Inconclusive	The title 'feat: Qutoa calc payg' contains a typo ('Qutoa' instead of 'Quota') and is vague. While it hints at quota and PAYG (pay-as-you-go) calculations, it doesn't clearly convey the substantial changes across 20+ files including new Kubernetes types, controllers, and quota management infrastructure.	Revise the title to fix the typo and provide clearer, more specific description of the main changes, such as 'feat: Implement quota controller with PAYG usage calculation' or similar.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch qutoa_calc_payg

⚔️ Resolve merge conflicts

Resolve merge conflict in branch qutoa_calc_payg

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

Generate code and open pull requests
Plan features and break down work
Investigate incidents and troubleshoot customer tickets together
Automate recurring tasks and respond to alerts with triggers
Summarize progress and report instantly

Built for teams:

Shared memory across your entire org—no repeating context
Per-thread sandboxes to safely plan and execute work
Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Review rate limit: 0/1 reviews remaining, refill in 60 minutes.}

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

umswmayj · 2026-05-04T15:49:47Z

follow up of #783

coderabbitai

Actionable comments posted: 8

🧹 Nitpick comments (3)

internal/scheduling/reservations/commitments/api/quota_monitor.go (1)
29-33: ⚡ Quick win

Pre-initialize the 503 series too.

HandleQuota has an explicit service-disabled path, so if we're pre-creating the common status-code series here, 503 should be included as well. Otherwise dashboards/alerts won't see that label until the first disabled request happens.
Suggested diff
-	for _, statusCode := range []string{"204", "400", "405", "500"} {
+	for _, statusCode := range []string{"204", "400", "405", "500", "503"} {
 		m.requestCounter.WithLabelValues(statusCode)
 		m.requestDuration.WithLabelValues(statusCode)
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/scheduling/reservations/commitments/api/quota_monitor.go` around
lines 29 - 33, Pre-create the 503 status series alongside the others so
dashboards/alerts see service-disabled metrics before the first disabled
request; update the loop that pre-initializes status codes (the array passed to
the for loop that calls m.requestCounter.WithLabelValues and
m.requestDuration.WithLabelValues) to include "503" as well—this affects the
same block used by HandleQuota's service-disabled path and ensures the "503"
label exists from startup.
internal/scheduling/reservations/failover/vm_source_test.go (1)
403-405: ⚡ Quick win

Add unit cases for the new deleted-server paths.

This mock change only restores interface compliance; IsServerActive, GetDeletedVMInfo, and CreatedAt propagation in DBVMSource still have no assertions here. A few table-driven cases for DELETED, nil, deleted-server flavor lookup, and CreatedAt would lock down the new quota path.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/scheduling/reservations/failover/vm_source_test.go` around lines 403
- 405, Add table-driven unit tests exercising the new deleted-server code paths:
update tests around mockNovaReader.GetDeletedServerByID and DBVMSource to
include cases for (1) a DELETED deleted-server result, (2) GetDeletedServerByID
returning nil, (3) deleted-server that requires flavor lookup (ensure flavor
ID/name resolution happens), and (4) propagation of CreatedAt from the
DeletedServer into the DB VM. For each case assert DBVMSource behavior and
side-effects by checking IsServerActive and GetDeletedVMInfo are invoked as
expected and that the resulting VM record's CreatedAt (and quota decision path)
matches the deleted-server input.
internal/scheduling/reservations/commitments/api/quota_test.go (1)
352-354: ⚡ Quick win

Use testlib.Ptr instead of a local pointer helper.

boolPtr() duplicates the shared pointer helper pattern and drifts from the repo's test convention.

As per coding guidelines, **/*_test.go: Use testlib.Ptr for test cases that require pointer values.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/scheduling/reservations/commitments/api/quota_test.go` around lines
352 - 354, The helper function boolPtr(b bool) is duplicating a shared pointer
helper; remove boolPtr and replace its usages with the shared test helper
testlib.Ptr(b) (update imports to include testlib if missing) so tests follow
the repo convention; search for the function name boolPtr and replace calls like
boolPtr(true) / boolPtr(false) with testlib.Ptr(true) / testlib.Ptr(false) and
delete the boolPtr function definition.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.claude/settings.local.json:
- Around line 1-8: The permissions block in settings.local.json is too
broad—replace the Read(//root/**) rule with least-privilege paths that the PR
actually needs (e.g., Read(//root/path/to/needed/dir) or explicit filenames) and
remove any wildcard that grants repository‑wide secret access; ensure the
"permissions" object only contains allowed actions for specific resources, and
make sure settings.local.json is listed in .gitignore or otherwise prevented
from being committed/used in CI/shared environments so the local-only settings
aren’t relied upon in shared workflows.

In `@internal/scheduling/reservations/commitments/api/quota.go`:
- Around line 72-84: The code currently allows creating/updating a ProjectQuota
with an empty DomainID when req.ProjectMetadata is absent; change the logic in
the block around req.ProjectMetadata.Unpack() so that if Unpack() returns false
you immediately reject the request with api.quotaError(...,
http.StatusBadRequest, "missing required domainID in project metadata",
startTime), and also validate that after unpack the derived domainID (variable
domainID) is non-empty—if it's empty return a 400 similarly; keep the existing
UUID consistency check and use the same api.quotaError pattern for these new
validation failures.
- Around line 111-160: The current Get→Create/Update flow for ProjectQuota
(crdName via projectQuotaCRDName) races with concurrent reconciles and can fail
with update conflicts; modify the write path to be conflict-safe by wrapping the
update logic in a retry-on-conflict loop (e.g., retry.RetryOnConflict) or by
using controllerutil.CreateOrUpdate on v1alpha1.ProjectQuota to compute and
apply spec changes atomically; ensure you preserve setting Spec.Quota,
Spec.ProjectName, Spec.DomainID and Spec.DomainName and log/create errors as
before but retry on conflicts instead of returning a 409/500 immediately.

In `@internal/scheduling/reservations/commitments/config.go`:
- Around line 61-64: The bool field EnableQuotaAPI in the Config struct
currently defaults to true only in DefaultConfig() but remains false when config
is unmarshaled into a zero-value Config and then passed through ApplyDefaults();
to fix this either (A) change the field to a pointer (*bool EnableQuotaAPI) and
update ApplyDefaults() to set it to pointer(true) when nil, or (B) ensure
decoding/UNMARSHAL logic constructs config by starting from DefaultConfig() and
then overlays file/flags onto it before calling ApplyDefaults(); update
ApplyDefaults(), DefaultConfig(), and any decoding flow that constructs Config
(referencing EnableQuotaAPI, ApplyDefaults, DefaultConfig) so the effective
default is unambiguous and tests reflect the chosen approach.

In `@internal/scheduling/reservations/failover/vm_source.go`:
- Around line 29-31: The VM CreatedAt field is not propagated when VMs are
rebuilt via buildVMsFromHypervisors called by ListVMsOnHypervisors(..., true),
causing callers to see an empty timestamp; update buildVMsFromHypervisors to
copy pgVM.CreatedAt into the new VM struct (and any other rebuild paths that
construct VM instances from pgVM) so CreatedAt flows through the
hypervisor-trust path; locate uses of pgVM and the VM constructor/assignment in
buildVMsFromHypervisors and add copying of the CreatedAt field (and mirror the
same change in the other rebuild block referenced around the same area).

In `@internal/scheduling/reservations/quota/controller.go`:
- Around line 396-427: The code incorrectly uses
ProjectQuota.Status.LastReconcileAt as the "last full reconcile" watermark;
introduce and use a dedicated field (e.g.
ProjectQuota.Status.LastFullReconcileAt) that is only updated by the full
periodic reconcile, update isVMNewSinceLastReconcile to compare VM creation time
against LastFullReconcileAt instead of LastReconcileAt, and change the PAYG-only
path and applyDeltaAndUpdateStatus() so they no longer overwrite
LastFullReconcileAt (they may continue to update LastReconcileAt for any status
change). Ensure the new field is set only in the full reconcile code path and
that isVMNewSinceLastReconcile references that new field when deciding whether a
VM is new.
- Around line 554-568: The ProjectQuota watch in SetupWithManager currently has
no predicate and thus enqueues on status-only updates; add a predicate (e.g.
projectQuotaSpecChangePredicate) to the For(&v1alpha1.ProjectQuota{}) watch that
filters out status-only changes by returning true only when Generation or Spec
changed (similar logic to crUsedAmountChangePredicate()). Implement
projectQuotaSpecChangePredicate() to compare old and new objects and only allow
events where old.GetGeneration() != new.GetGeneration() or a deep-equal check of
Spec differs, then pass it into builder.WithPredicates(...) for the ProjectQuota
watch in SetupWithManager.

In `@internal/scheduling/reservations/quota/metrics.go`:
- Around line 71-77: In QuotaMetrics.RecordUsage, avoid leaking metric label
series by deleting label values when a usage is removed: if totalUsage,
paygUsage and crUsage are all zero (or otherwise indicate the
resource/project/AZ was removed), call
m.totalUsageGauge.DeleteLabelValues(projectID, az, resource),
m.paygUsageGauge.DeleteLabelValues(...), and
m.crUsageGauge.DeleteLabelValues(...) instead of Set; otherwise continue to Set
the gauges as currently implemented. Ensure you reference the existing symbols
RecordUsage, totalUsageGauge, paygUsageGauge, crUsageGauge and use
DeleteLabelValues to prevent unbounded cardinality growth.

---

Nitpick comments:
In `@internal/scheduling/reservations/commitments/api/quota_monitor.go`:
- Around line 29-33: Pre-create the 503 status series alongside the others so
dashboards/alerts see service-disabled metrics before the first disabled
request; update the loop that pre-initializes status codes (the array passed to
the for loop that calls m.requestCounter.WithLabelValues and
m.requestDuration.WithLabelValues) to include "503" as well—this affects the
same block used by HandleQuota's service-disabled path and ensures the "503"
label exists from startup.

In `@internal/scheduling/reservations/commitments/api/quota_test.go`:
- Around line 352-354: The helper function boolPtr(b bool) is duplicating a
shared pointer helper; remove boolPtr and replace its usages with the shared
test helper testlib.Ptr(b) (update imports to include testlib if missing) so
tests follow the repo convention; search for the function name boolPtr and
replace calls like boolPtr(true) / boolPtr(false) with testlib.Ptr(true) /
testlib.Ptr(false) and delete the boolPtr function definition.

In `@internal/scheduling/reservations/failover/vm_source_test.go`:
- Around line 403-405: Add table-driven unit tests exercising the new
deleted-server code paths: update tests around
mockNovaReader.GetDeletedServerByID and DBVMSource to include cases for (1) a
DELETED deleted-server result, (2) GetDeletedServerByID returning nil, (3)
deleted-server that requires flavor lookup (ensure flavor ID/name resolution
happens), and (4) propagation of CreatedAt from the DeletedServer into the DB
VM. For each case assert DBVMSource behavior and side-effects by checking
IsServerActive and GetDeletedVMInfo are invoked as expected and that the
resulting VM record's CreatedAt (and quota decision path) matches the
deleted-server input.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: f410d3c9-ddee-4e79-9ca4-43f2b4447fd6

📥 Commits

Reviewing files that changed from the base of the PR and between a29bbcd and 6619176.

📒 Files selected for processing (22)

.claude/settings.local.json
api/v1alpha1/project_quota_types.go
api/v1alpha1/zz_generated.deepcopy.go
helm/library/cortex/files/crds/cortex.cloud_projectquotas.yaml
internal/scheduling/external/nova.go
internal/scheduling/reservations/commitments/api/handler.go
internal/scheduling/reservations/commitments/api/info.go
internal/scheduling/reservations/commitments/api/info_test.go
internal/scheduling/reservations/commitments/api/quota.go
internal/scheduling/reservations/commitments/api/quota_monitor.go
internal/scheduling/reservations/commitments/api/quota_test.go
internal/scheduling/reservations/commitments/config.go
internal/scheduling/reservations/commitments/usage.go
internal/scheduling/reservations/failover/integration_test.go
internal/scheduling/reservations/failover/vm_source.go
internal/scheduling/reservations/failover/vm_source_test.go
internal/scheduling/reservations/quota/config.go
internal/scheduling/reservations/quota/context.go
internal/scheduling/reservations/quota/controller.go
internal/scheduling/reservations/quota/controller_test.go
internal/scheduling/reservations/quota/integration_test.go
internal/scheduling/reservations/quota/metrics.go

coderabbitai · 2026-05-04T15:58:00Z

+{
+  "permissions": {
+    "allow": [
+      "Read(//root/**)",
+      "Bash(go doc:*)"
+    ]
+  }
+}


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Tighten Claude permissions to avoid overbroad secret access.

Read(//root/**) grants read access to everything under //root/**, which is potentially too permissive and could expose credentials/secrets to the model/tooling. Please apply least-privilege by scoping the read allow rule to only the specific directories/files needed for this PR, and/or ensure this settings.local.json truly remains local-only (e.g., gitignored) and isn’t relied on in CI/shared environments.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In @.claude/settings.local.json around lines 1 - 8, The permissions block in settings.local.json is too broad—replace the Read(//root/**) rule with least-privilege paths that the PR actually needs (e.g., Read(//root/path/to/needed/dir) or explicit filenames) and remove any wildcard that grants repository‑wide secret access; ensure the "permissions" object only contains allowed actions for specific resources, and make sure settings.local.json is listed in .gitignore or otherwise prevented from being committed/used in CI/shared environments so the local-only settings aren’t relied upon in shared workflows.

coderabbitai · 2026-05-04T15:58:01Z

+	// Extract project/domain metadata if available
+	var projectName, domainID, domainName string
+	if meta, ok := req.ProjectMetadata.Unpack(); ok {
+		// Consistency check: metadata UUID must match URL path UUID
+		if meta.UUID != "" && meta.UUID != projectID {
+			log.Info("project UUID mismatch", "urlProjectID", projectID, "metadataUUID", meta.UUID)
+			api.quotaError(w, http.StatusBadRequest, fmt.Sprintf("Project UUID mismatch: URL has %q but metadata has %q", projectID, meta.UUID), startTime)
+			return
+		}
+		projectName = meta.Name
+		domainID = meta.Domain.UUID
+		domainName = meta.Domain.Name
+	}


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Reject requests that cannot satisfy the CRD's required domainID.

ProjectMetadata is optional here, but ProjectQuotaSpec.DomainID is required by the new type/CRD. When metadata is absent you fall through to create/update with an empty DomainID, which turns a client error into an API-server validation 500.

Suggested fix

if meta, ok := req.ProjectMetadata.Unpack(); ok { // Consistency check: metadata UUID must match URL path UUID if meta.UUID != "" && meta.UUID != projectID { log.Info("project UUID mismatch", "urlProjectID", projectID, "metadataUUID", meta.UUID) api.quotaError(w, http.StatusBadRequest, fmt.Sprintf("Project UUID mismatch: URL has %q but metadata has %q", projectID, meta.UUID), startTime) return } projectName = meta.Name domainID = meta.Domain.UUID domainName = meta.Domain.Name } + if domainID == "" { + api.quotaError(w, http.StatusBadRequest, "missing domain UUID in project metadata", startTime) + return + }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@internal/scheduling/reservations/commitments/api/quota.go` around lines 72 - 84, The code currently allows creating/updating a ProjectQuota with an empty DomainID when req.ProjectMetadata is absent; change the logic in the block around req.ProjectMetadata.Unpack() so that if Unpack() returns false you immediately reject the request with api.quotaError(..., http.StatusBadRequest, "missing required domainID in project metadata", startTime), and also validate that after unpack the derived domainID (variable domainID) is non-empty—if it's empty return a 400 similarly; keep the existing UUID consistency check and use the same api.quotaError pattern for these new validation failures.

coderabbitai · 2026-05-04T15:58:01Z

+	// Create or update ProjectQuota CRD
+	crdName := projectQuotaCRDName(projectID)
+	ctx := r.Context()
+
+	var existing v1alpha1.ProjectQuota
+	err = api.client.Get(ctx, client.ObjectKey{Name: crdName}, &existing)
+	if err != nil {
+		if !apierrors.IsNotFound(err) {
+			// Real error
+			log.Error(err, "failed to get existing ProjectQuota", "name", crdName)
+			api.quotaError(w, http.StatusInternalServerError, fmt.Sprintf("Failed to check existing quota: %v", err), startTime)
+			return
+		}
+		// Not found — create new
+		pq := &v1alpha1.ProjectQuota{
+			ObjectMeta: metav1.ObjectMeta{
+				Name: crdName,
+			},
+			Spec: v1alpha1.ProjectQuotaSpec{
+				ProjectID:   projectID,
+				ProjectName: projectName,
+				DomainID:    domainID,
+				DomainName:  domainName,
+				Quota:       specQuota,
+			},
+		}
+		if err := api.client.Create(ctx, pq); err != nil {
+			log.Error(err, "failed to create ProjectQuota", "name", crdName)
+			api.quotaError(w, http.StatusInternalServerError, fmt.Sprintf("Failed to create quota: %v", err), startTime)
+			return
+		}
+		log.V(1).Info("created ProjectQuota", "name", crdName, "projectID", projectID, "resources", len(specQuota))
+	} else {
+		// Update existing
+		existing.Spec.Quota = specQuota
+		if projectName != "" {
+			existing.Spec.ProjectName = projectName
+		}
+		if domainID != "" {
+			existing.Spec.DomainID = domainID
+		}
+		if domainName != "" {
+			existing.Spec.DomainName = domainName
+		}
+		if err := api.client.Update(ctx, &existing); err != nil {
+			log.Error(err, "failed to update ProjectQuota", "name", crdName)
+			api.quotaError(w, http.StatusInternalServerError, fmt.Sprintf("Failed to update quota: %v", err), startTime)
+			return
+		}
+		log.V(1).Info("updated ProjectQuota", "name", crdName, "projectID", projectID, "resources", len(specQuota))


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail printf '\n== ProjectQuota references ==\n' rg -n -C2 'ProjectQuota' --type go api internal printf '\n== Status writers ==\n' rg -n -C2 'Status\.(Update|Patch)\(' --type go api internal printf '\n== Quota handler write path ==\n' rg -n -C3 'HandleQuota|\.Create\(|\.Update\(|\.Patch\(' --type go internal/scheduling/reservations/commitments/api

Repository: cobaltcore-dev/cortex

Length of output: 50377

🏁 Script executed:

#!/bin/bash set -euo pipefail printf '\n=== Quota Handler Code (lines 111-162) ===\n' sed -n '111,162p' internal/scheduling/reservations/commitments/api/quota.go printf '\n=== Controller Status Update Pattern ===\n' sed -n '530,545p' internal/scheduling/reservations/quota/controller.go sed -n '780,800p' internal/scheduling/reservations/quota/controller.go printf '\n=== Checking for retry.RetryOnConflict in handler ===\n' rg -n 'RetryOnConflict' internal/scheduling/reservations/commitments/api/

Repository: cobaltcore-dev/cortex

Length of output: 2932

Make the ProjectQuota write path conflict-safe.

This raw Get → Create/Update sequence races with the quota controller updating ProjectQuota status on the same object. When the controller reconciles concurrently, the handler's stale resourceVersion will fail the Update with a conflict error, turning valid PUT requests into spurious 409/500 responses.

Wrap the spec write in a retry-on-conflict loop, or use controllerutil.CreateOrUpdate() to handle version conflicts automatically.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@internal/scheduling/reservations/commitments/api/quota.go` around lines 111 - 160, The current Get→Create/Update flow for ProjectQuota (crdName via projectQuotaCRDName) races with concurrent reconciles and can fail with update conflicts; modify the write path to be conflict-safe by wrapping the update logic in a retry-on-conflict loop (e.g., retry.RetryOnConflict) or by using controllerutil.CreateOrUpdate on v1alpha1.ProjectQuota to compute and apply spec changes atomically; ensure you preserve setting Spec.Quota, Spec.ProjectName, Spec.DomainID and Spec.DomainName and log/create errors as before but retry on conflicts instead of returning a 409/500 immediately.

coderabbitai · 2026-05-04T15:58:01Z

+	// EnableQuotaAPI controls whether the quota API endpoint is active.
+	// When false, the endpoint will return HTTP 503 Service Unavailable.
+	// This can be used as an emergency switch if quota persistence is causing issues.
+	EnableQuotaAPI bool `json:"committedResourceEnableQuotaAPI"`


⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

EnableQuotaAPI does not actually default to true in the ApplyDefaults() path.

Adding the field here with DefaultConfig().EnableQuotaAPI = true is not enough on its own: an omitted bool still stays false when config is unmarshaled into a zero-value Config and then passed through ApplyDefaults(). That means the new quota API can come up disabled even when operators never set this flag.

Please make the defaulting strategy unambiguous for bools here as well, e.g. by using *bool for tri-state config or by ensuring config decoding starts from DefaultConfig() before overrides are applied.

Also applies to: 111-111

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@internal/scheduling/reservations/commitments/config.go` around lines 61 - 64, The bool field EnableQuotaAPI in the Config struct currently defaults to true only in DefaultConfig() but remains false when config is unmarshaled into a zero-value Config and then passed through ApplyDefaults(); to fix this either (A) change the field to a pointer (*bool EnableQuotaAPI) and update ApplyDefaults() to set it to pointer(true) when nil, or (B) ensure decoding/UNMARSHAL logic constructs config by starting from DefaultConfig() and then overlays file/flags onto it before calling ApplyDefaults(); update ApplyDefaults(), DefaultConfig(), and any decoding flow that constructs Config (referencing EnableQuotaAPI, ApplyDefaults, DefaultConfig) so the effective default is unambiguous and tests reflect the chosen approach.

coderabbitai · 2026-05-04T15:58:01Z

+	// CreatedAt is the ISO 8601 timestamp when the VM was created in Nova.
+	// Used by the quota controller to distinguish new VMs from migrations.
+	CreatedAt string


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Propagate CreatedAt through the hypervisor-trust path.

CreatedAt is now part of the VM contract for quota reconciliation, but ListVMsOnHypervisors(..., true) rebuilds VMs in buildVMsFromHypervisors() without copying it from pgVM. Callers on that path will always see an empty timestamp and can misclassify migrations as new VMs.

Suggested fix

vm := VM{ UUID: inst.ID, FlavorName: pgVM.FlavorName, ProjectID: pgVM.ProjectID, CurrentHypervisor: hv.Name, // Use hypervisor CRD location, not postgres AvailabilityZone: pgVM.AvailabilityZone, + CreatedAt: pgVM.CreatedAt, Resources: pgVM.Resources, FlavorExtraSpecs: pgVM.FlavorExtraSpecs, }

Also applies to: 52-58

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@internal/scheduling/reservations/failover/vm_source.go` around lines 29 - 31, The VM CreatedAt field is not propagated when VMs are rebuilt via buildVMsFromHypervisors called by ListVMsOnHypervisors(..., true), causing callers to see an empty timestamp; update buildVMsFromHypervisors to copy pgVM.CreatedAt into the new VM struct (and any other rebuild paths that construct VM instances from pgVM) so CreatedAt flows through the hypervisor-trust path; locate uses of pgVM and the VM constructor/assignment in buildVMsFromHypervisors and add copying of the CreatedAt field (and mirror the same change in the other rebuild block referenced around the same area).

coderabbitai · 2026-05-04T15:58:01Z

+func (c *QuotaController) isVMNewSinceLastReconcile(ctx context.Context, vm *failover.VM) bool {
+	if vm.CreatedAt == "" {
+		// No creation time available -- be conservative, skip increment.
+		// The next full reconcile will pick it up.
+		return false
+	}
+
+	// Look up the ProjectQuota for this VM's project
+	crdName := "quota-" + vm.ProjectID
+	var pq v1alpha1.ProjectQuota
+	if err := c.Get(ctx, client.ObjectKey{Name: crdName}, &pq); err != nil {
+		// If we can't find the ProjectQuota, skip (full reconcile will handle it)
+		return false
+	}
+
+	if pq.Status.LastReconcileAt == nil {
+		// No full reconcile has run yet -- skip incremental updates
+		return false
+	}
+
+	// Parse the VM's creation time and compare with last reconcile
+	vmCreatedAt, err := time.Parse("2006-01-02T15:04:05Z", vm.CreatedAt)
+	if err != nil {
+		// Try alternative format with timezone offset
+		vmCreatedAt, err = time.Parse(time.RFC3339, vm.CreatedAt)
+		if err != nil {
+			// Cannot parse -- be conservative, skip
+			return false
+		}
+	}
+
+	return vmCreatedAt.After(pq.Status.LastReconcileAt.Time)


⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Keep a dedicated full-reconcile watermark for incremental add detection.

isVMNewSinceLastReconcile() treats Status.LastReconcileAt as "last full reconcile", but both the PAYG-only path and applyDeltaAndUpdateStatus() overwrite that field on every status update. After any intervening CR update or incremental delta, a VM created after the last full sweep can be misclassified as already counted and never added until the next periodic reconcile.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@internal/scheduling/reservations/quota/controller.go` around lines 396 - 427, The code incorrectly uses ProjectQuota.Status.LastReconcileAt as the "last full reconcile" watermark; introduce and use a dedicated field (e.g. ProjectQuota.Status.LastFullReconcileAt) that is only updated by the full periodic reconcile, update isVMNewSinceLastReconcile to compare VM creation time against LastFullReconcileAt instead of LastReconcileAt, and change the PAYG-only path and applyDeltaAndUpdateStatus() so they no longer overwrite LastFullReconcileAt (they may continue to update LastReconcileAt for any status change). Ensure the new field is set only in the full reconcile code path and that isVMNewSinceLastReconcile references that new field when deciding whether a VM is new.

coderabbitai · 2026-05-04T15:58:01Z

+func (c *QuotaController) SetupWithManager(mgr ctrl.Manager) error {
+	return ctrl.NewControllerManagedBy(mgr).
+		Named("quota-controller").
+		// Watch ProjectQuota for spec changes (Limes pushes quota)
+		For(&v1alpha1.ProjectQuota{}).
+		// Watch CommittedResource for status changes (UsedAmount updates)
+		Watches(
+			&v1alpha1.CommittedResource{},
+			handler.EnqueueRequestsFromMapFunc(c.mapCRToProjectQuota),
+			builder.WithPredicates(crUsedAmountChangePredicate()),
+		).
+		WithOptions(controller.Options{
+			MaxConcurrentReconciles: 1,
+		}).
+		Complete(c)


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

fd -t f "controller.go" internal/scheduling

Repository: cobaltcore-dev/cortex

Length of output: 811

🏁 Script executed:

wc -l internal/scheduling/reservations/quota/controller.go

Repository: cobaltcore-dev/cortex

Length of output: 122

🏁 Script executed:

sed -n '554,570p' internal/scheduling/reservations/quota/controller.go

Repository: cobaltcore-dev/cortex

Length of output: 627

🏁 Script executed:

sed -n '1,50p' internal/scheduling/reservations/quota/controller.go

Repository: cobaltcore-dev/cortex

Length of output: 1874

🏁 Script executed:

grep -n "func (c \*QuotaController) Reconcile" internal/scheduling/reservations/quota/controller.go

Repository: cobaltcore-dev/cortex

Length of output: 348

🏁 Script executed:

sed -n '162,250p' internal/scheduling/reservations/quota/controller.go

Repository: cobaltcore-dev/cortex

Length of output: 3263

🏁 Script executed:

sed -n '400,500p' internal/scheduling/reservations/quota/controller.go

Repository: cobaltcore-dev/cortex

Length of output: 3198

🏁 Script executed:

sed -n '500,550p' internal/scheduling/reservations/quota/controller.go

Repository: cobaltcore-dev/cortex

Length of output: 1421

🏁 Script executed:

grep -n "func (c \*QuotaController) updateProjectQuotaStatusWithRetry" internal/scheduling/reservations/quota/controller.go

Repository: cobaltcore-dev/cortex

Length of output: 130

🏁 Script executed:

sed -n '779,830p' internal/scheduling/reservations/quota/controller.go

Repository: cobaltcore-dev/cortex

Length of output: 1904

🏁 Script executed:

grep -n "predicate\|Predicate\|WithPredicates" internal/scheduling/reservations/quota/controller.go | head -20

Repository: cobaltcore-dev/cortex

Length of output: 641

🏁 Script executed:

sed -n '554,575p' internal/scheduling/reservations/quota/controller.go

Repository: cobaltcore-dev/cortex

Length of output: 876

🏁 Script executed:

sed -n '889,920p' internal/scheduling/reservations/quota/controller.go

Repository: cobaltcore-dev/cortex

Length of output: 1290

🏁 Script executed:

sed -n '162,170p' internal/scheduling/reservations/quota/controller.go

Repository: cobaltcore-dev/cortex

Length of output: 523

🏁 Script executed:

sed -n '571,595p' internal/scheduling/reservations/quota/controller.go

Repository: cobaltcore-dev/cortex

Length of output: 1019

🏁 Script executed:

rg "For\(&" internal/scheduling --type go -A 3 | head -50

Repository: cobaltcore-dev/cortex

Length of output: 424

Add a predicate to filter status-only updates from the ProjectQuota watch.

The For(&v1alpha1.ProjectQuota{}) watch has no predicate and will enqueue on status changes. Since both Reconcile() (via updateProjectQuotaStatusWithRetry) and the incremental path (via applyDeltaAndUpdateStatus) update Status.LastReconcileAt, each reconciliation immediately triggers the next one, causing unnecessary API server churn. Add a predicate that only reacts to spec/generation changes, similar to crUsedAmountChangePredicate() used for CommittedResource.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@internal/scheduling/reservations/quota/controller.go` around lines 554 - 568, The ProjectQuota watch in SetupWithManager currently has no predicate and thus enqueues on status-only updates; add a predicate (e.g. projectQuotaSpecChangePredicate) to the For(&v1alpha1.ProjectQuota{}) watch that filters out status-only changes by returning true only when Generation or Spec changed (similar logic to crUsedAmountChangePredicate()). Implement projectQuotaSpecChangePredicate() to compare old and new objects and only allow events where old.GetGeneration() != new.GetGeneration() or a deep-equal check of Spec differs, then pass it into builder.WithPredicates(...) for the ProjectQuota watch in SetupWithManager.

coderabbitai · 2026-05-04T15:58:01Z

+func (m *QuotaMetrics) RecordUsage(projectID, az, resource string, totalUsage, paygUsage, crUsage int64) {
+	if m == nil {
+		return
+	}
+	m.totalUsageGauge.WithLabelValues(projectID, az, resource).Set(float64(totalUsage))
+	m.paygUsageGauge.WithLabelValues(projectID, az, resource).Set(float64(paygUsage))
+	m.crUsageGauge.WithLabelValues(projectID, az, resource).Set(float64(crUsage))


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, locate and read the metrics.go file fd metrics.go internal/scheduling

Repository: cobaltcore-dev/cortex

Length of output: 194

🏁 Script executed:

# Get an overview of the file to understand its structure wc -l internal/scheduling/reservations/quota/metrics.go

Repository: cobaltcore-dev/cortex

Length of output: 118

🏁 Script executed:

# Read the entire file to understand the QuotaMetrics structure and any cleanup logic cat -n internal/scheduling/reservations/quota/metrics.go

Repository: cobaltcore-dev/cortex

Length of output: 3740

🏁 Script executed:

# Search for any deletion or cleanup logic in the quota module rg "DeleteLabelValues|Delete\(|Reset\(" internal/scheduling/reservations/quota/

Repository: cobaltcore-dev/cortex

Length of output: 389

🏁 Script executed:

# Search for calls to RecordUsage to understand the lifecycle and context rg "RecordUsage" internal/scheduling/reservations/quota/

Repository: cobaltcore-dev/cortex

Length of output: 478

🏁 Script executed:

# Check if there are any other metric management methods in related quota files fd "\.go$" internal/scheduling/reservations/quota/ --exec grep -l "GaugeVec\|Delete" {}

Repository: cobaltcore-dev/cortex

Length of output: 285

🏁 Script executed:

# Find the line where RecordUsage is called and read context around it rg -B 10 -A 10 "c\.Metrics\.RecordUsage" internal/scheduling/reservations/quota/controller.go

Repository: cobaltcore-dev/cortex

Length of output: 869

🏁 Script executed:

# Get more context on the overall structure of the controller wc -l internal/scheduling/reservations/quota/controller.go

Repository: cobaltcore-dev/cortex

Length of output: 122

🏁 Script executed:

# Search for any cleanup or eviction logic in the controller rg "evict|cleanup|remove.*label|forget" internal/scheduling/reservations/quota/controller.go

Repository: cobaltcore-dev/cortex

Length of output: 47

🏁 Script executed:

# Find where these usage maps are built to understand what entries they contain rg -B 5 "for resourceName, total := range totalUsage" internal/scheduling/reservations/quota/controller.go

Repository: cobaltcore-dev/cortex

Length of output: 469

🏁 Script executed:

# Look for where totalUsage, paygUsage, crUsage are populated rg "totalUsage\s*:=|totalUsage\s*=" internal/scheduling/reservations/quota/controller.go | head -20

Repository: cobaltcore-dev/cortex

Length of output: 125

🏁 Script executed:

# Check if there's any cleanup logic for deleted projects rg "reconcil|Reconcil" internal/scheduling/reservations/quota/controller.go | head -15

Repository: cobaltcore-dev/cortex

Length of output: 828

🏁 Script executed:

# Search for DeleteLabelValues patterns in other metric files to see if this is handled elsewhere rg "DeleteLabelValues" internal/scheduling/

Repository: cobaltcore-dev/cortex

Length of output: 47

🏁 Script executed:

# Check other metric files in the codebase for cardinality management patterns fd "metrics\.go$" internal/scheduling/ --exec grep -l "GaugeVec" {}

Repository: cobaltcore-dev/cortex

Length of output: 115

🏁 Script executed:

# Look at another metrics file to see if there are deletion patterns cat -n internal/scheduling/reservations/commitments/api/change_commitments_metrics.go | head -100

Repository: cobaltcore-dev/cortex

Length of output: 1688

Add cleanup for removed usage label sets.

These gauges only ever Set() values. When a project/resource/AZ disappears, the old series is never deleted, so Prometheus will keep exporting stale usage and the project_id cardinality will grow monotonically over time. Add DeleteLabelValues() calls when projects or resources are removed to prevent unbounded metric cardinality growth.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@internal/scheduling/reservations/quota/metrics.go` around lines 71 - 77, In QuotaMetrics.RecordUsage, avoid leaking metric label series by deleting label values when a usage is removed: if totalUsage, paygUsage and crUsage are all zero (or otherwise indicate the resource/project/AZ was removed), call m.totalUsageGauge.DeleteLabelValues(projectID, az, resource), m.paygUsageGauge.DeleteLabelValues(...), and m.crUsageGauge.DeleteLabelValues(...) instead of Set; otherwise continue to Set the gauges as currently implemented. Ensure you reference the existing symbols RecordUsage, totalUsageGauge, paygUsageGauge, crUsageGauge and use DeleteLabelValues to prevent unbounded cardinality growth.

umswmayj added 2 commits April 30, 2026 11:32

feat: Add quota crd and update quota api

94ec411

WIP: add quota controller

6619176

umswmayj requested review from PhilippMatthes, SoWieMarkus, auhlig, juliusclausnitzer and mblos as code owners May 4, 2026 15:49

coderabbitai Bot reviewed May 4, 2026

View reviewed changes

Conversation

umswmayj commented May 4, 2026

Uh oh!

coderabbitai Bot commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (2 warnings, 1 inconclusive)

Uh oh!

umswmayj commented May 4, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 4, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 4, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 4, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 4, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 4, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 4, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 4, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 4, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

coderabbitai Bot commented May 4, 2026 •

edited

Loading