Skip to content

feat: Qutoa calc payg#796

Open
umswmayj wants to merge 2 commits intomainfrom
qutoa_calc_payg
Open

feat: Qutoa calc payg#796
umswmayj wants to merge 2 commits intomainfrom
qutoa_calc_payg

Conversation

@umswmayj
Copy link
Copy Markdown
Collaborator

@umswmayj umswmayj commented May 4, 2026

No description provided.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 4, 2026

📝 Walkthrough

Walkthrough

Implements ProjectQuota custom resource support for Kubernetes with a complete ingestion, storage, and reconciliation pipeline. Extends VM source interfaces to track deleted servers and VM creation timestamps. Adds HTTP API handler for quota requests, Prometheus monitoring, and a controller reconciling quota CRDs against active VMs and committed resources.

Changes

ProjectQuota Feature Implementation

Layer / File(s) Summary
Data Shape & API Types
api/v1alpha1/project_quota_types.go
Defines ResourceQuota, ResourceQuotaUsage, ProjectQuotaSpec, ProjectQuotaStatus, ProjectQuota, and ProjectQuotaList types with kubebuilder markers for cluster-scoped CR, status subresource, and print columns.
Code Generation
api/v1alpha1/zz_generated.deepcopy.go
Autogenerated DeepCopy/DeepCopyInto methods for all new types, including deep-copy logic for nested maps (PerAZ, quota maps) and condition slices.
CRD Schema Definition
helm/library/cortex/files/crds/cortex.cloud_projectquotas.yaml
OpenAPI schema for ProjectQuota with spec fields (domain/project IDs, optional names, quota map per resource with per-AZ breakdown) and status fields (usage maps, reconcile timestamp, conditions).
VM Source Infrastructure
internal/scheduling/reservations/failover/vm_source.go, internal/scheduling/external/nova.go
Extended VMSource interface with IsServerActive and GetDeletedVMInfo methods; added DeletedVMInfo struct for quota decrements; added VM.CreatedAt timestamp; extended NovaReaderInterface with GetDeletedServerByID; implemented all methods on DBVMSource and NovaReader.
VM Source Tests
internal/scheduling/reservations/failover/vm_source_test.go, internal/scheduling/reservations/failover/integration_test.go
Updated mockNovaReader and MockVMSource to implement new interface methods; extended failover integration test mocks.
Quota API Handler
internal/scheduling/reservations/commitments/api/quota.go, internal/scheduling/reservations/commitments/api/quota_monitor.go
Transformed HandleQuota from no-op to full request handler: validates PUT method, checks EnableQuotaAPI flag, extracts project UUID, decodes liquid.ServiceQuotaRequest, validates metadata, converts quotas with overflow checks, creates/updates ProjectQuota CRD, records Prometheus metrics via new QuotaAPIMonitor.
API Configuration & Wiring
internal/scheduling/reservations/commitments/api/handler.go, internal/scheduling/reservations/commitments/config.go, internal/scheduling/reservations/commitments/api/info.go, internal/scheduling/reservations/commitments/api/usage.go
Added quotaMonitor field to HTTPAPI and registered with Prometheus; added EnableQuotaAPI config boolean (default true); updated RAM resource topology/quota conditionals in buildServiceInfo based on handlesCommitments flag; updated buildUsageResponse to initialize per-AZ RAM quotas for commitment-accepting groups.
API Tests
internal/scheduling/reservations/commitments/api/quota_test.go, internal/scheduling/reservations/commitments/api/info_test.go
Added comprehensive HandleQuota tests covering error cases (method validation, API disablement, request validation, UUID mismatch) and create/update flows with CRD persistence; extended info_test.go assertions for topology and quota fields per flavor group.
Quota Controller Core
internal/scheduling/reservations/quota/controller.go, internal/scheduling/reservations/quota/config.go, internal/scheduling/reservations/quota/context.go, internal/scheduling/reservations/quota/metrics.go
Implemented three-path reconciliation: (1) periodic full reconciliation recomputing total usage from all VMs and deriving PAYG usage from committed resources, (2) watch-triggered single-quota reconciliation recomputing PAYG when committed resources change, (3) incremental HV-diff reconciliation delta-updating usage for added/removed VMs; added config with 5-minute default reconcile interval and CR state filter; added context helpers for request IDs and logging; added quota metrics (usage gauges, reconcile duration, result counters).
Controller Helpers & Logic
internal/scheduling/reservations/quota/controller.go (computation methods)
Implemented computeTotalUsage aggregating VMs per project/AZ/resource; computeCRUsage summing committed resources by state; derivePaygUsage subtracting CR from total with clamp; buildFlavorToGroupMap; vmResourceUnits converting VM specs to commitment units; incrementUsage/decrementUsage with decrement clamp; updateProjectQuotaStatusWithRetry persisting status with conflict-retry; event mapping and predicates for watch-based triggers.
Controller Tests
internal/scheduling/reservations/quota/controller_test.go
Added 11 unit tests covering computation helpers (computeTotalUsage, computeCRUsage, derivePaygUsage, etc.), usage delta accumulation, reconciliation with nil state, and VM accumulation (unknown/known flavors with creation-time guards).
Integration Tests
internal/scheduling/reservations/quota/integration_test.go
Added comprehensive scenario-driven integration test suite with 20+ test cases covering: baseline full reconciliation, periodic idempotency, incremental add/remove/migration/deletion, unknown-flavor skipping, drift correction, multi-project sequences, and PaygUsage updates on CR changes; includes test framework (scenario builders, action execution, verification helpers).
Configuration
.claude/settings.local.json
Added Claude IDE local settings granting Read(//root/**) and Bash(go doc:*) permissions.

Sequence Diagram

sequenceDiagram
    participant Client
    participant HTTPHandler as Quota API Handler
    participant KubeAPI as Kubernetes API
    participant Controller as Quota Controller
    participant VMSource as VM Source
    participant Metrics as Prometheus

    Client->>HTTPHandler: PUT /quota/{projectID}
    HTTPHandler->>HTTPHandler: Validate request (method, enable flag)
    HTTPHandler->>HTTPHandler: Decode ServiceQuotaRequest
    HTTPHandler->>KubeAPI: Get ProjectQuota CRD
    alt CRD not found
        HTTPHandler->>KubeAPI: Create ProjectQuota CRD
    else CRD exists
        HTTPHandler->>KubeAPI: Update ProjectQuota Spec
    end
    HTTPHandler->>Metrics: Record request metric (204)
    HTTPHandler-->>Client: 204 No Content

    Controller->>KubeAPI: Watch ProjectQuota changes
    Controller->>VMSource: List all VMs
    Controller->>VMSource: List committed resources
    Controller->>Controller: Compute TotalUsage per project/AZ
    Controller->>Controller: Compute PaygUsage (TotalUsage - CommittedUsage)
    Controller->>KubeAPI: Update ProjectQuota Status with retry
    Controller->>Metrics: Record usage gauges (total, payg, committed)
    Controller->>Metrics: Record reconcile duration
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

Suggested reviewers

  • auhlig
  • PhilippMatthes
  • juliusclausnitzer

Poem

🐰 A rabbit hops through quota lands
CRDs in place, controller stands
From VM deletion to PaygUsage born,
Reconciliation metrics dawn,
Tracking projects with grace and care!

🚥 Pre-merge checks | ✅ 2 | ❌ 3

❌ Failed checks (2 warnings, 1 inconclusive)

Check name Status Explanation Resolution
Description check ⚠️ Warning No pull request description was provided by the author, making it impossible to assess whether the description relates to the changeset. Add a detailed pull request description explaining the changes, including the new ProjectQuota CRD, quota controller, and PAYG usage calculation logic.
Docstring Coverage ⚠️ Warning Docstring coverage is 49.02% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title 'feat: Qutoa calc payg' contains a typo ('Qutoa' instead of 'Quota') and is vague. While it hints at quota and PAYG (pay-as-you-go) calculations, it doesn't clearly convey the substantial changes across 20+ files including new Kubernetes types, controllers, and quota management infrastructure. Revise the title to fix the typo and provide clearer, more specific description of the main changes, such as 'feat: Implement quota controller with PAYG usage calculation' or similar.
✅ Passed checks (2 passed)
Check name Status Explanation
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch qutoa_calc_payg
⚔️ Resolve merge conflicts
  • Resolve merge conflict in branch qutoa_calc_payg

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
Review rate limit: 0/1 reviews remaining, refill in 60 minutes.

Comment @coderabbitai help to get the list of available commands and usage tips.

@umswmayj
Copy link
Copy Markdown
Collaborator Author

umswmayj commented May 4, 2026

follow up of #783

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

🧹 Nitpick comments (3)
internal/scheduling/reservations/commitments/api/quota_monitor.go (1)

29-33: ⚡ Quick win

Pre-initialize the 503 series too.

HandleQuota has an explicit service-disabled path, so if we're pre-creating the common status-code series here, 503 should be included as well. Otherwise dashboards/alerts won't see that label until the first disabled request happens.

Suggested diff
-	for _, statusCode := range []string{"204", "400", "405", "500"} {
+	for _, statusCode := range []string{"204", "400", "405", "500", "503"} {
 		m.requestCounter.WithLabelValues(statusCode)
 		m.requestDuration.WithLabelValues(statusCode)
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/scheduling/reservations/commitments/api/quota_monitor.go` around
lines 29 - 33, Pre-create the 503 status series alongside the others so
dashboards/alerts see service-disabled metrics before the first disabled
request; update the loop that pre-initializes status codes (the array passed to
the for loop that calls m.requestCounter.WithLabelValues and
m.requestDuration.WithLabelValues) to include "503" as well—this affects the
same block used by HandleQuota's service-disabled path and ensures the "503"
label exists from startup.
internal/scheduling/reservations/failover/vm_source_test.go (1)

403-405: ⚡ Quick win

Add unit cases for the new deleted-server paths.

This mock change only restores interface compliance; IsServerActive, GetDeletedVMInfo, and CreatedAt propagation in DBVMSource still have no assertions here. A few table-driven cases for DELETED, nil, deleted-server flavor lookup, and CreatedAt would lock down the new quota path.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/scheduling/reservations/failover/vm_source_test.go` around lines 403
- 405, Add table-driven unit tests exercising the new deleted-server code paths:
update tests around mockNovaReader.GetDeletedServerByID and DBVMSource to
include cases for (1) a DELETED deleted-server result, (2) GetDeletedServerByID
returning nil, (3) deleted-server that requires flavor lookup (ensure flavor
ID/name resolution happens), and (4) propagation of CreatedAt from the
DeletedServer into the DB VM. For each case assert DBVMSource behavior and
side-effects by checking IsServerActive and GetDeletedVMInfo are invoked as
expected and that the resulting VM record's CreatedAt (and quota decision path)
matches the deleted-server input.
internal/scheduling/reservations/commitments/api/quota_test.go (1)

352-354: ⚡ Quick win

Use testlib.Ptr instead of a local pointer helper.

boolPtr() duplicates the shared pointer helper pattern and drifts from the repo's test convention.

As per coding guidelines, **/*_test.go: Use testlib.Ptr for test cases that require pointer values.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/scheduling/reservations/commitments/api/quota_test.go` around lines
352 - 354, The helper function boolPtr(b bool) is duplicating a shared pointer
helper; remove boolPtr and replace its usages with the shared test helper
testlib.Ptr(b) (update imports to include testlib if missing) so tests follow
the repo convention; search for the function name boolPtr and replace calls like
boolPtr(true) / boolPtr(false) with testlib.Ptr(true) / testlib.Ptr(false) and
delete the boolPtr function definition.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.claude/settings.local.json:
- Around line 1-8: The permissions block in settings.local.json is too
broad—replace the Read(//root/**) rule with least-privilege paths that the PR
actually needs (e.g., Read(//root/path/to/needed/dir) or explicit filenames) and
remove any wildcard that grants repository‑wide secret access; ensure the
"permissions" object only contains allowed actions for specific resources, and
make sure settings.local.json is listed in .gitignore or otherwise prevented
from being committed/used in CI/shared environments so the local-only settings
aren’t relied upon in shared workflows.

In `@internal/scheduling/reservations/commitments/api/quota.go`:
- Around line 72-84: The code currently allows creating/updating a ProjectQuota
with an empty DomainID when req.ProjectMetadata is absent; change the logic in
the block around req.ProjectMetadata.Unpack() so that if Unpack() returns false
you immediately reject the request with api.quotaError(...,
http.StatusBadRequest, "missing required domainID in project metadata",
startTime), and also validate that after unpack the derived domainID (variable
domainID) is non-empty—if it's empty return a 400 similarly; keep the existing
UUID consistency check and use the same api.quotaError pattern for these new
validation failures.
- Around line 111-160: The current Get→Create/Update flow for ProjectQuota
(crdName via projectQuotaCRDName) races with concurrent reconciles and can fail
with update conflicts; modify the write path to be conflict-safe by wrapping the
update logic in a retry-on-conflict loop (e.g., retry.RetryOnConflict) or by
using controllerutil.CreateOrUpdate on v1alpha1.ProjectQuota to compute and
apply spec changes atomically; ensure you preserve setting Spec.Quota,
Spec.ProjectName, Spec.DomainID and Spec.DomainName and log/create errors as
before but retry on conflicts instead of returning a 409/500 immediately.

In `@internal/scheduling/reservations/commitments/config.go`:
- Around line 61-64: The bool field EnableQuotaAPI in the Config struct
currently defaults to true only in DefaultConfig() but remains false when config
is unmarshaled into a zero-value Config and then passed through ApplyDefaults();
to fix this either (A) change the field to a pointer (*bool EnableQuotaAPI) and
update ApplyDefaults() to set it to pointer(true) when nil, or (B) ensure
decoding/UNMARSHAL logic constructs config by starting from DefaultConfig() and
then overlays file/flags onto it before calling ApplyDefaults(); update
ApplyDefaults(), DefaultConfig(), and any decoding flow that constructs Config
(referencing EnableQuotaAPI, ApplyDefaults, DefaultConfig) so the effective
default is unambiguous and tests reflect the chosen approach.

In `@internal/scheduling/reservations/failover/vm_source.go`:
- Around line 29-31: The VM CreatedAt field is not propagated when VMs are
rebuilt via buildVMsFromHypervisors called by ListVMsOnHypervisors(..., true),
causing callers to see an empty timestamp; update buildVMsFromHypervisors to
copy pgVM.CreatedAt into the new VM struct (and any other rebuild paths that
construct VM instances from pgVM) so CreatedAt flows through the
hypervisor-trust path; locate uses of pgVM and the VM constructor/assignment in
buildVMsFromHypervisors and add copying of the CreatedAt field (and mirror the
same change in the other rebuild block referenced around the same area).

In `@internal/scheduling/reservations/quota/controller.go`:
- Around line 396-427: The code incorrectly uses
ProjectQuota.Status.LastReconcileAt as the "last full reconcile" watermark;
introduce and use a dedicated field (e.g.
ProjectQuota.Status.LastFullReconcileAt) that is only updated by the full
periodic reconcile, update isVMNewSinceLastReconcile to compare VM creation time
against LastFullReconcileAt instead of LastReconcileAt, and change the PAYG-only
path and applyDeltaAndUpdateStatus() so they no longer overwrite
LastFullReconcileAt (they may continue to update LastReconcileAt for any status
change). Ensure the new field is set only in the full reconcile code path and
that isVMNewSinceLastReconcile references that new field when deciding whether a
VM is new.
- Around line 554-568: The ProjectQuota watch in SetupWithManager currently has
no predicate and thus enqueues on status-only updates; add a predicate (e.g.
projectQuotaSpecChangePredicate) to the For(&v1alpha1.ProjectQuota{}) watch that
filters out status-only changes by returning true only when Generation or Spec
changed (similar logic to crUsedAmountChangePredicate()). Implement
projectQuotaSpecChangePredicate() to compare old and new objects and only allow
events where old.GetGeneration() != new.GetGeneration() or a deep-equal check of
Spec differs, then pass it into builder.WithPredicates(...) for the ProjectQuota
watch in SetupWithManager.

In `@internal/scheduling/reservations/quota/metrics.go`:
- Around line 71-77: In QuotaMetrics.RecordUsage, avoid leaking metric label
series by deleting label values when a usage is removed: if totalUsage,
paygUsage and crUsage are all zero (or otherwise indicate the
resource/project/AZ was removed), call
m.totalUsageGauge.DeleteLabelValues(projectID, az, resource),
m.paygUsageGauge.DeleteLabelValues(...), and
m.crUsageGauge.DeleteLabelValues(...) instead of Set; otherwise continue to Set
the gauges as currently implemented. Ensure you reference the existing symbols
RecordUsage, totalUsageGauge, paygUsageGauge, crUsageGauge and use
DeleteLabelValues to prevent unbounded cardinality growth.

---

Nitpick comments:
In `@internal/scheduling/reservations/commitments/api/quota_monitor.go`:
- Around line 29-33: Pre-create the 503 status series alongside the others so
dashboards/alerts see service-disabled metrics before the first disabled
request; update the loop that pre-initializes status codes (the array passed to
the for loop that calls m.requestCounter.WithLabelValues and
m.requestDuration.WithLabelValues) to include "503" as well—this affects the
same block used by HandleQuota's service-disabled path and ensures the "503"
label exists from startup.

In `@internal/scheduling/reservations/commitments/api/quota_test.go`:
- Around line 352-354: The helper function boolPtr(b bool) is duplicating a
shared pointer helper; remove boolPtr and replace its usages with the shared
test helper testlib.Ptr(b) (update imports to include testlib if missing) so
tests follow the repo convention; search for the function name boolPtr and
replace calls like boolPtr(true) / boolPtr(false) with testlib.Ptr(true) /
testlib.Ptr(false) and delete the boolPtr function definition.

In `@internal/scheduling/reservations/failover/vm_source_test.go`:
- Around line 403-405: Add table-driven unit tests exercising the new
deleted-server code paths: update tests around
mockNovaReader.GetDeletedServerByID and DBVMSource to include cases for (1) a
DELETED deleted-server result, (2) GetDeletedServerByID returning nil, (3)
deleted-server that requires flavor lookup (ensure flavor ID/name resolution
happens), and (4) propagation of CreatedAt from the DeletedServer into the DB
VM. For each case assert DBVMSource behavior and side-effects by checking
IsServerActive and GetDeletedVMInfo are invoked as expected and that the
resulting VM record's CreatedAt (and quota decision path) matches the
deleted-server input.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: f410d3c9-ddee-4e79-9ca4-43f2b4447fd6

📥 Commits

Reviewing files that changed from the base of the PR and between a29bbcd and 6619176.

📒 Files selected for processing (22)
  • .claude/settings.local.json
  • api/v1alpha1/project_quota_types.go
  • api/v1alpha1/zz_generated.deepcopy.go
  • helm/library/cortex/files/crds/cortex.cloud_projectquotas.yaml
  • internal/scheduling/external/nova.go
  • internal/scheduling/reservations/commitments/api/handler.go
  • internal/scheduling/reservations/commitments/api/info.go
  • internal/scheduling/reservations/commitments/api/info_test.go
  • internal/scheduling/reservations/commitments/api/quota.go
  • internal/scheduling/reservations/commitments/api/quota_monitor.go
  • internal/scheduling/reservations/commitments/api/quota_test.go
  • internal/scheduling/reservations/commitments/config.go
  • internal/scheduling/reservations/commitments/usage.go
  • internal/scheduling/reservations/failover/integration_test.go
  • internal/scheduling/reservations/failover/vm_source.go
  • internal/scheduling/reservations/failover/vm_source_test.go
  • internal/scheduling/reservations/quota/config.go
  • internal/scheduling/reservations/quota/context.go
  • internal/scheduling/reservations/quota/controller.go
  • internal/scheduling/reservations/quota/controller_test.go
  • internal/scheduling/reservations/quota/integration_test.go
  • internal/scheduling/reservations/quota/metrics.go

Comment on lines +1 to +8
{
"permissions": {
"allow": [
"Read(//root/**)",
"Bash(go doc:*)"
]
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Tighten Claude permissions to avoid overbroad secret access.

Read(//root/**) grants read access to everything under //root/**, which is potentially too permissive and could expose credentials/secrets to the model/tooling. Please apply least-privilege by scoping the read allow rule to only the specific directories/files needed for this PR, and/or ensure this settings.local.json truly remains local-only (e.g., gitignored) and isn’t relied on in CI/shared environments.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.claude/settings.local.json around lines 1 - 8, The permissions block in
settings.local.json is too broad—replace the Read(//root/**) rule with
least-privilege paths that the PR actually needs (e.g.,
Read(//root/path/to/needed/dir) or explicit filenames) and remove any wildcard
that grants repository‑wide secret access; ensure the "permissions" object only
contains allowed actions for specific resources, and make sure
settings.local.json is listed in .gitignore or otherwise prevented from being
committed/used in CI/shared environments so the local-only settings aren’t
relied upon in shared workflows.

Comment on lines +72 to +84
// Extract project/domain metadata if available
var projectName, domainID, domainName string
if meta, ok := req.ProjectMetadata.Unpack(); ok {
// Consistency check: metadata UUID must match URL path UUID
if meta.UUID != "" && meta.UUID != projectID {
log.Info("project UUID mismatch", "urlProjectID", projectID, "metadataUUID", meta.UUID)
api.quotaError(w, http.StatusBadRequest, fmt.Sprintf("Project UUID mismatch: URL has %q but metadata has %q", projectID, meta.UUID), startTime)
return
}
projectName = meta.Name
domainID = meta.Domain.UUID
domainName = meta.Domain.Name
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Reject requests that cannot satisfy the CRD's required domainID.

ProjectMetadata is optional here, but ProjectQuotaSpec.DomainID is required by the new type/CRD. When metadata is absent you fall through to create/update with an empty DomainID, which turns a client error into an API-server validation 500.

Suggested fix
 	if meta, ok := req.ProjectMetadata.Unpack(); ok {
 		// Consistency check: metadata UUID must match URL path UUID
 		if meta.UUID != "" && meta.UUID != projectID {
 			log.Info("project UUID mismatch", "urlProjectID", projectID, "metadataUUID", meta.UUID)
 			api.quotaError(w, http.StatusBadRequest, fmt.Sprintf("Project UUID mismatch: URL has %q but metadata has %q", projectID, meta.UUID), startTime)
 			return
 		}
 		projectName = meta.Name
 		domainID = meta.Domain.UUID
 		domainName = meta.Domain.Name
 	}
+	if domainID == "" {
+		api.quotaError(w, http.StatusBadRequest, "missing domain UUID in project metadata", startTime)
+		return
+	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/scheduling/reservations/commitments/api/quota.go` around lines 72 -
84, The code currently allows creating/updating a ProjectQuota with an empty
DomainID when req.ProjectMetadata is absent; change the logic in the block
around req.ProjectMetadata.Unpack() so that if Unpack() returns false you
immediately reject the request with api.quotaError(..., http.StatusBadRequest,
"missing required domainID in project metadata", startTime), and also validate
that after unpack the derived domainID (variable domainID) is non-empty—if it's
empty return a 400 similarly; keep the existing UUID consistency check and use
the same api.quotaError pattern for these new validation failures.

Comment on lines +111 to +160
// Create or update ProjectQuota CRD
crdName := projectQuotaCRDName(projectID)
ctx := r.Context()

var existing v1alpha1.ProjectQuota
err = api.client.Get(ctx, client.ObjectKey{Name: crdName}, &existing)
if err != nil {
if !apierrors.IsNotFound(err) {
// Real error
log.Error(err, "failed to get existing ProjectQuota", "name", crdName)
api.quotaError(w, http.StatusInternalServerError, fmt.Sprintf("Failed to check existing quota: %v", err), startTime)
return
}
// Not found — create new
pq := &v1alpha1.ProjectQuota{
ObjectMeta: metav1.ObjectMeta{
Name: crdName,
},
Spec: v1alpha1.ProjectQuotaSpec{
ProjectID: projectID,
ProjectName: projectName,
DomainID: domainID,
DomainName: domainName,
Quota: specQuota,
},
}
if err := api.client.Create(ctx, pq); err != nil {
log.Error(err, "failed to create ProjectQuota", "name", crdName)
api.quotaError(w, http.StatusInternalServerError, fmt.Sprintf("Failed to create quota: %v", err), startTime)
return
}
log.V(1).Info("created ProjectQuota", "name", crdName, "projectID", projectID, "resources", len(specQuota))
} else {
// Update existing
existing.Spec.Quota = specQuota
if projectName != "" {
existing.Spec.ProjectName = projectName
}
if domainID != "" {
existing.Spec.DomainID = domainID
}
if domainName != "" {
existing.Spec.DomainName = domainName
}
if err := api.client.Update(ctx, &existing); err != nil {
log.Error(err, "failed to update ProjectQuota", "name", crdName)
api.quotaError(w, http.StatusInternalServerError, fmt.Sprintf("Failed to update quota: %v", err), startTime)
return
}
log.V(1).Info("updated ProjectQuota", "name", crdName, "projectID", projectID, "resources", len(specQuota))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

printf '\n== ProjectQuota references ==\n'
rg -n -C2 'ProjectQuota' --type go api internal

printf '\n== Status writers ==\n'
rg -n -C2 'Status\(\)\.(Update|Patch)\(' --type go api internal

printf '\n== Quota handler write path ==\n'
rg -n -C3 'HandleQuota|\.Create\(|\.Update\(|\.Patch\(' --type go internal/scheduling/reservations/commitments/api

Repository: cobaltcore-dev/cortex

Length of output: 50377


🏁 Script executed:

#!/bin/bash
set -euo pipefail

printf '\n=== Quota Handler Code (lines 111-162) ===\n'
sed -n '111,162p' internal/scheduling/reservations/commitments/api/quota.go

printf '\n=== Controller Status Update Pattern ===\n'
sed -n '530,545p' internal/scheduling/reservations/quota/controller.go
sed -n '780,800p' internal/scheduling/reservations/quota/controller.go

printf '\n=== Checking for retry.RetryOnConflict in handler ===\n'
rg -n 'RetryOnConflict' internal/scheduling/reservations/commitments/api/

Repository: cobaltcore-dev/cortex

Length of output: 2932


Make the ProjectQuota write path conflict-safe.

This raw GetCreate/Update sequence races with the quota controller updating ProjectQuota status on the same object. When the controller reconciles concurrently, the handler's stale resourceVersion will fail the Update with a conflict error, turning valid PUT requests into spurious 409/500 responses.

Wrap the spec write in a retry-on-conflict loop, or use controllerutil.CreateOrUpdate() to handle version conflicts automatically.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/scheduling/reservations/commitments/api/quota.go` around lines 111 -
160, The current Get→Create/Update flow for ProjectQuota (crdName via
projectQuotaCRDName) races with concurrent reconciles and can fail with update
conflicts; modify the write path to be conflict-safe by wrapping the update
logic in a retry-on-conflict loop (e.g., retry.RetryOnConflict) or by using
controllerutil.CreateOrUpdate on v1alpha1.ProjectQuota to compute and apply spec
changes atomically; ensure you preserve setting Spec.Quota, Spec.ProjectName,
Spec.DomainID and Spec.DomainName and log/create errors as before but retry on
conflicts instead of returning a 409/500 immediately.

Comment on lines +61 to +64
// EnableQuotaAPI controls whether the quota API endpoint is active.
// When false, the endpoint will return HTTP 503 Service Unavailable.
// This can be used as an emergency switch if quota persistence is causing issues.
EnableQuotaAPI bool `json:"committedResourceEnableQuotaAPI"`
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

EnableQuotaAPI does not actually default to true in the ApplyDefaults() path.

Adding the field here with DefaultConfig().EnableQuotaAPI = true is not enough on its own: an omitted bool still stays false when config is unmarshaled into a zero-value Config and then passed through ApplyDefaults(). That means the new quota API can come up disabled even when operators never set this flag.

Please make the defaulting strategy unambiguous for bools here as well, e.g. by using *bool for tri-state config or by ensuring config decoding starts from DefaultConfig() before overrides are applied.

Also applies to: 111-111

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/scheduling/reservations/commitments/config.go` around lines 61 - 64,
The bool field EnableQuotaAPI in the Config struct currently defaults to true
only in DefaultConfig() but remains false when config is unmarshaled into a
zero-value Config and then passed through ApplyDefaults(); to fix this either
(A) change the field to a pointer (*bool EnableQuotaAPI) and update
ApplyDefaults() to set it to pointer(true) when nil, or (B) ensure
decoding/UNMARSHAL logic constructs config by starting from DefaultConfig() and
then overlays file/flags onto it before calling ApplyDefaults(); update
ApplyDefaults(), DefaultConfig(), and any decoding flow that constructs Config
(referencing EnableQuotaAPI, ApplyDefaults, DefaultConfig) so the effective
default is unambiguous and tests reflect the chosen approach.

Comment on lines +29 to +31
// CreatedAt is the ISO 8601 timestamp when the VM was created in Nova.
// Used by the quota controller to distinguish new VMs from migrations.
CreatedAt string
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Propagate CreatedAt through the hypervisor-trust path.

CreatedAt is now part of the VM contract for quota reconciliation, but ListVMsOnHypervisors(..., true) rebuilds VMs in buildVMsFromHypervisors() without copying it from pgVM. Callers on that path will always see an empty timestamp and can misclassify migrations as new VMs.

Suggested fix
 			vm := VM{
 				UUID:              inst.ID,
 				FlavorName:        pgVM.FlavorName,
 				ProjectID:         pgVM.ProjectID,
 				CurrentHypervisor: hv.Name, // Use hypervisor CRD location, not postgres
 				AvailabilityZone:  pgVM.AvailabilityZone,
+				CreatedAt:         pgVM.CreatedAt,
 				Resources:         pgVM.Resources,
 				FlavorExtraSpecs:  pgVM.FlavorExtraSpecs,
 			}

Also applies to: 52-58

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/scheduling/reservations/failover/vm_source.go` around lines 29 - 31,
The VM CreatedAt field is not propagated when VMs are rebuilt via
buildVMsFromHypervisors called by ListVMsOnHypervisors(..., true), causing
callers to see an empty timestamp; update buildVMsFromHypervisors to copy
pgVM.CreatedAt into the new VM struct (and any other rebuild paths that
construct VM instances from pgVM) so CreatedAt flows through the
hypervisor-trust path; locate uses of pgVM and the VM constructor/assignment in
buildVMsFromHypervisors and add copying of the CreatedAt field (and mirror the
same change in the other rebuild block referenced around the same area).

Comment on lines +396 to +427
func (c *QuotaController) isVMNewSinceLastReconcile(ctx context.Context, vm *failover.VM) bool {
if vm.CreatedAt == "" {
// No creation time available -- be conservative, skip increment.
// The next full reconcile will pick it up.
return false
}

// Look up the ProjectQuota for this VM's project
crdName := "quota-" + vm.ProjectID
var pq v1alpha1.ProjectQuota
if err := c.Get(ctx, client.ObjectKey{Name: crdName}, &pq); err != nil {
// If we can't find the ProjectQuota, skip (full reconcile will handle it)
return false
}

if pq.Status.LastReconcileAt == nil {
// No full reconcile has run yet -- skip incremental updates
return false
}

// Parse the VM's creation time and compare with last reconcile
vmCreatedAt, err := time.Parse("2006-01-02T15:04:05Z", vm.CreatedAt)
if err != nil {
// Try alternative format with timezone offset
vmCreatedAt, err = time.Parse(time.RFC3339, vm.CreatedAt)
if err != nil {
// Cannot parse -- be conservative, skip
return false
}
}

return vmCreatedAt.After(pq.Status.LastReconcileAt.Time)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Keep a dedicated full-reconcile watermark for incremental add detection.

isVMNewSinceLastReconcile() treats Status.LastReconcileAt as "last full reconcile", but both the PAYG-only path and applyDeltaAndUpdateStatus() overwrite that field on every status update. After any intervening CR update or incremental delta, a VM created after the last full sweep can be misclassified as already counted and never added until the next periodic reconcile.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/scheduling/reservations/quota/controller.go` around lines 396 - 427,
The code incorrectly uses ProjectQuota.Status.LastReconcileAt as the "last full
reconcile" watermark; introduce and use a dedicated field (e.g.
ProjectQuota.Status.LastFullReconcileAt) that is only updated by the full
periodic reconcile, update isVMNewSinceLastReconcile to compare VM creation time
against LastFullReconcileAt instead of LastReconcileAt, and change the PAYG-only
path and applyDeltaAndUpdateStatus() so they no longer overwrite
LastFullReconcileAt (they may continue to update LastReconcileAt for any status
change). Ensure the new field is set only in the full reconcile code path and
that isVMNewSinceLastReconcile references that new field when deciding whether a
VM is new.

Comment on lines +554 to +568
func (c *QuotaController) SetupWithManager(mgr ctrl.Manager) error {
return ctrl.NewControllerManagedBy(mgr).
Named("quota-controller").
// Watch ProjectQuota for spec changes (Limes pushes quota)
For(&v1alpha1.ProjectQuota{}).
// Watch CommittedResource for status changes (UsedAmount updates)
Watches(
&v1alpha1.CommittedResource{},
handler.EnqueueRequestsFromMapFunc(c.mapCRToProjectQuota),
builder.WithPredicates(crUsedAmountChangePredicate()),
).
WithOptions(controller.Options{
MaxConcurrentReconciles: 1,
}).
Complete(c)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

fd -t f "controller.go" internal/scheduling

Repository: cobaltcore-dev/cortex

Length of output: 811


🏁 Script executed:

wc -l internal/scheduling/reservations/quota/controller.go

Repository: cobaltcore-dev/cortex

Length of output: 122


🏁 Script executed:

sed -n '554,570p' internal/scheduling/reservations/quota/controller.go

Repository: cobaltcore-dev/cortex

Length of output: 627


🏁 Script executed:

sed -n '1,50p' internal/scheduling/reservations/quota/controller.go

Repository: cobaltcore-dev/cortex

Length of output: 1874


🏁 Script executed:

grep -n "func (c \*QuotaController) Reconcile" internal/scheduling/reservations/quota/controller.go

Repository: cobaltcore-dev/cortex

Length of output: 348


🏁 Script executed:

sed -n '162,250p' internal/scheduling/reservations/quota/controller.go

Repository: cobaltcore-dev/cortex

Length of output: 3263


🏁 Script executed:

sed -n '400,500p' internal/scheduling/reservations/quota/controller.go

Repository: cobaltcore-dev/cortex

Length of output: 3198


🏁 Script executed:

sed -n '500,550p' internal/scheduling/reservations/quota/controller.go

Repository: cobaltcore-dev/cortex

Length of output: 1421


🏁 Script executed:

grep -n "func (c \*QuotaController) updateProjectQuotaStatusWithRetry" internal/scheduling/reservations/quota/controller.go

Repository: cobaltcore-dev/cortex

Length of output: 130


🏁 Script executed:

sed -n '779,830p' internal/scheduling/reservations/quota/controller.go

Repository: cobaltcore-dev/cortex

Length of output: 1904


🏁 Script executed:

grep -n "predicate\|Predicate\|WithPredicates" internal/scheduling/reservations/quota/controller.go | head -20

Repository: cobaltcore-dev/cortex

Length of output: 641


🏁 Script executed:

sed -n '554,575p' internal/scheduling/reservations/quota/controller.go

Repository: cobaltcore-dev/cortex

Length of output: 876


🏁 Script executed:

sed -n '889,920p' internal/scheduling/reservations/quota/controller.go

Repository: cobaltcore-dev/cortex

Length of output: 1290


🏁 Script executed:

sed -n '162,170p' internal/scheduling/reservations/quota/controller.go

Repository: cobaltcore-dev/cortex

Length of output: 523


🏁 Script executed:

sed -n '571,595p' internal/scheduling/reservations/quota/controller.go

Repository: cobaltcore-dev/cortex

Length of output: 1019


🏁 Script executed:

rg "For\(&" internal/scheduling --type go -A 3 | head -50

Repository: cobaltcore-dev/cortex

Length of output: 424


Add a predicate to filter status-only updates from the ProjectQuota watch.

The For(&v1alpha1.ProjectQuota{}) watch has no predicate and will enqueue on status changes. Since both Reconcile() (via updateProjectQuotaStatusWithRetry) and the incremental path (via applyDeltaAndUpdateStatus) update Status.LastReconcileAt, each reconciliation immediately triggers the next one, causing unnecessary API server churn. Add a predicate that only reacts to spec/generation changes, similar to crUsedAmountChangePredicate() used for CommittedResource.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/scheduling/reservations/quota/controller.go` around lines 554 - 568,
The ProjectQuota watch in SetupWithManager currently has no predicate and thus
enqueues on status-only updates; add a predicate (e.g.
projectQuotaSpecChangePredicate) to the For(&v1alpha1.ProjectQuota{}) watch that
filters out status-only changes by returning true only when Generation or Spec
changed (similar logic to crUsedAmountChangePredicate()). Implement
projectQuotaSpecChangePredicate() to compare old and new objects and only allow
events where old.GetGeneration() != new.GetGeneration() or a deep-equal check of
Spec differs, then pass it into builder.WithPredicates(...) for the ProjectQuota
watch in SetupWithManager.

Comment on lines +71 to +77
func (m *QuotaMetrics) RecordUsage(projectID, az, resource string, totalUsage, paygUsage, crUsage int64) {
if m == nil {
return
}
m.totalUsageGauge.WithLabelValues(projectID, az, resource).Set(float64(totalUsage))
m.paygUsageGauge.WithLabelValues(projectID, az, resource).Set(float64(paygUsage))
m.crUsageGauge.WithLabelValues(projectID, az, resource).Set(float64(crUsage))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, locate and read the metrics.go file
fd metrics.go internal/scheduling

Repository: cobaltcore-dev/cortex

Length of output: 194


🏁 Script executed:

# Get an overview of the file to understand its structure
wc -l internal/scheduling/reservations/quota/metrics.go

Repository: cobaltcore-dev/cortex

Length of output: 118


🏁 Script executed:

# Read the entire file to understand the QuotaMetrics structure and any cleanup logic
cat -n internal/scheduling/reservations/quota/metrics.go

Repository: cobaltcore-dev/cortex

Length of output: 3740


🏁 Script executed:

# Search for any deletion or cleanup logic in the quota module
rg "DeleteLabelValues|Delete\(|Reset\(" internal/scheduling/reservations/quota/

Repository: cobaltcore-dev/cortex

Length of output: 389


🏁 Script executed:

# Search for calls to RecordUsage to understand the lifecycle and context
rg "RecordUsage" internal/scheduling/reservations/quota/

Repository: cobaltcore-dev/cortex

Length of output: 478


🏁 Script executed:

# Check if there are any other metric management methods in related quota files
fd "\.go$" internal/scheduling/reservations/quota/ --exec grep -l "GaugeVec\|Delete" {}

Repository: cobaltcore-dev/cortex

Length of output: 285


🏁 Script executed:

# Find the line where RecordUsage is called and read context around it
rg -B 10 -A 10 "c\.Metrics\.RecordUsage" internal/scheduling/reservations/quota/controller.go

Repository: cobaltcore-dev/cortex

Length of output: 869


🏁 Script executed:

# Get more context on the overall structure of the controller
wc -l internal/scheduling/reservations/quota/controller.go

Repository: cobaltcore-dev/cortex

Length of output: 122


🏁 Script executed:

# Search for any cleanup or eviction logic in the controller
rg "evict|cleanup|remove.*label|forget" internal/scheduling/reservations/quota/controller.go

Repository: cobaltcore-dev/cortex

Length of output: 47


🏁 Script executed:

# Find where these usage maps are built to understand what entries they contain
rg -B 5 "for resourceName, total := range totalUsage" internal/scheduling/reservations/quota/controller.go

Repository: cobaltcore-dev/cortex

Length of output: 469


🏁 Script executed:

# Look for where totalUsage, paygUsage, crUsage are populated
rg "totalUsage\s*:=|totalUsage\s*=" internal/scheduling/reservations/quota/controller.go | head -20

Repository: cobaltcore-dev/cortex

Length of output: 125


🏁 Script executed:

# Check if there's any cleanup logic for deleted projects
rg "reconcil|Reconcil" internal/scheduling/reservations/quota/controller.go | head -15

Repository: cobaltcore-dev/cortex

Length of output: 828


🏁 Script executed:

# Search for DeleteLabelValues patterns in other metric files to see if this is handled elsewhere
rg "DeleteLabelValues" internal/scheduling/

Repository: cobaltcore-dev/cortex

Length of output: 47


🏁 Script executed:

# Check other metric files in the codebase for cardinality management patterns
fd "metrics\.go$" internal/scheduling/ --exec grep -l "GaugeVec" {}

Repository: cobaltcore-dev/cortex

Length of output: 115


🏁 Script executed:

# Look at another metrics file to see if there are deletion patterns
cat -n internal/scheduling/reservations/commitments/api/change_commitments_metrics.go | head -100

Repository: cobaltcore-dev/cortex

Length of output: 1688


Add cleanup for removed usage label sets.

These gauges only ever Set() values. When a project/resource/AZ disappears, the old series is never deleted, so Prometheus will keep exporting stale usage and the project_id cardinality will grow monotonically over time. Add DeleteLabelValues() calls when projects or resources are removed to prevent unbounded metric cardinality growth.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/scheduling/reservations/quota/metrics.go` around lines 71 - 77, In
QuotaMetrics.RecordUsage, avoid leaking metric label series by deleting label
values when a usage is removed: if totalUsage, paygUsage and crUsage are all
zero (or otherwise indicate the resource/project/AZ was removed), call
m.totalUsageGauge.DeleteLabelValues(projectID, az, resource),
m.paygUsageGauge.DeleteLabelValues(...), and
m.crUsageGauge.DeleteLabelValues(...) instead of Set; otherwise continue to Set
the gauges as currently implemented. Ensure you reference the existing symbols
RecordUsage, totalUsageGauge, paygUsageGauge, crUsageGauge and use
DeleteLabelValues to prevent unbounded cardinality growth.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant