Add FlavorGroupCapacity CRD and Capacity Controller#728
Add FlavorGroupCapacity CRD and Capacity Controller#728juliusclausnitzer wants to merge 19 commits intomainfrom
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds a new FlavorGroupCapacity CRD and deepcopy code; implements a periodic capacity reconciliation controller that probes an external scheduler, computes/sums capacities and instances, exposes Prometheus metrics, wires an optional manager runnable, updates Helm CRDs/RBAC/pipeline, extends a nova filter option, and adds tests. ChangesCapacity feature (single cohesive cohort)
Sequence DiagramsequenceDiagram
participant Manager as Manager
participant Controller as CapacityController
participant K8s as KubernetesAPI
participant Scheduler as ExternalScheduler
participant Prom as PrometheusMonitor
Manager->>Controller: Start(ctx)
loop every ReconcileInterval
Controller->>K8s: List FlavorGroups
K8s-->>Controller: FlavorGroups
Controller->>K8s: List Hypervisors
K8s-->>Controller: Hypervisors
Controller->>Controller: Derive AZs
loop per (FlavorGroup, AZ)
Controller->>Scheduler: ScheduleReservation(TotalPipeline, flavor)
Scheduler-->>Controller: hosts & slots (total)
Controller->>Scheduler: ScheduleReservation(PlaceablePipeline, flavor)
Scheduler-->>Controller: hosts & slots (placeable)
Controller->>K8s: List CommittedResources (filtered)
K8s-->>Controller: CommittedResources
Controller->>Controller: Sum committed capacity (slots)
Controller->>K8s: Create or Patch FlavorGroupCapacity.status
K8s-->>Controller: persisted
end
end
Prom->>K8s: List FlavorGroupCapacities
K8s-->>Prom: CRDs
Prom->>Prom: Export labeled gauges
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (2)
internal/scheduling/reservations/capacity/controller_test.go (1)
370-371: Variablehv1Objshadows the imported packagehv1.The variable name
hv1Objis fine, but note thathv1is also an imported package alias (line 14:hv1 "github.com/cobaltcore-dev/openstack-hypervisor-operator/api/v1"). While theObjsuffix distinguishes it, consider using a more distinct name likehvAandhvBfor clarity.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/scheduling/reservations/capacity/controller_test.go` around lines 370 - 371, The variable name hv1Obj shadows the imported package alias hv1 and can cause confusion; rename the local hypervisor variables created via newHypervisor (e.g., hv1Obj and hv2Obj) to clearer, distinct identifiers like hvA and hvB (or hostA/hostB) throughout the test to avoid any collision with the hv1 package import and update all references in the test accordingly.internal/scheduling/reservations/capacity/metrics.go (1)
70-104: Consider adding a scrape timeout to prevent hanging during long list operations.Using
context.Background()means theListcall has no timeout bound. While Prometheus scrapes have their own timeouts, the client call itself could hang indefinitely if the API server is slow.Optional: Add context with timeout
func (m *Monitor) Collect(ch chan<- prometheus.Metric) { + ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second) + defer cancel() + var list v1alpha1.FlavorGroupCapacityList - if err := m.client.List(context.Background(), &list); err != nil { + if err := m.client.List(ctx, &list); err != nil { log.Error(err, "failed to list FlavorGroupCapacity CRDs for metrics") return }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/scheduling/reservations/capacity/metrics.go` around lines 70 - 104, The Collect method uses context.Background() for m.client.List which can hang; change it to use a cancellable context with a timeout (e.g., ctx, cancel := context.WithTimeout(context.Background(), <reasonableDuration>); defer cancel()) before calling m.client.List, pass ctx to m.client.List, and ensure any early return still defers cancel; keep the rest of Monitor.Collect logic (resetting gauges and calling Collect) unchanged and handle the List error as before (logging the error).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@internal/scheduling/reservations/capacity/controller.go`:
- Around line 120-124: The code currently calls sumCommittedCapacity(ctx,
groupName, az, ...) inside the inner (group,az) loop (see committedCapacity
usage) causing a full cluster List() per pair; instead, in reconcileAll() call
sumCommittedCapacity once to produce a pre-aggregated map keyed by flavor group
and AZ (e.g., map[string]map[string]int64 or a composite-key map) and compute
all committed capacities there, then change reconcileOne() signature to accept
that precomputed map (or the per-group value) and remove the inner calls to
sumCommittedCapacity; update all call sites to use the preloaded value and
delete the duplicated List() usage (also replace the occurrences referenced
around lines 230-258) so the controller does a single cluster List() per
reconcile cycle.
- Around line 189-223: The current logic calls ScheduleReservation for a
single-instance probe (ScheduleReservation) then treats each resp.Hosts as if it
represents full slot capacity by adding memCap.Value()/smallestFlavorBytes to
capacity, which overcounts; instead either (A) change the scheduler call to
request/return per-host slot counts (extend ScheduleReservation response to
include per-host slot counts) and sum those, or (B) compute slot counts locally
from authoritative capacity in hv.Status (use hv.Status.EffectiveCapacity
falling back to hv.Status.Capacity, lookup hv1.ResourceMemory, call
memCap.Value()/smallestFlavorBytes per host found in resp.Hosts and add that
exact quotient as integer slots) rather than treating a single eligible host as
one full host of slots; update the loop over resp.Hosts (and any uses of
capacity/TotalCapacity/TotalPlaceable) to use the chosen approach (modify
ScheduleReservation signature and its callers if using A, or replace the current
host-to-slot conversion with the local per-host division using hv,
EffectiveCapacity, smallestFlavorBytes if using B).
- Around line 127-175: The status numeric fields are unconditionally overwritten
with probe results even when probes failed (totalErr/placeableErr), causing
transient failures to set capacities to zero; change the update logic in the
controller (around where existing.Status.TotalCapacity, TotalHosts,
TotalPlaceable, PlaceableHosts, TotalInstances, CommittedCapacity are assigned)
to only assign the fields that come from a successful probe (e.g., if totalErr
== nil then set TotalCapacity/TotalHosts/TotalInstances/CommittedCapacity; if
placeableErr == nil then set TotalPlaceable/PlaceableHosts), leaving existing
values intact when a probe failed, while still updating the Condition
(freshCondition) and patching via client.Status().Patch as before.
---
Nitpick comments:
In `@internal/scheduling/reservations/capacity/controller_test.go`:
- Around line 370-371: The variable name hv1Obj shadows the imported package
alias hv1 and can cause confusion; rename the local hypervisor variables created
via newHypervisor (e.g., hv1Obj and hv2Obj) to clearer, distinct identifiers
like hvA and hvB (or hostA/hostB) throughout the test to avoid any collision
with the hv1 package import and update all references in the test accordingly.
In `@internal/scheduling/reservations/capacity/metrics.go`:
- Around line 70-104: The Collect method uses context.Background() for
m.client.List which can hang; change it to use a cancellable context with a
timeout (e.g., ctx, cancel := context.WithTimeout(context.Background(),
<reasonableDuration>); defer cancel()) before calling m.client.List, pass ctx to
m.client.List, and ensure any early return still defers cancel; keep the rest of
Monitor.Collect logic (resetting gauges and calling Collect) unchanged and
handle the List error as before (logging the error).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: e34d4385-6776-4196-8bae-895e723c3dac
📒 Files selected for processing (10)
api/v1alpha1/flavor_group_capacity_types.goapi/v1alpha1/zz_generated.deepcopy.gocmd/manager/main.gohelm/bundles/cortex-nova/values.yamlhelm/library/cortex/files/crds/cortex.cloud_flavorgroupcapacities.yamlhelm/library/cortex/templates/rbac/role.yamlinternal/scheduling/reservations/capacity/config.gointernal/scheduling/reservations/capacity/controller.gointernal/scheduling/reservations/capacity/controller_test.gointernal/scheduling/reservations/capacity/metrics.go
| committedCapacity, committedErr := c.sumCommittedCapacity(ctx, groupName, az, smallestFlavorBytes) | ||
| if committedErr != nil { | ||
| log.Error(committedErr, "failed to sum committed capacity", "flavorGroup", groupName, "az", az) | ||
| committedCapacity = 0 | ||
| } |
There was a problem hiding this comment.
This does a full CommittedResource list once per flavor-group/AZ pair.
sumCommittedCapacity() performs a cluster-wide List() and is called from the inner (group, az) loop. One reconcile cycle therefore scales with len(flavorGroups) * len(azs) full-list API calls, which will get expensive quickly on a periodic controller. Please preload and aggregate committed capacity once in reconcileAll(), then pass the precomputed value into reconcileOne().
Also applies to: 230-258
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@internal/scheduling/reservations/capacity/controller.go` around lines 120 -
124, The code currently calls sumCommittedCapacity(ctx, groupName, az, ...)
inside the inner (group,az) loop (see committedCapacity usage) causing a full
cluster List() per pair; instead, in reconcileAll() call sumCommittedCapacity
once to produce a pre-aggregated map keyed by flavor group and AZ (e.g.,
map[string]map[string]int64 or a composite-key map) and compute all committed
capacities there, then change reconcileOne() signature to accept that
precomputed map (or the per-group value) and remove the inner calls to
sumCommittedCapacity; update all call sites to use the preloaded value and
delete the duplicated List() usage (also replace the occurrences referenced
around lines 230-258) so the controller does a single cluster List() per
reconcile cycle.
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@api/v1alpha1/flavor_group_capacity_types.go`:
- Around line 60-63: Add kubebuilder list-map markers to the Conditions field so
the CRD emits x-kubernetes-list-type: map and x-kubernetes-list-map-keys for
proper server-side apply merging: above the existing Conditions
[]metav1.Condition `json:"conditions,omitempty" ...` add the comments "//
+listType=map" and "// +listMapKey=type" (keeping the existing "//
+kubebuilder:validation:Optional") so the generator treats the list as a map
keyed by Condition.Type; apply the same change to every Condition field across
api/v1alpha1 (reservation, pipeline, kpi, knowledge, history, descheduling,
decision, committed_resource, datasource) and then regenerate the CRD artifacts.
In `@internal/scheduling/nova/plugins/filters/filter_has_enough_capacity.go`:
- Around line 78-92: The code skips subtracting VM allocations when
s.Options.IgnoreAllocations is true but still applies the later
committed-reservation adjustment that assumes allocations were removed, causing
overstated free capacity; modify the committed-reservation adjustment logic in
filter_has_enough_capacity.go (the same area that uses freeResourcesByHost and
hv.Status.Allocation) to be conditional on !s.Options.IgnoreAllocations (or to
use a copy of pre-allocation free totals when computing reservation blocking) so
that when IgnoreAllocations is true you do not subtract running allocations from
committed reservations; ensure the change references IgnoreAllocations,
freeResourcesByHost, and hv.Status.Allocation so the two adjustments remain
consistent.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 3edbf5df-73fb-4440-b186-4a37ccf1abfe
📒 Files selected for processing (7)
api/v1alpha1/flavor_group_capacity_types.gocmd/manager/main.gohelm/bundles/cortex-nova/templates/pipelines_kvm.yamlhelm/bundles/cortex-nova/values.yamlhelm/library/cortex/files/crds/cortex.cloud_flavorgroupcapacities.yamlinternal/scheduling/nova/plugins/filters/filter_has_enough_capacity.gointernal/scheduling/reservations/capacity/controller_test.go
🚧 Files skipped from review as they are similar to previous changes (3)
- helm/bundles/cortex-nova/values.yaml
- cmd/manager/main.go
- internal/scheduling/reservations/capacity/controller_test.go
|
small comments:
CRD format |
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
api/v1alpha1/flavor_group_capacity_types.go (1)
68-70:⚠️ Potential issue | 🟠 Major | ⚡ Quick winAdd list-map markers for
status.conditionsto preserve SSA merge semantics.Line 70 defines
Conditionswithout// +listType=mapand// +listMapKey=type, so server-side apply treats conditions as a plain list and replaces it instead of merging by condition type.Suggested fix
// The current status conditions of the FlavorGroupCapacity. // +kubebuilder:validation:Optional + // +listType=map + // +listMapKey=type Conditions []metav1.Condition `json:"conditions,omitempty" patchStrategy:"merge" patchMergeKey:"type"`#!/bin/bash set -euo pipefail rg -n -C2 'Conditions\s+\[\]metav1\.Condition|listType|listMapKey' api/v1alpha1/flavor_group_capacity_types.go rg -n -C3 'conditions:|x-kubernetes-list-type|x-kubernetes-list-map-keys' helm/library/cortex/files/crds/cortex.cloud_flavorgroupcapacities.yaml🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@api/v1alpha1/flavor_group_capacity_types.go` around lines 68 - 70, The Conditions field on the FlavorGroupCapacity status is missing list-map markers so SSA will replace the whole list instead of merging by condition type; update the Conditions declaration in the FlavorGroupCapacity status (the Conditions []metav1.Condition field) to include the Kubernetes markers // +listType=map and // +listMapKey=type (above the Conditions line) so server-side-apply and CRD generation treat it as a map keyed by the Condition.Type for proper merge semantics.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@internal/scheduling/reservations/capacity/metrics.go`:
- Around line 75-77: The List call in Collect uses context.Background() which
can block; replace it with a bounded timeout context (e.g., ctx, cancel :=
context.WithTimeout(ctx, timeout) and defer cancel()) and pass that ctx to
m.client.List; update the Collect function to define a sensible timeout constant
or configurable value (for example 1–5s), ensure defer cancel() is called before
the List returns, and keep the existing error handling (log.Error(err, ...);
return) intact when m.client.List(ctx, &list) fails.
---
Duplicate comments:
In `@api/v1alpha1/flavor_group_capacity_types.go`:
- Around line 68-70: The Conditions field on the FlavorGroupCapacity status is
missing list-map markers so SSA will replace the whole list instead of merging
by condition type; update the Conditions declaration in the FlavorGroupCapacity
status (the Conditions []metav1.Condition field) to include the Kubernetes
markers // +listType=map and // +listMapKey=type (above the Conditions line) so
server-side-apply and CRD generation treat it as a map keyed by the
Condition.Type for proper merge semantics.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 2480e9c9-6ef3-4a20-aa38-8e4f57d5877e
📒 Files selected for processing (9)
api/v1alpha1/flavor_group_capacity_types.goapi/v1alpha1/zz_generated.deepcopy.gocmd/manager/main.gohelm/bundles/cortex-nova/values.yamlhelm/library/cortex/files/crds/cortex.cloud_flavorgroupcapacities.yamlinternal/scheduling/nova/plugins/filters/filter_has_enough_capacity.gointernal/scheduling/reservations/capacity/controller.gointernal/scheduling/reservations/capacity/controller_test.gointernal/scheduling/reservations/capacity/metrics.go
✅ Files skipped from review due to trivial changes (1)
- cmd/manager/main.go
🚧 Files skipped from review as they are similar to previous changes (5)
- helm/bundles/cortex-nova/values.yaml
- api/v1alpha1/zz_generated.deepcopy.go
- internal/scheduling/reservations/capacity/controller_test.go
- internal/scheduling/reservations/capacity/controller.go
- internal/scheduling/nova/plugins/filters/filter_has_enough_capacity.go
Test Coverage ReportTest Coverage 📊: 69.1% |
No description provided.