feat: CR usage controller - moving CR usage calculation into CRD status, for API usage and Quota usage#800
feat: CR usage controller - moving CR usage calculation into CRD status, for API usage and Quota usage#800
Conversation
…us, for API usage and Quota usage
📝 WalkthroughWalkthroughExtended CommittedResource status for rollback/retry (AcceptedSpec, ConsecutiveFailures) and per-resource usage (UsedResources, AssignedInstances), added a UsageReconciler controller, Nova images datasource and image model, migrated server table to v3, added generation-aware API polling with exponential backoff, and updated tests/CRD/helm accordingly. ChangesCommittedResource Controller, Usage, API & Tests
Nova Images Datasource, Server v3 Migration & Extractors
Sequence Diagram(s)sequenceDiagram
participant API as Change<br/>Commitments API
participant CR as Committed<br/>Resource CRD
participant Controller as Committed<br/>Resource Controller
participant Reconciler as Usage<br/>Reconciler
participant UsageDB as Usage<br/>DB / Postgres
API->>CR: Create/Update CommittedResource (with spec)
CR-->>API: CR created (generation N)
API->>CR: Poll CR (generation-aware)
alt CR Ready for generation N
CR-->>API: Ready=True observed -> API returns success
else Timeout or Rejected
API->>CR: Rollback spec -> uses RetryOnConflict to restore AcceptedSpec
end
Controller->>CR: Reconcile spec -> try place Reservations
alt Placement fails and AllowRejection=false
Controller->>Controller: increment ConsecutiveFailures and RequeueAfter (backoff)
else Placement succeeds
Controller->>CR: set AcceptedSpec and reset ConsecutiveFailures
end
Reconciler->>UsageDB: Query project VMs (joins images to get os_type)
Reconciler->>Reconciler: Build VM→commitment assignments
Reconciler->>CR: Patch Status.AssignedInstances and Status.UsedResources (memory/cpu)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
@coderabbitai full review |
✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
Actionable comments posted: 7
🧹 Nitpick comments (5)
internal/knowledge/datasources/plugins/openstack/nova/nova_api.go (2)
89-97: 💤 Low valueImplicit precondition:
GetAllImagespanics if the datasource type is notImages.
api.glanceis only assigned whenapi.conf.Type == NovaDatasourceTypeImages. Any code path that callsGetAllImageson a non-Images instance will hit a nil-pointer dereference insideglanceimages.List(api.glance, ...). Today the sync router enforces this, but it's an easy regression. Consider either initializing Glance unconditionally (cheap — same provider/availability), or returning a clear error fromGetAllImageswhenapi.glance == nil.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/knowledge/datasources/plugins/openstack/nova/nova_api.go` around lines 89 - 97, GetAllImages can panic because api.glance is only set when api.conf.Type == v1alpha1.NovaDatasourceTypeImages; add an explicit nil-check at the start of GetAllImages that returns a clear error (e.g., "glance client not initialized: datasource type is not Images") if api.glance == nil, and/or alternatively initialize api.glance unconditionally in the constructor code shown (the openstack.NewImageV2 call) so api.glance is always non-nil; reference GetAllImages, api.glance, NovaDatasourceTypeImages, and the glanceimages.List call when locating where to add the guard or the unconditional initialization.
459-479: ⚡ Quick winAdd request-timer instrumentation for parity with other Get methods.*
Every other
Get*method in this file wraps the API call(s) withapi.mon.RequestTimer.WithLabelValues(label)for Prometheus tracing.GetAllImagesskips it, leaving Glance pagination unmonitored. Adding the same wrapper keeps observability uniform across datasources.♻️ Proposed change
func (api *novaAPI) GetAllImages(ctx context.Context) ([]Image, error) { + label := Image{}.TableName() + slog.Info("fetching nova data", "label", label) + if api.mon.RequestTimer != nil { + hist := api.mon.RequestTimer.WithLabelValues(label) + timer := prometheus.NewTimer(hist) + defer timer.ObserveDuration() + } var result []Image opts := glanceimages.ListOpts{Limit: 1000} err := glanceimages.List(api.glance, opts).EachPage(ctx, func(_ context.Context, page pagination.Page) (bool, error) { imgs, err := glanceimages.ExtractImages(page) if err != nil { return false, err } for _, img := range imgs { result = append(result, Image{ ID: img.ID, OSType: deriveOSType(img.Properties, img.Tags), }) } return true, nil }) if err != nil { return nil, fmt.Errorf("failed to list Glance images: %w", err) } + slog.Info("fetched", "label", label, "count", len(result)) return result, nil }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/knowledge/datasources/plugins/openstack/nova/nova_api.go` around lines 459 - 479, GetAllImages is missing Prometheus request-timer instrumentation; wrap the Glance call with the same timer pattern used by other Get* methods: call api.mon.RequestTimer.WithLabelValues("glance_images").StartTimer() (or the existing label naming convention used elsewhere), assign to a timer variable and defer timer.ObserveDuration(), then run the glanceimages.List(...).EachPage call as before inside that timed scope; update function novaAPI.GetAllImages to include this StartTimer/ObserveDuration pairing.internal/knowledge/datasources/plugins/openstack/nova/nova_types.go (1)
161-168: 💤 Low valueSilent failure when image JSON has unexpected shape.
If the
imagefield is a JSON object but unmarshaling intoimageMapfails (e.g., a future Nova response with a non-stringid), the error is silently dropped andImageRefis left empty. That makes such servers indistinguishable from volume-booted servers downstream and harder to diagnose. Consider returning the error (consistent with the flavor/fault handling above) or at least logging it.♻️ Proposed change
- if err := json.Unmarshal(aux.Image, &imageMap); err == nil { - s.ImageRef = imageMap.ID - } + if err := json.Unmarshal(aux.Image, &imageMap); err != nil { + return err + } + s.ImageRef = imageMap.ID🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/knowledge/datasources/plugins/openstack/nova/nova_types.go` around lines 161 - 168, The code currently swallows JSON unmarshal errors for aux.Image, leaving s.ImageRef empty; instead, when json.Unmarshal(aux.Image, &imageMap) returns an error you should not drop it—either return the error from the surrounding UnmarshalJSON (or the equivalent parsing function) or log it consistently with the earlier flavor/fault handling. Modify the block that parses aux.Image so that on err != nil you return fmt.Errorf("failed to parse image JSON: %w", err) (or call the same logger used elsewhere) rather than ignoring the error, keeping the existing assignment to s.ImageRef when successful (symbols: aux.Image, imageMap, s.ImageRef).internal/knowledge/datasources/plugins/openstack/nova/nova_sync_test.go (1)
59-61: ⚡ Quick winAdd a dedicated
TestNovaSyncer_SyncImagestest to match the pattern of every other sync method.Every existing sync method has a corresponding named test (
TestNovaSyncer_SyncServers,SyncHypervisors,SyncFlavors, etc.), butSyncAllImageshas none. The mock is already wired up — the test would be a one-to-one copy of the existing pattern.✅ Proposed test for `SyncAllImages`
+func TestNovaSyncer_SyncImages(t *testing.T) { + dbEnv := testlibDB.SetupDBEnv(t) + testDB := db.DB{DbMap: dbEnv.DbMap} + defer dbEnv.Close() + mon := datasources.Monitor{} + syncer := &NovaSyncer{ + DB: testDB, + Mon: mon, + Conf: v1alpha1.NovaDatasource{Type: v1alpha1.NovaDatasourceTypeImages}, + API: &mockNovaAPI{}, + } + + ctx := t.Context() + if err := syncer.Init(ctx); err != nil { + t.Fatalf("failed to init nova syncer: %v", err) + } + n, err := syncer.SyncAllImages(ctx) + if err != nil { + t.Fatalf("expected no error, got %v", err) + } + if n != 1 { + t.Fatalf("expected 1 image, got %d", n) + } +}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/knowledge/datasources/plugins/openstack/nova/nova_sync_test.go` around lines 59 - 61, Add a new unit test named TestNovaSyncer_SyncImages that mirrors the pattern used by TestNovaSyncer_SyncServers/SyncHypervisors/SyncFlavors: instantiate the NovaSyncer with the mockNovaAPI (which implements GetAllImages returning the sample image), call the sync method (e.g., NovaSyncer.SyncAllImages or the syncer method that triggers image syncing), and assert that the returned/side-effect results match the expected image list and no error occurred; use the same setup/teardown, mock injection, and assertion style as the other sync tests so it fits the existing test suite conventions.internal/scheduling/reservations/commitments/usage_reconciler_monitor.go (1)
18-55: ⚡ Quick win
NewUsageReconcilerMonitorreturns a value type, butDescribe/Collecthave pointer receivers — only*UsageReconcilerMonitorsatisfiesprometheus.Collector.Any caller that tries to use the returned value directly as a
prometheus.Collector(e.g.,prometheus.Register(monitor)wheremonitor := NewUsageReconcilerMonitor()) will get a compile error. The caller is forced to take the address explicitly (&monitor), which is non-obvious given a constructor name.The idiomatic fix is to return
*UsageReconcilerMonitor:♻️ Proposed fix
-func NewUsageReconcilerMonitor() UsageReconcilerMonitor { - m := UsageReconcilerMonitor{ +func NewUsageReconcilerMonitor() *UsageReconcilerMonitor { + m := &UsageReconcilerMonitor{ reconcileDuration: prometheus.NewHistogramVec(prometheus.HistogramOpts{ ... }, []string{"result"}), ... } // Pre-initialize result labels so metrics appear before the first reconcile. m.reconcileDuration.WithLabelValues("success") m.reconcileDuration.WithLabelValues("error") return m }Please verify how
cmd/manager/main.gouses the returned value — if it already passes&monitorto bothprometheus.RegisterandNewUsageReconciler, this is a style concern; if not, it's a compile error waiting to happen.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/scheduling/reservations/commitments/usage_reconciler_monitor.go` around lines 18 - 55, NewUsageReconcilerMonitor currently returns a value (UsageReconcilerMonitor) but the methods Describe and Collect have pointer receivers, so only *UsageReconcilerMonitor implements prometheus.Collector; change NewUsageReconcilerMonitor to return *UsageReconcilerMonitor and allocate/return a pointer (e.g., &UsageReconcilerMonitor{...}) so callers can pass it directly to prometheus.Register and NewUsageReconciler without taking the address; then verify usages in cmd/manager/main.go (prometheus.Register(...) and NewUsageReconciler(...)) to ensure they accept the pointer and update any call sites that expect a value.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@cmd/manager/main.go`:
- Around line 562-571: The code registers a UsageReconciler even when
commitmentsUsageDB is nil (i.e. no datasource configured), leaving the feature
enabled but non-functional; add a guard before creating/registering
UsageReconciler and its UsageReconcilerMonitor that checks
commitmentsConfig.DatasourceName and commitmentsUsageDB and fail fast (return an
error or log.Fatalf) with a clear message if the datasource is not configured;
specifically, before calling commitments.NewUsageReconcilerMonitor(), verify
commitmentsConfig.DatasourceName != "" and commitmentsUsageDB != nil and only
then create the monitor and call
(&commitments.UsageReconciler{...}).SetupWithManager(mgr, multiclusterClient).
In `@internal/scheduling/reservations/commitments/api/report_usage_test.go`:
- Around line 906-923: The helper toCommittedResourceCRD on UsageTestReservation
drops the commitment StartTime causing ordering-sensitive tests to break; update
toCommittedResourceCRD to copy the reservation's StartTime into the
CommittedResource spec (set Spec.StartTime = tr.StartTime or the appropriate
metav1.Time wrapper used by v1alpha1.CommittedResourceSpec) so synthetic CRDs
preserve the original commitment start time used by the reconciler (reference
UsageTestReservation and toCommittedResourceCRD).
In `@internal/scheduling/reservations/commitments/api/usage_test.go`:
- Around line 335-346: The synthetic CommittedResource created as cr lacks the
reservation time window, so add the reservation's StartTime and EndTime into
cr.Spec to mirror production activation windows; set cr.Spec.StartTime and
cr.Spec.EndTime from the test reservation (e.g.,
r.Spec.CommittedResourceReservation.StartTime and
r.Spec.CommittedResourceReservation.EndTime or r.Spec.StartTime/EndTime as
appropriate) so the reconciler sees the same activation window when exercising
CommittedResource handling.
In
`@internal/scheduling/reservations/commitments/committed_resource_controller.go`:
- Around line 352-363: The legacy fallback currently calls
FromCommittedResource(*cr) which uses the possibly-mutated current spec;
instead, when cr.Status.AcceptedSpec == nil build a temporary CommittedResource
spec that preserves placement/selector/amount from the last-accepted status:
create tempCR := v1alpha1.CommittedResource{Spec: cr.Spec} then overwrite the
placement-related fields (AZ/availability zone, project, domain, flavor/group
and the resource amount) from the corresponding status fields (e.g. use
cr.Status.AcceptedAmount.Value() and the status fields that record accepted
AZ/project/domain/flavor) before calling state, err =
FromCommittedResource(tempCR); keep the existing state.TotalMemoryBytes =
cr.Status.AcceptedAmount.Value() assignment after successful
FromCommittedResource. This ensures FromCommittedResource uses the last-accepted
placement rather than the current mutated CR.
- Around line 471-477: The slowdown gate should only apply to the generation
that actually failed: change the condition in the
committed_resource_controller.go block that currently checks
cr.Status.ConsecutiveFailures >= maxConsecutiveFailuresForSlowdown to also
require the failure is for the current generation (e.g.
cr.Status.FailureGeneration == cr.Generation or
cr.Status.FailedObservedGeneration == cr.ObjectMeta.Generation). Also update the
code that increments/reset ConsecutiveFailures (the failure-recording logic) to
set FailureGeneration = cr.Generation when you increment failures and to reset
ConsecutiveFailures and FailureGeneration when the spec/generation advances (so
new generations start with a fresh count). This ensures suppression only occurs
for the generation that produced the failures.
In `@internal/scheduling/reservations/commitments/usage_reconciler.go`:
- Around line 46-49: Early-return on non-active commitment states (checked via
cr.Spec.State against CommitmentStatusConfirmed and CommitmentStatusGuaranteed)
leaves stale cr.Status.AssignedInstances and cr.Status.UsedResources intact;
instead, when the commitment is not active (i.e., states
planned/pending/superseded/expired), clear/reset cr.Status.AssignedInstances and
cr.Status.UsedResources to empty/zero and persist the change via the status
update path (e.g., r.Status().Update(ctx, cr)) before returning so stale
assignments are not reported as committed usage. Ensure this logic is placed in
the same reconciliation branch that currently returns early after the
cr.Spec.State check.
---
Nitpick comments:
In `@internal/knowledge/datasources/plugins/openstack/nova/nova_api.go`:
- Around line 89-97: GetAllImages can panic because api.glance is only set when
api.conf.Type == v1alpha1.NovaDatasourceTypeImages; add an explicit nil-check at
the start of GetAllImages that returns a clear error (e.g., "glance client not
initialized: datasource type is not Images") if api.glance == nil, and/or
alternatively initialize api.glance unconditionally in the constructor code
shown (the openstack.NewImageV2 call) so api.glance is always non-nil; reference
GetAllImages, api.glance, NovaDatasourceTypeImages, and the glanceimages.List
call when locating where to add the guard or the unconditional initialization.
- Around line 459-479: GetAllImages is missing Prometheus request-timer
instrumentation; wrap the Glance call with the same timer pattern used by other
Get* methods: call
api.mon.RequestTimer.WithLabelValues("glance_images").StartTimer() (or the
existing label naming convention used elsewhere), assign to a timer variable and
defer timer.ObserveDuration(), then run the glanceimages.List(...).EachPage call
as before inside that timed scope; update function novaAPI.GetAllImages to
include this StartTimer/ObserveDuration pairing.
In `@internal/knowledge/datasources/plugins/openstack/nova/nova_sync_test.go`:
- Around line 59-61: Add a new unit test named TestNovaSyncer_SyncImages that
mirrors the pattern used by
TestNovaSyncer_SyncServers/SyncHypervisors/SyncFlavors: instantiate the
NovaSyncer with the mockNovaAPI (which implements GetAllImages returning the
sample image), call the sync method (e.g., NovaSyncer.SyncAllImages or the
syncer method that triggers image syncing), and assert that the
returned/side-effect results match the expected image list and no error
occurred; use the same setup/teardown, mock injection, and assertion style as
the other sync tests so it fits the existing test suite conventions.
In `@internal/knowledge/datasources/plugins/openstack/nova/nova_types.go`:
- Around line 161-168: The code currently swallows JSON unmarshal errors for
aux.Image, leaving s.ImageRef empty; instead, when json.Unmarshal(aux.Image,
&imageMap) returns an error you should not drop it—either return the error from
the surrounding UnmarshalJSON (or the equivalent parsing function) or log it
consistently with the earlier flavor/fault handling. Modify the block that
parses aux.Image so that on err != nil you return fmt.Errorf("failed to parse
image JSON: %w", err) (or call the same logger used elsewhere) rather than
ignoring the error, keeping the existing assignment to s.ImageRef when
successful (symbols: aux.Image, imageMap, s.ImageRef).
In `@internal/scheduling/reservations/commitments/usage_reconciler_monitor.go`:
- Around line 18-55: NewUsageReconcilerMonitor currently returns a value
(UsageReconcilerMonitor) but the methods Describe and Collect have pointer
receivers, so only *UsageReconcilerMonitor implements prometheus.Collector;
change NewUsageReconcilerMonitor to return *UsageReconcilerMonitor and
allocate/return a pointer (e.g., &UsageReconcilerMonitor{...}) so callers can
pass it directly to prometheus.Register and NewUsageReconciler without taking
the address; then verify usages in cmd/manager/main.go (prometheus.Register(...)
and NewUsageReconciler(...)) to ensure they accept the pointer and update any
call sites that expect a value.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: be511876-1fdb-427f-959d-63e23f2e391a
📒 Files selected for processing (33)
api/v1alpha1/committed_resource_types.goapi/v1alpha1/datasource_types.goapi/v1alpha1/zz_generated.deepcopy.gocmd/manager/main.gohelm/bundles/cortex-nova/templates/datasources.yamlhelm/bundles/cortex-nova/values.yamlhelm/library/cortex/files/crds/cortex.cloud_committedresources.yamlinternal/knowledge/datasources/plugins/openstack/controller_test.gointernal/knowledge/datasources/plugins/openstack/nova/nova_api.gointernal/knowledge/datasources/plugins/openstack/nova/nova_sync.gointernal/knowledge/datasources/plugins/openstack/nova/nova_sync_test.gointernal/knowledge/datasources/plugins/openstack/nova/nova_types.gointernal/knowledge/extractor/plugins/compute/libvirt_domain_cpu_steal_pct.sqlinternal/knowledge/extractor/plugins/compute/vm_host_residency.sqlinternal/knowledge/extractor/plugins/compute/vm_life_span.sqlinternal/knowledge/extractor/plugins/compute/vrops_hostsystem_resolver.sqlinternal/knowledge/extractor/plugins/compute/vrops_project_noisiness.sqlinternal/scheduling/reservations/commitments/api/change_commitments.gointernal/scheduling/reservations/commitments/api/change_commitments_test.gointernal/scheduling/reservations/commitments/api/report_usage_test.gointernal/scheduling/reservations/commitments/api/usage_test.gointernal/scheduling/reservations/commitments/committed_resource_controller.gointernal/scheduling/reservations/commitments/committed_resource_controller_test.gointernal/scheduling/reservations/commitments/config.gointernal/scheduling/reservations/commitments/field_index.gointernal/scheduling/reservations/commitments/reservation_manager.gointernal/scheduling/reservations/commitments/reservation_manager_test.gointernal/scheduling/reservations/commitments/syncer.gointernal/scheduling/reservations/commitments/usage.gointernal/scheduling/reservations/commitments/usage_internals_test.gointernal/scheduling/reservations/commitments/usage_reconciler.gointernal/scheduling/reservations/commitments/usage_reconciler_monitor.gotools/visualize-committed-resources/main.go
| usageReconcilerMonitor := commitments.NewUsageReconcilerMonitor() | ||
| metrics.Registry.MustRegister(&usageReconcilerMonitor) | ||
| usageReconcilerConf := commitmentsConfig.UsageReconciler | ||
| usageReconcilerConf.ApplyDefaults() | ||
| if err := (&commitments.UsageReconciler{ | ||
| Client: multiclusterClient, | ||
| Conf: usageReconcilerConf, | ||
| UsageDB: commitmentsUsageDB, | ||
| Monitor: usageReconcilerMonitor, | ||
| }).SetupWithManager(mgr, multiclusterClient); err != nil { |
There was a problem hiding this comment.
Fail fast when usage reconciliation has no datasource.
commitmentsUsageDB stays nil when commitmentsConfig.DatasourceName is empty, but this block still registers UsageReconciler with that nil dependency. That leaves the new status-based usage path enabled in a state where it cannot compute any usage data.
Suggested guard
+ if commitmentsUsageDB == nil {
+ setupLog.Error(nil,
+ "committed-resource-reservations-controller requires commitments datasourceName for usage reconciliation")
+ os.Exit(1)
+ }
+
usageReconcilerMonitor := commitments.NewUsageReconcilerMonitor()
metrics.Registry.MustRegister(&usageReconcilerMonitor)
usageReconcilerConf := commitmentsConfig.UsageReconciler📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| usageReconcilerMonitor := commitments.NewUsageReconcilerMonitor() | |
| metrics.Registry.MustRegister(&usageReconcilerMonitor) | |
| usageReconcilerConf := commitmentsConfig.UsageReconciler | |
| usageReconcilerConf.ApplyDefaults() | |
| if err := (&commitments.UsageReconciler{ | |
| Client: multiclusterClient, | |
| Conf: usageReconcilerConf, | |
| UsageDB: commitmentsUsageDB, | |
| Monitor: usageReconcilerMonitor, | |
| }).SetupWithManager(mgr, multiclusterClient); err != nil { | |
| if commitmentsUsageDB == nil { | |
| setupLog.Error(nil, | |
| "committed-resource-reservations-controller requires commitments datasourceName for usage reconciliation") | |
| os.Exit(1) | |
| } | |
| usageReconcilerMonitor := commitments.NewUsageReconcilerMonitor() | |
| metrics.Registry.MustRegister(&usageReconcilerMonitor) | |
| usageReconcilerConf := commitmentsConfig.UsageReconciler | |
| usageReconcilerConf.ApplyDefaults() | |
| if err := (&commitments.UsageReconciler{ | |
| Client: multiclusterClient, | |
| Conf: usageReconcilerConf, | |
| UsageDB: commitmentsUsageDB, | |
| Monitor: usageReconcilerMonitor, | |
| }).SetupWithManager(mgr, multiclusterClient); err != nil { |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@cmd/manager/main.go` around lines 562 - 571, The code registers a
UsageReconciler even when commitmentsUsageDB is nil (i.e. no datasource
configured), leaving the feature enabled but non-functional; add a guard before
creating/registering UsageReconciler and its UsageReconcilerMonitor that checks
commitmentsConfig.DatasourceName and commitmentsUsageDB and fail fast (return an
error or log.Fatalf) with a clear message if the datasource is not configured;
specifically, before calling commitments.NewUsageReconcilerMonitor(), verify
commitmentsConfig.DatasourceName != "" and commitmentsUsageDB != nil and only
then create the monitor and call
(&commitments.UsageReconciler{...}).SetupWithManager(mgr, multiclusterClient).
| // toCommittedResourceCRD creates a minimal CommittedResource CRD for this commitment. | ||
| // Used by the test setup to pre-populate the CR objects that the usage reconciler writes status into. | ||
| func (tr *UsageTestReservation) toCommittedResourceCRD() *v1alpha1.CommittedResource { | ||
| amount := resource.MustParse(strconv.FormatInt(tr.Flavor.MemoryMB*int64(tr.Count), 10) + "Mi") | ||
| return &v1alpha1.CommittedResource{ | ||
| ObjectMeta: metav1.ObjectMeta{Name: "cr-" + tr.CommitmentID}, | ||
| Spec: v1alpha1.CommittedResourceSpec{ | ||
| CommitmentUUID: tr.CommitmentID, | ||
| ProjectID: tr.ProjectID, | ||
| DomainID: "test-domain", | ||
| AvailabilityZone: tr.AZ, | ||
| FlavorGroupName: tr.Flavor.Group, | ||
| ResourceType: v1alpha1.CommittedResourceTypeMemory, | ||
| State: v1alpha1.CommitmentStatusConfirmed, | ||
| Amount: amount, | ||
| }, | ||
| } | ||
| } |
There was a problem hiding this comment.
Preserve StartTime in the synthetic CommittedResource.
The usage path now reads reconciled CommittedResource state, but this helper drops the commitment start time. Tests like “fill oldest commitment first” therefore stop exercising the ordering input they are asserting about.
Suggested fix
func (tr *UsageTestReservation) toCommittedResourceCRD() *v1alpha1.CommittedResource {
amount := resource.MustParse(strconv.FormatInt(tr.Flavor.MemoryMB*int64(tr.Count), 10) + "Mi")
- return &v1alpha1.CommittedResource{
+ cr := &v1alpha1.CommittedResource{
ObjectMeta: metav1.ObjectMeta{Name: "cr-" + tr.CommitmentID},
Spec: v1alpha1.CommittedResourceSpec{
CommitmentUUID: tr.CommitmentID,
ProjectID: tr.ProjectID,
DomainID: "test-domain",
AvailabilityZone: tr.AZ,
FlavorGroupName: tr.Flavor.Group,
ResourceType: v1alpha1.CommittedResourceTypeMemory,
State: v1alpha1.CommitmentStatusConfirmed,
Amount: amount,
},
}
+ if !tr.StartTime.IsZero() {
+ cr.Spec.StartTime = &metav1.Time{Time: tr.StartTime}
+ }
+ return cr
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // toCommittedResourceCRD creates a minimal CommittedResource CRD for this commitment. | |
| // Used by the test setup to pre-populate the CR objects that the usage reconciler writes status into. | |
| func (tr *UsageTestReservation) toCommittedResourceCRD() *v1alpha1.CommittedResource { | |
| amount := resource.MustParse(strconv.FormatInt(tr.Flavor.MemoryMB*int64(tr.Count), 10) + "Mi") | |
| return &v1alpha1.CommittedResource{ | |
| ObjectMeta: metav1.ObjectMeta{Name: "cr-" + tr.CommitmentID}, | |
| Spec: v1alpha1.CommittedResourceSpec{ | |
| CommitmentUUID: tr.CommitmentID, | |
| ProjectID: tr.ProjectID, | |
| DomainID: "test-domain", | |
| AvailabilityZone: tr.AZ, | |
| FlavorGroupName: tr.Flavor.Group, | |
| ResourceType: v1alpha1.CommittedResourceTypeMemory, | |
| State: v1alpha1.CommitmentStatusConfirmed, | |
| Amount: amount, | |
| }, | |
| } | |
| } | |
| // toCommittedResourceCRD creates a minimal CommittedResource CRD for this commitment. | |
| // Used by the test setup to pre-populate the CR objects that the usage reconciler writes status into. | |
| func (tr *UsageTestReservation) toCommittedResourceCRD() *v1alpha1.CommittedResource { | |
| amount := resource.MustParse(strconv.FormatInt(tr.Flavor.MemoryMB*int64(tr.Count), 10) + "Mi") | |
| cr := &v1alpha1.CommittedResource{ | |
| ObjectMeta: metav1.ObjectMeta{Name: "cr-" + tr.CommitmentID}, | |
| Spec: v1alpha1.CommittedResourceSpec{ | |
| CommitmentUUID: tr.CommitmentID, | |
| ProjectID: tr.ProjectID, | |
| DomainID: "test-domain", | |
| AvailabilityZone: tr.AZ, | |
| FlavorGroupName: tr.Flavor.Group, | |
| ResourceType: v1alpha1.CommittedResourceTypeMemory, | |
| State: v1alpha1.CommitmentStatusConfirmed, | |
| Amount: amount, | |
| }, | |
| } | |
| if !tr.StartTime.IsZero() { | |
| cr.Spec.StartTime = &metav1.Time{Time: tr.StartTime} | |
| } | |
| return cr | |
| } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@internal/scheduling/reservations/commitments/api/report_usage_test.go` around
lines 906 - 923, The helper toCommittedResourceCRD on UsageTestReservation drops
the commitment StartTime causing ordering-sensitive tests to break; update
toCommittedResourceCRD to copy the reservation's StartTime into the
CommittedResource spec (set Spec.StartTime = tr.StartTime or the appropriate
metav1.Time wrapper used by v1alpha1.CommittedResourceSpec) so synthetic CRDs
preserve the original commitment start time used by the reconciler (reference
UsageTestReservation and toCommittedResourceCRD).
| cr := &v1alpha1.CommittedResource{ | ||
| ObjectMeta: metav1.ObjectMeta{Name: "cr-" + uuid}, | ||
| Spec: v1alpha1.CommittedResourceSpec{ | ||
| CommitmentUUID: uuid, | ||
| ProjectID: r.Spec.CommittedResourceReservation.ProjectID, | ||
| DomainID: "test-domain", | ||
| AvailabilityZone: r.Spec.AvailabilityZone, | ||
| FlavorGroupName: r.Spec.CommittedResourceReservation.ResourceGroup, | ||
| ResourceType: v1alpha1.CommittedResourceTypeMemory, | ||
| State: v1alpha1.CommitmentStatusConfirmed, | ||
| Amount: amount, | ||
| }, |
There was a problem hiding this comment.
Carry the reservation time window into the synthetic CommittedResource.
These cases are asserting expired/future commitment behavior, but the CR reconciled here never gets StartTime or EndTime. With the new status-based flow, that means the test is no longer feeding the reconciler the same activation window data that production uses.
Suggested fix
cr := &v1alpha1.CommittedResource{
ObjectMeta: metav1.ObjectMeta{Name: "cr-" + uuid},
Spec: v1alpha1.CommittedResourceSpec{
CommitmentUUID: uuid,
ProjectID: r.Spec.CommittedResourceReservation.ProjectID,
DomainID: "test-domain",
AvailabilityZone: r.Spec.AvailabilityZone,
FlavorGroupName: r.Spec.CommittedResourceReservation.ResourceGroup,
ResourceType: v1alpha1.CommittedResourceTypeMemory,
State: v1alpha1.CommitmentStatusConfirmed,
Amount: amount,
},
}
+ if r.Spec.StartTime != nil {
+ cr.Spec.StartTime = r.Spec.StartTime.DeepCopy()
+ }
+ if r.Spec.EndTime != nil {
+ cr.Spec.EndTime = r.Spec.EndTime.DeepCopy()
+ }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@internal/scheduling/reservations/commitments/api/usage_test.go` around lines
335 - 346, The synthetic CommittedResource created as cr lacks the reservation
time window, so add the reservation's StartTime and EndTime into cr.Spec to
mirror production activation windows; set cr.Spec.StartTime and cr.Spec.EndTime
from the test reservation (e.g., r.Spec.CommittedResourceReservation.StartTime
and r.Spec.CommittedResourceReservation.EndTime or r.Spec.StartTime/EndTime as
appropriate) so the reconciler sees the same activation window when exercising
CommittedResource handling.
| if cr.Status.AcceptedSpec != nil { | ||
| // Use the full accepted spec snapshot: ensures rollback targets the exact previously-accepted | ||
| // placement (AZ, amount, project, domain) even if the current spec has been mutated. | ||
| tempCR := v1alpha1.CommittedResource{Spec: *cr.Status.AcceptedSpec} | ||
| state, err = FromCommittedResource(tempCR) | ||
| } else { | ||
| // Legacy fallback: AcceptedSpec not yet populated (CR accepted before this field existed). | ||
| state, err = FromCommittedResource(*cr) | ||
| if err == nil { | ||
| state.TotalMemoryBytes = cr.Status.AcceptedAmount.Value() | ||
| } | ||
| } |
There was a problem hiding this comment.
The legacy rollback path restores the new spec, not the last accepted one.
When AcceptedSpec is still nil, this fallback reconstructs state from the current mutated spec and only patches the amount back. For CRs that were already accepted before this field existed, the first failed update after rollout can roll back into the wrong AZ/project/domain/flavor group instead of the previously accepted placement.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In
`@internal/scheduling/reservations/commitments/committed_resource_controller.go`
around lines 352 - 363, The legacy fallback currently calls
FromCommittedResource(*cr) which uses the possibly-mutated current spec;
instead, when cr.Status.AcceptedSpec == nil build a temporary CommittedResource
spec that preserves placement/selector/amount from the last-accepted status:
create tempCR := v1alpha1.CommittedResource{Spec: cr.Spec} then overwrite the
placement-related fields (AZ/availability zone, project, domain, flavor/group
and the resource amount) from the corresponding status fields (e.g. use
cr.Status.AcceptedAmount.Value() and the status fields that record accepted
AZ/project/domain/flavor) before calling state, err =
FromCommittedResource(tempCR); keep the existing state.TotalMemoryBytes =
cr.Status.AcceptedAmount.Value() assignment after successful
FromCommittedResource. This ensures FromCommittedResource uses the last-accepted
placement rather than the current mutated CR.
| commitmentsByAZFG, err := calc.buildCommitmentCapacityMap(ctx, logger, cr.Spec.ProjectID) | ||
| if err != nil { | ||
| r.Monitor.reconcileDuration.WithLabelValues("error").Observe(time.Since(start).Seconds()) | ||
| return ctrl.Result{}, err | ||
| } | ||
| if len(commitmentsByAZFG) == 0 { | ||
| return ctrl.Result{RequeueAfter: cooldown}, nil | ||
| } |
There was a problem hiding this comment.
cores commitments never enter the status write path.
The reconcile set is derived from Reservation objects, and resourceType=cores explicitly does not create Reservation slots. As a result, CPU commitments never reach writeUsageStatus(), so their status.usedResources stays empty even though this PR moves quota/API usage into CR status.
Also applies to: 108-121
Test Coverage ReportTest Coverage 📊: 68.8% |
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (2)
internal/knowledge/datasources/plugins/openstack/nova/nova_sync_test.go (1)
276-306: ⚡ Quick winCover the
Sync()dispatch for image datasources too.This test proves
SyncAllImages()works, but the controller path goes throughNovaSyncer.Sync(). Switching this assertion tosyncer.Sync(ctx)would also protect the newNovaDatasourceTypeImagesbranch while keeping the DB checks below unchanged.Minimal change
- n, err := syncer.SyncAllImages(ctx) + n, err := syncer.Sync(ctx)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/knowledge/datasources/plugins/openstack/nova/nova_sync_test.go` around lines 276 - 306, Replace the direct call to SyncAllImages with the controller dispatch path by invoking NovaSyncer.Sync(ctx) in TestNovaSyncer_SyncImages so the NovaDatasourceTypeImages branch is exercised; keep the same assertions (expect no error, expect count n == 1) and the subsequent DB checks for Image table unchanged, referencing the test function TestNovaSyncer_SyncImages and the NovaSyncer methods Sync and SyncAllImages and the NovaDatasource{Type: v1alpha1.NovaDatasourceTypeImages} configuration so the test still validates the DB contents for img-1 and OSType windows8Server64Guest.internal/knowledge/datasources/plugins/openstack/nova/nova_api.go (1)
494-523: ⚡ Quick winAdd focused tests for
deriveOSType.This helper now defines the contract for
vmware_ostype,ostype:tags, ambiguous-tag fallback, and"unknown", but the provided tests only mock a precomputedOSType. A small table-driven unit test here would lock down the behavior before CR usage depends on it.Example cases worth pinning down
func TestDeriveOSType(t *testing.T) { tests := []struct { name string properties map[string]any tags []string want string }{ { name: "vmware_ostype wins", properties: map[string]any{"vmware_ostype": "windows8Server64Guest"}, want: "windows8Server64Guest", }, { name: "single ostype tag", tags: []string{"ostype:rhel7_64Guest"}, want: "rhel7_64Guest", }, { name: "multiple ostype tags are ambiguous", tags: []string{"ostype:rhel7_64Guest", "ostype:windows8Server64Guest"}, want: "unknown", }, { name: "fallback to unknown", want: "unknown", }, } for _, tt := range tests { t.Run(tt.name, func(t *testing.T) { if got := deriveOSType(tt.properties, tt.tags); got != tt.want { t.Fatalf("deriveOSType() = %q, want %q", got, tt.want) } }) } }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/knowledge/datasources/plugins/openstack/nova/nova_api.go` around lines 494 - 523, Add focused table-driven unit tests for deriveOSType to lock its contract: create TestDeriveOSType that exercises (1) properties with a non-empty string "vmware_ostype" wins, (2) single "ostype:..." tag returns that value, (3) multiple "ostype:..." tags produce "unknown" (ambiguous), (4) no info returns "unknown", and also cover a non-string or empty "vmware_ostype" value falling back to tags; implement using a tests slice with t.Run and compare deriveOSType(tt.properties, tt.tags) to the expected want.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In
`@internal/scheduling/reservations/commitments/committed_resource_controller.go`:
- Around line 134-137: The code leaves the
CommittedResource.Status.ConsecutiveFailures count unchanged when
ApplyCommitmentState succeeded but the controller returns early to wait for
child readiness (e.g., the branch that returns ctrl.Result{} with
setNotReady(reason Reserving, "waiting for reservation placement")); update the
reconcile flow so that after a successful ApplyCommitmentState (or any code path
where placement succeeded and we are only passively waiting for child resources)
you reset or clear Status.ConsecutiveFailures (or set a dedicated
lastPlacementFailed boolean) before calling setNotReady or returning without
RequeueAfter; specifically adjust the logic around ApplyCommitmentState, the
setNotReady calls in the Reserving branches and similar branches noted (also the
other branches at the same pattern) to ensure the slowdown gate no longer
observes stale consecutive failure counts.
In `@internal/scheduling/reservations/commitments/usage_reconciler.go`:
- Line 137: The metric usage_reconciler.assignedInstances currently calls
assignedInstances.Set(float64(totalAssigned)) and overwrites values across
projects; update UsageReconcilerMonitor so assignedInstances is either created
with a project_id label and Set is called with that label for each project's
totalAssigned, or change the metric to an aggregate (e.g., maintain a
cluster-wide sum in the reconciler and Set once) depending on intended
semantics—add project_id as a label on the GaugeVec and call
assignedInstances.WithLabelValues(projectID).Set(...) (or implement cluster
aggregation and remove per-project sets); confirm which approach matches
UsageReconcilerMonitor semantics and apply consistently across reconciles.
- Around line 117-131: The statusAge histogram is being observed inside the
inner loop (for _, state := range group) using cr.Status.LastUsageReconcileAt,
causing identical samples to be recorded once per commitment; move the single
observation out of the inner loops so you call r.Monitor.statusAge.Observe(...)
exactly once per reconcile (using
now.Time.Sub(cr.Status.LastUsageReconcileAt.Time).Seconds()) after the loops
complete (or alternatively change to use each
state.Target.Status.LastUsageReconcileAt if you intend per-commitment samples);
update the block around writeUsageStatus/state loop (and variables
written/totalAssigned) so observation happens once per reconcile rather than
inside writeUsageStatus loop.
---
Nitpick comments:
In `@internal/knowledge/datasources/plugins/openstack/nova/nova_api.go`:
- Around line 494-523: Add focused table-driven unit tests for deriveOSType to
lock its contract: create TestDeriveOSType that exercises (1) properties with a
non-empty string "vmware_ostype" wins, (2) single "ostype:..." tag returns that
value, (3) multiple "ostype:..." tags produce "unknown" (ambiguous), (4) no info
returns "unknown", and also cover a non-string or empty "vmware_ostype" value
falling back to tags; implement using a tests slice with t.Run and compare
deriveOSType(tt.properties, tt.tags) to the expected want.
In `@internal/knowledge/datasources/plugins/openstack/nova/nova_sync_test.go`:
- Around line 276-306: Replace the direct call to SyncAllImages with the
controller dispatch path by invoking NovaSyncer.Sync(ctx) in
TestNovaSyncer_SyncImages so the NovaDatasourceTypeImages branch is exercised;
keep the same assertions (expect no error, expect count n == 1) and the
subsequent DB checks for Image table unchanged, referencing the test function
TestNovaSyncer_SyncImages and the NovaSyncer methods Sync and SyncAllImages and
the NovaDatasource{Type: v1alpha1.NovaDatasourceTypeImages} configuration so the
test still validates the DB contents for img-1 and OSType windows8Server64Guest.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 4d1aa827-b4c6-4cb0-a435-bb5de63e0a40
📒 Files selected for processing (6)
cmd/manager/main.gointernal/knowledge/datasources/plugins/openstack/nova/nova_api.gointernal/knowledge/datasources/plugins/openstack/nova/nova_sync_test.gointernal/scheduling/reservations/commitments/committed_resource_controller.gointernal/scheduling/reservations/commitments/committed_resource_controller_test.gointernal/scheduling/reservations/commitments/usage_reconciler.go
🚧 Files skipped from review as they are similar to previous changes (1)
- cmd/manager/main.go
| if !allReady { | ||
| // Reservation controller hasn't processed all slots yet; Reservation watch will re-enqueue. | ||
| return ctrl.Result{}, r.setNotReady(ctx, cr, v1alpha1.CommittedResourceReasonReserving, "waiting for reservation placement") | ||
| } |
There was a problem hiding this comment.
Reset the failure counter once apply succeeds and you’re only waiting on child readiness.
After a few AllowRejection=false failures, this path keeps the old ConsecutiveFailures value until final acceptance. If ApplyCommitmentState has already succeeded and the reconcile just returns "waiting for reservation placement", the Reservation watch at Lines 477-481 can still suppress the ready-event requeue for the current generation, while this branch returns no RequeueAfter. That leaves the CR stuck in Reserving even though the child reservations are progressing normally.
Reset the counter before switching into the passive waiting state, or make the slowdown gate depend on an explicit “last reconcile ended in placement failure” marker instead of the raw counter.
Also applies to: 190-193, 271-279, 474-482
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In
`@internal/scheduling/reservations/commitments/committed_resource_controller.go`
around lines 134 - 137, The code leaves the
CommittedResource.Status.ConsecutiveFailures count unchanged when
ApplyCommitmentState succeeded but the controller returns early to wait for
child readiness (e.g., the branch that returns ctrl.Result{} with
setNotReady(reason Reserving, "waiting for reservation placement")); update the
reconcile flow so that after a successful ApplyCommitmentState (or any code path
where placement succeeded and we are only passively waiting for child resources)
you reset or clear Status.ConsecutiveFailures (or set a dedicated
lastPlacementFailed boolean) before calling setNotReady or returning without
RequeueAfter; specifically adjust the logic around ApplyCommitmentState, the
setNotReady calls in the Reserving branches and similar branches noted (also the
other branches at the same pattern) to ensure the slowdown gate no longer
observes stale consecutive failure counts.
| // rollbackToAccepted restores child Reservations to match Status.AcceptedSpec. | ||
| // AcceptedSpec is a full snapshot of the spec at the last successful reconcile, so rollback always | ||
| // targets the correct AZ, amount, project, domain — even when the current spec has been mutated. | ||
| // Falls back to AcceptedAmount + current spec fields for CRs accepted before AcceptedSpec existed. | ||
| // If neither is set (CR was never accepted), all child Reservations are deleted. | ||
| func (r *CommittedResourceController) rollbackToAccepted(ctx context.Context, logger logr.Logger, cr *v1alpha1.CommittedResource) error { | ||
| if cr.Status.AcceptedAmount == nil { | ||
| if cr.Status.AcceptedSpec == nil && cr.Status.AcceptedAmount == nil { | ||
| return r.deleteChildReservations(ctx, cr) | ||
| } | ||
| knowledge := &reservations.FlavorGroupKnowledgeClient{Client: r.Client} | ||
| flavorGroups, err := knowledge.GetAllFlavorGroups(ctx, nil) | ||
| if err != nil { | ||
| // Can't compute the rollback target — fall back to full delete rather than leaving | ||
| // a partial state that's inconsistent with the unknown AcceptedAmount. | ||
| // a partial state that's inconsistent with the unknown accepted state. | ||
| logger.Error(err, "flavor knowledge unavailable during rollback, deleting all child reservations") | ||
| return r.deleteChildReservations(ctx, cr) | ||
| } | ||
| state, err := FromCommittedResource(*cr) | ||
|
|
||
| var state *CommitmentState | ||
| if cr.Status.AcceptedSpec != nil { | ||
| // Use the full accepted spec snapshot: ensures rollback targets the exact previously-accepted | ||
| // placement (AZ, amount, project, domain) even if the current spec has been mutated. | ||
| tempCR := v1alpha1.CommittedResource{Spec: *cr.Status.AcceptedSpec} | ||
| state, err = FromCommittedResource(tempCR) | ||
| } else { | ||
| // AcceptedSpec not yet populated (CR accepted before this field existed). | ||
| // We cannot determine the previous placement, so delete all child reservations. | ||
| // The controller will recreate them on the next reconcile using the current spec. | ||
| logger.Info("AcceptedSpec missing during rollback, deleting child reservations for controller repair") | ||
| return r.deleteChildReservations(ctx, cr) |
There was a problem hiding this comment.
Deleting child reservations here drops migrated accepted CRs with no recovery path.
When AcceptedSpec is still nil, rollbackToAccepted now deletes all child reservations. For already-accepted CRs created before this status field existed, the first rejected update on the current generation empties their active reservations, and isRejectedForGeneration() then prevents the “next reconcile” from re-applying the spec. So this fallback is not just conservative rollback; it can permanently deprovision an accepted commitment until the spec changes again.
Please preserve the last accepted state in this branch instead of deleting it outright, or derive the rollback target from existing accepted state before entering the rejected-generation short-circuit.
Also applies to: 380-385
| for _, group := range commitmentsByAZFG { | ||
| for _, state := range group { | ||
| if err := r.writeUsageStatus(ctx, state, now); err != nil { | ||
| logger.Error(err, "failed to write usage status", "commitmentUUID", state.CommitmentUUID) | ||
| writeErr = err | ||
| } else { | ||
| written++ | ||
| totalAssigned += len(state.AssignedInstances) | ||
| // Observe status age: how long ago was it last reconciled before this run. | ||
| if cr.Status.LastUsageReconcileAt != nil { | ||
| r.Monitor.statusAge.Observe(now.Time.Sub(cr.Status.LastUsageReconcileAt.Time).Seconds()) | ||
| } | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
statusAge histogram is over-sampled per reconcile.
The Observe call sits inside the inner loop and reads cr.Status.LastUsageReconcileAt from the outer reconciled CR. For a project with N commitments, every reconcile records N identical samples (same delta, same outer timestamp), skewing the histogram toward whichever project has the most commitments. Move the observation outside the loop (one sample per reconcile), or compute the delta from each target.Status.LastUsageReconcileAt if you actually want per-CR status age.
📊 Proposed fix: observe once per reconcile
+ var prevAge *float64
+ if cr.Status.LastUsageReconcileAt != nil {
+ d := now.Time.Sub(cr.Status.LastUsageReconcileAt.Time).Seconds()
+ prevAge = &d
+ }
for _, group := range commitmentsByAZFG {
for _, state := range group {
if err := r.writeUsageStatus(ctx, state, now); err != nil {
logger.Error(err, "failed to write usage status", "commitmentUUID", state.CommitmentUUID)
writeErr = err
} else {
written++
totalAssigned += len(state.AssignedInstances)
- // Observe status age: how long ago was it last reconciled before this run.
- if cr.Status.LastUsageReconcileAt != nil {
- r.Monitor.statusAge.Observe(now.Time.Sub(cr.Status.LastUsageReconcileAt.Time).Seconds())
- }
}
}
}
+ if writeErr == nil && prevAge != nil {
+ r.Monitor.statusAge.Observe(*prevAge)
+ }📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| for _, group := range commitmentsByAZFG { | |
| for _, state := range group { | |
| if err := r.writeUsageStatus(ctx, state, now); err != nil { | |
| logger.Error(err, "failed to write usage status", "commitmentUUID", state.CommitmentUUID) | |
| writeErr = err | |
| } else { | |
| written++ | |
| totalAssigned += len(state.AssignedInstances) | |
| // Observe status age: how long ago was it last reconciled before this run. | |
| if cr.Status.LastUsageReconcileAt != nil { | |
| r.Monitor.statusAge.Observe(now.Time.Sub(cr.Status.LastUsageReconcileAt.Time).Seconds()) | |
| } | |
| } | |
| } | |
| } | |
| var prevAge *float64 | |
| if cr.Status.LastUsageReconcileAt != nil { | |
| d := now.Time.Sub(cr.Status.LastUsageReconcileAt.Time).Seconds() | |
| prevAge = &d | |
| } | |
| for _, group := range commitmentsByAZFG { | |
| for _, state := range group { | |
| if err := r.writeUsageStatus(ctx, state, now); err != nil { | |
| logger.Error(err, "failed to write usage status", "commitmentUUID", state.CommitmentUUID) | |
| writeErr = err | |
| } else { | |
| written++ | |
| totalAssigned += len(state.AssignedInstances) | |
| } | |
| } | |
| } | |
| if writeErr == nil && prevAge != nil { | |
| r.Monitor.statusAge.Observe(*prevAge) | |
| } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@internal/scheduling/reservations/commitments/usage_reconciler.go` around
lines 117 - 131, The statusAge histogram is being observed inside the inner loop
(for _, state := range group) using cr.Status.LastUsageReconcileAt, causing
identical samples to be recorded once per commitment; move the single
observation out of the inner loops so you call r.Monitor.statusAge.Observe(...)
exactly once per reconcile (using
now.Time.Sub(cr.Status.LastUsageReconcileAt.Time).Seconds()) after the loops
complete (or alternatively change to use each
state.Target.Status.LastUsageReconcileAt if you intend per-commitment samples);
update the block around writeUsageStatus/state loop (and variables
written/totalAssigned) so observation happens once per reconcile rather than
inside writeUsageStatus loop.
| } | ||
|
|
||
| r.Monitor.reconcileDuration.WithLabelValues("success").Observe(time.Since(start).Seconds()) | ||
| r.Monitor.assignedInstances.Set(float64(totalAssigned)) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Inspect the monitor definition to confirm whether assignedInstances has a project label.
fd -t f usage_reconciler_monitor.go --exec cat
rg -nP --type=go -C3 'assignedInstances\b'Repository: cobaltcore-dev/cortex
Length of output: 7129
assignedInstances.Set() overwrites per-project values across reconciles.
The gauge has no label dimensions and is set to a per-project total at each reconcile run. With multiple projects reconciling sequentially, the metric ends up reflecting only the last reconciled project's value rather than a cluster-wide aggregate as the help text suggests ("across all active commitments").
Add a project_id label to track per-project assignments, or redesign to aggregate across projects (e.g., using a counter with separate reset logic). Confirm the intended semantics with UsageReconcilerMonitor.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@internal/scheduling/reservations/commitments/usage_reconciler.go` at line
137, The metric usage_reconciler.assignedInstances currently calls
assignedInstances.Set(float64(totalAssigned)) and overwrites values across
projects; update UsageReconcilerMonitor so assignedInstances is either created
with a project_id label and Set is called with that label for each project's
totalAssigned, or change the metric to an aggregate (e.g., maintain a
cluster-wide sum in the reconciler and Set once) depending on intended
semantics—add project_id as a label on the GaugeVec and call
assignedInstances.WithLabelValues(projectID).Set(...) (or implement cluster
aggregation and remove per-project sets); confirm which approach matches
UsageReconcilerMonitor semantics and apply consistently across reconciles.
Moves CR usage computation out of the API handler and into a dedicated reconciler that persists results in CRD status, making usage data available to both the LIQUID API and quota controller.