[DNM] Service abort upstream followup#468
[DNM] Service abort upstream followup#468jacob-anders wants to merge 2 commits intoopenshift:mainfrom
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughAdds logic to abort ongoing firmware servicing when firmware specs are cleared, updates servicing state handling in the ironic provisioner, adds watches for firmware resources in the BareMetalHost reconciler, and simplifies a controller helper that sets status conditions. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Important Pre-merge checks failedPlease resolve all errors before merging. Addressing warnings is optional. ❌ Failed checks (1 error, 2 warnings, 1 inconclusive)
✅ Passed checks (1 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: jacob-anders The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
pkg/provisioner/ironic/servicing_test.go (1)
89-99: Model “spec removed” explicitly in abort tests.These two abort cases currently keep
unpreparedfalse, so they validate abort on mere absence of spec rather than a removal transition. Consider settingunprepared: truefor these abort scenarios (and optionally adding a “no change + no spec” ServiceWait case that continues).🧪 Suggested test adjustment
{ name: "serviceFail state(abort - no firmware spec)", @@ skipConfig: true, + unprepared: true, expectedStarted: true, expectedRequestAfter: 10, expectedDirty: true, }, @@ { name: "serviceWait state(abort - no firmware spec)", @@ skipConfig: true, + unprepared: true, expectedStarted: true, expectedRequestAfter: 10, expectedDirty: true, },Also applies to: 145-154
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/provisioner/ironic/servicing_test.go` around lines 89 - 99, The abort test cases like the one named "serviceFail state(abort - no firmware spec)" currently validate absence of a spec but do not model a spec removal transition; update those test case structs in servicing_test.go to set unprepared: true so the test models an explicit "spec removed" abort transition (apply the same change to the other abort case around the 145-154 block), and optionally add a ServiceWait case that covers "no change + no spec" if you want to assert continuation behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/provisioner/ironic/servicing.go`:
- Around line 112-114: The abort condition currently triggers on
!data.HasFirmwareSpec in ServiceFail and ServiceWait which is too broad; change
both checks to use the explicit "spec removed this reconcile" signal
(data.Unprepared) instead of HasFirmwareSpec so an in-progress servicing flow
isn't aborted when HasFirmwareSpec is false but the flow is already running;
update the conditional(s) that call abortServicing(ctx, ironicNode) in the
ServiceFail and ServiceWait functions to check data.Unprepared (and only abort
when true), leaving abortServicing and surrounding logic unchanged.
---
Nitpick comments:
In `@pkg/provisioner/ironic/servicing_test.go`:
- Around line 89-99: The abort test cases like the one named "serviceFail
state(abort - no firmware spec)" currently validate absence of a spec but do not
model a spec removal transition; update those test case structs in
servicing_test.go to set unprepared: true so the test models an explicit "spec
removed" abort transition (apply the same change to the other abort case around
the 145-154 block), and optionally add a ServiceWait case that covers "no change
+ no spec" if you want to assert continuation behavior.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 1eafebc9-d220-490d-895d-4ca38c0dbc6a
📒 Files selected for processing (4)
internal/controller/metal3.io/baremetalhost_controller.gopkg/provisioner/ironic/servicing.gopkg/provisioner/ironic/servicing_test.gopkg/provisioner/provisioner.go
| if !data.HasFirmwareSpec { | ||
| return p.abortServicing(ctx, ironicNode) | ||
| } |
There was a problem hiding this comment.
Abort condition is too broad in ServiceFail / ServiceWait.
On Line 112 and Line 144, abort is triggered solely by !data.HasFirmwareSpec. HasFirmwareSpec can be false while an already-started servicing flow is still progressing (see internal/controller/metal3.io/baremetalhost_controller.go, Line 1538-1540 where status is synchronized early). That can cause unintended aborts without a user removing spec.
Use an explicit “spec removed this reconcile” signal; unprepared is already available and maps to detected changes.
🔧 Proposed fix
- if !data.HasFirmwareSpec {
+ if unprepared && !data.HasFirmwareSpec {
return p.abortServicing(ctx, ironicNode)
}
@@
- if !data.HasFirmwareSpec {
+ if unprepared && !data.HasFirmwareSpec {
return p.abortServicing(ctx, ironicNode)
}Also applies to: 144-146
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/provisioner/ironic/servicing.go` around lines 112 - 114, The abort
condition currently triggers on !data.HasFirmwareSpec in ServiceFail and
ServiceWait which is too broad; change both checks to use the explicit "spec
removed this reconcile" signal (data.Unprepared) instead of HasFirmwareSpec so
an in-progress servicing flow isn't aborted when HasFirmwareSpec is false but
the flow is already running; update the conditional(s) that call
abortServicing(ctx, ironicNode) in the ServiceFail and ServiceWait functions to
check data.Unprepared (and only abort when true), leaving abortServicing and
surrounding logic unchanged.
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
pkg/provisioner/ironic/power_test.go (1)
521-524: Add a feature-gate negative test forGetHealth().All cases set
MaxVersion: 109, so theHasHealthAPI()==falsepath is still untested. Add one case withMaxVersion < 109and assert empty health.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/provisioner/ironic/power_test.go` around lines 521 - 524, Add a negative test case exercising the HasHealthAPI()==false path by setting prov.availableFeatures = clients.AvailableFeatures{MaxVersion: 108} (or any value < 109) before calling prov.GetHealth(t.Context()), and assert that the returned health is empty (e.g., equals the zero-value/empty struct or nil as your test framework expects). Place this case alongside the existing cases in the table/loop so it runs with the same setup that calls prov.GetHealth, referencing the same prov variable and GetHealth(t.Context()) call used in the other tests.pkg/provisioner/ironic/clients/features_test.go (1)
32-51: Consider restoring explicit coverage for the1.93branch.There’s no case for
MaxVersionin[93,94], soHasVirtualMediaGetAPI()path to"1.93"is currently unverified.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/provisioner/ironic/clients/features_test.go` around lines 32 - 51, Add an explicit table-driven test case covering the 1.93 branch by inserting one or two entries in the test table for HasVirtualMediaGetAPI() using the fields struct with MaxVersion set to 93 (and optionally 94) and expecting the string "1.93"; this ensures the test in features_test.go exercises the code path that returns "1.93" for MaxVersion in [93,94].
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/provisioner/ironic/power_test.go`:
- Around line 477-496: The test references provisioner constants
(provisioner.HealthOK, provisioner.HealthWarning, provisioner.HealthCritical)
but the provisioner package is not imported; add an import for the provisioner
package in the test file so those identifiers resolve, then run `go test` to
ensure the file compiles; locate usages in power_test.go around the table cases
that construct testserver.NewIronic nodes and update the imports block to
include the package that defines provisioner.
---
Nitpick comments:
In `@pkg/provisioner/ironic/clients/features_test.go`:
- Around line 32-51: Add an explicit table-driven test case covering the 1.93
branch by inserting one or two entries in the test table for
HasVirtualMediaGetAPI() using the fields struct with MaxVersion set to 93 (and
optionally 94) and expecting the string "1.93"; this ensures the test in
features_test.go exercises the code path that returns "1.93" for MaxVersion in
[93,94].
In `@pkg/provisioner/ironic/power_test.go`:
- Around line 521-524: Add a negative test case exercising the
HasHealthAPI()==false path by setting prov.availableFeatures =
clients.AvailableFeatures{MaxVersion: 108} (or any value < 109) before calling
prov.GetHealth(t.Context()), and assert that the returned health is empty (e.g.,
equals the zero-value/empty struct or nil as your test framework expects). Place
this case alongside the existing cases in the table/loop so it runs with the
same setup that calls prov.GetHealth, referencing the same prov variable and
GetHealth(t.Context()) call used in the other tests.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 7b30a3cf-a9e7-4c92-83fb-4b2be1b5e47b
📒 Files selected for processing (4)
pkg/provisioner/ironic/clients/features.gopkg/provisioner/ironic/clients/features_test.gopkg/provisioner/ironic/ironic.gopkg/provisioner/ironic/power_test.go
49981e4 to
ae3c006
Compare
ae3c006 to
35954c5
Compare
4d83b46 to
7fdfbaf
Compare
37ff226 to
a46282a
Compare
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
internal/controller/metal3.io/hostfirmwaresettings_controller.go (1)
248-275:⚠️ Potential issue | 🟠 MajorUnnecessary status writes on every reconcile due to uninitialized conditions slice.
updateStatus()createsnewStatuswith a zero-initializedConditionsfield (empty slice). WhensetCondition()callsmeta.SetStatusCondition(&status.Conditions, newCondition)on this empty slice, it always appends and returnstrue, causingdirtyto be set true even when the final conditions match the persisted ones. This triggers status updates and refreshesLastUpdatedon every reconcile cycle, regardless of actual changes.Initialize
newStatus.Conditionsfrom the persisted conditions before the firstsetCondition()call:Suggested fix
func (r *HostFirmwareSettingsReconciler) updateStatus(ctx context.Context, info *rInfo, settings metal3api.SettingsMap, schema *metal3api.FirmwareSchema) (err error) { dirty := false var newStatus metal3api.HostFirmwareSettingsStatus newStatus.Settings = make(metal3api.SettingsMap) + newStatus.Conditions = append([]metav1.Condition(nil), info.hfs.Status.Conditions...)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/controller/metal3.io/hostfirmwaresettings_controller.go` around lines 248 - 275, The status reconciliation is creating newStatus with an empty Conditions slice so the first call to setCondition (which uses meta.SetStatusCondition) always appends and marks the status dirty; to fix, initialize newStatus.Conditions from the persisted status before any setCondition calls (e.g., copy info.hostFirmwareSettings.Status.Conditions into newStatus.Conditions at the start of updateStatus or right before the first setCondition), ensuring subsequent meta.SetStatusCondition comparisons operate on the existing slice and only set dirty when conditions actually change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@internal/controller/metal3.io/hostfirmwaresettings_controller.go`:
- Around line 248-275: The status reconciliation is creating newStatus with an
empty Conditions slice so the first call to setCondition (which uses
meta.SetStatusCondition) always appends and marks the status dirty; to fix,
initialize newStatus.Conditions from the persisted status before any
setCondition calls (e.g., copy info.hostFirmwareSettings.Status.Conditions into
newStatus.Conditions at the start of updateStatus or right before the first
setCondition), ensuring subsequent meta.SetStatusCondition comparisons operate
on the existing slice and only set dirty when conditions actually change.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 42a60767-374c-446e-8c51-9a4ecc08d12e
📒 Files selected for processing (1)
internal/controller/metal3.io/hostfirmwaresettings_controller.go
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@internal/controller/metal3.io/baremetalhost_controller.go`:
- Around line 1491-1492: clearErrorWithStatus(info.host,
metal3api.OperationalStatusOK) mutates host status but its return value is
ignored, so the controller may not persist the change; replace the bare
actionComplete{} return with returning the update action that marks the host as
modified (use the same pattern as other paths that use actionUpdate{}, e.g., the
path at line ~1601) by capturing the modified host result from
clearErrorWithStatus (or otherwise marking info.host dirty) and returning
actionUpdate{} so the status change is persisted.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: bc5ad162-d893-41a9-8f87-07cd8dcf8643
📒 Files selected for processing (2)
internal/controller/metal3.io/baremetalhost_controller.gopkg/provisioner/provisioner.go
✅ Files skipped from review due to trivial changes (1)
- pkg/provisioner/provisioner.go
| clearErrorWithStatus(info.host, metal3api.OperationalStatusOK) | ||
| return actionComplete{} |
There was a problem hiding this comment.
Status changes may not be persisted.
clearErrorWithStatus modifies the host status but its return value is ignored. The bare actionComplete{} likely doesn't signal dirty, so the status changes (clearing error, setting OperationalStatusOK) won't be saved. Other code paths (e.g., line 1601) wrap in actionUpdate{} when the host is modified.
🐛 Proposed fix
- info.log.Info("successfully recovered from servicing error")
- clearErrorWithStatus(info.host, metal3api.OperationalStatusOK)
- return actionComplete{}
+ info.log.Info("successfully recovered from servicing error")
+ clearErrorWithStatus(info.host, metal3api.OperationalStatusOK)
+ return actionUpdate{actionComplete{}}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@internal/controller/metal3.io/baremetalhost_controller.go` around lines 1491
- 1492, clearErrorWithStatus(info.host, metal3api.OperationalStatusOK) mutates
host status but its return value is ignored, so the controller may not persist
the change; replace the bare actionComplete{} return with returning the update
action that marks the host as modified (use the same pattern as other paths that
use actionUpdate{}, e.g., the path at line ~1601) by capturing the modified host
result from clearErrorWithStatus (or otherwise marking info.host dirty) and
returning actionUpdate{} so the status change is persisted.
1ea4e90 to
2b6083a
Compare
The setCondition function in the HostFirmwareSettings controller was checking whether the condition changed by comparing against the old info.hfs.Status.Conditions rather than the newly updated status. When only ObservedGeneration changed but the condition status remained the same, the function returned false (no change), causing the status update to be skipped. This left ObservedGeneration permanently stale, blocking the BMH controller's generation-based consistency checks. Fix by using the return value of meta.SetStatusCondition, which correctly detects ObservedGeneration changes. Also remove the unused info parameter from setCondition since it is no longer needed. Assisted-By: Claude Opus 4.6 (cherry picked from commit 218158a)
When a user clears HFC/HFS specs to abort a failed servicing operation, the BMH controller calls getHostFirmwareSettings/Components which check that metadata.generation matches status.conditions.observedGeneration. Since the HFS/HFC sub-controllers may not have reconciled yet after the spec change, this generation mismatch returns an error that blocks the abort path, leaving the host stuck in ServicingError with exponential backoff. Fix with two complementary changes: 1. Add a fast-path in doServiceIfNeeded: when in ServicingError state, check specs directly via lightweight r.Get() calls (which don't validate generation) before the generation-sensitive functions. If specs are cleared, abort immediately via prov.Service() and clear the error/operational status on success. 2. Watch HostFirmwareSettings and HostFirmwareComponents for generation changes (spec modifications), mapping events to the corresponding BMH. This ensures the BMH controller reconciles promptly when specs change, rather than waiting for error-state exponential backoff. Assisted-By: Claude Opus 4.6 (cherry picked from commit b235782)
2b6083a to
c383955
Compare
|
@jacob-anders: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
PR meant for cluster-bot test
/hold