Skip to content

[DNM] Service abort upstream followup#468

Open
jacob-anders wants to merge 2 commits intoopenshift:mainfrom
jacob-anders:service-abort-upstream-followup
Open

[DNM] Service abort upstream followup#468
jacob-anders wants to merge 2 commits intoopenshift:mainfrom
jacob-anders:service-abort-upstream-followup

Conversation

@jacob-anders
Copy link
Copy Markdown

@jacob-anders jacob-anders commented Mar 13, 2026

PR meant for cluster-bot test
/hold

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 13, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 13, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Adds logic to abort ongoing firmware servicing when firmware specs are cleared, updates servicing state handling in the ironic provisioner, adds watches for firmware resources in the BareMetalHost reconciler, and simplifies a controller helper that sets status conditions.

Changes

Cohort / File(s) Summary
BareMetalHost controller
internal/controller/metal3.io/baremetalhost_controller.go
Fast-path in doServiceIfNeeded to abort servicing when firmware specs are removed; handles abort results (Dirty, errors, recovery) and clears host error on recovery. Added watches for HostFirmwareSettings and HostFirmwareComponents (GenerationChangedPredicate) and map function to enqueue related BareMetalHost.
Ironic provisioner servicing logic
pkg/provisioner/ironic/servicing.go
Added abortServicing helper invoked when firmware spec is absent to move node to TargetAbort. Reworked Service handling: separated Servicing vs ServiceWait cases, added pre-abort checks when spec cleared, adjusted logging and requeue behavior.
Provisioner API docs (comment only)
pkg/provisioner/provisioner.go
Clarified comment for ServicingData.HasFirmwareSpec to distinguish “no updates needed” from “user cleared spec”; no API or behavior changes.
HostFirmwareSettings controller helper
internal/controller/metal3.io/hostfirmwaresettings_controller.go
Removed info parameter from setCondition call sites and simplified setCondition implementation to always call meta.SetStatusCondition(&status.Conditions, newCondition) without reading current object state.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes


Important

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

❌ Failed checks (1 error, 2 warnings, 1 inconclusive)

Check name Status Explanation Resolution
Stable And Deterministic Test Names ❌ Error Test case uses dynamically constructed name via fmt.Sprintf with variables baseline and baselineVersionString, violating stable test naming requirement. Replace dynamic test name with static string like "MaxVersion < 89 return microversion 1.89" independent of variable values.
Docstring Coverage ⚠️ Warning Docstring coverage is 7.14% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Structure And Quality ⚠️ Warning New test functions TestGetHealth and TestComputeHealthyCondition are missing meaningful failure messages on assertions, violating codebase quality requirements and consistency standards. Add meaningful failure messages to all assertions in TestGetHealth and TestComputeHealthyCondition, following the pattern established in the adjacent TestComputeConditions function.
Title check ❓ Inconclusive The title '[DNM] Service abort upstream followup' is vague and lacks descriptive detail about the actual changes. While '[DNM]' indicates 'Do Not Merge', the term 'upstream followup' is non-specific and doesn't convey meaningful information about the core functionality being implemented. Consider using a more descriptive title that highlights the primary change, such as 'Add Ironic health API support with microversion 1.109 negotiation' or 'Implement GetHealth provisioner method for BMC health reporting'.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci bot requested review from honza and hroyrh March 13, 2026 01:10
@openshift-ci
Copy link
Copy Markdown

openshift-ci bot commented Mar 13, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: jacob-anders
Once this PR has been reviewed and has the lgtm label, please assign honza for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
pkg/provisioner/ironic/servicing_test.go (1)

89-99: Model “spec removed” explicitly in abort tests.

These two abort cases currently keep unprepared false, so they validate abort on mere absence of spec rather than a removal transition. Consider setting unprepared: true for these abort scenarios (and optionally adding a “no change + no spec” ServiceWait case that continues).

🧪 Suggested test adjustment
 		{
 			name: "serviceFail state(abort - no firmware spec)",
@@
 			skipConfig:           true,
+			unprepared:           true,
 			expectedStarted:      true,
 			expectedRequestAfter: 10,
 			expectedDirty:        true,
 		},
@@
 		{
 			name: "serviceWait state(abort - no firmware spec)",
@@
 			skipConfig:           true,
+			unprepared:           true,
 			expectedStarted:      true,
 			expectedRequestAfter: 10,
 			expectedDirty:        true,
 		},

Also applies to: 145-154

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/provisioner/ironic/servicing_test.go` around lines 89 - 99, The abort
test cases like the one named "serviceFail state(abort - no firmware spec)"
currently validate absence of a spec but do not model a spec removal transition;
update those test case structs in servicing_test.go to set unprepared: true so
the test models an explicit "spec removed" abort transition (apply the same
change to the other abort case around the 145-154 block), and optionally add a
ServiceWait case that covers "no change + no spec" if you want to assert
continuation behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/provisioner/ironic/servicing.go`:
- Around line 112-114: The abort condition currently triggers on
!data.HasFirmwareSpec in ServiceFail and ServiceWait which is too broad; change
both checks to use the explicit "spec removed this reconcile" signal
(data.Unprepared) instead of HasFirmwareSpec so an in-progress servicing flow
isn't aborted when HasFirmwareSpec is false but the flow is already running;
update the conditional(s) that call abortServicing(ctx, ironicNode) in the
ServiceFail and ServiceWait functions to check data.Unprepared (and only abort
when true), leaving abortServicing and surrounding logic unchanged.

---

Nitpick comments:
In `@pkg/provisioner/ironic/servicing_test.go`:
- Around line 89-99: The abort test cases like the one named "serviceFail
state(abort - no firmware spec)" currently validate absence of a spec but do not
model a spec removal transition; update those test case structs in
servicing_test.go to set unprepared: true so the test models an explicit "spec
removed" abort transition (apply the same change to the other abort case around
the 145-154 block), and optionally add a ServiceWait case that covers "no change
+ no spec" if you want to assert continuation behavior.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 1eafebc9-d220-490d-895d-4ca38c0dbc6a

📥 Commits

Reviewing files that changed from the base of the PR and between b659d2f and 183367a.

📒 Files selected for processing (4)
  • internal/controller/metal3.io/baremetalhost_controller.go
  • pkg/provisioner/ironic/servicing.go
  • pkg/provisioner/ironic/servicing_test.go
  • pkg/provisioner/provisioner.go

Comment on lines +112 to +114
if !data.HasFirmwareSpec {
return p.abortServicing(ctx, ironicNode)
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Abort condition is too broad in ServiceFail / ServiceWait.

On Line 112 and Line 144, abort is triggered solely by !data.HasFirmwareSpec. HasFirmwareSpec can be false while an already-started servicing flow is still progressing (see internal/controller/metal3.io/baremetalhost_controller.go, Line 1538-1540 where status is synchronized early). That can cause unintended aborts without a user removing spec.

Use an explicit “spec removed this reconcile” signal; unprepared is already available and maps to detected changes.

🔧 Proposed fix
-		if !data.HasFirmwareSpec {
+		if unprepared && !data.HasFirmwareSpec {
 			return p.abortServicing(ctx, ironicNode)
 		}
@@
-		if !data.HasFirmwareSpec {
+		if unprepared && !data.HasFirmwareSpec {
 			return p.abortServicing(ctx, ironicNode)
 		}

Also applies to: 144-146

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/provisioner/ironic/servicing.go` around lines 112 - 114, The abort
condition currently triggers on !data.HasFirmwareSpec in ServiceFail and
ServiceWait which is too broad; change both checks to use the explicit "spec
removed this reconcile" signal (data.Unprepared) instead of HasFirmwareSpec so
an in-progress servicing flow isn't aborted when HasFirmwareSpec is false but
the flow is already running; update the conditional(s) that call
abortServicing(ctx, ironicNode) in the ServiceFail and ServiceWait functions to
check data.Unprepared (and only abort when true), leaving abortServicing and
surrounding logic unchanged.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
pkg/provisioner/ironic/power_test.go (1)

521-524: Add a feature-gate negative test for GetHealth().

All cases set MaxVersion: 109, so the HasHealthAPI()==false path is still untested. Add one case with MaxVersion < 109 and assert empty health.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/provisioner/ironic/power_test.go` around lines 521 - 524, Add a negative
test case exercising the HasHealthAPI()==false path by setting
prov.availableFeatures = clients.AvailableFeatures{MaxVersion: 108} (or any
value < 109) before calling prov.GetHealth(t.Context()), and assert that the
returned health is empty (e.g., equals the zero-value/empty struct or nil as
your test framework expects). Place this case alongside the existing cases in
the table/loop so it runs with the same setup that calls prov.GetHealth,
referencing the same prov variable and GetHealth(t.Context()) call used in the
other tests.
pkg/provisioner/ironic/clients/features_test.go (1)

32-51: Consider restoring explicit coverage for the 1.93 branch.

There’s no case for MaxVersion in [93,94], so HasVirtualMediaGetAPI() path to "1.93" is currently unverified.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/provisioner/ironic/clients/features_test.go` around lines 32 - 51, Add an
explicit table-driven test case covering the 1.93 branch by inserting one or two
entries in the test table for HasVirtualMediaGetAPI() using the fields struct
with MaxVersion set to 93 (and optionally 94) and expecting the string "1.93";
this ensures the test in features_test.go exercises the code path that returns
"1.93" for MaxVersion in [93,94].
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/provisioner/ironic/power_test.go`:
- Around line 477-496: The test references provisioner constants
(provisioner.HealthOK, provisioner.HealthWarning, provisioner.HealthCritical)
but the provisioner package is not imported; add an import for the provisioner
package in the test file so those identifiers resolve, then run `go test` to
ensure the file compiles; locate usages in power_test.go around the table cases
that construct testserver.NewIronic nodes and update the imports block to
include the package that defines provisioner.

---

Nitpick comments:
In `@pkg/provisioner/ironic/clients/features_test.go`:
- Around line 32-51: Add an explicit table-driven test case covering the 1.93
branch by inserting one or two entries in the test table for
HasVirtualMediaGetAPI() using the fields struct with MaxVersion set to 93 (and
optionally 94) and expecting the string "1.93"; this ensures the test in
features_test.go exercises the code path that returns "1.93" for MaxVersion in
[93,94].

In `@pkg/provisioner/ironic/power_test.go`:
- Around line 521-524: Add a negative test case exercising the
HasHealthAPI()==false path by setting prov.availableFeatures =
clients.AvailableFeatures{MaxVersion: 108} (or any value < 109) before calling
prov.GetHealth(t.Context()), and assert that the returned health is empty (e.g.,
equals the zero-value/empty struct or nil as your test framework expects). Place
this case alongside the existing cases in the table/loop so it runs with the
same setup that calls prov.GetHealth, referencing the same prov variable and
GetHealth(t.Context()) call used in the other tests.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 7b30a3cf-a9e7-4c92-83fb-4b2be1b5e47b

📥 Commits

Reviewing files that changed from the base of the PR and between 183367a and 49981e4.

📒 Files selected for processing (4)
  • pkg/provisioner/ironic/clients/features.go
  • pkg/provisioner/ironic/clients/features_test.go
  • pkg/provisioner/ironic/ironic.go
  • pkg/provisioner/ironic/power_test.go

@jacob-anders jacob-anders force-pushed the service-abort-upstream-followup branch from 49981e4 to ae3c006 Compare March 13, 2026 12:17
@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 14, 2026
@jacob-anders jacob-anders force-pushed the service-abort-upstream-followup branch from ae3c006 to 35954c5 Compare March 17, 2026 07:02
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 17, 2026
@jacob-anders jacob-anders force-pushed the service-abort-upstream-followup branch from 4d83b46 to 7fdfbaf Compare March 20, 2026 06:23
@jacob-anders jacob-anders force-pushed the service-abort-upstream-followup branch from 37ff226 to a46282a Compare March 31, 2026 06:57
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
internal/controller/metal3.io/hostfirmwaresettings_controller.go (1)

248-275: ⚠️ Potential issue | 🟠 Major

Unnecessary status writes on every reconcile due to uninitialized conditions slice.

updateStatus() creates newStatus with a zero-initialized Conditions field (empty slice). When setCondition() calls meta.SetStatusCondition(&status.Conditions, newCondition) on this empty slice, it always appends and returns true, causing dirty to be set true even when the final conditions match the persisted ones. This triggers status updates and refreshes LastUpdated on every reconcile cycle, regardless of actual changes.

Initialize newStatus.Conditions from the persisted conditions before the first setCondition() call:

Suggested fix
 func (r *HostFirmwareSettingsReconciler) updateStatus(ctx context.Context, info *rInfo, settings metal3api.SettingsMap, schema *metal3api.FirmwareSchema) (err error) {
 	dirty := false
 	var newStatus metal3api.HostFirmwareSettingsStatus
 	newStatus.Settings = make(metal3api.SettingsMap)
+	newStatus.Conditions = append([]metav1.Condition(nil), info.hfs.Status.Conditions...)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/controller/metal3.io/hostfirmwaresettings_controller.go` around
lines 248 - 275, The status reconciliation is creating newStatus with an empty
Conditions slice so the first call to setCondition (which uses
meta.SetStatusCondition) always appends and marks the status dirty; to fix,
initialize newStatus.Conditions from the persisted status before any
setCondition calls (e.g., copy info.hostFirmwareSettings.Status.Conditions into
newStatus.Conditions at the start of updateStatus or right before the first
setCondition), ensuring subsequent meta.SetStatusCondition comparisons operate
on the existing slice and only set dirty when conditions actually change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@internal/controller/metal3.io/hostfirmwaresettings_controller.go`:
- Around line 248-275: The status reconciliation is creating newStatus with an
empty Conditions slice so the first call to setCondition (which uses
meta.SetStatusCondition) always appends and marks the status dirty; to fix,
initialize newStatus.Conditions from the persisted status before any
setCondition calls (e.g., copy info.hostFirmwareSettings.Status.Conditions into
newStatus.Conditions at the start of updateStatus or right before the first
setCondition), ensuring subsequent meta.SetStatusCondition comparisons operate
on the existing slice and only set dirty when conditions actually change.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 42a60767-374c-446e-8c51-9a4ecc08d12e

📥 Commits

Reviewing files that changed from the base of the PR and between ae3c006 and a46282a.

📒 Files selected for processing (1)
  • internal/controller/metal3.io/hostfirmwaresettings_controller.go

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@internal/controller/metal3.io/baremetalhost_controller.go`:
- Around line 1491-1492: clearErrorWithStatus(info.host,
metal3api.OperationalStatusOK) mutates host status but its return value is
ignored, so the controller may not persist the change; replace the bare
actionComplete{} return with returning the update action that marks the host as
modified (use the same pattern as other paths that use actionUpdate{}, e.g., the
path at line ~1601) by capturing the modified host result from
clearErrorWithStatus (or otherwise marking info.host dirty) and returning
actionUpdate{} so the status change is persisted.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: bc5ad162-d893-41a9-8f87-07cd8dcf8643

📥 Commits

Reviewing files that changed from the base of the PR and between a46282a and 1ea4e90.

📒 Files selected for processing (2)
  • internal/controller/metal3.io/baremetalhost_controller.go
  • pkg/provisioner/provisioner.go
✅ Files skipped from review due to trivial changes (1)
  • pkg/provisioner/provisioner.go

Comment on lines +1491 to +1492
clearErrorWithStatus(info.host, metal3api.OperationalStatusOK)
return actionComplete{}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Status changes may not be persisted.

clearErrorWithStatus modifies the host status but its return value is ignored. The bare actionComplete{} likely doesn't signal dirty, so the status changes (clearing error, setting OperationalStatusOK) won't be saved. Other code paths (e.g., line 1601) wrap in actionUpdate{} when the host is modified.

🐛 Proposed fix
-		info.log.Info("successfully recovered from servicing error")
-		clearErrorWithStatus(info.host, metal3api.OperationalStatusOK)
-		return actionComplete{}
+		info.log.Info("successfully recovered from servicing error")
+		clearErrorWithStatus(info.host, metal3api.OperationalStatusOK)
+		return actionUpdate{actionComplete{}}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/controller/metal3.io/baremetalhost_controller.go` around lines 1491
- 1492, clearErrorWithStatus(info.host, metal3api.OperationalStatusOK) mutates
host status but its return value is ignored, so the controller may not persist
the change; replace the bare actionComplete{} return with returning the update
action that marks the host as modified (use the same pattern as other paths that
use actionUpdate{}, e.g., the path at line ~1601) by capturing the modified host
result from clearErrorWithStatus (or otherwise marking info.host dirty) and
returning actionUpdate{} so the status change is persisted.

@jacob-anders jacob-anders force-pushed the service-abort-upstream-followup branch from 1ea4e90 to 2b6083a Compare April 1, 2026 06:29
@openshift-ci openshift-ci bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 1, 2026
The setCondition function in the HostFirmwareSettings controller was
checking whether the condition changed by comparing against the old
info.hfs.Status.Conditions rather than the newly updated status. When
only ObservedGeneration changed but the condition status remained the
same, the function returned false (no change), causing the status
update to be skipped. This left ObservedGeneration permanently stale,
blocking the BMH controller's generation-based consistency checks.

Fix by using the return value of meta.SetStatusCondition, which
correctly detects ObservedGeneration changes. Also remove the unused
info parameter from setCondition since it is no longer needed.

Assisted-By: Claude Opus 4.6
(cherry picked from commit 218158a)
When a user clears HFC/HFS specs to abort a failed servicing operation,
the BMH controller calls getHostFirmwareSettings/Components which check
that metadata.generation matches status.conditions.observedGeneration.
Since the HFS/HFC sub-controllers may not have reconciled yet after the
spec change, this generation mismatch returns an error that blocks the
abort path, leaving the host stuck in ServicingError with exponential
backoff.

Fix with two complementary changes:

1. Add a fast-path in doServiceIfNeeded: when in ServicingError state,
   check specs directly via lightweight r.Get() calls (which don't
   validate generation) before the generation-sensitive functions. If
   specs are cleared, abort immediately via prov.Service() and clear
   the error/operational status on success.

2. Watch HostFirmwareSettings and HostFirmwareComponents for generation
   changes (spec modifications), mapping events to the corresponding
   BMH. This ensures the BMH controller reconciles promptly when specs
   change, rather than waiting for error-state exponential backoff.

Assisted-By: Claude Opus 4.6
(cherry picked from commit b235782)
@jacob-anders jacob-anders force-pushed the service-abort-upstream-followup branch from 2b6083a to c383955 Compare April 1, 2026 06:49
@openshift-ci openshift-ci bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 1, 2026
@openshift-ci
Copy link
Copy Markdown

openshift-ci bot commented Apr 1, 2026

@jacob-anders: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/vendor 2b6083a link true /test vendor

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants