Skip to content

[DNM] Retest original service-abort code to troubleshoot issues with the final version accepted upstream#472

Open
jacob-anders wants to merge 6 commits intoopenshift:mainfrom
jacob-anders:service-abort-three
Open

[DNM] Retest original service-abort code to troubleshoot issues with the final version accepted upstream#472
jacob-anders wants to merge 6 commits intoopenshift:mainfrom
jacob-anders:service-abort-three

Conversation

@jacob-anders
Copy link
Copy Markdown

/hold

jacob-anders and others added 2 commits March 20, 2026 20:06
Generated-By: Claude Code Sonnet 4
Signed-off-by: Jacob Anders <janders@redhat.com>
(cherry picked from commit c40d0ad)
(cherry picked from commit b628dd5)
@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 20, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 20, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: a3820a22-64c1-4b12-b760-6b73e24d7999

📥 Commits

Reviewing files that changed from the base of the PR and between 3e8c53d and 41620b3.

📒 Files selected for processing (1)
  • internal/controller/metal3.io/hostfirmwaresettings_controller.go

Walkthrough

Controller splits firmware-spec detection into separate settings/components flags, adds a ServicingError recovery fast-path, and watches subresource generation; provisioner builds service steps earlier, tightens abort logic to require both spec flags absent and clears Ironic maintenance on abort; tests updated to new semantics.

Changes

Cohort / File(s) Summary
Servicing data struct
pkg/provisioner/provisioner.go
Replaced single HasFirmwareSpec with HasFirmwareSettingsSpec and HasFirmwareComponentsSpec.
Controller: firmware spec detection & recovery
internal/controller/metal3.io/baremetalhost_controller.go
Added independent Get checks for HostFirmwareSettings and HostFirmwareComponents to set the two new flags; implemented ServicingError recovery fast-path that attempts to abort servicing when specs are cleared; updated getHostFirmwareSettings/getHostFirmwareComponents to return (dirty=false, nil, nil) for absent/empty subresources; extended watches to enqueue BareMetalHost on subresource generation changes.
Provisioner servicing logic
pkg/provisioner/ironic/servicing.go
Builds firmware serviceSteps up front via buildServiceSteps; abortServicing clears Ironic maintenance if set before abort; abort conditions now require both settings and components flags to be false; consolidated Servicing/ServiceWait handling and adjusted logging/requeue semantics.
Servicing tests
pkg/provisioner/ironic/servicing_test.go
Removed tests asserting aborts based on the old single-flag logic; adjusted test setup (BMC URL scheme change) and removed obsolete HasFirmwareSpec assignments.
HostFirmwareSettings controller
internal/controller/metal3.io/hostfirmwaresettings_controller.go
Refactored setCondition signature to remove info *rInfo and simplified condition setting logic; updated call sites accordingly.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci bot requested review from honza and iurygregory March 20, 2026 10:14
@openshift-ci
Copy link
Copy Markdown

openshift-ci bot commented Mar 20, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: jacob-anders
Once this PR has been reviewed and has the lgtm label, please assign zaneb for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Signed-off-by: Jacob Anders <jacob-anders-dev@proton.me>
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
internal/controller/metal3.io/baremetalhost_controller.go (1)

1483-1507: ⚠️ Potential issue | 🔴 Critical

Fix error handling to distinguish transient failures from missing specs.

Lines 1484–1485 and 1506–1507 currently treat all r.Get() errors identically, converting any failure (transient or permanent) into HasFirmware*Spec == false. This risks triggering unintended aborts when retrieval fails temporarily. Other calls in this file properly use k8serrors.IsNotFound() to distinguish missing resources from transient errors; apply the same pattern here:

Suggested fix
 		hfsExists := &metal3api.HostFirmwareSettings{}
-		hfsExistsErr := r.Get(ctx, info.request.NamespacedName, hfsExists)
-		servicingData.HasFirmwareSettingsSpec = (hfsExistsErr == nil && len(hfsExists.Spec.Settings) > 0)
+		if err := r.Get(ctx, info.request.NamespacedName, hfsExists); err != nil {
+			if !k8serrors.IsNotFound(err) {
+				return actionError{fmt.Errorf("could not load host firmware settings: %w", err)}
+			}
+		} else if len(hfsExists.Spec.Settings) > 0 {
+			servicingData.HasFirmwareSettingsSpec = true
+		}
 
 		hfcExists := &metal3api.HostFirmwareComponents{}
-		hfcExistsErr := r.Get(ctx, info.request.NamespacedName, hfcExists)
-		servicingData.HasFirmwareComponentsSpec = (hfcExistsErr == nil && len(hfcExists.Spec.Updates) > 0)
+		if err := r.Get(ctx, info.request.NamespacedName, hfcExists); err != nil {
+			if !k8serrors.IsNotFound(err) {
+				return actionError{fmt.Errorf("could not load host firmware components: %w", err)}
+			}
+		} else if len(hfcExists.Spec.Updates) > 0 {
+			servicingData.HasFirmwareComponentsSpec = true
+		}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/controller/metal3.io/baremetalhost_controller.go` around lines 1483
- 1507, The r.Get() calls that set hfsExistsErr and hfcExistsErr should
distinguish NotFound from other errors: instead of treating any error as "no
spec", check
k8serrors.IsNotFound(hfsExistsErr)/k8serrors.IsNotFound(hfcExistsErr) and set
servicingData.HasFirmwareSettingsSpec and
servicingData.HasFirmwareComponentsSpec to false only for NotFound; for any
other non-nil error return an actionError wrapping the retrieval error (similar
to existing patterns in this file). Update the HostFirmwareSettings lookup
(hfsExists/hfsExistsErr) and the HostFirmwareComponents existence check
(hfcExists/hfcExistsErr) accordingly so transient API errors cause an immediate
actionError rather than silently marking the spec absent.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@internal/controller/metal3.io/baremetalhost_controller.go`:
- Around line 1481-1485: The HasFirmwareSettingsSpec flag is currently set only
from HostFirmwareSettings; also inspect the legacy BareMetalHost.spec.firmware
and fold its presence into the same signal so legacy in-flight updates aren't
aborted: after the existing r.Get for hfsExists (hfsExistsErr) also fetch the
BareMetalHost object (or use the existing info.bmh if available) and treat
servicingData.HasFirmwareSettingsSpec as true if either (hfsExistsErr == nil &&
len(hfsExists.Spec.Settings) > 0) OR the BareMetalHost.Spec.Firmware is non-nil
and contains settings (e.g., check Spec.Firmware or its fields are non-empty);
update the assignment to servicingData.HasFirmwareSettingsSpec accordingly using
those identifiers (hfsExists, hfsExistsErr, BareMetalHost.Spec.Firmware or
info.bmh).

In `@pkg/provisioner/ironic/servicing_test.go`:
- Around line 166-174: The test fixture fills prepData.ActualFirmwareSettings
and prepData.TargetFirmwareSettings when !tc.skipConfig but does not set the
corresponding presence flags, so the code takes the abort path; update the
fixture so that inside the if !tc.skipConfig block you also set the
spec-presence flags (e.g., prepData.HasFirmwareActualSpec = true and
prepData.HasFirmwareTargetSpec = true) so the ServiceFail/ServiceWait cases
exercise the retry/wait behavior; make these assignments alongside
prepData.ActualFirmwareSettings and prepData.TargetFirmwareSettings.

In `@pkg/provisioner/ironic/servicing.go`:
- Around line 113-118: The call to buildServiceSteps is currently executed
before Service() handles abort branches, which can prevent abortServicing() from
running if step construction fails; move the buildServiceSteps(bmcAccess, data)
call out of the eager path and only invoke it inside the start/restart branch
where servicing actually begins (leave ServiceFail/Servicing/ServiceWait and
abort handling to run without calling buildServiceSteps), so errors from
buildServiceSteps do not block abortServicing() and you avoid rebuilding steps
twice.

---

Outside diff comments:
In `@internal/controller/metal3.io/baremetalhost_controller.go`:
- Around line 1483-1507: The r.Get() calls that set hfsExistsErr and
hfcExistsErr should distinguish NotFound from other errors: instead of treating
any error as "no spec", check
k8serrors.IsNotFound(hfsExistsErr)/k8serrors.IsNotFound(hfcExistsErr) and set
servicingData.HasFirmwareSettingsSpec and
servicingData.HasFirmwareComponentsSpec to false only for NotFound; for any
other non-nil error return an actionError wrapping the retrieval error (similar
to existing patterns in this file). Update the HostFirmwareSettings lookup
(hfsExists/hfsExistsErr) and the HostFirmwareComponents existence check
(hfcExists/hfcExistsErr) accordingly so transient API errors cause an immediate
actionError rather than silently marking the spec absent.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: de99871a-c1ef-46a1-b7b4-6ff3a5516ae2

📥 Commits

Reviewing files that changed from the base of the PR and between fb6a6d5 and 3375ac6.

📒 Files selected for processing (4)
  • internal/controller/metal3.io/baremetalhost_controller.go
  • pkg/provisioner/ironic/servicing.go
  • pkg/provisioner/ironic/servicing_test.go
  • pkg/provisioner/provisioner.go

Comment on lines +1481 to +1485
// Track if HFS spec has actual settings - check independently since getHostFirmwareSettings
// returns nil when no changes even if object exists
hfsExists := &metal3api.HostFirmwareSettings{}
hfsExistsErr := r.Get(info.ctx, info.request.NamespacedName, hfsExists)
servicingData.HasFirmwareSettingsSpec = (hfsExistsErr == nil && len(hfsExists.Spec.Settings) > 0)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Keep legacy BareMetalHost.spec.firmware represented in HasFirmwareSettingsSpec.

Lines 1465-1468 still start servicing from the legacy firmware field, but this flag is now derived only from HostFirmwareSettings. After a service starts and status is synced, fwDirty drops to false; on the next Servicing / ServiceWait reconcile, pkg/provisioner/ironic/servicing.go sees both spec flags false and aborts an in-flight legacy firmware update as if the user had removed it. Please fold the legacy firmware-config presence into this signal as well.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/controller/metal3.io/baremetalhost_controller.go` around lines 1481
- 1485, The HasFirmwareSettingsSpec flag is currently set only from
HostFirmwareSettings; also inspect the legacy BareMetalHost.spec.firmware and
fold its presence into the same signal so legacy in-flight updates aren't
aborted: after the existing r.Get for hfsExists (hfsExistsErr) also fetch the
BareMetalHost object (or use the existing info.bmh if available) and treat
servicingData.HasFirmwareSettingsSpec as true if either (hfsExistsErr == nil &&
len(hfsExists.Spec.Settings) > 0) OR the BareMetalHost.Spec.Firmware is non-nil
and contains settings (e.g., check Spec.Firmware or its fields are non-empty);
update the assignment to servicingData.HasFirmwareSettingsSpec accordingly using
those identifiers (hfsExists, hfsExistsErr, BareMetalHost.Spec.Firmware or
info.bmh).

Comment on lines 166 to 174
if !tc.skipConfig {
host.Spec.BMC.Address = "bios-test://test.bmc/"
host.Spec.BMC.Address = "raid-test://test.bmc/"
prepData.ActualFirmwareSettings = metal3api.SettingsMap{
"Answer": "unknown",
}
prepData.TargetFirmwareSettings = metal3api.DesiredSettingsMap{
"Answer": intstr.FromInt(42),
}
prepData.HasFirmwareSpec = true
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Set the new spec-presence flag in these configured fixtures.

When !tc.skipConfig, this fixture now fills TargetFirmwareSettings / ActualFirmwareSettings but leaves both HasFirmware*Spec flags false. The remaining ServiceFail / ServiceWait cases will hit the abort path instead of the retry/wait behavior they expect.

Suggested fix
 			if !tc.skipConfig {
 				host.Spec.BMC.Address = "raid-test://test.bmc/"
 				prepData.ActualFirmwareSettings = metal3api.SettingsMap{
 					"Answer": "unknown",
 				}
 				prepData.TargetFirmwareSettings = metal3api.DesiredSettingsMap{
 					"Answer": intstr.FromInt(42),
 				}
+				prepData.HasFirmwareSettingsSpec = true
 			}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if !tc.skipConfig {
host.Spec.BMC.Address = "bios-test://test.bmc/"
host.Spec.BMC.Address = "raid-test://test.bmc/"
prepData.ActualFirmwareSettings = metal3api.SettingsMap{
"Answer": "unknown",
}
prepData.TargetFirmwareSettings = metal3api.DesiredSettingsMap{
"Answer": intstr.FromInt(42),
}
prepData.HasFirmwareSpec = true
}
if !tc.skipConfig {
host.Spec.BMC.Address = "raid-test://test.bmc/"
prepData.ActualFirmwareSettings = metal3api.SettingsMap{
"Answer": "unknown",
}
prepData.TargetFirmwareSettings = metal3api.DesiredSettingsMap{
"Answer": intstr.FromInt(42),
}
prepData.HasFirmwareSettingsSpec = true
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/provisioner/ironic/servicing_test.go` around lines 166 - 174, The test
fixture fills prepData.ActualFirmwareSettings and
prepData.TargetFirmwareSettings when !tc.skipConfig but does not set the
corresponding presence flags, so the code takes the abort path; update the
fixture so that inside the if !tc.skipConfig block you also set the
spec-presence flags (e.g., prepData.HasFirmwareActualSpec = true and
prepData.HasFirmwareTargetSpec = true) so the ServiceFail/ServiceWait cases
exercise the retry/wait behavior; make these assignments alongside
prepData.ActualFirmwareSettings and prepData.TargetFirmwareSettings.

Comment on lines +113 to +118
// Check if there are any pending updates
serviceSteps, err := p.buildServiceSteps(bmcAccess, data)
if err != nil {
result, err = operationFailed(err.Error())
return result, started, err
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don't gate abort handling on buildServiceSteps().

This new eager call returns before the ServiceFail / Servicing / ServiceWait branches. If step construction fails while the user is trying to abort by clearing specs, Service() never reaches abortServicing() and the host stays stuck in the servicing error path. Keeping step generation on the start/restart path also avoids rebuilding the same steps twice.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/provisioner/ironic/servicing.go` around lines 113 - 118, The call to
buildServiceSteps is currently executed before Service() handles abort branches,
which can prevent abortServicing() from running if step construction fails; move
the buildServiceSteps(bmcAccess, data) call out of the eager path and only
invoke it inside the start/restart branch where servicing actually begins (leave
ServiceFail/Servicing/ServiceWait and abort handling to run without calling
buildServiceSteps), so errors from buildServiceSteps do not block
abortServicing() and you avoid rebuilding steps twice.

When a user clears HFC/HFS specs to abort a failed servicing operation,
the BMH controller calls getHostFirmwareSettings/Components which check
that metadata.generation matches status.observedGeneration. Since the
sub-controllers may not have reconciled yet after the spec was cleared,
this generation mismatch causes an error that blocks the abort/recovery
path indefinitely, leaving the host stuck in servicing failed with
exponential backoff.

Add a fast-path check early in doServiceIfNeeded: when in ServicingError
state, directly check if specs are empty via lightweight r.Get() calls
(which don't validate generation) before falling through to the
generation-sensitive getHostFirmwareSettings/Components functions. If
specs are cleared, proceed directly to the abort via prov.Service().

Also removes debug logging from the previous iteration.

Assisted-By: Claude Opus 4.6
When a BareMetalHost is in a servicing error state with exponential
backoff, clearing the HostFirmwareComponents or HostFirmwareSettings
spec would not trigger a BMH reconcile because the BMH controller did
not watch those resources. The controller would sit in backoff for
hours before discovering the specs were cleared.

Add watches on HostFirmwareSettings and HostFirmwareComponents with a
GenerationChangedPredicate so that spec changes (which bump generation)
trigger an immediate BMH reconcile. Status-only updates from the
sub-controllers are filtered out since they don't change generation.

Assisted-By: Claude Opus 4.6
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
internal/controller/metal3.io/baremetalhost_controller.go (1)

1469-1480: ⚠️ Potential issue | 🔴 Critical

Preserve legacy BareMetalHost.spec.firmware in the “spec exists” signal

The current logic still derives HasFirmwareSettingsSpec only from HostFirmwareSettings, and the servicing-error fast-path also only checks HFS/HFC specs. If a host is using legacy spec.firmware, in-flight servicing can still be aborted as “spec removed” once the other flags are false.

Proposed fix
@@
-    specsExist := false
+    // Include legacy firmware spec so in-flight legacy servicing is not treated as removed.
+    specsExist := info.host.Spec.Firmware != nil
@@
-    servicingData.HasFirmwareSettingsSpec = (hfsExistsErr == nil && len(hfsExists.Spec.Settings) > 0)
+    legacyFirmwareSpecExists := info.host.Spec.Firmware != nil
+    servicingData.HasFirmwareSettingsSpec = legacyFirmwareSpecExists ||
+        (hfsExistsErr == nil && len(hfsExists.Spec.Settings) > 0)

Also applies to: 1521-1525

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/controller/metal3.io/baremetalhost_controller.go` around lines 1469
- 1480, The "specsExist" calculation and the servicing-error fast-path must also
consider legacy BareMetalHost.spec.firmware: when computing specsExist
(currently using liveFirmwareSettingsAllowed, liveFirmwareUpdatesAllowed,
hfsCheck and hfcCheck fetched via r.Get with info.request.NamespacedName), also
r.Get the BareMetalHost (e.g., into a bareMetalHost variable of type
metal3api.BareMetalHost) and set specsExist = true if
bareMetalHost.Spec.Firmware is non-nil/has meaningful content; apply the same
check in the other identical block referenced around the servicing-error
fast-path so legacy spec.firmware prevents treating the spec as removed.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@internal/controller/metal3.io/baremetalhost_controller.go`:
- Around line 1469-1480: The "specsExist" calculation and the servicing-error
fast-path must also consider legacy BareMetalHost.spec.firmware: when computing
specsExist (currently using liveFirmwareSettingsAllowed,
liveFirmwareUpdatesAllowed, hfsCheck and hfcCheck fetched via r.Get with
info.request.NamespacedName), also r.Get the BareMetalHost (e.g., into a
bareMetalHost variable of type metal3api.BareMetalHost) and set specsExist =
true if bareMetalHost.Spec.Firmware is non-nil/has meaningful content; apply the
same check in the other identical block referenced around the servicing-error
fast-path so legacy spec.firmware prevents treating the spec as removed.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 45cf887c-3802-48e0-827b-8de11467abe7

📥 Commits

Reviewing files that changed from the base of the PR and between 37bc8af and 3e8c53d.

📒 Files selected for processing (1)
  • internal/controller/metal3.io/baremetalhost_controller.go

The HFS controller's setCondition had a custom dirty-check that only
compared the condition Status field (True/False/Unknown) against the
old conditions, ignoring ObservedGeneration. When the HFS spec
generation changed without altering the condition outcome, the status
update was skipped and ObservedGeneration stayed stale. This caused the
BMH controller to block indefinitely on the generation mismatch check
in getHostFirmwareSettings, preventing power-on after a reboot.

Align with the HFC controller's setUpdatesCondition by returning
meta.SetStatusCondition directly, which compares all condition fields
including ObservedGeneration.

Assisted-By: Claude Opus 4.6
Signed-off-by: Jacob Anders <jacob-anders-dev@proton.me>
@openshift-ci
Copy link
Copy Markdown

openshift-ci bot commented Mar 31, 2026

@jacob-anders: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/unit 41620b3 link true /test unit

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant