[DNM] Retest original service-abort code to troubleshoot issues with the final version accepted upstream#472
Conversation
This reverts commit aa17d5a.
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
WalkthroughController splits firmware-spec detection into separate settings/components flags, adds a ServicingError recovery fast-path, and watches subresource generation; provisioner builds service steps earlier, tightens abort logic to require both spec flags absent and clears Ironic maintenance on abort; tests updated to new semantics. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: jacob-anders The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
Signed-off-by: Jacob Anders <jacob-anders-dev@proton.me>
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
internal/controller/metal3.io/baremetalhost_controller.go (1)
1483-1507:⚠️ Potential issue | 🔴 CriticalFix error handling to distinguish transient failures from missing specs.
Lines 1484–1485 and 1506–1507 currently treat all
r.Get()errors identically, converting any failure (transient or permanent) intoHasFirmware*Spec == false. This risks triggering unintended aborts when retrieval fails temporarily. Other calls in this file properly usek8serrors.IsNotFound()to distinguish missing resources from transient errors; apply the same pattern here:Suggested fix
hfsExists := &metal3api.HostFirmwareSettings{} - hfsExistsErr := r.Get(ctx, info.request.NamespacedName, hfsExists) - servicingData.HasFirmwareSettingsSpec = (hfsExistsErr == nil && len(hfsExists.Spec.Settings) > 0) + if err := r.Get(ctx, info.request.NamespacedName, hfsExists); err != nil { + if !k8serrors.IsNotFound(err) { + return actionError{fmt.Errorf("could not load host firmware settings: %w", err)} + } + } else if len(hfsExists.Spec.Settings) > 0 { + servicingData.HasFirmwareSettingsSpec = true + } hfcExists := &metal3api.HostFirmwareComponents{} - hfcExistsErr := r.Get(ctx, info.request.NamespacedName, hfcExists) - servicingData.HasFirmwareComponentsSpec = (hfcExistsErr == nil && len(hfcExists.Spec.Updates) > 0) + if err := r.Get(ctx, info.request.NamespacedName, hfcExists); err != nil { + if !k8serrors.IsNotFound(err) { + return actionError{fmt.Errorf("could not load host firmware components: %w", err)} + } + } else if len(hfcExists.Spec.Updates) > 0 { + servicingData.HasFirmwareComponentsSpec = true + }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/controller/metal3.io/baremetalhost_controller.go` around lines 1483 - 1507, The r.Get() calls that set hfsExistsErr and hfcExistsErr should distinguish NotFound from other errors: instead of treating any error as "no spec", check k8serrors.IsNotFound(hfsExistsErr)/k8serrors.IsNotFound(hfcExistsErr) and set servicingData.HasFirmwareSettingsSpec and servicingData.HasFirmwareComponentsSpec to false only for NotFound; for any other non-nil error return an actionError wrapping the retrieval error (similar to existing patterns in this file). Update the HostFirmwareSettings lookup (hfsExists/hfsExistsErr) and the HostFirmwareComponents existence check (hfcExists/hfcExistsErr) accordingly so transient API errors cause an immediate actionError rather than silently marking the spec absent.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@internal/controller/metal3.io/baremetalhost_controller.go`:
- Around line 1481-1485: The HasFirmwareSettingsSpec flag is currently set only
from HostFirmwareSettings; also inspect the legacy BareMetalHost.spec.firmware
and fold its presence into the same signal so legacy in-flight updates aren't
aborted: after the existing r.Get for hfsExists (hfsExistsErr) also fetch the
BareMetalHost object (or use the existing info.bmh if available) and treat
servicingData.HasFirmwareSettingsSpec as true if either (hfsExistsErr == nil &&
len(hfsExists.Spec.Settings) > 0) OR the BareMetalHost.Spec.Firmware is non-nil
and contains settings (e.g., check Spec.Firmware or its fields are non-empty);
update the assignment to servicingData.HasFirmwareSettingsSpec accordingly using
those identifiers (hfsExists, hfsExistsErr, BareMetalHost.Spec.Firmware or
info.bmh).
In `@pkg/provisioner/ironic/servicing_test.go`:
- Around line 166-174: The test fixture fills prepData.ActualFirmwareSettings
and prepData.TargetFirmwareSettings when !tc.skipConfig but does not set the
corresponding presence flags, so the code takes the abort path; update the
fixture so that inside the if !tc.skipConfig block you also set the
spec-presence flags (e.g., prepData.HasFirmwareActualSpec = true and
prepData.HasFirmwareTargetSpec = true) so the ServiceFail/ServiceWait cases
exercise the retry/wait behavior; make these assignments alongside
prepData.ActualFirmwareSettings and prepData.TargetFirmwareSettings.
In `@pkg/provisioner/ironic/servicing.go`:
- Around line 113-118: The call to buildServiceSteps is currently executed
before Service() handles abort branches, which can prevent abortServicing() from
running if step construction fails; move the buildServiceSteps(bmcAccess, data)
call out of the eager path and only invoke it inside the start/restart branch
where servicing actually begins (leave ServiceFail/Servicing/ServiceWait and
abort handling to run without calling buildServiceSteps), so errors from
buildServiceSteps do not block abortServicing() and you avoid rebuilding steps
twice.
---
Outside diff comments:
In `@internal/controller/metal3.io/baremetalhost_controller.go`:
- Around line 1483-1507: The r.Get() calls that set hfsExistsErr and
hfcExistsErr should distinguish NotFound from other errors: instead of treating
any error as "no spec", check
k8serrors.IsNotFound(hfsExistsErr)/k8serrors.IsNotFound(hfcExistsErr) and set
servicingData.HasFirmwareSettingsSpec and
servicingData.HasFirmwareComponentsSpec to false only for NotFound; for any
other non-nil error return an actionError wrapping the retrieval error (similar
to existing patterns in this file). Update the HostFirmwareSettings lookup
(hfsExists/hfsExistsErr) and the HostFirmwareComponents existence check
(hfcExists/hfcExistsErr) accordingly so transient API errors cause an immediate
actionError rather than silently marking the spec absent.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: de99871a-c1ef-46a1-b7b4-6ff3a5516ae2
📒 Files selected for processing (4)
internal/controller/metal3.io/baremetalhost_controller.gopkg/provisioner/ironic/servicing.gopkg/provisioner/ironic/servicing_test.gopkg/provisioner/provisioner.go
| // Track if HFS spec has actual settings - check independently since getHostFirmwareSettings | ||
| // returns nil when no changes even if object exists | ||
| hfsExists := &metal3api.HostFirmwareSettings{} | ||
| hfsExistsErr := r.Get(info.ctx, info.request.NamespacedName, hfsExists) | ||
| servicingData.HasFirmwareSettingsSpec = (hfsExistsErr == nil && len(hfsExists.Spec.Settings) > 0) |
There was a problem hiding this comment.
Keep legacy BareMetalHost.spec.firmware represented in HasFirmwareSettingsSpec.
Lines 1465-1468 still start servicing from the legacy firmware field, but this flag is now derived only from HostFirmwareSettings. After a service starts and status is synced, fwDirty drops to false; on the next Servicing / ServiceWait reconcile, pkg/provisioner/ironic/servicing.go sees both spec flags false and aborts an in-flight legacy firmware update as if the user had removed it. Please fold the legacy firmware-config presence into this signal as well.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@internal/controller/metal3.io/baremetalhost_controller.go` around lines 1481
- 1485, The HasFirmwareSettingsSpec flag is currently set only from
HostFirmwareSettings; also inspect the legacy BareMetalHost.spec.firmware and
fold its presence into the same signal so legacy in-flight updates aren't
aborted: after the existing r.Get for hfsExists (hfsExistsErr) also fetch the
BareMetalHost object (or use the existing info.bmh if available) and treat
servicingData.HasFirmwareSettingsSpec as true if either (hfsExistsErr == nil &&
len(hfsExists.Spec.Settings) > 0) OR the BareMetalHost.Spec.Firmware is non-nil
and contains settings (e.g., check Spec.Firmware or its fields are non-empty);
update the assignment to servicingData.HasFirmwareSettingsSpec accordingly using
those identifiers (hfsExists, hfsExistsErr, BareMetalHost.Spec.Firmware or
info.bmh).
| if !tc.skipConfig { | ||
| host.Spec.BMC.Address = "bios-test://test.bmc/" | ||
| host.Spec.BMC.Address = "raid-test://test.bmc/" | ||
| prepData.ActualFirmwareSettings = metal3api.SettingsMap{ | ||
| "Answer": "unknown", | ||
| } | ||
| prepData.TargetFirmwareSettings = metal3api.DesiredSettingsMap{ | ||
| "Answer": intstr.FromInt(42), | ||
| } | ||
| prepData.HasFirmwareSpec = true | ||
| } |
There was a problem hiding this comment.
Set the new spec-presence flag in these configured fixtures.
When !tc.skipConfig, this fixture now fills TargetFirmwareSettings / ActualFirmwareSettings but leaves both HasFirmware*Spec flags false. The remaining ServiceFail / ServiceWait cases will hit the abort path instead of the retry/wait behavior they expect.
Suggested fix
if !tc.skipConfig {
host.Spec.BMC.Address = "raid-test://test.bmc/"
prepData.ActualFirmwareSettings = metal3api.SettingsMap{
"Answer": "unknown",
}
prepData.TargetFirmwareSettings = metal3api.DesiredSettingsMap{
"Answer": intstr.FromInt(42),
}
+ prepData.HasFirmwareSettingsSpec = true
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if !tc.skipConfig { | |
| host.Spec.BMC.Address = "bios-test://test.bmc/" | |
| host.Spec.BMC.Address = "raid-test://test.bmc/" | |
| prepData.ActualFirmwareSettings = metal3api.SettingsMap{ | |
| "Answer": "unknown", | |
| } | |
| prepData.TargetFirmwareSettings = metal3api.DesiredSettingsMap{ | |
| "Answer": intstr.FromInt(42), | |
| } | |
| prepData.HasFirmwareSpec = true | |
| } | |
| if !tc.skipConfig { | |
| host.Spec.BMC.Address = "raid-test://test.bmc/" | |
| prepData.ActualFirmwareSettings = metal3api.SettingsMap{ | |
| "Answer": "unknown", | |
| } | |
| prepData.TargetFirmwareSettings = metal3api.DesiredSettingsMap{ | |
| "Answer": intstr.FromInt(42), | |
| } | |
| prepData.HasFirmwareSettingsSpec = true | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/provisioner/ironic/servicing_test.go` around lines 166 - 174, The test
fixture fills prepData.ActualFirmwareSettings and
prepData.TargetFirmwareSettings when !tc.skipConfig but does not set the
corresponding presence flags, so the code takes the abort path; update the
fixture so that inside the if !tc.skipConfig block you also set the
spec-presence flags (e.g., prepData.HasFirmwareActualSpec = true and
prepData.HasFirmwareTargetSpec = true) so the ServiceFail/ServiceWait cases
exercise the retry/wait behavior; make these assignments alongside
prepData.ActualFirmwareSettings and prepData.TargetFirmwareSettings.
pkg/provisioner/ironic/servicing.go
Outdated
| // Check if there are any pending updates | ||
| serviceSteps, err := p.buildServiceSteps(bmcAccess, data) | ||
| if err != nil { | ||
| result, err = operationFailed(err.Error()) | ||
| return result, started, err | ||
| } |
There was a problem hiding this comment.
Don't gate abort handling on buildServiceSteps().
This new eager call returns before the ServiceFail / Servicing / ServiceWait branches. If step construction fails while the user is trying to abort by clearing specs, Service() never reaches abortServicing() and the host stays stuck in the servicing error path. Keeping step generation on the start/restart path also avoids rebuilding the same steps twice.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/provisioner/ironic/servicing.go` around lines 113 - 118, The call to
buildServiceSteps is currently executed before Service() handles abort branches,
which can prevent abortServicing() from running if step construction fails; move
the buildServiceSteps(bmcAccess, data) call out of the eager path and only
invoke it inside the start/restart branch where servicing actually begins (leave
ServiceFail/Servicing/ServiceWait and abort handling to run without calling
buildServiceSteps), so errors from buildServiceSteps do not block
abortServicing() and you avoid rebuilding steps twice.
When a user clears HFC/HFS specs to abort a failed servicing operation, the BMH controller calls getHostFirmwareSettings/Components which check that metadata.generation matches status.observedGeneration. Since the sub-controllers may not have reconciled yet after the spec was cleared, this generation mismatch causes an error that blocks the abort/recovery path indefinitely, leaving the host stuck in servicing failed with exponential backoff. Add a fast-path check early in doServiceIfNeeded: when in ServicingError state, directly check if specs are empty via lightweight r.Get() calls (which don't validate generation) before falling through to the generation-sensitive getHostFirmwareSettings/Components functions. If specs are cleared, proceed directly to the abort via prov.Service(). Also removes debug logging from the previous iteration. Assisted-By: Claude Opus 4.6
When a BareMetalHost is in a servicing error state with exponential backoff, clearing the HostFirmwareComponents or HostFirmwareSettings spec would not trigger a BMH reconcile because the BMH controller did not watch those resources. The controller would sit in backoff for hours before discovering the specs were cleared. Add watches on HostFirmwareSettings and HostFirmwareComponents with a GenerationChangedPredicate so that spec changes (which bump generation) trigger an immediate BMH reconcile. Status-only updates from the sub-controllers are filtered out since they don't change generation. Assisted-By: Claude Opus 4.6
There was a problem hiding this comment.
♻️ Duplicate comments (1)
internal/controller/metal3.io/baremetalhost_controller.go (1)
1469-1480:⚠️ Potential issue | 🔴 CriticalPreserve legacy
BareMetalHost.spec.firmwarein the “spec exists” signalThe current logic still derives
HasFirmwareSettingsSpeconly fromHostFirmwareSettings, and the servicing-error fast-path also only checks HFS/HFC specs. If a host is using legacyspec.firmware, in-flight servicing can still be aborted as “spec removed” once the other flags are false.Proposed fix
@@ - specsExist := false + // Include legacy firmware spec so in-flight legacy servicing is not treated as removed. + specsExist := info.host.Spec.Firmware != nil @@ - servicingData.HasFirmwareSettingsSpec = (hfsExistsErr == nil && len(hfsExists.Spec.Settings) > 0) + legacyFirmwareSpecExists := info.host.Spec.Firmware != nil + servicingData.HasFirmwareSettingsSpec = legacyFirmwareSpecExists || + (hfsExistsErr == nil && len(hfsExists.Spec.Settings) > 0)Also applies to: 1521-1525
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/controller/metal3.io/baremetalhost_controller.go` around lines 1469 - 1480, The "specsExist" calculation and the servicing-error fast-path must also consider legacy BareMetalHost.spec.firmware: when computing specsExist (currently using liveFirmwareSettingsAllowed, liveFirmwareUpdatesAllowed, hfsCheck and hfcCheck fetched via r.Get with info.request.NamespacedName), also r.Get the BareMetalHost (e.g., into a bareMetalHost variable of type metal3api.BareMetalHost) and set specsExist = true if bareMetalHost.Spec.Firmware is non-nil/has meaningful content; apply the same check in the other identical block referenced around the servicing-error fast-path so legacy spec.firmware prevents treating the spec as removed.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@internal/controller/metal3.io/baremetalhost_controller.go`:
- Around line 1469-1480: The "specsExist" calculation and the servicing-error
fast-path must also consider legacy BareMetalHost.spec.firmware: when computing
specsExist (currently using liveFirmwareSettingsAllowed,
liveFirmwareUpdatesAllowed, hfsCheck and hfcCheck fetched via r.Get with
info.request.NamespacedName), also r.Get the BareMetalHost (e.g., into a
bareMetalHost variable of type metal3api.BareMetalHost) and set specsExist =
true if bareMetalHost.Spec.Firmware is non-nil/has meaningful content; apply the
same check in the other identical block referenced around the servicing-error
fast-path so legacy spec.firmware prevents treating the spec as removed.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 45cf887c-3802-48e0-827b-8de11467abe7
📒 Files selected for processing (1)
internal/controller/metal3.io/baremetalhost_controller.go
The HFS controller's setCondition had a custom dirty-check that only compared the condition Status field (True/False/Unknown) against the old conditions, ignoring ObservedGeneration. When the HFS spec generation changed without altering the condition outcome, the status update was skipped and ObservedGeneration stayed stale. This caused the BMH controller to block indefinitely on the generation mismatch check in getHostFirmwareSettings, preventing power-on after a reboot. Align with the HFC controller's setUpdatesCondition by returning meta.SetStatusCondition directly, which compares all condition fields including ObservedGeneration. Assisted-By: Claude Opus 4.6 Signed-off-by: Jacob Anders <jacob-anders-dev@proton.me>
|
@jacob-anders: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
/hold