Skip to content

Validate version skew policy compliance on cluster upgrades#923

Open
kke wants to merge 1 commit intomainfrom
version-skew-validation
Open

Validate version skew policy compliance on cluster upgrades#923
kke wants to merge 1 commit intomainfrom
version-skew-validation

Conversation

@kke
Copy link
Copy Markdown
Contributor

@kke kke commented Aug 12, 2025

#835

When upgrading, validate that the new version does not violate kube's version skew policy or upgrade path requirements.

Waits for kube-proxy new version roll-out after controller upgrades before starting worker upgrades.

@kke kke added the enhancement New feature or request label Aug 12, 2025
Comment thread phase/validate_hosts.go Outdated
return nil
}

if !delta.Consecutive {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does Consecutive mean here, X+1 minor version?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1.29.3 => 1.30.0 ✅
1.29.3 => 1.31.0 ❌
1.29.3 => 2.0.0 ✅
1.29.3 => 2.1.0 ❌

...but

1.29.3 => 1.30.1 ❌ (this is probably too strict and should be allowed)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Allowing jumps of minor+2 (1.29 => 1.31) would be ok by kube I suppose.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confused 🤷‍♂️

kubeadm docs state explicitly that upgrading more than one minor at a time is not supported.

It seems it's kube-apiserver, controller manager and schedular that do not support upgrades that skip minors.

In theory, workers could lag 3 minors behind but I don't see how that would happen with k0sctl unless you add previously unmanaged nodes and proceed to upgrade from 1.28 to 1.31.

Major upgrades are not supported at all, but I guess that is something for the far future.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Essentially it boils down to what k0s supports.

@kke
Copy link
Copy Markdown
Contributor Author

kke commented Aug 21, 2025

It could make sense to wait for workers to get the new kube-proxy after control plane is upgraded before proceeding to upgrade them.

@kke kke changed the title Validate version consicutiveness on upgrade Validate version consecutiveness on upgrade Aug 22, 2025
@kke kke changed the title Validate version consecutiveness on upgrade Validate version skew policy compliance on cluster upgrades Aug 29, 2025
@kke kke force-pushed the version-skew-validation branch from ec260b7 to ccd0855 Compare September 3, 2025 10:12
@kke kke force-pushed the version-skew-validation branch from ccd0855 to 5c47051 Compare September 11, 2025 08:19
@kke kke force-pushed the version-skew-validation branch from 93a7c0f to 99c7eda Compare March 9, 2026 08:30
@kke kke requested a review from Copilot March 9, 2026 08:31
Signed-off-by: Kimmo Lehto <klehto@mirantis.com>
@kke kke force-pushed the version-skew-validation branch from 99c7eda to 5ae5df1 Compare March 9, 2026 08:35
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds upgrade-time validation and sequencing to prevent unsupported Kubernetes/k0s version leaps and to ensure kube-proxy has rolled out before worker upgrades begin.

Changes:

  • Add version-skew validation during ValidateFacts, blocking unsupported upgrade jumps unless --force is used.
  • Add a kube-proxy DaemonSet rollout wait gate before starting worker upgrades.
  • Update smoke tests to exercise the new “blocked without --force” behavior and to apply --force where needed.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
smoke-test/smoke-upgrade.sh Expects upgrade to fail without --force, then succeed with --force.
smoke-test/smoke-dryrun.sh Adjusts apply helper to accept multiple flags and uses --force for upgrade paths.
pkg/node/statusfunc.go Adds DaemonSet rollout verification helpers and kube-proxy-specific rollout check.
phase/validate_facts.go Introduces version-skew validation for hosts requiring upgrade, with --force bypass.
phase/upgrade_workers.go Waits for kube-proxy rollout (unless --no-wait) before upgrading workers.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +32 to +36
if ../k0sctl apply --config "${K0SCTL_CONFIG}" --debug; then
echo "Expected failure when applying without --force"
exit 1
fi

Copy link

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test currently accepts any failure from the first k0sctl apply as an "expected" version-skew rejection. That can hide unrelated regressions (e.g., config or connectivity failures) because it doesn't assert the error cause. Consider capturing the output and checking it contains the version-skew policy error text before treating the failure as expected.

Suggested change
if ../k0sctl apply --config "${K0SCTL_CONFIG}" --debug; then
echo "Expected failure when applying without --force"
exit 1
fi
upgrade_output_file="$(mktemp)"
if ../k0sctl apply --config "${K0SCTL_CONFIG}" --debug >"${upgrade_output_file}" 2>&1; then
echo "Expected failure when applying without --force"
rm -f "${upgrade_output_file}"
exit 1
fi
# Verify that the failure was due to version skew policy
if ! grep -qi "version skew" "${upgrade_output_file}"; then
echo "k0sctl apply failed for an unexpected reason; expected version skew error. Output:"
cat "${upgrade_output_file}"
rm -f "${upgrade_output_file}"
exit 1
fi
rm -f "${upgrade_output_file}"

Copilot uses AI. Check for mistakes.
Comment thread phase/validate_facts.go
Comment on lines +37 to +39
if err := p.validateVersionSkew(); err != nil {
return err
}
Copy link

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

validateVersionSkew adds new upgrade-blocking behavior, but phase/validate_facts_test.go currently only covers downgrade/defaulting/nodeLocalLoadBalancing. Please add unit tests for the new skew rules (controller >1 minor, worker >3 minor, and --force bypass) to prevent regressions.

Copilot uses AI. Check for mistakes.
Comment thread phase/upgrade_workers.go
Comment on lines +96 to +121
// Upgrade worker hosts parallelly in 10% chunks
concurrentUpgrades := int(math.Floor(float64(len(p.hosts)) * float64(p.Config.Spec.Options.Concurrency.WorkerDisruptionPercent/100)))
if concurrentUpgrades == 0 {
concurrentUpgrades = 1
}
concurrentUpgrades = min(concurrentUpgrades, p.Config.Spec.Options.Concurrency.Limit)

// Wait once for kube-proxy to be at desired version across the cluster.
if !p.IsWet() {
p.DryMsg(p.leader, "wait for kube-proxy to be at the desired version (cluster-wide)")
} else if !NoWait { // honor --no-wait
log.Infof("waiting for kube-proxy cluster-wide roll-out")
if err := retry.WithDefaultTimeout(ctx, node.KubeProxyRolledOutFunc(p.leader)); err != nil {
return fmt.Errorf("kube-proxy did not reach the desired version: %w", err)
}
}

log.Infof("Upgrading max %d workers in parallel", concurrentUpgrades)
return p.hosts.BatchedParallelEach(ctx, concurrentUpgrades,
p.start,
p.cordonWorker,
p.drainWorker,
p.upgradeWorker,
p.uncordonWorker,
p.finish,
)
log.Infof("Upgrading max %d workers in parallel", concurrentUpgrades)
return p.hosts.BatchedParallelEach(ctx, concurrentUpgrades,
p.start,
p.cordonWorker,
p.drainWorker,
p.upgradeWorker,
p.uncordonWorker,
p.finish,
)
Copy link

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This block is indented with spaces instead of the gofmt-standard tabs used throughout the rest of the file. Please run gofmt (or adjust indentation) to keep formatting consistent and avoid noisy diffs/CI style failures.

Copilot uses AI. Check for mistakes.
Comment thread pkg/node/statusfunc.go
func buildLabelSelector(labels map[string]string) string {
// Simple AND of matchLabels (k=v,k2=v2,...)
if len(labels) == 0 {
return ""
Copy link

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DaemonSetRolledOutFunc only builds a selector from spec.selector.matchLabels. If matchLabels is empty/nil (e.g., the DaemonSet uses matchExpressions), buildLabelSelector returns an empty string and the kubectl call ends up listing all pods in the namespace, which can lead to incorrect rollout results/timeouts. Consider either supporting matchExpressions in the JSON struct/selector builder, or returning an explicit error when matchLabels is empty.

Suggested change
return ""
// Returning an empty selector would cause kubectl to list all pods in the namespace,
// which is incorrect if the DaemonSet uses only matchExpressions. Instead, return a
// selector that matches no pods to avoid false rollout success.
log.Warn("buildLabelSelector: empty matchLabels; using a selector that matches no pods (likely matchExpressions-only selector)")
return "k0sctl-selector=no-matching-labels"

Copilot uses AI. Check for mistakes.
@kke
Copy link
Copy Markdown
Contributor Author

kke commented Mar 9, 2026

Waiting for k0sproject/k0s#7173

@twz123
Copy link
Copy Markdown
Member

twz123 commented Mar 11, 2026

Waiting for k0sproject/k0s#7173

I don't expect this to land. Looks like fire-and-forget AI slop 🤷

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants