Validate version skew policy compliance on cluster upgrades#923
Validate version skew policy compliance on cluster upgrades#923
Conversation
| return nil | ||
| } | ||
|
|
||
| if !delta.Consecutive { |
There was a problem hiding this comment.
What does Consecutive mean here, X+1 minor version?
There was a problem hiding this comment.
1.29.3 => 1.30.0 ✅
1.29.3 => 1.31.0 ❌
1.29.3 => 2.0.0 ✅
1.29.3 => 2.1.0 ❌
...but
1.29.3 => 1.30.1 ❌ (this is probably too strict and should be allowed)
There was a problem hiding this comment.
Allowing jumps of minor+2 (1.29 => 1.31) would be ok by kube I suppose.
There was a problem hiding this comment.
Confused 🤷♂️
kubeadm docs state explicitly that upgrading more than one minor at a time is not supported.
It seems it's kube-apiserver, controller manager and schedular that do not support upgrades that skip minors.
In theory, workers could lag 3 minors behind but I don't see how that would happen with k0sctl unless you add previously unmanaged nodes and proceed to upgrade from 1.28 to 1.31.
Major upgrades are not supported at all, but I guess that is something for the far future.
There was a problem hiding this comment.
Essentially it boils down to what k0s supports.
|
It could make sense to wait for workers to get the new kube-proxy after control plane is upgraded before proceeding to upgrade them. |
ec260b7 to
ccd0855
Compare
ccd0855 to
5c47051
Compare
93a7c0f to
99c7eda
Compare
Signed-off-by: Kimmo Lehto <klehto@mirantis.com>
99c7eda to
5ae5df1
Compare
There was a problem hiding this comment.
Pull request overview
Adds upgrade-time validation and sequencing to prevent unsupported Kubernetes/k0s version leaps and to ensure kube-proxy has rolled out before worker upgrades begin.
Changes:
- Add version-skew validation during
ValidateFacts, blocking unsupported upgrade jumps unless--forceis used. - Add a kube-proxy DaemonSet rollout wait gate before starting worker upgrades.
- Update smoke tests to exercise the new “blocked without
--force” behavior and to apply--forcewhere needed.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| smoke-test/smoke-upgrade.sh | Expects upgrade to fail without --force, then succeed with --force. |
| smoke-test/smoke-dryrun.sh | Adjusts apply helper to accept multiple flags and uses --force for upgrade paths. |
| pkg/node/statusfunc.go | Adds DaemonSet rollout verification helpers and kube-proxy-specific rollout check. |
| phase/validate_facts.go | Introduces version-skew validation for hosts requiring upgrade, with --force bypass. |
| phase/upgrade_workers.go | Waits for kube-proxy rollout (unless --no-wait) before upgrading workers. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if ../k0sctl apply --config "${K0SCTL_CONFIG}" --debug; then | ||
| echo "Expected failure when applying without --force" | ||
| exit 1 | ||
| fi | ||
|
|
There was a problem hiding this comment.
The test currently accepts any failure from the first k0sctl apply as an "expected" version-skew rejection. That can hide unrelated regressions (e.g., config or connectivity failures) because it doesn't assert the error cause. Consider capturing the output and checking it contains the version-skew policy error text before treating the failure as expected.
| if ../k0sctl apply --config "${K0SCTL_CONFIG}" --debug; then | |
| echo "Expected failure when applying without --force" | |
| exit 1 | |
| fi | |
| upgrade_output_file="$(mktemp)" | |
| if ../k0sctl apply --config "${K0SCTL_CONFIG}" --debug >"${upgrade_output_file}" 2>&1; then | |
| echo "Expected failure when applying without --force" | |
| rm -f "${upgrade_output_file}" | |
| exit 1 | |
| fi | |
| # Verify that the failure was due to version skew policy | |
| if ! grep -qi "version skew" "${upgrade_output_file}"; then | |
| echo "k0sctl apply failed for an unexpected reason; expected version skew error. Output:" | |
| cat "${upgrade_output_file}" | |
| rm -f "${upgrade_output_file}" | |
| exit 1 | |
| fi | |
| rm -f "${upgrade_output_file}" |
| if err := p.validateVersionSkew(); err != nil { | ||
| return err | ||
| } |
There was a problem hiding this comment.
validateVersionSkew adds new upgrade-blocking behavior, but phase/validate_facts_test.go currently only covers downgrade/defaulting/nodeLocalLoadBalancing. Please add unit tests for the new skew rules (controller >1 minor, worker >3 minor, and --force bypass) to prevent regressions.
| // Upgrade worker hosts parallelly in 10% chunks | ||
| concurrentUpgrades := int(math.Floor(float64(len(p.hosts)) * float64(p.Config.Spec.Options.Concurrency.WorkerDisruptionPercent/100))) | ||
| if concurrentUpgrades == 0 { | ||
| concurrentUpgrades = 1 | ||
| } | ||
| concurrentUpgrades = min(concurrentUpgrades, p.Config.Spec.Options.Concurrency.Limit) | ||
|
|
||
| // Wait once for kube-proxy to be at desired version across the cluster. | ||
| if !p.IsWet() { | ||
| p.DryMsg(p.leader, "wait for kube-proxy to be at the desired version (cluster-wide)") | ||
| } else if !NoWait { // honor --no-wait | ||
| log.Infof("waiting for kube-proxy cluster-wide roll-out") | ||
| if err := retry.WithDefaultTimeout(ctx, node.KubeProxyRolledOutFunc(p.leader)); err != nil { | ||
| return fmt.Errorf("kube-proxy did not reach the desired version: %w", err) | ||
| } | ||
| } | ||
|
|
||
| log.Infof("Upgrading max %d workers in parallel", concurrentUpgrades) | ||
| return p.hosts.BatchedParallelEach(ctx, concurrentUpgrades, | ||
| p.start, | ||
| p.cordonWorker, | ||
| p.drainWorker, | ||
| p.upgradeWorker, | ||
| p.uncordonWorker, | ||
| p.finish, | ||
| ) | ||
| log.Infof("Upgrading max %d workers in parallel", concurrentUpgrades) | ||
| return p.hosts.BatchedParallelEach(ctx, concurrentUpgrades, | ||
| p.start, | ||
| p.cordonWorker, | ||
| p.drainWorker, | ||
| p.upgradeWorker, | ||
| p.uncordonWorker, | ||
| p.finish, | ||
| ) |
There was a problem hiding this comment.
This block is indented with spaces instead of the gofmt-standard tabs used throughout the rest of the file. Please run gofmt (or adjust indentation) to keep formatting consistent and avoid noisy diffs/CI style failures.
| func buildLabelSelector(labels map[string]string) string { | ||
| // Simple AND of matchLabels (k=v,k2=v2,...) | ||
| if len(labels) == 0 { | ||
| return "" |
There was a problem hiding this comment.
DaemonSetRolledOutFunc only builds a selector from spec.selector.matchLabels. If matchLabels is empty/nil (e.g., the DaemonSet uses matchExpressions), buildLabelSelector returns an empty string and the kubectl call ends up listing all pods in the namespace, which can lead to incorrect rollout results/timeouts. Consider either supporting matchExpressions in the JSON struct/selector builder, or returning an explicit error when matchLabels is empty.
| return "" | |
| // Returning an empty selector would cause kubectl to list all pods in the namespace, | |
| // which is incorrect if the DaemonSet uses only matchExpressions. Instead, return a | |
| // selector that matches no pods to avoid false rollout success. | |
| log.Warn("buildLabelSelector: empty matchLabels; using a selector that matches no pods (likely matchExpressions-only selector)") | |
| return "k0sctl-selector=no-matching-labels" |
|
Waiting for k0sproject/k0s#7173 |
I don't expect this to land. Looks like fire-and-forget AI slop 🤷 |
#835
When upgrading, validate that the new version does not violate kube's version skew policy or upgrade path requirements.
Waits for kube-proxy new version roll-out after controller upgrades before starting worker upgrades.