OCPBUGS-78990: Bump 1.35.3 to master#2633
OCPBUGS-78990: Bump 1.35.3 to master#2633jubittajohn wants to merge 23 commits intoopenshift:masterfrom
Conversation
This PR fixes the flaky TestApplyCRDuringCRDFinalization test that was failing intermittently on slower systems (s390x architecture, race detector builds). The root cause was a race condition where the test would attempt to apply a CR immediately after requesting CRD deletion, without waiting for the CRD to actually enter the terminating state. The fix explicitly waits for the CRD to have the Terminating condition set to True before attempting the apply.
NewSimpleClientset was marked as deprecated when NewClientset was introduced. This has caused some confusion: - Not all packages have NewClientset (kubernetes#135980). - Tests that work with NewSimpleClientset fail when switched to NewClientset (kubernetes#136327) because of missing CRD support (kubernetes#126850). It doesn't seem burdensome to keep NewSimpleClientset around forever. Some unit tests may even prefer to use it when they don't need server-side apply (less overhead). Therefore there is no need to deprecate it. This avoids churn in the eco system because contributors no longer create PRs "because the linter complains about the usage of a deprecated function".
…update logic Co-authored-by: Pohly <patrick.ohly@intel.com>
…-pick-of-#135567-upstream-release-1.35 [release1.35]Automated cherry pick of kubernetes#135567: Fix flaky TestApplyCRDuringCRDFinalization test
…binding slice first
1.35: add dockerized go cache chmod to `make clean`
we only use the rules in the master branch since we don't need rules.yaml, we don't have two places to match, so we can drop the golang version entirely from this file bump .go-version alone will be sufficient* on release branches after kubernetes#136954 * ignoring e2e images like agnhost, which will require follow-up PRs ...
…k-of-#137253-upstream-release-1.35 Automated cherry pick of kubernetes#137253: DRA: start scheduler after creating binding/non-binding slicesin Basicflow
The test uses an invalid image to induce a pull error. The previous image name 'some-image-that-doesnt-exist' causes slow DNS/registry resolution on some environments (especially metal), leading to 30s timeouts. Using 'localhost/some-image-that-does-not-exist' makes the pull fail instantly since there is no registry on localhost, avoiding flaky timeouts.
…-of-#136455-origin-release-1.35 Automated cherry pick of kubernetes#136455: fake client-go: un-deprecate NewSimpleClientset
If /var/lib/kubelet is MS_SHARED mountpoint, all the mountpoints under /var/lib/kubelet will have duplicate one. When `kubeadm reset -f` is executed, it will try to umount one path twice. However, they are in the peer group. Once we umount one path, the duplicate one will be umounted as well. So, in this case, we should ignore EINVAL error. Signed-off-by: Wei Fu <fuweid89@gmail.com> (cherry picked from commit 2634261) Signed-off-by: Wei Fu <fuweid89@gmail.com>
…-pick-of-#135611-upstream-release-1.35 [release-1.35]Automated cherry pick of kubernetes#135611: Fix flake TestDeviceTaintRule test
…-of-#137251-upstream-release-1.35 Automated cherry pick of kubernetes#137251: kubeadm: do not add learner member to etcd client endpoints
drop publishing rules from dependencies.yaml on release branch
…y-pick-of-#137252-upstream-release-1.35 Automated cherry pick of kubernetes#137252: Use localhost Image Reference in PodObservedGenerationTracking E2E Test
[release-1.35] cmd/kubeadm: ignore EINVAL error during unmount
Kubernetes official release v1.35.3
|
@jubittajohn: This pull request references Jira Issue OCPBUGS-78990, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
Requesting review from QA contact: The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
WalkthroughThis pull request includes release documentation updates for v1.35.2, conditional permission fixes in build scripts, etcd endpoint handling changes for learner members, device taint eviction controller enhancements with delayed work scheduling, and widespread API documentation updates to fake clientset and code generator templates regarding field tracking support. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes ✨ Finishing Touches🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 Trivy (0.69.3)Trivy execution failed: Unknown error Comment Tip CodeRabbit can use OpenGrep to find security vulnerabilities and bugs across 17+ programming languages.OpenGrep is compatible with Semgrep configurations. Add an |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jubittajohn The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
CHANGELOG/CHANGELOG-1.35.md (1)
679-680:⚠️ Potential issue | 🟡 MinorFix malformed markdown in two changelog bullets.
At Line 679 and Line 680, each bullet is missing a closing
)before the[SIG ...]suffix, which can break rendering/parsing of this section.Suggested fix
-- Fixes a bug where `MutatingAdmissionPolicy` would fail to apply to objects with duplicate list items (like env vars). ([`#135560`](https://github.com/kubernetes/kubernetes/pull/135560), [`@lalitc375`](https://github.com/lalitc375) [SIG API Machinery] +- Fixes a bug where `MutatingAdmissionPolicy` would fail to apply to objects with duplicate list items (like env vars). ([`#135560`](https://github.com/kubernetes/kubernetes/pull/135560), [`@lalitc375`](https://github.com/lalitc375)) [SIG API Machinery] -- K8s.io/client-go: Fixes a regression in 1.34+ which prevented informers from using configured Transformer functions. ([`#135580`](https://github.com/kubernetes/kubernetes/pull/135580), [`@serathius`](https://github.com/serathius) [SIG API Machinery] +- K8s.io/client-go: Fixes a regression in 1.34+ which prevented informers from using configured Transformer functions. ([`#135580`](https://github.com/kubernetes/kubernetes/pull/135580), [`@serathius`](https://github.com/serathius)) [SIG API Machinery]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@CHANGELOG/CHANGELOG-1.35.md` around lines 679 - 680, Two changelog bullets referencing "MutatingAdmissionPolicy" and "K8s.io/client-go" are missing a closing parenthesis before their trailing "[SIG API Machinery]" suffix; update each bullet so the PR link + author group is properly closed (add the missing ")" after the author link in the "([`#135560`]..., [`@lalitc375`]...)" and "([`#135580`]..., [`@serathius`]...)" groups) so the markdown renders correctly.
🧹 Nitpick comments (2)
build/common.sh (1)
343-353: Update stale comment to match the new dockerized cache handling.Line 347 says dockerized builds don’t need this chmod, but Lines 351-353 now do exactly that. Please align the comment to avoid future confusion.
Suggested diff
- # We don't need to do this at all for dockerized builds + # Dockerized builds can also leave go/cache non-writable on the host. + # Handle that path explicitly as well.As per coding guidelines, "Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@build/common.sh` around lines 343 - 353, Update the stale comment above the chmod blocks so it accurately describes current behavior: change the sentence that claims "We don't need to do this at all for dockerized builds" to reflect that both local and dockerized cache directories are now explicitly made writable; reference the existing variables/paths LOCAL_OUTPUT_ROOT/local/go/cache and LOCAL_OUTPUT_ROOT/dockerized/go/cache and ensure the comment explains why only these cache paths (not the whole output root) are chmodded.pkg/controller/devicetainteviction/device_taint_eviction_test.go (1)
1998-2002: Remove the unconditionalfmt.Printlnfrom the test loop.This will spam stdout on every run and makes CI output harder to read. If the trace is still useful, route it through
tContext.Logfinstead.Suggested cleanup
for _, item := range tContext.mockQueue.State().Later { - fmt.Println(item.Item, item.Duration) tContext.mockQueue.CancelAfter(item.Item) tContext.mockQueue.AddAfter(item.Item, item.Duration-state.advance) }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/controller/devicetainteviction/device_taint_eviction_test.go` around lines 1998 - 2002, Remove the unconditional fmt.Println in the test loop that iterates over tContext.mockQueue.State().Later; instead either delete that print or replace it with a test-friendly logger call like tContext.Logf, keeping the existing CancelAfter and AddAfter calls (tContext.mockQueue.CancelAfter(item.Item) and tContext.mockQueue.AddAfter(item.Item, item.Duration-state.advance)) unchanged so test output no longer spams stdout.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/controller/devicetainteviction/device_taint_eviction.go`:
- Around line 1286-1288: The current approach only enqueues the rule with
tc.workqueue.Add(workItemForRule(newRule)) but does not guarantee its processing
runs before pod eviction work items; to fix, perform the required rule-status
update synchronously (call the controller method that updates rule status
directly, e.g., tc.syncRuleStatus or the same status-update routine used by the
worker) before releasing the mutex/enqueuing pods, or implement explicit
prioritization/serialization (e.g., a dedicated rule-priority queue or a
blocking wait until the rule-status work item is processed) instead of relying
on workqueue order; update both places where workItemForRule(newRule) is used
(the block around the tc.logger.V(5).Info call and the similar section at the
later occurrence) to ensure the status transition happens deterministically
before any pod-eviction work runs.
In
`@staging/src/k8s.io/apiextensions-apiserver/test/integration/finalization_test.go`:
- Around line 188-200: The current wait on the CRD Terminating condition (using
wait.PollUntilContextTimeout + apiextensionshelpers.IsCRDConditionTrue on the
fetched crd) only ensures the prerequisite is set, but the create path can still
race and observe stale state; wrap the Apply/create call (the code that attempts
to create the test CR) in a retry loop (e.g. wait.PollUntilContextTimeout or
wait.PollImmediateUntil) that repeatedly calls the same Apply/create until the
operation fails due to the CRD being terminating (the behavior under test),
rather than performing a single Apply; locate the Apply invocation in this test
and replace the single attempt with a short-interval poll that asserts the Apply
returns the expected blocked/error result before proceeding.
---
Outside diff comments:
In `@CHANGELOG/CHANGELOG-1.35.md`:
- Around line 679-680: Two changelog bullets referencing
"MutatingAdmissionPolicy" and "K8s.io/client-go" are missing a closing
parenthesis before their trailing "[SIG API Machinery]" suffix; update each
bullet so the PR link + author group is properly closed (add the missing ")"
after the author link in the "([`#135560`]..., [`@lalitc375`]...)" and
"([`#135580`]..., [`@serathius`]...)" groups) so the markdown renders correctly.
---
Nitpick comments:
In `@build/common.sh`:
- Around line 343-353: Update the stale comment above the chmod blocks so it
accurately describes current behavior: change the sentence that claims "We don't
need to do this at all for dockerized builds" to reflect that both local and
dockerized cache directories are now explicitly made writable; reference the
existing variables/paths LOCAL_OUTPUT_ROOT/local/go/cache and
LOCAL_OUTPUT_ROOT/dockerized/go/cache and ensure the comment explains why only
these cache paths (not the whole output root) are chmodded.
In `@pkg/controller/devicetainteviction/device_taint_eviction_test.go`:
- Around line 1998-2002: Remove the unconditional fmt.Println in the test loop
that iterates over tContext.mockQueue.State().Later; instead either delete that
print or replace it with a test-friendly logger call like tContext.Logf, keeping
the existing CancelAfter and AddAfter calls
(tContext.mockQueue.CancelAfter(item.Item) and
tContext.mockQueue.AddAfter(item.Item, item.Duration-state.advance)) unchanged
so test output no longer spams stdout.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: e35431c4-4c05-4c7d-b9a0-3346d696384b
⛔ Files ignored due to path filters (2)
staging/src/k8s.io/sample-apiserver/pkg/generated/clientset/versioned/fake/clientset_generated.gois excluded by!**/generated/**staging/src/k8s.io/sample-controller/pkg/generated/clientset/versioned/fake/clientset_generated.gois excluded by!**/generated/**
📒 Files selected for processing (22)
CHANGELOG/CHANGELOG-1.35.mdbuild/common.shbuild/dependencies.yamlcmd/kubeadm/app/cmd/phases/reset/unmount_linux.gocmd/kubeadm/app/util/etcd/etcd.goopenshift-hack/images/hyperkube/Dockerfile.rhelpkg/controller/devicetainteviction/device_taint_eviction.gopkg/controller/devicetainteviction/device_taint_eviction_test.gostaging/src/k8s.io/apiextensions-apiserver/examples/client-go/pkg/client/clientset/versioned/fake/clientset_generated.gostaging/src/k8s.io/apiextensions-apiserver/pkg/client/clientset/clientset/fake/clientset_generated.gostaging/src/k8s.io/apiextensions-apiserver/test/integration/finalization_test.gostaging/src/k8s.io/client-go/kubernetes/fake/clientset_generated.gostaging/src/k8s.io/code-generator/cmd/client-gen/generators/fake/generator_fake_for_clientset.gostaging/src/k8s.io/code-generator/examples/HyphenGroup/clientset/versioned/fake/clientset_generated.gostaging/src/k8s.io/code-generator/examples/MixedCase/clientset/versioned/fake/clientset_generated.gostaging/src/k8s.io/code-generator/examples/apiserver/clientset/versioned/fake/clientset_generated.gostaging/src/k8s.io/code-generator/examples/crd/clientset/versioned/fake/clientset_generated.gostaging/src/k8s.io/code-generator/examples/single/clientset/versioned/fake/clientset_generated.gostaging/src/k8s.io/kube-aggregator/pkg/client/clientset_generated/clientset/fake/clientset_generated.gostaging/src/k8s.io/metrics/pkg/client/clientset/versioned/fake/clientset_generated.gotest/e2e/node/pods.gotest/integration/dra/binding_conditions_test.go
💤 Files with no reviewable changes (4)
- staging/src/k8s.io/kube-aggregator/pkg/client/clientset_generated/clientset/fake/clientset_generated.go
- staging/src/k8s.io/metrics/pkg/client/clientset/versioned/fake/clientset_generated.go
- staging/src/k8s.io/code-generator/examples/apiserver/clientset/versioned/fake/clientset_generated.go
- build/dependencies.yaml
| // Update the status at least once, immediately and before evicting any pods. | ||
| tc.logger.V(5).Info("Adding immediate status update because of new rule", "deviceTaintRule", klog.KObj(newRule)) | ||
| tc.workqueue.Add(workItemForRule(newRule)) |
There was a problem hiding this comment.
Queueing the rule item first does not guarantee the status update runs first.
These lines only change enqueue order. Once Run() uses multiple workers, a pod work item can still execute before the rule-status item after the mutex is released, so the promised “before evicting any pods” transition remains racy. The test workaround in pkg/controller/devicetainteviction/device_taint_eviction_test.go Lines 2429-2431 confirms the same ordering problem. If that sequencing matters, it needs explicit serialization/prioritization instead of relying on shared-queue order alone.
Also applies to: 1301-1307
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/controller/devicetainteviction/device_taint_eviction.go` around lines
1286 - 1288, The current approach only enqueues the rule with
tc.workqueue.Add(workItemForRule(newRule)) but does not guarantee its processing
runs before pod eviction work items; to fix, perform the required rule-status
update synchronously (call the controller method that updates rule status
directly, e.g., tc.syncRuleStatus or the same status-update routine used by the
worker) before releasing the mutex/enqueuing pods, or implement explicit
prioritization/serialization (e.g., a dedicated rule-priority queue or a
blocking wait until the rule-status work item is processed) instead of relying
on workqueue order; update both places where workItemForRule(newRule) is used
(the block around the tc.logger.V(5).Info call and the similar section at the
later occurrence) to ensure the status transition happens deterministically
before any pod-eviction work runs.
| // Wait for the CRD to have the Terminating condition set to True. | ||
| // The handler checks IsCRDConditionTrue(crd, apiextensionsv1.Terminating) to block | ||
| // CR creation, and this condition is set asynchronously by the CRD finalizer controller | ||
| // after it observes the DeletionTimestamp. Without this wait, the Apply could succeed | ||
| // if it races ahead of the controller setting the condition. | ||
| err = wait.PollUntilContextTimeout(t.Context(), 100*time.Millisecond, wait.ForeverTestTimeout, true, func(ctx context.Context) (bool, error) { | ||
| crd, err := apiExtensionClient.ApiextensionsV1().CustomResourceDefinitions().Get(ctx, noxuDefinition.Name, metav1.GetOptions{}) | ||
| if err != nil { | ||
| return false, err | ||
| } | ||
| return apiextensionshelpers.IsCRDConditionTrue(crd, apiextensionsv1.Terminating), nil | ||
| }) | ||
| require.NoError(t, err, "timed out waiting for CRD Terminating condition to be set") |
There was a problem hiding this comment.
Single Apply can still race the terminating-state propagation.
Line 193 only waits for a direct CRD Get to see Terminating=True. The create path observes CRD state asynchronously, so Line 208 can still hit the handler before that update is visible there and intermittently succeed. The old retry loop synchronized on the behavior under test; this version only synchronizes on a prerequisite.
Suggested fix
- _, err = noxuResourceClient.Apply(t.Context(), name, instance, metav1.ApplyOptions{DryRun: []string{"All"}, FieldManager: "manager"})
- wantErr := `create not allowed while custom resource definition is terminating`
- require.ErrorContains(t, err, wantErr)
+ wantErr := `create not allowed while custom resource definition is terminating`
+ var applyErr error
+ err = wait.PollUntilContextTimeout(t.Context(), 100*time.Millisecond, wait.ForeverTestTimeout, true, func(ctx context.Context) (bool, error) {
+ _, applyErr = noxuResourceClient.Apply(ctx, name, instance, metav1.ApplyOptions{DryRun: []string{"All"}, FieldManager: "manager"})
+ return applyErr != nil, nil
+ })
+ require.NoError(t, err)
+ require.ErrorContains(t, applyErr, wantErr)Also applies to: 208-210
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@staging/src/k8s.io/apiextensions-apiserver/test/integration/finalization_test.go`
around lines 188 - 200, The current wait on the CRD Terminating condition (using
wait.PollUntilContextTimeout + apiextensionshelpers.IsCRDConditionTrue on the
fetched crd) only ensures the prerequisite is set, but the create path can still
race and observe stale state; wrap the Apply/create call (the code that attempts
to create the test CR) in a retry loop (e.g. wait.PollUntilContextTimeout or
wait.PollImmediateUntil) that repeatedly calls the same Apply/create until the
operation fails due to the CRD being terminating (the behavior under test),
rather than performing a single Apply; locate the Apply invocation in this test
and replace the single attempt with a short-interval poll that asserts the Apply
returns the expected blocked/error result before proceeding.
|
/retest |
|
/test e2e-aws-ovn-runc |
|
@jubittajohn: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Bump release-4.22/master from v1.35.2 from v1.35.3