fix: pre-populate rule cache on first NodeReconciler pass to prevent bootstrap-only taint bypass#254
Conversation
✅ Deploy Preview for node-readiness-controller canceled.
|
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: Karman580 The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Hi @Karman580. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Regular contributors should join the org to skip this step. Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
|
…bootstrap-only taint bypass RuleReadinessController maintains an in-memory ruleCache populated lazily as RuleReconciler processes each NodeReadinessRule. NodeReconciler derives applicable rules exclusively from this cache. Both controllers start consuming their work queues concurrently after informer sync, creating a race where node events are processed against an empty or partially-warm cache. For continuous enforcement this window is self-healing. For bootstrap-only rules it is permanent: if a node is evaluated before its applicable rule reaches the cache the taint is never applied, the bootstrap completion annotation is never written, and the admission gate is silently bypassed. Add ensureCacheWarmed to RuleReadinessController. On the first call it lists all NodeReadinessRules from the already-synced informer cache and adds any rule not yet present in ruleCache. Subsequent calls are no-ops guarded by an atomic.Bool. Existing cache entries (including rules already marked for deletion by RuleReconciler) are intentionally preserved so that in-flight state is not overwritten. Call ensureCacheWarmed at the top of NodeReconciler.Reconcile before any rule evaluation takes place. Add two integration tests covering the cold-start scenario: - verifies that a bootstrap-only taint is applied even when ruleCache is empty at reconcile time (rule discovered via warm-up) - verifies that the warm-up is not repeated on subsequent reconciles
f9aba8b to
5c83d8f
Compare
|
recheck |
Description
RuleReadinessControllermaintains an in-memoryruleCachepopulated lazily asRuleReconcilerprocesses eachNodeReadinessRule.NodeReconcilerderives applicable rules exclusively from this cache. Both controllers start consuming their work queues concurrently after informer sync, creating a race where node events are processed against an empty or partially-warm cache.For
continuousenforcement this window is self-healing — the next condition, taint, or label change re-triggers evaluation and the controller converges. Forbootstrap-onlyrules it is a permanent correctness gap: if a node is evaluated before its applicable rule reaches the cache, the taint is never applied,markBootstrapCompletedis never reached, the bootstrap completion annotation is never written, and the intended admission gate is silently bypassed with no error surfaced.Root Cause
ensureCacheWarmeddid not exist. TheruleCachewas built entirely as a side effect ofRuleReconciler.Reconcile(). Although the controller-runtime manager guarantees informers are synced before controllers start, there was no mechanism to seedruleCachefrom the already-available informer data beforeNodeReconcilerbegan processing events.Fix
Add
ensureCacheWarmedtoRuleReadinessController. On the first call it lists allNodeReadinessRulesfrom the already-synced informer cache (no live API round-trip) and adds any rule not yet present inruleCache. Subsequent calls are no-ops guarded by anatomic.Bool. Existing cache entries — including rules already marked for deletion byRuleReconciler— are intentionally preserved so that in-flight state is not overwritten.ensureCacheWarmedis called at the top ofNodeReconciler.Reconcilebefore any rule evaluation takes place.Related Issue
Fixes #253
Type of Change
/kind bug
Testing
cold-start rule cache warm-upintegration test context innode_controller_test.gowith two cases:ruleCacheis empty at reconcile time (the rule is discovered via warm-up).cacheWarmedremains true).internal/controller73.5% → 73.9%.Checklist
make testpassesmake lintpassesDoes this PR introduce a user-facing change?