Skip to content

processNodeAgainstAllRules can write an empty NodeEvaluation that fails CRD validation and drops FailedNodes updates #217

@sahitya-chandra

Description

@sahitya-chandra

What happened

processNodeAgainstAllRules in internal/controller/node_controller.go always runs a status patch after evaluateRuleForNode, even when the evaluation returned an error. The patch builds the new NodeEvaluation entry by scanning the in-memory rule's Status.NodeEvaluations for one with the current node's name:

https://github.com/kubernetes-sigs/node-readiness-controller/blob/main/internal/controller/node_controller.go#L159-L206

currEval := readinessv1alpha1.NodeEvaluation{}
for _, eval := range rule.Status.NodeEvaluations {
    if eval.NodeName == node.Name {
        currEval = eval
        break
    }
}

found := false
for i := range latestRule.Status.NodeEvaluations {
    if latestRule.Status.NodeEvaluations[i].NodeName == node.Name {
        latestRule.Status.NodeEvaluations[i] = currEval
        found = true
        break
    }
}
if !found {
    latestRule.Status.NodeEvaluations = append(
        latestRule.Status.NodeEvaluations,
        currEval,
    )
}

If evaluateRuleForNode exits early on an error (taint patch failure such as a non-conflict API error or exhausted conflict retries), it never reaches updateNodeEvaluationStatus, so the in-memory rule.Status.NodeEvaluations has no entry for this node when the cache copy didn't already contain one. currEval then stays the zero value NodeEvaluation{NodeName: \"\"}, and the loop above either:

  • appends that empty entry to latestRule.Status.NodeEvaluations, or
  • overwrites a previously valid entry with it.

NodeEvaluation.NodeName is annotated with +kubebuilder:validation:MinLength=1 and a hostname pattern in api/v1alpha1/nodereadinessrule_types.go, so the API server rejects the entire Status().Patch with HTTP 422. RetryOnConflict doesn't retry on validation errors, so the patch fails permanently for this iteration.

The same patch carries the FailedNodes update built up on lines 192-203 - i.e. the very record an operator needs to debug the taint failure. So the failure path leaves both the NodeEvaluation and the FailedNodes entry unwritten, and the only signal is a log line on the controller pod.

What you expected to happen

When evaluateRuleForNode fails:

  • the NodeEvaluation status for the node should not be overwritten with an empty entry, and
  • the FailedNodes update should still be persisted so operators can see the failure via kubectl get nodereadinessrule -o yaml.

How to reproduce

  1. Create a NodeReadinessRule that targets a node and apply the rule so the controller starts evaluating.
  2. Cause addTaintBySpec / removeTaintBySpec to return a non-conflict error persistently for one node (e.g. RBAC denied for nodes patch, or simulate via webhook). The taint operation will exhaust retries and evaluateRuleForNode will return an error.
  3. Observe controller logs:
    • "Failed to evaluate rule for node" is logged.
    • Then "Failed to update rule status after node evaluation" is logged with a 422 from the API server about nodeName not matching the required pattern / minimum length.
  4. kubectl get nodereadinessrule <name> -o yaml shows no FailedNodes entry for the node, even though the controller hit a failure.

Suggested fix

When currEval.NodeName == \"\" (i.e. the in-memory rule had no evaluation for this node), skip the NodeEvaluation upsert and only update FailedNodes. Keeping the existing valid NodeEvaluation untouched is the correct behavior on a failed evaluation, and it lets the FailedNodes update through.

/kind bug

Metadata

Metadata

Labels

kind/bugCategorizes issue or PR as related to a bug.

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions