Skip to content

fix: consider only load on available nodes for utilisation#288

Merged
dtnyn merged 2 commits intoatlassian:masterfrom
tomwwright:fix/untainted-pods-capacity-calculation
Mar 31, 2026
Merged

fix: consider only load on available nodes for utilisation#288
dtnyn merged 2 commits intoatlassian:masterfrom
tomwwright:fix/untainted-pods-capacity-calculation

Conversation

@tomwwright
Copy link
Copy Markdown
Contributor

@tomwwright tomwwright commented Mar 26, 2026

Important

This PR is an alternative implementation to fix the issue described in #285

What 🎯

Introduces a node group option exclude_tainted_node_pods that configures Escalator to ignore resource demand of pods on nodes that are unavailable for scheduling -- nodes that are tainted, force tainted, or cordoned.

Unscheduled pods continue to count towards demand calculation.

Why 💭

Escalator currently computes utilisation as follows:

 resources used on nodes (untainted, tainted, cordoned)
---------------------------------------------------------
         resources available on untainted nodes

This limiting to untainted on the denominator causes utilisation to fluctuate as Escalator taints and untaints nodes.

As described in #285

We observed this in production on a cluster running ~92 bare-metal nodes (m6id.metal, 128 vCPU each). During a demand transition:

  • Utilisation dropped to 47.8%, triggering tainting at 3 nodes per 30-second cycle
  • Over 11 minutes, 24 nodes were tainted
  • The shrinking denominator caused a 26-point utilisation spike in a single scan cycle (48% to 74.5%) with no change in actual demand
  • Escalator reversed and untainted 22 of the 24 nodes, but 12 EC2 instances had already been terminated and replaced

Whereas #285 fixed by expanding the denominator, this PR fixes by filtering the numerator.

The rationale being that we want to measure utilisation from the perspective of the scheduler. The scheduler's view of capacity is of "in service" nodes where it can place pods, so tainted and cordoned nodes are not relevant and neither is the level of demand on those nodes currently.

      resources used on untainted nodes
---------------------------------------------
   resources available on untainted nodes

If a tainted or cordoned node is highly utilised or underutilised, that fact will enter the equation when and if Escalator returns the node to service.


Rovo Dev code review: Rovo Dev couldn't review this pull request
Upgrade to Rovo Dev Standard to continue using code review.

@atlassian-cla-bot
Copy link
Copy Markdown

Thank you for your submission! Like many open source projects, we ask that you sign our CLA (Contributor License Agreement) before we can accept your contribution.
If your email is listed below, please ensure that you sign the CLA with the same email address.

The following users still need to sign our CLA:
❌tomwwright
❌tomwwright-atlassian

Already signed the CLA? To re-check, try refreshing the page.

@tomwwright tomwwright marked this pull request as ready for review March 26, 2026 06:30
}))
}
}
controller, nodeGroupsState := buildController(t, nodes, pods, false)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be

controller, nodeGroupsState := buildController(t, nodes, pods, true)

instead?
since this mentions "flag enabled - accurate utilisation, no scale up". Cursory glance looks like this would fail since the 50 pods (25 on untainted + 25 on tainted) produce 10000 CPU / 10000 untainted capacity = 100%, which triggers scale-up
the assertion is assert.Equal(t, 0, nodesDelta) so this test would actually fail as written?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, will review

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch, something went astray during a refactor!

Comment thread pkg/controller/controller.go Outdated
// Filter to pods on untainted nodes
if nodeGroup.Opts.ExcludeUnavailableNodePods {
untaintedPods := k8s.FilterPodsByNode(pods, untaintedNodes, true)
log.WithField("nodegroup", nodegroup).Infof("pods total: %v (%v excluded on unavailable nodes)", len(pods), len(pods)-len(untaintedPods))
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kinda same comment as https://github.com/atlassian/escalator/pull/285/changes#r2979812367
Should be at info level? This log will run at every-tick and could be very noisy. Can this be changed to debug level, or only logging when (len(pods) - len(untaintedPods) > 0)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the existing log line that was already at info level, just enriched with the exclusion data (see else) - does that change your thoughts on it?

metrics.NodeGroupNodesUntainted.WithLabelValues(nodegroup).Set(float64(len(untaintedNodes)))
metrics.NodeGroupNodesTainted.WithLabelValues(nodegroup).Set(float64(len(taintedNodes)))
metrics.NodeGroupNodesForceTainted.WithLabelValues(nodegroup).Set(float64(len(forceTaintedNodes)))
metrics.NodeGroupPods.WithLabelValues(nodegroup).Set(float64(len(pods)))
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After the feature flag this will emit only untainted node which would be the correct behaviour, however this potentially means we're missing visibility on pods the tainted nodes.

I would suggest that we add a separated metrics for pods on tainted node

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do, good thinking

@tomwwright tomwwright force-pushed the fix/untainted-pods-capacity-calculation branch from 9b22bc7 to 7f253f2 Compare March 29, 2026 23:44
@tomwwright tomwwright requested a review from dtnyn March 29, 2026 23:48
@dtnyn dtnyn closed this Mar 31, 2026
@dtnyn dtnyn reopened this Mar 31, 2026
@tomwwright tomwwright force-pushed the fix/untainted-pods-capacity-calculation branch from 7f253f2 to 5097d3f Compare March 31, 2026 00:31
@dtnyn dtnyn merged commit 35cad53 into atlassian:master Mar 31, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants