fix: consider only load on available nodes for utilisation by tomwwright · Pull Request #288 · atlassian/escalator

tomwwright · 2026-03-26T06:13:41Z

Important

This PR is an alternative implementation to fix the issue described in #285

What 🎯

Introduces a node group option exclude_tainted_node_pods that configures Escalator to ignore resource demand of pods on nodes that are unavailable for scheduling -- nodes that are tainted, force tainted, or cordoned.

Unscheduled pods continue to count towards demand calculation.

Why 💭

Escalator currently computes utilisation as follows:

 resources used on nodes (untainted, tainted, cordoned)
---------------------------------------------------------
         resources available on untainted nodes

This limiting to untainted on the denominator causes utilisation to fluctuate as Escalator taints and untaints nodes.

As described in #285

We observed this in production on a cluster running ~92 bare-metal nodes (m6id.metal, 128 vCPU each). During a demand transition:

Utilisation dropped to 47.8%, triggering tainting at 3 nodes per 30-second cycle

Over 11 minutes, 24 nodes were tainted

The shrinking denominator caused a 26-point utilisation spike in a single scan cycle (48% to 74.5%) with no change in actual demand

Escalator reversed and untainted 22 of the 24 nodes, but 12 EC2 instances had already been terminated and replaced

Whereas #285 fixed by expanding the denominator, this PR fixes by filtering the numerator.

The rationale being that we want to measure utilisation from the perspective of the scheduler. The scheduler's view of capacity is of "in service" nodes where it can place pods, so tainted and cordoned nodes are not relevant and neither is the level of demand on those nodes currently.

      resources used on untainted nodes
---------------------------------------------
   resources available on untainted nodes

If a tainted or cordoned node is highly utilised or underutilised, that fact will enter the equation when and if Escalator returns the node to service.

Rovo Dev code review: Rovo Dev couldn't review this pull request
Upgrade to Rovo Dev Standard to continue using code review.

atlassian-cla-bot · 2026-03-26T06:13:45Z

Thank you for your submission! Like many open source projects, we ask that you sign our CLA (Contributor License Agreement) before we can accept your contribution.
If your email is listed below, please ensure that you sign the CLA with the same email address.

The following users still need to sign our CLA:
❌tomwwright
❌tomwwright-atlassian

_{Already signed the CLA? To re-check, try refreshing the page.}

dtnyn · 2026-03-27T07:35:23Z

+				}))
+			}
+		}
+		controller, nodeGroupsState := buildController(t, nodes, pods, false)


should this be

controller, nodeGroupsState := buildController(t, nodes, pods, true)

instead?
since this mentions "flag enabled - accurate utilisation, no scale up". Cursory glance looks like this would fail since the 50 pods (25 on untainted + 25 on tainted) produce 10000 CPU / 10000 untainted capacity = 100%, which triggers scale-up
the assertion is assert.Equal(t, 0, nodesDelta) so this test would actually fail as written?

Thanks, will review

Great catch, something went astray during a refactor!

dtnyn · 2026-03-27T07:36:38Z

+	// Filter to pods on untainted nodes
+	if nodeGroup.Opts.ExcludeUnavailableNodePods {
+		untaintedPods := k8s.FilterPodsByNode(pods, untaintedNodes, true)
+		log.WithField("nodegroup", nodegroup).Infof("pods total: %v (%v excluded on unavailable nodes)", len(pods), len(pods)-len(untaintedPods))


Kinda same comment as https://github.com/atlassian/escalator/pull/285/changes#r2979812367
Should be at info level? This log will run at every-tick and could be very noisy. Can this be changed to debug level, or only logging when (len(pods) - len(untaintedPods) > 0)

This is the existing log line that was already at info level, just enriched with the exclusion data (see else) - does that change your thoughts on it?

dtnyn · 2026-03-27T07:40:39Z

 	metrics.NodeGroupNodesUntainted.WithLabelValues(nodegroup).Set(float64(len(untaintedNodes)))
 	metrics.NodeGroupNodesTainted.WithLabelValues(nodegroup).Set(float64(len(taintedNodes)))
 	metrics.NodeGroupNodesForceTainted.WithLabelValues(nodegroup).Set(float64(len(forceTaintedNodes)))
 	metrics.NodeGroupPods.WithLabelValues(nodegroup).Set(float64(len(pods)))


After the feature flag this will emit only untainted node which would be the correct behaviour, however this potentially means we're missing visibility on pods the tainted nodes.

I would suggest that we add a separated metrics for pods on tainted node

Will do, good thinking

tomwwright marked this pull request as ready for review March 26, 2026 06:30

dtnyn reviewed Mar 27, 2026

View reviewed changes

tomwwright force-pushed the fix/untainted-pods-capacity-calculation branch from 9b22bc7 to 7f253f2 Compare March 29, 2026 23:44

tomwwright requested a review from dtnyn March 29, 2026 23:48

dtnyn approved these changes Mar 30, 2026

View reviewed changes

dtnyn mentioned this pull request Mar 30, 2026

feat: include tainted nodes in utilisation denominator #285

Closed

cespo self-requested a review March 30, 2026 23:57

cespo approved these changes Mar 30, 2026

View reviewed changes

dtnyn closed this Mar 31, 2026

dtnyn reopened this Mar 31, 2026

tomwwright added 2 commits March 31, 2026 11:31

fix: consider only load on available nodes for utilisation

993c47d

fix: add metric for excluded pods under new calculation

5097d3f

tomwwright force-pushed the fix/untainted-pods-capacity-calculation branch from 7f253f2 to 5097d3f Compare March 31, 2026 00:31

dtnyn merged commit 35cad53 into atlassian:master Mar 31, 2026
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: consider only load on available nodes for utilisation#288

fix: consider only load on available nodes for utilisation#288
dtnyn merged 2 commits intoatlassian:masterfrom
tomwwright:fix/untainted-pods-capacity-calculation

tomwwright commented Mar 26, 2026 •

edited

Loading

Uh oh!

atlassian-cla-bot Bot commented Mar 26, 2026

Uh oh!

dtnyn Mar 27, 2026

Uh oh!

tomwwright Mar 29, 2026

Uh oh!

tomwwright Mar 29, 2026

Uh oh!

dtnyn Mar 27, 2026

Uh oh!

tomwwright Mar 29, 2026

Uh oh!

dtnyn Mar 27, 2026

Uh oh!

tomwwright Mar 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

tomwwright commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What 🎯

Why 💭

Uh oh!

atlassian-cla-bot Bot commented Mar 26, 2026

Uh oh!

dtnyn Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

tomwwright Mar 29, 2026

Choose a reason for hiding this comment

Uh oh!

tomwwright Mar 29, 2026

Choose a reason for hiding this comment

Uh oh!

dtnyn Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

tomwwright Mar 29, 2026

Choose a reason for hiding this comment

Uh oh!

dtnyn Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

tomwwright Mar 29, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tomwwright commented Mar 26, 2026 •

edited

Loading