Skip to content

Comments

fix: improve NodePool weight fallback and add pre-launch circuit breaker for quota/capacity errors#1433

Open
GuetaHen wants to merge 2 commits intoAzure:mainfrom
GuetaHen:feature/quota-aware-nodepool-fallback
Open

fix: improve NodePool weight fallback and add pre-launch circuit breaker for quota/capacity errors#1433
GuetaHen wants to merge 2 commits intoAzure:mainfrom
GuetaHen:feature/quota-aware-nodepool-fallback

Conversation

@GuetaHen
Copy link

@GuetaHen GuetaHen commented Feb 17, 2026

Fixes #1323

Summary

This PR addresses a critical issue where Karpenter fails to fall back to lower-weight NodePools when higher-weight pools encounter quota exhaustion or capacity outages during large-scale scheduling.

Problem

When a high-weight NodePool's SKU family hits Azure quota limits or capacity outages:

  1. The ICE cache TTL (3 min default) for non-zero quota errors was too short, causing offerings to recycle and be retried before the scheduler could exhaust all SKUs and fall through to lower-weight pools
  2. Regional quota errors updated no cache at all, leading to infinite retry loops
  3. During prompt-scale events (e.g., 5000 cores), all NodeClaims launched simultaneously would hit the same failed SKUs, wasting dozens of Azure API calls

Changes

Commit 1: Fix ICE cache gaps for quota errors (commonerrorhandlers.go)

  • Non-zero SKU family quota errors now use 15-minute TTL (was 3 min default), preventing the recycling loop
  • Regional quota errors now update the ICE cache with 30-minute TTL before returning InsufficientCapacityError
  • Added SKUFamilyQuotaNonZeroTTL and RegionalQuotaExhaustedTTL constants

Commit 2: Pre-launch circuit breaker (offerings.go, vminstance.go, aksmachineinstance.go)

  • New PreLaunchFilter() re-checks instance types against the live ICE cache at launch time
  • New NewLiveCacheAvailabilityCheck() helper avoids code duplication between VM and AKS Machine providers
  • Acts as a circuit breaker: once the first failure updates the cache, subsequent NodeClaims skip the failed SKU immediately
  • Fail-open design: if cache lookup fails, all instance types remain available

Impact

Scenario Before After
Quota exhausted, fallback Infinite recycling loop Immediate fallback after 15min cache
Regional quota hit No cache update, repeated failures 30min cache, clean fallback
5000-core prompt scale ~78 wasted API calls ~1-3 wasted calls (race window only)

Tests

  • 3 new tests for error handler changes (regional quota on-demand/spot, SKU family non-zero TTL)
  • 11 new tests for PreLaunchFilter (nil fail-open, all/partial/no availability, multi-zone, spot/on-demand, large-scale circuit breaker, sequential progressive filtering scenario)

All existing tests pass. No new Azure API dependencies.

Hen Goueta (from Dev Box) added 2 commits February 17, 2026 12:29
…ota errors

This change addresses the issue where Karpenter fails to fall back to lower-weight
NodePools when higher-weight pools are quota-exhausted. Three specific gaps are fixed:

1. SKU family quota (non-zero limit): Previously used the default 3-minute TTL,
   causing offerings to recycle back into the available pool before the scheduler
   could exhaust all SKUs in the high-weight NodePool and fall through to lower-weight
   pools. Now uses 15-minute TTL (SKUFamilyQuotaNonZeroTTL).

2. Regional quota exhausted: Previously returned InsufficientCapacityError without
   updating the ICE cache, so subsequent scheduling loops would keep selecting instance
   types from the exhausted capacity type. Now marks offerings as unavailable in the
   cache with a 30-minute TTL before returning the error.

3. Added new TTL constants: SKUFamilyQuotaNonZeroTTL (15m) and RegionalQuotaExhaustedTTL
   (30m) to differentiate between transient and persistent quota exhaustion.

Tests added:
- Regional quota exceeded for on-demand marks all zones unavailable
- Regional quota exceeded for spot marks all spot unavailable
- SKU family quota non-zero limit uses longer TTL to prevent recycling

Signed-off-by: Hicham Engoueta <hengoueta@microsoft.com>
… calls

Adds a PreLaunchFilter that re-checks instance types against the live ICE
(Insufficient Capacity Error) cache right before making Azure API calls.

This acts as a circuit breaker during large-scale scheduling (e.g., 5000 cores):
when the scheduler creates many NodeClaims simultaneously, the first VM creation
failure updates the ICE cache, and subsequent NodeClaims skip the failed SKU
immediately instead of making redundant API calls that are guaranteed to fail.

Without this: ~78 wasted API calls, all fail with AllocationFailed/quota errors.
With this: first ~5-10 calls may fail (race window), then remaining skip instantly.

Integration:
- New function: offerings.PreLaunchFilter() with fail-open design
- Called in both DefaultVMProvider.BeginCreate() and DefaultAKSMachineProvider.BeginCreate()
- Uses the existing ICE cache (unavailableOfferings) from the error handler
- No new Azure API calls — purely in-memory cache re-check

Signed-off-by: Hicham Engoueta <hengoueta@microsoft.com>
@GuetaHen
Copy link
Author

@microsoft-github-policy-service agree company="Microsoft"

@GuetaHen
Copy link
Author

This PR fixes #1323

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Improve handling on quota failures (or avoid them altogether)

1 participant