fix: improve NodePool weight fallback and add pre-launch circuit breaker for quota/capacity errors by GuetaHen · Pull Request #1433 · Azure/karpenter-provider-azure

GuetaHen · 2026-02-17T11:11:27Z

Summary

This PR addresses a critical issue where Karpenter fails to fall back to lower-weight NodePools when higher-weight pools encounter quota exhaustion or capacity outages during large-scale scheduling.

Problem

When a high-weight NodePool's SKU family hits Azure quota limits or capacity outages:

The ICE cache TTL (3 min default) for non-zero quota errors was too short, causing offerings to recycle and be retried before the scheduler could exhaust all SKUs and fall through to lower-weight pools
Regional quota errors updated no cache at all, leading to infinite retry loops
During prompt-scale events (e.g., 5000 cores), all NodeClaims launched simultaneously would hit the same failed SKUs, wasting dozens of Azure API calls

Changes

Commit 1: Fix ICE cache gaps for quota errors (commonerrorhandlers.go)

Non-zero SKU family quota errors now use 15-minute TTL (was 3 min default), preventing the recycling loop
Regional quota errors now update the ICE cache with 30-minute TTL before returning InsufficientCapacityError
Added SKUFamilyQuotaNonZeroTTL and RegionalQuotaExhaustedTTL constants

Commit 2: Pre-launch circuit breaker (offerings.go, vminstance.go, aksmachineinstance.go)

New PreLaunchFilter() re-checks instance types against the live ICE cache at launch time
New NewLiveCacheAvailabilityCheck() helper avoids code duplication between VM and AKS Machine providers
Acts as a circuit breaker: once the first failure updates the cache, subsequent NodeClaims skip the failed SKU immediately
Fail-open design: if cache lookup fails, all instance types remain available

Impact

Scenario	Before	After
Quota exhausted, fallback	Infinite recycling loop	Immediate fallback after 15min cache
Regional quota hit	No cache update, repeated failures	30min cache, clean fallback
5000-core prompt scale	~78 wasted API calls	~1-3 wasted calls (race window only)

Tests

3 new tests for error handler changes (regional quota on-demand/spot, SKU family non-zero TTL)
11 new tests for PreLaunchFilter (nil fail-open, all/partial/no availability, multi-zone, spot/on-demand, large-scale circuit breaker, sequential progressive filtering scenario)

All existing tests pass. No new Azure API dependencies.

…ota errors This change addresses the issue where Karpenter fails to fall back to lower-weight NodePools when higher-weight pools are quota-exhausted. Three specific gaps are fixed: 1. SKU family quota (non-zero limit): Previously used the default 3-minute TTL, causing offerings to recycle back into the available pool before the scheduler could exhaust all SKUs in the high-weight NodePool and fall through to lower-weight pools. Now uses 15-minute TTL (SKUFamilyQuotaNonZeroTTL). 2. Regional quota exhausted: Previously returned InsufficientCapacityError without updating the ICE cache, so subsequent scheduling loops would keep selecting instance types from the exhausted capacity type. Now marks offerings as unavailable in the cache with a 30-minute TTL before returning the error. 3. Added new TTL constants: SKUFamilyQuotaNonZeroTTL (15m) and RegionalQuotaExhaustedTTL (30m) to differentiate between transient and persistent quota exhaustion. Tests added: - Regional quota exceeded for on-demand marks all zones unavailable - Regional quota exceeded for spot marks all spot unavailable - SKU family quota non-zero limit uses longer TTL to prevent recycling Signed-off-by: Hicham Engoueta <hengoueta@microsoft.com>

… calls Adds a PreLaunchFilter that re-checks instance types against the live ICE (Insufficient Capacity Error) cache right before making Azure API calls. This acts as a circuit breaker during large-scale scheduling (e.g., 5000 cores): when the scheduler creates many NodeClaims simultaneously, the first VM creation failure updates the ICE cache, and subsequent NodeClaims skip the failed SKU immediately instead of making redundant API calls that are guaranteed to fail. Without this: ~78 wasted API calls, all fail with AllocationFailed/quota errors. With this: first ~5-10 calls may fail (race window), then remaining skip instantly. Integration: - New function: offerings.PreLaunchFilter() with fail-open design - Called in both DefaultVMProvider.BeginCreate() and DefaultAKSMachineProvider.BeginCreate() - Uses the existing ICE cache (unavailableOfferings) from the error handler - No new Azure API calls — purely in-memory cache re-check Signed-off-by: Hicham Engoueta <hengoueta@microsoft.com>

GuetaHen · 2026-02-17T11:13:32Z

@microsoft-github-policy-service agree company="Microsoft"

GuetaHen · 2026-02-18T09:16:47Z

This PR fixes #1323

Hen Goueta (from Dev Box) added 2 commits February 17, 2026 12:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

fix: improve NodePool weight fallback and add pre-launch circuit breaker for quota/capacity errors#1433

fix: improve NodePool weight fallback and add pre-launch circuit breaker for quota/capacity errors#1433
GuetaHen wants to merge 2 commits intoAzure:mainfrom
GuetaHen:feature/quota-aware-nodepool-fallback

GuetaHen commented Feb 17, 2026 •

edited

Loading

Uh oh!

GuetaHen commented Feb 17, 2026

Uh oh!

GuetaHen commented Feb 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

GuetaHen commented Feb 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Changes

Impact

Tests

Uh oh!

GuetaHen commented Feb 17, 2026

Uh oh!

GuetaHen commented Feb 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

GuetaHen commented Feb 17, 2026 •

edited

Loading