Enable EKS provisioned control plane (4XL tier)#450
Open
huydhn wants to merge 3 commits into
Open
Conversation
The standard EKS control plane is constantly throttling the production
cluster. Analysis of 7-day Mimir metrics (2026-04-15) showed:
- POST pods requests throttled at ~0.4 req/s (0.36% of traffic,
~34K rejected requests/day) — 429s present in 100% of samples
- Peak API rate: 638 req/s (p99: 478, p95: 321, avg: 130)
- Cluster scale: up to 171 nodes / 4,220 pods / 223 node creations/hr
The XL provisioned tier ($1.75/hr, ~$15K/yr) provides:
- 2,000 API concurrency seats (10x standard ~200 seats)
- 167 pods/s scheduling rate
- 99.99% SLA (up from 99.95%)
4XL ($6.90/hr, ~$61K/yr) was considered but ruled out — the current
concurrency deficit is a few dozen seats, not thousands. Pod scheduling
rate caps at 400/s for both 4XL and 8XL, so the extra $46K/yr buys no
practical benefit at current scale.
Implementation wires a new `control_plane_scaling_tier` variable from
clusters.yaml through cluster-config.py and the tofu module chain.
Uses a dynamic block so clusters on "standard" (the default) emit no
control_plane_scaling_config block, keeping existing behavior unchanged.
Enabled on both staging and production — deploy staging first to
validate the tofu plan/apply before rolling to production.
huydhn
added a commit
that referenced
this pull request
Apr 15, 2026
The control_plane_scaling_config block (needed for EKS provisioned control plane in #450) was added in hashicorp/aws v6.23.0. The v5.x constraint would cause tofu plan to fail with an unsupported block error. Bumps all four EKS terraform roots (backend, eks, vpc, harbor). Signed-off-by: Huy Do <huydhn@gmail.com>
github-merge-queue Bot
pushed a commit
that referenced
this pull request
Apr 16, 2026
The control_plane_scaling_config block (needed for EKS provisioned control plane in #450) was added in hashicorp/aws v6.23.0. The v5.x constraint would cause tofu plan to fail with an unsupported block error. Bumps all four EKS terraform roots (backend, eks, vpc, harbor). Basically, allowing #450 to be deployed via tofu Signed-off-by: Huy Do <huydhn@gmail.com>
jeanschmidt
reviewed
Apr 16, 2026
| base_node_max_unavailable_percentage: 100 # all-at-once for staging | ||
| single_nat_gateway: true # cost optimization for staging | ||
| vpc_cidr: "10.0.0.0/16" | ||
| control_plane_scaling_tier: "tier-xl" # validate provisioned control plane before production |
Contributor
There was a problem hiding this comment.
ARC staging can use standard
jeanschmidt
reviewed
Apr 16, 2026
| # seats (10x standard) and 167 pods/s scheduling — ample headroom. | ||
| # 4XL (8,000 seats, $5.1K/mo) was ruled out as overkill given the | ||
| # modest concurrency deficit. | ||
| control_plane_scaling_tier: "tier-xl" |
jeanschmidt
reviewed
Apr 16, 2026
| - ossci_gha_terraform | ||
| base: | ||
| vpc_cidr: "10.1.0.0/16" | ||
| # Provisioned control plane XL ($1.75/hr, ~$15K/yr). |
Contributor
There was a problem hiding this comment.
Don't add unnecessary comments
jeanschmidt
requested changes
Apr 16, 2026
Contributor
jeanschmidt
left a comment
There was a problem hiding this comment.
Higher availability for production, lower for staging
Signed-off-by: Huy Do <huydhn@gmail.com>
jeanschmidt
approved these changes
Apr 22, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The standard EKS control plane is constantly throttling the production cluster. Analysis of 7-day Mimir metrics (2026-04-15) showed:
The provisioned tier provides:
Implementation wires a new
control_plane_scaling_tiervariable from clusters.yaml through cluster-config.py and the tofu module chain. Uses a dynamic block so clusters on "standard" (the default) emit no control_plane_scaling_config block, keeping existing behavior unchanged.Q
This currently updates both staging and production. Do we need this for staging? A: No, we don't need it for staging
Deployment plan