Skip to content

Enable EKS provisioned control plane (4XL tier)#450

Open
huydhn wants to merge 3 commits into
mainfrom
eks-provisioned-control-plane-xl
Open

Enable EKS provisioned control plane (4XL tier)#450
huydhn wants to merge 3 commits into
mainfrom
eks-provisioned-control-plane-xl

Conversation

@huydhn
Copy link
Copy Markdown
Contributor

@huydhn huydhn commented Apr 15, 2026

The standard EKS control plane is constantly throttling the production cluster. Analysis of 7-day Mimir metrics (2026-04-15) showed:

  • POST pods requests throttled at ~0.4 req/s (0.36% of traffic, ~34K rejected requests/day) — 429s present in 100% of samples
  • Peak API rate: 638 req/s (p99: 478, p95: 321, avg: 130)
  • Cluster scale: up to 171 nodes / 4,220 pods / 223 node creations/hr

The provisioned tier provides:

┌───────────────────────┬────────────────┬──────────────────────────────────────┐
│        Metric         │ XL             │              4XL (new)               │
├───────────────────────┼────────────────┼──────────────────────────────────────┤
│ Hourly cost           │ ~$1.65–1.75/hr │ ~$6.90/hr                            │
├───────────────────────┼────────────────┼──────────────────────────────────────┤
│ Annual cost           │ ~$15K/yr       │ ~$60K/yr                             │
├───────────────────────┼────────────────┼──────────────────────────────────────┤
│ API concurrency seats │ 1,700–2,000    │ 6,800–8,000 (depends on EKS version) │
├───────────────────────┼────────────────┼──────────────────────────────────────┤
│ Pod scheduling rate   │ 167 pods/s     │ 400 pods/s                           │
├───────────────────────┼────────────────┼──────────────────────────────────────┤
│ SLA                   │ 99.99%         │ 99.99%                               │
└───────────────────────┴────────────────┴──────────────────────────────────────┘

Implementation wires a new control_plane_scaling_tier variable from clusters.yaml through cluster-config.py and the tofu module chain. Uses a dynamic block so clusters on "standard" (the default) emit no control_plane_scaling_config block, keeping existing behavior unchanged.

Q

This currently updates both staging and production. Do we need this for staging? A: No, we don't need it for staging

Deployment plan

  1. tofu plan on staging — confirm it shows only ~ update in-place on aws_eks_cluster.this with the control_plane_scaling_config addition. No other resources should change.
  2. tofu apply on staging — watch the ScalingTierConfigUpdate complete (several minutes).
  3. Verify staging — confirm API server is responsive, pods are scheduling normally.
  4. Repeat for production.

The standard EKS control plane is constantly throttling the production
cluster.  Analysis of 7-day Mimir metrics (2026-04-15) showed:

  - POST pods requests throttled at ~0.4 req/s (0.36% of traffic,
    ~34K rejected requests/day) — 429s present in 100% of samples
  - Peak API rate: 638 req/s (p99: 478, p95: 321, avg: 130)
  - Cluster scale: up to 171 nodes / 4,220 pods / 223 node creations/hr

The XL provisioned tier ($1.75/hr, ~$15K/yr) provides:

  - 2,000 API concurrency seats (10x standard ~200 seats)
  - 167 pods/s scheduling rate
  - 99.99% SLA (up from 99.95%)

4XL ($6.90/hr, ~$61K/yr) was considered but ruled out — the current
concurrency deficit is a few dozen seats, not thousands.  Pod scheduling
rate caps at 400/s for both 4XL and 8XL, so the extra $46K/yr buys no
practical benefit at current scale.

Implementation wires a new `control_plane_scaling_tier` variable from
clusters.yaml through cluster-config.py and the tofu module chain.
Uses a dynamic block so clusters on "standard" (the default) emit no
control_plane_scaling_config block, keeping existing behavior unchanged.

Enabled on both staging and production — deploy staging first to
validate the tofu plan/apply before rolling to production.
@huydhn huydhn requested a review from jeanschmidt April 15, 2026 09:51
huydhn added a commit that referenced this pull request Apr 15, 2026
The control_plane_scaling_config block (needed for EKS provisioned
control plane in #450) was added in hashicorp/aws v6.23.0. The v5.x
constraint would cause tofu plan to fail with an unsupported block
error. Bumps all four EKS terraform roots (backend, eks, vpc, harbor).

Signed-off-by: Huy Do <huydhn@gmail.com>
github-merge-queue Bot pushed a commit that referenced this pull request Apr 16, 2026
The control_plane_scaling_config block (needed for EKS provisioned
control plane in #450) was added in hashicorp/aws v6.23.0. The v5.x
constraint would cause tofu plan to fail with an unsupported block
error. Bumps all four EKS terraform roots (backend, eks, vpc, harbor).

Basically, allowing #450 to be
deployed via tofu

Signed-off-by: Huy Do <huydhn@gmail.com>
Comment thread osdc/clusters.yaml Outdated
base_node_max_unavailable_percentage: 100 # all-at-once for staging
single_nat_gateway: true # cost optimization for staging
vpc_cidr: "10.0.0.0/16"
control_plane_scaling_tier: "tier-xl" # validate provisioned control plane before production
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ARC staging can use standard

Comment thread osdc/clusters.yaml Outdated
# seats (10x standard) and 167 pods/s scheduling — ample headroom.
# 4XL (8,000 seats, $5.1K/mo) was ruled out as overkill given the
# modest concurrency deficit.
control_plane_scaling_tier: "tier-xl"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use 4XL

Comment thread osdc/clusters.yaml Outdated
- ossci_gha_terraform
base:
vpc_cidr: "10.1.0.0/16"
# Provisioned control plane XL ($1.75/hr, ~$15K/yr).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't add unnecessary comments

Copy link
Copy Markdown
Contributor

@jeanschmidt jeanschmidt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Higher availability for production, lower for staging

Signed-off-by: Huy Do <huydhn@gmail.com>
@huydhn huydhn requested a review from jeanschmidt April 16, 2026 01:13
@huydhn huydhn changed the title Enable EKS provisioned control plane (XL tier) Enable EKS provisioned control plane (4XL tier) Apr 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants