Skip to content

Allow per-cluster max_runners overrides for arc-runners-X scale sets#630

Open
huydhn wants to merge 1 commit into
pytorch:mainfrom
huydhn:cleanup-h100-pet-instances-uw1
Open

Allow per-cluster max_runners overrides for arc-runners-X scale sets#630
huydhn wants to merge 1 commit into
pytorch:mainfrom
huydhn:cleanup-h100-pet-instances-uw1

Conversation

@huydhn
Copy link
Copy Markdown
Contributor

@huydhn huydhn commented May 26, 2026

arc-runners-X/defs/*.yaml now own per-cluster capacity directly. max_runners accepts either an int (baseline applied to every cluster — backwards-compatible with the ~30 existing CPU/GPU defs and the B200 defs) or a mapping:

max_runners:
  default: 8                  # required baseline
  arc-cbr-production-uw1: 48  # per-cluster override

Each arc-runners-h100 def adopts the mapping form. arc-cbr-production-uw1 inherits the 6 H100 nodes freed by pytorch-gha-infra#1180, so its scale sets render at def×6 (48 / 24 / 12 / 6). us-east-2 keeps the def baselines (unchanged).

Capacity stays co-located with the runner shape rather than split between clusters.yaml and the def — addresses @jeanschmidt's review feedback on the earlier clusters.yaml-based approach.

Test plan

  • pytest modules/arc-runners/scripts/python/test_generate_runners.py — 5 new tests covering the mapping form, plus all existing tests green.
  • End-to-end render against both clusters confirms maxRunners: 48 / 24 / 12 / 6 (uw1) and 8 / 4 / 2 / 1 (us-east-2 baseline).
  • Deploy arc-runners-h100 to arc-cbr-production-uw1; confirm maxRunners = 48 / 24 / 12 / 6.
  • Deploy arc-runners-h100/-b200 to arc-cbr-production; confirm maxRunners is unchanged.

@huydhn huydhn requested a review from jeanschmidt as a code owner May 26, 2026 19:12
@huydhn huydhn force-pushed the cleanup-h100-pet-instances-uw1 branch from 019d838 to 46f1210 Compare May 26, 2026 19:20
@huydhn huydhn changed the title Scale arc-runners-h100 max_runners by reserved-node count in us-west-1 Size arc-runners-X by reserved-node count from nodepools-X May 26, 2026
@huydhn huydhn force-pushed the cleanup-h100-pet-instances-uw1 branch from 46f1210 to 81c3e0d Compare May 26, 2026 19:27
@huydhn huydhn changed the title Size arc-runners-X by reserved-node count from nodepools-X Allow per-cluster max_runners overrides for arc-runners-X scale sets May 26, 2026
Comment thread osdc/clusters.yaml Outdated
# arc-runners-h100/defs/*.yaml are sized for 1 reserved 8-GPU node, but
# this cluster holds 6 (see nodepools-h100.capacity_reservation_ids above).
# Override each scale set's max_runners to def×6.
max_runners:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO max runners should be defined in the nodepool,

Can't we do this in the nodepool and duplicate them? Maybe it makes more sense

Copy link
Copy Markdown
Contributor Author

@huydhn huydhn May 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me try that, and see how it looks like. I think I know what you mean

@huydhn huydhn force-pushed the cleanup-h100-pet-instances-uw1 branch from 81c3e0d to 94b46e1 Compare May 27, 2026 06:17
@huydhn huydhn requested a review from jeanschmidt May 27, 2026 06:20
@huydhn huydhn force-pushed the cleanup-h100-pet-instances-uw1 branch 2 times, most recently from 03172e6 to c92ac31 Compare May 27, 2026 06:22
max_runners now accepts either an int (baseline applied to every cluster,
unchanged for the ~30 existing CPU/GPU defs and the B200 defs) or a
mapping with a required `default` baseline plus per-cluster overrides:

  max_runners:
    default: 8
    arc-cbr-production-uw1: 48

Each arc-runners-h100 def adopts the mapping form. arc-cbr-production-uw1
inherits the 6 H100 nodes freed by pytorch-gha-infra#1180, so each scale
set is sized to def×6 (48 / 24 / 12 / 6). us-east-2 keeps the def baselines.

Capacity info stays co-located with the runner shape rather than spread
between clusters.yaml and the def — addresses review feedback on the
earlier clusters.yaml-based approach.
@huydhn huydhn force-pushed the cleanup-h100-pet-instances-uw1 branch from c92ac31 to 45c2b6e Compare May 27, 2026 06:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants