cluster-config: add runner-image-modules subcommand and just recipes#634
cluster-config: add runner-image-modules subcommand and just recipes#634jeanschmidt wants to merge 1 commit into
Conversation
tofu plan — arc-cbr-production✅ Plan succeeded · commit Plan output |
tofu plan — arc-cbr-production-uw1✅ Plan succeeded · commit Plan output |
malfet
left a comment
There was a problem hiding this comment.
Please explain if new tests are actually validated in CI
And I'm confused about RUNNER_IMAGE_CONSUMER_MODULES concept. What is it? Why you want to filter out some modules but not others?
| CONFIG_PATH = Path(os.environ.get("CLUSTERS_YAML", Path(__file__).resolve().parent.parent / "clusters.yaml")) | ||
|
|
||
| # Modules whose generated manifests embed `runner_image_tag` from clusters.yaml. | ||
| # Authoritative source: modules/arc-runners/scripts/python/generate_runners.py is |
There was a problem hiding this comment.
If this is an authoritative source, why this file does not try to import them? Otherwise what will keep them in sync?
There was a problem hiding this comment.
I just don't know a reliable way to check if a module will be deploying ARC scalesets, the best I can come up with is have a list of modules and check them on the cluster.
We can, maybe, force modules that create a ARC scaleset to define a configuration, or have a file, but then the problem just move from one place to another...
Any ideas here?
| # Print the AWS region for a cluster (single source of truth: clusters.yaml) | ||
| region cluster: | ||
| @export CLUSTERS_YAML="{{CLUSTERS_YAML}}"; uv run {{CFG}} {{cluster}} region |
There was a problem hiding this comment.
| assert "monitoring" in lines | ||
| assert "buildkit" in lines | ||
|
|
||
| def test_runner_image_modules_staging(self, capsys): |
There was a problem hiding this comment.
I've tried to look at the logs in https://github.com/pytorch/ci-infra/actions/runs/26486182516/job/77994041549?pr=634 and didn't find those tests to be executed. If they are not run in CI, what's the point of adding new tests?
There was a problem hiding this comment.
Hi, we don't print every test that we run on CI, only red ones and coverage per file (the gate is all tests green and at least 97% coverage on all python files).
so for osdc/scripts/cluster_config.py the coverage should be on 98%
https://github.com/pytorch/ci-infra/actions/runs/26486182516/job/77994041549?pr=634#step:6:96
Do you believe we should print all tests we run?
Stack from ghstack (oldest at bottom):
Impact: OSDC tooling for cluster config introspection
Risk: low
What
Adds a new
runner-image-modulessubcommand tocluster-config.pythatreturns the comma-separated list of enabled modules consuming
runner_image_tag. Exposesregionandrunner-image-modulesasjustrecipes so workflows and operators can call them uniformly.
Why
Upcoming Renovate-driven runner image auto-update needs a single source of
truth for which modules to redeploy when
runner_image_tagbumps. CIworkflows also need a uniform way to derive the AWS region for a cluster
without hardcoding values.
How
RUNNER_IMAGE_CONSUMER_MODULESlists the modules whosegenerated manifests embed
runner_image_tag(arc-runners and itsb200/h100 delegates).
runner-image-modulesfilters the cluster's enabled modules throughthat list and prints comma-joined output.
just regionandjust runner-image-modulesthin-wrap the existingcluster-config commands so downstream tooling has one entrypoint.
Changes
osdc/scripts/cluster-config.py: new subcommand + consumer list.osdc/scripts/test_cluster_config.py: unit tests for the new path.osdc/justfile: two new recipes.Testing
cd osdc && just testto run unit tests.just region arc-stagingandjust runner-image-modules arc-stagingfor manual sanity.
Signed-off-by: Jean Schmidt contato@jschmidt.me