Enable Alloy clustering to reduce DPM by ~50% by yangw-dev · Pull Request #522 · pytorch/ci-infra

yangw-dev · 2026-04-30T01:26:52Z

Summary

Add alloy.clustering.enabled: true to helm values. This tells the Alloy helm chart to:

Create a headless service (alloy-cluster) for peer discovery
Pass --cluster.join-addresses=alloy-cluster to Alloy pods

With this, the two Alloy replicas discover each other and distribute scrape targets ~50/50, instead of both independently scraping all targets (doubling DPM).

This is a one-line change — no changes to controller.type (stays as Deployment), deploy.sh, or smoke tests.

Root cause

The Alloy River config already had clustering { enabled = true } on servicemonitors and podmonitors, but the helm chart level alloy.clustering.enabled was not set. Without it, the chart doesn't create the headless service or pass --cluster.join-addresses, so pods can't discover each other. Logs showed: "no peer discovery configured: both join and discover peers are empty".

Expected impact

DPM per series: ~2.0 → ~1.0
Total DPM: ~50% reduction

Test plan

Deploy to staging, verify clustering logs show peer discovery
Verify DPM reduction in Grafana Cloud dashboard
Deploy to prod

github-actions · 2026-04-30T01:27:54Z

tofu plan — arc-cbr-production

✅ Plan succeeded · commit 0ac63d60 · run log

Plan output

Installed 1 package in 2ms
{
    "BucketArn": "arn:aws:s3:::ciforge-tfstate-arc-cbr-prod",
    "BucketRegion": "us-west-2",
    "AccessPointAlias": false
}
━━━ PLAN: Base (arc-cbr-production) ━━━
There are some problems with the CLI configuration:
╷
│ Error: The specified plugin cache dir /home/runner/work/ci-infra/ci-infra/osdc/.terraform.d/plugin-cache cannot be opened: stat /home/runner/work/ci-infra/ci-infra/osdc/.terraform.d/plugin-cache: no such file or directory
│
╵

As a result of the above problems, OpenTofu may not behave as intended.


module.eks.aws_iam_role.cluster: Refreshing state... [id=pytorch-arc-cbr-production-cluster-role]
data.aws_availability_zones.available: Reading...
module.eks.aws_iam_role.node: Refreshing state... [id=pytorch-arc-cbr-production-node-role]
module.harbor.aws_iam_user.harbor_s3: Refreshing state... [id=pytorch-arc-cbr-production-harbor-s3]
module.eks.aws_kms_key.eks_secrets[0]: Refreshing state... [id=8115d61b-1bc1-49ad-b5a3-e8f88fc50cb1]
module.eks.data.aws_ami.eks_optimized_al2023: Reading...
module.eks.data.aws_caller_identity.current: Reading...
module.vpc.aws_vpc.this: Refreshing state... [id=vpc-0a126b1613758a408]
module.harbor.aws_s3_bucket.harbor_registry: Refreshing state... [id=pytorch-arc-cbr-production-harbor-registry]
module.eks.data.aws_caller_identity.current: Read complete after 0s [id=308535385114]
data.aws_availability_zones.available: Read complete after 0s [id=us-east-2]
module.harbor.aws_iam_access_key.harbor_s3: Refreshing state... [id=AKIAUPVRELQNMSO5RRNP]
module.eks.aws_kms_alias.eks_secrets[0]: Refreshing state... [id=alias/pytorch-arc-cbr-production-eks-secrets]
module.eks.aws_iam_role_policy_attachment.vpc_resource_controller: Refreshing state... [id=pytorch-arc-cbr-production-cluster-role-20260308084936685500000002]
module.eks.aws_iam_role_policy_attachment.cluster_policy: Refreshing state... [id=pytorch-arc-cbr-production-cluster-role-20260308084936681500000001]
module.eks.aws_iam_role_policy_attachment.cni_policy: Refreshing state... [id=pytorch-arc-cbr-production-node-role-20260308084936813000000004]
module.eks.aws_iam_role_policy_attachment.node_policy: Refreshing state... [id=pytorch-arc-cbr-production-node-role-20260308084936816800000005]
module.eks.aws_iam_role_policy_attachment.ssm_policy: Refreshing state... [id=pytorch-arc-cbr-production-node-role-20260316204739334600000001]
module.eks.aws_iam_role_policy_attachment.ecr_policy: Refreshing state... [id=pytorch-arc-cbr-production-node-role-20260308084936734100000003]
module.harbor.aws_s3_bucket_public_access_block.harbor_registry: Refreshing state... [id=pytorch-arc-cbr-production-harbor-registry]
module.harbor.aws_iam_policy.harbor_registry: Refreshing state... [id=arn:aws:iam::308535385114:policy/pytorch-arc-cbr-production-harbor-registry]
module.harbor.aws_s3_bucket_server_side_encryption_configuration.harbor_registry: Refreshing state... [id=pytorch-arc-cbr-production-harbor-registry]
module.eks.data.aws_ami.eks_optimized_al2023: Read complete after 0s [id=ami-009f1fe7d56695348]
module.vpc.aws_internet_gateway.this: Refreshing state... [id=igw-03eb66e57d13af64b]
module.harbor.aws_iam_user_policy_attachment.harbor_s3: Refreshing state... [id=pytorch-arc-cbr-production-harbor-s3-20260308084938596600000006]
module.vpc.aws_eip.nat[0]: Refreshing state... [id=eipalloc-084ed6fc52db22c39]
module.vpc.aws_subnet.public[1]: Refreshing state... [id=subnet-0610564f678f81c5f]
module.vpc.aws_route_table.public: Refreshing state... [id=rtb-07ac52a1aa741f267]
module.vpc.aws_subnet.public[2]: Refreshing state... [id=subnet-06a70b2818e270ed8]
module.vpc.aws_eip.nat[1]: Refreshing state... [id=eipalloc-023207cd15e79c81a]
module.vpc.aws_subnet.public[0]: Refreshing state... [id=subnet-0701693364b79c021]
module.vpc.aws_eip.nat[2]: Refreshing state... [id=eipalloc-0078fd5c0f6bc05eb]
module.vpc.aws_subnet.private[0]: Refreshing state... [id=subnet-0545d26e4a1d0ba89]
module.vpc.aws_subnet.private[1]: Refreshing state... [id=subnet-04682fc890bfd4630]
module.vpc.aws_subnet.private[2]: Refreshing state... [id=subnet-0ce6f1dcb7208cad8]
module.vpc.aws_route_table_association.public[2]: Refreshing state... [id=rtbassoc-0aa6ea5c845170545]
module.vpc.aws_route_table_association.public[0]: Refreshing state... [id=rtbassoc-04d9bba8d43569bbf]
module.vpc.aws_route_table_association.public[1]: Refreshing state... [id=rtbassoc-0d2591f24cba79e7b]
module.eks.aws_eks_cluster.this: Refreshing state... [id=pytorch-arc-cbr-production]
module.eks.aws_eks_addon.kube_proxy: Refreshing state... [id=pytorch-arc-cbr-production:kube-proxy]
module.eks.data.tls_certificate.cluster[0]: Reading...
module.eks.aws_eks_access_entry.cluster_admin["osdc_gha_prod"]: Refreshing state... [id=pytorch-arc-cbr-production:arn:aws:iam::308535385114:role/osdc_gha_prod]
module.eks.aws_eks_addon.vpc_cni: Refreshing state... [id=pytorch-arc-cbr-production:vpc-cni]
module.eks.aws_launch_template.base: Refreshing state... [id=lt-090bac79dddc5b77f]
module.vpc.aws_nat_gateway.this[1]: Refreshing state... [id=nat-07e2274170282eb8c]
module.vpc.aws_nat_gateway.this[2]: Refreshing state... [id=nat-086e3e66fe238d459]
module.vpc.aws_nat_gateway.this[0]: Refreshing state... [id=nat-0f34cc1aafea8fd16]
module.eks.aws_eks_node_group.base: Refreshing state... [id=pytorch-arc-cbr-production:pytorch-arc-cbr-production-base-nodes]
module.eks.data.tls_certificate.cluster[0]: Read complete after 0s [id=033a163afb2babc26f7883e642621ac361c93d61]
module.eks.aws_iam_openid_connect_provider.cluster[0]: Refreshing state... [id=arn:aws:iam::308535385114:oidc-provider/oidc.eks.us-east-2.amazonaws.com/id/70AA0C12C21E1A843313EF1BDE82D29A]
module.vpc.aws_route_table.private[1]: Refreshing state... [id=rtb-000d05ecec7d4b66e]
module.vpc.aws_route_table.private[0]: Refreshing state... [id=rtb-0777285eddd2bacd1]
module.vpc.aws_route_table.private[2]: Refreshing state... [id=rtb-0f623a6fa9d7bde45]
module.harbor.aws_iam_role.harbor_registry: Refreshing state... [id=pytorch-arc-cbr-production-harbor-registry]
module.eks.data.aws_iam_policy_document.ebs_csi_assume_role[0]: Reading...
module.eks.data.aws_iam_policy_document.ebs_csi_assume_role[0]: Read complete after 0s [id=2255203180]
module.eks.aws_iam_role.ebs_csi_driver[0]: Refreshing state... [id=pytorch-arc-cbr-production-ebs-csi-driver-role]
module.vpc.aws_route_table_association.private[1]: Refreshing state... [id=rtbassoc-00dacd13031b1f5de]
module.vpc.aws_route_table_association.private[0]: Refreshing state... [id=rtbassoc-0ec9764e9015e972e]
module.vpc.aws_route_table_association.private[2]: Refreshing state... [id=rtbassoc-08ccb8cfe4bfa80d7]
module.eks.aws_eks_addon.coredns: Refreshing state... [id=pytorch-arc-cbr-production:coredns]
module.eks.aws_eks_access_policy_association.cluster_admin["osdc_gha_prod"]: Refreshing state... [id=pytorch-arc-cbr-production#arn:aws:iam::308535385114:role/osdc_gha_prod#arn:aws:eks::aws:cluster-access-policy/AmazonEKSClusterAdminPolicy]
module.eks.aws_iam_role_policy_attachment.ebs_csi_driver[0]: Refreshing state... [id=pytorch-arc-cbr-production-ebs-csi-driver-role-2026030809125522790000000d]
module.harbor.aws_iam_role_policy_attachment.harbor_registry: Refreshing state... [id=pytorch-arc-cbr-production-harbor-registry-2026030809125509320000000c]
module.eks.aws_eks_addon.ebs_csi_driver: Refreshing state... [id=pytorch-arc-cbr-production:aws-ebs-csi-driver]

No changes. Your infrastructure matches the configuration.

OpenTofu has compared your real infrastructure against your configuration and
found no differences, so no changes are needed.

━━━ PLAN: Module karpenter (arc-cbr-production) ━━━
data.terraform_remote_state.base: Reading...
aws_cloudwatch_event_rule.scheduled_change: Refreshing state... [id=pytorch-arc-cbr-production-karpenter-scheduled-change]
aws_cloudwatch_event_rule.spot_interruption: Refreshing state... [id=pytorch-arc-cbr-production-karpenter-spot-interruption]
aws_cloudwatch_event_rule.instance_state_change: Refreshing state... [id=pytorch-arc-cbr-production-karpenter-instance-state-change]
aws_cloudwatch_event_rule.rebalance: Refreshing state... [id=pytorch-arc-cbr-production-karpenter-rebalance]
aws_sqs_queue.karpenter: Refreshing state... [id=https://sqs.us-east-2.amazonaws.com/308535385114/pytorch-arc-cbr-production-karpenter]
aws_sqs_queue_policy.karpenter: Refreshing state... [id=https://sqs.us-east-2.amazonaws.com/308535385114/pytorch-arc-cbr-production-karpenter]
aws_cloudwatch_event_target.rebalance: Refreshing state... [id=pytorch-arc-cbr-production-karpenter-rebalance-KarpenterRebalance]
aws_cloudwatch_event_target.instance_state_change: Refreshing state... [id=pytorch-arc-cbr-production-karpenter-instance-state-change-KarpenterInstanceStateChange]
aws_cloudwatch_event_target.spot_interruption: Refreshing state... [id=pytorch-arc-cbr-production-karpenter-spot-interruption-KarpenterSpotInterruption]
aws_cloudwatch_event_target.scheduled_change: Refreshing state... [id=pytorch-arc-cbr-production-karpenter-scheduled-change-KarpenterScheduledChange]
data.terraform_remote_state.base: Read complete after 1s
aws_ec2_tag.subnet_karpenter_discovery["subnet-0ce6f1dcb7208cad8"]: Refreshing state... [id=subnet-0ce6f1dcb7208cad8,karpenter.sh/discovery]
aws_ec2_tag.cluster_sg_karpenter: Refreshing state... [id=sg-03b965bcc0c037434,karpenter.sh/discovery]
aws_iam_role.karpenter_controller: Refreshing state... [id=pytorch-arc-cbr-production-karpenter-controller]
aws_ec2_tag.subnet_karpenter_discovery["subnet-0545d26e4a1d0ba89"]: Refreshing state... [id=subnet-0545d26e4a1d0ba89,karpenter.sh/discovery]
aws_ec2_tag.subnet_karpenter_discovery["subnet-04682fc890bfd4630"]: Refreshing state... [id=subnet-04682fc890bfd4630,karpenter.sh/discovery]
aws_iam_policy.karpenter_controller: Refreshing state... [id=arn:aws:iam::308535385114:policy/pytorch-arc-cbr-production-karpenter-controller]
aws_iam_role_policy_attachment.karpenter_controller: Refreshing state... [id=pytorch-arc-cbr-production-karpenter-controller-20260308154648023000000001]

No changes. Your infrastructure matches the configuration.

OpenTofu has compared your real infrastructure against your configuration and
found no differences, so no changes are needed.

━━━ PLAN: Module pypi-cache (arc-cbr-production) ━━━
data.terraform_remote_state.base: Reading...
aws_iam_policy.wants_collector: Refreshing state... [id=arn:aws:iam::308535385114:policy/pytorch-arc-cbr-production-pypi-wants-collector-s3]
aws_iam_policy.wheel_syncer: Refreshing state... [id=arn:aws:iam::308535385114:policy/pytorch-arc-cbr-production-pypi-wheel-syncer-s3]
aws_efs_file_system.pypi_cache: Refreshing state... [id=fs-053d2ed886d9ac92d]
data.terraform_remote_state.base: Read complete after 1s
aws_iam_role.wants_collector: Refreshing state... [id=pytorch-arc-cbr-production-pypi-wants-collector-role]
aws_iam_role.efs_csi_driver: Refreshing state... [id=pytorch-arc-cbr-production-efs-csi-driver-role]
aws_iam_role.wheel_syncer: Refreshing state... [id=pytorch-arc-cbr-production-pypi-wheel-syncer-role]
aws_security_group.efs: Refreshing state... [id=sg-099ef6309262a93fd]
aws_iam_role_policy_attachment.efs_csi_driver: Refreshing state... [id=pytorch-arc-cbr-production-efs-csi-driver-role-20260330040250456800000003]
aws_efs_mount_target.pypi_cache["subnet-0545d26e4a1d0ba89"]: Refreshing state... [id=fsmt-05b0a0d538bd49c8e]
aws_efs_mount_target.pypi_cache["subnet-0ce6f1dcb7208cad8"]: Refreshing state... [id=fsmt-01378a00a07852987]
aws_efs_mount_target.pypi_cache["subnet-04682fc890bfd4630"]: Refreshing state... [id=fsmt-0743bba60c50ed499]
aws_iam_role_policy_attachment.wheel_syncer: Refreshing state... [id=pytorch-arc-cbr-production-pypi-wheel-syncer-role-20260403211352439500000002]
aws_eks_addon.efs_csi_driver: Refreshing state... [id=pytorch-arc-cbr-production:aws-efs-csi-driver]
aws_iam_role_policy_attachment.wants_collector: Refreshing state... [id=pytorch-arc-cbr-production-pypi-wants-collector-role-20260403211352357700000001]

No changes. Your infrastructure matches the configuration.

OpenTofu has compared your real infrastructure against your configuration and
found no differences, so no changes are needed.

1. Add alloy.clustering.enabled=true — tells the helm chart to create a headless service and pass --cluster.join-addresses, enabling peer discovery between replicas. DPM drops from ~2x to ~1x per series. 2. Increase memory limit from 1Gi to 2Gi — prod has more scrape targets than staging, and clustering adds memberlist overhead. Without this, prod Alloy pods OOMKill. No changes to controller type (stays as Deployment), deploy.sh, or smoke tests. Authored with Claude.

yangw-dev temporarily deployed to osdc-staging April 30, 2026 01:26 — with GitHub Actions Inactive

yangw-dev force-pushed the elainewy/fix-alloy-clustering-dpm branch from ff487e2 to 44f0e0c Compare April 30, 2026 01:55

yangw-dev temporarily deployed to osdc-staging April 30, 2026 01:55 — with GitHub Actions Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable Alloy clustering to reduce DPM by ~50%#522

Enable Alloy clustering to reduce DPM by ~50%#522
yangw-dev wants to merge 1 commit into
mainfrom
elainewy/fix-alloy-clustering-dpm

yangw-dev commented Apr 30, 2026

Uh oh!

github-actions Bot commented Apr 30, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

yangw-dev commented Apr 30, 2026

Summary

Root cause

Expected impact

Test plan

Uh oh!

github-actions Bot commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

tofu plan — arc-cbr-production

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

github-actions Bot commented Apr 30, 2026 •

edited

Loading