Skip to content

Enable Alloy clustering to reduce DPM by ~50%#522

Open
yangw-dev wants to merge 1 commit into
mainfrom
elainewy/fix-alloy-clustering-dpm
Open

Enable Alloy clustering to reduce DPM by ~50%#522
yangw-dev wants to merge 1 commit into
mainfrom
elainewy/fix-alloy-clustering-dpm

Conversation

@yangw-dev
Copy link
Copy Markdown
Contributor

Summary

Add alloy.clustering.enabled: true to helm values. This tells the Alloy helm chart to:

  1. Create a headless service (alloy-cluster) for peer discovery
  2. Pass --cluster.join-addresses=alloy-cluster to Alloy pods

With this, the two Alloy replicas discover each other and distribute scrape targets ~50/50, instead of both independently scraping all targets (doubling DPM).

This is a one-line change — no changes to controller.type (stays as Deployment), deploy.sh, or smoke tests.

Root cause

The Alloy River config already had clustering { enabled = true } on servicemonitors and podmonitors, but the helm chart level alloy.clustering.enabled was not set. Without it, the chart doesn't create the headless service or pass --cluster.join-addresses, so pods can't discover each other. Logs showed: "no peer discovery configured: both join and discover peers are empty".

Expected impact

  • DPM per series: ~2.0 → ~1.0
  • Total DPM: ~50% reduction

Test plan

  • Deploy to staging, verify clustering logs show peer discovery
  • Verify DPM reduction in Grafana Cloud dashboard
  • Deploy to prod

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 30, 2026

tofu plan — arc-cbr-production

✅ Plan succeeded · commit 0ac63d60 · run log

Plan output
Installed 1 package in 2ms
{
    "BucketArn": "arn:aws:s3:::ciforge-tfstate-arc-cbr-prod",
    "BucketRegion": "us-west-2",
    "AccessPointAlias": false
}
━━━ PLAN: Base (arc-cbr-production) ━━━
There are some problems with the CLI configuration:
╷
│ Error: The specified plugin cache dir /home/runner/work/ci-infra/ci-infra/osdc/.terraform.d/plugin-cache cannot be opened: stat /home/runner/work/ci-infra/ci-infra/osdc/.terraform.d/plugin-cache: no such file or directory
│
╵

As a result of the above problems, OpenTofu may not behave as intended.


module.eks.aws_iam_role.cluster: Refreshing state... [id=pytorch-arc-cbr-production-cluster-role]
data.aws_availability_zones.available: Reading...
module.eks.aws_iam_role.node: Refreshing state... [id=pytorch-arc-cbr-production-node-role]
module.harbor.aws_iam_user.harbor_s3: Refreshing state... [id=pytorch-arc-cbr-production-harbor-s3]
module.eks.aws_kms_key.eks_secrets[0]: Refreshing state... [id=8115d61b-1bc1-49ad-b5a3-e8f88fc50cb1]
module.eks.data.aws_ami.eks_optimized_al2023: Reading...
module.eks.data.aws_caller_identity.current: Reading...
module.vpc.aws_vpc.this: Refreshing state... [id=vpc-0a126b1613758a408]
module.harbor.aws_s3_bucket.harbor_registry: Refreshing state... [id=pytorch-arc-cbr-production-harbor-registry]
module.eks.data.aws_caller_identity.current: Read complete after 0s [id=308535385114]
data.aws_availability_zones.available: Read complete after 0s [id=us-east-2]
module.harbor.aws_iam_access_key.harbor_s3: Refreshing state... [id=AKIAUPVRELQNMSO5RRNP]
module.eks.aws_kms_alias.eks_secrets[0]: Refreshing state... [id=alias/pytorch-arc-cbr-production-eks-secrets]
module.eks.aws_iam_role_policy_attachment.vpc_resource_controller: Refreshing state... [id=pytorch-arc-cbr-production-cluster-role-20260308084936685500000002]
module.eks.aws_iam_role_policy_attachment.cluster_policy: Refreshing state... [id=pytorch-arc-cbr-production-cluster-role-20260308084936681500000001]
module.eks.aws_iam_role_policy_attachment.cni_policy: Refreshing state... [id=pytorch-arc-cbr-production-node-role-20260308084936813000000004]
module.eks.aws_iam_role_policy_attachment.node_policy: Refreshing state... [id=pytorch-arc-cbr-production-node-role-20260308084936816800000005]
module.eks.aws_iam_role_policy_attachment.ssm_policy: Refreshing state... [id=pytorch-arc-cbr-production-node-role-20260316204739334600000001]
module.eks.aws_iam_role_policy_attachment.ecr_policy: Refreshing state... [id=pytorch-arc-cbr-production-node-role-20260308084936734100000003]
module.harbor.aws_s3_bucket_public_access_block.harbor_registry: Refreshing state... [id=pytorch-arc-cbr-production-harbor-registry]
module.harbor.aws_iam_policy.harbor_registry: Refreshing state... [id=arn:aws:iam::308535385114:policy/pytorch-arc-cbr-production-harbor-registry]
module.harbor.aws_s3_bucket_server_side_encryption_configuration.harbor_registry: Refreshing state... [id=pytorch-arc-cbr-production-harbor-registry]
module.eks.data.aws_ami.eks_optimized_al2023: Read complete after 0s [id=ami-009f1fe7d56695348]
module.vpc.aws_internet_gateway.this: Refreshing state... [id=igw-03eb66e57d13af64b]
module.harbor.aws_iam_user_policy_attachment.harbor_s3: Refreshing state... [id=pytorch-arc-cbr-production-harbor-s3-20260308084938596600000006]
module.vpc.aws_eip.nat[0]: Refreshing state... [id=eipalloc-084ed6fc52db22c39]
module.vpc.aws_subnet.public[1]: Refreshing state... [id=subnet-0610564f678f81c5f]
module.vpc.aws_route_table.public: Refreshing state... [id=rtb-07ac52a1aa741f267]
module.vpc.aws_subnet.public[2]: Refreshing state... [id=subnet-06a70b2818e270ed8]
module.vpc.aws_eip.nat[1]: Refreshing state... [id=eipalloc-023207cd15e79c81a]
module.vpc.aws_subnet.public[0]: Refreshing state... [id=subnet-0701693364b79c021]
module.vpc.aws_eip.nat[2]: Refreshing state... [id=eipalloc-0078fd5c0f6bc05eb]
module.vpc.aws_subnet.private[0]: Refreshing state... [id=subnet-0545d26e4a1d0ba89]
module.vpc.aws_subnet.private[1]: Refreshing state... [id=subnet-04682fc890bfd4630]
module.vpc.aws_subnet.private[2]: Refreshing state... [id=subnet-0ce6f1dcb7208cad8]
module.vpc.aws_route_table_association.public[2]: Refreshing state... [id=rtbassoc-0aa6ea5c845170545]
module.vpc.aws_route_table_association.public[0]: Refreshing state... [id=rtbassoc-04d9bba8d43569bbf]
module.vpc.aws_route_table_association.public[1]: Refreshing state... [id=rtbassoc-0d2591f24cba79e7b]
module.eks.aws_eks_cluster.this: Refreshing state... [id=pytorch-arc-cbr-production]
module.eks.aws_eks_addon.kube_proxy: Refreshing state... [id=pytorch-arc-cbr-production:kube-proxy]
module.eks.data.tls_certificate.cluster[0]: Reading...
module.eks.aws_eks_access_entry.cluster_admin["osdc_gha_prod"]: Refreshing state... [id=pytorch-arc-cbr-production:arn:aws:iam::308535385114:role/osdc_gha_prod]
module.eks.aws_eks_addon.vpc_cni: Refreshing state... [id=pytorch-arc-cbr-production:vpc-cni]
module.eks.aws_launch_template.base: Refreshing state... [id=lt-090bac79dddc5b77f]
module.vpc.aws_nat_gateway.this[1]: Refreshing state... [id=nat-07e2274170282eb8c]
module.vpc.aws_nat_gateway.this[2]: Refreshing state... [id=nat-086e3e66fe238d459]
module.vpc.aws_nat_gateway.this[0]: Refreshing state... [id=nat-0f34cc1aafea8fd16]
module.eks.aws_eks_node_group.base: Refreshing state... [id=pytorch-arc-cbr-production:pytorch-arc-cbr-production-base-nodes]
module.eks.data.tls_certificate.cluster[0]: Read complete after 0s [id=033a163afb2babc26f7883e642621ac361c93d61]
module.eks.aws_iam_openid_connect_provider.cluster[0]: Refreshing state... [id=arn:aws:iam::308535385114:oidc-provider/oidc.eks.us-east-2.amazonaws.com/id/70AA0C12C21E1A843313EF1BDE82D29A]
module.vpc.aws_route_table.private[1]: Refreshing state... [id=rtb-000d05ecec7d4b66e]
module.vpc.aws_route_table.private[0]: Refreshing state... [id=rtb-0777285eddd2bacd1]
module.vpc.aws_route_table.private[2]: Refreshing state... [id=rtb-0f623a6fa9d7bde45]
module.harbor.aws_iam_role.harbor_registry: Refreshing state... [id=pytorch-arc-cbr-production-harbor-registry]
module.eks.data.aws_iam_policy_document.ebs_csi_assume_role[0]: Reading...
module.eks.data.aws_iam_policy_document.ebs_csi_assume_role[0]: Read complete after 0s [id=2255203180]
module.eks.aws_iam_role.ebs_csi_driver[0]: Refreshing state... [id=pytorch-arc-cbr-production-ebs-csi-driver-role]
module.vpc.aws_route_table_association.private[1]: Refreshing state... [id=rtbassoc-00dacd13031b1f5de]
module.vpc.aws_route_table_association.private[0]: Refreshing state... [id=rtbassoc-0ec9764e9015e972e]
module.vpc.aws_route_table_association.private[2]: Refreshing state... [id=rtbassoc-08ccb8cfe4bfa80d7]
module.eks.aws_eks_addon.coredns: Refreshing state... [id=pytorch-arc-cbr-production:coredns]
module.eks.aws_eks_access_policy_association.cluster_admin["osdc_gha_prod"]: Refreshing state... [id=pytorch-arc-cbr-production#arn:aws:iam::308535385114:role/osdc_gha_prod#arn:aws:eks::aws:cluster-access-policy/AmazonEKSClusterAdminPolicy]
module.eks.aws_iam_role_policy_attachment.ebs_csi_driver[0]: Refreshing state... [id=pytorch-arc-cbr-production-ebs-csi-driver-role-2026030809125522790000000d]
module.harbor.aws_iam_role_policy_attachment.harbor_registry: Refreshing state... [id=pytorch-arc-cbr-production-harbor-registry-2026030809125509320000000c]
module.eks.aws_eks_addon.ebs_csi_driver: Refreshing state... [id=pytorch-arc-cbr-production:aws-ebs-csi-driver]

No changes. Your infrastructure matches the configuration.

OpenTofu has compared your real infrastructure against your configuration and
found no differences, so no changes are needed.

━━━ PLAN: Module karpenter (arc-cbr-production) ━━━
data.terraform_remote_state.base: Reading...
aws_cloudwatch_event_rule.scheduled_change: Refreshing state... [id=pytorch-arc-cbr-production-karpenter-scheduled-change]
aws_cloudwatch_event_rule.spot_interruption: Refreshing state... [id=pytorch-arc-cbr-production-karpenter-spot-interruption]
aws_cloudwatch_event_rule.instance_state_change: Refreshing state... [id=pytorch-arc-cbr-production-karpenter-instance-state-change]
aws_cloudwatch_event_rule.rebalance: Refreshing state... [id=pytorch-arc-cbr-production-karpenter-rebalance]
aws_sqs_queue.karpenter: Refreshing state... [id=https://sqs.us-east-2.amazonaws.com/308535385114/pytorch-arc-cbr-production-karpenter]
aws_sqs_queue_policy.karpenter: Refreshing state... [id=https://sqs.us-east-2.amazonaws.com/308535385114/pytorch-arc-cbr-production-karpenter]
aws_cloudwatch_event_target.rebalance: Refreshing state... [id=pytorch-arc-cbr-production-karpenter-rebalance-KarpenterRebalance]
aws_cloudwatch_event_target.instance_state_change: Refreshing state... [id=pytorch-arc-cbr-production-karpenter-instance-state-change-KarpenterInstanceStateChange]
aws_cloudwatch_event_target.spot_interruption: Refreshing state... [id=pytorch-arc-cbr-production-karpenter-spot-interruption-KarpenterSpotInterruption]
aws_cloudwatch_event_target.scheduled_change: Refreshing state... [id=pytorch-arc-cbr-production-karpenter-scheduled-change-KarpenterScheduledChange]
data.terraform_remote_state.base: Read complete after 1s
aws_ec2_tag.subnet_karpenter_discovery["subnet-0ce6f1dcb7208cad8"]: Refreshing state... [id=subnet-0ce6f1dcb7208cad8,karpenter.sh/discovery]
aws_ec2_tag.cluster_sg_karpenter: Refreshing state... [id=sg-03b965bcc0c037434,karpenter.sh/discovery]
aws_iam_role.karpenter_controller: Refreshing state... [id=pytorch-arc-cbr-production-karpenter-controller]
aws_ec2_tag.subnet_karpenter_discovery["subnet-0545d26e4a1d0ba89"]: Refreshing state... [id=subnet-0545d26e4a1d0ba89,karpenter.sh/discovery]
aws_ec2_tag.subnet_karpenter_discovery["subnet-04682fc890bfd4630"]: Refreshing state... [id=subnet-04682fc890bfd4630,karpenter.sh/discovery]
aws_iam_policy.karpenter_controller: Refreshing state... [id=arn:aws:iam::308535385114:policy/pytorch-arc-cbr-production-karpenter-controller]
aws_iam_role_policy_attachment.karpenter_controller: Refreshing state... [id=pytorch-arc-cbr-production-karpenter-controller-20260308154648023000000001]

No changes. Your infrastructure matches the configuration.

OpenTofu has compared your real infrastructure against your configuration and
found no differences, so no changes are needed.

━━━ PLAN: Module pypi-cache (arc-cbr-production) ━━━
data.terraform_remote_state.base: Reading...
aws_iam_policy.wants_collector: Refreshing state... [id=arn:aws:iam::308535385114:policy/pytorch-arc-cbr-production-pypi-wants-collector-s3]
aws_iam_policy.wheel_syncer: Refreshing state... [id=arn:aws:iam::308535385114:policy/pytorch-arc-cbr-production-pypi-wheel-syncer-s3]
aws_efs_file_system.pypi_cache: Refreshing state... [id=fs-053d2ed886d9ac92d]
data.terraform_remote_state.base: Read complete after 1s
aws_iam_role.wants_collector: Refreshing state... [id=pytorch-arc-cbr-production-pypi-wants-collector-role]
aws_iam_role.efs_csi_driver: Refreshing state... [id=pytorch-arc-cbr-production-efs-csi-driver-role]
aws_iam_role.wheel_syncer: Refreshing state... [id=pytorch-arc-cbr-production-pypi-wheel-syncer-role]
aws_security_group.efs: Refreshing state... [id=sg-099ef6309262a93fd]
aws_iam_role_policy_attachment.efs_csi_driver: Refreshing state... [id=pytorch-arc-cbr-production-efs-csi-driver-role-20260330040250456800000003]
aws_efs_mount_target.pypi_cache["subnet-0545d26e4a1d0ba89"]: Refreshing state... [id=fsmt-05b0a0d538bd49c8e]
aws_efs_mount_target.pypi_cache["subnet-0ce6f1dcb7208cad8"]: Refreshing state... [id=fsmt-01378a00a07852987]
aws_efs_mount_target.pypi_cache["subnet-04682fc890bfd4630"]: Refreshing state... [id=fsmt-0743bba60c50ed499]
aws_iam_role_policy_attachment.wheel_syncer: Refreshing state... [id=pytorch-arc-cbr-production-pypi-wheel-syncer-role-20260403211352439500000002]
aws_eks_addon.efs_csi_driver: Refreshing state... [id=pytorch-arc-cbr-production:aws-efs-csi-driver]
aws_iam_role_policy_attachment.wants_collector: Refreshing state... [id=pytorch-arc-cbr-production-pypi-wants-collector-role-20260403211352357700000001]

No changes. Your infrastructure matches the configuration.

OpenTofu has compared your real infrastructure against your configuration and
found no differences, so no changes are needed.

1. Add alloy.clustering.enabled=true — tells the helm chart to create
   a headless service and pass --cluster.join-addresses, enabling peer
   discovery between replicas. DPM drops from ~2x to ~1x per series.

2. Increase memory limit from 1Gi to 2Gi — prod has more scrape targets
   than staging, and clustering adds memberlist overhead. Without this,
   prod Alloy pods OOMKill.

No changes to controller type (stays as Deployment), deploy.sh,
or smoke tests.

Authored with Claude.
@yangw-dev yangw-dev force-pushed the elainewy/fix-alloy-clustering-dpm branch from ff487e2 to 44f0e0c Compare April 30, 2026 01:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant