Skip to content

Grant EC2 instance role cloudwatch:PutMetricData so memory/swap alarms can leave INSUFFICIENT_DATA #965

@jakebromberg

Description

@jakebromberg

Spun out of #937 as the smallest standalone unit. Without this, host-level CloudWatch memory and swap alarms can never page — the agent is running and publishing every minute, but every push gets AccessDenied and the alarm has no data to evaluate. The 2026-05-17 outage's detection gap was this, not a missing alarm.

Current state

  • CloudWatch agent (amazon-cloudwatch-agent) is running on wxyc-ec2 and pushing 1-minute system metrics.
  • Agent log shows AccessDenied on every PutMetricData call.
  • The intended alarms (memory pressure, swap usage) sit in INSUFFICIENT_DATA permanently.
  • During the 2026-05-17 15-hour outage the alarms never fired despite host memory pressure being the root cause. The synthetic-DJ canary did fire 21h before the OOM, but its single SNS subscriber was OOO; the host-side alarms were the intended second leg.

Fix

The EC2 instance role used by wxyc-ec2 needs cloudwatch:PutMetricData granted. AWS-managed CloudWatchAgentServerPolicy covers it (plus ec2:DescribeTags and logs:* for log push); whichever shape the IAM convention prefers is fine.

Needs WXYC infra account creds (account 503977661500, us-east-1). Doable from the AWS Console or aws iam attach-role-policy.

Acceptance

  • Policy attached to the EC2 instance role.
  • Within 5 min of attach, journalctl -u amazon-cloudwatch-agent stops showing AccessDenied.
  • aws cloudwatch describe-alarms --alarm-names <memory> <swap> shows both transition out of INSUFFICIENT_DATA.
  • Stress the host (stress-ng --vm 1 --vm-bytes 1G --timeout 120s) — verify the memory alarm fires to SNS within the 1-min granularity.

Why a sub-issue, not in line

#937's body is a full RCA covering the semantic-index uvicorn OOM, the worker-bound persistence work (semantic-index#318), the IAM permission gap, and the canary escalation gap. The first two have homes; this didn't.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingstatus:readyActionable now — no upstream blockers

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions