Spun out of #937 as the smallest standalone unit. Without this, host-level CloudWatch memory and swap alarms can never page — the agent is running and publishing every minute, but every push gets AccessDenied and the alarm has no data to evaluate. The 2026-05-17 outage's detection gap was this, not a missing alarm.
Current state
- CloudWatch agent (
amazon-cloudwatch-agent) is running on wxyc-ec2 and pushing 1-minute system metrics.
- Agent log shows
AccessDenied on every PutMetricData call.
- The intended alarms (memory pressure, swap usage) sit in
INSUFFICIENT_DATA permanently.
- During the 2026-05-17 15-hour outage the alarms never fired despite host memory pressure being the root cause. The synthetic-DJ canary did fire 21h before the OOM, but its single SNS subscriber was OOO; the host-side alarms were the intended second leg.
Fix
The EC2 instance role used by wxyc-ec2 needs cloudwatch:PutMetricData granted. AWS-managed CloudWatchAgentServerPolicy covers it (plus ec2:DescribeTags and logs:* for log push); whichever shape the IAM convention prefers is fine.
Needs WXYC infra account creds (account 503977661500, us-east-1). Doable from the AWS Console or aws iam attach-role-policy.
Acceptance
Why a sub-issue, not in line
#937's body is a full RCA covering the semantic-index uvicorn OOM, the worker-bound persistence work (semantic-index#318), the IAM permission gap, and the canary escalation gap. The first two have homes; this didn't.
Spun out of #937 as the smallest standalone unit. Without this, host-level CloudWatch memory and swap alarms can never page — the agent is running and publishing every minute, but every push gets
AccessDeniedand the alarm has no data to evaluate. The 2026-05-17 outage's detection gap was this, not a missing alarm.Current state
amazon-cloudwatch-agent) is running onwxyc-ec2and pushing 1-minute system metrics.AccessDeniedon everyPutMetricDatacall.INSUFFICIENT_DATApermanently.Fix
The EC2 instance role used by
wxyc-ec2needscloudwatch:PutMetricDatagranted. AWS-managedCloudWatchAgentServerPolicycovers it (plusec2:DescribeTagsandlogs:*for log push); whichever shape the IAM convention prefers is fine.Needs WXYC infra account creds (account
503977661500, us-east-1). Doable from the AWS Console oraws iam attach-role-policy.Acceptance
journalctl -u amazon-cloudwatch-agentstops showingAccessDenied.aws cloudwatch describe-alarms --alarm-names <memory> <swap>shows both transition out ofINSUFFICIENT_DATA.stress-ng --vm 1 --vm-bytes 1G --timeout 120s) — verify the memory alarm fires to SNS within the 1-min granularity.Why a sub-issue, not in line
#937's body is a full RCA covering the semantic-index uvicorn OOM, the worker-bound persistence work (
semantic-index#318), the IAM permission gap, and the canary escalation gap. The first two have homes; this didn't.