Add DevOps Agent Operator for automated workload failure investigation with AWS DevOps Agent by cawcaw253 · Pull Request #36 · aws-samples/kr-tech-blog-sample-code

cawcaw253 · 2026-03-06T10:50:15Z

Summary

This PR adds a Kubernetes operator that automatically detects EKS pod failures, collects comprehensive diagnostics, and integrates with AWS DevOps Agent for automated investigation.

Features

5-Layer Failure Detection: Whitelist-based detection covering pod status, container states, exit codes, pod phase, and scheduling conditions
Comprehensive Data Collection: Captures pod manifests, logs (current/previous), events, and node-level diagnostics via SSM
Multi-Output Support: Exports to S3, CloudWatch Logs, and triggers AWS DevOps Agent webhook with HMAC authentication
Timeout-Based Detection: Identifies stuck states (ContainerCreating, Unschedulable) after configurable grace period
Investigation Runbooks: Provides step-by-step guides for CloudWatch Logs and S3-based investigation

Files Added

containers/devops-agent-operator/: Complete operator implementation
- cmd/main.go: Operator entry point with controller setup
- internal/controller/: Pod failure detector and reconciler
- internal/collector/: Log and SSM-based node diagnostics collector
- internal/output/: S3, CloudWatch Logs, and webhook clients
- internal/config/: Configuration management
- runbooks/: Investigation guides for AWS DevOps Agent
- examples/: Kubernetes manifests for deployment
- test/: Unit tests and e2e test suite
- docs/: Architecture and pod lifecycle documentation

요약

EKS Pod 장애를 자동으로 감지하고 종합적인 진단 데이터를 수집하여 AWS DevOps Agent와 연동해 자동화된 장애 조사를 지원하는 Kubernetes Operator를 추가합니다.

주요 기능

5단계 장애 감지: Pod 상태, 컨테이너 상태, 종료 코드, Pod Phase, 스케줄링 조건을 포괄하는 화이트리스트 기반 감지
종합 데이터 수집: Pod manifest, 로그(현재/이전), 이벤트, SSM을 통한 노드 레벨 진단 정보 수집
다중 출력 지원: S3, CloudWatch Logs 저장 및 HMAC 인증을 통한 AWS DevOps Agent 웹훅 트리거
타임아웃 기반 감지: 설정 가능한 유예 기간 후 정체 상태(ContainerCreating, Unschedulable) 식별
장애 조사 런북: CloudWatch Logs 및 S3 기반 조사를 위한 단계별 가이드 제공

추가된 파일

containers/devops-agent-operator/: 완전한 Operator 구현
- cmd/main.go: 컨트롤러 설정을 포함한 Operator 진입점
- internal/controller/: Pod 장애 감지기 및 reconciler
- internal/collector/: 로그 및 SSM 기반 노드 진단 수집기
- internal/output/: S3, CloudWatch Logs, 웹훅 클라이언트
- internal/config/: 설정 관리
- runbooks/: AWS DevOps Agent에서 지정할 다양한 데이터 소스별 조사 가이드 예시
- examples/: 배포용 Kubernetes manifest
- test/: 단위 테스트 및 e2e 테스트 스위트
- docs/: 아키텍처 및 Pod 생명주기 문서

- Detect pod failures with 5-layer priority logic - Collect pod/node diagnostics via SSM - Export to S3, CloudWatch Logs, and trigger aws devops agent webhook - Include investigation runbooks and tests

feat: add devops-agent-operator for EKS pod failure detection

4746be7

- Detect pod failures with 5-layer priority logic - Collect pod/node diagnostics via SSM - Export to S3, CloudWatch Logs, and trigger aws devops agent webhook - Include investigation runbooks and tests

cawcaw253 requested review from comeddy, masangbeom and mateon01 as code owners March 6, 2026 10:50

mateon01 self-assigned this Mar 12, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add DevOps Agent Operator for automated workload failure investigation with AWS DevOps Agent#36

Add DevOps Agent Operator for automated workload failure investigation with AWS DevOps Agent#36
cawcaw253 wants to merge 1 commit intoaws-samples:mainfrom
cawcaw253:feat/devops-agent-operator

cawcaw253 commented Mar 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

cawcaw253 commented Mar 6, 2026

Summary

Features

Files Added

요약

주요 기능

추가된 파일

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants