Skip to content

Add DevOps Agent Operator for automated workload failure investigation with AWS DevOps Agent#36

Open
cawcaw253 wants to merge 1 commit intoaws-samples:mainfrom
cawcaw253:feat/devops-agent-operator
Open

Add DevOps Agent Operator for automated workload failure investigation with AWS DevOps Agent#36
cawcaw253 wants to merge 1 commit intoaws-samples:mainfrom
cawcaw253:feat/devops-agent-operator

Conversation

@cawcaw253
Copy link

Summary

This PR adds a Kubernetes operator that automatically detects EKS pod failures, collects comprehensive diagnostics, and integrates with AWS DevOps Agent for automated investigation.

Features

  • 5-Layer Failure Detection: Whitelist-based detection covering pod status, container states, exit codes, pod phase, and scheduling conditions
  • Comprehensive Data Collection: Captures pod manifests, logs (current/previous), events, and node-level diagnostics via SSM
  • Multi-Output Support: Exports to S3, CloudWatch Logs, and triggers AWS DevOps Agent webhook with HMAC authentication
  • Timeout-Based Detection: Identifies stuck states (ContainerCreating, Unschedulable) after configurable grace period
  • Investigation Runbooks: Provides step-by-step guides for CloudWatch Logs and S3-based investigation

Files Added

  • containers/devops-agent-operator/: Complete operator implementation
    • cmd/main.go: Operator entry point with controller setup
    • internal/controller/: Pod failure detector and reconciler
    • internal/collector/: Log and SSM-based node diagnostics collector
    • internal/output/: S3, CloudWatch Logs, and webhook clients
    • internal/config/: Configuration management
    • runbooks/: Investigation guides for AWS DevOps Agent
    • examples/: Kubernetes manifests for deployment
    • test/: Unit tests and e2e test suite
    • docs/: Architecture and pod lifecycle documentation

요약

EKS Pod 장애를 자동으로 감지하고 종합적인 진단 데이터를 수집하여 AWS DevOps Agent와 연동해 자동화된 장애 조사를 지원하는 Kubernetes Operator를 추가합니다.

주요 기능

  • 5단계 장애 감지: Pod 상태, 컨테이너 상태, 종료 코드, Pod Phase, 스케줄링 조건을 포괄하는 화이트리스트 기반 감지
  • 종합 데이터 수집: Pod manifest, 로그(현재/이전), 이벤트, SSM을 통한 노드 레벨 진단 정보 수집
  • 다중 출력 지원: S3, CloudWatch Logs 저장 및 HMAC 인증을 통한 AWS DevOps Agent 웹훅 트리거
  • 타임아웃 기반 감지: 설정 가능한 유예 기간 후 정체 상태(ContainerCreating, Unschedulable) 식별
  • 장애 조사 런북: CloudWatch Logs 및 S3 기반 조사를 위한 단계별 가이드 제공

추가된 파일

  • containers/devops-agent-operator/: 완전한 Operator 구현
    • cmd/main.go: 컨트롤러 설정을 포함한 Operator 진입점
    • internal/controller/: Pod 장애 감지기 및 reconciler
    • internal/collector/: 로그 및 SSM 기반 노드 진단 수집기
    • internal/output/: S3, CloudWatch Logs, 웹훅 클라이언트
    • internal/config/: 설정 관리
    • runbooks/: AWS DevOps Agent에서 지정할 다양한 데이터 소스별 조사 가이드 예시
    • examples/: 배포용 Kubernetes manifest
    • test/: 단위 테스트 및 e2e 테스트 스위트
    • docs/: 아키텍처 및 Pod 생명주기 문서

- Detect pod failures with 5-layer priority logic
- Collect pod/node diagnostics via SSM
- Export to S3, CloudWatch Logs, and trigger aws devops agent webhook
- Include investigation runbooks and tests
@mateon01 mateon01 self-assigned this Mar 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants