Skip to content

swanand18/eks-gitops-observability

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

4 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸš€ Production-Grade EKS Platform with GitOps & Full Observability

Terraform Kubernetes ArgoCD AWS EKS License: MIT

Portfolio Project by Swanand Awatade
Cloud & DevOps Engineer | AWS | Kubernetes | Terraform | GitOps | Observability


πŸš€ Overview

This project demonstrates a production-grade Amazon EKS platform built using Terraform, GitOps, and a full observability stack.

It is designed to reflect real-world platform engineering practices used in production cloud environments, including:

  • Infrastructure provisioning with Terraform
  • GitOps-based application delivery using ArgoCD
  • Cluster observability with Prometheus, Grafana, and CloudWatch
  • Security-first CI/CD workflows with automated scanning
  • Reusable and modular infrastructure design

This repository is intended as a portfolio-grade reference implementation for modern AWS + Kubernetes + DevOps engineering.


✨ Key Features

  • Multi-AZ Amazon EKS deployment
  • Reusable Terraform modules for VPC and EKS
  • GitOps deployment model using ArgoCD
  • Prometheus + Grafana monitoring stack
  • CloudWatch log forwarding
  • Trivy-based security scanning
  • GitHub Actions CI/CD pipelines
  • PagerDuty / Alertmanager integration
  • Dev / Prod environment separation
  • Production-style runbooks and architecture docs

πŸ“‹ Table of Contents


πŸ—οΈ Architecture

This platform follows a GitOps-based deployment model:

  1. Infrastructure is provisioned using Terraform
  2. Applications are deployed using ArgoCD
  3. Monitoring stack (Prometheus + Grafana) tracks cluster health
  4. Logs and operational telemetry are shipped to CloudWatch
  5. CI/CD pipelines enforce automation and security controls

This architecture is designed to simulate how a production Kubernetes platform is provisioned, deployed, observed, and operated.


πŸ–ΌοΈ Architecture Diagram

πŸ“Œ Add your architecture image here after uploading it to:

docs/images/architecture.png

![Architecture Diagram](docs/images/architecture.png)

Once uploaded, uncomment/use this line in the README:

Architecture Diagram


πŸ› οΈ Tech Stack

Category Tools / Services
Cloud Provider AWS (EKS, VPC, IAM, ECR, CloudWatch, SNS, S3)
Infrastructure as Code Terraform 1.7+, AWS Provider 5.x
Container Orchestration Kubernetes 1.29, Helm 3.x
GitOps ArgoCD 2.10
Monitoring Prometheus, Grafana, kube-state-metrics
Logging AWS CloudWatch Container Insights, Fluent Bit
Security Scanning Trivy (image + IaC)
CI/CD GitHub Actions
Alerting PagerDuty, Alertmanager
State Backend S3 + DynamoDB

πŸ“ Project Structure

eks-gitops-observability/
β”œβ”€β”€ README.md
β”œβ”€β”€ .github/
β”‚   └── workflows/
β”‚       β”œβ”€β”€ terraform-plan.yml
β”‚       β”œβ”€β”€ terraform-apply.yml
β”‚       └── trivy-image-scan.yml
β”œβ”€β”€ terraform/
β”‚   β”œβ”€β”€ modules/
β”‚   β”‚   β”œβ”€β”€ vpc/
β”‚   β”‚   β”œβ”€β”€ eks/
β”‚   β”‚   └── iam/
β”‚   └── environments/
β”‚       β”œβ”€β”€ dev/
β”‚       └── prod/
β”œβ”€β”€ k8s/
β”‚   β”œβ”€β”€ base/
β”‚   └── overlays/
β”‚       β”œβ”€β”€ dev/
β”‚       └── prod/
β”œβ”€β”€ argocd/
β”‚   β”œβ”€β”€ apps/
β”‚   └── projects/
β”œβ”€β”€ monitoring/
β”‚   β”œβ”€β”€ prometheus/
β”‚   β”‚   β”œβ”€β”€ values.yaml
β”‚   β”‚   └── alert-rules/
β”‚   └── grafana/
β”‚       └── dashboards/
└── docs/
    β”œβ”€β”€ architecture.md
    β”œβ”€β”€ runbook-incident-response.md
    └── cost-breakdown.md

βœ… Prerequisites

Before getting started, ensure you have:

  • AWS CLI configured with valid IAM credentials
  • Terraform >= 1.7
  • kubectl
  • Helm >= 3.x
  • argocd CLI (optional)

Required AWS IAM Permissions

Your deployment role/user should have access to:

  • eks:*
  • ec2:*
  • iam:*
  • s3:*
  • ecr:*
  • cloudwatch:*

⚠️ For production use, a least-privilege CI/CD IAM role is strongly recommended.


πŸš€ Quick Start

1. Clone the repository

git clone https://github.com/swanand18/eks-gitops-observability.git
cd eks-gitops-observability

2. Bootstrap Terraform backend (one-time)

aws s3api create-bucket \
  --bucket swanand-eks-terraform-state \
  --region ap-south-1 \
  --create-bucket-configuration LocationConstraint=ap-south-1

aws dynamodb create-table \
  --table-name terraform-state-lock \
  --attribute-definitions AttributeName=LockID,AttributeType=S \
  --key-schema AttributeName=LockID,KeyType=HASH \
  --billing-mode PAY_PER_REQUEST \
  --region ap-south-1

3. Deploy the Dev environment

cd terraform/environments/dev
terraform init
terraform plan -out=tfplan
terraform apply tfplan

4. Configure kubectl

aws eks update-kubeconfig \
  --region ap-south-1 \
  --name eks-dev-cluster

kubectl get nodes

5. Install ArgoCD

kubectl create namespace argocd

kubectl apply -n argocd \
  -f https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/install.yaml

kubectl apply -f argocd/projects/
kubectl apply -f argocd/apps/

6. Deploy observability stack

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

helm upgrade --install kube-prometheus-stack prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace \
  -f monitoring/prometheus/values.yaml

πŸ—οΈ Terraform Infrastructure

VPC Module

This module provisions a production-style VPC with:

  • 3 public subnets (for NAT Gateways / ALBs)
  • 3 private subnets (for EKS worker nodes)
  • Internet Gateway + route tables
  • NAT Gateways across AZs
  • VPC Flow Logs

EKS Module

The EKS module provisions:

  • EKS control plane
  • Managed node groups
  • OIDC provider for IRSA
  • EKS add-ons:
    • vpc-cni
    • coredns
    • kube-proxy
    • aws-ebs-csi-driver

IAM Module

Provides IAM roles for:

  • AWS Load Balancer Controller
  • Cluster Autoscaler
  • Fluent Bit
  • External DNS

Example Terraform Usage

module "vpc" {
  source             = "../../modules/vpc"
  vpc_cidr           = "10.0.0.0/16"
  availability_zones = ["ap-south-1a", "ap-south-1b", "ap-south-1c"]
  cluster_name       = "eks-prod-cluster"
  environment        = "prod"
}

πŸ”„ GitOps with ArgoCD

All Kubernetes workloads are deployed using ArgoCD, following a GitOps-first operating model.

Benefits of this approach

  • No manual production deployments
  • Declarative infrastructure and app delivery
  • Drift detection and self-healing
  • Version-controlled operational changes

App-of-Apps Pattern

argocd/apps/
β”œβ”€β”€ app-of-apps.yaml
β”œβ”€β”€ frontend.yaml
β”œβ”€β”€ backend-api.yaml
└── monitoring.yaml

Deployment Strategy

  • Auto-sync enabled for non-production environments
  • Manual sync recommended for production
  • Self-heal enabled
  • Pruning controlled for safety

πŸ“Š Observability Stack

The monitoring layer is built using Prometheus, Grafana, and CloudWatch.

Included Components

  • Prometheus
  • Grafana
  • kube-state-metrics
  • Alertmanager
  • Fluent Bit
  • CloudWatch Container Insights

Example Monitoring Coverage

  • Node CPU / memory utilization
  • Pod health and restart tracking
  • Deployment health
  • Application metrics
  • ArgoCD sync visibility
  • Cluster logging

Alerting Flow

PrometheusRule β†’ Alertmanager β†’ PagerDuty / Slack / Email

Example Alerts

  • Pod crash loops
  • Node memory pressure
  • Deployment replica mismatch
  • Persistent volume usage threshold
  • Node readiness issues

πŸ”’ Security Scanning

This project includes security-first CI/CD validation using Trivy.

Scanning Coverage

  • Terraform IaC misconfigurations
  • Container image vulnerabilities
  • Pull request validation gates

Example GitHub Actions Scan

- name: Trivy IaC Scan
  uses: aquasecurity/trivy-action@master
  with:
    scan-type: config
    scan-ref: ./terraform
    severity: CRITICAL,HIGH
    exit-code: 1

Security Design Principles

  • Shift-left security checks
  • Automated blocking of critical findings
  • IaC validation before infrastructure changes
  • Production-style security pipeline behavior

πŸ” CI/CD Pipeline

GitHub Actions is used for CI/CD automation.

Pipeline Coverage

  • Terraform plan on pull requests
  • Terraform apply on merge to main
  • Security scans in CI
  • Image scanning workflows
  • Infrastructure validation

Example Workflow Files

.github/workflows/
β”œβ”€β”€ terraform-plan.yml
β”œβ”€β”€ terraform-apply.yml
└── trivy-image-scan.yml

🚨 Alerting & Incident Response

This repository includes operational alerting patterns and incident response references.

Integrated / Planned Alerting

  • PagerDuty
  • Alertmanager
  • Slack / Email escalation

Incident Runbook

See:

docs/runbook-incident-response.md

This adds a realistic SRE / platform operations angle to the project.


πŸ’° Cost Considerations

This project also reflects cost-aware infrastructure design.

Resource Estimated Monthly Cost (ap-south-1)
EKS Control Plane ~$73
EC2 On-Demand Nodes ~$50
EC2 Spot Nodes ~$15
NAT Gateways ~$100
EBS Volumes ~$10
Estimated Dev Total ~$248/month

Cost Optimization Notes

  • Spot node groups reduce worker cost
  • Single NAT GW can reduce dev cost
  • Monitoring stack sizing should be right-sized for environment scale

πŸš€ Future Improvements

Potential enhancements for this platform include:

  • Cluster Autoscaler integration
  • Karpenter-based node provisioning
  • External Secrets Operator
  • AWS Load Balancer Controller setup
  • OPA / Kyverno policy enforcement
  • Service mesh integration
  • Multi-cluster GitOps expansion

πŸ“– Lessons Learned

1. GitOps improves operational consistency

Managing workloads declaratively reduces configuration drift and improves repeatability.

2. IRSA is essential

IAM Roles for Service Accounts provide cleaner and safer AWS access patterns inside Kubernetes.

3. Observability should be built in, not added later

Prometheus, Grafana, and logs should be part of the platform from day one.

4. Security belongs in the delivery pipeline

Infrastructure and image scanning should happen before changes reach production.

5. Platform engineering is about repeatability

The real value comes from building reusable, understandable, and operable infrastructure.


πŸ‘¨β€πŸ’» Author

Swanand Awatade
Cloud & DevOps Engineer
πŸ“ Pune, India
πŸ“§ swanand.awatade@gmail.com
πŸ”— LinkedIn | GitHub


πŸ“„ License

This project is licensed under the MIT License.
See the LICENSE file for more details.

About

Production-ready EKS GitOps platform with Terraform, ArgoCD, Prometheus, and Grafana

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages