🚀 Production-Grade EKS Platform with GitOps & Full Observability

Portfolio Project by Swanand Awatade
Cloud & DevOps Engineer | AWS | Kubernetes | Terraform | GitOps | Observability

🚀 Overview

This project demonstrates a production-grade Amazon EKS platform built using Terraform, GitOps, and a full observability stack.

It is designed to reflect real-world platform engineering practices used in production cloud environments, including:

Infrastructure provisioning with Terraform
GitOps-based application delivery using ArgoCD
Cluster observability with Prometheus, Grafana, and CloudWatch
Security-first CI/CD workflows with automated scanning
Reusable and modular infrastructure design

This repository is intended as a portfolio-grade reference implementation for modern AWS + Kubernetes + DevOps engineering.

✨ Key Features

Multi-AZ Amazon EKS deployment
Reusable Terraform modules for VPC and EKS
GitOps deployment model using ArgoCD
Prometheus + Grafana monitoring stack
CloudWatch log forwarding
Trivy-based security scanning
GitHub Actions CI/CD pipelines
PagerDuty / Alertmanager integration
Dev / Prod environment separation
Production-style runbooks and architecture docs

📋 Table of Contents

Overview
Key Features
Architecture
Architecture Diagram
Tech Stack
Project Structure
Prerequisites
Quick Start
Terraform Infrastructure
GitOps with ArgoCD
Observability Stack
Security Scanning
CI/CD Pipeline
Alerting & Incident Response
Cost Considerations
Future Improvements
Lessons Learned
Author
License

🏗️ Architecture

This platform follows a GitOps-based deployment model:

Infrastructure is provisioned using Terraform
Applications are deployed using ArgoCD
Monitoring stack (Prometheus + Grafana) tracks cluster health
Logs and operational telemetry are shipped to CloudWatch
CI/CD pipelines enforce automation and security controls

This architecture is designed to simulate how a production Kubernetes platform is provisioned, deployed, observed, and operated.

🖼️ Architecture Diagram

📌 Add your architecture image here after uploading it to:

docs/images/architecture.png

![Architecture Diagram](docs/images/architecture.png)

Once uploaded, uncomment/use this line in the README:

🛠️ Tech Stack

Category	Tools / Services
Cloud Provider	AWS (EKS, VPC, IAM, ECR, CloudWatch, SNS, S3)
Infrastructure as Code	Terraform 1.7+, AWS Provider 5.x
Container Orchestration	Kubernetes 1.29, Helm 3.x
GitOps	ArgoCD 2.10
Monitoring	Prometheus, Grafana, kube-state-metrics
Logging	AWS CloudWatch Container Insights, Fluent Bit
Security Scanning	Trivy (image + IaC)
CI/CD	GitHub Actions
Alerting	PagerDuty, Alertmanager
State Backend	S3 + DynamoDB

📁 Project Structure

eks-gitops-observability/
├── README.md
├── .github/
│   └── workflows/
│       ├── terraform-plan.yml
│       ├── terraform-apply.yml
│       └── trivy-image-scan.yml
├── terraform/
│   ├── modules/
│   │   ├── vpc/
│   │   ├── eks/
│   │   └── iam/
│   └── environments/
│       ├── dev/
│       └── prod/
├── k8s/
│   ├── base/
│   └── overlays/
│       ├── dev/
│       └── prod/
├── argocd/
│   ├── apps/
│   └── projects/
├── monitoring/
│   ├── prometheus/
│   │   ├── values.yaml
│   │   └── alert-rules/
│   └── grafana/
│       └── dashboards/
└── docs/
    ├── architecture.md
    ├── runbook-incident-response.md
    └── cost-breakdown.md

✅ Prerequisites

Before getting started, ensure you have:

AWS CLI configured with valid IAM credentials
Terraform >= 1.7
kubectl
Helm >= 3.x
argocd CLI (optional)

Required AWS IAM Permissions

Your deployment role/user should have access to:

eks:*
ec2:*
iam:*
s3:*
ecr:*
cloudwatch:*

⚠️ For production use, a least-privilege CI/CD IAM role is strongly recommended.

🚀 Quick Start

1. Clone the repository

git clone https://github.com/swanand18/eks-gitops-observability.git
cd eks-gitops-observability

2. Bootstrap Terraform backend (one-time)

aws s3api create-bucket \
  --bucket swanand-eks-terraform-state \
  --region ap-south-1 \
  --create-bucket-configuration LocationConstraint=ap-south-1

aws dynamodb create-table \
  --table-name terraform-state-lock \
  --attribute-definitions AttributeName=LockID,AttributeType=S \
  --key-schema AttributeName=LockID,KeyType=HASH \
  --billing-mode PAY_PER_REQUEST \
  --region ap-south-1

3. Deploy the Dev environment

cd terraform/environments/dev
terraform init
terraform plan -out=tfplan
terraform apply tfplan

4. Configure `kubectl`

aws eks update-kubeconfig \
  --region ap-south-1 \
  --name eks-dev-cluster

kubectl get nodes

5. Install ArgoCD

kubectl create namespace argocd

kubectl apply -n argocd \
  -f https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/install.yaml

kubectl apply -f argocd/projects/
kubectl apply -f argocd/apps/

6. Deploy observability stack

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

helm upgrade --install kube-prometheus-stack prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace \
  -f monitoring/prometheus/values.yaml

🏗️ Terraform Infrastructure

VPC Module

This module provisions a production-style VPC with:

3 public subnets (for NAT Gateways / ALBs)
3 private subnets (for EKS worker nodes)
Internet Gateway + route tables
NAT Gateways across AZs
VPC Flow Logs

EKS Module

The EKS module provisions:

EKS control plane
Managed node groups
OIDC provider for IRSA
EKS add-ons:
- vpc-cni
- coredns
- kube-proxy
- aws-ebs-csi-driver

IAM Module

Provides IAM roles for:

AWS Load Balancer Controller
Cluster Autoscaler
Fluent Bit
External DNS

Example Terraform Usage

module "vpc" {
  source             = "../../modules/vpc"
  vpc_cidr           = "10.0.0.0/16"
  availability_zones = ["ap-south-1a", "ap-south-1b", "ap-south-1c"]
  cluster_name       = "eks-prod-cluster"
  environment        = "prod"
}

🔄 GitOps with ArgoCD

All Kubernetes workloads are deployed using ArgoCD, following a GitOps-first operating model.

Benefits of this approach

No manual production deployments
Declarative infrastructure and app delivery
Drift detection and self-healing
Version-controlled operational changes

App-of-Apps Pattern

argocd/apps/
├── app-of-apps.yaml
├── frontend.yaml
├── backend-api.yaml
└── monitoring.yaml

Deployment Strategy

Auto-sync enabled for non-production environments
Manual sync recommended for production
Self-heal enabled
Pruning controlled for safety

📊 Observability Stack

The monitoring layer is built using Prometheus, Grafana, and CloudWatch.

Included Components

Prometheus
Grafana
kube-state-metrics
Alertmanager
Fluent Bit
CloudWatch Container Insights

Example Monitoring Coverage

Node CPU / memory utilization
Pod health and restart tracking
Deployment health
Application metrics
ArgoCD sync visibility
Cluster logging

Alerting Flow

PrometheusRule → Alertmanager → PagerDuty / Slack / Email

Example Alerts

Pod crash loops
Node memory pressure
Deployment replica mismatch
Persistent volume usage threshold
Node readiness issues

🔒 Security Scanning

This project includes security-first CI/CD validation using Trivy.

Scanning Coverage

Terraform IaC misconfigurations
Container image vulnerabilities
Pull request validation gates

Example GitHub Actions Scan

- name: Trivy IaC Scan
  uses: aquasecurity/trivy-action@master
  with:
    scan-type: config
    scan-ref: ./terraform
    severity: CRITICAL,HIGH
    exit-code: 1

Security Design Principles

Shift-left security checks
Automated blocking of critical findings
IaC validation before infrastructure changes
Production-style security pipeline behavior

🔁 CI/CD Pipeline

GitHub Actions is used for CI/CD automation.

Pipeline Coverage

Terraform plan on pull requests
Terraform apply on merge to main
Security scans in CI
Image scanning workflows
Infrastructure validation

Example Workflow Files

.github/workflows/
├── terraform-plan.yml
├── terraform-apply.yml
└── trivy-image-scan.yml

🚨 Alerting & Incident Response

This repository includes operational alerting patterns and incident response references.

Integrated / Planned Alerting

PagerDuty
Alertmanager
Slack / Email escalation

Incident Runbook

See:

docs/runbook-incident-response.md

This adds a realistic SRE / platform operations angle to the project.

💰 Cost Considerations

This project also reflects cost-aware infrastructure design.

Resource	Estimated Monthly Cost (ap-south-1)
EKS Control Plane	~$73
EC2 On-Demand Nodes	~$50
EC2 Spot Nodes	~$15
NAT Gateways	~$100
EBS Volumes	~$10
Estimated Dev Total	~$248/month

Cost Optimization Notes

Spot node groups reduce worker cost
Single NAT GW can reduce dev cost
Monitoring stack sizing should be right-sized for environment scale

🚀 Future Improvements

Potential enhancements for this platform include:

Cluster Autoscaler integration
Karpenter-based node provisioning
External Secrets Operator
AWS Load Balancer Controller setup
OPA / Kyverno policy enforcement
Service mesh integration
Multi-cluster GitOps expansion

📖 Lessons Learned

1. GitOps improves operational consistency

Managing workloads declaratively reduces configuration drift and improves repeatability.

2. IRSA is essential

IAM Roles for Service Accounts provide cleaner and safer AWS access patterns inside Kubernetes.

3. Observability should be built in, not added later

Prometheus, Grafana, and logs should be part of the platform from day one.

4. Security belongs in the delivery pipeline

Infrastructure and image scanning should happen before changes reach production.

5. Platform engineering is about repeatability

The real value comes from building reusable, understandable, and operable infrastructure.

👨‍💻 Author

Swanand Awatade
Cloud & DevOps Engineer
📍 Pune, India
📧 swanand.awatade@gmail.com
🔗 LinkedIn | GitHub

📄 License

This project is licensed under the MIT License.
See the LICENSE file for more details.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.github/workflows		.github/workflows
argocd		argocd
docs		docs
k8s		k8s
monitoring		monitoring
terraform		terraform
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
cd		cd
git		git
main		main

Folders and files

Latest commit

History

Repository files navigation

🚀 Production-Grade EKS Platform with GitOps & Full Observability

🚀 Overview

✨ Key Features

📋 Table of Contents

🏗️ Architecture

🖼️ Architecture Diagram

🛠️ Tech Stack

📁 Project Structure

✅ Prerequisites

Required AWS IAM Permissions

🚀 Quick Start

1. Clone the repository

2. Bootstrap Terraform backend (one-time)

3. Deploy the Dev environment

4. Configure kubectl

5. Install ArgoCD

6. Deploy observability stack

🏗️ Terraform Infrastructure

VPC Module

EKS Module

IAM Module

Example Terraform Usage

🔄 GitOps with ArgoCD

Benefits of this approach

App-of-Apps Pattern

Deployment Strategy

📊 Observability Stack

Included Components

Example Monitoring Coverage

Alerting Flow

Example Alerts

🔒 Security Scanning

Scanning Coverage

Example GitHub Actions Scan

Security Design Principles

🔁 CI/CD Pipeline

Pipeline Coverage

Example Workflow Files

🚨 Alerting & Incident Response

Integrated / Planned Alerting

Incident Runbook

💰 Cost Considerations

Cost Optimization Notes

🚀 Future Improvements

📖 Lessons Learned

1. GitOps improves operational consistency

2. IRSA is essential

3. Observability should be built in, not added later

4. Security belongs in the delivery pipeline

5. Platform engineering is about repeatability

👨‍💻 Author

📄 License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

4. Configure `kubectl`

Packages