Portfolio Project by Swanand Awatade
Cloud & DevOps Engineer | AWS | Kubernetes | Terraform | GitOps | Observability
This project demonstrates a production-grade Amazon EKS platform built using Terraform, GitOps, and a full observability stack.
It is designed to reflect real-world platform engineering practices used in production cloud environments, including:
- Infrastructure provisioning with Terraform
- GitOps-based application delivery using ArgoCD
- Cluster observability with Prometheus, Grafana, and CloudWatch
- Security-first CI/CD workflows with automated scanning
- Reusable and modular infrastructure design
This repository is intended as a portfolio-grade reference implementation for modern AWS + Kubernetes + DevOps engineering.
- Multi-AZ Amazon EKS deployment
- Reusable Terraform modules for VPC and EKS
- GitOps deployment model using ArgoCD
- Prometheus + Grafana monitoring stack
- CloudWatch log forwarding
- Trivy-based security scanning
- GitHub Actions CI/CD pipelines
- PagerDuty / Alertmanager integration
- Dev / Prod environment separation
- Production-style runbooks and architecture docs
- Overview
- Key Features
- Architecture
- Architecture Diagram
- Tech Stack
- Project Structure
- Prerequisites
- Quick Start
- Terraform Infrastructure
- GitOps with ArgoCD
- Observability Stack
- Security Scanning
- CI/CD Pipeline
- Alerting & Incident Response
- Cost Considerations
- Future Improvements
- Lessons Learned
- Author
- License
This platform follows a GitOps-based deployment model:
- Infrastructure is provisioned using Terraform
- Applications are deployed using ArgoCD
- Monitoring stack (Prometheus + Grafana) tracks cluster health
- Logs and operational telemetry are shipped to CloudWatch
- CI/CD pipelines enforce automation and security controls
This architecture is designed to simulate how a production Kubernetes platform is provisioned, deployed, observed, and operated.
π Add your architecture image here after uploading it to:
docs/images/architecture.png
Once uploaded, uncomment/use this line in the README:
| Category | Tools / Services |
|---|---|
| Cloud Provider | AWS (EKS, VPC, IAM, ECR, CloudWatch, SNS, S3) |
| Infrastructure as Code | Terraform 1.7+, AWS Provider 5.x |
| Container Orchestration | Kubernetes 1.29, Helm 3.x |
| GitOps | ArgoCD 2.10 |
| Monitoring | Prometheus, Grafana, kube-state-metrics |
| Logging | AWS CloudWatch Container Insights, Fluent Bit |
| Security Scanning | Trivy (image + IaC) |
| CI/CD | GitHub Actions |
| Alerting | PagerDuty, Alertmanager |
| State Backend | S3 + DynamoDB |
eks-gitops-observability/
βββ README.md
βββ .github/
β βββ workflows/
β βββ terraform-plan.yml
β βββ terraform-apply.yml
β βββ trivy-image-scan.yml
βββ terraform/
β βββ modules/
β β βββ vpc/
β β βββ eks/
β β βββ iam/
β βββ environments/
β βββ dev/
β βββ prod/
βββ k8s/
β βββ base/
β βββ overlays/
β βββ dev/
β βββ prod/
βββ argocd/
β βββ apps/
β βββ projects/
βββ monitoring/
β βββ prometheus/
β β βββ values.yaml
β β βββ alert-rules/
β βββ grafana/
β βββ dashboards/
βββ docs/
βββ architecture.md
βββ runbook-incident-response.md
βββ cost-breakdown.mdBefore getting started, ensure you have:
- AWS CLI configured with valid IAM credentials
- Terraform
>= 1.7 kubectl- Helm
>= 3.x argocdCLI (optional)
Your deployment role/user should have access to:
eks:*ec2:*iam:*s3:*ecr:*cloudwatch:*
β οΈ For production use, a least-privilege CI/CD IAM role is strongly recommended.
git clone https://github.com/swanand18/eks-gitops-observability.git
cd eks-gitops-observabilityaws s3api create-bucket \
--bucket swanand-eks-terraform-state \
--region ap-south-1 \
--create-bucket-configuration LocationConstraint=ap-south-1
aws dynamodb create-table \
--table-name terraform-state-lock \
--attribute-definitions AttributeName=LockID,AttributeType=S \
--key-schema AttributeName=LockID,KeyType=HASH \
--billing-mode PAY_PER_REQUEST \
--region ap-south-1cd terraform/environments/dev
terraform init
terraform plan -out=tfplan
terraform apply tfplanaws eks update-kubeconfig \
--region ap-south-1 \
--name eks-dev-cluster
kubectl get nodeskubectl create namespace argocd
kubectl apply -n argocd \
-f https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/install.yaml
kubectl apply -f argocd/projects/
kubectl apply -f argocd/apps/helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm upgrade --install kube-prometheus-stack prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--create-namespace \
-f monitoring/prometheus/values.yamlThis module provisions a production-style VPC with:
- 3 public subnets (for NAT Gateways / ALBs)
- 3 private subnets (for EKS worker nodes)
- Internet Gateway + route tables
- NAT Gateways across AZs
- VPC Flow Logs
The EKS module provisions:
- EKS control plane
- Managed node groups
- OIDC provider for IRSA
- EKS add-ons:
vpc-cnicorednskube-proxyaws-ebs-csi-driver
Provides IAM roles for:
- AWS Load Balancer Controller
- Cluster Autoscaler
- Fluent Bit
- External DNS
module "vpc" {
source = "../../modules/vpc"
vpc_cidr = "10.0.0.0/16"
availability_zones = ["ap-south-1a", "ap-south-1b", "ap-south-1c"]
cluster_name = "eks-prod-cluster"
environment = "prod"
}All Kubernetes workloads are deployed using ArgoCD, following a GitOps-first operating model.
- No manual production deployments
- Declarative infrastructure and app delivery
- Drift detection and self-healing
- Version-controlled operational changes
argocd/apps/
βββ app-of-apps.yaml
βββ frontend.yaml
βββ backend-api.yaml
βββ monitoring.yaml- Auto-sync enabled for non-production environments
- Manual sync recommended for production
- Self-heal enabled
- Pruning controlled for safety
The monitoring layer is built using Prometheus, Grafana, and CloudWatch.
- Prometheus
- Grafana
- kube-state-metrics
- Alertmanager
- Fluent Bit
- CloudWatch Container Insights
- Node CPU / memory utilization
- Pod health and restart tracking
- Deployment health
- Application metrics
- ArgoCD sync visibility
- Cluster logging
PrometheusRule β Alertmanager β PagerDuty / Slack / Email
- Pod crash loops
- Node memory pressure
- Deployment replica mismatch
- Persistent volume usage threshold
- Node readiness issues
This project includes security-first CI/CD validation using Trivy.
- Terraform IaC misconfigurations
- Container image vulnerabilities
- Pull request validation gates
- name: Trivy IaC Scan
uses: aquasecurity/trivy-action@master
with:
scan-type: config
scan-ref: ./terraform
severity: CRITICAL,HIGH
exit-code: 1- Shift-left security checks
- Automated blocking of critical findings
- IaC validation before infrastructure changes
- Production-style security pipeline behavior
GitHub Actions is used for CI/CD automation.
- Terraform plan on pull requests
- Terraform apply on merge to main
- Security scans in CI
- Image scanning workflows
- Infrastructure validation
.github/workflows/
βββ terraform-plan.yml
βββ terraform-apply.yml
βββ trivy-image-scan.ymlThis repository includes operational alerting patterns and incident response references.
- PagerDuty
- Alertmanager
- Slack / Email escalation
See:
docs/runbook-incident-response.md
This adds a realistic SRE / platform operations angle to the project.
This project also reflects cost-aware infrastructure design.
| Resource | Estimated Monthly Cost (ap-south-1) |
|---|---|
| EKS Control Plane | ~$73 |
| EC2 On-Demand Nodes | ~$50 |
| EC2 Spot Nodes | ~$15 |
| NAT Gateways | ~$100 |
| EBS Volumes | ~$10 |
| Estimated Dev Total | ~$248/month |
- Spot node groups reduce worker cost
- Single NAT GW can reduce dev cost
- Monitoring stack sizing should be right-sized for environment scale
Potential enhancements for this platform include:
- Cluster Autoscaler integration
- Karpenter-based node provisioning
- External Secrets Operator
- AWS Load Balancer Controller setup
- OPA / Kyverno policy enforcement
- Service mesh integration
- Multi-cluster GitOps expansion
Managing workloads declaratively reduces configuration drift and improves repeatability.
IAM Roles for Service Accounts provide cleaner and safer AWS access patterns inside Kubernetes.
Prometheus, Grafana, and logs should be part of the platform from day one.
Infrastructure and image scanning should happen before changes reach production.
The real value comes from building reusable, understandable, and operable infrastructure.
Swanand Awatade
Cloud & DevOps Engineer
π Pune, India
π§ swanand.awatade@gmail.com
π LinkedIn | GitHub
This project is licensed under the MIT License.
See the LICENSE file for more details.
