An AI-driven operations platform that transforms Kubernetes monitoring signals into actionable, guardrailed remediation decisions.
This project integrates observability systems, agentic reasoning, and Kubernetes-native controls to reduce manual incident response time, improve consistency of root-cause analysis, and make remediation safer through policy gates and execution safeguards.
For full extended project content, see README-detailed.md.
Modern platform operations teams receive high volumes of infrastructure alerts, but most alerts still require manual triage, context collection, and repetitive runbook execution. This creates alert fatigue, delayed recovery, and inconsistent incident handling quality.
This platform addresses that gap with an end-to-end AIOps workflow:
- ingest alerts in real time
- correlate metrics and logs automatically
- perform RCA using deterministic rules plus LLM and RAG context
- apply controlled remediation actions with Kubernetes safety checks
- persist incident history for auditability and future learning
The result is a practical, production-style operating model for self-healing Kubernetes workloads.
At a high level, the system works as follows:
- Prometheus rules detect abnormal workload behavior.
- Alertmanager sends firing alerts to the AI engine webhook.
- A LangGraph multi-agent pipeline analyzes the incident context.
- Policy and guardrails decide whether remediation can be executed.
- Actions are applied through Kubernetes APIs (or safely skipped).
- Incident artifacts are persisted and optionally pushed to Discord.
Core behavior implemented:
- Alert ingestion via POST /alerts
- Multi-agent orchestration: monitor -> rca -> remediate -> report (with fallback)
- Metrics and logs retrieval from Prometheus and Loki
- RAG-backed incident memory with Chroma
- Guardrailed remediation execution:
- restart pod
- scale deployment
- increase memory limit and restart pod
- rollback deployment (with retry-threshold safety checks)
- Persistent incident storage:
- JSONL history
- Markdown incident reports
- Operational dashboard for incident and remediation visibility
Stress App -> Prometheus Rules -> Alertmanager -> AI Engine (/alerts)
-> monitor/rca/remediate/report
-> Kubernetes API (guardrailed actions)
-> Incident Store (JSONL + Markdown + Chroma)
-> Discord (optional)
For full architecture diagrams and component-level explanation, see docs/architecture.md.
- Combines alert labels, live metrics, and recent logs
- Uses LLM output with deterministic post-guardrails
- Enriches RCA with similarity retrieval from past incidents
- Falls back safely when signal quality is low or model output is invalid
- Supports policy-based action gating per alert category
- Enforces confidence thresholds and allowed-action maps
- Applies cooldown and retry-window controls to reduce flapping
- Restricts execution with namespace and action allowlists
- Integrates HPA-aware scaling behavior and rollback safety checks
- Agent-level fallback chain on exceptions
- Non-blocking handling of Prometheus, Loki, RAG, and notification failures
- Traceable incident timeline with remediation attempt outcomes
- Durable incident artifacts for audits and postmortems
- Kubernetes (Minikube)
- Docker
- Helm
- Jenkins
- Prometheus
- Alertmanager
- Grafana
- Loki
- Promtail
- FastAPI
- LangGraph
- LangChain
- Ollama
- ChromaDB (RAG memory)
- Streamlit
- Python
- Kubernetes Python Client
aiops-agentic-platform/
βββ ai-engine/
βββ app/
βββ dashboard/
βββ docs/
βββ grafana/
βββ jenkins/
βββ k8s/
βββ README.md
βββ README-detailed.md
minikube start -p aiops --driver=docker --cpus=4 --memory=6144
minikube addons enable metrics-server -p aiops
minikube addons enable ingress -p aiopshelm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo add grafana https://grafana.github.io/helm-charts
helm repo update
helm upgrade --install monitoring prometheus-community/kube-prometheus-stack \
--namespace monitoring --create-namespace
helm upgrade --install loki grafana/loki \
-n monitoring -f k8s/loki/loki-values.yaml
helm upgrade --install promtail grafana/promtail \
-n monitoring -f k8s/loki/promtail-values.yamlkubectl apply -f k8s/stress-app-deployment.yaml
kubectl apply -f k8s/stress-app-service.yaml
kubectl apply -f k8s/stress-app-hpa.yaml
kubectl apply -f k8s/ai-engine-rbac.yaml
kubectl apply -f k8s/ai-engine-incidents-pvc.yaml
kubectl apply -f k8s/ai-engine-deployment.yaml
kubectl apply -f k8s/ai-engine-service.yaml
kubectl apply -f k8s/alerts/cpu-alert.yaml
kubectl apply -f k8s/alerts/loki-alerts.yamlkubectl create secret generic alertmanager-monitoring-kube-prometheus-alertmanager \
--from-file=alertmanager.yaml=k8s/alertmanager/alertmanager.yaml \
-n monitoring --dry-run=client -o yaml | kubectl apply -f -
kubectl rollout restart statefulset alertmanager-monitoring-kube-prometheus-alertmanager -n monitoringkubectl get pods -n default
kubectl get pods -n monitoring
kubectl get prometheusrule -n monitoringKeep real webhook values out of Git.
kubectl -n default create secret generic ai-engine-discord-webhook \
--from-literal=webhook-url='YOUR_REAL_WEBHOOK_URL' \
--dry-run=client -o yaml | kubectl apply -f -
kubectl rollout restart deployment/ai-engine -n default
kubectl rollout status deployment/ai-engine -n default --timeout=300sImplemented alert classes include:
- HighPodCPUUsage
- HighMemoryUsage
- PodCrashLoop
- PodCrashLoopBackOff
- PodOOMKilled
- PodImagePullBackOff
- PodImagePullBackOffPersistent
- PodErrImagePull
- PodCreateContainerConfigError
- PodNotReadyTooLong
Decisioning combines:
- rule-based heuristics for deterministic baselines
- LLM RCA output when available
- guardrail overrides for safety-critical patterns
- alert-type policy maps with confidence thresholds
Execution is constrained by runtime policy controls in AI engine:
- action allowlist
- namespace allowlist
- auto-remediation modes: off, dry-run, safe-auto
- cooldown and retry-window enforcement
- retry-limit protection
- HPA-aware scaling boundaries
- image-pull rollback retry threshold checks
This design keeps remediation useful while reducing accidental or unstable changes.
Core manifests are maintained under k8s:
- AI engine deployment/service/RBAC/PVC
- stress app deployment/service/HPA
- dashboard deployment/service/ingress
- alert rules and alertmanager webhook routing
Primary runtime services:
- AI Engine: k8s/ai-engine-service.yaml
- Stress App: k8s/stress-app-service.yaml
- Dashboard: k8s/dashboard-service.yaml
Base service URL (in cluster):
Available endpoints:
- GET /
- POST /alerts
- POST /analyze
- POST /remediate
- GET /incidents
- GET /incidents/{incident_id}
- GET /incidents/remediations
- GET /diagnostics/rag
- Source: dashboard/app.py
- In-cluster API base: AIOPS_API_BASE_URL=http://ai-engine.default.svc.cluster.local:8000
Run locally:
cd dashboard
pip install -r requirements.txt
streamlit run app.pyPipeline definition: jenkins/Jenkinsfile
Pipeline stages:
- Checkout
- Quality Gates: Static Validation
- Build Docker Image
- Run Tests
- Push Docker Image
- Deploy to Kubernetes
- Smoke Check + Contract Gate
Required Jenkins credential:
- dockerhub-pass
Webhook secret provisioning is intentionally managed outside Jenkins.
- Never commit secrets (plain or base64) to repository
- Keep k8s/discord-webhook-secret.yaml as placeholder template
- Rotate exposed webhook credentials immediately
- Prefer secret injection at runtime via kubectl or external secret manager
- Full extended project content: README-detailed.md
- Detailed architecture and flow analysis: docs/architecture.md










