Infrastructure engineer focused on Kubernetes reliability, observability stacks, and PostgreSQL HA operations. 20+ years in tech — 14 as a .NET developer and tech lead, 6+ in DevOps and cloud infrastructure.
Based in Vietnam 🇻🇳 · Remote worldwide · GMT+7
- Kubernetes — Production clusters across GKE, kubeadm and Rancher — cloud and on-premise including offline-tolerant ship environments (ships with satellite connectivity — not your typical infra problem).
- Observability — Prometheus, Grafana, Istio service mesh with distributed tracing.
- PostgreSQL — HA clusters with Patroni, streaming replication, failover testing, performance tuning.
- IaC — Terraform, Ansible, Helm. Infrastructure treated the same as application code: PR reviews, tested pipelines, no manual snowflakes.
- CI/CD — Jenkins, GitLab CI, GitHub Actions, Codefresh. Pre-deployment validation, staged rollouts, observability-driven rollback gates.
🛡️ cronguard — Kubernetes operator (Go)
SLO-style observability for Kubernetes CronJobs. Wraps batch/v1.CronJob workloads with a CronJobMonitor CRD that declares per-CronJob SLOs (schedule, max duration, max consecutive failures, missed-run tolerance), surfaces them as cronguard_* Prometheus metrics, and ships a default PrometheusRule with five alerts and runbooks.
- ~1500 lines Go on kubebuilder v4 + controller-runtime
- Coverage: 82.9% controller · 95.0% schedule · 95.8% history · 94.9% metrics
- Multi-arch image · Helm chart on OCI + GitHub Pages · listed on Artifact Hub
- kind-based e2e on every push + nightly schedule
Production-grade Prometheus alerting rules for Kubernetes, PostgreSQL/Patroni, and SLO burn rate alerting — with runbooks.
Covers:
- Pod crash-loop, OOM, PVC fill-up, deployment rollout stuck
- Patroni cluster health, replication lag, XID wraparound
- Multi-window SLO burn rate (Google SRE method)
- Node disk, network, clock skew
Orchestration Kubernetes (GKE · kubeadm · Rancher) · Docker · Helm
Cloud GCP · AWS · DigitalOcean · Yandex.Cloud · Alibaba Cloud
Observability Prometheus · Grafana · Istio · ELK · Dynatrace
Databases PostgreSQL · Patroni · MS SQL · Oracle
IaC Terraform · Ansible
CI/CD Jenkins · GitLab CI · GitHub Actions · Codefresh
Scripting Python · Bash · Go
| Achievement | Result |
|---|---|
| Production release failures | −90% (from ~10 to ~1/year) |
| System uptime | 99.8% for cruise operations |
| Cloud migration | Zero downtime · −30% cost |
| CI/CD speed | −75% deployment time |
| IAM security incidents | −60% after RBAC reorganization |
| PostgreSQL HA | 99.95% uptime · <30s failover |
- 📬 dmitry0983@gmail.com
- 💬 Telegram: @dmazhukov

