Your cluster broke. Your dashboards are green. Users are paging. One command. Root cause. 30 seconds.
Quick Start · How It Works · Features · Kubernetes · Docs
It's 3am. PagerDuty fires. Latency is up, error budget is burning, and every dashboard you own is green.
- Prometheus says CPU and memory look fine.
- Datadog APM says your app is healthy.
- The Grafana panels your SRE spent a weekend building - all green.
That's because every tool you have watches your application. Nothing is watching the kernel.
flowchart TB
subgraph Stack["YOUR K8S STACK"]
App["Workload Pods<br/>(Node, Python, Go, Java)"]
Runtime["Container Runtime<br/>(containerd, CRI-O)"]
OS["Node Kernel<br/>(Linux)"]
HW["Nodes / EC2 / GCE"]
end
subgraph Tools["WHO WATCHES WHAT"]
APM["Datadog · New Relic<br/>Prometheus · Grafana"]
CRun["Pixie · Tetragon<br/>Inspektor Gadget"]
Kerno["<b>KERNO</b><br/><i>eBPF kernel tracing</i>"]
Bare["(nobody)"]
end
App -.watched by.-> APM
Runtime -.watched by.-> CRun
OS -.watched by.-> Kerno
HW -.watched by.-> Bare
style Kerno fill:#e94560,stroke:#fff,color:#fff,stroke-width:3px
style App fill:#0f3460,stroke:#16213e,color:#fff
style Runtime fill:#16213e,stroke:#533483,color:#fff
style OS fill:#1a1a2e,stroke:#e94560,color:#fff
style HW fill:#533483,stroke:#16213e,color:#fff
style APM fill:#16213e,stroke:#0f3460,color:#ccc
style CRun fill:#16213e,stroke:#0f3460,color:#ccc
style Bare fill:#16213e,stroke:#0f3460,color:#888
The kernel is where the pain actually lives - disk throttling, TCP retransmits, OOM kills, scheduler contention, FD leaks. The kernel knows minutes before your dashboards. Hours before your users.
Kerno runs as a DaemonSet on every node, streams kernel signals through eBPF with microsecond overhead, and turns them into a diagnostic report that reads like a doctor's note.
kubectl -n kerno-system exec ds/kerno -- kerno doctorOne command. 30 seconds later:
$ kubectl -n kerno-system exec ds/kerno -- kerno doctor
██╗ ██╗███████╗██████╗ ███╗ ██╗ ██████╗
██║ ██╔╝██╔════╝██╔══██╗████╗ ██║██╔═══██╗ Production Incident Report
█████╔╝ █████╗ ██████╔╝██╔██╗ ██║██║ ██║ ──────────────────────────
██╔═██╗ ██╔══╝ ██╔══██╗██║╚██╗██║██║ ██║ Cluster prod-us-east-1
██║ ██╗███████╗██║ ██║██║ ╚████║╚██████╔╝ Node ip-10-0-3-47
╚═╝ ╚═╝╚══════╝╚═╝ ╚═╝╚═╝ ╚═══╝ ╚═════╝ Kernel 6.8.0-aws · x86_64
Window 30.0s · 12,481 events
❚❚ TRIAGE ────────────────────────────────────────────────────────── 30s
● ● ● 2 critical · 1 warning · 0 info
─────────────────────────────────────────────────────────────────────────
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ ● CRITICAL · Disk I/O Bottleneck SLO 🔥 ┃
┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛
Pod payments/checkout-api-7f4b8c-x9k2p
Volume /var/lib/postgres (EBS gp3, 3,000 IOPS)
Signal fsync p99 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 280 ms
threshold ▇▇▇▇▇▇▇▇▇▇▇▇▇ 200 ms
Cause EBS volume saturated · fsync blocking commit log writes
Impact checkout p99 ↑ 8.2× · connection pool 94% saturated
Fix → bump EBS IOPS 3000 → 6000
→ or move commit log to local NVMe
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ ● CRITICAL · TCP Retransmit Storm SLO 🔥 ┃
┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛
Namespace ingress-nginx
Pod nginx-ingress-controller-qz8hv
Signal retransmits ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 12.3 %
threshold ▇▇▇▇ 2.0 %
Cause packet loss on eth0 · likely upstream NLB path
Impact every external request at risk of latency spike
Fix → kubectl exec ... -- ethtool -S eth0 | grep error
→ check AWS NLB target group health
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ ▲ WARNING · Scheduler Contention ┃
┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛
Node ip-10-0-3-47 (c6i.2xlarge · 8 vCPU)
Signal runqueue p99 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 18 ms
threshold ▇▇▇▇▇ 5 ms
Cause noisy neighbor · 14 pods competing for 8 vCPUs
Fix → reduce pod density or set CPU requests
→ kubectl top pods --sort-by=cpu
▼ RECOMMENDED ORDER ──────────────────────────────────────────────────
1. [ NOW ] Disk I/O Bottleneck payments/checkout-api
2. [ NOW ] TCP Retransmit Storm ingress-nginx
3. [ 5min ] Scheduler Contention ip-10-0-3-47
kerno doctor --ai for root-cause analysis
kerno doctor --output json for runbooks / Slack bots
kerno predict to surface failures before they page you
That's the entire debugging loop - from page to root cause - in a single command.
| Watches | K8s-Native | Incident Report | SLO Mapping | AI Analysis | Install Time | |
|---|---|---|---|---|---|---|
| Prometheus + Grafana | Application | Partial | No | No | No | Hours |
| Datadog APM | Application | Partial | No | Partial | Yes | Hours |
| Cilium Tetragon | Security | Yes | No | No | No | Minutes |
| Inspektor Gadget | Container | Yes | No | No | No | Minutes |
| Pixie | Application | Yes | No | No | No | Minutes |
| Kerno | Kernel | Yes | Yes | Yes | Yes | < 1 min |
Kerno is the only eBPF tool in the Kubernetes ecosystem that produces a ranked, human-readable incident report - not a firehose of events, not another dashboard, not a query language to learn.
- Kubernetes 1.25+ with nodes running kernel 5.8+ (BTF enabled - all major managed offerings qualify: EKS, GKE, AKS, DOKS, Linode, Civo)
kubectlwith cluster-admin (to create the namespace + RBAC)
# Helm
helm install kerno ./deploy/helm/kerno \
-n kerno-system --create-namespace
# Or raw manifests
kubectl apply -f deploy/k8s/That's it. Within 30 seconds Kerno is running as a DaemonSet on every node.
# Diagnose the whole cluster - 30 seconds of real kernel data
kubectl -n kerno-system exec ds/kerno -- kerno doctor
# JSON output for on-call runbooks / Slack bots
kubectl -n kerno-system exec ds/kerno -- kerno doctor --output json
# With AI-powered root cause analysis
kubectl -n kerno-system set env ds/kerno KERNO_AI_API_KEY=sk-...
kubectl -n kerno-system exec ds/kerno -- kerno doctor --aikubectl -n kerno-system port-forward ds/kerno 9090:9090
curl localhost:9090/metricsServiceMonitor support is built-in when the Prometheus Operator is installed.
For standalone Linux servers outside Kubernetes:
# One-liner install
curl -sfL https://raw.githubusercontent.com/lowplane/kerno/main/scripts/install.sh | sudo bash
sudo kerno doctor
# Pin a specific version: --version v0.1.0Or run as a long-lived systemd service with Prometheus metrics:
curl -sfL https://raw.githubusercontent.com/lowplane/kerno/main/scripts/install.sh | sudo bash -s -- --daemon
journalctl -u kerno -fdocker run --rm --privileged --pid=host \
-v /sys/kernel/debug:/sys/kernel/debug:ro \
-v /sys/kernel/btf:/sys/kernel/btf:ro \
-v /sys/fs/bpf:/sys/fs/bpf \
-v /proc:/proc:ro \
ghcr.io/lowplane/kerno:v0.1.0 doctorMulti-arch images (linux/amd64, linux/arm64) published to GHCR on every release.
Kerno is designed from day one to run as a Kubernetes DaemonSet. One pod per node, one eBPF agent per kernel, zero API server load.
flowchart TB
subgraph Cluster["Kubernetes Cluster"]
direction TB
subgraph Node1["Worker Node 1"]
K1["Kerno Pod<br/><i>DaemonSet</i>"]
W1["Workload Pods"]
end
subgraph Node2["Worker Node 2"]
K2["Kerno Pod<br/><i>DaemonSet</i>"]
W2["Workload Pods"]
end
subgraph Node3["Worker Node N"]
K3["Kerno Pod<br/><i>DaemonSet</i>"]
W3["Workload Pods"]
end
end
K1 -->|:9090/metrics| Prom["Prometheus"]
K2 -->|:9090/metrics| Prom
K3 -->|:9090/metrics| Prom
Prom --> GF["Grafana"]
K1 -.enriches.-> W1
K2 -.enriches.-> W2
K3 -.enriches.-> W3
style K1 fill:#e94560,stroke:#fff,color:#fff
style K2 fill:#e94560,stroke:#fff,color:#fff
style K3 fill:#e94560,stroke:#fff,color:#fff
style Prom fill:#0f3460,stroke:#fff,color:#fff
style GF fill:#16213e,stroke:#fff,color:#fff
style W1 fill:#533483,stroke:#fff,color:#fff
style W2 fill:#533483,stroke:#fff,color:#fff
style W3 fill:#533483,stroke:#fff,color:#fff
Kerno tags every finding with pod, namespace, node, and workload labels. No client-go informers, no watch connections - Kerno reads /var/lib/kubelet/pods directly, so even a failing API server doesn't blind the agent. Exactly when you need it most.
| Mount | Why |
|---|---|
/sys/kernel/debug |
tracepoints, kprobes |
/sys/kernel/btf |
CO-RE type resolution |
/sys/fs/bpf |
BPF map pinning |
/proc |
PID → cgroup → pod resolution |
/sys/fs/cgroup |
container resource accounting |
/sys/class/net |
per-interface TCP counters |
/sys/block |
per-device disk stats |
- Runs with the minimum capabilities needed -
CAP_BPF,CAP_PERFMON,CAP_SYS_PTRACE,CAP_NET_ADMIN,CAP_DAC_READ_SEARCH(notCAP_SYS_ADMINfor the hot path). - Read-only root filesystem,
ProtectSystem=strictvia systemd on bare metal. - No outbound network calls. AI integration is opt-in and goes through your configured provider only.
image:
repository: ghcr.io/lowplane/kerno
tag: v0.1.0
resources:
requests: { cpu: 100m, memory: 128Mi }
limits: { cpu: "1", memory: 512Mi }
prometheus:
enabled: true
port: 9090
serviceMonitor: # Prometheus Operator
enabled: true
interval: 15s
nodeSelector:
monitoring: "true"kubectl -n kerno-system get ds kerno
kubectl -n kerno-system logs -l app.kubernetes.io/name=kerno
kubectl -n kerno-system exec ds/kerno -- kerno doctor
|
|
Kerno runs as a lightweight Go agent with six tiny eBPF programs attached to stable tracepoints. When kerno doctor runs, it collects 30 seconds of real kernel data, evaluates 11 diagnostic rules deterministically, and emits a ranked incident report. No sampling. No guesswork. No query language.
flowchart TB
subgraph Kernel["KERNEL SPACE · eBPF Programs"]
direction LR
P1["syscall<br/>latency"]
P2["tcp<br/>monitor"]
P3["oom<br/>track"]
P4["disk<br/>io"]
P5["sched<br/>delay"]
P6["fd<br/>track"]
end
RB[("Ring Buffers<br/>256KB per program<br/>zero-copy mmap")]
subgraph UserSpace["USER SPACE · Go"]
direction TB
Loader["BPF Loaders<br/>cilium/ebpf"]
Collector["Collectors<br/>percentile aggregation"]
Signals[("Signals Snapshot<br/>single source of truth")]
Adapter["Environment Adapter<br/>k8s · systemd · bare metal"]
end
subgraph Outputs["OUTPUTS"]
direction TB
Doctor["Doctor Engine<br/>11 diagnostic rules"]
AI["AI Layer <i>(optional)</i><br/>root cause analysis"]
Prom["Prometheus<br/>/metrics :9090"]
CLI["Terminal<br/>pretty · JSON"]
end
P1 & P2 & P3 & P4 & P5 & P6 --> RB
RB --> Loader
Loader --> Collector
Collector --> Signals
Adapter -.enriches.-> Signals
Signals --> Doctor
Signals --> Prom
Doctor --> AI
AI --> CLI
Doctor --> CLI
classDef kernel fill:#1a1a2e,stroke:#e94560,color:#fff,stroke-width:2px
classDef user fill:#0f3460,stroke:#16213e,color:#fff,stroke-width:2px
classDef output fill:#16213e,stroke:#533483,color:#fff,stroke-width:2px
classDef buffer fill:#533483,stroke:#e94560,color:#fff,stroke-width:3px
classDef ai fill:#e94560,stroke:#fff,color:#fff,stroke-width:2px
class P1,P2,P3,P4,P5,P6 kernel
class Loader,Collector,Signals,Adapter user
class Doctor,Prom,CLI output
class RB buffer
class AI ai
- Deterministic first. The rule engine is pure Go, testable, and runs whether AI is on or off. Every finding has a clear cause, threshold, and fix.
- Zero-copy hot path. Kernel events land in eBPF ring buffers and are drained via
mmap- microsecond overhead, no serialization cost. - No API server load. Pod enrichment reads the kubelet's local pod manifests. The agent survives API server outages - the moment you need it most.
- AI is a post-processor. Optional. Opt-in. Never touches the hot path. The deterministic engine always runs; AI enriches, it never replaces.
- Graceful degradation. If an eBPF program fails to load on a weird kernel, that collector is skipped with a clear warning. The rest keep working.
sequenceDiagram
participant K as Kernel<br/>(eBPF)
participant R as Ring Buffer
participant C as Collectors
participant D as Doctor Engine
participant A as AI Layer
participant U as On-call Engineer
K->>R: syscall/tcp/oom/io events
Note over K,R: Zero-copy, microsecond overhead
R->>C: drain events
C->>C: aggregate into p50/p95/p99
C->>D: Signals snapshot
D->>D: evaluate 11 rules
alt AI enabled
D->>A: findings + signals
A->>A: correlate + explain
A->>U: incident report + root cause
else AI disabled
D->>U: deterministic incident report
end
Kerno runs 11 deterministic rules against every snapshot. Every rule is explainable, configurable, and covered by tests.
| # | Rule | Triggers When | Severity |
|---|---|---|---|
| 1 | Disk I/O Bottleneck | fsync p99 > 50ms or write p99 > 200ms | WARN / CRIT |
| 2 | OOM Kill Occurred | Any OOM event in window | CRIT |
| 3 | TCP Retransmit Storm | Retransmit rate > 2% | CRIT |
| 4 | TCP RTT Degradation | RTT p99 > 10ms | WARN |
| 5 | Scheduler Contention | Runqueue delay p99 > 5ms | WARN / CRIT |
| 6 | FD Leak | FD growth > 10/sec sustained | WARN (with ETA) |
| 7 | Syscall Latency High | Any syscall p99 > 100ms | WARN / CRIT |
| 8 | OOM Imminent | Memory > 90% + positive growth | WARN / CRIT (with ETA) |
| 9 | Syscall Error Rate | Error rate > 1% per syscall | WARN / CRIT |
| 10 | Memory Pressure | RSS usage > 90% | WARN |
| 11 | Network Latency | Connection RTT > 100ms | WARN |
# The golden command
kubectl -n kerno-system exec ds/kerno -- kerno doctor
# Quick 10-second check
kubectl -n kerno-system exec ds/kerno -- kerno doctor --duration 10s
# JSON for CI/CD, runbooks, Slack bots (non-zero exit on critical)
kubectl -n kerno-system exec ds/kerno -- kerno doctor --output json --exit-code
# AI-powered root cause analysis
kubectl -n kerno-system exec ds/kerno -- kerno doctor --ai
# Explain a kernel error (no root, no cluster needed)
kerno explain "BUG: kernel NULL pointer dereference"
dmesg | tail -5 | kerno explain
# Predict failures before they page you
kubectl -n kerno-system exec ds/kerno -- kerno predict --snapshots 5 --interval 15s# Every syscall event streaming
kubectl -n kerno-system exec ds/kerno -- kerno trace syscall
# Only syscalls from a specific pod's PID
kubectl -n kerno-system exec ds/kerno -- kerno trace syscall --pid 1234
# Postgres disk writes over 5ms
kubectl -n kerno-system exec ds/kerno -- kerno trace disk --process postgres --op write --threshold 5ms
# Scheduler delays over 10ms
kubectl -n kerno-system exec ds/kerno -- kerno trace sched --threshold 10ms# TCP connections with retransmits
kubectl -n kerno-system exec ds/kerno -- kerno watch tcp --retransmits
# Any OOM kill, with pod context
kubectl -n kerno-system exec ds/kerno -- kerno watch oom --alert
# Processes leaking FDs
kubectl -n kerno-system exec ds/kerno -- kerno watch fd --threshold 10The DaemonSet exposes 16 metrics at :9090/metrics. ServiceMonitor is included when the Prometheus Operator is installed.
View all 16 metrics
| Metric | Type | What It Measures |
|---|---|---|
kerno_syscall_duration_nanoseconds |
Summary | Syscall latency (p50, p95, p99) |
kerno_syscall_total |
Counter | Total syscall events |
kerno_tcp_rtt_nanoseconds |
Summary | TCP round-trip time |
kerno_tcp_retransmits_total |
Counter | TCP retransmissions |
kerno_tcp_connections_total |
Counter | TCP connection events |
kerno_oom_kills_total |
Counter | OOM kill events |
kerno_disk_io_duration_nanoseconds |
Summary | Disk I/O latency |
kerno_disk_io_bytes_total |
Counter | Disk I/O bytes |
kerno_sched_delay_nanoseconds |
Summary | CPU run queue delay |
kerno_fd_open_total |
Counter | FD open operations |
kerno_fd_close_total |
Counter | FD close operations |
kerno_collector_events_total |
Counter | Events per collector |
kerno_collector_errors_total |
Counter | Errors per collector |
kerno_bpf_programs_loaded |
Gauge | Loaded eBPF programs |
kerno_info |
Gauge | Build version |
Health endpoints: /healthz and /readyz return JSON status.
Environment auto-detection. Kerno picks one of three adapters and enriches every event - no configuration required:
- Kubernetes (in-cluster token present) → pod, namespace, node, deployment
- Systemd (PID 1 is systemd) → unit, slice, scope
- Bare metal → hostname, cgroup path
AI (optional). The AI layer runs after the deterministic rule engine - it correlates cross-signals and explains root causes, it never replaces rules. Three providers (Anthropic, OpenAI, Ollama for air-gapped), three privacy modes (full / redacted / summary), TTL cache + token-bucket rate limiting, graceful fallback to a deterministic template on failure. No LLM SDK dependencies - pure net/http.
kubectl -n kerno-system set env ds/kerno \
KERNO_AI_API_KEY=sk-... \
KERNO_AI_PROVIDER=anthropic
kubectl -n kerno-system exec ds/kerno -- kerno doctor --aiKerno works with zero configuration. For custom setups, mount a config.yaml or use KERNO_* env vars:
log_level: info
collectors:
syscall_latency: true
tcp_monitor: true
oom_track: true
disk_io: true
sched_delay: true
fd_track: true
doctor:
duration: 30s
thresholds:
syscall_p99_warning_ns: 100000000 # 100ms
syscall_p99_critical_ns: 500000000 # 500ms
tcp_retransmit_pct: 2.0 # 2%
oom_memory_pct: 90.0 # 90%
disk_p99_warning_ns: 50000000 # 50ms
disk_p99_critical_ns: 200000000 # 200ms
sched_delay_warning_ns: 5000000 # 5ms
sched_delay_critical_ns: 20000000 # 20ms
fd_growth_per_sec: 10.0
prometheus:
enabled: true
addr: ":9090"
ai:
enabled: false
provider: anthropic
privacy_mode: summaryPrecedence: CLI flags > environment variables (KERNO_*) > config file > defaults.
See TODO.md for the full plan. Headlines:
- v0.1 - DaemonSet, 6 collectors, 11 rules, Prometheus, AI post-processor - shipped
- v0.2 - CRD for cluster-wide incident policies, OpenTelemetry OTLP export, Grafana dashboards
- v0.3 - historical incident replay, SLO-linked alerts, Slack / PagerDuty integrations
- v1.0 - multi-cluster control plane, managed offering (Lowplane Cloud)
# Requirements: Go 1.25+
# Optional for real eBPF: clang 14+, libbpf-dev, llvm
make build # Build binary (uses BPF stubs - no clang needed)
make bpf # Compile eBPF C programs
make test # Run unit tests
make test-race # Run with race detector
make lint # golangci-lint
make check # Full CI check: vet + test + lint
make docker # Build Docker imageContributions welcome. See CONTRIBUTING.md for:
- Development setup and prerequisites
- Commit message conventions (Conventional Commits)
- Code review process
- DCO sign-off requirement
For security reports, see SECURITY.md.
Apache License 2.0 - see LICENSE.