A Kubernetes operator that enables zero-downtime live migration of applications using CRIU (Checkpoint/Restore In Userspace) with Object Storage integration. Designed specifically for Spot/Preemptible instances.
This operator provides:
- Automatic Migration: Detects spot instance interruptions and automatically migrates workloads
- Incremental Checkpoints: Regular pre-checkpoints with S3 direct upload (zero disk I/O)
- Object Storage Integration: Stores checkpoints in S3/MinIO/GCS for cross-node migration
- Lazy Page Loading: Fast restore with async prefetch and hot VMA priority seeding
- Write Profiler: userfaultfd write-protect (uffd-wp) based dirty page tracking for adaptive checkpointing
- Deadline Scheduler: F_op feasibility model for deadline-driven pre-dumps within spot termination windows
- Experiment Data Collection: Automatic upload of all raw CRIU logs and per-fault metrics to S3
- Ablation Control: Fine-grained feature flags for systematic performance evaluation
- Kubernetes Native: CRD-based API with familiar kubectl workflows
┌─────────────────────────────────────────────────────────┐
│ Kubernetes Cluster │
│ │
│ ┌────────────────────────────────────────────────┐ │
│ │ Migration Controller │ │
│ │ - Reconciles MigratableApp resources │ │
│ │ - Orchestrates migrations │ │
│ │ - Manages Pod lifecycle │ │
│ └────────────────────────────────────────────────┘ │
│ │
│ ┌────────────────────────────────────────────────┐ │
│ │ Node Monitor (DaemonSet) │ │
│ │ - Detects spot interruptions │ │
│ │ - Triggers migrations │ │
│ └────────────────────────────────────────────────┘ │
│ │
│ ┌────────────────────────────────────────────────┐ │
│ │ Application Pod │ │
│ │ ┌──────────────┐ ┌────────────────────┐ │ │
│ │ │ App Container│ │ CRIU Agent Sidecar │ │ │
│ │ │ │ │ - gRPC Server │ │ │
│ │ │ your-app │◄──│ - Checkpoint │ │ │
│ │ │ │ │ - Restore │ │ │
│ │ └──────────────┘ └────────────────────┘ │ │
│ └────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ Object Storage (S3) │
└─────────────────────────────────────────────────────────┘
The operator uses a "sleep infinity" pattern to avoid checkpointing the container's PID 1 process directly:
- Pod starts with
sleep infinityas PID 1 (specified in MigratableApp spec) - Agent launches the actual application via
nsenterduring restore - CRIU only checkpoints the child process, not PID 1
Benefits:
- PID 1 (sleep) remains unchanged across migrations
- Avoids complications with container runtime expectations
- Maintains namespace sharing for kubelet compatibility
The operator implements CRIU's --join-ns mnt feature to handle Kubernetes-injected mounts:
Challenge: Kubernetes injects various mounts into containers:
/dev/termination-log/etc/hosts,/etc/resolv.conf,/etc/hostname- ConfigMap/Secret volumes
- Service account tokens
Solution: Join the target pod's existing mount namespace instead of restoring:
- Dump: Mark specific mounts as external (
--external mnt[path]:id) - Restore: Use
--join-ns mnt:/proc/1/ns/mntto join target's mount namespace - Result: Target pod's mounts (managed by kubelet) are used directly
CRIU Bug Fix: Fixed a bug in CRIU 4.0 where --join-ns mnt was not working correctly. See CRIU_JOIN_NS_MNT_BUG_FIX.md for details.
During Dump:
- Upload ALL checkpoint files to S3, including
pages-*.img - Even though pages are served via page-server during migration, they must be in S3 for lazy-pages daemon
During Restore:
- Download only metadata files from S3 (core, mm, files, etc.)
- Skip downloading
pages-*.img(too large, loaded on-demand) - Lazy-pages daemon fetches pages from S3 as needed
Benefit: Fast restore startup time (~1-2 seconds) with on-demand page loading
- Regular S3: Uses IAM roles or public access (no credentials needed in CRIU command)
- Express One Zone: Requires explicit credentials (
--aws-access-key,--aws-secret-key) - Agent conditionally includes credentials based on storage type
- Go: 1.25.3+ (required for building)
- Docker: For building container images
- Protobuf Compiler:
protoc(for generating gRPC code) - controller-gen: For generating CRD manifests
- kubectl: For deploying to Kubernetes
- Kubernetes: v1.20+
- Container Runtime: containerd (with CRIU support) or CRI-O
- Object Storage: S3, MinIO, or GCS
- Linux Kernel: 4.x+ (with CRIU support)
# Install Go 1.25.3 or later
# Ubuntu/Debian example:
wget https://go.dev/dl/go1.25.3.linux-amd64.tar.gz
sudo tar -C /usr/local -xzf go1.25.3.linux-amd64.tar.gz
export PATH=$PATH:/usr/local/go/bin
export PATH=$PATH:$(go env GOPATH)/bin
source /etc/profile # Or add to ~/.bashrc# Install protobuf compiler
sudo apt update && sudo apt install -y protobuf-compiler
# Install Go tools
go install google.golang.org/protobuf/cmd/protoc-gen-go@latest
go install google.golang.org/grpc/cmd/protoc-gen-go-grpc@latest
go install sigs.k8s.io/controller-tools/cmd/controller-gen@latestcd kubernetes_integration
go mod download
go mod tidy# Generate protobuf code
./scripts/generate-proto.sh
# Or generate manually:
export PATH=$PATH:$(go env GOPATH)/bin
protoc \
--go_out=. \
--go_opt=paths=source_relative \
--go-grpc_out=. \
--go-grpc_opt=paths=source_relative \
pkg/proto/agent.proto# Generate CRD manifests
make manifests
# This creates:
# - config/crd/migration.io_migratableapps.yaml
# - config/rbac/role.yaml# Build all binaries
make build
# Output:
# - bin/agent (CRIU Agent)
# - bin/controller (Migration Controller)
# - bin/node-monitor (Node Monitor)# Download CRIU binary and build all images
make docker-build
# This will:
# 1. Download CRIU binary from S3
# 2. Build agent image (with CRIU)
# 3. Build controller image
# 4. Build node-monitor image
# Images created:
# - 192.168.0.253:5000/criu-agent:latest
# - 192.168.0.253:5000/criu-migration-controller:latest
# - 192.168.0.253:5000/criu-node-monitor:latest# Push all images to registry
make docker-push
# Or customize registry:
make docker-push REGISTRY=your-registry.com/yourorg# Full build from scratch
source /etc/profile
cd kubernetes_integration
# 1. Install dependencies
go mod tidy
# 2. Generate code
./scripts/generate-proto.sh
make manifests
# 3. Build binaries
make build
# 4. Build and push Docker images
make docker-push- Kubernetes cluster running (v1.20+)
kubectlconfigured to access the cluster- Docker images pushed to registry (or use
make docker-push)
# 1. Install CRDs
make install
# 2. Deploy namespace, RBAC, controller and monitor
make deploy
# Or with custom registry:
make deploy REGISTRY=192.168.0.253:5000
# 3. Create storage credentials (see below)Note: The make deploy command will automatically substitute the correct image references based on the REGISTRY variable. If you built and pushed images with a custom registry (e.g., make docker-push REGISTRY=192.168.0.253:5000), use the same REGISTRY value when deploying.
kubectl apply -f config/crd/migration.io_migratableapps.yaml# This will create:
# - Namespace: migration-system
# - ServiceAccount: migration-controller
# - ClusterRole and ClusterRoleBinding
# - Leader election Role and RoleBinding
kubectl apply -f config/rbac/rbac.yaml# Deploy the controller deployment and node-monitor daemonset
kubectl apply -f config/manager/manager.yaml# Check if pods are running
kubectl get pods -n migration-system
# Expected output:
# NAME READY STATUS RESTARTS AGE
# migration-controller-xxxxxxxxxx-xxxxx 1/1 Running 0 30s
# node-monitor-xxxxx 1/1 Running 0 30s
# node-monitor-yyyyy 1/1 Running 0 30sImportant: Create the secret in the migration-system namespace (where the controller is deployed). The controller will automatically inject these credentials into all MigratableApp pods, regardless of which namespace they run in.
For AWS S3:
kubectl create secret generic s3-credentials \
--from-literal=AWS_ACCESS_KEY_ID=your-access-key \
--from-literal=AWS_SECRET_ACCESS_KEY=your-secret-key \
-n migration-systemFor MinIO:
kubectl create secret generic s3-credentials \
--from-literal=AWS_ACCESS_KEY_ID=minioadmin \
--from-literal=AWS_SECRET_ACCESS_KEY=minioadmin \
-n migration-systemNote: Only one secret in migration-system namespace is needed. All MigratableApps in any namespace will use this secret.
# example-app.yaml
apiVersion: migration.io/v1alpha1
kind: MigratableApp
metadata:
name: my-web-app
namespace: default
spec:
template:
metadata:
labels:
app: my-web-app
spec:
containers:
- name: app
image: python:3.9-slim
command: ["python", "-c"]
args:
- |
import time
counter = 0
while True:
counter += 1
print(f"Counter: {counter}")
time.sleep(5)
resources:
requests:
memory: "128Mi"
cpu: "100m"
checkpointPolicy:
interval: "30s"
autoAdjust: true
memoryThresholdMB: 100
maxCheckpointChainDepth: 10
migrationPolicy:
autoMigrate: true
preferOnDemand: true
migrationTimeoutSeconds: 300
storage:
type: s3
bucket: my-checkpoint-bucket
endpoint: http://minio.default.svc.cluster.local:9000
region: us-east-1
credentialsSecret: s3-credentialskubectl apply -f example-app.yaml# Watch MigratableApp status
kubectl get mapp my-web-app -w
# Get detailed status
kubectl describe mapp my-web-app
# View logs
kubectl logs -l migration.io/app=my-web-app -c criu-agent
kubectl logs -l migration.io/app=my-web-app -c app# View checkpoint information
kubectl get mapp my-web-app -o jsonpath='{.status.checkpointStatus}' | jq
# Output example:
# {
# "lastCheckpointID": "abc123-1234567890",
# "lastCheckpointTime": "2024-11-05T08:00:00Z",
# "checkpointChainDepth": 3,
# "checkpointChainRoot": "xyz789-1234567890"
# }kubectl get mapp my-web-app -o jsonpath='{.status.migrationHistory}' | jq
# Output example:
# [
# {
# "fromNode": "node-1",
# "toNode": "node-2",
# "timestamp": "2024-11-05T08:05:00Z",
# "reason": "spot-interrupt",
# "duration": "15.2s",
# "success": true
# }
# ]# Add migration trigger annotation
POD_NAME=$(kubectl get pod -l migration.io/app=my-web-app -o jsonpath='{.items[0].metadata.name}')
kubectl annotate pod $POD_NAME migration.io/trigger=requested
kubectl annotate pod $POD_NAME migration.io/reason=manualcheckpointPolicy:
# Interval between pre-checkpoints
interval: "30s"
# Automatically adjust interval based on memory changes
autoAdjust: true
# Trigger checkpoint when memory changes exceed this threshold (MB)
memoryThresholdMB: 100
# Maximum checkpoint chain depth before full checkpoint
maxCheckpointChainDepth: 10migrationPolicy:
# Enable automatic migration on spot interrupt
autoMigrate: true
# Node selector for migration target
targetNodeSelector:
node-type: on-demand
# Prefer on-demand nodes over spot
preferOnDemand: true
# Migration timeout (seconds)
migrationTimeoutSeconds: 300AWS S3:
storage:
type: s3
bucket: my-bucket
region: us-east-1
credentialsSecret: aws-credentialsMinIO:
storage:
type: minio
bucket: my-bucket
endpoint: http://minio.default.svc.cluster.local:9000
region: us-east-1
credentialsSecret: minio-credentialsGCS:
storage:
type: gcs
bucket: my-bucket
credentialsSecret: gcs-credentials# Development
make help # Show all available targets
make generate # Generate protobuf and deepcopy code
make fmt # Format Go code
make vet # Run Go vet
make test # Run tests
# Build
make build # Build binaries (agent, controller, node-monitor)
# Docker
make download-criu # Download CRIU binary from S3
make docker-build # Build Docker images (includes download-criu)
make docker-push # Build and push Docker images
# Deployment
make manifests # Generate CRD and RBAC manifests
make install # Install CRDs to cluster
make uninstall # Uninstall CRDs from cluster
make deploy # Deploy controller and monitor
make undeploy # Remove controller and monitor
# Dependencies
make controller-gen # Install controller-gen
make protoc-gen-go # Install protoc-gen-go
make protoc-gen-go-grpc # Install protoc-gen-go-grpc# Use custom CRIU binary URL
make docker-build CRIU_URL=https://your-server.com/criu# Build and push with custom registry
make docker-push REGISTRY=your-registry.com/yourorg
# Deploy with same custom registry
make deploy REGISTRY=your-registry.com/yourorg
# Complete workflow:
make docker-push REGISTRY=192.168.0.253:5000
make deploy REGISTRY=192.168.0.253:5000The REGISTRY variable affects:
- docker-build/docker-push: Sets the image tags for building and pushing
- deploy: Automatically replaces image references in deployment YAML before applying to cluster
Edit the Makefile:
AGENT_IMG ?= $(REGISTRY)/criu-agent:v1.0.0
CONTROLLER_IMG ?= $(REGISTRY)/criu-migration-controller:v1.0.0
MONITOR_IMG ?= $(REGISTRY)/criu-node-monitor:v1.0.0Problem: go: command not found
# Install Go and add to PATH
export PATH=$PATH:/usr/local/go/bin
export PATH=$PATH:$(go env GOPATH)/bin
source /etc/profileProblem: protoc: command not found
# Install protobuf compiler
sudo apt install -y protobuf-compilerProblem: controller-gen: command not found
# Install controller-gen
go install sigs.k8s.io/controller-tools/cmd/controller-gen@latestProblem: CRIU download fails
# Check CRIU URL and download manually
curl -L -o criu/criu https://mhsong-criu-s3-data.s3.us-west-2.amazonaws.com/criu
chmod +x criu/criuProblem: Go version mismatch in Docker
# Dockerfiles use golang:1.25.3-alpine
# Make sure go.mod requires go >= 1.25.1Problem: Agent connection failed
# Check agent pod logs
kubectl logs <pod-name> -c criu-agent
# Verify agent is running
kubectl exec <pod-name> -c criu-agent -- ps aux | grep agentProblem: Checkpoint failed
# Check CRIU logs in the pod
kubectl exec <pod-name> -c criu-agent -- ls /checkpoints
kubectl exec <pod-name> -c criu-agent -- cat /checkpoints/<dump-id>/criu.log
# Verify CRIU is available
kubectl exec <pod-name> -c criu-agent -- criu check --allProblem: Migration timeout
# Increase migration timeout
kubectl edit mapp <app-name>
# Set spec.migrationPolicy.migrationTimeoutSeconds to higher valuekubernetes_integration/
├── api/v1alpha1/ # CRD API definitions
├── cmd/ # Main applications
│ ├── agent/ # CRIU Agent
│ ├── controller/ # Migration Controller
│ └── node-monitor/ # Node Monitor
├── pkg/ # Libraries
│ ├── agent/ # Agent implementation
│ ├── controller/ # Controller implementation
│ ├── scheduler/ # Checkpoint scheduler
│ ├── monitor/ # Spot monitor
│ └── proto/ # gRPC definitions
├── config/ # Kubernetes manifests
│ ├── crd/ # CRD definitions
│ ├── rbac/ # RBAC configs
│ ├── manager/ # Controller deployment
│ └── samples/ # Example applications
├── deploy/ # Dockerfiles
│ ├── agent/
│ ├── controller/
│ └── node-monitor/
├── scripts/ # Build scripts
├── Makefile # Build automation
└── README.md # This file
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
Apache License 2.0
- Fixed: Multiple critical issues preventing re-migration (gen0 → gen1 → gen2)
- Major Changes:
- S3 path consistency: Use MigratableApp name instead of pod name across all generations
- SOURCE_POD_IP injection: Proper lazy-pages connection for re-migration
- Generation number tracking: Fixed via Downward API annotation reading
- PID layout consistency: Added PID booster init container for reproducible PIDs
- Enhanced namespace handling: Comprehensive external mount detection and mapping
- Robust lazy-pages lifecycle: Proper readiness detection and health checks
- Results: Successful multi-generation migration with 43-file checkpoint chains
- Performance: ~7s restore time with lazy-pages, continuous pre-checkpoints working
- Commit: c07b93f
- Fixed: TCP health check killing page-server prematurely
- Solution: Removed TCP dial from
waitForPageServerReady()function - Impact: Stable zero-downtime migrations achieved
- Performance: 1.8s restore time, 15.96s total migration time
- Details: See CRIU_MIGRATION_OPERATOR_DOCS.md
- Fixed: CRIU 4.0
--join-ns mntnot working correctly - Solution: Clear
root_ns_maskfor joined namespaces inprepare_namespace_before_tasks() - Impact: Successful mount namespace handling in Kubernetes
- Details: See CRIU_JOIN_NS_MNT_BUG_FIX.md
- S3 Direct Upload: CRIU
--object-storage-uploadfor zero-disk-I/O dumps - Write Profiler (uffd-wp): Auto-start dirty page tracking via userfaultfd write-protect
- ptrace syscall injection for uffd creation in target process
- Heat classification: theta=0.3, N=3 consecutive intervals, 5s scan
- Automatic cleanup before CRIU dump and reinit after
- Hot VMA Integration:
- Pre-dump:
--exclude-rangefor hot VMAs (skip frequently written regions) - Final dump:
hot-vmas.jsonuploaded to S3 for lazy-pages prefetch seeding
- Pre-dump:
- Async Prefetch:
--async-prefetch --prefetch-workers Nfor parallel page fetching - Ablation Control:
semiSyncIOVandhotVMASeedflags for 5-mode experiment - Deadline Scheduler: F_op feasibility model for deadline-driven pre-dumps
- Per-fault Metrics: Lazy-pages log parsing (stall times, S3 vs cache, pages per fault)
- Log Upload:
logUpload: trueuploads all raw CRIU logs to S3 for experiment collection - Path-style S3:
--object-storage-path-stylefor MinIO compatibility
- Architecture — component overview, data flow, pod structure
- Configuration Reference — all CRD fields, env vars, examples
- Webhook Injection — annotation-based sidecar injection for existing Deployments
- Log Upload — experiment data collection setup
- E2E Verification — full test results on QEMU cluster
- Migration Strategies — full vs lazy-storage vs lazy-direct vs lazy-hybrid
- Write Profiler — uffd-wp dirty page tracking details
- Resolved Issues — past bugs and fixes
- CRIU Documentation
- ddps-lab/criu-s3 — CRIU fork with S3 object storage support
- Kubernetes Operator Pattern
- controller-runtime
For questions or support:
- GitHub: github.com/ddps-lab/criu-migration-operator
- Issues: GitHub Issues