Complete reference for all distributed execution launchers supported by madengine.
madengine provides unified support for multiple distributed frameworks, enabling seamless execution across training and inference workloads on both Kubernetes and SLURM clusters.
| Launcher | Type | Use Case | K8s | SLURM | Multi-Node |
|---|---|---|---|---|---|
| torchrun | Training | PyTorch DDP/FSDP training | ✅ | ✅ | ✅ |
| DeepSpeed | Training | ZeRO optimization training | ✅ | ✅ | ✅ |
| Megatron-LM | Training | Large-scale transformer training | ✅ | ✅ | ✅ |
| TorchTitan | Training | LLM pre-training (FSDP2+TP+PP) | ✅ | ✅ | ✅ |
| Primus | Training | Megatron/TorchTitan/Jax via Primus config | ✅ | ✅ | ✅ |
| vLLM | Inference | High-throughput LLM serving | ✅ | ✅ | ✅ |
| SGLang | Inference | Fast LLM inference | ✅ | ✅ | ✅ |
| SGLang Disaggregated | Inference | Large-scale disaggregated inference | ✅ | ✅ | ✅ (min 3) |
{
"distributed": {
"launcher": "torchrun",
"nnodes": 2,
"nproc_per_node": 8
}
}# Build with configuration
madengine build --tags my_model \
--additional-context-file config.json
# Deploy to K8s or SLURM
madengine run --manifest-file build_manifest.jsonPurpose: Standard PyTorch distributed training with DDP/FSDP
When to Use:
- ✅ Multi-GPU/multi-node PyTorch training
- ✅ Data Parallel or Fully Sharded Data Parallel
- ✅ Standard distributed training patterns
Configuration:
{
"distributed": {
"launcher": "torchrun",
"nnodes": 2,
"nproc_per_node": 8,
"master_port": 29500
}
}Features:
- Automatic rank assignment
- NCCL backend for GPU communication
- Elastic training support
- Compatible with all PyTorch models
Examples:
- K8s:
examples/k8s-configs/minimal/torchrun-multi-gpu-minimal.json - SLURM:
examples/slurm-configs/minimal/torchrun-multi-node-minimal.json
Purpose: Memory-efficient training with ZeRO optimization
When to Use:
- ✅ Large models that don't fit in GPU memory
- ✅ ZeRO optimization stages (ZeRO-1, ZeRO-2, ZeRO-3)
- ✅ Gradient accumulation and mixed precision
Configuration:
{
"distributed": {
"launcher": "deepspeed",
"nnodes": 4,
"nproc_per_node": 8
}
}Features:
- ZeRO memory optimization
- Pipeline parallelism
- Gradient accumulation
- Mixed precision training
- Automatic hostfile generation (K8s)
Architecture:
- Uses its own launcher (not torchrun)
- Manages process spawning internally
- Requires DeepSpeed config file in model script
Examples:
- SLURM:
examples/slurm-configs/basic/04-multi-node-advanced.json
Purpose: Large-scale transformer model training
When to Use:
- ✅ GPT, BERT, T5 style transformers
- ✅ Tensor and pipeline parallelism
- ✅ Very large models (70B+ parameters)
Configuration:
{
"distributed": {
"launcher": "megatron",
"nnodes": 4,
"nproc_per_node": 8
}
}Features:
- Tensor parallelism across GPUs
- Pipeline parallelism across nodes
- Optimized for transformer architectures
- Built on top of torchrun
- Automatic TP/PP size configuration
Availability:
- ✅ K8s: Fully supported (dedicated launcher)
- ✅ SLURM: Fully supported
Examples:
- K8s:
examples/k8s-configs/minimal/megatron-lm-minimal.json - K8s Multi-node:
examples/k8s-configs/basic/megatron-lm-multi-node-basic.json - SLURM:
examples/slurm-configs/minimal/megatron-lm-minimal.json - SLURM Multi-node:
examples/slurm-configs/basic/09-megatron-lm-multi-node.json
Environment Variables (automatically set by launcher):
# Megatron-Core standard variables
TENSOR_MODEL_PARALLEL_SIZE # Tensor parallelism (GPUs per node)
PIPELINE_MODEL_PARALLEL_SIZE # Pipeline parallelism (typically = nnodes)
CONTEXT_PARALLEL_SIZE # Context parallelism (default: 1)Note: The launcher automatically configures:
- Single-node: TP only (PP=1)
- Multi-node: TP across GPUs + PP across nodes
Purpose: Production LLM pre-training with multi-dimensional parallelism
Reference: pytorch/torchtitan
When to Use:
- ✅ Llama 3.1 (8B to 405B) pre-training
- ✅ Multi-dimensional parallelism (FSDP2 + TP + PP + CP)
- ✅ Production-scale LLM training
Configuration:
{
"distributed": {
"launcher": "torchtitan",
"nnodes": 4,
"nproc_per_node": 8
}
}Parallelism Strategies:
- FSDP2: Fully Sharded Data Parallel v2 for parameter sharding
- TP: Tensor Parallel - split model layers across GPUs
- PP: Pipeline Parallel - split model stages across nodes
- CP: Context Parallel - distributed context processing
Features:
- Uses torchrun as underlying launcher
- Configured via TOML files
- Automatic parallelism detection
- Float8 and MXFP8 support
- Gradient accumulation
- Distributed checkpointing
Environment Variables:
TORCHTITAN_TENSOR_PARALLEL_SIZE=8
TORCHTITAN_PIPELINE_PARALLEL_SIZE=4
TORCHTITAN_FSDP_ENABLED=1
TORCHTITAN_CONTEXT_PARALLEL_SIZE=1Single vs Multi-Node:
- Single-node: TP only across GPUs
- Multi-node: TP + PP + FSDP2 combined
Examples:
- K8s:
examples/k8s-configs/minimal/torchtitan-single-node-minimal.json - SLURM:
examples/slurm-configs/minimal/torchtitan-single-node-minimal.json
Model Configuration (TOML):
[model]
name = "llama3"
flavor = "8B"
[training]
tensor_parallel_degree = 8
pipeline_parallel_degree = 1
batch_size = 1
seq_len = 8192Purpose: Unified pretrain entry for Megatron-LM, TorchTitan, and Jax/MaxText via Primus experiment YAML.
When to Use:
- Run Primus example configs (e.g.
examples/megatron/configs/MI300X/*.yaml) via madengine - Single image plus config path; scheduling and tools/metrics from madengine
Configuration:
{
"distributed": {
"launcher": "primus",
"nnodes": 2,
"nproc_per_node": 8,
"primus": {
"config_path": "examples/megatron/configs/MI300X/deepseek_v2_lite-BF16-pretrain.yaml",
"cli_extra": "",
"backend": "megatron"
}
}
}Optional primus.backend (e.g. MaxText, megatron) emits export BACKEND=... before your model script when path-based detection is not enough. If omitted, madengine infers BACKEND from the model name when it follows primus_pretrain/<launcher>_<arch>_... (e.g. primus_pretrain/torchtitan_MI300X_qwen3_4B-pretrain → torchtitan), matching scripts/primus_pretrain/get_models_json.py in MAD-internal.
Features:
- Launcher sets
PRIMUS_ROOT(default/workspace/Primuson K8s after init),PRIMUS_CONFIG_PATH, optionalPRIMUS_CLI_EXTRA, and optionalBACKEND(fromprimus.backendor inferred fromprimus_pretrain/...model names); noMAD_MULTI_NODE_RUNNER - Model script (e.g.
run.sh) setsEXPand calls Primusrun_pretrain.sh - NNODES, NODE_RANK, MASTER_ADDR, etc. set by madengine job template
- Use with MAD-Internal Primus submodule and
scripts/primus_pretrain/run.sh
Container image: Prefer docker/primus.ubuntu.amd.Dockerfile with COPY scripts/Primus/ /workspace/Primus/ and PRIMUS_ROOT=/workspace/Primus. On Kubernetes, the Job’s emptyDir hides image files under /workspace; madengine bundles scripts/Primus/examples/... into the ConfigMap as Primus/examples/... so the init container recreates /workspace/Primus. run.sh resolves PRIMUS_ROOT in that order (see script comments).
Examples:
- SLURM:
examples/slurm-configs/minimal/primus-minimal.json - K8s:
examples/k8s-configs/minimal/primus-minimal.json - K8s (Primus vs upstream workload API, MaxText caveats, TorchTitan/Megatron/MaxText sample JSON):
examples/k8s-configs/README.mdsection Primus on Kubernetes
Purpose: High-throughput LLM inference serving
Reference: vllm-project/vllm
When to Use:
- ✅ LLM inference with high throughput
- ✅ Continuous batching
- ✅ PagedAttention for memory efficiency
Configuration:
{
"distributed": {
"launcher": "vllm",
"nnodes": 2,
"nproc_per_node": 4
}
}Features:
- Continuous batching for high throughput
- PagedAttention memory optimization
- Tensor Parallelism support
- Ray for distributed coordination
- No torchrun needed (manages own processes)
Architecture:
- Single-node: TP across GPUs, no Ray
- Multi-node (K8s): Data Parallelism with independent replicas per pod
- Multi-node (SLURM): TP + PP with Ray cluster
Environment Variables:
VLLM_TENSOR_PARALLEL_SIZE=4
VLLM_PIPELINE_PARALLEL_SIZE=1
VLLM_DISTRIBUTED_BACKEND="auto" # or "ray" for multi-nodeExamples:
- K8s:
examples/k8s-configs/minimal/vllm-single-node-minimal.json - SLURM:
examples/slurm-configs/minimal/vllm-single-node-minimal.json
Purpose: Fast LLM inference with structured generation
Reference: sgl-project/sglang
When to Use:
- ✅ Structured LLM generation
- ✅ Fast inference with caching
- ✅ OpenAI-compatible API
Configuration:
{
"distributed": {
"launcher": "sglang",
"nnodes": 2,
"nproc_per_node": 4
}
}Features:
- Native launcher (sglang.launch_server)
- RadixAttention for prefix caching
- Tensor Parallelism
- Ray for distributed execution
- No torchrun needed
Architecture:
- Single-node: TP across GPUs
- Multi-node: Native multi-node support with Ray
Environment Variables:
SGLANG_TENSOR_PARALLEL_SIZE=4
SGLANG_PIPELINE_PARALLEL_SIZE=1Examples:
- K8s:
examples/k8s-configs/minimal/sglang-single-node-minimal.json - SLURM:
examples/slurm-configs/basic/07-sglang-single-node.json
Purpose: Large-scale disaggregated LLM inference with specialized prefill/decode clusters
Reference: sgl-project/sglang | Mooncake Framework
When to Use:
- ✅ Large-scale LLM inference (multi-node clusters)
- ✅ Optimized resource allocation (separate prefill/decode)
- ✅ High-throughput production deployments
- ✅ Workload-specific optimization (tune prefill/decode ratio)
Architecture:
SGLang Disaggregated separates inference into specialized node pools:
┌─────────────────────────────────────────────────┐
│ SGLang Disaggregated Cluster │
├─────────────────────────────────────────────────┤
│ Node 0: Proxy (Load Balancer) │
│ Nodes 1-P: Prefill Servers (~40%) │
│ Nodes P+1-N: Decode Servers (~60%) │
│ │
│ Communication: Mooncake (KV cache transfer) │
└─────────────────────────────────────────────────┘
Configuration:
{
"distributed": {
"launcher": "sglang-disagg",
"nnodes": 5,
"nproc_per_node": 8,
"sglang_disagg": {
"prefill_nodes": 2,
"decode_nodes": 2
}
}
}Minimum Requirements:
- Nodes: Minimum 3 nodes (1 proxy + 1 prefill + 1 decode)
- GPUs: Minimum 1 GPU per node (for tensor parallelism)
- Network: High-speed interconnect (InfiniBand recommended for production)
Node Roles:
- Proxy Node (Rank 0): Load balancer, request router (mini_lb)
- Prefill Nodes: Process input prompts, generate KV cache
- Decode Nodes: Receive KV cache, generate output tokens
Automatic Split (Default):
- Uses 40/60 golden ratio for prefill/decode
- Formula:
prefill = max(1, (nnodes - 1) * 2 // 5)
| Total Nodes | Proxy | Prefill | Decode |
|---|---|---|---|
| 3 | 1 | 1 (33%) | 1 (33%) |
| 5 | 1 | 2 (40%) | 2 (40%) |
| 7 | 1 | 2 (29%) | 4 (57%) |
| 11 | 1 | 4 (40%) | 6 (60%) |
Custom Split (NEW Feature!):
Override automatic split based on workload characteristics:
{
"distributed": {
"launcher": "sglang-disagg",
"nnodes": 7,
"nproc_per_node": 8,
"sglang_disagg": {
"prefill_nodes": 4,
"decode_nodes": 2
}
}
}Custom Split Use Cases:
| Workload Type | Recommended Split | Example (7 nodes) |
|---|---|---|
| Long prompts (code gen) | 60% prefill | prefill: 4, decode: 2 |
| Long outputs (creative) | 30% prefill | prefill: 2, decode: 4 |
| Balanced (default) | 40% prefill | Omit sglang_disagg |
| Document processing | 50% prefill | prefill: 3, decode: 3 |
Validation Rules:
prefill_nodes >= 1decode_nodes >= 1prefill_nodes + decode_nodes + 1 == nnodes
Features:
- Disaggregated prefill/decode architecture
- Mooncake framework for KV cache transfer
- Automatic or custom node role assignment
- RadixAttention for KV cache efficiency
- Ray cluster coordination
- No torchrun needed (manages own processes)
Environment Variables (K8s):
POD_INDEX=${JOB_COMPLETION_INDEX} # Pod index for role assignment
TOTAL_PODS=5 # Total number of pods
PREFILL_COUNT=2 # Number of prefill nodes
DECODE_COUNT=2 # Number of decode nodes
TP_SIZE=8 # Tensor parallel sizeEnvironment Variables (SLURM):
SGLANG_DISAGG_MODE="enabled"
SGLANG_DISAGG_PREFILL_NODES=2
SGLANG_DISAGG_DECODE_NODES=2
SGLANG_DISAGG_TOTAL_NODES=5
SGLANG_TP_SIZE=8
SGLANG_NODE_RANK=${SLURM_PROCID}
SGLANG_NODE_IPS="10.0.0.1,10.0.0.2,..."Examples:
- K8s Minimal:
examples/k8s-configs/minimal/sglang-disagg-minimal.json - K8s Basic:
examples/k8s-configs/basic/sglang-disagg-multi-node-basic.json - K8s Custom:
examples/k8s-configs/basic/sglang-disagg-custom-split.json - SLURM Minimal:
examples/slurm-configs/minimal/sglang-disagg-minimal.json - SLURM Basic:
examples/slurm-configs/basic/sglang-disagg-multi-node.json - SLURM Custom:
examples/slurm-configs/basic/sglang-disagg-custom-split.json
Comparison: SGLang vs SGLang Disaggregated:
| Feature | SGLang | SGLang Disaggregated |
|---|---|---|
| Architecture | Unified | Separated prefill/decode |
| Min Nodes | 1 | 3 |
| Node Types | Same for all | Specialized (proxy/prefill/decode) |
| KV Transfer | In-memory | Mooncake framework |
| Load Balancer | Ray | mini_lb (dedicated) |
| Best For | General inference | Large-scale clusters |
| Optimization | General | Workload-specific tuning |
Production Considerations:
- Install Mooncake: Full framework with RDMA support
- Configure Network: InfiniBand/RoCE for high-speed KV transfer
- Setup etcd: For distributed coordination
- Monitor Metrics: Track prefill latency, decode throughput, queue depths
- Tune Split: Adjust prefill/decode ratio based on workload
Performance Tuning:
# Start with automatic split
madengine run --tags model --config minimal-config.json
# Monitor bottleneck (prefill latency vs decode throughput)
# If prefill is bottleneck → increase prefill nodes
# If decode is bottleneck → increase decode nodes
# Apply custom split
madengine run --tags model --config custom-split-config.jsonTroubleshooting:
-
"requires minimum 3 nodes"
- Solution: Set
nnodes >= 3
- Solution: Set
-
"prefill_nodes + decode_nodes + 1 must equal nnodes"
- Solution: Verify math in custom split configuration
-
Pod/Node stuck in Init
- K8s: Check headless service creation
- SLURM: Verify node IP discovery
-
High KV cache transfer latency
- Enable RDMA/InfiniBand
- Configure Mooncake transfer backend
- Check network connectivity
| Feature | torchrun | DeepSpeed | Megatron-LM | TorchTitan | Primus |
|---|---|---|---|---|---|
| Data Parallel | ✅ DDP | ✅ ZeRO | ✅ | ✅ FSDP2 | via config |
| Tensor Parallel | ❌ | ❌ | ✅ | ✅ | via config |
| Pipeline Parallel | ❌ | ✅ | ✅ | ✅ | via config |
| Memory Efficiency | Medium | High (ZeRO) | High | Very High | config-driven |
| Ease of Use | ⭐⭐⭐⭐⭐ | ⭐⭐⭐ | ⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐⭐ |
| Model Size | Small-Medium | Medium-Large | Very Large | Very Large | config-driven |
| K8s Support | ✅ | ✅ | ✅ | ✅ | ✅ |
| SLURM Support | ✅ | ✅ | ✅ | ✅ | ✅ |
| Feature | vLLM | SGLang | SGLang Disaggregated |
|---|---|---|---|
| Throughput | Very High | High | Very High |
| Memory Efficiency | PagedAttention | RadixAttention | RadixAttention + Mooncake |
| Batching | Continuous | Continuous | Continuous |
| API | OpenAI-compatible | OpenAI-compatible | OpenAI-compatible |
| Structured Gen | Limited | ✅ Native | ✅ Native |
| Multi-Node | ✅ Ray | ✅ Ray | ✅ Ray + mini_lb |
| Architecture | Unified | Unified | Disaggregated |
| Min Nodes | 1 | 1 | 3 |
| Specialization | ❌ | ❌ | ✅ Prefill/Decode |
| Custom Split | ❌ | ❌ | ✅ |
| K8s Support | ✅ | ✅ | ✅ |
| SLURM Support | ✅ | ✅ | ✅ |
Training Workloads:
Small models (< 1B) → torchrun
Medium models (1B-10B) → DeepSpeed or torchrun
Large models (10B-70B) → TorchTitan or Megatron-LM
Very large (70B+) → TorchTitan with full parallelism
Inference Workloads:
High throughput → vLLM or SGLang Disaggregated
Structured generation → SGLang or SGLang Disaggregated
Memory constrained → vLLM (PagedAttention)
Large-scale clusters (5+) → SGLang Disaggregated
Workload-specific tuning → SGLang Disaggregated
GPU Count Guidelines:
{
"k8s": {
"gpu_count": 8 // Matches nproc_per_node
},
"distributed": {
"nnodes": 4,
"nproc_per_node": 8 // Total: 32 GPUs
}
}Memory Recommendations:
- torchrun: 16GB per GPU minimum
- DeepSpeed: 32GB per GPU (ZeRO-3)
- TorchTitan: 64GB+ per GPU (large models)
- vLLM: 32GB per GPU (depends on model size)
Kubernetes:
- Automatic headless service creation
- Pod discovery via DNS
- Uses
JOB_COMPLETION_INDEXfor rank
SLURM:
- Uses SLURM environment variables
- Automatic node discovery
- Network interface configuration
NNODES=4 # Number of nodes
NPROC_PER_NODE=8 # GPUs per node
NODE_RANK=0 # Current node rank (0-based)
MASTER_ADDR=master.local # Master node address
MASTER_PORT=29500 # Master communication porttorchrun:
MAD_MULTI_NODE_RUNNER="torchrun --nnodes=4 --nproc_per_node=8 ..."DeepSpeed:
MAD_MULTI_NODE_RUNNER="deepspeed --num_gpus=8 --hostfile=/tmp/hostfile ..."Megatron-LM:
# Megatron-Core standard environment variables
TENSOR_MODEL_PARALLEL_SIZE=8 # Tensor parallelism size
PIPELINE_MODEL_PARALLEL_SIZE=4 # Pipeline parallelism size
CONTEXT_PARALLEL_SIZE=1 # Context parallelism size
MAD_MULTI_NODE_RUNNER="torchrun ..." # Uses torchrun (SLURM only)TorchTitan:
TORCHTITAN_TENSOR_PARALLEL_SIZE=8
TORCHTITAN_PIPELINE_PARALLEL_SIZE=4
TORCHTITAN_FSDP_ENABLED=1
MAD_MULTI_NODE_RUNNER="torchrun ..."Primus:
PRIMUS_ROOT=/workspace/Primus # default; K8s overlays Primus/examples from ConfigMap here
PRIMUS_CONFIG_PATH="examples/megatron/configs/MI300X/..."
PRIMUS_CLI_EXTRA="" # optional
BACKEND="megatron" # optional; from distributed.primus.backend or model-name inference
# No MAD_MULTI_NODE_RUNNER (model script calls Primus run_pretrain.sh)vLLM:
VLLM_TENSOR_PARALLEL_SIZE=4
VLLM_DISTRIBUTED_BACKEND="ray"
# No MAD_MULTI_NODE_RUNNER (vLLM manages processes)SGLang:
SGLANG_TENSOR_PARALLEL_SIZE=4
NCCL_INIT_ADDR="master:29500"
# No MAD_MULTI_NODE_RUNNER (SGLang manages processes)SGLang Disaggregated:
SGLANG_DISAGG_MODE="enabled"
SGLANG_DISAGG_PREFILL_NODES=2
SGLANG_DISAGG_DECODE_NODES=2
SGLANG_DISAGG_TOTAL_NODES=5
SGLANG_TP_SIZE=8
SGLANG_NODE_RANK=${SLURM_PROCID}
# No MAD_MULTI_NODE_RUNNER (SGLang disagg manages processes)1. Launcher Not Found
Error: Unknown launcher type 'xyz'Solution: Use one of: torchrun, deepspeed, megatron, torchtitan, primus, vllm, sglang, sglang-disagg
2. Multi-Node Communication Fails
Error: Connection timeout to master nodeSolutions:
- Check network connectivity between nodes
- Verify
MASTER_ADDRis correct - Ensure firewall allows
MASTER_PORT - For K8s: Check headless service created
3. GPU Visibility Issues
Error: Expected 8 GPUs but found 0Solutions:
- Verify
gpu_countmatchesnproc_per_node - Check GPU resource name (
amd.com/gpuvsnvidia.com/gpu) - Ensure ROCm/CUDA drivers installed
4. Ray Cluster Issues (vLLM/SGLang)
Error: Ray cluster failed to startSolutions:
- Clean existing Ray processes:
ray stop --force - Check port 6379 is available
- Verify network interface configuration
- For multi-node: ensure pods can communicate
madengine provides $MAD_MULTI_NODE_RUNNER for frameworks that use torchrun:
#!/bin/bash
# Your model script
# For torchrun/deepspeed/megatron/torchtitan
$MAD_MULTI_NODE_RUNNER your_training_script.py --args
# For primus (no MAD_MULTI_NODE_RUNNER; use run.sh → run_pretrain.sh; PRIMUS_* / BACKEND set by madengine)
# For vLLM/sglang (no MAD_MULTI_NODE_RUNNER)
python your_inference_script.py --argsmadengine automatically:
- Detects launcher from
distributed.launcherfield - Sets up appropriate environment variables
- Generates launcher-specific commands
- Creates multi-node infrastructure (K8s services, SLURM env)
AMD MI300X:
{
"context": {
"env_vars": {
"PYTORCH_TUNABLEOP_ENABLED": "1",
"NCCL_IB_DISABLE": "0",
"NCCL_NET_GDR_LEVEL": "5"
}
}
}NVIDIA H100/A100:
{
"context": {
"env_vars": {
"NCCL_ALGO": "Ring",
"NCCL_PROTO": "Simple",
"CUDA_DEVICE_MAX_CONNECTIONS": "1"
}
}
}- Deployment Guide - Kubernetes and SLURM deployment
- K8s Configuration Guide
- SLURM Configuration Guide