Distributed Launchers Guide

Complete reference for all distributed execution launchers supported by madengine.

Overview

madengine provides unified support for multiple distributed frameworks, enabling seamless execution across training and inference workloads on both Kubernetes and SLURM clusters.

Supported Launchers

Launcher	Type	Use Case	K8s	SLURM	Multi-Node
torchrun	Training	PyTorch DDP/FSDP training	✅	✅	✅
DeepSpeed	Training	ZeRO optimization training	✅	✅	✅
Megatron-LM	Training	Large-scale transformer training	✅	✅	✅
TorchTitan	Training	LLM pre-training (FSDP2+TP+PP)	✅	✅	✅
Primus	Training	Megatron/TorchTitan/Jax via Primus config	✅	✅	✅
vLLM	Inference	High-throughput LLM serving	✅	✅	✅
SGLang	Inference	Fast LLM inference	✅	✅	✅
SGLang Disaggregated	Inference	Large-scale disaggregated inference	✅	✅	✅ (min 3)

Quick Start

Basic Configuration

{
  "distributed": {
    "launcher": "torchrun",
    "nnodes": 2,
    "nproc_per_node": 8
  }
}

Deployment

# Build with configuration
madengine build --tags my_model \
  --additional-context-file config.json

# Deploy to K8s or SLURM
madengine run --manifest-file build_manifest.json

Launcher Details

1. torchrun (PyTorch Distributed)

Purpose: Standard PyTorch distributed training with DDP/FSDP

When to Use:

✅ Multi-GPU/multi-node PyTorch training
✅ Data Parallel or Fully Sharded Data Parallel
✅ Standard distributed training patterns

Configuration:

{
  "distributed": {
    "launcher": "torchrun",
    "nnodes": 2,
    "nproc_per_node": 8,
    "master_port": 29500
  }
}

Features:

Automatic rank assignment
NCCL backend for GPU communication
Elastic training support
Compatible with all PyTorch models

Examples:

K8s: examples/k8s-configs/minimal/torchrun-multi-gpu-minimal.json
SLURM: examples/slurm-configs/minimal/torchrun-multi-node-minimal.json

2. DeepSpeed

Purpose: Memory-efficient training with ZeRO optimization

When to Use:

✅ Large models that don't fit in GPU memory
✅ ZeRO optimization stages (ZeRO-1, ZeRO-2, ZeRO-3)
✅ Gradient accumulation and mixed precision

Configuration:

{
  "distributed": {
    "launcher": "deepspeed",
    "nnodes": 4,
    "nproc_per_node": 8
  }
}

Features:

ZeRO memory optimization
Pipeline parallelism
Gradient accumulation
Mixed precision training
Automatic hostfile generation (K8s)

Architecture:

Uses its own launcher (not torchrun)
Manages process spawning internally
Requires DeepSpeed config file in model script

Examples:

SLURM: examples/slurm-configs/basic/04-multi-node-advanced.json

3. Megatron-LM

Purpose: Large-scale transformer model training

When to Use:

✅ GPT, BERT, T5 style transformers
✅ Tensor and pipeline parallelism
✅ Very large models (70B+ parameters)

Configuration:

{
  "distributed": {
    "launcher": "megatron",
    "nnodes": 4,
    "nproc_per_node": 8
  }
}

Features:

Tensor parallelism across GPUs
Pipeline parallelism across nodes
Optimized for transformer architectures
Built on top of torchrun
Automatic TP/PP size configuration

Availability:

✅ K8s: Fully supported (dedicated launcher)
✅ SLURM: Fully supported

Examples:

K8s: examples/k8s-configs/minimal/megatron-lm-minimal.json
K8s Multi-node: examples/k8s-configs/basic/megatron-lm-multi-node-basic.json
SLURM: examples/slurm-configs/minimal/megatron-lm-minimal.json
SLURM Multi-node: examples/slurm-configs/basic/09-megatron-lm-multi-node.json

Environment Variables (automatically set by launcher):

# Megatron-Core standard variables
TENSOR_MODEL_PARALLEL_SIZE    # Tensor parallelism (GPUs per node)
PIPELINE_MODEL_PARALLEL_SIZE  # Pipeline parallelism (typically = nnodes)
CONTEXT_PARALLEL_SIZE         # Context parallelism (default: 1)

Note: The launcher automatically configures:

Single-node: TP only (PP=1)
Multi-node: TP across GPUs + PP across nodes

4. TorchTitan

Purpose: Production LLM pre-training with multi-dimensional parallelism

Reference: pytorch/torchtitan

When to Use:

✅ Llama 3.1 (8B to 405B) pre-training
✅ Multi-dimensional parallelism (FSDP2 + TP + PP + CP)
✅ Production-scale LLM training

Configuration:

{
  "distributed": {
    "launcher": "torchtitan",
    "nnodes": 4,
    "nproc_per_node": 8
  }
}

Parallelism Strategies:

FSDP2: Fully Sharded Data Parallel v2 for parameter sharding
TP: Tensor Parallel - split model layers across GPUs
PP: Pipeline Parallel - split model stages across nodes
CP: Context Parallel - distributed context processing

Features:

Uses torchrun as underlying launcher
Configured via TOML files
Automatic parallelism detection
Float8 and MXFP8 support
Gradient accumulation
Distributed checkpointing

Environment Variables:

TORCHTITAN_TENSOR_PARALLEL_SIZE=8
TORCHTITAN_PIPELINE_PARALLEL_SIZE=4
TORCHTITAN_FSDP_ENABLED=1
TORCHTITAN_CONTEXT_PARALLEL_SIZE=1

Single vs Multi-Node:

Single-node: TP only across GPUs
Multi-node: TP + PP + FSDP2 combined

Examples:

K8s: examples/k8s-configs/minimal/torchtitan-single-node-minimal.json
SLURM: examples/slurm-configs/minimal/torchtitan-single-node-minimal.json

Model Configuration (TOML):

[model]
name = "llama3"
flavor = "8B"

[training]
tensor_parallel_degree = 8
pipeline_parallel_degree = 1
batch_size = 1
seq_len = 8192

5. Primus

Purpose: Unified pretrain entry for Megatron-LM, TorchTitan, and Jax/MaxText via Primus experiment YAML.

When to Use:

Run Primus example configs (e.g. examples/megatron/configs/MI300X/*.yaml) via madengine
Single image plus config path; scheduling and tools/metrics from madengine

Configuration:

{
  "distributed": {
    "launcher": "primus",
    "nnodes": 2,
    "nproc_per_node": 8,
    "primus": {
      "config_path": "examples/megatron/configs/MI300X/deepseek_v2_lite-BF16-pretrain.yaml",
      "cli_extra": "",
      "backend": "megatron"
    }
  }
}

Optional primus.backend (e.g. MaxText, megatron) emits export BACKEND=... before your model script when path-based detection is not enough. If omitted, madengine infers BACKEND from the model name when it follows primus_pretrain/<launcher>_<arch>_... (e.g. primus_pretrain/torchtitan_MI300X_qwen3_4B-pretrain → torchtitan), matching scripts/primus_pretrain/get_models_json.py in MAD-internal.

Features:

Launcher sets PRIMUS_ROOT (default /workspace/Primus on K8s after init), PRIMUS_CONFIG_PATH, optional PRIMUS_CLI_EXTRA, and optional BACKEND (from primus.backend or inferred from primus_pretrain/... model names); no MAD_MULTI_NODE_RUNNER
Model script (e.g. run.sh) sets EXP and calls Primus run_pretrain.sh
NNODES, NODE_RANK, MASTER_ADDR, etc. set by madengine job template
Use with MAD-Internal Primus submodule and scripts/primus_pretrain/run.sh

Container image: Prefer docker/primus.ubuntu.amd.Dockerfile with COPY scripts/Primus/ /workspace/Primus/ and PRIMUS_ROOT=/workspace/Primus. On Kubernetes, the Job’s emptyDir hides image files under /workspace; madengine bundles scripts/Primus/examples/... into the ConfigMap as Primus/examples/... so the init container recreates /workspace/Primus. run.sh resolves PRIMUS_ROOT in that order (see script comments).

Examples:

SLURM: examples/slurm-configs/minimal/primus-minimal.json
K8s: examples/k8s-configs/minimal/primus-minimal.json
K8s (Primus vs upstream workload API, MaxText caveats, TorchTitan/Megatron/MaxText sample JSON): examples/k8s-configs/README.md section Primus on Kubernetes

6. vLLM

Purpose: High-throughput LLM inference serving

Reference: vllm-project/vllm

When to Use:

✅ LLM inference with high throughput
✅ Continuous batching
✅ PagedAttention for memory efficiency

Configuration:

{
  "distributed": {
    "launcher": "vllm",
    "nnodes": 2,
    "nproc_per_node": 4
  }
}

Features:

Continuous batching for high throughput
PagedAttention memory optimization
Tensor Parallelism support
Ray for distributed coordination
No torchrun needed (manages own processes)

Architecture:

Single-node: TP across GPUs, no Ray
Multi-node (K8s): Data Parallelism with independent replicas per pod
Multi-node (SLURM): TP + PP with Ray cluster

Environment Variables:

VLLM_TENSOR_PARALLEL_SIZE=4
VLLM_PIPELINE_PARALLEL_SIZE=1
VLLM_DISTRIBUTED_BACKEND="auto"  # or "ray" for multi-node

Examples:

K8s: examples/k8s-configs/minimal/vllm-single-node-minimal.json
SLURM: examples/slurm-configs/minimal/vllm-single-node-minimal.json

7. SGLang

Purpose: Fast LLM inference with structured generation

Reference: sgl-project/sglang

When to Use:

✅ Structured LLM generation
✅ Fast inference with caching
✅ OpenAI-compatible API

Configuration:

{
  "distributed": {
    "launcher": "sglang",
    "nnodes": 2,
    "nproc_per_node": 4
  }
}

Features:

Native launcher (sglang.launch_server)
RadixAttention for prefix caching
Tensor Parallelism
Ray for distributed execution
No torchrun needed

Architecture:

Single-node: TP across GPUs
Multi-node: Native multi-node support with Ray

Environment Variables:

SGLANG_TENSOR_PARALLEL_SIZE=4
SGLANG_PIPELINE_PARALLEL_SIZE=1

Examples:

K8s: examples/k8s-configs/minimal/sglang-single-node-minimal.json
SLURM: examples/slurm-configs/basic/07-sglang-single-node.json

8. SGLang Disaggregated (NEW!)

Purpose: Large-scale disaggregated LLM inference with specialized prefill/decode clusters

Reference: sgl-project/sglang | Mooncake Framework

When to Use:

✅ Large-scale LLM inference (multi-node clusters)
✅ Optimized resource allocation (separate prefill/decode)
✅ High-throughput production deployments
✅ Workload-specific optimization (tune prefill/decode ratio)

Architecture:

SGLang Disaggregated separates inference into specialized node pools:

┌─────────────────────────────────────────────────┐
│         SGLang Disaggregated Cluster            │
├─────────────────────────────────────────────────┤
│  Node 0:        Proxy (Load Balancer)           │
│  Nodes 1-P:     Prefill Servers (~40%)          │
│  Nodes P+1-N:   Decode Servers (~60%)           │
│                                                 │
│  Communication: Mooncake (KV cache transfer)    │
└─────────────────────────────────────────────────┘

Configuration:

{
  "distributed": {
    "launcher": "sglang-disagg",
    "nnodes": 5,
    "nproc_per_node": 8,
    "sglang_disagg": {
      "prefill_nodes": 2,
      "decode_nodes": 2
    }
  }
}

Minimum Requirements:

Nodes: Minimum 3 nodes (1 proxy + 1 prefill + 1 decode)
GPUs: Minimum 1 GPU per node (for tensor parallelism)
Network: High-speed interconnect (InfiniBand recommended for production)

Node Roles:

Proxy Node (Rank 0): Load balancer, request router (mini_lb)
Prefill Nodes: Process input prompts, generate KV cache
Decode Nodes: Receive KV cache, generate output tokens

Automatic Split (Default):

Uses 40/60 golden ratio for prefill/decode
Formula: prefill = max(1, (nnodes - 1) * 2 // 5)

Total Nodes	Proxy	Prefill	Decode
3	1	1 (33%)	1 (33%)
5	1	2 (40%)	2 (40%)
7	1	2 (29%)	4 (57%)
11	1	4 (40%)	6 (60%)

Custom Split (NEW Feature!):

Override automatic split based on workload characteristics:

{
  "distributed": {
    "launcher": "sglang-disagg",
    "nnodes": 7,
    "nproc_per_node": 8,
    "sglang_disagg": {
      "prefill_nodes": 4,
      "decode_nodes": 2
    }
  }
}

Custom Split Use Cases:

Workload Type	Recommended Split	Example (7 nodes)
Long prompts (code gen)	60% prefill	`prefill: 4, decode: 2`
Long outputs (creative)	30% prefill	`prefill: 2, decode: 4`
Balanced (default)	40% prefill	Omit sglang_disagg
Document processing	50% prefill	`prefill: 3, decode: 3`

Validation Rules:

prefill_nodes >= 1
decode_nodes >= 1
prefill_nodes + decode_nodes + 1 == nnodes

Features:

Disaggregated prefill/decode architecture
Mooncake framework for KV cache transfer
Automatic or custom node role assignment
RadixAttention for KV cache efficiency
Ray cluster coordination
No torchrun needed (manages own processes)

Environment Variables (K8s):

POD_INDEX=${JOB_COMPLETION_INDEX}  # Pod index for role assignment
TOTAL_PODS=5                        # Total number of pods
PREFILL_COUNT=2                     # Number of prefill nodes
DECODE_COUNT=2                      # Number of decode nodes
TP_SIZE=8                           # Tensor parallel size

Environment Variables (SLURM):

SGLANG_DISAGG_MODE="enabled"
SGLANG_DISAGG_PREFILL_NODES=2
SGLANG_DISAGG_DECODE_NODES=2
SGLANG_DISAGG_TOTAL_NODES=5
SGLANG_TP_SIZE=8
SGLANG_NODE_RANK=${SLURM_PROCID}
SGLANG_NODE_IPS="10.0.0.1,10.0.0.2,..."

Examples:

K8s Minimal: examples/k8s-configs/minimal/sglang-disagg-minimal.json
K8s Basic: examples/k8s-configs/basic/sglang-disagg-multi-node-basic.json
K8s Custom: examples/k8s-configs/basic/sglang-disagg-custom-split.json
SLURM Minimal: examples/slurm-configs/minimal/sglang-disagg-minimal.json
SLURM Basic: examples/slurm-configs/basic/sglang-disagg-multi-node.json
SLURM Custom: examples/slurm-configs/basic/sglang-disagg-custom-split.json

Comparison: SGLang vs SGLang Disaggregated:

Feature	SGLang	SGLang Disaggregated
Architecture	Unified	Separated prefill/decode
Min Nodes	1	3
Node Types	Same for all	Specialized (proxy/prefill/decode)
KV Transfer	In-memory	Mooncake framework
Load Balancer	Ray	mini_lb (dedicated)
Best For	General inference	Large-scale clusters
Optimization	General	Workload-specific tuning

Production Considerations:

Install Mooncake: Full framework with RDMA support
Configure Network: InfiniBand/RoCE for high-speed KV transfer
Setup etcd: For distributed coordination
Monitor Metrics: Track prefill latency, decode throughput, queue depths
Tune Split: Adjust prefill/decode ratio based on workload

Performance Tuning:

# Start with automatic split
madengine run --tags model --config minimal-config.json

# Monitor bottleneck (prefill latency vs decode throughput)
# If prefill is bottleneck → increase prefill nodes
# If decode is bottleneck → increase decode nodes

# Apply custom split
madengine run --tags model --config custom-split-config.json

Troubleshooting:

"requires minimum 3 nodes"
- Solution: Set nnodes >= 3
"prefill_nodes + decode_nodes + 1 must equal nnodes"
- Solution: Verify math in custom split configuration
Pod/Node stuck in Init
- K8s: Check headless service creation
- SLURM: Verify node IP discovery
High KV cache transfer latency
- Enable RDMA/InfiniBand
- Configure Mooncake transfer backend
- Check network connectivity

Comparison Matrix

Training Launchers

Feature	torchrun	DeepSpeed	Megatron-LM	TorchTitan	Primus
Data Parallel	✅ DDP	✅ ZeRO	✅	✅ FSDP2	via config
Tensor Parallel	❌	❌	✅	✅	via config
Pipeline Parallel	❌	✅	✅	✅	via config
Memory Efficiency	Medium	High (ZeRO)	High	Very High	config-driven
Ease of Use	⭐⭐⭐⭐⭐	⭐⭐⭐	⭐⭐	⭐⭐⭐⭐	⭐⭐⭐
Model Size	Small-Medium	Medium-Large	Very Large	Very Large	config-driven
K8s Support	✅	✅	✅	✅	✅
SLURM Support	✅	✅	✅	✅	✅

Inference Launchers

Feature	vLLM	SGLang	SGLang Disaggregated
Throughput	Very High	High	Very High
Memory Efficiency	PagedAttention	RadixAttention	RadixAttention + Mooncake
Batching	Continuous	Continuous	Continuous
API	OpenAI-compatible	OpenAI-compatible	OpenAI-compatible
Structured Gen	Limited	✅ Native	✅ Native
Multi-Node	✅ Ray	✅ Ray	✅ Ray + mini_lb
Architecture	Unified	Unified	Disaggregated
Min Nodes	1	1	3
Specialization	❌	❌	✅ Prefill/Decode
Custom Split	❌	❌	✅
K8s Support	✅	✅	✅
SLURM Support	✅	✅	✅

Configuration Best Practices

1. Launcher Selection

Training Workloads:

Small models (< 1B)        → torchrun
Medium models (1B-10B)     → DeepSpeed or torchrun
Large models (10B-70B)     → TorchTitan or Megatron-LM
Very large (70B+)          → TorchTitan with full parallelism

Inference Workloads:

High throughput            → vLLM or SGLang Disaggregated
Structured generation      → SGLang or SGLang Disaggregated
Memory constrained         → vLLM (PagedAttention)
Large-scale clusters (5+)  → SGLang Disaggregated
Workload-specific tuning   → SGLang Disaggregated

2. Resource Allocation

GPU Count Guidelines:

{
  "k8s": {
    "gpu_count": 8  // Matches nproc_per_node
  },
  "distributed": {
    "nnodes": 4,
    "nproc_per_node": 8  // Total: 32 GPUs
  }
}

Memory Recommendations:

torchrun: 16GB per GPU minimum
DeepSpeed: 32GB per GPU (ZeRO-3)
TorchTitan: 64GB+ per GPU (large models)
vLLM: 32GB per GPU (depends on model size)

3. Multi-Node Setup

Kubernetes:

Automatic headless service creation
Pod discovery via DNS
Uses JOB_COMPLETION_INDEX for rank

SLURM:

Uses SLURM environment variables
Automatic node discovery
Network interface configuration

Environment Variables

Common Variables (All Launchers)

NNODES=4                    # Number of nodes
NPROC_PER_NODE=8           # GPUs per node
NODE_RANK=0                 # Current node rank (0-based)
MASTER_ADDR=master.local    # Master node address
MASTER_PORT=29500           # Master communication port

Launcher-Specific

torchrun:

MAD_MULTI_NODE_RUNNER="torchrun --nnodes=4 --nproc_per_node=8 ..."

DeepSpeed:

MAD_MULTI_NODE_RUNNER="deepspeed --num_gpus=8 --hostfile=/tmp/hostfile ..."

Megatron-LM:

# Megatron-Core standard environment variables
TENSOR_MODEL_PARALLEL_SIZE=8         # Tensor parallelism size
PIPELINE_MODEL_PARALLEL_SIZE=4       # Pipeline parallelism size
CONTEXT_PARALLEL_SIZE=1              # Context parallelism size
MAD_MULTI_NODE_RUNNER="torchrun ..."  # Uses torchrun (SLURM only)

TorchTitan:

TORCHTITAN_TENSOR_PARALLEL_SIZE=8
TORCHTITAN_PIPELINE_PARALLEL_SIZE=4
TORCHTITAN_FSDP_ENABLED=1
MAD_MULTI_NODE_RUNNER="torchrun ..."

Primus:

PRIMUS_ROOT=/workspace/Primus          # default; K8s overlays Primus/examples from ConfigMap here
PRIMUS_CONFIG_PATH="examples/megatron/configs/MI300X/..."
PRIMUS_CLI_EXTRA=""                   # optional
BACKEND="megatron"                    # optional; from distributed.primus.backend or model-name inference
# No MAD_MULTI_NODE_RUNNER (model script calls Primus run_pretrain.sh)

vLLM:

VLLM_TENSOR_PARALLEL_SIZE=4
VLLM_DISTRIBUTED_BACKEND="ray"
# No MAD_MULTI_NODE_RUNNER (vLLM manages processes)

SGLang:

SGLANG_TENSOR_PARALLEL_SIZE=4
NCCL_INIT_ADDR="master:29500"
# No MAD_MULTI_NODE_RUNNER (SGLang manages processes)

SGLang Disaggregated:

SGLANG_DISAGG_MODE="enabled"
SGLANG_DISAGG_PREFILL_NODES=2
SGLANG_DISAGG_DECODE_NODES=2
SGLANG_DISAGG_TOTAL_NODES=5
SGLANG_TP_SIZE=8
SGLANG_NODE_RANK=${SLURM_PROCID}
# No MAD_MULTI_NODE_RUNNER (SGLang disagg manages processes)

Troubleshooting

Common Issues

1. Launcher Not Found

Error: Unknown launcher type 'xyz'

Solution: Use one of: torchrun, deepspeed, megatron, torchtitan, primus, vllm, sglang, sglang-disagg

2. Multi-Node Communication Fails

Error: Connection timeout to master node

Solutions:

Check network connectivity between nodes
Verify MASTER_ADDR is correct
Ensure firewall allows MASTER_PORT
For K8s: Check headless service created

3. GPU Visibility Issues

Error: Expected 8 GPUs but found 0

Solutions:

Verify gpu_count matches nproc_per_node
Check GPU resource name (amd.com/gpu vs nvidia.com/gpu)
Ensure ROCm/CUDA drivers installed

4. Ray Cluster Issues (vLLM/SGLang)

Error: Ray cluster failed to start

Solutions:

Clean existing Ray processes: ray stop --force
Check port 6379 is available
Verify network interface configuration
For multi-node: ensure pods can communicate

Advanced Topics

Custom Launcher Scripts

madengine provides $MAD_MULTI_NODE_RUNNER for frameworks that use torchrun:

#!/bin/bash
# Your model script

# For torchrun/deepspeed/megatron/torchtitan
$MAD_MULTI_NODE_RUNNER your_training_script.py --args

# For primus (no MAD_MULTI_NODE_RUNNER; use run.sh → run_pretrain.sh; PRIMUS_* / BACKEND set by madengine)

# For vLLM/sglang (no MAD_MULTI_NODE_RUNNER)
python your_inference_script.py --args

Launcher Detection

madengine automatically:

Detects launcher from distributed.launcher field
Sets up appropriate environment variables
Generates launcher-specific commands
Creates multi-node infrastructure (K8s services, SLURM env)

Performance Optimization

AMD MI300X:

{
  "context": {
    "env_vars": {
      "PYTORCH_TUNABLEOP_ENABLED": "1",
      "NCCL_IB_DISABLE": "0",
      "NCCL_NET_GDR_LEVEL": "5"
    }
  }
}

NVIDIA H100/A100:

{
  "context": {
    "env_vars": {
      "NCCL_ALGO": "Ring",
      "NCCL_PROTO": "Simple",
      "CUDA_DEVICE_MAX_CONNECTIONS": "1"
    }
  }
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distributed Launchers Guide

Overview

Supported Launchers

Quick Start

Basic Configuration

Deployment

Launcher Details

1. torchrun (PyTorch Distributed)

2. DeepSpeed

3. Megatron-LM

4. TorchTitan

5. Primus

6. vLLM

7. SGLang

8. SGLang Disaggregated (NEW!)

Comparison Matrix

Training Launchers

Inference Launchers

Configuration Best Practices

1. Launcher Selection

2. Resource Allocation

3. Multi-Node Setup

Environment Variables

Common Variables (All Launchers)

Launcher-Specific

Troubleshooting

Common Issues

Advanced Topics

Custom Launcher Scripts

Launcher Detection

Performance Optimization

References

Official Documentation

madengine Documentation

Example Configurations

FilesExpand file tree

launchers.md

Latest commit

History

launchers.md

File metadata and controls

Distributed Launchers Guide

Overview

Supported Launchers

Quick Start

Basic Configuration

Deployment

Launcher Details

1. torchrun (PyTorch Distributed)

2. DeepSpeed

3. Megatron-LM

4. TorchTitan

5. Primus

6. vLLM

7. SGLang

8. SGLang Disaggregated (NEW!)

Comparison Matrix

Training Launchers

Inference Launchers

Configuration Best Practices

1. Launcher Selection

2. Resource Allocation

3. Multi-Node Setup

Environment Variables

Common Variables (All Launchers)

Launcher-Specific

Troubleshooting

Common Issues

Advanced Topics

Custom Launcher Scripts

Launcher Detection

Performance Optimization

References

Official Documentation

madengine Documentation

Example Configurations