PrimeIntellect-ai · mikasenghaas · May 22, 2026 · May 22, 2026 · May 22, 2026 · May 22, 2026
diff --git a/README.md b/README.md
@@ -52,7 +52,7 @@ With `[model] impl = "auto"` (the default), the trainer selects that custom stac
 | GLM-5 (`glm_moe_dsa`) | `zai-org/GLM-5`, `zai-org/GLM-5-FP8` | yes | ✅ | ✅ |
 | Qwen3 MoE (`qwen3_moe`) | `Qwen/Qwen3-30B-A3B`, … | yes | ✅ | ✅ |
 | Qwen3.5 MoE (`qwen3_5_moe`) | `Qwen/Qwen3.5-35B-A3B`, … | yes | ✅ | ✅ |
-| Qwen3 / Qwen3.5 VLMs | [multimodal.md](docs/multimodal.md) (`qwen3_vl`, `qwen3_5`, `qwen3_5_moe`) | MoE only on MoE VLMs | MoE only | ✅ |
+| Qwen3 / Qwen3.5 VLMs | see [advanced.md](docs/advanced.md#vision-language-models) (`qwen3_vl`, `qwen3_5`, `qwen3_5_moe`) | MoE only on MoE VLMs | MoE only | ✅ |
 | Poolside Laguna (`laguna`) | `poolside/Laguna-XS.2` | yes | ✅ | ✅ |
 | MiniMax M2 (`minimax_m2`) | `MiniMax/MiniMax-M2` | yes | ✅ | ✅ |
 | Nemotron H (`nemotron_h`) | `nvidia/Nemotron-3-Nano-30B-A3B`, `nvidia/Nemotron-3-Super-120B-A12B`, … | yes | ✅ | ❌ |
@@ -217,17 +217,13 @@ These guides are designed to be run from a Slurm cluster but can also be adapted
 
 Check out the [docs](docs) directory for in-depth guides on how to use PRIME-RL.
 
-- [**Entrypoints**](docs/entrypoints.md) - Overview of the main components (orchestrator, trainer, inference) and how to run SFT, RL, and evals
-- [**Configs**](docs/configs.md) - Configuration system using TOML files, CLI arguments, and environment variables
-- [**Environments**](docs/environments.md) - Installing and using verifiers environments from the Environments Hub
-- [**Async Training**](docs/async.md) - Understanding asynchronous off-policy training and step semantics
-- [**Logging**](docs/logging.md) - Logging with loguru, torchrun, and Weights & Biases
-- [**Checkpointing**](docs/checkpointing.md) - Saving and resuming training from checkpoints
-- [**Benchmarking**](docs/benchmarking.md) - Performance benchmarking and throughput measurement
-- [**Deployment**](docs/deployment.md) - Training deployment on single-GPU, multi-GPU, and multi-node clusters
-- [**Memory Usage**](docs/memory_usage.md) - Techniques for reducing memory usage (activation checkpointing, offloading, EP, CP, LoRA, etc.)
-- [**Troubleshooting**](docs/troubleshooting.md) - Common issues and their solutions
-- [**Multimodal**](docs/multimodal.md) - Training VLMs like Qwen3-VL
+- [**Overview**](docs/overview.md) - Architecture, install, and a copy-pasteable end-to-end RL run
+- [**Configuration**](docs/configuration.md) - TOML composition, CLI overrides, env vars, validation
+- [**Training**](docs/training.md) - RL, SFT, evals, checkpointing, observability, rules of thumb
+- [**Scaling**](docs/scaling.md) - Single-GPU through multi-node, FSDP/EP/CP, SLURM, benchmarking
+- [**Algorithms**](docs/algorithms.md) - Async/off-policy training, the AIPO loss, advantage and filter plugins, trajectory merging
+- [**Advanced**](docs/advanced.md) - Custom modeling, multimodal training, LoRA, multi-tenant training
+- [**Development**](docs/development.md) - Test suite, pre-commit hooks, adding a new model
 
 ## Contributing
 
@@ -249,28 +245,11 @@ uv run pre-commit install
 
 ### Tests
 
-Run the full test suite 
-
-```bash
-uv run pytest -v
-```
-
-To run unit tests, run
-
-```bash
-uv run pytest tests/unit -v
-```
-
-To run integration tests, run
-
-```bash
-uv run pytest tests/integration -v
-```
-
-To run CPU-only tests, use the inverse of the `gpu` marker:
-
 ```bash
-uv run pytest -v -m "not gpu"
+uv run pytest -v                    # everything
+uv run pytest tests/unit -v         # unit only
+uv run pytest tests/integration -v  # integration only
+uv run pytest -v -m "not gpu"       # CPU-only (inverse of the gpu marker)
 ```
 
 ## License

diff --git a/configs/debug/training_modes/README.md b/configs/debug/training_modes/README.md
@@ -44,4 +44,4 @@ uv run rl @ configs/debug/training_modes/sft_lora.toml
 uv run rl @ configs/debug/training_modes/sft_external.toml
 ```
 
-See [docs/training_modes.md](../../docs/training_modes.md) for what each mode does.
+See [docs/training.md](../../docs/training.md#training-modes-rl--opd--sft-via-orchestrator) for what each mode does.
diff --git a/docs/advanced.md b/docs/advanced.md
@@ -0,0 +1,147 @@
+# Advanced
+
+This page covers the specialized features layered on top of the core training stack: our custom model implementations (with EP for MoE families and CP for long-context training), multimodal training, LoRA training, multi-tenant training, and disaggregated prefill/decode inference. For developer-side workflows (adding new model architectures, debugging modeling code at small scale), see [Development](development.md).
+
+## Table of Contents
+
+- [Custom Modeling](#custom-modeling)
+  - [Expert Parallelism Backends](#expert-parallelism-backends)
+- [Multimodal Training](#multimodal-training)
+  - [Supported Families](#supported-families)
+  - [Enabling VLM Mode](#enabling-vlm-mode)
+  - [Limitations](#limitations)
+- [LoRA Training](#lora-training)
+- [Multi-Tenant Training](#multi-tenant-training)
+- [Disaggregated Prefill/Decode Inference](#disaggregated-prefilldecode-inference)
+
+## Custom Modeling
+
+`prime-rl` ships custom optimized model implementations for several MoE families. With `model.impl = "auto"` (default) the trainer picks the custom path when the HF config type is registered, falling back to plain HF otherwise. To force one:
+
+```toml
+[trainer.model]
+impl = "custom"        # or "hf" to force the HF path
+```
+
+| Family | HF config types | EP | CP |
+|---|---|---|---|
+| GLM-5 (`glm_moe_dsa`) | `zai-org/GLM-5`, `zai-org/GLM-5-FP8` | ✅ | ✅ |
+| Qwen3 MoE | `Qwen/Qwen3-30B-A3B`, … | ✅ | ✅ |
+| Qwen3.5 MoE | `Qwen/Qwen3.5-35B-A3B`, … | ✅ | ✅ |
+| Qwen3 / Qwen3.5 VLMs | see [Multimodal training](#multimodal-training) | MoE only | ✅ |
+| Laguna | `poolside/Laguna-XS.2` | ✅ | ✅ |
+| MiniMax M2 | `MiniMax/MiniMax-M2` | ✅ | ✅ |
+| Nemotron H | `nvidia/Nemotron-3-Nano-30B-A3B`, … | ✅ | ❌ |
+| Trinity (AFMoE) | `arcee-ai/Trinity-Mini`, … | ✅ | ✅ |
+| GLM-4 / GLM-4.5 / INTELLECT-3 | `THUDM/GLM-4-9B-0414`, `zai-org/GLM-4.5`, `PrimeIntellect/INTELLECT-3`, … | ✅ | ✅ |
+| GPT-OSS (HF MoE) | `openai/gpt-oss-20b`, `openai/gpt-oss-120b` | ❌ | ✅ |
+
+The custom path enables EP, selective activation checkpointing, FP8 training (`model.fp8 = true`, requires SM90+), and faster MoE kernels (`moe_use_grouped_mm = true`, default). Forcing `impl = "hf"` is mostly useful when debugging — it's slower and disables most MoE-specific knobs.
+
+### Expert Parallelism Backends
+
+`model.ep_comm_backend` picks the all-to-all kernel used for EP dispatch/combine:
+
+- **`torch`** (default): TorchTitan's all-to-all collective. Works everywhere, no extra install.
+- **`deepep`**: Custom kernels from DeepEP. Faster but requires DeepEP build (`bash scripts/install_deep_gemm.sh`, `bash scripts/install_ep_kernels.sh`) and tuning of `deepep_num_sms` (default 20) and `deepep_token_chunk_size` for your hardware.
+
+DeepEP intranode dispatch derives the RDMA channel count as `deepep_num_sms / 2`. Lower SM count leaves more for compute; higher speeds up dispatch. Useful starting points: 16–24 SMs on H100, 20–40 on B200.
+
+When you enable DeepEP, gradient clipping is auto-disabled (`optim.max_norm` set to `None`) because the kernels don't currently support it.
+
+## Multimodal Training
+
+### Supported Families
+
+The built-in VLM registry covers:
+
+| Family | `model_type` | Vision attr | LM attr |
+|---|---|---|---|
+| Qwen3-VL | `qwen3_vl` | `model.visual` | `model.language_model` |
+| Qwen3-VL MoE | `qwen3_vl_moe` | `model.visual` | `model.language_model` |
+| Qwen3.5 | `qwen3_5` | `model.visual` | `model.language_model` |
+| Qwen3.5-MoE | `qwen3_5_moe` | `model.visual` | `model.language_model` |
+
+For a model not in the table, look up the attribute paths on the loaded HF model with `model.named_children()` and set them under `[model.vlm]` directly.
+
+### Enabling VLM Mode
+
+Add `[model.vlm]` and bfloat16 dtypes:
+
+```toml
+[model]
+name = "Qwen/Qwen3-VL-4B-Instruct"
+optimization_dtype = "bfloat16"
+reduce_dtype = "bfloat16"
+
+[model.vlm]
+vision_encoder_attr = "model.visual"
+language_model_attr = "model.language_model"
+# freeze_vision_encoder = true  # default; set false to fine-tune the encoder
+```
+
+A bad attribute path errors immediately — no silent fallbacks. The weight-broadcast key prefix is derived as `{language_model_attr}.layers.` automatically.
+
+To add a new model family permanently, append an entry to `VLM_REGISTRY` in `src/prime_rl/utils/vlm.py`.
+
+### Limitations
+
+- **Vision encoder frozen by default.** Set `freeze_vision_encoder = false` to fine-tune it; in that case it's FSDP-sharded per block. The combination `freeze_vision_encoder = false` + LoRA is rejected by a config validator — LoRA freezes everything non-adapter, so unfreezing the encoder under LoRA would be a silent no-op.
+- **No multimodal-safe truncation.** Token sequences are truncated to `seq_len`, but `pixel_values` and `image_grid_thw` pass through unchanged. If a sample's tokens overflow, image tokens may get dropped while image tensors still describe the full image set. Set `seq_len` to cover your longest sample.
+- **bfloat16 mandatory.** The trainer config validator refuses any other `optimization_dtype` / `reduce_dtype` for VLMs — vLLM serves VLMs in bfloat16 and a mismatch breaks the importance ratio.
+- **Higher KL mismatch with multi-image inputs.** Expect noisier `mismatch_kl` than text-only; this is from minor numerical differences between the trainer's and vLLM's image processing.
+- **Images aren't logged to monitors.** Sample logging captures the prompt text but not the actual images.
+
+## LoRA Training
+
+LoRA is enabled by adding `[model.lora]`:
+
+```toml
+[model.lora]
+rank = 16
+alpha = 32
+dropout = 0.0
+```
+
+`target_modules` defaults to a reasonable cross-family set (`q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj`, `experts`, plus a few latent-projection names for Nemotron). Unknown names are silently ignored, so the defaults work across architectures. Add architecture-specific names to extend coverage (e.g. `in_proj` / `out_proj` for Mamba).
+
+LoRA is supported across SFT and RL. For RL, `weight_broadcast.type = "nccl"` is **not** supported with LoRA — use the default filesystem transport. To save the raw adapter alongside the merged HF weights:
+
+```toml
+[ckpt.weights]
+save_adapter_separately = true
+```
+
+LoRA pairs naturally with [multi-tenant training](#multi-tenant-training) — each tenant gets its own adapter and the backbone is shared across all of them in trainer memory.
+
+## Multi-Tenant Training
+
+Multi-tenant training lets a single trainer + inference deployment serve many concurrent LoRA "tenants" — each a fully isolated run with its own orchestrator, LoRA adapter, optimizer, scheduler, checkpoints, and progress tracking — sharing the same backbone weights and the same vLLM server. This is the topology behind hosted training on the [Prime Intellect platform (Lab)](https://app.primeintellect.ai). The trainer-side implementation is the `MultiRunManager` singleton, enabled by setting `trainer.max_concurrent_runs > 1`. For the full API surface, see [`src/prime_rl/trainer/runs/`](https://github.com/PrimeIntellect-ai/prime-rl/tree/main/src/prime_rl/trainer/runs).
+
+## Disaggregated Prefill/Decode Inference
+
+For large MoE serving, splitting prefill and decode onto separate vLLM groups can substantially improve throughput. Pick the prefill:decode ratio based on workload shape:
+
+| Workload | P:D ratio | Why |
+|---|---|---|
+| Agentic (SWE, Lean) | 3:1 | Long growing contexts → prefill-heavy |
+| Non-agentic (math, chat) | 1:2 | Short prompts, long generations → decode-heavy |
+
+Example config: [`examples/glm5_pd_disag/rl.toml`](https://github.com/PrimeIntellect-ai/prime-rl/blob/main/examples/glm5_pd_disag/rl.toml) — full RL run on `GLM-5` with P/D disaggregation behind a `vllm-router`, FP8 inference, and NCCL weight broadcast (see the [README](https://github.com/PrimeIntellect-ai/prime-rl/tree/main/examples/glm5_pd_disag) for the launch story).
+
+Monitor live queue depths to detect imbalance:
+
+```bash
+curl -s http://<prefill_node>:8100/metrics | grep num_requests_waiting
+curl -s http://<decode_node>:8200/metrics | grep num_requests_waiting
+```
+
+If prefill queues and decode is idle, add prefill nodes (and vice versa).
+
+**UCX 1.19 requirement.** NVSHMEM needs UCX ≥ 1.19 for multi-GPU CUDA. Most clusters ship UCX 1.17 via HPC-X, which manifests as `cuStreamCreate: invalid device context` errors during DeepEP internode dispatch. Check with `/opt/hpcx/ucx/bin/ucx_info -v` and, if needed, build from source:
+
+```bash
+salloc -N 1 --gres=gpu:1 bash -c 'bash scripts/install_nixl_from_source.sh'
+```
+
+The script writes UCX 1.19 to `third_party/ucx/`; the bundled sbatch templates prepend it to `LD_LIBRARY_PATH` so it overrides the system version.