diff --git a/README.md b/README.md
index 5b7654868d..e6321c9006 100644
--- a/README.md
+++ b/README.md
@@ -52,7 +52,7 @@ With `[model] impl = "auto"` (the default), the trainer selects that custom stac
 | GLM-5 (`glm_moe_dsa`) | `zai-org/GLM-5`, `zai-org/GLM-5-FP8` | yes | ✅ | ✅ |
 | Qwen3 MoE (`qwen3_moe`) | `Qwen/Qwen3-30B-A3B`, … | yes | ✅ | ✅ |
 | Qwen3.5 MoE (`qwen3_5_moe`) | `Qwen/Qwen3.5-35B-A3B`, … | yes | ✅ | ✅ |
-| Qwen3 / Qwen3.5 VLMs | [multimodal.md](docs/multimodal.md) (`qwen3_vl`, `qwen3_5`, `qwen3_5_moe`) | MoE only on MoE VLMs | MoE only | ✅ |
+| Qwen3 / Qwen3.5 VLMs | see [advanced.md](docs/advanced.md#vision-language-models) (`qwen3_vl`, `qwen3_5`, `qwen3_5_moe`) | MoE only on MoE VLMs | MoE only | ✅ |
 | Poolside Laguna (`laguna`) | `poolside/Laguna-XS.2` | yes | ✅ | ✅ |
 | MiniMax M2 (`minimax_m2`) | `MiniMax/MiniMax-M2` | yes | ✅ | ✅ |
 | Nemotron H (`nemotron_h`) | `nvidia/Nemotron-3-Nano-30B-A3B`, `nvidia/Nemotron-3-Super-120B-A12B`, … | yes | ✅ | ❌ |
@@ -217,17 +217,13 @@ These guides are designed to be run from a Slurm cluster but can also be adapted
 
 Check out the [docs](docs) directory for in-depth guides on how to use PRIME-RL.
 
-- [**Entrypoints**](docs/entrypoints.md) - Overview of the main components (orchestrator, trainer, inference) and how to run SFT, RL, and evals
-- [**Configs**](docs/configs.md) - Configuration system using TOML files, CLI arguments, and environment variables
-- [**Environments**](docs/environments.md) - Installing and using verifiers environments from the Environments Hub
-- [**Async Training**](docs/async.md) - Understanding asynchronous off-policy training and step semantics
-- [**Logging**](docs/logging.md) - Logging with loguru, torchrun, and Weights & Biases
-- [**Checkpointing**](docs/checkpointing.md) - Saving and resuming training from checkpoints
-- [**Benchmarking**](docs/benchmarking.md) - Performance benchmarking and throughput measurement
-- [**Deployment**](docs/deployment.md) - Training deployment on single-GPU, multi-GPU, and multi-node clusters
-- [**Memory Usage**](docs/memory_usage.md) - Techniques for reducing memory usage (activation checkpointing, offloading, EP, CP, LoRA, etc.)
-- [**Troubleshooting**](docs/troubleshooting.md) - Common issues and their solutions
-- [**Multimodal**](docs/multimodal.md) - Training VLMs like Qwen3-VL
+- [**Overview**](docs/overview.md) - Architecture, install, and a copy-pasteable end-to-end RL run
+- [**Configuration**](docs/configuration.md) - TOML composition, CLI overrides, env vars, validation
+- [**Training**](docs/training.md) - RL, SFT, evals, checkpointing, observability, rules of thumb
+- [**Scaling**](docs/scaling.md) - Single-GPU through multi-node, FSDP/EP/CP, SLURM, benchmarking
+- [**Algorithms**](docs/algorithms.md) - Async/off-policy training, the AIPO loss, advantage and filter plugins, trajectory merging
+- [**Advanced**](docs/advanced.md) - Custom modeling, multimodal training, LoRA, multi-tenant training
+- [**Development**](docs/development.md) - Test suite, pre-commit hooks, adding a new model
 
 ## Contributing
 
@@ -249,28 +245,11 @@ uv run pre-commit install
 
 ### Tests
 
-Run the full test suite 
-
-```bash
-uv run pytest -v
-```
-
-To run unit tests, run
-
-```bash
-uv run pytest tests/unit -v
-```
-
-To run integration tests, run
-
-```bash
-uv run pytest tests/integration -v
-```
-
-To run CPU-only tests, use the inverse of the `gpu` marker:
-
 ```bash
-uv run pytest -v -m "not gpu"
+uv run pytest -v                    # everything
+uv run pytest tests/unit -v         # unit only
+uv run pytest tests/integration -v  # integration only
+uv run pytest -v -m "not gpu"       # CPU-only (inverse of the gpu marker)
 ```
 
 ## License
diff --git a/configs/debug/training_modes/README.md b/configs/debug/training_modes/README.md
index 67c5450947..96ccebb009 100644
--- a/configs/debug/training_modes/README.md
+++ b/configs/debug/training_modes/README.md
@@ -44,4 +44,4 @@ uv run rl @ configs/debug/training_modes/sft_lora.toml
 uv run rl @ configs/debug/training_modes/sft_external.toml
 ```
 
-See [docs/training_modes.md](../../docs/training_modes.md) for what each mode does.
+See [docs/training.md](../../docs/training.md#training-modes-rl--opd--sft-via-orchestrator) for what each mode does.
diff --git a/docs/advanced.md b/docs/advanced.md
new file mode 100644
index 0000000000..f8f6d2ccb1
--- /dev/null
+++ b/docs/advanced.md
@@ -0,0 +1,147 @@
+# Advanced
+
+This page covers the specialized features layered on top of the core training stack: our custom model implementations (with EP for MoE families and CP for long-context training), multimodal training, LoRA training, multi-tenant training, and disaggregated prefill/decode inference. For developer-side workflows (adding new model architectures, debugging modeling code at small scale), see [Development](development.md).
+
+## Table of Contents
+
+- [Custom Modeling](#custom-modeling)
+  - [Expert Parallelism Backends](#expert-parallelism-backends)
+- [Multimodal Training](#multimodal-training)
+  - [Supported Families](#supported-families)
+  - [Enabling VLM Mode](#enabling-vlm-mode)
+  - [Limitations](#limitations)
+- [LoRA Training](#lora-training)
+- [Multi-Tenant Training](#multi-tenant-training)
+- [Disaggregated Prefill/Decode Inference](#disaggregated-prefilldecode-inference)
+
+## Custom Modeling
+
+`prime-rl` ships custom optimized model implementations for several MoE families. With `model.impl = "auto"` (default) the trainer picks the custom path when the HF config type is registered, falling back to plain HF otherwise. To force one:
+
+```toml
+[trainer.model]
+impl = "custom"        # or "hf" to force the HF path
+```
+
+| Family | HF config types | EP | CP |
+|---|---|---|---|
+| GLM-5 (`glm_moe_dsa`) | `zai-org/GLM-5`, `zai-org/GLM-5-FP8` | ✅ | ✅ |
+| Qwen3 MoE | `Qwen/Qwen3-30B-A3B`, … | ✅ | ✅ |
+| Qwen3.5 MoE | `Qwen/Qwen3.5-35B-A3B`, … | ✅ | ✅ |
+| Qwen3 / Qwen3.5 VLMs | see [Multimodal training](#multimodal-training) | MoE only | ✅ |
+| Laguna | `poolside/Laguna-XS.2` | ✅ | ✅ |
+| MiniMax M2 | `MiniMax/MiniMax-M2` | ✅ | ✅ |
+| Nemotron H | `nvidia/Nemotron-3-Nano-30B-A3B`, … | ✅ | ❌ |
+| Trinity (AFMoE) | `arcee-ai/Trinity-Mini`, … | ✅ | ✅ |
+| GLM-4 / GLM-4.5 / INTELLECT-3 | `THUDM/GLM-4-9B-0414`, `zai-org/GLM-4.5`, `PrimeIntellect/INTELLECT-3`, … | ✅ | ✅ |
+| GPT-OSS (HF MoE) | `openai/gpt-oss-20b`, `openai/gpt-oss-120b` | ❌ | ✅ |
+
+The custom path enables EP, selective activation checkpointing, FP8 training (`model.fp8 = true`, requires SM90+), and faster MoE kernels (`moe_use_grouped_mm = true`, default). Forcing `impl = "hf"` is mostly useful when debugging — it's slower and disables most MoE-specific knobs.
+
+### Expert Parallelism Backends
+
+`model.ep_comm_backend` picks the all-to-all kernel used for EP dispatch/combine:
+
+- **`torch`** (default): TorchTitan's all-to-all collective. Works everywhere, no extra install.
+- **`deepep`**: Custom kernels from DeepEP. Faster but requires DeepEP build (`bash scripts/install_deep_gemm.sh`, `bash scripts/install_ep_kernels.sh`) and tuning of `deepep_num_sms` (default 20) and `deepep_token_chunk_size` for your hardware.
+
+DeepEP intranode dispatch derives the RDMA channel count as `deepep_num_sms / 2`. Lower SM count leaves more for compute; higher speeds up dispatch. Useful starting points: 16–24 SMs on H100, 20–40 on B200.
+
+When you enable DeepEP, gradient clipping is auto-disabled (`optim.max_norm` set to `None`) because the kernels don't currently support it.
+
+## Multimodal Training
+
+### Supported Families
+
+The built-in VLM registry covers:
+
+| Family | `model_type` | Vision attr | LM attr |
+|---|---|---|---|
+| Qwen3-VL | `qwen3_vl` | `model.visual` | `model.language_model` |
+| Qwen3-VL MoE | `qwen3_vl_moe` | `model.visual` | `model.language_model` |
+| Qwen3.5 | `qwen3_5` | `model.visual` | `model.language_model` |
+| Qwen3.5-MoE | `qwen3_5_moe` | `model.visual` | `model.language_model` |
+
+For a model not in the table, look up the attribute paths on the loaded HF model with `model.named_children()` and set them under `[model.vlm]` directly.
+
+### Enabling VLM Mode
+
+Add `[model.vlm]` and bfloat16 dtypes:
+
+```toml
+[model]
+name = "Qwen/Qwen3-VL-4B-Instruct"
+optimization_dtype = "bfloat16"
+reduce_dtype = "bfloat16"
+
+[model.vlm]
+vision_encoder_attr = "model.visual"
+language_model_attr = "model.language_model"
+# freeze_vision_encoder = true  # default; set false to fine-tune the encoder
+```
+
+A bad attribute path errors immediately — no silent fallbacks. The weight-broadcast key prefix is derived as `{language_model_attr}.layers.` automatically.
+
+To add a new model family permanently, append an entry to `VLM_REGISTRY` in `src/prime_rl/utils/vlm.py`.
+
+### Limitations
+
+- **Vision encoder frozen by default.** Set `freeze_vision_encoder = false` to fine-tune it; in that case it's FSDP-sharded per block. The combination `freeze_vision_encoder = false` + LoRA is rejected by a config validator — LoRA freezes everything non-adapter, so unfreezing the encoder under LoRA would be a silent no-op.
+- **No multimodal-safe truncation.** Token sequences are truncated to `seq_len`, but `pixel_values` and `image_grid_thw` pass through unchanged. If a sample's tokens overflow, image tokens may get dropped while image tensors still describe the full image set. Set `seq_len` to cover your longest sample.
+- **bfloat16 mandatory.** The trainer config validator refuses any other `optimization_dtype` / `reduce_dtype` for VLMs — vLLM serves VLMs in bfloat16 and a mismatch breaks the importance ratio.
+- **Higher KL mismatch with multi-image inputs.** Expect noisier `mismatch_kl` than text-only; this is from minor numerical differences between the trainer's and vLLM's image processing.
+- **Images aren't logged to monitors.** Sample logging captures the prompt text but not the actual images.
+
+## LoRA Training
+
+LoRA is enabled by adding `[model.lora]`:
+
+```toml
+[model.lora]
+rank = 16
+alpha = 32
+dropout = 0.0
+```
+
+`target_modules` defaults to a reasonable cross-family set (`q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj`, `experts`, plus a few latent-projection names for Nemotron). Unknown names are silently ignored, so the defaults work across architectures. Add architecture-specific names to extend coverage (e.g. `in_proj` / `out_proj` for Mamba).
+
+LoRA is supported across SFT and RL. For RL, `weight_broadcast.type = "nccl"` is **not** supported with LoRA — use the default filesystem transport. To save the raw adapter alongside the merged HF weights:
+
+```toml
+[ckpt.weights]
+save_adapter_separately = true
+```
+
+LoRA pairs naturally with [multi-tenant training](#multi-tenant-training) — each tenant gets its own adapter and the backbone is shared across all of them in trainer memory.
+
+## Multi-Tenant Training
+
+Multi-tenant training lets a single trainer + inference deployment serve many concurrent LoRA "tenants" — each a fully isolated run with its own orchestrator, LoRA adapter, optimizer, scheduler, checkpoints, and progress tracking — sharing the same backbone weights and the same vLLM server. This is the topology behind hosted training on the [Prime Intellect platform (Lab)](https://app.primeintellect.ai). The trainer-side implementation is the `MultiRunManager` singleton, enabled by setting `trainer.max_concurrent_runs > 1`. For the full API surface, see [`src/prime_rl/trainer/runs/`](https://github.com/PrimeIntellect-ai/prime-rl/tree/main/src/prime_rl/trainer/runs).
+
+## Disaggregated Prefill/Decode Inference
+
+For large MoE serving, splitting prefill and decode onto separate vLLM groups can substantially improve throughput. Pick the prefill:decode ratio based on workload shape:
+
+| Workload | P:D ratio | Why |
+|---|---|---|
+| Agentic (SWE, Lean) | 3:1 | Long growing contexts → prefill-heavy |
+| Non-agentic (math, chat) | 1:2 | Short prompts, long generations → decode-heavy |
+
+Example config: [`examples/glm5_pd_disag/rl.toml`](https://github.com/PrimeIntellect-ai/prime-rl/blob/main/examples/glm5_pd_disag/rl.toml) — full RL run on `GLM-5` with P/D disaggregation behind a `vllm-router`, FP8 inference, and NCCL weight broadcast (see the [README](https://github.com/PrimeIntellect-ai/prime-rl/tree/main/examples/glm5_pd_disag) for the launch story).
+
+Monitor live queue depths to detect imbalance:
+
+```bash
+curl -s http://<prefill_node>:8100/metrics | grep num_requests_waiting
+curl -s http://<decode_node>:8200/metrics | grep num_requests_waiting
+```
+
+If prefill queues and decode is idle, add prefill nodes (and vice versa).
+
+**UCX 1.19 requirement.** NVSHMEM needs UCX ≥ 1.19 for multi-GPU CUDA. Most clusters ship UCX 1.17 via HPC-X, which manifests as `cuStreamCreate: invalid device context` errors during DeepEP internode dispatch. Check with `/opt/hpcx/ucx/bin/ucx_info -v` and, if needed, build from source:
+
+```bash
+salloc -N 1 --gres=gpu:1 bash -c 'bash scripts/install_nixl_from_source.sh'
+```
+
+The script writes UCX 1.19 to `third_party/ucx/`; the bundled sbatch templates prepend it to `LD_LIBRARY_PATH` so it overrides the system version.
diff --git a/docs/algorithms.md b/docs/algorithms.md
new file mode 100644
index 0000000000..fdd5b6e2da
--- /dev/null
+++ b/docs/algorithms.md
@@ -0,0 +1,304 @@
+# Algorithms
+
+This page covers the math and the configurable algorithmic components: how off-policy training works, the default loss and advantage functions, how to plug in your own, the filters applied between rollout and training, and how multi-turn rollouts get merged into training samples.
+
+## Table of Contents
+
+- [Async / Off-Policy Training](#async--off-policy-training)
+- [Loss](#loss)
+  - [Default Loss](#default-loss)
+  - [Custom Loss](#custom-loss)
+- [Advantage](#advantage)
+  - [Default Advantage](#default-advantage)
+  - [Custom Advantage](#custom-advantage)
+- [Filters](#filters)
+- [Difficulty Pools](#difficulty-pools)
+- [Online Difficulty Filtering](#online-difficulty-filtering)
+- [Multi-Turn Trajectories](#multi-turn-trajectories)
+  - [Extension Property](#extension-property)
+  - [Best-Effort Interleaving](#best-effort-interleaving)
+  - [Renderers](#renderers)
+  - [Discontinuous Trajectories](#discontinuous-trajectories)
+
+## Async / Off-Policy Training
+
+`prime-rl` is asynchronous by default. The trainer and inference always run one step overlapped: while the trainer is producing $\pi_n$ from rollouts at step $n$, inference is already generating the rollouts for step $n+1$ using $\pi_{n-1}$. With matched trainer and inference step times this produces fully-overlapped pipeline parallelism — neither side ever idles.
+
+![Async pipeline: trainer step n produces $\theta_n$, inference at step n samples with $\theta_{n-1}$](assets/async-pipeline.png)
+
+At step $n = 1, 2, 3, \dots$:
+
+- **Trainer** produces policy $\pi_n$ with weights $\theta_n$ from rollouts $(x_n, y_n)$.
+- **Inference** produces rollouts $(x_n, y_n)$ from policy $\pi_{\max(0,\,n-1)}$.
+
+Step indices are 0-indexed so the gap holds at startup — inference is exactly one step behind the trainer.
+
+## Loss
+
+### Default Loss
+
+The default RL loss is a DPPO policy-gradient term combined with a KL regularizer similar to Kimi-K2.5. For each prompt $x_j$ we sample a group of $G$ rollouts $\{y_i\}_{i=1}^G$, score them to get $s_i$, then optimize:
+
+$$
+\mathcal{L}(\theta) = -\,\mathcal{J}_{\text{PG}}(\theta) \;+\; \tau_{KL}\,\mathcal{L}_{KL}(\theta)
+$$
+
+where the policy-gradient term is
+
+$$
+\mathcal{J}_{\text{PG}}(\theta)
+= \frac{1}{\sum_{j,i} |y_i^{(j)}|}
+\sum_{j,i,t}
+\min\!\left(\frac{\pi(y_{i,t}^{(j)}\mid x_j, y_{i,<t}^{(j)})}{\mu(y_{i,t}^{(j)}\mid x_j, y_{i,<t}^{(j)})}, \delta\right) \hat{A}^{(j)}_{i,t}
+$$
+
+and the KL regularizer penalizes drift between trainer and inference policies via the squared log importance ratio:
+
+$$
+\mathcal{L}_{KL}(\theta) = \frac{1}{\sum_{j,i} |y_i^{(j)}|}
+\sum_{j,i,t} \log^2\!\left(\frac{\pi(y_{i,t}^{(j)}\mid x_j, y_{i,<t}^{(j)})}{\mu(y_{i,t}^{(j)}\mid x_j, y_{i,<t}^{(j)})}\right).
+$$
+
+$\mu$ is the policy that generated the rollout (inference), $\pi$ is the current policy (trainer), $\hat{A}_{i,t}$ is the token-level advantage, $\delta$ is the importance-sampling clipping ratio, and $\tau_{KL}$ is the KL temperature. The `min` clamps the importance ratio from above so a stale rollout assigning very low probability to a high-reward token doesn't produce a runaway gradient.
+
+The knobs (under `[trainer.loss]` with `type = "default"`):
+
+| Knob | Default | What it does |
+|---|---|---|
+| `dppo_mask_low` / `dppo_mask_high` | 0.2 / 0.2 | Lower / upper thresholds for DPPO-style token-level masking. |
+| `adv_tau` | 1.0 | Temperature on the advantage term. Set to 0 for pure distillation (no RL signal). |
+| `kl_tau` | 1e-3 | Temperature on the KL regularizer. Set to 0 to disable. |
+
+The trainer dispatches automatically based on the batch's training mode (set by the orchestrator via `orchestrator.training_mode`):
+
+- `rl` mode → DPPO + KL with the advantage signal.
+- `opd` mode → KL distillation against the teacher's per-token logprobs. The teacher must be a vLLM server (it's the only one that exposes `prompt_logprobs`).
+- `sft` mode → standard token-level NLL on teacher-generated rollouts.
+
+Set `[trainer.loss] type = "default"` and configure via the knobs above. SFT and OPD modes ignore the policy-gradient–specific fields.
+
+### Custom Loss
+
+The loss is computed **per sequence**: you write a function that takes one sequence's tensors and returns a scalar loss. The trainer iterates and aggregates.
+
+```python
+# my_module.py
+import torch
+from prime_rl.trainer.rl.loss import LossInputs, LossOutputs
+
+def ppo_clip_loss(inputs: LossInputs, clip_eps: float = 0.2) -> LossOutputs:
+    ratio = torch.exp(inputs.trainer_logprobs - inputs.inference_logprobs)
+    clipped = torch.clamp(ratio, 1 - clip_eps, 1 + clip_eps)
+    surr1 = ratio * inputs.advantages
+    surr2 = clipped * inputs.advantages
+    loss = -torch.min(surr1, surr2)[inputs.loss_mask].sum()
+    return LossOutputs(
+        loss=loss,
+        metrics={
+            "clip_frac": (ratio != clipped)[inputs.loss_mask].float().mean(),
+        },
+    )
+```
+
+Wire it up:
+
+```toml
+[trainer.loss]
+type = "custom"
+import_path = "my_module.ppo_clip_loss"
+kwargs = { clip_eps = 0.2 }
+```
+
+The dataclasses:
+
+```python
+@dataclass
+class LossInputs:
+    trainer_logprobs: Float[Tensor, "seq"]      # current policy
+    inference_logprobs: Float[Tensor, "seq"]    # rollout-time policy
+    teacher_logprobs: Float[Tensor, "seq"] | None  # only set in OPD mode
+    advantages: Float[Tensor, "seq"]
+    loss_mask: Bool[Tensor, "seq"]
+
+@dataclass
+class LossOutputs:
+    loss: Float[Tensor, ""]
+    metrics: dict[str, Tensor]
+```
+
+Anything you put in `metrics` is averaged across sequences and logged with the other trainer metrics.
+
+## Advantage
+
+### Default Advantage
+
+The default advantage is per-group reward minus per-group baseline (DR-GRPO without std normalization). For each prompt's group of `group_size` rollouts, every token in rollout $i$ receives advantage $s_i - \bar{s}$ where $\bar{s}$ is the group mean.
+
+This is intentionally simple — it does the right thing for most envs. Switch to a [custom advantage](#custom-advantage) when you need group-aware shaping that depends on trajectory metadata (sub-agent rollouts, relative-rank shaping, …).
+
+Two built-in **length penalties** can be layered on top of any advantage to discourage rambling:
+
+- `[orchestrator.length_penalty] type = "tokens"` — penalizes long completions in tokens, with configurable target and slope.
+- `[orchestrator.length_penalty] type = "turns"` — penalizes long multi-turn rollouts by turn count.
+
+
+### Custom Advantage
+
+Advantages are computed **per group**. You write a function that takes one group of rollouts and returns one advantage scalar per rollout. The orchestrator handles groups of varying size automatically — partial-group training kicks in when some rollouts in a group errored.
+
+```python
+# my_module.py
+import statistics
+from prime_rl.orchestrator.advantage import AdvantageInputs, AdvantageOutputs
+
+def normalized_advantage(inputs: AdvantageInputs, eps: float = 1e-8) -> AdvantageOutputs:
+    rewards = [r["reward"] for r in inputs.rollouts]
+    mean = statistics.fmean(rewards)
+    std = statistics.pstdev(rewards) if len(rewards) > 1 else 0.0
+    return AdvantageOutputs(advantages=[(r - mean) / (std + eps) for r in rewards])
+```
+
+```toml
+[orchestrator.advantage]
+type = "custom"
+import_path = "my_module.normalized_advantage"
+kwargs = { eps = 1e-8 }
+```
+
+`AdvantageInputs.rollouts` is a list of `verifiers.RolloutOutput`, so you have access to the full rollout (turns, tool calls, custom metadata) — not just the reward. Use this for anything reward-shaping-like that needs trajectory context.
+
+## Filters
+
+Filters drop rollouts between scoring and training. Built-ins (composable):
+
+| Filter | Effect |
+|---|---|
+| `gibberish` | Drops rollouts whose mean log-prob fall below a threshold — usually a sign of degenerate output. |
+| `repetition` | Drops rollouts with high n-gram repetition. |
+| `zero_advantage` | Drops rollouts whose advantage is zero, so the trainer doesn't waste tokens on them. |
+
+The default `[orchestrator]` config already includes all three filters with their defaults. To override, set `filters` explicitly — the list replaces the defaults wholesale:
+
+```toml
+[[orchestrator.filters]]
+type = "zero_advantage"
+
+[[orchestrator.filters]]
+type = "repetition"
+threshold = 0.4
+```
+
+Filtered rollouts still appear in W&B distributions, just not in the trainer batch — useful for spotting whether filtering is doing its job.
+
+## Difficulty Pools
+
+Difficulty pools gradually retire problems the model has solved or never solves. After each rollout, the average reward across a problem's group is compared to two thresholds:
+
+- `buffer.easy_threshold` — at or above this, the problem moves into the `easy` pool and is no longer sampled.
+- `buffer.hard_threshold` — at or below this, the problem moves into the `hard` pool and is no longer sampled.
+- Otherwise the problem stays in `normal` and remains in the sampling rotation.
+
+Pool assignments persist across checkpoints (`easy_examples.jsonl` / `hard_examples.jsonl` under each step's orchestrator checkpoint). When you resume — or want to broaden the curriculum mid-run — `buffer.easy_fraction` / `buffer.hard_fraction` randomly lift that fraction of pooled problems back into `normal` so they re-enter sampling.
+
+```toml
+[orchestrator.buffer]
+easy_threshold = 0.95
+hard_threshold = 0.05
+easy_fraction = 0.0   # default; bump on resume to bring some easy problems back
+hard_fraction = 0.0   # default; bump on resume to bring some hard problems back
+```
+
+Watch `pool/{env}/{easy,normal,hard}` (current pool ratios) and `evicted_examples/{env}/{easy,hard}` (per-step eviction rate).
+
+## Online Difficulty Filtering
+
+Online difficulty filtering (ODF) drops collapsed-advantage groups on the way *into* the buffer. Set `buffer.online_difficulty_filtering = true` (default `false`) to enable:
+
+- Average reward across the group is **0.0** (every rollout failed) → drop the group, count under `filtered_rollouts/{env}/hard`.
+- Average reward **1.0** (every rollout succeeded) → drop, count under `filtered_rollouts/{env}/easy`.
+- Otherwise → into the buffer.
+
+These are exactly the groups whose within-group advantage collapses to zero — DR-GRPO produces no gradient signal for them, so the trainer would burn step time on tokens it can't learn from.
+
+```toml
+[orchestrator.buffer]
+online_difficulty_filtering = true
+```
+
+**Tradeoff: trainer stability vs. inference speed.** With ODF on, every rollout that reaches the trainer carries non-zero advantage — each trainer step's effective batch is predictable and the gradient signal is denser. The cost is paid on the inference side: rollouts get produced and then thrown away, so the orchestrator has to oversample to keep the trainer fed. If the orchestrator is your bottleneck (`time/wait_for_batch` high on the trainer), ODF can starve the loop. Bump `orchestrator.oversampling_factor` so inference produces enough groups per step to absorb the drops.
+
+ODF is orthogonal to the [pools](#difficulty-pools): ODF reacts to the *current* group's reward distribution, the pools track the *running* per-problem average. Many configs use both — ODF for per-step density, pools for long-horizon curriculum cleanup.
+
+## Multi-Turn Trajectories
+
+Multi-turn rollouts (tool use, browser environments, long conversations) used to be stitched into a single fake "single-turn" sample, which silently corrupted the importance ratio when chat templates didn't roundtrip. Since [`verifiers` v0.1.8](https://github.com/PrimeIntellect-ai/verifiers/releases/tag/v0.1.8), `prime-rl` records each LLM request/response as an independent **trajectory step** and merges them at training time using best-effort interleaving — with [renderers](#renderers) as the mechanism that keeps the merge safe by construction.
+
+### Extension Property
+
+A sequence of trajectory steps has the **extension property** when each successive step's prompt contains all previous prompts and completions as an exact prefix. The trainer relies on this property — when it holds:
+
+- Multiple steps merge into one training sample.
+- Compute scales as $O(T)$ in the trajectory length.
+
+When it breaks (chat template strips past thinking, environment compacts context, an agent hands off to a sub-agent, etc.), the trainer starts a new training sample from that step:
+
+- Graceful fallback to multiple samples — no corrupted data.
+- Worst case (every step breaks extension) is $O(T^2)$.
+
+### Best-Effort Interleaving
+
+Concretely:
+
+```
+5-step trajectory where extension breaks at step 4:
+
+steps 1–3: extension holds   → merged into Sample 1
+step 4:    extension breaks  (e.g. thinking stripped from history)
+steps 4–5: extension holds   → merged into Sample 2
+
+result: 2 training samples instead of 5
+```
+
+The orchestrator enforces an **exact prefix invariant**: the prompt at turn $t$ must be the exact concatenation of prior messages exactly as the LLM originally generated them. If turn 2's prompt is `U1, A1', U2` while `A1' ≠ A1`, the orchestrator can't safely merge — either choice produces logprob drift between trainer and inference. Starting a fresh sample is the only correct behavior, so that's what happens.
+
+### Renderers
+
+Best-effort interleaving works because the renderer guarantees the exact-prefix invariant *by construction* — it never re-renders prior turns, so it can't lose tokens to chat-template normalization, BPE retokenization drift, or thinking stripping. A renderer turns a model's chat template into a Python object that can:
+
+- `render_ids(messages)` — tokenize messages to ids the inference engine accepts.
+- `parse_response(completion_ids)` — recover structured `(content, reasoning_content, tool_calls)` from sampled ids.
+- `bridge_to_next_turn(prev_prompt_ids, prev_completion_ids, new_messages)` — extend the previous turn's tokens verbatim with the new environment turn, instead of re-rendering history.
+
+When `bridge_to_next_turn` succeeds, the trainer sees the exact token stream the sampler produced; when it can't be proven safe (e.g. the renderer is `DefaultRenderer` and the template's stop sequence is unknown), it returns `None` and the orchestrator falls back to a full re-render — which triggers the new-sample fallback above.
+
+A common source of breakage in the absence of a hand-coded renderer is models like Qwen3 whose chat templates strip past `<think>` blocks across user turns:
+
+```python
+from transformers import AutoTokenizer
+tok = AutoTokenizer.from_pretrained("Qwen/Qwen3-0.6B")
+messages = [
+    {"role": "user", "content": "U1"},
+    {"role": "assistant", "content": "<think>R1</think>A1"},
+    {"role": "user", "content": "U2"},
+]
+tok.apply_chat_template(messages[:1], tokenize=False)
+# <|im_start|>user
+# U1<|im_end|>
+
+tok.apply_chat_template(messages, tokenize=False)
+# <|im_start|>user\nU1<|im_end|>\n<|im_start|>assistant\nA1<|im_end|>\n<|im_start|>user\nU2<|im_end|>
+# (the <think>R1</think> from turn 2 is gone)
+```
+
+Hand-coded renderers ship for `qwen3`, `qwen3-vl`, `qwen3.5`, `glm5`, `glm4.5`, `minimax-m2`, `deepseek-v3`, `kimi-k2`, `kimi-k2.5`, `nemotron-3`, `gpt-oss`; anything else falls back to `DefaultRenderer` (a generic `apply_chat_template` wrapper). Pick one via:
+
+```toml
+[orchestrator.renderer]
+name = "auto"   # detect from tokenizer; pass an explicit name for fine-tunes
+```
+
+For the full design rationale (failure modes ruled out, empirical token-identity comparison against `apply_chat_template`, when to write a hand-coded renderer), see [the renderers writeup on the Prime Intellect blog](https://www.primeintellect.ai/blog/renderers) — the canonical reference.
+
+### Discontinuous Trajectories
+
+Some envs are discontinuous by design — e.g. a main agent delegating to a sub-agent and getting back only a summarized result, not the sub-agent's whole conversation. Best-effort interleaving handles this naturally: each agent's contiguous turns merge, the handoff starts a new sample. The trainer never sees fabricated extension where there is none.
diff --git a/docs/assets/architecture.png b/docs/assets/architecture.png
index da160eb8b0..080a7e2710 100644
Binary files a/docs/assets/architecture.png and b/docs/assets/architecture.png differ
diff --git a/docs/assets/async-pipeline.png b/docs/assets/async-pipeline.png
new file mode 100644
index 0000000000..2eca7c0aa7
Binary files /dev/null and b/docs/assets/async-pipeline.png differ
diff --git a/docs/assets/rollout-timeline.png b/docs/assets/rollout-timeline.png
new file mode 100644
index 0000000000..0b118b5fbf
Binary files /dev/null and b/docs/assets/rollout-timeline.png differ
diff --git a/docs/assets/two-step-off-policy.png b/docs/assets/two-step-off-policy.png
deleted file mode 100644
index 366f1c26d3..0000000000
Binary files a/docs/assets/two-step-off-policy.png and /dev/null differ
diff --git a/docs/async.md b/docs/async.md
deleted file mode 100644
index f8eae7942e..0000000000
--- a/docs/async.md
+++ /dev/null
@@ -1,39 +0,0 @@
-# Asynchronous Training
-
-PRIME-RL implements asynchronous off-policy training, instead of the traditional synchronous on-policy training. The trainer and orchestrator/inference always run one step overlapped: while the trainer is producing $\pi_n$ from rollouts at step $n$, inference is already generating the rollouts for step $n+1$ using $\pi_{n-1}$. With trainer and inference step timings being equal, this allows to run without any idle time on either side.
-
-![Two-Step Off-Policy Training](assets/two-step-off-policy.png)
-
-## Loss Objective
-
-We adopt a loss objective capable of handling the natural distribution shift caused by the off-policy nature of the training. By default, we use a token-level loss variant of the [AIPO](https://arxiv.org/abs/2505.24034) training objective introduced in Llama-RL,
-but omit the entropy and KL loss terms.
-
-At each step, we sample $N$ prompts from our dataset. For
-each prompt $x$, we sample a group of rollouts $\{y_i\}^G_{i=1}$
-and use a verifier to assign scores $s_i$ to each $y_i$.
-Then, the optimization objective is given by
-
-$$
-\mathcal{J}_{\text{AIPO}}(\theta)
-= \frac{1}{\sum_{j=1}^N \sum_{i=1}^G |y_i^{(j)}|}
-\sum_{j=1}^N 
-\sum_{i=1}^G 
-\sum_{t=1}^{|y_i^{(j)}|}
-\min\left(
-\frac{\pi(y^{(j)}_{i,t}\mid x_j, y^{(j)}_{i,<t})}{\mu(y^{(j)}_{i,t}\mid x_j, y^{(j)}_{i,<t})},
-\delta
-\right)\hat{A}^{(j)}_{i,t}
-$$
-
-where $\mu$ refers to the policy that generated the rollout, $\pi$ refers to the current policy, $\hat{A}_{i,t}$ is the token-level advantage, and $\delta$ is the importance sampling clipping ratio.
-
-
-## Step Semantics
-
-PRIME-RL uses a global training step $n=1,2,3,\dots$ that is used to tag artifacts:
-
-- **Trainer**: Produces policy $\pi_n$ with weights $\theta_n$ from rollouts $(x_n, y_n)$
-- **Inference**: Produces rollouts $(x_n, y_n)$ from policy $\pi_{\max(0,\,n-1)}$
-
-We use 0-indexed steps to cleanly indicate that at each step, inference is exactly one step behind the trainer.
diff --git a/docs/benchmarking.md b/docs/benchmarking.md
deleted file mode 100644
index f75b03b23a..0000000000
--- a/docs/benchmarking.md
+++ /dev/null
@@ -1,79 +0,0 @@
-# Benchmarking
-
-We provide a convenient way to benchmark the performance, mainly measured in throughput and MFU, of the inference engine and trainer using the `--bench` flag. It will run each module in isolation for a few steps and log performance benchmark results in a rich table to the console.
-
-## SFT
-
-Benchmark on the default fake data configuration
-
-```bash
-uv run sft ... --data.type fake --bench
-```
-
-Benchmark with variable-length, instead of fixed-length, fake data to more closely simulate real data.
-
-```bash
-uv run sft ... --data.type fake --data.length variable --bench
-```
-
-Benchmark different batch configurations, i.e. the (micro) batch size and sequence length
-
-```bash
-uv run sft ... --data.type fake --data.seq-len 4096 --data.batch-size 64 --data.micro-batch-size 2 --bench
-```
-
-Benchmark against a real dataset
-
-```bash
-uv run sft ... --data.name PrimeIntellect/Reverse-Text-SFT --bench
-```
-
-Benchmark against a training configuration
-
-```bash
-uv run sft @ path/to/config.toml --bench
-```
-
-## RL
-
-### Trainer
-
-Benchmark on a fake data loader
-
-```bash
-uv run trainer ... --data.fake --bench
-```
-
-Benchmark different batch configurations, i.e. the (micro) batch size and sequence length
-
-```bash
-uv run trainer ... --model.seq-len 4096 --data.fake.batch-size 64 --data.fake.micro-batch-size 2 --bench
-```
-
-*Note, that it is not yet possible to benchmark the RL trainer against real data when benchmarking the RL trainer in isolation.*
-
-### Inference
-
-To benchmark the inference engine in isolation, start the inference server with the correct configuration file and run the orchestrator with the `--bench` flag.
-
-```bash
-uv run inference @ path/to/config.toml
-```
-
-```bash
-uv run orchestrator @ path/to/config.toml --bench
-```
-
-*Note, that it is not yet possible to benchmark the inference engine against fake data.*
-
-## Trainer + Inference
-
-To benchmark the full RL training, you can add the `--bench` flag to your RL entrypoint. This will benchmark the RL trainer against fake data and the inference engine against real data from the orchestrator.
-
-```bash
-uv run rl   \
-  --trainer @ path/to/train.toml  \
-  --orchestrator @ path/to/orch.toml \
-  --inference @ path/to/infer.toml \
-  --bench
-```
\ No newline at end of file
diff --git a/docs/bring-your-own-algorithms.md b/docs/bring-your-own-algorithms.md
deleted file mode 100644
index a81549cacd..0000000000
--- a/docs/bring-your-own-algorithms.md
+++ /dev/null
@@ -1,142 +0,0 @@
-# Bring Your Own Algorithms
-
-Prime-RL supports custom implementations for key algorithmic components, allowing you to experiment with different RL objectives and techniques.
-
-## 1. Custom Loss Functions
-
-The loss is computed **per-sequence** (per-sample). You provide a function that computes the loss for a single sequence, and the framework handles iteration and aggregation.
-
-### Interface
-
-```python
-from prime_rl.trainer.rl.loss import LossInputs, LossOutputs
-
-def my_custom_loss(inputs: LossInputs, **kwargs) -> LossOutputs:
-    ...
-```
-
-#### LossInputs
-
-```python
-@dataclass
-class LossInputs:
-    trainer_logprobs: Float[Tensor, "seq"]      # Log probs from current policy
-    inference_logprobs: Float[Tensor, "seq"]    # Log probs from reference policy
-    teacher_logprobs: Float[Tensor, "seq"] | None  # Optional teacher log probs
-    advantages: Float[Tensor, "seq"]            # Per-token advantages
-    loss_mask: Bool[Tensor, "seq"]              # Mask for valid tokens
-```
-
-#### LossOutputs
-
-```python
-@dataclass
-class LossOutputs:
-    loss: Float[Tensor, ""]         # Scalar loss for this sequence
-    metrics: dict[str, Tensor]      # Metrics to log
-```
-
-### Example: PPO Clipped Loss
-
-```python
-import torch
-from prime_rl.trainer.rl.loss import LossInputs, LossOutputs
-
-def ppo_clip_loss(inputs: LossInputs, clip_eps: float = 0.2) -> LossOutputs:
-    ratio = torch.exp(inputs.trainer_logprobs - inputs.inference_logprobs)
-    clipped_ratio = torch.clamp(ratio, 1 - clip_eps, 1 + clip_eps)
-
-    surr1 = ratio * inputs.advantages
-    surr2 = clipped_ratio * inputs.advantages
-
-    loss = -torch.min(surr1, surr2)[inputs.loss_mask].sum()
-
-    return LossOutputs(
-        loss=loss,
-        metrics={"clip_frac": (ratio != clipped_ratio)[inputs.loss_mask].float().mean()},
-    )
-```
-
-### Configuration
-
-```toml
-[loss]
-type = "custom"
-import_path = "my_module.ppo_clip_loss"
-kwargs = { clip_eps = 0.2 }
-```
-
----
-
-## 2. Custom Advantage Functions
-
-Advantages are computed **per-group** (one example × N rollouts). You provide a function that computes advantages for a single group; the framework calls it once per group and stitches the results back together. Groups may have fewer than `group_size` rollouts when some rollouts in the group errored (partial-group training).
-
-### Interface
-
-```python
-from prime_rl.orchestrator.advantage import AdvantageInputs, AdvantageOutputs
-
-def my_custom_advantage(inputs: AdvantageInputs, **kwargs) -> AdvantageOutputs:
-    ...
-```
-
-#### AdvantageInputs
-
-```python
-@dataclass
-class AdvantageInputs:
-    # All rollouts for a single example (one group).
-    rollouts: list[vf.RolloutOutput]
-```
-
-Each `vf.RolloutOutput` carries the full rollout (`reward`, `trajectory`, etc.), so custom advantages can read any metadata they need (e.g. completion-token counts, turn counts, tool calls).
-
-#### AdvantageOutputs
-
-```python
-@dataclass
-class AdvantageOutputs:
-    advantages: list[float]   # one entry per rollout in the input group
-```
-
-### Example: Normalized Advantage
-
-```python
-import statistics
-from prime_rl.orchestrator.advantage import AdvantageInputs, AdvantageOutputs
-
-def normalized_advantage(inputs: AdvantageInputs, eps: float = 1e-8) -> AdvantageOutputs:
-    """Normalize advantages to zero mean and unit variance within the group."""
-    rewards = [r["reward"] for r in inputs.rollouts]
-    mean = statistics.fmean(rewards)
-    std = statistics.pstdev(rewards) if len(rewards) > 1 else 0.0
-    return AdvantageOutputs(advantages=[(r - mean) / (std + eps) for r in rewards])
-```
-
-### Configuration
-
-```toml
-[advantage]
-type = "custom"
-import_path = "my_module.normalized_advantage"
-kwargs = { eps = 1e-8 }
-```
-
----
-
-## Default Implementations
-
-If no custom function is specified:
-
-- **Loss**: Uses `default_loss_fn` (masked importance sampling with KL against the inference policy, and optional masking strategies)
-- **Advantage**: Uses `default_advantage_fn` (reward minus per-example baseline, a.k.a. DR-GRPO without std normalization)
-
-See `LossConfig` and `AdvantageConfig` for available parameters.
-
-## Tips
-
-- Your functions receive structured inputs via dataclasses with jaxtyping annotations
-- Return metrics as scalars or 1D tensors - they'll be aggregated automatically
-- Use the `loss_mask` / tensor shapes to handle variable-length sequences
-- Test your custom functions with the provided test patterns before training
diff --git a/docs/checkpointing.md b/docs/checkpointing.md
deleted file mode 100644
index ce929a2f57..0000000000
--- a/docs/checkpointing.md
+++ /dev/null
@@ -1,57 +0,0 @@
-# Checkpointing
-
-Checkpointing is non-standard due to trainer/orchestrator separation and natural asynchrony.
-
-- SFT+RL Trainer: Checkpoints FSDP model shard (using DCP), optimizer and scheduler state, and progress (training step, total samples, total tokens)
-- Orchestrator: Checkpoints orchestrator progress (training step, total tokens, total samples, total problems)
-- Inference: Inference is stateless. Upon restart, the orchestrator will reload the correct weights into the inference engine. No checkpointing is required.
-
-The default checkpoint directory is `checkpoints` and each checkpoint step will live in a step subdirectory, i.e. `checkpoints/step_{step}`.
-
-Checkpointing is configured with the config key `--ckpt`. One can specify the interval (`--ckpt.interval`), whether to save checkpoints asynchronously  (`--ckpt.save-async`), how many recent step checkpoints to keep on disk (`--ckpt.keep-last`), and keep checkpoints at every N steps permanently (`--ckpt.keep-interval`). By default, we do not checkpoint to save disk space. 
-
-## SFT
-
-Let's split the reverse text training SFT example, which does 40 steps by default, into two runs of 20 steps each. 
-
-First, run the first 20 steps and append  `--ckpt` flag will enable the default checkpoint configuration which will only write the final checkpoint to disk, but no intermediate checkpoints.
-
-```bash
-uv run sft ... --max-steps 20 --ckpt
-```
-
-Then, to resume the training from step 20, run the following command
-
-```bash
-uv run sft ... --max-steps 40 --ckpt.resume-step 20
-```
-
-## RL
-
-Similarly, let's split the reverse text training RL example, which does 20 steps by default, into two runs of 10 steps each. 
-
-First, start the inference server. It can stay running across restarts as the orchestrator will automatically send the right checkpoint to the inference server when resuming.
-
-```bash
-uv run inference ...
-```
-
-Then, run the first 20 steps and write the final checkpoint to disk
-
-```bash
-uv run rl \
-  --trainer @ path/to/train.toml \
-  --orchestrator @ path/to/orch.toml \
-  --max-steps 10 \
-  --ckpt
-```
-
-And finally, resume the training to do the remaining 10 steps
-
-```bash
-uv run rl \
-  --trainer @ path/to/train.toml \
-  --orchestrator @ path/to/orch.toml \
-  --max-steps 20 \
-  --ckpt.resume-step 10
-```
diff --git a/docs/configs.md b/docs/configs.md
deleted file mode 100644
index ea384a7c5c..0000000000
--- a/docs/configs.md
+++ /dev/null
@@ -1,33 +0,0 @@
-# Configs
-
-We use `pydantic-settings` with some custom functionality for configuring runs. We support the following sources, in this order of precedence:
-
-1. **Command-line arguments**: Pass (nested) arguments as `--key.subkey value` to the script. For example, to set the model name, set `--model.name <model-name>`
-
-2. **Config files**: You can pass TOML config files using the `@` prefix. For example, to set a config, run `uv run inference @ path/to/config.toml`. (*You have to leave a space between the `@` and the config file*)
-
-3. **Environment variables**: You can set environment variables to override the config values. All environment variables must be prefixed with `PRIME_` and use the `__` delimiter to nest the keys. For example, to set the model name you can run `export PRIME_MODEL__NAME=Qwen/Qwen3-0.6B`.
-
-4. **Defaults**: For almost all config arguments, we have a default value which will be used if no other source is provided.
-
-In general we recommend setting configurations via config files to define reproducible experiments and use command-line arguments to override the config values to run variants of the same experiment. Environment variables are usually only used in production settings to communicate with the [Prime Protocol](https://github.com/PrimeIntellect-ai/protocol) worker. In most cases, you should not need to use environment variables.
-
-The precedence order will be important if multiple sources try to configure the same argument. For example, in the following command, all sources will define a model name
-
-```toml
-# qwen8b.toml
-[model]
-name = "Qwen/Qwen3-8B"
-```
-
-```toml
-# qwen14b.toml
-[model]
-name = "Qwen/Qwen-14B"
-```
-
-```bash
-PRIME_MODEL__NAME=Qwen/Qwen3-4B uv run ... @ qwen8b.toml @ qwen14b.toml --model.name Qwen/Qwen3-32B
-```
-
-In this example, the CLI argument `--model.name Qwen/Qwen3-32B` will take precedence and the script will use `Qwen/Qwen3-32B` as the model name. If the CLI argument wasn't set, then the second config file would take precedence and the script would use `Qwen/Qwen-14B` as the model name. If the second config file wasn't set, then the first config file would take precedence and the script would use `Qwen/Qwen3-8B` as the model name. Finally, if the first config file wasn't set, then the environment variable would take precedence and the script would use `Qwen/Qwen3-4B` as the model name. If the environment variable wasn't set, then the default value would be used and the script would use `Qwen/Qwen3-0.6B` as the model name.
diff --git a/docs/configuration.md b/docs/configuration.md
new file mode 100644
index 0000000000..077e7429ee
--- /dev/null
+++ b/docs/configuration.md
@@ -0,0 +1,207 @@
+# Configuration
+
+Every `prime-rl` entrypoint uses [`pydantic-config`](https://github.com/PrimeIntellect-ai/pydantic-config): TOML files for reproducible base configs, CLI flags for one-off overrides.
+
+> **AI agents working in this repo:** the equivalent runbook is at [`skills/configs/SKILL.md`](https://github.com/PrimeIntellect-ai/prime-rl/blob/main/skills/configs/SKILL.md), with extra runtime hints (where config classes live, validator conventions, the trainer-side `token_export` flag) that aren't surfaced here.
+
+## Table of Contents
+
+- [Sources and Precedence](#sources-and-precedence)
+- [TOML Composition](#toml-composition)
+- [CLI Overrides](#cli-overrides)
+- [Inspecting and Validating](#inspecting-and-validating)
+- [Syntax](#syntax)
+  - [Booleans](#booleans)
+  - [Lists](#lists)
+  - [Dicts](#dicts)
+  - [Optional Sub-Configs](#optional-sub-configs)
+  - [None](#none)
+  - [Discriminated Unions](#discriminated-unions)
+  - [Environments (`[[orchestrator.train.env]]`)](#environments-orchestratortrainenv)
+- [Examples](#examples)
+
+## Sources and Precedence
+
+Field values come from three sources — Pydantic defaults, TOML files (passed with `@`), and CLI flags. They're layered in this order, with later sources winning:
+
+1. **Defaults** declared on the Pydantic model.
+2. **TOML files** passed with `@`, left to right — later files override earlier ones.
+3. **CLI flags** in dotted, kebab-case form (`--model.name`).
+
+## TOML Composition
+
+The `@` token introduces a TOML file. Multiple `@` arguments compose left-to-right, deep-merged — unset fields in an overlay keep the base value:
+
+```bash
+uv run rl @ examples/reverse_text/rl.toml                      # one file
+uv run rl @ base.toml @ overlay.toml                           # left to right
+uv run rl --trainer @ trainer.toml --orchestrator @ orch.toml  # per-section
+uv run rl @ base.toml --trainer @ trainer.toml                 # mixed
+```
+
+> Mind the space: `@ path/to/x.toml`, not `@path/to/x.toml`.
+
+## CLI Overrides
+
+CLI flags mirror the TOML tree using dots:
+
+```bash
+--max-steps 50                              # top-level
+--model.name Qwen/Qwen3-4B                  # nested
+--trainer.optim.lr 1e-5                     # double-nested
+--inference.parallel.tp 4
+```
+
+> Field names are snake_case in TOML (`max_model_len`) and kebab-case on the CLI (`--max-model-len`).
+
+> Renamed fields keep their old name as a validation alias — e.g. `rollouts_per_example` is still accepted in TOML and CLI after being renamed to `group_size`. Mixing the two names across sources is safe.
+
+## Inspecting and Validating
+
+```bash
+uv run rl --help                                       # full schema
+uv run rl @ rl.toml --dry-run --output-dir /tmp/check  # write resolved configs
+```
+
+## Syntax
+
+### Booleans
+
+CLI uses paired flags: bare `--flag` sets `True`, `--no-flag` sets `False`. TOML must be explicit:
+
+```bash
+uv run rl @ rl.toml --clean-output-dir       # True
+uv run rl @ rl.toml --no-clean-output-dir    # False
+```
+
+```toml
+clean_output_dir = true
+```
+
+### Lists
+
+CLI accepts space-separated values or a JSON literal. TOML uses an array literal. Both forms target the same field:
+
+```bash
+uv run rl @ rl.toml --trainer.model.lora.target-modules q_proj k_proj v_proj
+uv run rl @ rl.toml --trainer.model.lora.target-modules '["q_proj", "k_proj", "v_proj"]'
+```
+
+```toml
+[trainer.model.lora]
+target_modules = ["q_proj", "k_proj", "v_proj"]
+```
+
+Overlay TOMLs **replace** lists wholesale — an overlay that wants to add one item must still spell out the full list. For arrays of tables (e.g. environments), see [Environments](#environments-orchestratortrainenv).
+
+### Dicts
+
+CLI takes a JSON literal. TOML uses a table or inline-table. CLI dicts deep-merge with TOML dicts — CLI keys win on conflict but don't wipe the file's keys:
+
+```bash
+uv run rl @ rl.toml --orchestrator.train.env.0.args \
+  '{"dataset_name": "openai/gsm8k", "dataset_subset": "main"}'
+```
+
+```toml
+[[orchestrator.train.env]]
+args = { dataset_name = "openai/gsm8k", dataset_subset = "main" }
+```
+
+### Optional Sub-Configs
+
+Many sub-configs are typed `SomeConfig | None`. Two patterns enable them:
+
+- **Bare flag with defaults**: `--model.compile` or, in TOML, an empty section `[model.compile]`. The sub-config materializes with all-default values.
+- **Enable and set fields together**: `--model.compile.fullgraph` (CLI) or any populated `[model.compile]` table (TOML).
+
+To **disable** a sub-config that's on by default, use `--no-<name>` on the CLI or assign the string `"None"` in TOML (see [None](#none)). This is how `[ckpt]`, `[model.lora]`, `[model.compile]`, `[trainer.wandb]`, etc. are turned on and off.
+
+### None
+
+TOML has no `null`. Use the string `"None"`, which the loader coerces:
+
+```toml
+[inference.model]
+max_model_len = "None"
+```
+
+On the CLI: `--inference.model.max-model-len None`.
+
+### Discriminated Unions
+
+Loss, advantage, optimizer, scheduler, weight broadcast transport, and several others are discriminated unions. Set the `type` field to pick a variant:
+
+```toml
+[trainer.optim]
+type = "muon"
+lr = 1e-5
+mu = 0.95
+```
+
+Omit `type` to keep the default variant.
+
+### Environments (`[[orchestrator.train.env]]`)
+
+Training environments are an array of tables — set one per env, optionally with sampling weights:
+
+```toml
+[[orchestrator.train.env]]
+id = "math-env"
+name = "gsm8k"
+args = { dataset_name = "openai/gsm8k", dataset_subset = "main" }
+
+[[orchestrator.train.env]]
+id = "reverse-text"
+ratio = 0.25  # 25% of batches; remaining 75% goes to math-env
+
+[[orchestrator.eval.env]]
+id = "math-env"
+name = "gsm8k-eval"
+args = { dataset_name = "openai/gsm8k", dataset_subset = "main" }
+```
+
+`args` is forwarded verbatim to the environment's `load_environment(**args)`.
+
+The same `id` can appear multiple times across train and eval (or with different `args`) — useful for evaluating on a held-out split of the env you're training on, or comparing two configurations of the same env side by side. When `id` is reused, set a distinct `name` on each entry; `name` defaults to `id` and must be unique across all envs in the same group.
+
+## Examples
+
+The shipped end-to-end examples in [`examples/`](https://github.com/PrimeIntellect-ai/prime-rl/tree/main/examples) are the canonical, kept-up-to-date references — the rest of the repo's TOMLs (under `configs/`) are CI- and debug-internal and may drift. Each example directory has its own README with the full launch story.
+
+**Basic** (1–8 GPUs):
+
+- [**Reverse Text**](https://github.com/PrimeIntellect-ai/prime-rl/tree/main/examples/reverse_text) — `Qwen3-0.6B` reversing a chunk of text. Tiny single-turn SFT + RL; runs on a single consumer GPU in minutes.
+- [**Wordle**](https://github.com/PrimeIntellect-ai/prime-rl/tree/main/examples/wordle) — `Qwen3-1.7B` playing Wordle. Multi-turn SFT + RL; 2–4 H100s.
+- [**Alphabet Sort**](https://github.com/PrimeIntellect-ai/prime-rl/tree/main/examples/alphabet_sort) — `Qwen3-4B-Instruct-2507` sorting names alphabetically. Multi-turn LoRA RL without SFT warmup; one H100.
+- [**Wiki Search**](https://github.com/PrimeIntellect-ai/prime-rl/tree/main/examples/wiki_search) — `Qwen3-4B-Instruct-2507` answering trivia by web-searching Wikipedia. Multi-turn with tool use.
+- [**Hendrycks Sanity**](https://github.com/PrimeIntellect-ai/prime-rl/tree/main/examples/hendrycks_sanity) — `DeepSeek-R1-Distill-Qwen-1.5B` on a filtered MATH subset. Useful for algorithm ablations.
+
+**Advanced** (32–2048 GPUs, SLURM):
+
+- [**Qwen 3 30B – A3B Math**](https://github.com/PrimeIntellect-ai/prime-rl/tree/main/examples/qwen30b_math) — `Qwen3-30B-A3B` on hard math.
+- [**Qwen 3 30B – A3B SWE**](https://github.com/PrimeIntellect-ai/prime-rl/tree/main/examples/qwen30b_swe) — `Qwen3-30B-A3B` on hard SWE.
+- [**INTELLECT-3.1**](https://github.com/PrimeIntellect-ai/prime-rl/tree/main/examples/Intellect-3.1) — reproduces our INTELLECT-3.1 training run.
+- [**MiniMax-M2.5 SWE**](https://github.com/PrimeIntellect-ai/prime-rl/tree/main/examples/minimax_m2.5_swe) — `MiniMax-M2.5` on agentic SWE.
+- [**High-throughput GLM-5**](https://github.com/PrimeIntellect-ai/prime-rl/tree/main/examples/glm5_pd_disag) — `GLM-5` with P/D disaggregation and FP8 inference.
+
+### Worked Example: Compose, Override, Dry-Run
+
+Start from a shipped base config, override two fields on the CLI, and dry-run:
+
+```bash
+uv run rl @ examples/reverse_text/rl.toml \
+  --wandb.name my-experiment \
+  --trainer.optim.lr 5e-6 \
+  --output-dir /tmp/reverse-dry \
+  --dry-run
+```
+
+Then inspect the resolved config:
+
+```bash
+ls /tmp/reverse-dry/configs/
+# rl.toml  trainer.toml  orchestrator.toml  inference.toml
+```
+
+Each per-process TOML reflects the final, validated configuration that the actual run would consume — exactly what each process sees when started standalone (`uv run trainer @ /tmp/reverse-dry/configs/trainer.toml`, etc.). This is the easiest way to bisect a misbehaving config: dry-run a known-good base, dry-run your overlay, diff the two.
diff --git a/docs/deployment.md b/docs/deployment.md
deleted file mode 100644
index 38d911b122..0000000000
--- a/docs/deployment.md
+++ /dev/null
@@ -1,299 +0,0 @@
-# Deployment
-
-You can deploy PRIME-RL on a single GPU and larger multi-node clusters.
-
-## SFT
-
-### Single-GPU
-
-For training on a single GPU, no communication orchestration is required and you can choose whether to start your trainer using our trainer entrypoint or using `torchrun`.
-
-To start with our `sft` entrypoint
-
-```bash
-uv run sft ...
-```
-
-To do the same thing, but using `torchrun`
-
-```bash
-uv run torchrun src/prime_rl/trainer/sft/train.py ...
-```
-
-### Multi-GPU
-
-For training on multiple GPUs, use `torchrun` with the `--nproc-per-node` flag.
-
-```bash
-uv run torchrun \
-  --local-rank-filter 0 \
-  --nproc-per-node 8 \
-  src/prime_rl/trainer/sft/train.py ...
-```
-
-*The `--local-rank-filter` flag is used to only log the logs from the master rank, as detailed in [logging](logging.md).*
-
-### Multi-Node
-
-For training on multiple nodes, use `torchrun` with the `--nnodes`, `--node-rank`, and `--rdzv-endpoint` flags.
-
-First, decide which node will be your head node and find a reachable private IP address for it. If your nodes are not colocated, you will likely need to setup VPN (e.g. [Tailscale](https://tailscale.com)) for the nodes to reach each other. 
-
-(*Skip this step if the default network interface is sufficient.*) Make sure to set the network interface for GLOO and NCCL to one that allows all nodes to reach each other.
-
-```bash
-# On both nodes
-export GLOO_SOCKET_IFNAME=...
-export NCCL_SOCKET_IFNAME=...
-```
- 
-Then, configure the rendezvous endpoint to allow the nodes to find each other. Here, `MASTER_ADDR` is the private IP address of the head node and `MASTER_PORT` is a free port on the head node, typically port 29500 for `torchrun`.
-
-```bash
-# On both nodes
-export MASTER_ADDR=...
-export MASTER_PORT=...
-```
-
-Then, on the head node, run
-
-```bash
-# On node 0
-uv run torchrun \
-  --nnodes 2 \
-  --node-rank 0 \
-  --rdzv-endpoint=$MASTER_ADDR:$MASTER_PORT \
-  --local-rank-filter 0 \
-  --nproc-per-node 8 \
-  src/prime_rl/trainer/sft/train.py ...
-```
-
-And on the second node, run
-
-```bash
-# On node 1
-uv run torchrun \
-  --nnodes 2 \
-  --node-rank 1 \
-  --rdzv-endpoint=$MASTER_ADDR:$MASTER_PORT \
-  --local-rank-filter 0 \
-  --nproc-per-node 8 \
-  src/prime_rl/trainer/sft/train.py ...
-```
-
-### SLURM
-
-See the dedicated [SLURM guide](slurm.md).
-
-## Inference
-
-For SLURM-based inference deployment, see the [SLURM guide](slurm.md#inference-examples). Each node runs an independent vLLM replica — no manual coordination needed.
-
-For manual multi-node deployment without SLURM, we rely on vLLM's multi-node data parallel deployment primitives ([docs](https://docs.vllm.ai/en/v0.10.0/serving/data_parallel_deployment.html)).
-
-First, decide which node will be your head node and find a reachable private IP address for it. If your nodes are not colocated, you will likely need to setup VPN (e.g. [Tailscale](https://tailscale.com)) for the nodes to reach each other. 
-
-(*Skip this step if the default network interface is sufficient.*) Make sure to set the network interface for GLOO and NCCL to one that allows all nodes to reach each other.
-
-```bash
-# On both nodes
-export GLOO_SOCKET_IFNAME=...
-export NCCL_SOCKET_IFNAME=...
-```
- 
-Then, configure the data parallel address as the private IP address of the head node.
-
-```bash
-# On both nodes
-export DATA_PARALLEL_ADDRESS=...
-export DATA_PARALLEL_RPC_PORT=...
-```
-
-To run TP=4 and DP=4 with DP ranks 0 and 1 on the head node and DP ranks 2 and 3 on the second node, run
-
-```bash
-# On node 0
-uv run inference \
-	--data-parallel-size 4 \
-	--tensor-parallel-size 4 \
-	--data-parallel-size-local 2 \
-	--data-parallel-address $DATA_PARALLEL_ADDRESS \
-	--data-parallel-rpc-port $DATA_PARALLEL_RPC_PORT
-```
-
-```bash
-# On node 1
-uv run inference \
-	--data-parallel-size 4 \
-	--tensor-parallel-size 4 \
-	--data-parallel-size-local 2 \
-	--data-parallel-address $DATA_PARALLEL_ADDRESS \
-	--data-parallel-rpc-port $DATA_PARALLEL_RPC_PORT \
-	--data-parallel-start-rank 2 \
-	--headless
-```
-
-## RL
-
-### Single-GPU Training
-
-If you only have access to a single GPU, you may still be able to run small RL experiments. To do so, configure your inference server to use only a fraction of the available memory to leave some space for the trainer.
-
-For example, to run an RL training on a single GPU while using 50% of the available memory for the inference server, run
-
-```bash
-bash scripts/tmux.sh
-```
-
-```bash
-uv run rl \
-  --trainer @ path/to/train.toml \
-  --orchestrator @ path/to/orch.toml \
-  --inference @ path/to/infer.toml \
-  --trainer-gpu-ids 0 \
-  --inference-gpu-ids 0 \
-  --inference.gpu-memory-utilization 0.5
-```
-
-*Make sure to tune the `--gpu-memory-utilization` value such that you have enough GPU memory for the RL trainer.* 
-
-You can also set this up by starting each submodule manually.
-
-```bash
-# Run this in the `Inference` pane
-uv run inference @ path/to/infer.toml --gpu-memory-utilization 0.5
-```
-
-```bash
-# Run this in the `Orchestrator` pane
-uv run orchestrator @ path/to/orch.toml
-```
-
-```bash
-# Run this in the `Trainer` pane
-uv run trainer @ path/to/train.toml
-```
-
-### Multi-GPU Training
-
-For single-node training, we recommend using the `rl` entrypoint to conveniently start all components, i.e. the inference server, the orchestrator, and the trainer. 
-
-By default, the inference server starts on GPU ID 0 and the trainer on GPU ID 1.
-
-```bash
-uv run rl \
-  --trainer @ path/to/train.toml \
-  --orchestrator @ path/to/orch.toml \
-  --inference @ path/to/infer.toml \
-```
-
-You can configure the GPU IDs to use for the inference server and the trainer. For example, to run the inference server on GPUs IDs 0-5 with data parallelism and the trainer on GPUs IDs 6-7
-
-```bash
-uv run rl \
-  --trainer @ path/to/train.toml \
-  --orchestrator @ path/to/orch.toml \
-  --inference @ path/to/infer.toml \
-  --inference-gpu-ids 0,1,2,3,4,5 \
-  --trainer-gpu-ids 6,7 \
-  --inference.parallel.dp 6
-```
-
-### Parallel Experiments
-
-For quick ablations, it can be more efficient to parallelize experiments within a node (e.g. split your GPUs to run two experiments in parallel). For example, if you have access to 4 GPUs and your experiment fits on 2 GPUs, you can parallelize two experiments as follows:
-
-Start the first experiment in a tmux session `exp1` with outputs directory `outputs1`. Specify it both in the tmux script, as well as in the start command (*will use the first 2 GPUs*)
-
-```bash
-bash scripts/tmux.sh -s exp1 -o outputs1
-```
-
-```bash
-# Run this in the `Trainer` pane
-uv run rl \
-  --trainer @ path/to/train.toml \
-  --orchestrator @ path/to/orch.toml \
-  --inference @ path/to/infer.toml \
-  --output-dir outputs1
-```
-
-For the second experiment, start a second tmux session named `exp2` with outputs directory `outputs2`. In addition, specify a new server port for the inference engine and orchestrator (*will use the last 2 GPUs*)
-
-```bash
-bash scripts/tmux.sh -s exp-2 -o outputs2
-```
-
-```bash
-# Run this in the `Trainer` pane
-uv run rl \
-  --trainer @ path/to/train.toml \
-  --orchestrator @ path/to/orch.toml \
-  --inference @ path/to/infer.toml \
-  --inference-gpu-ids 2 \
-  --trainer-gpu-ids 3 \
-  --inference.server.port 8001 \
-  --orchestrator.client.base-url http://localhost:8001/v1 \
-  --output-dir outputs2
-```
-
-### Multi-Node Training
-
-> We currently require a shared file system for multi-node RL training.
-
-To facilitate multi-node RL training, ensure that all nodes have access to a shared file system and that the node that will run the inference server is reachable from the orchestrator via a private or public IP address. Then, set the following environment variables on all nodes:
-
-```bash
-# On all nodes
-export OUTPUT_DIR=...               # Path to directory in shared file system
-export INFERENCE_SERVER_IP=...      # Reachable IP address of the inference node
-export INFERENCE_SERVER_API_KEY=... # API key for the inference server
-```
-
-Then, start the inference server on one node.
-
-```bash
-# On one node
-uv run inference ... \
-    --api-key $INFERENCE_SERVER_API_KEY --parallel ...
-```
-
-Then, start a single orchestrator
-
-```bash
-# On either node
-uv run orchestrator ... \
-    --client.base-url http://$INFERENCE_SERVER_IP:8000/v1 \
-    --client.api-key-var INFERENCE_SERVER_API_KEY \
-    --output-dir $OUTPUT_DIR
-```
-
-Finally, start the trainer on one as described in the [Trainer](#trainer) section.
-
-```bash
-# On other node
-uv run torchrun \
-    --nproc-per-node 8 \
-    --local-rank-filter 0 \
-    src/prime_rl/trainer/rl/train.py ... \
-    --output-dir $OUTPUT_DIR
-```
-
-Of course, you can further scale up the number of nodes used by the trainer and inference server, as described in the sections above. However, make sure that there is only a single orchestrator instance.
-
-### SLURM
-
-See the dedicated [SLURM guide](slurm.md).
-
-## Kubernetes
-
-For deployments on Kubernetes clusters, PRIME-RL provides a Helm chart that manages the entire training infrastructure including orchestrator, trainer, and inference components with automatic pod scheduling, GPU allocation, and shared storage.
-
-See the dedicated [Kubernetes guide](kubernetes.md) for complete documentation including:
-
-- Prerequisites and setup
-- Quick start examples
-- Component architecture
-- Scaling and distributed training
-- Configuration options
-- Troubleshooting
diff --git a/docs/development.md b/docs/development.md
new file mode 100644
index 0000000000..23bc983a6f
--- /dev/null
+++ b/docs/development.md
@@ -0,0 +1,131 @@
+# Development
+
+This page covers workflows for developing on `prime-rl` itself — running the test suite, contributing changes, and adding new model architectures with the small-scale tooling we use to iterate on MoE families without booting up a 100B+ run.
+
+## Table of Contents
+
+- [Test Suite](#test-suite)
+  - [Layout](#layout)
+  - [Running Tests Locally](#running-tests-locally)
+  - [CI Workflows](#ci-workflows)
+  - [Markers](#markers)
+- [Pre-Commit Hooks](#pre-commit-hooks)
+- [Adding a New Model](#adding-a-new-model)
+  - [Implement the Modeling Code](#implement-the-modeling-code)
+  - [Register a Mini Preset](#register-a-mini-preset)
+  - [Run the Smoke Test](#run-the-smoke-test)
+
+## Test Suite
+
+The test suite is split into three tiers, each with its own CI workflow.
+
+### Layout
+
+- **`tests/unit/`** — fast-running, hermetic tests for isolated logic: config parsing and validation, advantage / loss / scheduler / packer math, individual dataset paths, model-conversion roundtrips, etc. Tests that need a GPU are tagged with the `gpu` marker.
+- **`tests/integration/`** — full-stack RL/SFT runs on a tiny model end-to-end through inference + orchestrator + trainer.
+- **`tests/nightly/`** — runs the configs in [`examples/`](https://github.com/PrimeIntellect-ai/prime-rl/tree/main/examples) every night to catch regressions in the shipped examples.
+
+### Running Tests Locally
+
+```bash
+uv run pytest -v                                           # everything
+uv run pytest tests/unit -v                                # unit only
+uv run pytest tests/integration -v                         # integration only
+uv run pytest -v -m "not gpu"                              # CPU-only subset (mirrors CPU CI)
+uv run pytest -v -m gpu                                    # GPU-only subset
+uv run pytest tests/integration/test_reverse_text.py -vvs  # one specific scenario
+```
+
+### CI Workflows
+
+| Workflow | Trigger | What runs | Where |
+|---|---|---|---|
+| [`cpu_tests.yaml`](https://github.com/PrimeIntellect-ai/prime-rl/blob/main/.github/workflows/cpu_tests.yaml) | every PR + push to `main` | `pytest tests/unit -m "not gpu"`, plus a slim-wheel install check that `prime-rl-configs` imports cleanly without heavy deps (no torch / vllm / transformers / wandb / verifiers / datasets / liger / loguru in `sys.modules`) | `ubuntu-latest` |
+| [`gpu_tests.yaml`](https://github.com/PrimeIntellect-ai/prime-rl/blob/main/.github/workflows/gpu_tests.yaml) | every non-draft PR + push to `main` | `pytest tests/unit -m gpu`, plus a matrix of named integration scenarios (`reverse_text`, `reverse_text_sft`, `reverse_text_lora`, `reverse_text_moe`, `reverse_text_multi_run`, `reverse_text_rl_opd`, `reverse_text_rl_sft`, `reverse_text_sft_lora`, `alphabet_sort`, `benchmark_regression`) | self-hosted GPU runners (`vm`, `4xa6000`) |
+| [`nightly_tests.yaml`](https://github.com/PrimeIntellect-ai/prime-rl/blob/main/.github/workflows/nightly_tests.yaml) | 03:00 PST daily + manual `workflow_dispatch` (single-file filter optional) | every file in `tests/nightly/`, one matrix job per file | `research-cluster` |
+
+The GPU + Nightly workflows skip drafts — open the PR as **Draft** until you're ready to consume CI compute, then mark it ready for review to trigger the GPU matrix.
+
+### Markers
+
+Two pytest markers are declared in `pyproject.toml` (`addopts = "--strict-markers"`):
+
+- `gpu` — gate a test that needs CUDA. CPU CI uses `-m "not gpu"`; the GPU unit job uses `-m gpu`.
+- `slow` — gate a test that's expensive enough you'd usually skip it locally. Deselect with `-m "not slow"`.
+
+## Pre-Commit Hooks
+
+Install the [pre-commit](https://pre-commit.com) hooks before your first commit so ruff check + format run on staged Python files automatically:
+
+```bash
+uv run pre-commit install
+```
+
+## Adding a New Model
+
+Bringing up a new model family is three steps: implement the modeling code, register a mini preset, and run the smoke test. The preset and smoke test let you iterate on the modeling code at ~0.5B scale on 1–2 GPUs instead of paying the cost of the full-size model — useful for catching bugs in modeling code, state-dict conversions, and pipeline integration before scaling.
+
+### Implement the Modeling Code
+
+Drop the modeling code under `src/prime_rl/trainer/models/<arch>/` (HF-compatible config, modeling, and weight conversion). Mirror the layout of an existing family — `glm4_moe/` or `qwen3_moe/` are good starting points.
+
+### Register a Mini Preset
+
+Add an entry to [`scripts/mini_moe.py`](https://github.com/PrimeIntellect-ai/prime-rl/blob/main/scripts/mini_moe.py) so the smoke-test workflow can build a ~0.5B test model in your architecture. The preset names the config class, picks small dimensions, and wires up the HF + PrimeRL model classes plus a tokenizer source:
+
+```python
+ARCH_PRESETS = {
+    "glm4_moe": {
+        "config_class": Glm4MoeConfig,
+        "config_kwargs": dict(hidden_size=1024, num_hidden_layers=24, n_routed_experts=8, ...),
+        "hf_model_class": HFGlm4MoeForCausalLM,
+        "prime_model_class": PrimeRLGlm4MoeForCausalLM,
+        "tokenizer_source": "THUDM/GLM-4-9B-0414",
+    },
+    # add your arch here
+}
+```
+
+### Run the Smoke Test
+
+Build the mini model. This creates a ~543M-parameter GLM-4 MoE (1024 hidden, 24 layers, 8 experts) with random weights, copies the tokenizer from the original GLM-4 model, and verifies the HF↔PrimeRL roundtrip is lossless:
+
+```bash
+uv run python scripts/mini_moe.py --arch glm4_moe --output-dir ./mini-glm-moe
+```
+
+To re-verify the roundtrip after a modeling-code change without re-creating the model:
+
+```bash
+uv run python scripts/mini_moe.py --arch glm4_moe --output-dir ./mini-glm-moe --verify-only
+```
+
+Warm up the random-weight mini model with SFT on reverse-text so KL divergence becomes meaningful in the RL phase. Loss drops from ~12 to ~2.5 — the output won't be coherent, but the distribution is non-trivial. A pre-built SFT'd checkpoint lives at [samsja/mini-glm-moe](https://huggingface.co/samsja/mini-glm-moe) if you want to skip this step:
+
+```bash
+uv run sft @ configs/debug/moe/sft/train.toml \
+  --model.name ./mini-glm-moe \
+  --data.name PrimeIntellect/Reverse-Text-SFT \
+  --data.type null \
+  --max_steps 200 \
+  --optim.lr 1e-4 \
+  --ckpt.weights
+```
+
+Then run the full RL stack on reverse-text:
+
+```bash
+uv run rl @ configs/ci/integration/reverse_text_moe/start.toml \
+  --model.name samsja/mini-glm-moe \
+  --trainer.model.impl custom \
+  --inference.gpu-memory-utilization 0.7 \
+  --inference.model.max-model-len 2048
+```
+
+What to look for:
+
+- **No crashes.** Validates the full inference + orchestrator + trainer pipeline end-to-end.
+- **Finite, non-zero KL.** Confirms the reference distribution is meaningful.
+- **Loss reasonable.** Not NaN, not stuck.
+
+Don't expect reward to climb meaningfully in 20 steps on a random model.
diff --git a/docs/disaggregated-inference.md b/docs/disaggregated-inference.md
deleted file mode 100644
index 65f5dacf84..0000000000
--- a/docs/disaggregated-inference.md
+++ /dev/null
@@ -1,91 +0,0 @@
-# Disaggregated Prefill/Decode Inference
-
-Run MoE models with separate prefill and decode node groups for higher throughput.
-
-## Quick Start
-
-See [`configs/glm5_disagg_inference/inference.toml`](../configs/glm5_disagg_inference/inference.toml) for an example config.
-
-```bash
-uv run inference @ configs/glm5_disagg_inference/inference.toml --output-dir /data/$USER/outputs
-```
-
-## Prefill/Decode Ratio
-
-| Workload | Recommended ratio (P:D) | Why |
-|---|---|---|
-| Agentic (SWE, Lean) | **3:1** | Long growing contexts → prefill-heavy |
-| Non-agentic (math, chat) | **1:2** | Short prompts, long generations → decode-heavy |
-
-Monitor live queue depths:
-```bash
-curl -s http://<prefill_node>:8100/metrics | grep num_requests_waiting
-curl -s http://<decode_node>:8200/metrics | grep num_requests_waiting
-```
-
-If prefill has queued requests and decode has zero, add more prefill nodes (and vice versa).
-
-For historical averages (cumulative over the entire run), query the histogram metrics:
-```bash
-# Average queue time per request (seconds)
-curl -s http://<node>:<port>/metrics | awk '
-  /request_queue_time_seconds_sum\{/  { sum += $2 }
-  /request_queue_time_seconds_count\{/ { count += $2 }
-  END { if (count > 0) printf "avg queue: %.2fs (%d requests)\n", sum/count, count }
-'
-
-# Average prefill/decode compute time
-curl -s http://<node>:<port>/metrics | awk '
-  /request_prefill_time_seconds_sum\{/  { ps += $2 }
-  /request_prefill_time_seconds_count\{/ { pc += $2 }
-  /request_decode_time_seconds_sum\{/   { ds += $2 }
-  /request_decode_time_seconds_count\{/  { dc += $2 }
-  END {
-    if (pc > 0) printf "avg prefill: %.2fs\n", ps/pc
-    if (dc > 0) printf "avg decode:  %.2fs\n", ds/dc
-  }
-'
-```
-
-Other useful metrics on the `/metrics` endpoint:
-- `vllm:e2e_request_latency_seconds` — end-to-end latency
-- `vllm:kv_cache_usage_perc` — KV cache memory pressure
-- `vllm:nixl_xfer_time_seconds` — NIXL KV transfer duration
-- `vllm:nixl_bytes_transferred` — bytes per KV transfer
-
-## UCX 1.19
-
-NVSHMEM requires UCX >= 1.19 for multi-GPU CUDA support. Most clusters ship UCX 1.17 (via HPC-X), which causes `cuStreamCreate: invalid device context` errors during DeepEP internode dispatch.
-
-**Check your version:**
-```bash
-/opt/hpcx/ucx/bin/ucx_info -v | head -1
-# If < 1.19, you need to build from source
-```
-
-**Build UCX 1.19 (run once on a GPU node):**
-```bash
-salloc -N 1 --gres=gpu:1 bash -c 'bash scripts/install_nixl_from_source.sh'
-```
-
-This installs UCX 1.19 to `prime-rl/third_party/ucx/`. The sbatch template automatically adds it to `LD_LIBRARY_PATH`, overriding the system version.
-
-## Troubleshooting
-
-### `DeepEP error: timeout (dispatch CPU)`
-NVSHMEM internode communication failing. Check:
-1. UCX version >= 1.19? (`third_party/ucx/bin/ucx_info -v`)
-2. NVSHMEM libs reachable at `/tmp/deepep_build/nvshmem/lib/`? If not:
-   ```bash
-   ssh <node> 'mkdir -p /tmp/deepep_build/nvshmem && \
-       ln -sfn <venv>/lib/python3.12/site-packages/nvidia/nvshmem/lib \
-       /tmp/deepep_build/nvshmem/lib'
-   ```
-3. IBGDA driver enabled? `ssh <node> 'cat /proc/driver/nvidia/params | grep EnableStreamMemOPs'` should show `1`.
-
-### Router healthy but requests hang
-NIXL side channel not running on prefill. Check:
-```bash
-ssh <prefill_node> 'ss -tlnp sport ge :5600 sport le :5608 | grep -c LISTEN'
-# Should show 8 (one per DP rank). If 0, check logs for UCX/NVSHMEM errors.
-```
diff --git a/docs/entrypoints.md b/docs/entrypoints.md
deleted file mode 100644
index 6a97a60b5b..0000000000
--- a/docs/entrypoints.md
+++ /dev/null
@@ -1,67 +0,0 @@
-# Entrypoints
-
-## RL
-
-The main usecase of PRIME-RL is RL training. Three main abstractions facilitate RL training: the **orchestrator**, the **trainer**, and the **inference** service.
-
-![Architecture](assets/architecture.png)
-
-### Orchestrator
-
-The orchestrator is a lightweight CPU process that handles the core data and scheduling logic, serving as an intermediary between the trainer and inference service with bidirectional relays. In one direction, it collects rollouts from the inference server, assembles them into packed batches, and dispatches them to the trainer; in the other direction, it relays updated model weights from the trainer to the inference service. The orchestrator utilizes `verifiers` environments to abstract multi-turn rollout generation and scoring. Each training and evaluation environment is exposed as a `vf.EnvServer` as a sidecar to the orchestrator process (default) or as a standalone process (e.g. used in hosted training to run environments in containers).
-
-### Trainer
-
-The trainer is responsible for producing an updated policy model given rollouts and advantages. We use FSDP2 as the backend with compatibility for any HuggingFace (HF) model. For some models we also provide custom implementations, mostly for performance reasons. FSDP shards model parameters, gradients, and optimizer states, allowing training large models with data parallelism and minimal GPU memory footprint. We support a variety of popular training objectives, such as GRPO, GSPO, OPO, RLOO and [CISPO](https://arxiv.org/abs/2506.13585). The trainer is inspired by [`torchtitan`](https://github.com/pytorch/torchtitan) and relies on native PyTorch features to implement advanced parallelism techniques, such as tensor, context or expert parallelism.
-
-### Inference
-
-The inference service in its simplest form is a standard OpenAI-compatible server with a vLLM backend. The API specification is extended with a custom `update_weights` endpoint to reload model weights from a HF-compatible checkpoint on disk. Otherwise, we rely on vLLM's optimized kernels, parallelism strategies, and scheduling for fast rollout generation. Given the disaggregated nature of the service architecture, it can be directly extended to include multiple engines with a shared request pool, allowing operation across multiple clusters and straightforward integration of alternative inference engines (e.g. SGLang, Tokasaurus). We also heavily rely on native data parallelism in vLLM (also available in SGLang) for orchestrating the fleet of nodes dedicated to inference.
-
-### RL
-
-For doing RL training all components need to be started. One can do this manually:
-
-```bash
-uv run inference ...
-```
-
-```bash
-uv run orchestrator ...
-```
-
-```bash
-uv run trainer ...
-```
-
-Or, alternatively on a single node, use the `rl` entrypoint to start all components.
-
-```bash
-uv run rl \
-    --trainer @ path/to/train.toml \
-    --orchestrator @ path/to/orch.toml \
-    --inference @ path/to/infer.toml \
-    ...
-```
-
-For more details on multi-node deployment options, see the [deployment](deployment.md) documentation and see the [examples](examples) for concrete training configurations. To see all available configuration options, run `uv run rl --help`.
-
-## SFT
-
-We provide a fairly straight-forward SFT trainer which is capable of fine-tuning any conversational model on multi-turn conversation with tool calling. It shares a lot of components with the RL trainer, such as the modeling code, parallelism techniques, checkpoint format, logger, etc. which ensures a seamless post-training workflow.
-
-To start an SFT training, you need to prepare a conversational dataset in either [prompt-completion format](https://huggingface.co/docs/trl/en/dataset_formats#prompt-completion) or raw `messages` format. If `messages` is provided, the trainer interprets the full conversation as a single sample with an empty prompt and applies role-based loss masking across the whole chat. If both `messages` and `prompt` / `completion` are present, `messages` takes precedence. Single-turn fine-tuning should be compatible with the chat templates of most models. However, to properly handle loss masking, we require that the tokenizer's chat template satisfies a prefix property: the tokenization of any conversation prefix must be a prefix of the tokenization of the full conversation. For instance, tokenizing message 1 should yield a token sequence that forms a prefix of tokenizing messages 1 and 2, which in turn should be a prefix of tokenizing messages 1, 2, 3, and so forth. An example of a chat template that *does not* satisfy this property is Qwen3's chat template, as it strips away past think sections.
-
-On a single GPU, start the training with the `sft` entrypoint
-
-```bash
-uv run sft ...
-```
-
-If you have access to multiple GPUs, use [`torchrun`](https://docs.pytorch.org/docs/stable/elastic/run.html) with `--nproc-per-node` to start the training. 
-
-```bash
-uv run torchrun --nproc-per-node 8 src/prime_rl/trainer/sft/train.py ...
-```
-
-For more details on multi-node deployment options, see the [deployment](deployment.md) documentation and see the [examples](examples) for concrete training configurations. To see all available configuration options, run `uv run sft --help`.
diff --git a/docs/environments.md b/docs/environments.md
deleted file mode 100644
index 69fe15e625..0000000000
--- a/docs/environments.md
+++ /dev/null
@@ -1,32 +0,0 @@
-# Environments
-
-PRIME-RL can train and evaluate in any [`verifiers`](https://github.com/willccbb/verifiers) environments. To train in a new environment, simply install it from the [Environment Hub](https://app.primeintellect.ai/dashboard/environments) or install a local environment.
-
-## Installation
-
-You can explore the installation options using
-
-```bash
-prime env info <owner>/<name>
-```
-
-To install an environment temporarily
-
-```bash
-prime env install <owner>/<name>
-# Or: uv pip install <name> --extra-index-url https://hub.primeintellect.ai/<owner>/simple/
-```
-
-To install a local environment
-
-```bash
-uv pip install -e path/to/env
-```
-
-To verify your installation
-
-```bash
-uv run python -c "import <name>"
-```
-
-For more details on environments, see our Environments Hub documentation [here](https://docs.primeintellect.ai/tutorials-environments/environments).
\ No newline at end of file
diff --git a/docs/index.md b/docs/index.md
deleted file mode 100644
index aa76871f8c..0000000000
--- a/docs/index.md
+++ /dev/null
@@ -1,16 +0,0 @@
-# Docs
-
-This directory maintains the documentation for PRIME-RL. It is organized into the following sections:
-
-- [**Entrypoints**](entrypoints.md) - Overview of the main components (orchestrator, trainer, inference) and how to run SFT, RL, and evals
-- [**Configs**](configs.md) - Configuration system using TOML files, CLI arguments, and environment variables
-- [**Environments**](environments.md) - Installing and using verifiers environments from the Environments Hub
-- [**Async Training**](async.md) - Understanding asynchronous off-policy training and step semantics
-- [**Logging**](logging.md) - Logging with loguru, torchrun, and Weights & Biases
-- [**Platform Monitoring**](platform-monitoring.md) - Register runs on the Prime Intellect platform and stream training metrics
-- [**MultiRunManager**](multi_run_manager.md) - Multi-run training with the MultiRunManager object for concurrent LoRA adapters
-- [**Checkpointing**](checkpointing.md) - Saving and resuming training from checkpoints
-- [**Benchmarking**](benchmarking.md) - Performance benchmarking and throughput measurement
-- [**Deployment**](deployment.md) - Training deployment on single-GPU, multi-GPU, and multi-node clusters
-- [**Kubernetes**](kubernetes.md) - Deploying PRIME-RL on Kubernetes with Helm
-- [**Troubleshooting**](troubleshooting.md) - Common issues and their solutions
\ No newline at end of file
diff --git a/docs/kubernetes.md b/docs/kubernetes.md
deleted file mode 100644
index f718f1df01..0000000000
--- a/docs/kubernetes.md
+++ /dev/null
@@ -1,308 +0,0 @@
-# Kubernetes
-
-This guide covers deploying PRIME-RL training infrastructure on Kubernetes clusters using the provided Helm chart.
-
-## Prerequisites
-
-- Kubernetes cluster with GPU nodes
-- [NVIDIA GPU Operator](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/getting-started.html) installed
-- [Helm 3.x](https://helm.sh/docs/intro/install/) installed
-- Storage class that supports `ReadWriteMany` (e.g., NFS, CephFS, or cloud provider storage)
-
-### Verify Prerequisites
-
-```bash
-# Check Helm installation
-helm version
-
-# Check GPU operator
-kubectl get pods -n gpu-operator
-
-# Check available storage classes
-kubectl get storageclass
-```
-
-## Quick Start
-
-### 1. Deploy
-
-```bash
-# Deploy with a release name
-helm install my-exp ./k8s/prime-rl -f ./k8s/prime-rl/examples/reverse-text.yaml
-
-# Or with defaults (no example-specific config)
-helm install my-exp ./k8s/prime-rl --set trainer.replicas=3 --set inference.replicas=2
-```
-
-### 2. Verify deployment
-
-```bash
-# Check pod status
-kubectl get pods -l app.kubernetes.io/instance=my-exp
-
-# Should show 3 pods:
-# my-exp-orchestrator-0
-# my-exp-inference-0
-# my-exp-trainer-0
-```
-
-### 3. Run training
-
-```bash
-# Exec into trainer
-kubectl exec -it my-exp-trainer-0 -- bash
-
-# Inside the pod, run training
-cd /data
-uv run trainer @ /app/examples/reverse_text/configs/train.toml
-```
-
-### 4. Monitor progress
-
-```bash
-# Get logs
-kubectl logs my-exp-trainer-0
-
-# Follow logs in real-time
-kubectl logs -f my-exp-trainer-0
-```
-
-## Available Examples
-
-The chart includes pre-configured values for each example:
-
-### reverse-text (Small - 1 GPU)
-
-```bash
-helm install my-exp ./k8s/prime-rl -f ./k8s/prime-rl/examples/reverse-text.yaml
-```
-
-- Model: Qwen3-0.6B
-- GPUs: 1 per component
-- Runs on consumer GPUs (RTX 3090/4090)
-- **Note:** You can use any release name - the chart automatically configures service URLs
-
-## Configuration
-
-### Storage Configuration
-
-By default, the chart creates a 1TB PVC with NFS storage. To customize:
-
-```yaml
-# custom-values.yaml
-storage:
-  storageClassName: my-storage-class
-  size: 500Gi
-```
-
-Deploy with custom storage:
-
-```bash
-helm install my-release ./k8s/prime-rl -f custom-values.yaml
-```
-
-### GPU Configuration
-
-Adjust GPU count per component:
-
-```yaml
-# custom-gpu.yaml
-inference:
-  gpu:
-    count: 4  # Use 4 GPUs for inference
-
-trainer:
-  gpu:
-    count: 2  # Use 2 GPUs for training
-```
-
-### Resource Limits
-
-Customize memory and CPU:
-
-```yaml
-# custom-resources.yaml
-trainer:
-  resources:
-    requests:
-      memory: "64Gi"
-      cpu: "16"
-    limits:
-      memory: "128Gi"
-      cpu: "32"
-```
-
-### Secrets (Optional)
-
-For W&B and HuggingFace authentication:
-
-```bash
-# Create secret
-kubectl create secret generic prime-rl-secrets \
-  --from-literal=wandb-api-key=YOUR_WANDB_KEY \
-  --from-literal=hf-token=YOUR_HF_TOKEN
-
-# Enable in values
-helm install my-release ./k8s/prime-rl \
-  --set config.secrets.enabled=true \
-  --set config.secrets.name=prime-rl-secrets
-```
-
-## Common Operations
-
-### Deploy a new experiment
-
-```bash
-# With example config
-helm install my-exp ./k8s/prime-rl -f ./k8s/prime-rl/examples/reverse-text.yaml
-
-# With custom settings
-helm install my-exp ./k8s/prime-rl --set trainer.replicas=10 --set inference.replicas=5
-```
-
-### Exec into pods
-
-```bash
-# Exec into trainer-0
-kubectl exec -it my-exp-trainer-0 -- bash
-
-# Exec into specific trainer pod
-kubectl exec -it my-exp-trainer-3 -- bash
-
-# Exec into inference
-kubectl exec -it my-exp-inference-0 -- bash
-```
-
-### View logs
-
-```bash
-# Get logs from trainer-0
-kubectl logs my-exp-trainer-0
-
-# Follow logs in real-time
-kubectl logs -f my-exp-trainer-2
-
-# Get logs from all trainers
-kubectl logs -l app.kubernetes.io/instance=my-exp,role=trainer
-```
-
-### List all pods
-
-```bash
-# List pods for specific experiment
-kubectl get pods -l app.kubernetes.io/instance=my-exp
-
-# List all prime-rl pods
-kubectl get pods -l app=prime-rl
-```
-
-## Architecture
-
-### Components
-
-The chart deploys three main components (all using StatefulSets):
-
-1. **Orchestrator** (StatefulSet) - Coordinates training workflow
-   - Always 1 replica: `prime-rl-orchestrator-0`
-   - No GPU required
-   - Communicates with trainer and inference
-
-2. **Inference** (StatefulSet) - Runs vLLM inference server
-   - Scalable replicas with stable pod names: `prime-rl-inference-0`, `prime-rl-inference-1`, ...
-   - Each pod gets predictable DNS: `prime-rl-inference-0.prime-rl-inference-headless.default.svc.cluster.local`
-   - Requires GPU(s)
-   - Serves model predictions
-
-3. **Trainer** (StatefulSet) - Runs SFT or RL training
-   - Scalable replicas with stable pod names: `prime-rl-trainer-0`, `prime-rl-trainer-1`, ...
-   - Each pod gets predictable DNS: `prime-rl-trainer-0.prime-rl-trainer-headless.default.svc.cluster.local`
-   - Requires GPU(s)
-   - Updates model weights on shared storage
-
-**Why StatefulSets for all components?**
-
-- **Consistent naming**: All pods have predictable names (`orchestrator-0`, `trainer-0`, `trainer-1`, ...)
-- **Stable networking**: Each pod gets its own DNS hostname via headless service
-- **Required for distributed training**: PyTorch/vLLM need to discover peers by stable hostname
-- **Clean naming**: No random pod suffixes, easier to identify and debug
-
-### Shared Storage
-
-All components mount the same PVC at `/data` for:
-
-- Model checkpoint sharing
-- Training data
-- Experiment outputs
-
-This is **required** for coordinating weight updates between trainer and inference.
-
-## Environment Variables
-
-Each pod has these K8s environment variables set:
-
-- `$POD_NAME` - Full pod name (e.g., `my-exp-trainer-3`)
-- `$POD_IP` - Pod IP address
-- `$STATEFUL_REPLICAS` - Total number of replicas for that component
-- `$HEADLESS_SERVICE` - DNS name for peer discovery (e.g., `my-exp-trainer-headless.default.svc.cluster.local`)
-- `$INFERENCE_URL` - Full URL to the first inference pod (available in orchestrator and trainer pods)
-
-For distributed training, extract the rank from the pod name:
-
-```bash
-# Extract ordinal from pod name
-RANK=$(echo $POD_NAME | grep -o '[0-9]*$')  # e.g., "my-exp-trainer-3" -> "3"
-
-# Use in torchrun
-torchrun \
-  --nnodes=$STATEFUL_REPLICAS \
-  --node-rank=$RANK \
-  --nproc-per-node=8 \
-  --rdzv-endpoint=my-exp-trainer-0.$HEADLESS_SERVICE:29501 \
-  src/prime_rl/trainer/sft/train.py @ configs/train.toml
-```
-
-## Troubleshooting
-
-### Can't access shared storage
-
-Verify PVC is bound:
-
-```bash
-kubectl get pvc prime-rl-shared-data
-# STATUS should be "Bound"
-```
-
-Check mount inside pod:
-
-```bash
-kubectl exec -it prime-rl-trainer-xxx -- df -h /data
-```
-
-### Pod stuck in Pending
-
-Check if GPU resources are available:
-
-```bash
-kubectl describe pod my-exp-trainer-0
-```
-
-Look for events like `Insufficient nvidia.com/gpu`.
-
-### Inference server not responding
-
-Check if the inference pod is ready:
-
-```bash
-kubectl get pods -l role=inference
-kubectl logs my-exp-inference-0
-```
-
-## Uninstalling
-
-```bash
-# Remove the Helm release
-helm uninstall my-exp
-
-# Delete PVC (data will be lost!)
-kubectl delete pvc prime-rl-shared-data
-```
diff --git a/docs/logging.md b/docs/logging.md
deleted file mode 100644
index cbd7e881f2..0000000000
--- a/docs/logging.md
+++ /dev/null
@@ -1,86 +0,0 @@
-# Logging
-
-prime-rl uses [loguru](https://loguru.readthedocs.io/en/stable/) for logging with a global logger pattern. All logs are captured at the deployment level (stdout/stderr redirection for local, `tee` for SLURM) under `{output_dir}/logs/`. For RL training, we recommend streaming logs into tmux panes (as set up by `tmux.sh`).
-
-## Logger Architecture
-
-### `setup_logger` and `get_logger`
-
-We use a **singleton pattern** with a module-level global logger instance (`_LOGGER`).
-
-```python
-from prime_rl.utils.logger import setup_logger, get_logger
-
-# At entrypoint - call ONCE
-logger = setup_logger("info")
-
-# Anywhere else in codebase
-logger = get_logger()
-logger.info("Hello world")
-```
-
-**How it works:**
-
-1. **`get_logger()`** - Returns the global logger instance. Always works — if `setup_logger` hasn't been called yet, it initializes a default logger automatically. Safe to call from any module at any time.
-
-2. **`setup_logger(log_level)`** - Configures (or reconfigures) the global logger:
-   - Creates an isolated loguru `Logger` instance (not the default `loguru.logger`) to prevent third-party code from hijacking our logs
-   - Adds a stdout handler with colorized output (or JSON output if `json_logging=True`)
-   - Can be called multiple times — cleans up the previous logger before creating a new one
-
-3. **`reset_logger()`** - Resets the global logger to `None`:
-   - Used in subprocesses that inherit parent state (e.g., env workers)
-   - Used in tests between test cases
-
-## Log File Structure
-
-Logs are captured at the deployment level — the entrypoint redirects subprocess stdout/stderr to files (local) or `tee` captures them (SLURM). The structure is consistent across deployment types: `logs/trainer.log` and `logs/inference.log` always exist, regardless of whether the run is local or multi-node SLURM.
-
-### Local (single node)
-
-```
-{output_dir}/logs/
-├── trainer.log                  # trainer stdout (rank 0 only)
-├── orchestrator.log             # orchestrator stdout
-├── inference.log                # vLLM inference server stdout
-├── trainer/
-│   └── torchrun/                # per-rank stdout/stderr (all ranks)
-└── envs/
-    ├── train/{env_name}/
-    │   ├── env_server.log
-    │   └── env_worker_{id}.log
-    └── eval/{env_name}/
-        └── ...
-```
-
-### SLURM multi-node
-
-```
-{output_dir}/logs/
-├── trainer.log                  -> trainer/node_0.log (symlink)
-├── inference.log                -> inference/node_0.log (symlink)
-├── orchestrator.log             # orchestrator stdout
-├── trainer/
-│   ├── node_0.log               # per-node trainer output (rank 0 only)
-│   ├── node_1.log
-│   └── torchrun/                # per-rank stdout/stderr (all ranks)
-├── inference/
-│   ├── node_0.log               # per-node inference output
-│   ├── node_1.log
-│   └── router_0.log             # vllm-router per replica
-└── envs/
-    └── ...
-```
-
-Environment logs live under `logs/envs/train/{env_name}/` and `logs/envs/eval/{env_name}/`. Env log verbosity is controlled by `orchestrator.log.vf_level`.
-
-Only rank 0 output is shown in `trainer.log`. Per-rank logs from all ranks are available under `logs/trainer/torchrun/{rdzv_id}/attempt_0/{rank}/{stdout,stderr}.log`, written by torchrun's `--log-dir`.
-
-## tmux helper (`scripts/tmux.sh`)
-
-`scripts/tmux.sh` sets up a tmux session for RL runs with **four panes**:
-
-- **Trainer**: follows `{output_dir}/logs/trainer.log`
-- **Orchestrator**: follows `{output_dir}/logs/orchestrator.log`
-- **Envs**: follows `{output_dir}/logs/envs/*/*/*.log`
-- **Inference**: follows `{output_dir}/logs/inference.log`
diff --git a/docs/memory_usage.md b/docs/memory_usage.md
deleted file mode 100644
index b36c117254..0000000000
--- a/docs/memory_usage.md
+++ /dev/null
@@ -1,132 +0,0 @@
-# Reducing memory usage
-
-While most of our parallelism techniques in prime-rl are designed to scale training up (FSDP, EP, CP, ...), we also provide many tools to scale training down that allow training large MoE models on a limited amount of GPUs.
-
-These techniques target the trainer part of prime-rl.
-
-
-## TLDR: config to use for maximum memory usage reduction with correct throughput
-
-```toml
-[trainer.model]
-impl = "custom"
-attn = "flash_attention_2"
-fused_lm_head_token_chunk_size = 1024
-ep = 8
-cp = 2
-optim_cpu_offload = true
-
-[trainer.model.compile]
-
-[trainer.model.ac]
-freq = 1
-
-[trainer.model.ac_offloading]
-max_inflight_activations = 1
-```
-
-## Activation checkpointing
-
-Activation checkpointing discards intermediate activations during the forward pass and recomputes them during the backward pass, trading compute for memory.
-
-To enable it, use:
-
-```toml
-[trainer.model.ac]
-freq = 1
-```
-
-`freq` controls how often layers are checkpointed: every `freq` layers. Lower values yield lower memory usage (e.g. `freq = 1` checkpoints every layer).
-
-## Activation offloading
-
-Activation offloading offloads the activations to CPU to reduce the memory usage of the trainer. It can be used in combination with activation checkpointing.
-
-To enable it, use:
-
-```toml
-[trainer.model.ac]
-freq = 1
-
-[trainer.model.ac_offloading]
-max_inflight_activations = 5
-```
-
-## Chunk loss
-
-Chunk loss splits the loss computation into smaller chunks to reduce the memory usage of the trainer.
-
-To enable it, use:
-
-```toml
-[trainer.model]
-fused_lm_head_token_chunk_size = auto
-```
-
-
-## Expert parallelism
-
-While expert parallelism splits the weights of the experts across all GPUs like FSDP, using EP still reduces memory usage by reducing the communication size and therefore the FSDP buffer.
-
-EP is only available for models with MoE layers using the custom model implementation.
-
-```
-[trainer.model]
-impl = "custom"
-ep = 8
-```
-
-## Context parallelism
-
-Context parallelism splits the context into smaller chunks to reduce the memory usage of the activations. We don't advise using CP across multiple nodes (i.e., increasing CP beyond 8).
-
-CP is only available for certain models and only with the custom model implementation.
-
-```
-[trainer.model]
-impl = "custom"
-cp = 2
-```
-
-We recommend CP 2 or CP 4 for most 128K sequence length training runs. Can be pushed to 8.
-
-
-## torch compile
-
-Enabling torch.compile can reduce the memory usage for certain model architectures, especially MoE with the custom model implementation.
-
-```
-[trainer.model.compile]
-```
-
-## CPU Optimizer offloading
-
-Offloading the optimizer states to CPU can reduce the memory usage of the trainer significantly, especially at low GPU counts where the optimizer states take a lot of memory as they won't be sharded enough.
-
-In RL, in contrast with pretraining, we end up with many gradient accumulation steps, so the cost of offloading the optimizer states is not as high as in pretraining, and indeed barely noticeable.
-
-```
-[trainer.optim]
-optim_cpu_offload = true
-```
-
-## :warning: FSDP CPU offloading
-
-FSDP CPU offloading offloads the parameters, gradients, and optimizer states to CPU to reduce the memory usage of the trainer.
-
-This will make training significantly slower and is not recommended most of the time.
-
-```
-[trainer.model]
-fsdp_cpu_offload = true
-```
-
-## :warning: Lora training
-
-LoRA training significantly reduces the memory usage of the trainer at the cost of smaller gradient updates.
-
-```
-[trainer.model.lora]
-rank = 8
-```
-
diff --git a/docs/metrics.md b/docs/metrics.md
deleted file mode 100644
index bf6a785a1a..0000000000
--- a/docs/metrics.md
+++ /dev/null
@@ -1,55 +0,0 @@
-# Metrics
-
-## W&B
-
-For most runs we recommend logging metrics to [W&B](https://wandb.ai). Before enabling W&B, make sure that you have an account and are logged in.
-
-```bash
-uv run wandb login
-# Or set `export WANDB_API_KEY=...`
-```
-
-### SFT
-
-Logging to W&B is disabled by default. Enable the default configuration with `--wandb`
-
-```bash
-uv run sft ... --wandb
-```
-
-This will log to the `prime-rl` project with a random run name. You can specify which project and name to log to 
-
-```bash
-uv run sft ... --wandb.project my-project --wandb.name my-run
-```
-
-The same settings also work for multi-node training with `torchrun`. Note, that we only log global metrics from the master rank (e.g. the all-reduced loss)
-
-```bash
-uv run torchrun --nproc-per-node 8 ...  --wandb
-```
-
-### RL
-
-For RL training, both the trainer and orchestrator log to W&B as separate runs. Again, logging to W&B is disabled by default. Enable the default configuration with `--wandb`
-
-```bash
-uv run rl ... --wandb
-```
-
-This will log to the `prime-rl` project with a random run name. The trainer run is suffixed with `-trainer` and the orchestrator run is suffixed with `-orchestrator`. You can specify which project and name to log to using the same flags as for SFT.
-
-```bash
-uv run rl ... --wandb.project my-project --wandb.name my-run
-```
-
-For the RL trainer, we support logging samples (e.g. prompt, completion, reward, advantage for selected rollouts) and distributions (e.g. reward, advantage, entropy distributions) as W&B tables using the `wandb.log-extras` subconfig. If W&B is setup, this is enabled by default and will log for the RL trainer and orchestrator every 10 steps.
-
-You can configure this on the trainer and orchestrator separately. For example, to only log samples on the orchestrator every 50 steps, but not distribution on either
-
-```bash
-uv run rl  ... \
-  --no-trainer.wandb.log-extras.distributions \
-  --orchestrator.wandb.log-extras.interval 50
-```
-
diff --git a/docs/mint.json b/docs/mint.json
index 25437b7a31..361d71cef1 100644
--- a/docs/mint.json
+++ b/docs/mint.json
@@ -4,20 +4,13 @@
         {
             "group": "PRIME-RL",
             "pages": [
-                "index",
-                "entrypoints",
-                "configs",
-                "training_modes",
-                "environments",
-                "async",
-                "logging",
-                "multi_run_manager",
-                "checkpointing",
-                "benchmarking",
-                "deployment",
-                "kubernetes",
-                "testing-moe-at-small-scale",
-                "troubleshooting"
+                "overview",
+                "configuration",
+                "training",
+                "scaling",
+                "algorithms",
+                "advanced",
+                "development"
             ]
         }
     ]
diff --git a/docs/multi_run_manager.md b/docs/multi_run_manager.md
deleted file mode 100644
index bef6a6f566..0000000000
--- a/docs/multi_run_manager.md
+++ /dev/null
@@ -1,244 +0,0 @@
-# MultiRunManager
-
-The `MultiRunManager` object is a global singleton that manages the parameters and components for multiple concurrent training runs within a single trainer process.
-This allows multiple orchestrator deployments to share the same trainer.
-
-When `max_concurrent_runs > 1`, the trainer can train multiple runs in parallel. Each run:
-
-- Has its own LoRA adapter parameters
-- Has its own optimizer and scheduler
-- Saves its own checkpoints
-- Tracks its own training progress (step, tokens, samples)
-- Loads its own orchestrator configuration
-
-The `MultiRunManager` object provides:
-
-- **Bidirectional mapping** between run IDs (e.g., `run_abc123`) and run indices (0, 1, 2, ...)
-- **Progress tracking** per run (step count, total tokens, total samples)
-- **Configuration management** for orchestrator configs
-- **Distributed synchronization** across ranks via the PyTorch distributed store
-- **LoRA module registration** for multi-adapter parameter management
-- **Creation hooks** for initializing per-run resources (optimizers, schedulers)
-- **Run eviction** for removing runs that are misbehaving
-
-## **Initialization and run discovery**
-
-The `MultiRunManager` singleton is set up at the start of training:
-
-```python
-from prime_rl.trainer.runs import setup_multi_run_manager, get_multi_run_manager
-
-# Initialize with output directory and max concurrent runs
-setup_multi_run_manager(output_dir=Path("outputs/my-experiment"), max_runs=4)
-
-# Get the singleton instance anywhere in the codebase
-multi_run_manager = get_multi_run_manager()
-```
-
-Each run's directory follows this structure:
-
-```
-{output_dir}/
-├── run_abc123/
-│   ├── control/
-│   │   ├── orch.toml                    # Orchestrator configuration
-│   │   ├── config_validation_error.txt  # Config validation errors (if any)
-│   │   └── evicted.txt                  # Eviction reason (if evicted)
-│   ├── checkpoints/
-│   │   └── step_100/          # Orchestrator checkpoints
-│   ├── rollouts/
-│   │   └── step_100/          # Rollouts
-│   └── broadcast/
-│       └── step_100/          # Broadcasted weights for inference
-├── run_def456/
-│   └── ...
-└── ...
-
-```
-
-Runs are discovered by scanning the output directory for the pattern `run_*`. Each run must contain a valid orchestrator config at `{run_dir}/control/orch.toml` before they are added to the active runs otherwise they are ignored. When the maximum number of runs is reached, new `run_*` directories will not be picked up until old ones are deleted.
-
-```python
-# Master rank scans for new/deleted runs
-multi_run_manager.discover_runs()
-
-# All ranks synchronize state (must be called after discover_runs)
-multi_run_manager.synchronize_state()
-```
-
-The `discover_runs()` method (master only):
-
-1. Scans the output directory for `run_*` directories
-2. Filters out evicted runs (those with `control/evicted.txt`)
-3. Detects new runs and deleted runs
-4. Calls `forgotten_hook` for deleted runs (master only)
-5. Loads and validates the orchestrator config for each new run
-6. Updates internal mappings and data structures
-7. Calls `discovered_hook` for new runs (master only)
-
-The `synchronize_state()` method (all ranks):
-
-1. Master broadcasts run state to all ranks via the distributed store
-2. Non-master ranks catch up by calling internal `_delete_run_data` / `_create_run_data`
-3. All ranks execute `deletion_hook` for deleted runs
-4. All ranks execute `creation_hook` for new runs (e.g., optimizer setup, LoRA parameter reset)
-
-## Run Eviction
-
-The master proc on the trainer can evict a run using the `evict_run(idx: int, reason: str)` method.
-This is useful when the trainer detects an issue with a run that requires it to be stopped (e.g., invalid data, resource constraints, or policy violations).
-
-```python
-# Evict a run by its index (master only)
-multi_run_manager.evict_run(idx=0, reason="Run exceeded memory limits")
-```
-
-The `evict_run()` method (master only):
-
-1. Writes the eviction reason to `{run_dir}/control/evicted.txt`
-2. Logs a warning with the eviction details
-3. The run is **not** immediately removed from the manager's data structures
-
-The eviction takes effect through two mechanisms:
-
-**On the trainer side:**
-- The next `discover_runs()` call will filter out the evicted run (it checks for `evicted.txt`)
-- The run will then be treated as deleted, triggering forgotten/deletion hooks
-- The run index is returned to the unused pool
-
-**On the orchestrator side:**
-- The orchestrator checks for `evicted.txt` at the start of each iteration in its main loop
-- If found, it raises a `RuntimeError` with the eviction reason, causing the orchestrator to exit
-- This surfaces the eviction reason to the user
-- The orchestrator also self-evicts by writing `evicted.txt` if a training batch has no learning signal (all rollouts filtered out) on `MAX_EMPTY_BATCH_ATTEMPTS` (3) consecutive attempts
-
-## LoRA Module Registration
-
-LoRA modules register themselves with `MultiRunManager` for parameter management:
-
-```python
-# In apply_lora_to_model()
-lora_module = MultiLoRALinear(
-    base_layer=base_module,
-    rank=config.rank,
-    n_adapters=get_multi_run_manager().max_runs,
-    ...
-)
-lora_module.register_with_runs(get_multi_run_manager(), module_name)
-
-```
-
-The `MultiRunManager` object then exposes:
-
-```python
-# Get parameters for a specific run (used by optimizer creation)
-multi_run_manager.get_named_parameters_for_run(idx)
-
-# Get state dict for a specific run (used by weight broadcast)
-multi_run_manager.get_state_dict_for_run(idx)
-
-# Reset parameters for a new run
-multi_run_manager.reset_run_parameters(idx)
-
-```
-
-## Hooks
-
-The `MultiRunManager` object supports several types of hooks for different lifecycle events.
-Deletion hooks are always called before creation hooks.
-
-```mermaid
-flowchart TD
-    subgraph master["Rank 0 (Master)"]
-        discover["discover_runs()"]
-        forgotten["forgotten_hooks"]
-        validation["config_validation_hooks"]
-        discovered["discovered_hooks"]
-
-        discover --> forgotten
-        forgotten --> validation
-        validation --> discovered
-        discovered --> discover
-    end
-
-    subgraph rank1["Rank 1"]
-        wait1["waiting..."]
-    end
-
-    subgraph rankN["Rank N"]
-        waitN["waiting..."]
-    end
-
-    discovered --> barrier
-    wait1 --> barrier
-    waitN --> barrier
-
-    barrier[["synchronize_state()"]]
-
-    barrier --> deletion["deletion_hooks"]
-    deletion --> creation["creation_hooks"]
-
-    style barrier fill:#fff9c4
-```
-
-### Hook Registration
-
-```python
-# These hooks are only called on the master as only master uses `discover_runs()`
-# These hooks are thus only relevant to master only components (packer)
-multi_run_manager.register_discovered_hook(callback)
-multi_run_manager.register_forgotten_hook(callback)
-
-# These hooks are executed by all ranks in the order they were added during `synchronize_state()`
-# This ensures DTensor creations and other distributed operations happen together
-# Calling torch.dist.barrier() in a hook here should work
-multi_run_manager.register_creation_hook(callback)
-multi_run_manager.register_deletion_hook(callback)
-
-# These hooks validate the orchestrator config when runs are discovered:
-multi_run_manager.register_config_validation_hook(callback)
-```
-
-The callback signatures:
-
-```python
-def discovered_callback(idx: int, run_id: str, config: OrchestratorConfig) -> None:
-    """Called when a new run is discovered (master only).
-
-    Args:
-        idx: The run's index (0 to max_runs-1)
-        run_id: The run's ID (e.g., "run_abc123")
-        config: The orchestrator config for the run
-    """
-    # Example: Set the scaling factor for the run
-    multi_run_manager.scaling_factors[idx] = config.model.lora.alpha / config.model.lora.rank
-
-def forgotten_callback(idx: int, run_id: str) -> None:
-    """Called when a run is forgotten/removed (master only).
-
-    Args:
-        idx: The run's index (0 to max_runs-1)
-        run_id: The run's ID (e.g., "run_abc123")
-    """
-    pass
-
-def callback(idx: int, run_id: str) -> None:
-    """Called when a run is created/deleted.
-
-    Args:
-        idx: The run's index (0 to max_runs-1)
-        run_id: The run's ID (e.g., "run_abc123")
-    """
-    pass
-
-def config_validation_callback(config: OrchestratorConfig) -> tuple[bool, str]:
-    """Validate an orchestrator config.
-
-    Args:
-        config: The orchestrator config to validate
-
-    Returns:
-        (is_valid, error_message): If invalid, error_message is written to config dir
-    """
-    return True, ""
-```
diff --git a/docs/multimodal.md b/docs/multimodal.md
deleted file mode 100644
index 092869d922..0000000000
--- a/docs/multimodal.md
+++ /dev/null
@@ -1,60 +0,0 @@
-# Multimodal (VLM) Support
-
-Prime-RL supports training vision-language models (VLMs) like Qwen3-VL.
-
-## VLM Configuration
-
-### Supported Models
-
-The built-in registry supports these model families out of the box:
-
-| Model Family | model_type | Vision Encoder | Language Model |
-|-------------|------------|---------------|----------------|
-| Qwen3-VL | `qwen3_vl` | `model.visual` | `model.language_model` |
-| Qwen3.5 | `qwen3_5` | `model.visual` | `model.language_model` |
-| Qwen3.5-MoE | `qwen3_5_moe` | `model.visual` | `model.language_model` |
-
-Enable VLM mode by adding a `[model.vlm]` section. Both fields are required — they tell prime-rl where the vision encoder and language model live on the model object:
-
-```toml
-[model]
-name = "Qwen/Qwen3-VL-4B-Instruct"
-
-[model.vlm]
-vision_encoder_attr = "model.visual"
-language_model_attr = "model.language_model"
-```
-
-For the registered models in the table above, use the attrs shown there. For custom VLMs, check your model's structure with `model.named_children()`.
-
-Both fields are dotted attribute paths resolved on the loaded model. A bad path raises a `ValueError` immediately — there are no silent fallbacks.
-
-The weight key prefix for NCCL broadcasting is derived automatically as `{language_model_attr}.layers.`.
-
-To add permanent support for a new model family, add an entry to `VLM_REGISTRY` in `src/prime_rl/utils/vlm.py`.
-
-## Current Limitations
-
-- **Vision encoder is frozen by default**: The vision encoder is frozen during training by default. Set `freeze_vision_encoder = false` in `[model.vlm]` to make it trainable. When unfrozen, the vision encoder is FSDP-sharded per-block for proper gradient flow. Note: this has no effect when using LoRA.
-
-- **No multimodal-safe truncation**: Token sequences are truncated to `seq_len`, but `pixel_values` and `image_grid_thw` are passed through unchanged. If a multimodal sample exceeds `seq_len`, image tokens can be dropped while image tensors still describe the full set of images. Ensure `seq_len` covers your longest VLM samples.
-
-- **Optimization dtype must be bfloat16**: Set `optimization_dtype = "bfloat16"` and `reduce_dtype = "bfloat16"` in your trainer config.
-
-- **Higher KL mismatch with multi-image inputs**: VLM training exhibits higher KL mismatch compared to text-only, especially with multiple images.
-
-- **Images are not logged**: The images the VLM sees during training are not logged to monitors.
-
-## How Multi-Turn VLM RL Training Works
-
-VLM rollouts go through the renderer-backed TITO client (`orchestrator.use_renderer = true`, the default and required for VLMs). The renderer owns the HuggingFace processor per-slot and emits multimodal tensors alongside tokens.
-
-1. **Render**: For each trajectory step, the renderer tokenizes messages and emits per-image multimodal tensors (e.g. `pixel_values`, `image_grid_thw` for Qwen3-VL) as `multi_modal_data`.
-2. **Pack**: `interleave_rollout` concatenates the per-image tensors emitted across a sample's merged step range into a single `mm_kwargs` dict on the `TrainingSample`. Per-token `mm_token_type_ids` (0=text, 1=image, 2=video) come from `renderer.mm_token_type_id_map`.
-3. **Forward**: The trainer `**`-unpacks `mm_kwargs` into the model's `forward` signature, so any VLM whose HF processor and forward agree on kwarg names works without touching the transport.
-
-Each multimodal sample becomes its own micro-batch during training (no packing) since image tensor sizes vary.
-
-## vLLM Configuration
-
-`VLLM_WORKER_MULTIPROC_METHOD=spawn` is required for VLM inference. This is set automatically when using `uv run rl @ ...`, but if you start the vLLM server yourself, make sure this environment variable is set.
diff --git a/docs/overview.md b/docs/overview.md
new file mode 100644
index 0000000000..39ed30ba12
--- /dev/null
+++ b/docs/overview.md
@@ -0,0 +1,44 @@
+# Overview
+
+`prime-rl` is a framework for large-scale, asynchronous reinforcement learning of large language models. It is designed to be easy to use and hackable, yet capable of training 1T+-parameter MoE models on 1000+ GPU clusters.
+
+## Architecture
+
+A `prime-rl` RL run is three cooperating processes:
+
+![Architecture](assets/architecture.png)
+
+- **Inference** — vLLM-backed server (or fleet) holding the current policy. The orchestrator drives rollouts through the token-in `/v1/generate` route via the [`renderers`](https://github.com/PrimeIntellect-ai/renderers) package (OpenAI-compatible chat/completions routes are also exposed for external clients). Supports data + tensor + expert parallelism (with `deepep` and `flashinfer` all-to-all backends and EPLB), FP8 inference, prefill/decode disaggregation behind a `vllm-router`, CPU KV-cache offload, and *router replay* (the routed-expert mask is returned to the trainer for FP8 MoE numerical parity). Weights are pushed in place through a custom `update_weights` endpoint over filesystem or NCCL transports.
+- **Orchestrator** — Lightweight CPU process that owns the data plane across many [`verifiers`](https://github.com/PrimeIntellect-ai/verifiers) training and eval environments. Each env runs in an isolated subprocess with a variable-size pool of env workers for scalability. The orchestrator drives multi-turn rollouts against the inference fleet (tool use, browsers, sandboxes, long horizons) without re-tokenizing across turns, computes advantages, packs the rollouts into training batches, and relays new weights from trainer to inference.
+- **Trainer** — FSDP2 process group that consumes packed rollouts and steps the optimizer. We ship optimized custom modeling code for many MoE / dense / VLM families that unlocks advanced trainer parallelism — expert parallelism (EP, with DeepEP kernels) and context parallelism (CP) for long-sequence training — plus selective activation checkpointing, FP8 training on Hopper+, LoRA, and multi-tenant training (many concurrent LoRA tenants sharing one trainer + inference deployment).
+
+The three processes communicate through configurable transports — by default the trainer↔orchestrator rollout link uses the local filesystem, and weight broadcast uses the filesystem (or NCCL for synchronous setups). Swap to ZMQ for multi-host setups without shared storage. See [Scaling](scaling.md) for the deployment options.
+
+## Installation
+
+```bash
+curl -sSL https://raw.githubusercontent.com/PrimeIntellect-ai/prime-rl/main/scripts/install.sh | bash
+```
+
+The script clones the repo, initializes the [`verifiers`](https://github.com/PrimeIntellect-ai/verifiers) / [`renderers`](https://github.com/PrimeIntellect-ai/renderers) / [`research-environments`](https://github.com/PrimeIntellect-ai/research-environments) submodules, installs `uv`, and runs `uv sync --all-extras`. For manual setup, MoE-only installs (DeepGEMM / DeepEP / NIXL), or troubleshooting, see the [README](https://github.com/PrimeIntellect-ai/prime-rl#setup).
+
+You need at least one NVIDIA GPU (RTX 3090/4090/5090, A100, H100, H200, or B200). Single-GPU runs are supported for debugging; production RL is typically 1× inference node + 1+ trainer nodes.
+
+## Quick Run
+
+Train an SFT-warmed `Qwen3-0.6B` on the `reverse-text` task — the env is bundled with the [`verifiers`](https://github.com/PrimeIntellect-ai/verifiers) submodule so no separate install is needed. This config ships in the repo and runs on two GPUs (one for inference, one for the trainer):
+
+```bash
+uv run rl @ examples/reverse_text/rl.toml
+```
+
+The `rl` entrypoint reads `examples/reverse_text/rl.toml`, splits it into per-process sub-configs, picks GPU 0 for inference and GPU 1 for the trainer, launches all three processes, and tees their stdout into `outputs/logs/{trainer,orchestrator,inference}.log`. Within a minute the trainer should log `step 1` and a reward sample; after 20 steps the run completes and final HF-compatible weights land at `outputs/weights/step_20`.
+
+## Documentation
+
+- **[Configuration](configuration.md)** — TOML composition, CLI overrides, dry-run.
+- **[Training](training.md)** — Launch and observe RL and SFT runs.
+- **[Scaling](scaling.md)** — Single-GPU through multi-node clusters via FSDP / EP / CP and SLURM.
+- **[Algorithms](algorithms.md)** — Async semantics, loss / advantage / filter plugins, trajectory merging.
+- **[Advanced](advanced.md)** — Custom modeling, multimodal, LoRA, multi-tenant, P/D inference.
+- **[Development](development.md)** — Test suite, pre-commit hooks, adding a new model.
diff --git a/docs/platform-monitoring.md b/docs/platform-monitoring.md
deleted file mode 100644
index 31bcfe312b..0000000000
--- a/docs/platform-monitoring.md
+++ /dev/null
@@ -1,48 +0,0 @@
-# Platform Monitoring
-
-Use `orchestrator.prime_monitor` to register a run on the Prime Intellect platform and stream training metrics, samples, and distributions.
-
-> **Internal-only for now:** external run registration is currently only enabled for internal / allowlisted teams.
-
-## Prerequisites
-
-You need a Prime API key with `rft:write` scope.
-
-Use the CLI:
-
-```bash
-prime login
-```
-
-Or set an environment variable directly:
-
-```bash
-export PRIME_API_KEY=pit_...
-```
-
-## Minimal config
-
-```toml
-[orchestrator.prime_monitor]
-run_name = "my-experiment"
-```
-
-You can also override from the CLI:
-
-```bash
-uv run rl @ config.toml --orchestrator.prime_monitor.run_name "my-experiment"
-```
-
-## Troubleshooting
-
-### `API key not found`
-
-Set the env var from `api_key_var` or run:
-
-```bash
-prime login
-```
-
-### `External training runs are not enabled for this team`
-
-Your team is not allowlisted yet. This feature is currently internal-only.
diff --git a/docs/scaling.md b/docs/scaling.md
new file mode 100644
index 0000000000..15b39f2bfc
--- /dev/null
+++ b/docs/scaling.md
@@ -0,0 +1,268 @@
+# Scaling
+
+This page covers how to scale `prime-rl` from a single GPU to a 1000-GPU cluster: single-node and multi-node deployments, FSDP / expert parallelism / context parallelism, and throughput benchmarking. For knobs that fit on one box, see [Training](training.md) first. For prefill/decode disaggregated inference, see [Advanced](advanced.md#disaggregated-prefilldecode-inference).
+
+## Table of Contents
+
+- [Single-Node vs. Multi-Node Deployment](#single-node-vs-multi-node-deployment)
+  - [Single-Node](#single-node)
+    - [RL Placement](#rl-placement)
+    - [SFT and Torchrun](#sft-and-torchrun)
+  - [Multi-Node](#multi-node)
+- [Parallelism Knobs](#parallelism-knobs)
+  - [FSDP](#fsdp)
+  - [Expert Parallelism](#expert-parallelism)
+  - [Context Parallelism](#context-parallelism)
+  - [Activation Checkpointing and Offloading](#activation-checkpointing-and-offloading)
+  - [Optimizer Offloading](#optimizer-offloading)
+  - [LM Head Chunking](#lm-head-chunking)
+- [Memory-Tight Recipe](#memory-tight-recipe)
+- [SLURM](#slurm)
+  - [Activation](#activation)
+  - [`[deployment]` Block](#deployment-block)
+  - [Examples](#examples)
+  - [Custom Templates](#custom-templates)
+- [Benchmarking](#benchmarking)
+
+## Single-Node vs. Multi-Node Deployment
+
+The `rl`, `sft`, and `inference` entrypoints all accept a `[deployment]` block (`type = "single_node"` or `"multi_node"`) that picks how the trainer / orchestrator / inference processes are placed across hardware. **Single-node** runs locally; **multi-node** currently goes through [SLURM](#slurm) — the launcher writes an sbatch script that places inference replicas, the orchestrator, and the trainer with the right rendezvous endpoints, IPs, ports, and shared-filesystem paths wired in.
+
+### Single-Node
+
+#### RL Placement
+
+`rl` defaults to 1 trainer GPU and 1 inference GPU. To give inference 6 GPUs with data parallelism and the trainer the remaining 2 on an 8-GPU node:
+
+```bash
+uv run rl @ rl.toml \
+  --deployment.num-infer-gpus 6 \
+  --deployment.num-train-gpus 2 \
+  --inference.parallel.dp 6
+```
+
+The launcher allocates GPUs in order from `CUDA_VISIBLE_DEVICES` (or all visible GPUs): inference first, trainer next, teacher last. To target a specific physical subset, pin `CUDA_VISIBLE_DEVICES` before launching.
+
+For quick A/B ablations on the same node, run two RL instances side-by-side in separate tmux sessions, each pinned to half the GPUs and a separate inference port:
+
+```bash
+# session 1, GPUs 0–1, default port 8000
+bash scripts/tmux.sh -s exp1 -o outputs/exp1
+CUDA_VISIBLE_DEVICES=0,1 uv run rl @ rl.toml --output-dir outputs/exp1
+
+# session 2, GPUs 2–3, port 8001
+bash scripts/tmux.sh -s exp2 -o outputs/exp2
+CUDA_VISIBLE_DEVICES=2,3 uv run rl @ rl.toml \
+  --inference.server.port 8001 \
+  --orchestrator.client.base-url http://localhost:8001/v1 \
+  --output-dir outputs/exp2
+```
+
+#### SFT and Torchrun
+
+`uv run sft` handles distributed launch internally. To scale from 1 to N GPUs, set the deployment GPU count (or just let it pick up `WORLD_SIZE`). For non-default layouts, the manual equivalent is:
+
+```bash
+uv run torchrun \
+  --nproc-per-node 8 \
+  --local-ranks-filter 0 \
+  src/prime_rl/trainer/sft/train.py @ sft.toml
+```
+
+`--local-ranks-filter 0` keeps console output to rank 0 only; per-rank stdout/stderr is still captured in `<output_dir>/logs/trainer/torchrun/`.
+
+### Multi-Node
+
+Multi-node deployments (RL or SFT) are launched via [SLURM](#slurm) — set `[deployment] type = "multi_node"` plus the matching `[slurm]` block, and the launcher writes the sbatch script that places inference, orchestrator, and trainer across the requested nodes with the inter-process wiring set up correctly. See [SLURM § Examples](#examples) for full configs.
+
+## Parallelism Knobs
+
+### FSDP
+
+FSDP2 is the default model sharding strategy. By default the trainer fully shards parameters, gradients, and optimizer state across the data-parallel mesh. Tweakable knobs:
+
+| Knob | Effect |
+|---|---|
+| `trainer.model.dp_replicate` | Number of dimensions to **replicate** instead of shard. Set to 2 to run 2-way DP replication × FSDP sharding within each replica — useful for very large clusters where pure FSDP communication dominates. |
+| `trainer.model.reshard_after_forward` | If `true` (default), parameters are resharded after the forward pass to free memory; the backward pass re-gathers. Set `false` to keep params resident — faster but more memory. |
+| `trainer.model.fsdp_cpu_offload` | Offload params + grads + optimizer state to CPU. Big memory win, large throughput hit. |
+| `trainer.model.optim_cpu_offload` | Offload only optimizer state. Mid-ground — small throughput cost, decent memory savings, especially at low GPU count. |
+
+### Expert Parallelism
+
+EP shards MoE expert weights across the EP mesh, dramatically reducing the FSDP communication volume per layer. EP is only available with the custom model implementation (`model.impl = "custom"` or `"auto"` for supported families).
+
+```toml
+[trainer.model]
+impl = "custom"
+ep = 8                     # EP degree; must divide num_experts
+ep_comm_backend = "torch"  # or "deepep"
+```
+
+`ep_comm_backend = "deepep"` uses DeepEP's custom dispatch/combine kernels for speed, with two extra knobs (`deepep_num_sms`, `deepep_token_chunk_size`) — tune on your hardware.
+
+### Context Parallelism
+
+CP shards a single sequence across multiple GPUs along the token dimension — for long-context sequences. Only available with the custom impl and flash-attention.
+
+```toml
+[trainer.model]
+impl = "custom"
+attn = "flash_attention_2"   # or fa3 / fa4
+cp = 2                       # CP degree
+cp_style = "ring"            # "ulysses" for non-FA kernels
+```
+
+### Activation Checkpointing and Offloading
+
+| Knob | Memory ↓ | Throughput ↓ |
+|---|---|---|
+| `trainer.model.ac` | large | ~25% |
+| `trainer.model.ac.mode = "selective"` | medium | small | 
+| `trainer.model.ac_offloading` | extra | a bit more |
+
+Enable selective AC (custom impl only) for the best memory/throughput tradeoff:
+
+```toml
+[trainer.model.ac]
+mode = "selective"
+targets = ["norm", "attn_proj"]  # see Reference for the full list per architecture
+```
+
+### Optimizer Offloading
+
+Offloading optimizer states to CPU is a near-free memory win at low GPU counts:
+
+```toml
+[trainer.optim]
+# any optimizer type
+type = "adamw"
+
+[trainer.model]
+optim_cpu_offload = true
+```
+
+Mutually exclusive with `fsdp_cpu_offload`. Also incompatible with `trainer.max_concurrent_runs > 1` (multi-tenant training). Muon doesn't support `fsdp_cpu_offload` but does support `optim_cpu_offload`.
+
+### LM Head Chunking
+
+The vanilla LM head materializes a `[batch * seq, vocab]` logits tensor on every step — a major memory tax when the vocabulary is large (often >100K). `fused_lm_head_token_chunk_size` swaps in a custom fused linear + logprob/entropy kernel that streams through `chunk_size` tokens at a time, avoiding the materialization:
+
+```toml
+[trainer.model]
+fused_lm_head_token_chunk_size = "auto"     # picks 8192 for RL
+# or explicit:
+# fused_lm_head_token_chunk_size = 1024     # smaller = lower memory, more launches
+# fused_lm_head_token_chunk_size = "disabled"  # default; vanilla LM head
+```
+
+`auto` is a safe starting point for RL. Drop the chunk size further when peak memory is still tight (e.g. with very long sequences); raise it to amortize kernel-launch overhead. Only available with `model.impl = "custom"`, and currently RL-only — the SFT trainer rejects integer values.
+
+## Memory-Tight Recipe
+
+The kitchen-sink config for fitting large MoE on limited GPUs at acceptable throughput:
+
+```toml
+[trainer.model]
+impl = "custom"
+fused_lm_head_token_chunk_size = 1024
+ep = 8
+cp = 2
+optim_cpu_offload = true
+
+[trainer.model.compile]
+
+[trainer.model.ac]
+freq = 1
+
+[trainer.model.ac_offloading]
+max_inflight_activations = 1
+```
+
+Walks through every memory lever in order: FSDP+EP shard the weights, CP shards the activations along the token dim, AC + AC offloading shrink the activation footprint, fused LM head chunks the loss, `torch.compile` reduces fragmentation, optim offload moves Adam state off GPU. Apply selectively — each knob has a throughput cost.
+
+## SLURM
+
+The `rl`, `sft`, and `inference` entrypoints all submit to SLURM when a `[slurm]` table is present — there's no separate entrypoint.
+
+### Activation
+
+A SLURM config is usually a thin overlay that adds `[slurm]` (and `[deployment]` for multi-node) on top of a base config. Configs are composed left-to-right via the `@` CLI syntax — see [Configuration § TOML Composition](configuration.md#toml-composition):
+
+```toml
+# my_slurm.toml
+output_dir = "/shared/outputs/my-rl"
+
+[slurm]
+job_name = "my-rl-run"
+```
+
+Launch:
+
+```bash
+uv run rl @ base_rl.toml @ my_slurm.toml             # submits via sbatch
+uv run rl @ base_rl.toml @ my_slurm.toml --dry-run   # writes the sbatch script + resolved config, exits
+```
+
+### `[deployment]` Block
+
+`[deployment]` is a discriminated union picked by `type` — `single_node` or `multi_node` for RL/SFT, with an extra disaggregated variant for inference. RL multi-node:
+
+```toml
+[deployment]
+type = "multi_node"
+num_train_nodes = 2
+num_infer_nodes = 1
+gpus_per_node = 8                # default
+nodes_per_fsdp_group = 1         # optional — controls FSDP island size
+```
+
+SFT multi-node:
+
+```toml
+[deployment]
+type = "multi_node"
+num_nodes = 2
+gpus_per_node = 8
+```
+
+### Examples
+
+Full multi-node configs ship in [`examples/multinode/`](https://github.com/PrimeIntellect-ai/prime-rl/tree/main/examples/multinode):
+
+- [`rl.toml`](https://github.com/PrimeIntellect-ai/prime-rl/blob/main/examples/multinode/rl.toml) — two-node RL run with NCCL weight broadcast on a 30B MoE student.
+- [`sft.toml`](https://github.com/PrimeIntellect-ai/prime-rl/blob/main/examples/multinode/sft.toml) — two-node SFT against the same model.
+
+For inference-only multi-node, set `[deployment] type = "multi_node"` on an inference TOML — each node runs an independent vLLM replica (TP and DP must fit within one node), and the launcher prints one URL per node. Front the URLs with a router or point clients at any of them.
+
+### Custom Templates
+
+For unusual partitions, module loads, or environment setup, supply your own Jinja2 template:
+
+```bash
+uv run rl @ my_config.toml --slurm.template-path path/to/my_template.sbatch.j2
+```
+
+The default templates live under [`src/prime_rl/templates/`](https://github.com/PrimeIntellect-ai/prime-rl/tree/main/src/prime_rl/templates) — copy one as a starting point.
+
+## Benchmarking
+
+Every entrypoint supports a `--bench` flag that runs a few warm-up + measurement steps with fake data and prints a rich-formatted throughput / MFU table:
+
+```bash
+# SFT trainer alone
+uv run sft @ sft.toml --bench
+uv run sft ... --data.type fake --data.length variable --bench   # variable-length fake data
+
+# RL trainer alone (no inference involved)
+uv run trainer @ train.toml --data.fake --bench
+
+# Inference alone — start the server normally, then bench the orchestrator
+uv run inference @ infer.toml
+uv run orchestrator @ orch.toml --bench
+
+# Full RL stack (trainer with fake data, inference with real data from orchestrator)
+uv run rl @ rl.toml --bench
+```
+
+Persist results with `--bench.output-json`. Use this to compare parallelism configs before committing a multi-day run.
diff --git a/docs/slurm.md b/docs/slurm.md
deleted file mode 100644
index dc60431153..0000000000
--- a/docs/slurm.md
+++ /dev/null
@@ -1,297 +0,0 @@
-# SLURM
-
-The `rl`, `sft`, and `inference` entrypoints all have built-in SLURM support. Adding a `[slurm]` section to your config switches from local execution to SLURM job submission — no separate entrypoint needed.
-
-## Quick Start
-
-```bash
-# Local run
-uv run rl @ examples/reverse_text/rl.toml
-
-# SLURM run — overlay slurm_rl.toml on top of the base config
-uv run rl @ examples/reverse_text/rl.toml @ examples/reverse_text/slurm_rl.toml
-```
-
-The SLURM config is a thin overlay that adds `[slurm]` (and optionally `[deployment]`) on top of a base config. Configs are composed left-to-right via the `@` CLI syntax — see [`configs.md`](configs.md) for details.
-
-```toml
-# examples/reverse_text/slurm_rl.toml
-output_dir = "outputs/reverse-text-rl"
-
-[slurm]
-job_name = "reverse-text-rl"
-```
-
-## How it works
-
-When `[slurm]` is present, the entrypoint:
-
-1. Resolves the full config
-2. Renders a SLURM batch script from a Jinja2 template
-3. Writes the script and resolved config to `{output_dir}/`
-4. Submits via `sbatch` (or prints the script with `--slurm.dry-run`)
-
-For **single-node** jobs, the entire config is dumped to a TOML file and the template simply runs `uv run rl @` or `uv run sft @` on the allocated node.
-
-For **multi-node** jobs, sub-configs are written separately and `srun` dispatches processes across nodes.
-
-## Configuration
-
-### `[slurm]` — Job submission (shared between RL and SFT)
-
-| Field | Description | Default |
-|---|---|---|
-| `job_name` | SLURM job name | `"prime-rl"` |
-| `project_dir` | Path to the project root on the cluster | `"."` |
-| `template_path` | Path to a custom Jinja2 template | auto-selected |
-| `partition` | SLURM partition | `"cluster"` |
-| `nodelist` | Comma-separated list of specific nodes to run on (`--nodelist`) | `None` |
-| `exclude` | Comma-separated list of nodes to exclude (`--exclude`) | `None` |
-| `account` | SLURM account to charge (`--account`) | `None` |
-| `time` | Maximum wall time, e.g. `"24:00:00"` (`--time`) | `None` |
-| `pre_run_command` | Shell command to run on head node after env setup, before starting the job (e.g. cleanup) | `None` |
-
-### `[deployment]` — Node and GPU allocation
-
-**RL** uses a discriminated union with `type = "single_node"` (default) or `type = "multi_node"`:
-
-| Field | single_node | multi_node |
-|---|---|---|
-| `gpus_per_node` | Number of GPUs per node (default: 8) | Same |
-| `num_train_gpus` | Training GPUs | — |
-| `num_infer_gpus` | Inference GPUs | — |
-| `num_train_nodes` | — | Training nodes |
-| `num_infer_nodes` | — | Inference nodes |
-| `nodes_per_fsdp_group` | — | Nodes per FSDP island (optional) |
-
-**SFT** follows the same pattern but only has training nodes:
-
-| Field | single_node | multi_node |
-|---|---|---|
-| `gpus_per_node` | Number of GPUs per node (default: 8) | Same |
-| `num_gpus` | Number of GPUs (default: 1) | — |
-| `num_nodes` | — | Training nodes (default: 2) |
-| `nodes_per_fsdp_group` | — | Nodes per FSDP island (optional) |
-
-**Inference** runs independent vLLM replicas per node:
-
-| Field | single_node | multi_node |
-|---|---|---|
-| `gpus_per_node` | Number of GPUs per node (default: 8) | Same |
-| `num_nodes` | — | Number of inference nodes (default: 1) |
-
-The SLURM template is auto-selected based on `deployment.type`. You can override it with `slurm.template_path`.
-
-### Constraints
-
-- `output_dir` should be explicitly set when using SLURM (defaults to `"outputs"`)
-- Multi-node deployment requires `[slurm]` to be set
-
----
-
-## RL Examples
-
-### Single-node SLURM
-
-The simplest case: run on a single allocated node. No `[deployment]` needed — defaults to `single_node`.
-
-```toml
-output_dir = "/shared/outputs/my-rl-run"
-
-[slurm]
-job_name = "my-rl-run"
-```
-
-### Multi-node SLURM (Hendrycks Math)
-
-```toml
-output_dir = "outputs/rl-math-moe"
-max_steps = 500
-seq_len = 2048
-
-[slurm]
-job_name = "hendrycks-math-rl-moe"
-
-[deployment]
-type = "multi_node"
-num_train_nodes = 1
-num_infer_nodes = 1
-
-[weight_broadcast]
-type = "nccl"
-
-[model]
-name = "Qwen/Qwen3-30B-A3B-Thinking-2507"
-
-[trainer.model]
-impl = "custom"
-attn = "flash_attention_3"
-optim_cpu_offload = true
-
-[trainer.model.ac_offloading]
-max_inflight_activations = 5
-
-[trainer.model.ac]
-freq = 1
-
-[orchestrator]
-batch_size = 512
-group_size = 16
-
-[orchestrator.train.sampling]
-max_completion_tokens = 2048
-
-[[orchestrator.train.env]]
-id = "math-env"
-name = "hendrycks-math"
-args = { dataset_name = "PrimeIntellect/Hendrycks-Math", dataset_subset = "default" }
-
-[inference.parallel]
-tp = 4
-dp = 2
-```
-
-See [`examples/multinode/rl.toml`](../examples/multinode/rl.toml) for the full example.
-
----
-
-## SFT Examples
-
-### Single-node SLURM
-
-```toml
-output_dir = "/shared/outputs/my-sft-run"
-
-[slurm]
-job_name = "my-sft-run"
-```
-
-### Multi-node SLURM (MoE SFT)
-
-```toml
-output_dir = "outputs/sft-moe-math"
-max_steps = 500
-
-[slurm]
-job_name = "sft-moe-math"
-
-[deployment]
-type = "multi_node"
-num_nodes = 2
-
-[model]
-name = "Qwen/Qwen3-30B-A3B-Thinking-2507"
-impl = "custom"
-attn = "flash_attention_3"
-optim_cpu_offload = true
-
-[model.ac_offloading]
-max_inflight_activations = 5
-
-[model.ac]
-freq = 1
-
-[data]
-type = "sft"
-name = "PrimeIntellect/INTELLECT-3-SFT-10K"
-subsets = ["default"]
-splits = ["math"]
-batch_size = 128
-seq_len = 8192
-```
-
-See [`examples/multinode/sft.toml`](../examples/multinode/sft.toml) for the full example.
-
----
-
-## Inference Examples
-
-### Single-node SLURM
-
-Run a vLLM server on a single allocated node:
-
-```toml
-# my_inference.toml
-output_dir = "/shared/outputs/my-inference"
-
-[model]
-name = "Qwen/Qwen3-8B"
-
-[parallel]
-tp = 8
-
-[slurm]
-job_name = "my-inference"
-```
-
-```bash
-uv run inference @ my_inference.toml
-```
-
-### Multi-node SLURM
-
-Each node runs an independent vLLM replica. TP and DP must fit within a single node — there is no cross-node parallelism.
-
-```toml
-output_dir = "/shared/outputs/my-inference"
-
-[model]
-name = "PrimeIntellect/INTELLECT-3-RL-600"
-
-[parallel]
-tp = 4
-dp = 2
-
-[deployment]
-type = "multi_node"
-num_nodes = 4
-
-[slurm]
-job_name = "my-inference"
-```
-
-After submission, the SLURM template prints the inference URLs for all nodes (one per node).
-
-### Dry run
-
-Use `--dry-run` (or `dry_run = true` in TOML) to generate the sbatch script without submitting:
-
-```bash
-uv run inference @ config.toml --dry-run
-```
-
----
-
-## Custom SLURM Templates
-
-The default templates handle standard setups with InfiniBand detection, environment setup, and `srun`-based process dispatch. For advanced use cases (custom partitions, account settings, module loads, etc.), provide your own Jinja2 template:
-
-```bash
-uv run rl @ my_config.toml --slurm.template-path path/to/my_template.sbatch.j2
-```
-
-See [`src/prime_rl/templates/`](../src/prime_rl/templates/) for the default templates as a starting point.
-
-## Monitoring
-
-After submission, logs are available at:
-
-```bash
-# All deployment types (trainer.log and inference.log are symlinks for multi-node)
-tail -F {output_dir}/logs/trainer.log
-tail -F {output_dir}/logs/orchestrator.log
-tail -F {output_dir}/logs/inference.log
-
-# Multi-node: per-node logs
-tail -F {output_dir}/logs/trainer/node_*.log
-tail -F {output_dir}/logs/inference/node_*.log
-
-# Multi-node inference: per-replica router logs
-tail -F {output_dir}/logs/inference/router_*.log
-```
-
-For convenience, a tmux launcher sets up a session with all log streams:
-
-```bash
-bash scripts/tmux.sh my-rl-job /shared/outputs/my-rl-job
-```
diff --git a/docs/testing-moe-at-small-scale.md b/docs/testing-moe-at-small-scale.md
deleted file mode 100644
index ba2ca048f9..0000000000
--- a/docs/testing-moe-at-small-scale.md
+++ /dev/null
@@ -1,113 +0,0 @@
-# Testing MoE at Small Scale
-
-When working on MoE architectures (GLM-4, Kimi, etc.), you can't iterate on a 100B+ parameter model locally. This guide shows how to create a small (~0.5B) MoE model with the same architecture, run SFT to warm it up, and run RL on it — all on 1-2 GPUs.
-
-The goal isn't performance. It's catching bugs in modeling code, state dict conversions, and training pipeline integration before running at scale.
-
-## Overview
-
-1. **Create + verify** a mini model with random weights and check HF <-> PrimeRL roundtrip
-2. **SFT** to give it a non-trivial distribution
-3. **RL** on reverse-text to validate the full pipeline
-
-## Prerequisites
-
-- At least 1 GPU for steps 1-2, 2 GPUs for step 3 (RL)
-- Architecture presets are defined in `scripts/mini_moe.py`
-
-## Step 1: Create and verify the mini model
-
-```bash
-uv run python scripts/mini_moe.py --arch glm4_moe --output-dir ./mini-glm-moe
-```
-
-This creates a ~543M parameter GLM-4 MoE (1024 hidden, 24 layers, 8 experts) with random weights, copies the tokenizer from the original GLM-4 model, then verifies that:
-- Logits match between HF and PrimeRL implementations (`convert_to_prime`)
-- The HF -> PrimeRL -> HF roundtrip is lossless (`convert_to_hf`)
-
-To re-run verification only (e.g. after a modeling code change):
-
-```bash
-uv run python scripts/mini_moe.py --arch glm4_moe --output-dir ./mini-glm-moe --verify-only
-```
-
-## Step 2: SFT warmup
-
-Using the existing debug MoE SFT config with overrides for real data:
-
-```bash
-uv run sft @ configs/debug/moe/sft/train.toml \
-    --model.name ./mini-glm-moe \
-    --data.name PrimeIntellect/Reverse-Text-SFT \
-    --data.type null \
-    --max_steps 200 \
-    --optim.lr 1e-4 \
-    --ckpt.weights
-```
-
-This fine-tunes on [PrimeIntellect/Reverse-Text-SFT](https://huggingface.co/datasets/PrimeIntellect/Reverse-Text-SFT) for 200 steps. Loss should drop from ~12 to ~2.5. The model won't be coherent, but it will have a non-trivial distribution so KL divergence is meaningful during RL.
-
-The latest weight checkpoint is saved under `outputs/weights/step_<N>`. You can verify the roundtrip on it:
-
-```bash
-uv run python scripts/mini_moe.py --arch glm4_moe --output-dir outputs/weights/step_200 --verify-only
-```
-
-A pre-built SFT'd model is available at [samsja/mini-glm-moe](https://huggingface.co/samsja/mini-glm-moe).
-
-## Step 3: RL (reverse-text)
-
-Requires 2 GPUs (one for inference, one for training).
-
-```bash
-uv run rl @ configs/ci/integration/rl/start.toml \
-    --model.name samsja/mini-glm-moe \
-    --trainer.model.impl custom \
-    --inference.gpu-memory-utilization 0.7 \
-    --inference.model.max-model-len 2048
-```
-
-Or to use the checkpoint from step 2:
-
-```bash
-uv run rl @ configs/ci/integration/rl/start.toml \
-    --model.name outputs/weights/step_200 \
-    --trainer.model.impl custom \
-    --inference.gpu-memory-utilization 0.7 \
-    --inference.model.max-model-len 2048
-```
-
-What to look for:
-- **Training runs without crashing** — validates the full pipeline (inference server, orchestrator, trainer)
-- **KL divergence is non-zero and finite** — confirms the reference model distribution is working
-- **Loss is reasonable** — not NaN, not stuck at a constant value
-
-Don't expect the reward to go up meaningfully in 20 steps on a random model.
-
-## Adding a new architecture
-
-To test a new MoE architecture (e.g., Kimi2.5):
-
-1. Add modeling code under `src/prime_rl/trainer/models/<arch>/`
-2. Add a preset to `scripts/mini_moe.py` with the config class, small dimensions, HF model class, PrimeRL model class, and tokenizer source
-3. Run steps 1-3 above with `--arch <your_arch>`
-
-The preset defines the small config:
-
-```python
-ARCH_PRESETS = {
-    "glm4_moe": {
-        "config_class": Glm4MoeConfig,
-        "config_kwargs": dict(
-            hidden_size=1024,
-            num_hidden_layers=24,
-            n_routed_experts=8,
-            # ...
-        ),
-        "hf_model_class": HFGlm4MoeForCausalLM,
-        "prime_model_class": PrimeRLGlm4MoeForCausalLM,
-        "tokenizer_source": "THUDM/GLM-4-9B-0414",
-    },
-    # Add your new arch here
-}
-```
diff --git a/docs/training.md b/docs/training.md
new file mode 100644
index 0000000000..5c0d42716a
--- /dev/null
+++ b/docs/training.md
@@ -0,0 +1,327 @@
+# Training
+
+This page covers everything you need to launch, observe, checkpoint, and recover a `prime-rl` training run — the RL trainer, the SFT trainer, and the related on-policy distillation mode. For multi-node and cluster layouts, see [Scaling](scaling.md). For the loss math and algorithm knobs, see [Algorithms](algorithms.md).
+
+> **AI agents working in this repo:** the equivalent runbooks are at [`skills/training/`](https://github.com/PrimeIntellect-ai/prime-rl/tree/main/skills/training) — top-level routing in [`skills/training/SKILL.md`](https://github.com/PrimeIntellect-ai/prime-rl/blob/main/skills/training/SKILL.md), launch details in [`skills/training/start-run/SKILL.md`](https://github.com/PrimeIntellect-ai/prime-rl/blob/main/skills/training/start-run/SKILL.md), and check-in / restart procedures in [`skills/training/monitor-run/SKILL.md`](https://github.com/PrimeIntellect-ai/prime-rl/blob/main/skills/training/monitor-run/SKILL.md).
+
+## Table of Contents
+
+- [Entrypoints](#entrypoints)
+- [RL Trainer](#rl-trainer)
+  - [Launch](#launch)
+  - [Useful Knobs](#useful-knobs)
+  - [Training Modes (RL / OPD / SFT)](#training-modes-rl--opd--sft)
+  - [Important Metrics](#important-metrics)
+- [SFT Trainer](#sft-trainer)
+  - [Dataset Format](#dataset-format)
+  - [Launch](#launch-1)
+  - [SFT-Specific Knobs](#sft-specific-knobs)
+  - [Important Metrics](#important-metrics-1)
+- [Checkpointing](#checkpointing)
+  - [Enabling Checkpoints](#enabling-checkpoints)
+  - [Resuming a Run](#resuming-a-run)
+  - [Serving Checkpoints](#serving-checkpoints)
+- [Observability](#observability)
+  - [Log Files](#log-files)
+  - [Console Output](#console-output)
+  - [Weights & Biases](#weights--biases)
+  - [Platform Monitoring](#platform-monitoring)
+- [Rules of Thumb](#rules-of-thumb)
+
+## Entrypoints
+
+| Command | Purpose | Notes |
+|---|---|---|
+| `uv run rl` | Wraps the trainer, orchestrator, and inference server in one launch from a merged TOML. | The default for any RL run. Runs locally for single-node experiments; submits to SLURM for single- or multi-node when `[slurm]` is set (see [Scaling § SLURM](scaling.md#slurm)). |
+| `uv run sft` | Supervised fine-tuning on a HF dataset. | Launches torchrun internally; never call torchrun directly. |
+| `uv run inference` | vLLM server. | Always use this entrypoint over `vllm serve` — it adds `/update_weights`, `/load_lora_adapter`, and `/init_broadcaster`. |
+| `uv run trainer` | Standalone trainer process group. | Use only when launching the trainer separately from the orchestrator (e.g. multi-node RL without the `rl` wrapper). |
+| `uv run orchestrator` | Standalone orchestrator process. | Pair with a separately-launched trainer + inference. |
+
+## RL Trainer
+
+### Launch
+
+The minimal RL run trains an SFT-warmed `Qwen3-0.6B` on the `reverse-text` task — the env is bundled with the [`verifiers`](https://github.com/PrimeIntellect-ai/verifiers) submodule, so nothing else needs to be installed:
+
+```bash
+uv run rl @ examples/reverse_text/rl.toml
+```
+
+### Useful Knobs
+
+A condensed view of the knobs you'll most often tune. For trainer-side parallelism, sampling, optimizer, and loss knobs see [Scaling](scaling.md) and [Algorithms](algorithms.md).
+
+**Data and algorithm:**
+
+| Knob | What it does |
+|---|---|
+| `orchestrator.batch_size` | Tasks per trainer step. |
+| `orchestrator.group_size` | Rollouts generated per task. |
+| `orchestrator.max_off_policy_steps` | How many distinct policies may have contributed to one rollout before it's discarded (default 8). The main off-policy dial on long agentic rollouts — bump for throughput, lower for tighter on-policyness. Watch `errored_rollouts` and `mismatch_kl/all/mean` when tuning. |
+| `orchestrator.training_mode` | `rl` (default), `opd`, or `sft`. See [Training modes](#training-modes-rl--opd--sft). |
+| `[[orchestrator.train.env]]` | Training environments. List multiple tables for multi-env training; weight them via `ratio`. See [Configuration § Environments](configuration.md#environments-orchestratortrainenv). |
+| `[[orchestrator.eval.env]]` + `orchestrator.eval.interval` | Eval environments and cadence (default every 100 steps). |
+
+**Monitoring:**
+
+| Knob | What it does |
+|---|---|
+| `log.level` | Process log level for trainer + orchestrator (`info` default; falls back to `$PRIME_LOG_LEVEL`). Set per-process via `trainer.log.level` / `orchestrator.log.level`, or globally on the `rl` entrypoint to propagate to both. |
+| `orchestrator.log.vf_level` | Env-worker / [`verifiers`](https://github.com/PrimeIntellect-ai/verifiers) log level (`info` default; `debug` is noisy but useful for env debugging). |
+| `--wandb` (+ `--wandb.project`, `--wandb.name`) | Enable Weights & Biases logging. See [Weights & Biases](#weights--biases). |
+| `--orchestrator.prime-monitor` | Stream metrics to the Prime Intellect platform (Prime Lab). See [Platform monitoring](#platform-monitoring). |
+
+**Run management:**
+
+| Knob | What it does |
+|---|---|
+| `--clean-output-dir` | Wipe `<output_dir>` before starting. Useful when re-running an experiment with the same name during iteration. |
+| `--output-dir outputs/<name>` | Per-run output directory. Always set this when running more than one experiment in parallel. |
+| `--max-steps N` | Stop after `N` trainer steps. Overrides the config value. |
+| `--dry-run` | Resolve + validate the full config, write per-process TOMLs to `<output_dir>/configs/`, and exit without launching. The fastest way to debug a misbehaving config. |
+
+### Training Modes (RL / OPD / SFT)
+
+The RL entrypoint supports three training modes, switched via `orchestrator.training_mode`:
+
+| Mode | Student | Teacher | Use case |
+|---|---|---|---|
+| `rl` | Required | Forbidden | Standard RL |
+| `opd` | Required | Required, must be vLLM (needs `prompt_logprobs`) | [On-policy distillation](https://thinkingmachines.ai/blog/on-policy-distillation/): student generates rollouts, trainer minimizes KL to teacher logprobs |
+| `sft` | Required | Required, any OpenAI-compatible endpoint | Hard-distill: teacher generates rollouts, student trains on them |
+
+The `rl` entrypoint only manages student-policy inference. For OPD and (local-vLLM) SFT, start the teacher inference server manually and point `[orchestrator.teacher.client]` at it:
+
+```bash
+CUDA_VISIBLE_DEVICES=1 uv run inference \
+  --model.name <teacher> --server.port 8001
+```
+
+The standalone `uv run sft` entrypoint is the more traditional SFT path — pure dataset-based, no teacher, no orchestrator. Use `orchestrator.training_mode = "sft"` only when you want a teacher to generate the supervision on the fly.
+
+### Important Metrics
+
+Pulled from the console logs and mirrored to W&B.
+
+**Progress** (orchestrator):
+
+- `reward/{all,env}/mean` — main signal. Should trend upward over hundreds of steps.
+- `seq_len/{all,env}/mean` and `is_truncated/{all,env}/mean` — rollout length and truncation rate.
+- `num_turns/{all,env}/mean` — for multi-turn envs.
+- `empty_rollouts/{all,env}`, `errored_rollouts/{all,env}` — non-zero is fine in small numbers; sustained > 5% is a smell.
+- `eval/{env}/{avg@k,pass@k}` — eval scores when `[orchestrator.eval]` is set.
+
+**Stability** (trainer):
+
+- `mismatch_kl/{all,env}/{mean,std,max}` — KL between trainer's current policy and the (older) inference policy that generated the rollouts. A sustained, growing mean is the early-warning sign for off-policy collapse.
+- `entropy/{all,env}/mean` — too low means mode-collapse; too high means the model isn't committing.
+- `masked_advantage_{positive,negative}/mean` — fraction of DPPO-masked tokens, split by sign.
+- `optim/grad_norm` — spikes precede divergence; check the loss config or lower the LR.
+
+**Performance** (trainer + orchestrator step independently):
+
+| Source | Metric | Reading |
+|---|---|---|
+| trainer | `time/wait_for_batch` | **high → orchestrator bottleneck** |
+| orchestrator | `time/wait_for_ckpt` | **high → trainer bottleneck** |
+
+## SFT Trainer
+
+`uv run sft` runs supervised fine-tuning from a HF dataset. It shares model loaders, FSDP setup, checkpointing, and the chat-template plumbing with the RL trainer, so a typical workflow is _SFT → RL → SFT → …_ without any reformatting.
+
+### Dataset Format
+
+Two accepted layouts:
+
+- **Prompt-completion**: a HF dataset with `prompt` and `completion` columns ([TRL format](https://huggingface.co/docs/trl/en/dataset_formats#prompt-completion)). The trainer masks out the prompt and computes loss only over the completion.
+- **Messages**: a HF dataset with a single `messages` column containing a list of chat turns. The trainer interprets the whole conversation as one sample, applies role-based loss masking, and trains over all assistant turns.
+
+If both columns are present, `messages` takes precedence.
+
+**Tool definitions.** For tool-use SFT, add a `tools` column (OpenAI function-calling format) or `tool_defs` ([`verifiers`](https://github.com/PrimeIntellect-ai/verifiers) rollout format). Each row's value can be either a list of dicts or a JSON-encoded string of a list — both are accepted, and `tool_defs` rows are auto-converted to OAI shape before being passed into the chat template's `tools=...` argument. The `chat_template_kwargs` column, if present, is forwarded verbatim into `apply_chat_template`.
+
+**Position-dependent chat templates.** Multi-turn SFT under the default tokenization path (`build_incremental_token_mask`) requires that tokenizing the first _k_ turns of a conversation be a strict prefix of tokenizing all _n ≥ k_ turns. Qwen3's upstream template _violates_ this — it strips past `<think>` blocks across user turns, silently corrupting the loss mask. Two fixes:
+
+- **Enable the renderer** (`use_renderer = true`, recommended). The [`renderers`](algorithms.md#renderers) package owns tokenization end-to-end and is robust to position-dependent templates. Hand-coded renderers ship for Qwen3, Qwen3.5, GLM-5, GLM-4.5, Kimi K2/K2.5, MiniMax M2, DeepSeek V3, Nemotron 3, GPT-OSS. Not supported for VLMs.
+- **Patched chat template** — the prime-rl–patched checkpoints (e.g. `PrimeIntellect/Qwen3-0.6B`, used in `examples/reverse_text/sft.toml`) ship a chat template that preserves thinking. Or supply your own.
+
+See [Algorithms § Multi-Turn Trajectories](algorithms.md#multi-turn-trajectories) for the full picture.
+
+### Launch
+
+The minimal SFT run trains `Qwen3-0.6B` on the `reverse-text` SFT dataset:
+
+```bash
+uv run sft @ examples/reverse_text/sft.toml --wandb
+```
+
+Multi-GPU and multi-node use torchrun under the hood (the `sft` entrypoint manages this for you — see [Scaling § SFT and Torchrun](scaling.md#sft-and-torchrun) for non-default layouts; multi-node SFT goes through [SLURM](scaling.md#slurm)).
+
+### SFT-Specific Knobs
+
+| Knob | What it controls |
+|---|---|
+| `data.name` | HF dataset name or local path |
+| `data.batch_size` | Tokens per trainer step (packed) |
+| `data.seq_len` | Per-sample sequence length |
+| `loss_mask.*` | Which roles contribute to loss (system / user / assistant / tool). |
+| `val.interval` | Run validation every N steps; `val.data` mirrors `data` |
+
+### Important Metrics
+
+Pulled from the console log and mirrored to W&B.
+
+**Progress and loss:**
+
+- `loss/mean` — main signal. Should decrease through the run.
+- `val/loss` — validation loss when `[val]` is set, logged every `val.interval` steps.
+- `progress/epoch`, `progress/num_samples`, `progress/num_tokens` — dataset progress.
+- `progress/<subset>/ratio_{samples,tokens}` — when training on multiple HF subsets/splits, the realized mixing ratio.
+
+**Stability and optimization:**
+
+- `optim/grad_norm` — spikes precede divergence.
+- `optim/lr`, `optim/zero_grad_ratio` — LR schedule and the fraction of params that received zero gradients (high → dead path or wrong loss masking).
+- For MoE: `max_vio/mean` (load-balancing violation), `routing_confidence/mean` — both are logged when non-zero.
+
+**Performance:**
+
+| Metric | Reading |
+|---|---|
+| `perf/throughput`, `perf/throughput_per_gpu` | tokens/s overall and per GPU |
+| `perf/mfu` | MFU |
+| `perf/peak_memory` | peak GPU memory (GiB) |
+| `time/step`, `time/forward_backward`, `time/save_ckpt` | step breakdown |
+
+## Checkpointing
+
+Checkpointing is split across processes because the orchestrator and trainer can be on different machines and on different steps at any given time. Inference is stateless.
+
+| Process | What's saved | Where |
+|---|---|---|
+| Trainer | FSDP-sharded model (DCP), optimizer, scheduler, progress | `<output_dir>/checkpoints/step_N/trainer/` |
+| Orchestrator | Step counter, total tokens / samples / problems | `<output_dir>/checkpoints/step_N/orchestrator/` |
+| Inference | _nothing_ — re-pushed from the latest checkpoint on restart | n/a |
+| Trainer (HF weights) | HF-compatible weight snapshot for serving | `<output_dir>/weights/step_N/` |
+
+### Enabling Checkpoints
+
+Checkpointing is **off by default** to save disk. Enable it with `--ckpt`:
+
+```bash
+uv run rl @ rl.toml --ckpt                              # default: end-of-training only
+uv run rl @ rl.toml --ckpt.interval 25                  # every 25 steps
+uv run rl @ rl.toml --ckpt.interval 25 --ckpt.keep-last 3  # rolling window of 3
+uv run rl @ rl.toml --ckpt.interval 25 --ckpt.keep-interval 100  # …plus permanent every 100
+```
+
+### Resuming a Run
+
+Re-run the same launch command and pass `--ckpt.resume-step <N>` (or `-1` for "latest"). Make sure `--max-steps` is at least the target final step, not the remaining delta:
+
+```bash
+# First run: steps 0–10
+uv run rl @ rl.toml --max-steps 10 --ckpt
+
+# Resume: continue to step 20
+uv run rl @ rl.toml --max-steps 20 --ckpt.resume-step 10
+```
+
+### Serving Checkpoints
+
+HF-compatible weight snapshots are written under `<output_dir>/weights/step_N/` whenever a full checkpoint runs (or you can write weights-only via `--ckpt.weights-only` for cheaper snapshots). Upload directly:
+
+```bash
+uv run hf upload <user>/<model>-RL outputs/weights/step_100
+```
+
+For LoRA runs, set `ckpt.weights.save_adapter_separately = true` to also write the raw adapter alongside the merged weights — useful when serving the adapter through a separate `/load_lora_adapter` call.
+
+## Observability
+
+### Log Files
+
+The launcher tees every process's stdout/stderr into `<output_dir>/logs/`. The full layout (single-node runs skip the `node_*.log` and `router_*.log` files):
+
+```
+<output_dir>/logs/
+├── trainer.log                  # rank 0 only; symlink → trainer/node_0.log on multi-node
+├── orchestrator.log             # single instance, single file
+├── inference.log                # symlink → inference/node_0.log on multi-node
+├── trainer/
+│   ├── node_*.log               # per-node trainer stdout (multi-node only)
+│   └── torchrun/<rdzv>/attempt_0/<rank>/{stdout,stderr}.log   # per-rank
+├── inference/
+│   ├── node_*.log               # per-node inference stdout (multi-node only)
+│   └── router_*.log             # vllm-router per replica (multi-node only)
+└── envs/{train,eval}/<env_name>/
+    ├── env_server.log
+    └── env_worker_<id>.log
+```
+
+Env worker logs are the first place to look for env-side errors (most user code lives there). Verbosity is controlled by `orchestrator.log.vf_level`. For multi-rank trainer debugging, drop into `logs/trainer/torchrun/<rdzv>/attempt_0/<rank>/{stdout,stderr}.log` — verbose and per-rank.
+
+Live tailing from a single point (works on the head node for multi-node runs over a shared filesystem):
+
+```bash
+tail -F <output_dir>/logs/{trainer,orchestrator,inference}.log
+tail -F <output_dir>/logs/trainer/node_*.log     # multi-node only
+tail -F <output_dir>/logs/inference/router_*.log # multi-node only
+```
+
+### Console Output
+
+`scripts/tmux.sh` opens a 4-pane tmux session that follows `trainer.log`, `orchestrator.log`, `inference.log`, and the union of env worker logs. Start it before launching:
+
+```bash
+bash scripts/tmux.sh
+# then in the Launcher window:
+uv run rl @ ... --output-dir outputs/my-run
+```
+
+Pass `-s <session>` and `-o <output_dir>` to run multiple parallel experiments side-by-side in different sessions. The helper also works on a SLURM head node — `bash scripts/tmux.sh my-rl-job /shared/outputs/my-rl-job`.
+
+### Weights & Biases
+
+W&B is off by default. Enable with `--wandb`:
+
+```bash
+uv run rl @ rl.toml --wandb                               # default project, random name
+uv run rl @ rl.toml --wandb.project my-proj --wandb.name run-42
+```
+
+By default (`wandb.shared = true`) the trainer and orchestrator log into a **single shared W&B run**, so all metrics from both processes land in one place. Set `wandb.shared = false` (or pass `--no-wandb.shared`) to fall back to the legacy split — two runs suffixed `-trainer` and `-orchestrator`. Shared mode requires the W&B SDK ≥ 0.19.9 and is incompatible with `wandb.offline = true`.
+
+By default, every 10 steps each process also logs a sample of prompts/completions (with rewards and advantages) and reward/advantage/entropy distributions as W&B tables. Tune via `--wandb.log-extras.interval` and `--wandb.log-extras.sample-ratio`, or disable subsets:
+
+```bash
+uv run rl @ rl.toml --wandb \
+  --orchestrator.wandb.log-extras.interval 50 \
+  --no-trainer.wandb.log-extras.distributions
+```
+
+### Platform Monitoring
+
+Register a run on the Prime Intellect platform (Prime Lab) and stream training metrics, samples, and distributions to the platform dashboard. Bare flag uses defaults:
+
+```bash
+uv run rl @ rl.toml --orchestrator.prime-monitor
+```
+
+Or set it in TOML:
+
+```toml
+[orchestrator.prime_monitor]
+run_name = "my-experiment"
+```
+
+Requires `PRIME_API_KEY` (set via `prime login` or env var) and an allowlisted team. Currently internal-only.
+
+## Rules of Thumb
+
+- **Start small.** Run `examples/reverse_text/rl.toml` end-to-end on 2 GPUs before scaling. If the smoke run finishes cleanly, your install is good.
+- **Batch size ≥ 64.** Smaller batches give noisy gradient estimates and the trainer's overhead-per-step dominates throughput. 64 is the practical floor; 128–512 is the range for quick ablations; production RL often runs at 1024+.
+- **Group size ≥ 8.** Bigger groups (`orchestrator.group_size`) make it more likely that a task produces a mix of high- and low-reward rollouts, which is what gives the trainer a usable signal — if all rollouts in a group succeed or all fail, the within-group advantage collapses to zero and the trainer learns nothing from that task. Bigger groups also tighten advantage normalization. 8 is the floor; 16–32 is common.
+- **Pin `output_dir` per run.** Sharing a directory across runs will mix rollouts and break resumes. `--output-dir outputs/<unique-name>` is the simplest discipline.
+- **Use `--dry-run` before SLURM.** Validators (e.g. CP needs flash-attention) fail fast in dry-run and slow in queue.
diff --git a/docs/training_modes.md b/docs/training_modes.md
deleted file mode 100644
index 76809e63df..0000000000
--- a/docs/training_modes.md
+++ /dev/null
@@ -1,40 +0,0 @@
-# Training Modes
-
-PRIME-RL supports three training modes through our RL trainer, selected via `training_mode`:
-
-- **`rl`** — reinforcement learning: student generates rollouts, no teacher
-- **`opd`** — [on-policy distillation](https://thinkingmachines.ai/blog/on-policy-distillation/): students generates rollouts, train to minimize the KL divergence between the student and teacher's logprobs for each token in the rollout
-- **`sft`** — supervised fine-tuning on teacher-generated rollouts
-
-> Note: PRIME-RL also has a dedicated `sft` entrypoint for more traditional supervised fine-tuning from a HF dataset. When using the `sft` training mode on the orchestrator, teacher rollouts are generated on-the-fly and used for training.
-
-The mode determines who generates rollouts, what role the teacher plays, and what must be configured.
-
-| Mode | Student | Teacher |
-|---|---|---|
-| `rl` | required | forbidden |
-| `opd` | required | required (local vLLM) |
-| `sft` | required | required (any OAI-compatible endpoint) |
-
-**SFT vs OPD teachers** differ in what the orchestrator asks of them. SFT only calls `/v1/chat/completions` to generate rollouts — any OpenAI-compatible endpoint works (PI inference, OpenAI, Anthropic, a local vLLM). OPD additionally needs token-level logprobs scored over the student's tokens, which today only vLLM's `/inference/v1/generate` with `prompt_logprobs` exposes — so the OPD teacher must be a vLLM server.
-
-### Reference configs
-
-Debug-scale configs for all three modes (and LoRA variants) live in [`configs/debug/training_modes/`](../configs/debug/training_modes/):
-
-- `rl.toml` / `opd.toml` / `opd_lora.toml`
-- `sft.toml` / `sft_lora.toml` (local vLLM teacher)
-- `sft_external.toml` (PI inference teacher)
-
-See [`configs/debug/training_modes/README.md`](../configs/debug/training_modes/README.md) for run commands.
-
-## Parameter reference
-
-| Parameter | Default | Description |
-|-----------|---------|-------------|
-| `training_mode` | `"rl"` | One of `rl`, `opd`, `sft`. Propagates to `orchestrator.training_mode` and (for sft) `trainer.loss.type`. |
-| `trainer.loss.teacher_tau` | `0.0` | Distillation strength. Must be `> 0` in OPD. |
-| `trainer.loss.adv_tau` | `1.0` | Weight for the RL advantage signal. Set `0` for pure distillation. |
-| `orchestrator.verification.enabled` | `true` | Enable/disable verification. Set to `false` for pure distillation with `adv_tau = 0`. |
-
-> Note: the `rl` entrypoint only manages student-policy inference. For OPD and (local) SFT, start the teacher inference server manually (e.g. `CUDA_VISIBLE_DEVICES=1 uv run inference --model.name <teacher> --server.port 8001`) and point `[orchestrator.teacher.client]` at it. See [`configs/debug/training_modes/README.md`](../configs/debug/training_modes/README.md) for a full example.
diff --git a/docs/trajectories.md b/docs/trajectories.md
deleted file mode 100644
index d5d7eee5e4..0000000000
--- a/docs/trajectories.md
+++ /dev/null
@@ -1,96 +0,0 @@
-# Trajectories
-
-Verifiers [v0.1.8](https://github.com/PrimeIntellect-ai/verifiers/releases/tag/v0.1.8) introduced trajectory-based rollouts, where each LLM request/response pair in a multi-turn interaction is recorded as an independent step. For details on the design decision, check the detailed [design document](https://github.com/PrimeIntellect-ai/verifiers/blob/main/notes/TRAJECTORIES.md) in the verifiers repository.
-
-## Best-Effort Interleaved Rollouts
-
-PRIME-RL uses a best-effort interleaving strategy that automatically merges consecutive trajectory steps when possible, and starts a new training sample when the extension property breaks.
-
-### The Extension Property
-
-A sequence of trajectory steps has the **extension property** when each successive step's prompt contains all previous prompts and completions as a prefix. When this holds:
-- Multiple steps can be merged into a single training sample
-- Compute scales as O(T) for a trajectory of length T
-
-When extension breaks (e.g., due to context compaction or thinking being stripped):
-- A new training sample is started from that step
-- Compute scales as O(T²) in the worst case (every step breaks extension)
-
-### How It Works
-
-```
-5-step trajectory where extension breaks at step 4:
-
-Steps 1-3: extension holds → merged into Sample 1
-Step 4: extension breaks (e.g., thinking stripped from history)
-Steps 4-5: extension holds → merged into Sample 2
-
-Result: 2 training samples instead of 5
-```
-
-This approach gives you the best of both worlds:
-- When extension holds: O(T) compute, single merged sample
-- When extension breaks: graceful fallback, no corrupted data
-- Mixed scenarios: optimal merging where possible
-
-### The Exact Prefix Invariant
-
-Interleaving enforces a strict invariant:
-
-> The prompt at turn $t$ must be the exact concatenation of prior messages exactly as the LLM originally generated them
-
-We call this the "exact prefix" invariant. For example, at turn 2, the LLM should see U1,A1,U2 as the prompt, where U1 exactly matches the user message in turn 1 and A1 exactly matches the produced assistant message in turn 1. Any violation of this invariant will result in downstream problems when computing the importance sampling ratio during training.
-
-For example, assume that at turn 2 the prompt is U1,A1',U2 where A1' varies from A1. In this scenario it is not clear whether to add A1 or A1' to the interleaved rollout:
-- If we add A1', the logprobs from turn 1 might be off because the inference LLM produced A1 but the trainer LLM is computing logprobs for A1'
-- If we add A1, the logprobs from turn 2 might be off because the inference LLM is attending to A1' but the trainer LLM is attending to A1
-
-When the invariant is violated (extension breaks), PRIME-RL automatically starts a new training sample rather than producing corrupted data.
-
-### Arbitrary Chat Templates
-
-There exist chat templates which add, modify, or remove tokens across turns. One good example is the chat template of the Qwen3-series of models, which strips thinking across user turns.
-
-```python
-from transformers import AutoTokenizer
-
-tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-0.6B")
-
-messages = [
-    {"role": "user", "content": "U1"},
-    {"role": "assistant", "content": "<think>R1</think>A1"},
-    {"role": "user", "content": "U2"},
-]
-
-print(tokenizer.apply_chat_template(messages[:1], tokenize=False))
-# <|im_start|>user
-# U1<|im_end|>
-
-print(tokenizer.apply_chat_template(messages, tokenize=False))
-# <|im_start|>user
-# U1<|im_end|>
-# <|im_start|>assistant
-# A1<|im_end|>
-# <|im_start|>user
-# U2<|im_end|>
-```
-
-The chat template automatically strips away past thinking sections across user turns, which is often referred to as "interleaved thinking". Many chat templates, such as GLM or MiniMax, implement this approach.
-
-With best-effort interleaving, PRIME-RL handles this gracefully: when the thinking is stripped and the prefix no longer matches, a new training sample is started automatically.
-
-### Discontinuous Trajectories by Design
-
-Some multi-turn environments are intentionally discontinuous. For example, in a sub-agent calling scenario:
-
-1. Main agent receives a task and decides to delegate to a sub-agent
-2. Sub-agent runs independently (possibly multiple turns with its own context)
-3. Control returns to main agent with only the sub-agent's final result
-
-The main agent's trajectory is discontinuous because the sub-agent's internal conversation isn't part of its context. When the main agent resumes, its prompt doesn't extend the previous turn - it contains a summarized result instead.
-
-Best-effort interleaving handles this naturally: each agent's contiguous turns get merged, but the handoff between agents starts a new sample.
-
-## Deprecated: Branching Mode
-
-The `--trajectory-strategy branching` option is deprecated. The best-effort interleaving strategy now handles all cases automatically, falling back to separate samples (equivalent to branching) when the extension property breaks.
diff --git a/docs/troubleshooting.md b/docs/troubleshooting.md
deleted file mode 100644
index e2c1d68d2f..0000000000
--- a/docs/troubleshooting.md
+++ /dev/null
@@ -1,21 +0,0 @@
-# Troubleshooting
-
-> My API keeps timing out.
-
-We already set much larger timeout limits for the API clients that we use for training and evals. If you still encounter API timeout or connection errors, then this may be caused by your OS limiting the number of open file descriptors. Try increasing the maximum number of open files with
-
-```bash
-ulimit -n 32000
-```
-
-> I'm getting CUDA out of memory errors.
-
-Assuming this is happening on the RL or SFT trainer, you can try the following:
-- Use full activation checkpointing (`--model.ac`)
-- Reduce the micro batch size (`--data.micro-batch-size`) and sequence length (`--data.seq-len`)
-- (*Experimental*) Use context parallelism with `--model.cp`
-
-> I cannot pass my TOML config file
-
-Check that you *did* leave a whitespace between the `@` and the config file (e.g. `uv run ... @ path/to/config.toml` instead of `uv run ... @path/to/config.toml`). Also, make sure that your TOML config matches the configuration schema. If not, the Pydantic error message (which arguably is quite ugly) will hopefully point you in the right direction.
-