Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
67 commits
Select commit Hold shift + click to select a range
127dc89
docs: rewrite into 8 task-oriented pages with auto-generated reference
mikasenghaas May 22, 2026
1220bc5
docs: fix stale claims found in source verification pass
mikasenghaas May 22, 2026
154de2e
ci: enforce docs/reference.md stays in sync with configs
mikasenghaas May 22, 2026
b234055
chore: rename pre-commit hook id to docs-reference
mikasenghaas May 22, 2026
5aa8765
docs(overview): tighten landing page
mikasenghaas May 23, 2026
6c6c70e
docs(configuration): trim and reorganize
mikasenghaas May 23, 2026
e031e19
docs(training): drop redundant OpenAI-compatible prefix on vLLM server
mikasenghaas May 23, 2026
1d50afe
docs(training): trim and restructure per review
mikasenghaas May 23, 2026
59a0d69
docs: trim scaling layout table and move multi-node logs to training
mikasenghaas May 23, 2026
0afd6b8
docs(algorithms): add Renderers section, trim renderer prose
mikasenghaas May 23, 2026
9cff8f4
docs: refresh FAQ + training W&B claim + recommend prime eval
mikasenghaas May 23, 2026
b305b86
docs(faqs): drop the CP <= 8 recommendation
mikasenghaas May 23, 2026
5b9898d
docs(faqs): drop the vLLM log-quieting and KV-cache pressure Q&As
mikasenghaas May 23, 2026
dc40f71
docs(faqs): drop the Environments Hub install Q&A
mikasenghaas May 23, 2026
47671be
docs(faqs): drop the Models and environments section
mikasenghaas May 23, 2026
3b77e8c
Merge remote-tracking branch 'origin/main' into chore/docs-revamp
mikasenghaas May 25, 2026
1bb3350
docs(overview): second tightening pass
mikasenghaas May 25, 2026
9ec03bb
docs(configuration): second tightening pass
mikasenghaas May 25, 2026
6f73062
docs(training): second tightening pass
mikasenghaas May 25, 2026
aa912b5
docs(training): SFT trainer metrics + tools column + minor renames
mikasenghaas May 25, 2026
232791e
docs(training): rename 'Saving HF weights for serving' -> 'Serving ch…
mikasenghaas May 25, 2026
059a2c9
docs: cross-link the agent skills from training + configuration
mikasenghaas May 25, 2026
86bd5cd
docs(training): fold Important metrics into RL + SFT trainer sections
mikasenghaas May 25, 2026
9d5844a
docs(training): drop Prometheus and BetterStack subsection
mikasenghaas May 25, 2026
ebcc9b8
docs(training): tighten Useful knobs language
mikasenghaas May 25, 2026
d5b7ac8
docs(training): trim RL Performance table to the bottleneck signals
mikasenghaas May 25, 2026
b7b5ab0
docs(training): refresh chat-template prefix-property paragraph
mikasenghaas May 25, 2026
b40e36f
docs(training): fix SFT data row (data.name, drop default data.type)
mikasenghaas May 25, 2026
27ce099
docs(training): drop loss/nan_count from SFT trainer metrics
mikasenghaas May 25, 2026
22c17a8
docs(advanced): drop the Environments section
mikasenghaas May 25, 2026
bd3b345
docs(advanced): rename "MoE models" -> "Custom modeling" + "LoRA" -> …
mikasenghaas May 25, 2026
714af81
docs(advanced): drop 'Poolside' prefix from Laguna family row
mikasenghaas May 25, 2026
ae7d036
docs: rename Vision-language models -> Multimodal training
mikasenghaas May 25, 2026
3eca088
docs(advanced): drop the DeepEP grad-clip-disable tradeoff aside
mikasenghaas May 25, 2026
ae12a2e
docs(advanced): fix the freeze_vision_encoder + LoRA claim
mikasenghaas May 25, 2026
d9866b7
docs(advanced): drop the Multi-turn training subsection
mikasenghaas May 25, 2026
b72f12a
docs(advanced): rename Multi-run manager -> Multi-tenant training
mikasenghaas May 25, 2026
6ee5c5c
docs(advanced): trim multi-tenant training to user-facing surface only
mikasenghaas May 25, 2026
fa06df7
docs: add a Development page; move "Testing MoE at small scale" off A…
mikasenghaas May 25, 2026
6a98d2e
docs(algorithms): fold Renderers into Multi-turn trajectories
mikasenghaas May 25, 2026
858d428
docs(algorithms): drop max_async_level tuning + reframe as fixed one-…
mikasenghaas May 25, 2026
0b7daac
docs(development): add Test suite + Pre-commit hooks sections
mikasenghaas May 25, 2026
414e3e9
docs(algorithms): fold Length penalties into Default advantage
mikasenghaas May 25, 2026
00f2c8f
docs(advanced): add Difficulty pools + Online difficulty filtering
mikasenghaas May 25, 2026
8d61759
docs: relocate Difficulty pools + ODF from Advanced to Algorithms
mikasenghaas May 25, 2026
c8a64b3
docs(development): split MoE-debug recipe; promote "Adding a new arch…
mikasenghaas May 25, 2026
b39704b
docs(development): tighten Debugging MoE subsections
mikasenghaas May 25, 2026
9390893
docs: rename Worked example -> Examples; update reference generator
mikasenghaas May 25, 2026
21d4296
docs(training): correct batch-size rule of thumb
mikasenghaas May 25, 2026
7e7ad00
docs: relocate "Disaggregated prefill/decode inference" to Advanced
mikasenghaas May 25, 2026
ce4637e
docs(scaling): correct the Single GPU section
mikasenghaas May 25, 2026
b43311b
docs(scaling): collapse Single GPU / multi-GPU / Multi-node manual in…
mikasenghaas May 25, 2026
0cbaff9
docs: drop Kubernetes coverage for now
mikasenghaas May 25, 2026
3c9509b
docs(development): tighten test-suite layout bullets
mikasenghaas May 25, 2026
094ad23
docs(development): drop the nightly 24h-timeout / research-cluster aside
mikasenghaas May 25, 2026
09bf4df
docs(development): polish 'Adding a new architecture' prose
mikasenghaas May 25, 2026
7e62dc3
ci(docs-reference): add minimal GITHUB_TOKEN permissions
mikasenghaas May 25, 2026
3e2216e
docs: refresh architecture + async diagrams
mikasenghaas May 25, 2026
a9c3efa
docs: normalize package mentions to [\`pkg\`](github-url)
mikasenghaas May 26, 2026
2d0ce61
docs(overview): drop the async-by-default + loss summary paragraph
mikasenghaas May 26, 2026
9fc78cf
docs(overview): tighten 'Where to go next' to one-line pitches
mikasenghaas May 26, 2026
17d24ee
docs(readme): collapse test-suite commands into one code block
mikasenghaas May 26, 2026
ac2b0e5
docs: review-pass cleanups (max_async_level fallout, anchor fix, smok…
mikasenghaas May 26, 2026
f6bff7d
docs: review-pass cleanup (drop FAQs, Title Case, smell fixes)
mikasenghaas May 26, 2026
6ebf87d
Merge remote-tracking branch 'origin/main' into chore/docs-revamp
mikasenghaas May 26, 2026
1b5fcab
docs: tightening pass after main merge
mikasenghaas May 26, 2026
e10a1df
docs: drop the auto-generated reference page and its tooling
mikasenghaas May 26, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
45 changes: 12 additions & 33 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,7 @@ With `[model] impl = "auto"` (the default), the trainer selects that custom stac
| GLM-5 (`glm_moe_dsa`) | `zai-org/GLM-5`, `zai-org/GLM-5-FP8` | yes | ✅ | ✅ |
| Qwen3 MoE (`qwen3_moe`) | `Qwen/Qwen3-30B-A3B`, … | yes | ✅ | ✅ |
| Qwen3.5 MoE (`qwen3_5_moe`) | `Qwen/Qwen3.5-35B-A3B`, … | yes | ✅ | ✅ |
| Qwen3 / Qwen3.5 VLMs | [multimodal.md](docs/multimodal.md) (`qwen3_vl`, `qwen3_5`, `qwen3_5_moe`) | MoE only on MoE VLMs | MoE only | ✅ |
| Qwen3 / Qwen3.5 VLMs | see [advanced.md](docs/advanced.md#vision-language-models) (`qwen3_vl`, `qwen3_5`, `qwen3_5_moe`) | MoE only on MoE VLMs | MoE only | ✅ |
| Poolside Laguna (`laguna`) | `poolside/Laguna-XS.2` | yes | ✅ | ✅ |
| MiniMax M2 (`minimax_m2`) | `MiniMax/MiniMax-M2` | yes | ✅ | ✅ |
| Nemotron H (`nemotron_h`) | `nvidia/Nemotron-3-Nano-30B-A3B`, `nvidia/Nemotron-3-Super-120B-A12B`, … | yes | ✅ | ❌ |
Expand Down Expand Up @@ -217,17 +217,13 @@ These guides are designed to be run from a Slurm cluster but can also be adapted

Check out the [docs](docs) directory for in-depth guides on how to use PRIME-RL.

- [**Entrypoints**](docs/entrypoints.md) - Overview of the main components (orchestrator, trainer, inference) and how to run SFT, RL, and evals
- [**Configs**](docs/configs.md) - Configuration system using TOML files, CLI arguments, and environment variables
- [**Environments**](docs/environments.md) - Installing and using verifiers environments from the Environments Hub
- [**Async Training**](docs/async.md) - Understanding asynchronous off-policy training and step semantics
- [**Logging**](docs/logging.md) - Logging with loguru, torchrun, and Weights & Biases
- [**Checkpointing**](docs/checkpointing.md) - Saving and resuming training from checkpoints
- [**Benchmarking**](docs/benchmarking.md) - Performance benchmarking and throughput measurement
- [**Deployment**](docs/deployment.md) - Training deployment on single-GPU, multi-GPU, and multi-node clusters
- [**Memory Usage**](docs/memory_usage.md) - Techniques for reducing memory usage (activation checkpointing, offloading, EP, CP, LoRA, etc.)
- [**Troubleshooting**](docs/troubleshooting.md) - Common issues and their solutions
- [**Multimodal**](docs/multimodal.md) - Training VLMs like Qwen3-VL
- [**Overview**](docs/overview.md) - Architecture, install, and a copy-pasteable end-to-end RL run
- [**Configuration**](docs/configuration.md) - TOML composition, CLI overrides, env vars, validation
- [**Training**](docs/training.md) - RL, SFT, evals, checkpointing, observability, rules of thumb
- [**Scaling**](docs/scaling.md) - Single-GPU through multi-node, FSDP/EP/CP, SLURM, benchmarking
- [**Algorithms**](docs/algorithms.md) - Async/off-policy training, the AIPO loss, advantage and filter plugins, trajectory merging
- [**Advanced**](docs/advanced.md) - Custom modeling, multimodal training, LoRA, multi-tenant training
- [**Development**](docs/development.md) - Test suite, pre-commit hooks, adding a new model

## Contributing

Expand All @@ -249,28 +245,11 @@ uv run pre-commit install

### Tests

Run the full test suite

```bash
uv run pytest -v
```

To run unit tests, run

```bash
uv run pytest tests/unit -v
```

To run integration tests, run

```bash
uv run pytest tests/integration -v
```

To run CPU-only tests, use the inverse of the `gpu` marker:

```bash
uv run pytest -v -m "not gpu"
uv run pytest -v # everything
uv run pytest tests/unit -v # unit only
uv run pytest tests/integration -v # integration only
uv run pytest -v -m "not gpu" # CPU-only (inverse of the gpu marker)
```

## License
Expand Down
2 changes: 1 addition & 1 deletion configs/debug/training_modes/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,4 +44,4 @@ uv run rl @ configs/debug/training_modes/sft_lora.toml
uv run rl @ configs/debug/training_modes/sft_external.toml
```

See [docs/training_modes.md](../../docs/training_modes.md) for what each mode does.
See [docs/training.md](../../docs/training.md#training-modes-rl--opd--sft-via-orchestrator) for what each mode does.
147 changes: 147 additions & 0 deletions docs/advanced.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,147 @@
# Advanced

This page covers the specialized features layered on top of the core training stack: our custom model implementations (with EP for MoE families and CP for long-context training), multimodal training, LoRA training, multi-tenant training, and disaggregated prefill/decode inference. For developer-side workflows (adding new model architectures, debugging modeling code at small scale), see [Development](development.md).

## Table of Contents

- [Custom Modeling](#custom-modeling)
- [Expert Parallelism Backends](#expert-parallelism-backends)
- [Multimodal Training](#multimodal-training)
- [Supported Families](#supported-families)
- [Enabling VLM Mode](#enabling-vlm-mode)
- [Limitations](#limitations)
- [LoRA Training](#lora-training)
- [Multi-Tenant Training](#multi-tenant-training)
- [Disaggregated Prefill/Decode Inference](#disaggregated-prefilldecode-inference)

## Custom Modeling

`prime-rl` ships custom optimized model implementations for several MoE families. With `model.impl = "auto"` (default) the trainer picks the custom path when the HF config type is registered, falling back to plain HF otherwise. To force one:

```toml
[trainer.model]
impl = "custom" # or "hf" to force the HF path
```

| Family | HF config types | EP | CP |
|---|---|---|---|
| GLM-5 (`glm_moe_dsa`) | `zai-org/GLM-5`, `zai-org/GLM-5-FP8` | ✅ | ✅ |
| Qwen3 MoE | `Qwen/Qwen3-30B-A3B`, … | ✅ | ✅ |
| Qwen3.5 MoE | `Qwen/Qwen3.5-35B-A3B`, … | ✅ | ✅ |
| Qwen3 / Qwen3.5 VLMs | see [Multimodal training](#multimodal-training) | MoE only | ✅ |
| Laguna | `poolside/Laguna-XS.2` | ✅ | ✅ |
| MiniMax M2 | `MiniMax/MiniMax-M2` | ✅ | ✅ |
| Nemotron H | `nvidia/Nemotron-3-Nano-30B-A3B`, … | ✅ | ❌ |
| Trinity (AFMoE) | `arcee-ai/Trinity-Mini`, … | ✅ | ✅ |
| GLM-4 / GLM-4.5 / INTELLECT-3 | `THUDM/GLM-4-9B-0414`, `zai-org/GLM-4.5`, `PrimeIntellect/INTELLECT-3`, … | ✅ | ✅ |
| GPT-OSS (HF MoE) | `openai/gpt-oss-20b`, `openai/gpt-oss-120b` | ❌ | ✅ |

The custom path enables EP, selective activation checkpointing, FP8 training (`model.fp8 = true`, requires SM90+), and faster MoE kernels (`moe_use_grouped_mm = true`, default). Forcing `impl = "hf"` is mostly useful when debugging — it's slower and disables most MoE-specific knobs.

### Expert Parallelism Backends

`model.ep_comm_backend` picks the all-to-all kernel used for EP dispatch/combine:

- **`torch`** (default): TorchTitan's all-to-all collective. Works everywhere, no extra install.
- **`deepep`**: Custom kernels from DeepEP. Faster but requires DeepEP build (`bash scripts/install_deep_gemm.sh`, `bash scripts/install_ep_kernels.sh`) and tuning of `deepep_num_sms` (default 20) and `deepep_token_chunk_size` for your hardware.

DeepEP intranode dispatch derives the RDMA channel count as `deepep_num_sms / 2`. Lower SM count leaves more for compute; higher speeds up dispatch. Useful starting points: 16–24 SMs on H100, 20–40 on B200.

When you enable DeepEP, gradient clipping is auto-disabled (`optim.max_norm` set to `None`) because the kernels don't currently support it.

## Multimodal Training

### Supported Families

The built-in VLM registry covers:

| Family | `model_type` | Vision attr | LM attr |
|---|---|---|---|
| Qwen3-VL | `qwen3_vl` | `model.visual` | `model.language_model` |
| Qwen3-VL MoE | `qwen3_vl_moe` | `model.visual` | `model.language_model` |
| Qwen3.5 | `qwen3_5` | `model.visual` | `model.language_model` |
| Qwen3.5-MoE | `qwen3_5_moe` | `model.visual` | `model.language_model` |

For a model not in the table, look up the attribute paths on the loaded HF model with `model.named_children()` and set them under `[model.vlm]` directly.

### Enabling VLM Mode

Add `[model.vlm]` and bfloat16 dtypes:

```toml
[model]
name = "Qwen/Qwen3-VL-4B-Instruct"
optimization_dtype = "bfloat16"
reduce_dtype = "bfloat16"

[model.vlm]
vision_encoder_attr = "model.visual"
language_model_attr = "model.language_model"
# freeze_vision_encoder = true # default; set false to fine-tune the encoder
```

A bad attribute path errors immediately — no silent fallbacks. The weight-broadcast key prefix is derived as `{language_model_attr}.layers.` automatically.

To add a new model family permanently, append an entry to `VLM_REGISTRY` in `src/prime_rl/utils/vlm.py`.

### Limitations

- **Vision encoder frozen by default.** Set `freeze_vision_encoder = false` to fine-tune it; in that case it's FSDP-sharded per block. The combination `freeze_vision_encoder = false` + LoRA is rejected by a config validator — LoRA freezes everything non-adapter, so unfreezing the encoder under LoRA would be a silent no-op.
- **No multimodal-safe truncation.** Token sequences are truncated to `seq_len`, but `pixel_values` and `image_grid_thw` pass through unchanged. If a sample's tokens overflow, image tokens may get dropped while image tensors still describe the full image set. Set `seq_len` to cover your longest sample.
- **bfloat16 mandatory.** The trainer config validator refuses any other `optimization_dtype` / `reduce_dtype` for VLMs — vLLM serves VLMs in bfloat16 and a mismatch breaks the importance ratio.
- **Higher KL mismatch with multi-image inputs.** Expect noisier `mismatch_kl` than text-only; this is from minor numerical differences between the trainer's and vLLM's image processing.
- **Images aren't logged to monitors.** Sample logging captures the prompt text but not the actual images.

## LoRA Training

LoRA is enabled by adding `[model.lora]`:

```toml
[model.lora]
rank = 16
alpha = 32
dropout = 0.0
```

`target_modules` defaults to a reasonable cross-family set (`q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj`, `experts`, plus a few latent-projection names for Nemotron). Unknown names are silently ignored, so the defaults work across architectures. Add architecture-specific names to extend coverage (e.g. `in_proj` / `out_proj` for Mamba).

LoRA is supported across SFT and RL. For RL, `weight_broadcast.type = "nccl"` is **not** supported with LoRA — use the default filesystem transport. To save the raw adapter alongside the merged HF weights:

```toml
[ckpt.weights]
save_adapter_separately = true
```

LoRA pairs naturally with [multi-tenant training](#multi-tenant-training) — each tenant gets its own adapter and the backbone is shared across all of them in trainer memory.

## Multi-Tenant Training

Multi-tenant training lets a single trainer + inference deployment serve many concurrent LoRA "tenants" — each a fully isolated run with its own orchestrator, LoRA adapter, optimizer, scheduler, checkpoints, and progress tracking — sharing the same backbone weights and the same vLLM server. This is the topology behind hosted training on the [Prime Intellect platform (Lab)](https://app.primeintellect.ai). The trainer-side implementation is the `MultiRunManager` singleton, enabled by setting `trainer.max_concurrent_runs > 1`. For the full API surface, see [`src/prime_rl/trainer/runs/`](https://github.com/PrimeIntellect-ai/prime-rl/tree/main/src/prime_rl/trainer/runs).

## Disaggregated Prefill/Decode Inference

For large MoE serving, splitting prefill and decode onto separate vLLM groups can substantially improve throughput. Pick the prefill:decode ratio based on workload shape:

| Workload | P:D ratio | Why |
|---|---|---|
| Agentic (SWE, Lean) | 3:1 | Long growing contexts → prefill-heavy |
| Non-agentic (math, chat) | 1:2 | Short prompts, long generations → decode-heavy |

Example config: [`examples/glm5_pd_disag/rl.toml`](https://github.com/PrimeIntellect-ai/prime-rl/blob/main/examples/glm5_pd_disag/rl.toml) — full RL run on `GLM-5` with P/D disaggregation behind a `vllm-router`, FP8 inference, and NCCL weight broadcast (see the [README](https://github.com/PrimeIntellect-ai/prime-rl/tree/main/examples/glm5_pd_disag) for the launch story).

Monitor live queue depths to detect imbalance:

```bash
curl -s http://<prefill_node>:8100/metrics | grep num_requests_waiting
curl -s http://<decode_node>:8200/metrics | grep num_requests_waiting
```

If prefill queues and decode is idle, add prefill nodes (and vice versa).

**UCX 1.19 requirement.** NVSHMEM needs UCX ≥ 1.19 for multi-GPU CUDA. Most clusters ship UCX 1.17 via HPC-X, which manifests as `cuStreamCreate: invalid device context` errors during DeepEP internode dispatch. Check with `/opt/hpcx/ucx/bin/ucx_info -v` and, if needed, build from source:

```bash
salloc -N 1 --gres=gpu:1 bash -c 'bash scripts/install_nixl_from_source.sh'
```

The script writes UCX 1.19 to `third_party/ucx/`; the bundled sbatch templates prepend it to `LD_LIBRARY_PATH` so it overrides the system version.
Loading
Loading