docs: big revamp by mikasenghaas · Pull Request #2602 · PrimeIntellect-ai/prime-rl

mikasenghaas · 2026-05-22T23:47:42Z

Summary

Replaces 22 small/uneven docs files with 8 longer pages modeled on the verifiers docs — each page opens with a TOC, content is grouped by task (configure, train, scale, …) instead of by feature, and headings use Title Case for tone consistency with the verifiers docs.
Adds docs/reference.md, a single auto-generated field-by-field reference covering every entrypoint config (rl, sft, trainer, orchestrator, inference). Walks list-typed sub-configs (e.g. [[orchestrator.train.env]]), discriminated unions, and Optional[BaseConfig] fields; types render as code spans.
Adds scripts/generate_docs_reference.py (the generator), .github/workflows/docs-reference.yaml (CI guard that fails the build on drift), and a docs-reference pre-commit hook (regenerates on staged config changes). Regenerate manually with uv run python scripts/generate_docs_reference.py.
Refreshes the architecture and async-pipeline diagrams under docs/assets/.
Updates mint.json nav and the top-level README.md docs index.

New Structure

Page	Role
`overview.md`	Architecture diagram, three-process tour, install, one runnable RL command, docs index
`configuration.md`	TOML composition, CLI overrides, dry-run, syntax (booleans, lists, dicts, optional sub-configs, discriminated unions, env arrays), examples tour
`training.md`	RL + SFT trainer entrypoints, useful knobs, training modes (RL / OPD / SFT), checkpointing, observability, rules of thumb
`scaling.md`	Single-node vs. multi-node deployment, parallelism knobs (FSDP / EP / CP / AC / optimizer offloading / LM head chunking), memory-tight recipe, SLURM, benchmarking
`algorithms.md`	Async / off-policy training, default + custom loss / advantage / filters, difficulty pools, online difficulty filtering, multi-turn trajectories with renderers
`advanced.md`	Custom modeling, multimodal training, LoRA training, multi-tenant training, disaggregated prefill/decode inference
`development.md`	Test suite (unit / integration / nightly), pre-commit hooks, adding a new model
`reference.md`	Auto-generated field-by-field reference for every entrypoint config

Old → New Content Mapping

Old	New home
`index.md`, `entrypoints.md`	`overview.md` + `training.md`
`configs.md`, `environments.md`	`configuration.md`
`training_modes.md`	`training.md` § Training Modes
`async.md`, `bring-your-own-algorithms.md`, `trajectories.md`	`algorithms.md`
`logging.md`, `metrics.md`, `platform-monitoring.md`, `checkpointing.md`	`training.md` § Observability + Checkpointing
`deployment.md`, `slurm.md`, `benchmarking.md`, `memory_usage.md`	`scaling.md`
`disaggregated-inference.md`	`advanced.md` § Disaggregated Prefill/Decode Inference
`multimodal.md`, `multi_run_manager.md`	`advanced.md` (Multimodal Training, Multi-Tenant Training)
`testing-moe-at-small-scale.md`	`development.md` § Adding a New Model
`troubleshooting.md`, `kubernetes.md`	Dropped (`troubleshooting` folded into per-page prose; `kubernetes` deferred)

🤖 Generated with Claude Code

Replaces 22 small/uneven docs files with 8 longer pages modeled on the verifiers docs: each page opens with a TOC, content is grouped by task (configure, train, scale, …) instead of by feature, and a single auto-generated reference page covers every config field. New pages: overview, configuration, training, scaling, algorithms, advanced, faqs, reference. reference.md is generated by scripts/generate_docs_reference.py from the Pydantic config models; regenerate with `uv run python scripts/generate_docs_reference.py`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- algorithms.md: max_async_level default is 1 (not 2); default loss includes the Kimi-K2.5 KL regularizer (was wrongly claimed to drop it); update the formula to show the full L = -PG + tau_KL * KL form; filter table is `[[orchestrator.filters]]` (plural) - training.md: checkpoint paths nest under `checkpoints/step_N/{trainer, orchestrator}/` rather than separate hierarchies; --inference-gpu-ids / --trainer-gpu-ids don't exist — use --deployment.num-{infer,train}- gpus and pin physical GPUs via CUDA_VISIBLE_DEVICES; update max_async_level prose to match the new default - scaling.md: same GPU-flag fix throughout the single/multi-GPU examples; correct the claim that Muon + optim_cpu_offload is unsupported (only fsdp_cpu_offload is blocked) - configuration.md: there is no generic PRIME_* env var override mechanism in pydantic-config — rewrite the env vars section to list the specific named vars that individual fields read as defaults - advanced.md: add the qwen3_vl_moe entry to the VLM registry table; the small-scale MoE RL config lives at configs/ci/integration/reverse_text_moe/start.toml, not .../rl/ - faqs.md: update the max_async_level Q&A to match the new default Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds two safety nets so the auto-generated reference can't silently drift from the Pydantic config models: - Pre-commit hook (local): re-runs scripts/generate_docs_reference.py whenever a config class or the generator itself is staged. If the generated file changes, pre-commit fails the commit so the contributor re-stages the regenerated reference. - GitHub Actions (CI): a small workflow runs the generator and `git diff --exit-code docs/reference.md`. Catches anyone who bypassed the pre-commit hook. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- Quick run now uses examples/reverse_text/rl.toml; the env is bundled with the verifiers submodule so no prime env install is needed, and the tmux helper is documented elsewhere instead of duplicated here - Architecture bullets advertise the SOTA features per process: vLLM multi-node + FP8 + P/D disaggregation for inference; FSDP2 + EP (incl. DeepEP) + CP + selective AC + FP8 + LoRA + multi-run for the trainer - Drop the "Use prime-rl when you want to" bullets and the CPU-only SFT smoke check — the landing page reads cleaner without them Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- Drop the env-vars section entirely (and the precedence callout that referenced it) — the page is now strictly TOML + CLI. The few named env vars that individual fields read as defaults are out of scope for the config docs and stay in the per-feature pages (training.md, etc.) - Drop the entrypoint enumeration and the W&B/output-dir recommendation blurb in the intro - Reword "@ introduces a TOML file" so the sentence doesn't lead with an inline code token; convert the "Mind the space" hint to a blockquote - Drop --output-dir from the convenience-flag list (it's just another override, not a special flag) - Note that --dry-run is available on rl, sft, and inference only — the standalone trainer and orchestrator configs don't have a dry_run field - Split "Booleans, None, and lists" into one section each, matching the pydantic-config README style; add a Dicts section - Drop the "Prefer --ckpt" convention bullet — checkpointing is covered in training.md and didn't belong in the config conventions Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

vLLM is an OpenAI-compatible server by default; the prefix in the entrypoints table was just noise. Other mentions of "OpenAI-compatible" describe the API surface or third-party endpoints and stay. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- Quick-start now uses examples/reverse_text/rl.toml; drop the prime env install and tmux preamble (covered elsewhere) - Add a "Useful CLI flags" subsection: --ckpt, --wandb, --orchestrator. prime-monitor (Prime Lab), --clean-output-dir, --output-dir, --max-steps, --dry-run - Mention the env-server / env-worker fan-out in the orchestrator bullet under "What each process does at runtime" - Restrict the Key knobs table to orchestrator-only args; drop max_async_level, max_completion_tokens, inference, trainer rows; rename rollouts_per_example row to lead with "Group size" - SFT Launch now uses examples/reverse_text/sft.toml; drop the CPU fake-data smoke alternative - "two distillation modes" -> "three training modes" (rl/opd/sft) - Drop the long-run checkpoint-combo recommendation - Drop the trainer+orchestrator lockstep note from Resuming a run - Swap order: Platform monitoring now appears before Prometheus + BetterStack under Observability; show --orchestrator.prime-monitor CLI invocation - Rename "Metrics that matter" -> "Important metrics"; drop the live vLLM curl snippet - Drop "Eyeball the reward distribution", "Match inference.parallel.tp", and "Set max_async_level deliberately" rules of thumb - Add new rules of thumb: batch size >= 64; group size >= 8 with the reasoning that all-succeed / all-fail groups give the trainer no signal because the within-group advantage collapses - Drop the Common Issues section entirely Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- scaling.md: drop the 1-GPU row and the "Production MoE with long contexts" row from the "Choosing a layout" table; the disaggregated prefill/decode page section is still findable via its own H2 - scaling.md: drop the trailing "Multi-node logs" section (heading + TOC entry); the content now lives next to single-node log layout - training.md: fold the multi-node tree into "Log files" with the single-node skip note inlined; add live-tail recipes and the per-rank torchrun debug note; mention the tmux helper works on a SLURM head node Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- New Renderers section explains why best-effort interleaving works: the renderer guarantees the exact-prefix invariant by construction via bridge_to_next_turn. Lists the renderer API surface and the hand-coded model coverage - Drop the verifiers trajectories-design-note link from Discontinuous trajectories and the --trajectory-strategy branching deprecation - Drop preserve_all_thinking workaround mentions from algorithms.md and faqs.md (reference.md still documents the fields) - Leave a TODO(blog-post-url) for the PI site writeup Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

faqs.md: - Drop the "override an env var in TOML" Q&A (matches the configuration page where env vars are no longer documented as a generic override) - Drop the "max_async_level" Q&A; replace with a max_off_policy_steps Q&A — the more impactful knob to tune on long agentic rollouts - Drop the outdated "two W&B runs per RL job" Q&A; default is shared now (wandb.shared = true) - Drop the SFT-section Q&As that referenced preserve_all_thinking or were too thin to keep - Switch the "evaluate without training" recipe from vf-eval to prime eval run (the Prime CLI is the recommended entrypoint) training.md: - Fix the W&B section to describe the new default: shared single run (wandb.shared = true), with the legacy split as opt-out - Add max_off_policy_steps to the Key knobs table - Switch the eval example from vf-eval to prime eval run Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

# Conflicts: # docs/bring-your-own-algorithms.md # docs/slurm.md # docs/training_modes.md

- Inference bullet now leads with the local default (token-in /v1/generate via renderers; OpenAI-compatible routes called out as the external-client path) and adds DP/TP/EP with deepep + flashinfer all-to-all backends + EPLB, P/D disaggregation behind vllm-router, CPU KV-cache offload, and router replay (FP8 MoE numerical-parity feature). Weight broadcast is filesystem or NCCL. - Orchestrator bullet now leads with "owns the data plane across many verifiers training and eval environments" plus the per-env isolated subprocess + variable-size env-worker pool. - Trainer bullet drops "torchrun-launched" and surfaces the custom modeling code as the enabler for advanced trainer parallelism (EP with DeepEP, CP for long sequences). - Drop the [AIPO] link in the async paragraph (off-policy-aware PG + KL regularizer, no paper handle); also drop the "AIPO loss" mention from the Algorithms blurb in "Where to go next" so the page is internally consistent. - Quick-run command is now bare: uv run rl @ examples/reverse_text/rl.toml (no --wandb.* / --ckpt). - Drop the trailing scaling pointer (Scaling is already linked in "Where to go next"). Co-authored-by: Cursor <cursoragent@cursor.com>

- Drop the entrypoint-splitting paragraph ([trainer] / [orchestrator] / [inference] table lifting); covered elsewhere. - Rename "TOML files and composition" -> "TOML composition", and "Special syntax" -> "Syntax". - Open "Sources and precedence" by naming the three sources (Pydantic defaults, TOML files, CLI flags) up front, then layering them. - Drop the "(-- is a kebab-case marker)" parenthetical from CLI overrides; turn the snake/kebab note into a callout. - Drop the --help / --dry-run convenience-flag block and the "--dry-run is the single most useful debugging tool" prose; the bash example is enough. - Reorder Syntax subsections to mirror the pydantic-config README: Booleans -> Lists -> Dicts -> Optional sub-configs -> None -> Discriminated unions -> Environments. None moves down and is cross-linked from "disabling an optional sub-config". - Booleans example swapped from --ckpt (which is itself an optional sub-config) to --clean-output-dir (a real bool = False field), showing both --flag and --no-flag forms. - Lists / Dicts now show TOML and CLI on the *same* field name so the mapping is obvious (target_modules for lists, env.0.args for dicts), and add the "lists are replaced wholesale" overlay note + "dicts deep-merge across sources" detail. - Add a callout on validation aliases (rollouts_per_example still works after the rename to group_size) — only material gap vs the pydantic-config README that's relevant to end users. - Worked example: --dry-run is now the final flag. - Drop the Conventions section. Co-authored-by: Cursor <cursoragent@cursor.com>

- Rename "RL training" -> "RL trainer" and "SFT training" -> "SFT trainer" (and update the page intro accordingly). - Entrypoints table: clarify that `uv run rl` wraps the trainer, orchestrator, and inference server in one launch — runs locally for single-node experiments and submits to SLURM for single- or multi-node when [slurm] is set. Drop the trailing "rl is a convenience wrapper" paragraph. - RL trainer Launch: minimal command is now bare `uv run rl @ examples/reverse_text/rl.toml` (no flags). Drop the GPU-placement paragraph + multi-GPU example (covered in scaling.md). - Replace "Useful CLI flags" + "Key knobs" + "What each process does at runtime" with one consolidated "Useful knobs" section split into three sub-tables: data-and-algorithm, monitoring, run management. - Add training environments ([[orchestrator.train.env]] for multi-env training) - Add eval environments ([[orchestrator.eval.env]] + orchestrator.eval.interval) - Add monitoring entries: orchestrator.log.vf_level, --wandb, --orchestrator.prime-monitor - Move Training modes from a top-level section into RL trainer as a subsection (it's RL-entrypoint-specific). - Drop the standalone Evaluations section — eval syntax is covered in configuration.md and the eval-knobs row in Useful knobs links to `prime eval` for one-off evals. - Drop the optimization_dtype / reduce_dtype rule of thumb. Co-authored-by: Cursor <cursoragent@cursor.com>

- Drop "-via-orchestrator" from the Training modes heading and the internal/cross-doc anchors. The mode value is just `sft` and the short title reads cleaner. - Drop "and the tmux helper" from the Console output subsection title; the tmux helper is still documented in the section body. - Important metrics is now split into RL trainer and SFT trainer subsections so the SFT-only metrics (loss/mean, val/loss, progress/{epoch,num_samples,num_tokens}, optim/zero_grad_ratio, per-subset mixing ratios, MoE max_vio + routing_confidence, perf/peak_memory + the time/* breakdown) are documented. - SFT Dataset format gains a Tool definitions paragraph: rows can carry a `tools` column (OAI function-calling format) or `tool_defs` (verifiers rollout format), as either a list of dicts or a JSON-encoded string. `tool_defs` is auto-converted to OAI shape before being passed into the chat template's `tools=...` argument. `chat_template_kwargs` rows pass through verbatim. Co-authored-by: Cursor <cursoragent@cursor.com>

…eckpoints' Co-authored-by: Cursor <cursoragent@cursor.com>

Adds a callout under the intro of training.md and configuration.md pointing at the equivalent skill files for AI agents working in this repo: - training.md -> skills/training/SKILL.md (top-level routing) + skills/training/start-run/SKILL.md (launch details) + skills/training/monitor-run/SKILL.md (check-in / restart). - configuration.md -> skills/configs/SKILL.md. The skills aren't part of the published Mintlify nav, so the links go to GitHub blob URLs. Co-authored-by: Cursor <cursoragent@cursor.com>

The standalone "## Important metrics" section is gone. Each trainer subsection now ends with its own "### Important metrics" covering only the metrics relevant to that flow: - RL trainer / Important metrics: reward + rollout signals from the orchestrator, mismatch_kl + entropy + grad_norm from the trainer, and the trainer/orchestrator/vLLM performance grid. - SFT trainer / Important metrics: loss/mean, val/loss, progress counters, optim signals, MoE max_vio + routing_confidence, and the perf/{throughput,mfu,peak_memory} + time/* breakdown. TOC updated to point at #important-metrics (RL) and #important-metrics-1 (SFT) — Mintlify de-duplicates with the same -N suffix scheme it already uses for the two Launch subsections. Co-authored-by: Cursor <cursoragent@cursor.com>

Not ready for end-user docs yet. Both knobs (trainer.metrics_server.port, trainer.heartbeat.url) are still in reference.md for anyone who needs them, just no narrative coverage. Co-authored-by: Cursor <cursoragent@cursor.com>

- Use "task" not "prompt" for the conceptual unit ingested by the orchestrator (rows for batch_size and group_size, plus the matching Rules of thumb wording). - group_size: drop the "Used for advantage normalization and pass@k estimation" trailer; the row name is enough. - max_off_policy_steps: rename "throughput-vs-noise dial" to just "off-policy dial". - eval row: drop the "Scores land in trainer logs and W&B as eval/{env}/{avg@k,pass@k}" + `prime eval` trailer; keep the row scoped to what the knob does. - Add log.level to the Monitoring table (trainer/orchestrator process log level, $PRIME_LOG_LEVEL fallback, per-process or global on rl). - Drop --ckpt from Run management; checkpointing has its own section. Co-authored-by: Cursor <cursoragent@cursor.com>

Drop the throughput/MFU/async-lag/KV-cache rows from the RL trainer Performance table — they're either generic perf metrics already covered by the SFT trainer's perf table or vLLM-internal. The two remaining rows (time/wait_for_batch, time/wait_for_ckpt) are the useful diagnostic — they tell you which side is the bottleneck. Co-authored-by: Cursor <cursoragent@cursor.com>

The old wording recommended only patched checkpoints / a custom chat template as the fix for position-dependent templates. The renderer-based path landed in SFT (use_renderer flag) and is now the primary recommended fix, but it's still off by default (use_renderer: bool = False on SFTConfig), so the patched- checkpoint path also still works — and is what examples/reverse_text/sft.toml uses today via PrimeIntellect/ Qwen3-0.6B. Rewrite the paragraph to cover both fixes: - Renderer path: use_renderer = true, lists the hand-coded renderers, calls out the VLM unsupported case. - Patched template path: the prime-rl-patched checkpoint or a user-supplied template that preserves thinking. Cross-link both to Algorithms § Renderers and § Multi-turn trajectories. Co-authored-by: Cursor <cursoragent@cursor.com>

- data.type = "sft" is the discriminated-union default for SFTConfig.data, so users don't need to spell it out. - The dataset path field is data.name, not data.path. Confirmed against SFTDataConfig.name in packages/prime-rl-configs/src/ prime_rl/configs/sft.py. Co-authored-by: Cursor <cursoragent@cursor.com>

Co-authored-by: Cursor <cursoragent@cursor.com>

Renderers existed as a top-level section because they were added in a follow-up, but conceptually they're the mechanism that makes best-effort interleaving safe — splitting them apart forced a forward-reference and duplicated the "exact-prefix invariant" framing. - Demote "## Renderers" to "### Renderers" inside "## Multi-turn trajectories", placed between Best-effort interleaving and Discontinuous trajectories. - Move the Qwen3 thinking-stripping example from Best-effort interleaving into Renderers — it's the failure case the renderer fixes, so it reads better adjacent to the family list / config block. - Drop the "Workaround: use a chat template that preserves thinking" trailer; the patched-checkpoint workaround for SFT is already documented in training.md and isn't relevant in the orchestrator context (use_renderer defaults to true). - Open Multi-turn trajectories with a forward-link to Renderers so the reader knows the safety mechanism is coming. - TOC updated; both #multi-turn-trajectories and #renderers anchors preserved (only their nesting changes), so the existing cross- links from training.md and faqs.md keep working. Co-authored-by: Cursor <cursoragent@cursor.com>

…step max_async_level is being deprecated as a user-facing knob (hardcoded to 1; matching reframe in docs/async.md on feat/deprecate-max-async- level). Update algorithms.md so the long-form treatment matches. - Drop the "### Tuning max_async_level" subsection (the k=0/1/2/>=3 table) and the NCCL-needs-max_async_level=1 line that follows it — both become vacuous when k is fixed at 1. - Reword the Async / off-policy training intro to describe the one-step overlap directly instead of "up to k steps where k = max_async_level". - Step semantics: rho_inference is now pi_{max(0, n-1)}, with prose "inference is exactly one step behind the trainer" replacing the generic "gap is at most k steps". - Drop the Tuning entry from the page TOC. Other docs (overview.md intro, training.md rule of thumb, faqs.md two entries, scaling.md NCCL-example comment) still mention max_async_level and are now stale; will clean those up in the next turn unless flagged otherwise. Co-authored-by: Cursor <cursoragent@cursor.com>

Pull in the testing- and contributor-workflow content from README and the GitHub Actions configs so contributors don't have to dig through .github/workflows/ to figure out what runs where. New sections in development.md: - Test suite - Layout: tests/unit/, tests/integration/, tests/nightly/ — what each tier is for, with the actual file names contributors will encounter. - Running tests locally: pytest one-liners (everything, unit-only, integration-only, -m "not gpu", -m gpu, single file). - CI workflows: a 3-row table covering cpu_tests.yaml, gpu_tests.yaml (matrix list pulled verbatim from the workflow), and nightly_tests.yaml — including the trigger conditions (cpu = always, gpu = non-draft, nightly = scheduled + workflow_dispatch) and the runners (ubuntu-latest, vm/4xa6000, research-cluster). - Markers: gpu + slow, both declared with --strict-markers in pyproject.toml. - Pre-commit hooks - uv run pre-commit install - Currently configured hooks: ruff check/format and docs-reference (regenerator that fails the commit if docs/reference.md would drift). Plus update the Development pitch in overview.md "Where to go next" and README.md docs index to mention the new scope. Co-authored-by: Cursor <cursoragent@cursor.com>

Length penalties are configured under [orchestrator.length_penalty] and layer on top of *any* advantage function — they're conceptually not a standalone advantage variant. Move them into Default advantage where readers will see the option while reading about advantages. Drive-by: clean up the parenthetical hint in Default advantage. The old version listed "length penalties tied to turn count" as a reason to write a custom advantage, which conflicts with the actual turn-count length penalty being a built-in. New wording points length-penalty users at the now-adjacent built-in section, and points custom-advantage users at trajectory-metadata-driven shaping (sub-agents, relative-rank, …). TOC entry for #length-penalties dropped; no external doc links to that anchor. Co-authored-by: Cursor <cursoragent@cursor.com>

Covers the two complementary buffer-side mechanisms for keeping the trainer batch high-signal. Verified each claim against src/prime_rl/orchestrator/buffer.py and packages/prime-rl-configs/ src/prime_rl/configs/orchestrator.py. - Difficulty pools (buffer.easy_threshold / hard_threshold + easy_fraction / hard_fraction): per-problem running-average reward is compared to the thresholds; problems hitting either bound move to easy/hard pool and stop being sampled. Pool assignments persist across checkpoints (easy_examples.jsonl / hard_examples.jsonl); *_fraction lifts a fraction of pooled problems back into normal on resume / start. - Online difficulty filtering (buffer.online_difficulty_filtering bool): groups whose avg reward is exactly 0.0 or 1.0 are dropped from the buffer because their within-group advantage is zero (DR-GRPO produces no signal). Counted under filtered_rollouts/ {env}/{easy,hard} for visibility. - The tradeoff bit the user asked for explicitly: with ODF on each trainer step's effective batch is dense + predictable, but the orchestrator pays for the throw-away rollouts and may need a higher oversampling_factor; if time/wait_for_batch is already high on the trainer, ODF can starve the loop. - ODF is orthogonal to the pools — ODF reacts to the current group's reward distribution, pools track the running per-problem average. Configs often use both. Plus a one-line page-intro update + TOC entries. Co-authored-by: Cursor <cursoragent@cursor.com>

Both are buffer-side controls over what reaches the trainer — same conceptual category as Filters and the loss/advantage knobs. Algorithms is the right home; Advanced is for orthogonal feature sets (LoRA, multi-tenant, custom modeling, multimodal). Promote each to its own top-level section. - algorithms.md: new "## Difficulty pools" and "## Online difficulty filtering" sections, inserted between Filters and Multi-turn trajectories. TOC updated. - advanced.md: section + TOC entries + page-intro mention removed. Content unchanged from the previous commit on Advanced — same threshold / fraction / persistence claims, same ODF tradeoff explanation, same cross-link to oversampling_factor. ODF section back-links to Difficulty pools for the orthogonality note. Co-authored-by: Cursor <cursoragent@cursor.com>

…itecture" - "Adding a new architecture" promoted to its own top-level section, placed *before* the renamed Debugging MoE recipe so the natural reading order is "here's how to wire up a new arch -> here's how to smoke-test it". Step 3 of the recipe now forward-links to Debugging MoE instead of "the three steps above". - "Testing MoE at small scale" -> "Debugging MoE". The page is already titled Development, so the qualifier was redundant. - Subsection titles drop the "Step N:" prefixes (the page-level TOC already implies sequence) and switch to sentence case for consistency with the rest of the docs: - Step 1: build and verify a mini model -> Build and verify a mini model - Step 2: SFT warmup -> SFT warmup (already correct) - Step 3: RL on reverse-text -> RL on reverse-text (already correct) - TOC updated; README docs index pitch swaps "small-scale MoE testing" -> "debugging MoE" to match the new section title. Co-authored-by: Cursor <cursoragent@cursor.com>

- "### Build and verify a mini model" -> "### Create mini model". The roundtrip-verify body covers the "and verify" half; the shorter title is enough. - Merge "### SFT warmup" + "### RL on reverse-text" into one "### Smoketest training" subsection. The two were always run together as a single end-to-end smoke test (warmup so KL is meaningful, then RL stack), so a single subsection with two code blocks reads better than two artificially-split ones. - TOC updated. No external doc links to the old sub-anchors. Co-authored-by: Cursor <cursoragent@cursor.com>

User-facing docs no longer reference configs/ directly — examples/ is the only "we keep this up to date" surface, the rest is CI- and debug-internal: - configuration.md: launch line + worked-example switched to examples/reverse_text/rl.toml. Section renamed "Worked example" -> "Examples" with a curated tour of the 10 README examples (basic 1-8 GPU + advanced SLURM tiers); the compose / override / dry-run walkthrough lives as a "### Worked example" subsection. - training.md: drop the "Debug configs for all variants ship under configs/debug/training_modes/" pointer in the Training modes section. The prose already explains how to set the mode. - scaling.md: P/D inference now points at examples/glm5_pd_disag/ rl.toml (with a link to its README) instead of the configs/-side inference-only TOML. - faqs.md: install-verify and smoke-test recipes both switch from configs/debug/sft and configs/gsm8k to examples/reverse_text. Reference generator (scripts/generate_docs_reference.py): - Drop "from the Pydantic config models" from the page header. - Move the regenerate command + structural notes from the header to a new "## About this page" footer. - Wrap the Type column in code spans so list[int], int | None, etc. render as code instead of plain text. fmt_type now emits literal `|` (GFM accepts pipes inside code spans inside table cells; no escaping needed). - Walk list-of-BaseModel fields. Previously orchestrator.train.env / orchestrator.eval.env / orchestrator.filters were rendered as one row showing the default repr; their leaf fields never showed up. New _list_inner_models() detects both list[X] (single model) and list[Annotated[Union[A | B], discriminator]] (discriminated union of list items, e.g. filters: list[FilterConfig]). Index placeholder rendered as <n> to match the CLI form (--orchestrator.train.env.0.id ...). Regenerate reference.md: +5k chars, mostly the new env/filter list-item subsections that were missing before. development.md still references two CI-tested configs/ paths (configs/debug/moe/sft/train.toml, configs/ci/integration/ reverse_text_moe/start.toml) — those are validated by the reverse_text_moe GPU integration test on every PR, so they don't risk drifting. Flagging in case the user wants those swapped too. Co-authored-by: Cursor <cursoragent@cursor.com>

128-512 is the range for quick ablations, not production. Production RL often runs at 1024+. Co-authored-by: Cursor <cursoragent@cursor.com>

P/D disaggregation is a feature you opt into for large-MoE serving, not a step on the single-GPU -> 1000-GPU scaling ladder. It pairs naturally with Custom modeling / multi-tenant / multimodal as a specialized inference topology, so Advanced is the right home. - scaling.md: drop the section + TOC entry + "disaggregated prefill/decode inference" from the page intro. Page intro now forward-links to Advanced for users who came in for P/D. - advanced.md: append the section after Multi-tenant training, unchanged content (P:D ratio table, glm5_pd_disag example link, queue-depth monitoring snippet, UCX 1.19 build-from-source note). TOC + page-intro list updated. - overview.md "Where to go next": drop disagg from the Scaling bullet, add to the Advanced bullet. Anchor preserved (#disaggregated-prefilldecode-inference) — no external doc links to it survived the move check. Co-authored-by: Cursor <cursoragent@cursor.com>

The previous claim "trainer and inference server can share a GPU" via the rl launcher was wrong. Verified against src/prime_rl/entrypoints/rl.py:86-99: the launcher partitions visible GPUs strictly (inference 0..N-1, trainer N..N+M-1) and raises ValueError when total_requested_gpus > len(physical_gpu_ids). Setting CUDA_VISIBLE_DEVICES=0 + --num-infer-gpus 1 + --num-train-gpus 1 makes total=2, visible=1, validation fails before anything launches. What actually works for single-GPU RL is the manual three-pane launch: each of uv run {inference,orchestrator,trainer} is an independent process with no cross-process GPU validation, so pinning CUDA_VISIBLE_DEVICES=0 on inference *and* trainer lets them share the same physical GPU. - Drop the misleading `uv run rl` recipe. - Promote the manual three-pane recipe to the canonical single-GPU path, with CUDA_VISIBLE_DEVICES=0 spelled out on both the inference and trainer panes. - Lead with SFT (where single-GPU is the default and just works). - Add an explicit "single-GPU RL is for debugging only" caveat. Co-authored-by: Cursor <cursoragent@cursor.com>

…to one umbrella - New umbrella section "## Single-node vs. multi-node deployment" framing the [deployment] discriminated union as the user-facing knob: single_node runs locally; multi_node currently goes through SLURM. Subsections nest beneath: - ### Single GPU (unchanged content from "## Single GPU") - ### Single-node multi-GPU (unchanged from "## Single-node multi-GPU") - #### RL placement - #### SFT and torchrun - ### Multi-node (new short pointer to ## SLURM with two cross- links to the existing RL / SFT-and-inference examples) - The umbrella section opens with a callout that manual multi-node launches are technically possible but reimplement what the SLURM launcher does — the user's preferred framing. - Drop "## Choosing a layout" entirely (the new umbrella section conveys the same routing more naturally + the layout table was going stale). - Drop "## Multi-node (manual)" entirely (RL training, SFT training, Multi-node inference subsections all gone). Anyone who needs the manual recipe can replicate what the SLURM templates do. Cross-link fixes: - training.md SFT § Launch line previously pointed at scaling.md#sft-training (under "## Multi-node (manual)"). Now points at scaling.md#sft-and-torchrun for non-default single-node layouts and scaling.md#slurm for multi-node. - faqs.md "Multi-node without SLURM or K8s?" answer updated: from "yes, see [Scaling § Multi-node (manual)]" to "not currently documented; technically possible but reimplements the SLURM launcher". Page intro adjusted to match the new structure ("multi-node SLURM and Kubernetes deployments" -> "single-node and multi-node deployments"). Co-authored-by: Cursor <cursoragent@cursor.com>

The K8s Helm chart at k8s/prime-rl/ still ships, but the user-facing docs are dropping coverage until the chart and the matching guide are re-validated together. - scaling.md: drop the "## Kubernetes" section + TOC entry. Page intro already covered (was reworded earlier in the restructure). - overview.md "Where to go next": drop "Kubernetes guides" from the Scaling bullet. - README.md docs index: drop "Kubernetes" from the Scaling bullet. The two passing-mention "k8s" / "Kubernetes" lines in README (Overview features list, Advanced Training Examples adaptability note) are left as-is — they describe codebase capability, not docs coverage. Reference.md still mentions Kubernetes liveness probes in an auto-generated field docstring; that's source-side, out of scope for this pass. Co-authored-by: Cursor <cursoragent@cursor.com>

- Drop the example test filenames from the tests/integration/ bullet — they're a moving list and not the point. - Reframe the tests/nightly/ bullet around what it does (runs the examples/ configs to catch regressions) instead of listing the individual nightly tests by name. Co-authored-by: Cursor <cursoragent@cursor.com>

Co-authored-by: Cursor <cursoragent@cursor.com>

- Replace the bare 'To add (e.g.) Kimi 2.5:' opener with a one-line framing of the two-step contract: implement modeling code, register a mini preset for smoke-testing. - Bold the leading verb on each numbered step so the structure reads as a checklist. - Step 1 now nudges readers at glm4_moe/ and qwen3_moe/ as templates for the modeling code. - Step 2 explains *what* the preset is for ('build a ~0.5B test model in your architecture') rather than just listing fields. Path now links to scripts/mini_moe.py. - Step 3 says what the smoke-test actually exercises (roundtrip + SFT + RL stack) so users know what 'smoke-test' means here. Co-authored-by: Cursor <cursoragent@cursor.com>

CodeQL alert actions/missing-workflow-permissions (security/code-scanning/19) flagged the new workflow for relying on the repo's default GITHUB_TOKEN permissions. The workflow only checks out code (contents: read), syncs deps via uv, runs the doc generator, and runs git diff. None of that needs write scope on any resource type. Pin to contents: read at the workflow level — explicit minimum that satisfies the rule. Co-authored-by: Cursor <cursoragent@cursor.com>

- assets/architecture.png: replace with the new diagram (trainer + orchestrator + inference deployment, GPU layout per process, data + scheduling + weight-broadcast arrows). Was 511k 96dpi, now 111k @ 200dpi from architecture.pdf. - assets/two-step-off-policy.png removed; assets/async-pipeline.png replaces it with the cleaner one-step-overlap diagram (trainer steps g_0..g_n above, inference samples with theta_{n-1} below). algorithms.md image reference + alt text updated to match the post-deprecation "one-step overlap" framing. - assets/rollout-timeline.png added but not yet referenced. It's a continuous-time view showing rollouts spanning policy boundaries (policies pi_{i-2}, pi_{i-1}, pi_i on the x-axis with rollout bars crossing the boundaries) — that's the picture behind max_off_policy_steps, not max_async_level. Want me to drop it into algorithms.md (e.g. above the off-policy / max_off_policy discussion) or save it for later? Co-authored-by: Cursor <cursoragent@cursor.com>

User-facing prose mentions of verifiers / renderers / research- environments / pydantic-config now consistently render as code-spans with a github link to the package. Affected lines: - overview.md: orchestrator-bullet [verifiers](url) (dropped the bare-text variant), install paragraph (linked all three submodules individually), quick-run paragraph. - algorithms.md: "Since [\`verifiers\` v0.1.8]..." (added backticks around the package name in the release-tag link). - training.md: SFT \"verifiers submodule\" launch line, useful-knobs vf_level row, tool-defs paragraph. Intentionally left alone: - algorithms.md \`verifiers.RolloutOutput\` (dotted-path code ref; whole expression already in code). - algorithms.md / training.md [renderers](#renderers) / [Renderers] (#renderers) (internal anchors to the in-page section, more useful than the github repo for a reader inside the doc). - algorithms.md "Hand-coded renderers ship for ..." and "the renderers writeup on the PI blog" — generic prose, not a package mention. - reference.md docstring-sourced mentions ("verifiers package", "renderers package", etc.) — those come from Pydantic field docstrings; would need source-side edits + regen. Co-authored-by: Cursor <cursoragent@cursor.com>

Out of date with the max_async_level deprecation and duplicates content already covered properly in algorithms.md (one-step overlap framing + the AIPO/KL loss math). Architecture bullets above already mention async semantics where relevant. Co-authored-by: Cursor <cursoragent@cursor.com>

Co-authored-by: Cursor <cursoragent@cursor.com>

Was four separate code blocks separated by one-line prose; now a single bash block with inline-comment annotations for each variant. Reads cleaner and matches the test-suite block in development.md. Co-authored-by: Cursor <cursoragent@cursor.com>

…e-test) Self-review found a handful of small inconsistencies and stale claims that survived earlier passes. Stale max_async_level references (now hardcoded to 1, so any prose that treats it as tunable is wrong): - faqs.md: "If growing, drop max_async_level or LR" -> drop max_off_policy_steps. NCCL FAQ no longer says "requires max_async_level=1" — that constraint is vacuous. - training.md Rules of thumb: NCCL/max_async_level=1 dropped from the dry-run validator example; CP / flash-attention example stays. - scaling.md NCCL example: drop the trailing "# synchronous; max_async_level forced to 1" comment. Other accuracy / consistency fixes: - faqs.md hardware FAQ: drop "you can co-locate both on a single GPU" — verified earlier that the rl launcher rejects this; the manual three-pane recipe is the actual single-GPU path. Cross- link to Scaling § Single GPU. - faqs.md max_off_policy_steps wording: "throughput-vs-noise knob" -> "off-policy dial" to match algorithms.md / training.md. - training.md sft.data.loss_mask cross-link: anchor was reference.md#sft-data (the discriminated-union heading); the loss_mask sub-table actually lives at #sft-data-sft-loss-mask. - development.md "Smoketest training" -> "Smoke-test training" in the section title and TOC anchor, matching the hyphenated verb form ("smoke-test the new architecture") used in the page body. Co-authored-by: Cursor <cursoragent@cursor.com>

Four asks rolled in: 1. **Drop docs/faqs.md entirely.** Removed the file plus all its cross-references: docs/mint.json nav, README.md docs index, and docs/overview.md "Documentation" list. The standalone Q&A page wasn't pulling its weight against the verifiers tone reference. 2. **Title Case all headings** across the user-facing pages so the visual style matches deps/verifiers/docs/* (which uses "Hosted Training" / "Performance Trade-offs" etc., not sentence case). Anchors are slug-based (lowercase + hyphen), so internal #links survive the case flip — only the link text in cross-doc "§ Section" references needed updating (training.md → Algorithms § Multi-Turn Trajectories, training.md → Scaling § SFT and Torchrun, scaling.md → Configuration § TOML Composition, scaling.md → SLURM § RL Example / SFT and Inference Examples). 3. **Rename overview.md "Where to go next" → "Documentation"** for symmetry with the verifiers landing page. 4. **Smell fixes**: a. algorithms.md had two near-identically-named subsections ("### The default loss" under Async with the loss math, and "### Default loss" under Loss with the mode dispatch). Collapse: the loss math now lives under "## Loss > ### Default Loss" together with the rl/opd/sft mode-dispatch bullets; the Async section is just intro + step semantics. One source of truth. b. reference.md auto-generated docstring mentions of "verifiers package" / "renderers package" / "renderers library" / "renderers.parsers" / "Registered verifiers environment ID" now render as [`pkg`](github-url) links. Source-side edits in packages/prime-rl-configs/src/prime_rl/configs/{shared, orchestrator,sft}.py, then regenerated reference.md. Net: faqs.md deletion (-198) dominates; everything else is small churn for case + anchor consistency. Co-authored-by: Cursor <cursoragent@cursor.com>

# Conflicts: # docs/async.md

Restructure + accuracy fixes that were piled up locally during the SSH-signing block earlier today. - configuration.md: note that reusing an env id requires a unique name; drop the "See each environment's README on the Hub" tail. - development.md: collapse "Adding a New Architecture" + "Debugging MoE" (with its Create Mini Model + Smoke-Test Training subsections) into a single "## Adding a New Model" with three subsections — Implement the Modeling Code, Register a Mini Preset, Run the Smoke Test. Page intro updated. - scaling.md: drop "## Single GPU" subsection; rename "## Single-Node Multi-GPU" -> "## Single-Node" so "Single-Node vs. Multi-Node Deployment" has just two clean children. Drop the manual-multi- node prose. Drop the [slurm] field reference table; rename the section to "## [deployment] Block" and link to reference.md. Collapse the two example subsections into one "## Examples" with pointers to examples/multinode/{rl,sft}.toml. Rename "## CPU Optimizer Offload" -> "## Optimizer Offloading". Add new "## LM Head Chunking" subsection covering the fused_lm_head_token_chunk_ size knob. Drop torchrun phrasing in the SFT scaling subsection, the CP recommendation paragraph, "(offloads checkpoints to CPU)" parenthetical, "RL gradient-accumulation amortization" line, attn = "flash_attention_2" from the memory-tight recipe, dry-run invaluable line. - algorithms.md: drop the Step Semantics subsection heading (body stays); replace the AIPO link with a plain "DPPO + KL similar to Kimi-K2.5" framing. - advanced.md: drop the "## Custom vs HF Implementations" subsection heading; the body becomes the lead content under "## Custom Modeling". - overview.md + README.md: pitch updates ("adding a new architecture" -> "adding a new model"; FAQs row already gone). Co-authored-by: Cursor <cursoragent@cursor.com>

Removes everything in the reference-generation pipeline: - docs/reference.md — the 197k-char auto-generated field reference. - scripts/generate_docs_reference.py — the generator that walks Pydantic config trees, discriminated unions, list-of-models, etc. - .github/workflows/docs-reference.yaml — the CI guard that ran the generator and failed on git diff drift. - .pre-commit-config.yaml — drops the local docs-reference hook that re-ran the generator on staged config-class edits. Knock-on cleanup of cross-references: - mint.json: drop the "reference" page from nav. - overview.md / README.md: drop the Reference bullet from the Documentation list / docs index. - configuration.md: drop the two trailing "See [Reference] ..." pointers (under Discriminated Unions and at the end of the Examples section). - training.md: drop the "for the full field reference see [Reference]" tail in Useful Knobs; the loss_mask row no longer cross-links and just lists the four roles. - scaling.md: drop the [Reference § trainer.model] DeepEP pointer and the [Reference § slurm] pointer in the [deployment] block. - algorithms.md: drop the [Reference § orchestrator length penalties] line. - development.md: simplify the Pre-Commit Hooks section — only the ruff hook remains, so the configured-hooks bullet list goes away. Co-authored-by: Cursor <cursoragent@cursor.com>

mikasenghaas and others added 4 commits May 22, 2026 23:47

chore: rename pre-commit hook id to docs-reference

b234055

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-advanced-security AI found potential problems May 22, 2026

View reviewed changes

Comment thread .github/workflows/docs-reference.yaml Fixed

mikasenghaas and others added 25 commits May 23, 2026 00:17

docs(faqs): drop the CP <= 8 recommendation

b305b86

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

docs(faqs): drop the vLLM log-quieting and KV-cache pressure Q&As

5b9898d

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

docs(faqs): drop the Environments Hub install Q&A

dc40f71

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

docs(faqs): drop the Models and environments section

47671be

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Merge remote-tracking branch 'origin/main' into chore/docs-revamp

3b77e8c

# Conflicts: # docs/bring-your-own-algorithms.md # docs/slurm.md # docs/training_modes.md

docs(training): rename 'Saving HF weights for serving' -> 'Serving ch…

232791e

…eckpoints' Co-authored-by: Cursor <cursoragent@cursor.com>

docs(training): drop Prometheus and BetterStack subsection

9d5844a

Not ready for end-user docs yet. Both knobs (trainer.metrics_server.port, trainer.heartbeat.url) are still in reference.md for anyone who needs them, just no narrative coverage. Co-authored-by: Cursor <cursoragent@cursor.com>

docs(training): drop loss/nan_count from SFT trainer metrics

27ce099

Co-authored-by: Cursor <cursoragent@cursor.com>

mikasenghaas and others added 25 commits May 25, 2026 23:06

docs(training): correct batch-size rule of thumb

21d4296

128-512 is the range for quick ablations, not production. Production RL often runs at 1024+. Co-authored-by: Cursor <cursoragent@cursor.com>

docs(development): drop the nightly 24h-timeout / research-cluster aside

094ad23

Co-authored-by: Cursor <cursoragent@cursor.com>

docs(overview): tighten 'Where to go next' to one-line pitches

9fc78cf

Co-authored-by: Cursor <cursoragent@cursor.com>

mikasenghaas changed the title ~~docs: rewrite into 8 task-oriented pages with auto-generated reference~~ docs: big revamp May 26, 2026

mikasenghaas and others added 3 commits May 26, 2026 02:03

Merge remote-tracking branch 'origin/main' into chore/docs-revamp

6ebf87d

# Conflicts: # docs/async.md

mikasenghaas requested a review from samsja May 26, 2026 05:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: big revamp#2602

docs: big revamp#2602
mikasenghaas wants to merge 67 commits into
mainfrom
chore/docs-revamp

mikasenghaas commented May 22, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mikasenghaas commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

New Structure

Old → New Content Mapping

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mikasenghaas commented May 22, 2026 •

edited

Loading