docs: big revamp#2602
Draft
mikasenghaas wants to merge 67 commits into
Draft
Conversation
Replaces 22 small/uneven docs files with 8 longer pages modeled on the verifiers docs: each page opens with a TOC, content is grouped by task (configure, train, scale, …) instead of by feature, and a single auto-generated reference page covers every config field. New pages: overview, configuration, training, scaling, algorithms, advanced, faqs, reference. reference.md is generated by scripts/generate_docs_reference.py from the Pydantic config models; regenerate with `uv run python scripts/generate_docs_reference.py`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- algorithms.md: max_async_level default is 1 (not 2); default loss
includes the Kimi-K2.5 KL regularizer (was wrongly claimed to drop
it); update the formula to show the full L = -PG + tau_KL * KL form;
filter table is `[[orchestrator.filters]]` (plural)
- training.md: checkpoint paths nest under `checkpoints/step_N/{trainer,
orchestrator}/` rather than separate hierarchies; --inference-gpu-ids
/ --trainer-gpu-ids don't exist — use --deployment.num-{infer,train}-
gpus and pin physical GPUs via CUDA_VISIBLE_DEVICES; update
max_async_level prose to match the new default
- scaling.md: same GPU-flag fix throughout the single/multi-GPU
examples; correct the claim that Muon + optim_cpu_offload is
unsupported (only fsdp_cpu_offload is blocked)
- configuration.md: there is no generic PRIME_* env var override
mechanism in pydantic-config — rewrite the env vars section to list
the specific named vars that individual fields read as defaults
- advanced.md: add the qwen3_vl_moe entry to the VLM registry table;
the small-scale MoE RL config lives at
configs/ci/integration/reverse_text_moe/start.toml, not .../rl/
- faqs.md: update the max_async_level Q&A to match the new default
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds two safety nets so the auto-generated reference can't silently drift from the Pydantic config models: - Pre-commit hook (local): re-runs scripts/generate_docs_reference.py whenever a config class or the generator itself is staged. If the generated file changes, pre-commit fails the commit so the contributor re-stages the regenerated reference. - GitHub Actions (CI): a small workflow runs the generator and `git diff --exit-code docs/reference.md`. Catches anyone who bypassed the pre-commit hook. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Quick run now uses examples/reverse_text/rl.toml; the env is bundled with the verifiers submodule so no prime env install is needed, and the tmux helper is documented elsewhere instead of duplicated here - Architecture bullets advertise the SOTA features per process: vLLM multi-node + FP8 + P/D disaggregation for inference; FSDP2 + EP (incl. DeepEP) + CP + selective AC + FP8 + LoRA + multi-run for the trainer - Drop the "Use prime-rl when you want to" bullets and the CPU-only SFT smoke check — the landing page reads cleaner without them Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Drop the env-vars section entirely (and the precedence callout that referenced it) — the page is now strictly TOML + CLI. The few named env vars that individual fields read as defaults are out of scope for the config docs and stay in the per-feature pages (training.md, etc.) - Drop the entrypoint enumeration and the W&B/output-dir recommendation blurb in the intro - Reword "@ introduces a TOML file" so the sentence doesn't lead with an inline code token; convert the "Mind the space" hint to a blockquote - Drop --output-dir from the convenience-flag list (it's just another override, not a special flag) - Note that --dry-run is available on rl, sft, and inference only — the standalone trainer and orchestrator configs don't have a dry_run field - Split "Booleans, None, and lists" into one section each, matching the pydantic-config README style; add a Dicts section - Drop the "Prefer --ckpt" convention bullet — checkpointing is covered in training.md and didn't belong in the config conventions Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
vLLM is an OpenAI-compatible server by default; the prefix in the entrypoints table was just noise. Other mentions of "OpenAI-compatible" describe the API surface or third-party endpoints and stay. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Quick-start now uses examples/reverse_text/rl.toml; drop the prime env install and tmux preamble (covered elsewhere) - Add a "Useful CLI flags" subsection: --ckpt, --wandb, --orchestrator. prime-monitor (Prime Lab), --clean-output-dir, --output-dir, --max-steps, --dry-run - Mention the env-server / env-worker fan-out in the orchestrator bullet under "What each process does at runtime" - Restrict the Key knobs table to orchestrator-only args; drop max_async_level, max_completion_tokens, inference, trainer rows; rename rollouts_per_example row to lead with "Group size" - SFT Launch now uses examples/reverse_text/sft.toml; drop the CPU fake-data smoke alternative - "two distillation modes" -> "three training modes" (rl/opd/sft) - Drop the long-run checkpoint-combo recommendation - Drop the trainer+orchestrator lockstep note from Resuming a run - Swap order: Platform monitoring now appears before Prometheus + BetterStack under Observability; show --orchestrator.prime-monitor CLI invocation - Rename "Metrics that matter" -> "Important metrics"; drop the live vLLM curl snippet - Drop "Eyeball the reward distribution", "Match inference.parallel.tp", and "Set max_async_level deliberately" rules of thumb - Add new rules of thumb: batch size >= 64; group size >= 8 with the reasoning that all-succeed / all-fail groups give the trainer no signal because the within-group advantage collapses - Drop the Common Issues section entirely Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- scaling.md: drop the 1-GPU row and the "Production MoE with long contexts" row from the "Choosing a layout" table; the disaggregated prefill/decode page section is still findable via its own H2 - scaling.md: drop the trailing "Multi-node logs" section (heading + TOC entry); the content now lives next to single-node log layout - training.md: fold the multi-node tree into "Log files" with the single-node skip note inlined; add live-tail recipes and the per-rank torchrun debug note; mention the tmux helper works on a SLURM head node Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- New Renderers section explains why best-effort interleaving works: the renderer guarantees the exact-prefix invariant by construction via bridge_to_next_turn. Lists the renderer API surface and the hand-coded model coverage - Drop the verifiers trajectories-design-note link from Discontinuous trajectories and the --trajectory-strategy branching deprecation - Drop preserve_all_thinking workaround mentions from algorithms.md and faqs.md (reference.md still documents the fields) - Leave a TODO(blog-post-url) for the PI site writeup Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
faqs.md: - Drop the "override an env var in TOML" Q&A (matches the configuration page where env vars are no longer documented as a generic override) - Drop the "max_async_level" Q&A; replace with a max_off_policy_steps Q&A — the more impactful knob to tune on long agentic rollouts - Drop the outdated "two W&B runs per RL job" Q&A; default is shared now (wandb.shared = true) - Drop the SFT-section Q&As that referenced preserve_all_thinking or were too thin to keep - Switch the "evaluate without training" recipe from vf-eval to prime eval run (the Prime CLI is the recommended entrypoint) training.md: - Fix the W&B section to describe the new default: shared single run (wandb.shared = true), with the legacy split as opt-out - Add max_off_policy_steps to the Key knobs table - Switch the eval example from vf-eval to prime eval run Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
# Conflicts: # docs/bring-your-own-algorithms.md # docs/slurm.md # docs/training_modes.md
- Inference bullet now leads with the local default (token-in /v1/generate via renderers; OpenAI-compatible routes called out as the external-client path) and adds DP/TP/EP with deepep + flashinfer all-to-all backends + EPLB, P/D disaggregation behind vllm-router, CPU KV-cache offload, and router replay (FP8 MoE numerical-parity feature). Weight broadcast is filesystem or NCCL. - Orchestrator bullet now leads with "owns the data plane across many verifiers training and eval environments" plus the per-env isolated subprocess + variable-size env-worker pool. - Trainer bullet drops "torchrun-launched" and surfaces the custom modeling code as the enabler for advanced trainer parallelism (EP with DeepEP, CP for long sequences). - Drop the [AIPO] link in the async paragraph (off-policy-aware PG + KL regularizer, no paper handle); also drop the "AIPO loss" mention from the Algorithms blurb in "Where to go next" so the page is internally consistent. - Quick-run command is now bare: uv run rl @ examples/reverse_text/rl.toml (no --wandb.* / --ckpt). - Drop the trailing scaling pointer (Scaling is already linked in "Where to go next"). Co-authored-by: Cursor <cursoragent@cursor.com>
- Drop the entrypoint-splitting paragraph ([trainer] / [orchestrator] / [inference] table lifting); covered elsewhere. - Rename "TOML files and composition" -> "TOML composition", and "Special syntax" -> "Syntax". - Open "Sources and precedence" by naming the three sources (Pydantic defaults, TOML files, CLI flags) up front, then layering them. - Drop the "(-- is a kebab-case marker)" parenthetical from CLI overrides; turn the snake/kebab note into a callout. - Drop the --help / --dry-run convenience-flag block and the "--dry-run is the single most useful debugging tool" prose; the bash example is enough. - Reorder Syntax subsections to mirror the pydantic-config README: Booleans -> Lists -> Dicts -> Optional sub-configs -> None -> Discriminated unions -> Environments. None moves down and is cross-linked from "disabling an optional sub-config". - Booleans example swapped from --ckpt (which is itself an optional sub-config) to --clean-output-dir (a real bool = False field), showing both --flag and --no-flag forms. - Lists / Dicts now show TOML and CLI on the *same* field name so the mapping is obvious (target_modules for lists, env.0.args for dicts), and add the "lists are replaced wholesale" overlay note + "dicts deep-merge across sources" detail. - Add a callout on validation aliases (rollouts_per_example still works after the rename to group_size) — only material gap vs the pydantic-config README that's relevant to end users. - Worked example: --dry-run is now the final flag. - Drop the Conventions section. Co-authored-by: Cursor <cursoragent@cursor.com>
- Rename "RL training" -> "RL trainer" and "SFT training" -> "SFT
trainer" (and update the page intro accordingly).
- Entrypoints table: clarify that `uv run rl` wraps the trainer,
orchestrator, and inference server in one launch — runs locally
for single-node experiments and submits to SLURM for single- or
multi-node when [slurm] is set. Drop the trailing "rl is a
convenience wrapper" paragraph.
- RL trainer Launch: minimal command is now bare
`uv run rl @ examples/reverse_text/rl.toml` (no flags). Drop the
GPU-placement paragraph + multi-GPU example (covered in scaling.md).
- Replace "Useful CLI flags" + "Key knobs" + "What each process does
at runtime" with one consolidated "Useful knobs" section split into
three sub-tables: data-and-algorithm, monitoring, run management.
- Add training environments ([[orchestrator.train.env]] for
multi-env training)
- Add eval environments ([[orchestrator.eval.env]] +
orchestrator.eval.interval)
- Add monitoring entries: orchestrator.log.vf_level, --wandb,
--orchestrator.prime-monitor
- Move Training modes from a top-level section into RL trainer as
a subsection (it's RL-entrypoint-specific).
- Drop the standalone Evaluations section — eval syntax is covered
in configuration.md and the eval-knobs row in Useful knobs links
to `prime eval` for one-off evals.
- Drop the optimization_dtype / reduce_dtype rule of thumb.
Co-authored-by: Cursor <cursoragent@cursor.com>
- Drop "-via-orchestrator" from the Training modes heading and the
internal/cross-doc anchors. The mode value is just `sft` and the
short title reads cleaner.
- Drop "and the tmux helper" from the Console output subsection
title; the tmux helper is still documented in the section body.
- Important metrics is now split into RL trainer and SFT trainer
subsections so the SFT-only metrics (loss/mean, val/loss,
progress/{epoch,num_samples,num_tokens}, optim/zero_grad_ratio,
per-subset mixing ratios, MoE max_vio + routing_confidence,
perf/peak_memory + the time/* breakdown) are documented.
- SFT Dataset format gains a Tool definitions paragraph: rows can
carry a `tools` column (OAI function-calling format) or
`tool_defs` (verifiers rollout format), as either a list of dicts
or a JSON-encoded string. `tool_defs` is auto-converted to OAI
shape before being passed into the chat template's `tools=...`
argument. `chat_template_kwargs` rows pass through verbatim.
Co-authored-by: Cursor <cursoragent@cursor.com>
…eckpoints' Co-authored-by: Cursor <cursoragent@cursor.com>
Adds a callout under the intro of training.md and configuration.md pointing at the equivalent skill files for AI agents working in this repo: - training.md -> skills/training/SKILL.md (top-level routing) + skills/training/start-run/SKILL.md (launch details) + skills/training/monitor-run/SKILL.md (check-in / restart). - configuration.md -> skills/configs/SKILL.md. The skills aren't part of the published Mintlify nav, so the links go to GitHub blob URLs. Co-authored-by: Cursor <cursoragent@cursor.com>
The standalone "## Important metrics" section is gone. Each
trainer subsection now ends with its own "### Important metrics"
covering only the metrics relevant to that flow:
- RL trainer / Important metrics: reward + rollout signals from the
orchestrator, mismatch_kl + entropy + grad_norm from the trainer,
and the trainer/orchestrator/vLLM performance grid.
- SFT trainer / Important metrics: loss/mean, val/loss, progress
counters, optim signals, MoE max_vio + routing_confidence, and
the perf/{throughput,mfu,peak_memory} + time/* breakdown.
TOC updated to point at #important-metrics (RL) and
#important-metrics-1 (SFT) — Mintlify de-duplicates with the same
-N suffix scheme it already uses for the two Launch subsections.
Co-authored-by: Cursor <cursoragent@cursor.com>
Not ready for end-user docs yet. Both knobs (trainer.metrics_server.port, trainer.heartbeat.url) are still in reference.md for anyone who needs them, just no narrative coverage. Co-authored-by: Cursor <cursoragent@cursor.com>
- Use "task" not "prompt" for the conceptual unit ingested by the
orchestrator (rows for batch_size and group_size, plus the matching
Rules of thumb wording).
- group_size: drop the "Used for advantage normalization and pass@k
estimation" trailer; the row name is enough.
- max_off_policy_steps: rename "throughput-vs-noise dial" to just
"off-policy dial".
- eval row: drop the "Scores land in trainer logs and W&B as
eval/{env}/{avg@k,pass@k}" + `prime eval` trailer; keep the row
scoped to what the knob does.
- Add log.level to the Monitoring table (trainer/orchestrator process
log level, $PRIME_LOG_LEVEL fallback, per-process or global on rl).
- Drop --ckpt from Run management; checkpointing has its own section.
Co-authored-by: Cursor <cursoragent@cursor.com>
Drop the throughput/MFU/async-lag/KV-cache rows from the RL trainer Performance table — they're either generic perf metrics already covered by the SFT trainer's perf table or vLLM-internal. The two remaining rows (time/wait_for_batch, time/wait_for_ckpt) are the useful diagnostic — they tell you which side is the bottleneck. Co-authored-by: Cursor <cursoragent@cursor.com>
The old wording recommended only patched checkpoints / a custom chat template as the fix for position-dependent templates. The renderer-based path landed in SFT (use_renderer flag) and is now the primary recommended fix, but it's still off by default (use_renderer: bool = False on SFTConfig), so the patched- checkpoint path also still works — and is what examples/reverse_text/sft.toml uses today via PrimeIntellect/ Qwen3-0.6B. Rewrite the paragraph to cover both fixes: - Renderer path: use_renderer = true, lists the hand-coded renderers, calls out the VLM unsupported case. - Patched template path: the prime-rl-patched checkpoint or a user-supplied template that preserves thinking. Cross-link both to Algorithms § Renderers and § Multi-turn trajectories. Co-authored-by: Cursor <cursoragent@cursor.com>
- data.type = "sft" is the discriminated-union default for SFTConfig.data, so users don't need to spell it out. - The dataset path field is data.name, not data.path. Confirmed against SFTDataConfig.name in packages/prime-rl-configs/src/ prime_rl/configs/sft.py. Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Renderers existed as a top-level section because they were added in a follow-up, but conceptually they're the mechanism that makes best-effort interleaving safe — splitting them apart forced a forward-reference and duplicated the "exact-prefix invariant" framing. - Demote "## Renderers" to "### Renderers" inside "## Multi-turn trajectories", placed between Best-effort interleaving and Discontinuous trajectories. - Move the Qwen3 thinking-stripping example from Best-effort interleaving into Renderers — it's the failure case the renderer fixes, so it reads better adjacent to the family list / config block. - Drop the "Workaround: use a chat template that preserves thinking" trailer; the patched-checkpoint workaround for SFT is already documented in training.md and isn't relevant in the orchestrator context (use_renderer defaults to true). - Open Multi-turn trajectories with a forward-link to Renderers so the reader knows the safety mechanism is coming. - TOC updated; both #multi-turn-trajectories and #renderers anchors preserved (only their nesting changes), so the existing cross- links from training.md and faqs.md keep working. Co-authored-by: Cursor <cursoragent@cursor.com>
…step
max_async_level is being deprecated as a user-facing knob (hardcoded
to 1; matching reframe in docs/async.md on feat/deprecate-max-async-
level). Update algorithms.md so the long-form treatment matches.
- Drop the "### Tuning max_async_level" subsection (the k=0/1/2/>=3
table) and the NCCL-needs-max_async_level=1 line that follows it
— both become vacuous when k is fixed at 1.
- Reword the Async / off-policy training intro to describe the
one-step overlap directly instead of "up to k steps where
k = max_async_level".
- Step semantics: rho_inference is now pi_{max(0, n-1)}, with prose
"inference is exactly one step behind the trainer" replacing the
generic "gap is at most k steps".
- Drop the Tuning entry from the page TOC.
Other docs (overview.md intro, training.md rule of thumb, faqs.md
two entries, scaling.md NCCL-example comment) still mention
max_async_level and are now stale; will clean those up in the next
turn unless flagged otherwise.
Co-authored-by: Cursor <cursoragent@cursor.com>
Pull in the testing- and contributor-workflow content from README and
the GitHub Actions configs so contributors don't have to dig through
.github/workflows/ to figure out what runs where.
New sections in development.md:
- Test suite
- Layout: tests/unit/, tests/integration/, tests/nightly/ — what
each tier is for, with the actual file names contributors will
encounter.
- Running tests locally: pytest one-liners (everything, unit-only,
integration-only, -m "not gpu", -m gpu, single file).
- CI workflows: a 3-row table covering cpu_tests.yaml,
gpu_tests.yaml (matrix list pulled verbatim from the workflow),
and nightly_tests.yaml — including the trigger conditions
(cpu = always, gpu = non-draft, nightly = scheduled +
workflow_dispatch) and the runners (ubuntu-latest, vm/4xa6000,
research-cluster).
- Markers: gpu + slow, both declared with --strict-markers in
pyproject.toml.
- Pre-commit hooks
- uv run pre-commit install
- Currently configured hooks: ruff check/format and
docs-reference (regenerator that fails the commit if
docs/reference.md would drift).
Plus update the Development pitch in overview.md "Where to go next"
and README.md docs index to mention the new scope.
Co-authored-by: Cursor <cursoragent@cursor.com>
Length penalties are configured under [orchestrator.length_penalty] and layer on top of *any* advantage function — they're conceptually not a standalone advantage variant. Move them into Default advantage where readers will see the option while reading about advantages. Drive-by: clean up the parenthetical hint in Default advantage. The old version listed "length penalties tied to turn count" as a reason to write a custom advantage, which conflicts with the actual turn-count length penalty being a built-in. New wording points length-penalty users at the now-adjacent built-in section, and points custom-advantage users at trajectory-metadata-driven shaping (sub-agents, relative-rank, …). TOC entry for #length-penalties dropped; no external doc links to that anchor. Co-authored-by: Cursor <cursoragent@cursor.com>
Covers the two complementary buffer-side mechanisms for keeping the
trainer batch high-signal. Verified each claim against
src/prime_rl/orchestrator/buffer.py and packages/prime-rl-configs/
src/prime_rl/configs/orchestrator.py.
- Difficulty pools (buffer.easy_threshold / hard_threshold +
easy_fraction / hard_fraction): per-problem running-average reward
is compared to the thresholds; problems hitting either bound move
to easy/hard pool and stop being sampled. Pool assignments persist
across checkpoints (easy_examples.jsonl / hard_examples.jsonl);
*_fraction lifts a fraction of pooled problems back into normal on
resume / start.
- Online difficulty filtering (buffer.online_difficulty_filtering
bool): groups whose avg reward is exactly 0.0 or 1.0 are dropped
from the buffer because their within-group advantage is zero
(DR-GRPO produces no signal). Counted under filtered_rollouts/
{env}/{easy,hard} for visibility.
- The tradeoff bit the user asked for explicitly: with ODF on each
trainer step's effective batch is dense + predictable, but the
orchestrator pays for the throw-away rollouts and may need a
higher oversampling_factor; if time/wait_for_batch is already
high on the trainer, ODF can starve the loop.
- ODF is orthogonal to the pools — ODF reacts to the current
group's reward distribution, pools track the running per-problem
average. Configs often use both.
Plus a one-line page-intro update + TOC entries.
Co-authored-by: Cursor <cursoragent@cursor.com>
Both are buffer-side controls over what reaches the trainer — same conceptual category as Filters and the loss/advantage knobs. Algorithms is the right home; Advanced is for orthogonal feature sets (LoRA, multi-tenant, custom modeling, multimodal). Promote each to its own top-level section. - algorithms.md: new "## Difficulty pools" and "## Online difficulty filtering" sections, inserted between Filters and Multi-turn trajectories. TOC updated. - advanced.md: section + TOC entries + page-intro mention removed. Content unchanged from the previous commit on Advanced — same threshold / fraction / persistence claims, same ODF tradeoff explanation, same cross-link to oversampling_factor. ODF section back-links to Difficulty pools for the orthogonality note. Co-authored-by: Cursor <cursoragent@cursor.com>
…itecture" - "Adding a new architecture" promoted to its own top-level section, placed *before* the renamed Debugging MoE recipe so the natural reading order is "here's how to wire up a new arch -> here's how to smoke-test it". Step 3 of the recipe now forward-links to Debugging MoE instead of "the three steps above". - "Testing MoE at small scale" -> "Debugging MoE". The page is already titled Development, so the qualifier was redundant. - Subsection titles drop the "Step N:" prefixes (the page-level TOC already implies sequence) and switch to sentence case for consistency with the rest of the docs: - Step 1: build and verify a mini model -> Build and verify a mini model - Step 2: SFT warmup -> SFT warmup (already correct) - Step 3: RL on reverse-text -> RL on reverse-text (already correct) - TOC updated; README docs index pitch swaps "small-scale MoE testing" -> "debugging MoE" to match the new section title. Co-authored-by: Cursor <cursoragent@cursor.com>
- "### Build and verify a mini model" -> "### Create mini model". The roundtrip-verify body covers the "and verify" half; the shorter title is enough. - Merge "### SFT warmup" + "### RL on reverse-text" into one "### Smoketest training" subsection. The two were always run together as a single end-to-end smoke test (warmup so KL is meaningful, then RL stack), so a single subsection with two code blocks reads better than two artificially-split ones. - TOC updated. No external doc links to the old sub-anchors. Co-authored-by: Cursor <cursoragent@cursor.com>
User-facing docs no longer reference configs/ directly — examples/ is the only "we keep this up to date" surface, the rest is CI- and debug-internal: - configuration.md: launch line + worked-example switched to examples/reverse_text/rl.toml. Section renamed "Worked example" -> "Examples" with a curated tour of the 10 README examples (basic 1-8 GPU + advanced SLURM tiers); the compose / override / dry-run walkthrough lives as a "### Worked example" subsection. - training.md: drop the "Debug configs for all variants ship under configs/debug/training_modes/" pointer in the Training modes section. The prose already explains how to set the mode. - scaling.md: P/D inference now points at examples/glm5_pd_disag/ rl.toml (with a link to its README) instead of the configs/-side inference-only TOML. - faqs.md: install-verify and smoke-test recipes both switch from configs/debug/sft and configs/gsm8k to examples/reverse_text. Reference generator (scripts/generate_docs_reference.py): - Drop "from the Pydantic config models" from the page header. - Move the regenerate command + structural notes from the header to a new "## About this page" footer. - Wrap the Type column in code spans so list[int], int | None, etc. render as code instead of plain text. fmt_type now emits literal `|` (GFM accepts pipes inside code spans inside table cells; no escaping needed). - Walk list-of-BaseModel fields. Previously orchestrator.train.env / orchestrator.eval.env / orchestrator.filters were rendered as one row showing the default repr; their leaf fields never showed up. New _list_inner_models() detects both list[X] (single model) and list[Annotated[Union[A | B], discriminator]] (discriminated union of list items, e.g. filters: list[FilterConfig]). Index placeholder rendered as <n> to match the CLI form (--orchestrator.train.env.0.id ...). Regenerate reference.md: +5k chars, mostly the new env/filter list-item subsections that were missing before. development.md still references two CI-tested configs/ paths (configs/debug/moe/sft/train.toml, configs/ci/integration/ reverse_text_moe/start.toml) — those are validated by the reverse_text_moe GPU integration test on every PR, so they don't risk drifting. Flagging in case the user wants those swapped too. Co-authored-by: Cursor <cursoragent@cursor.com>
128-512 is the range for quick ablations, not production. Production RL often runs at 1024+. Co-authored-by: Cursor <cursoragent@cursor.com>
P/D disaggregation is a feature you opt into for large-MoE serving, not a step on the single-GPU -> 1000-GPU scaling ladder. It pairs naturally with Custom modeling / multi-tenant / multimodal as a specialized inference topology, so Advanced is the right home. - scaling.md: drop the section + TOC entry + "disaggregated prefill/decode inference" from the page intro. Page intro now forward-links to Advanced for users who came in for P/D. - advanced.md: append the section after Multi-tenant training, unchanged content (P:D ratio table, glm5_pd_disag example link, queue-depth monitoring snippet, UCX 1.19 build-from-source note). TOC + page-intro list updated. - overview.md "Where to go next": drop disagg from the Scaling bullet, add to the Advanced bullet. Anchor preserved (#disaggregated-prefilldecode-inference) — no external doc links to it survived the move check. Co-authored-by: Cursor <cursoragent@cursor.com>
The previous claim "trainer and inference server can share a GPU"
via the rl launcher was wrong. Verified against
src/prime_rl/entrypoints/rl.py:86-99: the launcher partitions
visible GPUs strictly (inference 0..N-1, trainer N..N+M-1) and
raises ValueError when total_requested_gpus > len(physical_gpu_ids).
Setting CUDA_VISIBLE_DEVICES=0 + --num-infer-gpus 1 +
--num-train-gpus 1 makes total=2, visible=1, validation fails
before anything launches.
What actually works for single-GPU RL is the manual three-pane
launch: each of uv run {inference,orchestrator,trainer} is an
independent process with no cross-process GPU validation, so
pinning CUDA_VISIBLE_DEVICES=0 on inference *and* trainer lets
them share the same physical GPU.
- Drop the misleading `uv run rl` recipe.
- Promote the manual three-pane recipe to the canonical single-GPU
path, with CUDA_VISIBLE_DEVICES=0 spelled out on both the
inference and trainer panes.
- Lead with SFT (where single-GPU is the default and just works).
- Add an explicit "single-GPU RL is for debugging only" caveat.
Co-authored-by: Cursor <cursoragent@cursor.com>
…to one umbrella
- New umbrella section "## Single-node vs. multi-node deployment"
framing the [deployment] discriminated union as the user-facing
knob: single_node runs locally; multi_node currently goes through
SLURM. Subsections nest beneath:
- ### Single GPU (unchanged content from "## Single GPU")
- ### Single-node multi-GPU (unchanged from "## Single-node multi-GPU")
- #### RL placement
- #### SFT and torchrun
- ### Multi-node (new short pointer to ## SLURM with two cross-
links to the existing RL / SFT-and-inference examples)
- The umbrella section opens with a callout that manual multi-node
launches are technically possible but reimplement what the SLURM
launcher does — the user's preferred framing.
- Drop "## Choosing a layout" entirely (the new umbrella section
conveys the same routing more naturally + the layout table was
going stale).
- Drop "## Multi-node (manual)" entirely (RL training, SFT training,
Multi-node inference subsections all gone). Anyone who needs the
manual recipe can replicate what the SLURM templates do.
Cross-link fixes:
- training.md SFT § Launch line previously pointed at
scaling.md#sft-training (under "## Multi-node (manual)"). Now
points at scaling.md#sft-and-torchrun for non-default single-node
layouts and scaling.md#slurm for multi-node.
- faqs.md "Multi-node without SLURM or K8s?" answer updated: from
"yes, see [Scaling § Multi-node (manual)]" to "not currently
documented; technically possible but reimplements the SLURM
launcher".
Page intro adjusted to match the new structure ("multi-node SLURM
and Kubernetes deployments" -> "single-node and multi-node
deployments").
Co-authored-by: Cursor <cursoragent@cursor.com>
The K8s Helm chart at k8s/prime-rl/ still ships, but the user-facing docs are dropping coverage until the chart and the matching guide are re-validated together. - scaling.md: drop the "## Kubernetes" section + TOC entry. Page intro already covered (was reworded earlier in the restructure). - overview.md "Where to go next": drop "Kubernetes guides" from the Scaling bullet. - README.md docs index: drop "Kubernetes" from the Scaling bullet. The two passing-mention "k8s" / "Kubernetes" lines in README (Overview features list, Advanced Training Examples adaptability note) are left as-is — they describe codebase capability, not docs coverage. Reference.md still mentions Kubernetes liveness probes in an auto-generated field docstring; that's source-side, out of scope for this pass. Co-authored-by: Cursor <cursoragent@cursor.com>
- Drop the example test filenames from the tests/integration/ bullet — they're a moving list and not the point. - Reframe the tests/nightly/ bullet around what it does (runs the examples/ configs to catch regressions) instead of listing the individual nightly tests by name. Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
- Replace the bare 'To add (e.g.) Kimi 2.5:' opener with a one-line
framing of the two-step contract: implement modeling code,
register a mini preset for smoke-testing.
- Bold the leading verb on each numbered step so the structure
reads as a checklist.
- Step 1 now nudges readers at glm4_moe/ and qwen3_moe/ as
templates for the modeling code.
- Step 2 explains *what* the preset is for ('build a ~0.5B test
model in your architecture') rather than just listing fields.
Path now links to scripts/mini_moe.py.
- Step 3 says what the smoke-test actually exercises (roundtrip +
SFT + RL stack) so users know what 'smoke-test' means here.
Co-authored-by: Cursor <cursoragent@cursor.com>
CodeQL alert actions/missing-workflow-permissions (security/code-scanning/19) flagged the new workflow for relying on the repo's default GITHUB_TOKEN permissions. The workflow only checks out code (contents: read), syncs deps via uv, runs the doc generator, and runs git diff. None of that needs write scope on any resource type. Pin to contents: read at the workflow level — explicit minimum that satisfies the rule. Co-authored-by: Cursor <cursoragent@cursor.com>
- assets/architecture.png: replace with the new diagram (trainer
+ orchestrator + inference deployment, GPU layout per process,
data + scheduling + weight-broadcast arrows). Was 511k 96dpi,
now 111k @ 200dpi from architecture.pdf.
- assets/two-step-off-policy.png removed; assets/async-pipeline.png
replaces it with the cleaner one-step-overlap diagram (trainer
steps g_0..g_n above, inference samples with theta_{n-1} below).
algorithms.md image reference + alt text updated to match the
post-deprecation "one-step overlap" framing.
- assets/rollout-timeline.png added but not yet referenced. It's a
continuous-time view showing rollouts spanning policy boundaries
(policies pi_{i-2}, pi_{i-1}, pi_i on the x-axis with rollout
bars crossing the boundaries) — that's the picture behind
max_off_policy_steps, not max_async_level. Want me to drop it
into algorithms.md (e.g. above the off-policy / max_off_policy
discussion) or save it for later?
Co-authored-by: Cursor <cursoragent@cursor.com>
User-facing prose mentions of verifiers / renderers / research-
environments / pydantic-config now consistently render as code-spans
with a github link to the package. Affected lines:
- overview.md: orchestrator-bullet [verifiers](url) (dropped the
bare-text variant), install paragraph (linked all three submodules
individually), quick-run paragraph.
- algorithms.md: "Since [\`verifiers\` v0.1.8]..." (added backticks
around the package name in the release-tag link).
- training.md: SFT \"verifiers submodule\" launch line, useful-knobs
vf_level row, tool-defs paragraph.
Intentionally left alone:
- algorithms.md \`verifiers.RolloutOutput\` (dotted-path code ref;
whole expression already in code).
- algorithms.md / training.md [renderers](#renderers) / [Renderers]
(#renderers) (internal anchors to the in-page section, more useful
than the github repo for a reader inside the doc).
- algorithms.md "Hand-coded renderers ship for ..." and "the
renderers writeup on the PI blog" — generic prose, not a package
mention.
- reference.md docstring-sourced mentions ("verifiers package",
"renderers package", etc.) — those come from Pydantic field
docstrings; would need source-side edits + regen.
Co-authored-by: Cursor <cursoragent@cursor.com>
Out of date with the max_async_level deprecation and duplicates content already covered properly in algorithms.md (one-step overlap framing + the AIPO/KL loss math). Architecture bullets above already mention async semantics where relevant. Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Was four separate code blocks separated by one-line prose; now a single bash block with inline-comment annotations for each variant. Reads cleaner and matches the test-suite block in development.md. Co-authored-by: Cursor <cursoragent@cursor.com>
…e-test)
Self-review found a handful of small inconsistencies and stale claims
that survived earlier passes.
Stale max_async_level references (now hardcoded to 1, so any prose
that treats it as tunable is wrong):
- faqs.md: "If growing, drop max_async_level or LR" -> drop
max_off_policy_steps. NCCL FAQ no longer says "requires
max_async_level=1" — that constraint is vacuous.
- training.md Rules of thumb: NCCL/max_async_level=1 dropped from
the dry-run validator example; CP / flash-attention example
stays.
- scaling.md NCCL example: drop the trailing
"# synchronous; max_async_level forced to 1" comment.
Other accuracy / consistency fixes:
- faqs.md hardware FAQ: drop "you can co-locate both on a single
GPU" — verified earlier that the rl launcher rejects this; the
manual three-pane recipe is the actual single-GPU path. Cross-
link to Scaling § Single GPU.
- faqs.md max_off_policy_steps wording: "throughput-vs-noise knob"
-> "off-policy dial" to match algorithms.md / training.md.
- training.md sft.data.loss_mask cross-link: anchor was
reference.md#sft-data (the discriminated-union heading); the
loss_mask sub-table actually lives at #sft-data-sft-loss-mask.
- development.md "Smoketest training" -> "Smoke-test training" in
the section title and TOC anchor, matching the hyphenated verb
form ("smoke-test the new architecture") used in the page body.
Co-authored-by: Cursor <cursoragent@cursor.com>
Four asks rolled in:
1. **Drop docs/faqs.md entirely.** Removed the file plus all its
cross-references: docs/mint.json nav, README.md docs index, and
docs/overview.md "Documentation" list. The standalone Q&A page
wasn't pulling its weight against the verifiers tone reference.
2. **Title Case all headings** across the user-facing pages so the
visual style matches deps/verifiers/docs/* (which uses
"Hosted Training" / "Performance Trade-offs" etc., not sentence
case). Anchors are slug-based (lowercase + hyphen), so internal
#links survive the case flip — only the link text in cross-doc
"§ Section" references needed updating (training.md → Algorithms
§ Multi-Turn Trajectories, training.md → Scaling § SFT and
Torchrun, scaling.md → Configuration § TOML Composition,
scaling.md → SLURM § RL Example / SFT and Inference Examples).
3. **Rename overview.md "Where to go next" → "Documentation"** for
symmetry with the verifiers landing page.
4. **Smell fixes**:
a. algorithms.md had two near-identically-named subsections ("###
The default loss" under Async with the loss math, and "###
Default loss" under Loss with the mode dispatch). Collapse:
the loss math now lives under "## Loss > ### Default Loss"
together with the rl/opd/sft mode-dispatch bullets; the Async
section is just intro + step semantics. One source of truth.
b. reference.md auto-generated docstring mentions of "verifiers
package" / "renderers package" / "renderers library" /
"renderers.parsers" / "Registered verifiers environment ID"
now render as [`pkg`](github-url) links. Source-side edits in
packages/prime-rl-configs/src/prime_rl/configs/{shared,
orchestrator,sft}.py, then regenerated reference.md.
Net: faqs.md deletion (-198) dominates; everything else is small
churn for case + anchor consistency.
Co-authored-by: Cursor <cursoragent@cursor.com>
# Conflicts: # docs/async.md
Restructure + accuracy fixes that were piled up locally during the
SSH-signing block earlier today.
- configuration.md: note that reusing an env id requires a unique
name; drop the "See each environment's README on the Hub" tail.
- development.md: collapse "Adding a New Architecture" + "Debugging
MoE" (with its Create Mini Model + Smoke-Test Training subsections)
into a single "## Adding a New Model" with three subsections —
Implement the Modeling Code, Register a Mini Preset, Run the
Smoke Test. Page intro updated.
- scaling.md: drop "## Single GPU" subsection; rename "## Single-Node
Multi-GPU" -> "## Single-Node" so "Single-Node vs. Multi-Node
Deployment" has just two clean children. Drop the manual-multi-
node prose. Drop the [slurm] field reference table; rename the
section to "## [deployment] Block" and link to reference.md.
Collapse the two example subsections into one "## Examples" with
pointers to examples/multinode/{rl,sft}.toml. Rename "## CPU
Optimizer Offload" -> "## Optimizer Offloading". Add new "## LM
Head Chunking" subsection covering the fused_lm_head_token_chunk_
size knob. Drop torchrun phrasing in the SFT scaling subsection,
the CP recommendation paragraph, "(offloads checkpoints to CPU)"
parenthetical, "RL gradient-accumulation amortization" line,
attn = "flash_attention_2" from the memory-tight recipe, dry-run
invaluable line.
- algorithms.md: drop the Step Semantics subsection heading (body
stays); replace the AIPO link with a plain "DPPO + KL similar to
Kimi-K2.5" framing.
- advanced.md: drop the "## Custom vs HF Implementations"
subsection heading; the body becomes the lead content under
"## Custom Modeling".
- overview.md + README.md: pitch updates ("adding a new
architecture" -> "adding a new model"; FAQs row already gone).
Co-authored-by: Cursor <cursoragent@cursor.com>
Removes everything in the reference-generation pipeline: - docs/reference.md — the 197k-char auto-generated field reference. - scripts/generate_docs_reference.py — the generator that walks Pydantic config trees, discriminated unions, list-of-models, etc. - .github/workflows/docs-reference.yaml — the CI guard that ran the generator and failed on git diff drift. - .pre-commit-config.yaml — drops the local docs-reference hook that re-ran the generator on staged config-class edits. Knock-on cleanup of cross-references: - mint.json: drop the "reference" page from nav. - overview.md / README.md: drop the Reference bullet from the Documentation list / docs index. - configuration.md: drop the two trailing "See [Reference] ..." pointers (under Discriminated Unions and at the end of the Examples section). - training.md: drop the "for the full field reference see [Reference]" tail in Useful Knobs; the loss_mask row no longer cross-links and just lists the four roles. - scaling.md: drop the [Reference § trainer.model] DeepEP pointer and the [Reference § slurm] pointer in the [deployment] block. - algorithms.md: drop the [Reference § orchestrator length penalties] line. - development.md: simplify the Pre-Commit Hooks section — only the ruff hook remains, so the configured-hooks bullet list goes away. Co-authored-by: Cursor <cursoragent@cursor.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
docs/reference.md, a single auto-generated field-by-field reference covering every entrypoint config (rl,sft,trainer,orchestrator,inference). Walks list-typed sub-configs (e.g.[[orchestrator.train.env]]), discriminated unions, andOptional[BaseConfig]fields; types render as code spans.scripts/generate_docs_reference.py(the generator),.github/workflows/docs-reference.yaml(CI guard that fails the build on drift), and adocs-referencepre-commit hook (regenerates on staged config changes). Regenerate manually withuv run python scripts/generate_docs_reference.py.docs/assets/.mint.jsonnav and the top-levelREADME.mddocs index.New Structure
overview.mdconfiguration.mdtraining.mdscaling.mdalgorithms.mdadvanced.mddevelopment.mdreference.mdOld → New Content Mapping
index.md,entrypoints.mdoverview.md+training.mdconfigs.md,environments.mdconfiguration.mdtraining_modes.mdtraining.md§ Training Modesasync.md,bring-your-own-algorithms.md,trajectories.mdalgorithms.mdlogging.md,metrics.md,platform-monitoring.md,checkpointing.mdtraining.md§ Observability + Checkpointingdeployment.md,slurm.md,benchmarking.md,memory_usage.mdscaling.mddisaggregated-inference.mdadvanced.md§ Disaggregated Prefill/Decode Inferencemultimodal.md,multi_run_manager.mdadvanced.md(Multimodal Training, Multi-Tenant Training)testing-moe-at-small-scale.mddevelopment.md§ Adding a New Modeltroubleshooting.md,kubernetes.mdtroubleshootingfolded into per-page prose;kubernetesdeferred)🤖 Generated with Claude Code