Skip to content

docs: big revamp#2602

Draft
mikasenghaas wants to merge 67 commits into
mainfrom
chore/docs-revamp
Draft

docs: big revamp#2602
mikasenghaas wants to merge 67 commits into
mainfrom
chore/docs-revamp

Conversation

@mikasenghaas
Copy link
Copy Markdown
Member

@mikasenghaas mikasenghaas commented May 22, 2026

Summary

  • Replaces 22 small/uneven docs files with 8 longer pages modeled on the verifiers docs — each page opens with a TOC, content is grouped by task (configure, train, scale, …) instead of by feature, and headings use Title Case for tone consistency with the verifiers docs.
  • Adds docs/reference.md, a single auto-generated field-by-field reference covering every entrypoint config (rl, sft, trainer, orchestrator, inference). Walks list-typed sub-configs (e.g. [[orchestrator.train.env]]), discriminated unions, and Optional[BaseConfig] fields; types render as code spans.
  • Adds scripts/generate_docs_reference.py (the generator), .github/workflows/docs-reference.yaml (CI guard that fails the build on drift), and a docs-reference pre-commit hook (regenerates on staged config changes). Regenerate manually with uv run python scripts/generate_docs_reference.py.
  • Refreshes the architecture and async-pipeline diagrams under docs/assets/.
  • Updates mint.json nav and the top-level README.md docs index.

New Structure

Page Role
overview.md Architecture diagram, three-process tour, install, one runnable RL command, docs index
configuration.md TOML composition, CLI overrides, dry-run, syntax (booleans, lists, dicts, optional sub-configs, discriminated unions, env arrays), examples tour
training.md RL + SFT trainer entrypoints, useful knobs, training modes (RL / OPD / SFT), checkpointing, observability, rules of thumb
scaling.md Single-node vs. multi-node deployment, parallelism knobs (FSDP / EP / CP / AC / optimizer offloading / LM head chunking), memory-tight recipe, SLURM, benchmarking
algorithms.md Async / off-policy training, default + custom loss / advantage / filters, difficulty pools, online difficulty filtering, multi-turn trajectories with renderers
advanced.md Custom modeling, multimodal training, LoRA training, multi-tenant training, disaggregated prefill/decode inference
development.md Test suite (unit / integration / nightly), pre-commit hooks, adding a new model
reference.md Auto-generated field-by-field reference for every entrypoint config

Old → New Content Mapping

Old New home
index.md, entrypoints.md overview.md + training.md
configs.md, environments.md configuration.md
training_modes.md training.md § Training Modes
async.md, bring-your-own-algorithms.md, trajectories.md algorithms.md
logging.md, metrics.md, platform-monitoring.md, checkpointing.md training.md § Observability + Checkpointing
deployment.md, slurm.md, benchmarking.md, memory_usage.md scaling.md
disaggregated-inference.md advanced.md § Disaggregated Prefill/Decode Inference
multimodal.md, multi_run_manager.md advanced.md (Multimodal Training, Multi-Tenant Training)
testing-moe-at-small-scale.md development.md § Adding a New Model
troubleshooting.md, kubernetes.md Dropped (troubleshooting folded into per-page prose; kubernetes deferred)

🤖 Generated with Claude Code

mikasenghaas and others added 4 commits May 22, 2026 23:47
Replaces 22 small/uneven docs files with 8 longer pages modeled on the
verifiers docs: each page opens with a TOC, content is grouped by task
(configure, train, scale, …) instead of by feature, and a single
auto-generated reference page covers every config field.

New pages: overview, configuration, training, scaling, algorithms,
advanced, faqs, reference. reference.md is generated by
scripts/generate_docs_reference.py from the Pydantic config models;
regenerate with `uv run python scripts/generate_docs_reference.py`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- algorithms.md: max_async_level default is 1 (not 2); default loss
  includes the Kimi-K2.5 KL regularizer (was wrongly claimed to drop
  it); update the formula to show the full L = -PG + tau_KL * KL form;
  filter table is `[[orchestrator.filters]]` (plural)
- training.md: checkpoint paths nest under `checkpoints/step_N/{trainer,
  orchestrator}/` rather than separate hierarchies; --inference-gpu-ids
  / --trainer-gpu-ids don't exist — use --deployment.num-{infer,train}-
  gpus and pin physical GPUs via CUDA_VISIBLE_DEVICES; update
  max_async_level prose to match the new default
- scaling.md: same GPU-flag fix throughout the single/multi-GPU
  examples; correct the claim that Muon + optim_cpu_offload is
  unsupported (only fsdp_cpu_offload is blocked)
- configuration.md: there is no generic PRIME_* env var override
  mechanism in pydantic-config — rewrite the env vars section to list
  the specific named vars that individual fields read as defaults
- advanced.md: add the qwen3_vl_moe entry to the VLM registry table;
  the small-scale MoE RL config lives at
  configs/ci/integration/reverse_text_moe/start.toml, not .../rl/
- faqs.md: update the max_async_level Q&A to match the new default

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds two safety nets so the auto-generated reference can't silently
drift from the Pydantic config models:

- Pre-commit hook (local): re-runs scripts/generate_docs_reference.py
  whenever a config class or the generator itself is staged. If the
  generated file changes, pre-commit fails the commit so the contributor
  re-stages the regenerated reference.
- GitHub Actions (CI): a small workflow runs the generator and
  `git diff --exit-code docs/reference.md`. Catches anyone who bypassed
  the pre-commit hook.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread .github/workflows/docs-reference.yaml Fixed
mikasenghaas and others added 25 commits May 23, 2026 00:17
- Quick run now uses examples/reverse_text/rl.toml; the env is bundled
  with the verifiers submodule so no prime env install is needed, and
  the tmux helper is documented elsewhere instead of duplicated here
- Architecture bullets advertise the SOTA features per process: vLLM
  multi-node + FP8 + P/D disaggregation for inference; FSDP2 + EP
  (incl. DeepEP) + CP + selective AC + FP8 + LoRA + multi-run for the
  trainer
- Drop the "Use prime-rl when you want to" bullets and the CPU-only
  SFT smoke check — the landing page reads cleaner without them

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Drop the env-vars section entirely (and the precedence callout that
  referenced it) — the page is now strictly TOML + CLI. The few named
  env vars that individual fields read as defaults are out of scope for
  the config docs and stay in the per-feature pages (training.md, etc.)
- Drop the entrypoint enumeration and the W&B/output-dir recommendation
  blurb in the intro
- Reword "@ introduces a TOML file" so the sentence doesn't lead with
  an inline code token; convert the "Mind the space" hint to a
  blockquote
- Drop --output-dir from the convenience-flag list (it's just another
  override, not a special flag)
- Note that --dry-run is available on rl, sft, and inference only —
  the standalone trainer and orchestrator configs don't have a dry_run
  field
- Split "Booleans, None, and lists" into one section each, matching
  the pydantic-config README style; add a Dicts section
- Drop the "Prefer --ckpt" convention bullet — checkpointing is
  covered in training.md and didn't belong in the config conventions

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
vLLM is an OpenAI-compatible server by default; the prefix in the
entrypoints table was just noise. Other mentions of "OpenAI-compatible"
describe the API surface or third-party endpoints and stay.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Quick-start now uses examples/reverse_text/rl.toml; drop the prime
  env install and tmux preamble (covered elsewhere)
- Add a "Useful CLI flags" subsection: --ckpt, --wandb, --orchestrator.
  prime-monitor (Prime Lab), --clean-output-dir, --output-dir,
  --max-steps, --dry-run
- Mention the env-server / env-worker fan-out in the orchestrator
  bullet under "What each process does at runtime"
- Restrict the Key knobs table to orchestrator-only args; drop
  max_async_level, max_completion_tokens, inference, trainer rows;
  rename rollouts_per_example row to lead with "Group size"
- SFT Launch now uses examples/reverse_text/sft.toml; drop the CPU
  fake-data smoke alternative
- "two distillation modes" -> "three training modes" (rl/opd/sft)
- Drop the long-run checkpoint-combo recommendation
- Drop the trainer+orchestrator lockstep note from Resuming a run
- Swap order: Platform monitoring now appears before Prometheus +
  BetterStack under Observability; show --orchestrator.prime-monitor
  CLI invocation
- Rename "Metrics that matter" -> "Important metrics"; drop the live
  vLLM curl snippet
- Drop "Eyeball the reward distribution", "Match inference.parallel.tp",
  and "Set max_async_level deliberately" rules of thumb
- Add new rules of thumb: batch size >= 64; group size >= 8 with the
  reasoning that all-succeed / all-fail groups give the trainer no
  signal because the within-group advantage collapses
- Drop the Common Issues section entirely

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- scaling.md: drop the 1-GPU row and the "Production MoE with long
  contexts" row from the "Choosing a layout" table; the disaggregated
  prefill/decode page section is still findable via its own H2
- scaling.md: drop the trailing "Multi-node logs" section (heading +
  TOC entry); the content now lives next to single-node log layout
- training.md: fold the multi-node tree into "Log files" with the
  single-node skip note inlined; add live-tail recipes and the
  per-rank torchrun debug note; mention the tmux helper works on a
  SLURM head node

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- New Renderers section explains why best-effort interleaving works:
  the renderer guarantees the exact-prefix invariant by construction
  via bridge_to_next_turn. Lists the renderer API surface and the
  hand-coded model coverage
- Drop the verifiers trajectories-design-note link from Discontinuous
  trajectories and the --trajectory-strategy branching deprecation
- Drop preserve_all_thinking workaround mentions from algorithms.md
  and faqs.md (reference.md still documents the fields)
- Leave a TODO(blog-post-url) for the PI site writeup

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
faqs.md:
- Drop the "override an env var in TOML" Q&A (matches the configuration
  page where env vars are no longer documented as a generic override)
- Drop the "max_async_level" Q&A; replace with a max_off_policy_steps
  Q&A — the more impactful knob to tune on long agentic rollouts
- Drop the outdated "two W&B runs per RL job" Q&A; default is shared
  now (wandb.shared = true)
- Drop the SFT-section Q&As that referenced preserve_all_thinking or
  were too thin to keep
- Switch the "evaluate without training" recipe from vf-eval to prime
  eval run (the Prime CLI is the recommended entrypoint)

training.md:
- Fix the W&B section to describe the new default: shared single run
  (wandb.shared = true), with the legacy split as opt-out
- Add max_off_policy_steps to the Key knobs table
- Switch the eval example from vf-eval to prime eval run

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
# Conflicts:
#	docs/bring-your-own-algorithms.md
#	docs/slurm.md
#	docs/training_modes.md
- Inference bullet now leads with the local default (token-in
  /v1/generate via renderers; OpenAI-compatible routes called out
  as the external-client path) and adds DP/TP/EP with deepep +
  flashinfer all-to-all backends + EPLB, P/D disaggregation behind
  vllm-router, CPU KV-cache offload, and router replay (FP8 MoE
  numerical-parity feature). Weight broadcast is filesystem or NCCL.
- Orchestrator bullet now leads with "owns the data plane across
  many verifiers training and eval environments" plus the per-env
  isolated subprocess + variable-size env-worker pool.
- Trainer bullet drops "torchrun-launched" and surfaces the custom
  modeling code as the enabler for advanced trainer parallelism
  (EP with DeepEP, CP for long sequences).
- Drop the [AIPO] link in the async paragraph (off-policy-aware PG
  + KL regularizer, no paper handle); also drop the "AIPO loss"
  mention from the Algorithms blurb in "Where to go next" so the
  page is internally consistent.
- Quick-run command is now bare: uv run rl @ examples/reverse_text/rl.toml
  (no --wandb.* / --ckpt).
- Drop the trailing scaling pointer (Scaling is already linked in
  "Where to go next").

Co-authored-by: Cursor <cursoragent@cursor.com>
- Drop the entrypoint-splitting paragraph ([trainer] / [orchestrator]
  / [inference] table lifting); covered elsewhere.
- Rename "TOML files and composition" -> "TOML composition", and
  "Special syntax" -> "Syntax".
- Open "Sources and precedence" by naming the three sources (Pydantic
  defaults, TOML files, CLI flags) up front, then layering them.
- Drop the "(-- is a kebab-case marker)" parenthetical from CLI
  overrides; turn the snake/kebab note into a callout.
- Drop the --help / --dry-run convenience-flag block and the
  "--dry-run is the single most useful debugging tool" prose; the
  bash example is enough.
- Reorder Syntax subsections to mirror the pydantic-config README:
  Booleans -> Lists -> Dicts -> Optional sub-configs -> None ->
  Discriminated unions -> Environments. None moves down and is
  cross-linked from "disabling an optional sub-config".
- Booleans example swapped from --ckpt (which is itself an optional
  sub-config) to --clean-output-dir (a real bool = False field),
  showing both --flag and --no-flag forms.
- Lists / Dicts now show TOML and CLI on the *same* field name so
  the mapping is obvious (target_modules for lists, env.0.args for
  dicts), and add the "lists are replaced wholesale" overlay note
  + "dicts deep-merge across sources" detail.
- Add a callout on validation aliases (rollouts_per_example still
  works after the rename to group_size) — only material gap vs the
  pydantic-config README that's relevant to end users.
- Worked example: --dry-run is now the final flag.
- Drop the Conventions section.

Co-authored-by: Cursor <cursoragent@cursor.com>
- Rename "RL training" -> "RL trainer" and "SFT training" -> "SFT
  trainer" (and update the page intro accordingly).
- Entrypoints table: clarify that `uv run rl` wraps the trainer,
  orchestrator, and inference server in one launch — runs locally
  for single-node experiments and submits to SLURM for single- or
  multi-node when [slurm] is set. Drop the trailing "rl is a
  convenience wrapper" paragraph.
- RL trainer Launch: minimal command is now bare
  `uv run rl @ examples/reverse_text/rl.toml` (no flags). Drop the
  GPU-placement paragraph + multi-GPU example (covered in scaling.md).
- Replace "Useful CLI flags" + "Key knobs" + "What each process does
  at runtime" with one consolidated "Useful knobs" section split into
  three sub-tables: data-and-algorithm, monitoring, run management.
  - Add training environments ([[orchestrator.train.env]] for
    multi-env training)
  - Add eval environments ([[orchestrator.eval.env]] +
    orchestrator.eval.interval)
  - Add monitoring entries: orchestrator.log.vf_level, --wandb,
    --orchestrator.prime-monitor
- Move Training modes from a top-level section into RL trainer as
  a subsection (it's RL-entrypoint-specific).
- Drop the standalone Evaluations section — eval syntax is covered
  in configuration.md and the eval-knobs row in Useful knobs links
  to `prime eval` for one-off evals.
- Drop the optimization_dtype / reduce_dtype rule of thumb.

Co-authored-by: Cursor <cursoragent@cursor.com>
- Drop "-via-orchestrator" from the Training modes heading and the
  internal/cross-doc anchors. The mode value is just `sft` and the
  short title reads cleaner.
- Drop "and the tmux helper" from the Console output subsection
  title; the tmux helper is still documented in the section body.
- Important metrics is now split into RL trainer and SFT trainer
  subsections so the SFT-only metrics (loss/mean, val/loss,
  progress/{epoch,num_samples,num_tokens}, optim/zero_grad_ratio,
  per-subset mixing ratios, MoE max_vio + routing_confidence,
  perf/peak_memory + the time/* breakdown) are documented.
- SFT Dataset format gains a Tool definitions paragraph: rows can
  carry a `tools` column (OAI function-calling format) or
  `tool_defs` (verifiers rollout format), as either a list of dicts
  or a JSON-encoded string. `tool_defs` is auto-converted to OAI
  shape before being passed into the chat template's `tools=...`
  argument. `chat_template_kwargs` rows pass through verbatim.

Co-authored-by: Cursor <cursoragent@cursor.com>
…eckpoints'

Co-authored-by: Cursor <cursoragent@cursor.com>
Adds a callout under the intro of training.md and configuration.md
pointing at the equivalent skill files for AI agents working in
this repo:

- training.md -> skills/training/SKILL.md (top-level routing) +
  skills/training/start-run/SKILL.md (launch details) +
  skills/training/monitor-run/SKILL.md (check-in / restart).
- configuration.md -> skills/configs/SKILL.md.

The skills aren't part of the published Mintlify nav, so the links
go to GitHub blob URLs.

Co-authored-by: Cursor <cursoragent@cursor.com>
The standalone "## Important metrics" section is gone. Each
trainer subsection now ends with its own "### Important metrics"
covering only the metrics relevant to that flow:

- RL trainer / Important metrics: reward + rollout signals from the
  orchestrator, mismatch_kl + entropy + grad_norm from the trainer,
  and the trainer/orchestrator/vLLM performance grid.
- SFT trainer / Important metrics: loss/mean, val/loss, progress
  counters, optim signals, MoE max_vio + routing_confidence, and
  the perf/{throughput,mfu,peak_memory} + time/* breakdown.

TOC updated to point at #important-metrics (RL) and
#important-metrics-1 (SFT) — Mintlify de-duplicates with the same
-N suffix scheme it already uses for the two Launch subsections.

Co-authored-by: Cursor <cursoragent@cursor.com>
Not ready for end-user docs yet. Both knobs (trainer.metrics_server.port,
trainer.heartbeat.url) are still in reference.md for anyone who needs
them, just no narrative coverage.

Co-authored-by: Cursor <cursoragent@cursor.com>
- Use "task" not "prompt" for the conceptual unit ingested by the
  orchestrator (rows for batch_size and group_size, plus the matching
  Rules of thumb wording).
- group_size: drop the "Used for advantage normalization and pass@k
  estimation" trailer; the row name is enough.
- max_off_policy_steps: rename "throughput-vs-noise dial" to just
  "off-policy dial".
- eval row: drop the "Scores land in trainer logs and W&B as
  eval/{env}/{avg@k,pass@k}" + `prime eval` trailer; keep the row
  scoped to what the knob does.
- Add log.level to the Monitoring table (trainer/orchestrator process
  log level, $PRIME_LOG_LEVEL fallback, per-process or global on rl).
- Drop --ckpt from Run management; checkpointing has its own section.

Co-authored-by: Cursor <cursoragent@cursor.com>
Drop the throughput/MFU/async-lag/KV-cache rows from the RL trainer
Performance table — they're either generic perf metrics already
covered by the SFT trainer's perf table or vLLM-internal. The two
remaining rows (time/wait_for_batch, time/wait_for_ckpt) are the
useful diagnostic — they tell you which side is the bottleneck.

Co-authored-by: Cursor <cursoragent@cursor.com>
The old wording recommended only patched checkpoints / a custom
chat template as the fix for position-dependent templates. The
renderer-based path landed in SFT (use_renderer flag) and is now
the primary recommended fix, but it's still off by default
(use_renderer: bool = False on SFTConfig), so the patched-
checkpoint path also still works — and is what
examples/reverse_text/sft.toml uses today via PrimeIntellect/
Qwen3-0.6B.

Rewrite the paragraph to cover both fixes:
- Renderer path: use_renderer = true, lists the hand-coded
  renderers, calls out the VLM unsupported case.
- Patched template path: the prime-rl-patched checkpoint or a
  user-supplied template that preserves thinking.

Cross-link both to Algorithms § Renderers and § Multi-turn
trajectories.

Co-authored-by: Cursor <cursoragent@cursor.com>
- data.type = "sft" is the discriminated-union default for
  SFTConfig.data, so users don't need to spell it out.
- The dataset path field is data.name, not data.path. Confirmed
  against SFTDataConfig.name in packages/prime-rl-configs/src/
  prime_rl/configs/sft.py.

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
mikasenghaas and others added 25 commits May 25, 2026 23:06
Renderers existed as a top-level section because they were added in
a follow-up, but conceptually they're the mechanism that makes
best-effort interleaving safe — splitting them apart forced a
forward-reference and duplicated the "exact-prefix invariant" framing.

- Demote "## Renderers" to "### Renderers" inside "## Multi-turn
  trajectories", placed between Best-effort interleaving and
  Discontinuous trajectories.
- Move the Qwen3 thinking-stripping example from Best-effort
  interleaving into Renderers — it's the failure case the renderer
  fixes, so it reads better adjacent to the family list / config
  block.
- Drop the "Workaround: use a chat template that preserves thinking"
  trailer; the patched-checkpoint workaround for SFT is already
  documented in training.md and isn't relevant in the orchestrator
  context (use_renderer defaults to true).
- Open Multi-turn trajectories with a forward-link to Renderers so
  the reader knows the safety mechanism is coming.
- TOC updated; both #multi-turn-trajectories and #renderers anchors
  preserved (only their nesting changes), so the existing cross-
  links from training.md and faqs.md keep working.

Co-authored-by: Cursor <cursoragent@cursor.com>
…step

max_async_level is being deprecated as a user-facing knob (hardcoded
to 1; matching reframe in docs/async.md on feat/deprecate-max-async-
level). Update algorithms.md so the long-form treatment matches.

- Drop the "### Tuning max_async_level" subsection (the k=0/1/2/>=3
  table) and the NCCL-needs-max_async_level=1 line that follows it
  — both become vacuous when k is fixed at 1.
- Reword the Async / off-policy training intro to describe the
  one-step overlap directly instead of "up to k steps where
  k = max_async_level".
- Step semantics: rho_inference is now pi_{max(0, n-1)}, with prose
  "inference is exactly one step behind the trainer" replacing the
  generic "gap is at most k steps".
- Drop the Tuning entry from the page TOC.

Other docs (overview.md intro, training.md rule of thumb, faqs.md
two entries, scaling.md NCCL-example comment) still mention
max_async_level and are now stale; will clean those up in the next
turn unless flagged otherwise.

Co-authored-by: Cursor <cursoragent@cursor.com>
Pull in the testing- and contributor-workflow content from README and
the GitHub Actions configs so contributors don't have to dig through
.github/workflows/ to figure out what runs where.

New sections in development.md:

- Test suite
  - Layout: tests/unit/, tests/integration/, tests/nightly/ — what
    each tier is for, with the actual file names contributors will
    encounter.
  - Running tests locally: pytest one-liners (everything, unit-only,
    integration-only, -m "not gpu", -m gpu, single file).
  - CI workflows: a 3-row table covering cpu_tests.yaml,
    gpu_tests.yaml (matrix list pulled verbatim from the workflow),
    and nightly_tests.yaml — including the trigger conditions
    (cpu = always, gpu = non-draft, nightly = scheduled +
    workflow_dispatch) and the runners (ubuntu-latest, vm/4xa6000,
    research-cluster).
  - Markers: gpu + slow, both declared with --strict-markers in
    pyproject.toml.
- Pre-commit hooks
  - uv run pre-commit install
  - Currently configured hooks: ruff check/format and
    docs-reference (regenerator that fails the commit if
    docs/reference.md would drift).

Plus update the Development pitch in overview.md "Where to go next"
and README.md docs index to mention the new scope.

Co-authored-by: Cursor <cursoragent@cursor.com>
Length penalties are configured under [orchestrator.length_penalty]
and layer on top of *any* advantage function — they're conceptually
not a standalone advantage variant. Move them into Default advantage
where readers will see the option while reading about advantages.

Drive-by: clean up the parenthetical hint in Default advantage. The
old version listed "length penalties tied to turn count" as a reason
to write a custom advantage, which conflicts with the actual
turn-count length penalty being a built-in. New wording points
length-penalty users at the now-adjacent built-in section, and
points custom-advantage users at trajectory-metadata-driven shaping
(sub-agents, relative-rank, …).

TOC entry for #length-penalties dropped; no external doc links to
that anchor.

Co-authored-by: Cursor <cursoragent@cursor.com>
Covers the two complementary buffer-side mechanisms for keeping the
trainer batch high-signal. Verified each claim against
src/prime_rl/orchestrator/buffer.py and packages/prime-rl-configs/
src/prime_rl/configs/orchestrator.py.

- Difficulty pools (buffer.easy_threshold / hard_threshold +
  easy_fraction / hard_fraction): per-problem running-average reward
  is compared to the thresholds; problems hitting either bound move
  to easy/hard pool and stop being sampled. Pool assignments persist
  across checkpoints (easy_examples.jsonl / hard_examples.jsonl);
  *_fraction lifts a fraction of pooled problems back into normal on
  resume / start.
- Online difficulty filtering (buffer.online_difficulty_filtering
  bool): groups whose avg reward is exactly 0.0 or 1.0 are dropped
  from the buffer because their within-group advantage is zero
  (DR-GRPO produces no signal). Counted under filtered_rollouts/
  {env}/{easy,hard} for visibility.
- The tradeoff bit the user asked for explicitly: with ODF on each
  trainer step's effective batch is dense + predictable, but the
  orchestrator pays for the throw-away rollouts and may need a
  higher oversampling_factor; if time/wait_for_batch is already
  high on the trainer, ODF can starve the loop.
- ODF is orthogonal to the pools — ODF reacts to the current
  group's reward distribution, pools track the running per-problem
  average. Configs often use both.

Plus a one-line page-intro update + TOC entries.

Co-authored-by: Cursor <cursoragent@cursor.com>
Both are buffer-side controls over what reaches the trainer — same
conceptual category as Filters and the loss/advantage knobs.
Algorithms is the right home; Advanced is for orthogonal feature
sets (LoRA, multi-tenant, custom modeling, multimodal). Promote each
to its own top-level section.

- algorithms.md: new "## Difficulty pools" and "## Online difficulty
  filtering" sections, inserted between Filters and Multi-turn
  trajectories. TOC updated.
- advanced.md: section + TOC entries + page-intro mention removed.

Content unchanged from the previous commit on Advanced — same
threshold / fraction / persistence claims, same ODF tradeoff
explanation, same cross-link to oversampling_factor. ODF section
back-links to Difficulty pools for the orthogonality note.

Co-authored-by: Cursor <cursoragent@cursor.com>
…itecture"

- "Adding a new architecture" promoted to its own top-level section,
  placed *before* the renamed Debugging MoE recipe so the natural
  reading order is "here's how to wire up a new arch -> here's how
  to smoke-test it". Step 3 of the recipe now forward-links to
  Debugging MoE instead of "the three steps above".
- "Testing MoE at small scale" -> "Debugging MoE". The page is
  already titled Development, so the qualifier was redundant.
- Subsection titles drop the "Step N:" prefixes (the page-level TOC
  already implies sequence) and switch to sentence case for
  consistency with the rest of the docs:
  - Step 1: build and verify a mini model -> Build and verify a mini model
  - Step 2: SFT warmup -> SFT warmup (already correct)
  - Step 3: RL on reverse-text -> RL on reverse-text (already correct)
- TOC updated; README docs index pitch swaps "small-scale MoE
  testing" -> "debugging MoE" to match the new section title.

Co-authored-by: Cursor <cursoragent@cursor.com>
- "### Build and verify a mini model" -> "### Create mini model".
  The roundtrip-verify body covers the "and verify" half; the
  shorter title is enough.
- Merge "### SFT warmup" + "### RL on reverse-text" into one
  "### Smoketest training" subsection. The two were always run
  together as a single end-to-end smoke test (warmup so KL is
  meaningful, then RL stack), so a single subsection with two code
  blocks reads better than two artificially-split ones.
- TOC updated. No external doc links to the old sub-anchors.

Co-authored-by: Cursor <cursoragent@cursor.com>
User-facing docs no longer reference configs/ directly — examples/
is the only "we keep this up to date" surface, the rest is CI- and
debug-internal:

- configuration.md: launch line + worked-example switched to
  examples/reverse_text/rl.toml. Section renamed "Worked example"
  -> "Examples" with a curated tour of the 10 README examples
  (basic 1-8 GPU + advanced SLURM tiers); the compose / override /
  dry-run walkthrough lives as a "### Worked example" subsection.
- training.md: drop the "Debug configs for all variants ship under
  configs/debug/training_modes/" pointer in the Training modes
  section. The prose already explains how to set the mode.
- scaling.md: P/D inference now points at examples/glm5_pd_disag/
  rl.toml (with a link to its README) instead of the configs/-side
  inference-only TOML.
- faqs.md: install-verify and smoke-test recipes both switch from
  configs/debug/sft and configs/gsm8k to examples/reverse_text.

Reference generator (scripts/generate_docs_reference.py):
- Drop "from the Pydantic config models" from the page header.
- Move the regenerate command + structural notes from the header to
  a new "## About this page" footer.
- Wrap the Type column in code spans so list[int], int | None, etc.
  render as code instead of plain text. fmt_type now emits literal
  `|` (GFM accepts pipes inside code spans inside table cells; no
  escaping needed).
- Walk list-of-BaseModel fields. Previously orchestrator.train.env
  / orchestrator.eval.env / orchestrator.filters were rendered as
  one row showing the default repr; their leaf fields never showed
  up. New _list_inner_models() detects both list[X] (single model)
  and list[Annotated[Union[A | B], discriminator]] (discriminated
  union of list items, e.g. filters: list[FilterConfig]). Index
  placeholder rendered as <n> to match the CLI form
  (--orchestrator.train.env.0.id ...).

Regenerate reference.md: +5k chars, mostly the new env/filter
list-item subsections that were missing before.

development.md still references two CI-tested configs/ paths
(configs/debug/moe/sft/train.toml, configs/ci/integration/
reverse_text_moe/start.toml) — those are validated by the
reverse_text_moe GPU integration test on every PR, so they don't
risk drifting. Flagging in case the user wants those swapped too.

Co-authored-by: Cursor <cursoragent@cursor.com>
128-512 is the range for quick ablations, not production. Production
RL often runs at 1024+.

Co-authored-by: Cursor <cursoragent@cursor.com>
P/D disaggregation is a feature you opt into for large-MoE serving,
not a step on the single-GPU -> 1000-GPU scaling ladder. It pairs
naturally with Custom modeling / multi-tenant / multimodal as a
specialized inference topology, so Advanced is the right home.

- scaling.md: drop the section + TOC entry + "disaggregated
  prefill/decode inference" from the page intro. Page intro now
  forward-links to Advanced for users who came in for P/D.
- advanced.md: append the section after Multi-tenant training,
  unchanged content (P:D ratio table, glm5_pd_disag example link,
  queue-depth monitoring snippet, UCX 1.19 build-from-source note).
  TOC + page-intro list updated.
- overview.md "Where to go next": drop disagg from the Scaling
  bullet, add to the Advanced bullet.

Anchor preserved (#disaggregated-prefilldecode-inference) — no
external doc links to it survived the move check.

Co-authored-by: Cursor <cursoragent@cursor.com>
The previous claim "trainer and inference server can share a GPU"
via the rl launcher was wrong. Verified against
src/prime_rl/entrypoints/rl.py:86-99: the launcher partitions
visible GPUs strictly (inference 0..N-1, trainer N..N+M-1) and
raises ValueError when total_requested_gpus > len(physical_gpu_ids).
Setting CUDA_VISIBLE_DEVICES=0 + --num-infer-gpus 1 +
--num-train-gpus 1 makes total=2, visible=1, validation fails
before anything launches.

What actually works for single-GPU RL is the manual three-pane
launch: each of uv run {inference,orchestrator,trainer} is an
independent process with no cross-process GPU validation, so
pinning CUDA_VISIBLE_DEVICES=0 on inference *and* trainer lets
them share the same physical GPU.

- Drop the misleading `uv run rl` recipe.
- Promote the manual three-pane recipe to the canonical single-GPU
  path, with CUDA_VISIBLE_DEVICES=0 spelled out on both the
  inference and trainer panes.
- Lead with SFT (where single-GPU is the default and just works).
- Add an explicit "single-GPU RL is for debugging only" caveat.

Co-authored-by: Cursor <cursoragent@cursor.com>
…to one umbrella

- New umbrella section "## Single-node vs. multi-node deployment"
  framing the [deployment] discriminated union as the user-facing
  knob: single_node runs locally; multi_node currently goes through
  SLURM. Subsections nest beneath:
  - ### Single GPU (unchanged content from "## Single GPU")
  - ### Single-node multi-GPU (unchanged from "## Single-node multi-GPU")
    - #### RL placement
    - #### SFT and torchrun
  - ### Multi-node (new short pointer to ## SLURM with two cross-
    links to the existing RL / SFT-and-inference examples)
- The umbrella section opens with a callout that manual multi-node
  launches are technically possible but reimplement what the SLURM
  launcher does — the user's preferred framing.
- Drop "## Choosing a layout" entirely (the new umbrella section
  conveys the same routing more naturally + the layout table was
  going stale).
- Drop "## Multi-node (manual)" entirely (RL training, SFT training,
  Multi-node inference subsections all gone). Anyone who needs the
  manual recipe can replicate what the SLURM templates do.

Cross-link fixes:

- training.md SFT § Launch line previously pointed at
  scaling.md#sft-training (under "## Multi-node (manual)"). Now
  points at scaling.md#sft-and-torchrun for non-default single-node
  layouts and scaling.md#slurm for multi-node.
- faqs.md "Multi-node without SLURM or K8s?" answer updated: from
  "yes, see [Scaling § Multi-node (manual)]" to "not currently
  documented; technically possible but reimplements the SLURM
  launcher".

Page intro adjusted to match the new structure ("multi-node SLURM
and Kubernetes deployments" -> "single-node and multi-node
deployments").

Co-authored-by: Cursor <cursoragent@cursor.com>
The K8s Helm chart at k8s/prime-rl/ still ships, but the user-facing
docs are dropping coverage until the chart and the matching guide
are re-validated together.

- scaling.md: drop the "## Kubernetes" section + TOC entry. Page
  intro already covered (was reworded earlier in the restructure).
- overview.md "Where to go next": drop "Kubernetes guides" from
  the Scaling bullet.
- README.md docs index: drop "Kubernetes" from the Scaling bullet.

The two passing-mention "k8s" / "Kubernetes" lines in README
(Overview features list, Advanced Training Examples adaptability
note) are left as-is — they describe codebase capability, not docs
coverage. Reference.md still mentions Kubernetes liveness probes in
an auto-generated field docstring; that's source-side, out of scope
for this pass.

Co-authored-by: Cursor <cursoragent@cursor.com>
- Drop the example test filenames from the tests/integration/
  bullet — they're a moving list and not the point.
- Reframe the tests/nightly/ bullet around what it does (runs the
  examples/ configs to catch regressions) instead of listing the
  individual nightly tests by name.

Co-authored-by: Cursor <cursoragent@cursor.com>
- Replace the bare 'To add (e.g.) Kimi 2.5:' opener with a one-line
  framing of the two-step contract: implement modeling code,
  register a mini preset for smoke-testing.
- Bold the leading verb on each numbered step so the structure
  reads as a checklist.
- Step 1 now nudges readers at glm4_moe/ and qwen3_moe/ as
  templates for the modeling code.
- Step 2 explains *what* the preset is for ('build a ~0.5B test
  model in your architecture') rather than just listing fields.
  Path now links to scripts/mini_moe.py.
- Step 3 says what the smoke-test actually exercises (roundtrip +
  SFT + RL stack) so users know what 'smoke-test' means here.

Co-authored-by: Cursor <cursoragent@cursor.com>
CodeQL alert actions/missing-workflow-permissions
(security/code-scanning/19) flagged the new workflow for relying on
the repo's default GITHUB_TOKEN permissions.

The workflow only checks out code (contents: read), syncs deps via
uv, runs the doc generator, and runs git diff. None of that needs
write scope on any resource type. Pin to contents: read at the
workflow level — explicit minimum that satisfies the rule.

Co-authored-by: Cursor <cursoragent@cursor.com>
- assets/architecture.png: replace with the new diagram (trainer
  + orchestrator + inference deployment, GPU layout per process,
  data + scheduling + weight-broadcast arrows). Was 511k 96dpi,
  now 111k @ 200dpi from architecture.pdf.
- assets/two-step-off-policy.png removed; assets/async-pipeline.png
  replaces it with the cleaner one-step-overlap diagram (trainer
  steps g_0..g_n above, inference samples with theta_{n-1} below).
  algorithms.md image reference + alt text updated to match the
  post-deprecation "one-step overlap" framing.
- assets/rollout-timeline.png added but not yet referenced. It's a
  continuous-time view showing rollouts spanning policy boundaries
  (policies pi_{i-2}, pi_{i-1}, pi_i on the x-axis with rollout
  bars crossing the boundaries) — that's the picture behind
  max_off_policy_steps, not max_async_level. Want me to drop it
  into algorithms.md (e.g. above the off-policy / max_off_policy
  discussion) or save it for later?

Co-authored-by: Cursor <cursoragent@cursor.com>
User-facing prose mentions of verifiers / renderers / research-
environments / pydantic-config now consistently render as code-spans
with a github link to the package. Affected lines:

- overview.md: orchestrator-bullet [verifiers](url) (dropped the
  bare-text variant), install paragraph (linked all three submodules
  individually), quick-run paragraph.
- algorithms.md: "Since [\`verifiers\` v0.1.8]..." (added backticks
  around the package name in the release-tag link).
- training.md: SFT \"verifiers submodule\" launch line, useful-knobs
  vf_level row, tool-defs paragraph.

Intentionally left alone:
- algorithms.md \`verifiers.RolloutOutput\` (dotted-path code ref;
  whole expression already in code).
- algorithms.md / training.md [renderers](#renderers) / [Renderers]
  (#renderers) (internal anchors to the in-page section, more useful
  than the github repo for a reader inside the doc).
- algorithms.md "Hand-coded renderers ship for ..." and "the
  renderers writeup on the PI blog" — generic prose, not a package
  mention.
- reference.md docstring-sourced mentions ("verifiers package",
  "renderers package", etc.) — those come from Pydantic field
  docstrings; would need source-side edits + regen.

Co-authored-by: Cursor <cursoragent@cursor.com>
Out of date with the max_async_level deprecation and duplicates
content already covered properly in algorithms.md (one-step overlap
framing + the AIPO/KL loss math). Architecture bullets above
already mention async semantics where relevant.

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Was four separate code blocks separated by one-line prose; now a
single bash block with inline-comment annotations for each variant.
Reads cleaner and matches the test-suite block in development.md.

Co-authored-by: Cursor <cursoragent@cursor.com>
…e-test)

Self-review found a handful of small inconsistencies and stale claims
that survived earlier passes.

Stale max_async_level references (now hardcoded to 1, so any prose
that treats it as tunable is wrong):

- faqs.md: "If growing, drop max_async_level or LR" -> drop
  max_off_policy_steps. NCCL FAQ no longer says "requires
  max_async_level=1" — that constraint is vacuous.
- training.md Rules of thumb: NCCL/max_async_level=1 dropped from
  the dry-run validator example; CP / flash-attention example
  stays.
- scaling.md NCCL example: drop the trailing
  "# synchronous; max_async_level forced to 1" comment.

Other accuracy / consistency fixes:

- faqs.md hardware FAQ: drop "you can co-locate both on a single
  GPU" — verified earlier that the rl launcher rejects this; the
  manual three-pane recipe is the actual single-GPU path. Cross-
  link to Scaling § Single GPU.
- faqs.md max_off_policy_steps wording: "throughput-vs-noise knob"
  -> "off-policy dial" to match algorithms.md / training.md.
- training.md sft.data.loss_mask cross-link: anchor was
  reference.md#sft-data (the discriminated-union heading); the
  loss_mask sub-table actually lives at #sft-data-sft-loss-mask.
- development.md "Smoketest training" -> "Smoke-test training" in
  the section title and TOC anchor, matching the hyphenated verb
  form ("smoke-test the new architecture") used in the page body.

Co-authored-by: Cursor <cursoragent@cursor.com>
Four asks rolled in:

1. **Drop docs/faqs.md entirely.** Removed the file plus all its
   cross-references: docs/mint.json nav, README.md docs index, and
   docs/overview.md "Documentation" list. The standalone Q&A page
   wasn't pulling its weight against the verifiers tone reference.

2. **Title Case all headings** across the user-facing pages so the
   visual style matches deps/verifiers/docs/* (which uses
   "Hosted Training" / "Performance Trade-offs" etc., not sentence
   case). Anchors are slug-based (lowercase + hyphen), so internal
   #links survive the case flip — only the link text in cross-doc
   "§ Section" references needed updating (training.md → Algorithms
   § Multi-Turn Trajectories, training.md → Scaling § SFT and
   Torchrun, scaling.md → Configuration § TOML Composition,
   scaling.md → SLURM § RL Example / SFT and Inference Examples).

3. **Rename overview.md "Where to go next" → "Documentation"** for
   symmetry with the verifiers landing page.

4. **Smell fixes**:
   a. algorithms.md had two near-identically-named subsections ("###
      The default loss" under Async with the loss math, and "###
      Default loss" under Loss with the mode dispatch). Collapse:
      the loss math now lives under "## Loss > ### Default Loss"
      together with the rl/opd/sft mode-dispatch bullets; the Async
      section is just intro + step semantics. One source of truth.
   b. reference.md auto-generated docstring mentions of "verifiers
      package" / "renderers package" / "renderers library" /
      "renderers.parsers" / "Registered verifiers environment ID"
      now render as [`pkg`](github-url) links. Source-side edits in
      packages/prime-rl-configs/src/prime_rl/configs/{shared,
      orchestrator,sft}.py, then regenerated reference.md.

Net: faqs.md deletion (-198) dominates; everything else is small
churn for case + anchor consistency.

Co-authored-by: Cursor <cursoragent@cursor.com>
@mikasenghaas mikasenghaas changed the title docs: rewrite into 8 task-oriented pages with auto-generated reference docs: big revamp May 26, 2026
mikasenghaas and others added 3 commits May 26, 2026 02:03
Restructure + accuracy fixes that were piled up locally during the
SSH-signing block earlier today.

- configuration.md: note that reusing an env id requires a unique
  name; drop the "See each environment's README on the Hub" tail.
- development.md: collapse "Adding a New Architecture" + "Debugging
  MoE" (with its Create Mini Model + Smoke-Test Training subsections)
  into a single "## Adding a New Model" with three subsections —
  Implement the Modeling Code, Register a Mini Preset, Run the
  Smoke Test. Page intro updated.
- scaling.md: drop "## Single GPU" subsection; rename "## Single-Node
  Multi-GPU" -> "## Single-Node" so "Single-Node vs. Multi-Node
  Deployment" has just two clean children. Drop the manual-multi-
  node prose. Drop the [slurm] field reference table; rename the
  section to "## [deployment] Block" and link to reference.md.
  Collapse the two example subsections into one "## Examples" with
  pointers to examples/multinode/{rl,sft}.toml. Rename "## CPU
  Optimizer Offload" -> "## Optimizer Offloading". Add new "## LM
  Head Chunking" subsection covering the fused_lm_head_token_chunk_
  size knob. Drop torchrun phrasing in the SFT scaling subsection,
  the CP recommendation paragraph, "(offloads checkpoints to CPU)"
  parenthetical, "RL gradient-accumulation amortization" line,
  attn = "flash_attention_2" from the memory-tight recipe, dry-run
  invaluable line.
- algorithms.md: drop the Step Semantics subsection heading (body
  stays); replace the AIPO link with a plain "DPPO + KL similar to
  Kimi-K2.5" framing.
- advanced.md: drop the "## Custom vs HF Implementations"
  subsection heading; the body becomes the lead content under
  "## Custom Modeling".
- overview.md + README.md: pitch updates ("adding a new
  architecture" -> "adding a new model"; FAQs row already gone).

Co-authored-by: Cursor <cursoragent@cursor.com>
Removes everything in the reference-generation pipeline:

- docs/reference.md — the 197k-char auto-generated field reference.
- scripts/generate_docs_reference.py — the generator that walks
  Pydantic config trees, discriminated unions, list-of-models, etc.
- .github/workflows/docs-reference.yaml — the CI guard that ran the
  generator and failed on git diff drift.
- .pre-commit-config.yaml — drops the local docs-reference hook
  that re-ran the generator on staged config-class edits.

Knock-on cleanup of cross-references:

- mint.json: drop the "reference" page from nav.
- overview.md / README.md: drop the Reference bullet from the
  Documentation list / docs index.
- configuration.md: drop the two trailing "See [Reference] ..."
  pointers (under Discriminated Unions and at the end of the
  Examples section).
- training.md: drop the "for the full field reference see
  [Reference]" tail in Useful Knobs; the loss_mask row no longer
  cross-links and just lists the four roles.
- scaling.md: drop the [Reference § trainer.model] DeepEP pointer
  and the [Reference § slurm] pointer in the [deployment] block.
- algorithms.md: drop the [Reference § orchestrator length
  penalties] line.
- development.md: simplify the Pre-Commit Hooks section — only the
  ruff hook remains, so the configured-hooks bullet list goes away.

Co-authored-by: Cursor <cursoragent@cursor.com>
@mikasenghaas mikasenghaas requested a review from samsja May 26, 2026 05:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants