RLVR: Reinforcement Learning with Verifiable Rewards

A comprehensive framework for training general-purpose AI agents using programmatically verifiable rewards across 2,697 diverse domains. This project provides the taxonomy, verification infrastructure, dataset registry, and training protocol for scaling from current frontier models toward AGI-level breadth and depth.

The Core Idea

Current AI training hits a bottleneck: human feedback doesn't scale. RLVR eliminates this by using programmatic verification — if you can automatically check whether an answer is correct, you can generate unlimited training signal with zero human involvement.

DeepSeek-R1 showed this works for math and code. We extend it to every verifiable domain on the internet: science, engineering, games, language understanding, agent tasks, security, medicine, law, and more.

Three-Stage Theory of Intelligence

We organize training around a universal pattern of skill acquisition:

Stage	Phase	What It Develops	Domains
Stage 1	Rule Recognition	Pattern inference, structural reasoning, formal logic	~200 synthetic & formal domains
Stage 2	System Mastery	Language, code, math, science, vision, agency	~1,500 applied domains
Stage 3	Capability Climbing	Expert performance in specialized applications	~1,000 expert domains

This is validated by research on Neural Cellular Automata (NCA) pre-training (Lee et al., 2025), which showed that purely synthetic data containing zero linguistic content outperformed 10x more Common Crawl data at preparing models for language, code, and math.

What's Included

13 Rust Verifiers (227 tests)

Production-quality verifiers that are deterministic, fast, and validated against real data:

Verifier	Type	Tests	Technique
`math_numerical`	Number extraction + comparison	26	Regex extraction, GSM8K `####` format, `\boxed{}`, tolerance
`math_equivalence`	Symbolic equivalence	18	LaTeX normalization, fraction→decimal, numerical evaluation
`exact_match`	Normalized string matching	27	Article removal, punctuation stripping, F1 scoring
`instruction_following`	Constraint satisfaction	21	15 constraint types (word count, format, include/exclude)
`json_schema`	Schema validation	20	Type checking, required fields, ranges, patterns, nesting
`code_execution`	Subprocess sandbox	16	Python execution, stdin/stdout, function-call harness, timeout
`sudoku`	Grid constraints	16	Row/column/box uniqueness, given-respect, partial credit
`chemical_equation`	Atom balance	15	Formula parsing with parentheses, coefficient handling
`regex_synthesis`	Compile + test	13	Full-string anchoring, positive/negative example checking
`date_time`	Chrono computation	13	Days-between, day-of-week, leap year, date arithmetic
`unit_conversion`	Lookup + compute	12	60+ unit pairs, temperature special cases
`sql_execution`	SQLite execution	11	In-memory DB, result set comparison, order-insensitive
`graph_properties`	Graph algorithms	10	Dijkstra, connected components, topological sort, coloring

2,697 Domain Environments

Every domain has a concrete verification mechanism, dataset sources, and reconstruction notes. Organized across major categories:

Code & Software (~680 domains): Code generation/repair/translation in 50+ languages, SQL (window functions through DuckDB), every major algorithm (Dijkstra through FFT), design patterns, 54 Exercism tracks, 15 Rosetta Code tasks, SWE-bench variants, HumanEval/MBPP/APPS/CodeContests, DevOps configs (Terraform/K8s/Docker/Helm/ArgoCD/Jenkins/GitLab CI), web development (React/Vue components, OAuth, WebSockets), security (CSRF/XSS/input validation), data structures (BST through Bloom filters), all 10 Advent of Code years
Mathematics (~180 domains): Competition math (AIME/AMC/Putnam/IMO), formal proofs (Lean 4/Coq/Isabelle), symbolic algebra, every calculus topic (limits through surface integrals), number theory, combinatorics, probability, statistics (hypothesis testing through ANOVA), game theory (Nash equilibria), Markov chains, SDEs, 57 MMLU subjects
Logic & Formal Methods (~50 domains): SAT/SMT/ATP/ITP, modal/temporal logic, BDDs, Petri nets, automata construction, Datalog, resolution refutation, 8 formal grammar tasks (CFG through LALR)
Science & Engineering (~350 domains): Physics (projectile motion through Maxwell's equations), chemistry (all organic reaction types, TDC drug discovery, 11 MoleculeNet tasks), biology (protein fitness/stability/solubility, genomics, CAFA), materials science (band gap through defect formation), engineering (HVAC through lightning protection), astronomy (Kepler through eclipse prediction)
Language & Knowledge (~450 domains): GLUE/SuperGLUE/XTREME individual tasks, 152 BIG-Bench tasks, 50 FLORES translation pairs, 20 WMT language pairs, QA (DROP/CosmosQA/StrategyQA), NLI, NER, relation extraction, clinical NLP (15 n2c2/BioCreative tasks), retrieval (MS MARCO/BEIR/MTEB), financial QA (FinQA/TAT-QA), legal (CUAD/LexGLUE)
Games & Interactive (~500 domains): 80+ Atari games, 16 Procgen games, 49 OpenSpiel games, 49 Meta-World tasks, 28 DM Control tasks, 50 RLBench tasks, 16 PettingZoo environments, chess/Go/Shogi, 30+ logic puzzles (Sudoku through Masyu), board games (Mancala through Terraforming Mars), card games (Poker through Skat)
Agent & Tool Use (~100 domains): ALFRED/BEHAVIOR household tasks, WebArena/VisualWebArena, OSWorld, BabyAI levels, MiniHack levels, GAIA multi-tool, autonomous driving (nuPlan/Waymo), drone navigation, dexterous manipulation, D4RL offline RL tasks, Safety-Gymnasium constrained RL
Vision & Multimodal (~200 domains): MMMU/MM-Vet/SEED-Bench, ImageNet variants (V2/R/A/Sketch), COCO detection/panoptic, fine-grained recognition (birds/flowers/cars/aircraft), medical imaging (CheXpert/ISIC/retinal), 3D (ScanNet/ShapeNet/ModelNet), video understanding (Kinetics/Something-Something/DAVIS)
Audio & Speech (~80 domains): LibriSpeech/CommonVoice in 50 languages, SUPERB benchmark, MusicCaps, beat tracking, chord recognition, speaker identification, source separation (MUSDB18)
ML & Data Science (~100 domains): 25+ Kaggle competitions, M4/M5 forecasting, NAS-Bench, fairness/bias (BBQ/WinoBias/CrowS), adversarial robustness, calibration, 30+ UCI datasets, safety benchmarks (HarmBench/XSTest/ETHICS)
Expert & Professional (~150 domains): Medical (MedQA/PubMedQA/PathVQA/CheXpert), legal (CUAD/CaseHOLD/LexGLUE), financial (FinQA/options pricing/portfolio optimization), engineering calculations (pipe sizing through sprinkler design), sports scoring (14 sports), nutrition, construction, clinical trials
Miscellaneous (~50 domains): Geographic knowledge, unit conversions, calendar systems (Hebrew/Islamic/Mayan), encoding (Braille/Morse/semaphore), checksums, number theory curiosities (Collatz/happy numbers)

Dataset Registry (~6M problems + unlimited procedural generation)

Category	Datasets	Problems	Size
Math	GSM8K, MMLU	~24K	162MB
Code	HumanEval, MBPP, WikiSQL	~82K	25MB
QA	SQuAD, TriviaQA, HotpotQA, CommonsenseQA, COPA, WikiTQ	~265K	700MB
Fact Verification	FEVER, TabFact	~263K	31MB
NLI/Classification	SNLI, MultiNLI, ANLI, SST-2, AG News	~1.25M	331MB
Commonsense/Science	HellaSwag, PIQA, Winogrande, ARC, OpenBookQA, SciQ	~83K	666MB
Medical	MedQA (USMLE)	~10K	15MB
Logic/Games	Lichess puzzles, Sudoku	~4M	348MB
Instruction	IFEval	541	<1MB

7 procedural generators produce unlimited training data at 10K-100K problems/second for: unit conversion, date/time, chemical equations, regex, JSON schema, instruction constraints, and graph problems.

Architecture

                    ┌──────────────────┐
                    │  Policy (VLM)    │
                    └────────┬─────────┘
                             │ generates response
                    ┌────────▼─────────┐
                    │  GRPO Trainer    │
                    │  (Python/TRL)   │
                    └────────┬─────────┘
                             │ sends (task, response)
                    ┌────────▼─────────┐
                    │ Verifier Server  │
                    │ (Rust, HTTP)     │
                    │ 13 verifiers     │
                    │ 2,697 domains      │
                    └────────┬─────────┘
                             │ returns score ∈ [0,1]
                    ┌────────▼─────────┐
                    │ Curriculum Ctrl  │
                    │ difficulty ∈[1,10]│
                    │ domain mixing    │
                    └──────────────────┘

The Rust verifier runs as a separate HTTP service, enabling language-agnostic integration. This is a departure from existing RLVR frameworks (DeepSeek-R1, veRL, OpenRLHF, TRL) which all use inline Python functions with regex-based extraction.

Project Structure

src/
  main.rs                    # Entry point
  verifiers/
    mod.rs                   # VerifyResult type + Verifier trait
    math_numerical.rs        # 26 tests — numeric extraction & comparison
    math_equivalence.rs      # 18 tests — symbolic math equivalence
    exact_match.rs           # 27 tests — normalized string matching
    instruction_following.rs # 21 tests — constraint satisfaction
    json_schema.rs           # 20 tests — schema validation
    code_execution.rs        # 16 tests — sandboxed code execution
    sudoku.rs                # 16 tests — sudoku grid verification
    chemical_equation.rs     # 15 tests — chemical equation balancing
    regex_synthesis.rs       # 13 tests — regex synthesis verification
    date_time.rs             # 13 tests — date/time computation
    unit_conversion.rs       # 12 tests — physical unit conversion
    sql_execution.rs         # 11 tests — SQL query verification
    graph_properties.rs      # 10 tests — graph algorithm verification
  datasets/
    mod.rs                   # Dataset loading traits
    gsm8k.rs                 # GSM8K parser
    registry.rs              # Dataset registry
wiki/
  overview.md                # High-level thesis and taxonomy
  index.md                   # Master index of all 2,697 domains
  domains/                   # One page per domain (2,697 files)
  concepts/                  # Cross-cutting concepts
    verification-types.md    # Taxonomy of verification mechanisms
    reward-shaping.md        # Reward design principles
    dataset-scaling.md       # Scaling strategies
  synthesis/                 # Analysis pages
    intelligence-hierarchy.md # Three-stage theory deep dive
    harness-architecture.md  # Training harness design
    domain-matrix.md         # Feature matrix across domains
    scaling-roadmap.md       # Path to AGI
    dataset-sources.md       # Internet dataset catalog
    pretraining-sft-sources.md # Upstream pipeline data
    exhaustive-audit.md      # Domain coverage audit
raw/
  datasets/                  # 33 downloaded datasets (2.3GB)
  papers/                    # Source papers

Quick Start

# Run all verifier tests
cargo test

# Run a specific verifier's tests
cargo test math_numerical
cargo test sudoku

Verification Principles

Every verifier satisfies four properties:

Deterministic: Same input always produces the same score
Zero false positives: A score of 1.0 means the answer is definitively correct
Tested against real data: Every verifier has anti-hardcoding tests validated on real benchmark problems
Fast: Verification in milliseconds (except code execution, which uses subprocess timeout)

Verification Types

Type	Mechanism	Example Domains
Exact match	Output matches known answer	Math, QA, classification
Execution-based	Output is tested by running it	Code, SQL, regex
Simulation-based	Physics/logic simulator verifies	Circuit design, robotics
Constraint satisfaction	Hard constraints must be met	Sudoku, scheduling
Diff-based	Structural comparison to reference	AST, DOM, image similarity
Rule-based	Formal rules check correctness	Chess legality, grammar
Outcome-based	Downstream result determines correctness	Game win, task completion

Hypothetical or "LLM-as-judge" rewards are NOT used. Every domain has a concrete, implementable verification function.

Related Work

DeepSeek-R1 (2024): RLVR on math + code → strong reasoning. We extend from 2-5 domains to 272.
NCA Pre-Pre-Training (Lee et al., 2025): Synthetic CA data develops transferable attention mechanisms. Our Stage 1 formalizes this.
AlphaProof / AlphaGeometry (DeepMind, 2024): RLVR for formal mathematics. We extend to the full reasoning spectrum.
veRL, OpenRLHF, TRL: Open-source RL frameworks with inline Python verifiers. Our Rust verifier server with 227 validated tests is a significant infrastructure improvement.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
docs		docs
python/rlvr		python/rlvr
raw/datasets		raw/datasets
src		src
wiki		wiki
.gitattributes		.gitattributes
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md
domains.json		domains.json
paper.md		paper.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RLVR: Reinforcement Learning with Verifiable Rewards

The Core Idea

Three-Stage Theory of Intelligence

What's Included

13 Rust Verifiers (227 tests)

2,697 Domain Environments

Dataset Registry (~6M problems + unlimited procedural generation)

Architecture

Project Structure

Quick Start

Verification Principles

Verification Types

Related Work

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

RLVR: Reinforcement Learning with Verifiable Rewards

The Core Idea

Three-Stage Theory of Intelligence

What's Included

13 Rust Verifiers (227 tests)

2,697 Domain Environments

Dataset Registry (~6M problems + unlimited procedural generation)

Architecture

Project Structure

Quick Start

Verification Principles

Verification Types

Related Work

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages