Skip to content

Project-Prevail/TruthSeekingGym-Code-Public

Repository files navigation

TruthSeekingGym

A unified framework for evaluating and training language models on truth-seeking behavior.

Overview

TruthSeekingGym provides infrastructure for both evaluating how well models seek truth and training models to improve truth-seeking behavior. It offers a consistent interface across API models, local models, batch APIs, Claude Code agents, and even humans.

Core Abstractions

  • Policies — Unified interface for all model types: OpenAI (GPT-4.1, GPT-5, o3, o4-mini), Anthropic (Claude), Google (Gemini), local HuggingFace models, human input, and Claude Code agents
  • Domains — Problem sets with verifiable answers: research analysis, forecasting, debate evaluation (ChangeMyView), conceptual reasoning
  • Evaluation Paradigms — Multiple experimental setups for operationalizing "truth-seeking":
    • Ground-truth accuracy: Does the model reach correct conclusions?
    • Martingale property: Are belief updates unpredictable from prior beliefs? (predictable updates suggest bias)
    • Sycophantic reasoning: Does reasoning quality degrade when the user expresses an opinion?
    • Mutual predictability: Does knowing a model's answers on some questions help predict its answers on others?
    • World-in-the-loop: Are the model's claims useful for making accurate predictions about the world?
    • Qualitative judgment: Does reasoning exhibit originality, curiosity, and willingness to challenge assumptions?
  • Reasoning Modes — Generation strategies: direct inference, chain-of-thought, self-debate, bootstrap interview, length-controlled generation
  • Graders — Reward functions for training: Python-based (Brier scores) and LLM-based evaluation
  • Trainers — Training strategies: supervised fine-tuning (SFT), reinforcement learning (RL), few-shot in-context learning

Supported Policies

The following policies are supported via create_policy_from_string(). Pass the string in the "Policy String" column to create a policy.

Policy String Provider Model Type Notes
human N/A Special CLI-based human input
claude-code N/A Special Claude Code agent integration
HuggingFace model ID HuggingFace/Local LocalModel e.g., Qwen/Qwen3-235B-A22B-Thinking-2507
Path from data/models/ Local LocalModel Relative path starting from data/models/
gemini-embedding-001 Google Embedding Requires USE_RAY=1
Qwen/Qwen3-Embedding-8B Local Embedding Local SGlang-based
Qwen/Qwen3-Embedding-4B Local Embedding Local SGlang-based
Qwen/Qwen3-Embedding-0.6B Local Embedding Local SGlang-based
gpt-4.1-nano OpenAI API
gpt-4.1-mini OpenAI API
gpt-4.1 OpenAI API
gpt-5 OpenAI API
gpt-5-mini OpenAI API
gpt-5-nano OpenAI API
gpt-o3 OpenAI API Alias for o3
o3 OpenAI API
o3-2025-04-16 OpenAI API
gpt-o4-mini OpenAI API Alias for o4-mini
o4-mini OpenAI API
o4-mini-2025-04-16 OpenAI API
gpt-4o OpenAI API
deepseek-v3 Together/DeepSeek API
llama-4-scout Together/Meta API
llama-4-maverick Together/Meta API
claude-sonnet-4 Anthropic API
claude-opus-4 Anthropic API
claude-opus-4.1 Anthropic API
claude-3-5-haiku Anthropic API
deepseek-r1 Together/DeepSeek API
gemma-3-27b-it Together/Google API
gemma-3-12b-it Together/Google API Via OpenRouter only
gemma-3-4b-it Together/Google API Via OpenRouter only
gemma-2-27b-it Together/Google API
gemma-3n-e4b-it Together/Google API
llama-3-1-8b-instruct Together/Meta API
qwen-3-235b-a22b-instruct Together/Qwen API
qwen-3-235b-a22b-thinking Together/Qwen API
qwen-3-235b-a22b Together/Qwen API
qwen-3-32b Together/Qwen API
qwen-3-14b Together/Qwen API
qwen-3-14b-base Together/Qwen API Direct provider only
qwen-3-8b Together/Qwen API
qwen-3-8b-base Together/Qwen API Direct provider only
qwen-2-5-7b Together/Qwen API
mistral-small-3.1-24b-instruct Together/Mistral API Via OpenRouter only
mistral-small-24b-instruct-2501 Together/Mistral API Direct provider only
kimi-k2 Together/Moonshot API
gemini-2.0-flash Google API
gemini-2.5-flash Google API Via OpenRouter only
gemini-2.5-pro Google API

Notes:

  • Some models are only available via OpenRouter (when USE_OPENROUTER=1) or direct provider access
  • LocalModel entries accept either HuggingFace-hosted model IDs or relative paths from data/models/ for locally saved models
  • Trained models saved in data/models/ are automatically detected and loaded

Adding New Models

To add support for a new model, edit utils/policy_utils.py and add an entry to the candidate_policies dictionary:

# For OpenRouter mode (USE_OPENROUTER=1):
candidate_policies = {
    # ... existing entries ...
    "your-model-name": ("provider/model-id", "your-model-name", "openrouter provider"),
    # Example: "gpt-4.1-mini": ("openai/gpt-4.1-mini", "gpt-4.1-mini", "openrouter openai"),
}

# For direct provider mode (USE_OPENROUTER=0):
candidate_policies = {
    # ... existing entries ...
    "your-model-name": ("exact-api-model-id", "your-model-name", "provider"),
    # Example: "gpt-4.1-mini": ("gpt-4.1-mini-2025-04-14", "gpt-4.1-mini", "openai"),
}

Each entry is a tuple of (api_model_id, colloquial_name, provider):

  • api_model_id: The exact model ID used in API calls
  • colloquial_name: The short name used for display and file naming
  • provider: The provider string (openai, anthropic, together, google, or openrouter <provider> for OpenRouter)

For local models, simply pass the HuggingFace model ID directly to create_policy_from_string() - no configuration needed.

Infrastructure

  • Ray parallelization for high-throughput workloads (100k+ tokens/second)
  • Batch APIs for 50% cost reduction (24-48 hour latency)
  • Multi-GPU training with automatic DeepSpeed ZeRO-2/3 detection
  • Full async support across inference and training

Installation and Usage

Prerequisites

  • Python 3.10+
  • Git LFS (for data submodule)

Installation

# Clone with submodules
git clone --recurse-submodules https://github.com/Project-Prevail/TruthSeekingGym-Code-Public.git
cd TruthSeekingGym-Code-Public

# Initialize data submodule (HuggingFace dataset)
cd data && git checkout main && cd ..

# Install dependencies (using uv recommended)
uv venv && uv pip install -e . -e lib/safety_tooling

# Or with pip
pip install -e . -e lib/safety_tooling

Git will automatically fetch the data from the Huggingface repo.

API Keys Setup

Create lib/safety_tooling/.env with your API keys:

OPENROUTER_API_KEY=your_key
OPENAI_API_KEY=your_key
ANTHROPIC_API_KEY=your_key

Web GUI

cd web
./start.sh

This kickstarts the web GUI, from which you may configure and launch evaluation/training runs.

Quick Start (CLI)

# Minimal example: evaluate one model on one domain
ALGO_NAMES=GroundTruthAccuracy \
DOMAIN_NAMES=Research \
REASONING_MODE_NAMES=DirectInference \
POLICY_LIST=gpt-4.1-mini \
USE_OPENROUTER=1 \
NUM_TRAJECTORIES=5 \
python -m scripts.run_reasoning
# Full example: evaluate, analyze, then train
# Evaluation
export ALGO_NAMES=MartingaleStrategy
export DOMAIN_NAMES=Research,Forecasting
export REASONING_MODE_NAMES=DirectInference,ChainOfThought
export POLICY_LIST_MODE=frontier # evaluate all frontier models
export USE_RAY=1 # use Ray for parallelization
export USE_OPENROUTER=1 # use OpenRouter for model routing (required for Gemini models)
python -m scripts.run_reasoning

# Analysis
export ANALYZERS=all 
export DIR_NAME=run-XXXXX  # Use your run directory
python -m scripts.run_analyzers

# Training
export SFT_USING_EVAL_RESULTS=1
export RL_USING_DOMAIN=1
export DIR_NAME=run-XXXXX  # For SFT/FEWSHOT: directory with trajectory scores
export DOMAIN_NAMES=Forecasting  # For RL: domains to train on
python -m scripts.run_trainers

The evaluation script produces reasoning-trajectories-raw.json and bias-eval-results-*.json in the run directory within the data/runs folder. The training script uses trajectory scores from analyzers or domains directly for RL training.

Each run directory is named by run-{RUN_ID}. After a run is finished, you may rename the second half of the directory name to something more descriptive, and thereby change run ID. You can use the environment variable RUN_ID=xxx to set the run ID for the run; if a directory with the same run ID already exists, data from the previous run will be loaded and analyzed, without executing the reasoning again.

You may use the DIR_NAME environment variable to set the location where run directories are stored, relative to the data/runs folder. It may contain multiple runs nested recursively within.

Low-Level Abstractions

The framework integrates TianyiQ/LMPortal as the lower-level abstraction.

Example 1: Basic Flexible Inference

The infer() method is the recommended way to do inference - it accepts multiple input types and returns appropriate outputs:

from utils.policy_utils import create_policy_from_string

# Create a policy (automatically detects provider)
policy = create_policy_from_string("o4-mini")

# Simple string inference
response = policy.infer("What is the capital of France?")
print(response)  # Returns: str

# Or with history
response = policy.infer([
    {"role": "user", "content": "What is 2+2?"}
])
print(response)  # Returns: str

# Getting logprobs of held-out response
conversation_logprobs = policy.logprobs_single([
    {"role": "user", "content": "What is 2+2?"},
    {"role": "assistant", "content": "4"},
])
prompt_logprobs = policy.logprobs_single([
    {"role": "user", "content": "What is 2+2?"},
])
print(conversation_logprobs - prompt_logprobs)  # Returns: float

Example 2: Inference with Problems and Domains

The flexible infer() method can directly work with Problems and Domains:

from core.domain.forecasting import Forecasting
from utils.policy_utils import create_policy_from_string

policy = create_policy_from_string("o4-mini")
domain = Forecasting()

# Infer from a single problem
problem = domain.sample_problems(n=1)[0]
result = policy.infer(problem.to_sample())
print(f"Question: {result.history[0]['content']}")
print(f"Answer: {result.output}")  # Returns: SingleSample

# Infer directly from domain (samples 1 problem automatically)
result = policy.infer(domain)
print(result)  # Returns: SingleSample

Example 3: Batch Flexible Inference

The infer_many() method handles batch inference with flexible input types:

from utils.policy_utils import create_policy_from_string
from core.domain.conceptual import Conceptual

policy = create_policy_from_string("o4-mini")
domain = Conceptual()

# Batch inference from multiple problems
problems = domain.sample_problems(n=3)
samples = [p.to_sample() for p in problems]
results = policy.infer_many(samples)
for result in results:
    print(f"Q: {result.history[0]['content']}")
    print(f"A: {result.output}")
# Returns: list[SingleSample]

# Or directly from domain with count
results = policy.infer_many((domain, 5))  # Sample 5 problems from domain
print(f"Generated {len(results)} responses")  # Returns: list[SingleSample]

Example 4: Working with Domains

from core.domain.forecasting import Forecasting

# Load domain
domain = Forecasting()

# Sample problems
problems = domain.sample_problems(n=5, split="train")

for problem in problems:
    print(f"Q: {problem.question}")
    if hasattr(problem, "correct_option"):
        print(f"Answer: {problem.options[problem.correct_option]}")

    # Convert problem to Sample for inference
    sample = problem.to_sample()
    print(f"Sample history: {sample.history}")

Example 5: Human-AI Dialogue

Create interactive dialogues between human and AI policies:

from utils.policy_utils import create_policy_from_string

# Create policies
human = create_policy_from_string("human")
ai = create_policy_from_string("o4-mini")

# Start dialogue
history = []
for turn in range(3):
    # Human turn
    human_msg = human.infer_from_history(history)
    history.append({"role": "user", "content": human_msg})
    print(f"Human: {human_msg}")

    # AI turn
    ai_msg = ai.infer_from_history(history)
    history.append({"role": "assistant", "content": ai_msg})
    print(f"AI: {ai_msg}")

Example 6: Claude Code Agent Inference

Use Claude Code agents for complex reasoning tasks:

from utils.policy_utils import create_policy_from_string

# Create Claude Code agent policy
agent = create_policy_from_string("claude-code")

# Infer with code execution capabilities
result = agent.infer("Write a Python function to calculate fibonacci numbers and test it with n=10")
print(f"Agent response: {result}")

Example 7: Supervised Fine-Tuning

SFT trainer accepts list[SingleSample] directly.

from utils.policy_utils import create_policy_from_string
from core.policy.schema import SingleSample
from core.trainer.sft import SFTTrainer, SFTConfig

# Prepare training data
samples = [
    SingleSample(
        history=[{"role": "user", "content": "What is 2+2?"}],
        output="4",
    ),
    SingleSample(
        history=[{"role": "user", "content": "What is the capital of France?"}],
        output="Paris",
    ),
    # ... more samples
]

# Create trainer
config = SFTConfig(
    num_epochs=2,
    learning_rate=1e-5,
    validation_strategy="train"  # split from training set
)
trainer = SFTTrainer(config)

# Train (creates new policy, doesn't modify original)
base_policy = create_policy_from_string("gpt-4o")
trained_policy = trainer.train(
    policy=base_policy,
    samples=samples
)

Example 8: Few-Shot Learning

Few-shot trainer also accepts list[SingleSample].

from utils.policy_utils import create_policy_from_string
from core.policy.schema import SingleSample
from core.trainer.fewshot import FewShotTrainer

# Prepare few-shot examples
examples = [
    SingleSample(
        history=[{"role": "user", "content": "Translate to French: Hello"}],
        output="Bonjour",
    ),
    SingleSample(
        history=[{"role": "user", "content": "Translate to French: Goodbye"}],
        output="Au revoir",
    ),
]

# Create policy with few-shot examples
trainer = FewShotTrainer()
base_policy = create_policy_from_string("o4-mini")
fewshot_policy = trainer.train(
    policy=base_policy,
    samples=examples
)

# Now use the policy with in-context examples
response = fewshot_policy.infer("Translate to French: Thank you")
print(response)

Example 9: Reinforcement Learning with Graders

from core.domain.forecasting import Forecasting
from core.trainer.rl import RLTrainer, RLConfig
from core.grader.python_brier import PythonBrierGrader

# Setup
domain = Forecasting()
problems = domain.sample_problems(n=100, split="train")

# Create grader and trainer
grader = PythonBrierGrader()
config = RLConfig(num_epochs=3, learning_rate=1e-6, kl_coef=0.1)
trainer = RLTrainer(config)

# Train with RL
base_policy = create_policy_from_string("o4-mini")
trained_policy = trainer.train(
    policy=base_policy,
    problem_list=problems,
    grader=grader
)

Example 10: End-to-End Workflow

Complete workflow from domain to inference to training, using self-labeled training as an example:

from core.domain.conceptual import Conceptual
from utils.policy_utils import create_policy_from_string
from core.trainer.sft import SFTTrainer, SFTConfig

# 1. Load domain and sample problems
domain = Conceptual()
problems = domain.sample_problems(n=10, split="train")

# 2. Generate responses with base policy
policy = create_policy_from_string("o4-mini")
samples = [p.to_sample() for p in problems]
results = policy.infer_many(samples)

# 3. Use results as training data
trainer = SFTTrainer(SFTConfig(num_epochs=1))
trained_policy = trainer.train(policy=policy, samples=results)

# 4. Test trained policy
test_problem = domain.sample_problems(n=1, split="test")[0]
response = trained_policy.infer(test_problem.to_sample())
print(f"Q: {response.history[0]['content']}")
print(f"A: {response.output}")

Example 11: Async Inference and Training Across Multiple Domains

Run inference and training on multiple domains in parallel.

import asyncio
from core.domain.conceptual import Conceptual
from core.domain.forecasting import Forecasting
from utils.policy_utils import create_policy_from_string
from core.trainer.sft import SFTTrainer

policy = create_policy_from_string("o4-mini")

async def process_domain(domain, policy, trainer):
    """Infer and train on a single domain"""
    # Generate training data
    problems = domain.sample_problems(n=5, split="train")
    samples = [p.to_sample() for p in problems]
    results = await asyncio.gather(*[policy.infer_async(s) for s in samples])

    # Train and return
    return await trainer.train_async(policy=policy, samples=results)

async def main():
    trainer = SFTTrainer()

    # Process multiple domains in parallel
    domains = [Conceptual(), Forecasting()]
    trained_policies = await asyncio.gather(
        *[process_domain(d, policy, trainer) for d in domains]
    )

    print(f"Trained {len(trained_policies)} policies in parallel")

asyncio.run(main())

Everything else in this library is also asynchronous, and the snippet above serves only as an example. Note that it is strongly recommended to instantiate policies (including through the create_policy_from_string interface and through policy classes such as LocalModel) outside of asynchronous contexts, to avoid potential event loop issues.

Example 12: Local Model Training with Multi-GPU

from utils.policy_utils import create_policy_from_string
from core.trainer.sft import SFTTrainer, SFTConfig
from core.policy.schema import SingleSample

# Create local model (automatically uses all available GPUs)
policy = create_policy_from_string("meta-llama/Llama-3.2-1B-Instruct")

# Prepare samples
samples = [
    SingleSample(
        history=[{"role": "user", "content": "Hello"}],
        output="Hi there!",
    ),
    # ... more samples
]

# Train with DeepSpeed ZeRO-2 (automatic)
trainer = SFTTrainer(SFTConfig(num_epochs=2))
trained_model = await trainer.train_async(
    policy=policy,
    samples=samples
)

Core Components Documentation

Algorithms (core/algo/)

  • accuracy.py - Provides GroundTruthAccuracy for evaluating model accuracy against ground truth (previously coupled with MartingaleStrategy, now independent)
  • mutualpredict.py - Provides MutualPredictStrategy for measuring mutual predictability between models, as described in Wen et al. (2025)
  • sycoreason.py - Provides SycophanticReasoning for measuring sycophancy towards user opinion, including both expressed opinion and IQ change due to user (dis)agreement
  • qualitative.py - Provides QualitativeJudge for evaluating free-form text responses, where a judge (arbitrary Policy, can be human) looks for truth-seeking qualities in the response
  • martingale.py - Implements martingale-based strategies for bias evaluation and correction (He et al., 2025)
  • worldintheloop.py - Implements "world-in-the-loop" evaluation strategies, where we test the helpfulness of model outputs for predicting real-world observations gathered by an investigation agent
  • graderwrapper.py - Provides GraderWrapper for wrapping any Grader (model-based or Python-based) as a DebiasStrategy for evaluation

Algorithm Configuration

All evaluation strategies now expose a typed AlgoConfig accessible via strategy.get_config() and serialized in bias result files under the config field.

Config values are set via environment variables - see the Environment Variables section for more details.

Implemented configs:

  • GroundTruthAccuracyConfig (metric, mode)
  • MartingaleConfig (belief_change_type, sample_granularity, regularization_type, informative_switch, informative_coef)
  • MutualPredictConfig (predictor_choice, predicted_choice, target_question_choice, context_question_choice, k_context, trials_per_sample, scoring, predictee_behavior_examples)
  • QualitativeJudgeConfig (instruction, red_team_mode, include_self_answer, few_shot_strategy, few_shot_source_path, example_recomputation_rounds, few_shot_bootstrap_rounds, bootstrap_respect_permutations)
  • SycophanticReasoningConfig (metric, mode)
  • GraderWrapperConfig (grader_spec, grader_type, grader_model)

Reasoning Modes (core/reasoning/)

  • direct.py - Full response as one single step. Reasoning models have two steps: reasoning and response
  • cot.py - Chain of Thought reasoning implementation
  • debate.py - Self-debate reasoning implementation where models argue different positions
  • bootstrap.py - Bootstrap Interview mode that asks auxiliary questions before the main question to build reasoning capacity
  • length_control.py - Length-controlled reasoning mode for controlling response verbosity

Reasoning Configuration

All reasoning modes now support typed ReasoningConfig subclasses for configuration:

  • DirectConfig - Configuration for direct inference mode
  • CoTConfig - Configuration for Chain of Thought mode
  • DebateConfig - Configuration for self-debate mode (num_turns)
  • BootstrapInterviewConfig - Configuration for bootstrap interview mode (num_auxiliary_questions, auxiliary_mode, instruction_types)
  • LengthControlConfig - Configuration for length-controlled reasoning mode

Analysis (core/analyzer/)

  • performance_comparison.py - Comparing performance between different policies, and testing the soundness of performance scores
  • evaluation_relationship.py - Correlational and causal relationship between scores from different evaluation algorithms
  • causal_attribution.py - How different features causally contribute to the evaluation score
  • cross_setup_agreement.py - Under the same evaluation algorithm, compare scores that the same policy/trajectory get from different evaluation setups
  • token_level_evidence.py - Token-level analysis, e.g. visualizing how much each token contributes to the final score
  • trajectory_score.py - Aggregates and sorts trajectories by score for training pipeline
  • training_causal_effect.py - Analyzes causal effects of training interventions
  • results_catalog.py - Cataloging all run files to be displayed in the web app

Analysis Framework

  • Data models:
    • RunFiles: shared and per-eval-strategy file references in a run directory
    • Run: mirrors a single runs catalog entry; holds all co-existing eval strategies' configs
    • Setup: a condition grouping of runs with identical (algo, domain, reasoning_mode, system_prompt)
  • Analyzer interface (scale-aware methods; analyzers implement any subset):
    • analyze_trajectory(trajectory, *, run, eval_strategies)
    • analyze_run(run, *, eval_strategies)
    • analyze_setup(setup, *, eval_strategies)
    • analyze_across_setups(setups, *, eval_strategies)
  • Orchestration utilities (utils/analyzer_utils.py):
    • discover_run(run_dir|bias_file)Run
    • discover_runs_in_batch_dir(batch_dir)Run[]
    • group_setups(runs)Setup[]
    • collect_eval_strategies_from_runs(runs) → list of AlgoStrRepr
    • run_all_analyzers_for_two_batches(dir_a, dir_b, clean_output=True)
    • run_analyzers_from_env() entrypoint used by scripts/run_analyzers.py
  • Output path policy:
    • data/analysis/<Analyzer>/<algo>/<domain>/<mode>/<prompt>/<AlgoStrRepr>/<model or ALL>/...
  • Running analyzers:
    • ANALYZERS=all DIR_NAME=run-XXX python -m scripts.run_analyzers
    • ANALYZERS=ResultsCatalogAnalyzer,CausalAttributionAnalyzer,CrossSetupAgreementAnalyzer DIR_NAME=run-XXX python -m scripts.run_analyzers

Domains (core/domain/)

  • research.py - Self-curated dataset of research questions, with an easy answer and a hard answer
  • conceptual.py - 31 very thorny conceptual or philosophical questions, meant to test the ability for (1) deconfusion, and (2) think outside the Overton window
  • forecasting.py - Forecasting domain for prediction tasks using Metaculus and Polymarket data
  • cmvbinary.py - ChangeMyView domain with binary opinion change evaluation
  • cmvfreeform.py - ChangeMyView domain with free-form opinion change evaluation
  • openreview.py - OpenReview domain for academic paper evaluation
  • intellectual.py - Intellectual demonstration domain for testing reasoning depth
  • wildchat.py - WildChat domain for evaluating on real user conversations

Policies (core/policy/)

All policies share a unified interface with flexible input handling—infer() and infer_many() accept strings, message lists, or dialogue objects.

  • apimodel.py - Standard API-based model interface for external LLM services
  • raymodel.py - API-based model accelerated with Ray-based parallelization (higher performance than apimodel.py when throughput is higher than 100,000 token/s)
  • batchmodel.py - API-based model using batch API to save costs. Cost reduced by 50%, but one full run takes up to 48hr. Recommended only when pooling many runs together
  • localmodel.py - Locally deployed model with full logprob access using SGLang backend
  • human.py - Implements the Human policy class, where conversations are shown on the command line and the human user types responses
  • claudecode.py - Claude Code integration for interactive coding assistance

Graders (core/grader/)

  • schema.py - Base Grader abstract class and factory functions for creating graders
  • python_brier.py - Extracts \finalBeliefProb{X} patterns and calculates Brier scores
  • model_brier.py - Uses LLMs to extract beliefs and calculate Brier scores
  • python_grader.py - Base class for Python-based graders executed on OpenAI servers
  • model_grader.py - Base class for model-based graders using LLMs for evaluation

Training Pipeline (core/trainer/)

The training pipeline supports supervised fine-tuning (SFT), reinforcement learning (RL), and few-shot in-context learning approaches to improve model performance based on trajectory-score pairs from evaluation runs.

Training Strategies

  • sft.py - Supervised Fine-Tuning trainer that selects top-scoring trajectories and fine-tunes models on them

    • Supports OpenAI and Together AI fine-tuning APIs for API models
    • Supports local fine-tuning with trl and deepspeed for LocalModel
    • Configurable via SFT_TOP_PERCENTAGE, SFT_NUM_EPOCHS, SFT_LEARNING_RATE, etc.
    • Validation Support: Automatic validation set creation with three strategies:
      • none: No validation (default)
      • train: Split a portion from training set
      • gt: Use ground-truth aligned samples only
  • rl.py - Reinforcement Learning trainer using custom graders with OpenAI's RL API

    • Supports both Python and model-based graders for reward calculation
    • Configurable via RL_NUM_EPOCHS, RL_LEARNING_RATE, etc.
    • Uses graders defined in core/grader/ for scoring model outputs
  • fewshot.py - Few-shot trainer that formats top trajectories as in-context examples

    • Creates new policies with prepended few-shot examples
    • Randomly selects one sample per trajectory to avoid duplicates
    • Configurable via FEWSHOT_TOP_COUNT, FEWSHOT_TOP_PERCENTAGE

Training Architecture

The training system follows a clean separation of concerns:

  1. Trainers (core/trainer/) - Handle data compilation and selection

    • Load trajectory-score pairs from analyzer outputs
    • Select top trajectories based on configuration
    • Convert trajectories to appropriate training formats
    • Validation Management: Automatic validation set creation with intelligent sizing
    • Ground Truth Filtering: Filter validation samples to only include correctly aligned beliefs
  2. Policies (core/policy/) - Handle model management and training execution

    • Implement train_sft() for supervised fine-tuning with validation support
    • Implement add_few_shot_examples() for in-context learning
    • Use deep_copy() utility for creating trained policy instances with proper naming and metadata
    • Real-time Monitoring: Display training/validation losses during fine-tuning
    • WandB Integration: Log comprehensive training metrics including losses, learning rate, gradient norms
  3. Reasoning Modes (core/reasoning/) - Handle trajectory-to-sample conversion

    • Each reasoning mode implements trajectory_to_samples() to convert trajectories to training samples
    • Respects the trainable flag on reasoning steps to exclude system/non-trainable content

Training Workflow

  1. Generate trajectories and scores: Run evaluation with desired algorithms to produce trajectory-score pairs

  2. Run analyzers: Use TrajectoryScoreAnalyzer to aggregate and sort trajectories by score

  3. Train policies: The training pipeline automatically detects trajectory score files and trains policies:

    # After running evaluation and analysis
    export TRAINERS=SFTTrainer,FewShotTrainer  # or "all"
    export SFT_TOP_PERCENTAGE=0.1  # Train on top 10%
    export FEWSHOT_TOP_COUNT=100   # Use top 100 examples
    python -m scripts.run_trainers
  4. Trained model storage: Trained models are saved in data/models/ with unique names:

    • Format: {base_name}-{training_type}-{hash}
    • Example: gpt-4.1-mini-sft-a36b1f2c3f84
    • Metadata saved alongside including training configuration and source files

Configuration

Training behavior is controlled via environment variables and TrainingConfig:

  • TRAINERS: Which trainers to use (SFTTrainer, FewShotTrainer, or "all")
  • VALIDATION_STRATEGY: Validation set strategy ("none", "train", "gt") (default: "none")
    • none: No validation set
    • train: Split validation from training set
    • gt: Use ground-truth aligned samples only for validation
  • SFT_TOP_PERCENTAGE: Percentage of top trajectories for SFT (default: 0.1)
  • SFT_NUM_EPOCHS: Number of training epochs (default: 2)
  • SFT_LEARNING_RATE: Learning rate for fine-tuning (default: 1e-5)
  • FEWSHOT_TOP_COUNT: Maximum examples for few-shot (default: 100)
  • FEWSHOT_TOP_PERCENTAGE: Percentage for few-shot selection (default: 0.1)
  • WANDB_API_KEY: Optional WandB API key for training metrics logging

The system uses the minimum of count and percentage limits for few-shot to avoid excessive context length.

Codebase structure

core

Contains the core logic of the project.

  • core/algo: Contains the debiasing algorithms (e.g. Martingale, justified flipping)
  • core/analyzer: Contains the result analyzers (e.g. PerformanceComparisonAnalyzer, EvaluationRelationshipAnalyzer)
  • core/domain: Contains the problem domains (e.g. forecasting)
  • core/policy: Contains the policy models (e.g. API model, Local LLMs)
  • core/reasoning: Contains the reasoning modes (e.g. Self-Debate, CoT)

Each of these components has a schema file that describes the components' inputs and outputs, accompanied by a number of subclass files that implement the schema for specific algorithms/domains/policies/reasoning modes.

scripts

Contains scripts for data fetching, processing, and analysis.

  • scripts/run_reasoning.py: Contains the script for producing reasoning trajectories given any combination of reasoning modes, domains, policies, evaluation algorithms, and analyzers.
  • scripts/data/*: Contains the scripts for data fetching and organization.
  • scripts/misc/*: Contains all other scripts of long-standing value.
  • scripts/legacy/*: Contains all deprecated scripts that are only kept for backward compatibility.

utils

Contains utility functions for the project.

  • utils/policy_utils.py: Contains the utility functions for policy model creation and other policy-related operations.
  • utils/io_utils.py: Contains the utility functions for input/output operations, including the handling of JSON formatting.
  • utils/async_utils.py: Contains the utility functions for asynchronous operations.
  • utils/stats_utils.py: Contains tools for statistical analysis and plotting.
  • utils/nlp_utils.py: Contains tools for natural language processing.
  • utils/path_utils.py: Expands the PATH variable to include all levels of the project directory.
  • utils/analyzer_utils.py: Contains utility functions for calling analyzers.
  • utils/debate_processing_utils.py: Contains the utility functions for processing the debate data.
  • utils/judge_manipulation_utils.py: Contains the utility functions for manipulating the judge policy's belief, most useful for SelfDebate.
  • utils/killall.sh: Contains the script for killing all GPU processes, useful for LocalModel.
  • utils/templates/*: Contains prompt templates.

Environment Variables

The codebase uses various environment variables for configuration. These should be set as needed before running experiments:

This document lists all environment variables supported by the ERC evaluation system. Variables are grouped into Features, Execution, and Experimental categories.

Features

These variables control which features to evaluate. They are typically set from the UI selection.

  • ALGO_NAMES: Evaluation strategy(-ies) to use
    • Options: GroundTruthAccuracy, MartingaleStrategy, WorldInTheLoop, QualitativeJudge, MutualPredictStrategy, GraderWrapper
    • Required: Yes
  • DOMAIN_NAMES: Domain name(s) to evaluate
    • Options: Forecasting, OpenReview, CMVBinary, CMVFreeForm, Research, Conceptual, IntellectualDemonstration, WildChat
    • Required: Yes
  • REASONING_MODE_NAMES: Reasoning mode name(s)
    • Options: DirectInference, ChainOfThought, SelfDebate, BootstrapInterview, LengthControl
    • Required: Yes
  • POLICY_LIST: Policies to evaluate (comma-separated)
  • SYSTEM_PROMPT: System prompt to test (comma-separated)
  • ANALYZERS: Result analyzers to run
    • Options: PerformanceComparisonAnalyzer, EvaluationRelationshipAnalyzer, CausalAttributionAnalyzer, CrossSetupAgreementAnalyzer, TokenLevelEvidenceAnalyzer, TrajectoryScoreAnalyzer, TrainingCausalEffectAnalyzer, ResultsCatalogAnalyzer

Execution

These variables control how the evaluation runs are executed.

  • NUM_TRAJECTORIES: Number of trajectories to generate per combination (default: differs by algorithm and domain)
    • Type: Number
  • DIR_NAME: Directory name for output, use "/" to indicate subdirectory
    • Default: Generates timestamp-based name
  • DEBUG: Debug level (0=none, 1=basic, 2=verbose)
    • Options: 0, 1, 2
    • Default: 0
  • SAVE_TO_FILE: Back up untruncated console logs to file
    • Type: Boolean (0 or 1)
    • Default: false
  • USE_RAY: Use Ray for distributed processing of API calls
    • Type: Boolean (0 or 1)
    • Default: true
  • USE_OPENROUTER: Use OpenRouter for model routing (requires USE_RAY=true)
    • Type: Boolean (0 or 1)
    • Default: true
  • USE_BATCH: Use provider-specific batch APIs (requires USE_RAY=false)
    • Type: Boolean (0 or 1)
    • Default: false
  • PARALLEL_BATCH: Use async parallelism across runs
    • Type: Boolean (0 or 1)
    • Default: true
  • MAX_WORKERS: Maximum number of Ray workers
    • Type: Number
  • RERUN_INCOMPLETE: Rerun experiments that contain < NUM_TRAJECTORIES trajectories
    • Type: Boolean (0 or 1)
    • Default: true
  • RECOMPUTE_RESULTS: If to recompute and overwrite existing final result JSON file
    • Type: Boolean (0 or 1)
    • Default: false
  • RECOMPUTE_TRAJECTORIES: When to recompute trajectories JSON file
    • Options: never, missing, always
      • "never": Never recompute trajectories, always use existing ones
      • "missing": Only generate trajectories if they don't exist
      • "always": Always regenerate trajectories (expensive)
    • Default: missing
  • RECOMPUTE_BELIEFS: When to recompute beliefs JSON file
    • Options: never, missing, always
      • "never": Never recompute beliefs, always use existing ones
      • "missing": Only measure beliefs if they don't exist
      • "always": Always remeasure beliefs (recommended for experimenting with different algorithms)
    • Default: missing

API Keys

  • OPENROUTER_API_KEY: OpenRouter API key
  • HUGGINGFACE_API_KEY: HuggingFace API key
  • TOGETHER_API_KEY: TogetherAI API key
  • OPENAI_API_KEY: OpenAI API key
  • ANTHROPIC_API_KEY: Anthropic API key
  • GOOGLE_API_KEY: Google API key
  • WANDB_API_KEY: Weights & Biases API key for training metrics logging

Debug

  • SHOW_PROGRESS: Show progress bars
    • Type: Boolean (0 or 1)
    • Default: true

Performance

  • NO_RETRY: Disable retry mechanism for API calls
    • Type: Boolean (0 or 1)
    • Default: false

Experimental

These variables control experimental features and algorithm-specific behaviors.

  • JUDGE_POLICY_NAMES: Judge policy names (comma-separated)
    • Dynamically set based on selected algorithms
  • TEMPERATURE: Model temperature for non-Ray API models
    • Type: Number
    • Default: 0.25
  • PRESENCE_PENALTY: Presence penalty for non-Ray API models
    • Type: Number
    • Default: 0.0
  • POLICY_LIST_MODE: Canonical policy list, overrides POLICY_LIST
    • Options: frontier, legacy, neurips
      • "frontier": Current default policy list (gpt-4.1, gpt-o3, deepseek-v3, llama-4, claude-sonnet-4, etc.)
      • "legacy": Legacy policy list (subset of frontier models)
      • "neurips": All 21 policies from batch-neurips directory including -confirmatory/-critical variants
  • FORBIDDEN_MODELS: Remove policies whose names contain any of the following (comma-separated)

Belief Measurement

  • DISABLE_SYSTEM_PROMPT_IN_BELIEF_MEASUREMENT: Disable system prompts for judge policies during belief measurement
    • Type: Boolean (0 or 1)
    • Default: true
  • USE_FIXED_JUDGE: Use fixed judge for evaluation instead of the evaluated policy itself
    • Type: Boolean (0 or 1)
    • Default: true
  • OBJECTIVE_BELIEF: Judge estimates beliefs from the standpoint of the evaluated policy
    • Type: Boolean (0 or 1)
    • Default: true
  • USE_PER_TRAJ_BELIEF_MEASURE: Use per-trajectory belief measurement instead of per-step
    • Type: Boolean (0 or 1)
    • Default: true
  • DECOUPLE_TRAJECTORY_BELIEFS: Save belief measurement and trajectories in separate files
    • Type: Boolean (0 or 1)
    • Default: true
      • 0: Legacy mode - trajectories and beliefs stored together in reasoning-trajectories.json
      • 1: Decoupled mode - raw trajectories in reasoning-trajectories-raw.json, beliefs in reasoning-beliefs-{algorithm}.json

World-in-the-Loop

Available when WorldInTheLoop algorithm is selected.

  • WITL_RECOMPUTE_POLICY: Which World-in-the-Loop components to recompute
    • Options: investigation, uplift, upliftblanket, all
    • Default: all
  • FORECASTER_TEMPLATE: Template to use for presenting investigation results to forecaster
    • Options: vanilla, rephrase, toolcall
      • "vanilla" - Simple prediction prompt with direct investigation result as assistant response
      • "rephrase" - Rephrases investigation results to remove style familiarity, includes tool usage explanations
      • "toolcall" - Most natural approach: forecaster retrieves investigation report through formal tool call API, mentioning that Claude Code completed the investigation
    • Default: vanilla
  • WITL_PER_TOKEN_UPLIFT: Use per-token uplift calculation
    • Type: Boolean (0 or 1)
    • Default: true

Qualitative Judge

Available when QualitativeJudge algorithm is selected.

  • QUALITATIVE_JUDGE_USE_FEW_SHOT: Use few-shot examples in qualitative judge
    • Type: Boolean (0 or 1)
    • Default: true
  • QUAL_RED_TEAM_MODE: Qualitative judge red team mode
    • Options: none, red, red_blue, red_blue_resolution
    • Default: red_blue_resolution
  • QUAL_FEW_SHOT_PERMUTE: Permute few-shot examples
    • Type: Boolean (0 or 1)
    • Default: true
  • QUAL_EXAMPLE_RECOMPUTATION_ROUNDS: Example recomputation rounds
    • Type: Number
    • Default: 1
  • QUAL_FEW_SHOT_BOOTSTRAP_ROUNDS: Few-shot bootstrap rounds
    • Type: Number
    • Default: 1
  • QUAL_BOOTSTRAP_RESPECT_PERMUTE: Respect permutation in bootstrap
    • Type: Boolean (0 or 1)
    • Default: false
  • FEW_SHOT_SOURCE_PATH: Few-shot source path
    • Default: data/questions/conceptual_human_examples.json
  • FEW_SHOT_BOOTSTRAP_SOURCE_PATH: Few-shot bootstrap source path

Bootstrap Interview

Available when BootstrapInterview reasoning mode is selected.

  • BOOTSTRAP_NUM_AUXILIARY: Number of auxiliary questions to ask before the main question
    • Type: Number
    • Default: 3
  • BOOTSTRAP_MODE: Method for selecting auxiliary questions
    • Options: fixed_sequence, iid, stationary_ood, llm_preset, llm_adaptive
      • "fixed_sequence": Use predefined list of truth-seeking questions
      • "iid": Sample from same domain as the main question
      • "stationary_ood": Sample from other specified domains
      • "llm_preset": LLM generates all auxiliary questions at once
      • "llm_adaptive": LLM generates questions adaptively based on conversation
    • Default: iid
  • BOOTSTRAP_OOD_DOMAINS: Domains to sample from for stationary_ood mode (comma-separated)
    • Example: Research,Forecasting
  • BOOTSTRAP_GENERATOR_POLICY: Policy to use for generating auxiliary questions in LLM modes
    • Example: gpt-4.1-mini
  • BOOTSTRAP_INSTRUCTION_TYPES: Instruction types for LLM generation (comma-separated)
    • Options: curriculum, contradiction_seeking, synergistic, socratic
      • "curriculum": Build progressively in complexity
      • "contradiction_seeking": Focus on eliciting contradictions
      • "synergistic": Establish cross-domain connections
      • "socratic": Use Socratic questioning to uncover assumptions

GraderWrapper

Available when GraderWrapper algorithm is selected.

  • GRADER_SPEC: Grader specification for GraderWrapper (JSON string or env var name)
    • Type: String
    • Used to instantiate an arbitrary grader via create_grader_from_spec
  • GRADER_TYPE: Type of grader to use (if not using GRADER_SPEC)
    • Options: python_brier, model_brier, model
    • Used as fallback if GRADER_SPEC not provided
  • GRADER_MODEL: Model to use for model-based grading
    • Type: String (e.g., o1-mini, gpt-4)
    • Used when GRADER_TYPE is model_brier or model

Mutual Predictability

Available when MutualPredictStrategy algorithm is selected.

  • MP_PREDICTOR_CHOICE: MutualPredictStrategy predictor policy choice
    • Options: evaluated, random_non_evaluated, random_any
    • Default: random_non_evaluated
  • MP_PREDICTED_CHOICE: MutualPredictStrategy predicted policy choice
  • Options: evaluated, random_non_evaluated, random_any
  • Default: evaluated
  • MP_TARGET_QUESTION: MutualPredictStrategy target question choice
  • Options: evaluated_question, random_non_evaluated_question, random_any_question
  • Default: evaluated_question
  • MP_CONTEXT_QUESTIONS: MutualPredictStrategy context question choice
  • Options: evaluated_question, k_random_non_evaluated
  • Default: k_random_non_evaluated
  • MP_K_CONTEXT: Number of context questions
  • Type: Number
  • Default: 3
  • MP_TRIALS_PER_SAMPLE: Trials per sample to average
  • Type: Number
  • Default: 3
  • MP_SCORING: MutualPredictStrategy scoring method
  • Options: uplift, conditional_only, judge_consistency
  • Default: uplift
  • MP_PREDICTEE_BEHAVIOR_EXAMPLES: Number of predictee behavioral examples
  • Type: Number
  • Default: 0

Performance Comparison Analyzer

Available when PerformanceComparisonAnalyzer is selected.

  • REMOVE_N_OUTGROUP_SETUP: Number of outgroup data points (with most different setup names) to remove from per-setup performance comparison
    • Type: Number
    • Default: 0
  • REMOVE_N_OUTGROUP_ACROSS_SETUPS: Number of outgroup data points (with most different setup names) to remove from across-setups performance comparison
  • Type: Number
  • Default: 0
  • GROUP_MODE_SETUP: How to group data points in the per-setup performance comparison ("none", "qwen", "substr_KEY1=VALUE1_...")
  • Default: none
  • GROUP_MODE_ACROSS_SETUPS: How to group data points in the across-setups performance comparison ("none", "qwen", "substr_KEY1=VALUE1_...")
  • Default: none
  • ASSIGN_X_COORDS_SETUP: How to assign x coordinates when plotting per-setup performance comparison
  • Options: none, lexical, lexical_and_group, size, size_and_group, size_group_and_random
  • Default: none
  • ASSIGN_X_COORDS_ACROSS_SETUPS: How to assign x coordinates when plotting across-setups performance comparison
  • Options: none, lexical, lexical_and_group, size, size_and_group, size_group_and_random
  • Default: none
  • ELO_CORR_X_TICKS_SETUP: How to assign x ticks when plotting per-setup performance comparison ("names" or list of string labels)
  • Default: none
  • ELO_CORR_X_TICKS_ACROSS_SETUPS: How to assign x ticks when plotting across-setups performance comparison ("names" or list of string labels)
  • Default: none
  • ELO_CORR_X_LABEL_SETUP: X-axis label for per-setup performance comparison
  • Default: Setup Index
  • ELO_CORR_X_LABEL_ACROSS_SETUPS: X-axis label for across-setups performance comparison
  • ELO_CORR_Y_COORDS_SETUP: How to transform y coordinates in the per-setup performance comparison ("none", "negate", "log", comma-separated list of python math functions to apply)
  • Default: none
  • ELO_CORR_Y_COORDS_ACROSS_SETUPS: How to transform y coordinates in the across-setups performance comparison ("none", "negate", "log", comma-separated list of python math functions to apply)
  • Default: none

Misc Features and Notes

File Structure

When decoupling is enabled, the following files are used:

  • reasoning-trajectories-raw.json - Raw reasoning content without belief measurements
  • reasoning-beliefs-{algorithm}.json - Belief measurements for specific algorithms
  • reasoning-trajectories.json - Legacy format (maintained for backward compatibility)

The system automatically detects and loads legacy files when decoupling is enabled, ensuring full backward compatibility.

Partial Results Recomputation

The evaluation framework supports selective recomputation of incomplete or failed evaluations through two new environment variables:

RECOMPUTE_RESULTS

When RECOMPUTE_RESULTS=1, the system will:

  • Load existing bias-eval-results-[ALGO_NAME].json files instead of skipping them
  • Pass the existing results to the algorithm's compute_loss_async method
  • Allow algorithms to selectively recompute missing or failed components
  • Overwrite the existing results file with updated evaluations

This is particularly useful for:

  • Completing interrupted World-in-the-Loop evaluations
  • Rerunning failed investigations due to network issues or data access problems
  • Adding missing uplift calculations to existing evaluations

WITL_RECOMPUTE_POLICY (World-in-the-Loop only)

When used with RECOMPUTE_RESULTS=1, this controls what gets recomputed in World-in-the-Loop evaluations:

  • "all" (default): Recompute everything (task sampling, investigation, uplift) for trajectories with any missing fields (or missing entire trajectory)
  • "investigation": Only recompute investigation results (NOT subsequent uplift calculations) for trajectories with missing investigation results (or missing entire trajectory)
  • "uplift": Only recompute uplift rewards for trajectories that have investigation results but missing uplift values
  • "upliftblanket": Recompute uplift rewards for all trajectories that have investigation results (including those with existing uplift values)
# Recompute all missing components in an existing World-in-the-Loop evaluation
RECOMPUTE_RESULTS=1 WITL_RECOMPUTE_POLICY=all python -m scripts.run_reasoning

# Only fill in missing investigation results (useful after data access issues / investigator agent API issues are resolved)  
RECOMPUTE_RESULTS=1 WITL_RECOMPUTE_POLICY=investigation python -m scripts.run_reasoning

# Only compute missing uplift values (when investigations completed but forecaster failed)
RECOMPUTE_RESULTS=1 WITL_RECOMPUTE_POLICY=uplift python -m scripts.run_reasoning

# Recompute all uplift values (useful when forecaster model or parameters changed)
RECOMPUTE_RESULTS=1 WITL_RECOMPUTE_POLICY=upliftblanket python -m scripts.run_reasoning

Local Model Choice

For evaluation strategies that involve log probabilities, we deploy local models to compute the log probabilities.

Availability of models depend on support by SGLang, which we use for deployment. As of Aug 12 2025, we have tested the following local models.

Models tested to work:

  • Qwen/Qwen3-30B-A3B-Instruct-2507 (30B, LMArena #23; base available)
  • Qwen/Qwen3-0.6B (0.6B)
  • deepseek-v3 (685B, LMArena #37; base available; FP8 supported)
  • zai-org/GLM-4.5-Air (110B, LMArena #23 with reasoning; base available; FP8 supported)
  • mistral-small-3.2-24b-instruct-2503 (24B, LMArena #96; base available)
  • Llama-3.2-1B-Instruct (1B, LMArena #196; base available)

Note that for mutual predictability/WITL, small and weaker models may sometimes work better as forecaster/judge.

Mutual Predictability (7-axis framework)

Key idea: measure conditional predictability via log probabilities with configurable axes. A single evaluated policy’s answers over many questions are scored by how much a predictor policy improves likelihood assignment to a target answer when given configurable context.

  • Core class: core/algo/mutualpredict.pyMutualPredictStrategy
  • Config: MutualPredictConfig with axes:
    • Predictor (Axis 2): EVALUATED_POLICY | RANDOM_NON_EVALUATED | RANDOM_ANY
    • Predicted (Axis 3): EVALUATED_POLICY | RANDOM_NON_EVALUATED | RANDOM_ANY
    • Target question (Axis 4): EVALUATED_QUESTION | RANDOM_NON_EVALUATED_QUESTION | RANDOM_ANY_QUESTION
    • Context questions (Axis 5): EVALUATED_QUESTION | K_RANDOM_NON_EVALUATED (k hyperparam)
    • Context responses (Axis 6): the evaluated policy’s answers (fixed)
    • Scoring (Axis 7): UPLIFT (Δcond−Δbase), CONDITIONAL_ONLY, or JUDGE_CONSISTENCY
    • Trials: trials_per_sample to average randomness
    • Predictee behavior examples: predictee_behavior_examples (default 0) adds a clearly labeled block of the target policy’s past Q/A to condition the predictor; 0 adds nothing.

Defaults reproduce the previous behavior (cross-question context from the evaluated policy; uplift scoring).

Minimal example (legacy-like behavior)

from core.algo.mutualpredict import (
    MutualPredictStrategy, MutualPredictConfig,
    PredictorChoice, PredictedChoice,
    TargetQuestionChoice, ContextQuestionChoice,
    ScoringMethod,
)

config = MutualPredictConfig(
    predictor_choice=PredictorChoice.RANDOM_NON_EVALUATED,
    predicted_choice=PredictedChoice.EVALUATED_POLICY,
    target_question_choice=TargetQuestionChoice.EVALUATED_QUESTION,
    context_question_choice=ContextQuestionChoice.K_RANDOM_NON_EVALUATED,
    k_context=3,
    trials_per_sample=3,
    scoring=ScoringMethod.UPLIFT,
    predictee_behavior_examples=0,  # default
)

algo = MutualPredictStrategy(judge_policies=[judge_model], config=config)
loss, details = await algo.compute_loss_async(samples=reasoning_trajectories)

Peer-prediction-like setup (same question, cross-model), with predictee examples

config = MutualPredictConfig(
    predictor_choice=PredictorChoice.RANDOM_ANY,
    predicted_choice=PredictedChoice.RANDOM_ANY,
    target_question_choice=TargetQuestionChoice.EVALUATED_QUESTION,
    context_question_choice=ContextQuestionChoice.EVALUATED_QUESTION,
    trials_per_sample=5,
    scoring=ScoringMethod.UPLIFT,
    predictee_behavior_examples=3,
)

algo = MutualPredictStrategy(judge_policies=predictor_pool, config=config)
loss, details = await algo.compute_loss_async(
    samples=reasoning_trajectories,
    participant_policies=predictor_pool,   # non-evaluated participants
    reasoning_mode=reasoning_mode,         # to generate target answers for non-evaluated policies
    domain=domain,
)

Notes

  • Predictor and predicted must differ; such trials are skipped automatically.
  • Exact duplicate answers between context and target are skipped.
  • When random choices are used on any axis, set trials_per_sample > 1 for stable averages.
  • If predicted_choice != EVALUATED_POLICY or predictee_behavior_examples > 0 with non-evaluated predictee, provide reasoning_mode and domain to generate/cache answers.

Qualitative Judge (prompt design & interaction pipeline)

Key class: core/algo/qualitative.pyQualitativeJudge

  • What it does: Grades reasoning trajectories for truth-seeking quality via a judge policy (default: gpt-o3).
  • Design axes (switchable):
    • Instruction/rubric: Controls criteria text injected into prompts.
    • Judge self-answer: Judge first answers the question to calibrate “easy vs hard insights”.
    • Adversarial depth (red_team_mode): none | red | red_blue | red_blue_resolution.
    • Few-shot strategy: disabled | static (fixed order) | permuted (all permutations × recomputation rounds).
    • Example recomputation rounds (M): Recompute example components across rounds to reduce variance.
    • Few-shot source: Path to initial examples.
    • Iterative few-shot bootstrapping (T rounds): Each round uses current few-shot set as context to grade candidates from samples and adds ≥ max(1, old_count) new examples (≈ doubling per round).
    • Robust parsing: Extracts JSON from imperfect LLM outputs with balanced-brace search and fallbacks.

Defaults mirror the prior behavior: self-answer enabled, full red_blue_resolution, few-shot enabled with permutations, M=1, no bootstrapping.

Environment variables:

  • Core
    • QUAL_INSTRUCTION: Override rubric text (default: built-in rubric)
    • QUAL_INCLUDE_SELF_ANSWER=1|0 (default 1)
    • QUAL_RED_TEAM_MODE in {none, red, red_blue, red_blue_resolution} (default red_blue_resolution)
  • Few-shot
    • QUALITATIVE_JUDGE_USE_FEW_SHOT=1|0 (default 1)
    • QUAL_FEW_SHOT_PERMUTE=1|0 (default 1 → permuted; 0 → static)
    • QUAL_FEW_SHOT_SOURCE=path (default data/questions/conceptual_human_examples.json)
    • QUAL_EXAMPLE_RECOMPUTATION_ROUNDS=M (default 1; also respects legacy EXAMPLE_RECOMPUTATION_ROUNDS)
  • Iterative few-shot bootstrapping
    • QUAL_FEW_SHOT_BOOTSTRAP_ROUNDS=T (default 0)
    • QUAL_BOOTSTRAP_RESPECT_PERMUTE=1|0 (default 0)
    • QUAL_FEW_SHOT_BOOTSTRAP_SOURCE=path (default None)
      • Expected to be a file containing reasoning trajectories or a directory (directly or indirectly) containing reasoning trajectories. Must be supplied if QUAL_FEW_SHOT_BOOTSTRAP_ROUNDS > 0.

Minimal usage (CLI):

# Default qualitative judging (self-answer + red-blue-resolution + permuted few-shot)
export ALGO_NAME=QualitativeJudge
python -m scripts.run_reasoning

Configure adversarial depth and few-shot behavior:

# Red + Blue + Resolution with 2 recomputation rounds and permuted few-shot
export ALGO_NAME=QualitativeJudge
export QUAL_RED_TEAM_MODE=red_blue_resolution
export QUALITATIVE_JUDGE_USE_FEW_SHOT=1
export QUAL_FEW_SHOT_PERMUTE=1
export QUAL_EXAMPLE_RECOMPUTATION_ROUNDS=2
python -m scripts.run_reasoning

Disable few-shot, run red-team only:

export ALGO_NAME=QualitativeJudge
export QUALITATIVE_JUDGE_USE_FEW_SHOT=0
export QUAL_RED_TEAM_MODE=red
python -m scripts.run_reasoning

Iterative few-shot bootstrapping (doubles examples each round using samples as candidates):

export ALGO_NAME=QualitativeJudge
export QUAL_FEW_SHOT_BOOTSTRAP_ROUNDS=2
# Optional: keep permuted context during bootstrapping rounds
export QUAL_BOOTSTRAP_RESPECT_PERMUTE=1
python -m scripts.run_reasoning

Python API (advanced):

from core.algo.qualitative import QualitativeJudge

algo = QualitativeJudge(judge_policies=[judge_model])  # env vars control axes
loss, details = await algo.compute_loss_async(samples=reasoning_trajectories)

Multi-GPU Training with DeepSpeed and Accelerate

The LocalModel class supports distributed training across multiple GPUs.

The system automatically detects available GPUs and configures training appropriately:

  • Single GPU: Standard training
  • Multiple GPUs: Distributed Data Parallel (DDP) training
  • With DeepSpeed: ZeRO optimization stages 2 or 3

Two DeepSpeed configurations are provided:

  • data/config/deepspeed_zero2.json: ZeRO Stage 2 (recommended for most cases)
  • data/config/deepspeed_zero3.json: ZeRO Stage 3 (for very large models)

A series of Accelerate configurations are provided:

  • data/config/accelerate_config_1node_{N}gpu.yaml: Pre-configured for single-node, N-GPU setup

Environment Variables

Control multi-GPU behavior with these environment variables:

# Force single GPU usage (useful for debugging)
export FORCE_SINGLE_GPU=1

# Disable DeepSpeed (use regular DDP)
export DISABLE_DEEPSPEED=1

# Control concurrent local model instances
export LOCALMODEL_MAX_CONCURRENT=2

Basic Usage (Automatic Configuration)

The system automatically detects and uses available GPUs:

from utils.policy_utils import create_policy_from_string

# Create model - will auto-detect GPUs
model = create_policy_from_string("meta-llama/Llama-3.2-1B-Instruct")

# Train with SFT - automatically uses all available GPUs
trained_model = await model.train_sft_async(
    samples=training_samples,
    validation_samples=validation_samples,
)

Using Accelerate Launch

For explicit control over distributed training:

# Launch with accelerate (uses config file for single-node, single-GPU setup)
TRAINER_TYPE=rl POLICY_LIST_MODE=Qwen/Qwen3-0.6B DOMAIN_NAME=Forecasting GRADER_TYPE=python_brier RL_NUM_SAMPLES=100 VALIDATION_STRATEGY=train accelerate launch --config_file data/config/accelerate_config_1node_1gpu.yaml -m scripts.run_trainers

# Or configure interactively
accelerate config
TRAINER_TYPE=rl POLICY_LIST_MODE=Qwen/Qwen3-0.6B DOMAIN_NAME=Forecasting GRADER_TYPE=python_brier RL_NUM_SAMPLES=100 VALIDATION_STRATEGY=train accelerate launch -m scripts.run_trainers

SFT Training with Multi-GPU

from core.trainer.sft import SFTTrainer, SFTConfig

# Configure SFT for multi-GPU
config = SFTConfig(
    num_epochs=2,
    learning_rate=2e-5,
    batch_size=4,  # Per-device batch size
    gradient_accumulation_steps=2,
)

trainer = SFTTrainer(config)
trained_policy = await trainer.train_async(
    policy=model,
    trajectory_score_files=["path/to/trajectories.json"],
)

RL Training with Multi-GPU

from core.trainer.rl import RLTrainer, RLConfig
from core.grader.python_brier import PythonBrierGrader

# Configure RL for multi-GPU
config = RLConfig(
    num_epochs=2,
    learning_rate=1e-6,
    batch_size=2,  # Per-device batch size
    kl_coef=0.1,
)

trainer = RLTrainer(config)
grader = PythonBrierGrader()

trained_policy = await trainer.train_async(
    policy=model,
    problem_list=problems,
    grader=grader,
)

About

Truth-seeking evaluation and training for language models.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors