TruthSeekingGym

A unified framework for evaluating and training language models on truth-seeking behavior.

Overview

TruthSeekingGym provides infrastructure for both evaluating how well models seek truth and training models to improve truth-seeking behavior. It offers a consistent interface across API models, local models, batch APIs, Claude Code agents, and even humans.

Core Abstractions

Policies — Unified interface for all model types: OpenAI (GPT-4.1, GPT-5, o3, o4-mini), Anthropic (Claude), Google (Gemini), local HuggingFace models, human input, and Claude Code agents
Domains — Problem sets with verifiable answers: research analysis, forecasting, debate evaluation (ChangeMyView), conceptual reasoning
Evaluation Paradigms — Multiple experimental setups for operationalizing "truth-seeking":
- Ground-truth accuracy: Does the model reach correct conclusions?
- Martingale property: Are belief updates unpredictable from prior beliefs? (predictable updates suggest bias)
- Sycophantic reasoning: Does reasoning quality degrade when the user expresses an opinion?
- Mutual predictability: Does knowing a model's answers on some questions help predict its answers on others?
- World-in-the-loop: Are the model's claims useful for making accurate predictions about the world?
- Qualitative judgment: Does reasoning exhibit originality, curiosity, and willingness to challenge assumptions?
Reasoning Modes — Generation strategies: direct inference, chain-of-thought, self-debate, bootstrap interview, length-controlled generation
Graders — Reward functions for training: Python-based (Brier scores) and LLM-based evaluation
Trainers — Training strategies: supervised fine-tuning (SFT), reinforcement learning (RL), few-shot in-context learning

Supported Policies

The following policies are supported via create_policy_from_string(). Pass the string in the "Policy String" column to create a policy.

Policy String	Provider	Model Type	Notes
`human`	N/A	Special	CLI-based human input
`claude-code`	N/A	Special	Claude Code agent integration
HuggingFace model ID	HuggingFace/Local	LocalModel	e.g., `Qwen/Qwen3-235B-A22B-Thinking-2507`
Path from `data/models/`	Local	LocalModel	Relative path starting from `data/models/`
`gemini-embedding-001`	Google	Embedding	Requires `USE_RAY=1`
`Qwen/Qwen3-Embedding-8B`	Local	Embedding	Local SGlang-based
`Qwen/Qwen3-Embedding-4B`	Local	Embedding	Local SGlang-based
`Qwen/Qwen3-Embedding-0.6B`	Local	Embedding	Local SGlang-based
`gpt-4.1-nano`	OpenAI	API
`gpt-4.1-mini`	OpenAI	API
`gpt-4.1`	OpenAI	API
`gpt-5`	OpenAI	API
`gpt-5-mini`	OpenAI	API
`gpt-5-nano`	OpenAI	API
`gpt-o3`	OpenAI	API	Alias for `o3`
`o3`	OpenAI	API
`o3-2025-04-16`	OpenAI	API
`gpt-o4-mini`	OpenAI	API	Alias for `o4-mini`
`o4-mini`	OpenAI	API
`o4-mini-2025-04-16`	OpenAI	API
`gpt-4o`	OpenAI	API
`deepseek-v3`	Together/DeepSeek	API
`llama-4-scout`	Together/Meta	API
`llama-4-maverick`	Together/Meta	API
`claude-sonnet-4`	Anthropic	API
`claude-opus-4`	Anthropic	API
`claude-opus-4.1`	Anthropic	API
`claude-3-5-haiku`	Anthropic	API
`deepseek-r1`	Together/DeepSeek	API
`gemma-3-27b-it`	Together/Google	API
`gemma-3-12b-it`	Together/Google	API	Via OpenRouter only
`gemma-3-4b-it`	Together/Google	API	Via OpenRouter only
`gemma-2-27b-it`	Together/Google	API
`gemma-3n-e4b-it`	Together/Google	API
`llama-3-1-8b-instruct`	Together/Meta	API
`qwen-3-235b-a22b-instruct`	Together/Qwen	API
`qwen-3-235b-a22b-thinking`	Together/Qwen	API
`qwen-3-235b-a22b`	Together/Qwen	API
`qwen-3-32b`	Together/Qwen	API
`qwen-3-14b`	Together/Qwen	API
`qwen-3-14b-base`	Together/Qwen	API	Direct provider only
`qwen-3-8b`	Together/Qwen	API
`qwen-3-8b-base`	Together/Qwen	API	Direct provider only
`qwen-2-5-7b`	Together/Qwen	API
`mistral-small-3.1-24b-instruct`	Together/Mistral	API	Via OpenRouter only
`mistral-small-24b-instruct-2501`	Together/Mistral	API	Direct provider only
`kimi-k2`	Together/Moonshot	API
`gemini-2.0-flash`	Google	API
`gemini-2.5-flash`	Google	API	Via OpenRouter only
`gemini-2.5-pro`	Google	API

Notes:

Some models are only available via OpenRouter (when USE_OPENROUTER=1) or direct provider access
LocalModel entries accept either HuggingFace-hosted model IDs or relative paths from data/models/ for locally saved models
Trained models saved in data/models/ are automatically detected and loaded

Adding New Models

To add support for a new model, edit utils/policy_utils.py and add an entry to the candidate_policies dictionary:

# For OpenRouter mode (USE_OPENROUTER=1):
candidate_policies = {
    # ... existing entries ...
    "your-model-name": ("provider/model-id", "your-model-name", "openrouter provider"),
    # Example: "gpt-4.1-mini": ("openai/gpt-4.1-mini", "gpt-4.1-mini", "openrouter openai"),
}

# For direct provider mode (USE_OPENROUTER=0):
candidate_policies = {
    # ... existing entries ...
    "your-model-name": ("exact-api-model-id", "your-model-name", "provider"),
    # Example: "gpt-4.1-mini": ("gpt-4.1-mini-2025-04-14", "gpt-4.1-mini", "openai"),
}

Each entry is a tuple of (api_model_id, colloquial_name, provider):

api_model_id: The exact model ID used in API calls
colloquial_name: The short name used for display and file naming
provider: The provider string (openai, anthropic, together, google, or openrouter <provider> for OpenRouter)

For local models, simply pass the HuggingFace model ID directly to create_policy_from_string() - no configuration needed.

Infrastructure

Ray parallelization for high-throughput workloads (100k+ tokens/second)
Batch APIs for 50% cost reduction (24-48 hour latency)
Multi-GPU training with automatic DeepSpeed ZeRO-2/3 detection
Full async support across inference and training

Installation and Usage

Prerequisites

Python 3.10+
Git LFS (for data submodule)

Installation

# Clone with submodules
git clone --recurse-submodules https://github.com/Project-Prevail/TruthSeekingGym-Code-Public.git
cd TruthSeekingGym-Code-Public

# Initialize data submodule (HuggingFace dataset)
cd data && git checkout main && cd ..

# Install dependencies (using uv recommended)
uv venv && uv pip install -e . -e lib/safety_tooling

# Or with pip
pip install -e . -e lib/safety_tooling

Git will automatically fetch the data from the Huggingface repo.

API Keys Setup

Create lib/safety_tooling/.env with your API keys:

OPENROUTER_API_KEY=your_key
OPENAI_API_KEY=your_key
ANTHROPIC_API_KEY=your_key

Web GUI

cd web
./start.sh

This kickstarts the web GUI, from which you may configure and launch evaluation/training runs.

Quick Start (CLI)

# Minimal example: evaluate one model on one domain
ALGO_NAMES=GroundTruthAccuracy \
DOMAIN_NAMES=Research \
REASONING_MODE_NAMES=DirectInference \
POLICY_LIST=gpt-4.1-mini \
USE_OPENROUTER=1 \
NUM_TRAJECTORIES=5 \
python -m scripts.run_reasoning

# Full example: evaluate, analyze, then train
# Evaluation
export ALGO_NAMES=MartingaleStrategy
export DOMAIN_NAMES=Research,Forecasting
export REASONING_MODE_NAMES=DirectInference,ChainOfThought
export POLICY_LIST_MODE=frontier # evaluate all frontier models
export USE_RAY=1 # use Ray for parallelization
export USE_OPENROUTER=1 # use OpenRouter for model routing (required for Gemini models)
python -m scripts.run_reasoning

# Analysis
export ANALYZERS=all 
export DIR_NAME=run-XXXXX  # Use your run directory
python -m scripts.run_analyzers

# Training
export SFT_USING_EVAL_RESULTS=1
export RL_USING_DOMAIN=1
export DIR_NAME=run-XXXXX  # For SFT/FEWSHOT: directory with trajectory scores
export DOMAIN_NAMES=Forecasting  # For RL: domains to train on
python -m scripts.run_trainers

The evaluation script produces reasoning-trajectories-raw.json and bias-eval-results-*.json in the run directory within the data/runs folder. The training script uses trajectory scores from analyzers or domains directly for RL training.

Each run directory is named by run-{RUN_ID}. After a run is finished, you may rename the second half of the directory name to something more descriptive, and thereby change run ID. You can use the environment variable RUN_ID=xxx to set the run ID for the run; if a directory with the same run ID already exists, data from the previous run will be loaded and analyzed, without executing the reasoning again.

You may use the DIR_NAME environment variable to set the location where run directories are stored, relative to the data/runs folder. It may contain multiple runs nested recursively within.

Low-Level Abstractions

The framework integrates TianyiQ/LMPortal as the lower-level abstraction.

Example 1: Basic Flexible Inference

The infer() method is the recommended way to do inference - it accepts multiple input types and returns appropriate outputs:

from utils.policy_utils import create_policy_from_string

# Create a policy (automatically detects provider)
policy = create_policy_from_string("o4-mini")

# Simple string inference
response = policy.infer("What is the capital of France?")
print(response)  # Returns: str

# Or with history
response = policy.infer([
    {"role": "user", "content": "What is 2+2?"}
])
print(response)  # Returns: str

# Getting logprobs of held-out response
conversation_logprobs = policy.logprobs_single([
    {"role": "user", "content": "What is 2+2?"},
    {"role": "assistant", "content": "4"},
])
prompt_logprobs = policy.logprobs_single([
    {"role": "user", "content": "What is 2+2?"},
])
print(conversation_logprobs - prompt_logprobs)  # Returns: float

Example 2: Inference with Problems and Domains

The flexible infer() method can directly work with Problems and Domains:

from core.domain.forecasting import Forecasting
from utils.policy_utils import create_policy_from_string

policy = create_policy_from_string("o4-mini")
domain = Forecasting()

# Infer from a single problem
problem = domain.sample_problems(n=1)[0]
result = policy.infer(problem.to_sample())
print(f"Question: {result.history[0]['content']}")
print(f"Answer: {result.output}")  # Returns: SingleSample

# Infer directly from domain (samples 1 problem automatically)
result = policy.infer(domain)
print(result)  # Returns: SingleSample

Example 3: Batch Flexible Inference

The infer_many() method handles batch inference with flexible input types:

from utils.policy_utils import create_policy_from_string
from core.domain.conceptual import Conceptual

policy = create_policy_from_string("o4-mini")
domain = Conceptual()

# Batch inference from multiple problems
problems = domain.sample_problems(n=3)
samples = [p.to_sample() for p in problems]
results = policy.infer_many(samples)
for result in results:
    print(f"Q: {result.history[0]['content']}")
    print(f"A: {result.output}")
# Returns: list[SingleSample]

# Or directly from domain with count
results = policy.infer_many((domain, 5))  # Sample 5 problems from domain
print(f"Generated {len(results)} responses")  # Returns: list[SingleSample]

Example 4: Working with Domains

from core.domain.forecasting import Forecasting

# Load domain
domain = Forecasting()

# Sample problems
problems = domain.sample_problems(n=5, split="train")

for problem in problems:
    print(f"Q: {problem.question}")
    if hasattr(problem, "correct_option"):
        print(f"Answer: {problem.options[problem.correct_option]}")

    # Convert problem to Sample for inference
    sample = problem.to_sample()
    print(f"Sample history: {sample.history}")

Example 5: Human-AI Dialogue

Create interactive dialogues between human and AI policies:

from utils.policy_utils import create_policy_from_string

# Create policies
human = create_policy_from_string("human")
ai = create_policy_from_string("o4-mini")

# Start dialogue
history = []
for turn in range(3):
    # Human turn
    human_msg = human.infer_from_history(history)
    history.append({"role": "user", "content": human_msg})
    print(f"Human: {human_msg}")

    # AI turn
    ai_msg = ai.infer_from_history(history)
    history.append({"role": "assistant", "content": ai_msg})
    print(f"AI: {ai_msg}")

Example 6: Claude Code Agent Inference

Use Claude Code agents for complex reasoning tasks:

from utils.policy_utils import create_policy_from_string

# Create Claude Code agent policy
agent = create_policy_from_string("claude-code")

# Infer with code execution capabilities
result = agent.infer("Write a Python function to calculate fibonacci numbers and test it with n=10")
print(f"Agent response: {result}")

Example 7: Supervised Fine-Tuning

SFT trainer accepts list[SingleSample] directly.

from utils.policy_utils import create_policy_from_string
from core.policy.schema import SingleSample
from core.trainer.sft import SFTTrainer, SFTConfig

# Prepare training data
samples = [
    SingleSample(
        history=[{"role": "user", "content": "What is 2+2?"}],
        output="4",
    ),
    SingleSample(
        history=[{"role": "user", "content": "What is the capital of France?"}],
        output="Paris",
    ),
    # ... more samples
]

# Create trainer
config = SFTConfig(
    num_epochs=2,
    learning_rate=1e-5,
    validation_strategy="train"  # split from training set
)
trainer = SFTTrainer(config)

# Train (creates new policy, doesn't modify original)
base_policy = create_policy_from_string("gpt-4o")
trained_policy = trainer.train(
    policy=base_policy,
    samples=samples
)

Example 8: Few-Shot Learning

Few-shot trainer also accepts list[SingleSample].

from utils.policy_utils import create_policy_from_string
from core.policy.schema import SingleSample
from core.trainer.fewshot import FewShotTrainer

# Prepare few-shot examples
examples = [
    SingleSample(
        history=[{"role": "user", "content": "Translate to French: Hello"}],
        output="Bonjour",
    ),
    SingleSample(
        history=[{"role": "user", "content": "Translate to French: Goodbye"}],
        output="Au revoir",
    ),
]

# Create policy with few-shot examples
trainer = FewShotTrainer()
base_policy = create_policy_from_string("o4-mini")
fewshot_policy = trainer.train(
    policy=base_policy,
    samples=examples
)

# Now use the policy with in-context examples
response = fewshot_policy.infer("Translate to French: Thank you")
print(response)

Example 9: Reinforcement Learning with Graders

from core.domain.forecasting import Forecasting
from core.trainer.rl import RLTrainer, RLConfig
from core.grader.python_brier import PythonBrierGrader

# Setup
domain = Forecasting()
problems = domain.sample_problems(n=100, split="train")

# Create grader and trainer
grader = PythonBrierGrader()
config = RLConfig(num_epochs=3, learning_rate=1e-6, kl_coef=0.1)
trainer = RLTrainer(config)

# Train with RL
base_policy = create_policy_from_string("o4-mini")
trained_policy = trainer.train(
    policy=base_policy,
    problem_list=problems,
    grader=grader
)

Example 10: End-to-End Workflow

Complete workflow from domain to inference to training, using self-labeled training as an example:

from core.domain.conceptual import Conceptual
from utils.policy_utils import create_policy_from_string
from core.trainer.sft import SFTTrainer, SFTConfig

# 1. Load domain and sample problems
domain = Conceptual()
problems = domain.sample_problems(n=10, split="train")

# 2. Generate responses with base policy
policy = create_policy_from_string("o4-mini")
samples = [p.to_sample() for p in problems]
results = policy.infer_many(samples)

# 3. Use results as training data
trainer = SFTTrainer(SFTConfig(num_epochs=1))
trained_policy = trainer.train(policy=policy, samples=results)

# 4. Test trained policy
test_problem = domain.sample_problems(n=1, split="test")[0]
response = trained_policy.infer(test_problem.to_sample())
print(f"Q: {response.history[0]['content']}")
print(f"A: {response.output}")

Example 11: Async Inference and Training Across Multiple Domains

Run inference and training on multiple domains in parallel.

import asyncio
from core.domain.conceptual import Conceptual
from core.domain.forecasting import Forecasting
from utils.policy_utils import create_policy_from_string
from core.trainer.sft import SFTTrainer

policy = create_policy_from_string("o4-mini")

async def process_domain(domain, policy, trainer):
    """Infer and train on a single domain"""
    # Generate training data
    problems = domain.sample_problems(n=5, split="train")
    samples = [p.to_sample() for p in problems]
    results = await asyncio.gather(*[policy.infer_async(s) for s in samples])

    # Train and return
    return await trainer.train_async(policy=policy, samples=results)

async def main():
    trainer = SFTTrainer()

    # Process multiple domains in parallel
    domains = [Conceptual(), Forecasting()]
    trained_policies = await asyncio.gather(
        *[process_domain(d, policy, trainer) for d in domains]
    )

    print(f"Trained {len(trained_policies)} policies in parallel")

asyncio.run(main())

Everything else in this library is also asynchronous, and the snippet above serves only as an example. Note that it is strongly recommended to instantiate policies (including through the create_policy_from_string interface and through policy classes such as LocalModel) outside of asynchronous contexts, to avoid potential event loop issues.

Example 12: Local Model Training with Multi-GPU

from utils.policy_utils import create_policy_from_string
from core.trainer.sft import SFTTrainer, SFTConfig
from core.policy.schema import SingleSample

# Create local model (automatically uses all available GPUs)
policy = create_policy_from_string("meta-llama/Llama-3.2-1B-Instruct")

# Prepare samples
samples = [
    SingleSample(
        history=[{"role": "user", "content": "Hello"}],
        output="Hi there!",
    ),
    # ... more samples
]

# Train with DeepSpeed ZeRO-2 (automatic)
trainer = SFTTrainer(SFTConfig(num_epochs=2))
trained_model = await trainer.train_async(
    policy=policy,
    samples=samples
)

Core Components Documentation

Algorithms (`core/algo/`)

accuracy.py - Provides GroundTruthAccuracy for evaluating model accuracy against ground truth (previously coupled with MartingaleStrategy, now independent)
mutualpredict.py - Provides MutualPredictStrategy for measuring mutual predictability between models, as described in Wen et al. (2025)
sycoreason.py - Provides SycophanticReasoning for measuring sycophancy towards user opinion, including both expressed opinion and IQ change due to user (dis)agreement
qualitative.py - Provides QualitativeJudge for evaluating free-form text responses, where a judge (arbitrary Policy, can be human) looks for truth-seeking qualities in the response
martingale.py - Implements martingale-based strategies for bias evaluation and correction (He et al., 2025)
worldintheloop.py - Implements "world-in-the-loop" evaluation strategies, where we test the helpfulness of model outputs for predicting real-world observations gathered by an investigation agent
graderwrapper.py - Provides GraderWrapper for wrapping any Grader (model-based or Python-based) as a DebiasStrategy for evaluation

Algorithm Configuration

All evaluation strategies now expose a typed AlgoConfig accessible via strategy.get_config() and serialized in bias result files under the config field.

Config values are set via environment variables - see the Environment Variables section for more details.

Implemented configs:

GroundTruthAccuracyConfig (metric, mode)
MartingaleConfig (belief_change_type, sample_granularity, regularization_type, informative_switch, informative_coef)
MutualPredictConfig (predictor_choice, predicted_choice, target_question_choice, context_question_choice, k_context, trials_per_sample, scoring, predictee_behavior_examples)
QualitativeJudgeConfig (instruction, red_team_mode, include_self_answer, few_shot_strategy, few_shot_source_path, example_recomputation_rounds, few_shot_bootstrap_rounds, bootstrap_respect_permutations)
SycophanticReasoningConfig (metric, mode)
GraderWrapperConfig (grader_spec, grader_type, grader_model)

Reasoning Modes (`core/reasoning/`)

direct.py - Full response as one single step. Reasoning models have two steps: reasoning and response
cot.py - Chain of Thought reasoning implementation
debate.py - Self-debate reasoning implementation where models argue different positions
bootstrap.py - Bootstrap Interview mode that asks auxiliary questions before the main question to build reasoning capacity
length_control.py - Length-controlled reasoning mode for controlling response verbosity

Reasoning Configuration

All reasoning modes now support typed ReasoningConfig subclasses for configuration:

DirectConfig - Configuration for direct inference mode
CoTConfig - Configuration for Chain of Thought mode
DebateConfig - Configuration for self-debate mode (num_turns)
BootstrapInterviewConfig - Configuration for bootstrap interview mode (num_auxiliary_questions, auxiliary_mode, instruction_types)
LengthControlConfig - Configuration for length-controlled reasoning mode

Analysis (`core/analyzer/`)

performance_comparison.py - Comparing performance between different policies, and testing the soundness of performance scores
evaluation_relationship.py - Correlational and causal relationship between scores from different evaluation algorithms
causal_attribution.py - How different features causally contribute to the evaluation score
cross_setup_agreement.py - Under the same evaluation algorithm, compare scores that the same policy/trajectory get from different evaluation setups
token_level_evidence.py - Token-level analysis, e.g. visualizing how much each token contributes to the final score
trajectory_score.py - Aggregates and sorts trajectories by score for training pipeline
training_causal_effect.py - Analyzes causal effects of training interventions
results_catalog.py - Cataloging all run files to be displayed in the web app

Analysis Framework

Data models:
- RunFiles: shared and per-eval-strategy file references in a run directory
- Run: mirrors a single runs catalog entry; holds all co-existing eval strategies' configs
- Setup: a condition grouping of runs with identical (algo, domain, reasoning_mode, system_prompt)
Analyzer interface (scale-aware methods; analyzers implement any subset):
- analyze_trajectory(trajectory, *, run, eval_strategies)
- analyze_run(run, *, eval_strategies)
- analyze_setup(setup, *, eval_strategies)
- analyze_across_setups(setups, *, eval_strategies)
Orchestration utilities (utils/analyzer_utils.py):
- discover_run(run_dir|bias_file) → Run
- discover_runs_in_batch_dir(batch_dir) → Run[]
- group_setups(runs) → Setup[]
- collect_eval_strategies_from_runs(runs) → list of AlgoStrRepr
- run_all_analyzers_for_two_batches(dir_a, dir_b, clean_output=True)
- run_analyzers_from_env() entrypoint used by scripts/run_analyzers.py
Output path policy:
- data/analysis/<Analyzer>/<algo>/<domain>/<mode>/<prompt>/<AlgoStrRepr>/<model or ALL>/...
Running analyzers:
- ANALYZERS=all DIR_NAME=run-XXX python -m scripts.run_analyzers
- ANALYZERS=ResultsCatalogAnalyzer,CausalAttributionAnalyzer,CrossSetupAgreementAnalyzer DIR_NAME=run-XXX python -m scripts.run_analyzers

Domains (`core/domain/`)

research.py - Self-curated dataset of research questions, with an easy answer and a hard answer
conceptual.py - 31 very thorny conceptual or philosophical questions, meant to test the ability for (1) deconfusion, and (2) think outside the Overton window
forecasting.py - Forecasting domain for prediction tasks using Metaculus and Polymarket data
cmvbinary.py - ChangeMyView domain with binary opinion change evaluation
cmvfreeform.py - ChangeMyView domain with free-form opinion change evaluation
openreview.py - OpenReview domain for academic paper evaluation
intellectual.py - Intellectual demonstration domain for testing reasoning depth
wildchat.py - WildChat domain for evaluating on real user conversations

Policies (`core/policy/`)

All policies share a unified interface with flexible input handling—infer() and infer_many() accept strings, message lists, or dialogue objects.

apimodel.py - Standard API-based model interface for external LLM services
raymodel.py - API-based model accelerated with Ray-based parallelization (higher performance than apimodel.py when throughput is higher than 100,000 token/s)
batchmodel.py - API-based model using batch API to save costs. Cost reduced by 50%, but one full run takes up to 48hr. Recommended only when pooling many runs together
localmodel.py - Locally deployed model with full logprob access using SGLang backend
human.py - Implements the Human policy class, where conversations are shown on the command line and the human user types responses
claudecode.py - Claude Code integration for interactive coding assistance

Graders (`core/grader/`)

schema.py - Base Grader abstract class and factory functions for creating graders
python_brier.py - Extracts \finalBeliefProb{X} patterns and calculates Brier scores
model_brier.py - Uses LLMs to extract beliefs and calculate Brier scores
python_grader.py - Base class for Python-based graders executed on OpenAI servers
model_grader.py - Base class for model-based graders using LLMs for evaluation

Training Pipeline (`core/trainer/`)

The training pipeline supports supervised fine-tuning (SFT), reinforcement learning (RL), and few-shot in-context learning approaches to improve model performance based on trajectory-score pairs from evaluation runs.

Training Strategies

sft.py - Supervised Fine-Tuning trainer that selects top-scoring trajectories and fine-tunes models on them
- Supports OpenAI and Together AI fine-tuning APIs for API models
- Supports local fine-tuning with trl and deepspeed for LocalModel
- Configurable via SFT_TOP_PERCENTAGE, SFT_NUM_EPOCHS, SFT_LEARNING_RATE, etc.
- Validation Support: Automatic validation set creation with three strategies:
  - none: No validation (default)
  - train: Split a portion from training set
  - gt: Use ground-truth aligned samples only
rl.py - Reinforcement Learning trainer using custom graders with OpenAI's RL API
- Supports both Python and model-based graders for reward calculation
- Configurable via RL_NUM_EPOCHS, RL_LEARNING_RATE, etc.
- Uses graders defined in core/grader/ for scoring model outputs
fewshot.py - Few-shot trainer that formats top trajectories as in-context examples
- Creates new policies with prepended few-shot examples
- Randomly selects one sample per trajectory to avoid duplicates
- Configurable via FEWSHOT_TOP_COUNT, FEWSHOT_TOP_PERCENTAGE

Training Architecture

The training system follows a clean separation of concerns:

Trainers (core/trainer/) - Handle data compilation and selection
- Load trajectory-score pairs from analyzer outputs
- Select top trajectories based on configuration
- Convert trajectories to appropriate training formats
- Validation Management: Automatic validation set creation with intelligent sizing
- Ground Truth Filtering: Filter validation samples to only include correctly aligned beliefs
Policies (core/policy/) - Handle model management and training execution
- Implement train_sft() for supervised fine-tuning with validation support
- Implement add_few_shot_examples() for in-context learning
- Use deep_copy() utility for creating trained policy instances with proper naming and metadata
- Real-time Monitoring: Display training/validation losses during fine-tuning
- WandB Integration: Log comprehensive training metrics including losses, learning rate, gradient norms
Reasoning Modes (core/reasoning/) - Handle trajectory-to-sample conversion
- Each reasoning mode implements trajectory_to_samples() to convert trajectories to training samples
- Respects the trainable flag on reasoning steps to exclude system/non-trainable content

Training Workflow

Generate trajectories and scores: Run evaluation with desired algorithms to produce trajectory-score pairs
Run analyzers: Use TrajectoryScoreAnalyzer to aggregate and sort trajectories by score

Train policies: The training pipeline automatically detects trajectory score files and trains policies:

# After running evaluation and analysis
export TRAINERS=SFTTrainer,FewShotTrainer  # or "all"
export SFT_TOP_PERCENTAGE=0.1  # Train on top 10%
export FEWSHOT_TOP_COUNT=100   # Use top 100 examples
python -m scripts.run_trainers

Trained model storage: Trained models are saved in data/models/ with unique names:
- Format: {base_name}-{training_type}-{hash}
- Example: gpt-4.1-mini-sft-a36b1f2c3f84
- Metadata saved alongside including training configuration and source files

Configuration

Training behavior is controlled via environment variables and TrainingConfig:

TRAINERS: Which trainers to use (SFTTrainer, FewShotTrainer, or "all")
VALIDATION_STRATEGY: Validation set strategy ("none", "train", "gt") (default: "none")
- none: No validation set
- train: Split validation from training set
- gt: Use ground-truth aligned samples only for validation
SFT_TOP_PERCENTAGE: Percentage of top trajectories for SFT (default: 0.1)
SFT_NUM_EPOCHS: Number of training epochs (default: 2)
SFT_LEARNING_RATE: Learning rate for fine-tuning (default: 1e-5)
FEWSHOT_TOP_COUNT: Maximum examples for few-shot (default: 100)
FEWSHOT_TOP_PERCENTAGE: Percentage for few-shot selection (default: 0.1)
WANDB_API_KEY: Optional WandB API key for training metrics logging

The system uses the minimum of count and percentage limits for few-shot to avoid excessive context length.

Codebase structure

`core`

Contains the core logic of the project.

core/algo: Contains the debiasing algorithms (e.g. Martingale, justified flipping)
core/analyzer: Contains the result analyzers (e.g. PerformanceComparisonAnalyzer, EvaluationRelationshipAnalyzer)
core/domain: Contains the problem domains (e.g. forecasting)
core/policy: Contains the policy models (e.g. API model, Local LLMs)
core/reasoning: Contains the reasoning modes (e.g. Self-Debate, CoT)

Each of these components has a schema file that describes the components' inputs and outputs, accompanied by a number of subclass files that implement the schema for specific algorithms/domains/policies/reasoning modes.

`scripts`

Contains scripts for data fetching, processing, and analysis.

scripts/run_reasoning.py: Contains the script for producing reasoning trajectories given any combination of reasoning modes, domains, policies, evaluation algorithms, and analyzers.
scripts/data/*: Contains the scripts for data fetching and organization.
scripts/misc/*: Contains all other scripts of long-standing value.
scripts/legacy/*: Contains all deprecated scripts that are only kept for backward compatibility.

`utils`

Contains utility functions for the project.

utils/policy_utils.py: Contains the utility functions for policy model creation and other policy-related operations.
utils/io_utils.py: Contains the utility functions for input/output operations, including the handling of JSON formatting.
utils/async_utils.py: Contains the utility functions for asynchronous operations.
utils/stats_utils.py: Contains tools for statistical analysis and plotting.
utils/nlp_utils.py: Contains tools for natural language processing.
utils/path_utils.py: Expands the PATH variable to include all levels of the project directory.
utils/analyzer_utils.py: Contains utility functions for calling analyzers.
utils/debate_processing_utils.py: Contains the utility functions for processing the debate data.
utils/judge_manipulation_utils.py: Contains the utility functions for manipulating the judge policy's belief, most useful for SelfDebate.
utils/killall.sh: Contains the script for killing all GPU processes, useful for LocalModel.
utils/templates/*: Contains prompt templates.

Environment Variables

The codebase uses various environment variables for configuration. These should be set as needed before running experiments:

This document lists all environment variables supported by the ERC evaluation system. Variables are grouped into Features, Execution, and Experimental categories.

Features

These variables control which features to evaluate. They are typically set from the UI selection.

ALGO_NAMES: Evaluation strategy(-ies) to use
- Options: GroundTruthAccuracy, MartingaleStrategy, WorldInTheLoop, QualitativeJudge, MutualPredictStrategy, GraderWrapper
- Required: Yes
DOMAIN_NAMES: Domain name(s) to evaluate
- Options: Forecasting, OpenReview, CMVBinary, CMVFreeForm, Research, Conceptual, IntellectualDemonstration, WildChat
- Required: Yes
REASONING_MODE_NAMES: Reasoning mode name(s)
- Options: DirectInference, ChainOfThought, SelfDebate, BootstrapInterview, LengthControl
- Required: Yes
POLICY_LIST: Policies to evaluate (comma-separated)
SYSTEM_PROMPT: System prompt to test (comma-separated)
ANALYZERS: Result analyzers to run
- Options: PerformanceComparisonAnalyzer, EvaluationRelationshipAnalyzer, CausalAttributionAnalyzer, CrossSetupAgreementAnalyzer, TokenLevelEvidenceAnalyzer, TrajectoryScoreAnalyzer, TrainingCausalEffectAnalyzer, ResultsCatalogAnalyzer

Execution

These variables control how the evaluation runs are executed.

NUM_TRAJECTORIES: Number of trajectories to generate per combination (default: differs by algorithm and domain)
- Type: Number
DIR_NAME: Directory name for output, use "/" to indicate subdirectory
- Default: Generates timestamp-based name
DEBUG: Debug level (0=none, 1=basic, 2=verbose)
- Options: 0, 1, 2
- Default: 0
SAVE_TO_FILE: Back up untruncated console logs to file
- Type: Boolean (0 or 1)
- Default: false
USE_RAY: Use Ray for distributed processing of API calls
- Type: Boolean (0 or 1)
- Default: true
USE_OPENROUTER: Use OpenRouter for model routing (requires USE_RAY=true)
- Type: Boolean (0 or 1)
- Default: true
USE_BATCH: Use provider-specific batch APIs (requires USE_RAY=false)
- Type: Boolean (0 or 1)
- Default: false
PARALLEL_BATCH: Use async parallelism across runs
- Type: Boolean (0 or 1)
- Default: true
MAX_WORKERS: Maximum number of Ray workers
- Type: Number
RERUN_INCOMPLETE: Rerun experiments that contain < NUM_TRAJECTORIES trajectories
- Type: Boolean (0 or 1)
- Default: true
RECOMPUTE_RESULTS: If to recompute and overwrite existing final result JSON file
- Type: Boolean (0 or 1)
- Default: false
RECOMPUTE_TRAJECTORIES: When to recompute trajectories JSON file
- Options: never, missing, always
  - "never": Never recompute trajectories, always use existing ones
  - "missing": Only generate trajectories if they don't exist
  - "always": Always regenerate trajectories (expensive)
- Default: missing
RECOMPUTE_BELIEFS: When to recompute beliefs JSON file
- Options: never, missing, always
  - "never": Never recompute beliefs, always use existing ones
  - "missing": Only measure beliefs if they don't exist
  - "always": Always remeasure beliefs (recommended for experimenting with different algorithms)
- Default: missing

API Keys

OPENROUTER_API_KEY: OpenRouter API key
HUGGINGFACE_API_KEY: HuggingFace API key
TOGETHER_API_KEY: TogetherAI API key
OPENAI_API_KEY: OpenAI API key
ANTHROPIC_API_KEY: Anthropic API key
GOOGLE_API_KEY: Google API key
WANDB_API_KEY: Weights & Biases API key for training metrics logging

Debug

SHOW_PROGRESS: Show progress bars
- Type: Boolean (0 or 1)
- Default: true

Performance

NO_RETRY: Disable retry mechanism for API calls
- Type: Boolean (0 or 1)
- Default: false

Experimental

These variables control experimental features and algorithm-specific behaviors.

JUDGE_POLICY_NAMES: Judge policy names (comma-separated)
- Dynamically set based on selected algorithms
TEMPERATURE: Model temperature for non-Ray API models
- Type: Number
- Default: 0.25
PRESENCE_PENALTY: Presence penalty for non-Ray API models
- Type: Number
- Default: 0.0
POLICY_LIST_MODE: Canonical policy list, overrides POLICY_LIST
- Options: frontier, legacy, neurips
  - "frontier": Current default policy list (gpt-4.1, gpt-o3, deepseek-v3, llama-4, claude-sonnet-4, etc.)
  - "legacy": Legacy policy list (subset of frontier models)
  - "neurips": All 21 policies from batch-neurips directory including -confirmatory/-critical variants
FORBIDDEN_MODELS: Remove policies whose names contain any of the following (comma-separated)

Belief Measurement

DISABLE_SYSTEM_PROMPT_IN_BELIEF_MEASUREMENT: Disable system prompts for judge policies during belief measurement
- Type: Boolean (0 or 1)
- Default: true
USE_FIXED_JUDGE: Use fixed judge for evaluation instead of the evaluated policy itself
- Type: Boolean (0 or 1)
- Default: true
OBJECTIVE_BELIEF: Judge estimates beliefs from the standpoint of the evaluated policy
- Type: Boolean (0 or 1)
- Default: true
USE_PER_TRAJ_BELIEF_MEASURE: Use per-trajectory belief measurement instead of per-step
- Type: Boolean (0 or 1)
- Default: true
DECOUPLE_TRAJECTORY_BELIEFS: Save belief measurement and trajectories in separate files
- Type: Boolean (0 or 1)
- Default: true
  - 0: Legacy mode - trajectories and beliefs stored together in reasoning-trajectories.json
  - 1: Decoupled mode - raw trajectories in reasoning-trajectories-raw.json, beliefs in reasoning-beliefs-{algorithm}.json

World-in-the-Loop

Available when WorldInTheLoop algorithm is selected.

WITL_RECOMPUTE_POLICY: Which World-in-the-Loop components to recompute
- Options: investigation, uplift, upliftblanket, all
- Default: all
FORECASTER_TEMPLATE: Template to use for presenting investigation results to forecaster
- Options: vanilla, rephrase, toolcall
  - "vanilla" - Simple prediction prompt with direct investigation result as assistant response
  - "rephrase" - Rephrases investigation results to remove style familiarity, includes tool usage explanations
  - "toolcall" - Most natural approach: forecaster retrieves investigation report through formal tool call API, mentioning that Claude Code completed the investigation
- Default: vanilla
WITL_PER_TOKEN_UPLIFT: Use per-token uplift calculation
- Type: Boolean (0 or 1)
- Default: true

Qualitative Judge

Available when QualitativeJudge algorithm is selected.

QUALITATIVE_JUDGE_USE_FEW_SHOT: Use few-shot examples in qualitative judge
- Type: Boolean (0 or 1)
- Default: true
QUAL_RED_TEAM_MODE: Qualitative judge red team mode
- Options: none, red, red_blue, red_blue_resolution
- Default: red_blue_resolution
QUAL_FEW_SHOT_PERMUTE: Permute few-shot examples
- Type: Boolean (0 or 1)
- Default: true
QUAL_EXAMPLE_RECOMPUTATION_ROUNDS: Example recomputation rounds
- Type: Number
- Default: 1
QUAL_FEW_SHOT_BOOTSTRAP_ROUNDS: Few-shot bootstrap rounds
- Type: Number
- Default: 1
QUAL_BOOTSTRAP_RESPECT_PERMUTE: Respect permutation in bootstrap
- Type: Boolean (0 or 1)
- Default: false
FEW_SHOT_SOURCE_PATH: Few-shot source path
- Default: data/questions/conceptual_human_examples.json
FEW_SHOT_BOOTSTRAP_SOURCE_PATH: Few-shot bootstrap source path

Bootstrap Interview

Available when BootstrapInterview reasoning mode is selected.

BOOTSTRAP_NUM_AUXILIARY: Number of auxiliary questions to ask before the main question
- Type: Number
- Default: 3
BOOTSTRAP_MODE: Method for selecting auxiliary questions
- Options: fixed_sequence, iid, stationary_ood, llm_preset, llm_adaptive
  - "fixed_sequence": Use predefined list of truth-seeking questions
  - "iid": Sample from same domain as the main question
  - "stationary_ood": Sample from other specified domains
  - "llm_preset": LLM generates all auxiliary questions at once
  - "llm_adaptive": LLM generates questions adaptively based on conversation
- Default: iid
BOOTSTRAP_OOD_DOMAINS: Domains to sample from for stationary_ood mode (comma-separated)
- Example: Research,Forecasting
BOOTSTRAP_GENERATOR_POLICY: Policy to use for generating auxiliary questions in LLM modes
- Example: gpt-4.1-mini
BOOTSTRAP_INSTRUCTION_TYPES: Instruction types for LLM generation (comma-separated)
- Options: curriculum, contradiction_seeking, synergistic, socratic
  - "curriculum": Build progressively in complexity
  - "contradiction_seeking": Focus on eliciting contradictions
  - "synergistic": Establish cross-domain connections
  - "socratic": Use Socratic questioning to uncover assumptions

GraderWrapper

Available when GraderWrapper algorithm is selected.

GRADER_SPEC: Grader specification for GraderWrapper (JSON string or env var name)
- Type: String
- Used to instantiate an arbitrary grader via create_grader_from_spec
GRADER_TYPE: Type of grader to use (if not using GRADER_SPEC)
- Options: python_brier, model_brier, model
- Used as fallback if GRADER_SPEC not provided
GRADER_MODEL: Model to use for model-based grading
- Type: String (e.g., o1-mini, gpt-4)
- Used when GRADER_TYPE is model_brier or model

Mutual Predictability

Available when MutualPredictStrategy algorithm is selected.

MP_PREDICTOR_CHOICE: MutualPredictStrategy predictor policy choice
- Options: evaluated, random_non_evaluated, random_any
- Default: random_non_evaluated
MP_PREDICTED_CHOICE: MutualPredictStrategy predicted policy choice
Options: evaluated, random_non_evaluated, random_any
Default: evaluated
MP_TARGET_QUESTION: MutualPredictStrategy target question choice
Options: evaluated_question, random_non_evaluated_question, random_any_question
Default: evaluated_question
MP_CONTEXT_QUESTIONS: MutualPredictStrategy context question choice
Options: evaluated_question, k_random_non_evaluated
Default: k_random_non_evaluated
MP_K_CONTEXT: Number of context questions
Type: Number
Default: 3
MP_TRIALS_PER_SAMPLE: Trials per sample to average
Type: Number
Default: 3
MP_SCORING: MutualPredictStrategy scoring method
Options: uplift, conditional_only, judge_consistency
Default: uplift
MP_PREDICTEE_BEHAVIOR_EXAMPLES: Number of predictee behavioral examples
Type: Number
Default: 0

Performance Comparison Analyzer

Available when PerformanceComparisonAnalyzer is selected.

REMOVE_N_OUTGROUP_SETUP: Number of outgroup data points (with most different setup names) to remove from per-setup performance comparison
- Type: Number
- Default: 0
REMOVE_N_OUTGROUP_ACROSS_SETUPS: Number of outgroup data points (with most different setup names) to remove from across-setups performance comparison
Type: Number
Default: 0
GROUP_MODE_SETUP: How to group data points in the per-setup performance comparison ("none", "qwen", "substr_KEY1=VALUE1_...")
Default: none
GROUP_MODE_ACROSS_SETUPS: How to group data points in the across-setups performance comparison ("none", "qwen", "substr_KEY1=VALUE1_...")
Default: none
ASSIGN_X_COORDS_SETUP: How to assign x coordinates when plotting per-setup performance comparison
Options: none, lexical, lexical_and_group, size, size_and_group, size_group_and_random
Default: none
ASSIGN_X_COORDS_ACROSS_SETUPS: How to assign x coordinates when plotting across-setups performance comparison
Options: none, lexical, lexical_and_group, size, size_and_group, size_group_and_random
Default: none
ELO_CORR_X_TICKS_SETUP: How to assign x ticks when plotting per-setup performance comparison ("names" or list of string labels)
Default: none
ELO_CORR_X_TICKS_ACROSS_SETUPS: How to assign x ticks when plotting across-setups performance comparison ("names" or list of string labels)
Default: none
ELO_CORR_X_LABEL_SETUP: X-axis label for per-setup performance comparison
Default: Setup Index
ELO_CORR_X_LABEL_ACROSS_SETUPS: X-axis label for across-setups performance comparison
ELO_CORR_Y_COORDS_SETUP: How to transform y coordinates in the per-setup performance comparison ("none", "negate", "log", comma-separated list of python math functions to apply)
Default: none
ELO_CORR_Y_COORDS_ACROSS_SETUPS: How to transform y coordinates in the across-setups performance comparison ("none", "negate", "log", comma-separated list of python math functions to apply)
Default: none

Misc Features and Notes

File Structure

When decoupling is enabled, the following files are used:

reasoning-trajectories-raw.json - Raw reasoning content without belief measurements
reasoning-beliefs-{algorithm}.json - Belief measurements for specific algorithms
reasoning-trajectories.json - Legacy format (maintained for backward compatibility)

The system automatically detects and loads legacy files when decoupling is enabled, ensuring full backward compatibility.

Partial Results Recomputation

The evaluation framework supports selective recomputation of incomplete or failed evaluations through two new environment variables:

RECOMPUTE_RESULTS

When RECOMPUTE_RESULTS=1, the system will:

Load existing bias-eval-results-[ALGO_NAME].json files instead of skipping them
Pass the existing results to the algorithm's compute_loss_async method
Allow algorithms to selectively recompute missing or failed components
Overwrite the existing results file with updated evaluations

This is particularly useful for:

Completing interrupted World-in-the-Loop evaluations
Rerunning failed investigations due to network issues or data access problems
Adding missing uplift calculations to existing evaluations

WITL_RECOMPUTE_POLICY (World-in-the-Loop only)

When used with RECOMPUTE_RESULTS=1, this controls what gets recomputed in World-in-the-Loop evaluations:

"all" (default): Recompute everything (task sampling, investigation, uplift) for trajectories with any missing fields (or missing entire trajectory)
"investigation": Only recompute investigation results (NOT subsequent uplift calculations) for trajectories with missing investigation results (or missing entire trajectory)
"uplift": Only recompute uplift rewards for trajectories that have investigation results but missing uplift values
"upliftblanket": Recompute uplift rewards for all trajectories that have investigation results (including those with existing uplift values)

# Recompute all missing components in an existing World-in-the-Loop evaluation
RECOMPUTE_RESULTS=1 WITL_RECOMPUTE_POLICY=all python -m scripts.run_reasoning

# Only fill in missing investigation results (useful after data access issues / investigator agent API issues are resolved)  
RECOMPUTE_RESULTS=1 WITL_RECOMPUTE_POLICY=investigation python -m scripts.run_reasoning

# Only compute missing uplift values (when investigations completed but forecaster failed)
RECOMPUTE_RESULTS=1 WITL_RECOMPUTE_POLICY=uplift python -m scripts.run_reasoning

# Recompute all uplift values (useful when forecaster model or parameters changed)
RECOMPUTE_RESULTS=1 WITL_RECOMPUTE_POLICY=upliftblanket python -m scripts.run_reasoning

Local Model Choice

For evaluation strategies that involve log probabilities, we deploy local models to compute the log probabilities.

Availability of models depend on support by SGLang, which we use for deployment. As of Aug 12 2025, we have tested the following local models.

Models tested to work:

Qwen/Qwen3-30B-A3B-Instruct-2507 (30B, LMArena #23; base available)
Qwen/Qwen3-0.6B (0.6B)
deepseek-v3 (685B, LMArena #37; base available; FP8 supported)
zai-org/GLM-4.5-Air (110B, LMArena #23 with reasoning; base available; FP8 supported)
mistral-small-3.2-24b-instruct-2503 (24B, LMArena #96; base available)
Llama-3.2-1B-Instruct (1B, LMArena #196; base available)

Note that for mutual predictability/WITL, small and weaker models may sometimes work better as forecaster/judge.

Mutual Predictability (7-axis framework)

Key idea: measure conditional predictability via log probabilities with configurable axes. A single evaluated policy’s answers over many questions are scored by how much a predictor policy improves likelihood assignment to a target answer when given configurable context.

Core class: core/algo/mutualpredict.py → MutualPredictStrategy
Config: MutualPredictConfig with axes:
- Predictor (Axis 2): EVALUATED_POLICY | RANDOM_NON_EVALUATED | RANDOM_ANY
- Predicted (Axis 3): EVALUATED_POLICY | RANDOM_NON_EVALUATED | RANDOM_ANY
- Target question (Axis 4): EVALUATED_QUESTION | RANDOM_NON_EVALUATED_QUESTION | RANDOM_ANY_QUESTION
- Context questions (Axis 5): EVALUATED_QUESTION | K_RANDOM_NON_EVALUATED (k hyperparam)
- Context responses (Axis 6): the evaluated policy’s answers (fixed)
- Scoring (Axis 7): UPLIFT (Δcond−Δbase), CONDITIONAL_ONLY, or JUDGE_CONSISTENCY
- Trials: trials_per_sample to average randomness
- Predictee behavior examples: predictee_behavior_examples (default 0) adds a clearly labeled block of the target policy’s past Q/A to condition the predictor; 0 adds nothing.

Defaults reproduce the previous behavior (cross-question context from the evaluated policy; uplift scoring).

Minimal example (legacy-like behavior)

from core.algo.mutualpredict import (
    MutualPredictStrategy, MutualPredictConfig,
    PredictorChoice, PredictedChoice,
    TargetQuestionChoice, ContextQuestionChoice,
    ScoringMethod,
)

config = MutualPredictConfig(
    predictor_choice=PredictorChoice.RANDOM_NON_EVALUATED,
    predicted_choice=PredictedChoice.EVALUATED_POLICY,
    target_question_choice=TargetQuestionChoice.EVALUATED_QUESTION,
    context_question_choice=ContextQuestionChoice.K_RANDOM_NON_EVALUATED,
    k_context=3,
    trials_per_sample=3,
    scoring=ScoringMethod.UPLIFT,
    predictee_behavior_examples=0,  # default
)

algo = MutualPredictStrategy(judge_policies=[judge_model], config=config)
loss, details = await algo.compute_loss_async(samples=reasoning_trajectories)

Peer-prediction-like setup (same question, cross-model), with predictee examples

config = MutualPredictConfig(
    predictor_choice=PredictorChoice.RANDOM_ANY,
    predicted_choice=PredictedChoice.RANDOM_ANY,
    target_question_choice=TargetQuestionChoice.EVALUATED_QUESTION,
    context_question_choice=ContextQuestionChoice.EVALUATED_QUESTION,
    trials_per_sample=5,
    scoring=ScoringMethod.UPLIFT,
    predictee_behavior_examples=3,
)

algo = MutualPredictStrategy(judge_policies=predictor_pool, config=config)
loss, details = await algo.compute_loss_async(
    samples=reasoning_trajectories,
    participant_policies=predictor_pool,   # non-evaluated participants
    reasoning_mode=reasoning_mode,         # to generate target answers for non-evaluated policies
    domain=domain,
)

Notes

Predictor and predicted must differ; such trials are skipped automatically.
Exact duplicate answers between context and target are skipped.
When random choices are used on any axis, set trials_per_sample > 1 for stable averages.
If predicted_choice != EVALUATED_POLICY or predictee_behavior_examples > 0 with non-evaluated predictee, provide reasoning_mode and domain to generate/cache answers.

Qualitative Judge (prompt design & interaction pipeline)

Key class: core/algo/qualitative.py → QualitativeJudge

What it does: Grades reasoning trajectories for truth-seeking quality via a judge policy (default: gpt-o3).
Design axes (switchable):
- Instruction/rubric: Controls criteria text injected into prompts.
- Judge self-answer: Judge first answers the question to calibrate “easy vs hard insights”.
- Adversarial depth (red_team_mode): none | red | red_blue | red_blue_resolution.
- Few-shot strategy: disabled | static (fixed order) | permuted (all permutations × recomputation rounds).
- Example recomputation rounds (M): Recompute example components across rounds to reduce variance.
- Few-shot source: Path to initial examples.
- Iterative few-shot bootstrapping (T rounds): Each round uses current few-shot set as context to grade candidates from samples and adds ≥ max(1, old_count) new examples (≈ doubling per round).
- Robust parsing: Extracts JSON from imperfect LLM outputs with balanced-brace search and fallbacks.

Defaults mirror the prior behavior: self-answer enabled, full red_blue_resolution, few-shot enabled with permutations, M=1, no bootstrapping.

Environment variables:

Core
- QUAL_INSTRUCTION: Override rubric text (default: built-in rubric)
- QUAL_INCLUDE_SELF_ANSWER=1|0 (default 1)
- QUAL_RED_TEAM_MODE in {none, red, red_blue, red_blue_resolution} (default red_blue_resolution)
Few-shot
- QUALITATIVE_JUDGE_USE_FEW_SHOT=1|0 (default 1)
- QUAL_FEW_SHOT_PERMUTE=1|0 (default 1 → permuted; 0 → static)
- QUAL_FEW_SHOT_SOURCE=path (default data/questions/conceptual_human_examples.json)
- QUAL_EXAMPLE_RECOMPUTATION_ROUNDS=M (default 1; also respects legacy EXAMPLE_RECOMPUTATION_ROUNDS)
Iterative few-shot bootstrapping
- QUAL_FEW_SHOT_BOOTSTRAP_ROUNDS=T (default 0)
- QUAL_BOOTSTRAP_RESPECT_PERMUTE=1|0 (default 0)
- QUAL_FEW_SHOT_BOOTSTRAP_SOURCE=path (default None)
  - Expected to be a file containing reasoning trajectories or a directory (directly or indirectly) containing reasoning trajectories. Must be supplied if QUAL_FEW_SHOT_BOOTSTRAP_ROUNDS > 0.

Minimal usage (CLI):

# Default qualitative judging (self-answer + red-blue-resolution + permuted few-shot)
export ALGO_NAME=QualitativeJudge
python -m scripts.run_reasoning

Configure adversarial depth and few-shot behavior:

# Red + Blue + Resolution with 2 recomputation rounds and permuted few-shot
export ALGO_NAME=QualitativeJudge
export QUAL_RED_TEAM_MODE=red_blue_resolution
export QUALITATIVE_JUDGE_USE_FEW_SHOT=1
export QUAL_FEW_SHOT_PERMUTE=1
export QUAL_EXAMPLE_RECOMPUTATION_ROUNDS=2
python -m scripts.run_reasoning

Disable few-shot, run red-team only:

export ALGO_NAME=QualitativeJudge
export QUALITATIVE_JUDGE_USE_FEW_SHOT=0
export QUAL_RED_TEAM_MODE=red
python -m scripts.run_reasoning

Iterative few-shot bootstrapping (doubles examples each round using samples as candidates):

export ALGO_NAME=QualitativeJudge
export QUAL_FEW_SHOT_BOOTSTRAP_ROUNDS=2
# Optional: keep permuted context during bootstrapping rounds
export QUAL_BOOTSTRAP_RESPECT_PERMUTE=1
python -m scripts.run_reasoning

Python API (advanced):

from core.algo.qualitative import QualitativeJudge

algo = QualitativeJudge(judge_policies=[judge_model])  # env vars control axes
loss, details = await algo.compute_loss_async(samples=reasoning_trajectories)

Multi-GPU Training with DeepSpeed and Accelerate

The LocalModel class supports distributed training across multiple GPUs.

The system automatically detects available GPUs and configures training appropriately:

Single GPU: Standard training
Multiple GPUs: Distributed Data Parallel (DDP) training
With DeepSpeed: ZeRO optimization stages 2 or 3

Two DeepSpeed configurations are provided:

data/config/deepspeed_zero2.json: ZeRO Stage 2 (recommended for most cases)
data/config/deepspeed_zero3.json: ZeRO Stage 3 (for very large models)

A series of Accelerate configurations are provided:

data/config/accelerate_config_1node_{N}gpu.yaml: Pre-configured for single-node, N-GPU setup

Environment Variables

Control multi-GPU behavior with these environment variables:

# Force single GPU usage (useful for debugging)
export FORCE_SINGLE_GPU=1

# Disable DeepSpeed (use regular DDP)
export DISABLE_DEEPSPEED=1

# Control concurrent local model instances
export LOCALMODEL_MAX_CONCURRENT=2

Basic Usage (Automatic Configuration)

The system automatically detects and uses available GPUs:

from utils.policy_utils import create_policy_from_string

# Create model - will auto-detect GPUs
model = create_policy_from_string("meta-llama/Llama-3.2-1B-Instruct")

# Train with SFT - automatically uses all available GPUs
trained_model = await model.train_sft_async(
    samples=training_samples,
    validation_samples=validation_samples,
)

Using Accelerate Launch

For explicit control over distributed training:

# Launch with accelerate (uses config file for single-node, single-GPU setup)
TRAINER_TYPE=rl POLICY_LIST_MODE=Qwen/Qwen3-0.6B DOMAIN_NAME=Forecasting GRADER_TYPE=python_brier RL_NUM_SAMPLES=100 VALIDATION_STRATEGY=train accelerate launch --config_file data/config/accelerate_config_1node_1gpu.yaml -m scripts.run_trainers

# Or configure interactively
accelerate config
TRAINER_TYPE=rl POLICY_LIST_MODE=Qwen/Qwen3-0.6B DOMAIN_NAME=Forecasting GRADER_TYPE=python_brier RL_NUM_SAMPLES=100 VALIDATION_STRATEGY=train accelerate launch -m scripts.run_trainers

SFT Training with Multi-GPU

from core.trainer.sft import SFTTrainer, SFTConfig

# Configure SFT for multi-GPU
config = SFTConfig(
    num_epochs=2,
    learning_rate=2e-5,
    batch_size=4,  # Per-device batch size
    gradient_accumulation_steps=2,
)

trainer = SFTTrainer(config)
trained_policy = await trainer.train_async(
    policy=model,
    trajectory_score_files=["path/to/trajectories.json"],
)

RL Training with Multi-GPU

from core.trainer.rl import RLTrainer, RLConfig
from core.grader.python_brier import PythonBrierGrader

# Configure RL for multi-GPU
config = RLConfig(
    num_epochs=2,
    learning_rate=1e-6,
    batch_size=2,  # Per-device batch size
    kl_coef=0.1,
)

trainer = RLTrainer(config)
grader = PythonBrierGrader()

trained_policy = await trainer.train_async(
    policy=model,
    problem_list=problems,
    grader=grader,
)

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
core		core
data @ 655146f		data @ 655146f
lib		lib
scripts		scripts
utils		utils
web		web
.gitignore		.gitignore
.gitmodules		.gitmodules
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
netlify.toml		netlify.toml
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

TruthSeekingGym

Overview

Core Abstractions

Supported Policies

Adding New Models

Infrastructure

Installation and Usage

Prerequisites

Installation

API Keys Setup

Web GUI

Quick Start (CLI)

Low-Level Abstractions

Example 1: Basic Flexible Inference

Example 2: Inference with Problems and Domains

Example 3: Batch Flexible Inference

Example 4: Working with Domains

Example 5: Human-AI Dialogue

Example 6: Claude Code Agent Inference

Example 7: Supervised Fine-Tuning

Example 8: Few-Shot Learning

Example 9: Reinforcement Learning with Graders

Example 10: End-to-End Workflow

Example 11: Async Inference and Training Across Multiple Domains

Example 12: Local Model Training with Multi-GPU

Core Components Documentation

Algorithms (core/algo/)

Algorithm Configuration

Reasoning Modes (core/reasoning/)

Reasoning Configuration

Analysis (core/analyzer/)

Analysis Framework

Domains (core/domain/)

Policies (core/policy/)

Graders (core/grader/)

Training Pipeline (core/trainer/)

Training Strategies

Training Architecture

Training Workflow

Configuration

Codebase structure

core

scripts

utils

Environment Variables

Features

Execution

API Keys

Debug

Performance

Experimental

Belief Measurement

World-in-the-Loop

Qualitative Judge

Bootstrap Interview

GraderWrapper

Mutual Predictability

Performance Comparison Analyzer

Misc Features and Notes

File Structure

Partial Results Recomputation

RECOMPUTE_RESULTS

WITL_RECOMPUTE_POLICY (World-in-the-Loop only)

Local Model Choice

Mutual Predictability (7-axis framework)

Qualitative Judge (prompt design & interaction pipeline)

Multi-GPU Training with DeepSpeed and Accelerate

Environment Variables

Basic Usage (Automatic Configuration)

Using Accelerate Launch

SFT Training with Multi-GPU

RL Training with Multi-GPU

About

Resources

License

Uh oh!

Algorithms (`core/algo/`)

Reasoning Modes (`core/reasoning/`)

Analysis (`core/analyzer/`)

Domains (`core/domain/`)

Policies (`core/policy/`)

Graders (`core/grader/`)

Training Pipeline (`core/trainer/`)

`core`

`scripts`

`utils`

Packages