Skip to content

SPAR-Self-Forecasting/LLMSelfForecasting

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LLM Self-Forecasting Framework

Research project investigating how well LLMs can predict their own behavior in multi-turn and agentic scenarios. Part of the SPAR stream at MATS (Mar-May 2026). Black-box approach — all experiments use API calls, no white-box access needed.

Research Phases

  1. How well can LMs predict their own behavior? — Baseline self-prediction accuracy across multi-turn and agentic tasks.
  2. Can we intervene to improve self-prediction? — Post-training and finetuning to boost self-forecasting ability.
  3. How well can LMs predict behavior under update steps? — Predicting how RL training will affect their own behavior.
  4. Can we intervene to improve prediction under update steps? — Training models to better anticipate the effects of RL on themselves.

Phase 1.5 (parallel): Can the LM obfuscate its self-predictive ability? Distinguishing deception from inability.

Framework Architecture

The codebase provides a general-purpose framework with abstract base classes. Others can add their own self-forecasting tasks by subclassing Task and Judge.

Core Abstractions

  • Task (ABC): Defines the scenario — system prompts for agent and simulated user, conversation logic, and forecast elicitation prompt.
  • Judge (ABC): Evaluates forecast quality. Two tiers:
    • Tier A (Similarity): How well did the forecast match the actual outcome?
    • Tier B (Outcome): How beneficial was the conversation/action for the user?

Current Task Implementations

Task Description Evaluation
AI Psychosis LLM converses with a simulated user exhibiting psychosis tendencies. Multi-turn. Tier A: LLM-rated similarity (1-10). Tier B: conversation benefit rating (1-10).
Competitive Programming Model attempts Codeforces-style problems. Can solve, give wrong answer, or abstain. Tier A: categorical match (correct/wrong/abstain). Tier B: reward (+1 correct, +0.5 abstain, 0 wrong).

Experiment Flow

  1. Run the actual conversation via task.run_conversation()
  2. Show the agent partial context, ask it to self-forecast the rest
  3. Tier A judge: compare forecast vs actual outcome
  4. Tier B judge: rate conversation quality / calculate reward
  5. Save all results as JSON

Setup

python3 -m venv .venv
.venv/bin/pip install openai pydantic python-dotenv
cp .env.example .env
# Edit .env and add your OpenRouter API key

Usage

# Run AI Psychosis task (5 turns, forecast after turn 2)
.venv/bin/python -m code.run --task psychosis --turns 5

# Run Competitive Programming task
.venv/bin/python -m code.run --task competitive_programming --problem-index 0

# Specify model and parameters
.venv/bin/python -m code.run --task psychosis --turns 3 --model google/gemma-3-27b-it --forecast-after 1

Results are saved as JSON in results/.

Adding a New Task

  1. Create a new file in code/tasks/
  2. Subclass Task and implement the abstract methods
  3. Register it in code/tasks/__init__.py by adding to TASK_REGISTRY

File Structure

code/
├── llm_client.py        # Thin OpenAI-compatible wrapper for OpenRouter
├── base.py              # Abstract base classes: Task, Judge, data types
├── config.py            # Top-level config (models, turns, temperatures)
├── experiment.py        # Orchestrator: run → forecast → judge → save
├── run.py               # CLI entrypoint
├── tasks/
│   ├── psychosis.py     # AI Psychosis task
│   └── competitive_programming.py  # Codeforces task
└── judges/
    ├── similarity.py    # Tier A: predicted vs actual similarity
    └── outcome.py       # Tier B: conversation outcome quality

Team

SPAR Self-Forecasting team — GitHub org

About

Framework for testing LLMs' ability to predict their own behavior in multi-turn and agentic scenarios

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages