Research project investigating how well LLMs can predict their own behavior in multi-turn and agentic scenarios. Part of the SPAR stream at MATS (Mar-May 2026). Black-box approach — all experiments use API calls, no white-box access needed.
- How well can LMs predict their own behavior? — Baseline self-prediction accuracy across multi-turn and agentic tasks.
- Can we intervene to improve self-prediction? — Post-training and finetuning to boost self-forecasting ability.
- How well can LMs predict behavior under update steps? — Predicting how RL training will affect their own behavior.
- Can we intervene to improve prediction under update steps? — Training models to better anticipate the effects of RL on themselves.
Phase 1.5 (parallel): Can the LM obfuscate its self-predictive ability? Distinguishing deception from inability.
The codebase provides a general-purpose framework with abstract base classes. Others can add their own self-forecasting tasks by subclassing Task and Judge.
Task(ABC): Defines the scenario — system prompts for agent and simulated user, conversation logic, and forecast elicitation prompt.Judge(ABC): Evaluates forecast quality. Two tiers:- Tier A (Similarity): How well did the forecast match the actual outcome?
- Tier B (Outcome): How beneficial was the conversation/action for the user?
| Task | Description | Evaluation |
|---|---|---|
| AI Psychosis | LLM converses with a simulated user exhibiting psychosis tendencies. Multi-turn. | Tier A: LLM-rated similarity (1-10). Tier B: conversation benefit rating (1-10). |
| Competitive Programming | Model attempts Codeforces-style problems. Can solve, give wrong answer, or abstain. | Tier A: categorical match (correct/wrong/abstain). Tier B: reward (+1 correct, +0.5 abstain, 0 wrong). |
- Run the actual conversation via
task.run_conversation() - Show the agent partial context, ask it to self-forecast the rest
- Tier A judge: compare forecast vs actual outcome
- Tier B judge: rate conversation quality / calculate reward
- Save all results as JSON
python3 -m venv .venv
.venv/bin/pip install openai pydantic python-dotenv
cp .env.example .env
# Edit .env and add your OpenRouter API key# Run AI Psychosis task (5 turns, forecast after turn 2)
.venv/bin/python -m code.run --task psychosis --turns 5
# Run Competitive Programming task
.venv/bin/python -m code.run --task competitive_programming --problem-index 0
# Specify model and parameters
.venv/bin/python -m code.run --task psychosis --turns 3 --model google/gemma-3-27b-it --forecast-after 1Results are saved as JSON in results/.
- Create a new file in
code/tasks/ - Subclass
Taskand implement the abstract methods - Register it in
code/tasks/__init__.pyby adding toTASK_REGISTRY
code/
├── llm_client.py # Thin OpenAI-compatible wrapper for OpenRouter
├── base.py # Abstract base classes: Task, Judge, data types
├── config.py # Top-level config (models, turns, temperatures)
├── experiment.py # Orchestrator: run → forecast → judge → save
├── run.py # CLI entrypoint
├── tasks/
│ ├── psychosis.py # AI Psychosis task
│ └── competitive_programming.py # Codeforces task
└── judges/
├── similarity.py # Tier A: predicted vs actual similarity
└── outcome.py # Tier B: conversation outcome quality
SPAR Self-Forecasting team — GitHub org