Can AI reason about and forecast the trajectory of an ongoing war before it transitions into history?
This is the code repository for the paper "When AI Navigates the Fog of War". We present a temporally grounded benchmark that evaluates whether frontier LLMs can reason about an unfolding geopolitical conflict using only information available at each moment in time.
📄 Paper | 🖥️ Website | 🤗 Dataset
We construct 11 critical temporal nodes spanning the early stages of the 2026 Middle East conflict (Feb 27 – Mar 6, 2026), along with 42 node-specific verifiable questions and 5 general exploratory questions. At each time point, models receive only news articles published before the event and must reason about what happens next. This design substantially mitigates training-data leakage concerns, as the conflict unfolded after the training cutoff of current frontier models.
| Node | Date | Event | Theme | Theme Description |
|---|---|---|---|---|
| T0 | Feb 27 | Operation Epic Fury | I | Initial Outbreak |
| T1 | Feb 28 | Israeli-US Strikes | I | Initial Outbreak |
| T2 | Feb 28 | Iranian Strikes | I | Initial Outbreak |
| T3 | Mar 1 | Two Missiles towards British Bases on Cyprus | II | Threshold Crossings |
| T4 | Mar 1 | Oil Refiner and Oil Tanker Was Attacked | III | Economic Shockwaves |
| T5 | Mar 2 | Qatar Halts Energy Production | III | Economic Shockwaves |
| T6 | Mar 2 | Natanz Nuclear Facility Damaged | II | Threshold Crossings |
| T7 | Mar 3 | U.S. Begins Evacuation of Citizens from the Middle East | II | Threshold Crossings |
| T8 | Mar 3 | Nine Countries Involved and Israeli Ground Invasion | II | Threshold Crossings |
| T9 | Mar 3 | Mojtaba Khamenei Becomes Supreme Leader | IV | Political Signaling |
| T10 | Mar 6 | Iranian Apology to Neighboring Countries | IV | Political Signaling |
- 🧠 Current state-of-the-art LLMs often show strong strategic reasoning, attending to underlying incentives, deterrence pressures, and material constraints rather than surface political rhetoric.
- ⚖️ This capability is uneven across domains: models are more reliable in economically and logistically structured settings than in politically ambiguous multi-actor environments.
- 📈 Model narratives evolve over time, shifting from early expectations of rapid containment toward more systemic accounts of regional entrenchment and attritional de-escalation.
All inference is routed through OpenRouter, a unified API gateway for frontier LLMs.
| Model | Provider |
|---|---|
openai/gpt-5.4 |
OpenAI |
qwen/qwen3.5-35b-a3b |
Qwen |
google/gemini-3.1-flash-lite-preview |
|
anthropic/claude-sonnet-4.6 |
Anthropic |
moonshotai/kimi-k2.5 |
Moonshot |
minimax/minimax-m2.5 |
MiniMax |
pip install -r requirements.txt📦 Data is automatically downloaded from HuggingFace on first run.
🔑 API Key: All API calls go through OpenRouter. Sign up for a free account and get your API key, then create ../war-prediction-LLMs/config.json (please note that only paid account could use the full context of models):
{
"OPENROUTER_API_KEY": "your-openrouter-key"
}# Audit data quality
python src/audit_data.py
# Dry run (verify prompts, no API calls)
python src/run_predictions.py --dry-run
# Run single model
python src/run_predictions.py --models openai/gpt-5.4 --time-points T3
# Full benchmark (all models, all time points)
bash run_all.shwar-test/
├── README.md
├── requirements.txt
├── run_all.sh # Run full benchmark (all models)
├── test_gpt.sh # Quick test on single model
├── src/
│ ├── config.py # API key, model list, constants, HF data loading
│ ├── context_builder.py # Article filtering by cutoff datetime
│ ├── prompt_builder.py # System + user prompt construction
│ ├── response_parser.py # LLM JSON response parsing
│ └── run_predictions.py # Main inference pipeline
├── assets/ # Images
└── dataset/ # HuggingFace dataset card & parquet
@misc{li2026ainavigatesfogwar,
title={When AI Navigates the Fog of War},
author={Ming Li and Xirui Li and Tianyi Zhou},
year={2026},
eprint={2603.16642},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2603.16642},
}