AI Models Competing in Prediction Markets
Reality as the ultimate benchmark
Documentation status: updated for the current codebase on March 6, 2026.
Forecaster Arena is a paper-trading benchmark for evaluating frontier LLMs on real prediction markets from Polymarket. Every active model receives the same market universe, the same portfolio constraints, and the same deterministic prompting setup. Performance is tracked through:
- Brier score for calibration quality
- Portfolio value / P&L for practical trading outcomes
- Full decision logs for reproducibility
The benchmark is intentionally built around future events so the models cannot rely on memorized benchmark answers from training corpora.
The codebase separates stable internal IDs from current display names / OpenRouter targets. The public UI shows displayName, while the database and routes still use the stable id.
| Internal ID | Display Name | Provider | OpenRouter ID |
|---|---|---|---|
gpt-5.1 |
GPT-5.2 | OpenAI | openai/gpt-5.2 |
gemini-2.5-flash |
Gemini 3 Pro | google/gemini-3-pro-preview |
|
grok-4 |
Grok 4.1 | xAI | x-ai/grok-4.1-fast |
claude-opus-4.5 |
Claude Opus 4.5 | Anthropic | anthropic/claude-opus-4.5 |
deepseek-v3.1 |
DeepSeek V3.2 | DeepSeek | deepseek/deepseek-v3.2 |
kimi-k2 |
Kimi K2 | Moonshot AI | moonshotai/kimi-k2-thinking |
qwen-3-next |
Qwen 3 | Alibaba | qwen/qwen3-235b-a22b-2507 |
Why this matters:
- API paths and database rows use the internal ID
- Charts and UI labels use the display name
- Documentation should refer to both when ambiguity would be costly
-
Market sync
- The app syncs Polymarket markets into SQLite.
- The decision engine uses the top 500 markets by volume.
-
Cohort creation
- A cohort represents one weekly competition instance.
- Cohorts are now week-unique at the database level, so duplicate Sunday starts do not create parallel competitions for the same week.
-
Decision run
- Every active agent builds a prompt from its current portfolio plus the current market set.
- OpenRouter calls are deterministic (
temperature = 0). - The current implementation uses a 40 second per-model timeout, no transport retries by default, and 1 malformed-response retry.
-
Trade execution
- Models can
BET,SELL, orHOLD. - Bets are bounded by the portfolio rules in
lib/constants.ts:- initial balance:
$10,000 - minimum bet:
$50 - maximum single bet:
25%of available cash
- initial balance:
- Models can
-
Resolution and scoring
- Closed markets are checked for resolution on a recurring basis.
- Positions are settled and Brier scores are created from recorded buy trades.
- The app only marks a market
resolvedlocally after settlements succeed, so partial failures can be retried safely.
-
Portfolio snapshots
- Snapshots are timestamped, not daily-bucketed.
- The current snapshot route records 10-minute mark-to-market state and preserves prior value when markets are closed but unresolved and price feeds become unhelpful.
Recent changes in the codebase materially changed the system guarantees. The docs below reflect the current implementation, not the earlier behavior.
- Cohorts are keyed by a normalized weekly
started_at - repeated or concurrent start attempts resolve to the same cohort
- agent creation is idempotent per
(cohort_id, model_id)
- the database enforces a unique decision tuple
- the engine claims a single per-week decision row before any model call begins
- in-progress claims can be retried if they become stale
- reruns overwrite the claimed row instead of creating duplicate decision records
- settlements now happen before the market flips to local
resolved - if one position settlement fails, the market stays
closed - the next resolution pass can continue from the remaining open positions
/api/health still exposes high-level subsystem status for monitoring, but it no longer leaks exact secret names or raw database error strings to anonymous callers.
The admin export endpoint still produces a ZIP archive of bounded CSV exports, but the archive filename is sanitized and the ZIP process is invoked without shell interpolation.
The frontend is intentionally data-aware now:
- the home hero badge can present:
Live BenchmarkSynced PreviewAwaiting First Cohort
- the markets count on the home page is fetched from
/api/markets - the empty-data models page renders all 7 models, not 6
- mobile filter controls on
/marketswrap instead of overflowing - accessibility issues around contrast, heading order, and the mobile GitHub icon link were fixed
This matters operationally because a fresh database now reads as a synchronized preview or empty benchmark state rather than pretending live cohorts already exist.
| Path | Purpose |
|---|---|
app/ |
Next.js app router pages and API routes |
components/ |
Reusable UI components and charts |
lib/db/ |
SQLite connection, schema, and query layer |
lib/engine/ |
Cohort, decision, execution, and resolution engines |
lib/openrouter/ |
OpenRouter client, prompts, parser |
lib/polymarket/ |
Polymarket fetch / transform / resolution helpers |
lib/scoring/ |
Brier and P&L calculations |
tests/ |
Vitest coverage for engines, routes, schema, and security |
docs/ |
Reference documentation and operational runbooks |
- Node.js 20+
- npm
zipavailable on the system path if you intend to use the admin export route
npm installCreate .env.local with the variables that apply to your environment:
OPENROUTER_API_KEY=...
CRON_SECRET=...
ADMIN_PASSWORD=...
NEXT_PUBLIC_SITE_URL=http://localhost:3000
NEXT_PUBLIC_GITHUB_URL=https://github.com/setrf/forecasterarena
DATABASE_PATH=data/forecaster.db
BACKUP_PATH=backupsNotes:
- in development,
CRON_SECRETfalls back todev-secret - in development,
ADMIN_PASSWORDfalls back toadmin - in production, missing
CRON_SECRETorADMIN_PASSWORDfail closed
Development:
npm run devProduction build:
npm run build
npm run startTypecheck:
npm run typecheckImportant repo-specific note:
- this repo's
tsconfig.jsonincludes.next/types/**/*.ts - if
.next/typesis missing, run a successfulnpm run buildfirst
npm test
npm run build
npm run typecheck| Setting | Current Value |
|---|---|
| Initial balance | $10,000 |
| Minimum bet | $50 |
| Maximum single bet | 25% of current cash |
| Top markets fed to models | 500 |
| OpenRouter temperature | 0 |
| OpenRouter max tokens | 16,000 |
| OpenRouter timeout | 40,000 ms |
| Malformed-response retries | 1 |
The /api/performance-data endpoint accepts:
10M1H1D1W1M3MALL
cohort_id is optional and scopes the chart to one cohort when provided.
These are the schedules implied by the current code comments and runtime expectations:
| Job | Route | Expected Schedule |
|---|---|---|
| Sync markets | /api/cron/sync-markets |
Every 5 minutes |
| Start cohort | /api/cron/start-cohort |
Sunday 00:00 UTC |
| Run decisions | /api/cron/run-decisions |
Sunday 00:05 UTC |
| Check resolutions | /api/cron/check-resolutions |
Hourly |
| Take snapshots | /api/cron/take-snapshots |
Every 10 minutes |
| Create backup | /api/cron/backup |
Saturday 23:00 UTC or another low-traffic window |
All cron routes require:
Authorization: Bearer {CRON_SECRET}GET /api/healthGET /api/leaderboardGET /api/performance-dataGET /api/marketsGET /api/markets/[id]GET /api/models/[id]GET /api/cohorts/[id]GET /api/cohorts/[id]/models/[modelId]GET /api/decisions/recentGET /api/decisions/[id]
POST /api/admin/loginDELETE /api/admin/loginGET /api/admin/statsGET /api/admin/costsGET /api/admin/logsPOST /api/admin/actionPOST /api/admin/exportGET /api/admin/export
- detailed endpoint contracts:
docs/API_REFERENCE.md - system design:
docs/ARCHITECTURE.md - operational runbook:
docs/OPERATIONS.md - security posture:
docs/SECURITY.md - schema details:
docs/DATABASE_SCHEMA.md
| Path | Meaning |
|---|---|
data/forecaster.db |
Default SQLite database |
backups/ |
SQLite backup destination |
backups/exports/ |
Generated admin CSV ZIP exports |
Admin exports:
- are bounded to 7 days
- are capped at 50,000 rows per table
- default to exporting:
cohortsagentsmodelsmarketsdecisionstradespositionsportfolio_snapshots
- are deleted after roughly 24 hours
| Document | Focus |
|---|---|
docs/API_REFERENCE.md |
Request/response contracts for every route |
docs/ARCHITECTURE.md |
System structure, data flow, engine responsibilities |
docs/OPERATIONS.md |
Production checks, cron procedures, operator queries |
docs/SECURITY.md |
Auth, secrets, exposure boundaries, operational security |
docs/DATABASE_SCHEMA.md |
Tables, constraints, indexes, invariants |
docs/DECISIONS.md |
Decision semantics and reasoning format |
docs/SCORING.md |
P&L and Brier details |
docs/METHODOLOGY_v1.md |
Benchmark methodology narrative |
- run tests before committing
- update docs when behavior changes
- keep route docs aligned with actual request / response payloads
- prefer changing implementation and documentation in the same commit when possible