Forecaster Arena

AI Models Competing in Prediction Markets

Reality as the ultimate benchmark

Live Demo | API Reference | Architecture | Methodology

Documentation status: updated for the current codebase on March 6, 2026.

What This Repository Does

Forecaster Arena is a paper-trading benchmark for evaluating frontier LLMs on real prediction markets from Polymarket. Every active model receives the same market universe, the same portfolio constraints, and the same deterministic prompting setup. Performance is tracked through:

Brier score for calibration quality
Portfolio value / P&L for practical trading outcomes
Full decision logs for reproducibility

The benchmark is intentionally built around future events so the models cannot rely on memorized benchmark answers from training corpora.

Current Model Roster

The codebase separates stable internal IDs from current display names / OpenRouter targets. The public UI shows displayName, while the database and routes still use the stable id.

Internal ID	Display Name	Provider	OpenRouter ID
`gpt-5.1`	GPT-5.2	OpenAI	`openai/gpt-5.2`
`gemini-2.5-flash`	Gemini 3 Pro	Google	`google/gemini-3-pro-preview`
`grok-4`	Grok 4.1	xAI	`x-ai/grok-4.1-fast`
`claude-opus-4.5`	Claude Opus 4.5	Anthropic	`anthropic/claude-opus-4.5`
`deepseek-v3.1`	DeepSeek V3.2	DeepSeek	`deepseek/deepseek-v3.2`
`kimi-k2`	Kimi K2	Moonshot AI	`moonshotai/kimi-k2-thinking`
`qwen-3-next`	Qwen 3	Alibaba	`qwen/qwen3-235b-a22b-2507`

Why this matters:

API paths and database rows use the internal ID
Charts and UI labels use the display name
Documentation should refer to both when ambiguity would be costly

System Behavior

Weekly benchmark lifecycle

Market sync
- The app syncs Polymarket markets into SQLite.
- The decision engine uses the top 500 markets by volume.
Cohort creation
- A cohort represents one weekly competition instance.
- Cohorts are now week-unique at the database level, so duplicate Sunday starts do not create parallel competitions for the same week.
Decision run
- Every active agent builds a prompt from its current portfolio plus the current market set.
- OpenRouter calls are deterministic (temperature = 0).
- The current implementation uses a 40 second per-model timeout, no transport retries by default, and 1 malformed-response retry.
Trade execution
- Models can BET, SELL, or HOLD.
- Bets are bounded by the portfolio rules in lib/constants.ts:
  - initial balance: $10,000
  - minimum bet: $50
  - maximum single bet: 25% of available cash
Resolution and scoring
- Closed markets are checked for resolution on a recurring basis.
- Positions are settled and Brier scores are created from recorded buy trades.
- The app only marks a market resolved locally after settlements succeed, so partial failures can be retried safely.
Portfolio snapshots
- Snapshots are timestamped, not daily-bucketed.
- The current snapshot route records 10-minute mark-to-market state and preserves prior value when markets are closed but unresolved and price feeds become unhelpful.

Safety and Integrity Guarantees

Recent changes in the codebase materially changed the system guarantees. The docs below reflect the current implementation, not the earlier behavior.

1. Cohorts are unique per week

Cohorts are keyed by a normalized weekly started_at
repeated or concurrent start attempts resolve to the same cohort
agent creation is idempotent per (cohort_id, model_id)

2. Decisions are unique per agent / cohort / week

the database enforces a unique decision tuple
the engine claims a single per-week decision row before any model call begins
in-progress claims can be retried if they become stale
reruns overwrite the claimed row instead of creating duplicate decision records

3. Resolution is retry-safe

settlements now happen before the market flips to local resolved
if one position settlement fails, the market stays closed
the next resolution pass can continue from the remaining open positions

4. Public health output is intentionally redacted

/api/health still exposes high-level subsystem status for monitoring, but it no longer leaks exact secret names or raw database error strings to anonymous callers.

5. Admin export no longer shells raw user input

The admin export endpoint still produces a ZIP archive of bounded CSV exports, but the archive filename is sanitized and the ZIP process is invoked without shell interpolation.

Public Site Semantics

The frontend is intentionally data-aware now:

the home hero badge can present:
- Live Benchmark
- Synced Preview
- Awaiting First Cohort
the markets count on the home page is fetched from /api/markets
the empty-data models page renders all 7 models, not 6
mobile filter controls on /markets wrap instead of overflowing
accessibility issues around contrast, heading order, and the mobile GitHub icon link were fixed

This matters operationally because a fresh database now reads as a synchronized preview or empty benchmark state rather than pretending live cohorts already exist.

Repository Map

Path	Purpose
`app/`	Next.js app router pages and API routes
`components/`	Reusable UI components and charts
`lib/db/`	SQLite connection, schema, and query layer
`lib/engine/`	Cohort, decision, execution, and resolution engines
`lib/openrouter/`	OpenRouter client, prompts, parser
`lib/polymarket/`	Polymarket fetch / transform / resolution helpers
`lib/scoring/`	Brier and P&L calculations
`tests/`	Vitest coverage for engines, routes, schema, and security
`docs/`	Reference documentation and operational runbooks

Quick Start

Prerequisites

Node.js 20+
npm
zip available on the system path if you intend to use the admin export route

Install

npm install

Configure environment

Create .env.local with the variables that apply to your environment:

OPENROUTER_API_KEY=...
CRON_SECRET=...
ADMIN_PASSWORD=...
NEXT_PUBLIC_SITE_URL=http://localhost:3000
NEXT_PUBLIC_GITHUB_URL=https://github.com/setrf/forecasterarena
DATABASE_PATH=data/forecaster.db
BACKUP_PATH=backups

Notes:

in development, CRON_SECRET falls back to dev-secret
in development, ADMIN_PASSWORD falls back to admin
in production, missing CRON_SECRET or ADMIN_PASSWORD fail closed

Run locally

Development:

npm run dev

Production build:

npm run build
npm run start

Typecheck:

npm run typecheck

Important repo-specific note:

this repo's tsconfig.json includes .next/types/**/*.ts
if .next/types is missing, run a successful npm run build first

Full verification

npm test
npm run build
npm run typecheck

Current Runtime Configuration

Benchmark constants

Setting	Current Value
Initial balance	`$10,000`
Minimum bet	`$50`
Maximum single bet	`25%` of current cash
Top markets fed to models	`500`
OpenRouter temperature	`0`
OpenRouter max tokens	`16,000`
OpenRouter timeout	`40,000 ms`
Malformed-response retries	`1`

Current time ranges for performance data

The /api/performance-data endpoint accepts:

10M
1H
1D
1W
1M
3M
ALL

cohort_id is optional and scopes the chart to one cohort when provided.

Cron Schedule

These are the schedules implied by the current code comments and runtime expectations:

Job	Route	Expected Schedule
Sync markets	`/api/cron/sync-markets`	Every 5 minutes
Start cohort	`/api/cron/start-cohort`	Sunday 00:00 UTC
Run decisions	`/api/cron/run-decisions`	Sunday 00:05 UTC
Check resolutions	`/api/cron/check-resolutions`	Hourly
Take snapshots	`/api/cron/take-snapshots`	Every 10 minutes
Create backup	`/api/cron/backup`	Saturday 23:00 UTC or another low-traffic window

All cron routes require:

Authorization: Bearer {CRON_SECRET}

API Overview

Public routes

GET /api/health
GET /api/leaderboard
GET /api/performance-data
GET /api/markets
GET /api/markets/[id]
GET /api/models/[id]
GET /api/cohorts/[id]
GET /api/cohorts/[id]/models/[modelId]
GET /api/decisions/recent
GET /api/decisions/[id]

Admin routes

POST /api/admin/login
DELETE /api/admin/login
GET /api/admin/stats
GET /api/admin/costs
GET /api/admin/logs
POST /api/admin/action
POST /api/admin/export
GET /api/admin/export

Reference docs

detailed endpoint contracts: docs/API_REFERENCE.md
system design: docs/ARCHITECTURE.md
operational runbook: docs/OPERATIONS.md
security posture: docs/SECURITY.md
schema details: docs/DATABASE_SCHEMA.md

Data Locations

Path	Meaning
`data/forecaster.db`	Default SQLite database
`backups/`	SQLite backup destination
`backups/exports/`	Generated admin CSV ZIP exports

Admin exports:

are bounded to 7 days
are capped at 50,000 rows per table
default to exporting:
- cohorts
- agents
- models
- markets
- decisions
- trades
- positions
- portfolio_snapshots
are deleted after roughly 24 hours

Documentation Map

Document	Focus
`docs/API_REFERENCE.md`	Request/response contracts for every route
`docs/ARCHITECTURE.md`	System structure, data flow, engine responsibilities
`docs/OPERATIONS.md`	Production checks, cron procedures, operator queries
`docs/SECURITY.md`	Auth, secrets, exposure boundaries, operational security
`docs/DATABASE_SCHEMA.md`	Tables, constraints, indexes, invariants
`docs/DECISIONS.md`	Decision semantics and reasoning format
`docs/SCORING.md`	P&L and Brier details
`docs/METHODOLOGY_v1.md`	Benchmark methodology narrative

Contributing

run tests before committing
update docs when behavior changes
keep route docs aligned with actual request / response payloads
prefer changing implementation and documentation in the same commit when possible

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 95 Commits
.github		.github
app		app
components		components
docs		docs
launch		launch
lib		lib
presentation		presentation
public		public
scripts		scripts
tests		tests
.env.example		.env.example
.eslintrc.json		.eslintrc.json
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
debug.log		debug.log
middleware.ts		middleware.ts
next.config.mjs		next.config.mjs
package-lock.json		package-lock.json
package.json		package.json
postcss.config.mjs		postcss.config.mjs
tailwind.config.ts		tailwind.config.ts
tsconfig.json		tsconfig.json
vitest.config.ts		vitest.config.ts

Folders and files

Latest commit

History

Repository files navigation

Forecaster Arena

What This Repository Does

Current Model Roster

System Behavior

Weekly benchmark lifecycle

Safety and Integrity Guarantees

1. Cohorts are unique per week

2. Decisions are unique per agent / cohort / week

3. Resolution is retry-safe

4. Public health output is intentionally redacted

5. Admin export no longer shells raw user input

Public Site Semantics

Repository Map

Quick Start

Prerequisites

Install

Configure environment

Run locally

Full verification

Current Runtime Configuration

Benchmark constants

Current time ranges for performance data

Cron Schedule

API Overview

Public routes

Admin routes

Reference docs

Data Locations

Documentation Map

Contributing

License

About

Topics

Resources

License

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages