Skip to content

setrf/forecasterarena

Repository files navigation

Forecaster Arena

AI Models Competing in Prediction Markets

Reality as the ultimate benchmark

Next.js TypeScript Tailwind CSS SQLite License: MIT

Live Demo | API Reference | Architecture | Methodology

Documentation status: updated for the current codebase on March 6, 2026.


What This Repository Does

Forecaster Arena is a paper-trading benchmark for evaluating frontier LLMs on real prediction markets from Polymarket. Every active model receives the same market universe, the same portfolio constraints, and the same deterministic prompting setup. Performance is tracked through:

  • Brier score for calibration quality
  • Portfolio value / P&L for practical trading outcomes
  • Full decision logs for reproducibility

The benchmark is intentionally built around future events so the models cannot rely on memorized benchmark answers from training corpora.


Current Model Roster

The codebase separates stable internal IDs from current display names / OpenRouter targets. The public UI shows displayName, while the database and routes still use the stable id.

Internal ID Display Name Provider OpenRouter ID
gpt-5.1 GPT-5.2 OpenAI openai/gpt-5.2
gemini-2.5-flash Gemini 3 Pro Google google/gemini-3-pro-preview
grok-4 Grok 4.1 xAI x-ai/grok-4.1-fast
claude-opus-4.5 Claude Opus 4.5 Anthropic anthropic/claude-opus-4.5
deepseek-v3.1 DeepSeek V3.2 DeepSeek deepseek/deepseek-v3.2
kimi-k2 Kimi K2 Moonshot AI moonshotai/kimi-k2-thinking
qwen-3-next Qwen 3 Alibaba qwen/qwen3-235b-a22b-2507

Why this matters:

  • API paths and database rows use the internal ID
  • Charts and UI labels use the display name
  • Documentation should refer to both when ambiguity would be costly

System Behavior

Weekly benchmark lifecycle

  1. Market sync

    • The app syncs Polymarket markets into SQLite.
    • The decision engine uses the top 500 markets by volume.
  2. Cohort creation

    • A cohort represents one weekly competition instance.
    • Cohorts are now week-unique at the database level, so duplicate Sunday starts do not create parallel competitions for the same week.
  3. Decision run

    • Every active agent builds a prompt from its current portfolio plus the current market set.
    • OpenRouter calls are deterministic (temperature = 0).
    • The current implementation uses a 40 second per-model timeout, no transport retries by default, and 1 malformed-response retry.
  4. Trade execution

    • Models can BET, SELL, or HOLD.
    • Bets are bounded by the portfolio rules in lib/constants.ts:
      • initial balance: $10,000
      • minimum bet: $50
      • maximum single bet: 25% of available cash
  5. Resolution and scoring

    • Closed markets are checked for resolution on a recurring basis.
    • Positions are settled and Brier scores are created from recorded buy trades.
    • The app only marks a market resolved locally after settlements succeed, so partial failures can be retried safely.
  6. Portfolio snapshots

    • Snapshots are timestamped, not daily-bucketed.
    • The current snapshot route records 10-minute mark-to-market state and preserves prior value when markets are closed but unresolved and price feeds become unhelpful.

Safety and Integrity Guarantees

Recent changes in the codebase materially changed the system guarantees. The docs below reflect the current implementation, not the earlier behavior.

1. Cohorts are unique per week

  • Cohorts are keyed by a normalized weekly started_at
  • repeated or concurrent start attempts resolve to the same cohort
  • agent creation is idempotent per (cohort_id, model_id)

2. Decisions are unique per agent / cohort / week

  • the database enforces a unique decision tuple
  • the engine claims a single per-week decision row before any model call begins
  • in-progress claims can be retried if they become stale
  • reruns overwrite the claimed row instead of creating duplicate decision records

3. Resolution is retry-safe

  • settlements now happen before the market flips to local resolved
  • if one position settlement fails, the market stays closed
  • the next resolution pass can continue from the remaining open positions

4. Public health output is intentionally redacted

/api/health still exposes high-level subsystem status for monitoring, but it no longer leaks exact secret names or raw database error strings to anonymous callers.

5. Admin export no longer shells raw user input

The admin export endpoint still produces a ZIP archive of bounded CSV exports, but the archive filename is sanitized and the ZIP process is invoked without shell interpolation.


Public Site Semantics

The frontend is intentionally data-aware now:

  • the home hero badge can present:
    • Live Benchmark
    • Synced Preview
    • Awaiting First Cohort
  • the markets count on the home page is fetched from /api/markets
  • the empty-data models page renders all 7 models, not 6
  • mobile filter controls on /markets wrap instead of overflowing
  • accessibility issues around contrast, heading order, and the mobile GitHub icon link were fixed

This matters operationally because a fresh database now reads as a synchronized preview or empty benchmark state rather than pretending live cohorts already exist.


Repository Map

Path Purpose
app/ Next.js app router pages and API routes
components/ Reusable UI components and charts
lib/db/ SQLite connection, schema, and query layer
lib/engine/ Cohort, decision, execution, and resolution engines
lib/openrouter/ OpenRouter client, prompts, parser
lib/polymarket/ Polymarket fetch / transform / resolution helpers
lib/scoring/ Brier and P&L calculations
tests/ Vitest coverage for engines, routes, schema, and security
docs/ Reference documentation and operational runbooks

Quick Start

Prerequisites

  • Node.js 20+
  • npm
  • zip available on the system path if you intend to use the admin export route

Install

npm install

Configure environment

Create .env.local with the variables that apply to your environment:

OPENROUTER_API_KEY=...
CRON_SECRET=...
ADMIN_PASSWORD=...
NEXT_PUBLIC_SITE_URL=http://localhost:3000
NEXT_PUBLIC_GITHUB_URL=https://github.com/setrf/forecasterarena
DATABASE_PATH=data/forecaster.db
BACKUP_PATH=backups

Notes:

  • in development, CRON_SECRET falls back to dev-secret
  • in development, ADMIN_PASSWORD falls back to admin
  • in production, missing CRON_SECRET or ADMIN_PASSWORD fail closed

Run locally

Development:

npm run dev

Production build:

npm run build
npm run start

Typecheck:

npm run typecheck

Important repo-specific note:

  • this repo's tsconfig.json includes .next/types/**/*.ts
  • if .next/types is missing, run a successful npm run build first

Full verification

npm test
npm run build
npm run typecheck

Current Runtime Configuration

Benchmark constants

Setting Current Value
Initial balance $10,000
Minimum bet $50
Maximum single bet 25% of current cash
Top markets fed to models 500
OpenRouter temperature 0
OpenRouter max tokens 16,000
OpenRouter timeout 40,000 ms
Malformed-response retries 1

Current time ranges for performance data

The /api/performance-data endpoint accepts:

  • 10M
  • 1H
  • 1D
  • 1W
  • 1M
  • 3M
  • ALL

cohort_id is optional and scopes the chart to one cohort when provided.


Cron Schedule

These are the schedules implied by the current code comments and runtime expectations:

Job Route Expected Schedule
Sync markets /api/cron/sync-markets Every 5 minutes
Start cohort /api/cron/start-cohort Sunday 00:00 UTC
Run decisions /api/cron/run-decisions Sunday 00:05 UTC
Check resolutions /api/cron/check-resolutions Hourly
Take snapshots /api/cron/take-snapshots Every 10 minutes
Create backup /api/cron/backup Saturday 23:00 UTC or another low-traffic window

All cron routes require:

Authorization: Bearer {CRON_SECRET}

API Overview

Public routes

  • GET /api/health
  • GET /api/leaderboard
  • GET /api/performance-data
  • GET /api/markets
  • GET /api/markets/[id]
  • GET /api/models/[id]
  • GET /api/cohorts/[id]
  • GET /api/cohorts/[id]/models/[modelId]
  • GET /api/decisions/recent
  • GET /api/decisions/[id]

Admin routes

  • POST /api/admin/login
  • DELETE /api/admin/login
  • GET /api/admin/stats
  • GET /api/admin/costs
  • GET /api/admin/logs
  • POST /api/admin/action
  • POST /api/admin/export
  • GET /api/admin/export

Reference docs


Data Locations

Path Meaning
data/forecaster.db Default SQLite database
backups/ SQLite backup destination
backups/exports/ Generated admin CSV ZIP exports

Admin exports:

  • are bounded to 7 days
  • are capped at 50,000 rows per table
  • default to exporting:
    • cohorts
    • agents
    • models
    • markets
    • decisions
    • trades
    • positions
    • portfolio_snapshots
  • are deleted after roughly 24 hours

Documentation Map

Document Focus
docs/API_REFERENCE.md Request/response contracts for every route
docs/ARCHITECTURE.md System structure, data flow, engine responsibilities
docs/OPERATIONS.md Production checks, cron procedures, operator queries
docs/SECURITY.md Auth, secrets, exposure boundaries, operational security
docs/DATABASE_SCHEMA.md Tables, constraints, indexes, invariants
docs/DECISIONS.md Decision semantics and reasoning format
docs/SCORING.md P&L and Brier details
docs/METHODOLOGY_v1.md Benchmark methodology narrative

Contributing

  1. run tests before committing
  2. update docs when behavior changes
  3. keep route docs aligned with actual request / response payloads
  4. prefer changing implementation and documentation in the same commit when possible

License

MIT

About

AI models competing in prediction markets. Reality as the ultimate benchmark. Seven frontier LLMs forecast real-world events through Polymarket. No memorization possible - only genuine forecasting ability.

Topics

Resources

License

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors