Skip to content

amirharati/polymarket_data

Repository files navigation

Polymarket Data Downloader & Processor

This project contains Python scripts to download historical market, event, and price history data from the Polymarket APIs and process it into usable TSV formats.

Quick Start: Master Script

run_all.py is a master script that runs all the individual scripts in the correct order with comprehensive logging. This is the recommended way to run the entire pipeline.

Features

  • Runs all scripts in the correct sequence automatically
  • Captures and logs all stdout/stderr from each script
  • Creates timestamped master log files
  • Allows skipping individual steps if data already exists
  • Configurable folder names and parameters
  • Provides detailed execution summary

Basic Usage

# Run the entire pipeline with default settings
python run_all.py

# Run with custom directories and settings
python run_all.py \
    --market-data-dir market_data \
    --event-details-dir event_details \
    --market-details-dir market_details \
    --price-history-dir price_history \
    --status closed \
    --workers 10

Skip Options

If you've already completed some steps, you can skip them:

# Skip downloading markets and events (already have the data)
python run_all.py \
    --skip-download-markets \
    --skip-download-events \
    --skip-process-task1

Logging

  • Master log file: logs/run_all_YYYYMMDD_HHMMSS.log (contains all output)
  • Individual script logs: logs/download_markets.log, logs/download_event_details.log, etc.
  • All stdout and stderr from each script is captured and logged

Full Help

python run_all.py --help

Individual Scripts

If you prefer to run scripts individually or need more control, here are the individual scripts:

  1. download_markets.py: Fetches market data in batches based on status (e.g., closed) from the Gamma API and saves each batch as a .jsonl file in the specified output directory.

    • Purpose: Downloads the initial set of market overview data.
    • Output Directory: Contains files like markets_offset_0_limit_20.jsonl.
    • Resume: Automatically detects the last successfully saved batch and resumes downloading from the next offset.
    • Example Command:
      python download_markets.py --output-dir market_data --status closed
  2. download_event_details.py: Scans the market .jsonl files generated by the first script, extracts all unique event IDs mentioned, and then fetches the full details for each event ID using the Gamma API /events/{id} endpoint. Saves each event's details into a separate JSON file.

    • Purpose: Downloads detailed data for every unique event associated with the downloaded markets.
    • Input Directory: The directory containing the market .jsonl files (e.g., market_data).
    • Output Directory: Contains files like event_12345.json.
    • Resume: Checks for existing event files and only downloads details for events not already present.
    • Parallelism: Uses multiple workers (default 8) to speed up downloads.
    • Example Command:
      python download_event_details.py --market-data-dir market_data --output-dir event_details --workers 10
  3. download_price_history.py: Scans individual market detail JSON files (market_{id}.json), extracts the first CLOB token ID (assumed to be the "Yes" outcome), and fetches the price history time series for that token from the CLOB API (/prices-history). Saves the raw JSON response for each market.

    • Purpose: Downloads the raw price history for the "Yes" outcome of each market.
    • Input Directory: Directory containing individual market detail JSON files (e.g., market_details - typically created by Task 1 of process_data.py). Use --market-details-dir.
    • Output Directory: Contains files like price_history_yes_12345.json. Use --output-dir.
    • Resume: Checks for existing price history files and only downloads data for markets not already present.
    • Parallelism: Uses multiple workers (default 8) to speed up downloads.
    • Example Command:
      python download_price_history.py --market-details-dir market_details --output-dir price_history --workers 10
  4. process_data.py: Processes the downloaded market, event, and price history data.

    • Task 1 (Optional): Saves each market from the .jsonl files into an individual market_{id}.json file for easier access (--market-output-dir). Required if download_price_history.py needs these files as input.
    • Task 2: Reads the market .jsonl files and the corresponding event detail JSONs. It also checks the downloaded price history files (--price-history-dir). It then creates two separate TSV files:
      • Markets TSV (--market-tsv-output): Contains one row per market, with all market_* prefixed columns. Includes a market_event_ids column (comma-separated string of event IDs) and a market_downloaded_pricehistory_nonempty column (True/False) indicating if the corresponding price history file was found and contained data.
      • Events TSV (--event-tsv-output): Contains one row per unique event encountered across all processed markets, with all event_* prefixed columns.
    • Task 3 (Optional): Reads the price history JSON files (--price-history-dir). For each file with a non-empty history list, it creates a new TSV file (timeseries_{id}.tsv) in the specified output directory (--timeseries-output-dir) containing timestamp and price columns.
    • Inputs: Market data directory (--market-data-dir), event details directory (--event-details-dir), price history directory (--price-history-dir, required for Tasks 2 & 3).
    • Outputs: Individual market JSON directory (for Task 1), Markets TSV file, Events TSV file (for Task 2), Timeseries TSV directory (for Task 3).
    • Example Command (All Tasks):
      python process_data.py \
          --market-data-dir market_data \
          --event-details-dir event_details \
          --price-history-dir price_history \
          --market-output-dir market_details \
          --market-tsv-output polymarket_markets.tsv \
          --event-tsv-output polymarket_events.tsv \
          --timeseries-output-dir timeseries_data
    • Example Command (Only Task 2 & 3 - TSV Creation):
      python process_data.py \
          --market-data-dir market_data \
          --event-details-dir event_details \
          --price-history-dir price_history \
          --market-tsv-output polymarket_markets.tsv \
          --event-tsv-output polymarket_events.tsv \
          --timeseries-output-dir timeseries_data \
          --skip-task1
    • Example Command (Only Task 3 - Timeseries TSVs):
      python process_data.py \
          --market-data-dir market_data `# Still needed for arg parser even if task skipped` \
          --event-details-dir event_details `# Still needed for arg parser even if task skipped` \
          --price-history-dir price_history \
          --timeseries-output-dir timeseries_data \
          --skip-task1 --skip-task2
  5. analyze_price_data.py: Analyzes all price history JSON files (e.g., from price_history/). It calculates statistics for each file, such as the number of data points, mean price, standard deviation, time range, and timestamp delta characteristics. It also identifies potential issues like empty files, constant prices, or formatting errors.

    • Purpose: To assess the quality and characteristics of downloaded price history files and produce a structured summary for further processing or filtering.
    • Input Directory: Assumes JSON files are in a price_history/ directory relative to where the script is run.
    • Outputs:
      • analysis_summary.txt: A human-readable text file summarizing the analysis for each file and providing global statistics across all files.
      • analysis_results.json: A JSON file containing a list of detailed analysis dictionaries for each processed file. This file is intended for programmatic use, for example, by filter_price_data.py.
    • Parallelism: Uses multiprocessing to speed up the analysis when handling many files.
    • Example Command:
      python analyze_price_data.py
  6. filter_price_data.py: Filters the analyzed price history data based on user-defined criteria. It reads the analysis_results.json file generated by analyze_price_data.py.

    • Purpose: To select a subset of price history files that meet specific quality or characteristic thresholds.
    • Input:
      • analysis_results.json: The JSON file output by analyze_price_data.py.
      • Filtering criteria: Defined directly within the filter_criteria dictionary in the script.
    • Outputs:
      • Prints a list of filenames that meet the specified criteria to the console.
      • filtered_filenames.txt: A text file containing the list of filenames that passed the filters, one filename per line.
    • Customization: Users should modify the filter_criteria dictionary within the script to set their desired thresholds for metrics like minimum/maximum number of data points, mean price range, standard deviation range, issues to exclude, or maximum allowed time delta between points.
    • Example Command:
      python filter_price_data.py

Data Flow

Recommended: Use run_all.py to execute all steps automatically. See the "Quick Start: Master Script" section above.

Manual Execution (if you prefer to run scripts individually):

  1. Run download_markets.py to fetch market data based on status (e.g., closed) from the Gamma API and save each batch as a .jsonl file in the specified output directory.
  2. Run download_event_details.py to scan the market .jsonl files, extract all unique event IDs mentioned, and fetch the full details for each event ID using the Gamma API /events/{id} endpoint. Save each event's details into a separate JSON file.
  3. Run process_data.py Task 1 to save each market from the .jsonl files into an individual market_{id}.json file (required for price history download).
  4. Run download_price_history.py to scan individual market detail JSON files, extract the first CLOB token ID (assumed to be the "Yes" outcome), and fetch the price history time series for that token from the CLOB API (/prices-history). Save the raw JSON response for each market.
  5. Run process_data.py Tasks 2 & 3 to process the downloaded data:
    • Task 2: Creates Markets TSV and Events TSV files with all market and event data.
    • Task 3 (Optional): Creates individual timeseries TSV files for each market with price history.
  6. Run analyze_price_data.py to analyze all price history JSON files. It calculates statistics for each file and identifies potential issues. The script produces two outputs: a human-readable text file summarizing the analysis for each file and a JSON file containing a list of detailed analysis dictionaries for each processed file.
  7. Run filter_price_data.py to filter the analyzed price history data based on user-defined criteria. It reads the analysis_results.json file generated by analyze_price_data.py and prints a list of filenames that meet the specified criteria to the console. It also saves the list of filenames that passed the filters to a text file.

Error Handling & Recovery

Common Errors

The scripts may encounter errors during execution. Here are the common types:

Event Details Errors

  • DNS Resolution Failures: Failed to resolve 'gamma-api.polymarket.com'
    • Cause: Temporary network connectivity or DNS issues
    • Recovery: Automatic - script will retry on next run
    • Action: Simply re-run the script; it will skip successful downloads and retry failed ones

Price History Errors

  • HTTP 400 Bad Request: 400 Client Error: Bad Request
    • Cause: Invalid token ID, market has no price history, or API endpoint issue
    • Recovery: Partial - some may be recoverable, others indicate missing data
    • Action: Re-run 2-3 times; if still failing, the data likely doesn't exist

Automatic Recovery

All download scripts implement automatic resume functionality:

  1. First Run: Downloads all data; some items may fail due to network issues
  2. Second Run: Automatically skips successfully downloaded files, only retries failed ones
  3. Third Run: Same behavior - only retries what's still missing

Recovery Strategy: After 2-3 runs:

  • Recoverable errors (network issues) will be fixed
  • Persistent failures likely indicate missing data (event doesn't exist, market has no history, etc.)

The scripts check for existing files before downloading:

  • Event details: Checks if event_{id}.json exists and is non-empty
  • Price history: Checks if price_history_yes_{market_id}.json exists and is non-empty

Logging Behavior

Individual Script Logs

  • Location: logs/download_markets.log, logs/download_event_details.log, etc.
  • Behavior: APPENDS to existing files
  • Effect: Each run adds to the log file, preserving full history

Master Log File (run_all.py)

  • Default: Creates new timestamped file each run: logs/run_all_YYYYMMDD_HHMMSS.log
  • Custom: If you specify --log-file custom.log, it will APPEND to that file
  • Effect: Default behavior preserves separate logs per run; custom names accumulate logs

Getting Final Summaries

Individual Script Summaries

Each script logs a summary at the end of execution. Look for lines like:

--- Event Details Downloader Script Finished ---
Total unique IDs found: 29537
Already existed / Skipped: 25000
Attempted to fetch: 4537
Successfully fetched & saved: 4500
Fetch errors: 37

Check the individual log files:

  • logs/download_markets.log
  • logs/download_event_details.log
  • logs/download_price_history.log
  • logs/process_data_task1.log

Master Pipeline Summary

When using run_all.py, a comprehensive summary is logged at the end:

================================================================================
PIPELINE EXECUTION SUMMARY
================================================================================
Step Results:
  Download Markets                              ✓ SUCCESS
  Download Event Details                        ✓ SUCCESS
  Process Data - Task 1                         ✓ SUCCESS
  Download Price History                        ✓ SUCCESS
  Process Data - Tasks 2 & 3                    ✓ SUCCESS
  Analyze Price Data                            ✓ SUCCESS
  Filter Price Data                             ✓ SUCCESS

Overall Status: ✓ ALL STEPS COMPLETED SUCCESSFULLY
Master log file: logs/run_all_20251121_165610.log
================================================================================

The summary is in:

  • The master log file (default: logs/run_all_YYYYMMDD_HHMMSS.log)
  • Console output (if running interactively)

Requirements

  • Python 3.x
  • requests library (pip install requests)

Notes

  1. During exploration we can see about 36k has non-empty time series but not all time series are valid many only have 0.5 so need some clean up.
  2. It seems many fields including categories are useless or empty. We can fill some of these using LLMs or other techniques.

TODOs

  1. Check using GraphQL for better quality data or missing fields. https://thegraph.com/docs/en/subgraphs/guides/polymarket/
  2. Data clean up and more exploration.
  3. Other endpoints.

Relevant links

  1. https://github.com/DominiqueBuob/polymarket_analysis_v1?tab=readme-ov-file
  2. https://goldsky.com/blog/polymarket-dataset
  3. https://github.com/Mr-Slope/Polymarket-Autocorrelation/tree/main
  4. https://docs.polymarket.com/

About

fetch polymarket data

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors