This project contains Python scripts to download historical market, event, and price history data from the Polymarket APIs and process it into usable TSV formats.
run_all.py is a master script that runs all the individual scripts in the correct order with comprehensive logging. This is the recommended way to run the entire pipeline.
- Runs all scripts in the correct sequence automatically
- Captures and logs all stdout/stderr from each script
- Creates timestamped master log files
- Allows skipping individual steps if data already exists
- Configurable folder names and parameters
- Provides detailed execution summary
# Run the entire pipeline with default settings
python run_all.py
# Run with custom directories and settings
python run_all.py \
--market-data-dir market_data \
--event-details-dir event_details \
--market-details-dir market_details \
--price-history-dir price_history \
--status closed \
--workers 10If you've already completed some steps, you can skip them:
# Skip downloading markets and events (already have the data)
python run_all.py \
--skip-download-markets \
--skip-download-events \
--skip-process-task1- Master log file:
logs/run_all_YYYYMMDD_HHMMSS.log(contains all output) - Individual script logs:
logs/download_markets.log,logs/download_event_details.log, etc. - All stdout and stderr from each script is captured and logged
python run_all.py --helpIf you prefer to run scripts individually or need more control, here are the individual scripts:
-
download_markets.py: Fetches market data in batches based on status (e.g., closed) from the Gamma API and saves each batch as a.jsonlfile in the specified output directory.- Purpose: Downloads the initial set of market overview data.
- Output Directory: Contains files like
markets_offset_0_limit_20.jsonl. - Resume: Automatically detects the last successfully saved batch and resumes downloading from the next offset.
- Example Command:
python download_markets.py --output-dir market_data --status closed
-
download_event_details.py: Scans the market.jsonlfiles generated by the first script, extracts all unique event IDs mentioned, and then fetches the full details for each event ID using the Gamma API/events/{id}endpoint. Saves each event's details into a separate JSON file.- Purpose: Downloads detailed data for every unique event associated with the downloaded markets.
- Input Directory: The directory containing the market
.jsonlfiles (e.g.,market_data). - Output Directory: Contains files like
event_12345.json. - Resume: Checks for existing event files and only downloads details for events not already present.
- Parallelism: Uses multiple workers (default 8) to speed up downloads.
- Example Command:
python download_event_details.py --market-data-dir market_data --output-dir event_details --workers 10
-
download_price_history.py: Scans individual market detail JSON files (market_{id}.json), extracts the first CLOB token ID (assumed to be the "Yes" outcome), and fetches the price history time series for that token from the CLOB API (/prices-history). Saves the raw JSON response for each market.- Purpose: Downloads the raw price history for the "Yes" outcome of each market.
- Input Directory: Directory containing individual market detail JSON files (e.g.,
market_details- typically created by Task 1 ofprocess_data.py). Use--market-details-dir. - Output Directory: Contains files like
price_history_yes_12345.json. Use--output-dir. - Resume: Checks for existing price history files and only downloads data for markets not already present.
- Parallelism: Uses multiple workers (default 8) to speed up downloads.
- Example Command:
python download_price_history.py --market-details-dir market_details --output-dir price_history --workers 10
-
process_data.py: Processes the downloaded market, event, and price history data.- Task 1 (Optional): Saves each market from the
.jsonlfiles into an individualmarket_{id}.jsonfile for easier access (--market-output-dir). Required ifdownload_price_history.pyneeds these files as input. - Task 2: Reads the market
.jsonlfiles and the corresponding event detail JSONs. It also checks the downloaded price history files (--price-history-dir). It then creates two separate TSV files:- Markets TSV (
--market-tsv-output): Contains one row per market, with allmarket_*prefixed columns. Includes amarket_event_idscolumn (comma-separated string of event IDs) and amarket_downloaded_pricehistory_nonemptycolumn (True/False) indicating if the corresponding price history file was found and contained data. - Events TSV (
--event-tsv-output): Contains one row per unique event encountered across all processed markets, with allevent_*prefixed columns.
- Markets TSV (
- Task 3 (Optional): Reads the price history JSON files (
--price-history-dir). For each file with a non-emptyhistorylist, it creates a new TSV file (timeseries_{id}.tsv) in the specified output directory (--timeseries-output-dir) containingtimestampandpricecolumns. - Inputs: Market data directory (
--market-data-dir), event details directory (--event-details-dir), price history directory (--price-history-dir, required for Tasks 2 & 3). - Outputs: Individual market JSON directory (for Task 1), Markets TSV file, Events TSV file (for Task 2), Timeseries TSV directory (for Task 3).
- Example Command (All Tasks):
python process_data.py \ --market-data-dir market_data \ --event-details-dir event_details \ --price-history-dir price_history \ --market-output-dir market_details \ --market-tsv-output polymarket_markets.tsv \ --event-tsv-output polymarket_events.tsv \ --timeseries-output-dir timeseries_data - Example Command (Only Task 2 & 3 - TSV Creation):
python process_data.py \ --market-data-dir market_data \ --event-details-dir event_details \ --price-history-dir price_history \ --market-tsv-output polymarket_markets.tsv \ --event-tsv-output polymarket_events.tsv \ --timeseries-output-dir timeseries_data \ --skip-task1 - Example Command (Only Task 3 - Timeseries TSVs):
python process_data.py \ --market-data-dir market_data `# Still needed for arg parser even if task skipped` \ --event-details-dir event_details `# Still needed for arg parser even if task skipped` \ --price-history-dir price_history \ --timeseries-output-dir timeseries_data \ --skip-task1 --skip-task2
- Task 1 (Optional): Saves each market from the
-
analyze_price_data.py: Analyzes all price history JSON files (e.g., fromprice_history/). It calculates statistics for each file, such as the number of data points, mean price, standard deviation, time range, and timestamp delta characteristics. It also identifies potential issues like empty files, constant prices, or formatting errors.- Purpose: To assess the quality and characteristics of downloaded price history files and produce a structured summary for further processing or filtering.
- Input Directory: Assumes JSON files are in a
price_history/directory relative to where the script is run. - Outputs:
analysis_summary.txt: A human-readable text file summarizing the analysis for each file and providing global statistics across all files.analysis_results.json: A JSON file containing a list of detailed analysis dictionaries for each processed file. This file is intended for programmatic use, for example, byfilter_price_data.py.
- Parallelism: Uses multiprocessing to speed up the analysis when handling many files.
- Example Command:
python analyze_price_data.py
-
filter_price_data.py: Filters the analyzed price history data based on user-defined criteria. It reads theanalysis_results.jsonfile generated byanalyze_price_data.py.- Purpose: To select a subset of price history files that meet specific quality or characteristic thresholds.
- Input:
analysis_results.json: The JSON file output byanalyze_price_data.py.- Filtering criteria: Defined directly within the
filter_criteriadictionary in the script.
- Outputs:
- Prints a list of filenames that meet the specified criteria to the console.
filtered_filenames.txt: A text file containing the list of filenames that passed the filters, one filename per line.
- Customization: Users should modify the
filter_criteriadictionary within the script to set their desired thresholds for metrics like minimum/maximum number of data points, mean price range, standard deviation range, issues to exclude, or maximum allowed time delta between points. - Example Command:
python filter_price_data.py
Recommended: Use run_all.py to execute all steps automatically. See the "Quick Start: Master Script" section above.
Manual Execution (if you prefer to run scripts individually):
- Run
download_markets.pyto fetch market data based on status (e.g., closed) from the Gamma API and save each batch as a.jsonlfile in the specified output directory. - Run
download_event_details.pyto scan the market.jsonlfiles, extract all unique event IDs mentioned, and fetch the full details for each event ID using the Gamma API/events/{id}endpoint. Save each event's details into a separate JSON file. - Run
process_data.pyTask 1 to save each market from the.jsonlfiles into an individualmarket_{id}.jsonfile (required for price history download). - Run
download_price_history.pyto scan individual market detail JSON files, extract the first CLOB token ID (assumed to be the "Yes" outcome), and fetch the price history time series for that token from the CLOB API (/prices-history). Save the raw JSON response for each market. - Run
process_data.pyTasks 2 & 3 to process the downloaded data:- Task 2: Creates Markets TSV and Events TSV files with all market and event data.
- Task 3 (Optional): Creates individual timeseries TSV files for each market with price history.
- Run
analyze_price_data.pyto analyze all price history JSON files. It calculates statistics for each file and identifies potential issues. The script produces two outputs: a human-readable text file summarizing the analysis for each file and a JSON file containing a list of detailed analysis dictionaries for each processed file. - Run
filter_price_data.pyto filter the analyzed price history data based on user-defined criteria. It reads theanalysis_results.jsonfile generated byanalyze_price_data.pyand prints a list of filenames that meet the specified criteria to the console. It also saves the list of filenames that passed the filters to a text file.
The scripts may encounter errors during execution. Here are the common types:
- DNS Resolution Failures:
Failed to resolve 'gamma-api.polymarket.com'- Cause: Temporary network connectivity or DNS issues
- Recovery: Automatic - script will retry on next run
- Action: Simply re-run the script; it will skip successful downloads and retry failed ones
- HTTP 400 Bad Request:
400 Client Error: Bad Request- Cause: Invalid token ID, market has no price history, or API endpoint issue
- Recovery: Partial - some may be recoverable, others indicate missing data
- Action: Re-run 2-3 times; if still failing, the data likely doesn't exist
All download scripts implement automatic resume functionality:
- First Run: Downloads all data; some items may fail due to network issues
- Second Run: Automatically skips successfully downloaded files, only retries failed ones
- Third Run: Same behavior - only retries what's still missing
Recovery Strategy: After 2-3 runs:
- ✅ Recoverable errors (network issues) will be fixed
- ❌ Persistent failures likely indicate missing data (event doesn't exist, market has no history, etc.)
The scripts check for existing files before downloading:
- Event details: Checks if
event_{id}.jsonexists and is non-empty - Price history: Checks if
price_history_yes_{market_id}.jsonexists and is non-empty
- Location:
logs/download_markets.log,logs/download_event_details.log, etc. - Behavior: APPENDS to existing files
- Effect: Each run adds to the log file, preserving full history
- Default: Creates new timestamped file each run:
logs/run_all_YYYYMMDD_HHMMSS.log - Custom: If you specify
--log-file custom.log, it will APPEND to that file - Effect: Default behavior preserves separate logs per run; custom names accumulate logs
Each script logs a summary at the end of execution. Look for lines like:
--- Event Details Downloader Script Finished ---
Total unique IDs found: 29537
Already existed / Skipped: 25000
Attempted to fetch: 4537
Successfully fetched & saved: 4500
Fetch errors: 37
Check the individual log files:
logs/download_markets.loglogs/download_event_details.loglogs/download_price_history.loglogs/process_data_task1.log
When using run_all.py, a comprehensive summary is logged at the end:
================================================================================
PIPELINE EXECUTION SUMMARY
================================================================================
Step Results:
Download Markets ✓ SUCCESS
Download Event Details ✓ SUCCESS
Process Data - Task 1 ✓ SUCCESS
Download Price History ✓ SUCCESS
Process Data - Tasks 2 & 3 ✓ SUCCESS
Analyze Price Data ✓ SUCCESS
Filter Price Data ✓ SUCCESS
Overall Status: ✓ ALL STEPS COMPLETED SUCCESSFULLY
Master log file: logs/run_all_20251121_165610.log
================================================================================
The summary is in:
- The master log file (default:
logs/run_all_YYYYMMDD_HHMMSS.log) - Console output (if running interactively)
- Python 3.x
requestslibrary (pip install requests)
- During exploration we can see about 36k has non-empty time series but not all time series are valid many only have 0.5 so need some clean up.
- It seems many fields including categories are useless or empty. We can fill some of these using LLMs or other techniques.
- Check using GraphQL for better quality data or missing fields. https://thegraph.com/docs/en/subgraphs/guides/polymarket/
- Data clean up and more exploration.
- Other endpoints.