A data pipeline for collecting, processing, and analyzing BoardGameGeek game data using BigQuery.
This project collects board game data from BoardGameGeek's API, stores raw responses in BigQuery, and processes them into a normalized data warehouse for analysis.
Fetch New Games (src/pipeline/fetch_new_games.py)
- Retrieves new board game IDs from BoardGameGeek
- Fetches API responses for unfetched games
- Processes responses into normalized tables
- Runs daily at 6 AM UTC
Refresh Old Games (src/pipeline/refresh_old_games.py)
- Refreshes stale game data based on publication year
- Recent games (0-2 years): refreshed weekly
- Established games (2-5 years): refreshed monthly
- Classic games (5-10 years): refreshed quarterly
- Vintage games (10+ years): refreshed bi-annually
- Runs daily at 7 AM UTC
graph TD
A[BGG XML API] -->|Fetch Game IDs| B[thing_ids Table]
B -->|Unfetched IDs| C[Response Fetcher]
C -->|Rate-Limited Requests| A
C -->|Store Raw Data| D[raw_responses Table]
C -->|Track Fetch| E[fetched_responses Table]
E -->|Unprocessed Records| F[Response Processor]
D -->|Raw XML| F
F -->|Normalized Data| G[BigQuery Warehouse]
F -->|Track Processing| H[processed_responses Table]
Raw Dataset (raw)
thing_ids: Game ID registryraw_responses: Raw API responsesfetched_responses: Fetch trackingprocessed_responses: Processing trackingrequest_log: API request audit logfetch_in_progress: Prevents duplicate concurrent fetches
Core Dataset (core)
games: Core game datacategories,mechanics,families: Dimension tablesdesigners,artists,publishers: Creator tablesrankings,player_counts: Metrics tables
Analytics Dataset (analytics) - Managed by Dataform
games_active: View of latest game data (deduped by game_id)games_features: Denormalized table with computed columns:hurdle: Binary flag for games with 25+ ratingsgeek_rating,complexity,rating: Renamed metricslog_users_rated: Log-transformed user count- Aggregated arrays for categories, mechanics, publishers, designers, artists, families
All infrastructure is managed via Terraform in the terraform/ directory.
- Cloud Run Jobs: Two jobs execute the pipelines daily
bgg-fetch-new-games: 1 vCPU, 2GB memorybgg-refresh-old-games: 1 vCPU, 2GB memory
- Dataform: Analytics transformations run via GitHub Actions workflow
- GitHub Actions: Triggers Cloud Run jobs on schedule, deploys on merge to main, runs Dataform
- Cloud Build: Builds and deploys Docker images
- Python 3.12+
- UV package manager
- Google Cloud project with:
- Cloud Run API
- Cloud Build API
- BigQuery API
- Service account with BigQuery Data Editor, Cloud Run Invoker roles
- Clone the repository:
git clone https://github.com/phenrickson/bgg-data-warehouse.git
cd bgg-data-warehouse- Install UV and dependencies:
# Install UV (see https://docs.astral.sh/uv/getting-started/installation/)
curl -LsSf https://astral.sh/uv/install.sh | sh
# Create virtual environment and install dependencies
uv venv
source .venv/bin/activate # Unix/macOS
.venv\Scripts\activate # Windows
uv sync- Configure environment:
cp .env.example .env
# Edit .env with your GCP_PROJECT_ID, ENVIRONMENT, BGG_API_TOKEN- Configure GitHub repository secrets:
SERVICE_ACCOUNT_KEY: GCP service account key JSONGCP_PROJECT_ID: Google Cloud project IDBGG_API_TOKEN: BoardGameGeek API token
# Fetch new games
uv run python -m src.pipeline.fetch_new_games
# Refresh old games
uv run python -m src.pipeline.refresh_old_games
# Run tests
uv run pytestgcloud run jobs execute bgg-fetch-new-games-prod --region us-central1 --wait
gcloud run jobs execute bgg-refresh-old-games-prod --region us-central1 --waitA Streamlit dashboard is available for exploring the data:
streamlit run src/visualization/dashboard.pyThe dashboard is also deployed to Cloud Run and accessible via the URL output by the deploy workflow.
This project uses semantic versioning. When changes are merged to main with a version bump in pyproject.toml, a GitHub Action automatically creates a corresponding git tag.
See CHANGELOG.md for version history.
MIT License