Skip to content

PR #573: Live MLB game logs refresh (tnestico/mlb_scraper pattern) — daily 4:15 AM#442

Merged
jaayslaughter-cpu merged 1 commit into
mainfrom
pr-573-game-logs-refresh
May 15, 2026
Merged

PR #573: Live MLB game logs refresh (tnestico/mlb_scraper pattern) — daily 4:15 AM#442
jaayslaughter-cpu merged 1 commit into
mainfrom
pr-573-game-logs-refresh

Conversation

@jaayslaughter-cpu
Copy link
Copy Markdown
Owner

@jaayslaughter-cpu jaayslaughter-cpu commented May 15, 2026

Changes

New file: game_logs_refresh.py

Daily refresh of mlb_batting_logs.csv and mlb_pitching_logs.csv using the MLB Stats API (same source as tnestico/mlb_scraper).

Architecture (inspired by mlb_scraper):

  1. Fetch completed game IDs from MLB Stats API schedule for last 7 days
  2. Parallel-fetch /api/v1/game/{id}/boxscore (8 threads)
  3. Parse batter rows: mlbam_id, player, date, starter, home_runs, h_1b, h_2b, h_3b, b_ab, b_pa, b_runs, b_rbi, b_k
  4. Parse pitcher rows: mlbam_id, player, date, starter, outs, strikeouts, earnedruns, walks, hits
  5. Upsert into new Postgres tables live_batting_logs + live_pitching_logs (auto-created)
  6. Rebuild CSVs from full Postgres history — keeps static files fresh

First-run behavior: Seeds Postgres from existing static CSVs (9,819 batting + 4,007 pitching rows) before upserts.

orchestrator.py

  • New job_game_logs_refresh at 4:15 AM PT daily (after savant_refresh at 4:00 AM)

requirements_army.txt

  • git+https://github.com/tnestico/mlb_scraper.git — available for future pitch-level enrichment
  • polars>=0.20.0 — required by mlb_scraper

Why

Static CSVs were last updated manually. This keeps 2026 batting/pitching logs current through yesterday's games, ensuring team_form_layer.py and other consumers always see fresh data.


Summary by cubic

Automates a daily MLB batting and pitching game-log refresh at 4:15 AM PT using the MLB Stats API, keeping 2026 CSVs and Postgres tables up to date. This keeps team_form_layer.py and other consumers current through yesterday’s games.

  • New Features

    • Added game_logs_refresh.py: fetches last 7 days of completed games, parallel boxscore pulls, parses batter/pitcher stats, upserts to live_batting_logs/live_pitching_logs, and rebuilds data/stats/2026/mlb_batting_logs.csv and data/stats/2026/mlb_pitching_logs.csv.
    • Scheduled game_logs_refresh in orchestrator.py (daily 4:15 AM PT; runs after savant_refresh).
    • First run seeds Postgres from existing CSVs.
  • Dependencies

    • Added git+https://github.com/tnestico/mlb_scraper.git and polars>=0.20.0.

Written for commit c39dc1b. Summary will update on new commits.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 15, 2026

Warning

Rate limit exceeded

@jaayslaughter-cpu has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 21 minutes and 44 seconds before requesting another review.

You’ve run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 8f1a571a-90d3-4dee-9e4c-3702534af514

📥 Commits

Reviewing files that changed from the base of the PR and between 951b5d4 and c39dc1b.

📒 Files selected for processing (3)
  • game_logs_refresh.py
  • orchestrator.py
  • requirements_army.txt
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch pr-573-game-logs-refresh

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@deepsource-io
Copy link
Copy Markdown

deepsource-io Bot commented May 15, 2026

DeepSource Code Review

We reviewed changes in 951b5d4...c39dc1b on this pull request. Below is the summary for the review, and you can see the individual issues we found as inline review comments.

See full review on DeepSource ↗

PR Report Card

Overall Grade   Security  

Reliability  

Complexity  

Hygiene  

Code Review Summary

Analyzer Status Updated (UTC) Details
Docker May 15, 2026 5:10a.m. Review ↗
JavaScript May 15, 2026 5:10a.m. Review ↗
Python May 15, 2026 5:10a.m. Review ↗
SQL May 15, 2026 5:10a.m. Review ↗
Secrets May 15, 2026 5:10a.m. Review ↗

Important

AI Review is run only on demand for your team. We're only showing results of static analysis review right now. To trigger AI Review, comment @deepsourcebot review on this thread.

@codacy-production
Copy link
Copy Markdown

Not up to standards ⛔

🔴 Issues 1 critical · 3 high · 2 medium

Alerts:
⚠ 6 issues (≤ 0 issues of at least minor severity)

Results:
6 new issues

Category Results
ErrorProne 3 high
Security 1 critical
2 medium

View in Codacy

🟢 Metrics 60 complexity

Metric Results
Complexity 60

View in Codacy

NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new script, game_logs_refresh.py, and a corresponding scheduled job in orchestrator.py to automate the daily retrieval and storage of MLB batting and pitching logs for the 2026 season. The implementation includes parallel fetching from the MLB Stats API and synchronization between a Postgres database and local CSV files. Feedback highlights a bug in the inningsPitched parsing logic that fails on whole numbers and suggests optimizing database performance by replacing row-by-row insertions with bulk operations in the upsert and seeding functions.

Comment thread game_logs_refresh.py
Comment on lines +247 to +248
innings, thirds = ip_str.split(".")
outs = int(innings) * 3 + int(thirds)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The current logic for parsing inningsPitched will raise a ValueError if the string does not contain a decimal point (e.g., "5" instead of "5.0"). This results in 0 outs being credited for full innings in the except block.

Suggested change
innings, thirds = ip_str.split(".")
outs = int(innings) * 3 + int(thirds)
parts = ip_str.split(".")
innings = int(parts[0])
thirds = int(parts[1]) if len(parts) > 1 else 0
outs = innings * 3 + thirds

Comment thread game_logs_refresh.py
Comment on lines +112 to +132
sql = """
INSERT INTO live_batting_logs
(mlbam_id, player, game_date, starter,
home_runs, h_1b, h_2b, h_3b, b_ab, b_pa, b_runs, b_rbi, b_k)
VALUES (%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)
ON CONFLICT (mlbam_id, game_date) DO UPDATE SET
player=EXCLUDED.player, starter=EXCLUDED.starter,
home_runs=EXCLUDED.home_runs, h_1b=EXCLUDED.h_1b,
h_2b=EXCLUDED.h_2b, h_3b=EXCLUDED.h_3b,
b_ab=EXCLUDED.b_ab, b_pa=EXCLUDED.b_pa,
b_runs=EXCLUDED.b_runs, b_rbi=EXCLUDED.b_rbi, b_k=EXCLUDED.b_k
"""
with conn.cursor() as cur:
for r in rows:
cur.execute(sql, (
r["mlbam_id"], r["player"], r["date"], r["starter"],
r["home_runs"], r["h_1b"], r["h_2b"], r["h_3b"],
r["b_ab"], r["b_pa"], r["b_runs"], r["b_rbi"], r["b_k"],
))
conn.commit()
return len(rows)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Performing row-by-row inserts in a loop is inefficient. Using psycopg2.extras.execute_values allows for bulk upserts, which significantly reduces database round-trips and improves performance.

    from psycopg2.extras import execute_values
    sql = """
    INSERT INTO live_batting_logs
        (mlbam_id, player, game_date, starter,
         home_runs, h_1b, h_2b, h_3b, b_ab, b_pa, b_runs, b_rbi, b_k)
    VALUES %s
    ON CONFLICT (mlbam_id, game_date) DO UPDATE SET
        player=EXCLUDED.player, starter=EXCLUDED.starter,
        home_runs=EXCLUDED.home_runs, h_1b=EXCLUDED.h_1b,
        h_2b=EXCLUDED.h_2b, h_3b=EXCLUDED.h_3b,
        b_ab=EXCLUDED.b_ab, b_pa=EXCLUDED.b_pa,
        b_runs=EXCLUDED.b_runs, b_rbi=EXCLUDED.b_rbi, b_k=EXCLUDED.b_k
    """
    data = [
        (
            r["mlbam_id"], r["player"], r["date"], r["starter"],
            r["home_runs"], r["h_1b"], r["h_2b"], r["h_3b"],
            r["b_ab"], r["b_pa"], r["b_runs"], r["b_rbi"], r["b_k"]
        )
        for r in rows
    ]
    with conn.cursor() as cur:
        execute_values(cur, sql, data)
    conn.commit()
    return len(rows)

Comment thread game_logs_refresh.py
Comment on lines +138 to +157
sql = """
INSERT INTO live_pitching_logs
(mlbam_id, player, game_date, starter,
outs, strikeouts, earnedruns, walks, hits)
VALUES (%s,%s,%s,%s,%s,%s,%s,%s,%s)
ON CONFLICT (mlbam_id, game_date) DO UPDATE SET
player=EXCLUDED.player, starter=EXCLUDED.starter,
outs=EXCLUDED.outs, strikeouts=EXCLUDED.strikeouts,
earnedruns=EXCLUDED.earnedruns, walks=EXCLUDED.walks,
hits=EXCLUDED.hits
"""
with conn.cursor() as cur:
for r in rows:
cur.execute(sql, (
r["mlbam_id"], r["player"], r["date"], r["starter"],
r["outs"], r["strikeouts"], r["earnedruns"],
r["walks"], r["hits"],
))
conn.commit()
return len(rows)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Similar to the batting upsert, this pitching upsert should use bulk operations for better efficiency.

    from psycopg2.extras import execute_values
    sql = """
    INSERT INTO live_pitching_logs
        (mlbam_id, player, game_date, starter,
         outs, strikeouts, earnedruns, walks, hits)
    VALUES %s
    ON CONFLICT (mlbam_id, game_date) DO UPDATE SET
        player=EXCLUDED.player, starter=EXCLUDED.starter,
        outs=EXCLUDED.outs, strikeouts=EXCLUDED.strikeouts,
        earnedruns=EXCLUDED.earnedruns, walks=EXCLUDED.walks,
        hits=EXCLUDED.hits
    """
    data = [
        (
            r["mlbam_id"], r["player"], r["date"], r["starter"],
            r["outs"], r["strikeouts"], r["earnedruns"],
            r["walks"], r["hits"]
        )
        for r in rows
    ]
    with conn.cursor() as cur:
        execute_values(cur, sql, data)
    conn.commit()
    return len(rows)

Comment thread game_logs_refresh.py
Comment on lines +294 to +328
def _seed_from_csv(conn) -> None:
"""If Postgres tables are empty, seed from existing CSVs."""
with conn.cursor() as cur:
cur.execute("SELECT COUNT(*) FROM live_batting_logs")
count = cur.fetchone()[0]
if count > 0:
return

logger.info("[GameLogs] Seeding Postgres from existing CSVs...")
for table, path, cols, pk_col in [
("live_batting_logs", _BATTING_CSV, _BATTING_COLS, "game_date"),
("live_pitching_logs", _PITCHING_CSV, _PITCHING_COLS, "game_date"),
]:
if not path.exists():
continue
rows_inserted = 0
with path.open("r", encoding="utf-8") as f:
reader = csv.DictReader(f)
with conn.cursor() as cur:
for row in reader:
placeholders = ", ".join(["%s"] * len(cols))
col_names = ", ".join(
[c if c != "date" else "game_date" for c in cols]
)
values = [row.get(c, row.get("date" if c == "date" else c, None))
for c in cols]
cur.execute(
f"INSERT INTO {table} ({col_names}) VALUES ({placeholders}) "
f"ON CONFLICT DO NOTHING",
values,
)
rows_inserted += 1
conn.commit()
logger.info("[GameLogs] Seeded %d rows into %s", rows_inserted, table)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Seeding nearly 10,000 rows from CSV using individual INSERT statements is extremely slow. Refactoring this to use bulk insertion with execute_values will drastically improve the first-run experience.

def _seed_from_csv(conn) -> None:
    """If Postgres tables are empty, seed from existing CSVs."""
    with conn.cursor() as cur:
        cur.execute("SELECT COUNT(*) FROM live_batting_logs")
        count = cur.fetchone()[0]
    if count > 0:
        return

    logger.info("[GameLogs] Seeding Postgres from existing CSVs...")
    from psycopg2.extras import execute_values
    for table, path, cols in [
        ("live_batting_logs",  _BATTING_CSV,  _BATTING_COLS),
        ("live_pitching_logs", _PITCHING_CSV, _PITCHING_COLS),
    ]:
        if not path.exists():
            continue
        with path.open("r", encoding="utf-8") as f:
            reader = csv.DictReader(f)
            rows = list(reader)
            if not rows:
                continue
            col_names = ", ".join([c if c != "date" else "game_date" for c in cols])
            sql = f"INSERT INTO {table} ({col_names}) VALUES %s ON CONFLICT DO NOTHING"
            data = [tuple(row.get(c) for c in cols) for row in rows]
            with conn.cursor() as cur:
                execute_values(cur, sql, data)
        conn.commit()
        logger.info("[GameLogs] Seeded %d rows into %s", len(rows), table)

@jaayslaughter-cpu jaayslaughter-cpu merged commit 52e9c24 into main May 15, 2026
7 of 9 checks passed
@ecc-tools
Copy link
Copy Markdown
Contributor

ecc-tools Bot commented May 15, 2026

ECC bundle files are already tracked in this repository. Skipping generation of another bundle PR.

Comment thread game_logs_refresh.py
from __future__ import annotations

import csv
import io
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused import io


An object has been imported but is not used anywhere in the file.
It should either be used or the import should be removed.

Comment thread game_logs_refresh.py
starter_batter_id = batting_order[0] if batting_order else None
starter_pitcher_id = pitching_order[0] if pitching_order else None

for pid_str, pdata in players.items():
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused variable 'pid_str'


An unused variable takes up space in the code, and can lead to confusion, and it should be removed. If this variable is necessary, name the variable _ to indicate that it will be unused, or start the name with unused or _unused.

Comment thread game_logs_refresh.py

# ── Batting ───────────────────────────────────────────────────────
if pid in batters:
b = stats.get("batting", {}).get("summary", None)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused variable 'b'


An unused variable takes up space in the code, and can lead to confusion, and it should be removed. If this variable is necessary, name the variable _ to indicate that it will be unused, or start the name with unused or _unused.

Comment thread game_logs_refresh.py
return

logger.info("[GameLogs] Seeding Postgres from existing CSVs...")
for table, path, cols, pk_col in [
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused variable 'pk_col'


An unused variable takes up space in the code, and can lead to confusion, and it should be removed. If this variable is necessary, name the variable _ to indicate that it will be unused, or start the name with unused or _unused.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant