fix: restore market history category snapshots by rvac-bucky · Pull Request #274 · ambicuity/New-Grad-Jobs

rvac-bucky · 2026-05-20T11:22:44Z

Root cause

save_market_history() was only reading the legacy job["categories"] list, but enriched jobs now store category data in the singular job["category"] object. That left new market-history snapshots with an empty categories map.

Fix

Aggregate category counts from job["category"]["id"]
Keep the legacy categories list as a fallback for older shapes
Validate category values before counting so malformed scraped data cannot create bad keys
Preserve existing snapshot schema and retention behavior

Regression protection

Added focused coverage around:

singular category aggregation
preference for the enriched singular field over legacy fallback
legacy categories fallback
malformed category payloads
deterministic snapshot schema and retention behavior

Validation

pre-commit on touched files
pytest tests/test_save_market_history.py -q -o addopts='' -> 25 passed
make test -> 721 passed
GitHub Actions and CodeRabbit were green on the superseded branch before opening this clean PR

Supersedes

Supersedes #252. That PR was technically clean, but GitHub still displayed stale CHANGES_REQUESTED from old bot reviews on obsolete commits. This replacement PR carries the same scoped fix from a clean branch so the review state is not polluted by historical bot feedback.

Fixes #228

Summary by CodeRabbit

Improvements
- Enhanced job category data processing to support multiple data formats
- Strengthened data validation for job information handling
- Extended test coverage for category aggregation logic to ensure reliability

coderabbitai · 2026-05-20T11:23:01Z

Warning

Rate limit exceeded

@rvac-bucky has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 4 minutes and 24 seconds before requesting another review.

You’ve run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 7113e375-802b-4079-88f2-ca871f356e7b

📥 Commits

Reviewing files that changed from the base of the PR and between 63f04fd and 41c7da7.

📒 Files selected for processing (2)

scripts/update_jobs.py
tests/test_save_market_history.py

📝 Walkthrough

Walkthrough

This PR fixes a bug where market-history.json snapshots contained empty category counts despite non-empty jobs. The category aggregation logic in save_market_history now correctly handles the enriched job["category"]["id"] shape while maintaining backward compatibility with legacy job["categories"] lists. The fix includes comprehensive test coverage for category preference, fallback behavior, and edge cases, along with standardized UTF-8 encoding across the test suite.

Changes

Category aggregation and test coverage

Layer / File(s)	Summary
Category aggregation implementation `scripts/update_jobs.py`	`Iterator` type imported; `save_market_history` now normalizes and accumulates category IDs from `job["category"]["id"]` when present, falling back to legacy `job["categories"]` entries.
Category aggregation test coverage `tests/test_save_market_history.py`	New test cases validate enriched `category.id` aggregation, preference for enriched singular category over legacy lists, fallback when enriched category is absent, disregard of legacy categories when enriched category exists (even if malformed), handling of missing category data, and robustness against malformed shapes.
Test file encoding standardization `tests/test_save_market_history.py`	All test methods updated to read `market-history.json` with explicit `encoding="utf-8"` parameter.
JSON termination cleanup `docs/predictions.json`	File closing brace adjusted for clean termination.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Poem

🐰 A rabbit hops through history's row,
Categories now begin to flow—
Enriched fields dance, legacy stays,
Tests ensure the data's ablaze,
UTF-8 strings in files so clean,
The finest snapshots ever seen!

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'fix: restore market history category snapshots' directly addresses the main change: fixing empty category snapshots by restoring category population in market-history.json.
Linked Issues check	✅ Passed	The pull request fully addresses issue `#228` objectives: aggregates from job['category']['id'] with legacy fallback, adds comprehensive regression tests for enriched/legacy/malformed shapes, and preserves snapshot schema.
Out of Scope Changes check	✅ Passed	All changes are in-scope: update_jobs.py logic fix and type import, test file expansion for category scenarios, and JSON formatting adjustment are directly related to restoring category snapshot functionality.
Docstring Coverage	✅ Passed	Docstring coverage is 96.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

📋 Issue Planner

Built with CodeRabbit's Coding Plans for faster development and fewer bugs.

View plan used: #228

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request updates the category counting logic in scripts/update_jobs.py to prioritize a new singular category field over the legacy categories list, utilizing a new nested helper function and comprehensive unit tests. Feedback suggests moving the helper function to the module level for better maintainability and refining the fallback logic to ensure legacy data is used if the new category field is present but invalid.

gemini-code-assist · 2026-05-20T11:28:47Z

+    def iter_category_ids(job: Dict[str, Any]) -> Iterator[str]:
+        if 'category' in job:
+            category_id = get_nested_value(job, 'category.id')
+            if isinstance(category_id, str):
+                normalized = category_id.strip()
+                if normalized:
+                    yield normalized
+            return
+
+        legacy_categories = job.get('categories', [])
+        if not isinstance(legacy_categories, list):
+            return
+
+        for category in legacy_categories:
+            if isinstance(category, str):
+                normalized = category.strip()
+                if normalized:
+                    yield normalized



Defining iter_category_ids inside save_market_history causes it to be redefined on every call. While this function is currently only called once per script execution, moving it to the module level is a better practice for maintainability and testability, and it avoids unnecessary overhead if the script's entry points ever change.

gemini-code-assist · 2026-05-20T11:28:47Z

+        if 'category' in job:
+            category_id = get_nested_value(job, 'category.id')
+            if isinstance(category_id, str):
+                normalized = category_id.strip()
+                if normalized:
+                    yield normalized
+            return


The current logic assumes that if the category key exists, it is the exclusive source of truth, even if it is None or lacks a valid id. If an upstream scraper partially fails and leaves a null category while categories still contains valid legacy data, this job will be recorded with no category. Consider falling back to legacy_categories if category_id is not found or invalid within the category object.

References

Ensure that invalid inputs or states are safely handled in all cases, providing appropriate fallbacks where data integrity might be compromised.

ambicuity · 2026-05-26T17:41:03Z

Posting a post-closure review since this work intersects a real, currently-live production bug. No agenda about the close decision — just leaving the technical record so it's findable when someone revisits this area.

The underlying bug is still live on `main`

scripts/update_jobs.py:2531-2533 reads the wrong key:

category_counts = Counter()
for job in jobs:
    for category in job.get('categories', []):   # ← plural list, never populated
        category_counts[category] += 1

But enrich_jobs() at scripts/update_jobs.py:2400 writes the singular form and never produces a categories list:

category = categorize_job(title, description)
job['category'] = category    # singular dict {id, name, emoji}

So category_counts is always empty by the time the snapshot is written. Verified against the current published artifact:

$ python3 -c "import json; d=json.load(open('docs/market-history.json')); \
  [print(s['date'], 'total=', s['total_jobs'], 'categories=', s['categories']) \
   for s in d['snapshots'][-5:]]"
2026-05-22 total= 1332 categories= {}
2026-05-23 total= 1328 categories= {}
2026-05-24 total= 1402 categories= {}
2026-05-25 total= 1380 categories= {}
2026-05-26 total= 1332 categories= {}

All 81 snapshots in docs/market-history.json have categories: {} — the category breakdown that drives any downstream "growing/declining categories" analytics has been silently dead for the entire retention window.

The fix in this PR was correct

The diff replaces the wrong-key read with a small generator:

def iter_category_ids(job: Dict[str, Any]) -> Iterator[str]:
    category_id = get_nested_value(job, 'category.id')
    if isinstance(category_id, str):
        normalized = category_id.strip()
        if normalized:
            yield normalized
            return
    legacy_categories = job.get('categories', [])
    if not isinstance(legacy_categories, list):
        return
    for category in legacy_categories:
        if isinstance(category, str):
            normalized = category.strip()
            if normalized:
                yield normalized

What this gets right that's worth preserving on a future attempt:

Reads from the singular category.id first (the post-enrichment shape).
Falls back to legacy categories only when the singular path yields nothing — important for partial-scrape recovery (matches gemini-code-assist's second comment).
Validates element types (isinstance(..., str)) and strips empties so malformed upstream data can't poison the Counter keys.
Placed at module level (not nested inside save_market_history), satisfying gemini-code-assist's first comment.

AI reviewer feedback was fully addressed

gemini-code-assist (63f04fdd, "move helper to module level" + "fall back when category exists but invalid"): the final commit 41c7da7e ("fix: handle invalid market history categories") places the helper at module level and adds test_falls_back_to_legacy_categories_when_category_payload_is_invalid exercising the four invalid shapes ({'id': None}, {}, {'id': ''}, None). Verified in the diff.
CodeRabbit: rate-limited at review time, but the CodeRabbit status check finished SUCCESS. No outstanding actionable comments.

CI state at close

All 6 checks were SUCCESS, mergeStateStatus: CLEAN, mergeable: MERGEABLE. The PR also already includes the docs/predictions.json trailing-newline fix that has since been re-applied via #298.

Test coverage shipped with the fix

tests/test_save_market_history.py adds five regression tests covering exactly the failure modes that matter here:

test_counts_categories_correctly_from_singular_category_field — the path that's broken on main today.
test_prefers_singular_category_field_over_legacy_categories_list — guards against double-counting if both shapes ever coexist.
test_falls_back_to_legacy_categories_list_when_category_missing — keeps old fixtures working.
test_falls_back_to_legacy_categories_when_category_payload_is_invalid — partial-scrape recovery.
Plus the existing snapshot-schema / retention assertions remain green.

The PR's own validation log (make test → 721 passed) checks out against the current suite as well.

Suggested next step

Whenever someone next touches this area, the cheapest path is probably to cherry-pick 63f04fdd + 41c7da7e into a fresh branch off current main (the only conflict will be the docs/predictions.json line that landed via #298, which auto-resolves). The fix is ~23 lines of production code, the tests are already written, and the production artifact is empty today.

If there was a specific concern with this branch that I missed, happy to address it in a redo.

@rvac-bucky

## Why `docs/market-history.json` has been silently empty-on-categories for the entire 90-day retention window. The aggregation step in `save_market_history()` reads the wrong key. `scripts/update_jobs.py` (pre-fix): ```python for job in jobs: for category in job.get('categories', []): # plural list, never populated post-enrichment category_counts[category] += 1 ``` But `enrich_jobs()` writes the singular form and never creates a `categories` list: ```python category = categorize_job(title, description) job['category'] = category # {'id': 'software_engineering', 'name': ..., 'emoji': ...} ``` Reproduced against the live artifact: ``` $ python3 -c "import json; d=json.load(open('docs/market-history.json')); \ [print(s['date'], 'cats=', len(s['categories'])) for s in d['snapshots'][-5:]]" 2026-05-22 cats= 0 2026-05-23 cats= 0 2026-05-24 cats= 0 2026-05-25 cats= 0 2026-05-26 cats= 0 ``` All 81 retained snapshots show `categories: {}`. The category breakdown that feeds any "growing/declining categories" analytic has been dead. ## What New module-level helper `iter_category_ids(job)` (`scripts/update_jobs.py:2464`): - Reads from `category.id` first (the enriched shape). - Falls back to the legacy `categories` list **only** when the singular path is absent or invalid — handles `None`, `{}`, `{'id': None}`, `{'id': ''}`, non-list `categories` values, and non-string elements. - Strips whitespace and skips empty values so malformed upstream data can't poison the Counter. `save_market_history()` switches to the new helper. ## Tests `tests/test_save_market_history.py` adds four targeted regressions: - `test_counts_categories_correctly_from_singular_category_field` — the path that's broken on main today. - `test_prefers_singular_category_field_over_legacy_categories_list` — guards against double-counting if both shapes coexist. - `test_falls_back_to_legacy_categories_list_when_category_missing` — keeps old fixtures working. - `test_falls_back_to_legacy_categories_when_category_payload_is_invalid` — partial-scrape recovery against four invalid singular shapes. Existing snapshot-schema / retention / determinism assertions stay green. ## Validation - `pytest tests/test_save_market_history.py`: 25 passed. - `pytest` (full suite): 722 passed. - `pre-commit run --all-files`: clean. - End-to-end smoke (run real `enrich_jobs()` → `save_market_history()` on synthetic raw jobs, inspect resulting `docs/market-history.json`): ``` before fix: categories: {} after fix: categories: {'software_engineering': 2, 'data_ml': 2} ``` Counts equal the enriched job count, as expected. ## Provenance This is the same scoped fix from #274 (closed without merge despite green CI), rebased cleanly onto current `main`. Original work by @rvac-bucky — author attribution preserved in the commit. The `docs/predictions.json` conflict from #274 doesn't re-occur because #298 already landed the trailing-newline normalization. Closes #228  ## Summary by CodeRabbit * **Refactor** * Improved job category data normalization to support both legacy and enriched data formats, enhancing system robustness and backward compatibility. * **Tests** * Updated test suite to validate consistent category data handling and standardized tier naming conventions.  [![Review Change Stack](https://storage.googleapis.com/coderabbit_public_assets/review-stack-in-coderabbit-ui.svg)](https://app.coderabbit.ai/change-stack/ambicuity/New-Grad-Jobs/pull/299?utm_source=github_walkthrough&utm_medium=github&utm_campaign=change_stack)   --------- Co-authored-by: rvac-bucky <263012179+rvac-bucky@users.noreply.github.com>

fix: restore market history category snapshots

63f04fd

rvac-bucky requested a review from ambicuity as a code owner May 20, 2026 11:22

adamsbucky00-sys mentioned this pull request May 20, 2026

fix: restore market history category snapshots #252

Closed

gemini-code-assist Bot reviewed May 20, 2026

View reviewed changes

fix: handle invalid market history categories

41c7da7

ambicuity closed this May 26, 2026

ambicuity mentioned this pull request May 26, 2026

fix: restore market history category snapshots #299

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: restore market history category snapshots#274

fix: restore market history category snapshots#274
rvac-bucky wants to merge 2 commits into
ambicuity:mainfrom
rvac-bucky:fix/228-market-history-categories-clean

rvac-bucky commented May 20, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 20, 2026 •

edited

Loading

Rate limit exceeded

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 20, 2026

Uh oh!

gemini-code-assist Bot May 20, 2026

Uh oh!

ambicuity commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

rvac-bucky commented May 20, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Root cause

Fix

Regression protection

Validation

Supersedes

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 20, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 20, 2026

Choose a reason for hiding this comment

Uh oh!

ambicuity commented May 26, 2026

The underlying bug is still live on main

The fix in this PR was correct

AI reviewer feedback was fully addressed

CI state at close

Test coverage shipped with the fix

Suggested next step

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rvac-bucky commented May 20, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 20, 2026 •

edited

Loading

The underlying bug is still live on `main`