Skip to content

fix: restore market history category snapshots#274

Closed
rvac-bucky wants to merge 2 commits into
ambicuity:mainfrom
rvac-bucky:fix/228-market-history-categories-clean
Closed

fix: restore market history category snapshots#274
rvac-bucky wants to merge 2 commits into
ambicuity:mainfrom
rvac-bucky:fix/228-market-history-categories-clean

Conversation

@rvac-bucky
Copy link
Copy Markdown
Contributor

@rvac-bucky rvac-bucky commented May 20, 2026

Root cause

save_market_history() was only reading the legacy job["categories"] list, but enriched jobs now store category data in the singular job["category"] object. That left new market-history snapshots with an empty categories map.

Fix

  • Aggregate category counts from job["category"]["id"]
  • Keep the legacy categories list as a fallback for older shapes
  • Validate category values before counting so malformed scraped data cannot create bad keys
  • Preserve existing snapshot schema and retention behavior

Regression protection

Added focused coverage around:

  • singular category aggregation
  • preference for the enriched singular field over legacy fallback
  • legacy categories fallback
  • malformed category payloads
  • deterministic snapshot schema and retention behavior

Validation

  • pre-commit on touched files
  • pytest tests/test_save_market_history.py -q -o addopts='' -> 25 passed
  • make test -> 721 passed
  • GitHub Actions and CodeRabbit were green on the superseded branch before opening this clean PR

Supersedes

Supersedes #252. That PR was technically clean, but GitHub still displayed stale CHANGES_REQUESTED from old bot reviews on obsolete commits. This replacement PR carries the same scoped fix from a clean branch so the review state is not polluted by historical bot feedback.

Fixes #228

Summary by CodeRabbit

  • Improvements
    • Enhanced job category data processing to support multiple data formats
    • Strengthened data validation for job information handling
    • Extended test coverage for category aggregation logic to ensure reliability

Review Change Stack

@rvac-bucky rvac-bucky requested a review from ambicuity as a code owner May 20, 2026 11:22
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 20, 2026

Warning

Rate limit exceeded

@rvac-bucky has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 4 minutes and 24 seconds before requesting another review.

You’ve run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 7113e375-802b-4079-88f2-ca871f356e7b

📥 Commits

Reviewing files that changed from the base of the PR and between 63f04fd and 41c7da7.

📒 Files selected for processing (2)
  • scripts/update_jobs.py
  • tests/test_save_market_history.py
📝 Walkthrough

Walkthrough

This PR fixes a bug where market-history.json snapshots contained empty category counts despite non-empty jobs. The category aggregation logic in save_market_history now correctly handles the enriched job["category"]["id"] shape while maintaining backward compatibility with legacy job["categories"] lists. The fix includes comprehensive test coverage for category preference, fallback behavior, and edge cases, along with standardized UTF-8 encoding across the test suite.

Changes

Category aggregation and test coverage

Layer / File(s) Summary
Category aggregation implementation
scripts/update_jobs.py
Iterator type imported; save_market_history now normalizes and accumulates category IDs from job["category"]["id"] when present, falling back to legacy job["categories"] entries.
Category aggregation test coverage
tests/test_save_market_history.py
New test cases validate enriched category.id aggregation, preference for enriched singular category over legacy lists, fallback when enriched category is absent, disregard of legacy categories when enriched category exists (even if malformed), handling of missing category data, and robustness against malformed shapes.
Test file encoding standardization
tests/test_save_market_history.py
All test methods updated to read market-history.json with explicit encoding="utf-8" parameter.
JSON termination cleanup
docs/predictions.json
File closing brace adjusted for clean termination.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Poem

🐰 A rabbit hops through history's row,
Categories now begin to flow—
Enriched fields dance, legacy stays,
Tests ensure the data's ablaze,
UTF-8 strings in files so clean,
The finest snapshots ever seen!

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'fix: restore market history category snapshots' directly addresses the main change: fixing empty category snapshots by restoring category population in market-history.json.
Linked Issues check ✅ Passed The pull request fully addresses issue #228 objectives: aggregates from job['category']['id'] with legacy fallback, adds comprehensive regression tests for enriched/legacy/malformed shapes, and preserves snapshot schema.
Out of Scope Changes check ✅ Passed All changes are in-scope: update_jobs.py logic fix and type import, test file expansion for category scenarios, and JSON formatting adjustment are directly related to restoring category snapshot functionality.
Docstring Coverage ✅ Passed Docstring coverage is 96.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

📋 Issue Planner

Built with CodeRabbit's Coding Plans for faster development and fewer bugs.

View plan used: #228

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the category counting logic in scripts/update_jobs.py to prioritize a new singular category field over the legacy categories list, utilizing a new nested helper function and comprehensive unit tests. Feedback suggests moving the helper function to the module level for better maintainability and refining the fallback logic to ensure legacy data is used if the new category field is present but invalid.

Comment thread scripts/update_jobs.py Outdated
Comment on lines 2444 to 2462
def iter_category_ids(job: Dict[str, Any]) -> Iterator[str]:
if 'category' in job:
category_id = get_nested_value(job, 'category.id')
if isinstance(category_id, str):
normalized = category_id.strip()
if normalized:
yield normalized
return

legacy_categories = job.get('categories', [])
if not isinstance(legacy_categories, list):
return

for category in legacy_categories:
if isinstance(category, str):
normalized = category.strip()
if normalized:
yield normalized

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Defining iter_category_ids inside save_market_history causes it to be redefined on every call. While this function is currently only called once per script execution, moving it to the module level is a better practice for maintainability and testability, and it avoids unnecessary overhead if the script's entry points ever change.

Comment thread scripts/update_jobs.py Outdated
Comment on lines +2445 to +2451
if 'category' in job:
category_id = get_nested_value(job, 'category.id')
if isinstance(category_id, str):
normalized = category_id.strip()
if normalized:
yield normalized
return
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The current logic assumes that if the category key exists, it is the exclusive source of truth, even if it is None or lacks a valid id. If an upstream scraper partially fails and leaves a null category while categories still contains valid legacy data, this job will be recorded with no category. Consider falling back to legacy_categories if category_id is not found or invalid within the category object.

References
  1. Ensure that invalid inputs or states are safely handled in all cases, providing appropriate fallbacks where data integrity might be compromised.

@ambicuity ambicuity closed this May 26, 2026
@ambicuity
Copy link
Copy Markdown
Owner

Posting a post-closure review since this work intersects a real, currently-live production bug. No agenda about the close decision — just leaving the technical record so it's findable when someone revisits this area.

The underlying bug is still live on main

scripts/update_jobs.py:2531-2533 reads the wrong key:

category_counts = Counter()
for job in jobs:
    for category in job.get('categories', []):   # ← plural list, never populated
        category_counts[category] += 1

But enrich_jobs() at scripts/update_jobs.py:2400 writes the singular form and never produces a categories list:

category = categorize_job(title, description)
job['category'] = category    # singular dict {id, name, emoji}

So category_counts is always empty by the time the snapshot is written. Verified against the current published artifact:

$ python3 -c "import json; d=json.load(open('docs/market-history.json')); \
  [print(s['date'], 'total=', s['total_jobs'], 'categories=', s['categories']) \
   for s in d['snapshots'][-5:]]"
2026-05-22 total= 1332 categories= {}
2026-05-23 total= 1328 categories= {}
2026-05-24 total= 1402 categories= {}
2026-05-25 total= 1380 categories= {}
2026-05-26 total= 1332 categories= {}

All 81 snapshots in docs/market-history.json have categories: {} — the category breakdown that drives any downstream "growing/declining categories" analytics has been silently dead for the entire retention window.

The fix in this PR was correct

The diff replaces the wrong-key read with a small generator:

def iter_category_ids(job: Dict[str, Any]) -> Iterator[str]:
    category_id = get_nested_value(job, 'category.id')
    if isinstance(category_id, str):
        normalized = category_id.strip()
        if normalized:
            yield normalized
            return
    legacy_categories = job.get('categories', [])
    if not isinstance(legacy_categories, list):
        return
    for category in legacy_categories:
        if isinstance(category, str):
            normalized = category.strip()
            if normalized:
                yield normalized

What this gets right that's worth preserving on a future attempt:

  • Reads from the singular category.id first (the post-enrichment shape).
  • Falls back to legacy categories only when the singular path yields nothing — important for partial-scrape recovery (matches gemini-code-assist's second comment).
  • Validates element types (isinstance(..., str)) and strips empties so malformed upstream data can't poison the Counter keys.
  • Placed at module level (not nested inside save_market_history), satisfying gemini-code-assist's first comment.

AI reviewer feedback was fully addressed

  • gemini-code-assist (63f04fdd, "move helper to module level" + "fall back when category exists but invalid"): the final commit 41c7da7e ("fix: handle invalid market history categories") places the helper at module level and adds test_falls_back_to_legacy_categories_when_category_payload_is_invalid exercising the four invalid shapes ({'id': None}, {}, {'id': ''}, None). Verified in the diff.
  • CodeRabbit: rate-limited at review time, but the CodeRabbit status check finished SUCCESS. No outstanding actionable comments.

CI state at close

All 6 checks were SUCCESS, mergeStateStatus: CLEAN, mergeable: MERGEABLE. The PR also already includes the docs/predictions.json trailing-newline fix that has since been re-applied via #298.

Test coverage shipped with the fix

tests/test_save_market_history.py adds five regression tests covering exactly the failure modes that matter here:

  • test_counts_categories_correctly_from_singular_category_field — the path that's broken on main today.
  • test_prefers_singular_category_field_over_legacy_categories_list — guards against double-counting if both shapes ever coexist.
  • test_falls_back_to_legacy_categories_list_when_category_missing — keeps old fixtures working.
  • test_falls_back_to_legacy_categories_when_category_payload_is_invalid — partial-scrape recovery.
  • Plus the existing snapshot-schema / retention assertions remain green.

The PR's own validation log (make test → 721 passed) checks out against the current suite as well.

Suggested next step

Whenever someone next touches this area, the cheapest path is probably to cherry-pick 63f04fdd + 41c7da7e into a fresh branch off current main (the only conflict will be the docs/predictions.json line that landed via #298, which auto-resolves). The fix is ~23 lines of production code, the tests are already written, and the production artifact is empty today.

If there was a specific concern with this branch that I missed, happy to address it in a redo.

ambicuity added a commit that referenced this pull request May 26, 2026
## Why

`docs/market-history.json` has been silently empty-on-categories for the
entire 90-day retention window. The aggregation step in
`save_market_history()` reads the wrong key.

`scripts/update_jobs.py` (pre-fix):
```python
for job in jobs:
    for category in job.get('categories', []):   # plural list, never populated post-enrichment
        category_counts[category] += 1
```

But `enrich_jobs()` writes the singular form and never creates a
`categories` list:
```python
category = categorize_job(title, description)
job['category'] = category    # {'id': 'software_engineering', 'name': ..., 'emoji': ...}
```

Reproduced against the live artifact:
```
$ python3 -c "import json; d=json.load(open('docs/market-history.json')); \
  [print(s['date'], 'cats=', len(s['categories'])) for s in d['snapshots'][-5:]]"
2026-05-22 cats= 0
2026-05-23 cats= 0
2026-05-24 cats= 0
2026-05-25 cats= 0
2026-05-26 cats= 0
```

All 81 retained snapshots show `categories: {}`. The category breakdown
that feeds any "growing/declining categories" analytic has been dead.

## What

New module-level helper `iter_category_ids(job)`
(`scripts/update_jobs.py:2464`):

- Reads from `category.id` first (the enriched shape).
- Falls back to the legacy `categories` list **only** when the singular
path is absent or invalid — handles `None`, `{}`, `{'id': None}`,
`{'id': ''}`, non-list `categories` values, and non-string elements.
- Strips whitespace and skips empty values so malformed upstream data
can't poison the Counter.

`save_market_history()` switches to the new helper.

## Tests

`tests/test_save_market_history.py` adds four targeted regressions:

- `test_counts_categories_correctly_from_singular_category_field` — the
path that's broken on main today.
- `test_prefers_singular_category_field_over_legacy_categories_list` —
guards against double-counting if both shapes coexist.
- `test_falls_back_to_legacy_categories_list_when_category_missing` —
keeps old fixtures working.
-
`test_falls_back_to_legacy_categories_when_category_payload_is_invalid`
— partial-scrape recovery against four invalid singular shapes.

Existing snapshot-schema / retention / determinism assertions stay
green.

## Validation

- `pytest tests/test_save_market_history.py`: 25 passed.
- `pytest` (full suite): 722 passed.
- `pre-commit run --all-files`: clean.
- End-to-end smoke (run real `enrich_jobs()` → `save_market_history()`
on synthetic raw jobs, inspect resulting `docs/market-history.json`):
  ```
  before fix: categories: {}
  after  fix: categories: {'software_engineering': 2, 'data_ml': 2}
  ```
  Counts equal the enriched job count, as expected.

## Provenance

This is the same scoped fix from #274 (closed without merge despite
green CI), rebased cleanly onto current `main`. Original work by
@rvac-bucky — author attribution preserved in the commit. The
`docs/predictions.json` conflict from #274 doesn't re-occur because #298
already landed the trailing-newline normalization.

Closes #228

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Refactor**
* Improved job category data normalization to support both legacy and
enriched data formats, enhancing system robustness and backward
compatibility.

* **Tests**
* Updated test suite to validate consistent category data handling and
standardized tier naming conventions.

<!-- review_stack_entry_start -->

[![Review Change
Stack](https://storage.googleapis.com/coderabbit_public_assets/review-stack-in-coderabbit-ui.svg)](https://app.coderabbit.ai/change-stack/ambicuity/New-Grad-Jobs/pull/299?utm_source=github_walkthrough&utm_medium=github&utm_campaign=change_stack)

<!-- review_stack_entry_end -->

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: rvac-bucky <263012179+rvac-bucky@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: market-history category breakdown is always empty despite non-empty jobs

2 participants