fix: restore market history category snapshots#274
Conversation
|
Warning Rate limit exceeded
You’ve run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Repository UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughThis PR fixes a bug where ChangesCategory aggregation and test coverage
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. 📋 Issue PlannerBuilt with CodeRabbit's Coding Plans for faster development and fewer bugs. View plan used: ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request updates the category counting logic in scripts/update_jobs.py to prioritize a new singular category field over the legacy categories list, utilizing a new nested helper function and comprehensive unit tests. Feedback suggests moving the helper function to the module level for better maintainability and refining the fallback logic to ensure legacy data is used if the new category field is present but invalid.
| def iter_category_ids(job: Dict[str, Any]) -> Iterator[str]: | ||
| if 'category' in job: | ||
| category_id = get_nested_value(job, 'category.id') | ||
| if isinstance(category_id, str): | ||
| normalized = category_id.strip() | ||
| if normalized: | ||
| yield normalized | ||
| return | ||
|
|
||
| legacy_categories = job.get('categories', []) | ||
| if not isinstance(legacy_categories, list): | ||
| return | ||
|
|
||
| for category in legacy_categories: | ||
| if isinstance(category, str): | ||
| normalized = category.strip() | ||
| if normalized: | ||
| yield normalized | ||
|
|
There was a problem hiding this comment.
Defining iter_category_ids inside save_market_history causes it to be redefined on every call. While this function is currently only called once per script execution, moving it to the module level is a better practice for maintainability and testability, and it avoids unnecessary overhead if the script's entry points ever change.
| if 'category' in job: | ||
| category_id = get_nested_value(job, 'category.id') | ||
| if isinstance(category_id, str): | ||
| normalized = category_id.strip() | ||
| if normalized: | ||
| yield normalized | ||
| return |
There was a problem hiding this comment.
The current logic assumes that if the category key exists, it is the exclusive source of truth, even if it is None or lacks a valid id. If an upstream scraper partially fails and leaves a null category while categories still contains valid legacy data, this job will be recorded with no category. Consider falling back to legacy_categories if category_id is not found or invalid within the category object.
References
- Ensure that invalid inputs or states are safely handled in all cases, providing appropriate fallbacks where data integrity might be compromised.
|
Posting a post-closure review since this work intersects a real, currently-live production bug. No agenda about the close decision — just leaving the technical record so it's findable when someone revisits this area. The underlying bug is still live on
|
## Why
`docs/market-history.json` has been silently empty-on-categories for the
entire 90-day retention window. The aggregation step in
`save_market_history()` reads the wrong key.
`scripts/update_jobs.py` (pre-fix):
```python
for job in jobs:
for category in job.get('categories', []): # plural list, never populated post-enrichment
category_counts[category] += 1
```
But `enrich_jobs()` writes the singular form and never creates a
`categories` list:
```python
category = categorize_job(title, description)
job['category'] = category # {'id': 'software_engineering', 'name': ..., 'emoji': ...}
```
Reproduced against the live artifact:
```
$ python3 -c "import json; d=json.load(open('docs/market-history.json')); \
[print(s['date'], 'cats=', len(s['categories'])) for s in d['snapshots'][-5:]]"
2026-05-22 cats= 0
2026-05-23 cats= 0
2026-05-24 cats= 0
2026-05-25 cats= 0
2026-05-26 cats= 0
```
All 81 retained snapshots show `categories: {}`. The category breakdown
that feeds any "growing/declining categories" analytic has been dead.
## What
New module-level helper `iter_category_ids(job)`
(`scripts/update_jobs.py:2464`):
- Reads from `category.id` first (the enriched shape).
- Falls back to the legacy `categories` list **only** when the singular
path is absent or invalid — handles `None`, `{}`, `{'id': None}`,
`{'id': ''}`, non-list `categories` values, and non-string elements.
- Strips whitespace and skips empty values so malformed upstream data
can't poison the Counter.
`save_market_history()` switches to the new helper.
## Tests
`tests/test_save_market_history.py` adds four targeted regressions:
- `test_counts_categories_correctly_from_singular_category_field` — the
path that's broken on main today.
- `test_prefers_singular_category_field_over_legacy_categories_list` —
guards against double-counting if both shapes coexist.
- `test_falls_back_to_legacy_categories_list_when_category_missing` —
keeps old fixtures working.
-
`test_falls_back_to_legacy_categories_when_category_payload_is_invalid`
— partial-scrape recovery against four invalid singular shapes.
Existing snapshot-schema / retention / determinism assertions stay
green.
## Validation
- `pytest tests/test_save_market_history.py`: 25 passed.
- `pytest` (full suite): 722 passed.
- `pre-commit run --all-files`: clean.
- End-to-end smoke (run real `enrich_jobs()` → `save_market_history()`
on synthetic raw jobs, inspect resulting `docs/market-history.json`):
```
before fix: categories: {}
after fix: categories: {'software_engineering': 2, 'data_ml': 2}
```
Counts equal the enriched job count, as expected.
## Provenance
This is the same scoped fix from #274 (closed without merge despite
green CI), rebased cleanly onto current `main`. Original work by
@rvac-bucky — author attribution preserved in the commit. The
`docs/predictions.json` conflict from #274 doesn't re-occur because #298
already landed the trailing-newline normalization.
Closes #228
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
* **Refactor**
* Improved job category data normalization to support both legacy and
enriched data formats, enhancing system robustness and backward
compatibility.
* **Tests**
* Updated test suite to validate consistent category data handling and
standardized tier naming conventions.
<!-- review_stack_entry_start -->
[](https://app.coderabbit.ai/change-stack/ambicuity/New-Grad-Jobs/pull/299?utm_source=github_walkthrough&utm_medium=github&utm_campaign=change_stack)
<!-- review_stack_entry_end -->
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
---------
Co-authored-by: rvac-bucky <263012179+rvac-bucky@users.noreply.github.com>
Root cause
save_market_history()was only reading the legacyjob["categories"]list, but enriched jobs now store category data in the singularjob["category"]object. That left new market-history snapshots with an emptycategoriesmap.Fix
job["category"]["id"]categorieslist as a fallback for older shapesRegression protection
Added focused coverage around:
categoryaggregationcategoriesfallbackValidation
pre-commiton touched filespytest tests/test_save_market_history.py -q -o addopts=''-> 25 passedmake test-> 721 passedSupersedes
Supersedes #252. That PR was technically clean, but GitHub still displayed stale
CHANGES_REQUESTEDfrom old bot reviews on obsolete commits. This replacement PR carries the same scoped fix from a clean branch so the review state is not polluted by historical bot feedback.Fixes #228
Summary by CodeRabbit