fix: restore market history category snapshots#299
Conversation
`save_market_history()` was only reading the legacy `job['categories']` list,
but enriched jobs (post `enrich_jobs()`) only carry the singular
`job['category']` dict. The result: every snapshot in `docs/market-history.json`
has `categories: {}`. Verified against the published artifact — all 81
snapshots in the current 90-day retention window show empty category maps
despite ~1,300 enriched jobs per day.
Changes:
- New module-level `iter_category_ids(job)` helper that yields category IDs
from the enriched `category.id` shape first, and falls back to the legacy
`categories` list when the singular path is absent or invalid (handles
`None`, `{}`, `{'id': None}`, `{'id': ''}`, non-list legacy values).
- `save_market_history()` switches to the new helper, so post-enrichment
jobs are counted correctly and legacy fixtures keep working.
Tests (`tests/test_save_market_history.py`):
- `test_counts_categories_correctly_from_singular_category_field` — the path
that's broken on main today.
- `test_prefers_singular_category_field_over_legacy_categories_list` — guards
against double-counting if both shapes coexist on the same job.
- `test_falls_back_to_legacy_categories_list_when_category_missing` — keeps
old fixtures working.
- `test_falls_back_to_legacy_categories_when_category_payload_is_invalid` —
partial-scrape recovery (empty / null / missing-id category dicts).
This is the same fix from #274 (closed), rebased onto current `main` so the
`docs/predictions.json` conflict resolves cleanly. Original work by
@rvac-bucky.
Closes #228
|
Caution Review failedPull request was closed or merged during review No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughThe PR adds a category ID normalization helper function ( ChangesCategory ID normalization and test schema alignment
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Suggested labels
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces a new helper function iter_category_ids in scripts/update_jobs.py to extract and normalize category IDs from both enriched and legacy job category data, updating save_market_history to utilize it. It also cleans up file-reading operations in tests/test_save_market_history.py and adds comprehensive test coverage for the new category resolution logic. The review feedback suggests a valuable optimization and defensive guard to ensure job is a dictionary and to replace the helper get_nested_value with direct dictionary lookups for better performance.
| category_id = get_nested_value(job, 'category.id') | ||
| if isinstance(category_id, str): | ||
| normalized = category_id.strip() | ||
| if normalized: | ||
| yield normalized | ||
| return |
There was a problem hiding this comment.
Optimization & Defensive Guard
- Defensive Programming: Added a guard check to ensure
jobis a dictionary before attempting any operations. This prevents potentialAttributeErrorcrashes if a malformed orNonejob payload is passed. - Performance Optimization: Replaced the generic
get_nested_value(job, 'category.id')helper with a direct dictionary lookup. Since this function is executed in a loop over all jobs (potentially thousands), avoiding the overhead of string splitting and dynamic type checking insideget_nested_valuesignificantly improves performance.
| category_id = get_nested_value(job, 'category.id') | |
| if isinstance(category_id, str): | |
| normalized = category_id.strip() | |
| if normalized: | |
| yield normalized | |
| return | |
| if not isinstance(job, dict): | |
| return | |
| category = job.get('category') | |
| category_id = category.get('id') if isinstance(category, dict) else None | |
| if isinstance(category_id, str): | |
| normalized = category_id.strip() | |
| if normalized: | |
| yield normalized | |
| return |
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
There was a problem hiding this comment.
Pull request overview
Fixes broken category aggregation in the market history snapshots by counting from the enriched job["category"]["id"] shape (with a safe legacy fallback), restoring meaningful docs/market-history.json category breakdowns used for trend analytics.
Changes:
- Add
iter_category_ids(job)helper to normalize and safely extract category IDs from eitherjob["category"]["id"]or legacyjob["categories"]. - Update
save_market_history()to useiter_category_ids()for category counting. - Expand
tests/test_save_market_history.pywith targeted regressions for singular-category counting, preference rules, and fallback behavior.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| scripts/update_jobs.py | Adds a category-ID iterator helper and updates market-history aggregation to read enriched category data correctly. |
| tests/test_save_market_history.py | Adds regressions covering the fixed aggregation path and fallback/validation cases. |
| def test_counts_categories_correctly_from_singular_category_field(self): | ||
| """Categories are counted from the enriched singular category field.""" | ||
| jobs = [ | ||
| {'company': 'Google', 'category': {'id': 'software_engineering'}, 'company_tier': {'tier': 'faang-plus'}}, | ||
| {'company': 'Meta', 'category': {'id': 'software_engineering'}, 'company_tier': {'tier': 'faang-plus'}}, | ||
| {'company': 'Stripe', 'category': {'id': 'data_ml'}, 'company_tier': {'tier': 'unicorn'}}, |
Addresses two AI-reviewer findings on #299: - Gemini: add `isinstance(job, dict)` guard at the top of `iter_category_ids()` so a malformed non-dict job in the list cannot crash the helper's fallback path (`job.get('categories', [])`). The category extraction is already null-safe via `get_nested_value`, but the legacy fallback was not. - Copilot: standardize the tier IDs used throughout `tests/test_save_market_history.py` to `faang_plus` to match what `get_company_tier()` actually emits (verified against the live `docs/market-history.json`, which only ever contains `faang_plus`, `unicorn`, and `other`). The fictional `public-tech` tier was also replaced with `other`. Adds `test_iter_category_ids_ignores_non_dict_job` to lock in the new defensive guard. Verified locally: 723 tests pass; end-to-end smoke (enriched-shape jobs -> save_market_history) yields the expected populated categories breakdown and `faang_plus`/`unicorn` tier counts.
## Why
The "RECENT COMMITS" panel in the right-side detail of `#contributors`
was rendering deterministic mock SHAs and made-up messages like
`refactor(ui): keyboard nav` for every selected dev. Anyone landing on
jobs.riteshrana.engineer/#contributors and clicking a contributor saw
fabricated history that had nothing to do with that person.
## What
Replaces the local mock generator in `docs/terminal/contributors.jsx`
with a `useRecentCommits()` hook that fetches `GET
/repos/{repo}/commits?author={handle}` from the GitHub REST API on
selection. A module-level cache keyed by handle dedupes within a session
so re-clicking a contributor doesn't re-hit the API.
Each row now shows:
- real 7-char SHA
- actual commit subject (first line, truncated to 80 chars)
- human-readable "ago" derived from the commit author date
- linked to the commit on GitHub
States surfaced explicitly (no silent failures):
- `loading from github…` while in-flight
- `no commits authored by @{handle} in this repo` for co-author-only
contributors (e.g. @15045 / 小吴 — co-author on #296 but never the
author).
- `could not load commits (...)` with the failure reason on
rate-limit/network failure.
Cache-bust `?v=7` → `?v=8` on all `docs/index.html` bundles so existing
visitors pick up the new code.
## Verification (local — `python3 -m http.server 8765`, Playwright)
| Contributor | Result |
|---|---|
| @ambicuity | Real commits — #299, #300, #301 (19m, 27m, 42m ago) |
| @rvac-bucky | Real commits — #265, #247, #166, #193 (2h, 2mo …) |
| @15045 | Empty-state message (co-author only) |
| @mitre88 | Empty-state message (no direct authored commits) |
Screenshot saved to `contributors-real-commits.png` (not committed).
Why
docs/market-history.jsonhas been silently empty-on-categories for the entire 90-day retention window. The aggregation step insave_market_history()reads the wrong key.scripts/update_jobs.py(pre-fix):But
enrich_jobs()writes the singular form and never creates acategorieslist:Reproduced against the live artifact:
All 81 retained snapshots show
categories: {}. The category breakdown that feeds any "growing/declining categories" analytic has been dead.What
New module-level helper
iter_category_ids(job)(scripts/update_jobs.py:2464):category.idfirst (the enriched shape).categorieslist only when the singular path is absent or invalid — handlesNone,{},{'id': None},{'id': ''}, non-listcategoriesvalues, and non-string elements.save_market_history()switches to the new helper.Tests
tests/test_save_market_history.pyadds four targeted regressions:test_counts_categories_correctly_from_singular_category_field— the path that's broken on main today.test_prefers_singular_category_field_over_legacy_categories_list— guards against double-counting if both shapes coexist.test_falls_back_to_legacy_categories_list_when_category_missing— keeps old fixtures working.test_falls_back_to_legacy_categories_when_category_payload_is_invalid— partial-scrape recovery against four invalid singular shapes.Existing snapshot-schema / retention / determinism assertions stay green.
Validation
pytest tests/test_save_market_history.py: 25 passed.pytest(full suite): 722 passed.pre-commit run --all-files: clean.enrich_jobs()→save_market_history()on synthetic raw jobs, inspect resultingdocs/market-history.json):Provenance
This is the same scoped fix from #274 (closed without merge despite green CI), rebased cleanly onto current
main. Original work by @rvac-bucky — author attribution preserved in the commit. Thedocs/predictions.jsonconflict from #274 doesn't re-occur because #298 already landed the trailing-newline normalization.Closes #228
Summary by CodeRabbit
Refactor
Tests