fix: restore market history category snapshots by ambicuity · Pull Request #299 · ambicuity/New-Grad-Jobs

ambicuity · 2026-05-26T17:49:46Z

Why

docs/market-history.json has been silently empty-on-categories for the entire 90-day retention window. The aggregation step in save_market_history() reads the wrong key.

scripts/update_jobs.py (pre-fix):

for job in jobs:
    for category in job.get('categories', []):   # plural list, never populated post-enrichment
        category_counts[category] += 1

But enrich_jobs() writes the singular form and never creates a categories list:

category = categorize_job(title, description)
job['category'] = category    # {'id': 'software_engineering', 'name': ..., 'emoji': ...}

Reproduced against the live artifact:

$ python3 -c "import json; d=json.load(open('docs/market-history.json')); \
  [print(s['date'], 'cats=', len(s['categories'])) for s in d['snapshots'][-5:]]"
2026-05-22 cats= 0
2026-05-23 cats= 0
2026-05-24 cats= 0
2026-05-25 cats= 0
2026-05-26 cats= 0

All 81 retained snapshots show categories: {}. The category breakdown that feeds any "growing/declining categories" analytic has been dead.

What

New module-level helper iter_category_ids(job) (scripts/update_jobs.py:2464):

Reads from category.id first (the enriched shape).
Falls back to the legacy categories list only when the singular path is absent or invalid — handles None, {}, {'id': None}, {'id': ''}, non-list categories values, and non-string elements.
Strips whitespace and skips empty values so malformed upstream data can't poison the Counter.

save_market_history() switches to the new helper.

Tests

tests/test_save_market_history.py adds four targeted regressions:

test_counts_categories_correctly_from_singular_category_field — the path that's broken on main today.
test_prefers_singular_category_field_over_legacy_categories_list — guards against double-counting if both shapes coexist.
test_falls_back_to_legacy_categories_list_when_category_missing — keeps old fixtures working.
test_falls_back_to_legacy_categories_when_category_payload_is_invalid — partial-scrape recovery against four invalid singular shapes.

Existing snapshot-schema / retention / determinism assertions stay green.

Validation

pytest tests/test_save_market_history.py: 25 passed.
pytest (full suite): 722 passed.
pre-commit run --all-files: clean.
End-to-end smoke (run real enrich_jobs() → save_market_history() on synthetic raw jobs, inspect resulting docs/market-history.json):
```
before fix: categories: {}
after  fix: categories: {'software_engineering': 2, 'data_ml': 2}
```
Counts equal the enriched job count, as expected.

Provenance

This is the same scoped fix from #274 (closed without merge despite green CI), rebased cleanly onto current main. Original work by @rvac-bucky — author attribution preserved in the commit. The docs/predictions.json conflict from #274 doesn't re-occur because #298 already landed the trailing-newline normalization.

Closes #228

Summary by CodeRabbit

Refactor
- Improved job category data normalization to support both legacy and enriched data formats, enhancing system robustness and backward compatibility.
Tests
- Updated test suite to validate consistent category data handling and standardized tier naming conventions.

@rvac-bucky

`save_market_history()` was only reading the legacy `job['categories']` list, but enriched jobs (post `enrich_jobs()`) only carry the singular `job['category']` dict. The result: every snapshot in `docs/market-history.json` has `categories: {}`. Verified against the published artifact — all 81 snapshots in the current 90-day retention window show empty category maps despite ~1,300 enriched jobs per day. Changes: - New module-level `iter_category_ids(job)` helper that yields category IDs from the enriched `category.id` shape first, and falls back to the legacy `categories` list when the singular path is absent or invalid (handles `None`, `{}`, `{'id': None}`, `{'id': ''}`, non-list legacy values). - `save_market_history()` switches to the new helper, so post-enrichment jobs are counted correctly and legacy fixtures keep working. Tests (`tests/test_save_market_history.py`): - `test_counts_categories_correctly_from_singular_category_field` — the path that's broken on main today. - `test_prefers_singular_category_field_over_legacy_categories_list` — guards against double-counting if both shapes coexist on the same job. - `test_falls_back_to_legacy_categories_list_when_category_missing` — keeps old fixtures working. - `test_falls_back_to_legacy_categories_when_category_payload_is_invalid` — partial-scrape recovery (empty / null / missing-id category dicts). This is the same fix from #274 (closed), rebased onto current `main` so the `docs/predictions.json` conflict resolves cleanly. Original work by @rvac-bucky. Closes #228

coderabbitai · 2026-05-26T17:49:53Z

Caution

Review failed

Pull request was closed or merged during review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: c6c5a849-2b5e-42a3-8fd9-ff35c553da15

📥 Commits

Reviewing files that changed from the base of the PR and between c6b3699 and 470cf3e.

📒 Files selected for processing (2)

scripts/update_jobs.py
tests/test_save_market_history.py

📝 Walkthrough

Walkthrough

The PR adds a category ID normalization helper function (iter_category_ids()) that extracts category identifiers from either enriched job data (category.id) or legacy job data (categories list). The market history snapshot logic is updated to use this helper for category counting. The entire test suite is revised to align with new tier naming (faang_plus) and to explicitly test category schema precedence and fallback behavior, while standardizing file I/O with UTF-8 encoding throughout.

Changes

Category ID normalization and test schema alignment

Layer / File(s)	Summary
Category ID iteration helper `scripts/update_jobs.py`	Adds `iter_category_ids(job)` function to normalize category IDs from enriched (`category.id`) or legacy (`categories` list) job data shapes. Updates module imports to include `Iterator`.
Market history category counting integration `scripts/update_jobs.py`	Updates `save_market_history` to compute `category_counts` by iterating `iter_category_ids(job)` instead of directly accessing legacy `job.get('categories', [])`.
Category counting precedence and robustness tests `tests/test_save_market_history.py`	Reworks `TestCategoryAndTierCounting` class to comprehensively test enriched `category.id` counting, preference of enriched over legacy `categories`, fallback to legacy when enriched is missing, and robust handling of invalid/malformed enriched payloads. Updates tier test data to `faang_plus` throughout.
Basic snapshot structure and metadata tests `tests/test_save_market_history.py`	Updates snapshot structure, date range metadata, and date format tests to use `faang_plus` tier naming and adds explicit UTF-8 encoding to all snapshot file reads.
Company aggregation and metrics tests `tests/test_save_market_history.py`	Updates tier defaults, company ranking, unique company count, and average jobs-per-company calculations to use `faang_plus` tier inputs and UTF-8 file encoding.
History retention and snapshot ordering tests `tests/test_save_market_history.py`	Updates 90-day retention, existing snapshot updates, and snapshot sorting tests to use `faang_plus` tier inputs and UTF-8 file encoding.
File handling robustness tests `tests/test_save_market_history.py`	Updates directory creation, JSON corruption recovery, empty jobs, missing fields, and performance tests to use `faang_plus` tier inputs and consistent UTF-8 file encoding.
Determinism and schema stability tests `tests/test_save_market_history.py`	Updates retention boundary and snapshot schema determinism tests to use `faang_plus` tier inputs and expected tier keys, ensuring snapshot keying and aggregation remain deterministic.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested labels

size/L

Poem

🐰 Category IDs now dance in dual forms,
Enriched or legacy, both transform,
UTF-8 files stay true and clean,
The finest schema that has been seen!
Tests march forth with faang_plus cheer—
Determinism reigns, no doubts here. 🎯

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'fix: restore market history category snapshots' clearly and directly summarizes the main change: restoring category snapshot functionality in market history by fixing the aggregation logic to use the enriched job category data instead of legacy lists.
Docstring Coverage	✅ Passed	Docstring coverage is 93.55% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/228-market-history-categories

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request introduces a new helper function iter_category_ids in scripts/update_jobs.py to extract and normalize category IDs from both enriched and legacy job category data, updating save_market_history to utilize it. It also cleans up file-reading operations in tests/test_save_market_history.py and adds comprehensive test coverage for the new category resolution logic. The review feedback suggests a valuable optimization and defensive guard to ensure job is a dictionary and to replace the helper get_nested_value with direct dictionary lookups for better performance.

gemini-code-assist · 2026-05-26T17:50:36Z

+    category_id = get_nested_value(job, 'category.id')
+    if isinstance(category_id, str):
+        normalized = category_id.strip()
+        if normalized:
+            yield normalized
+            return


Optimization & Defensive Guard

Defensive Programming: Added a guard check to ensure job is a dictionary before attempting any operations. This prevents potential AttributeError crashes if a malformed or None job payload is passed.

Performance Optimization: Replaced the generic get_nested_value(job, 'category.id') helper with a direct dictionary lookup. Since this function is executed in a loop over all jobs (potentially thousands), avoiding the overhead of string splitting and dynamic type checking inside get_nested_value significantly improves performance.

Suggested change

category_id = get_nested_value(job, 'category.id')

if isinstance(category_id, str):

normalized = category_id.strip()

if normalized:

yield normalized

return

if not isinstance(job, dict):

return

category = job.get('category')

category_id = category.get('id') if isinstance(category, dict) else None

if isinstance(category_id, str):

normalized = category_id.strip()

if normalized:

yield normalized

return

codecov · 2026-05-26T17:51:05Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Copilot

Pull request overview

Fixes broken category aggregation in the market history snapshots by counting from the enriched job["category"]["id"] shape (with a safe legacy fallback), restoring meaningful docs/market-history.json category breakdowns used for trend analytics.

Changes:

Add iter_category_ids(job) helper to normalize and safely extract category IDs from either job["category"]["id"] or legacy job["categories"].
Update save_market_history() to use iter_category_ids() for category counting.
Expand tests/test_save_market_history.py with targeted regressions for singular-category counting, preference rules, and fallback behavior.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File	Description
scripts/update_jobs.py	Adds a category-ID iterator helper and updates market-history aggregation to read enriched category data correctly.
tests/test_save_market_history.py	Adds regressions covering the fixed aggregation path and fallback/validation cases.

+    def test_counts_categories_correctly_from_singular_category_field(self):
+        """Categories are counted from the enriched singular category field."""
+        jobs = [
+            {'company': 'Google', 'category': {'id': 'software_engineering'}, 'company_tier': {'tier': 'faang-plus'}},
+            {'company': 'Meta', 'category': {'id': 'software_engineering'}, 'company_tier': {'tier': 'faang-plus'}},
+            {'company': 'Stripe', 'category': {'id': 'data_ml'}, 'company_tier': {'tier': 'unicorn'}},


Addresses two AI-reviewer findings on #299: - Gemini: add `isinstance(job, dict)` guard at the top of `iter_category_ids()` so a malformed non-dict job in the list cannot crash the helper's fallback path (`job.get('categories', [])`). The category extraction is already null-safe via `get_nested_value`, but the legacy fallback was not. - Copilot: standardize the tier IDs used throughout `tests/test_save_market_history.py` to `faang_plus` to match what `get_company_tier()` actually emits (verified against the live `docs/market-history.json`, which only ever contains `faang_plus`, `unicorn`, and `other`). The fictional `public-tech` tier was also replaced with `other`. Adds `test_iter_category_ids_ignores_non_dict_job` to lock in the new defensive guard. Verified locally: 723 tests pass; end-to-end smoke (enriched-shape jobs -> save_market_history) yields the expected populated categories breakdown and `faang_plus`/`unicorn` tier counts.

@15045

## Why The "RECENT COMMITS" panel in the right-side detail of `#contributors` was rendering deterministic mock SHAs and made-up messages like `refactor(ui): keyboard nav` for every selected dev. Anyone landing on jobs.riteshrana.engineer/#contributors and clicking a contributor saw fabricated history that had nothing to do with that person. ## What Replaces the local mock generator in `docs/terminal/contributors.jsx` with a `useRecentCommits()` hook that fetches `GET /repos/{repo}/commits?author={handle}` from the GitHub REST API on selection. A module-level cache keyed by handle dedupes within a session so re-clicking a contributor doesn't re-hit the API. Each row now shows: - real 7-char SHA - actual commit subject (first line, truncated to 80 chars) - human-readable "ago" derived from the commit author date - linked to the commit on GitHub States surfaced explicitly (no silent failures): - `loading from github…` while in-flight - `no commits authored by @{handle} in this repo` for co-author-only contributors (e.g. @15045 / 小吴 — co-author on #296 but never the author). - `could not load commits (...)` with the failure reason on rate-limit/network failure. Cache-bust `?v=7` → `?v=8` on all `docs/index.html` bundles so existing visitors pick up the new code. ## Verification (local — `python3 -m http.server 8765`, Playwright) | Contributor | Result | |---|---| | @ambicuity | Real commits — #299, #300, #301 (19m, 27m, 42m ago) | | @rvac-bucky | Real commits — #265, #247, #166, #193 (2h, 2mo …) | | @15045 | Empty-state message (co-author only) | | @mitre88 | Empty-state message (no direct authored commits) | Screenshot saved to `contributors-real-commits.png` (not committed).

Copilot AI review requested due to automatic review settings May 26, 2026 17:49

Copilot started reviewing on behalf of ambicuity May 26, 2026 17:49 View session

gemini-code-assist Bot reviewed May 26, 2026

View reviewed changes

Copilot AI reviewed May 26, 2026

View reviewed changes

ambicuity merged commit 01b5ff3 into main May 26, 2026
7 checks passed

ambicuity deleted the fix/228-market-history-categories branch May 26, 2026 18:43

ambicuity mentioned this pull request May 26, 2026

fix(contributors): show real GitHub commits per contributor #302

Merged

ambicuity mentioned this pull request May 26, 2026

[Bug]: /favicon.ico returns 404 (no favicon configured) #305

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: restore market history category snapshots#299

fix: restore market history category snapshots#299
ambicuity merged 2 commits into
mainfrom
fix/228-market-history-categories

ambicuity commented May 26, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 26, 2026 •

edited

Loading

Review failed

Walkthrough

Changes

Estimated code review effort

Suggested labels

Poem

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 26, 2026

Uh oh!

codecov Bot commented May 26, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ambicuity commented May 26, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why

What

Tests

Validation

Provenance

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Walkthrough

Changes

Estimated code review effort

Suggested labels

Poem

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 26, 2026

Choose a reason for hiding this comment

Optimization & Defensive Guard

Uh oh!

codecov Bot commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ambicuity commented May 26, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 26, 2026 •

edited

Loading

codecov Bot commented May 26, 2026 •

edited

Loading