Skip to content

fix: restore market history category snapshots#299

Merged
ambicuity merged 2 commits into
mainfrom
fix/228-market-history-categories
May 26, 2026
Merged

fix: restore market history category snapshots#299
ambicuity merged 2 commits into
mainfrom
fix/228-market-history-categories

Conversation

@ambicuity
Copy link
Copy Markdown
Owner

@ambicuity ambicuity commented May 26, 2026

Why

docs/market-history.json has been silently empty-on-categories for the entire 90-day retention window. The aggregation step in save_market_history() reads the wrong key.

scripts/update_jobs.py (pre-fix):

for job in jobs:
    for category in job.get('categories', []):   # plural list, never populated post-enrichment
        category_counts[category] += 1

But enrich_jobs() writes the singular form and never creates a categories list:

category = categorize_job(title, description)
job['category'] = category    # {'id': 'software_engineering', 'name': ..., 'emoji': ...}

Reproduced against the live artifact:

$ python3 -c "import json; d=json.load(open('docs/market-history.json')); \
  [print(s['date'], 'cats=', len(s['categories'])) for s in d['snapshots'][-5:]]"
2026-05-22 cats= 0
2026-05-23 cats= 0
2026-05-24 cats= 0
2026-05-25 cats= 0
2026-05-26 cats= 0

All 81 retained snapshots show categories: {}. The category breakdown that feeds any "growing/declining categories" analytic has been dead.

What

New module-level helper iter_category_ids(job) (scripts/update_jobs.py:2464):

  • Reads from category.id first (the enriched shape).
  • Falls back to the legacy categories list only when the singular path is absent or invalid — handles None, {}, {'id': None}, {'id': ''}, non-list categories values, and non-string elements.
  • Strips whitespace and skips empty values so malformed upstream data can't poison the Counter.

save_market_history() switches to the new helper.

Tests

tests/test_save_market_history.py adds four targeted regressions:

  • test_counts_categories_correctly_from_singular_category_field — the path that's broken on main today.
  • test_prefers_singular_category_field_over_legacy_categories_list — guards against double-counting if both shapes coexist.
  • test_falls_back_to_legacy_categories_list_when_category_missing — keeps old fixtures working.
  • test_falls_back_to_legacy_categories_when_category_payload_is_invalid — partial-scrape recovery against four invalid singular shapes.

Existing snapshot-schema / retention / determinism assertions stay green.

Validation

  • pytest tests/test_save_market_history.py: 25 passed.
  • pytest (full suite): 722 passed.
  • pre-commit run --all-files: clean.
  • End-to-end smoke (run real enrich_jobs()save_market_history() on synthetic raw jobs, inspect resulting docs/market-history.json):
    before fix: categories: {}
    after  fix: categories: {'software_engineering': 2, 'data_ml': 2}
    
    Counts equal the enriched job count, as expected.

Provenance

This is the same scoped fix from #274 (closed without merge despite green CI), rebased cleanly onto current main. Original work by @rvac-bucky — author attribution preserved in the commit. The docs/predictions.json conflict from #274 doesn't re-occur because #298 already landed the trailing-newline normalization.

Closes #228

Summary by CodeRabbit

  • Refactor

    • Improved job category data normalization to support both legacy and enriched data formats, enhancing system robustness and backward compatibility.
  • Tests

    • Updated test suite to validate consistent category data handling and standardized tier naming conventions.

Review Change Stack

`save_market_history()` was only reading the legacy `job['categories']` list,
but enriched jobs (post `enrich_jobs()`) only carry the singular
`job['category']` dict. The result: every snapshot in `docs/market-history.json`
has `categories: {}`. Verified against the published artifact — all 81
snapshots in the current 90-day retention window show empty category maps
despite ~1,300 enriched jobs per day.

Changes:
- New module-level `iter_category_ids(job)` helper that yields category IDs
  from the enriched `category.id` shape first, and falls back to the legacy
  `categories` list when the singular path is absent or invalid (handles
  `None`, `{}`, `{'id': None}`, `{'id': ''}`, non-list legacy values).
- `save_market_history()` switches to the new helper, so post-enrichment
  jobs are counted correctly and legacy fixtures keep working.

Tests (`tests/test_save_market_history.py`):
- `test_counts_categories_correctly_from_singular_category_field` — the path
  that's broken on main today.
- `test_prefers_singular_category_field_over_legacy_categories_list` — guards
  against double-counting if both shapes coexist on the same job.
- `test_falls_back_to_legacy_categories_list_when_category_missing` — keeps
  old fixtures working.
- `test_falls_back_to_legacy_categories_when_category_payload_is_invalid` —
  partial-scrape recovery (empty / null / missing-id category dicts).

This is the same fix from #274 (closed), rebased onto current `main` so the
`docs/predictions.json` conflict resolves cleanly. Original work by
@rvac-bucky.

Closes #228
Copilot AI review requested due to automatic review settings May 26, 2026 17:49
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 26, 2026

Caution

Review failed

Pull request was closed or merged during review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: c6c5a849-2b5e-42a3-8fd9-ff35c553da15

📥 Commits

Reviewing files that changed from the base of the PR and between c6b3699 and 470cf3e.

📒 Files selected for processing (2)
  • scripts/update_jobs.py
  • tests/test_save_market_history.py

📝 Walkthrough

Walkthrough

The PR adds a category ID normalization helper function (iter_category_ids()) that extracts category identifiers from either enriched job data (category.id) or legacy job data (categories list). The market history snapshot logic is updated to use this helper for category counting. The entire test suite is revised to align with new tier naming (faang_plus) and to explicitly test category schema precedence and fallback behavior, while standardizing file I/O with UTF-8 encoding throughout.

Changes

Category ID normalization and test schema alignment

Layer / File(s) Summary
Category ID iteration helper
scripts/update_jobs.py
Adds iter_category_ids(job) function to normalize category IDs from enriched (category.id) or legacy (categories list) job data shapes. Updates module imports to include Iterator.
Market history category counting integration
scripts/update_jobs.py
Updates save_market_history to compute category_counts by iterating iter_category_ids(job) instead of directly accessing legacy job.get('categories', []).
Category counting precedence and robustness tests
tests/test_save_market_history.py
Reworks TestCategoryAndTierCounting class to comprehensively test enriched category.id counting, preference of enriched over legacy categories, fallback to legacy when enriched is missing, and robust handling of invalid/malformed enriched payloads. Updates tier test data to faang_plus throughout.
Basic snapshot structure and metadata tests
tests/test_save_market_history.py
Updates snapshot structure, date range metadata, and date format tests to use faang_plus tier naming and adds explicit UTF-8 encoding to all snapshot file reads.
Company aggregation and metrics tests
tests/test_save_market_history.py
Updates tier defaults, company ranking, unique company count, and average jobs-per-company calculations to use faang_plus tier inputs and UTF-8 file encoding.
History retention and snapshot ordering tests
tests/test_save_market_history.py
Updates 90-day retention, existing snapshot updates, and snapshot sorting tests to use faang_plus tier inputs and UTF-8 file encoding.
File handling robustness tests
tests/test_save_market_history.py
Updates directory creation, JSON corruption recovery, empty jobs, missing fields, and performance tests to use faang_plus tier inputs and consistent UTF-8 file encoding.
Determinism and schema stability tests
tests/test_save_market_history.py
Updates retention boundary and snapshot schema determinism tests to use faang_plus tier inputs and expected tier keys, ensuring snapshot keying and aggregation remain deterministic.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested labels

size/L

Poem

🐰 Category IDs now dance in dual forms,
Enriched or legacy, both transform,
UTF-8 files stay true and clean,
The finest schema that has been seen!
Tests march forth with faang_plus cheer—
Determinism reigns, no doubts here. 🎯

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'fix: restore market history category snapshots' clearly and directly summarizes the main change: restoring category snapshot functionality in market history by fixing the aggregation logic to use the enriched job category data instead of legacy lists.
Docstring Coverage ✅ Passed Docstring coverage is 93.55% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/228-market-history-categories

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new helper function iter_category_ids in scripts/update_jobs.py to extract and normalize category IDs from both enriched and legacy job category data, updating save_market_history to utilize it. It also cleans up file-reading operations in tests/test_save_market_history.py and adds comprehensive test coverage for the new category resolution logic. The review feedback suggests a valuable optimization and defensive guard to ensure job is a dictionary and to replace the helper get_nested_value with direct dictionary lookups for better performance.

Comment thread scripts/update_jobs.py
Comment on lines +2466 to +2471
category_id = get_nested_value(job, 'category.id')
if isinstance(category_id, str):
normalized = category_id.strip()
if normalized:
yield normalized
return
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Optimization & Defensive Guard

  1. Defensive Programming: Added a guard check to ensure job is a dictionary before attempting any operations. This prevents potential AttributeError crashes if a malformed or None job payload is passed.
  2. Performance Optimization: Replaced the generic get_nested_value(job, 'category.id') helper with a direct dictionary lookup. Since this function is executed in a loop over all jobs (potentially thousands), avoiding the overhead of string splitting and dynamic type checking inside get_nested_value significantly improves performance.
Suggested change
category_id = get_nested_value(job, 'category.id')
if isinstance(category_id, str):
normalized = category_id.strip()
if normalized:
yield normalized
return
if not isinstance(job, dict):
return
category = job.get('category')
category_id = category.get('id') if isinstance(category, dict) else None
if isinstance(category_id, str):
normalized = category_id.strip()
if normalized:
yield normalized
return

@codecov
Copy link
Copy Markdown

codecov Bot commented May 26, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes broken category aggregation in the market history snapshots by counting from the enriched job["category"]["id"] shape (with a safe legacy fallback), restoring meaningful docs/market-history.json category breakdowns used for trend analytics.

Changes:

  • Add iter_category_ids(job) helper to normalize and safely extract category IDs from either job["category"]["id"] or legacy job["categories"].
  • Update save_market_history() to use iter_category_ids() for category counting.
  • Expand tests/test_save_market_history.py with targeted regressions for singular-category counting, preference rules, and fallback behavior.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
scripts/update_jobs.py Adds a category-ID iterator helper and updates market-history aggregation to read enriched category data correctly.
tests/test_save_market_history.py Adds regressions covering the fixed aggregation path and fallback/validation cases.

Comment on lines +128 to +133
def test_counts_categories_correctly_from_singular_category_field(self):
"""Categories are counted from the enriched singular category field."""
jobs = [
{'company': 'Google', 'category': {'id': 'software_engineering'}, 'company_tier': {'tier': 'faang-plus'}},
{'company': 'Meta', 'category': {'id': 'software_engineering'}, 'company_tier': {'tier': 'faang-plus'}},
{'company': 'Stripe', 'category': {'id': 'data_ml'}, 'company_tier': {'tier': 'unicorn'}},
Addresses two AI-reviewer findings on #299:

- Gemini: add `isinstance(job, dict)` guard at the top of `iter_category_ids()`
  so a malformed non-dict job in the list cannot crash the helper's fallback
  path (`job.get('categories', [])`). The category extraction is already
  null-safe via `get_nested_value`, but the legacy fallback was not.
- Copilot: standardize the tier IDs used throughout
  `tests/test_save_market_history.py` to `faang_plus` to match what
  `get_company_tier()` actually emits (verified against the live
  `docs/market-history.json`, which only ever contains `faang_plus`,
  `unicorn`, and `other`). The fictional `public-tech` tier was also
  replaced with `other`.

Adds `test_iter_category_ids_ignores_non_dict_job` to lock in the new
defensive guard.

Verified locally: 723 tests pass; end-to-end smoke
(enriched-shape jobs -> save_market_history) yields the expected
populated categories breakdown and `faang_plus`/`unicorn` tier counts.
@ambicuity ambicuity merged commit 01b5ff3 into main May 26, 2026
7 checks passed
@ambicuity ambicuity deleted the fix/228-market-history-categories branch May 26, 2026 18:43
ambicuity added a commit that referenced this pull request May 26, 2026
## Why

The "RECENT COMMITS" panel in the right-side detail of `#contributors`
was rendering deterministic mock SHAs and made-up messages like
`refactor(ui): keyboard nav` for every selected dev. Anyone landing on
jobs.riteshrana.engineer/#contributors and clicking a contributor saw
fabricated history that had nothing to do with that person.

## What

Replaces the local mock generator in `docs/terminal/contributors.jsx`
with a `useRecentCommits()` hook that fetches `GET
/repos/{repo}/commits?author={handle}` from the GitHub REST API on
selection. A module-level cache keyed by handle dedupes within a session
so re-clicking a contributor doesn't re-hit the API.

Each row now shows:
- real 7-char SHA
- actual commit subject (first line, truncated to 80 chars)
- human-readable "ago" derived from the commit author date
- linked to the commit on GitHub

States surfaced explicitly (no silent failures):
- `loading from github…` while in-flight
- `no commits authored by @{handle} in this repo` for co-author-only
contributors (e.g. @15045 / 小吴 — co-author on #296 but never the
author).
- `could not load commits (...)` with the failure reason on
rate-limit/network failure.

Cache-bust `?v=7` → `?v=8` on all `docs/index.html` bundles so existing
visitors pick up the new code.

## Verification (local — `python3 -m http.server 8765`, Playwright)

| Contributor | Result |
|---|---|
| @ambicuity | Real commits — #299, #300, #301 (19m, 27m, 42m ago) |
| @rvac-bucky | Real commits — #265, #247, #166, #193 (2h, 2mo …) |
| @15045 | Empty-state message (co-author only) |
| @mitre88 | Empty-state message (no direct authored commits) |

Screenshot saved to `contributors-real-commits.png` (not committed).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: market-history category breakdown is always empty despite non-empty jobs

3 participants