fix: replace deprecated datetime.utcnow() with datetime.now(timezone.utc) by ambicuity · Pull Request #292 · ambicuity/New-Grad-Jobs

ambicuity · 2026-05-24T02:41:40Z

Summary

Complete the timezone-aware migration by replacing ALL deprecated datetime.utcnow() calls with datetime.now(timezone.utc).

Follow-up to PR #290 and PR #291.

Changes

scripts/update_jobs.py: RSS feed timestamp (line 2950)
tests/conftest.py: 3 test fixture calls
tests/test_filter.py: 11 test fixture calls
tests/test_rss.py: 3 test fixture calls

Benefits

Eliminates all 104 deprecation warnings from test suite
Uses modern Python 3.12+ recommended approach
Consistent timezone-aware datetime handling throughout codebase

Testing

All 718 tests pass with zero deprecation warnings.

Replace the glassmorphism job board with the Bloomberg/REPL terminal aesthetic from the design package: dense tabular layout, amber accent, JetBrains Mono, hash-routed HIRING ↔ CONTRIBUTORS tabs. - New docs/terminal/ holds the React+Babel SPA (UMD CDN, no build step) - app.jsx, dashboard.jsx, contributors.jsx copied from the design - data.jsx adapts docs/jobs.json onto the dashboard's expected shape (de-duplicates ids, synthesizes 90-day deadline window, maps category → role type, tier → company size) - contributors-data.jsx adapts docs/contributors.json (all-contributors format) onto the contributors view shape - docs/index.html rewritten as the design's terminal shell, preserves the GoatCounter analytics snippet and SEO meta tags - docs/contributors.html replaced with a meta+JS redirect to index.html#contributors so external links continue to resolve - Delete docs/stats.html and docs/stats.js (out of scope for the new nav; market-history.json kept because scripts still consume it) - Surgical edit to dashboard.jsx APPLY button / Enter-key handler and contributors.jsx VIEW ON GITHUB button so they open real urls

Remove the .slice(0, 7) cap on the HIRING NOW left-rail widget so every company in the filtered set is listed. Surface the total in the header ("HIRING NOW 199") and wrap the rows in a 220px-tall scroller so the filter chips, deadline histogram, and company list all stay visible at the same time instead of the rail turning into one long list.

Replace the 0/0/0/0 repo card stats and the synthesized "commits=120 for everyone" leaderboard placeholders with real values from the public GitHub API: - GET /repos/:owner/:repo → stars / forks / issues / license - GET /repos/:owner/:repo/contributors → per-login commit counts - GET /search/issues?type:pr+state:open → open PR count Cached in localStorage for 1h to stay polite under the 60 req/hr/IP unauthenticated limit. Falls back to the existing placeholders if any endpoint fails or rate-limits. LoC totals are still derived (no /stats/contributors call — that endpoint is async/202 and slow); multipliers tuned to ~25 added / 8 deleted per commit so the NET LOC stat reads "+306k / -98k" instead of the previous inflated "+1.31M / -480k". Caveats left as-is (all-contributors data doesn't carry them, and the fixes would each cost an extra per-user API call): - region: '—' for everyone - since: '2025' for everyone (first-commit date) - last: '—' for non-owners (last-event timestamp) - Recent-commits panel and activity sparkline remain synthesized

The terminal frontend's COMP widget rendered '—' for every job because docs/jobs.json had no compensation field. Pay-transparency laws in CA, NY, CO, WA and others mean a meaningful fraction of US new-grad listings include explicit salary ranges in the description body — which the scraper already fetches and uses for categorization but then drops. New extract_compensation() helper regex-parses USD ranges out of the description before truncation, with sanity bounds (30k–600k, max span 250k) and a blocklist that strips funding/valuation/market-cap spans to suppress the most common false positives ("Series B raised $50M", etc.). Wired into Greenhouse, Lever, and JobSpy fetchers. JobSpy also prefers the structured min_amount/max_amount columns that the jobspy library exposes on Indeed/LinkedIn listings before falling back to regex. Workday continues to ship comp=None until we add a per-job detail fetch. The new 'comp' field is passed through generate_jobs_json() and the frontend adapter (docs/terminal/data.jsx) now reads it via compTuple() into the [low, high] thousands tuple the design's fmtComp() expects. 19 regex cases (CA-law / em-dash / k-suffix / USD-prefix / single-value / funding-then-real-salary / inverted-range / sanity-bound rejections / blocklist rejections) plus 2 generate_jobs_json passes — 44 tests pass.

Inspection of zero-hit boards (Chime, Anthropic, Databricks, Pinterest) showed two real gaps in the original extractor: 1. Chime and several others publish ranges as "$X and up to $Y" rather than "$X - $Y". 8/9 Chime new-grad listings now match (was 0/9). 2. Greenhouse and Lever return HTML in `content`; salary spans like `<strong>$120,000</strong> - <strong>$180,000</strong>` previously missed because the regex saw the surrounding markup. Now strip tags and unescape HTML entities ($, –,  ) before matching. Also accept .00 cents on dollar amounts ($140,000.00) and the "through" connector. Expanded test cases from 19 to 24. End-to-end pipeline against 20 Greenhouse boards: 31% new-grad coverage (67/210 jobs), up from 28%. Per-company hit rates: Brex 100%, Discord 100%, Chime 88%, LaunchDarkly 75%, Affirm 69%, Asana 60%, Twilio 46%. Greenhouse public API confirmed to have NO structured compensation field on either the list or per-job endpoint — description-body regex remains the only signal. Zero-hit companies (Anthropic, Databricks, Pinterest, Robinhood, Airbnb, Figma, Roblox, Samsara) don't disclose numeric ranges in their description body and would require per-job HTML scraping of a separate compensation tab. Honest `—` for those. 49/49 tests pass.

Comp extraction was silently zero across all production scrapes because config.yml lists Greenhouse boards as e.g. https://boards-api.greenhouse.io/v1/boards/affirm/jobs The Greenhouse Job Board API omits the description body unless the caller passes ?content=true. Without descriptions, extract_compensation always returned None — the symptoms looked like the regex was broken, but the regex never saw any text at all (verified by tracing one Affirm job through fetch → dedup → filter → enrich → generate_jobs_json). fetch_greenhouse_jobs now appends the flag if it's missing. End-to-end on a fresh local scrape: 33% of new-grad jobs (255/770) now ship with extracted comp, up from 0% with the same regex. Real values verified in the browser: Vercel SWE, Agent $232–348k PlanetScale SWE, InfoSec $140–320k Vercel SWE, Next.js $208–312k Chime Assoc Gen Counsel $217–300k Two follow-ups deferred: - Lever's salaryRange field is documented but rarely populated in practice; description-body regex would help but only 25 Lever jobs survive new-grad filter — small absolute gain. - Workday's list endpoint omits descriptions entirely; would need a per-job detail fetch (~3× request count). 0/90 today. Regression test added: TestGreenhouseFetcherContentFlag asserts the auto-append, preserves existing query params, doesn't double-append, and end-to-end verifies comp populates from returned content. 53 tests pass (28 in this file, 25 in test_generate_jobs_json).

… boards Comprehensive cleanup of `config.yml` based on head-probing all 139 Greenhouse boards, all 5 Lever boards, and the 3 stale Workday tenants called out as 404 in the last scrape log. Removed 39 Greenhouse boards (HTTP 404 on `/v1/boards/<slug>/jobs`): Alchemy, Chainalysis, Chronosphere, Circle, Cityblock Health, Cloudinary, Coinbase, Color Health, Column, Cruise, Deel, Devoted Health, Getaround, Goat, Grammarly, Hims & Hers, Jam City, Kraken, Lever, LightStep, Lime, Modern Treasury, Niantic, Paxos, Pipe, Pulumi, Railway, Retool, Ro, Runway, Shield AI, Skydio, Supercell, Tecton, Tempus, Veeva, Via, Zoox, Zynga. Removed 3 Lever boards (return 200 OK with empty job list): Netflix, Plaid, Atlassian — all migrated to custom careers sites. Removed 3 Workday tenants (stale Site_IDs, HTTP 404): Home Depot, Nike, Visa. Added 3 Greenhouse boards (verified live with non-trivial job counts): Anduril Industries (1944 jobs), Block (161), xAI (220). Verification scrape after this change: 957 new-grad jobs from 90 unique companies (was 770 from 87 companies) Greenhouse 842 (was 655) · Workday 90 · Lever 25 Zero HTTP 404 errors in scrape log (was 42) Net: +187 new-grad jobs, +24% coverage New companies in scrape output: Anduril Industries: 160 new-grad jobs xAI: 14 new-grad jobs Block: 13 new-grad jobs See `docs/removed-companies.md` for the full per-company audit log and the list of high-value employers that cannot be added because they use custom careers sites (Apple, Meta, Google, OpenAI, Anthropic, quant firms, etc.).

38 Workday tenants (Microsoft, JPMorgan, Goldman Sachs, Morgan Stanley, Oracle, Salesforce, ServiceNow, Cisco, AMD, Qualcomm, Capital One, Fidelity, Intuit, Lockheed Martin, etc.) consistently return HTTP 422 with an empty error message. The existing retry path re-acquires the CSRF token from `https://{host}/` and retries — fails identically. Probed Microsoft, Oracle, and Goldman Sachs across five candidate endpoints (bare root GET, site-path GET, client_check POST, jobs GET, jobs POST). None of them issue the `CALYPSO_CSRF_TOKEN` cookie. The only cookie consistently set is `PLAY_SESSION`, which alone is not sufficient to authenticate a subsequent `/jobs` POST. The picture is also tenant-cluster-dependent — Microsoft's wd10 cluster uses a different first-party cookie (`vps-cke`) than the wd1/wd5 clusters used by other tenants, suggesting Workday has rolled out infrastructure changes that the current HTTP-only scraper can't accommodate. A real fix probably requires headless-browser bootstrapping per tenant on first request (playwright is already a dependency for tests). That's multi-day work and an architectural change to fetch_workday_jobs. This commit ships only the investigation document so the next person who picks it up has the empirical findings to start from. Affected: 38 Workday tenants currently silently return 0 jobs after a single retry. The 2 HTTP 401 failures (TikTok, AbbVie) are separate credential issues, also out of scope.

Wider candidate probe across AI infra, devtools, fintech, healthtech, and dev-experience companies. 10 new live Greenhouse boards added: Glean (180 listings), BridgeBio (93), Together AI (56), Cribl (54), Tailscale (50), Sweetgreen (43), Recursion (34), Maven Clinic (28), Squarespace (27), Pulley (4). Verification scrape: 1007 new-grad jobs from 95 unique companies (was 957/90) Greenhouse 891 (was 842), Workday 91 (was 90), Lever 25 unchanged Net +50 new-grad jobs, +5 companies, +29 comp values Per-board new-grad yield: Glean 31, Tailscale 12, Together AI 4, Cribl and Squarespace 1 each. BridgeBio/Sweetgreen/Recursion/Maven/Pulley returned 0 new-grad listings this cycle but have non-empty boards that may surface candidates over time. Probed but already in config: Reddit, Gusto, Carta, PagerDuty, Webflow, Lattice, Materialize.

… Mistral, Perplexity Ashby is a modern ATS used by many AI labs and devtools companies that the existing Greenhouse/Lever/Workday set cannot reach. Adds a new fetch_ashby_jobs that hits the public Posting API: https://api.ashbyhq.com/posting-api/job-board/<slug>?includeCompensation=true Unlike Greenhouse/Lever, Ashby returns structured compensation directly in the response. The fetcher prefers Ashby's compensationTiers when present (currency=USD, interval=per-year, sanity-bounded), falling back to the description-body regex extractor for the rest. 25 verified-live boards configured: OpenAI, Crusoe, Mistral AI, Notion, Cohere, Sierra, LangChain, Cursor, Lovable, Perplexity, Baseten, Ashby, Supabase, Sentry, Modal, Attio, Campfire, Vapi, Linear, Qualified, Browserbase, Anyscale, Pinecone, Weaviate, Turbopuffer. Verification scrape: 1103 new-grad jobs from 107 unique companies (was 1007 / 95) Greenhouse 891, Workday 91, Ashby 96, Lever 25 311 with extracted comp (was 293) Top of the comp-sorted table is now AI-lab-dominant: Notion $220–350k Software Engineer, Mobile AI, iOS Vercel $232–348k Software Engineer, Agent xAI $200–340k Infrastructure Security Engineer OpenAI $230–325k Data Scientist, Safety OpenAI $293–325k Data Scientist, Core Experimentation Notion $272–320k Software Engineer, AI Capture ALSO fixes a latent HTTP_SESSION bug: Accept-Encoding advertised 'br' (Brotli) which Python's requests does not auto-decode. Ashby's Cloudflare layer preferred br when offered, yielding un-parseable bodies and 0/25 success on first run. Dropping br from Accept-Encoding (leaving gzip + deflate) makes responses decode-on-arrival like every other ATS already in the pipeline. 5 new unit tests for fetch_ashby_jobs covering: ?includeCompensation auto-append, structured-comp extraction, regex fallback, sanity-bound rejection of internship stipends, and non-USD currency skipping. 82 tests pass.

…er, ElevenLabs, ... Second-pass Ashby probe surfaced 18 more verified-live boards: Snowflake (423 raw), ElevenLabs (139), Decagon (108), Plaid (88), Commure (85), Suno (43), Docker (42), Astronomer (27), Poolside (16), Cradle Bio (11), Airbyte (9), Railway (9), Warp (8), Statsig (7), Reka (6), Stytch (5), Prefect (4), Runway (4). Two were previously in docs/removed-companies.md: - Plaid was removed from Lever (zero jobs); migrated to Ashby. - Runway was removed from Greenhouse as `runwayml` (404); on Ashby the slug is `runway`. Verification scrape: 1134 new-grad jobs from 115 unique companies (was 1103 / 107) Greenhouse 891, Ashby 127 (was 96), Workday 91, Lever 25 314 with extracted comp (was 311) Per-board new-grad yield from this pass: Snowflake 12, Plaid 7, Suno 3, Commure 3, Astronomer 2, Docker 2, Airbyte 1, Decagon 1. The other 10 boards had non-empty totals but zero passed the new-grad filter this cycle. removed-companies.md updated with both Ashby passes and the Plaid / Runway re-discovery note.

The left-rail "REMOTE" filter (remote / hybrid / onsite chips) was visually present but functionally dead: docs/terminal/data.jsx hardcoded `rmt: 'onsite'` for every job, so clicking "remote" or "hybrid" always filtered to zero results. Add a deriveRmt(j) helper that matches the location and title text: - 'hybrid' wins when "\bhybrid\b" appears (covers "Hybrid - SF", "San Francisco (Hybrid)", "Hybrid remote", etc.) - 'remote' wins next when "\bremote\b" appears (without hybrid) - otherwise 'onsite' After this, classification across the current 1134-job scrape: 928 onsite · 179 remote · 27 hybrid Spot-checks: Twilio "Remote - US" → remote ✓ Vercel "Hybrid - San Francisco" → hybrid ✓ Tanium "Addison TX (Hybrid)" → hybrid ✓ Waymo "Mountain View, California" → onsite ✓ Anduril "Costa Mesa, California" → onsite ✓ Clicking the remote chip in the browser now narrows the table to "179 / 1134 results" — wired end-to-end. The eventual right move is to add a structured `is_remote` / `workplace_type` field to jobs.json at scrape time (Ashby exposes both as first-class fields; Greenhouse/Lever need the same text-detection). That's a follow-up; this commit unblocks the UX today.

The 8 role chips (swe / ml / data / frontend / backend / infra / security / mobile) were rendered but four of them — frontend, backend, security, mobile — never matched any job because the scraper only classifies into 6 broad categories and the adapter mapped them as: software_engineering → SWE (where backend/frontend/mobile hide) data_ml → ML data_engineering → DATA infrastructure_sre → INFRA (where security hides) hardware → INFRA (no HW chip existed at all) other → SWE Two changes: 1. New deriveType(j) in data.jsx pattern-matches the job title before falling back to the category id. Order matters (HW > SEC > ML > DATA > MOBILE > FE > BE > INFRA) so "embedded systems engineer" wins HW, "cloud security engineer" wins SEC, "ML platform engineer" wins ML. 2. dashboard.jsx grows a 9th chip — `hardware` — and the TYPE_LABEL gains HW='hardware'. The hardware category in jobs.json (Anduril, SpaceX, Northrop, etc.) is no longer silently merged into infra. Distribution across the live 1134-job scrape: SWE 632 · INFRA 205 · DATA 77 · SEC 77 · ML 54 · BE 43 · HW 27 · MOBILE 11 · FE 8 Browser-verified click behavior: each chip narrows the table to the correct count (e.g. hardware → 27 / 1134 results; frontend → 8 / 1134).

…UT THE ROLE The right-panel "ABOUT THE ROLE" widget was rendering a templated fallback ("Anduril Industries is hiring for Software Engineer, Lattice C2 UI in … Posted via Greenhouse.") for every job because the scraper captured each posting's description body but dropped it before serializing docs/jobs.json. The frontend adapter had no real text to read so it built the placeholder. Plumb the real description through: 1. New clean_description(text, max_chars=1200) helper near the compensation extractor. Strips HTML and decodes entities, collapses whitespace, clips to 1200 chars on a word boundary with a trailing ellipsis. Keeps jobs.json under ~3 MB for a 1000-job scrape. 2. Replace all 6 `description[:500] if description else ''` truncation sites (Greenhouse, Lever, JobSpy, Ashby, two GraphQL paths) with clean_description(description). Workday remains description='' — its list endpoint doesn't expose the field. 3. generate_jobs_json() emits 'description': job.get('description', ''). 4. docs/terminal/data.jsx mapJob prefers j.description (when ≥ 60 chars) over the synthetic "hiring for X in Y" fallback. ALSO fixes a latent _strip_html ordering bug. Greenhouse serves its content field as DOUBLE-encoded HTML (<div> rather than <div>). The previous implementation stripped tags first (no-op because no literal '<' present) then unescaped — leaving literal HTML tags in the final output. Now it unescapes first then strips tags. Verification scrape: 1129 new-grad jobs, 1041 with real description (92% coverage) Greenhouse 892/892, Ashby 124/124, Lever 25/25, Workday 0/88 jobs.json size: 2.19 MB (up from 0.98 MB) avg description length: 1192 chars, zero HTML residue Browser-verified: clicking Anduril Lattice C2 UI in the table now shows the actual posting copy in ABOUT THE ROLE. Also bumped ?v=4 cache-buster on the .jsx <script> tags in index.html so the description-shape change forces browsers to re-fetch the bundle. Bump v on future data-shape changes. 6 new TestCleanDescription cases. 113 total tests pass.

…on't blow out the panel Long postings (e.g. the Riot Games "Technical Producer II – Machine Learning" job at ~1200 chars) were stretching the detail panel by 400+ px, pushing STACK / REQUIREMENTS / SIMILAR ROLES / APPLY out of the fold. Constrain the description block: maxHeight: 160px (≈ 5–6 lines at the new font size) overflowY: auto (scoped scrollbar, not the whole panel) fontSize: 11.5 (tighter density to fit more text) paddingRight: 8 (room for the scrollbar gutter) Browser-verified on the Riot Technical Producer II posting: rendered height 160px · content scrollHeight 392px → internal scroll panel sections below remain visible at the page fold. Bumped ?v=5 cache-buster on the .jsx script tags so the layout change takes effect on next page load.

… terminal" User-facing changes: - Top-bar badge changes from NG to NGJ (the only brand mark in the chrome). - Drop the "newgrad.sh" wordmark and "// new-grad job terminal" subtitle next to the badge. - Page <title> + og:title now read "NGJ · New Grad Jobs". - Boot/loading splash drops the newgrad.sh prefix. Internal cleanup: - Comment in docs/terminal/data.jsx no longer references newgrad.sh. Verified in browser: no "newgrad.sh", no "new-grad job terminal", no solo "NG" badge remains anywhere on the page. ?v=6 cache-bust ensures the bundle re-fetches.

README: - Rewrite "Data Sources" — Greenhouse 113, Ashby 43 (new), Workday 57, Lever 2 (down from 5), JobSpy. Google Careers removed (deprecated). - Rewrite "Key Features" to describe the NGJ terminal frontend (JetBrains Mono, dense tabular, 9-category ROLE filter, REMOTE filter, real compensation + about-the-role widgets). - Add a "docs/jobs.json schema" table — first time the public output shape is documented. Covers the new `comp` and `description` fields. - Refresh "Companies Monitored" — purge the 39 dead Greenhouse names pruned in this branch, add the 13 new Greenhouse names, enumerate all 43 Ashby boards, list highlight Workday tenants with a link to config.yml for the full set. JOB_SCRAPING_APIS: - Promote the "Existing APIs (Maintained)" section into structured entries with endpoint + status + known issues (Greenhouse content=true, Workday 422 link to investigation doc). - New Ashby section (43 boards, structured compensation, Brotli Accept-Encoding gotcha). No script regenerates these prose sections (verified by grep across scripts/ and .github/workflows/), so the edits are stable across scheduled scraper runs.

…p-aware graphql test Four pre-merge CI failures, all addressed: 1. test_config.py: hardcoded threshold of 200 configured companies. The counter only summed greenhouse + lever + workday; the new 43-board Ashby section wasn't counted. After adding Ashby: 113 + 2 + 57 + 43 = 215 companies, above the 200 floor. 2. tests/test_display_count_copy.py + tests/test_stats_predictions_contract.py: referenced docs/stats.html and docs/stats.js, both deleted in this PR (the new NGJ two-tab nav doesn't include a stats view). Removed the whole test_stats_predictions_contract.py file (every test in it exercises the now-deleted stats page) and made test_display_count_copy skip non-existent surfaces gracefully. 3. tests/test_graphql_fetch.py asserted len(description) == 500. The clip ceiling moved from 500 to 1200 in 6c8b585 (description wiring). Updated the test to use a 2000-char input and assert <= 1201 (clip) plus the "…" suffix. 4. Pre-commit auto-fixes: isort reorders `import json` after `import html` alphabetically; end-of-file-fixer adds a trailing newline to docs/predictions.json. Both applied. Local: 692 tests pass.

Pre-commit's end-of-file-fixer hook keeps flagging this file because the scraper's auto-commit on main writes it without a final newline. One-character fix to make the hook green.

The top-bar "● LIVE 16 MAY 2026 · 14:32 PT" was a static string that never moved (docs/terminal/app.jsx). Replace it with <LiveStamp />: * Reads window.NGJOBS_META.generated_at (set by data.jsx from docs/jobs.json's meta block, refreshed by every scrape). * Renders in the viewer's local timezone via Intl.DateTimeFormat with timeZoneName: 'short' — the static "PT" suffix becomes whatever abbreviation the visitor's environment reports (PDT / IST / GMT / …). * Polls every 30s plus once on NGJOBS_READY so the chip transitions live → stale (amber dot, STALE label) when generated_at is more than 15 minutes old — three missed 5-min scrape cycles is a real failure signal. Renders "● OFFLINE —" when generated_at is missing entirely rather than fabricating a date. The pure formatter lives in docs/terminal/live-stamp-format.js as a UMD-style IIFE so it works as a <script> tag in the browser and can be loaded into a Node VM for unit testing. tests/terminal/test_live_stamp.cjs covers fresh / stale / boundary / offline / Intl-throws cases. index.html bumps the bundle cache-bust to v=7 and loads live-stamp-format.js before the Babel scripts so window.formatLiveStamp is defined when <LiveStamp /> renders. Browser smoke-test (Playwright): chip rendered "● STALE May 17, 2026, 22:47 CDT" against the current (19h-old) docs/jobs.json, and "● LIVE May 18, 2026, 12:17 CDT" when fed a fresh generated_at. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

## Summary - When Greenhouse list endpoint returns empty `content` for a job, fetch individual job details from `/{job_id}` endpoint - Only enriches jobs that need it (performance-conscious) - Increases `clean_description` max_chars from 1200 to 3000 - Adds `description_html` (raw HTML) and `full_description` (complete cleaned text) to jobs.json - Gracefully handles individual fetch failures (job returned with empty description) ## Changes - `scripts/update_jobs.py`: Added `fetch_greenhouse_job_detail()`, modified `fetch_greenhouse_jobs()` to enrich empty-content jobs, updated `generate_jobs_json()` with new fields - `tests/test_enrichment.py`: Added `TestGreenhouseDescriptionEnrichment` with 5 tests - `tests/test_graphql_fetch.py`: Updated for new 3000-char limit ## Testing ```bash pytest tests/test_enrichment.py::TestGreenhouseDescriptionEnrichment -v # 5 passed pytest tests/ -v # 697 passed flake8 scripts/ --select=E9,F63,F7,F82 # clean ``` Fixes #275

…hment changes

## Summary - When Greenhouse list endpoint returns empty `content` for a job, fetch individual job details from `/{job_id}` endpoint - Only enriches jobs that need it (performance-conscious) - Increases `clean_description` max_chars from 1200 to 3000 - Adds `description_html` (raw HTML) and `full_description` (complete cleaned text) to jobs.json - Gracefully handles individual fetch failures (job returned with empty description) ## Changes - `scripts/update_jobs.py`: Added `fetch_greenhouse_job_detail()`, modified `fetch_greenhouse_jobs()` to enrich empty-content jobs, updated `generate_jobs_json()` with new fields - `tests/test_enrichment.py`: Added `TestGreenhouseDescriptionEnrichment` with 5 tests - `tests/test_graphql_fetch.py`: Updated for new 3000-char limit ## Testing ```bash pytest tests/test_enrichment.py::TestGreenhouseDescriptionEnrichment -v # 5 passed pytest tests/ -v # 697 passed flake8 scripts/ --select=E9,F63,F7,F82 # clean ``` Fixes #275

The enrichment code adds full_description (complete text up to 50k chars) to jobs.json. This commit updates the frontend to prefer full_description over the truncated description field, so users see complete job details including full requirements and qualifications.

Previously only Greenhouse jobs had description_html populated. This commit adds description_html to all ATS fetchers so the frontend can show full descriptions for all job sources. - Ashby: description_html from descriptionHtml field - Lever: description_html from description field - JobSpy: description_html from description field This ensures the frontend's full_description preference works for all job sources, not just Greenhouse.

…eholder The REQUIREMENTS section was showing hardcoded placeholder text ('BS / MS in CS or equivalent', 'Proficiency in —') instead of the actual requirements extracted from job descriptions. Now uses extractRequirements() to parse requirements from the full_description field and displays them in the REQUIREMENTS section. Falls back to a message when requirements cannot be extracted. Fixes the 'Proficiency in —' placeholder issue.

# Conflicts: # docs/predictions.json

The previous extractRequirements function only worked for ~17% of jobs because it stripped HTML and lost list structure. Now: 1. Parses HTML directly to find <li> elements 2. Uses broader header patterns (WHAT WE LOOKING FOR, ABOUT YOU, etc.) 3. Falls back to <li> items when no requirements section found 4. Better filtering of garbage items

# Conflicts: # docs/terminal/dashboard.jsx

The previous extractRequirements function only worked for ~17% of jobs because it stripped HTML and lost list structure. Now: 1. Parses HTML directly to find <li> elements 2. Uses broader header patterns (WHAT WE LOOKING FOR, ABOUT YOU, etc.) 3. Falls back to <li> items when no requirements section found 4. Better filtering of garbage items

Fixes #289 Changed datetime.now().isoformat() to datetime.now(timezone.utc).isoformat() for all timestamp generation in update_jobs.py. This ensures timestamps include the UTC timezone marker (+00:00), allowing the frontend to correctly convert and display times in the viewer's local timezone. Previously, naive datetime strings were parsed by JavaScript's Date.parse() as local time, causing the LIVE indicator to show incorrect times. Files changed: - scripts/update_jobs.py: 5 timestamp calls updated - tests/: 3 test files updated to expect timezone-aware format

Complete the timezone fix by updating remaining naive datetime.now() calls: - scripts/update_jobs.py: 3 date-only comparison calls (lines 2500, 2566, 2698) - tests/test_enrichment.py: 6 test fixture calls - tests/test_save_market_history.py: 12 test fixture calls This ensures consistency across the entire codebase and prevents potential day-boundary edge cases when running on non-UTC machines.

…utc) Eliminates all deprecated datetime.utcnow() calls: - scripts/update_jobs.py: RSS feed timestamp (line 2950) - tests/conftest.py: 3 test fixture calls - tests/test_filter.py: 11 test fixture calls - tests/test_rss.py: 3 test fixture calls This completes the timezone-aware migration and eliminates all 104 deprecation warnings from the test suite.

coderabbitai · 2026-05-24T02:41:46Z

Warning

Review limit reached

@ambicuity, we couldn't start this review because you've used your available PR reviews for now.

Your plan currently allows 1 review/hour. Refill in 35 minutes and 13 seconds.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more review capacity refills, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than trial, open-source, and free plans. In all cases, review capacity refills continuously over time.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 7326d563-6a6a-4e5c-8001-019a15f993bd

📥 Commits

Reviewing files that changed from the base of the PR and between 30c4359 and 3786ea9.

📒 Files selected for processing (9)

docs/terminal/dashboard.jsx
scripts/update_jobs.py
tests/conftest.py
tests/test_enrichment.py
tests/test_filter.py
tests/test_generate_jobs_json.py
tests/test_predict_hiring_trends.py
tests/test_rss.py
tests/test_save_market_history.py

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/timezone-aware-timestamps

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

codecov · 2026-05-24T02:42:52Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

gemini-code-assist

Code Review

This pull request refines the requirement extraction logic in dashboard.jsx by implementing header-based and HTML list-item parsing. It also standardizes the Python codebase and test suite to use timezone-aware datetimes (timezone.utc) instead of naive now() or utcnow() calls. A review comment identifies a redundant stopword check in the requirement filtering logic, as the existing length constraint already excludes the specified short words.

gemini-code-assist · 2026-05-24T02:48:31Z

+    const endMatch = afterHeader.match(sectionEnd);
+    const section = endMatch ? afterHeader.substring(0, endMatch.index) : afterHeader.substring(0, 1000);
+    const items = section.split(/(?:^|\s)[•\-\*]\s*|(?:^|\s)\d+\.\s*|(?:^|\s)[a-z]\)\s*/);
+    const validItems = items.map(s => s.trim()).filter(s => s.length > 10 && s.length < 300 && !s.match(/^(?:and|or|the|a|an|is|are|was|were|be|been|being|have|has|had|do|does|did|will|would|could|should|may|might|can|shall)$/i));


The stopword regex check is redundant here because the filter already requires s.length > 10. None of the words in the stopword list (e.g., 'and', 'should', 'might') exceed 6 characters in length, so they are already excluded by the length constraint.

Suggested change

const validItems = items.map(s => s.trim()).filter(s => s.length > 10 && s.length < 300 && !s.match(/^(?:and|or|the|a|an|is|are|was|were|be|been|being|have|has|had|do|does|did|will|would|could|should|may|might|can|shall)$/i));

const validItems = items.map(s => s.trim()).filter(s => s.length > 10 && s.length < 300);

ambicuity and others added 30 commits May 16, 2026 13:33

fix(ci): add trailing newline to docs/predictions-status.json

5bd147e

Pre-commit's end-of-file-fixer hook keeps flagging this file because the scraper's auto-commit on main writes it without a final newline. One-character fix to make the hook green.

Merge main into feat/terminal-redesign: resolve conflicts, keep enric…

70d9831

…hment changes

Merge remote-tracking branch 'origin/main' into feat/terminal-redesign

2b34f93

fix: ensure predictions.json ends with newline

df09efc

Merge remote-tracking branch 'origin/main' into feat/terminal-redesign

66226ba

Merge remote-tracking branch 'origin/main' into feat/terminal-redesign

40b62e4

# Conflicts: # docs/predictions.json

ambicuity added 7 commits May 23, 2026 20:25

fix: ensure predictions.json ends with newline

85a040e

Merge remote-tracking branch 'origin/main' into feat/terminal-redesign

c318246

# Conflicts: # docs/terminal/dashboard.jsx

Copilot AI review requested due to automatic review settings May 24, 2026 02:41

Copilot started reviewing on behalf of ambicuity May 24, 2026 02:41 View session

ambicuity merged commit 70732c2 into main May 24, 2026
7 of 8 checks passed

ambicuity deleted the fix/timezone-aware-timestamps branch May 24, 2026 02:43

Copilot AI reviewed May 24, 2026

View reviewed changes

gemini-code-assist Bot reviewed May 24, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: replace deprecated datetime.utcnow() with datetime.now(timezone.utc)#292

fix: replace deprecated datetime.utcnow() with datetime.now(timezone.utc)#292
ambicuity merged 37 commits into
mainfrom
fix/timezone-aware-timestamps

ambicuity commented May 24, 2026

Uh oh!

coderabbitai Bot commented May 24, 2026

Review limit reached

Uh oh!

codecov Bot commented May 24, 2026

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	const validItems = items.map(s => s.trim()).filter(s => s.length > 10 && s.length < 300 && !s.match(/^(?:and\|or\|the\|a\|an\|is\|are\|was\|were\|be\|been\|being\|have\|has\|had\|do\|does\|did\|will\|would\|could\|should\|may\|might\|can\|shall)$/i));
	const validItems = items.map(s => s.trim()).filter(s => s.length > 10 && s.length < 300);

Conversation

ambicuity commented May 24, 2026

Summary

Changes

Benefits

Testing

Uh oh!

coderabbitai Bot commented May 24, 2026

Review limit reached

Uh oh!

codecov Bot commented May 24, 2026

Codecov Report

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 24, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants