Skip to content

fix: replace deprecated datetime.utcnow() with datetime.now(timezone.utc)#292

Merged
ambicuity merged 37 commits into
mainfrom
fix/timezone-aware-timestamps
May 24, 2026
Merged

fix: replace deprecated datetime.utcnow() with datetime.now(timezone.utc)#292
ambicuity merged 37 commits into
mainfrom
fix/timezone-aware-timestamps

Conversation

@ambicuity
Copy link
Copy Markdown
Owner

Summary

Complete the timezone-aware migration by replacing ALL deprecated datetime.utcnow() calls with datetime.now(timezone.utc).

Follow-up to PR #290 and PR #291.

Changes

  • scripts/update_jobs.py: RSS feed timestamp (line 2950)
  • tests/conftest.py: 3 test fixture calls
  • tests/test_filter.py: 11 test fixture calls
  • tests/test_rss.py: 3 test fixture calls

Benefits

  • Eliminates all 104 deprecation warnings from test suite
  • Uses modern Python 3.12+ recommended approach
  • Consistent timezone-aware datetime handling throughout codebase

Testing

All 718 tests pass with zero deprecation warnings.

ambicuity and others added 30 commits May 16, 2026 13:33
Replace the glassmorphism job board with the Bloomberg/REPL terminal
aesthetic from the design package: dense tabular layout, amber accent,
JetBrains Mono, hash-routed HIRING ↔ CONTRIBUTORS tabs.

- New docs/terminal/ holds the React+Babel SPA (UMD CDN, no build step)
  - app.jsx, dashboard.jsx, contributors.jsx copied from the design
  - data.jsx adapts docs/jobs.json onto the dashboard's expected shape
    (de-duplicates ids, synthesizes 90-day deadline window, maps category
    → role type, tier → company size)
  - contributors-data.jsx adapts docs/contributors.json (all-contributors
    format) onto the contributors view shape
- docs/index.html rewritten as the design's terminal shell, preserves the
  GoatCounter analytics snippet and SEO meta tags
- docs/contributors.html replaced with a meta+JS redirect to
  index.html#contributors so external links continue to resolve
- Delete docs/stats.html and docs/stats.js (out of scope for the new nav;
  market-history.json kept because scripts still consume it)
- Surgical edit to dashboard.jsx APPLY button / Enter-key handler and
  contributors.jsx VIEW ON GITHUB button so they open real urls
Remove the .slice(0, 7) cap on the HIRING NOW left-rail widget so every
company in the filtered set is listed. Surface the total in the header
("HIRING NOW 199") and wrap the rows in a 220px-tall scroller so the
filter chips, deadline histogram, and company list all stay visible at
the same time instead of the rail turning into one long list.
Replace the 0/0/0/0 repo card stats and the synthesized "commits=120 for
everyone" leaderboard placeholders with real values from the public
GitHub API:

- GET /repos/:owner/:repo            → stars / forks / issues / license
- GET /repos/:owner/:repo/contributors → per-login commit counts
- GET /search/issues?type:pr+state:open → open PR count

Cached in localStorage for 1h to stay polite under the 60 req/hr/IP
unauthenticated limit. Falls back to the existing placeholders if any
endpoint fails or rate-limits.

LoC totals are still derived (no /stats/contributors call — that endpoint
is async/202 and slow); multipliers tuned to ~25 added / 8 deleted per
commit so the NET LOC stat reads "+306k / -98k" instead of the previous
inflated "+1.31M / -480k".

Caveats left as-is (all-contributors data doesn't carry them, and the
fixes would each cost an extra per-user API call):
- region: '—' for everyone
- since: '2025' for everyone (first-commit date)
- last:  '—' for non-owners (last-event timestamp)
- Recent-commits panel and activity sparkline remain synthesized
The terminal frontend's COMP widget rendered '—' for every job because
docs/jobs.json had no compensation field. Pay-transparency laws in CA,
NY, CO, WA and others mean a meaningful fraction of US new-grad listings
include explicit salary ranges in the description body — which the
scraper already fetches and uses for categorization but then drops.

New extract_compensation() helper regex-parses USD ranges out of the
description before truncation, with sanity bounds (30k–600k, max span
250k) and a blocklist that strips funding/valuation/market-cap spans to
suppress the most common false positives ("Series B raised $50M", etc.).

Wired into Greenhouse, Lever, and JobSpy fetchers. JobSpy also prefers
the structured min_amount/max_amount columns that the jobspy library
exposes on Indeed/LinkedIn listings before falling back to regex.
Workday continues to ship comp=None until we add a per-job detail fetch.

The new 'comp' field is passed through generate_jobs_json() and the
frontend adapter (docs/terminal/data.jsx) now reads it via compTuple()
into the [low, high] thousands tuple the design's fmtComp() expects.

19 regex cases (CA-law / em-dash / k-suffix / USD-prefix / single-value /
funding-then-real-salary / inverted-range / sanity-bound rejections /
blocklist rejections) plus 2 generate_jobs_json passes — 44 tests pass.
Inspection of zero-hit boards (Chime, Anthropic, Databricks, Pinterest)
showed two real gaps in the original extractor:

1. Chime and several others publish ranges as "$X and up to $Y" rather
   than "$X - $Y". 8/9 Chime new-grad listings now match (was 0/9).
2. Greenhouse and Lever return HTML in `content`; salary spans like
   `<strong>$120,000</strong> - <strong>$180,000</strong>` previously
   missed because the regex saw the surrounding markup. Now strip tags
   and unescape HTML entities (&#36;, &ndash;, &nbsp;) before matching.

Also accept .00 cents on dollar amounts ($140,000.00) and the "through"
connector. Expanded test cases from 19 to 24.

End-to-end pipeline against 20 Greenhouse boards: 31% new-grad coverage
(67/210 jobs), up from 28%. Per-company hit rates: Brex 100%, Discord
100%, Chime 88%, LaunchDarkly 75%, Affirm 69%, Asana 60%, Twilio 46%.

Greenhouse public API confirmed to have NO structured compensation
field on either the list or per-job endpoint — description-body regex
remains the only signal. Zero-hit companies (Anthropic, Databricks,
Pinterest, Robinhood, Airbnb, Figma, Roblox, Samsara) don't disclose
numeric ranges in their description body and would require per-job
HTML scraping of a separate compensation tab. Honest `—` for those.

49/49 tests pass.
Comp extraction was silently zero across all production scrapes because
config.yml lists Greenhouse boards as e.g.

  https://boards-api.greenhouse.io/v1/boards/affirm/jobs

The Greenhouse Job Board API omits the description body unless the
caller passes ?content=true. Without descriptions, extract_compensation
always returned None — the symptoms looked like the regex was broken,
but the regex never saw any text at all (verified by tracing one Affirm
job through fetch → dedup → filter → enrich → generate_jobs_json).

fetch_greenhouse_jobs now appends the flag if it's missing. End-to-end
on a fresh local scrape: 33% of new-grad jobs (255/770) now ship with
extracted comp, up from 0% with the same regex. Real values verified
in the browser:

  Vercel        SWE, Agent              $232–348k
  PlanetScale   SWE, InfoSec            $140–320k
  Vercel        SWE, Next.js            $208–312k
  Chime         Assoc Gen Counsel       $217–300k

Two follow-ups deferred:
  - Lever's salaryRange field is documented but rarely populated in
    practice; description-body regex would help but only 25 Lever jobs
    survive new-grad filter — small absolute gain.
  - Workday's list endpoint omits descriptions entirely; would need a
    per-job detail fetch (~3× request count). 0/90 today.

Regression test added: TestGreenhouseFetcherContentFlag asserts the
auto-append, preserves existing query params, doesn't double-append,
and end-to-end verifies comp populates from returned content.

53 tests pass (28 in this file, 25 in test_generate_jobs_json).
… boards

Comprehensive cleanup of `config.yml` based on head-probing all 139
Greenhouse boards, all 5 Lever boards, and the 3 stale Workday tenants
called out as 404 in the last scrape log.

Removed 39 Greenhouse boards (HTTP 404 on `/v1/boards/<slug>/jobs`):
  Alchemy, Chainalysis, Chronosphere, Circle, Cityblock Health,
  Cloudinary, Coinbase, Color Health, Column, Cruise, Deel, Devoted
  Health, Getaround, Goat, Grammarly, Hims & Hers, Jam City, Kraken,
  Lever, LightStep, Lime, Modern Treasury, Niantic, Paxos, Pipe,
  Pulumi, Railway, Retool, Ro, Runway, Shield AI, Skydio, Supercell,
  Tecton, Tempus, Veeva, Via, Zoox, Zynga.

Removed 3 Lever boards (return 200 OK with empty job list):
  Netflix, Plaid, Atlassian — all migrated to custom careers sites.

Removed 3 Workday tenants (stale Site_IDs, HTTP 404):
  Home Depot, Nike, Visa.

Added 3 Greenhouse boards (verified live with non-trivial job counts):
  Anduril Industries (1944 jobs), Block (161), xAI (220).

Verification scrape after this change:
  957 new-grad jobs from 90 unique companies (was 770 from 87 companies)
  Greenhouse 842 (was 655) · Workday 90 · Lever 25
  Zero HTTP 404 errors in scrape log (was 42)
  Net: +187 new-grad jobs, +24% coverage

New companies in scrape output:
  Anduril Industries: 160 new-grad jobs
  xAI: 14 new-grad jobs
  Block: 13 new-grad jobs

See `docs/removed-companies.md` for the full per-company audit log and
the list of high-value employers that cannot be added because they use
custom careers sites (Apple, Meta, Google, OpenAI, Anthropic, quant
firms, etc.).
38 Workday tenants (Microsoft, JPMorgan, Goldman Sachs, Morgan Stanley,
Oracle, Salesforce, ServiceNow, Cisco, AMD, Qualcomm, Capital One,
Fidelity, Intuit, Lockheed Martin, etc.) consistently return HTTP 422
with an empty error message. The existing retry path re-acquires the
CSRF token from `https://{host}/` and retries — fails identically.

Probed Microsoft, Oracle, and Goldman Sachs across five candidate
endpoints (bare root GET, site-path GET, client_check POST, jobs GET,
jobs POST). None of them issue the `CALYPSO_CSRF_TOKEN` cookie. The
only cookie consistently set is `PLAY_SESSION`, which alone is not
sufficient to authenticate a subsequent `/jobs` POST.

The picture is also tenant-cluster-dependent — Microsoft's wd10 cluster
uses a different first-party cookie (`vps-cke`) than the wd1/wd5 clusters
used by other tenants, suggesting Workday has rolled out infrastructure
changes that the current HTTP-only scraper can't accommodate.

A real fix probably requires headless-browser bootstrapping per tenant
on first request (playwright is already a dependency for tests). That's
multi-day work and an architectural change to fetch_workday_jobs.

This commit ships only the investigation document so the next person
who picks it up has the empirical findings to start from.

Affected: 38 Workday tenants currently silently return 0 jobs after a
single retry. The 2 HTTP 401 failures (TikTok, AbbVie) are separate
credential issues, also out of scope.
Wider candidate probe across AI infra, devtools, fintech, healthtech,
and dev-experience companies. 10 new live Greenhouse boards added:

  Glean (180 listings), BridgeBio (93), Together AI (56), Cribl (54),
  Tailscale (50), Sweetgreen (43), Recursion (34), Maven Clinic (28),
  Squarespace (27), Pulley (4).

Verification scrape:
  1007 new-grad jobs from 95 unique companies (was 957/90)
  Greenhouse 891 (was 842), Workday 91 (was 90), Lever 25 unchanged
  Net +50 new-grad jobs, +5 companies, +29 comp values

Per-board new-grad yield: Glean 31, Tailscale 12, Together AI 4, Cribl
and Squarespace 1 each. BridgeBio/Sweetgreen/Recursion/Maven/Pulley
returned 0 new-grad listings this cycle but have non-empty boards that
may surface candidates over time.

Probed but already in config: Reddit, Gusto, Carta, PagerDuty, Webflow,
Lattice, Materialize.
… Mistral, Perplexity

Ashby is a modern ATS used by many AI labs and devtools companies that
the existing Greenhouse/Lever/Workday set cannot reach. Adds a new
fetch_ashby_jobs that hits the public Posting API:

  https://api.ashbyhq.com/posting-api/job-board/<slug>?includeCompensation=true

Unlike Greenhouse/Lever, Ashby returns structured compensation directly
in the response. The fetcher prefers Ashby's compensationTiers when
present (currency=USD, interval=per-year, sanity-bounded), falling back
to the description-body regex extractor for the rest.

25 verified-live boards configured: OpenAI, Crusoe, Mistral AI, Notion,
Cohere, Sierra, LangChain, Cursor, Lovable, Perplexity, Baseten, Ashby,
Supabase, Sentry, Modal, Attio, Campfire, Vapi, Linear, Qualified,
Browserbase, Anyscale, Pinecone, Weaviate, Turbopuffer.

Verification scrape:
  1103 new-grad jobs from 107 unique companies (was 1007 / 95)
  Greenhouse 891, Workday 91, Ashby 96, Lever 25
  311 with extracted comp (was 293)

Top of the comp-sorted table is now AI-lab-dominant:
  Notion         $220–350k  Software Engineer, Mobile AI, iOS
  Vercel         $232–348k  Software Engineer, Agent
  xAI            $200–340k  Infrastructure Security Engineer
  OpenAI         $230–325k  Data Scientist, Safety
  OpenAI         $293–325k  Data Scientist, Core Experimentation
  Notion         $272–320k  Software Engineer, AI Capture

ALSO fixes a latent HTTP_SESSION bug: Accept-Encoding advertised 'br'
(Brotli) which Python's requests does not auto-decode. Ashby's
Cloudflare layer preferred br when offered, yielding un-parseable bodies
and 0/25 success on first run. Dropping br from Accept-Encoding (leaving
gzip + deflate) makes responses decode-on-arrival like every other ATS
already in the pipeline.

5 new unit tests for fetch_ashby_jobs covering: ?includeCompensation
auto-append, structured-comp extraction, regex fallback, sanity-bound
rejection of internship stipends, and non-USD currency skipping.

82 tests pass.
…er, ElevenLabs, ...

Second-pass Ashby probe surfaced 18 more verified-live boards:

  Snowflake (423 raw), ElevenLabs (139), Decagon (108), Plaid (88),
  Commure (85), Suno (43), Docker (42), Astronomer (27), Poolside (16),
  Cradle Bio (11), Airbyte (9), Railway (9), Warp (8), Statsig (7),
  Reka (6), Stytch (5), Prefect (4), Runway (4).

Two were previously in docs/removed-companies.md:
- Plaid was removed from Lever (zero jobs); migrated to Ashby.
- Runway was removed from Greenhouse as `runwayml` (404); on Ashby
  the slug is `runway`.

Verification scrape:
  1134 new-grad jobs from 115 unique companies (was 1103 / 107)
  Greenhouse 891, Ashby 127 (was 96), Workday 91, Lever 25
  314 with extracted comp (was 311)

Per-board new-grad yield from this pass:
  Snowflake 12, Plaid 7, Suno 3, Commure 3, Astronomer 2, Docker 2,
  Airbyte 1, Decagon 1. The other 10 boards had non-empty totals but
  zero passed the new-grad filter this cycle.

removed-companies.md updated with both Ashby passes and the Plaid /
Runway re-discovery note.
The left-rail "REMOTE" filter (remote / hybrid / onsite chips) was
visually present but functionally dead: docs/terminal/data.jsx hardcoded
`rmt: 'onsite'` for every job, so clicking "remote" or "hybrid" always
filtered to zero results.

Add a deriveRmt(j) helper that matches the location and title text:
  - 'hybrid' wins when "\bhybrid\b" appears (covers "Hybrid - SF",
    "San Francisco (Hybrid)", "Hybrid remote", etc.)
  - 'remote' wins next when "\bremote\b" appears (without hybrid)
  - otherwise 'onsite'

After this, classification across the current 1134-job scrape:
  928 onsite · 179 remote · 27 hybrid

Spot-checks:
  Twilio "Remote - US"                    → remote ✓
  Vercel "Hybrid - San Francisco"         → hybrid ✓
  Tanium "Addison TX (Hybrid)"            → hybrid ✓
  Waymo "Mountain View, California"       → onsite ✓
  Anduril "Costa Mesa, California"        → onsite ✓

Clicking the remote chip in the browser now narrows the table to
"179 / 1134 results" — wired end-to-end.

The eventual right move is to add a structured `is_remote` /
`workplace_type` field to jobs.json at scrape time (Ashby exposes both
as first-class fields; Greenhouse/Lever need the same text-detection).
That's a follow-up; this commit unblocks the UX today.
The 8 role chips (swe / ml / data / frontend / backend / infra /
security / mobile) were rendered but four of them — frontend, backend,
security, mobile — never matched any job because the scraper only
classifies into 6 broad categories and the adapter mapped them as:

  software_engineering → SWE     (where backend/frontend/mobile hide)
  data_ml              → ML
  data_engineering     → DATA
  infrastructure_sre   → INFRA   (where security hides)
  hardware             → INFRA   (no HW chip existed at all)
  other                → SWE

Two changes:

1. New deriveType(j) in data.jsx pattern-matches the job title before
   falling back to the category id. Order matters (HW > SEC > ML > DATA
   > MOBILE > FE > BE > INFRA) so "embedded systems engineer" wins HW,
   "cloud security engineer" wins SEC, "ML platform engineer" wins ML.

2. dashboard.jsx grows a 9th chip — `hardware` — and the TYPE_LABEL
   gains HW='hardware'. The hardware category in jobs.json (Anduril,
   SpaceX, Northrop, etc.) is no longer silently merged into infra.

Distribution across the live 1134-job scrape:
  SWE 632 · INFRA 205 · DATA 77 · SEC 77 · ML 54 · BE 43 · HW 27 · MOBILE 11 · FE 8

Browser-verified click behavior: each chip narrows the table to the
correct count (e.g. hardware → 27 / 1134 results; frontend → 8 / 1134).
…UT THE ROLE

The right-panel "ABOUT THE ROLE" widget was rendering a templated
fallback ("Anduril Industries is hiring for Software Engineer, Lattice
C2 UI in … Posted via Greenhouse.") for every job because the scraper
captured each posting's description body but dropped it before
serializing docs/jobs.json. The frontend adapter had no real text to
read so it built the placeholder.

Plumb the real description through:

1. New clean_description(text, max_chars=1200) helper near the
   compensation extractor. Strips HTML and decodes entities, collapses
   whitespace, clips to 1200 chars on a word boundary with a trailing
   ellipsis. Keeps jobs.json under ~3 MB for a 1000-job scrape.

2. Replace all 6 `description[:500] if description else ''` truncation
   sites (Greenhouse, Lever, JobSpy, Ashby, two GraphQL paths) with
   clean_description(description). Workday remains description='' —
   its list endpoint doesn't expose the field.

3. generate_jobs_json() emits 'description': job.get('description', '').

4. docs/terminal/data.jsx mapJob prefers j.description (when ≥ 60 chars)
   over the synthetic "hiring for X in Y" fallback.

ALSO fixes a latent _strip_html ordering bug. Greenhouse serves its
content field as DOUBLE-encoded HTML (&lt;div&gt; rather than <div>).
The previous implementation stripped tags first (no-op because no
literal '<' present) then unescaped — leaving literal HTML tags in the
final output. Now it unescapes first then strips tags.

Verification scrape:
  1129 new-grad jobs, 1041 with real description (92% coverage)
  Greenhouse 892/892, Ashby 124/124, Lever 25/25, Workday 0/88
  jobs.json size: 2.19 MB (up from 0.98 MB)
  avg description length: 1192 chars, zero HTML residue

Browser-verified: clicking Anduril Lattice C2 UI in the table now shows
the actual posting copy in ABOUT THE ROLE.

Also bumped ?v=4 cache-buster on the .jsx <script> tags in index.html
so the description-shape change forces browsers to re-fetch the bundle.
Bump v on future data-shape changes.

6 new TestCleanDescription cases. 113 total tests pass.
…on't blow out the panel

Long postings (e.g. the Riot Games "Technical Producer II – Machine
Learning" job at ~1200 chars) were stretching the detail panel by 400+
px, pushing STACK / REQUIREMENTS / SIMILAR ROLES / APPLY out of the
fold.  Constrain the description block:

  maxHeight: 160px        (≈ 5–6 lines at the new font size)
  overflowY: auto         (scoped scrollbar, not the whole panel)
  fontSize: 11.5          (tighter density to fit more text)
  paddingRight: 8         (room for the scrollbar gutter)

Browser-verified on the Riot Technical Producer II posting:
  rendered height 160px · content scrollHeight 392px → internal scroll
  panel sections below remain visible at the page fold.

Bumped ?v=5 cache-buster on the .jsx script tags so the layout change
takes effect on next page load.
… terminal"

User-facing changes:
- Top-bar badge changes from NG to NGJ (the only brand mark in the chrome).
- Drop the "newgrad.sh" wordmark and "// new-grad job terminal" subtitle
  next to the badge.
- Page <title> + og:title now read "NGJ · New Grad Jobs".
- Boot/loading splash drops the newgrad.sh prefix.

Internal cleanup:
- Comment in docs/terminal/data.jsx no longer references newgrad.sh.

Verified in browser: no "newgrad.sh", no "new-grad job terminal", no
solo "NG" badge remains anywhere on the page. ?v=6 cache-bust ensures
the bundle re-fetches.
README:
- Rewrite "Data Sources" — Greenhouse 113, Ashby 43 (new), Workday 57,
  Lever 2 (down from 5), JobSpy. Google Careers removed (deprecated).
- Rewrite "Key Features" to describe the NGJ terminal frontend
  (JetBrains Mono, dense tabular, 9-category ROLE filter, REMOTE filter,
  real compensation + about-the-role widgets).
- Add a "docs/jobs.json schema" table — first time the public output
  shape is documented. Covers the new `comp` and `description` fields.
- Refresh "Companies Monitored" — purge the 39 dead Greenhouse names
  pruned in this branch, add the 13 new Greenhouse names, enumerate
  all 43 Ashby boards, list highlight Workday tenants with a link to
  config.yml for the full set.

JOB_SCRAPING_APIS:
- Promote the "Existing APIs (Maintained)" section into structured
  entries with endpoint + status + known issues (Greenhouse content=true,
  Workday 422 link to investigation doc).
- New Ashby section (43 boards, structured compensation, Brotli
  Accept-Encoding gotcha).

No script regenerates these prose sections (verified by grep across
scripts/ and .github/workflows/), so the edits are stable across
scheduled scraper runs.
…p-aware graphql test

Four pre-merge CI failures, all addressed:

1. test_config.py: hardcoded threshold of 200 configured companies. The
   counter only summed greenhouse + lever + workday; the new 43-board
   Ashby section wasn't counted. After adding Ashby: 113 + 2 + 57 + 43 =
   215 companies, above the 200 floor.

2. tests/test_display_count_copy.py + tests/test_stats_predictions_contract.py:
   referenced docs/stats.html and docs/stats.js, both deleted in this PR
   (the new NGJ two-tab nav doesn't include a stats view). Removed the
   whole test_stats_predictions_contract.py file (every test in it
   exercises the now-deleted stats page) and made test_display_count_copy
   skip non-existent surfaces gracefully.

3. tests/test_graphql_fetch.py asserted len(description) == 500. The
   clip ceiling moved from 500 to 1200 in 6c8b585 (description wiring).
   Updated the test to use a 2000-char input and assert <= 1201 (clip)
   plus the "…" suffix.

4. Pre-commit auto-fixes: isort reorders `import json` after `import html`
   alphabetically; end-of-file-fixer adds a trailing newline to
   docs/predictions.json. Both applied.

Local: 692 tests pass.
Pre-commit's end-of-file-fixer hook keeps flagging this file because
the scraper's auto-commit on main writes it without a final newline.
One-character fix to make the hook green.
The top-bar "● LIVE 16 MAY 2026 · 14:32 PT" was a static string that
never moved (docs/terminal/app.jsx). Replace it with <LiveStamp />:

* Reads window.NGJOBS_META.generated_at (set by data.jsx from
  docs/jobs.json's meta block, refreshed by every scrape).
* Renders in the viewer's local timezone via Intl.DateTimeFormat with
  timeZoneName: 'short' — the static "PT" suffix becomes whatever
  abbreviation the visitor's environment reports (PDT / IST / GMT / …).
* Polls every 30s plus once on NGJOBS_READY so the chip transitions
  live → stale (amber dot, STALE label) when generated_at is more than
  15 minutes old — three missed 5-min scrape cycles is a real failure
  signal. Renders "● OFFLINE —" when generated_at is missing entirely
  rather than fabricating a date.

The pure formatter lives in docs/terminal/live-stamp-format.js as a
UMD-style IIFE so it works as a <script> tag in the browser and can be
loaded into a Node VM for unit testing. tests/terminal/test_live_stamp.cjs
covers fresh / stale / boundary / offline / Intl-throws cases.

index.html bumps the bundle cache-bust to v=7 and loads live-stamp-format.js
before the Babel scripts so window.formatLiveStamp is defined when
<LiveStamp /> renders.

Browser smoke-test (Playwright): chip rendered "● STALE  May 17, 2026,
22:47 CDT" against the current (19h-old) docs/jobs.json, and "● LIVE
May 18, 2026, 12:17 CDT" when fed a fresh generated_at.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Summary

- When Greenhouse list endpoint returns empty `content` for a job, fetch
individual job details from `/{job_id}` endpoint
- Only enriches jobs that need it (performance-conscious)
- Increases `clean_description` max_chars from 1200 to 3000
- Adds `description_html` (raw HTML) and `full_description` (complete
cleaned text) to jobs.json
- Gracefully handles individual fetch failures (job returned with empty
description)

## Changes

- `scripts/update_jobs.py`: Added `fetch_greenhouse_job_detail()`,
modified `fetch_greenhouse_jobs()` to enrich empty-content jobs, updated
`generate_jobs_json()` with new fields
- `tests/test_enrichment.py`: Added
`TestGreenhouseDescriptionEnrichment` with 5 tests
- `tests/test_graphql_fetch.py`: Updated for new 3000-char limit

## Testing

```bash
pytest tests/test_enrichment.py::TestGreenhouseDescriptionEnrichment -v  # 5 passed
pytest tests/ -v  # 697 passed
flake8 scripts/ --select=E9,F63,F7,F82  # clean
```

Fixes #275
## Summary

- When Greenhouse list endpoint returns empty `content` for a job, fetch
individual job details from `/{job_id}` endpoint
- Only enriches jobs that need it (performance-conscious)
- Increases `clean_description` max_chars from 1200 to 3000
- Adds `description_html` (raw HTML) and `full_description` (complete
cleaned text) to jobs.json
- Gracefully handles individual fetch failures (job returned with empty
description)

## Changes

- `scripts/update_jobs.py`: Added `fetch_greenhouse_job_detail()`,
modified `fetch_greenhouse_jobs()` to enrich empty-content jobs, updated
`generate_jobs_json()` with new fields
- `tests/test_enrichment.py`: Added
`TestGreenhouseDescriptionEnrichment` with 5 tests
- `tests/test_graphql_fetch.py`: Updated for new 3000-char limit

## Testing

```bash
pytest tests/test_enrichment.py::TestGreenhouseDescriptionEnrichment -v  # 5 passed
pytest tests/ -v  # 697 passed
flake8 scripts/ --select=E9,F63,F7,F82  # clean
```

Fixes #275
The enrichment code adds full_description (complete text up to 50k chars)
to jobs.json. This commit updates the frontend to prefer full_description
over the truncated description field, so users see complete job details
including full requirements and qualifications.
Previously only Greenhouse jobs had description_html populated.
This commit adds description_html to all ATS fetchers so the
frontend can show full descriptions for all job sources.

- Ashby: description_html from descriptionHtml field
- Lever: description_html from description field
- JobSpy: description_html from description field

This ensures the frontend's full_description preference works
for all job sources, not just Greenhouse.
…eholder

The REQUIREMENTS section was showing hardcoded placeholder text
('BS / MS in CS or equivalent', 'Proficiency in —') instead of
the actual requirements extracted from job descriptions.

Now uses extractRequirements() to parse requirements from the
full_description field and displays them in the REQUIREMENTS section.
Falls back to a message when requirements cannot be extracted.

Fixes the 'Proficiency in —' placeholder issue.
ambicuity added 7 commits May 23, 2026 20:25
The previous extractRequirements function only worked for ~17% of jobs
because it stripped HTML and lost list structure. Now:

1. Parses HTML directly to find <li> elements
2. Uses broader header patterns (WHAT WE LOOKING FOR, ABOUT YOU, etc.)
3. Falls back to <li> items when no requirements section found
4. Better filtering of garbage items
The previous extractRequirements function only worked for ~17% of jobs
because it stripped HTML and lost list structure. Now:

1. Parses HTML directly to find <li> elements
2. Uses broader header patterns (WHAT WE LOOKING FOR, ABOUT YOU, etc.)
3. Falls back to <li> items when no requirements section found
4. Better filtering of garbage items
Fixes #289

Changed datetime.now().isoformat() to datetime.now(timezone.utc).isoformat()
for all timestamp generation in update_jobs.py. This ensures timestamps include
the UTC timezone marker (+00:00), allowing the frontend to correctly convert
and display times in the viewer's local timezone.

Previously, naive datetime strings were parsed by JavaScript's Date.parse()
as local time, causing the LIVE indicator to show incorrect times.

Files changed:
- scripts/update_jobs.py: 5 timestamp calls updated
- tests/: 3 test files updated to expect timezone-aware format
Complete the timezone fix by updating remaining naive datetime.now() calls:
- scripts/update_jobs.py: 3 date-only comparison calls (lines 2500, 2566, 2698)
- tests/test_enrichment.py: 6 test fixture calls
- tests/test_save_market_history.py: 12 test fixture calls

This ensures consistency across the entire codebase and prevents
potential day-boundary edge cases when running on non-UTC machines.
…utc)

Eliminates all deprecated datetime.utcnow() calls:
- scripts/update_jobs.py: RSS feed timestamp (line 2950)
- tests/conftest.py: 3 test fixture calls
- tests/test_filter.py: 11 test fixture calls
- tests/test_rss.py: 3 test fixture calls

This completes the timezone-aware migration and eliminates all 104
deprecation warnings from the test suite.
Copilot AI review requested due to automatic review settings May 24, 2026 02:41
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 24, 2026

Warning

Review limit reached

@ambicuity, we couldn't start this review because you've used your available PR reviews for now.

Your plan currently allows 1 review/hour. Refill in 35 minutes and 13 seconds.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more review capacity refills, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than trial, open-source, and free plans. In all cases, review capacity refills continuously over time.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 7326d563-6a6a-4e5c-8001-019a15f993bd

📥 Commits

Reviewing files that changed from the base of the PR and between 30c4359 and 3786ea9.

📒 Files selected for processing (9)
  • docs/terminal/dashboard.jsx
  • scripts/update_jobs.py
  • tests/conftest.py
  • tests/test_enrichment.py
  • tests/test_filter.py
  • tests/test_generate_jobs_json.py
  • tests/test_predict_hiring_trends.py
  • tests/test_rss.py
  • tests/test_save_market_history.py
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/timezone-aware-timestamps

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 24, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@ambicuity ambicuity merged commit 70732c2 into main May 24, 2026
7 of 8 checks passed
@ambicuity ambicuity deleted the fix/timezone-aware-timestamps branch May 24, 2026 02:43
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refines the requirement extraction logic in dashboard.jsx by implementing header-based and HTML list-item parsing. It also standardizes the Python codebase and test suite to use timezone-aware datetimes (timezone.utc) instead of naive now() or utcnow() calls. A review comment identifies a redundant stopword check in the requirement filtering logic, as the existing length constraint already excludes the specified short words.

const endMatch = afterHeader.match(sectionEnd);
const section = endMatch ? afterHeader.substring(0, endMatch.index) : afterHeader.substring(0, 1000);
const items = section.split(/(?:^|\s)[•\-\*]\s*|(?:^|\s)\d+\.\s*|(?:^|\s)[a-z]\)\s*/);
const validItems = items.map(s => s.trim()).filter(s => s.length > 10 && s.length < 300 && !s.match(/^(?:and|or|the|a|an|is|are|was|were|be|been|being|have|has|had|do|does|did|will|would|could|should|may|might|can|shall)$/i));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The stopword regex check is redundant here because the filter already requires s.length > 10. None of the words in the stopword list (e.g., 'and', 'should', 'might') exceed 6 characters in length, so they are already excluded by the length constraint.

Suggested change
const validItems = items.map(s => s.trim()).filter(s => s.length > 10 && s.length < 300 && !s.match(/^(?:and|or|the|a|an|is|are|was|were|be|been|being|have|has|had|do|does|did|will|would|could|should|may|might|can|shall)$/i));
const validItems = items.map(s => s.trim()).filter(s => s.length > 10 && s.length < 300);

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants