feat(scraper): add support for weeks, months, and 'just posted' date formats#295
Conversation
|
Warning Review limit reached
Your plan currently allows 1 review/hour. Refill in 42 minutes and 22 seconds. Your organization has run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After more review capacity refills, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than trial, open-source, and free plans. In all cases, review capacity refills continuously over time. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Repository UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughThis PR extends the ChangesRelative Date Format Expansion
Estimated code review effort🎯 2 (Simple) | ⏱️ ~8 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request expands the normalize_date_string function in scripts/update_jobs.py to support a wider range of relative date formats, including 'Just posted', 'Weeks Ago', and 'Months Ago'. A review comment identifies that the logic added for 'Active X Days Ago' is redundant because the existing 'days ago' regex already captures these cases.
| # Handle "Active X Days Ago" (some job boards use "Active" instead of "Posted") | ||
| active_match = re.search(r'active\s+(\d+)\s*days?\s+ago', posted_at_lower) | ||
| if active_match: | ||
| days = int(active_match.group(1)) | ||
| return (now - timedelta(days=days)).strftime('%Y-%m-%d') |
There was a problem hiding this comment.
This block is redundant and will never be executed. The existing days_match logic at line 2240 already uses the regex r'(\d+)\s*days?\s+ago', which captures the "X days ago" pattern regardless of whether it is preceded by "Posted" or "Active" (since re.search matches anywhere in the string). Because the implementation here is identical to the earlier block, this addition is unnecessary.
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@scripts/update_jobs.py`:
- Around line 2265-2269: The current logic uses timedelta(days=months * 30)
which can drift by several days; replace this with calendar-accurate month
subtraction using dateutil.relativedelta: import relativedelta via "from
dateutil.relativedelta import relativedelta" and change the return to "(now -
relativedelta(months=months)).strftime('%Y-%m-%d')". Update the code path that
handles months (variables months_match, posted_at_lower, now) and ensure
dateutil is added to requirements if not present; alternatively, if you cannot
add the dependency, document the 30-day approximation in the function docstring
mentioning the possible +/- days drift.
- Around line 2271-2275: Remove the unreachable "Active X Days Ago" regex block:
the earlier pattern r'(\d+)\s*days?\s+ago' already matches "active 5 days ago"
and returns, so delete the lines that define active_match, its regex
r'active\s+(\d+)\s*days?\s+ago', and the subsequent return; locate this dead
code in the same function that computes posted_at using posted_at_lower and now
(look for the existing r'(\d+)\s*days?\s+ago' match and the variable names
active_match, posted_at_lower, days) and remove only that redundant block to
keep behavior identical.
- Around line 2259-2263: Extract the hardcoded week-to-day conversion (the
literal 7) into a module-level constant (e.g., DAYS_PER_WEEK) alongside the
other constants in scripts/update_jobs.py, then replace the literal in the weeks
handling block (the code that uses weeks_match and computes timedelta(days=weeks
* 7)) with timedelta(days=weeks * DAYS_PER_WEEK) so the conversion is named and
reusable.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 9080e92e-a03a-4278-b23c-ac5f273b8de8
📒 Files selected for processing (1)
scripts/update_jobs.py
ambicuity
left a comment
There was a problem hiding this comment.
Approving after thorough local verification.
Scope verified: Extends normalize_date_string() to handle "Just posted" / "Recently", "X Weeks Ago", "X Months Ago" formats — fixes a real bug where these formats fell through to the raw-string return and got filtered downstream.
AI reviewer feedback (CodeRabbit + Gemini): all addressed in commit 6434af1.
- Extracted
DAYS_PER_WEEK/DAYS_PER_MONTHmodule-level constants — done. - Replaced
timedelta(days=months * 30)withrelativedelta(months=months)for calendar-accurate month arithmetic — done. - Removed the unreachable "Active X Days Ago" block; the earlier
(\d+)\s*days?\s+agoregex already handles it (verified with real input — see test results below).
Local verification on PR head (6434af1):
- All 22 hand-written real-data assertions pass (regression + new behaviors + edge cases): "Posted 1 Month Ago" → exactly 1 calendar month back via
relativedelta, "Active 5 Days Ago" → correctly resolved via the existing days regex, "Recently"/"Just now"/"Just posted" → today, "3 wks ago" / "6 mos ago" abbreviations work. - Full repo test suite: 718 passed locally (Python 3.13.5).
- CI
test_unicorn_companyfailed once due to Python hash-randomization picking a unicorn that is also FAANG+ (next(iter(UNICORNS))is non-deterministic andget_company_tierchecksFAANG_PLUSbeforeUNICORNS). This is a pre-existing flaky test unrelated to this PR's diff (which only touchesnormalize_date_string). Re-run succeeded.
All 6 status checks now green, mergeStateStatus: CLEAN. LGTM.
…NG_PLUS (#298) Closes #297. ## Summary - `test_unicorn_company` flaked on PR #295's CI with `assert 'faang_plus' == 'unicorn'`. Root cause: `get_company_tier()` checks `FAANG_PLUS` before `UNICORNS`, and the test used `next(iter(UNICORNS))` — which depends on `PYTHONHASHSEED` and occasionally lands on a company that is in both sets. - Pick deterministically from `sorted(UNICORNS - FAANG_PLUS)` so the chosen company is guaranteed to resolve to `'unicorn'`. - Apply the same sort-for-determinism hardening to `test_finance_sector_detected`, `test_defense_sector_detected`, and `test_company_can_overlap_tier_and_sector` so a future tier-precedence change cannot reintroduce flakiness. - Use `pytest.skip(...)` instead of a silent `if ...:` branch so that an empty config category is loudly visible. ## Test plan - [x] `pytest tests/test_enrichment.py -v` — 93 passed. - [x] Stress-tested across 10 `PYTHONHASHSEED` values (0, 1, 7, 13, 42, 99, 314, 1234, 9999, 12345): 7/7 of the `TestGetCompanyTier` cases pass every time. - [x] Full repo suite (`pytest`): 718 passed. ## Risk Test-only change. No production code touched.
Enhanced
normalize_date_string()to handle additional date formats commonly returned by job boards:These formats appear on Indeed, LinkedIn, and other job listing platforms but were previously falling through to the
raw-string return path, which caused some jobs to be unnecessarily filtered out.
Summary by CodeRabbit