Skip to content

feat(scraper): add support for weeks, months, and 'just posted' date formats#295

Merged
ambicuity merged 2 commits into
ambicuity:mainfrom
whbbupt:feat/add-date-parsing
May 26, 2026
Merged

feat(scraper): add support for weeks, months, and 'just posted' date formats#295
ambicuity merged 2 commits into
ambicuity:mainfrom
whbbupt:feat/add-date-parsing

Conversation

@whbbupt
Copy link
Copy Markdown
Contributor

@whbbupt whbbupt commented May 24, 2026

Enhanced normalize_date_string() to handle additional date formats commonly returned by job boards:

  • "Just posted" / "Recently" → today
  • "X weeks ago" → X * 7 days ago
  • "X months ago" → X * 30 days ago
  • "Active X days ago" → X days ago

These formats appear on Indeed, LinkedIn, and other job listing platforms but were previously falling through to the
raw-string return path, which caused some jobs to be unnecessarily filtered out.

Summary by CodeRabbit

  • Bug Fixes
    • Improved date format recognition for job postings to support additional human-readable date expressions. The system now correctly interprets "Recently", "Just posted", "Just now", and relative time phrases like "Posted X Weeks Ago" and "Posted X Months Ago", ensuring more accurate chronological sorting and display.

Review Change Stack

@whbbupt whbbupt requested a review from ambicuity as a code owner May 24, 2026 07:20
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 24, 2026

Warning

Review limit reached

@whbbupt, we couldn't start this review because you've used your available PR reviews for now.

Your plan currently allows 1 review/hour. Refill in 42 minutes and 22 seconds.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more review capacity refills, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than trial, open-source, and free plans. In all cases, review capacity refills continuously over time.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: cb0e1988-e430-4317-bdf9-b989260e0788

📥 Commits

Reviewing files that changed from the base of the PR and between cf71786 and 6434af1.

📒 Files selected for processing (1)
  • scripts/update_jobs.py
📝 Walkthrough

Walkthrough

This PR extends the normalize_date_string() function in scripts/update_jobs.py to parse additional relative date formats commonly found in job posting timestamps. The changes add support for immediate-posting phrases ("just posted," "recently"), week and month offset expressions, and "active" day counts, with corresponding docstring examples.

Changes

Relative Date Format Expansion

Layer / File(s) Summary
Relative date format support
scripts/update_jobs.py
Added parsing branches for "just posted/now/recently" (mapped to today), weeks-ago expressions (weeks × 7 days), months-ago expressions (months × 30 days), and "active X days ago" (X days). Docstring examples extended to document the newly supported phrases.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

Possibly related PRs

  • ambicuity/New-Grad-Jobs#161: Both PRs expand relative-date parsing in normalize_date_string() to recognize additional timestamp formats ("hours ago"/"minutes ago" in #161, vs. weeks/months/active days/"just posted" in this PR).

Suggested labels

size/S

Suggested reviewers

  • ambicuity

Poem

🐰 With timestamps now speaking in weeks and in moons,
And "just posted" ringing across all the job postings,
The parsing grows wiser, no longer in cartoon—
Fresh dates in human words, our script now well-knowing!
hops past the merge button

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly and clearly summarizes the main change: adding support for additional date formats (weeks, months, and 'just posted') in the date normalization function.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request expands the normalize_date_string function in scripts/update_jobs.py to support a wider range of relative date formats, including 'Just posted', 'Weeks Ago', and 'Months Ago'. A review comment identifies that the logic added for 'Active X Days Ago' is redundant because the existing 'days ago' regex already captures these cases.

Comment thread scripts/update_jobs.py Outdated
Comment on lines +2271 to +2275
# Handle "Active X Days Ago" (some job boards use "Active" instead of "Posted")
active_match = re.search(r'active\s+(\d+)\s*days?\s+ago', posted_at_lower)
if active_match:
days = int(active_match.group(1))
return (now - timedelta(days=days)).strftime('%Y-%m-%d')
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This block is redundant and will never be executed. The existing days_match logic at line 2240 already uses the regex r'(\d+)\s*days?\s+ago', which captures the "X days ago" pattern regardless of whether it is preceded by "Posted" or "Active" (since re.search matches anywhere in the string). Because the implementation here is identical to the earlier block, this addition is unnecessary.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@scripts/update_jobs.py`:
- Around line 2265-2269: The current logic uses timedelta(days=months * 30)
which can drift by several days; replace this with calendar-accurate month
subtraction using dateutil.relativedelta: import relativedelta via "from
dateutil.relativedelta import relativedelta" and change the return to "(now -
relativedelta(months=months)).strftime('%Y-%m-%d')". Update the code path that
handles months (variables months_match, posted_at_lower, now) and ensure
dateutil is added to requirements if not present; alternatively, if you cannot
add the dependency, document the 30-day approximation in the function docstring
mentioning the possible +/- days drift.
- Around line 2271-2275: Remove the unreachable "Active X Days Ago" regex block:
the earlier pattern r'(\d+)\s*days?\s+ago' already matches "active 5 days ago"
and returns, so delete the lines that define active_match, its regex
r'active\s+(\d+)\s*days?\s+ago', and the subsequent return; locate this dead
code in the same function that computes posted_at using posted_at_lower and now
(look for the existing r'(\d+)\s*days?\s+ago' match and the variable names
active_match, posted_at_lower, days) and remove only that redundant block to
keep behavior identical.
- Around line 2259-2263: Extract the hardcoded week-to-day conversion (the
literal 7) into a module-level constant (e.g., DAYS_PER_WEEK) alongside the
other constants in scripts/update_jobs.py, then replace the literal in the weeks
handling block (the code that uses weeks_match and computes timedelta(days=weeks
* 7)) with timedelta(days=weeks * DAYS_PER_WEEK) so the conversion is named and
reusable.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 9080e92e-a03a-4278-b23c-ac5f273b8de8

📥 Commits

Reviewing files that changed from the base of the PR and between 94eb071 and cf71786.

📒 Files selected for processing (1)
  • scripts/update_jobs.py

Comment thread scripts/update_jobs.py Outdated
Comment thread scripts/update_jobs.py Outdated
Comment thread scripts/update_jobs.py Outdated
Copy link
Copy Markdown
Owner

@ambicuity ambicuity left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving after thorough local verification.

Scope verified: Extends normalize_date_string() to handle "Just posted" / "Recently", "X Weeks Ago", "X Months Ago" formats — fixes a real bug where these formats fell through to the raw-string return and got filtered downstream.

AI reviewer feedback (CodeRabbit + Gemini): all addressed in commit 6434af1.

  1. Extracted DAYS_PER_WEEK / DAYS_PER_MONTH module-level constants — done.
  2. Replaced timedelta(days=months * 30) with relativedelta(months=months) for calendar-accurate month arithmetic — done.
  3. Removed the unreachable "Active X Days Ago" block; the earlier (\d+)\s*days?\s+ago regex already handles it (verified with real input — see test results below).

Local verification on PR head (6434af1):

  • All 22 hand-written real-data assertions pass (regression + new behaviors + edge cases): "Posted 1 Month Ago" → exactly 1 calendar month back via relativedelta, "Active 5 Days Ago" → correctly resolved via the existing days regex, "Recently"/"Just now"/"Just posted" → today, "3 wks ago" / "6 mos ago" abbreviations work.
  • Full repo test suite: 718 passed locally (Python 3.13.5).
  • CI test_unicorn_company failed once due to Python hash-randomization picking a unicorn that is also FAANG+ (next(iter(UNICORNS)) is non-deterministic and get_company_tier checks FAANG_PLUS before UNICORNS). This is a pre-existing flaky test unrelated to this PR's diff (which only touches normalize_date_string). Re-run succeeded.

All 6 status checks now green, mergeStateStatus: CLEAN. LGTM.

@ambicuity ambicuity merged commit abdbdfc into ambicuity:main May 26, 2026
7 of 8 checks passed
ambicuity added a commit that referenced this pull request May 26, 2026
…NG_PLUS (#298)

Closes #297.

## Summary

- `test_unicorn_company` flaked on PR #295's CI with `assert
'faang_plus' == 'unicorn'`. Root cause: `get_company_tier()` checks
`FAANG_PLUS` before `UNICORNS`, and the test used `next(iter(UNICORNS))`
— which depends on `PYTHONHASHSEED` and occasionally lands on a company
that is in both sets.
- Pick deterministically from `sorted(UNICORNS - FAANG_PLUS)` so the
chosen company is guaranteed to resolve to `'unicorn'`.
- Apply the same sort-for-determinism hardening to
`test_finance_sector_detected`, `test_defense_sector_detected`, and
`test_company_can_overlap_tier_and_sector` so a future tier-precedence
change cannot reintroduce flakiness.
- Use `pytest.skip(...)` instead of a silent `if ...:` branch so that an
empty config category is loudly visible.

## Test plan

- [x] `pytest tests/test_enrichment.py -v` — 93 passed.
- [x] Stress-tested across 10 `PYTHONHASHSEED` values (0, 1, 7, 13, 42,
99, 314, 1234, 9999, 12345): 7/7 of the `TestGetCompanyTier` cases pass
every time.
- [x] Full repo suite (`pytest`): 718 passed.

## Risk

Test-only change. No production code touched.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants