[Enchancement] Update web scraping API

## Enhancement Request Template

### **Is your enhancement request related to a problem? Please describe.**
Our current scraping implementation uses `fetch()` + `cheerio` inside a Next.js API route deployed on Vercel. This method previously worked but is now consistently failing because websites like Peerlist and Medium have introduced stronger bot protections such as **Cloudflare Browser Challenges**.

These protections return:
- `403 Forbidden`
- `cf-mitigated: challenge`
- Unrendered HTML shells with no usable content

Since Vercel's serverless/edge runtime **cannot execute JavaScript in a real browser environment**, it cannot solve Cloudflare challenges, resulting in scraping failures.

### **Describe the enhancement you'd like**
Introduce a dedicated scraping service using **AWS Lambda + Playwright** (or Chromium). Lambda can run a real headless browser, allowing it to:
- Execute JavaScript
- Solve Cloudflare JS challenges
- Load fully-rendered HTML
- Scrape dynamic websites reliably

Our Next.js API routes will call this Lambda function instead of attempting to scrape directly from Vercel.

### **Describe alternatives you've considered**
1. **Adding browser-like headers**  
   Attempted custom `User-Agent`, cookies, and `Accept` headers. Cloudflare still blocks the requests.

2. **Scraping directly on Vercel**  
   Not viable since Vercel prevents running Playwright or any full browser in serverless functions.

3. **Using proxies**  
   Rotating proxies do not solve Cloudflare's JavaScript challenge.

4. **Using third-party scraping APIs**  
   They work but add ongoing costs and provide less control. A Lambda-based scraper is more flexible and scalable.

### **Possible Implementation Details**
- Create an AWS Lambda function using:
  - `playwright-core`
  - `@sparticuz/chromium` (Lambda-optimized Chromium build)
- Lambda loads the target URL in a real browser:
  ```js
  const browser = await playwright.chromium.launch({
    args: chromium.args,
    executablePath: await chromium.executablePath(),
    headless: chromium.headless,
  });

  const page = await browser.newPage();
  await page.goto(url, { waitUntil: "networkidle" });
  const html = await page.content();
Lambda returns the fully-rendered HTML to our Next.js API.

Next.js extracts text using cheerio as before, but now with real, complete HTML.

This bypasses Cloudflare entirely and restores consistent scraping.

Additional context

Peerlist and Medium recently updated their bot protection systems.

Cloudflare challenges require JavaScript execution and browser fingerprinting.

AWS Lambda supports full browser automation, making it a suitable scraping backend.

This change will significantly improve scraping reliability and reduce API failures.

Optional Sections

Priority: High

Are you willing to submit a PR for this enhancement? Yes

Does this enhancement require changes in documentation? Yes — the scraping architecture must be updated to reflect the new Lambda-based workflow.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Enchancement] Update web scraping API #24

Enhancement Request Template

Is your enhancement request related to a problem? Please describe.

Describe the enhancement you'd like

Describe alternatives you've considered

Possible Implementation Details

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[Enchancement] Update web scraping API #24

Description

Enhancement Request Template

Is your enhancement request related to a problem? Please describe.

Describe the enhancement you'd like

Describe alternatives you've considered

Possible Implementation Details

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions