Skip to content

[Enchancement] Update web scraping API #24

@suryanshsingh2001

Description

@suryanshsingh2001

Enhancement Request Template

Is your enhancement request related to a problem? Please describe.

Our current scraping implementation uses fetch() + cheerio inside a Next.js API route deployed on Vercel. This method previously worked but is now consistently failing because websites like Peerlist and Medium have introduced stronger bot protections such as Cloudflare Browser Challenges.

These protections return:

  • 403 Forbidden
  • cf-mitigated: challenge
  • Unrendered HTML shells with no usable content

Since Vercel's serverless/edge runtime cannot execute JavaScript in a real browser environment, it cannot solve Cloudflare challenges, resulting in scraping failures.

Describe the enhancement you'd like

Introduce a dedicated scraping service using AWS Lambda + Playwright (or Chromium). Lambda can run a real headless browser, allowing it to:

  • Execute JavaScript
  • Solve Cloudflare JS challenges
  • Load fully-rendered HTML
  • Scrape dynamic websites reliably

Our Next.js API routes will call this Lambda function instead of attempting to scrape directly from Vercel.

Describe alternatives you've considered

  1. Adding browser-like headers
    Attempted custom User-Agent, cookies, and Accept headers. Cloudflare still blocks the requests.

  2. Scraping directly on Vercel
    Not viable since Vercel prevents running Playwright or any full browser in serverless functions.

  3. Using proxies
    Rotating proxies do not solve Cloudflare's JavaScript challenge.

  4. Using third-party scraping APIs
    They work but add ongoing costs and provide less control. A Lambda-based scraper is more flexible and scalable.

Possible Implementation Details

  • Create an AWS Lambda function using:
    • playwright-core
    • @sparticuz/chromium (Lambda-optimized Chromium build)
  • Lambda loads the target URL in a real browser:
    const browser = await playwright.chromium.launch({
      args: chromium.args,
      executablePath: await chromium.executablePath(),
      headless: chromium.headless,
    });
    
    const page = await browser.newPage();
    await page.goto(url, { waitUntil: "networkidle" });
    const html = await page.content();

Lambda returns the fully-rendered HTML to our Next.js API.

Next.js extracts text using cheerio as before, but now with real, complete HTML.

This bypasses Cloudflare entirely and restores consistent scraping.

Additional context

Peerlist and Medium recently updated their bot protection systems.

Cloudflare challenges require JavaScript execution and browser fingerprinting.

AWS Lambda supports full browser automation, making it a suitable scraping backend.

This change will significantly improve scraping reliability and reduce API failures.

Optional Sections

Priority: High

Are you willing to submit a PR for this enhancement? Yes

Does this enhancement require changes in documentation? Yes — the scraping architecture must be updated to reflect the new Lambda-based workflow.

Metadata

Metadata

Labels

No labels
No labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions