🔥 Flarecrawl CLI

Cloudflare Browser Run CLI — Firecrawl-compatible, cost-efficient at scale.

CLI that wraps Cloudflare's Browser Run API with the same command structure as Firecrawl. Supports scraping, crawling, URL discovery, screenshots, PDFs, and AI-powered data extraction — all running on Cloudflare's headless Chromium infrastructure. Now with direct Chrome DevTools Protocol (CDP) access for persistent browser sessions, real-time debugging, and authenticated scraping. Cost-efficient alternative for high-volume use cases (free 10 min/day, then $0.09/hr).

Recent Updates

Version	Date	Changes
v0.22.0	2026-04-21	Secure credential storage. OS keyring via `flarecrawl[secure]` (Forma protocol §07). Auto-migrates legacy plaintext config.json. Priority: env > keyring > .env > legacy. 1112 tests
v0.21.0	2026-04-20	Auth + crawl fixes. `--browser-cookies` on scrape/interact/design (was videos-only). `--session` on crawl. `--ignore-robots` made actionable. Live test suite for design extraction
v0.20.0	2026-04-20	Actionable CDP errors. `_enrich_cdp_error` detects bot detection, timeouts, redirects, network errors, auth failures and appends suggestions. CHANGELOG.md restored as source of truth
v0.19.0	2026-04-19	Video discovery. `flarecrawl videos` finds video URLs (mp4, webm, m3u8, YouTube, Vimeo embeds, OpenGraph, JSON-LD). `--export-cookies` for yt-dlp Netscape cookie format. Works behind login with --session/--interactive. Pipe to yt-dlp for downloading
v0.18.0	2026-04-18	Security hardening. Path-traversal fix on `--resume JOB_ID`, blocks non-http(s) URLs, caps robots.txt/sitemap downloads. Crawl loop refactored. PEP 561 `py.typed` marker. 1027 tests

For older releases, see CHANGELOG.md.

Why Flarecrawl?

	Firecrawl	Flarecrawl
Pricing	Per-page credits ($99/100K)	Time-based (free 10 min/day, then $0.09/hr)
Crossover	Cheaper below ~8K pages/mo	11x cheaper at 100K pages/mo
JS rendering	Yes	Yes (headless Chromium on edge)
Persistent browser sessions	No	Yes (`--cdp --keep-alive`)
Live browser debugging	No	Yes (`--live-view` via Chrome DevTools)
Interactive auth (OAuth/2FA/CAPTCHA)	No	Yes (`--interactive` — login in DevTools)
Form filling	No	Yes (`interact` command with human-like timing)
Session recordings	No	Yes (`--record` — rrweb format, 30-day retention)
WebMCP tool discovery	No	Yes (`webmcp discover/call`)
AI extraction	Spark models	Workers AI (included)
Agent-safe sanitisation	No	Yes (`--agent-safe` — 13 sanitisers, DeepMind taxonomy)
Paywall bypass	Limited	Multi-strategy cascade (`--paywall --stealth`)
Content negotiation	No	Yes (auto `Accept: text/markdown`, zero browser time)
PDF generation	No	Yes
Web search	Yes	Yes (Jina Search API)
Cookie persistence	No	Yes (`--save-cookies`/`--load-cookies`)
Real HAR capture	No	Yes (CDP `Network.enable`)
Custom CDP backends	No	Yes (`FLARECRAWL_CDP_ENDPOINT` — Oxylabs, Bright Data, local Chrome)
Batch mode	Limited	Yes (file input, NDJSON output, up to 50 workers)
Direct HTTP spider	Single-engine	`flarecrawl spider` — direct HTTP, no browser cost, 50-100 concurrent, built for millions of URLs
Resume interrupted crawls	No	Yes (`--resume JOB_ID`) — kill the process, come back hours later, pick up exactly where you left off
Weekly-refresh mode	No	Yes (`--refresh-days 7`) — only re-fetch pages that changed since last run; unchanged = zero browser cost
Fair multi-domain scheduling	No	Yes — round-robin so one fast site can't starve 53,999 others
Adaptive politeness	No	Yes (`--adaptive-delay`) — automatically slows on slow hosts, honours `Retry-After`
Auto-retry with backoff	Limited	Yes — per-URL retry budget, exp backoff, dead-letter inspection
Circuit breaker per domain	No	Yes — pauses domains after 10 consecutive failures
robots.txt compliance	Partial	Yes (protego, with per-host crawl-delay)
URL canonicalisation	No	Yes — strips `utm_*`, `gclid`, `fbclid`, etc. so tracking-param permutations don't duplicate crawls
OpenTelemetry tracing	No	Yes (`--tracing console\|json\|otlp`)
Self-hosted	Yes	Cloudflare infrastructure (or any CDP backend)
Favicon extraction	Via branding	Dedicated command
Branding extraction	Yes	Not yet

Different pricing models suit different use cases. Flarecrawl's time-based pricing is particularly cost-efficient for high-volume crawls.

Use Cases

AI agent perception layer

Flarecrawl is often the first thing an AI agent sees when it reads the web. --agent-safe sanitises scraped content against adversarial attacks (AI Agent Traps, Google DeepMind 2026) before it enters an LLM context window or RAG pipeline. 13 sanitisers defend against hidden text injection, prompt injection, and semantic manipulation — with findings reported in structured JSON metadata.

Scraping behind authentication

The hardest scraping problem isn't parsing HTML — it's getting past the login page. --interactive opens a real browser in Chrome DevTools where you complete OAuth flows, solve CAPTCHAs, or handle 2FA manually. Cookies are auto-saved and reused on subsequent scrapes. No more extracting tokens from browser DevTools and pasting them into headers.

High-volume content extraction

Batch mode with up to 50 parallel workers (--workers 50) scrapes thousands of pages concurrently. Combine with --paywall (multi-strategy extraction cascade), --stealth (browser TLS fingerprinting), and --agent-safe for production pipelines that handle the real web — paywalls, bot detection, and adversarial content included.

Site monitoring and change detection

--diff compares current content against the cached version. Combine with --har for network-level visibility into what changed. --record saves full session recordings for audit trails. Use --keep-alive for cost-efficient repeated checks on the same site.

Documentation and knowledge base building

flarecrawl download saves entire sites as markdown files. flarecrawl discover finds every URL via sitemaps, RSS feeds, and link crawling. Content negotiation (Accept: text/markdown) fetches server-rendered markdown directly from compatible sites — zero browser time, higher quality output.

API and structured data extraction

flarecrawl extract uses Cloudflare Workers AI for natural language data extraction ("Get all product names and prices"). flarecrawl schema extracts LD+JSON, OpenGraph, and Twitter Cards. flarecrawl openapi discovers and downloads API specifications.

CDP: Persistent Browser Control

Flarecrawl v0.14.1 adds direct Chrome DevTools Protocol (CDP) access via Cloudflare's Browser Run WebSocket endpoint. This is an opt-in mode (--cdp) that gives you a persistent, controllable browser session instead of fire-and-forget REST calls.

REST (default) — each command spins up a fresh browser, navigates, extracts, and destroys. Stateless and cheap. Good for 90% of scraping.

CDP (--cdp) — you get a live browser session that stays open. Navigate, interact, inspect, then navigate again — all within the same browser context. The browser remembers cookies, localStorage, and DOM state between operations.

When to use CDP

Scenario	Why CDP
Scraping behind login	`--interactive` opens DevTools, you log in manually (OAuth, 2FA, CAPTCHA), cookies auto-saved for future scrapes
Debugging failed scrapes	`--live-view` opens Chrome DevTools pointed at the remote browser — see console errors, inspect DOM, watch network
Complex JS execution	`--js-eval` via CDP uses real `Runtime.evaluate` — async/await works, returns typed objects, not the REST addScriptTag hack
Multi-step workflows	`--keep-alive 60` holds the browser open — scrape, screenshot, extract without re-navigating (1/3 the browser time cost)
Network debugging	`--har` captures real browser network traffic (redirects, blocked resources, timing waterfall), not just flarecrawl API metadata
Audit trails	`--record` saves rrweb session recordings — replay exactly what the browser did

CDP examples

# Interactive auth: login in DevTools, cookies saved, then scrape
flarecrawl scrape https://private-app.example.com --interactive --json

# Debug a scrape in real-time
flarecrawl scrape https://broken-site.com --live-view

# Proper JS execution (async, typed returns)
flarecrawl scrape https://spa.example.com --cdp --js-eval "await fetch('/api/data').then(r => r.json())"

# Persistent session: navigate once, do multiple things
flarecrawl scrape https://example.com --cdp --keep-alive 120 \
  --save-cookies session.json --har traffic.har --json

# Record a session for debugging later
flarecrawl scrape https://example.com --record --record-output debug.json

# Reuse saved cookies for authenticated scraping
flarecrawl scrape https://private-app.example.com --load-cookies session.json --cdp --json

# Manage CDP sessions
flarecrawl cdp sessions --json
flarecrawl cdp close

Install CDP support

CDP requires the websockets package (optional dependency):

uv pip install websockets
# or with extras
uv pip install flarecrawl[cdp]

Without websockets, all non-CDP features work normally. The --cdp flag will print a clear install instruction if the dependency is missing.

Quick Start

git clone https://github.com/0xDarkMatter/flarecrawl.git
cd flarecrawl
uv tool install --editable .
flarecrawl auth login
flarecrawl scrape https://example.com
flarecrawl scrape https://example.com --json | jq '.data.content'

Key Commands

Command	Description
`flarecrawl scrape URL`	Scrape page to markdown (or html/links/images/summary)
`flarecrawl crawl URL --wait --limit N`	Crawl site with async job system
`flarecrawl download URL --limit N`	Save site pages to disk as markdown/html
`flarecrawl extract PROMPT --urls URL`	AI-powered structured data extraction
`flarecrawl fetch URL -o file`	Content-type aware download (binary, JSON, HTML)
`flarecrawl openapi URL --probe`	Discover OpenAPI/Swagger specs on a site
`flarecrawl screenshot URL -o page.png`	Capture full or partial page screenshots
`flarecrawl pdf URL -o page.pdf`	Render page as PDF
`flarecrawl map URL`	List all links on a page
`flarecrawl discover URL`	Discover URLs via sitemaps, feeds, and links
`flarecrawl search QUERY`	Web search via Jina API
`flarecrawl schema URL`	Extract LD+JSON, OpenGraph, Twitter Cards
`flarecrawl favicon URL`	Extract favicon/icon URLs
`flarecrawl session list`	Manage saved cookie sessions

Install

git clone https://github.com/0xDarkMatter/flarecrawl.git
cd flarecrawl
uv tool install --editable .

Secure credential storage (recommended)

By default, flarecrawl stores credentials in ~/.config/flarecrawl/config.json (plaintext). For OS-native keyring storage:

uv tool install --editable . --with keyring
# or: uv pip install flarecrawl[secure]

With keyring installed, tokens move to:

Windows: Credential Manager
macOS: Keychain
Linux: Secret Service / GNOME Keyring / KWallet

Existing plaintext credentials are migrated automatically on first use.

Authentication

1. Create an API token

Go to https://dash.cloudflare.com/profile/api-tokens
Click Create Token
Select Create Custom Token
Configure:
- Token name: Flarecrawl (or anything)
- Permissions: Account → Browser Rendering → Edit
- Account Resources: Include → your account
Click Continue to summary → Create Token
Copy the token (shown only once)

2. Find your Account ID

Go to https://dash.cloudflare.com
Click any domain (or the account overview)
Look in the right sidebar under Account ID
Copy the 32-character hex string

3. Authenticate

# Interactive — opens browser to Cloudflare dashboard for guided setup
flarecrawl auth login

# Non-interactive (CI/CD)
flarecrawl auth login --account-id YOUR_ACCOUNT_ID --token YOUR_TOKEN

4. Verify

flarecrawl auth status
flarecrawl --status   # Shows auth + pricing info

Environment Variables (CI/CD)

export FLARECRAWL_ACCOUNT_ID="your-account-id"
export FLARECRAWL_API_TOKEN="your-api-token"

Commands

scrape — Fetch page content

# Default: markdown output to stdout
flarecrawl scrape https://example.com

# HTML output
flarecrawl scrape https://example.com --format html

# JSON envelope (for piping)
flarecrawl scrape https://example.com --json

# Multiple URLs (scraped concurrently)
flarecrawl scrape https://a.com https://b.com https://c.com --json

# Batch mode: file input with NDJSON output and configurable workers
flarecrawl scrape --batch urls.txt --workers 5

# From a file of URLs (backward-compatible alias for --batch)
flarecrawl scrape --input urls.txt --json

# With timing info
flarecrawl scrape https://example.com --timing

# Filter JSON fields
flarecrawl scrape https://example.com --json --fields url,content

# Extract links only
flarecrawl scrape https://example.com --format links --json

# Take screenshot via scrape
flarecrawl scrape https://example.com --screenshot -o page.png

# Wait for JS rendering (SPAs, Swagger UIs)
flarecrawl scrape https://example.com --js

# Bypass response cache
flarecrawl scrape https://example.com --no-cache

# Custom page load strategy
flarecrawl scrape https://example.com --wait-until networkidle2

Formats: markdown (default), html, links, screenshot, json (AI extraction)

HTTP Basic Auth

All commands support --auth user:password for sites protected by HTTP Basic Auth:

flarecrawl scrape https://intranet.example.com --auth admin:secret
flarecrawl crawl https://intranet.example.com --wait --limit 50 --auth admin:secret
flarecrawl download https://intranet.example.com --limit 20 --auth user:pass
flarecrawl screenshot https://intranet.example.com --auth user:pass -o page.png

Content filtering

# Strip nav/header/footer, keep main article content
flarecrawl scrape https://example.com --only-main-content

# Keep only specific CSS selectors
flarecrawl scrape https://example.com --include-tags "article,.post"

# Remove specific elements
flarecrawl scrape https://example.com --exclude-tags "nav,footer,.sidebar"

Custom headers & mobile

# Custom HTTP headers
flarecrawl scrape https://example.com --headers "Accept-Language: fr"
flarecrawl scrape https://example.com --headers '{"X-Api-Key": "abc123"}'

# Custom User-Agent (identify your crawler, or try bypassing paywalls)
flarecrawl scrape https://example.com --user-agent "MyBot/1.0 (contact@example.com)"
flarecrawl scrape https://paywalled.example.com --user-agent "Googlebot/2.1"

# Mobile device emulation (iPhone 14 Pro viewport)
flarecrawl scrape https://example.com --mobile
flarecrawl screenshot https://example.com --mobile -o mobile.png

Images, summaries & structured data

# Extract all image URLs from a page
flarecrawl scrape https://example.com --format images --json

# AI-powered content summary
flarecrawl scrape https://example.com --format summary --json

# Extract LD+JSON, OpenGraph, Twitter Cards
flarecrawl scrape https://example.com --format schema --json

# Dedicated schema command with type filtering
flarecrawl schema https://example.com --json
flarecrawl schema https://example.com --type ld-json --json
flarecrawl schema https://example.com --type opengraph --json

Webhooks

# POST crawl results to a URL when complete
flarecrawl crawl https://example.com --wait --limit 10 --webhook https://hooks.example.com/crawl

# With custom headers (e.g. auth token)
flarecrawl crawl https://example.com --wait --limit 10 \
  --webhook https://hooks.example.com/crawl \
  --webhook-headers "Authorization: Bearer token123"

CSS selector extraction & JS execution

# Extract content from specific CSS selector
flarecrawl scrape https://example.com --selector "main" --json

# Wait for a CSS element before capturing (SPAs, lazy-load)
flarecrawl scrape https://example.com --wait-for-selector ".loaded" --json

# Run JavaScript and return the result
flarecrawl scrape https://example.com --js-eval "document.title" --json
flarecrawl scrape https://example.com --js-eval "document.querySelectorAll('a').length" --json

Stdin piping & HAR capture

# Process local HTML without API call
cat page.html | flarecrawl scrape --stdin --only-main-content
curl https://example.com | flarecrawl scrape --stdin --format schema --json

# Save request metadata to HAR file
flarecrawl scrape https://example.com --har requests.har --json

# Save raw HTML alongside output (for archival/reprocessing)
flarecrawl scrape https://example.com --backup-dir ./html-backup
flarecrawl download https://example.com --limit 20 --backup-dir ./html-backup

URL discovery

# Discover all URLs via sitemaps, RSS feeds, and page links
flarecrawl discover https://example.com --json

# Sitemaps only
flarecrawl discover https://example.com --no-feed --no-links --json

# With limit
flarecrawl discover https://example.com --limit 100 --json

Cookie banner removal, language, archive fallback

# Remove cookie banners, GDPR modals, newsletter popups
flarecrawl scrape https://eu-site.example.com --magic

# Request content in a specific language
flarecrawl scrape https://example.com --language de

# Fallback to Internet Archive if page returns 404
flarecrawl scrape https://dead-link.example.com --archived

Enhanced content extraction

Multi-strategy content extraction that applies per-site optimisations before falling back to standard browser rendering. Useful for sites that serve content in SSR HTML but hide it with client-side JavaScript, or sites behind bot detection that block default request patterns.

# Enhanced extraction with multi-strategy cascade
flarecrawl scrape https://example.com/article --paywall

# JSON output shows which extraction strategy worked
flarecrawl scrape https://example.com/article --paywall --json
# metadata.source indicates the strategy used

# Batch mode
flarecrawl scrape --batch urls.txt --paywall --workers 5

# No Cloudflare auth required (direct HTTP strategies)
# With auth: falls through to browser rendering if direct strategies fail

Strategies are tried in order of speed and cost. Direct HTTP strategies consume zero browser time. When Cloudflare auth is configured, per-site header optimisations are also applied to browser rendering requests as a fallback.

Optional dependency for improved compatibility: pip install curl_cffi

Stealth mode

Opt-in browser TLS fingerprint impersonation for direct HTTP requests. Uses curl_cffi to send requests with a real Safari/Chrome TLS handshake, avoiding bot detection systems that fingerprint JA3/JA4 hashes.

# Stealth mode for content negotiation and direct fetches
flarecrawl scrape https://example.com --stealth

# Combine with paywall extraction
flarecrawl scrape https://example.com --paywall --stealth --json

# Batch mode
flarecrawl scrape --batch urls.txt --stealth --workers 5

Requires: pip install curl_cffi

Without --stealth, requests use Python's default TLS stack (httpx/OpenSSL) which is identifiable by bot detection systems. With --stealth, the TLS Client Hello is indistinguishable from a real Safari browser.

Content cleanup

Markdown output is automatically cleaned of common ad placeholders, share buttons, newsletter prompts, copyright lines, and navigation chrome. No flag needed - applied to all markdown output by default.

For HTML output, use --clean to strip ad/promo DOM elements:

# Strip ad containers, social share widgets, cookie banners from HTML
flarecrawl scrape https://example.com --format html --clean --json

Agent-safe mode

Sanitise scraped content against adversarial AI agent traps. See Agent Safety for full details.

flarecrawl scrape https://example.com --agent-safe --json
flarecrawl scrape https://example.com --agent-safe --paywall --stealth --json
flarecrawl crawl https://example.com --wait --limit 50 --agent-safe

Web search

Search the web and optionally scrape each result. Uses Jina Search API.

# Search and get results as JSON
flarecrawl search "python web scraping" --json

# Search and scrape each result through the normal pipeline
flarecrawl search "topic" --scrape --limit 5 --paywall --json

# Pipeline: search -> extract URLs -> batch scrape
flarecrawl search "query" --json | jq -r '.data[].url' > urls.txt
flarecrawl scrape --batch urls.txt --paywall --stealth --workers 5

Requires JINA_API_KEY env var (free at https://jina.ai/api-key). With --scrape, all scrape flags apply (--paywall, --stealth, --only-main-content, --clean).

Proxy support

Route all HTTP requests through a proxy. Affects CLI-to-Cloudflare API connections and direct HTTP (stealth, negotiate, paywall). Does NOT affect the Cloudflare browser-to-target connection (CF's browser uses its own IP).

# HTTP proxy
flarecrawl scrape https://example.com --proxy http://proxy:8080

# SOCKS5 proxy
flarecrawl scrape https://example.com --proxy socks5://localhost:9050

# Set default via env var
export FLARECRAWL_PROXY=socks5://localhost:9050
flarecrawl scrape https://example.com   # uses env var proxy

Supported on all commands: scrape, crawl, download, extract, search.

Per-site rules

Customisable per-site header rules for enhanced extraction. Default rules ship with the package; user overrides are loaded from ~/.config/flarecrawl/rules.yaml.

# List all loaded rules
flarecrawl rules list --json

# Show rules for a domain
flarecrawl rules show www.nytimes.com

# Add a custom rule
flarecrawl rules add example.com --referer https://www.google.com/ --cookie "auth=1"

# Show file paths
flarecrawl rules path

Rules use Ladder-compatible YAML format:

- domain: www.example.com
  headers:
    Referer: "https://www.google.com/"
    Cookie: "session=abc"
- domains:
    - a.com
    - b.com
  headers:
    Referer: "https://t.co/x?amp=1"

Markdown content negotiation

Sites on Cloudflare (Pro+) can serve markdown directly via Accept: text/markdown content negotiation. Flarecrawl auto-detects this on every scrape — when a site supports it, content is fetched via a simple HTTP GET instead of headless Chromium.

# Auto-detect (default) — tries content negotiation first
flarecrawl scrape https://blog.cloudflare.com/some-post

# Force browser rendering (skip negotiation)
flarecrawl scrape https://blog.cloudflare.com/some-post --no-negotiate

# JSON output shows the source
flarecrawl scrape https://blog.cloudflare.com/some-post --json
# metadata.source: "content-negotiation" (no browser) or "browser-rendering"
# metadata.markdownTokens: 1234 (from x-markdown-tokens header)
# metadata.contentSignal: {"ai-train": "yes", ...}

Benefits when negotiation succeeds:

Faster — ~100-200ms vs 2-3s for browser rendering
Cheaper — zero browser time consumed
Higher quality — server-side conversion by the site owner
Domain cached — one probe per domain, batch-friendly

Change tracking

# Compare current content against cached version
flarecrawl scrape https://example.com --diff --json

crawl — Crawl a website

# Start crawl and wait for results
flarecrawl crawl https://example.com --wait --limit 50

# With progress indicator
flarecrawl crawl https://example.com --wait --progress --limit 100

# Fire and forget (returns job ID)
flarecrawl crawl https://example.com --limit 50

# Check status of running crawl
flarecrawl crawl JOB_ID --status

# Get results from completed crawl
flarecrawl crawl JOB_ID

# Filter paths
flarecrawl crawl https://docs.example.com --wait --limit 200 \
  --include-paths "/docs,/api" --exclude-paths "/zh,/ja"

# Stream results as NDJSON (one record per line)
flarecrawl crawl JOB_ID --ndjson --fields url,markdown

# Skip JS rendering for faster crawl
flarecrawl crawl https://example.com --wait --limit 100 --no-render

# Follow subdomains
flarecrawl crawl https://example.com --wait --allow-subdomains

# Save to file
flarecrawl crawl https://example.com --wait --limit 50 -o results.json

map — Discover URLs

# List all links on a page
flarecrawl map https://example.com

# JSON output
flarecrawl map https://example.com --json

# Include subdomains
flarecrawl map https://example.com --include-subdomains

# Limit results
flarecrawl map https://example.com --limit 20 --json

download — Save site to disk

# Download as markdown files to .flarecrawl/
flarecrawl download https://docs.example.com --limit 50

# Download as HTML
flarecrawl download https://example.com --limit 20 --format html

# Filter paths
flarecrawl download https://docs.example.com --limit 100 \
  --include-paths "/docs" --exclude-paths "/changelog"

# Skip confirmation prompt
flarecrawl download https://example.com --limit 10 -y

Files are saved to .flarecrawl/<domain>/ with sanitized filenames.

extract — AI-powered data extraction

# Extract structured data with a natural language prompt
flarecrawl extract "Get all product names and prices" \
  --urls https://shop.example.com --json

# With JSON schema for structured output
flarecrawl extract "Extract article metadata" \
  --urls https://blog.example.com \
  --schema '{"type":"json_schema","schema":{"type":"object","properties":{"title":{"type":"string"},"date":{"type":"string"}}}}'

# Schema from file
flarecrawl extract "Extract data" --urls https://example.com --schema-file schema.json

# Multiple URLs
flarecrawl extract "Get page title" --urls https://a.com,https://b.com --json

# Batch mode: parallel extraction with NDJSON output
flarecrawl extract "Get page title" --batch urls.txt --workers 5

Uses Cloudflare Workers AI for extraction (no additional cost).

screenshot — Capture web pages

# Default: saves to screenshot.png
flarecrawl screenshot https://example.com

# Custom output path
flarecrawl screenshot https://example.com -o hero.png

# Full page
flarecrawl screenshot https://example.com -o full.png --full-page

# Specific element
flarecrawl screenshot https://example.com --selector "main" -o main.png

# Custom viewport
flarecrawl screenshot https://example.com --width 1440 --height 900 -o wide.png

# JPEG format
flarecrawl screenshot https://example.com --format jpeg -o page.jpg

# JSON output (base64 encoded)
flarecrawl screenshot https://example.com --json

pdf — Render pages as PDF

# Default: saves to page.pdf
flarecrawl pdf https://example.com

# Custom output
flarecrawl pdf https://example.com -o report.pdf

# Landscape A4
flarecrawl pdf https://example.com -o report.pdf --landscape --format a4

# JSON output (base64 encoded)
flarecrawl pdf https://example.com --json

favicon — Extract favicon URL

# Get the best (largest) favicon
flarecrawl favicon https://example.com

# Show all found icons
flarecrawl favicon https://example.com --all

# JSON output
flarecrawl favicon https://example.com --json
flarecrawl favicon https://example.com --all --json

Renders the page, parses <link rel="icon">, <link rel="apple-touch-icon">, and related tags. Returns the largest icon found. Falls back to /favicon.ico if no <link> tags found.

usage — Track browser time

# Show today's usage and history
flarecrawl usage

# JSON output
flarecrawl usage --json

Tracks the X-Browser-Ms-Used header from each API response locally. Free tier is 600,000ms (10 minutes) per day.

auth — Authentication

flarecrawl auth login                    # Interactive
flarecrawl auth login --account-id ID --token TOKEN  # Non-interactive
flarecrawl auth status                   # Human-readable
flarecrawl auth status --json            # Machine-readable
flarecrawl auth logout                   # Clear credentials

cache — Response cache management

flarecrawl cache status                  # Show entries, size, path
flarecrawl cache status --json           # Machine-readable
flarecrawl cache clear                   # Remove all cached responses

Responses are cached for 1 hour by default. Use --no-cache on any command to bypass.

negotiate — Domain capability cache

flarecrawl negotiate status              # Show domains that support text/markdown
flarecrawl negotiate status --json       # Machine-readable
flarecrawl negotiate clear               # Reset domain cache

Tracks which domains respond to Accept: text/markdown. Positive results cached 7 days, negative 24 hours.

Performance features

Response caching — 1-hour TTL, saves redundant browser renders
Connection pooling — persistent httpx session with HTTP/2 support
Resource rejection — skips images/fonts/media/stylesheets for text extraction
JS rendering — opt-in via --js flag (waits for networkidle0)

Environment variables

Variable	Default	Description
`FLARECRAWL_ACCOUNT_ID`	—	Cloudflare account ID
`FLARECRAWL_API_TOKEN`	—	Cloudflare API token
`FLARECRAWL_CACHE_TTL`	3600	Cache TTL in seconds
`FLARECRAWL_MAX_RETRIES`	3	Max retry attempts
`FLARECRAWL_MAX_WORKERS`	10	Max parallel workers
`FLARECRAWL_TIMEOUT`	120	Request timeout in seconds
`FLARECRAWL_PROXY`	-	Default proxy URL (http/https/socks5)
`JINA_API_KEY`	-	Jina API key for `search` command

Firecrawl Compatibility

Flarecrawl follows the same command structure as the firecrawl CLI:

firecrawl command	flarecrawl equivalent	Notes
`firecrawl scrape URL`	`flarecrawl scrape URL`	Same flags
`firecrawl scrape URL1 URL2`	`flarecrawl scrape URL1 URL2`	Concurrent
`firecrawl crawl URL --wait`	`flarecrawl crawl URL --wait`	Same flags
`firecrawl map URL`	`flarecrawl map URL`	Same flags
`firecrawl download URL`	`flarecrawl download URL`	Saves to `.flarecrawl/`
`firecrawl agent PROMPT`	`flarecrawl extract PROMPT`	Uses Workers AI
`firecrawl credit-usage`	`flarecrawl usage`	Local tracking
`firecrawl search QUERY`	`flarecrawl search QUERY`	Via Jina Search API
`firecrawl --status`	`flarecrawl --status`	Same

What's different

search command — uses Jina Search API (requires JINA_API_KEY), with --scrape to scrape results
extract instead of agent — same concept, different name to avoid confusion
favicon command — bonus: extract favicon/apple-touch-icon URLs from pages
schema command — bonus: extract LD+JSON, OpenGraph, Twitter Cards
PDF command — bonus: Cloudflare supports PDF rendering, Firecrawl doesn't
Output directory — .flarecrawl/ instead of .firecrawl/
--only-main-content — client-side via BeautifulSoup (Firecrawl uses server-side extraction)

Output Format

All --json output follows a consistent envelope:

{
  "data": { ... },
  "meta": { "format": "markdown", "count": 1 }
}

Errors:

{
  "error": { "code": "AUTH_REQUIRED", "message": "Not authenticated..." }
}

Exit Codes

Code	Meaning	Action
0	Success	Continue
1	Error	Check stderr for details
2	Auth required	Run `flarecrawl auth login`
3	Not found	Check job ID
4	Validation	Fix arguments
5	Forbidden	Check token permissions
7	Rate limited	Wait and retry

Batch & Parallel

Commands that operate on multiple URLs support batch mode with configurable parallelism.

Batch input (`--batch`)

# Plain text file (one URL per line, # comments supported)
flarecrawl scrape --batch urls.txt --workers 5

# JSON array
flarecrawl scrape --batch urls.json

# NDJSON (one JSON object per line)
flarecrawl extract "Get title" --batch urls.ndjson --workers 3

Input format is auto-detected: starts with [ → JSON array, starts with { → NDJSON, otherwise plain text.

Batch output

Batch mode outputs NDJSON (one JSON object per line) with index correlation:

{"index": 0, "status": "ok", "data": {"url": "https://a.com", "content": "...", "elapsed": 1.2}}
{"index": 1, "status": "error", "error": {"code": "TIMEOUT", "message": "Request timed out..."}}
{"index": 2, "status": "ok", "data": {"url": "https://c.com", "content": "...", "elapsed": 0.8}}

Results are sorted by index. Failed URLs don't stop processing — errors are reported inline.

Workers

Flag	Default	Max	Notes
`--workers` / `-w`	3	10	Matches CF paid tier concurrency limit

# Conservative (free tier: 3 concurrent browsers)
flarecrawl scrape --batch urls.txt --workers 3

# Aggressive (paid tier: up to 10)
flarecrawl scrape --batch urls.txt --workers 10

Supported commands

Command	`--batch`	`--workers`	Notes
`scrape`	Yes	Yes	Also supports `--input` (alias)
`extract`	Yes	Yes	Supplements `--urls`
`crawl`	No	No	Has its own async job system
`screenshot`	No	No	Single URL
`pdf`	No	No	Single URL

Advanced Usage

Raw JSON body passthrough

Every command supports --body to send a raw JSON payload directly to the CF API, bypassing all flag processing:

flarecrawl scrape --body '{
  "url": "https://example.com",
  "gotoOptions": {"waitUntil": "networkidle0", "timeout": 60000},
  "rejectResourceTypes": ["image", "media"]
}' --json

Piping and chaining

# Map URLs then batch scrape them
flarecrawl map https://docs.example.com --json | \
  jq -r '.data[]' | head -10 > urls.txt
flarecrawl scrape --batch urls.txt --workers 5

# Crawl and extract just the markdown
flarecrawl crawl https://example.com --wait --limit 10 --json | \
  jq -r '.data.records[] | select(.status=="completed") | .markdown'

# Stream crawl results through jq
flarecrawl crawl JOB_ID --ndjson --fields url,markdown | \
  jq -r '.url + "\t" + (.markdown | length | tostring)'

Retry behavior

Requests automatically retry up to 3 times on HTTP 429 (rate limited), 502, and 503 errors with exponential backoff. Timeouts also trigger retries.

Agent Safety

Flarecrawl is often the perception layer for AI agent workflows - scraped content flows directly into LLM context windows and RAG systems. The --agent-safe flag sanitises content against adversarial attacks before it reaches the consuming agent.

Background

This implementation is informed by AI Agent Traps (Franklin, Tomasev, Jacobs, Leibo, Osindero - Google DeepMind, March 2026), which identifies six categories of adversarial attacks against autonomous AI agents navigating the web:

Category	Target	Flarecrawl Defence
Content Injection	Agent perception	Defended - hidden text, comments, attributes, unicode tricks, iframes, hidden inputs, CSS class hiding, meta tags
Semantic Manipulation	Agent reasoning	Defended - urgency clusters and authority claims flagged (not removed)
Behavioural Control	Agent actions	Defended - prompt injection patterns detected and stripped
Cognitive State	Agent memory/RAG	Upstream - content provenance via `metadata.source`
Systemic	Multi-agent networks	Upstream - outside scraping layer
Human-in-the-Loop	Human supervisor	Upstream - outside scraping layer

Usage

# Sanitise content for safe agent consumption
flarecrawl scrape https://example.com --agent-safe --json

# Combine with extraction flags
flarecrawl scrape https://example.com --agent-safe --paywall --stealth --json

# Batch mode
flarecrawl scrape --batch urls.txt --agent-safe --workers 5

# All commands support --agent-safe
flarecrawl crawl https://example.com --wait --limit 50 --agent-safe
flarecrawl download https://docs.example.com --limit 50 --agent-safe
flarecrawl extract "Get products" --urls https://shop.example.com --agent-safe --json

# Stdin pipe (sanitises local HTML)
cat page.html | flarecrawl scrape --stdin --agent-safe --json

Two-phase sanitisation pipeline

Content passes through 13 sanitisers in two phases:

Phase 1 (HTML) - 8 sanitisers run on the DOM before markdown conversion:

Sanitiser	Attack Vector	Action
Hidden text	CSS hiding (`display:none`, `opacity:0`, off-screen, `clip-path`, `color:transparent`, etc.)	Removed
HTML comments	Instruction-like content in `<!-- -->` comments	Removed
Suspicious attributes	Injection patterns in `data-*`, `aria-label`, `alt`, `title`	Removed
Unicode tricks	Zero-width characters, bidirectional text overrides	Removed
Hidden iframes	`<iframe src="...">` with external content	Removed
Hidden inputs	`<input type="hidden" value="...">` with instruction patterns	Removed
CSS class hiding	`.d-none`, `.hidden`, `[hidden]` attribute with injection content	Removed
Meta injection	Custom `<meta>` tags with adversarial content (standard tags preserved)	Removed

Phase 2 (Text) - 5 sanitisers run on markdown after conversion:

Sanitiser	Attack Vector	Action
Prompt injection	"ignore previous instructions", "SYSTEM:", role-play, delimiter injection	Removed
Semantic manipulation	Urgency clusters, authority claims	Flagged only
Homoglyph evasion	Cyrillic/Greek lookalike characters bypassing patterns	Removed
Markdown exfiltration	Image URLs with suspicious query params (IP addresses, exfil params)	Flagged only
HTML entity evasion	`Ignore` entity-encoded injection patterns	Removed

False positive prevention

Technique	How it works
Short-line bias	Prompt injection only strips lines <200 chars - articles about injection are preserved
Accessibility preservation	`.sr-only`/`.visually-hidden` text kept unless it matches injection patterns
Standard meta allowlist	`description`, `og:`, `twitter:`, `charset` meta tags never touched
Mixed-script detection	Homoglyph normalisation only on lines with both Latin and Cyrillic/Greek chars
Threshold gates	Hidden elements need 20+ chars; hidden inputs need 50+ chars + pattern match
Flag vs remove	Semantic manipulation and exfiltration are flagged, never removed

JSON output

When --json is used, metadata.agentSafety reports what was detected:

{
  "metadata": {
    "agentSafety": {
      "sanitised": true,
      "findings": [
        {"category": "content_injection", "severity": "high",
         "description": "Hidden text via CSS (2 elements)", "action": "removed", "count": 2},
        {"category": "prompt_injection", "severity": "high",
         "description": "Prompt injection patterns removed (1 lines)", "action": "removed", "count": 1},
        {"category": "semantic_manipulation", "severity": "medium",
         "description": "Urgency language clusters (1 lines)", "action": "flagged", "count": 1}
      ],
      "stats": {
        "removed": 3,
        "flagged": 1,
        "byCategory": {
          "content_injection": 2,
          "prompt_injection": 1,
          "semantic_manipulation": 1
        }
      }
    }
  }
}

Extensibility

New attack vectors are added by writing a function and decorating it:

from flarecrawl.sanitise import register_html, register_text, Finding

@register_html
def sanitise_new_vector(soup):
    """HTML-level sanitiser - mutates soup, returns findings."""
    # ... detection and removal logic ...
    return [Finding(category="content_injection", severity="high",
                    description="New vector (N elements)", action="removed", count=n)]

@register_text
def sanitise_new_text_vector(text):
    """Text-level sanitiser - returns (cleaned_text, findings)."""
    # ... detection logic ...
    return cleaned_text, findings

Add a corpus fixture to tests/corpus/attacks/ and the parametrised test suite picks it up automatically.

Test corpus

The tests/corpus/ directory contains 61 fixture files that serve as both test data and a living catalogue of known attack vectors:

attacks/ (45 files) - adversarial fixtures across 12 categories
benign/ (16 files) - false-positive traps: security articles, responsive CSS, ARIA accessibility, admin UIs, medical urgency, i18n content, code tutorials, system docs, journalism

Parametrised tests validate that every attack file produces findings and every benign file produces zero removals.

Pricing Details

Tier	Browser Time	Concurrent Browsers
Free	10 min/day	Up to 120 (default)
Paid	10 hr/month included, then $0.09/hr	Up to 120 (default), higher on request

Browser time is shared between REST API calls and Workers bindings. Track your usage with flarecrawl usage.

Project Structure

flarecrawl/
├── pyproject.toml              # Package config
├── AGENTS.md                   # AI agent context
├── README.md                   # This file
├── src/flarecrawl/
│   ├── __init__.py             # Version
│   ├── authcrawl.py            # Authenticated BFS crawler (session cookie propagation)
│   ├── batch.py                # Batch processing (parse + parallel workers)
│   ├── cache.py                # File-based response cache
│   ├── cli.py                  # Typer CLI (all commands)
│   ├── cdp.py                  # CDP WebSocket client (persistent sessions, DevTools Protocol)
│   ├── client.py               # CF Browser Run REST API client (httpx pooling, HTTP/2)
│   ├── config.py               # Credentials, usage tracking, env-var config, session storage
│   ├── cookies.py              # Cookie loading (Puppeteer/Netscape/Chrome), conversion, validation
│   ├── design.py               # Design system extraction (tokens, coherence scoring, formatting)
│   ├── extract.py              # HTML extraction (main content, images, schema, tags)
│   ├── fetch.py                # Content-type aware download (binary, JSON, HTML)
│   ├── negotiate.py            # Markdown content negotiation (Accept: text/markdown)
│   ├── openapi.py              # OpenAPI/Swagger spec discovery and validation
│   ├── paywall.py              # Paywall bypass cascade (SSR, Referer, Wayback, Jina)
│   ├── rules.py                # Per-site YAML rulesets (load, merge, cache)
│   ├── search.py               # Web search via Jina Search API
│   ├── sanitise.py             # Agent-safety sanitisation (hidden text, injection, manipulation)
│   ├── stealth.py              # Stealth HTTP (curl_cffi TLS impersonation)
│   └── default_rules.yaml      # Shipped per-site header rules
└── tests/
    ├── conftest.py             # Test fixtures
    ├── corpus.py               # Feature test corpus (80 live tests x 8 sites)
    ├── corpus/                 # Agent-safety attack/benign fixture corpus (48 files)
    │   ├── attacks/            # Adversarial fixtures (35 files, 6 categories)
    │   └── benign/             # False-positive traps (13 files)
    ├── test_batch.py           # Batch module tests
    ├── test_cache.py           # Cache module tests
    ├── test_cli.py             # CLI tests
    ├── test_client.py          # Client tests
    ├── test_design.py          # Design extraction tests
    ├── test_extract.py         # Extract module tests
    ├── test_paywall.py         # Paywall bypass tests
    ├── test_rules.py           # Per-site rules tests
    ├── test_cdp.py             # CDP client tests (69 tests)
    ├── test_sanitise.py        # Agent-safety tests (137 tests + corpus validation)
    └── test_search.py          # Search module tests

Development

# Install dev dependencies
uv tool install --editable . --with pytest --with ruff

# Install CDP support (optional — needed for --cdp, interact, webmcp)
uv pip install websockets

# Run unit tests (723 tests, no network)
pytest tests/ -v

# Run live tests against public sites (needs CF auth)
PYTHONPATH=src pytest tests/live/ -v -m live

# Run CDP live tests (needs CF auth + websockets)
PYTHONPATH=src pytest tests/live/ -v -m cdp

# Lint
ruff check src/

# Reinstall after changes
uv tool install --editable .

Forma Protocol

This tool follows the Forma Protocol.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 167 Commits
.claude		.claude
.forma		.forma
docs		docs
src/flarecrawl		src/flarecrawl
tests		tests
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation