CrawlForge is a distributed web crawling system designed to efficiently collect and process web data at scale.
It leverages asynchronous workers, a Redis-based URL frontier, and a decoupled pipeline for crawling, content extraction, and embedding generation. This architecture enables scalable processing across multiple processes or machines while maintaining clear separation between I/O-bound and compute-bound tasks.
The project demonstrates core backend and data engineering concepts such as distributed systems, asynchronous networking, queue-based task distribution, and data pipelines. It is designed to be extensible for applications like dataset generation, search indexing, and semantic retrieval systems.
CrawlForge uses a three-stage distributed pipeline with Redis as the central broker and FAISS for semantic search:
┌──────────────────────────────────────────────────────────────┐
│ Redis │
│ (URL Queue + Content Queue + Visited Set) │
└──────────────┬─────────────────────────┬────────────────────┘
│ │
┌──────▼─────────┐ ┌─────────▼──────────┐
│ Crawler Svc │ │ Embedder Svc │
│ │ │ │
│ - Pop URL │ │ - Pop content │
│ - Fetch page │ │ - Generate embed │
│ - Extract text │ │ - Write JSONL │
│ - Push content │ │ │
└────────────────┘ └─────────┬──────────┘
│
┌───────▼────────┐
│ embedded_data │
│ .jsonl │
└───────┬────────┘
│
┌───────▼────────┐
┌──────────────────────▶│ FAISS Index │
│ └────────────────┘
│ │
└──────────────────────────────┘
(Index reloaded every 30s)
┌──────────────────────┐
│ API Service │
│ (port 8000) │
│ - /crawl (POST) │
│ - /search (GET) │
│ - /reload (POST) │
└──────────────────────┘
│
Client
- Crawl Phase: Seed URLs → URL queue → Crawler service fetches pages concurrently
- Parse Phase: HTML parsed → main text extracted → pushed to content queue
- Embed Phase: Embedder service reads content queue → generates 384-dim vectors → writes to
data/embedded_data.jsonl - Index Phase: Every 30s, API reloads FAISS index from embedded_data.jsonl
- Search Phase: User queries API → semantic search via FAISS → ranked results returned
- URL Queue (Redis): Stores frontier URLs to crawl
- Content Queue (Redis): Stores extracted text from pages
- crawled_data.jsonl: Raw HTML + metadata from crawler (optional storage)
- embedded_data.jsonl: Embeddings + metadata for semantic search
- FAISS Index: In-memory vector index for fast similarity search
crawlforge/
│
├── pyproject.toml
├── docker-compose.yml
├── Dockerfile
├── DEMO.md
├── README.md
│
├── data/ # Persistent volume (synced from containers)
│ ├── crawled_data.jsonl
│ └── embedded_data.jsonl
│
└── src/crawlforge/
├── main.py # Crawler entry point
├── seed_url.py # Interactive URL seeding
│
├── api/
│ └── search_api.py # FastAPI server (/crawl, /search, /reload)
│
├── crawler/
│ └── fetcher.py # Async HTTP fetch with User-Agent
│
├── parser/
│ ├── html_parser.py # Extract links, title
│ └── content_extractor.py # Extract main content
│
├── queue/
│ └── redis_queue.py # Redis URL queue operations
│
├── scheduler/
│ └── domain_scheduler.py # Rate limiting per domain
│
├── storage/
│ └── jsonl_writer.py # Async JSONL file writing
│
├── ml/
│ ├── embedding_model.py # sentence-transformers wrapper
│ ├── embedding_worker.py # Async embedding processing
│ ├── build_index.py # Load embeddings → FAISS
│ ├── vector_store.py # FAISS wrapper
│ └── search.py # CLI semantic search
│
└── utils/
└── url_utils.py # URL normalization, validation
Language & Async
- Python 3.12
- asyncio, aiohttp
Web Parsing
- BeautifulSoup
- readability-lxml
Infrastructure
- Redis (URL frontier & queue)
- FastAPI (search API)
- Docker & Docker Compose
ML/Search
- sentence-transformers (embeddings)
- FAISS (vector indexing)
Clone the repository:
git clone https://github.com/SxxAq/crawlforge.git
cd crawlforgeStart all services:
mkdir -p data
docker compose upIn another terminal, queue a URL:
curl -X POST http://localhost:8000/crawl \
-H "Content-Type: application/json" \
-d '{"url": "https://en.wikipedia.org/wiki/Machine_learning"}'Wait 30-60 seconds for embeddings, then search:
curl "http://localhost:8000/search?query=machine+learning"See DEMO.md for step-by-step walkthrough.
Install dependencies:
uv syncStart Redis (separate terminal):
redis-serverRun crawler:
python -m crawlforge.mainRun embedder (separate terminal):
python -m crawlforge.ml.embedding_workerRun API (separate terminal):
uvicorn crawlforge.api.search_api:app --reload