A search engine built from scratch -- crawler, inverted index, BM25 ranking, and PageRank.
Spider is a full-text search engine with a concurrent web crawler written in Go, an inverted index stored in SQLite, and a search frontend built with Next.js. It combines BM25 relevance scoring with PageRank authority to deliver accurate results across 345,000+ indexed postings.
- Web crawler -- concurrent crawler built with goroutines, respects robots.txt
- Inverted index -- term-to-document mapping for fast full-text lookup
- BM25 ranking -- probabilistic relevance scoring for search results
- PageRank -- link-graph authority scoring to boost high-quality pages
- Spell correction -- "Did you mean?" suggestions using edit distance
- Page viewer -- preview cached page content directly from search results
- In-memory caching -- 60-second TTL cache for sub-second repeat queries
- 345,000+ postings indexed across thousands of crawled pages
A concurrent web crawler using goroutines for parallel page fetching. It respects robots.txt, extracts text content and hyperlinks, and stores raw page data in SQLite.
Processes crawled pages to build an inverted index. Computes term frequencies for each document and constructs the link graph used for PageRank computation.
Iterative PageRank computation over the link graph. Authority scores are stored per page and used as a boost factor during search ranking.
The search API scores results using a two-phase approach:
- Lightweight scoring -- BM25 scores computed using only the postings table (no JOINs)
- Top-K enrichment -- page details fetched only for the top 60 candidates
- Combined ranking -- BM25 relevance + PageRank authority boost
An in-memory cache (60s TTL, max 200 entries) eliminates redundant database round-trips.
| Table | Purpose |
|---|---|
| pages | URL, title, body text, PageRank score |
| terms | Unique terms from the corpus |
| postings | Term-to-page mapping with frequency |
| links | Directed edges for the link graph |
Hosted on Turso Cloud for edge-distributed access with libSQL.
| Component | Technology |
|---|---|
| Crawler | Go (goroutines, net/http) |
| Indexer | Go |
| PageRank | Go (iterative computation) |
| Search API | TypeScript / Next.js |
| Database | SQLite / Turso Cloud (libSQL) |
| Frontend | Next.js / React |
| Ranking | BM25 + PageRank |
- Go 1.21+
- Node.js (v18 or later)
- A Turso database (or local SQLite file)
cd crawler
go run . --seed "https://example.com" --depth 3cd web
npm install
npm run devThe search interface will be available at http://localhost:3000/spider.
| Metric | Value |
|---|---|
| Indexed postings | 345,000+ |
| Search latency | < 1 second |
| Cache TTL | 60 seconds |
| Top-K candidates | 60 per query |
This project is licensed under the MIT License.