Spider

A search engine built from scratch -- crawler, inverted index, BM25 ranking, and PageRank.

Spider is a full-text search engine with a concurrent web crawler written in Go, an inverted index stored in SQLite, and a search frontend built with Next.js. It combines BM25 relevance scoring with PageRank authority to deliver accurate results across 345,000+ indexed postings.

Features

Web crawler -- concurrent crawler built with goroutines, respects robots.txt
Inverted index -- term-to-document mapping for fast full-text lookup
BM25 ranking -- probabilistic relevance scoring for search results
PageRank -- link-graph authority scoring to boost high-quality pages
Spell correction -- "Did you mean?" suggestions using edit distance
Page viewer -- preview cached page content directly from search results
In-memory caching -- 60-second TTL cache for sub-second repeat queries
345,000+ postings indexed across thousands of crawled pages

Architecture

Crawler (Go)

A concurrent web crawler using goroutines for parallel page fetching. It respects robots.txt, extracts text content and hyperlinks, and stores raw page data in SQLite.

Indexer (Go)

Processes crawled pages to build an inverted index. Computes term frequencies for each document and constructs the link graph used for PageRank computation.

PageRank (Go)

Iterative PageRank computation over the link graph. Authority scores are stored per page and used as a boost factor during search ranking.

Search API (TypeScript / Next.js)

The search API scores results using a two-phase approach:

Lightweight scoring -- BM25 scores computed using only the postings table (no JOINs)
Top-K enrichment -- page details fetched only for the top 60 candidates
Combined ranking -- BM25 relevance + PageRank authority boost

An in-memory cache (60s TTL, max 200 entries) eliminates redundant database round-trips.

Database (SQLite / Turso Cloud)

Table	Purpose
pages	URL, title, body text, PageRank score
terms	Unique terms from the corpus
postings	Term-to-page mapping with frequency
links	Directed edges for the link graph

Hosted on Turso Cloud for edge-distributed access with libSQL.

Tech Stack

Component	Technology
Crawler	Go (goroutines, net/http)
Indexer	Go
PageRank	Go (iterative computation)
Search API	TypeScript / Next.js
Database	SQLite / Turso Cloud (libSQL)
Frontend	Next.js / React
Ranking	BM25 + PageRank

Getting Started

Prerequisites

Go 1.21+
Node.js (v18 or later)
A Turso database (or local SQLite file)

Crawl and Index

cd crawler
go run . --seed "https://example.com" --depth 3

Run the Search Frontend

cd web
npm install
npm run dev

The search interface will be available at http://localhost:3000/spider.

Performance

Metric	Value
Indexed postings	345,000+
Search latency	< 1 second
Cache TTL	60 seconds
Top-K candidates	60 per query

License

This project is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.claude		.claude
api		api
config		config
crawler		crawler
indexer		indexer
ranker		ranker
search		search
README.md		README.md
featured.json		featured.json
go.mod		go.mod
go.sum		go.sum
main.go		main.go
seeds.txt		seeds.txt
spider.db		spider.db
spider.exe		spider.exe
spider.exe~		spider.exe~
spider_test.go		spider_test.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Spider

Features

Architecture

Crawler (Go)

Indexer (Go)

PageRank (Go)

Search API (TypeScript / Next.js)

Database (SQLite / Turso Cloud)

Tech Stack

Getting Started

Prerequisites

Crawl and Index

Run the Search Frontend

Performance

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Spider

Features

Architecture

Crawler (Go)

Indexer (Go)

PageRank (Go)

Search API (TypeScript / Next.js)

Database (SQLite / Turso Cloud)

Tech Stack

Getting Started

Prerequisites

Crawl and Index

Run the Search Frontend

Performance

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages