Skip to content

orbita-pos/spider

Repository files navigation

Spider

A search engine built from scratch -- crawler, inverted index, BM25 ranking, and PageRank.

Spider is a full-text search engine with a concurrent web crawler written in Go, an inverted index stored in SQLite, and a search frontend built with Next.js. It combines BM25 relevance scoring with PageRank authority to deliver accurate results across 345,000+ indexed postings.


Features

  • Web crawler -- concurrent crawler built with goroutines, respects robots.txt
  • Inverted index -- term-to-document mapping for fast full-text lookup
  • BM25 ranking -- probabilistic relevance scoring for search results
  • PageRank -- link-graph authority scoring to boost high-quality pages
  • Spell correction -- "Did you mean?" suggestions using edit distance
  • Page viewer -- preview cached page content directly from search results
  • In-memory caching -- 60-second TTL cache for sub-second repeat queries
  • 345,000+ postings indexed across thousands of crawled pages

Architecture

Crawler (Go)

A concurrent web crawler using goroutines for parallel page fetching. It respects robots.txt, extracts text content and hyperlinks, and stores raw page data in SQLite.

Indexer (Go)

Processes crawled pages to build an inverted index. Computes term frequencies for each document and constructs the link graph used for PageRank computation.

PageRank (Go)

Iterative PageRank computation over the link graph. Authority scores are stored per page and used as a boost factor during search ranking.

Search API (TypeScript / Next.js)

The search API scores results using a two-phase approach:

  1. Lightweight scoring -- BM25 scores computed using only the postings table (no JOINs)
  2. Top-K enrichment -- page details fetched only for the top 60 candidates
  3. Combined ranking -- BM25 relevance + PageRank authority boost

An in-memory cache (60s TTL, max 200 entries) eliminates redundant database round-trips.

Database (SQLite / Turso Cloud)

Table Purpose
pages URL, title, body text, PageRank score
terms Unique terms from the corpus
postings Term-to-page mapping with frequency
links Directed edges for the link graph

Hosted on Turso Cloud for edge-distributed access with libSQL.

Tech Stack

Component Technology
Crawler Go (goroutines, net/http)
Indexer Go
PageRank Go (iterative computation)
Search API TypeScript / Next.js
Database SQLite / Turso Cloud (libSQL)
Frontend Next.js / React
Ranking BM25 + PageRank

Getting Started

Prerequisites

  • Go 1.21+
  • Node.js (v18 or later)
  • A Turso database (or local SQLite file)

Crawl and Index

cd crawler
go run . --seed "https://example.com" --depth 3

Run the Search Frontend

cd web
npm install
npm run dev

The search interface will be available at http://localhost:3000/spider.

Performance

Metric Value
Indexed postings 345,000+
Search latency < 1 second
Cache TTL 60 seconds
Top-K candidates 60 per query

License

This project is licensed under the MIT License.

About

Search engine built from scratch: Go crawler, inverted index, BM25 ranking, and PageRank

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages