Helix

SIMD-accelerated near-duplicate detection and content freshness engine for web-scale crawl pipelines. Written in Rust with Node.js and Python bindings.

What it does

Helix solves two problems that every large-scale crawler eventually hits:

Near-duplicate detection — finding pages that are semantically identical or near-identical without scanning the full index on every insert. Helix fingerprints each document using SimHash with SIMD-accelerated k-shingle hashing, then uses LSH band indexing to find candidates in sub-linear time.

Content freshness tracking — knowing whether a re-crawled URL actually changed or is just a re-fetch of identical content. Helix maintains a SHA-256 content hash and timestamp per URL, classifying each recrawl as Unchanged, NearDuplicate, or New.

Both operations work incrementally. No full index rebuilds.

Architecture

helix-core/        # Rust library — fingerprinting, LSH index, freshness store
helix-api/         # Axum REST microservice wrapping helix-core
helix-node/        # NAPI-RS bindings (npm package)
helix-py/          # PyO3 + Maturin bindings (PyPI package)
benches/           # Criterion benchmarks

Performance

Operation	Throughput / Latency
SimHash fingerprinting	~500k docs/sec (single core, AVX2)
LSH near-dup lookup	< 2ms p99 at 10M doc index
Freshness check (Redis)	< 1ms p99
Incremental insert	O(b) band updates, no full rebuild

Benchmarked on M2 MacBook Pro. AVX2 path used on x86_64 Linux in production.

API (REST)

POST /fingerprint        → { url, fingerprint: u64, shingle_count: usize }
POST /dedup-check        → { url, fingerprint, status: "new"|"near_dup"|"exact_dup", candidates: [...] }
POST /freshness-check    → { url, content_hash, status: "unchanged"|"changed"|"new", last_seen_at }
POST /batch              → bulk version of the above, returns NDJSON
GET  /health             → { status: "ok", index_size, uptime_secs }

Quickstart

# Clone and build
git clone https://github.com/vivekjami/helix
cd helix
cargo build --release

# Run the service
REDIS_URL=redis://localhost:6379 ./target/release/helix-api

# Node.js
npm install helix-node

import { HelixClient } from 'helix-node';
const h = new HelixClient({ redisUrl: 'redis://localhost:6379' });
const result = await h.dedupCheck({ url, content });
// result.status → 'new' | 'near_dup' | 'exact_dup'

import helix_py
idx = helix_py.HelixIndex(redis_url="redis://localhost:6379")
result = idx.dedup_check(url=url, content=content)

Linting and CI

cargo fmt --check          # formatting
cargo clippy -- -D warnings  # lints, zero warnings policy
cargo test                 # unit + integration tests
cargo bench                # criterion benchmarks

GitHub Actions runs all four on every push and PR. Benchmarks are compared against main and fail the PR if p99 latency regresses more than 10%.

Configuration

Env var	Default	Description
`REDIS_URL`	`redis://localhost:6379`	Freshness store
`LSH_BANDS`	`24`	LSH band count (higher = more recall)
`LSH_ROWS`	`4`	Rows per band (higher = more precision)
`SIMHASH_BITS`	`64`	Fingerprint width
`HAMMING_THRESHOLD`	`3`	Max differing bits to call near-dup
`PORT`	`8080`	REST service port

Status

Active development. Core fingerprinting and LSH index are stable. Bindings and REST API are beta.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.cargo		.cargo
.githooks		.githooks
helix-api		helix-api
helix-core		helix-core
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md
clippy.toml		clippy.toml
rustfmt.toml		rustfmt.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Helix

What it does

Architecture

Performance

API (REST)

Quickstart

Linting and CI

Configuration

Status

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Helix

What it does

Architecture

Performance

API (REST)

Quickstart

Linting and CI

Configuration

Status

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages