Skip to content

Lumi-node/infinite-context

Repository files navigation

Infinite Context

Experimental context extension for local LLMs via hierarchical retrieval.

Status: Early-Stage Research Prototype / Under Active Development

This project explores extending LLM effective context using the Hierarchical Attention Tree (HAT) for retrieval-augmented memory. Current benchmarks measure retrieval recall on synthetic data — not end-to-end task accuracy. The "100% retrieval" figures below mean HAT finds the right chunks in controlled tests, not that the LLM produces correct answers 100% of the time. Real-world performance depends on many factors (query quality, chunk boundaries, model capability) that have not been rigorously evaluated.

This is a research prototype, not production-ready software. Use at your own risk. Rigorous benchmarking is in progress.

Infinite Context Architecture


Try It NOW (Pick Your Favorite)

Zero Install - Just Run

# Docker (one command, works everywhere)
docker run -it --rm --network host andrewmang/infinite-context

# Or with docker-compose for full stack
curl -O https://raw.githubusercontent.com/Lumi-node/infinite-context/main/docker-compose.yml
docker-compose up -d

Live Demo (No Install At All)

Try it on Hugging Face Spaces - See HAT in action right in your browser!

One-Line Installer

# Linux/macOS - installs everything automatically
curl -sSL https://raw.githubusercontent.com/Lumi-node/infinite-context/main/install.sh | bash

Install from Source

# Clone the repo
git clone https://github.com/Lumi-node/infinite-context
cd infinite-context

# Install Python package (recommended - full HAT support)
pip install maturin sentence-transformers
maturin develop --release

# Or build Rust CLI (benchmarks only)
cargo build --release

Retrieval Benchmarks (Synthetic Data)

Complexity Comparison

Model Native Context Addressable via HAT Extension (retrieval only)
gemma3:1b 8K 11.3M+ 1,413x
phi4 16K 11.3M+ 706x
llama3.2 8K 11.3M+ 1,413x

These figures represent the amount of stored text HAT can search through — not that the model "understands" all 11M tokens simultaneously. Retrieved chunks are injected into the model's native context window. End-to-end task accuracy (does the model answer correctly?) has not been formally benchmarked.

Benchmark Results


The Problem

Local models like Gemma 3 (8K) and Phi 4 (16K) are powerful — but they forget everything outside their tiny context window. RAG systems try to help but deliver ~70% accuracy at best, losing critical information.

The Solution

Hierarchical Attention Tree (HAT) — exploits the natural hierarchy of conversations:

HAT Tree Structure

Instead of searching all chunks O(n), HAT does O(log n) beam search through the hierarchy — achieving high retrieval recall in synthetic benchmarks. Real-world accuracy depends on data structure, embedding quality, and query characteristics.

Beam Search Visualization


Detailed Setup

Docker Usage

# Pull and run immediately
docker run -it --rm --network host andrewmang/infinite-context

# Run benchmark
docker run -it --rm andrewmang/infinite-context infinite-context bench --chunks 100000

# Full stack with Ollama
docker-compose up -d
docker-compose exec infinite-context infinite-context chat --model gemma3:1b

Python API (Recommended - Full HAT Support)

The Python API uses real embeddings + HAT retrieval + Ollama. Note: This is experimental research software, not a production-ready system.

# From the repo (after cloning)
pip install maturin sentence-transformers
maturin develop --release
from infinite_context import InfiniteContext

# Initialize - connects to Ollama
ctx = InfiniteContext(model="gemma3:1b")

# Add information (automatically embedded with sentence-transformers and indexed in HAT)
ctx.add("My name is Alex and I work on quantum computing.")
ctx.add("The latest experiment showed 47% improvement in coherence.")

# Chat - HAT retrieves relevant context, injects it into prompt, queries Ollama
response = ctx.chat("What were the quantum experiment results?")
print(response)  # References the 47% improvement

# Save memory to disk
ctx.save("my_memory.hat")

# Load later
ctx = InfiniteContext.load("my_memory.hat", model="gemma3:1b")

Low-Level API

from infinite_context import HatIndex
from sentence_transformers import SentenceTransformer

# Setup
embedder = SentenceTransformer('all-MiniLM-L6-v2')
index = HatIndex.cosine(384)

# Add embeddings
embedding = embedder.encode("Important info", normalize_embeddings=True)
index.add(embedding.tolist())

# Query
query_emb = embedder.encode("What's important?", normalize_embeddings=True)
results = index.near(query_emb.tolist(), k=10)

# Persist
index.save("index.hat")

Rust CLI (Benchmarks & Testing)

The Rust CLI is useful for benchmarking HAT performance and testing Ollama connectivity.

Note: For actual chat with HAT memory retrieval, use the Python API above.

# Build the CLI
cargo build --release

# Run HAT performance benchmark
./target/release/infinite-context bench --chunks 100000

# Test Ollama connection
./target/release/infinite-context test --model gemma3:1b

# List available models
./target/release/infinite-context models

System Requirements

  • Rust: 1.70+ (for CLI)
  • Python: 3.9+ (for Python API)
  • Ollama: Any version
  • RAM: 4GB minimum

Building from Source

git clone https://github.com/Lumi-node/infinite-context
cd infinite-context

# Rust CLI
cargo build --release
./target/release/infinite-context --help

# Python wheel
pip install maturin
maturin develop --release

Why This Exists

We're exploring whether local, hierarchical retrieval can meaningfully extend context for small LLMs — without sending data to cloud APIs.

Local Privacy

Design goals:

  • Local: Runs on your hardware, data stays on your machine
  • Free: No API costs
  • Fast retrieval: Sub-millisecond HAT queries in synthetic benchmarks
  • High retrieval recall: 100% on synthetic hierarchical test data (real-world accuracy not yet validated)

Model Compatibility

Note: This is a research project exploring an idea, not a finished product. The retrieval layer works well in controlled tests, but end-to-end quality (does the LLM actually give better answers?) needs rigorous evaluation. We are actively working on this.


Research

Based on the Hierarchical Attention Tree (HAT) algorithm. Key hypothesis: conversations naturally form hierarchies (sessions → documents → chunks), and exploiting this structure may enable O(log n) retrieval with high recall. Validating this hypothesis rigorously is ongoing work.


License

MIT


Get Started in 10 Seconds

Method Command Notes
Docker docker run -it --rm --network host andrewmang/infinite-context Full setup
Browser Hugging Face Spaces Try HAT live
Source git clone ... && maturin develop --release Python API (recommended)

An experiment in local, hierarchical AI memory. Contributions and feedback welcome.

About

Give any local LLM unlimited memory. 11M+ tokens, 0.51ms latency, 100% accuracy. HAT algorithm for Ollama models.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors