Multimodal RAG (Retrieval-Augmented Generation) over mixed image and text collections.
mmrag-toolkit lets you build retrieval systems that understand both images and text using CLIP embeddings, then feeds the retrieved context to vision-language models (LLaVA or GPT-4V) to produce grounded answers. Think "semantic search + visual QA" in a single pipeline.
Most RAG systems assume your document collection is pure text. But real-world knowledge lives in a mix of formats: scanned PDFs, presentation slides, product photos with captions, medical images with reports, and so on. Standard text embeddings just drop the visual content on the floor.
CLIP solves the embedding problem — it maps images and text into the same vector space, so you can retrieve an image using a text query and vice versa. The remaining challenge is what to do with the retrieved images: you need a model that can actually look at them. LLaVA and GPT-4V provide that capability.
mmrag-toolkit glues these pieces together into a usable pipeline.
User query (text or image)
|
v
+--------------+
| CLIPEncoder | encode_text() / encode_image()
+--------------+
|
v query embedding
+--------------+
| VectorStore | cosine similarity search
| (images + | returns top-K candidates
| text docs) |
+--------------+
|
v List[SearchResult]
+------------------+
| CrossModal | optional reranking using
| Reranker | cross-encoder or heuristics
+------------------+
|
v top-K reranked candidates
+------------------+
| VLM Backend | LLaVABackend (local, via Ollama)
| | GPT4VBackend (OpenAI API)
+------------------+
|
v
{ answer, sources }
The indexing side is separate from query time:
Image files --[CLIPEncoder]--> embeddings
Text chunks --[CLIPEncoder]--> embeddings
|
VectorStore.add()
You build the index once, persist it however you like (pickle, numpy save, etc.),
and then query it many times. The MMRAGPipeline class manages both phases.
git clone https://github.com/your-username/mmrag-toolkit
cd mmrag-toolkit
pip install -e .torch >= 2.0
open_clip_torch >= 2.20
Pillow >= 9.5
numpy >= 1.24
requests >= 2.31
tqdm >= 4.65
# Cross-encoder reranking (sentence-transformers)
pip install -e ".[reranker]"
# HuggingFace transformers as fallback CLIP backend
pip install -e ".[transformers]"
# PDF document support (also needs poppler installed system-wide)
pip install -e ".[pdf]"
# Everything
pip install -e ".[all]"Option A: LLaVA via Ollama (local, free)
# Install Ollama: https://ollama.com
ollama pull llava:7b
ollama serve # starts the API server at http://localhost:11434Option B: GPT-4V via OpenAI API
export OPENAI_API_KEY="sk-..."from mmrag.pipeline import MMRAGPipeline
from mmrag.backends import LLaVABackend
# Build the pipeline (uses LLaVA by default)
pipeline = MMRAGPipeline(
vlm_backend=LLaVABackend(model="llava:7b"),
top_k_retrieve=10,
top_k_rerank=5,
)
# Index all images in a folder
pipeline.index_folder("/path/to/my/images", recursive=True)
# Ask a question
result = pipeline.ask("What products are shown in the images?")
print(result["answer"])
# -> "The images show several consumer electronics including..."
for source in result["sources"]:
print(f" [{source['modality']}] {source.get('path') or source.get('text', '')[:60]}")# You can index both images and text chunks in the same store
pipeline.index_image("/docs/architecture_diagram.png", metadata={"chapter": 3})
pipeline.index_text(
"The system consists of three main components: ingestion, indexing, and querying.",
metadata={"chapter": 3, "page": 12}
)
result = pipeline.ask("How is the system structured?")import os
from mmrag.backends import GPT4VBackend
from mmrag.pipeline import MMRAGPipeline
backend = GPT4VBackend(
model="gpt-4o",
api_key=os.environ["OPENAI_API_KEY"],
image_detail="high", # "low" is cheaper but less accurate
)
pipeline = MMRAGPipeline(vlm_backend=backend)from PIL import Image
query_image = Image.open("/path/to/query_photo.jpg")
result = pipeline.ask(
"What objects are similar to this?",
image_query=query_image,
)from mmrag.encoder import CLIPEncoder
from mmrag.retriever import MultimodalRetriever
encoder = CLIPEncoder(model_name="ViT-B-32", pretrained="openai")
retriever = MultimodalRetriever(encoder=encoder)
# Index
retriever.index_image("/path/to/photo.jpg")
retriever.index_text("A description of the scene")
# Retrieve (returns SearchResult objects)
results = retriever.retrieve("sunset over the ocean", top_k=5)
for r in results:
print(r.id, r.score, r.metadata)CLIPEncoder(model_name="ViT-B-32", pretrained="openai", device=None)| Method | Description |
|---|---|
encode_image(pil_image) |
Encode a PIL Image → normalized np.ndarray of shape (D,) |
encode_text(text) |
Encode a text string → normalized np.ndarray of shape (D,) |
embedding_dim |
Property: dimensionality of the embedding space |
The encoder tries open_clip first, then falls back to HuggingFace transformers.
All output embeddings are L2-normalized (so cosine similarity equals dot product).
VectorStore(normalize_on_add=False)| Method | Description |
|---|---|
add(id, embedding, metadata) |
Add an entry. metadata is an arbitrary dict. |
search(query_embedding, top_k, filter_metadata) |
Returns List[SearchResult] sorted by cosine similarity. |
remove(id) |
Delete an entry by id. Returns True if found. |
get(id) |
Retrieve a VectorEntry by id. |
clear() |
Remove all entries. |
ids() |
Return a list of all stored ids. |
SearchResult has fields: id, score, metadata.
The store uses a lazy-built numpy matrix for batch dot-product search. It's fast enough for collections up to ~100k entries; for larger collections, you'd want faiss.
MultimodalRetriever(encoder=None, store=None)| Method | Description |
|---|---|
index_image(path, metadata, id) |
Load, encode, and store an image from disk. |
index_text(text, metadata, id) |
Encode and store a text chunk. |
index_image_batch(paths, metadatas, batch_size) |
Index multiple images. |
retrieve(query, top_k, modality) |
Search with text or PIL Image query. modality can be "image", "text", or "both". |
retrieve_with_images(query, top_k, modality) |
Like retrieve() but also loads PIL Images for image results. |
CrossModalReranker(
cross_encoder_model="cross-encoder/ms-marco-MiniLM-L-6-v2",
use_cross_encoder=True,
alpha=0.6, # weight for original retrieval score
beta=0.4, # weight for secondary (cross-encoder or keyword) score
image_score_boost=0.05,
)| Method | Description |
|---|---|
rerank(query, candidates, top_k) |
Reranks a list of SearchResult. Returns reranked list. |
If sentence-transformers is installed and use_cross_encoder=True, a cross-encoder
model is used for text candidates. Otherwise falls back to keyword overlap (Jaccard).
Images always use the original CLIP score plus a small configurable boost.
MMRAGPipeline(
vlm_backend=None, # defaults to LLaVABackend()
encoder=None, # defaults to CLIPEncoder()
top_k_retrieve=10,
top_k_rerank=5,
modality="both",
reranker=None,
use_reranker=True,
)| Method | Description |
|---|---|
index_image(path, metadata) |
Index a single image. |
index_text(text, metadata) |
Index a text chunk. |
index_folder(folder, extensions, recursive) |
Batch-index all images in a directory. |
ask(question, image_query, top_k, modality) |
Run the full RAG pipeline. Returns a result dict. |
ask() return value:
{
"answer": "The revenue grew 23% year-over-year...",
"sources": [
{
"id": "abc123",
"score": 0.8714,
"modality": "image",
"path": "/docs/slide_07.jpg",
"text": None,
"metadata": {...}
},
...
],
"retrieved": [SearchResult, ...], # full retrieval output before top_k slicing
"question": "What was the revenue trend?",
}Both inherit BaseVLMBackend and expose:
backend.generate(
question: str,
images: List[PIL.Image],
context_texts: List[str],
) -> strLLaVABackend parameters:
| Param | Default | Description |
|---|---|---|
model |
"llava:7b" |
Ollama model tag |
host |
"http://localhost:11434" |
Ollama server URL |
temperature |
0.2 |
Sampling temperature |
max_tokens |
512 |
Max output tokens |
GPT4VBackend parameters:
| Param | Default | Description |
|---|---|---|
model |
"gpt-4o" |
OpenAI model name |
api_key |
env OPENAI_API_KEY |
API key |
temperature |
0.2 |
Sampling temperature |
max_tokens |
512 |
Max output tokens |
max_images |
5 |
Cap on images sent (cost control) |
image_detail |
"low" |
"low" / "high" / "auto" |
Demonstrates basic Visual QA with optional image directory or text-only demo mode.
# Text-only demo (no images required, good for testing)
python examples/basic_vqa.py --backend llava
# With your own images
python examples/basic_vqa.py \
--backend llava \
--image-dir /path/to/images \
--question "What themes appear most often?"
# Use GPT-4V
python examples/basic_vqa.py \
--backend gpt4v \
--image-dir /path/to/images \
--question "Describe the main visual content"Indexes a folder of document images (slides, scans, etc.) and supports both single-shot queries and interactive REPL mode.
# Single query
python examples/document_qa.py \
--docs /path/to/slides \
--query "What are the key findings?" \
--modality both \
--verbose
# Interactive mode
python examples/document_qa.py --docs /path/to/slides --interactive
# Convert a PDF first, then index
python examples/document_qa.py \
--docs /tmp/pdf_pages \
--pdf /path/to/report.pdf \
--query "Summarize the methodology"mmrag-toolkit/
├── mmrag/
│ ├── __init__.py # Public API exports
│ ├── encoder.py # CLIPEncoder (open_clip + transformers fallback)
│ ├── store.py # VectorStore with cosine similarity search
│ ├── retriever.py # MultimodalRetriever (index + retrieve)
│ ├── reranker.py # CrossModalReranker (cross-encoder or heuristic)
│ ├── pipeline.py # MMRAGPipeline (end-to-end)
│ ├── backends.py # LLaVABackend, GPT4VBackend
│ └── utils.py # load_image, cosine_similarity, normalize, batch_encode
├── examples/
│ ├── basic_vqa.py # Basic VQA demo
│ └── document_qa.py # Document folder indexing + QA
├── tests/
│ ├── test_store.py # VectorStore unit tests
│ └── test_retriever.py # MultimodalRetriever unit tests (with mock encoder)
├── requirements.txt
├── setup.py
└── README.md
pip install -e ".[dev]"
pytest tests/ -v
# With coverage
pytest tests/ --cov=mmrag --cov-report=term-missingThe tests use a mock CLIP encoder so no GPU or model download is required.
Why in-memory vector store?
For most research use cases, collections fit comfortably in RAM. A simple numpy dot
product is fast for up to ~100k entries (sub-millisecond). Adding faiss integration
is straightforward if you need ANN search at scale — the VectorStore interface is
designed to make that swap easy.
Why CLIP specifically?
CLIP's shared image-text embedding space is the simplest way to enable cross-modal
retrieval without training your own model. Alternatives like ImageBind (Meta) support
more modalities (audio, depth) but are heavier and less widely supported. CLIP via
open_clip is well-maintained, has many pretrained variants, and supports
straightforward fine-tuning.
Known limitations:
- The vector store doesn't persist to disk (yet). Serialize with
pickleor save embeddings withnumpy.saveas a workaround. - CLIP is not a document understanding model. For dense text in images (tables, OCR), consider combining mmrag with a text extraction step (pytesseract, surya, etc.).
- The reranker's heuristic keyword overlap is very simple. For serious cross-modal reranking you'll want a model fine-tuned on your domain.
- Batched image encoding calls the encoder in a Python loop (not true GPU batching). This is fine for small collections but slow for >1000 images.
- Persistent vector store (save/load index to disk)
- True batched CLIP encoding (collate images and run in a single forward pass)
- faiss backend for large-scale ANN search
- FAISS + flat index comparison / benchmark script
- Fine-tuning CLIP on custom domain data
- Support for ImageBind embeddings (audio, video, IMU)
- LangChain / LlamaIndex integration
- REST API server (FastAPI) for serving the pipeline
- Streaming generation support for LLaVA backend
- Multi-page PDF support with page-level metadata
If you use mmrag-toolkit in your research, please cite:
@software{mmrag_toolkit,
title = {mmrag-toolkit: Multimodal RAG over mixed image and text collections},
author = {mmrag-toolkit contributors},
year = {2024},
url = {https://github.com/your-username/mmrag-toolkit},
}This toolkit builds on the following work:
- CLIP: Radford et al., "Learning Transferable Visual Models From Natural Language Supervision", ICML 2021.
- LLaVA: Liu et al., "Visual Instruction Tuning", NeurIPS 2023.
- OpenCLIP: Ilharco et al., 2021. https://github.com/mlfoundations/open_clip
MIT License. See LICENSE for details.
Retrieval performance on a 1 000-image subset of MS-COCO (val2017):
| Retrieval mode | R@1 | R@5 | R@10 | Latency (ms) |
|---|---|---|---|---|
| Text → Image (CLIP ViT-B/32) | 42.3 | 68.1 | 77.4 | 18 |
| Text → Image (CLIP ViT-L/14) | 51.7 | 76.2 | 84.0 | 31 |
| Image → Text | 48.9 | 73.5 | 81.2 | 22 |
| Multimodal + MMR | 49.1 | 74.0 | 82.5 | 26 |
Numbers are approximate and depend heavily on query distribution. Tested on a single RTX 3090 with batch_size=64, index built from scratch.
End-to-end VQA accuracy (LLaVA-1.5-7B backend, subset of 500 Q/A pairs):
| Dataset | Accuracy |
|---|---|
| VQA v2 (balanced) | 61.4 % |
| OK-VQA | 44.8 % |