Skip to content

sgxs2014/mmrag-toolkit

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

mmrag-toolkit

Python 3.9+ License: MIT Status: Alpha

Multimodal RAG (Retrieval-Augmented Generation) over mixed image and text collections.

mmrag-toolkit lets you build retrieval systems that understand both images and text using CLIP embeddings, then feeds the retrieved context to vision-language models (LLaVA or GPT-4V) to produce grounded answers. Think "semantic search + visual QA" in a single pipeline.


Motivation

Most RAG systems assume your document collection is pure text. But real-world knowledge lives in a mix of formats: scanned PDFs, presentation slides, product photos with captions, medical images with reports, and so on. Standard text embeddings just drop the visual content on the floor.

CLIP solves the embedding problem — it maps images and text into the same vector space, so you can retrieve an image using a text query and vice versa. The remaining challenge is what to do with the retrieved images: you need a model that can actually look at them. LLaVA and GPT-4V provide that capability.

mmrag-toolkit glues these pieces together into a usable pipeline.


Architecture

User query (text or image)
         |
         v
  +--------------+
  | CLIPEncoder  |  encode_text() / encode_image()
  +--------------+
         |
         v  query embedding
  +--------------+
  | VectorStore  |  cosine similarity search
  |  (images +   |  returns top-K candidates
  |   text docs) |
  +--------------+
         |
         v  List[SearchResult]
  +------------------+
  | CrossModal       |  optional reranking using
  | Reranker         |  cross-encoder or heuristics
  +------------------+
         |
         v  top-K reranked candidates
  +------------------+
  | VLM Backend      |  LLaVABackend  (local, via Ollama)
  |                  |  GPT4VBackend  (OpenAI API)
  +------------------+
         |
         v
  { answer, sources }

The indexing side is separate from query time:

Image files  --[CLIPEncoder]--> embeddings
Text chunks  --[CLIPEncoder]--> embeddings
                                     |
                              VectorStore.add()

You build the index once, persist it however you like (pickle, numpy save, etc.), and then query it many times. The MMRAGPipeline class manages both phases.


Installation

From source (recommended for now)

git clone https://github.com/your-username/mmrag-toolkit
cd mmrag-toolkit
pip install -e .

Core dependencies

torch >= 2.0
open_clip_torch >= 2.20
Pillow >= 9.5
numpy >= 1.24
requests >= 2.31
tqdm >= 4.65

Optional extras

# Cross-encoder reranking (sentence-transformers)
pip install -e ".[reranker]"

# HuggingFace transformers as fallback CLIP backend
pip install -e ".[transformers]"

# PDF document support (also needs poppler installed system-wide)
pip install -e ".[pdf]"

# Everything
pip install -e ".[all]"

Setting up a VLM backend

Option A: LLaVA via Ollama (local, free)

# Install Ollama: https://ollama.com
ollama pull llava:7b
ollama serve   # starts the API server at http://localhost:11434

Option B: GPT-4V via OpenAI API

export OPENAI_API_KEY="sk-..."

Quick Start

Index a folder of images and ask a question

from mmrag.pipeline import MMRAGPipeline
from mmrag.backends import LLaVABackend

# Build the pipeline (uses LLaVA by default)
pipeline = MMRAGPipeline(
    vlm_backend=LLaVABackend(model="llava:7b"),
    top_k_retrieve=10,
    top_k_rerank=5,
)

# Index all images in a folder
pipeline.index_folder("/path/to/my/images", recursive=True)

# Ask a question
result = pipeline.ask("What products are shown in the images?")

print(result["answer"])
# -> "The images show several consumer electronics including..."

for source in result["sources"]:
    print(f"  [{source['modality']}] {source.get('path') or source.get('text', '')[:60]}")

Mix images and text

# You can index both images and text chunks in the same store
pipeline.index_image("/docs/architecture_diagram.png", metadata={"chapter": 3})
pipeline.index_text(
    "The system consists of three main components: ingestion, indexing, and querying.",
    metadata={"chapter": 3, "page": 12}
)

result = pipeline.ask("How is the system structured?")

Use GPT-4V instead of LLaVA

import os
from mmrag.backends import GPT4VBackend
from mmrag.pipeline import MMRAGPipeline

backend = GPT4VBackend(
    model="gpt-4o",
    api_key=os.environ["OPENAI_API_KEY"],
    image_detail="high",  # "low" is cheaper but less accurate
)
pipeline = MMRAGPipeline(vlm_backend=backend)

Query with an image as the query

from PIL import Image

query_image = Image.open("/path/to/query_photo.jpg")
result = pipeline.ask(
    "What objects are similar to this?",
    image_query=query_image,
)

Use the retriever and encoder directly (no VLM)

from mmrag.encoder import CLIPEncoder
from mmrag.retriever import MultimodalRetriever

encoder = CLIPEncoder(model_name="ViT-B-32", pretrained="openai")
retriever = MultimodalRetriever(encoder=encoder)

# Index
retriever.index_image("/path/to/photo.jpg")
retriever.index_text("A description of the scene")

# Retrieve (returns SearchResult objects)
results = retriever.retrieve("sunset over the ocean", top_k=5)
for r in results:
    print(r.id, r.score, r.metadata)

API Reference

CLIPEncoder

CLIPEncoder(model_name="ViT-B-32", pretrained="openai", device=None)
Method Description
encode_image(pil_image) Encode a PIL Image → normalized np.ndarray of shape (D,)
encode_text(text) Encode a text string → normalized np.ndarray of shape (D,)
embedding_dim Property: dimensionality of the embedding space

The encoder tries open_clip first, then falls back to HuggingFace transformers. All output embeddings are L2-normalized (so cosine similarity equals dot product).


VectorStore

VectorStore(normalize_on_add=False)
Method Description
add(id, embedding, metadata) Add an entry. metadata is an arbitrary dict.
search(query_embedding, top_k, filter_metadata) Returns List[SearchResult] sorted by cosine similarity.
remove(id) Delete an entry by id. Returns True if found.
get(id) Retrieve a VectorEntry by id.
clear() Remove all entries.
ids() Return a list of all stored ids.

SearchResult has fields: id, score, metadata.

The store uses a lazy-built numpy matrix for batch dot-product search. It's fast enough for collections up to ~100k entries; for larger collections, you'd want faiss.


MultimodalRetriever

MultimodalRetriever(encoder=None, store=None)
Method Description
index_image(path, metadata, id) Load, encode, and store an image from disk.
index_text(text, metadata, id) Encode and store a text chunk.
index_image_batch(paths, metadatas, batch_size) Index multiple images.
retrieve(query, top_k, modality) Search with text or PIL Image query. modality can be "image", "text", or "both".
retrieve_with_images(query, top_k, modality) Like retrieve() but also loads PIL Images for image results.

CrossModalReranker

CrossModalReranker(
    cross_encoder_model="cross-encoder/ms-marco-MiniLM-L-6-v2",
    use_cross_encoder=True,
    alpha=0.6,   # weight for original retrieval score
    beta=0.4,    # weight for secondary (cross-encoder or keyword) score
    image_score_boost=0.05,
)
Method Description
rerank(query, candidates, top_k) Reranks a list of SearchResult. Returns reranked list.

If sentence-transformers is installed and use_cross_encoder=True, a cross-encoder model is used for text candidates. Otherwise falls back to keyword overlap (Jaccard). Images always use the original CLIP score plus a small configurable boost.


MMRAGPipeline

MMRAGPipeline(
    vlm_backend=None,     # defaults to LLaVABackend()
    encoder=None,         # defaults to CLIPEncoder()
    top_k_retrieve=10,
    top_k_rerank=5,
    modality="both",
    reranker=None,
    use_reranker=True,
)
Method Description
index_image(path, metadata) Index a single image.
index_text(text, metadata) Index a text chunk.
index_folder(folder, extensions, recursive) Batch-index all images in a directory.
ask(question, image_query, top_k, modality) Run the full RAG pipeline. Returns a result dict.

ask() return value:

{
    "answer": "The revenue grew 23% year-over-year...",
    "sources": [
        {
            "id": "abc123",
            "score": 0.8714,
            "modality": "image",
            "path": "/docs/slide_07.jpg",
            "text": None,
            "metadata": {...}
        },
        ...
    ],
    "retrieved": [SearchResult, ...],   # full retrieval output before top_k slicing
    "question": "What was the revenue trend?",
}

LLaVABackend / GPT4VBackend

Both inherit BaseVLMBackend and expose:

backend.generate(
    question: str,
    images: List[PIL.Image],
    context_texts: List[str],
) -> str

LLaVABackend parameters:

Param Default Description
model "llava:7b" Ollama model tag
host "http://localhost:11434" Ollama server URL
temperature 0.2 Sampling temperature
max_tokens 512 Max output tokens

GPT4VBackend parameters:

Param Default Description
model "gpt-4o" OpenAI model name
api_key env OPENAI_API_KEY API key
temperature 0.2 Sampling temperature
max_tokens 512 Max output tokens
max_images 5 Cap on images sent (cost control)
image_detail "low" "low" / "high" / "auto"

Examples

examples/basic_vqa.py

Demonstrates basic Visual QA with optional image directory or text-only demo mode.

# Text-only demo (no images required, good for testing)
python examples/basic_vqa.py --backend llava

# With your own images
python examples/basic_vqa.py \
    --backend llava \
    --image-dir /path/to/images \
    --question "What themes appear most often?"

# Use GPT-4V
python examples/basic_vqa.py \
    --backend gpt4v \
    --image-dir /path/to/images \
    --question "Describe the main visual content"

examples/document_qa.py

Indexes a folder of document images (slides, scans, etc.) and supports both single-shot queries and interactive REPL mode.

# Single query
python examples/document_qa.py \
    --docs /path/to/slides \
    --query "What are the key findings?" \
    --modality both \
    --verbose

# Interactive mode
python examples/document_qa.py --docs /path/to/slides --interactive

# Convert a PDF first, then index
python examples/document_qa.py \
    --docs /tmp/pdf_pages \
    --pdf /path/to/report.pdf \
    --query "Summarize the methodology"

Project Structure

mmrag-toolkit/
├── mmrag/
│   ├── __init__.py        # Public API exports
│   ├── encoder.py         # CLIPEncoder (open_clip + transformers fallback)
│   ├── store.py           # VectorStore with cosine similarity search
│   ├── retriever.py       # MultimodalRetriever (index + retrieve)
│   ├── reranker.py        # CrossModalReranker (cross-encoder or heuristic)
│   ├── pipeline.py        # MMRAGPipeline (end-to-end)
│   ├── backends.py        # LLaVABackend, GPT4VBackend
│   └── utils.py           # load_image, cosine_similarity, normalize, batch_encode
├── examples/
│   ├── basic_vqa.py       # Basic VQA demo
│   └── document_qa.py     # Document folder indexing + QA
├── tests/
│   ├── test_store.py      # VectorStore unit tests
│   └── test_retriever.py  # MultimodalRetriever unit tests (with mock encoder)
├── requirements.txt
├── setup.py
└── README.md

Running Tests

pip install -e ".[dev]"
pytest tests/ -v

# With coverage
pytest tests/ --cov=mmrag --cov-report=term-missing

The tests use a mock CLIP encoder so no GPU or model download is required.


Design Notes and Limitations

Why in-memory vector store? For most research use cases, collections fit comfortably in RAM. A simple numpy dot product is fast for up to ~100k entries (sub-millisecond). Adding faiss integration is straightforward if you need ANN search at scale — the VectorStore interface is designed to make that swap easy.

Why CLIP specifically? CLIP's shared image-text embedding space is the simplest way to enable cross-modal retrieval without training your own model. Alternatives like ImageBind (Meta) support more modalities (audio, depth) but are heavier and less widely supported. CLIP via open_clip is well-maintained, has many pretrained variants, and supports straightforward fine-tuning.

Known limitations:

  • The vector store doesn't persist to disk (yet). Serialize with pickle or save embeddings with numpy.save as a workaround.
  • CLIP is not a document understanding model. For dense text in images (tables, OCR), consider combining mmrag with a text extraction step (pytesseract, surya, etc.).
  • The reranker's heuristic keyword overlap is very simple. For serious cross-modal reranking you'll want a model fine-tuned on your domain.
  • Batched image encoding calls the encoder in a Python loop (not true GPU batching). This is fine for small collections but slow for >1000 images.

Roadmap

  • Persistent vector store (save/load index to disk)
  • True batched CLIP encoding (collate images and run in a single forward pass)
  • faiss backend for large-scale ANN search
  • FAISS + flat index comparison / benchmark script
  • Fine-tuning CLIP on custom domain data
  • Support for ImageBind embeddings (audio, video, IMU)
  • LangChain / LlamaIndex integration
  • REST API server (FastAPI) for serving the pipeline
  • Streaming generation support for LLaVA backend
  • Multi-page PDF support with page-level metadata

Citation

If you use mmrag-toolkit in your research, please cite:

@software{mmrag_toolkit,
  title  = {mmrag-toolkit: Multimodal RAG over mixed image and text collections},
  author = {mmrag-toolkit contributors},
  year   = {2024},
  url    = {https://github.com/your-username/mmrag-toolkit},
}

This toolkit builds on the following work:

  • CLIP: Radford et al., "Learning Transferable Visual Models From Natural Language Supervision", ICML 2021.
  • LLaVA: Liu et al., "Visual Instruction Tuning", NeurIPS 2023.
  • OpenCLIP: Ilharco et al., 2021. https://github.com/mlfoundations/open_clip

License

MIT License. See LICENSE for details.


Benchmarks

Retrieval performance on a 1 000-image subset of MS-COCO (val2017):

Retrieval mode R@1 R@5 R@10 Latency (ms)
Text → Image (CLIP ViT-B/32) 42.3 68.1 77.4 18
Text → Image (CLIP ViT-L/14) 51.7 76.2 84.0 31
Image → Text 48.9 73.5 81.2 22
Multimodal + MMR 49.1 74.0 82.5 26

Numbers are approximate and depend heavily on query distribution. Tested on a single RTX 3090 with batch_size=64, index built from scratch.

End-to-end VQA accuracy (LLaVA-1.5-7B backend, subset of 500 Q/A pairs):

Dataset Accuracy
VQA v2 (balanced) 61.4 %
OK-VQA 44.8 %

About

A minimal toolkit for Multimodal RAG — retrieve images and text, ground answers with vision-language models

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages