An offline multimodal Retrieval-Augmented Generation system that ingests PDFs/DOCX, images, and audio; indexes them in shared vector spaces; and answers natural-language queries with grounded citations.
- Offline embeddings and LLM (no network required after first model downloads)
- Modalities: Text (PDF/DOCX), Images (PNG/JPG), Audio (WAV/MP3 via offline STT)
- Dual vector spaces: CLIP (cross-modal) and Sentence-Transformers (text)
- FAISS indexes on disk + SQLite metadata with cross-links and citations
- Streamlit UI for ingestion and chat/search with expandable sources
src/
config.py
ingest/
pdf_docx.py
images.py
audio.py
processing/
chunking.py
embeddings/
clip_embedder.py
text_embedder.py
index/
faiss_index.py
store/
metadata_db.py
retrieval/
retriever.py
generation/
llm.py
app.py
requirements.txt
- Create and activate a virtual environment (Windows PowerShell):
python -m venv .venv
. .venv/Scripts/Activate.ps1
pip install --upgrade pip
pip install -r requirements.txt- Download offline models:
- CLIP:
ViT-B-32viaopen_clip(auto-downloaded on first use; you can pre-download by first running the app online once, then stay offline) - Text encoder:
sentence-transformers/all-MiniLM-L6-v2(auto-download on first run). To pre-download offline, manually placeall-MiniLM-L6-v2into~/.cache/torch/sentence_transformers/ - Vosk STT model: download a small model, e.g.,
vosk-model-small-en-us-0.15and setVOSK_MODEL_PATHin.envorsrc/config.py - LLM: Download a GGUF model compatible with
llama-cpp-python(e.g.,TheBloke/Llama-2-7B-GGUFq4_0). SetLLM_MODEL_PATHin.envorsrc/config.py.
- Run the app:
streamlit run app.py- Ingest data:
- Use the Ingest panel to add folders or files (PDF, DOCX, PNG/JPG, WAV/MP3)
- The system extracts text, generates embeddings, and builds FAISS indexes
- Query:
- Type a natural-language question. Results fuse text and image/audio-derived context
- Click citations to open source snippets, transcript segments, or images
Edit defaults in src/config.py or via environment variables:
DATA_DIR,INDEX_DIR,DB_PATHLLM_MODEL_PATH,VOSK_MODEL_PATHDEVICE(cpu/cuda)
- First run may download model weights. After that, the app works fully offline.
- Audio ingestion converts to WAV mono 16kHz and uses Vosk for STT.
- For Python 3.13 on Windows, provide WAV mono 16kHz files directly. MP3 conversion is not included.
- Cross-modal retrieval uses CLIP space for text↔image/audio (via transcript) and fuses with text space for pure text retrieval.