Skip to content

akashrajeev/RAGNet

Repository files navigation

Multimodal Offline RAG (RAGNet)

An offline multimodal Retrieval-Augmented Generation system that ingests PDFs/DOCX, images, and audio; indexes them in shared vector spaces; and answers natural-language queries with grounded citations.

Features

  • Offline embeddings and LLM (no network required after first model downloads)
  • Modalities: Text (PDF/DOCX), Images (PNG/JPG), Audio (WAV/MP3 via offline STT)
  • Dual vector spaces: CLIP (cross-modal) and Sentence-Transformers (text)
  • FAISS indexes on disk + SQLite metadata with cross-links and citations
  • Streamlit UI for ingestion and chat/search with expandable sources

Project Structure

src/
  config.py
  ingest/
    pdf_docx.py
    images.py
    audio.py
  processing/
    chunking.py
  embeddings/
    clip_embedder.py
    text_embedder.py
  index/
    faiss_index.py
  store/
    metadata_db.py
  retrieval/
    retriever.py
  generation/
    llm.py
app.py
requirements.txt

Quickstart

  1. Create and activate a virtual environment (Windows PowerShell):
python -m venv .venv
. .venv/Scripts/Activate.ps1
pip install --upgrade pip
pip install -r requirements.txt
  1. Download offline models:
  • CLIP: ViT-B-32 via open_clip (auto-downloaded on first use; you can pre-download by first running the app online once, then stay offline)
  • Text encoder: sentence-transformers/all-MiniLM-L6-v2 (auto-download on first run). To pre-download offline, manually place all-MiniLM-L6-v2 into ~/.cache/torch/sentence_transformers/
  • Vosk STT model: download a small model, e.g., vosk-model-small-en-us-0.15 and set VOSK_MODEL_PATH in .env or src/config.py
  • LLM: Download a GGUF model compatible with llama-cpp-python (e.g., TheBloke/Llama-2-7B-GGUF q4_0). Set LLM_MODEL_PATH in .env or src/config.py.
  1. Run the app:
streamlit run app.py
  1. Ingest data:
  • Use the Ingest panel to add folders or files (PDF, DOCX, PNG/JPG, WAV/MP3)
  • The system extracts text, generates embeddings, and builds FAISS indexes
  1. Query:
  • Type a natural-language question. Results fuse text and image/audio-derived context
  • Click citations to open source snippets, transcript segments, or images

Configuration

Edit defaults in src/config.py or via environment variables:

  • DATA_DIR, INDEX_DIR, DB_PATH
  • LLM_MODEL_PATH, VOSK_MODEL_PATH
  • DEVICE (cpu/cuda)

Notes

  • First run may download model weights. After that, the app works fully offline.
  • Audio ingestion converts to WAV mono 16kHz and uses Vosk for STT.
    • For Python 3.13 on Windows, provide WAV mono 16kHz files directly. MP3 conversion is not included.
  • Cross-modal retrieval uses CLIP space for text↔image/audio (via transcript) and fuses with text space for pure text retrieval.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages