A powerful FastAPI-based multi-modal ingestion system that processes PDFs, scanned documents, images, videos, YouTube links, and text files — then optionally performs semantic retrieval using FAISS + HuggingFace embeddings and refines answers using an LLM (Ollama via NPMAI).
- 📄 Extract text from searchable PDFs
- 🖨️ OCR for scanned PDFs
- 🖼️ Image OCR (Tesseract + OpenCV preprocessing)
- 🎥 Local video speech-to-text (Whisper)
- 📺 YouTube video transcription (yt-dlp + Whisper)
- 📃 Plain text processing
- 🧠 FAISS vector database creation & loading
- 🔎 Semantic similarity search
- ♻️ Iterative refinement using LLM (Ollama)
- 🗂 Automatic ingestion routing based on file type
- Supabse Storage for persistent storage.
We have officially upgraded the NPMAI Ecosystem to a more intelligent, cost-efficient, and "Product-Ready" pipeline. These updates move beyond basic RAG into High-Performance Agentic Retrieval.
The Problem:
Standard RAG systems use a fixed k value (e.g., k=4). This is inefficient—it provides too little context for large documents (missing facts) and too much "noise" for tiny documents (wasting tokens).
The Solution: I have engineered a Proportional Scaling Logic that calculates the optimal number of chunks to retrieve based on the actual density of your vectorized database.
-
Logic:
dynamic_k = max(1, int(total_chunks * 0.70)) -
How it works:
- Short Documents: If your database has only 2 chunks, the system retrieves only those 2.
-
Large PDFs: If your PDF generates 100 chunks, the system automatically scales up to retrieve 70 relevant chunks (
$k=70$ ).
- The Impact: This ensures the AI always sees a statistically significant slice of the knowledge base, adapting perfectly to any document size.
The Problem:
Traditional "Refine" strategies process one chunk at a time. This is incredibly slow because it makes
The Solution: I have implemented a Sliding Window Batch-Refine system that processes chunks in groups of 3 instead of 1.
-
Logic:
for i in range(0, total_chunks, 3): -
How it works:
- Instead of making a single LLM call for every 1,000 characters, the system sends a batch of 3 related chunks (3,000 characters) in one go.
- It uses the previous answer as a "Running Memory" to merge new information from the current 3-chunk batch.
-
The Impact:
- 3x Faster Execution: We have reduced total API latency by 66%.
-
Improved Coherence: The AI sees a broader context (
$3,000$ chars vs$1,000$ chars), allowing it to spot connections between facts that are split across neighboring chunks.
We have successfully integrated Supabase Object Storage to move from temporary memory to Persistent Knowledge Bases.
- Vector Persistence: All
.faissand.pklindex files are now automatically uploaded to a secure Supabase bucket. - Multi-Platform Access: This allows NPM-Rag-AI, NPM-AutoCode-AI, and the npmai SDK to share and load the same vectorized data from anywhere in the world.
Summary: These architectural changes make NPMAI one of the most efficient open-source RAG frameworks available for developers who need Speed + Accuracy without the high cost of standard 1-by-1 refinement.
Client Request
↓
/ingestion Endpoint
↓
File Type Detection
↓
Text Extraction (PDF/OCR/Video/etc.)
↓
Optional Vector DB Retrieval (FAISS)
↓
Refinement via LLM
↓
Final Response
GET /
Returns:
{ "ok": true }POST /ingestion
file→ Upload file (pdf, txt, mp4, jpg, png, etc.)query→ Optional semantic queryDB_PATH→ Path to vector databaselink→ YouTube linkoutput_path→ Download location for videotemperature→ LLM temperaturemodel→ Ollama model name
| Type | Processing Method |
|---|---|
| PDF (text-based) | PyMuPDF |
| PDF (scanned) | pdf2image + Tesseract |
| Image | OpenCV + Tesseract |
| TXT | Direct read |
| MP4 | Whisper transcription |
| YouTube | yt-dlp + Whisper |
If query and DB_PATH are provided:
- Check if FAISS DB exists
- If yes → Load and perform similarity search
- If no → Create embeddings & save DB
- Retrieve top 4 chunks
- Send to LLM refine loop
- Embeddings:
all-MiniLM-L6-v2 - Vector DB: FAISS
- Chunk Size: 1000
- Overlap: 200
For each retrieved chunk:
- Pass context to LLM
- Iteratively refine previous answer
- Return final refined response
Install required packages:
pip install fastapi uvicorn
pip install langchain langchain-community
pip install faiss-cpu
pip install whisper
pip install moviepy
pip install pytesseract
pip install pdf2image
pip install pymupdf
pip install yt-dlp
pip install opencv-python
pip install pillow
pip install numpyMake sure:
- Tesseract OCR is installed in system
- FFmpeg is installed
- Ollama is running locally
uvicorn main:app --reloadPOST /ingestion
Form Data:
file = document.pdf
query = "Summarize key points"
DB_PATH = vector_db
model = llama3
temperature = 0.7
- GPU is disabled (
CUDA_VISIBLE_DEVICES="") - Whisper model loads once (thread-safe singleton)
- FAISS uses dangerous deserialization (use trusted DB paths only)
- Temporary audio saved as
temp.wav
- Streaming responses
- Async video processing
- Chunk-level caching
- Background task queue
- Better refine logic
- Support for multiple vector stores
- Use through ##Docker
- FastAPI
- FAISS
- HuggingFace Embeddings
- Whisper
- OpenCV
- Tesseract OCR
- PyMuPDF
- yt-dlp
- ##npmai
MIT License
This system acts as a universal AI ingestion pipeline capable of processing multi-modal data and performing intelligent semantic retrieval with LLM refinement.
It can serve as:
- AI document assistant
- Video summarizer
- Research helper
- OCR intelligence engine
- Knowledge base system
