Multi-Modal Document Intelligence System

A robust AI agent system combining Computer Vision, Text Extraction and Large Language Models (Gemini 2.0) for highly sophisticated document understanding and structuring.

🎯 Dave.AI Hiring Assignment & My Learning Journey

Honesty and self-awareness are as important as the code itself!

This repository was initially born out of a hiring assignment for Dave.AI. It challenged me to push my boundaries and step into complex, multi-modal territory. I built the project up to this current iteration and am continuing to actively modify, refine, and upgrade it.

🧠 What I Learned

Working on this project has been an incredibly rewarding learning curve. I gained hands-on experience in:

Full-Stack Orchestration: Bridging a FastAPI/Python backend seamlessly with a sleek React & Vite frontend.
Agentic Architectures: Conceptualizing multi-agent systems where distinct "Agents" (Vision, Text, Fusion, Synthesis) exchange context.
Modern User Interfaces: Creating fluid, "glassmorphic" UI components with Framer Motion, tailored to provide a premium AI interactive experience.

🚧 Failures, Challenges & Transparency

I strongly believe in sharing where I stumbled to highlight genuine continuous improvement:

Missing Implementations: The true underlying orchestration (the app module framework) for dynamic, synchronized agent execution is still a work-in-progress. Certain complex integrations between YOLO layout parsers and the Gemini flash engine haven't been fully connected yet.
The RAG Module: Integrating seamless Vector indexing (Text + Image embeddings) with Qdrant turned out to be much harder than I anticipated on a local scale, and remains partially implemented.
My Ongoing Focus: I couldn't perfect everything within the initial timeline, but I have not given up on the vision. I am currently taking time to deeply brush up my skills in LangChain/LlamaIndex and advanced computer vision deployment to bridge these exact gaps.

This repo stands not just as an assignment submission, but as a living diary of my technical growth. 🚀

🚀 Features (Current & Roadmap)

✅ Currently Implemented

Premium Glassmorphism UI: A highly polished React frontend with 3D animations, Framer motion transitions, and real-time upload simulation.
Foundation Scripts & Tools: Built-in connection verification with Tesseract OCR, OpenRouter API tests, and basic system logic flow validators (test_full_system.py).
API Shell: Ready-to-go environment for FastAPI server execution (run_all.bat).

🔜 Upcoming Features (Work In Progress)

Multi-Modal Ingestion: Direct ingestion pipeline for varying formats (PDFs, Scanned images).
Intelligent Layout Detection: Native integration of YOLOv8 to slice tables, figures, and text efficiently.
Multi-Agent Architecture:
- 👁️ Vision Agent: Parses layout visually.
- 📝 Text Agent: Reads the content.
- 🔗 Fusion Agent: Merges text/vision contexts.
- 🧠 Synthesis Agent: Intelligent JSON representation (powered by Gemini Flash).

🛠️ Tech Stack

Frontend: React (v19), Vite, Framer Motion, Tailwind CSS, Lucide React
Backend (In Dev): FastAPI, Python 3.10
AI/ML (In Dev): YOLOv8, SentenceTransformers, Tesseract, Google Gemini (via OpenRouter)

📦 Installation & Setup

Clone the Repository

git clone https://github.com/AnmollCodes/Multi-Modal_AI.git
cd Multi-Modal_AI

Run the System (One-Click Windows) Double-click run_all.bat to automatically setup environments and boot both backend & frontend.

Or manually:

# Backend Setup
python -m venv venv
venv\Scripts\activate
pip install -r requirements.txt
uvicorn app.main:app --host 0.0.0.0 --port 8000

# Frontend Setup (Port 5173 / localhost)
cd web
npm install
npm run dev

Created with dedication by Anmol Agarwal. Continually building and striving for excellence.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
web		web
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
README_SETUP.md		README_SETUP.md
check_tesseract.py		check_tesseract.py
debug_start.py		debug_start.py
docker-compose.yml		docker-compose.yml
list_models.py		list_models.py
requirements.txt		requirements.txt
run_all.bat		run_all.bat
server_debug.log		server_debug.log
test_api_connection.py		test_api_connection.py
test_full_system.py		test_full_system.py
test_llm.py		test_llm.py
verify_openrouter.py		verify_openrouter.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multi-Modal Document Intelligence System

🎯 Dave.AI Hiring Assignment & My Learning Journey

🧠 What I Learned

🚧 Failures, Challenges & Transparency

🚀 Features (Current & Roadmap)

✅ Currently Implemented

🔜 Upcoming Features (Work In Progress)

🛠️ Tech Stack

📦 Installation & Setup

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Multi-Modal Document Intelligence System

🎯 Dave.AI Hiring Assignment & My Learning Journey

🧠 What I Learned

🚧 Failures, Challenges & Transparency

🚀 Features (Current & Roadmap)

✅ Currently Implemented

🔜 Upcoming Features (Work In Progress)

🛠️ Tech Stack

📦 Installation & Setup

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages