A robust AI agent system combining Computer Vision, Text Extraction and Large Language Models (Gemini 2.0) for highly sophisticated document understanding and structuring.
Honesty and self-awareness are as important as the code itself!
This repository was initially born out of a hiring assignment for Dave.AI. It challenged me to push my boundaries and step into complex, multi-modal territory. I built the project up to this current iteration and am continuing to actively modify, refine, and upgrade it.
Working on this project has been an incredibly rewarding learning curve. I gained hands-on experience in:
- Full-Stack Orchestration: Bridging a FastAPI/Python backend seamlessly with a sleek React & Vite frontend.
- Agentic Architectures: Conceptualizing multi-agent systems where distinct "Agents" (Vision, Text, Fusion, Synthesis) exchange context.
- Modern User Interfaces: Creating fluid, "glassmorphic" UI components with Framer Motion, tailored to provide a premium AI interactive experience.
I strongly believe in sharing where I stumbled to highlight genuine continuous improvement:
- Missing Implementations: The true underlying orchestration (the
appmodule framework) for dynamic, synchronized agent execution is still a work-in-progress. Certain complex integrations between YOLO layout parsers and the Gemini flash engine haven't been fully connected yet. - The RAG Module: Integrating seamless Vector indexing (Text + Image embeddings) with Qdrant turned out to be much harder than I anticipated on a local scale, and remains partially implemented.
- My Ongoing Focus: I couldn't perfect everything within the initial timeline, but I have not given up on the vision. I am currently taking time to deeply brush up my skills in LangChain/LlamaIndex and advanced computer vision deployment to bridge these exact gaps.
This repo stands not just as an assignment submission, but as a living diary of my technical growth. 🚀
- Premium Glassmorphism UI: A highly polished React frontend with 3D animations, Framer motion transitions, and real-time upload simulation.
- Foundation Scripts & Tools: Built-in connection verification with Tesseract OCR, OpenRouter API tests, and basic system logic flow validators (
test_full_system.py). - API Shell: Ready-to-go environment for FastAPI server execution (
run_all.bat).
- Multi-Modal Ingestion: Direct ingestion pipeline for varying formats (PDFs, Scanned images).
- Intelligent Layout Detection: Native integration of YOLOv8 to slice tables, figures, and text efficiently.
- Multi-Agent Architecture:
- 👁️ Vision Agent: Parses layout visually.
- 📝 Text Agent: Reads the content.
- 🔗 Fusion Agent: Merges text/vision contexts.
- 🧠 Synthesis Agent: Intelligent JSON representation (powered by Gemini Flash).
- Frontend: React (v19), Vite, Framer Motion, Tailwind CSS, Lucide React
- Backend (In Dev): FastAPI, Python 3.10
- AI/ML (In Dev): YOLOv8, SentenceTransformers, Tesseract, Google Gemini (via OpenRouter)
-
Clone the Repository
git clone https://github.com/AnmollCodes/Multi-Modal_AI.git cd Multi-Modal_AI -
Run the System (One-Click Windows) Double-click
run_all.batto automatically setup environments and boot both backend & frontend.Or manually:
# Backend Setup python -m venv venv venv\Scripts\activate pip install -r requirements.txt uvicorn app.main:app --host 0.0.0.0 --port 8000 # Frontend Setup (Port 5173 / localhost) cd web npm install npm run dev
Created with dedication by Anmol Agarwal. Continually building and striving for excellence.