Skip to content

AnmollCodes/Multi-Modal_AI

Repository files navigation

Multi-Modal Document Intelligence System

Status: Work in Progress Python 3.10+ React License

A robust AI agent system combining Computer Vision, Text Extraction and Large Language Models (Gemini 2.0) for highly sophisticated document understanding and structuring.


🎯 Dave.AI Hiring Assignment & My Learning Journey

Honesty and self-awareness are as important as the code itself!

This repository was initially born out of a hiring assignment for Dave.AI. It challenged me to push my boundaries and step into complex, multi-modal territory. I built the project up to this current iteration and am continuing to actively modify, refine, and upgrade it.

🧠 What I Learned

Working on this project has been an incredibly rewarding learning curve. I gained hands-on experience in:

  • Full-Stack Orchestration: Bridging a FastAPI/Python backend seamlessly with a sleek React & Vite frontend.
  • Agentic Architectures: Conceptualizing multi-agent systems where distinct "Agents" (Vision, Text, Fusion, Synthesis) exchange context.
  • Modern User Interfaces: Creating fluid, "glassmorphic" UI components with Framer Motion, tailored to provide a premium AI interactive experience.

🚧 Failures, Challenges & Transparency

I strongly believe in sharing where I stumbled to highlight genuine continuous improvement:

  • Missing Implementations: The true underlying orchestration (the app module framework) for dynamic, synchronized agent execution is still a work-in-progress. Certain complex integrations between YOLO layout parsers and the Gemini flash engine haven't been fully connected yet.
  • The RAG Module: Integrating seamless Vector indexing (Text + Image embeddings) with Qdrant turned out to be much harder than I anticipated on a local scale, and remains partially implemented.
  • My Ongoing Focus: I couldn't perfect everything within the initial timeline, but I have not given up on the vision. I am currently taking time to deeply brush up my skills in LangChain/LlamaIndex and advanced computer vision deployment to bridge these exact gaps.

This repo stands not just as an assignment submission, but as a living diary of my technical growth. 🚀


🚀 Features (Current & Roadmap)

✅ Currently Implemented

  • Premium Glassmorphism UI: A highly polished React frontend with 3D animations, Framer motion transitions, and real-time upload simulation.
  • Foundation Scripts & Tools: Built-in connection verification with Tesseract OCR, OpenRouter API tests, and basic system logic flow validators (test_full_system.py).
  • API Shell: Ready-to-go environment for FastAPI server execution (run_all.bat).

🔜 Upcoming Features (Work In Progress)

  • Multi-Modal Ingestion: Direct ingestion pipeline for varying formats (PDFs, Scanned images).
  • Intelligent Layout Detection: Native integration of YOLOv8 to slice tables, figures, and text efficiently.
  • Multi-Agent Architecture:
    • 👁️ Vision Agent: Parses layout visually.
    • 📝 Text Agent: Reads the content.
    • 🔗 Fusion Agent: Merges text/vision contexts.
    • 🧠 Synthesis Agent: Intelligent JSON representation (powered by Gemini Flash).

🛠️ Tech Stack

  • Frontend: React (v19), Vite, Framer Motion, Tailwind CSS, Lucide React
  • Backend (In Dev): FastAPI, Python 3.10
  • AI/ML (In Dev): YOLOv8, SentenceTransformers, Tesseract, Google Gemini (via OpenRouter)

📦 Installation & Setup

  1. Clone the Repository

    git clone https://github.com/AnmollCodes/Multi-Modal_AI.git
    cd Multi-Modal_AI
  2. Run the System (One-Click Windows) Double-click run_all.bat to automatically setup environments and boot both backend & frontend.

    Or manually:

    # Backend Setup
    python -m venv venv
    venv\Scripts\activate
    pip install -r requirements.txt
    uvicorn app.main:app --host 0.0.0.0 --port 8000
    
    # Frontend Setup (Port 5173 / localhost)
    cd web
    npm install
    npm run dev

Created with dedication by Anmol Agarwal. Continually building and striving for excellence.

About

A multi-agent framework utilizing Vision, Text, Fusion, and Synthesis agents to extract and structure data from PDFs and images. Originally built as a Dave.AI hiring assignment.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors