InsightRAG - AI-Powered Document & Video Chat System

A production-ready RAG (Retrieval-Augmented Generation) system that enables intelligent conversations with PDFs and YouTube videos, featuring automatic citation tracking, hierarchical chunking, and real-time source navigation.

Quick Demo:

https://youtu.be/NDi98Cc4dwk

🎯 Overview

InsightRAG solves a critical problem in document research: finding exact sources for AI-generated answers. Unlike standard ChatGPT interactions where you need to manually search documents for citations, this system automatically provides clickable references to exact page numbers in PDFs or timestamps in YouTube videos.

The Problem

ChatGPT doesn't provide page numbers for document citations
When page numbers are given, they're often inaccurate
Users must manually search through documents to verify information
No easy way to analyze long-form video content

The Solution

Automatic Citations: Every answer includes exact page numbers or video timestamps
Click-to-Navigate: Click any citation to instantly jump to the source
Hierarchical RAG: Advanced chunking strategy ensures accurate retrieval and rich context
Multi-Source Support: Works with both PDFs and YouTube videos seamlessly

✨ Key Features

📄 Document Intelligence

PDF Processing: Upload PDFs up to 100+ pages with automatic text extraction
Smart Chunking: Hierarchical parent-child chunking for optimal retrieval accuracy
Page-Level Citations: Every answer includes specific page numbers
Auto-Navigation: Click citations to jump directly to referenced pages in the PDF viewer

🎥 Video Analysis

YouTube Integration: Paste any YouTube URL to analyze video content
Transcript Extraction: Automatic subtitle/caption retrieval in multiple languages
Timestamp Citations: Answers include exact timestamps where information appears
Quick Seeking: Click timestamps to jump to that moment in the video

💬 Conversation Features

Context-Aware Chat: Maintains conversation history per document/video
Multi-Document Support: Switch between different documents and their chat histories
Real-Time Responses: Fast answer generation with streaming support
Source Verification: All claims backed by retrievable sources

🔐 Security & Auth

JWT Authentication: Secure token-based authentication system
User Isolation: Each user's documents and conversations are private
Password Hashing: Bcrypt encryption for password storage
Session Management: Automatic token refresh and logout handling

🎨 User Experience

Split-Screen Interface: Chat and document viewer side-by-side
Responsive Design: Works on desktop, tablet, and mobile
Dark Mode Support: Easy on the eyes for long reading sessions
Keyboard Shortcuts: Efficient navigation and interaction

🏗️ Architecture

System Architecture Diagram

┌─────────────────────────────────────────────────────────────┐
│                         Frontend                             │
│  ┌──────────────┐  ┌──────────────┐  ┌─────────────────┐  │
│  │ React + TS   │  │ PDF Viewer   │  │ Video Player    │  │
│  │ Chat UI      │  │ (react-pdf)  │  │ (YouTube API)   │  │
│  └──────┬───────┘  └──────┬───────┘  └────────┬────────┘  │
│         │                  │                    │            │
│         └──────────────────┴────────────────────┘            │
│                            │                                 │
│                    REST API (JWT Auth)                       │
└────────────────────────────┼────────────────────────────────┘
                             │
┌────────────────────────────┼────────────────────────────────┐
│                      FastAPI Backend                         │
│  ┌──────────────┐  ┌──────────────┐  ┌─────────────────┐  │
│  │ Auth Service │  │ Doc Processor│  │ Chat Service    │  │
│  │ (JWT)        │  │ (PDF/YT)     │  │ (RAG Pipeline)  │  │
│  └──────┬───────┘  └──────┬───────┘  └────────┬────────┘  │
│         │                  │                    │            │
│  ┌──────┴──────────────────┴────────────────────┴────────┐ │
│  │              Core Application Layer                    │ │
│  └────────────────────────────────────────────────────────┘ │
│         │                  │                    │            │
│  ┌──────▼───────┐   ┌─────▼──────┐     ┌──────▼─────────┐ │
│  │ PostgreSQL   │   │ Qdrant     │     │ File Storage   │ │
│  │ (User/Meta)  │   │ (Vectors)  │     │ (Parent Chunks)│ │
│  └──────────────┘   └────────────┘     └────────────────┘ │
│         │                  │                    │            │
│  ┌──────▼──────────────────▼────────────────────▼────────┐ │
│  │            External Services Layer                     │ │
│  │  ┌───────────┐  ┌──────────┐  ┌──────────────────┐   │ │
│  │  │ OpenAI    │  │ Groq AI  │  │ YouTube Trans.   │   │ │
│  │  │(Embedding)│  │ (LLM)    │  │ API              │   │ │
│  │  └───────────┘  └──────────┘  └──────────────────┘   │ │
│  └────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘

Data Flow: Document Upload to Query

1. USER UPLOADS PDF
   ↓
2. EXTRACT TEXT (pypdf)
   ↓
3. CREATE CHUNKS
   ├─ Parent Chunks (5000 chars) → Local JSON Storage
   └─ Child Chunks (900 chars) → Continue to embedding
   ↓
4. GENERATE EMBEDDINGS (OpenAI text-embedding-3-small)
   ↓
5. STORE IN QDRANT
   ├─ Child chunks with embeddings
   ├─ Metadata (document_id, user_id, parent_id, page_number)
   └─ Vector index for similarity search
   ↓
6. MARK AS COMPLETED
   ↓
7. USER ASKS QUESTION
   ↓
8. EMBED QUESTION (OpenAI)
   ↓
9. VECTOR SEARCH (Qdrant)
   ├─ Find top-k similar child chunks
   └─ Extract parent_ids
   ↓
10. RETRIEVE PARENT CHUNKS (Local Storage)
    ├─ Get full context from parent chunks
    └─ Extract page numbers from metadata
    ↓
11. GENERATE ANSWER (Groq Llama 3.3 70B)
    ├─ Context: Parent chunk content
    └─ Question: User query
    ↓
12. RETURN RESPONSE
    ├─ Answer text
    ├─ Citations with page numbers
    └─ Source references

🛠️ Tech Stack

Backend Technologies

Technology	Version	Purpose
Python	3.10+	Core programming language
FastAPI	0.100+	High-performance async web framework
PostgreSQL	14+	Primary database for user data, documents, conversations
SQLAlchemy	2.0+	Async ORM for database operations
Alembic	1.11+	Database migration tool
Qdrant	1.7+	Vector database for semantic search
OpenAI API	1.0+	Text embedding generation (text-embedding-3-small)
Groq API	Latest	Fast LLM inference (Llama 3.3 70B)
PyPDF	3.0+	PDF text extraction
YouTube Transcript API	0.6+	Video transcript extraction
python-jose	3.3+	JWT token creation and validation
passlib	1.7+	Password hashing with bcrypt
python-multipart	0.0.6+	File upload handling
LangChain	0.1+	Text splitting and chunking utilities

Frontend Technologies

Technology	Version	Purpose
React	18.2+	UI framework
TypeScript	5.0+	Type-safe JavaScript
Vite	5.0+	Build tool and dev server
React Router	6.20+	Client-side routing
Tailwind CSS	3.4+	Utility-first CSS framework
shadcn/ui	Latest	Pre-built React components
react-pdf	7.5+	PDF rendering and navigation
Lucide React	0.300+	Icon library

Infrastructure

Service	Purpose
Qdrant Cloud (optional)	Managed vector database
AWS S3 (optional)	File storage for uploaded documents
Docker	Containerization for deployment
Nginx	Reverse proxy and static file serving

💻 System Requirements

Minimum Requirements

CPU: 2 cores
RAM: 4 GB
Storage: 10 GB free space
OS: Windows 10+, macOS 10.15+, or Linux (Ubuntu 20.04+)

Recommended Requirements

CPU: 4+ cores
RAM: 8+ GB
Storage: 20+ GB SSD
OS: Latest stable version

Software Prerequisites

Python 3.10 or higher
Node.js 18 or higher
PostgreSQL 14 or higher
Qdrant (local or cloud)
Git

📦 Installation

1. Clone the Repository

git clone https://github.com/dattang12/rag-for-doc-youtube.git
cd rag-for-doc-youtube

2. Backend Setup

Install Python Dependencies

cd backend

# Create virtual environment
python -m venv venv

# Activate virtual environment
# On Windows:
venv\Scripts\activate
# On macOS/Linux:
source venv/bin/activate

# Install dependencies
pip install -r requirements.txt

Set Up PostgreSQL Database

# Create database
createdb rag_db

# Or using psql:
psql -U postgres
CREATE DATABASE rag_db;
\q

Configure Environment Variables

# Copy example env file
cp .env.example .env

# Edit .env with your values
nano .env

Required environment variables:

# Database Configuration
DATABASE_URL=postgresql://username:password@localhost:5432/rag_db

# OpenAI Configuration (for embeddings)
OPENAI_API_KEY=sk-your-openai-api-key-here

# Groq Configuration (for LLM)
GROQ_API_KEY=gsk_your-groq-api-key-here

# JWT Configuration
SECRET_KEY=your-super-secret-jwt-key-here
ALGORITHM=HS256
ACCESS_TOKEN_EXPIRE_MINUTES=10080

# Qdrant Configuration
QDRANT_HOST=localhost
QDRANT_PORT=6333
QDRANT_COLLECTION_CHILD=child_chunks
QDRANT_COLLECTION_PARENT=parent_chunks

# Application Settings
TOP_K_RESULTS=10
UPLOAD_DIR=./storage/uploads
PARENT_CHUNK_DIR=./storage/parent_chunks

Run Database Migrations

# Run all migrations
alembic upgrade head

Start Qdrant (Local)

# Using Docker
docker run -p 6333:6333 -p 6334:6334 \
    -v $(pwd)/qdrant_storage:/qdrant/storage:z \
    qdrant/qdrant

# Or install locally and run
qdrant

Start the Backend Server

# Development mode with auto-reload
python -m uvicorn app.main:app --reload --host 0.0.0.0 --port 8000

# Production mode
python -m uvicorn app.main:app --host 0.0.0.0 --port 8000 --workers 4

Backend will be available at http://localhost:8000

API documentation at http://localhost:8000/docs

3. Frontend Setup

Install Node Dependencies

cd ../frontend

# Install dependencies
npm install

Configure Environment Variables

# Copy example env file
cp .env.example .env

# Edit .env
nano .env

Required environment variables:

VITE_API_URL=http://localhost:8000

Start the Development Server

# Development mode
npm run dev

# Build for production
npm run build

# Preview production build
npm run preview

Frontend will be available at http://localhost:5173

4. Verify Installation

Open browser to http://localhost:5173
Create an account
Upload a sample PDF
Wait for processing to complete
Ask a question and verify citations appear

⚙️ Configuration

Backend Configuration Options

Database Settings

# app/core/config.py
class Settings:
    # PostgreSQL connection
    DATABASE_URL: str = "postgresql://user:pass@localhost/db"
    
    # Connection pool settings
    DB_POOL_SIZE: int = 5
    DB_MAX_OVERFLOW: int = 10

Embedding Settings

# Embedding model selection
EMBEDDING_MODEL: str = "text-embedding-3-small"  # or "text-embedding-3-large"
EMBEDDING_DIMENSIONS: int = 1536  # or 3072 for large

# Batch size for embedding generation
EMBEDDING_BATCH_SIZE: int = 50

Chunking Strategy

# app/utils/hierarchical_chunker.py
class HierarchicalChunker:
    def __init__(
        self,
        parent_chunk_size: int = 5000,      # Larger chunks for context
        parent_chunk_overlap: int = 300,     # Overlap to preserve context
        child_chunk_size: int = 900,         # Smaller chunks for precision
        child_chunk_overlap: int = 50        # Minimal overlap for children
    )

LLM Settings

# app/services/openai_service.py
GROQ_MODEL: str = "llama-3.3-70b-versatile"
MAX_TOKENS: int = 2048
TEMPERATURE: float = 0.7

Vector Search Parameters

# Search configuration
TOP_K_RESULTS: int = 10              # Number of chunks to retrieve
SCORE_THRESHOLD: float = 0.2         # Minimum similarity score

Frontend Configuration Options

API Configuration

// src/lib/api.ts
export const API_BASE = import.meta.env.VITE_API_URL || 'http://localhost:8000';
export const API_TIMEOUT = 30000; // 30 seconds

PDF Viewer Settings

// src/components/DocumentViewer.tsx
const PDF_SCALE = 1.0;
const PDF_PAGE_WIDTH = 600;
const ENABLE_TEXT_LAYER = true;

📚 Usage Guide

1. Getting Started

Create an Account

Navigate to http://localhost:5173
Click "Sign Up"
Enter email, username, and password
Click "Create Account"

Login

Enter your credentials
Click "Sign In"
You'll be redirected to the main chat interface

2. Working with Documents

Upload a PDF

Click the "New" button in the top-right
Select "Document"
Click "Choose File" or drag-and-drop a PDF
Wait for processing (shows progress bar)
Processing time varies by document size:
- 10 pages: ~10-15 seconds
- 50 pages: ~30-45 seconds
- 100+ pages: ~1-2 minutes

Chat with Your Document

Once processing completes, the chat interface activates
Type your question in the text box
Press Enter or click the send button
Wait for the AI response (usually 2-5 seconds)
Review the answer and citations

Navigate Using Citations

Look for page number badges in the AI response
Click any "📄 Page X" badge
The PDF viewer automatically jumps to that page
Review the source material
Continue your conversation

3. Working with YouTube Videos

Add a YouTube Video

Click the "New" button
Select "YouTube"
Paste the full YouTube URL
Click "Add Video"
Wait for transcript extraction (~5-10 seconds)

Chat with Video Content

Ask questions about the video content
Receive answers with timestamp references
Click timestamp badges (▶ 2:35) to jump to that moment
Video player seeks automatically

4. Managing Conversations

View Conversation History

Click a document/video card to load its conversation
All previous questions and answers are preserved
Context is maintained across the conversation

Start a New Conversation

Click "New" to upload a different document
Each document has its own isolated conversation
Switch between documents to access their chats

Delete Conversations

Click the document options menu (⋮)
Select "Delete"
Conversation and document are permanently removed

📡 API Documentation

Authentication Endpoints

Register User

POST /api/v1/auth/user/register
Content-Type: application/json

{
  "email": "user@example.com",
  "username": "username",
  "password": "securepassword"
}

Response: 200 OK
{
  "id": 1,
  "email": "user@example.com",
  "username": "username"
}

Login

POST /api/v1/auth/user/login
Content-Type: application/x-www-form-urlencoded

username=user@example.com&password=securepassword

Response: 200 OK
{
  "access_token": "eyJ0eXAiOiJKV1QiLCJhbGc...",
  "token_type": "bearer"
}

Get Current User

GET /api/v1/users/users/me
Authorization: Bearer <token>

Response: 200 OK
{
  "id": 1,
  "email": "user@example.com",
  "username": "username"
}

Document Endpoints

Upload Document

POST /api/v1/documents/upload
Authorization: Bearer <token>
Content-Type: multipart/form-data

file: <binary PDF data>

Response: 200 OK
{
  "id": 42,
  "filename": "uuid-filename.pdf",
  "original_filename": "document.pdf",
  "file_size": 1048576,
  "document_type": "PDF",
  "status": "PROCESSING",
  "num_pages": 25,
  "created_at": "2024-01-15T10:30:00Z"
}

Get Document

GET /api/v1/documents/{document_id}
Authorization: Bearer <token>

Response: 200 OK
{
  "id": 42,
  "original_filename": "document.pdf",
  "status": "COMPLETED",
  "num_pages": 25,
  "created_at": "2024-01-15T10:30:00Z",
  "processed_at": "2024-01-15T10:30:45Z"
}

Download Document

GET /api/v1/documents/{document_id}/file
Authorization: Bearer <token>

Response: 200 OK
Content-Type: application/pdf
Content-Disposition: inline; filename="document.pdf"

<binary PDF data>

List Documents

GET /api/v1/documents/
Authorization: Bearer <token>

Response: 200 OK
[
  {
    "id": 42,
    "original_filename": "document.pdf",
    "status": "COMPLETED",
    "num_pages": 25,
    "created_at": "2024-01-15T10:30:00Z"
  }
]

Delete Document

DELETE /api/v1/documents/{document_id}
Authorization: Bearer <token>

Response: 200 OK
{
  "message": "Document deleted successfully"
}

YouTube Endpoints

Add YouTube Video

POST /api/v1/youtube/add
Authorization: Bearer <token>
Content-Type: application/json

{
  "video_url": "https://www.youtube.com/watch?v=dQw4w9WgXcQ"
}

Response: 200 OK
{
  "id": 10,
  "video_id": "dQw4w9WgXcQ",
  "video_url": "https://www.youtube.com/watch?v=dQw4w9WgXcQ",
  "title": "Video Title",
  "status": "COMPLETED",
  "created_at": "2024-01-15T10:30:00Z"
}

List YouTube Videos

GET /api/v1/youtube/
Authorization: Bearer <token>

Response: 200 OK
[
  {
    "id": 10,
    "video_url": "https://www.youtube.com/watch?v=...",
    "title": "Video Title",
    "status": "COMPLETED"
  }
]

Chat Endpoints

Ask Question

POST /api/v1/chat/ask
Authorization: Bearer <token>
Content-Type: application/json

{
  "question": "What is the main topic of the document?",
  "document_id": 42,
  "conversation_id": null
}

Response: 200 OK
{
  "answer": "The main topic of the document is...",
  "conversation_id": 15,
  "document_id": 42,
  "document_name": "document.pdf",
  "citations": [
    {
      "text": "The document discusses...",
      "page": 5,
      "score": 0.89
    },
    {
      "text": "Further evidence shows...",
      "page": 12,
      "score": 0.85
    }
  ]
}

Get Conversations

GET /api/v1/chat/conversations
Authorization: Bearer <token>

Response: 200 OK
[
  {
    "id": 15,
    "document_id": 42,
    "document_name": "document.pdf",
    "created_at": "2024-01-15T10:30:00Z",
    "messages": [
      {
        "id": 1,
        "role": "USER",
        "content": "What is this about?",
        "created_at": "2024-01-15T10:31:00Z"
      },
      {
        "id": 2,
        "role": "ASSISTANT",
        "content": "This document discusses...",
        "created_at": "2024-01-15T10:31:05Z"
      }
    ]
  }
]

📁 Project Structure

rag-for-doc-youtube/
├── backend/
│   ├── app/
│   │   ├── api/
│   │   │   ├── deps.py                 # Dependency injection
│   │   │   └── v1/
│   │   │       ├── auth.py             # Authentication endpoints
│   │   │       ├── user.py             # User management
│   │   │       ├── document.py         # Document upload/management
│   │   │       ├── youtube.py          # YouTube video handling
│   │   │       └── chat.py             # Chat/Q&A endpoints
│   │   ├── core/
│   │   │   ├── config.py               # Configuration management
│   │   │   └── security.py             # JWT and password hashing
│   │   ├── db/
│   │   │   ├── database.py             # Database connection
│   │   │   └── parent_store_manager.py # Parent chunk storage
│   │   ├── models/
│   │   │   ├── user.py                 # User model
│   │   │   ├── document.py             # Document model
│   │   │   ├── youtube.py              # YouTube video model
│   │   │   └── chat.py                 # Conversation/Message models
│   │   ├── schemas/
│   │   │   ├── user.py                 # User Pydantic schemas
│   │   │   ├── document.py             # Document schemas
│   │   │   ├── youtube.py              # YouTube schemas
│   │   │   └── chat.py                 # Chat schemas
│   │   ├── services/
│   │   │   ├── embedding_service.py    # OpenAI embedding generation
│   │   │   └── openai_service.py       # Groq LLM integration
│   │   ├── utils/
│   │   │   ├── document_processor.py   # PDF text extraction
│   │   │   ├── hierarchical_chunker.py # Text chunking logic
│   │   │   └── youtube_utils.py        # YouTube transcript extraction
│   │   ├── vectordb/
│   │   │   └── qdrant_client.py        # Qdrant vector operations
│   │   └── main.py                     # FastAPI application entry
│   ├── alembic/
│   │   ├── versions/                   # Database migrations
│   │   └── env.py                      # Alembic configuration
│   ├── storage/
│   │   ├── uploads/                    # Uploaded PDF files
│   │   └── parent_chunks/              # Parent chunk JSON files
│   ├── requirements.txt                # Python dependencies
│   ├── alembic.ini                     # Alembic config
│   └── .env                            # Environment variables
│
├── frontend/
│   ├── public/                         # Static assets
│   ├── src/
│   │   ├── components/
│   │   │   ├── ui/                     # shadcn/ui components
│   │   │   └── chat/
│   │   │       ├── ChatPanel.tsx       # Chat interface
│   │   │       ├── DocumentViewer.tsx  # PDF viewer
│   │   │       ├── YouTubeViewer.tsx   # YouTube player
│   │   │       ├── DocumentUpload.tsx  # File upload
│   │   │       └── YouTubeInput.tsx    # URL input
│   │   ├── pages/
│   │   │   ├── Chat.tsx                # Main chat page
│   │   │   └── SignInAndUp.tsx         # Auth page
│   │   ├── lib/
│   │   │   ├── auth.ts                 # Authentication logic
│   │   │   └── utils.ts                # Utility functions
│   │   ├── App.tsx                     # App router
│   │   ├── main.tsx                    # Entry point
│   │   └── index.css                   # Global styles
│   ├── package.json                    # Node dependencies
│   ├── tsconfig.json                   # TypeScript config
│   ├── vite.config.ts                  # Vite config
│   ├── tailwind.config.js              # Tailwind config
│   └── .env                            # Environment variables
│
├── .gitignore                          # Git ignore rules
└── README.md                           # This file

🔬 Implementation Details

Hierarchical RAG Pipeline

Why Hierarchical Chunking?

Traditional RAG systems face a dilemma:

Large chunks: Rich context but poor retrieval precision
Small chunks: Precise retrieval but insufficient context

Our solution uses hierarchical parent-child chunking:

Parent Chunks (5000 chars)
- Provide comprehensive context for LLM
- Stored locally in JSON files for quick access
- Include full paragraphs and section context
Child Chunks (900 chars)
- Enable precise semantic search
- Stored in Qdrant vector database with embeddings
- Each child references its parent
Retrieval Process
- Search child chunks for precision
- Retrieve parent chunks for context
- Best of both worlds!

Page Number Extraction

def calculate_page_number(chunk_index: int, total_chunks: int, total_pages: int) -> int:
    """
    Distribute chunks evenly across document pages
    
    Example: 100-page document with 50 parent chunks
    - Chunk 0-9 → Pages 1-20
    - Chunk 10-19 → Pages 21-40
    - etc.
    """
    if total_pages == 0 or total_chunks == 0:
        return 1
    
    page = int((chunk_index / total_chunks) * total_pages) + 1
    return min(page, total_pages)

This ensures accurate page attribution even for documents with uneven text distribution.

Vector Search Implementation

Embedding Generation

async def generate_embeddings(texts: list[str], batch_size: int = 50) -> list:
    """
    Generate embeddings in batches to handle large documents
    
    Why batches?
    - OpenAI API has rate limits
    - Large documents may have 1000+ chunks
    - Batching prevents timeouts and rate limit errors
    """
    all_embeddings = []
    
    for i in range(0, len(texts), batch_size):
        batch = texts[i:i + batch_size]
        
        response = openai_client.embeddings.create(
            model="text-embedding-3-small",
            input=batch
        )
        
        batch_embeddings = [item.embedding for item in response.data]
        all_embeddings.extend(batch_embeddings)
    
    return all_embeddings

Semantic Search

def search_children(
    query_vector: list[float],
    user_id: int,
    document_id: int,
    limit: int = 10,
    score_threshold: float = 0.2
):
    """
    Search for relevant child chunks using vector similarity
    
    Filters:
    - User isolation: only search user's own documents
    - Document-specific: search within one document at a time
    - Score threshold: filter out low-quality matches
    
    Returns child chunks with:
    - Text content
    - Similarity score
    - Parent ID reference
    - Metadata (page number, etc.)
    """
    results = qdrant_client.search(
        collection_name="child_chunks",
        query_vector=query_vector,
        query_filter=models.Filter(
            must=[
                models.FieldCondition(
                    key="user_id",
                    match=models.MatchValue(value=user_id)
                ),
                models.FieldCondition(
                    key="document_id",
                    match=models.MatchValue(value=document_id)
                )
            ]
        ),
        limit=limit,
        score_threshold=score_threshold
    )
    
    return results

YouTube Timestamp Extraction

Keyword-Based Relevance Scoring

def find_relevant_segments(segments: list, question: str, top_k: int = 5) -> list:
    """
    Find transcript segments most relevant to user's question
    
    Algorithm:
    1. Extract keywords from question (remove stop words)
    2. For each transcript segment:
       - Count keyword matches
       - Calculate relevance score
    3. Sort by score and return top-k segments
    
    This is fast and works well for most queries without requiring
    additional embedding/search operations.
    """
    # Extract question keywords
    question_words = set(question.lower().split())
    stop_words = {'what', 'where', 'when', 'who', 'how', 'is', 'the', 'a', 'in', 'to'}
    question_words = question_words - stop_words
    
    scored_segments = []
    for segment in segments:
        text_words = set(segment["text"].lower().split())
        overlap = len(question_words & text_words)
        
        if overlap > 0:
            scored_segments.append({
                "text": segment["text"],
                "start": segment["start"],
                "duration": segment["duration"],
                "score": overlap / len(question_words)
            })
    
    scored_segments.sort(key=lambda x: x["score"], reverse=True)
    return scored_segments[:top_k]

Authentication & Security

JWT Token Generation

def create_access_token(data: dict, expires_delta: timedelta = None):
    """
    Create JWT access token with expiration
    
    Token payload includes:
    - sub: user email (subject)
    - exp: expiration timestamp
    - iat: issued at timestamp
    """
    to_encode = data.copy()
    
    if expires_delta:
        expire = datetime.utcnow() + expires_delta
    else:
        expire = datetime.utcnow() + timedelta(minutes=10080)  # 7 days
    
    to_encode.update({"exp": expire})
    
    encoded_jwt = jwt.encode(
        to_encode,
        SECRET_KEY,
        algorithm=ALGORITHM
    )
    
    return encoded_jwt

Password Hashing

def hash_password(password: str) -> str:
    """
    Hash password using bcrypt
    
    bcrypt automatically:
    - Generates unique salt per password
    - Uses adaptive hashing (configurable rounds)
    - Resistant to rainbow table attacks
    """
    return pwd_context.hash(password)

def verify_password(plain_password: str, hashed_password: str) -> bool:
    """Verify password against hash"""
    return pwd_context.verify(plain_password, hashed_password)

⚡ Performance Optimization

Backend Optimizations

1. Async Database Operations

# All database queries use async SQLAlchemy
async with AsyncSessionLocal() as db:
    result = await db.execute(query)
    # Non-blocking I/O operations

2. Connection Pooling

engine = create_async_engine(
    DATABASE_URL,
    pool_size=5,          # 5 persistent connections
    max_overflow=10,      # Up to 15 total connections
    pool_pre_ping=True    # Verify connection health
)

3. Batch Embedding Generation

Process 50 chunks at a time
Reduces API calls from 1000 to 20 for large documents
Prevents rate limiting

4. Parent Chunk Caching

Store parent chunks as JSON files on disk
Faster than database queries
No network overhead for retrieval

5. Vector Search Optimization

Use filtered searches (user_id, document_id)
Set appropriate score thresholds
Limit results to top-k relevant chunks

Frontend Optimizations

1. Code Splitting

// Lazy load pages
const Chat = lazy(() => import('./pages/Chat'));
const Auth = lazy(() => import('./pages/Auth'));

2. PDF Rendering

// Only render current page
<Page pageNumber={currentPage} width={600} />
// Don't load all pages at once

3. Debounced Search

// Wait for user to stop typing before searching
const debouncedSearch = useMemo(
  () => debounce(handleSearch, 300),
  []
);

4. Memoization

// Cache expensive computations
const processedMessages = useMemo(
  () => messages.map(formatMessage),
  [messages]
);

Caching Strategies

Browser Caching

Static assets cached for 1 year
API responses cached per user session
PDF files cached after first load

Server Caching

Parent chunks stored on disk (instant retrieval)
Qdrant maintains internal vector cache
PostgreSQL query result cache

🐛 Troubleshooting

Common Issues and Solutions

1. Database Connection Errors

Error: sqlalchemy.exc.OperationalError: could not connect to server

Solutions:

# Check PostgreSQL is running
sudo systemctl status postgresql

# Verify connection string in .env
DATABASE_URL=postgresql://user:password@localhost:5432/rag_db

# Test connection
psql -U user -d rag_db -h localhost

2. Qdrant Connection Errors

Error: Failed to connect to Qdrant

Solutions:

# Check Qdrant is running
docker ps | grep qdrant

# Restart Qdrant
docker restart qdrant

# Verify port is correct
QDRANT_PORT=6333  # Default port

3. OpenAI API Errors

Error: AuthenticationError: Incorrect API key

Solutions:

# Verify API key is set correctly
echo $OPENAI_API_KEY

# Check key is valid at platform.openai.com
# Regenerate key if needed

# Ensure no extra spaces in .env
OPENAI_API_KEY=sk-your-key-without-spaces

4. PDF Processing Fails

Error: Document processing failed

Solutions:

# Check file is valid PDF
file document.pdf

# Verify file size is reasonable (< 50MB recommended)
ls -lh document.pdf

# Check logs for specific error
tail -f backend/logs/app.log

# Common causes:
# - Scanned PDF (no extractable text)
# - Password-protected PDF
# - Corrupted file

5. Frontend Build Errors

Error: Module not found or Cannot find module

Solutions:

# Clear node_modules and reinstall
rm -rf node_modules package-lock.json
npm install

# Clear Vite cache
rm -rf node_modules/.vite

# Verify Node version
node --version  # Should be 18+

6. CORS Errors

Error: Access-Control-Allow-Origin header missing

Solutions:

# In backend/app/main.py, verify CORS middleware:
app.add_middleware(
    CORSMiddleware,
    allow_origins=["http://localhost:5173"],  # Frontend URL
    allow_credentials=True,
    allow_methods=["*"],
    allow_headers=["*"],
)

7. JWT Token Expired

Error: Token has expired

Solutions:

// Frontend should handle token refresh
if (error.status === 401) {
  logout();
  navigate('/login');
}

// Or increase token expiration in backend
ACCESS_TOKEN_EXPIRE_MINUTES=10080  # 7 days

Performance Issues

Slow Document Processing

Symptoms: Processing takes more than 2 minutes for 100-page document

Solutions:

Check internet connection (affects embedding API calls)
Increase embedding batch size (trade memory for speed)
Use local embedding model (sentence-transformers)
Optimize chunk sizes to reduce total chunks

Slow Query Responses

Symptoms: Answers take more than 10 seconds

Solutions:

Reduce TOP_K_RESULTS (fewer chunks to retrieve)
Increase SCORE_THRESHOLD (filter low-quality matches)
Check Qdrant performance (memory usage, disk I/O)
Upgrade to faster LLM model (Groq is already fast)

High Memory Usage

Symptoms: Backend consuming > 2GB RAM

Solutions:

Reduce database connection pool size
Limit concurrent requests
Clear old parent chunk files periodically
Use pagination for large result sets

🤝 Contributing

We welcome contributions! Here's how to get started:

Development Setup

Fork the repository
Clone your fork:

git clone https://github.com/YOUR_USERNAME/rag-for-doc-youtube.git

Create a feature branch:

git checkout -b feature/amazing-feature

Make your changes and commit:

git commit -m "Add amazing feature"

Push to your fork:

git push origin feature/amazing-feature

Open a Pull Request

Code Style Guidelines

Python (Backend)

Follow PEP 8 style guide
Use type hints for function parameters and returns
Write docstrings for all functions and classes
Use async/await for I/O operations
Maximum line length: 100 characters

async def process_document(
    document_id: int,
    db: AsyncSession
) -> Document:
    """
    Process uploaded document and create embeddings.
    
    Args:
        document_id: ID of document to process
        db: Database session
        
    Returns:
        Processed document with status updated
        
    Raises:
        DocumentNotFoundError: If document doesn't exist
    """
    pass

TypeScript (Frontend)

Use TypeScript strict mode
Define interfaces for all data structures
Use functional components with hooks
Follow React best practices
Maximum line length: 100 characters

interface DocumentUploadProps {
  onUploadComplete: (doc: Document) => void;
  maxFileSize?: number;
}

export const DocumentUpload: React.FC<DocumentUploadProps> = ({
  onUploadComplete,
  maxFileSize = 10 * 1024 * 1024 // 10MB default
}) => {
  // Component implementation
};

Testing

Backend Tests

cd backend
pytest tests/ -v --cov=app

Frontend Tests

cd frontend
npm run test
npm run test:coverage

Commit Message Format

Follow conventional commits:

type(scope): description

[optional body]

[optional footer]

Types:

feat: New feature
fix: Bug fix
docs: Documentation changes
style: Code style changes (formatting)
refactor: Code refactoring
test: Adding or updating tests
chore: Maintenance tasks

Examples:

feat(backend): add support for DOCX files
fix(frontend): resolve PDF viewer scrolling issue
docs(readme): update installation instructions

📄 License

This project is licensed under the MIT License - see below for details:

MIT License

Copyright (c) 2024 Dat Tang

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

🙏 Acknowledgments

Technologies

OpenAI - Embedding API
Groq - Fast LLM inference
Qdrant - Vector search engine
FastAPI - Python web framework
React - UI framework
Tailwind CSS - CSS framework
shadcn/ui - Component library

Inspiration

LangChain's RAG implementation patterns
ChromaDB's hierarchical chunking approach
OpenAI's best practices for embeddings

Special Thanks

The open-source community for amazing tools and libraries
Everyone who reported issues and suggested improvements
Contributors who helped improve the codebase

Built with ❤️ by Dat Tang

Last Updated: January 2026

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
backend		backend
frontend		frontend
.gitignore		.gitignore
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

InsightRAG - AI-Powered Document & Video Chat System

Quick Demo:

📋 Table of Contents

🎯 Overview

The Problem

The Solution

✨ Key Features

📄 Document Intelligence

🎥 Video Analysis

💬 Conversation Features

🔐 Security & Auth

🎨 User Experience

🏗️ Architecture

System Architecture Diagram

Data Flow: Document Upload to Query

🛠️ Tech Stack

Backend Technologies

Frontend Technologies

Infrastructure

💻 System Requirements

Minimum Requirements

Recommended Requirements

Software Prerequisites

📦 Installation

1. Clone the Repository

2. Backend Setup

Install Python Dependencies

Set Up PostgreSQL Database

Configure Environment Variables

Run Database Migrations

Start Qdrant (Local)

Start the Backend Server

3. Frontend Setup

Install Node Dependencies

Configure Environment Variables

Start the Development Server

4. Verify Installation

⚙️ Configuration

Backend Configuration Options

Database Settings

Embedding Settings

Chunking Strategy

LLM Settings

Vector Search Parameters

Frontend Configuration Options

API Configuration

PDF Viewer Settings

📚 Usage Guide

1. Getting Started

Create an Account

Login

2. Working with Documents

Upload a PDF

Chat with Your Document

Navigate Using Citations

3. Working with YouTube Videos

Add a YouTube Video

Chat with Video Content

4. Managing Conversations

View Conversation History

Start a New Conversation

Delete Conversations

📡 API Documentation

Authentication Endpoints

Register User

Login

Get Current User

Document Endpoints

Upload Document

Get Document

Download Document

List Documents

Delete Document

YouTube Endpoints

Add YouTube Video

List YouTube Videos