Skip to content

Latest commit

 

History

History
92 lines (72 loc) · 3.31 KB

File metadata and controls

92 lines (72 loc) · 3.31 KB

Sample Document for Testing Document Assistant RAG

Introduction

This is a sample document created to test the Document Assistant RAG system. It contains various types of content to demonstrate the advanced chunking and retrieval capabilities.

Key Features of the System

Document Processing

The system can handle multiple document formats:

  • PDF files using PyPDF2 and pdfplumber
  • DOCX files using python-docx
  • Plain text files with encoding detection
  • Markdown files with formatting cleanup

Advanced Text Chunking

The text chunker uses multiple strategies:

  1. Semantic Chunking: Based on content structure and meaning
  2. Recursive Character Splitting: Smart boundary detection
  3. Sentence-aware Splitting: Preserves sentence integrity
  4. Content-type Aware: Different strategies for code, tables, and paragraphs

Embedding and Retrieval

  • Uses Sentence Transformers (all-MiniLM-L6-v2) for high-quality embeddings
  • ChromaDB for efficient vector storage and similarity search
  • Semantic search capabilities that understand meaning, not just keywords

Language Model Integration

Multiple LLM options available:

  • OpenAI GPT: High-quality responses (requires API key)
  • Hugging Face Models: Free, local processing options
  • Fallback Methods: Template-based responses when models unavailable

Scope Validation

The system includes intelligent scope validation:

  • Semantic similarity analysis
  • Keyword overlap detection
  • Pattern-based out-of-scope identification
  • Question type classification

Technical Architecture

Core Components

  1. RAG Engine: Orchestrates the entire pipeline
  2. Document Processor: Extracts text from various formats
  3. Text Chunker: Splits text intelligently
  4. Scope Validator: Ensures query relevance

Example Code Block

def process_document(file):
    """Process uploaded document"""
    text = document_processor.extract_text(file)
    chunks = text_chunker.chunk_text(text)
    embeddings = embedding_model.encode(chunks)
    return embeddings

Configuration Table

Component Default Value Description
Chunk Size 1000 Target characters per chunk
Chunk Overlap 200 Character overlap between chunks
Embedding Model all-MiniLM-L6-v2 Sentence transformer model
LLM Provider huggingface Default language model provider

Usage Examples

Good Questions to Ask

  • "What are the key features of the system?"
  • "How does the text chunking work?"
  • "What embedding model is used?"
  • "Explain the scope validation process"

Questions Outside Document Scope

  • "What's the weather today?"
  • "Recommend a good restaurant"
  • "Latest news updates"
  • "Personal advice or opinions"

Benefits

  1. Accuracy: Answers are grounded in your documents
  2. Transparency: See sources for each answer
  3. Scope Control: Prevents hallucination on unrelated topics
  4. Flexibility: Supports multiple document formats
  5. Privacy: Can run entirely locally without external APIs

Conclusion

The Document Assistant RAG system provides a comprehensive solution for document-based question answering with advanced features for chunking, embedding, and scope validation. It's designed to be both powerful and user-friendly, making document analysis accessible to everyone.