This is a sample document created to test the Document Assistant RAG system. It contains various types of content to demonstrate the advanced chunking and retrieval capabilities.
The system can handle multiple document formats:
- PDF files using PyPDF2 and pdfplumber
- DOCX files using python-docx
- Plain text files with encoding detection
- Markdown files with formatting cleanup
The text chunker uses multiple strategies:
- Semantic Chunking: Based on content structure and meaning
- Recursive Character Splitting: Smart boundary detection
- Sentence-aware Splitting: Preserves sentence integrity
- Content-type Aware: Different strategies for code, tables, and paragraphs
- Uses Sentence Transformers (all-MiniLM-L6-v2) for high-quality embeddings
- ChromaDB for efficient vector storage and similarity search
- Semantic search capabilities that understand meaning, not just keywords
Multiple LLM options available:
- OpenAI GPT: High-quality responses (requires API key)
- Hugging Face Models: Free, local processing options
- Fallback Methods: Template-based responses when models unavailable
The system includes intelligent scope validation:
- Semantic similarity analysis
- Keyword overlap detection
- Pattern-based out-of-scope identification
- Question type classification
- RAG Engine: Orchestrates the entire pipeline
- Document Processor: Extracts text from various formats
- Text Chunker: Splits text intelligently
- Scope Validator: Ensures query relevance
def process_document(file):
"""Process uploaded document"""
text = document_processor.extract_text(file)
chunks = text_chunker.chunk_text(text)
embeddings = embedding_model.encode(chunks)
return embeddings| Component | Default Value | Description |
|---|---|---|
| Chunk Size | 1000 | Target characters per chunk |
| Chunk Overlap | 200 | Character overlap between chunks |
| Embedding Model | all-MiniLM-L6-v2 | Sentence transformer model |
| LLM Provider | huggingface | Default language model provider |
- "What are the key features of the system?"
- "How does the text chunking work?"
- "What embedding model is used?"
- "Explain the scope validation process"
- "What's the weather today?"
- "Recommend a good restaurant"
- "Latest news updates"
- "Personal advice or opinions"
- Accuracy: Answers are grounded in your documents
- Transparency: See sources for each answer
- Scope Control: Prevents hallucination on unrelated topics
- Flexibility: Supports multiple document formats
- Privacy: Can run entirely locally without external APIs
The Document Assistant RAG system provides a comprehensive solution for document-based question answering with advanced features for chunking, embedding, and scope validation. It's designed to be both powerful and user-friendly, making document analysis accessible to everyone.