A sophisticated Retrieval Augmented Generation (RAG) system that allows users to upload documents and ask questions about their content with intelligent scope validation.
- Multi-format Document Support: PDF, DOCX, TXT, and Markdown files
- Advanced Text Chunking: Semantic, recursive character, and sentence-based splitting
- High-Quality Embeddings: Using Sentence Transformers (all-MiniLM-L6-v2)
- Vector Database: ChromaDB for efficient similarity search
- Multiple LLM Options: OpenAI GPT and Hugging Face models
- Scope Validation: Intelligent detection of out-of-scope queries
- Interactive Web Interface: Streamlit-based user-friendly interface
- Content-Aware Chunking: Different strategies for code, tables, and paragraphs
- Context Preservation: Overlapping chunks maintain context continuity
- Semantic Search: Find relevant information using meaning, not just keywords
- Source Attribution: See which documents and sections answers came from
- Error Handling: Robust processing with graceful fallbacks
- Python 3.8 or higher
- pip package manager
-
Clone the repository:
git clone <repository-url> cd CodeLLM
-
Create a virtual environment (recommended):
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install dependencies:
pip install -r requirements.txt
-
Optional: Set up OpenAI API (for GPT models):
export OPENAI_API_KEY="your-api-key-here" # Or create a .env file with: OPENAI_API_KEY=your-api-key-here
Run the Streamlit application:
streamlit run app.pyThe application will open in your web browser at http://localhost:8501
-
Upload Documents:
- Use the sidebar to upload PDF, DOCX, TXT, or Markdown files
- Multiple files can be uploaded simultaneously
- Processing happens automatically
-
Ask Questions:
- Type questions about your uploaded documents in the chat interface
- Get contextual answers with source attributions
- Receive warnings for out-of-scope questions
-
Review Sources:
- Click on source expandables to see which document sections were used
- Understand the context behind each answer
Good (In-scope) Questions:
- "What are the main points discussed in the document?"
- "Can you summarize the key findings from Chapter 3?"
- "What does the author say about [specific topic]?"
- "According to the document, how does [process] work?"
Out-of-scope Questions (will be flagged):
- "What's the weather today?"
- "Can you recommend a good restaurant?"
- "What's the latest news about [unrelated topic]?"
-
RAG Engine (
backend/rag_engine.py)- Orchestrates the entire RAG pipeline
- Manages embeddings and vector storage
- Integrates with multiple LLM providers
-
Document Processor (
backend/document_processor.py)- Extracts text from various document formats
- Handles encoding detection and text cleaning
- Robust error handling for corrupted files
-
Text Chunker (
backend/text_chunker.py)- Advanced chunking strategies based on content type
- Preserves context with intelligent overlapping
- Handles code, tables, and regular text differently
-
Scope Validator (
backend/scope_validator.py)- Validates query relevance using multiple methods
- Semantic similarity and keyword analysis
- Pattern-based out-of-scope detection
- Frontend: Streamlit for interactive web interface
- Embeddings: Sentence Transformers (all-MiniLM-L6-v2)
- Vector Database: ChromaDB with persistent storage
- LLM Options:
- OpenAI GPT-3.5/4 (requires API key)
- Hugging Face Transformers (free, local)
- Document Processing: PyPDF2, pdfplumber, python-docx
- Text Processing: Advanced regex and NLP techniques
Create a .env file in the project root:
# OpenAI API (optional, for GPT models)
OPENAI_API_KEY=your-openai-api-key
# Chunking Configuration
CHUNK_SIZE=1000
CHUNK_OVERLAP=200
MIN_CHUNK_SIZE=100
# Embedding Model
EMBEDDING_MODEL=all-MiniLM-L6-v2
# LLM Configuration
LLM_PROVIDER=huggingface # or "openai"
LLM_MODEL=microsoft/DialoGPT-smallYou can customize the RAG system by modifying parameters in the code:
- Chunk sizes: Adjust
chunk_sizeandchunk_overlapintext_chunker.py - Embedding model: Change the model in
rag_engine.py - LLM provider: Switch between OpenAI and Hugging Face models
- Similarity thresholds: Tune scope validation sensitivity
- Upload high-quality documents: Clear text, good formatting
- Use specific questions: More specific queries get better answers
- Check document scope: Ensure questions relate to uploaded content
- Multiple documents: Upload related documents for comprehensive coverage
- Memory: 4GB+ RAM recommended (8GB+ for large documents)
- Storage: Vector database grows with document size
- CPU: Modern multi-core processor for faster processing
- GPU: Optional, improves Hugging Face model performance
-
Import Errors:
pip install --upgrade -r requirements.txt
-
PDF Processing Fails:
pip install pymupdf # Alternative PDF processor -
Memory Issues with Large Documents:
- Reduce chunk size
- Process documents individually
- Use a machine with more RAM
-
Slow Performance:
- Use GPU acceleration if available
- Switch to lighter embedding models
- Reduce the number of retrieved chunks
The application logs important information to help with debugging:
- Document processing status
- Chunking statistics
- Query processing details
- Error messages with context
We welcome contributions! Here's how to get started:
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests if applicable
- Submit a pull request
# Install development dependencies
pip install -r requirements.txt
pip install pytest black flake8
# Run tests
pytest
# Format code
black .
# Check code style
flake8 .This project is licensed under the MIT License - see the LICENSE file for details.
- Sentence Transformers: For high-quality embeddings
- ChromaDB: For efficient vector storage
- Streamlit: For the interactive web interface
- Hugging Face: For free, accessible LLM models
- OpenAI: For advanced language understanding capabilities
If you encounter issues or have questions:
- Check the troubleshooting section above
- Search existing GitHub issues
- Create a new issue with detailed information
- Include error logs and system information
Built with ❤️ for better document understanding and knowledge extraction# CODELLM