Document Assistant RAG

A sophisticated Retrieval Augmented Generation (RAG) system that allows users to upload documents and ask questions about their content with intelligent scope validation.

Features

🚀 Core Capabilities

Multi-format Document Support: PDF, DOCX, TXT, and Markdown files
Advanced Text Chunking: Semantic, recursive character, and sentence-based splitting
High-Quality Embeddings: Using Sentence Transformers (all-MiniLM-L6-v2)
Vector Database: ChromaDB for efficient similarity search
Multiple LLM Options: OpenAI GPT and Hugging Face models
Scope Validation: Intelligent detection of out-of-scope queries
Interactive Web Interface: Streamlit-based user-friendly interface

🧠 Advanced Features

Content-Aware Chunking: Different strategies for code, tables, and paragraphs
Context Preservation: Overlapping chunks maintain context continuity
Semantic Search: Find relevant information using meaning, not just keywords
Source Attribution: See which documents and sections answers came from
Error Handling: Robust processing with graceful fallbacks

Installation

Prerequisites

Python 3.8 or higher
pip package manager

Setup Instructions

Clone the repository:
```
git clone <repository-url>
cd CodeLLM
```

Create a virtual environment (recommended):

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install dependencies:
```
pip install -r requirements.txt
```

Optional: Set up OpenAI API (for GPT models):

export OPENAI_API_KEY="your-api-key-here"
# Or create a .env file with: OPENAI_API_KEY=your-api-key-here

Usage

Starting the Application

Run the Streamlit application:

streamlit run app.py

The application will open in your web browser at http://localhost:8501

Using the Document Assistant

Upload Documents:
- Use the sidebar to upload PDF, DOCX, TXT, or Markdown files
- Multiple files can be uploaded simultaneously
- Processing happens automatically
Ask Questions:
- Type questions about your uploaded documents in the chat interface
- Get contextual answers with source attributions
- Receive warnings for out-of-scope questions
Review Sources:
- Click on source expandables to see which document sections were used
- Understand the context behind each answer

Example Queries

Good (In-scope) Questions:

"What are the main points discussed in the document?"
"Can you summarize the key findings from Chapter 3?"
"What does the author say about [specific topic]?"
"According to the document, how does [process] work?"

Out-of-scope Questions (will be flagged):

"What's the weather today?"
"Can you recommend a good restaurant?"
"What's the latest news about [unrelated topic]?"

Architecture

Core Components

RAG Engine (backend/rag_engine.py)
- Orchestrates the entire RAG pipeline
- Manages embeddings and vector storage
- Integrates with multiple LLM providers
Document Processor (backend/document_processor.py)
- Extracts text from various document formats
- Handles encoding detection and text cleaning
- Robust error handling for corrupted files
Text Chunker (backend/text_chunker.py)
- Advanced chunking strategies based on content type
- Preserves context with intelligent overlapping
- Handles code, tables, and regular text differently
Scope Validator (backend/scope_validator.py)
- Validates query relevance using multiple methods
- Semantic similarity and keyword analysis
- Pattern-based out-of-scope detection

Technology Stack

Frontend: Streamlit for interactive web interface
Embeddings: Sentence Transformers (all-MiniLM-L6-v2)
Vector Database: ChromaDB with persistent storage
LLM Options:
- OpenAI GPT-3.5/4 (requires API key)
- Hugging Face Transformers (free, local)
Document Processing: PyPDF2, pdfplumber, python-docx
Text Processing: Advanced regex and NLP techniques

Configuration

Environment Variables

Create a .env file in the project root:

# OpenAI API (optional, for GPT models)
OPENAI_API_KEY=your-openai-api-key

# Chunking Configuration
CHUNK_SIZE=1000
CHUNK_OVERLAP=200
MIN_CHUNK_SIZE=100

# Embedding Model
EMBEDDING_MODEL=all-MiniLM-L6-v2

# LLM Configuration
LLM_PROVIDER=huggingface  # or "openai"
LLM_MODEL=microsoft/DialoGPT-small

Customization Options

You can customize the RAG system by modifying parameters in the code:

Chunk sizes: Adjust chunk_size and chunk_overlap in text_chunker.py
Embedding model: Change the model in rag_engine.py
LLM provider: Switch between OpenAI and Hugging Face models
Similarity thresholds: Tune scope validation sensitivity

Performance Tips

For Better Results

Upload high-quality documents: Clear text, good formatting
Use specific questions: More specific queries get better answers
Check document scope: Ensure questions relate to uploaded content
Multiple documents: Upload related documents for comprehensive coverage

System Requirements

Memory: 4GB+ RAM recommended (8GB+ for large documents)
Storage: Vector database grows with document size
CPU: Modern multi-core processor for faster processing
GPU: Optional, improves Hugging Face model performance

Troubleshooting

Common Issues

Import Errors:

pip install --upgrade -r requirements.txt

PDF Processing Fails:

pip install pymupdf  # Alternative PDF processor

Memory Issues with Large Documents:
- Reduce chunk size
- Process documents individually
- Use a machine with more RAM
Slow Performance:
- Use GPU acceleration if available
- Switch to lighter embedding models
- Reduce the number of retrieved chunks

Logs and Debugging

The application logs important information to help with debugging:

Document processing status
Chunking statistics
Query processing details
Error messages with context

Contributing

We welcome contributions! Here's how to get started:

Fork the repository
Create a feature branch
Make your changes
Add tests if applicable
Submit a pull request

Development Setup

# Install development dependencies
pip install -r requirements.txt
pip install pytest black flake8

# Run tests
pytest

# Format code
black .

# Check code style
flake8 .

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

Sentence Transformers: For high-quality embeddings
ChromaDB: For efficient vector storage
Streamlit: For the interactive web interface
Hugging Face: For free, accessible LLM models
OpenAI: For advanced language understanding capabilities

Support

If you encounter issues or have questions:

Check the troubleshooting section above
Search existing GitHub issues
Create a new issue with detailed information
Include error logs and system information

Built with ❤️ for better document understanding and knowledge extraction# CODELLM

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.github		.github
backend		backend
test_documents		test_documents
.gitignore		.gitignore
CROSS_PLATFORM_SETUP.md		CROSS_PLATFORM_SETUP.md
GIT_CLEANUP_GUIDE.md		GIT_CLEANUP_GUIDE.md
PROJECT_COMPLETION_SUMMARY.md		PROJECT_COMPLETION_SUMMARY.md
PROJECT_STATUS.md		PROJECT_STATUS.md
QUICK_START.md		QUICK_START.md
README.md		README.md
SETUP_AND_RUN.md		SETUP_AND_RUN.md
TESTME.md		TESTME.md
Title_ The Global Impact of Urban Green Spaces.docx		Title_ The Global Impact of Urban Green Spaces.docx
Title_ The Rise of Electric Vehicles and the Future of Transportation.docx		Title_ The Rise of Electric Vehicles and the Future of Transportation.docx
advanced_docx_tester.py		advanced_docx_tester.py
app.py		app.py
clean_docx_tester.py		clean_docx_tester.py
comprehensive_docx_tester.py		comprehensive_docx_tester.py
comprehensive_test.py		comprehensive_test.py
comprehensive_test_advanced.py		comprehensive_test_advanced.py
demo_climate_report.md		demo_climate_report.md
enhanced_docx_tester.py		enhanced_docx_tester.py
production_docx_tester.py		production_docx_tester.py
requirements.txt		requirements.txt
requirements_enhanced.txt		requirements_enhanced.txt
sample_document.md		sample_document.md
setup_and_run.bat		setup_and_run.bat
setup_and_run.ps1		setup_and_run.ps1
setup_and_run.sh		setup_and_run.sh
simple_demo.py		simple_demo.py
start_full_stack.sh		start_full_stack.sh
test_end_to_end.py		test_end_to_end.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Document Assistant RAG

Features

🚀 Core Capabilities

🧠 Advanced Features

Installation

Prerequisites

Setup Instructions

Usage

Starting the Application

Using the Document Assistant

Example Queries

Architecture

Core Components

Technology Stack

Configuration

Environment Variables

Customization Options

Performance Tips

For Better Results

System Requirements

Troubleshooting

Common Issues

Logs and Debugging

Contributing

Development Setup

License

Acknowledgments

Support

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Document Assistant RAG

Features

🚀 Core Capabilities

🧠 Advanced Features

Installation

Prerequisites

Setup Instructions

Usage

Starting the Application

Using the Document Assistant

Example Queries

Architecture

Core Components

Technology Stack

Configuration

Environment Variables

Customization Options

Performance Tips

For Better Results

System Requirements

Troubleshooting

Common Issues

Logs and Debugging

Contributing

Development Setup

License

Acknowledgments

Support

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages