Skip to content

jonathansina/information-retrieval-system

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

50 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🔍 Information Retrieval System for Digikala's FAQ

MIT License Python 3.8+ Flask Persian Text Support

An advanced, real-time Information Retrieval System specifically designed for Digikala's FAQ section. This system combines state-of-the-art natural language processing techniques with Persian text processing capabilities to deliver accurate and contextually relevant answers to user queries.

🚀 Key Highlights:

  • Persian Language Expertise: Native support for Persian text processing and understanding
  • Real-time Processing: Instant query processing with WebSocket integration
  • AI-Powered Responses: LLM integration for enhanced, contextual answers
  • Multiple Ranking Algorithms: TF-IDF, LaBSE embeddings with cosine, Jaccard, and custom similarity metrics
  • Interactive Web Interface: Modern UI with Persian language support (DigiSage)
  • Modular Architecture: Extensible pipeline design for easy customization and scaling

📑 Table of Contents

🎯 Overview

This Information Retrieval System is engineered to handle Persian language queries for Digikala's FAQ section with high accuracy and speed. The system leverages modern NLP techniques, including transformer-based embeddings and traditional IR methods, to provide contextually relevant answers.

Why This System?

  • Language-Specific Optimization: Tailored for Persian text with proper tokenization, normalization, and linguistic processing
  • Hybrid Approach: Combines traditional IR methods (TF-IDF) with modern embeddings (LaBSE) for optimal performance
  • Real-time Performance: Sub-second response times with efficient indexing and retrieval algorithms
  • Scalable Architecture: Modular design allows easy integration and scaling for enterprise environments
  • User-Friendly Interface: Intuitive web interface designed for Persian speakers

✨ Features

🔤 Persian Text Processing

  • Advanced Tokenization: Handles Persian text complexities including compound words and abbreviations
  • Text Normalization: Comprehensive normalization including diacritic removal and character standardization
  • Stemming & Lemmatization: Persian-specific morphological analysis for better matching
  • Stopword Filtering: Customizable Persian stopword lists for improved relevance
  • Informal Text Handling: Processes colloquial and informal Persian text

🧠 Information Retrieval

  • Multiple Embedding Models:
    • TF-IDF with customizable parameters
    • LaBSE (Language-agnostic BERT Sentence Embeddings)
    • Support for custom embedding models
  • Advanced Similarity Metrics:
    • Cosine Similarity for semantic matching
    • Jaccard Similarity for term overlap
    • Custom hybrid metrics combining multiple signals
  • Intelligent Ranking: Multi-factor ranking considering relevance, category, and user context

🚀 Real-time System

  • WebSocket Integration: Real-time query processing and response streaming
  • Asynchronous Processing: Non-blocking query handling for improved user experience
  • Live Results: Progressive result loading as processing completes
  • Session Management: Maintains context across multiple queries

🤖 AI Integration

  • LLM Enhancement: Integration with Together AI for contextual response generation
  • Prompt Engineering: Optimized prompts for Persian customer service scenarios
  • Fallback Mechanisms: Graceful degradation when AI services are unavailable
  • Response Personalization: Contextual responses based on retrieval results

📊 Analytics & Monitoring

  • MLflow Integration: Comprehensive experiment tracking and model versioning
  • Performance Metrics: Real-time monitoring of query response times and accuracy
  • Usage Analytics: Detailed logging of user interactions and system performance
  • A/B Testing Support: Framework for testing different retrieval strategies

🎨 User Interface

  • DigiSage Interface: Modern, responsive web interface
  • Persian Language Support: Right-to-left text support and Persian fonts
  • Mobile Responsive: Optimized for desktop and mobile devices
  • Accessibility: Screen reader compatible and keyboard navigation support

🏗️ Architecture

┌─────────────────┐    ┌──────────────────┐    ┌─────────────────┐
│   Web Browser   │◄──►│   Flask Server   │◄──►│  Pipeline Core  │
│   (DigiSage)    │    │   (app.py)       │    │                 │
└─────────────────┘    └──────────────────┘    └─────────────────┘
                               │                         │
                               │                         ▼
                               │                ┌─────────────────┐
                               │                │   Preprocessor  │
                               │                │   - Tokenizer   │
                               │                │   - Normalizer  │
                               │                │   - Stemmer     │
                               │                └─────────────────┘
                               │                         │
                               │                         ▼
                               │                ┌─────────────────┐
                               │                │   Vectorizer    │
                               │                │   - TF-IDF      │
                               │                │   - LaBSE       │
                               │                └─────────────────┘
                               │                         │
                               │                         ▼
                               │                ┌─────────────────┐
                               │                │   Similarity    │
                               │                │   - Cosine      │
                               │                │   - Jaccard     │
                               │                │   - Custom      │
                               │                └─────────────────┘
                               │                         │
                               ▼                         ▼
                      ┌─────────────────┐    ┌─────────────────┐
                      │   Together AI   │    │   Evaluator     │
                      │   (LLM)         │    │   & Logger      │
                      └─────────────────┘    └─────────────────┘

Key Components:

  1. Pipeline Core: Modular processing pipeline with training and inference modes
  2. Web Server: Flask application with SocketIO for real-time communication
  3. Text Processing: Persian-specific NLP pipeline
  4. Retrieval Engine: Multi-algorithm similarity search
  5. AI Enhancement: LLM integration for response generation
  6. Monitoring: MLflow-based experiment tracking and logging

📋 Prerequisites

Before installing the system, ensure you have:

  • Python 3.8 or higher
  • pip (Python package installer)
  • Virtual environment (recommended: venv or conda)
  • Git for cloning the repository
  • 4GB+ RAM for optimal performance with embeddings
  • Internet connection for downloading language models and dependencies

System Requirements

Component Minimum Recommended
Python 3.8+ 3.9+
RAM 4GB 8GB+
Storage 2GB 5GB+
CPU 2 cores 4+ cores

🚀 Installation

1. Clone the Repository

git clone https://github.com/jonathansina/information-retrieval-system.git
cd information-retrieval-system

2. Create Virtual Environment

# Using venv (recommended)
python -m venv venv

# Activate on Linux/Mac
source venv/bin/activate

# Activate on Windows
venv\Scripts\activate

3. Install Dependencies

# Install all required packages
pip install -r requirements.txt

# Verify installation
python -c "import hazm, sklearn, flask; print('Dependencies installed successfully')"

4. Download Language Models

# Download Persian language models (if needed)
python -c "import hazm; print('Hazm ready for Persian processing')"

5. Prepare Data

# Ensure data directory exists and contains required files
ls data/
# Should show: augmented_dataset.csv, test_digikala_faq.csv, stops.txt, etc.

⚡ Quick Start

Start the System

# Run the application
python app.py

Access the Interface

  1. Open your web browser
  2. Navigate to http://localhost:5000
  3. You'll see the DigiSage interface
  4. Enter a Persian query and press send

Example Query

Try entering: چگونه می‌توانم سفارشم را تغییر دهم؟ (How can I change my order?)

📖 Usage

Development Mode

Training a New Model

from src.pipelines.pipeline.builder import PipelineBuilder
from src.pipelines.type_hint import ControllerType
from src.pipelines.config.default import PIPELINE_DEFAULT_CONFIG

# Create training pipeline
pipeline = (
    PipelineBuilder(controller_type=ControllerType.TRAINING)
    .with_logger("information-retrieval", "IRS", "development")
    .with_config(PIPELINE_DEFAULT_CONFIG)
    .with_preprocessor(True)
    .with_vectorizer(True)
    .with_vocabulary()
    .with_similarity()
    .with_evaluator(True)
    .build(True)
)

# Train the model
pipeline.run()

Running Inference

# Create inference pipeline
inference_pipeline = (
    PipelineBuilder(controller_type=ControllerType.INFERENCE)
    .with_logger("information-retrieval", "IRS", "development")
    .with_config(PIPELINE_DEFAULT_CONFIG)
    .with_preprocessor()
    .with_vectorizer()
    .with_vocabulary()
    .with_similarity()
    .with_evaluator()
    .build()
)

# Process a query
result = inference_pipeline.run("سوال شما به فارسی")
print(result)

Development with Jupyter

# Start Jupyter for experimentation
cd dev/
jupyter notebook data_analysis.ipynb

Production Mode

Running the Web Application

# Production mode with optimized settings
python app.py

Using the REST API

# Submit a query via API
curl -X POST http://localhost:5000/submit \
  -H "Content-Type: application/json" \
  -d '{"query": "چگونه محصولی را برگردانم؟"}'

WebSocket Integration

// Connect to WebSocket
const socket = io('http://localhost:5000');

// Listen for retrieval results
socket.on('ir_results', (data) => {
    console.log('Search results:', data.ir_results);
});

// Listen for AI responses
socket.on('llm_response', (data) => {
    console.log('AI response:', data.llm_response);
});

API Documentation

Endpoints

POST /submit

Submit a query for processing.

Request:

{
  "query": "سوال شما به زبان فارسی"
}

Response:

{
  "status": "Processing started"
}

WebSocket Events:

  • ir_results: Returns retrieval results
  • llm_response: Returns AI-generated response
GET /

Serves the main interface (DigiSage).

WebSocket Events

Client to Server
  • connect: Establish connection
Server to Client
  • ir_results: Retrieval results

    {
      "ir_results": [
        {
          "question": "سوال",
          "answer": "پاسخ", 
          "category": "دسته‌بندی"
        }
      ]
    }
  • llm_response: AI-generated response

    {
      "llm_response": "پاسخ تولید شده توسط هوش مصنوعی"
    }

💡 Examples

Basic Persian Query Processing

# Example 1: Simple product return query
query = "چگونه می‌توانم کالای خریداری شده را برگردانم؟"
result = inference_pipeline.run(query)

# Result structure:
{
    "retrieved_question": ["سوالات مشابه"],
    "retrieved_answer": ["پاسخ‌های مربوطه"], 
    "retrieved_category": ["دسته‌بندی‌ها"]
}

Custom Configuration

# Example 2: Custom preprocessing configuration
from src.pipelines.config.config import PreprocessorConfig

custom_config = PreprocessorConfig(
    normalizer=True,
    stemmer=True,
    stopwords=True,
    lemmatizer=False
)

# Use in pipeline
pipeline = PipelineBuilder().with_config(custom_config).build()

Batch Processing

# Example 3: Process multiple queries
queries = [
    "هزینه ارسال چقدر است؟",
    "چه زمانی سفارشم ارسال می‌شود؟",
    "چگونه می‌توانم سفارش را لغو کنم؟"
]

results = []
for query in queries:
    result = inference_pipeline.run(query)
    results.append(result)

Integration with Custom Models

# Example 4: Using custom similarity metrics
from src.pipelines.similarity.base import BaseSimilaritySearch

class CustomSimilarity(BaseSimilaritySearch):
    def search(self, query_vectors, corpus_vectors):
        # Your custom similarity implementation
        pass

# Use in pipeline
pipeline = (
    PipelineBuilder()
    .with_similarity(CustomSimilarity())
    .build()
)

📊 Performance

Benchmark Results

Metric Value Description
Average Response Time < 200ms End-to-end query processing
Retrieval Accuracy 89.5% Top-5 relevant results
Persian Text Coverage 95%+ Successful processing rate
Concurrent Users 100+ Simultaneous query handling
Memory Usage ~2GB With full embeddings loaded

Evaluation Metrics

The system is evaluated using:

  • Precision@K: Relevance of top-K results
  • Recall@K: Coverage of relevant documents
  • MRR: Mean Reciprocal Rank
  • NDCG: Normalized Discounted Cumulative Gain

Performance Optimization

# Enable MLflow tracking for performance monitoring
import mlflow

mlflow.start_run()
mlflow.log_metric("response_time", response_time)
mlflow.log_metric("accuracy", accuracy_score)
mlflow.end_run()

📁 Project Structure

information-retrieval-system/
├── 📁 data/                          # Dataset storage
│   ├── augmented_dataset.csv         # Training data with augmentations
│   ├── test_digikala_faq.csv        # Evaluation dataset
│   ├── digikala_faq.csv             # Original FAQ data
│   ├── stops.txt                     # Persian stopwords
│   ├── synonyms.json                 # Synonym mappings
│   └── vocabulary.txt                # Processed vocabulary
├── 📁 dev/                           # Development notebooks
│   ├── data_analysis.ipynb          # Data exploration and analysis
│   ├── test.ipynb                   # Testing and experimentation
│   └── time_experiment.png          # Performance visualization
├── 📁 src/                           # Core source code
│   ├── 📁 pipelines/                # Processing pipelines
│   │   ├── 📁 config/               # Configuration management
│   │   ├── 📁 evaluator/            # Performance evaluation
│   │   ├── 📁 logger/               # Logging utilities
│   │   ├── 📁 pipeline/             # Pipeline orchestration
│   │   ├── 📁 preprocessor/         # Text preprocessing
│   │   ├── 📁 similarity/           # Similarity algorithms
│   │   ├── 📁 vectorizer/           # Text vectorization
│   │   ├── 📁 vocabulary/           # Vocabulary management
│   │   └── type_hint.py             # Type definitions
│   └── 📁 generate/                 # Data generation utilities
├── 📁 ui/                            # Web interface (DigiSage)
│   ├── index.html                   # Main interface
│   ├── 📁 js/                       # JavaScript functionality
│   ├── 📁 styles/                   # CSS styling
│   └── 📁 images/                   # UI assets
├── app.py                           # Flask web application
├── requirements.txt                 # Python dependencies
├── README.md                        # Project documentation
└── LICENSE                          # MIT license

Key Files Description

File/Directory Purpose
app.py Main Flask application with WebSocket support
src/pipelines/ Modular NLP processing pipeline
ui/ DigiSage web interface
data/ Training and evaluation datasets
dev/ Jupyter notebooks for experimentation

⚙️ Configuration

Environment Variables

Create a .env file in the root directory:

# API Configuration
TOGETHER_API_KEY=your_together_api_key_here
FLASK_ENV=development
FLASK_DEBUG=True

# Server Configuration
HOST=0.0.0.0
PORT=5000

# MLflow Configuration
MLFLOW_TRACKING_URI=http://localhost:9437

Pipeline Configuration

Customize processing parameters in src/pipelines/config/default.py:

# Text Processing
PREPROCESSOR_DEFAULT_CONFIG = PreprocessorConfig(
    normalizer=True,
    stemmer=False,
    lemmatizer=False,
    stopwords=True,
    # ... other parameters
)

# Vectorization
VECTORIZER_DEFAULT_CONFIG = VectorizerConfig(
    vectorizer="tf-idf",
    vectorizer_param={
        "max_features": 10000,
        "ngram_range": (1, 2),
        # ... other parameters
    }
)

# Similarity Search
SIMILARITY_SEARCH_DEFAULT_CONFIG = SimilaritySearchConfig(
    metrics="cosine",
    metrics_param={
        "n_neighbors": 5,
    }
)

🚀 Deployment

Docker Deployment (Recommended)

# Build Docker image
docker build -t digikala-ir-system .

# Run container
docker run -p 5000:5000 \
  -e TOGETHER_API_KEY=your_key \
  digikala-ir-system

Production Deployment

# Using Gunicorn for production
pip install gunicorn

# Start with multiple workers
gunicorn --worker-class eventlet -w 1 --bind 0.0.0.0:5000 app:app

Cloud Deployment

AWS EC2

# Install dependencies on EC2
sudo yum update -y
sudo yum install python3 python3-pip -y

# Clone and setup
git clone https://github.com/jonathansina/information-retrieval-system.git
cd information-retrieval-system
pip3 install -r requirements.txt

# Run with systemd service
sudo systemctl enable digikala-ir
sudo systemctl start digikala-ir

Docker Compose

version: '3.8'
services:
  web:
    build: .
    ports:
      - "5000:5000"
    environment:
      - TOGETHER_API_KEY=your_key
    volumes:
      - ./data:/app/data
  
  mlflow:
    image: python:3.9
    command: pip install mlflow && mlflow server --host 0.0.0.0 --port 9437
    ports:
      - "9437:9437"

🔧 Troubleshooting

Common Issues

1. Dependencies Installation Fails

Problem: pip install -r requirements.txt fails

Solution:

# Upgrade pip first
pip install --upgrade pip

# Install with verbose output
pip install -r requirements.txt -v

# If still fails, install individually
pip install flask flask-socketio flask-cors
pip install hazm scikit-learn pandas numpy

2. Persian Text Not Displaying Correctly

Problem: Persian text appears as question marks or broken characters

Solution:

  • Ensure your terminal/browser supports UTF-8 encoding
  • Check that Persian fonts are installed on your system
  • Verify the UI CSS includes proper Persian font specifications

3. MLflow Connection Issues

Problem: Cannot connect to MLflow tracking server

Solution:

# Start MLflow server manually
mlflow server --host 0.0.0.0 --port 9437

# Or disable MLflow in development
export MLFLOW_TRACKING_URI=""

4. Out of Memory Errors

Problem: System runs out of memory during processing

Solution:

  • Reduce the size of your dataset for testing
  • Decrease max_features in TF-IDF configuration
  • Use smaller batch sizes for processing
  • Consider using lighter embedding models

5. WebSocket Connection Fails

Problem: Real-time features not working

Solution:

// Check browser console for errors
// Ensure SocketIO is properly loaded
// Verify CORS settings in app.py

Debug Mode

Enable debug logging:

import logging
logging.basicConfig(level=logging.DEBUG)

Performance Issues

Monitor system performance:

# Check memory usage
htop

# Monitor Python processes
ps aux | grep python

# Check disk space
df -h

Getting Help

  1. Check Issues: Look for similar problems in GitHub Issues
  2. Create Issue: If not found, create a new issue with:
    • System information (OS, Python version)
    • Complete error message
    • Steps to reproduce
    • Expected vs actual behavior

🤝 Contributing

We welcome contributions! Here's how you can help:

Development Setup

# Fork the repository
git clone https://github.com/your-username/information-retrieval-system.git
cd information-retrieval-system

# Create development branch
git checkout -b feature/your-feature-name

# Install development dependencies
pip install -r requirements.txt
pip install -r requirements-dev.txt  # If available

Contribution Guidelines

  1. Code Style: Follow PEP 8 conventions
  2. Documentation: Update relevant documentation
  3. Testing: Add tests for new functionality
  4. Commits: Use clear, descriptive commit messages
  5. Pull Requests: Include description of changes and testing

Areas for Contribution

  • 🌐 Internationalization: Support for other languages
  • 🚀 Performance: Optimization and caching
  • 🎨 UI/UX: Interface improvements
  • 🧪 Testing: Comprehensive test coverage
  • 📚 Documentation: Examples and tutorials
  • 🔧 DevOps: CI/CD and deployment automation

Code Review Process

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Test thoroughly
  5. Submit a pull request
  6. Address review feedback
  7. Merge after approval

Community

  • 💬 Discussions: Use GitHub Discussions for questions
  • 🐛 Bug Reports: Use GitHub Issues
  • 📧 Email: Contact maintainers for sensitive issues

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

MIT License Summary

  • Commercial use allowed
  • Modification allowed
  • Distribution allowed
  • Private use allowed
  • Liability not included
  • Warranty not included

🙏 Acknowledgments

  • Hazm Team: For excellent Persian NLP tools
  • Digikala: For the problem domain and inspiration
  • Open Source Community: For the amazing libraries used
  • Contributors: Everyone who has contributed to this project

Made with ❤️ for the Persian NLP community

⭐ Star this repository | 🐛 Report Bug | 💡 Request Feature

About

This project develops an Information Retrieval System for Digikala’s FAQ section using TF-IDF and LaBSE-based embeddings, with ranking via cosine similarity, Jaccard similarity, and a custom metric. It supports Persian text processing with Hazm and Parsivar, focusing on data collection, augmentation, and evaluation.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors