A modern semantic search engine for discovering songs and poems across Hindi, Hinglish, and English using state-of-the-art multilingual embeddings and vector similarity.
- Features
- Architecture
- Quick Start
- API Reference
- Configuration
- Using Weaviate
- Frontend Features
- Data Sources
- Technical Details
- Troubleshooting
- License **- Contributing
- Trilingual Support: Seamlessly search across Hindi (Devanagari), Hinglish (romanized Hindi), and English
- Semantic Search: Powered by multilingual sentence transformers for deep contextual understanding
- RAG (Retrieval-Augmented Generation): AI-powered summaries, recommendations, and interactive chat using LangChain
- Conversation Memory: Multi-turn chat with context retention using LangChain memory management
- High Performance: FAISS vector indexing for lightning-fast similarity search
- Modern UI: Beautiful React interface with real-time search and chat history display
- Flexible Backend: Support for both FAISS (local) and Weaviate (distributed) vector databases
- Rich Metadata: Search results include language detection, similarity scores, and transliterations
This project leverages a powerful tech stack:
- Backend: FastAPI for high-performance async API endpoints
- Embeddings:
sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2(500MB model) - Vector Store: FAISS for local indexing with cosine similarity; Weaviate optional for production
- RAG: LangChain with HuggingFace models for AI-generated insights and conversational memory
- Frontend: React 18 + Vite with modern CSS animations and Lucide icons
- Data Sources: Curated Hugging Face datasets (~1.1k Hindi poems, ~20k+ English lyrics)
- Document Processing: LangChain for intelligent text chunking and retrieval
- Python 3.8 or higher
- Node.js 16+ and npm
- Git Bash or compatible shell (Windows users)
Step 1: Clone the repository
git clone <repository-url>
cd GenAI_projectStep 2: Set up Python environment
# Create and activate virtual environment
python -m venv .venv
source .venv/Scripts/activate # Windows Git Bash
# source .venv/bin/activate # Linux/Mac
# Install dependencies
pip install -r requirements.txtStep 3: Configure environment variables
Create a .env file in the project root (see Configuration section)
Option 1: CLI Search (Quick Test)
# Activate virtual environment
source .venv/Scripts/activate
# Hindi (Devanagari) search
python app.py --rebuild --limit 50 --query "प्रेम गीत" --top_k 5
# Hinglish (romanized Hindi) search
python app.py --query "prem geet" --top_k 5
# English search
python app.py --query "heartbreak love song" --top_k 5
# Force Hindi-only results
python app.py --query "prem geet" --lang hi --top_k 5Note: First run will download the model (~500MB) and datasets. Use --rebuild to recreate the index after configuration changes.
Option 2: Full Web Application
Terminal 1 - Start Backend API:
# From project root
source .venv/Scripts/activate
uvicorn api:app --reload --port 8000Terminal 2 - Start Frontend Dev Server:
# From project root
cd webui
npm install # First time only
*node ./node_modules/vite/bin/vite.js dev --host --port 5173*Access the application at: http://localhost:5173
API documentation available at: http://localhost:8000/docs
cd webui
node ./node_modules/vite/bin/vite.js build
# Serve the dist/ folder with your preferred static serverWindows Users: Due to the & character in Stats&AI, use the direct Vite binary invocation shown above instead of npm run commands in CMD.
GET /api/healthResponse:
{
"status": "healthy",
"backend": "faiss",
"model": "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2"
}POST /api/search
Content-Type: application/jsonRequest body:
{
"query": "prem geet",
"top_k": 5,
"lang": "auto",
"include_english": true
}Parameters:
query(string, required): Search query in Hindi/Hinglish/Englishtop_k(integer, optional): Number of results to return (default: 5, max: 20)lang(string, optional): Language routing -"auto","hi","en", or"both"(default:"auto")include_english(boolean, optional): Include English corpus in results (default:true)
Response:
{
"results": [
{
"id": "unique-id",
"text": "Original text content",
"hinglish": "Transliteration (if Hindi)",
"title": "Song/Poem title",
"poet": "Artist name",
"language": "hi",
"score": 0.923,
"period": "Modern"
}
],
"backend": "faiss",
"counts": {
"hi": 1142,
"en": 23456
}
}POST /api/rag
Content-Type: application/jsonRequest body:
{
"query": "songs about love and separation",
"top_k": 5,
"mode": "chat",
"user_message": "What makes these songs emotionally powerful?",
"session_id": "optional-session-id"
}Parameters:
query(string, required): Search querytop_k(integer, optional): Number of results to retrieve (default: 5, max: 10)mode(string, optional): RAG mode -"summary","recommendation", or"chat"(default:"summary")user_message(string, optional): User's question for chat modesession_id(string, optional): Session ID for conversation continuity in chat mode
Response:
{
"query": "songs about love and separation",
"response": "The retrieved songs explore themes of longing, heartbreak, and the pain of separation. They capture deep emotional connections and the bittersweet nature of love lost, making them perfect for moments of reflection and emotional catharsis.",
"sources": [
{
"title": "Song Title",
"poet": "Artist Name",
"text": "Excerpt from the song...",
"language": "en",
"score": 0.95
}
],
"mode": "chat",
"session_id": "abc-123-def-456",
"chat_history": [
{"role": "user", "content": "What makes these songs emotionally powerful?"},
{"role": "assistant", "content": "The retrieved songs explore..."}
],
"rag_available": true
}Chat Memory Features:
- Multi-turn conversations with context retention
- Session-based memory management
- View history:
GET /api/rag/history/{session_id} - Clear history:
POST /api/rag/clear/{session_id} - Delete session:
DELETE /api/rag/session/{session_id}
See CHAT_MEMORY.md for detailed documentation.
Note: RAG features require HUGGINGFACEHUB_API_TOKEN or HUGGINGFACE_TOKEN environment variable.
Create a .env file in the project root:
# Hugging Face (optional, for higher rate limits)
HUGGINGFACE_TOKEN=your_token_here
# RAG / AI Features (optional - enables AI-powered summaries and recommendations)
OPENAI_API_KEY=your_openai_key_here # For OpenAI GPT models (recommended)
# OR use HUGGINGFACE_TOKEN above for HuggingFace models (free alternative)
# Dataset Configuration
DATASET_ID=Sourabh2/Hindi_Poems
EN_DATASET_ID=Santarabantoosoo/hf_song_lyrics_with_names,Annanay/aml_song_lyrics_balanced,sheacon/song_lyrics
DATASET_LIMIT= # Optional: cap rows per dataset for faster testing
# Model Configuration
EMBED_MODEL=sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
# Vector Store Configuration
VECTOR_BACKEND=faiss # Options: "faiss" or "weaviate"
FAISS_INDEX_PATH=artifacts/faiss_index
INCLUDE_ENGLISH=1 # Set to 0 to skip English corpus entirely
# Weaviate Configuration (optional)
WEAVIATE_URL= # e.g., http://localhost:8080
WEAVIATE_API_KEY= # Required for cloud instances
WEAVIATE_PERSIST_PATH=.weaviate # Local persistence directory| Variable | Description | Default |
|---|---|---|
OPENAI_API_KEY |
OpenAI API key for RAG features (GPT-3.5/GPT-4) | None |
HUGGINGFACE_TOKEN |
HuggingFace token (for rate limits & alternative RAG) | None |
DATASET_ID |
Hindi poems dataset(s), comma/newline separated | Sourabh2/Hindi_Poems |
EN_DATASET_ID |
English lyrics dataset(s), comma/newline separated | Multiple datasets |
EMBED_MODEL |
Sentence transformer model for embeddings | paraphrase-multilingual-MiniLM-L12-v2 |
VECTOR_BACKEND |
Vector database to use | faiss |
FAISS_INDEX_PATH |
Path to store FAISS index | artifacts/faiss_index |
INCLUDE_ENGLISH |
Whether to index English corpus | 1 (yes) |
DATASET_LIMIT |
Max rows per dataset (for testing) | None (all) |
WEAVIATE_URL |
Weaviate instance URL | None |
WEAVIATE_API_KEY |
API key for Weaviate cloud | None |
You can override language behavior at query time:
--lang auto(default): Auto-detects script; routes Devanagari/Hinglish to Hindi index first--lang hi: Search Hindi/Hinglish corpus only--lang en: Search English corpus only--lang both: Blend results from all corpora equally
For production deployments or distributed setups, you can use Weaviate instead of FAISS:
Step 1: Set up Weaviate
# Docker (easiest)
docker run -d -p 8080:8080 semitechnologies/weaviate:latest
# Or use Weaviate Cloud
# Sign up at https://console.weaviate.cloudStep 2: Configure environment
VECTOR_BACKEND=weaviate
WEAVIATE_URL=http://localhost:8080
WEAVIATE_API_KEY=your_key_here # Only for cloud instancesStep 3: Run with Weaviate
python app.py --backend weaviate --rebuild --query "your query"The system automatically falls back to FAISS if Weaviate is unreachable.
The React web UI includes:
- Modern Design: Glassmorphism effects, gradient animations, smooth transitions
- Real-time Search: Instant results as you type
- Language Detection: Automatic script detection and routing
- Sample Queries: Pre-loaded examples for Hindi, Hinglish, and English
- Responsive Layout: Works seamlessly on desktop and mobile
- Professional Icons: Lucide React icons throughout
- Dark Theme: Eye-friendly dark mode optimized for readability
- Primary dataset:
Sourabh2/Hindi_Poems(~1,100 entries) - Includes classical and modern Hindi poetry
- Automatically transliterated to Hinglish for better romanized matching
- Primary:
Santarabantoosoo/hf_song_lyrics_with_names - Secondary:
Annanay/aml_song_lyrics_balanced - Tertiary:
sheacon/song_lyrics - Combined corpus: ~20,000+ song lyrics
- Diverse genres and artists
You can add custom datasets by modifying DATASET_ID and EN_DATASET_ID in .env with Hugging Face dataset identifiers.
- Embedding Generation: Text chunks are encoded using multilingual sentence transformers
- Normalization: Vectors are L2-normalized for cosine similarity via dot product
- Indexing: FAISS builds an efficient similarity search index
- Query Processing: User queries are embedded with the same model
- Retrieval: Top-K most similar vectors are retrieved and ranked
- Post-processing: Results include transliterations and metadata enrichment
- Lazy Loading: Embeddings load on first request, not at startup
- Caching: FAISS index persists to disk to avoid rebuilding
- Batch Processing: Documents are embedded in batches for efficiency
- Fast Path: Pre-built indexes skip dataset reloading entirely
Cosine similarity is computed as:
similarity = (query_vector · document_vector) / (||query|| × ||document||)
Since vectors are pre-normalized:
similarity = query_vector · document_vector
You may see warnings about symlinks from huggingface_hub. This is harmless - caching still works, just uses more disk space. To silence it, enable Windows Developer Mode.
If npm run commands fail due to the Stats&AI folder name, use the direct Vite binary:
node ./node_modules/vite/bin/vite.js devIf port 8000 or 5173 is occupied:
# Backend
uvicorn api:app --reload --port 8001
# Frontend (update VITE_API_URL accordingly)
node ./node_modules/vite/bin/vite.js dev --port 5174The initial run downloads:
- Sentence transformer model (~500MB)
- Datasets (~50-100MB combined)
- Builds FAISS index (~1-2 minutes)
Subsequent runs use cached artifacts and start instantly.
- Cosine similarity achieved through L2-normalization before indexing
- Hindi chunks include automatic Hinglish transliteration for improved romanized query matching
- Multiple Hugging Face datasets can be specified via comma/newline separation in
.env - Language detection uses script analysis (Devanagari vs. Latin) and keyword matching
- Results include similarity scores, language tags, and transliterations where applicable
MIT License - feel free to use this project for learning, development, or production.
Contributions welcome! Please feel free to submit pull requests or open issues for bugs and feature requests.
Built using FastAPI, React, FAISS, and Sentence Transformers
