┌─────────────────────────────────────────────────────────────────┐
│ USER INTERFACE │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────┐ ┌──────────────────────┐ │
│ │ Streamlit Web App (app.py) │ │ Jupyter Notebook │ │
│ │ │ │ (Learning) │ │
│ │ • Index Documents Tab │ │ │ │
│ │ • Search Tab │ │ • Experiments │ │
│ │ • Stats Tab │ │ • Visualizations │ │
│ └─────────────────────────────────┘ │ • Hands-on coding │ │
│ └──────────────────────┘ │
└──────────────────────┬──────────────────────────────────────────┘
│
┌──────────────────────▼──────────────────────────────────────────┐
│ SEMANTIC SEARCH ENGINE │
│ (src/search_engine.py) │
├─────────────────────────────────────────────────────────────────┤
│ │
│ • Orchestrates all components │
│ • Manages indexing pipeline │
│ • Handles search queries │
│ │
└──┬────────────┬─────────────┬─────────────┬────────────────────┘
│ │ │ │\n ▼ ▼ ▼ ▼\n┌─────────┐ ┌──────────┐ ┌────────┐ ┌────────────┐\n│Ingestion│ │ Chunking │ │Embeddings│ │Vector Store│\n│(ingest) │ │(chunking)│ │(embeddings) │(vect_store)\n└─────────┘ └──────────┘ └────────┘ └────────────┘\n```\n\n## 📊 Data Flow: Indexing Pipeline\n\n```\n┌──────────────────────────────────────────────────────────────────┐\n│ DOCUMENT INDEXING FLOW │\n└──────────────────────────────────────────────────────────────────┘\n\nSTEP 1: INGESTION\n────────────────────────────────────────────────────────────────────\n\n PDF/TXT/MD Files\n │\n ├─ machine_learning_intro.md\n ├─ embeddings_guide.md\n └─ vector_databases.md\n │\n ▼\n ┌──────────────────────────┐\n │ DocumentIngester │\n │ (src/ingestion.py) │\n │ │\n │ • Load files │\n │ • Extract text │\n │ • Handle different │\n │ formats │\n └────────────┬─────────────┘\n │\n ▼\n Raw Text + Metadata\n (filename, page_number)\n\n\nSTEP 2: CHUNKING\n────────────────────────────────────────────────────────────────────\n\n \"Machine learning is a subset...\n At its core, ML is about creating...\n Supervised learning involves...\" (3000 chars)\n │\n ▼\n ┌──────────────────────────────┐\n │ FixedSizeChunker │\n │ (src/chunking.py) │\n │ │\n │ chunk_size = 500 │\n │ overlap = 100 │\n └────────────┬─────────────────┘\n │\n ▼\n Chunk 1: \"Machine learning is... [500 chars]\"\n Chunk 2: \"[overlap=100] ...core, ML is about... [500 chars]\"\n Chunk 3: \"[overlap=100] ...Supervised learning... [500 chars]\"\n (6 chunks total)\n\n\nSTEP 3: EMBEDDINGS\n────────────────────────────────────────────────────────────────────\n\n Chunk 1: \"Machine learning is...\"\n Chunk 2: \"...core, ML is about...\"\n Chunk 3: \"...Supervised learning...\"\n │\n ▼\n ┌─────────────────────────────┐\n │ OllamaEmbeddings │\n │ (src/embeddings.py) │\n │ │\n │ model: nomic-embed-text │\n │ (768 dimensions) │\n │ │\n │ Sends to local Ollama │\n └────────────┬────────────────┘\n │\n ▼\n [0.12, -0.45, 0.78, ..., -0.23] (768 numbers)\n [0.14, -0.43, 0.76, ..., -0.21] (768 numbers)\n [0.10, -0.48, 0.80, ..., -0.25] (768 numbers)\n\n\nSTEP 4: VECTOR STORE\n────────────────────────────────────────────────────────────────────\n\n Chunks + Embeddings + Metadata\n │\n ▼\n ┌────────────────────────────────┐\n │ ChromaDB VectorStore │\n │ (src/vector_store.py) │\n │ │\n │ • Store embeddings │\n │ • Build HNSW index │\n │ • Track metadata │\n │ • Persist to disk │\n └────────────┬────────────────────┘\n │\n ▼\n Persistent Storage\n ./data/chroma_db/\n ├── embeddings (vectors)\n ├── chunks (text)\n └── metadata (source, page, etc)\n\n\n═══════════════════════════════════════════════════════════════════\nOUTPUT: Indexed and searchable document collection ✅\n═══════════════════════════════════════════════════════════════════\n```\n\n## 🔍 Data Flow: Search Pipeline\n\n```\n┌──────────────────────────────────────────────────────────────────┐\n│ SEARCH FLOW │\n└──────────────────────────────────────────────────────────────────┘\n\nSTEP 1: USER QUERY\n────────────────────────────────────────────────────────────────────\n\n User enters: \"What are embeddings?\"\n │\n ▼\n SemanticSearchEngine.search(\"What are embeddings?\")\n\n\nSTEP 2: QUERY EMBEDDING\n────────────────────────────────────────────────────────────────────\n\n \"What are embeddings?\"\n │\n ▼\n ┌─────────────────────────────┐\n │ OllamaEmbeddings.embed() │\n │ (src/embeddings.py) │\n │ │\n │ Sends to Ollama │\n └────────────┬────────────────┘\n │\n ▼\n query_vector = [0.11, -0.42, 0.75, ..., -0.22]\n (768 dimensions, same space as documents!)\n\n\nSTEP 3: SIMILARITY SEARCH\n────────────────────────────────────────────────────────────────────\n\n query_vector\n │\n ▼\n ┌────────────────────────────────┐\n │ ChromaDB.query() │\n │ (src/vector_store.py) │\n │ │\n │ • Load HNSW index │\n │ • Find nearest neighbors │\n │ • Return top-k (k=5) │\n │ │\n │ Algorithm: │\n │ 1. Start at layer 0 │\n │ 2. Navigate graph │\n │ 3. Find ~100 candidates │\n │ 4. Refine search │\n │ 5. Return top-5 │\n └────────────┬────────────────────┘\n │\n ▼\n Candidate chunks ranked by similarity\n\n\nSTEP 4: CALCULATE SIMILARITY\n────────────────────────────────────────────────────────────────────\n\n For each candidate chunk:\n\n chunk_vector = [0.12, -0.45, 0.78, ..., -0.23]\n query_vector = [0.11, -0.42, 0.75, ..., -0.22]\n │\n ▼\n ┌─────────────────────────┐\n │ CosineSimilarity() │\n │ (src/similarity.py) │\n │ │\n │ cos(θ) = (A·B)/(||A||·||B||)\n │ = 0.9234 ✅ │\n │ (High similarity!) │\n └────────────┬────────────┘\n │\n ▼\n similarity = 0.9234\n\n\nSTEP 5: RANK AND RETURN\n────────────────────────────────────────────────────────────────────\n\n Sorted by similarity (highest first):\n\n [1] Similarity: 0.9234\n Source: embeddings_guide.md\n Text: \"Embeddings are vectors that represent...\"\n\n [2] Similarity: 0.8756\n Source: machine_learning_intro.md\n Text: \"At its core, machine learning is about...\"\n\n [3] Similarity: 0.7823\n Source: vector_databases.md\n Text: \"Embeddings enable fast similarity search...\"\n\n [4] Similarity: 0.6234\n Source: embeddings_guide.md\n Text: \"Word embeddings capture semantic...\"\n\n [5] Similarity: 0.5123\n Source: machine_learning_intro.md\n Text: \"Neural networks learn patterns...\"\n\n\n═══════════════════════════════════════════════════════════════════\nOUTPUT: User sees ranked search results ✅\n═══════════════════════════════════════════════════════════════════\n```\n\n## 🎯 Embedding Space Visualization\n\n```\nHigh-Dimensional Vector Space (768 dimensions)\n┌─────────────────────────────────────────────────────────────────┐\n│ │\n│ \"Neural networks\" │\n│ • │\n│ / \\ │\n│ / \\ │\n│ / • \"Machine learning\" │\n│ / \\ │\n│ / \\ │\n│ •━━━━━━━━━━━• │\n│ \"AI\" /\"Deep learning\" │\n│ / │\n│ / │\n│ • \"Artificial intelligence\" │\n│ │\n│ │\n│ [Very far away in embedding space] │\n│ │\n│ • \"The weather is sunny\" │\n│ │\n│ • \"I like cooking\" │\n│ │\n│ • \"Sports are fun\" │\n│ │\n└─────────────────────────────────────────────────────────────────┘\n\nKey insight: Similar concepts cluster together!\n- \"Neural networks\", \"Machine learning\", \"Deep learning\" cluster\n- \"Weather\", \"Cooking\", \"Sports\" are far away\n```\n\n## 🔧 Module Dependencies\n\n```\n┌─────────────────────────────────────┐\n│ search_engine.py │\n│ (SemanticSearchEngine) │\n│ Main Orchestrator │\n└────────────┬──────────────┬──────────┘\n │ │\n ┌────────▼──┐ ┌──────▼────────┐\n │ingestion │ │config │\n │ Load docs │ │ Settings │\n └────────┬──┘ └──────────────-┘\n │\n ┌────────▼──────────────────┐\n │ chunking.py │\n │ Split text into pieces │\n └────────┬──────────────────┘\n │\n ┌────────▼──────────────────┐\n │ embeddings.py │\n │ Generate via Ollama │\n └────────┬──────────────────┘\n │\n ┌──────────▼──────────┐\n │ vector_store.py │\n │ Store in ChromaDB │\n │ │\n └────────┬───────────┘\n │\n ┌────────▼──────────┐\n │ similarity.py │\n │ Calc closeness │\n └─────────────────-┘\n```\n\n## 📈 Performance Characteristics\n\n```\nOperation Time Scale\n────────────────────────────────────────────────────────\nEmbed 1 text 100-500ms per text\nEmbed 1000 chunks 30-60s with Ollama\nIndex in ChromaDB 5-10s 1000 chunks\nSingle query 100-500ms depends on size\nFull pipeline (100 docs) 2-5 min end-to-end\n\nStorage\n────────────────────────────────────────────────────────\nEmbeddings 3GB per 1M vectors\nMetadata 100MB per 1M vectors\nTotal index 3.1GB per 1M vectors\n\nScalability\n────────────────────────────────────────────────────────\nChromaDB on laptop Up to 1M vectors\nPinecone (cloud) Unlimited\nHNSW indexing Sub-linear search\n```\n\n## 🔄 Configuration Impact\n\n```\n┌──────────────────────────────────────────────────────────┐\n│ Configuration → Impact │\n├──────────────────────────────────────────────────────────┤\n│ │\n│ CHUNK_SIZE ──┬──> Smaller (200) ──> Granular, many │\n│ │ chunks, slower │\n│ └──> Larger (1000) ──> Context, few │\n│ chunks, faster │\n│ │\n│ OVERLAP ─────┬──> None (0) ──────> Less storage, │\n│ │ lower recall │\n│ └──> High (200) ────> More storage, │\n│ better recall │\n│ │\n│ TOP_K ───────┬──> Small (1-3) ───> Fast, confident │\n│ │ results │\n│ └──> Large (20+) ────> Slow, may have │\n│ false positives │\n│ │\n│ EMBEDDING ───┬──> Ollama (local) ─> Free, private, │\n│ PROVIDER │ good quality │\n│ └──> OpenAI (API) ───> Costs money, │\n│ best quality │\n│ │\n└──────────────────────────────────────────────────────────┘\n```\n\n## 🎯 Decision Tree: Which Configuration?\n\n```\n Your use case?\n │\n ┌──────────────┼──────────────┐\n │ │ │\n Research Production Learning\n │ │ │\n ▼ ▼ ▼\n Optimize for Optimize for Optimize for\n accuracy speed+accuracy understanding\n │ │ │\n chunk_size: chunk_size: chunk_size:\n 300-500 500-700 500 (default)\n \n overlap: overlap: overlap:\n 150-200 50-100 100\n \n model: model: model:\n Large Medium Small\n (slower) (balanced) (fast)\n```\n"