Skip to content

Latest commit

 

History

History
30 lines (28 loc) · 21.5 KB

File metadata and controls

30 lines (28 loc) · 21.5 KB

System Architecture & Data Flow Diagrams

🏗️ Complete System Architecture

┌─────────────────────────────────────────────────────────────────┐
│                         USER INTERFACE                          │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ┌─────────────────────────────────┐  ┌──────────────────────┐ │
│  │   Streamlit Web App (app.py)    │  │  Jupyter Notebook    │ │
│  │                                 │  │  (Learning)          │ │
│  │  • Index Documents Tab          │  │                      │ │
│  │  • Search Tab                   │  │  • Experiments       │ │
│  │  • Stats Tab                    │  │  • Visualizations    │ │
│  └─────────────────────────────────┘  │  • Hands-on coding   │ │
│                                       └──────────────────────┘ │
└──────────────────────┬──────────────────────────────────────────┘
                       │
┌──────────────────────▼──────────────────────────────────────────┐
│                    SEMANTIC SEARCH ENGINE                       │
│                   (src/search_engine.py)                        │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  • Orchestrates all components                                 │
│  • Manages indexing pipeline                                   │
│  • Handles search queries                                      │
│                                                                 │
└──┬────────────┬─────────────┬─────────────┬────────────────────┘
   │            │             │             │\n   ▼            ▼             ▼             ▼\n┌─────────┐ ┌──────────┐ ┌────────┐ ┌────────────┐\n│Ingestion│ │ Chunking │ │Embeddings│ │Vector Store│\n│(ingest) │ │(chunking)│ │(embeddings)  │(vect_store)\n└─────────┘ └──────────┘ └────────┘ └────────────┘\n```\n\n## 📊 Data Flow: Indexing Pipeline\n\n```\n┌──────────────────────────────────────────────────────────────────┐\n│                    DOCUMENT INDEXING FLOW                        │\n└──────────────────────────────────────────────────────────────────┘\n\nSTEP 1: INGESTION\n────────────────────────────────────────────────────────────────────\n\n  PDF/TXT/MD Files\n         │\n         ├─ machine_learning_intro.md\n         ├─ embeddings_guide.md\n         └─ vector_databases.md\n                   │\n                   ▼\n     ┌──────────────────────────┐\n     │ DocumentIngester         │\n     │ (src/ingestion.py)       │\n     │                          │\n     │ • Load files             │\n     │ • Extract text           │\n     │ • Handle different       │\n     │   formats                │\n     └────────────┬─────────────┘\n                  │\n                  ▼\n     Raw Text + Metadata\n     (filename, page_number)\n\n\nSTEP 2: CHUNKING\n────────────────────────────────────────────────────────────────────\n\n  \"Machine learning is a subset...\n   At its core, ML is about creating...\n   Supervised learning involves...\" (3000 chars)\n           │\n           ▼\n  ┌──────────────────────────────┐\n  │ FixedSizeChunker             │\n  │ (src/chunking.py)            │\n  │                              │\n  │ chunk_size = 500             │\n  │ overlap = 100                │\n  └────────────┬─────────────────┘\n               │\n               ▼\n  Chunk 1: \"Machine learning is... [500 chars]\"\n  Chunk 2: \"[overlap=100] ...core, ML is about... [500 chars]\"\n  Chunk 3: \"[overlap=100] ...Supervised learning... [500 chars]\"\n  (6 chunks total)\n\n\nSTEP 3: EMBEDDINGS\n────────────────────────────────────────────────────────────────────\n\n  Chunk 1: \"Machine learning is...\"\n  Chunk 2: \"...core, ML is about...\"\n  Chunk 3: \"...Supervised learning...\"\n           │\n           ▼\n  ┌─────────────────────────────┐\n  │ OllamaEmbeddings            │\n  │ (src/embeddings.py)         │\n  │                             │\n  │ model: nomic-embed-text     │\n  │ (768 dimensions)            │\n  │                             │\n  │ Sends to local Ollama       │\n  └────────────┬────────────────┘\n               │\n               ▼\n  [0.12, -0.45, 0.78, ..., -0.23]  (768 numbers)\n  [0.14, -0.43, 0.76, ..., -0.21]  (768 numbers)\n  [0.10, -0.48, 0.80, ..., -0.25]  (768 numbers)\n\n\nSTEP 4: VECTOR STORE\n────────────────────────────────────────────────────────────────────\n\n  Chunks + Embeddings + Metadata\n           │\n           ▼\n  ┌────────────────────────────────┐\n  │ ChromaDB VectorStore           │\n  │ (src/vector_store.py)          │\n  │                                │\n  │ • Store embeddings             │\n  │ • Build HNSW index             │\n  │ • Track metadata               │\n  │ • Persist to disk              │\n  └────────────┬────────────────────┘\n               │\n               ▼\n  Persistent Storage\n  ./data/chroma_db/\n  ├── embeddings (vectors)\n  ├── chunks (text)\n  └── metadata (source, page, etc)\n\n\n═══════════════════════════════════════════════════════════════════\nOUTPUT: Indexed and searchable document collection ✅\n═══════════════════════════════════════════════════════════════════\n```\n\n## 🔍 Data Flow: Search Pipeline\n\n```\n┌──────────────────────────────────────────────────────────────────┐\n│                       SEARCH FLOW                                │\n└──────────────────────────────────────────────────────────────────┘\n\nSTEP 1: USER QUERY\n────────────────────────────────────────────────────────────────────\n\n  User enters: \"What are embeddings?\"\n       │\n       ▼\n  SemanticSearchEngine.search(\"What are embeddings?\")\n\n\nSTEP 2: QUERY EMBEDDING\n────────────────────────────────────────────────────────────────────\n\n  \"What are embeddings?\"\n           │\n           ▼\n  ┌─────────────────────────────┐\n  │ OllamaEmbeddings.embed()    │\n  │ (src/embeddings.py)         │\n  │                             │\n  │ Sends to Ollama             │\n  └────────────┬────────────────┘\n               │\n               ▼\n  query_vector = [0.11, -0.42, 0.75, ..., -0.22]\n  (768 dimensions, same space as documents!)\n\n\nSTEP 3: SIMILARITY SEARCH\n────────────────────────────────────────────────────────────────────\n\n  query_vector\n       │\n       ▼\n  ┌────────────────────────────────┐\n  │ ChromaDB.query()               │\n  │ (src/vector_store.py)          │\n  │                                │\n  │ • Load HNSW index              │\n  │ • Find nearest neighbors       │\n  │ • Return top-k (k=5)           │\n  │                                │\n  │ Algorithm:                     │\n  │  1. Start at layer 0           │\n  │  2. Navigate graph             │\n  │  3. Find ~100 candidates       │\n  │  4. Refine search              │\n  │  5. Return top-5               │\n  └────────────┬────────────────────┘\n               │\n               ▼\n  Candidate chunks ranked by similarity\n\n\nSTEP 4: CALCULATE SIMILARITY\n────────────────────────────────────────────────────────────────────\n\n  For each candidate chunk:\n\n    chunk_vector = [0.12, -0.45, 0.78, ..., -0.23]\n    query_vector = [0.11, -0.42, 0.75, ..., -0.22]\n                   │\n                   ▼\n          ┌─────────────────────────┐\n          │ CosineSimilarity()      │\n          │ (src/similarity.py)     │\n          │                         │\n          │ cos(θ) = (A·B)/(||A||·||B||)\n          │         = 0.9234 ✅    │\n          │ (High similarity!)      │\n          └────────────┬────────────┘\n                       │\n                       ▼\n          similarity = 0.9234\n\n\nSTEP 5: RANK AND RETURN\n────────────────────────────────────────────────────────────────────\n\n  Sorted by similarity (highest first):\n\n  [1] Similarity: 0.9234\n      Source: embeddings_guide.md\n      Text: \"Embeddings are vectors that represent...\"\n\n  [2] Similarity: 0.8756\n      Source: machine_learning_intro.md\n      Text: \"At its core, machine learning is about...\"\n\n  [3] Similarity: 0.7823\n      Source: vector_databases.md\n      Text: \"Embeddings enable fast similarity search...\"\n\n  [4] Similarity: 0.6234\n      Source: embeddings_guide.md\n      Text: \"Word embeddings capture semantic...\"\n\n  [5] Similarity: 0.5123\n      Source: machine_learning_intro.md\n      Text: \"Neural networks learn patterns...\"\n\n\n═══════════════════════════════════════════════════════════════════\nOUTPUT: User sees ranked search results ✅\n═══════════════════════════════════════════════════════════════════\n```\n\n## 🎯 Embedding Space Visualization\n\n```\nHigh-Dimensional Vector Space (768 dimensions)\n┌─────────────────────────────────────────────────────────────────┐\n│                                                                 │\n│    \"Neural networks\"                                            │\n│         •                                                       │\n│        / \\                                                      │\n│       /   \\                                                     │\n│      /     •  \"Machine learning\"                               │\n│     /       \\                                                   │\n│    /         \\                                                  │\n│   •━━━━━━━━━━━•                                               │\n│ \"AI\"         /\"Deep learning\"                                  │\n│            /                                                    │\n│           /                                                     │\n│          •  \"Artificial intelligence\"                           │\n│                                                                 │\n│                                                                 │\n│     [Very far away in embedding space]                         │\n│                                                                 │\n│     •  \"The weather is sunny\"                                  │\n│                                                                 │\n│     •  \"I like cooking\"                                        │\n│                                                                 │\n│     •  \"Sports are fun\"                                        │\n│                                                                 │\n└─────────────────────────────────────────────────────────────────┘\n\nKey insight: Similar concepts cluster together!\n- \"Neural networks\", \"Machine learning\", \"Deep learning\" cluster\n- \"Weather\", \"Cooking\", \"Sports\" are far away\n```\n\n## 🔧 Module Dependencies\n\n```\n┌─────────────────────────────────────┐\n│     search_engine.py                │\n│  (SemanticSearchEngine)             │\n│     Main Orchestrator               │\n└────────────┬──────────────┬──────────┘\n             │              │\n    ┌────────▼──┐    ┌──────▼────────┐\n    │ingestion  │    │config         │\n    │ Load docs │    │ Settings      │\n    └────────┬──┘    └──────────────-┘\n             │\n    ┌────────▼──────────────────┐\n    │   chunking.py            │\n    │ Split text into pieces   │\n    └────────┬──────────────────┘\n             │\n    ┌────────▼──────────────────┐\n    │   embeddings.py          │\n    │ Generate via Ollama      │\n    └────────┬──────────────────┘\n             │\n  ┌──────────▼──────────┐\n  │ vector_store.py    │\n  │ Store in ChromaDB  │\n  │                    │\n  └────────┬───────────┘\n           │\n  ┌────────▼──────────┐\n  │ similarity.py    │\n  │ Calc closeness   │\n  └─────────────────-┘\n```\n\n## 📈 Performance Characteristics\n\n```\nOperation                  Time        Scale\n────────────────────────────────────────────────────────\nEmbed 1 text               100-500ms   per text\nEmbed 1000 chunks          30-60s      with Ollama\nIndex in ChromaDB          5-10s       1000 chunks\nSingle query               100-500ms   depends on size\nFull pipeline (100 docs)   2-5 min     end-to-end\n\nStorage\n────────────────────────────────────────────────────────\nEmbeddings                 3GB         per 1M vectors\nMetadata                   100MB       per 1M vectors\nTotal index                3.1GB       per 1M vectors\n\nScalability\n────────────────────────────────────────────────────────\nChromaDB on laptop         Up to 1M vectors\nPinecone (cloud)           Unlimited\nHNSW indexing              Sub-linear search\n```\n\n## 🔄 Configuration Impact\n\n```\n┌──────────────────────────────────────────────────────────┐\n│              Configuration → Impact                      │\n├──────────────────────────────────────────────────────────┤\n│                                                          │\n│ CHUNK_SIZE ──┬──> Smaller (200) ──> Granular, many    │\n│              │                       chunks, slower    │\n│              └──> Larger (1000) ──> Context, few      │\n│                                      chunks, faster   │\n│                                                        │\n│ OVERLAP ─────┬──> None (0) ──────> Less storage,      │\n│              │                       lower recall     │\n│              └──> High (200) ────> More storage,      │\n│                                      better recall   │\n│                                                        │\n│ TOP_K ───────┬──> Small (1-3) ───> Fast, confident   │\n│              │                       results          │\n│              └──> Large (20+) ────> Slow, may have    │\n│                                      false positives  │\n│                                                        │\n│ EMBEDDING ───┬──> Ollama (local) ─> Free, private,   │\n│ PROVIDER     │                       good quality    │\n│              └──> OpenAI (API) ───> Costs money,      │\n│                                      best quality    │\n│                                                        │\n└──────────────────────────────────────────────────────────┘\n```\n\n## 🎯 Decision Tree: Which Configuration?\n\n```\n                    Your use case?\n                         │\n          ┌──────────────┼──────────────┐\n          │              │              │\n     Research    Production    Learning\n          │              │              │\n          ▼              ▼              ▼\n   Optimize for   Optimize for   Optimize for\n   accuracy     speed+accuracy   understanding\n          │              │              │\n    chunk_size:     chunk_size:    chunk_size:\n    300-500        500-700        500 (default)\n                                 \n    overlap:      overlap:       overlap:\n    150-200       50-100         100\n                                 \n    model:        model:        model:\n    Large         Medium        Small\n    (slower)      (balanced)    (fast)\n```\n"