Embedding Model Comparison for Turkish Medical Texts

Which embedding model works best for Turkish medical texts? I tested 3 popular models with the MedTurkQuaD dataset. The fastest model isn't always the best — here's the proof.

TL;DR (Quick Summary)

I compared 3 popular embedding models (Multi-MiniLM, BGE-M3, all-mpnet) using a Turkish medical Q&A dataset. The results are surprising:

BGE-M3: Best retrieval (MRR: 0.0338) but slowest (50.59s)

Multi-MiniLM: Fastest (15.81s) and champion in Turkish morphology (0.9284)

all-mpnet: Great for English but fails in Turkish (MRR: 0.0084)

Key takeaway: A "multilingual" label isn't enough. Domain-specific testing is essential!

The Story: Why I Needed This Test

Last month, I was developing a medical Q&A system. I tried the most popular embedding models on HuggingFace. The results... were disastrous.

For the question "What is an abscess?", the system returned "lung cancer" as the answer. I switched models, got slightly better results, but still not satisfactory.

That's when I realized: Benchmark tables are valid for English. There was no data for the Turkish + Medical combination.

In this article, I'll show you which model actually works through a systematic comparison.

Why This Comparison Matters

Common Problems When Choosing an Embedding Model

"Let me pick the most popular model" → Popularity ≠ Suitable for your use case
"It says multilingual, supports Turkish" → In theory yes, in practice sometimes no
"Ranked #1 on benchmarks" → In which language? Which domain?
"Bigger model is better" → Slower, more expensive, not always better

What Makes This Test Different

Same dataset → Fair comparison
Same metrics → Objective evaluation
Reproducible code → You can try it yourself
Turkish + Domain-specific → Real-world scenario

Test Setup

Competing Models

Model	Dimensions	Features	Expectation
Multi-MiniLM-L12-v2	384	Lightweight, multilingual	Fast but sufficient?
BGE-M3	1024	Next-gen, powerful	Best but how slow?
all-mpnet-base-v2	768	English SOTA	What about Turkish?

Test Arena: MedTurkQuaD Dataset

What? Turkish medical Q&A dataset
Why difficult? Two-layered challenge:

Turkish morphology (suffixes, inflections)
Medical terminology (domain-specific)

Example Challenge:

Question: "An abscess is usually a type of inflammation caused by what?"

 Correct: "pyogenic bacteria"
 Misleading Negative: "uncontrolled cells in lung tissue..."

→ Both answers contain medical terms!
→ Model must capture subtle differences

Reproducibility Guarantee

# Same results on every run
device = "cuda" if torch.cuda.is_available() else "cpu"
random.seed(42)
np.random.seed(42)
torch.manual_seed(42)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(42)

Why 42? The answer to life, the universe, and everything (and the AI community's standard seed)

Test Process: Step by Step

Step 1: Data Preparation - Negative Sampling

def process_qa_data(qa_data):
    all_queries, all_positives, all_negatives = [], [], []
    
    # Questions and correct answers
    for doc in qa_data.get('data', []):
        for paragraph in doc.get('paragraphs', []):
            for qa_pair in paragraph.get('qas', []):
                all_queries.append(qa_pair['question'])
                all_positives.append(qa_pair['answers'][0]['text'])
    
    # Random negative for each positive
    num_pairs = len(all_positives)
    for i in range(num_pairs):
        idx = i
        while idx == i:  # Don't pick the same answer
            idx = random.choice(range(num_pairs))
        all_negatives.append(all_positives[idx])
    
    return all_queries, all_positives, all_negatives

Why this method?

In the real world, correct answers get lost among wrong ones
Tests the model's discrimination ability
Classic benchmark method for retrieval systems

Step 2: Embedding Generation and Time Measurement

for model_name, model in models_to_test.items():
    start_time = time.time()
    
    # Encode
    query_vectors = model.encode(queries, convert_to_numpy=True, show_progress_bar=True)
    doc_vectors = model.encode(documents, convert_to_numpy=True, show_progress_bar=True)
    
    duration = time.time() - start_time
    print(f" {model_name}: {duration:.2f} seconds")

Output:

 Multi-MiniLM-L12-v2: 15.81 seconds
 BGE-M3: 50.59 seconds
 all-mpnet-base-v2: 25.00 seconds

Step 3: Similarity Search with FAISS

Critical Detail: L2 Normalization

dim = query_vectors.shape[1]
index = faiss.IndexFlatIP(dim)  # Inner Product Index

#  Normalization = Cosine Similarity
faiss.normalize_L2(doc_vectors)
faiss.normalize_L2(query_vectors)

index.add(doc_vectors)
D, I = index.search(query_vectors, k=len(documents))

Why normalize?

Case	Formula	What it measures?
No normalization	`IP(A,B) = \|A\| × \|B\| × cos(θ)`	Magnitude + Angle
With normalization	`IP(A,B) = cos(θ)`	Only Angle (semantic)

Evaluation: 4 Different Metrics

MRR (Mean Reciprocal Rank)

What does it measure? On average, what rank is the correct answer?

def compute_mrr(search_results, true_indices):
    rr_sum = 0
    for i in range(len(true_indices)):
        ranks = np.where(search_results[i] == true_indices[i])[0]
        if len(ranks) > 0:
            rr_sum += 1 / (ranks[0] + 1)
    return rr_sum / len(true_indices)

Interpretation:

MRR = 1.0 → Correct answer at rank 1 for every question (perfect!)
MRR = 0.5 → On average at rank 2
MRR = 0.033 → On average at ~rank 30 (low)

Recall@K

What does it measure? Is the correct answer in the top K results?

Metric	Description
Recall@1	Is the first result correct? (strictest test)
Recall@3	Is it in the top 3?
Recall@10	Is it in the top 10?

Why important?

Recall@1 → If you're showing only one result to the user
Recall@10 → If you're showing a list

Morphology Score

What does it measure? Sensitivity to Turkish suffixes

Test pairs:

morph_pairs = [
    ("geliyorum", "gelmekteyim"),      # I'm coming (different forms)
    ("gidecek", "gider"),              # Will go / goes
    ("yaptım", "yapıyorum"),           # I did / I'm doing
    ("okuyor", "okumakta"),            # Reading (different forms)
    ("koşacağım", "koşarım"),          # I will run / I run
    ("araba", "arabalar"),             # Car / cars
    ("evdeyim", "evde olmak")          # I'm at home (different forms)
]

Calculation:

# Calculate cosine similarity for each pair
similarities = []
for pair in morph_pairs:
    vec1 = model.encode(pair[0])
    vec2 = model.encode(pair[1])
    sim = cosine_similarity([vec1], [vec2])[0][0]
    similarities.append(sim)

morph_score = np.mean(similarities)

Interpretation:

Score > 0.9 → Excellent Turkish understanding
Score 0.7-0.9 → Good
Score < 0.7 → Weak (treats each suffix as different word)

Silhouette Score

What does it measure? How organized is the embedding space?

kmeans = KMeans(n_clusters=2, random_state=42, n_init='auto')
labels = kmeans.fit_predict(doc_vectors)
sil_score = silhouette_score(doc_vectors, labels)

Interpretation:

Close to +1 → Clusters are well separated
Close to 0 → Clusters overlap
Close to -1 → Incorrectly clustered

Results: Champions and Surprises

Complete Results Table

┌─────────────────────┬──────┬──────────┬────────────┬─────────────┬────────┬──────────┬──────────┬──────────┬───────────┐
│ Model               │ Dim  │ Time (s) │ Silhouette │ Morph Score │  MRR   │ Recall@1 │ Recall@3 │ Recall@5 │ Recall@10 │
├─────────────────────┼──────┼──────────┼────────────┼─────────────┼────────┼──────────┼──────────┼──────────┼───────────┤
│ BGE-M3              │ 1024 │  50.59   │   0.0366   │   0.8113    │ 0.0338 │  1.12%   │  3.24%   │  4.91%   │   7.66%   │
│ Multi-MiniLM-L12-v2 │  384 │  15.81   │   0.0758   │   0.9284    │ 0.0200 │  0.70%   │  1.93%   │  2.72%   │   4.34%   │
│ all-mpnet-base-v2   │  768 │  25.00   │   0.1185   │   0.7460    │ 0.0084 │  0.30%   │  0.78%   │  1.29%   │   1.85%   │
└─────────────────────┴──────┴──────────┴────────────┴─────────────┴────────┴──────────┴──────────┴──────────┴───────────┘

Visual Analysis

1. Performance Metrics (2×2 Grid)

What we see:

MRR chart: All bars are short (low values) → Domain is very challenging
Recall@1 chart: BGE-M3 clearly ahead but still low
Morph Score chart: Multi-MiniLM champion 🏆
Silhouette chart: all-mpnet first but this is misleading

2. Speed vs Quality Trade-off (Scatter Plot)

Analysis:

Top left = Ideal zone (fast + quality)
BGE-M3: Top right (slow but quality)
Multi-MiniLM: Bottom left (fast but medium MRR)
all-mpnet: Lost in the middle (neither fast nor quality)

Decision guide:

Real-time system → Multi-MiniLM
Offline batch → BGE-M3

3. Radar Chart: Model Profiles

Character analysis:

BGE-M3: "Slow but Effective"

High MRR, low speed
Ideal for batch processing in large projects

Multi-MiniLM: "Fast and Turkish-Specialized"

High speed and morph score
Perfect for real-time applications

all-mpnet: "Organized but Wrong"

Only good silhouette
Don't use for Turkish

Surprising Findings and Analysis

Finding 1: Why Are MRR Values So Low?

Expectation: MRR > 0.5 (correct answer in top 2)
Reality: MRR = 0.008-0.033 (correct answer at rank 30-120)

3 Reasons:

Domain Gap
- Models trained on Wikipedia, books, news
- Medical terminology is less than 1% of training data
- Terms like "pyogenic bacteria" rarely seen
Negative Sampling Difficulty
- Randomly selected "wrong" answers are actually related
- Both contain medical terms → Model confuses them
- Very similar to real-world scenario (good test!)
Lack of Fine-tuning
- General-purpose models weak in specific domains
- 5-10x improvement expected with fine-tuning

** Practical lesson:** Don't panic if you see MRR < 0.1. Normal for domain-specific datasets. Fine-tuning is essential!

Finding 2: Morphology Champion ≠ Retrieval Champion

Model	Morph Score	MRR	Relationship
Multi-MiniLM	🥇 0.9284	🥈 0.0200	Inverse correlation!
BGE-M3	🥈 0.8113	🥇 0.0338

Why?

Required for morphology:

Surface-level similarity ("geliyorum" ≈ "gelmekteyim")
Grammar rules
Syntax patterns

Required for retrieval:

Deep semantic understanding
Context awareness
Domain knowledge

Analogy:

Morphology = Recognizing word forms
Retrieval = Understanding word meanings

🇬🇧 Finding 3: English Model's Turkish Fiasco

all-mpnet-base-v2 report card:

MRR: 0.0084 (last place)
Morph: 0.7460 (last place)
Recall@1: 0.30% (last place)
Silhouette: 0.1185 (1st place) 🤔

Why high silhouette but low others?

Silhouette measures "organization", not "correctness". The model organized vectors nicely but organized them wrongly.

Analogy:

You organized books by color (well organized)
But people searching by topic can't find them (wrongly organized)

Lesson: Don't trust a single metric!

Finding 4: Dramatic Speed Difference

Model	Time	vs Multi-MiniLM
Multi-MiniLM	15.81s	1.0x (baseline)
all-mpnet	25.00s	1.6x slower
BGE-M3	50.59s	3.2x slower

Real-world impact:

Processing 1000 queries:

Multi-MiniLM: ~4.4 hours
all-mpnet: ~7 hours
BGE-M3: ~14 hours

In real-time systems:

50ms vs 160ms per user makes a difference
100 concurrent users = server struggles

Decision Guide: Which Model Should I Choose?

Scenario-Based Recommendations

Scenario 1: Customer Support Chatbot (Real-time)

Requirements:

Speed critical (users won't wait)
Turkish morphology important (users write differently)
Sufficient accuracy (doesn't need to be perfect)

Choice: Multi-MiniLM-L12-v2

Why:

3.2x faster (vs BGE-M3)
Morphology champion (0.9284)
Sufficient MRR (0.0200)
Small vectors = low RAM

Example implementation:

from sentence_transformers import SentenceTransformer
import faiss

model = SentenceTransformer('sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2')

# Encode all KB answers (offline)
kb_answers = ["answer1", "answer2", ...]
answer_vectors = model.encode(kb_answers)

# Create FAISS index
index = faiss.IndexFlatIP(384)
faiss.normalize_L2(answer_vectors)
index.add(answer_vectors)

# When user question arrives (online)
def get_answer(user_question):
    q_vec = model.encode([user_question])
    faiss.normalize_L2(q_vec)
    D, I = index.search(q_vec, k=3)
    return [kb_answers[i] for i in I[0]]

Scenario 2: Medical Document Search Engine (Offline)

Requirements:

Quality critical (wrong result = critical error)
Speed secondary (batch processing)
Very specific domain

Choice: BGE-M3 + Fine-tuning

Why:

Best MRR (0.0338)
Large model = more capacity
Speed irrelevant in batch processing

Fine-tuning example:

from sentence_transformers import SentenceTransformer, InputExample, losses
from torch.utils.data import DataLoader

# Load model
model = SentenceTransformer('BAAI/bge-m3')

# Prepare medical Q&A pairs
train_examples = [
    InputExample(texts=['What is an abscess?', 'inflammation caused by pyogenic bacteria']),
    InputExample(texts=['High blood pressure...', 'hypertension...']),
    # ... at least 1000 examples
]

# Create DataLoader
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)

# Train with contrastive loss
train_loss = losses.MultipleNegativesRankingLoss(model)

# Fine-tune
model.fit(
    train_objectives=[(train_dataloader, train_loss)],
    epochs=5,
    warmup_steps=100
)

# Save
model.save('bge-m3-medical-turkish')

Scenario 3: E-commerce Product Search

Requirements:

🇹🇷 Turkish variations (tişört/tshirt, çorap/sock)
Medium speed
Lots of products

Choice: Multi-MiniLM-L12-v2

Why:

Morphology champion (users write differently)
Fast
Small vectors = millions of products can be indexed

Scenario 4: Multilingual Platform (TR + EN + DE)

Requirements:

Cross-lingual search
Single model for multiple languages

Choice: BGE-M3

Why:

100+ language support
Good cross-lingual alignment
Single embedding space

Running the Code Guide

Installation

# Create virtual environment
python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate

# Install packages
pip install sentence-transformers faiss-cpu scikit-learn pandas torch matplotlib seaborn

# If you have GPU
pip install faiss-gpu  # instead of faiss-cpu

Quick Start

# 1. Clone the code
git clone [repository-url]
cd embedding-comparison

# 2. Prepare data.json (or use fallback sample data)
# 3. Run
python compare_embedding_v2.py

# 4. Results
#  Table in terminal
#  3 visualizations saved as PNG

Dataset Format

MedTurkQuaD JSON structure:

{
  "data": [
    {
      "title": "Medical Topic",
      "paragraphs": [
        {
          "context": "Medical text context...",
          "qas": [
            {
              "question": "What causes abscess?",
              "answers": [
                {
                  "text": "pyogenic bacteria",
                  "answer_start": 42
                }
              ]
            }
          ]
        }
      ]
    }
  ]
}

Customization Options

Add New Model

models_to_test = {
    'Multi-MiniLM-L12-v2': SentenceTransformer('sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2'),
    'BGE-M3': SentenceTransformer('BAAI/bge-m3'),
    'all-mpnet-base-v2': SentenceTransformer('sentence-transformers/all-mpnet-base-v2'),
    'YOUR-MODEL': SentenceTransformer('your-model-name')  # Add here
}

Adjust Turkish Morphology Tests

# Add more challenging pairs
morph_pairs = [
    ("geliyorum", "gelmekteyim"),
    ("custom_word1", "custom_word2"),  # Add your own
]

Change Evaluation Metrics

# Adjust recall@k values
recall_at = [1, 3, 5, 10, 20]  # Add @20 if needed

Understanding the Visualizations

1. Performance Metrics Report (4 subplots)

Purpose: Compare all models across 4 key metrics

How to read:

Taller bars = better (except Silhouette, see below)
Look for consistent patterns across metrics
Single high bar doesn't mean best overall

2. Performance vs Speed Scatter Plot

Purpose: Trade-off analysis

Quadrants:

Top-left: Fast and accurate (ideal but rare)
Top-right: Slow but accurate (batch processing)
Bottom-left: Fast but less accurate (real-time with compromise)
Bottom-right: Slow and inaccurate (avoid!)

3. Radar Chart: Model Profiles

Purpose: Holistic view of strengths/weaknesses

Reading tips:

Larger area = better overall (but check which dimensions!)
Look for spikes = strong specialization
Balanced polygon = well-rounded model

Advanced Topics

Fine-tuning for Your Domain

When to fine-tune:

MRR < 0.1 on your data
Your domain very different from general text
You have 1000+ labeled examples

Simple fine-tuning recipe:

from sentence_transformers import SentenceTransformer, InputExample, losses
from torch.utils.data import DataLoader

# 1. Prepare training data
train_examples = []
for query, positive, negative in your_data:
    train_examples.append(InputExample(texts=[query, positive, negative]))

# 2. Create DataLoader
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)

# 3. Define loss
train_loss = losses.TripletLoss(model)

# 4. Train
model.fit(
    train_objectives=[(train_dataloader, train_loss)],
    epochs=3,
    warmup_steps=100,
    output_path='fine-tuned-model'
)

Hybrid Search: Combining Multiple Models

def hybrid_search(query, alpha=0.7):
    # Fast model for initial filtering
    fast_results = multi_minilm.search(query, k=100)
    
    # Slow model for re-ranking top results
    reranked = bge_m3.rerank(query, fast_results)
    
    return reranked[:10]

Monitoring Model Performance

import mlflow

# Log metrics during evaluation
mlflow.log_metric("mrr", mrr_score)
mlflow.log_metric("recall_at_1", recall_1)
mlflow.log_artifact("performance_plot.png")

Troubleshooting

Common Issues

1. Out of Memory Error

Symptoms:

RuntimeError: CUDA out of memory

Solutions:

# Reduce batch size in encoding
model.encode(texts, batch_size=8)  # default is 32

# Or use CPU
model = SentenceTransformer('model-name', device='cpu')

2. FAISS Installation Issues

Windows:

# Use conda instead of pip
conda install -c conda-forge faiss-cpu

macOS (M1/M2):

conda install -c conda-forge faiss-cpu

3. Slow Encoding

Check GPU usage:

import torch
print(torch.cuda.is_available())  # Should be True
print(model.device)  # Should be 'cuda'

Force GPU:

model = SentenceTransformer('model-name', device='cuda')

4. Different Results on Each Run

Ensure reproducibility:

import random, numpy as np, torch

random.seed(42)
np.random.seed(42)
torch.manual_seed(42)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(42)
    torch.backends.cudnn.deterministic = True

Contributing

Contributions are welcome! Areas for improvement:

Add more Turkish embedding models
Test on other Turkish domains (legal, finance)
Implement cross-lingual evaluation
Add interactive dashboard
Benchmark on GPU vs CPU

How to contribute:

Fork the repository
Create feature branch (git checkout -b feature/NewModel)
Commit changes (git commit -m 'Add new model')
Push to branch (git push origin feature/NewModel)
Open Pull Request

💬 Have questions? Start a discussion!

🐛 Found a bug? Open an issue!

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
README.md		README.md
compare_embedding_v2.py		compare_embedding_v2.py
data.json		data.json
modellerin_radar_profili.png		modellerin_radar_profili.png
performans_metrikleri_raporu.png		performans_metrikleri_raporu.png
performans_vs_hiz.png		performans_vs_hiz.png

Folders and files

Latest commit

History

Repository files navigation

Embedding Model Comparison for Turkish Medical Texts

Which embedding model works best for Turkish medical texts? I tested 3 popular models with the MedTurkQuaD dataset. The fastest model isn't always the best — here's the proof.

TL;DR (Quick Summary)

The Story: Why I Needed This Test

Why This Comparison Matters

Common Problems When Choosing an Embedding Model

What Makes This Test Different

Test Setup

Competing Models

Test Arena: MedTurkQuaD Dataset

Reproducibility Guarantee

Test Process: Step by Step

Step 1: Data Preparation - Negative Sampling

Step 2: Embedding Generation and Time Measurement

Step 3: Similarity Search with FAISS

Evaluation: 4 Different Metrics

MRR (Mean Reciprocal Rank)

Recall@K

Morphology Score

Silhouette Score

Results: Champions and Surprises

Complete Results Table

Visual Analysis

1. Performance Metrics (2×2 Grid)

2. Speed vs Quality Trade-off (Scatter Plot)

3. Radar Chart: Model Profiles

Surprising Findings and Analysis

Finding 1: Why Are MRR Values So Low?

Finding 2: Morphology Champion ≠ Retrieval Champion

🇬🇧 Finding 3: English Model's Turkish Fiasco

Finding 4: Dramatic Speed Difference

Decision Guide: Which Model Should I Choose?

Scenario-Based Recommendations

Scenario 1: Customer Support Chatbot (Real-time)

Scenario 2: Medical Document Search Engine (Offline)

Scenario 3: E-commerce Product Search

Scenario 4: Multilingual Platform (TR + EN + DE)

Running the Code Guide

Installation

Quick Start

Dataset Format

Customization Options

Add New Model

Adjust Turkish Morphology Tests

Change Evaluation Metrics

Understanding the Visualizations

1. Performance Metrics Report (4 subplots)

2. Performance vs Speed Scatter Plot

3. Radar Chart: Model Profiles

Advanced Topics

Fine-tuning for Your Domain

Hybrid Search: Combining Multiple Models

Monitoring Model Performance

Troubleshooting

Common Issues

1. Out of Memory Error

2. FAISS Installation Issues

3. Slow Encoding

4. Different Results on Each Run

Further Reading

Academic Papers

Practical Guides

Related Projects

Contributing

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages