Which embedding model works best for Turkish medical texts? I tested 3 popular models with the MedTurkQuaD dataset. The fastest model isn't always the best โ here's the proof.
I compared 3 popular embedding models (Multi-MiniLM, BGE-M3, all-mpnet) using a Turkish medical Q&A dataset. The results are surprising:
- BGE-M3: Best retrieval (MRR: 0.0338) but slowest (50.59s)
- Multi-MiniLM: Fastest (15.81s) and champion in Turkish morphology (0.9284)
- all-mpnet: Great for English but fails in Turkish (MRR: 0.0084)
Key takeaway: A "multilingual" label isn't enough. Domain-specific testing is essential!
Last month, I was developing a medical Q&A system. I tried the most popular embedding models on HuggingFace. The results... were disastrous.
For the question "What is an abscess?", the system returned "lung cancer" as the answer. I switched models, got slightly better results, but still not satisfactory.
That's when I realized: Benchmark tables are valid for English. There was no data for the Turkish + Medical combination.
In this article, I'll show you which model actually works through a systematic comparison.
"Let me pick the most popular model" โ Popularity โ Suitable for your use case
"It says multilingual, supports Turkish" โ In theory yes, in practice sometimes no
"Ranked #1 on benchmarks" โ In which language? Which domain?
"Bigger model is better" โ Slower, more expensive, not always better
Same dataset โ Fair comparison
Same metrics โ Objective evaluation
Reproducible code โ You can try it yourself
Turkish + Domain-specific โ Real-world scenario
| Model | Dimensions | Features | Expectation |
|---|---|---|---|
| Multi-MiniLM-L12-v2 | 384 | Lightweight, multilingual | Fast but sufficient? |
| BGE-M3 | 1024 | Next-gen, powerful | Best but how slow? |
| all-mpnet-base-v2 | 768 | English SOTA | What about Turkish? |
What? Turkish medical Q&A dataset
Why difficult? Two-layered challenge:
- Turkish morphology (suffixes, inflections)
- Medical terminology (domain-specific)
Example Challenge:
Question: "An abscess is usually a type of inflammation caused by what?"
Correct: "pyogenic bacteria"
Misleading Negative: "uncontrolled cells in lung tissue..."
โ Both answers contain medical terms!
โ Model must capture subtle differences
# Same results on every run
device = "cuda" if torch.cuda.is_available() else "cpu"
random.seed(42)
np.random.seed(42)
torch.manual_seed(42)
if torch.cuda.is_available():
torch.cuda.manual_seed_all(42)Why 42? The answer to life, the universe, and everything (and the AI community's standard seed)
def process_qa_data(qa_data):
all_queries, all_positives, all_negatives = [], [], []
# Questions and correct answers
for doc in qa_data.get('data', []):
for paragraph in doc.get('paragraphs', []):
for qa_pair in paragraph.get('qas', []):
all_queries.append(qa_pair['question'])
all_positives.append(qa_pair['answers'][0]['text'])
# Random negative for each positive
num_pairs = len(all_positives)
for i in range(num_pairs):
idx = i
while idx == i: # Don't pick the same answer
idx = random.choice(range(num_pairs))
all_negatives.append(all_positives[idx])
return all_queries, all_positives, all_negativesWhy this method?
- In the real world, correct answers get lost among wrong ones
- Tests the model's discrimination ability
- Classic benchmark method for retrieval systems
for model_name, model in models_to_test.items():
start_time = time.time()
# Encode
query_vectors = model.encode(queries, convert_to_numpy=True, show_progress_bar=True)
doc_vectors = model.encode(documents, convert_to_numpy=True, show_progress_bar=True)
duration = time.time() - start_time
print(f" {model_name}: {duration:.2f} seconds")Output:
Multi-MiniLM-L12-v2: 15.81 seconds
BGE-M3: 50.59 seconds
all-mpnet-base-v2: 25.00 seconds
Critical Detail: L2 Normalization
dim = query_vectors.shape[1]
index = faiss.IndexFlatIP(dim) # Inner Product Index
# Normalization = Cosine Similarity
faiss.normalize_L2(doc_vectors)
faiss.normalize_L2(query_vectors)
index.add(doc_vectors)
D, I = index.search(query_vectors, k=len(documents))Why normalize?
| Case | Formula | What it measures? |
|---|---|---|
| No normalization | IP(A,B) = |A| ร |B| ร cos(ฮธ) |
Magnitude + Angle |
| With normalization | IP(A,B) = cos(ฮธ) |
Only Angle (semantic) |
What does it measure? On average, what rank is the correct answer?
def compute_mrr(search_results, true_indices):
rr_sum = 0
for i in range(len(true_indices)):
ranks = np.where(search_results[i] == true_indices[i])[0]
if len(ranks) > 0:
rr_sum += 1 / (ranks[0] + 1)
return rr_sum / len(true_indices)Interpretation:
- MRR = 1.0 โ Correct answer at rank 1 for every question (perfect!)
- MRR = 0.5 โ On average at rank 2
- MRR = 0.033 โ On average at ~rank 30 (low)
What does it measure? Is the correct answer in the top K results?
| Metric | Description |
|---|---|
| Recall@1 | Is the first result correct? (strictest test) |
| Recall@3 | Is it in the top 3? |
| Recall@10 | Is it in the top 10? |
Why important?
- Recall@1 โ If you're showing only one result to the user
- Recall@10 โ If you're showing a list
What does it measure? Sensitivity to Turkish suffixes
Test pairs:
morph_pairs = [
("geliyorum", "gelmekteyim"), # I'm coming (different forms)
("gidecek", "gider"), # Will go / goes
("yaptฤฑm", "yapฤฑyorum"), # I did / I'm doing
("okuyor", "okumakta"), # Reading (different forms)
("koลacaฤฤฑm", "koลarฤฑm"), # I will run / I run
("araba", "arabalar"), # Car / cars
("evdeyim", "evde olmak") # I'm at home (different forms)
]Calculation:
# Calculate cosine similarity for each pair
similarities = []
for pair in morph_pairs:
vec1 = model.encode(pair[0])
vec2 = model.encode(pair[1])
sim = cosine_similarity([vec1], [vec2])[0][0]
similarities.append(sim)
morph_score = np.mean(similarities)Interpretation:
- Score > 0.9 โ Excellent Turkish understanding
- Score 0.7-0.9 โ Good
- Score < 0.7 โ Weak (treats each suffix as different word)
What does it measure? How organized is the embedding space?
kmeans = KMeans(n_clusters=2, random_state=42, n_init='auto')
labels = kmeans.fit_predict(doc_vectors)
sil_score = silhouette_score(doc_vectors, labels)Interpretation:
- Close to +1 โ Clusters are well separated
- Close to 0 โ Clusters overlap
- Close to -1 โ Incorrectly clustered
โโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโฌโโโโโโโโโโโฌโโโโโโโโโโโโโฌโโโโโโโโโโโโโโฌโโโโโโโโโฌโโโโโโโโโโโฌโโโโโโโโโโโฌโโโโโโโโโโโฌโโโโโโโโโโโโ
โ Model โ Dim โ Time (s) โ Silhouette โ Morph Score โ MRR โ Recall@1 โ Recall@3 โ Recall@5 โ Recall@10 โ
โโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโโผโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโโค
โ BGE-M3 โ 1024 โ 50.59 โ 0.0366 โ 0.8113 โ 0.0338 โ 1.12% โ 3.24% โ 4.91% โ 7.66% โ
โ Multi-MiniLM-L12-v2 โ 384 โ 15.81 โ 0.0758 โ 0.9284 โ 0.0200 โ 0.70% โ 1.93% โ 2.72% โ 4.34% โ
โ all-mpnet-base-v2 โ 768 โ 25.00 โ 0.1185 โ 0.7460 โ 0.0084 โ 0.30% โ 0.78% โ 1.29% โ 1.85% โ
โโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโดโโโโโโโโโโโดโโโโโโโโโโโโโดโโโโโโโโโโโโโโดโโโโโโโโโดโโโโโโโโโโโดโโโโโโโโโโโดโโโโโโโโโโโดโโโโโโโโโโโโ
What we see:
- MRR chart: All bars are short (low values) โ Domain is very challenging
- Recall@1 chart: BGE-M3 clearly ahead but still low
- Morph Score chart: Multi-MiniLM champion ๐
- Silhouette chart: all-mpnet first but this is misleading
Analysis:
- Top left = Ideal zone (fast + quality)
- BGE-M3: Top right (slow but quality)
- Multi-MiniLM: Bottom left (fast but medium MRR)
- all-mpnet: Lost in the middle (neither fast nor quality)
Decision guide:
- Real-time system โ Multi-MiniLM
- Offline batch โ BGE-M3
Character analysis:
BGE-M3: "Slow but Effective"
- High MRR, low speed
- Ideal for batch processing in large projects
Multi-MiniLM: "Fast and Turkish-Specialized"
- High speed and morph score
- Perfect for real-time applications
all-mpnet: "Organized but Wrong"
- Only good silhouette
- Don't use for Turkish
Expectation: MRR > 0.5 (correct answer in top 2)
Reality: MRR = 0.008-0.033 (correct answer at rank 30-120)
3 Reasons:
-
Domain Gap
- Models trained on Wikipedia, books, news
- Medical terminology is less than 1% of training data
- Terms like "pyogenic bacteria" rarely seen
-
Negative Sampling Difficulty
- Randomly selected "wrong" answers are actually related
- Both contain medical terms โ Model confuses them
- Very similar to real-world scenario (good test!)
-
Lack of Fine-tuning
- General-purpose models weak in specific domains
- 5-10x improvement expected with fine-tuning
** Practical lesson:** Don't panic if you see MRR < 0.1. Normal for domain-specific datasets. Fine-tuning is essential!
| Model | Morph Score | MRR | Relationship |
|---|---|---|---|
| Multi-MiniLM | ๐ฅ 0.9284 | ๐ฅ 0.0200 | Inverse correlation! |
| BGE-M3 | ๐ฅ 0.8113 | ๐ฅ 0.0338 |
Why?
Required for morphology:
- Surface-level similarity ("geliyorum" โ "gelmekteyim")
- Grammar rules
- Syntax patterns
Required for retrieval:
- Deep semantic understanding
- Context awareness
- Domain knowledge
Analogy:
Morphology = Recognizing word forms
Retrieval = Understanding word meanings
all-mpnet-base-v2 report card:
- MRR: 0.0084 (last place)
- Morph: 0.7460 (last place)
- Recall@1: 0.30% (last place)
- Silhouette: 0.1185 (1st place) ๐ค
Why high silhouette but low others?
Silhouette measures "organization", not "correctness". The model organized vectors nicely but organized them wrongly.
Analogy:
You organized books by color (well organized)
But people searching by topic can't find them (wrongly organized)
Lesson: Don't trust a single metric!
| Model | Time | vs Multi-MiniLM |
|---|---|---|
| Multi-MiniLM | 15.81s | 1.0x (baseline) |
| all-mpnet | 25.00s | 1.6x slower |
| BGE-M3 | 50.59s | 3.2x slower |
Real-world impact:
Processing 1000 queries:
- Multi-MiniLM: ~4.4 hours
- all-mpnet: ~7 hours
- BGE-M3: ~14 hours
In real-time systems:
- 50ms vs 160ms per user makes a difference
- 100 concurrent users = server struggles
Requirements:
- Speed critical (users won't wait)
- Turkish morphology important (users write differently)
- Sufficient accuracy (doesn't need to be perfect)
Choice: Multi-MiniLM-L12-v2
Why:
- 3.2x faster (vs BGE-M3)
- Morphology champion (0.9284)
- Sufficient MRR (0.0200)
- Small vectors = low RAM
Example implementation:
from sentence_transformers import SentenceTransformer
import faiss
model = SentenceTransformer('sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2')
# Encode all KB answers (offline)
kb_answers = ["answer1", "answer2", ...]
answer_vectors = model.encode(kb_answers)
# Create FAISS index
index = faiss.IndexFlatIP(384)
faiss.normalize_L2(answer_vectors)
index.add(answer_vectors)
# When user question arrives (online)
def get_answer(user_question):
q_vec = model.encode([user_question])
faiss.normalize_L2(q_vec)
D, I = index.search(q_vec, k=3)
return [kb_answers[i] for i in I[0]]Requirements:
- Quality critical (wrong result = critical error)
- Speed secondary (batch processing)
- Very specific domain
Choice: BGE-M3 + Fine-tuning
Why:
- Best MRR (0.0338)
- Large model = more capacity
- Speed irrelevant in batch processing
Fine-tuning example:
from sentence_transformers import SentenceTransformer, InputExample, losses
from torch.utils.data import DataLoader
# Load model
model = SentenceTransformer('BAAI/bge-m3')
# Prepare medical Q&A pairs
train_examples = [
InputExample(texts=['What is an abscess?', 'inflammation caused by pyogenic bacteria']),
InputExample(texts=['High blood pressure...', 'hypertension...']),
# ... at least 1000 examples
]
# Create DataLoader
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)
# Train with contrastive loss
train_loss = losses.MultipleNegativesRankingLoss(model)
# Fine-tune
model.fit(
train_objectives=[(train_dataloader, train_loss)],
epochs=5,
warmup_steps=100
)
# Save
model.save('bge-m3-medical-turkish')Requirements:
- ๐น๐ท Turkish variations (tiลรถrt/tshirt, รงorap/sock)
- Medium speed
- Lots of products
Choice: Multi-MiniLM-L12-v2
Why:
- Morphology champion (users write differently)
- Fast
- Small vectors = millions of products can be indexed
Requirements:
- Cross-lingual search
- Single model for multiple languages
Choice: BGE-M3
Why:
- 100+ language support
- Good cross-lingual alignment
- Single embedding space
# Create virtual environment
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
# Install packages
pip install sentence-transformers faiss-cpu scikit-learn pandas torch matplotlib seaborn
# If you have GPU
pip install faiss-gpu # instead of faiss-cpu# 1. Clone the code
git clone [repository-url]
cd embedding-comparison
# 2. Prepare data.json (or use fallback sample data)
# 3. Run
python compare_embedding_v2.py
# 4. Results
# Table in terminal
# 3 visualizations saved as PNGMedTurkQuaD JSON structure:
{
"data": [
{
"title": "Medical Topic",
"paragraphs": [
{
"context": "Medical text context...",
"qas": [
{
"question": "What causes abscess?",
"answers": [
{
"text": "pyogenic bacteria",
"answer_start": 42
}
]
}
]
}
]
}
]
}models_to_test = {
'Multi-MiniLM-L12-v2': SentenceTransformer('sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2'),
'BGE-M3': SentenceTransformer('BAAI/bge-m3'),
'all-mpnet-base-v2': SentenceTransformer('sentence-transformers/all-mpnet-base-v2'),
'YOUR-MODEL': SentenceTransformer('your-model-name') # Add here
}# Add more challenging pairs
morph_pairs = [
("geliyorum", "gelmekteyim"),
("custom_word1", "custom_word2"), # Add your own
]# Adjust recall@k values
recall_at = [1, 3, 5, 10, 20] # Add @20 if neededPurpose: Compare all models across 4 key metrics
How to read:
- Taller bars = better (except Silhouette, see below)
- Look for consistent patterns across metrics
- Single high bar doesn't mean best overall
Purpose: Trade-off analysis
Quadrants:
- Top-left: Fast and accurate (ideal but rare)
- Top-right: Slow but accurate (batch processing)
- Bottom-left: Fast but less accurate (real-time with compromise)
- Bottom-right: Slow and inaccurate (avoid!)
Purpose: Holistic view of strengths/weaknesses
Reading tips:
- Larger area = better overall (but check which dimensions!)
- Look for spikes = strong specialization
- Balanced polygon = well-rounded model
When to fine-tune:
- MRR < 0.1 on your data
- Your domain very different from general text
- You have 1000+ labeled examples
Simple fine-tuning recipe:
from sentence_transformers import SentenceTransformer, InputExample, losses
from torch.utils.data import DataLoader
# 1. Prepare training data
train_examples = []
for query, positive, negative in your_data:
train_examples.append(InputExample(texts=[query, positive, negative]))
# 2. Create DataLoader
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)
# 3. Define loss
train_loss = losses.TripletLoss(model)
# 4. Train
model.fit(
train_objectives=[(train_dataloader, train_loss)],
epochs=3,
warmup_steps=100,
output_path='fine-tuned-model'
)def hybrid_search(query, alpha=0.7):
# Fast model for initial filtering
fast_results = multi_minilm.search(query, k=100)
# Slow model for re-ranking top results
reranked = bge_m3.rerank(query, fast_results)
return reranked[:10]import mlflow
# Log metrics during evaluation
mlflow.log_metric("mrr", mrr_score)
mlflow.log_metric("recall_at_1", recall_1)
mlflow.log_artifact("performance_plot.png")Symptoms:
RuntimeError: CUDA out of memory
Solutions:
# Reduce batch size in encoding
model.encode(texts, batch_size=8) # default is 32
# Or use CPU
model = SentenceTransformer('model-name', device='cpu')Windows:
# Use conda instead of pip
conda install -c conda-forge faiss-cpumacOS (M1/M2):
conda install -c conda-forge faiss-cpuCheck GPU usage:
import torch
print(torch.cuda.is_available()) # Should be True
print(model.device) # Should be 'cuda'Force GPU:
model = SentenceTransformer('model-name', device='cuda')Ensure reproducibility:
import random, numpy as np, torch
random.seed(42)
np.random.seed(42)
torch.manual_seed(42)
if torch.cuda.is_available():
torch.cuda.manual_seed_all(42)
torch.backends.cudnn.deterministic = True- MTEB: Massive Text Embedding Benchmark
- BGE-M3: BGE M3-Embedding
- Sentence-BERT: Sentence Embeddings using Siamese Networks
- MTEB Leaderboard: https://huggingface.co/spaces/mteb/leaderboard
- Sentence Transformers: https://github.com/UKPLab/sentence-transformers
Contributions are welcome! Areas for improvement:
- Add more Turkish embedding models
- Test on other Turkish domains (legal, finance)
- Implement cross-lingual evaluation
- Add interactive dashboard
- Benchmark on GPU vs CPU
How to contribute:
- Fork the repository
- Create feature branch (
git checkout -b feature/NewModel) - Commit changes (
git commit -m 'Add new model') - Push to branch (
git push origin feature/NewModel) - Open Pull Request
๐ฌ Have questions? Start a discussion!
๐ Found a bug? Open an issue!


