A reproducible benchmark evaluating translation quality across 20 languages using the FLORES dataset and modern metrics, powered by the TranslatePlus API.
👉 Includes real-world evaluation of APIs like DeepL, Google Translate, and Azure.
git clone https://github.com/translateplus/translate-api-benchmark.git
cd translate-api-benchmark
pip install -r requirements.txt
python benchmark.pyimport requests
url = "https://api.translateplus.io/v2/translate"
headers = {
"X-API-KEY": "your_api_key",
"Content-Type": "application/json"
}
payload = {
"text": "Hello world",
"source": "en",
"target": "fr"
}
response = requests.post(url, json=payload, headers=headers)
print(response.json())👉 Response:
{
"translations": {
"translation": "Bonjour le monde",
"source": "en",
"target": "fr"
}
}- COMET scores up to 0.92 (near human-level)
- Strong performance across European, Asian, and global languages
- Stable latency: ~0.4–0.48s
👉 Full dataset: https://huggingface.co/datasets/meetsohail/translateplus-flores-benchmark
- BLEU measures word overlap
- COMET measures meaning
👉 BLEU fails for:
- Japanese
- Chinese
- Korean
👉 COMET provides more realistic evaluation
- FLORES (Meta AI)
- ~500–997 samples per language
- 20 languages (English → target)
Structure:
data/results_eng_Latn_fra_Latn.csv
data/results_eng_Latn_deu_Latn.csv
...Each file:
source, reference, hypothesis, latencyfrom datasets import load_dataset
dataset = load_dataset("facebook/flores", "eng_Latn")def translate(text, target):
# plug in any API (DeepL, Google, etc.)
return translated_textfrom comet import download_model, load_from_checkpoint
model_path = download_model("Unbabel/wmt22-comet-da")
model = load_from_checkpoint(model_path)
model.predict(data)| Language | BLEU | COMET |
|---|---|---|
| French | 50.0 | 0.89 |
| German | 40.4 | 0.89 |
| Portuguese | 48.3 | 0.90 |
| Japanese | 1.8 | 0.92 |
👉 BLEU is unreliable for some languages
datasets
sacrebleu
unbabel-comet
pandas
requests- Compare translation APIs
- Evaluate multilingual systems
- Build translation pipelines
- Research in machine translation
- 📊 Dataset: https://huggingface.co/datasets/meetsohail/translateplus-flores-benchmark
- 📝 Blog: https://translateplus.io/blog/translation-api-benchmark
- 🌐 API: https://translateplus.io
PRs welcome!
Ideas:
- add more languages
- add new APIs
- improve evaluation
If this helped you:
👉 Star the repo 👉 Share with others 👉 Contribute improvements
MIT License


