Skip to content

translateplus/translate-api-benchmark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Translation API Benchmark (FLORES + COMET)

GitHub stars License Dataset

A reproducible benchmark evaluating translation quality across 20 languages using the FLORES dataset and modern metrics, powered by the TranslatePlus API.

👉 Includes real-world evaluation of APIs like DeepL, Google Translate, and Azure.


Quick Start (Run in 30 seconds)

git clone https://github.com/translateplus/translate-api-benchmark.git
cd translate-api-benchmark

pip install -r requirements.txt
python benchmark.py

⚡ Try the API (copy-paste)

import requests

url = "https://api.translateplus.io/v2/translate"

headers = {
    "X-API-KEY": "your_api_key",
    "Content-Type": "application/json"
}

payload = {
    "text": "Hello world",
    "source": "en",
    "target": "fr"
}

response = requests.post(url, json=payload, headers=headers)
print(response.json())

👉 Response:

{
  "translations": {
    "translation": "Bonjour le monde",
    "source": "en",
    "target": "fr"
  }
}

📊 Key Results

  • COMET scores up to 0.92 (near human-level)
  • Strong performance across European, Asian, and global languages
  • Stable latency: ~0.4–0.48s

👉 Full dataset: https://huggingface.co/datasets/meetsohail/translateplus-flores-benchmark


Benchmark Visualizations

BLEU Scores

BLEU Scores

COMET Scores

COMET Scores

Latency

Latency


Why COMET > BLEU

  • BLEU measures word overlap
  • COMET measures meaning

👉 BLEU fails for:

  • Japanese
  • Chinese
  • Korean

👉 COMET provides more realistic evaluation


📁 Dataset

  • FLORES (Meta AI)
  • ~500–997 samples per language
  • 20 languages (English → target)

Structure:

data/results_eng_Latn_fra_Latn.csv
data/results_eng_Latn_deu_Latn.csv
...

Each file:

source, reference, hypothesis, latency

⚙️ Benchmark Pipeline

1. Load dataset

from datasets import load_dataset
dataset = load_dataset("facebook/flores", "eng_Latn")

2. Translate

def translate(text, target):
    # plug in any API (DeepL, Google, etc.)
    return translated_text

3. Evaluate (COMET)

from comet import download_model, load_from_checkpoint

model_path = download_model("Unbabel/wmt22-comet-da")
model = load_from_checkpoint(model_path)

model.predict(data)

📈 Example Results

Language BLEU COMET
French 50.0 0.89
German 40.4 0.89
Portuguese 48.3 0.90
Japanese 1.8 0.92

👉 BLEU is unreliable for some languages


🔧 Requirements

datasets
sacrebleu
unbabel-comet
pandas
requests

💡 Use Cases

  • Compare translation APIs
  • Evaluate multilingual systems
  • Build translation pipelines
  • Research in machine translation

🔗 Resources


🤝 Contributing

PRs welcome!

Ideas:

  • add more languages
  • add new APIs
  • improve evaluation

⭐ Support

If this helped you:

👉 Star the repo 👉 Share with others 👉 Contribute improvements


📜 License

MIT License

Releases

No releases published

Packages

 
 
 

Contributors

Languages