Lightweight LLM evaluation tool for OpenAI-compatible endpoints (Chat Completions / Responses) with CLI and local Web UI. MMLU, GSM8K, HumanEval, TruthfulQA, HellaSwag — bring your own BASE_URL + MODEL_NAME.
python model-evaluation fastapi humaneval llm-evaluation openai-compatible mbpp llm-benchmark safety-benchmarks
-
Updated
Jun 6, 2026 - Python