add: Evaluating Large Language Models with lm-evaluation-harness #113

davidwtf · 2026-01-19T10:10:02Z

Summary by CodeRabbit

Documentation
- Added a comprehensive guide for evaluating Large Language Models, covering overview, installation, usage (basic to advanced), task types, datasets, configuration, examples, outputs, API considerations, and best practices.
- Added an interactive quick-start notebook demonstrating CLI-driven workflows, task discovery, single- and multi-task evaluations, config-file usage, caching/resume, result inspection, and practical examples.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2026-01-19T10:10:15Z

Walkthrough

Adds two new documentation artifacts: a comprehensive Markdown guide on evaluating LLMs with lm-evaluation-harness and a Jupyter quick-start notebook demonstrating installation, task listing, evaluation workflows (basic to advanced), and examples.

Changes

Cohort / File(s)	Summary
Documentation (Guide) `docs/en/solutions/How_to_Evaluate_LLM.md`	New comprehensive Markdown guide covering overview, installation, supported backends and tokenization modes, task types and requirements, common workflows (single/multi-task, config files), usage examples, output formats, API considerations, and resources.
Documentation (Notebook) `docs/public/lm-eval/lm-eval_quick_star.ipynb`	New Jupyter notebook quick-start demonstrating prerequisites, installation, listing/searching tasks, local and API evaluation examples, multi-task and config-driven workflows, caching/resume, advanced task examples (GSM8K, MMLU, multilingual), result inspection, and best practices.

Sequence Diagram(s)

(omitted — changes are documentation-only and do not introduce new multi-component control flow)

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

Update README.md #20 — Related changes adding/updating documentation under docs/en/solutions and docs/public, aligning doc placement and formatting.

Suggested reviewers

typhoonzero
sinbadonline
zhaomingkun1030

Poem

🐇 I hopped through pages, tidy and bright,
A harness and notebook to guide the night.
Tasks and prompts in organized rows,
Testing seeds where model wisdom grows. 🌱

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title directly describes the main addition: comprehensive documentation and quick-start guide for evaluating Large Language Models using lm-evaluation-harness.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🤖 Fix all issues with AI agents

In `@docs/en/solutions/How_to_Evaluate_LLM.md`:
- Around line 531-532: The Markdown link in How_to_Evaluate_LLM.md points to a
dead target "../public/lm-eval/lm-eval_quick_star.ipynb" (typo/missing 't');
update that link target to the correct notebook location under docs/public
(e.g., replace "../public/lm-eval/lm-eval_quick_star.ipynb" with the correct
relative path to docs/public/lm-eval/lm-eval_quick_start.ipynb) so the "lm-eval
Quick Start Notebook" link resolves and the build succeeds.
- Around line 401-407: The Math & Reasoning Suite example uses the wrong
lm-evaluation-harness task name—replace the task token "math" in the lm-eval
command's --tasks argument with the official name "minerva_math" so the command
becomes: update the --tasks list (gsm8k,math,arc_challenge) to
(gsm8k,minerva_math,arc_challenge) in the example invocation for lm-eval (the
line with --tasks).

🧹 Nitpick comments (1)

docs/public/lm-eval/lm-eval_quick_star.ipynb (1)
178-182: Avoid hardcoding API keys in the notebook.

Direct assignment encourages storing secrets in notebooks. Prefer a prompt or external environment variable to reduce leak risk.
🔐 Proposed safer pattern
-        "import os\n",
-        "\n",
-        "# Set your OpenAI API key\n",
-        "os.environ['OPENAI_API_KEY'] = 'your-api-key-here'\n",
+        "import os\n",
+        "import getpass\n",
+        "\n",
+        "# Set your OpenAI API key (avoid committing real keys)\n",
+        "if not os.getenv('OPENAI_API_KEY'):\n",
+        "    os.environ['OPENAI_API_KEY'] = getpass.getpass('OpenAI API key: ')\n",
Please verify this still aligns with the recommended authentication flow in the OpenAI API docs.

coderabbitai · 2026-01-19T10:19:08Z

docs/en/solutions/How_to_Evaluate_LLM.md

+**Math & Reasoning Suite**:
+
+```bash
+lm-eval --model local-chat-completions \
+    --model_args model=MODEL_NAME,base_url=http://localhost:8000/v1 \
+    --tasks gsm8k,math,arc_challenge \
+    --output_path ./results


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# First, let's check if the file exists and read the context around lines 401-407 fd "How_to_Evaluate_LLM.md" --type f

Repository: alauda/knowledge

Length of output: 100

🏁 Script executed:

# Read the file around lines 401-407 and context to see the table cat -n docs/en/solutions/How_to_Evaluate_LLM.md | sed -n '350,450p'

Repository: alauda/knowledge

Length of output: 3131

🏁 Script executed:

# Search for the table mentioned in the review that lists "minerva_math" rg "minerva_math" docs/en/solutions/How_to_Evaluate_LLM.md -B 5 -A 5

Repository: alauda/knowledge

Length of output: 1715

🌐 Web query:

lm-eval minerva_math vs math task name documentation

💡 Result:

Short answer: "minerva_math" is the lm-evaluation-harness task group implementing the Minerva / Minerva‑Math dataset (with several subgroups like minerva_math_algebra, etc.); "math" is a more generic/math-category task name used by some wrappers/distributions (or by platform catalogs) that can refer to one or more math tasks in the harness. To see exactly which dataset a given name runs, list tasks from your installed harness (python -m lm_eval --tasks list or lm-eval --tasks list). Sources: EleutherAI lm-evaluation-harness repo/docs and issue referencing minerva_math, and a packaging/registry listing showing a generic "math" task name. [1][2][3]

Sources

EleutherAI / lm-evaluation-harness (repo/documentation). [1]

Issue demonstrating use of the minerva_math task and its subgroups. [2]

Example packaging / product docs showing a generic "math" task label in tool catalogs (lists "math" as available). [3]

🌐 Web query:

lm-evaluation-harness official task names list math minerva_math

💡 Result:

Official lm-evaluation-harness task names for Minerva's MATH group:

minerva_math

minerva_math_algebra

minerva_math_counting_and_prob

minerva_math_geometry

minerva_math_intermediate_algebra

minerva_math_num_theory

minerva_math_prealgebra

minerva_math_precalc

Sources: GitHub issue showing these subtask names and example outputs [1]; lm-evaluation-harness repository/task listing (repo/releases/docs) [2].

🏁 Script executed:

# Let me check if there are any other references to "math" task (without minerva_ prefix) in the documentation rg "tasks.*math" docs/en/solutions/How_to_Evaluate_LLM.md -i

Repository: alauda/knowledge

Length of output: 265

Update the Math & Reasoning Suite example to use the correct task name.

The table lists the MATH dataset as minerva_math, but the suite example at line 406 uses math. The correct official lm-evaluation-harness task name is minerva_math. This inconsistency will cause the example to fail when executed.

Fix

- --tasks gsm8k,math,arc_challenge \ + --tasks gsm8k,minerva_math,arc_challenge \

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

**Math & Reasoning Suite**:

```bash

lm-eval --model local-chat-completions \

--model_args model=MODEL_NAME,base_url=http://localhost:8000/v1 \

--tasks gsm8k,math,arc_challenge \

--output_path ./results

lm-eval --model local-chat-completions \

--model_args model=MODEL_NAME,base_url=http://localhost:8000/v1 \

--tasks gsm8k,minerva_math,arc_challenge \

--output_path ./results

🤖 Prompt for AI Agents

In `@docs/en/solutions/How_to_Evaluate_LLM.md` around lines 401 - 407, The Math & Reasoning Suite example uses the wrong lm-evaluation-harness task name—replace the task token "math" in the lm-eval command's --tasks argument with the official name "minerva_math" so the command becomes: update the --tasks list (gsm8k,math,arc_challenge) to (gsm8k,minerva_math,arc_challenge) in the example invocation for lm-eval (the line with --tasks).

coderabbitai · 2026-01-19T10:19:08Z

docs/en/solutions/How_to_Evaluate_LLM.md

+Download the Jupyter notebook example: [lm-eval Quick Start Notebook](../public/lm-eval/lm-eval_quick_star.ipynb)
+


⚠️ Potential issue | 🟠 Major

Fix dead notebook link to restore build.

Build failed due to a dead link pointing to ../public/lm-eval/lm-eval_quick_star.ipynb. Update the relative path to the actual location under docs/public.

🧩 Proposed fix

-Download the Jupyter notebook example: [lm-eval Quick Start Notebook](../public/lm-eval/lm-eval_quick_star.ipynb) +Download the Jupyter notebook example: [lm-eval Quick Start Notebook](../../public/lm-eval/lm-eval_quick_star.ipynb)

🤖 Prompt for AI Agents

In `@docs/en/solutions/How_to_Evaluate_LLM.md` around lines 531 - 532, The Markdown link in How_to_Evaluate_LLM.md points to a dead target "../public/lm-eval/lm-eval_quick_star.ipynb" (typo/missing 't'); update that link target to the correct notebook location under docs/public (e.g., replace "../public/lm-eval/lm-eval_quick_star.ipynb" with the correct relative path to docs/public/lm-eval/lm-eval_quick_start.ipynb) so the "lm-eval Quick Start Notebook" link resolves and the build succeeds.

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@docs/en/solutions/How_to_Evaluate_LLM.md`:
- Line 311: The Markdown table row containing `truthfulqa_gen` /
`generate_until` with metric names `bleu_max`, `rouge1_max`, `rougeL_max`,
`bleurt_max` also references metric suffixes `_acc` and `_diff` which are being
rendered as emphasis; update that table cell by wrapping the `_acc` and `_diff`
tokens in backticks (e.g., `_acc` -> `_acc`) so they render as code literals and
avoid unintended Markdown emphasis—edit the table row that lists
`truthfulqa_gen`, `generate_until`, and the metric names to apply the backticks
consistently.

♻️ Duplicate comments (2)

docs/en/solutions/How_to_Evaluate_LLM.md (2)
401-407: Use the official minerva_math task name in the suite.

The “Math & Reasoning Suite” uses math, but the dataset table lists minerva_math. This mismatch will fail when users run the command.
💡 Proposed fix
-    --tasks gsm8k,math,arc_challenge \
+    --tasks gsm8k,minerva_math,arc_challenge \
531-531: Fix the quick-start notebook link target (typo/404 risk).

The link uses lm-eval_quick_star.ipynb which appears to be a typo; verify the actual filename and update to prevent a broken link.
💡 Proposed fix (if filename is quick_start)
-Download the Jupyter notebook example: [lm-eval Quick Start Notebook](/lm-eval/lm-eval_quick_star.ipynb)
+Download the Jupyter notebook example: [lm-eval Quick Start Notebook](/lm-eval/lm-eval_quick_start.ipynb)

coderabbitai · 2026-01-19T10:43:03Z

docs/en/solutions/How_to_Evaluate_LLM.md

+| | StoryCloze | `storycloze_2016` | `multiple_choice` | `acc` | Chat / Completion | Server-side | Story completion task |
+| **Truthfulness & Safety** | TruthfulQA | `truthfulqa_mc1` | `multiple_choice` | `acc` | Chat / Completion | Server-side | Single-correct answer truthfulness |
+| | | `truthfulqa_mc2` | `multiple_choice` | `acc` | Chat / Completion | Server-side | Multiple-correct answer truthfulness |
+| | | `truthfulqa_gen` | `generate_until` | `bleu_max`, `rouge1_max`, `rougeL_max`, `bleurt_max` | Chat / Completion | Server-side | Generative truthfulness evaluation (also outputs _acc and _diff variants) |


⚠️ Potential issue | 🟡 Minor

Avoid unintended emphasis from _acc / _diff tokens.

Markdown treats underscores as emphasis; wrap these metric suffixes in backticks to prevent formatting artifacts.

✍️ Proposed fix

-- | | `truthfulqa_gen` | `generate_until` | `bleu_max`, `rouge1_max`, `rougeL_max`, `bleurt_max` | Chat / Completion | Server-side | Generative truthfulness evaluation (also outputs _acc and _diff variants) | +- | | `truthfulqa_gen` | `generate_until` | `bleu_max`, `rouge1_max`, `rougeL_max`, `bleurt_max` | Chat / Completion | Server-side | Generative truthfulness evaluation (also outputs `*_acc` and `*_diff` variants) |

🧰 Tools

🪛 markdownlint-cli2 (0.18.1)

311-311: Spaces inside emphasis markers

(MD037, no-space-in-emphasis)

🤖 Prompt for AI Agents

In `@docs/en/solutions/How_to_Evaluate_LLM.md` at line 311, The Markdown table row containing `truthfulqa_gen` / `generate_until` with metric names `bleu_max`, `rouge1_max`, `rougeL_max`, `bleurt_max` also references metric suffixes `_acc` and `_diff` which are being rendered as emphasis; update that table cell by wrapping the `_acc` and `_diff` tokens in backticks (e.g., `_acc` -> `_acc`) so they render as code literals and avoid unintended Markdown emphasis—edit the table row that lists `truthfulqa_gen`, `generate_until`, and the metric names to apply the backticks consistently.

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@docs/public/lm-eval/lm-eval_quick_star.ipynb`:
- Around line 1-25: Rename the notebook file from lm-eval_quick_star.ipynb to
lm-eval_quick_start.ipynb to fix the typo; update any references to the notebook
name (e.g., links in docs, README, or navigation entries) so they point to
lm-eval_quick_start.ipynb and verify the notebook metadata/title if present
mentions "Quick Start" to keep naming consistent with the displayed heading.

🧹 Nitpick comments (3)

docs/public/lm-eval/lm-eval_quick_star.ipynb (3)
131-151: Consider removing leading newline in YAML config.

The triple-quoted string starts with a newline, creating an empty first line in the generated YAML file. While typically harmless, it's cleaner to use .strip() or start the content immediately.
💡 Suggested fix
 # Create a configuration file
-config = """
-model: local-chat-completions
+config = """\
+model: local-chat-completions
 model_args:
   model: Qwen/Qwen2.5-7B-Instruct
   base_url: http://localhost:8000/v1
Or alternatively use config.strip() when writing:
with open('eval_config.yaml', 'w') as f:
    f.write(config.strip())
163-174: Remove empty placeholder cells.

These empty markdown and code cells appear to be leftover placeholders and should be removed for a cleaner notebook.

213-232: Add error handling for results file loading.

The hardcoded path ./results/results.json assumes a specific output structure. lm-eval may create timestamped subdirectories or differently named files. Consider adding error handling and guidance for locating actual output files.
💡 Suggested improvement
 import json
+import os
+import glob
 
 # Load and display results
-with open('./results/results.json', 'r') as f:
-    results = json.load(f)
+# Find the most recent results file
+result_files = glob.glob('./results/**/results.json', recursive=True)
+if not result_files:
+    print("No results.json found. Please run an evaluation first.")
+else:
+    latest_file = max(result_files, key=os.path.getmtime)
+    print(f"Loading results from: {latest_file}")
+    with open(latest_file, 'r') as f:
+        results = json.load(f)
+    print("=== Evaluation Results ===")
+    print(json.dumps(results, indent=2))

coderabbitai · 2026-01-19T10:50:23Z

docs/public/lm-eval/lm-eval_quick_star.ipynb

+{
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "# LM-Evaluation-Harness Quick Start Guide\n",
+        "\n",
+        "This notebook demonstrates how to use the lm-evaluation-harness (lm-eval) to evaluate language models using command-line interface.\n",
+        "\n",
+        "## Prerequisites\n",
+        "\n",
+        "Install lm-eval with the required backends:"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "# Install lm-eval with API support\n",
+        "!pip install \"lm_eval[api]\""
+      ]
+    },


⚠️ Potential issue | 🟡 Minor

Filename typo: quick_star should be quick_start.

The notebook filename lm-eval_quick_star.ipynb appears to have a typo. Consider renaming to lm-eval_quick_start.ipynb for clarity.

The prerequisites section and installation command look correct.

🤖 Prompt for AI Agents

In `@docs/public/lm-eval/lm-eval_quick_star.ipynb` around lines 1 - 25, Rename the notebook file from lm-eval_quick_star.ipynb to lm-eval_quick_start.ipynb to fix the typo; update any references to the notebook name (e.g., links in docs, README, or navigation entries) so they point to lm-eval_quick_start.ipynb and verify the notebook metadata/title if present mentions "Quick Start" to keep naming consistent with the displayed heading.

add: Evaluating Large Language Models with lm-evaluation-harness

5219479

davidwtf had a problem deploying to translate January 19, 2026 10:10 — with GitHub Actions Failure

coderabbitai bot reviewed Jan 19, 2026

View reviewed changes

update

5f78680

davidwtf temporarily deployed to translate January 19, 2026 10:39 — with GitHub Actions Inactive

coderabbitai bot reviewed Jan 19, 2026

View reviewed changes

update

f6d94be

davidwtf deployed to translate January 19, 2026 10:46 — with GitHub Actions Active

coderabbitai bot reviewed Jan 19, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

add: Evaluating Large Language Models with lm-evaluation-harness #113

add: Evaluating Large Language Models with lm-evaluation-harness #113

davidwtf commented Jan 19, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Jan 19, 2026 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Jan 19, 2026

Uh oh!

coderabbitai bot Jan 19, 2026

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Jan 19, 2026

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Jan 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		Download the Jupyter notebook example: [lm-eval Quick Start Notebook](../public/lm-eval/lm-eval_quick_star.ipynb)

add: Evaluating Large Language Models with lm-evaluation-harness #113

Are you sure you want to change the base?

add: Evaluating Large Language Models with lm-evaluation-harness #113

Conversation

davidwtf commented Jan 19, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Jan 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 19, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 19, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 19, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 19, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

davidwtf commented Jan 19, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 19, 2026 •

edited

Loading