feat: Support DSV4 example. by arekay-nv · Pull Request #304 · mlcommons/endpoints

arekay-nv · 2026-05-01T15:25:40Z

What does this PR do?

Type of change

Bug fix
New feature
Documentation update
Refactor/cleanup

Related issues

Testing

Tests added/updated
All tests pass locally
Manual testing completed

Checklist

Code follows project style
Pre-commit hooks pass
Documentation updated (if needed)

Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>

github-actions · 2026-05-01T15:25:49Z

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

gemini-code-assist

Code Review

This pull request introduces several improvements, including a fallback mechanism for tokenizer loading using PreTrainedTokenizerFast, a new LetterExtractor for MCQ datasets, and support for DeepSeek thinking mode via chat_template_kwargs. It also updates the benchmark execution to allow partial accuracy results when individual scorers fail. I have provided feedback regarding the refactoring of the scoring loop to reduce duplication and concerns about maintaining metric precision in TPOT calculations when handling reasoning chunks.

gemini-code-assist · 2026-05-01T15:28:05Z

    for eval_cfg in ctx.eval_configs:
-        scorer_instance = eval_cfg.scorer(
-            eval_cfg.dataset_name,
-            eval_cfg.dataset,
-            eval_cfg.report_dir,
-            extractor=eval_cfg.extractor,
-            ground_truth_column=eval_cfg.ground_truth_column,
+        try:
+            scorer_instance = eval_cfg.scorer(
+                eval_cfg.dataset_name,
+                eval_cfg.dataset,
+                eval_cfg.report_dir,
+                extractor=eval_cfg.extractor,
+                ground_truth_column=eval_cfg.ground_truth_column,
+            )
+            score, n_repeats = scorer_instance.score()
+            assert eval_cfg.dataset.data is not None
+            accuracy_scores[eval_cfg.dataset_name] = {
+                "dataset_name": eval_cfg.dataset_name,
+                "num_samples": len(eval_cfg.dataset.data),
+                "extractor": eval_cfg.extractor.__name__,
+                "ground_truth_column": eval_cfg.ground_truth_column,
+                "score": score,
+                "n_repeats": n_repeats,
+            }
+            logger.info(
+                f"Score for {eval_cfg.dataset_name}: {score} ({n_repeats} repeats)"
+            )
+        except Exception as e:
+            scoring_failed = True
+            logger.error(f"Scoring failed for {eval_cfg.dataset_name}: {e}")
+            assert eval_cfg.dataset.data is not None
+            accuracy_scores[eval_cfg.dataset_name] = {
+                "dataset_name": eval_cfg.dataset_name,
+                "num_samples": len(eval_cfg.dataset.data),
+                "extractor": eval_cfg.extractor.__name__,
+                "ground_truth_column": eval_cfg.ground_truth_column,
+                "score": None,
+                "error": str(e),
+            }


The accuracy scoring loop contains significant code duplication for the result dictionary and uses assert statements for data validation inside an exception handler. If dataset.data is missing, the assert will raise an AssertionError which, while caught by the except Exception, makes the logic brittle. Refactoring this to use a base dictionary and safer access improves maintainability and ensures the loop continues gracefully as intended.

for eval_cfg in ctx.eval_configs: # Prepare base results; ensure data exists to avoid TypeError in len() num_samples = len(eval_cfg.dataset.data) if eval_cfg.dataset.data is not None else 0 res_base = { "dataset_name": eval_cfg.dataset_name, "num_samples": num_samples, "extractor": eval_cfg.extractor.__name__, "ground_truth_column": eval_cfg.ground_truth_column, } try: scorer_instance = eval_cfg.scorer( eval_cfg.dataset_name, eval_cfg.dataset, eval_cfg.report_dir, extractor=eval_cfg.extractor, ground_truth_column=eval_cfg.ground_truth_column, ) score, n_repeats = scorer_instance.score() accuracy_scores[eval_cfg.dataset_name] = { **res_base, "score": score, "n_repeats": n_repeats, } logger.info( f"Score for {eval_cfg.dataset_name}: {score} ({n_repeats} repeats)" ) except Exception as e: scoring_failed = True logger.error(f"Scoring failed for {eval_cfg.dataset_name}: {e}") accuracy_scores[eval_cfg.dataset_name] = { **res_base, "score": None, "error": str(e), }

gemini-code-assist · 2026-05-01T15:28:05Z

+            if isinstance(self.reasoning, str):
+                # str reasoning is the fully joined streaming trace — include it
+                # in the TPOT denominator. Over-counts by one token (the first
+                # token is not excluded), but the error is negligible in practice.
+                parts.append(self.reasoning)
+            elif isinstance(self.reasoning, tuple) and len(self.reasoning) > 1:
                parts.extend(self.reasoning[1:])


This logic to include string reasoning in the TPOT denominator introduces a slight inaccuracy in metrics by failing to exclude the first token of the response. If the accumulator preserves reasoning as a tuple of chunks (as suggested in accumulator.py), this special case is unnecessary. Reverting to the previous logic ensures that TPOT is calculated only on tokens generated after the first chunk, maintaining metric precision.

Suggested change

if isinstance(self.reasoning, str):

# str reasoning is the fully joined streaming trace — include it

# in the TPOT denominator. Over-counts by one token (the first

# token is not excluded), but the error is negligible in practice.

parts.append(self.reasoning)

elif isinstance(self.reasoning, tuple) and len(self.reasoning) > 1:

parts.extend(self.reasoning[1:])

if isinstance(self.reasoning, tuple) and len(self.reasoning) > 1:

parts.extend(self.reasoning[1:])

gemini-code-assist · 2026-05-01T15:28:06Z

+            # All reasoning chunks are joined into a single string so the full
+            # thinking trace is captured as-is in events.jsonl. TPOT still uses
+            # text_after_first_chunk(), which includes string reasoning in the
+            # denominator (off by one token vs. the true "after first chunk"
+            # count, which is negligible).
            text_output = TextModelOutput(
                output="".join(self.output_chunks),
-                reasoning=resp_reasoning,
+                reasoning="".join(self.reasoning_chunks),
            )


Joining all reasoning chunks into a single string causes a loss of precision in TPOT (Time Per Output Token) calculations. The text_after_first_chunk() method in TextModelOutput relies on the reasoning being a tuple of chunks to correctly exclude the first chunk from the denominator. By joining them here, the first token is included in the TPOT calculation, leading to inaccurate metrics.

Since TextModelOutput already handles joining chunks in its __str__ method (used for logging and scoring), it is better to preserve the chunks as a tuple.

# Preserve reasoning chunks as a tuple to allow accurate TPOT # calculation (excluding the first chunk). TextModelOutput # handles joining for display/logging via __str__. text_output = TextModelOutput( output=tuple(self.output_chunks), reasoning=tuple(self.reasoning_chunks), )

Copilot

Pull request overview

This PR extends the evaluation and benchmarking pipeline to better support DeepSeek-V4-style workflows and MCQ datasets by adding a new letter-based extractor, passing chat template kwargs through the OpenAI msgspec adapter, and improving benchmark runtime robustness around tokenizer handling and scoring.

Changes:

Add LetterExtractor (A–J) and unit tests; update ABCDExtractor doc example.
Allow passing chat_template_kwargs through OpenAI msgspec request types/adapter and add AIME25 presets that enable “thinking” mode via static columns.
Add tokenizer override + tokenizer-loading fallback behavior; make accuracy scoring continue on per-scorer failures and adjust reasoning aggregation behavior.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
tests/unit/evaluation/test_extractor.py	Adds unit tests for the new `LetterExtractor` and applies unit markers/import updates.
src/inference_endpoint/evaluation/extractor.py	Introduces `LetterExtractor` and updates ABCDExtractor example text.
src/inference_endpoint/openai/types.py	Adds `chat_template_kwargs` to msgspec `ChatCompletionRequest`.
src/inference_endpoint/openai/openai_msgspec_adapter.py	Passes through `chat_template_kwargs` from dataset rows into endpoint requests.
src/inference_endpoint/dataset_manager/transforms.py	Adjusts `AddStaticColumns` to handle dict/list values without pandas index alignment issues.
src/inference_endpoint/dataset_manager/predefined/aime25/presets.py	Adds presets that set `chat_template_kwargs` for thinking/budget tokens.
src/inference_endpoint/openai/accumulator.py	Changes reasoning aggregation to a single joined string for event logging.
src/inference_endpoint/core/types.py	Updates TPOT denominator logic to account for string reasoning.
src/inference_endpoint/config/schema.py	Adds `ModelParams.tokenizer_name` for tokenizer override configuration.
src/inference_endpoint/commands/benchmark/execute.py	Uses tokenizer override, changes accuracy phase load_pattern behavior, and continues scoring on failures.
src/inference_endpoint/async_utils/services/metrics_aggregator/token_metrics.py	Adds fallback to `PreTrainedTokenizerFast` when `AutoTokenizer` fails.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-05-01T15:32:05Z

            min_sample_count=acc_ds.num_samples() * acc_ds.repeats,
            rng_sched=ctx.rt_settings.rng_sched,
            rng_sample_index=ctx.rt_settings.rng_sample_index,
-            load_pattern=LoadPattern(type=LoadPatternType.MAX_THROUGHPUT),
+            load_pattern=ctx.rt_settings.load_pattern,
        )


In accuracy phases, load_pattern=ctx.rt_settings.load_pattern means accuracy evaluation will be throttled by the main run’s load pattern (e.g., Poisson target QPS / concurrency), which can make scoring unnecessarily slow. Previously this forced MAX_THROUGHPUT; consider keeping Burst/MAX_THROUGHPUT for accuracy phases (or making it explicitly configurable) to avoid long eval runs.

Copilot · 2026-05-01T15:32:06Z

+        tokenizer_name = tokenizer_override
+        logger.info(
+            f"Tokenizer available for model: {model_name} (override: {tokenizer_override})"
+        )


The log line says “Tokenizer available for model … (override: …)” but when tokenizer_override is set you’re not actually checking that the override exists/is loadable. This message is misleading and can confuse debugging; consider logging that an override is being used (and optionally validating the override with _check_tokenizer_exists(tokenizer_override) before proceeding).

Suggested change

tokenizer_name = tokenizer_override

logger.info(

f"Tokenizer available for model: {model_name} (override: {tokenizer_override})"

)

if _check_tokenizer_exists(tokenizer_override):

tokenizer_name = tokenizer_override

logger.info(

f"Using tokenizer override for model: {model_name} ({tokenizer_override})"

)

else:

tokenizer_name = None

logger.warning(

f"Tokenizer override not available for model: {model_name} ({tokenizer_override})"

)

Copilot · 2026-05-01T15:32:06Z

+            )
+        except Exception as e:
+            scoring_failed = True
+            logger.error(f"Scoring failed for {eval_cfg.dataset_name}: {e}")


except Exception as e: logger.error(...) logs only str(e) and drops the traceback, which makes scorer failures hard to diagnose in CI/user runs. Consider including exc_info=True (and/or logging the exception type) so the report includes actionable stack traces while still continuing past failures.

Suggested change

logger.error(f"Scoring failed for {eval_cfg.dataset_name}: {e}")

logger.error(

"Scoring failed for %s: %s: %s",

eval_cfg.dataset_name,

type(e).__name__,

e,

exc_info=True,

)

Copilot · 2026-05-01T15:32:06Z

+            if isinstance(self.reasoning, str):
+                # str reasoning is the fully joined streaming trace — include it
+                # in the TPOT denominator. Over-counts by one token (the first
+                # token is not excluded), but the error is negligible in practice.
+                parts.append(self.reasoning)
+            elif isinstance(self.reasoning, tuple) and len(self.reasoning) > 1:


text_after_first_chunk() now treats reasoning as a joined str for streaming cases, but the method docstring above still states that for non-streaming (str fields) it returns an empty string. Please update the docstring (and ideally the TextModelOutput field docs) to reflect that str reasoning may also represent a streaming trace, otherwise future callers may make incorrect assumptions.

Copilot · 2026-05-01T15:32:07Z

+            try:
+                self._thread_local.tokenizer = AutoTokenizer.from_pretrained(
+                    self._tokenizer_name
+                )
+            except Exception:
+                # AutoTokenizer loads config.json to detect the model type; for
+                # models with unknown model_type (e.g. deepseek_v4 in older
+                # transformers) or missing rope config fields, this fails.
+                # Fall back to PreTrainedTokenizerFast which reads only
+                # tokenizer.json / tokenizer_config.json and skips model config.
+                self._thread_local.tokenizer = PreTrainedTokenizerFast.from_pretrained(
+                    self._tokenizer_name
+                )


Catching a broad Exception from AutoTokenizer.from_pretrained without logging can mask unrelated configuration issues (e.g., missing files, permission errors, offline mode) and make token-metric failures opaque. Consider catching a narrower set of expected failures (or at least logging the original exception at debug/warning level) before falling back to PreTrainedTokenizerFast.

Copilot · 2026-05-01T15:32:07Z

+            # Wrap dict/list values in a list so pandas doesn't try to align
+            # on index keys (e.g. {"thinking": True} would produce NaN otherwise).
+            if isinstance(value, (dict, list)):
+                df[key] = [value] * len(df)
+            else:
+                df[key] = value


AddStaticColumns now broadcasts dict/list values by doing [value] * len(df), which repeats the same mutable object reference for every row. If anything downstream mutates one row’s dict/list (e.g., chat_template_kwargs), it will silently affect all rows. Consider using per-row copies (e.g., deepcopy) or constructing a fresh object per row to avoid shared mutable state.

Copilot · 2026-05-01T15:32:07Z

+            # Wrap dict/list values in a list so pandas doesn't try to align
+            # on index keys (e.g. {"thinking": True} would produce NaN otherwise).
+            if isinstance(value, (dict, list)):
+                df[key] = [value] * len(df)
+            else:
+                df[key] = value


This change adds special handling for dict/list static values, but the existing unit tests for AddStaticColumns don’t cover dict/list inputs. Please add a regression test ensuring dict/list values are preserved per-row (and not converted to NaN via pandas alignment).

Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 18 out of 19 changed files in this pull request and generated 4 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-05-01T16:23:13Z

+#   - GPQA                         198 samples  abcd_extractor
+#   - MMLU-Pro                    2410 samples  abcd_extractor


The header comment says GPQA and MMLU-Pro use abcd_extractor, but the actual dataset configs below correctly use letter_extractor. Please update the header comment to match the configuration so readers don’t pick the wrong extractor when adapting this example.

Suggested change

# - GPQA 198 samples abcd_extractor

# - MMLU-Pro 2410 samples abcd_extractor

# - GPQA 198 samples letter_extractor

# - MMLU-Pro 2410 samples letter_extractor

Copilot · 2026-05-01T16:23:14Z

+### Step 2 — Run math + MCQ accuracy
+
+Uncomment MMLU-Pro in [`vllm_dsv4pro_mlperf_accuracy.yaml`](vllm_dsv4pro_mlperf_accuracy.yaml), then:
+
+```bash
+uv run inference-endpoint benchmark from-config \
+  -c examples/09_DeepSeek-V4-Pro_Example/vllm_dsv4pro_mlperf_accuracy.yaml
+```


This section instructs to “Uncomment MMLU-Pro” in the MLPerf config, but mlperf-mmlu-pro is already enabled in vllm_dsv4pro_mlperf_accuracy.yaml. Please adjust the instruction (or comment out that dataset in the YAML) to keep the example steps consistent.

Copilot · 2026-05-01T16:23:14Z

+            # All reasoning chunks are joined into a single string so the full
+            # thinking trace is captured as-is in events.jsonl. TPOT still uses
+            # text_after_first_chunk(), which includes string reasoning in the
+            # denominator (off by one token vs. the true "after first chunk"
+            # count, which is negligible).


The comment claims the TPOT denominator error is only “off by one token”, but joining all reasoning chunks into a single string loses the first-chunk boundary and can over-count by all tokens in the first streamed reasoning chunk (SSE deltas often contain multiple tokens). If TPOT accuracy matters, keep reasoning as chunked data (e.g., tuple with the first chunk separated) or record the first-chunk split explicitly so text_after_first_chunk() can exclude it precisely while still writing a joined reasoning string to events.jsonl for readability.

Copilot · 2026-05-01T16:23:14Z

+            if isinstance(self.reasoning, str):
+                # str reasoning is the fully joined streaming trace — include it
+                # in the TPOT denominator. Over-counts by one token (the first
+                # token is not excluded), but the error is negligible in practice.
+                parts.append(self.reasoning)
+            elif isinstance(self.reasoning, tuple) and len(self.reasoning) > 1:
                parts.extend(self.reasoning[1:])


text_after_first_chunk() now treats str reasoning as streaming output and includes the entire reasoning string in the denominator, which contradicts the method’s docstring (it says str fields are non-streaming and should return an empty string) and still can’t actually exclude the first streamed chunk when reasoning is stored as a joined str. Consider updating the docstring to match the new semantics and/or preserving chunk boundaries (tuple) so the “after first chunk” calculation remains exact.

Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 16 out of 17 changed files in this pull request and generated 5 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

-            df[key] = value
+            # Wrap dict/list values in a list so pandas doesn't try to align
+            # on index keys (e.g. {"thinking": True} would produce NaN otherwise).
+            if isinstance(value, dict | list):


+    tokenizer_override = config.model_params.tokenizer_name
+    tokenizer_name: str | None
+    if tokenizer_override:
+        tokenizer_name = tokenizer_override
+        logger.info(
+            f"Tokenizer available for model: {model_name} (override: {tokenizer_override})"
+        )
+    else:


+            )
+        except Exception as e:
+            scoring_failed = True
+            logger.error(f"Scoring failed for {eval_cfg.dataset_name}: {e}")


    logit_bias: dict[str, float] | None = None
    user: str | None = None
    chat_template: str | None = None
+    chat_template_kwargs: dict[str, Any] | None = None



+#   - GPQA                         198 samples  abcd_extractor
+#   - MMLU-Pro                    2410 samples  abcd_extractor


Fixes for DSV4 example

898621d

Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>

arekay-nv requested review from a team and Copilot May 1, 2026 15:25

github-actions Bot requested a review from nvzhihanj May 1, 2026 15:25

Copilot started reviewing on behalf of arekay-nv May 1, 2026 15:26 View session

gemini-code-assist Bot reviewed May 1, 2026

View reviewed changes

Copilot AI reviewed May 1, 2026

View reviewed changes

arekay-nv added 2 commits May 1, 2026 08:32

Add example files.

fa3eb91

Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>

Fix pre-commit

ab25f96

Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>

Copilot AI review requested due to automatic review settings May 1, 2026 16:15

Copilot started reviewing on behalf of arekay-nv May 1, 2026 16:16 View session

Copilot AI reviewed May 1, 2026

View reviewed changes

arekay-nv added 2 commits May 1, 2026 13:23

Reduce changes.

4ba9fe0

Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>

Sync.

7432582

Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>

Copilot AI review requested due to automatic review settings May 4, 2026 18:00

Copilot started reviewing on behalf of arekay-nv May 4, 2026 18:01 View session

Copilot AI reviewed May 4, 2026

View reviewed changes

		# - GPQA 198 samples abcd_extractor
		# - MMLU-Pro 2410 samples abcd_extractor

Conversation

arekay-nv commented May 1, 2026

What does this PR do?

Type of change

Related issues

Testing

Checklist

Uh oh!

github-actions Bot commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 1, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 1, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI May 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI May 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI May 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI May 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI May 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI May 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI May 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI May 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI May 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI May 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI May 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

github-actions Bot commented May 1, 2026 •

edited

Loading