Conversation
Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>
|
MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅ |
There was a problem hiding this comment.
Code Review
This pull request introduces several improvements, including a fallback mechanism for tokenizer loading using PreTrainedTokenizerFast, a new LetterExtractor for MCQ datasets, and support for DeepSeek thinking mode via chat_template_kwargs. It also updates the benchmark execution to allow partial accuracy results when individual scorers fail. I have provided feedback regarding the refactoring of the scoring loop to reduce duplication and concerns about maintaining metric precision in TPOT calculations when handling reasoning chunks.
| for eval_cfg in ctx.eval_configs: | ||
| scorer_instance = eval_cfg.scorer( | ||
| eval_cfg.dataset_name, | ||
| eval_cfg.dataset, | ||
| eval_cfg.report_dir, | ||
| extractor=eval_cfg.extractor, | ||
| ground_truth_column=eval_cfg.ground_truth_column, | ||
| try: | ||
| scorer_instance = eval_cfg.scorer( | ||
| eval_cfg.dataset_name, | ||
| eval_cfg.dataset, | ||
| eval_cfg.report_dir, | ||
| extractor=eval_cfg.extractor, | ||
| ground_truth_column=eval_cfg.ground_truth_column, | ||
| ) | ||
| score, n_repeats = scorer_instance.score() | ||
| assert eval_cfg.dataset.data is not None | ||
| accuracy_scores[eval_cfg.dataset_name] = { | ||
| "dataset_name": eval_cfg.dataset_name, | ||
| "num_samples": len(eval_cfg.dataset.data), | ||
| "extractor": eval_cfg.extractor.__name__, | ||
| "ground_truth_column": eval_cfg.ground_truth_column, | ||
| "score": score, | ||
| "n_repeats": n_repeats, | ||
| } | ||
| logger.info( | ||
| f"Score for {eval_cfg.dataset_name}: {score} ({n_repeats} repeats)" | ||
| ) | ||
| except Exception as e: | ||
| scoring_failed = True | ||
| logger.error(f"Scoring failed for {eval_cfg.dataset_name}: {e}") | ||
| assert eval_cfg.dataset.data is not None | ||
| accuracy_scores[eval_cfg.dataset_name] = { | ||
| "dataset_name": eval_cfg.dataset_name, | ||
| "num_samples": len(eval_cfg.dataset.data), | ||
| "extractor": eval_cfg.extractor.__name__, | ||
| "ground_truth_column": eval_cfg.ground_truth_column, | ||
| "score": None, | ||
| "error": str(e), | ||
| } |
There was a problem hiding this comment.
The accuracy scoring loop contains significant code duplication for the result dictionary and uses assert statements for data validation inside an exception handler. If dataset.data is missing, the assert will raise an AssertionError which, while caught by the except Exception, makes the logic brittle. Refactoring this to use a base dictionary and safer access improves maintainability and ensures the loop continues gracefully as intended.
for eval_cfg in ctx.eval_configs:
# Prepare base results; ensure data exists to avoid TypeError in len()
num_samples = len(eval_cfg.dataset.data) if eval_cfg.dataset.data is not None else 0
res_base = {
"dataset_name": eval_cfg.dataset_name,
"num_samples": num_samples,
"extractor": eval_cfg.extractor.__name__,
"ground_truth_column": eval_cfg.ground_truth_column,
}
try:
scorer_instance = eval_cfg.scorer(
eval_cfg.dataset_name,
eval_cfg.dataset,
eval_cfg.report_dir,
extractor=eval_cfg.extractor,
ground_truth_column=eval_cfg.ground_truth_column,
)
score, n_repeats = scorer_instance.score()
accuracy_scores[eval_cfg.dataset_name] = {
**res_base,
"score": score,
"n_repeats": n_repeats,
}
logger.info(
f"Score for {eval_cfg.dataset_name}: {score} ({n_repeats} repeats)"
)
except Exception as e:
scoring_failed = True
logger.error(f"Scoring failed for {eval_cfg.dataset_name}: {e}")
accuracy_scores[eval_cfg.dataset_name] = {
**res_base,
"score": None,
"error": str(e),
}| if isinstance(self.reasoning, str): | ||
| # str reasoning is the fully joined streaming trace — include it | ||
| # in the TPOT denominator. Over-counts by one token (the first | ||
| # token is not excluded), but the error is negligible in practice. | ||
| parts.append(self.reasoning) | ||
| elif isinstance(self.reasoning, tuple) and len(self.reasoning) > 1: | ||
| parts.extend(self.reasoning[1:]) |
There was a problem hiding this comment.
This logic to include string reasoning in the TPOT denominator introduces a slight inaccuracy in metrics by failing to exclude the first token of the response. If the accumulator preserves reasoning as a tuple of chunks (as suggested in accumulator.py), this special case is unnecessary. Reverting to the previous logic ensures that TPOT is calculated only on tokens generated after the first chunk, maintaining metric precision.
| if isinstance(self.reasoning, str): | |
| # str reasoning is the fully joined streaming trace — include it | |
| # in the TPOT denominator. Over-counts by one token (the first | |
| # token is not excluded), but the error is negligible in practice. | |
| parts.append(self.reasoning) | |
| elif isinstance(self.reasoning, tuple) and len(self.reasoning) > 1: | |
| parts.extend(self.reasoning[1:]) | |
| if isinstance(self.reasoning, tuple) and len(self.reasoning) > 1: | |
| parts.extend(self.reasoning[1:]) |
| # All reasoning chunks are joined into a single string so the full | ||
| # thinking trace is captured as-is in events.jsonl. TPOT still uses | ||
| # text_after_first_chunk(), which includes string reasoning in the | ||
| # denominator (off by one token vs. the true "after first chunk" | ||
| # count, which is negligible). | ||
| text_output = TextModelOutput( | ||
| output="".join(self.output_chunks), | ||
| reasoning=resp_reasoning, | ||
| reasoning="".join(self.reasoning_chunks), | ||
| ) |
There was a problem hiding this comment.
Joining all reasoning chunks into a single string causes a loss of precision in TPOT (Time Per Output Token) calculations. The text_after_first_chunk() method in TextModelOutput relies on the reasoning being a tuple of chunks to correctly exclude the first chunk from the denominator. By joining them here, the first token is included in the TPOT calculation, leading to inaccurate metrics.
Since TextModelOutput already handles joining chunks in its __str__ method (used for logging and scoring), it is better to preserve the chunks as a tuple.
# Preserve reasoning chunks as a tuple to allow accurate TPOT
# calculation (excluding the first chunk). TextModelOutput
# handles joining for display/logging via __str__.
text_output = TextModelOutput(
output=tuple(self.output_chunks),
reasoning=tuple(self.reasoning_chunks),
)There was a problem hiding this comment.
Pull request overview
This PR extends the evaluation and benchmarking pipeline to better support DeepSeek-V4-style workflows and MCQ datasets by adding a new letter-based extractor, passing chat template kwargs through the OpenAI msgspec adapter, and improving benchmark runtime robustness around tokenizer handling and scoring.
Changes:
- Add
LetterExtractor(A–J) and unit tests; update ABCDExtractor doc example. - Allow passing
chat_template_kwargsthrough OpenAI msgspec request types/adapter and add AIME25 presets that enable “thinking” mode via static columns. - Add tokenizer override + tokenizer-loading fallback behavior; make accuracy scoring continue on per-scorer failures and adjust reasoning aggregation behavior.
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/unit/evaluation/test_extractor.py | Adds unit tests for the new LetterExtractor and applies unit markers/import updates. |
| src/inference_endpoint/evaluation/extractor.py | Introduces LetterExtractor and updates ABCDExtractor example text. |
| src/inference_endpoint/openai/types.py | Adds chat_template_kwargs to msgspec ChatCompletionRequest. |
| src/inference_endpoint/openai/openai_msgspec_adapter.py | Passes through chat_template_kwargs from dataset rows into endpoint requests. |
| src/inference_endpoint/dataset_manager/transforms.py | Adjusts AddStaticColumns to handle dict/list values without pandas index alignment issues. |
| src/inference_endpoint/dataset_manager/predefined/aime25/presets.py | Adds presets that set chat_template_kwargs for thinking/budget tokens. |
| src/inference_endpoint/openai/accumulator.py | Changes reasoning aggregation to a single joined string for event logging. |
| src/inference_endpoint/core/types.py | Updates TPOT denominator logic to account for string reasoning. |
| src/inference_endpoint/config/schema.py | Adds ModelParams.tokenizer_name for tokenizer override configuration. |
| src/inference_endpoint/commands/benchmark/execute.py | Uses tokenizer override, changes accuracy phase load_pattern behavior, and continues scoring on failures. |
| src/inference_endpoint/async_utils/services/metrics_aggregator/token_metrics.py | Adds fallback to PreTrainedTokenizerFast when AutoTokenizer fails. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| min_sample_count=acc_ds.num_samples() * acc_ds.repeats, | ||
| rng_sched=ctx.rt_settings.rng_sched, | ||
| rng_sample_index=ctx.rt_settings.rng_sample_index, | ||
| load_pattern=LoadPattern(type=LoadPatternType.MAX_THROUGHPUT), | ||
| load_pattern=ctx.rt_settings.load_pattern, | ||
| ) |
There was a problem hiding this comment.
In accuracy phases, load_pattern=ctx.rt_settings.load_pattern means accuracy evaluation will be throttled by the main run’s load pattern (e.g., Poisson target QPS / concurrency), which can make scoring unnecessarily slow. Previously this forced MAX_THROUGHPUT; consider keeping Burst/MAX_THROUGHPUT for accuracy phases (or making it explicitly configurable) to avoid long eval runs.
| tokenizer_name = tokenizer_override | ||
| logger.info( | ||
| f"Tokenizer available for model: {model_name} (override: {tokenizer_override})" | ||
| ) |
There was a problem hiding this comment.
The log line says “Tokenizer available for model … (override: …)” but when tokenizer_override is set you’re not actually checking that the override exists/is loadable. This message is misleading and can confuse debugging; consider logging that an override is being used (and optionally validating the override with _check_tokenizer_exists(tokenizer_override) before proceeding).
| tokenizer_name = tokenizer_override | |
| logger.info( | |
| f"Tokenizer available for model: {model_name} (override: {tokenizer_override})" | |
| ) | |
| if _check_tokenizer_exists(tokenizer_override): | |
| tokenizer_name = tokenizer_override | |
| logger.info( | |
| f"Using tokenizer override for model: {model_name} ({tokenizer_override})" | |
| ) | |
| else: | |
| tokenizer_name = None | |
| logger.warning( | |
| f"Tokenizer override not available for model: {model_name} ({tokenizer_override})" | |
| ) |
| ) | ||
| except Exception as e: | ||
| scoring_failed = True | ||
| logger.error(f"Scoring failed for {eval_cfg.dataset_name}: {e}") |
There was a problem hiding this comment.
except Exception as e: logger.error(...) logs only str(e) and drops the traceback, which makes scorer failures hard to diagnose in CI/user runs. Consider including exc_info=True (and/or logging the exception type) so the report includes actionable stack traces while still continuing past failures.
| logger.error(f"Scoring failed for {eval_cfg.dataset_name}: {e}") | |
| logger.error( | |
| "Scoring failed for %s: %s: %s", | |
| eval_cfg.dataset_name, | |
| type(e).__name__, | |
| e, | |
| exc_info=True, | |
| ) |
| if isinstance(self.reasoning, str): | ||
| # str reasoning is the fully joined streaming trace — include it | ||
| # in the TPOT denominator. Over-counts by one token (the first | ||
| # token is not excluded), but the error is negligible in practice. | ||
| parts.append(self.reasoning) | ||
| elif isinstance(self.reasoning, tuple) and len(self.reasoning) > 1: |
There was a problem hiding this comment.
text_after_first_chunk() now treats reasoning as a joined str for streaming cases, but the method docstring above still states that for non-streaming (str fields) it returns an empty string. Please update the docstring (and ideally the TextModelOutput field docs) to reflect that str reasoning may also represent a streaming trace, otherwise future callers may make incorrect assumptions.
| try: | ||
| self._thread_local.tokenizer = AutoTokenizer.from_pretrained( | ||
| self._tokenizer_name | ||
| ) | ||
| except Exception: | ||
| # AutoTokenizer loads config.json to detect the model type; for | ||
| # models with unknown model_type (e.g. deepseek_v4 in older | ||
| # transformers) or missing rope config fields, this fails. | ||
| # Fall back to PreTrainedTokenizerFast which reads only | ||
| # tokenizer.json / tokenizer_config.json and skips model config. | ||
| self._thread_local.tokenizer = PreTrainedTokenizerFast.from_pretrained( | ||
| self._tokenizer_name | ||
| ) |
There was a problem hiding this comment.
Catching a broad Exception from AutoTokenizer.from_pretrained without logging can mask unrelated configuration issues (e.g., missing files, permission errors, offline mode) and make token-metric failures opaque. Consider catching a narrower set of expected failures (or at least logging the original exception at debug/warning level) before falling back to PreTrainedTokenizerFast.
| # Wrap dict/list values in a list so pandas doesn't try to align | ||
| # on index keys (e.g. {"thinking": True} would produce NaN otherwise). | ||
| if isinstance(value, (dict, list)): | ||
| df[key] = [value] * len(df) | ||
| else: | ||
| df[key] = value |
There was a problem hiding this comment.
AddStaticColumns now broadcasts dict/list values by doing [value] * len(df), which repeats the same mutable object reference for every row. If anything downstream mutates one row’s dict/list (e.g., chat_template_kwargs), it will silently affect all rows. Consider using per-row copies (e.g., deepcopy) or constructing a fresh object per row to avoid shared mutable state.
| # Wrap dict/list values in a list so pandas doesn't try to align | ||
| # on index keys (e.g. {"thinking": True} would produce NaN otherwise). | ||
| if isinstance(value, (dict, list)): | ||
| df[key] = [value] * len(df) | ||
| else: | ||
| df[key] = value |
There was a problem hiding this comment.
This change adds special handling for dict/list static values, but the existing unit tests for AddStaticColumns don’t cover dict/list inputs. Please add a regression test ensuring dict/list values are preserved per-row (and not converted to NaN via pandas alignment).
Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>
Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 18 out of 19 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # - GPQA 198 samples abcd_extractor | ||
| # - MMLU-Pro 2410 samples abcd_extractor |
There was a problem hiding this comment.
The header comment says GPQA and MMLU-Pro use abcd_extractor, but the actual dataset configs below correctly use letter_extractor. Please update the header comment to match the configuration so readers don’t pick the wrong extractor when adapting this example.
| # - GPQA 198 samples abcd_extractor | |
| # - MMLU-Pro 2410 samples abcd_extractor | |
| # - GPQA 198 samples letter_extractor | |
| # - MMLU-Pro 2410 samples letter_extractor |
| ### Step 2 — Run math + MCQ accuracy | ||
|
|
||
| Uncomment MMLU-Pro in [`vllm_dsv4pro_mlperf_accuracy.yaml`](vllm_dsv4pro_mlperf_accuracy.yaml), then: | ||
|
|
||
| ```bash | ||
| uv run inference-endpoint benchmark from-config \ | ||
| -c examples/09_DeepSeek-V4-Pro_Example/vllm_dsv4pro_mlperf_accuracy.yaml | ||
| ``` |
There was a problem hiding this comment.
This section instructs to “Uncomment MMLU-Pro” in the MLPerf config, but mlperf-mmlu-pro is already enabled in vllm_dsv4pro_mlperf_accuracy.yaml. Please adjust the instruction (or comment out that dataset in the YAML) to keep the example steps consistent.
| # All reasoning chunks are joined into a single string so the full | ||
| # thinking trace is captured as-is in events.jsonl. TPOT still uses | ||
| # text_after_first_chunk(), which includes string reasoning in the | ||
| # denominator (off by one token vs. the true "after first chunk" | ||
| # count, which is negligible). |
There was a problem hiding this comment.
The comment claims the TPOT denominator error is only “off by one token”, but joining all reasoning chunks into a single string loses the first-chunk boundary and can over-count by all tokens in the first streamed reasoning chunk (SSE deltas often contain multiple tokens). If TPOT accuracy matters, keep reasoning as chunked data (e.g., tuple with the first chunk separated) or record the first-chunk split explicitly so text_after_first_chunk() can exclude it precisely while still writing a joined reasoning string to events.jsonl for readability.
| if isinstance(self.reasoning, str): | ||
| # str reasoning is the fully joined streaming trace — include it | ||
| # in the TPOT denominator. Over-counts by one token (the first | ||
| # token is not excluded), but the error is negligible in practice. | ||
| parts.append(self.reasoning) | ||
| elif isinstance(self.reasoning, tuple) and len(self.reasoning) > 1: | ||
| parts.extend(self.reasoning[1:]) |
There was a problem hiding this comment.
text_after_first_chunk() now treats str reasoning as streaming output and includes the entire reasoning string in the denominator, which contradicts the method’s docstring (it says str fields are non-streaming and should return an empty string) and still can’t actually exclude the first streamed chunk when reasoning is stored as a joined str. Consider updating the docstring to match the new semantics and/or preserving chunk boundaries (tuple) so the “after first chunk” calculation remains exact.
Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 16 out of 17 changed files in this pull request and generated 5 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| df[key] = value | ||
| # Wrap dict/list values in a list so pandas doesn't try to align | ||
| # on index keys (e.g. {"thinking": True} would produce NaN otherwise). | ||
| if isinstance(value, dict | list): |
| tokenizer_override = config.model_params.tokenizer_name | ||
| tokenizer_name: str | None | ||
| if tokenizer_override: | ||
| tokenizer_name = tokenizer_override | ||
| logger.info( | ||
| f"Tokenizer available for model: {model_name} (override: {tokenizer_override})" | ||
| ) | ||
| else: |
| ) | ||
| except Exception as e: | ||
| scoring_failed = True | ||
| logger.error(f"Scoring failed for {eval_cfg.dataset_name}: {e}") |
| logit_bias: dict[str, float] | None = None | ||
| user: str | None = None | ||
| chat_template: str | None = None | ||
| chat_template_kwargs: dict[str, Any] | None = None | ||
|
|
| # - GPQA 198 samples abcd_extractor | ||
| # - MMLU-Pro 2410 samples abcd_extractor |
What does this PR do?
Type of change
Related issues
Testing
Checklist