Skip to content

feat: Support DSV4 example.#304

Open
arekay-nv wants to merge 5 commits intomainfrom
arekay/dsv4_example
Open

feat: Support DSV4 example.#304
arekay-nv wants to merge 5 commits intomainfrom
arekay/dsv4_example

Conversation

@arekay-nv
Copy link
Copy Markdown
Collaborator

What does this PR do?

Type of change

  • Bug fix
  • New feature
  • Documentation update
  • Refactor/cleanup

Related issues

Testing

  • Tests added/updated
  • All tests pass locally
  • Manual testing completed

Checklist

  • Code follows project style
  • Pre-commit hooks pass
  • Documentation updated (if needed)

Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>
@arekay-nv arekay-nv requested review from a team and Copilot May 1, 2026 15:25
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 1, 2026

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces several improvements, including a fallback mechanism for tokenizer loading using PreTrainedTokenizerFast, a new LetterExtractor for MCQ datasets, and support for DeepSeek thinking mode via chat_template_kwargs. It also updates the benchmark execution to allow partial accuracy results when individual scorers fail. I have provided feedback regarding the refactoring of the scoring loop to reduce duplication and concerns about maintaining metric precision in TPOT calculations when handling reasoning chunks.

Comment on lines 661 to +694
for eval_cfg in ctx.eval_configs:
scorer_instance = eval_cfg.scorer(
eval_cfg.dataset_name,
eval_cfg.dataset,
eval_cfg.report_dir,
extractor=eval_cfg.extractor,
ground_truth_column=eval_cfg.ground_truth_column,
try:
scorer_instance = eval_cfg.scorer(
eval_cfg.dataset_name,
eval_cfg.dataset,
eval_cfg.report_dir,
extractor=eval_cfg.extractor,
ground_truth_column=eval_cfg.ground_truth_column,
)
score, n_repeats = scorer_instance.score()
assert eval_cfg.dataset.data is not None
accuracy_scores[eval_cfg.dataset_name] = {
"dataset_name": eval_cfg.dataset_name,
"num_samples": len(eval_cfg.dataset.data),
"extractor": eval_cfg.extractor.__name__,
"ground_truth_column": eval_cfg.ground_truth_column,
"score": score,
"n_repeats": n_repeats,
}
logger.info(
f"Score for {eval_cfg.dataset_name}: {score} ({n_repeats} repeats)"
)
except Exception as e:
scoring_failed = True
logger.error(f"Scoring failed for {eval_cfg.dataset_name}: {e}")
assert eval_cfg.dataset.data is not None
accuracy_scores[eval_cfg.dataset_name] = {
"dataset_name": eval_cfg.dataset_name,
"num_samples": len(eval_cfg.dataset.data),
"extractor": eval_cfg.extractor.__name__,
"ground_truth_column": eval_cfg.ground_truth_column,
"score": None,
"error": str(e),
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The accuracy scoring loop contains significant code duplication for the result dictionary and uses assert statements for data validation inside an exception handler. If dataset.data is missing, the assert will raise an AssertionError which, while caught by the except Exception, makes the logic brittle. Refactoring this to use a base dictionary and safer access improves maintainability and ensures the loop continues gracefully as intended.

    for eval_cfg in ctx.eval_configs:
        # Prepare base results; ensure data exists to avoid TypeError in len()
        num_samples = len(eval_cfg.dataset.data) if eval_cfg.dataset.data is not None else 0
        res_base = {
            "dataset_name": eval_cfg.dataset_name,
            "num_samples": num_samples,
            "extractor": eval_cfg.extractor.__name__,
            "ground_truth_column": eval_cfg.ground_truth_column,
        }
        try:
            scorer_instance = eval_cfg.scorer(
                eval_cfg.dataset_name,
                eval_cfg.dataset,
                eval_cfg.report_dir,
                extractor=eval_cfg.extractor,
                ground_truth_column=eval_cfg.ground_truth_column,
            )
            score, n_repeats = scorer_instance.score()
            accuracy_scores[eval_cfg.dataset_name] = {
                **res_base,
                "score": score,
                "n_repeats": n_repeats,
            }
            logger.info(
                f"Score for {eval_cfg.dataset_name}: {score} ({n_repeats} repeats)"
            )
        except Exception as e:
            scoring_failed = True
            logger.error(f"Scoring failed for {eval_cfg.dataset_name}: {e}")
            accuracy_scores[eval_cfg.dataset_name] = {
                **res_base,
                "score": None,
                "error": str(e),
            }

Comment thread src/inference_endpoint/core/types.py Outdated
Comment on lines 136 to 142
if isinstance(self.reasoning, str):
# str reasoning is the fully joined streaming trace — include it
# in the TPOT denominator. Over-counts by one token (the first
# token is not excluded), but the error is negligible in practice.
parts.append(self.reasoning)
elif isinstance(self.reasoning, tuple) and len(self.reasoning) > 1:
parts.extend(self.reasoning[1:])
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This logic to include string reasoning in the TPOT denominator introduces a slight inaccuracy in metrics by failing to exclude the first token of the response. If the accumulator preserves reasoning as a tuple of chunks (as suggested in accumulator.py), this special case is unnecessary. Reverting to the previous logic ensures that TPOT is calculated only on tokens generated after the first chunk, maintaining metric precision.

Suggested change
if isinstance(self.reasoning, str):
# str reasoning is the fully joined streaming trace — include it
# in the TPOT denominator. Over-counts by one token (the first
# token is not excluded), but the error is negligible in practice.
parts.append(self.reasoning)
elif isinstance(self.reasoning, tuple) and len(self.reasoning) > 1:
parts.extend(self.reasoning[1:])
if isinstance(self.reasoning, tuple) and len(self.reasoning) > 1:
parts.extend(self.reasoning[1:])

Comment on lines 71 to 79
# All reasoning chunks are joined into a single string so the full
# thinking trace is captured as-is in events.jsonl. TPOT still uses
# text_after_first_chunk(), which includes string reasoning in the
# denominator (off by one token vs. the true "after first chunk"
# count, which is negligible).
text_output = TextModelOutput(
output="".join(self.output_chunks),
reasoning=resp_reasoning,
reasoning="".join(self.reasoning_chunks),
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Joining all reasoning chunks into a single string causes a loss of precision in TPOT (Time Per Output Token) calculations. The text_after_first_chunk() method in TextModelOutput relies on the reasoning being a tuple of chunks to correctly exclude the first chunk from the denominator. By joining them here, the first token is included in the TPOT calculation, leading to inaccurate metrics.

Since TextModelOutput already handles joining chunks in its __str__ method (used for logging and scoring), it is better to preserve the chunks as a tuple.

            # Preserve reasoning chunks as a tuple to allow accurate TPOT 
            # calculation (excluding the first chunk). TextModelOutput 
            # handles joining for display/logging via __str__.
            text_output = TextModelOutput(
                output=tuple(self.output_chunks),
                reasoning=tuple(self.reasoning_chunks),
            )

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR extends the evaluation and benchmarking pipeline to better support DeepSeek-V4-style workflows and MCQ datasets by adding a new letter-based extractor, passing chat template kwargs through the OpenAI msgspec adapter, and improving benchmark runtime robustness around tokenizer handling and scoring.

Changes:

  • Add LetterExtractor (A–J) and unit tests; update ABCDExtractor doc example.
  • Allow passing chat_template_kwargs through OpenAI msgspec request types/adapter and add AIME25 presets that enable “thinking” mode via static columns.
  • Add tokenizer override + tokenizer-loading fallback behavior; make accuracy scoring continue on per-scorer failures and adjust reasoning aggregation behavior.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
tests/unit/evaluation/test_extractor.py Adds unit tests for the new LetterExtractor and applies unit markers/import updates.
src/inference_endpoint/evaluation/extractor.py Introduces LetterExtractor and updates ABCDExtractor example text.
src/inference_endpoint/openai/types.py Adds chat_template_kwargs to msgspec ChatCompletionRequest.
src/inference_endpoint/openai/openai_msgspec_adapter.py Passes through chat_template_kwargs from dataset rows into endpoint requests.
src/inference_endpoint/dataset_manager/transforms.py Adjusts AddStaticColumns to handle dict/list values without pandas index alignment issues.
src/inference_endpoint/dataset_manager/predefined/aime25/presets.py Adds presets that set chat_template_kwargs for thinking/budget tokens.
src/inference_endpoint/openai/accumulator.py Changes reasoning aggregation to a single joined string for event logging.
src/inference_endpoint/core/types.py Updates TPOT denominator logic to account for string reasoning.
src/inference_endpoint/config/schema.py Adds ModelParams.tokenizer_name for tokenizer override configuration.
src/inference_endpoint/commands/benchmark/execute.py Uses tokenizer override, changes accuracy phase load_pattern behavior, and continues scoring on failures.
src/inference_endpoint/async_utils/services/metrics_aggregator/token_metrics.py Adds fallback to PreTrainedTokenizerFast when AutoTokenizer fails.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 374 to 378
min_sample_count=acc_ds.num_samples() * acc_ds.repeats,
rng_sched=ctx.rt_settings.rng_sched,
rng_sample_index=ctx.rt_settings.rng_sample_index,
load_pattern=LoadPattern(type=LoadPatternType.MAX_THROUGHPUT),
load_pattern=ctx.rt_settings.load_pattern,
)
Copy link

Copilot AI May 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In accuracy phases, load_pattern=ctx.rt_settings.load_pattern means accuracy evaluation will be throttled by the main run’s load pattern (e.g., Poisson target QPS / concurrency), which can make scoring unnecessarily slow. Previously this forced MAX_THROUGHPUT; consider keeping Burst/MAX_THROUGHPUT for accuracy phases (or making it explicitly configurable) to avoid long eval runs.

Copilot uses AI. Check for mistakes.
Comment on lines +306 to +309
tokenizer_name = tokenizer_override
logger.info(
f"Tokenizer available for model: {model_name} (override: {tokenizer_override})"
)
Copy link

Copilot AI May 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The log line says “Tokenizer available for model … (override: …)” but when tokenizer_override is set you’re not actually checking that the override exists/is loadable. This message is misleading and can confuse debugging; consider logging that an override is being used (and optionally validating the override with _check_tokenizer_exists(tokenizer_override) before proceeding).

Suggested change
tokenizer_name = tokenizer_override
logger.info(
f"Tokenizer available for model: {model_name} (override: {tokenizer_override})"
)
if _check_tokenizer_exists(tokenizer_override):
tokenizer_name = tokenizer_override
logger.info(
f"Using tokenizer override for model: {model_name} ({tokenizer_override})"
)
else:
tokenizer_name = None
logger.warning(
f"Tokenizer override not available for model: {model_name} ({tokenizer_override})"
)

Copilot uses AI. Check for mistakes.
)
except Exception as e:
scoring_failed = True
logger.error(f"Scoring failed for {eval_cfg.dataset_name}: {e}")
Copy link

Copilot AI May 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

except Exception as e: logger.error(...) logs only str(e) and drops the traceback, which makes scorer failures hard to diagnose in CI/user runs. Consider including exc_info=True (and/or logging the exception type) so the report includes actionable stack traces while still continuing past failures.

Suggested change
logger.error(f"Scoring failed for {eval_cfg.dataset_name}: {e}")
logger.error(
"Scoring failed for %s: %s: %s",
eval_cfg.dataset_name,
type(e).__name__,
e,
exc_info=True,
)

Copilot uses AI. Check for mistakes.
Comment thread src/inference_endpoint/core/types.py Outdated
Comment on lines +136 to +141
if isinstance(self.reasoning, str):
# str reasoning is the fully joined streaming trace — include it
# in the TPOT denominator. Over-counts by one token (the first
# token is not excluded), but the error is negligible in practice.
parts.append(self.reasoning)
elif isinstance(self.reasoning, tuple) and len(self.reasoning) > 1:
Copy link

Copilot AI May 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

text_after_first_chunk() now treats reasoning as a joined str for streaming cases, but the method docstring above still states that for non-streaming (str fields) it returns an empty string. Please update the docstring (and ideally the TextModelOutput field docs) to reflect that str reasoning may also represent a streaming trace, otherwise future callers may make incorrect assumptions.

Copilot uses AI. Check for mistakes.
Comment on lines +79 to +91
try:
self._thread_local.tokenizer = AutoTokenizer.from_pretrained(
self._tokenizer_name
)
except Exception:
# AutoTokenizer loads config.json to detect the model type; for
# models with unknown model_type (e.g. deepseek_v4 in older
# transformers) or missing rope config fields, this fails.
# Fall back to PreTrainedTokenizerFast which reads only
# tokenizer.json / tokenizer_config.json and skips model config.
self._thread_local.tokenizer = PreTrainedTokenizerFast.from_pretrained(
self._tokenizer_name
)
Copy link

Copilot AI May 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Catching a broad Exception from AutoTokenizer.from_pretrained without logging can mask unrelated configuration issues (e.g., missing files, permission errors, offline mode) and make token-metric failures opaque. Consider catching a narrower set of expected failures (or at least logging the original exception at debug/warning level) before falling back to PreTrainedTokenizerFast.

Copilot uses AI. Check for mistakes.
Comment on lines +126 to +131
# Wrap dict/list values in a list so pandas doesn't try to align
# on index keys (e.g. {"thinking": True} would produce NaN otherwise).
if isinstance(value, (dict, list)):
df[key] = [value] * len(df)
else:
df[key] = value
Copy link

Copilot AI May 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AddStaticColumns now broadcasts dict/list values by doing [value] * len(df), which repeats the same mutable object reference for every row. If anything downstream mutates one row’s dict/list (e.g., chat_template_kwargs), it will silently affect all rows. Consider using per-row copies (e.g., deepcopy) or constructing a fresh object per row to avoid shared mutable state.

Copilot uses AI. Check for mistakes.
Comment on lines +126 to +131
# Wrap dict/list values in a list so pandas doesn't try to align
# on index keys (e.g. {"thinking": True} would produce NaN otherwise).
if isinstance(value, (dict, list)):
df[key] = [value] * len(df)
else:
df[key] = value
Copy link

Copilot AI May 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change adds special handling for dict/list static values, but the existing unit tests for AddStaticColumns don’t cover dict/list inputs. Please add a regression test ensuring dict/list values are preserved per-row (and not converted to NaN via pandas alignment).

Copilot uses AI. Check for mistakes.
arekay-nv added 2 commits May 1, 2026 08:32
Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>
Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>
Copilot AI review requested due to automatic review settings May 1, 2026 16:15
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 18 out of 19 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +10 to +11
# - GPQA 198 samples abcd_extractor
# - MMLU-Pro 2410 samples abcd_extractor
Copy link

Copilot AI May 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The header comment says GPQA and MMLU-Pro use abcd_extractor, but the actual dataset configs below correctly use letter_extractor. Please update the header comment to match the configuration so readers don’t pick the wrong extractor when adapting this example.

Suggested change
# - GPQA 198 samples abcd_extractor
# - MMLU-Pro 2410 samples abcd_extractor
# - GPQA 198 samples letter_extractor
# - MMLU-Pro 2410 samples letter_extractor

Copilot uses AI. Check for mistakes.
Comment on lines +127 to +134
### Step 2 — Run math + MCQ accuracy

Uncomment MMLU-Pro in [`vllm_dsv4pro_mlperf_accuracy.yaml`](vllm_dsv4pro_mlperf_accuracy.yaml), then:

```bash
uv run inference-endpoint benchmark from-config \
-c examples/09_DeepSeek-V4-Pro_Example/vllm_dsv4pro_mlperf_accuracy.yaml
```
Copy link

Copilot AI May 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section instructs to “Uncomment MMLU-Pro” in the MLPerf config, but mlperf-mmlu-pro is already enabled in vllm_dsv4pro_mlperf_accuracy.yaml. Please adjust the instruction (or comment out that dataset in the YAML) to keep the example steps consistent.

Copilot uses AI. Check for mistakes.
Comment on lines +71 to +75
# All reasoning chunks are joined into a single string so the full
# thinking trace is captured as-is in events.jsonl. TPOT still uses
# text_after_first_chunk(), which includes string reasoning in the
# denominator (off by one token vs. the true "after first chunk"
# count, which is negligible).
Copy link

Copilot AI May 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment claims the TPOT denominator error is only “off by one token”, but joining all reasoning chunks into a single string loses the first-chunk boundary and can over-count by all tokens in the first streamed reasoning chunk (SSE deltas often contain multiple tokens). If TPOT accuracy matters, keep reasoning as chunked data (e.g., tuple with the first chunk separated) or record the first-chunk split explicitly so text_after_first_chunk() can exclude it precisely while still writing a joined reasoning string to events.jsonl for readability.

Copilot uses AI. Check for mistakes.
Comment thread src/inference_endpoint/core/types.py Outdated
Comment on lines 136 to 142
if isinstance(self.reasoning, str):
# str reasoning is the fully joined streaming trace — include it
# in the TPOT denominator. Over-counts by one token (the first
# token is not excluded), but the error is negligible in practice.
parts.append(self.reasoning)
elif isinstance(self.reasoning, tuple) and len(self.reasoning) > 1:
parts.extend(self.reasoning[1:])
Copy link

Copilot AI May 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

text_after_first_chunk() now treats str reasoning as streaming output and includes the entire reasoning string in the denominator, which contradicts the method’s docstring (it says str fields are non-streaming and should return an empty string) and still can’t actually exclude the first streamed chunk when reasoning is stored as a joined str. Consider updating the docstring to match the new semantics and/or preserving chunk boundaries (tuple) so the “after first chunk” calculation remains exact.

Copilot uses AI. Check for mistakes.
arekay-nv added 2 commits May 1, 2026 13:23
Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>
Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>
Copilot AI review requested due to automatic review settings May 4, 2026 18:00
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 16 out of 17 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

df[key] = value
# Wrap dict/list values in a list so pandas doesn't try to align
# on index keys (e.g. {"thinking": True} would produce NaN otherwise).
if isinstance(value, dict | list):
Comment on lines +303 to +310
tokenizer_override = config.model_params.tokenizer_name
tokenizer_name: str | None
if tokenizer_override:
tokenizer_name = tokenizer_override
logger.info(
f"Tokenizer available for model: {model_name} (override: {tokenizer_override})"
)
else:
)
except Exception as e:
scoring_failed = True
logger.error(f"Scoring failed for {eval_cfg.dataset_name}: {e}")
Comment on lines 103 to 107
logit_bias: dict[str, float] | None = None
user: str | None = None
chat_template: str | None = None
chat_template_kwargs: dict[str, Any] | None = None

Comment on lines +10 to +11
# - GPQA 198 samples abcd_extractor
# - MMLU-Pro 2410 samples abcd_extractor
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants