Conversation
Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>
|
MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅ |
There was a problem hiding this comment.
Code Review
This pull request introduces a formal warmup phase to the benchmark execution process, replacing the previous environment variable-based implementation. Key changes include the addition of a WarmupConfig schema, a SaltedDataset wrapper that injects random salts into prompts to prevent KV-cache reuse, and a drain_after configuration to manage phase transitions. The EchoServer and testing fixtures were also updated to support custom request handling for new integration tests. Review feedback identifies missing imports for os and logging in dataset.py and notes that the SaltedDataset constructor signature is incompatible with its base class, which could break inherited class methods.
| def __init__(self, inner: Dataset) -> None: | ||
| # Skip Dataset.__init__ — all state is delegated to inner | ||
| self._inner = inner | ||
| self.data = inner.data | ||
| self.dataframe = None | ||
| self.transforms = None | ||
| self.repeats = inner.repeats | ||
| self.logger = getLogger(__name__) |
There was a problem hiding this comment.
SaltedDataset inherits from Dataset but its __init__ method signature is incompatible with the base class. While it works as a wrapper for an existing instance in the current implementation, it breaks the contract for any code that expects to instantiate a Dataset (or subclass) using the standard signature (e.g., df, transforms, repeats).
Specifically, calling the inherited @classmethod get_dataloader on SaltedDataset will fail with a TypeError because it attempts to call the constructor with arguments that SaltedDataset.__init__ does not accept. Additionally, setting self.dataframe and self.transforms to None might break other inherited methods that expect these attributes to be populated.
Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>
arekay-nv
left a comment
There was a problem hiding this comment.
Review Council — Multi-AI Code Review
Reviewed by Codex + Claude | Depth: thorough
Found 14 issues across 6 files: 0 critical, 3 high, 4 medium, 7 low.
See inline comments for details. A grouped summary will follow as a separate top-level comment.
| def get_dataloader( | ||
| cls, | ||
| datasets_dir: Path = Path("datasets"), | ||
| datasets_dir: Path = Path("dataset_cache"), |
There was a problem hiding this comment.
[Review Council — Claude] high · api-contract
Default datasets_dir changed from Path("datasets") to Path("dataset_cache"):
datasets_dir: Path = Path("dataset_cache"),This is a silent breaking change. Any existing user who relied on the previous default will, after upgrade, fail to find the cached dataframe under datasets/<DATASET_ID>/<variant> and will silently re-download/re-generate (potentially many GB and minutes of work). The .gitignore change only ignores dataset_cache/, leaving any pre-existing datasets/ cache uncovered.
Suggest: keep Path("datasets") as the default, OR add a fallback that reads from datasets/ if dataset_cache/ is empty, OR document the migration prominently. At minimum, decouple this rename from the warmup PR — it's an orthogonal change.
| ) -> None: | ||
| pass # Inner dataset already loaded | ||
|
|
||
| def load_sample(self, index: int) -> Any: |
There was a problem hiding this comment.
[Review Council — Claude] high · bug
SaltedDataset.load_sample() only mutates data["prompt"], but several adapters bypass the prompt entirely. In src/inference_endpoint/sglang/adapter.py:90:
input_tokens = query.data["input_tokens"]
return cls._request_encoder.encode(
SGLangGenerateRequest(input_ids=input_tokens, ...))When salt=True is used with a tokenized dataset (any pipeline that pre-tokenizes — common in MLPerf submissions), the salt prefix is added to the (unused) text but input_tokens are unchanged, so the server receives the same token-ID sequence for every warmup request. KV-cache reuse is NOT prevented. The class docstring's claim that "each call to load_sample() generates a fresh salt, so reused samples … each receive a distinct salt" is silently violated.
Fix options: (a) detect input_tokens/token_ids keys and either drop them or re-tokenize after salting; (b) raise an error at warmup setup if salt=True and the inner dataset has tokenized fields; (c) explicitly document the restriction.
| salt = os.urandom(8).hex() | ||
| if isinstance(prompt, str): | ||
| return {**data, "prompt": f"[{salt}] {prompt}"} | ||
| if ( |
There was a problem hiding this comment.
[Review Council — Both (Codex P2 + Claude)] medium · bug
Multimodal salting only fires when prompt[0] is a text part:
if (isinstance(prompt, list) and prompt and isinstance(prompt[0], dict)
and prompt[0].get("type") == "text"):
salted_parts = [{**prompt[0], "text": f"[{salt}] ..."}, *prompt[1:]]Real-world OpenAI-style multimodal prompts often start with an image/audio part and place text at index 1+ ("look at this image, then answer the question"). For those, salting is silently skipped (falls through to the return data at line 478) and warmup.salt=true does NOT prevent KV-cache reuse for those requests. This breaks the advertised warmup behavior for a common multimodal input shape.
Fix: scan the list for the first dict with type == "text" (or salt all text parts) instead of only inspecting prompt[0].
| try: | ||
| pool_cm = TokenizePool(args.tokenizer, n_workers=args.tokenizer_workers) | ||
| except Exception as e: | ||
| logging.warning( |
There was a problem hiding this comment.
[Review Council — Claude] low · error-handling
Tokenizer load failure is silently downgraded to "no token metrics":
except Exception as e:
logging.warning(
f"Failed to load tokenizer '{args.tokenizer}': {e}. "
"ISL/OSL/TPOT token metrics will be unavailable."
)
pool_cm = nullcontext()A user who explicitly passes --tokenizer X and thus expects token metrics will instead see a single WARNING line buried in the log, then a silent benchmark run with empty ISL/OSL/TPOT columns. By the time they check the report, the run is over.
Fix: re-raise unless an explicit --allow-tokenizer-fallback flag is set, OR at least log at ERROR level and surface the missing-tokenizer state in the final report. Also use the module logger (logger = logging.getLogger(__name__)) rather than the root logging module.
| if not (isinstance(data, dict) and "prompt" in data): | ||
| return data | ||
| prompt = data["prompt"] | ||
| salt = os.urandom(8).hex() |
There was a problem hiding this comment.
[Review Council — Claude] low · design
salt = os.urandom(8).hex() on every load_sample() ignores the seeded rng_sample_index in RuntimeSettings. Reproducibility-conscious users who set dataloader_random_seed to compare runs will see different salted prompts on each invocation. Warmup metrics aren't reported, but the actual prompts sent to the server (and thus server-side caches/stats) are nondeterministic.
Fix: thread the rng_sample_index (or a derived random.Random) through SaltedDataset.__init__ and use f'{rng.getrandbits(64):016x}' to keep determinism with a non-zero seed.
|
|
||
| @pytest.fixture | ||
| def base_rt_settings(self): | ||
| import random |
There was a problem hiding this comment.
[Review Council — Claude] low · design
TestBuildPhases uses lazy/inline imports inside fixtures and ~15 test method bodies (import random, from inference_endpoint.commands.benchmark.execute import _build_phases, etc.).
AGENTS.md explicitly forbids this:
All imports must be at the top of the file. Imports inside functions, methods, or conditional blocks (other than
TYPE_CHECKING) are disallowed.
None of the documented exceptions apply (no circular import, no optional dep, no sandboxing). Move the imports to the top of the file.
| guaranteeing all overlap events have been recorded before we assert. | ||
| """ | ||
|
|
||
| _DELAY = 0.15 # seconds; long enough for perf to start while warmup is pending |
There was a problem hiding this comment.
[Review Council — Claude] low · testing
_DELAY = 0.15 is fragile for the overlap-observation tests:
_DELAY = 0.15 # seconds; long enough for perf to start while warmup is pendingOn a heavily-loaded CI runner, the entire warmup batch (5 requests) can plausibly round-trip within 150 ms before the perf phase issues, producing a false "no overlap" and flaking test_drain_false_overlap_observed / test_drain_false_max_concurrent_exceeds_single_phase_size.
Fix: use a controllable handler that explicitly blocks until the test signals release (e.g. via asyncio.Event) instead of a fixed sleep. If sticking with a sleep, increase to ≥1s.
| _SALT_RE = re.compile(r"^\[([0-9a-f]{16})\] (.+)$") | ||
|
|
||
| _MINIMAL_CLIENT = HTTPClientConfig( | ||
| num_workers=1, warmup_connections=0, max_connections=10 |
There was a problem hiding this comment.
[Review Council — Claude] low · testing
All warmup integration tests run with num_workers=1:
_MINIMAL_CLIENT = HTTPClientConfig(
num_workers=1, warmup_connections=0, max_connections=10
)None cover the multi-worker warmup path, where ZMQ ordering between worker processes matters and where salted prompts must be distinct across workers. Suggest adding at least one warmup test with num_workers >= 2 to exercise the multi-process distribution path the production code uses.
Review Council — Multi-AI Code ReviewReviewed by Codex + Claude | Depth: thorough | HEAD: Found 14 issues across 6 files. Inline comments posted in this review. Cross-reviewer agreement (boosted confidence): execute.py:365 (warmup RNG seeding) and dataset.py:467 (multimodal salt) were flagged by both Codex and Claude. 🔴 Must Fix (high)
🟡 Should Fix (medium)
🔵 Consider (low)
Theme summary The warmup feature lands the structural plumbing well, but several correctness/reproducibility properties the YAML knobs claim to provide are not actually enforced end-to-end:
🤖 Posted by |
- Add `random_seed` field to WarmupConfig for deterministic warmup scheduling - Use `dataclasses.replace` + warmup seed for warmup RuntimeSettings - Fix SaltedDataset.data to be a property (avoids stale snapshot after inner reload) - Fix multimodal salting to find first text part at any index (handles image-first prompts) - Log warning when input_tokens present without prompt (salting not possible) - Fix ruff-format CI failure in test_async_session.py - Move inline imports to top of test_benchmark.py (AGENTS.md compliance) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
| return cls(df, transforms=transforms, repeats=num_repeats) | ||
|
|
||
|
|
||
| class SaltedDataset(Dataset): |
There was a problem hiding this comment.
[Review Council — Claude] medium · design
SaltedDataset inherits from Dataset without opting out of __init_subclass__ registration. Dataset.__init_subclass__ (line 276) registers every concrete subclass under its class name in Dataset.PREDEFINED:
cls.DATASET_ID = dataset_id # → "SaltedDataset"
Dataset.PREDEFINED[dataset_id] = clsSaltedDataset.__init__ takes inner: Dataset (not dataframe: pd.DataFrame), so any code that tries to instantiate it through the registry (e.g., DataLoaderFactory or the YAML dataset_id lookup) will raise a TypeError with a confusing error. Fix: pass dataset_id=None to explicitly opt out:
class SaltedDataset(Dataset, dataset_id=None):| *prompt[1:], | ||
| ] | ||
| return {**data, "prompt": salted_parts} | ||
| return data # unsupported prompt type — skip salting |
There was a problem hiding this comment.
[Review Council — Claude] medium · bug
When the prompt is a multimodal list where ALL parts are non-text (e.g., all image_url), the loop finds no text part and falls through to return data with no warning:
if isinstance(prompt, list) and prompt:
for i, part in enumerate(prompt):
if isinstance(part, dict) and part.get("type") == "text":
...
return {**data, "prompt": salted_parts}
return data # unsupported prompt type — skip salting (silent!)Unlike the input_tokens case (which emits a logger.warning), this fallback is completely silent. Users who enable salt=True for image-only multimodal datasets will not see any indication that KV-cache bypass is not functioning. Add a logger.warning consistent with the input_tokens path.
| n_samples_from_dataset=ctx.dataloader.num_samples(), | ||
| n_samples_to_issue=warmup_cfg.n_requests, | ||
| min_sample_count=1, | ||
| rng_sched=random.Random(), |
There was a problem hiding this comment.
[Review Council — Claude] medium · design
After fixing the unseeded RNGs, both rng_sched and rng_sample_index for the warmup phase will be seeded with the same value (warmup_cfg.random_seed, default 0). This means both RNGs start in identical state and produce identical pseudo-random sequences — the scheduler and sample-index selection will be perfectly correlated.
Compare with the perf phase in RuntimeSettings.from_config which uses separate fields scheduler_random_seed and dataloader_random_seed. Consider using random_seed for one and random_seed ^ 0xDEAD (or adding warmup_scheduler_random_seed to WarmupConfig) for the other, so their sequences diverge.
| sd = SaltedDataset(inner) | ||
| assert sd.load_sample(0) == {} | ||
|
|
||
| def test_non_dict_sample_is_returned_as_is(self): |
There was a problem hiding this comment.
[Review Council — Claude] low · testing
sd.data = inner.data is redundant now that SaltedDataset.data is a property that delegates to self._inner.data. The assignment calls the property setter which sets self._inner.data = inner.data — a no-op since they reference the same list. This won't cause a test failure, but it misleads readers into thinking SaltedDataset.data is a plain attribute:
inner.data = ["raw string sample"] # override with non-dict
sd = SaltedDataset(inner)
sd.data = inner.data # redundant — sd.data property already returns inner.dataRemove the sd.data = inner.data line.
| return cls(df, transforms=transforms, repeats=num_repeats) | ||
|
|
||
|
|
||
| class SaltedDataset(Dataset): |
There was a problem hiding this comment.
[Review Council — Claude] medium · design
SaltedDataset inherits from Dataset without opting out of __init_subclass__ registration. Dataset.__init_subclass__ registers every concrete non-abstract subclass under its class name in Dataset.PREDEFINED:
cls.DATASET_ID = dataset_id # → "SaltedDataset"
Dataset.PREDEFINED[dataset_id] = clsSaltedDataset.__init__ takes inner: Dataset (not dataframe: pd.DataFrame), so any code that instantiates via the registry will raise a TypeError. Fix: opt out explicitly:
class SaltedDataset(Dataset, dataset_id=None):| *prompt[1:], | ||
| ] | ||
| return {**data, "prompt": salted_parts} | ||
| return data # unsupported prompt type — skip salting |
There was a problem hiding this comment.
[Review Council — Claude] medium · bug
When the prompt is a multimodal list where ALL parts are non-text (e.g., all image_url), the loop finds no text part and falls through to return data without salting and without any warning:
if isinstance(prompt, list) and prompt:
for i, part in enumerate(prompt):
if isinstance(part, dict) and part.get("type") == "text":
...
return {**data, "prompt": salted_parts}
return data # unsupported prompt type — skip salting (silent!)Unlike the input_tokens case (which emits logger.warning), this fallback is completely silent. Users who set salt=True for image-only datasets will see no indication that KV-cache bypass is not functioning. Add a logger.warning here, consistent with the input_tokens path.
| n_samples_from_dataset=ctx.dataloader.num_samples(), | ||
| n_samples_to_issue=warmup_cfg.n_requests, | ||
| min_sample_count=1, | ||
| rng_sched=random.Random(), |
There was a problem hiding this comment.
[Review Council — Claude] medium · design
After the fix that seeds the warmup RNGs, both rng_sched and rng_sample_index will be seeded with the same value (warmup_cfg.random_seed, default 0):
rng_sched=random.Random(warmup_cfg.random_seed),
rng_sample_index=random.Random(warmup_cfg.random_seed),Both RNGs start in identical state and produce the same sequence. Scheduler timing and sample selection will be perfectly correlated. Compare with the perf phase which uses separate scheduler_random_seed and dataloader_random_seed fields. Consider using random_seed and random_seed ^ 0xDEAD (or a separate warmup_scheduler_random_seed field) so the sequences diverge.
| sd = SaltedDataset(inner) | ||
| assert sd.load_sample(0) == {} | ||
|
|
||
| def test_non_dict_sample_is_returned_as_is(self): |
There was a problem hiding this comment.
[Review Council — Claude] low · testing
sd.data = inner.data is now redundant: SaltedDataset.data is a property that delegates to self._inner.data. The assignment calls the setter which sets self._inner.data = inner.data — a no-op since they already reference the same list.
inner.data = ["raw string sample"] # override with non-dict
sd = SaltedDataset(inner)
sd.data = inner.data # redundant — property already returns inner.dataRemove the sd.data = inner.data line.
Review Council — Multi-AI Code ReviewReviewed by Claude (Codex CLI unavailable) | Depth: thorough | HEAD: `1143d64` Found 7 new issues across 5 files. Inline comments posted for 4; 3 are out-of-diff or post-fix. Cross-reviewer note: Issues already addressed in the previous round (unseeded RNGs, 🔴 Must Fix (high)
🟡 Should Fix (medium)
🔵 Consider (low)
🤖 Posted by |
Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>
| return cls(df, transforms=transforms, repeats=num_repeats) | ||
|
|
||
|
|
||
| class SaltedDataset(Dataset, register=False): |
arekay-nv
left a comment
There was a problem hiding this comment.
Review Council — Multi-AI Code Review (Round 2)
Reviewed by Codex + Claude | Depth: thorough | HEAD: c9e80b0 (was 56258be in round 1)
Re-reviewing after the author's response commits. Most round-1 findings are addressed; this round focuses on new code (global timeout, warmup_random_seed, register=False) and incomplete fixes.
Found 4 new issues across 3 files: 0 critical, 1 high, 1 medium, 2 low. See inline comments — a tiered summary will follow.
| ) | ||
| session.stop() | ||
|
|
||
| global_timeout_handle = loop.call_later( |
There was a problem hiding this comment.
[Review Council — Both (Codex P1 + Claude)] high · api-contract
The new global timer redefines what max_duration_ms means, but only when warmup is enabled, and the schema description was not updated.
max_duration_ms = ctx.rt_settings.max_duration_ms
if max_duration_ms is not None:
...
global_timeout_handle = loop.call_later(
max_duration_ms / 1000.0, _on_global_timeout
)session.py:_make_stop_check still enforces max_duration_ms per-phase against phase_start_ns, so we now have two clocks running on the same configured value:
- per-phase, from each phase's start (the original semantic — and still what
_build_phasesclears toNonefor warmup/accuracy). - global, from
loop.call_laterscheduling time (warmup + perf + accuracy combined).
For a user with warmup.enabled=True and max_duration_ms=600_000 (10 min), if warmup takes 90s, the perf phase's effective wall-clock budget collapses from 10 min to ~8.5 min — silently. Worse, an accuracy pass in TestMode.BOTH (which sets max_duration_ms=None per-phase, expecting unbounded) can now be truncated mid-run by the global timer.
The schema description still reads "Maximum test duration in ms (0 for no limit)", which is the original per-phase meaning. Existing YAML configs will get shorter perf/accuracy phases once they enable warmup, with no documentation indicating why.
Suggested fixes (any one):
- Schedule the timer at the start of the performance phase (e.g., on
START_PERFORMANCE_TRACKING) so warmup time doesn't eat the perf budget — preserves the per-phase semantic. - Expose a separate
warmup.max_duration_msto bound only warmup, leave perf-phasemax_duration_msuntouched. - If a global cap is genuinely the goal, rename it (
experiment_timeout_ms?), keepmax_duration_msper-phase, and document the change inWarmupConfigand release notes.
| throughout the current phase.""" | ||
| Bounded by the global experiment timeout: if the caller schedules a | ||
| loop.call_later that calls stop(), stop() sets _drain_event, unblocking | ||
| this wait without leaving it hung indefinitely.""" |
There was a problem hiding this comment.
[Review Council — Claude] medium · error-handling
The new docstring on _drain_inflight claims a safety guarantee that only holds in some configurations:
async def _drain_inflight(self, phase_issuer: PhaseIssuer) -> None:
"""Wait for all in-flight responses from this phase to complete.
Bounded by the global experiment timeout: if the caller schedules a
loop.call_later that calls stop(), stop() sets _drain_event, unblocking
this wait without leaving it hung indefinitely."""
...
self._drain_event.clear()
await self._drain_event.wait()The loop.call_later is only scheduled in execute.py:561 if ctx.rt_settings.max_duration_ms is not None. runtime_settings.from_config maps max_duration_ms == 0 to None, and the schema default is 0 — so the default offline/online CLI run has no global timer, and _drain_inflight retains its old unbounded await self._drain_event.wait(). A single dropped or hanging request will hang the benchmark forever.
The docstring suggests this case is solved; it isn't.
Suggested fixes:
- Make the docstring conditional ("if a global timeout is configured…") and add a note that the drain is otherwise unbounded, OR
- Add an intrinsic drain timeout (a fixed cap like 30–60s, or a multiple of observed completion latency) inside
_drain_inflightitself so the safety property is a property of the function, not a contract on the caller.
| "Global experiment timeout reached (%d ms); stopping session.", | ||
| max_duration_ms, | ||
| ) | ||
| session.stop() |
There was a problem hiding this comment.
[Review Council — Claude] low · concurrency
The global timer's callback calls session.stop() directly, with no guard against firing after cleanup has begun:
def _on_global_timeout() -> None:
logger.warning(
"Global experiment timeout reached (%d ms); stopping session.",
max_duration_ms,
)
session.stop()global_timeout_handle.cancel() only runs after entering the finally: block at line 580, so the timer can fire during the gap between await session.run(phases) returning and the cancel() call running — or worse, during the slow cleanup that follows (http_client.shutdown_async(), publisher close, etc.). When that happens, session.stop() flips _stop_requested=True and _drain_event.set() on a session that has already finished. The downstream effect is mostly harmless today (the strategy task is None), but the race is real and easy to make harmful with future refactors.
Suggested fix:
def _on_global_timeout() -> None:
if session._done: # or expose a public flag
return
logger.warning(...)
session.stop()Also consider moving global_timeout_handle.cancel() to immediately after session.run(phases) returns (success path), and only run it in finally to handle the exception path.
| for _ in range(n_draw) | ||
| ] | ||
|
|
||
| assert order_with == order_without, ( |
There was a problem hiding this comment.
[Review Council — Claude] low · testing
test_performance_sample_order_identical_with_and_without_warmup does not actually exercise the property it claims to test:
def make_rt():
return RuntimeSettings(
...
rng_sched=random.Random(99),
rng_sample_index=random.Random(99),
...
)
ctx_with = self._make_ctx(config_with, make_rt(), simple_dataset)
ctx_without = self._make_ctx(config_without, make_rt(), simple_dataset)make_rt() is called twice and produces two separate RuntimeSettings instances with freshly-seeded Random(99) RNGs. _build_phases in the warmup path uses dataclass_replace which substitutes new Random(seed) instances on the warmup phase only — it never touches the perf phase's RNGs. So this test only verifies that _build_phases doesn't drain the perf RNG, which it never could (it only stores references). It does not verify the regression it claims to guard against: that executing the warmup phase leaves the perf-phase RNG untouched.
A tighter test would either:
- Build phases from a single shared
RuntimeSettings, then advance the warmup phase'srng_sample_index(e.g., draw N samples fromcreate_sample_order(warmup_rt)), and assert that the perf-phase order matches a control with no warmup, OR - Drop this test as redundant with
test_warmup_phase_uses_independent_rngs(which already assertswarmup_rt.rng_sched is not perf_rt.rng_sched). The current implementation gives a false sense of coverage for the actual race the docstring describes.
Review Council — Multi-AI Code Review (Round 2)Reviewed by Codex + Claude | Depth: thorough | HEAD: Re-review after the author's response commits. 4 new issues across 3 files. Inline comments in this review. Cross-reviewer agreement (boosted confidence): execute.py:570 — global timeout semantic change — was the only finding both Codex and Claude flagged in round 2. 🔴 Must Fix (high)
🟡 Should Fix (medium)
🔵 Consider (low)
Round 1 follow-up status
Theme summary (round 2)The biggest new concern is the global wall-clock timer added to bound the warmup drain. The mechanism works, but it silently changes what The drain-safety claim in 🤖 Posted by |
Plugs in warmup specification with salt via config.
datasetsas default local folder for cache - causing pre-commit to complain.Type of change
Related issues
Testing
Checklist