feat: Plug in warmup phase by arekay-nv · Pull Request #305 · mlcommons/endpoints

arekay-nv · 2026-05-05T01:42:47Z

Plugs in warmup specification with salt via config.

fixes datasets as default local folder for cache - causing pre-commit to complain.

Type of change

Bug fix
New feature
Documentation update
Refactor/cleanup

Related issues

Testing

Tests added/updated
All tests pass locally
Manual testing completed

Checklist

Code follows project style
Pre-commit hooks pass
Documentation updated (if needed)

Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>

github-actions · 2026-05-05T01:42:54Z

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

gemini-code-assist

Code Review

This pull request introduces a formal warmup phase to the benchmark execution process, replacing the previous environment variable-based implementation. Key changes include the addition of a WarmupConfig schema, a SaltedDataset wrapper that injects random salts into prompts to prevent KV-cache reuse, and a drain_after configuration to manage phase transitions. The EchoServer and testing fixtures were also updated to support custom request handling for new integration tests. Review feedback identifies missing imports for os and logging in dataset.py and notes that the SaltedDataset constructor signature is incompatible with its base class, which could break inherited class methods.

gemini-code-assist · 2026-05-05T01:44:11Z

+    def __init__(self, inner: Dataset) -> None:
+        # Skip Dataset.__init__ — all state is delegated to inner
+        self._inner = inner
+        self.data = inner.data
+        self.dataframe = None
+        self.transforms = None
+        self.repeats = inner.repeats
+        self.logger = getLogger(__name__)


SaltedDataset inherits from Dataset but its __init__ method signature is incompatible with the base class. While it works as a wrapper for an existing instance in the current implementation, it breaks the contract for any code that expects to instantiate a Dataset (or subclass) using the standard signature (e.g., df, transforms, repeats).

Specifically, calling the inherited @classmethod get_dataloader on SaltedDataset will fail with a TypeError because it attempts to call the constructor with arguments that SaltedDataset.__init__ does not accept. Additionally, setting self.dataframe and self.transforms to None might break other inherited methods that expect these attributes to be populated.

Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>

arekay-nv

Review Council — Multi-AI Code Review

Reviewed by Codex + Claude | Depth: thorough

Found 14 issues across 6 files: 0 critical, 3 high, 4 medium, 7 low.

See inline comments for details. A grouped summary will follow as a separate top-level comment.

arekay-nv · 2026-05-05T03:03:12Z

    def get_dataloader(
        cls,
-        datasets_dir: Path = Path("datasets"),
+        datasets_dir: Path = Path("dataset_cache"),


[Review Council — Claude] high · api-contract

Default datasets_dir changed from Path("datasets") to Path("dataset_cache"):

datasets_dir: Path = Path("dataset_cache"),

This is a silent breaking change. Any existing user who relied on the previous default will, after upgrade, fail to find the cached dataframe under datasets/<DATASET_ID>/<variant> and will silently re-download/re-generate (potentially many GB and minutes of work). The .gitignore change only ignores dataset_cache/, leaving any pre-existing datasets/ cache uncovered.

Suggest: keep Path("datasets") as the default, OR add a fallback that reads from datasets/ if dataset_cache/ is empty, OR document the migration prominently. At minimum, decouple this rename from the warmup PR — it's an orthogonal change.

arekay-nv · 2026-05-05T03:03:12Z

+    ) -> None:
+        pass  # Inner dataset already loaded
+
+    def load_sample(self, index: int) -> Any:


[Review Council — Claude] high · bug

SaltedDataset.load_sample() only mutates data["prompt"], but several adapters bypass the prompt entirely. In src/inference_endpoint/sglang/adapter.py:90:

input_tokens = query.data["input_tokens"] return cls._request_encoder.encode( SGLangGenerateRequest(input_ids=input_tokens, ...))

When salt=True is used with a tokenized dataset (any pipeline that pre-tokenizes — common in MLPerf submissions), the salt prefix is added to the (unused) text but input_tokens are unchanged, so the server receives the same token-ID sequence for every warmup request. KV-cache reuse is NOT prevented. The class docstring's claim that "each call to load_sample() generates a fresh salt, so reused samples … each receive a distinct salt" is silently violated.

Fix options: (a) detect input_tokens/token_ids keys and either drop them or re-tokenize after salting; (b) raise an error at warmup setup if salt=True and the inner dataset has tokenized fields; (c) explicitly document the restriction.

arekay-nv · 2026-05-05T03:03:12Z

+        salt = os.urandom(8).hex()
+        if isinstance(prompt, str):
+            return {**data, "prompt": f"[{salt}] {prompt}"}
+        if (


[Review Council — Both (Codex P2 + Claude)] medium · bug

Multimodal salting only fires when prompt[0] is a text part:

if (isinstance(prompt, list) and prompt and isinstance(prompt[0], dict) and prompt[0].get("type") == "text"): salted_parts = [{**prompt[0], "text": f"[{salt}] ..."}, *prompt[1:]]

Real-world OpenAI-style multimodal prompts often start with an image/audio part and place text at index 1+ ("look at this image, then answer the question"). For those, salting is silently skipped (falls through to the return data at line 478) and warmup.salt=true does NOT prevent KV-cache reuse for those requests. This breaks the advertised warmup behavior for a common multimodal input shape.

Fix: scan the list for the first dict with type == "text" (or salt all text parts) instead of only inspecting prompt[0].

arekay-nv · 2026-05-05T03:03:13Z

+        try:
+            pool_cm = TokenizePool(args.tokenizer, n_workers=args.tokenizer_workers)
+        except Exception as e:
+            logging.warning(


[Review Council — Claude] low · error-handling

Tokenizer load failure is silently downgraded to "no token metrics":

except Exception as e: logging.warning( f"Failed to load tokenizer '{args.tokenizer}': {e}. " "ISL/OSL/TPOT token metrics will be unavailable." ) pool_cm = nullcontext()

A user who explicitly passes --tokenizer X and thus expects token metrics will instead see a single WARNING line buried in the log, then a silent benchmark run with empty ISL/OSL/TPOT columns. By the time they check the report, the run is over.

Fix: re-raise unless an explicit --allow-tokenizer-fallback flag is set, OR at least log at ERROR level and surface the missing-tokenizer state in the final report. Also use the module logger (logger = logging.getLogger(__name__)) rather than the root logging module.

arekay-nv · 2026-05-05T03:03:13Z

+        if not (isinstance(data, dict) and "prompt" in data):
+            return data
+        prompt = data["prompt"]
+        salt = os.urandom(8).hex()


[Review Council — Claude] low · design

salt = os.urandom(8).hex() on every load_sample() ignores the seeded rng_sample_index in RuntimeSettings. Reproducibility-conscious users who set dataloader_random_seed to compare runs will see different salted prompts on each invocation. Warmup metrics aren't reported, but the actual prompts sent to the server (and thus server-side caches/stats) are nondeterministic.

Fix: thread the rng_sample_index (or a derived random.Random) through SaltedDataset.__init__ and use f'{rng.getrandbits(64):016x}' to keep determinism with a non-zero seed.

arekay-nv · 2026-05-05T03:03:13Z

+
+    @pytest.fixture
+    def base_rt_settings(self):
+        import random


[Review Council — Claude] low · design

TestBuildPhases uses lazy/inline imports inside fixtures and ~15 test method bodies (import random, from inference_endpoint.commands.benchmark.execute import _build_phases, etc.).

AGENTS.md explicitly forbids this:

All imports must be at the top of the file. Imports inside functions, methods, or conditional blocks (other than TYPE_CHECKING) are disallowed.

None of the documented exceptions apply (no circular import, no optional dep, no sandboxing). Move the imports to the top of the file.

arekay-nv · 2026-05-05T03:03:13Z

+      guaranteeing all overlap events have been recorded before we assert.
+    """
+
+    _DELAY = 0.15  # seconds; long enough for perf to start while warmup is pending


[Review Council — Claude] low · testing

_DELAY = 0.15 is fragile for the overlap-observation tests:

_DELAY = 0.15 # seconds; long enough for perf to start while warmup is pending

On a heavily-loaded CI runner, the entire warmup batch (5 requests) can plausibly round-trip within 150 ms before the perf phase issues, producing a false "no overlap" and flaking test_drain_false_overlap_observed / test_drain_false_max_concurrent_exceeds_single_phase_size.

Fix: use a controllable handler that explicitly blocks until the test signals release (e.g. via asyncio.Event) instead of a fixed sleep. If sticking with a sleep, increase to ≥1s.

arekay-nv · 2026-05-05T03:03:13Z

+_SALT_RE = re.compile(r"^\[([0-9a-f]{16})\] (.+)$")
+
+_MINIMAL_CLIENT = HTTPClientConfig(
+    num_workers=1, warmup_connections=0, max_connections=10


[Review Council — Claude] low · testing

All warmup integration tests run with num_workers=1:

_MINIMAL_CLIENT = HTTPClientConfig( num_workers=1, warmup_connections=0, max_connections=10 )

None cover the multi-worker warmup path, where ZMQ ordering between worker processes matters and where salted prompts must be distinct across workers. Suggest adding at least one warmup test with num_workers >= 2 to exercise the multi-process distribution path the production code uses.

arekay-nv · 2026-05-05T03:04:05Z

Review Council — Multi-AI Code Review

Reviewed by Codex + Claude | Depth: thorough | HEAD: 56258be

Found 14 issues across 6 files. Inline comments posted in this review.

Cross-reviewer agreement (boosted confidence): execute.py:365 (warmup RNG seeding) and dataset.py:467 (multimodal salt) were flagged by both Codex and Claude.

🔴 Must Fix (high)

#	File	Line	Category	Reviewer(s)	Summary
1	`src/inference_endpoint/dataset_manager/dataset.py`	414	api-contract	Claude	Default `datasets_dir` silently changed `datasets/` → `dataset_cache/` (orthogonal to warmup; risks losing existing caches)
2	`src/inference_endpoint/dataset_manager/dataset.py`	459	bug	Claude	`SaltedDataset.load_sample()` mutates `prompt` only, leaving `input_tokens` (used by SGLang adapter) unchanged → KV-cache reuse not actually prevented for tokenized datasets
3	`src/inference_endpoint/commands/benchmark/execute.py`	365	bug	Both	Warmup `rng_sched` / `rng_sample_index` use bare `random.Random()` (unseeded) → warmup sample order is nondeterministic across runs, breaking reproducibility

🟡 Should Fix (medium)

#	File	Line	Category	Reviewer(s)	Summary
1	`src/inference_endpoint/dataset_manager/dataset.py`	467	bug	Both	Multimodal salting only fires when `prompt[0]` is text; image-first prompts (common shape) silently skip salting
2	`src/inference_endpoint/dataset_manager/dataset.py`	444	bug	Claude	`SaltedDataset.data` snapshots `inner.data` at construction; later `inner.load(force=True)` desyncs the wrapper from the inner dataset
3	`src/inference_endpoint/commands/benchmark/execute.py`	357	design	Claude	`warmup_rt` built ad-hoc instead of via `dataclasses.replace`; load pattern hardcoded to `MAX_THROUGHPUT` (no way to warm up at the perf-phase QPS/concurrency)
4	`src/inference_endpoint/load_generator/session.py`	273	api-contract	Claude	`ENABLE_WARMUP=1` env-var escape hatch removed without a deprecation warning; existing scripts now silently no-op

🔵 Consider (low)

#	File	Line	Category	Reviewer(s)	Summary
1	`src/inference_endpoint/load_generator/session.py`	68	api-contract	Claude	`PhaseConfig.drain_after=True` default flips warmup drain semantics for direct callers (was effectively `False`)
2	`src/inference_endpoint/load_generator/session.py`	423	data-integrity	Claude	`drain_after=False` filter only covers `_on_sample_complete`; warmup `COMPLETE` events still publish to ZMQ → event log pollution
3	`src/inference_endpoint/async_utils/services/metrics_aggregator/__main__.py`	100	error-handling	Claude	Tokenizer load failure silently downgraded to `nullcontext()`; user-requested `--tokenizer` produces empty ISL/OSL/TPOT with only a warning
4	`src/inference_endpoint/dataset_manager/dataset.py`	464	design	Claude	`os.urandom(8)` salt ignores seeded RNG → salted prompts are nondeterministic even when the user sets `dataloader_random_seed`
5	`tests/unit/commands/test_benchmark.py`	471	design	Claude	`TestBuildPhases` uses lazy/inline imports throughout (~15 sites) — explicit AGENTS.md violation
6	`tests/integration/commands/test_warmup.py`	292	testing	Claude	`_DELAY = 0.15` is fragile for overlap-observation tests — risk of CI flakes; replace with explicit `asyncio.Event` synchronization
7	`tests/integration/commands/test_warmup.py`	65	testing	Claude	All warmup integration tests use `num_workers=1`; multi-worker warmup path (ZMQ ordering, salt distinctness) is uncovered

Theme summary

The warmup feature lands the structural plumbing well, but several correctness/reproducibility properties the YAML knobs claim to provide are not actually enforced end-to-end:

Reproducibility — warmup RNGs are unseeded (execute.py:365) and the salt uses os.urandom (dataset.py:464). Two runs of the same config produce different warmup workloads, indirectly affecting perf-phase reproducibility when warmup sizes the cache.
Salt completeness — SaltedDataset only handles plain-text prompts. Tokenized datasets (SGLang adapter via input_tokens) and image-first multimodal prompts silently skip salting, defeating warmup.salt=true for those paths.
API/back-compat surface — the ENABLE_WARMUP env var was removed without a deprecation warning; the new PhaseConfig.drain_after=True default flips warmup drain semantics for direct callers; datasets_dir default rename couples an orthogonal cache-directory change to this PR.
Test coverage — single-worker only; the multi-worker ZMQ ordering and salt-distinctness across worker processes is the riskiest part of the architecture and is currently unexercised. The _DELAY = 0.15 overlap tests are also flake-prone.

🤖 Posted by /review-council skill.

Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>

- Add `random_seed` field to WarmupConfig for deterministic warmup scheduling - Use `dataclasses.replace` + warmup seed for warmup RuntimeSettings - Fix SaltedDataset.data to be a property (avoids stale snapshot after inner reload) - Fix multimodal salting to find first text part at any index (handles image-first prompts) - Log warning when input_tokens present without prompt (salting not possible) - Fix ruff-format CI failure in test_async_session.py - Move inline imports to top of test_benchmark.py (AGENTS.md compliance) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

arekay-nv · 2026-05-05T11:55:11Z

        return cls(df, transforms=transforms, repeats=num_repeats)


+class SaltedDataset(Dataset):


[Review Council — Claude] medium · design

SaltedDataset inherits from Dataset without opting out of __init_subclass__ registration. Dataset.__init_subclass__ (line 276) registers every concrete subclass under its class name in Dataset.PREDEFINED:

cls.DATASET_ID = dataset_id # → "SaltedDataset" Dataset.PREDEFINED[dataset_id] = cls

SaltedDataset.__init__ takes inner: Dataset (not dataframe: pd.DataFrame), so any code that tries to instantiate it through the registry (e.g., DataLoaderFactory or the YAML dataset_id lookup) will raise a TypeError with a confusing error. Fix: pass dataset_id=None to explicitly opt out:

class SaltedDataset(Dataset, dataset_id=None):

arekay-nv · 2026-05-05T11:55:11Z

+                *prompt[1:],
+            ]
+            return {**data, "prompt": salted_parts}
+        return data  # unsupported prompt type — skip salting


[Review Council — Claude] medium · bug

When the prompt is a multimodal list where ALL parts are non-text (e.g., all image_url), the loop finds no text part and falls through to return data with no warning:

if isinstance(prompt, list) and prompt: for i, part in enumerate(prompt): if isinstance(part, dict) and part.get("type") == "text": ... return {**data, "prompt": salted_parts} return data # unsupported prompt type — skip salting (silent!)

Unlike the input_tokens case (which emits a logger.warning), this fallback is completely silent. Users who enable salt=True for image-only multimodal datasets will not see any indication that KV-cache bypass is not functioning. Add a logger.warning consistent with the input_tokens path.

arekay-nv · 2026-05-05T11:55:11Z

+            n_samples_from_dataset=ctx.dataloader.num_samples(),
+            n_samples_to_issue=warmup_cfg.n_requests,
+            min_sample_count=1,
+            rng_sched=random.Random(),


[Review Council — Claude] medium · design

After fixing the unseeded RNGs, both rng_sched and rng_sample_index for the warmup phase will be seeded with the same value (warmup_cfg.random_seed, default 0). This means both RNGs start in identical state and produce identical pseudo-random sequences — the scheduler and sample-index selection will be perfectly correlated.

Compare with the perf phase in RuntimeSettings.from_config which uses separate fields scheduler_random_seed and dataloader_random_seed. Consider using random_seed for one and random_seed ^ 0xDEAD (or adding warmup_scheduler_random_seed to WarmupConfig) for the other, so their sequences diverge.

arekay-nv · 2026-05-05T11:55:11Z

+        sd = SaltedDataset(inner)
+        assert sd.load_sample(0) == {}
+
+    def test_non_dict_sample_is_returned_as_is(self):


[Review Council — Claude] low · testing

sd.data = inner.data is redundant now that SaltedDataset.data is a property that delegates to self._inner.data. The assignment calls the property setter which sets self._inner.data = inner.data — a no-op since they reference the same list. This won't cause a test failure, but it misleads readers into thinking SaltedDataset.data is a plain attribute:

inner.data = ["raw string sample"] # override with non-dict sd = SaltedDataset(inner) sd.data = inner.data # redundant — sd.data property already returns inner.data

Remove the sd.data = inner.data line.

arekay-nv

test

arekay-nv · 2026-05-05T11:55:59Z

        return cls(df, transforms=transforms, repeats=num_repeats)


+class SaltedDataset(Dataset):


[Review Council — Claude] medium · design

SaltedDataset inherits from Dataset without opting out of __init_subclass__ registration. Dataset.__init_subclass__ registers every concrete non-abstract subclass under its class name in Dataset.PREDEFINED:

cls.DATASET_ID = dataset_id # → "SaltedDataset" Dataset.PREDEFINED[dataset_id] = cls

SaltedDataset.__init__ takes inner: Dataset (not dataframe: pd.DataFrame), so any code that instantiates via the registry will raise a TypeError. Fix: opt out explicitly:

class SaltedDataset(Dataset, dataset_id=None):

arekay-nv · 2026-05-05T11:56:00Z

+                *prompt[1:],
+            ]
+            return {**data, "prompt": salted_parts}
+        return data  # unsupported prompt type — skip salting


[Review Council — Claude] medium · bug

When the prompt is a multimodal list where ALL parts are non-text (e.g., all image_url), the loop finds no text part and falls through to return data without salting and without any warning:

if isinstance(prompt, list) and prompt: for i, part in enumerate(prompt): if isinstance(part, dict) and part.get("type") == "text": ... return {**data, "prompt": salted_parts} return data # unsupported prompt type — skip salting (silent!)

Unlike the input_tokens case (which emits logger.warning), this fallback is completely silent. Users who set salt=True for image-only datasets will see no indication that KV-cache bypass is not functioning. Add a logger.warning here, consistent with the input_tokens path.

arekay-nv · 2026-05-05T11:56:01Z

+            n_samples_from_dataset=ctx.dataloader.num_samples(),
+            n_samples_to_issue=warmup_cfg.n_requests,
+            min_sample_count=1,
+            rng_sched=random.Random(),


[Review Council — Claude] medium · design

After the fix that seeds the warmup RNGs, both rng_sched and rng_sample_index will be seeded with the same value (warmup_cfg.random_seed, default 0):

rng_sched=random.Random(warmup_cfg.random_seed), rng_sample_index=random.Random(warmup_cfg.random_seed),

Both RNGs start in identical state and produce the same sequence. Scheduler timing and sample selection will be perfectly correlated. Compare with the perf phase which uses separate scheduler_random_seed and dataloader_random_seed fields. Consider using random_seed and random_seed ^ 0xDEAD (or a separate warmup_scheduler_random_seed field) so the sequences diverge.

arekay-nv · 2026-05-05T11:56:02Z

+        sd = SaltedDataset(inner)
+        assert sd.load_sample(0) == {}
+
+    def test_non_dict_sample_is_returned_as_is(self):


[Review Council — Claude] low · testing

sd.data = inner.data is now redundant: SaltedDataset.data is a property that delegates to self._inner.data. The assignment calls the setter which sets self._inner.data = inner.data — a no-op since they already reference the same list.

inner.data = ["raw string sample"] # override with non-dict sd = SaltedDataset(inner) sd.data = inner.data # redundant — property already returns inner.data

Remove the sd.data = inner.data line.

arekay-nv · 2026-05-05T11:56:21Z

Review Council — Multi-AI Code Review

Reviewed by Claude (Codex CLI unavailable) | Depth: thorough | HEAD: `1143d64`

Found 7 new issues across 5 files. Inline comments posted for 4; 3 are out-of-diff or post-fix.

Cross-reviewer note: Issues already addressed in the previous round (unseeded RNGs, SaltedDataset.data snapshot, multimodal salting at index 0, inline imports, ruff-format) are not re-reported.

🔴 Must Fix (high)

#	File	Line	Category	Summary
1	`commands/benchmark/execute.py`	365	design	Inline ↑ After RNG-seed fix: `rng_sched` and `rng_sample_index` both seeded with same `random_seed` value — identical pseudo-random sequences for scheduler and sample selection

🟡 Should Fix (medium)

#	File	Line	Category	Summary
2	`dataset_manager/dataset.py`	434	design	Inline ↑ `SaltedDataset` is registered in `Dataset.PREDEFINED` under `"SaltedDataset"` via `__init_subclass__`; instantiating via registry raises `TypeError` (incompatible constructor). Fix: `class SaltedDataset(Dataset, dataset_id=None)`
3	`dataset_manager/dataset.py`	478	bug	Inline ↑ Silent no-salt fallback for image-only multimodal prompts (all parts are `image_url`, no text part exists) — no `logger.warning` emitted, `salt=True` silently no-ops
4	`load_generator/session.py`	366	bug	`_drain_inflight` awaits `self._drain_event.wait()` with no timeout. If a warmup request is dropped (network error, server restart), the drain hangs indefinitely and the benchmark never starts. Only `SIGINT` unblocks it.

🔵 Consider (low)

#	File	Line	Category	Summary
5	`tests/unit/dataset_manager/test_salted_dataset.py`	143	testing	Inline ↑ `sd.data = inner.data` is redundant now that `data` is a property delegating to `_inner.data`
6	`config/schema.py`	413	api-contract	`random_seed: int = Field(0, ...)` default is implicitly deterministic; `Optional[int] = None` with "use fresh seed when None" semantics would better match `salt=True` cache-busting intent
7	`config/schema.py`	395	api-contract	`WarmupConfig` has no `@cyclopts.Parameter` annotations — warmup is YAML-only with no CLI flags. Undocumented limitation; users testing via CLI will find warmup silently inactive.

🤖 Posted by /review-council skill.

Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>

        return cls(df, transforms=transforms, repeats=num_repeats)


+class SaltedDataset(Dataset, register=False):


arekay-nv

Review Council — Multi-AI Code Review (Round 2)

Reviewed by Codex + Claude | Depth: thorough | HEAD: c9e80b0 (was 56258be in round 1)

Re-reviewing after the author's response commits. Most round-1 findings are addressed; this round focuses on new code (global timeout, warmup_random_seed, register=False) and incomplete fixes.

Found 4 new issues across 3 files: 0 critical, 1 high, 1 medium, 2 low. See inline comments — a tiered summary will follow.

arekay-nv · 2026-05-05T15:36:32Z

+                )
+                session.stop()
+
+            global_timeout_handle = loop.call_later(


[Review Council — Both (Codex P1 + Claude)] high · api-contract

The new global timer redefines what max_duration_ms means, but only when warmup is enabled, and the schema description was not updated.

max_duration_ms = ctx.rt_settings.max_duration_ms if max_duration_ms is not None: ... global_timeout_handle = loop.call_later( max_duration_ms / 1000.0, _on_global_timeout )

session.py:_make_stop_check still enforces max_duration_ms per-phase against phase_start_ns, so we now have two clocks running on the same configured value:

per-phase, from each phase's start (the original semantic — and still what _build_phases clears to None for warmup/accuracy).

global, from loop.call_later scheduling time (warmup + perf + accuracy combined).

For a user with warmup.enabled=True and max_duration_ms=600_000 (10 min), if warmup takes 90s, the perf phase's effective wall-clock budget collapses from 10 min to ~8.5 min — silently. Worse, an accuracy pass in TestMode.BOTH (which sets max_duration_ms=None per-phase, expecting unbounded) can now be truncated mid-run by the global timer.

The schema description still reads "Maximum test duration in ms (0 for no limit)", which is the original per-phase meaning. Existing YAML configs will get shorter perf/accuracy phases once they enable warmup, with no documentation indicating why.

Suggested fixes (any one):

Schedule the timer at the start of the performance phase (e.g., on START_PERFORMANCE_TRACKING) so warmup time doesn't eat the perf budget — preserves the per-phase semantic.

Expose a separate warmup.max_duration_ms to bound only warmup, leave perf-phase max_duration_ms untouched.

If a global cap is genuinely the goal, rename it (experiment_timeout_ms?), keep max_duration_ms per-phase, and document the change in WarmupConfig and release notes.

arekay-nv · 2026-05-05T15:36:32Z

-        throughout the current phase."""
+        Bounded by the global experiment timeout: if the caller schedules a
+        loop.call_later that calls stop(), stop() sets _drain_event, unblocking
+        this wait without leaving it hung indefinitely."""


[Review Council — Claude] medium · error-handling

The new docstring on _drain_inflight claims a safety guarantee that only holds in some configurations:

async def _drain_inflight(self, phase_issuer: PhaseIssuer) -> None: """Wait for all in-flight responses from this phase to complete. Bounded by the global experiment timeout: if the caller schedules a loop.call_later that calls stop(), stop() sets _drain_event, unblocking this wait without leaving it hung indefinitely.""" ... self._drain_event.clear() await self._drain_event.wait()

The loop.call_later is only scheduled in execute.py:561 if ctx.rt_settings.max_duration_ms is not None. runtime_settings.from_config maps max_duration_ms == 0 to None, and the schema default is 0 — so the default offline/online CLI run has no global timer, and _drain_inflight retains its old unbounded await self._drain_event.wait(). A single dropped or hanging request will hang the benchmark forever.

The docstring suggests this case is solved; it isn't.

Suggested fixes:

Make the docstring conditional ("if a global timeout is configured…") and add a note that the drain is otherwise unbounded, OR

Add an intrinsic drain timeout (a fixed cap like 30–60s, or a multiple of observed completion latency) inside _drain_inflight itself so the safety property is a property of the function, not a contract on the caller.

arekay-nv · 2026-05-05T15:36:33Z

+                    "Global experiment timeout reached (%d ms); stopping session.",
+                    max_duration_ms,
+                )
+                session.stop()


[Review Council — Claude] low · concurrency

The global timer's callback calls session.stop() directly, with no guard against firing after cleanup has begun:

def _on_global_timeout() -> None: logger.warning( "Global experiment timeout reached (%d ms); stopping session.", max_duration_ms, ) session.stop()

global_timeout_handle.cancel() only runs after entering the finally: block at line 580, so the timer can fire during the gap between await session.run(phases) returning and the cancel() call running — or worse, during the slow cleanup that follows (http_client.shutdown_async(), publisher close, etc.). When that happens, session.stop() flips _stop_requested=True and _drain_event.set() on a session that has already finished. The downstream effect is mostly harmless today (the strategy task is None), but the race is real and easy to make harmful with future refactors.

Suggested fix:

def _on_global_timeout() -> None: if session._done: # or expose a public flag return logger.warning(...) session.stop()

Also consider moving global_timeout_handle.cancel() to immediately after session.run(phases) returns (success path), and only run it in finally to handle the exception path.

arekay-nv · 2026-05-05T15:36:33Z

+            for _ in range(n_draw)
+        ]
+
+        assert order_with == order_without, (


[Review Council — Claude] low · testing

test_performance_sample_order_identical_with_and_without_warmup does not actually exercise the property it claims to test:

def make_rt(): return RuntimeSettings( ... rng_sched=random.Random(99), rng_sample_index=random.Random(99), ... ) ctx_with = self._make_ctx(config_with, make_rt(), simple_dataset) ctx_without = self._make_ctx(config_without, make_rt(), simple_dataset)

make_rt() is called twice and produces two separate RuntimeSettings instances with freshly-seeded Random(99) RNGs. _build_phases in the warmup path uses dataclass_replace which substitutes new Random(seed) instances on the warmup phase only — it never touches the perf phase's RNGs. So this test only verifies that _build_phases doesn't drain the perf RNG, which it never could (it only stores references). It does not verify the regression it claims to guard against: that executing the warmup phase leaves the perf-phase RNG untouched.

A tighter test would either:

Build phases from a single shared RuntimeSettings, then advance the warmup phase's rng_sample_index (e.g., draw N samples from create_sample_order(warmup_rt)), and assert that the perf-phase order matches a control with no warmup, OR

Drop this test as redundant with test_warmup_phase_uses_independent_rngs (which already asserts warmup_rt.rng_sched is not perf_rt.rng_sched). The current implementation gives a false sense of coverage for the actual race the docstring describes.

arekay-nv · 2026-05-05T15:37:20Z

Review Council — Multi-AI Code Review (Round 2)

Reviewed by Codex + Claude | Depth: thorough | HEAD: c9e80b0 (was 56258be in round 1)

Re-review after the author's response commits. 4 new issues across 3 files. Inline comments in this review.

Cross-reviewer agreement (boosted confidence): execute.py:570 — global timeout semantic change — was the only finding both Codex and Claude flagged in round 2.

🔴 Must Fix (high)

#	File	Line	Category	Reviewer(s)	Summary
1	`src/inference_endpoint/commands/benchmark/execute.py`	570	api-contract	Both	Global timer makes warmup eat into the per-phase `max_duration_ms` budget; `_make_stop_check` still uses per-phase semantics → two clocks on the same configured value, accuracy phases configured for unbounded duration can be truncated

🟡 Should Fix (medium)

#	File	Line	Category	Reviewer(s)	Summary
1	`src/inference_endpoint/load_generator/session.py`	361	error-handling	Claude	`_drain_inflight` docstring claims a safety guarantee that only holds when `max_duration_ms is not None`; the default offline/online run still has unbounded drain on a hung request

🔵 Consider (low)

#	File	Line	Category	Reviewer(s)	Summary
1	`src/inference_endpoint/commands/benchmark/execute.py`	568	concurrency	Claude	`_on_global_timeout` has no `_done` guard; can fire during the cleanup gap between `session.run()` returning and `global_timeout_handle.cancel()` running
2	`tests/unit/commands/test_benchmark.py`	775	testing	Claude	`test_performance_sample_order_identical_with_and_without_warmup` builds two separate RuntimeSettings with fresh seeded RNGs — it doesn't actually exercise warmup-perturbing-perf-RNG; gives a false sense of coverage

Round 1 follow-up status

✅ Fixed — execute.py:365 — warmup RNGs unseeded — now random.Random(warmup_random_seed)
✅ Fixed — dataset.py:467 — multimodal salt only first text — now scans for first text part at any index
✅ Fixed — dataset.py:444 — SaltedDataset.data snapshot — now @property delegating to _inner.data
✅ Fixed — test_benchmark.py:471 — lazy imports — imports moved to top
✅ Fixed — dataset.py:434 — registry pollution — now class SaltedDataset(Dataset, register=False)
🟡 Partial — dataset.py:459 — input_tokens not salted — warning added for input_tokens-without-prompt; common case (both present) still silently drops salt
🟡 Partial — execute.py:357 — ad-hoc warmup_rt — now uses dataclass_replace ✅ but LoadPattern(MAX_THROUGHPUT) is still hardcoded
⏸️ Unaddressed — dataset.py:414 — datasets/→dataset_cache/`` — still a silent breaking default change
⏸️ Unaddressed — session.py:273 — ENABLE_WARMUP env-var removed — no deprecation warning
⏸️ Unaddressed — session.py:68 — drain_after=True default flips warmup — direct callers get opposite semantics
⏸️ Unaddressed — session.py:423 — late warmup COMPLETE events still publish to ZMQ
⏸️ Unaddressed — metrics_aggregator/__main__.py:100 — tokenizer fail silently downgrades
⏸️ Unaddressed — dataset.py:464 — os.urandom salt ignores seeded RNG
⏸️ Unaddressed — test_warmup.py:292 — _DELAY=0.15 flake risk
⏸️ Unaddressed — test_warmup.py:65 — no multi-worker tests

Theme summary (round 2)

The biggest new concern is the global wall-clock timer added to bound the warmup drain. The mechanism works, but it silently changes what max_duration_ms means for any user who enables warmup (warmup time now eats into the perf-phase budget, and accuracy phases configured for None get truncated by the global cap). Either scope the timer to the perf phase, expose a separate warmup.max_duration_ms, or rename the field — but don't double-mean the same configured value.

The drain-safety claim in _drain_inflight is also overstated: with the schema default (max_duration_ms=0 → None), no global timer is scheduled and the drain is unbounded — a single hung request can still hang the benchmark.

🤖 Posted by /review-council skill.

Plug in warmup phase

06a9237

Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>

github-actions Bot requested a review from nvzhihanj May 5, 2026 01:42

gemini-code-assist Bot reviewed May 5, 2026

View reviewed changes

github-code-quality Bot found potential problems May 5, 2026

View reviewed changes

Comment thread src/inference_endpoint/dataset_manager/dataset.py Fixed

Missing changes.

56258be

Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>

github-code-quality Bot found potential problems May 5, 2026

View reviewed changes

Comment thread src/inference_endpoint/dataset_manager/dataset.py Fixed

arekay-nv commented May 5, 2026

View reviewed changes

arekay-nv and others added 2 commits May 4, 2026 20:08

Fix test

4d1c89e

Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>

arekay-nv commented May 5, 2026

View reviewed changes

Address comments

c9e80b0

Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>

github-code-quality Bot found potential problems May 5, 2026

View reviewed changes

Comment thread src/inference_endpoint/dataset_manager/dataset.py

return cls(df, transforms=transforms, repeats=num_repeats)

class SaltedDataset(Dataset, register=False):

arekay-nv commented May 5, 2026

View reviewed changes

		return cls(df, transforms=transforms, repeats=num_repeats)


		class SaltedDataset(Dataset):

		return cls(df, transforms=transforms, repeats=num_repeats)


		class SaltedDataset(Dataset, register=False):

Conversation

arekay-nv commented May 5, 2026

Type of change

Related issues

Testing

Checklist

Uh oh!

github-actions Bot commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

gemini-code-assist Bot May 5, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

arekay-nv left a comment

Choose a reason for hiding this comment

Review Council — Multi-AI Code Review

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

arekay-nv commented May 5, 2026

Review Council — Multi-AI Code Review

🔴 Must Fix (high)

🟡 Should Fix (medium)

🔵 Consider (low)

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

arekay-nv left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

arekay-nv commented May 5, 2026

Review Council — Multi-AI Code Review

🔴 Must Fix (high)

🟡 Should Fix (medium)

🔵 Consider (low)

Uh oh!

arekay-nv left a comment

Choose a reason for hiding this comment

Review Council — Multi-AI Code Review (Round 2)

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

github-actions Bot commented May 5, 2026 •

edited

Loading