Skip to content

feat: Plug in warmup phase#305

Draft
arekay-nv wants to merge 5 commits intomainfrom
arekay/plugin_warmup_phase
Draft

feat: Plug in warmup phase#305
arekay-nv wants to merge 5 commits intomainfrom
arekay/plugin_warmup_phase

Conversation

@arekay-nv
Copy link
Copy Markdown
Collaborator

Plugs in warmup specification with salt via config.

  • fixes datasets as default local folder for cache - causing pre-commit to complain.

Type of change

  • Bug fix
  • New feature
  • Documentation update
  • Refactor/cleanup

Related issues

Testing

  • Tests added/updated
  • All tests pass locally
  • Manual testing completed

Checklist

  • Code follows project style
  • Pre-commit hooks pass
  • Documentation updated (if needed)

Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 5, 2026

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

@github-actions github-actions Bot requested a review from nvzhihanj May 5, 2026 01:42
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a formal warmup phase to the benchmark execution process, replacing the previous environment variable-based implementation. Key changes include the addition of a WarmupConfig schema, a SaltedDataset wrapper that injects random salts into prompts to prevent KV-cache reuse, and a drain_after configuration to manage phase transitions. The EchoServer and testing fixtures were also updated to support custom request handling for new integration tests. Review feedback identifies missing imports for os and logging in dataset.py and notes that the SaltedDataset constructor signature is incompatible with its base class, which could break inherited class methods.

Comment thread src/inference_endpoint/dataset_manager/dataset.py Outdated
Comment on lines +442 to +449
def __init__(self, inner: Dataset) -> None:
# Skip Dataset.__init__ — all state is delegated to inner
self._inner = inner
self.data = inner.data
self.dataframe = None
self.transforms = None
self.repeats = inner.repeats
self.logger = getLogger(__name__)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

SaltedDataset inherits from Dataset but its __init__ method signature is incompatible with the base class. While it works as a wrapper for an existing instance in the current implementation, it breaks the contract for any code that expects to instantiate a Dataset (or subclass) using the standard signature (e.g., df, transforms, repeats).

Specifically, calling the inherited @classmethod get_dataloader on SaltedDataset will fail with a TypeError because it attempts to call the constructor with arguments that SaltedDataset.__init__ does not accept. Additionally, setting self.dataframe and self.transforms to None might break other inherited methods that expect these attributes to be populated.

Comment thread src/inference_endpoint/dataset_manager/dataset.py Fixed
Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>
Comment thread src/inference_endpoint/dataset_manager/dataset.py Fixed
Copy link
Copy Markdown
Collaborator Author

@arekay-nv arekay-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Council — Multi-AI Code Review

Reviewed by Codex + Claude | Depth: thorough

Found 14 issues across 6 files: 0 critical, 3 high, 4 medium, 7 low.

See inline comments for details. A grouped summary will follow as a separate top-level comment.

def get_dataloader(
cls,
datasets_dir: Path = Path("datasets"),
datasets_dir: Path = Path("dataset_cache"),
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Review Council — Claude] high · api-contract

Default datasets_dir changed from Path("datasets") to Path("dataset_cache"):

datasets_dir: Path = Path("dataset_cache"),

This is a silent breaking change. Any existing user who relied on the previous default will, after upgrade, fail to find the cached dataframe under datasets/<DATASET_ID>/<variant> and will silently re-download/re-generate (potentially many GB and minutes of work). The .gitignore change only ignores dataset_cache/, leaving any pre-existing datasets/ cache uncovered.

Suggest: keep Path("datasets") as the default, OR add a fallback that reads from datasets/ if dataset_cache/ is empty, OR document the migration prominently. At minimum, decouple this rename from the warmup PR — it's an orthogonal change.

) -> None:
pass # Inner dataset already loaded

def load_sample(self, index: int) -> Any:
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Review Council — Claude] high · bug

SaltedDataset.load_sample() only mutates data["prompt"], but several adapters bypass the prompt entirely. In src/inference_endpoint/sglang/adapter.py:90:

input_tokens = query.data["input_tokens"]
return cls._request_encoder.encode(
    SGLangGenerateRequest(input_ids=input_tokens, ...))

When salt=True is used with a tokenized dataset (any pipeline that pre-tokenizes — common in MLPerf submissions), the salt prefix is added to the (unused) text but input_tokens are unchanged, so the server receives the same token-ID sequence for every warmup request. KV-cache reuse is NOT prevented. The class docstring's claim that "each call to load_sample() generates a fresh salt, so reused samples … each receive a distinct salt" is silently violated.

Fix options: (a) detect input_tokens/token_ids keys and either drop them or re-tokenize after salting; (b) raise an error at warmup setup if salt=True and the inner dataset has tokenized fields; (c) explicitly document the restriction.

Comment thread src/inference_endpoint/commands/benchmark/execute.py Outdated
salt = os.urandom(8).hex()
if isinstance(prompt, str):
return {**data, "prompt": f"[{salt}] {prompt}"}
if (
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Review Council — Both (Codex P2 + Claude)] medium · bug

Multimodal salting only fires when prompt[0] is a text part:

if (isinstance(prompt, list) and prompt and isinstance(prompt[0], dict)
        and prompt[0].get("type") == "text"):
    salted_parts = [{**prompt[0], "text": f"[{salt}] ..."}, *prompt[1:]]

Real-world OpenAI-style multimodal prompts often start with an image/audio part and place text at index 1+ ("look at this image, then answer the question"). For those, salting is silently skipped (falls through to the return data at line 478) and warmup.salt=true does NOT prevent KV-cache reuse for those requests. This breaks the advertised warmup behavior for a common multimodal input shape.

Fix: scan the list for the first dict with type == "text" (or salt all text parts) instead of only inspecting prompt[0].

Comment thread src/inference_endpoint/dataset_manager/dataset.py Outdated
try:
pool_cm = TokenizePool(args.tokenizer, n_workers=args.tokenizer_workers)
except Exception as e:
logging.warning(
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Review Council — Claude] low · error-handling

Tokenizer load failure is silently downgraded to "no token metrics":

except Exception as e:
    logging.warning(
        f"Failed to load tokenizer '{args.tokenizer}': {e}. "
        "ISL/OSL/TPOT token metrics will be unavailable."
    )
    pool_cm = nullcontext()

A user who explicitly passes --tokenizer X and thus expects token metrics will instead see a single WARNING line buried in the log, then a silent benchmark run with empty ISL/OSL/TPOT columns. By the time they check the report, the run is over.

Fix: re-raise unless an explicit --allow-tokenizer-fallback flag is set, OR at least log at ERROR level and surface the missing-tokenizer state in the final report. Also use the module logger (logger = logging.getLogger(__name__)) rather than the root logging module.

if not (isinstance(data, dict) and "prompt" in data):
return data
prompt = data["prompt"]
salt = os.urandom(8).hex()
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Review Council — Claude] low · design

salt = os.urandom(8).hex() on every load_sample() ignores the seeded rng_sample_index in RuntimeSettings. Reproducibility-conscious users who set dataloader_random_seed to compare runs will see different salted prompts on each invocation. Warmup metrics aren't reported, but the actual prompts sent to the server (and thus server-side caches/stats) are nondeterministic.

Fix: thread the rng_sample_index (or a derived random.Random) through SaltedDataset.__init__ and use f'{rng.getrandbits(64):016x}' to keep determinism with a non-zero seed.

Comment thread tests/unit/commands/test_benchmark.py Outdated

@pytest.fixture
def base_rt_settings(self):
import random
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Review Council — Claude] low · design

TestBuildPhases uses lazy/inline imports inside fixtures and ~15 test method bodies (import random, from inference_endpoint.commands.benchmark.execute import _build_phases, etc.).

AGENTS.md explicitly forbids this:

All imports must be at the top of the file. Imports inside functions, methods, or conditional blocks (other than TYPE_CHECKING) are disallowed.

None of the documented exceptions apply (no circular import, no optional dep, no sandboxing). Move the imports to the top of the file.

guaranteeing all overlap events have been recorded before we assert.
"""

_DELAY = 0.15 # seconds; long enough for perf to start while warmup is pending
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Review Council — Claude] low · testing

_DELAY = 0.15 is fragile for the overlap-observation tests:

_DELAY = 0.15  # seconds; long enough for perf to start while warmup is pending

On a heavily-loaded CI runner, the entire warmup batch (5 requests) can plausibly round-trip within 150 ms before the perf phase issues, producing a false "no overlap" and flaking test_drain_false_overlap_observed / test_drain_false_max_concurrent_exceeds_single_phase_size.

Fix: use a controllable handler that explicitly blocks until the test signals release (e.g. via asyncio.Event) instead of a fixed sleep. If sticking with a sleep, increase to ≥1s.

_SALT_RE = re.compile(r"^\[([0-9a-f]{16})\] (.+)$")

_MINIMAL_CLIENT = HTTPClientConfig(
num_workers=1, warmup_connections=0, max_connections=10
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Review Council — Claude] low · testing

All warmup integration tests run with num_workers=1:

_MINIMAL_CLIENT = HTTPClientConfig(
    num_workers=1, warmup_connections=0, max_connections=10
)

None cover the multi-worker warmup path, where ZMQ ordering between worker processes matters and where salted prompts must be distinct across workers. Suggest adding at least one warmup test with num_workers >= 2 to exercise the multi-process distribution path the production code uses.

@arekay-nv
Copy link
Copy Markdown
Collaborator Author

Review Council — Multi-AI Code Review

Reviewed by Codex + Claude | Depth: thorough | HEAD: 56258be

Found 14 issues across 6 files. Inline comments posted in this review.

Cross-reviewer agreement (boosted confidence): execute.py:365 (warmup RNG seeding) and dataset.py:467 (multimodal salt) were flagged by both Codex and Claude.

🔴 Must Fix (high)

# File Line Category Reviewer(s) Summary
1 src/inference_endpoint/dataset_manager/dataset.py 414 api-contract Claude Default datasets_dir silently changed datasets/dataset_cache/ (orthogonal to warmup; risks losing existing caches)
2 src/inference_endpoint/dataset_manager/dataset.py 459 bug Claude SaltedDataset.load_sample() mutates prompt only, leaving input_tokens (used by SGLang adapter) unchanged → KV-cache reuse not actually prevented for tokenized datasets
3 src/inference_endpoint/commands/benchmark/execute.py 365 bug Both Warmup rng_sched / rng_sample_index use bare random.Random() (unseeded) → warmup sample order is nondeterministic across runs, breaking reproducibility

🟡 Should Fix (medium)

# File Line Category Reviewer(s) Summary
1 src/inference_endpoint/dataset_manager/dataset.py 467 bug Both Multimodal salting only fires when prompt[0] is text; image-first prompts (common shape) silently skip salting
2 src/inference_endpoint/dataset_manager/dataset.py 444 bug Claude SaltedDataset.data snapshots inner.data at construction; later inner.load(force=True) desyncs the wrapper from the inner dataset
3 src/inference_endpoint/commands/benchmark/execute.py 357 design Claude warmup_rt built ad-hoc instead of via dataclasses.replace; load pattern hardcoded to MAX_THROUGHPUT (no way to warm up at the perf-phase QPS/concurrency)
4 src/inference_endpoint/load_generator/session.py 273 api-contract Claude ENABLE_WARMUP=1 env-var escape hatch removed without a deprecation warning; existing scripts now silently no-op

🔵 Consider (low)

# File Line Category Reviewer(s) Summary
1 src/inference_endpoint/load_generator/session.py 68 api-contract Claude PhaseConfig.drain_after=True default flips warmup drain semantics for direct callers (was effectively False)
2 src/inference_endpoint/load_generator/session.py 423 data-integrity Claude drain_after=False filter only covers _on_sample_complete; warmup COMPLETE events still publish to ZMQ → event log pollution
3 src/inference_endpoint/async_utils/services/metrics_aggregator/__main__.py 100 error-handling Claude Tokenizer load failure silently downgraded to nullcontext(); user-requested --tokenizer produces empty ISL/OSL/TPOT with only a warning
4 src/inference_endpoint/dataset_manager/dataset.py 464 design Claude os.urandom(8) salt ignores seeded RNG → salted prompts are nondeterministic even when the user sets dataloader_random_seed
5 tests/unit/commands/test_benchmark.py 471 design Claude TestBuildPhases uses lazy/inline imports throughout (~15 sites) — explicit AGENTS.md violation
6 tests/integration/commands/test_warmup.py 292 testing Claude _DELAY = 0.15 is fragile for overlap-observation tests — risk of CI flakes; replace with explicit asyncio.Event synchronization
7 tests/integration/commands/test_warmup.py 65 testing Claude All warmup integration tests use num_workers=1; multi-worker warmup path (ZMQ ordering, salt distinctness) is uncovered

Theme summary

The warmup feature lands the structural plumbing well, but several correctness/reproducibility properties the YAML knobs claim to provide are not actually enforced end-to-end:

  • Reproducibility — warmup RNGs are unseeded (execute.py:365) and the salt uses os.urandom (dataset.py:464). Two runs of the same config produce different warmup workloads, indirectly affecting perf-phase reproducibility when warmup sizes the cache.
  • Salt completenessSaltedDataset only handles plain-text prompts. Tokenized datasets (SGLang adapter via input_tokens) and image-first multimodal prompts silently skip salting, defeating warmup.salt=true for those paths.
  • API/back-compat surface — the ENABLE_WARMUP env var was removed without a deprecation warning; the new PhaseConfig.drain_after=True default flips warmup drain semantics for direct callers; datasets_dir default rename couples an orthogonal cache-directory change to this PR.
  • Test coverage — single-worker only; the multi-worker ZMQ ordering and salt-distinctness across worker processes is the riskiest part of the architecture and is currently unexercised. The _DELAY = 0.15 overlap tests are also flake-prone.

🤖 Posted by /review-council skill.

arekay-nv and others added 2 commits May 4, 2026 20:08
Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>
- Add `random_seed` field to WarmupConfig for deterministic warmup scheduling
- Use `dataclasses.replace` + warmup seed for warmup RuntimeSettings
- Fix SaltedDataset.data to be a property (avoids stale snapshot after inner reload)
- Fix multimodal salting to find first text part at any index (handles image-first prompts)
- Log warning when input_tokens present without prompt (salting not possible)
- Fix ruff-format CI failure in test_async_session.py
- Move inline imports to top of test_benchmark.py (AGENTS.md compliance)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
return cls(df, transforms=transforms, repeats=num_repeats)


class SaltedDataset(Dataset):
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Review Council — Claude] medium · design

SaltedDataset inherits from Dataset without opting out of __init_subclass__ registration. Dataset.__init_subclass__ (line 276) registers every concrete subclass under its class name in Dataset.PREDEFINED:

cls.DATASET_ID = dataset_id  # → "SaltedDataset"
Dataset.PREDEFINED[dataset_id] = cls

SaltedDataset.__init__ takes inner: Dataset (not dataframe: pd.DataFrame), so any code that tries to instantiate it through the registry (e.g., DataLoaderFactory or the YAML dataset_id lookup) will raise a TypeError with a confusing error. Fix: pass dataset_id=None to explicitly opt out:

class SaltedDataset(Dataset, dataset_id=None):

*prompt[1:],
]
return {**data, "prompt": salted_parts}
return data # unsupported prompt type — skip salting
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Review Council — Claude] medium · bug

When the prompt is a multimodal list where ALL parts are non-text (e.g., all image_url), the loop finds no text part and falls through to return data with no warning:

if isinstance(prompt, list) and prompt:
    for i, part in enumerate(prompt):
        if isinstance(part, dict) and part.get("type") == "text":
            ...
            return {**data, "prompt": salted_parts}
return data  # unsupported prompt type — skip salting (silent!)

Unlike the input_tokens case (which emits a logger.warning), this fallback is completely silent. Users who enable salt=True for image-only multimodal datasets will not see any indication that KV-cache bypass is not functioning. Add a logger.warning consistent with the input_tokens path.

n_samples_from_dataset=ctx.dataloader.num_samples(),
n_samples_to_issue=warmup_cfg.n_requests,
min_sample_count=1,
rng_sched=random.Random(),
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Review Council — Claude] medium · design

After fixing the unseeded RNGs, both rng_sched and rng_sample_index for the warmup phase will be seeded with the same value (warmup_cfg.random_seed, default 0). This means both RNGs start in identical state and produce identical pseudo-random sequences — the scheduler and sample-index selection will be perfectly correlated.

Compare with the perf phase in RuntimeSettings.from_config which uses separate fields scheduler_random_seed and dataloader_random_seed. Consider using random_seed for one and random_seed ^ 0xDEAD (or adding warmup_scheduler_random_seed to WarmupConfig) for the other, so their sequences diverge.

sd = SaltedDataset(inner)
assert sd.load_sample(0) == {}

def test_non_dict_sample_is_returned_as_is(self):
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Review Council — Claude] low · testing

sd.data = inner.data is redundant now that SaltedDataset.data is a property that delegates to self._inner.data. The assignment calls the property setter which sets self._inner.data = inner.data — a no-op since they reference the same list. This won't cause a test failure, but it misleads readers into thinking SaltedDataset.data is a plain attribute:

inner.data = ["raw string sample"]  # override with non-dict
sd = SaltedDataset(inner)
sd.data = inner.data  # redundant — sd.data property already returns inner.data

Remove the sd.data = inner.data line.

Copy link
Copy Markdown
Collaborator Author

@arekay-nv arekay-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test

return cls(df, transforms=transforms, repeats=num_repeats)


class SaltedDataset(Dataset):
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Review Council — Claude] medium · design

SaltedDataset inherits from Dataset without opting out of __init_subclass__ registration. Dataset.__init_subclass__ registers every concrete non-abstract subclass under its class name in Dataset.PREDEFINED:

cls.DATASET_ID = dataset_id  # → "SaltedDataset"
Dataset.PREDEFINED[dataset_id] = cls

SaltedDataset.__init__ takes inner: Dataset (not dataframe: pd.DataFrame), so any code that instantiates via the registry will raise a TypeError. Fix: opt out explicitly:

class SaltedDataset(Dataset, dataset_id=None):

*prompt[1:],
]
return {**data, "prompt": salted_parts}
return data # unsupported prompt type — skip salting
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Review Council — Claude] medium · bug

When the prompt is a multimodal list where ALL parts are non-text (e.g., all image_url), the loop finds no text part and falls through to return data without salting and without any warning:

if isinstance(prompt, list) and prompt:
    for i, part in enumerate(prompt):
        if isinstance(part, dict) and part.get("type") == "text":
            ...
            return {**data, "prompt": salted_parts}
return data  # unsupported prompt type — skip salting (silent!)

Unlike the input_tokens case (which emits logger.warning), this fallback is completely silent. Users who set salt=True for image-only datasets will see no indication that KV-cache bypass is not functioning. Add a logger.warning here, consistent with the input_tokens path.

n_samples_from_dataset=ctx.dataloader.num_samples(),
n_samples_to_issue=warmup_cfg.n_requests,
min_sample_count=1,
rng_sched=random.Random(),
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Review Council — Claude] medium · design

After the fix that seeds the warmup RNGs, both rng_sched and rng_sample_index will be seeded with the same value (warmup_cfg.random_seed, default 0):

rng_sched=random.Random(warmup_cfg.random_seed),
rng_sample_index=random.Random(warmup_cfg.random_seed),

Both RNGs start in identical state and produce the same sequence. Scheduler timing and sample selection will be perfectly correlated. Compare with the perf phase which uses separate scheduler_random_seed and dataloader_random_seed fields. Consider using random_seed and random_seed ^ 0xDEAD (or a separate warmup_scheduler_random_seed field) so the sequences diverge.

sd = SaltedDataset(inner)
assert sd.load_sample(0) == {}

def test_non_dict_sample_is_returned_as_is(self):
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Review Council — Claude] low · testing

sd.data = inner.data is now redundant: SaltedDataset.data is a property that delegates to self._inner.data. The assignment calls the setter which sets self._inner.data = inner.data — a no-op since they already reference the same list.

inner.data = ["raw string sample"]  # override with non-dict
sd = SaltedDataset(inner)
sd.data = inner.data  # redundant — property already returns inner.data

Remove the sd.data = inner.data line.

@arekay-nv
Copy link
Copy Markdown
Collaborator Author

Review Council — Multi-AI Code Review

Reviewed by Claude (Codex CLI unavailable) | Depth: thorough | HEAD: `1143d64`

Found 7 new issues across 5 files. Inline comments posted for 4; 3 are out-of-diff or post-fix.

Cross-reviewer note: Issues already addressed in the previous round (unseeded RNGs, SaltedDataset.data snapshot, multimodal salting at index 0, inline imports, ruff-format) are not re-reported.

🔴 Must Fix (high)

# File Line Category Summary
1 commands/benchmark/execute.py 365 design Inline ↑ After RNG-seed fix: rng_sched and rng_sample_index both seeded with same random_seed value — identical pseudo-random sequences for scheduler and sample selection

🟡 Should Fix (medium)

# File Line Category Summary
2 dataset_manager/dataset.py 434 design Inline ↑ SaltedDataset is registered in Dataset.PREDEFINED under "SaltedDataset" via __init_subclass__; instantiating via registry raises TypeError (incompatible constructor). Fix: class SaltedDataset(Dataset, dataset_id=None)
3 dataset_manager/dataset.py 478 bug Inline ↑ Silent no-salt fallback for image-only multimodal prompts (all parts are image_url, no text part exists) — no logger.warning emitted, salt=True silently no-ops
4 load_generator/session.py 366 bug _drain_inflight awaits self._drain_event.wait() with no timeout. If a warmup request is dropped (network error, server restart), the drain hangs indefinitely and the benchmark never starts. Only SIGINT unblocks it.

🔵 Consider (low)

# File Line Category Summary
5 tests/unit/dataset_manager/test_salted_dataset.py 143 testing Inline ↑ sd.data = inner.data is redundant now that data is a property delegating to _inner.data
6 config/schema.py 413 api-contract random_seed: int = Field(0, ...) default is implicitly deterministic; Optional[int] = None with "use fresh seed when None" semantics would better match salt=True cache-busting intent
7 config/schema.py 395 api-contract WarmupConfig has no @cyclopts.Parameter annotations — warmup is YAML-only with no CLI flags. Undocumented limitation; users testing via CLI will find warmup silently inactive.

🤖 Posted by /review-council skill.

Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>
return cls(df, transforms=transforms, repeats=num_repeats)


class SaltedDataset(Dataset, register=False):
Copy link
Copy Markdown
Collaborator Author

@arekay-nv arekay-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Council — Multi-AI Code Review (Round 2)

Reviewed by Codex + Claude | Depth: thorough | HEAD: c9e80b0 (was 56258be in round 1)

Re-reviewing after the author's response commits. Most round-1 findings are addressed; this round focuses on new code (global timeout, warmup_random_seed, register=False) and incomplete fixes.

Found 4 new issues across 3 files: 0 critical, 1 high, 1 medium, 2 low. See inline comments — a tiered summary will follow.

)
session.stop()

global_timeout_handle = loop.call_later(
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Review Council — Both (Codex P1 + Claude)] high · api-contract

The new global timer redefines what max_duration_ms means, but only when warmup is enabled, and the schema description was not updated.

max_duration_ms = ctx.rt_settings.max_duration_ms
if max_duration_ms is not None:
    ...
    global_timeout_handle = loop.call_later(
        max_duration_ms / 1000.0, _on_global_timeout
    )

session.py:_make_stop_check still enforces max_duration_ms per-phase against phase_start_ns, so we now have two clocks running on the same configured value:

  • per-phase, from each phase's start (the original semantic — and still what _build_phases clears to None for warmup/accuracy).
  • global, from loop.call_later scheduling time (warmup + perf + accuracy combined).

For a user with warmup.enabled=True and max_duration_ms=600_000 (10 min), if warmup takes 90s, the perf phase's effective wall-clock budget collapses from 10 min to ~8.5 min — silently. Worse, an accuracy pass in TestMode.BOTH (which sets max_duration_ms=None per-phase, expecting unbounded) can now be truncated mid-run by the global timer.

The schema description still reads "Maximum test duration in ms (0 for no limit)", which is the original per-phase meaning. Existing YAML configs will get shorter perf/accuracy phases once they enable warmup, with no documentation indicating why.

Suggested fixes (any one):

  1. Schedule the timer at the start of the performance phase (e.g., on START_PERFORMANCE_TRACKING) so warmup time doesn't eat the perf budget — preserves the per-phase semantic.
  2. Expose a separate warmup.max_duration_ms to bound only warmup, leave perf-phase max_duration_ms untouched.
  3. If a global cap is genuinely the goal, rename it (experiment_timeout_ms?), keep max_duration_ms per-phase, and document the change in WarmupConfig and release notes.

throughout the current phase."""
Bounded by the global experiment timeout: if the caller schedules a
loop.call_later that calls stop(), stop() sets _drain_event, unblocking
this wait without leaving it hung indefinitely."""
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Review Council — Claude] medium · error-handling

The new docstring on _drain_inflight claims a safety guarantee that only holds in some configurations:

async def _drain_inflight(self, phase_issuer: PhaseIssuer) -> None:
    """Wait for all in-flight responses from this phase to complete.

    Bounded by the global experiment timeout: if the caller schedules a
    loop.call_later that calls stop(), stop() sets _drain_event, unblocking
    this wait without leaving it hung indefinitely."""
    ...
    self._drain_event.clear()
    await self._drain_event.wait()

The loop.call_later is only scheduled in execute.py:561 if ctx.rt_settings.max_duration_ms is not None. runtime_settings.from_config maps max_duration_ms == 0 to None, and the schema default is 0 — so the default offline/online CLI run has no global timer, and _drain_inflight retains its old unbounded await self._drain_event.wait(). A single dropped or hanging request will hang the benchmark forever.

The docstring suggests this case is solved; it isn't.

Suggested fixes:

  • Make the docstring conditional ("if a global timeout is configured…") and add a note that the drain is otherwise unbounded, OR
  • Add an intrinsic drain timeout (a fixed cap like 30–60s, or a multiple of observed completion latency) inside _drain_inflight itself so the safety property is a property of the function, not a contract on the caller.

"Global experiment timeout reached (%d ms); stopping session.",
max_duration_ms,
)
session.stop()
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Review Council — Claude] low · concurrency

The global timer's callback calls session.stop() directly, with no guard against firing after cleanup has begun:

def _on_global_timeout() -> None:
    logger.warning(
        "Global experiment timeout reached (%d ms); stopping session.",
        max_duration_ms,
    )
    session.stop()

global_timeout_handle.cancel() only runs after entering the finally: block at line 580, so the timer can fire during the gap between await session.run(phases) returning and the cancel() call running — or worse, during the slow cleanup that follows (http_client.shutdown_async(), publisher close, etc.). When that happens, session.stop() flips _stop_requested=True and _drain_event.set() on a session that has already finished. The downstream effect is mostly harmless today (the strategy task is None), but the race is real and easy to make harmful with future refactors.

Suggested fix:

def _on_global_timeout() -> None:
    if session._done:  # or expose a public flag
        return
    logger.warning(...)
    session.stop()

Also consider moving global_timeout_handle.cancel() to immediately after session.run(phases) returns (success path), and only run it in finally to handle the exception path.

for _ in range(n_draw)
]

assert order_with == order_without, (
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Review Council — Claude] low · testing

test_performance_sample_order_identical_with_and_without_warmup does not actually exercise the property it claims to test:

def make_rt():
    return RuntimeSettings(
        ...
        rng_sched=random.Random(99),
        rng_sample_index=random.Random(99),
        ...
    )

ctx_with = self._make_ctx(config_with, make_rt(), simple_dataset)
ctx_without = self._make_ctx(config_without, make_rt(), simple_dataset)

make_rt() is called twice and produces two separate RuntimeSettings instances with freshly-seeded Random(99) RNGs. _build_phases in the warmup path uses dataclass_replace which substitutes new Random(seed) instances on the warmup phase only — it never touches the perf phase's RNGs. So this test only verifies that _build_phases doesn't drain the perf RNG, which it never could (it only stores references). It does not verify the regression it claims to guard against: that executing the warmup phase leaves the perf-phase RNG untouched.

A tighter test would either:

  1. Build phases from a single shared RuntimeSettings, then advance the warmup phase's rng_sample_index (e.g., draw N samples from create_sample_order(warmup_rt)), and assert that the perf-phase order matches a control with no warmup, OR
  2. Drop this test as redundant with test_warmup_phase_uses_independent_rngs (which already asserts warmup_rt.rng_sched is not perf_rt.rng_sched). The current implementation gives a false sense of coverage for the actual race the docstring describes.

@arekay-nv
Copy link
Copy Markdown
Collaborator Author

Review Council — Multi-AI Code Review (Round 2)

Reviewed by Codex + Claude | Depth: thorough | HEAD: c9e80b0 (was 56258be in round 1)

Re-review after the author's response commits. 4 new issues across 3 files. Inline comments in this review.

Cross-reviewer agreement (boosted confidence): execute.py:570 — global timeout semantic change — was the only finding both Codex and Claude flagged in round 2.

🔴 Must Fix (high)

# File Line Category Reviewer(s) Summary
1 src/inference_endpoint/commands/benchmark/execute.py 570 api-contract Both Global timer makes warmup eat into the per-phase max_duration_ms budget; _make_stop_check still uses per-phase semantics → two clocks on the same configured value, accuracy phases configured for unbounded duration can be truncated

🟡 Should Fix (medium)

# File Line Category Reviewer(s) Summary
1 src/inference_endpoint/load_generator/session.py 361 error-handling Claude _drain_inflight docstring claims a safety guarantee that only holds when max_duration_ms is not None; the default offline/online run still has unbounded drain on a hung request

🔵 Consider (low)

# File Line Category Reviewer(s) Summary
1 src/inference_endpoint/commands/benchmark/execute.py 568 concurrency Claude _on_global_timeout has no _done guard; can fire during the cleanup gap between session.run() returning and global_timeout_handle.cancel() running
2 tests/unit/commands/test_benchmark.py 775 testing Claude test_performance_sample_order_identical_with_and_without_warmup builds two separate RuntimeSettings with fresh seeded RNGs — it doesn't actually exercise warmup-perturbing-perf-RNG; gives a false sense of coverage

Round 1 follow-up status

  • ✅ Fixed — execute.py:365 — warmup RNGs unseeded — now random.Random(warmup_random_seed)
  • ✅ Fixed — dataset.py:467 — multimodal salt only first text — now scans for first text part at any index
  • ✅ Fixed — dataset.py:444 — SaltedDataset.data snapshot — now @property delegating to _inner.data
  • ✅ Fixed — test_benchmark.py:471 — lazy imports — imports moved to top
  • ✅ Fixed — dataset.py:434 — registry pollution — now class SaltedDataset(Dataset, register=False)
  • 🟡 Partial — dataset.py:459 — input_tokens not salted — warning added for input_tokens-without-prompt; common case (both present) still silently drops salt
  • 🟡 Partial — execute.py:357 — ad-hoc warmup_rt — now uses dataclass_replace ✅ but LoadPattern(MAX_THROUGHPUT) is still hardcoded
  • ⏸️ Unaddressed — dataset.py:414 — datasets/dataset_cache/`` — still a silent breaking default change
  • ⏸️ Unaddressed — session.py:273 — ENABLE_WARMUP env-var removed — no deprecation warning
  • ⏸️ Unaddressed — session.py:68 — drain_after=True default flips warmup — direct callers get opposite semantics
  • ⏸️ Unaddressed — session.py:423 — late warmup COMPLETE events still publish to ZMQ
  • ⏸️ Unaddressed — metrics_aggregator/__main__.py:100 — tokenizer fail silently downgrades
  • ⏸️ Unaddressed — dataset.py:464 — os.urandom salt ignores seeded RNG
  • ⏸️ Unaddressed — test_warmup.py:292 — _DELAY=0.15 flake risk
  • ⏸️ Unaddressed — test_warmup.py:65 — no multi-worker tests

Theme summary (round 2)

The biggest new concern is the global wall-clock timer added to bound the warmup drain. The mechanism works, but it silently changes what max_duration_ms means for any user who enables warmup (warmup time now eats into the perf-phase budget, and accuracy phases configured for None get truncated by the global cap). Either scope the timer to the perf phase, expose a separate warmup.max_duration_ms, or rename the field — but don't double-mean the same configured value.

The drain-safety claim in _drain_inflight is also overstated: with the schema default (max_duration_ms=0None), no global timer is scheduled and the drain is unbounded — a single hung request can still hang the benchmark.

🤖 Posted by /review-council skill.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant