[WIP][experimental] multi turn chat benchmark by cquil11 · Pull Request #821 · SemiAnalysisAI/InferenceX

cquil11 · 2026-02-27T20:35:35Z

No description provided.

* fix AITER flags for v0.14.0 release * drop mi325 triton gemm env var * Add changes to perf changelog

…wont be erroneous negative diff [skip-sweep] (#571)

* remove assign * initial * update perf * fix perf changelog * trigger test sweep * trigger test sweep pt 2 * rebase for evals only * Update perf-changelog.yaml * remove newline * update perf changelog --------- Co-authored-by: Cam Quilici <cjquilici@gmail.com>

* b300 srt slurm * update generated srtslurm yaml Signed-off-by: jthomson04 <jothomson@nvidia.com> * fix image * add uv and sqsh file * change partition * change slurm account * use regular srt Signed-off-by: jthomson04 <jothomson@nvidia.com> * update perf changelog Signed-off-by: jthomson04 <jothomson@nvidia.com> * fix runner Signed-off-by: jthomson04 <jothomson@nvidia.com> * correct account Signed-off-by: jthomson04 <jothomson@nvidia.com> * qos support Signed-off-by: jthomson04 <jothomson@nvidia.com> * fix get checkout Signed-off-by: jthomson04 <jothomson@nvidia.com> * update runner label and partition * undo branch checkout Signed-off-by: jthomson04 <jothomson@nvidia.com> * debug info Signed-off-by: jthomson04 <jothomson@nvidia.com> * cleanup logging Signed-off-by: jthomson04 <jothomson@nvidia.com> * use local model dir Signed-off-by: jthomson04 <jothomson@nvidia.com> * checkout specific commit Signed-off-by: jthomson04 <jothomson@nvidia.com> --------- Signed-off-by: jthomson04 <jothomson@nvidia.com> Co-authored-by: Sahithi Chigurupati <schigurupati@nvidia.com> Co-authored-by: Sahithi Chigurupati <chigurupati.sahithi@gmail.com>

…wont be erroneous negative diff [skip-sweep] (#577)

* Update SGLang Docker Image for MI355 to v0.5.8 1. activate FP8 KV cache 2. use the MLA persistent kernel * Do not activate FP8 KV cache and the MLA persistent kernel explicitly * Add config-keys (v0.5.5.post3 --> v0.5.8) * Update perf-changelog.yaml with key fix description for v0.5.8 Add description: Disables mla persistent kernel when not using fp8 kv_cache Co-authored-by: functionstackx <functionstackx@users.noreply.github.com> --------- Co-authored-by: claude[bot] <41898282+claude[bot]@users.noreply.github.com> Co-authored-by: functionstackx <functionstackx@users.noreply.github.com> Co-authored-by: functionstackx <47992694+functionstackx@users.noreply.github.com>

This reverts commit dea7f48.

30s default to 300s

* chore: save server long as artifact after single node runs * test flaky eval * test flaky eval * test flaky eval * rebase * rebase pt 2 * add trap to upload server logs on exit * rebase pt 3 * make server log in gha workspace * export result filename at runtime so it is present * revert perf changelog

* chore: add pre-merge check for newline in perf-changelog.yaml Add a validation step in run-sweep.yml that ensures perf-changelog.yaml ends with a newline character. This prevents negative diff issues in subsequent PRs when the file is appended to. Closes #578 Co-authored-by: Cameron Quilici <cquil11@users.noreply.github.com> * test * change logic of newline check * trigger test check * remove test perf changelog --------- Co-authored-by: claude[bot] <41898282+claude[bot]@users.noreply.github.com> Co-authored-by: Cameron Quilici <cquil11@users.noreply.github.com>

…rn-benchmark

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

When DURATION is set, generates a 10k conversation pool and runs for that duration with grace-period=0. When unset, runs to completion. Added --dataset-sampler shuffle to avoid sequential ordering bias. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Runs analyze_benchmark_distributions.py after AIPerf completes to generate turn count, ISL/OSL distribution stats and plots. Results uploaded as artifacts for verification against Qwen trace profile. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Allows running sweeps with different model precisions by passing -f precision='fp8' to select the correct benchmark script. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…g, fix CUDA arch - Remove compilation-config (Blackwell-specific custom ops) - TORCH_CUDA_ARCH_LIST=9.0 (Hopper, not 10.0 Blackwell) - Remove --attention-config.use_trtllm_attention=0 (Blackwell-specific) - Add --disable-log-requests Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Synthetic workload: 10 turns (stddev 1), 2000 ISL (stddev 200), 500 OSL (stddev 50), 2s think time (stddev 500ms). No dataset file — uses AIPerf built-in synthetic generation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

vLLM docs: "When TP > 1, this is the total buffer size summed across all TP ranks." We were dividing by TP, giving each rank less offload space than intended. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The vLLM docs say "when TP > 1, this is the total buffer size summed across all TP ranks" — meaning the value is per-rank and the total is the sum. Original calculation was correct. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

vLLM passes kv_offloading_size * (1 << 30) directly as cpu_bytes_to_use. The per-rank division happens internally via world_size in the block cost calculation. Our division by TP was double-dividing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Error/timeout records may not have input_sequence_length or output_sequence_length in their metrics, causing KeyError. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The min(6000, ...) was hiding all requests above 6K tokens, making multi-turn context growth invisible in the plot. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

New row: max ISL per conversation (final context size), total OSL per conversation, and max ISL vs turn count scatter. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

conversation_id is the template ID which gets reused across sessions. x_correlation_id is unique per session. Without this fix, sessions sharing the same template appeared as one conversation with duplicate turn indices, inflating turn counts. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Same workload as H200 homogeneous (10 turns, 2k ISL, 500 OSL) but with Blackwell-specific config: CUDA 10.0, compilation config, trtllm attention disabled. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Enable --steady-state-prefill in homogeneous benchmark scripts. Update aiperf submodule with steady-state prefill implementation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Two new features for benchmark_serving_multi_turn.py: 1. --synthetic mode: generates multi-turn conversations using Shakespeare corpus with configurable ISL/OSL/turns distributions (mean + stddev). Self-contained, no AIPerf dependency. 2. --steady-state-prefill: actually runs prefill turns through the server before benchmarking starts. Each client's first conversation is assigned a staggered starting turn and turns 0..N-1 are executed to warm the KV cache. Unlike --virtual-history which just skips turns (cold KV), this produces warm cache at benchmark start. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Same workload as AIPerf homogeneous (10 turns, 2k ISL, 500 OSL) but uses benchmark_serving_multi_turn.py with --synthetic and --steady-state-prefill. One conversation per client (max-active=1). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Custom client doesn't have --ignore-eos, --save-result, --result-dir, --result-filename. Removed these flags. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Sets --limit-min-tokens 500 --limit-max-tokens 500 to ensure the model generates exactly 500 tokens per turn (equivalent to ignore_eos in AIPerf). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Setting --limit-min-tokens 0 --limit-max-tokens 0 tells the client to read the assistant message token count from the dataset and use it as both min_tokens and max_tokens per request. This gives per-request ignore_eos behavior matching the synthetic OSL distribution. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Passes ignore_eos=true in the API request payload to force generating until max_tokens. Works the same way as AIPerf's --extra-inputs ignore_eos:true. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

max_active_conversations is divided by num_clients to get per-client limit. Setting it to 1 with 512 clients gives 0, which fails. Set to USERS so each client manages 1 conversation at a time. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Rohan138 and others added 30 commits January 26, 2026 17:15

[AMD]: fix AITER flags for vllm v0.14.0 docker image (#535)

f0b8f42

* fix AITER flags for v0.14.0 release * drop mi325 triton gemm env var * Add changes to perf changelog

fix: add final newline to original perf-changelog.yaml so that there …

0b66978

…wont be erroneous negative diff [skip-sweep] (#571)

fix: add final newline to original perf-changelog.yaml so that there …

a045c1b

…wont be erroneous negative diff [skip-sweep] (#577)

initial commit

36b9afa

Revert "[NV] dsr1 fp4 b300 dynamo trtllm (#532)" (#583) [skip-sweep]

848fbca

This reverts commit dea7f48.

Increase eval timeout (#584)

61b2b21

30s default to 300s

revert (#586) [skip-sweep]

28bc58e

Merge remote-tracking branch 'origin/main' into experimental/multi-tu…

77fe65b

…rn-benchmark

commit

36110f1

chore: add large data files to gitignore

a10b511

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

update vllm bench

a1acc8a

save response

8a33d1f

metrics collector

0735853

add pbar

b5ffc81

add pbar

d1a5cc5

metrics collector fix

26d7e0a

cpu offload metrics

bf045a0

cpu offload metrics pt 2

339d71a

fix man num requests

47c5160

fix man num requests pt 2

d994856

fix join deadlock

33ca9fa

add new plots

c6c2e84

add dcgmi and remove some plots

997ed00

make tx / rx plots continuous

5f1eef1

make tx / rx plots continuous and add cumsum

c42aa1b

cquil11 and others added 30 commits March 12, 2026 08:22

add SCRIPT_SUFFIX support to all H200 runner scripts

cb2c2d0

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: --dataset-sampling-strategy not --dataset-sampler

eb3ff18

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

remove unnecessary dataset-sampling-strategy shuffle

c0bea99

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

sync h200 benchmark script with b200 (identical)

7d1b5ad

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

upload conversations.jsonl in benchmark artifacts

54561ee

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

add custom run_name input to multiturn sweep workflow

d510423

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

add overall ISL/OSL distribution plots and ISL vs OSL scatter

7d4cd53

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

add precision input to sweep workflow + fp8 h200 benchmark script

93c567c

Allows running sweeps with different model precisions by passing -f precision='fp8' to select the correct benchmark script. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

add

7f23392

remove --disable-log-requests from fp8 h200 script

2dce03d

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: skip records missing ISL/OSL metrics in distribution analysis

bbc72bb

Error/timeout records may not have input_sequence_length or output_sequence_length in their metrics, causing KeyError. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: remove 6000 token clip on All Requests ISL histogram

f554fea

The min(6000, ...) was hiding all requests above 6K tokens, making multi-turn context growth invisible in the plot. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

add per-conversation ISL/OSL distribution plots

6f871a8

New row: max ISL per conversation (final context size), total OSL per conversation, and max ISL vs turn count scatter. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

add homogeneous multi-turn benchmark for FP4 B200

29029f8

Same workload as H200 homogeneous (10 turns, 2k ISL, 500 OSL) but with Blackwell-specific config: CUDA 10.0, compilation config, trtllm attention disabled. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

add steady-state prefill to homogeneous benchmark scripts

0733412

Enable --steady-state-prefill in homogeneous benchmark scripts. Update aiperf submodule with steady-state prefill implementation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: remove unsupported CLI flags from custom client benchmark script

73fb8c4

Custom client doesn't have --ignore-eos, --save-result, --result-dir, --result-filename. Removed these flags. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

add min/max token limits to force 500 OSL in custom client

36ac04d

Sets --limit-min-tokens 500 --limit-max-tokens 500 to ensure the model generates exactly 500 tokens per turn (equivalent to ignore_eos in AIPerf). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

add --ignore-eos flag to custom client, same as AIPerf

42fcb2d

Passes ignore_eos=true in the API request payload to force generating until max_tokens. Works the same way as AIPerf's --extra-inputs ignore_eos:true. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP][experimental] multi turn chat benchmark#821

[WIP][experimental] multi turn chat benchmark#821
cquil11 wants to merge 145 commits intomainfrom
experimental/multi-turn-benchmark

cquil11 commented Feb 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

cquil11 commented Feb 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants