Skip to content

[WIP][experimental] multi turn chat benchmark#821

Draft
cquil11 wants to merge 145 commits intomainfrom
experimental/multi-turn-benchmark
Draft

[WIP][experimental] multi turn chat benchmark#821
cquil11 wants to merge 145 commits intomainfrom
experimental/multi-turn-benchmark

Conversation

@cquil11
Copy link
Collaborator

@cquil11 cquil11 commented Feb 27, 2026

No description provided.

Rohan138 and others added 30 commits January 26, 2026 17:15
* fix AITER flags for v0.14.0 release

* drop mi325 triton gemm env var

* Add changes to perf changelog
…wont be erroneous negative diff [skip-sweep] (#571)
* remove assign

* initial

* update perf

* fix perf changelog

* trigger test sweep

* trigger test sweep pt 2

* rebase for evals only

* Update perf-changelog.yaml

* remove newline

* update perf changelog

---------

Co-authored-by: Cam Quilici <cjquilici@gmail.com>
* b300 srt slurm

* update generated srtslurm yaml

Signed-off-by: jthomson04 <jothomson@nvidia.com>

* fix image

* add uv and sqsh file

* change partition

* change slurm account

* use regular srt

Signed-off-by: jthomson04 <jothomson@nvidia.com>

* update perf changelog

Signed-off-by: jthomson04 <jothomson@nvidia.com>

* fix runner

Signed-off-by: jthomson04 <jothomson@nvidia.com>

* correct account

Signed-off-by: jthomson04 <jothomson@nvidia.com>

* qos support

Signed-off-by: jthomson04 <jothomson@nvidia.com>

* fix get checkout

Signed-off-by: jthomson04 <jothomson@nvidia.com>

* update runner label and partition

* undo branch checkout

Signed-off-by: jthomson04 <jothomson@nvidia.com>

* debug info

Signed-off-by: jthomson04 <jothomson@nvidia.com>

* cleanup logging

Signed-off-by: jthomson04 <jothomson@nvidia.com>

* use local model dir

Signed-off-by: jthomson04 <jothomson@nvidia.com>

* checkout specific commit

Signed-off-by: jthomson04 <jothomson@nvidia.com>

---------

Signed-off-by: jthomson04 <jothomson@nvidia.com>
Co-authored-by: Sahithi Chigurupati <schigurupati@nvidia.com>
Co-authored-by: Sahithi Chigurupati <chigurupati.sahithi@gmail.com>
…wont be erroneous negative diff [skip-sweep] (#577)
* Update SGLang Docker Image for MI355 to v0.5.8

1. activate FP8 KV cache
2. use the MLA persistent kernel

* Do not activate FP8 KV cache and the MLA persistent kernel explicitly

* Add config-keys (v0.5.5.post3 --> v0.5.8)

* Update perf-changelog.yaml with key fix description for v0.5.8

Add description: Disables mla persistent kernel when not using fp8 kv_cache

Co-authored-by: functionstackx <functionstackx@users.noreply.github.com>

---------

Co-authored-by: claude[bot] <41898282+claude[bot]@users.noreply.github.com>
Co-authored-by: functionstackx <functionstackx@users.noreply.github.com>
Co-authored-by: functionstackx <47992694+functionstackx@users.noreply.github.com>
30s default to 300s
* chore: save server long as artifact after single node runs

* test flaky eval

* test flaky eval

* test flaky eval

* rebase

* rebase pt 2

* add trap to upload server logs on exit

* rebase pt 3

* make server log in gha workspace

* export result filename at runtime so it is present

* revert perf changelog
* chore: add pre-merge check for newline in perf-changelog.yaml

Add a validation step in run-sweep.yml that ensures perf-changelog.yaml
ends with a newline character. This prevents negative diff issues in
subsequent PRs when the file is appended to.

Closes #578

Co-authored-by: Cameron Quilici <cquil11@users.noreply.github.com>

* test

* change logic of newline check

* trigger test check

* remove test perf changelog

---------

Co-authored-by: claude[bot] <41898282+claude[bot]@users.noreply.github.com>
Co-authored-by: Cameron Quilici <cquil11@users.noreply.github.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
cquil11 and others added 30 commits March 12, 2026 08:22
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When DURATION is set, generates a 10k conversation pool and runs for
that duration with grace-period=0. When unset, runs to completion.
Added --dataset-sampler shuffle to avoid sequential ordering bias.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Runs analyze_benchmark_distributions.py after AIPerf completes to
generate turn count, ISL/OSL distribution stats and plots. Results
uploaded as artifacts for verification against Qwen trace profile.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Allows running sweeps with different model precisions by passing
-f precision='fp8' to select the correct benchmark script.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…g, fix CUDA arch

- Remove compilation-config (Blackwell-specific custom ops)
- TORCH_CUDA_ARCH_LIST=9.0 (Hopper, not 10.0 Blackwell)
- Remove --attention-config.use_trtllm_attention=0 (Blackwell-specific)
- Add --disable-log-requests

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Synthetic workload: 10 turns (stddev 1), 2000 ISL (stddev 200),
500 OSL (stddev 50), 2s think time (stddev 500ms).
No dataset file — uses AIPerf built-in synthetic generation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
vLLM docs: "When TP > 1, this is the total buffer size summed across
all TP ranks." We were dividing by TP, giving each rank less offload
space than intended.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The vLLM docs say "when TP > 1, this is the total buffer size summed
across all TP ranks" — meaning the value is per-rank and the total is
the sum. Original calculation was correct.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
vLLM passes kv_offloading_size * (1 << 30) directly as cpu_bytes_to_use.
The per-rank division happens internally via world_size in the block
cost calculation. Our division by TP was double-dividing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Error/timeout records may not have input_sequence_length or
output_sequence_length in their metrics, causing KeyError.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The min(6000, ...) was hiding all requests above 6K tokens, making
multi-turn context growth invisible in the plot.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
New row: max ISL per conversation (final context size), total OSL per
conversation, and max ISL vs turn count scatter.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
conversation_id is the template ID which gets reused across sessions.
x_correlation_id is unique per session. Without this fix, sessions
sharing the same template appeared as one conversation with duplicate
turn indices, inflating turn counts.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Same workload as H200 homogeneous (10 turns, 2k ISL, 500 OSL) but
with Blackwell-specific config: CUDA 10.0, compilation config,
trtllm attention disabled.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Enable --steady-state-prefill in homogeneous benchmark scripts.
Update aiperf submodule with steady-state prefill implementation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Two new features for benchmark_serving_multi_turn.py:

1. --synthetic mode: generates multi-turn conversations using Shakespeare
   corpus with configurable ISL/OSL/turns distributions (mean + stddev).
   Self-contained, no AIPerf dependency.

2. --steady-state-prefill: actually runs prefill turns through the server
   before benchmarking starts. Each client's first conversation is
   assigned a staggered starting turn and turns 0..N-1 are executed
   to warm the KV cache. Unlike --virtual-history which just skips
   turns (cold KV), this produces warm cache at benchmark start.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Same workload as AIPerf homogeneous (10 turns, 2k ISL, 500 OSL) but
uses benchmark_serving_multi_turn.py with --synthetic and
--steady-state-prefill. One conversation per client (max-active=1).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Custom client doesn't have --ignore-eos, --save-result, --result-dir,
--result-filename. Removed these flags.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sets --limit-min-tokens 500 --limit-max-tokens 500 to ensure
the model generates exactly 500 tokens per turn (equivalent to
ignore_eos in AIPerf).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Setting --limit-min-tokens 0 --limit-max-tokens 0 tells the client to
read the assistant message token count from the dataset and use it as
both min_tokens and max_tokens per request. This gives per-request
ignore_eos behavior matching the synthetic OSL distribution.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Passes ignore_eos=true in the API request payload to force generating
until max_tokens. Works the same way as AIPerf's --extra-inputs
ignore_eos:true.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
max_active_conversations is divided by num_clients to get per-client
limit. Setting it to 1 with 512 clients gives 0, which fails.
Set to USERS so each client manages 1 conversation at a time.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

5 participants