Draft
Conversation
- Add output_tps metric to load_test.py summary (QPS * avg completion tokens) - Add locust-glm5p1.conf configuration file for GLM 5.1 TPS testing - Add benchmark_glm5p1_tps.py comprehensive TPS benchmark suite with: - Concurrency sweep to find saturation point - Token length sweep for different configurations - Automated report generation with recommendations - Update README.md with GLM 5.1 TPS benchmarking documentation This helps measure and optimize TPS for GLM 5.1 deployments on B200 GPUs. Co-authored-by: Teo Feliu <teofeliu@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR adds tooling to help measure and optimize TPS (tokens per second) for GLM 5.1 deployments, specifically to help the Revolut team diagnose and improve their B200 deployment performance.
Context
The Revolut team is seeing ~100 TPS on their GLM 5.1 deployment on 4xB200 GPUs, but expects ~300 TPS for a model of this size. This PR adds benchmarking tools to help:
Changes
New files
llm_bench/locust-glm5p1.conf: Locust configuration file optimized for GLM 5.1 TPS testingllm_bench/benchmark_glm5p1_tps.py: Comprehensive TPS benchmark suite that:Modified files
llm_bench/load_test.py: Addedoutput_tpsmetric to the summary output (calculated as QPS × average completion tokens)llm_bench/README.md: Added documentation for GLM 5.1 TPS benchmarkingUsage
Quick TPS measurement
locust --config locust-glm5p1.conf \ -H https://api.fireworks.ai/inference \ -k $FIREWORKS_API_KEY \ -m accounts/fireworks/models/glm-5p1 \ -u 16 -t 2minComprehensive benchmark suite
python benchmark_glm5p1_tps.py \ --host https://api.fireworks.ai/inference \ --api-key $FIREWORKS_API_KEY \ --model accounts/fireworks/models/glm-5p1Testing
The changes are additive and don't modify existing behavior. The new
output_tpsmetric is only shown for non-embedding/non-rerank benchmarks.