Skip to content

Add GLM 5.1 TPS benchmarking support#83

Draft
teofeliu wants to merge 1 commit intomainfrom
cursor/glm-5-1-tps-ef63
Draft

Add GLM 5.1 TPS benchmarking support#83
teofeliu wants to merge 1 commit intomainfrom
cursor/glm-5-1-tps-ef63

Conversation

@teofeliu
Copy link
Copy Markdown

Summary

This PR adds tooling to help measure and optimize TPS (tokens per second) for GLM 5.1 deployments, specifically to help the Revolut team diagnose and improve their B200 deployment performance.

Context

The Revolut team is seeing ~100 TPS on their GLM 5.1 deployment on 4xB200 GPUs, but expects ~300 TPS for a model of this size. This PR adds benchmarking tools to help:

  1. Accurately measure current TPS
  2. Find optimal concurrency settings
  3. Test different token configurations
  4. Generate reports with recommendations

Changes

New files

  • llm_bench/locust-glm5p1.conf: Locust configuration file optimized for GLM 5.1 TPS testing
  • llm_bench/benchmark_glm5p1_tps.py: Comprehensive TPS benchmark suite that:
    • Runs concurrency sweeps to find the saturation point
    • Tests different prompt/output token configurations
    • Generates detailed reports with TPS analysis and recommendations

Modified files

  • llm_bench/load_test.py: Added output_tps metric to the summary output (calculated as QPS × average completion tokens)
  • llm_bench/README.md: Added documentation for GLM 5.1 TPS benchmarking

Usage

Quick TPS measurement

locust --config locust-glm5p1.conf \
  -H https://api.fireworks.ai/inference \
  -k $FIREWORKS_API_KEY \
  -m accounts/fireworks/models/glm-5p1 \
  -u 16 -t 2min

Comprehensive benchmark suite

python benchmark_glm5p1_tps.py \
  --host https://api.fireworks.ai/inference \
  --api-key $FIREWORKS_API_KEY \
  --model accounts/fireworks/models/glm-5p1

Testing

The changes are additive and don't modify existing behavior. The new output_tps metric is only shown for non-embedding/non-rerank benchmarks.

Open in Web Open in Cursor 

- Add output_tps metric to load_test.py summary (QPS * avg completion tokens)
- Add locust-glm5p1.conf configuration file for GLM 5.1 TPS testing
- Add benchmark_glm5p1_tps.py comprehensive TPS benchmark suite with:
  - Concurrency sweep to find saturation point
  - Token length sweep for different configurations
  - Automated report generation with recommendations
- Update README.md with GLM 5.1 TPS benchmarking documentation

This helps measure and optimize TPS for GLM 5.1 deployments on B200 GPUs.

Co-authored-by: Teo Feliu <teofeliu@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants