Benchmark for latency and throughput both GPU server and Modal by ArnavBharti · Pull Request #12 · ScaledFocus/masala-embed

ArnavBharti · 2025-10-23T12:25:51Z

=== Benchmark Summary ===
Mode: server
Target: https://scaledfocus--masala-embed-server-modal-fastapi-app.modal.run/v1/embeddings
Concurrency: 100
Payloads: 3000
Request timeout: 120s

=== Per-stage results ===
--- stage-1 (duration=20s, target_rps=5) ---
Requests: 100 Successes: 100 Failures: 0
Measured wall-time (first->last): 16.251 s
Throughput (successful reqs / wall-time): 6.154 req/s
End-to-end latency p50/p95/p99: 475.93 / 3199.36 / 4002.60 ms
vLLM p50/p95/p99: 166.60 / 344.90 / 364.09 ms
Normalize p50/p95/p99: 0.01 / 0.02 / 0.04 ms
Search p50/p95/p99: 0.17 / 0.26 / 0.29 ms
Server p50/p95/p99: 166.90 / 345.12 / 364.27 ms

--- stage-2 (duration=40s, target_rps=20) ---
Requests: 796 Successes: 796 Failures: 0
Measured wall-time (first->last): 40.084 s
Throughput (successful reqs / wall-time): 19.858 req/s
End-to-end latency p50/p95/p99: 513.94 / 588.97 / 633.42 ms
vLLM p50/p95/p99: 191.36 / 251.17 / 282.14 ms
Normalize p50/p95/p99: 0.01 / 0.02 / 0.02 ms
Search p50/p95/p99: 0.14 / 0.25 / 0.40 ms
Server p50/p95/p99: 191.60 / 251.36 / 282.32 ms

--- stage-3 (duration=60s, target_rps=50) ---
Requests: 2980 Successes: 2980 Failures: 0
Measured wall-time (first->last): 60.127 s
Throughput (successful reqs / wall-time): 49.562 req/s
End-to-end latency p50/p95/p99: 656.43 / 877.53 / 1108.32 ms
vLLM p50/p95/p99: 295.64 / 448.96 / 521.56 ms
Normalize p50/p95/p99: 0.01 / 0.02 / 0.04 ms
Search p50/p95/p99: 0.17 / 0.42 / 0.61 ms
Server p50/p95/p99: 295.85 / 449.30 / 521.91 ms

--- stage-4 (duration=60s, target_rps=100) ---
Requests: 5960 Successes: 5960 Failures: 0
Measured wall-time (first->last): 90.897 s
Throughput (successful reqs / wall-time): 65.569 req/s
End-to-end latency p50/p95/p99: 1513.67 / 1903.91 / 2114.91 ms
vLLM p50/p95/p99: 314.50 / 891.68 / 1387.54 ms
Normalize p50/p95/p99: 0.01 / 0.02 / 0.14 ms
Search p50/p95/p99: 0.20 / 0.60 / 0.78 ms
Server p50/p95/p99: 314.76 / 891.96 / 1387.78 ms

=== Overall ===
Requests: 9836 Successes: 9836 Failures: 0
Measured wall-time (first->last): 208.773 s
Throughput (successful reqs / wall-time): 47.113 req/s
End-to-end latency p50/p95/p99: 1330.15 / 1823.62 / 2086.31 ms
vLLM p50/p95/p99: 295.48 / 685.89 / 1349.32 ms
Normalize p50/p95/p99: 0.01 / 0.02 / 0.10 ms
Search p50/p95/p99: 0.18 / 0.55 / 0.74 ms
Server p50/p95/p99: 295.81 / 686.11 / 1349.63 ms

commented out os.environ["CUDA_VISIBLE_DEVICES"] = "0" line.

…nto infra

ArnavBharti and others added 14 commits September 28, 2025 17:32

feat: add vllm inference for both modal and local server

9c280b4

fix: linting error

84a4291

changed model_name in vllm_inference_local.py

6c86453

Update vllm_inference_local.py

e6d49c1

commented out os.environ["CUDA_VISIBLE_DEVICES"] = "0" line.

commented a single line

1785303

Merge branch 'infra' of https://github.com/ScaledFocus/masala-embed i…

bd5c837

…nto infra

chore: lint error fix

0d64ab3

Local GPU full workflow

03efd14

refactor: local setup and convert it to modal

54802a9

fix: off by one error in index

b1b7abc

chore: lint fixes

04cd225

increase concurrency

694803f

benchmarking for latency and throughput

768a9bd

removed a test case

b575f8a

NirantK self-requested a review October 23, 2025 12:32

NirantK closed this Nov 28, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark for latency and throughput both GPU server and Modal#12

Benchmark for latency and throughput both GPU server and Modal#12
ArnavBharti wants to merge 14 commits intomainfrom
benchmarking

ArnavBharti commented Oct 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ArnavBharti commented Oct 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants