Add embeddings endpoint testing support#501
Add embeddings endpoint testing support#501maryamtahhan wants to merge 2 commits intovllm-project:mainfrom
Conversation
375eab2 to
02ad329
Compare
|
Sorry for the delay on review; I am hung up on some performance regression work this will probably be waiting a bit longer, possibly into next year. One note: we will be merging #478 first which will affect this PR. |
No problem. |
|
I will update this PR since #478 has been merged |
190d856 to
8292ba8
Compare
|
Still working on this - will post an update soon |
2dd66fc to
3cf0ffd
Compare
3cf0ffd to
267342f
Compare
0710367 to
7f1c66e
Compare
There was a problem hiding this comment.
Some high-level comment that need to be addressed before a full review:
- "quality" testing should be striped out of this PR and put in a followup PR.
- Start with just the console and json/yaml output. Move the rest to a followup.
- Try to remove the random formatting changes to unrelated code, or submit a separate PR that this one builds on which cleans up formatting.
- There is a lot of duplicated code that is going to lead to a maintenance headache down the line. For classes that contain "Generative AI" specific code but some of the functionality is needed for embeddings, move the shared functionality to a base class and have both versions inherent from it.
- Don't lazy or late import. If your importing something non-default, import it in
extras/embeddingsand do something like this (Without quality this is probably unnecessary in this PR).
Basically, strip this PR down to an MVP. It's too many misc changes to meaningfully review.
|
Ok - I will refactor this to an MVP and push incremental PRs |
3e47844 to
4253ca3
Compare
fe31870 to
2d8aeec
Compare
Implements embeddings benchmarking functionality with a streamlined MVP scope, removing non-essential features to focus on core functionality. Also refactors common code patterns to eliminate duplication between embeddings and generative benchmarks. MVP Scope Changes: - Remove CSV and HTML output formats (keeping JSON only for embeddings) - Remove quality validation (MTEB integration) from embeddings - Simplify embeddings output to focus on performance metrics only - Update tests to reflect MVP scope Code Refactoring: - Extract shared entrypoint utility functions (resolve_output_formats_generic, resolve_transient_phases) to entrypoints_utils.py - Create BaseBenchmarkArgs base class with ~30 common configuration fields shared between BenchmarkEmbeddingsArgs and BenchmarkGenerativeTextArgs - Consolidate benchmark orchestration into unified run_benchmark_workflow() function with customization points via modifier functions - Reduce embeddings_entrypoints.py - Add setup_backend_kwargs() modifier to configure embeddings-specific request_format (/v1/embeddings) and encoding_format - Add setup_profile_kwargs() modifiers to customize profile constraints Core Features: - Full embeddings benchmark support with JSON output - Support for all profile types (constant, sweep, poisson, synchronous, throughput) - Request latency metrics (mean, median, p95, p99) - Server throughput metrics (requests/sec, tokens/sec) - Concurrency tracking - Compatible with OpenAI-compatible embeddings endpoints - vLLM embeddings server support Testing: - All unit tests pass (1940/1940 embeddings tests) - All integration tests pass (2/2) - E2E tests pass - Live testing against AWS EC2 vLLM servers validates both embeddings (granite-embedding-english-r2) and generative (Qwen3-0.6B) workflows - Profile testing (constant, sweep, poisson, synchronous, throughput) - 140/140 requests successful across all profiles Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
15e7c32 to
611018f
Compare




Add Embeddings Benchmark Support (MVP)
Summary
This PR adds comprehensive support for benchmarking the
/v1/embeddingsendpoint with a streamlined MVP scope focused on performance testing. The implementation also refactors common code patterns between embeddings and generative benchmarks, eliminating ~360 lines of duplication.Key Features
1. Embeddings Performance Benchmarking
/v1/embeddingsendpoint2. Code Refactoring
entrypoints_utils.pyresolve_output_formats_generic()- Generic output format resolverresolve_transient_phases()- Warmup/cooldown configurationrun_benchmark_workflow()3. Mock Server Support
/v1/embeddings)MVP Scope Decisions
To deliver core functionality quickly, this PR excludes the following features (can be added in future PRs):
This allows us to:
Implementation Details
Core Embeddings Support
EmbeddingsBenchmark,EmbeddingsBenchmarkAccumulator,EmbeddingsMetrics,EmbeddingsBenchmarksReportEmbeddingsRequestFinalizerfor preparing embedding requestsEmbeddingsRequestCollatorfor batchingEmbeddingsColumnMapperpreprocessor/v1/embeddings)benchmark_embeddings()entrypointOutput Formats
EmbeddingsBenchmarkerConsolefor terminal displayEmbeddingsBenchmarkerOutputfor serialized outputsCode Refactoring
BaseBenchmarkArgswith 30+ common fieldsrun_benchmark_workflow()for unified orchestrationsetup_backend_kwargs()- Sets request_format to/v1/embeddingsand encoding_formatsetup_profile_kwargs()- Configures embeddings constraints (no rampup, uses max_duration)CLI Integration
guidellm benchmark embeddingscommandTesting
Code Quality
tox -e quality)tox -e types)Example Usage
Basic Embeddings Benchmark
Sweep Profile (Multiple Rates)
Example Output
Console Output
JSON Output
{ "benchmarks": [ { "mode": "embeddings", "config": { "strategy": { "type_": "constant", "rate": 10.0 } }, "metrics": { "request_latency": { "mean": 0.110, "median": 0.110, "p95": 0.111, "p99": 0.111 }, "server_ttft": { "mean": 0.0, "median": 0.0 }, "server_throughput_request": { "mean": 9.1, "median": 9.09 }, "server_throughput_token_input": { "mean": 524.8, "median": 475.2 } } } ] }Test Plan
Automated Tests
Manual Testing (Completed)
1. Mock Server Embeddings ✅
Result: 20/20 requests successful
2. AWS EC2 vLLM (granite-embedding-english-r2) ✅
Tested all profile types:
Total: 70/70 requests successful
3. AWS EC2 Qwen3-0.6B (generative) ✅
Verified refactored code works for generative benchmarks:
Total: 70/70 requests successful
Grand Total: 140/140 requests successful across all profiles and both model types! 🎉
Breaking Changes
None. This PR is fully backward compatible. All existing generative text benchmarking functionality remains unchanged and passes all tests.
Future Enhancements
The following features were scoped out of this MVP and can be added in future PRs:
Dependencies
No new required dependencies. All embeddings functionality works with existing dependencies.
Use of AI