Skip to content

Add embeddings endpoint testing support#501

Open
maryamtahhan wants to merge 2 commits intovllm-project:mainfrom
maryamtahhan:feat-embedding-testing
Open

Add embeddings endpoint testing support#501
maryamtahhan wants to merge 2 commits intovllm-project:mainfrom
maryamtahhan:feat-embedding-testing

Conversation

@maryamtahhan
Copy link
Contributor

@maryamtahhan maryamtahhan commented Dec 5, 2025

Add Embeddings Benchmark Support (MVP)

Summary

This PR adds comprehensive support for benchmarking the /v1/embeddings endpoint with a streamlined MVP scope focused on performance testing. The implementation also refactors common code patterns between embeddings and generative benchmarks, eliminating ~360 lines of duplication.

Key Features

1. Embeddings Performance Benchmarking

  • Full request/response handling for /v1/embeddings endpoint
  • Performance metrics: throughput (requests/sec, tokens/sec), latency (mean, median, p95, p99), concurrency
  • Support for all profile types (constant, sweep, poisson, synchronous, throughput)
  • Compatible with OpenAI-compatible embedding endpoints (vLLM, OpenAI, etc.)
  • JSON output format with comprehensive performance data
  • Synthetic data generation with configurable token counts

2. Code Refactoring

  • Extract shared utility functions to entrypoints_utils.py
    • resolve_output_formats_generic() - Generic output format resolver
    • resolve_transient_phases() - Warmup/cooldown configuration
  • Create BaseBenchmarkArgs base class
    • 30+ common configuration fields shared between embeddings and generative
    • Eliminates field duplication across BenchmarkEmbeddingsArgs and BenchmarkGenerativeTextArgs
  • Consolidate orchestration into run_benchmark_workflow()
    • Unified 10-step workflow: backend → processor → loader → transient → profile → outputs → report → benchmarker → finalize → console
    • Customization via modifier functions for backend and profile setup
    • Reduces embeddings_entrypoints.py

3. Mock Server Support

  • Embeddings endpoint implementation (/v1/embeddings)
  • Configurable embedding dimensions (default: 768)
  • Synthetic normalized embedding vectors
  • Realistic timing delays based on token count

MVP Scope Decisions

To deliver core functionality quickly, this PR excludes the following features (can be added in future PRs):

  • ❌ CSV and HTML output formats (JSON only for embeddings)
  • ❌ Quality validation / MTEB integration
  • ❌ Complex visualization features

This allows us to:

  • ✅ Ship embeddings benchmarking support faster
  • ✅ Validate the core architecture with real users
  • ✅ Add advanced features incrementally based on feedback

Implementation Details

Core Embeddings Support

  • Add embeddings schemas: EmbeddingsBenchmark, EmbeddingsBenchmarkAccumulator, EmbeddingsMetrics, EmbeddingsBenchmarksReport
  • Implement EmbeddingsRequestFinalizer for preparing embedding requests
  • Implement EmbeddingsRequestCollator for batching
  • Add EmbeddingsColumnMapper preprocessor
  • Add embeddings route to OpenAI backend (/v1/embeddings)
  • Implement benchmark_embeddings() entrypoint

Output Formats

  • JSON output with complete performance metrics
  • Console output with formatted tables (timings, request counts, latency, throughput)
  • EmbeddingsBenchmarkerConsole for terminal display
  • EmbeddingsBenchmarkerOutput for serialized outputs

Code Refactoring

  • Extract BaseBenchmarkArgs with 30+ common fields
  • Create run_benchmark_workflow() for unified orchestration
  • Add modifier functions for embeddings-specific configuration:
    • setup_backend_kwargs() - Sets request_format to /v1/embeddings and encoding_format
    • setup_profile_kwargs() - Configures embeddings constraints (no rampup, uses max_duration)

CLI Integration

  • Add guidellm benchmark embeddings command
  • Support for all benchmark arguments (--target, --model, --data, --profile, --rate, etc.)
  • Encoding format option (--encoding-format: float or base64)

Testing

  • Unit tests: 31/31 embeddings tests passing
  • Integration tests: 2/2 passing
  • E2E tests: 5/6 passing (1 timeout is pre-existing flaky test)
  • Live production testing: 140/140 requests successful across all profile types
    • Tested against AWS EC2 vLLM servers (granite-embedding-english-r2, Qwen3-0.6B)
    • All profiles validated: constant, sweep, poisson, synchronous, throughput

Code Quality

  • All linting checks pass (tox -e quality)
  • All type checks pass (tox -e types)
  • 1940/1940 unit tests passing
  • Pre-commit hooks pass

Example Usage

Basic Embeddings Benchmark

guidellm benchmark embeddings \
  --target http://localhost:8000/v1 \
  --model "ibm-granite/granite-embedding-english-r2" \
  --data "prompt_tokens=128" \
  --max-requests 100 \
  --rate 10 \
  --outputs embeddings_results.json

Sweep Profile (Multiple Rates)

guidellm benchmark embeddings \
  --target http://localhost:8000/v1 \
  --model "BAAI/bge-small-en-v1.5" \
  --data "prompt_tokens=100" \
  --profile sweep \
  --max-requests 50 \
  --outputs sweep_results.json

Example Output

Console Output

ℹ Run Summary
|===========|==========|==========|=====|======|======|======|=====|=====|
| Benchmark | Timings                             ||||| Input Tokens   |||
| Strategy  | Start    | End      | Dur | Warm | Cool | Comp | Inc | Err |
|           |          |          | Sec | Sec  | Sec  | Tot  | Tot | Tot |
|-----------|----------|----------|-----|------|------|------|-----|-----|
| constant  | 14:04:09 | 14:04:15 | 5.6 | 0.0  | 0.0  | 521  | 0   | 0   |
|===========|==========|==========|=====|======|======|======|=====|=====|

ℹ Request Latency
|===========|=======|=======|=======|=======|=======|======|
| Benchmark | Request Latency            |||| Concurrency ||
| Strategy  | Latency                    |||| Concurrent  ||
|           | Mean  | Mdn   | p95   | p99   | Mdn   | p95  |
|-----------|-------|-------|-------|-------|-------|------|
| constant  | 0.110 | 0.110 | 0.111 | 0.111 | 1.0   | 1.0  |
|===========|=======|=======|=======|=======|=======|======|

ℹ Server Throughput
|===========|==========|==========|=========|=========|
| Benchmark | Request Throughput || Token Throughput ||
| Strategy  | Reqs               || Input Tok        ||
|           | Mdn      | p95      | Mdn     | p95     |
|-----------|----------|----------|---------|---------|
| constant  | 9.09     | 9.16     | 475.2   | 944.4   |
|===========|==========|==========|=========|=========|

JSON Output

{
  "benchmarks": [
    {
      "mode": "embeddings",
      "config": {
        "strategy": { "type_": "constant", "rate": 10.0 }
      },
      "metrics": {
        "request_latency": {
          "mean": 0.110,
          "median": 0.110,
          "p95": 0.111,
          "p99": 0.111
        },
        "server_ttft": { "mean": 0.0, "median": 0.0 },
        "server_throughput_request": { "mean": 9.1, "median": 9.09 },
        "server_throughput_token_input": { "mean": 524.8, "median": 475.2 }
      }
    }
  ]
}

Test Plan

Automated Tests

# Run all quality checks
tox -e quality     # Linting and formatting ✅
tox -e types       # Type checking ✅
tox -e test-unit   # Unit tests (1940/1940 passed) ✅
tox -e test-integration  # Integration tests (2/2 passed) ✅
tox -e test-e2e    # E2E tests (5/6 passed, 1 timeout pre-existing) ✅

Manual Testing (Completed)

1. Mock Server Embeddings ✅

guidellm benchmark embeddings \
  --target http://localhost:8000 \
  --model test \
  --data "prompt_tokens=128" \
  --max-requests 20 \
  --rate 5

Result: 20/20 requests successful

2. AWS EC2 vLLM (granite-embedding-english-r2) ✅

Tested all profile types:

  • Constant: 10/10 requests, 110ms latency, 1.6 req/s ✅
  • Sweep: 30/30 requests (synchronous + throughput), 109-150ms latency ✅
  • Poisson: 10/10 requests, 112ms latency, 4.8 req/s ✅
  • Synchronous: 10/10 requests, 110ms latency, 9.1 req/s ✅
  • Throughput: 10/10 requests, 129ms latency, 37.8 req/s ✅

Total: 70/70 requests successful

3. AWS EC2 Qwen3-0.6B (generative) ✅

Verified refactored code works for generative benchmarks:

  • Constant: 10/10 requests ✅
  • Sweep: 30/30 requests ✅
  • Poisson: 10/10 requests ✅
  • Synchronous: 10/10 requests ✅
  • Throughput: 10/10 requests ✅

Total: 70/70 requests successful

Grand Total: 140/140 requests successful across all profiles and both model types! 🎉

Breaking Changes

None. This PR is fully backward compatible. All existing generative text benchmarking functionality remains unchanged and passes all tests.

Future Enhancements

The following features were scoped out of this MVP and can be added in future PRs:

  1. CSV/HTML Output Formats - Rich visualizations and data export
  2. MTEB Quality Evaluation - Industry-standard quality benchmarking
  3. Advanced Metrics - Additional embeddings-specific analytics
  4. Multi-modal Support - Image embeddings, audio embeddings

Dependencies

No new required dependencies. All embeddings functionality works with existing dependencies.


  • "I certify that all code in this PR is my own, except as noted below."

Use of AI

  • Includes AI-assisted code completion
  • Includes code generated by an AI application
  • Includes AI-generated tests

@maryamtahhan maryamtahhan force-pushed the feat-embedding-testing branch 2 times, most recently from 375eab2 to 02ad329 Compare December 5, 2025 14:38
@sjmonson
Copy link
Collaborator

Sorry for the delay on review; I am hung up on some performance regression work this will probably be waiting a bit longer, possibly into next year. One note: we will be merging #478 first which will affect this PR.

@maryamtahhan
Copy link
Contributor Author

Sorry for the delay on review; I am hung up on some performance regression work this will probably be waiting a bit longer, possibly into next year. One note: we will be merging #478 first which will affect this PR.

No problem.

@dbutenhof dbutenhof added this to the v0.6.0 milestone Jan 28, 2026
@maryamtahhan
Copy link
Contributor Author

I will update this PR since #478 has been merged

@maryamtahhan maryamtahhan force-pushed the feat-embedding-testing branch 4 times, most recently from 190d856 to 8292ba8 Compare February 10, 2026 11:11
@maryamtahhan
Copy link
Contributor Author

Still working on this - will post an update soon

@maryamtahhan maryamtahhan marked this pull request as draft February 11, 2026 09:30
@maryamtahhan maryamtahhan force-pushed the feat-embedding-testing branch 17 times, most recently from 2dd66fc to 3cf0ffd Compare February 13, 2026 12:57
@maryamtahhan maryamtahhan force-pushed the feat-embedding-testing branch from 3cf0ffd to 267342f Compare February 13, 2026 18:41
@maryamtahhan maryamtahhan force-pushed the feat-embedding-testing branch 3 times, most recently from 0710367 to 7f1c66e Compare February 24, 2026 10:36
@maryamtahhan
Copy link
Contributor Author

With MTEB Results
Screenshot 2026-02-24 at 11 51 33

With MTEB Results
Screenshot 2026-02-24 at 11 51 43

Without MTEB Results
Screenshot 2026-02-24 at 11 51 50

@maryamtahhan
Copy link
Contributor Author

Console output example:
Screenshot 2026-02-24 at 11 50 03

@maryamtahhan maryamtahhan marked this pull request as ready for review February 24, 2026 11:53
@maryamtahhan
Copy link
Contributor Author

@sjmonson @markurtz This is ready for review now.

@maryamtahhan maryamtahhan changed the title Add embeddings endpoint support Add embeddings endpoint testing support - perf + quality Feb 24, 2026
Copy link
Collaborator

@sjmonson sjmonson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some high-level comment that need to be addressed before a full review:

  • "quality" testing should be striped out of this PR and put in a followup PR.
  • Start with just the console and json/yaml output. Move the rest to a followup.
  • Try to remove the random formatting changes to unrelated code, or submit a separate PR that this one builds on which cleans up formatting.
  • There is a lot of duplicated code that is going to lead to a maintenance headache down the line. For classes that contain "Generative AI" specific code but some of the functionality is needed for embeddings, move the shared functionality to a base class and have both versions inherent from it.
  • Don't lazy or late import. If your importing something non-default, import it in extras/embeddings and do something like this (Without quality this is probably unnecessary in this PR).

Basically, strip this PR down to an MVP. It's too many misc changes to meaningfully review.

@maryamtahhan
Copy link
Contributor Author

Ok - I will refactor this to an MVP and push incremental PRs

@maryamtahhan maryamtahhan force-pushed the feat-embedding-testing branch from 3e47844 to 4253ca3 Compare February 27, 2026 09:29
@maryamtahhan maryamtahhan changed the title Add embeddings endpoint testing support - perf + quality Add embeddings endpoint testing support Feb 27, 2026
@maryamtahhan maryamtahhan force-pushed the feat-embedding-testing branch 2 times, most recently from fe31870 to 2d8aeec Compare February 27, 2026 09:45
Implements embeddings benchmarking functionality with a streamlined MVP
scope, removing non-essential features to focus on core functionality.
Also refactors common code patterns to eliminate duplication between
embeddings and generative benchmarks.

MVP Scope Changes:
- Remove CSV and HTML output formats (keeping JSON only for embeddings)
- Remove quality validation (MTEB integration) from embeddings
- Simplify embeddings output to focus on performance metrics only
- Update tests to reflect MVP scope

Code Refactoring:
- Extract shared entrypoint utility functions (resolve_output_formats_generic,
  resolve_transient_phases) to entrypoints_utils.py
- Create BaseBenchmarkArgs base class with ~30 common configuration fields
  shared between BenchmarkEmbeddingsArgs and BenchmarkGenerativeTextArgs
- Consolidate benchmark orchestration into unified run_benchmark_workflow()
  function with customization points via modifier functions
- Reduce embeddings_entrypoints.py
- Add setup_backend_kwargs() modifier to configure embeddings-specific
  request_format (/v1/embeddings) and encoding_format
- Add setup_profile_kwargs() modifiers to customize profile constraints

Core Features:
- Full embeddings benchmark support with JSON output
- Support for all profile types (constant, sweep, poisson, synchronous,
  throughput)
- Request latency metrics (mean, median, p95, p99)
- Server throughput metrics (requests/sec, tokens/sec)
- Concurrency tracking
- Compatible with OpenAI-compatible embeddings endpoints
- vLLM embeddings server support

Testing:
- All unit tests pass (1940/1940 embeddings tests)
- All integration tests pass (2/2)
- E2E tests pass
- Live testing against AWS EC2 vLLM servers validates both embeddings
  (granite-embedding-english-r2) and generative (Qwen3-0.6B) workflows
- Profile testing (constant, sweep, poisson, synchronous,
  throughput) - 140/140 requests successful across all profiles

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
@maryamtahhan maryamtahhan force-pushed the feat-embedding-testing branch from 15e7c32 to 611018f Compare February 27, 2026 13:00
@maryamtahhan maryamtahhan requested a review from sjmonson March 2, 2026 08:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants