Improve embeddings load test with batch support, new API params, and proper metrics by andrefoo · Pull Request #64 · fw-ai/benchmark

andrefoo · 2026-03-03T12:56:32Z

Summary

Batch support: New --embeddings-batch-size N flag sends N texts as a single array request, enabling throughput testing across different batch sizes
New API parameters: --embeddings-dimensions (output vector size) and --embeddings-prompt-template (Jinja2 template for structured inputs) now passed through to the Fireworks embeddings API
Proper metrics: Embeddings responses are now always parsed to capture prompt_tokens from the API usage field; new latency_per_embedding metric (total latency ÷ batch size) is emitted and reported in the summary with percentiles
Bug fixes: Quitting listener no longer crashes with KeyError when running embeddings tests (was trying to access LLM-only metrics like time_to_first_token); FireworksProvider no longer injects perf_metrics_in_response into embeddings payloads; logging params omit completion_tokens for embeddings and show embeddings_batch_size/embeddings_dimensions instead

Test plan

Embeddings smoke test: --embeddings --max-requests 3 against accounts/pyroworks/deployments/i907pjzb — 0 failures, correct NxD shape reported, prompt_tokens captured from API
Chat streaming: --chat --stream against accounts/fireworks/models/gpt-oss-20b — 0 failures, all LLM metrics intact
Chat non-streaming: --chat --no-stream — 0 failures, TTFT/latency_per_token correctly blanked in summary
Non-chat completions: --no-chat --no-stream — 0 failures
Vision model: --chat --stream --prompt-images-with-resolutions 1920x1080 against accounts/fireworks/models/kimi-k2p5 — 0 failures

Made with Cursor

…proper metrics - Add --embeddings-batch-size to send arrays of texts per request - Add --embeddings-dimensions to request specific output vector size - Add --embeddings-prompt-template for Jinja2 structured input preprocessing - Always parse embeddings response to capture prompt_tokens from API usage - Emit latency_per_embedding metric (total_latency / batch_size) - Fix quitting listener to use embeddings-specific summary metrics instead of crashing on missing LLM-only metrics (time_to_first_token, latency_per_token) - Fix FireworksProvider to skip perf_metrics_in_response for embeddings payloads - Update logging_params to show embeddings_batch_size/dimensions instead of completion_tokens when in embeddings mode - parse_output_json for embeddings now returns shape "NxD" and prompt_tokens Made-with: Cursor

andrefoo requested review from divchenko and dzhulgakov March 3, 2026 13:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve embeddings load test with batch support, new API params, and proper metrics#64

Improve embeddings load test with batch support, new API params, and proper metrics#64
andrefoo wants to merge 1 commit intomainfrom
embeddings-improvements

andrefoo commented Mar 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

andrefoo commented Mar 3, 2026

Summary

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant