Skip to content

Improve embeddings load test with batch support, new API params, and proper metrics#64

Open
andrefoo wants to merge 1 commit intomainfrom
embeddings-improvements
Open

Improve embeddings load test with batch support, new API params, and proper metrics#64
andrefoo wants to merge 1 commit intomainfrom
embeddings-improvements

Conversation

@andrefoo
Copy link
Copy Markdown

@andrefoo andrefoo commented Mar 3, 2026

Summary

  • Batch support: New --embeddings-batch-size N flag sends N texts as a single array request, enabling throughput testing across different batch sizes
  • New API parameters: --embeddings-dimensions (output vector size) and --embeddings-prompt-template (Jinja2 template for structured inputs) now passed through to the Fireworks embeddings API
  • Proper metrics: Embeddings responses are now always parsed to capture prompt_tokens from the API usage field; new latency_per_embedding metric (total latency ÷ batch size) is emitted and reported in the summary with percentiles
  • Bug fixes: Quitting listener no longer crashes with KeyError when running embeddings tests (was trying to access LLM-only metrics like time_to_first_token); FireworksProvider no longer injects perf_metrics_in_response into embeddings payloads; logging params omit completion_tokens for embeddings and show embeddings_batch_size/embeddings_dimensions instead

Test plan

  • Embeddings smoke test: --embeddings --max-requests 3 against accounts/pyroworks/deployments/i907pjzb — 0 failures, correct NxD shape reported, prompt_tokens captured from API
  • Chat streaming: --chat --stream against accounts/fireworks/models/gpt-oss-20b — 0 failures, all LLM metrics intact
  • Chat non-streaming: --chat --no-stream — 0 failures, TTFT/latency_per_token correctly blanked in summary
  • Non-chat completions: --no-chat --no-stream — 0 failures
  • Vision model: --chat --stream --prompt-images-with-resolutions 1920x1080 against accounts/fireworks/models/kimi-k2p5 — 0 failures

Made with Cursor

…proper metrics

- Add --embeddings-batch-size to send arrays of texts per request
- Add --embeddings-dimensions to request specific output vector size
- Add --embeddings-prompt-template for Jinja2 structured input preprocessing
- Always parse embeddings response to capture prompt_tokens from API usage
- Emit latency_per_embedding metric (total_latency / batch_size)
- Fix quitting listener to use embeddings-specific summary metrics instead of
  crashing on missing LLM-only metrics (time_to_first_token, latency_per_token)
- Fix FireworksProvider to skip perf_metrics_in_response for embeddings payloads
- Update logging_params to show embeddings_batch_size/dimensions instead of
  completion_tokens when in embeddings mode
- parse_output_json for embeddings now returns shape "NxD" and prompt_tokens

Made-with: Cursor
@andrefoo andrefoo requested review from divchenko and dzhulgakov March 3, 2026 13:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant