Skip to content

llm_bench: auto-detect TRT-LLM rejection of Fireworks request extras#109

Draft
kevinle-fw wants to merge 1 commit intomainfrom
cursor/auto-handle-trtllm-extras-2d22
Draft

llm_bench: auto-detect TRT-LLM rejection of Fireworks request extras#109
kevinle-fw wants to merge 1 commit intomainfrom
cursor/auto-handle-trtllm-extras-2d22

Conversation

@kevinle-fw
Copy link
Copy Markdown
Contributor

Motivation

Some Fireworks serving images — notably the TRT-LLM image (4.178.0-trtllm-env-config.0 and similar) — enforce a strict OpenAI-compat schema configured with extra="forbid". They reject the Fireworks-specific request body extras perf_metrics_in_response and prompt_cache_max_len with HTTP 400 (extra_forbidden). Before this change, the only way to get a sweep to collect metrics against such a deployment was to patch load_test.py before starting locust — for example the LOAD_TEST_DISABLE_FIREWORKS_EXTRAS sed-patch hook in the Customers collect_data.py / GHA workflow added during Customers PR #593. That preflight added startup time and a per-run failure-noise window to every sweep.

What this PR does

Move the workaround into the locust file itself so the bench works out-of-the-box on TRT-LLM deployments and downstream sweep tooling no longer needs to patch the file.

  • New CLI flag --fireworks-extras=auto|on|off (env var FIREWORKS_EXTRAS), defaulting to auto.
    • auto (default): send the extras; if the deployment rejects them with a 4xx whose body names one of the extras, silently disable them for the rest of the run and transparently retry the failing request once.
    • off: never send them — skips the one-failed-request auto-detect cost when you already know the deployment doesn't support them.
    • on: always send them — fail fast if you expect them to be supported.
  • BaseProvider gains a handle_request_failure(status_code, body) hook so providers can opt into a single transparent retry on schema errors.
  • FireworksProvider:
    • Class-level, thread-safe _extras_disabled flag shared across all locust users in the process (so each user does not pay the one-failed-request cost).
    • format_payload only includes the extras when enabled.
    • handle_request_failure recognizes 4xx responses whose body mentions one of the extras field names and disables them for the rest of the run, with a single warning log line.
  • _do_generate_text gives the provider one chance to recover from a failed request by rebuilding the payload and retrying. The very first request on an incompatible deployment now contributes a clean metric instead of being recorded as a failure.
  • llm_bench/README.md documents the new flag.

Net effect

Sweeps against TRT-LLM deployments now collect TTFT/TPOT/latency metrics correctly out of the box, with a single warning line in the locust log instead of a failed first request and a sed-patched locust file.

Downstream cleanup (separate)

The Customers collect_data.py LOAD_TEST_DISABLE_FIREWORKS_EXTRAS sed-patch and the corresponding disable_fireworks_extras GHA workflow input become unnecessary once this lands — the bench handles the case itself. They can either be removed in a follow-up Customers PR or left in place as an explicit override (it will continue to work; the env-var-driven sed simply removes the same fields this PR would otherwise omit at runtime).

Verification

  • Module imports cleanly.
  • Unit-style smoke test (in-process) verified:
    • default auto: extras enabled by default
    • 400 with extra_forbidden: perf_metrics_in_response → handled, extras dropped from subsequent payloads
    • 400 with unrelated body → not handled (stays a failure)
    • 422 with prompt_cache_max_len not in schema → handled (any 4xx)
    • 5xx → not handled
    • --fireworks-extras=off → extras omitted from the very first request
    • --fireworks-extras=on → extras always sent, 400s never auto-handled (fail-fast)
  • Single warning log line on auto-disable, even under concurrent failures.

Slack Thread

Open in Web Open in Cursor 

Some serving images (notably TRT-LLM with strict OpenAI-compat schema
configured with extra='forbid') reject the Fireworks-specific request body
extras 'perf_metrics_in_response' and 'prompt_cache_max_len' with HTTP 400.
Until now the only workaround was to patch load_test.py at the start of each
load test (e.g. the Customers GHA workflow's LOAD_TEST_DISABLE_FIREWORKS_EXTRAS
sed-patch hook), which adds startup time and one-off failure noise to every run.

This change handles the rejection inside the locust file itself:

- New --fireworks-extras=auto|on|off flag (env var FIREWORKS_EXTRAS),
  defaulting to 'auto'.
- BaseProvider gains a handle_request_failure(status_code, body) hook so
  providers can opt into a single transparent retry on schema errors.
- FireworksProvider:
  * Class-level _extras_disabled flag (thread-safe), shared across users.
  * format_payload only includes the extras when enabled.
  * handle_request_failure recognizes 4xx responses whose body mentions one
    of the extras field names and disables them for the rest of the run.
- _do_generate_text gives the provider one chance to recover from a failed
  request by rebuilding the payload and retrying — so the very first request
  on an incompatible deployment now contributes a clean metric instead of
  being recorded as a failure.

Net effect: Sweeps against TRT-LLM deployments collect metrics correctly
out of the box. Downstream callers (Customers GHA workflow / collect_data.py
sed-patch, skill files) no longer need to preflight-patch the locust file.

Co-authored-by: kevinle-fw <kevinle-fw@users.noreply.github.com>
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 99eb44a. Configure here.

Comment thread llm_bench/load_test.py
if not any(key in body for key in self._EXTRA_FIELD_KEYS):
return False
snippet = body[:200].replace("\n", " ")
type(self)._disable_extras(f"server returned HTTP 400: {snippet}")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Log message hardcodes "HTTP 400" for any 4xx status

Low Severity

handle_request_failure accepts any 4xx status code at the guard on line 986 (400 <= int(status_code) < 500), but the reason string passed to _disable_extras at line 996 hardcodes "server returned HTTP 400". When a deployment returns e.g. 422, the warning log will incorrectly say "HTTP 400", making it harder to debug. The status_code parameter is available but unused in the f-string.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 99eb44a. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants