llm_bench: auto-detect TRT-LLM rejection of Fireworks request extras#109
llm_bench: auto-detect TRT-LLM rejection of Fireworks request extras#109kevinle-fw wants to merge 1 commit intomainfrom
Conversation
Some serving images (notably TRT-LLM with strict OpenAI-compat schema
configured with extra='forbid') reject the Fireworks-specific request body
extras 'perf_metrics_in_response' and 'prompt_cache_max_len' with HTTP 400.
Until now the only workaround was to patch load_test.py at the start of each
load test (e.g. the Customers GHA workflow's LOAD_TEST_DISABLE_FIREWORKS_EXTRAS
sed-patch hook), which adds startup time and one-off failure noise to every run.
This change handles the rejection inside the locust file itself:
- New --fireworks-extras=auto|on|off flag (env var FIREWORKS_EXTRAS),
defaulting to 'auto'.
- BaseProvider gains a handle_request_failure(status_code, body) hook so
providers can opt into a single transparent retry on schema errors.
- FireworksProvider:
* Class-level _extras_disabled flag (thread-safe), shared across users.
* format_payload only includes the extras when enabled.
* handle_request_failure recognizes 4xx responses whose body mentions one
of the extras field names and disables them for the rest of the run.
- _do_generate_text gives the provider one chance to recover from a failed
request by rebuilding the payload and retrying — so the very first request
on an incompatible deployment now contributes a clean metric instead of
being recorded as a failure.
Net effect: Sweeps against TRT-LLM deployments collect metrics correctly
out of the box. Downstream callers (Customers GHA workflow / collect_data.py
sed-patch, skill files) no longer need to preflight-patch the locust file.
Co-authored-by: kevinle-fw <kevinle-fw@users.noreply.github.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 99eb44a. Configure here.
| if not any(key in body for key in self._EXTRA_FIELD_KEYS): | ||
| return False | ||
| snippet = body[:200].replace("\n", " ") | ||
| type(self)._disable_extras(f"server returned HTTP 400: {snippet}") |
There was a problem hiding this comment.
Log message hardcodes "HTTP 400" for any 4xx status
Low Severity
handle_request_failure accepts any 4xx status code at the guard on line 986 (400 <= int(status_code) < 500), but the reason string passed to _disable_extras at line 996 hardcodes "server returned HTTP 400". When a deployment returns e.g. 422, the warning log will incorrectly say "HTTP 400", making it harder to debug. The status_code parameter is available but unused in the f-string.
Reviewed by Cursor Bugbot for commit 99eb44a. Configure here.


Motivation
Some Fireworks serving images — notably the TRT-LLM image (
4.178.0-trtllm-env-config.0and similar) — enforce a strict OpenAI-compat schema configured withextra="forbid". They reject the Fireworks-specific request body extrasperf_metrics_in_responseandprompt_cache_max_lenwith HTTP 400 (extra_forbidden). Before this change, the only way to get a sweep to collect metrics against such a deployment was to patchload_test.pybefore starting locust — for example theLOAD_TEST_DISABLE_FIREWORKS_EXTRASsed-patch hook in the Customerscollect_data.py/ GHA workflow added during Customers PR #593. That preflight added startup time and a per-run failure-noise window to every sweep.What this PR does
Move the workaround into the locust file itself so the bench works out-of-the-box on TRT-LLM deployments and downstream sweep tooling no longer needs to patch the file.
--fireworks-extras=auto|on|off(env varFIREWORKS_EXTRAS), defaulting toauto.auto(default): send the extras; if the deployment rejects them with a 4xx whose body names one of the extras, silently disable them for the rest of the run and transparently retry the failing request once.off: never send them — skips the one-failed-request auto-detect cost when you already know the deployment doesn't support them.on: always send them — fail fast if you expect them to be supported.BaseProvidergains ahandle_request_failure(status_code, body)hook so providers can opt into a single transparent retry on schema errors.FireworksProvider:_extras_disabledflag shared across all locust users in the process (so each user does not pay the one-failed-request cost).format_payloadonly includes the extras when enabled.handle_request_failurerecognizes 4xx responses whose body mentions one of the extras field names and disables them for the rest of the run, with a single warning log line._do_generate_textgives the provider one chance to recover from a failed request by rebuilding the payload and retrying. The very first request on an incompatible deployment now contributes a clean metric instead of being recorded as a failure.llm_bench/README.mddocuments the new flag.Net effect
Sweeps against TRT-LLM deployments now collect TTFT/TPOT/latency metrics correctly out of the box, with a single warning line in the locust log instead of a failed first request and a sed-patched locust file.
Downstream cleanup (separate)
The Customers
collect_data.pyLOAD_TEST_DISABLE_FIREWORKS_EXTRASsed-patch and the correspondingdisable_fireworks_extrasGHA workflow input become unnecessary once this lands — the bench handles the case itself. They can either be removed in a follow-up Customers PR or left in place as an explicit override (it will continue to work; the env-var-driven sed simply removes the same fields this PR would otherwise omit at runtime).Verification
auto: extras enabled by defaultextra_forbidden: perf_metrics_in_response→ handled, extras dropped from subsequent payloadsprompt_cache_max_len not in schema→ handled (any 4xx)--fireworks-extras=off→ extras omitted from the very first request--fireworks-extras=on→ extras always sent, 400s never auto-handled (fail-fast)Slack Thread