llm_bench: auto-detect TRT-LLM rejection of Fireworks request extras by kevinle-fw · Pull Request #109 · fw-ai/benchmark

kevinle-fw · 2026-05-08T18:51:34Z

Motivation

Some Fireworks serving images — notably the TRT-LLM image (4.178.0-trtllm-env-config.0 and similar) — enforce a strict OpenAI-compat schema configured with extra="forbid". They reject the Fireworks-specific request body extras perf_metrics_in_response and prompt_cache_max_len with HTTP 400 (extra_forbidden). Before this change, the only way to get a sweep to collect metrics against such a deployment was to patch load_test.py before starting locust — for example the LOAD_TEST_DISABLE_FIREWORKS_EXTRAS sed-patch hook in the Customers collect_data.py / GHA workflow added during Customers PR #593. That preflight added startup time and a per-run failure-noise window to every sweep.

What this PR does

Move the workaround into the locust file itself so the bench works out-of-the-box on TRT-LLM deployments and downstream sweep tooling no longer needs to patch the file.

New CLI flag --fireworks-extras=auto|on|off (env var FIREWORKS_EXTRAS), defaulting to auto.
- auto (default): send the extras; if the deployment rejects them with a 4xx whose body names one of the extras, silently disable them for the rest of the run and transparently retry the failing request once.
- off: never send them — skips the one-failed-request auto-detect cost when you already know the deployment doesn't support them.
- on: always send them — fail fast if you expect them to be supported.
BaseProvider gains a handle_request_failure(status_code, body) hook so providers can opt into a single transparent retry on schema errors.
FireworksProvider:
- Class-level, thread-safe _extras_disabled flag shared across all locust users in the process (so each user does not pay the one-failed-request cost).
- format_payload only includes the extras when enabled.
- handle_request_failure recognizes 4xx responses whose body mentions one of the extras field names and disables them for the rest of the run, with a single warning log line.
_do_generate_text gives the provider one chance to recover from a failed request by rebuilding the payload and retrying. The very first request on an incompatible deployment now contributes a clean metric instead of being recorded as a failure.
llm_bench/README.md documents the new flag.

Net effect

Sweeps against TRT-LLM deployments now collect TTFT/TPOT/latency metrics correctly out of the box, with a single warning line in the locust log instead of a failed first request and a sed-patched locust file.

Downstream cleanup (separate)

The Customers collect_data.py LOAD_TEST_DISABLE_FIREWORKS_EXTRAS sed-patch and the corresponding disable_fireworks_extras GHA workflow input become unnecessary once this lands — the bench handles the case itself. They can either be removed in a follow-up Customers PR or left in place as an explicit override (it will continue to work; the env-var-driven sed simply removes the same fields this PR would otherwise omit at runtime).

Verification

Module imports cleanly.
Unit-style smoke test (in-process) verified:
- default auto: extras enabled by default
- 400 with extra_forbidden: perf_metrics_in_response → handled, extras dropped from subsequent payloads
- 400 with unrelated body → not handled (stays a failure)
- 422 with prompt_cache_max_len not in schema → handled (any 4xx)
- 5xx → not handled
- --fireworks-extras=off → extras omitted from the very first request
- --fireworks-extras=on → extras always sent, 400s never auto-handled (fail-fast)
Single warning log line on auto-disable, even under concurrent failures.

Slack Thread

Some serving images (notably TRT-LLM with strict OpenAI-compat schema configured with extra='forbid') reject the Fireworks-specific request body extras 'perf_metrics_in_response' and 'prompt_cache_max_len' with HTTP 400. Until now the only workaround was to patch load_test.py at the start of each load test (e.g. the Customers GHA workflow's LOAD_TEST_DISABLE_FIREWORKS_EXTRAS sed-patch hook), which adds startup time and one-off failure noise to every run. This change handles the rejection inside the locust file itself: - New --fireworks-extras=auto|on|off flag (env var FIREWORKS_EXTRAS), defaulting to 'auto'. - BaseProvider gains a handle_request_failure(status_code, body) hook so providers can opt into a single transparent retry on schema errors. - FireworksProvider: * Class-level _extras_disabled flag (thread-safe), shared across users. * format_payload only includes the extras when enabled. * handle_request_failure recognizes 4xx responses whose body mentions one of the extras field names and disables them for the rest of the run. - _do_generate_text gives the provider one chance to recover from a failed request by rebuilding the payload and retrying — so the very first request on an incompatible deployment now contributes a clean metric instead of being recorded as a failure. Net effect: Sweeps against TRT-LLM deployments collect metrics correctly out of the box. Downstream callers (Customers GHA workflow / collect_data.py sed-patch, skill files) no longer need to preflight-patch the locust file. Co-authored-by: kevinle-fw <kevinle-fw@users.noreply.github.com>

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 99eb44a. Configure here.}

cursor · 2026-05-08T18:58:21Z

+        if not any(key in body for key in self._EXTRA_FIELD_KEYS):
+            return False
+        snippet = body[:200].replace("\n", " ")
+        type(self)._disable_extras(f"server returned HTTP 400: {snippet}")


Log message hardcodes "HTTP 400" for any 4xx status

Low Severity

handle_request_failure accepts any 4xx status code at the guard on line 986 (400 <= int(status_code) < 500), but the reason string passed to _disable_extras at line 996 hardcodes "server returned HTTP 400". When a deployment returns e.g. 422, the warning log will incorrectly say "HTTP 400", making it harder to debug. The status_code parameter is available but unused in the f-string.

^{Reviewed by Cursor Bugbot for commit 99eb44a. Configure here.}

cursor Bot reviewed May 8, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llm_bench: auto-detect TRT-LLM rejection of Fireworks request extras#109

llm_bench: auto-detect TRT-LLM rejection of Fireworks request extras#109
kevinle-fw wants to merge 1 commit intomainfrom
cursor/auto-handle-trtllm-extras-2d22

kevinle-fw commented May 8, 2026

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kevinle-fw commented May 8, 2026

Motivation

What this PR does

Net effect

Downstream cleanup (separate)

Verification

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot May 8, 2026

Choose a reason for hiding this comment

Log message hardcodes "HTTP 400" for any 4xx status

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants