Feat/implement streaming path in audio transcription by WaelRabah · Pull Request #926 · vllm-project/production-stack

WaelRabah · 2026-04-20T18:21:30Z

Adds a streaming (stream=true) path to proxy_multipart_request so the router can forward chunked transcription responses (SSE / ndjson) from backend Whisper engines straight to the client, instead of buffering the full JSON body before replying.

What changed

src/vllm_router/services/request_service/request.py
- New streaming branch: opens the upstream aiohttp response with await client.post(...) (outside any async generator), forwards headers/status immediately, and streams body chunks via StreamingResponse.
- Upstream response lifetime is managed outside the body iterator. traced_stream()'s finally block explicitly calls backend_response.close() plus request_stats_monitor.on_request_complete(...), so an aborted consumer (downstream disconnect) can never leak the backend connection or skip stats cleanup.
- on_new_request / on_request_complete are balanced on the post-failure path via except BaseException: ... raise, so a failed client.post(...) still completes the stats pair before bubbling up to the existing JSON error handlers.
src/tests/test_transcription_streaming.py
- Full unit coverage for the streaming path: returns StreamingResponse, preserves content-type, passes form-data boundary, fires on_new_request / on_request_response / on_request_complete once each.
- Dedicated cancellation test (test_streaming_closes_upstream_on_consumer_abort) calls body_iterator.aclose() after the first chunk and asserts that backend_response.close() and on_request_complete both fire. Guards against the connection-leak risk flagged in adversarial review.
- Connection-failure test asserts 503 JSON response + stats balanced even when client.post(...) raises before headers.
tutorials/23-whisper-api-transcription.md — tutorial updated with streaming usage examples.

Why

Whisper-style transcription can take tens of seconds per file. Without streaming, clients wait for the full response before receiving any text, and long requests look indistinguishable from hangs. Streaming gives time-to-first-token feedback and matches the shape already used for chat-completions in this router.

Risk / design notes

The streaming path deliberately keeps the upstream ClientResponse alive beyond the handler frame, then closes it from the body iterator's finally. The closure path is covered by the new cancellation test.
Non-streaming (stream=false) path is unchanged — backward compatible.

FIX #xxxx (link existing issues this PR will resolve)

BEFORE SUBMITTING, PLEASE READ THE CHECKLIST BELOW AND FILL IN THE DESCRIPTION ABOVE

Make sure the code changes pass the pre-commit checks.
Sign-off your commit by using -s when doing git commit
Try to classify PRs for easy understanding of the type of changes, such as [Bugfix], [Feat], and [CI].

Detailed Checklist (Click to Expand)

Thank you for your contribution to production-stack! Before submitting the pull request, please ensure the PR meets the following criteria. This helps us maintain the code quality and improve the efficiency of the review process.

PR Title and Classification

Please try to classify PRs for easy understanding of the type of changes. The PR title is prefixed appropriately to indicate the type of change. Please use one of the following:

[Bugfix] for bug fixes.
[CI/Build] for build or continuous integration improvements.
[Doc] for documentation fixes and improvements.
[Feat] for new features in the cluster (e.g., autoscaling, disaggregated prefill, etc.).
[Router] for changes to the vllm_router (e.g., routing algorithm, router observability, etc.).
[Misc] for PRs that do not fit the above categories. Please use this sparingly.

Note: If the PR spans more than one category, please include all relevant prefixes.

Code Quality

The PR need to meet the following code quality standards:

Pass all linter checks. Please use pre-commit to format your code. See README.md for installation.
The code need to be well-documented to ensure future contributors can easily understand the code.
Please include sufficient tests to ensure the change is stay correct and robust. This includes both unit tests and integration tests.

DCO and Signed-off-by

When contributing changes to this project, you must agree to the DCO. Commits must include a Signed-off-by: header which certifies agreement with the terms of the DCO.

Using -s with git commit will automatically add this header.

What to Expect for the Reviews

We aim to address all PRs in a timely manner. If no one reviews your PR within 5 days, please @-mention one of YuhanLiu11
, Shaoting-Feng or ApostaC.

Signed-off-by: WaelRabah11 <wael.rabah@multiversecomputing.com>

…rror handling between streaming and non-streaming paths. Stats hooks (on_new_request/on_request_response/on_request_complete) now fire on both paths. Tutorial fixed to use with open(...). Tests pass (8/8). Signed-off-by: WaelRabah11 <wael.rabah@multiversecomputing.com>

gemini-code-assist

Code Review

This pull request introduces streaming support for audio transcriptions. It updates the routing logic to parse the stream parameter, modifies the proxy request handler to return a StreamingResponse when requested, and ensures request statistics are correctly tracked throughout the streaming lifecycle. Additionally, comprehensive tests and updated documentation with streaming examples were added. Feedback suggests refining exception handling by catching Exception instead of BaseException, improving error handling for JSON decoding in the non-streaming path, and ensuring robust SSE parsing in the tutorial's Python example.

…cription

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Wael Rabah <47258198+WaelRabah@users.noreply.github.com>

…atch json decode errors Signed-off-by: WaelRabah11 <wael.rabah@multiversecomputing.com>

gemini-code-assist

Code Review

This pull request introduces streaming support for audio transcriptions by allowing the stream parameter in multipart requests. Key changes include updating the proxy_multipart_request function to handle Server-Sent Events (SSE) via StreamingResponse, integrating request statistics monitoring into the proxying logic, and adding comprehensive tests for various streaming scenarios. The documentation has also been updated with streaming examples. A critical feedback point was raised regarding a potential connection leak in the non-streaming path, where the backend response was not explicitly closed.

gemini-code-assist · 2026-04-21T22:52:30Z

+        finally:
+            request_stats_monitor.on_request_complete(
+                chosen_url, request_id, time.time()
+            )


The backend_response is not explicitly closed in the non-streaming path. Since it was opened manually via await client.post(...) (without an async with context manager), it must be closed to prevent connection leaks, especially if the response body is not fully consumed or if an error occurs during parsing. This ensures resources are released back to the pool immediately.

Suggested change

finally:

request_stats_monitor.on_request_complete(

chosen_url, request_id, time.time()

)

finally:

backend_response.close()

request_stats_monitor.on_request_complete(

chosen_url, request_id, time.time()

)

waelrabah11 added 2 commits April 19, 2026 22:16

implement streaming path in audio transcription endpoint

0c0e3e7

Signed-off-by: WaelRabah11 <wael.rabah@multiversecomputing.com>

gemini-code-assist Bot reviewed Apr 20, 2026

View reviewed changes

Comment thread src/vllm_router/services/request_service/request.py Outdated

Comment thread src/vllm_router/services/request_service/request.py

Comment thread tutorials/23-whisper-api-transcription.md Outdated

WaelRabah and others added 3 commits April 21, 2026 20:05

Merge branch 'main' into feat/implement-streaming-path-in-audio-trans…

7232fb4

…cription

Update src/vllm_router/services/request_service/request.py

316ed62

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Wael Rabah <47258198+WaelRabah@users.noreply.github.com>

update streaming example, refactor try/except to reduce nesting and c…

68839d8

…atch json decode errors Signed-off-by: WaelRabah11 <wael.rabah@multiversecomputing.com>

WaelRabah force-pushed the feat/implement-streaming-path-in-audio-transcription branch from e6b9d77 to 68839d8 Compare April 21, 2026 22:43

gemini-code-assist Bot reviewed Apr 21, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat/implement streaming path in audio transcription#926

Feat/implement streaming path in audio transcription#926
WaelRabah wants to merge 5 commits intovllm-project:mainfrom
WaelRabah:feat/implement-streaming-path-in-audio-transcription

WaelRabah commented Apr 20, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

WaelRabah commented Apr 20, 2026

What changed

Why

Risk / design notes

PR Title and Classification

Code Quality

DCO and Signed-off-by

What to Expect for the Reviews

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants