Skip to content

Feat/implement streaming path in audio transcription#926

Open
WaelRabah wants to merge 5 commits intovllm-project:mainfrom
WaelRabah:feat/implement-streaming-path-in-audio-transcription
Open

Feat/implement streaming path in audio transcription#926
WaelRabah wants to merge 5 commits intovllm-project:mainfrom
WaelRabah:feat/implement-streaming-path-in-audio-transcription

Conversation

@WaelRabah
Copy link
Copy Markdown

Adds a streaming (stream=true) path to proxy_multipart_request so the router can forward chunked transcription responses (SSE / ndjson) from backend Whisper engines straight to the client, instead of buffering the full JSON body before replying.

What changed

  • src/vllm_router/services/request_service/request.py
    • New streaming branch: opens the upstream aiohttp response with await client.post(...) (outside any async generator), forwards headers/status immediately, and streams body chunks via StreamingResponse.
    • Upstream response lifetime is managed outside the body iterator. traced_stream()'s finally block explicitly calls backend_response.close() plus request_stats_monitor.on_request_complete(...), so an aborted consumer (downstream disconnect) can never leak the backend connection or skip stats cleanup.
    • on_new_request / on_request_complete are balanced on the post-failure path via except BaseException: ... raise, so a failed client.post(...) still completes the stats pair before bubbling up to the existing JSON error handlers.
  • src/tests/test_transcription_streaming.py
    • Full unit coverage for the streaming path: returns StreamingResponse, preserves content-type, passes form-data boundary, fires on_new_request / on_request_response / on_request_complete once each.
    • Dedicated cancellation test (test_streaming_closes_upstream_on_consumer_abort) calls body_iterator.aclose() after the first chunk and asserts that backend_response.close() and on_request_complete both fire. Guards against the connection-leak risk flagged in adversarial review.
    • Connection-failure test asserts 503 JSON response + stats balanced even when client.post(...) raises before headers.
  • tutorials/23-whisper-api-transcription.md — tutorial updated with streaming usage examples.

Why

Whisper-style transcription can take tens of seconds per file. Without streaming, clients wait for the full response before receiving any text, and long requests look indistinguishable from hangs. Streaming gives time-to-first-token feedback and matches the shape already used for chat-completions in this router.

Risk / design notes

  • The streaming path deliberately keeps the upstream ClientResponse alive beyond the handler frame, then closes it from the body iterator's finally. The closure path is covered by the new cancellation test.
  • Non-streaming (stream=false) path is unchanged — backward compatible.

FIX #xxxx (link existing issues this PR will resolve)

BEFORE SUBMITTING, PLEASE READ THE CHECKLIST BELOW AND FILL IN THE DESCRIPTION ABOVE


  • Make sure the code changes pass the pre-commit checks.
  • Sign-off your commit by using -s when doing git commit
  • Try to classify PRs for easy understanding of the type of changes, such as [Bugfix], [Feat], and [CI].
Detailed Checklist (Click to Expand)

Thank you for your contribution to production-stack! Before submitting the pull request, please ensure the PR meets the following criteria. This helps us maintain the code quality and improve the efficiency of the review process.

PR Title and Classification

Please try to classify PRs for easy understanding of the type of changes. The PR title is prefixed appropriately to indicate the type of change. Please use one of the following:

  • [Bugfix] for bug fixes.
  • [CI/Build] for build or continuous integration improvements.
  • [Doc] for documentation fixes and improvements.
  • [Feat] for new features in the cluster (e.g., autoscaling, disaggregated prefill, etc.).
  • [Router] for changes to the vllm_router (e.g., routing algorithm, router observability, etc.).
  • [Misc] for PRs that do not fit the above categories. Please use this sparingly.

Note: If the PR spans more than one category, please include all relevant prefixes.

Code Quality

The PR need to meet the following code quality standards:

  • Pass all linter checks. Please use pre-commit to format your code. See README.md for installation.
  • The code need to be well-documented to ensure future contributors can easily understand the code.
  • Please include sufficient tests to ensure the change is stay correct and robust. This includes both unit tests and integration tests.

DCO and Signed-off-by

When contributing changes to this project, you must agree to the DCO. Commits must include a Signed-off-by: header which certifies agreement with the terms of the DCO.

Using -s with git commit will automatically add this header.

What to Expect for the Reviews

We aim to address all PRs in a timely manner. If no one reviews your PR within 5 days, please @-mention one of YuhanLiu11
, Shaoting-Feng or ApostaC.

Signed-off-by: WaelRabah11 <wael.rabah@multiversecomputing.com>
…rror

  handling between streaming and non-streaming paths. Stats hooks
  (on_new_request/on_request_response/on_request_complete) now fire on both paths.
  Tutorial fixed to use with open(...). Tests pass (8/8).

Signed-off-by: WaelRabah11 <wael.rabah@multiversecomputing.com>
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces streaming support for audio transcriptions. It updates the routing logic to parse the stream parameter, modifies the proxy request handler to return a StreamingResponse when requested, and ensures request statistics are correctly tracked throughout the streaming lifecycle. Additionally, comprehensive tests and updated documentation with streaming examples were added. Feedback suggests refining exception handling by catching Exception instead of BaseException, improving error handling for JSON decoding in the non-streaming path, and ensuring robust SSE parsing in the tutorial's Python example.

Comment thread src/vllm_router/services/request_service/request.py Outdated
Comment thread src/vllm_router/services/request_service/request.py
Comment thread tutorials/23-whisper-api-transcription.md Outdated
WaelRabah and others added 3 commits April 21, 2026 20:05
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Wael Rabah <47258198+WaelRabah@users.noreply.github.com>
…atch json decode errors

Signed-off-by: WaelRabah11 <wael.rabah@multiversecomputing.com>
@WaelRabah WaelRabah force-pushed the feat/implement-streaming-path-in-audio-transcription branch from e6b9d77 to 68839d8 Compare April 21, 2026 22:43
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces streaming support for audio transcriptions by allowing the stream parameter in multipart requests. Key changes include updating the proxy_multipart_request function to handle Server-Sent Events (SSE) via StreamingResponse, integrating request statistics monitoring into the proxying logic, and adding comprehensive tests for various streaming scenarios. The documentation has also been updated with streaming examples. A critical feedback point was raised regarding a potential connection leak in the non-streaming path, where the backend response was not explicitly closed.

Comment on lines +1284 to +1287
finally:
request_stats_monitor.on_request_complete(
chosen_url, request_id, time.time()
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The backend_response is not explicitly closed in the non-streaming path. Since it was opened manually via await client.post(...) (without an async with context manager), it must be closed to prevent connection leaks, especially if the response body is not fully consumed or if an error occurs during parsing. This ensures resources are released back to the pool immediately.

Suggested change
finally:
request_stats_monitor.on_request_complete(
chosen_url, request_id, time.time()
)
finally:
backend_response.close()
request_stats_monitor.on_request_complete(
chosen_url, request_id, time.time()
)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants