Feat/implement streaming path in audio transcription#926
Feat/implement streaming path in audio transcription#926WaelRabah wants to merge 5 commits intovllm-project:mainfrom
Conversation
Signed-off-by: WaelRabah11 <wael.rabah@multiversecomputing.com>
…rror handling between streaming and non-streaming paths. Stats hooks (on_new_request/on_request_response/on_request_complete) now fire on both paths. Tutorial fixed to use with open(...). Tests pass (8/8). Signed-off-by: WaelRabah11 <wael.rabah@multiversecomputing.com>
There was a problem hiding this comment.
Code Review
This pull request introduces streaming support for audio transcriptions. It updates the routing logic to parse the stream parameter, modifies the proxy request handler to return a StreamingResponse when requested, and ensures request statistics are correctly tracked throughout the streaming lifecycle. Additionally, comprehensive tests and updated documentation with streaming examples were added. Feedback suggests refining exception handling by catching Exception instead of BaseException, improving error handling for JSON decoding in the non-streaming path, and ensuring robust SSE parsing in the tutorial's Python example.
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Wael Rabah <47258198+WaelRabah@users.noreply.github.com>
…atch json decode errors Signed-off-by: WaelRabah11 <wael.rabah@multiversecomputing.com>
e6b9d77 to
68839d8
Compare
There was a problem hiding this comment.
Code Review
This pull request introduces streaming support for audio transcriptions by allowing the stream parameter in multipart requests. Key changes include updating the proxy_multipart_request function to handle Server-Sent Events (SSE) via StreamingResponse, integrating request statistics monitoring into the proxying logic, and adding comprehensive tests for various streaming scenarios. The documentation has also been updated with streaming examples. A critical feedback point was raised regarding a potential connection leak in the non-streaming path, where the backend response was not explicitly closed.
| finally: | ||
| request_stats_monitor.on_request_complete( | ||
| chosen_url, request_id, time.time() | ||
| ) |
There was a problem hiding this comment.
The backend_response is not explicitly closed in the non-streaming path. Since it was opened manually via await client.post(...) (without an async with context manager), it must be closed to prevent connection leaks, especially if the response body is not fully consumed or if an error occurs during parsing. This ensures resources are released back to the pool immediately.
| finally: | |
| request_stats_monitor.on_request_complete( | |
| chosen_url, request_id, time.time() | |
| ) | |
| finally: | |
| backend_response.close() | |
| request_stats_monitor.on_request_complete( | |
| chosen_url, request_id, time.time() | |
| ) |
Adds a streaming (
stream=true) path toproxy_multipart_requestso the router can forward chunked transcription responses (SSE / ndjson) from backend Whisper engines straight to the client, instead of buffering the full JSON body before replying.What changed
src/vllm_router/services/request_service/request.pyaiohttpresponse withawait client.post(...)(outside any async generator), forwards headers/status immediately, and streams body chunks viaStreamingResponse.traced_stream()'sfinallyblock explicitly callsbackend_response.close()plusrequest_stats_monitor.on_request_complete(...), so an aborted consumer (downstream disconnect) can never leak the backend connection or skip stats cleanup.on_new_request/on_request_completeare balanced on the post-failure path viaexcept BaseException: ... raise, so a failedclient.post(...)still completes the stats pair before bubbling up to the existing JSON error handlers.src/tests/test_transcription_streaming.pyStreamingResponse, preserves content-type, passes form-data boundary, fireson_new_request/on_request_response/on_request_completeonce each.test_streaming_closes_upstream_on_consumer_abort) callsbody_iterator.aclose()after the first chunk and asserts thatbackend_response.close()andon_request_completeboth fire. Guards against the connection-leak risk flagged in adversarial review.client.post(...)raises before headers.tutorials/23-whisper-api-transcription.md— tutorial updated with streaming usage examples.Why
Whisper-style transcription can take tens of seconds per file. Without streaming, clients wait for the full response before receiving any text, and long requests look indistinguishable from hangs. Streaming gives time-to-first-token feedback and matches the shape already used for chat-completions in this router.
Risk / design notes
ClientResponsealive beyond the handler frame, then closes it from the body iterator'sfinally. The closure path is covered by the new cancellation test.stream=false) path is unchanged — backward compatible.FIX #xxxx (link existing issues this PR will resolve)
BEFORE SUBMITTING, PLEASE READ THE CHECKLIST BELOW AND FILL IN THE DESCRIPTION ABOVE
-swhen doinggit commit[Bugfix],[Feat], and[CI].Detailed Checklist (Click to Expand)
Thank you for your contribution to production-stack! Before submitting the pull request, please ensure the PR meets the following criteria. This helps us maintain the code quality and improve the efficiency of the review process.
PR Title and Classification
Please try to classify PRs for easy understanding of the type of changes. The PR title is prefixed appropriately to indicate the type of change. Please use one of the following:
[Bugfix]for bug fixes.[CI/Build]for build or continuous integration improvements.[Doc]for documentation fixes and improvements.[Feat]for new features in the cluster (e.g., autoscaling, disaggregated prefill, etc.).[Router]for changes to thevllm_router(e.g., routing algorithm, router observability, etc.).[Misc]for PRs that do not fit the above categories. Please use this sparingly.Note: If the PR spans more than one category, please include all relevant prefixes.
Code Quality
The PR need to meet the following code quality standards:
pre-committo format your code. SeeREADME.mdfor installation.DCO and Signed-off-by
When contributing changes to this project, you must agree to the DCO. Commits must include a
Signed-off-by:header which certifies agreement with the terms of the DCO.Using
-swithgit commitwill automatically add this header.What to Expect for the Reviews
We aim to address all PRs in a timely manner. If no one reviews your PR within 5 days, please @-mention one of YuhanLiu11
, Shaoting-Feng or ApostaC.