feat(asr): ✨ add audio transcription endpoint #71

jedzill4 · 2026-01-19T17:09:53Z

Introduced a new endpoint for audio transcription using websockets.
Implemented streaming of audio files to an external transcription service.
Added necessary models and validation for transcription responses.
Updated settings to include TRANSCRIBE_WS_URI for configuration.

Summary by Sourcery

Add a new audio transcription API that streams uploaded audio to an external websocket-based ASR service and returns structured transcription results.

New Features:

Expose a /transcribe endpoint under the audio/transcription tag for uploading audio files and receiving transcribed segments.
Define ASR websocket response and transcription data models to validate and normalize messages from the external transcription service.

Enhancements:

Add configuration support for an external transcription websocket URI via the TRANSCRIBE_WS_URI setting.
Add librosa as a dependency to support audio loading and resampling for streaming to the ASR service.

Build:

Include librosa as a new runtime dependency in pyproject.toml.

- Introduced a new endpoint for audio transcription using websockets. - Implemented streaming of audio files to an external transcription service. - Added necessary models and validation for transcription responses. - Updated settings to include TRANSCRIBE_WS_URI for configuration.

sourcery-ai · 2026-01-19T17:10:01Z

Reviewer's Guide

Adds a new FastAPI audio transcription endpoint that streams uploaded audio over a websocket to an external ASR service, validates and normalizes the streaming responses into typed models, and exposes them as a structured transcription API, with corresponding configuration and dependency updates.

Sequence diagram for the new audio transcription endpoint

sequenceDiagram
    actor Client
    participant FastAPIApp
    participant TranscribeEndpoint as transcribe
    participant AsrWebsocketClient as websockets_client
    participant ExternalASRService

    Client->>FastAPIApp: POST /transcribe (file)
    FastAPIApp->>TranscribeEndpoint: transcribe(file)
    TranscribeEndpoint->>TranscribeEndpoint: check settings.TRANSCRIBE_WS_URI
    alt missing URI
        TranscribeEndpoint-->>Client: HTTP 500 (TRANSCRIBE_WS_URI not configured)
    else configured
        TranscribeEndpoint->>AsrWebsocketClient: connect(TRANSCRIBE_WS_URI)
        AsrWebsocketClient->>ExternalASRService: WebSocket handshake
        activate ExternalASRService
        TranscribeEndpoint->>TranscribeEndpoint: create_task(_receive_updates)
        TranscribeEndpoint->>TranscribeEndpoint: _stream_audio(file, websocket)
        loop audio chunks
            TranscribeEndpoint->>ExternalASRService: send(chunk bytes)
            ExternalASRService-->>AsrWebsocketClient: partial JSON updates
            AsrWebsocketClient->>TranscribeEndpoint: _receive_updates parses WLKMessageRawResponse
        end
        TranscribeEndpoint->>ExternalASRService: send(empty frame)
        ExternalASRService-->>AsrWebsocketClient: status active_transcription
        AsrWebsocketClient->>TranscribeEndpoint: WLKMessageStatus(last_active)
        ExternalASRService-->>AsrWebsocketClient: ready_to_stop
        AsrWebsocketClient->>TranscribeEndpoint: WLKMessageReadyToStopMessage
        deactivate ExternalASRService
        TranscribeEndpoint->>TranscribeEndpoint: map lines to list TranscriptionItem
        TranscribeEndpoint-->>Client: 200 OK (list TranscriptionItem)
    end

Class diagram for new ASR websocket models and transcription item

classDiagram
    class BaseModel

    class WLKMessageModelConfig {
        +str asr_model
        +str asr_backend
        +str diarization_model
        +str diarization_backend
    }

    class WLKMessageConfig {
        +Literal_config type
        +bool useAudioWorklet
        +WLKMessageModelConfig models
    }

    class WLKMessageTranscriptionLine {
        +int speaker
        +str text
        +timedelta start
        +timedelta end
        +bool final
        +str speaker_id
        +str detected_language
        +parse_hhmmss(value) timedelta
    }

    class WLKMessageStatus {
        +str status
        +list~WLKMessageTranscriptionLine~ lines
        +str buffer_transcription
        +str buffer_diarization
        +str buffer_translation
        +float remaining_time_transcription
        +float remaining_time_diarization
        +dict~str,str~ speaker_ids
    }

    class WLKMessageSpeakerEmbeddings {
        +Literal_speaker_embeddings type
        +dict~str,str~ speaker_ids
        +int speaker_id_bits
        +WLKMessageModelConfig models
    }

    class WLKMessageReadyToStopMessage {
        +Literal_ready_to_stop type
    }

    class WLKMessageRawResponse {
        <<union>>
        WLKMessageConfig
        WLKMessageStatus
        WLKMessageSpeakerEmbeddings
        WLKMessageReadyToStopMessage
    }

    class TranscriptionItem {
        +int speaker_no
        +str speaker_id
        +timedelta start
        +timedelta end
        +str text
        +parse_hhmmss(value) timedelta
    }

    BaseModel <|-- WLKMessageModelConfig
    BaseModel <|-- WLKMessageConfig
    BaseModel <|-- WLKMessageTranscriptionLine
    BaseModel <|-- WLKMessageStatus
    BaseModel <|-- WLKMessageSpeakerEmbeddings
    BaseModel <|-- WLKMessageReadyToStopMessage
    BaseModel <|-- TranscriptionItem

    WLKMessageConfig --> WLKMessageModelConfig : models
    WLKMessageSpeakerEmbeddings --> WLKMessageModelConfig : models
    WLKMessageStatus --> WLKMessageTranscriptionLine : lines

    WLKMessageRawResponse ..> WLKMessageConfig
    WLKMessageRawResponse ..> WLKMessageStatus
    WLKMessageRawResponse ..> WLKMessageSpeakerEmbeddings
    WLKMessageRawResponse ..> WLKMessageReadyToStopMessage

File-Level Changes

Change	Details	Files
Expose a new audio transcription REST endpoint in the main API router.	Register the new transcription router in the central FastAPI router with an `audio/transcription` tag. Remove an unused anonymizer database import from the core router module.	`aymurai/api/core.py`
Add configuration support for the external websocket transcription service and its dependency.	Introduce a `TRANSCRIBE_WS_URI` setting (optional string) for configuring the external ASR websocket endpoint. Load the new setting through the existing settings initialization flow. Add `librosa` as a runtime dependency for audio loading and resampling.	`aymurai/settings.py` `pyproject.toml`
Implement websocket-based streaming of uploaded audio to an external ASR service and expose it via a `/transcribe` endpoint.	Define router-level constants for sample rate, chunking behavior, logging limits, and a Pydantic `TypeAdapter` for websocket responses. Implement `_stream_audio` to read the uploaded file, decode and resample it with `librosa`, slice into fixed-size chunks, convert to int16 PCM, and stream the bytes over an established websocket connection while tracking total bytes sent. Implement `_parse_ws_message` to safely parse text or binary websocket frames into JSON, log truncated previews of invalid payloads, and validate into a discriminated union of response types using Pydantic; unrecognized payloads are logged and ignored. Implement `_receive_updates` to continuously read websocket messages, track the last `active_transcription` status message, and stop when a `ready_to_stop` message is received or the connection closes/errors. Implement the `POST /transcribe` endpoint that checks configuration, opens a websocket connection to `TRANSCRIBE_WS_URI`, concurrently streams audio and receives updates, handles cancellation and websocket errors with appropriate HTTP error mapping, and returns a list of normalized `TranscriptionItem` objects built from the final active transcription status (or an empty list if none).	`aymurai/api/endpoints/routers/asr/transcribe.py`
Introduce typed models for websocket ASR protocol messages and normalized transcription items.	Implement a flexible `_parse_hhmmss` helper that converts HH:MM:SS strings, raw seconds (int/float), or timedeltas into `timedelta` objects with validation. Define Pydantic models for websocket config, status updates, speaker embeddings, and ready-to-stop messages, using `Literal` fields for type discrimination and proper nesting of model configuration information. Model transcription lines with diarization metadata and per-field validators to normalize `start`/`end` time fields to `timedelta`. Define a union type `WLKMessageRawResponse` over all possible websocket message variants for validation via `TypeAdapter`. Expose a `TranscriptionItem` API model that normalizes speaker IDs and time fields for the HTTP response, reusing the same time parsing validator logic.	`aymurai/api/meta/asr/websocket.py`

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

sourcery-ai

Hey - I've found 3 issues, and left some high level feedback:

The _stream_audio function reads the entire uploaded file into memory and then decodes/resamples it with librosa.load; if you expect large inputs, consider enforcing a maximum file size/duration or using a more streaming-friendly decoding approach to avoid high memory usage and latency.
WLKMessageStatus.status is typed as a plain str but later matched against the literal value "active_transcription"; tightening this to a Literal[...] (or an enum) would make the pattern matching and validation more robust and self-documenting.

Prompt for AI Agents

Please address the comments from this code review:

## Overall Comments
- The `_stream_audio` function reads the entire uploaded file into memory and then decodes/resamples it with `librosa.load`; if you expect large inputs, consider enforcing a maximum file size/duration or using a more streaming-friendly decoding approach to avoid high memory usage and latency.
- `WLKMessageStatus.status` is typed as a plain `str` but later matched against the literal value `"active_transcription"`; tightening this to a `Literal[...]` (or an enum) would make the pattern matching and validation more robust and self-documenting.

## Individual Comments

### Comment 1
<location> `aymurai/api/endpoints/routers/asr/transcribe.py:38-39` </location>
<code_context>
+    file: UploadFile,
+    websocket: websockets.ClientConnection,
+) -> int:
+    payload = await file.read()
+    audio, _ = librosa.load(io.BytesIO(payload), sr=SAMPLE_RATE_HZ, mono=True)
+    total_bytes = 0
+    for i in range(0, len(audio), CHUNK_SAMPLES):
</code_context>

<issue_to_address>
**suggestion (performance):** Reading the entire file into memory can be problematic for large uploads.

This reads the entire upload into memory before decoding, which risks excessive memory use or crashes with large files. Please either enforce and validate a maximum upload size before reading, or switch to a streaming-friendly decoding approach if supported by your ASR stack.
</issue_to_address>

### Comment 2
<location> `aymurai/api/endpoints/routers/asr/transcribe.py:80-89` </location>
<code_context>
+        return None
+
+
+async def _receive_updates(
+    websocket: websockets.ClientConnection,
+) -> WLKMessageStatus | None:
+    last_active_transcription: WLKMessageStatus | None = None
+    while True:
+        try:
+            msg = await websocket.recv()
+        except websockets.exceptions.ConnectionClosedOK:
+            logger.info("connection closed normally")
+            break
+        except websockets.exceptions.WebSocketException as exc:
+            logger.error(f"websocket error while receiving updates: {exc}")
+            break
+
+        parsed = _parse_ws_message(msg)
+        match parsed:
+            case None:
+                continue
+            case WLKMessageStatus(status="active_transcription") as message:
+                last_active_transcription = message
+            case WLKMessageReadyToStopMessage():
+                break
+
+    return last_active_transcription
+
+
</code_context>

<issue_to_address>
**issue:** Websocket receive loop may never terminate if the remote service misbehaves.

The loop only exits on specific close/exception cases or a `ready_to_stop` message. If the server stays open but stops sending messages, this coroutine will hang and `transcribe` will never return. Please add a safeguard (e.g., overall timeout via `asyncio.wait_for`, periodic inactivity timeout, or a max message count) to handle a stalled backend.
</issue_to_address>

### Comment 3
<location> `aymurai/api/endpoints/routers/asr/transcribe.py:63-72` </location>
<code_context>
+    try:
+        async with websockets.connect(settings.TRANSCRIBE_WS_URI) as websocket:
+            receive_task = asyncio.create_task(_receive_updates(websocket))
+            try:
+                total_bytes = await _stream_audio(file, websocket)
+                await websocket.send(b"")
+                logger.info(f"sent {total_bytes} bytes to transcription service")
+                last_active_transcription = await receive_task
+            except Exception:
+                if not receive_task.done():
+                    receive_task.cancel()
+                    with contextlib.suppress(asyncio.CancelledError):
+                        await receive_task
+                raise
+    except websockets.exceptions.WebSocketException as exc:
+        logger.error(f"websocket error during transcription: {exc}")
+        raise HTTPException(
+            status_code=status.HTTP_502_BAD_GATEWAY,
+            detail="Transcription service websocket error",
</code_context>

<issue_to_address>
**suggestion (bug_risk):** All non-websocket errors are mapped to a generic 500, masking client-side issues.

Exceptions from `_stream_audio` (e.g. invalid/unsupported audio format, decoding errors) are all surfaced as 5xx. Consider explicitly catching known decoding/validation errors and returning a 4xx for bad input, keeping 5xx for genuine backend or internal failures.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

sourcery-ai · 2026-01-19T17:11:13Z

aymurai/api/endpoints/routers/asr/transcribe.py

+    payload = await file.read()
+    audio, _ = librosa.load(io.BytesIO(payload), sr=SAMPLE_RATE_HZ, mono=True)


suggestion (performance): Reading the entire file into memory can be problematic for large uploads.

This reads the entire upload into memory before decoding, which risks excessive memory use or crashes with large files. Please either enforce and validate a maximum upload size before reading, or switch to a streaming-friendly decoding approach if supported by your ASR stack.

sourcery-ai · 2026-01-19T17:11:13Z

aymurai/api/endpoints/routers/asr/transcribe.py

+async def _receive_updates(
+    websocket: websockets.ClientConnection,
+) -> WLKMessageStatus | None:
+    last_active_transcription: WLKMessageStatus | None = None
+    while True:
+        try:
+            msg = await websocket.recv()
+        except websockets.exceptions.ConnectionClosedOK:
+            logger.info("connection closed normally")
+            break


issue: Websocket receive loop may never terminate if the remote service misbehaves.

The loop only exits on specific close/exception cases or a ready_to_stop message. If the server stays open but stops sending messages, this coroutine will hang and transcribe will never return. Please add a safeguard (e.g., overall timeout via asyncio.wait_for, periodic inactivity timeout, or a max message count) to handle a stalled backend.

sourcery-ai · 2026-01-19T17:11:13Z

aymurai/api/endpoints/routers/asr/transcribe.py

+    try:
+        parsed = json.loads(message)
+    except json.JSONDecodeError:
+        logger.warning(f"received non-json websocket payload: {payload_preview}")
+        return None
+
+    try:
+        return ASR_RAW_RESPONSE_ADAPTER.validate_python(parsed)
+    except ValidationError as exc:
+        logger.warning(


suggestion (bug_risk): All non-websocket errors are mapped to a generic 500, masking client-side issues.

Exceptions from _stream_audio (e.g. invalid/unsupported audio format, decoding errors) are all surfaced as 5xx. Consider explicitly catching known decoding/validation errors and returning a 4xx for bad input, keeping 5xx for genuine backend or internal failures.

sourcery-ai bot reviewed Jan 19, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(asr): ✨ add audio transcription endpoint #71

feat(asr): ✨ add audio transcription endpoint #71

jedzill4 commented Jan 19, 2026 •

edited by sourcery-ai bot

Loading

Uh oh!

sourcery-ai bot commented Jan 19, 2026 •

edited

Loading

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

sourcery-ai bot left a comment

Uh oh!

sourcery-ai bot Jan 19, 2026

Uh oh!

sourcery-ai bot Jan 19, 2026

Uh oh!

sourcery-ai bot Jan 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		payload = await file.read()
		audio, _ = librosa.load(io.BytesIO(payload), sr=SAMPLE_RATE_HZ, mono=True)

feat(asr): ✨ add audio transcription endpoint #71

Are you sure you want to change the base?

feat(asr): ✨ add audio transcription endpoint #71

Conversation

jedzill4 commented Jan 19, 2026 • edited by sourcery-ai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by Sourcery

Uh oh!

sourcery-ai bot commented Jan 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewer's Guide

Sequence diagram for the new audio transcription endpoint

Class diagram for new ASR websocket models and transcription item

File-Level Changes

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Jan 19, 2026

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Jan 19, 2026

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Jan 19, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jedzill4 commented Jan 19, 2026 •

edited by sourcery-ai bot

Loading

sourcery-ai bot commented Jan 19, 2026 •

edited

Loading