Skip to content

Conversation

@jedzill4
Copy link
Contributor

@jedzill4 jedzill4 commented Jan 19, 2026

  • Introduced a new endpoint for audio transcription using websockets.
  • Implemented streaming of audio files to an external transcription service.
  • Added necessary models and validation for transcription responses.
  • Updated settings to include TRANSCRIBE_WS_URI for configuration.

Summary by Sourcery

Add a new audio transcription API that streams uploaded audio to an external websocket-based ASR service and returns structured transcription results.

New Features:

  • Expose a /transcribe endpoint under the audio/transcription tag for uploading audio files and receiving transcribed segments.
  • Define ASR websocket response and transcription data models to validate and normalize messages from the external transcription service.

Enhancements:

  • Add configuration support for an external transcription websocket URI via the TRANSCRIBE_WS_URI setting.
  • Add librosa as a dependency to support audio loading and resampling for streaming to the ASR service.

Build:

  • Include librosa as a new runtime dependency in pyproject.toml.

- Introduced a new endpoint for audio transcription using websockets.
- Implemented streaming of audio files to an external transcription service.
- Added necessary models and validation for transcription responses.
- Updated settings to include TRANSCRIBE_WS_URI for configuration.
@sourcery-ai
Copy link

sourcery-ai bot commented Jan 19, 2026

Reviewer's Guide

Adds a new FastAPI audio transcription endpoint that streams uploaded audio over a websocket to an external ASR service, validates and normalizes the streaming responses into typed models, and exposes them as a structured transcription API, with corresponding configuration and dependency updates.

Sequence diagram for the new audio transcription endpoint

sequenceDiagram
    actor Client
    participant FastAPIApp
    participant TranscribeEndpoint as transcribe
    participant AsrWebsocketClient as websockets_client
    participant ExternalASRService

    Client->>FastAPIApp: POST /transcribe (file)
    FastAPIApp->>TranscribeEndpoint: transcribe(file)
    TranscribeEndpoint->>TranscribeEndpoint: check settings.TRANSCRIBE_WS_URI
    alt missing URI
        TranscribeEndpoint-->>Client: HTTP 500 (TRANSCRIBE_WS_URI not configured)
    else configured
        TranscribeEndpoint->>AsrWebsocketClient: connect(TRANSCRIBE_WS_URI)
        AsrWebsocketClient->>ExternalASRService: WebSocket handshake
        activate ExternalASRService
        TranscribeEndpoint->>TranscribeEndpoint: create_task(_receive_updates)
        TranscribeEndpoint->>TranscribeEndpoint: _stream_audio(file, websocket)
        loop audio chunks
            TranscribeEndpoint->>ExternalASRService: send(chunk bytes)
            ExternalASRService-->>AsrWebsocketClient: partial JSON updates
            AsrWebsocketClient->>TranscribeEndpoint: _receive_updates parses WLKMessageRawResponse
        end
        TranscribeEndpoint->>ExternalASRService: send(empty frame)
        ExternalASRService-->>AsrWebsocketClient: status active_transcription
        AsrWebsocketClient->>TranscribeEndpoint: WLKMessageStatus(last_active)
        ExternalASRService-->>AsrWebsocketClient: ready_to_stop
        AsrWebsocketClient->>TranscribeEndpoint: WLKMessageReadyToStopMessage
        deactivate ExternalASRService
        TranscribeEndpoint->>TranscribeEndpoint: map lines to list TranscriptionItem
        TranscribeEndpoint-->>Client: 200 OK (list TranscriptionItem)
    end
Loading

Class diagram for new ASR websocket models and transcription item

classDiagram
    class BaseModel

    class WLKMessageModelConfig {
        +str asr_model
        +str asr_backend
        +str diarization_model
        +str diarization_backend
    }

    class WLKMessageConfig {
        +Literal_config type
        +bool useAudioWorklet
        +WLKMessageModelConfig models
    }

    class WLKMessageTranscriptionLine {
        +int speaker
        +str text
        +timedelta start
        +timedelta end
        +bool final
        +str speaker_id
        +str detected_language
        +parse_hhmmss(value) timedelta
    }

    class WLKMessageStatus {
        +str status
        +list~WLKMessageTranscriptionLine~ lines
        +str buffer_transcription
        +str buffer_diarization
        +str buffer_translation
        +float remaining_time_transcription
        +float remaining_time_diarization
        +dict~str,str~ speaker_ids
    }

    class WLKMessageSpeakerEmbeddings {
        +Literal_speaker_embeddings type
        +dict~str,str~ speaker_ids
        +int speaker_id_bits
        +WLKMessageModelConfig models
    }

    class WLKMessageReadyToStopMessage {
        +Literal_ready_to_stop type
    }

    class WLKMessageRawResponse {
        <<union>>
        WLKMessageConfig
        WLKMessageStatus
        WLKMessageSpeakerEmbeddings
        WLKMessageReadyToStopMessage
    }

    class TranscriptionItem {
        +int speaker_no
        +str speaker_id
        +timedelta start
        +timedelta end
        +str text
        +parse_hhmmss(value) timedelta
    }

    BaseModel <|-- WLKMessageModelConfig
    BaseModel <|-- WLKMessageConfig
    BaseModel <|-- WLKMessageTranscriptionLine
    BaseModel <|-- WLKMessageStatus
    BaseModel <|-- WLKMessageSpeakerEmbeddings
    BaseModel <|-- WLKMessageReadyToStopMessage
    BaseModel <|-- TranscriptionItem

    WLKMessageConfig --> WLKMessageModelConfig : models
    WLKMessageSpeakerEmbeddings --> WLKMessageModelConfig : models
    WLKMessageStatus --> WLKMessageTranscriptionLine : lines

    WLKMessageRawResponse ..> WLKMessageConfig
    WLKMessageRawResponse ..> WLKMessageStatus
    WLKMessageRawResponse ..> WLKMessageSpeakerEmbeddings
    WLKMessageRawResponse ..> WLKMessageReadyToStopMessage
Loading

File-Level Changes

Change Details Files
Expose a new audio transcription REST endpoint in the main API router.
  • Register the new transcription router in the central FastAPI router with an audio/transcription tag.
  • Remove an unused anonymizer database import from the core router module.
aymurai/api/core.py
Add configuration support for the external websocket transcription service and its dependency.
  • Introduce a TRANSCRIBE_WS_URI setting (optional string) for configuring the external ASR websocket endpoint.
  • Load the new setting through the existing settings initialization flow.
  • Add librosa as a runtime dependency for audio loading and resampling.
aymurai/settings.py
pyproject.toml
Implement websocket-based streaming of uploaded audio to an external ASR service and expose it via a /transcribe endpoint.
  • Define router-level constants for sample rate, chunking behavior, logging limits, and a Pydantic TypeAdapter for websocket responses.
  • Implement _stream_audio to read the uploaded file, decode and resample it with librosa, slice into fixed-size chunks, convert to int16 PCM, and stream the bytes over an established websocket connection while tracking total bytes sent.
  • Implement _parse_ws_message to safely parse text or binary websocket frames into JSON, log truncated previews of invalid payloads, and validate into a discriminated union of response types using Pydantic; unrecognized payloads are logged and ignored.
  • Implement _receive_updates to continuously read websocket messages, track the last active_transcription status message, and stop when a ready_to_stop message is received or the connection closes/errors.
  • Implement the POST /transcribe endpoint that checks configuration, opens a websocket connection to TRANSCRIBE_WS_URI, concurrently streams audio and receives updates, handles cancellation and websocket errors with appropriate HTTP error mapping, and returns a list of normalized TranscriptionItem objects built from the final active transcription status (or an empty list if none).
aymurai/api/endpoints/routers/asr/transcribe.py
Introduce typed models for websocket ASR protocol messages and normalized transcription items.
  • Implement a flexible _parse_hhmmss helper that converts HH:MM:SS strings, raw seconds (int/float), or timedeltas into timedelta objects with validation.
  • Define Pydantic models for websocket config, status updates, speaker embeddings, and ready-to-stop messages, using Literal fields for type discrimination and proper nesting of model configuration information.
  • Model transcription lines with diarization metadata and per-field validators to normalize start/end time fields to timedelta.
  • Define a union type WLKMessageRawResponse over all possible websocket message variants for validation via TypeAdapter.
  • Expose a TranscriptionItem API model that normalizes speaker IDs and time fields for the HTTP response, reusing the same time parsing validator logic.
aymurai/api/meta/asr/websocket.py

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 3 issues, and left some high level feedback:

  • The _stream_audio function reads the entire uploaded file into memory and then decodes/resamples it with librosa.load; if you expect large inputs, consider enforcing a maximum file size/duration or using a more streaming-friendly decoding approach to avoid high memory usage and latency.
  • WLKMessageStatus.status is typed as a plain str but later matched against the literal value "active_transcription"; tightening this to a Literal[...] (or an enum) would make the pattern matching and validation more robust and self-documenting.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- The `_stream_audio` function reads the entire uploaded file into memory and then decodes/resamples it with `librosa.load`; if you expect large inputs, consider enforcing a maximum file size/duration or using a more streaming-friendly decoding approach to avoid high memory usage and latency.
- `WLKMessageStatus.status` is typed as a plain `str` but later matched against the literal value `"active_transcription"`; tightening this to a `Literal[...]` (or an enum) would make the pattern matching and validation more robust and self-documenting.

## Individual Comments

### Comment 1
<location> `aymurai/api/endpoints/routers/asr/transcribe.py:38-39` </location>
<code_context>
+    file: UploadFile,
+    websocket: websockets.ClientConnection,
+) -> int:
+    payload = await file.read()
+    audio, _ = librosa.load(io.BytesIO(payload), sr=SAMPLE_RATE_HZ, mono=True)
+    total_bytes = 0
+    for i in range(0, len(audio), CHUNK_SAMPLES):
</code_context>

<issue_to_address>
**suggestion (performance):** Reading the entire file into memory can be problematic for large uploads.

This reads the entire upload into memory before decoding, which risks excessive memory use or crashes with large files. Please either enforce and validate a maximum upload size before reading, or switch to a streaming-friendly decoding approach if supported by your ASR stack.
</issue_to_address>

### Comment 2
<location> `aymurai/api/endpoints/routers/asr/transcribe.py:80-89` </location>
<code_context>
+        return None
+
+
+async def _receive_updates(
+    websocket: websockets.ClientConnection,
+) -> WLKMessageStatus | None:
+    last_active_transcription: WLKMessageStatus | None = None
+    while True:
+        try:
+            msg = await websocket.recv()
+        except websockets.exceptions.ConnectionClosedOK:
+            logger.info("connection closed normally")
+            break
+        except websockets.exceptions.WebSocketException as exc:
+            logger.error(f"websocket error while receiving updates: {exc}")
+            break
+
+        parsed = _parse_ws_message(msg)
+        match parsed:
+            case None:
+                continue
+            case WLKMessageStatus(status="active_transcription") as message:
+                last_active_transcription = message
+            case WLKMessageReadyToStopMessage():
+                break
+
+    return last_active_transcription
+
+
</code_context>

<issue_to_address>
**issue:** Websocket receive loop may never terminate if the remote service misbehaves.

The loop only exits on specific close/exception cases or a `ready_to_stop` message. If the server stays open but stops sending messages, this coroutine will hang and `transcribe` will never return. Please add a safeguard (e.g., overall timeout via `asyncio.wait_for`, periodic inactivity timeout, or a max message count) to handle a stalled backend.
</issue_to_address>

### Comment 3
<location> `aymurai/api/endpoints/routers/asr/transcribe.py:63-72` </location>
<code_context>
+    try:
+        async with websockets.connect(settings.TRANSCRIBE_WS_URI) as websocket:
+            receive_task = asyncio.create_task(_receive_updates(websocket))
+            try:
+                total_bytes = await _stream_audio(file, websocket)
+                await websocket.send(b"")
+                logger.info(f"sent {total_bytes} bytes to transcription service")
+                last_active_transcription = await receive_task
+            except Exception:
+                if not receive_task.done():
+                    receive_task.cancel()
+                    with contextlib.suppress(asyncio.CancelledError):
+                        await receive_task
+                raise
+    except websockets.exceptions.WebSocketException as exc:
+        logger.error(f"websocket error during transcription: {exc}")
+        raise HTTPException(
+            status_code=status.HTTP_502_BAD_GATEWAY,
+            detail="Transcription service websocket error",
</code_context>

<issue_to_address>
**suggestion (bug_risk):** All non-websocket errors are mapped to a generic 500, masking client-side issues.

Exceptions from `_stream_audio` (e.g. invalid/unsupported audio format, decoding errors) are all surfaced as 5xx. Consider explicitly catching known decoding/validation errors and returning a 4xx for bad input, keeping 5xx for genuine backend or internal failures.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment on lines +38 to +39
payload = await file.read()
audio, _ = librosa.load(io.BytesIO(payload), sr=SAMPLE_RATE_HZ, mono=True)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (performance): Reading the entire file into memory can be problematic for large uploads.

This reads the entire upload into memory before decoding, which risks excessive memory use or crashes with large files. Please either enforce and validate a maximum upload size before reading, or switch to a streaming-friendly decoding approach if supported by your ASR stack.

Comment on lines +80 to +89
async def _receive_updates(
websocket: websockets.ClientConnection,
) -> WLKMessageStatus | None:
last_active_transcription: WLKMessageStatus | None = None
while True:
try:
msg = await websocket.recv()
except websockets.exceptions.ConnectionClosedOK:
logger.info("connection closed normally")
break
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue: Websocket receive loop may never terminate if the remote service misbehaves.

The loop only exits on specific close/exception cases or a ready_to_stop message. If the server stays open but stops sending messages, this coroutine will hang and transcribe will never return. Please add a safeguard (e.g., overall timeout via asyncio.wait_for, periodic inactivity timeout, or a max message count) to handle a stalled backend.

Comment on lines +63 to +72
try:
parsed = json.loads(message)
except json.JSONDecodeError:
logger.warning(f"received non-json websocket payload: {payload_preview}")
return None

try:
return ASR_RAW_RESPONSE_ADAPTER.validate_python(parsed)
except ValidationError as exc:
logger.warning(
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (bug_risk): All non-websocket errors are mapped to a generic 500, masking client-side issues.

Exceptions from _stream_audio (e.g. invalid/unsupported audio format, decoding errors) are all surfaced as 5xx. Consider explicitly catching known decoding/validation errors and returning a 4xx for bad input, keeping 5xx for genuine backend or internal failures.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants