Skip to content

Commit d9c4f1f

Browse files
npatel-aaiAssemblyAIclaude
authored
chore: sync sdk code with DeepLearning repo (#209) 0.64.21
Co-authored-by: AssemblyAI <engineering.sdk@assemblyai.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
1 parent aecf5d7 commit d9c4f1f

10 files changed

Lines changed: 1390 additions & 6 deletions

File tree

README.md

Lines changed: 64 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -785,6 +785,70 @@ finally:
785785

786786
</details>
787787

788+
<details>
789+
<summary>Dual-channel: mic + system audio in one session</summary>
790+
791+
For note-taker apps that capture two live sources (microphone **and** system/speaker output) but want them handled as **one** streaming session — while still knowing which source each word came from — wrap the client in a `ChannelStreamer`.
792+
793+
You declare named channels and feed each channel's PCM separately. The SDK runs per-channel energy VAD, mixes the channels into a single mono stream over one websocket, and — for handlers registered on the coordinator — delivers an enriched `DualChannelTurnEvent` whose words/turn carry their originating channel (`turn.channel` and per-word `word.channel`). The base `Word` / `TurnEvent` stay unchanged, so single-stream payloads aren't affected. Attribution is fully client-side and model-agnostic, so it composes with `speaker_labels`, multilingual, and `u3-rt-pro`. It is a **separate dimension from diarization**`word.channel` (physical source) is independent of `word.speaker` (voice): two people on the same `system` channel get distinct speaker labels, while one person heard on two channels keeps a single speaker label.
794+
795+
Unlike a browser sample, the SDK does not capture audio — you supply 16-bit PCM for each channel (from `sounddevice`, `pyaudio`, a loopback device, files, …).
796+
797+
```python
798+
from assemblyai.streaming.v3 import (
799+
ChannelStreamer, StreamingClient, StreamingClientOptions,
800+
StreamingEvents, StreamingParameters,
801+
)
802+
803+
def on_turn(client, event): # event is a DualChannelTurnEvent
804+
print(f"[{event.channel}] {event.transcript}")
805+
for w in event.words:
806+
print(f" {w.text!r} -> channel={w.channel} speaker={w.speaker}")
807+
808+
client = StreamingClient(StreamingClientOptions(api_key="<YOUR_API_KEY>"))
809+
810+
# Declare the channels and the session sample rate (must be pcm_s16le).
811+
mixer = ChannelStreamer(client, channels=["mic", "system"], sample_rate=16000)
812+
# Register handlers on the mixer: Turn handlers receive the enriched event,
813+
# other events (Begin/Error/…) are forwarded to the client.
814+
mixer.on(StreamingEvents.Turn, on_turn)
815+
client.connect(StreamingParameters(
816+
sample_rate=16000, speech_model="u3-rt-pro", speaker_labels=True,
817+
))
818+
819+
# Feed each source separately — e.g. from two capture callbacks. Send
820+
# continuous PCM for every channel (silence as zeros), at the same rate.
821+
mixer.stream("mic", mic_pcm)
822+
mixer.stream("system", system_pcm)
823+
824+
mixer.flush() # push trailing buffered audio
825+
client.disconnect(terminate=True)
826+
```
827+
828+
`AsyncChannelStreamer` is the asyncio-native equivalent (`await mixer.stream(...)` / `await mixer.close_channel(...)` / `await mixer.flush()`); register handlers the same way with `mixer.on(...)`.
829+
830+
**Sources that end mid-session.** Mixing keeps channels aligned by consuming the shortest buffer, so it assumes every channel keeps delivering PCM (send silence as zeros, don't omit it). When a source genuinely ends (file EOF, screen share stopped, device removed), call `mixer.close_channel(name)` so the session degrades to the surviving channel(s) instead of stalling — the ended channel is then padded with silence.
831+
832+
**Swappable VAD.** The default detector is the built-in energy-based `EnergyVad`. Supply your own (e.g. a DNN VAD such as Silero) via `ChannelAttributionOptions.create_vad`, which is called once per channel with the channel name; subclass `VadDetector` (`process(frame) -> VadResult`, `reset()`). Pass `on_vad=callback` to observe raw per-frame activity (e.g. a live "who's talking" meter). Tune the default with `EnergyVad(threshold_ratio=3.0, noise_floor_alpha=0.05, hangover_frames=10)``threshold_ratio` below ~2 is too sensitive, above ~6 misses quiet onsets/offsets.
833+
834+
**Resolving unknown channels.** A word is `"unknown"` when no channel was clearly dominant in its window — silence, or two channels too close to call (the top must beat the runner-up by `dominance_ratio`, default 4). `ChannelAttributionOptions.resolve_unknown_channels_method` back-fills these:
835+
836+
- `"window"` (default) — from the dominant non-`"unknown"` channel among ±`resolution_window_words` neighbor words.
837+
- `"speaker-history"` — from the speaker's session-wide channel evidence (requires `speaker_labels`).
838+
- `"none"` — leave `"unknown"` as-is.
839+
840+
Back-filled words are flagged `word.channel_resolved = True`; confident per-word decisions are never overwritten. The method is validated at construction, so a typo raises immediately rather than silently disabling resolution.
841+
842+
**Caveats.**
843+
844+
- Requires 16-bit PCM (`pcm_s16le`, the default) — linear mixing is invalid for `pcm_mulaw`.
845+
- Capturing the system/speaker output is platform-specific: macOS needs a loopback driver (e.g. BlackHole); Windows uses WASAPI loopback; Linux a PulseAudio/PipeWire monitor source.
846+
- If the mic physically picks up the speakers, that bleed can pull attribution toward `mic`. Apply acoustic echo cancellation at capture (`getUserMedia({ audio: { echoCancellation: true } })` in browser front-ends, or an AEC-capable native path) — the SDK only receives already-captured PCM, so it can't apply AEC itself. Transcription quality is unaffected; only the `channel` field.
847+
848+
See [`examples/streaming_dual_channel.py`](./examples/streaming_dual_channel.py) for a complete runnable demo.
849+
850+
</details>
851+
788852
<details>
789853
<summary>Stream a local file (async)</summary>
790854

assemblyai/__version__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
__version__ = "0.64.20"
1+
__version__ = "0.64.21"

assemblyai/streaming/v3/__init__.py

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,17 @@
11
from .async_client import AsyncStreamingClient
22
from .client import StreamingClient
3+
from .extras import (
4+
AsyncChannelStreamer,
5+
ChannelAttributionOptions,
6+
ChannelStreamer,
7+
DualChannelTurnEvent,
8+
DualChannelWord,
9+
EnergyVad,
10+
VadDetector,
11+
VadFrame,
12+
VadResult,
13+
attribute_turn,
14+
)
315
from .models import (
416
BeginEvent,
517
Encoding,
@@ -27,8 +39,14 @@
2739
)
2840

2941
__all__ = [
42+
"AsyncChannelStreamer",
3043
"AsyncStreamingClient",
3144
"BeginEvent",
45+
"ChannelAttributionOptions",
46+
"ChannelStreamer",
47+
"DualChannelTurnEvent",
48+
"DualChannelWord",
49+
"EnergyVad",
3250
"Encoding",
3351
"EventMessage",
3452
"LLMGatewayResponseEvent",
@@ -50,6 +68,10 @@
5068
"StreamingSessionParameters",
5169
"TerminationEvent",
5270
"TurnEvent",
71+
"VadDetector",
72+
"VadFrame",
73+
"VadResult",
5374
"WarningEvent",
5475
"Word",
76+
"attribute_turn",
5577
]

assemblyai/streaming/v3/async_client.py

Lines changed: 11 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -37,6 +37,7 @@
3737
ErrorEvent,
3838
EventMessage,
3939
ForceEndpoint,
40+
KeepAlive,
4041
OperationMessage,
4142
StreamingClientOptions,
4243
StreamingError,
@@ -80,10 +81,10 @@ class AsyncStreamingClient(_BaseStreamingClient):
8081
8182
Behavioral notes vs. the sync ``StreamingClient``:
8283
83-
- ``stream`` / ``set_params`` / ``force_endpoint`` raise ``RuntimeError``
84-
when called before ``connect()`` — silent drop would diverge from the
85-
sync client (which buffers pre-connect data) in a way that's easy to
86-
miss. After the connection has closed, the same calls are silent
84+
- ``stream`` / ``set_params`` / ``force_endpoint`` / ``keep_alive`` raise
85+
``RuntimeError`` when called before ``connect()`` — silent drop would
86+
diverge from the sync client (which buffers pre-connect data) in a way
87+
that's easy to miss. After the connection has closed, the same calls are silent
8788
no-ops so cleanup paths don't need defensive try/except.
8889
- ``disconnect(terminate=True)`` waits at most 2.0s for the write task to
8990
drain the ``TerminateSession`` frame before forcing teardown. The sync
@@ -288,6 +289,12 @@ async def force_endpoint(self) -> None:
288289
return
289290
await write_queue.put(ForceEndpoint())
290291

292+
async def keep_alive(self) -> None:
293+
write_queue, stop_event = self._ensure_connected("keep_alive")
294+
if stop_event.is_set():
295+
return
296+
await write_queue.put(KeepAlive())
297+
291298
def _ensure_connected(
292299
self, method: str
293300
) -> "tuple[asyncio.Queue[OperationMessage], asyncio.Event]":

assemblyai/streaming/v3/client.py

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,7 @@
2424
ErrorEvent,
2525
EventMessage,
2626
ForceEndpoint,
27+
KeepAlive,
2728
OperationMessage,
2829
StreamingClientOptions,
2930
StreamingError,
@@ -194,6 +195,10 @@ def force_endpoint(self):
194195
message = ForceEndpoint()
195196
self._write_queue.put(message)
196197

198+
def keep_alive(self):
199+
message = KeepAlive()
200+
self._write_queue.put(message)
201+
197202
def _write_message(self) -> None:
198203
while True:
199204
if not self._websocket:

0 commit comments

Comments
 (0)