Reconstruct fragmented mic clip: fragment-aware direct capture by MaxHeimbrock · Pull Request #306 · livekit/client-sdk-unity

MaxHeimbrock · 2026-06-12T13:30:14Z

Stacked on #304. Fixes the choppy/garbled published audio with a Bluetooth HFP headset mic on macOS — by reading the audio Unity actually delivers, which turned out to be intact but scattered.

Root cause (proven by buffer inspection)

A raw WAV dump of the mic clip in the bad state showed the exact structure: FMOD writes each real 20 ms packet of clip.frequency audio, then advances Microphone.GetPosition as if it had written ~3.2× as much, zero-filling the skipped range. Concretely: valid fragments of exactly 320 samples at a stride of exactly 1024 (320/1024 = 1/k where k = counterRate/clip.frequency = 3.2), with exact-zero padding between them (true silence has a real noise floor, never exact zeros).

Junction analysis showed the fragments join continuously (boundary sample deltas within normal in-fragment variation) — the full stream is present, just zero-padded. Concatenating fragments reconstructed clean, correct-pitch voice (verified by ear).

This explains every prior symptom:

plain playback → 31% voice + 69% silence = chop
counter-paced reading → fragments + padding fast over a live buffer = noise with echo
pitch servo (either model) → cannot help; the gaps are in the data layout, not the timing

Change

MicrophoneSource now does fragment-aware direct capture:

Reads the clip ring buffer directly (no AudioSource, no OnAudioFilterRead) — also decouples capture from the output device's clock.
A ~0.3 s pre-roll measures k (counter rate ÷ clip.frequency) and the counter's smallest discrete jump (the stride J).
k ≈ 1 (healthy devices): plain contiguous read at the counter's pace.
k > 1.05 (this state): read only the first J/k samples of each stride — exactly the valid fragments — skipping the padding.
Downmix → mono, resample clip.frequency → fixed 48 kHz native source (streaming linear; resampler state carries across fragments since junctions are continuous).
Backlog beyond 200 ms after a stall is dropped (stride-aligned) so the native queue can't overrun.

Expected log in the bad state:

MicrophoneSource: fragmented clip detected (k=3.20); reading 320 of every 1024 samples at 16000Hz

Healthy devices log contiguous capture (k=1.00).

Verification

Runtime compiles clean.
Buffer-dump analysis (fragment sizes/strides, junction continuity, reconstruction) done offline on a captured WAV; reconstruction validated by ear.
To validate end-to-end: mac publisher with BT headset mic → Android receiver; expect clean, correct-pitch, non-choppy audio. Built-in mic should behave unchanged.

History

This branch went through two falsified designs first — a pitch servo at the counter ratio (garbled: the counter doesn't describe the data) and a k-rescaled lag servo at pitch 1 (perfect telemetry, still choppy: the gaps are in the buffer itself). The WAV dump diagnostic settled it. Commits preserved for the record.

🤖 Generated with Claude Code

The mic clip is filled by the capture device's clock while the AudioSource that plays it (feeding OnAudioFilterRead) runs on the output device's clock. Some devices also misreport the clip rate entirely: a Bluetooth headset on macOS labels its clip 16kHz while filling it at ~51kHz. Either way the read head drifts against the write head and gets lapped, which sounds like periodic chopping. Add a pacing servo that measures how fast the write head actually advances (GetPosition delta over wall clock - steady within ±0.1% even when the instantaneous position is jumpy) and continuously adjusts AudioSource.pitch so the read head consumes clip samples at the same rate, holding a fixed lag behind the writer. A short pre-roll measures the rate before playback starts so the initial pitch is already correct; the fill-rate estimate and the lag target (sized to ~4x observed jitter, bounded by clip capacity) keep adapting while capturing, and an out-of-bounds resync recovers from long hitches. In the normal case the measured rate matches clip.frequency, pitch hovers at ~1.0, and the servo is effectively a no-op. In the misreporting case pitch settles at the true ratio (~3.2), which plays the clip's real-time data at correct speed and eliminates the chop. Pitch is rate control, not a delay: the added latency is only the held lag (~80-150ms, adaptive). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…head Field test falsified the previous model: with pitch set to the measured counter ratio (3.2), the published audio became garbled repeats ("noise with echo"), while the servo's own lag telemetry stayed perfectly stable — because it was measuring against the same lying counter. Combined with earlier results (1x playback yields correct-pitch voice; reading at the counter's pace yields noise), the consistent model is: - The clip DATA genuinely is at clip.frequency (16kHz here). - Microphone.GetPosition's counter is inflated ~3.2x on macOS + BT-HFP; it does not describe the data. The choppiness on the plain path is the read head colliding with the bursty real write head due to a small, unmanaged startup lag — not a rate mismatch. Rework the servo accordingly: pitch stays pinned near 1.0 (max ±3% trim). The counter is used only after rescaling by its measured inflation factor k = counterRate / clip.frequency (~1 on healthy devices) to estimate the real write head, and the servo holds the read head a generous adaptive lag (150ms default) behind that estimate. Clip buffer extended to 2s for more collision headroom. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

The reworked servo's telemetry is perfect in the bad state (k=3.20, pitch~1.00, lag locked on target, jitter ~0, no resyncs) yet the published audio still chops like the unpaced path. That falsifies the read/write collision model: the reader is provably never near the writer. Remaining hypothesis: the chop is baked into the clip data itself — FMOD scatters the real 16kHz samples at the inflated counter's positions, leaving stale regions between fragments (~31% fresh per cycle). That would also explain why counter-paced reading sounds like noise with echo (fragments + stale older audio, fast). Snapshot the raw clip to a WAV 4s after capture starts (editor-only) so the buffer contents can be inspected directly: contiguous voice means the chop is downstream and still fixable; fragmented voice means capture data is destroyed at write time and the Unity Microphone path cannot work for this device. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

A raw dump of the mic clip in the macOS + Bluetooth HFP state revealed the true buffer structure: FMOD writes each real 20ms packet of clip.frequency audio, then advances the position counter as if it had written k (~3.2x) as much and zero-fills the skipped range. The buffer holds valid fragments of exactly 320 samples at a stride of exactly 1024 (320/1024 = 1/k), and the fragments join continuously (junction sample deltas within normal in-fragment variation) - i.e. the full audio stream is present, just zero-padded. Concatenating the fragments reconstructed clean, correct-pitch voice (verified by ear), which also explains every earlier symptom: plain playback = 31% voice + 69% silence (chop); counter-paced reading = fragments and padding played fast over a live buffer (noise with echo). Replace the pitch-servo playback approach with fragment-aware direct capture: - Read the clip ring buffer directly (no AudioSource, no OnAudioFilterRead), which also decouples capture from the output device's clock. - Pre-roll measures the counter rate (k = counterRate / clip.frequency) and the counter's smallest discrete jump (the stride J). - k ~ 1: plain contiguous read at the counter's pace (healthy devices). - k > 1.05: read only the first J/k samples of each stride - exactly the valid fragments - skipping the zero padding. - Downmix to mono and resample clip.frequency -> 48kHz (streaming linear; state carries across fragments since their junctions are continuous), into a native source fixed at 48kHz mono. - Backlog beyond 200ms after a stall is dropped, stride-aligned, to avoid overrunning the native queue. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

MaxHeimbrock and others added 4 commits June 12, 2026 15:29

MaxHeimbrock changed the title ~~Adaptive pitch servo: lock mic playback to the measured capture rate~~ Reconstruct fragmented mic clip: fragment-aware direct capture Jun 12, 2026

This was referenced Jun 12, 2026

Add AudioClipDump debugging utility (dump audio buffers to WAV) #307

Open

Fix microphone capture: device-true source format + fragment-aware clip reading #308

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reconstruct fragmented mic clip: fragment-aware direct capture#306

Reconstruct fragmented mic clip: fragment-aware direct capture#306
MaxHeimbrock wants to merge 4 commits into
max/mic-samplerate-device-initfrom
max/mic-pitch-servo

MaxHeimbrock commented Jun 12, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

MaxHeimbrock commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Root cause (proven by buffer inspection)

Change

Verification

History

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

MaxHeimbrock commented Jun 12, 2026 •

edited

Loading