Fix microphone capture: device-true source format + fragment-aware clip reading#308
Open
MaxHeimbrock wants to merge 5 commits into
Open
Fix microphone capture: device-true source format + fragment-aware clip reading#308MaxHeimbrock wants to merge 5 commits into
MaxHeimbrock wants to merge 5 commits into
Conversation
…ip reading Publishing the microphone with a Bluetooth HFP headset on macOS produced "sample_rate and num_channels don't match" errors from the native source and, beyond that, persistently choppy or garbled audio on receivers. Two root causes, both fixed here: 1. The native (Rust) audio source was created with a hardcoded format (48000Hz/2ch) while captured frames arrive at whatever format the device actually delivers. The native source rejects mismatched frames (it does not resample). RtcAudioSource now has two constructors: a device-mode one that resolves the format from Unity's output configuration, and an explicit-format one for sources that know their exact rate/channels. Frames that still mismatch are dropped with a throttled warning instead of erroring natively. 2. On macOS with a Bluetooth HFP headset, Unity's Microphone clip buffer is fragmented: FMOD writes each real 20ms packet of clip.frequency audio, then advances Microphone.GetPosition as if it had written ~3.2x as much, zero-filling the skipped range. A raw buffer dump showed valid fragments of exactly 320 samples at a stride of exactly 1024 (= 1/k where k is the counter inflation), with the fragments joining continuously - the stream is intact, just scattered. Every playback-based capture strategy therefore chops (31% voice, 69% padding) and counter-paced reading garbles. MicrophoneSource now reads the clip ring buffer directly (no AudioSource, no OnAudioFilterRead - which also decouples capture from the output device's clock). A short pre-roll measures the counter rate (k = counterRate / clip.frequency) and the counter's smallest discrete jump (the stride). Healthy devices (k ~ 1) use a plain contiguous read; fragmented devices (k > 1.05) read only the first stride/k samples of each stride - exactly the valid fragments. Captured audio is downmixed to mono and resampled from clip.frequency to a fixed 48kHz native source, preserving the publish-before-start contract. Backlog beyond 200ms after a stall is dropped, stride-aligned, to avoid overrunning the native queue. Also removes the redundant Microphone.Start in the Meet sample and lets the test sine source declare its exact format explicitly. Verified end-to-end: macOS publisher with the Bluetooth headset microphone to an Android receiver now sounds clean and correct-pitch; healthy microphones take the contiguous path unchanged. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The fragment-aware capture logic is subtle and was painful to diagnose, but most of it is pure logic that doesn't need a microphone. Extract it from MicrophoneSource into two UnityEngine-free internal classes: - MicClipReader: pre-roll measurement (counter rate k, smallest jump = stride), contiguous vs fragmented mode selection, per-stride valid-range emission, ring-wrap splitting, and stride-aligned backlog dropping. - StreamingResampler: the streaming linear resampler (state carries across chunks so fragment junctions stay continuous). MicrophoneSource.CaptureLoop becomes a thin Unity shell: poll GetPosition, feed the reader, GetData the emitted ranges, downmix, resample, push. Behavior is unchanged. Add EditMode tests covering: healthy contiguous capture (k~1, every sample emitted), fragmented detection (k=3.2, stride 1024, valid 320 - the exact structure dumped from the Sony MDR-1000X on macOS), lossless reconstruction of a synthetic fragmented buffer across multiple ring laps (strictly sequential output, no gaps/repeats/padding), stride-aligned backlog drops bounded by the limit, pre-roll emitting nothing, resampler frequency/length preservation, and chunked-equals-whole resampling (1-sample tail tolerance for float boundary rounding). Logic verified by executing all test scenarios in a standalone harness (mono) in addition to compiling the Unity assemblies. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Contributor
Author
|
Added test coverage for the capture logic (35031cc): the fragment reconstruction and resampling logic is now extracted into UnityEngine-free internal classes ( |
MicrophoneSource no longer attaches an AudioSource to its GameObject (it reads the mic clip directly), but the Meet sample still called GetComponent<AudioSource>()?.Stop() on unpublish. The ?. operator bypasses Unity's overloaded null-check on the editor's missing-component stub, so Stop() ran on the stub and threw MissingComponentException. Remove the obsolete call. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…eanup Field testing device transitions surfaced a false positive: right after recovering onto the healthy MacBook microphone, the pre-roll measured k=1.07 (counter startup burst while driver buffers flush) which crossed the old 1.05 threshold and engaged fragmented mode - silently discarding ~6% of real audio (heard as choppiness) until the next re-measurement. Engaging fragmented mode discards (stride - valid) samples per stride, so a false positive guarantees audio loss while a false negative only risks mild artifacts. Fix both sides of the measurement: - Raise the fragmented threshold from 1.05 to 1.5: the observed pathological device measures k=3.2, healthy devices ~1.0 plus a few percent of noise - keep a wide margin between the two. - Add a 100ms settle window that discards the counter's startup burst before the rate measurement begins. Add a regression test for the borderline case (k=1.07 must stay contiguous). Also fix the second AudioSource null-propagation site (CleanUpAllTracks via OnDestroy) with TryGetComponent - same MissingComponentException class as the unpublish path, hit because the local mic object no longer carries an AudioSource. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Generated by the editor; required for stable GUIDs when the package is imported. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Consolidated, cleaned-up version of the microphone capture fix (supersedes #303, #304, #305, #306 — see History).
Problem
Publishing the microphone with a Bluetooth HFP headset on macOS produced
sample_rate and num_channels don't matcherrors from the native source, and — beyond the error — persistently choppy or garbled audio on receivers. The same headset works fine from Android.Root causes (both proven empirically)
1. Hardcoded native source format. The native (Rust) audio source was created at a fixed 48000 Hz / 2ch, while captured frames arrive at whatever Unity actually delivers. The Rust source rejects mismatched frames; it does not resample.
2. Fragmented mic clip buffer on macOS + BT-HFP. A raw WAV dump of the clip's ring buffer (analyzed offline) showed: FMOD writes each real 20 ms packet of
clip.frequencyaudio, then advancesMicrophone.GetPositionas if it had written k ≈ 3.2× as much, zero-filling the gap. The buffer holds valid fragments of exactly 320 samples at a stride of exactly 1024 (320/1024 = 1/k), and the fragments join continuously — the stream is intact, just scattered with zero padding. Hence:AudioSource+OnAudioFilterRead) plays 31% voice + 69% padding → chopGetPosition's pace replays fragments + padding too fast over a live buffer → garbled noiseChange
RtcAudioSource: two constructors — device-mode (format resolved from Unity's output configuration) and explicit-format (type, sampleRate, channels) for sources that know their exact rate. Frames that still mismatch the configured format are dropped with a throttled warning instead of erroring natively.MicrophoneSource: reads the clip ring buffer directly (noAudioSource, noOnAudioFilterRead— which also decouples capture from the output device's clock for all devices). A ~0.3 s pre-roll measuresk = counterRate / clip.frequencyand the counter's smallest discrete jump (the strideJ):k ≈ 1(healthy devices): plain contiguous read at the counter's pace.k > 1.05(fragmented state): read only the firstJ/ksamples of each stride — exactly the valid fragments.clip.frequency→ fixed 48 kHz native source (preserves the publish-before-start contract). Stall backlog beyond 200 ms is dropped stride-aligned so the native queue can't overrun.BasicAudioSourceuses device-mode (drops its unusedchannelsparameter — minor source-breaking change); the testSineWaveAudioSourcedeclares its exact format; the Meet sample drops a redundantMicrophone.Start.Log signature in the bad state:
Healthy devices log
contiguous capture (k=1.00).Verification
k≈1.0, contiguous path).Notes
Microphone.GetPosition's counter is packet-granular and can be rate-inflated on macOS; this PR never trusts it directly — only its measured average and jump size. The dump utility used for the diagnosis is PR Add AudioClipDump debugging utility (dump audio buffers to WAV) #307.kmeasurement doubles as a detector for surfacing degraded-mic states in the future.History
The investigation went through several falsified designs, preserved on their branches: #303 (recreate + republish on mismatch), #304 (device-config init + output-rate mic open), #305 (naive direct polling — three experiments), #306 (pitch servos, then the working fragment-aware capture that this PR consolidates). The buffer dump was the decisive step.
🤖 Generated with Claude Code