Skip to content

Fix microphone capture: device-true source format + fragment-aware clip reading#308

Open
MaxHeimbrock wants to merge 5 commits into
mainfrom
max/mic-fragment-aware-capture
Open

Fix microphone capture: device-true source format + fragment-aware clip reading#308
MaxHeimbrock wants to merge 5 commits into
mainfrom
max/mic-fragment-aware-capture

Conversation

@MaxHeimbrock

Copy link
Copy Markdown
Contributor

Consolidated, cleaned-up version of the microphone capture fix (supersedes #303, #304, #305, #306 — see History).

Problem

Publishing the microphone with a Bluetooth HFP headset on macOS produced sample_rate and num_channels don't match errors from the native source, and — beyond the error — persistently choppy or garbled audio on receivers. The same headset works fine from Android.

Root causes (both proven empirically)

1. Hardcoded native source format. The native (Rust) audio source was created at a fixed 48000 Hz / 2ch, while captured frames arrive at whatever Unity actually delivers. The Rust source rejects mismatched frames; it does not resample.

2. Fragmented mic clip buffer on macOS + BT-HFP. A raw WAV dump of the clip's ring buffer (analyzed offline) showed: FMOD writes each real 20 ms packet of clip.frequency audio, then advances Microphone.GetPosition as if it had written k ≈ 3.2× as much, zero-filling the gap. The buffer holds valid fragments of exactly 320 samples at a stride of exactly 1024 (320/1024 = 1/k), and the fragments join continuously — the stream is intact, just scattered with zero padding. Hence:

  • any playback-based capture (AudioSource + OnAudioFilterRead) plays 31% voice + 69% padding → chop
  • reading at GetPosition's pace replays fragments + padding too fast over a live buffer → garbled noise

Change

  • RtcAudioSource: two constructors — device-mode (format resolved from Unity's output configuration) and explicit-format (type, sampleRate, channels) for sources that know their exact rate. Frames that still mismatch the configured format are dropped with a throttled warning instead of erroring natively.
  • MicrophoneSource: reads the clip ring buffer directly (no AudioSource, no OnAudioFilterRead — which also decouples capture from the output device's clock for all devices). A ~0.3 s pre-roll measures k = counterRate / clip.frequency and the counter's smallest discrete jump (the stride J):
    • k ≈ 1 (healthy devices): plain contiguous read at the counter's pace.
    • k > 1.05 (fragmented state): read only the first J/k samples of each stride — exactly the valid fragments.
    • Downmix → mono, streaming-resample clip.frequency → fixed 48 kHz native source (preserves the publish-before-start contract). Stall backlog beyond 200 ms is dropped stride-aligned so the native queue can't overrun.
  • BasicAudioSource uses device-mode (drops its unused channels parameter — minor source-breaking change); the test SineWaveAudioSource declares its exact format; the Meet sample drops a redundant Microphone.Start.

Log signature in the bad state:

MicrophoneSource: fragmented clip detected (k=3.20); reading 320 of every 1024 samples at 16000Hz

Healthy devices log contiguous capture (k=1.00).

Verification

  • End-to-end, on hardware: macOS publisher with the Bluetooth headset mic → Android receiver now sounds clean, correct-pitch, and continuous (previously chopped/garbled). Reconstruction was first validated offline by dumping the buffer, concatenating fragments, and listening.
  • Runtime, PlayModeTests, and Meet Assembly-CSharp compile clean.
  • Recommended before merge: a PlayMode E2E run against a dev server and a quick healthy-mic check (expect k≈1.0, contiguous path).

Notes

  • Microphone.GetPosition's counter is packet-granular and can be rate-inflated on macOS; this PR never trusts it directly — only its measured average and jump size. The dump utility used for the diagnosis is PR Add AudioClipDump debugging utility (dump audio buffers to WAV) #307.
  • This is arguably a Unity bug worth reporting upstream (clip labeled 16 kHz, position counter at ~51 k/s, zero-padded fragment writes).
  • Platform Audio (native ADM capture) remains the preferred path where applications can use it; the k measurement doubles as a detector for surfacing degraded-mic states in the future.

History

The investigation went through several falsified designs, preserved on their branches: #303 (recreate + republish on mismatch), #304 (device-config init + output-rate mic open), #305 (naive direct polling — three experiments), #306 (pitch servos, then the working fragment-aware capture that this PR consolidates). The buffer dump was the decisive step.

🤖 Generated with Claude Code

MaxHeimbrock and others added 2 commits June 12, 2026 16:16
…ip reading

Publishing the microphone with a Bluetooth HFP headset on macOS produced
"sample_rate and num_channels don't match" errors from the native source and,
beyond that, persistently choppy or garbled audio on receivers.

Two root causes, both fixed here:

1. The native (Rust) audio source was created with a hardcoded format
   (48000Hz/2ch) while captured frames arrive at whatever format the device
   actually delivers. The native source rejects mismatched frames (it does
   not resample). RtcAudioSource now has two constructors: a device-mode one
   that resolves the format from Unity's output configuration, and an
   explicit-format one for sources that know their exact rate/channels.
   Frames that still mismatch are dropped with a throttled warning instead of
   erroring natively.

2. On macOS with a Bluetooth HFP headset, Unity's Microphone clip buffer is
   fragmented: FMOD writes each real 20ms packet of clip.frequency audio,
   then advances Microphone.GetPosition as if it had written ~3.2x as much,
   zero-filling the skipped range. A raw buffer dump showed valid fragments
   of exactly 320 samples at a stride of exactly 1024 (= 1/k where k is the
   counter inflation), with the fragments joining continuously - the stream
   is intact, just scattered. Every playback-based capture strategy therefore
   chops (31% voice, 69% padding) and counter-paced reading garbles.

   MicrophoneSource now reads the clip ring buffer directly (no AudioSource,
   no OnAudioFilterRead - which also decouples capture from the output
   device's clock). A short pre-roll measures the counter rate
   (k = counterRate / clip.frequency) and the counter's smallest discrete
   jump (the stride). Healthy devices (k ~ 1) use a plain contiguous read;
   fragmented devices (k > 1.05) read only the first stride/k samples of
   each stride - exactly the valid fragments. Captured audio is downmixed to
   mono and resampled from clip.frequency to a fixed 48kHz native source,
   preserving the publish-before-start contract. Backlog beyond 200ms after
   a stall is dropped, stride-aligned, to avoid overrunning the native queue.

Also removes the redundant Microphone.Start in the Meet sample and lets the
test sine source declare its exact format explicitly.

Verified end-to-end: macOS publisher with the Bluetooth headset microphone to
an Android receiver now sounds clean and correct-pitch; healthy microphones
take the contiguous path unchanged.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The fragment-aware capture logic is subtle and was painful to diagnose, but
most of it is pure logic that doesn't need a microphone. Extract it from
MicrophoneSource into two UnityEngine-free internal classes:

- MicClipReader: pre-roll measurement (counter rate k, smallest jump =
  stride), contiguous vs fragmented mode selection, per-stride valid-range
  emission, ring-wrap splitting, and stride-aligned backlog dropping.
- StreamingResampler: the streaming linear resampler (state carries across
  chunks so fragment junctions stay continuous).

MicrophoneSource.CaptureLoop becomes a thin Unity shell: poll GetPosition,
feed the reader, GetData the emitted ranges, downmix, resample, push.
Behavior is unchanged.

Add EditMode tests covering: healthy contiguous capture (k~1, every sample
emitted), fragmented detection (k=3.2, stride 1024, valid 320 - the exact
structure dumped from the Sony MDR-1000X on macOS), lossless reconstruction
of a synthetic fragmented buffer across multiple ring laps (strictly
sequential output, no gaps/repeats/padding), stride-aligned backlog drops
bounded by the limit, pre-roll emitting nothing, resampler frequency/length
preservation, and chunked-equals-whole resampling (1-sample tail tolerance
for float boundary rounding).

Logic verified by executing all test scenarios in a standalone harness
(mono) in addition to compiling the Unity assemblies.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@MaxHeimbrock

Copy link
Copy Markdown
Contributor Author

Added test coverage for the capture logic (35031cc): the fragment reconstruction and resampling logic is now extracted into UnityEngine-free internal classes (MicClipReader, StreamingResampler) with EditMode tests — including a replay of the exact fragmented structure dumped from the Sony MDR-1000X (320 valid / 1024 stride, k=3.2) asserting lossless, strictly-sequential reconstruction across multiple ring laps, plus contiguous-mode, backlog-drop, and resampler continuity tests. All scenarios were additionally executed in a standalone harness to verify behavior, not just compilation. CaptureLoop is now a thin Unity shell; behavior unchanged.

MicrophoneSource no longer attaches an AudioSource to its GameObject (it
reads the mic clip directly), but the Meet sample still called
GetComponent<AudioSource>()?.Stop() on unpublish. The ?. operator bypasses
Unity's overloaded null-check on the editor's missing-component stub, so
Stop() ran on the stub and threw MissingComponentException. Remove the
obsolete call.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
MaxHeimbrock and others added 2 commits June 12, 2026 17:19
…eanup

Field testing device transitions surfaced a false positive: right after
recovering onto the healthy MacBook microphone, the pre-roll measured k=1.07
(counter startup burst while driver buffers flush) which crossed the old 1.05
threshold and engaged fragmented mode - silently discarding ~6% of real audio
(heard as choppiness) until the next re-measurement.

Engaging fragmented mode discards (stride - valid) samples per stride, so a
false positive guarantees audio loss while a false negative only risks mild
artifacts. Fix both sides of the measurement:

- Raise the fragmented threshold from 1.05 to 1.5: the observed pathological
  device measures k=3.2, healthy devices ~1.0 plus a few percent of noise -
  keep a wide margin between the two.
- Add a 100ms settle window that discards the counter's startup burst before
  the rate measurement begins.

Add a regression test for the borderline case (k=1.07 must stay contiguous).

Also fix the second AudioSource null-propagation site (CleanUpAllTracks via
OnDestroy) with TryGetComponent - same MissingComponentException class as the
unpublish path, hit because the local mic object no longer carries an
AudioSource.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Generated by the editor; required for stable GUIDs when the package is
imported.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant