Skip to content

Reconstruct fragmented mic clip: fragment-aware direct capture#306

Open
MaxHeimbrock wants to merge 4 commits into
max/mic-samplerate-device-initfrom
max/mic-pitch-servo
Open

Reconstruct fragmented mic clip: fragment-aware direct capture#306
MaxHeimbrock wants to merge 4 commits into
max/mic-samplerate-device-initfrom
max/mic-pitch-servo

Conversation

@MaxHeimbrock

@MaxHeimbrock MaxHeimbrock commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Stacked on #304. Fixes the choppy/garbled published audio with a Bluetooth HFP headset mic on macOS — by reading the audio Unity actually delivers, which turned out to be intact but scattered.

Root cause (proven by buffer inspection)

A raw WAV dump of the mic clip in the bad state showed the exact structure: FMOD writes each real 20 ms packet of clip.frequency audio, then advances Microphone.GetPosition as if it had written ~3.2× as much, zero-filling the skipped range. Concretely: valid fragments of exactly 320 samples at a stride of exactly 1024 (320/1024 = 1/k where k = counterRate/clip.frequency = 3.2), with exact-zero padding between them (true silence has a real noise floor, never exact zeros).

Junction analysis showed the fragments join continuously (boundary sample deltas within normal in-fragment variation) — the full stream is present, just zero-padded. Concatenating fragments reconstructed clean, correct-pitch voice (verified by ear).

This explains every prior symptom:

  • plain playback → 31% voice + 69% silence = chop
  • counter-paced reading → fragments + padding fast over a live buffer = noise with echo
  • pitch servo (either model) → cannot help; the gaps are in the data layout, not the timing

Change

MicrophoneSource now does fragment-aware direct capture:

  • Reads the clip ring buffer directly (no AudioSource, no OnAudioFilterRead) — also decouples capture from the output device's clock.
  • A ~0.3 s pre-roll measures k (counter rate ÷ clip.frequency) and the counter's smallest discrete jump (the stride J).
  • k ≈ 1 (healthy devices): plain contiguous read at the counter's pace.
  • k > 1.05 (this state): read only the first J/k samples of each stride — exactly the valid fragments — skipping the padding.
  • Downmix → mono, resample clip.frequency → fixed 48 kHz native source (streaming linear; resampler state carries across fragments since junctions are continuous).
  • Backlog beyond 200 ms after a stall is dropped (stride-aligned) so the native queue can't overrun.

Expected log in the bad state:

MicrophoneSource: fragmented clip detected (k=3.20); reading 320 of every 1024 samples at 16000Hz

Healthy devices log contiguous capture (k=1.00).

Verification

  • Runtime compiles clean.
  • Buffer-dump analysis (fragment sizes/strides, junction continuity, reconstruction) done offline on a captured WAV; reconstruction validated by ear.
  • To validate end-to-end: mac publisher with BT headset mic → Android receiver; expect clean, correct-pitch, non-choppy audio. Built-in mic should behave unchanged.

History

This branch went through two falsified designs first — a pitch servo at the counter ratio (garbled: the counter doesn't describe the data) and a k-rescaled lag servo at pitch 1 (perfect telemetry, still choppy: the gaps are in the buffer itself). The WAV dump diagnostic settled it. Commits preserved for the record.

🤖 Generated with Claude Code

MaxHeimbrock and others added 4 commits June 12, 2026 15:29
The mic clip is filled by the capture device's clock while the AudioSource
that plays it (feeding OnAudioFilterRead) runs on the output device's clock.
Some devices also misreport the clip rate entirely: a Bluetooth headset on
macOS labels its clip 16kHz while filling it at ~51kHz. Either way the read
head drifts against the write head and gets lapped, which sounds like
periodic chopping.

Add a pacing servo that measures how fast the write head actually advances
(GetPosition delta over wall clock - steady within ±0.1% even when the
instantaneous position is jumpy) and continuously adjusts AudioSource.pitch
so the read head consumes clip samples at the same rate, holding a fixed lag
behind the writer. A short pre-roll measures the rate before playback starts
so the initial pitch is already correct; the fill-rate estimate and the lag
target (sized to ~4x observed jitter, bounded by clip capacity) keep adapting
while capturing, and an out-of-bounds resync recovers from long hitches.

In the normal case the measured rate matches clip.frequency, pitch hovers at
~1.0, and the servo is effectively a no-op. In the misreporting case pitch
settles at the true ratio (~3.2), which plays the clip's real-time data at
correct speed and eliminates the chop. Pitch is rate control, not a delay:
the added latency is only the held lag (~80-150ms, adaptive).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…head

Field test falsified the previous model: with pitch set to the measured
counter ratio (3.2), the published audio became garbled repeats ("noise with
echo"), while the servo's own lag telemetry stayed perfectly stable — because
it was measuring against the same lying counter. Combined with earlier
results (1x playback yields correct-pitch voice; reading at the counter's
pace yields noise), the consistent model is:

- The clip DATA genuinely is at clip.frequency (16kHz here).
- Microphone.GetPosition's counter is inflated ~3.2x on macOS + BT-HFP; it
  does not describe the data. The choppiness on the plain path is the read
  head colliding with the bursty real write head due to a small, unmanaged
  startup lag — not a rate mismatch.

Rework the servo accordingly: pitch stays pinned near 1.0 (max ±3% trim).
The counter is used only after rescaling by its measured inflation factor
k = counterRate / clip.frequency (~1 on healthy devices) to estimate the
real write head, and the servo holds the read head a generous adaptive lag
(150ms default) behind that estimate. Clip buffer extended to 2s for more
collision headroom.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The reworked servo's telemetry is perfect in the bad state (k=3.20,
pitch~1.00, lag locked on target, jitter ~0, no resyncs) yet the published
audio still chops like the unpaced path. That falsifies the read/write
collision model: the reader is provably never near the writer.

Remaining hypothesis: the chop is baked into the clip data itself — FMOD
scatters the real 16kHz samples at the inflated counter's positions, leaving
stale regions between fragments (~31% fresh per cycle). That would also
explain why counter-paced reading sounds like noise with echo (fragments +
stale older audio, fast).

Snapshot the raw clip to a WAV 4s after capture starts (editor-only) so the
buffer contents can be inspected directly: contiguous voice means the chop
is downstream and still fixable; fragmented voice means capture data is
destroyed at write time and the Unity Microphone path cannot work for this
device.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
A raw dump of the mic clip in the macOS + Bluetooth HFP state revealed the
true buffer structure: FMOD writes each real 20ms packet of clip.frequency
audio, then advances the position counter as if it had written k (~3.2x) as
much and zero-fills the skipped range. The buffer holds valid fragments of
exactly 320 samples at a stride of exactly 1024 (320/1024 = 1/k), and the
fragments join continuously (junction sample deltas within normal in-fragment
variation) - i.e. the full audio stream is present, just zero-padded.
Concatenating the fragments reconstructed clean, correct-pitch voice
(verified by ear), which also explains every earlier symptom: plain playback
= 31% voice + 69% silence (chop); counter-paced reading = fragments and
padding played fast over a live buffer (noise with echo).

Replace the pitch-servo playback approach with fragment-aware direct capture:

- Read the clip ring buffer directly (no AudioSource, no OnAudioFilterRead),
  which also decouples capture from the output device's clock.
- Pre-roll measures the counter rate (k = counterRate / clip.frequency) and
  the counter's smallest discrete jump (the stride J).
- k ~ 1: plain contiguous read at the counter's pace (healthy devices).
- k > 1.05: read only the first J/k samples of each stride - exactly the
  valid fragments - skipping the zero padding.
- Downmix to mono and resample clip.frequency -> 48kHz (streaming linear;
  state carries across fragments since their junctions are continuous), into
  a native source fixed at 48kHz mono.
- Backlog beyond 200ms after a stall is dropped, stride-aligned, to avoid
  overrunning the native queue.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@MaxHeimbrock MaxHeimbrock changed the title Adaptive pitch servo: lock mic playback to the measured capture rate Reconstruct fragmented mic clip: fragment-aware direct capture Jun 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant