Reconstruct fragmented mic clip: fragment-aware direct capture#306
Open
MaxHeimbrock wants to merge 4 commits into
Open
Reconstruct fragmented mic clip: fragment-aware direct capture#306MaxHeimbrock wants to merge 4 commits into
MaxHeimbrock wants to merge 4 commits into
Conversation
The mic clip is filled by the capture device's clock while the AudioSource that plays it (feeding OnAudioFilterRead) runs on the output device's clock. Some devices also misreport the clip rate entirely: a Bluetooth headset on macOS labels its clip 16kHz while filling it at ~51kHz. Either way the read head drifts against the write head and gets lapped, which sounds like periodic chopping. Add a pacing servo that measures how fast the write head actually advances (GetPosition delta over wall clock - steady within ±0.1% even when the instantaneous position is jumpy) and continuously adjusts AudioSource.pitch so the read head consumes clip samples at the same rate, holding a fixed lag behind the writer. A short pre-roll measures the rate before playback starts so the initial pitch is already correct; the fill-rate estimate and the lag target (sized to ~4x observed jitter, bounded by clip capacity) keep adapting while capturing, and an out-of-bounds resync recovers from long hitches. In the normal case the measured rate matches clip.frequency, pitch hovers at ~1.0, and the servo is effectively a no-op. In the misreporting case pitch settles at the true ratio (~3.2), which plays the clip's real-time data at correct speed and eliminates the chop. Pitch is rate control, not a delay: the added latency is only the held lag (~80-150ms, adaptive). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…head
Field test falsified the previous model: with pitch set to the measured
counter ratio (3.2), the published audio became garbled repeats ("noise with
echo"), while the servo's own lag telemetry stayed perfectly stable — because
it was measuring against the same lying counter. Combined with earlier
results (1x playback yields correct-pitch voice; reading at the counter's
pace yields noise), the consistent model is:
- The clip DATA genuinely is at clip.frequency (16kHz here).
- Microphone.GetPosition's counter is inflated ~3.2x on macOS + BT-HFP; it
does not describe the data. The choppiness on the plain path is the read
head colliding with the bursty real write head due to a small, unmanaged
startup lag — not a rate mismatch.
Rework the servo accordingly: pitch stays pinned near 1.0 (max ±3% trim).
The counter is used only after rescaling by its measured inflation factor
k = counterRate / clip.frequency (~1 on healthy devices) to estimate the
real write head, and the servo holds the read head a generous adaptive lag
(150ms default) behind that estimate. Clip buffer extended to 2s for more
collision headroom.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The reworked servo's telemetry is perfect in the bad state (k=3.20, pitch~1.00, lag locked on target, jitter ~0, no resyncs) yet the published audio still chops like the unpaced path. That falsifies the read/write collision model: the reader is provably never near the writer. Remaining hypothesis: the chop is baked into the clip data itself — FMOD scatters the real 16kHz samples at the inflated counter's positions, leaving stale regions between fragments (~31% fresh per cycle). That would also explain why counter-paced reading sounds like noise with echo (fragments + stale older audio, fast). Snapshot the raw clip to a WAV 4s after capture starts (editor-only) so the buffer contents can be inspected directly: contiguous voice means the chop is downstream and still fixable; fragmented voice means capture data is destroyed at write time and the Unity Microphone path cannot work for this device. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
A raw dump of the mic clip in the macOS + Bluetooth HFP state revealed the true buffer structure: FMOD writes each real 20ms packet of clip.frequency audio, then advances the position counter as if it had written k (~3.2x) as much and zero-fills the skipped range. The buffer holds valid fragments of exactly 320 samples at a stride of exactly 1024 (320/1024 = 1/k), and the fragments join continuously (junction sample deltas within normal in-fragment variation) - i.e. the full audio stream is present, just zero-padded. Concatenating the fragments reconstructed clean, correct-pitch voice (verified by ear), which also explains every earlier symptom: plain playback = 31% voice + 69% silence (chop); counter-paced reading = fragments and padding played fast over a live buffer (noise with echo). Replace the pitch-servo playback approach with fragment-aware direct capture: - Read the clip ring buffer directly (no AudioSource, no OnAudioFilterRead), which also decouples capture from the output device's clock. - Pre-roll measures the counter rate (k = counterRate / clip.frequency) and the counter's smallest discrete jump (the stride J). - k ~ 1: plain contiguous read at the counter's pace (healthy devices). - k > 1.05: read only the first J/k samples of each stride - exactly the valid fragments - skipping the zero padding. - Downmix to mono and resample clip.frequency -> 48kHz (streaming linear; state carries across fragments since their junctions are continuous), into a native source fixed at 48kHz mono. - Backlog beyond 200ms after a stall is dropped, stride-aligned, to avoid overrunning the native queue. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This was referenced Jun 12, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Stacked on #304. Fixes the choppy/garbled published audio with a Bluetooth HFP headset mic on macOS — by reading the audio Unity actually delivers, which turned out to be intact but scattered.
Root cause (proven by buffer inspection)
A raw WAV dump of the mic clip in the bad state showed the exact structure: FMOD writes each real 20 ms packet of
clip.frequencyaudio, then advancesMicrophone.GetPositionas if it had written ~3.2× as much, zero-filling the skipped range. Concretely: valid fragments of exactly 320 samples at a stride of exactly 1024 (320/1024 = 1/k where k = counterRate/clip.frequency = 3.2), with exact-zero padding between them (true silence has a real noise floor, never exact zeros).Junction analysis showed the fragments join continuously (boundary sample deltas within normal in-fragment variation) — the full stream is present, just zero-padded. Concatenating fragments reconstructed clean, correct-pitch voice (verified by ear).
This explains every prior symptom:
Change
MicrophoneSourcenow does fragment-aware direct capture:AudioSource, noOnAudioFilterRead) — also decouples capture from the output device's clock.k(counter rate ÷clip.frequency) and the counter's smallest discrete jump (the strideJ).J/ksamples of each stride — exactly the valid fragments — skipping the padding.clip.frequency→ fixed 48 kHz native source (streaming linear; resampler state carries across fragments since junctions are continuous).Expected log in the bad state:
Healthy devices log
contiguous capture (k=1.00).Verification
History
This branch went through two falsified designs first — a pitch servo at the counter ratio (garbled: the counter doesn't describe the data) and a k-rescaled lag servo at pitch 1 (perfect telemetry, still choppy: the gaps are in the buffer itself). The WAV dump diagnostic settled it. Commits preserved for the record.
🤖 Generated with Claude Code