fix(#1049): flush stale embedding ring on speech onset after gated silence#1063
Conversation
There was a problem hiding this comment.
Pull request overview
Fixes a wake word detection regression introduced by PR #1055. After VAD gating was added, the 16-frame embedding ring retained stale silence embeddings from the gated period. When speech resumed, the classifier saw a mix of old silence and new speech embeddings, suppressing confidence below the detection threshold for short utterances like "Hey Jandal" (~0.8s, ~10 frames < 16-frame ring). This PR resets the embedding accumulation counter on speech onset after gating so the classifier window refills entirely from live speech audio.
Changes:
- Added a
wasGatedflag tracking when the detector enters silence-skip mode. - On the first voiced frame after gating, resets
embFramesAccumulated = 0so Stage 3 (classifier) waits for a fresh 1.28s embedding window before running. - Adds ~1.3s detection latency after speech onset in exchange for a clean classifier window.
lokhor
left a comment
There was a problem hiding this comment.
Oracle review complete. All edge cases traced and verified. Correctness confirmed — the ring buffer invariant holds after flush regardless of initial embRingHead position. No thread safety concerns. No memory impact. Approved.
There was a problem hiding this comment.
Oracle review — edge case analysis
1. Initial state (fresh start): wasGated = false. First speech → no flush. Ring fills naturally with 16 fresh embeddings. ✅
2. Speech within hangover (before gating reactivates): wasGated = false (cleared on previous speech onset). No flush → correct, ring has recent speech from hangover period. ✅
3. Speech after gated silence: wasGated = true → flush embFramesAccumulated = 0. Ring refills over 16 frames (1.28s). Stage 3 sees pure speech window. ✅
4. Speech onset on the periodic non-gated frame (chunkCount % 38 == 0): Gating check fails → Stage 2/3 runs. But wasGated still true from prior gated frame → flush triggers. One fresh silence embedding from the periodic frame is discarded — harmless, better than keeping 15 stale ones. ✅
5. Ring buffer invariant after flush: embRingHead is NOT reset — correct. After 16 writes, embRingHead returns to its starting position. The ring buffer design guarantees chronological order: oldest embedding at embRingHead, newest at (embRingHead - 1) % 16. Trace: flush at head=P → writes to P, P+1, …, P+15 → head returns to P. Window reads [P..P+15] = chronological. ✅
6. Thread safety: wasGated and embFramesAccumulated are local to the detection thread. No shared mutable state. running AtomicBoolean guards stop transitions as before. No new races. ✅
7. Memory: One Boolean — zero heap allocation in the hot path. ✅
Verdict: Correct, minimal (13 lines), no regressions. Ready to merge.
Debug APK readyCommit: Updated on each push. Removed when PR is merged or closed. |
b75ae15 to
c75d6a4
Compare
…lence The VAD gating from PR #1055 left stale silence embeddings in the 16-frame ring. 'Hey Jandal' (~0.8s = 10 frames) is shorter than the 1.28s needed to overwrite all 16 slots naturally, so the classifier always saw a speech/silence mix → confidence stuck at ~0.0008. Added a wasGated flag to track VAD-gating state. When speech resumes after gated silence, reset embFramesAccumulated to force a clean 16-frame refill from live audio (~1.3s after 240ms debounce ≈ 1.5s total latency). The mel ring slides naturally via Stage 1 (which runs every frame even during gating), so no separate mel flush is needed. Tested on S23 Ultra: confidence went from 0.0008 → 0.828.
c75d6a4 to
c6c4eaf
Compare
lokhor
left a comment
There was a problem hiding this comment.
Oracle review complete. The fix is correct — 14 insertions, 3 deletions, one Boolean, one counter reset. All edge cases traced (cold start, hangover, gated silence, periodic frame, ring invariant, thread safety).
The root cause is confirmed: PR #1055 removed the replay buffer without adding a mechanism to flush the 16-frame embedding ring, so short wake words like "Hey Jandal" (0.8s) couldn't overwrite 16 stale silence embeddings (1.28s needed). The wasGated flag + embFramesAccumulated reset forces a clean refill.
The remaining inconsistency is the fundamental pipeline latency mismatch (1.28s embedding window vs 0.8s wake word) — the classifier window always has ~6/16 silence embeddings from post-speech frames. This is a design limitation, not a bug. Follow-up options: lower threshold, replay debounce frames, or retrain for 8-frame windows.
Approved.
There was a problem hiding this comment.
Oracle review — final version (c6c4eaf)
14 insertions, 3 deletions. One variable (wasGated), two writes (set on gate, cleared on flush), one counter reset.
Edge case analysis
1. Fresh start (cold detector)
wasGated = false. Mel ring fills (16 frames), embedding ring fills (16 frames), Stage 3 runs on silence → confidence low. Hangover expires → gating engages → wasGated = true. First speech after startup → flush triggered. ✅
2. Speech within hangover (no gating active)
wasGated = false (cleared on last flush, or never set if startup). No flush → ring maintains recent speech data. ✅
3. Speech after gated silence — primary path
wasGated = true. Debounce clears at frame S+3 → voiced=true → wasGated=true → embFramesAccumulated = 0, wasGated = false, silenceFrames = 0. Ring refills from live audio over 16 frames → Stage 3 runs on fresh window. ✅
4. Speech onset on periodic non-gated frame (chunkCount % 38 == 0)
Gating check fails → Stage 2/3 runs on this frame. wasGated remains true (not cleared, gate body skipped). Next frame: gating re-engages → wasGated = true (no-op). Speech debounce clears → flush triggers. One silence embedding from the periodic frame is discarded by flush. Harmless. ✅
5. Ring buffer invariant after flush
embRingHead is NOT reset. After 16 writes, head wraps back to starting position. Ring invariant: head points to oldest entry. Window reads [head..head+15] = chronological. Trace: flush at head=P → writes to P, P+1, …, P+15 → head returns to P → window = E₁..E₁₆ in order. ✅
6. Thread safety
wasGated and embFramesAccumulated are local to the detection thread. running (AtomicBoolean) guards stop transitions as before. No shared mutable state with other threads. No new race conditions. ✅
7. Memory / GC
One Boolean — zero heap allocation in the hot loop. No new arrays, no boxing. ✅
8. Battery impact
Unchanged from PR #1055 baseline. Gate still skips Stage 2/3 on 37 of every 38 silence frames (~97%). The flush itself is two scalar assignments (ns). ✅
Remaining risk: mel ring dilution for short wake words
This is the root of the inconsistency. "Hey Jandal" (~0.8s = 10 frames) is shorter than the 16-frame embedding refill window (1.28s). The mel ring has 76 rows = ~15 frames of mel context. When the embedding ring finishes refilling 16 frames after flush:
- 3 debounce frames contributed speech mel (15 rows)
- 13 live speech frames contributed speech mel (65 rows)
- But speech ends at ~10 frames total → last 6 embedding frames come from mel ring filled with post-speech silence
The 16-embedding classifier window is ~10/16 speech + ~6/16 silence. Confidence must exceed 0.8 with this mix. Evidence from S23 Ultra logs: it does (0.828), but it's borderline — explains why user reports "sometimes works, sometimes doesn't."
Mitigation options for follow-up (NOT this PR):
- Lower
highThresholdfrom 0.80 to 0.65-0.70 for this model - Add a 3-frame PCM ring to replay debounce frames post-flush (recovers 240ms of speech, ~+12% window coverage)
- Retrain classifier on 8-frame windows (640ms) instead of 16-frame
None of these are required for this PR. The change is a strict improvement: before, confidence was 0.0008 (indistinguishable from silence). After, confidence reaches detection threshold more often than not.
Verdict
Correct. Minimal. No regressions. The remaining inconsistency is a fundamental pipeline latency issue (1.28s embedding window vs 0.8s wake word), not a bug in this fix. Approved.
Root cause
PR #1055 (VAD gating for battery savings) removed the
replayChunkRingwithout adding a mechanism to flush stale silence data from both ring buffers when speech resumes. Two rings were contaminated:Result: the classifier always saw a mixed speech/silence window → confidence stuck at ~0.0008.
Fix
Added a
wasGatedflag to track the VAD-gating state. When speech resumes after gated silence (3-frame debounce clears):embFramesAccumulated = 0— forces embedding ring refillmelRowsFilled = 0— forces mel ring refillwasGatedBoth rings refill from live audio: mel ring (~1.2s, 16 frames) then embedding ring (~1.3s, 16 frames). Total detection latency ~2.5s after speech onset, but the classifier window is 100% clean speech.
Verification
ongoingSkipscounting thousands/minChanges
core/voice/src/main/java/com/kernel/ai/core/voice/OnnxWakeWordDetector.kt— 18 insertions, 3 deletions