feat(#935 #939 #982 #1017): central STT transcript normaliser with Kiwi phonetic and unit alias tables#1070
Open
lokhor wants to merge 1 commit into
Open
feat(#935 #939 #982 #1017): central STT transcript normaliser with Kiwi phonetic and unit alias tables#1070lokhor wants to merge 1 commit into
lokhor wants to merge 1 commit into
Conversation
…wi phonetic and unit alias tables - New TranscriptNormaliser in core/voice/ runs on all STT output before downstream consumption - KIWI_PHONETIC_REPLACEMENTS: fattybaku/farah paco → wharepaku, tonifa/tanifa → taniwha, comrade → kumara, chaka → chocka - STT_UNIT_ALIAS_REPLACEMENTS: mls/Mills/mils/ml's → ml for numeric quantities - NativeAndroid, Vosk, Sherpa-ONNX backends wrap extractBestTranscript/resultTextOnline/resultTextOffline - LIST_NAME_TAIL_MISHEAR_RE in QIR pre-router catches lost/lust/last → list for list commands only - TranscriptNormaliserTest (18 tests), QuickIntentRouterListRoutingTest regression, E4B fallthrough test Closes #935, #939, #982, #1017
Debug APK readyCommit: Updated on each push. Removed when PR is merged or closed. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #935, #939, #982, #1017
What
A pure-Kotlin
TranscriptNormaliserincore/voice/that centralises STT post-processing normalisation, running on every transcript before it reaches any downstream consumer (intent router, LLM, RAG, chat).TranscriptNormaliser (new)
Two replacement tables:
KIWI_PHONETIC_REPLACEMENTS — word-level Kiwi/Māori mishears (#935, #939):
fattybaku/farah paco→wharepakutonifa/tanifa→taniwhacomrade→kumarachaka→chockaSTT_UNIT_ALIAS_REPLACEMENTS — unit/abbreviation normalisations (#1017):
300 mls/300 Mills/200 mils/100 ml's→300 ml(numeric prefix preserved)All patterns are word-boundary, case-insensitive. Idempotent. No whitespace collapsing.
Backend wiring
extractBestTranscriptparseTranscript/parsePartialTranscriptresultTextOnline/resultTextOffline(covers all 5 emission sites)QuickIntentRouter (#982)
LIST_NAME_TAIL_MISHEAR_RErewrites trailinglost/lust/last→listonly in the regex pre-pass. Chat voice ("I lost my keys") retains the original text via theFallThroughpath — verified by a regression test.Testing
add 111 over 70 to my blood pressure lost→add_to_list"I lost my keys"still falls through to E4BVerification
./gradlew :core:voice:testDebugUnitTest— 18/18 pass./gradlew :core:skills:testDebugUnitTest— all pass (ListRouting + full suite)./gradlew :app:assembleDebug— clean build./gradlew :core:voice:lintDebug :core:skills:lintDebug— cleanOut of scope
nz_truth_memories.json— no terms exist in the corpus todaymilalias intentionally omitted (ambiguous: mL vs thousandth of an inch)