swap elevenlabs scribe for funasr local backend#4
Open
adan3862936 wants to merge 1 commit intobrowser-use:mainfrom
Open
swap elevenlabs scribe for funasr local backend#4adan3862936 wants to merge 1 commit intobrowser-use:mainfrom
adan3862936 wants to merge 1 commit intobrowser-use:mainfrom
Conversation
runs asr fully on-device via modelscope funasr (paraformer-zh + fsmn-vad + ct-punc + cam++). no api key, no per-minute cost, strong chinese support, apple silicon mps acceleration. preserves the elevenlabs-shaped transcript json (words array with type/text/start/end/speaker_id) so pack_transcripts.py, render.py, and the rest of the pipeline are untouched. - helpers/transcribe.py: rewrite call_scribe → call_funasr with per-character timestamp alignment and sentence-level fallback - helpers/transcribe_batch.py: drop api-key loading, default workers to 1 (funasr is heavy)
There was a problem hiding this comment.
1 issue found across 2 files
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="helpers/transcribe.py">
<violation number="1" location="helpers/transcribe.py:238">
P2: Writing non-ASCII JSON without an explicit UTF-8 encoding can fail on non-UTF-8 locales (e.g., Windows cp1252).</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review, or fix all with cubic.
| payload = call_funasr(audio, language, num_speakers) | ||
|
|
||
| out_path.write_text(json.dumps(payload, indent=2)) | ||
| out_path.write_text(json.dumps(payload, indent=2, ensure_ascii=False)) |
There was a problem hiding this comment.
P2: Writing non-ASCII JSON without an explicit UTF-8 encoding can fail on non-UTF-8 locales (e.g., Windows cp1252).
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At helpers/transcribe.py, line 238:
<comment>Writing non-ASCII JSON without an explicit UTF-8 encoding can fail on non-UTF-8 locales (e.g., Windows cp1252).</comment>
<file context>
@@ -117,10 +232,10 @@ def transcribe_one(
+ payload = call_funasr(audio, language, num_speakers)
- out_path.write_text(json.dumps(payload, indent=2))
+ out_path.write_text(json.dumps(payload, indent=2, ensure_ascii=False))
dt = time.time() - t0
</file context>
Suggested change
| out_path.write_text(json.dumps(payload, indent=2, ensure_ascii=False)) | |
| out_path.write_text(json.dumps(payload, indent=2, ensure_ascii=False), encoding="utf-8") |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
wordsarray withtype/text/start/end/speaker_id), sopack_transcripts.py,render.py, and the rest of the pipeline are untouched.Motivation
For users working primarily with Chinese or mixed-language content, ElevenLabs Scribe is either unavailable or expensive at scale. FunASR's
paraformer-zhships per-character timestamps (aligned with the existing per-word schema), andcam++gives the same speaker diarization semantics Scribe provided. End result: the skill works exactly the same, but the transcription is free and runs on the user's machine.What changed
helpers/transcribe.pycall_scribe()→call_funasr(). Adds_get_model()(cached), per-char timestamp alignment, and a sentence-level fallback when punctuation breaks alignment.helpers/transcribe_batch.py--workerschanged from 4 → 1 (FunASR holds a large in-process model; high concurrency thrashes memory without GIL benefit).What did NOT change
{language_code, text, words: [{type, text, start, end, speaker_id}]}.pack_transcripts.py,render.py,SKILL.md,pyproject.toml.transcribe_one()(just drops the now-unusedapi_keyparameter).Runtime footprint
FunASR model cache lands in
~/.cache/modelscope/hub/on first run:paraformer-zh~1GBcam++~200MBfsmn-vad+ct-punc~tens of MBOn Apple Silicon (tested on M-series / macOS 25.4), torch auto-selects MPS and runs ~5–10x realtime.
Trade-off / opt-in suggestion
This PR is a breaking change for users who rely on the ElevenLabs backend (different languages, existing API key, specific audio-event tags like
(laughter)). If maintainers prefer to keep ElevenLabs as the default, the cleanest path forward is an env-var / CLI toggle:Happy to rework the PR to that opt-in shape if preferred — just say the word. I left the FunASR path fully self-contained (
_get_model,call_funasr, transform helpers) so it's easy to drop behind a flag.Test plan
python3 -c "import funasr, torch, torchaudio"— deps resolvepython3 helpers/transcribe.py <video.mp4>— producesedit/transcripts/<name>.jsonwith the expected schemapack_transcripts.pyconsumes the new output without modificationrender.pyself-eval pass (pending a sample project from maintainers if they want to see it end-to-end)--language en(routes toiic/Whisper-large-v3stack)Install deltas
Users adopting this branch add:
No other deps change. No
.envrequired.Summary by cubic
Replaces ElevenLabs Scribe with a local
funasrbackend for transcription. Runs on-device, preserves the transcript JSON, and removes the need for an API key.New Features
funasr(defaultparaformer-zh; English withiic/Whisper-large-v3via--language en).language_code,text,words[{type,text,start,end,speaker_id}]), so downstream tools continue to work.fsmn-vad, punctuationct-punc, and speaker diarizationcam++; aligns per-char timestamps intowordswith spacing for silence.torch; no per‑minute cost.Migration
pip install funasr modelscope torch torchaudio.ELEVENLABS_API_KEYneeded; batch--workersnow defaults to 1.Written for commit ca3a4af. Summary will update on new commits.