swap elevenlabs scribe for funasr local backend by adan3862936 · Pull Request #4 · browser-use/video-use

adan3862936 · 2026-04-22T16:09:21Z

Summary

Replace ElevenLabs Scribe API with FunASR (ModelScope's open-source ASR) as the transcription backend.
Runs fully on-device — no API key, no per-minute cost, strong Chinese + mixed-language support, Apple Silicon MPS acceleration out of the box.
The transcript JSON contract is preserved (words array with type/text/start/end/speaker_id), so pack_transcripts.py, render.py, and the rest of the pipeline are untouched.

Motivation

For users working primarily with Chinese or mixed-language content, ElevenLabs Scribe is either unavailable or expensive at scale. FunASR's paraformer-zh ships per-character timestamps (aligned with the existing per-word schema), and cam++ gives the same speaker diarization semantics Scribe provided. End result: the skill works exactly the same, but the transcription is free and runs on the user's machine.

What changed

File	Change
`helpers/transcribe.py`	Full rewrite of the transcription layer. `call_scribe()` → `call_funasr()`. Adds `_get_model()` (cached), per-char timestamp alignment, and a sentence-level fallback when punctuation breaks alignment.
`helpers/transcribe_batch.py`	Drops API-key loading. Default `--workers` changed from 4 → 1 (FunASR holds a large in-process model; high concurrency thrashes memory without GIL benefit).

What did NOT change

Output JSON schema — still ElevenLabs-shaped {language_code, text, words: [{type, text, start, end, speaker_id}]}.
pack_transcripts.py, render.py, SKILL.md, pyproject.toml.
Public signature of transcribe_one() (just drops the now-unused api_key parameter).

Runtime footprint

FunASR model cache lands in ~/.cache/modelscope/hub/ on first run:

paraformer-zh ~1GB
cam++ ~200MB
fsmn-vad + ct-punc ~tens of MB

On Apple Silicon (tested on M-series / macOS 25.4), torch auto-selects MPS and runs ~5–10x realtime.

Trade-off / opt-in suggestion

This PR is a breaking change for users who rely on the ElevenLabs backend (different languages, existing API key, specific audio-event tags like (laughter)). If maintainers prefer to keep ElevenLabs as the default, the cleanest path forward is an env-var / CLI toggle:

backend = os.getenv("VIDEO_USE_ASR", "elevenlabs")  # or "funasr"

Happy to rework the PR to that opt-in shape if preferred — just say the word. I left the FunASR path fully self-contained (_get_model, call_funasr, transform helpers) so it's easy to drop behind a flag.

Test plan

python3 -c "import funasr, torch, torchaudio" — deps resolve
python3 helpers/transcribe.py <video.mp4> — produces edit/transcripts/<name>.json with the expected schema
pack_transcripts.py consumes the new output without modification
Full render.py self-eval pass (pending a sample project from maintainers if they want to see it end-to-end)
English-only smoke test via --language en (routes to iic/Whisper-large-v3 stack)

Install deltas

Users adopting this branch add:

pip install funasr modelscope torch torchaudio

No other deps change. No .env required.

Summary by cubic

Replaces ElevenLabs Scribe with a local funasr backend for transcription. Runs on-device, preserves the transcript JSON, and removes the need for an API key.

New Features
- Local ASR via funasr (default paraformer-zh; English with iic/Whisper-large-v3 via --language en).
- Keeps the existing JSON contract (language_code, text, words[{type,text,start,end,speaker_id}]), so downstream tools continue to work.
- Uses VAD fsmn-vad, punctuation ct-punc, and speaker diarization cam++; aligns per-char timestamps into words with spacing for silence.
- Batch defaults to 1 worker to reduce memory thrash; MPS acceleration via torch; no per‑minute cost.
Migration
- Install: pip install funasr modelscope torch torchaudio.
- No ELEVENLABS_API_KEY needed; batch --workers now defaults to 1.
- Breaking for ElevenLabs-specific workflows (e.g., Scribe-only audio event tags). Validate outputs if you depend on them.

^{Written for commit ca3a4af. Summary will update on new commits.}

runs asr fully on-device via modelscope funasr (paraformer-zh + fsmn-vad + ct-punc + cam++). no api key, no per-minute cost, strong chinese support, apple silicon mps acceleration. preserves the elevenlabs-shaped transcript json (words array with type/text/start/end/speaker_id) so pack_transcripts.py, render.py, and the rest of the pipeline are untouched. - helpers/transcribe.py: rewrite call_scribe → call_funasr with per-character timestamp alignment and sentence-level fallback - helpers/transcribe_batch.py: drop api-key loading, default workers to 1 (funasr is heavy)

cubic-dev-ai

1 issue found across 2 files

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="helpers/transcribe.py">

<violation number="1" location="helpers/transcribe.py:238">
P2: Writing non-ASCII JSON without an explicit UTF-8 encoding can fail on non-UTF-8 locales (e.g., Windows cp1252).</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review, or fix all with cubic.}

cubic-dev-ai · 2026-04-22T16:16:08Z

+        payload = call_funasr(audio, language, num_speakers)

-    out_path.write_text(json.dumps(payload, indent=2))
+    out_path.write_text(json.dumps(payload, indent=2, ensure_ascii=False))


P2: Writing non-ASCII JSON without an explicit UTF-8 encoding can fail on non-UTF-8 locales (e.g., Windows cp1252).

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At helpers/transcribe.py, line 238: <comment>Writing non-ASCII JSON without an explicit UTF-8 encoding can fail on non-UTF-8 locales (e.g., Windows cp1252).</comment> <file context> @@ -117,10 +232,10 @@ def transcribe_one( + payload = call_funasr(audio, language, num_speakers) - out_path.write_text(json.dumps(payload, indent=2)) + out_path.write_text(json.dumps(payload, indent=2, ensure_ascii=False)) dt = time.time() - t0 </file context>

Suggested change

out_path.write_text(json.dumps(payload, indent=2, ensure_ascii=False))

out_path.write_text(json.dumps(payload, indent=2, ensure_ascii=False), encoding="utf-8")

cubic-dev-ai Bot reviewed Apr 22, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

swap elevenlabs scribe for funasr local backend#4

swap elevenlabs scribe for funasr local backend#4
adan3862936 wants to merge 1 commit intobrowser-use:mainfrom
adan3862936:feat/funasr-local-backend

adan3862936 commented Apr 22, 2026 •

edited by cubic-dev-ai Bot

Loading

Uh oh!

cubic-dev-ai Bot left a comment

Uh oh!

cubic-dev-ai Bot Apr 22, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	out_path.write_text(json.dumps(payload, indent=2, ensure_ascii=False))
	out_path.write_text(json.dumps(payload, indent=2, ensure_ascii=False), encoding="utf-8")

Conversation

adan3862936 commented Apr 22, 2026 • edited by cubic-dev-ai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

What changed

What did NOT change

Runtime footprint

Trade-off / opt-in suggestion

Test plan

Install deltas

Summary by cubic

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

adan3862936 commented Apr 22, 2026 •

edited by cubic-dev-ai Bot

Loading

cubic-dev-ai Bot Apr 22, 2026 •

edited

Loading