Skip to content

swap elevenlabs scribe for funasr local backend#4

Open
adan3862936 wants to merge 1 commit intobrowser-use:mainfrom
adan3862936:feat/funasr-local-backend
Open

swap elevenlabs scribe for funasr local backend#4
adan3862936 wants to merge 1 commit intobrowser-use:mainfrom
adan3862936:feat/funasr-local-backend

Conversation

@adan3862936
Copy link
Copy Markdown

@adan3862936 adan3862936 commented Apr 22, 2026

Summary

  • Replace ElevenLabs Scribe API with FunASR (ModelScope's open-source ASR) as the transcription backend.
  • Runs fully on-device — no API key, no per-minute cost, strong Chinese + mixed-language support, Apple Silicon MPS acceleration out of the box.
  • The transcript JSON contract is preserved (words array with type/text/start/end/speaker_id), so pack_transcripts.py, render.py, and the rest of the pipeline are untouched.

Motivation

For users working primarily with Chinese or mixed-language content, ElevenLabs Scribe is either unavailable or expensive at scale. FunASR's paraformer-zh ships per-character timestamps (aligned with the existing per-word schema), and cam++ gives the same speaker diarization semantics Scribe provided. End result: the skill works exactly the same, but the transcription is free and runs on the user's machine.

What changed

File Change
helpers/transcribe.py Full rewrite of the transcription layer. call_scribe()call_funasr(). Adds _get_model() (cached), per-char timestamp alignment, and a sentence-level fallback when punctuation breaks alignment.
helpers/transcribe_batch.py Drops API-key loading. Default --workers changed from 4 → 1 (FunASR holds a large in-process model; high concurrency thrashes memory without GIL benefit).

What did NOT change

  • Output JSON schema — still ElevenLabs-shaped {language_code, text, words: [{type, text, start, end, speaker_id}]}.
  • pack_transcripts.py, render.py, SKILL.md, pyproject.toml.
  • Public signature of transcribe_one() (just drops the now-unused api_key parameter).

Runtime footprint

FunASR model cache lands in ~/.cache/modelscope/hub/ on first run:

  • paraformer-zh ~1GB
  • cam++ ~200MB
  • fsmn-vad + ct-punc ~tens of MB

On Apple Silicon (tested on M-series / macOS 25.4), torch auto-selects MPS and runs ~5–10x realtime.

Trade-off / opt-in suggestion

This PR is a breaking change for users who rely on the ElevenLabs backend (different languages, existing API key, specific audio-event tags like (laughter)). If maintainers prefer to keep ElevenLabs as the default, the cleanest path forward is an env-var / CLI toggle:

backend = os.getenv("VIDEO_USE_ASR", "elevenlabs")  # or "funasr"

Happy to rework the PR to that opt-in shape if preferred — just say the word. I left the FunASR path fully self-contained (_get_model, call_funasr, transform helpers) so it's easy to drop behind a flag.

Test plan

  • python3 -c "import funasr, torch, torchaudio" — deps resolve
  • python3 helpers/transcribe.py <video.mp4> — produces edit/transcripts/<name>.json with the expected schema
  • pack_transcripts.py consumes the new output without modification
  • Full render.py self-eval pass (pending a sample project from maintainers if they want to see it end-to-end)
  • English-only smoke test via --language en (routes to iic/Whisper-large-v3 stack)

Install deltas

Users adopting this branch add:

pip install funasr modelscope torch torchaudio

No other deps change. No .env required.


Summary by cubic

Replaces ElevenLabs Scribe with a local funasr backend for transcription. Runs on-device, preserves the transcript JSON, and removes the need for an API key.

  • New Features

    • Local ASR via funasr (default paraformer-zh; English with iic/Whisper-large-v3 via --language en).
    • Keeps the existing JSON contract (language_code, text, words[{type,text,start,end,speaker_id}]), so downstream tools continue to work.
    • Uses VAD fsmn-vad, punctuation ct-punc, and speaker diarization cam++; aligns per-char timestamps into words with spacing for silence.
    • Batch defaults to 1 worker to reduce memory thrash; MPS acceleration via torch; no per‑minute cost.
  • Migration

    • Install: pip install funasr modelscope torch torchaudio.
    • No ELEVENLABS_API_KEY needed; batch --workers now defaults to 1.
    • Breaking for ElevenLabs-specific workflows (e.g., Scribe-only audio event tags). Validate outputs if you depend on them.

Written for commit ca3a4af. Summary will update on new commits.

runs asr fully on-device via modelscope funasr (paraformer-zh +
fsmn-vad + ct-punc + cam++). no api key, no per-minute cost, strong
chinese support, apple silicon mps acceleration.

preserves the elevenlabs-shaped transcript json (words array with
type/text/start/end/speaker_id) so pack_transcripts.py, render.py,
and the rest of the pipeline are untouched.

- helpers/transcribe.py: rewrite call_scribe → call_funasr with
  per-character timestamp alignment and sentence-level fallback
- helpers/transcribe_batch.py: drop api-key loading, default workers
  to 1 (funasr is heavy)
Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 2 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="helpers/transcribe.py">

<violation number="1" location="helpers/transcribe.py:238">
P2: Writing non-ASCII JSON without an explicit UTF-8 encoding can fail on non-UTF-8 locales (e.g., Windows cp1252).</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review, or fix all with cubic.

Comment thread helpers/transcribe.py
payload = call_funasr(audio, language, num_speakers)

out_path.write_text(json.dumps(payload, indent=2))
out_path.write_text(json.dumps(payload, indent=2, ensure_ascii=False))
Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Writing non-ASCII JSON without an explicit UTF-8 encoding can fail on non-UTF-8 locales (e.g., Windows cp1252).

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At helpers/transcribe.py, line 238:

<comment>Writing non-ASCII JSON without an explicit UTF-8 encoding can fail on non-UTF-8 locales (e.g., Windows cp1252).</comment>

<file context>
@@ -117,10 +232,10 @@ def transcribe_one(
+        payload = call_funasr(audio, language, num_speakers)
 
-    out_path.write_text(json.dumps(payload, indent=2))
+    out_path.write_text(json.dumps(payload, indent=2, ensure_ascii=False))
     dt = time.time() - t0
 
</file context>
Suggested change
out_path.write_text(json.dumps(payload, indent=2, ensure_ascii=False))
out_path.write_text(json.dumps(payload, indent=2, ensure_ascii=False), encoding="utf-8")
Fix with Cubic

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant