Browser-based live video production switcher with GPU acceleration and AI-powered captions
SwitchFrame is a live video switcher where the server handles all switching, mixing, encoding, and compositing. Browsers connect over WebTransport as control surfaces — they view sources and send commands, but don't produce the output. Sources arrive via Prism MoQ ingest, SRT input (listener or caller mode), or MXL shared-memory transport.
Every source is continuously decoded to raw YUV420. Cuts are instant. All video processing — transitions, keying, compositing, scaling, color grading, AI segmentation — happens server-side in BT.709 YUV420, with optional GPU acceleration on Metal (macOS) and CUDA (Linux).
Switching — Cut, mix, dip-to-black, wipe (6 directions with soft-edge blending), stinger (PNG sequence + audio), fade to black with reverse. Manual T-bar. Frame synchronization with motion-compensated interpolation for mixed-rate sources.
Audio — Per-channel faders, 3-band parametric EQ, single-band compressor, input trim, mute, AFV. Master bus with brickwall limiter. BS.1770-4 loudness metering (momentary, short-term, integrated). Per-source delay for lip-sync correction. Clock-driven output ticker decouples mix timing from source arrival. Signal chain bypasses decode/encode when a single source is at unity with processing bypassed.
AI Captions — Live speech-to-text via whisper.cpp with Silero VAD for neural speech detection. Per-word confidence scoring. Transcripts feed into the CEA-608 caption encoder for real-time closed captioning on program output. CUDA GPU acceleration supported.
AI Background Removal — TensorRT-accelerated person segmentation (RobustVideoMatting / RVM ONNX at runtime — see server/models/README.md). Removes backgrounds without a green screen. Temporal smoothing (EMA), edge refinement (3x3 erosion), and GPU-resident mask pipeline. Runs as a pipeline node alongside keying and compositing.
Color Grading — GPU-accelerated 3D LUT color grading with 6 built-in broadcast looks (cinematic, warm natural, cool drama, broadcast news, vivid, log-to-Rec.709). Industry-standard .cube file import. LUT uploaded to GPU memory with generation-counter caching for zero-overhead steady state.
Graphics & Keying — 8-layer downstream key compositor with fade, fly-in/out, slide, and pulse animations. 6 built-in broadcast templates. Per-source upstream chroma and luma keying. PIP, side-by-side, and quad layouts with slot transitions and live drag positioning via WebTransport datagrams. ST map per-pixel coordinate remapping for lens correction (barrel, pincushion, fisheye, corner pin) and animated program effects (heat shimmer, dream, ripple, vortex).
Closed Captions — CEA-608/708 captioning with three modes: off, passthrough (re-encode source captions), and author (live text input or AI transcription). H.264 SEI injection for MPEG-TS output. SMPTE ST 334 / CDP VANC output for MXL.
Instant Replay — Per-source circular buffers (configurable up to 5 minutes), mark-in/out, variable-speed playback down to 0.25x with pause/resume/seek. Quick-replay buttons and JKL shuttle control. Pitch-preserved audio via phase vocoder. Frame interpolation: duplication, alpha blend, motion-compensated (MCFI), or hold-crossfade (default).
SRT Input — Listener (push) and caller (pull) modes for ingesting SRT sources. Any codec FFmpeg can decode, normalized to YUV420. Persistent config across reconnects. Exponential backoff reconnection. Per-source latency override and connection stats.
Output — MPEG-TS recording with time and size rotation. Multi-destination SRT output, push and pull. CEA-608 closed captions embedded in output. Per-destination SCTE-35 ad insertion with signal conditioning rules engine. SCTE-104 automation on MXL data flows. Confidence monitor (1fps JPEG thumbnail).
Multi-Operator — Director, audio, graphics, and viewer roles with per-subsystem locking. Macro system with 60 action types across 11 categories covering switching, audio, graphics, replay, layout, SCTE-35, and color grading with step validation and keyboard triggers.
Operator Comms — Built-in voice channel for multi-operator coordination. Opus codec over WebTransport bidirectional streams. N-1 mixing (hear everyone except yourself). Push-to-talk with backtick key. Auto-duck dims program audio during active comms. Up to 6 participants.
GPU Acceleration — Metal on macOS (Apple Silicon), CUDA on Linux (NVIDIA). 9 Metal compute shaders and 12 CUDA kernels covering format conversion, transition blending, scaling, chroma/luma keying, PIP compositing, DSK overlay, ST map warp, frame rate up-conversion, V210 conversion, color grading, AI preprocessing, and mask smoothing. Auto-detected at startup with graceful CPU fallback. Apple Silicon unified memory enables zero-copy upload/download. 59 hand-written SIMD assembly kernels (amd64 SSE/AVX + arm64 NEON) for CPU-path graphics, transitions, audio, FFT, V210, motion estimation, and ST map warp.
MXL — Optional shared-memory transport for uncompressed V210 video and float32 audio. Sources bypass H.264 decode entirely — raw YUV420p into the pipeline. Program output routes back to MXL. NMOS IS-04 flow discovery. Bidirectional SCTE-104 on data flows.
Infrastructure — WebTransport/QUIC for media and state, REST polling fallback. Hardware encoder auto-detection (NVENC, VA-API, VideoToolbox, libx264). Preview proxy encoding (480p, 500kbps per source) drops browser bandwidth from ~55 Mbps to ~10 Mbps. mmap-backed frame pool allocations outside the Go GC heap. Single-binary deployment with embedded UI. Prometheus metrics and pprof.
git clone https://github.com/zsiec/switchframe.git && cd switchframe
make demoOpen http://localhost:5173. Four simulated cameras + two SRT sources with full audio mixer.
Prerequisites
Go 1.25+, Node.js 22+, and codec development libraries.
macOS
brew install ffmpeg fdk-aac pkg-configDebian / Ubuntu
sudo apt install libavcodec-dev libavutil-dev libavformat-dev libswscale-dev libswresample-dev libx264-dev libfdk-aac-dev pkg-configSources (H.264 via MoQ, any codec via SRT, or V210 via MXL shared memory)
→ per-source decode to YUV420
→ ST map lens correction · frame synchronizer · delay buffer
→ switching engine
→ pipeline: upstream key → PIP → DSK → color grade → AI segment → encode
→ program relay
├── browsers (WebTransport/MoQ, 480p preview proxy)
├── recording (MPEG-TS with CEA-608 captions)
├── SRT destinations (with per-destination SCTE-35)
└── MXL output (V210 + VANC)
Audio: decode → trim → EQ → compressor → fader
→ mix → master → limiter → LUFS metering → encode
→ ASR (Whisper) → caption pacer → CEA-608 output
GPU: upload NV12 → key → layout → DSK → color grade → AI segment → ST map → download YUV420
The server uses Prism for MoQ/WebTransport media distribution. The frontend is a Svelte 5 SPA that connects over a single QUIC connection for both media streams and control state.
The video pipeline is a chain of immutable processing nodes, atomically swapped at runtime for zero-frame-drop reconfiguration. Sources are routed through lock-free atomic pointers. The hot path holds locks for under 1us per frame at 30fps.
| Mouse | Keyboard | |
|---|---|---|
| Preview | Click source tile | 1–9 |
| Cut | CUT button | Space |
| Auto transition | AUTO button | Enter |
| Fade to black | FTB button | F1 |
| Hot-punch to program | — | Shift+1–9 |
| Run macro | — | Ctrl+1–9 |
| Comms mute toggle | — | ` (backtick) |
Press ? for the full shortcut overlay. Append ?mode=simple for a volunteer-friendly layout.
Start at docs/README.md for the full index, or docs/architecture.md for the narrative spine.
| Entry point | What's there |
|---|---|
| Docs index | Full table of contents across concepts / reference / subsystems / integration / operations |
| Architecture | The whole engine at a glance — component diagram, media-path sequence, state-broadcast fan-out |
| API reference | 291 REST endpoints, grouped by subsystem under docs/reference/api/ |
| Pipeline | Always-decode-per-source, atomic graph swap, processing frames, frame pool |
| Media path | End-to-end: source → switcher → pipeline → output fan-out |
| Concurrency | Lock inventory, frame flow, ordering rules, lock-free hot path |
| GPU acceleration | Metal / CUDA backends, kernel catalog, unified memory |
| Deployment | Build, run, ports, HTTPS/HTTP-3 certs, GPU setup, Docker |
| Fast-control protocol | WebTransport datagram binary protocol |
| State broadcast | ControlRoomState schema, push cadence, delivery |
Per-subsystem deep dives live under docs/subsystems/ — one doc per Go package (switcher, transition, graphics-and-dve, stmap, audio, captions, comms, output, srt-ingest, mxl, codec, control-plane, replay, clips, playout, scte35).
Cross-tier contracts (what a client needs to implement) live under docs/integration/: ui-server-contract, cef-protocol, asr-sidecar.
make dev # Go + Vite dev servers
make demo # 4 simulated cameras
make build # Production binary (embedded UI)
make docker # Multi-stage Docker image
make test-all # Go + Vitest + PlaywrightMIT. ONNX weights and some test fixtures are third-party — see THIRD_PARTY_LICENSES.md.
