A local CLI that turns a source video into a single concatenated highlight reel in the source's original orientation (16:9). Pick the moments manually (type timestamp ranges) or automatically (transcribe the audio, then let Gemini rank the most engaging moments).
Each stage is a single-task module: it reads one input artifact and writes one
output artifact, knowing nothing about the other stages. The pivot artifact is
segments.json — an ordered list of Segment objects. Whether those segments
are typed by a human or ranked by AI, everything downstream is identical.
| Stage | Status | What it does |
|---|---|---|
probe |
✅ | video → metadata (duration, resolution, fps) |
extract_audio |
✅ | video → 16kHz mono wav |
transcribe |
✅ | wav → transcript.json (Groq Whisper) |
audio_energy |
🔲 stub | wav → peaks.json |
build_candidates |
🔲 stub | merge signals → candidates.json |
select |
✅ | → segments.json (manual ranges or Gemini auto) |
clip |
✅ | video + segments.json → individual clips |
concat |
✅ | clips → reel.mp4 |
scene_detect, subtitles, vertical_reframe, external_context, music
are future layers and are not present yet.
- Python 3.11+
- ffmpeg and ffprobe on your
PATH(checked at startup with a clear error if missing).- Windows:
winget install Gyan.FFmpeg, or a build from https://www.gyan.dev/ffmpeg/builds/ with itsbin/folder on PATH.
- Windows:
- Software encoding only (libx264). No GPU encoders.
- API keys (auto mode only): a Groq key for transcription and a Gemini key for selection. Manual mode needs neither.
cd highlight-extractor
python -m venv .venv
.\.venv\Scripts\Activate.ps1
pip install -r requirements.txtFor auto mode, copy .env.example to .env and fill in your keys:
GROQ_API_KEY=...
GEMINI_API_KEY=...
.env is gitignored — keys never get committed.
python cli.py "C:\path\to\source.mp4"You can also omit the path and you'll be prompted for it. The first question is what to make:
- Highlight reel — cut clips (manual or AI), optionally subtitled (below).
- Subtitle a whole video — generate learning subtitles over the entire
video. Outputs a
.asssidecar by default (load it in VLC/mpv, no re-encode, toggleable) or burns it in. Romaji-only is fully offline/free.
For a highlight reel, the remaining questions are:
- Source video path (or pass it as the CLI arg above)
- Selection mode —
manualorauto3a. (manual) Ranges — e.g.00:30-00:45, 01:10-01:25or one per line. AcceptsMM:SS-MM:SSandHH:MM:SS-HH:MM:SS. 3b. (auto) How many highlights, an optional steer for the AI (e.g. "focus on funny moments"), and a max seconds per highlight (0= no cap). - Padding seconds added around each cut (default
0.5) - Subtitles? yes/no — burn Japanese-learning subtitles onto the reel. If yes, you're also asked whether the source already shows its own subtitles; if so, ours move to the top and show only Japanese + romaji (so they don't collide with the source's bottom subs).
- Output filename (default
reel.mp4, written next to the source)
Built for studying Japanese. Romaji is always shown, spaced by word, not
syllable (kudasai, not ku da sai), so you can hear where words begin and end.
You choose which other lines appear:
- Romaji + English
- Japanese + romaji + English
- Japanese + romaji
- Romaji only
English is only meaningful for Japanese audio. For other-language audio, the Japanese line is a translation (so you learn how to say it), with romaji.
How it's produced:
- Romaji is generated offline (Janome word-segmentation + pykakasi), so the romaji line is free, instant, unlimited, and never depends on the network — ideal for subtitling whole episodes. Romaji-only on Japanese audio needs no API at all.
- Translation uses an LLM with fallback: Gemini first, then Groq
(
llama-3.3-70b-versatile) if Gemini is overloaded — so a Gemini 503 spike won't stop you. Both use keys you already have.
For clean output, use a raw source without burned-in subtitles; for sources that already show subs, answer "yes" to the top-position prompt so ours don't collide.
In auto mode the tool extracts audio, transcribes it, and asks Gemini to pick the highlights — then echoes the resolved config and the final segment list (with the AI's labels) and asks you to confirm before rendering.
| Flag | Default | Meaning |
|---|---|---|
--padding |
0.5 |
Seconds added around each cut |
--max-clip |
0 |
Auto mode: cap each highlight's length in seconds (0 = no cap) |
--output |
reel.mp4 |
Output filename |
--keep-temp |
off | Preserve workdir/<task_id>/ for debugging |
--fresh |
off | Auto mode: ignore the cached transcript and re-transcribe |
The first auto run on a video transcribes it and caches the result under
cache/ (gitignored), keyed on the file's path + size + modified-time. Later
runs on the same video reuse the transcript — so you can re-run with a
different highlight count, steer, or --max-clip without spending Groq quota or
waiting on transcription again. Edit/replace the video and the key changes
automatically; pass --fresh to force a re-transcribe.
- Frame-accurate cuts: every segment is re-encoded (never
-c copy). Stream-copy only cuts on keyframes, which drifts boundaries by seconds. - Glitch-free concat: every clip is normalized to identical parameters
(libx264 /
yuv420p, source resolution + fps, AAC 44.1kHz stereo), then joined with the concat demuxer (-c copy, safe because the clips already match). - Validation: ranges are clamped to
[0, duration], empty ranges dropped, and after padding, overlapping/adjacent segments are merged.
Each run uses workdir/<task_id>/ for intermediate clips and segments.json.
It is removed at the end unless you pass --keep-temp. The folder is
gitignored.
highlight-extractor/
cli.py # interactive config + orchestration
config.py # Config dataclass
models.py # Segment dataclass + segments.json I/O
media/
ffmpeg.py # subprocess wrappers, presence check, probe
pipeline/
probe.py # ✅
extract_audio.py # 🔲 stub
transcribe.py # 🔲 stub
audio_energy.py # 🔲 stub
build_candidates.py # 🔲 stub
select.py # ✅ manual mode; auto stub
clip.py # ✅
concat.py # ✅
workdir/ # per-run temp (gitignored)