GitHub - GVCLab/CutClaw: Agentic Hours-Long Video Editing via Music Synchronization

🦞CutClaw: Agentic Hours-Long Video Editing via Music Synchronization

Shifang Zhao · Yihan Hu · Ying Shan · Yunchao Wei · Xiaodong Cun

Beijing Jiaotong University · GVC Lab, Great Bay University · Tencent ARC Lab

🎬 Your personal editor for turning hours of footage into cinematic montages.

Overview • Roadmap • Features • Gallery • Quick Start • Troubleshooting • Citation • Star History

Demo.mp4

💡 Overview

CutClaw is an end-to-end editing system for long-form footage + music.

It first deconstructs raw video/audio into structured captions, then uses a multi-agent pipeline to plan shots (shot_plan), select clip timestamps (shot_point), and validate final quality before rendering.

🗺️ Roadmap

We warmly welcome new issues and ideas from the community. If you have suggestions, please open an issue. Your feedback will help shape our future plans and be the fuel that helps this project take off. 🔥

Short-Term Goals

What we're building next for faster, cheaper, and more expressive video editing.

🧩 ARC-Chapter Integration
Bring in ARC-Chapter to reduce the cost of long-form footage deconstruction.
💸 Low-Cost Mode
Add a budget-friendly mode that proactively reads only relevant footage instead of fully processing all source material.
🎙️ Talking-Head + Visual Mixing
Introduce hybrid editing logic that coordinates narration-driven clips with supporting visual footage.

Long-Term Goals

Broader product and ecosystem directions for the next stage of CutClaw.

✍️ Playwriter Upgrade
Expand the Playwriter with richer editing patterns and more diverse visual storytelling methods.
🔌 Claude Code MCP Support
Adapt CutClaw to work smoothly within Claude Code MCP workflows.
🌐 Online Service Interface
Build a web-based service interface for easier access and deployment.

✨ Key Features

🎬 One-Click Deconstruction

Effortlessly transforms hours-long raw video and audio into structured, searchable assets with a single click.

🎯 Instruction Control

Requires only one text instruction to steer the editing style—easily generating fast-paced character montages or slow-paced emotional narratives.

📱 Smart Auto-Cropping

Content-aware cropping automatically identifies core subjects and adjusts aspect ratios to fit various social platforms.

🎵 Music-Aware Sync

Extracts musical beats and energy signals to build rhythm-aware cuts that perfectly match the music's pacing.

🖼️ Gallery（remember to turn on the audio）

dark_knight.mp4

kyoto.mp4

paprika.mp4

chongqing.mp4

interstellar.mp4

naruto.mp4

lalaland.mp4

swiss.mp4

titanic.mp4

🚀 Quick Start

1. Install

git clone https://github.com/GVCLab/CutClaw.git
cd CutClaw
conda create -n CutClaw python=3.12
conda activate CutClaw
pip install -r requirements.txt

We strongly recommend the GPU-accelerated Decord/NVDEC build for faster video decoding. Build from source.

2. Add your files

resource/
├── video/      ← put your .mp4 / .mkv here
├── audio/      ← put your .mp3 / .wav here
└── subtitle/   ← optional .srt (skips ASR, saves time)

3. Run

UI (recommended)

streamlit run app.py

Then open http://localhost:8501 in your browser. (*If http://localhost:8501 does not work well, try http://127.0.0.1:8501)

Place your footage in the paths above, then you can directly select those files in the UI.

Model selection guidance:

Video model
- Role: shot/scene understanding and visual captioning.
- Recommended: Gemini-3, Qwen3.5, GPT-5.3
Audio model
- Role: ASR plus music-structure parsing (beat/downbeat, pitch, energy) for music-aware segmentation.
- Recommended: Gemini-3
Agent model
- Role: drives the Screenwriter + Editor + Reviewer loop to generate shot_plan and shot_point.
- Recommended: MiniMax-2.7, Kimi-2.5, Claude-4.5

We leverage LiteLLM as the api manager gateway, the typical Model name is e.g. 'openai/MiniMax-2.7' which means using openai protocol to call the given model, more information see LiteLLM documents.

CLI (advanced)

python local_run.py \
  --Video_Path "resource/video/xxxx.mp4" \
  --Audio_Path "resource/audio/xxxx.mp3" \
  --Instruction "xxxx"

Common config overrides

Any src/config.py parameter can be overridden with --config.PARAM_NAME VALUE.

Parameter	Default	Effect
`VIDEO_PATH`	`"resource/video/The_Dark_Knight.mkv"`	Default input video path used by UI remembered inputs
`AUDIO_PATH`	`"resource/audio/Way_Down_We_Go.mp3"`	Default input audio path used by UI remembered inputs
`INSTRUCTION`	`"Joker's crazy that want to change the world."`	Default editing instruction prompt
`ASR_BACKEND`	`"litellm"`	ASR engine (`litellm` cloud or `whisper_cpp` local)
`VIDEO_FPS`	`2`	Sampling FPS for preprocessing
`MAIN_CHARACTER_NAME`	`"Joker"`	Protagonist name for character-focused edits
`AUDIO_MIN_SEGMENT_DURATION`	`3.0`	Minimum beat segment duration (seconds)
`AUDIO_MAX_SEGMENT_DURATION`	`5.0`	Maximum beat segment duration (seconds)
`AUDIO_DETECTION_METHODS`	`["downbeat", "pitch", "mel_energy"]`	Audio keypoint detection methods
`PARALLEL_SHOT_MAX_WORKERS`	`4`	Parallel shot selection workers

Example:

python local_run.py \
  --Video_Path "resource/video/xxxx.mp4" \
  --Audio_Path "resource/audio/xxxx.mp3" \
  --Instruction "xxxx" \
  --config.MAIN_CHARACTER_NAME "Batman" \
  --config.VIDEO_FPS 2 \
  --config.AUDIO_TOTAL_SHOTS 50

Then render manually:

python render/render_video.py \
  --shot-plan  "Output/<video_audio>/shot_plan_*.json" \
  --shot-json  "Output/<video_audio>/shot_point_*.json" \
  --video  "resource/video/xxxx.mp4" \
  --audio  "resource/audio/xxxx.mp3" \
  --output "output/final.mp4" \
  --crop-ratio "9:16" \
  --no-labels --render-hook-dialogue

🛠️ Troubleshooting

Very slow runtime

API latency — the pipeline sends a large number of concurrent requests to vision/language APIs. Speed is heavily dependent on your API provider's response time and rate limits.
First-run Footage Deconstruction — the first time you process a video, shot detection, captioning, ASR, and scene analysis all run from scratch. This is a one-time cost per video; subsequent edits with the same footage reuse the cached results and are much faster.
GPU acceleration — a CUDA-capable GPU significantly speeds up video decoding and encoding. We recommend building Decord with NVDEC support (see Install section).
Video codec compatibility — if the pipeline appears to hang during video-related steps, the source video's encoding may be the cause. In our testing, videos encoded with libx264 worked reliably.

⭐ Citation

If you find CutClaw useful for your research, welcome to cite our work using the following BibTeX:

@article{cutclaw,
 title={CutClaw: Agentic Hours-Long Video Editing via Music Synchronization},
 author={Shifang Zhao, Yihan Hu, Ying Shan, Yunchao Wei, Xiaodong Cun},
 journal={arXiv preprint arXiv:2603.29664},
 year={2026}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🦞CutClaw: Agentic Hours-Long Video Editing via Music Synchronization

Shifang Zhao · Yihan Hu · Ying Shan · Yunchao Wei · Xiaodong Cun

Beijing Jiaotong University · GVC Lab, Great Bay University · Tencent ARC Lab

💡 Overview

🗺️ Roadmap

Short-Term Goals

Long-Term Goals

✨ Key Features

🎬 One-Click Deconstruction

🎯 Instruction Control

📱 Smart Auto-Cropping

🎵 Music-Aware Sync

🖼️ Gallery（remember to turn on the audio）

🚀 Quick Start

1. Install

2. Add your files

3. Run

🛠️ Troubleshooting

⭐ Citation

📈 Star History

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
asset		asset
render		render
resource		resource
src		src
.gitignore		.gitignore
app.py		app.py
local_run.py		local_run.py
readme.md		readme.md
readme_zh.md		readme_zh.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

🦞CutClaw: Agentic Hours-Long Video Editing via Music Synchronization

Shifang Zhao · Yihan Hu · Ying Shan · Yunchao Wei · Xiaodong Cun

Beijing Jiaotong University · GVC Lab, Great Bay University · Tencent ARC Lab

💡 Overview

🗺️ Roadmap

Short-Term Goals

Long-Term Goals

✨ Key Features

🎬 One-Click Deconstruction

🎯 Instruction Control

📱 Smart Auto-Cropping

🎵 Music-Aware Sync

🖼️ Gallery（remember to turn on the audio）

🚀 Quick Start

1. Install

2. Add your files

3. Run

🛠️ Troubleshooting

⭐ Citation

📈 Star History

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages