video-evaluator is a standalone video review and understanding pack for
Codex, Claude Code, and other coding-agent workflows.
Its job is not to render videos or replace domain-specific QA. Its job is to take a video or a video-output bundle, extract grounded evidence, and give an agent a shared way to inspect what happened.
This repo exists as the shared analyzer layer for the 45ck video tooling repos. Product-specific runtimes still own their own capture, generation, publishing, and policy decisions, but reusable video evidence belongs here.
- Purpose: extract storyboard evidence from videos and turn it into agent-usable review artifacts
- Best fit today: UI-heavy product demos, walkthroughs, and internal app recordings
- Not solved yet: arbitrary-video understanding and exact click-by-click timeline reconstruction
- Output shape:
storyboard.manifest.json,storyboard.ocr.json,storyboard.transitions.json,storyboard.summary.json,timeline.evidence.json,video.shots.json,segment.evidence.json,layout-safety.report.json - Maturity: experimental but benchmarked, with explicit low-signal reporting instead of pretending weak OCR is semantic proof
This repo is for:
- extracting storyboard frames from videos
- biasing those frames toward likely change points
- OCRing the extracted frames
- inferring coarse transitions between frames
- generating summary artifacts from OCR evidence
- fusing shot, storyboard, OCR, transition, and timeline evidence by segment
- extracting per-shot storyboard frames when global sampling leaves gaps
- detecting declared layout overlaps and caption safe-zone collisions for generated video systems that emit layout annotations
- packaging review prompts for agent use
- comparing two video bundles or runs
- materializing an installable skill pack for Codex or Claude Code
This repo is not:
- a video generator
- a full multimodal model runtime
- a guaranteed semantic video decompiler
- a replacement for product-specific evaluation logic
What it does reasonably well today:
- turns local videos into structured storyboard artifacts
- works well enough for first-pass review of UI-heavy product videos
- extracts OCR, basic layout regions, coarse transition structure, and filtered UI evidence
- extracts coarse shot boundaries and representative frames for longer videos
- checks
layout-annotations.v1sidecars for overlapping cards, diagrams, captions, and UI layers - fuses available evidence into per-segment review maps
- can produce shot-aware storyboard frames so each segment gets coverage
- packages grounded prompts so agents can review runs from real artifacts
- gives multiple repos a common evidence format
What it does not do well enough yet:
- exact action-by-action timeline reconstruction
- reliable understanding of arbitrary public YouTube videos
- robust app/view/capability extraction across noisy OCR
- strong semantic understanding of gameplay, cooking, vlogs, sports, or talks
The benchmark work in this repo exists to keep those limits honest.
video-evaluator is currently consumed by:
45ck/demo-machine: uses the package as a runtime dependency fordemo-machine analyze, review bundles, package review prompts, storyboard and segment evidence, layout-safety reports, visual diff primitives, and demo capture evidence routing.45ck/content-machine: uses the evaluator as the shared analyzer owner for promoted short-form examples and demo-video audits. The content repo keeps skills, archetype guidance, render policy, caption style guidance, and short-form publishing decisions local.
The boundary is intentional:
- Put reusable facts here: media probe data, quality gates, caption artifacts, OCR/sync checks, technical video review, contact sheets, visual diffs, layout-safety reports, source-media signals, timeline evidence, segment evidence, and review prompt packaging.
- Keep repo-specific behavior in the producer repo: browser capture and demo
semantics in
demo-machine; short-form archetypes, skills, flows, Remotion choices, and publish policy incontent-machine. - Consumer repos may wrap these tools, but cross-repo artifacts should use the contracts in Cross-Repo Compatibility.
The pipeline is:
video-intakeorreview-bundlestoryboard-extract- optionally
video-shots - optionally
segment-storyboard storyboard-ocrstoryboard-transitions- optionally
segment-evidence storyboard-understandpackage-review-promptorcompare-bundles
In plain English:
- Start from either a raw local video or an existing run/output folder.
- Extract a small set of frames across the video.
- Bias some of those frames toward likely changes.
- OCR the frames into text lines with confidence and coarse regions.
- Filter that OCR into likely UI evidence versus subtitle-like or noisy text.
- Infer whether frames look like screen changes, same-screen changes, dialog changes, or scroll changes.
- Normalize any existing timestamps, subtitles, or event logs into
timeline.evidence.json. - Optionally extract coarse shot boundaries into
video.shots.json. - Optionally extract one or more frames per shot with
segment-storyboard. - Optionally run
layout-safety-reviewwhen a generated video ships layout annotations. - Optionally fuse per-shot evidence into
segment.evidence.json. - Summarize the artifact into a form an agent can actually use.
agent/
run-tool.mjs JSON-stdio tool runner for installed packs
benchmarks/
youtube-diverse-queries.json Public benchmark manifest
docs/
*.md Operator docs, contracts, roadmap, releases
scripts/
bench/ Benchmark runners
harness/ Local CLI entrypoints
skills/
... Installable skill definitions and examples
src/
core/ Artifact logic
harness/ Tool wrappers
index.ts Public exports
tests/
*.test.ts Unit tests
- Documentation index
- Architecture
- Tool reference
- Operator workflows
- Artifact contracts
- YouTube evaluation
- Roadmap
- Release process
- Support
- Node.js
>=20.6.0 ffmpegandffprobeonPATH- bundled
eng.traineddataor network access fortesseract.js - for the YouTube benchmark:
python3pip
- optionally Firefox cookies for more reliable downloads
- Contributing guide
- Code of conduct
- Code owners
- Security policy
- Support policy
- Changelog
- CI workflow
- Latest release
Install and verify the repo:
npm install
npm run typecheck
npm test
npm run buildRun the simplest local pipeline from a raw video:
cat <<'JSON' | node --import tsx scripts/harness/storyboard-extract.ts
{
"videoPath": "/path/to/video.mp4",
"frameCount": 8,
"samplingMode": "hybrid"
}
JSON
cat <<'JSON' | node --import tsx scripts/harness/storyboard-ocr.ts
{
"storyboardDir": "/path/to/video-evaluator-storyboard"
}
JSON
cat <<'JSON' | node --import tsx scripts/harness/storyboard-transitions.ts
{
"storyboardDir": "/path/to/video-evaluator-storyboard"
}
JSON
cat <<'JSON' | node --import tsx scripts/harness/storyboard-understand.ts
{
"storyboardDir": "/path/to/video-evaluator-storyboard"
}
JSONIf you already have a run folder and want a review-oriented entrypoint:
cat <<'JSON' | node --import tsx scripts/harness/review-bundle.ts
{
"outputDir": "../demo-machine/output/todo-app/20260426-000000-000"
}
JSONLists the skill surface shipped by this repo.
Use when:
- you want an agent to discover what this repo exposes
- you are installing the pack into another workspace
Copies the repo's built runtime, skills, and metadata into a target directory and installs runtime dependencies there.
Example:
cat <<'JSON' | node --import tsx scripts/harness/install-skill-pack.ts
{
"targetDir": ".video-evaluator"
}
JSONInstalled packs are intended to be called through:
cat <<'JSON' | node ./.video-evaluator/agent/run-tool.mjs package-review-prompt
{
"outputDir": "./output/storyboard",
"focus": ["what the app appears to do", "flow progression"]
}
JSONNormalizes a local video artifact or bundle into a shared evaluation shape.
Use when:
- you need a common artifact contract before review
- you are unifying outputs from different repos
Extracts a small set of frames from a video.
Important options:
frameCountsamplingModechangeThresholdformat
Example:
cat <<'JSON' | node --import tsx scripts/harness/storyboard-extract.ts
{
"videoPath": "/path/to/video.mp4",
"frameCount": 8,
"samplingMode": "hybrid",
"changeThreshold": 0.08,
"format": "jpg"
}
JSONExtracts coarse scene-change segments from a local video and can write one representative frame per segment.
Use when:
- a longer video needs a quick part map before deeper OCR review
- you want stable timestamps for "what changed where" without pretending to reconstruct every action
Example:
cat <<'JSON' | node --import tsx scripts/harness/video-shots.ts
{
"videoPath": "/path/to/video.mp4",
"sceneThreshold": 0.08,
"extractRepresentativeFrames": true
}
JSONFuses video.shots.json with available storyboard, OCR, transition, and
timeline artifacts.
Use when:
- you want one ordered per-segment map of the evidence
- you need to know which parts are usable, weak, or empty before reviewing
Example:
cat <<'JSON' | node --import tsx scripts/harness/segment-evidence.ts
{
"outputDir": "/path/to/run-or-video-folder",
"maxTextItemsPerSegment": 8
}
JSONExtracts one to three frames per shot segment and writes a standard
storyboard.manifest.json under segment-storyboard/.
Use when:
segment.evidence.jsonshows too manyemptysegments- global storyboard sampling skipped short shots or dense cuts
Example:
cat <<'JSON' | node --import tsx scripts/harness/segment-storyboard.ts
{
"outputDir": "/path/to/run-or-video-folder",
"framesPerSegment": 1,
"format": "jpg"
}
JSONRuns OCR over extracted frames and writes storyboard.ocr.json.
The OCR artifact tries to preserve:
- line text
- line confidence
- line bounding boxes
- coarse
top/middle/bottomregion labels - filtered
semanticLinesthat are treated as likely UI evidence - per-frame
qualityso downstream tools can abstain on low-signal OCR
Important distinction:
lines= raw OCR lines that passedminConfidencesemanticLines= filtered subset intended for semantic extractionquality.status=usable,weak, orreject
That means later stages no longer assume every OCR line is product UI. Subtitle-heavy or noisy frames can now be downgraded or rejected instead of being treated as first-class evidence.
Example:
cat <<'JSON' | node --import tsx scripts/harness/storyboard-ocr.ts
{
"storyboardDir": "./output/storyboard",
"minConfidence": 45
}
JSONInfers coarse transitions between storyboard frames.
Current transition kinds:
screen-changestate-changescroll-changedialog-changeuncertain
The artifact also carries fields like:
overlapRatiosharedLineCount- OCR-derived transition evidence
Example:
cat <<'JSON' | node --import tsx scripts/harness/storyboard-transitions.ts
{
"storyboardDir": "./output/storyboard"
}
JSONBuilds a higher-level summary from OCR and transition artifacts.
The current summary includes:
appNamesviewsocrQualitysamplinginteractionSegmentslikelyFlowlikelyCapabilitiestextDominanceopenQuestions
textDominance is there to help interpret whether OCR looks more like:
- UI labels
- subtitle / narration text
- mixed evidence
ocrQuality is there to answer a different question:
- did this storyboard actually contain enough usable UI evidence to trust semantic extraction at all?
Example:
cat <<'JSON' | node --import tsx scripts/harness/storyboard-understand.ts
{
"storyboardDir": "./output/storyboard"
}
JSONRuns the repo's review-oriented bundle path so an agent can inspect an existing output folder without manually stitching every stage together.
Compares two artifact bundles or runs.
Use when:
- you want to compare revisions
- you want to inspect a before/after video pipeline change
Builds a grounded prompt from the artifacts so an agent can review the video with actual evidence, not speculation.
The formal compatibility notes live in docs/artifact-contracts.md. Typical files:
storyboard.manifest.jsonstoryboard.ocr.jsonstoryboard.transitions.jsonstoryboard.summary.jsontimeline.evidence.jsonvideo.shots.jsonsegment.evidence.json
Short version:
storyboard.manifest.json- extracted frame list, sampling reasons, and change-point diagnostics
storyboard.ocr.json- raw OCR
lines, filteredsemanticLines, boxes, regions, and quality
- raw OCR
storyboard.transitions.json- coarse frame-to-frame transition inference
storyboard.summary.json- agent-facing interpretation, including
textDominanceandocrQuality
- agent-facing interpretation, including
timeline.evidence.json- normalized timestamped evidence from
timestamps.json,events.json, andsubtitles.vtt
- normalized timestamped evidence from
video.shots.json- coarse scene-change segments and optional representative frame paths
segment.evidence.json- per-shot evidence map joining storyboard frames, OCR, transitions, and
timeline items with
usable,weak, oremptystatus
- per-shot evidence map joining storyboard frames, OCR, transitions, and
timeline items with
samplingMode: "hybrid" is the default mode worth caring about here.
It keeps a uniform backbone, then biases additional frames toward:
- scene changes
- same-screen local changes
- denser local UI sequences when they look important
This helps the later OCR and transition layers see more of what changed without trying to decode every frame in the full video.
It is still heuristic.
The repo includes a public benchmark runner:
npm run benchmark:youtube -- --limit=3Useful flags:
--manifest=benchmarks/youtube-diverse-queries.json--output-root=/tmp/video-evaluator-youtube-benchmark--limit=10--frame-count=8--clip-seconds=75--change-threshold=0.08--min-confidence=45--min-operational-successes=3--max-negative-control-false-positives=0--min-gold-high-fit-semantic-passes=1
What the benchmark does:
- resolves a public video source
- downloads it with
yt-dlp - clips the requested segment
- runs extraction, OCR, transitions, and summary
- writes:
- per-case
case-report.json - aggregate
benchmark.report.json - aggregate
benchmark.report.md
- per-case
The benchmark manifest lives at:
Each entry can include:
idcategoryqueryexpectedFitcurationStatusexpectedAppNamesexpectedViewHintsrequiredSignalsforbiddenSignalsreviewNotesvideoIdurlchannelContainstitleContainsresolvedAtstartSecondsclipSeconds
Recommended policy:
- use
videoIdfor stable benchmarks - keep
queryas provenance or fallback - use
channelContainsandtitleContainsas human-auditable intent - use
startSecondsto skip intro cards and target the real segment
The aggregate report distinguishes between:
- operational success
- semantic benchmark pass
- raw flow recovery
- meaningful flow recovery
This matters because a case that only says:
screen-changescreen-changescreen-change
did technically produce flow output, but it did not produce useful understanding.
The current repo writes both numbers so the benchmark does not inflate itself.
It also reports OCR signal quality so a case can fail honestly for "usable UI evidence was weak" instead of being misread as a semantic regression.
Gate flags are optional. Without them, the benchmark remains report-only.
When one or more gate thresholds are configured, the aggregate report
includes a gate block and the process exits non-zero if any configured
threshold fails.
Use docs/youtube-evaluation.md as the operating policy: public-video analysis is a regression and boundary test, not a cloning workflow.
The benchmark will try to stay operational on machines with old global
yt-dlp installs.
If needed, it will:
- bootstrap a newer
yt-dlpinto the benchmark tooling directory - prefer Firefox cookies when available
This is done to keep the benchmark reproducible in practice, not just in theory.
This repo ships installable skills under skills/.
Current skill set:
skill-cataloginstall-skill-packvideo-artifact-intakevideo-shotssegment-evidencesegment-storyboardreview-bundlestoryboard-extractstoryboard-ocrstoryboard-transitionsstoryboard-understandcompare-video-runspackage-review-prompt
These are meant to give Codex and Claude Code a shared operational surface without requiring each repo to invent its own evaluation verbs.
Each skill includes sequencing, input/output, failure-mode, and abstention guidance so agents can operate the pack without guessing from command names alone.
The repo exports its main logic from src/index.ts.
Notable exports:
- request schemas
extractStoryboardextractVideoShotsbuildSegmentEvidenceextractSegmentStoryboardocrStoryboardinferStoryboardTransitionsclassifyStoryboardTransitionunderstandStoryboard- harness wrappers
This means you can use the repo:
- as local scripts
- as an installed skill pack
- as a TypeScript dependency
If you want the smallest realistic storyboard review:
npm install
npm run build
cat <<'JSON' | node --import tsx scripts/harness/storyboard-extract.ts
{
"videoPath": "/tmp/demo.mp4",
"frameCount": 8,
"samplingMode": "hybrid"
}
JSON
cat <<'JSON' | node --import tsx scripts/harness/storyboard-ocr.ts
{
"storyboardDir": "/tmp/video-evaluator-storyboard"
}
JSON
cat <<'JSON' | node --import tsx scripts/harness/storyboard-transitions.ts
{
"storyboardDir": "/tmp/video-evaluator-storyboard"
}
JSON
cat <<'JSON' | node --import tsx scripts/harness/storyboard-understand.ts
{
"storyboardDir": "/tmp/video-evaluator-storyboard"
}
JSONThen inspect:
storyboard.manifest.jsonstoryboard.ocr.jsonstoryboard.transitions.jsonstoryboard.summary.json
If the video has many cuts or the first pass misses too many segments, run the shot-aware path:
cat <<'JSON' | node --import tsx scripts/harness/video-shots.ts
{
"videoPath": "/tmp/demo.mp4",
"outputDir": "/tmp/video-evaluator-storyboard"
}
JSON
cat <<'JSON' | node --import tsx scripts/harness/segment-storyboard.ts
{
"outputDir": "/tmp/video-evaluator-storyboard",
"framesPerSegment": 1
}
JSON
cat <<'JSON' | node --import tsx scripts/harness/storyboard-ocr.ts
{
"storyboardDir": "/tmp/video-evaluator-storyboard/segment-storyboard"
}
JSON
cat <<'JSON' | node --import tsx scripts/harness/storyboard-transitions.ts
{
"storyboardDir": "/tmp/video-evaluator-storyboard/segment-storyboard"
}
JSON
cat <<'JSON' | node --import tsx scripts/harness/segment-evidence.ts
{
"outputDir": "/tmp/video-evaluator-storyboard"
}
JSONThen inspect:
video.shots.jsonsegment-storyboard/storyboard.manifest.jsonsegment-storyboard/storyboard.ocr.jsonsegment-storyboard/storyboard.transitions.jsonsegment.evidence.json
These are not hidden:
- OCR quality can collapse on noisy public videos
- public-video intros and subtitles can dominate the evidence
- app/view/capability extraction is still too weak across diverse real videos
- benchmark success currently means "pipeline ran", not "video was deeply understood"
- timeline evidence is normalized from existing artifacts, not yet fused into a full semantic timeline summary
- shot extraction is coarse scene segmentation, not a semantic decompile of source footage
- segment evidence routes existing artifacts by time; it is not a full semantic video understanding model
- per-shot storyboard frames improve coverage but still miss motion between sampled frames
If you are deciding whether to depend on this repo, this is the section to take seriously.
High-value next steps:
- OCR quality gating before semantic inference
- broader app/view/capability extraction for generic software
- stronger preference for stable UI-anchor frames during understanding
- better subtitle / narration filtering
- richer local action-sequence reconstruction
- denser benchmark coverage with stronger high-fit product videos
- fused shot, timeline, OCR, and transition evidence
Longer-term direction:
- richer multimodal timeline understanding
- better repo-specific adapters on top of the shared artifact contract
- more reliable comparison and regression-review workflows across repos
Run the core checks:
npm run typecheck
npm test
npm run buildThis repo has tests for:
- hybrid frame planning
- same-screen probe scoring
- transition classification
- summary extraction
- narration-dominance heuristics
video-evaluator is already useful as an evidence-extraction and
review-packaging layer.
It is not yet a strong general video-understanding system.
That distinction is the main thing this README is trying to make clear.
