Skip to content

Add Gemini Veo video generation #25

@cameronsjo

Description

@cameronsjo

Migrated from cameronsjo/cadence-palette#1

Summary

Add Gemini Veo video generation support to cadence-palette alongside the existing image generation pipeline. Veo 3.1 is GA with native audio, 720p/1080p/4K resolution, and 4-8 second clips — unlocking short-form
mascot animations, product demos, and promotional clips from structured prompt files.

Motivation

  • The current pipeline (/generate-prompt/generate-image) only supports static images via gemini-3-pro-image-preview
  • Gemini's Veo 3.1 (veo-3.1-generate-preview) uses the same google-genai SDK and GOOGLE_API_KEY, so infrastructure overlap is high
  • Video generation is an expensive cost tier ($0.15-$0.60/sec, ~$3.20 per 8-sec clip at 1080p) — the existing pre-flight validation and cost confirmation patterns in /generate-image are directly reusable
  • First real use case already exists: Code Puppy "bark like a chicken" mascot animation prompt (prompts/bark-like-a-chicken-video.md)

Proposed Changes

1. New skill: generate-video-prompt (parallel to generate-prompt)

Structured video prompt file format with frontmatter:

---
name: bark-like-a-chicken-video
model: veo-3.1-generate-preview
aspect_ratio: '16:9'        # 16:9 or 9:16 only (no 1:1)
resolution: 1080p            # 720p, 1080p, 4k
duration: 8                  # 4, 6, or 8 seconds
cost_estimate: $3.20
style: pixel-art-animation
last_generated: null
last_updated: '2026-03-26T20:00:00Z'
---

Body sections differ from image prompts — temporal, not spatial:

┌───────────────────────────────┬────────────────────────────────────────────┐
│         Image Prompt          │                Video Prompt                │
├───────────────────────────────┼────────────────────────────────────────────┤
│ Subject (position, materials) │ Subject + Action (what happens over time)  │
├───────────────────────────────┼────────────────────────────────────────────┤
│ Environment (static setting)  │ Environment + Camera (movement, shot type) │
├───────────────────────────────┼────────────────────────────────────────────┤
│ Secondary Elements            │ Motion Events (sequence of beats)          │
├───────────────────────────────┼────────────────────────────────────────────┤
│ Lighting                      │ Lighting + Color Grading                   │
├───────────────────────────────┼────────────────────────────────────────────┤
│ Style                         │ Style + Sound Design (Veo 3+ native audio) │
└───────────────────────────────┴────────────────────────────────────────────┘

2. New skill: generate-video (parallel to generate-image)

Reuses the same infrastructure gate pattern:
- Check GOOGLE_API_KEY
- Check google-genai SDK installed
- Read + validate prompt files
- Cost confirmation gate (even more critical at $3.20/clip vs $0.13/image)
- Generate via client.models.generate_videos() with polling loop
- Save to generated/{name}-{timestamp}.mp4

Key API differences from image generation:
- Async operation — returns an operation object, must poll operation.done
- No free tier — billing required from first call
- Limited aspect ratios — only 16:9 and 9:16
- Fixed 24fps
- Native audio on Veo 3+ — prompts can describe soundscapes

3. New spec: specs/video-prompt-engineering.md

Video-specific prompt engineering guide covering:
- Temporal structure (setup → action → button)
- Camera vocabulary (tracking shot, dolly, static hold, etc.)
- Audio cues for Veo 3+ (dialogue in quotes, SFX descriptions, ambient sound)
- Duration planning (what fits in 4s vs 8s)
- Scene extension chaining for longer narratives (up to ~1 minute)

4. Updates to existing skills

- gemini-image-gen/SKILL.md — add "See also: video generation" cross-reference
- setup-generation/SKILL.md — Veo uses the same API key, but note billing requirement (no free tier for video)

Veo API Reference

┌──────────────────────┬──────────────────────────────┐
│       Property       │            Value             │
├──────────────────────┼──────────────────────────────┤
│ Model                │ veo-3.1-generate-preview     │
├──────────────────────┼──────────────────────────────┤
│ SDK                  │ google-genai (same as image) │
├──────────────────────┼──────────────────────────────┤
│ Durations            │ 4, 6, 8 seconds              │
├──────────────────────┼──────────────────────────────┤
│ Resolutions          │ 720p, 1080p, 4K (3.1 only)   │
├──────────────────────┼──────────────────────────────┤
│ Aspect ratios        │ 16:9, 9:16                   │
├──────────────────────┼──────────────────────────────┤
│ Cost (1080p, 8s)     │ ~$3.20                       │
├──────────────────────┼──────────────────────────────┤
│ Cost (720p fast, 8s) │ ~$1.20                       │
├──────────────────────┼──────────────────────────────┤
│ Audio                │ Native on Veo 3+             │
├──────────────────────┼──────────────────────────────┤
│ Latency              │ 11s–6min                     │
└──────────────────────┴──────────────────────────────┘

Cost Tier Implications

Video is the most expensive generation tier in the palette:

┌────────────────┬────────┬───────────────────────────────────────────────────────────┐
│      Tier      │  Cost  │                           Gate                            │
├────────────────┼────────┼───────────────────────────────────────────────────────────┤
│ Image 1K       │ ~$0.04 │ Validation only                                           │
├────────────────┼────────┼───────────────────────────────────────────────────────────┤
│ Image 2K       │ ~$0.13 │ Validation only                                           │
├────────────────┼────────┼───────────────────────────────────────────────────────────┤
│ Image 4K       │ ~$0.24 │ Validation + confirmation                                 │
├────────────────┼────────┼───────────────────────────────────────────────────────────┤
│ Video 720p 8s  │ ~$1.20 │ Validation + explicit cost confirmation                   │
├────────────────┼────────┼───────────────────────────────────────────────────────────┤
│ Video 1080p 8s │ ~$3.20 │ Validation + explicit cost confirmation                   │
├────────────────┼────────┼───────────────────────────────────────────────────────────┤
│ Video 4K 8s    │ ~$4.80 │ Validation + explicit cost confirmation + "are you sure?" │
└────────────────┴────────┴───────────────────────────────────────────────────────────┘

Out of Scope

- Image-to-video (passing a generated image as first frame) — future enhancement
- Scene extension chaining (building 1-minute narratives) — future enhancement
- Style reference images — Veo 3.1 supports this but adds complexity
- Vertex AI endpoint support — Gemini API only for now

Test Plan

- Write a video prompt using /generate-video-prompt
- Validate frontmatter parsing handles video-specific fields (model, duration, cost_estimate)
- Confirm infrastructure gate detects missing GOOGLE_API_KEY
- Confirm cost confirmation gate fires before generation
- Generate a test video at 720p/4s (cheapest option: ~$0.60) to validate the pipeline
- Verify .mp4 output saves correctly with timestamp filename
- Verify polling loop handles both fast completion and timeout gracefully

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions