Live-Stream Stability — a Continual-Learning POC

This is the live_stream_stability POC. It lives under a continual_learning/poc/ workspace that will hold sibling POCs over time (e.g. compression/), each its own git repo.

What I'm testing. Whether I can take a strong open-weight vision-language model (VLM) and continually pretrain it on the long-duration, first-person-style experience stream of a single subject — video + audio — so that it becomes measurably more knowledgeable and perceptive about that subject's reality over time.

Hypothesis. A base VLM's understanding/recall/perception on Day-0 (before it sees the stream) is measurably lower than after it has absorbed X hours of that subject's continuous feed by Day-N. The long-term vision is a model that becomes a user's experiential alter ego — it sees what they see and accumulates their context.

As the closest public proxy for "one person's continuous life feed," I use IShowSpeed's 35-day livestreamed IRL US tour (2025): 67 videos, ~752.8 hours (~11 h each, 1080p60 VP9/Opus).

I own this work: Murali Nandan Nagarapu — Head of Engineering, Nucleus AI · nmn@withnucleus.ai

Where this stands today

Phase	What I set out to do	Status
1 — Acquire	Turn the playlist into an indexed on-disk corpus: download + metadata + ASR + diarization	✅ 67/67 complete
2 — Chunk	Pick the base model + chunking operating point, then cut the corpus into VLM-sized chunks	✅ Complete — 2,290 chunks
3 — Describe	Generate gold per-chunk descriptions (the training targets)	⏳ Next
4 — Train + eval	Continually pretrain; measure Day-0 vs Day-N	⏳

I keep the live working state — what's in flight, gotchas, next steps — in HANDOFF.md and handoff/. This README is the stable overview; HANDOFF is the agent/working canvas.

Decisions locked so far: base model Qwen2.5-VL-32B-Instruct; chunk the stream at 20 min / 1 fps at the model's default token budget; training target = generated descriptions (ASR/diarization are inputs, not targets); full bf16 + fp32 Adam, ZeRO-3, no LoRA; audio off for v1.

The four phases

PHASE-1 — Acquire the corpus ✅

I turn a YouTube playlist into a clean, indexed, on-disk corpus: one folder per video with the media plus all metadata and time-aligned transcripts, all tracked by a single manifest CSV (ishowspeed_tour.csv) that is the spine of the phase — one row per video, static columns from the playlist, runtime columns filled by the downloader and transcript steps.

The pipeline (in data/live_streams/ishowspeed/tour/): build_manifest.py (playlist → CSV) → download.py (yt-dlp, Archivist format chain → MKV + info + description + thumbnail + subs) → fetch_subs.py (YouTube captions) → asr_subs.py (faster-whisper for caption-less videos) → diarize_subs.py (WhisperX + pyannote speaker labels) → rename_files.py (ASCII-safe artifact names) → status.py (progress report). The production pull ran sharded across the 8-node mesh (scripts/node_shard_download.sh), each node taking a slice of the playlist to its local SSD and backing up to GCS.

Status: all 67/67 videos downloaded, transcribed, and diarized.

PHASE-2 — Chunk the stream ✅

A VLM has finite context; 752 hours can't go in at once. So I had to choose (a) a base model and (b) a chunking operating point (length, fps, resolution), then materialize the chunk dataset.

Strategy (deep record: experiments/chunk_length/HANDOVER.md) — a 6-model empirical sweep over a 195-question bank anchored to Day-01 Miami, auto-graded by LLM judges, produced:

Base model: Qwen2.5-VL-32B-Instruct (best mean score, best OCR of its cost tier, trains on ≤4 H100 with ZeRO-3). 72B is the backup for long chunks / critical OCR.
Operating point: 20 min / 1 fps / model-default token budget. Quality holds to ~20 min and degrades past it.
Hard-won lesson: benchmark on your own data. Qwen3-VL scores higher on public VideoMME but its OCR collapsed (~0.86 → ~0.15) on this footage.

The token-budget math (you'll need it in PHASE-3/4). Qwen2.5-VL uses an adaptive global token cap (~12–14K vision tokens), not a fixed per-frame resolution. As chunk_seconds × fps grows, total tokens stay ~constant and per-frame resolution shrinks: pairs = chunk_s × fps / 2, pixels_per_pair ≈ 25.2M / pairs, frame_side ≈ √pixels_per_pair. At 20 min × 1 fps → 600 pairs → ~205-px frames: great temporal coverage, legible for most reasoning, not for small text. Drop to 0.5 fps or use the 72B if fine OCR ever becomes critical.

Productionization (chunk_videos.py). I stream-copy each video into 20-min chunks with ffmpeg's segment muxer (-c copy — lossless, ~2 min/video, keyframe-exact on this VP9), upload the chunk media to GCS, and write one row per chunk into chunks_manifest.csv. The local SSD is touched only transiently (staged → uploaded → deleted), so the durable chunk store is GCS. The job is mesh-distributed: each node chunks the videos it holds locally and writes its own chunks_manifest.d/<node>.csv; --merge unions the per-node parts into the single source-of-truth table (scripts/launch_mesh_chunk.sh fans the run out over SSH and merges).

Result: all 67 videos → 2,290 chunks → 752.7 h in GCS. Integrity verified — 0 duplicate chunk_ids, 0 non-GCS paths, 0 n_chunks mismatches, per-video coverage to <0.1 s. chunks_manifest.csv is the join table PHASE-3 consumes: one row per chunk, media_path → GCS, all PHASE-1 metadata carried.

PHASE-3 — Gold chunk descriptions ⏳ (next)

For every chunk I'll generate a rich textual description — the TARGET the model learns to predict in PHASE-4. ASR transcripts and diarization are inputs to the generator, not training targets. Each description should capture what's visually happening, where it's set, time/lighting, anything notable, and cross-chunk pointers for later recall; I may condition each on the previous chunk's description. Open questions (generator choice, paragraph vs structured JSON, length, whether to add Q&A pairs) are tracked in handoff/phase-3-describe.md. I'll prototype on 5–10 chunks before committing.

PHASE-4 — Continual pretrain + evaluate ⏳ (next)

INPUT  = [<|video_start|>] [vision_tokens(chunk @ 20min/1fps)] [<|video_end|>]
TARGET = chunk description + injected metadata (geo, timestamp, day-N-of-35, cross-chunk cues)
LOSS   = causal LM loss on TARGET tokens only — vision input positions are masked

Why this shape. Modern open-weight VLMs do not convert video to text first. A vision encoder turns pixels into continuous embeddings placed in the LLM's token space, interleaved with text tokens; the LLM does next-token prediction over text positions only while attending to the vision tokens. So I train the model to describe what it sees — grounding it in this subject's visual experience — rather than fine-tuning a language head on synthetic captions.

Full bf16 + fp32 Adam, ZeRO-3 sharded, gradient-checkpointed, no LoRA (continual pretraining needs deep rewiring). ~2 a3-mega nodes (16 H100) for 32B full training. Milestones: v1 Day-01 alone (~11 h) → recall + emergent behavior; v2 all 67 (~752.8 h) → scaling + forgetting; v3 audio; v4 more modalities; v5 online/streaming. Recipe + eval plan in handoff/phase-4-train.md.

Repository layout

poc/live_stream_stability/              ← this POC (a git repo)
├── README.md                          ← this file (human-facing overview)
├── HANDOFF.md  handoff/               ← agent working canvas + per-phase notes (live state)
├── scripts/                           ← mesh ops: sync_to_gcs, node_shard_download, launch_mesh_chunk
├── data/live_streams/ishowspeed/tour/ ← PHASE-1 pipeline + PHASE-2 chunker (code + seed manifests)
│   ├── *.py  (build_manifest, download, fetch_subs, asr_subs, diarize_subs,
│   │          calib_diarize, rename_files, status, chunk_videos)
│   ├── playlist_raw.jsonl  ishowspeed_tour.csv  ishowspeed_tour.md   ← PHASE-1 manifest seeds
│   └── chunks_manifest.csv                                           ← PHASE-2 chunk table (the join key)
└── experiments/chunk_length/          ← PHASE-2 experiment record (HANDOVER + plan/ qa_bank/ code/ results/)

Third-party reference (gitignored, re-clonable): TheFrenchGhostys-Ultimate-YouTube-DL-Scripts-Collection — download.py's yt-dlp format chain mirrors that repo's Archivist preset.

Data locations

Asset	Location	In git?	In GCS?
Pipeline code, docs, manifest seeds	this repo (NFS `/home`)	✅	✅ (via repo)
67 source videos (~1.1 TB) + info/desc/thumb/subs	`/mnt/localssd/poc/data/live_streams/ishowspeed/tour/videos/` (per node)	❌	✅
20-min chunks — 2,290 files (~1.1 TB) + `chunks_manifest.csv`	`gs://…/tour/chunks/` (media); table in repo + mirrored to GCS	✅ table	✅ media
HF model cache (Qwen2.5-VL-32B etc.)	`/mnt/localssd/.hf-home/`	❌	❌ (re-downloadable)

Storage discipline. Git is the source of truth for code/docs; GCS is the source of truth for bulk data. The working tree /mnt/localssd/poc/ is rsync'd to gs://nucleus-continual-learning/poc/ (us-east4, additive/never-delete). ISHOWSPEED_DATA_ROOT overrides the data root (default /mnt/localssd/poc/data/live_streams/ishowspeed/tour). The bucket enforces uniform bucket-level access + public-access-prevention — I share objects via signed URLs, not public ACLs.

Why this discipline exists. I lost the original PHASE-1 working data on /mnt/localssd during a SLURM cluster replacement. Git (code/docs/small seeds) plus the GCS backup are now the durable safety net; nothing irreplaceable lives only on the local SSD.

Runbooks

conda activate moe          # torch 2.8/cu128, yt-dlp, faster-whisper, whisperx, pyannote, ffmpeg 7.1
cd data/live_streams/ishowspeed/tour

# PHASE-1 (acquire) — already complete; regenerable + resumable:
python build_manifest.py                         # playlist → manifest (refuses to clobber done rows)
python download.py --all --no-comments --parallel 2   # full pull (production ran sharded via scripts/node_shard_download.sh)
python fetch_subs.py && python asr_subs.py --parallel 8 && python diarize_subs.py --parallel 8
python rename_files.py && python status.py

# PHASE-2 (chunk) — per node, idempotent; or fan out across the mesh:
python chunk_videos.py --all                     # chunk this node's done videos → GCS
python chunk_videos.py --merge                   # union per-node parts → chunks_manifest.csv (+ GCS)
python chunk_videos.py --status                  # cross-node totals
bash ../../../../scripts/launch_mesh_chunk.sh    # fan out over all 8 nodes, then merge

Infra. GCP poetic-avenue-438401-a7, zone us-east4-b. SLURM partition a3mega = 8× a3-mega nodes nucla3m-a3meganodeset-[0-7] (8× H100 80 GB each); /home is shared NFS, /mnt/localssd is per-node local; node-to-node SSH is passwordless. HF_TOKEN is in ~/.bashrc.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Live-Stream Stability — a Continual-Learning POC

Where this stands today

The four phases

PHASE-1 — Acquire the corpus ✅

PHASE-2 — Chunk the stream ✅

PHASE-3 — Gold chunk descriptions ⏳ (next)

PHASE-4 — Continual pretrain + evaluate ⏳ (next)

Repository layout

Data locations

Runbooks

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
data/live_streams/ishowspeed/tour		data/live_streams/ishowspeed/tour
experiments/chunk_length		experiments/chunk_length
handoff		handoff
scripts		scripts
.gitignore		.gitignore
HANDOFF.md		HANDOFF.md
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Live-Stream Stability — a Continual-Learning POC

Where this stands today

The four phases

PHASE-1 — Acquire the corpus ✅

PHASE-2 — Chunk the stream ✅

PHASE-3 — Gold chunk descriptions ⏳ (next)

PHASE-4 — Continual pretrain + evaluate ⏳ (next)

Repository layout

Data locations

Runbooks

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages