This is the
live_stream_stabilityPOC. It lives under acontinual_learning/poc/workspace that will hold sibling POCs over time (e.g.compression/), each its own git repo.
What I'm testing. Whether I can take a strong open-weight vision-language model (VLM) and continually pretrain it on the long-duration, first-person-style experience stream of a single subject — video + audio — so that it becomes measurably more knowledgeable and perceptive about that subject's reality over time.
Hypothesis. A base VLM's understanding/recall/perception on Day-0 (before it sees the stream) is measurably lower than after it has absorbed X hours of that subject's continuous feed by Day-N. The long-term vision is a model that becomes a user's experiential alter ego — it sees what they see and accumulates their context.
As the closest public proxy for "one person's continuous life feed," I use IShowSpeed's 35-day livestreamed IRL US tour (2025): 67 videos, ~752.8 hours (~11 h each, 1080p60 VP9/Opus).
I own this work: Murali Nandan Nagarapu — Head of Engineering, Nucleus AI · nmn@withnucleus.ai
| Phase | What I set out to do | Status |
|---|---|---|
| 1 — Acquire | Turn the playlist into an indexed on-disk corpus: download + metadata + ASR + diarization | ✅ 67/67 complete |
| 2 — Chunk | Pick the base model + chunking operating point, then cut the corpus into VLM-sized chunks | ✅ Complete — 2,290 chunks |
| 3 — Describe | Generate gold per-chunk descriptions (the training targets) | ⏳ Next |
| 4 — Train + eval | Continually pretrain; measure Day-0 vs Day-N | ⏳ |
I keep the live working state — what's in flight, gotchas, next steps — in
HANDOFF.md and handoff/. This README is the stable overview; HANDOFF is the
agent/working canvas.
Decisions locked so far: base model Qwen2.5-VL-32B-Instruct; chunk the stream at 20 min / 1 fps at the model's default token budget; training target = generated descriptions (ASR/diarization are inputs, not targets); full bf16 + fp32 Adam, ZeRO-3, no LoRA; audio off for v1.
I turn a YouTube playlist into a clean, indexed, on-disk corpus: one folder per video with the
media plus all metadata and time-aligned transcripts, all tracked by a single manifest CSV
(ishowspeed_tour.csv) that is the spine of the phase — one row per video, static columns from
the playlist, runtime columns filled by the downloader and transcript steps.
The pipeline (in data/live_streams/ishowspeed/tour/):
build_manifest.py (playlist → CSV) → download.py (yt-dlp, Archivist format chain → MKV +
info + description + thumbnail + subs) → fetch_subs.py (YouTube captions) → asr_subs.py
(faster-whisper for caption-less videos) → diarize_subs.py (WhisperX + pyannote speaker
labels) → rename_files.py (ASCII-safe artifact names) → status.py (progress report). The
production pull ran sharded across the 8-node mesh (scripts/node_shard_download.sh), each
node taking a slice of the playlist to its local SSD and backing up to GCS.
Status: all 67/67 videos downloaded, transcribed, and diarized.
A VLM has finite context; 752 hours can't go in at once. So I had to choose (a) a base model and (b) a chunking operating point (length, fps, resolution), then materialize the chunk dataset.
Strategy (deep record: experiments/chunk_length/HANDOVER.md) —
a 6-model empirical sweep over a 195-question bank anchored to Day-01 Miami, auto-graded by LLM
judges, produced:
- Base model: Qwen2.5-VL-32B-Instruct (best mean score, best OCR of its cost tier, trains on ≤4 H100 with ZeRO-3). 72B is the backup for long chunks / critical OCR.
- Operating point: 20 min / 1 fps / model-default token budget. Quality holds to ~20 min and degrades past it.
- Hard-won lesson: benchmark on your own data. Qwen3-VL scores higher on public VideoMME but its OCR collapsed (~0.86 → ~0.15) on this footage.
The token-budget math (you'll need it in PHASE-3/4). Qwen2.5-VL uses an adaptive global
token cap (~12–14K vision tokens), not a fixed per-frame resolution. As chunk_seconds × fps
grows, total tokens stay ~constant and per-frame resolution shrinks:
pairs = chunk_s × fps / 2, pixels_per_pair ≈ 25.2M / pairs, frame_side ≈ √pixels_per_pair.
At 20 min × 1 fps → 600 pairs → ~205-px frames: great temporal coverage, legible for most
reasoning, not for small text. Drop to 0.5 fps or use the 72B if fine OCR ever becomes critical.
Productionization (chunk_videos.py).
I stream-copy each video into 20-min chunks with ffmpeg's segment muxer (-c copy — lossless,
~2 min/video, keyframe-exact on this VP9), upload the chunk media to GCS, and write one row per
chunk into chunks_manifest.csv. The
local SSD is touched only transiently (staged → uploaded → deleted), so the durable chunk store
is GCS. The job is mesh-distributed: each node chunks the videos it holds locally and
writes its own chunks_manifest.d/<node>.csv; --merge unions the per-node parts into the single
source-of-truth table (scripts/launch_mesh_chunk.sh fans the run out over SSH and merges).
Result: all 67 videos → 2,290 chunks → 752.7 h in GCS. Integrity verified — 0 duplicate
chunk_ids, 0 non-GCS paths, 0 n_chunks mismatches, per-video coverage to <0.1 s. chunks_manifest.csv
is the join table PHASE-3 consumes: one row per chunk, media_path → GCS, all PHASE-1 metadata carried.
For every chunk I'll generate a rich textual description — the TARGET the model learns to
predict in PHASE-4. ASR transcripts and diarization are inputs to the generator, not training
targets. Each description should capture what's visually happening, where it's set, time/lighting,
anything notable, and cross-chunk pointers for later recall; I may condition each on the
previous chunk's description. Open questions (generator choice, paragraph vs structured JSON,
length, whether to add Q&A pairs) are tracked in handoff/phase-3-describe.md.
I'll prototype on 5–10 chunks before committing.
INPUT = [<|video_start|>] [vision_tokens(chunk @ 20min/1fps)] [<|video_end|>]
TARGET = chunk description + injected metadata (geo, timestamp, day-N-of-35, cross-chunk cues)
LOSS = causal LM loss on TARGET tokens only — vision input positions are masked
Why this shape. Modern open-weight VLMs do not convert video to text first. A vision encoder turns pixels into continuous embeddings placed in the LLM's token space, interleaved with text tokens; the LLM does next-token prediction over text positions only while attending to the vision tokens. So I train the model to describe what it sees — grounding it in this subject's visual experience — rather than fine-tuning a language head on synthetic captions.
Full bf16 + fp32 Adam, ZeRO-3 sharded, gradient-checkpointed, no LoRA (continual pretraining
needs deep rewiring). ~2 a3-mega nodes (16 H100) for 32B full training. Milestones: v1 Day-01
alone (~11 h) → recall + emergent behavior; v2 all 67 (~752.8 h) → scaling + forgetting;
v3 audio; v4 more modalities; v5 online/streaming. Recipe + eval plan in
handoff/phase-4-train.md.
poc/live_stream_stability/ ← this POC (a git repo)
├── README.md ← this file (human-facing overview)
├── HANDOFF.md handoff/ ← agent working canvas + per-phase notes (live state)
├── scripts/ ← mesh ops: sync_to_gcs, node_shard_download, launch_mesh_chunk
├── data/live_streams/ishowspeed/tour/ ← PHASE-1 pipeline + PHASE-2 chunker (code + seed manifests)
│ ├── *.py (build_manifest, download, fetch_subs, asr_subs, diarize_subs,
│ │ calib_diarize, rename_files, status, chunk_videos)
│ ├── playlist_raw.jsonl ishowspeed_tour.csv ishowspeed_tour.md ← PHASE-1 manifest seeds
│ └── chunks_manifest.csv ← PHASE-2 chunk table (the join key)
└── experiments/chunk_length/ ← PHASE-2 experiment record (HANDOVER + plan/ qa_bank/ code/ results/)
Third-party reference (gitignored, re-clonable): TheFrenchGhostys-Ultimate-YouTube-DL-Scripts-Collection
— download.py's yt-dlp format chain mirrors that repo's Archivist preset.
| Asset | Location | In git? | In GCS? |
|---|---|---|---|
| Pipeline code, docs, manifest seeds | this repo (NFS /home) |
✅ | ✅ (via repo) |
| 67 source videos (~1.1 TB) + info/desc/thumb/subs | /mnt/localssd/poc/data/live_streams/ishowspeed/tour/videos/ (per node) |
❌ | ✅ |
20-min chunks — 2,290 files (~1.1 TB) + chunks_manifest.csv |
gs://…/tour/chunks/ (media); table in repo + mirrored to GCS |
✅ table | ✅ media |
| HF model cache (Qwen2.5-VL-32B etc.) | /mnt/localssd/.hf-home/ |
❌ | ❌ (re-downloadable) |
Storage discipline. Git is the source of truth for code/docs; GCS is the source of truth
for bulk data. The working tree /mnt/localssd/poc/ is rsync'd to
gs://nucleus-continual-learning/poc/ (us-east4, additive/never-delete). ISHOWSPEED_DATA_ROOT
overrides the data root (default /mnt/localssd/poc/data/live_streams/ishowspeed/tour). The
bucket enforces uniform bucket-level access + public-access-prevention — I share objects via
signed URLs, not public ACLs.
Why this discipline exists. I lost the original PHASE-1 working data on
/mnt/localssdduring a SLURM cluster replacement. Git (code/docs/small seeds) plus the GCS backup are now the durable safety net; nothing irreplaceable lives only on the local SSD.
conda activate moe # torch 2.8/cu128, yt-dlp, faster-whisper, whisperx, pyannote, ffmpeg 7.1
cd data/live_streams/ishowspeed/tour
# PHASE-1 (acquire) — already complete; regenerable + resumable:
python build_manifest.py # playlist → manifest (refuses to clobber done rows)
python download.py --all --no-comments --parallel 2 # full pull (production ran sharded via scripts/node_shard_download.sh)
python fetch_subs.py && python asr_subs.py --parallel 8 && python diarize_subs.py --parallel 8
python rename_files.py && python status.py
# PHASE-2 (chunk) — per node, idempotent; or fan out across the mesh:
python chunk_videos.py --all # chunk this node's done videos → GCS
python chunk_videos.py --merge # union per-node parts → chunks_manifest.csv (+ GCS)
python chunk_videos.py --status # cross-node totals
bash ../../../../scripts/launch_mesh_chunk.sh # fan out over all 8 nodes, then mergeInfra. GCP poetic-avenue-438401-a7, zone us-east4-b. SLURM partition a3mega = 8× a3-mega
nodes nucla3m-a3meganodeset-[0-7] (8× H100 80 GB each); /home is shared NFS, /mnt/localssd
is per-node local; node-to-node SSH is passwordless. HF_TOKEN is in ~/.bashrc.