Skip to content

Add post-training example with structured-JSON captions#17

Open
Xuanmeng-Zhang wants to merge 2 commits into
mainfrom
unify-structured-jsonl-captions
Open

Add post-training example with structured-JSON captions#17
Xuanmeng-Zhang wants to merge 2 commits into
mainfrom
unify-structured-jsonl-captions

Conversation

@Xuanmeng-Zhang
Copy link
Copy Markdown
Collaborator

The post-training example trained on dense captions while inference uses the model's native structured-JSON prompt format — a misalignment. This makes structured JSON (caption_json) the default across captioning, training, and inference, with the dense narrative kept as a backup, and migrates the example onto the maintained nvidia/BridgeData2-Subset-Synthetic-Captions dataset.

Caption format & pipeline:

  • New inference/structured_caption.py: canonical schema + parse/assemble of the two-phase VLM output; caption_json_to_prompt is the single serializer shared by the loader and inference for byte-identical train↔infer prompts.
  • caption_from_video.py saves both caption.json (structured) and caption.txt (dense); two-phase video_captioner.txt emits the full canonical Phase-1 JSON.
  • captions_to_sft_jsonl.py emits caption_json + dense caption; adds --num-video-frames, loader-matching filters, and a <output>.summary.json.
  • sft_dataset.py _select_caption: caption_json top priority, dict serialized verbatim (no prose-period/suffix/media mangling on the JSON path); configurable max_num_tokens (recipes raise it to 2048 for the longer JSON prompts).
  • New inference_prompts_to_json.py rewrites val inference prompts to JSON.
  • New video_metadata.py (ffprobe media fields).

Evaluation:

  • New cosmos_framework/scripts/eval.py: CPU PSNR/SSIM of generated vs GT video, aggregated per conditioning mode, with --compare-baseline for A/B deltas. Verified on base Cosmos3-Nano (JSON vs dense prompt, 51 clips × T2V/I2V/V2V): JSON ≥ dense on the conditioned modes (I2V/V2V SSIM win-rate 71%/69%, p<0.01).

Dataset unification:

  • Migrate every reference (docs, launch shells, checkpoints.py DATASETS registry + its tests, H100 staging) from the dense-only nvidia/bridge-v2-subset-synthetic- captions to nvidia/BridgeData2-Subset-Synthetic-Captions at the revision that carries caption_json. Docs gain Format + Evaluate sections.

Tests: 52 passing (structured_caption, captions_to_sft_jsonl, inference_prompts_to_json, caption_from_video, sft_dataset caption selection, eval).

@Xuanmeng-Zhang Xuanmeng-Zhang requested a review from lfengad June 4, 2026 04:06
The post-training example trained on dense captions while inference uses the
model's native structured-JSON prompt format — a misalignment. This makes
structured JSON (`caption_json`) the default across captioning, training, and
inference, with the dense narrative kept as a backup, and migrates the example
onto the maintained `nvidia/BridgeData2-Subset-Synthetic-Captions` dataset.

Caption format & pipeline:
- New `inference/structured_caption.py`: canonical schema + parse/assemble of the
  two-phase VLM output; `caption_json_to_prompt` is the single serializer shared
  by the loader and inference for byte-identical train↔infer prompts.
- `caption_from_video.py` saves both `caption.json` (structured) and `caption.txt`
  (dense); two-phase `video_captioner.txt` emits the full canonical Phase-1 JSON.
- `captions_to_sft_jsonl.py` emits `caption_json` + dense `caption`; adds
  `--num-video-frames`, loader-matching filters, and a `<output>.summary.json`.
- `sft_dataset.py` `_select_caption`: `caption_json` top priority, dict serialized
  verbatim (no prose-period/suffix/media mangling on the JSON path); configurable
  `max_num_tokens` (recipes raise it to 2048 for the longer JSON prompts).
- New `inference_prompts_to_json.py` rewrites val inference prompts to JSON.
- New `video_metadata.py` (ffprobe media fields).

Evaluation:
- New `cosmos_framework/scripts/eval.py`: CPU PSNR/SSIM of generated vs GT video,
  aggregated per conditioning mode, with `--compare-baseline` for A/B deltas.
  Verified on base Cosmos3-Nano (JSON vs dense prompt, 51 clips × T2V/I2V/V2V):
  JSON ≥ dense on the conditioned modes (I2V/V2V SSIM win-rate 71%/69%, p<0.01).

Dataset unification:
- Migrate every reference (docs, launch shells, checkpoints.py DATASETS registry +
  its tests, H100 staging) from the dense-only `nvidia/bridge-v2-subset-synthetic-
  captions` to `nvidia/BridgeData2-Subset-Synthetic-Captions` at the revision that
  carries `caption_json`. Docs gain Format + Evaluate sections.

Tests: 52 passing (structured_caption, captions_to_sft_jsonl, inference_prompts_to_json,
caption_from_video, sft_dataset caption selection, eval).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@Xuanmeng-Zhang Xuanmeng-Zhang force-pushed the unify-structured-jsonl-captions branch from 17471de to 8eef586 Compare June 4, 2026 04:17
@@ -0,0 +1,284 @@
# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems that we could remove this eval.py file? The eval part is not necessary?

CPU-only "vision" evaluation: pair each predicted ``vision.mp4`` with its ground-truth
video, compute per-clip PSNR and SSIM, and aggregate the means **per conditioning mode**
(``t2v`` / ``i2v`` / ``v2v``). This is a dependency-light port of imaginaire4's
``cosmos3.scripts.eval`` *vision* path (which computes PSNR only); SSIM is added here with
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This cosmos3.*** better be removed since this is deprecated.

def compute_video_metrics(gt_cthw_uint8: torch.Tensor, pred_path: Path) -> dict[str, float]:
"""Read ``pred_path``, align it to GT, and return ``{"psnr", "ssim"}``.

Alignment mirrors the imaginaire4 reference: read at most ``T_gt + 1`` frames (so an
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The imaginare4 mention better be removed in all the texts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants