Add post-training example with structured-JSON captions by Xuanmeng-Zhang · Pull Request #17 · NVIDIA/cosmos-framework

Xuanmeng-Zhang · 2026-06-04T04:06:13Z

The post-training example trained on dense captions while inference uses the model's native structured-JSON prompt format — a misalignment. This makes structured JSON (caption_json) the default across captioning, training, and inference, with the dense narrative kept as a backup, and migrates the example onto the maintained nvidia/BridgeData2-Subset-Synthetic-Captions dataset.

Caption format & pipeline:

New inference/structured_caption.py: canonical schema + parse/assemble of the two-phase VLM output; caption_json_to_prompt is the single serializer shared by the loader and inference for byte-identical train↔infer prompts.
caption_from_video.py saves both caption.json (structured) and caption.txt (dense); two-phase video_captioner.txt emits the full canonical Phase-1 JSON.
captions_to_sft_jsonl.py emits caption_json + dense caption; adds --num-video-frames, loader-matching filters, and a <output>.summary.json.
sft_dataset.py _select_caption: caption_json top priority, dict serialized verbatim (no prose-period/suffix/media mangling on the JSON path); configurable max_num_tokens (recipes raise it to 2048 for the longer JSON prompts).
New inference_prompts_to_json.py rewrites val inference prompts to JSON.
New video_metadata.py (ffprobe media fields).

Evaluation:

New cosmos_framework/scripts/eval.py: CPU PSNR/SSIM of generated vs GT video, aggregated per conditioning mode, with --compare-baseline for A/B deltas. Verified on base Cosmos3-Nano (JSON vs dense prompt, 51 clips × T2V/I2V/V2V): JSON ≥ dense on the conditioned modes (I2V/V2V SSIM win-rate 71%/69%, p<0.01).

Dataset unification:

Migrate every reference (docs, launch shells, checkpoints.py DATASETS registry + its tests, H100 staging) from the dense-only nvidia/bridge-v2-subset-synthetic- captions to nvidia/BridgeData2-Subset-Synthetic-Captions at the revision that carries caption_json. Docs gain Format + Evaluate sections.

Tests: 52 passing (structured_caption, captions_to_sft_jsonl, inference_prompts_to_json, caption_from_video, sft_dataset caption selection, eval).

The post-training example trained on dense captions while inference uses the model's native structured-JSON prompt format — a misalignment. This makes structured JSON (`caption_json`) the default across captioning, training, and inference, with the dense narrative kept as a backup, and migrates the example onto the maintained `nvidia/BridgeData2-Subset-Synthetic-Captions` dataset. Caption format & pipeline: - New `inference/structured_caption.py`: canonical schema + parse/assemble of the two-phase VLM output; `caption_json_to_prompt` is the single serializer shared by the loader and inference for byte-identical train↔infer prompts. - `caption_from_video.py` saves both `caption.json` (structured) and `caption.txt` (dense); two-phase `video_captioner.txt` emits the full canonical Phase-1 JSON. - `captions_to_sft_jsonl.py` emits `caption_json` + dense `caption`; adds `--num-video-frames`, loader-matching filters, and a `<output>.summary.json`. - `sft_dataset.py` `_select_caption`: `caption_json` top priority, dict serialized verbatim (no prose-period/suffix/media mangling on the JSON path); configurable `max_num_tokens` (recipes raise it to 2048 for the longer JSON prompts). - New `inference_prompts_to_json.py` rewrites val inference prompts to JSON. - New `video_metadata.py` (ffprobe media fields). Evaluation: - New `cosmos_framework/scripts/eval.py`: CPU PSNR/SSIM of generated vs GT video, aggregated per conditioning mode, with `--compare-baseline` for A/B deltas. Verified on base Cosmos3-Nano (JSON vs dense prompt, 51 clips × T2V/I2V/V2V): JSON ≥ dense on the conditioned modes (I2V/V2V SSIM win-rate 71%/69%, p<0.01). Dataset unification: - Migrate every reference (docs, launch shells, checkpoints.py DATASETS registry + its tests, H100 staging) from the dense-only `nvidia/bridge-v2-subset-synthetic- captions` to `nvidia/BridgeData2-Subset-Synthetic-Captions` at the revision that carries `caption_json`. Docs gain Format + Evaluate sections. Tests: 52 passing (structured_caption, captions_to_sft_jsonl, inference_prompts_to_json, caption_from_video, sft_dataset caption selection, eval). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

lfengad · 2026-06-04T07:43:58Z

@@ -0,0 +1,284 @@
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.


Seems that we could remove this eval.py file? The eval part is not necessary?

lfengad · 2026-06-04T07:47:16Z

+CPU-only "vision" evaluation: pair each predicted ``vision.mp4`` with its ground-truth
+video, compute per-clip PSNR and SSIM, and aggregate the means **per conditioning mode**
+(``t2v`` / ``i2v`` / ``v2v``). This is a dependency-light port of imaginaire4's
+``cosmos3.scripts.eval`` *vision* path (which computes PSNR only); SSIM is added here with


This cosmos3.*** better be removed since this is deprecated.

lfengad · 2026-06-04T07:48:35Z

+def compute_video_metrics(gt_cthw_uint8: torch.Tensor, pred_path: Path) -> dict[str, float]:
+    """Read ``pred_path``, align it to GT, and return ``{"psnr", "ssim"}``.
+
+    Alignment mirrors the imaginaire4 reference: read at most ``T_gt + 1`` frames (so an


The imaginare4 mention better be removed in all the texts.

Xuanmeng-Zhang requested a review from lfengad June 4, 2026 04:06

Xuanmeng-Zhang force-pushed the unify-structured-jsonl-captions branch from 17471de to 8eef586 Compare June 4, 2026 04:17

Merge branch 'main' into unify-structured-jsonl-captions

d6b4eb2

lfengad reviewed Jun 4, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add post-training example with structured-JSON captions#17

Add post-training example with structured-JSON captions#17
Xuanmeng-Zhang wants to merge 2 commits into
mainfrom
unify-structured-jsonl-captions

Xuanmeng-Zhang commented Jun 4, 2026

Uh oh!

lfengad Jun 4, 2026

Uh oh!

lfengad Jun 4, 2026

Uh oh!

lfengad Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		@@ -0,0 +1,284 @@
		# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.

Conversation

Xuanmeng-Zhang commented Jun 4, 2026

Uh oh!

lfengad Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

lfengad Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

lfengad Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants