Add post-training example with structured-JSON captions#17
Open
Xuanmeng-Zhang wants to merge 2 commits into
Open
Add post-training example with structured-JSON captions#17Xuanmeng-Zhang wants to merge 2 commits into
Xuanmeng-Zhang wants to merge 2 commits into
Conversation
The post-training example trained on dense captions while inference uses the model's native structured-JSON prompt format — a misalignment. This makes structured JSON (`caption_json`) the default across captioning, training, and inference, with the dense narrative kept as a backup, and migrates the example onto the maintained `nvidia/BridgeData2-Subset-Synthetic-Captions` dataset. Caption format & pipeline: - New `inference/structured_caption.py`: canonical schema + parse/assemble of the two-phase VLM output; `caption_json_to_prompt` is the single serializer shared by the loader and inference for byte-identical train↔infer prompts. - `caption_from_video.py` saves both `caption.json` (structured) and `caption.txt` (dense); two-phase `video_captioner.txt` emits the full canonical Phase-1 JSON. - `captions_to_sft_jsonl.py` emits `caption_json` + dense `caption`; adds `--num-video-frames`, loader-matching filters, and a `<output>.summary.json`. - `sft_dataset.py` `_select_caption`: `caption_json` top priority, dict serialized verbatim (no prose-period/suffix/media mangling on the JSON path); configurable `max_num_tokens` (recipes raise it to 2048 for the longer JSON prompts). - New `inference_prompts_to_json.py` rewrites val inference prompts to JSON. - New `video_metadata.py` (ffprobe media fields). Evaluation: - New `cosmos_framework/scripts/eval.py`: CPU PSNR/SSIM of generated vs GT video, aggregated per conditioning mode, with `--compare-baseline` for A/B deltas. Verified on base Cosmos3-Nano (JSON vs dense prompt, 51 clips × T2V/I2V/V2V): JSON ≥ dense on the conditioned modes (I2V/V2V SSIM win-rate 71%/69%, p<0.01). Dataset unification: - Migrate every reference (docs, launch shells, checkpoints.py DATASETS registry + its tests, H100 staging) from the dense-only `nvidia/bridge-v2-subset-synthetic- captions` to `nvidia/BridgeData2-Subset-Synthetic-Captions` at the revision that carries `caption_json`. Docs gain Format + Evaluate sections. Tests: 52 passing (structured_caption, captions_to_sft_jsonl, inference_prompts_to_json, caption_from_video, sft_dataset caption selection, eval). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
17471de to
8eef586
Compare
lfengad
reviewed
Jun 4, 2026
| @@ -0,0 +1,284 @@ | |||
| # SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. | |||
Collaborator
There was a problem hiding this comment.
Seems that we could remove this eval.py file? The eval part is not necessary?
| CPU-only "vision" evaluation: pair each predicted ``vision.mp4`` with its ground-truth | ||
| video, compute per-clip PSNR and SSIM, and aggregate the means **per conditioning mode** | ||
| (``t2v`` / ``i2v`` / ``v2v``). This is a dependency-light port of imaginaire4's | ||
| ``cosmos3.scripts.eval`` *vision* path (which computes PSNR only); SSIM is added here with |
Collaborator
There was a problem hiding this comment.
This cosmos3.*** better be removed since this is deprecated.
| def compute_video_metrics(gt_cthw_uint8: torch.Tensor, pred_path: Path) -> dict[str, float]: | ||
| """Read ``pred_path``, align it to GT, and return ``{"psnr", "ssim"}``. | ||
|
|
||
| Alignment mirrors the imaginaire4 reference: read at most ``T_gt + 1`` frames (so an |
Collaborator
There was a problem hiding this comment.
The imaginare4 mention better be removed in all the texts.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The post-training example trained on dense captions while inference uses the model's native structured-JSON prompt format — a misalignment. This makes structured JSON (
caption_json) the default across captioning, training, and inference, with the dense narrative kept as a backup, and migrates the example onto the maintainednvidia/BridgeData2-Subset-Synthetic-Captionsdataset.Caption format & pipeline:
inference/structured_caption.py: canonical schema + parse/assemble of the two-phase VLM output;caption_json_to_promptis the single serializer shared by the loader and inference for byte-identical train↔infer prompts.caption_from_video.pysaves bothcaption.json(structured) andcaption.txt(dense); two-phasevideo_captioner.txtemits the full canonical Phase-1 JSON.captions_to_sft_jsonl.pyemitscaption_json+ densecaption; adds--num-video-frames, loader-matching filters, and a<output>.summary.json.sft_dataset.py_select_caption:caption_jsontop priority, dict serialized verbatim (no prose-period/suffix/media mangling on the JSON path); configurablemax_num_tokens(recipes raise it to 2048 for the longer JSON prompts).inference_prompts_to_json.pyrewrites val inference prompts to JSON.video_metadata.py(ffprobe media fields).Evaluation:
cosmos_framework/scripts/eval.py: CPU PSNR/SSIM of generated vs GT video, aggregated per conditioning mode, with--compare-baselinefor A/B deltas. Verified on base Cosmos3-Nano (JSON vs dense prompt, 51 clips × T2V/I2V/V2V): JSON ≥ dense on the conditioned modes (I2V/V2V SSIM win-rate 71%/69%, p<0.01).Dataset unification:
nvidia/bridge-v2-subset-synthetic- captionstonvidia/BridgeData2-Subset-Synthetic-Captionsat the revision that carriescaption_json. Docs gain Format + Evaluate sections.Tests: 52 passing (structured_caption, captions_to_sft_jsonl, inference_prompts_to_json, caption_from_video, sft_dataset caption selection, eval).