feat: end-frame conditioning via guiding-token appending for all pipelines#29
feat: end-frame conditioning via guiding-token appending for all pipelines#29azrahello wants to merge 3 commits into
Conversation
…TILLED) Replace the hard latent-replace approach (apply_conditioning with frame_idx=-1) with token appending in denoise_distilled: the encoded end-frame latent is flattened and appended to the video sequence before each transformer step, using a timestep scaled by (1 - guiding_strength) so the model attends to the target end frame without overwriting any video token. This removes the abrupt appearance jump near the last frame that the replace approach caused. - Add _prepare_guiding_tokens() helper to flatten cond latent and extract last-frame positional encodings from the full position grid - Add guiding_tokens/guiding_positions/guiding_strength params to denoise_distilled - Wire up s1/s2 guiding tokens in the DISTILLED two-stage pipeline - Simplify _build_i2v_conditionings: remove end_image params (guiding tokens now own end-frame conditioning for DISTILLED; other pipelines pending) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add guiding_tokens/guiding_positions/guiding_strength params to denoise_dev_av and denoise_res2s_av, mirroring the approach already applied to denoise_distilled. For both functions: - Pre-compute extended RoPE (video + guiding positions) once outside the denoising loop to avoid per-step recomputation. - Inside each loop step, concatenate guiding tokens + scaled timestep to the video sequence before every transformer pass, then slice velocity back to num_video_tokens. Applies to all guidance passes (pos, neg, STG, modality) so none of the guidance math sees guiding-token outputs. Wire up all call sites: - DEV: end_image_latent → _prepare_guiding_tokens → denoise_dev_av - DEV_TWO_STAGE stage1 (denoise_dev_av): stage1_end_image_latent - DEV_TWO_STAGE stage2 (denoise_distilled): stage2_end_image_latent - DEV_TWO_STAGE_HQ stage1 (denoise_res2s_av): stage1_end_image_latent - DEV_TWO_STAGE_HQ stage2 (denoise_res2s_av): stage2_end_image_latent Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The config-declared out_channels can diverge from the actual weight in LTX-2.3 (1024 vs 2048), causing a broadcast error when the skip connection is added. Infer out_channels from conv.weight at runtime so the reshape is always consistent with the loaded weights. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
Could you share samples of before and after |
|
Sorry for the late reply! The previous approach used a hard latent-replace strategy for the end frame: the encoded target latent was directly overwritten at position -1 at every denoising step. This did not The new approach replaces the hard-replace with token appending: the encoded end-frame latent is flattened and appended to the video sequence as extra tokens before each transformer The end-frame works well even without a start frame. One remaining characteristic is that the conditioning tends to assert itself rather sharply — longer videos naturally mitigate this, output_old2.mp4new output.mp4 |


Summary
VideoConditionByKeyframeIndexDISTILLED,DEV,DEV_TWO_STAGE,DEV_TWO_STAGE_HQSpaceToDepthDownsamplebroadcast error in the VAE encoder that appeared with LTX-2.3 weights (config declaresout_channels=1024but actual weights are2048; the fix reads the channel count from the weight tensor at runtime)Technical approach
For each denoiser call:
(1, h*w, C)tokens[video_tokens, guiding_tokens]are concatenated on the sequence axis; guiding tokens receivecond_ts = sigma × (1 − guiding_strength)(equivalent todenoise_mask = 1 − strengthin the PyTorch reference)velocity[:, :num_video_tokens, :]is used — guiding token outputs are discardeddenoise_dev_av,denoise_res2s_av), the extended position grid covering both video and guiding tokens is passed toprecompute_freqs_cisonce before the loopTesting status
DISTILLEDpipeline only (LTX-2.3 distilled,prince-canuma/LTX-2.3-distilled).The
DEV,DEV_TWO_STAGE, andDEV_TWO_STAGE_HQpipelines follow the same token-appending logic and have been reviewed for correctness, but have not been run end-to-end — the base (non-distilled) LTX-2.3 model weights are several GB and were not locally available during development. Community testing on those pipelines is welcome.🤖 Generated with Claude Code