feat: end-frame conditioning via guiding-token appending for all pipelines by azrahello · Pull Request #29 · Blaizzy/mlx-video

azrahello · 2026-05-13T08:49:28Z

Summary

Replaces the previous hard-token-replacement approach for end-frame I2V conditioning with the guiding-token appending mechanism used by LTX-2's own VideoConditionByKeyframeIndex
The conditioning latent is flattened into tokens and appended to the video sequence before each transformer pass; the model attends to the target end frame as a clean reference without overwriting any video token, avoiding abrupt appearance jumps
Extended to all four pipelines: DISTILLED, DEV, DEV_TWO_STAGE, DEV_TWO_STAGE_HQ
Also fixes a SpaceToDepthDownsample broadcast error in the VAE encoder that appeared with LTX-2.3 weights (config declares out_channels=1024 but actual weights are 2048; the fix reads the channel count from the weight tensor at runtime)

Technical approach

For each denoiser call:

The end-image latent is encoded once and flattened to (1, h*w, C) tokens
The positions of the last latent frame are extracted from the video position grid and used as the guiding token positions
Inside the denoising loop: [video_tokens, guiding_tokens] are concatenated on the sequence axis; guiding tokens receive cond_ts = sigma × (1 − guiding_strength) (equivalent to denoise_mask = 1 − strength in the PyTorch reference)
After the transformer forward pass, only velocity[:, :num_video_tokens, :] is used — guiding token outputs are discarded
For denoisers that pre-compute RoPE (denoise_dev_av, denoise_res2s_av), the extended position grid covering both video and guiding tokens is passed to precompute_freqs_cis once before the loop

Testing status

⚠️ Tested on DISTILLED pipeline only (LTX-2.3 distilled, prince-canuma/LTX-2.3-distilled).

The DEV, DEV_TWO_STAGE, and DEV_TWO_STAGE_HQ pipelines follow the same token-appending logic and have been reviewed for correctness, but have not been run end-to-end — the base (non-distilled) LTX-2.3 model weights are several GB and were not locally available during development. Community testing on those pipelines is welcome.

🤖 Generated with Claude Code

…TILLED) Replace the hard latent-replace approach (apply_conditioning with frame_idx=-1) with token appending in denoise_distilled: the encoded end-frame latent is flattened and appended to the video sequence before each transformer step, using a timestep scaled by (1 - guiding_strength) so the model attends to the target end frame without overwriting any video token. This removes the abrupt appearance jump near the last frame that the replace approach caused. - Add _prepare_guiding_tokens() helper to flatten cond latent and extract last-frame positional encodings from the full position grid - Add guiding_tokens/guiding_positions/guiding_strength params to denoise_distilled - Wire up s1/s2 guiding tokens in the DISTILLED two-stage pipeline - Simplify _build_i2v_conditionings: remove end_image params (guiding tokens now own end-frame conditioning for DISTILLED; other pipelines pending) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Add guiding_tokens/guiding_positions/guiding_strength params to denoise_dev_av and denoise_res2s_av, mirroring the approach already applied to denoise_distilled. For both functions: - Pre-compute extended RoPE (video + guiding positions) once outside the denoising loop to avoid per-step recomputation. - Inside each loop step, concatenate guiding tokens + scaled timestep to the video sequence before every transformer pass, then slice velocity back to num_video_tokens. Applies to all guidance passes (pos, neg, STG, modality) so none of the guidance math sees guiding-token outputs. Wire up all call sites: - DEV: end_image_latent → _prepare_guiding_tokens → denoise_dev_av - DEV_TWO_STAGE stage1 (denoise_dev_av): stage1_end_image_latent - DEV_TWO_STAGE stage2 (denoise_distilled): stage2_end_image_latent - DEV_TWO_STAGE_HQ stage1 (denoise_res2s_av): stage1_end_image_latent - DEV_TWO_STAGE_HQ stage2 (denoise_res2s_av): stage2_end_image_latent Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

The config-declared out_channels can diverge from the actual weight in LTX-2.3 (1024 vs 2048), causing a broadcast error when the skip connection is added. Infer out_channels from conv.weight at runtime so the reshape is always consistent with the loaded weights. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Blaizzy · 2026-05-13T08:59:55Z

Could you share samples of before and after

azrahello · 2026-05-14T06:38:37Z

Sorry for the late reply!

The previous approach used a hard latent-replace strategy for the end frame: the encoded target latent was directly overwritten at position -1 at every denoising step. This did not
affect the trajectory from A to B — the motion and overall composition were unaffected — but caused a visible artifact only in the last frames, where the video would abruptly snap toward
the target image rather than arriving naturally. With images containing foliage or warm/cool color palettes, this produced a noticeable chromatic shift in the final frames.

The new approach replaces the hard-replace with token appending: the encoded end-frame latent is flattened and appended to the video sequence as extra tokens before each transformer
step. The model attends to these tokens via self-attention without any video token being overwritten.

The end-frame works well even without a start frame. One remaining characteristic is that the conditioning tends to assert itself rather sharply — longer videos naturally mitigate this,
as the model has more frames over which to build the transition. Shorter videos may still show a more abrupt shift near the end.
old

output_old2.mp4

new

output.mp4

the frame used are

azrahello and others added 3 commits May 13, 2026 09:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: end-frame conditioning via guiding-token appending for all pipelines#29

feat: end-frame conditioning via guiding-token appending for all pipelines#29
azrahello wants to merge 3 commits into
Blaizzy:mainfrom
azrahello:feat/endframe-guiding-tokens

azrahello commented May 13, 2026

Uh oh!

Blaizzy commented May 13, 2026

Uh oh!

azrahello commented May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

azrahello commented May 13, 2026

Summary

Technical approach

Testing status

Uh oh!

Blaizzy commented May 13, 2026

Uh oh!

azrahello commented May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants