Skip to content

feat: end-frame conditioning via guiding-token appending for all pipelines#29

Open
azrahello wants to merge 3 commits into
Blaizzy:mainfrom
azrahello:feat/endframe-guiding-tokens
Open

feat: end-frame conditioning via guiding-token appending for all pipelines#29
azrahello wants to merge 3 commits into
Blaizzy:mainfrom
azrahello:feat/endframe-guiding-tokens

Conversation

@azrahello
Copy link
Copy Markdown

Summary

  • Replaces the previous hard-token-replacement approach for end-frame I2V conditioning with the guiding-token appending mechanism used by LTX-2's own VideoConditionByKeyframeIndex
  • The conditioning latent is flattened into tokens and appended to the video sequence before each transformer pass; the model attends to the target end frame as a clean reference without overwriting any video token, avoiding abrupt appearance jumps
  • Extended to all four pipelines: DISTILLED, DEV, DEV_TWO_STAGE, DEV_TWO_STAGE_HQ
  • Also fixes a SpaceToDepthDownsample broadcast error in the VAE encoder that appeared with LTX-2.3 weights (config declares out_channels=1024 but actual weights are 2048; the fix reads the channel count from the weight tensor at runtime)

Technical approach

For each denoiser call:

  1. The end-image latent is encoded once and flattened to (1, h*w, C) tokens
  2. The positions of the last latent frame are extracted from the video position grid and used as the guiding token positions
  3. Inside the denoising loop: [video_tokens, guiding_tokens] are concatenated on the sequence axis; guiding tokens receive cond_ts = sigma × (1 − guiding_strength) (equivalent to denoise_mask = 1 − strength in the PyTorch reference)
  4. After the transformer forward pass, only velocity[:, :num_video_tokens, :] is used — guiding token outputs are discarded
  5. For denoisers that pre-compute RoPE (denoise_dev_av, denoise_res2s_av), the extended position grid covering both video and guiding tokens is passed to precompute_freqs_cis once before the loop

Testing status

⚠️ Tested on DISTILLED pipeline only (LTX-2.3 distilled, prince-canuma/LTX-2.3-distilled).

The DEV, DEV_TWO_STAGE, and DEV_TWO_STAGE_HQ pipelines follow the same token-appending logic and have been reviewed for correctness, but have not been run end-to-end — the base (non-distilled) LTX-2.3 model weights are several GB and were not locally available during development. Community testing on those pipelines is welcome.

🤖 Generated with Claude Code

azrahello and others added 3 commits May 13, 2026 09:29
…TILLED)

Replace the hard latent-replace approach (apply_conditioning with frame_idx=-1)
with token appending in denoise_distilled: the encoded end-frame latent is
flattened and appended to the video sequence before each transformer step,
using a timestep scaled by (1 - guiding_strength) so the model attends to
the target end frame without overwriting any video token. This removes the
abrupt appearance jump near the last frame that the replace approach caused.

- Add _prepare_guiding_tokens() helper to flatten cond latent and extract
  last-frame positional encodings from the full position grid
- Add guiding_tokens/guiding_positions/guiding_strength params to denoise_distilled
- Wire up s1/s2 guiding tokens in the DISTILLED two-stage pipeline
- Simplify _build_i2v_conditionings: remove end_image params (guiding tokens
  now own end-frame conditioning for DISTILLED; other pipelines pending)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add guiding_tokens/guiding_positions/guiding_strength params to
denoise_dev_av and denoise_res2s_av, mirroring the approach already
applied to denoise_distilled.

For both functions:
- Pre-compute extended RoPE (video + guiding positions) once outside
  the denoising loop to avoid per-step recomputation.
- Inside each loop step, concatenate guiding tokens + scaled timestep
  to the video sequence before every transformer pass, then slice
  velocity back to num_video_tokens. Applies to all guidance passes
  (pos, neg, STG, modality) so none of the guidance math sees
  guiding-token outputs.

Wire up all call sites:
- DEV: end_image_latent → _prepare_guiding_tokens → denoise_dev_av
- DEV_TWO_STAGE stage1 (denoise_dev_av): stage1_end_image_latent
- DEV_TWO_STAGE stage2 (denoise_distilled): stage2_end_image_latent
- DEV_TWO_STAGE_HQ stage1 (denoise_res2s_av): stage1_end_image_latent
- DEV_TWO_STAGE_HQ stage2 (denoise_res2s_av): stage2_end_image_latent

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The config-declared out_channels can diverge from the actual weight
in LTX-2.3 (1024 vs 2048), causing a broadcast error when the skip
connection is added. Infer out_channels from conv.weight at runtime
so the reshape is always consistent with the loaded weights.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@Blaizzy
Copy link
Copy Markdown
Owner

Blaizzy commented May 13, 2026

Could you share samples of before and after

@azrahello
Copy link
Copy Markdown
Author

Sorry for the late reply!

The previous approach used a hard latent-replace strategy for the end frame: the encoded target latent was directly overwritten at position -1 at every denoising step. This did not
affect the trajectory from A to B — the motion and overall composition were unaffected — but caused a visible artifact only in the last frames, where the video would abruptly snap toward
the target image rather than arriving naturally. With images containing foliage or warm/cool color palettes, this produced a noticeable chromatic shift in the final frames.

The new approach replaces the hard-replace with token appending: the encoded end-frame latent is flattened and appended to the video sequence as extra tokens before each transformer
step. The model attends to these tokens via self-attention without any video token being overwritten.

The end-frame works well even without a start frame. One remaining characteristic is that the conditioning tends to assert itself rather sharply — longer videos naturally mitigate this,
as the model has more frames over which to build the transition. Shorter videos may still show a more abrupt shift near the end.
old

output_old2.mp4

new

output.mp4

the frame used are
image_240
image_239

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants