Last updated: 2026-05-03
Technical reference for LTX 2.3 model behavior relevant to the audio loop workflow. Extracted from CLAUDE.md for progressive disclosure -- read this when working on model-facing code, not for every conversation.
Guide strength does NOT control how much the image influences style. It controls the denoise mask (noise addition), which is only one of three layers. Text conditioning operates on a separate, unattenuated pathway:
1. Cross-attention (text -> all tokens) <- ALWAYS FULL STRENGTH, no per-guide control
2. Self-attention (guide <-> generated) <- controlled by attention_strength (default 1.0)
3. Denoise mask (noise addition) <- controlled by guide strength (1.0 = no noise)
- strength=1.0 -> denoise_mask=0.0 -> guide frames spatially frozen
- BUT cross-attention still pulls style/appearance toward text description
- Guides are CONCATENATED to the latent sequence (extra frames at the end), not blended at the target index. keyframe_idxs tells RoPE their logical position.
- This is why changing text causes style drift even at guide strength 1.0: the guide anchors composition, but text controls style via cross-attention.
- The right fix for style consistency is keeping text aligned (consistent prompts + ConditioningBlend), not increasing guide strength.
Source: ComfyUI-LTXVideo/latents.py (LTXVAddLatentGuide), comfy_extras/nodes_lt.py (append_keyframe), comfy/ldm/lightricks/model.py (per-reference attention masking).
- Video VAE: First pixel frame -> own latent frame, then 8 pixels per latent.
Formula:
latent = (pixel - 1) // 8 + 1. NOTpixel // 8. Pixel frames must follow 8n+1 (1, 9, 17, 25, 497...). - Audio VAE: 25 latents/second, 1D, completely independent of video latent temporal dimension. They live in separate NestedTensor sub-tensors.
- Using
pixel // 8instead of(pixel - 1) // 8 + 1caused the v0409 sync bug: 25 pixels -> 3 latent frames (wrong) vs 4 (correct).
- Latent volume limit:
(width/32) * (height/32) * ((frames-1)/8 + 1)should stay below ~15,000-20,000. Exceeding causes artifacts, grid patterns, color loss. - 832x480 at 497 frames: 261563 = 24,570 -- already at the edge. Don't increase resolution without reducing frame count per window.
- Higher resolution improves motion/lip-sync/audio quality but costs more VRAM and risks latent volume overflow. 720p+ with 48-50fps gives smoother motion.
- Portrait (vertical) resolutions are unstable -- keep height < 1600px. Landscape and square work best.
- Two-stage approach is the recommended workaround: generate at lower res (720p),
then spatial latent upscale to 1080p+. This is what LTX-Desktop and native LTX-2
both do. See
docs/analysis/ltx23_gaps_analysis.md(upscale workflow is designed but not yet shipped — seeinternal/design/upscale_workflow_design.md(private clone only)). - For our loop workflow: each window is 497 frames at 832x480. Changing resolution requires adjusting window_seconds or temporal_tile_size to stay under the limit.
Loop iterations progressively darken because each iteration's latent statistics drift from the initial render. The init_image guide anchors composition but not color -- guide strength controls the denoise mask, not cross-attention style.
Two AdaIN approaches (can be used together or independently):
Per-iteration AdaIN (LTXVAdainLatent, inside subgraph):
- Location: after SeparateAVLatent (#596), before CropGuides (#655)
- Reference: initial render video latent from SeparateAV #245
- Factor: 0.2 default (gentle). Increase to 0.5 for stronger correction.
- per_frame=False (global statistics). Try True if per-frame flickering occurs.
- Present in all three workflows. Bypass (mode=4) to disable.
Per-step AdaIN (LTXVPerStepAdainPatcher, model chain):
- Location: after SamplingPreviewOverride, before Set_model
- Reference: node 531 (init image embed latent, available before sampling)
- Factors: per-denoising-step, e.g., "0.3,0.2,0.1,0.05,0.0,0.0,0.0,0.0" (stronger at early noisy steps, none at late detail steps)
- Only in
audio-loop-music-video_image_adain_perstep.json - More aggressive than per-iteration. Applied during sampling, not after.
Testing order: Start with per-iteration only (factor=0.2). If drift persists, try the per-step workflow. Compare iteration 5+ brightness against initial render.
- LTX 2.3 is distilled -- CFG=1.0 by default (NAG handles guidance, not CFG).
- Prompts are i2v (image-to-video) style: describe changes from the init_image, not the full scene.
- Start with
Style: cinematic.(or omit if init_image establishes style). - Use present-progressive verbs: "is singing," "is walking."
- Include audio descriptions inline with visuals (LTX 2.3 is audio-video joint).
- No meta-language: no "The scene opens with...", no timestamps, no cuts.
- Camera motion only when intended. Keywords:
static camera,dolly in/out/left/right,jib up/down,focus shift. - Avoid dolly out -- breaks limbs and faces. Use static camera with lighting shifts for visual variation.
- i2v rule: describe only changes from the init_image. Re-describing the setting causes the model to "restart" the scene.
- Two-person scenes: always "singing together." Don't direct male vs female vocals -- audio conditioning handles it.
- Subject anchoring, not setting re-description. Describe WHO (traits, clothing, position) in every entry to anchor identity. Do NOT re-describe the environment -- that's in the init_image.
- Node 169 covers trimmed 0:00 to window_seconds (~20s). TimestampPromptSchedule does NOT run during the initial render. Node 169 prompt MUST match the schedule's 0:00 entry to avoid visual discontinuity at ~20s.
Full system prompts: docs/reference/ltx23_prompt_system_prompts.md
Prompt creation guide: docs/guides/prompt_creation_guide.md
- LTX 2.3 uses Gemma 3 text encoder (NOT CLIP). Conditioning format is
[tensor, {"attention_mask": mask}]with no pooled_output. Standard ConditioningAverage won't work -- use our ConditioningBlend instead. - Workflow uses DualCLIPLoader + CLIPTextEncode nodes. Despite the names, these are Gemma 3 encoders (loaded via gemma_3_12B + ltx-2.3_text_projection).
- Extension #843 positive/negative should come from Get_base_cond_pos/neg DIRECTLY, NOT through an extra LTXVConditioning node. Node 1587 is bypassed because it corrupted the initial render's audio-video cross-attention.
- Conditioning wiring (canonical after 2026-04-22):
CLIP loads once per generation; DiT + NAG stay resident; hard-cut between schedule entries at the iteration grid. In copies of the workflow saved before 2026-04-22 the per-iteration chain was
TimestampPromptScheduleBatchEncode (runs ONCE, outside the loop) clip, schedule, stride_seconds, audio_duration, snap_boundaries |-> conditioning_list -> ConditioningSelectByIteration (inside loop) |-> conditioning -> Extension #843 input 6 (positive) TensorLoopOpen.current_iteration ----^TimestampPromptSchedule (1558) -> CachedTextEncode (1559 + 1607) -> ConditioningBlend (1608) -> Extension, which silenced NAG iter 2+ viaobject_patchesdevice-migration asymmetry (seedocs/analysis/nag_object_patches_offload_asymmetry.md). Migrate pre-fix copies viascripts/apply_batch_encode_fix.py.
- TensorLoopOpen MUST receive the sampled initial render, NOT the raw image-embed latent from LTXVImgToVideoInplaceKJ.
- LTXVAddLatentGuide APPENDS guide frames to temporal dim (torch.cat dim=2). Sampler output latent has shape [B,C,63+N_guides,H,W], not [B,C,63,H,W].
- For LATENT workflows: initial render prepended via LatentConcat (dim=t) using CropGuides output (guide-stripped).
- Correct latent path: #531 -> #350 ConcatAV -> #161 Sampler -> #245 SeparateAV -> #1539 TensorLoopOpen
- VAEEncode produces latent with NO noise_mask key. LTXVAudioVideoMask then creates a fresh all-zeros mask. This is the correct behavior.
- LTXVSelectLatents PRESERVES the existing noise_mask from its input. Inherited stale masks corrupt the sampler's mask semantics and break sync.
- LatentContextExtract / LatentOverlapTrim (our nodes) strip noise_mask automatically between LTXVSelectLatents and LTXVAudioVideoMask. Use these in the latent-space subgraph rather than raw LTXVSelectLatents. (StripLatentNoiseMask was a standalone helper for the same purpose; removed 2026-04-27 in favor of the auto-stripping LatentContextExtract / LatentOverlapTrim path.)
- IMAGE-AdaIN workflow (
audio-loop-music-video_image_adain_perstep.json): Subgraph uses GetImageRangeFromBatch + VAEEncode/Decode plus LTXVPerStepAdainPatcher on the model chain. ImageBatch prepends initial render. Use only when you specifically need per-iteration AdaIN; latent is the production default. (The plain image workflow was retired 2026-04-27.) - LATENT workflow (
audio-loop-music-video_latent.json): Subgraph uses LatentContextExtract + LatentOverlapTrim. LatentConcat prepends initial render. No per-iteration VAE round-trip. - AudioLoopController outputs work for both: overlap_frames (pixel) + overlap_latent_frames (latent).
Per-iteration sampling pipeline. IMAGE and LATENT workflows differ in context extraction and output trimming nodes. Shared internals:
- VAEEncode (1520) -- encodes init_image to latent (scene anchor guide)
- LTXVAddLatentGuide (1519) -- merges conditioning + both guides into latent
- LTXVConcatAVLatent (583) -- adds audio latent
- CFGGuider (644) -- packages for sampling (cfg=1.0, NAG does guidance)
- SamplerCustomAdvanced (573) -- generates new frames
Full trace: docs/reference/pipeline_flow_latent.md (LATENT workflow, the primary baseline).
- TrimAudioDuration (Node 567) start_index is song-dependent. It trims instrumental intro that doesn't contribute to lip sync.
- Audio and video durations must match: 497 frames / 25fps = 19.88s audio.
- LTXVAudioVideoMask (Node 606): audio_start_time and audio_end_time are BOTH wired to window_size_seconds (19.88). This creates an empty mask range (start=end), so audio stays fixed as the encoded song. DO NOT change.
LTXVImgToVideoInplaceKJ (and similar multi-input nodes) serialize
widgets as: [num_items, strength_1, strength_2, ..., index_1, index_2, ...]
Strengths come FIRST for all items, THEN indices. NOT interleaved.
Example: ['2', 1.0, 0.5, 0, -1] = 2 images, strengths [1.0, 0.5], indices [0, -1].
Getting this wrong silently misconfigures the node.
- Upscale is a SEPARATE workflow, not part of the loop workflow.
- Stay in latent space: Load video -> VAEEncode (once) -> LTXVLatentUpsampler (2x) -> 3-step refinement sampler -> VAEDecodeTiled.
- Model:
ltx-2.3-spatial-upscaler-x2-1.1.safetensors - Refinement sigmas: [0.85, 0.725, 0.4219, 0.0] (3 steps).
- Design doc (workflow not yet shipped):
internal/design/upscale_workflow_design.md(private clone only)