Add HY-OmniWeaving support for HunyuanVideo 1.5#13289
Add HY-OmniWeaving support for HunyuanVideo 1.5#13289ifilipis wants to merge 1 commit intoComfy-Org:masterfrom
Conversation
📝 WalkthroughWalkthroughThis pull request adds support for HunyuanVideo 1.5 "Omni" models by extending text encoder detection and checkpoint handling for Qwen2.5-VL encoders, adding attention tensor format conversion for HY-OmniWeave checkpoints, and introducing three new conditioning nodes. The changes include model detection logic updates, checkpoint key normalization, attention tensor merging for split Q/K/V formats, and UI/API extensions to expose the new 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@comfy_extras/nodes_hunyuan.py`:
- Around line 528-535: The omni_mask can exceed 1.0 (e.g., omni_mask[ref_idx]
becomes 2.0), which makes concat_mask negative after computing 1.0 - omni_mask;
clamp omni_mask to the [0,1] range before inverting so concat_mask remains a
proper 0/1 mask. Update the code that computes concat_mask (and/or immediately
before it) to use a clamped version of omni_mask (e.g., torch.clamp(omni_mask,
0.0, 1.0)) when computing 1.0 - omni_mask, referencing omni_mask, concat_mask,
cond_latent, latent_length and the preceding logic that modifies omni_mask
(including _encode_single_image/reference_images handling).
In `@comfy/sd.py`:
- Around line 1270-1276: detect_te_model() accepts checkpoints keyed under
model.language_model.* for both QWEN25_3B and QWEN25_7B, but the 3B loading path
calls omnigen2.te() with the raw sd (whereas the 7B path normalizes prefixes
before loading), which risks silent weight-dropping by
transformer.load_state_dict in SDClipModel.load_sd(); update the 3B branch to
perform the same key-prefix normalization as the 7B loader before calling
omnigen2.te() (i.e., rewrite keys from the model.language_model.* layout to the
expected model.* layout), or alternatively restrict detect_te_model() to only
detect the 7B layout—prefer the former and apply the same prefix-rewrite logic
where the 3B omnigen2.te(...) invocation occurs so the state dict keys match the
model expected by transformer.load_state_dict.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 07947a53-d006-487c-a7ba-e3c834765b33
📒 Files selected for processing (3)
comfy/sd.pycomfy_extras/nodes_hunyuan.pynodes.py
| encoded_ref = cls._encode_single_image(vae, reference_images[:1], width, height) | ||
| ref_idx = 1 if latent_length > 1 else 0 | ||
| cond_latent[:, :, ref_idx:ref_idx + 1] += encoded_ref[:, :, :1] | ||
| omni_mask[ref_idx] += 1.0 | ||
|
|
||
| cond_latent = comfy.utils.resize_to_batch_size(cond_latent, batch_size) | ||
| # BaseModel/HunyuanVideo15 inverts concat_mask (mask = 1 - concat_mask), so pass the pre-inverted mask. | ||
| concat_mask = (1.0 - omni_mask).view(1, 1, latent_length, 1, 1).expand(cond_latent.shape[0], 1, latent_length, cond_latent.shape[-2], cond_latent.shape[-1]).to(cond_latent.dtype) |
There was a problem hiding this comment.
Clamp the TiV2V mask before inverting it.
Line 531 increments a slot that is already set to 1.0 by the conditioned-video branch, so omni_mask[ref_idx] becomes 2.0. After the 1.0 - omni_mask transform on Line 535, the TiV2V path sends -1.0 in concat_mask for that frame, which breaks the 0/1 mask semantics used by the other tasks.
Proposed fix
encoded_ref = cls._encode_single_image(vae, reference_images[:1], width, height)
ref_idx = 1 if latent_length > 1 else 0
cond_latent[:, :, ref_idx:ref_idx + 1] += encoded_ref[:, :, :1]
- omni_mask[ref_idx] += 1.0
+ omni_mask[ref_idx] = 1.0
cond_latent = comfy.utils.resize_to_batch_size(cond_latent, batch_size)
+ omni_mask = omni_mask.clamp_(0.0, 1.0)
# BaseModel/HunyuanVideo15 inverts concat_mask (mask = 1 - concat_mask), so pass the pre-inverted mask.
concat_mask = (1.0 - omni_mask).view(1, 1, latent_length, 1, 1).expand(cond_latent.shape[0], 1, latent_length, cond_latent.shape[-2], cond_latent.shape[-1]).to(cond_latent.dtype)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@comfy_extras/nodes_hunyuan.py` around lines 528 - 535, The omni_mask can
exceed 1.0 (e.g., omni_mask[ref_idx] becomes 2.0), which makes concat_mask
negative after computing 1.0 - omni_mask; clamp omni_mask to the [0,1] range
before inverting so concat_mask remains a proper 0/1 mask. Update the code that
computes concat_mask (and/or immediately before it) to use a clamped version of
omni_mask (e.g., torch.clamp(omni_mask, 0.0, 1.0)) when computing 1.0 -
omni_mask, referencing omni_mask, concat_mask, cond_latent, latent_length and
the preceding logic that modifies omni_mask (including
_encode_single_image/reference_images handling).
| # Qwen-VL checkpoints can be saved under model.language_model.* (e.g. HY-OmniWeave text encoder). | ||
| if 'model.language_model.layers.0.self_attn.k_proj.bias' in sd: | ||
| weight = sd['model.language_model.layers.0.self_attn.k_proj.bias'] | ||
| if weight.shape[0] == 256: | ||
| return TEModel.QWEN25_3B | ||
| if weight.shape[0] == 512: | ||
| return TEModel.QWEN25_7B |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
files=$(fd -a '(omnigen2|qwen_image|hunyuan_image|hunyuan_video)\.py$' .)
printf 'Inspecting wrapper files:\n%s\n\n' "$files"
for f in $files; do
echo "== $f =="
rg -n -C2 'model\.language_model|model\.layers\.0\.self_attn\.k_proj|state_dict_prefix_replace|def llama_detect|load_sd|load_state_dict' "$f"
echo
doneRepository: Comfy-Org/ComfyUI
Length of output: 407
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Examine the critical line ranges in comfy/sd.py
echo "=== Lines 1270-1280 (detect_te_model QWEN25 detection) ==="
sed -n '1270,1280p' comfy/sd.py
echo ""
echo "=== Lines 1420-1445 (QWEN25_7B and prefix normalization) ==="
sed -n '1420,1445p' comfy/sd.py
echo ""
echo "=== Lines 1400-1420 (Context before QWEN25_7B) ==="
sed -n '1400,1420p' comfy/sd.py
echo ""
echo "=== Search for QWEN25_3B handling ==="
rg -n -A8 'QWEN25_3B' comfy/sd.py | head -50Repository: Comfy-Org/ComfyUI
Length of output: 6141
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "=== omnigen2.te() and llama_detect definition ==="
rg -n -A15 'def te\(' comfy/text_encoders/omnigen2.py | head -60
echo ""
echo "=== llama_detect in comfy/sd.py to see what keys it expects ==="
rg -n -B2 -A20 'def llama_detect' comfy/sd.py
echo ""
echo "=== Check if omnigen2.te handles state_dict_prefix_replace or similar ==="
rg -n 'state_dict_prefix_replace|model\.language_model|model\.visual' comfy/text_encoders/omnigen2.pyRepository: Comfy-Org/ComfyUI
Length of output: 1642
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "=== Omnigen2Model class and __init__ ==="
rg -n -B2 -A30 'class Omnigen2Model' comfy/text_encoders/omnigen2.py | head -80
echo ""
echo "=== Check if Omnigen2Model or parent does any prefix rewriting ==="
rg -n 'state_dict_prefix_replace|model\.language_model|load_state_dict' comfy/text_encoders/omnigen2.py
echo ""
echo "=== Check parent class if Omnigen2Model inherits ==="
rg -n 'import.*Omnigen|from.*omnigen' comfy/text_encoders/omnigen2.py | head -10Repository: Comfy-Org/ComfyUI
Length of output: 1022
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "=== Check SD1ClipModel ==="
rg -n -B2 -A20 'class SD1ClipModel' comfy/text_encoders/sd1_clip.py | head -60
echo ""
echo "=== Check Qwen25_3BModel ==="
fd 'qwen.*\.py$' --type f | xargs rg -l 'Qwen25_3BModel'
echo ""
echo "=== Search for Qwen25_3BModel definition ==="
rg -n -B2 -A25 'class Qwen25_3BModel' comfy/Repository: Comfy-Org/ComfyUI
Length of output: 176
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "=== Find SD1ClipModel definition ==="
rg -n 'class SD1ClipModel' comfy/
echo ""
echo "=== Find where sd1_clip is imported from ==="
rg -n 'from.*sd1_clip|import.*sd1_clip' comfy/
echo ""
echo "=== Find Qwen25_3BModel ==="
rg -n 'class Qwen25_3BModel' comfy/Repository: Comfy-Org/ComfyUI
Length of output: 2090
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "=== Qwen25_3BModel definition ==="
sed -n '25,50p' comfy/text_encoders/omnigen2.py
echo ""
echo "=== SDClipModel and SD1ClipModel load_sd or __init__ ==="
sed -n '710,800p' comfy/sd1_clip.py
echo ""
echo "=== Search for load_state_dict in SD1ClipModel area ==="
rg -n -B5 -A15 'def load_state_dict|state_dict_prefix' comfy/sd1_clip.py | head -100Repository: Comfy-Org/ComfyUI
Length of output: 3202
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "=== Search for SDClipModel definition ==="
rg -n 'class SDClipModel' comfy/
echo ""
echo "=== Search for Qwen25_3B in llama.py ==="
rg -n -B2 -A30 'class Qwen25_3B' comfy/text_encoders/llama.py | head -80
echo ""
echo "=== Check if SDClipModel has load_sd or transformers loading ==="
rg -n -B5 -A20 'class SDClipModel' comfy/sd1_clip.py | head -80Repository: Comfy-Org/ComfyUI
Length of output: 3913
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "=== Check SDClipModel.load_sd implementation ==="
rg -n -B3 -A25 'def load_sd' comfy/sd1_clip.py | head -80
echo ""
echo "=== Check if Qwen25_3B or Llama2_ handle state dict prefix flexibly ==="
rg -n 'def forward|def load_state_dict|state_dict' comfy/text_encoders/llama.py | head -30
echo ""
echo "=== Verify: check what keys Llama2_ expects by looking at model structure ==="
sed -n '300,400p' comfy/text_encoders/llama.py | head -60Repository: Comfy-Org/ComfyUI
Length of output: 4758
Check whether the 3B path needs the same prefix normalization as the 7B path.
detect_te_model() now accepts model.language_model.* prefixed layouts for both 256-dim (3B) and 512-dim (7B) models (lines 1271–1276). However, the 3B loader at line 1425 passes the state dict directly to omnigen2.te() with no key rewriting, while the 7B loader at lines 1431–1440 normalizes the prefixes before loading.
Since SDClipModel.load_sd() calls transformer.load_state_dict(sd, strict=False), PyTorch will silently ignore the mismatched keys model.language_model.* when the model expects model.layers.*. The checkpoint will appear supported but fail to load any weights.
The 3B path should either rewrite the keys the same way as 7B, or the detection should be scoped to only the 7B branch.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@comfy/sd.py` around lines 1270 - 1276, detect_te_model() accepts checkpoints
keyed under model.language_model.* for both QWEN25_3B and QWEN25_7B, but the 3B
loading path calls omnigen2.te() with the raw sd (whereas the 7B path normalizes
prefixes before loading), which risks silent weight-dropping by
transformer.load_state_dict in SDClipModel.load_sd(); update the 3B branch to
perform the same key-prefix normalization as the 7B loader before calling
omnigen2.te() (i.e., rewrite keys from the model.language_model.* layout to the
expected model.* layout), or alternatively restrict detect_te_model() to only
detect the 7B layout—prefer the former and apply the same prefix-rewrite logic
where the 3B omnigen2.te(...) invocation occurs so the state dict keys match the
model expected by transformer.load_state_dict.
https://huggingface.co/tencent/HY-OmniWeaving
Repackaged models:
https://huggingface.co/vafipas663/HY-OmniWeaving_repackaged/tree/main/split_files
Tested with their model and encoder as is
Workflow:
