You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When training SDXL with LoRA, the following error occurs at the UNet forward pass:
RuntimeError: Tensors must have same number of dimensions: got 3 and 2
Full stacktrace:
File ".../unet_2d_condition.py", line 981, in get_aug_embed
add_embeds = torch.concat([text_embeds, time_embeds], dim=-1)
RuntimeError: Tensors must have same number of dimensions: got 3 and 2
Steps to Reproduce
Use SDXL base model (stabilityai/stable-diffusion-xl-base-1.0) and a LoRA fine-tuning script based on official diffusers or notebook code.
At each training iteration, UNet is called as follows:
The error appears in the concatenation of text_embeds and time_embeds.
Suspected Cause
This occurs because text_embeds from SDXL's text encoder have shape [batch, seq, emb], while time_embeds have shape [batch, emb]. They cannot be concatenated directly along the last axis.
Proposed Fix
Broadcast or unsqueeze time_embeds to match the shape of text_embedsor use the pooled text embedding correctly.
Alternatively, use pooled_prompt_embeds for both parts if that's what SDXL expects.
Update the notebook or script to ensure text_embeds and time_embeds have compatible shapes before concatenation:
iftext_embeds.dim() ==3andtime_embeds.dim() ==2:
# Use pooled text embedding (dim == 2)add_embeds=torch.concat([pooled_prompt_embeds, time_embeds], dim=-1)
else:
# If needed, unsqueeze or broadcast time_embeds
...
Environment
diffusers version: latest
Python version: 3.12
Torch version: latest
GPU: any
Please fix the tensor shapes for UNet SDXL training and document the proper SDXL conditioning flow for LoRA fine-tuning in the repository docs.
Issue Summary
When training SDXL with LoRA, the following error occurs at the UNet forward pass:
Full stacktrace:
Steps to Reproduce
stabilityai/stable-diffusion-xl-base-1.0) and a LoRA fine-tuning script based on official diffusers or notebook code.text_embedsandtime_embeds.Suspected Cause
This occurs because
text_embedsfrom SDXL's text encoder have shape[batch, seq, emb], whiletime_embedshave shape[batch, emb]. They cannot be concatenated directly along the last axis.Proposed Fix
time_embedsto match the shape oftext_embedsor use the pooled text embedding correctly.pooled_prompt_embedsfor both parts if that's what SDXL expects.Suggestion
Update the notebook or script to ensure
text_embedsandtime_embedshave compatible shapes before concatenation:Environment
diffusersversion: latestPlease fix the tensor shapes for UNet SDXL training and document the proper SDXL conditioning flow for LoRA fine-tuning in the repository docs.