SDXL Naruto Style Training fails with dimension mismatch in UNet conditioning

## Issue Summary

When training SDXL with LoRA, the following error occurs at the UNet forward pass:

```
RuntimeError: Tensors must have same number of dimensions: got 3 and 2
```

Full stacktrace:
```
File ".../unet_2d_condition.py", line 981, in get_aug_embed
    add_embeds = torch.concat([text_embeds, time_embeds], dim=-1)
RuntimeError: Tensors must have same number of dimensions: got 3 and 2
```

## Steps to Reproduce
- Use SDXL base model (`stabilityai/stable-diffusion-xl-base-1.0`) and a LoRA fine-tuning script based on official diffusers or notebook code.
- At each training iteration, UNet is called as follows:
  ```python
  model_pred = unet(
      noisy_latents,
      timesteps,
      encoder_hidden_states=prompt_embeds,
      added_cond_kwargs={
          "text_embeds": pooled_prompt_embeds,
          "time_ids": torch.zeros(batch_size, 6, device="cuda"),
      },
  ).sample
  ```
- The error appears in the concatenation of `text_embeds` and `time_embeds`.

## Suspected Cause
This occurs because `text_embeds` from SDXL's text encoder have shape `[batch, seq, emb]`, while `time_embeds` have shape `[batch, emb]`. They cannot be concatenated directly along the last axis.

## Proposed Fix
- Broadcast or unsqueeze `time_embeds` to match the shape of `text_embeds` _or_ use the pooled text embedding correctly.
- Alternatively, use `pooled_prompt_embeds` for both parts if that's what SDXL expects.
- Reference: https://github.com/huggingface/diffusers/issues/5813 (similar bug)

## Suggestion
Update the notebook or script to ensure `text_embeds` and `time_embeds` have compatible shapes before concatenation:
```python
if text_embeds.dim() == 3 and time_embeds.dim() == 2:
    # Use pooled text embedding (dim == 2)
    add_embeds = torch.concat([pooled_prompt_embeds, time_embeds], dim=-1)
else:
    # If needed, unsqueeze or broadcast time_embeds
    ...
```

## Environment
- `diffusers` version: *latest*
- Python version: 3.12
- Torch version: *latest*
- GPU: *any*

---
Please fix the tensor shapes for UNet SDXL training and document the proper SDXL conditioning flow for LoRA fine-tuning in the repository docs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SDXL Naruto Style Training fails with dimension mismatch in UNet conditioning #13

Issue Summary

Steps to Reproduce

Suspected Cause

Proposed Fix

Suggestion

Environment

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

SDXL Naruto Style Training fails with dimension mismatch in UNet conditioning #13

Description

Issue Summary

Steps to Reproduce

Suspected Cause

Proposed Fix

Suggestion

Environment

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions