Hi. thanks for sharing great works!
In this scenario, clip image embedding and text embeddings are added after forwarding shared linear layer with hyper-parameter $\alpha$.


Questions
- Clip text embeddings would have shape of [batch size, 77, 1024] and CLIP img embeddings [batch, 1024]. Then how are they added after forwarded shared linear layer ?(I guess unsqueeze CLIP img embedding and expand it to token length (77))
- Then how $\alpha$ is set when training ?? (when inference $\alpha$ may control the influence of each type of guidance.
- If classifier-free guidance is applied, how much scale do you apply when inference ? (and how much
ucg_rate prob which make zero embedding)
Thanks!
Hi. thanks for sharing great works!
In this scenario, clip image embedding and text embeddings are added after forwarding shared linear layer with hyper-parameter$\alpha$ .


Questions
ucg_rateprob which make zero embedding)Thanks!