Hi, I wanted to commend you on the excellent work with the sdxl-turbo model. While following the implementation, I observed that the model incorporates two CLIP text encoders, yet it seems that only one of them has been optimized in the codebase.
I'm curious if this could potentially lead to any issues or affect the training efficiency. Could you please shed some light on whether this is an intentional design or something that might need further optimization?
Looking forward to your insights.
Hi, I wanted to commend you on the excellent work with the sdxl-turbo model. While following the implementation, I observed that the model incorporates two CLIP text encoders, yet it seems that only one of them has been optimized in the codebase.
I'm curious if this could potentially lead to any issues or affect the training efficiency. Could you please shed some light on whether this is an intentional design or something that might need further optimization?
Looking forward to your insights.