Why you have decided to use empty prompt "" when you need to use the unet to build the features from the noise image?
input_ids_for_encoder = tokenizer( "", # args.prompt_template.format(placeholder_token=args.domain_class_token), padding="max_length", truncation=True, max_length=tokenizer.model_max_length, return_tensors="pt" ).input_ids
We know that the image would be something like "a photo of args.domain_class_token" so I'm not sure if this can have an impact on the pretraining
Why you have decided to use empty prompt "" when you need to use the unet to build the features from the noise image?
input_ids_for_encoder = tokenizer( "", # args.prompt_template.format(placeholder_token=args.domain_class_token), padding="max_length", truncation=True, max_length=tokenizer.model_max_length, return_tensors="pt" ).input_idsWe know that the image would be something like "a photo of args.domain_class_token" so I'm not sure if this can have an impact on the pretraining