Hi, I was trying to reproduce M3D with pretraining mm_projector. However, when using bf16, I encountered an issue that at random stage of training, loss suddenly become 0 and grad_nrom=nan. I tried massive possible solutions from Internet and GPT, including overlap_comm=False and etc.. Thanks a lot if you have encountered this problem and share your solution!!
Hi, I was trying to reproduce M3D with pretraining mm_projector. However, when using bf16, I encountered an issue that at random stage of training, loss suddenly become 0 and grad_nrom=nan. I tried massive possible solutions from Internet and GPT, including overlap_comm=False and etc.. Thanks a lot if you have encountered this problem and share your solution!!