nan grad

I find that using learnable sampling offset will lead to nan grad. When I remove the linear layer of sampling offset and replace it with handcraft-fixed offset, the problem is solved but the performance deteriorates badly. 

does anyone have the same problem? any solution if I want to use learnable sampling offset while train stably?