-
Notifications
You must be signed in to change notification settings - Fork 4
Sadly new nodes still recompile constantly. #19
Description
On turing, I have to set the orig_dtype to float16 so it doesn't complain about bfloat to half conversion.. I guess it doesn't get set in forward. That's easily fixable though. Unfortunately every run of the node causes recompile so it's about 90s at a time. If I run it without compile, it's not as fast as bob's node with compilation.
I already went in and exposed only one triton autotune config so it's not related to that. Something breaks graphs. There is some partial offload due to TE+Chroma not fully fitting in 22gb of vram but bob's node can handle it. It's a difference between 8.x second on new prompts and 10.x seconds with the same workflow.
Outputs from TE is now improved though and tensorwise models run without NaN on quantops. Dynamic vram is globally disabled and in the nodes.