Sadly new nodes still recompile constantly.

On turing, I have to set the orig_dtype to float16 so it doesn't complain about bfloat to half conversion.. I guess it _doesn't_ get set in forward. That's easily fixable though. Unfortunately every run of the node causes recompile so it's about 90s at a time. If I run it without compile, it's not as fast as bob's node with compilation.

I already went in and exposed only one triton autotune config so it's not related to that. Something breaks graphs. There is some partial offload due to TE+Chroma not fully fitting in 22gb of vram but bob's node can handle it. It's a difference between 8.x second on new prompts and 10.x seconds with the same workflow.

Outputs from TE is now improved though and tensorwise models run without NaN on quantops. Dynamic vram is globally disabled and in the nodes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sadly new nodes still recompile constantly. #19

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Sadly new nodes still recompile constantly. #19

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions