Great job for you guys!
My device only has 24GB VRAM, so I must set the seqlen to around 4k when running the perf test, and the performance of FSA is worse than NSA. Besides, it seems like the provided speedup ratio is also tested on default 64k seqlen, which is not very common at practice.
How's the performance of NSA _ref and FSA when the seqlen is relative shorter? It will be better that you can provide a benchmark comparing NSA_ref, FSA and full-attention along side the seqlen.