Thank you for your impressive work. I have a question regarding MoBA. Compared to full attention mechanisms, is MoBA capable of reducing GPU memory consumption during both training and inference stages, thereby enabling support for longer sequence inputs?
We conducted some experiments and observed a slight increase in memory consumption, despite a significant improvement in decoding speed. Is this an expected outcome?