Implement KV-caching variant such as [Multi-Query Attention](https://arxiv.org/abs/1911.02150) or [Grouped-Query Attention](https://arxiv.org/abs/2305.13245). According to [LLaMA-2](https://arxiv.org/abs/2307.09288), GQA performs slightly better than MQA.
Implement KV-caching variant such as Multi-Query Attention or Grouped-Query Attention. According to LLaMA-2, GQA performs slightly better than MQA.