Skip to content

Latest commit

 

History

History
32 lines (23 loc) · 632 Bytes

File metadata and controls

32 lines (23 loc) · 632 Bytes

LLM Performance Tuning

Reference

Ideas

  • https://x.com/athleticKoder/status/1979163202844754396
    • Techniques I’d master if I wanted to make LLMs faster + cheaper.
      1. Quantization
      2. KV-Cache Quantization
      3. Flash Attention
      4. Speculative Decoding
      5. LoRA
      6. Pruning
      7. Knowledge Distillation
      8. Weight Sharing
      9. Sparse Attention
      10. Batching & Dynamic Batching
      11. Model Serving Optimization
      12. Tensor Parallelism
      13. Pipeline Parallelism
      14. Paged Attention
      15. Mixed Precision Inference
      16. Early Exit / Token-Level Pruning