I am passionate about stripping away the "black box" of modern AI frameworks and building high-performance systems from the ground up. I specialize in writing raw CUDA kernels, optimizing deep learning models, and exploring the limits of GPU architectures.
- 🔭 I’m currently working on building high-performance, lightweight LLM inference engines, AI and NLP Applications.
- 🌱 I’m currently diving deep into NVIDIA GPU architecture, Machine Learning, Deep Learning and Optimization.
- 👯 I’m looking to collaborate on Open-source AI Infrastructure & Low-level Systems.
- ⚡ Fun fact: I believe the best way to understand how things work is to build them entirely from scratch.
Languages & Core Tools
Frameworks & Expertise
A lightweight, from-scratch CUDA inference engine for transformer models. Built to deeply understand GPU computing fundamentals.
- No PyTorch, No TensorRT. Just raw, hand-optimized CUDA C++.
- Features Custom GEMM (tiled implementation with shared memory).
- Implements Flash Attention v1/v2 and Online Softmax.
- Achieves parity with TensorRT FP16 using INT8 Quantization (dp4a).