Releases: m96-chan/PyGPUkit
v0.2.20: Fused NN Kernels + Flash Attention 3 SM120
Highlights
Fused NN Kernels
High-performance fused kernels with 2-14x speedup:
- SwiGLU:
silu(gate) * up(used in Qwen, LLaMA3, Mistral FFN) - RMSNorm+Residual:
rmsnorm(x + residual) * gamma - GeGLU:
gelu(gate) * up
| Kernel | Batch | Speedup |
|---|---|---|
| SwiGLU | 32 | 14.25x |
| RMSNorm+Residual | 128 | 12.37x |
| GeGLU | 32 | 13.10x |
Flash Attention 3 SM120
- TMA (Tensor Memory Accelerator) for efficient global memory access
- Producer/consumer warp architecture for overlapped compute and memory ops
- Tunable configurations for different sequence lengths
FP8 Block-Scale MMA
- Native PTX
mma.syncfor FP8 with per-block scaling - Enables future W8A8 quantized inference paths
Other Improvements
- Native Conv1d CUDA kernel
- LLaMA 4 native CUDA kernels
- Llama Guard 3 content safety classifier example
Full Changelog: v0.2.19...v0.2.20
v0.2.19 - FLUX.1 Image Generation
Highlights
FLUX.1 Image Generation
Text-to-image generation with Black Forest Labs' FLUX.1 model:
- Full FLUX.1-schnell transformer (19 joint + 38 single blocks)
- Flow matching Euler scheduler
- GPU-native operations (transpose, batched matmul, RoPE)
Lazy Model Loading with Streaming
Memory-efficient model loading strategies:
StreamingStrategy.EAGER- Load all at once (default)StreamingStrategy.PROGRESSIVE- Load during first forwardStreamingStrategy.LAYER_BY_LAYER- Minimal memory usage
cuBLAS Dynamic Loader
- Runtime DLL loading without compile-time CUDA Toolkit
- Auto-detection of cuBLASLt versions (13/12/11)
- Graceful fallback to native kernels
C++ Kernel Profiler
- Built-in CUDA kernel profiling with minimal overhead
- Per-kernel timing statistics
HuggingFace T5 Encoder
- Sharded safetensors support
- Full T5 encoder for FLUX/SD3
DiT Architecture
- PixArt transformer with AdaLN-Zero
- Self/cross attention with GQA
- GEGLU FFN
New GPU Operations
transpose_4d_0213,transpose_3d_012gpu_batched_matmul,gpu_softmax,gpu_apply_ropecross_attention,conv2d,group_norm
Known Issues
- FLUX.1 performance needs optimization (#187)
Full Changelog
v0.2.18
What's New in v0.2.18
Major Codebase Refactoring
Complete modularization of the codebase for better maintainability:
- Split monolithic files into modular
.inlcomponents - Reorganized matmul kernel directory structure
- Standardized GEMM/GEMV naming conventions
- Modular pybind11 bindings
Kokoro-82M TTS
Text-to-speech synthesis with Japanese/English support.
Positional Encoding Operations
New neural network operations for attention mechanisms:
| Function | Description |
|---|---|
pope_init_encoding |
Sinusoidal positional encoding (PoPE) |
pope_inplace |
Apply additive encoding to Q/K |
alibi_init_slopes |
ALiBi head-specific slopes |
alibi_compute_bias |
ALiBi attention bias matrix |
rope_init_ntk_aware |
NTK-aware RoPE for context extension |
rope_init_yarn |
YaRN dimension-wise interpolation |
rope_init_linear |
Linear position interpolation |
relu2 |
ReLU squared activation (Primer) |
Unified Benchmark Suite
New scripts/benchmark.py for comprehensive performance testing.
QAT/Pruning/Sparsity Config
Model config support for quantization-aware training, pruning, and sparsity patterns.
Optimized BF16 GEMV
98-101% peak memory bandwidth for typical LLM dimensions.
W8A16 GEMM Fix
Fixed MMA A-fragment register mapping. MoE models now produce correct output.
Full Changelog: v0.2.17...v0.2.18
v0.2.17
What's Changed
Full Changelog: v0.2.16...v0.2.17
v0.2.16 - MoE, Thinking Model, SM120 Kernels
What's New
Added
- MoE (Mixture of Experts): Full Mixtral support with TopK routing, grouped GEMM
- Thinking Model: Qwen3
<think>...</think>block parsing - GEMV Kernels (SM120): FP8/FP8 (W8A8), NVF4/NVF4 (W4A4), Int4
- GEMM Kernels (SM120): W8A16, Int8 native (dp4a), Int4 via Int8, Grouped GEMM v2
- Claude Code Skills: Build, benchmark, lint, test automation
- Subagents: kernel-reviewer, perf-analyzer, api-designer
Changed
- Kernel directory restructure:
{gemm|gemv}/{input}/{output}/{arch}/ - Removed redundant slow kernels (FP8 GEMV basic, Int8 via FP8)
Performance Highlights (RTX 5090)
| Kernel | Performance |
|---|---|
| W8A8 GEMV | 6-18x faster than BF16 |
| W8A16 GEMM | SM120 CUTLASS support |
| Grouped GEMM v2 | Per-row expert IDs for MoE |
See CHANGELOG.md for full details.
v0.2.15
What's Changed
Full Changelog: v0.2.14...v0.2.15
v0.2.14
Full Changelog: v0.2.13...v0.2.14
v0.2.13
Full Changelog: v0.2.12...v0.2.13
v0.2.12
What's Changed
Full Changelog: v0.2.11...v0.2.12