Skip to content

Releases: m96-chan/PyGPUkit

v0.2.20: Fused NN Kernels + Flash Attention 3 SM120

26 Jan 16:06
224e6bb

Choose a tag to compare

Highlights

Fused NN Kernels

High-performance fused kernels with 2-14x speedup:

  • SwiGLU: silu(gate) * up (used in Qwen, LLaMA3, Mistral FFN)
  • RMSNorm+Residual: rmsnorm(x + residual) * gamma
  • GeGLU: gelu(gate) * up
Kernel Batch Speedup
SwiGLU 32 14.25x
RMSNorm+Residual 128 12.37x
GeGLU 32 13.10x

Flash Attention 3 SM120

  • TMA (Tensor Memory Accelerator) for efficient global memory access
  • Producer/consumer warp architecture for overlapped compute and memory ops
  • Tunable configurations for different sequence lengths

FP8 Block-Scale MMA

  • Native PTX mma.sync for FP8 with per-block scaling
  • Enables future W8A8 quantized inference paths

Other Improvements

  • Native Conv1d CUDA kernel
  • LLaMA 4 native CUDA kernels
  • Llama Guard 3 content safety classifier example

Full Changelog: v0.2.19...v0.2.20

v0.2.19 - FLUX.1 Image Generation

01 Jan 19:25
7adbe5f

Choose a tag to compare

Highlights

FLUX.1 Image Generation

Text-to-image generation with Black Forest Labs' FLUX.1 model:

  • Full FLUX.1-schnell transformer (19 joint + 38 single blocks)
  • Flow matching Euler scheduler
  • GPU-native operations (transpose, batched matmul, RoPE)

Lazy Model Loading with Streaming

Memory-efficient model loading strategies:

  • StreamingStrategy.EAGER - Load all at once (default)
  • StreamingStrategy.PROGRESSIVE - Load during first forward
  • StreamingStrategy.LAYER_BY_LAYER - Minimal memory usage

cuBLAS Dynamic Loader

  • Runtime DLL loading without compile-time CUDA Toolkit
  • Auto-detection of cuBLASLt versions (13/12/11)
  • Graceful fallback to native kernels

C++ Kernel Profiler

  • Built-in CUDA kernel profiling with minimal overhead
  • Per-kernel timing statistics

HuggingFace T5 Encoder

  • Sharded safetensors support
  • Full T5 encoder for FLUX/SD3

DiT Architecture

  • PixArt transformer with AdaLN-Zero
  • Self/cross attention with GQA
  • GEGLU FFN

New GPU Operations

  • transpose_4d_0213, transpose_3d_012
  • gpu_batched_matmul, gpu_softmax, gpu_apply_rope
  • cross_attention, conv2d, group_norm

Known Issues

  • FLUX.1 performance needs optimization (#187)

Full Changelog

v0.2.18...v0.2.19

v0.2.18

30 Dec 22:59

Choose a tag to compare

What's New in v0.2.18

Major Codebase Refactoring

Complete modularization of the codebase for better maintainability:

  • Split monolithic files into modular .inl components
  • Reorganized matmul kernel directory structure
  • Standardized GEMM/GEMV naming conventions
  • Modular pybind11 bindings

Kokoro-82M TTS

Text-to-speech synthesis with Japanese/English support.

Positional Encoding Operations

New neural network operations for attention mechanisms:

Function Description
pope_init_encoding Sinusoidal positional encoding (PoPE)
pope_inplace Apply additive encoding to Q/K
alibi_init_slopes ALiBi head-specific slopes
alibi_compute_bias ALiBi attention bias matrix
rope_init_ntk_aware NTK-aware RoPE for context extension
rope_init_yarn YaRN dimension-wise interpolation
rope_init_linear Linear position interpolation
relu2 ReLU squared activation (Primer)

Unified Benchmark Suite

New scripts/benchmark.py for comprehensive performance testing.

QAT/Pruning/Sparsity Config

Model config support for quantization-aware training, pruning, and sparsity patterns.

Optimized BF16 GEMV

98-101% peak memory bandwidth for typical LLM dimensions.

W8A16 GEMM Fix

Fixed MMA A-fragment register mapping. MoE models now produce correct output.


Full Changelog: v0.2.17...v0.2.18

v0.2.17

28 Dec 12:47

Choose a tag to compare

What's Changed

Full Changelog: v0.2.16...v0.2.17

v0.2.16 - MoE, Thinking Model, SM120 Kernels

28 Dec 04:42

Choose a tag to compare

What's New

Added

  • MoE (Mixture of Experts): Full Mixtral support with TopK routing, grouped GEMM
  • Thinking Model: Qwen3 <think>...</think> block parsing
  • GEMV Kernels (SM120): FP8/FP8 (W8A8), NVF4/NVF4 (W4A4), Int4
  • GEMM Kernels (SM120): W8A16, Int8 native (dp4a), Int4 via Int8, Grouped GEMM v2
  • Claude Code Skills: Build, benchmark, lint, test automation
  • Subagents: kernel-reviewer, perf-analyzer, api-designer

Changed

  • Kernel directory restructure: {gemm|gemv}/{input}/{output}/{arch}/
  • Removed redundant slow kernels (FP8 GEMV basic, Int8 via FP8)

Performance Highlights (RTX 5090)

Kernel Performance
W8A8 GEMV 6-18x faster than BF16
W8A16 GEMM SM120 CUTLASS support
Grouped GEMM v2 Per-row expert IDs for MoE

See CHANGELOG.md for full details.

v0.2.15

26 Dec 09:25

Choose a tag to compare

What's Changed

  • v0.2.15: FP8 I/O GEMM, Pure NVF4 (446 TFLOPS), New Math Ops by @m96-chan in #117

Full Changelog: v0.2.14...v0.2.15

v0.2.14

23 Dec 06:20

Choose a tag to compare

Full Changelog: v0.2.13...v0.2.14

v0.2.13

23 Dec 05:46

Choose a tag to compare

Full Changelog: v0.2.12...v0.2.13

v0.2.12

22 Dec 17:48

Choose a tag to compare

What's Changed

  • feat(audio): add advanced audio processing kernels (v0.2.12) by @m96-chan in #99

Full Changelog: v0.2.11...v0.2.12

v0.2.11

22 Dec 12:04

Choose a tag to compare

What's Changed

  • v0.2.11: Driver API, Dual CUDA Build, RTX 5090 Support by @m96-chan in #94
  • chore: bump version to 0.2.11 by @m96-chan in #95

Full Changelog: v0.2.10...v0.2.11