Releases · m96-chan/PyGPUkit · GitHub

26 Jan 16:06

m96-chan

v0.2.20: Fused NN Kernels + Flash Attention 3 SM120 Latest

Latest

Highlights

Fused NN Kernels

High-performance fused kernels with 2-14x speedup:

SwiGLU: silu(gate) * up (used in Qwen, LLaMA3, Mistral FFN)
RMSNorm+Residual: rmsnorm(x + residual) * gamma
GeGLU: gelu(gate) * up

Kernel	Batch	Speedup
SwiGLU	32	14.25x
RMSNorm+Residual	128	12.37x
GeGLU	32	13.10x

Flash Attention 3 SM120

TMA (Tensor Memory Accelerator) for efficient global memory access
Producer/consumer warp architecture for overlapped compute and memory ops
Tunable configurations for different sequence lengths

FP8 Block-Scale MMA

Native PTX mma.sync for FP8 with per-block scaling
Enables future W8A8 quantized inference paths

Other Improvements

Native Conv1d CUDA kernel
LLaMA 4 native CUDA kernels
Llama Guard 3 content safety classifier example

Full Changelog: v0.2.19...v0.2.20

Assets 7

01 Jan 19:25

m96-chan

v0.2.19 - FLUX.1 Image Generation

Highlights

FLUX.1 Image Generation

Text-to-image generation with Black Forest Labs' FLUX.1 model:

Full FLUX.1-schnell transformer (19 joint + 38 single blocks)
Flow matching Euler scheduler
GPU-native operations (transpose, batched matmul, RoPE)

Lazy Model Loading with Streaming

Memory-efficient model loading strategies:

StreamingStrategy.EAGER - Load all at once (default)
StreamingStrategy.PROGRESSIVE - Load during first forward
StreamingStrategy.LAYER_BY_LAYER - Minimal memory usage

cuBLAS Dynamic Loader

Runtime DLL loading without compile-time CUDA Toolkit
Auto-detection of cuBLASLt versions (13/12/11)
Graceful fallback to native kernels

C++ Kernel Profiler

Built-in CUDA kernel profiling with minimal overhead
Per-kernel timing statistics

HuggingFace T5 Encoder

Sharded safetensors support
Full T5 encoder for FLUX/SD3

DiT Architecture

PixArt transformer with AdaLN-Zero
Self/cross attention with GQA
GEGLU FFN

New GPU Operations

transpose_4d_0213, transpose_3d_012
gpu_batched_matmul, gpu_softmax, gpu_apply_rope
cross_attention, conv2d, group_norm

Known Issues

FLUX.1 performance needs optimization (#187)

Full Changelog

v0.2.18...v0.2.19

Assets 7

30 Dec 22:59

m96-chan

v0.2.18

What's New in v0.2.18

Major Codebase Refactoring

Complete modularization of the codebase for better maintainability:

Split monolithic files into modular .inl components
Reorganized matmul kernel directory structure
Standardized GEMM/GEMV naming conventions
Modular pybind11 bindings

Kokoro-82M TTS

Text-to-speech synthesis with Japanese/English support.

Positional Encoding Operations

New neural network operations for attention mechanisms:

Function	Description
`pope_init_encoding`	Sinusoidal positional encoding (PoPE)
`pope_inplace`	Apply additive encoding to Q/K
`alibi_init_slopes`	ALiBi head-specific slopes
`alibi_compute_bias`	ALiBi attention bias matrix
`rope_init_ntk_aware`	NTK-aware RoPE for context extension
`rope_init_yarn`	YaRN dimension-wise interpolation
`rope_init_linear`	Linear position interpolation
`relu2`	ReLU squared activation (Primer)

Unified Benchmark Suite

New scripts/benchmark.py for comprehensive performance testing.

QAT/Pruning/Sparsity Config

Model config support for quantization-aware training, pruning, and sparsity patterns.

Optimized BF16 GEMV

98-101% peak memory bandwidth for typical LLM dimensions.

W8A16 GEMM Fix

Fixed MMA A-fragment register mapping. MoE models now produce correct output.

Full Changelog: v0.2.17...v0.2.18

Assets 7

28 Dec 12:47

v0.2.17

What's Changed

feat(triton): add Triton backend MVP by @m96-chan in #124

Full Changelog: v0.2.16...v0.2.17

Contributors

m96-chan

Assets 7

28 Dec 04:42

m96-chan

v0.2.16 - MoE, Thinking Model, SM120 Kernels

What's New

Added

MoE (Mixture of Experts): Full Mixtral support with TopK routing, grouped GEMM
Thinking Model: Qwen3 <think>...</think> block parsing
GEMV Kernels (SM120): FP8/FP8 (W8A8), NVF4/NVF4 (W4A4), Int4
GEMM Kernels (SM120): W8A16, Int8 native (dp4a), Int4 via Int8, Grouped GEMM v2
Claude Code Skills: Build, benchmark, lint, test automation
Subagents: kernel-reviewer, perf-analyzer, api-designer

Changed

Kernel directory restructure: {gemm|gemv}/{input}/{output}/{arch}/
Removed redundant slow kernels (FP8 GEMV basic, Int8 via FP8)

Performance Highlights (RTX 5090)

Kernel	Performance
W8A8 GEMV	6-18x faster than BF16
W8A16 GEMM	SM120 CUTLASS support
Grouped GEMM v2	Per-row expert IDs for MoE

See CHANGELOG.md for full details.

Assets 7

26 Dec 09:25

v0.2.15

What's Changed

v0.2.15: FP8 I/O GEMM, Pure NVF4 (446 TFLOPS), New Math Ops by @m96-chan in #117

Full Changelog: v0.2.14...v0.2.15

Contributors

m96-chan

Assets 7

23 Dec 06:20

v0.2.14

Full Changelog: v0.2.13...v0.2.14

Assets 9

23 Dec 05:46

v0.2.13

Full Changelog: v0.2.12...v0.2.13

Assets 9

22 Dec 17:48

v0.2.12

What's Changed

feat(audio): add advanced audio processing kernels (v0.2.12) by @m96-chan in #99

Full Changelog: v0.2.11...v0.2.12

Contributors

m96-chan

Assets 9

22 Dec 12:04

v0.2.11

What's Changed

v0.2.11: Driver API, Dual CUDA Build, RTX 5090 Support by @m96-chan in #94
chore: bump version to 0.2.11 by @m96-chan in #95

Full Changelog: v0.2.10...v0.2.11

Contributors

m96-chan

Assets 9