waynehacking8

WEI CHENG CHIU waynehacking8

working on ML inference & Agentic AI on the NVIDIA stack

Achievements

content-radar content-radar Public

Collect trending AI/dev signal from Hacker News, arXiv, GitHub Trending, Reddit, and X, then synthesize review-ready post drafts with Claude.

Python 2
gh-radar gh-radar Public

Daily email digest of trending GitHub tools — GitHub Trending + Hacker News + new-repo search, no X API. Runs on GitHub Actions.

Python
tensor-core-from-scratch tensor-core-from-scratch Public

From naive matmul to tensor cores on NVIDIA Blackwell — step by step. 8 self-contained CUDA kernels, each benchmarked against cuBLAS.

Cuda 1 1
blackwell-tensorcore-kernels blackwell-tensorcore-kernels Public

Hand-written CUDA Tensor Core GEMM kernels on Blackwell (sm_120) and Hopper (sm_90) — raw mma.sync reaching 106% of the cuBLAS-TC kernel on sm_120, CUTLASS 3.x wgmma at 85.5% of nvjet on H100, and …

Cuda
trtllm-triton-serving trtllm-triton-serving Public

TensorRT-LLM vs vLLM controlled head-to-head on H100 — 12 studies including a knob-by-knob waterfall reproducing NVIDIA's published 27.7k tok/s (100.3%) and attributing the gap to real serving, plu…

Python