Production-grade CUDA deep learning framework built from scratch.
Combines ideas from PyTorch, TensorRT, FlashAttention, and Triton in a single C++/CUDA/Python codebase.
- Custom GPU tensor engine (FP32/FP16/BF16/INT8)
- Caching CUDA memory allocator (like PyTorch's)
- Optimized CUDA kernels: elementwise, reductions, matmul, conv, softmax, layernorm
- Tensor Core acceleration via WMMA API (FP16 GEMM)
- FlashAttention-2 style attention kernel (O(N) memory)
- Dynamic automatic differentiation engine
- Full neural network API: Linear, Conv2d, BatchNorm, LayerNorm, RMSNorm, Embedding, etc.
- GPT/Transformer engine with multi-head attention
- Adam, AdamW, SGD, RMSprop optimizers
- LR schedulers: cosine annealing, warmup, 1-cycle
- NVRTC-based JIT CUDA kernel compiler
- Real-time GPU monitoring dashboard (FastAPI + WebSocket)
- Multi-GPU data parallelism via NCCL
- DataLoader with async prefetching
- Built-in MNIST, CIFAR-10 datasets
- Full Python API via pybind11
- C++ test suite and Python test suite
- Benchmark suite vs PyTorch
| Requirement | Version |
|---|---|
| CUDA Toolkit | >= 11.8 |
| CMake | >= 3.20 |
| C++ compiler | MSVC 2022 / GCC 11+ / Clang 14+ |
| Python | >= 3.9 |
| pybind11 | auto-fetched by CMake |
Windows: Visual Studio 2022 with "Desktop development with C++" + CUDA workload.
Linux: gcc, g++, CUDA toolkit.
pip install numpy tqdmcd novatorch
pip install -e .This calls CMake under the hood and installs the package in editable mode.
# From novatorch/
mkdir build && cd build
# Linux
cmake .. -DCMAKE_BUILD_TYPE=Release
make -j$(nproc)
# Windows (Visual Studio)
cmake .. -G "Visual Studio 17 2022" -A x64 -DCMAKE_BUILD_TYPE=Release
cmake --build . --config Release --parallelThe compiled _novatorch_C.pyd (Windows) or _novatorch_C.so (Linux) will be placed in novatorch/.
python -c "import novatorch as tc; print(tc.__version__); print('CUDA:', tc.cuda_available())"Expected output:
0.1.0
CUDA: True
By default novatorch targets SM 75, 80, 86, 89, 90 (Turing through Hopper).
To target only your specific GPU (faster compile):
# RTX 3090 (SM 86)
cmake .. -DCMAKE_CUDA_ARCHITECTURES=86
# RTX 4090 (SM 89)
cmake .. -DCMAKE_CUDA_ARCHITECTURES=89
# A100 (SM 80)
cmake .. -DCMAKE_CUDA_ARCHITECTURES=80Or let CMake detect automatically via setup.py (calls nvidia-smi).
import novatorch as tc
# Create tensors
x = tc.Tensor.randn([32, 512]) # random normal on CUDA
z = tc.Tensor.zeros([32, 512])
o = tc.Tensor.ones([32, 512])
a = tc.Tensor.from_numpy(np_array) # from numpy
# Math
y = x + z
y = x * 2.0
y = x.matmul(w) # or x @ w
y = x.bmm(w) # batched matmul
y = x.pow(2.0)
y = x.sqrt()
y = x.exp()
y = x.log()
# Reductions
s = x.sum()
m = x.mean(dim=1, keepdim=True)
mx = x.max(dim=-1)
# Reshape / view
y = x.reshape([16, -1])
y = x.transpose(0, 1)
y = x.permute([1, 0, 2])
y = x.unsqueeze(0)
y = x.flatten(1)
# Device transfer
x_cpu = x.cpu()
x_gpu = x.cuda(0)
np_arr = x.cpu().numpy()
# Autograd
x = tc.Tensor.randn([4, 8])
x.requires_grad = True
y = x.matmul(w) + b
loss = y.mean()
loss.backward()
print(x.grad) # gradient w.r.t. ximport novatorch as tc
# Build a CNN
model = tc.Sequential()
model.add(tc.Conv2d(1, 32, 3, padding=1))
model.add(tc.BatchNorm2d(32))
model.add(tc.ReLU())
model.add(tc.MaxPool2d(2))
model.add(tc.Flatten())
model.add(tc.Linear(32*14*14, 10))
model.cuda()
model.train()
# Forward
x = tc.Tensor.randn([8, 1, 28, 28])
logits = model(x) # [8, 10]
# Loss + backward
criterion = tc.CrossEntropyLoss()
loss = criterion(logits, targets)
loss.backward()
# Optimizer
optimizer = tc.AdamW(model, lr=1e-3, weight_decay=0.01)
optimizer.step()
optimizer.zero_grad()
print(f"Loss: {loss.item():.4f}")
print(f"Parameters: {model.num_parameters():,}")import novatorch as tc
cfg = tc.GPTConfig()
cfg.vocab_size = 50257
cfg.max_seq_len = 1024
cfg.d_model = 768
cfg.num_heads = 12
cfg.num_layers = 12
cfg.d_ff = 3072
cfg.use_flash_attn = True # FlashAttention-2 kernel
model = tc.GPT(cfg)
model.cuda()
print(f"GPT parameters: {model.num_params():,}")
# Training step
tokens = tc.Tensor.from_numpy(token_ids) # [B, S] float
targets = tc.Tensor.from_numpy(target_ids)
logits = model(tokens) # [B, S, V]
loss = model.compute_loss(logits, targets)
loss.backward()
optimizer.step()mha = tc.MultiHeadAttention(
embed_dim=512,
num_heads=8,
dropout=0.1,
use_flash=True # use FlashAttention kernel
)
mha.cuda()
# x: [batch, seq_len, embed_dim]
x = tc.Tensor.randn([4, 128, 512])
out = mha(x, causal=True) # causal mask for decoderjit = tc.JITCompiler.instance()
x = tc.Tensor.randn([1024*1024])
b = tc.Tensor.ones([1024*1024]) * 0.5
# Compile and run fused kernel at runtime
out = jit.execute("relu(add(x, b))", {"x": x, "b": b})
out = jit.execute("gelu(mul(x, 2.0))", {"x": x})
out = jit.execute("sigmoid(add(mul(x, 2.0), b))", {"x": x, "b": b})
# Kernels are cached - second call uses the compiled versionoptimizer = tc.AdamW(model, lr=3e-4, weight_decay=0.01)
from novatorch.optim import LinearWarmupCosineDecay
scheduler = LinearWarmupCosineDecay(optimizer,
warmup_steps=100,
total_steps=10000,
min_lr=1e-5)
for step in range(10000):
optimizer.zero_grad()
loss = compute_loss(...)
loss.backward()
# Gradient clipping
# tc.functional.clip_grad_norm(model.parameters(), max_norm=1.0)
optimizer.step()
scheduler.step()
print(f"LR: {scheduler.get_last_lr():.2e}")from novatorch.data import MNISTDataset, CIFAR10Dataset, DataLoader
# MNIST
train_ds = MNISTDataset(root="./data", train=True, download=True)
loader = DataLoader(train_ds, batch_size=64, shuffle=True,
device_str="cuda", num_prefetch=2)
for images, labels in loader:
# images: tc.Tensor [B, 1, 28, 28] on CUDA
# labels: tc.Tensor [B] on CUDA
out = model(images)
loss = criterion(out, labels)
...tc.Profiler.instance().enable()
# ... run training ...
tc.Profiler.instance().print_report()
# Output:
# Kernel Calls Total(ms) Avg(ms) BW(GB/s)
# -----------------------------------------------------------------
# sgemm_optimized 200 24.150 0.121 890.2
# softmax_kernel 200 3.200 0.016 412.0
# layernorm_fwd_kernel 400 1.800 0.005 1200.0python examples/mnist_cnn.pyExpected: ~99% accuracy in 5 epochs, ~2000 samples/s on RTX 3090.
python examples/gpt_lm.pyTrains a 6-layer GPT on a character-level corpus. Generates text samples.
python examples/jit_demo.pyCompiles and runs fused CUDA kernels at runtime via NVRTC.
pip install torch # optional, for comparison
python examples/benchmark.py# Build first
cmake --build build --target test_tensor --config Release
# Run
./build/Release/test_tensor.exe # Windows
./build/test_tensor # Linuxpython tests/test_framework.pyor with pytest:
pip install pytest
pytest tests/test_framework.py -vStart the real-time GPU monitoring dashboard:
pip install fastapi uvicorn websockets pynvml
python dashboard/server.pyOpen http://localhost:8000 to see:
- Live GPU utilization and VRAM usage
- Training loss curve
- Throughput (samples/sec)
- CUDA kernel timing table
- Temperature and power draw
Run training in a separate terminal - the dashboard will pick up metrics from novatorch's built-in profiler automatically.
novatorch/
|-- core/
| |-- dtype.h # FP32/FP16/BF16/INT8 type system
| |-- device.h # CUDA device management
| |-- tensor.h # Tensor class (header)
| `-- tensor.cu # Tensor implementation (CUDA)
|-- cuda/
| |-- memory_pool.h # Caching CUDA allocator
| |-- streams.h # CUDA stream/event RAII wrappers
| `-- kernels/
| |-- elementwise.cuh # add/mul/relu/gelu/dropout/cast/random
| |-- reduction.cuh # sum/max/softmax/layernorm/batchnorm
| |-- matmul.cuh # SGEMM + Tensor Core WMMA HGEMM + batched
| |-- conv.cuh # Conv2d/depthwise/maxpool/avgpool
| `-- flash_attention.cuh # FlashAttention-2 FP32 + FP16
|-- autograd/
| |-- node.h # AutogradNode base + Edge + AccumulateGrad
| |-- functions.h # Backward functions for all ops
| |-- engine.h # Backward engine interface
| `-- engine.cpp # Topological sort + reverse AD execution
|-- nn/
| |-- module.h # Module base + Sequential + ModuleList
| |-- layers/
| | |-- linear.h # Linear, Bilinear
| | |-- conv.h # Conv1d/2d/Transpose, MaxPool, AvgPool
| | |-- normalization.h # LayerNorm, BatchNorm2d, GroupNorm, RMSNorm
| | `-- embedding.h # Embedding
| |-- activations/
| | `-- activations.h # ReLU/GELU/SiLU/Sigmoid/Tanh/Dropout/Flatten
| |-- loss/
| | `-- loss.h # MSE/L1/BCE/CrossEntropy/Focal/KLDiv/Huber
| `-- optimizers/
| `-- optimizer.h # SGD/Adam/AdamW/RMSprop + LR schedulers
|-- transformer/
| |-- attention.h # ScaledDotProduct, MultiHeadAttention, FeedForward
| | # TransformerEncoder/DecoderLayer, SinusoidalPE, RoPE
| `-- gpt.h # GPTBlock, GPTConfig, GPT model
|-- compiler/
| `-- jit.h # NVRTC JIT compiler + expression parser + code gen
|-- profiler/
| `-- profiler.h # CUDA profiler with kernel timing + GPU metrics
|-- distributed/
| `-- nccl_comm.h # NCCL multi-GPU ProcessGroup + DataParallel
|-- novatorch/ # Python package
| |-- __init__.py # Main public API
| |-- nn/__init__.py
| |-- nn/modules.py # High-level Python nn composites
| |-- optim/__init__.py
| |-- optim/schedulers.py # Python LR schedulers
| `-- data/
| |-- dataloader.py # DataLoader with async prefetch
| `-- datasets.py # MNIST, CIFAR-10, RandomDataset
|-- bindings/
| `-- python_bindings.cpp # Full pybind11 module
|-- dashboard/
| |-- server.py # FastAPI + WebSocket dashboard server
| `-- requirements.txt
|-- examples/
| |-- mnist_cnn.py # CNN training on MNIST
| |-- gpt_lm.py # Tiny GPT language model
| |-- jit_demo.py # JIT kernel compiler demo
| `-- benchmark.py # Benchmark vs PyTorch
|-- tests/
| |-- test_framework.py # Python test suite
| `-- test_tensor_cpp.cpp # C++ test suite
|-- CMakeLists.txt
|-- setup.py
`-- README.md
Mirrors PyTorch's caching allocator. Freed blocks are kept in a free list keyed by size. New allocations first check the free list for a best-fit block before calling cudaMalloc. This eliminates most allocation overhead during training.
Dynamic graph (define-by-run), like PyTorch. Each op creates AutogradNode subclass with a saved context. backward() does topological sort via sequence numbers and executes gradients in reverse order. Reference counting handles DAGs correctly.
Implements the FA-2 algorithm: splits Q/K/V into tiles that fit in SRAM, runs attention within tiles with online softmax (running max + sum), never materializing the full N x N attention matrix. Memory is O(N) instead of O(N^2).
Parses expression strings like "relu(add(x, b))" into an AST, generates inline CUDA device code, compiles via NVRTC to PTX, loads via CUdriver API. Compiled kernels are cached by expression string.
hgemm_tensor_core uses the WMMA API (nvcuda::wmma) for 16x16x16 matrix fragment operations. Provides 4-8x speedup over regular FP32 GEMM on Volta/Turing/Ampere/Ada GPUs.
ModuleNotFoundError: No module named 'novatorch._novatorch_C'
- Build the C extension first:
pip install -e . - Check
build/directory for compile errors - Verify CUDA toolkit is on
PATH
CUDA architecture mismatch
- Error like "no kernel image for this device"
- Set
-DCMAKE_CUDA_ARCHITECTURES=XXmatching your GPU
NVRTC not found
- Install CUDA toolkit, ensure
nvrtc64_*.dllis on PATH - JIT compiler requires CUDA >= 11.0
Out of memory
- Call
tc.empty_cache()to release cached allocations - Reduce batch size
- Use FP16:
model.to(tc.DType.float16)
pybind11 not found
- CMake will auto-download it via FetchContent
- Or:
pip install pybind11thencmake .. -Dpybind11_DIR=$(python -c "import pybind11; print(pybind11.get_cmake_dir())")
MIT License - see LICENSE file.