novatorch

Production-grade CUDA deep learning framework built from scratch.

Combines ideas from PyTorch, TensorRT, FlashAttention, and Triton in a single C++/CUDA/Python codebase.

Features

Custom GPU tensor engine (FP32/FP16/BF16/INT8)
Caching CUDA memory allocator (like PyTorch's)
Optimized CUDA kernels: elementwise, reductions, matmul, conv, softmax, layernorm
Tensor Core acceleration via WMMA API (FP16 GEMM)
FlashAttention-2 style attention kernel (O(N) memory)
Dynamic automatic differentiation engine
Full neural network API: Linear, Conv2d, BatchNorm, LayerNorm, RMSNorm, Embedding, etc.
GPT/Transformer engine with multi-head attention
Adam, AdamW, SGD, RMSprop optimizers
LR schedulers: cosine annealing, warmup, 1-cycle
NVRTC-based JIT CUDA kernel compiler
Real-time GPU monitoring dashboard (FastAPI + WebSocket)
Multi-GPU data parallelism via NCCL
DataLoader with async prefetching
Built-in MNIST, CIFAR-10 datasets
Full Python API via pybind11
C++ test suite and Python test suite
Benchmark suite vs PyTorch

Requirements

Requirement	Version
CUDA Toolkit	>= 11.8
CMake	>= 3.20
C++ compiler	MSVC 2022 / GCC 11+ / Clang 14+
Python	>= 3.9
pybind11	auto-fetched by CMake

Windows: Visual Studio 2022 with "Desktop development with C++" + CUDA workload.

Linux: gcc, g++, CUDA toolkit.

Quick Build

1. Install Python dependencies

pip install numpy tqdm

2. Build with pip (recommended)

cd novatorch
pip install -e .

This calls CMake under the hood and installs the package in editable mode.

3. Manual CMake build

# From novatorch/
mkdir build && cd build

# Linux
cmake .. -DCMAKE_BUILD_TYPE=Release
make -j$(nproc)

# Windows (Visual Studio)
cmake .. -G "Visual Studio 17 2022" -A x64 -DCMAKE_BUILD_TYPE=Release
cmake --build . --config Release --parallel

The compiled _novatorch_C.pyd (Windows) or _novatorch_C.so (Linux) will be placed in novatorch/.

4. Verify the build

python -c "import novatorch as tc; print(tc.__version__); print('CUDA:', tc.cuda_available())"

Expected output:

0.1.0
CUDA: True

GPU Architecture Targeting

By default novatorch targets SM 75, 80, 86, 89, 90 (Turing through Hopper).

To target only your specific GPU (faster compile):

# RTX 3090 (SM 86)
cmake .. -DCMAKE_CUDA_ARCHITECTURES=86

# RTX 4090 (SM 89)
cmake .. -DCMAKE_CUDA_ARCHITECTURES=89

# A100 (SM 80)
cmake .. -DCMAKE_CUDA_ARCHITECTURES=80

Or let CMake detect automatically via setup.py (calls nvidia-smi).

Python API

Tensor operations

import novatorch as tc

# Create tensors
x = tc.Tensor.randn([32, 512])     # random normal on CUDA
z = tc.Tensor.zeros([32, 512])
o = tc.Tensor.ones([32, 512])
a = tc.Tensor.from_numpy(np_array) # from numpy

# Math
y = x + z
y = x * 2.0
y = x.matmul(w)    # or  x @ w
y = x.bmm(w)       # batched matmul
y = x.pow(2.0)
y = x.sqrt()
y = x.exp()
y = x.log()

# Reductions
s = x.sum()
m = x.mean(dim=1, keepdim=True)
mx = x.max(dim=-1)

# Reshape / view
y = x.reshape([16, -1])
y = x.transpose(0, 1)
y = x.permute([1, 0, 2])
y = x.unsqueeze(0)
y = x.flatten(1)

# Device transfer
x_cpu = x.cpu()
x_gpu = x.cuda(0)
np_arr = x.cpu().numpy()

# Autograd
x = tc.Tensor.randn([4, 8])
x.requires_grad = True
y = x.matmul(w) + b
loss = y.mean()
loss.backward()
print(x.grad)   # gradient w.r.t. x

Neural Network

import novatorch as tc

# Build a CNN
model = tc.Sequential()
model.add(tc.Conv2d(1, 32, 3, padding=1))
model.add(tc.BatchNorm2d(32))
model.add(tc.ReLU())
model.add(tc.MaxPool2d(2))
model.add(tc.Flatten())
model.add(tc.Linear(32*14*14, 10))

model.cuda()
model.train()

# Forward
x = tc.Tensor.randn([8, 1, 28, 28])
logits = model(x)            # [8, 10]

# Loss + backward
criterion = tc.CrossEntropyLoss()
loss = criterion(logits, targets)
loss.backward()

# Optimizer
optimizer = tc.AdamW(model, lr=1e-3, weight_decay=0.01)
optimizer.step()
optimizer.zero_grad()

print(f"Loss: {loss.item():.4f}")
print(f"Parameters: {model.num_parameters():,}")

GPT / Transformer

import novatorch as tc

cfg = tc.GPTConfig()
cfg.vocab_size  = 50257
cfg.max_seq_len = 1024
cfg.d_model     = 768
cfg.num_heads   = 12
cfg.num_layers  = 12
cfg.d_ff        = 3072
cfg.use_flash_attn = True   # FlashAttention-2 kernel

model = tc.GPT(cfg)
model.cuda()

print(f"GPT parameters: {model.num_params():,}")

# Training step
tokens   = tc.Tensor.from_numpy(token_ids)   # [B, S] float
targets  = tc.Tensor.from_numpy(target_ids)
logits   = model(tokens)                     # [B, S, V]
loss     = model.compute_loss(logits, targets)
loss.backward()
optimizer.step()

Multi-Head Attention (standalone)

mha = tc.MultiHeadAttention(
    embed_dim=512,
    num_heads=8,
    dropout=0.1,
    use_flash=True    # use FlashAttention kernel
)
mha.cuda()

# x: [batch, seq_len, embed_dim]
x   = tc.Tensor.randn([4, 128, 512])
out = mha(x, causal=True)   # causal mask for decoder

JIT CUDA Compiler

jit = tc.JITCompiler.instance()

x = tc.Tensor.randn([1024*1024])
b = tc.Tensor.ones([1024*1024]) * 0.5

# Compile and run fused kernel at runtime
out = jit.execute("relu(add(x, b))", {"x": x, "b": b})
out = jit.execute("gelu(mul(x, 2.0))", {"x": x})
out = jit.execute("sigmoid(add(mul(x, 2.0), b))", {"x": x, "b": b})
# Kernels are cached - second call uses the compiled version

Optimizers + LR Schedulers

optimizer = tc.AdamW(model, lr=3e-4, weight_decay=0.01)

from novatorch.optim import LinearWarmupCosineDecay
scheduler = LinearWarmupCosineDecay(optimizer,
    warmup_steps=100,
    total_steps=10000,
    min_lr=1e-5)

for step in range(10000):
    optimizer.zero_grad()
    loss = compute_loss(...)
    loss.backward()

    # Gradient clipping
    # tc.functional.clip_grad_norm(model.parameters(), max_norm=1.0)

    optimizer.step()
    scheduler.step()
    print(f"LR: {scheduler.get_last_lr():.2e}")

DataLoader

from novatorch.data import MNISTDataset, CIFAR10Dataset, DataLoader

# MNIST
train_ds = MNISTDataset(root="./data", train=True, download=True)
loader   = DataLoader(train_ds, batch_size=64, shuffle=True,
                       device_str="cuda", num_prefetch=2)

for images, labels in loader:
    # images: tc.Tensor [B, 1, 28, 28] on CUDA
    # labels: tc.Tensor [B] on CUDA
    out  = model(images)
    loss = criterion(out, labels)
    ...

Profiler

tc.Profiler.instance().enable()

# ... run training ...

tc.Profiler.instance().print_report()
# Output:
# Kernel                         Calls    Total(ms)     Avg(ms)     BW(GB/s)
# -----------------------------------------------------------------
# sgemm_optimized                  200       24.150       0.121        890.2
# softmax_kernel                   200        3.200       0.016        412.0
# layernorm_fwd_kernel             400        1.800       0.005       1200.0

Examples

Train CNN on MNIST

python examples/mnist_cnn.py

Expected: ~99% accuracy in 5 epochs, ~2000 samples/s on RTX 3090.

Train tiny GPT

python examples/gpt_lm.py

Trains a 6-layer GPT on a character-level corpus. Generates text samples.

JIT demo

python examples/jit_demo.py

Compiles and runs fused CUDA kernels at runtime via NVRTC.

Benchmark vs PyTorch

pip install torch  # optional, for comparison
python examples/benchmark.py

Tests

C++ tests

# Build first
cmake --build build --target test_tensor --config Release

# Run
./build/Release/test_tensor.exe   # Windows
./build/test_tensor               # Linux

Python tests

python tests/test_framework.py

or with pytest:

pip install pytest
pytest tests/test_framework.py -v

Dashboard

Start the real-time GPU monitoring dashboard:

pip install fastapi uvicorn websockets pynvml
python dashboard/server.py

Open http://localhost:8000 to see:

Live GPU utilization and VRAM usage
Training loss curve
Throughput (samples/sec)
CUDA kernel timing table
Temperature and power draw

Run training in a separate terminal - the dashboard will pick up metrics from novatorch's built-in profiler automatically.

Project Structure

novatorch/
|-- core/
|   |-- dtype.h              # FP32/FP16/BF16/INT8 type system
|   |-- device.h             # CUDA device management
|   |-- tensor.h             # Tensor class (header)
|   `-- tensor.cu            # Tensor implementation (CUDA)
|-- cuda/
|   |-- memory_pool.h        # Caching CUDA allocator
|   |-- streams.h            # CUDA stream/event RAII wrappers
|   `-- kernels/
|       |-- elementwise.cuh  # add/mul/relu/gelu/dropout/cast/random
|       |-- reduction.cuh    # sum/max/softmax/layernorm/batchnorm
|       |-- matmul.cuh       # SGEMM + Tensor Core WMMA HGEMM + batched
|       |-- conv.cuh         # Conv2d/depthwise/maxpool/avgpool
|       `-- flash_attention.cuh  # FlashAttention-2 FP32 + FP16
|-- autograd/
|   |-- node.h               # AutogradNode base + Edge + AccumulateGrad
|   |-- functions.h          # Backward functions for all ops
|   |-- engine.h             # Backward engine interface
|   `-- engine.cpp           # Topological sort + reverse AD execution
|-- nn/
|   |-- module.h             # Module base + Sequential + ModuleList
|   |-- layers/
|   |   |-- linear.h         # Linear, Bilinear
|   |   |-- conv.h           # Conv1d/2d/Transpose, MaxPool, AvgPool
|   |   |-- normalization.h  # LayerNorm, BatchNorm2d, GroupNorm, RMSNorm
|   |   `-- embedding.h      # Embedding
|   |-- activations/
|   |   `-- activations.h    # ReLU/GELU/SiLU/Sigmoid/Tanh/Dropout/Flatten
|   |-- loss/
|   |   `-- loss.h           # MSE/L1/BCE/CrossEntropy/Focal/KLDiv/Huber
|   `-- optimizers/
|       `-- optimizer.h      # SGD/Adam/AdamW/RMSprop + LR schedulers
|-- transformer/
|   |-- attention.h          # ScaledDotProduct, MultiHeadAttention, FeedForward
|   |                        # TransformerEncoder/DecoderLayer, SinusoidalPE, RoPE
|   `-- gpt.h                # GPTBlock, GPTConfig, GPT model
|-- compiler/
|   `-- jit.h                # NVRTC JIT compiler + expression parser + code gen
|-- profiler/
|   `-- profiler.h           # CUDA profiler with kernel timing + GPU metrics
|-- distributed/
|   `-- nccl_comm.h          # NCCL multi-GPU ProcessGroup + DataParallel
|-- novatorch/               # Python package
|   |-- __init__.py          # Main public API
|   |-- nn/__init__.py
|   |-- nn/modules.py        # High-level Python nn composites
|   |-- optim/__init__.py
|   |-- optim/schedulers.py  # Python LR schedulers
|   `-- data/
|       |-- dataloader.py    # DataLoader with async prefetch
|       `-- datasets.py      # MNIST, CIFAR-10, RandomDataset
|-- bindings/
|   `-- python_bindings.cpp  # Full pybind11 module
|-- dashboard/
|   |-- server.py            # FastAPI + WebSocket dashboard server
|   `-- requirements.txt
|-- examples/
|   |-- mnist_cnn.py         # CNN training on MNIST
|   |-- gpt_lm.py            # Tiny GPT language model
|   |-- jit_demo.py          # JIT kernel compiler demo
|   `-- benchmark.py         # Benchmark vs PyTorch
|-- tests/
|   |-- test_framework.py    # Python test suite
|   `-- test_tensor_cpp.cpp  # C++ test suite
|-- CMakeLists.txt
|-- setup.py
`-- README.md

Architecture Notes

Memory Allocator

Mirrors PyTorch's caching allocator. Freed blocks are kept in a free list keyed by size. New allocations first check the free list for a best-fit block before calling cudaMalloc. This eliminates most allocation overhead during training.

Autograd Engine

Dynamic graph (define-by-run), like PyTorch. Each op creates AutogradNode subclass with a saved context. backward() does topological sort via sequence numbers and executes gradients in reverse order. Reference counting handles DAGs correctly.

FlashAttention

Implements the FA-2 algorithm: splits Q/K/V into tiles that fit in SRAM, runs attention within tiles with online softmax (running max + sum), never materializing the full N x N attention matrix. Memory is O(N) instead of O(N^2).

JIT Compiler

Parses expression strings like "relu(add(x, b))" into an AST, generates inline CUDA device code, compiles via NVRTC to PTX, loads via CUdriver API. Compiled kernels are cached by expression string.

Tensor Cores

hgemm_tensor_core uses the WMMA API (nvcuda::wmma) for 16x16x16 matrix fragment operations. Provides 4-8x speedup over regular FP32 GEMM on Volta/Turing/Ampere/Ada GPUs.

Troubleshooting

ModuleNotFoundError: No module named 'novatorch._novatorch_C'

Build the C extension first: pip install -e .
Check build/ directory for compile errors
Verify CUDA toolkit is on PATH

CUDA architecture mismatch

Error like "no kernel image for this device"
Set -DCMAKE_CUDA_ARCHITECTURES=XX matching your GPU

NVRTC not found

Install CUDA toolkit, ensure nvrtc64_*.dll is on PATH
JIT compiler requires CUDA >= 11.0

Out of memory

Call tc.empty_cache() to release cached allocations
Reduce batch size
Use FP16: model.to(tc.DType.float16)

pybind11 not found

CMake will auto-download it via FetchContent
Or: pip install pybind11 then cmake .. -Dpybind11_DIR=$(python -c "import pybind11; print(pybind11.get_cmake_dir())")

License

MIT License - see LICENSE file.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
autograd		autograd
bindings		bindings
compiler		compiler
core		core
cuda		cuda
dashboard		dashboard
distributed		distributed
docs		docs
examples		examples
nn		nn
novatorch		novatorch
profiler		profiler
tests		tests
transformer		transformer
.gitignore		.gitignore
AGENT.md		AGENT.md
CLAUDE.md		CLAUDE.md
CMakeLists.txt		CMakeLists.txt
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
setup.py		setup.py

Folders and files

Latest commit

History

Repository files navigation

novatorch

Features

Requirements

Quick Build

1. Install Python dependencies

2. Build with pip (recommended)

3. Manual CMake build

4. Verify the build

GPU Architecture Targeting

Python API

Tensor operations

Neural Network

GPT / Transformer

Multi-Head Attention (standalone)

JIT CUDA Compiler

Optimizers + LR Schedulers

DataLoader

Profiler

Examples

Train CNN on MNIST

Train tiny GPT

JIT demo

Benchmark vs PyTorch

Tests

C++ tests

Python tests

Dashboard

Project Structure

Architecture Notes

Memory Allocator

Autograd Engine

FlashAttention

JIT Compiler

Tensor Cores

Troubleshooting

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages