Skip to content

SanskaarUndale21/novatorch

Repository files navigation

novatorch

Production-grade CUDA deep learning framework built from scratch.

Combines ideas from PyTorch, TensorRT, FlashAttention, and Triton in a single C++/CUDA/Python codebase.


Features

  • Custom GPU tensor engine (FP32/FP16/BF16/INT8)
  • Caching CUDA memory allocator (like PyTorch's)
  • Optimized CUDA kernels: elementwise, reductions, matmul, conv, softmax, layernorm
  • Tensor Core acceleration via WMMA API (FP16 GEMM)
  • FlashAttention-2 style attention kernel (O(N) memory)
  • Dynamic automatic differentiation engine
  • Full neural network API: Linear, Conv2d, BatchNorm, LayerNorm, RMSNorm, Embedding, etc.
  • GPT/Transformer engine with multi-head attention
  • Adam, AdamW, SGD, RMSprop optimizers
  • LR schedulers: cosine annealing, warmup, 1-cycle
  • NVRTC-based JIT CUDA kernel compiler
  • Real-time GPU monitoring dashboard (FastAPI + WebSocket)
  • Multi-GPU data parallelism via NCCL
  • DataLoader with async prefetching
  • Built-in MNIST, CIFAR-10 datasets
  • Full Python API via pybind11
  • C++ test suite and Python test suite
  • Benchmark suite vs PyTorch

Requirements

Requirement Version
CUDA Toolkit >= 11.8
CMake >= 3.20
C++ compiler MSVC 2022 / GCC 11+ / Clang 14+
Python >= 3.9
pybind11 auto-fetched by CMake

Windows: Visual Studio 2022 with "Desktop development with C++" + CUDA workload.

Linux: gcc, g++, CUDA toolkit.


Quick Build

1. Install Python dependencies

pip install numpy tqdm

2. Build with pip (recommended)

cd novatorch
pip install -e .

This calls CMake under the hood and installs the package in editable mode.

3. Manual CMake build

# From novatorch/
mkdir build && cd build

# Linux
cmake .. -DCMAKE_BUILD_TYPE=Release
make -j$(nproc)

# Windows (Visual Studio)
cmake .. -G "Visual Studio 17 2022" -A x64 -DCMAKE_BUILD_TYPE=Release
cmake --build . --config Release --parallel

The compiled _novatorch_C.pyd (Windows) or _novatorch_C.so (Linux) will be placed in novatorch/.

4. Verify the build

python -c "import novatorch as tc; print(tc.__version__); print('CUDA:', tc.cuda_available())"

Expected output:

0.1.0
CUDA: True

GPU Architecture Targeting

By default novatorch targets SM 75, 80, 86, 89, 90 (Turing through Hopper).

To target only your specific GPU (faster compile):

# RTX 3090 (SM 86)
cmake .. -DCMAKE_CUDA_ARCHITECTURES=86

# RTX 4090 (SM 89)
cmake .. -DCMAKE_CUDA_ARCHITECTURES=89

# A100 (SM 80)
cmake .. -DCMAKE_CUDA_ARCHITECTURES=80

Or let CMake detect automatically via setup.py (calls nvidia-smi).


Python API

Tensor operations

import novatorch as tc

# Create tensors
x = tc.Tensor.randn([32, 512])     # random normal on CUDA
z = tc.Tensor.zeros([32, 512])
o = tc.Tensor.ones([32, 512])
a = tc.Tensor.from_numpy(np_array) # from numpy

# Math
y = x + z
y = x * 2.0
y = x.matmul(w)    # or  x @ w
y = x.bmm(w)       # batched matmul
y = x.pow(2.0)
y = x.sqrt()
y = x.exp()
y = x.log()

# Reductions
s = x.sum()
m = x.mean(dim=1, keepdim=True)
mx = x.max(dim=-1)

# Reshape / view
y = x.reshape([16, -1])
y = x.transpose(0, 1)
y = x.permute([1, 0, 2])
y = x.unsqueeze(0)
y = x.flatten(1)

# Device transfer
x_cpu = x.cpu()
x_gpu = x.cuda(0)
np_arr = x.cpu().numpy()

# Autograd
x = tc.Tensor.randn([4, 8])
x.requires_grad = True
y = x.matmul(w) + b
loss = y.mean()
loss.backward()
print(x.grad)   # gradient w.r.t. x

Neural Network

import novatorch as tc

# Build a CNN
model = tc.Sequential()
model.add(tc.Conv2d(1, 32, 3, padding=1))
model.add(tc.BatchNorm2d(32))
model.add(tc.ReLU())
model.add(tc.MaxPool2d(2))
model.add(tc.Flatten())
model.add(tc.Linear(32*14*14, 10))

model.cuda()
model.train()

# Forward
x = tc.Tensor.randn([8, 1, 28, 28])
logits = model(x)            # [8, 10]

# Loss + backward
criterion = tc.CrossEntropyLoss()
loss = criterion(logits, targets)
loss.backward()

# Optimizer
optimizer = tc.AdamW(model, lr=1e-3, weight_decay=0.01)
optimizer.step()
optimizer.zero_grad()

print(f"Loss: {loss.item():.4f}")
print(f"Parameters: {model.num_parameters():,}")

GPT / Transformer

import novatorch as tc

cfg = tc.GPTConfig()
cfg.vocab_size  = 50257
cfg.max_seq_len = 1024
cfg.d_model     = 768
cfg.num_heads   = 12
cfg.num_layers  = 12
cfg.d_ff        = 3072
cfg.use_flash_attn = True   # FlashAttention-2 kernel

model = tc.GPT(cfg)
model.cuda()

print(f"GPT parameters: {model.num_params():,}")

# Training step
tokens   = tc.Tensor.from_numpy(token_ids)   # [B, S] float
targets  = tc.Tensor.from_numpy(target_ids)
logits   = model(tokens)                     # [B, S, V]
loss     = model.compute_loss(logits, targets)
loss.backward()
optimizer.step()

Multi-Head Attention (standalone)

mha = tc.MultiHeadAttention(
    embed_dim=512,
    num_heads=8,
    dropout=0.1,
    use_flash=True    # use FlashAttention kernel
)
mha.cuda()

# x: [batch, seq_len, embed_dim]
x   = tc.Tensor.randn([4, 128, 512])
out = mha(x, causal=True)   # causal mask for decoder

JIT CUDA Compiler

jit = tc.JITCompiler.instance()

x = tc.Tensor.randn([1024*1024])
b = tc.Tensor.ones([1024*1024]) * 0.5

# Compile and run fused kernel at runtime
out = jit.execute("relu(add(x, b))", {"x": x, "b": b})
out = jit.execute("gelu(mul(x, 2.0))", {"x": x})
out = jit.execute("sigmoid(add(mul(x, 2.0), b))", {"x": x, "b": b})
# Kernels are cached - second call uses the compiled version

Optimizers + LR Schedulers

optimizer = tc.AdamW(model, lr=3e-4, weight_decay=0.01)

from novatorch.optim import LinearWarmupCosineDecay
scheduler = LinearWarmupCosineDecay(optimizer,
    warmup_steps=100,
    total_steps=10000,
    min_lr=1e-5)

for step in range(10000):
    optimizer.zero_grad()
    loss = compute_loss(...)
    loss.backward()

    # Gradient clipping
    # tc.functional.clip_grad_norm(model.parameters(), max_norm=1.0)

    optimizer.step()
    scheduler.step()
    print(f"LR: {scheduler.get_last_lr():.2e}")

DataLoader

from novatorch.data import MNISTDataset, CIFAR10Dataset, DataLoader

# MNIST
train_ds = MNISTDataset(root="./data", train=True, download=True)
loader   = DataLoader(train_ds, batch_size=64, shuffle=True,
                       device_str="cuda", num_prefetch=2)

for images, labels in loader:
    # images: tc.Tensor [B, 1, 28, 28] on CUDA
    # labels: tc.Tensor [B] on CUDA
    out  = model(images)
    loss = criterion(out, labels)
    ...

Profiler

tc.Profiler.instance().enable()

# ... run training ...

tc.Profiler.instance().print_report()
# Output:
# Kernel                         Calls    Total(ms)     Avg(ms)     BW(GB/s)
# -----------------------------------------------------------------
# sgemm_optimized                  200       24.150       0.121        890.2
# softmax_kernel                   200        3.200       0.016        412.0
# layernorm_fwd_kernel             400        1.800       0.005       1200.0

Examples

Train CNN on MNIST

python examples/mnist_cnn.py

Expected: ~99% accuracy in 5 epochs, ~2000 samples/s on RTX 3090.

Train tiny GPT

python examples/gpt_lm.py

Trains a 6-layer GPT on a character-level corpus. Generates text samples.

JIT demo

python examples/jit_demo.py

Compiles and runs fused CUDA kernels at runtime via NVRTC.

Benchmark vs PyTorch

pip install torch  # optional, for comparison
python examples/benchmark.py

Tests

C++ tests

# Build first
cmake --build build --target test_tensor --config Release

# Run
./build/Release/test_tensor.exe   # Windows
./build/test_tensor               # Linux

Python tests

python tests/test_framework.py

or with pytest:

pip install pytest
pytest tests/test_framework.py -v

Dashboard

Start the real-time GPU monitoring dashboard:

pip install fastapi uvicorn websockets pynvml
python dashboard/server.py

Open http://localhost:8000 to see:

  • Live GPU utilization and VRAM usage
  • Training loss curve
  • Throughput (samples/sec)
  • CUDA kernel timing table
  • Temperature and power draw

Run training in a separate terminal - the dashboard will pick up metrics from novatorch's built-in profiler automatically.


Project Structure

novatorch/
|-- core/
|   |-- dtype.h              # FP32/FP16/BF16/INT8 type system
|   |-- device.h             # CUDA device management
|   |-- tensor.h             # Tensor class (header)
|   `-- tensor.cu            # Tensor implementation (CUDA)
|-- cuda/
|   |-- memory_pool.h        # Caching CUDA allocator
|   |-- streams.h            # CUDA stream/event RAII wrappers
|   `-- kernels/
|       |-- elementwise.cuh  # add/mul/relu/gelu/dropout/cast/random
|       |-- reduction.cuh    # sum/max/softmax/layernorm/batchnorm
|       |-- matmul.cuh       # SGEMM + Tensor Core WMMA HGEMM + batched
|       |-- conv.cuh         # Conv2d/depthwise/maxpool/avgpool
|       `-- flash_attention.cuh  # FlashAttention-2 FP32 + FP16
|-- autograd/
|   |-- node.h               # AutogradNode base + Edge + AccumulateGrad
|   |-- functions.h          # Backward functions for all ops
|   |-- engine.h             # Backward engine interface
|   `-- engine.cpp           # Topological sort + reverse AD execution
|-- nn/
|   |-- module.h             # Module base + Sequential + ModuleList
|   |-- layers/
|   |   |-- linear.h         # Linear, Bilinear
|   |   |-- conv.h           # Conv1d/2d/Transpose, MaxPool, AvgPool
|   |   |-- normalization.h  # LayerNorm, BatchNorm2d, GroupNorm, RMSNorm
|   |   `-- embedding.h      # Embedding
|   |-- activations/
|   |   `-- activations.h    # ReLU/GELU/SiLU/Sigmoid/Tanh/Dropout/Flatten
|   |-- loss/
|   |   `-- loss.h           # MSE/L1/BCE/CrossEntropy/Focal/KLDiv/Huber
|   `-- optimizers/
|       `-- optimizer.h      # SGD/Adam/AdamW/RMSprop + LR schedulers
|-- transformer/
|   |-- attention.h          # ScaledDotProduct, MultiHeadAttention, FeedForward
|   |                        # TransformerEncoder/DecoderLayer, SinusoidalPE, RoPE
|   `-- gpt.h                # GPTBlock, GPTConfig, GPT model
|-- compiler/
|   `-- jit.h                # NVRTC JIT compiler + expression parser + code gen
|-- profiler/
|   `-- profiler.h           # CUDA profiler with kernel timing + GPU metrics
|-- distributed/
|   `-- nccl_comm.h          # NCCL multi-GPU ProcessGroup + DataParallel
|-- novatorch/               # Python package
|   |-- __init__.py          # Main public API
|   |-- nn/__init__.py
|   |-- nn/modules.py        # High-level Python nn composites
|   |-- optim/__init__.py
|   |-- optim/schedulers.py  # Python LR schedulers
|   `-- data/
|       |-- dataloader.py    # DataLoader with async prefetch
|       `-- datasets.py      # MNIST, CIFAR-10, RandomDataset
|-- bindings/
|   `-- python_bindings.cpp  # Full pybind11 module
|-- dashboard/
|   |-- server.py            # FastAPI + WebSocket dashboard server
|   `-- requirements.txt
|-- examples/
|   |-- mnist_cnn.py         # CNN training on MNIST
|   |-- gpt_lm.py            # Tiny GPT language model
|   |-- jit_demo.py          # JIT kernel compiler demo
|   `-- benchmark.py         # Benchmark vs PyTorch
|-- tests/
|   |-- test_framework.py    # Python test suite
|   `-- test_tensor_cpp.cpp  # C++ test suite
|-- CMakeLists.txt
|-- setup.py
`-- README.md

Architecture Notes

Memory Allocator

Mirrors PyTorch's caching allocator. Freed blocks are kept in a free list keyed by size. New allocations first check the free list for a best-fit block before calling cudaMalloc. This eliminates most allocation overhead during training.

Autograd Engine

Dynamic graph (define-by-run), like PyTorch. Each op creates AutogradNode subclass with a saved context. backward() does topological sort via sequence numbers and executes gradients in reverse order. Reference counting handles DAGs correctly.

FlashAttention

Implements the FA-2 algorithm: splits Q/K/V into tiles that fit in SRAM, runs attention within tiles with online softmax (running max + sum), never materializing the full N x N attention matrix. Memory is O(N) instead of O(N^2).

JIT Compiler

Parses expression strings like "relu(add(x, b))" into an AST, generates inline CUDA device code, compiles via NVRTC to PTX, loads via CUdriver API. Compiled kernels are cached by expression string.

Tensor Cores

hgemm_tensor_core uses the WMMA API (nvcuda::wmma) for 16x16x16 matrix fragment operations. Provides 4-8x speedup over regular FP32 GEMM on Volta/Turing/Ampere/Ada GPUs.


Troubleshooting

ModuleNotFoundError: No module named 'novatorch._novatorch_C'

  • Build the C extension first: pip install -e .
  • Check build/ directory for compile errors
  • Verify CUDA toolkit is on PATH

CUDA architecture mismatch

  • Error like "no kernel image for this device"
  • Set -DCMAKE_CUDA_ARCHITECTURES=XX matching your GPU

NVRTC not found

  • Install CUDA toolkit, ensure nvrtc64_*.dll is on PATH
  • JIT compiler requires CUDA >= 11.0

Out of memory

  • Call tc.empty_cache() to release cached allocations
  • Reduce batch size
  • Use FP16: model.to(tc.DType.float16)

pybind11 not found

  • CMake will auto-download it via FetchContent
  • Or: pip install pybind11 then cmake .. -Dpybind11_DIR=$(python -c "import pybind11; print(pybind11.get_cmake_dir())")

License

MIT License - see LICENSE file.

About

NovaTorch: Production-grade CUDA deep learning framework. Custom GPU tensor engine, FlashAttention-2, dynamic autograd, GPT/Transformer, JIT CUDA kernel compiler, and full Python API via pybind11.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors