A performance-oriented implementation of training a simple multilayer perceptron (MLP) on the MNIST dataset written in C and CUDA.
The project focuses on understanding the performance characteristics of different matrix multiplication backends, ranging from naive CPU implementations to custom CUDA kernels and cuBLAS.
The codebase implements the entire training pipeline from scratch, including tensor operations, forward and backward propagation, loss computation, and optimization.
The goal of the project is to explore low-level performance engineering in machine learning systems.
- Fully from-scratch MLP training pipeline
- No high-level ML frameworks
- Multiple backends for matrix multiplication
- CPU and GPU implementations
- Custom tiled CUDA kernels
- cuBLAS implementation
- Benchmark comparison across implementations
- For CPU: OpenBLAS, OpenMP (gcc-15 on macOS)
- For CUDA: CUDA toolkit, cuBLAS
apt update
apt install libopenblas-dev pkg-config
brew install openblas pkg-config
sh download_mnist.sh
make cpu
make
Basic usage:
./test <mode> [stochastic] [num_epoch] [lr] [batch_size]
| Mode | Description |
|---|---|
| 0 | naive matrix multiplication (CPU) |
| 1 | custom CPU tiled matrix multiplication |
| 2 | CPU BLAS matrix multiplication |
| 11 | custom GPU tiled matrix multiplication |
| 12 | cuBLAS multiplication |
If optional arguments are not provided, the following defaults are used:
| Parameter | Default |
|---|---|
| batch_size | 500 |
| num_epoch | 5 |
| lr | 0.1 |
| stochastic | 0 |
The model is a simple fully connected multilayer perceptron trained on MNIST.
Architecture:
784 -> Hidden (default 128) -> Hidden (default 256) -> 10
Training is performed using mini-batch gradient descent with cross entropy loss.
All forward and backward operations are implemented manually.
src/cpu/: CPU backend implementationssrc/cuda/: CUDA backend implementationsinclude/: Header filesdata/: MNIST datapoc/: Proof of concept scripts
The project expects the MNIST dataset in the following format:
data/mnist/train-images-idx3-ubyte data/mnist/train-labels-idx1-ubyte data/mnist/t10k-images-idx3-ubyte data/mnist/t10k-labels-idx1-ubyte
Default parameters:
- batch_size = 500
- num_epoch = 5
- lr = 0.1
| Mode | Time (s) | Accuracy |
|---|---|---|
| 0 | 7.18 | 93.67% |
| 1 | 3.54 | 93.65% |
| 2 | 1.11 | 93.60% |
| Mode | Time (s) | Accuracy |
|---|---|---|
| 11 | 0.112 | 94.11% |
| 11 (stochastic) | 0.125 | 94.11% |
| 12 | 0.153 | 94.05% |
| 12 (stochastic) | 0.202 | 94.11% |
The custom CUDA tiled matrix multiplication slightly outperforms cuBLAS under these conditions.
Key implementation aspects include:
- tiled shared-memory CUDA matrix multiplication
- custom tensor abstraction
- softmax and loss reduction kernels
- reuse cache address in kernel
- pluggable backend design via operator table
- CUDA stream-based asynchronous data transfer
- CUDA stream-based parallel parameter updates
- Pinned host memory buffers
The project is designed to make it easy to compare performance between different implementations.
This project was developed as part of a high-performance computing course to explore:
- GPU programming
- CUDA kernel optimization
- BLAS libraries
- CUDA stream synchronizations
- performance bottlenecks in ML training systems
It also serves as a learning exercise in building machine learning infrastructure from first principles.