CUDA MLP Training from Scratch

A performance-oriented implementation of training a simple multilayer perceptron (MLP) on the MNIST dataset written in C and CUDA.
The project focuses on understanding the performance characteristics of different matrix multiplication backends, ranging from naive CPU implementations to custom CUDA kernels and cuBLAS.

The codebase implements the entire training pipeline from scratch, including tensor operations, forward and backward propagation, loss computation, and optimization.

The goal of the project is to explore low-level performance engineering in machine learning systems.

Features

Fully from-scratch MLP training pipeline
No high-level ML frameworks
Multiple backends for matrix multiplication
CPU and GPU implementations
Custom tiled CUDA kernels
cuBLAS implementation
Benchmark comparison across implementations

Run Instructions

Dependencies

For CPU: OpenBLAS, OpenMP (gcc-15 on macOS)
For CUDA: CUDA toolkit, cuBLAS

Download OpenBLAS

For Ubuntu/Debian

apt update apt install libopenblas-dev pkg-config

For macOS with Homebrew:

brew install openblas pkg-config

Download dataset

sh download_mnist.sh

Build

CPU Only

make cpu

With CUDA

make

Run

Basic usage:

./test <mode> [stochastic] [num_epoch] [lr] [batch_size]

Modes

Mode	Description
0	naive matrix multiplication (CPU)
1	custom CPU tiled matrix multiplication
2	CPU BLAS matrix multiplication
11	custom GPU tiled matrix multiplication
12	cuBLAS multiplication

Default parameters

If optional arguments are not provided, the following defaults are used:

Parameter	Default
batch_size	500
num_epoch	5
lr	0.1
stochastic	0

Model

The model is a simple fully connected multilayer perceptron trained on MNIST.

Architecture:

784 -> Hidden (default 128) -> Hidden (default 256) -> 10

Training is performed using mini-batch gradient descent with cross entropy loss.

All forward and backward operations are implemented manually.

Project Structure

src/cpu/: CPU backend implementations
src/cuda/: CUDA backend implementations
include/: Header files
data/: MNIST data
poc/: Proof of concept scripts

Dataset

The project expects the MNIST dataset in the following format:

data/mnist/train-images-idx3-ubyte data/mnist/train-labels-idx1-ubyte data/mnist/t10k-images-idx3-ubyte data/mnist/t10k-labels-idx1-ubyte

Performance

Default parameters:

batch_size = 500
num_epoch = 5
lr = 0.1

CPU (Apple M4, 10 cores)

Mode	Time (s)	Accuracy
0	7.18	93.67%
1	3.54	93.65%
2	1.11	93.60%

GPU (RTX 4090)

Mode	Time (s)	Accuracy
11	0.112	94.11%
11 (stochastic)	0.125	94.11%
12	0.153	94.05%
12 (stochastic)	0.202	94.11%

The custom CUDA tiled matrix multiplication slightly outperforms cuBLAS under these conditions.

Implementation Notes

Key implementation aspects include:

tiled shared-memory CUDA matrix multiplication
custom tensor abstraction
softmax and loss reduction kernels
reuse cache address in kernel
pluggable backend design via operator table
CUDA stream-based asynchronous data transfer
CUDA stream-based parallel parameter updates
Pinned host memory buffers

The project is designed to make it easy to compare performance between different implementations.

Purpose

This project was developed as part of a high-performance computing course to explore:

GPU programming
CUDA kernel optimization
BLAS libraries
CUDA stream synchronizations
performance bottlenecks in ML training systems

It also serves as a learning exercise in building machine learning infrastructure from first principles.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
include		include
poc		poc
src		src
.gitignore		.gitignore
README.md		README.md
download_mnist.sh		download_mnist.sh
makefile		makefile

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CUDA MLP Training from Scratch

Features

Run Instructions

Dependencies

Download OpenBLAS

For Ubuntu/Debian

For macOS with Homebrew:

Download dataset

Build

CPU Only

With CUDA

Run

Modes

Default parameters

Model

Project Structure

Dataset

Performance

CPU (Apple M4, 10 cores)

GPU (RTX 4090)

Implementation Notes

Purpose

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CUDA MLP Training from Scratch

Features

Run Instructions

Dependencies

Download OpenBLAS

For Ubuntu/Debian

For macOS with Homebrew:

Download dataset

Build

CPU Only

With CUDA

Run

Modes

Default parameters

Model

Project Structure

Dataset

Performance

CPU (Apple M4, 10 cores)

GPU (RTX 4090)

Implementation Notes

Purpose

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages