A complete CUDA research case study on memory management optimizations for a shallow neural network. This repository compares a reference GPU implementation against optimized variants that reduce allocation overhead, improve memory transfer, and use CUDA streams.
- A performance study of memory management in CUDA-based shallow neural network training.
- A comparison between a reference GPU implementation and three optimization strategies:
- Streams — concurrent kernel and transfer execution using CUDA streams.
- Pinned memory — page-locked host buffers for faster host-device transfers.
- Combined — both streams and pinned memory with pre-allocated GPU buffers.
- Experimental evaluation of scaling behavior across dataset sizes and network depth.
- Educational deliverables for a CUDA research project at École Nationale Supérieure d'Informatique (ESI).
- Project Specification - Original project requirements and guidelines in French.
- Final Report - Complete research article with experimental results and analysis.
- Reference Report - Baseline implementation and report by Brouthen and Akeb.
GPU memory management is often the hidden bottleneck in neural network training. When device allocations and host transfers are repeated for every matrix operation, the overhead can dominate runtime. This project shows how alternative memory strategies can substantially improve performance for shallow network training while preserving correctness.
- Input dimension: 32 features
- Hidden layer size: 256 neurons
- Output dimension: 1 neuron
- Training epochs: 100
- Batch size: 256
- Metrics: average runtime and final MSE over multiple runs
- Data: synthetic convex datasets in
reference/data/ - GPU test target: NVIDIA T4 with compute capability 7.5
- NVIDIA GPU with compute capability 7.5+
- CUDA Toolkit 11.8+
- GNU GCC 9.4+
nvccavailable inPATH- Python 3.10+ for optional analysis scripts
python3 -m venvsupport
numpypandasmatplotlib
Install optional Python dependencies with:
python3 -m venv .venv
source .venv/bin/activate
pip install -r python_requirements.txtgit clone <repo-url>
cd <repo-name>cd reference
gcc -O3 -o nn nn.c -lm -fopenmp
gcc -O3 -o nn_pthreads nn_pthreads.c -lm -pthread -fopenmp
nvcc -O3 -o nn_cuda nn_cuda.cu -Xcompiler -fopenmp -gencode arch=compute_75,code=sm_75From the repository root:
cd alternatives
./run_all.shThis script compiles the main CUDA variants and runs each of them on the three synthetic datasets.
cd alternatives
nvcc -O3 -Xcompiler -fopenmp -gencode arch=compute_75,code=sm_75 nn_cuda_combined.cu -o nn_cuda_combinedFrom reference/:
./nn_cuda ../reference/data/synthetic_convex_small.csvFrom alternatives/:
./nn_cuda_combined ../reference/data/synthetic_convex_small.csvFrom reference/:
./nn ../reference/data/synthetic_convex_small.csv
./nn_pthreads ../reference/data/synthetic_convex_small.csvcd alternatives
./run_all.shThis command:
- compiles
nn_cuda_reference,nn_cuda_streams,nn_cuda_pinned, andnn_cuda_combined - runs each on
small,medium, andlarge - prints timing output for each dataset
alternatives/- Optimized CUDA implementations and evaluation scripts.
- Variants include
nn_cuda_reference.cu,nn_cuda_streams.cu,nn_cuda_pinned.cu, andnn_cuda_combined.cu. - Depth variants:
*_two_layers.cu,*_three_layers.cu.
reference/- Baseline code and datasets.
nn.c,nn_pthreads.c,nn_cuda.cu,test_cuda.cu.data/containssynthetic_convex_small.csv,synthetic_convex_medium.csv, andsynthetic_convex_large.csv.
report/- Final article source and report materials.
run_full_project.ipynb- Notebook for analysis and visualization.
python_requirements.txt- Python dependencies for optional analysis.
- The combined strategy is the best-performing optimization in this study.
- Streams-only offers small improvements when compute and transfer overlap is already limited.
- Pinned memory improves transfer throughput and reduces host-device overhead.
- Performance gains depend strongly on network size, batch count, and layer depth.
- Depth variants show that adding more hidden layers reduces the benefit of memory-only optimizations.
Mohamed El Amine Kherroubi, Badis Khalef, Mounir Sofiane Mostefai, Youcef Tati, Mohamed Ishak Messadia 2CS-SIQ/SID, École Nationale Supérieure d'Informatique (ESI), Algiers
[1] Brouthen, K., & Akeb, A. (2024). Exploring parallelization of shallow neural network using CUDA.