Mini-GPU: Custom Parallel GPU Architecture
A fully custom GPU architecture implemented in SystemVerilog/Verilog with Python functional simulation. Inspired by NVIDIA CUDA cores and AMD RDNA compute units.
CPU Interface
|
Command Processor (FIFO queue, 64-deep)
|
Instruction Fetch & Decode (4K IMEM)
|
Warp Scheduler (8 warps, round-robin + stall bypass)
|
SIMD Execution Engine (32 parallel lanes)
|-- Integer ALU (ADD/SUB/MUL/DIV/AND/OR/XOR/SHL/SHR/NOT/CMP)
|-- IEEE 754 FPU (FADD/FSUB/FMUL/FDIV/FSQRT/FABS/FNEG)
|-- Vector Units (VADD/VMUL/VDOT/VCROSS)
|
Shader Pipeline
|-- Vertex Shader (4x4 MVP matrix, perspective divide, viewport)
|-- Primitive Assembly (triangle collector, back-face cull)
|-- Rasterizer (scanline, barycentric interpolation)
|-- Fragment Shader (flat, Gouraud, Phong+texture)
|
Memory Hierarchy
|-- Shared Memory (32KB, 32-bank, bank-conflict detection)
|-- Global Memory (1MB, 16KB L1 4-way cache)
|-- Texture Cache (256x256 RGB, bilinear filtering)
|
Framebuffer (double-buffered 640x480 RGB888, Z-buffer)
|
VGA Controller (640x480@60Hz, 25MHz pixel clock)
|
Monitor Output
Ray Tracing Unit : Moller-Trumbore ray-triangle intersection, 12-stage pipeline
Tensor Core : 4x4 INT8 systolic array, D=AxB+C
Custom GPU ISA : 64 opcodes covering ALU, FPU, vector, memory, texture, control, sync
Category
Instructions
Integer
ADD SUB MUL DIV AND OR XOR SHL SHR NOT CMP
Float
FADD FSUB FMUL FDIV FSQRT FABS FNEG FCMP
Vector
VADD VMUL VDOT VCROSS VNORM
Memory
LOAD STORE LSHR SSHR (shared mem)
Texture
TEX2D TEXF
Control
JMP BEQ BNE BLT BGT CALL RET
GPU
SYNC WARP_X LDIMM MOV
Ray Trace
RAY_ISECT BVH_TRAV
AI/Tensor
MATMUL TMAC
Convert
I2F F2I
rtl/
core/ - gpu_pkg, ALU, FPU, register file, SIMD engine, warp scheduler
pipeline/ - vertex shader, fragment shader
rasterizer/ - primitive assembly, rasterizer
memory/ - shared mem, global mem + L1 cache, texture cache
display/ - framebuffer (double-buffer + Z-buffer), VGA controller
raytracing/ - ray-triangle intersection
tensor/ - 4x4 tensor core
top/ - gpu_top (full integration)
tb/
core/ - tb_alu, tb_simd_engine, tb_warp_scheduler
pipeline/ - tb_rasterizer
system/ - tb_gpu_top (full system)
sim/
functional_sim.py - Python simulation (no HDL tool required)
run_all.sh - Verilator/Linux runner
run_all.ps1 - Verilator/Windows runner
Python functional simulation (no tools needed)
python sim/functional_sim.py
# Expected: 136 PASSED, 0 FAILED
Verilator (HDL simulation)
# Install verilator: https://verilator.org/guide/latest/install.html
bash sim/run_all.sh # Linux/Mac
.\s im\r un_all.ps1 # Windows PowerShell
# Target: Basys 3 (xc7a35tcpg236-1)
create_project mini_gpu ./vivado -part xc7a35tcpg236-1
add_files [glob rtl/**/*.sv]
set_property top gpu_top [current_fileset]
synth_design
Board
Part
Recommended For
Basys 3
xc7a35t
Core + VGA output
Nexys A7
xc7a100t
Full pipeline
Zynq-7000
xc7z020
CPU+GPU integration
Kintex-7
xc7k325t
Ray tracing + tensor
Key Architecture Parameters
Parameter
Value
Warp size
32 threads
Concurrent warps
8
Registers per thread
32 x 32-bit
Shared memory
32KB, 32 banks
Global memory
1MB + 16KB L1 cache
Display
640x480 @ 60Hz VGA
Texture size
256x256 RGB888
Instruction width
32-bit
Data width
32-bit (INT32 / FP32)
This project demonstrates:
RTL design and hardware description language (SystemVerilog)
GPU microarchitecture: SIMT execution, warp scheduling
Computer graphics pipeline: vertex processing, rasterization
Memory hierarchy design: cache, shared memory, texture
FPGA prototyping workflow
Functional verification with Python models