Mini-GPU: Custom Parallel GPU Architecture

A fully custom GPU architecture implemented in SystemVerilog/Verilog with Python functional simulation. Inspired by NVIDIA CUDA cores and AMD RDNA compute units.

Architecture Overview

CPU Interface
     |
Command Processor (FIFO queue, 64-deep)
     |
Instruction Fetch & Decode (4K IMEM)
     |
Warp Scheduler (8 warps, round-robin + stall bypass)
     |
SIMD Execution Engine (32 parallel lanes)
  |-- Integer ALU (ADD/SUB/MUL/DIV/AND/OR/XOR/SHL/SHR/NOT/CMP)
  |-- IEEE 754 FPU (FADD/FSUB/FMUL/FDIV/FSQRT/FABS/FNEG)
  |-- Vector Units (VADD/VMUL/VDOT/VCROSS)
     |
Shader Pipeline
  |-- Vertex Shader (4x4 MVP matrix, perspective divide, viewport)
  |-- Primitive Assembly (triangle collector, back-face cull)
  |-- Rasterizer (scanline, barycentric interpolation)
  |-- Fragment Shader (flat, Gouraud, Phong+texture)
     |
Memory Hierarchy
  |-- Shared Memory (32KB, 32-bank, bank-conflict detection)
  |-- Global Memory (1MB, 16KB L1 4-way cache)
  |-- Texture Cache (256x256 RGB, bilinear filtering)
     |
Framebuffer (double-buffered 640x480 RGB888, Z-buffer)
     |
VGA Controller (640x480@60Hz, 25MHz pixel clock)
     |
Monitor Output

Advanced Modules

Ray Tracing Unit: Moller-Trumbore ray-triangle intersection, 12-stage pipeline
Tensor Core: 4x4 INT8 systolic array, D=AxB+C
Custom GPU ISA: 64 opcodes covering ALU, FPU, vector, memory, texture, control, sync

GPU ISA Summary

Category	Instructions
Integer	ADD SUB MUL DIV AND OR XOR SHL SHR NOT CMP
Float	FADD FSUB FMUL FDIV FSQRT FABS FNEG FCMP
Vector	VADD VMUL VDOT VCROSS VNORM
Memory	LOAD STORE LSHR SSHR (shared mem)
Texture	TEX2D TEXF
Control	JMP BEQ BNE BLT BGT CALL RET
GPU	SYNC WARP_X LDIMM MOV
Ray Trace	RAY_ISECT BVH_TRAV
AI/Tensor	MATMUL TMAC
Convert	I2F F2I

Directory Structure

rtl/
  core/         - gpu_pkg, ALU, FPU, register file, SIMD engine, warp scheduler
  pipeline/     - vertex shader, fragment shader
  rasterizer/   - primitive assembly, rasterizer
  memory/       - shared mem, global mem + L1 cache, texture cache
  display/      - framebuffer (double-buffer + Z-buffer), VGA controller
  raytracing/   - ray-triangle intersection
  tensor/       - 4x4 tensor core
  top/          - gpu_top (full integration)
tb/
  core/         - tb_alu, tb_simd_engine, tb_warp_scheduler
  pipeline/     - tb_rasterizer
  system/       - tb_gpu_top (full system)
sim/
  functional_sim.py  - Python simulation (no HDL tool required)
  run_all.sh         - Verilator/Linux runner
  run_all.ps1        - Verilator/Windows runner

Running Tests

Python functional simulation (no tools needed)

python sim/functional_sim.py
# Expected: 136 PASSED, 0 FAILED

Verilator (HDL simulation)

# Install verilator: https://verilator.org/guide/latest/install.html
bash sim/run_all.sh        # Linux/Mac
.\sim\run_all.ps1          # Windows PowerShell

Vivado (FPGA synthesis)

# Target: Basys 3 (xc7a35tcpg236-1)
create_project mini_gpu ./vivado -part xc7a35tcpg236-1
add_files [glob rtl/**/*.sv]
set_property top gpu_top [current_fileset]
synth_design

FPGA Targets

Board	Part	Recommended For
Basys 3	xc7a35t	Core + VGA output
Nexys A7	xc7a100t	Full pipeline
Zynq-7000	xc7z020	CPU+GPU integration
Kintex-7	xc7k325t	Ray tracing + tensor

Key Architecture Parameters

Parameter	Value
Warp size	32 threads
Concurrent warps	8
Registers per thread	32 x 32-bit
Shared memory	32KB, 32 banks
Global memory	1MB + 16KB L1 cache
Display	640x480 @ 60Hz VGA
Texture size	256x256 RGB888
Instruction width	32-bit
Data width	32-bit (INT32 / FP32)

Project Phases

Learning Resources

This project demonstrates:

RTL design and hardware description language (SystemVerilog)
GPU microarchitecture: SIMT execution, warp scheduling
Computer graphics pipeline: vertex processing, rasterization
Memory hierarchy design: cache, shared memory, texture
FPGA prototyping workflow
Functional verification with Python models

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
docs		docs
openlane		openlane
rtl		rtl
scripts		scripts
sim		sim
tb		tb
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
GETTING_STARTED.md		GETTING_STARTED.md
LICENSE		LICENSE
PHYSICAL_LAYOUT.md		PHYSICAL_LAYOUT.md
README.md		README.md
SECURITY.md		SECURITY.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Mini-GPU: Custom Parallel GPU Architecture

Architecture Overview

Advanced Modules

GPU ISA Summary

Directory Structure

Running Tests

Python functional simulation (no tools needed)

Verilator (HDL simulation)

Vivado (FPGA synthesis)

FPGA Targets

Key Architecture Parameters

Project Phases

Learning Resources

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Mini-GPU: Custom Parallel GPU Architecture

Architecture Overview

Advanced Modules

GPU ISA Summary

Directory Structure

Running Tests

Python functional simulation (no tools needed)

Verilator (HDL simulation)

Vivado (FPGA synthesis)

FPGA Targets

Key Architecture Parameters

Project Phases

Learning Resources

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages