Skip to content

SanskaarUndale21/mini-Gpu

Mini-GPU: Custom Parallel GPU Architecture

A fully custom GPU architecture implemented in SystemVerilog/Verilog with Python functional simulation. Inspired by NVIDIA CUDA cores and AMD RDNA compute units.

Architecture Overview

CPU Interface
     |
Command Processor (FIFO queue, 64-deep)
     |
Instruction Fetch & Decode (4K IMEM)
     |
Warp Scheduler (8 warps, round-robin + stall bypass)
     |
SIMD Execution Engine (32 parallel lanes)
  |-- Integer ALU (ADD/SUB/MUL/DIV/AND/OR/XOR/SHL/SHR/NOT/CMP)
  |-- IEEE 754 FPU (FADD/FSUB/FMUL/FDIV/FSQRT/FABS/FNEG)
  |-- Vector Units (VADD/VMUL/VDOT/VCROSS)
     |
Shader Pipeline
  |-- Vertex Shader (4x4 MVP matrix, perspective divide, viewport)
  |-- Primitive Assembly (triangle collector, back-face cull)
  |-- Rasterizer (scanline, barycentric interpolation)
  |-- Fragment Shader (flat, Gouraud, Phong+texture)
     |
Memory Hierarchy
  |-- Shared Memory (32KB, 32-bank, bank-conflict detection)
  |-- Global Memory (1MB, 16KB L1 4-way cache)
  |-- Texture Cache (256x256 RGB, bilinear filtering)
     |
Framebuffer (double-buffered 640x480 RGB888, Z-buffer)
     |
VGA Controller (640x480@60Hz, 25MHz pixel clock)
     |
Monitor Output

Advanced Modules

  • Ray Tracing Unit: Moller-Trumbore ray-triangle intersection, 12-stage pipeline
  • Tensor Core: 4x4 INT8 systolic array, D=AxB+C
  • Custom GPU ISA: 64 opcodes covering ALU, FPU, vector, memory, texture, control, sync

GPU ISA Summary

Category Instructions
Integer ADD SUB MUL DIV AND OR XOR SHL SHR NOT CMP
Float FADD FSUB FMUL FDIV FSQRT FABS FNEG FCMP
Vector VADD VMUL VDOT VCROSS VNORM
Memory LOAD STORE LSHR SSHR (shared mem)
Texture TEX2D TEXF
Control JMP BEQ BNE BLT BGT CALL RET
GPU SYNC WARP_X LDIMM MOV
Ray Trace RAY_ISECT BVH_TRAV
AI/Tensor MATMUL TMAC
Convert I2F F2I

Directory Structure

rtl/
  core/         - gpu_pkg, ALU, FPU, register file, SIMD engine, warp scheduler
  pipeline/     - vertex shader, fragment shader
  rasterizer/   - primitive assembly, rasterizer
  memory/       - shared mem, global mem + L1 cache, texture cache
  display/      - framebuffer (double-buffer + Z-buffer), VGA controller
  raytracing/   - ray-triangle intersection
  tensor/       - 4x4 tensor core
  top/          - gpu_top (full integration)
tb/
  core/         - tb_alu, tb_simd_engine, tb_warp_scheduler
  pipeline/     - tb_rasterizer
  system/       - tb_gpu_top (full system)
sim/
  functional_sim.py  - Python simulation (no HDL tool required)
  run_all.sh         - Verilator/Linux runner
  run_all.ps1        - Verilator/Windows runner

Running Tests

Python functional simulation (no tools needed)

python sim/functional_sim.py
# Expected: 136 PASSED, 0 FAILED

Verilator (HDL simulation)

# Install verilator: https://verilator.org/guide/latest/install.html
bash sim/run_all.sh        # Linux/Mac
.\sim\run_all.ps1          # Windows PowerShell

Vivado (FPGA synthesis)

# Target: Basys 3 (xc7a35tcpg236-1)
create_project mini_gpu ./vivado -part xc7a35tcpg236-1
add_files [glob rtl/**/*.sv]
set_property top gpu_top [current_fileset]
synth_design

FPGA Targets

Board Part Recommended For
Basys 3 xc7a35t Core + VGA output
Nexys A7 xc7a100t Full pipeline
Zynq-7000 xc7z020 CPU+GPU integration
Kintex-7 xc7k325t Ray tracing + tensor

Key Architecture Parameters

Parameter Value
Warp size 32 threads
Concurrent warps 8
Registers per thread 32 x 32-bit
Shared memory 32KB, 32 banks
Global memory 1MB + 16KB L1 cache
Display 640x480 @ 60Hz VGA
Texture size 256x256 RGB888
Instruction width 32-bit
Data width 32-bit (INT32 / FP32)

Project Phases

  • Phase 1: ALU + basic pipeline
  • Phase 2: SIMD execution engine (32-wide)
  • Phase 3: Warp scheduler (8 warps, round-robin)
  • Phase 4: Shared memory (banked, conflict detection)
  • Phase 5: Shader pipeline (vertex + fragment)
  • Phase 6: Rasterizer (scanline + barycentric)
  • Phase 7: Framebuffer (double-buffer, Z-test)
  • Phase 8: VGA output (640x480@60Hz)
  • Phase 9: Texture unit (bilinear filtering)
  • Phase 10: Custom GPU ISA (64 opcodes)
  • Phase 11: Parallel thread execution
  • Phase 12: Ray tracing (Moller-Trumbore)
  • Phase 13: Tensor/AI acceleration (4x4 systolic)

Learning Resources

This project demonstrates:

  • RTL design and hardware description language (SystemVerilog)
  • GPU microarchitecture: SIMT execution, warp scheduling
  • Computer graphics pipeline: vertex processing, rasterization
  • Memory hierarchy design: cache, shared memory, texture
  • FPGA prototyping workflow
  • Functional verification with Python models

About

Custom parallel GPU architecture in SystemVerilog - SIMD engine, warp scheduler, rasterizer, ray tracing, VGA output

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors