This repository contains a complete hardware implementation of a 2D Convolution Accelerator designed for high-performance image processing and deep learning applications. The accelerator is built using an 8Γ8 Systolic Array architecture and implements a streaming coprocessor model for efficient matrix convolution operations.
Course: CMP3020 - VLSI Design
Institution: Cairo University Faculty of Engineering (CUFE), Computer Engineering Department
Architecture: Domain-Specific Accelerator for 2D Convolution
Target Technology: Sky130 PDK (130nm)
- 8Γ8 Systolic Array with 64 Processing Elements (PEs)
- 32KB On-Chip SRAM with ping-pong buffering for continuous operation
- Configurable Matrix Sizes: 16Γ16 to 64Γ64 input matrices
- Flexible Kernel Support: 2Γ2 to 16Γ16 convolution kernels
- 8-bit Unsigned Integer arithmetic with 32-bit internal accumulation
- AXI-Stream-like Interface with Valid/Ready handshake protocol
- Weight Stationary Dataflow for optimal energy efficiency
- β Modular Architecture: Separate control, memory, and compute subsystems
- β Memory Efficiency: Ping-pong buffering hides DRAM access latency
- β Address Generation Unit (AGU): Handles 2D-to-1D address mapping with sliding windows
- β Tiling Support: Processes large matrices in hardware-sized blocks
- β Handshake Protocols: Backpressure-aware data streaming
- β Comprehensive Testbenches: Self-checking verification environment
- β Golden Model: Python reference implementation for validation
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Convolution Accelerator Top β
β β
β ββββββββββββββββ ββββββββββββββββ βββββββββββββββ β
β β Control β β Memory β β Systolic β β
β β Unit βββββββΊβ Controller βββββββΊβ Array β β
β β (FSM) β β (Ping-Pong) β β (8Γ8) β β
β ββββββββ¬ββββββββ ββββββββ¬ββββββββ βββββββ¬ββββββββ β
β β β β β
β β β β β
β ββββββββΌββββββββββββββββββββββΌββββββββββββββββββββββΌββββββββ β
β β Data Loader & Address Generation Unit (AGU) β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β SRAM Buffer (32KB Max) β β
β β (Sky130 1rw1r Pseudo-Dual Port SRAM) β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
βββββββββββ¬ββββββββββββββββββββββββββββββββββ¬ββββββββββββββββββββ
β β
rx_data (8-bit input) tx_data (8-bit output)
rx_valid/rx_ready tx_valid/tx_ready
- 64 identical Processing Elements (PEs)
- Each PE performs Multiply-Accumulate (MAC) operations
- Weight Stationary dataflow: weights preloaded, pixels stream through
- Pipeline depth: 15 cycles (ROWS + COLS - 1)
- Location:
rtl/core/systolic_array.v,rtl/core/processing_element.v
- Manages 32KB on-chip SRAM (configurable: 4KB-32KB)
- Ping-pong buffering: simultaneous read from one buffer, write to another
- 3-cycle read latency handling
- Integrates Sky130 SRAM hard macros (1rw1r configuration)
- Location:
rtl/mem/memory_controller.v
- Main FSM orchestrator with states: IDLE, LOAD_INPUT, LOAD_WEIGHT, COMPUTE, DRAIN, DONE
- Configuration management (matrix size N, kernel size K)
- Handshake protocol controller
- Coordinates between memory, AGU, and systolic array
- Location:
rtl/control/control_unit.v
- Converts 2D coordinates (x, y) to linear memory addresses
- Implements sliding window patterns for convolution tiling
- Handles "halo" pixels for edge cases:
Input Size = Array Size + (K-1) - Generates non-sequential access patterns efficiently
- Location:
rtl/control/address_generator.v
- Manages data movement between external DRAM and on-chip buffers
- Implements Valid/Ready handshake protocol
- Handles format conversions (8-bit β 32-bit)
- Location:
rtl/control/data_loader.v
VLSI_Project/
βββ README.md # This file - comprehensive project description
βββ QUICK_START.md # Quick setup and simulation guide
βββ BUGFIX_SUMMARY.md # Integration bug fixes documentation
βββ project.txt # Full project specification document
βββ project_doc.pdf # Project documentation (PDF)
β
βββ rtl/ # RTL source files (Verilog)
β βββ convolution_accelerator_top.v # Top-level module
β βββ accelerator_integration.v # Memory + Systolic array integration
β βββ core/ # Compute core modules
β β βββ systolic_array.v # 8Γ8 systolic array
β β βββ processing_element.v # Single PE (MAC unit)
β β βββ README.md # Core architecture documentation
β β βββ Stage1_IOs.md # Stage 1 I/O specifications
β β βββ systolic_array_handshake.md # Handshake protocol details
β βββ control/ # Control and data management
β β βββ control_unit.v # Main FSM controller
β β βββ address_generator.v # AGU for 2Dβ1D mapping
β β βββ data_loader.v # Data streaming controller
β β βββ README.md # Control subsystem docs
β βββ mem/ # Memory subsystem
β β βββ memory_controller.v # Ping-pong buffer controller
β β βββ README.md # Memory architecture docs
β βββ tb/ # Testbenches
β β βββ tb_processing_element.v # PE unit tests
β β βββ tb_systolic_array.v # Array tests
β β βββ tb_memory_controller.v # Memory tests
β β βββ tb_control_unit.v # FSM tests
β β βββ tb_accelerator_integration.v # Integration tests
β β βββ tb_full_system.v # Full system tests
β βββ INTEGRATION_README.md # Integration guide
β βββ README.md # RTL directory overview
β
βββ scripts/ # Automation scripts
β βββ golden_model_conv2d.py # Python reference model
β βββ python/ # Python utilities
β β βββ expected_out.txt # Expected outputs
β β βββ results_hw.txt # Hardware results
β β βββ README.md
β βββ sim/ # Simulation scripts
β β βββ README.md
β βββ utils/ # Utility scripts
β βββ README.md
β
βββ sim/ # Simulation working directory
β βββ run_integration.do # ModelSim/QuestaSim script
β
βββ third_party/ # Third-party IP
β βββ sram_macros/ # Sky130 SRAM models
β β βββ sky130_sram_1kbyte_1rw1r_32x256_8.v
β β βββ README.md
β βββ README.md
β
βββ test_cases/ # Golden test vectors
β βββ 01_Basic_Minimal_*.hex # Basic 2Γ2 kernel test
β βββ 02_Basic_Identity_*.hex # Identity kernel test
β βββ 03_Basic_AllOnes_*.hex # All-ones kernel test
β βββ 04_Regular_Standard_*.hex # Standard 3Γ3 kernel
β βββ 05_Regular_LargeHalo_*.hex # Large kernel test
β βββ 06_Regular_PingPong_*.hex # Ping-pong buffer test
β βββ 07_Adv_MaxSpec_*.hex # Maximum size (64Γ64)
β βββ 08_Adv_Throughput_*.hex # Throughput test
β βββ 09_Pro_PartialTile_*.hex # Partial tile handling
β βββ 10_Pro_Saturation_*.hex # Output saturation test
β
βββ config/ # OpenLane configuration
β βββ openlane/ # Synthesis configs
β βββ README.md
β
βββ final/ # Final implementation outputs
βββ README.md # Final deliverables info
Hardware Simulation:
- ModelSim/QuestaSim (for Verilog simulation)
- Icarus Verilog (alternative simulator)
Software Reference:
- Python 3.7+ with NumPy
Physical Design (Optional):
- OpenLane flow
- Sky130 PDK
-
Clone the Repository:
git clone https://github.com/Uderscore/VLSI_Project.git cd VLSI_Project -
Run Basic Simulation:
cd sim vsim -do run_integration.do -
Generate Golden Model:
cd scripts python golden_model_conv2d.py
For detailed setup instructions, see QUICK_START.md.
| Parameter | Min | Max | Type |
|---|---|---|---|
| Input Matrix (NΓN) | 16Γ16 | 64Γ64 | Variable |
| Kernel Size (KΓK) | 2Γ2 | 16Γ16 | Variable |
| Stride | 1 | 1 | Fixed |
| Padding | 0 | 0 | Fixed |
| Input/Weight Precision | 8-bit | 8-bit | Unsigned |
| Internal Accumulation | 32-bit | 32-bit | Fixed Point |
| Output Precision | 8-bit | 8-bit | Unsigned (Truncated) |
| Resource | Min | Max | Notes |
|---|---|---|---|
| On-Chip Memory | 4 KB | 32 KB | Total SRAM |
| Systolic Array | 4Γ4 | 8Γ8 | Processing Elements |
| Register Size | 8-bit | 32-bit | Datapath registers |
| External Bus | 8-bit | 32-bit | DRAM interface |
| Internal Bus | 8-bit | 128-bit | SRAM β Array |
| Signal | Direction | Width | Description |
|---|---|---|---|
clk |
Input | 1 | System clock |
rst_n |
Input | 1 | Active-low async reset |
start |
Input | 1 | Begin computation pulse |
cfg_N |
Input | 7 | Input matrix dimension N |
cfg_K |
Input | 5 | Kernel dimension K |
done |
Output | 1 | Computation complete |
rx_data |
Input | 8-32 | Input data stream |
rx_valid |
Input | 1 | Input data valid |
rx_ready |
Output | 1 | Ready to accept input |
tx_data |
Output | 8-32 | Output data stream |
tx_valid |
Output | 1 | Output data valid |
tx_ready |
Input | 1 | Ready to accept output |
The project includes 10 comprehensive test cases covering:
- Basic Tests (01-03): Minimal kernels, identity operations, edge cases
- Regular Tests (04-06): Standard convolutions, large halos, ping-pong buffering
- Advanced Tests (07-08): Maximum specifications, throughput validation
- Professional Tests (09-10): Partial tiles, saturation handling
A Python reference implementation generates expected outputs:
python scripts/golden_model_conv2d.pyResults are compared with hardware outputs with a tolerance of Β±1 LSB to account for fixed-point rounding.
# Run integration testbench
cd sim
vsim -do run_integration.do
# Run full system test
vsim -do run_full_system.do
# Run specific module tests
cd rtl/tb
vsim tb_systolic_array -do "run -all"- Throughput: 8 MACs per cycle (64 PEs Γ 1 MAC/cycle)
- Latency: ~15 cycles (pipeline fill) + NΒ²/64 cycles (computation)
- Memory Bandwidth: Up to 128 bits/cycle internal
- Power: Clock gating for idle PEs
-
Area Optimization:
- Counter bit-width reduction
- Resource sharing between PEs
- Minimal state machine complexity
-
Power Optimization:
- Clock gating for idle PEs during halo loading
- Efficient memory access patterns
- Reduced switching activity
-
Timing Optimization:
- Pipeline balancing
- Critical path analysis
- Maximum operating frequency tuning
See BUGFIX_SUMMARY.md for detailed bug reports and resolutions, including:
- β Multiple driver conflicts resolved
- β Ping-pong buffer switching corrected
- β Handshake protocol timing fixed
- β Testbench timeout protection added
- QUICK_START.md: Quick setup and simulation guide
- BUGFIX_SUMMARY.md: Integration debugging guide
- project.txt: Complete project specification (14 pages)
- project_doc.pdf: Official project documentation
- rtl/INTEGRATION_README.md: Detailed integration guide
- rtl/README.md: RTL architecture overview
- rtl/core/README.md: Systolic array details
- rtl/control/README.md: Control subsystem docs
- rtl/mem/README.md: Memory controller docs
Systolic Arrays:
- Portland State University - Systolic Arrays (PDF)
- NJIT - Matrix Multiplication Visualization (Video)
Memory Architecture:
Fixed-Point Arithmetic:
TPU Architecture:
Sky130 SRAM Macros:
This is an academic project for CMP3020 - VLSI Design course. Team members are responsible for:
- Functional Verification (10 marks): Golden model matching with Β±0.1 precision
- Performance Optimization (5 marks): PPA metrics ranking
- Personal Contribution (5 marks): Individual component ownership
Team Size: 7-8 members
Deadline: Week 13
Each team member should contribute to specific components:
- PE design and verification
- Systolic array assembly
- Memory controller integration
- AGU implementation
- Control FSM development
- Testbench development
- Documentation
- Processing Element (PE) design
- 8Γ8 Systolic Array
- Memory Controller with ping-pong buffering
- SRAM macro integration
- Control Unit FSM
- Address Generation Unit (AGU)
- Data Loader
- Top-level integration
- Comprehensive testbenches
- Golden model (Python)
- 10 test cases with golden vectors
- Bug fixes for integration issues
- OpenLane synthesis and place-and-route
- Power analysis and optimization
- Clock gating implementation
- Final GDS-II generation
- PPA metrics optimization
- Additional test cases
- Performance benchmarking
Course: CMP3020 - VLSI Design
Institution: Cairo University Faculty of Engineering (CUFE)
Department: Computer Engineering
Instructor: Muhammad Sayed
For technical questions or issues:
- Check existing documentation in the repository
- Review testbench outputs and waveforms
- Consult BUGFIX_SUMMARY.md for common issues
- Contact team members or instructor
This project is part of academic coursework for CMP3020 - VLSI Design at Cairo University Faculty of Engineering. All rights reserved.
Primary Objective: Design a functional 2D convolution accelerator that matches the golden model output within Β±0.1 precision.
Secondary Objectives:
- Achieve competitive PPA (Power, Performance, Area) metrics
- Demonstrate modular and extensible architecture
- Implement industry-standard design practices
- Create comprehensive verification environment
Bonus Opportunities:
- Advanced dataflow analysis (Input vs Weight Stationary)
- Sophisticated memory banking strategies
- Automated regression testing suite
- Novel architectural optimizations
Last Updated: January 2026
Repository: https://github.com/Uderscore/VLSI_Project
Branch: copilot/describe-repo-details