Skip to content

michael-borck/gpu-onboard

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

gpu-onboard

NVIDIA GPU onboarding and validation tool for second-hand cards (900–4000 series, 2–24 GB VRAM).

Runs a structured test suite against every detected GPU and logs results to a cumulative CSV you can open in Excel, LibreOffice Calc, or pandas.

What it tests

Test What it checks
Static info Name, VRAM, VBIOS, compute capability, PCIe slot width vs card max
Baseline Idle temperature, fan speed, power draw, ECC error counters
VRAM fill Fills available VRAM with a known pattern and verifies read-back (catches bad memory cells)
Stress test Matrix-multiply loop for N seconds; monitors for thermal throttling
Benchmarks VRAM bandwidth (GB/s), FP32/FP16 TFLOPS, PCIe H↔D bandwidth — compared against reference specs

Requirements

  • Ubuntu 22.04 LTS (or similar)
  • NVIDIA drivers installed (nvidia-smi in PATH)
  • Python 3.9+
  • uv for environment setup

Setup

# Install uv if needed
curl -LsSf https://astral.sh/uv/install.sh | sh

# Create venv and install dependencies (torch ~2 GB first download)
./setup.sh

setup.sh auto-detects your CUDA version and picks the matching PyTorch wheel index.

Usage

./run.sh                        # test all GPUs, 60 s stress
./run.sh --stress 120           # longer stress test
./run.sh --no-stress            # skip stress (quick ID + VRAM test only)
./run.sh --no-memtest           # skip VRAM fill test
./run.sh --no-bench             # skip bandwidth/TFLOPS benchmarks
./run.sh --gpu 0                # test one GPU by index
./run.sh --report               # also save gpu_report_<ts>.json/.txt

Or directly:

python gpu_onboard.py --help

Log management

Results are appended to gpu_onboard_log.csv after every run. Use gpu_log.py to query it:

python gpu_log.py runs                        # list all test runs
python gpu_log.py runs --failed               # only failed runs
python gpu_log.py runs --arch turing --vram >=8192
python gpu_log.py cards                       # one row per unique GPU (latest result)
python gpu_log.py cards --passed
python gpu_log.py report 20260416_005218      # reproduce full report from CSV
python gpu_log.py history GPU-34c746a6        # all runs for one card
python gpu_log.py delete run 20260416_005218  # remove a run
python gpu_log.py delete gpu GPU-34c746a6     # remove all rows for a card
python gpu_log.py clean                       # keep only latest result per GPU

Project layout

gpu_onboard.py       # main test script
gpu_log.py           # log query/management tool
setup.sh             # first-time venv + dependency setup
run.sh               # convenience wrapper (activates venv, runs gpu_onboard.py)
pyproject.toml       # project metadata and dependencies

License

MIT — see LICENSE.

About

NVIDIA GPU onboarding and validation tool for second-hand cards

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors