LLM Inference Optimization on Multiple Nodes and GPUs

Project Description

This project, LLM Inference Optimization on Multiple Nodes and GPUs, is the final project for the High Performance and Scalable Computing Spring class at Seoul National University (SNU). The objective is to perform efficient and scalable inference on a GPT-2 model using 16 GPUs across 4 nodes. This project leverages CUDA and MPI to achieve high performance.

Introduction

This project demonstrates the implementation of a distributed GPT-2 inference engine. By utilizing 4 nodes with 4 GPUs each, we aim to optimize the performance and scalability of large language model (LLM) inference tasks. This implementation involves writing CUDA kernels and integrating MPI for inter-node communication. With the current code, you can achieve a throughput of 20,000 tokens per second.

Prerequisites

Before you begin, ensure you have the following libraries and tools installed:

MPI Library (e.g., OpenMPI)
CUDA Toolkit

Ensure that your environment is properly configured with the necessary drivers and libraries for CUDA and MPI.

Installation

Setting up MPI

To install OpenMPI on your system, use the following commands:

For Ubuntu:

sudo apt update
sudo apt install openmpi-bin openmpi-common libopenmpi-dev

For CentOS:

sudo yum install openmpi openmpi-devel

Setting up CUDA

Follow the instructions on the NVIDIA CUDA Toolkit website to download and install the appropriate version for your system.

Usage

Cloning the Repository

First, clone the repository to your local machine:

git clone https://github.com/yourusername/multi_gpu_llm_inference.git
cd multi_gpu_llm_inference

Running the Inference

To run the GPT-2 inference on multiple nodes, use the following command:

mpirun -np 16 --hostfile hostfile ./run_inference.sh

License

This project is licensed under the MIT License. See the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
data		data
include		include
obj		obj
src		src
.DS_Store		.DS_Store
Makefile		Makefile
README.md		README.md
main		main
run.sh		run.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM Inference Optimization on Multiple Nodes and GPUs

Project Description

Table of Contents

Introduction

Prerequisites

Installation

Setting up MPI

Setting up CUDA

Usage

Cloning the Repository

Running the Inference

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LLM Inference Optimization on Multiple Nodes and GPUs

Project Description

Table of Contents

Introduction

Prerequisites

Installation

Setting up MPI

Setting up CUDA

Usage

Cloning the Repository

Running the Inference

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages