Skip to content

peaclab/NERSC_workload_analysis

Repository files navigation

Analyzing GPU Utilization in HPC Workloads: Insights from Large-Scale Systems

📘 Overview

This repository contains the Jupyter Notebook and supplementary materials for the paper "Analyzing GPU Utilization in HPC Workloads: Insights from Large-Scale Systems." This study examines GPU resource utilization patterns on the Perlmutter supercomputer, highlighting temporal and spatial imbalances and their impact on workload efficiency.

📂 Dataset Locations

All datasets are stored under the directory: /pscratch/sd/e/esencan/.

  • Zipped Raw DCGM Time Series Data: dcgm_extended_july_all.tar.gz is under /pscratch/sd/e/esencan/.
  • Unzipped Raw DCGM Time Series Data: Unzipped files are under /pscratch/sd/e/esencan/extracted_dcgm_jobs_data_additional_metrics/dcgm_2/. This directory contains .pkl files for each day of July 2024 (one file per day).
  • SLURM Job Data: perlmutter_gpu_jobs_july_2024.csv is located under /pscratch/sd/e/esencan/.
  • Feature-Extracted DCGM Data: tsfresh_feature_extracted_all_jobs_minimal_features_corrected_node_id_mem_util.parquet is located under /pscratch/sd/e/esencan/.

To request access to these datasets, please contact Efe Sencan.

📂 Repository Structure

  • 📁 scripts/: Scripts for submitting jobs, extracting features, and workload analysis.
    • submit_jobs.sh: Bash script for daily feature extraction.
    • extract_features_day.py: Python script for time-series feature extraction.
    • workload_analysis.py: Script for analyzing workload patterns from extracted features.
  • 📁 notebooks/
    • NERSC_data_analysis.ipynb: Jupyter Notebook with core analysis and results.
    • inspect_dcgm_metrics.ipynb: Unzips .tar.gz files, loads .pkl files, and inspects columns.
    • feature_extraction.ipynb: Combines daily extracted features into a single Parquet file.
  • 📄 requirements.txt: Python package dependencies.
  • 📄 README.md: This file.

🛠️ Dependencies

  • Python 3.9 +
  • Jupyter Notebook
  • pandas
  • numpy
  • matplotlib
  • seaborn
  • tsfresh (for time-series feature extraction)

Install dependencies using:

pip install -r requirements.txt

📝 Dataset

The study uses data from the Perlmutter supercomputer collected during July 2024:

  • DCGM Telemetry: Metrics collected every 10 seconds from four NVIDIA A100 GPUs per node.
  • SLURM Metadata: Job-level details including submission times, durations, and allocated resources.
  • Filtering: Only jobs under the 'regular' QoS were included, totaling 118,276 jobs.

📝 Workflow Overview

1️⃣ Inspecting DCGM Metrics

Since the raw time series dataset was provided as a compressed .tar.gz file, inspect_dcgm_metrics.ipynb was created to:

  • Unzip dcgm_extended_july_all.tar.gz into extracted_dcgm_jobs_data_additional_metrics/.
  • Load daily .pkl files from dcgm_2/.
  • Print available column names and inspect the data structure for one sample job.

🧩 Methodology

1️⃣ Feature Extraction

  • Used tsfresh to extract 17 statistical features from each DCGM metric.
  • Generated 408 feature columns from time-series telemetry.
  • Stored results in Parquet format for efficient analysis.

📝 Feature Extraction Workflow

1️⃣ Submitting Jobs for Daily Feature Extraction

Because extracting features from the concatenated time series of all jobs is time-consuming, the extraction process is split by day.

  • Step 1: Navigate to the scripts/ directory.
  • Step 2: Make the script executable:
chmod +x submit_jobs.sh
  • Step 3: Submit jobs:
bash submit_jobs.sh

The submit_jobs.sh script submits Slurm jobs for daily feature extraction. Each job:

  • Loads the required Python environment (feature_extraction_env by default, update as needed).
  • Runs extract_features_day.py on daily .pkl files.

2️⃣ Combining Daily Features

The notebook feature_extraction.ipynb:

  • Reads the extracted feature files
  • Merges them into a single Parquet file for analysis.
  • Outputs the combined file to the spesified directory

2️⃣ Temporal Imbalance Analysis

  • Calculated temporal imbalance factors using:
    • Coefficient of Variation (CV)
    • Linear Trend (via linear regression slope)
    • Mean Absolute Change (MAC)
  • Combined these metrics into a merged temporal imbalance factor.

3️⃣ Spatial Imbalance Analysis

  • Computed intra-node and inter-node imbalance using:
    • Normalized Range (NR)
    • Variance of Utilization
  • Derived merged spatial imbalance factors for both intra-node and inter-node imbalances.

4️⃣ Correlation Analysis

  • Analyzed relationships between temporal and spatial imbalances, maximum utilization metrics, and job node hours.

📊 Results Highlights

  • GPU Utilization: 62.75% of jobs reach at least 75% GPU utilization, contributing to 80.01% of node hours.
  • Memory Utilization: Low utilization observed, with 37.12% of jobs never exceeding 15% memory usage.
  • Temporal Imbalance: Minimal for most jobs, but a subset shows significant fluctuations.
  • Spatial Imbalance: More pronounced intra-node than inter-node disparities.
  • Correlation Findings: Stable memory utilization correlates with higher GPU efficiency.

🚀 How to Run the Notebook

jupyter notebook notebooks/NERSC_workload_analysis.ipynb

Follow the steps in the notebook to reproduce the analysis and generate figures.

📈 Example Outputs

  • Figure 1: Distribution of jobs and node hours for GPU and memory utilization.
  • Figure 2: CDF and PDF of temporal imbalance factors for GPU metrics.
  • Figure 3: Spatial imbalance CDFs and PDFs.
  • Figure 4: Correlation matrix showing relationships between imbalance factors and utilization metrics.

🎯 Citation

If you use this repository for your research, please cite our paper:

@incollection{sencan2025analyzing,
  title={Analyzing GPU Utilization in HPC Workloads: Insights from Large-Scale Systems},
  author={Sencan, Efe and Kulkarni, Dhruva and Coskun, Ayse and Konate, Kadidia},
  booktitle={Practice and Experience in Advanced Research Computing 2025: The Power of Collaboration},
  pages={1--8},
  year={2025}
}

💬 Contact

For any questions, please contact:

  • Author: Efe Sencan
  • Email: esencan@bu.edu
  • Affiliation: Boston University/ National Energy Research Scientific Computing Center (NERSC)

License: This repository is licensed under the MIT License. See LICENSE for details.

About

GPU Utilization Analysis of NERSC jobs

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors