GTaP is a directive-based fork-join task-parallel runtime system for GPUs, implemented in CUDA C++. It consists of:
- A header-only runtime library
- A Clang-based compiler extension that translates GTaP directives into CUDA device code
GTaP enables structured fork-join parallelism directly on GPUs using a pragma-based programming model.
🔬 GTaP is a research prototype under active development. Interfaces and internal mechanisms may evolve over time.
-
Fork-join task parallelism on GPUs:
Programmers express fork-join using#pragma gtap taskand#pragma gtap taskwait. GTaP realizes fork-join parallelism by representing each task function as a switch-statement-based state machine.
The Clang extension automatically generates these state machines and manages task data across join points. -
Two granularities:
GTaP supports two execution modes for task execution: thread-executed (thread-level workers) and block-cooperative (block-level workers). In the thread-executed mode, a task function runs on a single CUDA thread and is written like ordinary sequential code. In the block-cooperative mode, a task function runs cooperatively on all threads in one thread block; programmers write it in a GPU-style data-parallel manner usingthreadIdx/blockIdx. The runtime providesgtap_thread.cuhandgtap_block.cuhfor these modes. -
Execution-path-aware queueing (EPAQ):
Programmers can optionally specify a queue index as#pragma gtap task queue(expr)(at spawn) or#pragma gtap taskwait queue(expr)(at re-entry after a join). This allows tasks that are expected to follow different execution paths to be separated before they run. -
Task schedulers:
GTaP uses randomized work-stealing. In thread-executed mode, a warp acquires up to 32 runnable tasks via a warp-cooperative batched pop/steal.
| Directory | Description |
|---|---|
| clang-gtap/ | Clang fork that compiles GTaP programs. See clang-gtap/README.md for build and usage. |
| runtime/ | Header-only GTaP runtime library. |
| evaluation/ | Benchmarks and scripts used for performance evaluation. |
| examples/ | Example GTaP programs (fib, n-queens, mergesort, cilksort, tree workloads, etc.). |
We have verified build and basic functionality on a single GH200 node of the Miyabi-G supercomputer (1× NVIDIA GH200, compute capability 9.0 / sm_90; Clang 21.1.8, CUDA Toolkit 12.9, Linux kernel 5.14.0-427.13.1.el9_4.aarch64).
- Clone the repository:
git clone https://github.com/yukim0359/GTaP.git --recursive
cd GTaP- Build the compiler:
Follow clang-gtap/README.md to build the GTaP-enabled Clang.
- Compile programs:
Example: Fibonacci
cd examples/fib
make
./bin/fibCompilation flags and required preprocessor macros are described in examples/README.md.
Detailed instructions for reproducing experimental results are provided in evaluation/README.md.
GTaP has a built-in profiler for inspecting how tasks are scheduled on the GPU.
- Data:
gtap_visualize_profile("app_name")writes CSV files under./profile/with warp/block timelines and summary statistics. - Usage:
#define PROFILE // enable GTaP profiling #include "gtap_thread.cuh" // or "gtap_block.cuh" int main(){ // ... launch GTaP kernel and synchronize ... gtap_visualize_profile("fib"); }
- Programmer responsibilities: define
PROFILEin the translation unit you profile, and optionally tuneMAX_PROFILE_DATAif you need more samples.
For further details, see examples/fib_profile.
- clang-gtap: Based on the LLVM Project; see clang-gtap/LICENSE.TXT (Apache License v2.0 with LLVM Exceptions).
- Other components: See LICENSE at the repository root.