From b381be6ed7d6f2f06cd901876fe3bc27ebbbaa28 Mon Sep 17 00:00:00 2001 From: Tomoya Fujita Date: Sun, 1 Mar 2026 22:00:10 +0900 Subject: [PATCH] fix seveal misspelling and typos. Signed-off-by: Tomoya Fujita --- CONTRIBUTING.md | 2 +- README.md | 4 ++-- Samples/1_Utilities/topologyQuery/README.md | 2 +- Samples/3_CUDA_Features/README.md | 6 +++--- Samples/3_CUDA_Features/bf16TensorCoreGemm/README.md | 2 +- Samples/3_CUDA_Features/dmmaTensorCoreGemm/README.md | 2 +- Samples/3_CUDA_Features/tf32TensorCoreGemm/README.md | 2 +- Samples/6_Performance/README.md | 2 +- Samples/6_Performance/UnifiedMemoryPerf/README.md | 2 +- Samples/7_libNVVM/device-side-launch/README.md | 2 +- 10 files changed, 13 insertions(+), 13 deletions(-) diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 431410a8f..09bd79542 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -84,7 +84,7 @@ pre-commit install Now code linters and formatters will be run each time you commit changes. -You can skip these checks with `git commit --no-verify` or with the short version `git commit -n`, althoguh please note +You can skip these checks with `git commit --no-verify` or with the short version `git commit -n`, although please note that this may result in pull requests being rejected if subsequent checks fail. ## Review Process diff --git a/README.md b/README.md index 4686e11d5..778b12079 100644 --- a/README.md +++ b/README.md @@ -552,9 +552,9 @@ These CUDA features are needed by some CUDA samples. They are provided by either CUFFT Callback Routines are user-supplied kernel routines that CUFFT will call when loading or storing data. These callback routines are only available on Linux x86_64 and ppc64le systems. -#### CUDA Dynamic Parallellism +#### CUDA Dynamic Parallelism -CDP (CUDA Dynamic Parallellism) allows kernels to be launched from threads running on the GPU. CDP is only available on GPUs with SM architecture of 3.5 or above. +CDP (CUDA Dynamic Parallelism) allows kernels to be launched from threads running on the GPU. CDP is only available on GPUs with SM architecture of 3.5 or above. #### Multi-block Cooperative Groups diff --git a/Samples/1_Utilities/topologyQuery/README.md b/Samples/1_Utilities/topologyQuery/README.md index 1c6a06c6a..587ed0a64 100644 --- a/Samples/1_Utilities/topologyQuery/README.md +++ b/Samples/1_Utilities/topologyQuery/README.md @@ -2,7 +2,7 @@ ## Description -A simple exemple on how to query the topology of a system with multiple GPU +A simple example on how to query the topology of a system with multiple GPU ## Key Concepts diff --git a/Samples/3_CUDA_Features/README.md b/Samples/3_CUDA_Features/README.md index 74278321a..a668758c9 100644 --- a/Samples/3_CUDA_Features/README.md +++ b/Samples/3_CUDA_Features/README.md @@ -2,7 +2,7 @@ ### [bf16TensorCoreGemm](./bf16TensorCoreGemm) -A CUDA sample demonstrating __nv_bfloat16 (e8m7) GEMM computation using the Warp Matrix Multiply and Accumulate (WMMA) API introduced with CUDA 11 in Ampere chip family tensor cores for faster matrix operations. This sample also uses async copy provided by cuda pipeline interface for gmem to shmem async loads which improves kernel performance and reduces register presssure. +A CUDA sample demonstrating __nv_bfloat16 (e8m7) GEMM computation using the Warp Matrix Multiply and Accumulate (WMMA) API introduced with CUDA 11 in Ampere chip family tensor cores for faster matrix operations. This sample also uses async copy provided by cuda pipeline interface for gmem to shmem async loads which improves kernel performance and reduces register pressure. ### [binaryPartitionCG](./binaryPartitionCG) This sample is a simple code that illustrates binary partition cooperative groups and reduce within the thread block. @@ -36,7 +36,7 @@ This sample demonstrates the use of the new CUDA WMMA API employing the Tensor C In addition to that, it demonstrates the use of the new CUDA function attribute cudaFuncAttributeMaxDynamicSharedMemorySize that allows the application to reserve an extended amount of shared memory than it is available by default. ### [dmmaTensorCoreGemm](./dmmaTensorCoreGemm) -CUDA sample demonstrates double precision GEMM computation using the Double precision Warp Matrix Multiply and Accumulate (WMMA) API introduced with CUDA 11 in Ampere chip family tensor cores for faster matrix operations. This sample also uses async copy provided by cuda pipeline interface for gmem to shmem async loads which improves kernel performance and reduces register presssure. Further, this sample also demonstrates how to use cooperative groups async copy interface over a group for performing gmem to shmem async loads. +CUDA sample demonstrates double precision GEMM computation using the Double precision Warp Matrix Multiply and Accumulate (WMMA) API introduced with CUDA 11 in Ampere chip family tensor cores for faster matrix operations. This sample also uses async copy provided by cuda pipeline interface for gmem to shmem async loads which improves kernel performance and reduces register pressure. Further, this sample also demonstrates how to use cooperative groups async copy interface over a group for performing gmem to shmem async loads. ### [globalToShmemAsyncCopy](./globalToShmemAsyncCopy) This sample implements matrix multiplication which uses asynchronous copy of data from global to shared memory when on compute capability 8.0 or higher. Also demonstrates arrive-wait barrier for synchronization. @@ -69,7 +69,7 @@ A demonstration of CUDA Graphs creation, instantiation and launch using Graphs A This sample demonstrates basic use of stream priorities. ### [tf32TensorCoreGemm](./tf32TensorCoreGemm) -A CUDA sample demonstrating tf32 (e8m10) GEMM computation using the Warp Matrix Multiply and Accumulate (WMMA) API introduced with CUDA 11 in Ampere chip family tensor cores for faster matrix operations. This sample also uses async copy provided by cuda pipeline interface for gmem to shmem async loads which improves kernel performance and reduces register presssure. +A CUDA sample demonstrating tf32 (e8m10) GEMM computation using the Warp Matrix Multiply and Accumulate (WMMA) API introduced with CUDA 11 in Ampere chip family tensor cores for faster matrix operations. This sample also uses async copy provided by cuda pipeline interface for gmem to shmem async loads which improves kernel performance and reduces register pressure. ### [warpAggregatedAtomicsCG](./warpAggregatedAtomicsCG) This sample demonstrates how using Cooperative Groups (CG) to perform warp aggregated atomics to single and multiple counters, a useful technique to improve performance when many threads atomically add to a single or multiple counters. diff --git a/Samples/3_CUDA_Features/bf16TensorCoreGemm/README.md b/Samples/3_CUDA_Features/bf16TensorCoreGemm/README.md index a0c7cc369..adf93fdc8 100644 --- a/Samples/3_CUDA_Features/bf16TensorCoreGemm/README.md +++ b/Samples/3_CUDA_Features/bf16TensorCoreGemm/README.md @@ -2,7 +2,7 @@ ## Description -A CUDA sample demonstrating __nv_bfloat16 (e8m7) GEMM computation using the Warp Matrix Multiply and Accumulate (WMMA) API introduced with CUDA 11 in Ampere chip family tensor cores for faster matrix operations. This sample also uses async copy provided by cuda pipeline interface for gmem to shmem async loads which improves kernel performance and reduces register presssure. +A CUDA sample demonstrating __nv_bfloat16 (e8m7) GEMM computation using the Warp Matrix Multiply and Accumulate (WMMA) API introduced with CUDA 11 in Ampere chip family tensor cores for faster matrix operations. This sample also uses async copy provided by cuda pipeline interface for gmem to shmem async loads which improves kernel performance and reduces register pressure. ## Key Concepts diff --git a/Samples/3_CUDA_Features/dmmaTensorCoreGemm/README.md b/Samples/3_CUDA_Features/dmmaTensorCoreGemm/README.md index e9498e5d3..8e3e9c0bf 100644 --- a/Samples/3_CUDA_Features/dmmaTensorCoreGemm/README.md +++ b/Samples/3_CUDA_Features/dmmaTensorCoreGemm/README.md @@ -2,7 +2,7 @@ ## Description -CUDA sample demonstrates double precision GEMM computation using the Double precision Warp Matrix Multiply and Accumulate (WMMA) API introduced with CUDA 11 in Ampere chip family tensor cores for faster matrix operations. This sample also uses async copy provided by cuda pipeline interface for gmem to shmem async loads which improves kernel performance and reduces register presssure. Further, this sample also demonstrates how to use cooperative groups async copy interface over a group for performing gmem to shmem async loads. +CUDA sample demonstrates double precision GEMM computation using the Double precision Warp Matrix Multiply and Accumulate (WMMA) API introduced with CUDA 11 in Ampere chip family tensor cores for faster matrix operations. This sample also uses async copy provided by cuda pipeline interface for gmem to shmem async loads which improves kernel performance and reduces register pressure. Further, this sample also demonstrates how to use cooperative groups async copy interface over a group for performing gmem to shmem async loads. ## Key Concepts diff --git a/Samples/3_CUDA_Features/tf32TensorCoreGemm/README.md b/Samples/3_CUDA_Features/tf32TensorCoreGemm/README.md index 7c75298bf..6f4273df1 100644 --- a/Samples/3_CUDA_Features/tf32TensorCoreGemm/README.md +++ b/Samples/3_CUDA_Features/tf32TensorCoreGemm/README.md @@ -2,7 +2,7 @@ ## Description -A CUDA sample demonstrating tf32 (e8m10) GEMM computation using the Warp Matrix Multiply and Accumulate (WMMA) API introduced with CUDA 11 in Ampere chip family tensor cores for faster matrix operations. This sample also uses async copy provided by cuda pipeline interface for gmem to shmem async loads which improves kernel performance and reduces register presssure. +A CUDA sample demonstrating tf32 (e8m10) GEMM computation using the Warp Matrix Multiply and Accumulate (WMMA) API introduced with CUDA 11 in Ampere chip family tensor cores for faster matrix operations. This sample also uses async copy provided by cuda pipeline interface for gmem to shmem async loads which improves kernel performance and reduces register pressure. ## Key Concepts diff --git a/Samples/6_Performance/README.md b/Samples/6_Performance/README.md index c44b0ba2c..7ff25fba2 100644 --- a/Samples/6_Performance/README.md +++ b/Samples/6_Performance/README.md @@ -8,7 +8,7 @@ A simple test, showing huge access speed gap between aligned and misaligned stru This sample demonstrates Matrix Transpose. Different performance are shown to achieve high performance. ### [UnifiedMemoryPerf](./UnifiedMemoryPerf) -This sample demonstrates the performance comparision using matrix multiplication kernel of Unified Memory with/without hints and other types of memory like zero copy buffers, pageable, pagelocked memory performing synchronous and Asynchronous transfers on a single GPU. +This sample demonstrates the performance comparison using matrix multiplication kernel of Unified Memory with/without hints and other types of memory like zero copy buffers, pageable, pagelocked memory performing synchronous and Asynchronous transfers on a single GPU. ### [cudaGraphsPerfScaling](./cudaGraphsPerfScaling) This sample demonstrates the performance characteristics of cuda graphs. It is focused on how the apis scale with graph size. diff --git a/Samples/6_Performance/UnifiedMemoryPerf/README.md b/Samples/6_Performance/UnifiedMemoryPerf/README.md index a9e6d6374..c90ca3603 100644 --- a/Samples/6_Performance/UnifiedMemoryPerf/README.md +++ b/Samples/6_Performance/UnifiedMemoryPerf/README.md @@ -2,7 +2,7 @@ ## Description -This sample demonstrates the performance comparision using matrix multiplication kernel of Unified Memory with/without hints and other types of memory like zero copy buffers, pageable, pagelocked memory performing synchronous and Asynchronous transfers on a single GPU. +This sample demonstrates the performance comparison using matrix multiplication kernel of Unified Memory with/without hints and other types of memory like zero copy buffers, pageable, pagelocked memory performing synchronous and Asynchronous transfers on a single GPU. ## Key Concepts diff --git a/Samples/7_libNVVM/device-side-launch/README.md b/Samples/7_libNVVM/device-side-launch/README.md index e89496779..cca3a6e71 100644 --- a/Samples/7_libNVVM/device-side-launch/README.md +++ b/Samples/7_libNVVM/device-side-launch/README.md @@ -2,7 +2,7 @@ Device-Side Launch From NVVM IR =============================== This document is for the programming language and compiler implementers who -target NVVM IR and plan to support Dynamic Parallelism in their langauge. +target NVVM IR and plan to support Dynamic Parallelism in their language. It provides the low-level details related to supporting kernel launches at the NVVM IR level.