2805 Bowers Ave, Santa Clara, CA 95051 | 408-730-2275
research@colfax-intl.com

Developing CUDA Kernels for GEMM on NVIDIA Hopper Architecture using CUTLASS

We explain how to develop NVIDIA CUDA® kernels for optimized general matrix multiplication (GEMM) on NVIDIA Hopper architecture using the template collection CUTLASS and its core library CuTe. Our main contribution is to provide an implementation of a GEMM kernel that uses the Tensor Memory Accelerator (TMA) and Warp Group Matrix-Multiply-Accumulate (WGMMA) operations introduced with NVIDIA Hopper architecture.

colfax-gemm-kernels-hopper.pdf


Discover more from Colfax Research

Subscribe to get the latest posts sent to your email.

Posted

in

,

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *