Colfax Home > Colfax Research

Developing CUDA Kernels for GEMM on NVIDIA Hopper Architecture using CUTLASS

We explain how to develop NVIDIA CUDA kernels for optimized general matrix multiplication (GEMM) on NVIDIA Hopper architecture using the template collection CUTLASS and its core library CuTe. Our main contribution is to provide an implementation of a GEMM kernel that uses the Tensor Memory Accelerator (TMA) and Warp Group Matrix-Multiply-Accumulate (WGMMA) operations introduced with NVIDIA Hopper architecture.

 colfax-gemm-kernels-hopper.pdf (603.58 KB)



, ,