We explain how to develop NVIDIA CUDA® kernels for optimized general matrix multiplication (GEMM) on NVIDIA Hopper™ architecture using the template collection CUTLASS and its core library CuTe. Our main contribution is to provide an implementation of a GEMM kernel that uses the Tensor Memory Accelerator (TMA) and Warp Group Matrix-Multiply-Accumulate (WGMMA) operations introduced with NVIDIA Hopper™ architecture.
Developing CUDA Kernels for GEMM on NVIDIA Hopper Architecture using CUTLASS
Discover more from Colfax Research
Subscribe to get the latest posts sent to your email.
Comments
One response to “Developing CUDA Kernels for GEMM on NVIDIA Hopper Architecture using CUTLASS”
-
Well organized and very clear tutorial.
I think it would be a good start for me to learn cutlass.
Thank you for your sharing.
Leave a Reply