We explain how to develop NVIDIA CUDA® kernels for optimized general matrix multiplication (GEMM) on NVIDIA Hopper™ architecture using the template collection CUTLASS and its core library CuTe. Our main contribution is to provide an implementation of a GEMM kernel that uses the Tensor Memory Accelerator (TMA) and Warp Group Matrix-Multiply-Accumulate (WGMMA) operations introduced with NVIDIA Hopper™ architecture.
Developing CUDA Kernels for GEMM on NVIDIA Hopper Architecture using CUTLASS
Discover more from Colfax Research
Subscribe to get the latest posts sent to your email.
Leave a Reply