2805 Bowers Ave, Santa Clara, CA 95051 | 408-730-2275
research@colfax-intl.com

Developing CUDA Kernels for GEMM on NVIDIA Hopper Architecture using CUTLASS

We explain how to develop NVIDIA CUDA^® kernels for optimized general matrix multiplication (GEMM) on NVIDIA Hopper^™ architecture using the template collection CUTLASS and its core library CuTe. Our main contribution is to provide an implementation of a GEMM kernel that uses the Tensor Memory Accelerator (TMA) and Warp Group Matrix-Multiply-Accumulate (WGMMA) operations introduced with NVIDIA Hopper^™ architecture.

colfax-gemm-kernels-hopper.pdf

Discover more from Colfax Research

Subscribe to get the latest posts sent to your email.

Posted

October 16, 2023

Deep Learning, Publications

Comments

One response to “Developing CUDA Kernels for GEMM on NVIDIA Hopper Architecture using CUTLASS”

Shiki

May 20, 2025

Well organized and very clear tutorial.
I think it would be a good start for me to learn cutlass.
Thank you for your sharing.

Reply

Developing CUDA Kernels for GEMM on NVIDIA Hopper Architecture using CUTLASS

Share this:

Discover more from Colfax Research

Comments

One response to “Developing CUDA Kernels for GEMM on NVIDIA Hopper Architecture using CUTLASS”

Leave a Reply Cancel reply