Colfax Research

CUTLASS Tutorial: Sub-byte GEMM on NVIDIA® Blackwell GPUs

June 7, 2025

|

Blog, Publications, Tutorials

Welcome to part 3 of our series investigating GEMM on the NVIDIA Blackwell architecture. In parts 1 and 2, we looked at the Tensory Memory and 2 SM capabilities of the new Blackwell Tensor Core UMMA instructions and how to work with them in CUTLASS. In this part, we introduce low-precision computation and then discuss…
Read more…
CUTLASS Tutorial: GEMM with Thread Block Clusters on NVIDIA® Blackwell GPUs

May 10, 2025

|

Blog, Publications, Tutorials

Welcome to part two of our series investigating GEMM on the NVIDIA Blackwell architecture. In part 1, we introduced some key new features available on NVIDIA Blackwell GPUs, including Tensor Memory, and went over how to write a simple CUTLASS GEMM kernel that uses the new UMMA instructions (tcgen05.mma) to target the Blackwell Tensor Cores.…
Read more…
CUTLASS Tutorial: Writing GEMM Kernels Using Tensor Memory For NVIDIA® Blackwell GPUs

April 19, 2025

|

Blog, Publications, Tutorials

The NVIDIA Blackwell architecture introduces some new features that significantly change the shape of a GEMM kernel. In this series of posts, we explore the new features available on Blackwell and examine how to write CUTLASS GEMM kernels that utilize these new features by drawing on the CuTe tutorial examples. The goal of this series…
Read more…
DeepSeek-R1 and FP8 Mixed-Precision Training

January 27, 2025

|

Article, Blog

DeepSeek has shocked the world with the release of their reasoning model DeepSeek-R1. Similar to OpenAI’s o1 and Google Gemini’s Flash Thinking, the R1 model aims to improve the quality of its replies by generating a “chain of thought” before responding to a prompt. The excitement around R1 stems from it achieving parity with o1…
Read more…
CUTLASS Tutorial: Persistent Kernels and Stream-K

December 19, 2024

|

Article, Blog, Publications, Tutorials

Welcome to Part 3 of our tutorial series on GEMM (GEneral Matrix Multiplication). In Parts 1 and 2, we discussed GEMM at length from the perspective of a single threadblock, introducing the WGMMA matmul primitive, pipelining, and warp specialization. In this part, we will examine GEMM from the perspective of the entire grid. At this…
Read more…
FlashAttention-3 for Inference: INT8 Quantization and Query Head Packing for MQA/GQA (External)

November 27, 2024

|

Benchmarks, Deep Learning, Publications

In this blog post presented on the Character.AI research blog, we explain two techniques that are important for using FlashAttention-3 for inference: in-kernel pre-processing of tensors via warp specialization and query head packing for MQA/GQA.
Read more…