Author: Colfax Research

Tutorial: Matrix Transpose in CUTLASS

May 5, 2024

The goal of this tutorial is to elicit the concepts and techniques involving memory copy when programming on NVIDIA® GPUs using CUTLASS and its core backend library CuTe. Specifically, we will study the task of matrix transpose as an illustrative example for these concepts. We choose this task because it involves no operation other than… Go to article…
Installing Ubuntu 22.04 LTS over the Network on Servers with the NVIDIA® Grace Hopper™ Superchip

April 18, 2024

Grace™, NVIDIA’s first datacenter CPU, is a new choice of platform available for datacenter, CPU and HPC applications. The common property of these new NVIDIA Superchips is the Arm® architecture. This post reports on our experience provisioning the Ubuntu 22.04 LTS operating system (OS) on servers based on the NVIDIA Grace Hopper Superchip over the… Go to article…
Tutorial: Python bindings for CUDA libraries in PyTorch

March 13, 2024

PyTorch today is one of the most popular AI frameworks. Developed by Meta (then Facebook) and open-sourced in 2017, it features approachable, “pythonic” interfaces. This ease-of-use makes it especially potent for research and development, where a researcher might need to go through multiple iterations of novel AI workloads that they are developing. However, developing in… Go to article…
Delivering 1 PFLOP/s of Performance with FP8 FlashAttention-2

February 29, 2024

We recently released an update to our FlashAttention-2 forward pass implementation on NVIDIA Hopper™ architecture that incorporates a number of new optimizations and improvements, including a software pipelining scheme and FP8 support. In this article, we will explain a challenge with achieving layout conformance of register fragments for WGMMA instructions that we encountered in the… Go to article…

Author: Colfax Research

Tutorial: Matrix Transpose in CUTLASS

Installing Ubuntu 22.04 LTS over the Network on Servers with the NVIDIA® Grace Hopper™ Superchip

Tutorial: Python bindings for CUDA libraries in PyTorch

Delivering 1 PFLOP/s of Performance with FP8 FlashAttention-2