Category: Blog

CUTLASS Tutorial: Mastering the NVIDIA® Tensor Memory Accelerator (TMA)

June 24, 2024

TMA (Tensor Memory Accelerator) is a new feature introduced in the NVIDIA Hopper™ architecture for doing asynchronous memory copy between a GPU’s global memory (GMEM) and the shared memory (SMEM) of its threadblocks (i.e., CTAs). Compared to prior approaches, TMA offers a number of advantages, such as (1) improving GPU utilization through facilitating warp-specialized kernel… Go to article…
Sharing NVIDIA® GPUs at the System Level: Time-Sliced and MIG-Backed vGPUs

May 29, 2024

While some modern applications for GPUs aim to consume all GPU resources and even scale to multiple GPUs (deep learning training, for instance), other applications require only a fraction of GPU resources (like some deep learning inferencing) or don’t use GPUs all the time (for example, a developer working on an NVIDIA CUDA® application may… Go to article…
Tutorial: Matrix Transpose in CUTLASS

May 5, 2024

The goal of this tutorial is to elicit the concepts and techniques involving memory copy when programming on NVIDIA® GPUs using CUTLASS and its core backend library CuTe. Specifically, we will study the task of matrix transpose as an illustrative example for these concepts. We choose this task because it involves no operation other than… Go to article…
Installing Ubuntu 22.04 LTS over the Network on Servers with the NVIDIA® Grace Hopper™ Superchip

April 18, 2024

Grace™, NVIDIA’s first datacenter CPU, is a new choice of platform available for datacenter, CPU and HPC applications. The common property of these new NVIDIA Superchips is the Arm® architecture. This post reports on our experience provisioning the Ubuntu 22.04 LTS operating system (OS) on servers based on the NVIDIA Grace Hopper Superchip over the… Go to article…
Tutorial: Python bindings for CUDA libraries in PyTorch

March 13, 2024

PyTorch today is one of the most popular AI frameworks. Developed by Meta (then Facebook) and open-sourced in 2017, it features approachable, “pythonic” interfaces. This ease-of-use makes it especially potent for research and development, where a researcher might need to go through multiple iterations of novel AI workloads that they are developing. However, developing in… Go to article…

Category: Blog

CUTLASS Tutorial: Mastering the NVIDIA® Tensor Memory Accelerator (TMA)

Sharing NVIDIA® GPUs at the System Level: Time-Sliced and MIG-Backed vGPUs

Tutorial: Matrix Transpose in CUTLASS

Installing Ubuntu 22.04 LTS over the Network on Servers with the NVIDIA® Grace Hopper™ Superchip

Tutorial: Python bindings for CUDA libraries in PyTorch