2805 Bowers Ave, Santa Clara, CA 95051 | 408-730-2275
research@colfax-intl.com

Blog

We present expository-style articles and coding tutorials on our blog.

  • CUTLASS Tutorial: Mastering the NVIDIA® Tensor Memory Accelerator (TMA)

    TMA (Tensor Memory Accelerator) is a new feature introduced in the NVIDIA Hopper™ architecture for doing asynchronous memory copy between a GPU’s global memory (GMEM) and the shared memory (SMEM) of its threadblocks (i.e., CTAs). Compared to prior approaches, TMA offers a number of advantages, such as (1) improving GPU utilization through facilitating warp-specialized kernel […] Go to article…

  • Sharing NVIDIA® GPUs at the System Level: Time-Sliced and MIG-Backed vGPUs

    While some modern applications for GPUs aim to consume all GPU resources and even scale to multiple GPUs (deep learning training, for instance), other applications require only a fraction of GPU resources (like some deep learning inferencing) or don’t use GPUs all the time (for example, a developer working on an NVIDIA CUDA® application may […] Go to article…

  • Tutorial: Matrix Transpose in CUTLASS

    The goal of this tutorial is to elicit the concepts and techniques involving memory copy when programming on NVIDIA® GPUs using CUTLASS and its core backend library CuTe. Specifically, we will study the task of matrix transpose as an illustrative example for these concepts. We choose this task because it involves no operation other than […] Go to article…

  • Installing Ubuntu 22.04 LTS over the Network on Servers with the NVIDIA® Grace Hopper™ Superchip

    Grace™, NVIDIA’s first datacenter CPU, is a new choice of platform available for datacenter, CPU and HPC applications. The common property of these new NVIDIA Superchips is the Arm® architecture. This post reports on our experience provisioning the Ubuntu 22.04 LTS operating system (OS) on servers based on the NVIDIA Grace Hopper Superchip over the network. This procedure is applicable… Go to article…

  • Tutorial: Python bindings for CUDA libraries in PyTorch

    PyTorch today is one of the most popular AI frameworks. Developed by Meta (then Facebook) and open-sourced in 2017, it features approachable, “pythonic” interfaces. This ease-of-use makes it especially potent for research and development, where a researcher might need to go through multiple iterations of novel AI workloads that they are developing. However, developing in […] Go to article…