2805 Bowers Ave, Santa Clara, CA 95051 | 408-730-2275
research@colfax-intl.com

Recent Posts

  • Tutorial: Matrix Transpose in CUTLASS

    |

    ,

    The goal of this tutorial is to elicit the concepts and techniques involving memory copy when programming on NVIDIA® GPUs using CUTLASS and its core backend library CuTe. Specifically, we will study the task of matrix transpose as an illustrative example for these concepts. We choose this task because it involves no operation other than…

    Read more…


  • Installing Ubuntu 22.04 LTS over the Network on Servers with the NVIDIA® Grace Hopper™ Superchip

    |

    Grace™, NVIDIA’s first datacenter CPU, is a new choice of platform available for datacenter, CPU and HPC applications. The common property of these new NVIDIA Superchips is the Arm® architecture. This post reports on our experience provisioning the Ubuntu 22.04 LTS operating system (OS) on servers based on the NVIDIA Grace Hopper Superchip over the…

    Read more…


  • Tutorial: Python bindings for CUDA libraries in PyTorch

    |

    ,

    PyTorch today is one of the most popular AI frameworks. Developed by Meta (then Facebook) and open-sourced in 2017, it features approachable, “pythonic” interfaces. This ease-of-use makes it especially potent for research and development, where a researcher might need to go through multiple iterations of novel AI workloads that they are developing. However, developing in…

    Read more…


  • Delivering 1 PFLOP/s of Performance with FP8 FlashAttention-2

    |

    ,

    We recently released an update to our FlashAttention-2 forward pass implementation on NVIDIA Hopper™ architecture that incorporates a number of new optimizations and improvements, including a software pipelining scheme and FP8 support. In this article, we will explain a challenge with achieving layout conformance of register fragments for WGMMA instructions that we encountered in the…

    Read more…


  • A note on the algebra of CuTe Layouts

    |

    ,

    The core abstraction of NVIDIA’s CUTLASS library for high-performance linear algebra is the CuTe Layout. In this technical note, we give a rigorous, mathematical treatment of the algebra of these layouts and certain layout operations. Currently, the main goal is to lay down conditions for when the operations of complementation, composition, and logical division are…

    Read more…


  • A Case Study in CUDA Kernel Fusion: Implementing FlashAttention-2 on NVIDIA Hopper Architecture using the CUTLASS Library

    |

    ,

    We provide an optimized implementation of the forward pass of FlashAttention-2, a popular memory-aware scaled dot-product attention algorithm, as a custom fused CUDA® kernel targeting NVIDIA Hopper™ architecture and written using the open-source CUTLASS library. In doing so, we explain the challenges and techniques involved in fusing online-softmax with back-to-back GEMM kernels, utilizing the Hopper-specific…

    Read more…