2805 Bowers Ave, Santa Clara, CA 95051 | 408-730-2275
research@colfax-intl.com

Category: Deep Learning

  • FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision

    FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision

    In this blogpost, we describe three main techniques that we use to speed up attention on Hopper GPUs in FlashAttention-3: exploiting asynchrony of the Tensor Cores and TMA to (1) overlap overall computation and data movement via warp-specialization and (2) interleave block-wise matmul and softmax operations, and (3) incoherent processing that leverages hardware support for… Go to article…

  • Delivering 1 PFLOP/s of Performance with FP8 FlashAttention-2

    Delivering 1 PFLOP/s of Performance with FP8 FlashAttention-2

    We recently released an update to our FlashAttention-2 forward pass implementation on NVIDIA Hopper™ architecture that incorporates a number of new optimizations and improvements, including a software pipelining scheme and FP8 support. In this article, we will explain a challenge with achieving layout conformance of register fragments for WGMMA instructions that we encountered in the… Go to article…

  • A note on the algebra of CuTe Layouts

    A note on the algebra of CuTe Layouts

    The core abstraction of NVIDIA’s CUTLASS library for high-performance linear algebra is the CuTe Layout. In this technical note, we give a rigorous, mathematical treatment of the algebra of these layouts and certain layout operations. Currently, the main goal is to lay down conditions for when the operations of complementation, composition, and logical division are… Go to article…

  • A Case Study in CUDA Kernel Fusion: Implementing FlashAttention-2 on NVIDIA Hopper Architecture using the CUTLASS Library

    A Case Study in CUDA Kernel Fusion: Implementing FlashAttention-2 on NVIDIA Hopper Architecture using the CUTLASS Library

    We provide an optimized implementation of the forward pass of FlashAttention-2, a popular memory-aware scaled dot-product attention algorithm, as a custom fused CUDA® kernel targeting NVIDIA Hopper™ architecture and written using the open-source CUTLASS library. In doing so, we explain the challenges and techniques involved in fusing online-softmax with back-to-back GEMM kernels, utilizing the Hopper-specific… Go to article…

  • Developing CUDA Kernels for GEMM on NVIDIA Hopper Architecture using CUTLASS

    Developing CUDA Kernels for GEMM on NVIDIA Hopper Architecture using CUTLASS

    We explain how to develop NVIDIA CUDA® kernels for optimized general matrix multiplication (GEMM) on NVIDIA Hopper™ architecture using the template collection CUTLASS and its core library CuTe. Our main contribution is to provide an implementation of a GEMM kernel that uses the Tensor Memory Accelerator (TMA) and Warp Group Matrix-Multiply-Accumulate (WGMMA) operations introduced with… Go to article…