2805 Bowers Ave, Santa Clara, CA 95051 | 408-730-2275
research@colfax-intl.com

Category: Publications

  • FlexAttention + FlashAttention-4: Fast and Flexible (External)

    FlexAttention + FlashAttention-4: Fast and Flexible (External)

    In this PyTorch blog on which we collaborated, we explain the FlexAttention extension to FlashAttention-4 (or from another point of view, the incorporation of FA-4 as an attention backend for the PyTorch FlexAttention API). Go to article…

  • CUTLASS Tutorial: Hardware-supported Block-scaling with NVIDIA Blackwell GPUs

    CUTLASS Tutorial: Hardware-supported Block-scaling with NVIDIA Blackwell GPUs

    Welcome to part 4 of our series investigating GEMM on the NVIDIA Blackwell architecture. So far we have discussed the capabilities of the new Blackwell Tensor Core UMMA instructions, including handling sub-byte data types, and how to work with them in CUTLASS. In this part, we will continue our exploration of low-precision computation by discussing Go to article…

  • FlashAttention-4: Algorithm and Kernel Pipelining Co-Design for Asymmetric Hardware Scaling

    FlashAttention-4: Algorithm and Kernel Pipelining Co-Design for Asymmetric Hardware Scaling

    Modern accelerators like Blackwell GPUs continue the trend of asymmetric hardware scaling, where tensor core throughput grows far faster than other resources such as shared memory bandwidth, special function units (SFUs) for transcendental operations like exponential, and general-purpose integer and floating-point ALUs. From the Hopper H100 to the Blackwell B200, for instance, BF16 tensor core… Go to article…

  • A User’s Guide to FlexAttention in FlashAttention CuTe DSL

    A User’s Guide to FlexAttention in FlashAttention CuTe DSL

    Many variants of attention (Vaswani et al., 2017) have become popular in recent years, for reasons related to performance and model quality. These include: The PyTorch team at Meta recognized that most of these variants (including all of the above) can be unified under one elegant framework, dubbed FlexAttention (Guessous et al., 2024). This simple Go to article…

  • Categorical Foundations for CuTe Layouts

    Categorical Foundations for CuTe Layouts

    In GPU programming, performance depends critically on how data is stored and accessed in memory. While the data we care about is typically multi-dimensional, the GPU’s memory is fundamentally one-dimensional. This means that when we want to load, store, or otherwise manipulate data, we need to map its multi-dimensional logical coordinates to one-dimensional physical coordinates. Go to article…

  • CUTLASS Tutorial: Sub-byte GEMM on NVIDIA® Blackwell GPUs

    CUTLASS Tutorial: Sub-byte GEMM on NVIDIA® Blackwell GPUs

    Welcome to part 3 of our series investigating GEMM on the NVIDIA Blackwell architecture. In parts 1 and 2, we looked at the Tensory Memory and 2 SM capabilities of the new Blackwell Tensor Core UMMA instructions and how to work with them in CUTLASS. In this part, we introduce low-precision computation and then discuss Go to article…

  • CUTLASS Tutorial: GEMM with Thread Block Clusters on NVIDIA® Blackwell GPUs

    CUTLASS Tutorial: GEMM with Thread Block Clusters on NVIDIA® Blackwell GPUs

    Welcome to part two of our series investigating GEMM on the NVIDIA Blackwell architecture. In part 1, we introduced some key new features available on NVIDIA Blackwell GPUs, including Tensor Memory, and went over how to write a simple CUTLASS GEMM kernel that uses the new UMMA instructions (tcgen05.mma) to target the Blackwell Tensor Cores. Go to article…

  • CUTLASS Tutorial: Writing GEMM Kernels Using Tensor Memory For NVIDIA® Blackwell GPUs

    CUTLASS Tutorial: Writing GEMM Kernels Using Tensor Memory For NVIDIA® Blackwell GPUs

    The NVIDIA Blackwell architecture introduces some new features that significantly change the shape of a GEMM kernel. In this series of posts, we explore the new features available on Blackwell and examine how to write CUTLASS GEMM kernels that utilize these new features by drawing on the CuTe tutorial examples. The goal of this series Go to article…

  • CUTLASS Tutorial: Persistent Kernels and Stream-K

    CUTLASS Tutorial: Persistent Kernels and Stream-K

    Welcome to Part 3 of our tutorial series on GEMM (GEneral Matrix Multiplication). In Parts 1 and 2, we discussed GEMM at length from the perspective of a single threadblock, introducing the WGMMA matmul primitive, pipelining, and warp specialization. In this part, we will examine GEMM from the perspective of the entire grid. At this Go to article…

  • FlashAttention-3 for Inference: INT8 Quantization and Query Head Packing for MQA/GQA (External)

    FlashAttention-3 for Inference: INT8 Quantization and Query Head Packing for MQA/GQA (External)

    In this blog post presented on the Character.AI research blog, we explain two techniques that are important for using FlashAttention-3 for inference: in-kernel pre-processing of tensors via warp specialization and query head packing for MQA/GQA. Go to article…