2805 Bowers Ave, Santa Clara, CA 95051 | 408-730-2275
research@colfax-intl.com

FlashAttention-3 for Inference: INT8 Quantization and Query Head Packing for MQA/GQA (External)

In this blog post presented on the Character.AI research blog, we explain two techniques that are important for using FlashAttention-3 for inference:

  1. A general methodology for in-kernel pre-processing of tensors via warp specialization, applied to the case of a half INT8 attention kernel design that upcasts the V tensor in the producer warpgroup.
  2. Query head packing of the Q tile done for multi-query attention (MQA) or grouped query attention (GQA), which is needed to saturate bandwidth during the memory-bound decoding phase of inference.

We also give microbenchmark results for both prefill and decode-type attention workloads, measured on an NVIDIA H100 SXM5 GPU.

Joint work with Character.AI.


Discover more from Colfax Research

Subscribe to get the latest posts sent to your email.

Posted

in

, ,