In this blog post presented on the Character.AI research blog, we explain two techniques that are important for using FlashAttention-3 for inference:
- A general methodology for in-kernel pre-processing of tensors via warp specialization, applied to the case of a half INT8 attention kernel design that upcasts the V tensor in the producer warpgroup.
- Query head packing of the Q tile done for multi-query attention (MQA) or grouped query attention (GQA), which is needed to saturate bandwidth during the memory-bound decoding phase of inference.
We also give microbenchmark results for both prefill and decode-type attention workloads, measured on an NVIDIA H100 SXM5 GPU.
Optimizing AI Inference at Character.AI (Part Deux)
At Character.AI, we’re building personalized AI entertainment. In order to offer our users engaging, interactive experiences, it’s critical we achieve highly efficient inference, or the process by which LLMs generate replies. Our last post on this topic looked at several techniques that contribute to the performance and sustainability of our serving stack, such as multi-query attention (MQA), cross-layer KV-sharing, hybrid attention horizons, and int8 quantization. Those optimizations were primarily concerned with memory efficiency, and in particular how to reduce the KV cache to a manageable size.
In this post, we turn our focus to speed. Specifically, we show how our custom int8 attention kernel achieves strong performance for inference, both in the compute-bound setting of prefill and memory-bound case of iterative decoding. […]
In this post, we turn our focus to speed. Specifically, we show how our custom int8 attention kernel achieves strong performance for inference, both in the compute-bound setting of prefill and memory-bound case of iterative decoding. […]
Joint work with Character.AI.