In this GPU Mode lecture, Jay Shah presents his joint work on FlashAttention-3 and how to implement the main compute loop in the algorithm using CUTLASS.
The code discussed in this lecture can be found at this commit in the FlashAttention-3 codebase.
Note: Slides adapted from a talk given by Tri Dao.