2805 Bowers Ave, Santa Clara, CA 95051 | 408-730-2275
research@colfax-intl.com

About Us

Our team consists of mathematicians and scientists who bring formal academic training and deep analytical rigor to GPU kernel development. We have a demonstrated record of excellence across research, education, technical writing, and open-source contributions, pairing first principles reasoning about hardware, systems, and algorithms with hands-on kernel engineering.

Research papers

We publish research at the frontier of GPU performance and the mathematics that underpins it. Our publications include:

FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision [blog, arXiv:2407.08608]. NeurIPS 2024 Spotlight.
FlashAttention-4: Algorithm and Kernel Pipelining Co-Design for Asymmetric Hardware Scaling [blog, arXiv:2603.05451].
Categorical Foundations for CuTe Layouts [blog, arXiv:2601.05972].

Lectures and talks

We share our methods publicly through technical lectures on leading platforms.

CUTLASS and Flash Attention 3 (GPU Mode).
FlashAttention-3: Fast and Accurate Attention With Asynchrony and Low Precision (NVIDIA GTC).
Fundamentals of CuTe Layout Algebra and Category-theoretic Interpretation (GPU Mode).

Open-source contributions

We have made foundational and ongoing contributions to the FlashAttention project headed by Tri Dao, and have also made contributions to vLLM and CUTLASS.

Blogs

We regularly publish highly respected blogs, often in the form of multi-part series, covering in detail architectural and performance engineering aspects of modern GPUs, with the goal of empowering developers to acquire skills from the ground up:

Hopper GEMM series: Part 1, Part 2, Part 3.
Blackwell GEMM series: Part 1, Part 2, Part 3, Part 4.

We also produce joint work with other industry leaders:

FlexAttention + FlashAttention-4: Fast and Flexible (joint with PyTorch)
CUTLASS 3.x: Orthogonal, Reusable, and Composable Abstractions for GEMM Kernel Design (joint with NVIDIA)
FlashAttention-3 for Inference: INT8 Quantization and Query Head Packing for MQA/GQA (joint with Character.AI)

These blogs serve as useful primers for deep-dive training courses we offer via live, pre-recorded, and interactive formats. More details on these course offerings will be available soon.

About Us

Research papers

Lectures and talks

Open-source contributions

Blogs

Share this:

Like this: