FlashAttention has always been an exercise in respecting the memory hierarchy: tile the computation, keep the softmax online, never materialize the full attention matrix in HBM. FA3 adds Hopper-specific machinery — overlapping async memory movement with tensor-core math, and low-precision paths — to close more of the gap to peak. We rebuilt the core idea from scratch in our recreation.
SIGNAL · SIGNALS
FlashAttention-3: async, low-precision, Hopper-native
The headline is hardware-specific: FA3 is a Hopper story (async copy/MMA overlap, FP8 paths). The portable lesson from the FA line is still the one that matters — attention is bandwidth-bound, and the win is in HBM traffic, not FLOPs.
1 MINBY Frontier Checkpoint Editorial
- Source
- arXiv:2407.08608 ↗paper · notable