FlashAttention has always been an exercise in respecting the memory hierarchy: tile the computation, keep the softmax online, never materialize the full attention matrix in HBM. FA3 adds Hopper-specific machinery — overlapping async memory movement with tensor-core math, and low-precision paths — to close more of the gap to peak. We rebuilt the core idea from scratch in our recreation.