Mamba replaces attention’s pairwise mixing with an input-dependent state-space recurrence, computed with a parallel scan tuned to the GPU memory hierarchy. The practitioner-relevant consequence is a constant-size recurrent state at inference instead of a growing KV-cache — attractive for very long sequences — traded against weaker exact-recall behavior. Most production interest is in hybrid stacks that interleave SSM and attention blocks.
SIGNAL · SIGNALS
Mamba and the selective-state-space line
Worth understanding even if you ship transformers: SSMs change the asymptotics (linear in sequence length, constant state at inference) and the failure modes. The interesting deployments are hybrids, not pure-SSM.
1 MINBY Frontier Checkpoint Editorial
- Source
- arXiv:2312.00752 ↗paper · notable