PagedAttention treats the KV-cache like virtual memory: fixed-size blocks, near-zero fragmentation, and sharing across sequences. Combined with continuous (in-flight) batching, it keeps the GPU saturated instead of waiting on the slowest sequence in a batch. The reason to care is not novelty — it is that the technique has become the assumed baseline, and SGLang, TGI, and TensorRT-LLM all ship their own versions. Our explainer walks the internals.
SIGNAL · SIGNALS
vLLM and the new default shape of LLM serving
If you are still serving with naive static batching, the gap is not marginal — paged KV-cache and continuous batching change the throughput-per-GPU math, and most other stacks have copied the idea.
1 MINBY Frontier Checkpoint Editorial
- Source
- vllm-project/vllm ↗repo · major