SIGNAL · SIGNALS

vLLM and the new default shape of LLM serving

If you are still serving with naive static batching, the gap is not marginal — paged KV-cache and continuous batching change the throughput-per-GPU math, and most other stacks have copied the idea.

2026-06-261 MINBY Frontier Checkpoint Editorial

Source: vllm-project/vllm ↗repo · major

PagedAttention treats the KV-cache like virtual memory: fixed-size blocks, near-zero fragmentation, and sharing across sequences. Combined with continuous (in-flight) batching, it keeps the GPU saturated instead of waiting on the slowest sequence in a batch. The reason to care is not novelty — it is that the technique has become the assumed baseline, and SGLang, TGI, and TensorRT-LLM all ship their own versions. Our explainer walks the internals.