Topic

PagedAttention

Virtual-memory-style KV-cache management.

2 checkpoints

2026-06-26SIGNALS

vLLM and the new default shape of LLM serving

If you are still serving with naive static batching, the gap is not marginal — paged KV-cache and continuous batching change the throughput-per-GPU math, and most other stacks have copied the idea.

CHECKPOINT 00032026-05-20LIBRARIESserving · production

vLLM, Explained: PagedAttention, Continuous Batching, and the Serving Stack

vLLM treats the KV cache like OS virtual memory — non-contiguous paged blocks — and schedules work at the token, not the request. You get high aggregate throughput; the cost is that per-request latency becomes something you tune rather than something you get for free.

paged-attention kv-cache serving llm