vLLM’s central bet is that LLM serving is gated by KV-cache memory management, not by FLOPs. If you store the cache the obvious way — one contiguous buffer per request, sized for the worst case — most of your GPU’s memory sits reserved and idle, which caps how many sequences you can batch, which caps throughput. vLLM’s answer is to manage the cache the way an operating system manages RAM: fixed-size pages, a page table, copy-on-write, and eviction under pressure. The win is real and large. The cost, which the marketing tends to skip, is that you are now optimizing aggregate tokens/sec, and an individual request’s tail latency becomes a thing you schedule and tune rather than a thing you get for free.
Everything below describes mechanisms as of vLLM’s V1 engine (GA in 2025); the feature surface moves fast, so treat specific flags as “check the docs” and treat every performance claim as “benchmark on your hardware, your workload.” The repo is vllm-project/vllm (Apache-2.0), and the originating ideas are in the PagedAttention paper (Kwon et al., Efficient Memory Management for Large Language Model Serving with PagedAttention, SOSP 2023).
PagedAttention: the KV cache as virtual memory
The KV cache is the dominant runtime memory cost of decoding. For each token you keep a key and value vector per layer per head; that state grows linearly with sequence length and with the number of concurrent sequences. The naive serving systems that preceded vLLM (FasterTransformer-style) allocated one contiguous KV buffer per request, sized to max_model_len. That design bleeds memory three ways: internal fragmentation (you reserved 4k slots, the request emits 300 tokens, the rest is dead), reservation waste (slots are committed for tokens not yet generated), and external fragmentation (variable-length contiguous blocks leave unusable gaps between requests). The PagedAttention paper measures 60–80% of KV memory wasted to these effects in prior systems.
PagedAttention borrows paging wholesale. The KV cache is partitioned into fixed-size blocks — by default 16 tokens of KV per block — and a per-sequence block table maps the sequence’s logical token positions to physical blocks that can live anywhere in GPU memory, non-contiguously. The attention kernel is modified to gather KV from those scattered physical blocks via the block table rather than reading one contiguous run. Fragmentation collapses to at most one partially-filled block per sequence, bounded by the block size. The paper reports memory waste under 4%, and that recovered memory is what lets vLLM hold far more sequences resident — the source of the headline 2–4× throughput improvement over FasterTransformer and Orca at comparable latency (their measurement, their hardware).
The page table buys a second thing for free: sharing. Because blocks are addressable and reference-counted, two sequences can point at the same physical block.
- Parallel sampling / beam search: sample
ncompletions from one prompt, and allnsequences share the prompt’s read-only KV blocks. They only diverge in the generated region. When a shared, partially-filled block would be mutated by one sequence, vLLM does copy-on-write — clone the block, decrement the refcount, write into the copy — exactly like an OS forking a process. - Prefix caching (covered below) is the cross-request version of the same trick.
Continuous batching: scheduling at the token, not the request
Paged memory only pays off if you can keep the batch full, and that is a scheduling problem. Static batching forms a batch, pads every sequence to the longest, and runs all of them to completion before admitting anyone new. Generation lengths vary by an order of magnitude, so the short sequences finish early and their lanes go idle while the batch waits on the one 2000-token outlier. Worse, a new request that arrives one millisecond after the batch starts waits for the entire batch to drain. Throughput craters and tail latency is hostage to the longest member.
Continuous batching (also called in-flight or iteration-level batching, an idea from Orca, OSDI 2022) schedules at the granularity of a single decode step. After every forward pass the scheduler reconsiders the batch: a sequence that emitted its EOS is evicted immediately and its KV blocks are freed; a waiting request is admitted into the freed lane on the very next step. The batch is recomposed every iteration, so the GPU stays saturated and queueing delay drops from “wait for the batch” to “wait for one token’s worth of compute.”
This is the single biggest throughput lever in modern serving, and it is why PagedAttention matters: continuous batching constantly admits and evicts sequences of unpredictable length, which is precisely the workload that destroys a contiguous allocator. Paging makes admit/evict cheap (free a few blocks, hand them to the next request) instead of a defragmentation crisis.
Prefill, decode, and the tension between them
Inference has two phases with opposite hardware profiles. Prefill processes the whole prompt in one shot — big, compute-bound matmuls that produce the first token and fill the prompt’s KV blocks. Decode generates one token per step, reading the entire KV cache each time — small matmuls, memory-bandwidth-bound, latency-dominated. The two metrics your users feel map onto these phases: TTFT (time-to-first-token) is a prefill cost; ITL/TPOT (inter-token latency, time-per-output-token) is a decode cost.
The scheduler has to interleave them, and they fight. A long prefill is a compute hog: if you let one run to completion, every in-flight decode stalls behind it and your ITL spikes — users watching a stream see it freeze. vLLM’s lever here is chunked prefill: split a long prompt’s prefill into token-budgeted chunks and co-schedule those chunks alongside ongoing decodes in the same batch. ITL gets smoother and more predictable; the cost is that the chunked request’s own TTFT goes up, and you’ve added a max_num_batched_tokens budget to tune. This is the prefill-vs-decode tradeoff in one knob: spend TTFT to protect ITL, or the reverse.
Prefix caching (automatic prefix caching, APC) extends block sharing across requests in time. vLLM hashes block contents; when a new request’s prompt shares a leading run of blocks with something already cached — a shared system prompt, a few-shot preamble, a long document reused across questions, an agent re-sending its conversation — those blocks are reused and their prefill is skipped entirely. For agent and RAG workloads with heavy prompt reuse this is enormous; for one-shot traffic with unique prompts it does nothing. In V1, chunked prefill and prefix caching are on by default.
When demand for KV blocks exceeds supply — too many long sequences resident at once — vLLM preempts. It evicts a sequence’s blocks (recomputing its prefill later, or swapping to CPU) to make room, then resumes it. This is graceful degradation rather than an OOM crash, but it is your clearest signal of memory pressure: preemption warnings in the logs mean you’ve oversubscribed and throughput is now being spent on rework. The newer architectural answer is prefill/decode disaggregation — running prefill and decode on separate GPU pools so a prefill burst can’t jitter decode latency — which vLLM supports in an experimental capacity (via KV-transfer connectors); treat it as evolving.
Parallelism, quantization, and speculation: the feature surface
Beyond memory and scheduling, vLLM bundles the usual production accelerators. As of writing — benchmark before you trust any of these on your model:
- Tensor and pipeline parallelism:
tensor_parallel_sizeshards each layer’s weights (Megatron-style, all-reduce on the critical path) and wants fast intra-node interconnect (NVLink);pipeline_parallel_sizepartitions layers into stages and tolerates slower inter-node links. The default mental model: TP within a node, PP across nodes. - Quantization: weight-only INT4 via GPTQ and AWQ, FP8 (E4M3) weights and activations on Hopper/Ada, plus INT8 paths and Marlin kernels. Separately, KV-cache quantization (
kv_cache_dtype="fp8") roughly halves the cache footprint, which directly buys you more concurrent sequences or longer context — but it can move task accuracy, so measure on a task-relevant eval, not just perplexity. - Speculative decoding: a cheap drafter proposes several tokens and the target model verifies them in one forward pass, accepting a prefix. vLLM supports draft-model, n-gram/prompt-lookup, and EAGLE/Medusa-style speculation. The honest caveat: speculation helps most in the memory-bound, low-batch regime where the GPU has spare compute to verify on. Under high-batch continuous batching you are already compute-saturated, and speculation can break even or hurt. It is a latency tool, not a throughput tool.
Running it: the server and the knobs that matter
The offline path is a LLM object; the online path is vllm serve, which stands up an HTTP server exposing OpenAI-compatible /v1/chat/completions, /v1/completions, and /v1/embeddings. That compatibility is most of why vLLM became the default: you point the OpenAI SDK at it and existing clients work unchanged.
from vllm import LLM, SamplingParams
llm = LLM( model="meta-llama/Llama-3.1-8B-Instruct", tensor_parallel_size=2, # shard weights across 2 NVLinked GPUs gpu_memory_utilization=0.90, # everything left after weights -> KV blocks max_model_len=8192, # caps per-seq KV; higher -> fewer concurrent seqs enable_prefix_caching=True, # reuse identical prompt prefixes across requests kv_cache_dtype="fp8", # ~half the KV footprint; verify accuracy)
params = SamplingParams(temperature=0.7, max_tokens=512)out = llm.generate(["Explain paged attention in one paragraph."], params)print(out[0].outputs[0].text)vllm serve meta-llama/Llama-3.1-8B-Instruct \ --tensor-parallel-size 2 \ --gpu-memory-utilization 0.92 \ --max-num-seqs 256 \ --enable-chunked-prefill# OpenAI-compatible server on :8000 ; point the openai SDK at /v1The knobs that actually decide your throughput/latency curve:
gpu_memory_utilization(default ~0.9): the fraction of VRAM vLLM claims. After model weights and activation scratch, the remainder is your KV-cache block pool. Pushing it up grows the pool — more concurrent sequences, higher throughput — until you OOM. This is the single highest-leverage dial.max_model_len: the per-sequence KV ceiling. Set it to the context you actually serve. If the requested length times your batch can’t fit in the block pool, vLLM tells you at startup; reduce length or batch.max_num_seqs/max_num_batched_tokens: the scheduler’s width and per-step token budget — these govern the prefill/decode interleave and your ITL smoothness.
Measure with vllm bench serve (the load generator; the exact invocation moves between versions) and watch four numbers: TTFT, ITL/TPOT, throughput (output tokens/sec and requests/sec), and the preemption count. Memory pressure shows up as preemption warnings and as TTFT climbing while the queue grows; if you see those, you’ve oversubscribed gpu_memory_utilization or max_num_seqs for your context length.
When vLLM — and how to price a token
vLLM is the right default for high-throughput datacenter-GPU serving where you want OpenAI-compatible endpoints and broad model coverage without a build step. The alternatives carve out real niches:
- TGI (Text Generation Inference): tight Hugging Face ecosystem fit; it adopted continuous batching and paged/flash attention too, so the gap narrowed.
- SGLang: RadixAttention (prefix sharing via a radix tree) and a strong structured-generation/programming model — often the pick for agent workloads dense with shared prefixes and constrained decoding.
- TensorRT-LLM: can be the fastest on NVIDIA hardware via compiled engines, at the cost of a build step and less flexibility; usually paired with Triton Inference Server.
- llama.cpp: the answer for CPU, Apple Silicon, consumer GPUs, and GGUF/edge single-user inference — not a throughput server for A100/H100 fleets.
To reason about cost per token, ignore the sticker latency and compute it from saturated throughput at your latency SLO: cost/token ≈ (GPU $/hr) / (output tokens/sec sustained within your TTFT and ITL budget). Continuous batching and paged memory are exactly the machinery that pushes the denominator up — more sequences amortize the fixed bandwidth cost of streaming weights each step, dragging you from the memory-bound regime toward compute-bound, where each GPU-second produces the most tokens. That is the whole game, and also the whole tradeoff: the cheapest token comes from a packed batch, and a packed batch is the thing that lengthens any single request’s tail. Pick the batch size that sits at your SLO, not the one that maxes the throughput chart, and re-benchmark when the model, context length, or traffic shape changes — because all three move the curve.
The bottom line
vLLM’s durable idea is reframing serving as memory management: page the KV cache, schedule at the token, share what you can, and evict under pressure. PagedAttention and continuous batching are co-dependent and together account for most of the throughput most teams will ever get from a single config change. Treat the rest — quantization, speculation, disaggregation, the exact flag names — as a fast-moving surface to benchmark, not a spec sheet to trust. And keep one number on the wall: the per-request latency you’re spending to buy each marginal token of throughput. That trade, not any leaderboard, is what you’re actually tuning.