Series · ongoing

The Inference Stack

What actually happens between a request and a token: paged KV-cache, continuous batching, quantization, and the serving machinery that makes LLMs cheap enough to ship.

2 of 4 parts published

01
Libraries · 2026-05-20
vLLM, Explained: PagedAttention, Continuous Batching, and the Serving Stack
vLLM treats the KV cache like OS virtual memory — non-contiguous paged blocks — and schedules work at the token, not the request. You get high aggregate throughput; the cost is that per-request latency becomes something you tune rather than something you get for free.
02
Explainers · 2026-06-08
Post-Training Quantization in Practice: GPTQ, AWQ, and FP8
Post-training quantization is the cheapest inference lever and the easiest to pull wrong. The right method is set by your serving regime — bandwidth-bound decode wants weight-only INT4, compute-bound prefill wants FP8 — and the win is real only if a fast kernel accelerates your exact config.
03
Planned · coming soon
04
Planned · coming soon

vLLM, Explained: PagedAttention, Continuous Batching, and the Serving Stack

Post-Training Quantization in Practice: GPTQ, AWQ, and FP8