Series · ongoing
The Inference Stack
What actually happens between a request and a token: paged KV-cache, continuous batching, quantization, and the serving machinery that makes LLMs cheap enough to ship.
- 01
vLLM, Explained: PagedAttention, Continuous Batching, and the Serving Stack
vLLM treats the KV cache like OS virtual memory — non-contiguous paged blocks — and schedules work at the token, not the request. You get high aggregate throughput; the cost is that per-request latency becomes something you tune rather than something you get for free.
- 02
Post-Training Quantization in Practice: GPTQ, AWQ, and FP8
Post-training quantization is the cheapest inference lever and the easiest to pull wrong. The right method is set by your serving regime — bandwidth-bound decode wants weight-only INT4, compute-bound prefill wants FP8 — and the win is real only if a fast kernel accelerates your exact config.
- 03
- 04