Series · ongoing

The Inference Stack

What actually happens between a request and a token: paged KV-cache, continuous batching, quantization, and the serving machinery that makes LLMs cheap enough to ship.

2 of 4 parts published

  1. 01

    Libraries · 2026-05-20

    vLLM, Explained: PagedAttention, Continuous Batching, and the Serving Stack

    vLLM treats the KV cache like OS virtual memory — non-contiguous paged blocks — and schedules work at the token, not the request. You get high aggregate throughput; the cost is that per-request latency becomes something you tune rather than something you get for free.

  2. 02

    Explainers · 2026-06-08

    Post-Training Quantization in Practice: GPTQ, AWQ, and FP8

    Post-training quantization is the cheapest inference lever and the easiest to pull wrong. The right method is set by your serving regime — bandwidth-bound decode wants weight-only INT4, compute-bound prefill wants FP8 — and the win is real only if a fast kernel accelerates your exact config.

  3. 03

    Planned · coming soon

  4. 04

    Planned · coming soon