Topic

Serving

Throughput, batching, and the LLM serving stack.

5 checkpoints

2026-06-26SIGNALS

vLLM and the new default shape of LLM serving

If you are still serving with naive static batching, the gap is not marginal — paged KV-cache and continuous batching change the throughput-per-GPU math, and most other stacks have copied the idea.

CHECKPOINT 00122026-06-23ESSAYS

The Economics of Thinking: Test-Time Compute as a Scaling Axis

Reasoning models turned inference into a per-request dial. This is an economic read on when spending FLOPs at test time actually buys accuracy, why it only pays where answers are cheap to verify, and what variable-cost inference does to latency budgets and capacity planning.

test-time-compute reasoning serving industry

2026-06-17SIGNALS

SGLang and RadixAttention for prefix reuse

If your workload has heavy shared prefixes — system prompts, few-shot exemplars, agent scaffolds — automatic prefix caching is close to free latency. This is where serving for agents diverges from serving for chat.

serving kv-cache agents

CHECKPOINT 00082026-06-08EXPLAINERSintermediate

Post-Training Quantization in Practice: GPTQ, AWQ, and FP8

Post-training quantization is the cheapest inference lever and the easiest to pull wrong. The right method is set by your serving regime — bandwidth-bound decode wants weight-only INT4, compute-bound prefill wants FP8 — and the win is real only if a fast kernel accelerates your exact config.

quantization serving kv-cache

CHECKPOINT 00032026-05-20LIBRARIESserving · production

vLLM, Explained: PagedAttention, Continuous Batching, and the Serving Stack

vLLM treats the KV cache like OS virtual memory — non-contiguous paged blocks — and schedules work at the token, not the request. You get high aggregate throughput; the cost is that per-request latency becomes something you tune rather than something you get for free.

paged-attention kv-cache serving llm