Signal · the signals desk

Signals

The live pulse.

High signal-to-noise dispatches from the frontier — papers, releases, kernels, benchmarks — each with a sharp take on why it matters for the code you write.

RSS ↗6 checkpoints

2026-06-26

vLLM and the new default shape of LLM serving

If you are still serving with naive static batching, the gap is not marginal — paged KV-cache and continuous batching change the throughput-per-GPU math, and most other stacks have copied the idea.

serving paged-attention kv-cache

2026-06-24

FlashAttention-3: async, low-precision, Hopper-native

The headline is hardware-specific: FA3 is a Hopper story (async copy/MMA overlap, FP8 paths). The portable lesson from the FA line is still the one that matters — attention is bandwidth-bound, and the win is in HBM traffic, not FLOPs.

flash-attention kernels gpu-memory

2026-06-20

Mamba and the selective-state-space line

Worth understanding even if you ship transformers: SSMs change the asymptotics (linear in sequence length, constant state at inference) and the failure modes. The interesting deployments are hybrids, not pure-SSM.

state-space-models attention long-context

2026-06-17

SGLang and RadixAttention for prefix reuse

If your workload has heavy shared prefixes — system prompts, few-shot exemplars, agent scaffolds — automatic prefix caching is close to free latency. This is where serving for agents diverges from serving for chat.

serving kv-cache agents

2026-06-13

DeepSeek-R1: RL-trained reasoning with open weights

The reproducible part is the method, not a leaderboard cell: group-relative RL on verifiable rewards, with open weights to probe. It is the cleanest public artifact for understanding the reasoning-model training loop.

reasoning grpo rlhf

2026-06-11

The modded-nanogpt speedrun and the Muon optimizer

A rare fully-public optimization target with a reproducible harness — exactly the kind of artifact we like. The Muon optimizer it popularized is the most interesting practical idea to come out of it.

pretraining optimization reproducibility