SIGNAL · SIGNALS

SGLang and RadixAttention for prefix reuse

If your workload has heavy shared prefixes — system prompts, few-shot exemplars, agent scaffolds — automatic prefix caching is close to free latency. This is where serving for agents diverges from serving for chat.

2026-06-171 MINBY Frontier Checkpoint Editorial

Source: sgl-project/sglang ↗repo · notable

Agent and structured-generation workloads send the same long prefix over and over. RadixAttention indexes KV-cache blocks in a radix tree so identical prefixes are computed once and shared, rather than recomputed per request. It is a concrete example of a broader theme we keep returning to: the harness and serving layer, not just the model, is where agent performance is won.