Agent and structured-generation workloads send the same long prefix over and over. RadixAttention indexes KV-cache blocks in a radix tree so identical prefixes are computed once and shared, rather than recomputed per request. It is a concrete example of a broader theme we keep returning to: the harness and serving layer, not just the model, is where agent performance is won.
SIGNAL · SIGNALS
SGLang and RadixAttention for prefix reuse
If your workload has heavy shared prefixes — system prompts, few-shot exemplars, agent scaffolds — automatic prefix caching is close to free latency. This is where serving for agents diverges from serving for chat.
1 MINBY Frontier Checkpoint Editorial
- Source
- sgl-project/sglang ↗repo · notable