Rotary position embeddings won. Llama, Qwen, Mistral, DeepSeek, Gemma — essentially every open model you serve rotates its queries and keys instead of adding a position vector. The elegance is real: position enters through a norm-preserving rotation, and the attention logits end up depending only on the relative offset between two tokens. The catch is that the same rotation is exactly why “we extended it to 128k” is almost always a frequency-scaling trick plus a fine-tune — and why the context the model can actually use is shorter than the number on the card. Below: the rotation math walked into the extension methods (PI, NTK-aware, YaRN), the long-range attention pathologies, and the eval and serving realities that decide whether long context is real or cosmetic.

The rotation, precisely

RoPE (Su et al., RoFormer) splits each head’s dimension d into d/2 two-dimensional subspaces and rotates the i-th pair by an angle m·θ_i, where m is the token’s absolute position and θ_i is a fixed per-pair frequency base ** (-2i/d). The default base — often surfaced as rope_theta — is 10000. Index i = 0 is the highest frequency, rotating a full turn every few tokens; i = d/2 − 1 is the lowest, rotating so slowly its wavelength spans tens of thousands of positions.

rope.py
import torch
def rope_freqs(dim, base=10000.0):
# one angle per 2D subspace; i = 0 is the highest frequency
i = torch.arange(0, dim, 2).float()
return base ** (-i / dim) # theta_i, shape [dim/2]
def apply_rope(x, pos, theta):
# x: [..., seq, dim]; pos: [seq]; theta: [dim/2]
ang = pos[:, None] * theta[None, :] # [seq, dim/2]
cos, sin = ang.cos(), ang.sin()
x1, x2 = x[..., 0::2], x[..., 1::2]
out_even = x1 * cos - x2 * sin
out_odd = x1 * sin + x2 * cos
return torch.stack((out_even, out_odd), dim=-1).flatten(-2)

Two properties matter. First, it is content-preserving and applied only to Q and K (never V): rotation is orthogonal, so it changes phase, not norm, and it happens before the scores, so it composes with any attention kernel. Second — the whole point — for a rotation matrix R(mθ)R(nθ)=R((nm)θ)R(m\theta)^\top R(n\theta) = R((n - m)\theta), so the dot product between a rotated query at position m and a rotated key at position n depends only on m − n. (If you prefer the complex view your prerequisites promised: pack each pair into a complex number and multiply by e to the power i·m·theta; the relative phase survives.) Production code — Hugging Face, for one — uses the “rotate-half” layout (first half against second half) rather than the interleaved pairs above; it is the same operation up to a permutation of dimensions.

The spread of wavelengths is the entire long-context story. The shortest wavelength is about 2π2\pi tokens — six-ish — and the longest is about 2π2\pi times the base, tens of thousands of tokens. The high-frequency dimensions resolve local order; the low-frequency ones carry coarse, long-range position. Keep that split in mind, because every extension method is a different answer to “what do we do to each end of this spectrum.”

Why naive extrapolation falls apart

Each dimension sees the angle m·θ_i, and the model only ever trained on positions m in the range [0, L) for some training length L. Sort the dimensions by wavelength relative to L.

  • Short-wavelength (high-frequency, small i) dimensions cycle through the full range of angles many times inside the training window. Past L, their angle modulo a full turn is a value the model has already seen — they extrapolate cleanly.
  • Long-wavelength (low-frequency, large i) dimensions, whose wavelength exceeds L, only ever sweep a narrow arc during training; they never complete a rotation. Push m past L and those dimensions enter angles the model has literally never seen. The attention logits, fit on that narrow arc, extrapolate to large, erratic values — and perplexity explodes the moment you cross L.

So the breakage is concentrated in the long-wavelength dimensions. That is the opposite of the naive guess, and it is precisely why a good extension method treats high and low frequencies differently instead of squeezing them uniformly.

This is the lens the YaRN paper makes explicit: take the ratio of training length to a dimension’s wavelength. Dimensions that complete at least one full period are safe to leave alone; dimensions that never complete a period are the ones whose angles must be brought back into range.

The extension toolkit: PI, NTK-aware, and YaRN

Linear Position Interpolation. Scale positions down by L/L′ so every angle lands back inside the trained range — m′ = m·(L/L′). It works and it is a two-line change.

The cost is that it squeezes every frequency uniformly, including the short-wavelength dimensions that were already fine, so adjacent tokens receive nearly identical rotations and local positional resolution degrades. PI needs fine-tuning at the target length to recover; the original paper tuned on the order of a thousand steps.

NTK-aware / base scaling. Instead of scaling positions, change the base so high frequencies are barely touched and only the low frequencies get interpolated — mathematically close to just raising the base, since a larger base lengthens every wavelength and keeps the low-frequency angles inside their trained arc at the extended length. Two payoffs: it spends the scaling budget where it is actually needed, and “dynamic NTK” variants (scale the base by the current sequence length) can stretch context with no fine-tuning at all, which PI cannot. The bluntest production version of this is to raise the base outright and fine-tune. The “Effective Long-Context Scaling” / ABF work did exactly that, and Llama 3 ships with a base of 500000 rather than 10000.

YaRN is the strong default and combines three ideas: (1) NTK-by-parts — classify each dimension by its wavelength-to-context ratio, extrapolate the short-wavelength dimensions, interpolate the long-wavelength ones, and ramp smoothly between; (2) an attention temperature that scales the logits by a constant slightly above 1 to counter the entropy growth that comes with a longer context, foldable into the cached cos/sin so it costs nothing at inference; (3) the payoff — the YaRN paper reports reaching the target context with roughly 10x fewer tokens and 2.5x fewer training steps than prior interpolation methods.

rope_scaling (illustrative; exact keys are version-specific)
# linear position interpolation
{"rope_type": "linear", "factor": 8.0}
# dynamic NTK: scale base with current length, no fine-tune
{"rope_type": "dynamic", "factor": 8.0}
# YaRN: by-parts interpolation + attention temperature
{"rope_type": "yarn", "factor": 8.0,
"original_max_position_embeddings": 4096,
"beta_fast": 32, "beta_slow": 1}

What “128k” actually buys you

Even with a clean extension and a good fine-tune, a long window is not a usable window. Three behaviors set the gap.

Attention sinks. Softmax forces the attention weights to sum to one, so a head with nothing salient to attend to has to put its mass somewhere — and it learns to dump it on the first couple of tokens, position 0 above all. StreamingLLM (Xiao et al.) named this and showed the consequence: evict those initial tokens from the KV cache and quality collapses, even though they often carry no semantic content. Any long-context serving scheme that wants to drop old KV has to keep the sinks.

Lost in the middle. Retrieval accuracy over position is U-shaped (Liu et al., arXiv 2307.03172): models use information at the very start and very end of a long context far better than anything buried in the middle. “Supports 128k” does not mean “attends to 128k uniformly.”

Effective versus nominal length. The extended window is an upper bound on position indices, not a capability guarantee. RULER (Hsieh et al.) formalized an effective context length and found that a number of models advertising 32k-plus hold their quality only up to a fraction of that. Read the card number as nominal and measure the rest yourself.

Measuring it without fooling yourself

Needle-in-a-haystack — plant a sentence in long filler and ask the model to read it back — is necessary but badly insufficient. A model that fails NIAH certainly cannot use its context; a model that passes has demonstrated fuzzy string lookup, which induction heads handle and which says nothing about reasoning across the context. A fully green NIAH heatmap is a smoke test that has been oversold as a capability result.

RULER-style synthetic suites are the honest version: controllable difficulty across multi-needle retrieval, multi-hop variable tracking, aggregation (extract the most frequent words across the whole context), and long QA. They separate retrieval from reasoning and they surface the effective-versus-nominal gap that NIAH hides. Practical guidance:

  • Pick probes shaped like your workload — multi-hop for agentic tool chains, aggregation for summarization.
  • Watch for contamination and for tasks that secretly reduce to “attend to the last sentence.”
  • Report accuracy as a function of both length and insertion depth, never a single pass/fail.

The systems bill: KV cache, FlashAttention, serving

None of the above is free, and the bill comes due at serving time.

The KV cache is linear in sequence length and is the dominant memory cost at long context: roughly 2 * n_layers * n_kv_heads * head_dim * seq_len * bytes * batch (the 2 is K and V). At 128k that term dwarfs the weights for many models, which is why grouped-query attention (Llama 3 70B uses 8 KV heads) and KV-cache quantization to FP8/INT8 stop being optional and become the reason long context fits at all.

FlashAttention composes with RoPE cleanly — the rotation is applied to Q and K before the kernel, so the IO-aware tiling is untouched, and some kernels fuse the rotation in. But FlashAttention removes the O(N^2) memory, not the O(N^2) FLOPs: a 128k prefill is genuinely compute-heavy, and that is where most long-context latency lives. Decode then becomes memory-bandwidth bound, because every generated token reads the entire KV cache, and it slows as the context grows.

The footgun: RoPE uses absolute position ids, so prefix caching, chunked prefill, and any KV reuse must keep position ids consistent — and the rope_scaling config at serve time must match what the model was tuned with. Serve a YaRN-tuned model without the scaling, or apply linear scaling to a model that never saw it, and nothing errors; long-context quality just quietly degrades. Check the served config against the training config before you trust a single eval.

What to take away

RoPE’s relative-position-by-rotation is a frequency-domain object, and everything about long context follows from that: the failure mode is long-wavelength dimensions leaving their trained arc, the fixes are frequency-selective scaling, and the cost is a quadratic prefill plus a linear KV bill.

  • The methods form a ladder: PI (cheap, uniformly lossy), NTK / base-scaling (frequency-aware, sometimes tune-free), YaRN (the strong default). Production frequently just raises the base and fine-tunes — Llama 3’s base of 500000 is the tell.
  • Trust effective context, not nominal. NIAH is a smoke test; RULER-style multi-hop and aggregation probes are the measurement.
  • Budget for a quadratic prefill and a KV cache that can exceed your weights; GQA and KV quantization are how long context fits.
  • Verify rope_scaling matches between training and serving — the silent failure is the expensive one.

Opinionated close: treat any “supports N tokens” as unverified until you have run a length-scaled, reasoning-heavy probe on your own workload. The rotation is exact; the context window on the box is nominal.