GRPO, Demystified: Group-Relative Policy Optimization for Reasoning Models

PPO for language models carries a passenger almost nobody wants: a value network roughly the size of the policy, trained to predict token-level returns from a reward that only arrives at the end of the sequence. GRPO’s move is to fire that passenger. Instead of a learned critic, it estimates the baseline empirically — sample a group of completions per prompt and let the group’s mean reward be the baseline each completion is measured against. You drop the second network and its optimizer state, you delete an entire failure mode (critic divergence), and you pay for it in generation: every gradient step now needs $G$ rollouts per prompt instead of one. That trade — sampling compute and a coarser credit signal in exchange for no critic — is the whole story, and whether it’s a good deal depends almost entirely on your reward.

The objective, from the baseline up

Start where every policy-gradient method starts. The REINFORCE estimator $\nabla_\theta \mathbb{E}[r] = \mathbb{E}[\,r(o)\,\nabla_\theta \log \pi_\theta(o)\,]$ is unbiased but high-variance. You tame the variance by subtracting a baseline $b$ that does not depend on the sampled action: $(r - b)\,\nabla \log \pi$ has the same expectation, lower variance. The variance-minimizing baseline is approximately the expected return for that prompt, $b \approx \mathbb{E}[\,r \mid q\,]$ . PPO learns that quantity with a critic $V_\phi$ . GRPO estimates it by Monte Carlo: draw $G$ completions for prompt $q$ and use their mean reward.

Concretely, for prompt $q$ sample $\{o_1, \dots, o_G\} \sim \pi_{\theta_{old}}$ , score each with a reward $r_i$ , and compute a group-relative advantage as a z-score:

$\hat{A}_i = \frac{r_i - \mathrm{mean}(r_{1}, \dots, r_{G})}{\mathrm{std}(r_{1}, \dots, r_{G})}$

Under outcome supervision — one reward per completion, the dominant setting for reasoning — every token in completion $i$ inherits the same scalar $\hat{A}_i$ . There is no intra-sequence credit assignment; a brilliant deduction and a filler token in the same correct trace get identical credit.

The advantage then plugs into a PPO-style clipped surrogate with a KL anchor:

$\mathcal{J}(\theta)=\mathbb{E}\!\left[\frac{1}{G}\sum_{i=1}^{G}\frac{1}{|o_i|}\sum_{t=1}^{|o_i|}\min\!\big(\rho_{i,t}\hat{A}_i,\ \mathrm{clip}(\rho_{i,t},1-\epsilon,1+\epsilon)\,\hat{A}_i\big)-\beta\,\mathbb{D}_{KL}\!\big[\pi_\theta\,\|\,\pi_{ref}\big]\right]$

where $\rho_{i,t} = \pi_\theta(o_{i,t}) / \pi_{\theta_{old}}(o_{i,t})$ is the per-token importance ratio. The advantage computation is the entire novelty, and it is about ten lines:

import torch

def group_advantages(rewards, group_size, eps=1e-4, normalize_std=True):
    # rewards: (B*G,) flat tensor, grouped contiguously by prompt
    r = rewards.view(-1, group_size)              # (B, G)
    adv = r - r.mean(dim=1, keepdim=True)         # subtract the group baseline
    if normalize_std:                             # GRPO; Dr. GRPO drops this line
        adv = adv / (r.std(dim=1, keepdim=True) + eps)
    return adv.view(-1)                           # one scalar per completion

That is the part that does not exist in PPO. There is no value_head, no GAE, no return bootstrapping — just a mean and a standard deviation over $G$ samples.

What the critic bought, and why a group mean can stand in

The PPO critic did two jobs. First, it provided a variance-reducing baseline. Second, via bootstrapping and GAE, it spread a sequence-level reward across tokens, so earlier tokens could receive credit for later success. The group mean replaces the first job cleanly — it is an unbiased, same-prompt Monte Carlo estimate of the baseline, and for sparse sequence-level rewards it is arguably a better estimate than a value head that has to learn token-level returns from a signal it only ever sees at EOS. The second job is genuinely lost. GRPO is, structurally, an admission that the token-level credit a value function promised was mostly fictional anyway when the reward is one bit at the end of a thousand-token trace.

What you keep from PPO matters. The clip is still a trust region: across inner epochs the data goes off-policy relative to $\pi_\theta$ , and clipping $\rho_{i,t}$ stops a single token’s ratio from exploding the update. The KL-to-reference still anchors the policy to a sane distribution so it cannot wander into degenerate text that happens to score well. One subtlety practitioners miss: GRPO adds the KL as an explicit penalty in the loss (using Schulman’s positive, low-variance k3 estimator), rather than folding it into the per-token reward the way InstructGPT-era RLHF does. With no value function, there is nothing to bootstrap a reward-shaped KL through, so it lives in the objective directly.

def grpo_loss(logp, old_logp, ref_logp, adv, mask, eps=0.2, beta=0.04):
    # logp/old_logp/ref_logp: (N, T) per-token log-probs of sampled tokens
    # adv: (N,) one advantage per completion, broadcast over tokens
    adv = adv[:, None]
    ratio = torch.exp(logp - old_logp)
    pg = torch.minimum(ratio * adv,
                       torch.clamp(ratio, 1 - eps, 1 + eps) * adv)
    # k3 unbiased KL estimator, always >= 0
    kl = torch.exp(ref_logp - logp) - (ref_logp - logp) - 1.0
    per_tok = pg - beta * kl
    # per-response length normalization -- the term Dr. GRPO flags
    return -((per_tok * mask).sum(dim=1) / mask.sum(dim=1)).mean()

Rewards: why GRPO lives in verifiable domains

GRPO’s reputation was made on math and code, and that is not an accident of who published first. It is structural. The group baseline only produces a gradient when there is reward variance within the group: if all $G$ completions score identically, every advantage is zero and the prompt contributes nothing. Reinforcement learning with verifiable rewards (RLVR) supplies exactly the crisp, varying signal this needs.

In these domains the reward is a program, not a model:

Exact-match / equivalence for math: parse the final answer out of a \boxed{} span and compare to ground truth, ideally with a symbolic check (sympy) so 1/2 and 0.5 agree.
Unit tests for code: execute the completion against held-out tests, reward the pass fraction or a binary pass/fail.
Format rewards: a regex that checks the output put its reasoning in a <think> block and its answer where the parser expects it.

Two properties make this the right fit. A verifier is cheap — no extra reward-model forward pass per sample, which matters precisely because GRPO already multiplies your generation bill by $G$ . And a verifier is hard to over-optimize in the way a learned reward model is: there is no smooth scalar surface for the policy to climb into nonsense. The DeepSeekMath paper, which introduced GRPO, credits it with a multi-point lift on MATH and GSM8K over the same model’s SFT checkpoint. The more striking demonstration is in the DeepSeek-R1 report: R1-Zero is trained with GRPO directly on a base model, no supervised warmup at all, and the paper reports its AIME 2024 pass@1 climbing from the mid-teens to roughly 71% purely through RL on verifiable rewards. That is the result that turned GRPO from a DeepSeekMath footnote into a default.

The pathologies nobody warns you about

GRPO is not a finished recipe, and treating the DeepSeekMath formulation as canonical is how you ship a length-hacked model. The most cited critique is Dr. GRPO (“Understanding R1-Zero-Like Training”, Sea AI Lab), which identifies two normalization biases baked into the original objective.

The first is the per-response length normalization, the $1/|o_i|$ in the loss. For a completion with negative advantage (a wrong answer), dividing by its length means a longer wrong answer is penalized less per token. Gradient descent notices, and the policy learns to pad incorrect responses — easily mistaken for the model “learning to think longer” when it is really learning to dilute its penalty. The second is the std-normalization in the advantage: dividing by the group’s reward standard deviation up-weights questions where rewards barely vary (nearly all-right or all-hard), so the batch’s gradient is dominated by the easiest and hardest prompts rather than the informative middle. Dr. GRPO’s prescription is to drop both: use $\hat{A}_i = r_i - \mathrm{mean}(r)$ with a constant, length-independent token normalizer. Whether you adopt it is a judgment call, but you should make it deliberately rather than inherit the biases by copy-paste.

Reward hacking survives even verifiable rewards. Models learn to farm a format reward without solving anything, to exploit a brittle answer extractor, to special-case the visible unit tests, or — as in R1-Zero — to mix languages in a way the verifier tolerated until a language-consistency reward was bolted on. The KL term is also a live design choice, not a constant: DAPO (ByteDance/Tsinghua) drops the KL penalty entirely for reasoning RL, arguing the clip alone is a sufficient trust region and the KL just tethers the policy to a base distribution you are explicitly trying to move away from. DAPO also contributes a useful toolkit on top of GRPO: clip-higher (an asymmetric upper clip that preserves exploration and fights entropy collapse), dynamic sampling (discard prompts where all $G$ rewards tie, since they carry no gradient), and overlong-response reward shaping. Read these as the live design space around the objective, not optional polish.

The knobs that actually move training

Group size $G$ . Larger groups give a lower-variance baseline and a better chance of within-group reward variance, at linear generation cost. DeepSeekMath used $G = 64$ ; many open recipes run 8 to 16. $G = 1$ is degenerate — the advantage is always zero.
Sampling temperature. Needs to be high enough (commonly 0.6 to 1.0) to produce genuinely diverse completions. This is the lever that creates the variance the baseline depends on; treat it as a learning-rate-class hyperparameter, not a generation default.
Batch construction. Prompts whose $G$ completions all share a reward are dead weight. DAPO-style dynamic sampling filters them so the effective batch stays full of gradient-bearing examples. Difficulty curation matters for the same reason: trivially easy or impossibly hard prompts waste rollouts.
Reference-model refresh. Iterative GRPO periodically resets $\pi_{ref}$ to the current policy (re-anchoring the KL), which lets the policy keep moving without the reference dragging it back; the alternative is to weaken or drop $\beta$ entirely.
Base vs SFT init. Base-model RL is the cleaner experiment and can discover reasoning from scratch, but yields rough output; an SFT cold start trains faster and produces readable traces. Pick based on whether you are doing science or shipping.

In TRL the whole loop is a trainer plus a reward function, and the reward function is where your domain knowledge lives:

from trl import GRPOConfig, GRPOTrainer

def reward_correct(completions, answer, **kwargs):      # the verifier
    return [1.0 if extract_boxed(c) == a else 0.0
            for c, a in zip(completions, answer)]

cfg = GRPOConfig(
    num_generations=8,          # G
    temperature=0.9,
    max_completion_length=2048,
    beta=0.0,                   # several RLVR recipes drop the KL penalty
    num_iterations=1,           # PPO-style inner epochs per batch
    use_vllm=True,              # rollouts dominate wall-clock; offload them
)
trainer = GRPOTrainer(
    model="Qwen/Qwen2.5-7B",
    reward_funcs=[reward_correct, reward_format],
    args=cfg,
    train_dataset=ds,
)
trainer.train()

Trainer and config field names drift between TRL versions, so treat this as illustrative and check the current docs — but the shape is stable: a reward callable that returns one float per completion, and a config whose most consequential field is num_generations. Note use_vllm: with $G$ rollouts per step, generation throughput, not the backward pass, sets your epoch time, which is why fast inference is now a first-class part of the RL stack.

Picking the tool, and what is still open

The three post-training options sort cleanly by what reward signal you have. DPO is offline: no sampling, no reward function, just a fixed dataset of preference pairs and a closed-form contrastive loss. It is cheap, stable, and reproducible, and it is capped by that dataset and off-policy by construction — reach for it for style and preference alignment when you have pairs and no verifier. PPO is the most general online method: it works with a learned reward model and keeps the critic, so it can chase fuzzy human-preference rewards and assign per-token credit, at the price of the most moving parts and the most ways to diverge. GRPO sits exactly in between — online and on-policy like PPO, but with a programmatic reward and no critic — and it is the right tool when you have a cheap verifier and can afford the rollouts.

The honest framing is that GRPO is not a better optimizer than PPO; it is PPO with the critic amortized into samples. That is an excellent deal precisely when the reward is sequence-level and verifiable and your budget is memory- rather than generation-bound, and a bad one when rewards are dense per-token, no cheap verifier exists, or you are already inference-constrained. The genuinely open question is the one the critic used to answer and GRPO simply dropped: per-token credit assignment over long reasoning traces. Dr. GRPO and DAPO are already chipping pieces off the canonical objective, the KL term is negotiable, and “mean over a group plus a clip” is starting to look less like a law and more like the first thing that worked. If you are building on GRPO today, build on the structure — group baseline, trust-region clip, verifiable reward — and hold the specific normalization choices loosely.

The objective, from the baseline up#

What the critic bought, and why a group mean can stand in#

Rewards: why GRPO lives in verifiable domains#

The pathologies nobody warns you about#

The knobs that actually move training#

Picking the tool, and what is still open#