Topic

PPO

Proximal Policy Optimization.

1 checkpoint

CHECKPOINT 00072026-06-04EXPLAINERSadvanced

GRPO, Demystified: Group-Relative Policy Optimization for Reasoning Models

GRPO swaps PPO's learned critic for a Monte-Carlo baseline — the mean reward over a group of sampled completions — trading rollout compute and per-token credit assignment for a simpler, more stable RL loop on verifiable-reward tasks.