Series · ongoing

RL for Reasoning Models

The policy-gradient lineage behind reasoning models — from PPO and RLHF to DPO and GRPO — with the math, the failure modes, and the libraries that implement it.

2 of 3 parts published

01
Explainers · 2026-06-04
GRPO, Demystified: Group-Relative Policy Optimization for Reasoning Models
GRPO swaps PPO's learned critic for a Monte-Carlo baseline — the mean reward over a group of sampled completions — trading rollout compute and per-token credit assignment for a simpler, more stable RL loop on verifiable-reward tasks.
02
Libraries · 2026-06-11
TRL in Anger: SFT, DPO, and GRPO Without Rewriting Your Training Loop
TRL turns SFT, DPO, and GRPO into Trainer subclasses that inherit the entire Hugging Face stack — accelerate, peft, DeepSpeed. The convenience is real; the cost is that you're debugging someone else's training loop the moment your problem stops looking like the quickstart.
03
Planned · coming soon

GRPO, Demystified: Group-Relative Policy Optimization for Reasoning Models

TRL in Anger: SFT, DPO, and GRPO Without Rewriting Your Training Loop