Series · ongoing
RL for Reasoning Models
The policy-gradient lineage behind reasoning models — from PPO and RLHF to DPO and GRPO — with the math, the failure modes, and the libraries that implement it.
- 01
GRPO, Demystified: Group-Relative Policy Optimization for Reasoning Models
GRPO swaps PPO's learned critic for a Monte-Carlo baseline — the mean reward over a group of sampled completions — trading rollout compute and per-token credit assignment for a simpler, more stable RL loop on verifiable-reward tasks.
- 02
TRL in Anger: SFT, DPO, and GRPO Without Rewriting Your Training Loop
TRL turns SFT, DPO, and GRPO into Trainer subclasses that inherit the entire Hugging Face stack — accelerate, peft, DeepSpeed. The convenience is real; the cost is that you're debugging someone else's training loop the moment your problem stops looking like the quickstart.
- 03