Series · ongoing

RL for Reasoning Models

The policy-gradient lineage behind reasoning models — from PPO and RLHF to DPO and GRPO — with the math, the failure modes, and the libraries that implement it.

2 of 3 parts published

  1. 01

    Explainers · 2026-06-04

    GRPO, Demystified: Group-Relative Policy Optimization for Reasoning Models

    GRPO swaps PPO's learned critic for a Monte-Carlo baseline — the mean reward over a group of sampled completions — trading rollout compute and per-token credit assignment for a simpler, more stable RL loop on verifiable-reward tasks.

  2. 02

    Libraries · 2026-06-11

    TRL in Anger: SFT, DPO, and GRPO Without Rewriting Your Training Loop

    TRL turns SFT, DPO, and GRPO into Trainer subclasses that inherit the entire Hugging Face stack — accelerate, peft, DeepSpeed. The convenience is real; the cost is that you're debugging someone else's training loop the moment your problem stops looking like the quickstart.

  3. 03

    Planned · coming soon