Topic

RLHF

Reinforcement learning from human feedback.

3 checkpoints

2026-06-13SIGNALS

DeepSeek-R1: RL-trained reasoning with open weights

The reproducible part is the method, not a leaderboard cell: group-relative RL on verifiable rewards, with open weights to probe. It is the cleanest public artifact for understanding the reasoning-model training loop.

CHECKPOINT 00092026-06-11LIBRARIESrl · stable

TRL in Anger: SFT, DPO, and GRPO Without Rewriting Your Training Loop

TRL turns SFT, DPO, and GRPO into Trainer subclasses that inherit the entire Hugging Face stack — accelerate, peft, DeepSpeed. The convenience is real; the cost is that you're debugging someone else's training loop the moment your problem stops looking like the quickstart.

dpo grpo peft fine-tuning

CHECKPOINT 00072026-06-04EXPLAINERSadvanced

GRPO, Demystified: Group-Relative Policy Optimization for Reasoning Models

GRPO swaps PPO's learned critic for a Monte-Carlo baseline — the mean reward over a group of sampled completions — trading rollout compute and per-token credit assignment for a simpler, more stable RL loop on verifiable-reward tasks.

grpo rlhf ppo reasoning