Topic

DPO

Direct Preference Optimization.

1 checkpoint

CHECKPOINT 00092026-06-11LIBRARIESrl · stable

TRL in Anger: SFT, DPO, and GRPO Without Rewriting Your Training Loop

TRL turns SFT, DPO, and GRPO into Trainer subclasses that inherit the entire Hugging Face stack — accelerate, peft, DeepSpeed. The convenience is real; the cost is that you're debugging someone else's training loop the moment your problem stops looking like the quickstart.