TRL in Anger: SFT, DPO, and GRPO Without Rewriting Your Training Loop

TRL (Transformer Reinforcement Learning, github.com/huggingface/trl, Apache-2.0) is the path of least resistance from a base checkpoint to a post-trained model. Supervised fine-tuning, preference optimization, and online RL all live behind a handful of Trainer subclasses that inherit the entire Hugging Face ecosystem — accelerate, peft, datasets, DeepSpeed, FSDP, the lot. The cost of that convenience is control. You are running someone else’s training loop, and the moment your problem stops looking like the quickstart — a reward that needs external state, a disaggregated rollout fleet, a parallelism layout the Trainer does not expose — the abstraction leaks and you debug it from the outside. This piece is a map of where the seams are.

What is actually in the box

TRL began life as a PPO library: fine-tune GPT-2 with a value head and a clipped surrogate, in the lineage of “learning from human preferences.” It has since grown into a general post-training toolkit, and the useful mental model is a table from trainer name to the algorithm underneath. Names and signatures drift between releases — treat the below as concepts and check the current docs for the exact API.

Trainer	Algorithm	Data shape
`SFTTrainer`	Supervised next-token cross-entropy	prompt+completion, or a `messages` column
`RewardTrainer`	Bradley–Terry pairwise reward model	`chosen` / `rejected`
`DPOTrainer`	DPO and variants (via `loss_type`)	prompt / `chosen` / `rejected`
`PPOTrainer`	PPO with a learned value head	prompts + a reward model
`GRPOTrainer`	GRPO (group-relative baseline)	prompts + reward function(s)
`OnlineDPOTrainer` / `RLOOTrainer`	online preference / REINFORCE-leave-one-out	prompts + reward or judge

SFTTrainer is a thin specialization of transformers.Trainer that handles dataset formatting, packing, and completion-only loss masking. DPOTrainer implements the Rafailov et al. (2023) objective, which collapses RLHF into a single classification-style loss over preference pairs:

$\mathcal{L}_{\text{DPO}} = -\,\mathbb{E}_{(x, y_w, y_l)} \, \log \sigma\!\big( \beta\, h_\theta(x, y_w) - \beta\, h_\theta(x, y_l) \big), \quad h_\theta(x, y) = \log \pi_\theta(y \mid x) - \log \pi_{\text{ref}}(y \mid x)$

The loss_type flag is the variant switch — sigmoid is vanilla DPO, ipo swaps in the IPO objective, and there are several others. Sibling preference methods that need different data or no reference model (KTO, ORPO, CPO) get their own trainers rather than a flag.

GRPOTrainer is the one most readers come for. It implements Group-Relative Policy Optimization from DeepSeekMath / DeepSeek-R1: sample $G$ completions per prompt, score each, and use the within-group z-score as the advantage, $\hat{A}_{i} = \big(r_i - \operatorname{mean}(\mathbf r)\big) \,/\, \operatorname{std}(\mathbf r)$ — no critic network. In TRL you pass reward_funcs: one or more plain Python callables that receive the batch of completions plus your dataset columns and return a list of floats. That signature is the whole reason GRPO is pleasant here — verifiable rewards (exact match, a unit-test runner, a format regex) are just functions. (The mechanics of the objective itself are the subject of this series’ GRPO explainer; here we care about the plumbing.)

from trl import GRPOConfig, GRPOTrainer

def reward_format(completions, **kwargs):
    # +1 if the model closed its <answer> tag, else 0 — a verifiable reward
    return [1.0 if "</answer>" in c else 0.0 for c in completions]

cfg = GRPOConfig(
    output_dir="grpo-out",
    num_generations=8,            # G: completions sampled per prompt
    per_device_train_batch_size=8,
    max_completion_length=1024,
    temperature=0.9,
    beta=0.04,                    # KL-to-reference coefficient (0.0 = no KL term)
    use_vllm=True,
    learning_rate=1e-6,
)

trainer = GRPOTrainer(
    model="Qwen/Qwen2.5-3B-Instruct",
    reward_funcs=reward_format,
    args=cfg,
    train_dataset=dataset,        # needs a "prompt" column
)
trainer.train()

What the quickstart hides: templates, masking, reproducibility

The bugs that cost you a weekend are almost never in the loss function. They are in tokenization.

The chat template is the first landmine. SFTTrainer applies the tokenizer’s chat_template to your messages, and if your DPO or GRPO stage uses a different template — or a base model whose tokenizer ships no template at all — your reference log-probs are computed over text the policy never saw in that form. The classic symptom is a double BOS: the template prepends <|begin_of_text|>, the tokenizer’s add_special_tokens prepends another, and your sequences quietly start with two. It will train; it will just be subtly wrong. Pin the template, render a few examples by hand, and diff the token ids.

Second: loss masking. For instruction tuning you almost never want gradient on the prompt tokens. Depending on version this is assistant_only_loss / completion_only_loss in SFTConfig, or a DataCollatorForCompletionOnlyLM configured with the response template string. Get it wrong and the model learns to generate the user’s turns too. For DPO the analogous trap is that chosen and rejected must share an identical prompt prefix and identical templating — the loss is a difference of log-ratios, so any prompt-side asymmetry leaks straight into the gradient.

from trl import SFTConfig, SFTTrainer

trainer = SFTTrainer(
    model="meta-llama/Llama-3.1-8B",
    args=SFTConfig(
        output_dir="sft-out",
        packing=True,             # concatenate samples to fill max_length
        max_length=4096,
        assistant_only_loss=True, # mask user/system turns from the loss
    ),
    train_dataset=dataset,        # conversational: a "messages" column
)

Two more reproducibility notes. packing=True concatenates samples to fill the context window — efficient, but unless your stack uses position-id resets or a block-diagonal attention mask, tokens from sample A can attend to sample B. For short-context SFT it rarely matters; for anything position-sensitive, verify. And pin your transformers/tokenizers versions in the run: a template change in a point release silently changes your training distribution, and “I upgraded a dependency” is the least satisfying root cause to find three weeks later.

Scaling down and out: PEFT, accelerate, FSDP

Every TRL trainer takes a peft_config, and that single argument is what makes 7B–70B post-training fit on hardware you can rent. Pass a LoraConfig and you train adapters; combine it with a 4-bit bitsandbytes base and you are doing QLoRA. The non-obvious win is on the RL side: for DPO and GRPO with LoRA, TRL computes the reference log-probs by disabling the adapters on the same base weights rather than holding a second full model in memory.

Because the trainers are transformers.Trainer underneath, multi-GPU is an accelerate launch away rather than a rewrite. You select DeepSpeed ZeRO-3 or FSDP through an accelerate config file; gradient checkpointing, mixed precision, and gradient accumulation are config flags that compose with everything above. The interactions are where care is needed — gradient checkpointing trades compute for activation memory, and with online RL the generation step has its own memory footprint that the training-side sharding does not touch.

# 4×GPU, ZeRO-3 offload via an accelerate+deepspeed config
accelerate launch --config_file zero3.yaml grpo_min.py

One GRPO-specific footgun lives here: num_generations must evenly divide the effective batch size (per-device batch times world size times gradient-accumulation). Add a GPU without adjusting the batch and you get a shape error, not a warning.

Online RL is a generation problem

Here is the thing the SFT/DPO experience does not prepare you for: in GRPO, PPO, and online DPO, the optimizer step is not the bottleneck — generation is. Every training step you sample many completions per prompt, and autoregressive decoding with model.generate is slow, memory-fragmented, and stuck on the same device as your training process. On reasoning tasks with long completions, rollouts can dominate wall-clock by a wide margin, which means your GPU spends most of an “RL” run doing inference.

This is why TRL grew a vLLM integration. Set use_vllm=True and rollouts run on vLLM’s PagedAttention engine instead of HF generate. There are two deployment shapes, and the choice is a real one:

Colocate — the vLLM engine shares GPUs with training. Simplest to launch; the engine and the trainer fight over the same VRAM, so you tune vllm_gpu_memory_utilization and accept the contention.
Server — stand up a separate inference process (the trl vllm-serve command) on its own GPUs and point the trainer at it. More moving parts, better utilization, the model for multi-node runs.

Either way there is a tax the docs underplay: weight synchronization. After each optimizer step the updated policy weights must be pushed into the vLLM engine before the next rollout, or you are sampling from a stale policy. That transfer is pure overhead, and minimizing it — by syncing less often, or streaming weights efficiently — is exactly the engineering the heavyweight RL frameworks were built around.

Reading the dashboards before they read you

TRL logs to Weights & Biases or TensorBoard for free, and the metrics are your early-warning system for the failure modes that RL training is heir to.

For DPO, watch rewards/chosen, rewards/rejected, rewards/margins, and rewards/accuracies. These are implicit rewards — the $\beta$ -scaled log-ratios, not an external score. Healthy training pushes the margin up and accuracy toward 1. The pathology to catch is both chosen and rejected rewards sliding negative together: the policy is drifting from the reference faster than it is learning the preference, usually a sign beta is too low.

For GRPO, the panel to live in is reward, reward_std, kl, and completion_length, plus a series per reward function.

reward_std collapsing toward zero means every sample in a group earns the same score — the advantage is undefined and the gradient signal is gone. Either the task is too easy, too hard, or your reward is saturated.
completion_length creeping up while reward plateaus is the length-bias / reward-hacking tell — the model has found that longer outputs game the scorer. This is the single most common GRPO pathology in the wild.
kl exploding means the policy is running away from the reference; raise beta or revisit the reward.

A note on beta: some R1-style recipes set the KL coefficient to zero and rely on clipping plus the on-policy distribution to stay grounded.

[3]Several follow-ups, including the “Dr. GRPO” critique, argue that GRPO’s length and difficulty normalization introduce biases and that the KL term is not always doing what you think — see the GRPO explainer in this series for the structural argument. The practical takeaway: do not treat any single published recipe as canonical; ablate beta and the normalization on your own task.

TRL, or something else

TRL is the right default when you are already in the Hugging Face ecosystem and want SFT, DPO, or GRPO on a single node to a modest multi-GPU box without writing distributed plumbing. It is not always the right tool.

A from-scratch loop — when the research is the loss function, the advantage estimator, or the sampling scheme, and you need every line under your control. TRL’s abstraction is exactly what is in your way here.
Axolotl — a YAML-config layer over Transformers/TRL. Same engines, but reproducible recipes as files; excellent for running many SFT/DPO configs without touching Python.
Unsloth — hand-written Triton kernels for single-GPU and small-multi-GPU fine-tuning. Unsloth reports roughly 2x speedups and large memory reductions versus a stock setup; it drops in as the model behind TRL’s trainers, so it is more a complement than a competitor for the single-GPU case. Verify the numbers against your own model and sequence length.
OpenRLHF / veRL — Ray-based frameworks that disaggregate the actor, reference, reward, and vLLM rollout across GPU pools. This is what you graduate to for multi-node, large-scale reasoning RL where rollout/training separation and throughput are the whole game. veRL’s HybridFlow design exists precisely to wring out the throughput that a colocated TRL run leaves on the table.

The honest decision rule: start in TRL, instrument the rollout, and only migrate when generation throughput or multi-node coordination — not the algorithm — becomes the thing you spend your days fighting.

The bottom line

TRL’s value is that it makes the standard post-training recipe boring: a messages dataset, a peft_config, an accelerate launch, and you have SFT, DPO, or GRPO running with logging and checkpointing you did not write. Spend your scrutiny where the abstraction is thin — the chat template, the loss mask, the rollout engine, and the four GRPO curves above — because that is where correctness and wall-clock actually live. The trainer names will change by the time you read this; the seams will not. Pin your versions, profile the generation step, read the dashboards adversarially, and reach for OpenRLHF or veRL only when the bottleneck is genuinely the framework and not your reward function.

What is actually in the box#

What the quickstart hides: templates, masking, reproducibility#

Scaling down and out: PEFT, accelerate, FSDP#

Online RL is a generation problem#

Reading the dashboards before they read you#

TRL, or something else#

The bottom line#