SIGNAL · SIGNALS

DeepSeek-R1: RL-trained reasoning with open weights

The reproducible part is the method, not a leaderboard cell: group-relative RL on verifiable rewards, with open weights to probe. It is the cleanest public artifact for understanding the reasoning-model training loop.

2026-06-131 MINBY Frontier Checkpoint Editorial

Source: arXiv:2501.12948 ↗model · major

R1 is a useful anchor because it pairs a strong claim with an unusual amount of openness — weights plus a recipe centered on GRPO and rule-verifiable rewards. That makes it checkable in a way most frontier reasoning systems are not. Start with our GRPO explainer for the algorithm, then read the report for the training details.