Topic

FSDP

Fully Sharded Data Parallel and ZeRO-style sharding.

1 checkpoint

CHECKPOINT 00022026-05-15EXPLAINERSadvanced

Sharding the Model: FSDP, ZeRO, and Tensor/Pipeline Parallelism

Past one GPU you stop training a model and start operating a distributed system. Here is what each parallelism axis actually shards, what it costs on the wire, and how practitioners stack them into 3D/4D layouts.