Sharding the Model: FSDP, ZeRO, and Tensor/Pipeline Parallelism
Past one GPU you stop training a model and start operating a distributed system. Here is what each parallelism axis actually shards, what it costs on the wire, and how practitioners stack them into 3D/4D layouts.