March 15, 2026 · 8 min read distributedPyTorchFSDPNVIDIAllm

How FSDP sharding cut our training wall-time by 3.2×

A teardown of the sharding strategy and NCCL tuning that eliminated inter-node bandwidth as the bottleneck on a multi-node GPU cluster.

Updated March 24, 2026


Distributed training has a dirty secret: the bottleneck is rarely the GPU. It’s the fabric between them.

When I first got the chance to work on distributed GPU training, working on 12 Nodes across 6 racks, each having 128 cores of AMD Epyc CPUs and 8 NVIDIA A100 SXM 80GB GPUs, with ~3600GB RAM, saying I was excited would be an understatement. I was unable to opptimaly utilized any resource initally. Compute utilisation hovered around 40%. The other 60% was GPUs waiting for gradients. Drowning with NCCL errors,

I was happy I made it rrun, but I was working with billions of files and 1000s of terabytes of data (Hyperspectral and Multispectral Satellite Images), and this bottleneck was creating a new issue at the research lab - the pre-training. Each step taking hours even with such a massive resource pool.

This was my first time dealing with anything at such a scale. The data size was increasing, and the training time per step was too high, so much so that we could only see minimal difference of multi0GPU step which we had. The latency between each epoch and the simultaneous crashes with NCCL error for GPU race conditions were frequent.

The actual problem

Standard DDP keeps a full model replica on every device. During the backward pass, all-reduce synchronises gradients across replicas - and that communication is proportional to parameter count. At 7B parameters, you’re moving roughly 28 GB of gradient tensors per step, per node.

At 16 nodes, the all-reduce becomes the bottleneck before compute is anywhere near saturated.

What we changed

FSDP over DDP. FSDP shards model parameters, gradients, and optimizer state across all devices. Each rank holds a shard and communicates only what its shard requires. The all-reduce becomes a reduce-scatter — fundamentally lower traffic at scale.

NCCL tuning. The default NCCL configuration is conservative. Two changes mattered most:

  1. Bucket sizes: larger buckets, fewer communication ops per step, amortised setup overhead.
  2. Async overlap: gradient communication overlaps with the next layer’s backward pass.

FP8 in the final training stages. Weight tensors shrink, FSDP all-gather traffic drops proportionally.

Results

Wall-time per epoch: [PLACEHOLDER — from X] → [PLACEHOLDER — to Y]. That’s the 3.2× reduction.

GPU utilisation climbed from roughly 40% to over 80%. The remaining headroom is unavoidable synchronisation at this scale.

The fix wasn’t exotic. It was identifying where the bottleneck actually was — in the communication pattern, not the model — and choosing the right primitive for that pattern.