FedSGD – FedAVG
H. B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, “Communication-Efficient Learning of Deep Networks from Decentralized Data,” Jan. 26, 2023, arXiv: arXiv:1602.05629. doi: 10.48550/arXiv.1602.05629.
In FedSGD, server sends current model $w_t$ to all devices and each device calculates the “slope” or gradient $g_k$ of its local data. The math is,
$w_{t+1} \leftarrow w_t – \eta \sum \frac{n_k}{n} g_k$
This is equivalent to central server doing one big step of Gradient Descent but work is distributed across devices. However, this requires constant communication and devices must talk to the server after every single tiny calculation.
With FedAvg, a device or client apply the gradient itself locally $w_{t+1}^k \leftarrow w_t – \eta g_k$ and then send the resulting model back to the server. Each device performs multiple steps of gradient descent locally before sending its updated model back to the server to be averaged. This drastically reduces the number of times devices need to communicate to with the server.
