Skip to main content

Foundations and FedAvg

A FedAvg training-curve figure compares federated averaging against synchronized SGD across communication rounds.

Figure: The original FedAvg paper showed that local SGD averaging can reach target accuracy in many fewer rounds than synchronized FedSGD across MNIST/CIFAR/Shakespeare. From McMahan et al., 2017 — embedded under educational fair use with attribution.

Federated learning is the training pattern where data remains on clients while models or model updates move. The usual mental picture is a server coordinating many clients: it broadcasts a current model, selected clients run local optimization, and the server aggregates the returned updates. McMahan et al. introduced this as a practical deep-learning method for decentralized mobile data and showed that local computation can reduce communication rounds by large factors compared with synchronized FedSGD [1].

The foundation matters because nearly every later federated-learning idea modifies one of four design choices: which clients participate, how much local computation they do, what they send, and how the server aggregates it. Cross-device systems emphasize massive scale, intermittent availability, and privacy-preserving aggregation. Cross-silo systems emphasize governance, auditability, higher client reliability, and institutional data sovereignty. In both cases, FedAvg is the baseline to understand before studying heterogeneity, personalization, privacy, compression, or robustness.

Definitions

Let there be KK clients. Client kk owns a local dataset PkP_k with nk=Pkn_k=\vert P_k\vert examples, and n=k=1Knkn=\sum_{k=1}^{K} n_k. For a model parameter vector wRdw\in\mathbb{R}^d, define the local objective

Fk(w)=1nkiPki(w),F_k(w)=\frac{1}{n_k}\sum_{i\in P_k}\ell_i(w),

and the population-weighted federated objective

f(w)=k=1KnknFk(w).f(w)=\sum_{k=1}^{K}\frac{n_k}{n}F_k(w).

Here i(w)\ell_i(w) is the loss on example ii, such as cross-entropy for classification or next-token prediction. A global optimum for ff may not be good for every client, but it is the canonical single-model objective.

Cross-device federated learning uses very many clients, often phones, tablets, wearables, vehicles, or sensors. A client may be available only when plugged in, idle, and on unmetered networking. The server samples a small fraction of eligible clients per round, and the system expects dropouts.

Cross-silo federated learning uses fewer clients, such as hospitals, banks, labs, manufacturers, or agencies. Clients are usually more reliable, each may hold substantial data, and governance can include contracts, audit logs, certificates, and manually approved infrastructure.

FedSGD is the direct federated version of synchronous SGD. In each round, selected clients compute gradients on local data or minibatches, and the server averages those gradients before taking a global step.

FedAvg replaces one client gradient with several local SGD steps. At communication round tt, the server has wtw_t. It samples StS_t, usually with m=max(CK,1)m=\max(CK,1) clients for client fraction CC. Each selected client initializes wt,0k=wtw_{t,0}^{k}=w_t and runs EE local epochs over minibatches of size BB:

wt,e+1k=wt,ekηBt,ek(wt,ek).w_{t,e+1}^{k}=w_{t,e}^{k}-\eta\nabla \ell_{\mathcal{B}_{t,e}^{k}}(w_{t,e}^{k}).

After local training, the client sends wt+1kw_{t+1}^{k} or Δtk=wt+1kwt\Delta_t^k=w_{t+1}^k-w_t. The server computes

wt+1=kStnkjStnjwt+1k.w_{t+1}=\sum_{k\in S_t}\frac{n_k}{\sum_{j\in S_t}n_j}w_{t+1}^{k}.

When all clients participate, this is the common formula

wt+1=k=1Knknwt+1k.w_{t+1}=\sum_{k=1}^{K}\frac{n_k}{n}w_{t+1}^{k}.

The basic hyperparameters are TT communication rounds, client fraction CC, local epochs EE, minibatch size BB, learning rate η\eta, and model dimension dd. McMahan et al. used CC, EE, and BB to expose the communication-computation tradeoff: more local work can dramatically reduce rounds, but too much local work can make clients drift toward incompatible local optima [1].

SettingCross-device FLCross-silo FL
Typical clientsPhones, IoT, vehiclesHospitals, banks, firms
Number of clientsThousands to millions2 to hundreds
AvailabilityIntermittent and biasedScheduled and reliable
Local dataSmall per client, highly personalLarger per silo, governed
Main system riskDropout, bandwidth, privacyGovernance, audit, trust
Common aggregationSampled synchronous roundsFull or partial participation
Security emphasisSecure aggregation, DP, abuse resistanceAccess control, MPC, legal audit

Key results

FedAvg can be read as a practical compromise between minibatch SGD and local training. With E=1E=1 and one minibatch per selected client, it resembles FedSGD. As EE grows, clients spend more computation between communications, and the server receives a model that has moved along the client's local objective. If client data are IID samples from the same distribution, then local objectives FkF_k are noisy approximations of ff, so averaging local SGD trajectories can approximate centralized SGD while requiring fewer synchronizations.

The communication win comes from doing more local work per round. McMahan et al. reported that FedAvg reduced communication rounds by roughly 1010 to 100100 times over synchronized SGD in their experimental settings, including MNIST/CIFAR-style image tasks and a Shakespeare next-character prediction task [1]. The precise speedup depends on model, data partition, learning rate, and target accuracy, but the mechanism is simple: one upload can carry the effect of many local minibatch updates.

FedAvg is not just averaging independently trained final models. All clients start a round from the same wtw_t, so local models remain in a shared coordinate system. The original paper illustrates that averaging neural-network weights is much more sensible when models share initialization and training history than when independently initialized networks are averaged [1]. This point still matters in modern FL: aggregation is meaningful because local optimization is coordinated around a common global iterate.

The same design creates client drift. Under non-IID data, FkF_k may point toward a local optimum that differs substantially from the optimum of ff. Multiple local steps compound the mismatch, so the update returned by client kk may no longer be an unbiased or low-variance estimate of a global descent direction. Later methods such as FedProx, SCAFFOLD, FedNova, FedBN, and MOON can all be interpreted as attempts to constrain, correct, normalize, or representationally align this drift [5], [6], [7], [11], [12].

Synchronous rounds are the cleanest baseline. In a synchronous protocol, the server waits for enough selected clients and then aggregates. This simplifies analysis and makes the round count easy to measure. Real systems add eligibility filters, deadlines, retries, secure aggregation setup, server-side optimizers, and sometimes asynchronous buffering. Bonawitz et al.'s production-system discussion emphasizes that practical FL also requires device scheduling, compression, privacy accounting, telemetry, and robust orchestration beyond the textbook algorithm [4].

FedAvg pseudocode:

initialize w_0
for t = 0, 1, ..., T-1:
m = max(floor(CK), 1)
S_t = sample m eligible clients
for each client k in S_t in parallel:
w = w_t
repeat E local epochs:
for minibatch b of size B from P_k:
w = w - eta * grad loss_b(w)
send w_{t+1}^k = w to server
w_{t+1} = sum_{k in S_t} (n_k / sum_{j in S_t} n_j) w_{t+1}^k

The important proof intuition is not that FedAvg always has a clean unbiased-gradient interpretation. Rather, in the IID limit with small local learning rates, the average of local SGD trajectories behaves like a larger minibatch stochastic update. As data become heterogeneous, higher-order terms accumulate because each client evaluates gradients at different locally moved points. This is the bridge to the heterogeneity chapter.

Visual

Worked example 1: One scalar FedAvg round

Problem. Three clients participate in one full FedAvg round. The current scalar model is wt=2.0w_t=2.0. Client sizes are n1=10n_1=10, n2=30n_2=30, and n3=60n_3=60. Each client performs local training and returns:

wt+11=1.6,wt+12=2.2,wt+13=2.5.w_{t+1}^1=1.6,\qquad w_{t+1}^2=2.2,\qquad w_{t+1}^3=2.5.

Compute the next global model.

Step 1: compute total selected examples.

n=10+30+60=100.n=10+30+60=100.

Step 2: compute aggregation weights.

α1=10100=0.10,α2=30100=0.30,α3=60100=0.60.\alpha_1=\frac{10}{100}=0.10,\quad \alpha_2=\frac{30}{100}=0.30,\quad \alpha_3=\frac{60}{100}=0.60.

Step 3: take the weighted model average.

wt+1=0.10(1.6)+0.30(2.2)+0.60(2.5)=0.16+0.66+1.50=2.32.\begin{aligned} w_{t+1} &=0.10(1.6)+0.30(2.2)+0.60(2.5)\\ &=0.16+0.66+1.50\\ &=2.32. \end{aligned}

Checked answer. The large third client pulls the model upward. A simple unweighted average would be (1.6+2.2+2.5)/3=2.1(1.6+2.2+2.5)/3=2.1, but FedAvg uses data-weighted averaging, giving wt+1=2.32w_{t+1}=2.32.

Worked example 2: Communication cost versus local computation

Problem. A model has d=5,000,000d=5{,}000{,}000 float32 parameters. Each upload sends one model update, so each parameter costs 44 bytes before compression. Training uses T=400T=400 rounds, K=1,000,000K=1{,}000{,}000 total clients, and client fraction C=0.01C=0.01. Compare total uplink traffic with a hypothetical FedSGD run that needs TSGD=8,000T_{\mathrm{SGD}}=8{,}000 rounds using the same client fraction.

Step 1: selected clients per round.

m=CK=0.01(1,000,000)=10,000.m=CK=0.01(1{,}000{,}000)=10{,}000.

Step 2: bytes per client upload.

4d=4(5,000,000)=20,000,000 bytes=20 MB.4d=4(5{,}000{,}000)=20{,}000{,}000\text{ bytes}=20\text{ MB}.

Step 3: FedAvg uplink traffic.

FedAvg uplink=Tm20 MB=40010,00020 MB=80,000,000 MB=80,000 GB=80 TB.\begin{aligned} \text{FedAvg uplink} &=T\cdot m\cdot 20\text{ MB}\\ &=400\cdot 10{,}000\cdot 20\text{ MB}\\ &=80{,}000{,}000\text{ MB}\\ &=80{,}000\text{ GB}\\ &=80\text{ TB}. \end{aligned}

Step 4: FedSGD uplink traffic.

FedSGD uplink=8,00010,00020 MB=1,600,000,000 MB=1,600,000 GB=1,600 TB.\begin{aligned} \text{FedSGD uplink} &=8{,}000\cdot 10{,}000\cdot 20\text{ MB}\\ &=1{,}600{,}000{,}000\text{ MB}\\ &=1{,}600{,}000\text{ GB}\\ &=1{,}600\text{ TB}. \end{aligned}

Step 5: ratio.

1,600 TB80 TB=20.\frac{1{,}600\text{ TB}}{80\text{ TB}}=20.

Checked answer. If local computation lets FedAvg use 400400 rounds instead of 8,0008{,}000, the uplink traffic is 2020 times smaller before considering compression or secure-aggregation overhead.

Code

import numpy as np

def fedavg_round(global_w, client_weights, client_sizes):
"""Aggregate client model vectors using FedAvg data weights."""
total = float(sum(client_sizes))
new_w = np.zeros_like(global_w, dtype=np.float64)
for w_k, n_k in zip(client_weights, client_sizes):
new_w += (n_k / total) * w_k
return new_w

def local_sgd_quadratic(w0, a, eta=0.1, steps=5):
# Minimize F_k(w) = 0.5 * (w - a)^2 on one scalar client.
w = float(w0)
for _ in range(steps):
grad = w - a
w -= eta * grad
return np.array([w])

global_w = np.array([2.0])
targets = [1.0, 2.5, 3.0]
sizes = [10, 30, 60]
local_models = [local_sgd_quadratic(global_w[0], a) for a in targets]

print([float(w[0]) for w in local_models])
print(float(fedavg_round(global_w, local_models, sizes)[0]))

Common pitfalls

  • Treating federated learning as a privacy guarantee by itself; updates can still leak information.
  • Averaging clients uniformly when the objective is defined with nk/nn_k/n weights.
  • Comparing methods by local epochs alone instead of by round count, wall-clock time, and communication.
  • Forgetting that the selected-client denominator is jStnj\sum_{j\in S_t}n_j, not always the full nn.
  • Assuming IID convergence intuition applies under label skew, feature shift, or client-specific tasks.
  • Letting EE grow without retuning η\eta; local trajectories can overshoot or drift.
  • Confusing cross-device and cross-silo assumptions about trust, availability, and audit requirements.
  • Reporting only mean accuracy when the per-client distribution may be highly unequal.
  • Ignoring downlink cost; the server also broadcasts models or compressed deltas.
  • Ignoring client eligibility bias; available clients may not represent the whole population.
  • Treating secure aggregation as compatible with every anomaly detector; individual updates may be hidden.
  • Forgetting that real FL systems need retries, deadlines, telemetry, privacy accounting, and rollout controls.

Connections

References

[1] H. B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, "Communication-Efficient Learning of Deep Networks from Decentralized Data," AISTATS, 2017. https://arxiv.org/abs/1602.05629

[2] P. Kairouz et al., "Advances and Open Problems in Federated Learning," Foundations and Trends in Machine Learning, 2021. https://arxiv.org/abs/1912.04977

[3] J. Konecny, H. B. McMahan, F. X. Yu, P. Richtarik, A. T. Suresh, and D. Bacon, "Federated Learning: Strategies for Improving Communication Efficiency," 2016. https://arxiv.org/abs/1610.05492

[4] K. Bonawitz et al., "Towards Federated Learning at Scale: System Design," MLSys, 2019. https://arxiv.org/abs/1902.01046

[5] T. Li, A. K. Sahu, M. Zaheer, M. Sanjabi, A. Talwalkar, and V. Smith, "Federated Optimization in Heterogeneous Networks," MLSys, 2020. https://arxiv.org/abs/1812.06127

[6] S. P. Karimireddy, S. Kale, M. Mohri, S. J. Reddi, S. U. Stich, and A. T. Suresh, "SCAFFOLD: Stochastic Controlled Averaging for Federated Learning," ICML, 2020. https://arxiv.org/abs/1910.06378

[7] J. Wang et al., "Tackling the Objective Inconsistency Problem in Heterogeneous Federated Optimization," NeurIPS, 2020. https://arxiv.org/abs/2007.07481

[8] K. Bonawitz et al., "Practical Secure Aggregation for Privacy-Preserving Machine Learning," CCS, 2017. https://dl.acm.org/doi/10.1145/3133956.3133982

[9] M. Abadi et al., "Deep Learning with Differential Privacy," CCS, 2016. https://arxiv.org/abs/1607.00133

[10] H. B. McMahan, D. Ramage, K. Talwar, and L. Zhang, "Learning Differentially Private Recurrent Language Models," ICLR, 2018. https://arxiv.org/abs/1710.06963

[11] X. Li et al., "FedBN: Federated Learning on Non-IID Features via Local Batch Normalization," ICLR, 2021. https://openreview.net/forum?id=6YEQUn0QICG

[12] Q. Li, B. He, and D. Song, "Model-Contrastive Federated Learning," CVPR, 2021. https://arxiv.org/abs/2103.16257

[13] R. Geyer, T. Klein, and M. Nabi, "Differentially Private Federated Learning: A Client Level Perspective," 2017. https://arxiv.org/abs/1712.07557

[14] European Union, "Regulation (EU) 2016/679: General Data Protection Regulation," 2016. https://eur-lex.europa.eu/eli/reg/2016/679/oj

[15] U.S. Department of Health and Human Services, "Health Insurance Portability and Accountability Act of 1996." https://www.hhs.gov/hipaa/