VectorInstitute · emersodb · Jun 4, 2025 · May 14, 2025 · May 14, 2025 · May 14, 2025
diff --git a/books/fl/src/SUMMARY.md b/books/fl/src/SUMMARY.md
@@ -20,10 +20,10 @@
   - [Vanilla FL](horizontal/vanilla_fl/README.md)
     - [FedSGD](horizontal/vanilla_fl/fedsgd.md)
     - [FedAvg](horizontal/vanilla_fl/fedavg.md)
-  - [Robust Global FL]() <-- (horizontal/robust_global_fl/README.md) -->
-    - [FedAdam]() <-- (horizontal/robust_global_fl/fedadam.md) -->
-    - [FedProx]() <-- (horizontal/robust_global_fl/fedprox.md) -->
-    - [MOON]() <-- (horizontal/robust_global_fl/moon.md) -->
+  - [Robust Global FL](horizontal/robust_global_fl/README.md)
+    - [FedOpt](horizontal/robust_global_fl/fedopt.md)
+    - [FedProx](horizontal/robust_global_fl/fedprox.md)
+    - [MOON](horizontal/robust_global_fl/moon.md)
   - [Personalized FL]() <-- (horizontal/personalized/README.md) -->
     - [FedPer]() <-- (horizontal/personalized/fedper.md) -->
     - [FENDA-FL]() <-- (horizontal/personalized/fenda.md) -->

diff --git a/books/fl/src/assets/FedProxAdaptation_bottom.png b/books/fl/src/assets/FedProxAdaptation_bottom.png
diff --git a/books/fl/src/assets/FedProxAdaptation_top.png b/books/fl/src/assets/FedProxAdaptation_top.png
diff --git a/books/fl/src/assets/SplitModels.svg b/books/fl/src/assets/SplitModels.svg
diff --git a/books/fl/src/assets/algorithm-fedopt.svg b/books/fl/src/assets/algorithm-fedopt.svg
diff --git a/books/fl/src/assets/algorithm-fedprox.svg b/books/fl/src/assets/algorithm-fedprox.svg
diff --git a/books/fl/src/assets/algorithm-fedsgd.svg b/books/fl/src/assets/algorithm-fedsgd.svg
diff --git a/books/fl/src/assets/algorithm-moon.svg b/books/fl/src/assets/algorithm-moon.svg
diff --git a/books/fl/src/assets/combined_loss_objective.svg b/books/fl/src/assets/combined_loss_objective.svg
diff --git a/books/fl/src/assets/fed_df_model.png b/books/fl/src/assets/fed_df_model.png
diff --git a/books/fl/src/assets/fedavg_drift.svg b/books/fl/src/assets/fedavg_drift.svg
diff --git a/books/fl/src/assets/fedavg_model.png b/books/fl/src/assets/fedavg_model.png
diff --git a/books/fl/src/assets/fedsgd_steps.svg b/books/fl/src/assets/fedsgd_steps.svg
diff --git a/books/fl/src/assets/heterogeneity_two_routes_alt.svg b/books/fl/src/assets/heterogeneity_two_routes_alt.svg
diff --git a/books/fl/src/assets/local_model_1.png b/books/fl/src/assets/local_model_1.png
diff --git a/books/fl/src/assets/local_model_2.png b/books/fl/src/assets/local_model_2.png
diff --git a/books/fl/src/horizontal/README.md b/books/fl/src/horizontal/README.md
@@ -54,7 +54,7 @@ This section of the book is organized as follows:
   - [FedSGD](vanilla_fl/fedsgd.md)
   - [FedAvg](vanilla_fl/fedavg.md)
 - [Robust Global FL](robust_global_fl/index.md)
-  - [FedAdam](robust_global_fl/fedadam.md)
+  - [FedOpt](robust_global_fl/fedopt.md)
   - [FedProx](robust_global_fl/fedprox.md)
   - [MOON](robust_global_fl/moon.md)
 - [Personalized FL](personalized/index.md)

diff --git a/books/fl/src/horizontal/robust_global_fl/README.md b/books/fl/src/horizontal/robust_global_fl/README.md
@@ -1 +1,176 @@
+<!-- markdownlint-disable-file MD033 MD013 -->
+
 # Robust Global FL Approaches
+
+{{ #aipr_header }}
+
+## Data heterogeneity in standard ML
+
+In standard ML, when training and deploying a model, a standard underlying
+assumption is that the training data is distributionally
+similar to new data to which the model will be applied. There are methods
+that specialize in out-of-domain generalization, but in most cases
+models are assumed to be applied on data that is drawn from the same
+statistical distributions that describe the data on which it was trained. The
+validity of this assumption can degrade, for example, over time or due to
+the model being used to make predictions in entirely new domains.
+
+While data shifts present a significant challenge in centralized ML training,
+the characteristics that describe data shifts in this domain also exist in FL
+when comparing disparate, distributed datasets. Data shift between such
+datasets is typically referred to as "data heterogeneity" between clients. Such
+heterogeneity introduces new obstacles in FL and is quite prevalent. Before
+discussing its impact on federated training and how it is addressed. Let's
+define some types of data divergence. Three common ways to describe
+disparities or shifts between training and inference data are:[^1]
+
+1. [Label Shift](#label-shift)
+2. [Covariate Shift](#covariate-shift)
+3. [Concept Drift](#concept-drift)
+
+Let \\(X\\) and \\(Y\\) represent the feature (input) and label (output)
+spaces, respectively for a model. Shifts are present, regardless of whether
+model performance degrades, when the joint distributions
+
+$$
+\begin{align}
+\\mathbb{P}\_{\\text{train}}(X, Y) \\neq \\mathbb{P}\_{\\text{test}}(X, Y). \tag{1}
+\end{align}
+$$
+
+### Label Shift
+
+Label shifts occur when there is a change in the label distribution \\(\\mathbb{P}(Y)\\)
+with a fixed posterior distribution \\(\\mathbb{P}(X \\vert Y)\\). That is, the
+probability of seeing different label values shifts, but the distribution of
+features conditioned on the labels does not change. A pertinent example of
+this might be data meant to train a model to diagnose COVID-19 in the early
+days of spread versus the later stages when the virus was widely circulating.
+Generally, the symptoms, given that someone had the virus, did not markedly
+change. However, the prevalence of the virus, \\(\\mathbb{P}(Y)\\), did.
+
+### Covariate Shift
+
+Covariate shifts between data distributions represent a change in the feature
+distribution, \\(\\mathbb{P}(X)\\), while the statistical relationship of labels to
+features, \\(\\mathbb{P}(Y \\vert X)\\), remains fixed. Consider the setting of training
+a readmission risk model on data drawn from the patient population of a
+general hospital. If, for instance, that model were transferred for use at a
+nearby pediatric hospital, assuming all else equal, predictions from that model
+would be influenced by covariate drift due to the change in patient
+demographics. Namely, though features associated with younger patients are
+likely part of the general hospital population, they will, of course, be
+statistically over-represented in the data points seen by the model at the
+pediatric hospital.
+
+### Concept Drift
+
+Concept drift is characterized by a change in \\(\\mathbb{P}(Y \vert X)\\) provided a
+fixed \\(\\mathbb{P}(Y)\\). Essentially, this drift encapsulates a shift in the
+predictive relationship between the features, \\(X\\), and the labels, \\(Y\\).
+As an illustrative example, consider training a purchase conversion model
+for airline ticket purchases where two possible incentives are features. The
+first offers a ticket discount to encourage purchase, whereas the second offers
+free add-ons. In good economic periods, the second incentive may produce higher
+conversion rates. On the other hand, in periods of economic uncertainty,
+perhaps the first offer would do so.
+
+Note that each of the shifts discussed above may exist in isolation or be
+present together to varying degrees.
+
+## How does data heterogeneity manifest in FL?
+
+In FL, differences in training data distributions are not strictly temporal or
+marked by a change in the joint probability distributions of the training and
+test datasets, as expressed in Equation (1). Each client participating in
+federated training might naturally exhibit distribution disparities compared
+to one another. Consider the example given in the Section on
+[Covariate Shift](#covariate-shift). If the general and pediatric hospitals
+would like to collaboratively train a model using FL, the demographics of
+their patient populations mean that there will be substantial statistical
+heterogeneity between their respective training datasets.
+
+Each distributed training dataset in an FL system may naturally exhibit the
+various disparities, compared with one another, discussed above. As a further
+example, consider two financial institutions working together to train a fraud
+detection model. Because of their different clientele, one bank may experience
+fraud at a rate of 2% per transaction, while the other may see only 0.1%, an
+example of label shift, among potentially others.
+
+## How does it impact FL models and their training?
+
+Data heterogeneity, in its various forms, has been linked to a number of
+challenges in training FL models using methods like
+[FedAvg](../vanilla_fl/fedavg.md), including slower convergence, performance
+degradation, and unevenly distributed training dynamics among clients. In [2],
+a clear illustration of the impact of data heterogeneity is provided. In the
+figures below, two clients have locally trained a model on their respective
+datasets.
+
+<figure>
+<center>
+<img src="../../assets/local_model_1.png" alt="Local Model 1", height="300">
+<img src="../../assets/local_model_2.png" alt="Local Model 2", height="300">
+<figcaption>Two clients with different datasets. Note that each holds a
+slightly different view of the feature space. Notably, Client 1 (left) has a
+distinct cluster of data points in the bottom right and fewer points labeled in
+green within the red cluster. </figcaption>
+</center>
+</figure>
+
+The decision boundaries of the locally trained models are largely similar but
+differ in important ways. If the two models are averaged via FedAvg (see figure
+below), the result is a blurred decision boundary which has diverged from the
+sharp boundary one would expect to compute were the data agglomerated and a
+central model trained. Alternatively, using an approach that is more robust
+to data heterogeneity, FedDF,[^2] the resulting model exhibits the kinds of
+classification boundaries one would expect when considering the data
+distributions from a global perspective.
+
+<figure>
+<center>
+<img src="../../assets/fedavg_model.png" alt="FedAvg Model", height="300">
+<img src="../../assets/fed_df_model.png" alt="FedDF Model", height="300">
+<figcaption>Model resulting from FedAvg (left) compared with the model
+trained using FedDF (right).</figcaption>
+</center>
+</figure>
+
+There are two common routes, among many other routes, for addressing
+heterogeneity in FL. The first is to maintain a sense of a single global model
+to be trained by all participants. Modifications to items like the aggregation
+strategy, local learning objectives, or corrections to model updates are
+applied to better align FL training with the dynamics of centralized training
+without sacrificing most of the benefits associated with the original FedAvg
+algorithm. The second route is to abandon, to one degree or another, the idea
+of a global model that performs well across all clients and instead allow
+each client to train a unique model. This is known as Personal or Personalized
+FL (pFL). Such models still benefit from global information through aspects of
+FL, but more strongly emphasize local distributions.
+
+<figure>
+<center>
+<img src="../../assets/heterogeneity_two_routes_alt.svg" alt="Two FL Routes", width="75%">
+<figcaption>Two possible routes for addressing data heterogeneity in FL.</figcaption>
+</center>
+</figure>
+
+In the subsequent sections of this chapter, we'll cover a few of the many FL
+methods aimed at robust global model optimization in FL. Such models are often
+more generalizable and are more easily distributed to new domains than their
+pFL equivalents. Alternatively, model performance on each client may not be
+as high as those produced by pFL approaches.
+
+#### References & Useful Links <!-- markdownlint-disable-line MD001 -->
+
+[^1]:
+    J. Quinonero-Candela, M. Sugiyama, A. Schwaighofer, and N. D. Lawrence. Dataset shift
+    in machine learning. Mit Press, 2008
+
+[^2]:
+    [Lin, Tao et al. “Ensemble distillation for robust model fusion in federated
+    learning”. In: Proceedings of the 34th International Conference on Neural
+    Information Processing Systems. NIPS ’20. Vancouver, BC, Canada: Curran
+    Associates Inc., 2020.](https://proceedings.neurips.cc/paper/2020/file/18df51b97ccd68128e994804f3eccc87-Paper.pdf)
+
+{{#author emersodb}}
diff --git a/books/fl/src/horizontal/robust_global_fl/fedadam.md b/books/fl/src/horizontal/robust_global_fl/fedadam.md
diff --git a/books/fl/src/horizontal/robust_global_fl/fedopt.md b/books/fl/src/horizontal/robust_global_fl/fedopt.md
@@ -0,0 +1,130 @@
+<!-- markdownlint-disable-file MD033 MD013 -->
+
+# The FedOpt Family of Aggregation Strategies
+
+{{ #aipr_header }}
+
+Recall that modern deep learning optimizers like AdamW[^1] or AdaGrad[^2] use
+first- and second-order moment estimates of the stochastic gradients computed
+during iterative optimization to adaptively modify the model updates.
+At a high level, each algorithm aims to reinforce common update directions
+(i.e. those with momentum) and damp update elements corresponding to noisy
+directions (i.e. those with high batch-to-batch variance). The FedOpt
+family[^3] of algorithms, considers modifying the traditional
+[FedAvg](../vanilla_fl/fedavg.md) aggregation algorithm to incorporate
+similar adaptations into server-side model updates in FL.
+
+## Mathematical motivation
+
+In FedAvg, recall that, after a round of local training on each client,
+client model weights are combined into a single model representation via
+
+$$
+\begin{align*}
+\\mathbf{w}\_{t+1} = \\sum\_{k \\in C_t} \\frac{n_k}{n_s} \\mathbf{w}^k_{t+1},
+\end{align*}
+$$
+
+where \\(\\mathbf{w}^k\_{t+1}\\) is simply the model weights after local
+training on client \\(k\\). For round \\(t\\), each client starts local
+training from the same set of weights, \\(\\mathbf{w_t}\\). Assume that each
+client has the same number of data points such that \\(n_k = m\\). With a bit
+of algebra, the update is rewritten
+
+$$
+\begin{align}
+\\mathbf{w}\_{t+1} = \\sum\_{k \\in C_t} \\frac{n_k}{n_s} \\mathbf{w}^k\_{t+1} &= \\mathbf{w}_t - \\frac{1}{C_t} \\sum\_{k \\in C_t}
+\\left( \\mathbf{w}_t - \\mathbf{w}^k\_{t+1} \\right), \\\\
+&= \\mathbf{w}_t + \\frac{1}{C\_t} \\sum\_{k \\in C_t}  \\Delta^k\_{t+1}, \\\\
+&= \\mathbf{w}_t + \\Delta\_{t+1}. \tag{1}
+\end{align}
+$$
+
+Here, \\(\\Delta^k\_{t+1} = \\mathbf{w}^k\_{t+1} - \\mathbf{w}\_t\\) is just
+the vector pointing from the initial models weights to those after local
+training and \\(\\Delta\_{t+1}\\) is simply the uniform average of these
+update vectors.
+
+Recall that, if each client uses a fixed learning rate, \\(\eta\\), and
+performs a single, full gradient update, FedAvg is equivalent to centralized
+large-batch SGD. Similarly, in this case, if each client performs one step of
+batch SGD with a learning rate of 1.0, then the update in Equation (1) is
+equivalent to a batch-SGD update with a learning rate of 1.0 for the
+**server**. The "server-side" batch is the union of the batches used on each
+client.
+
+The observation that \\(-\Delta\_{t+1}\\) is simply a stochastic gradient
+motivates treating these update directions like the stochastic gradients
+in standard adaptive optimizers. It's important to note that if the clients,
+for instance, apply multiple steps of local SGD or use different learning
+rates, the exact equivalence of \\(-\Delta\_{t+1}\\) to a stochastic gradient
+is broken. However, it shares similarities to such a gradient and is,
+therefore, called a "pseudo-gradient."[^3]
+
+## The algorithms: FedAdagrad, FedAdam, FedYogi
+
+Drawing inspiration from three successful, traditional adaptive optimizers,
+the adaptive server-side aggregation strategies of FedAdaGrad, FedAdam, and
+FedYogi have been proposed. See the algorithm below for details.
+
+<figure>
+<center>
+<img src="../../assets/algorithm-fedopt.svg" alt="FedOpt Algorithms" width="100%">
+</center>
+</figure>
+
+Those familiar with the mathematical formulations of Adagrad, Adam,[^4] and
+Yogi[^5] will recognize the general structure of these equations. Computation
+of \\(m_t\\), based on the average of the update directions suggested by each
+client through local training (\\(\Delta\_{t+1}\\)) serves to accumulate
+momentum associated with directions that are consistently and frequently part
+of these updates. On the other hand, \\(\nu_t\\) estimates the variance
+associated with update directions throughout the server rounds. Directions
+with higher variance values are damped in favor of those with more consistency
+round over round.
+
+As with the usual forms of these algorithms, there are a number of
+hyper-parameters that can be tuned, including \\(\tau, \beta_1,\\) and
+\\(\beta_2\\). However, sensible defaults are suggested in the paper such that
+\\(\beta_1=0.9\\) and \\(\beta_2=0.99\\). The authors also show that
+performance is generally robust to \\(\tau\\).
+
+A number of experiments show that the proposed FedOpt family of algorithms
+can outperform FedAvg, especially in heterogeneous settings. Moreover, these
+algorithms, in the experiments of the paper, outperform SCAFFOLD,[^6] a
+variance reduction method aimed at improving convergence in the presence of
+heterogeneity. A final advantage of the FedOpt family of algorithms is that
+they are accompanied by several convergence results showing that, as long as
+the variance of the local gradients is not too large, the algorithms converge
+properly.
+
+#### References & Useful Links <!-- markdownlint-disable-line MD001 -->
+
+[^1]:
+    [I. Loshchilov and F. Hutter. Fixing weight decay regularization in ADAM,
+    2018.386](https://arxiv.org/pdf/1711.05101)
+
+[^2]:
+    [Duchi, J., Hazan, E., & Singer, Y. (2011). Adaptive subgradient methods
+    for online learning and stochastic optimization. Journal of machine learning research, 12(7).](https://www.jmlr.org/papers/volume12/duchi11a/duchi11a.pdf)
+
+[^3]:
+    [S. J. Reddi, Z. Charles, M. Zaheer, Z. Garrett, K. Rush, J. Konêcný, S. Kumar, and H. B.
+    McMahan. Adaptive federated optimization. In ICLR 2021, 2021.](https://arxiv.org/abs/2003.002950)
+
+[^4]:
+    [Kingma, D. P. & Ba, J. (2015). Adam: A Method for Stochastic Optimization.
+    In Y. Bengio & Y. LeCun (eds.), ICLR (Poster).](https://arxiv.org/pdf/1412.6980)
+
+[^5]:
+    [Manzil Zaheer, Sashank J. Reddi, Devendra Sachan, Satyen Kale, and
+    Sanjiv Kumar. 2018. Adaptive methods for nonconvex optimization. In Proceedings of the 32nd International Conference on Neural Information Processing Systems (NIPS'18). Curran Associates Inc., Red Hook, NY, USA, 9815–9825.](https://proceedings.neurips.cc/paper_files/paper/2018/file/90365351ccc7437a1309dc64e4db32a3-Paper.pdf)
+
+[^6]:
+    [S. P. Karimireddy, S. Kale, M. Mohri, S. Reddi, S. Stich, and A. T. Suresh.
+    SCAFFOLD: Stochastic controlled averaging for federated learning. In
+    Hal Daumé III and Aarti Singh, editors, Proceedings of the 37th
+    International Conference on Machine Learning, volume 119 of Proceedings of
+    Machine Learning Research, pages 5132–5143. PMLR, 13–18 Jul 2020.](https://www.microsoft.com/en-us/research/publication/frustratingly-easy-neural-domain-adaptation/)
+
+{{#author emersodb}}