Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 4 additions & 4 deletions books/fl/src/SUMMARY.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,10 +20,10 @@
- [Vanilla FL](horizontal/vanilla_fl/README.md)
- [FedSGD](horizontal/vanilla_fl/fedsgd.md)
- [FedAvg](horizontal/vanilla_fl/fedavg.md)
- [Robust Global FL]() <-- (horizontal/robust_global_fl/README.md) -->
- [FedAdam]() <-- (horizontal/robust_global_fl/fedadam.md) -->
- [FedProx]() <-- (horizontal/robust_global_fl/fedprox.md) -->
- [MOON]() <-- (horizontal/robust_global_fl/moon.md) -->
- [Robust Global FL](horizontal/robust_global_fl/README.md)
- [FedOpt](horizontal/robust_global_fl/fedopt.md)
- [FedProx](horizontal/robust_global_fl/fedprox.md)
- [MOON](horizontal/robust_global_fl/moon.md)
- [Personalized FL]() <-- (horizontal/personalized/README.md) -->
- [FedPer]() <-- (horizontal/personalized/fedper.md) -->
- [FENDA-FL]() <-- (horizontal/personalized/fenda.md) -->
Expand Down
Binary file added books/fl/src/assets/FedProxAdaptation_bottom.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added books/fl/src/assets/FedProxAdaptation_top.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
4 changes: 4 additions & 0 deletions books/fl/src/assets/SplitModels.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1,010 changes: 1,010 additions & 0 deletions books/fl/src/assets/algorithm-fedopt.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
661 changes: 661 additions & 0 deletions books/fl/src/assets/algorithm-fedprox.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
447 changes: 447 additions & 0 deletions books/fl/src/assets/algorithm-fedsgd.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
647 changes: 647 additions & 0 deletions books/fl/src/assets/algorithm-moon.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
223 changes: 223 additions & 0 deletions books/fl/src/assets/combined_loss_objective.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added books/fl/src/assets/fed_df_model.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
881 changes: 881 additions & 0 deletions books/fl/src/assets/fedavg_drift.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added books/fl/src/assets/fedavg_model.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
994 changes: 994 additions & 0 deletions books/fl/src/assets/fedsgd_steps.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
4 changes: 4 additions & 0 deletions books/fl/src/assets/heterogeneity_two_routes_alt.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added books/fl/src/assets/local_model_1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added books/fl/src/assets/local_model_2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 1 addition & 1 deletion books/fl/src/horizontal/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,7 @@ This section of the book is organized as follows:
- [FedSGD](vanilla_fl/fedsgd.md)
- [FedAvg](vanilla_fl/fedavg.md)
- [Robust Global FL](robust_global_fl/index.md)
- [FedAdam](robust_global_fl/fedadam.md)
- [FedOpt](robust_global_fl/fedopt.md)
- [FedProx](robust_global_fl/fedprox.md)
- [MOON](robust_global_fl/moon.md)
- [Personalized FL](personalized/index.md)
Expand Down
175 changes: 175 additions & 0 deletions books/fl/src/horizontal/robust_global_fl/README.md
Original file line number Diff line number Diff line change
@@ -1 +1,176 @@
<!-- markdownlint-disable-file MD033 MD013 -->

# Robust Global FL Approaches

{{ #aipr_header }}

## Data heterogeneity in standard ML

In standard ML, when training and deploying a model, a standard underlying
assumption is that the training data is distributionally
similar to new data to which the model will be applied. There are methods
that specialize in out-of-domain generalization, but in most cases
models are assumed to be applied on data that is drawn from the same
statistical distributions that describe the data on which it was trained. The
validity of this assumption can degrade, for example, over time or due to
the model being used to make predictions in entirely new domains.

While data shifts present a significant challenge in centralized ML training,
the characteristics that describe data shifts in this domain also exist in FL
when comparing disparate, distributed datasets. Data shift between such
datasets is typically referred to as "data heterogeneity" between clients. Such
heterogeneity introduces new obstacles in FL and is quite prevalent. Before
discussing its impact on federated training and how it is addressed. Let's
define some types of data divergence. Three common ways to describe
disparities or shifts between training and inference data are:[^1]

1. [Label Shift](#label-shift)
2. [Covariate Shift](#covariate-shift)
3. [Concept Drift](#concept-drift)

Let \\(X\\) and \\(Y\\) represent the feature (input) and label (output)
spaces, respectively for a model. Shifts are present, regardless of whether
model performance degrades, when the joint distributions

$$
\begin{align}
\\mathbb{P}\_{\\text{train}}(X, Y) \\neq \\mathbb{P}\_{\\text{test}}(X, Y). \tag{1}
\end{align}
$$

### Label Shift

Label shifts occur when there is a change in the label distribution \\(\\mathbb{P}(Y)\\)
Comment thread
emersodb marked this conversation as resolved.
with a fixed posterior distribution \\(\\mathbb{P}(X \\vert Y)\\). That is, the
probability of seeing different label values shifts, but the distribution of
features conditioned on the labels does not change. A pertinent example of
this might be data meant to train a model to diagnose COVID-19 in the early
days of spread versus the later stages when the virus was widely circulating.
Generally, the symptoms, given that someone had the virus, did not markedly
change. However, the prevalence of the virus, \\(\\mathbb{P}(Y)\\), did.

### Covariate Shift

Covariate shifts between data distributions represent a change in the feature
distribution, \\(\\mathbb{P}(X)\\), while the statistical relationship of labels to
features, \\(\\mathbb{P}(Y \\vert X)\\), remains fixed. Consider the setting of training
a readmission risk model on data drawn from the patient population of a
general hospital. If, for instance, that model were transferred for use at a
nearby pediatric hospital, assuming all else equal, predictions from that model
would be influenced by covariate drift due to the change in patient
demographics. Namely, though features associated with younger patients are
likely part of the general hospital population, they will, of course, be
statistically over-represented in the data points seen by the model at the
pediatric hospital.

### Concept Drift

Concept drift is characterized by a change in \\(\\mathbb{P}(Y \vert X)\\) provided a
fixed \\(\\mathbb{P}(Y)\\). Essentially, this drift encapsulates a shift in the
predictive relationship between the features, \\(X\\), and the labels, \\(Y\\).
As an illustrative example, consider training a purchase conversion model
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could use Trump tarrifs as part of this example, but maybe we shouldn't be political.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may or may not have been thinking of this when I wrote the example, but chose to be more vague 😂

for airline ticket purchases where two possible incentives are features. The
first offers a ticket discount to encourage purchase, whereas the second offers
free add-ons. In good economic periods, the second incentive may produce higher
conversion rates. On the other hand, in periods of economic uncertainty,
perhaps the first offer would do so.

Note that each of the shifts discussed above may exist in isolation or be
present together to varying degrees.

## How does data heterogeneity manifest in FL?

In FL, differences in training data distributions are not strictly temporal or
marked by a change in the joint probability distributions of the training and
test datasets, as expressed in Equation (1). Each client participating in
federated training might naturally exhibit distribution disparities compared
to one another. Consider the example given in the Section on
[Covariate Shift](#covariate-shift). If the general and pediatric hospitals
would like to collaboratively train a model using FL, the demographics of
their patient populations mean that there will be substantial statistical
heterogeneity between their respective training datasets.

Each distributed training dataset in an FL system may naturally exhibit the
various disparities, compared with one another, discussed above. As a further
example, consider two financial institutions working together to train a fraud
detection model. Because of their different clientele, one bank may experience
fraud at a rate of 2% per transaction, while the other may see only 0.1%, an
example of label shift, among potentially others.

## How does it impact FL models and their training?

Data heterogeneity, in its various forms, has been linked to a number of
challenges in training FL models using methods like
[FedAvg](../vanilla_fl/fedavg.md), including slower convergence, performance
degradation, and unevenly distributed training dynamics among clients. In [2],
a clear illustration of the impact of data heterogeneity is provided. In the
figures below, two clients have locally trained a model on their respective
datasets.

<figure>
<center>
<img src="../../assets/local_model_1.png" alt="Local Model 1", height="300">
<img src="../../assets/local_model_2.png" alt="Local Model 2", height="300">
<figcaption>Two clients with different datasets. Note that each holds a
slightly different view of the feature space. Notably, Client 1 (left) has a
distinct cluster of data points in the bottom right and fewer points labeled in
green within the red cluster. </figcaption>
</center>
</figure>

The decision boundaries of the locally trained models are largely similar but
differ in important ways. If the two models are averaged via FedAvg (see figure
below), the result is a blurred decision boundary which has diverged from the
sharp boundary one would expect to compute were the data agglomerated and a
central model trained. Alternatively, using an approach that is more robust
to data heterogeneity, FedDF,[^2] the resulting model exhibits the kinds of
classification boundaries one would expect when considering the data
distributions from a global perspective.

<figure>
<center>
<img src="../../assets/fedavg_model.png" alt="FedAvg Model", height="300">
<img src="../../assets/fed_df_model.png" alt="FedDF Model", height="300">
<figcaption>Model resulting from FedAvg (left) compared with the model
trained using FedDF (right).</figcaption>
</center>
</figure>

There are two common routes, among many other routes, for addressing
heterogeneity in FL. The first is to maintain a sense of a single global model
to be trained by all participants. Modifications to items like the aggregation
strategy, local learning objectives, or corrections to model updates are
applied to better align FL training with the dynamics of centralized training
without sacrificing most of the benefits associated with the original FedAvg
algorithm. The second route is to abandon, to one degree or another, the idea
of a global model that performs well across all clients and instead allow
each client to train a unique model. This is known as Personal or Personalized
FL (pFL). Such models still benefit from global information through aspects of
FL, but more strongly emphasize local distributions.

<figure>
<center>
<img src="../../assets/heterogeneity_two_routes_alt.svg" alt="Two FL Routes", width="75%">
<figcaption>Two possible routes for addressing data heterogeneity in FL.</figcaption>
</center>
</figure>

In the subsequent sections of this chapter, we'll cover a few of the many FL
methods aimed at robust global model optimization in FL. Such models are often
more generalizable and are more easily distributed to new domains than their
pFL equivalents. Alternatively, model performance on each client may not be
as high as those produced by pFL approaches.

#### References & Useful Links <!-- markdownlint-disable-line MD001 -->

[^1]:
J. Quinonero-Candela, M. Sugiyama, A. Schwaighofer, and N. D. Lawrence. Dataset shift
in machine learning. Mit Press, 2008

[^2]:
[Lin, Tao et al. “Ensemble distillation for robust model fusion in federated
learning”. In: Proceedings of the 34th International Conference on Neural
Information Processing Systems. NIPS ’20. Vancouver, BC, Canada: Curran
Associates Inc., 2020.](https://proceedings.neurips.cc/paper/2020/file/18df51b97ccd68128e994804f3eccc87-Paper.pdf)

{{#author emersodb}}
1 change: 0 additions & 1 deletion books/fl/src/horizontal/robust_global_fl/fedadam.md

This file was deleted.

130 changes: 130 additions & 0 deletions books/fl/src/horizontal/robust_global_fl/fedopt.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,130 @@
<!-- markdownlint-disable-file MD033 MD013 -->

# The FedOpt Family of Aggregation Strategies

{{ #aipr_header }}

Recall that modern deep learning optimizers like AdamW[^1] or AdaGrad[^2] use
first- and second-order moment estimates of the stochastic gradients computed
during iterative optimization to adaptively modify the model updates.
At a high level, each algorithm aims to reinforce common update directions
(i.e. those with momentum) and damp update elements corresponding to noisy
directions (i.e. those with high batch-to-batch variance). The FedOpt
family[^3] of algorithms, considers modifying the traditional
[FedAvg](../vanilla_fl/fedavg.md) aggregation algorithm to incorporate
similar adaptations into server-side model updates in FL.

## Mathematical motivation

In FedAvg, recall that, after a round of local training on each client,
client model weights are combined into a single model representation via

$$
\begin{align*}
\\mathbf{w}\_{t+1} = \\sum\_{k \\in C_t} \\frac{n_k}{n_s} \\mathbf{w}^k_{t+1},
\end{align*}
$$

where \\(\\mathbf{w}^k\_{t+1}\\) is simply the model weights after local
training on client \\(k\\). For round \\(t\\), each client starts local
training from the same set of weights, \\(\\mathbf{w_t}\\). Assume that each
client has the same number of data points such that \\(n_k = m\\). With a bit
of algebra, the update is rewritten

$$
\begin{align}
\\mathbf{w}\_{t+1} = \\sum\_{k \\in C_t} \\frac{n_k}{n_s} \\mathbf{w}^k\_{t+1} &= \\mathbf{w}_t - \\frac{1}{C_t} \\sum\_{k \\in C_t}
\\left( \\mathbf{w}_t - \\mathbf{w}^k\_{t+1} \\right), \\\\
&= \\mathbf{w}_t + \\frac{1}{C\_t} \\sum\_{k \\in C_t} \\Delta^k\_{t+1}, \\\\
&= \\mathbf{w}_t + \\Delta\_{t+1}. \tag{1}
\end{align}
$$

Here, \\(\\Delta^k\_{t+1} = \\mathbf{w}^k\_{t+1} - \\mathbf{w}\_t\\) is just
the vector pointing from the initial models weights to those after local
training and \\(\\Delta\_{t+1}\\) is simply the uniform average of these
update vectors.

Recall that, if each client uses a fixed learning rate, \\(\eta\\), and
performs a single, full gradient update, FedAvg is equivalent to centralized
large-batch SGD. Similarly, in this case, if each client performs one step of
batch SGD with a learning rate of 1.0, then the update in Equation (1) is
equivalent to a batch-SGD update with a learning rate of 1.0 for the
**server**. The "server-side" batch is the union of the batches used on each
client.

The observation that \\(-\Delta\_{t+1}\\) is simply a stochastic gradient
motivates treating these update directions like the stochastic gradients
in standard adaptive optimizers. It's important to note that if the clients,
for instance, apply multiple steps of local SGD or use different learning
rates, the exact equivalence of \\(-\Delta\_{t+1}\\) to a stochastic gradient
is broken. However, it shares similarities to such a gradient and is,
therefore, called a "pseudo-gradient."[^3]

## The algorithms: FedAdagrad, FedAdam, FedYogi

Drawing inspiration from three successful, traditional adaptive optimizers,
the adaptive server-side aggregation strategies of FedAdaGrad, FedAdam, and
FedYogi have been proposed. See the algorithm below for details.

<figure>
<center>
<img src="../../assets/algorithm-fedopt.svg" alt="FedOpt Algorithms" width="100%">
</center>
</figure>

Those familiar with the mathematical formulations of Adagrad, Adam,[^4] and
Yogi[^5] will recognize the general structure of these equations. Computation
of \\(m_t\\), based on the average of the update directions suggested by each
client through local training (\\(\Delta\_{t+1}\\)) serves to accumulate
momentum associated with directions that are consistently and frequently part
of these updates. On the other hand, \\(\nu_t\\) estimates the variance
associated with update directions throughout the server rounds. Directions
with higher variance values are damped in favor of those with more consistency
round over round.

As with the usual forms of these algorithms, there are a number of
hyper-parameters that can be tuned, including \\(\tau, \beta_1,\\) and
\\(\beta_2\\). However, sensible defaults are suggested in the paper such that
\\(\beta_1=0.9\\) and \\(\beta_2=0.99\\). The authors also show that
performance is generally robust to \\(\tau\\).

A number of experiments show that the proposed FedOpt family of algorithms
can outperform FedAvg, especially in heterogeneous settings. Moreover, these
algorithms, in the experiments of the paper, outperform SCAFFOLD,[^6] a
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we have a pocket ref for SCAFFOLD in this section as well, eventually?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, probably. It'll be a bit more involved, just because the method is more complicated, but I have all of the figures necessary to do one. If you want to throw it on the backlog, I can tackle it at the same time as I work on pFL methods.

variance reduction method aimed at improving convergence in the presence of
heterogeneity. A final advantage of the FedOpt family of algorithms is that
they are accompanied by several convergence results showing that, as long as
the variance of the local gradients is not too large, the algorithms converge
properly.

#### References & Useful Links <!-- markdownlint-disable-line MD001 -->

[^1]:
[I. Loshchilov and F. Hutter. Fixing weight decay regularization in ADAM,
2018.386](https://arxiv.org/pdf/1711.05101)

[^2]:
[Duchi, J., Hazan, E., & Singer, Y. (2011). Adaptive subgradient methods
for online learning and stochastic optimization. Journal of machine learning research, 12(7).](https://www.jmlr.org/papers/volume12/duchi11a/duchi11a.pdf)

[^3]:
[S. J. Reddi, Z. Charles, M. Zaheer, Z. Garrett, K. Rush, J. Konêcný, S. Kumar, and H. B.
McMahan. Adaptive federated optimization. In ICLR 2021, 2021.](https://arxiv.org/abs/2003.002950)

[^4]:
[Kingma, D. P. & Ba, J. (2015). Adam: A Method for Stochastic Optimization.
In Y. Bengio & Y. LeCun (eds.), ICLR (Poster).](https://arxiv.org/pdf/1412.6980)

[^5]:
[Manzil Zaheer, Sashank J. Reddi, Devendra Sachan, Satyen Kale, and
Sanjiv Kumar. 2018. Adaptive methods for nonconvex optimization. In Proceedings of the 32nd International Conference on Neural Information Processing Systems (NIPS'18). Curran Associates Inc., Red Hook, NY, USA, 9815–9825.](https://proceedings.neurips.cc/paper_files/paper/2018/file/90365351ccc7437a1309dc64e4db32a3-Paper.pdf)

[^6]:
[S. P. Karimireddy, S. Kale, M. Mohri, S. Reddi, S. Stich, and A. T. Suresh.
SCAFFOLD: Stochastic controlled averaging for federated learning. In
Hal Daumé III and Aarti Singh, editors, Proceedings of the 37th
International Conference on Machine Learning, volume 119 of Proceedings of
Machine Learning Research, pages 5132–5143. PMLR, 13–18 Jul 2020.](https://www.microsoft.com/en-us/research/publication/frustratingly-easy-neural-domain-adaptation/)

{{#author emersodb}}
Loading