-
Notifications
You must be signed in to change notification settings - Fork 9
Robust Global FL Pocket Refs #154
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
1d3d27c
c4867fe
e9cb348
58f57b2
904ae50
26f83b3
c5bda8d
f86f387
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1 +1,176 @@ | ||
| <!-- markdownlint-disable-file MD033 MD013 --> | ||
|
|
||
| # Robust Global FL Approaches | ||
|
|
||
| {{ #aipr_header }} | ||
|
|
||
| ## Data heterogeneity in standard ML | ||
|
|
||
| In standard ML, when training and deploying a model, a standard underlying | ||
| assumption is that the training data is distributionally | ||
| similar to new data to which the model will be applied. There are methods | ||
| that specialize in out-of-domain generalization, but in most cases | ||
| models are assumed to be applied on data that is drawn from the same | ||
| statistical distributions that describe the data on which it was trained. The | ||
| validity of this assumption can degrade, for example, over time or due to | ||
| the model being used to make predictions in entirely new domains. | ||
|
|
||
| While data shifts present a significant challenge in centralized ML training, | ||
| the characteristics that describe data shifts in this domain also exist in FL | ||
| when comparing disparate, distributed datasets. Data shift between such | ||
| datasets is typically referred to as "data heterogeneity" between clients. Such | ||
| heterogeneity introduces new obstacles in FL and is quite prevalent. Before | ||
| discussing its impact on federated training and how it is addressed. Let's | ||
| define some types of data divergence. Three common ways to describe | ||
| disparities or shifts between training and inference data are:[^1] | ||
|
|
||
| 1. [Label Shift](#label-shift) | ||
| 2. [Covariate Shift](#covariate-shift) | ||
| 3. [Concept Drift](#concept-drift) | ||
|
|
||
| Let \\(X\\) and \\(Y\\) represent the feature (input) and label (output) | ||
| spaces, respectively for a model. Shifts are present, regardless of whether | ||
| model performance degrades, when the joint distributions | ||
|
|
||
| $$ | ||
| \begin{align} | ||
| \\mathbb{P}\_{\\text{train}}(X, Y) \\neq \\mathbb{P}\_{\\text{test}}(X, Y). \tag{1} | ||
| \end{align} | ||
| $$ | ||
|
|
||
| ### Label Shift | ||
|
|
||
| Label shifts occur when there is a change in the label distribution \\(\\mathbb{P}(Y)\\) | ||
| with a fixed posterior distribution \\(\\mathbb{P}(X \\vert Y)\\). That is, the | ||
| probability of seeing different label values shifts, but the distribution of | ||
| features conditioned on the labels does not change. A pertinent example of | ||
| this might be data meant to train a model to diagnose COVID-19 in the early | ||
| days of spread versus the later stages when the virus was widely circulating. | ||
| Generally, the symptoms, given that someone had the virus, did not markedly | ||
| change. However, the prevalence of the virus, \\(\\mathbb{P}(Y)\\), did. | ||
|
|
||
| ### Covariate Shift | ||
|
|
||
| Covariate shifts between data distributions represent a change in the feature | ||
| distribution, \\(\\mathbb{P}(X)\\), while the statistical relationship of labels to | ||
| features, \\(\\mathbb{P}(Y \\vert X)\\), remains fixed. Consider the setting of training | ||
| a readmission risk model on data drawn from the patient population of a | ||
| general hospital. If, for instance, that model were transferred for use at a | ||
| nearby pediatric hospital, assuming all else equal, predictions from that model | ||
| would be influenced by covariate drift due to the change in patient | ||
| demographics. Namely, though features associated with younger patients are | ||
| likely part of the general hospital population, they will, of course, be | ||
| statistically over-represented in the data points seen by the model at the | ||
| pediatric hospital. | ||
|
|
||
| ### Concept Drift | ||
|
|
||
| Concept drift is characterized by a change in \\(\\mathbb{P}(Y \vert X)\\) provided a | ||
| fixed \\(\\mathbb{P}(Y)\\). Essentially, this drift encapsulates a shift in the | ||
| predictive relationship between the features, \\(X\\), and the labels, \\(Y\\). | ||
| As an illustrative example, consider training a purchase conversion model | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we could use Trump tarrifs as part of this example, but maybe we shouldn't be political.
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I may or may not have been thinking of this when I wrote the example, but chose to be more vague 😂 |
||
| for airline ticket purchases where two possible incentives are features. The | ||
| first offers a ticket discount to encourage purchase, whereas the second offers | ||
| free add-ons. In good economic periods, the second incentive may produce higher | ||
| conversion rates. On the other hand, in periods of economic uncertainty, | ||
| perhaps the first offer would do so. | ||
|
|
||
| Note that each of the shifts discussed above may exist in isolation or be | ||
| present together to varying degrees. | ||
|
|
||
| ## How does data heterogeneity manifest in FL? | ||
|
|
||
| In FL, differences in training data distributions are not strictly temporal or | ||
| marked by a change in the joint probability distributions of the training and | ||
| test datasets, as expressed in Equation (1). Each client participating in | ||
| federated training might naturally exhibit distribution disparities compared | ||
| to one another. Consider the example given in the Section on | ||
| [Covariate Shift](#covariate-shift). If the general and pediatric hospitals | ||
| would like to collaboratively train a model using FL, the demographics of | ||
| their patient populations mean that there will be substantial statistical | ||
| heterogeneity between their respective training datasets. | ||
|
|
||
| Each distributed training dataset in an FL system may naturally exhibit the | ||
| various disparities, compared with one another, discussed above. As a further | ||
| example, consider two financial institutions working together to train a fraud | ||
| detection model. Because of their different clientele, one bank may experience | ||
| fraud at a rate of 2% per transaction, while the other may see only 0.1%, an | ||
| example of label shift, among potentially others. | ||
|
|
||
| ## How does it impact FL models and their training? | ||
|
|
||
| Data heterogeneity, in its various forms, has been linked to a number of | ||
| challenges in training FL models using methods like | ||
| [FedAvg](../vanilla_fl/fedavg.md), including slower convergence, performance | ||
| degradation, and unevenly distributed training dynamics among clients. In [2], | ||
| a clear illustration of the impact of data heterogeneity is provided. In the | ||
| figures below, two clients have locally trained a model on their respective | ||
| datasets. | ||
|
|
||
| <figure> | ||
| <center> | ||
| <img src="../../assets/local_model_1.png" alt="Local Model 1", height="300"> | ||
| <img src="../../assets/local_model_2.png" alt="Local Model 2", height="300"> | ||
| <figcaption>Two clients with different datasets. Note that each holds a | ||
| slightly different view of the feature space. Notably, Client 1 (left) has a | ||
| distinct cluster of data points in the bottom right and fewer points labeled in | ||
| green within the red cluster. </figcaption> | ||
| </center> | ||
| </figure> | ||
|
|
||
| The decision boundaries of the locally trained models are largely similar but | ||
| differ in important ways. If the two models are averaged via FedAvg (see figure | ||
| below), the result is a blurred decision boundary which has diverged from the | ||
| sharp boundary one would expect to compute were the data agglomerated and a | ||
| central model trained. Alternatively, using an approach that is more robust | ||
| to data heterogeneity, FedDF,[^2] the resulting model exhibits the kinds of | ||
| classification boundaries one would expect when considering the data | ||
| distributions from a global perspective. | ||
|
|
||
| <figure> | ||
| <center> | ||
| <img src="../../assets/fedavg_model.png" alt="FedAvg Model", height="300"> | ||
| <img src="../../assets/fed_df_model.png" alt="FedDF Model", height="300"> | ||
| <figcaption>Model resulting from FedAvg (left) compared with the model | ||
| trained using FedDF (right).</figcaption> | ||
| </center> | ||
| </figure> | ||
|
|
||
| There are two common routes, among many other routes, for addressing | ||
| heterogeneity in FL. The first is to maintain a sense of a single global model | ||
| to be trained by all participants. Modifications to items like the aggregation | ||
| strategy, local learning objectives, or corrections to model updates are | ||
| applied to better align FL training with the dynamics of centralized training | ||
| without sacrificing most of the benefits associated with the original FedAvg | ||
| algorithm. The second route is to abandon, to one degree or another, the idea | ||
| of a global model that performs well across all clients and instead allow | ||
| each client to train a unique model. This is known as Personal or Personalized | ||
| FL (pFL). Such models still benefit from global information through aspects of | ||
| FL, but more strongly emphasize local distributions. | ||
|
|
||
| <figure> | ||
| <center> | ||
| <img src="../../assets/heterogeneity_two_routes_alt.svg" alt="Two FL Routes", width="75%"> | ||
| <figcaption>Two possible routes for addressing data heterogeneity in FL.</figcaption> | ||
| </center> | ||
| </figure> | ||
|
|
||
| In the subsequent sections of this chapter, we'll cover a few of the many FL | ||
| methods aimed at robust global model optimization in FL. Such models are often | ||
| more generalizable and are more easily distributed to new domains than their | ||
| pFL equivalents. Alternatively, model performance on each client may not be | ||
| as high as those produced by pFL approaches. | ||
|
|
||
| #### References & Useful Links <!-- markdownlint-disable-line MD001 --> | ||
|
|
||
| [^1]: | ||
| J. Quinonero-Candela, M. Sugiyama, A. Schwaighofer, and N. D. Lawrence. Dataset shift | ||
| in machine learning. Mit Press, 2008 | ||
|
|
||
| [^2]: | ||
| [Lin, Tao et al. “Ensemble distillation for robust model fusion in federated | ||
| learning”. In: Proceedings of the 34th International Conference on Neural | ||
| Information Processing Systems. NIPS ’20. Vancouver, BC, Canada: Curran | ||
| Associates Inc., 2020.](https://proceedings.neurips.cc/paper/2020/file/18df51b97ccd68128e994804f3eccc87-Paper.pdf) | ||
|
|
||
| {{#author emersodb}} | ||
This file was deleted.
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,130 @@ | ||
| <!-- markdownlint-disable-file MD033 MD013 --> | ||
|
|
||
| # The FedOpt Family of Aggregation Strategies | ||
|
|
||
| {{ #aipr_header }} | ||
|
|
||
| Recall that modern deep learning optimizers like AdamW[^1] or AdaGrad[^2] use | ||
| first- and second-order moment estimates of the stochastic gradients computed | ||
| during iterative optimization to adaptively modify the model updates. | ||
| At a high level, each algorithm aims to reinforce common update directions | ||
| (i.e. those with momentum) and damp update elements corresponding to noisy | ||
| directions (i.e. those with high batch-to-batch variance). The FedOpt | ||
| family[^3] of algorithms, considers modifying the traditional | ||
| [FedAvg](../vanilla_fl/fedavg.md) aggregation algorithm to incorporate | ||
| similar adaptations into server-side model updates in FL. | ||
|
|
||
| ## Mathematical motivation | ||
|
|
||
| In FedAvg, recall that, after a round of local training on each client, | ||
| client model weights are combined into a single model representation via | ||
|
|
||
| $$ | ||
| \begin{align*} | ||
| \\mathbf{w}\_{t+1} = \\sum\_{k \\in C_t} \\frac{n_k}{n_s} \\mathbf{w}^k_{t+1}, | ||
| \end{align*} | ||
| $$ | ||
|
|
||
| where \\(\\mathbf{w}^k\_{t+1}\\) is simply the model weights after local | ||
| training on client \\(k\\). For round \\(t\\), each client starts local | ||
| training from the same set of weights, \\(\\mathbf{w_t}\\). Assume that each | ||
| client has the same number of data points such that \\(n_k = m\\). With a bit | ||
| of algebra, the update is rewritten | ||
|
|
||
| $$ | ||
| \begin{align} | ||
| \\mathbf{w}\_{t+1} = \\sum\_{k \\in C_t} \\frac{n_k}{n_s} \\mathbf{w}^k\_{t+1} &= \\mathbf{w}_t - \\frac{1}{C_t} \\sum\_{k \\in C_t} | ||
| \\left( \\mathbf{w}_t - \\mathbf{w}^k\_{t+1} \\right), \\\\ | ||
| &= \\mathbf{w}_t + \\frac{1}{C\_t} \\sum\_{k \\in C_t} \\Delta^k\_{t+1}, \\\\ | ||
| &= \\mathbf{w}_t + \\Delta\_{t+1}. \tag{1} | ||
| \end{align} | ||
| $$ | ||
|
|
||
| Here, \\(\\Delta^k\_{t+1} = \\mathbf{w}^k\_{t+1} - \\mathbf{w}\_t\\) is just | ||
| the vector pointing from the initial models weights to those after local | ||
| training and \\(\\Delta\_{t+1}\\) is simply the uniform average of these | ||
| update vectors. | ||
|
|
||
| Recall that, if each client uses a fixed learning rate, \\(\eta\\), and | ||
| performs a single, full gradient update, FedAvg is equivalent to centralized | ||
| large-batch SGD. Similarly, in this case, if each client performs one step of | ||
| batch SGD with a learning rate of 1.0, then the update in Equation (1) is | ||
| equivalent to a batch-SGD update with a learning rate of 1.0 for the | ||
| **server**. The "server-side" batch is the union of the batches used on each | ||
| client. | ||
|
|
||
| The observation that \\(-\Delta\_{t+1}\\) is simply a stochastic gradient | ||
| motivates treating these update directions like the stochastic gradients | ||
| in standard adaptive optimizers. It's important to note that if the clients, | ||
| for instance, apply multiple steps of local SGD or use different learning | ||
| rates, the exact equivalence of \\(-\Delta\_{t+1}\\) to a stochastic gradient | ||
| is broken. However, it shares similarities to such a gradient and is, | ||
| therefore, called a "pseudo-gradient."[^3] | ||
|
|
||
| ## The algorithms: FedAdagrad, FedAdam, FedYogi | ||
|
|
||
| Drawing inspiration from three successful, traditional adaptive optimizers, | ||
| the adaptive server-side aggregation strategies of FedAdaGrad, FedAdam, and | ||
| FedYogi have been proposed. See the algorithm below for details. | ||
|
|
||
| <figure> | ||
| <center> | ||
| <img src="../../assets/algorithm-fedopt.svg" alt="FedOpt Algorithms" width="100%"> | ||
| </center> | ||
| </figure> | ||
|
|
||
| Those familiar with the mathematical formulations of Adagrad, Adam,[^4] and | ||
| Yogi[^5] will recognize the general structure of these equations. Computation | ||
| of \\(m_t\\), based on the average of the update directions suggested by each | ||
| client through local training (\\(\Delta\_{t+1}\\)) serves to accumulate | ||
| momentum associated with directions that are consistently and frequently part | ||
| of these updates. On the other hand, \\(\nu_t\\) estimates the variance | ||
| associated with update directions throughout the server rounds. Directions | ||
| with higher variance values are damped in favor of those with more consistency | ||
| round over round. | ||
|
|
||
| As with the usual forms of these algorithms, there are a number of | ||
| hyper-parameters that can be tuned, including \\(\tau, \beta_1,\\) and | ||
| \\(\beta_2\\). However, sensible defaults are suggested in the paper such that | ||
| \\(\beta_1=0.9\\) and \\(\beta_2=0.99\\). The authors also show that | ||
| performance is generally robust to \\(\tau\\). | ||
|
|
||
| A number of experiments show that the proposed FedOpt family of algorithms | ||
| can outperform FedAvg, especially in heterogeneous settings. Moreover, these | ||
| algorithms, in the experiments of the paper, outperform SCAFFOLD,[^6] a | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. should we have a pocket ref for SCAFFOLD in this section as well, eventually?
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, probably. It'll be a bit more involved, just because the method is more complicated, but I have all of the figures necessary to do one. If you want to throw it on the backlog, I can tackle it at the same time as I work on pFL methods. |
||
| variance reduction method aimed at improving convergence in the presence of | ||
| heterogeneity. A final advantage of the FedOpt family of algorithms is that | ||
| they are accompanied by several convergence results showing that, as long as | ||
| the variance of the local gradients is not too large, the algorithms converge | ||
| properly. | ||
|
|
||
| #### References & Useful Links <!-- markdownlint-disable-line MD001 --> | ||
|
|
||
| [^1]: | ||
| [I. Loshchilov and F. Hutter. Fixing weight decay regularization in ADAM, | ||
| 2018.386](https://arxiv.org/pdf/1711.05101) | ||
|
|
||
| [^2]: | ||
| [Duchi, J., Hazan, E., & Singer, Y. (2011). Adaptive subgradient methods | ||
| for online learning and stochastic optimization. Journal of machine learning research, 12(7).](https://www.jmlr.org/papers/volume12/duchi11a/duchi11a.pdf) | ||
|
|
||
| [^3]: | ||
| [S. J. Reddi, Z. Charles, M. Zaheer, Z. Garrett, K. Rush, J. Konêcný, S. Kumar, and H. B. | ||
| McMahan. Adaptive federated optimization. In ICLR 2021, 2021.](https://arxiv.org/abs/2003.002950) | ||
|
|
||
| [^4]: | ||
| [Kingma, D. P. & Ba, J. (2015). Adam: A Method for Stochastic Optimization. | ||
| In Y. Bengio & Y. LeCun (eds.), ICLR (Poster).](https://arxiv.org/pdf/1412.6980) | ||
|
|
||
| [^5]: | ||
| [Manzil Zaheer, Sashank J. Reddi, Devendra Sachan, Satyen Kale, and | ||
| Sanjiv Kumar. 2018. Adaptive methods for nonconvex optimization. In Proceedings of the 32nd International Conference on Neural Information Processing Systems (NIPS'18). Curran Associates Inc., Red Hook, NY, USA, 9815–9825.](https://proceedings.neurips.cc/paper_files/paper/2018/file/90365351ccc7437a1309dc64e4db32a3-Paper.pdf) | ||
|
|
||
| [^6]: | ||
| [S. P. Karimireddy, S. Kale, M. Mohri, S. Reddi, S. Stich, and A. T. Suresh. | ||
| SCAFFOLD: Stochastic controlled averaging for federated learning. In | ||
| Hal Daumé III and Aarti Singh, editors, Proceedings of the 37th | ||
| International Conference on Machine Learning, volume 119 of Proceedings of | ||
| Machine Learning Research, pages 5132–5143. PMLR, 13–18 Jul 2020.](https://www.microsoft.com/en-us/research/publication/frustratingly-easy-neural-domain-adaptation/) | ||
|
|
||
| {{#author emersodb}} | ||
Uh oh!
There was an error while loading. Please reload this page.