From e97e07a30fd822c6424364d05fc0e689a825a65a Mon Sep 17 00:00:00 2001 From: Luiza Andrade Date: Wed, 18 Oct 2023 14:23:38 -0500 Subject: [PATCH 01/11] Guides landing page --- guides/guides.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/guides/guides.md b/guides/guides.md index b95927f..aa86445 100644 --- a/guides/guides.md +++ b/guides/guides.md @@ -1,10 +1,10 @@ --- layout: default -title: Coding guides +title: Guides nav_order: 3 has_children: true --- -# Coding guides +# DIL guides -Collection of guides. \ No newline at end of file +These are step-by-step guides covering common coding tasks in DIL's projects. The guides are a work in progress. If you'd like to see guidance on topics that are not yet covered here, feel free to[create an issue](https://github.com/DevInnovationLab/guides/issues) requesting a new guide. \ No newline at end of file From 8bd995612b4fd58dbd3941fb4d95e3e6470812c0 Mon Sep 17 00:00:00 2001 From: Luiza Andrade Date: Wed, 18 Oct 2023 14:23:48 -0500 Subject: [PATCH 02/11] Power calcs landing page --- guides/power-calculations.md | 17 ----------------- guides/power.md | 20 ++++++++++++++++++++ 2 files changed, 20 insertions(+), 17 deletions(-) delete mode 100644 guides/power-calculations.md create mode 100644 guides/power.md diff --git a/guides/power-calculations.md b/guides/power-calculations.md deleted file mode 100644 index 027d543..0000000 --- a/guides/power-calculations.md +++ /dev/null @@ -1,17 +0,0 @@ ---- -layout: default -title: Power calculations -parent: Coding guides -has_children: true -has_toc: true ---- - - -# Power calculations - -We will first discuss a few general considerations regarding research design and statistical power and then, most importantly, illustrate what an ideal power calculation workflow in DIL projects looks like. -The most common research scenarios are being outlined and we'll see how power calculations change based on design choices. -This note is accompanied by an interactive `Shiny` document which can be [accessed here](https://lehner.shinyapps.io/dil_power/) and is best used side-by-side to this note. -More advanced topics are only treated at a high level by providing basic intuition. -These are covered in detail with example code in separate notebooks which are hosted [on the main branch](https://github.com/DevInnovationLab/guides/tree/main) of the `guides` repository. -If you have a good example code from your own project, please reach out so we can showcase it and help improve the quality of future DIL work. \ No newline at end of file diff --git a/guides/power.md b/guides/power.md new file mode 100644 index 0000000..27e87a0 --- /dev/null +++ b/guides/power.md @@ -0,0 +1,20 @@ +--- +layout: default +title: Power calculations +parent: Guides +has_children: true +has_toc: true +--- + + +# Power calculations + +We will first discuss a few general considerations regarding research design and statistical power and then, most importantly, illustrate what an ideal power calculation workflow in DIL projects looks like. +The most common research scenarios are being outlined and we'll see how power calculations change based on design choices. +**This note is accompanied by an interactive `Shiny` document which can be [accessed here](https://lehner.shinyapps.io/dil_power/) and is best used side-by-side to this note.** +**More advanced topics are only treated at a high level by providing basic intuition.** +**These are covered in detail with example code in separate notebooks which are hosted [on the main branch](https://github.com/DevInnovationLab/guides/tree/main) of the `guides` repository.** +If you have a good example code from your own project, please reach out so we can showcase it and help improve the quality of future DIL work. + +When computing the needed sample size for an RCT, one of the most important questions we have to answer is what is the smallest difference in the average of the outcome variable between treatment and control group that we would like to be able to detect. +As a primer, we will quickly revisit conventional two-sample hypothesis testing in order to see the role effect size, $$\delta$$, and standard deviation of the outcome, $$\sigma$$, play in the process. \ No newline at end of file From 6cf4eb6959b990c0c29ebf3bbc9a1cd28dff6f37 Mon Sep 17 00:00:00 2001 From: Luiza Andrade Date: Wed, 18 Oct 2023 14:24:00 -0500 Subject: [PATCH 03/11] Hypothesis testing --- guides/{ => power}/hyp-testing-recap.md | 26 ++++++++++++------------- 1 file changed, 13 insertions(+), 13 deletions(-) rename guides/{ => power}/hyp-testing-recap.md (65%) diff --git a/guides/hyp-testing-recap.md b/guides/power/hyp-testing-recap.md similarity index 65% rename from guides/hyp-testing-recap.md rename to guides/power/hyp-testing-recap.md index 1c3eea7..45452d2 100644 --- a/guides/hyp-testing-recap.md +++ b/guides/power/hyp-testing-recap.md @@ -1,24 +1,24 @@ --- layout: default -title: A recap of hypoyhesis testing -grand_parent: Coding guides +title: A recap of hypothesis testing +grand_parent: Guides parent: Power calculations -nav_order: 5 +nav_order: 3 --- -# Recapping Null Hypothesis Significance Testing (NHST) +## A recap of Null Hypothesis Significance Testing (NHST) -Estimating whether an intervention - e.g. the treatment of an RCT - had an effect on the outcome is challenging due to sampling variation: the sample average of the treatment group will most of the time look different than the one of the control group due to chance alone. +**Estimating whether an intervention -- e.g. the treatment of an RCT -- had an effect on the outcome is challenging due to sampling variation: the sample average of the treatment group will most of the time look different than the one of the control group due to chance alone.** -It is our task as researchers to assess whether this difference is large enough in order to conclude that it is due to actual differences at the population level - or whether it is too small and therefore most likely due to sampling noise alone. +It is our task as researchers to assess whether this difference is large enough in order to conclude that it is due to actual differences at the population level -- or whether it is too small and therefore most likely due to sampling noise alone. That's where (frequentist) null hypothesis significance testing (NHST) enters. We always start by stating the null hypothesis, $$H_0$$, which says that the effect was zero. The alternative hypothesis, $$H_1$$, states that the effect was $$\delta$$ - the difference of the outcome variable between treatment and control group. -When conducting such a test, we can make two possible errors. -Rejecting a true null hypothesis, that is, to conclude that the intervention was effective when it was not (Type I error or "false positive"). -We could also conclude that the intervention had no effect when in reality such an effect exists, i.e. fail to reject the null hypothesis when it is false (Type II error or "false negative"). +**When conducting such a test, we can make two possible errors.** +**Rejecting a true null hypothesis**, that is, to conclude that the intervention was effective when it was not (Type I error or "false positive"). +We could also conclude that the intervention had no effect when in reality such an effect exists, i.e. **fail to reject the null hypothesis when it is false** (Type II error or "false negative"). | | True Hypothesis (H0) | True Hypothesis (H1) | |------------------------|:---------------------------:|:---------------------------:| @@ -28,8 +28,8 @@ We could also conclude that the intervention had no effect when in reality such | Test Decision (H1) | **Type I Error** [$$\alpha$$] | Correct Decision | | *reject null* | (False Positive, FP) | (True Positive, TP) | -When carried out on real-world data, we will never be able to know whether a Type I or Type II error is being committed. -We can, however, design our study as to control the probability of committing each type of error. +**When carried out on real-world data, we will never be able to know whether a Type I or Type II error is being committed.** +**We can, however, design our study as to control the probability of committing each type of error.** First, we have to define a significance level, $$\alpha$$. This is conventionally set to 0.05 for a two-sided test and represents the probability of committing a Type I error ($$P[reject\ H_0\ |\ H_0\ true]$$). @@ -37,7 +37,7 @@ This means that when the null is true, we won't reject it in 95% of cases. Second, we have to make a choice on the probability of a Type II error, $$\beta$$ (not to be confused with the regression parameter below - notation was kept for consistency with most stats textbooks). It refers to the chance of finding no treatment effect when one exists in reality ($$P[fail\ reject\ H_0\ |\ H_1\ true]$$). The most common value for $$\beta$$ is 0.2. -Consequently, $$1-\beta$$ is the chance of finding an effect when it exists, i.e. the probability that a study has of uncovering a true effect. +**Consequently, $$1-\beta$$ is the chance of finding an effect when it exists, i.e. the probability that a study has of uncovering a true effect.** **This is what we also call the (statistical) power of a test**. Of course we would like power to be as high as possible, but usually a power of 80% (i.e. $$1-\beta = 0.8$$) is considered high enough. Defining the two hypotheses, the desired significance level, and the power of the test, allows us to simultaneously control the likelihood of committing either a Type I or a Type II error. @@ -45,6 +45,6 @@ By looking at the hypothetical distributions of the test statistic under the nul -The [companion Shiny app](https://lehner.shinyapps.io/dil_power/) illustrates the positive role a large effect size and a low deviation of the outcome have on power. +This guide's [companion Shiny app](https://lehner.shinyapps.io/dil_power/) illustrates the positive role a large effect size and a low deviation of the outcome have on power. From 47a2565a01f654c8903e2b2449c443af6b2cf9b7 Mon Sep 17 00:00:00 2001 From: Luiza Andrade Date: Wed, 18 Oct 2023 14:25:28 -0500 Subject: [PATCH 04/11] Intro to power --- guides/how-to-step-by-step.md | 34 ----------------------- guides/intro-to-power.md | 32 --------------------- guides/low-power-bad.md | 25 ----------------- guides/power/what-and-why.md | 52 +++++++++++++++++++++++++++++++++++ 4 files changed, 52 insertions(+), 91 deletions(-) delete mode 100644 guides/how-to-step-by-step.md delete mode 100644 guides/intro-to-power.md delete mode 100644 guides/low-power-bad.md create mode 100644 guides/power/what-and-why.md diff --git a/guides/how-to-step-by-step.md b/guides/how-to-step-by-step.md deleted file mode 100644 index f0e4a57..0000000 --- a/guides/how-to-step-by-step.md +++ /dev/null @@ -1,34 +0,0 @@ ---- -layout: default -title: How to perform power calculations for your project, step by step -grand_parent: Coding guides -parent: Power calculations -nav_order: 2 ---- - -## How to Perform Power Calculations for your Project, Step by Step - -First, we have to gather intelligence on the three most fundamental things: - -- What is the expected mean of our variable of interest? -- How much do values deviate from this mean in reality? -- Which effect size can we expect roughly from our intervention? - - - - - -In case we want (or have) to assign treatment at a cluster or group level, we also need to gather information on the degree of similarity within each cluster or group. -This descriptive statistic is often referred to as the Intra-cluster Correlation Coefficient (ICC). - -The relevant pieces of information can be gathered through a review of similar studies that were carried out in the past, the analysis of surveys and/or administrative data, or, ideally, a combination of the two. -It is not advised to rely on pilot studies for power calculations due to small sample size problems. - -As a first step we always use basic plug-in formulas to get a rough sense for the needed sample size. -This serves as a benchmark and can/will be refined later by controlling for important covariates/predictors which usually soak up some of the variation in the variable of interest, reducing variance in the residuals of the regression and thus lowering the required sample size for any specified level of precision or power. -Similar gains can be achieved by stratification/blocking. -For more complex settings we usually want to resort to simulation methods in a final step. -This is something where we would recommend using R. - -When computing the needed sample size for an RCT, one of the most important questions we have to answer is what is the smallest difference in the average of the outcome variable between treatment and control group that we would like to be able to detect. -As a primer, we will quickly revisit conventional two-sample hypothesis testing in order to see the role effect size, $\delta$, and standard deviation of the outcome, $\sigma$, play in the process. \ No newline at end of file diff --git a/guides/intro-to-power.md b/guides/intro-to-power.md deleted file mode 100644 index 424f517..0000000 --- a/guides/intro-to-power.md +++ /dev/null @@ -1,32 +0,0 @@ ---- -layout: default -title: Introduction to statistical power -grand_parent: Coding guides -parent: Power calculations -nav_order: 3 ---- - - -> Statistical power is defined as the probability, before a study is performed, that a particular comparison will achieve "statistical significance" at some predetermined level (typically a p-value below 0.05), given some assumed true effect size. -> A power analysis is performed by first hypothesizing an effect size, then making some assumptions about the variation in the data and the sample size of the study to be conducted, and finally using probability calculations to determine the chance of the p-value being below the threshold. - -[Gelman, Hill, Vehktari (2020) Regression and Other Stories, Chapter 16] - -First, it has to be clarified that null hypothesis significance testing (NHST) - and thus also statistical power - is a frequentist concept only[^1]. -The question we want to answer is: **how large a sample do we have to collect in order to be reasonably sure of meeting the study objectives**? -Power calculations are not just useful for randomized evaluations, however, but also worth doing before we carry out studies on *purely observational data*. -Think of a scenario where an intervention or event has already happened, and we need to figure out whether it is worth the time and expense of collecting or trying to get access to data. -Power calculations are sometimes even worth doing on published/finished research. - - - - -## A useful Heuristic: "Sixteen S-squared over D-squared" - -A rough approximation introduced by Lehr says that for a two sample t-test with power 80% and significance level 5%, the required sample size can be approximated by $$n \approx 16 \frac{\sigma^2}{\delta^2}$$, where $$\sigma$$ is the standard error of the variable of interest and $$\delta$$ the difference in means that we want to detect. - - -[^1]: A Bayesian would think about power differently. In essence, instead of focusing on frequencies of rejecting a null hypothesis when an alternative is true, a Bayesian would consider the posterior distribution of a parameter given the data. The strength of evidence for or against particular parameter values is expressed in terms of probability distributions. To analogously assess "power," a Bayesian might consider the probability that a future observation or set of observations will fall within a particular region, given the current posterior distribution. - - - diff --git a/guides/low-power-bad.md b/guides/low-power-bad.md deleted file mode 100644 index 20e5761..0000000 --- a/guides/low-power-bad.md +++ /dev/null @@ -1,25 +0,0 @@ ---- -layout: default -title: Why underpowered studies are bad -grand_parent: Coding guides -parent: Power calculations -nav_order: 4 ---- - -# Why Underpowered Studies are Bad - - -Inference from low-powered studies is problematic for two main reasons. -First, by definition, adequate power is required to ensure that studies have a high likelihood of detecting a genuine effect. Low power implies high rates of false negatives whereby the null hypothesis of "no effect" is not rejected, despite being false. -This can arguably have impacts on the development of scientific knowledge: false negative results are less likely to be followed up than false positives, self-correction is less likely to occur in these cases. - -Second, low power threatens the credibility of research findings through effect inflation because in settings where standard errors are large, only those findings that by chance overestimate the magnitude of the effect will appear statistically significant and thus pass the threshold for discovery. Effect inflation is therefore more severe in underpowered studies that are based on small samples in the presence of high measurement error. This problem is also known as the winner's curse or referred to as a Type M error (a term coined by Andrew Gelman). It is well-known that conditional on NHST, research findings (from non-experimental data) suffer from an upward bias (because of a "file drawer problem" of non-significant studies that are not being published) and there are ways to correct for this (e.g. a procedure by Andrews and Kasy). - - - - - - - - -Keep in mind the following: a corollary from everything we covered so far is that if a study has a small sample size, it necessarily has to find a large effect in order to obtain a significant result (according to our frequentist hypothesis testing standards). diff --git a/guides/power/what-and-why.md b/guides/power/what-and-why.md new file mode 100644 index 0000000..d1e2e83 --- /dev/null +++ b/guides/power/what-and-why.md @@ -0,0 +1,52 @@ +--- +layout: default +title: What is statistical power and why does it matter +grand_parent: Guides +parent: Power calculations +nav_order: 2 +--- + +## What is statistical power and why does it matter + +### Defining statistical power + +> Statistical power is defined as the probability, before a study is performed, that a particular comparison will achieve "statistical significance" at some predetermined level (typically a p-value below 0.05), given some assumed true effect size. +> A power analysis is performed by first hypothesizing an effect size, then making some assumptions about the variation in the data and the sample size of the study to be conducted, and finally using probability calculations to determine the chance of the p-value being below the threshold. + +[Gelman, Hill, Vehktari (2020) Regression and Other Stories, Chapter 16] + +First, it has to be clarified that null hypothesis significance testing (NHST) - and thus also statistical power - is a frequentist concept only[^1]. +The question we want to answer is: **how large a sample do we have to collect in order to be reasonably sure of meeting the study objectives**? +Power calculations are not just useful for randomized evaluations, however, but also worth doing before we carry out studies on *purely observational data*. +Think of a scenario where an intervention or event has already happened, and we need to figure out whether it is worth the time and expense of collecting or trying to get access to data. +Power calculations are sometimes even worth doing on published/finished research. + + + + +### A useful Heuristic: "Sixteen S-squared over D-squared" + +A rough approximation introduced by Lehr says that **for a two sample t-test with power 80% and significance level 5%, the required sample size can be approximated by $$n \approx 16 \frac{\sigma^2}{\delta^2}$$,** where $$\sigma$$ is the standard error of the variable of interest and $$\delta$$ the difference in means that we want to detect. + +### Why underpowered studies are bad + + +Inference from low-powered studies is problematic for two main reasons. +First, by definition, adequate power is required to ensure that studies have a high likelihood of detecting a genuine effect. **Low power implies high rates of false negatives whereby the null hypothesis of "no effect" is not rejected, despite being false.** +This can arguably have impacts on the development of scientific knowledge: false negative results are less likely to be followed up than false positives, self-correction is less likely to occur in these cases. + +Second, low power threatens the credibility of research findings through effect inflation because **in settings where standard errors are large, only those findings that by chance overestimate the magnitude of the effect will appear statistically significant and thus pass the threshold for discovery**. Effect inflation is therefore more severe in underpowered studies that are based on small samples in the presence of high measurement error. This problem is also known as the winner's curse or referred to as a Type M error (a term coined by Andrew Gelman). It is well-known that conditional on NHST, research findings from non-experimental data suffer from an upward bias because of a "file drawer problem" of non-significant studies that are not being published. + + + + + + + + +**Keep in mind the following: a corollary from everything we covered so far is that if a study has a small sample size, it necessarily has to find a large effect in order to obtain a significant result (according to our frequentist hypothesis testing standards).** + +[^1]: A Bayesian would think about power differently. In essence, instead of focusing on frequencies of rejecting a null hypothesis when an alternative is true, a Bayesian would consider the posterior distribution of a parameter given the data. The strength of evidence for or against particular parameter values is expressed in terms of probability distributions. To analogously assess "power," a Bayesian might consider the probability that a future observation or set of observations will fall within a particular region, given the current posterior distribution. + + + From 0d01440ea4cc1f06fad04b4968da2187b98368cc Mon Sep 17 00:00:00 2001 From: Luiza Andrade Date: Wed, 18 Oct 2023 14:25:42 -0500 Subject: [PATCH 05/11] DIL standards for power calcs --- guides/power-at-dil.md | 23 ------------------ guides/power/standards.md | 49 +++++++++++++++++++++++++++++++++++++++ 2 files changed, 49 insertions(+), 23 deletions(-) delete mode 100644 guides/power-at-dil.md create mode 100644 guides/power/standards.md diff --git a/guides/power-at-dil.md b/guides/power-at-dil.md deleted file mode 100644 index 37ede2b..0000000 --- a/guides/power-at-dil.md +++ /dev/null @@ -1,23 +0,0 @@ ---- -layout: default -title: Power calculations at DIL -grand_parent: Coding guides -parent: Power calculations -nav_order: 1 ---- - -## Guidelines for a DIL Powercalculation Summary Document - -It is crucial that every project has a standing file - ideally a notebook combining code and text - that showcases power calculations. -Over its lifecycle, different RAs, RPs, and PIs will contribute to it. -It is thus essential that [coding principles are strictly followed](https://devinnovationlab.github.io/guides/principles/writing-code.html) as in any other piece of work. -Such a document will also help the team think clearly about research design and how to best communicate it at a very early stage. - -- Always start with a quick abstract summary of descriptive statistics: mean and standard deviation of outcomes, cluster sizes, estimates for the intra-cluster correlation, and expected effect sizes (with references to the respective literature). -- Showcase simple power calculations from plug-in formulas first. -- Using these as benchmarks, incorporate more sophisticated designs only afterwards and see how and why power increases (often this has to be done through simulations). -- Always link to the folder/repository where the code is stored so anyone can always have a look at the latest, up-to-date version (ideally also push these reports to the Github repository of the respective project). -- Every number has to be easily reproducible. If there is no replication code, then every number that was plugged into a formula has to be stated clearly. - -You can find some example notebooks in the `power_calc/examples/` folder following [this link (main branch of guides repo)](https://github.com/DevInnovationLab/guides/tree/main). - diff --git a/guides/power/standards.md b/guides/power/standards.md new file mode 100644 index 0000000..b4a12fd --- /dev/null +++ b/guides/power/standards.md @@ -0,0 +1,49 @@ +--- +layout: default +title: DIL standards +grand_parent: Guides +parent: Power calculations +nav_order: 1 +--- + +## DIL standards + +**It is crucial that every project has a stand-alone file - ideally a notebook combining code and text - that showcases power calculations.** +Over its lifecycle, different RAs, RPs, and PIs will contribute to it. +It is thus essential that [coding principles are strictly followed](https://devinnovationlab.github.io/guides/principles/writing-code.html) as in any other piece of work. +Such a document will also help the team think clearly about research design and how to best communicate it at a very early stage. + +This document should have two key features + +- **The document must link to the folder or repository where the code is stored** so the reader can always access the latest version or the document. Ideally also push these reports to the GitHub repository of the respective project. +- **Every number mentioned in this report must be easily reproducible.** If there is no replication code, then the source of every number that was plugged into a formula must be stated clearly. + +You can find some example notebooks in the [`power_calc/examples/` of this repository's main branch](https://github.com/DevInnovationLab/guides/tree/main). + +### 1. Start with a quick abstract summary of descriptive statistics + +The first task when performing power calculations is to gather intelligence on the three most fundamental pieces of information: + +- **What is the expected mean of our variable of interest? How much do values deviate from this mean in reality?** Your document should start by indicating the mean and standard deviation of outcomes. +- **What are the cluster sizes and estimates for intra-cluster correlation?** In case we want (or have) to assign treatment at a cluster or group level, we also need to gather information on the degree of similarity within each cluster or group. +This descriptive statistic is often referred to as the Intra-cluster Correlation Coefficient (ICC). +- **Which effect size can we expect roughly from our intervention?** To answer this question, refer to the relevant literature. + +The relevant pieces of information can be gathered through a review of similar studies that were carried out in the past, the analysis of surveys and/or administrative data, or, ideally, a combination of the two. It is not advised to rely on pilot studies for power calculations due to small sample size problems. + +### 2. Showcase simple power calculations from plug-in formulas + + + + + +As a first step we always use basic plug-in formulas to get a rough sense for the needed sample size. This serves as a benchmark will be refined in the next step. + +### 3. Incorporate more sophisticated designs + +Using the plug-in formulas as a starting point, see how and why power increases when you refine the calculations by + +- **Controlling for important covariates/predictors**. They usually soak up some of the variation in the variable of interest, reducing variance in the residuals of the regression and thus lowering the required sample size for any specified level of precision or power. +- **Using stratification/blocking**. These usually achieve similar gains in reducing variation. + +For more complex settings, this last step is often done using simulation. From 4b06be11819a6d9b2c20c0ea093b90130aba984c Mon Sep 17 00:00:00 2001 From: Luiza Andrade Date: Wed, 18 Oct 2023 14:26:01 -0500 Subject: [PATCH 06/11] individual randomization --- .../{ => power}/individual-randomization.md | 42 ++++++++++--------- 1 file changed, 23 insertions(+), 19 deletions(-) rename guides/{ => power}/individual-randomization.md (61%) diff --git a/guides/individual-randomization.md b/guides/power/individual-randomization.md similarity index 61% rename from guides/individual-randomization.md rename to guides/power/individual-randomization.md index 9566eec..c39b0e1 100644 --- a/guides/individual-randomization.md +++ b/guides/power/individual-randomization.md @@ -1,47 +1,53 @@ --- layout: default title: Individual randomization -grand_parent: Coding guides +grand_parent: Guides parent: Power calculations -nav_order: 11 +nav_order: 4 --- ## Individual Level Randomization - -We will look at both continuous and binary outcomes when the treatment is assigned to individuals. +Individual-level randomization occurs when the treatment is assigned to individuals. +This section discusses individual-level randomization with continuous and binary outcomes In the next section we will then look at these two cases when the treatment is assigned at a cluster or group level. -Everything else equal, this will lead to a decrease in power due to the non-independence of units within each group. -We therefore generally would prefer to design a study that allows us to randomize at the individual level. +Everything else equal, clustered randomization lead to a decrease in power compared to individual randomization due to the non-independence of units within each group. +**We therefore generally would prefer to design a study that allows us to randomize at the individual level.** ### Continuous Outcomes In order to test whether an RCT with simple randomization had a significant effect, in theory one does not even need a regression. A simple difference in means by treatment status is unbiased for the average effect. -Since we want to introduce more sophisticated designs later in this short article, we introduce regression notation from the outset. In order to obtain an average treatment effect, we just run the following univariate regression: $$Y_i = \beta_0 + \beta_1 D_i + \varepsilon_i,$$ where $$D$$ is an indicator variable that indicates whether individual $$i$$ received the treatment or not. $$\varepsilon$$ is an iid error term. -We are then interested in the hypothesis test of whether $$\beta_1$$ is different from zero. + +We are interested in the hypothesis test of whether $$\beta_1$$ is different from zero, -Writing down this hypothesis test formally, it is straightforward to derive a closed-form solution for the minimum detectable effect (MDE), given a sample size, and the required sample size, $$N^*$$, to detect a given difference in means, $$\delta$$. -We are interested in rejecting the null hypothesis $$H_0: \beta_1 = 0$$ in order show a significant effect of a treatment. +where rejecting the null hypothesis $$H_0: \beta_1 = 0$$ would show a significant effect of a treatment. +That is, we would like to know the probability that our study has of uncovering a true effect when estimating $$\beta_1$$. +Writing down this hypothesis test formally, it is straightforward to derive a closed-form solution for the Minimum Detectable Effect (MDE), given a sample size, and the required sample size, $$N^*$$, to detect a difference in means $$\delta$$ given the standard deviation of the outcome. + +The standard deviation of the outcome is usually unknown and hence needs an estimation -- or a best guess. +Estimating the standard deviation of the outcome can be difficult, specially if there is little prior work on the study's subject and context. +However, it is necessary to conduct power calculations, and small-scale data collection is often the only good way to get a credible estimate. -The standard deviation of the outcome is usually unknown and hence needs an estimation - or a best guess. -Sometimes this can be difficult. We can either assume that the variance of the outcome is homogeneous across the treated and control group, and the standard deviation is $$\sigma$$, or we can assume that the variance is different for the two groups, and the standard deviations are $$\sigma_0, \sigma_1$$ respectively. We only focus on the first case here. -The MDE that can be detected with $$1 − \beta$$ power at significance level $$\alpha$$ is + +The Minimum Detectable Effect that can be detected with $$1 − \beta$$ power at significance level $$\alpha$$ is $$\delta = \left(t_\beta + t_{\frac{\alpha}{2}}\right)\sigma\sqrt{\frac{1}{n_0}+\frac{1}{n_1}},$$ where $$n_0, n_1$$ are the sample size in the treatment and control group respectively. Assuming that the sample size in the treatment and control groups are the same, $$n_0=n_1=n^*$$, the sample size needed for each group is $$n^* = 2\left(t_\beta + t_{\frac{\alpha}{2}}\right)^2\frac{\sigma^2}{\delta^2},$$ implying that the total sample size needed is $$N^* = 2n^*$$. - + + + @@ -49,7 +55,7 @@ Assuming that the sample size in the treatment and control groups are the same, -In practice we will estimate MDE or sample size with the function `power.t.test()` (or alternatively with the corresponding function from the package `pwr`). +**In practice we will estimate MDE or sample size with R's function `power.t.test()` (or alternatively with the corresponding function from the package `pwr`).** A good discussion of Stata equivalents can be found [in the DIME Wiki](https://dimewiki.worldbank.org/Power_Calculations_in_Stata). In all of the above we implicitly assumed that the treatment has only an effect on the mean - which is by far the most common assumption and usually what we are interested in. @@ -58,9 +64,7 @@ In all of the above we implicitly assumed that the treatment has only an effect ### Binary Outcomes -For continuous outcomes, we do have to come up with an estimate for the standard deviation of the outcome variable. -Sometimes this can be difficult if little prior work exists and often a pilot is the only good way to get a good sense for how much the variable of interest is going to deviate. -Binary data is a bit more user-friendly, as the standard deviation is implied by the Bernoulli distribution, $$p*(1-p)$$, with the highest value at a base rate of 0.5. +When the outcome variable is binary, power calculations are more straightforward, as the standard deviation is implied by the Bernoulli distribution, $$p*(1-p)$$, with its highest value at a base rate of 0.5. To obtain the required sample size, we can measure the effect size in terms of differences in the probability of success: @@ -69,6 +73,6 @@ $$ N^{*}=\left(\frac{p_{1}\left(1-p_{1}\right)}{\pi}+\frac{p_{0}\left(1-p_{0}\ri where $$p_{0}$$ is the probability of success or an event rate, and $$p_{1} = p_0 \times (1-MDE)$$. $$\pi$$ represents the fraction of the sample that is treated and $$z$$ refers to the respective values from the standard normal distribution. -In practice, like in the continuous case, we can also use the built-in function `power.prop.test()`. +**In practice, like in the continuous case, we can also use the built-in function `power.prop.test()`.** From 471cc2f6e17c8514063dad6f553547ac68f9d450 Mon Sep 17 00:00:00 2001 From: Luiza Andrade Date: Wed, 18 Oct 2023 15:19:21 -0500 Subject: [PATCH 07/11] simulations --- guides/advanced-simulations.md | 66 ---------------------- guides/power/advanced-simulations.md | 82 ++++++++++++++++++++++++++++ 2 files changed, 82 insertions(+), 66 deletions(-) delete mode 100644 guides/advanced-simulations.md create mode 100644 guides/power/advanced-simulations.md diff --git a/guides/advanced-simulations.md b/guides/advanced-simulations.md deleted file mode 100644 index 2ea1c1f..0000000 --- a/guides/advanced-simulations.md +++ /dev/null @@ -1,66 +0,0 @@ ---- -layout: default -title: Power through simulations -grand_grand_parent: Selected advanced topics -grand_parent: Coding guides -parent: Power calculations -nav_order: 66 ---- - -## Calculating Power with Simulations - -A first step in power calculations is always the usage of the plug-in formulas we saw above. - -These analytical power calculations are useful for simple comparisons. For nonparametric tests and more complex or more specific design choices, simulation-based power calculations provide more flexibility. They for complex clustering and blocking schemes (i.e. stratification), can incorporate covariates, multiple treatment arms, etc. They also have advantages when it comes to accounting for the multiplicity of testing if we want to test more than one null hypothesis. - - - - -These simulations require us to specify the following: -- an underlying model -- the full experimental design -- sample size -- the values of the covariates -- the parameter values expressing the distribution of the outcome variable under the alternative hypothesis -- the variances. - -Based on the pre-specified model, we generate our synthetic data and run the estimation on these data a large number of times. In each round of the simulations we obtain a p-value. Power is then calculated as the proportion of p-values that are lower than the desired cutoff value $\alpha$. - -Note that we have to pay attention to cluster the standard errors appropriately for designs with clustered treatment assignment or blocked designs, otherwise the t-statistics will be upward biased and we will obtain inflated power numbers. - -As a starter, the most simple setup for a simulation would be to just draw from two different distributions and compare their means - the same way we did initially with the two sample hypothesis testing. -The only thing we add is that we repeat this example n-times and store the t-statistic every time. -At the end we just count the number of significant comparisons. The share of significant results is then the power of our "design" (which is a clear overstatement for this simple example). - -``` -# very basic example simulation -# instead of looping, we use the apply() class of functions -# this is called vectorization in R and is strictly preferred due to its computational advantages - -# define a function -get_t_result <- function(sampleSize, mean_C, mean_T, sd){ - sampleSize_C <- sampleSize_T <- sampleSize # 50/50 split - # take sample for both groups, assuming a normal distribution - # we assume the same sd for T and C - group1 <- rnorm(sampleSize, mean_C, sd) - group2 <- rnorm(sampleSize, mean_T, sd) - # do a t.test (or a regression and extract the t-stat or p-value): - ttest.result <- t.test(group1, group2) - # return the value of the t-stat (or equivalently the p-value): - return(tibble(tstat = ttest.result$statistic)) -} - -sampleSize <- 30 -num_runs <- 1000 -mean_C <- 30 -mean_T <- 35 -sd <- 50 -# Use apply to run the function n-times and store the t-statistics in a vector -tstats <- lapply(1:num_runs, function(x) get_t_result(sampleSize, mean_C, mean_T, sd)) - -# Count the number of tstats above 1.96 -mean(abs(unlist(tstats)) > 1.96) - -``` - -If the number of draws is sufficiently large, the power number here will correspond to the number that we determined above with the plug-in formula. diff --git a/guides/power/advanced-simulations.md b/guides/power/advanced-simulations.md new file mode 100644 index 0000000..bca2e3a --- /dev/null +++ b/guides/power/advanced-simulations.md @@ -0,0 +1,82 @@ +--- +layout: default +title: Advanced topics - simulations +grand_parent: Guides +parent: Power calculations +nav_order: 6 +--- + +## Calculating Power with Simulations + +**As discussed before, the initial in power calculations should always be to use simple plug-in formulas.** + +These analytical power calculations are useful for simple comparisons. +**For nonparametric tests and more complex or more specific design choices, simulation-based power calculations provide more flexibility.** +They allow us to incorporate complex clustering and blocking schemes, as well as covariates, multiple treatment arms, and other refinements. +They also have advantages when it comes to accounting for the multiplicity of testing if we want to test more than one null hypothesis. + + + + +These simulations require us to specify the following: +- an underlying model +- the full experimental design +- the sample size +- the values of the covariates +- the parameter values expressing the distribution of the outcome variable under the alternative hypothesis +- the variances. + +**Based on the pre-specified model, we generate synthetic data and estimate the treatment effect on these data a large number of times. +In each round of the simulations we obtain a p-value. +Power is then calculated as the proportion of p-values that are lower than the desired cutoff value $\alpha$.** +For designs with clustered treatment assignment or blocked designs, it is important to cluster the standard errors appropriately, otherwise the t-statistics will be upward biased and we will obtain inflated power numbers. + +As a starter, the most simple setup for a simulation is illustrated in the code snippet below. +It draws from two different distributions n-times, compares their means, and stores the t-statistic. +The share of significant results is then the power of our "design" (which is a clear overstatement for this simple example). +If the number of draws is sufficiently large, the power number here will correspond to the number that we determined above with the plug-in formula. + + +``` +# Very basic example simulation + +library(tidyverse) + +## Define a function to test the difference in means and return the t-test value +get_t_result <- + function(sampleSize, mean_C, mean_T, sd) { + + # Simulate values for the outcome variable for both groups + # Assuming a normal distribution with the same size and standard deviation in both groups + group1 <- rnorm(sampleSize, mean_C, sd) + group2 <- rnorm(sampleSize, mean_T, sd) + + # Test for different in means using a t-test + # (could also run a regression and extract the t-stat or p-value) + ttest.result <- t.test(group1, group2) + + # Return the value of the t-stat + # (or equivalently the p-value) + return(tibble(tstat = ttest.result$statistic)) + } + +## Determine our input values +num_runs <- 1000 # Number of times the test will be run +sampleSize <- 30 # Sample size (this is the size of the control and the treatment groups separately, not the added total) +mean_C <- 30 # Control mean +mean_T <- 35 # Treatment mean +sd <- 50 # Standard deviation + +## Test the difference in means repeatedly +# (Instead of looping, we use the apply() class of functions +# this is called vectorization in R and is strictly preferred due to its computational advantages() +tstats <- + map( + 1:num_runs, + ~ get_t_result(sampleSize, mean_C, mean_T, sd) + ) + +# Count the number of t stats above 1.96 +mean(abs(unlist(tstats)) > 1.96) +``` + From 6ec2c67a52ec56e24c1c98bb8dc7ef5cb37ed095 Mon Sep 17 00:00:00 2001 From: Luiza Andrade Date: Wed, 18 Oct 2023 15:19:33 -0500 Subject: [PATCH 08/11] Cluster randomization --- guides/{ => power}/cluster-randomization.md | 34 +++++++++++---------- 1 file changed, 18 insertions(+), 16 deletions(-) rename guides/{ => power}/cluster-randomization.md (60%) diff --git a/guides/cluster-randomization.md b/guides/power/cluster-randomization.md similarity index 60% rename from guides/cluster-randomization.md rename to guides/power/cluster-randomization.md index f73eade..38cc05f 100644 --- a/guides/cluster-randomization.md +++ b/guides/power/cluster-randomization.md @@ -1,32 +1,33 @@ --- layout: default title: Cluster randomization -grand_parent: Coding guides +grand_parent: Guides parent: Power calculations -nav_order: 12 +nav_order: 5 --- ## Cluster/Group Level Randomization Individual randomization is typically preferred because of its statistical properties. Often it is not feasible, though, so we are forced to allocate treatment at a higher level of aggregation. -Cluster randomized experiments allocate treatments to groups, but measure outcomes at the level of the individuals that compose the groups. - +**Cluster-randomized experiments allocate treatments to groups, but measure outcomes at the level of the individuals that compose the groups.** The estimating equation therefore changes to: $$Y_{ij} = \beta_0 + \beta_1 D_j + v_j + \varepsilon_{ij},$$ where $$j$$ now denotes the cluster at which the treatment gets assigned (e.g. a village, health facility, classroom, ...). + Note that $$v_j$$ is an error term at the cluster level. The two variances are given by $$var(v_j) = \sigma^2_c$$ and $$var(\varepsilon_{ij}) = \sigma^2_p$$. These represent the variation of the outcome $$Y$$ at two different levels, group and individual (index $$p$$ stands for "personal" since $$i$$ is already in use). Combining them we get $$\sigma^2_c + \sigma^2_p = \sigma^2$$. -For cluster studies we need to come up with a measure that determines the proportion of the total variance accounted for by the between cluster variance component, i.e. how much of the variance is explained by the clusters alone. +For clustered studies, we need to come up with a measure that determines the proportion of the total variance accounted for by the between-cluster variance component, i.e., how much of the variance is explained by the clusters alone. This so-called Intra-cluster Correlation Coefficient (ICC) gives us a measure of how similar units within each clusters are. -It has to be taken either from a large pilot, prior studies, or, in fortunate situations from large scale administrative or survey data. -Quite often coming up with reliable estimates for the ICC can be very challenging because large sample sizes are needed. -Since it can have huge effects on required sample sizes - especially when the cluster size is large - it is often worth performing power calculations with a variety of ICC levels to get a range of required sample sizes. [See the DIL article on ICC for further details]. +It has to be taken either from freshly collected data, prior studies, or, in fortunate situations, from large scale administrative or survey data from other studies. +Coming up with reliable estimates for the ICC can be very challenging because large sample sizes are needed. +Since it can have large impacts on required sample sizes -- especially when the cluster size is large -- it is often worth performing power calculations with a variety of ICC levels to get a range of required sample sizes.[^1] + The ICC is a summary statistic that is defined as: $$\rho = \frac{\sigma^2_c}{\sigma^2_c + \sigma^2_p}.$$ @@ -36,24 +37,24 @@ A large ICC means there is a high degree of similarity of units within each clus As a consequence, less information is added by each individual. Adapting the formula involves some tedious algebra, but is generally straightforward. -### Continuous +### Continuous outcomes -The MDE changes to: +Under clustered randomization, the MDE formula for continuous outcomes changes to $$\delta = \left(t_\beta + t_{\frac{\alpha}{2}}\right) 2 \biggl( \frac{m \sigma^2_c + \sigma^2_p}{m k} \biggr),$$ where there are $$m$$ individuals in every cluster and the total number of clusters is $$k$$. -It is straightforward to adjust these calculations to allow for varying clustersizes. -Usually this is done by providing the coefficient of variation in clustersize across all clusters. -A large difference in clustersize generally decreases power, or, keeping power constant, increases the needed sample size to detect an effect. +It is straightforward to adjust these calculations to allow for varying cluster sizes. +This is usually done by providing the coefficient of variation in cluster size across all clusters. +A large difference in cluster sizes generally decreases power, or, keeping power constant, increases the necessary sample size to detect an effect. -In a similar fashion, the required sample size per treatment arm - assuming equal allocation - $$n^* = m^* \ k^*$$ becomes: +In a similar fashion, the required sample size per treatment arm -- assuming equal allocation -- $$n^* = m^* \ k^*$$ becomes $$n^* = 2\left(t_\beta + t_{\frac{\alpha}{2}}\right)^2\frac{\sigma^2}{\delta^2} \ (1 + (m-1) \rho).$$ -### Binary +### Binary outcomes -For simplicity we abstract from differences in cluster sizes here again, but accommodating this in the formulas is quite straightforward (a high variation would further reduce power). +For simplicity, we abstract from differences in cluster sizes here again, but accommodating this in the formulas is quite straightforward a high variation would further reduce power). $$ N^{*}=\left(\frac{p_{1}\left(1-p_{1}\right)}{\pi}+\frac{p_{0}\left(1-p_{0}\right)}{1-\pi}\right) \frac{\left(z_{\beta}+z_{\alpha / 2}\right)^{2}}{\left(p_{1}-p_{0}\right)^{2}} \ (1 + (m-1) \rho).$$ @@ -62,3 +63,4 @@ It is easy to see that the cluster randomization formulas nest the individual fo Experiment with the [companion Shiny app](https://lehner.shinyapps.io/dil_power/) to see how detrimental ICC is for power. +[^1]: See the [guide on ICC](https://github.com/DevInnovationLab/internal-resources/blob/main/power/topics/ICC.Rmd) for further details (internal access only). \ No newline at end of file From f0bb120a6e27eca2d2bccf334d1c4945386348ec Mon Sep 17 00:00:00 2001 From: Luiza Andrade Date: Wed, 18 Oct 2023 15:19:40 -0500 Subject: [PATCH 09/11] Delete advanced-topics.md --- guides/advanced-topics.md | 14 -------------- 1 file changed, 14 deletions(-) delete mode 100644 guides/advanced-topics.md diff --git a/guides/advanced-topics.md b/guides/advanced-topics.md deleted file mode 100644 index 28bdcaf..0000000 --- a/guides/advanced-topics.md +++ /dev/null @@ -1,14 +0,0 @@ ---- -layout: default -title: Selected advanced topics -grand_parent: Coding guides -parent: Power calculations -nav_order: 20 -has_children: true -has_toc: true ---- - -## Intro - -this is a test with a formula $$ Y_i = \beta D_i + \varepsilon_i $$ - From 1e81c419ea068325d764b7564d8559b496b196c2 Mon Sep 17 00:00:00 2001 From: Luiza Andrade Date: Wed, 18 Oct 2023 15:20:04 -0500 Subject: [PATCH 10/11] Fix power references --- guides/power/resources-references.md | 30 ++++++++++++++++++++++++ guides/resources-references.md | 34 ---------------------------- 2 files changed, 30 insertions(+), 34 deletions(-) create mode 100644 guides/power/resources-references.md delete mode 100644 guides/resources-references.md diff --git a/guides/power/resources-references.md b/guides/power/resources-references.md new file mode 100644 index 0000000..ccbd28b --- /dev/null +++ b/guides/power/resources-references.md @@ -0,0 +1,30 @@ +--- +layout: default +title: Resources and references +grand_parent: Guides +parent: Power calculations +nav_order: 99 +--- + +## Resources and references for power calculations + +### Useful Links + +- [EGAP's 10 Things to Know About Statistical Power](https://egap.org/resource/10-things-to-know-about-statistical-power/) +- [JPAL's research resources page on power calculations](https://www.povertyactionlab.org/resource/power-calculations) +- [DIME Wiki page on power calculations](https://dimewiki.worldbank.org/Power_Calculations) +- [Rachel Glennerster's recorded lecture "Sampling and Sample Size"](https://www.youtube.com/watch?v=aNbabnONlY4): a non-technical introduction + +### Further Reading + +- The classic reference for randomization studies in economics is [Duflo et al. (2007)](https://doi.org/10.1016/S1573-4471(07)04061-2). +- [Athey & Imbens (2017)](https://doi.org/10.1016/bs.hefe.2016.10.003) offer a slightly more technical and up-to-date treatment. +- A classic reference on statistical power is [Murpy et al. (2014)](https://www.routledge.com/Statistical-Power-Analysis-A-Simple-and-General-Model-for-Traditional-and/Myors-Murphy/p/book/9781032283005). +- Derivations for the formulas used can be found for instance in [McConnell & Vera-Hernandez (2015)](https://ifs.org.uk/publications/going-beyond-simple-sample-size-calculations-practitioners-guide). +- See [Czibor et al. (2019)](https://onlinelibrary.wiley.com/doi/10.1002/soej.12392) for a short summary paper with many useful suggestions for field experiments in general. + +### Other recommended material + +- [Schochet (2013)](https://doi.org/10.1080/19345747.2012.725803): power for binary outcomes +- [Gelman, Hill, Vehktari (2020)](https://doi.org/10.1017/9781139161879): Regression and Other Stories (esp. Chapter 16) +- [Gelman & Carlin (2014)](https://doi.org/10.1177/1745691614551642): for a good discussion on post-hoc power \ No newline at end of file diff --git a/guides/resources-references.md b/guides/resources-references.md deleted file mode 100644 index c9b2d8f..0000000 --- a/guides/resources-references.md +++ /dev/null @@ -1,34 +0,0 @@ ---- -layout: default -title: Resources and references -grand_parent: Coding guides -parent: Power calculations -nav_order: 99 ---- - -## Resources and references for power calculations - -### Useful Links - -- [EGAP's 10 Things to Know About Statistical Power](https://egap.org/resource/10-things-to-know-about-statistical-power/) -- [JPAL's research resources page on power calculations](https://www.povertyactionlab.org/resource/power-calculations) -- [DIME Wiki page on power calculations](https://dimewiki.worldbank.org/Power_Calculations) -- [Rachel Glennerster's recorded lecture "Sampling and Sample Size"](https://www.youtube.com/watch?v=aNbabnONlY4): a non-technical introduction - -### Further Reading - -- The classic reference for randomization studies in economics is [Duflo et al. (2007)](#DufloGlennersterKremer2007). -- [Athey & Imbens (2017)](#AtheyImbens2017) offer a slightly more technical and up-to-date treatment. -- A classic reference on statistical power is [Murpy et al. (2014)](#MurphyMyorsWolach2014). -- Derivations for the formulas used can be found for instance in [McConnell & Vera-Hernandez (2015)](#McConnellVera-Hernandez2015). -- See [Czibor et al. (2019)](#CziborJimenez-GomezList2019) for a short summary paper with many useful suggestions for field experiments in general. - -### Other recommended material - -- [Schochet (2013)](#Schochet2013): power for binary outcomes -- [Gelman, Hill, Vehktari (2020)](#GelmanHillVehtari2020): Regression and Other Stories (esp. Chapter 16) -- [Gelman & Carlin (2014)](#GelmanCarlin2014): for a good discussion on post-hoc power - -### References - -{% bibliography %} \ No newline at end of file From 6db14c977259b95f57058c17301530b606a41cee Mon Sep 17 00:00:00 2001 From: Luiza Andrade Date: Wed, 18 Oct 2023 15:20:14 -0500 Subject: [PATCH 11/11] Clarify mention of pilot --- guides/power/img/avatar.svg | 1 + guides/power/img/logo.svg | 1 + guides/power/individual-randomization.md | 3 ++- 3 files changed, 4 insertions(+), 1 deletion(-) create mode 100644 guides/power/img/avatar.svg create mode 100644 guides/power/img/logo.svg diff --git a/guides/power/img/avatar.svg b/guides/power/img/avatar.svg new file mode 100644 index 0000000..084c985 --- /dev/null +++ b/guides/power/img/avatar.svg @@ -0,0 +1 @@ + \ No newline at end of file diff --git a/guides/power/img/logo.svg b/guides/power/img/logo.svg new file mode 100644 index 0000000..9ce4df4 --- /dev/null +++ b/guides/power/img/logo.svg @@ -0,0 +1 @@ + \ No newline at end of file diff --git a/guides/power/individual-randomization.md b/guides/power/individual-randomization.md index c39b0e1..d9f5616 100644 --- a/guides/power/individual-randomization.md +++ b/guides/power/individual-randomization.md @@ -33,7 +33,7 @@ Writing down this hypothesis test formally, it is straightforward to derive a cl The standard deviation of the outcome is usually unknown and hence needs an estimation -- or a best guess. Estimating the standard deviation of the outcome can be difficult, specially if there is little prior work on the study's subject and context. -However, it is necessary to conduct power calculations, and small-scale data collection is often the only good way to get a credible estimate. +However, it is necessary to conduct power calculations, and data collection is often the only good way to get a credible estimate.[^1] We can either assume that the variance of the outcome is homogeneous across the treated and control group, and the standard deviation is $$\sigma$$, or we can assume that the variance is different for the two groups, and the standard deviations are $$\sigma_0, \sigma_1$$ respectively. We only focus on the first case here. @@ -76,3 +76,4 @@ $$\pi$$ represents the fraction of the sample that is treated and $$z$$ refers t **In practice, like in the continuous case, we can also use the built-in function `power.prop.test()`.** +[^1]: It is common for studies to rely on survey pilots for an estimate of the standard deviation of the outcome variable. When doing so, researchers should take into account that pilot samples are usually small, so estimates of data moments are noisy by construction. \ No newline at end of file