diff --git a/guides/advanced-simulations.md b/guides/advanced-simulations.md
deleted file mode 100644
index af04f60..0000000
--- a/guides/advanced-simulations.md
+++ /dev/null
@@ -1,66 +0,0 @@
----
-layout: default
-title: Power through simulations
-grand_grand_parent: Selected advanced topics
-grand_parent: Guides
-parent: Power calculations
-nav_order: 66
----
-
-## Calculating Power with Simulations
-
-A first step in power calculations is always the usage of the plug-in formulas we saw above.
-
-These analytical power calculations are useful for simple comparisons. For nonparametric tests and more complex or more specific design choices, simulation-based power calculations provide more flexibility. They for complex clustering and blocking schemes (i.e. stratification), can incorporate covariates, multiple treatment arms, etc. They also have advantages when it comes to accounting for the multiplicity of testing if we want to test more than one null hypothesis.
-
-
-
-
-These simulations require us to specify the following:
-- an underlying model
-- the full experimental design
-- sample size
-- the values of the covariates
-- the parameter values expressing the distribution of the outcome variable under the alternative hypothesis
-- the variances.
-
-Based on the pre-specified model, we generate our synthetic data and run the estimation on these data a large number of times. In each round of the simulations we obtain a p-value. Power is then calculated as the proportion of p-values that are lower than the desired cutoff value $\alpha$.
-
-Note that we have to pay attention to cluster the standard errors appropriately for designs with clustered treatment assignment or blocked designs, otherwise the t-statistics will be upward biased and we will obtain inflated power numbers.
-
-As a starter, the most simple setup for a simulation would be to just draw from two different distributions and compare their means - the same way we did initially with the two sample hypothesis testing.
-The only thing we add is that we repeat this example n-times and store the t-statistic every time.
-At the end we just count the number of significant comparisons. The share of significant results is then the power of our "design" (which is a clear overstatement for this simple example).
-
-```
-# very basic example simulation
-# instead of looping, we use the apply() class of functions
-# this is called vectorization in R and is strictly preferred due to its computational advantages
-
-# define a function
-get_t_result <- function(sampleSize, mean_C, mean_T, sd){
- sampleSize_C <- sampleSize_T <- sampleSize # 50/50 split
- # take sample for both groups, assuming a normal distribution
- # we assume the same sd for T and C
- group1 <- rnorm(sampleSize, mean_C, sd)
- group2 <- rnorm(sampleSize, mean_T, sd)
- # do a t.test (or a regression and extract the t-stat or p-value):
- ttest.result <- t.test(group1, group2)
- # return the value of the t-stat (or equivalently the p-value):
- return(tibble(tstat = ttest.result$statistic))
-}
-
-sampleSize <- 30
-num_runs <- 1000
-mean_C <- 30
-mean_T <- 35
-sd <- 50
-# Use apply to run the function n-times and store the t-statistics in a vector
-tstats <- lapply(1:num_runs, function(x) get_t_result(sampleSize, mean_C, mean_T, sd))
-
-# Count the number of tstats above 1.96
-mean(abs(unlist(tstats)) > 1.96)
-
-```
-
-If the number of draws is sufficiently large, the power number here will correspond to the number that we determined above with the plug-in formula.
diff --git a/guides/advanced-topics.md b/guides/advanced-topics.md
deleted file mode 100644
index ce51863..0000000
--- a/guides/advanced-topics.md
+++ /dev/null
@@ -1,103 +0,0 @@
----
-layout: default
-title: Selected advanced topics
-grand_parent: Guides
-parent: Power calculations
-nav_order: 20
-has_children: true
-has_toc: true
----
-
-## Some more advanced topics
-
-This section covers a few selected topics that are important for statistical power and experimental design: the inclusion of covariates, panel data (i.e. multiple observations per unit over time), stratification (sometimes called blocking), and multiple hypothesis testing (MHT) corrections.
-
-### Including covariates
-
-The obvious way to improve power is to just brute-forcely increase the sample size.
-A cheaper and more elegant way is to improve the regression model that is being estimated.
-By including important predictors for $$Y$$, we will usually reduce the variance in the residuals, $$\varepsilon$$, and therefore increase the precision of our coefficient of interest - which generally will be the $$\beta_1$$ corresponding to a dummy variable for whether a unit/individual was treated or not - by lowering its standard error.
-This means that we will require a lower sample size to achieve the same power as compared to an unconditional model.
-
-
-Power thus depends on the variability in the error term.
-More variable outcomes $$Y$$ make it harder to detect the changes resulting from our treatment.
-If we control for a set of variables that we think are likely to predict the outcome of interest, then we would typically analyze the data by running the regression:
-
-$$Y_i = \beta_0 + \beta_1 D_i + X_i\gamma + u_i$$
-
-Where $X$ is a set of controls that soak up some of the variation in $$Y$$.
-Power is then going to depend on the variability of the error term $$u$$, which is less than the variability in $$\varepsilon$$ because we have explained some of the variation in the outcome.
-Note, however, that controlling for variables that explain little or none of the variation in the outcome will increase standard errors by reducing degrees of freedom.
-
-In a simple randomized experiment, controlling for baseline values of covariates does not affect the expected value of the estimated $$\hat{\beta_1}$$, but, as we said, it can reduce its variance.
-Note that controlling for covariates affected by treatment $$D$$ would bias the estimate of $$\beta_1$$ by capturing part of its impact.
-Information on covariates should therefore be collected in baseline surveys or, if possible, from other observational data that has been collected before the experiment.
-
-
-### Power calculations for panel data
-
-When there is also a time dimension, David McKenzie (2012) illustrated how one can increase statistical power by taking multiple measurements of the relevant outcomes at relatively short intervals.
-He argues that stronger unit-specific shocks can erode the benefits of collecting additional waves of data.
-What essentially increases power in this setting is the fact that noise in the outcome variable can be averaged out over multiple collection periods.
-Burlig, Preonas, and Woerman (2020) extend his argument to within-unit serial correlation, demonstrating that higher autocorrelation in the idiosyncratic error term can similarly erode - and even reverse - the benefits of increased panel length.
-Their result reflects the analytical properties of estimators that leverage both pre- and post-treatment data and does not reflect the DD estimator over-controlling for pre-period data.
-The implementation of these panel formulas (Burlig et al. also have simulations) is only available in Stata as of September 2023.
-
-
-### Stratification (blocking)
-
-
-
-Stratification and pairing are intended to improve the efficiency of the design by disallowing assignments that are likely to be uninformative about the treatment effects of interest.
-In a stratified randomized experiment, the covariate space is partitioned into a finite set of subsets - the strata (or blocks).
-Within each of these subsets, a completely randomized experiment is carried out, after which the results are combined.
-Stratification carried out properly will ensure that we do not end up in a situation where we have imbalances in baseline covariates.
-This will also protect against type I errors.
-
-
-The blocking approach is preferred to model-based analyses applied after the randomization to adjust for differences in covariates.
-The stronger the covariates are correlated with the outcome variable of interest, the more beneficial stratification will be.
-The primary advantage is a precision gain by reducing the residual variance and therefore increasing power.
-The average treatment effect in such a design is equal to the weighted average of the difference between treated and untreated units in each group - the weight being the number of observations in each group.
-
-The regression specification when the randomization was carried out within stratum changes to:
-
-$$Y_{ij} = \beta_0 + \beta_1 D_i + M_{ij}\gamma + v_j + \varepsilon_{ij},$$
-
-where this is a fully saturated model with $$M$$ being a set of dummy variables indicating the respective stratum (or block) $$j$$ for every observation $$i$$.
-You can think of these as fixed effects and we do not care about the set of unit-specific coefficients $$\gamma$$.
-Importantly, we have to adjust the standard errors $$\varepsilon_{ij}$$ to allow for dependence within each stratum.
-This amounts to choosing **cluster robust standard errors at the stratum level**.
-If stratification was successful, the standard error of our coefficient of interest, $$\beta_1$$, will be smaller than if we had used the default robust standard errors.
-
-Apart from reducing variance, an important reason to adopt a stratified design is when the researchers are interested in the effect of the program on specific subgroups.
-If one is interested in the effect of the program on a subgroup, the experiment must have enough power for this subgroup (each subgroup constitutes in some sense a distinct experiment).
-Stratification according to those subgroups then ensures that the ratio between treatment and control units is determined by the experimenter in each subgroup, and can therefore be chosen optimally.
-
-In R we can e.g. use `blockTools` or `randomizr` to carry out stratification. In Stata this is usually done via `egen strata=group()`.
-
-
-
-### Multiple hypothesis testing (MHT)
-
-In practice, most RCTs test more than one null hypothesis.
-For example, we might have constructed multiple treatment arms, are interested in heterogeneous treatment effects, or want to test a treatment effect on multiple outcome variables.
-From the way we defined null hypothesis testing above, it is easy to see that the more hypotheses we test, the more false rejections - i.e. false positives - we are going to get.
-To account for the multiplicity of testing, we can aim to control either the family-wise error rate (FWER), or the false-discovery rate (FDR).
-These adjustments reduce the likelihood of spurious findings but also change the statistical power, which reduces the probability of detecting effects when they do exist.
-
-Controlling the FWER, which is the probability of making at least one false rejection among all tests, is the more straightforward, but usually also the more conservative approach.
-If we test $$m$$ times, the probability of making a type I error gets inflated to $$1-(1-\alpha)^m$$.
-To bring it back to our desired level, usually 0.05, we simply divide $$\alpha$$ by the number of hypothesis tests $$m$$ and reject only null hypotheses that are below this adjusted threshold. This is the well-known Bonferroni correction. The procedure by Holm also controls the FWER but is slightly more complex.
-
-The FDR instead controls the number of false rejections out of the total number of rejections.
-We define a q-value (the level at which to control the FDR), then order all the p-values ($$p_1 < ... < p_m$$), define $$L = max\{j: p_j < qj/m \}$$, and, finally, reject all null hypotheses for which $$p_j \leq p_L$$.
-This is the famous procedure by Benjamini and Hochberg.
-
-In practice, we can use the function `p.adjust()` in R to do all of these corrections. An overview of commands in Stata for MHT can be found [on the Worldbank blog](https://blogs.worldbank.org/impactevaluations/overview-multiple-hypothesis-testing-commands-stata).
-If feasible, a more accurate way to account for multiple testing in power calculations is through simulation. The main reason is that it allows us to address the extent to which the multiple comparisons are correlated with one another and thus delivers less conservative thresholds for rejection.
-
-
-
diff --git a/guides/guides.md b/guides/guides.md
index 48440e2..aa86445 100644
--- a/guides/guides.md
+++ b/guides/guides.md
@@ -5,6 +5,6 @@ nav_order: 3
has_children: true
---
-# DIL Guides
+# DIL guides
-Collection of guides.
\ No newline at end of file
+These are step-by-step guides covering common coding tasks in DIL's projects. The guides are a work in progress. If you'd like to see guidance on topics that are not yet covered here, feel free to[create an issue](https://github.com/DevInnovationLab/guides/issues) requesting a new guide.
\ No newline at end of file
diff --git a/guides/how-to-step-by-step.md b/guides/how-to-step-by-step.md
deleted file mode 100644
index 8fdbb82..0000000
--- a/guides/how-to-step-by-step.md
+++ /dev/null
@@ -1,34 +0,0 @@
----
-layout: default
-title: How to perform power calculations for your project, step by step
-grand_parent: Guides
-parent: Power calculations
-nav_order: 2
----
-
-## How to Perform Power Calculations for your Project, Step by Step
-
-First, we have to gather intelligence on the three most fundamental things:
-
-- What is the expected mean of our variable of interest?
-- How much do values deviate from this mean in reality?
-- Which effect size can we expect roughly from our intervention?
-
-
-
-
-
-In case we want (or have) to assign treatment at a cluster or group level, we also need to gather information on the degree of similarity within each cluster or group.
-This descriptive statistic is often referred to as the Intra-cluster Correlation Coefficient (ICC).
-
-The relevant pieces of information can be gathered through a review of similar studies that were carried out in the past, the analysis of surveys and/or administrative data, or, ideally, a combination of the two.
-It is not advised to rely on pilot studies for power calculations due to small sample size problems.
-
-As a first step we always use basic plug-in formulas to get a rough sense for the needed sample size.
-This serves as a benchmark and can/will be refined later by controlling for important covariates/predictors which usually soak up some of the variation in the variable of interest, reducing variance in the residuals of the regression and thus lowering the required sample size for any specified level of precision or power.
-Similar gains can be achieved by stratification/blocking.
-For more complex settings we usually want to resort to simulation methods in a final step.
-This is something where we would recommend using R.
-
-When computing the needed sample size for an RCT, one of the most important questions we have to answer is what is the smallest difference in the average of the outcome variable between treatment and control group that we would like to be able to detect.
-As a primer, we will quickly revisit conventional two-sample hypothesis testing in order to see the role effect size, $\delta$, and standard deviation of the outcome, $\sigma$, play in the process.
\ No newline at end of file
diff --git a/guides/intro-to-power.md b/guides/intro-to-power.md
deleted file mode 100644
index 0df2fe9..0000000
--- a/guides/intro-to-power.md
+++ /dev/null
@@ -1,32 +0,0 @@
----
-layout: default
-title: Introduction to statistical power
-grand_parent: Guides
-parent: Power calculations
-nav_order: 3
----
-
-
-> Statistical power is defined as the probability, before a study is performed, that a particular comparison will achieve "statistical significance" at some predetermined level (typically a p-value below 0.05), given some assumed true effect size.
-> A power analysis is performed by first hypothesizing an effect size, then making some assumptions about the variation in the data and the sample size of the study to be conducted, and finally using probability calculations to determine the chance of the p-value being below the threshold.
-
-[Gelman, Hill, Vehktari (2020) Regression and Other Stories, Chapter 16]
-
-First, it has to be clarified that null hypothesis significance testing (NHST) - and thus also statistical power - is a frequentist concept only[^1].
-The question we want to answer is: **how large a sample do we have to collect in order to be reasonably sure of meeting the study objectives**?
-Power calculations are not just useful for randomized evaluations, however, but also worth doing before we carry out studies on *purely observational data*.
-Think of a scenario where an intervention or event has already happened, and we need to figure out whether it is worth the time and expense of collecting or trying to get access to data.
-Power calculations are sometimes even worth doing on published/finished research.
-
-
-
-
-## A useful Heuristic: "Sixteen S-squared over D-squared"
-
-A rough approximation introduced by Lehr says that for a two sample t-test with power 80% and significance level 5%, the required sample size can be approximated by $$n \approx 16 \frac{\sigma^2}{\delta^2}$$, where $$\sigma$$ is the standard error of the variable of interest and $$\delta$$ the difference in means that we want to detect.
-
-
-[^1]: A Bayesian would think about power differently. In essence, instead of focusing on frequencies of rejecting a null hypothesis when an alternative is true, a Bayesian would consider the posterior distribution of a parameter given the data. The strength of evidence for or against particular parameter values is expressed in terms of probability distributions. To analogously assess "power," a Bayesian might consider the probability that a future observation or set of observations will fall within a particular region, given the current posterior distribution.
-
-
-
diff --git a/guides/low-power-bad.md b/guides/low-power-bad.md
deleted file mode 100644
index 76c98da..0000000
--- a/guides/low-power-bad.md
+++ /dev/null
@@ -1,25 +0,0 @@
----
-layout: default
-title: Why underpowered studies are bad
-grand_parent: Guides
-parent: Power calculations
-nav_order: 4
----
-
-# Why Underpowered Studies are Bad
-
-
-Inference from low-powered studies is problematic for two main reasons.
-First, by definition, adequate power is required to ensure that studies have a high likelihood of detecting a genuine effect. Low power implies high rates of false negatives whereby the null hypothesis of "no effect" is not rejected, despite being false.
-This can arguably have impacts on the development of scientific knowledge: false negative results are less likely to be followed up than false positives, self-correction is less likely to occur in these cases.
-
-Second, low power threatens the credibility of research findings through effect inflation because in settings where standard errors are large, only those findings that by chance overestimate the magnitude of the effect will appear statistically significant and thus pass the threshold for discovery. Effect inflation is therefore more severe in underpowered studies that are based on small samples in the presence of high measurement error. This problem is also known as the winner's curse or referred to as a Type M error (a term coined by Andrew Gelman). It is well-known that conditional on NHST, research findings (from non-experimental data) suffer from an upward bias (because of a "file drawer problem" of non-significant studies that are not being published) and there are ways to correct for this (e.g. a procedure by Andrews and Kasy).
-
-
-
-
-
-
-
-
-Keep in mind the following: a corollary from everything we covered so far is that if a study has a small sample size, it necessarily has to find a large effect in order to obtain a significant result (according to our frequentist hypothesis testing standards).
diff --git a/guides/power-at-dil.md b/guides/power-at-dil.md
deleted file mode 100644
index d5d61e5..0000000
--- a/guides/power-at-dil.md
+++ /dev/null
@@ -1,23 +0,0 @@
----
-layout: default
-title: Power calculations at DIL
-grand_parent: Guides
-parent: Power calculations
-nav_order: 1
----
-
-## Guidelines for a DIL Powercalculation Summary Document
-
-It is crucial that every project has a standing file - ideally a notebook combining code and text - that showcases power calculations.
-Over its lifecycle, different RAs, RPs, and PIs will contribute to it.
-It is thus essential that [coding principles are strictly followed](https://devinnovationlab.github.io/guides/principles/writing-code.html) as in any other piece of work.
-Such a document will also help the team think clearly about research design and how to best communicate it at a very early stage.
-
-- Always start with a quick abstract summary of descriptive statistics: mean and standard deviation of outcomes, cluster sizes, estimates for the intra-cluster correlation, and expected effect sizes (with references to the respective literature).
-- Showcase simple power calculations from plug-in formulas first.
-- Using these as benchmarks, incorporate more sophisticated designs only afterwards and see how and why power increases (often this has to be done through simulations).
-- Always link to the folder/repository where the code is stored so anyone can always have a look at the latest, up-to-date version (ideally also push these reports to the Github repository of the respective project).
-- Every number has to be easily reproducible. If there is no replication code, then every number that was plugged into a formula has to be stated clearly.
-
-You can find some example notebooks in the `power/examples/` folder following [this link (main branch of `internal-resources` repo, Github acces required)](https://github.com/DevInnovationLab/internal-resources/tree/main/power/examples).
-
diff --git a/guides/power-calculations.md b/guides/power-calculations.md
deleted file mode 100644
index b4c1d4f..0000000
--- a/guides/power-calculations.md
+++ /dev/null
@@ -1,17 +0,0 @@
----
-layout: default
-title: Power calculations
-parent: Guides
-has_children: true
-has_toc: true
----
-
-
-# Power calculations
-
-We will first discuss a few general considerations regarding research design and statistical power and then, most importantly, illustrate what an ideal power calculation workflow in DIL projects looks like.
-The most common research scenarios are being outlined and we'll see how power calculations change based on design choices.
-This note is accompanied by an interactive `Shiny` document which can be [accessed here](https://lehner.shinyapps.io/dil_power/) and is best used side-by-side to this note.
-More advanced topics are only treated at a high level by providing basic intuition.
-These are covered in detail with example code in separate notebooks which are hosted [on the main branch](https://github.com/DevInnovationLab/internal-resources/tree/main/power/topics) of the `internal-resources` repository (Github access needed).
-If you have a good example code from your own project, please reach out so we can showcase it and help improve the quality of future DIL work.
\ No newline at end of file
diff --git a/guides/power.md b/guides/power.md
new file mode 100644
index 0000000..27e87a0
--- /dev/null
+++ b/guides/power.md
@@ -0,0 +1,20 @@
+---
+layout: default
+title: Power calculations
+parent: Guides
+has_children: true
+has_toc: true
+---
+
+
+# Power calculations
+
+We will first discuss a few general considerations regarding research design and statistical power and then, most importantly, illustrate what an ideal power calculation workflow in DIL projects looks like.
+The most common research scenarios are being outlined and we'll see how power calculations change based on design choices.
+**This note is accompanied by an interactive `Shiny` document which can be [accessed here](https://lehner.shinyapps.io/dil_power/) and is best used side-by-side to this note.**
+**More advanced topics are only treated at a high level by providing basic intuition.**
+**These are covered in detail with example code in separate notebooks which are hosted [on the main branch](https://github.com/DevInnovationLab/guides/tree/main) of the `guides` repository.**
+If you have a good example code from your own project, please reach out so we can showcase it and help improve the quality of future DIL work.
+
+When computing the needed sample size for an RCT, one of the most important questions we have to answer is what is the smallest difference in the average of the outcome variable between treatment and control group that we would like to be able to detect.
+As a primer, we will quickly revisit conventional two-sample hypothesis testing in order to see the role effect size, $$\delta$$, and standard deviation of the outcome, $$\sigma$$, play in the process.
\ No newline at end of file
diff --git a/guides/power/advanced-simulations.md b/guides/power/advanced-simulations.md
new file mode 100644
index 0000000..bca2e3a
--- /dev/null
+++ b/guides/power/advanced-simulations.md
@@ -0,0 +1,82 @@
+---
+layout: default
+title: Advanced topics - simulations
+grand_parent: Guides
+parent: Power calculations
+nav_order: 6
+---
+
+## Calculating Power with Simulations
+
+**As discussed before, the initial in power calculations should always be to use simple plug-in formulas.**
+
+These analytical power calculations are useful for simple comparisons.
+**For nonparametric tests and more complex or more specific design choices, simulation-based power calculations provide more flexibility.**
+They allow us to incorporate complex clustering and blocking schemes, as well as covariates, multiple treatment arms, and other refinements.
+They also have advantages when it comes to accounting for the multiplicity of testing if we want to test more than one null hypothesis.
+
+
+
+
+These simulations require us to specify the following:
+- an underlying model
+- the full experimental design
+- the sample size
+- the values of the covariates
+- the parameter values expressing the distribution of the outcome variable under the alternative hypothesis
+- the variances.
+
+**Based on the pre-specified model, we generate synthetic data and estimate the treatment effect on these data a large number of times.
+In each round of the simulations we obtain a p-value.
+Power is then calculated as the proportion of p-values that are lower than the desired cutoff value $\alpha$.**
+For designs with clustered treatment assignment or blocked designs, it is important to cluster the standard errors appropriately, otherwise the t-statistics will be upward biased and we will obtain inflated power numbers.
+
+As a starter, the most simple setup for a simulation is illustrated in the code snippet below.
+It draws from two different distributions n-times, compares their means, and stores the t-statistic.
+The share of significant results is then the power of our "design" (which is a clear overstatement for this simple example).
+If the number of draws is sufficiently large, the power number here will correspond to the number that we determined above with the plug-in formula.
+
+
+```
+# Very basic example simulation
+
+library(tidyverse)
+
+## Define a function to test the difference in means and return the t-test value
+get_t_result <-
+ function(sampleSize, mean_C, mean_T, sd) {
+
+ # Simulate values for the outcome variable for both groups
+ # Assuming a normal distribution with the same size and standard deviation in both groups
+ group1 <- rnorm(sampleSize, mean_C, sd)
+ group2 <- rnorm(sampleSize, mean_T, sd)
+
+ # Test for different in means using a t-test
+ # (could also run a regression and extract the t-stat or p-value)
+ ttest.result <- t.test(group1, group2)
+
+ # Return the value of the t-stat
+ # (or equivalently the p-value)
+ return(tibble(tstat = ttest.result$statistic))
+ }
+
+## Determine our input values
+num_runs <- 1000 # Number of times the test will be run
+sampleSize <- 30 # Sample size (this is the size of the control and the treatment groups separately, not the added total)
+mean_C <- 30 # Control mean
+mean_T <- 35 # Treatment mean
+sd <- 50 # Standard deviation
+
+## Test the difference in means repeatedly
+# (Instead of looping, we use the apply() class of functions
+# this is called vectorization in R and is strictly preferred due to its computational advantages()
+tstats <-
+ map(
+ 1:num_runs,
+ ~ get_t_result(sampleSize, mean_C, mean_T, sd)
+ )
+
+# Count the number of t stats above 1.96
+mean(abs(unlist(tstats)) > 1.96)
+```
+
diff --git a/guides/cluster-randomization.md b/guides/power/cluster-randomization.md
similarity index 60%
rename from guides/cluster-randomization.md
rename to guides/power/cluster-randomization.md
index cd8e98c..38cc05f 100644
--- a/guides/cluster-randomization.md
+++ b/guides/power/cluster-randomization.md
@@ -3,30 +3,31 @@ layout: default
title: Cluster randomization
grand_parent: Guides
parent: Power calculations
-nav_order: 12
+nav_order: 5
---
## Cluster/Group Level Randomization
Individual randomization is typically preferred because of its statistical properties.
Often it is not feasible, though, so we are forced to allocate treatment at a higher level of aggregation.
-Cluster randomized experiments allocate treatments to groups, but measure outcomes at the level of the individuals that compose the groups.
-
+**Cluster-randomized experiments allocate treatments to groups, but measure outcomes at the level of the individuals that compose the groups.**
The estimating equation therefore changes to:
$$Y_{ij} = \beta_0 + \beta_1 D_j + v_j + \varepsilon_{ij},$$
where $$j$$ now denotes the cluster at which the treatment gets assigned (e.g. a village, health facility, classroom, ...).
+
Note that $$v_j$$ is an error term at the cluster level.
The two variances are given by $$var(v_j) = \sigma^2_c$$ and $$var(\varepsilon_{ij}) = \sigma^2_p$$.
These represent the variation of the outcome $$Y$$ at two different levels, group and individual (index $$p$$ stands for "personal" since $$i$$ is already in use).
Combining them we get $$\sigma^2_c + \sigma^2_p = \sigma^2$$.
-For cluster studies we need to come up with a measure that determines the proportion of the total variance accounted for by the between cluster variance component, i.e. how much of the variance is explained by the clusters alone.
+For clustered studies, we need to come up with a measure that determines the proportion of the total variance accounted for by the between-cluster variance component, i.e., how much of the variance is explained by the clusters alone.
This so-called Intra-cluster Correlation Coefficient (ICC) gives us a measure of how similar units within each clusters are.
-It has to be taken either from a large pilot, prior studies, or, in fortunate situations from large scale administrative or survey data.
-Quite often coming up with reliable estimates for the ICC can be very challenging because large sample sizes are needed.
-Since it can have huge effects on required sample sizes - especially when the cluster size is large - it is often worth performing power calculations with a variety of ICC levels to get a range of required sample sizes. [See the DIL article on ICC for further details].
+It has to be taken either from freshly collected data, prior studies, or, in fortunate situations, from large scale administrative or survey data from other studies.
+Coming up with reliable estimates for the ICC can be very challenging because large sample sizes are needed.
+Since it can have large impacts on required sample sizes -- especially when the cluster size is large -- it is often worth performing power calculations with a variety of ICC levels to get a range of required sample sizes.[^1]
+
The ICC is a summary statistic that is defined as:
$$\rho = \frac{\sigma^2_c}{\sigma^2_c + \sigma^2_p}.$$
@@ -36,24 +37,24 @@ A large ICC means there is a high degree of similarity of units within each clus
As a consequence, less information is added by each individual.
Adapting the formula involves some tedious algebra, but is generally straightforward.
-### Continuous
+### Continuous outcomes
-The MDE changes to:
+Under clustered randomization, the MDE formula for continuous outcomes changes to
$$\delta = \left(t_\beta + t_{\frac{\alpha}{2}}\right) 2 \biggl( \frac{m \sigma^2_c + \sigma^2_p}{m k} \biggr),$$
where there are $$m$$ individuals in every cluster and the total number of clusters is $$k$$.
-It is straightforward to adjust these calculations to allow for varying clustersizes.
-Usually this is done by providing the coefficient of variation in clustersize across all clusters.
-A large difference in clustersize generally decreases power, or, keeping power constant, increases the needed sample size to detect an effect.
+It is straightforward to adjust these calculations to allow for varying cluster sizes.
+This is usually done by providing the coefficient of variation in cluster size across all clusters.
+A large difference in cluster sizes generally decreases power, or, keeping power constant, increases the necessary sample size to detect an effect.
-In a similar fashion, the required sample size per treatment arm - assuming equal allocation - $$n^* = m^* \ k^*$$ becomes:
+In a similar fashion, the required sample size per treatment arm -- assuming equal allocation -- $$n^* = m^* \ k^*$$ becomes
$$n^* = 2\left(t_\beta + t_{\frac{\alpha}{2}}\right)^2\frac{\sigma^2}{\delta^2} \ (1 + (m-1) \rho).$$
-### Binary
+### Binary outcomes
-For simplicity we abstract from differences in cluster sizes here again, but accommodating this in the formulas is quite straightforward (a high variation would further reduce power).
+For simplicity, we abstract from differences in cluster sizes here again, but accommodating this in the formulas is quite straightforward a high variation would further reduce power).
$$ N^{*}=\left(\frac{p_{1}\left(1-p_{1}\right)}{\pi}+\frac{p_{0}\left(1-p_{0}\right)}{1-\pi}\right) \frac{\left(z_{\beta}+z_{\alpha / 2}\right)^{2}}{\left(p_{1}-p_{0}\right)^{2}} \ (1 + (m-1) \rho).$$
@@ -62,3 +63,4 @@ It is easy to see that the cluster randomization formulas nest the individual fo
Experiment with the [companion Shiny app](https://lehner.shinyapps.io/dil_power/) to see how detrimental ICC is for power.
+[^1]: See the [guide on ICC](https://github.com/DevInnovationLab/internal-resources/blob/main/power/topics/ICC.Rmd) for further details (internal access only).
\ No newline at end of file
diff --git a/guides/hyp-testing-recap.md b/guides/power/hyp-testing-recap.md
similarity index 66%
rename from guides/hyp-testing-recap.md
rename to guides/power/hyp-testing-recap.md
index 3cf9220..45452d2 100644
--- a/guides/hyp-testing-recap.md
+++ b/guides/power/hyp-testing-recap.md
@@ -1,24 +1,24 @@
---
layout: default
-title: A recap of hypoyhesis testing
+title: A recap of hypothesis testing
grand_parent: Guides
parent: Power calculations
-nav_order: 5
+nav_order: 3
---
-# Recapping Null Hypothesis Significance Testing (NHST)
+## A recap of Null Hypothesis Significance Testing (NHST)
-Estimating whether an intervention - e.g. the treatment of an RCT - had an effect on the outcome is challenging due to sampling variation: the sample average of the treatment group will most of the time look different than the one of the control group due to chance alone.
+**Estimating whether an intervention -- e.g. the treatment of an RCT -- had an effect on the outcome is challenging due to sampling variation: the sample average of the treatment group will most of the time look different than the one of the control group due to chance alone.**
-It is our task as researchers to assess whether this difference is large enough in order to conclude that it is due to actual differences at the population level - or whether it is too small and therefore most likely due to sampling noise alone.
+It is our task as researchers to assess whether this difference is large enough in order to conclude that it is due to actual differences at the population level -- or whether it is too small and therefore most likely due to sampling noise alone.
That's where (frequentist) null hypothesis significance testing (NHST) enters.
We always start by stating the null hypothesis, $$H_0$$, which says that the effect was zero.
The alternative hypothesis, $$H_1$$, states that the effect was $$\delta$$ - the difference of the outcome variable between treatment and control group.
-When conducting such a test, we can make two possible errors.
-Rejecting a true null hypothesis, that is, to conclude that the intervention was effective when it was not (Type I error or "false positive").
-We could also conclude that the intervention had no effect when in reality such an effect exists, i.e. fail to reject the null hypothesis when it is false (Type II error or "false negative").
+**When conducting such a test, we can make two possible errors.**
+**Rejecting a true null hypothesis**, that is, to conclude that the intervention was effective when it was not (Type I error or "false positive").
+We could also conclude that the intervention had no effect when in reality such an effect exists, i.e. **fail to reject the null hypothesis when it is false** (Type II error or "false negative").
| | True Hypothesis (H0) | True Hypothesis (H1) |
|------------------------|:---------------------------:|:---------------------------:|
@@ -28,8 +28,8 @@ We could also conclude that the intervention had no effect when in reality such
| Test Decision (H1) | **Type I Error** [$$\alpha$$] | Correct Decision |
| *reject null* | (False Positive, FP) | (True Positive, TP) |
-When carried out on real-world data, we will never be able to know whether a Type I or Type II error is being committed.
-We can, however, design our study as to control the probability of committing each type of error.
+**When carried out on real-world data, we will never be able to know whether a Type I or Type II error is being committed.**
+**We can, however, design our study as to control the probability of committing each type of error.**
First, we have to define a significance level, $$\alpha$$.
This is conventionally set to 0.05 for a two-sided test and represents the probability of committing a Type I error ($$P[reject\ H_0\ |\ H_0\ true]$$).
@@ -37,7 +37,7 @@ This means that when the null is true, we won't reject it in 95% of cases.
Second, we have to make a choice on the probability of a Type II error, $$\beta$$ (not to be confused with the regression parameter below - notation was kept for consistency with most stats textbooks).
It refers to the chance of finding no treatment effect when one exists in reality ($$P[fail\ reject\ H_0\ |\ H_1\ true]$$).
The most common value for $$\beta$$ is 0.2.
-Consequently, $$1-\beta$$ is the chance of finding an effect when it exists, i.e. the probability that a study has of uncovering a true effect.
+**Consequently, $$1-\beta$$ is the chance of finding an effect when it exists, i.e. the probability that a study has of uncovering a true effect.**
**This is what we also call the (statistical) power of a test**.
Of course we would like power to be as high as possible, but usually a power of 80% (i.e. $$1-\beta = 0.8$$) is considered high enough.
Defining the two hypotheses, the desired significance level, and the power of the test, allows us to simultaneously control the likelihood of committing either a Type I or a Type II error.
@@ -45,6 +45,6 @@ By looking at the hypothetical distributions of the test statistic under the nul
-The [companion Shiny app](https://lehner.shinyapps.io/dil_power/) illustrates the positive role a large effect size and a low deviation of the outcome have on power.
+This guide's [companion Shiny app](https://lehner.shinyapps.io/dil_power/) illustrates the positive role a large effect size and a low deviation of the outcome have on power.
diff --git a/guides/power/img/avatar.svg b/guides/power/img/avatar.svg
new file mode 100644
index 0000000..084c985
--- /dev/null
+++ b/guides/power/img/avatar.svg
@@ -0,0 +1 @@
+
\ No newline at end of file
diff --git a/guides/power/img/logo.svg b/guides/power/img/logo.svg
new file mode 100644
index 0000000..9ce4df4
--- /dev/null
+++ b/guides/power/img/logo.svg
@@ -0,0 +1 @@
+
\ No newline at end of file
diff --git a/guides/individual-randomization.md b/guides/power/individual-randomization.md
similarity index 59%
rename from guides/individual-randomization.md
rename to guides/power/individual-randomization.md
index abb7dd9..d9f5616 100644
--- a/guides/individual-randomization.md
+++ b/guides/power/individual-randomization.md
@@ -3,45 +3,51 @@ layout: default
title: Individual randomization
grand_parent: Guides
parent: Power calculations
-nav_order: 11
+nav_order: 4
---
## Individual Level Randomization
-
-We will look at both continuous and binary outcomes when the treatment is assigned to individuals.
+Individual-level randomization occurs when the treatment is assigned to individuals.
+This section discusses individual-level randomization with continuous and binary outcomes
In the next section we will then look at these two cases when the treatment is assigned at a cluster or group level.
-Everything else equal, this will lead to a decrease in power due to the non-independence of units within each group.
-We therefore generally would prefer to design a study that allows us to randomize at the individual level.
+Everything else equal, clustered randomization lead to a decrease in power compared to individual randomization due to the non-independence of units within each group.
+**We therefore generally would prefer to design a study that allows us to randomize at the individual level.**
### Continuous Outcomes
In order to test whether an RCT with simple randomization had a significant effect, in theory one does not even need a regression.
A simple difference in means by treatment status is unbiased for the average effect.
-Since we want to introduce more sophisticated designs later in this short article, we introduce regression notation from the outset.
In order to obtain an average treatment effect, we just run the following univariate regression:
$$Y_i = \beta_0 + \beta_1 D_i + \varepsilon_i,$$
where $$D$$ is an indicator variable that indicates whether individual $$i$$ received the treatment or not.
$$\varepsilon$$ is an iid error term.
-We are then interested in the hypothesis test of whether $$\beta_1$$ is different from zero.
+
+We are interested in the hypothesis test of whether $$\beta_1$$ is different from zero,
-Writing down this hypothesis test formally, it is straightforward to derive a closed-form solution for the minimum detectable effect (MDE), given a sample size, and the required sample size, $$N^*$$, to detect a given difference in means, $$\delta$$.
-We are interested in rejecting the null hypothesis $$H_0: \beta_1 = 0$$ in order show a significant effect of a treatment.
+where rejecting the null hypothesis $$H_0: \beta_1 = 0$$ would show a significant effect of a treatment.
+That is, we would like to know the probability that our study has of uncovering a true effect when estimating $$\beta_1$$.
+Writing down this hypothesis test formally, it is straightforward to derive a closed-form solution for the Minimum Detectable Effect (MDE), given a sample size, and the required sample size, $$N^*$$, to detect a difference in means $$\delta$$ given the standard deviation of the outcome.
+
+The standard deviation of the outcome is usually unknown and hence needs an estimation -- or a best guess.
+Estimating the standard deviation of the outcome can be difficult, specially if there is little prior work on the study's subject and context.
+However, it is necessary to conduct power calculations, and data collection is often the only good way to get a credible estimate.[^1]
-The standard deviation of the outcome is usually unknown and hence needs an estimation - or a best guess.
-Sometimes this can be difficult.
We can either assume that the variance of the outcome is homogeneous across the treated and control group, and the standard deviation is $$\sigma$$, or we can assume that the variance is different for the two groups, and the standard deviations are $$\sigma_0, \sigma_1$$ respectively.
We only focus on the first case here.
-The MDE that can be detected with $$1 − \beta$$ power at significance level $$\alpha$$ is
+
+The Minimum Detectable Effect that can be detected with $$1 − \beta$$ power at significance level $$\alpha$$ is
$$\delta = \left(t_\beta + t_{\frac{\alpha}{2}}\right)\sigma\sqrt{\frac{1}{n_0}+\frac{1}{n_1}},$$
where $$n_0, n_1$$ are the sample size in the treatment and control group respectively.
Assuming that the sample size in the treatment and control groups are the same, $$n_0=n_1=n^*$$, the sample size needed for each group is $$n^* = 2\left(t_\beta + t_{\frac{\alpha}{2}}\right)^2\frac{\sigma^2}{\delta^2},$$ implying that the total sample size needed is $$N^* = 2n^*$$.
-
+
+
+
@@ -49,7 +55,7 @@ Assuming that the sample size in the treatment and control groups are the same,
-In practice we will estimate MDE or sample size with the function `power.t.test()` (or alternatively with the corresponding function from the package `pwr`).
+**In practice we will estimate MDE or sample size with R's function `power.t.test()` (or alternatively with the corresponding function from the package `pwr`).**
A good discussion of Stata equivalents can be found [in the DIME Wiki](https://dimewiki.worldbank.org/Power_Calculations_in_Stata).
In all of the above we implicitly assumed that the treatment has only an effect on the mean - which is by far the most common assumption and usually what we are interested in.
@@ -58,9 +64,7 @@ In all of the above we implicitly assumed that the treatment has only an effect
### Binary Outcomes
-For continuous outcomes, we do have to come up with an estimate for the standard deviation of the outcome variable.
-Sometimes this can be difficult if little prior work exists and often a pilot is the only good way to get a good sense for how much the variable of interest is going to deviate.
-Binary data is a bit more user-friendly, as the standard deviation is implied by the Bernoulli distribution, $$p*(1-p)$$, with the highest value at a base rate of 0.5.
+When the outcome variable is binary, power calculations are more straightforward, as the standard deviation is implied by the Bernoulli distribution, $$p*(1-p)$$, with its highest value at a base rate of 0.5.
To obtain the required sample size, we can measure the effect size in terms of differences in the probability of success:
@@ -69,6 +73,7 @@ $$ N^{*}=\left(\frac{p_{1}\left(1-p_{1}\right)}{\pi}+\frac{p_{0}\left(1-p_{0}\ri
where $$p_{0}$$ is the probability of success or an event rate, and $$p_{1} = p_0 \times (1-MDE)$$.
$$\pi$$ represents the fraction of the sample that is treated and $$z$$ refers to the respective values from the standard normal distribution.
-In practice, like in the continuous case, we can also use the built-in function `power.prop.test()`.
+**In practice, like in the continuous case, we can also use the built-in function `power.prop.test()`.**
+[^1]: It is common for studies to rely on survey pilots for an estimate of the standard deviation of the outcome variable. When doing so, researchers should take into account that pilot samples are usually small, so estimates of data moments are noisy by construction.
\ No newline at end of file
diff --git a/guides/power/resources-references.md b/guides/power/resources-references.md
new file mode 100644
index 0000000..ccbd28b
--- /dev/null
+++ b/guides/power/resources-references.md
@@ -0,0 +1,30 @@
+---
+layout: default
+title: Resources and references
+grand_parent: Guides
+parent: Power calculations
+nav_order: 99
+---
+
+## Resources and references for power calculations
+
+### Useful Links
+
+- [EGAP's 10 Things to Know About Statistical Power](https://egap.org/resource/10-things-to-know-about-statistical-power/)
+- [JPAL's research resources page on power calculations](https://www.povertyactionlab.org/resource/power-calculations)
+- [DIME Wiki page on power calculations](https://dimewiki.worldbank.org/Power_Calculations)
+- [Rachel Glennerster's recorded lecture "Sampling and Sample Size"](https://www.youtube.com/watch?v=aNbabnONlY4): a non-technical introduction
+
+### Further Reading
+
+- The classic reference for randomization studies in economics is [Duflo et al. (2007)](https://doi.org/10.1016/S1573-4471(07)04061-2).
+- [Athey & Imbens (2017)](https://doi.org/10.1016/bs.hefe.2016.10.003) offer a slightly more technical and up-to-date treatment.
+- A classic reference on statistical power is [Murpy et al. (2014)](https://www.routledge.com/Statistical-Power-Analysis-A-Simple-and-General-Model-for-Traditional-and/Myors-Murphy/p/book/9781032283005).
+- Derivations for the formulas used can be found for instance in [McConnell & Vera-Hernandez (2015)](https://ifs.org.uk/publications/going-beyond-simple-sample-size-calculations-practitioners-guide).
+- See [Czibor et al. (2019)](https://onlinelibrary.wiley.com/doi/10.1002/soej.12392) for a short summary paper with many useful suggestions for field experiments in general.
+
+### Other recommended material
+
+- [Schochet (2013)](https://doi.org/10.1080/19345747.2012.725803): power for binary outcomes
+- [Gelman, Hill, Vehktari (2020)](https://doi.org/10.1017/9781139161879): Regression and Other Stories (esp. Chapter 16)
+- [Gelman & Carlin (2014)](https://doi.org/10.1177/1745691614551642): for a good discussion on post-hoc power
\ No newline at end of file
diff --git a/guides/power/standards.md b/guides/power/standards.md
new file mode 100644
index 0000000..b4a12fd
--- /dev/null
+++ b/guides/power/standards.md
@@ -0,0 +1,49 @@
+---
+layout: default
+title: DIL standards
+grand_parent: Guides
+parent: Power calculations
+nav_order: 1
+---
+
+## DIL standards
+
+**It is crucial that every project has a stand-alone file - ideally a notebook combining code and text - that showcases power calculations.**
+Over its lifecycle, different RAs, RPs, and PIs will contribute to it.
+It is thus essential that [coding principles are strictly followed](https://devinnovationlab.github.io/guides/principles/writing-code.html) as in any other piece of work.
+Such a document will also help the team think clearly about research design and how to best communicate it at a very early stage.
+
+This document should have two key features
+
+- **The document must link to the folder or repository where the code is stored** so the reader can always access the latest version or the document. Ideally also push these reports to the GitHub repository of the respective project.
+- **Every number mentioned in this report must be easily reproducible.** If there is no replication code, then the source of every number that was plugged into a formula must be stated clearly.
+
+You can find some example notebooks in the [`power_calc/examples/` of this repository's main branch](https://github.com/DevInnovationLab/guides/tree/main).
+
+### 1. Start with a quick abstract summary of descriptive statistics
+
+The first task when performing power calculations is to gather intelligence on the three most fundamental pieces of information:
+
+- **What is the expected mean of our variable of interest? How much do values deviate from this mean in reality?** Your document should start by indicating the mean and standard deviation of outcomes.
+- **What are the cluster sizes and estimates for intra-cluster correlation?** In case we want (or have) to assign treatment at a cluster or group level, we also need to gather information on the degree of similarity within each cluster or group.
+This descriptive statistic is often referred to as the Intra-cluster Correlation Coefficient (ICC).
+- **Which effect size can we expect roughly from our intervention?** To answer this question, refer to the relevant literature.
+
+The relevant pieces of information can be gathered through a review of similar studies that were carried out in the past, the analysis of surveys and/or administrative data, or, ideally, a combination of the two. It is not advised to rely on pilot studies for power calculations due to small sample size problems.
+
+### 2. Showcase simple power calculations from plug-in formulas
+
+
+
+
+
+As a first step we always use basic plug-in formulas to get a rough sense for the needed sample size. This serves as a benchmark will be refined in the next step.
+
+### 3. Incorporate more sophisticated designs
+
+Using the plug-in formulas as a starting point, see how and why power increases when you refine the calculations by
+
+- **Controlling for important covariates/predictors**. They usually soak up some of the variation in the variable of interest, reducing variance in the residuals of the regression and thus lowering the required sample size for any specified level of precision or power.
+- **Using stratification/blocking**. These usually achieve similar gains in reducing variation.
+
+For more complex settings, this last step is often done using simulation.
diff --git a/guides/power/what-and-why.md b/guides/power/what-and-why.md
new file mode 100644
index 0000000..d1e2e83
--- /dev/null
+++ b/guides/power/what-and-why.md
@@ -0,0 +1,52 @@
+---
+layout: default
+title: What is statistical power and why does it matter
+grand_parent: Guides
+parent: Power calculations
+nav_order: 2
+---
+
+## What is statistical power and why does it matter
+
+### Defining statistical power
+
+> Statistical power is defined as the probability, before a study is performed, that a particular comparison will achieve "statistical significance" at some predetermined level (typically a p-value below 0.05), given some assumed true effect size.
+> A power analysis is performed by first hypothesizing an effect size, then making some assumptions about the variation in the data and the sample size of the study to be conducted, and finally using probability calculations to determine the chance of the p-value being below the threshold.
+
+[Gelman, Hill, Vehktari (2020) Regression and Other Stories, Chapter 16]
+
+First, it has to be clarified that null hypothesis significance testing (NHST) - and thus also statistical power - is a frequentist concept only[^1].
+The question we want to answer is: **how large a sample do we have to collect in order to be reasonably sure of meeting the study objectives**?
+Power calculations are not just useful for randomized evaluations, however, but also worth doing before we carry out studies on *purely observational data*.
+Think of a scenario where an intervention or event has already happened, and we need to figure out whether it is worth the time and expense of collecting or trying to get access to data.
+Power calculations are sometimes even worth doing on published/finished research.
+
+
+
+
+### A useful Heuristic: "Sixteen S-squared over D-squared"
+
+A rough approximation introduced by Lehr says that **for a two sample t-test with power 80% and significance level 5%, the required sample size can be approximated by $$n \approx 16 \frac{\sigma^2}{\delta^2}$$,** where $$\sigma$$ is the standard error of the variable of interest and $$\delta$$ the difference in means that we want to detect.
+
+### Why underpowered studies are bad
+
+
+Inference from low-powered studies is problematic for two main reasons.
+First, by definition, adequate power is required to ensure that studies have a high likelihood of detecting a genuine effect. **Low power implies high rates of false negatives whereby the null hypothesis of "no effect" is not rejected, despite being false.**
+This can arguably have impacts on the development of scientific knowledge: false negative results are less likely to be followed up than false positives, self-correction is less likely to occur in these cases.
+
+Second, low power threatens the credibility of research findings through effect inflation because **in settings where standard errors are large, only those findings that by chance overestimate the magnitude of the effect will appear statistically significant and thus pass the threshold for discovery**. Effect inflation is therefore more severe in underpowered studies that are based on small samples in the presence of high measurement error. This problem is also known as the winner's curse or referred to as a Type M error (a term coined by Andrew Gelman). It is well-known that conditional on NHST, research findings from non-experimental data suffer from an upward bias because of a "file drawer problem" of non-significant studies that are not being published.
+
+
+
+
+
+
+
+
+**Keep in mind the following: a corollary from everything we covered so far is that if a study has a small sample size, it necessarily has to find a large effect in order to obtain a significant result (according to our frequentist hypothesis testing standards).**
+
+[^1]: A Bayesian would think about power differently. In essence, instead of focusing on frequencies of rejecting a null hypothesis when an alternative is true, a Bayesian would consider the posterior distribution of a parameter given the data. The strength of evidence for or against particular parameter values is expressed in terms of probability distributions. To analogously assess "power," a Bayesian might consider the probability that a future observation or set of observations will fall within a particular region, given the current posterior distribution.
+
+
+
diff --git a/guides/resources-references.md b/guides/resources-references.md
deleted file mode 100644
index 83ced7c..0000000
--- a/guides/resources-references.md
+++ /dev/null
@@ -1,30 +0,0 @@
----
-layout: default
-title: Resources and references
-grand_parent: Coding guides
-parent: Power calculations
-nav_order: 99
----
-
-## Resources and references for power calculations
-
-### Useful Links
-
-- [EGAP's 10 Things to Know About Statistical Power](https://egap.org/resource/10-things-to-know-about-statistical-power/)
-- [JPAL's research resources page on power calculations](https://www.povertyactionlab.org/resource/power-calculations)
-- [DIME Wiki page on power calculations](https://dimewiki.worldbank.org/Power_Calculations)
-- [Rachel Glennerster's recorded lecture "Sampling and Sample Size"](https://www.youtube.com/watch?v=aNbabnONlY4): a non-technical introduction
-
-### Further Reading
-
-- The classic reference for randomization studies in economics is [Duflo et al. (2007)](#DufloGlennersterKremer2007).
-- [Athey & Imbens (2017)](#AtheyImbens2017) offer a slightly more technical and up-to-date treatment.
-- A classic reference on statistical power is [Murpy et al. (2014)](#MurphyMyorsWolach2014).
-- Derivations for the formulas used can be found for instance in [McConnell & Vera-Hernandez (2015)](#McConnellVera-Hernandez2015).
-- See [Czibor et al. (2019)](#CziborJimenez-GomezList2019) for a short summary paper with many useful suggestions for field experiments in general.
-
-### Other recommended material
-
-- [Schochet (2013)](#Schochet2013): power for binary outcomes
-- [Gelman, Hill, Vehktari (2020)](#GelmanHillVehtari2020): Regression and Other Stories (esp. Chapter 16)
-- [Gelman & Carlin (2014)](#GelmanCarlin2014): for a good discussion on post-hoc power