sampling for impact evaluation - world bankpubdocs.worldbank.org/.../core-21-eng-sampling-and... ·...
TRANSCRIPT
Sampling for Impact Evaluation
Nazmul Chaudhury World Bank
Dakar, Senegal Wednesday, October 2, 2013
Key takeaways
Sampling describes the process to draw a sample of units from a population to estimate the characteristics of that population
Larger samples give more precise estimates of the population characteristics
Impact evaluation requires estimating the difference in outcomes between two groups (treatment and comparison)
Small samples create risks of drawing incorrect policy conclusions
Power calculations tell us how large samples need to be. Larger samples are needed to precisely estimate impacts if we expect the impacts to be small, or if the program creates clusters,… 2
Drawing a sample from a population
Population of interest
Sampling describes the process to draw a sample of units from a population to estimate its characteristics.
Sample
Infer characteristics of the population
based on the sample
(e.g. average height of children aged 2 in
Nigeria)
How do you draw a sample? In practice…
Define the Population of interest All children aged 0-24 months in Nigeria? All children aged 0-24 having visited a health center in the last month?
Define a sampling frame: Most comprehensive list than can be obtained of units in the
population of interest.
Define a sampling procedure (i.e. how to draw sample from population) Probability sampling procedures assigns a defined probability for each
unit to be drawn (ensures sample is representative): e.g. random sampling
Avoid convenience sampling
4
Random sampling is not sufficient for IE
Non Participants Participants to a
program
Drawing a random sample from two groups does not make them comparable. Random sampling is not sufficient for IE
What if we draw a random sample from two different groups?
Randomized assignment When does randomized assignment produce comparable groups?
Comparison
Comparison
To achieve balance, randomization needs to be performed on a sufficient number of eligible units.
How to build large enough samples for impact evaluation?
How many people/facilities/units should be in an evaluation sample?
Determining an adequate sample size is essential
Important trade-offs between cost and reliability of findings.
Power calculations help make decisions on sample size
7
Impact evaluation is about measuring differences in outcomes between groups
Does a randomized nutrition program improve nutrition outcomes among young children?
The impact of the program is the difference between nutrition status in the treatment and control group.
How do we estimate impact? Step 1: Estimation of outcomes in treatment group Step 2: Estimation of outcomes in comparison group Step 3: Estimation of difference in outcomes between the
two groups And test whether it is statistically different from 0.
How big a sample do I need for the comparison and treatment groups?
8
Larger samples are more accurate
Think of the sample size as the accuracy of our measuring device: The more observations you have The more precise is your “measuring device” The more confident you are about the conclusions of your evaluation
Example: guess the sentence below
9
Larger samples are more accurate
What if we increase the number of “observations”?
10
Small samples create risks for policy decisions
Assume that a program has a positive impact on beneficiaries: • If the evaluation sample is too small, you might not be able
to detect this positive impact. • “Type 2 error”: The risk of failing to conclude that your
program has an impact even when it does. • Could lead to policy decisions to eliminate the program,
which would be detrimental to beneficiaries and society An impact evaluation is powerful if there is a low risk of not
detecting real program impacts, that is, of committing a type 2 error.
“Type 1 error”: The risk of concluding that a purely coincidental impact is due to your program
11
How to determine the sample size?
The short response: an ugly formula
Let’s focus on intuition behind power calculations
12
[ ])1(1)(4
2
22/
2
−+
+= H
Dzz
N ρσ βα
Intuition behind power calculations
We do not know in advance the effect of our policy. How can we be sure we will be able to measure it? Precision is not cheap: larger samples cost more
Core Ingredients 1. What is the minimum impact that would justify the
investment in the intervention? 2. How variable is the outcome you are interested in? 3. Does you program create clusters?
13
1st ingredient: Smallest Size effect
1st ingredient: Smallest program effect size that you wish to detect
Fundamental policy question: what is the level of impact below
which a program should be considered unsuccessful?
What is the objective of your program?
Decrease stunting rates by 5%, 20%, 50%?
The smaller are the (EXPECTED) differences between treatment &
control … … the more precise the instrument has to be to detect them
The larger the sample needs to be
14
Who is taller? Detecting smaller differences is harder
15
The larger is the sample the more precise is the measuring device the easier it is to detect smaller effects Increasing sample size ≈ increasing precision (of our measuring tool)
2nd Ingredient: Variance of Outcomes (1)
How does the variance of the outcome affect our ability to detect an impact? Example: Of the two (circled) populations, which is bigger? How many
observations from each circle would you need to decide?
16
2nd Ingredient: Variance of Outcomes (2)
Example: on average which group has the larger animals? Comparison is more complicated, such that you need more
information (i.e. a larger sample) answer may depend on which members of the blue & red groups you
observes
17
2nd Ingredient: Variance of Outcomes (3)
In which of these two cases is the impact harder to detect?
18
3nd Ingredient: Clustering
Does your program create clusters? What is the unit for which at outcomes are measured? What is the unit at which the program is implemented?
• Example of nutrition program: Impact measured at the level of the individual/child But programs implemented at the level of the village
Challenges with clustering: Outcomes for individual within a cluster are likely to be
correlated (intra-cluster correlation) Need to adjust sample: It is more powerful to add 1
observation in a new cluster, rather than 1 observation in an existing cluster
It is the number of clusters that largely determine the ‘useful’ sample size (number of individuals within clusters matter less)
19
Intuition behind power calculation
Other factors 1. Multiple evaluation questions/ treatment groups 2. Comparison of impacts between sub-groups 3. Take-up 4. Data quality 5. Statistical parameters (lever of confidence,
power,…) 6. Choice of impact evaluation method
20
The more questions, the larger the sample…
What if you are interested in two impact evaluation questions: Does the nutrition program have an impact? Should the nutrition program be complemented by an information
campaign?
Impact evaluation will have 3 groups (multiple treatment arms): Control group (group C) Group receiving nutrition program only (group T1) Group receiving nutrition program + information campaign (group
T2)
Larger sample is needed to make precise comparison between each group.
21
Power Calculations Summary
24
Elements: Implication for Sample Size:
The smaller effects that we want to detect
The larger will have to be the sample size
The higher the underlying variance
The higher level of implementation (clustering), and correlation of outcomes within cluster
The more (statistical) confidence/precision
The more complicated design - Multiple treatment - Interest in comparison between sub-groups
The lower take up
The lower data quality
Non-Experimental Impact Evaluation Methods require larger samples!
Key takeaways
Sampling describes the process to draw a sample of units from a population to estimate the characteristics of that population
Larger samples give more precise estimates of the population characteristics
Impact evaluation requires estimating the difference in outcomes between two groups (treatment and comparison)
Small samples create risks of drawing incorrect policy conclusions
Power calculations tell us how large samples need to be. Larger samples are needed to precisely estimate impacts if we expect the impacts to be small, or if the program creates clusters,… 25
In Case you need to run power calculations
• Look for a sampling specialist… Calculations can be made in many statistical packages. • In STATA, key command is sampsi • OPTIMAL DESIGN software more user friendly. Displays trade-offs visually:
26 T t l b f l t
Power
43 82 121 160 199
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0 = 0.050 n = 5
= 0.20,= 0.00= 0.20,= 0.05= 0.40,= 0.00= 0.40,= 0.05