sampling for impact evaluation - world bankpubdocs.worldbank.org/.../core-21-eng-sampling-and... ·...

Sampling for Impact Evaluation

Nazmul Chaudhury World Bank

Dakar, Senegal Wednesday, October 2, 2013

Key takeaways

Sampling describes the process to draw a sample of units from a population to estimate the characteristics of that population

Larger samples give more precise estimates of the population characteristics

Impact evaluation requires estimating the difference in outcomes between two groups (treatment and comparison)

Small samples create risks of drawing incorrect policy conclusions

Power calculations tell us how large samples need to be. Larger samples are needed to precisely estimate impacts if we expect the impacts to be small, or if the program creates clusters,… 2

Drawing a sample from a population

Population of interest

Sampling describes the process to draw a sample of units from a population to estimate its characteristics.

Sample

Infer characteristics of the population

based on the sample

(e.g. average height of children aged 2 in

Nigeria)

How do you draw a sample? In practice…

Define the Population of interest All children aged 0-24 months in Nigeria? All children aged 0-24 having visited a health center in the last month?

Define a sampling frame: Most comprehensive list than can be obtained of units in the

population of interest.

Define a sampling procedure (i.e. how to draw sample from population) Probability sampling procedures assigns a defined probability for each

unit to be drawn (ensures sample is representative): e.g. random sampling

Avoid convenience sampling

4

Random sampling is not sufficient for IE

Non Participants Participants to a

program

Drawing a random sample from two groups does not make them comparable. Random sampling is not sufficient for IE

What if we draw a random sample from two different groups?

Randomized assignment When does randomized assignment produce comparable groups?

Comparison

Comparison

To achieve balance, randomization needs to be performed on a sufficient number of eligible units.

How to build large enough samples for impact evaluation?

How many people/facilities/units should be in an evaluation sample?

Determining an adequate sample size is essential

Important trade-offs between cost and reliability of findings.

Power calculations help make decisions on sample size

7

Impact evaluation is about measuring differences in outcomes between groups

Does a randomized nutrition program improve nutrition outcomes among young children?

The impact of the program is the difference between nutrition status in the treatment and control group.

How do we estimate impact? Step 1: Estimation of outcomes in treatment group Step 2: Estimation of outcomes in comparison group Step 3: Estimation of difference in outcomes between the

two groups And test whether it is statistically different from 0.

How big a sample do I need for the comparison and treatment groups?

8

Larger samples are more accurate

Think of the sample size as the accuracy of our measuring device: The more observations you have The more precise is your “measuring device” The more confident you are about the conclusions of your evaluation

Example: guess the sentence below

9

Larger samples are more accurate

What if we increase the number of “observations”?

10

Small samples create risks for policy decisions

Assume that a program has a positive impact on beneficiaries: • If the evaluation sample is too small, you might not be able

to detect this positive impact. • “Type 2 error”: The risk of failing to conclude that your

program has an impact even when it does. • Could lead to policy decisions to eliminate the program,

which would be detrimental to beneficiaries and society An impact evaluation is powerful if there is a low risk of not

detecting real program impacts, that is, of committing a type 2 error.

“Type 1 error”: The risk of concluding that a purely coincidental impact is due to your program

11

How to determine the sample size?

The short response: an ugly formula

Let’s focus on intuition behind power calculations

12

[ ])1(1)(4

2

22/

2

−+

+= H

Dzz

N ρσ βα

Intuition behind power calculations

We do not know in advance the effect of our policy. How can we be sure we will be able to measure it? Precision is not cheap: larger samples cost more

Core Ingredients 1. What is the minimum impact that would justify the

investment in the intervention? 2. How variable is the outcome you are interested in? 3. Does you program create clusters?

13

1st ingredient: Smallest Size effect

1st ingredient: Smallest program effect size that you wish to detect

Fundamental policy question: what is the level of impact below

which a program should be considered unsuccessful?

What is the objective of your program?

Decrease stunting rates by 5%, 20%, 50%?

The smaller are the (EXPECTED) differences between treatment &

control … … the more precise the instrument has to be to detect them

The larger the sample needs to be

14

Who is taller? Detecting smaller differences is harder

15

The larger is the sample the more precise is the measuring device the easier it is to detect smaller effects Increasing sample size ≈ increasing precision (of our measuring tool)

2nd Ingredient: Variance of Outcomes (1)

How does the variance of the outcome affect our ability to detect an impact? Example: Of the two (circled) populations, which is bigger? How many

observations from each circle would you need to decide?

16


Example: on average which group has the larger animals? Comparison is more complicated, such that you need more

information (i.e. a larger sample) answer may depend on which members of the blue & red groups you

observes

17


In which of these two cases is the impact harder to detect?

18

3nd Ingredient: Clustering

Does your program create clusters? What is the unit for which at outcomes are measured? What is the unit at which the program is implemented?

• Example of nutrition program: Impact measured at the level of the individual/child But programs implemented at the level of the village

Challenges with clustering: Outcomes for individual within a cluster are likely to be

correlated (intra-cluster correlation) Need to adjust sample: It is more powerful to add 1

observation in a new cluster, rather than 1 observation in an existing cluster

It is the number of clusters that largely determine the ‘useful’ sample size (number of individuals within clusters matter less)

19

Intuition behind power calculation

Other factors 1. Multiple evaluation questions/ treatment groups 2. Comparison of impacts between sub-groups 3. Take-up 4. Data quality 5. Statistical parameters (lever of confidence,

power,…) 6. Choice of impact evaluation method

20

The more questions, the larger the sample…

What if you are interested in two impact evaluation questions: Does the nutrition program have an impact? Should the nutrition program be complemented by an information

campaign?

Impact evaluation will have 3 groups (multiple treatment arms): Control group (group C) Group receiving nutrition program only (group T1) Group receiving nutrition program + information campaign (group

T2)

Larger sample is needed to make precise comparison between each group.

21

Power Calculations Summary

24

Elements: Implication for Sample Size:

The smaller effects that we want to detect

The larger will have to be the sample size

The higher the underlying variance

The higher level of implementation (clustering), and correlation of outcomes within cluster

The more (statistical) confidence/precision

The more complicated design - Multiple treatment - Interest in comparison between sub-groups

The lower take up

The lower data quality

Non-Experimental Impact Evaluation Methods require larger samples!

Key takeaways

Sampling describes the process to draw a sample of units from a population to estimate the characteristics of that population

Larger samples give more precise estimates of the population characteristics

Impact evaluation requires estimating the difference in outcomes between two groups (treatment and comparison)

Small samples create risks of drawing incorrect policy conclusions

Power calculations tell us how large samples need to be. Larger samples are needed to precisely estimate impacts if we expect the impacts to be small, or if the program creates clusters,… 25

In Case you need to run power calculations

• Look for a sampling specialist… Calculations can be made in many statistical packages. • In STATA, key command is sampsi • OPTIMAL DESIGN software more user friendly. Displays trade-offs visually:

26 T t l b f l t

Power

43 82 121 160 199

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0 = 0.050 n = 5

= 0.20,= 0.00= 0.20,= 0.05= 0.40,= 0.00= 0.40,= 0.05

sampling for impact evaluation - world bankpubdocs.worldbank.org/.../core-21-eng-sampling-and... ·...

Documents