ifpri- impact surveys 1

Sampling and statistical power

Devesh Roy (on behalf of IFPRI-IFAD team)

Questionnaire of impact surveys

• Module A – Identification –cluster, household id, cluster

• Module B- Household demographics- sex, age, literacy

• Module C-Survey Questions- Dwellings, Drinking water supply, Sanitation, Food security, Asset related questions, farming and livestock questions, Anthropometry

Important details underlying this questionnaire• No tracking of households- straightforward identification module

• Limited set of household characteristics

• Sample size exogenously fixed and simple random sampling proposed but effects across strata envisioned

Some elements of robust IE–leading to RIMS+

•Choosing the Sample• Issue -what data needed and the sample required to precisely estimate differences in

outcomes between the treatment group and the comparison group.

• Determine both the sample size and how to draw the units in the sample from a population of interest.

• What Kinds of Data Do I Need?

• Need SMART data- specific, measurable, attributable, realistic, and targeted.

• Good quality data are required to assess the impact of the intervention on the outcomes of interest.

• The IE should not measure only outcomes for which the program is directly accountable. Outcome indicators indirectly affected or indicators capturing unintended program impact will maximize the value of the information that the IE generates.

Primer on sample and nature of data• some indicators may not be amenable to IE in small samples. Detecting

impacts for outcomes that are • extremely variable,

• rare events,

• or that are likely to be only marginally affected by an intervention may require prohibitively large samples.

• Example Identifying the impact of an intervention on maternal mortality rates will be feasible only in a sample that contains many pregnant women.

• Data on exogenous factors that may affect the outcome of interest. These make it possible to control for outside influences.

• Data on other characteristics. Including additional controls or analyzing the heterogeneity of the program’s effects along certain characteristics

Role of existing data

• To set benchmark value of indicators

• Power calculation for getting the minimum sample size

• Here in IFAD projects- One might not need to do power calculations but get some sense of when you would need a larger sample vis-à-vis smaller sample

How large the sample must be?

• Associated calculations are called power calculations.

• Avoid collecting too much data as well as too few data

• Remember too few data - If the sample is too small, you may not be able to detect positive impact and may thus conclude that the program had no effect

• Assume for simplicity that all who are intended to be beneficiary take part and those intended to be non-beneficiary remain so.

Power calculations

• Most impact evaluations test a simple hypothesis

• Does the program have an impact? In other words, Is the program impact different from zero? Answering this question requires two steps:

• 1. Estimate the average outcomes for the beneficiary and non-beneficiary groups.

• 2. Assess whether a difference exists between the average outcome for the treatment beneficiary group and the average outcome for the non-beneficiary group.

Large versus small sample (Large sample reduces the chance of being unlucky (Gertler et al 2010)

Consider some related example

• Take a nutrition program • Take anthropometric measure of the beneficiary and non-beneficiary

• Take a sample of 2 children (beneficiary and non- beneficiary) and do it many times

• The estimates from the different samples taken repeatedly will bounce a lot-implies the estimates are unreliable

• Take a children sample of 100 and repeat it many times –

• What do you see- which estimate bounces much more ?

• It is the smaller sample

Errors in Impact evaluation

• Type 1 error and type 2 error

• Type 1 error (conclude that average height in the beneficiary group is higher than in non-beneficiary group when in fact it is not)

• Type 2 error (conclude that average height in the beneficiary group is no different than in non-beneficiary group when in fact it is actually different)

• Likelihood of a type 1 error is called confidence level – usually set at 5 percent i.e. you would be 95 percent confident that program had an impact

• Many factors affect the likelihood of committing a type 2 error but sample size is crucial- When a sample is large it is less likely that average height or weight of children in the two groups is equal just by luck

Power calculation and errors in IE (Gertler et al 2010) • The statistical power of an impact evaluation is the probability that it will

detect a difference between the beneficiary and non-beneficiary groups when in fact one exists. An impact evaluation has a high power if there is a low risk of not detecting real program impacts, that is, of committing a type II error.

• Under high power unlikely to be disappointed by results showing that the program being evaluated has had no impact, when in reality it did have an impact.

• From a policy perspective, underpowered impact evaluations are costly• If you were to conclude that the program was not effective, even though it

was, policy makers would be likely to close down a good program.• Carrying out power calculations could be crucial and relevant.

Steps in power calculations (Gertler et al 2010)• Does the program create clusters?

• What is the outcome indicator?

• Do you aim to compare program impacts between subgroups?

• What is the minimum level of impact that would justify the investment that has been made in the intervention?

• What is a reasonable level of power for the evaluation being conducted?

• What are the baseline mean and variance of the outcome indicators?

Power calculation: Continued

• No clusters case- Intervention at the level at which impacts are observed (some treatments given at school level and outcomes observed at the student level would comprise clusters)

• No clusters- take a random sample out of the entire population

• Identify the most important indicators that you want to evaluate

• If there are sub-groups (like SC/ST caste groups) then the sample size would be larger

• Then the sample size of the effect to be determined would be lager

• For an evaluation to identify a small impact, estimates of any difference in mean outcomes between the treatment and comparison groups will need to be very precise, requiring a large sample.

• Population baseline mean and variance

• Usual power of 80 percent is the norm

• Many statistical software can do the power calculations once these parameters are known

• Programs that create clusters have a different issue with power calculations

Power calculations with clusters

• Some programs assign benefits at the cluster level

• Here the principle is for sample size- number of clusters matters more than number of households within clusters

• A sufficient number of clusters is required to test convincingly whether a program has had an impact by comparing outcomes

• Compared to the steps before just add how variable are the outcomes within clusters?

• Think within village incomes are same but different across villages-adding an individual from another village would add more statistical power

INTERNATIONAL FOOD POLICY RESEARCH INSTITUTE

Example: test of difference of means in two

populations

• The equation for sample size is derived from the equation for the statistical test

• In a t-test the equation for the test is

t = (x1 - x2) - (m1 - m2)

(s12 n+ s2

2 n)12

• The derived equation for sample size is

n = (z1-/2 + z1-b)2(s1

2 + s22)

(m1 - m2)2

What next after sample size?

• A Sampling Strategy

• Steps in sampling

• 1. Determine the population of interest (eg children under certain age).

• 2. Identify a sampling frame- should coincide with population of interest but sometimes it does not.

• 3. Draw as many units from the sampling frame as required by power calculations.

• Choose the sampling method

Sampling method

• Probability sampling (PS) methods-most rigorous- they assign a well-defined probability of each unit’s being drawn. The 3 main PS methods are:

• Random sampling. Every unit in the population has exactly the same probability of being drawn.

• Stratified random sampling. The population is divided into groups (for example, male and female) and random sampling is performed within each group (essential for comparing impacts in sub groups)

• Cluster sampling. Units are grouped in clusters, and a random sample of clusters is drawn, after which either all units in those clusters constitute the sample or a number of units within the cluster are randomly drawn. This means that each cluster has a well-defined probability of being selected, and units within a selected cluster also have a well-defined probability of being drawn.

Sampling method and impact evaluation

• Drawing a sample depends on the rules of eligibility in the program

• Usually interventions take place at cluster level• Implies in these cases should go for cluster sampling

• Under any cost avoid non-probabilistic like purposive or convenience sampling

ifpri- impact surveys 1

Data & Analytics

small sample large sample

children sample

sample issue

children beneficiary

program impact different

non beneficiary groups

nonbeneficiary groups

treatment beneficiary