missing data: how do we know what we don’t know?

31
Missing data: How do we know what we don’t know? Carina Omoeva, PhD Education Policy and Data Center, FHI 360 [email protected]

Upload: hedya

Post on 23-Feb-2016

80 views

Category:

Documents


0 download

DESCRIPTION

Missing data: How do we know what we don’t know?. Carina Omoeva, PhD Education Policy and Data Center, FHI 360 [email protected]. What is missing data??. MISSING!!!!. It’s information that we cannot observe. When are missing data a problem?. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Missing  data:  How do we know what we don’t know?

Missing data: How do we know what we don’t know?

Carina Omoeva, PhDEducation Policy and Data Center, FHI [email protected]

Page 2: Missing  data:  How do we know what we don’t know?

What is missing data??

It’s information that we cannot observe

MISSING!!!!

Page 3: Missing  data:  How do we know what we don’t know?

When are missing data a problem?

“Focusing only on countries with publishable data is likely to be misleading for global policy debates, particularly as many of the countries without data are likely to be furthest from achieving UPE” (2014 GMR, p. 54)

BIAS in observed parameters

1

3

5

+ 7

57,000,000

Page 4: Missing  data:  How do we know what we don’t know?

Missing data rates may reach 40% on some indicators

• About 25 % countries are missing gross intake rate data for 2010-2011

• About 15% missing pupil-teacher ratioUIS

• Armenia, 8th grade student survey: 12% missing on question about parental education

TIMSS

• Up to 40% missing data on reading comprehension questions EGRA

• Aid effectiveness survey: 25-70% countries not reporting data on primary completion, intake

GPE

Page 5: Missing  data:  How do we know what we don’t know?

SOME THEORY

2. ARE ALL MISSING DATA THE SAME?

Page 6: Missing  data:  How do we know what we don’t know?

The missing data framework*

• Missing Completely at Random (MCAR)

E.g.:o A random subsample was drawn for an extended

survey

o Enumerator accidentally flipped pages of questionnaire, skipping a few questions

* Rubin (1976), Allison (2002)

Page 7: Missing  data:  How do we know what we don’t know?

• Missing at Random (MAR)

E.g.: o Students from low SES, rural areas do not report their

parents’ education

o Data on road quality missing from remote villages in poor districts

Missing data framework, cont’d

* Rubin (1976), Allison (2002)

Page 8: Missing  data:  How do we know what we don’t know?

• Not Missing at Random (NMAR):

E.g.: o People who committed crime less likely to report that

they had

• Very difficult mechanism to address and to model (but methods exist)

Missing data framework, cont’d

* Rubin (1976), Allison (2002)

Page 9: Missing  data:  How do we know what we don’t know?

COMMON METHODS

3. WHAT TO DO ABOUT MISSING DATA?

Page 10: Missing  data:  How do we know what we don’t know?

Common practices in addressing missing data

1. Complete cases only– Used in data sets with low

rates of missing data where data are not NMAR

– E.g. Household survey response rates generally are high (>95%) and not considered problematic

CountryHHS Resp.

RateBangladesh 99.3Burkina Faso 99.4Côte d'Ivoire 97.9Egypt 99.1Ghana 98.7Guinea 97.3Indonesia 98.9Kenya 96.3Mali 98Namibia 96.9Nigeria 98.6Rwanda 99.5Tanzania 98.3Viet Nam 99.9

Page 11: Missing  data:  How do we know what we don’t know?

2. Last observation carried forward (LOCF)3. Group mean or median imputation4. Correlation- or regression-based prediction5. “Expert opinion” or “best guess”

– All of these methods are frequently used– However, only a few agencies publish imputation

methodology documentation

Common practices: Single value imputation methods

Page 12: Missing  data:  How do we know what we don’t know?

Missing data methods: Multiple imputation

Several duplicates of same dataset go through iterative rounds of maximum likelihood estimation prediction until convergence is achieved on all variables with missing data

All variables with missing data are simultaneously imputed Parameters are computed in each of the multiply imputed

estimates, and averaged across– ILO ‘The Impact of Multiple Imputation of Coarsened Data on

Estimates on the Working Poor in South Africa’– CDC ‘Multiple Imputation of Family Income and Personal

Earnings in the National Health Interview Survey’

Page 13: Missing  data:  How do we know what we don’t know?

PLUGGING IT IN

4. HOW DOES THIS HELP ME?

Page 14: Missing  data:  How do we know what we don’t know?

Missing 2011 country values, rate of out of school children (primary age), UIS Data Centre

Country 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011Benin ... ... ... ... 17.12625 14.19443 17.72751 15.32814 ... 12.37597 ... ... ...

Botswana 20.44992 19.17772 17.47724 17.15707 17.06117 16.19356 16.90068 16.0317 16.78229 16.91703 15.97294 ... ...

Burundi ... 59.17458 50.43434 51.00511 47.50929 45.86108 42.69354 26.29329 19.37098 9.10837 10.91698 5.9276 ...

Cameroon ... ... ... ... ... ... ... ... ... 17.8611 15.05237 ... ...

Comoros 31.9001 23.61875 ... ... ... ... ... ... 16.66771 ... ... ... ...

Congo ... ... ... ... ... ... 47.34904 43.91574 ... ... ... 12.16786 ...

Côte d'Ivoire 40.31403 40.86011 36.05823 35.52237 38.83673 ... ... ... ... ... 38.14261 ... ...

DR Congo 63.78939 ... ... ... ... ... ... ... ... ... ... ... ...

Gabon ... ... ... ... ... ... ... ... ... ... ... ... ...

Guinea-Bissau ... 50.58379 ... ... ... ... ... ... ... ... ... 29.20563 ...

Kenya 37.13713 34.02705 32.57801 37.49941 25.14625 26.07675 24.38792 24.59853 13.67437 18.01437 17.02601 ... ...

Madagascar 36.46819 34.81941 33.25125 32.32049 22.56573 ... ... ... ... ... ... ... ...

Malawi 0.92699 ... ... 0.76915 ... 2.44002 2.37384 2.87726 6.63348 2.90175 2.52415 ... ...

Namibia 12.27122 10.76214 10.02947 8.73525 8.98333 11.53399 11.73285 13.14494 12.00636 12.86524 13.138 13.37428 ...

Nigeria 38.12975 35.13065 ... ... ... 33.87797 33.4364 32.94754 35.80272 42.19059 43.68921 43.77389 ...

Rwanda 17.86934 ... ... 14.64582 9.54785 18.98723 ... ... 25.4651 26.7457 17.55844 11.42626 ...

Sao Tome and Principe 19.84707 ... ... 3.781 2.65329 0.6945 - ... 2.8143 1.15072 2.11461 1.27772 ...

Sierra Leone ... ... ... ... ... ... ... ... ... ... ... ... ...

Somalia                          

South Africa 3.47284 5.5345 5.45364 5.31352 4.71496 4.5757 5.12409 4.70283 3.22638 4.05069 5.41988 ... ...

Swaziland 29.14217 27.65937 27.21756 27.33598 27.13233 24.53797 23.51192 17.77565 15.13016 ... ... ... ...

Togo 11.45093 15.04471 12.39699 9.04673 11.12428 8.27481 8.61346 6.69741 8.99883 7.18874 ... ... ...

UR Tanzania 50.70551 46.96316 42.17218 27.06975 18.45552 13.95434 9.38522 3.83551 ... 2.16739 ... ... ...

Zimbabwe 15.56385 14.48166 12.35066 15.56918 16.52517 ... ... ... ... ... ... ... ...

Page 15: Missing  data:  How do we know what we don’t know?

Creating a “plausible range”

• We will apply a combination of methods to create a range of plausible estimates of the number of out of school children in Sub-Saharan Africa, based on our dataset of rates

• The methods will include: – Last observation carried forward– Trend extrapolation– Group mean imputation– Regression imputation, with UIS and survey data– Multiple imputation using predictive mean matching,

with survey data

Page 16: Missing  data:  How do we know what we don’t know?

Methods – Last observation carried forward (LOCF)

19992000

20012002

20032004

20052006

20072008

20092010

20110

10203040506070

DR Congo

?

05

101520253035

Swaziland

?

19992000

20012002

20032004

20052006

20072008

20092010

20110

4

8

12

16

Namibia

Page 17: Missing  data:  How do we know what we don’t know?

RESULTS: HOW CLOSE ARE WE?

UIS LOCF05

10152025303540

29.834.9

UIS and imputed estimate by method

mill

ions

Page 18: Missing  data:  How do we know what we don’t know?

Methods – group mean imputation (GMI)

GDP per capita groupings

Poorest Lower mid Upper mid Wealthiest

Central East South West

Region groupings

Page 19: Missing  data:  How do we know what we don’t know?

Group mean imputation

Kenya Rwanda Burundi Tanzania Somalia 0

10203040

1711.4

5.9 2.2 0

East Africa most recent observed values (rates)

Group mean= 32.4

Botswana Swaziland Namibia Congo South Africa Gabon-5

5

15

25

35

16.0 15.1 13.4 12.25.4

Wealthiest GDP group most recent observed values (rates)

Group mean= 13.6

Page 20: Missing  data:  How do we know what we don’t know?

RESULTS

UIS LOCF GMI GDP GMI region

0

5

10

15

20

25

30

35

40

29.834.9

30.9 33.5

UIS and imputed estimates by method

Page 21: Missing  data:  How do we know what we don’t know?

Combining methods: Scenario A

• If a time series exists, missing value is an extrapolation of post-2006 time trend (only four countries!)

• If insufficient time series, but post-2006 data available, last observation carried forward

• Six countries with no data after 2006 • Replaced missing values with mean of subgroup

leveled by GDP per capita• Replaced missing values with mean of region

Page 22: Missing  data:  How do we know what we don’t know?

UIS LOCF GMI GDP

GMI region

Slope, LOCF, GMI GDP

Slope, LOCF, GMI

region

05

10152025303540

29.834.9

30.9 33.5 32.8 31.8

UIS and imputed estimates by method

RESULTS

Page 23: Missing  data:  How do we know what we don’t know?

Scenario B: combination using regression with GER as predictor

20 40 60 80 100 120 140 160 1800

10

20

30

40

50

60

70

80

Out of School Children Rate vs. Gross Enrollment Rate

GER

OOSCrate

Page 24: Missing  data:  How do we know what we don’t know?

RESULTS

UISLO

CF

GMI GDP

GMI region

Slope, LO

CF, GMI G

DP

Slope, L

OCF, GMI re

gion

Regressi

on and GMI region

Regressi

on, LOCF

0

5

10

15

20

25

30

35

40

29.834.9

30.933.5 32.8

31.8 29.329.1

UIS and imputed estimates by method

Page 25: Missing  data:  How do we know what we don’t know?

Scenario C: combination using regression, GMI, and population weight

• Regression w/ mean imputation with population weight adjustment– Replaced 9 missing GER countries with mean of

subgroup leveled by GDP per capita (26,000,000)– Replaced 9 countries with mean of region

(29,200,000)

Page 26: Missing  data:  How do we know what we don’t know?

UISLO

CF

GMI GDP

GMI region

Slope, LO

CF, GMI G

DP

Slope, LO

CF, GMI re

gion

Regressi

on and GMI region

Regressi

on, LOCF

Regressi

on and GMI GDP, p

op wt

Regressi

on and GMI region, p

op wt

0

5

10

15

20

25

30

35

40

29.834.9

30.933.5 32.8

32.8 29.329.1

26.429.2

UIS and imputed estimates by method

RESULTS

Page 27: Missing  data:  How do we know what we don’t know?

Constructing Confidence Intervals with Survey Data

• Method 1: Group mean imputation with survey data– Estimate = 36,410,000; CI (35700000 , 37120000)

• Method 2: Multiple imputation– First step: LOCF or trend-based imputation– Second step: multiple imputation across 80 MI

datasets, using predictive mean matching– Estimate: 30,887,000; CI: (30185000 , 31588000)

Page 28: Missing  data:  How do we know what we don’t know?

RESULTS

Method 1

Method 2

36.4

30.9

UISLO

CF

GMI GDP

GMI region

Slope, LO

CF, GMI G

DP

Slope, LO

CF, GMI re

gion

Regressi

on and GMI region

Regressi

on, LOCF

Regressi

on and GMI GDP, p

op wt

Regressi

on and GMI region, p

op wt

0

5

10

15

20

25

30

35

40

29.834.9

30.933.5 32.8

31.8 29.329.1

26.429.2

UIS and imputed estimates by method Survey based

Page 29: Missing  data:  How do we know what we don’t know?

The Takeaways?

• Estimates and aggregates based on datasets with missing values must be treated as ranges

• Multiple methods used produce different estimates• Use of a single estimate produces a sense of false

precision

• Most importantly, “The only really good solution to the missing data problem is not to have any.”

– Allison (2002)

Page 30: Missing  data:  How do we know what we don’t know?

Agenda

• Invest in data systems, data quality

• More transparency around data collection, imputation, dissemination.

• Focus on larger trends, ignore small variability

• Post-2015 “data revolution” should focus on issues of data quality