hdca summer school on capability and …...with censored headcounts: it is 17.58% total for 5 yrs of...

HDCA Summer School on Capability and Multidimensional Poverty

24 August –

3 September 2012

University of Indonesia, Indonesia

Analysing

associations

across deprivation

indicators

in MP

Paola Ballón

(OPHI)

“It is possible for a set of univariate

analyses done independently

for each dimension of well-being to conclude

that poverty in A is lower than poverty in B while a multivariate

analysis concludes the opposite, and vice-versa.

The key to these possibilities is the interaction

of the various dimensions of well-being in the poverty measure and their correlation in the sampled populations.”

Duclos, Sahn

and Younger, 2006, p.945

The importance of “associations”in Multidimensional Poverty

1997 2000 2004 2007

Panel A. Marginal distribution: % children under 5 deprived in...

Health (1) 43.5 39.8 26.7 19.9Nutrition (2) 74.3 62.2 61.4 58.3Improved Sanitation (3) 72.5 68.4 79.1 58.3

Panel B. Joint distribution: % children under 5...

Deprived in at least one dimension (union) 93.3 88.6 91 82.9Deprived in all three dimensions (intersection) 26.6 21.1 16.3 8.6

The joint distribution tells a different story

Source: Roche (2012) based on Bangladesh Demographic Health Survey data. The example follows a similar illustration from Atkinson and Lugo (2010) reproduced in Ferreira and Lugo (2012)Note: (1) No immunized or did not received medical treatment when sick; (2) Either underweight, stunting, or wasting; (3) MDGs

indicator standards.

The joint distribution provides an indication of the intensity of deprivations that batter the poor at the same

time.

The

joint

distribution

matters India NFHS data 2005-6

A comparison of raw headcounts of school attendance and 5yrs of

schooling shows*:

Percentage of people living in hh

where no member has 5 yrs schooling: 18.27%Percentage of people living in hh

where a child is not attending school: 21.17%

Are they the same people? Far less than half the time. 7.41% of people live in hh

with both

deprivations10.86% of people have no member with 5 years of schooling only13.76% of people have a child who is not attending school only. 67.97% of people do not experience either deprivation.

*With censored headcounts: it is 17.58% total for 5 yrs of schooling and 19.53% in children out of school; 7.41 both.

There is a value in looking across indicators

Focus of this classIntroduce the basic statistical tools to:

A) Display

the relationships

between categorical variablesCross-tabs (contingency tables)

B) Measure

their degree

of association: Strength and direction (sign)

•

Test for independence•

Choose a measure of association which has a “contextual”

meaning

Outline of this lecture 1.

Ways of classifying categorical variables

2.

Contingency tables (cross-tabs)

3.

Tests for independence: 2 statistic

4.

Measures of association:a)

Non model-based measures of association “Traditional”: Based on the 2 statistic

b) Measures for comparing proportions: The odds ratio

c) Measures of association assuming an underlying continuum: The tetrachoric

correlation coefficient

1. Ways of classifying categorical variables

A) Scale distinction: Nominal –

ordinal

a) Nominal

variables are variables having categories without

a natural

ordering. Examples: gender, region, religion.

The labels

attached to a category do not imply

a quantity.

The categories are mutually exclusive

-

each observation is placed in one

and only one category, and exhaustive

-

each case should be placed in a

category

Gender with categories male versus female, could be labeled: “0”

versus “1”

does not imply

that females are superior to males

“2”

versus “1”

does not imply that males are superior to females.


b)

Ordinal variables are variables that do

have ordered categories but

distances

between categories are

unknown.

Example: Level of education of the household’s head. Often this is represented by an ordinal variable with several categories comprising: no education, primary incomplete, primary complete, secondary incomplete, secondary complete, higher.

These categories could be labeled:“1, 2, 3, 4, 5, 6”; a difference of one unit between (1) and (2) does not mean that someone having achieved primary education is two times

more

educated than someone else with no education.

Or labeled: “10, 12, 15, 20, 23, 30”

.


The way a variable is measured determines its classification

For example, education

•Is nominal when measured as literate/illiterate

•Is ordinal when measured as highest degree attained: none –

PhD.

•It could also be considered as an interval/ratio variable when measured by the number of years of education.

Interval/ratio variables

belong to the highest level of measurement. An interval or ratio variable indicates the order

of categories, and the

exact distance

between them.

Sources of information

•Here we focus on dichotomised deprivation scores, 0 or 1. These are simple, and appropriate for all data.

•To study the association across deprivation indicators we have two different sources of information:

–

g0

matrix raw head counts–

g0

(k) matrix

censored head counts

•Our main tool for displaying the relationships across deprivation indicators is a cross-tab or contingency table.

2. The contingency table associated with the g0

matrix

•Example : India NFHS data 2005-6 (sub-sample)

Relative & absolute frequencies - g0

matrix

Safe water Non deprived Deprived Total

Non deprived 323885 82568 40645365% 17% 82%

Deprived 66885 21732 8861714% 4% 18%

Total 390770 104300 49507079% 21% 100%

Child mortality

Raw head counts

2. The contingency table associated with the g0

(k) matrix

Relative & absolute frequencies - g0

(k) matrix

Censored head counts

Safe water Non deprived Deprived Total

Non deprived 370441 65700 43614175% 13% 88%

Deprived 38407 20522 589298% 4% 12%

Total 408848 86222 49507083% 17% 100%

Child mortality

The joint distribution

between the two categorical indicators

determines

their relationship; it also determines the marginal

and conditional

distributions.

2. Describing a contingency table

•Let X

and Y

denote two deprivation indicators, X

with I

categories and Y

with J

categories. •Classifications of households on both indicators have IJ

possible combinations. [If the variables are 0-1, there are 2x2=4 combinations]

•The responses (X, Y)

of a household, chosen randomly, from some population have a probability distribution.

•A table having I

rows for categories of X

and J

columns for categories of Y

displays this distribution.

•The cells of this table represent the IJ

possible combinations.

•

When these cells contain frequency counts (absolute frequencies) the table is called a contingency table (Karl Pearson, 1904)

2. Describing a 2x2 contingency table: g0

matrix

•In our example: I=2, J=2, thus IJ=4, we have:

Child mortality

Safe water Non deprived

Deprived Total

Non deprived 323885 82568 406453

Deprived 66885 21732 88617

Total 390770 104300 495070

1 1

I J

iji j

n n

,i jn n

i jn are the cell count frequencies

are the row, and column marginal

totals

2. Describing a 2x2 contingency table: g0

matrix

•In our example: I=2, J=2, thus IJ=4, we have:

Child mortality

Safe water Non deprived

Deprived Total

Non deprived n11 n12 n1+

Deprived n21 n22 n2+Total n+1 n+2 n

1 1

I J

iji j

n n

,i jn n

i jn are the cell count frequencies

are the row, and column marginal

totals

2. Joint, Conditional and Marginal distributions

,i i jj

j i ji

denote the probability that (X,Y) occurs in the cell in row i

and column j

are the marginal

distributions of the row variable X

i j

•The probability distribution is the joint distribution

of X

and Yi j

are the marginal distributions of the column variable Y

•These result from summing the joint probabilities

•These satisfy: 1 .0i j i ji j i j

j i denotes the conditional probability

of category j

of Y

at category i

of X

•For a fixed category of X:

•The probabilities form the conditional distribution of Y

at category i

of X 1 , . . . ,i J i

2. Contingency tables and their distributions: g0

or g0

(k) matrices

Child mortality (Y)

Safe water (X) Non deprived

Deprived Total

Non deprived 11

(1|1

)12

(2|1

)1+

(1.0)

Deprived 21

(1|2

)22

(2|2

)2+

(1.0)

Total +1 +2 1.0

3. Independence of Deprivation Indicators

•Two deprivation indicators (categorical variables) are defined to be independent

if all

joint probabilities

equal the product of their marginal

probabilities.

•The hypothesis of independence is:

0 : ij i jH for 1,..., and 1,..., i I j J

•

The traditional approach to test this hypothesis is to:

1.

calculate the expected counts (ij

) under the null assumption of independence, and

2.

compare the observed (nij

) and expected counts (ij

) using the Pearson’s chi-square statistic (2).


•When H0

is true the expected values of nij

, called expected frequencies, are:

( ) i j i j i jE n n

•

Usually, and are unknown. Their ML (maximum likelihood) estimates are the sample marginal proportions:

i j

ˆ = ii

nn

ˆ = jj

nn

and

•So estimated expected frequencies are:

ˆ i ji j

n nn


22 ˆ( )

ˆij ij

i j ij

n

•For large samples 2

has approximately a chi-squared distribution with (I-1) x (J-1) degrees of freedom (Fisher, 1922).

•The P-value is approximated by:

2 2( 1)( 1)I J oP

where: 20

is the observed chi-square value with probability

(Type I error). If P-value < , we Reject H0

, and conclude that X

and Y

are not independent.

•Then the 2

test statistic is:

Example: See xls. sheet

A brief detour…

•

Before studying the association across categorical (deprivation)

indicators, we shall recall that for interval-ratio variables, the association

is measured by the Pearson product-moment correlation coefficient r.

•

This coefficient is a measure of the

correlation

(linear

dependence) between two variables giving a value between +1 and −1 inclusive (perfect correlation). A value of 0, indicates absence of linear correlation.

•

It is defined as the

covariance

of the two variables divided by the product of their standard deviations The population, correlation coefficient is:

•

and its sample counterpart is:

, ( )( )X YXY

X Y X Y

Cov X Y E X Yr

( )( )ˆ

ˆ ˆ

i ii

X YX Y

X X Y Yr

4.a Traditional measures of association

•

The mean square contingency:

•

Coefficient of contingency

(Pearson):

22 0

n

2

2 , ( 0 , 1 )1

nC

n

•

The 2

statistic is a useful tool for testing independence across deprivation indicators, but does not provide any information about their degree of association (strength or sign).

•

Traditional measures of association relying on the 2 comprise:

4.a Traditional measures of association

•

Cramer variant:

•

which in the 2x2 case:

•

C

and V

lie between 0 and 1, and take the extreme values under independence and “complete association”.

•

Goodman and Kruskal

(1979) point out that these “traditional”

measures (function of 2) do not have an operational meaning. Hence the difficulty in comparing meaningfully the

values of two different cross-tabs.

2

, (0,1)min ( 1), ( 1)

nVI J

11 22 12 211/ 2

1 2 1 2

, ( 1,1)( )n n n nVn n n n

4.b Comparing proportions: The odds

•

The odds ratio

is a measure of association that compares the relative chances of a “success" vis-à-vis a “failure”

in a cross-tab.

•

In the context of two deprivation indicators X, and Y , corresponding to the uncensored deprivation matrix (g0

), we denote the probability of not being deprived

as , and the probability of being deprived as (1-).

•

The odds are defined to be:

•

The odds are nonnegative, with >1.0 when being non-deprived (success) is more likely than being deprived (failure)

1

non-deprived

deprived

4.b Comparing proportions: The odds

•

For instance: if =0.75, then

= 0.75/0.25 = 3.0

.

•

This means that being not-deprived

is three times as likely

as being deprived.

•

This is, we expect about 3 non-deprived households for each deprived one.

•

Conversely if = 1/3, being deprived is three times as likely as being non-

deprived, and we expect about 1 non-deprived household for each 3 deprived ones.

•

In the case of a 2x2 table, a measure of association compares the odds of being non-deprived instead of being deprived in indicator Y, within each (row) category of indicator X. This is the odds ratio.

4.b Comparing proportions: The odds ratio

•

The odds ratio or cross-product ratio of a 2x2 table is:

•

For cell counts nij

, the sample odds ratio is:

•

A value of =1 corresponds to independence of X

and Y

•

Values of farther from 1, in a given direction, represent stronger association.

1

2 11 22

1222

12 212

1

1 1 , 0

11 22 12 21ˆ /n n n n

4.b Comparing proportions: The odds ratio

•

When 1< <, households identified as non-deprived in indicator X

(row 1 of X) are more likely to be non-deprived in indicator Y, compared to households identified as deprived in indicator X

(row 2 of X).

•

For instance, when =4 the odds of being non-deprived in Y for non-

deprived households in X, are four times the odds of deprived households in X.

•

For 2x2 table Yule (1900, 1912) introduced the measure of association Q:

•

Q relates to the odds ratio by: a monotone transformation of from the (0,) scale onto the (-1, +1) scale.

11 22 12 21

11 22 12 21

, -1 1Q Q

1 / 1Q

Example: See xls. sheet

4.c The tetrachoric


•

The tetrachoric

correlation coefficient is a measure of association applied under the assumption of an underlying joint continuum characterising

two deprivation indicators X and Y.

•

The desirability of assuming an underlying joint continuum was an issue of heated debate in the 1940’s between Yule (1912) and K.Pearson

& Heron (1913).

•

According to Yule such an assumption was very frequently misleading and artificial.

•

In contrast Pearson and Heron argued that almost always such an assumption was justified and fruitful.

4.c The tetrachoric


•

We should note that if one assumes an underlying one-dimensional continuum it establishes an ordering of the categories of the indicator.

•

Agresti

(2002) defines the tetrachoric

correlation as: “[the] ML estimate of the correlation for a bivariate

normal distribution assumed to underlie counts in a 2x2 table”.

See illustration of the bivariate

normal distribution

•

“It is the correlation value

in the bivariate

normal density that would produce cell probabilities equal to the sample cell proportions when the density is collapsed to a 2x2 table having the same marginal proportions as the observed table”

(p. 620).

•

To compute

we shall use STATA command tetrachoric.

4.c Example

0.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.80

0.90

1.00sc

ho_e

nrol

scho

_chm

osc

ho_n

utsc

ho_e

lect

scho

_san

itsc

ho_w

atsc

ho_f

losc

ho_f

uel

scho

_ass

ets

enro

l_ch

mo

enro

l_nu

ten

rol_

elec

ten

rol_

sani

ten

rol_

wat

enro

l_flo

enro

l_fu

elen

rol_

asse

tsch

mor

_nut

chm

or_e

lect

chm

or_s

ani

chm

or_w

atch

mor

_flo

chm

or_f

uel

chm

or_a

sse

nut_

elec

tnu

t_sa

nit

nut_

wat

nut_

flonu

t_fu

elnu

t_as

sets

elec

t_sa

nit

elec

t_w

atel

ect_

floel

ect_

fuel

elec

t_as

sets

sani

t_w

atsa

nit_

flosa

nit_

fuel

sani

t_as

sets

wat

_flo

wat

_fue

lw

at_a

sset

sflo

_fue

lflo

_ass

ets

fuel

_ass

ets

Censored head count pairs: X vs Y

tetr

a &

phi

N_phi Tetra

4. Summary

•

To study the association across pair of (dichotomous) deprivation indicators in a multidimensional perspective we need to:

1.

Test for independence: •

Use 2

statistic

2.

Measure the strength and direction

of the association (if it exists) by comparing the following three measures:

•

Cramer V

measure (-1,1)

•

Odds ratio

>0

•

Tetrachoric


(-1,1)Keep in mind the assumption of bivariate

normality).

•The study of the association across multidimensional indicators of deprivation is intrinsically related to a debate as to whether it is desirable for indicators to have a high or low association.

•Favour low association

The low redundancy across indicators justifies the need for a multidimensional measure. According to this perspective, high

correlation is “bad”, as it signals redundancy, and the redundant indicator(s) could be dropped from the analysis (Ranis, Samman, and Stewart, 2006; McGillivray and White, 1993).

A final note: Multidimensionality and association

•Favour high association: High association is useful because it generates a robust measure. Low association is “undesirable”

because it can give rise to unstable measres

(Foster, McGillivray, and Seth, 2012)

•Our view: Think it through.

•Even if indicators are highly associated, if there is a normative view supporting the need to monitor both indicators for policy purposes, then both should be included in the analysis. However it may be that their weights would be less. If indicators have a low association, and

if each is independently important, then both can be considered.

A final note: Multidimensionality and association

hdca summer school on capability and …...with censored headcounts: it is 17.58% total for 5 yrs of...

Documents