hdca summer school on capability and …...with censored headcounts: it is 17.58% total for 5 yrs of...
TRANSCRIPT
HDCA Summer School on Capability and Multidimensional Poverty
24 August –
3 September 2012
University of Indonesia, Indonesia
Analysing
associations
across deprivation
indicators
in MP
Paola Ballón
(OPHI)
“It is possible for a set of univariate
analyses done independently
for each dimension of well-being to conclude
that poverty in A is lower than poverty in B while a multivariate
analysis concludes the opposite, and vice-versa.
The key to these possibilities is the interaction
of the various dimensions of well-being in the poverty measure and their correlation in the sampled populations.”
Duclos, Sahn
and Younger, 2006, p.945
The importance of “associations”in Multidimensional Poverty
1997 2000 2004 2007
Panel A. Marginal distribution: % children under 5 deprived in...
Health (1) 43.5 39.8 26.7 19.9Nutrition (2) 74.3 62.2 61.4 58.3Improved Sanitation (3) 72.5 68.4 79.1 58.3
Panel B. Joint distribution: % children under 5...
Deprived in at least one dimension (union) 93.3 88.6 91 82.9Deprived in all three dimensions (intersection) 26.6 21.1 16.3 8.6
The joint distribution tells a different story
Source: Roche (2012) based on Bangladesh Demographic Health Survey data. The example follows a similar illustration from Atkinson and Lugo (2010) reproduced in Ferreira and Lugo (2012)Note: (1) No immunized or did not received medical treatment when sick; (2) Either underweight, stunting, or wasting; (3) MDGs
indicator standards.
The joint distribution provides an indication of the intensity of deprivations that batter the poor at the same
time.
The
joint
distribution
matters India NFHS data 2005-6
A comparison of raw headcounts of school attendance and 5yrs of
schooling shows*:
Percentage of people living in hh
where no member has 5 yrs schooling: 18.27%Percentage of people living in hh
where a child is not attending school: 21.17%
Are they the same people? Far less than half the time. 7.41% of people live in hh
with both
deprivations10.86% of people have no member with 5 years of schooling only13.76% of people have a child who is not attending school only. 67.97% of people do not experience either deprivation.
*With censored headcounts: it is 17.58% total for 5 yrs of schooling and 19.53% in children out of school; 7.41 both.
There is a value in looking across indicators
Focus of this classIntroduce the basic statistical tools to:
A) Display
the relationships
between categorical variablesCross-tabs (contingency tables)
B) Measure
their degree
of association: Strength and direction (sign)
•
Test for independence•
Choose a measure of association which has a “contextual”
meaning
Outline of this lecture 1.
Ways of classifying categorical variables
2.
Contingency tables (cross-tabs)
3.
Tests for independence: 2 statistic
4.
Measures of association:a)
Non model-based measures of association “Traditional”: Based on the 2 statistic
b) Measures for comparing proportions: The odds ratio
c) Measures of association assuming an underlying continuum: The tetrachoric
correlation coefficient
1. Ways of classifying categorical variables
A) Scale distinction: Nominal –
ordinal
a) Nominal
variables are variables having categories without
a natural
ordering. Examples: gender, region, religion.
The labels
attached to a category do not imply
a quantity.
The categories are mutually exclusive
-
each observation is placed in one
and only one category, and exhaustive
-
each case should be placed in a
category
Gender with categories male versus female, could be labeled: “0”
versus “1”
does not imply
that females are superior to males
“2”
versus “1”
does not imply that males are superior to females.
1. Ways of classifying categorical variables
b)
Ordinal variables are variables that do
have ordered categories but
distances
between categories are
unknown.
Example: Level of education of the household’s head. Often this is represented by an ordinal variable with several categories comprising: no education, primary incomplete, primary complete, secondary incomplete, secondary complete, higher.
These categories could be labeled:“1, 2, 3, 4, 5, 6”; a difference of one unit between (1) and (2) does not mean that someone having achieved primary education is two times
more
educated than someone else with no education.
Or labeled: “10, 12, 15, 20, 23, 30”
.
1. Ways of classifying categorical variables
The way a variable is measured determines its classification
For example, education
•Is nominal when measured as literate/illiterate
•Is ordinal when measured as highest degree attained: none –
PhD.
•It could also be considered as an interval/ratio variable when measured by the number of years of education.
Interval/ratio variables
belong to the highest level of measurement. An interval or ratio variable indicates the order
of categories, and the
exact distance
between them.
Sources of information
•Here we focus on dichotomised deprivation scores, 0 or 1. These are simple, and appropriate for all data.
•To study the association across deprivation indicators we have two different sources of information:
–
g0
matrix raw head counts–
g0
(k) matrix
censored head counts
•Our main tool for displaying the relationships across deprivation indicators is a cross-tab or contingency table.
2. The contingency table associated with the g0
matrix
•Example : India NFHS data 2005-6 (sub-sample)
Relative & absolute frequencies - g0
matrix
Safe water Non deprived Deprived Total
Non deprived 323885 82568 40645365% 17% 82%
Deprived 66885 21732 8861714% 4% 18%
Total 390770 104300 49507079% 21% 100%
Child mortality
Raw head counts
2. The contingency table associated with the g0
(k) matrix
Relative & absolute frequencies - g0
(k) matrix
Censored head counts
Safe water Non deprived Deprived Total
Non deprived 370441 65700 43614175% 13% 88%
Deprived 38407 20522 589298% 4% 12%
Total 408848 86222 49507083% 17% 100%
Child mortality
The joint distribution
between the two categorical indicators
determines
their relationship; it also determines the marginal
and conditional
distributions.
2. Describing a contingency table
•Let X
and Y
denote two deprivation indicators, X
with I
categories and Y
with J
categories. •Classifications of households on both indicators have IJ
possible combinations. [If the variables are 0-1, there are 2x2=4 combinations]
•The responses (X, Y)
of a household, chosen randomly, from some population have a probability distribution.
•A table having I
rows for categories of X
and J
columns for categories of Y
displays this distribution.
•The cells of this table represent the IJ
possible combinations.
•
When these cells contain frequency counts (absolute frequencies) the table is called a contingency table (Karl Pearson, 1904)
2. Describing a 2x2 contingency table: g0
matrix
•In our example: I=2, J=2, thus IJ=4, we have:
Child mortality
Safe water Non deprived
Deprived Total
Non deprived 323885 82568 406453
Deprived 66885 21732 88617
Total 390770 104300 495070
1 1
I J
iji j
n n
,i jn n
i jn are the cell count frequencies
are the row, and column marginal
totals
2. Describing a 2x2 contingency table: g0
matrix
•In our example: I=2, J=2, thus IJ=4, we have:
Child mortality
Safe water Non deprived
Deprived Total
Non deprived n11 n12 n1+
Deprived n21 n22 n2+Total n+1 n+2 n
1 1
I J
iji j
n n
,i jn n
i jn are the cell count frequencies
are the row, and column marginal
totals
2. Joint, Conditional and Marginal distributions
,i i jj
j i ji
denote the probability that (X,Y) occurs in the cell in row i
and column j
are the marginal
distributions of the row variable X
i j
•The probability distribution is the joint distribution
of X
and Yi j
are the marginal distributions of the column variable Y
•These result from summing the joint probabilities
•These satisfy: 1 .0i j i ji j i j
j i denotes the conditional probability
of category j
of Y
at category i
of X
•For a fixed category of X:
•The probabilities form the conditional distribution of Y
at category i
of X 1 , . . . ,i J i
2. Contingency tables and their distributions: g0
or g0
(k) matrices
Child mortality (Y)
Safe water (X) Non deprived
Deprived Total
Non deprived 11
(1|1
)12
(2|1
)1+
(1.0)
Deprived 21
(1|2
)22
(2|2
)2+
(1.0)
Total +1 +2 1.0
3. Independence of Deprivation Indicators
•Two deprivation indicators (categorical variables) are defined to be independent
if all
joint probabilities
equal the product of their marginal
probabilities.
•The hypothesis of independence is:
0 : ij i jH for 1,..., and 1,..., i I j J
•
The traditional approach to test this hypothesis is to:
1.
calculate the expected counts (ij
) under the null assumption of independence, and
2.
compare the observed (nij
) and expected counts (ij
) using the Pearson’s chi-square statistic (2).
3. Independence of Deprivation Indicators
•When H0
is true the expected values of nij
, called expected frequencies, are:
( ) i j i j i jE n n
•
Usually, and are unknown. Their ML (maximum likelihood) estimates are the sample marginal proportions:
i j
ˆ = ii
nn
ˆ = jj
nn
and
•So estimated expected frequencies are:
ˆ i ji j
n nn
3. Independence of Deprivation Indicators
22 ˆ( )
ˆij ij
i j ij
n
•For large samples 2
has approximately a chi-squared distribution with (I-1) x (J-1) degrees of freedom (Fisher, 1922).
•The P-value is approximated by:
2 2( 1)( 1)I J oP
where: 20
is the observed chi-square value with probability
(Type I error). If P-value < , we Reject H0
, and conclude that X
and Y
are not independent.
•Then the 2
test statistic is:
Example: See xls. sheet
A brief detour…
•
Before studying the association across categorical (deprivation)
indicators, we shall recall that for interval-ratio variables, the association
is measured by the Pearson product-moment correlation coefficient r.
•
This coefficient is a measure of the
correlation
(linear
dependence) between two variables giving a value between +1 and −1 inclusive (perfect correlation). A value of 0, indicates absence of linear correlation.
•
It is defined as the
covariance
of the two variables divided by the product of their standard deviations The population, correlation coefficient is:
•
and its sample counterpart is:
, ( )( )X YXY
X Y X Y
Cov X Y E X Yr
( )( )ˆ
ˆ ˆ
i ii
X YX Y
X X Y Yr
4.a Traditional measures of association
•
The mean square contingency:
•
Coefficient of contingency
(Pearson):
22 0
n
2
2 , ( 0 , 1 )1
nC
n
•
The 2
statistic is a useful tool for testing independence across deprivation indicators, but does not provide any information about their degree of association (strength or sign).
•
Traditional measures of association relying on the 2 comprise:
4.a Traditional measures of association
•
Cramer variant:
•
which in the 2x2 case:
•
C
and V
lie between 0 and 1, and take the extreme values under independence and “complete association”.
•
Goodman and Kruskal
(1979) point out that these “traditional”
measures (function of 2) do not have an operational meaning. Hence the difficulty in comparing meaningfully the
values of two different cross-tabs.
2
, (0,1)min ( 1), ( 1)
nVI J
11 22 12 211/ 2
1 2 1 2
, ( 1,1)( )n n n nVn n n n
4.b Comparing proportions: The odds
•
The odds ratio
is a measure of association that compares the relative chances of a “success" vis-à-vis a “failure”
in a cross-tab.
•
In the context of two deprivation indicators X, and Y , corresponding to the uncensored deprivation matrix (g0
), we denote the probability of not being deprived
as , and the probability of being deprived as (1-).
•
The odds are defined to be:
•
The odds are nonnegative, with >1.0 when being non-deprived (success) is more likely than being deprived (failure)
1
non-deprived
deprived
4.b Comparing proportions: The odds
•
For instance: if =0.75, then
= 0.75/0.25 = 3.0
.
•
This means that being not-deprived
is three times as likely
as being deprived.
•
This is, we expect about 3 non-deprived households for each deprived one.
•
Conversely if = 1/3, being deprived is three times as likely as being non-
deprived, and we expect about 1 non-deprived household for each 3 deprived ones.
•
In the case of a 2x2 table, a measure of association compares the odds of being non-deprived instead of being deprived in indicator Y, within each (row) category of indicator X. This is the odds ratio.
4.b Comparing proportions: The odds ratio
•
The odds ratio or cross-product ratio of a 2x2 table is:
•
For cell counts nij
, the sample odds ratio is:
•
A value of =1 corresponds to independence of X
and Y
•
Values of farther from 1, in a given direction, represent stronger association.
1
2 11 22
1222
12 212
1
1 1 , 0
11 22 12 21ˆ /n n n n
4.b Comparing proportions: The odds ratio
•
When 1< <, households identified as non-deprived in indicator X
(row 1 of X) are more likely to be non-deprived in indicator Y, compared to households identified as deprived in indicator X
(row 2 of X).
•
For instance, when =4 the odds of being non-deprived in Y for non-
deprived households in X, are four times the odds of deprived households in X.
•
For 2x2 table Yule (1900, 1912) introduced the measure of association Q:
•
Q relates to the odds ratio by: a monotone transformation of from the (0,) scale onto the (-1, +1) scale.
11 22 12 21
11 22 12 21
, -1 1Q Q
1 / 1Q
Example: See xls. sheet
4.c The tetrachoric
correlation coefficient
•
The tetrachoric
correlation coefficient is a measure of association applied under the assumption of an underlying joint continuum characterising
two deprivation indicators X and Y.
•
The desirability of assuming an underlying joint continuum was an issue of heated debate in the 1940’s between Yule (1912) and K.Pearson
& Heron (1913).
•
According to Yule such an assumption was very frequently misleading and artificial.
•
In contrast Pearson and Heron argued that almost always such an assumption was justified and fruitful.
4.c The tetrachoric
correlation coefficient
•
We should note that if one assumes an underlying one-dimensional continuum it establishes an ordering of the categories of the indicator.
•
Agresti
(2002) defines the tetrachoric
correlation as: “[the] ML estimate of the correlation for a bivariate
normal distribution assumed to underlie counts in a 2x2 table”.
See illustration of the bivariate
normal distribution
•
“It is the correlation value
in the bivariate
normal density that would produce cell probabilities equal to the sample cell proportions when the density is collapsed to a 2x2 table having the same marginal proportions as the observed table”
(p. 620).
•
To compute
we shall use STATA command tetrachoric.
4.c Example
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00sc
ho_e
nrol
scho
_chm
osc
ho_n
utsc
ho_e
lect
scho
_san
itsc
ho_w
atsc
ho_f
losc
ho_f
uel
scho
_ass
ets
enro
l_ch
mo
enro
l_nu
ten
rol_
elec
ten
rol_
sani
ten
rol_
wat
enro
l_flo
enro
l_fu
elen
rol_
asse
tsch
mor
_nut
chm
or_e
lect
chm
or_s
ani
chm
or_w
atch
mor
_flo
chm
or_f
uel
chm
or_a
sse
nut_
elec
tnu
t_sa
nit
nut_
wat
nut_
flonu
t_fu
elnu
t_as
sets
elec
t_sa
nit
elec
t_w
atel
ect_
floel
ect_
fuel
elec
t_as
sets
sani
t_w
atsa
nit_
flosa
nit_
fuel
sani
t_as
sets
wat
_flo
wat
_fue
lw
at_a
sset
sflo
_fue
lflo
_ass
ets
fuel
_ass
ets
Censored head count pairs: X vs Y
tetr
a &
phi
N_phi Tetra
4. Summary
•
To study the association across pair of (dichotomous) deprivation indicators in a multidimensional perspective we need to:
1.
Test for independence: •
Use 2
statistic
2.
Measure the strength and direction
of the association (if it exists) by comparing the following three measures:
•
Cramer V
measure (-1,1)
•
Odds ratio
>0
•
Tetrachoric
correlation coefficient
(-1,1)Keep in mind the assumption of bivariate
normality).
•The study of the association across multidimensional indicators of deprivation is intrinsically related to a debate as to whether it is desirable for indicators to have a high or low association.
•Favour low association
The low redundancy across indicators justifies the need for a multidimensional measure. According to this perspective, high
correlation is “bad”, as it signals redundancy, and the redundant indicator(s) could be dropped from the analysis (Ranis, Samman, and Stewart, 2006; McGillivray and White, 1993).
A final note: Multidimensionality and association
•Favour high association: High association is useful because it generates a robust measure. Low association is “undesirable”
because it can give rise to unstable measres
(Foster, McGillivray, and Seth, 2012)
•Our view: Think it through.
•Even if indicators are highly associated, if there is a normative view supporting the need to monitor both indicators for policy purposes, then both should be included in the analysis. However it may be that their weights would be less. If indicators have a low association, and
if each is independently important, then both can be considered.
A final note: Multidimensionality and association