a measure of divergence among several populations …boos/library/mimeo.archive/isms_1972… · its...

A MEASURE OF DIVERGENCE AMONG SEVERAL POPULATIONS

By

William S. Cleveland and Peter A. Lachenbruch

State University of New York, Buffaloand

University of North Carolina, Chapel Hill, N. C.

Institute of Statistics Mimeo Series No. 838

AUGUST 1972

A Measure of Divergence Among Several Populations

by

William S. Cleveland(State University of New York, Buffalo)*

and

Peter A. Lachenbruch(University of North Carolina, Chapel Hill, N.C. 27514)

1. Introduction

Measures of distance between two populations, or more generally of

divergence of a populations, are widely used in statistics, for example

as inputs to clustering and multidimensional scaling procedures or in the

analysis of contingency tables. In this paper a measure of divergence of

a populations is defined to be the probability of correctly assigning an

item to one of the populations on the basis of a measurement X on the item,

using a particular classification rule. An important property of this

measure is that it possesses operational meaning by virtue of this classi-

fication interpretation. The argument of Goodman and Kruskal (1954,

p. 733-735) that operational meaning is important for measures of associa-

tion applies equally well to measures of divergence of populations.

The following is a brief summary of this paper. It will be supposed

except in the last section that X is a categorical variable taking values

labeled 1,2, ... ,6. Let TI denote the a-th population and let p (b) be thea a

probability X = b given the item is from population TI. In the case wherea

*Present address: Bell Telephone Laboratories, Inc., Murr?y Hill, N. J.07974 U.S.A.

2

the p (b) are known, the measure of divergence isa

-1asL max

b=l ap (b).

a

In the case where the p (b) are unknown and a sample is available thea

measure has the same form but with p (b) replaced by an estimate.a

A convenient way of displaying all pairwise divergences of the

populations is described and the definition of what it means for the

populations to be in order is given in terms of this display. Other

measures of divergence on the basis of a categorical variable for the

case a=2 which have appeared in the literature are described. It is

argued that the measure of divergence can be used to measure association

in a contingency table and comparisons are made with the Guttman-Goodman-

Kruskal measures of association. The divergence measure is used in the

analysis of three data sets: 1) Voting results in each state for Nixon,

Humphrey, and Wallace in the 1968 U.S. Presidential election. 2) The

distributions of birth weights of nonwhites and whites in the U.S. in

1950, 1955, 1960, and 1965. 3) Results of intelligence tests given to

5-year old children and their mothers. The paper is concluded by a few

remarks about the measure of divergence in the general case where X may

have any distribution.

2. The Measure of Divergence When the Distributionsof X are Known

Suppose an item is equally likely to come from anyone of the a

populations TIl, .•. ,TIa

and the Pa(b) are known. Suppose on the basis of

its X measurement the item is to be classified in one of the populations.

3

A sensible procedure (the optimal one if the goal is to maximize the

probability of a correct classification is, if X = b, to choose a TI fora

which p (b) is maximized over a. The measure of classification divergencea

of the a populations is defined to be the probability the item will be

correctly classified.

Let p be the axS matrix whose (a,b)-th element is p (b). Thusa

the a-th row of p is the distribution of X given the item is from TI .a

The probability of a correct classification will be denoted either by

f(p), showing the dependence on the probabilities p, or by f(TI1

, .•. ,TIa),

showing which populations are included.

An expression for f(p) will now be derived. Let ab

, for b=l, ..• ,S,

be an index such that

max p (b).a

a

It should be noted that, if there are two or more possible values for ab ,

f(p) does not depend upon which value is chosen in the above c1assifica-

tion procedure. Now

f(p) =al

a=lProb[Correct C1assificationiTI ]Prob[TI ].

a a

Since the assumption is that the item is equally likely to come from any

-1one of the populations, Prob[TI ] = aa

Prob[Correct C1assificationiTI ]a

Furthermore,

Sl Pa(b) <ab = a) ,

b=l

where <aba) is equal to 1 if ab a and 0 otherwise. Thus

4

-1 S ar(p) a I I Pa(b) <~ a}

b=l a=l

-1 Sa I pa (b)

b=l b

-1 13a I max p (b) .

b=l aa

The following list of facts sheds some light on the behavior of

r(p) .

Fact: The minimum value of r(p) is a-I This occurs when X has

the same distribution for each population; that is, the rows of pare

identical.

Fact: If 13 > a the maximum value of r(p) is 1. This occurs when

the a distributions of X given TI are concentrated on mutually disjointa

sets, that is, each column of p has at most one nonzero element. If

13 < a the maximum value of r(p) is 13/0., which occurs when there is a

subset of 13 populations, each having its own single X value which no

other population in the subset can take; that is, there is a SXS sub-

matrix of p with exactly one 1 in each column.

Fact: If

~ ] ,l-s

where 0 < s ~ 1, then r(p) = 2/3 does not depend on s. This is a little

upsetting. However, this phenomenon of total lack of dependence of r(p)

5

on the values of some row of p can only occur if 8<a.

Fact: Let p and q be ax8 matrices whose rows are distributions of

X. Suppose the rows of q lie in the convex region generated by the rows

of p. Intuitively one feels that the rows of q diverge less than those

of p. In fact, it is true that

f(q) < f(p).

To see this note that

q (b)a

awhere L

m=lc

am1 for each a=l, ••. ,a. Thus

so that

q (b)a

< maxa

p (b)a

max qa(b) < max Pa(b),a a

which verifies the fact.

Fact: If some of the categories of X are merged, the divergence

decreases or stays the same since max Pa(b1) + max p (b 2) > max p (b1)+p (b 2).a - a aa a aFor example suppose

[.2 0 .8 .: ].p0 .8 .1

Suppose the 2nd, 3rd, and 4th categories of X are merged so that the

resulting distributions are

6

q

Then .5 = f(p) > f(q) = .6. In some cases it will be desirable to use

f(p) directly to measure divergence. In others it may be desirable to

relocate and rescale f(p) so that it ranges from 0 to 1 as p ranges

over all possible values for fixed a and S. This new measure is

ll(p)-1

f(p) - a1 -1 .

a- min(a,S) - a

ll(p) obscures the operational meaning somewhat but may enable more

effective comparison of divergences of two groups of populations with

different values of a. For 1=2, ll(p) is a metric when the probability

distributions of X are viewed as points in Euclidean S-space, since

Sll(p) = ~ I 1Pl(b) - P2(b) I .

b=l

3. The Measure of Divergence When the PopulationDistributions of X are Unknown

Suppose the true population distributions of X are unknown but the

X values and population origins of a sample of items is available. Let

n (b) be the number of items from the a-th population whose X values area S

b. Let n (0) = I n (b). f(p) cannot be used to measure the distancea b=l a

of the populations since the p (b) are unknown. What is needed is a meaa

sure which is in the same spirit as f(p) but is based on the sample informa-

tion. This will be done, first from the sampling theory point of view and

then from the Bayesian point of view.

7

From the sampling theory point of view a sensible procedure would

be to choose an estimate of p (b) and use f(p) with the p (b) replaceda a

by the estimates, as a sample measure of divergence. An estimate of

Pa(b), and the one which will be used in this paper, is the sample

frequency

f (b)a

n (b)a

= -n::"(~."'""") •a

Thus, in this case the sample measure of divergence will be

-1CI.

f3L max

b=l af (b).

a

In some cases, particularly when the counts are low, one should consider

adding a flattening constant to the counts. (cf. (Fienberg and Halland,

1972), (Good, 1965, p. 24), (Mosteller, 1968, p. 23)).

It is mildly annoying that we must regard this sample measure of

divergence as an estimate of a probability rather than a probability it-

self. The situation can, however, be remedied by viewing from the

Bayesian angle. Suppose your prior distribution (it is an improper one)

is proportional to

f3II

b=l

-1[p (b)] •

a

Then the posterior is proportional to

S n (b) - 1II [Pa(b)] a

b=l

8

The probability a future item from population TI will have X=b is, givena

the sample, equal to f (b). The optimal rule for classifying a futurea

item with X=b, if you gain 1 from a correct classification and 0 other-

-1wise and if your prior probabilities are a for each TI , is to choose a

a

population which maximizes f (b) over a. Your probability the item willa

be correctly classified under this rule is the above sample measure of

divergence.

4. The Pairwise Divergence Triangle andPopulations Which are In Order

For most practical examples it will be informative to calculate,

the (a2

) divergences ~(TI.,TI.) between each distinct pair of populations1 J

from the set TIl, ... ,TIa . One way of displaying these numbers puts them in

a lower triangular matrix with the population labels along the main dia-

gonal. However a rotation of 45° produces a more congenial (at least to

the eye) diagram as in Display 6. We shall refer to this diagram as the

pairwise divergence triangle of the populations. The two populations

corresponding to a particular entry may be found by moving upward along

the two diagonals which meet at that entry. For example, in Display 6

In a number of practical examples we will want to check a particular

ordering of the populations, say TIl, .•. ,TIa

, to see if the pairwise

divergences are "in order". By this it is meant that for each i=l, ••• ,a,

~(TI.,TI.) increases as j runs from i+l to a and increases as j runs from1 J

i-I to 1. In terms of the pairwise divergence triangle this means the

divergences increase as you move down anyone of the 2a diagonals. For

example, the pairwise divergence triangles in Displays 4 and 6 are both

9

in order, whereas the triangle in Display 7 is not in order.

The typical case where one expects the populations to be in order

is where the categorical variable represents the grouping of measure-

ments of a numerical variable. This is illustrated in the examples of

Section 7.

5. Other Measures of Divergence for the Case a=2

Rao (1952, p. 352) suggests a measure of distance which is also

based on the probability of a correct classification. However his

classification rule is the minimax strategy rather than the rule used

here which maximizes the probability of a correct classification. For

example, if S=2 and

l-S]l-t

where 0 2 t 2 s 2 1, then Rao's measure is s:t' whereas f(p)

Gini (1914-1915) has suggested using

S~(p) = ~ L !Pl(b) - P2(b) I

b=l

~(s+1-t).

in contingency tables to measure the distance between two columns (or

two rows), where Pl(b) is the conditional distribution in one column and

P2(b), the conditional distribution in the other. It does not appear,

however, that Gini discussed the classification interpretation of the

measure.

Over the past few years there has been discussion regarding distance

10

between populations based on gene frequencies by several authors.

Balakrishnan and Sangvhi (1968) proposed a measure based on the X2

statistic

s.+l

I !j=l k=l

where j indexes the character and the j-th character has s.+l states.J

2 -1Kurzynski (1970) introduced a measure Dk = (Pl-PZ)'S (Pl-PZ) analogous

to the Mahalanobis D2• Edwards and Cavalli-Sforza (cf. Edwards (1971))

proposed a measure based on the angular transformation

If there are several independent loci (categorical variables) they propose

adding the E2 values.

There is considerable difficulty in interpreting these last three

measures. We are not able to develop a feeling for a distance between

two sets of probabilities calculated in the ways that have been proposed,

primarily due to the lack of any operational meaning. D2 is quite subjectk

to major problems if the proportions are far apart. In that case, no

sensible estimate of S is obtainable since the individual cells will

have different variances. The maximum of E2 depends on the number of

states which makes comparisons difficult.

We would suggest that the measuresG2 and E2 might be improved bys

modifying their treatment of several categorical variables assumed to be

independent. Instead of calculating the distances from the marginals and

11

summing it would seem more reasonable to estimate cell probabilities by

the products of the marginals and treat the computations as one large

multinomial.

It has been called to our attention that a measure similar to the

one proposed here has recently been developed by Powell, Levene, and

Dobzhansky. It is to appear in Evolution in 1973.

6. A Measure of Predictive Association

Suppose A and B are two categorical variables whose joint distri

bution is denoted

rab

Pr[A=a, B=b]

for a=l, .•. ,a and b=l, .•. ,13. Let r denote the marginal distributiona·

of A and r. b , the marginal of B. Let r be the axl3 matrix whose elements

are rab

• Let r[BIA] be the ax 13 matrix whose (a,b)-th element is rab

.;- ra·

Thus the rows of r[BIA] are the a conditional distributions of B given A.

~(r[BIA]) measures the divergence among the a conditional distribu

tions of B given A and therefore may be regarded as a measure of dependence

of B on A. It is defined provided no row of r contains all zeros. It

ranges between 0 and 1 and is 0 if and only if A and B are independent,

that is, r ab = ra.r. b • ~(r[BIA]) is equal to the Guttman-Goodman-Kruskal

(cf. (Goodman and Kruskal, 1954, Section 5.1» measure of dependence of

B on A, if and only if the marginal of B has equal probabilities.

Now

nCr) ~~(r[AIB]) + ~~(r[BIA])

12

is a measure of classification association between A and B. The follow-

ing facts about D(r) give some insight into its behavior.

Fact: D(r) is defined provided no row or no column of r contains

all zeros.

Fact: For fixed a and S, D(r) ranges between 0 and 1, as r ranges

over all possible distributions.

Fact: D(r) = 0 if and only if A and B are independent.

Fact: If a~S, D(r) = 1 if and only if there is exactly one nonzero

probability in each row of r (and no column has all zero probabilities).

For example, if

r = [ 1~31/3

D(r) = 1. A similar statement holds when S~a.

Let A(r) denote the Guttman-Goodman-Kruskal (cf. Goodman and Kruskal

(1954, Section 5.2)) measure of association between A and B. In our nota-

-1 -1 -1-1tion, A(r) = ~(E max rab-a )/(l-a ) + ~(E max rab-S )/(l-S ). D(r)

b a a bhas the advantage that it is 0 if and only if A and B are independent,

which is not true of A(r). A(r) sometimes gives results that are at

first sight quite surprising. Consider, for example, the table

r = [ l_3xlO-lO

10-10

Intuitively one might feel there is a high degree of association. Yet

A(r) = 0, whereas D(r) % .5.

13

One advantage of A(r), however, is that it is not as sensitive as

~(r) to a single observation in a row (or column) with a small number of

counts. For example, consider the two contingency tables

and :] .

For both these tables A is zero whereas ~ is zero for the first and .5

for the second.

The lesson in all this (it has been taught many times) is that no

single measure of association will do everything one wants it to.

7. Examples

Example 1: 1968 Presidential Election.

The World Almanac (1972, p. 718) gives the number of votes cast in

each of the 51 states of the U.S. (including the District of Columbia)

in the 1968 Presidential election for each of the three candidates,

Humphrey, Nixon, and Wallace. The items will be the voters, the three

populations will be the three sets of each candidates voters, and the

state in which an item voted will be the categorical variable X. Thus in

studying the divergance measures we shall be studying how the candidates

differed according to how their vote was distributed among 51 states.

It should be noted that these measures are independent of the total number

of votes cast for each candidate.

The various divergences are shown in Display 1. The three candidates

appear to diverge a moderate amount. Nixon and Humphrey are substantially

closer together than either is to Wallace, not a surprising result.

14

Example 2: Birth Weight Distributions.

Display 2 exhibits birth weight distributions of eight populations

in the U.S. which are cross classified according to race and year. The

data was obtained from Chase and Byrnes, (1972, p. 40-41).

Display 3 shows six interesting divergences. The divergence

~(TI5' TI6 , TI7 , TIS) of the four nonwhite distributions is about six times

as large as the divergence ~(TIl' TI 2 , TI3 , TI4 ) of the four white distribu

tions. A close look at Display 2 reveals a definite pattern in the non

white distributions; the birth weights are shifting toward the left

(getting lower) through time. The divergences between the two populations

white and non-white for each of the four years show that, in addition, the

nonwhite distributions are shifting away from the white distributions.

Display 4 gives the pairwise divergence triangle for the four non

white populations. It is interesting to note that the populations are

in order, that is, the divergences increase as you move down any diagonal.

Furthermore, as you move from left to right along any row of the triangle,

the values decrease. This is an indication that the change through time

in the nonwhite distributions is slowing down.

Example 3: Intelligence Testing of Mothers and Children.

Barber (1972) has reported the results of an I.Q. test given to 92

5-year-old children and an intelligence test given to their mothers. The

data is shown in Display 5.

The X2 statistic is 2S.03, which as a statistical test for the

existence of association between the two intelligence tests is near the

.05 level of significance. But this seems to tell us little. That we

should be suspicious of a significance level based on asymptotic results

15

when so many low counts occur is almost beside the point. It seems point

less to ask if association exists when we feel so certain that it must.

It would only be a question of gathering enough data to prove such

existence. The interesting questions would seem to be what is the

strength of the association, what is the pattern of the association, and

how well does the data measure these two items.

The divergence of the four populations of mothers when the cate

gorical variable X is taken to be the child's test score category is

The divergence of the seven populations of children when the categorical

variable X is taken to be the mother's test score category is

Thus the measure of association Q is .32, but again we should be some

what skeptical of the precision of this value in view of the low counts

in several cells.

Display 6 gives the pairwise divergence triangle for the four

mother populations TI l , ..• ,TI4 and Display 7 for the seven child popula

tions 81

, ..• ,87

, We would fully expect the true population pairwise

divergence triangles to be in order. This in fact is the case in

Display 6 but not in Display 7. The pairwise divergences involving either

81 or 87 are somewhat erratic, presumably due to the small number of

counts in each of these populations. Merging 81 with 82 and 86 with 87

seemed therefore sensible. The new child populations will be denoted

by 8i, ... ,8;.

16

Display 8 shows the pairwise divergences between the mother popu-

lations after the merging. The populations are still in order. As

predicted by a Fact in Section 2 each value in this new triangle is

less than or equal to the corresponding value in the old triangle in

Display 6. Display 9 shows the pairwise divergences between the child

populations ei, ... ,eS• This new triangle is now in order except for

one near miss. .23 = ~(e3,e4) is slightly larger than .22 = ~(e2,e4).

Thus the merging seems justified.

We now have that

so that n, the measure of association between the mothers' test scores

and the children's test scores is .24.

8. The Measure of Divergence for Any Distribution of X

The measure of divergence can easily be extended, in a manner

entirely analogous to the categorical case, to allow for any distribution

of X. If d (x) is the density of X with respect to a measure m, givena

the item is from the a-th population, then the measure of divergence,

f, is

maxa

d (x)dm.a

Suppose a=2, and dl(x) and d2(x) are multivariate normal densities with

the same covariance matrix, then

~ 2 ¢(D/2) - 1

17

where D2 is the Mahalanobis distance function (cf. Rao (1952, p. 354»

and ~ is the standard normal distribution function. Thus r is a monotone

function of D2 in this case.

Suppose 1=2 and d (x) = A exp - A x for a=1,2. If Al > A2 thena a a

where c

-A c1

- e

Acknowledgement: One of us was supported by research career developmentaward HD-46344. w. S. Cleveland was supported by the Officeof Naval Research.

Display 1.

18

Divergences for Three Candidates in 1968 Presidential Election.

Populations r b.

Humphrey-Nixon-Wa11ace .48 .22

Humphrey-Nixon .54 .08

Humphrey-Wallace .68 .36

Nixon-Wallace .69 .38

19

Display 2. Percentage Distribution of Birth Weights for Eight Populations.

Birth Weight (Grams)

Populations Below 2001 2001-2500 2501-3000 3001-3500 3501-4000 Above 4000

White

(TIl) 1950 2.3 4.8 17 .8 38.3 27.5 9.3

(TI2

) 1955 2.4 4.6 17 .5 38.4 28.0 9.3

(TI3

) 1960 2.3 4.5 17.3 38.1 28.2 9.6

(TI4

) 1965 2.4 4.8 18.3 38.5 27.1 8.9

Nonwhite

(TI5

) 1950 3.5 6.8 21.4 35.4 22.8 10.2

(TI6

) 1955 4.0 7.6 23.7 36.7 20.6 7.4

(117

) 1960 4.7 8.3 25.3 37.1 18.9 5.9

(TI8

) 1965 5.0 8.8 26.2 37.3 17 .8 4.9

Display 3. Various Divergences for the Birth Weight Distributions.

20

Populations r b.

Whites (TTl' TT2

, TT3

, TT4

) .255 .006

Nonwhites (TT5

, TT6 , TT7

, TTS) .276 .034

White and Nonwhite in

1950 (TTl' TT5

) .53S .077

1955 (TT2

, TT6

) .554 .10S

1960 (TT3 ' TT7

) .571 .141

1965 (TT4 ' TTS

) .573 .145

Display 4. Pairwise Divergence Triangle for Nonwhite Birth WeightDistributions TIS' •.• ,TIs .

21

1

.OSO

2

.032

3

.021

4

.OS2 .OS3

.103

Display 5. Cross Classification of Child-Mother Scores onIntelligence Tests.

22

Mothers' Scores

Children's Scores (TTl) Below 60 (TT2) 60-69 (TT3

) 70-79 (TT4 ) Above 79

(81) Below 85 3 1 2 0

(82

) 85-94 1 2 0 1

(83

) 95-104 5 5 9 3

(84

) 105-114 2 6 6 3

(85

) 115-124 2 4 10 9

( 86

) 125-134 0 2 5 6

(87

) Above 134 0 1 0 4

Display 6. Pairwise Divergence Triangle for the Mother PopulationsTTl' ••• ,114 •

23

1

.33

2

.24

3

.30

4

.35 .40

.58

Display 7. Pairwise Divergence Triangle for the Child Populationse1 ,···,e7 •

1 2 3 4 5 6 7

.58 .41 .17 .23 .10 .39

.27 .35 .22 .32 .48

.38 .51 .33 .62

.43 .60 .66

.51 .55

.83

24

Display 8.Populations

Pairwise Divergence Triangle After Merging for the MotherTIl' ••• ,TI4 •

25

1

.31

2

.18

3

.26

4

.35 .40

.58

Display 9.Populations

Pairwise Divergence Triangle After Merging for the Childei,···,e~ .

26

1

.25

2

.28

.17

3

.22

.23

4

.38

.20

5

.46 .42

.53

27

References

Balakrishnan, V. and Sanghvi, L. D., "Distance between populations onthe basis of attribute data", Biometrics ~ (1968), 859-865.

Barber, C. Renate, "A note on the results of intelligence testingduring a longitudinal survey", Journal of Biosocial Science 4(1972), 37-39.

Chase, Helen C., and Byrnes, Mary E., "Trends in 'Prematurity.'United States: 1950-67", Vital and Health Statistics AnalyticStudies (1972), Series 3, Number 15, U.S. Department of Health,Education, and Welfare.

Edwards, A. W. F., "Distances between populations on the basis of genefrequencies", Biometrics !J... (1971), 873-881.

Fienberg, Stephen E., and Holland, Paul W., "On the choice of flatteningconstants for estimating multinomial probabilities", Journal ofMultivariate Analysis l (1972), 127-134.

Gini, Corrado, "Di una misura della dissomiglianza tra due gruppi diquantita e delle sue applicazioni allo studio delle relazionistatistiche", Atti Del Reale Istituto Veneto di scienze, lettereed arti, Tomo LXXIV, Parte seconda (1914-1915), 185-213.

Good, Irving John, The Estimation of Probabilities (1965), The M.I.T.Press, Cambridge.

Goodman, Leo A. and Kruskal, William H., "Measures of association forcross-classifications", Journal of the American StatisticalAssociation 49 (1954), 732-764.

Kurczynski, T. W., "Generalized distance and discrete variable",Biometrics ~ (1970),525-534.

Mosteller, Frederick, "Association and estimation in contingency tables",Journal of the American Statistical Association ~ (1968), 1-28.

Rao, C. R., Advanced Statistical Methods in Biometric Research (1952),Wiley, New York.

The World Almanac (1972), Luman H. Long, editor, Newspaper EnterpriseAssociation, Inc., New York.

a measure of divergence among several populations …boos/library/mimeo.archive/isms_1972… · its...

Documents