flexible parametric bootstrap for testing homogeneity ... · overview and introduction example 1:...

75
Overview and Introduction Example 1: social stratification Example 2: Species distribution ranges Example 3: methadone patients Discussion Flexible parametric bootstrap for testing homogeneity against clustering and assessing the number of clusters Christian Hennig 4 December 2014 Christian Hennig Flexible parametric bootstrap

Upload: others

Post on 17-Oct-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Flexible parametric bootstrap for testing homogeneity ... · Overview and Introduction Example 1: social stratification Example 2: Species distribution ranges Example 3: methadone

Overview and IntroductionExample 1: social stratification

Example 2: Species distribution rangesExample 3: methadone patients

Discussion

Flexible parametric bootstrap for testinghomogeneity against clustering and

assessing the number of clusters

Christian Hennig

4 December 2014

Christian Hennig Flexible parametric bootstrap

Page 2: Flexible parametric bootstrap for testing homogeneity ... · Overview and Introduction Example 1: social stratification Example 2: Species distribution ranges Example 3: methadone

Overview and IntroductionExample 1: social stratification

Example 2: Species distribution rangesExample 3: methadone patients

Discussion

Overview◮ The idea◮ Example 1: social stratification (mixed type data)◮ Example 2: species distribution ranges

(spatial dependence, BIC)◮ Example 3: methadone patients

(Markov chain, visual testing)◮ Discussion

Christian Hennig Flexible parametric bootstrap

Page 3: Flexible parametric bootstrap for testing homogeneity ... · Overview and Introduction Example 1: social stratification Example 2: Species distribution ranges Example 3: methadone

Overview and IntroductionExample 1: social stratification

Example 2: Species distribution rangesExample 3: methadone patients

Discussion

The general idea is not new (Jain and Dubes 1988),though rarely seen in practice:

For testing H0 : “no real clustering structure”,define null model, simulate distribution of validation index.

Christian Hennig Flexible parametric bootstrap

Page 4: Flexible parametric bootstrap for testing homogeneity ... · Overview and Introduction Example 1: social stratification Example 2: Species distribution ranges Example 3: methadone

Overview and IntroductionExample 1: social stratification

Example 2: Species distribution rangesExample 3: methadone patients

Discussion

The general idea is not new (Jain and Dubes 1988),though rarely seen in practice:

For testing H0 : “no real clustering structure”,define null model, simulate distribution of validation index.

In the literature:simple null models like Gaussian, uniform,random dissimilarities, permutation based.

Christian Hennig Flexible parametric bootstrap

Page 5: Flexible parametric bootstrap for testing homogeneity ... · Overview and Introduction Example 1: social stratification Example 2: Species distribution ranges Example 3: methadone

Overview and IntroductionExample 1: social stratification

Example 2: Species distribution rangesExample 3: methadone patients

Discussion

An additional difficulty

Often there is “non-clustering” structure in the dataand standard homogeneity models are easily rejectedfor reasons other than true clustering.So need model non-clustering structure.

Christian Hennig Flexible parametric bootstrap

Page 6: Flexible parametric bootstrap for testing homogeneity ... · Overview and Introduction Example 1: social stratification Example 2: Species distribution ranges Example 3: methadone

Overview and IntroductionExample 1: social stratification

Example 2: Species distribution rangesExample 3: methadone patients

Discussion

An additional difficulty

Often there is “non-clustering” structure in the dataand standard homogeneity models are easily rejectedfor reasons other than true clustering.So need model non-clustering structure.

Cluster definition may not be based on probability models.E.g., may not identify clusters with mixture components,may want clusters based on low within-cluster distances,still want to distinguish clustering from model-basedhomogeneity.

Christian Hennig Flexible parametric bootstrap

Page 7: Flexible parametric bootstrap for testing homogeneity ... · Overview and Introduction Example 1: social stratification Example 2: Species distribution ranges Example 3: methadone

Overview and IntroductionExample 1: social stratification

Example 2: Species distribution rangesExample 3: methadone patients

Discussion

Example 1: social stratification

Hennig & Liao (2013) looked for evidence for social stratain 2007 US Survey of Consumer Financesn = 17,430, mixed type variables:

◮ log savings amount (continuous),◮ log income (continuous),◮ years of education (ordinal/count),◮ number of checking accounts (ordinal/count),◮ number of savings accounts (ordinal/count),◮ housing (nominal but with more structure),◮ life insurance (binary),◮ occupation (nominal/ordinal).

Christian Hennig Flexible parametric bootstrap

Page 8: Flexible parametric bootstrap for testing homogeneity ... · Overview and Introduction Example 1: social stratification Example 2: Species distribution ranges Example 3: methadone

Overview and IntroductionExample 1: social stratification

Example 2: Species distribution rangesExample 3: methadone patients

Discussion

Tried latent class mixture, but preferredpam (Kaufman and Rousseeuw 1990)because tailor-made distance measure appropriateinvolving variable weights andflexible handling of categorical structure.

Also clusters should be characterized bylow distance within clusters.

pam:

argmin{C1,...,Ck} partition of xn, x̃1∈C1,...,x̃k∈Ck

n∑

i=1

minj

d(x i , x̃ j).

Christian Hennig Flexible parametric bootstrap

Page 9: Flexible parametric bootstrap for testing homogeneity ... · Overview and Introduction Example 1: social stratification Example 2: Species distribution ranges Example 3: methadone

Overview and IntroductionExample 1: social stratification

Example 2: Species distribution rangesExample 3: methadone patients

Discussion

Mixture components vs. low distance clusters

1

1

111 1

1

1

1

1

11

11

1

1

1

1

1

1

1

1

1

1

1

1

111

1

11

1

1

11

1

1

11

1

1

1

1

11

1

1

1

1

1

1

1

1

1

1

1

1

1

11

1

1

1

1

1

1

1

1

1

1

11

1

1

11

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

11

1

1

1

1

1

1

1

1

11

1

1

1

1

1

11

1

1

11

1

1

1

1

1

111

1

11

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

11

1

1

1

1

1

1

1

1

1

11

1

1

1

11

11

1

11

1

1

1

1

1

11

1

1

1

1

1

1

11

1

1

1

1

1

1

1

2 2222

22

22

2

2

2

222

222

22

2222 22

222 22222222 2

2 2222

2

2

22 22 222

222 2

222

22

22

222 2

2

222

22 222

22

22 222

222222

22

222222

2

2

2

3

33

3

3

3

3

3

3

3

3

3

3

3

3

3

3

33

3

3

3

3

3

3

3

3 33

3

3

3

3

3

3

3

3

3

3

3

3 3

33

3

33 3

3

3

3

3

3

33

3

3

3

3

33

3

3

3 3

3

3

3

3

3

3

3

3

3

3

3

3

33

3

3

3

3

3

3

3

33

3

3

3

3

3

33

3

3

3

3

3

1 2 3 4 5 6 7

12

34

56

x

y 1

2

333 3

2

1

1

3

33

11

4

3

4

5

5

4

3

5

3

6

3

4

333

3

66

2

3

53

5

3

66

5

3

1

4

22

3

1

3

1

5

1

6

4

1

2

3

1

4

66

2

5

6

3

1

2

2

5

6

4

33

2

5

33

3

6

3

1

2

6

3

4

3

6

2

4

2

1

5

3

6

1

2

6

3

2

6

2

33

5

1

5

3

2

2

6

3

61

6

6

4

1

5

66

4

5

66

1

4

1

3

2

553

2

11

5

6

3

5

4

1

5

6

4

6

4

1

2

1

3

5

1

3

4

2

44

5

3

5

1

4

1

3

3

6

11

1

2

1

66

11

2

55

2

5

4

3

1

46

3

1

3

1

3

6

33

2

5

3

4

2

3

5

7 7777

77

77

7

7

7

777

777

77

7777 77

777 77777777 7

7 7777

7

7

77 77 777

777 7

777

77

77

777 7

7

777

77 777

77

77 777

777777

77

777777

7

7

7

2

22

8

2

2

8

2

8

3

2

2

8

2

8

2

8

88

8

2

8

8

8

2

8

2 28

8

8

2

2

8

2

2

8

8

8

2

8 8

88

2

552

8

2

2

8

2

22

8

2

8

2

88

2

3

2 2

8

2

2

8

8

2

2

8

2

8

8

8

28

2

2

8

8

8

2

8

22

8

8

2

8

2

88

2

8

8

2

2

1 2 3 4 5 6 7

12

34

56

x

y

Christian Hennig Flexible parametric bootstrap

Page 10: Flexible parametric bootstrap for testing homogeneity ... · Overview and Introduction Example 1: social stratification Example 2: Species distribution ranges Example 3: methadone

Overview and IntroductionExample 1: social stratification

Example 2: Species distribution rangesExample 3: methadone patients

Discussion

Standard recommendation for estimating k :

Average silhouette width (ASW)(Kaufman and Rouseeuw 1990):sw(i , k) = b(i ,k)−a(i ,k)

max(a(i ,k),b(i ,k)) ,

a(i , k) =1

|Cj | − 1

x∈Cj

d(x i , x), b(i , k) = minx i 6∈Cl

1|Cl |

x∈Cl

d(x i , x).

Maximum average sw ⇒ optimal k .

Formalises good separation from neighbouring clustercompared to cluster to which object belongs.

Christian Hennig Flexible parametric bootstrap

Page 11: Flexible parametric bootstrap for testing homogeneity ... · Overview and Introduction Example 1: social stratification Example 2: Species distribution ranges Example 3: methadone

Overview and IntroductionExample 1: social stratification

Example 2: Species distribution rangesExample 3: methadone patients

Discussion

2 4 6 8 10

0.00

0.05

0.10

0.15

0.20

0.25

0.30

Number of clusters

Ave

rage

silh

ouet

te w

idth

Looks like k = 2 but is there any clustering at all?

Christian Hennig Flexible parametric bootstrap

Page 12: Flexible parametric bootstrap for testing homogeneity ... · Overview and Introduction Example 1: social stratification Example 2: Species distribution ranges Example 3: methadone

Overview and IntroductionExample 1: social stratification

Example 2: Species distribution rangesExample 3: methadone patients

Discussion

Social stratification: structure introduced bymixing categorical, ordinal, continuous information;categorical variablescarrying stronger than categorical information.

Christian Hennig Flexible parametric bootstrap

Page 13: Flexible parametric bootstrap for testing homogeneity ... · Overview and Introduction Example 1: social stratification Example 2: Species distribution ranges Example 3: methadone

Overview and IntroductionExample 1: social stratification

Example 2: Species distribution rangesExample 3: methadone patients

Discussion

Social stratification: structure introduced bymixing categorical, ordinal, continuous information;categorical variablescarrying stronger than categorical information.

Latent class, spherical clustering (pam,k-means)imply approximate independence within clusters.

Can clustering in data be explained bysimple dependence and category structure alone?

Christian Hennig Flexible parametric bootstrap

Page 14: Flexible parametric bootstrap for testing homogeneity ... · Overview and Introduction Example 1: social stratification Example 2: Species distribution ranges Example 3: methadone

Overview and IntroductionExample 1: social stratification

Example 2: Species distribution rangesExample 3: methadone patients

Discussion

Null model for “structure but no clustering”:◮ Latent multivariate Gaussian, general covariance matrix.◮ Ordinal variables by categorising with quantiles.◮ Assume nominal variables ordinal with unknown order.

Christian Hennig Flexible parametric bootstrap

Page 15: Flexible parametric bootstrap for testing homogeneity ... · Overview and Introduction Example 1: social stratification Example 2: Species distribution ranges Example 3: methadone

Overview and IntroductionExample 1: social stratification

Example 2: Species distribution rangesExample 3: methadone patients

Discussion

Null model for “structure but no clustering”:◮ Latent multivariate Gaussian, general covariance matrix.◮ Ordinal variables by categorising with quantiles.◮ Assume nominal variables ordinal with unknown order.

Estimation:◮ Impose correlation-based order on nominal variables

(order by average correlation of category dummy variableswith ordinal and continuous variables).

◮ Compute polychoric correlation matrix (Drasgow 1986),using 10 equally-sized categories for continuous.

Christian Hennig Flexible parametric bootstrap

Page 16: Flexible parametric bootstrap for testing homogeneity ... · Overview and Introduction Example 1: social stratification Example 2: Species distribution ranges Example 3: methadone

Overview and IntroductionExample 1: social stratification

Example 2: Species distribution rangesExample 3: methadone patients

Discussion

Parametric bootstrap. Repeat m times:◮ Generate multivariate Gaussian.◮ Transform to original marginal distributions

for ordinal/categorical variables.◮ Compute distances same way as for original data.◮ Cluster for k = 2, . . . , kmax by pam.◮ Compute ASW.

Christian Hennig Flexible parametric bootstrap

Page 17: Flexible parametric bootstrap for testing homogeneity ... · Overview and Introduction Example 1: social stratification Example 2: Species distribution ranges Example 3: methadone

Overview and IntroductionExample 1: social stratification

Example 2: Species distribution rangesExample 3: methadone patients

Discussion

Parametric bootstrap. Repeat m times:◮ Generate multivariate Gaussian.◮ Transform to original marginal distributions

for ordinal/categorical variables.◮ Compute distances same way as for original data.◮ Cluster for k = 2, . . . , kmax by pam.◮ Compute ASW.

Need parametric bootstrap, not nonparametric,because nonparametric bootstrap reproducesclustering structure, doesn’t model homogeneity.

Christian Hennig Flexible parametric bootstrap

Page 18: Flexible parametric bootstrap for testing homogeneity ... · Overview and Introduction Example 1: social stratification Example 2: Species distribution ranges Example 3: methadone

Overview and IntroductionExample 1: social stratification

Example 2: Species distribution rangesExample 3: methadone patients

Discussion

2 4 6 8 10

0.00

0.05

0.10

0.15

0.20

0.25

0.30

Number of clusters

Ave

rage

silh

ouet

te w

idth

Significant for larger k , not for “optimal” k = 2!

Christian Hennig Flexible parametric bootstrap

Page 19: Flexible parametric bootstrap for testing homogeneity ... · Overview and Introduction Example 1: social stratification Example 2: Species distribution ranges Example 3: methadone

Overview and IntroductionExample 1: social stratification

Example 2: Species distribution rangesExample 3: methadone patients

Discussion

A subtle detail:Emulate marginal distribution for categorical variablesbecause marginal distribution alonedoesn’t indicate clustering.

Do not emulate marginal distributionfor continuous variables(because non-unimodal distribution indicates clustering).

For ordinal variables it depends on interpretation.

Christian Hennig Flexible parametric bootstrap

Page 20: Flexible parametric bootstrap for testing homogeneity ... · Overview and Introduction Example 1: social stratification Example 2: Species distribution ranges Example 3: methadone

Overview and IntroductionExample 1: social stratification

Example 2: Species distribution rangesExample 3: methadone patients

Discussion

But. . .Gaussian distribution for continuous variablesmay still be over-simplistic,and big proportion of zero savingsmay not seen as “clustering” feature

Christian Hennig Flexible parametric bootstrap

Page 21: Flexible parametric bootstrap for testing homogeneity ... · Overview and Introduction Example 1: social stratification Example 2: Species distribution ranges Example 3: methadone

Overview and IntroductionExample 1: social stratification

Example 2: Species distribution rangesExample 3: methadone patients

Discussion

But. . .Gaussian distribution for continuous variablesmay still be over-simplistic,and big proportion of zero savingsmay not seen as “clustering” feature

Significant ASW may still be caused bynon-clustering structure.

Christian Hennig Flexible parametric bootstrap

Page 22: Flexible parametric bootstrap for testing homogeneity ... · Overview and Introduction Example 1: social stratification Example 2: Species distribution ranges Example 3: methadone

Overview and IntroductionExample 1: social stratification

Example 2: Species distribution rangesExample 3: methadone patients

Discussion

−1 0 1 2 3

−4

−2

02

4

standardised log savings

stan

dard

ised

log

inco

me

Christian Hennig Flexible parametric bootstrap

Page 23: Flexible parametric bootstrap for testing homogeneity ... · Overview and Introduction Example 1: social stratification Example 2: Species distribution ranges Example 3: methadone

Overview and IntroductionExample 1: social stratification

Example 2: Species distribution rangesExample 3: methadone patients

Discussion

More sophisticated “non-clustering”model for savings and income:Zero with probability p0 and unimodal otherwise.

Christian Hennig Flexible parametric bootstrap

Page 24: Flexible parametric bootstrap for testing homogeneity ... · Overview and Introduction Example 1: social stratification Example 2: Species distribution ranges Example 3: methadone

Overview and IntroductionExample 1: social stratification

Example 2: Species distribution rangesExample 3: methadone patients

Discussion

More sophisticated “non-clustering”model for savings and income:Zero with probability p0 and unimodal otherwise.

◮ Estimate p0 for savings and income,

Christian Hennig Flexible parametric bootstrap

Page 25: Flexible parametric bootstrap for testing homogeneity ... · Overview and Introduction Example 1: social stratification Example 2: Species distribution ranges Example 3: methadone

Overview and IntroductionExample 1: social stratification

Example 2: Species distribution rangesExample 3: methadone patients

Discussion

More sophisticated “non-clustering”model for savings and income:Zero with probability p0 and unimodal otherwise.

◮ Estimate p0 for savings and income,◮ look at log variable conditional on untransformed > 0,◮ estimate kernel density for smallest bandwidth

giving unimodality.

Christian Hennig Flexible parametric bootstrap

Page 26: Flexible parametric bootstrap for testing homogeneity ... · Overview and Introduction Example 1: social stratification Example 2: Species distribution ranges Example 3: methadone

Overview and IntroductionExample 1: social stratification

Example 2: Species distribution rangesExample 3: methadone patients

Discussion

More sophisticated “non-clustering”model for savings and income:Zero with probability p0 and unimodal otherwise.

◮ Estimate p0 for savings and income,◮ look at log variable conditional on untransformed > 0,◮ estimate kernel density for smallest bandwidth

giving unimodality.◮ Still simulate latent multivariate Gaussian based on

polychoric correlation, transformed to 10 categories,◮ transform to zero/unimodal through F−1(Φ(X )).

Christian Hennig Flexible parametric bootstrap

Page 27: Flexible parametric bootstrap for testing homogeneity ... · Overview and Introduction Example 1: social stratification Example 2: Species distribution ranges Example 3: methadone

Overview and IntroductionExample 1: social stratification

Example 2: Species distribution rangesExample 3: methadone patients

Discussion

2 4 6 8 10

0.00

0.05

0.10

0.15

0.20

0.25

0.30

Number of clusters

Ave

rage

silh

ouet

te w

idth

k = 8 still significant, much better now than k = 6.k = 3 is a candidate, too.

Christian Hennig Flexible parametric bootstrap

Page 28: Flexible parametric bootstrap for testing homogeneity ... · Overview and Introduction Example 1: social stratification Example 2: Species distribution ranges Example 3: methadone

Overview and IntroductionExample 1: social stratification

Example 2: Species distribution rangesExample 3: methadone patients

Discussion

What have we learnt?◮ There is significantly more clustering structure

(as measured by ASW) in the data than in null model.

Christian Hennig Flexible parametric bootstrap

Page 29: Flexible parametric bootstrap for testing homogeneity ... · Overview and Introduction Example 1: social stratification Example 2: Species distribution ranges Example 3: methadone

Overview and IntroductionExample 1: social stratification

Example 2: Species distribution rangesExample 3: methadone patients

Discussion

What have we learnt?◮ There is significantly more clustering structure

(as measured by ASW) in the data than in null model.◮ “Maximize ASW” isn’t appropriate rule for k .

ASW should be compared with what can be expectedunder null model.

Christian Hennig Flexible parametric bootstrap

Page 30: Flexible parametric bootstrap for testing homogeneity ... · Overview and Introduction Example 1: social stratification Example 2: Species distribution ranges Example 3: methadone

Overview and IntroductionExample 1: social stratification

Example 2: Species distribution rangesExample 3: methadone patients

Discussion

What have we learnt?◮ There is significantly more clustering structure

(as measured by ASW) in the data than in null model.◮ “Maximize ASW” isn’t appropriate rule for k .

ASW should be compared with what can be expectedunder null model.

◮ k = 8 looks better than k = 6 because ASWis about the same, but expectation is lower.

Christian Hennig Flexible parametric bootstrap

Page 31: Flexible parametric bootstrap for testing homogeneity ... · Overview and Introduction Example 1: social stratification Example 2: Species distribution ranges Example 3: methadone

Overview and IntroductionExample 1: social stratification

Example 2: Species distribution rangesExample 3: methadone patients

Discussion

What have we learnt?◮ There is significantly more clustering structure

(as measured by ASW) in the data than in null model.◮ “Maximize ASW” isn’t appropriate rule for k .

ASW should be compared with what can be expectedunder null model.

◮ k = 8 looks better than k = 6 because ASWis about the same, but expectation is lower.

◮ “Zero plus unimodal” vs. Gaussian modelimproves ASW, but still clustering at k = 8 significant.

Christian Hennig Flexible parametric bootstrap

Page 32: Flexible parametric bootstrap for testing homogeneity ... · Overview and Introduction Example 1: social stratification Example 2: Species distribution ranges Example 3: methadone

Overview and IntroductionExample 1: social stratification

Example 2: Species distribution rangesExample 3: methadone patients

Discussion

Formal rules. . . for testing homogeneity, and estimating k :

1. Tests for each k :◮

a+1r+1 , a “data has lower ASW”-runs, r all runs,

◮ or based on approximate normality, mean(ASW), sd(ASW).

Christian Hennig Flexible parametric bootstrap

Page 33: Flexible parametric bootstrap for testing homogeneity ... · Overview and Introduction Example 1: social stratification Example 2: Species distribution ranges Example 3: methadone

Overview and IntroductionExample 1: social stratification

Example 2: Species distribution rangesExample 3: methadone patients

Discussion

Formal rules. . . for testing homogeneity, and estimating k :

1. Tests for each k :◮

a+1r+1 , a “data has lower ASW”-runs, r all runs,

◮ or based on approximate normality, mean(ASW), sd(ASW).

2. Aggregate tests for all kto a single clustering test,e.g., averaging ASW or rank ASW over k .

Christian Hennig Flexible parametric bootstrap

Page 34: Flexible parametric bootstrap for testing homogeneity ... · Overview and Introduction Example 1: social stratification Example 2: Species distribution ranges Example 3: methadone

Overview and IntroductionExample 1: social stratification

Example 2: Species distribution rangesExample 3: methadone patients

Discussion

Formal rules. . . for testing homogeneity, and estimating k :

1. Tests for each k :◮

a+1r+1 , a “data has lower ASW”-runs, r all runs,

◮ or based on approximate normality, mean(ASW), sd(ASW).

2. Aggregate tests for all kto a single clustering test,e.g., averaging ASW or rank ASW over k .

3. Define rule for estimating k ,e.g., maximise (ASW-mean(ASW))/sd(ASW) under H0.

Christian Hennig Flexible parametric bootstrap

Page 35: Flexible parametric bootstrap for testing homogeneity ... · Overview and Introduction Example 1: social stratification Example 2: Species distribution ranges Example 3: methadone

Overview and IntroductionExample 1: social stratification

Example 2: Species distribution rangesExample 3: methadone patients

Discussion

Formal rules. . . for testing homogeneity, and estimating k :

1. Tests for each k :◮

a+1r+1 , a “data has lower ASW”-runs, r all runs,

◮ or based on approximate normality, mean(ASW), sd(ASW).

2. Aggregate tests for all kto a single clustering test,e.g., averaging ASW or rank ASW over k .

3. Define rule for estimating k ,e.g., maximise (ASW-mean(ASW))/sd(ASW) under H0.

Exploratory power of image is more impressive.

Christian Hennig Flexible parametric bootstrap

Page 36: Flexible parametric bootstrap for testing homogeneity ... · Overview and Introduction Example 1: social stratification Example 2: Species distribution ranges Example 3: methadone

Overview and IntroductionExample 1: social stratification

Example 2: Species distribution rangesExample 3: methadone patients

Discussion

An example with Gaussian mixtures and BIC:snail species distribution ranges

Clustering 80 snail speciescharacterised by presence-absence dataon 34 Aegean islands (Cyclades)(Hausdorf and Hennig 2004)

Christian Hennig Flexible parametric bootstrap

Page 37: Flexible parametric bootstrap for testing homogeneity ... · Overview and Introduction Example 1: social stratification Example 2: Species distribution ranges Example 3: methadone

Overview and IntroductionExample 1: social stratification

Example 2: Species distribution rangesExample 3: methadone patients

Discussion

Christian Hennig Flexible parametric bootstrap

Page 38: Flexible parametric bootstrap for testing homogeneity ... · Overview and Introduction Example 1: social stratification Example 2: Species distribution ranges Example 3: methadone

Overview and IntroductionExample 1: social stratification

Example 2: Species distribution rangesExample 3: methadone patients

Discussion

Use Kulczynski-dissimilarity between ranges,MDS, Gaussian mixture clustering with noise (mclust).BIC picks k = 8 and noise.

N

1

N

2

2

8

6

N7N

2

N

N

4

1

6 6

8

4

N

N

N

4

1

4

N

6

5

3

1

3

5N

6

1

N

N

1

N

8

1

23

3

5

N

1

3

2

1

7

N

N

4

5

3

7

4

N

N

4

1

5

7

N3

2N2

3

N

1

7

4

N

N2

5

N

6

−0.6 −0.4 −0.2 0.0 0.2 0.4

−0.

4−

0.2

0.0

0.2

0.4

mds[,1]

mds

[,2]

Christian Hennig Flexible parametric bootstrap

Page 39: Flexible parametric bootstrap for testing homogeneity ... · Overview and Introduction Example 1: social stratification Example 2: Species distribution ranges Example 3: methadone

Overview and IntroductionExample 1: social stratification

Example 2: Species distribution rangesExample 3: methadone patients

Discussion

Using a homogeneous Gaussian as null model:

2 4 6 8 10

−20

00

200

400

600

Number of clusters

BIC

Under H0, BIC usually picks k = 1 as expected,clustering significantbut small sample effects for large k .

Christian Hennig Flexible parametric bootstrap

Page 40: Flexible parametric bootstrap for testing homogeneity ... · Overview and Introduction Example 1: social stratification Example 2: Species distribution ranges Example 3: methadone

Overview and IntroductionExample 1: social stratification

Example 2: Species distribution rangesExample 3: methadone patients

Discussion

However, there is spatial structure.Species are more often on neighbouring islands.

Model underlying presence-absence datausing neighbourhood list of islands:

Null model parameters:◮ species width distribution,◮ attraction (species frequency) per region,◮ parameter pn governing autocorrelation.

Christian Hennig Flexible parametric bootstrap

Page 41: Flexible parametric bootstrap for testing homogeneity ... · Overview and Introduction Example 1: social stratification Example 2: Species distribution ranges Example 3: methadone

Overview and IntroductionExample 1: social stratification

Example 2: Species distribution rangesExample 3: methadone patients

Discussion

Algorithmic null model: For every species

1. draw species width from empirical distribution,

2. draw starting region proportional to attraction,

3. with probability pn put it in neighborhoodof previous occurrences,otherwise outside neighborhood.(If impossible, put somewhere randomly.)

4. Go to 1 until width exhausted.

Christian Hennig Flexible parametric bootstrap

Page 42: Flexible parametric bootstrap for testing homogeneity ... · Overview and Introduction Example 1: social stratification Example 2: Species distribution ranges Example 3: methadone

Overview and IntroductionExample 1: social stratification

Example 2: Species distribution rangesExample 3: methadone patients

Discussion

Empirical distribution of species on islands

Christian Hennig Flexible parametric bootstrap

Page 43: Flexible parametric bootstrap for testing homogeneity ... · Overview and Introduction Example 1: social stratification Example 2: Species distribution ranges Example 3: methadone

Overview and IntroductionExample 1: social stratification

Example 2: Species distribution rangesExample 3: methadone patients

Discussion

Draw species size (3) and initial island

Christian Hennig Flexible parametric bootstrap

Page 44: Flexible parametric bootstrap for testing homogeneity ... · Overview and Introduction Example 1: social stratification Example 2: Species distribution ranges Example 3: methadone

Overview and IntroductionExample 1: social stratification

Example 2: Species distribution rangesExample 3: methadone patients

Discussion

With probability p_n, new island is neighbour

Christian Hennig Flexible parametric bootstrap

Page 45: Flexible parametric bootstrap for testing homogeneity ... · Overview and Introduction Example 1: social stratification Example 2: Species distribution ranges Example 3: methadone

Overview and IntroductionExample 1: social stratification

Example 2: Species distribution rangesExample 3: methadone patients

Discussion

Christian Hennig Flexible parametric bootstrap

Page 46: Flexible parametric bootstrap for testing homogeneity ... · Overview and Introduction Example 1: social stratification Example 2: Species distribution ranges Example 3: methadone

Overview and IntroductionExample 1: social stratification

Example 2: Species distribution rangesExample 3: methadone patients

Discussion

With probability 1−p_n, new island is non−neighbour

Christian Hennig Flexible parametric bootstrap

Page 47: Flexible parametric bootstrap for testing homogeneity ... · Overview and Introduction Example 1: social stratification Example 2: Species distribution ranges Example 3: methadone

Overview and IntroductionExample 1: social stratification

Example 2: Species distribution rangesExample 3: methadone patients

Discussion

Continue until species size is reached

Christian Hennig Flexible parametric bootstrap

Page 48: Flexible parametric bootstrap for testing homogeneity ... · Overview and Introduction Example 1: social stratification Example 2: Species distribution ranges Example 3: methadone

Overview and IntroductionExample 1: social stratification

Example 2: Species distribution rangesExample 3: methadone patients

Discussion

Estimation of autocorrelation parameter pn:

Simulate empirical disjunction probability q̂j from H0

for different pn. Compute linear regression.Find p̂n for original data disjunction probability.

Christian Hennig Flexible parametric bootstrap

Page 49: Flexible parametric bootstrap for testing homogeneity ... · Overview and Introduction Example 1: social stratification Example 2: Species distribution ranges Example 3: methadone

Overview and IntroductionExample 1: social stratification

Example 2: Species distribution rangesExample 3: methadone patients

Discussion

From null data, compute MDS, fit Gaussian mixture, BIC.

2 4 6 8 10

020

040

060

080

010

00

Number of clusters

BIC

Christian Hennig Flexible parametric bootstrap

Page 50: Flexible parametric bootstrap for testing homogeneity ... · Overview and Introduction Example 1: social stratification Example 2: Species distribution ranges Example 3: methadone

Overview and IntroductionExample 1: social stratification

Example 2: Species distribution rangesExample 3: methadone patients

Discussion

From null data, compute MDS, fit Gaussian mixture, BIC.

2 4 6 8 10

020

040

060

080

010

00

Number of clusters

BIC

Increasing BIC is expected under null model,clustering is no longer significant!

Christian Hennig Flexible parametric bootstrap

Page 51: Flexible parametric bootstrap for testing homogeneity ... · Overview and Introduction Example 1: social stratification Example 2: Species distribution ranges Example 3: methadone

Overview and IntroductionExample 1: social stratification

Example 2: Species distribution rangesExample 3: methadone patients

Discussion

Example 3: methadone patients(joint work with Chien-Ju Lin)

Data from 314 methadone users (heroine addicts)one of six dosages taken over 180 days(and missing values).

Defined dissimilarity and comparedpam, average, complete linkage with ASW andPrediction Strength (Tibshirani and Walther 2005).

Christian Hennig Flexible parametric bootstrap

Page 52: Flexible parametric bootstrap for testing homogeneity ... · Overview and Introduction Example 1: social stratification Example 2: Species distribution ranges Example 3: methadone

Overview and IntroductionExample 1: social stratification

Example 2: Species distribution rangesExample 3: methadone patients

Discussion

Prediction Strength:◮ Split dataset in two halves (m times).◮ Cluster both parts.◮ Use one half to predict cluster memberships

of the other halfby assigning every point of part 2 to closest mean of part 1.

Christian Hennig Flexible parametric bootstrap

Page 53: Flexible parametric bootstrap for testing homogeneity ... · Overview and Introduction Example 1: social stratification Example 2: Species distribution ranges Example 3: methadone

Overview and IntroductionExample 1: social stratification

Example 2: Species distribution rangesExample 3: methadone patients

Discussion

Prediction Strength:◮ Split dataset in two halves (m times).◮ Cluster both parts.◮ Use one half to predict cluster memberships

of the other halfby assigning every point of part 2 to closest mean of part 1.

Statistic: proportion of correctly predictedco-memberships in clustering 2 of pairs of points.

T& W suggest to choose smallest k with PS> 0.8.

Christian Hennig Flexible parametric bootstrap

Page 54: Flexible parametric bootstrap for testing homogeneity ... · Overview and Introduction Example 1: social stratification Example 2: Species distribution ranges Example 3: methadone

Overview and IntroductionExample 1: social stratification

Example 2: Species distribution rangesExample 3: methadone patients

Discussion

−5 0 5

−4

−2

02

46

xc1[,1]

xc1[

,2]

−5 0 5

−4

−2

02

46

xc2[,1]

xc2[

,2]

−5 0 5

−4

−2

02

46

xc1[,1]

xc1[

,2]

−5 0 5

−4

−2

02

46

xc2[,1]

xc2[

,2]

Christian Hennig Flexible parametric bootstrap

Page 55: Flexible parametric bootstrap for testing homogeneity ... · Overview and Introduction Example 1: social stratification Example 2: Species distribution ranges Example 3: methadone

Overview and IntroductionExample 1: social stratification

Example 2: Species distribution rangesExample 3: methadone patients

Discussion

Is there real clustering? How many clusters?

Christian Hennig Flexible parametric bootstrap

Page 56: Flexible parametric bootstrap for testing homogeneity ... · Overview and Introduction Example 1: social stratification Example 2: Species distribution ranges Example 3: methadone

Overview and IntroductionExample 1: social stratification

Example 2: Species distribution rangesExample 3: methadone patients

Discussion

Structure:◮ Almost all patients start on low dosage◮ Once weekly new prescription,

much change when new prescription,little on other days(apart from missing values),

◮ Early new prescriptions change much more than laterones.

Christian Hennig Flexible parametric bootstrap

Page 57: Flexible parametric bootstrap for testing homogeneity ... · Overview and Introduction Example 1: social stratification Example 2: Species distribution ranges Example 3: methadone

Overview and IntroductionExample 1: social stratification

Example 2: Species distribution rangesExample 3: methadone patients

Discussion

Transition probabilities, new prescription:

Christian Hennig Flexible parametric bootstrap

Page 58: Flexible parametric bootstrap for testing homogeneity ... · Overview and Introduction Example 1: social stratification Example 2: Species distribution ranges Example 3: methadone

Overview and IntroductionExample 1: social stratification

Example 2: Species distribution rangesExample 3: methadone patients

Discussion

Transition probabilities, other days:

Christian Hennig Flexible parametric bootstrap

Page 59: Flexible parametric bootstrap for testing homogeneity ... · Overview and Introduction Example 1: social stratification Example 2: Species distribution ranges Example 3: methadone

Overview and IntroductionExample 1: social stratification

Example 2: Species distribution rangesExample 3: methadone patients

Discussion

Null model for non-clustering structure:◮ Fix marginal on day 1.◮ Markov model with transition probabilities

◮ . . . for prescription changes 1 and 2,◮ . . . for all other prescription changes aggregated,◮ . . . for all other days aggregated.

◮ Draw missing value pattern from empirical distribution.

Christian Hennig Flexible parametric bootstrap

Page 60: Flexible parametric bootstrap for testing homogeneity ... · Overview and Introduction Example 1: social stratification Example 2: Species distribution ranges Example 3: methadone

Overview and IntroductionExample 1: social stratification

Example 2: Species distribution rangesExample 3: methadone patients

Discussion

Average linkage, complete linkage:

pam, p-values:

Christian Hennig Flexible parametric bootstrap

Page 61: Flexible parametric bootstrap for testing homogeneity ... · Overview and Introduction Example 1: social stratification Example 2: Species distribution ranges Example 3: methadone

Overview and IntroductionExample 1: social stratification

Example 2: Species distribution rangesExample 3: methadone patients

Discussion

What have we learnt?◮ PS> 0.8 rule may be fulfilled by null model.

Christian Hennig Flexible parametric bootstrap

Page 62: Flexible parametric bootstrap for testing homogeneity ... · Overview and Introduction Example 1: social stratification Example 2: Species distribution ranges Example 3: methadone

Overview and IntroductionExample 1: social stratification

Example 2: Species distribution rangesExample 3: methadone patients

Discussion

What have we learnt?◮ PS> 0.8 rule may be fulfilled by null model.◮ pam (k ≈ 7) looks better

than average and complete linkage,but clustering looks significantonly if best k are “cherry-picked”.

Aggregating PS ranks gives p = 0.475.

Christian Hennig Flexible parametric bootstrap

Page 63: Flexible parametric bootstrap for testing homogeneity ... · Overview and Introduction Example 1: social stratification Example 2: Species distribution ranges Example 3: methadone

Overview and IntroductionExample 1: social stratification

Example 2: Species distribution rangesExample 3: methadone patients

Discussion

What have we learnt?◮ PS> 0.8 rule may be fulfilled by null model.◮ pam (k ≈ 7) looks better

than average and complete linkage,but clustering looks significantonly if best k are “cherry-picked”.

Aggregating PS ranks gives p = 0.475.

Null model was good enough to fit the data;no evidence for real clustering.

Christian Hennig Flexible parametric bootstrap

Page 64: Flexible parametric bootstrap for testing homogeneity ... · Overview and Introduction Example 1: social stratification Example 2: Species distribution ranges Example 3: methadone

Overview and IntroductionExample 1: social stratification

Example 2: Species distribution rangesExample 3: methadone patients

Discussion

Visual testing with the null model

What if there is clusteringbut ASW/PS don’t pick it as significant?

Buja et al. (2009) - visual testing:◮ generate g − 1 datasets from H0,◮ visualise them,◮ show together with real dataset in random position,◮ show to h statisticians and ask them

to nominate the most real(special) dataset.◮ Do significantly more than 1

g of them pick real one?

Christian Hennig Flexible parametric bootstrap

Page 65: Flexible parametric bootstrap for testing homogeneity ... · Overview and Introduction Example 1: social stratification Example 2: Species distribution ranges Example 3: methadone

Overview and IntroductionExample 1: social stratification

Example 2: Species distribution rangesExample 3: methadone patients

Discussion

Using pam and pam-adapted observation order by C.-J. Lin:

Christian Hennig Flexible parametric bootstrap

Page 66: Flexible parametric bootstrap for testing homogeneity ... · Overview and Introduction Example 1: social stratification Example 2: Species distribution ranges Example 3: methadone

Overview and IntroductionExample 1: social stratification

Example 2: Species distribution rangesExample 3: methadone patients

Discussion

Discussion

Homogeneity test:can real data be distinguished from null model databy clustering criteria?

Christian Hennig Flexible parametric bootstrap

Page 67: Flexible parametric bootstrap for testing homogeneity ... · Overview and Introduction Example 1: social stratification Example 2: Species distribution ranges Example 3: methadone

Overview and IntroductionExample 1: social stratification

Example 2: Species distribution rangesExample 3: methadone patients

Discussion

Discussion

Homogeneity test:can real data be distinguished from null model databy clustering criteria?

Not significant: real data isnot more clustered than data from H0

according to the criterion.

If the criterion is chosen well,this is a strong result, though negative

Christian Hennig Flexible parametric bootstrap

Page 68: Flexible parametric bootstrap for testing homogeneity ... · Overview and Introduction Example 1: social stratification Example 2: Species distribution ranges Example 3: methadone

Overview and IntroductionExample 1: social stratification

Example 2: Species distribution rangesExample 3: methadone patients

Discussion

Significant: real data is more clusteredthan H0 with heuristic parameter estimates.

Maybe other parameters fit data welland produce better criterion values?(Not likely for large n and good estimators;also look at gap between data and null model.)

Also, H0 may still be too simplistic(try hard to model non-clustering structure;see Example 1.)

Christian Hennig Flexible parametric bootstrap

Page 69: Flexible parametric bootstrap for testing homogeneity ... · Overview and Introduction Example 1: social stratification Example 2: Species distribution ranges Example 3: methadone

Overview and IntroductionExample 1: social stratification

Example 2: Species distribution rangesExample 3: methadone patients

Discussion

Estimating k involving the null modelcompares criterion with null behaviour for increasing kwhich isn’t known for most cluster validation criteria.

Provides well-needed calibration,certainly more informative than simple maximum-rule.

Is null behaviour relevant for comparing k and k − 1 > 1?

Christian Hennig Flexible parametric bootstrap

Page 70: Flexible parametric bootstrap for testing homogeneity ... · Overview and Introduction Example 1: social stratification Example 2: Species distribution ranges Example 3: methadone

Overview and IntroductionExample 1: social stratification

Example 2: Species distribution rangesExample 3: methadone patients

Discussion

Conclusion◮ Explore behaviour of validation index under

model for non-clustering structure in the databy parametric bootstrap.

Christian Hennig Flexible parametric bootstrap

Page 71: Flexible parametric bootstrap for testing homogeneity ... · Overview and Introduction Example 1: social stratification Example 2: Species distribution ranges Example 3: methadone

Overview and IntroductionExample 1: social stratification

Example 2: Species distribution rangesExample 3: methadone patients

Discussion

Conclusion◮ Explore behaviour of validation index under

model for non-clustering structure in the databy parametric bootstrap.

◮ Naive null models for “no clustering” easily rejectedeven by non-clustering structure.

Christian Hennig Flexible parametric bootstrap

Page 72: Flexible parametric bootstrap for testing homogeneity ... · Overview and Introduction Example 1: social stratification Example 2: Species distribution ranges Example 3: methadone

Overview and IntroductionExample 1: social stratification

Example 2: Species distribution rangesExample 3: methadone patients

Discussion

Conclusion◮ Explore behaviour of validation index under

model for non-clustering structure in the databy parametric bootstrap.

◮ Naive null models for “no clustering” easily rejectedeven by non-clustering structure.

◮ If there is non-clustering structure,k maximising validation indexes and BIC may not be best.

Christian Hennig Flexible parametric bootstrap

Page 73: Flexible parametric bootstrap for testing homogeneity ... · Overview and Introduction Example 1: social stratification Example 2: Species distribution ranges Example 3: methadone

Overview and IntroductionExample 1: social stratification

Example 2: Species distribution rangesExample 3: methadone patients

Discussion

Ideas for further work◮ Compare formal methods for aggregating a test

from different k ,and for using information to estimate k .

Christian Hennig Flexible parametric bootstrap

Page 74: Flexible parametric bootstrap for testing homogeneity ... · Overview and Introduction Example 1: social stratification Example 2: Species distribution ranges Example 3: methadone

Overview and IntroductionExample 1: social stratification

Example 2: Species distribution rangesExample 3: methadone patients

Discussion

Ideas for further work◮ Compare formal methods for aggregating a test

from different k ,and for using information to estimate k .

◮ Fit mixture of null models?I’m skeptical; often too many parameters,large within-model distances.

Christian Hennig Flexible parametric bootstrap

Page 75: Flexible parametric bootstrap for testing homogeneity ... · Overview and Introduction Example 1: social stratification Example 2: Species distribution ranges Example 3: methadone

Overview and IntroductionExample 1: social stratification

Example 2: Species distribution rangesExample 3: methadone patients

Discussion

Ideas for further work◮ Compare formal methods for aggregating a test

from different k ,and for using information to estimate k .

◮ Fit mixture of null models?I’m skeptical; often too many parameters,large within-model distances.

◮ Test k against k − 1 clusters,fitting null model within found clusters.

Christian Hennig Flexible parametric bootstrap