flexible parametric bootstrap for testing homogeneity ... · overview and introduction example 1:...
TRANSCRIPT
Overview and IntroductionExample 1: social stratification
Example 2: Species distribution rangesExample 3: methadone patients
Discussion
Flexible parametric bootstrap for testinghomogeneity against clustering and
assessing the number of clusters
Christian Hennig
4 December 2014
Christian Hennig Flexible parametric bootstrap
Overview and IntroductionExample 1: social stratification
Example 2: Species distribution rangesExample 3: methadone patients
Discussion
Overview◮ The idea◮ Example 1: social stratification (mixed type data)◮ Example 2: species distribution ranges
(spatial dependence, BIC)◮ Example 3: methadone patients
(Markov chain, visual testing)◮ Discussion
Christian Hennig Flexible parametric bootstrap
Overview and IntroductionExample 1: social stratification
Example 2: Species distribution rangesExample 3: methadone patients
Discussion
The general idea is not new (Jain and Dubes 1988),though rarely seen in practice:
For testing H0 : “no real clustering structure”,define null model, simulate distribution of validation index.
Christian Hennig Flexible parametric bootstrap
Overview and IntroductionExample 1: social stratification
Example 2: Species distribution rangesExample 3: methadone patients
Discussion
The general idea is not new (Jain and Dubes 1988),though rarely seen in practice:
For testing H0 : “no real clustering structure”,define null model, simulate distribution of validation index.
In the literature:simple null models like Gaussian, uniform,random dissimilarities, permutation based.
Christian Hennig Flexible parametric bootstrap
Overview and IntroductionExample 1: social stratification
Example 2: Species distribution rangesExample 3: methadone patients
Discussion
An additional difficulty
Often there is “non-clustering” structure in the dataand standard homogeneity models are easily rejectedfor reasons other than true clustering.So need model non-clustering structure.
Christian Hennig Flexible parametric bootstrap
Overview and IntroductionExample 1: social stratification
Example 2: Species distribution rangesExample 3: methadone patients
Discussion
An additional difficulty
Often there is “non-clustering” structure in the dataand standard homogeneity models are easily rejectedfor reasons other than true clustering.So need model non-clustering structure.
Cluster definition may not be based on probability models.E.g., may not identify clusters with mixture components,may want clusters based on low within-cluster distances,still want to distinguish clustering from model-basedhomogeneity.
Christian Hennig Flexible parametric bootstrap
Overview and IntroductionExample 1: social stratification
Example 2: Species distribution rangesExample 3: methadone patients
Discussion
Example 1: social stratification
Hennig & Liao (2013) looked for evidence for social stratain 2007 US Survey of Consumer Financesn = 17,430, mixed type variables:
◮ log savings amount (continuous),◮ log income (continuous),◮ years of education (ordinal/count),◮ number of checking accounts (ordinal/count),◮ number of savings accounts (ordinal/count),◮ housing (nominal but with more structure),◮ life insurance (binary),◮ occupation (nominal/ordinal).
Christian Hennig Flexible parametric bootstrap
Overview and IntroductionExample 1: social stratification
Example 2: Species distribution rangesExample 3: methadone patients
Discussion
Tried latent class mixture, but preferredpam (Kaufman and Rousseeuw 1990)because tailor-made distance measure appropriateinvolving variable weights andflexible handling of categorical structure.
Also clusters should be characterized bylow distance within clusters.
pam:
argmin{C1,...,Ck} partition of xn, x̃1∈C1,...,x̃k∈Ck
n∑
i=1
minj
d(x i , x̃ j).
Christian Hennig Flexible parametric bootstrap
Overview and IntroductionExample 1: social stratification
Example 2: Species distribution rangesExample 3: methadone patients
Discussion
Mixture components vs. low distance clusters
1
1
111 1
1
1
1
1
11
11
1
1
1
1
1
1
1
1
1
1
1
1
111
1
11
1
1
11
1
1
11
1
1
1
1
11
1
1
1
1
1
1
1
1
1
1
1
1
1
11
1
1
1
1
1
1
1
1
1
1
11
1
1
11
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
11
1
1
1
1
1
1
1
1
11
1
1
1
1
1
11
1
1
11
1
1
1
1
1
111
1
11
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
11
1
1
1
1
1
1
1
1
1
11
1
1
1
11
11
1
11
1
1
1
1
1
11
1
1
1
1
1
1
11
1
1
1
1
1
1
1
2 2222
22
22
2
2
2
222
222
22
2222 22
222 22222222 2
2 2222
2
2
22 22 222
222 2
222
22
22
222 2
2
222
22 222
22
22 222
222222
22
222222
2
2
2
3
33
3
3
3
3
3
3
3
3
3
3
3
3
3
3
33
3
3
3
3
3
3
3
3 33
3
3
3
3
3
3
3
3
3
3
3
3 3
33
3
33 3
3
3
3
3
3
33
3
3
3
3
33
3
3
3 3
3
3
3
3
3
3
3
3
3
3
3
3
33
3
3
3
3
3
3
3
33
3
3
3
3
3
33
3
3
3
3
3
1 2 3 4 5 6 7
12
34
56
x
y 1
2
333 3
2
1
1
3
33
11
4
3
4
5
5
4
3
5
3
6
3
4
333
3
66
2
3
53
5
3
66
5
3
1
4
22
3
1
3
1
5
1
6
4
1
2
3
1
4
66
2
5
6
3
1
2
2
5
6
4
33
2
5
33
3
6
3
1
2
6
3
4
3
6
2
4
2
1
5
3
6
1
2
6
3
2
6
2
33
5
1
5
3
2
2
6
3
61
6
6
4
1
5
66
4
5
66
1
4
1
3
2
553
2
11
5
6
3
5
4
1
5
6
4
6
4
1
2
1
3
5
1
3
4
2
44
5
3
5
1
4
1
3
3
6
11
1
2
1
66
11
2
55
2
5
4
3
1
46
3
1
3
1
3
6
33
2
5
3
4
2
3
5
7 7777
77
77
7
7
7
777
777
77
7777 77
777 77777777 7
7 7777
7
7
77 77 777
777 7
777
77
77
777 7
7
777
77 777
77
77 777
777777
77
777777
7
7
7
2
22
8
2
2
8
2
8
3
2
2
8
2
8
2
8
88
8
2
8
8
8
2
8
2 28
8
8
2
2
8
2
2
8
8
8
2
8 8
88
2
552
8
2
2
8
2
22
8
2
8
2
88
2
3
2 2
8
2
2
8
8
2
2
8
2
8
8
8
28
2
2
8
8
8
2
8
22
8
8
2
8
2
88
2
8
8
2
2
1 2 3 4 5 6 7
12
34
56
x
y
Christian Hennig Flexible parametric bootstrap
Overview and IntroductionExample 1: social stratification
Example 2: Species distribution rangesExample 3: methadone patients
Discussion
Standard recommendation for estimating k :
Average silhouette width (ASW)(Kaufman and Rouseeuw 1990):sw(i , k) = b(i ,k)−a(i ,k)
max(a(i ,k),b(i ,k)) ,
a(i , k) =1
|Cj | − 1
∑
x∈Cj
d(x i , x), b(i , k) = minx i 6∈Cl
1|Cl |
∑
x∈Cl
d(x i , x).
Maximum average sw ⇒ optimal k .
Formalises good separation from neighbouring clustercompared to cluster to which object belongs.
Christian Hennig Flexible parametric bootstrap
Overview and IntroductionExample 1: social stratification
Example 2: Species distribution rangesExample 3: methadone patients
Discussion
2 4 6 8 10
0.00
0.05
0.10
0.15
0.20
0.25
0.30
Number of clusters
Ave
rage
silh
ouet
te w
idth
Looks like k = 2 but is there any clustering at all?
Christian Hennig Flexible parametric bootstrap
Overview and IntroductionExample 1: social stratification
Example 2: Species distribution rangesExample 3: methadone patients
Discussion
Social stratification: structure introduced bymixing categorical, ordinal, continuous information;categorical variablescarrying stronger than categorical information.
Christian Hennig Flexible parametric bootstrap
Overview and IntroductionExample 1: social stratification
Example 2: Species distribution rangesExample 3: methadone patients
Discussion
Social stratification: structure introduced bymixing categorical, ordinal, continuous information;categorical variablescarrying stronger than categorical information.
Latent class, spherical clustering (pam,k-means)imply approximate independence within clusters.
Can clustering in data be explained bysimple dependence and category structure alone?
Christian Hennig Flexible parametric bootstrap
Overview and IntroductionExample 1: social stratification
Example 2: Species distribution rangesExample 3: methadone patients
Discussion
Null model for “structure but no clustering”:◮ Latent multivariate Gaussian, general covariance matrix.◮ Ordinal variables by categorising with quantiles.◮ Assume nominal variables ordinal with unknown order.
Christian Hennig Flexible parametric bootstrap
Overview and IntroductionExample 1: social stratification
Example 2: Species distribution rangesExample 3: methadone patients
Discussion
Null model for “structure but no clustering”:◮ Latent multivariate Gaussian, general covariance matrix.◮ Ordinal variables by categorising with quantiles.◮ Assume nominal variables ordinal with unknown order.
Estimation:◮ Impose correlation-based order on nominal variables
(order by average correlation of category dummy variableswith ordinal and continuous variables).
◮ Compute polychoric correlation matrix (Drasgow 1986),using 10 equally-sized categories for continuous.
Christian Hennig Flexible parametric bootstrap
Overview and IntroductionExample 1: social stratification
Example 2: Species distribution rangesExample 3: methadone patients
Discussion
Parametric bootstrap. Repeat m times:◮ Generate multivariate Gaussian.◮ Transform to original marginal distributions
for ordinal/categorical variables.◮ Compute distances same way as for original data.◮ Cluster for k = 2, . . . , kmax by pam.◮ Compute ASW.
Christian Hennig Flexible parametric bootstrap
Overview and IntroductionExample 1: social stratification
Example 2: Species distribution rangesExample 3: methadone patients
Discussion
Parametric bootstrap. Repeat m times:◮ Generate multivariate Gaussian.◮ Transform to original marginal distributions
for ordinal/categorical variables.◮ Compute distances same way as for original data.◮ Cluster for k = 2, . . . , kmax by pam.◮ Compute ASW.
Need parametric bootstrap, not nonparametric,because nonparametric bootstrap reproducesclustering structure, doesn’t model homogeneity.
Christian Hennig Flexible parametric bootstrap
Overview and IntroductionExample 1: social stratification
Example 2: Species distribution rangesExample 3: methadone patients
Discussion
2 4 6 8 10
0.00
0.05
0.10
0.15
0.20
0.25
0.30
Number of clusters
Ave
rage
silh
ouet
te w
idth
Significant for larger k , not for “optimal” k = 2!
Christian Hennig Flexible parametric bootstrap
Overview and IntroductionExample 1: social stratification
Example 2: Species distribution rangesExample 3: methadone patients
Discussion
A subtle detail:Emulate marginal distribution for categorical variablesbecause marginal distribution alonedoesn’t indicate clustering.
Do not emulate marginal distributionfor continuous variables(because non-unimodal distribution indicates clustering).
For ordinal variables it depends on interpretation.
Christian Hennig Flexible parametric bootstrap
Overview and IntroductionExample 1: social stratification
Example 2: Species distribution rangesExample 3: methadone patients
Discussion
But. . .Gaussian distribution for continuous variablesmay still be over-simplistic,and big proportion of zero savingsmay not seen as “clustering” feature
Christian Hennig Flexible parametric bootstrap
Overview and IntroductionExample 1: social stratification
Example 2: Species distribution rangesExample 3: methadone patients
Discussion
But. . .Gaussian distribution for continuous variablesmay still be over-simplistic,and big proportion of zero savingsmay not seen as “clustering” feature
Significant ASW may still be caused bynon-clustering structure.
Christian Hennig Flexible parametric bootstrap
Overview and IntroductionExample 1: social stratification
Example 2: Species distribution rangesExample 3: methadone patients
Discussion
−1 0 1 2 3
−4
−2
02
4
standardised log savings
stan
dard
ised
log
inco
me
Christian Hennig Flexible parametric bootstrap
Overview and IntroductionExample 1: social stratification
Example 2: Species distribution rangesExample 3: methadone patients
Discussion
More sophisticated “non-clustering”model for savings and income:Zero with probability p0 and unimodal otherwise.
Christian Hennig Flexible parametric bootstrap
Overview and IntroductionExample 1: social stratification
Example 2: Species distribution rangesExample 3: methadone patients
Discussion
More sophisticated “non-clustering”model for savings and income:Zero with probability p0 and unimodal otherwise.
◮ Estimate p0 for savings and income,
Christian Hennig Flexible parametric bootstrap
Overview and IntroductionExample 1: social stratification
Example 2: Species distribution rangesExample 3: methadone patients
Discussion
More sophisticated “non-clustering”model for savings and income:Zero with probability p0 and unimodal otherwise.
◮ Estimate p0 for savings and income,◮ look at log variable conditional on untransformed > 0,◮ estimate kernel density for smallest bandwidth
giving unimodality.
Christian Hennig Flexible parametric bootstrap
Overview and IntroductionExample 1: social stratification
Example 2: Species distribution rangesExample 3: methadone patients
Discussion
More sophisticated “non-clustering”model for savings and income:Zero with probability p0 and unimodal otherwise.
◮ Estimate p0 for savings and income,◮ look at log variable conditional on untransformed > 0,◮ estimate kernel density for smallest bandwidth
giving unimodality.◮ Still simulate latent multivariate Gaussian based on
polychoric correlation, transformed to 10 categories,◮ transform to zero/unimodal through F−1(Φ(X )).
Christian Hennig Flexible parametric bootstrap
Overview and IntroductionExample 1: social stratification
Example 2: Species distribution rangesExample 3: methadone patients
Discussion
2 4 6 8 10
0.00
0.05
0.10
0.15
0.20
0.25
0.30
Number of clusters
Ave
rage
silh
ouet
te w
idth
k = 8 still significant, much better now than k = 6.k = 3 is a candidate, too.
Christian Hennig Flexible parametric bootstrap
Overview and IntroductionExample 1: social stratification
Example 2: Species distribution rangesExample 3: methadone patients
Discussion
What have we learnt?◮ There is significantly more clustering structure
(as measured by ASW) in the data than in null model.
Christian Hennig Flexible parametric bootstrap
Overview and IntroductionExample 1: social stratification
Example 2: Species distribution rangesExample 3: methadone patients
Discussion
What have we learnt?◮ There is significantly more clustering structure
(as measured by ASW) in the data than in null model.◮ “Maximize ASW” isn’t appropriate rule for k .
ASW should be compared with what can be expectedunder null model.
Christian Hennig Flexible parametric bootstrap
Overview and IntroductionExample 1: social stratification
Example 2: Species distribution rangesExample 3: methadone patients
Discussion
What have we learnt?◮ There is significantly more clustering structure
(as measured by ASW) in the data than in null model.◮ “Maximize ASW” isn’t appropriate rule for k .
ASW should be compared with what can be expectedunder null model.
◮ k = 8 looks better than k = 6 because ASWis about the same, but expectation is lower.
Christian Hennig Flexible parametric bootstrap
Overview and IntroductionExample 1: social stratification
Example 2: Species distribution rangesExample 3: methadone patients
Discussion
What have we learnt?◮ There is significantly more clustering structure
(as measured by ASW) in the data than in null model.◮ “Maximize ASW” isn’t appropriate rule for k .
ASW should be compared with what can be expectedunder null model.
◮ k = 8 looks better than k = 6 because ASWis about the same, but expectation is lower.
◮ “Zero plus unimodal” vs. Gaussian modelimproves ASW, but still clustering at k = 8 significant.
Christian Hennig Flexible parametric bootstrap
Overview and IntroductionExample 1: social stratification
Example 2: Species distribution rangesExample 3: methadone patients
Discussion
Formal rules. . . for testing homogeneity, and estimating k :
1. Tests for each k :◮
a+1r+1 , a “data has lower ASW”-runs, r all runs,
◮ or based on approximate normality, mean(ASW), sd(ASW).
Christian Hennig Flexible parametric bootstrap
Overview and IntroductionExample 1: social stratification
Example 2: Species distribution rangesExample 3: methadone patients
Discussion
Formal rules. . . for testing homogeneity, and estimating k :
1. Tests for each k :◮
a+1r+1 , a “data has lower ASW”-runs, r all runs,
◮ or based on approximate normality, mean(ASW), sd(ASW).
2. Aggregate tests for all kto a single clustering test,e.g., averaging ASW or rank ASW over k .
Christian Hennig Flexible parametric bootstrap
Overview and IntroductionExample 1: social stratification
Example 2: Species distribution rangesExample 3: methadone patients
Discussion
Formal rules. . . for testing homogeneity, and estimating k :
1. Tests for each k :◮
a+1r+1 , a “data has lower ASW”-runs, r all runs,
◮ or based on approximate normality, mean(ASW), sd(ASW).
2. Aggregate tests for all kto a single clustering test,e.g., averaging ASW or rank ASW over k .
3. Define rule for estimating k ,e.g., maximise (ASW-mean(ASW))/sd(ASW) under H0.
Christian Hennig Flexible parametric bootstrap
Overview and IntroductionExample 1: social stratification
Example 2: Species distribution rangesExample 3: methadone patients
Discussion
Formal rules. . . for testing homogeneity, and estimating k :
1. Tests for each k :◮
a+1r+1 , a “data has lower ASW”-runs, r all runs,
◮ or based on approximate normality, mean(ASW), sd(ASW).
2. Aggregate tests for all kto a single clustering test,e.g., averaging ASW or rank ASW over k .
3. Define rule for estimating k ,e.g., maximise (ASW-mean(ASW))/sd(ASW) under H0.
Exploratory power of image is more impressive.
Christian Hennig Flexible parametric bootstrap
Overview and IntroductionExample 1: social stratification
Example 2: Species distribution rangesExample 3: methadone patients
Discussion
An example with Gaussian mixtures and BIC:snail species distribution ranges
Clustering 80 snail speciescharacterised by presence-absence dataon 34 Aegean islands (Cyclades)(Hausdorf and Hennig 2004)
Christian Hennig Flexible parametric bootstrap
Overview and IntroductionExample 1: social stratification
Example 2: Species distribution rangesExample 3: methadone patients
Discussion
Christian Hennig Flexible parametric bootstrap
Overview and IntroductionExample 1: social stratification
Example 2: Species distribution rangesExample 3: methadone patients
Discussion
Use Kulczynski-dissimilarity between ranges,MDS, Gaussian mixture clustering with noise (mclust).BIC picks k = 8 and noise.
N
1
N
2
2
8
6
N7N
2
N
N
4
1
6 6
8
4
N
N
N
4
1
4
N
6
5
3
1
3
5N
6
1
N
N
1
N
8
1
23
3
5
N
1
3
2
1
7
N
N
4
5
3
7
4
N
N
4
1
5
7
N3
2N2
3
N
1
7
4
N
N2
5
N
6
−0.6 −0.4 −0.2 0.0 0.2 0.4
−0.
4−
0.2
0.0
0.2
0.4
mds[,1]
mds
[,2]
Christian Hennig Flexible parametric bootstrap
Overview and IntroductionExample 1: social stratification
Example 2: Species distribution rangesExample 3: methadone patients
Discussion
Using a homogeneous Gaussian as null model:
2 4 6 8 10
−20
00
200
400
600
Number of clusters
BIC
Under H0, BIC usually picks k = 1 as expected,clustering significantbut small sample effects for large k .
Christian Hennig Flexible parametric bootstrap
Overview and IntroductionExample 1: social stratification
Example 2: Species distribution rangesExample 3: methadone patients
Discussion
However, there is spatial structure.Species are more often on neighbouring islands.
Model underlying presence-absence datausing neighbourhood list of islands:
Null model parameters:◮ species width distribution,◮ attraction (species frequency) per region,◮ parameter pn governing autocorrelation.
Christian Hennig Flexible parametric bootstrap
Overview and IntroductionExample 1: social stratification
Example 2: Species distribution rangesExample 3: methadone patients
Discussion
Algorithmic null model: For every species
1. draw species width from empirical distribution,
2. draw starting region proportional to attraction,
3. with probability pn put it in neighborhoodof previous occurrences,otherwise outside neighborhood.(If impossible, put somewhere randomly.)
4. Go to 1 until width exhausted.
Christian Hennig Flexible parametric bootstrap
Overview and IntroductionExample 1: social stratification
Example 2: Species distribution rangesExample 3: methadone patients
Discussion
Empirical distribution of species on islands
Christian Hennig Flexible parametric bootstrap
Overview and IntroductionExample 1: social stratification
Example 2: Species distribution rangesExample 3: methadone patients
Discussion
Draw species size (3) and initial island
Christian Hennig Flexible parametric bootstrap
Overview and IntroductionExample 1: social stratification
Example 2: Species distribution rangesExample 3: methadone patients
Discussion
With probability p_n, new island is neighbour
Christian Hennig Flexible parametric bootstrap
Overview and IntroductionExample 1: social stratification
Example 2: Species distribution rangesExample 3: methadone patients
Discussion
Christian Hennig Flexible parametric bootstrap
Overview and IntroductionExample 1: social stratification
Example 2: Species distribution rangesExample 3: methadone patients
Discussion
With probability 1−p_n, new island is non−neighbour
Christian Hennig Flexible parametric bootstrap
Overview and IntroductionExample 1: social stratification
Example 2: Species distribution rangesExample 3: methadone patients
Discussion
Continue until species size is reached
Christian Hennig Flexible parametric bootstrap
Overview and IntroductionExample 1: social stratification
Example 2: Species distribution rangesExample 3: methadone patients
Discussion
Estimation of autocorrelation parameter pn:
Simulate empirical disjunction probability q̂j from H0
for different pn. Compute linear regression.Find p̂n for original data disjunction probability.
Christian Hennig Flexible parametric bootstrap
Overview and IntroductionExample 1: social stratification
Example 2: Species distribution rangesExample 3: methadone patients
Discussion
From null data, compute MDS, fit Gaussian mixture, BIC.
2 4 6 8 10
020
040
060
080
010
00
Number of clusters
BIC
Christian Hennig Flexible parametric bootstrap
Overview and IntroductionExample 1: social stratification
Example 2: Species distribution rangesExample 3: methadone patients
Discussion
From null data, compute MDS, fit Gaussian mixture, BIC.
2 4 6 8 10
020
040
060
080
010
00
Number of clusters
BIC
Increasing BIC is expected under null model,clustering is no longer significant!
Christian Hennig Flexible parametric bootstrap
Overview and IntroductionExample 1: social stratification
Example 2: Species distribution rangesExample 3: methadone patients
Discussion
Example 3: methadone patients(joint work with Chien-Ju Lin)
Data from 314 methadone users (heroine addicts)one of six dosages taken over 180 days(and missing values).
Defined dissimilarity and comparedpam, average, complete linkage with ASW andPrediction Strength (Tibshirani and Walther 2005).
Christian Hennig Flexible parametric bootstrap
Overview and IntroductionExample 1: social stratification
Example 2: Species distribution rangesExample 3: methadone patients
Discussion
Prediction Strength:◮ Split dataset in two halves (m times).◮ Cluster both parts.◮ Use one half to predict cluster memberships
of the other halfby assigning every point of part 2 to closest mean of part 1.
Christian Hennig Flexible parametric bootstrap
Overview and IntroductionExample 1: social stratification
Example 2: Species distribution rangesExample 3: methadone patients
Discussion
Prediction Strength:◮ Split dataset in two halves (m times).◮ Cluster both parts.◮ Use one half to predict cluster memberships
of the other halfby assigning every point of part 2 to closest mean of part 1.
Statistic: proportion of correctly predictedco-memberships in clustering 2 of pairs of points.
T& W suggest to choose smallest k with PS> 0.8.
Christian Hennig Flexible parametric bootstrap
Overview and IntroductionExample 1: social stratification
Example 2: Species distribution rangesExample 3: methadone patients
Discussion
−5 0 5
−4
−2
02
46
xc1[,1]
xc1[
,2]
−5 0 5
−4
−2
02
46
xc2[,1]
xc2[
,2]
−5 0 5
−4
−2
02
46
xc1[,1]
xc1[
,2]
−5 0 5
−4
−2
02
46
xc2[,1]
xc2[
,2]
Christian Hennig Flexible parametric bootstrap
Overview and IntroductionExample 1: social stratification
Example 2: Species distribution rangesExample 3: methadone patients
Discussion
Is there real clustering? How many clusters?
Christian Hennig Flexible parametric bootstrap
Overview and IntroductionExample 1: social stratification
Example 2: Species distribution rangesExample 3: methadone patients
Discussion
Structure:◮ Almost all patients start on low dosage◮ Once weekly new prescription,
much change when new prescription,little on other days(apart from missing values),
◮ Early new prescriptions change much more than laterones.
Christian Hennig Flexible parametric bootstrap
Overview and IntroductionExample 1: social stratification
Example 2: Species distribution rangesExample 3: methadone patients
Discussion
Transition probabilities, new prescription:
Christian Hennig Flexible parametric bootstrap
Overview and IntroductionExample 1: social stratification
Example 2: Species distribution rangesExample 3: methadone patients
Discussion
Transition probabilities, other days:
Christian Hennig Flexible parametric bootstrap
Overview and IntroductionExample 1: social stratification
Example 2: Species distribution rangesExample 3: methadone patients
Discussion
Null model for non-clustering structure:◮ Fix marginal on day 1.◮ Markov model with transition probabilities
◮ . . . for prescription changes 1 and 2,◮ . . . for all other prescription changes aggregated,◮ . . . for all other days aggregated.
◮ Draw missing value pattern from empirical distribution.
Christian Hennig Flexible parametric bootstrap
Overview and IntroductionExample 1: social stratification
Example 2: Species distribution rangesExample 3: methadone patients
Discussion
Average linkage, complete linkage:
pam, p-values:
Christian Hennig Flexible parametric bootstrap
Overview and IntroductionExample 1: social stratification
Example 2: Species distribution rangesExample 3: methadone patients
Discussion
What have we learnt?◮ PS> 0.8 rule may be fulfilled by null model.
Christian Hennig Flexible parametric bootstrap
Overview and IntroductionExample 1: social stratification
Example 2: Species distribution rangesExample 3: methadone patients
Discussion
What have we learnt?◮ PS> 0.8 rule may be fulfilled by null model.◮ pam (k ≈ 7) looks better
than average and complete linkage,but clustering looks significantonly if best k are “cherry-picked”.
Aggregating PS ranks gives p = 0.475.
Christian Hennig Flexible parametric bootstrap
Overview and IntroductionExample 1: social stratification
Example 2: Species distribution rangesExample 3: methadone patients
Discussion
What have we learnt?◮ PS> 0.8 rule may be fulfilled by null model.◮ pam (k ≈ 7) looks better
than average and complete linkage,but clustering looks significantonly if best k are “cherry-picked”.
Aggregating PS ranks gives p = 0.475.
Null model was good enough to fit the data;no evidence for real clustering.
Christian Hennig Flexible parametric bootstrap
Overview and IntroductionExample 1: social stratification
Example 2: Species distribution rangesExample 3: methadone patients
Discussion
Visual testing with the null model
What if there is clusteringbut ASW/PS don’t pick it as significant?
Buja et al. (2009) - visual testing:◮ generate g − 1 datasets from H0,◮ visualise them,◮ show together with real dataset in random position,◮ show to h statisticians and ask them
to nominate the most real(special) dataset.◮ Do significantly more than 1
g of them pick real one?
Christian Hennig Flexible parametric bootstrap
Overview and IntroductionExample 1: social stratification
Example 2: Species distribution rangesExample 3: methadone patients
Discussion
Using pam and pam-adapted observation order by C.-J. Lin:
Christian Hennig Flexible parametric bootstrap
Overview and IntroductionExample 1: social stratification
Example 2: Species distribution rangesExample 3: methadone patients
Discussion
Discussion
Homogeneity test:can real data be distinguished from null model databy clustering criteria?
Christian Hennig Flexible parametric bootstrap
Overview and IntroductionExample 1: social stratification
Example 2: Species distribution rangesExample 3: methadone patients
Discussion
Discussion
Homogeneity test:can real data be distinguished from null model databy clustering criteria?
Not significant: real data isnot more clustered than data from H0
according to the criterion.
If the criterion is chosen well,this is a strong result, though negative
Christian Hennig Flexible parametric bootstrap
Overview and IntroductionExample 1: social stratification
Example 2: Species distribution rangesExample 3: methadone patients
Discussion
Significant: real data is more clusteredthan H0 with heuristic parameter estimates.
Maybe other parameters fit data welland produce better criterion values?(Not likely for large n and good estimators;also look at gap between data and null model.)
Also, H0 may still be too simplistic(try hard to model non-clustering structure;see Example 1.)
Christian Hennig Flexible parametric bootstrap
Overview and IntroductionExample 1: social stratification
Example 2: Species distribution rangesExample 3: methadone patients
Discussion
Estimating k involving the null modelcompares criterion with null behaviour for increasing kwhich isn’t known for most cluster validation criteria.
Provides well-needed calibration,certainly more informative than simple maximum-rule.
Is null behaviour relevant for comparing k and k − 1 > 1?
Christian Hennig Flexible parametric bootstrap
Overview and IntroductionExample 1: social stratification
Example 2: Species distribution rangesExample 3: methadone patients
Discussion
Conclusion◮ Explore behaviour of validation index under
model for non-clustering structure in the databy parametric bootstrap.
Christian Hennig Flexible parametric bootstrap
Overview and IntroductionExample 1: social stratification
Example 2: Species distribution rangesExample 3: methadone patients
Discussion
Conclusion◮ Explore behaviour of validation index under
model for non-clustering structure in the databy parametric bootstrap.
◮ Naive null models for “no clustering” easily rejectedeven by non-clustering structure.
Christian Hennig Flexible parametric bootstrap
Overview and IntroductionExample 1: social stratification
Example 2: Species distribution rangesExample 3: methadone patients
Discussion
Conclusion◮ Explore behaviour of validation index under
model for non-clustering structure in the databy parametric bootstrap.
◮ Naive null models for “no clustering” easily rejectedeven by non-clustering structure.
◮ If there is non-clustering structure,k maximising validation indexes and BIC may not be best.
Christian Hennig Flexible parametric bootstrap
Overview and IntroductionExample 1: social stratification
Example 2: Species distribution rangesExample 3: methadone patients
Discussion
Ideas for further work◮ Compare formal methods for aggregating a test
from different k ,and for using information to estimate k .
Christian Hennig Flexible parametric bootstrap
Overview and IntroductionExample 1: social stratification
Example 2: Species distribution rangesExample 3: methadone patients
Discussion
Ideas for further work◮ Compare formal methods for aggregating a test
from different k ,and for using information to estimate k .
◮ Fit mixture of null models?I’m skeptical; often too many parameters,large within-model distances.
Christian Hennig Flexible parametric bootstrap
Overview and IntroductionExample 1: social stratification
Example 2: Species distribution rangesExample 3: methadone patients
Discussion
Ideas for further work◮ Compare formal methods for aggregating a test
from different k ,and for using information to estimate k .
◮ Fit mixture of null models?I’m skeptical; often too many parameters,large within-model distances.
◮ Test k against k − 1 clusters,fitting null model within found clusters.
Christian Hennig Flexible parametric bootstrap