NUMERICAL ANALYSIS OF BIOLOGICAL AND
ENVIRONMENTAL DATA
Lecture 3. Classification
Agglomerative hierarchical cluster analysis
Two-way indicator species analysis – TWINSPAN
Non-hierarchical k-means clustering
‘Fuzzy’ clustering
Mixture models and latent class analysis
Detection of indicator species
Interpretation of classifications using external data
Comparing classifications
Software
CLASSIFICATION
M. R. Anderberg, 1973, Cluster analysis for applications. Academic
H.T. Clifford & W. Stephenson, 1975, An introduction to numerical classification. Academic
B. Everitt, 1993, Cluster analysis. Halsted Press
A.D. Gordon, 1999, Classification. Chapman & Hall
A.K. Jain & R.C. Dubes, 1988, Algorithms for clustering data. Prentice Hall
L. Kaufman & P.J. Rousseeuw, 1990, Finding groups in data. An introduction to cluster analysis. Wiley
H.C. Romesburg, 1984, Cluster analysis for researchers. Lifetime Learning Publications
P.H. A. Sneath & R.R. Sokal, 1973, Numerical taxonomy. W.H. Freeman
H. Späth, 1980, Cluster analysis algorithms for data reduction and classification of objects
BOOKS ON NUMERICAL CLASSIFICATION
P.G.N. Digby & R.A. Kempton, 1987, Multivariate analysis of ecological communities. Chapman & Hall
P. Greig-Smith, 1983, Quantitative plant ecology. Blackwell
R.H.G. Jongman, C.J.F. ter Braak & O.F.R. van Tongeren (eds), 1995, Data analysis in cummunity and landscape ecology. Cambridge University Press
P. Legendre & L. Legendre, 1998, Numerical ecology. Elsevier (Second English Edition)
J.A. Ludwig & J.F. Reynolds, 1988, Statistical ecology. J. Wiley
L. Orloci, 1978, Multivariate analysis in vegetation research. Dr. Junk
E.C. Pielou, 1984, The interpretation of ecological data. J. Wiley
J. Podani, 2000, Introduction to the exploration of multivariate biological data. Backhuys
W.T. Williams, 1976, Pattern analysis in agricultural science. CSIRO Melbourne
Most important are Chapters 7 and 8 in Legendre and Legendre (1998)
BOOKS ON NUMERICAL CLASSIFICATION IN ECOLOGY
Partition set of data (objects) into groups or clusters.
Partition into g groups so as to optimise some stated mathematical criterion, e.g. minimum sum-of-squares. Divide data into g groups so as to minimise the total variance or within-groups sum-of-squares, i.e. to make within-group variance as small as possible, thereby maximising the between-group variance.
Reduce data to a few groups. Can be very useful.
Compromise 50 objects, 1080 possible classifications Hierarchical classification Agglomerative, divisiveMajor reviews
A.D. Gordon, 1996, Hierarchical classification in clustering and classification (ed. P. Arabie & L.J. Hubert) pp 65-121. World Scientific Publishing, River Edge, NJ
A.D. Gordon, 1999, Classification (Second edition). Chapman & Hall
BASIC AIM
CLASSIFICATION OF CLASSIFICATIONS
Formal - Informal
Hierarchical - Non-hierarchical
Quantitative - Qualitative
Agglomerative - Divisive
Polythetic - Monothetic
Sharp - Fuzzy
Supervised - Unsupervised
Useful - Not useful
MAIN APPROACHES
Hierarchical cluster analysis
formal, hierarchical, quantitative, agglomerative, polythetic, sharp, not always useful.
Two-way indicator species analysis (TWINSPAN)
formal, hierarchical, semi-quantitative, divisive, semi-polythetic, sharp, usually useful.
k-means clustering
formal, non-hierarchical, quantitative, semi-agglomerative, polythetic, sharp, usually useful.
Fuzzy clustering
formal, non-hierarchical, quantitative, semi-agglomerative, polythetic, fuzzy, rarely used but potentially useful.
Mixture models and latent class analysis
formal (too formal!) non-hierarchical, quantitative, polythetic, sharp or fuzzy, rarely used, perhaps not potentially useful with complex data-sets.
All UNSUPERVISED classifications
Warning!
“The availability of computer packages of classification techniques has led to the waste of more valuable scientific time than any other “statistical” innovation (with the possible exception of multiple regression techniques)”
Cormack, 1970
1. Calculate matrix of proximity or dissimilarity coefficients
2. Clustering
3. Graphical display
4. Check for distortion
5. Validation of results
AGGLOMERATIVE HIERARCHICAL CLUSTER ANALYSIS
PROXIMITY OR DISTANCE OR DISSIMILARITY MEASURES
Hubalek1982
Biol. Rev. 97, 669-689
Gower & Legendre
1986
J. Classific. 3, 5-48
Archer & Maples
1987
Palaois 2, 609-617
Maples & Archer
1988
Palaois 3, 95-103
Legendre & Legendre
1998
Numerical ecology. Chapter 7
A. Binary Data
Object j
Object i
+ -
+ a b
- c d
Jaccard coefficient Dissimilarity (1-S)
Simple matching coefficient
Baroni-Urbani & Buser
cbaa
SJ
cbacb
DJ
dcbada
SSMC
dcba
cbDSMC
cbaadaad
Syst. Zool. (1976) 25; 251-259
B. Quantitative Data
i
j
Variable 1
Xi1 Xj1
Xi2
Xj2
Vari
ab
le 2
dij2 222
211 ijij xxxx Euclidean
distance
2
1
m
kjkikij xxd dominated by large
values
Manhattan or city-block metric
m
kjkikij xxd
1less dominated by large values
Bray & Curtis (percentage similarity)
jkik
jkikij xx
xxd
sensitive to extreme values
relates minima to average values and represents the relative influence of abundant and uncommon variables
B. Quantitative Data (cont)
Similarity ratio or Steinhaus-Marczewski coefficient ( Jaccard)
222
jkikjkik
jkikij
xxxx
xxd
less dominated by extremes
Chord distance for % data
2
1
1
2
m
kjkikij ppd “signal to
noise”
C. Percentage Data (e.g. pollen, diatoms)
Standardised Euclidean distance -
gives all variables ‘equal’ weight, increases noise in data
Euclidean distance - dominated by large values, rare variables almost no influence
Chord distance (= Euclidean distance -
good compromise, maximises signal
of square-root transformed data) to noise ratio
D. Transformations
Normalise samples - ‘equal’ weight
Normalise variables - ‘equal’ weight, rare species inflated
No transformation- quantity dominated
Double transformation - equalise both, compromise
m
kikx
1
2
Noy-Meir et al. (1975) J. Ecology 63; 779-800
E. Mixed data (e.g. quantitative, qualitative, binary)
Gower coefficient (see Lecture 12)
AGGLOMERATIVE HIERARCHICAL CLUSTER ANALYSIS (five stages)
i. Calculate matrix of proximity (similarity or dissimilarity measures) between all pairs of n samples ½ n (n - 1)
ii. Fuse objects into groups using stated criterion, ‘clustering’ or sorting strategy
iii. Graphical display of results - dendrograms or trees
- graphs
- shadings
iv. Check for distortion
v. Validation results?
i. Simple Distance Matrix
m
kjkikij xxd
1
22D=
1 -
2 2 -
3 6 5 -
410
9 4 -
5 9 8 5 3 -
1 2 3 4 5
Objects
ii. Clustering Strategy using Single-Link Criterion
Find objects with smallest dij = d12 = 2
Calculate distances between this group (1 and 2) and other objects
d(12)3 = min { d13, d23 } = d23 = 5
d(12)4 = min { d14, d24 } = d24 = 9
d(12)5 = min { d15, d25 } = d25 = 8
Find objects with smallest dij = d45 = 3
Calculate distances between (1, 2), 3, and (4, 5)
Find object with smallest dij = d3(4, 5) = 4
Fuse object 3 with group (4 + 5)
Now fuse (1, 2) with (3, 4, 5) at distance 5
D=
1+2
-
3 5 -
4 9 4 -
5 8 5 3 -
1+2
3 4 5
D=
1+2
-
3 5 -
4+5
8 4 -
1+2
34+5
I & J fuseNeed to calculatedistance of K to (I, J)
Single-link (nearest neighbour) - fusion depends on distance between closest pairs of objects, produces ‘chaining’
Complete-link (furthest neighbour) -
fusion depends on distance between furthest pairs of objects
Median - fusion depends on distance between K and mid-point (median) of line IJ‘weighted’ because I ≈ J (1 compared with 4)
Centroid - fusion depends on centre of gravity (centroid) of I and J line‘unweighted’ as the size of J is taken into account
Also:
Unweighted group-average distance between K and (I,J) is average of all distances from objects in I and J to K, i.e.
Weighted group-average distance between K and (I,J) is average of distance between K and J (i.e. d/4) and between I and K i.e.
5d
24
JK
IK
dd
Single-link (nearest neighbour)
Complete-link (furthest neighbour)
Median
Centroid
Unweighted group-average
Weighted group-average
Minimum variance, sum-of-squares Orloci (1967) J. Ecology 55, 193-206 Ward’s method
QI, QJ, QK within-group variance Fuse I with J to give (I, J) if and only if
or QJK – (QJ + QK)
i.e. only fuse I and J if neither will combine better and make lower sum-of-squares with some other group.
KIIKJIIJ QQQQQQ
(distance between group k and group (i, j) follows a recurrence formula, where , , and are parameters for different methods)
Wishart, (1969) Biometrics 25, 165-170
dk(ij) = i dki + j dkj + dij + dki – dkj
CLUSTER
CLUSTAN-PC
CLUSTAN-GRAPHICS
GENERALISED SORTING STRATEGY
i
j
Single-link ½ ½ 0 -½Furthest-link ½ ½ 0 ½
-α i α j
= - n i n j /
(n i + n j )2
Median ½ ½ -¼ 0Group average (unweighted) n i /(n i + n j ) n j /(n i + n j ) 0 0
Group average (weighted) ½ ½ 0 0
(n i + n k ) (n j + n k ) -n k
(n i +n j +n k ) (n i +n j +n k ) (n i +n j +n k )
(1−α i −α j )
0Sum of squares
Centroid n i /(n i + n j ) n j /(n i + n j ) 0
Single-link example, to calculate distance d3(1,2) =
d3(1,2) = i dki + j dkj + dij + | dki – dkj |
= ½ d31(6) + ½ d32(5) + 0dij + –½ | d31(6) – d32(5) |
= ½ 6 + ½ 5 + 0 – ½ 1 = 3 + 2.5 – 0.5 = 5 Can also have Flexible clustering with user-specified (usually –0.25)
CLUSTERING STRATEGIESSingle link = nearest neighbour
Finds the minimum spanning tree, the shortest tree that connects all points
Finds discontinuities if they exist in data
Chaining common
Clusters of unequal size
Complete-link = furthest neighbour
Compact clusters of ± equal size
Makes compact clusters even when none exist
Average-linkage methods
Intermediate between single and complete link
Unweighted GA maximises cophenetic correlation
Clusters often quite compact
Make quite compact clusters even when none exist
Median and centroid
Can form reversals in the tree
Minimum variance sum-of-squares
Compact clusters of ± equal size
Makes very compact clusters even when none exist
Very intense clustering method
iii. Graphical display
Dendrogram ‘Tree Diagram’
Group average dendrogram of 65 regions in Europe; The measure of pairwise similarity is Jaccard’s coefficient, based on the presence or absence of 144 species of fern.
Limit number of different values taken by heights of internal nodes or number of internal nodes.Global parsimonious tree of the dendrogram.
Parsimonious Trees
Local parsimonious tree of the dendrogram
A similarity matrix based on scores for 15 qualities of 48 applicants for a job. The dendrogram shows a furthest-neighbour cluster analysis, the end points of which correspond to the 48 applicants in sorted order.
Ling (1973) Comm. Asoc. Computing Mach. 16, 355-361
Matrix Shading
Schematic way of combining row and column
hierarchical analyses
Re-order Data Matrix
Summarised two-way table of the Malham data set. The representation of the species groups (1-23) delimited by minimum variance cluster analyses in the eight quadrat clusters (A-H) is shown by the size of the circle. In addition, both the quadrat and species dendrograms derived from minimum-variance clustering are shown to show the relationships between groups.
Cophenetic correlations. The similarity matrix S contains the original similarity values between the OTU’s (in this example it is a dissimilarity matrix U of taxonomic distances). The UPGMA phenogram derived from it is shown, and from the phenogram the cophenetic distances are obtained to give the matrix C. The cophenetic correlation coefficient rcs is the correlation between corresponding pairs from C and S, and is 0.9911.
iv. Tests for Distortion
R
CLUSTER
Cluster analysis of the Mancetter data
Ward’s Method analysis of the dataAverage link analysis
No. 1 2 3 4 5 6 No. 1 2 3 4 5 6
1 1 1 1 1 1 1 24 1 1 1 1 2 3
2 1 1 1 1 1 1 25 1 1 1 1 2 1
3 2 2 3 2 3 2 26 2 2 2 2 1 3
4 1 1 1 1 1 1 27 2 2 2 2 2 2
5 2 2 3 2 2 1 28 1 2 3 2 2 1
6 1 1 1 1 2 2 29 2 2 3 2 2 1
7 1 1 1 1 3 2 30 2 2 2 0 1 2
8 1 1 1 1 1 3 31 1 1 1 1 2 2
9 0 1 1 1 3 3 32 1 1 1 1 2 1
10 2 2 2 2 2 3 33 1 1 1 1 1 1
11 2 2 2 2 3 3 34 2 2 3 2 3 2
12 0 1 1 0 3 2 35 2 2 2 2 3 2
13 2 2 2 2 1 3 36 1 1 1 1 1 3
14 2 2 2 2 2 3 37 2 2 2 2 2 3
15 2 2 2 2 1 3 38 1 1 2 1 2 1
16 1 1 1 1 1 3 39 1 1 1 1 2 1
17 2 2 2 2 2 1 40 2 2 3 2 3 2
18 1 1 1 1 3 2 41 1 1 1 1 2 3
19 0 2 0 0 1 2 42 0 1 1 1 3 1
20 2 2 2 2 2 3 43 1 1 1 1 1 3
21 1 1 1 1 2 1 44 2 2 2 2 2 3
h22 1 1 2 1 3 2 45 2 2 2 2 2 3
23 2 2 3 2 3 2 46 2 2 1 2 3 3
Results of different classifications obtained by cluster analyses of the Mancetter dataMethod Method
Note: Mancetter specimens are numbered sequentially; entries in the table identify clustermembership with a 0 indicating an outlier. The six methods used were (1) PCA, (2) Ward’smethod, (3) Ward’s method on the first six standardised principal component
Which Cluster Method to Use?
J. Oksanen (2002)
SINGLE LINK
MINIMUM VARIANCE
CLUSTERING AND SPACE
Convex hull encloses all points so that no line between two points can be drawn outside the convex hull.
J. Oksanen (2002)
Minimum variance is usually most useful but tends to produce clusters of fairly equal size, followed by group average. Single-link is least useful.
General Behaviour of Different Methods
Single-link Often results in chaining
Complete-link Intense clustering
Group-average (weighted)Tends to join clusters with small variances
Group-average (unweighted) Intermediate between single and complete link
Median Can result in reversals
Centroid Can result in reversals
Minimum variance Often forms clusters of equal size
General Experience
Clustering of random data on two variables.
Note: Diagram (a) is a plot of two randomly generated variables labelled according to the clusters suggested by Ward’s method in diagram (b)
Baxter (1994)
SIMULATION STUDIES
Validation tests for
1. The complete absence of any group structure in the data
2. The validity of an individual cluster
3. The validity of a partition
4. The validity of a complete hierarchical classification
Main interest in (2) and (3) - generally assume there is some ‘group structure’, rarely interested in validating a complete hierarchical classification.
v. VALIDATION OF RESULTS
TESTS FOR ASSESSING CLUSTERS
Gordon, A.D. (1995) Statistics in Transition 2: 207-217
Gordon, A.D. (1996) In: From Data to Knowledge (ed. W. Gaul & D. Pfeifer, Springer
Cluster analysis of joint occurrence of 43 species of fish in the Susquehanna River drainage area of Pennsylvania, constructed with the UPGMA clustering algorithm (Sneath & Sokal, 1973). The three short perpendicular lines on the dissimilarity scale represent the critical values C1, C2, and C3 obtained from the null nodal distributions of the null frequency histogram. Significant clusters are indicated by solid lines. The non-significant portion of the dendrogram is drawn in dotted lines.
Strauss, (1982) Ecology 63, 634-639.
Hunter & McCoy 2004 J. Vegetation Science 15: 135-138.
Problem of creating ecologically relevant 'random' or 'null' data-sets.
Within a 'significant' cluster, linkages are often identified as 'significant' even when species are actually randomly distributed among the sites in the group.
Artificial data
2 groups of 20 sites, no species in common
sites
speci
es
Randomisation test identifies both groups and all linkages within them as 'significant'.
Same test finds all linkages non-significant if only use one of the groups!
= significant = not significant
Arises because the randomisation matrices need to be created at each classification step, not just at the beginning.
Can test for significance of groups by comparing linkage distances to a null distribution derived from randomisation and clustering of a sub-matrix containing only the sites within the larger group. In other words, this is testing the null hypothesis that within the significant group, sites represent random assemblages of species.
Sequential randomisation allows evaluation of all nodes in the classification.
OTHER APPROACHES TO ASSESSING AND VALIDATING
CLUSTERS
If replicate samples are available, can use bootstrapping to evaluate significance. Can also use within-cluster samples as ‘replicates’.
BOOTCLUS McKenna (2003) Environmental Modelling & Software 18, 205-220
(www.glsc.usgs.gov/data/bootclus.htm)
SAMPLERE Pillar (1999) Ecology 89, 2508-2516
Compares cluster analysis groups and uses bootstrapping (resampling with replacement) to test the null hypothesis that the clusters in the bootstrap samples are random samples of their most similar corresponding clusters in the observed data. The resulting probability indicates whether the groups in the partition are sharp enough to reappear consistently in resampling.
NUMBER OF CLUSTERS
There are as many fusion levels as there are observations.
Hierarchical classification can be cut at any level.
User generally wants to use groups all at one level, hence ‘cut level’ or ‘stopping rules’.
No optimality criteria or guidelines.
Select what is useful for the purpose at hand.
No right or wrong answer, just useful or not useful!
Mathematical criteria - see A.D. Gordon (1999) pp. 60-65
1. Divide underlying gradients into equal parts
2. Compact clusters
3. Groups of equal size
4. Discontinuous groups
These criteria often in conflict, and cannot all be satisfied simultaneously.
CRITERIA FOR GOOD CLUSTERS
J. Oksanen (2002)
2. TWINSPAN – Two-Way Indicator Species Analysis Mark Hill (1979)
Differential variables characterise groups, i.e. variables common on one side of dichotomy. Involves qualitative (+/–) concept, have to analyse numerical data as PSEUDO-VARIABLES (conjoint coding).
Species A 1-5% SPECIES A1
Species A 5-10% SPECIES A2
Species A 10-25% SPECIES A3
cut level Basic idea is to construct hierarchical classification by successive division.
Ordinate samples by correspondence analysis, divide at middle group to left negative; group to right positive. Now refine classification using variables with maximum indicator value, so-called iterative character weighting and do a second ordination that gives a greater weight to the ‘preferentials’, namely species on one or other side of dichotomy.
Identify number of indicators that differ most in frequency of occurrence between two groups. Those associated with positive side +1 score, negative side -1. If variable 3 times more frequent on one side than other, variable is good indicator. Samples now reordered on basis of indicator scores. Refine second time to take account of other variables. Repeat on 2 groups to give 4, 8, 16 and so on until group reaches below minimum size.
TWINSPAN
Cladonia coccifera − − 1 1 − − − − 1 1 1 1 − − − − − − − − − − −Pseudoscleropodium purum − − 1 1 − − 1 − − 1 1 − − − − − − − − − − − −Cladonia arbuscula − − 1 − 1 − 1 1 − − − 1 1 1 − − − − − − − 1 −Hylocomium splendens 1 − 1 − 1 1 1 − 1 − − − − − − − − − − − − − −Melampyrum pratense 1 1 1 1 2 2 2 − 1 1 1 1 1 1 − − − − − − − − −Festuca ovina 3 − 2 3 3 3 1 − 1 1 1 − − 1 − − − − − − − − −Agrostis canina 1 2 2 1 1 − 1 − − − 1 − − − − − − − − − − − −Parmelia saxatillis 1 1 − 1 − − − − 1 1 − 1 − − − − − − − − − 1 −Blechnum spicant 2 1 − − − 1 1 1 − 1 − − − − − − − − 1 − 1 − −Thuidium tamarisc /delicat. 1 1 − 1 − − 1 − 1 2 − − − − − − − − − − 1 − −Potentilla erecta 2 2 1 − 1 1 1 − − − − − − − − − 1 − − −Pleurozium schreberi − 3 2 2 2 1 2 1 1 3 1 2 2 3 − 3 1 − 1 − − 1 −Molinia caerulea 3 3 3 1 3 3 3 3 1 3 3 3 2 3 − 3 1 1 3 − 1 1 −Hypnum cupressiforme 1 1 2 2 1 1 1 3 1 1 1 1 1 − − 1 1 − 1 − 1 − −Pteridium aquilinum − − 3 − 1 2 3 3 − 2 2 2 − − − − 1 − − − − 2 −Thuidium tamariscinum − − 1 − 1 1 − − − − − 1 1 − − − − − − 1 1 −Sorbus aucuparia seedling − − − 1 − 1 1 − − − 1 − 1 − − − 1 − 1 − − − −Betula pubescens seedling 1 − 1 − − − 1 − 1 − 1 − 1 1 − − 1 − 1 − − − −Dicranum scoparium 2 1 1 2 − 1 2 1 1 1 1 − − 1 − − 2 2 − − − − 1Plagiothecium undulatum 1 1 1 1 1 2 2 2 − − 1 1 1 1 − 1 1 1 1 1 1 − −Leucobryum glaucum − 1 − − 2 2 − 3 2 2 2 3 2 2 − 1 2 − 1 1 − 1 1Isothecium myosuroides 4 2 3 1 1 − − 1 1 1 1 = 1 2 − − − 1 − 3 1 − −Quercus petraea 1 4 5 4 4 4 4 4 4 4 4 4 4 4 4 3 5 4 4 4 5 3 4Dicranum majus − 1 − 2 − 1 1 2 2 1 1 1 − − 1 1 1 1 1 − 1 2 −Campylopus flexuosus − 1 − 2 1 1 1 1 − − 1 − 1 1 − − 1 − 1 1 1 1Calluna vulgaris 2 3 − 1 1 − − − − − − 1 1 1 − 3 1 − 1 − − − 5Mnium hornum 1 1 − − − 1 − − − 1 − 1 1 − − − 1 − 1 1 1 − −Polytrichum formosum 1 2 1 2 1 − − − 1 1 1 − − 1 − 1 − − 1 2 1 1 −Vaccinium myrtillus − − − 1 1 1 − − 2 − 3 2 4 2 3 3 3 3 3 − 1 − −Rhytidiadelphus loreus 1 1 − 1 − 1 2 2 2 1 − 1 2 − 1 1 2 2 2 3 1 −Bazzania trilobata − − − − − 1 1 − 1 − − 1 − 1 − 1 1 − 1 − − 1 −Sphagnum quinqefarium − − − − − − 2 − − − − 2 − − − 2 2 3 1 1 − 1 1Deschampsia flexuosa 1 1 2 3 − − 2 1 − − − 1 1 3 2 1 2 2 3 2 3 3 −Lepidozia reptans 1 − 1 − − − − − − − − − − 1 1 1 − 1 1 1 1 1 −Diplophyllum albicans − − − − − − 1 − − − 1 − − 1 − 1 − − 1 1 1 1 −Dicranodontium denudatum − − − − − 1 − − − − − − − 1 1 − 1 1 1 1 1 1 −Lepidozia pearsonii − − − − − − − − − − − − − − − − 1 − 1 − − 1 −Saccogyna viticulosa − − − − − − − − − − − − − − − − − 1 − − 1 1 −Calypogeia fissa − − − − − − − − − − 1 − − − − − 1 1 1 − 1 1 −Betula pubescens − 1 − − − − − − − − − − − − 3 3 − − − − − 3 −Scapania gracilis 1 − − − − − − − − − − − − 1 − 2 1 1 1 1 1 2 −Sphagnum robustum − − − − − − − − − − − − − 1 2 1 1 − − − − − −Isopterygium elegans − − − − − 1 − − − − − − − − − 1 − − 1 − 1 1 1Erica cinerea − − − − − − − − − − − − − − − − − − − − − − 1Hypnum cupress. v. ericet − − − − − − − − − − − − − − − − − − − − 2SECTION A A A A A A A A A A A A A A B B B B B B B B A
GROUP MEAN PH 3.7 3.73
COED CYMERAU TWINSPAN TABLEGROUP I IITWINSP
AN
Each species can be represented by several pseudo-species, depending on the species abundance. A pseudo-species is present if the species value equals or exceeds the relevant user-defined cut-level.
Original data Sample 1 Sample 2
Cirsium palustre 0 2
Filipendula ulmaria 6 0
Juncus effusus 15 25
Cut levels 1, 5, and 20 (user-defined)
Pseudo-species
Cirsium palustre 1 0 1
Filipendula ulmaria 1
1 0
Filipendula ulmaria 2
1 0
Juncus effusus 1 1 1
Juncus effusus 2 1 1
Juncus effusus 3 0 1Thus quantitative data are transformed into categorical nominal (1/0) variables.
Pseudo-species Concept
Variables classified in much same way. Variables classified using sample weights based on sample classification. Classified on basis of fidelity - how confined variables are to particular sample groups. Ratio of mean occurrence of variable in samples in group to mean occurrence of variable in samples not in group. Variables are ordered on basis of degree of fidelity within group, and then print out structured two-way table.
Concepts of INDICATOR SPECIESDIFFERENTIALS and PREFERENTIALSFIDELITY
Gauch & Whittaker (1981) J. Ecology 69, 537-557
“two-way indicator species analysis usually best. There are cases where other techniques may be complementary or superior”.
Very robust - considers overall data structure
“best general purpose method when a data set is complex, noisy, large or unfamiliar. In many cases a single TWINSPAN classification is likely to be all that is necessary”.
TWINSPAN, TWINGRP, TWINDEND, WINTWINS
van Groenewoud (1992) J. Veg. Sci. 3, 239-246
Belbin & McDonald (1993) J. Veg. Sci. 4, 341-348
Artificial data of known properties and structure
Reliability of TWINSPAN depends on:
1. How well correspondence analysis extracts axes that have ecological meaning
2. How well the CA axes are divided into meaningful segments
3. How faithful certain species are to certain segments of the multivariate space
TWINSPAN TESTS OF ROBUSTNESS & RELIABILITY
Problems arise:
1. The splitting rule of TWINSPAN (dividing the CA at centre) overrides keeping ecologically closely related samples together. Groups of samples that are similar but near centre can get split into two groups. Relocation necessary. FLEXCLUS
2. With more complex data, small groups of samples are split off from main body. Outliers. CA sensitive to rare taxa and unusual samples.
3. Displacement of points along first CA axis may be considerable, resulting in poor results, especially occurs when there are two underlying gradients of approximately same length.
4. The division of the first CA axis in the middle, followed by separate CA of each of the two halves of original data creates conditions under which the second CA is not detecting the second gradient but is doing a finer CA of the first gradient. Increases chances of misplacement at centres.
Ideal if:
The first major underlying gradient is considerably larger than the second one and the structure is not complex.
“The erratic behaviour of TWINSPAN beyond the first division makes the results of this analysis of real vegetation data suspect”.
“Use of the TWINSPAN program for vegetation analysis is thus not recommended”.
van Groenewoud (1992)
“TWINSPAN is not a method but a program. A bag of tricks, too unstable and tricky. Better avoided. It uses a kludge of pseudospecies and has many other quirks so that its analyses may be impossible to repeat.”
Oksanen (2003)
Belbin & McDonald (1993)
Artificial data: 480 data sets of 50 sites in 2 dimensions with 2, 3, 4, 5 or 8 clusters.
(TWINSPAN expects 2, 4, 8, 16, 32, 64... 2n clusters)
Recovery of structure (mean of Rand statistic for comparing classifications - 1 is perfect, 0 is terrible).
Mean Rand
TWINSPAN 0.63
Non-hierarchical 0.77
Flexible unweighted GA 0.79
Major TWINSPAN problem (and of all divisive procedures). An early ‘error’ in a division can have serious effects, as it can not be undone except by relocation (FLEXCLUS).
Useful tool - not the final classification!
Tausch, Charlet, Weixelman & Zamudio (1995)
J. Vegetation Science 6, 897-902
“Patterns of ordination and classification instability resulting from changes in input data order”.
Claimed results from correspondence analysis-based methods such as TWINSPAN dependent on order of input, i.e. vary with entry order of data.
Is this a problem of the Method?
Algorithm?
Software?Used “TWINSPAN as compiled for PC-ORD package”
“Found 1-4 plots changed group affiliation and relationships between groups often changed”.
TWINSPAN INSTABILITY
Explanation: Oksanen & Minchin (1997) J. Vegetation Science 8, 447-454
TWINSPAN uses correspondence analysis as basis for ordering and dividing samples. In TWINSPAN a very fast algorithm is used to extract the correspondence analysis axes. Iterative procedure - stops after maximum number of iterations or reaches a convergence criterion (tolerance), whichever it reaches first.
Instability disappears
Max iterations
Tolerance
TWINSPAN
Original Hill (1979)
5 0.003
Strict criteria 999 0.000005
Version 2.2a
TWINSPAN may have some drawbacks. Also has some major advantages.
Two-way table of samples and species and the groupings is the easiest way of seeing what makes the groups distinct (or otherwise!). Powerful way of seeing the fuzziness in the data without hiding it as techniques like cluster analysis do.
Selection of pseudo-species cut levels can sometimes be a problem.
If the pseudo-species groups are very unequal in width, TWINSPAN will give greater weight to the smaller groups. (e.g. group 1-2 % vs. group 51-100%), as the smaller values are typically, but not always, of lesser importance and are less well estimated.
www.ecotone.org/Download/twinwght.xls
Allows you to calculate appropriate cut-levels to avoid this weighting problem.
Extensions to TWINSPAN
Basic ordering of objects derived from correspondence analysis axis one. Axis is bisected and objects assigned to positive or negative groups at each stage. Can also use:
1. First PRINCIPAL COMPONENTS ANALYSIS axis
ORBACLAN C.W.N. Looman
Ideal for TWINSPAN style classification of environmental data, e.g. chemistry data in different units, standardise to zero mean and unit variance, use PCA axis in ORBACLAN (cannot use standardised data in correspondence analysis, as negative values not possible).
2. First CANONICAL CORRESPONDENCE ANALYSIS axis.
COINSPAN T.J. Carleton et al. (1996) J. Vegetation Science 7: 125-130
First CCA axis is axis that is a linear combination of external environmental variables that maximises dispersion (spread) of species scores on axis, i.e. use a combination of biological and environmental data for basis of divisions. COINSPAN is a constrained TWINSPAN - ideal for stratigraphically ordered palaeoecological data if use sample order as environmental variable.
Dendrograms of (a) TWINSPAN and (b) COINSPAN on a 170 stand pine forest understorey dataset. The number of stands resulting from each division is shown at each level of the dendrogram. Plant species names are the respective indicators on the negative (left) or positive (right) of each division. The mean number of Pinus strobus seedlings between 10 cm and 150 cm in height, with associated standard errors, are given for each of the four final stand groups.
Carleton et al. (1996)
Zygogonium ericetorumActinotaenium crucurbita
Homoeothrix julianaFragilaria acidobiontica
Microsprora pachyderma
CosmariumSpriogyraBambusina brebissoniiPediastrumZygogonium tunetanumGonatozygon
Zygogonium ericetorumHomoeothrix juliana
Achnanthes minutissimaCymbella microcephalaGomphonema acuminatumHigh
AcidityClearwater
Group1
High AcidityHumicGroup
2
Moderate acidity
Low AlkalinityGroup
3
Low acidityModerate Alkalinity
Group4
- +DIC
-+ Al
- DOC + - ANC +
(a) Log Indicator
taxa
Classification of lake groups and identification of periphyton indicator taxa and key discriminat-ing environmental variables based on a COINSPAN of (a) log10-transformed taxa biovolumes and (b) taxonomic presence – absence data, along with log10-transformed water chemistry data.
- DOC + - ANC +
Zygogonium ericetorumActinotaenium crucurbitaCylindrocystis brebissonii
Tetmemorus laevisHomoeothrix juliana
Eunotia bactrianaFragilaria acidobiontica
CosmariumSpriogyraSpondylosium planumPediastrumZygogonium tunetanumGonatozygonZygogonium ericetorum
Homoeothrix juliana
Achnanthes minutissimaCymbella microcephalaGomphonema acuminatumHigh
AcidityClearwater
Group1
High AcidityHumicGroup
2
Moderate acidity
Low AlkalinityGroup
3
Low acidityModerate Alkalinity
Group4
- +DIC
-+ Al
AudouinellaSpirogyra
(b) +/-
Vinebrooke & Graham (1997)
OPTIMISING A CLASSIFICATION: K-MEANS CLUSTERING
• Agglomerative clustering has a legacy of history: once formed, classes cannot be changed although that would be sensible at a chosen level
• K-means clustering: iterative procedure for non-hierarchical classification
• If start with chosen hierarchic clustering, will be optimised
• Best suited with centroid or minimum-variance linkage, since it uses same criterion but in a non-hierarchical way
• Computationally difficult, cannot be sure the optimal solution is found
J. Oksanen (2002)
NON-HIERARCHICAL K-MEANS CLUSTERING
Given n objects in m-dimensional space, find the partition into k groups or clusters such that the objects within each cluster are more similar to one another than to objects in the other cluster. The number of groups k is determined by the user.
In k-means, the numerical function that the partition should minimise is, as in minimum-variance cluster analysis (=Ward’s method), total error sum of squares (Ek
2) or variance, but it does not impose any hierarchical structure.
Tries to form clusters to maximise between-cluster variance and to form groups of samples that will achieve the largest number of significant differences in ANOVA for the variables in relation to the clusters.
Major practical problem is that the solution on which the computation eventually converges depends to some extent on the initial centroids or groups. Only way to be sure that the optimal solution has been found is to try all possible solutions in turn.
Impossible for any real-size problem – 50 objects, 1080 possible solutions!
POSSIBLE SOLUTIONS
1. Provide initial configuration based on ecological knowledge and hopefully this will provide a good start for the algorithm.
2. Provide initial configuration based on a hierarchical clustering. The k-means algorithm will try to rearrange the group membership to find a better overall solution (lower Ek
2).
3. Select as a group seed for each of the k groups some objects thought to be 'typical' of each group.
4. Assign the objects at random to the various groups, find a solution, note Ek
2. Repeat many times (100), starting each time with a different random configuration. Retain solution with lowest Ek
2.
5. Alternating least-squares algorithm. K-MEANS, R
There are several solutions to help a k-means algorithm converge to the overall minimum criterion (Ek
2).
K-MEANS SOFTWAREPierre Legendre (www.fas.umontreal.ca/biol/legendre)
K observations are selected as 'group seeds' and cluster centroids are computed.
Assign each object to the nearest seed
Carry on moving objects until Ek2 can no longer be improved.
In the iterations, the program tries to minimise the sum, over all groups, of the squared within-group residuals, which are the distances of the objects to the respective group centroids. Convergence is reached when the residual sum of squares cannot be lowered any more. The groups obtained are geometrically as compact as possible around their respective centroids.
Cannot guarantee to find the absolute minimum of Ek2. Necessary to
repeat several times with different initial group seeds.
For each number of groups (K), calculate the Calinski-Harabasz pseudo-F statistic (C-H). C-H = [R2 / (K-1)] / [(1-R2) / (n-K)]
where R2 = (SST-SSE) / SST
SST is total sum of squared distances to the overall centroid and SSE is sum of squared distances of the objects to the groups own
centroids.
HOW MANY GROUPS?
One is interested to find the number of groups, K, for which the Calinski-Harabasz criterion is maximum. This would be the most compact set of clusters.
No. of groups (K)
C-H
2 1039.39
3 1143.07
4 1445.60
5 1320.63
6 1279.83
K-MEANS SOFTWARE
1. Four different initial assignment procedures
2. Input data
3. Data transformation options (gives different implicit proximity measures – k-means requires Euclidean distances)
4. Variable weightings if required
Output
1. K and C-H values
2. Details of group membership for each K
Dimensions 100,000 objects, 250 variables, 30 groups
Very fast!
R
K-MEANS CLUSTERING - A SUMMARY
Provides useful and relatively fast non-hierarchical partitioning of large or gigantic data sets. Generally finds near-optimal solution in matter of minutes.
Important to compare with results from hierarchical cluster analysis procedures to see how the partitioning has been distorted by imposing a hierarchical structure on the data.
Problem is how to display results for large data-sets.
• map clusters in geographical space
• overlay on ordination plots
• cluster summaries (means, ranges, etc)
• re-arrange data tables
'FUZZY' CLUSTERING
Some objects may clearly belong to some groups. Other objects have group membership that is much less obvious.
18 objects assigned into 2 groups that minimise the sum-of-squares criterion.
What about objects A, B, C, and D? Less clearly associated with groups C1 and C2 than the other 14 objects.
Fuzzy clustering gives each object a membership function between 0 and 1 that specifies the strength with which each object can be regarded as belonging to each group.
Gordon 1999
Three groups of artificial data with 50 samples each from three bivariate normal distributions.
Data with known group structure. Sum-of-squares clustering into 3 groups
Fuzzy clustering -objects with membership function less than 0.5 are circled. Cannot really be grouped.
Gordon (1999)
'FUZZY' CLUSTERING IN ECOLOGYEquihua, M. (1990) J. Ecology 78, 519-534'Fuzzy' c-means algorithm - similar to k-means clustering but with object weightings or membership functions iteratively changed to minimise sum-of-squares criterion.
Used correspondence analysis ordination as starting configuration for 'fuzzy' c-means clustering.
Compared 'fuzzy' c-means with TWINSPAN. Assigned each object to the group with which it has the highest membership function for comparison purposes.
Membership functions usually 0.5 - 0.7, a few as high as 0.9.
Good agreement at two or four group levels, less good with more groups.
Both find the 'obvious' groups; differ in fine-level divisions.
Suggests ecological data consist of (1) objects falling in some clear groups and of (2) objects that are clearly intermediate.
'Discontinuous data' and 'continuous data'
Software FUZPHY http://labdsv.nr.usu.edu/
(LABDSV = Laboratory for Dynamic Synthetic Vegephenomenology)
R
'FUZZY' CLUSTERING - A SUMMARY
• Classes can be useful for many purposes (e.g. maps)
• Fuzzy clustering combines good aspects of classification and ordination
• Each observation is given a membership function of class membership
• Corresponding crisp classification: class of highest membership functions
• Non-hierarchic, flat classification
• Iterative procedure
• Does not pretend the classes are natural entities
J. Oksanen (2002)
MIXTURE MODELS FOR CLUSTER ANALYSIS
Cluster analysis methods have no underlying theoretical statistical model except for mixture models.
Finite mixture distributions
Sample of individuals from some population have heights recorded, but gender not recorded.
Density function of height
h (height) = p (female) h1 (height:female) + p (male) h2 (height:male)
where p (female) and p (male) are probabilities that a member of the population is female or male, and h1 and h2 are the height density functions for females and males.
Density function of height is a superposition of two conditional density functions. Density function is known as finite mixture density.
If h1 and h2 follow normal distribution, can estimate p (female) and p (male) by maximum likelihood procedures.
Can be extended to more than one group and more than one variable
g
iiii xpxf
1
),,()(
where
g
iii pp
1
110 ;
and
1121
2 21
21
l iiipii xxx )()(exp)(
),(/
/,
Assumes g clusters and within each cluster the variables have a multivariate normal distribution.
Clusters would be formed on the basis of the maximum values of the estimated posterior probabilities
g
iiii
sss
xp
xpxsp
1
)ˆ,ˆ,(ˆ
)ˆˆ,(ˆ)/(ˆ ,
where is the estimated probability that an individual with vector x of observations belongs to group s.
)/(ˆ xsp
SIMPLE EXAMPLE
50 observations from each of two bivariate normal distributions with the following properties
Density one x = 1.0 y = 1.0
x = 1.0 y = 1.0
= 0.0
Density two x = 4.0 y = 5.0
x = 2.0 y = 0.5
= 0.0
Results of fitting two component normal mixture
Proportion Means SDs Correlation
Cluster 1 0.50 [1.14, 0.64]
[0.95, 1.10]
0.16
Cluster 2 0.50 [3.94, 4.98]
[2.32, 0.45]
-0.22
Bivariate data containing two clusters
Contour plot of estimated two component normal mixture
Perspective view of estimated two component normal mixture
PROBLEMS
1. Complex computationally, E (expectation) M (maximisation) algorithm.
2. Requires well separated densities and/or very large sample sizes.
3. Convergence is often to a local rather than a global solution.
4. Different start values needed in the EM algorithm.
5. Can be very slow to converge.
6. How to estimate g, the number of components. Idea is to use the likelihood ratio to test for the smallest value of g compatible with the data. Not straightforward and no agreed estimation procedure.
REAL EXAMPLE
Blood pressure data - systolic and distolic blood pressure
Is blood pressure a continuous variable from a single population or from two or three sub-populations with different mean levels? If latter, maybe there is a gene that causes arterial blood pressure to increase faster with age in those who have this gene than it does for people who lack it.
Whites Non-whites
Systolic Diastolic Systolic Diastolic
Gp 1 Gp 2 Gp 1 Gp 2 Gp 1 Gp 2 Gp 1 Gp 2
Mean 118.3 147.6 65.7 78.4 116.1 145.9 71.0 94.9
Variance
215 694 102 169 159 552 102 54
No. of subjects
847 301 821 325 136 74 178 30
Percent 74 26 72 28 65 35 85 15
No convincing evidence for groups within systolic, possibly some subgroups within diastolic.
MIXTURE MODELS FOR CATEGORICAL DATA - LATENT CLASS ANALYSIS
Mixture models not suitable for data where variables are categorical as the methods assume that within each group the variables have a multivariate normal distribution.
For categorical data, the mixture assumed needs to involve other component densities.
Multivariate Bernoulli density - within each group the categorical variables are independent of one another, so-called conditional independence assumption.
MIXTURE MODELS FOR MIXED MODE DATA
Data may consist of continuous and categorical variables, so-called mixed mode data.
Mixture models can be extended to include mixed mode data but there are severe computational problems if there are more than about four categorical variables.
INTEGRATED CLASSIFICATIONS BASED ON MIXTURE MODELS
Biological and water-quality data
Traditionally two-step analysis
1. Cluster sites on the basis of the biological data
2. Relate the clusters to the water-quality data by, for example, DISCRIM or linear discriminant analysis
Taxa
Clusters Clusters
Water-quality data
Step 1 Step 2 'Asymmetric model'
'Symmetric model'
Taxa + Water-quality data
Clusters
Can we do?
Data are in very different units and have very different distributions.
Model-based clustering based on latent class analysis.
MODEL-BASED CLUSTERING BASED ON LATENT CLASS ANALYSIS
ter Braak et al. (2003) Ecological Modelling 160: 235-248
1. There are G classes and each site belongs to one and only one of these classes, but it is unknown which one.
2. The variables (environmental variables and taxon counts) have probability distributions that differ between classes. Conditional probability density of the vector-variable y given class g is p(y|c).
3. The marginal distribution is a mixture of these distributions with mixing proportions (g, g = 1, ..., G), i.e.
p(y) = g g p(y|g)
The mixing proportions must sum to unity.
4. Class membership of the sites is unknown. With the data vector y from a site, model allows one to calculate the class membership probability p(g|y) that the site belongs to a particular class.
In mixed biological-environmental data, different data properties
1. Environmental variables assumed to be quantitative (e.g. pH, conductivity) and to follow a multivariate normal distribution within each class.
2. Biological data (counts of taxa) are assumed to follow independent Poisson distributions within each class.
Combining normal and non-normal variables in one analysis - mixed mode data.
Symmetric model - use taxa and environmental data together to create g classes, both 'response variables'.
Assume taxon counts are independent of the environmental variables within each class.
Let taxon counts be y, environmental variables x
)|()|()|,(),( cypcxpcyxpyxpg
gg
g
Assumes conditional independence of x and y.
Can be fitted using the EM maximum likelihood algorithm or Bayesian approach using Markov Chain Monte Carlo (MCMC) methods.
In Bayesian approach, prior distribution must be specified for all parameters of the model.
Bayesian approach has several advantages:
1. Flexible. Parameters do not have to be equal or unequal. They can be a 'bit unequal'.
2. Problems in the EM approach of values near zero can be avoided by defining robust prior distributions.
3. Can be extended by including prior information (e.g. habitat preferences of taxa, ecological indicator values) or details about field sampling.
4. Can easily compare one model to another so that model fit is balanced against model complexity. Can find the 'optimal' number of clusters.
5. Means for model checking.
6. Predictions are straightforward and the uncertainty of predictions can be assessed in a natural way by integrating out all relevant sources of uncertainty.
ECOLOGICAL EXAMPLE
Stream invertebrates from five types of habitat
P1 - temporary moorland pools, low pH
P2 - permanent moorland pools, low pH
P3 - moorland pools, medium pH
P7 - large mesotrophic bodies
P8 - medium sized eutrophic bodies
Small real data-set: information criteria in latent class analysis plotted against the number of
clusters.
ML solutions BIC, ML BIC suggests 4 clustersML suggests no clear number of clusters
Bayesian solutions
BF, min G, max G
Suggest no clear number of clusters.
Which is reality?
Few groups, many groups, or no groups?
"The trend is towards explicit models and a Bayesian approach to cluster analysis to improve upon the good-old TWINSPAN method. Frequently it is hard to beat ad-hocery and TWINSPAN with modern statistical methods; occasionally it is possible but not often!"
C.J.F. ter Braak (2003)
A REAL CLASS STRUCTURE OF THREE IRIS SPECIES
Data on Iris setosa (s), I. versicolor (c), and I. virginica (v).
J. Oksanen (2002)
RESULTS
No method recovers the three species structure!
J. Oksanen (2002)
CLASSIFICATION OF VARIABLES - POSSIBLE APPROACHES
1. TWINSPAN species ordering of basic data or TWINSPAN of transposed matrix so variables become 'objects' and objects become 'variables'. Problem is how to define realistic pseudo-variables for objects (now 'variables').
2. Concept of species associations is usually based on presence/absence data. Transform data to presence/absence or pseudo-variables first, then transpose matrix so that variables are 'objects' and objects are 'variables'. Calculate suitable similarity or dissimilarity coefficients between 'objects' defined as +/-.
Jaccard SJ = a DJ = b + c .
a + b + c a + b + c
Sørensen SS = 2a DS = b + c .
2a + b + c 2a + b + c
Do cluster analysis or k-means clustering.
3. Use other similarity coefficients, transform to distance coefficient as (1 - S) (to ensure coefficient is metric), compute principal co-ordinates of the distance matrix to give co-ordinates of the 'objects' (= variables) in orthogonal multi-dimensional space, and do cluster analysis or k-means clustering.
Legendre & Legendre (1998) Numerical Ecology pp. 355-361
CHOICE OF CLUSTERING METHOD
• Some opt for single linkage: finds distinct clusters, but prone to chaining and sensitive to sampling pattern
• Most opt for average linkage and minimum variance methods: chops data more evenly
• All dependent on appropriate dissimilarity measure: should be ecologically meaningful
• Small changes in data can cause large visual changes in clustering: classification may be optimised for a chosen level (k-means)
• Fuzzy clustering may fail as well, but at least it shows the uncertainty
• TWINSPAN: surprisingly robust and useful
• Mixture models and latent class analysis: complex theory but limited utility
Basic concept and tradition in ecology and biogeography – characteristic or indicator species e.g. species characteristic of particular habitat, geographical region, vegetation type. Valuable in monitoring, conservation, management, description, and stratigraphy.
Add ecological meaning to groups of sites discovered by clustering
INDICATOR SPECIES – indicative of particular groups of sites. ‘Good’ indicator species should be found mostly in a single group of a classification and be present at most of the sites belonging to that group. Important DUALITY (faithful AND high constancy)
INDVAL – Dufrene & Legendre (1997) Ecological Monographs 67, 345-366
Derives indicator species from any hierarchical or non-hierarchical classification of objects
Indicator value index based only on within-species abundance and occurrence comparisons. Its value is not affected by the abundances of other species.
Significance of indicator value of each species is assessed by a randomisation procedure.
INDVAL
DETECTION OF INDICATOR SPECIES or CHARACTER SPECIES
Specificity measureFAITHFULNESS
Aij = N individuals ij / N individuals
i. sum of the mean abundance of species i over all groups
Mean abundance of species i across the sites in group j
(means are used to remove any effects of variation in the number of sites belonging to the various groups)
Fidelity measure CONSTANCY
Bij = N sites ij / N sites. j
number of sites in group j where species i is present
total number of sites in cluster j
Aij is maximum when species i is present in group j only
Bij is maximum when species i is present in all sites in group j
Indicator value (Aij . Bij . 100) % INDVALij
Indicator value of species i for a grouping of sites is the largest value of INDVALij observed over all groups j of that classification.
INDVALi = max (INDVALij)
Will be 100% when individuals of species i are observed at all sites belonging to a single group.
INDICATOR SPECIES VALUE
A random re-allocation procedure of sites among the groups is used to test the significance of INDVALi
Can be computed for any given partition of grouping of sites and/or for all levels of a hierarchical classification of sites.
INDVAL
Site groups
Site ranking
UPGMA-WARD
MDS DCAPcoA Ca
Hierarchical
cluster(s)
Nonhierarchicalcluster(s)
k means
Sites
Species
Sites
Site groups
Site ranking
UPGMA-WARD
MDS DCAPcoA Ca
Hierarchicalcluster(s)
Nonhierarchical
cluster(s)k means
Species
Sites
Species
Species
Sites
Primary ordination (CA)Subdivide in two subsetsIdentify indicator speciesRefine the site ordinationSubdivide in two subsets
Two site subsets
Site groups
and ordering
Repeat foreach site subset
Measure the species preferential power
Classify the species
Diagram of the analysis steps for the Q- and R- mode classical analyses, and the TWINSPAN procedure. CA = Correspondence Analysis; DCA = Detrended Correspondence Analysis; MDS = Nonmetric Multidimensional Scaling; PcoA = Principal Coordinates Analysis; UPGMA = Unweighted Pair-Group Method using Arithmetic Average; WARD = Ward’s clustering method.
Q – mode sites
R – mode species
TWINSPAN
Classify the samples in a divisive hierarchy
Diagram of the analysis steps for the indicator value method
UPGMA-WARD
Site groups
Site ranking
MDS DCAPcoA Ca
Hierarchical
cluster(s)
Nonhierarchical
cluster(s)k means
SitesSpecies
Sites
Any site typology
Measuring SpeciesIndicator Power
Random permutation of sites in the
typologyObserved
valueA randomized INDVAL to be included in the distribution
Randomized INDVAL distribution
INDVAL
5
5
5
5
5
15
15
25
25
2045
25
7070
100
3015 15
15 15
Species A
Site numberWidespread
0
3
5
5
5
0
10
30
20
4060
30
9090
100
1010 10
0 0
Species B
Site number2 group max
0
0
0
5
5
0
0
0
10
90100
0
100100
100
00 0
0 0
Species C
Site numberone groupA)
B)
0
10
20
30
40
50
60
70
80
90
100
1 2 3 4 5
Number of Clusters
Nu
mb
er o
f cl
ust
ers
Species A
Species B
Species C
Indicator values for A, B, C at different clustering levels
Test case results. (A) Distribution of abundances of the three species in the five clustering levels. (B) Bar chart showing the decrease (species A) or increase (species C) of the indicator values when the sites are subdivided.
Species 1 2 3 4 5
100 60.9 43.8 32.3 2572 85.7 85.7 42.9 4040 66.7 66.7 100 90
A 7.17 5.7 4.25 2.88B 6.2 5.6 2.19 2.84C 4.21 3.14 6.59 6.39
z-statistic
Test case results: species indicator values and z statistics for five clustering levels.
Number of clusters
Indicator value
Species 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25A 4 4 4 4 4 5 5 5 5 5 5 5 5 5 5 3 3 3 3 3 3 3 3 3 3B 8 8 8 8 8 4 4 4 4 4 6 6 6 6 6 4 4 4 4 4 0 0 0 0 0C 18 18 18 18 18 2 2 2 2 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Group 1 Group 2 Group 5Group 4Group 3
Ind
icato
r valu
e
Number of clusters
1. Xeric chalky grasslands
2. Mesic chalky grasslands3. Zn grasslands4. Atypical heathlands5. Xeric heathlands
6. Temporary flooded heathlands7. Peat bogs and raised mires
8. Swamps9. Pond fringes
10. Alluvial grasslands
Hierarchical dendrogram built with the results of the k-means reallocation clustering method. Reallocations are scarce and the main changes concern the “temporary flooded heathlands” (group 6), which are allocated to wet habitats at the two-, five-, and six-group level and to dry habitats at the other clustering levels.
Carabid beetles 97 species. 123 year-catches from 69 different localities representing 9 habitats.
Dendrogram representing the TWINSPAN classification of the year-catch cycles. The indicator species relative abundance levels are expressed on an ordinal scale (1, 0-2%; 2, 2-5%; 3, 5-10%; 4, 10-20%; and 5, 20-100%.
Chalky mesic grasslands
Chalky xeric grasslands
Zn grasslands and xeric sandy heathlands
Atypical and xeric gravelly heathlandsTemporary flooded heathlands
Peaty heathlands
Fringes of ponds and alluvial grasslands
Swamps and raised mires
T secalis (1)P. nigrita (1)
D. globosus (1)A communis
(1)
P. melanarius (1)
P. cupreus (1)
A.. equestris (1)
C. problematicus (3)A. ater (1)
P. versicolor (3)T cognatus
(1)
P. madidus (1)H. rubripes (1)P. Cupreus (1)
P. versicolor (3)P. lepidus (1)
C. melanocephalus (1)
P. cupreus (1)
B. ruficolis (1)D. globosus (1)C. violaceus (1)
P. diligens (1)P. rhaeticus (1)A.. fuliginosus (1)P. minor (1)
P. minor (3)A. fuliginosus (1)L. pilicornis (1)
1
2
3
4
5
6
7
Allu
via
l g
ra s.Pon
ds
Sw
am ps
Pea t
bog s
Flo
od
ed
heat
h..
Xeri c
hea
th.
Aty
pic
a l heat
hZn
g
rass
lan
ds
Mei
sc
cha
lky
Xeri c
cha
lky
Eutrophic
Wet habitats
Oligotrophic
Species present in all habitats
Typicalheathlan
ds
Heathlands
Acid heathlands
Dry habitats
Chalky
10
10 4
4 5
2
5
9
9 7
7 6
36
3 8
8
Indica
tor a
nd sa
tellite
specie
sSPEC
IES
Site GroupsSITES
Steps that are followed to build a two-way table from the hierarchical clusters indicator values. The first species group (centre of figure) contains species that are common in all habitats (i.e., having their indicator value maximum when all sites are pooled in one group). At the next step, two species groups are created: one with species dominating in all wet habitats, and the other one with species that are common in all dry habitats. The procedure is repeated for each site cluster.
INDVAL
Site clusters obtained with the k-means method, but with the associated indicator species and indicator values in parentheses. All species with an indicator value >25% are mentioned for each site cluster where they are found, until they have found a maximum indicator value.
Eutrophicwet sitesP. minor (83)P. nigrita (80)
L. pilicornis (76)T. secalis (73)
P. strenuus (62)A fuliginosum (55)C granulutus (53)
C. fossor (51)P. atrorufus (47)B. unicolor (44)
A. versutum (37)O. helopioides (34)B. dentellum (32)A. viduum (31)
A. moestrum (30)B. doris (26)
T. placidus (26)
Wet habitatsP. rhaeticus (97)P. diligens (79)
A. fuliginosum (56)P. minor (48))T.secalis (39)
L. pilicornis (35)P. nigrita (29)
Dry habitatsA. lunicollis (51), P. versicolor (44)P. madidus (42), C. campestris (39)
B. ruficollis (37), H. rubripes (36)P. cupreus (35), B. lampros (32)
C. melanocephalus (31), P. lepidus (27)
P. versicolor (65)A. lunicollis (64)B. ruficollis (58)
C. melanocephalus (49)
B. lampros (46)P. lepidus (43)
C. problematicus (38)C. campestris (36)A. equestris (35)N. aquaticus (35)M. foveatus (33)
B. harpalinus (32)C. fuscipes (29)D. globosus (29)
L. ferrugineus (26)
ChalkygrasslandsP madidus (82)H. rubripes (65)P. cupreus (56)
B. bipustrulatus (36)C. auratus (27)
P. melanarius (27)C. nemoralis (26)
HeathlandsB. ruficollis (76), B. lampros (52)
C. problematicus (45), B. harpalinus (43)
N. aquaticus (42), L. ferrugineus (36)B. globosus (32), T cognatus (32)B. nigricorne (30), C. micropterus
(30)O. rotundatus (30), A. obscurum (29)
H. rufitarsis (29), C. erratus (26)
Zngrasslands
A. equestris (97)C. fuscipes ((68)
C. campestris (56)
A. similata (50)P. lepidus (46)
Typicalheathlands
B. ruficollis (88), P. versicolor (61)C. melanocephalus (58), N. aquaticus
(57)B. nigricorne (47), C. micropterus (47)O. rotundatus (47), D. globosus (42)
M. foveatus (38), C. erratus (34)Xeric
heathlandsN. aquaticus (70)O. rotundatus (67)C. melanocephalus
(63)C. erratus (59)
H. rufitarsis (48)B. nigricorne (46)
P. lepidus (46)B. collaris (37)A. infima (30)
H. smaragdinus (30)B. 4-maculatus (30)
H. tardus (26)B. properans (26)
Temporary flooded heathlandsT. cognatus (98)D. globosus (78)
P. niger (75)A. obscurum (54)P. vernalis (46)
AlluvialgrasslandsP. strenuus (87)P. atrorufus (82)
A. fuliginosum (65)B. unicolor (64)
C. granulatus (61)O. helopioides (54)
T. placidus (45)L. rufescens (40)
Pondfringes
A. versutum (88)B. dentellum (75)
C. fossor (66)B. doris (63)
B. obliquum (38)A. sexpunctatum
(29)B. assimile (25)
Oligotrophic wet sites
A. ericeti (28)
4
Peat bogsRaised mires
A. ericeti (28)
SwampsA. gracile (42)
5
All habitats
A. communis (26)
3
6
2
Atypicalheathlands
C. problematicus (89)
L. ferrugineus (45)A. ater (40)
9
Mesic chalkygrasslandsP. melanarius
(99) C. auratus (63)
C. violaceus (45)H. rufipes (44)
Xeric chalkygrasslandsP. cupreus (59) A. ovalis (30)
H. atracus (25)H. puncticollis
(25)
8
7
10
INDVAL
Classification not an end in itself. Means to an end. Aid interpretation - external data, e.g. environmental data.
Basic EDA graphical approaches e.g. box plots.
Discriminant analysis using classification groups.
RELATING CLASSIFICATIONS TOEXTERNAL SET OF VARIABLES
INTERPRETING CLUSTERS
• Clusters may differ in their environment
• Community classification may reflect environmental patterns
• Clustering may detect local peculiarities, whereas (most) ordination methods show the global gradient pattern
J. Oksanen (2002)
Hierarchical classification analysis of 34 lakes in northwestern Ontario based on abundance of 27 species of zooplankton (from Palatas, 1971). The vertical axis represents the information gain (∆I) on fusion. The higher the level of fusion, the more dissimilar are the lakes and species compositions. On the horizontal axis are lake code numbers and group code letters.
BIOLOGY
The separation of the lake groups of the first two discriminant functions of the eleven environmental variables. Mean values for area and maximum depth are given for each group.
ENVIRONMENTAL VARIABLES IN RELATION TO BIOLOGICAL CLUSTERS
CANVAR, CANOCOGreen & Vascotto (1997) Water Research 12, 583-590
Heino et al. 2003 Ecological Applications 13 (3): 842-852.
235 headwater streams in Finland, macroinvertebrates, wide range of associated environmental variables.
TWINSPAN classification of the study streams. Numbers refer to number of sites in each group. Also shown are mean latitude (solid bars) and pH (shaded bars) for each TWINSPAN end group.
Results of indicator species analysis (INDVAL) at the 4th TWINSPAN division level
Mean values of environmental variables important in discriminating among the TWINSPAN groups at the 4th division level
Discriminant analysis to find the environmental variables that best discriminate between groups at the 2-group (level 1) and 10-group (level 4) TWINSPAN classification.
Wilks' lambda from stepwise DFA for variables best discriminating among groups at the 1st and 4th TWINSPAN division level
Leave-one-out cross-validation in discriminant analysis
Can also look at TWINSPAN groups in an ordination context, in this case non-metric multidimensional scaling
Can also plot TWINSPAN group membership on a canonical correspondence analysis (CCA) plot.
A CCA biplot defined by the first two axes of the ordination of environmental variables and TWINSPAN site groups.
Plot the INDVAL indicator species abundances in ordination space (in this case CCA space)
External variables not always continuous. May be +/–, nominal, ordinal or mixed.____________________
C.J.F. ter Braak (1986) Data Analysis & Informatics IV, 11-21 DISCRIM - simple discriminant functions
Supply biological classification first (e.g. TWINSPAN), then characterise classification in terms of external variables.
Coding as in TWINSPAN
+/– binary data +/- dummy variable 0/1
red 0/1green 0/1blue 0/1
0/1
X1, X2, X3
(i) discretee.g. <10 disjoint 0/110 - 20 coding 0/1
>20 0/1(ii) rank them and
divide into quartiles‘pseudo-variables’
conjoint coding
(iii) pseudovariables 'pseudo-variables'
Quantitative data
Nominal data red, green, blue 3 dummy variables
Ordinal data small, medium, large or pseudovariables
3 dummy variables
Group variables together on basis of their fidelity to particular TWINSPAN groups. Means of characterising TWINSPAN groups.
MILTRANS
SIMPLE DISCRIMINANT ANALYSIS
Indicator species for the first two levels of division of TWINSPAN. Some species are only an indicator if they occur with high abundance, e.g. Curlew >5 means Curlew is an indicator if the abundance reaches abundance class 5 or over (N: number of heathlands in group; thr: threshold value (maximum discriminant score for negative group); mn: number of misclassified negatives; mp: number of misclassified positives.
Biological Classification
11 16 76 7 36 5 56 63 66 67 3 34 45 67 77 24 78 4 44 44 55 55 53 44 77 8 13 33 55 82 22 22 22 22 36 1 13 11 11 7
78 97 56 65 11 12 38 90 83 23 52 42 90 16 90 34 12 80 64 56 78 1 23 75 39 19 1 36 78 45 26 24 53 78 9 4 70 44 12 56 89 7
TADO −− −− −− −− −− 52 −2 33 −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −3 −− −− −− −− −3 −− −− −− −− −− −− −− −− 0
NAEV 23 −− 43 65 5− 11 1− 1− −− −− −1 −− −− −1 −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −7 4 0
VANE 44 42 41 44 −5 13 33 22 2− 53 −− −− −− −1 −− 2− −2 −− 42 −3 44 −− −− 4− −− 36 −− −− −− −− −− −− −− 4− −− −− −− −− −− −− −− −− 100
SUBB 12 −− 11 −− −− 11 −2 −− −− −− −− −− −− −− −2 2− −− −− −− −− −− −− −− −− −1 −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− 101
RUBE −− −− −5 34 44 −1 3− 21 23 −− −− −− −− −− −− −− −− −− −− −− −− −− −− 1− −− 3− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− 101
TOTA 2− −2 41 24 −4 −2 1− 14 23 33 −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −5 −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− 101
GALL −− −− 4− 3− −− −1 −2 −2 2− 2− −− −− −− −3 −− −− −− −− −− −− −− −− −− −− −− 3− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− 101
OSTR 1− 4− 4− 1− −− −1 13 22 2− 25 −− −− −− −1 −− −− −− −− 2− −− −− −− −− −− −− −4 −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− 101
LIMO 14 −2 31 1− −− 12 12 22 −− 44 −− −− −− −− −− −− −− −− −− −− −4 −− −− 3− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− 101
FLAV −− −3 23 4− −− 11 −− −− −− −− −− −− −− −− −− −− −2 −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− 101
EXCU −− −− −1 1− −− −− −− −− −− −− −− −− −− −− −− 1− 21 −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− 101
CYAN −− −− −− 1− −− −1 −− −− −− −− −− −− −− −1 −− −− 22 44 −− −− −− −− 44 47 −− −− −− −6 −− −− −− −− −− −− −− −− −− −− −− −− −− −− 101
PRAT −2 45 46 75 −− 13 1− −1 3− 2− −− 4− −− −1 −2 −− −− 3− −5 −4 65 −3 −− −− 5− −5 −− −− −4 −− −− 2− −− −− −− −− −− −− −− −− −− −− 11
CAMP −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− 46 −− −− −− −− −1 44 −1 −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− 11
OENA 1− −− 12 46 5− 33 43 33 43 33 −− −− 44 −− 33 −− −− −− 42 −− 32 42 −1 −1 43 −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− 11
CRIS −− −− −− −− −− −− −2 −− −− −− −− −− −− −− −− −− −− −− −− −− −− −1 −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− 11
ARQU 55 76 44 46 66 45 65 55 65 55 56 65 55 43 −2 22 41 45 53 32 44 21 −− 3− 32 46 5 −− −− −− 3− −− 2− −− 5− −3 −− −− −− −− −− −− 10
RUBE 12 44 2− 54 −− 13 32 2− 3− −− 3− −4 4− −2 −2 −3 −− −− 31 −− 22 33 24 −4 2− −− −− −3 −5 −− −− −− −− −− −− −− −− −− −− −− −− −− 10
ARVE 35 −3 44 66 75 67 53 56 54 6− 4− 75 35 44 35 56 55 57 45 67 77 55 77 67 76 76 7 −6 −− 66 53 62 4− 44 −− −− −− −− −− −− −− −− 1100
CANN 56 75 76 66 67 52 56 52 5− 55 5− 66 4− 35 46 56 34 55 42 44 24 33 35 4− −2 5− 5 −− −− 43 −− −− −− −− −− −− 5− −− −− −− −− 5 1101
CANO 12 −− 42 34 44 21 33 22 3− 23 3− −4 34 22 3− 22 13 −− −2 −− 11 −1 −− 1− 2− −− −− −− −− −− −− −− −− −− −− −6 −− −5 −− −− −− 4 1101
TETR 43 −− −− 25 −7 23 23 21 3− −− −− −− −− 13 −− 2− 41 −− 2− 22 −3 5− −− 5− 3− −4 5 −− −− −− −6 −− 3− −− 7− −− −− −− −5 −− −− −− 1101
PERD 1− −− 51 36 5− 41 32 31 2− 4− −− 64 −− 34 −2 2− 23 45 −− 12 21 23 −− 3− −− 44 −− −− −5 −− −− 33 −− −− 5− −− −− −− −− −− −− −− 1101
ALBA 4− 44 33 44 4− 21 −3 1− 33 3− 3− −4 −− −1 35 2− 32 3− −2 −1 11 22 22 1− −2 −− −− −− −− −− −− −− −− −− −− −4 −− −3 −− −− −− −− 1101
TINN −− −− 32 3− 4− 21 14 21 2− 2− 34 −− 3− 21 32 −− −1 −− −1 −1 11 −− −− 1− 2− −− −− −− −− −− −− −− −− −− 5− 35 −− −− −− −− −− −− 111
SVEC 15 −5 3− 5− −− −− −− −− −− −− −− −4 −− −− −− −− −− −− 42 −− 21 −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− 6 111
SHCO 77 57 75 77 66 64 45 51 47 66 5− 4− 44 −3 −− −− 4− −− −− −− −− −− 3− −− −− −− −− −− −− −− −− −− −− −− 35 −− 7− −− −− −− 7 111
COMM 36 64 66 76 56 51 55 44 3− 66 4− 57 44 −5 55 32 21 −− 24 22 34 41 −− 15 −− 5− − −− −− −− −− 3− 2− −− −7 66 −− −− 7− −− 77 6 10
TRIV 77 76 76 66 76 74 45 65 54 56 6− 57 57 56 65 65 66 44 65 54 25 55 46 76 32 6− 5 77 77 67 −− 62 54 43 7− 66 77 76 77 77 −− 7 110
VIRI 1− −− −− −− −− 2− 33 11 2− −− −− −− 3− −2 43 2− −7 −5 −− 21 −− −− −− −− −1 −− −− −− −− −− −3 22 2− 34 55 44 −− −− −− −− −− −− 110
TROC 77 77 76 76 77 73 77 74 63 77 7− 67 6− 56 77 65 31 67 64 52 25 27 67 6− 21 −− −− −− −4 77 73 64 64 56 77 77 77 77 77 77 77 7 110
EURO 32 −4 −− 1− −− −1 −− −− −− −− 4− −− 54 −− −2 −− 44 −− −− −− −− −− 24 −− −− −− −− −3 −− −− −− −− −− −− −− 3− −− −3 −− −− −− −− 110
CITR 45 −5 53 47 65 52 56 54 44 54 −− 67 −− 45 44 45 53 65 42 32 44 55 −− 54 −1 56 6 73 76 54 5− −2 3− 3− 77 56 66 76 −− −− −− 6 111
ARBO −4 −2 −− −− −− 11 −2 −− −− −− −− −4 −− −2 −3 −− −− 55 −− −− −− −1 −− 37 23 −− −− −6 67 44 5− −− −− −− −− −− −− −− −− −− −− −− 111
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1
0 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 1 1 1 1 0 0 0 0 1 1 1
0 0 1 1 1 0 0 0 0 0 1 0 0 0 0 0 0 0 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 0 0 0 1
0 0 1 1 1 0 0 0 1 1 1 0 0 0 0 0 1 1 0 0 0 0 0 1 1 0 1 1 0 0 1 1 0 1
TWINSPAN two-way table of bird species (rows) of Dutch heathlands (columns). Values are logarithmic classes of abundance (number of pairs per 10 hectares). (−: absent; 1: <0.5; 2: 0.5-0.9; 3: 1.0-1.9; 4: 2.0-3.9; 5: 4.0-7.9; 6: 8.0-15.9; 7: >16.0). The top margin gives site identification numbers, printed vertically. The bottom and right-hand margins show the hierarchical classifications of the heathlands and birds, respectively, each with five levels of division. Vertical lines separate groups of sites at level 2; horizontal lines separate the first two species divisions.
No. Description Values
1 AREA < 20 Area of heath smaller than 20 hectares 0/1
2 AR20 − 100 Area of heath between 20 and 100 hectares 0/1
3 AREA > 100 Area of heath greater than 100 hectares 0/1
4 RECR MILI Index of recreational use of heath, including military usage, based on inquiry ranked
Index of recreational use of heath, excluding military 1-82
24 RECR EATI usage, based on inquiry ranked
1-82
Number of other heaths within a radius of 5 km from ranked
the border of the heath 1-82
6 OPEN SAND Presence of open sand within the heath 0/1
7 MOOR POOL Presence of moorland pools within the heath 0/1
8 WET Presence of wet patches within the heath 0/1
9 SURR FORE Heath at least partly surrounded by woodland 0/1
10 SURR AGRI Heath at least partly surrounded by woodland or arable land 0/1
11 VELU WE Heath lies on the VELUWE 0/1
12 BRAB ANT Heath lies in the BRABANT 0/1
13 DREN THE Heath lies in DRENTHE 0/1
14 GRON INGE Heath lies in GRONINGEN 0/1
15 GOOI Heath lies in ‘het GOOI’ 0/1
16 LIMB URG Heath lies in LIMBURG 0/1
17 UNDU LATI Heath is undulating 0/1
18 FEN LAND Presence of fen-land 0/1
19 SAND SOIL Presence of sandy soil 0/1
20 SAND FEN Presence of sandy soil in fen-land 0/1
21 1SOI LTYP Presence of only one soil type 0/1
22 2SOI LTYP Presence of two soil types 0/1
23 3SOI LTYP Presence of three or more soil types 0/1
TOPOGRAPHY, SOIL, AND SOIL HETEROGENEITY (based on soil maps)
LANDSCAPE
GEOGRAPHICAL POSITION
ISOLATION
5 HEAT < 5 KM
Heathland characteristics used in DISCRIM to interpret clusters of heathlands
Abbreviation
AREA
RECREATIONAL USAGE
11 16 76 7 36 5 56 63 66 67 3 34 45 67 77 24 78 4 44 44 55 55 53 44 77 8 13 33 55 82 22 22 22 22 36 1 13 11 11 7
78 97 56 65 11 12 38 90 83 23 52 42 90 16 90 34 12 80 64 56 78 1 23 75 39 19 1 36 78 45 26 24 53 78 9 4 70 44 12 56 89 7
23 3SOI LTYP 11 −− 1− −− 1− 11 11 11 −− −− −− −− −1 11 −− −− −1 11 −− −− −− −− −− 1− −− 1− 1 −− −− −− −− −− −− −− −− −− −− −− −− −− −− − 11
13 DREN THE −− −− 11 11 11 11 11 11 11 11 11 11 11 11 11 −− −− −− 1− −− −− −− −− −− −− 1− − −− −− −− −− −− −− −− −− −1 −− −− −− −− −− − 11
7 MOOR POOL 11 −1 −1 −1 1− 11 11 −− 11 11 1− −− −− 1− −− 1− 1− −− 11 −1 1− −− −− −− −− 1− − −− −− −− −− −− −− −− −− −1 −− −− −− −− −− − 11
3 AREA >100 11 −1 11 1− −− −1 11 11 1− 1− −− −− −− 11 −1 11 11 −− 11 11 11 11 11 −1 1− −− − −− −− −1 −− 11 1− −1 −− −− −− −− −− −− −− − 11
20 SAND FEN −− −1 −− −− 1− −− 1− −1 −− −− −− −− −− −− 1− −− −− −− −− −− −− −− −− −− −− −− − −− −− −− −− −− −− −− −− 1− −− −− −− −1 −− − 10
18 FEN LAND −− −− 11 1− −− 1− −− 1− −− −− −1 −− −− −− −− −− −− 11 −− −− −− −− −− −− −− −− − −− −− −− −− −− −− −− −− −− 1− −− −− −− −1 11 10
12 BRAB ANT 11 11 −− −− −− −− −− −− −− −− −− −− −− −− −− −− 1− −− −− −− −− −− −− −− −− −1 1 −− −− −− 1− −− −− −− 1− −− −− −− −− −− −− −− 10
10 SURR AGRI 1− 1− 11 11 11 11 11 11 1− 11 11 11 11 −1 −1 −1 −− 11 1− −− −1 −− −− −− −− 11 1 1− −− −− 1− −− −− −− 1− 11 11 11 11 −− 11 1 111
8 WET 11 11 11 1− −1 −1 1− 11 −1 −− −− 11 −− −1 −− −− −− −− −− −− −− −− −− −− −− −− − −− −− −− −− −− −− −− −− 1− 1− 1− 1− 11 1− 1 111
2 AR2O -100 −− 1− −− −1 11 1− −− −− −1 −1 11 22 11 −− 1− −− −− 11 −− −− −− −− −− −1 −− 11 1 −1 11 1− 11 −− −1 1− 11 11 11 −1 −1 −− −− 1 111
24 RECR EATI 32 43 21 11 21 42 44 24 33 42 11 −1 31 22 42 43 43 31 21 44 34 44 44 34 42 42 3 24 33 33 23 23 32 33 11 31 12 23 12 21 11 2 1101
22 19 2SOI LTYP −− −1 −1 1− −− −− −− −− −− 11 1− 11 1− −− 11 −− 1− −− 11 −− −− 11 11 −− 1− −1 1 −− −− −− 11 −− −1 −1 −1 1− −− −− −− −1 1− − 1101
5 SAND SOIL 11 1− −− −1 −1 11 −1 −− 11 11 1− 22 11 11 11 11 11 11 11 11 11 11 11 11 11 11 1 11 11 11 11 11 11 11 11 −1 −1 11 11 11 1− − 1101
9 HEAT <5KM 22 44 23 42 23 44 44 34 34 33 42 −1 34 23 33 34 14 22 43 43 24 14 44 41 32 22 2 11 12 44 23 23 32 33 22 32 11 11 11 11 11 1 1101
4 SURR FORE 11 11 −− −− −− −1 1− −− 11 −− 1− 21 −1 1− 11 11 11 11 11 11 11 11 11 11 11 1− − 11 11 11 11 11 11 11 11 11 −− 1− −1 11 −− − 1100
17 RECR MILI 31 23 21 11 11 32 43 24 22 31 11 1− 31 11 41 33 44 22 13 44 34 34 44 34 33 22 2 24 44 44 24 23 43 43 13 11 22 23 24 22 22 3 1100
11 UNDU LATI −− −− −− −− −− −− 1− −1 −− −− −− −− 1− 11 −1 1− 11 1− 11 11 1− 1− −1 −1 11 −1 − −1 11 11 1− −− −− −− −− −− −− −− 11 −− −− − 10
6 VELU WE −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− 11 −1 −− −1 11 11 11 11 11 11 −− − −− −1 11 −− −− −− −− −− −− −− −1 −− −− −− − 10
OPEN SAND −− −− −− −− −− 1− 1− −− −− −− −− −− −1 11 −− 1− 1− −− −− −− −− −− −− −1 −− − −− 1− −− 1− −− −− −− −− −− −− −− −− −− −− − 10
21 ISOI LTYP −− 1− −− −1 −1 −− −− −− 11 −− −1 1− −− −− −− 11 −− −− −− 11 11 −− −− −1 −1 −− − 11 11 11 −− 11 1− 1− 1− −1 11 11 11 1− −1 1 1
16 LIMB URG −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −1 −− −− − −1 1− −− −− −− −− −− −− −− −− −− −− −− −− − 0
15 GOOI −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− − −− −− −− −1 11 11 11 −1 1− −− −− −− −− −− − 0
14 GRON INGE −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− − 1− −− −− −− −− −− −− −− −− 11 1− 11 11 11 − 0
1 AREA <20 −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− − 1− −− −− −− −− −− −− −− −− −− 1− 1− 11 11 − 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1
0 0 0 0 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1
0 0 1 1 1 0 0 0 0 0 1 0 0 0 0 0 0 0 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 0 0 0 1
0 1 1 0 0 0 1 1 1 0 0 0 0 0 1 1 0 0 0 0 0 1 0 1 1 0 0 1 1 0 0 1
DISCRIM two-way table of heath characteristics (rows) of Dutch heathlands (columns). Values are presence (1), absence (−) or indicate quartiles (1, 2, 3, 4). Vertical lines separate groups of sites at level 2. Horizontal lines separate three major groups of attributes.
Environmental Variables
Indicator attributes resulting from DISCRIM that best predict the biological divisions of TWINSPAN
Can re-arrange basic data tables using DIATAB
TWINSPAN
DISCRIM
Rohlf (1974) Ann. Rev. Ecol. Syst. 5, 101-113Gordon (1999)Fowlkes & Mallows (1983) J. Amer. Stat. Assoc. 78, 553-569 (1) Cross-classification table COMPCLUS (2) Rand coefficient (1971) J. Amer. Stat. Assoc. 66, 846-850
COMPCLUS
1) - (n n 21
1
2
22
21
i j
iji j i
ijj
ij nnn
c
I II IIIClassification I 2 2 1
B II 1 0 4 (n = 10)
Classification A
c = 1 – [½{(2 + 1)2 + (2 + 0)2 + (1 + 4)2 + (2 + 2 + 1)2 + (1 + 0 + 4)2} – 22 + 22 + 12 + 12 + 02 + 42] / 45
= 1 – [½ {38 + 50} – 26] / 45= 1 – 18/45
= 0.6Range 0 (dissimilar) to 1 (identical classifications)
HOW TO COMPARE CLASSIFICATIONS?
COMPCLUS
Cluster T F-T OCF CCF-A CCF-B CCF-C MFU MFL DF AP GI 11 1 ― ― ― ― 3 ― ― ― ―II 3 1 ― ― ― ― ― 5 ― 1 ―III ― 8 8 ― 4 ― 3 ― ― ― ―IV ― ― ― ― ― ― ― ― 4 1 ―V ― ― ― ― ― ― ― ― ― 7 2VI ― ― 2 ― 15 ― 1 ― ― ― ―VII ― ― 2 ― 3 6 4 7 ― ― ―VIII ― ― 2 ― ― 14 1 1 ― ― ―IX ― ― ― 8 ― 2 ― 1 ― ― ―
Total No. of samples 14 10 14 8 22 22 12 14 4 9 2
Comparison of the Nine-Group Clustering of Samples Suggested by the Agglomerative Minimum Sum-of Squares Algorithm
Vegetation-landform unit
Cross – classification table
Cluster T F-T OCF CCF-A CCF-B CCF-C MFU MFL DF AP G1 11 1 ― ― ― ― 3 ― ― ― ―2 3 1 ― ― ― ― ― 5 ― 1 ―3 ― 8 8 ― 4 ― 3 ― ― ― ―4 ― ― ― ― ― ― ― ― 4 1 ―5 ― ― ― ― ― ― ― ― ― 7 26 ― ― 2 ― 15 ― 1 ― ― ― ―7 ― ― 2 ― 3 6 4 7 ― ― ―8 ― ― 2 ― ― 14 1 1 ― ― ―9 ― ― ― 8 ― 2 ― 1 ― ― ―
Total No. of samples 14 10 14 8 22 22 12 14 4 9 2
Comparison of the Nine-Group Clustering of Samples Suggested by the Hybrid Algorithm
Vegetation-landform unit
Cross – classification table
3 groups 7 groups 11 groupsAgglomerative (3 groups) 0.76 0.65 0.64Agglomerative (5 groups) 0.69 0.76 0.77Agglomerative (9 groups) 0.61 0.86 0.87Hybrid (9 groups) 0.61 0.86 0.87Hybrid (11 groups) 0.59 0.85 0.88
Matrix of Rand’s (1971) Coefficients between Partitions of the Lichti-Federovich and Ritchie (1968) Data Based on Vegetation-Landfrom Units andPartitions Suggested by Several Numerical Classifications of the Surface-Pollen Data
Vegetation – landform classificationNumerical pollen classification
COMPCLUS
R
Rand's coefficient should be corrected for chance so as to ensure
1. its expected value is 0 when the partitions are selected at random (subject to the constraint that the row and column totals are fixed)
2. its maximum value is 1
The similarity between two independent classifications of the same set of objects can be assessed by comparing their Rand statistic with its distribution under the randomisation model.
For small values of n objects, the complete set of n! values of Rand can be evaluated. For large values of n, comparison is made with the values resulting from a random subset of permutations. R
(3) Hill’s coefficient
Hill’s S (Moss 1985 Applied Geogr. 5, 131-150) CLASSTAT
Cross-classification table of classification I and J
Let pij = aij/ a
ji
ijijIJ pp
ppS log
Categories of Classification J
1 2 3 4 ... n Totals
Categories of Classification I
1 a11 a12 a1n a1
2 a21 a2
3 a31 a3
4
...
m amn am
Totals a1 a2 an a=N
Maximum possible value of SIJ comes when two
classifications are identical.Then SIJ = HI = HJ where HI is the entropy of the
classification I.
Adapt SIJ so that 0 = independence, 1 = best agreement
possible.
iiI ppH loglog
jjJ ppH loglog
JI
IJJI
JI
IJ
HHHHH
HHS
S,min,min
ijijIJ ppH log where
(4) Cohen's Kappa statistic
Kappa
where
)1(ˆ
e
eop
ppk
c
liio pp
1
the sum of the overall proportion of observed agreements
c
iiie ppp
1
..the overall proportion of chance-expected agreements
Kappa = 1 with perfect agreement
0 with observed agreement approximately the same as would be expected by chance (po pe)
R
TWINSPAN, TWINGRP, TWINDEND
WINTWINS
ORBACLAN, COINSPAN
FLEXCLUS
DISCRIM, MILTRANS, DIATAB
CLUSTER
DCMAT, GOWER
CLUSTAN-PC
CLUSTAN GRAPHICS
USEFUL CLASSIFICATION SOFTWARE
K-MEANS
FUZPHY
CLASSTAT, COMPCLUS
RANMAT, ASSOC
BOOTCLUS, SAMPLERE
CEDIT
R + Libraries
(e.g. mva, cluster, MASS)