numerical analysis of biological and environmental data lecture 3. classification

130
NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 3. Classification

Upload: melvyn-anthony

Post on 16-Dec-2015

219 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 3. Classification

NUMERICAL ANALYSIS OF BIOLOGICAL AND

ENVIRONMENTAL DATA

Lecture 3. Classification

Page 2: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 3. Classification

Agglomerative hierarchical cluster analysis

Two-way indicator species analysis – TWINSPAN

Non-hierarchical k-means clustering

‘Fuzzy’ clustering

Mixture models and latent class analysis

Detection of indicator species

Interpretation of classifications using external data

Comparing classifications

Software

CLASSIFICATION

Page 3: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 3. Classification

M. R. Anderberg, 1973, Cluster analysis for applications. Academic

H.T. Clifford & W. Stephenson, 1975, An introduction to numerical classification. Academic

B. Everitt, 1993, Cluster analysis. Halsted Press

A.D. Gordon, 1999, Classification. Chapman & Hall

A.K. Jain & R.C. Dubes, 1988, Algorithms for clustering data. Prentice Hall

L. Kaufman & P.J. Rousseeuw, 1990, Finding groups in data. An introduction to cluster analysis. Wiley

H.C. Romesburg, 1984, Cluster analysis for researchers. Lifetime Learning Publications

P.H. A. Sneath & R.R. Sokal, 1973, Numerical taxonomy. W.H. Freeman

H. Späth, 1980, Cluster analysis algorithms for data reduction and classification of objects

BOOKS ON NUMERICAL CLASSIFICATION

Page 4: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 3. Classification

P.G.N. Digby & R.A. Kempton, 1987, Multivariate analysis of ecological communities. Chapman & Hall

P. Greig-Smith, 1983, Quantitative plant ecology. Blackwell

R.H.G. Jongman, C.J.F. ter Braak & O.F.R. van Tongeren (eds), 1995, Data analysis in cummunity and landscape ecology. Cambridge University Press

P. Legendre & L. Legendre, 1998, Numerical ecology. Elsevier (Second English Edition)

J.A. Ludwig & J.F. Reynolds, 1988, Statistical ecology. J. Wiley

L. Orloci, 1978, Multivariate analysis in vegetation research. Dr. Junk

E.C. Pielou, 1984, The interpretation of ecological data. J. Wiley

J. Podani, 2000, Introduction to the exploration of multivariate biological data. Backhuys

W.T. Williams, 1976, Pattern analysis in agricultural science. CSIRO Melbourne

Most important are Chapters 7 and 8 in Legendre and Legendre (1998)

BOOKS ON NUMERICAL CLASSIFICATION IN ECOLOGY

Page 5: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 3. Classification

Partition set of data (objects) into groups or clusters.

Partition into g groups so as to optimise some stated mathematical criterion, e.g. minimum sum-of-squares. Divide data into g groups so as to minimise the total variance or within-groups sum-of-squares, i.e. to make within-group variance as small as possible, thereby maximising the between-group variance.

Reduce data to a few groups. Can be very useful.

Compromise 50 objects, 1080 possible classifications Hierarchical classification Agglomerative, divisiveMajor reviews

A.D. Gordon, 1996, Hierarchical classification in clustering and classification (ed. P. Arabie & L.J. Hubert) pp 65-121. World Scientific Publishing, River Edge, NJ

A.D. Gordon, 1999, Classification (Second edition). Chapman & Hall

BASIC AIM

Page 6: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 3. Classification

CLASSIFICATION OF CLASSIFICATIONS

Formal - Informal

Hierarchical - Non-hierarchical

Quantitative - Qualitative

Agglomerative - Divisive

Polythetic - Monothetic

Sharp - Fuzzy

Supervised - Unsupervised

Useful - Not useful

Page 7: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 3. Classification

MAIN APPROACHES

Hierarchical cluster analysis

formal, hierarchical, quantitative, agglomerative, polythetic, sharp, not always useful.

Two-way indicator species analysis (TWINSPAN)

formal, hierarchical, semi-quantitative, divisive, semi-polythetic, sharp, usually useful.

k-means clustering

formal, non-hierarchical, quantitative, semi-agglomerative, polythetic, sharp, usually useful.

Fuzzy clustering

formal, non-hierarchical, quantitative, semi-agglomerative, polythetic, fuzzy, rarely used but potentially useful.

Mixture models and latent class analysis

formal (too formal!) non-hierarchical, quantitative, polythetic, sharp or fuzzy, rarely used, perhaps not potentially useful with complex data-sets.

All UNSUPERVISED classifications

Page 8: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 3. Classification

Warning!

“The availability of computer packages of classification techniques has led to the waste of more valuable scientific time than any other “statistical” innovation (with the possible exception of multiple regression techniques)”

Cormack, 1970

Page 9: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 3. Classification

1. Calculate matrix of proximity or dissimilarity coefficients

2. Clustering

3. Graphical display

4. Check for distortion

5. Validation of results

AGGLOMERATIVE HIERARCHICAL CLUSTER ANALYSIS

Page 10: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 3. Classification

PROXIMITY OR DISTANCE OR DISSIMILARITY MEASURES

Hubalek1982

Biol. Rev. 97, 669-689

Gower & Legendre

1986

J. Classific. 3, 5-48

Archer & Maples

1987

Palaois 2, 609-617

Maples & Archer

1988

Palaois 3, 95-103

Legendre & Legendre

1998

Numerical ecology. Chapter 7

A. Binary Data

Object j

Object i

+ -

+ a b

- c d

Jaccard coefficient Dissimilarity (1-S)

Simple matching coefficient

Baroni-Urbani & Buser

cbaa

SJ

cbacb

DJ

dcbada

SSMC

dcba

cbDSMC

cbaadaad

Syst. Zool. (1976) 25; 251-259

Page 11: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 3. Classification

B. Quantitative Data

i

j

Variable 1

Xi1 Xj1

Xi2

Xj2

Vari

ab

le 2

dij2 222

211 ijij xxxx Euclidean

distance

2

1

m

kjkikij xxd dominated by large

values

Manhattan or city-block metric

m

kjkikij xxd

1less dominated by large values

Bray & Curtis (percentage similarity)

jkik

jkikij xx

xxd

sensitive to extreme values

relates minima to average values and represents the relative influence of abundant and uncommon variables

Page 12: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 3. Classification

B. Quantitative Data (cont)

Similarity ratio or Steinhaus-Marczewski coefficient ( Jaccard)

222

jkikjkik

jkikij

xxxx

xxd

less dominated by extremes

Chord distance for % data

2

1

1

2

m

kjkikij ppd “signal to

noise”

C. Percentage Data (e.g. pollen, diatoms)

Standardised Euclidean distance -

gives all variables ‘equal’ weight, increases noise in data

Euclidean distance - dominated by large values, rare variables almost no influence

Chord distance (= Euclidean distance -

good compromise, maximises signal

of square-root transformed data) to noise ratio

Page 13: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 3. Classification

D. Transformations

Normalise samples - ‘equal’ weight

Normalise variables - ‘equal’ weight, rare species inflated

No transformation- quantity dominated

Double transformation - equalise both, compromise

m

kikx

1

2

Noy-Meir et al. (1975) J. Ecology 63; 779-800

E. Mixed data (e.g. quantitative, qualitative, binary)

Gower coefficient (see Lecture 12)

Page 14: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 3. Classification

AGGLOMERATIVE HIERARCHICAL CLUSTER ANALYSIS (five stages)

i. Calculate matrix of proximity (similarity or dissimilarity measures) between all pairs of n samples ½ n (n - 1)

ii. Fuse objects into groups using stated criterion, ‘clustering’ or sorting strategy

iii. Graphical display of results - dendrograms or trees

- graphs

- shadings

iv. Check for distortion

v. Validation results?

Page 15: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 3. Classification

i. Simple Distance Matrix

m

kjkikij xxd

1

22D=

1 -

2 2 -

3 6 5 -

410

9 4 -

5 9 8 5 3 -

1 2 3 4 5

Objects

Page 16: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 3. Classification

ii. Clustering Strategy using Single-Link Criterion

Find objects with smallest dij = d12 = 2

Calculate distances between this group (1 and 2) and other objects

d(12)3 = min { d13, d23 } = d23 = 5

d(12)4 = min { d14, d24 } = d24 = 9

d(12)5 = min { d15, d25 } = d25 = 8

Find objects with smallest dij = d45 = 3

Calculate distances between (1, 2), 3, and (4, 5)

Find object with smallest dij = d3(4, 5) = 4

Fuse object 3 with group (4 + 5)

Now fuse (1, 2) with (3, 4, 5) at distance 5

D=

1+2

-

3 5 -

4 9 4 -

5 8 5 3 -

1+2

3 4 5

D=

1+2

-

3 5 -

4+5

8 4 -

1+2

34+5

Page 17: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 3. Classification

I & J fuseNeed to calculatedistance of K to (I, J)

Single-link (nearest neighbour) - fusion depends on distance between closest pairs of objects, produces ‘chaining’

Complete-link (furthest neighbour) -

fusion depends on distance between furthest pairs of objects

Median - fusion depends on distance between K and mid-point (median) of line IJ‘weighted’ because I ≈ J (1 compared with 4)

Centroid - fusion depends on centre of gravity (centroid) of I and J line‘unweighted’ as the size of J is taken into account

Page 18: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 3. Classification

Also:

Unweighted group-average distance between K and (I,J) is average of all distances from objects in I and J to K, i.e.

Weighted group-average distance between K and (I,J) is average of distance between K and J (i.e. d/4) and between I and K i.e.

5d

24

JK

IK

dd

Page 19: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 3. Classification

Single-link (nearest neighbour)

Complete-link (furthest neighbour)

Median

Centroid

Unweighted group-average

Weighted group-average

Minimum variance, sum-of-squares Orloci (1967) J. Ecology 55, 193-206 Ward’s method 

QI, QJ, QK within-group variance Fuse I with J to give (I, J) if and only if

or QJK – (QJ + QK)

i.e. only fuse I and J if neither will combine better and make lower sum-of-squares with some other group.

KIIKJIIJ QQQQQQ

Page 20: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 3. Classification

(distance between group k and group (i, j) follows a recurrence formula, where , , and are parameters for different methods)

Wishart, (1969) Biometrics 25, 165-170

dk(ij) = i dki + j dkj + dij + dki – dkj

CLUSTER

CLUSTAN-PC

CLUSTAN-GRAPHICS

GENERALISED SORTING STRATEGY

Page 21: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 3. Classification

i

j

Single-link ½ ½ 0 -½Furthest-link ½ ½ 0 ½

-α i α j

= - n i n j /

(n i + n j )2

Median ½ ½ -¼ 0Group average (unweighted) n i /(n i + n j ) n j /(n i + n j ) 0 0

Group average (weighted) ½ ½ 0 0

(n i + n k ) (n j + n k ) -n k

(n i +n j +n k ) (n i +n j +n k ) (n i +n j +n k )

(1−α i −α j )

0Sum of squares

Centroid n i /(n i + n j ) n j /(n i + n j ) 0

Single-link example, to calculate distance d3(1,2) = 

d3(1,2) = i dki + j dkj + dij + | dki – dkj | 

= ½ d31(6) + ½ d32(5) + 0dij + –½ | d31(6) – d32(5) | 

= ½ 6 + ½ 5 + 0 – ½ 1 = 3 + 2.5 – 0.5 = 5 Can also have Flexible clustering with user-specified (usually –0.25)

Page 22: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 3. Classification

CLUSTERING STRATEGIESSingle link = nearest neighbour

Finds the minimum spanning tree, the shortest tree that connects all points

Finds discontinuities if they exist in data

Chaining common

Clusters of unequal size

Complete-link = furthest neighbour

Compact clusters of ± equal size

Makes compact clusters even when none exist

Average-linkage methods

Intermediate between single and complete link

Unweighted GA maximises cophenetic correlation

Clusters often quite compact

Make quite compact clusters even when none exist

Median and centroid

Can form reversals in the tree

Minimum variance sum-of-squares

Compact clusters of ± equal size

Makes very compact clusters even when none exist

Very intense clustering method

Page 23: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 3. Classification

iii. Graphical display

Dendrogram ‘Tree Diagram’

Page 24: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 3. Classification

Group average dendrogram of 65 regions in Europe; The measure of pairwise similarity is Jaccard’s coefficient, based on the presence or absence of 144 species of fern.

Limit number of different values taken by heights of internal nodes or number of internal nodes.Global parsimonious tree of the dendrogram.

Parsimonious Trees

Page 25: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 3. Classification

Local parsimonious tree of the dendrogram

Page 26: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 3. Classification

A similarity matrix based on scores for 15 qualities of 48 applicants for a job. The dendrogram shows a furthest-neighbour cluster analysis, the end points of which correspond to the 48 applicants in sorted order.

Ling (1973) Comm. Asoc. Computing Mach. 16, 355-361

Matrix Shading

Page 27: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 3. Classification

Schematic way of combining row and column

hierarchical analyses

Re-order Data Matrix

Page 28: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 3. Classification

Summarised two-way table of the Malham data set. The representation of the species groups (1-23) delimited by minimum variance cluster analyses in the eight quadrat clusters (A-H) is shown by the size of the circle. In addition, both the quadrat and species dendrograms derived from minimum-variance clustering are shown to show the relationships between groups.

Page 29: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 3. Classification

Cophenetic correlations. The similarity matrix S contains the original similarity values between the OTU’s (in this example it is a dissimilarity matrix U of taxonomic distances). The UPGMA phenogram derived from it is shown, and from the phenogram the cophenetic distances are obtained to give the matrix C. The cophenetic correlation coefficient rcs is the correlation between corresponding pairs from C and S, and is 0.9911.

iv. Tests for Distortion

R

CLUSTER

Page 30: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 3. Classification

Cluster analysis of the Mancetter data

Ward’s Method analysis of the dataAverage link analysis

No. 1 2 3 4 5 6 No. 1 2 3 4 5 6

1 1 1 1 1 1 1 24 1 1 1 1 2 3

2 1 1 1 1 1 1 25 1 1 1 1 2 1

3 2 2 3 2 3 2 26 2 2 2 2 1 3

4 1 1 1 1 1 1 27 2 2 2 2 2 2

5 2 2 3 2 2 1 28 1 2 3 2 2 1

6 1 1 1 1 2 2 29 2 2 3 2 2 1

7 1 1 1 1 3 2 30 2 2 2 0 1 2

8 1 1 1 1 1 3 31 1 1 1 1 2 2

9 0 1 1 1 3 3 32 1 1 1 1 2 1

10 2 2 2 2 2 3 33 1 1 1 1 1 1

11 2 2 2 2 3 3 34 2 2 3 2 3 2

12 0 1 1 0 3 2 35 2 2 2 2 3 2

13 2 2 2 2 1 3 36 1 1 1 1 1 3

14 2 2 2 2 2 3 37 2 2 2 2 2 3

15 2 2 2 2 1 3 38 1 1 2 1 2 1

16 1 1 1 1 1 3 39 1 1 1 1 2 1

17 2 2 2 2 2 1 40 2 2 3 2 3 2

18 1 1 1 1 3 2 41 1 1 1 1 2 3

19 0 2 0 0 1 2 42 0 1 1 1 3 1

20 2 2 2 2 2 3 43 1 1 1 1 1 3

21 1 1 1 1 2 1 44 2 2 2 2 2 3

h22 1 1 2 1 3 2 45 2 2 2 2 2 3

23 2 2 3 2 3 2 46 2 2 1 2 3 3

Results of different classifications obtained by cluster analyses of the Mancetter dataMethod Method

Note: Mancetter specimens are numbered sequentially; entries in the table identify clustermembership with a 0 indicating an outlier. The six methods used were (1) PCA, (2) Ward’smethod, (3) Ward’s method on the first six standardised principal component

Which Cluster Method to Use?

Page 31: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 3. Classification

J. Oksanen (2002)

Page 32: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 3. Classification

SINGLE LINK

Page 33: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 3. Classification

MINIMUM VARIANCE

Page 34: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 3. Classification

CLUSTERING AND SPACE

Convex hull encloses all points so that no line between two points can be drawn outside the convex hull.

J. Oksanen (2002)

Page 35: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 3. Classification

Minimum variance is usually most useful but tends to produce clusters of fairly equal size, followed by group average. Single-link is least useful.

General Behaviour of Different Methods

Single-link Often results in chaining

Complete-link Intense clustering

Group-average (weighted)Tends to join clusters with small variances

Group-average (unweighted) Intermediate between single and complete link

Median Can result in reversals

Centroid Can result in reversals

Minimum variance Often forms clusters of equal size

General Experience

Page 36: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 3. Classification

Clustering of random data on two variables.

Note: Diagram (a) is a plot of two randomly generated variables labelled according to the clusters suggested by Ward’s method in diagram (b)

Baxter (1994)

SIMULATION STUDIES

Page 37: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 3. Classification

Validation tests for

1. The complete absence of any group structure in the data

2. The validity of an individual cluster

3. The validity of a partition

4. The validity of a complete hierarchical classification

Main interest in (2) and (3) - generally assume there is some ‘group structure’, rarely interested in validating a complete hierarchical classification.

v. VALIDATION OF RESULTS

TESTS FOR ASSESSING CLUSTERS

Gordon, A.D. (1995) Statistics in Transition 2: 207-217

Gordon, A.D. (1996) In: From Data to Knowledge (ed. W. Gaul & D. Pfeifer, Springer

Page 38: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 3. Classification

Cluster analysis of joint occurrence of 43 species of fish in the Susquehanna River drainage area of Pennsylvania, constructed with the UPGMA clustering algorithm (Sneath & Sokal, 1973). The three short perpendicular lines on the dissimilarity scale represent the critical values C1, C2, and C3 obtained from the null nodal distributions of the null frequency histogram. Significant clusters are indicated by solid lines. The non-significant portion of the dendrogram is drawn in dotted lines.

Strauss, (1982) Ecology 63, 634-639.

Page 39: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 3. Classification

Hunter & McCoy 2004 J. Vegetation Science 15: 135-138.

Problem of creating ecologically relevant 'random' or 'null' data-sets.

Within a 'significant' cluster, linkages are often identified as 'significant' even when species are actually randomly distributed among the sites in the group.

Artificial data

2 groups of 20 sites, no species in common

sites

speci

es

Page 40: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 3. Classification

Randomisation test identifies both groups and all linkages within them as 'significant'.

Same test finds all linkages non-significant if only use one of the groups!

= significant = not significant

Page 41: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 3. Classification

Arises because the randomisation matrices need to be created at each classification step, not just at the beginning.

Can test for significance of groups by comparing linkage distances to a null distribution derived from randomisation and clustering of a sub-matrix containing only the sites within the larger group. In other words, this is testing the null hypothesis that within the significant group, sites represent random assemblages of species.

Sequential randomisation allows evaluation of all nodes in the classification.

Page 42: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 3. Classification
Page 43: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 3. Classification

OTHER APPROACHES TO ASSESSING AND VALIDATING

CLUSTERS

If replicate samples are available, can use bootstrapping to evaluate significance. Can also use within-cluster samples as ‘replicates’.

BOOTCLUS McKenna (2003) Environmental Modelling & Software 18, 205-220

(www.glsc.usgs.gov/data/bootclus.htm)

SAMPLERE Pillar (1999) Ecology 89, 2508-2516

Compares cluster analysis groups and uses bootstrapping (resampling with replacement) to test the null hypothesis that the clusters in the bootstrap samples are random samples of their most similar corresponding clusters in the observed data. The resulting probability indicates whether the groups in the partition are sharp enough to reappear consistently in resampling.

Page 44: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 3. Classification

NUMBER OF CLUSTERS

There are as many fusion levels as there are observations.

Hierarchical classification can be cut at any level.

User generally wants to use groups all at one level, hence ‘cut level’ or ‘stopping rules’.

No optimality criteria or guidelines.

Select what is useful for the purpose at hand.

No right or wrong answer, just useful or not useful!

Mathematical criteria - see A.D. Gordon (1999) pp. 60-65

Page 45: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 3. Classification

1. Divide underlying gradients into equal parts

2. Compact clusters

3. Groups of equal size

4. Discontinuous groups

These criteria often in conflict, and cannot all be satisfied simultaneously.

CRITERIA FOR GOOD CLUSTERS

J. Oksanen (2002)

Page 46: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 3. Classification

2. TWINSPAN – Two-Way Indicator Species Analysis Mark Hill (1979)

Differential variables characterise groups, i.e. variables common on one side of dichotomy. Involves qualitative (+/–) concept, have to analyse numerical data as PSEUDO-VARIABLES (conjoint coding).

Species A 1-5% SPECIES A1

Species A 5-10% SPECIES A2

Species A 10-25% SPECIES A3

cut level Basic idea is to construct hierarchical classification by successive division.

Ordinate samples by correspondence analysis, divide at middle group to left negative; group to right positive. Now refine classification using variables with maximum indicator value, so-called iterative character weighting and do a second ordination that gives a greater weight to the ‘preferentials’, namely species on one or other side of dichotomy.

Identify number of indicators that differ most in frequency of occurrence between two groups. Those associated with positive side +1 score, negative side -1. If variable 3 times more frequent on one side than other, variable is good indicator. Samples now reordered on basis of indicator scores. Refine second time to take account of other variables. Repeat on 2 groups to give 4, 8, 16 and so on until group reaches below minimum size.

TWINSPAN

Page 47: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 3. Classification

Cladonia coccifera − − 1 1 − − − − 1 1 1 1 − − − − − − − − − − −Pseudoscleropodium purum − − 1 1 − − 1 − − 1 1 − − − − − − − − − − − −Cladonia arbuscula − − 1 − 1 − 1 1 − − − 1 1 1 − − − − − − − 1 −Hylocomium splendens 1 − 1 − 1 1 1 − 1 − − − − − − − − − − − − − −Melampyrum pratense 1 1 1 1 2 2 2 − 1 1 1 1 1 1 − − − − − − − − −Festuca ovina 3 − 2 3 3 3 1 − 1 1 1 − − 1 − − − − − − − − −Agrostis canina 1 2 2 1 1 − 1 − − − 1 − − − − − − − − − − − −Parmelia saxatillis 1 1 − 1 − − − − 1 1 − 1 − − − − − − − − − 1 −Blechnum spicant 2 1 − − − 1 1 1 − 1 − − − − − − − − 1 − 1 − −Thuidium tamarisc /delicat. 1 1 − 1 − − 1 − 1 2 − − − − − − − − − − 1 − −Potentilla erecta 2 2 1 − 1 1 1 − − − − − − − − − 1 − − −Pleurozium schreberi − 3 2 2 2 1 2 1 1 3 1 2 2 3 − 3 1 − 1 − − 1 −Molinia caerulea 3 3 3 1 3 3 3 3 1 3 3 3 2 3 − 3 1 1 3 − 1 1 −Hypnum cupressiforme 1 1 2 2 1 1 1 3 1 1 1 1 1 − − 1 1 − 1 − 1 − −Pteridium aquilinum − − 3 − 1 2 3 3 − 2 2 2 − − − − 1 − − − − 2 −Thuidium tamariscinum − − 1 − 1 1 − − − − − 1 1 − − − − − − 1 1 −Sorbus aucuparia seedling − − − 1 − 1 1 − − − 1 − 1 − − − 1 − 1 − − − −Betula pubescens seedling 1 − 1 − − − 1 − 1 − 1 − 1 1 − − 1 − 1 − − − −Dicranum scoparium 2 1 1 2 − 1 2 1 1 1 1 − − 1 − − 2 2 − − − − 1Plagiothecium undulatum 1 1 1 1 1 2 2 2 − − 1 1 1 1 − 1 1 1 1 1 1 − −Leucobryum glaucum − 1 − − 2 2 − 3 2 2 2 3 2 2 − 1 2 − 1 1 − 1 1Isothecium myosuroides 4 2 3 1 1 − − 1 1 1 1 = 1 2 − − − 1 − 3 1 − −Quercus petraea 1 4 5 4 4 4 4 4 4 4 4 4 4 4 4 3 5 4 4 4 5 3 4Dicranum majus − 1 − 2 − 1 1 2 2 1 1 1 − − 1 1 1 1 1 − 1 2 −Campylopus flexuosus − 1 − 2 1 1 1 1 − − 1 − 1 1 − − 1 − 1 1 1 1Calluna vulgaris 2 3 − 1 1 − − − − − − 1 1 1 − 3 1 − 1 − − − 5Mnium hornum 1 1 − − − 1 − − − 1 − 1 1 − − − 1 − 1 1 1 − −Polytrichum formosum 1 2 1 2 1 − − − 1 1 1 − − 1 − 1 − − 1 2 1 1 −Vaccinium myrtillus − − − 1 1 1 − − 2 − 3 2 4 2 3 3 3 3 3 − 1 − −Rhytidiadelphus loreus 1 1 − 1 − 1 2 2 2 1 − 1 2 − 1 1 2 2 2 3 1 −Bazzania trilobata − − − − − 1 1 − 1 − − 1 − 1 − 1 1 − 1 − − 1 −Sphagnum quinqefarium − − − − − − 2 − − − − 2 − − − 2 2 3 1 1 − 1 1Deschampsia flexuosa 1 1 2 3 − − 2 1 − − − 1 1 3 2 1 2 2 3 2 3 3 −Lepidozia reptans 1 − 1 − − − − − − − − − − 1 1 1 − 1 1 1 1 1 −Diplophyllum albicans − − − − − − 1 − − − 1 − − 1 − 1 − − 1 1 1 1 −Dicranodontium denudatum − − − − − 1 − − − − − − − 1 1 − 1 1 1 1 1 1 −Lepidozia pearsonii − − − − − − − − − − − − − − − − 1 − 1 − − 1 −Saccogyna viticulosa − − − − − − − − − − − − − − − − − 1 − − 1 1 −Calypogeia fissa − − − − − − − − − − 1 − − − − − 1 1 1 − 1 1 −Betula pubescens − 1 − − − − − − − − − − − − 3 3 − − − − − 3 −Scapania gracilis 1 − − − − − − − − − − − − 1 − 2 1 1 1 1 1 2 −Sphagnum robustum − − − − − − − − − − − − − 1 2 1 1 − − − − − −Isopterygium elegans − − − − − 1 − − − − − − − − − 1 − − 1 − 1 1 1Erica cinerea − − − − − − − − − − − − − − − − − − − − − − 1Hypnum cupress. v. ericet − − − − − − − − − − − − − − − − − − − − 2SECTION A A A A A A A A A A A A A A B B B B B B B B A

GROUP MEAN PH 3.7 3.73

COED CYMERAU TWINSPAN TABLEGROUP I IITWINSP

AN

Page 48: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 3. Classification

Each species can be represented by several pseudo-species, depending on the species abundance. A pseudo-species is present if the species value equals or exceeds the relevant user-defined cut-level.

Original data Sample 1 Sample 2

Cirsium palustre 0 2

Filipendula ulmaria 6 0

Juncus effusus 15 25

Cut levels 1, 5, and 20 (user-defined)

Pseudo-species

Cirsium palustre 1 0 1

Filipendula ulmaria 1

1 0

Filipendula ulmaria 2

1 0

Juncus effusus 1 1 1

Juncus effusus 2 1 1

Juncus effusus 3 0 1Thus quantitative data are transformed into categorical nominal (1/0) variables.

Pseudo-species Concept

Page 49: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 3. Classification

Variables classified in much same way. Variables classified using sample weights based on sample classification. Classified on basis of fidelity - how confined variables are to particular sample groups. Ratio of mean occurrence of variable in samples in group to mean occurrence of variable in samples not in group. Variables are ordered on basis of degree of fidelity within group, and then print out structured two-way table. 

Concepts of INDICATOR SPECIESDIFFERENTIALS and PREFERENTIALSFIDELITY

 Gauch & Whittaker (1981) J. Ecology 69, 537-557

 

“two-way indicator species analysis usually best. There are cases where other techniques may be complementary or superior”.

Very robust - considers overall data structure

“best general purpose method when a data set is complex, noisy, large or unfamiliar. In many cases a single TWINSPAN classification is likely to be all that is necessary”.

TWINSPAN, TWINGRP, TWINDEND, WINTWINS

Page 50: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 3. Classification

van Groenewoud (1992) J. Veg. Sci. 3, 239-246

Belbin & McDonald (1993) J. Veg. Sci. 4, 341-348

 

Artificial data of known properties and structure

 

Reliability of TWINSPAN depends on:

 

1. How well correspondence analysis extracts axes that have ecological meaning 

2. How well the CA axes are divided into meaningful segments

3. How faithful certain species are to certain segments of the multivariate space

 

TWINSPAN TESTS OF ROBUSTNESS & RELIABILITY

Page 51: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 3. Classification

Problems arise:

 

1. The splitting rule of TWINSPAN (dividing the CA at centre) overrides keeping ecologically closely related samples together. Groups of samples that are similar but near centre can get split into two groups. Relocation necessary. FLEXCLUS 

2. With more complex data, small groups of samples are split off from main body. Outliers. CA sensitive to rare taxa and unusual samples.

3. Displacement of points along first CA axis may be considerable, resulting in poor results, especially occurs when there are two underlying gradients of approximately same length.

4. The division of the first CA axis in the middle, followed by separate CA of each of the two halves of original data creates conditions under which the second CA is not detecting the second gradient but is doing a finer CA of the first gradient. Increases chances of misplacement at centres.

Page 52: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 3. Classification

Ideal if: 

The first major underlying gradient is considerably larger than the second one and the structure is not complex.

“The erratic behaviour of TWINSPAN beyond the first division makes the results of this analysis of real vegetation data suspect”.

“Use of the TWINSPAN program for vegetation analysis is thus not recommended”.

van Groenewoud (1992)

“TWINSPAN is not a method but a program. A bag of tricks, too unstable and tricky. Better avoided. It uses a kludge of pseudospecies and has many other quirks so that its analyses may be impossible to repeat.”

Oksanen (2003)

Belbin & McDonald (1993)

Artificial data: 480 data sets of 50 sites in 2 dimensions with 2, 3, 4, 5 or 8 clusters.

(TWINSPAN expects 2, 4, 8, 16, 32, 64... 2n clusters)

Page 53: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 3. Classification
Page 54: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 3. Classification

Recovery of structure (mean of Rand statistic for comparing classifications - 1 is perfect, 0 is terrible).

Mean Rand

TWINSPAN 0.63

Non-hierarchical 0.77

Flexible unweighted GA 0.79

Major TWINSPAN problem (and of all divisive procedures). An early ‘error’ in a division can have serious effects, as it can not be undone except by relocation (FLEXCLUS).

Useful tool - not the final classification!

Page 55: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 3. Classification

Tausch, Charlet, Weixelman & Zamudio (1995)

J. Vegetation Science 6, 897-902

“Patterns of ordination and classification instability resulting from changes in input data order”.

Claimed results from correspondence analysis-based methods such as TWINSPAN dependent on order of input, i.e. vary with entry order of data.

Is this a problem of the Method?

Algorithm?

Software?Used “TWINSPAN as compiled for PC-ORD package”

“Found 1-4 plots changed group affiliation and relationships between groups often changed”.

TWINSPAN INSTABILITY

Page 56: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 3. Classification

Explanation: Oksanen & Minchin (1997) J. Vegetation Science 8, 447-454 

TWINSPAN uses correspondence analysis as basis for ordering and dividing samples. In TWINSPAN a very fast algorithm is used to extract the correspondence analysis axes. Iterative procedure - stops after maximum number of iterations or reaches a convergence criterion (tolerance), whichever it reaches first.

Instability disappears

Max iterations

Tolerance

TWINSPAN

Original Hill (1979)

5 0.003

Strict criteria 999 0.000005

Version 2.2a

Page 57: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 3. Classification

TWINSPAN may have some drawbacks. Also has some major advantages.

Two-way table of samples and species and the groupings is the easiest way of seeing what makes the groups distinct (or otherwise!). Powerful way of seeing the fuzziness in the data without hiding it as techniques like cluster analysis do.

Page 58: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 3. Classification

Selection of pseudo-species cut levels can sometimes be a problem.

If the pseudo-species groups are very unequal in width, TWINSPAN will give greater weight to the smaller groups. (e.g. group 1-2 % vs. group 51-100%), as the smaller values are typically, but not always, of lesser importance and are less well estimated.

www.ecotone.org/Download/twinwght.xls

Allows you to calculate appropriate cut-levels to avoid this weighting problem.

Page 59: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 3. Classification

Extensions to TWINSPAN

Basic ordering of objects derived from correspondence analysis axis one. Axis is bisected and objects assigned to positive or negative groups at each stage. Can also use:

1. First PRINCIPAL COMPONENTS ANALYSIS axis

ORBACLAN C.W.N. Looman

Ideal for TWINSPAN style classification of environmental data, e.g. chemistry data in different units, standardise to zero mean and unit variance, use PCA axis in ORBACLAN (cannot use standardised data in correspondence analysis, as negative values not possible).

2. First CANONICAL CORRESPONDENCE ANALYSIS axis.

COINSPAN T.J. Carleton et al. (1996) J. Vegetation Science 7: 125-130

First CCA axis is axis that is a linear combination of external environmental variables that maximises dispersion (spread) of species scores on axis, i.e. use a combination of biological and environmental data for basis of divisions. COINSPAN is a constrained TWINSPAN - ideal for stratigraphically ordered palaeoecological data if use sample order as environmental variable.

Page 60: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 3. Classification

Dendrograms of (a) TWINSPAN and (b) COINSPAN on a 170 stand pine forest understorey dataset. The number of stands resulting from each division is shown at each level of the dendrogram. Plant species names are the respective indicators on the negative (left) or positive (right) of each division. The mean number of Pinus strobus seedlings between 10 cm and 150 cm in height, with associated standard errors, are given for each of the four final stand groups.

Carleton et al. (1996)

Page 61: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 3. Classification

Zygogonium ericetorumActinotaenium crucurbita

Homoeothrix julianaFragilaria acidobiontica

Microsprora pachyderma

CosmariumSpriogyraBambusina brebissoniiPediastrumZygogonium tunetanumGonatozygon

Zygogonium ericetorumHomoeothrix juliana

Achnanthes minutissimaCymbella microcephalaGomphonema acuminatumHigh

AcidityClearwater

Group1

High AcidityHumicGroup

2

Moderate acidity

Low AlkalinityGroup

3

Low acidityModerate Alkalinity

Group4

- +DIC

-+ Al

- DOC + - ANC +

(a) Log Indicator

taxa

Classification of lake groups and identification of periphyton indicator taxa and key discriminat-ing environmental variables based on a COINSPAN of (a) log10-transformed taxa biovolumes and (b) taxonomic presence – absence data, along with log10-transformed water chemistry data.

- DOC + - ANC +

Zygogonium ericetorumActinotaenium crucurbitaCylindrocystis brebissonii

Tetmemorus laevisHomoeothrix juliana

Eunotia bactrianaFragilaria acidobiontica

CosmariumSpriogyraSpondylosium planumPediastrumZygogonium tunetanumGonatozygonZygogonium ericetorum

Homoeothrix juliana

Achnanthes minutissimaCymbella microcephalaGomphonema acuminatumHigh

AcidityClearwater

Group1

High AcidityHumicGroup

2

Moderate acidity

Low AlkalinityGroup

3

Low acidityModerate Alkalinity

Group4

- +DIC

-+ Al

AudouinellaSpirogyra

(b) +/-

Vinebrooke & Graham (1997)

Page 62: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 3. Classification

OPTIMISING A CLASSIFICATION: K-MEANS CLUSTERING

• Agglomerative clustering has a legacy of history: once formed, classes cannot be changed although that would be sensible at a chosen level

• K-means clustering: iterative procedure for non-hierarchical classification

• If start with chosen hierarchic clustering, will be optimised

• Best suited with centroid or minimum-variance linkage, since it uses same criterion but in a non-hierarchical way

• Computationally difficult, cannot be sure the optimal solution is found

J. Oksanen (2002)

Page 63: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 3. Classification

NON-HIERARCHICAL K-MEANS CLUSTERING

Given n objects in m-dimensional space, find the partition into k groups or clusters such that the objects within each cluster are more similar to one another than to objects in the other cluster. The number of groups k is determined by the user.

In k-means, the numerical function that the partition should minimise is, as in minimum-variance cluster analysis (=Ward’s method), total error sum of squares (Ek

2) or variance, but it does not impose any hierarchical structure.

Tries to form clusters to maximise between-cluster variance and to form groups of samples that will achieve the largest number of significant differences in ANOVA for the variables in relation to the clusters.

Major practical problem is that the solution on which the computation eventually converges depends to some extent on the initial centroids or groups. Only way to be sure that the optimal solution has been found is to try all possible solutions in turn.

Impossible for any real-size problem – 50 objects, 1080 possible solutions!

Page 64: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 3. Classification

POSSIBLE SOLUTIONS

1. Provide initial configuration based on ecological knowledge and hopefully this will provide a good start for the algorithm.

2. Provide initial configuration based on a hierarchical clustering. The k-means algorithm will try to rearrange the group membership to find a better overall solution (lower Ek

2).

3. Select as a group seed for each of the k groups some objects thought to be 'typical' of each group.

4. Assign the objects at random to the various groups, find a solution, note Ek

2. Repeat many times (100), starting each time with a different random configuration. Retain solution with lowest Ek

2.

5. Alternating least-squares algorithm. K-MEANS, R

There are several solutions to help a k-means algorithm converge to the overall minimum criterion (Ek

2).

Page 65: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 3. Classification

K-MEANS SOFTWAREPierre Legendre (www.fas.umontreal.ca/biol/legendre)

K observations are selected as 'group seeds' and cluster centroids are computed.

Assign each object to the nearest seed

Carry on moving objects until Ek2 can no longer be improved.

In the iterations, the program tries to minimise the sum, over all groups, of the squared within-group residuals, which are the distances of the objects to the respective group centroids. Convergence is reached when the residual sum of squares cannot be lowered any more. The groups obtained are geometrically as compact as possible around their respective centroids.

Cannot guarantee to find the absolute minimum of Ek2. Necessary to

repeat several times with different initial group seeds.

For each number of groups (K), calculate the Calinski-Harabasz pseudo-F statistic (C-H). C-H = [R2 / (K-1)] / [(1-R2) / (n-K)]

where R2 = (SST-SSE) / SST

SST is total sum of squared distances to the overall centroid and SSE is sum of squared distances of the objects to the groups own

centroids.

Page 66: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 3. Classification

HOW MANY GROUPS?

One is interested to find the number of groups, K, for which the Calinski-Harabasz criterion is maximum. This would be the most compact set of clusters.

No. of groups (K)

C-H

2 1039.39

3 1143.07

4 1445.60

5 1320.63

6 1279.83

Page 67: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 3. Classification

K-MEANS SOFTWARE

1. Four different initial assignment procedures

2. Input data

3. Data transformation options (gives different implicit proximity measures – k-means requires Euclidean distances)

4. Variable weightings if required

Output

1. K and C-H values

2. Details of group membership for each K

Dimensions 100,000 objects, 250 variables, 30 groups

Very fast!

R

Page 68: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 3. Classification

K-MEANS CLUSTERING - A SUMMARY

Provides useful and relatively fast non-hierarchical partitioning of large or gigantic data sets. Generally finds near-optimal solution in matter of minutes.

Important to compare with results from hierarchical cluster analysis procedures to see how the partitioning has been distorted by imposing a hierarchical structure on the data.

Problem is how to display results for large data-sets.

• map clusters in geographical space

• overlay on ordination plots

• cluster summaries (means, ranges, etc)

• re-arrange data tables

Page 69: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 3. Classification

'FUZZY' CLUSTERING

Some objects may clearly belong to some groups. Other objects have group membership that is much less obvious.

18 objects assigned into 2 groups that minimise the sum-of-squares criterion.

What about objects A, B, C, and D? Less clearly associated with groups C1 and C2 than the other 14 objects.

Fuzzy clustering gives each object a membership function between 0 and 1 that specifies the strength with which each object can be regarded as belonging to each group.

Gordon 1999

Page 70: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 3. Classification

Three groups of artificial data with 50 samples each from three bivariate normal distributions.

Data with known group structure. Sum-of-squares clustering into 3 groups

Fuzzy clustering -objects with membership function less than 0.5 are circled. Cannot really be grouped.

Gordon (1999)

Page 71: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 3. Classification

'FUZZY' CLUSTERING IN ECOLOGYEquihua, M. (1990) J. Ecology 78, 519-534'Fuzzy' c-means algorithm - similar to k-means clustering but with object weightings or membership functions iteratively changed to minimise sum-of-squares criterion.

Used correspondence analysis ordination as starting configuration for 'fuzzy' c-means clustering.

Compared 'fuzzy' c-means with TWINSPAN. Assigned each object to the group with which it has the highest membership function for comparison purposes.

Membership functions usually 0.5 - 0.7, a few as high as 0.9.

Good agreement at two or four group levels, less good with more groups.

Both find the 'obvious' groups; differ in fine-level divisions.

Suggests ecological data consist of (1) objects falling in some clear groups and of (2) objects that are clearly intermediate.

'Discontinuous data' and 'continuous data'

Software FUZPHY http://labdsv.nr.usu.edu/

(LABDSV = Laboratory for Dynamic Synthetic Vegephenomenology)

R

Page 72: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 3. Classification

'FUZZY' CLUSTERING - A SUMMARY

• Classes can be useful for many purposes (e.g. maps)

• Fuzzy clustering combines good aspects of classification and ordination

• Each observation is given a membership function of class membership

• Corresponding crisp classification: class of highest membership functions

• Non-hierarchic, flat classification

• Iterative procedure

• Does not pretend the classes are natural entities

J. Oksanen (2002)

Page 73: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 3. Classification

MIXTURE MODELS FOR CLUSTER ANALYSIS

Cluster analysis methods have no underlying theoretical statistical model except for mixture models.

Finite mixture distributions

Sample of individuals from some population have heights recorded, but gender not recorded.

Density function of height

h (height) = p (female) h1 (height:female) + p (male) h2 (height:male)

where p (female) and p (male) are probabilities that a member of the population is female or male, and h1 and h2 are the height density functions for females and males.

Density function of height is a superposition of two conditional density functions. Density function is known as finite mixture density.

Page 74: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 3. Classification

If h1 and h2 follow normal distribution, can estimate p (female) and p (male) by maximum likelihood procedures.

Can be extended to more than one group and more than one variable

g

iiii xpxf

1

),,()(

where

g

iii pp

1

110 ;

and

1121

2 21

21

l iiipii xxx )()(exp)(

),(/

/,

Page 75: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 3. Classification

Assumes g clusters and within each cluster the variables have a multivariate normal distribution.

Clusters would be formed on the basis of the maximum values of the estimated posterior probabilities

g

iiii

sss

xp

xpxsp

1

)ˆ,ˆ,(ˆ

)ˆˆ,(ˆ)/(ˆ ,

where is the estimated probability that an individual with vector x of observations belongs to group s.

)/(ˆ xsp

Page 76: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 3. Classification

SIMPLE EXAMPLE

50 observations from each of two bivariate normal distributions with the following properties

Density one x = 1.0 y = 1.0

x = 1.0 y = 1.0

= 0.0

Density two x = 4.0 y = 5.0

x = 2.0 y = 0.5

= 0.0

Results of fitting two component normal mixture

Proportion Means SDs Correlation

Cluster 1 0.50 [1.14, 0.64]

[0.95, 1.10]

0.16

Cluster 2 0.50 [3.94, 4.98]

[2.32, 0.45]

-0.22

Page 77: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 3. Classification

Bivariate data containing two clusters

Contour plot of estimated two component normal mixture

Perspective view of estimated two component normal mixture

Page 78: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 3. Classification

PROBLEMS

1. Complex computationally, E (expectation) M (maximisation) algorithm.

2. Requires well separated densities and/or very large sample sizes.

3. Convergence is often to a local rather than a global solution.

4. Different start values needed in the EM algorithm.

5. Can be very slow to converge.

6. How to estimate g, the number of components. Idea is to use the likelihood ratio to test for the smallest value of g compatible with the data. Not straightforward and no agreed estimation procedure.

Page 79: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 3. Classification

REAL EXAMPLE

Blood pressure data - systolic and distolic blood pressure

Is blood pressure a continuous variable from a single population or from two or three sub-populations with different mean levels? If latter, maybe there is a gene that causes arterial blood pressure to increase faster with age in those who have this gene than it does for people who lack it.

Whites Non-whites

Systolic Diastolic Systolic Diastolic

Gp 1 Gp 2 Gp 1 Gp 2 Gp 1 Gp 2 Gp 1 Gp 2

Mean 118.3 147.6 65.7 78.4 116.1 145.9 71.0 94.9

Variance

215 694 102 169 159 552 102 54

No. of subjects

847 301 821 325 136 74 178 30

Percent 74 26 72 28 65 35 85 15

Page 80: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 3. Classification

No convincing evidence for groups within systolic, possibly some subgroups within diastolic.

Page 81: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 3. Classification

MIXTURE MODELS FOR CATEGORICAL DATA - LATENT CLASS ANALYSIS

Mixture models not suitable for data where variables are categorical as the methods assume that within each group the variables have a multivariate normal distribution.

For categorical data, the mixture assumed needs to involve other component densities.

Multivariate Bernoulli density - within each group the categorical variables are independent of one another, so-called conditional independence assumption.

Page 82: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 3. Classification

MIXTURE MODELS FOR MIXED MODE DATA

Data may consist of continuous and categorical variables, so-called mixed mode data.

Mixture models can be extended to include mixed mode data but there are severe computational problems if there are more than about four categorical variables.

Page 83: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 3. Classification

INTEGRATED CLASSIFICATIONS BASED ON MIXTURE MODELS

Biological and water-quality data

Traditionally two-step analysis

1. Cluster sites on the basis of the biological data

2. Relate the clusters to the water-quality data by, for example, DISCRIM or linear discriminant analysis

Taxa

Clusters Clusters

Water-quality data

Step 1 Step 2 'Asymmetric model'

Page 84: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 3. Classification

'Symmetric model'

Taxa + Water-quality data

Clusters

Can we do?

Data are in very different units and have very different distributions.

Model-based clustering based on latent class analysis.

Page 85: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 3. Classification

MODEL-BASED CLUSTERING BASED ON LATENT CLASS ANALYSIS

ter Braak et al. (2003) Ecological Modelling 160: 235-248

1. There are G classes and each site belongs to one and only one of these classes, but it is unknown which one.

2. The variables (environmental variables and taxon counts) have probability distributions that differ between classes. Conditional probability density of the vector-variable y given class g is p(y|c).

3. The marginal distribution is a mixture of these distributions with mixing proportions (g, g = 1, ..., G), i.e.

p(y) = g g p(y|g)

The mixing proportions must sum to unity.

4. Class membership of the sites is unknown. With the data vector y from a site, model allows one to calculate the class membership probability p(g|y) that the site belongs to a particular class.

Page 86: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 3. Classification

In mixed biological-environmental data, different data properties

1. Environmental variables assumed to be quantitative (e.g. pH, conductivity) and to follow a multivariate normal distribution within each class.

2. Biological data (counts of taxa) are assumed to follow independent Poisson distributions within each class.

Combining normal and non-normal variables in one analysis - mixed mode data.

Symmetric model - use taxa and environmental data together to create g classes, both 'response variables'.

Assume taxon counts are independent of the environmental variables within each class.

Let taxon counts be y, environmental variables x

)|()|()|,(),( cypcxpcyxpyxpg

gg

g

Assumes conditional independence of x and y.

Can be fitted using the EM maximum likelihood algorithm or Bayesian approach using Markov Chain Monte Carlo (MCMC) methods.

Page 87: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 3. Classification

In Bayesian approach, prior distribution must be specified for all parameters of the model.

Bayesian approach has several advantages:

1. Flexible. Parameters do not have to be equal or unequal. They can be a 'bit unequal'.

2. Problems in the EM approach of values near zero can be avoided by defining robust prior distributions.

3. Can be extended by including prior information (e.g. habitat preferences of taxa, ecological indicator values) or details about field sampling.

4. Can easily compare one model to another so that model fit is balanced against model complexity. Can find the 'optimal' number of clusters.

5. Means for model checking.

6. Predictions are straightforward and the uncertainty of predictions can be assessed in a natural way by integrating out all relevant sources of uncertainty.

Page 88: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 3. Classification

ECOLOGICAL EXAMPLE

Stream invertebrates from five types of habitat

P1 - temporary moorland pools, low pH

P2 - permanent moorland pools, low pH

P3 - moorland pools, medium pH

P7 - large mesotrophic bodies

P8 - medium sized eutrophic bodies

Page 89: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 3. Classification

Small real data-set: information criteria in latent class analysis plotted against the number of

clusters.

Page 90: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 3. Classification

ML solutions BIC, ML BIC suggests 4 clustersML suggests no clear number of clusters

Bayesian solutions

BF, min G, max G

Suggest no clear number of clusters.

Which is reality?

Few groups, many groups, or no groups?

Page 91: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 3. Classification

"The trend is towards explicit models and a Bayesian approach to cluster analysis to improve upon the good-old TWINSPAN method. Frequently it is hard to beat ad-hocery and TWINSPAN with modern statistical methods; occasionally it is possible but not often!"

C.J.F. ter Braak (2003)

Page 92: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 3. Classification

A REAL CLASS STRUCTURE OF THREE IRIS SPECIES

Data on Iris setosa (s), I. versicolor (c), and I. virginica (v).

J. Oksanen (2002)

Page 93: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 3. Classification

RESULTS

No method recovers the three species structure!

J. Oksanen (2002)

Page 94: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 3. Classification

CLASSIFICATION OF VARIABLES - POSSIBLE APPROACHES

1. TWINSPAN species ordering of basic data or TWINSPAN of transposed matrix so variables become 'objects' and objects become 'variables'. Problem is how to define realistic pseudo-variables for objects (now 'variables').

2. Concept of species associations is usually based on presence/absence data. Transform data to presence/absence or pseudo-variables first, then transpose matrix so that variables are 'objects' and objects are 'variables'. Calculate suitable similarity or dissimilarity coefficients between 'objects' defined as +/-.

Jaccard SJ = a DJ = b + c .

a + b + c a + b + c

Sørensen SS = 2a DS = b + c .

2a + b + c 2a + b + c

Do cluster analysis or k-means clustering.

3. Use other similarity coefficients, transform to distance coefficient as (1 - S) (to ensure coefficient is metric), compute principal co-ordinates of the distance matrix to give co-ordinates of the 'objects' (= variables) in orthogonal multi-dimensional space, and do cluster analysis or k-means clustering.

Legendre & Legendre (1998) Numerical Ecology pp. 355-361

Page 95: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 3. Classification

CHOICE OF CLUSTERING METHOD

• Some opt for single linkage: finds distinct clusters, but prone to chaining and sensitive to sampling pattern

• Most opt for average linkage and minimum variance methods: chops data more evenly

• All dependent on appropriate dissimilarity measure: should be ecologically meaningful

• Small changes in data can cause large visual changes in clustering: classification may be optimised for a chosen level (k-means)

• Fuzzy clustering may fail as well, but at least it shows the uncertainty

• TWINSPAN: surprisingly robust and useful

• Mixture models and latent class analysis: complex theory but limited utility

Page 96: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 3. Classification

Basic concept and tradition in ecology and biogeography – characteristic or indicator species e.g. species characteristic of particular habitat, geographical region, vegetation type. Valuable in monitoring, conservation, management, description, and stratigraphy.

Add ecological meaning to groups of sites discovered by clustering

INDICATOR SPECIES – indicative of particular groups of sites. ‘Good’ indicator species should be found mostly in a single group of a classification and be present at most of the sites belonging to that group. Important DUALITY (faithful AND high constancy)

INDVAL – Dufrene & Legendre (1997) Ecological Monographs 67, 345-366

Derives indicator species from any hierarchical or non-hierarchical classification of objects

Indicator value index based only on within-species abundance and occurrence comparisons. Its value is not affected by the abundances of other species.

Significance of indicator value of each species is assessed by a randomisation procedure.

INDVAL

DETECTION OF INDICATOR SPECIES or CHARACTER SPECIES

Page 97: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 3. Classification

Specificity measureFAITHFULNESS

Aij = N individuals ij / N individuals

i. sum of the mean abundance of species i over all groups

Mean abundance of species i across the sites in group j

(means are used to remove any effects of variation in the number of sites belonging to the various groups)

Fidelity measure CONSTANCY 

Bij = N sites ij / N sites. j

number of sites in group j where species i is present

total number of sites in cluster j

Aij is maximum when species i is present in group j only

Bij is maximum when species i is present in all sites in group j

Indicator value (Aij . Bij . 100) % INDVALij

Indicator value of species i for a grouping of sites is the largest value of INDVALij observed over all groups j of that classification.

 INDVALi = max (INDVALij)

 Will be 100% when individuals of species i are observed at all sites belonging to a single group.

INDICATOR SPECIES VALUE

Page 98: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 3. Classification

A random re-allocation procedure of sites among the groups is used to test the significance of INDVALi

Can be computed for any given partition of grouping of sites and/or for all levels of a hierarchical classification of sites.

INDVAL

Page 99: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 3. Classification

Site groups

Site ranking

UPGMA-WARD

MDS DCAPcoA Ca

Hierarchical

cluster(s)

Nonhierarchicalcluster(s)

k means

Sites

Species

Sites

Site groups

Site ranking

UPGMA-WARD

MDS DCAPcoA Ca

Hierarchicalcluster(s)

Nonhierarchical

cluster(s)k means

Species

Sites

Species

Species

Sites

Primary ordination (CA)Subdivide in two subsetsIdentify indicator speciesRefine the site ordinationSubdivide in two subsets

Two site subsets

Site groups

and ordering

Repeat foreach site subset

Measure the species preferential power

Classify the species

Diagram of the analysis steps for the Q- and R- mode classical analyses, and the TWINSPAN procedure. CA = Correspondence Analysis; DCA = Detrended Correspondence Analysis; MDS = Nonmetric Multidimensional Scaling; PcoA = Principal Coordinates Analysis; UPGMA = Unweighted Pair-Group Method using Arithmetic Average; WARD = Ward’s clustering method.

Q – mode sites

R – mode species

TWINSPAN

Classify the samples in a divisive hierarchy

Page 100: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 3. Classification

Diagram of the analysis steps for the indicator value method

UPGMA-WARD

Site groups

Site ranking

MDS DCAPcoA Ca

Hierarchical

cluster(s)

Nonhierarchical

cluster(s)k means

SitesSpecies

Sites

Any site typology

Measuring SpeciesIndicator Power

Random permutation of sites in the

typologyObserved

valueA randomized INDVAL to be included in the distribution

Randomized INDVAL distribution

INDVAL

Page 101: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 3. Classification

5

5

5

5

5

15

15

25

25

2045

25

7070

100

3015 15

15 15

Species A

Site numberWidespread

0

3

5

5

5

0

10

30

20

4060

30

9090

100

1010 10

0 0

Species B

Site number2 group max

0

0

0

5

5

0

0

0

10

90100

0

100100

100

00 0

0 0

Species C

Site numberone groupA)

B)

0

10

20

30

40

50

60

70

80

90

100

1 2 3 4 5

Number of Clusters

Nu

mb

er o

f cl

ust

ers

Species A

Species B

Species C

Indicator values for A, B, C at different clustering levels

Test case results. (A) Distribution of abundances of the three species in the five clustering levels. (B) Bar chart showing the decrease (species A) or increase (species C) of the indicator values when the sites are subdivided.

Species 1 2 3 4 5

100 60.9 43.8 32.3 2572 85.7 85.7 42.9 4040 66.7 66.7 100 90

A 7.17 5.7 4.25 2.88B 6.2 5.6 2.19 2.84C 4.21 3.14 6.59 6.39

z-statistic

Test case results: species indicator values and z statistics for five clustering levels.

Number of clusters

Indicator value

Species 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25A 4 4 4 4 4 5 5 5 5 5 5 5 5 5 5 3 3 3 3 3 3 3 3 3 3B 8 8 8 8 8 4 4 4 4 4 6 6 6 6 6 4 4 4 4 4 0 0 0 0 0C 18 18 18 18 18 2 2 2 2 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Group 1 Group 2 Group 5Group 4Group 3

Ind

icato

r valu

e

Number of clusters

Page 102: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 3. Classification

1. Xeric chalky grasslands

2. Mesic chalky grasslands3. Zn grasslands4. Atypical heathlands5. Xeric heathlands

6. Temporary flooded heathlands7. Peat bogs and raised mires

8. Swamps9. Pond fringes

10. Alluvial grasslands

Hierarchical dendrogram built with the results of the k-means reallocation clustering method. Reallocations are scarce and the main changes concern the “temporary flooded heathlands” (group 6), which are allocated to wet habitats at the two-, five-, and six-group level and to dry habitats at the other clustering levels.

Carabid beetles 97 species. 123 year-catches from 69 different localities representing 9 habitats.

Page 103: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 3. Classification

Dendrogram representing the TWINSPAN classification of the year-catch cycles. The indicator species relative abundance levels are expressed on an ordinal scale (1, 0-2%; 2, 2-5%; 3, 5-10%; 4, 10-20%; and 5, 20-100%.

Chalky mesic grasslands

Chalky xeric grasslands

Zn grasslands and xeric sandy heathlands

Atypical and xeric gravelly heathlandsTemporary flooded heathlands

Peaty heathlands

Fringes of ponds and alluvial grasslands

Swamps and raised mires

T secalis (1)P. nigrita (1)

D. globosus (1)A communis

(1)

P. melanarius (1)

P. cupreus (1)

A.. equestris (1)

C. problematicus (3)A. ater (1)

P. versicolor (3)T cognatus

(1)

P. madidus (1)H. rubripes (1)P. Cupreus (1)

P. versicolor (3)P. lepidus (1)

C. melanocephalus (1)

P. cupreus (1)

B. ruficolis (1)D. globosus (1)C. violaceus (1)

P. diligens (1)P. rhaeticus (1)A.. fuliginosus (1)P. minor (1)

P. minor (3)A. fuliginosus (1)L. pilicornis (1)

1

2

3

4

5

6

7

Page 104: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 3. Classification

Allu

via

l g

ra s.Pon

ds

Sw

am ps

Pea t

bog s

Flo

od

ed

heat

h..

Xeri c

hea

th.

Aty

pic

a l heat

hZn

g

rass

lan

ds

Mei

sc

cha

lky

Xeri c

cha

lky

Eutrophic

Wet habitats

Oligotrophic

Species present in all habitats

Typicalheathlan

ds

Heathlands

Acid heathlands

Dry habitats

Chalky

10

10 4

4 5

2

5

9

9 7

7 6

36

3 8

8

Indica

tor a

nd sa

tellite

specie

sSPEC

IES

Site GroupsSITES

Steps that are followed to build a two-way table from the hierarchical clusters indicator values. The first species group (centre of figure) contains species that are common in all habitats (i.e., having their indicator value maximum when all sites are pooled in one group). At the next step, two species groups are created: one with species dominating in all wet habitats, and the other one with species that are common in all dry habitats. The procedure is repeated for each site cluster.

INDVAL

Page 105: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 3. Classification

Site clusters obtained with the k-means method, but with the associated indicator species and indicator values in parentheses. All species with an indicator value >25% are mentioned for each site cluster where they are found, until they have found a maximum indicator value.

Eutrophicwet sitesP. minor (83)P. nigrita (80)

L. pilicornis (76)T. secalis (73)

P. strenuus (62)A fuliginosum (55)C granulutus (53)

C. fossor (51)P. atrorufus (47)B. unicolor (44)

A. versutum (37)O. helopioides (34)B. dentellum (32)A. viduum (31)

A. moestrum (30)B. doris (26)

T. placidus (26)

Wet habitatsP. rhaeticus (97)P. diligens (79)

A. fuliginosum (56)P. minor (48))T.secalis (39)

L. pilicornis (35)P. nigrita (29)

Dry habitatsA. lunicollis (51), P. versicolor (44)P. madidus (42), C. campestris (39)

B. ruficollis (37), H. rubripes (36)P. cupreus (35), B. lampros (32)

C. melanocephalus (31), P. lepidus (27)

P. versicolor (65)A. lunicollis (64)B. ruficollis (58)

C. melanocephalus (49)

B. lampros (46)P. lepidus (43)

C. problematicus (38)C. campestris (36)A. equestris (35)N. aquaticus (35)M. foveatus (33)

B. harpalinus (32)C. fuscipes (29)D. globosus (29)

L. ferrugineus (26)

ChalkygrasslandsP madidus (82)H. rubripes (65)P. cupreus (56)

B. bipustrulatus (36)C. auratus (27)

P. melanarius (27)C. nemoralis (26)

HeathlandsB. ruficollis (76), B. lampros (52)

C. problematicus (45), B. harpalinus (43)

N. aquaticus (42), L. ferrugineus (36)B. globosus (32), T cognatus (32)B. nigricorne (30), C. micropterus

(30)O. rotundatus (30), A. obscurum (29)

H. rufitarsis (29), C. erratus (26)

Zngrasslands

A. equestris (97)C. fuscipes ((68)

C. campestris (56)

A. similata (50)P. lepidus (46)

Typicalheathlands

B. ruficollis (88), P. versicolor (61)C. melanocephalus (58), N. aquaticus

(57)B. nigricorne (47), C. micropterus (47)O. rotundatus (47), D. globosus (42)

M. foveatus (38), C. erratus (34)Xeric

heathlandsN. aquaticus (70)O. rotundatus (67)C. melanocephalus

(63)C. erratus (59)

H. rufitarsis (48)B. nigricorne (46)

P. lepidus (46)B. collaris (37)A. infima (30)

H. smaragdinus (30)B. 4-maculatus (30)

H. tardus (26)B. properans (26)

Temporary flooded heathlandsT. cognatus (98)D. globosus (78)

P. niger (75)A. obscurum (54)P. vernalis (46)

AlluvialgrasslandsP. strenuus (87)P. atrorufus (82)

A. fuliginosum (65)B. unicolor (64)

C. granulatus (61)O. helopioides (54)

T. placidus (45)L. rufescens (40)

Pondfringes

A. versutum (88)B. dentellum (75)

C. fossor (66)B. doris (63)

B. obliquum (38)A. sexpunctatum

(29)B. assimile (25)

Oligotrophic wet sites

A. ericeti (28)

4

Peat bogsRaised mires

A. ericeti (28)

SwampsA. gracile (42)

5

All habitats

A. communis (26)

3

6

2

Atypicalheathlands

C. problematicus (89)

L. ferrugineus (45)A. ater (40)

9

Mesic chalkygrasslandsP. melanarius

(99) C. auratus (63)

C. violaceus (45)H. rufipes (44)

Xeric chalkygrasslandsP. cupreus (59) A. ovalis (30)

H. atracus (25)H. puncticollis

(25)

8

7

10

INDVAL

Page 106: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 3. Classification

Classification not an end in itself. Means to an end. Aid interpretation - external data, e.g. environmental data.

Basic EDA graphical approaches e.g. box plots.

Discriminant analysis using classification groups.

RELATING CLASSIFICATIONS TOEXTERNAL SET OF VARIABLES

Page 107: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 3. Classification

INTERPRETING CLUSTERS

• Clusters may differ in their environment

• Community classification may reflect environmental patterns

• Clustering may detect local peculiarities, whereas (most) ordination methods show the global gradient pattern

J. Oksanen (2002)

Page 108: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 3. Classification

Hierarchical classification analysis of 34 lakes in northwestern Ontario based on abundance of 27 species of zooplankton (from Palatas, 1971). The vertical axis represents the information gain (∆I) on fusion. The higher the level of fusion, the more dissimilar are the lakes and species compositions. On the horizontal axis are lake code numbers and group code letters.

BIOLOGY

The separation of the lake groups of the first two discriminant functions of the eleven environmental variables. Mean values for area and maximum depth are given for each group.

ENVIRONMENTAL VARIABLES IN RELATION TO BIOLOGICAL CLUSTERS

CANVAR, CANOCOGreen & Vascotto (1997) Water Research 12, 583-590

Page 109: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 3. Classification

Heino et al. 2003 Ecological Applications 13 (3): 842-852.

235 headwater streams in Finland, macroinvertebrates, wide range of associated environmental variables.

TWINSPAN classification of the study streams. Numbers refer to number of sites in each group. Also shown are mean latitude (solid bars) and pH (shaded bars) for each TWINSPAN end group.

Page 110: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 3. Classification

Results of indicator species analysis (INDVAL) at the 4th TWINSPAN division level

Page 111: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 3. Classification

Mean values of environmental variables important in discriminating among the TWINSPAN groups at the 4th division level

Page 112: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 3. Classification

Discriminant analysis to find the environmental variables that best discriminate between groups at the 2-group (level 1) and 10-group (level 4) TWINSPAN classification.

Wilks' lambda from stepwise DFA for variables best discriminating among groups at the 1st and 4th TWINSPAN division level

Page 113: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 3. Classification

Leave-one-out cross-validation in discriminant analysis

Page 114: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 3. Classification

Can also look at TWINSPAN groups in an ordination context, in this case non-metric multidimensional scaling

Page 115: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 3. Classification

Can also plot TWINSPAN group membership on a canonical correspondence analysis (CCA) plot.

A CCA biplot defined by the first two axes of the ordination of environmental variables and TWINSPAN site groups.

Page 116: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 3. Classification

Plot the INDVAL indicator species abundances in ordination space (in this case CCA space)

Page 117: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 3. Classification

External variables not always continuous. May be +/–, nominal, ordinal or mixed.____________________

 C.J.F. ter Braak (1986) Data Analysis & Informatics IV, 11-21 DISCRIM - simple discriminant functions

Supply biological classification first (e.g. TWINSPAN), then characterise classification in terms of external variables.

Coding as in TWINSPAN

+/– binary data +/- dummy variable 0/1

red 0/1green 0/1blue 0/1

0/1

X1, X2, X3

(i) discretee.g. <10 disjoint 0/110 - 20 coding 0/1

>20 0/1(ii) rank them and

divide into quartiles‘pseudo-variables’

conjoint coding

(iii) pseudovariables 'pseudo-variables'

Quantitative data

Nominal data red, green, blue 3 dummy variables

Ordinal data small, medium, large or pseudovariables

3 dummy variables

Group variables together on basis of their fidelity to particular TWINSPAN groups. Means of characterising TWINSPAN groups.

MILTRANS

SIMPLE DISCRIMINANT ANALYSIS

Page 118: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 3. Classification

Indicator species for the first two levels of division of TWINSPAN. Some species are only an indicator if they occur with high abundance, e.g. Curlew >5 means Curlew is an indicator if the abundance reaches abundance class 5 or over (N: number of heathlands in group; thr: threshold value (maximum discriminant score for negative group); mn: number of misclassified negatives; mp: number of misclassified positives.

Biological Classification

Page 119: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 3. Classification

11 16 76 7 36 5 56 63 66 67 3 34 45 67 77 24 78 4 44 44 55 55 53 44 77 8 13 33 55 82 22 22 22 22 36 1 13 11 11 7

78 97 56 65 11 12 38 90 83 23 52 42 90 16 90 34 12 80 64 56 78 1 23 75 39 19 1 36 78 45 26 24 53 78 9 4 70 44 12 56 89 7

TADO −− −− −− −− −− 52 −2 33 −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −3 −− −− −− −− −3 −− −− −− −− −− −− −− −− 0

NAEV 23 −− 43 65 5− 11 1− 1− −− −− −1 −− −− −1 −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −7 4 0

VANE 44 42 41 44 −5 13 33 22 2− 53 −− −− −− −1 −− 2− −2 −− 42 −3 44 −− −− 4− −− 36 −− −− −− −− −− −− −− 4− −− −− −− −− −− −− −− −− 100

SUBB 12 −− 11 −− −− 11 −2 −− −− −− −− −− −− −− −2 2− −− −− −− −− −− −− −− −− −1 −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− 101

RUBE −− −− −5 34 44 −1 3− 21 23 −− −− −− −− −− −− −− −− −− −− −− −− −− −− 1− −− 3− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− 101

TOTA 2− −2 41 24 −4 −2 1− 14 23 33 −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −5 −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− 101

GALL −− −− 4− 3− −− −1 −2 −2 2− 2− −− −− −− −3 −− −− −− −− −− −− −− −− −− −− −− 3− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− 101

OSTR 1− 4− 4− 1− −− −1 13 22 2− 25 −− −− −− −1 −− −− −− −− 2− −− −− −− −− −− −− −4 −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− 101

LIMO 14 −2 31 1− −− 12 12 22 −− 44 −− −− −− −− −− −− −− −− −− −− −4 −− −− 3− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− 101

FLAV −− −3 23 4− −− 11 −− −− −− −− −− −− −− −− −− −− −2 −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− 101

EXCU −− −− −1 1− −− −− −− −− −− −− −− −− −− −− −− 1− 21 −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− 101

CYAN −− −− −− 1− −− −1 −− −− −− −− −− −− −− −1 −− −− 22 44 −− −− −− −− 44 47 −− −− −− −6 −− −− −− −− −− −− −− −− −− −− −− −− −− −− 101

PRAT −2 45 46 75 −− 13 1− −1 3− 2− −− 4− −− −1 −2 −− −− 3− −5 −4 65 −3 −− −− 5− −5 −− −− −4 −− −− 2− −− −− −− −− −− −− −− −− −− −− 11

CAMP −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− 46 −− −− −− −− −1 44 −1 −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− 11

OENA 1− −− 12 46 5− 33 43 33 43 33 −− −− 44 −− 33 −− −− −− 42 −− 32 42 −1 −1 43 −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− 11

CRIS −− −− −− −− −− −− −2 −− −− −− −− −− −− −− −− −− −− −− −− −− −− −1 −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− 11

ARQU 55 76 44 46 66 45 65 55 65 55 56 65 55 43 −2 22 41 45 53 32 44 21 −− 3− 32 46 5 −− −− −− 3− −− 2− −− 5− −3 −− −− −− −− −− −− 10

RUBE 12 44 2− 54 −− 13 32 2− 3− −− 3− −4 4− −2 −2 −3 −− −− 31 −− 22 33 24 −4 2− −− −− −3 −5 −− −− −− −− −− −− −− −− −− −− −− −− −− 10

ARVE 35 −3 44 66 75 67 53 56 54 6− 4− 75 35 44 35 56 55 57 45 67 77 55 77 67 76 76 7 −6 −− 66 53 62 4− 44 −− −− −− −− −− −− −− −− 1100

CANN 56 75 76 66 67 52 56 52 5− 55 5− 66 4− 35 46 56 34 55 42 44 24 33 35 4− −2 5− 5 −− −− 43 −− −− −− −− −− −− 5− −− −− −− −− 5 1101

CANO 12 −− 42 34 44 21 33 22 3− 23 3− −4 34 22 3− 22 13 −− −2 −− 11 −1 −− 1− 2− −− −− −− −− −− −− −− −− −− −− −6 −− −5 −− −− −− 4 1101

TETR 43 −− −− 25 −7 23 23 21 3− −− −− −− −− 13 −− 2− 41 −− 2− 22 −3 5− −− 5− 3− −4 5 −− −− −− −6 −− 3− −− 7− −− −− −− −5 −− −− −− 1101

PERD 1− −− 51 36 5− 41 32 31 2− 4− −− 64 −− 34 −2 2− 23 45 −− 12 21 23 −− 3− −− 44 −− −− −5 −− −− 33 −− −− 5− −− −− −− −− −− −− −− 1101

ALBA 4− 44 33 44 4− 21 −3 1− 33 3− 3− −4 −− −1 35 2− 32 3− −2 −1 11 22 22 1− −2 −− −− −− −− −− −− −− −− −− −− −4 −− −3 −− −− −− −− 1101

TINN −− −− 32 3− 4− 21 14 21 2− 2− 34 −− 3− 21 32 −− −1 −− −1 −1 11 −− −− 1− 2− −− −− −− −− −− −− −− −− −− 5− 35 −− −− −− −− −− −− 111

SVEC 15 −5 3− 5− −− −− −− −− −− −− −− −4 −− −− −− −− −− −− 42 −− 21 −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− 6 111

SHCO 77 57 75 77 66 64 45 51 47 66 5− 4− 44 −3 −− −− 4− −− −− −− −− −− 3− −− −− −− −− −− −− −− −− −− −− −− 35 −− 7− −− −− −− 7 111

COMM 36 64 66 76 56 51 55 44 3− 66 4− 57 44 −5 55 32 21 −− 24 22 34 41 −− 15 −− 5− − −− −− −− −− 3− 2− −− −7 66 −− −− 7− −− 77 6 10

TRIV 77 76 76 66 76 74 45 65 54 56 6− 57 57 56 65 65 66 44 65 54 25 55 46 76 32 6− 5 77 77 67 −− 62 54 43 7− 66 77 76 77 77 −− 7 110

VIRI 1− −− −− −− −− 2− 33 11 2− −− −− −− 3− −2 43 2− −7 −5 −− 21 −− −− −− −− −1 −− −− −− −− −− −3 22 2− 34 55 44 −− −− −− −− −− −− 110

TROC 77 77 76 76 77 73 77 74 63 77 7− 67 6− 56 77 65 31 67 64 52 25 27 67 6− 21 −− −− −− −4 77 73 64 64 56 77 77 77 77 77 77 77 7 110

EURO 32 −4 −− 1− −− −1 −− −− −− −− 4− −− 54 −− −2 −− 44 −− −− −− −− −− 24 −− −− −− −− −3 −− −− −− −− −− −− −− 3− −− −3 −− −− −− −− 110

CITR 45 −5 53 47 65 52 56 54 44 54 −− 67 −− 45 44 45 53 65 42 32 44 55 −− 54 −1 56 6 73 76 54 5− −2 3− 3− 77 56 66 76 −− −− −− 6 111

ARBO −4 −2 −− −− −− 11 −2 −− −− −− −− −4 −− −2 −3 −− −− 55 −− −− −− −1 −− 37 23 −− −− −6 67 44 5− −− −− −− −− −− −− −− −− −− −− −− 111

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1

0 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 1 1 1 1 0 0 0 0 1 1 1

0 0 1 1 1 0 0 0 0 0 1 0 0 0 0 0 0 0 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 0 0 0 1

0 0 1 1 1 0 0 0 1 1 1 0 0 0 0 0 1 1 0 0 0 0 0 1 1 0 1 1 0 0 1 1 0 1

TWINSPAN two-way table of bird species (rows) of Dutch heathlands (columns). Values are logarithmic classes of abundance (number of pairs per 10 hectares). (−: absent; 1: <0.5; 2: 0.5-0.9; 3: 1.0-1.9; 4: 2.0-3.9; 5: 4.0-7.9; 6: 8.0-15.9; 7: >16.0). The top margin gives site identification numbers, printed vertically. The bottom and right-hand margins show the hierarchical classifications of the heathlands and birds, respectively, each with five levels of division. Vertical lines separate groups of sites at level 2; horizontal lines separate the first two species divisions.

Page 120: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 3. Classification

No. Description Values

1 AREA < 20 Area of heath smaller than 20 hectares 0/1

2 AR20 − 100 Area of heath between 20 and 100 hectares 0/1

3 AREA > 100 Area of heath greater than 100 hectares 0/1

4 RECR MILI Index of recreational use of heath, including military usage, based on inquiry ranked

Index of recreational use of heath, excluding military 1-82

24 RECR EATI usage, based on inquiry ranked

1-82

Number of other heaths within a radius of 5 km from ranked

the border of the heath 1-82

6 OPEN SAND Presence of open sand within the heath 0/1

7 MOOR POOL Presence of moorland pools within the heath 0/1

8 WET Presence of wet patches within the heath 0/1

9 SURR FORE Heath at least partly surrounded by woodland 0/1

10 SURR AGRI Heath at least partly surrounded by woodland or arable land 0/1

11 VELU WE Heath lies on the VELUWE 0/1

12 BRAB ANT Heath lies in the BRABANT 0/1

13 DREN THE Heath lies in DRENTHE 0/1

14 GRON INGE Heath lies in GRONINGEN 0/1

15 GOOI Heath lies in ‘het GOOI’ 0/1

16 LIMB URG Heath lies in LIMBURG 0/1

17 UNDU LATI Heath is undulating 0/1

18 FEN LAND Presence of fen-land 0/1

19 SAND SOIL Presence of sandy soil 0/1

20 SAND FEN Presence of sandy soil in fen-land 0/1

21 1SOI LTYP Presence of only one soil type 0/1

22 2SOI LTYP Presence of two soil types 0/1

23 3SOI LTYP Presence of three or more soil types 0/1

TOPOGRAPHY, SOIL, AND SOIL HETEROGENEITY (based on soil maps)

LANDSCAPE

GEOGRAPHICAL POSITION

ISOLATION

5 HEAT < 5 KM

Heathland characteristics used in DISCRIM to interpret clusters of heathlands

Abbreviation

AREA

RECREATIONAL USAGE

Page 121: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 3. Classification

11 16 76 7 36 5 56 63 66 67 3 34 45 67 77 24 78 4 44 44 55 55 53 44 77 8 13 33 55 82 22 22 22 22 36 1 13 11 11 7

78 97 56 65 11 12 38 90 83 23 52 42 90 16 90 34 12 80 64 56 78 1 23 75 39 19 1 36 78 45 26 24 53 78 9 4 70 44 12 56 89 7

23 3SOI LTYP 11 −− 1− −− 1− 11 11 11 −− −− −− −− −1 11 −− −− −1 11 −− −− −− −− −− 1− −− 1− 1 −− −− −− −− −− −− −− −− −− −− −− −− −− −− − 11

13 DREN THE −− −− 11 11 11 11 11 11 11 11 11 11 11 11 11 −− −− −− 1− −− −− −− −− −− −− 1− − −− −− −− −− −− −− −− −− −1 −− −− −− −− −− − 11

7 MOOR POOL 11 −1 −1 −1 1− 11 11 −− 11 11 1− −− −− 1− −− 1− 1− −− 11 −1 1− −− −− −− −− 1− − −− −− −− −− −− −− −− −− −1 −− −− −− −− −− − 11

3 AREA >100 11 −1 11 1− −− −1 11 11 1− 1− −− −− −− 11 −1 11 11 −− 11 11 11 11 11 −1 1− −− − −− −− −1 −− 11 1− −1 −− −− −− −− −− −− −− − 11

20 SAND FEN −− −1 −− −− 1− −− 1− −1 −− −− −− −− −− −− 1− −− −− −− −− −− −− −− −− −− −− −− − −− −− −− −− −− −− −− −− 1− −− −− −− −1 −− − 10

18 FEN LAND −− −− 11 1− −− 1− −− 1− −− −− −1 −− −− −− −− −− −− 11 −− −− −− −− −− −− −− −− − −− −− −− −− −− −− −− −− −− 1− −− −− −− −1 11 10

12 BRAB ANT 11 11 −− −− −− −− −− −− −− −− −− −− −− −− −− −− 1− −− −− −− −− −− −− −− −− −1 1 −− −− −− 1− −− −− −− 1− −− −− −− −− −− −− −− 10

10 SURR AGRI 1− 1− 11 11 11 11 11 11 1− 11 11 11 11 −1 −1 −1 −− 11 1− −− −1 −− −− −− −− 11 1 1− −− −− 1− −− −− −− 1− 11 11 11 11 −− 11 1 111

8 WET 11 11 11 1− −1 −1 1− 11 −1 −− −− 11 −− −1 −− −− −− −− −− −− −− −− −− −− −− −− − −− −− −− −− −− −− −− −− 1− 1− 1− 1− 11 1− 1 111

2 AR2O -100 −− 1− −− −1 11 1− −− −− −1 −1 11 22 11 −− 1− −− −− 11 −− −− −− −− −− −1 −− 11 1 −1 11 1− 11 −− −1 1− 11 11 11 −1 −1 −− −− 1 111

24 RECR EATI 32 43 21 11 21 42 44 24 33 42 11 −1 31 22 42 43 43 31 21 44 34 44 44 34 42 42 3 24 33 33 23 23 32 33 11 31 12 23 12 21 11 2 1101

22 19 2SOI LTYP −− −1 −1 1− −− −− −− −− −− 11 1− 11 1− −− 11 −− 1− −− 11 −− −− 11 11 −− 1− −1 1 −− −− −− 11 −− −1 −1 −1 1− −− −− −− −1 1− − 1101

5 SAND SOIL 11 1− −− −1 −1 11 −1 −− 11 11 1− 22 11 11 11 11 11 11 11 11 11 11 11 11 11 11 1 11 11 11 11 11 11 11 11 −1 −1 11 11 11 1− − 1101

9 HEAT <5KM 22 44 23 42 23 44 44 34 34 33 42 −1 34 23 33 34 14 22 43 43 24 14 44 41 32 22 2 11 12 44 23 23 32 33 22 32 11 11 11 11 11 1 1101

4 SURR FORE 11 11 −− −− −− −1 1− −− 11 −− 1− 21 −1 1− 11 11 11 11 11 11 11 11 11 11 11 1− − 11 11 11 11 11 11 11 11 11 −− 1− −1 11 −− − 1100

17 RECR MILI 31 23 21 11 11 32 43 24 22 31 11 1− 31 11 41 33 44 22 13 44 34 34 44 34 33 22 2 24 44 44 24 23 43 43 13 11 22 23 24 22 22 3 1100

11 UNDU LATI −− −− −− −− −− −− 1− −1 −− −− −− −− 1− 11 −1 1− 11 1− 11 11 1− 1− −1 −1 11 −1 − −1 11 11 1− −− −− −− −− −− −− −− 11 −− −− − 10

6 VELU WE −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− 11 −1 −− −1 11 11 11 11 11 11 −− − −− −1 11 −− −− −− −− −− −− −− −1 −− −− −− − 10

OPEN SAND −− −− −− −− −− 1− 1− −− −− −− −− −− −1 11 −− 1− 1− −− −− −− −− −− −− −1 −− − −− 1− −− 1− −− −− −− −− −− −− −− −− −− −− − 10

21 ISOI LTYP −− 1− −− −1 −1 −− −− −− 11 −− −1 1− −− −− −− 11 −− −− −− 11 11 −− −− −1 −1 −− − 11 11 11 −− 11 1− 1− 1− −1 11 11 11 1− −1 1 1

16 LIMB URG −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −1 −− −− − −1 1− −− −− −− −− −− −− −− −− −− −− −− −− − 0

15 GOOI −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− − −− −− −− −1 11 11 11 −1 1− −− −− −− −− −− − 0

14 GRON INGE −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− − 1− −− −− −− −− −− −− −− −− 11 1− 11 11 11 − 0

1 AREA <20 −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− −− − 1− −− −− −− −− −− −− −− −− −− 1− 1− 11 11 − 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1

0 0 0 0 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1

0 0 1 1 1 0 0 0 0 0 1 0 0 0 0 0 0 0 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 0 0 0 1

0 1 1 0 0 0 1 1 1 0 0 0 0 0 1 1 0 0 0 0 0 1 0 1 1 0 0 1 1 0 0 1

DISCRIM two-way table of heath characteristics (rows) of Dutch heathlands (columns). Values are presence (1), absence (−) or indicate quartiles (1, 2, 3, 4). Vertical lines separate groups of sites at level 2. Horizontal lines separate three major groups of attributes.

Environmental Variables

Page 122: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 3. Classification

Indicator attributes resulting from DISCRIM that best predict the biological divisions of TWINSPAN

Can re-arrange basic data tables using DIATAB

TWINSPAN

DISCRIM

Page 123: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 3. Classification

Rohlf (1974) Ann. Rev. Ecol. Syst. 5, 101-113Gordon (1999)Fowlkes & Mallows (1983) J. Amer. Stat. Assoc. 78, 553-569  (1) Cross-classification table COMPCLUS  (2) Rand coefficient (1971) J. Amer. Stat. Assoc. 66, 846-850

COMPCLUS

1) - (n n 21

1

2

22

21

i j

iji j i

ijj

ij nnn

c

I II IIIClassification I 2 2 1

B II 1 0 4 (n = 10)

Classification A

c = 1 – [½{(2 + 1)2 + (2 + 0)2 + (1 + 4)2 + (2 + 2 + 1)2 + (1 + 0 + 4)2} – 22 + 22 + 12 + 12 + 02 + 42] / 45

= 1 – [½ {38 + 50} – 26] / 45= 1 – 18/45

= 0.6Range 0 (dissimilar) to 1 (identical classifications)

HOW TO COMPARE CLASSIFICATIONS?

Page 124: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 3. Classification

COMPCLUS

Cluster T F-T OCF CCF-A CCF-B CCF-C MFU MFL DF AP GI 11 1 ― ― ― ― 3 ― ― ― ―II 3 1 ― ― ― ― ― 5 ― 1 ―III ― 8 8 ― 4 ― 3 ― ― ― ―IV ― ― ― ― ― ― ― ― 4 1 ―V ― ― ― ― ― ― ― ― ― 7 2VI ― ― 2 ― 15 ― 1 ― ― ― ―VII ― ― 2 ― 3 6 4 7 ― ― ―VIII ― ― 2 ― ― 14 1 1 ― ― ―IX ― ― ― 8 ― 2 ― 1 ― ― ―

Total No. of samples 14 10 14 8 22 22 12 14 4 9 2

Comparison of the Nine-Group Clustering of Samples Suggested by the Agglomerative Minimum Sum-of Squares Algorithm

Vegetation-landform unit

Cross – classification table

Cluster T F-T OCF CCF-A CCF-B CCF-C MFU MFL DF AP G1 11 1 ― ― ― ― 3 ― ― ― ―2 3 1 ― ― ― ― ― 5 ― 1 ―3 ― 8 8 ― 4 ― 3 ― ― ― ―4 ― ― ― ― ― ― ― ― 4 1 ―5 ― ― ― ― ― ― ― ― ― 7 26 ― ― 2 ― 15 ― 1 ― ― ― ―7 ― ― 2 ― 3 6 4 7 ― ― ―8 ― ― 2 ― ― 14 1 1 ― ― ―9 ― ― ― 8 ― 2 ― 1 ― ― ―

Total No. of samples 14 10 14 8 22 22 12 14 4 9 2

Comparison of the Nine-Group Clustering of Samples Suggested by the Hybrid Algorithm

Vegetation-landform unit

Cross – classification table

Page 125: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 3. Classification

3 groups 7 groups 11 groupsAgglomerative (3 groups) 0.76 0.65 0.64Agglomerative (5 groups) 0.69 0.76 0.77Agglomerative (9 groups) 0.61 0.86 0.87Hybrid (9 groups) 0.61 0.86 0.87Hybrid (11 groups) 0.59 0.85 0.88

Matrix of Rand’s (1971) Coefficients between Partitions of the Lichti-Federovich and Ritchie (1968) Data Based on Vegetation-Landfrom Units andPartitions Suggested by Several Numerical Classifications of the Surface-Pollen Data

Vegetation – landform classificationNumerical pollen classification

COMPCLUS

R

Page 126: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 3. Classification

Rand's coefficient should be corrected for chance so as to ensure

1. its expected value is 0 when the partitions are selected at random (subject to the constraint that the row and column totals are fixed)

2. its maximum value is 1

The similarity between two independent classifications of the same set of objects can be assessed by comparing their Rand statistic with its distribution under the randomisation model.

For small values of n objects, the complete set of n! values of Rand can be evaluated. For large values of n, comparison is made with the values resulting from a random subset of permutations. R

Page 127: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 3. Classification

(3) Hill’s coefficient

Hill’s S (Moss 1985 Applied Geogr. 5, 131-150) CLASSTAT

Cross-classification table of classification I and J

Let pij = aij/ a

ji

ijijIJ pp

ppS log

Categories of Classification J

1 2 3 4 ... n Totals

Categories of Classification I

1 a11 a12 a1n a1

2 a21 a2

3 a31 a3

4

...

m amn am

Totals a1 a2 an a=N

Page 128: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 3. Classification

Maximum possible value of SIJ comes when two

classifications are identical.Then SIJ = HI = HJ where HI is the entropy of the

classification I. 

 

 

Adapt SIJ so that 0 = independence, 1 = best agreement

possible. 

iiI ppH loglog

jjJ ppH loglog

JI

IJJI

JI

IJ

HHHHH

HHS

S,min,min

ijijIJ ppH log where

Page 129: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 3. Classification

(4) Cohen's Kappa statistic

Kappa

where

)1(ˆ

e

eop

ppk

c

liio pp

1

the sum of the overall proportion of observed agreements

c

iiie ppp

1

..the overall proportion of chance-expected agreements

Kappa = 1 with perfect agreement

0 with observed agreement approximately the same as would be expected by chance (po pe)

R

Page 130: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 3. Classification

TWINSPAN, TWINGRP, TWINDEND

WINTWINS

ORBACLAN, COINSPAN

FLEXCLUS

DISCRIM, MILTRANS, DIATAB

CLUSTER

DCMAT, GOWER

CLUSTAN-PC

CLUSTAN GRAPHICS

USEFUL CLASSIFICATION SOFTWARE

K-MEANS

FUZPHY

CLASSTAT, COMPCLUS

RANMAT, ASSOC

BOOTCLUS, SAMPLERE

CEDIT

R + Libraries

(e.g. mva, cluster, MASS)