classification

42
Classification (Dis)similarity measures, Resemblance functions Cluster analysis TWINSPAN

Upload: sopoline-castro

Post on 31-Dec-2015

25 views

Category:

Documents


3 download

DESCRIPTION

Classification. Similarity measures. Each ordination or classification method is based (explicitely or implicitely) on some similarity measure. (Two possible formulations of ordination problem). Similarities (dissimilarities, resemblance functions) based on qualitative/quantitative data. - PowerPoint PPT Presentation

TRANSCRIPT

  • Classification(Dis)similarity measures, Resemblance functionsCluster analysisTWINSPAN

  • Similarity measuresEach ordination or classification method is based (explicitly or implicitly) on some similarity measure(Two possible formulations of ordination problem)

  • Similarities (dissimilarities, distances)Resemblance functions (the term includes both similarities and dissimilarities)If 0 S 1, then often D = 1 S nebo D = (1 S) nebo D = (1 S2)Different indices are usually used for sample similarity than for species similaritySimilarity of two cases (samples) has a meaning by itself: similarity of two species has meaning only in relation to the data set.Species set is fixed (e.g. all vascular plant species), cases are usually random selection from a population of sites

  • Distance should fulfill the triangular inequality

    ACBAB < AC + BC

  • Resemblance functionsProbably hundreds were proposed and tens are used

    We compare:cases - Qspecies RData typePresence/absence (0 / 1)Srensen coefficientJaccard coefficientPearson f (V) coeff.Yule (Q) coefficientQuantitativeEuclidean distancec2 distancePercentage similaritycorrelation coefficientsc2 distance

  • Case similarity based on qualitative data

    SrensenJacquardd - number of species absent in both cases (usually not used) - consequently, the value is independent of other cases in a table

    _1032286875.unknown

    Species in sample B

    +

    -

    Species in sample A

    +

    a

    b

    -

    c

    d

    Therefore, a is the number of species present in both compared

  • Species similarity based on presence absenced - number of cases without both species - absolutely necessary

    Species B

    +

    -

    Species A

    +

    a

    b

    -

    c

    d

    _1032287198.unknown

    Table A

    Table B

    50

    50

    50

    50

    50

    1000

    50

    5

  • Species vs. case similaritySpecies similarity (i.e. similarity of species ecological behavior, e.g. V, Q) often scaled from -1 to 1. Null model means independence of the species, and in this case V=Q=0.Case similarity (S, J), usually scaled from 0 (no common species) to 1 (identical species composition). No null model available. (or better, no meaningfull null model available; compare random selection of two sets of species from species pool?)

  • Transformation is an algebraic function Xij=f(Xij) which is applied independently of the other values. Standardization is done either with respect to the values of other species in the case (standardization by cases) or with respect to the values of the species in other cases (standardization by species).

    Quantitative dataCentering means the subtraction of a mean so that the resulting variable (species) or case has amean of zero. Standardization usually means division of each value by the case (species) norm or by the total of all the values in a case (species).

  • Ordinal transformation of Br.-Bl. scale is roughly equivalent to log transformation of the cover values

    Br.-Bl. scale

    Ordinal tr

    Cover

    (Bannister 1966)

    log(Cover+1)

    r

    1

    0.1

    0.04139

    +

    2

    0.5

    0.17609

    1

    3

    3

    0.60206

    2

    4

    15

    1.20412

    3

    5

    37.5

    1.58546

    4

    6

    62.5

    1.80277

    5

    7

    87.5

    1.94694

    _932817547.xls

    Sheet: List1

    Sheet: List2

    Sheet: List3

    Sheet: List4

    Sheet: List5

    Sheet: List6

    Sheet: List7

    Sheet: List8

    Sheet: List9

    Sheet: List10

    Sheet: List11

    Sheet: List12

    Sheet: List13

    Sheet: List14

    Sheet: List15

    Sheet: List16

    Z-M scale

    Ordinal tr

    Cover

    log(Cover+1)

    r

    0.04139268515822508

    +

    0.17609125905568124

    0.6020599913279624

    1.2041199826559248

    1.5854607295085006

    1.8027737252919758

    1.9469432706978254

  • Euclidean distanceFor ED, standardize by case norm, not by totalThe cases with t contain values standardized by the total, those with n are standardized by case norm. For cases standardized by total, ED12 = 1.41 (2), whereas ED34=0.82, whereas for cases standardized by case norm, ED12=ED34=1.41

    CASES

    1

    2

    3

    4

    1t

    2t

    3t

    4t

    1n

    2n

    3n

    4n

    Species 1

    10

    5

    1

    0.33

    1

    0.58

    Species 2

    10

    5

    1

    0.33

    1

    0.58

    Species 3

    5

    0.33

    0.58

    Species 4

    5

    0.33

    0.58

    Species 5

    5

    0.33

    0.58

    Species 6

    5

    0.33

    0.58

    Table 61 Hypothetical table with samples 1 and 2 containing one species each and samples 3 and 4, containing three equally abundant species each (for standardized data the actual quantities are not important). 1 has no species in common with 2 and 3 has no species in common with 4. The samples with t contain values standardized by the total, those with n samples standardized by sample norm. For samples standardized by total, ED12 = 1.41 (2), whereas ED34=0.82, whereas for samples standardized by sample norm, ED12=ED34=1.41

  • Percentual similarity (quantitative Srensen)Neither ED, nor PS take into consideration species which are absent in the two compared cases

    _1032288756.unknown

  • Similarity of species based on quantitative data

    Correlation coefficients (ordinary, rank) i.e. again taking into account the cases where both species are missingNote the implicit double standardization (by both, case and species total) Consequently, the value is changing according to composition of other cases in the total table.

  • Similarity of samples vs. similarity of communitiesInspired be seemingly high beta-diversity of insects in tropics

    Population (%)

    Sample 1 (indiv.)

    Sample 2 (indiv.)

    Spec.1

    5

    3

    1

    Spec.2

    3

    1

    2

    Spec.3

    1

    0

    1

    Spec.4

    1

    0

    1

    .

    1

    0

    0

    .

    1

    0

    0

    .

    1

    1

    0

    .

    1

    0

    1

    .

    1

    0

    0

    .

    1

    0

    0

    .

    1

    1

    0

    .

    1

    0

    1

    .

    1

    0

    0

    .

    1

    1

    0

    .

    1

    1

    1

    .

    1

    0

    0

    .

    0.5

    0

    0

    .

    0.5

    1

    1

    .

    0.2

    0

    0

    .

    0.2

    0

    0

    .

    0.1

    1

    0

    .

    0.1

    0

    1

    Spec. n

    0.1

    1

    0

    Etc.

  • expected number of shared species in two subsamples taken randomly from the second sample.

    22Normalized expected shared species =

    Normalized expected species similarity XE "Similarity/dissimilarity:NESS" index (NESS, Grassle & Smith 1976)

    NESS=

    expected number of species in common between two random subsamples of certain size drawn from the two compared larger samples without replacement

    expected number of shared species in the two subsamples taken randomly from the first sample

    expected number of shared species in two subsamples taken randomly from the second sample.

  • Similarity of objects when variable are measured on various scales

  • Gower distance

  • Standardization of variables by the s.d. or range is sometimes problematic possibility to standardize by variation see Lep et al. 2006

  • Similarity matrices - directly used inMultidimensional scaling (both metric and non-metric see milauer lecture)Mantel test

  • Mantel TestQuestion is there any dependence between two (dis)similarity matrices?e.g. is the distance of individual plants in physical space correlated with their genetic dissimilarity?

  • Individuals in the plotIndiv. No. 5And this individual is strange (just one of five)

    Chart3

    2

    3

    1

    2

    10

    Sheet1

    plant12345plant12345

    1*1*

    21.41*21.4142135624*

    31.002.24*312.2360679775*

    41.001.001.41*4111.4142135624*

    512.0410.6312.7311.31*512.041594578810.630145812712.727922061411.313708499*

    plant1234

    1

    20.1

    30.20.2

    40.10.30.2

    50.90.60.70.8

    1121.41421356240.1

    22310.2

    31110.1

    42212.04159457880.9

    510102.23606797750.2

    10.3

    10.63014581270.6

    1.41421356240.2

    12.72792206140.7

    11.3137084990.8

    Sheet1

    Sheet2

    -1.05989159060.8446293253

    1.2199802275-0.7058465673

    1.06228175420.7103778663

    -1.9996207825-0.1512558757

    -1.83047783720.5238539659

    1.5345402398-0.8666588698

    -1.34474050880.2088102471

    1.7021321480.1906048229

    0.9688020291-0.5912676726

    -0.80843292280.5780729181

    -1.49590884440.2773897066

    1.0400263205-0.1768197143

    0.8246666018-0.1307615046

    1.3989978249-1.1174361321

    -0.8562715362-0.1150041484

    1.33803836570.4858700445

    1.17794757920.3038034096

    1.44002656980.7209983088

    1.5570785536-0.0547577212

    0.8872359987-0.026567813

    -0.84725395410.5160180243

    -1.00541666930.34133475

    -1.62428432420.1405797294

    1.05777632890.7708854357

    -1.2544518082-0.6853940481

    1.3989991567-0.4422902338

    1.3129841090.250330837

    0.88232915130.3118461145

    1.7631552990.8221098683

    -0.25989159060.8446293253

    0.4199802275-0.7058465673

    0.26228175420.7103778663

    -1.1996207825-0.1512558757

    -1.03047783720.5238539659

    0.7345402398-0.8666588698

    -0.54474050880.2088102471

    0.9021321480.1906048229

    0.1688020291-0.5912676726

    -0.00843292280.5780729181

    -0.69590884440.2773897066

    0.2400263205-0.1768197143

    0.0246666018-0.1307615046

    0.5989978249-1.1174361321

    -0.0562715362-0.1150041484

    0.53803836570.4858700445

    0.37794757920.3038034096

    0.64002656980.7209983088

    0.7570785536-0.0547577212

    0.0872359987-0.026567813

    -0.04725395410.5160180243

    -0.20541666930.34133475

    -0.82428432420.1405797294

    0.25777632890.7708854357

    -0.4544518082-0.6853940481

    0.5989991567-0.4422902338

    0.5129841090.250330837

    0.08232915130.3118461145

    0.9631552990.8221098683

    -0.53500700920.617673286

    Sheet2

    Sheet3

  • Two dissimilarity matricesDistance in the plotGenetic distance

    plant12345121.4131.002.2441.001.001.41512.0410.6312.7311.31

    plant12345120.130.20.240.10.30.250.90.60.70.8

  • Regression is highly significant(but we have 10 independent observations out of five plants!)And four independent observations out of ten are the largest

  • SolutionPermutation testNot individual distances, but individuals are permuted

  • ClassificationOf historical significance only

    (e.g. Association (e.g. TWINSPAN)

    analysis)

    _1032332534.doc

    nonhierarchical hierarchical

    (e.g., K-means clustering)

    divisive agglomerative

    (classical

    cluster analysis)

    monothetic polythetic

    Twinspan

  • Non-hierarchical classificationK-means clustering

    In fact, reverse ANOVA ANOVA: F=Msgroup/MSresidualThe goal divide the set into groups to maximize F (multivariate counterpart of F)

  • Hierarchical agglomerative (cluster analysis)

    Original data matrix

    Species

    Similarity matrix

    Cases

    Samples

    Resemblance

    Clustering algorithm

    Samples

  • Subjective decisions in the objective procedureNevertheless, the procedure is reproducible

    _1007129516.doc

    Field sampling

    importance value

    Raw data

    transformation, standard-

    ization, similarity measure

    (Dis)similarity matrix

    clustering algorithm

    Tree

  • Cluster analysis joiningDistances among objects are in the (dis)similarity matrix. In the hierarchical classification, we need also the distances among clusters....

  • Single linkage (nearest neighbour, representative of short hand) and complete linkage (furthest neighbour, representative of long hand methods)Several other methods, e.g. Ward (minimum dispersion), average linkage most popular, but the term was used for several methods preferred name UPGMA - Unweighted Pair Group Method with Arithmetic mean

    A

    B

  • Single linkage - > chaining

  • Order does not play a role

  • TWINSPAN Two Way INdicator SPecies ANalysisInvented (by Mark Hill) to search for a pattern in extensive vegetation tablesInspired by classical phytosociologyAlgorithm based on the presence/absence data Quantitative data used for definition of pseudospecies

  • TWINSPAN 2 - pseudospeciesDefinition of cut levels has similar effect as transformation (weighting dominance vs. presence/absence)Compare 0, 1, 10, 100 vs. 0, 10, 20, 30, 40 Lower exclusive border

  • Divisive method each group is divided on the basis of the first CA axisThe first axis is based on CA ordination - it is then not surprising, that TWINSPAN results well correspond to, e.g., DCA and individual groups are well clustered in ordination space.

    Chart1

    -0.3679595113

    0.487400698

    0.8336608237

    0.346413928

    0.0881523337

    -0.6085497293

    -1.2634235051

    -0.1220614036

    0.0174041701

    -0.2576307406

    -0.2911361097

    -0.0621242366

    0.4383365224

    -0.2079000338

    -0.5418904056

    0.1885450519

    0.3289072378

    0.4956929662

    -0.4953047098

    0.1555205329

    -0.2811742022

    0.3659734018

    0.3822855362

    -0.4919264144

    0.5080853705

    -0.686154586

    -0.1533210826

    -0.6083404181

    0.5661628712

    Sheet1

    plant12345plant12345

    1*1*

    21.41*21.4142135624*

    31.002.24*312.2360679775*

    41.001.001.41*4111.4142135624*

    512.0410.6312.7311.31*512.041594578810.630145812712.727922061411.313708499*

    plant1234

    1

    20.1

    30.20.2

    40.10.30.2

    50.90.60.70.8

    1121.41421356240.1

    22310.2

    31110.1

    42212.04159457880.9

    510102.23606797750.2

    10.3

    10.63014581270.6

    1.41421356240.2

    12.72792206140.7

    11.3137084990.8

    Sheet2

    -0.70383239591.0557679707

    0.52682580980.3923011625

    0.1506871627-0.629887616

    0.3523369730.0764144746

    0.89743795680.1781975337

    0.00344365891.0867507516

    -0.6312628478-0.5724934656

    0.2384124314-0.5634444577

    0.8005400635-0.2486600287

    -0.46596275330.4277186253

    -0.21194897290.5943651535

    -0.51691178830.929757173

    0.6225735240.3030059853

    0.8036097460.253517701

    1.02676750640.5808707369

    -0.25808145610.2702218077

    0.67854188880.5043086096

    0.4687398180.2700476168

    0.13224146680.8821441992

    1.1103362324-0.1869459751

    -0.9205506237-0.0592178541

    0.0688630294-0.4988946773

    0.613265630.5039467708

    0.159155324-0.0345564789

    -0.1390510737-0.7178660029

    0.7982507809-0.6447369746

    -0.0298952327-0.650099294

    -0.19305923970.1079624256

    0.4536185971-0.806939811

    -0.02605791420.061506275

    Sheet2

    0.0867776412

    0.5693165604

    -0.0499121699

    0.0588071358

    0.1267425698

    0.6624017301

    0.3946393342

    0.0089313559

    0.2076436561

    -0.3710935994

    -0.9867220017

    0.1945870664

    0.9357918204

    -0.279090028

    -0.286325505

    -0.4675655384

    0.5654707065

    -0.2113086077

    0.4466090503

    -0.5255387516

    0.19027618

    0.4045493148

    0.2513534681

    0.4375226748

    -0.6446929855

    -0.3246038918

    0.7229029066

    -0.0418989898

    0.4289417369

    Sheet3

  • Divisive method each group is divided on the basis of the first CA axisMost of the cases are usually around the center -> we need some polarization

    Chart1

    -0.3679595113

    0.487400698

    0.8336608237

    0.346413928

    0.0881523337

    -0.6085497293

    -1.2634235051

    -0.1220614036

    0.0174041701

    -0.2576307406

    -0.2911361097

    -0.0621242366

    0.4383365224

    -0.2079000338

    -0.5418904056

    0.1885450519

    0.3289072378

    0.4956929662

    -0.4953047098

    0.1555205329

    -0.2811742022

    0.3659734018

    0.3822855362

    -0.4919264144

    0.5080853705

    -0.686154586

    -0.1533210826

    -0.6083404181

    0.5661628712

    Sheet1

    plant12345plant12345

    1*1*

    21.41*21.4142135624*

    31.002.24*312.2360679775*

    41.001.001.41*4111.4142135624*

    512.0410.6312.7311.31*512.041594578810.630145812712.727922061411.313708499*

    plant1234

    1

    20.1

    30.20.2

    40.10.30.2

    50.90.60.70.8

    1121.41421356240.1

    22310.2

    31110.1

    42212.04159457880.9

    510102.23606797750.2

    10.3

    10.63014581270.6

    1.41421356240.2

    12.72792206140.7

    11.3137084990.8

    Sheet2

    -0.70383239591.0557679707

    0.52682580980.3923011625

    0.1506871627-0.629887616

    0.3523369730.0764144746

    0.89743795680.1781975337

    0.00344365891.0867507516

    -0.6312628478-0.5724934656

    0.2384124314-0.5634444577

    0.8005400635-0.2486600287

    -0.46596275330.4277186253

    -0.21194897290.5943651535

    -0.51691178830.929757173

    0.6225735240.3030059853

    0.8036097460.253517701

    1.02676750640.5808707369

    -0.25808145610.2702218077

    0.67854188880.5043086096

    0.4687398180.2700476168

    0.13224146680.8821441992

    1.1103362324-0.1869459751

    -0.9205506237-0.0592178541

    0.0688630294-0.4988946773

    0.613265630.5039467708

    0.159155324-0.0345564789

    -0.1390510737-0.7178660029

    0.7982507809-0.6447369746

    -0.0298952327-0.650099294

    -0.19305923970.1079624256

    0.4536185971-0.806939811

    -0.02605791420.061506275

    Sheet2

    0.0867776412

    0.5693165604

    -0.0499121699

    0.0588071358

    0.1267425698

    0.6624017301

    0.3946393342

    0.0089313559

    0.2076436561

    -0.3710935994

    -0.9867220017

    0.1945870664

    0.9357918204

    -0.279090028

    -0.286325505

    -0.4675655384

    0.5654707065

    -0.2113086077

    0.4466090503

    -0.5255387516

    0.19027618

    0.4045493148

    0.2513534681

    0.4375226748

    -0.6446929855

    -0.3246038918

    0.7229029066

    -0.0418989898

    0.4289417369

    Sheet3

  • Polarized ordination (based on indicator species)

    Chart2

    0.8446293253

    -0.7058465673

    0.7103778663

    -0.1512558757

    0.5238539659

    -0.8666588698

    0.2088102471

    0.1906048229

    -0.5912676726

    0.5780729181

    0.2773897066

    -0.1768197143

    -0.1307615046

    -1.1174361321

    -0.1150041484

    0.4858700445

    0.3038034096

    0.7209983088

    -0.0547577212

    -0.026567813

    0.5160180243

    0.34133475

    0.1405797294

    0.7708854357

    -0.6853940481

    -0.4422902338

    0.250330837

    0.3118461145

    0.8221098683

    Sheet1

    plant12345plant12345

    1*1*

    21.41*21.4142135624*

    31.002.24*312.2360679775*

    41.001.001.41*4111.4142135624*

    512.0410.6312.7311.31*512.041594578810.630145812712.727922061411.313708499*

    plant1234

    1

    20.1

    30.20.2

    40.10.30.2

    50.90.60.70.8

    1121.41421356240.1

    22310.2

    31110.1

    42212.04159457880.9

    510102.23606797750.2

    10.3

    10.63014581270.6

    1.41421356240.2

    12.72792206140.7

    11.3137084990.8

    Sheet2

    -1.05989159060.8446293253

    1.2199802275-0.7058465673

    1.06228175420.7103778663

    -1.9996207825-0.1512558757

    -1.83047783720.5238539659

    1.5345402398-0.8666588698

    -1.34474050880.2088102471

    1.7021321480.1906048229

    0.9688020291-0.5912676726

    -0.80843292280.5780729181

    -1.49590884440.2773897066

    1.0400263205-0.1768197143

    0.8246666018-0.1307615046

    1.3989978249-1.1174361321

    -0.8562715362-0.1150041484

    1.33803836570.4858700445

    1.17794757920.3038034096

    1.44002656980.7209983088

    1.5570785536-0.0547577212

    0.8872359987-0.026567813

    -0.84725395410.5160180243

    -1.00541666930.34133475

    -1.62428432420.1405797294

    1.05777632890.7708854357

    -1.2544518082-0.6853940481

    1.3989991567-0.4422902338

    1.3129841090.250330837

    0.88232915130.3118461145

    1.7631552990.8221098683

    -0.25989159060.8446293253

    0.4199802275-0.7058465673

    0.26228175420.7103778663

    -1.1996207825-0.1512558757

    -1.03047783720.5238539659

    0.7345402398-0.8666588698

    -0.54474050880.2088102471

    0.9021321480.1906048229

    0.1688020291-0.5912676726

    -0.00843292280.5780729181

    -0.69590884440.2773897066

    0.2400263205-0.1768197143

    0.0246666018-0.1307615046

    0.5989978249-1.1174361321

    -0.0562715362-0.1150041484

    0.53803836570.4858700445

    0.37794757920.3038034096

    0.64002656980.7209983088

    0.7570785536-0.0547577212

    0.0872359987-0.026567813

    -0.04725395410.5160180243

    -0.20541666930.34133475

    -0.82428432420.1405797294

    0.25777632890.7708854357

    -0.4544518082-0.6853940481

    0.5989991567-0.4422902338

    0.5129841090.250330837

    0.08232915130.3118461145

    0.9631552990.8221098683

    -0.53500700920.617673286

    Sheet2

    Sheet3

  • 01 is more similar to 1 than 00The order of groups reflects possible gradient in the table

  • SSSSSSSSSSSSSS

    aaaaaaaaaaaaaa

    mmmmmmmmmmmmmm

    pppppppppppppp

    00000000000000

    00000000000000

    00000000011111

    21345678901234

    4 Sali SIle ---------5---- 0000

    29 Anth Alpi -----2---3---- 0000

    30 Hype Macu ------2--3---- 0000

    31 Rubu Idae ------2--3---- 0000

    28 Aden Alli -----2---2---- 0001

    1 Pice Abie 6666665---5--- 001000

    7 Oxal Acet 55344-4--3---- 001001

    9 Sold Hung 43444--------- 001001

    18 Luzu Pilo 2-2----------- 001001

    20 Luzu Sylv --3243-------- 001001

    12 Gent Ascl 23333333------ 001010

    32 Geun Mont ------2--3-33- 1101

    5 Juni Comm -----------24- 111

    34 Puls Alba -----------32- 111

    38 Oreo Dist ----------5564 111

    39 Fest Supi ----------3444 111

    40 Camp Alpi ----------34-4 111

    41 Junc Trif ----------4453 111

    42 Luzu Alpi ----------33-- 111

    43 Hier Alpi ----------233- 111

    44 Care Semp -----------545 111

    45 Tris Fusc -----------33- 111

    46 Pote Aure ------------32 111

    47 Sale Herb -------------5 111

    48 Prim Mini -------------4 111

    00000000001111

    0000000111

    0000011

  • *