pattern recognition : clustering and classification richard brereton [email protected]

40
PATTERN RECOGNITION : CLUSTERING AND CLASSIFICATION Richard Brereton [email protected]

Upload: quentin-boyd

Post on 14-Dec-2015

224 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: PATTERN RECOGNITION : CLUSTERING AND CLASSIFICATION Richard Brereton r.g.brereton@bris.ac.uk

PATTERN RECOGNITION : CLUSTERING AND CLASSIFICATION

Richard Brereton

[email protected]

Page 2: PATTERN RECOGNITION : CLUSTERING AND CLASSIFICATION Richard Brereton r.g.brereton@bris.ac.uk

CLUSTER ANALYSIS - UNSUPERVISED PATTERN RECOGNITION

 

•Grouping of objects according to similarity.

•No predefined classes

Page 3: PATTERN RECOGNITION : CLUSTERING AND CLASSIFICATION Richard Brereton r.g.brereton@bris.ac.uk

TAXONOMY

Page 4: PATTERN RECOGNITION : CLUSTERING AND CLASSIFICATION Richard Brereton r.g.brereton@bris.ac.uk

CHEMICAL TAXONOMY

Grouping organisms according to similarity from chemical fingerprints

•DNA base pairs, proteins

•NMR and pyrolysis of extracts

•NIR spectra

Page 5: PATTERN RECOGNITION : CLUSTERING AND CLASSIFICATION Richard Brereton r.g.brereton@bris.ac.uk

SIMILAR PRINCIPLES IN ALL TYPES OF CHEMISTRY

• Chemical archaeology

• Environmental samples

• Food

Page 6: PATTERN RECOGNITION : CLUSTERING AND CLASSIFICATION Richard Brereton r.g.brereton@bris.ac.uk

STEPS IN CLUSTER ANALYSIS

Similarity measures. 

Calculate similarity between objects.

Example

Page 7: PATTERN RECOGNITION : CLUSTERING AND CLASSIFICATION Richard Brereton r.g.brereton@bris.ac.uk

Correlation coefficient : higher, more similar

Euclidean distance : smaller, more similar

Euclidean distance

Page 8: PATTERN RECOGNITION : CLUSTERING AND CLASSIFICATION Richard Brereton r.g.brereton@bris.ac.uk

Manhattan distance

Manhattan distance : smaller, more similar

Page 9: PATTERN RECOGNITION : CLUSTERING AND CLASSIFICATION Richard Brereton r.g.brereton@bris.ac.uk

Use correlations for illustration.  Group samples.

 1. Find most similar, highest correlation.

Objects 2 and 5. 2. Combine them.

3. Work out new correlation of the new object2&5 with the other objects (1,3,4,6).

Page 10: PATTERN RECOGNITION : CLUSTERING AND CLASSIFICATION Richard Brereton r.g.brereton@bris.ac.uk

Linkage methods – determination of new similarity measures of groups.

Several methods.

• Nearest neighbour uses the highest correlation

• Furthest neighbour uses the lowest correlation

• Average linkage uses an average.

Illustrate with nearest neighbour.

Page 11: PATTERN RECOGNITION : CLUSTERING AND CLASSIFICATION Richard Brereton r.g.brereton@bris.ac.uk
Page 12: PATTERN RECOGNITION : CLUSTERING AND CLASSIFICATION Richard Brereton r.g.brereton@bris.ac.uk

Dendrograms

Page 13: PATTERN RECOGNITION : CLUSTERING AND CLASSIFICATION Richard Brereton r.g.brereton@bris.ac.uk

CLUSTER ANALYSIS : SUMMARY

• Similarity measures

• Linkage methods

• Dendrogram

Page 14: PATTERN RECOGNITION : CLUSTERING AND CLASSIFICATION Richard Brereton r.g.brereton@bris.ac.uk

CLASSIFICATION

Many methods.

 

CONVENTIONAL

 

LDA (Linear discriminant analysis)

 

Original statistics : projections

Page 15: PATTERN RECOGNITION : CLUSTERING AND CLASSIFICATION Richard Brereton r.g.brereton@bris.ac.uk

Examples

 

Orange juices, can we class into origins and can we detect adulteration from NIR spectra?

 

Class modelling of mussels, can we find which come from polluted site from GC?

 

 

Detailed mathematical model

Page 16: PATTERN RECOGNITION : CLUSTERING AND CLASSIFICATION Richard Brereton r.g.brereton@bris.ac.uk

PRINCIPLES : BIVARIATE EXAMPLE

Class A

Class B

line 1

line 2

Class A Class B

centre centre

Page 17: PATTERN RECOGNITION : CLUSTERING AND CLASSIFICATION Richard Brereton r.g.brereton@bris.ac.uk

Often exact cut-off impossible

Class A Class B

centre centre

Class A

Class B

line 1

line 2

Page 18: PATTERN RECOGNITION : CLUSTERING AND CLASSIFICATION Richard Brereton r.g.brereton@bris.ac.uk

Class distance plots

Centre class A

Centre class B

Class distances

Page 19: PATTERN RECOGNITION : CLUSTERING AND CLASSIFICATION Richard Brereton r.g.brereton@bris.ac.uk

Multivariate data : several measurements per class

Example – Fisher Iris data – four measurements per irisPetal width, petal length, sepal width, sepal length

150 Irises, divided into 50 of each species

I. Setosa

I. Versicolor

I. Verginica

Page 20: PATTERN RECOGNITION : CLUSTERING AND CLASSIFICATION Richard Brereton r.g.brereton@bris.ac.uk

SPECIAL DISTANCES USED.

Linear discriminant function between classes A and B

• The first term is simply the difference between the centres of each class – so a more positive value indicates class A.

• The middle term is the inverse of the “pooled variance covariance matrix.

What does this mean? Sometimes measurements are correlated.Sometimes classes are more dispersed.Puts distances on common scale.

•The final term is the measurement for each object.

Page 21: PATTERN RECOGNITION : CLUSTERING AND CLASSIFICATION Richard Brereton r.g.brereton@bris.ac.uk

Discriminant score against sample number : I Versicolor and I Verginica

-35

-30

-25

-20

-15

-10

-5

0

Page 22: PATTERN RECOGNITION : CLUSTERING AND CLASSIFICATION Richard Brereton r.g.brereton@bris.ac.uk

Can shift the scale so that •positive score probably class A, •negative score probably class B.

Note some ambiguities. WAB.

Discriminant score against sample number - adjust for group means

-20

-15

-10

-5

0

5

10

15

Page 23: PATTERN RECOGNITION : CLUSTERING AND CLASSIFICATION Richard Brereton r.g.brereton@bris.ac.uk

Extending to more than 2 classes

Three classes – 2 out of 3 possible discriminant parameters

If we have 3 classes and choose to use WAB and WAC as the

functions, it is easy to see that

•an object belongs to class A if WAB and WAC are both positive,

•an object belongs to class B if WAB is negative and WAC is

greater than WAB, and

•an object belongs to class C if WAC is negative and WAB is

greater than WAC.

Page 24: PATTERN RECOGNITION : CLUSTERING AND CLASSIFICATION Richard Brereton r.g.brereton@bris.ac.uk

WAB

WAC

Class A

Class B

Class C

Page 25: PATTERN RECOGNITION : CLUSTERING AND CLASSIFICATION Richard Brereton r.g.brereton@bris.ac.uk

Mahalanobis distance

Similar idea to the Euclidean distance, i.e. distance to the centre of a class but use the variance covariance matrix for scaling.

Page 26: PATTERN RECOGNITION : CLUSTERING AND CLASSIFICATION Richard Brereton r.g.brereton@bris.ac.uk

0.0

1.0

2.0

3.0

4.0

5.0

0.0 2.0 4.0 6.0 8.0 10.0

Distance to class A

Du

stan

ce t

o c

lass

B

Page 27: PATTERN RECOGNITION : CLUSTERING AND CLASSIFICATION Richard Brereton r.g.brereton@bris.ac.uk

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0

9.0

10.0

0.0 2.0 4.0 6.0 8.0 10.0

Distance to class A

Dis

tan

ce t

o c

lass

B

Class B

Class AOutlier - maybe another class?

Ambiguous

Page 28: PATTERN RECOGNITION : CLUSTERING AND CLASSIFICATION Richard Brereton r.g.brereton@bris.ac.uk

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

I Versicolor I Verginica

Page 29: PATTERN RECOGNITION : CLUSTERING AND CLASSIFICATION Richard Brereton r.g.brereton@bris.ac.uk

0

2

4

6

8

10

12

14

16

18

0 2 4 6 8 10 12 14 16

I Versicolor I Verginica I Serosa

Page 30: PATTERN RECOGNITION : CLUSTERING AND CLASSIFICATION Richard Brereton r.g.brereton@bris.ac.uk

Many classical statistical methods developed first in biology.

Problem for chemists: Mahalanobis distance depends on measurements being more than variables

Spectroscopy, chromatography : often a huge number of measurements per sample.

Page 31: PATTERN RECOGNITION : CLUSTERING AND CLASSIFICATION Richard Brereton r.g.brereton@bris.ac.uk

Solutions

•Variable selection

•PCA prior to performing classification

Page 32: PATTERN RECOGNITION : CLUSTERING AND CLASSIFICATION Richard Brereton r.g.brereton@bris.ac.uk

Many diagnostics

•Modelling power of variables

•Discriminatory power of variables

•Quality of class model

•Probabilities of class membership

•Ambiguous classification : is analytical data good enough?

 

Page 33: PATTERN RECOGNITION : CLUSTERING AND CLASSIFICATION Richard Brereton r.g.brereton@bris.ac.uk

MANY SOPHISTICATIONS

Large number of methods for classification based on LDA.

•Bayesian methods – based on prior probabilities.

•Methods that try to find optimal groupings before class modelling.

Page 34: PATTERN RECOGNITION : CLUSTERING AND CLASSIFICATION Richard Brereton r.g.brereton@bris.ac.uk

LOTS OF INFORMATION

•Class membership

•Outliers

•Whether another new class

•Is a class well defined or are there subclasses e.g. subspecies or species from different environments

•What measurements are most useful for discrimination. Can we reduce the number of measurements?

•Are there ambiguous samples, and if so do we need more or better measurements?

•Replicates analysis. Is our method sufficiently good for repeatability. Clinical diagnostics.

Page 35: PATTERN RECOGNITION : CLUSTERING AND CLASSIFICATION Richard Brereton r.g.brereton@bris.ac.uk

SIMCA sometimes used in chemometrics as an alternative

 

•Soft

•Independent

•Modelling of

•Class analogy

Page 36: PATTERN RECOGNITION : CLUSTERING AND CLASSIFICATION Richard Brereton r.g.brereton@bris.ac.uk

Use PCA models

*

Page 37: PATTERN RECOGNITION : CLUSTERING AND CLASSIFICATION Richard Brereton r.g.brereton@bris.ac.uk

Use PCA to model each class independently

•Choose optimal number of PCs

•Use distance from PC model as an indicator of class distance

Page 38: PATTERN RECOGNITION : CLUSTERING AND CLASSIFICATION Richard Brereton r.g.brereton@bris.ac.uk

VALIDATION OF A CLASS MODEL

Procedure. •Establish a training set.•Assess model with a test set.•Use model on real data. Information •Graphical - e.g. diagrams•Quantitative - class distances•Quantitative - probability of membership of a given class. 

Page 39: PATTERN RECOGNITION : CLUSTERING AND CLASSIFICATION Richard Brereton r.g.brereton@bris.ac.uk

Training set

Test set

Page 40: PATTERN RECOGNITION : CLUSTERING AND CLASSIFICATION Richard Brereton r.g.brereton@bris.ac.uk

SUMMARY

•Cluster analysis – unsupervised pattern recognition

•Similarity measures

•Linkage

•Dendrograms

•Classification – supervised pattern recognition

•Class models

•Class distances

•Graphical methods