machine vision

Post on 09-Dec-2015

35 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Lecture slide for machine vision

TRANSCRIPT

1

Pattern Recognition

Dr P. N. SuganthanRoom No.: S2-B2a-21

Tel : 67905404E-mail: epnsugan@ntu.edu.sg

2

Outline• Pattern Recognition (weeks 4 - 7)- Model based approach: template matching (covered in week #3)- Statistical approaches

IntroductionSupervised

Minimum distance, minimum risk, knn, Fisher’s discriminantUnsupervised

Clustering algorithmsApplications- Biometrics

3

Recommended Reference Books

1. R.C. Gonzalez & R. E. Woods, “Digital image processing”, Addison-Wesley, 2010.

2. RO Duda, E Hart and DG Stork, “Pattern Classification 2nd ed.” John Wiley & Sons, Inc 2001. [Duda01]

3. Bow ST, Pattern recognition and image preprocessing, M. Dekker, TK7882.P3B784P 2002.

4

Feature Extraction

Representation& description

Knowledgebase

Image processing(e.g. segmentation) Classification,

Recognition &Interpretation

Image acquisition

High-level processingLow-level processing

Intermediate-level processing

Overview of image analysis

5

Design Cycle

[Duda01]

Iterative processNo universal approach

Training – use data to determine optimum Classifier providing optimal decision surface

Evaluation of error for improvement

6

What is a feature ? A feature is a characteristic primitive or attribute of image

patterns.(e.g. pixel value of an image point or a distribution of pixel values within an

image region, shape characteristics of patterns, etc.)

A pattern may be characterized by one or more features.

Examples of features used to characterize regions/objectsFeatures used to represent image regions (objects) are :

area (A): counting the no. of pixels within a region. This gives a measure of the size of the region.

Perimeter (P): counting the no. of pixels on the boundary of an image region.

roundness (R): a measure of circularity (R=1 for a circle):

A lot more on features in lecture set 1.

R PA

2

4

7

Statistical features:

• Statistical measures commonly used to represent features and the variations of feature values.

• Two basic types of statistical measures are:First order statistics - e.g. mean, variance & standard

deviation of feature distribution.(assumed normal distribution).

Second order statistics - features obtained from co-occurrence matrix.

• By analyzing feature values, shapes/patterns can be classified and recognized.

• Statistical features can be any extractable measurements.

8

9

10

11

12

13

14

15

16

17

18

19

2

20

21

Minimum Distance Classifier (MDC)

Example (Bean – red/grn; big/small):

1 : m1= (x1(1), x2

(1))T

2 : m2= (x1(2), x2

(2))T

3 : m3= (x1(3), x2

(3))T

4 : m4= (x1(4), x2

(4))T

||x- m3|| < ||x- m1||||x- m3|| < ||x- m2||||x- m3|| < ||x- m4||

x is closest to m3Therefore, x belongs to 3.

m3

m1

m2

x

x1

x2

Types of decision function:

m4

22

• Each pattern class is represented by a prototype :

where mj is a mean vector for class j,Nj is the no. of pattern vectors from class j.

• Given an unknown pattern vector x, find the closest prototype by finding the minimum Euclidean Distance:

where = Euclidean norm of vector a

x i if

m xx

jjN

j Mj

1 12

, ,...,

MjD jj ,...,2,1 )( mxx

= 1/2 aaa T

D D j M j ii j( ) ( ) , ,..., ;x x 12

m4m3

m1 m2

23

• Recognition is achieved by finding

• Decision function for MDC can be defined as

• Recognition is then achieved by finding

• By expanding Dj(x) (in prev. slide) and removing the common terms, dj(x) is simplified to

;,...,2,1 )}({min)( MjDd jji xx

)()( xx jj Dd

d d j Mij

j( ) max{ ( )} , ,..., ;x x 1 2

d jT

j jT

j( )x x m m m 12

24

How to simplify the decision function of MDC ?

)()()( mxmxx TjD

d j Mj jT

j( ) ( ) ( ) ,...,x x m x m 1 ( )( )x m x mT

jT

j

( )x x m x x m m mTjT T

j jT

j

d jT T

j jT

j( ) ( )x x x x m m m2

m x x mjT T

j

x x

x x

T

jT

j

d

d

is independent of j, i.e. a common term.

When (x)} is sought, is can be dropped.

Therefore, dividing by 2, we have

max{

d jT

j jT

j( )x x m m m12

25

versicolor 1 : m1=(4.3, 1.3)T

setosa 2 : m2=(1.5, 0.3)T

Rewrite in matrix form

d x xT T1 1 1 1 1 2

12

4 3 13 101( ) . . .x x m m m

d x xT T2 2 2 2 1 2

12

15 03 117( ) . . .x x m m m

4 3 13 10115 03 117

1

1

21

2

. . .

. . .

xx

dd

26

If a petal has length 2 cm and width 0.5cm, i.e. x =(2, 0.5)T

Therefore, it is Iris Setosa.

If a petal has length 5 cm and width 1.5cm, i.e. x =(5, 1.5)T

Therefore, it is Iris Versicolor.

4 3 13 10115 03 117

2051

085198

. . .

. . ..

..

4 3 13 10115 03 117

5151

1335678

. . .

. . ..

..

27

Decision boundary

- it is the line that separates two classes.

Property of the decision boundary in MDC

the decision boundary is a perpendicular bisector of the line joining two class prototypes.

At the decision boundary (surface), d1(x) = d2(x)

d12(x) = d1(x) - d2(x) = 2.8x1 + 1.0x2 - 8.9 = 0

for Iris classification,

x if d12(x) > 0x if d12(x) < 0

28

Example:

x =(2, 0.5)T; d12((2, 0.5)T) = -2.8 2 Iris setosax =(5, 1.5)T; d12((5, 1.5)T) = 6.6 1 Iris versicolor

Decision boundary can be used for recognition

Definition of decision function:

Let the decision function be a line or surface in the pattern space separating the 2-classes.

Recognition is achieved by testing the sign of the decision function when presented with a pattern vector.

29

Example:

Classify Iris versicolor and Iris setosa by petal length, x1, and petal width, x2.

30

Use of Probability Theory in Recognition

• Probability theory is widely used in statistical pattern recognition.

• In particular, Bayes probability theory is used in formulating the optimal classifier (recognizer).

• Feature distributions are used to form a priori probability distributions and they are represented as

• The above is also known as the likelihood function of class j object (prior probability).

p j Mj( | ) , ,...., .x for 12

31

Optimal Statistical Classifier - The Bayes Classifier

Recognition problem: given x, find i.

Let the probability of deciding x belonging to i be p(i|x) . (Given x, the probability that it belongs to i)

If x should belong to i but is wrongly classified to j, the conditional average risk or conditional average loss is defined as

where Lkj is the loss incurred due to misclassification.

M

kkkjj pLr

1=)|( =)( xx

rj (x) = conditional average risk

32

• p(i|x) is a posterior probability and is generally unknown.

From Bayes’ rule,

p(x)-1 is independent of j and is a common term,

rj can be reduced to

)()()|()|(

xx

ppxpp kk

k

)()|( )(

1=)(1=

kM

kkkjj ppL

pr x

xx

)()|( =)(1=

kM

kkkjj ppLr xx

p(x|ωk): probability density function of the patterns in each class;p(ωk): probability of occurrence of each class.

33

• p(x | k) and p(k) are both a priori probabilities that can be estimated from training set.

• Recognition is achieved by

x i if for j = 1,2, ... M.

Decision function for Bayes classifier is

x i if for j = 1,2, ... M.

)()|( -=)(1=

kM

kkkjj ppLd xx

}{max jj

i dd

}{min jj

i rr

34

Example illustrating the concept of Bayes classifier

For a 2-class problem, show how a feature x coming from class 2 can be correctly classified using the Bayes classifier. Assume that patterns from both classes are equally likely to be seen. For correct classification, there should be no penalty. For incorrect classification, the penalty is Lij=1, i j.

From above, we know : M = 2, p(1) = p(2), and x 2Also, L11 = L22 = 0, L12=L21=1.

r1 = L11 p(x | 1) p(1) + L21 p(x | 2) p(2) r2 = L12 p(x | 1) p(1) + L22 p(x | 2) p(2)

35

Since p(1) = p(2), and L11 = L22 = 0, L12=L21=1.

r1 = p(x | 2) r2 = p(x | 1)d1 = -p(x | 2) d2 = -p(x | 1)

If x 2 p(x | 2) > p(x | 1) r1 > r2

If x 2 - p(x | 2) < -p(x | 1) d1 < d2

p(x | 2)p(x | 1)

36

Bayes classifier with 0-1 loss function.

• In the penalty approach,

Lii = 0 for correct decision, i.e. i = j.Lij = 1 for incorrect decision, i.e. i j.

• In the reward approach,

ii= 1 for correct decision, i.e. i = j .ij= 0 for incorrect decision. i.e. i j.

Let Lij = 1- ij and ij= 1 when i = j, ij= 0 when i j.

)()|( )1(-=)(1=

kM

kkkjj ppd xx Penalty approach.

From 3 slides back

37

The summation can be separated into 2 parts:

But,

Since -p(x) is constant and ij= 0 except when k = j,

x i ifMjppd jjj ,...,2,1 , )()|(=)( xx

Mjdd jj

i ,...,2,1 ,)}({max x

)()|( +)()|( -=)(1=1=

kM

kkkjk

M

kkj ppppd xxx

)()|( =)(1=

kM

kk ppp xx

)()|( +)(-=)(1=

kM

kkkjj pppd xxx

38

A Bayes Classifier with 0-1 loss

x Evaluationof

p(x|j)j=1,2,…,M

DecisionMaximum

Selector

p(x|1)

p(x|2)

p(x|M)

p(1)

p(2)

p(M)

39

Bayes classifier with 0-1 loss for Gaussian patterns

A pattern class with a Gaussian distribution of its feature values can be represented by

where m = E{x}

2 = E{(x-m)2}

E{.} denotes the expected values.

p x x m( ) exp

12

12

2

Gaussian probability density function

40

A 2-class example of Gaussian patterns

Suppose only one feature is used to classify two patterns. Let x be the feature, the Bayes decision function is

2,1 , )()|(=)( jpxpxd jjj

2,1 , )(21exp

21)(

2

jp

mxxd j

j

j

jj

Therefore,

41

At the decision boundary , d1(x0)=d2(x0)

If p(2) = p(1), p(x0 | 1) = p(x0 | 2).

If p(1) > p(2), x0 will be shifted to the left.

If p(1) < p(2), x0 will be shifted to the right.

42

For n dimensional feature vectors, Gaussian distribution is represented by

)()(

21exp

||)2(1)|( 1

2/12/ jjT

jj

njp mxCmxC

x

where mj is the mean vector of class j,Cj is the covariance matrix of class j,|Cj| is the determinant of Cj,n is the dimension of feature vector.

= Ej { x }= Ej {(x – mj)(x – mj)T}

Approximating the Expected value, E{•}, with its average value:

jj

j N x

xm 1

j

Tjj

T

jj N x

mmxxC 1

43

The Bayes decision function for Gaussian patterns is:

)()()(21exp

||)2(1)( 1

2/12/ jjjT

jj

nj pd

mxCmx

Cx

)(ln=)(' xx jj dd

Remove the constant terms, dj(x) can be redefined as

jjjjT

jj pd CmxCmxx ln21)(ln+)()(

21)( 1'

This is the general form of Bayes classifier for Gaussian pattern classes with 0-1 loss function

)(ln+)()(21ln

21)2(ln

2)( 1'

jjjT

jjj pnd mxCmxCx

44

Cj is a symmetric square matrix of n by n.

The Covariance matrix Cj

}))({( Tjjjj E mxmxC

nn

nn

jj mxmxmx

mx

mxmx

E ...

.

.

.2211

22

11

C

45

E x m x m E x m x m E x m x mE x m x m E x m x m

E x m x m E x m x m

n n

n n n n n n

{( )( )} {( )( )} ... {( )( )}{( )( )} {( )( )} .

. . .

. .

. . .{( )( )} ... ... {( )( )}

1 1 1 1 1 1 2 2 1 1

2 2 1 1 2 2 2 2

1 1

Cj=

j

C j

n

n nn j

c c cc c

c c

11 12 1

21 22

1

....

. . .

. .

. . .... ...

•cii is the variance

•cij is the covariance; i j.

46

• Covariance matrix can be estimated by expanding the original definition and simplifying by using averages to:

Tjj

T

j

Tjj

Tjj

jN

E mmxxmmxxCx

1][

• For features that are statistically independent (uncorrelated), Cj is a diagonal matrix. (cij = 0)

If all covariance matrices are equal and are identity matrices and P(j) = 1/M for all j = 1,2, …., M, then:

d jT

j jT

j( )x x m m m12

Thus, MDC is optimal in the Bayes sense if:1. Pattern classes are Gaussian2. All covariances are equal to identity matrix3. All classes are equally likely to occur

= MDC (Minimum Distance Classifier)

47

Special cases of the covariance matrix (2D examples):

(a) assume m = 0, C = I

(b) assume m = [1,0]T, c22 > c11

(0,0)

x1

x2

(1,0)

x1

x2

22

110

0c

cC

48

(c) assume m = [0,1]T, C

c cc c11 12

21 22

(0,1)

x1

x2

Random Vector Functional Link Network (RVFL)

EE6222

Outline

IntroductionSLFN: Single Layer Feedforward NetworkFrom Error back-propagation (back-prop) to randomizedlearningNotations of RVFLLearning process in RVFLTheoretical part of RVFL

Classification ability of RVFLproperty of RVFL variants

Outline

IntroductionSLFN: Single Layer Feedforward NetworkFrom Error back-propagation (back-prop) to randomizedlearningNotations of RVFLLearning process in RVFLTheoretical part of RVFL

Classification ability of RVFLproperty of RVFL variants

Structure of an SLFN (Single Layer Feedforward Network)

Structure

Details

I Bias :x × β + b = [x , 1]× β. i.e, extending the input features by addingadditional feature which equals to 1.

I For C classes, we can have C 0-1 output neurons. When we present aninput observation belonging to c th class, we expect the c th neuron to givean output near one while all others to give near zero. Circles in thehidden and output layers have non-linear activation functions. Inputs toactivation functions are weighted sums.

Learning with neural network

learning the mapping ti = f (xi ), where i is the index for the data. The pipelinecan be formalized as follows:

I Input: Input consists of a set of N data [xi , yi ], where xi is the featuresand yi is the actual target given with the dataset. We refer to this dataas the training set used to learn network parameters, etc.

I Learning: Use the training set to learn network structure and parameters.We refer to this step as training a classifier.

I Classification: In the end, we evaluate the quality of the classifier byasking it to predict target t for an input data that it has never seenbefore. We will then compare the true target y of the data to the tpredicted by the model. Prediction t is selected by a max operator amongthe C outputs.

Approximation ability of neural network

Theoretical proof about the approximation ability of standard multilayerfeedforward network can be found in1.With the following conditions:

I Arbitrary squashing functions.

I Sufficiently many hidden units are available.

Standard multilayer feedforward network can approximate virtually anyfunction to any desired degree of accuracy.

1Kurt Hornik, Maxwell Stinchcombe, and Halbert White. “Multilayerfeedforward networks are universal approximators”. In: Neural Networks 2.5(1989), pp. 359–366.

Why randomized neural network

Some weaknesses of back-prop

I Error surface can have multiple local minimum.

I Slow convergence due to iteratively adjusting the weights onevery link. These weights are first randomly generated.

I Sensitivity to learning rate setting.

Alternative for back-prop

I In2, the authors show the parameters in hidden layers can berandomly and properly generated without learning.

I In3, the parameters in the enhancement nodes can berandomly generalized with proper range.

2Wouter F Schmidt, Martin Kraaijveld, Robert PW Duin, et al.“Feedforward neural networks with random weights”. In: 1992 11th IAPRInternational Conference on Pattern Recognition. IEEE. 1992, pp. 1–4.

3Yoh-Han Pao, Gwang-Hoon Park, and Dejan J Sobajic. “Learning andgeneralization characteristics of the random vector functional-link net”. In:Neurocomputing 6.2 (1994), pp. 163–180.

Outline

IntroductionSLFN: Single Layer Feedforward NetworkFrom Error back-propagation (back-prop) to randomizedlearningNotations of RVFLLearning process in RVFLTheoretical part of RVFL

Classification ability of RVFLproperty of RVFL variants

Structure of RVFL4

Illustrations

Details

I Parameters β (indicated in blue)in enhancement nodes arerandomly generated in properrange and kept fixed.

I Original features (from the inputlayer) are concatenated withenhanced feature (from hiddenlayer) and referred to as d .

I Learning aims at solvingti = di ∗W , i = 1, 2, ...,N, whereN is the number of trainingsample, W (indicated in red andgray) are the weights in theoutput layer, and ti , di representthe output values and thecombined features, respectively.

4Yoh-Han Pao, Gwang-Hoon Park, and Dejan J Sobajic. “Learning andgeneralization characteristics of the random vector functional-link net”. In:Neurocomputing 6.2 (1994), pp. 163–180.

Outline

IntroductionSLFN: Single Layer Feedforward NetworkFrom Error back-propagation (back-prop) to randomizedlearningNotations of RVFLLearning process in RVFLTheoretical part of RVFL

Classification ability of RVFLproperty of RVFL variants

Details of RVFLNotations

I X = [x1, x2, ...xN ]′: input data (N samples and m features).

I β = [β1, β2, ...βm]′: the weights for the enhancement nodes (m × k, k isthe number of enhancement nodes).

I b = [b1, b2, ...bk ] the bias for the enhancement node.

I H = h(X ∗ β + ones(n, 1) ∗ b): feature matrix after the enhancementnodes and h is the activation function. No activation in output nodes.

Commonly used activation functions in hidden layer

activation function formulation: s is the input, y is the output

sigmoid y = 11+e−s

sine y = sine(s)

hardlim y = (sign(s) + 1)/2)

tribas y = max(1− |s|, 0)

radbas y = exp(−s2)

sign y = sign(s)

relu y = max(0, x)

Details of RVFL

Notations

I H =

h(x1)h(x2)

...h(xN)

=

h1(x1) · · · hk(x1)h1(x2) · · · hk(x2)

.... . .

...h1(xN) · · · hk(xN)

I D = [H,X ] =

d(x1)d(x2)

...d(xN)

h1(x1) · · · hk(x1) x11 · · · x1mh1(x2) · · · hk(x2) x21 · · · x2m

.... . .

......

. . ....

h1(xN) · · · hk(xN) xN1 · · · xNm

I Target y , can be class labels for C class classification problem (1, 2, ...,C).

I For classification, y is usually transformed into 0-1 coding of the class

label. For example, if we have 3 classes y =

123

=

1 0 00 1 00 0 1

, i.e. we

have 3 neurons, 1st neuron goes to 1 for 1st classes while others at zero,and so on.

Outline

IntroductionSLFN: Single Layer Feedforward NetworkFrom Error back-propagation (back-prop) to randomizedlearningNotations of RVFLLearning process in RVFLTheoretical part of RVFL

Classification ability of RVFLproperty of RVFL variants

Learning

Define the problemOnce the random parameters β and b are generated in proper range, thelearning of RVFL aims at solving the following problem:ti = di ∗W , i = 1, 2, ...,N,where N is the number of training sample, W are the weights in the outputlayer, and ti , di represent the ouput and the combined features.

Solutions

I closed form:

I In prime space:W = (λI + DTD)−1DTY ; (1)

I In dual space:W = DT (λI + DDT )−1Y (2)

where λ is the regularization parameter, when λ→ 0, the methodsconverges to the Moore-Penrose pseudoinverse solution.Y = [y1, y2, ...yN ]′;

I Iterative methods: tune the weight based on the gradient of the errorfunction in each step. In assignment, you’ll use above Eqns.

Outline

IntroductionSLFN: Single Layer Feedforward NetworkFrom Error back-propagation (back-prop) to randomizedlearningNotations of RVFLLearning process in RVFLTheoretical part of RVFL

Classification ability of RVFLproperty of RVFL variants

Approximation ability of RVFLTheoretical justification for RVFL is presented in5.

Method

I It formulates a limit-integral representation of the function tobe approximated.

I Evaluates the integral with the Monte-Carlo Method.

Conclusion

I RVFL is a universal approximator for continuous function onbounded finite dimensional sets.

I RVFL is an efficient universal approximator with the rate ofapproximation error converge to zero of order O(C/

√n),

where n is the number of basis functions and C isindependent of n.

5Bons Igelnik and Yoh-Han Pao. “Stochastic choice of basis functions inadaptive function approximation and the functional-link net”. In: IEEETransactions on Neural Networks 6.6 (1995), pp. 1320–1329.

Outline

IntroductionSLFN: Single Layer Feedforward NetworkFrom Error back-propagation (back-prop) to randomizedlearningNotations of RVFLLearning process in RVFLTheoretical part of RVFL

Classification ability of RVFLproperty of RVFL variants

Classification ability of RVFL

EvaluationThe following issues can be investigated by using 121 UCI datasets as done by6

I Effect of direct links from the input layer to the output layer.

I Effect of the bias in the output neuron.

I Effect of scaling the random features before feeding them into theactivation function.

I Performance of 6 commonly used activation functions: sigmoid , radbas,sine, sign, hardlim, tribas .

I Performance of Moore-Penrose pseudoinverse and ridge regression (orregularized least square solutions) for the computation of the outputweights.

6Manuel Fernandez-Delgado et al. “Do we need hundreds of classifiers tosolve real world classification problems?” In: The Journal of Machine LearningResearch 15.1 (2014), pp. 3133–3181.

67

k-Nearest Neighbour• We are concerned with the following problem: we wish to

label some observed pattern x with some class category . • Two possible situations with respect to x and may occur:

– We may have complete statistical knowledge of the distribution of observation x and category . In this case, a standard Bayes analysis yields an optimal decision procedure.

– We may have no knowledge of the distribution of observation x and category aside from that provided by pre-classified samples. In this case, a decision to classify x into category will depend only on a collection of correctly classified samples.

• The nearest neighbor rule is concerned with the latter case. Such problems are classified in the domain of non-parametric statistics. No optimal classification procedure exists with respect to all underlying statistics under such conditions.

68

The NN Rule• Recall that the only knowledge we have is a set of i points

x correctly classified into categories : (xi,i). By intuition, it is reasonable to assume that observations that are close together -- for some appropriate metric -- will have the same classification.

• Thus, when classifying an unknown sample x, it seems appropriate to weight the evidence of the nearby xi's heavily.

• One simple non-parametric decision procedure of this form is the nearest neighbor rule or NN-rule. This rule classifies x in the category of its nearest neighbor. More precisely, we call

xm (x1, x2, x3, …, xn)xm is a nearest neighbor to x if min d(xi, x) = d(xm, x) where i = 1, 2, . . , n .

69

70

71

72

73

Fisher’s Linear Discriminant Analysis

74

5001000

75

76

ii Hx

ii

i xn

m 1

77

78

79

80

81

( +

82

83

84

85

86

87

Also refer to the book by Robert Schalkoff, “Pattern Recognition …”, on page 89- for more details on LDA Q327.S297

88

Supervised Training Vs Unsupervised Training

Supervised training• A supervisor teaches the system how to classify a

known set of patterns with class labels before using it.

• need a priori class label information

Unsupervised training • does not depend on a priori class label information• used when a priori class label information is not

available or proper training sets are not available.

89

Unsupervised training by clusteringTo seek regions such that every pattern falls into one of these regions and no pattern falls in two or more regions, i.e.

* Algorithms classify objects into clusters bynatural association according to some similaritymeasures. e.g. Euclidean distance* Clustering algorithms are based either on heuristics or optimization principles.

;....321 SSSSS K jiSS ji

90

Clustering with a known number of classesRequire knowledge about the number of classes, e.g.

K classes (or clusters).The K-means algorithm is based on the

minimization of the sum of squared distances

, j = 1,2, ... , K

where Sj(k) is the cluster domain for cluster Centre zj at the kth iteration.

)(

2

kSjj

j

Jx

zx

91

K-Means Algorithm

1. Arbitrarily choose K samples as initial cluster centres. i.e. z1(k), z2(k), z3(k), .. , zK(k); iteration,k = 1.

2. At kth iterative step, distribute the pattern samples x among the K clusters according to the following rule:

if

for i = 1, 2, ..., K ; i j ;

)(kS jx )()( kk ij zxzx

92

4. If zj(k+1)= zj(k) for j = 1,2, ..., K, stop.

Otherwise go to step 2.

3. Compute zj(k+1) and Jj for j = 1,2, ..., K.

)(

;1)1(kSj

jj

Nk

xxz

)(

2)1(

kSjj

j

kJx

zx

93

Advantages :

• simple and efficient.

• minimum computations are required

Drawbacks :• K must be decided • Individual Clusters themselves should be tight,

and clusters should be widely separated from one another.

• Results usually depend on the order of presentation of training samples and the initial cluster centers.

94

ExampleApply K-means algorithm to the followingdata for K = 2 :x1 = (0,0), x2 =(0,1), x3 =(1,0), x4=(1,1), x5 =(2,1), x6 =(1,2), x7 =(2,2), x8 =(3,2), x9 =(6,6), x10 =(7,6) x11 =(8,6), x12 =(6,7), x13 =(7,7), x14=(8,7), x15 =(9,7), x16 =(7,8), x17 =(8,8), x18 =(9,8), x19 =(8,9), x20 =(9,9)

95

0123456789

10

0 2 4 6 8 10

x1

x2

Homework – make use of this dataset with class labels to try out Fisher’sDiscriminant method.

96

ExampleIteration 1:

Step 1. Let K = 2, z1(1) = x1 = (0,0)T, z2(1) = x2 = (0,1)T.(These starting points are selected randomly).Step 2.Calculate distances of all patterns from z1 and z2. The patterns are assigned to S1(1) and S2(1) asS1(1) = {x1, x3} ; S2(1) = {x2, x4, x5,..., x20}

97

Step 3.Update the cluster centers:

z1(2) =(0.5 0.0) T ; z2(2) =(5.67 5.33) T

Step 4.

zj(1) and zj(2) are different, go back to step 2

Iteration 2:Step 2.Calculate distances of all patterns from z1 and z2. The patterns are assigned to S1(2) and S2(2) as S1(2) = {x1, x2,..., x8} ; S2(2) = {x9, x10,..., x20}

98

Step 3.Update the cluster centers: z1(3) = (1.25 1.13)T ; z2(3) = (7.67 7.33)T

Step 4.zj(3) and zj(2) are different, go to step 2

Iteration 3:Step 2.Calculate distances of all patterns from z1 and z2.No change in S1 and S2.

99

Step 3.• Update the cluster centers:• z1(3) = (1.25 1.13)T ; z2(3) = (7.67 7.33)T

Step 4.• zj(3) and zj(4) are the same, stop.

• Final Centers :• z1(4) = (1.25 1.13)T ; z2(4) = (7.67 7.33)T

100

Batchelor and Wilkin’s algorithm

This algorithm forms clusters without requiring the specification of the number of clusters, K. But instead, it needs to specify aparameter, namely the proportion p.

The algorithm: 1. Arbitrarily, take the first sample as first

cluster. i.e. z1= x1

2. Choose the farthest pattern sample from z1 to be z2.

101

3. Compute distance from each remaining pattern sample to z1 and z2.

4. Save the minimum distance for each pair of these computations.

5. Select the maximum of these minimumdistances.

6. If the maximum distance is greater than a fraction p of d(z1, z2), assign the correspondingsample to a new cluster z3. Otherwise, stop.

102

7. Compute distance from each remaining pattern sample to each of the 3 clusters and save the minimum distance of every group of 3 distances.

8. Then select the maximum of these minimum distances.

9. If the maximum distance is greater than a fraction p of the typical previous maximum distance, assign the corresponding sample to a new clusterz4. Otherwise, stop.

103

10. Repeat steps 7 to 9 until no new cluster is formed.

11. Assign each sample to its nearest clusters.

*********************

104

Example Apply Batchelor and Wilkin’s algorithmto find the cluster groups for the following samples :

x1 = (0,0), x2 = (3,8), x3 = (2,2), x4 = (1,1),

x5 = (5,3), x6 = (4,8), x7 = (6,3), x8 = (5,4),

x9 = (6,4), x10 = (7,5)

By using the algorithm, the result will be

S1 = {x1, x3, x4}; S2 = {x2, x6}

S3 = {x5, x7, x8, x9, x10};

105

0123456789

0 2 4 6 8

x1

x2

Q : Apply Batchelor and Wilkens algorithm to the above training data. Select the proportion parameter as 0.5. Start with x1 = (0,0).

Q : What effect do you think the proportion parameter has on the final clustering ?

106

The The Advantages of this heuristic algorithm are as follows :

i) It determines K,

ii) it self-organises, and

iii) it is efficient for small-to-medium numbers of classes.

107

The disadvantages are :

i) the proportion p must be given,

ii) the ordering of the samples influences the clustering,

iii) the value of p influences the clustering,

iv) the separation is linear,

v) the cluster centers are not recomputed to betterrepresent the classes,

vi) there is no measure of how good the clustering is,

vii) an outlier can upset the whole process.

108

In practice, it is suggested that the process be run for several values of p and the mean-squared error of all distances of the samples from their center be computed for each kth cluster (k = 1, 2, …, K) and then summed over all classes k to obtain the total mean-squared error. The value of p that provides the lowest total mean-square error is an optimal one. The final clusters should be averaged to obtain a new center and all sample vectors reassigned via minimum distance to the new centers.

top related