pattern recognition syllabus 2013

- 1 -

Topics in Pattern Recognition2nd Semester 2013

1. Lecturer

Soo-Hyung Kim, Prof.Dept. ECE, Chonnam National [email protected]://pr.jnu.ac.kr/shkim(office) 062-530-3430(mobile) 010-2687-3430

2. Textbooks(1) S. Theodoridis, A. Pikrakis, K. Koutroumbas, D. Cavouras, Introduction

to Pattern Recognition - A MATLAB Approach, Academic Press, 2010. Korean Version: , MATLAB , , 2013.(2) S. Theodoridis and K. Koutroumbas, Pattern Recognition, 4th ed.,

Academic Press, 2009.

3. Course Schedule- Introduction to Pattern Recognition- Introduction to MATLAB- Chapter 1 Classifiers Based on Bayes Decision Theory- Chapter 2 Classifiers Based on Cost Function Optimization- Chapter 3 Feature Generation and Dimensionality Reduction- Chapter 4 Feature Selection- Chapter 5 Template Matching- Chapter 6 Hidden Markov Models- Chapter 7 Clustering

Chapter #-pages #-sections #-examples #-exercises1 28 10 13 132 50 8 12 53 28 6 8 34 30 8 13 75 10 4 4 16 12 3 5 27 50 7 12 13

total 208 46 67 44

- 2 -

4. Grading Criteria- Presentation Quality for the assigned topic- Home Works: Exercises from every chapters- Final Exam at the end of the semester- Term Project (optional): only for the students who do not have any presentation in the class- Class Attendance

5. e-Class (via CNU portal)- All the class materials can be downloaded from the e-Class- Homeworks should be uploaded upon the e-Class- Any kind of announcement will be posted on the e-Class

6. Topic Assignment

Chapter Contents Speaker No.

- Introduction to PR S.H. Kim, Prof. 1Introduction to MATLAB S.H. Kim, Prof. 2

11.1 ~ 1.5 (parametric methods) 31.6 ~ 1.10 (nonparametric methods) 4Exercises (Homework) 5

2

2.1 ~ 2.3 (least error methods) 62.4 ~ 2.5 (support vector machine) 72.7 ~ 2.8 (Adaboost & MLP) 8Exercises (Homework) 9

33.1 ~ 3.3 (PCA & SVD) 103.4 ~ 3.5 (Fishers LDA & kernel PCA) 11Exercises (Homework) 12

44.1 ~ 4.6 (data normalization) 134.7 ~ 4.8 (feature selection) 14Exercises (Homework) 15

5 5.1 ~ 5.3 (matching sequences) & Exercises 16

6 6.1 ~ 6.3 (HMM) 17Exercises (Homework) 18

7

7.1 ~ 7.4 (sequential clustering) 197.5 (cost optimization clustering) 207.7 (hierarchical clustering) 21Exercises (Homework) 22

- 3 -

CHAPTER 1 Classifiers Based on Bayes Decision Theory 1

1.1 Introduction 1 1.2 Bayes Decision Theory 1 1.3 Gaussian Probability Density Function 2

Example 1.3.1: Compute the value of a Gaussian PDFExample 1.3.2: Given two PDFs for w1 & w2, classify a pattern x (Ex 1.3.1) Repeat Example 1.3.2 with different prior probabilitiesExample 1.3.3: Generate N=500 samples according to given parameters

1.4 Minimum Distance Classifiers 6 1.4.1 Euclidean Distance Classifier 1.4.2 Mahalanobis Distance Classifier

Example 1.4.1: Given two PDFs for w1 & w2 in 3D, classify a pattern x by a Euclidean distance classifier and Mahalanobis distance classifier

1.4.3 ML Parameter Estimation of Gaussian PDFsExample 1.4.2: Generate N=50 samples according to given m & S, and compute ML estimate of m & S using the given samples(Ex 1.4.1) Repeat Example 1.4.2 with N=500, and N=5000Example 1.4.3: Generate 1000 train- & test-samples for 3 classes, compute ML estimates m & S for each class, classify a pattern x(Ex 1.4.2) Repeat Example 1.4.3 using different covariance S(Ex 1.4.3) Repeat Example 1.4.3 with different prior probabilities(Ex 1.4.4) Repeat Example 1.4.3 with different covariance, S1, S2, and S3

1.5 Mixture Models 11Example 1.5.1: Generate and plot 500 points from a Gaussian mixture, with different mixing probabilities

1.6 The Expectation-Maximization Algorithm 13Example 1.6.1: Generate 500 points from a Gaussian mixture, and estimate the parameters using EM algorithm with different initializationsExample 1.6.2: Given two-class samples, 500 for each, estimate GMs for w1 and w2 respectively, and test the classification accuracy for a Bayesian classifier using another 1,000 samples(Ex 1.6.1) Repeat Example 1.6.2 with different initial parameters

1.7 Parzen Windows 19Example 1.7.1: Generate N=1,000 points from a Gaussian mixture, and estimate the PDF using a Parzen window method with h=0.1(Ex 1.7.1) Repeat Example 1.7.1 with different N & h(Ex 1.7.2) Repeat Example 1.7.1 with a 2-D PDF(Ex 1.7.3) Repeat Example 1.4.3 with Parzen window estimator

1.8 k-NN Density Estimation 21Example 1.8.1: Repeat Example 1.7.1 with k-NN estimator with k=21

- 4 -

(Ex 1.8.1) Repeat Example 1.8.1 with different k and N(Ex 1.8.2) Repeat Example 1.4.3 with Pk-NN density estimator

1.9 Naive Bayesian Classifier 22Example 1.9.1: Generate 50 5-D data from 2 classes, and estimate the two PDFs for naive Bayesian classifier, and then test with 10,000 samples(Ex 1.9.1) Classify the data in Example 1.9.1 with the original PDF, and repeat Example 1.9.1 with 1,000 training data

1.10 Nearest Neighbor Rule 25Example 1.10.1: Generate 1,000 data from two equiprobable classes, and classifiy another 5,000 samples with k-NN classifier with k=3 and adopting squared Euclidean distance(Ex 1.10.1) Repeat Example 1.10.1 for k=1, 7, 15.

CHAPTER 2 Classifiers Based on Cost Function Optimization 29

2.1 Introduction 292.2 Perceptron Algorithm 30

Example 2.2.1: Generate 4 different data sets containing 1 & +1 classes; apply the perceptron algorithm to get the separating line

2.2.1 Online Form of the Perceptron AlgorithmExample 2.2.2: Repeat Example 2.2.1 with the online version algorithm

2.3 Sum of Squared Errors (SSE) Classifier 35Example 2.3.1: Generate train and test data, each of which has 200 samples from the two Gaussians having the same co-variance; apply SSE method to get the separating line; repeat with 100,000 samples; compare with the optimal Bayesian classifierExample 2.3.2: skip

2.3.1 Multi-class LS ClassifierExample 2.3.3: Generate train data with 1,000 samples and test data with 10,000 samples from the three Gaussians having the same co-variance; apply SSE to get the three discriminant functions; prove that the value of these functions correspond to posterior probabilities; and compare with the optimal Bayesian classifier(Ex 2.3.1) Repeat Example 2.3.3 in more non-separable situation

2.4 SVM: The Linear Case 43Example 2.4.1: Given 400 points of two classes in a 2D space, apply SVM with 6 different values of C, and compute the accuracy, count the support vectors, computer the margin, and plot the separating line(Ex 2.4.1) Repeat Example 2.4.1 with a different data distribution

2.4.1 Multi-class Generalization

- 5 -

Example 2.4.2: Given 3-classes samples as in Example 2.3.3, get the three SVM classifiers

2.5 SVM: Nonlinear Case 50Example 2.5.1: Generate a set of N=150 training samples in 2D region [-5, 5][-5, 5], belonging to either one of +1 and 1 classes nonlinearly separable; apply linear SVM with C=2, tol=0.001; apply a nonlinear SVM using RBF kernel; apply a nonlinear SVM using polynomial kernelExample 2.5.2: Generate a set of 270 training samples in a 33 grid, or 9 cells, where the +1 and 1 classes are alternating; apply linear SVM with C=200, tol=0.001; apply a nonlinear SVM using RBF kernel; apply a nonlinear SVM using polynomial kernel

2.6 Kernel Perceptron Algorithm 58 skip 2.7 AdaBoost Algoithm 63

Example 2.7.1: Apply Adaboost algorithm to make a classifier between two classes in 2D, where each class is a Gaussian mixture - use 100 samples where 50 are for class +1 and the other 50 are for class 1; observe the error rates according to the number of base classifiers(Ex 2.7.1) Repeat Example 2.7.1 where the two classes are described by normal distribution with means (1, 1) and (s, s), s=2, 3, 4, 6

2.8 Multi-Layer Perceptron 66Example 2.8.1: Generate two-class samples in 2D, where class +1 is from a mixture of 3 Gaussians and class 1 is from another mixture of 4 Gaussians; train a 2-layer feedforward NN with 2 nodes in the hidden layer and another 2-layer feedforward NN with 4 hidden nodes by using a standard BP with lr=0.01; repeat the standard BP with lr=0.0001; train the same NNs with adaptive BP;(Ex 2.8.1) Repeat Example 2.8.1 where the two classes are more spread-out with the larger covariance valuesExample 2.8.2: Generate two-class samples in 2D, where class +1 is from a mixture of 4 Gaussians and class 1 is from another mixture of 5 Gaussians; train a 2-layer feedforward NN with 3 nodes in the hidden layer, a 2-layer feedforward NN with 4 hidden nodes, and a 2-layer feedforward NN with 10 hidden nodes by using a standard BP with lr=0.01; train the same NNs with adaptive BP;(Ex 2.8.2) Generate two-class samples in 2D, where class +1 is from a mixture of 8 Gaussians and class 1 is from another mixture of 8 Gaussians; train a 2-layer FNN with 7, 8, 10, 14, 16, 20, 32, 40 hidden nodes with a adaptive BP algorithm; repeat with more spreadout samples

- 6 -

CHAPTER 3 Data Transformation: Feature Generation and Dimensionality Reduction

3.1 Introduction 793.2 Principal Component Analysis(PCA) 79

Example 3.2.1: Generate a set of 500 samples from a Gaussian distribution, and perform PCA and get two eigenvalue/eigenvector pairs; repeat the same procedure with a different distributionExample 3.2.2: Generate two data sets X1 and X2; perform PCA on X1 and project the data on the first principal component; repeat with X2(Ex 3.2.1) Generate two data sets X1 and X2, having different mean points in 3D; perform PCA on X1 and project the data on the first two principal components; repeat the same procedure with X2

3.3 Singular Value Decomposition Method 84(Ex 3.3.1) Perform SVD on the data X1 and X2 in Ex 3.2.1 and compareExample 3.3.1: Generate a set of 100 samples from a Gaussian distribution in 2000-dimensional space; apply PCA and SVD and compare the results

3.4 Fishers Linear Discriminant Analysis 87Example 3.4.1: Apply Fishers LDA on the data X2 in Example 3.2.2Example 3.4.2: Generate 900 3D samples of two classes the first 100 samples from a zero-mean Gaussian distribution, and the rest from 8 groups of Gaussian distributions with different means; plot the data; perform Fishers LDA and project the data; repeat the same for a 3-class problem where the last group of 100 samples is labeled class 3

3.5 Kernel PCA 92Example 3.5.1: Data set in original space; data set in transformed space; perform PCA in the transformed space; mapping back to the original spaceExample 3.5.2:Example 3.5.3:(Ex 3.5.1)

3.6 Eigenmap 101 skip

CHAPTER 4 Feature Selection 107

4.1 Introduction 1074.2 Outlier Removal 107

Example 4.2.1: Generate N=100 points from a 1-D Gaussian distribution, and then add 6 outliers; use a library function for outlier detection

4.3 Data Normalization 108

- 7 -

Example 4.3.1: Normalize two different data by the use of three normalization methods

4.4 Hypothesis Testing: t-Test 111Example 4.4.1: Given two-class data from equiprobable Gaussian distributions with m1=8.75 and m2=9 (variance=4); Test whether the two mean values differ significantly with significance level of 0.05 and 0.001(Ex 4.4.1) Repeat the t-Test with different variance of 1 and 16

4.5 Receiver Operating Characteristic (ROC) Curve 113Example 4.5.1: Consider two Gaussians m1=2, s1=1, and m2=0 s2=0; plot the respective histograms; get AUC using ROC function(Ex 4.5.1) Repeat Example 4.5.1 with { m1=m2=0 } and { m1=5, m2=0 }

4.6 Fishers Discriminant Ratio 114Example 4.6.1: Given 200 samples of two Gaussians in 5D, compute the FDR values for the five featuresExample 4.6.2: Compute FDR values for the discriminatory power of the four features from a ultrasonic images

4.7 Class Separability Measures 117 4.7.1 Divergence

Example 4.7.1: Generate 100 normally distributed samples for two different classes and compute divergence between the two classes(Ex 4.7.1) Repeat Example 4.7.1 with three different situations

4.7.2 Bhattacharyya Distance and Chernoff BoundExample 4.7.2: Compute Bhattacharyya distance and Chernoff bound for the same data in Example 4.7.1(Ex 4.7.2) Compute Bhattacharyya distance and Chernoff bound for the same data in Ex 4.7.1

4.7.3 Scatter MatricesExample 4.7.3: For the data in Example 4.6.2, select three features out of the four based on the J3 measure(Ex 4.7.3) Compare the 3-feature obtained from Example 4.7.3 with all other 3-feature combinations

4.8 Feature Subset Selection 122 4.8.1 Scalar Feature Selection

Example 4.8.1: Normalize the features of the data in Example 4.6.2, and use FDR to rank the four features; use scalar feature selection technique to rank the features

4.8.2 Feature Vector SelectionExample 4.8.2: Among the four features from a mammography data, apply the exhaustive search method to sect the best combination of three features according to the divergency, Bhattacharyya distance, and the J3 measure

- 8 -

(Ex 4.8.2) Compute the J3 measure for mean, skewness, kurtosis in Example 4.8.2 and compare the resultsExample 4.8.3: Suboptimal searching(Ex 4.8.3) Repeat Example 4.8.3 with different number of selected featuresExample 4.8.4: Designing a Classification System via data collection, feature generation, feature selection, classifier design, and evaluation

CHAPTER 5 Template Matching 137

5.1 Introduction 1375.2 Edit(Levenstein) Distance 137

Example 5.2.1: Compute the edit distance between book and bokks; Repeat for template and replatte(Ex 5.2.1) Find the most likely word with igposre among impose, ignore, and restore in terms of edit distance

5.3 Matching Sequences of Real Numbers 139Example 5.3.1: Compute the matching cost between P={-1, -2, 0, 2} and T={-1, -2, -2, 0, 2} by Sakoe-Chiba local constraintsExample 5.3.2: Compute the matching cost between P and T1, P and T2, where P={1, 0, 1} and T1={1, 1, 0, 0, 0, 1, 1, 1} or T2={1, 1, 0, 0, 1} using the standard Itakura constraintsExample 5.3.3: Compute the matching cost between P={-8, -4, 0, 4, 0, -4} and T={0, -8, -4, 0, 4, 0, -4, 0, 0} by Sakoe-Chiba local constraints; repeat with endpoint constraints

5.4 Dynamic Time Warping in Speech Recognition 143 skip

CHAPTER 6 Hidden Markov Models 147

6.1 Introduction 1476.2 Modeling 1476.3 Recognition and Training 148

Example 6.3.1: Given a sequence of Head-Tail observations, O=HHHTHHHHTHHHHTTHH, and two M1, M2 with different transition matrices, compute the recognition probability, P(O|M1) and P(O|M2)Example 6.3.2: For the setting of Example 6.3.1, comoute the Viterbi score and respective best-state sequence for M1 and M2Example 6.3.3: Train the HMM with a set of 70 training samples by using Baum-Welch algorithm; use two different initializations(Ex 6.3.1) Repeat Example 6.3.3 with a different initialization

- 9 -

Example 6.3.4: Repeat Example 6.3.3 by using Viterbi trainingExample 6.3.5: Compute the Viterbi score for a set of 30 test samples(Ex 6.3.2) Compare the two results in Example 6.3.5

CHAPTER 7 Clustering 159

7.1 Introduction 159 7.2 Basic Concepts and Definitions 159

Example 7.2.1: Two clusterings for 7 samples in 2D space7.3 Clustering Algorithms 1607.4 Sequential Algorithms 161 7.4.1 BSAS Algorithm 7.4.2 Clustering Refinement

Example 7.4.1: Apply the BSAS algorithm on 15 samples with the variations in presentation order, , and q valuesExample 7.4.2: Generate 400 samples from 4 different Gaussians, and then apply BSAS algorithm by estimating the number of compact clusters(Ex 7.4.1) Repeat step 1 of Example 7.4.2 with a set of 300 samples from zero-mean and identity covariance matrix

7.5 Cost Function Optimization Clustering Algorithms 168 7.5.1 Hard Clustering Algorithms

Example 7.5.1: Generate 400 samples of 4 groups in 2D; apply k-means algorithm for m=4; repeat for m=3; repeat m=5; repeat for m=4 with a specific initializations; (Ex 7.5.1) Apply k-means algorithm for m=2,3 on the data in Ex 7.4.1Example 7.5.2: Generate 500 samples, where the first 400 are as in Example 7.5.1, and the other 100 from a uniform distribution in [-1, 12][-2, 12]; apply k-means for m=4 and compare the resultsExample 7.5.3: Apply k-means algorithm (m=2) for a set of 515 samples, where the first 500 stem from a zero-mean normal distribution, and the other 15 stem from a normal distribution centered at [5, 5]Example 7.5.4: Apply k-means algorithm for m=4 on a data set with 4 groups as in Figure 7.5Example 7.5.5: Run the k-means algorithm for each value in a range, and find the significant knee on the various data in Example 7.5.1, Example 7.5.2, Example 7.4.2, and Ex 7.5.1Example 7.5.6: Generate a set of 216 samples, where the first 100 stem from a zero-mean Gaussian, the next 100 stem from a Gaussian centered at [12, 13], and other two groups with 8 samples around [0, -40] and [-30, -30], respectively; apply k-means and PAM algorithm for m=2

- 10 -

(Ex 7.5.2) Repeat Example 7.5.1 for the PAM algorithm(Ex 7.5.3) Repeat Example 7.5.5 for the PAM algorithm(Ex 7.5.4) Repeat Example 7.5.1 using the GMDAS algorithm(Ex 7.5.5) Repeat Example 7.5.5 using the GMDAS algorithm

7.5.2 Nonhard Clustering Algorithms(Ex 7.5.6) Repeat Example 7.5.1 using FCM with q=2(Ex 7.5.7) Repeat Example 7.5.3 using FCM with q=2(Ex 7.5.8) Repeat Example 7.5.5 using FCM with q=2(Ex 7.5.9) Repeat Example 7.5.1 using FCM with q=2, 10, 25, and compareExample 7.5.7: Apply FCM algorithm on the data in Example 7.5.6, and compare the result from k-means and PAM(Ex 7.5.10) Apply PCM algorithm on the data in Example 7.5.1 for m=4, 6, 3. Use q=2, n=3; repeat with different initialization(Ex 7.5.11) Apply PCM on the data in Example 7.5.3 for m=2, q=2

7.6 Miscellaneous Clustering Algorithms 189 skip7.7 Hierarchical Clustering Algorithms 198 7.7.1 Generalized Agglomerative Scheme 7.7.2 Specific Agglomerative Clustering Algorithms

Example 7.7.1: Apply the single-link algorithm & complete link algorithm on a set of six points

7.7.3 Choosing the Best ClusteringExample 7.7.2: Generate a set of 40 samples of 4 groups in 2D; apply single link and complete link algorithm; determine the best clusterings(Ex 7.7.1) Repeat Example 7.7.2 on the data with 4 clusters of 30, 20, 10, 51 points respectively

pattern recognition syllabus 2013

Documents