ch 1. introductionimlab.postech.ac.kr/dkim/class/csed514_2019s/ch1.pdf · ch 1. introduction...

37
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 790-784, Korea [email protected]

Upload: others

Post on 06-Aug-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Ch 1. Introductionimlab.postech.ac.kr/dkim/class/csed514_2019s/ch1.pdf · Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Department of Computer Science

Ch 1. Introduction

Pattern Recognition and Machine Learning,

C. M. Bishop, 2006.

Department of Computer Science and Engineering

Pohang University of Science and Technology

77 Cheongam-ro, Nam-gu, Pohang 790-784, Korea

[email protected]

Page 2: Ch 1. Introductionimlab.postech.ac.kr/dkim/class/csed514_2019s/ch1.pdf · Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Department of Computer Science

Contents

• 1.1 Example: Polynomial Curve Fitting

• 1.2 Probability Theory

– 1.2.1 Probability densities

– 1.2.2 Expectations and covariance

– 1.2.3 Bayesian probabilities

– 1.2.4 The Gaussian distribution

– 1.2.5 Curve fitting re-visited

– 1.2.6 Bayesian curve fitting

• 1.3 Model Selection

• 1.4 The Curse of Dimensionality

• 1.5 Decision Theory

• 1.6 Information Theory

2

Page 3: Ch 1. Introductionimlab.postech.ac.kr/dkim/class/csed514_2019s/ch1.pdf · Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Department of Computer Science

Pattern Recognition

• Training set,

• Target vector,

• Training (learning) phase

– Determine

• Generalization

– Test set

• Preprocessing

– Feature selection

3

1{ ,..., }Nx x

t

y(x)

Page 4: Ch 1. Introductionimlab.postech.ac.kr/dkim/class/csed514_2019s/ch1.pdf · Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Department of Computer Science

Supervised, Unsupervised and Reinforcement

Learning

• Supervised Learning: with target vector

– Classification

– Regression

• Unsupervised learning: w/o target vector

– Clustering

– Density estimation

– Visualization

• Reinforcement learning: maximize a reward

– Trade-off between exploration & exploitation

4

Page 5: Ch 1. Introductionimlab.postech.ac.kr/dkim/class/csed514_2019s/ch1.pdf · Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Department of Computer Science

1.1 Example: Polynomial Curve Fitting

• N observations

• Minimizing error function

5

sin(2 )x1( ,..., )T

Nx xx 1( ,..., )T

Nt tt

2

0 1 2( , ) ... M

My x w w x w x w x w

0

Mj

j

j

w x

2

1

1( ) { ( , ) }

2

N

n n

n

E y x t

w w

Page 6: Ch 1. Introductionimlab.postech.ac.kr/dkim/class/csed514_2019s/ch1.pdf · Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Department of Computer Science

Model Selection & Over-fitting (1/2)

6

Page 7: Ch 1. Introductionimlab.postech.ac.kr/dkim/class/csed514_2019s/ch1.pdf · Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Department of Computer Science

Model Selection & Over-fitting (2/2)

• RMS(Root-Mean-Square) Error

• Too large

→ Over-fitting

• The more data, the better

generalization

• Over-fitting is a general property

of maximum likelihood

7

2 ( *) /RMSE E N w

*w

Page 8: Ch 1. Introductionimlab.postech.ac.kr/dkim/class/csed514_2019s/ch1.pdf · Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Department of Computer Science

Regularization

8

- Shrinkage

- Ridge regression

- Weight decay

2 2

1

1( ) { ( , ) } || ||

2 2

N

n n

n

E y x t

w w w

Page 9: Ch 1. Introductionimlab.postech.ac.kr/dkim/class/csed514_2019s/ch1.pdf · Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Department of Computer Science

1.2 Probability Theory

• “What is the overall probability that the selection procedure will pick an apple?”

• “Given that we have chosen an orange, what is the probability that the box we chose was the blue one?”

9

Page 10: Ch 1. Introductionimlab.postech.ac.kr/dkim/class/csed514_2019s/ch1.pdf · Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Department of Computer Science

Rules of Probability (1/2)

• Joint probability

• Marginal probability

• Conditional probability

10

( , )ij

i i

np X x Y y

N

1

( ) ( , )L

i i j

j

p X x p X x Y y

5M

3L

( | )ij

i i

i

np Y y X x

c

ic

N

Page 11: Ch 1. Introductionimlab.postech.ac.kr/dkim/class/csed514_2019s/ch1.pdf · Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Department of Computer Science

Rules of Probability (2/2)

• Sum rule

• Production rule

• Bayes’ theorem

11

( ) ( , )Y

p X p X Y

( , ) ( | ) ( )p X Y p Y X p X

( | ) ( )( | )

( )

p X Y p Yp Y X

p X

( | ) ( )

( | ) ( )Y

p X Y p Y

p X Y p YPosterior

Likelihood

Prior

Normalizing

constant

Page 12: Ch 1. Introductionimlab.postech.ac.kr/dkim/class/csed514_2019s/ch1.pdf · Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Department of Computer Science

Probability densities

12

Page 13: Ch 1. Introductionimlab.postech.ac.kr/dkim/class/csed514_2019s/ch1.pdf · Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Department of Computer Science

Expectations and Covariances

• Expectation

• Variance

• Covariance

[ ] ( ) ( )x

E f p x f x

13

2var[ ] [( ( ) [ ( )]) ]f E f x E f x

,cov[ , ] [{ [ ]}{ [ ]}]x yx y E x E x y E y

, [ ] [ ] [ ]x yE xy E x E y

Page 14: Ch 1. Introductionimlab.postech.ac.kr/dkim/class/csed514_2019s/ch1.pdf · Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Department of Computer Science

Bayesian Probabilities -Frequantist vs. Bayesian

• Likelihood:

• Frequantist– w: a fixed parameter determined by 'estimator‘

• Maximum likelihood: Error function =

• Error bars: Obtained by the distribution of possible data sets – Bootstrap

• Bayesian– a single data set

– a probability distribution w: the uncertainty in the parameters

– Prior knowledge• noninformative prior

14

( | ) ( )( | )

( )

p pp

p

w ww

DD

D

( | )p wD

log ( | )p wDD

D

Page 15: Ch 1. Introductionimlab.postech.ac.kr/dkim/class/csed514_2019s/ch1.pdf · Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Department of Computer Science

Bayesian Probabilities

-Expansion of Bayesian Application

• Application of full Bayesian procedure is limited for a long time

– It was originated from 18th century

– It was needed to marginalize over the whole of parameter space for

making predictions or comparing different models.

• Markov chain Monte Carlo sampling method (Chap 11)

– Small-scale problem

• Highly efficient deterministic approximation schemes

– e.g. variational Bayes, expectation propagation (Chap 10)

– Large-scale problem

15

Page 16: Ch 1. Introductionimlab.postech.ac.kr/dkim/class/csed514_2019s/ch1.pdf · Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Department of Computer Science

Gaussian distribution

16

2 2

2 1/ 2 2

1 1( | , ) exp ( )

(2 ) 2x x

N

1

/ 2 1/ 2

1 1 1( | , ) exp ( ) ( )

(2 ) | | 2

T

D

x μ Σ x μ Σ x μ

ΣN

• D-demensional Multivariate Gaussian Distribution

Page 17: Ch 1. Introductionimlab.postech.ac.kr/dkim/class/csed514_2019s/ch1.pdf · Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Department of Computer Science

Gaussian distribution

-Example (1/2)

• Getting unknown parameters

• Data points are i.i.d.

– Maximizing with respect to

• sample mean:

– Maximizing with respect to variance

• sample variance:

17

2 2

1

( | , ) ( | , )N

n

n

p x

x N

2 2 2

21

1ln ( | , ) ( ) ln ln(2 )

2 2 2

N

n

n

N Np x

x

1

1 N

ML n

n

xN

2 2

1

1( )

N

ML n ML

n

xN

Page 18: Ch 1. Introductionimlab.postech.ac.kr/dkim/class/csed514_2019s/ch1.pdf · Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Department of Computer Science

Gaussian distribution

-Example (2/2)

• Bias phenomenon

– Limitation of the maximum likelihood approach

18

[ ]MLE

2 21[ ]ML

NE

N

Page 19: Ch 1. Introductionimlab.postech.ac.kr/dkim/class/csed514_2019s/ch1.pdf · Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Department of Computer Science

Curve Fitting Re-visited (1/2)

• Goal in the curve fitting problem

– Prediction for the target variable t given some new input variable x

19

Determine the unknown w & by maximum likelihood

1( | , , ) ( | ( , ), )p t x t y x w wN

1

1

( | , , ) ( | ( , ), )N

n n

n

p t y x

w wt x N

2

1

ln ( | , , ) { ( , ) } ln ln(2 )2 2 2

N

n

n

N Np y x

w wt x

Page 20: Ch 1. Introductionimlab.postech.ac.kr/dkim/class/csed514_2019s/ch1.pdf · Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Department of Computer Science

Curve Fitting Re-visited (2/2)

– maximizing likelihood

= minimizing the sum-of-squares error function

• Predictive distribution

20

1( | , , ) ( | ( , ), )ML ML ML MLp t x t y x w wN

2

1

1 1{ ( , ) }

N

n ML

nML

y xN

w

MLw

Page 21: Ch 1. Introductionimlab.postech.ac.kr/dkim/class/csed514_2019s/ch1.pdf · Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Department of Computer Science

Maximum Posterior (MAP)

• Add prior probability

– : hyperparameter

– Minimum of

equals (1.4)

21

1( | ) ( | , )p w w 0 IN

( | , ) ( | , , ) ( | )p p p w w wx,t, t x

2 2

1

{ ( , ) } || ||2 2

N

n n

n

y x t

w w

Page 22: Ch 1. Introductionimlab.postech.ac.kr/dkim/class/csed514_2019s/ch1.pdf · Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Department of Computer Science

Bayesian Curve Fitting

• Marginalization

22

Page 23: Ch 1. Introductionimlab.postech.ac.kr/dkim/class/csed514_2019s/ch1.pdf · Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Department of Computer Science

1.3 Model Selection

• Proper model complexity

→ Good generalization & best model

• Measuring the generalization performance

– If data are plentiful, divide into training, validation & test set

– Otherwise, cross-validate

• Leave-one-out technique

• Drawbacks

– Expensive computation

– Using separate data

→ multiple complexity parameters

– New measures of performance

• e.g. Akaike information criterion(AIC), Bayesian information criterion(BIC)

23

MwDp ML )|(ln

Page 24: Ch 1. Introductionimlab.postech.ac.kr/dkim/class/csed514_2019s/ch1.pdf · Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Department of Computer Science

24

1.4 The Curse of Dimensionality

• The High Dimensionality Problem

• Ex. Mixture of Oil, Water, Gas

- 3-Class

(Homogeneous, Annular, Laminar)

- 12 Input Variables

- Scatter Plot of x6, x7

- Predict Point X

- Simple and Naïve Approach

Page 25: Ch 1. Introductionimlab.postech.ac.kr/dkim/class/csed514_2019s/ch1.pdf · Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Department of Computer Science

25

1.4 The Curse of Dimensionality(Cont’d)

• The Shortcomings of Naïve Approach

- The number of cells increase exponentially.

- Needs a large training data set for cells not

to be empty.

Page 26: Ch 1. Introductionimlab.postech.ac.kr/dkim/class/csed514_2019s/ch1.pdf · Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Department of Computer Science

26

1.4 The Curse of Dimensionality(Cont’d)

• Polynomial Curve Fitting Method (3 Order)

- As D increases, it grows proportionally to DM

• The Volume of High Dimensional Sphere

- Concentrated in a thin shell near the surface

Page 27: Ch 1. Introductionimlab.postech.ac.kr/dkim/class/csed514_2019s/ch1.pdf · Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Department of Computer Science

27

1.4 The Curse of Dimensionality(Cont’d)

• Gaussian Distribution

Page 28: Ch 1. Introductionimlab.postech.ac.kr/dkim/class/csed514_2019s/ch1.pdf · Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Department of Computer Science

28

1.5 Decision Theory

• Make Optimal Decisions

- Inference Step & Decision Step

- Select Higher Posterior Probability

• Minimizing the Misclassification Rate

- MAP

→ Minimizing Colored Area

Page 29: Ch 1. Introductionimlab.postech.ac.kr/dkim/class/csed514_2019s/ch1.pdf · Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Department of Computer Science

29

1.5 Decision Theory (Cont’d)

• Minimizing the Expected Loss

- Damage of Missclassification may different from classes

- Introduction of Loss Function(Cost Function)

- MAP

→ Minimizing Expected Loss

• The Reject Option

- Threshold θ

- Reject if θ > Posterior Prob.

Page 30: Ch 1. Introductionimlab.postech.ac.kr/dkim/class/csed514_2019s/ch1.pdf · Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Department of Computer Science

30

1.5 Decision Theory (Cont’d)

• Inference and Decision

- Three Distinct Approach

1. Obtain Posterior Probability & Generative Models

2. Obtain Posterior Probability & Discriminative Models

3. Find Discrimitive Function

Page 31: Ch 1. Introductionimlab.postech.ac.kr/dkim/class/csed514_2019s/ch1.pdf · Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Department of Computer Science

31

• The Reason to Compute the Posterior

1. Minimizing Risk

2. Reject Option

3. Compensating for Class Priors

4. Combining Models

1.5 Decision Theory (Cont’d)

Page 32: Ch 1. Introductionimlab.postech.ac.kr/dkim/class/csed514_2019s/ch1.pdf · Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Department of Computer Science

32

1.5 Decision Theory (Cont’d)

• Loss Function for Regression

- Multiple Target Variable Vector

Page 33: Ch 1. Introductionimlab.postech.ac.kr/dkim/class/csed514_2019s/ch1.pdf · Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Department of Computer Science

33

• Minkowski Loss

1.5 Decision Theory (Cont’d)

Page 34: Ch 1. Introductionimlab.postech.ac.kr/dkim/class/csed514_2019s/ch1.pdf · Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Department of Computer Science

34

1.6 Information Theory

• Entropy

- The noiseless coding theorem states that the entropy is

lower bound on the number of bits needed to transmit the

state of a random variable.

- Higher Entropy, Lager Uncertainty

Page 35: Ch 1. Introductionimlab.postech.ac.kr/dkim/class/csed514_2019s/ch1.pdf · Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Department of Computer Science

35

1.6 Information Theory (Cont’d)

• Maximum Entropy Configuration for Continuous Variable

- Constraints

- Result

- The distribution that maximize the differential entropy is the

Gaussian

• Conditional Entropy : H[x,y] = H[y|x] + H[x]

Page 36: Ch 1. Introductionimlab.postech.ac.kr/dkim/class/csed514_2019s/ch1.pdf · Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Department of Computer Science

36

1.6 Information Theory (Cont’d)

• Relative Entropy [Kullback-Leibler divergence]

• Convexity Function (Jensen’s Inequality)

Page 37: Ch 1. Introductionimlab.postech.ac.kr/dkim/class/csed514_2019s/ch1.pdf · Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Department of Computer Science

37

• Mutual Information

1.6 Information Theory (Cont’d)

- I[x, y] = H[x] – H[x|y] = H[y] – H[y|x]

- If x and y are independent, I[x,y] = 0

- the Reduction in the uncertainty about x by virtue of being told

the value of y