ch 1. introductionimlab.postech.ac.kr/dkim/class/csed514_2019s/ch1.pdf · ch 1. introduction...

Ch 1. Introduction

Pattern Recognition and Machine Learning,

C. M. Bishop, 2006.

Department of Computer Science and Engineering

Pohang University of Science and Technology

77 Cheongam-ro, Nam-gu, Pohang 790-784, Korea

[email protected]

Contents

• 1.1 Example: Polynomial Curve Fitting

• 1.2 Probability Theory

– 1.2.1 Probability densities

– 1.2.2 Expectations and covariance

– 1.2.3 Bayesian probabilities

– 1.2.4 The Gaussian distribution

– 1.2.5 Curve fitting re-visited

– 1.2.6 Bayesian curve fitting

• 1.3 Model Selection

• 1.4 The Curse of Dimensionality

• 1.5 Decision Theory

• 1.6 Information Theory

2

Pattern Recognition

• Training set,

• Target vector,

• Training (learning) phase

– Determine

• Generalization

– Test set

• Preprocessing

– Feature selection

3

1{ ,..., }Nx x

t

y(x)

Supervised, Unsupervised and Reinforcement

Learning

• Supervised Learning: with target vector

– Classification

– Regression

• Unsupervised learning: w/o target vector

– Clustering

– Density estimation

– Visualization

• Reinforcement learning: maximize a reward

– Trade-off between exploration & exploitation

4

1.1 Example: Polynomial Curve Fitting

• N observations

•

• Minimizing error function

5

sin(2 )x1( ,..., )T

Nx xx 1( ,..., )T

Nt tt

2

0 1 2( , ) ... M

My x w w x w x w x w

0

Mj

j

j

w x

2

1

1( ) { ( , ) }

2

N

n n

n

E y x t

w w

Model Selection & Over-fitting (1/2)

6

Model Selection & Over-fitting (2/2)

• RMS(Root-Mean-Square) Error

• Too large

→ Over-fitting

• The more data, the better

generalization

• Over-fitting is a general property

of maximum likelihood

7

2 ( *) /RMSE E N w

*w

Regularization

8

•

- Shrinkage

- Ridge regression

- Weight decay

2 2

1

1( ) { ( , ) } || ||

2 2

N

n n

n

E y x t

w w w

1.2 Probability Theory

• “What is the overall probability that the selection procedure will pick an apple?”

• “Given that we have chosen an orange, what is the probability that the box we chose was the blue one?”

9

Rules of Probability (1/2)

• Joint probability

• Marginal probability

• Conditional probability

10

( , )ij

i i

np X x Y y

N

1

( ) ( , )L

i i j

j

p X x p X x Y y

5M

3L

( | )ij

i i

i

np Y y X x

c

ic

N

Rules of Probability (2/2)

• Sum rule

• Production rule

• Bayes’ theorem

11

( ) ( , )Y

p X p X Y

( , ) ( | ) ( )p X Y p Y X p X

( | ) ( )( | )

( )

p X Y p Yp Y X

p X

( | ) ( )

( | ) ( )Y

p X Y p Y

p X Y p YPosterior

Likelihood

Prior

Normalizing

constant

Probability densities

12

Expectations and Covariances

• Expectation

• Variance

• Covariance

[ ] ( ) ( )x

E f p x f x

13

2var[ ] [( ( ) [ ( )]) ]f E f x E f x

,cov[ , ] [{ [ ]}{ [ ]}]x yx y E x E x y E y

, [ ] [ ] [ ]x yE xy E x E y

Bayesian Probabilities -Frequantist vs. Bayesian

• Likelihood:

• Frequantist– w: a fixed parameter determined by 'estimator‘

• Maximum likelihood: Error function =

• Error bars: Obtained by the distribution of possible data sets – Bootstrap

• Bayesian– a single data set

– a probability distribution w: the uncertainty in the parameters

– Prior knowledge• noninformative prior

14

( | ) ( )( | )

( )

p pp

p

w ww

DD

D

( | )p wD

log ( | )p wDD

D

Bayesian Probabilities

-Expansion of Bayesian Application

• Application of full Bayesian procedure is limited for a long time

– It was originated from 18th century

– It was needed to marginalize over the whole of parameter space for

making predictions or comparing different models.

• Markov chain Monte Carlo sampling method (Chap 11)

– Small-scale problem

• Highly efficient deterministic approximation schemes

– e.g. variational Bayes, expectation propagation (Chap 10)

– Large-scale problem

15

Gaussian distribution

•

16

2 2

2 1/ 2 2

1 1( | , ) exp ( )

(2 ) 2x x

N

1

/ 2 1/ 2

1 1 1( | , ) exp ( ) ( )

(2 ) | | 2

T

D

x μ Σ x μ Σ x μ

ΣN

• D-demensional Multivariate Gaussian Distribution


-Example (1/2)

• Getting unknown parameters

• Data points are i.i.d.

– Maximizing with respect to

• sample mean:

– Maximizing with respect to variance

• sample variance:

17

2 2

1

( | , ) ( | , )N

n

n

p x

x N

2 2 2

21

1ln ( | , ) ( ) ln ln(2 )

2 2 2

N

n

n

N Np x

x

1

1 N

ML n

n

xN

2 2

1

1( )

N

ML n ML

n

xN


-Example (2/2)

• Bias phenomenon

– Limitation of the maximum likelihood approach

18

[ ]MLE

2 21[ ]ML

NE

N

Curve Fitting Re-visited (1/2)

• Goal in the curve fitting problem

– Prediction for the target variable t given some new input variable x

19

Determine the unknown w & by maximum likelihood

1( | , , ) ( | ( , ), )p t x t y x w wN

1

1

( | , , ) ( | ( , ), )N

n n

n

p t y x

w wt x N

2

1

ln ( | , , ) { ( , ) } ln ln(2 )2 2 2

N

n

n

N Np y x

w wt x

Curve Fitting Re-visited (2/2)

•

– maximizing likelihood

= minimizing the sum-of-squares error function

•

• Predictive distribution

20

1( | , , ) ( | ( , ), )ML ML ML MLp t x t y x w wN

2

1

1 1{ ( , ) }

N

n ML

nML

y xN

w

MLw

Maximum Posterior (MAP)

• Add prior probability

– : hyperparameter

– Minimum of

equals (1.4)

–

21

1( | ) ( | , )p w w 0 IN

( | , ) ( | , , ) ( | )p p p w w wx,t, t x

2 2

1

{ ( , ) } || ||2 2

N

n n

n

y x t

w w

Bayesian Curve Fitting

• Marginalization

22

1.3 Model Selection

• Proper model complexity

→ Good generalization & best model

• Measuring the generalization performance

– If data are plentiful, divide into training, validation & test set

– Otherwise, cross-validate

• Leave-one-out technique

• Drawbacks

– Expensive computation

– Using separate data

→ multiple complexity parameters

– New measures of performance

• e.g. Akaike information criterion(AIC), Bayesian information criterion(BIC)

23

MwDp ML )|(ln

24

1.4 The Curse of Dimensionality

• The High Dimensionality Problem

• Ex. Mixture of Oil, Water, Gas

- 3-Class

(Homogeneous, Annular, Laminar)

- 12 Input Variables

- Scatter Plot of x6, x7

- Predict Point X

- Simple and Naïve Approach

25

1.4 The Curse of Dimensionality(Cont’d)

• The Shortcomings of Naïve Approach

- The number of cells increase exponentially.

- Needs a large training data set for cells not

to be empty.

26


• Polynomial Curve Fitting Method (3 Order)

- As D increases, it grows proportionally to DM

• The Volume of High Dimensional Sphere

- Concentrated in a thin shell near the surface

27


• Gaussian Distribution

28

1.5 Decision Theory

• Make Optimal Decisions

- Inference Step & Decision Step

- Select Higher Posterior Probability

• Minimizing the Misclassification Rate

- MAP

→ Minimizing Colored Area

29

1.5 Decision Theory (Cont’d)

• Minimizing the Expected Loss

- Damage of Missclassification may different from classes

- Introduction of Loss Function(Cost Function)

- MAP

→ Minimizing Expected Loss

• The Reject Option

- Threshold θ

- Reject if θ > Posterior Prob.

30


• Inference and Decision

- Three Distinct Approach

1. Obtain Posterior Probability & Generative Models

2. Obtain Posterior Probability & Discriminative Models

3. Find Discrimitive Function

31

• The Reason to Compute the Posterior

1. Minimizing Risk

2. Reject Option

3. Compensating for Class Priors

4. Combining Models


32


• Loss Function for Regression

- Multiple Target Variable Vector

33

• Minkowski Loss


34

1.6 Information Theory

• Entropy

- The noiseless coding theorem states that the entropy is

lower bound on the number of bits needed to transmit the

state of a random variable.

- Higher Entropy, Lager Uncertainty

35

1.6 Information Theory (Cont’d)

• Maximum Entropy Configuration for Continuous Variable

- Constraints

- Result

- The distribution that maximize the differential entropy is the

Gaussian

• Conditional Entropy : H[x,y] = H[y|x] + H[x]

36


• Relative Entropy [Kullback-Leibler divergence]

• Convexity Function (Jensen’s Inequality)

37

• Mutual Information


- I[x, y] = H[x] – H[x|y] = H[y] – H[y|x]

- If x and y are independent, I[x,y] = 0

- the Reduction in the uncertainty about x by virtue of being told

the value of y

ch 1. introductionimlab.postech.ac.kr/dkim/class/csed514_2019s/ch1.pdf · ch 1. introduction...

Documents