ch 1. introductionimlab.postech.ac.kr/dkim/class/csed514_2019s/ch1.pdf · ch 1. introduction...
TRANSCRIPT
Ch 1. Introduction
Pattern Recognition and Machine Learning,
C. M. Bishop, 2006.
Department of Computer Science and Engineering
Pohang University of Science and Technology
77 Cheongam-ro, Nam-gu, Pohang 790-784, Korea
Contents
• 1.1 Example: Polynomial Curve Fitting
• 1.2 Probability Theory
– 1.2.1 Probability densities
– 1.2.2 Expectations and covariance
– 1.2.3 Bayesian probabilities
– 1.2.4 The Gaussian distribution
– 1.2.5 Curve fitting re-visited
– 1.2.6 Bayesian curve fitting
• 1.3 Model Selection
• 1.4 The Curse of Dimensionality
• 1.5 Decision Theory
• 1.6 Information Theory
2
Pattern Recognition
• Training set,
• Target vector,
• Training (learning) phase
– Determine
• Generalization
– Test set
• Preprocessing
– Feature selection
3
1{ ,..., }Nx x
t
y(x)
Supervised, Unsupervised and Reinforcement
Learning
• Supervised Learning: with target vector
– Classification
– Regression
• Unsupervised learning: w/o target vector
– Clustering
– Density estimation
– Visualization
• Reinforcement learning: maximize a reward
– Trade-off between exploration & exploitation
4
1.1 Example: Polynomial Curve Fitting
• N observations
•
• Minimizing error function
5
sin(2 )x1( ,..., )T
Nx xx 1( ,..., )T
Nt tt
2
0 1 2( , ) ... M
My x w w x w x w x w
0
Mj
j
j
w x
2
1
1( ) { ( , ) }
2
N
n n
n
E y x t
w w
Model Selection & Over-fitting (1/2)
6
Model Selection & Over-fitting (2/2)
• RMS(Root-Mean-Square) Error
• Too large
→ Over-fitting
• The more data, the better
generalization
• Over-fitting is a general property
of maximum likelihood
7
2 ( *) /RMSE E N w
*w
Regularization
8
•
- Shrinkage
- Ridge regression
- Weight decay
2 2
1
1( ) { ( , ) } || ||
2 2
N
n n
n
E y x t
w w w
1.2 Probability Theory
• “What is the overall probability that the selection procedure will pick an apple?”
• “Given that we have chosen an orange, what is the probability that the box we chose was the blue one?”
9
Rules of Probability (1/2)
• Joint probability
• Marginal probability
• Conditional probability
10
( , )ij
i i
np X x Y y
N
1
( ) ( , )L
i i j
j
p X x p X x Y y
5M
3L
( | )ij
i i
i
np Y y X x
c
ic
N
Rules of Probability (2/2)
• Sum rule
• Production rule
• Bayes’ theorem
11
( ) ( , )Y
p X p X Y
( , ) ( | ) ( )p X Y p Y X p X
( | ) ( )( | )
( )
p X Y p Yp Y X
p X
( | ) ( )
( | ) ( )Y
p X Y p Y
p X Y p YPosterior
Likelihood
Prior
Normalizing
constant
Probability densities
12
Expectations and Covariances
• Expectation
• Variance
• Covariance
[ ] ( ) ( )x
E f p x f x
13
2var[ ] [( ( ) [ ( )]) ]f E f x E f x
,cov[ , ] [{ [ ]}{ [ ]}]x yx y E x E x y E y
, [ ] [ ] [ ]x yE xy E x E y
Bayesian Probabilities -Frequantist vs. Bayesian
• Likelihood:
• Frequantist– w: a fixed parameter determined by 'estimator‘
• Maximum likelihood: Error function =
• Error bars: Obtained by the distribution of possible data sets – Bootstrap
• Bayesian– a single data set
– a probability distribution w: the uncertainty in the parameters
– Prior knowledge• noninformative prior
14
( | ) ( )( | )
( )
p pp
p
w ww
DD
D
( | )p wD
log ( | )p wDD
D
Bayesian Probabilities
-Expansion of Bayesian Application
• Application of full Bayesian procedure is limited for a long time
– It was originated from 18th century
– It was needed to marginalize over the whole of parameter space for
making predictions or comparing different models.
• Markov chain Monte Carlo sampling method (Chap 11)
– Small-scale problem
• Highly efficient deterministic approximation schemes
– e.g. variational Bayes, expectation propagation (Chap 10)
– Large-scale problem
15
Gaussian distribution
•
16
2 2
2 1/ 2 2
1 1( | , ) exp ( )
(2 ) 2x x
N
1
/ 2 1/ 2
1 1 1( | , ) exp ( ) ( )
(2 ) | | 2
T
D
x μ Σ x μ Σ x μ
ΣN
• D-demensional Multivariate Gaussian Distribution
Gaussian distribution
-Example (1/2)
• Getting unknown parameters
• Data points are i.i.d.
– Maximizing with respect to
• sample mean:
– Maximizing with respect to variance
• sample variance:
17
2 2
1
( | , ) ( | , )N
n
n
p x
x N
2 2 2
21
1ln ( | , ) ( ) ln ln(2 )
2 2 2
N
n
n
N Np x
x
1
1 N
ML n
n
xN
2 2
1
1( )
N
ML n ML
n
xN
Gaussian distribution
-Example (2/2)
• Bias phenomenon
– Limitation of the maximum likelihood approach
18
[ ]MLE
2 21[ ]ML
NE
N
Curve Fitting Re-visited (1/2)
• Goal in the curve fitting problem
– Prediction for the target variable t given some new input variable x
19
Determine the unknown w & by maximum likelihood
1( | , , ) ( | ( , ), )p t x t y x w wN
1
1
( | , , ) ( | ( , ), )N
n n
n
p t y x
w wt x N
2
1
ln ( | , , ) { ( , ) } ln ln(2 )2 2 2
N
n
n
N Np y x
w wt x
Curve Fitting Re-visited (2/2)
•
– maximizing likelihood
= minimizing the sum-of-squares error function
•
• Predictive distribution
20
1( | , , ) ( | ( , ), )ML ML ML MLp t x t y x w wN
2
1
1 1{ ( , ) }
N
n ML
nML
y xN
w
MLw
Maximum Posterior (MAP)
• Add prior probability
– : hyperparameter
– Minimum of
equals (1.4)
–
21
1( | ) ( | , )p w w 0 IN
( | , ) ( | , , ) ( | )p p p w w wx,t, t x
2 2
1
{ ( , ) } || ||2 2
N
n n
n
y x t
w w
Bayesian Curve Fitting
• Marginalization
22
1.3 Model Selection
• Proper model complexity
→ Good generalization & best model
• Measuring the generalization performance
– If data are plentiful, divide into training, validation & test set
– Otherwise, cross-validate
• Leave-one-out technique
• Drawbacks
– Expensive computation
– Using separate data
→ multiple complexity parameters
– New measures of performance
• e.g. Akaike information criterion(AIC), Bayesian information criterion(BIC)
23
MwDp ML )|(ln
24
1.4 The Curse of Dimensionality
• The High Dimensionality Problem
• Ex. Mixture of Oil, Water, Gas
- 3-Class
(Homogeneous, Annular, Laminar)
- 12 Input Variables
- Scatter Plot of x6, x7
- Predict Point X
- Simple and Naïve Approach
25
1.4 The Curse of Dimensionality(Cont’d)
• The Shortcomings of Naïve Approach
- The number of cells increase exponentially.
- Needs a large training data set for cells not
to be empty.
26
1.4 The Curse of Dimensionality(Cont’d)
• Polynomial Curve Fitting Method (3 Order)
- As D increases, it grows proportionally to DM
• The Volume of High Dimensional Sphere
- Concentrated in a thin shell near the surface
27
1.4 The Curse of Dimensionality(Cont’d)
• Gaussian Distribution
28
1.5 Decision Theory
• Make Optimal Decisions
- Inference Step & Decision Step
- Select Higher Posterior Probability
• Minimizing the Misclassification Rate
- MAP
→ Minimizing Colored Area
29
1.5 Decision Theory (Cont’d)
• Minimizing the Expected Loss
- Damage of Missclassification may different from classes
- Introduction of Loss Function(Cost Function)
- MAP
→ Minimizing Expected Loss
• The Reject Option
- Threshold θ
- Reject if θ > Posterior Prob.
30
1.5 Decision Theory (Cont’d)
• Inference and Decision
- Three Distinct Approach
1. Obtain Posterior Probability & Generative Models
2. Obtain Posterior Probability & Discriminative Models
3. Find Discrimitive Function
31
• The Reason to Compute the Posterior
1. Minimizing Risk
2. Reject Option
3. Compensating for Class Priors
4. Combining Models
1.5 Decision Theory (Cont’d)
32
1.5 Decision Theory (Cont’d)
• Loss Function for Regression
- Multiple Target Variable Vector
33
• Minkowski Loss
1.5 Decision Theory (Cont’d)
34
1.6 Information Theory
• Entropy
- The noiseless coding theorem states that the entropy is
lower bound on the number of bits needed to transmit the
state of a random variable.
- Higher Entropy, Lager Uncertainty
35
1.6 Information Theory (Cont’d)
• Maximum Entropy Configuration for Continuous Variable
- Constraints
- Result
- The distribution that maximize the differential entropy is the
Gaussian
• Conditional Entropy : H[x,y] = H[y|x] + H[x]
36
1.6 Information Theory (Cont’d)
• Relative Entropy [Kullback-Leibler divergence]
• Convexity Function (Jensen’s Inequality)
37
• Mutual Information
1.6 Information Theory (Cont’d)
- I[x, y] = H[x] – H[x|y] = H[y] – H[y|x]
- If x and y are independent, I[x,y] = 0
- the Reduction in the uncertainty about x by virtue of being told
the value of y