computer vision: models, learning and inference chapter 4 fitting probability models
TRANSCRIPT
Computer vision: models, learning and inference
Chapter 4 Fitting Probability Models
2
Structure
2Computer vision: models, learning and inference. ©2011 Simon J.D. Prince
• Fitting probability distributions– Maximum likelihood– Maximum a posteriori– Bayesian approach
• Worked example 1: Normal distribution• Worked example 2: Categorical distribution
Fitting: As the name suggests: find the parameters under which the data are most likely:
Maximum Likelihood
3Computer vision: models, learning and inference. ©2011 Simon J.D. Prince
Predictive Density:Evaluate new data point under probability distribution with best parameters
We have assumed that data was independent (hence product)
Maximum a posteriori (MAP)FittingAs the name suggests we find the parameters which maximize the posterior probability .
4Computer vision: models, learning and inference. ©2011 Simon J.D. Prince
Again we have assumed that data was independent
Maximum a posteriori (MAP)FittingAs the name suggests we find the parameters which maximize the posterior probability .
6Computer vision: models, learning and inference. ©2011 Simon J.D. Prince
Since the denominator doesn’t depend on the parameters we can instead maximize
Maximum a posteriori (MAP)
Predictive Density:
Evaluate new data point under probability distribution with MAP parameters
7Computer vision: models, learning and inference. ©2011 Simon J.D. Prince
Bayesian ApproachFitting
Compute the posterior distribution over possible parameter values using Bayes’ rule:
Principle: why pick one set of parameters? There are many values that could have explained the data. Try to capture all of the possibilities
8Computer vision: models, learning and inference. ©2011 Simon J.D. Prince
Bayesian ApproachPredictive Density
• Each possible parameter value makes a prediction• Some parameters more probable than others
Make a prediction that is an infinite weighted sum (integral) of the predictions for each parameter value, where weights are the probabilities
9Computer vision: models, learning and inference. ©2011 Simon J.D. Prince
Predictive densities for 3 methods
Maximum a posteriori:
Evaluate new data point under probability distribution with MAP parameters
Maximum likelihood:
Evaluate new data point under probability distribution with ML parameters
Bayesian:
Calculate weighted sum of predictions from all possible values of parameters
10Computer vision: models, learning and inference. ©2011 Simon J.D. Prince
How to rationalize different forms?
Consider ML and MAP estimates as probability distributions with zero probability everywhere except at estimate (i.e. delta functions)
Predictive densities for 3 methods
11Computer vision: models, learning and inference. ©2011 Simon J.D. Prince
12
Structure
12Computer vision: models, learning and inference. ©2011 Simon J.D. Prince
• Fitting probability distributions– Maximum likelihood– Maximum a posteriori– Bayesian approach
• Worked example 1: Normal distribution• Worked example 2: Categorical distribution
Univariate Normal Distribution
For short we write:
Univariate normal distribution describes single continuous variable.
Takes 2 parameters m and s2>013Computer vision: models, learning and inference. ©2011 Simon J.D. Prince
Normal Inverse Gamma DistributionDefined on 2 variables m and s2>0
or for short
Four parameters , , > 0 a b g and .d
14Computer vision: models, learning and inference. ©2011 Simon J.D. Prince
15
Ready?
• Approach the same problem 3 different ways:– Learn ML parameters– Learn MAP parameters– Learn Bayesian distribution of parameters
• Will we get the same results?
Computer vision: models, learning and inference. ©2011 Simon J.D. Prince
As the name suggests we find the parameters under which the data is most likely.
Fitting normal distribution: ML
16Computer vision: models, learning and inference. ©2011 Simon J.D. Prince
Likelihood given by pdf
Fitting normal distribution: ML
17Computer vision: models, learning and inference. ©2011 Simon J.D. Prince
Fitting a normal distribution: ML
Plotted surface of likelihoods as a function of possible parameter values
ML Solution is at peak
18Computer vision: models, learning and inference. ©2011 Simon J.D. Prince
Fitting normal distribution: ML
Algebraically:
or alternatively, we can maximize the logarithm
19Computer vision: models, learning and inference. ©2011 Simon J.D. Prince
where:
Why the logarithm?
The logarithm is a monotonic transformation.
Hence, the position of the peak stays in the same place
But the log likelihood is easier to work with20Computer vision: models, learning and inference. ©2011 Simon J.D. Prince
Fitting normal distribution: ML
How to maximize a function? Take derivative and equate to zero.
21Computer vision: models, learning and inference. ©2011 Simon J.D. Prince
Solution:
Fitting normal distribution: ML
Maximum likelihood solution:
Should look familiar!
22Computer vision: models, learning and inference. ©2011 Simon J.D. Prince
23Computer vision: models, learning and inference. ©2011 Simon J.D. Prince
Least Squares
23
Maximum likelihood for the normal distribution...
...gives `least squares’ fitting criterion.
Fitting normal distribution: MAPFitting
As the name suggests we find the parameters which maximize the posterior probability ..
Likelihood is normal PDF
24Computer vision: models, learning and inference. ©2011 Simon J.D. Prince
Fitting normal distribution: MAPPrior
Use conjugate prior, normal scaled inverse gamma.
25Computer vision: models, learning and inference. ©2011 Simon J.D. Prince
Fitting normal distribution: MAP
Likelihood Prior Posterior
26Computer vision: models, learning and inference. ©2011 Simon J.D. Prince
Fitting normal distribution: MAP
Again maximize the log – does not change position of maximum
27Computer vision: models, learning and inference. ©2011 Simon J.D. Prince
Fitting normal distribution: MAPMAP solution:
Mean can be rewritten as weighted sum of data mean and prior mean:
28Computer vision: models, learning and inference. ©2011 Simon J.D. Prince
Fitting normal distribution: MAP
50 data points 5 data points 1 data points
Fitting normal: Bayesian approachFitting
Compute the posterior distribution using Bayes’ rule:
Fitting normal: Bayesian approachFitting
Compute the posterior distribution using Bayes’ rule:
Two constants MUST cancel out or LHS not a valid pdf31Computer vision: models, learning and inference. ©2011 Simon J.D. Prince
Fitting normal: Bayesian approachFitting
Compute the posterior distribution using Bayes’ rule:
where
32Computer vision: models, learning and inference. ©2011 Simon J.D. Prince
Fitting normal: Bayesian approachPredictive density
Take weighted sum of predictions from different parameter values:
33Computer vision: models, learning and inference. ©2011 Simon J.D. Prince
Posterior Samples from posterior
Fitting normal: Bayesian approachPredictive density
Take weighted sum of predictions from different parameter values:
34Computer vision: models, learning and inference. ©2011 Simon J.D. Prince
Fitting normal: Bayesian approachPredictive density
Take weighted sum of predictions from different parameter values:
where
35Computer vision: models, learning and inference. ©2011 Simon J.D. Prince
Fitting normal: Bayesian Approach
50 data points 5 data points 1 data points36Computer vision: models, learning and inference. ©2011 Simon J.D. Prince
37
Structure
37Computer vision: models, learning and inference. ©2011 Simon J.D. Prince
• Fitting probability distributions– Maximum likelihood– Maximum a posteriori– Bayesian approach
• Worked example 1: Normal distribution• Worked example 2: Categorical distribution
Computer vision: models, learning and inference. ©2011 Simon J.D. Prince
Categorical Distribution
or can think of data as vector with all elements zero except kth e.g. [0,0,0,1 0]
For short we write:
Categorical distribution describes situation where K possible outcomes y=1… y=k.Takes K parameters where
38
Dirichlet DistributionDefined over K values where
Or for short: Has k parameters ak>0
39Computer vision: models, learning and inference. ©2011 Simon J.D. Prince
Categorical distribution: ML
40Computer vision: models, learning and inference. ©2011 Simon J.D. Prince
Maximize product of individual likelihoods
(remember, P(x) = )
Nk = # times weobserved bin k
Categorical distribution: ML
41Computer vision: models, learning and inference. ©2011 Simon J.D. Prince
Instead maximize the log probability
Log likelihood Lagrange multiplier to ensure that params sum to one
Take derivative, set to zero and re-arrange:
Categorical distribution: MAP
42Computer vision: models, learning and inference. ©2011 Simon J.D. Prince
MAP criterion:
Categorical distribution: MAP
43Computer vision: models, learning and inference. ©2011 Simon J.D. Prince
With a uniform prior (a1..K=1), gives same result as maximum likelihood.
Take derivative, set to zero and re-arrange:
Categorical Distribution
Observed data
Five samples from prior
Five samples from posterior
44Computer vision: models, learning and inference. ©2011 Simon J.D. Prince
Categorical Distribution: Bayesian approach
45Computer vision: models, learning and inference. ©2011 Simon J.D. Prince
Two constants MUST cancel out or LHS not a valid pdf
Compute posterior distribution over parameters:
Categorical Distribution: Bayesian approach
46Computer vision: models, learning and inference. ©2011 Simon J.D. Prince
Two constants MUST cancel out or LHS not a valid pdf
Compute predictive distribution:
ML / MAP vs. Bayesian
MAP/ML Bayesian47Computer vision: models, learning and inference. ©2011 Simon J.D. Prince
Conclusion
48Computer vision: models, learning and inference. ©2011 Simon J.D. Prince
• Three ways to fit probability distributions• Maximum likelihood• Maximum a posteriori• Bayesian Approach
• Two worked example• Normal distribution (ML least squares)• Categorical distribution