coms21202: symbols, patterns and signals probabilistic ... · [modified from dima damen lecture...
TRANSCRIPT
COMS21202: Symbols, Patterns and SignalsProbabilistic Data Models
[modified from Dima Damen lecture notes]
Rui Ponte Costa & Dima [email protected]
Department of Computer Science, University of BristolBristol BS8 1UB, UK
March 9, 2020
Rui Ponte Costa & Dima [email protected]
COMS21202: Probabilistic Models
Data Modelling
I Deterministic models do not explicitly model uncertainties or‘randomness’ in data
I Inference variability from the data is not includedI In many tasks, we benefit from modelling uncertaintyI This is explicit in Probabilistic Models
Rui Ponte Costa & Dima [email protected]
COMS21202: Probabilistic Models
Back to Fish - Discrete case
Discrete variable:
Example
A fisherman returns with the daily catch of fish. If we select a fish atrandom from the hold, what species will it be?
fish ∈ {salmon, seabass, cod , ...}
I A deterministic model would give one value, the most likelyI A probabilistic model quantifies the chance/probability of the selected
fish being one of the possible species.I Model the probability P(xi = qi) where
qi ∈ {salmon, seabass, cod , · · · }
Rui Ponte Costa & Dima [email protected]
COMS21202: Probabilistic Models
Back to Fish - Continuous case
Continuous variable:
Example
Predict the weight of fish from its length
Let us assume that we think the weight of fish is directly proportional to itslength, i.e. weight = b× length +a (linear regression)
A probabilistic approach would model weight as a random variable andhypothesize that
weight = b × length + a + ε
where ε is a random variable, with mean usually close to zero
Rui Ponte Costa & Dima [email protected]
COMS21202: Probabilistic Models
Back to Fish - Continuous case
weight = b × length + a + ε
I Modelled using a probability distribution for ε,I by a uniform distributionI by a normal distributionI · · ·
I In the next slides, we will simplify things by setting weight = 0 whenlength = 0
I As a consequence, the y-intercept can be set to zero (i.e. a=0), and
weight = b × length + ε
Rui Ponte Costa & Dima [email protected]
COMS21202: Probabilistic Models
Back to Fish - Probabilistic
weight = b × length + ε
We can assume, for example, that ε is given by N (0, σ2)
p(ε) =1√2πσ
e−ε2
2σ2
This model has two parameters: the slope b and uncertainty σ (with µ = 0)
Rui Ponte Costa & Dima [email protected]
COMS21202: Probabilistic Models
Maximum Likelihood Estimation
I Similar to building deterministic models, probabilistic modelparameters need to be tuned/trained
I Maximum-likelihood estimation (MLE) is a method of estimating theparameters of a probabilistic model.
I Assume θ is a vector of all parameters of the probabilistic model. (e.g.θ = {b, σ}).
I MLE is an extremum estimator1 obtained by maximising an objectivefunction of θ
1"Extremum estimators are a wide class of estimators for parametric models that arecalculated through maximization (or minimization) of a certain objective function, whichdepends on the data." wikipedia.org
Rui Ponte Costa & Dima [email protected]
COMS21202: Probabilistic Models
Maximum Likelihood Estimation
DefinitionAssume f (θ) is an objective function to be optimised (e.g. maximised), thearg max corresponds to the value of θ that attains the maximum value ofthe objective function f
θ̂ = arg maxθ f (θ)
I Tuning the parameter is then equal to finding the maximum argumentarg max
Rui Ponte Costa & Dima [email protected]
COMS21202: Probabilistic Models
Maximum Likelihood Estimation - GeneralI Maximum Likelihood Estimation (MLE) is a common method for
solving such problems
θMLE = arg maxθ p(D|θ)= arg maxθ lnp(D|θ)= arg minθ − lnp(D|θ)
MLE Recipe
1. Determine θ, D and expression for likelihood p(D|θ)
2. Take the natural logarithm of the likelihood
3. Take the derivative of ln p(D|θ) w.r.t. θ. If θ is a multi-dimensional vector, takepartial derivatives
4. Set derivative(s) to 0 and solve for θ
Rui Ponte Costa & Dima [email protected]
COMS21202: Probabilistic Models
MLE: 1. Define likelihood
Given a set of N data points - xi is length and yi is weight in our fishyexample
D = {(x1, y1), (x2, y2), · · · , (xN , yN)}
I The probabilistic approach would:I derive expression for conditional probability of observing data D given
parameters θ = {b, σ}p(D|θ)
Rui Ponte Costa & Dima [email protected]
COMS21202: Probabilistic Models
MLE: 1. Define likelihoodGiven a set of N data points
D = {(x1, y1), (x2, y2), · · · , (xN , yN)}
Assume that observations are independent - a common assumption oftenreferred to as i.i.d. independent and identically distributed - then :
p(D|θ) =N∏
i=1
p(yi |xi , θ)
Given yi = b xi + ε, and ε is N (0, σ2), then
p(yi |xi , θ) ∼ N (bxi , σ2)
For a large sample:I The average of yi value will be b xi
I The ‘spread’ or variance will be the same as for ε, defined by σ2
Rui Ponte Costa & Dima [email protected]
COMS21202: Probabilistic Models
MLE: 1. Define likelihood
The conditional probability (for all data) is thus formulated as
p(D|θ) =N∏
i=1
p(yi |xi , θ)
=N∏
i=1
1√2πσ
e−12
(yi−bxi )2
σ2
Rui Ponte Costa & Dima [email protected]
COMS21202: Probabilistic Models
MLE: 2. Take natural logarithmWe will focus on parameter b for the next steps,
bML = arg maxb p(D|θ)
= arg maxb
N∏i=1
1√2πσ
e− 12
(yi−bxi )2
σ2
= arg maxb ln( N∏
i=1
1√2πσ
e− 12
(yi−bxi )2
σ2)
(use ln trick)
= arg maxb
N∑i=1
ln( 1√
2πσe− 1
2(yi−bxi )
2
σ2)
(ln prop. 1)
= arg maxb
N∑i=1
ln1√2πσ
− (yi − bxi)2
2σ2 (ln prop. 2)
= arg maxb
N∑i=1
− 12σ2 (yi − bxi)
2 (discard no-b terms ln1√2πσ
)
= arg minb
N∑i=1
12σ2 (yi − bxi)
2 (switch to minimisation form.)
Rui Ponte Costa & Dima [email protected]
COMS21202: Probabilistic Models
Data Modelling - Deterministic vs Probabilistic
I Deterministic Least Squares [Lecture 3]:
bLS = arg minb R(b,a = 0) = arg minb
∑i
(yi − b xi)2
I Probabilistic Maximum Likelihood:
bML = arg minb
∑i
12σ2 (yi − b xi)
2
I probabilistic model explicit considers uncertainty, σ2.
Rui Ponte Costa & Dima [email protected]
COMS21202: Probabilistic Models
MLE: 3. Take derivatives and 4. Find solutionTo simplify the calculations here we assume σ = 1,
bML = arg minb
∑i
12(yi − b xi)
2
To find the minimum, calculate the derivative
ddb
∑i
12(yi − bxi)
2 = −∑
i
xi(yi − bxi)
and equate it to zero
−∑
i
xi(yi − bMLxi) = 0∑i
xiyi − bML
∑i
x2i = 0
bML =
∑i yixi∑i x2
i
Rui Ponte Costa & Dima [email protected]
COMS21202: Probabilistic Models
MLE: summary
Example: normal distribution (parameter b)
1. Determine θ, D and expression for likelihood p(D|θ)
p(D|b) =N∏
i=1
1√2πσ
e−12
(yi−bxi )2
σ2
2. Take the natural logarithm of the likelihoodbML = arg minb
∑i
12σ2 (yi − b xi)
2
3. Take the derivative of lnp(D|θ) w.r.t. θ. If θ is a multi-dimensionalvector, take partial derivatives 2
ddb
∑i
12 (yi − bxi)
2 = −∑
i xi(yi − bxi)
4. Set derivative(s) to 0 and solve for bbML =
∑i yi xi∑i x2
i
2For simplicity we set σ = 1Rui Ponte Costa & Dima [email protected]
COMS21202: Probabilistic Models
Data Modelling - Deterministic vs ProbabilisticI Probabilistic Models can tell us moreI We could use the same MLE recipe to find σML. This would tell us how
uncertainty our model is about the data D.I For example: if we apply this method to two datasets (D1 and D2)
what would the parameters θ = {b, σ} be?
bD1ML > bD2
ML [slope] and σD1ML < σ
D2ML [uncertainty3]
3The uncertainty (σ) is represented by the light green bar in the plots. Test it your self.Rui Ponte Costa & Dima [email protected]
COMS21202: Probabilistic Models
Probabilistic Model - Ex2
Example
Given a coin, you were assigned the task of figuring out whether the coinwill land on its head or tails. You were asked to build a probabilistic model(i.e. with confidence)
I Data: head/tail binary attempts (of size N)I Model: Binomial distributionI Model Parameters: head probability α
Rui Ponte Costa & Dima [email protected]
COMS21202: Probabilistic Models
Probabilistic Model - Ex2
DefinitionThe binomial distribution gives the probability distribution for a discretevariable to obtain exactly D successes out of N trials, where the probabilityof the success is α and the probability of failure is (1− α) and 0 ≤ α ≤ 1
The binomial distribution probability density function is given by
P(D|N) =
(ND
)αD(1− α)N−D
=N!
D!(N − D)!αD(1− α)N−D
Rui Ponte Costa & Dima [email protected]
COMS21202: Probabilistic Models
Probabilistic Model - Ex2
Accordingly, using the binomial probability distribution where D is thenumber of heads in N coin tosses and θ is the probability of getting headsin a single toss,
P(D|θ) =(
ND
)θD(1− θ)N−D
Maximum Likelihood Estimation (MLE) would then be looking for
θML = arg maxθ p(D|θ)
Rui Ponte Costa & Dima [email protected]
COMS21202: Probabilistic Models
Probabilistic Model - Ex2
I Take the natural logarithm
P(D|θ) =(
ND
)θD(1− θ)N−D
lnP(D|θ) = ln
(ND
)+ D ln θ + (N − D) ln(1− θ)
I Take the derivative w.r.t θ
ddθ
lnP(D|θ) = D1θ+ (N − D)
11− θ
(−1)
=Dθ− N − D
1− θ
Rui Ponte Costa & Dima [email protected]
COMS21202: Probabilistic Models
Probabilistic Model - Ex2
I Set the derivative to 0 and solve for θ
DθML− N − D
1− θML= 0
D(1− θML)− (N − D)θML
θML(1− θML)= 0
D − NθML = 0
θML =DN
I In conclusion, the probability of heads is the relative frequency (Dover the number of samples N).
Rui Ponte Costa & Dima [email protected]
COMS21202: Probabilistic Models
Probabilistic Model - Using prior informationI MLE ignores any prior knowledge we may have about θ
I If we have prior knowledge about values what θ is likely to have, thenwe can use Bayesian inference, which combines prior and likelihoodprobabilities as p(θ|D) = p(D|θ)p(θ)/Z , where p(θ|D) is the posterior,p(D|θ) the likelihood of the data, p(θ) the prior over the parametersand Z is the normalization term (p(D)).
Rui Ponte Costa & Dima [email protected]
COMS21202: Probabilistic Models
Probabilistic Model - Using prior information
I The MLE method can be expanded to consider prior information as
θML = arg maxθ p(D|θ)p(θ)
I This is known as Maximum a Posteriori (MAP) estimation 4
4You are going to learn more Posterior distributions next year in Machine Learning!Rui Ponte Costa & Dima [email protected]
COMS21202: Probabilistic Models
Maximum a Posterior - Example
Example
Given a coin, you were assigned the task of figuring out whether the coinwill land on its head or tails. You were asked to build a probabilistic model(i.e. with confidence)
I Suppose we want to utilise our prior belief that coins are typically fairI p(θ) would peak around θ = 0.5I Let’s use
p(θ) = b θ (1− θ)
where b is a normalising factor so the area under the curve is equalto 1
Rui Ponte Costa & Dima [email protected]
COMS21202: Probabilistic Models
Maximum a Posterior - Example
I Likelihood:p(D|θ) =
(ND
)θD(1− θ)N−D
I Prior:p(θ) = b θ (1− θ)
I Posterior:
p(θ|D) = p(D|θ)p(θ) =(
ND
)θD(1− θ)N−D b θ (1− θ)
Rui Ponte Costa & Dima [email protected]
COMS21202: Probabilistic Models
Maximum a Posterior - Example
I Take the natural logarithm and derivate [same recipe as in MLE]
p(D|θ)p(θ) =(
ND
)θD(1− θ)N−D b θ (1− θ)
ln p(D|θ) p(θ) = ln
(ND
)+ D ln θ + (N − D) ln(1− θ) + ln b + ln θ + ln(1− θ)
ddθ
lnp(D|θ)p(θ) = D1θ− (N − D)
11− θ
+1θ− 1
(1− θ)
Rui Ponte Costa & Dima [email protected]
COMS21202: Probabilistic Models
Maximum a Posterior - Example
I Set the derivative to 0 and solve for θMAP
D1
θMAP− (N − D)
11− θMAP
+1
θMAP− 1
(1− θMAP)= 0
D + 1θMAP
− (N − D + 1)1
1− θMAP= 0
(D + 1)(1− θMAP)− (N − D + 1)θMAP
θMAP(1− θMAP)= 0
θMAP =D + 1N + 2
I The prior added two ‘virtual’ coin tosses, one with heads and one withtails. Note that for D = N = 0, it defaults to our prior knowledge, i.e.θMAP = 1
2 .
Rui Ponte Costa & Dima [email protected]
COMS21202: Probabilistic Models
Conclusion
I Probabilistic models encode randomness in the dataI They provide model uncertaintyI Parameters of the model are tuned using estimatorsI Maximum Likelihood Estimation (MLE) is a recipe used for training
model parametersI MLE does not encode our prior knowledge of possible parametersI Maximum a Posteriori (MAP) maximises likelihood along with prior
Rui Ponte Costa & Dima [email protected]
COMS21202: Probabilistic Models
Further Reading
I Probability and Statistics for Engineers and ScientistsWalpole et al (2007)I Section 3.1I Section 3.2I Section 4.1I Section 4.2
I Statistical Learning MethodsRussell and Norvig (2003)I Chapter 20 (p. 712 - 720)
Rui Ponte Costa & Dima [email protected]
COMS21202: Probabilistic Models