coms21202: symbols, patterns and signals probabilistic ... · [modiﬁed from dima damen lecture...

COMS21202: Symbols, Patterns and SignalsProbabilistic Data Models

[modified from Dima Damen lecture notes]

Rui Ponte Costa & Dima [email protected]

Department of Computer Science, University of BristolBristol BS8 1UB, UK

March 9, 2020


COMS21202: Probabilistic Models

Data Modelling

I Deterministic models do not explicitly model uncertainties or‘randomness’ in data

I Inference variability from the data is not includedI In many tasks, we benefit from modelling uncertaintyI This is explicit in Probabilistic Models



Back to Fish - Discrete case

Discrete variable:

Example

A fisherman returns with the daily catch of fish. If we select a fish atrandom from the hold, what species will it be?

fish ∈ {salmon, seabass, cod , ...}

I A deterministic model would give one value, the most likelyI A probabilistic model quantifies the chance/probability of the selected

fish being one of the possible species.I Model the probability P(xi = qi) where

qi ∈ {salmon, seabass, cod , · · · }



Back to Fish - Continuous case

Continuous variable:

Example

Predict the weight of fish from its length

Let us assume that we think the weight of fish is directly proportional to itslength, i.e. weight = b× length +a (linear regression)

A probabilistic approach would model weight as a random variable andhypothesize that

weight = b × length + a + ε

where ε is a random variable, with mean usually close to zero



Back to Fish - Continuous case

weight = b × length + a + ε

I Modelled using a probability distribution for ε,I by a uniform distributionI by a normal distributionI · · ·

I In the next slides, we will simplify things by setting weight = 0 whenlength = 0

I As a consequence, the y-intercept can be set to zero (i.e. a=0), and

weight = b × length + ε



Back to Fish - Probabilistic

weight = b × length + ε

We can assume, for example, that ε is given by N (0, σ2)

p(ε) =1√2πσ

e−ε2

2σ2

This model has two parameters: the slope b and uncertainty σ (with µ = 0)



Maximum Likelihood Estimation

I Similar to building deterministic models, probabilistic modelparameters need to be tuned/trained

I Maximum-likelihood estimation (MLE) is a method of estimating theparameters of a probabilistic model.

I Assume θ is a vector of all parameters of the probabilistic model. (e.g.θ = {b, σ}).

I MLE is an extremum estimator1 obtained by maximising an objectivefunction of θ

1"Extremum estimators are a wide class of estimators for parametric models that arecalculated through maximization (or minimization) of a certain objective function, whichdepends on the data." wikipedia.org



Maximum Likelihood Estimation

DefinitionAssume f (θ) is an objective function to be optimised (e.g. maximised), thearg max corresponds to the value of θ that attains the maximum value ofthe objective function f

θ̂ = arg maxθ f (θ)

I Tuning the parameter is then equal to finding the maximum argumentarg max



Maximum Likelihood Estimation - GeneralI Maximum Likelihood Estimation (MLE) is a common method for

solving such problems

θMLE = arg maxθ p(D|θ)= arg maxθ lnp(D|θ)= arg minθ − lnp(D|θ)

MLE Recipe

1. Determine θ, D and expression for likelihood p(D|θ)

2. Take the natural logarithm of the likelihood

3. Take the derivative of ln p(D|θ) w.r.t. θ. If θ is a multi-dimensional vector, takepartial derivatives

4. Set derivative(s) to 0 and solve for θ



MLE: 1. Define likelihood

Given a set of N data points - xi is length and yi is weight in our fishyexample

D = {(x1, y1), (x2, y2), · · · , (xN , yN)}

I The probabilistic approach would:I derive expression for conditional probability of observing data D given

parameters θ = {b, σ}p(D|θ)



MLE: 1. Define likelihoodGiven a set of N data points

D = {(x1, y1), (x2, y2), · · · , (xN , yN)}

Assume that observations are independent - a common assumption oftenreferred to as i.i.d. independent and identically distributed - then :

p(D|θ) =N∏

i=1

p(yi |xi , θ)

Given yi = b xi + ε, and ε is N (0, σ2), then

p(yi |xi , θ) ∼ N (bxi , σ2)

For a large sample:I The average of yi value will be b xi

I The ‘spread’ or variance will be the same as for ε, defined by σ2



MLE: 1. Define likelihood

The conditional probability (for all data) is thus formulated as

p(D|θ) =N∏

i=1

p(yi |xi , θ)

=N∏

i=1

1√2πσ

e−12

(yi−bxi )2

σ2



MLE: 2. Take natural logarithmWe will focus on parameter b for the next steps,

bML = arg maxb p(D|θ)

= arg maxb

N∏i=1

1√2πσ

e− 12

(yi−bxi )2

σ2

= arg maxb ln( N∏

i=1

1√2πσ

e− 12

(yi−bxi )2

σ2)

(use ln trick)

= arg maxb

N∑i=1

ln( 1√

2πσe− 1

2(yi−bxi )

2

σ2)

(ln prop. 1)

= arg maxb

N∑i=1

ln1√2πσ

− (yi − bxi)2

2σ2 (ln prop. 2)

= arg maxb

N∑i=1

− 12σ2 (yi − bxi)

2 (discard no-b terms ln1√2πσ

)

= arg minb

N∑i=1

12σ2 (yi − bxi)

2 (switch to minimisation form.)



Data Modelling - Deterministic vs Probabilistic

I Deterministic Least Squares [Lecture 3]:

bLS = arg minb R(b,a = 0) = arg minb

∑i

(yi − b xi)2

I Probabilistic Maximum Likelihood:

bML = arg minb

∑i

12σ2 (yi − b xi)

2

I probabilistic model explicit considers uncertainty, σ2.



MLE: 3. Take derivatives and 4. Find solutionTo simplify the calculations here we assume σ = 1,

bML = arg minb

∑i

12(yi − b xi)

2

To find the minimum, calculate the derivative

ddb

∑i

12(yi − bxi)

2 = −∑

i

xi(yi − bxi)

and equate it to zero

−∑

i

xi(yi − bMLxi) = 0∑i

xiyi − bML

∑i

x2i = 0

bML =

∑i yixi∑i x2

i



MLE: summary

Example: normal distribution (parameter b)

1. Determine θ, D and expression for likelihood p(D|θ)

p(D|b) =N∏

i=1

1√2πσ

e−12

(yi−bxi )2

σ2

2. Take the natural logarithm of the likelihoodbML = arg minb

∑i

12σ2 (yi − b xi)

2

3. Take the derivative of lnp(D|θ) w.r.t. θ. If θ is a multi-dimensionalvector, take partial derivatives 2

ddb

∑i

12 (yi − bxi)

2 = −∑

i xi(yi − bxi)

4. Set derivative(s) to 0 and solve for bbML =

∑i yi xi∑i x2

i

2For simplicity we set σ = 1Rui Ponte Costa & Dima [email protected]


Data Modelling - Deterministic vs ProbabilisticI Probabilistic Models can tell us moreI We could use the same MLE recipe to find σML. This would tell us how

uncertainty our model is about the data D.I For example: if we apply this method to two datasets (D1 and D2)

what would the parameters θ = {b, σ} be?

bD1ML > bD2

ML [slope] and σD1ML < σ

D2ML [uncertainty3]

3The uncertainty (σ) is represented by the light green bar in the plots. Test it your self.Rui Ponte Costa & Dima [email protected]


Probabilistic Model - Ex2

Example

Given a coin, you were assigned the task of figuring out whether the coinwill land on its head or tails. You were asked to build a probabilistic model(i.e. with confidence)

I Data: head/tail binary attempts (of size N)I Model: Binomial distributionI Model Parameters: head probability α




DefinitionThe binomial distribution gives the probability distribution for a discretevariable to obtain exactly D successes out of N trials, where the probabilityof the success is α and the probability of failure is (1− α) and 0 ≤ α ≤ 1

The binomial distribution probability density function is given by

P(D|N) =

(ND

)αD(1− α)N−D

=N!

D!(N − D)!αD(1− α)N−D




Accordingly, using the binomial probability distribution where D is thenumber of heads in N coin tosses and θ is the probability of getting headsin a single toss,

P(D|θ) =(

ND

)θD(1− θ)N−D

Maximum Likelihood Estimation (MLE) would then be looking for

θML = arg maxθ p(D|θ)




I Take the natural logarithm

P(D|θ) =(

ND

)θD(1− θ)N−D

lnP(D|θ) = ln

(ND

)+ D ln θ + (N − D) ln(1− θ)

I Take the derivative w.r.t θ

ddθ

lnP(D|θ) = D1θ+ (N − D)

11− θ

(−1)

=Dθ− N − D

1− θ




I Set the derivative to 0 and solve for θ

DθML− N − D

1− θML= 0

D(1− θML)− (N − D)θML

θML(1− θML)= 0

D − NθML = 0

θML =DN

I In conclusion, the probability of heads is the relative frequency (Dover the number of samples N).



Probabilistic Model - Using prior informationI MLE ignores any prior knowledge we may have about θ

I If we have prior knowledge about values what θ is likely to have, thenwe can use Bayesian inference, which combines prior and likelihoodprobabilities as p(θ|D) = p(D|θ)p(θ)/Z , where p(θ|D) is the posterior,p(D|θ) the likelihood of the data, p(θ) the prior over the parametersand Z is the normalization term (p(D)).



Probabilistic Model - Using prior information

I The MLE method can be expanded to consider prior information as

θML = arg maxθ p(D|θ)p(θ)

I This is known as Maximum a Posteriori (MAP) estimation 4

4You are going to learn more Posterior distributions next year in Machine Learning!Rui Ponte Costa & Dima [email protected]


Maximum a Posterior - Example

Example

Given a coin, you were assigned the task of figuring out whether the coinwill land on its head or tails. You were asked to build a probabilistic model(i.e. with confidence)

I Suppose we want to utilise our prior belief that coins are typically fairI p(θ) would peak around θ = 0.5I Let’s use

p(θ) = b θ (1− θ)

where b is a normalising factor so the area under the curve is equalto 1




I Likelihood:p(D|θ) =

(ND

)θD(1− θ)N−D

I Prior:p(θ) = b θ (1− θ)

I Posterior:

p(θ|D) = p(D|θ)p(θ) =(

ND

)θD(1− θ)N−D b θ (1− θ)




I Take the natural logarithm and derivate [same recipe as in MLE]

p(D|θ)p(θ) =(

ND

)θD(1− θ)N−D b θ (1− θ)

ln p(D|θ) p(θ) = ln

(ND

)+ D ln θ + (N − D) ln(1− θ) + ln b + ln θ + ln(1− θ)

ddθ

lnp(D|θ)p(θ) = D1θ− (N − D)

11− θ

+1θ− 1

(1− θ)




I Set the derivative to 0 and solve for θMAP

D1

θMAP− (N − D)

11− θMAP

+1

θMAP− 1

(1− θMAP)= 0

D + 1θMAP

− (N − D + 1)1

1− θMAP= 0

(D + 1)(1− θMAP)− (N − D + 1)θMAP

θMAP(1− θMAP)= 0

θMAP =D + 1N + 2

I The prior added two ‘virtual’ coin tosses, one with heads and one withtails. Note that for D = N = 0, it defaults to our prior knowledge, i.e.θMAP = 1

2 .



Conclusion

I Probabilistic models encode randomness in the dataI They provide model uncertaintyI Parameters of the model are tuned using estimatorsI Maximum Likelihood Estimation (MLE) is a recipe used for training

model parametersI MLE does not encode our prior knowledge of possible parametersI Maximum a Posteriori (MAP) maximises likelihood along with prior



Further Reading

I Probability and Statistics for Engineers and ScientistsWalpole et al (2007)I Section 3.1I Section 3.2I Section 4.1I Section 4.2

I Statistical Learning MethodsRussell and Norvig (2003)I Chapter 20 (p. 712 - 720)



coms21202: symbols, patterns and signals probabilistic ... · [modiﬁed from dima damen lecture...

Documents