coms21202: symbols, patterns and signals probabilistic ... · [modified from dima damen lecture...

30
COMS21202: Symbols, Patterns and Signals Probabilistic Data Models [modified from Dima Damen lecture notes] Rui Ponte Costa & Dima Damen [email protected] Department of Computer Science, University of Bristol Bristol BS8 1UB, UK March 9, 2020 Rui Ponte Costa & Dima Damen [email protected] COMS21202: Probabilistic Models

Upload: others

Post on 11-May-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: COMS21202: Symbols, Patterns and Signals Probabilistic ... · [modified from Dima Damen lecture notes] Rui Ponte Costa & Dima Damen rui.costa@bristol.ac.uk Department of Computer

COMS21202: Symbols, Patterns and SignalsProbabilistic Data Models

[modified from Dima Damen lecture notes]

Rui Ponte Costa & Dima [email protected]

Department of Computer Science, University of BristolBristol BS8 1UB, UK

March 9, 2020

Rui Ponte Costa & Dima [email protected]

COMS21202: Probabilistic Models

Page 2: COMS21202: Symbols, Patterns and Signals Probabilistic ... · [modified from Dima Damen lecture notes] Rui Ponte Costa & Dima Damen rui.costa@bristol.ac.uk Department of Computer

Data Modelling

I Deterministic models do not explicitly model uncertainties or‘randomness’ in data

I Inference variability from the data is not includedI In many tasks, we benefit from modelling uncertaintyI This is explicit in Probabilistic Models

Rui Ponte Costa & Dima [email protected]

COMS21202: Probabilistic Models

Page 3: COMS21202: Symbols, Patterns and Signals Probabilistic ... · [modified from Dima Damen lecture notes] Rui Ponte Costa & Dima Damen rui.costa@bristol.ac.uk Department of Computer

Back to Fish - Discrete case

Discrete variable:

Example

A fisherman returns with the daily catch of fish. If we select a fish atrandom from the hold, what species will it be?

fish ∈ {salmon, seabass, cod , ...}

I A deterministic model would give one value, the most likelyI A probabilistic model quantifies the chance/probability of the selected

fish being one of the possible species.I Model the probability P(xi = qi) where

qi ∈ {salmon, seabass, cod , · · · }

Rui Ponte Costa & Dima [email protected]

COMS21202: Probabilistic Models

Page 4: COMS21202: Symbols, Patterns and Signals Probabilistic ... · [modified from Dima Damen lecture notes] Rui Ponte Costa & Dima Damen rui.costa@bristol.ac.uk Department of Computer

Back to Fish - Continuous case

Continuous variable:

Example

Predict the weight of fish from its length

Let us assume that we think the weight of fish is directly proportional to itslength, i.e. weight = b× length +a (linear regression)

A probabilistic approach would model weight as a random variable andhypothesize that

weight = b × length + a + ε

where ε is a random variable, with mean usually close to zero

Rui Ponte Costa & Dima [email protected]

COMS21202: Probabilistic Models

Page 5: COMS21202: Symbols, Patterns and Signals Probabilistic ... · [modified from Dima Damen lecture notes] Rui Ponte Costa & Dima Damen rui.costa@bristol.ac.uk Department of Computer

Back to Fish - Continuous case

weight = b × length + a + ε

I Modelled using a probability distribution for ε,I by a uniform distributionI by a normal distributionI · · ·

I In the next slides, we will simplify things by setting weight = 0 whenlength = 0

I As a consequence, the y-intercept can be set to zero (i.e. a=0), and

weight = b × length + ε

Rui Ponte Costa & Dima [email protected]

COMS21202: Probabilistic Models

Page 6: COMS21202: Symbols, Patterns and Signals Probabilistic ... · [modified from Dima Damen lecture notes] Rui Ponte Costa & Dima Damen rui.costa@bristol.ac.uk Department of Computer

Back to Fish - Probabilistic

weight = b × length + ε

We can assume, for example, that ε is given by N (0, σ2)

p(ε) =1√2πσ

e−ε2

2σ2

This model has two parameters: the slope b and uncertainty σ (with µ = 0)

Rui Ponte Costa & Dima [email protected]

COMS21202: Probabilistic Models

Page 7: COMS21202: Symbols, Patterns and Signals Probabilistic ... · [modified from Dima Damen lecture notes] Rui Ponte Costa & Dima Damen rui.costa@bristol.ac.uk Department of Computer

Maximum Likelihood Estimation

I Similar to building deterministic models, probabilistic modelparameters need to be tuned/trained

I Maximum-likelihood estimation (MLE) is a method of estimating theparameters of a probabilistic model.

I Assume θ is a vector of all parameters of the probabilistic model. (e.g.θ = {b, σ}).

I MLE is an extremum estimator1 obtained by maximising an objectivefunction of θ

1"Extremum estimators are a wide class of estimators for parametric models that arecalculated through maximization (or minimization) of a certain objective function, whichdepends on the data." wikipedia.org

Rui Ponte Costa & Dima [email protected]

COMS21202: Probabilistic Models

Page 8: COMS21202: Symbols, Patterns and Signals Probabilistic ... · [modified from Dima Damen lecture notes] Rui Ponte Costa & Dima Damen rui.costa@bristol.ac.uk Department of Computer

Maximum Likelihood Estimation

DefinitionAssume f (θ) is an objective function to be optimised (e.g. maximised), thearg max corresponds to the value of θ that attains the maximum value ofthe objective function f

θ̂ = arg maxθ f (θ)

I Tuning the parameter is then equal to finding the maximum argumentarg max

Rui Ponte Costa & Dima [email protected]

COMS21202: Probabilistic Models

Page 9: COMS21202: Symbols, Patterns and Signals Probabilistic ... · [modified from Dima Damen lecture notes] Rui Ponte Costa & Dima Damen rui.costa@bristol.ac.uk Department of Computer

Maximum Likelihood Estimation - GeneralI Maximum Likelihood Estimation (MLE) is a common method for

solving such problems

θMLE = arg maxθ p(D|θ)= arg maxθ lnp(D|θ)= arg minθ − lnp(D|θ)

MLE Recipe

1. Determine θ, D and expression for likelihood p(D|θ)

2. Take the natural logarithm of the likelihood

3. Take the derivative of ln p(D|θ) w.r.t. θ. If θ is a multi-dimensional vector, takepartial derivatives

4. Set derivative(s) to 0 and solve for θ

Rui Ponte Costa & Dima [email protected]

COMS21202: Probabilistic Models

Page 10: COMS21202: Symbols, Patterns and Signals Probabilistic ... · [modified from Dima Damen lecture notes] Rui Ponte Costa & Dima Damen rui.costa@bristol.ac.uk Department of Computer

MLE: 1. Define likelihood

Given a set of N data points - xi is length and yi is weight in our fishyexample

D = {(x1, y1), (x2, y2), · · · , (xN , yN)}

I The probabilistic approach would:I derive expression for conditional probability of observing data D given

parameters θ = {b, σ}p(D|θ)

Rui Ponte Costa & Dima [email protected]

COMS21202: Probabilistic Models

Page 11: COMS21202: Symbols, Patterns and Signals Probabilistic ... · [modified from Dima Damen lecture notes] Rui Ponte Costa & Dima Damen rui.costa@bristol.ac.uk Department of Computer

MLE: 1. Define likelihoodGiven a set of N data points

D = {(x1, y1), (x2, y2), · · · , (xN , yN)}

Assume that observations are independent - a common assumption oftenreferred to as i.i.d. independent and identically distributed - then :

p(D|θ) =N∏

i=1

p(yi |xi , θ)

Given yi = b xi + ε, and ε is N (0, σ2), then

p(yi |xi , θ) ∼ N (bxi , σ2)

For a large sample:I The average of yi value will be b xi

I The ‘spread’ or variance will be the same as for ε, defined by σ2

Rui Ponte Costa & Dima [email protected]

COMS21202: Probabilistic Models

Page 12: COMS21202: Symbols, Patterns and Signals Probabilistic ... · [modified from Dima Damen lecture notes] Rui Ponte Costa & Dima Damen rui.costa@bristol.ac.uk Department of Computer

MLE: 1. Define likelihood

The conditional probability (for all data) is thus formulated as

p(D|θ) =N∏

i=1

p(yi |xi , θ)

=N∏

i=1

1√2πσ

e−12

(yi−bxi )2

σ2

Rui Ponte Costa & Dima [email protected]

COMS21202: Probabilistic Models

Page 13: COMS21202: Symbols, Patterns and Signals Probabilistic ... · [modified from Dima Damen lecture notes] Rui Ponte Costa & Dima Damen rui.costa@bristol.ac.uk Department of Computer

MLE: 2. Take natural logarithmWe will focus on parameter b for the next steps,

bML = arg maxb p(D|θ)

= arg maxb

N∏i=1

1√2πσ

e− 12

(yi−bxi )2

σ2

= arg maxb ln( N∏

i=1

1√2πσ

e− 12

(yi−bxi )2

σ2)

(use ln trick)

= arg maxb

N∑i=1

ln( 1√

2πσe− 1

2(yi−bxi )

2

σ2)

(ln prop. 1)

= arg maxb

N∑i=1

ln1√2πσ

− (yi − bxi)2

2σ2 (ln prop. 2)

= arg maxb

N∑i=1

− 12σ2 (yi − bxi)

2 (discard no-b terms ln1√2πσ

)

= arg minb

N∑i=1

12σ2 (yi − bxi)

2 (switch to minimisation form.)

Rui Ponte Costa & Dima [email protected]

COMS21202: Probabilistic Models

Page 14: COMS21202: Symbols, Patterns and Signals Probabilistic ... · [modified from Dima Damen lecture notes] Rui Ponte Costa & Dima Damen rui.costa@bristol.ac.uk Department of Computer

Data Modelling - Deterministic vs Probabilistic

I Deterministic Least Squares [Lecture 3]:

bLS = arg minb R(b,a = 0) = arg minb

∑i

(yi − b xi)2

I Probabilistic Maximum Likelihood:

bML = arg minb

∑i

12σ2 (yi − b xi)

2

I probabilistic model explicit considers uncertainty, σ2.

Rui Ponte Costa & Dima [email protected]

COMS21202: Probabilistic Models

Page 15: COMS21202: Symbols, Patterns and Signals Probabilistic ... · [modified from Dima Damen lecture notes] Rui Ponte Costa & Dima Damen rui.costa@bristol.ac.uk Department of Computer

MLE: 3. Take derivatives and 4. Find solutionTo simplify the calculations here we assume σ = 1,

bML = arg minb

∑i

12(yi − b xi)

2

To find the minimum, calculate the derivative

ddb

∑i

12(yi − bxi)

2 = −∑

i

xi(yi − bxi)

and equate it to zero

−∑

i

xi(yi − bMLxi) = 0∑i

xiyi − bML

∑i

x2i = 0

bML =

∑i yixi∑i x2

i

Rui Ponte Costa & Dima [email protected]

COMS21202: Probabilistic Models

Page 16: COMS21202: Symbols, Patterns and Signals Probabilistic ... · [modified from Dima Damen lecture notes] Rui Ponte Costa & Dima Damen rui.costa@bristol.ac.uk Department of Computer

MLE: summary

Example: normal distribution (parameter b)

1. Determine θ, D and expression for likelihood p(D|θ)

p(D|b) =N∏

i=1

1√2πσ

e−12

(yi−bxi )2

σ2

2. Take the natural logarithm of the likelihoodbML = arg minb

∑i

12σ2 (yi − b xi)

2

3. Take the derivative of lnp(D|θ) w.r.t. θ. If θ is a multi-dimensionalvector, take partial derivatives 2

ddb

∑i

12 (yi − bxi)

2 = −∑

i xi(yi − bxi)

4. Set derivative(s) to 0 and solve for bbML =

∑i yi xi∑i x2

i

2For simplicity we set σ = 1Rui Ponte Costa & Dima [email protected]

COMS21202: Probabilistic Models

Page 17: COMS21202: Symbols, Patterns and Signals Probabilistic ... · [modified from Dima Damen lecture notes] Rui Ponte Costa & Dima Damen rui.costa@bristol.ac.uk Department of Computer

Data Modelling - Deterministic vs ProbabilisticI Probabilistic Models can tell us moreI We could use the same MLE recipe to find σML. This would tell us how

uncertainty our model is about the data D.I For example: if we apply this method to two datasets (D1 and D2)

what would the parameters θ = {b, σ} be?

bD1ML > bD2

ML [slope] and σD1ML < σ

D2ML [uncertainty3]

3The uncertainty (σ) is represented by the light green bar in the plots. Test it your self.Rui Ponte Costa & Dima [email protected]

COMS21202: Probabilistic Models

Page 18: COMS21202: Symbols, Patterns and Signals Probabilistic ... · [modified from Dima Damen lecture notes] Rui Ponte Costa & Dima Damen rui.costa@bristol.ac.uk Department of Computer

Probabilistic Model - Ex2

Example

Given a coin, you were assigned the task of figuring out whether the coinwill land on its head or tails. You were asked to build a probabilistic model(i.e. with confidence)

I Data: head/tail binary attempts (of size N)I Model: Binomial distributionI Model Parameters: head probability α

Rui Ponte Costa & Dima [email protected]

COMS21202: Probabilistic Models

Page 19: COMS21202: Symbols, Patterns and Signals Probabilistic ... · [modified from Dima Damen lecture notes] Rui Ponte Costa & Dima Damen rui.costa@bristol.ac.uk Department of Computer

Probabilistic Model - Ex2

DefinitionThe binomial distribution gives the probability distribution for a discretevariable to obtain exactly D successes out of N trials, where the probabilityof the success is α and the probability of failure is (1− α) and 0 ≤ α ≤ 1

The binomial distribution probability density function is given by

P(D|N) =

(ND

)αD(1− α)N−D

=N!

D!(N − D)!αD(1− α)N−D

Rui Ponte Costa & Dima [email protected]

COMS21202: Probabilistic Models

Page 20: COMS21202: Symbols, Patterns and Signals Probabilistic ... · [modified from Dima Damen lecture notes] Rui Ponte Costa & Dima Damen rui.costa@bristol.ac.uk Department of Computer

Probabilistic Model - Ex2

Accordingly, using the binomial probability distribution where D is thenumber of heads in N coin tosses and θ is the probability of getting headsin a single toss,

P(D|θ) =(

ND

)θD(1− θ)N−D

Maximum Likelihood Estimation (MLE) would then be looking for

θML = arg maxθ p(D|θ)

Rui Ponte Costa & Dima [email protected]

COMS21202: Probabilistic Models

Page 21: COMS21202: Symbols, Patterns and Signals Probabilistic ... · [modified from Dima Damen lecture notes] Rui Ponte Costa & Dima Damen rui.costa@bristol.ac.uk Department of Computer

Probabilistic Model - Ex2

I Take the natural logarithm

P(D|θ) =(

ND

)θD(1− θ)N−D

lnP(D|θ) = ln

(ND

)+ D ln θ + (N − D) ln(1− θ)

I Take the derivative w.r.t θ

ddθ

lnP(D|θ) = D1θ+ (N − D)

11− θ

(−1)

=Dθ− N − D

1− θ

Rui Ponte Costa & Dima [email protected]

COMS21202: Probabilistic Models

Page 22: COMS21202: Symbols, Patterns and Signals Probabilistic ... · [modified from Dima Damen lecture notes] Rui Ponte Costa & Dima Damen rui.costa@bristol.ac.uk Department of Computer

Probabilistic Model - Ex2

I Set the derivative to 0 and solve for θ

DθML− N − D

1− θML= 0

D(1− θML)− (N − D)θML

θML(1− θML)= 0

D − NθML = 0

θML =DN

I In conclusion, the probability of heads is the relative frequency (Dover the number of samples N).

Rui Ponte Costa & Dima [email protected]

COMS21202: Probabilistic Models

Page 23: COMS21202: Symbols, Patterns and Signals Probabilistic ... · [modified from Dima Damen lecture notes] Rui Ponte Costa & Dima Damen rui.costa@bristol.ac.uk Department of Computer

Probabilistic Model - Using prior informationI MLE ignores any prior knowledge we may have about θ

I If we have prior knowledge about values what θ is likely to have, thenwe can use Bayesian inference, which combines prior and likelihoodprobabilities as p(θ|D) = p(D|θ)p(θ)/Z , where p(θ|D) is the posterior,p(D|θ) the likelihood of the data, p(θ) the prior over the parametersand Z is the normalization term (p(D)).

Rui Ponte Costa & Dima [email protected]

COMS21202: Probabilistic Models

Page 24: COMS21202: Symbols, Patterns and Signals Probabilistic ... · [modified from Dima Damen lecture notes] Rui Ponte Costa & Dima Damen rui.costa@bristol.ac.uk Department of Computer

Probabilistic Model - Using prior information

I The MLE method can be expanded to consider prior information as

θML = arg maxθ p(D|θ)p(θ)

I This is known as Maximum a Posteriori (MAP) estimation 4

4You are going to learn more Posterior distributions next year in Machine Learning!Rui Ponte Costa & Dima [email protected]

COMS21202: Probabilistic Models

Page 25: COMS21202: Symbols, Patterns and Signals Probabilistic ... · [modified from Dima Damen lecture notes] Rui Ponte Costa & Dima Damen rui.costa@bristol.ac.uk Department of Computer

Maximum a Posterior - Example

Example

Given a coin, you were assigned the task of figuring out whether the coinwill land on its head or tails. You were asked to build a probabilistic model(i.e. with confidence)

I Suppose we want to utilise our prior belief that coins are typically fairI p(θ) would peak around θ = 0.5I Let’s use

p(θ) = b θ (1− θ)

where b is a normalising factor so the area under the curve is equalto 1

Rui Ponte Costa & Dima [email protected]

COMS21202: Probabilistic Models

Page 26: COMS21202: Symbols, Patterns and Signals Probabilistic ... · [modified from Dima Damen lecture notes] Rui Ponte Costa & Dima Damen rui.costa@bristol.ac.uk Department of Computer

Maximum a Posterior - Example

I Likelihood:p(D|θ) =

(ND

)θD(1− θ)N−D

I Prior:p(θ) = b θ (1− θ)

I Posterior:

p(θ|D) = p(D|θ)p(θ) =(

ND

)θD(1− θ)N−D b θ (1− θ)

Rui Ponte Costa & Dima [email protected]

COMS21202: Probabilistic Models

Page 27: COMS21202: Symbols, Patterns and Signals Probabilistic ... · [modified from Dima Damen lecture notes] Rui Ponte Costa & Dima Damen rui.costa@bristol.ac.uk Department of Computer

Maximum a Posterior - Example

I Take the natural logarithm and derivate [same recipe as in MLE]

p(D|θ)p(θ) =(

ND

)θD(1− θ)N−D b θ (1− θ)

ln p(D|θ) p(θ) = ln

(ND

)+ D ln θ + (N − D) ln(1− θ) + ln b + ln θ + ln(1− θ)

ddθ

lnp(D|θ)p(θ) = D1θ− (N − D)

11− θ

+1θ− 1

(1− θ)

Rui Ponte Costa & Dima [email protected]

COMS21202: Probabilistic Models

Page 28: COMS21202: Symbols, Patterns and Signals Probabilistic ... · [modified from Dima Damen lecture notes] Rui Ponte Costa & Dima Damen rui.costa@bristol.ac.uk Department of Computer

Maximum a Posterior - Example

I Set the derivative to 0 and solve for θMAP

D1

θMAP− (N − D)

11− θMAP

+1

θMAP− 1

(1− θMAP)= 0

D + 1θMAP

− (N − D + 1)1

1− θMAP= 0

(D + 1)(1− θMAP)− (N − D + 1)θMAP

θMAP(1− θMAP)= 0

θMAP =D + 1N + 2

I The prior added two ‘virtual’ coin tosses, one with heads and one withtails. Note that for D = N = 0, it defaults to our prior knowledge, i.e.θMAP = 1

2 .

Rui Ponte Costa & Dima [email protected]

COMS21202: Probabilistic Models

Page 29: COMS21202: Symbols, Patterns and Signals Probabilistic ... · [modified from Dima Damen lecture notes] Rui Ponte Costa & Dima Damen rui.costa@bristol.ac.uk Department of Computer

Conclusion

I Probabilistic models encode randomness in the dataI They provide model uncertaintyI Parameters of the model are tuned using estimatorsI Maximum Likelihood Estimation (MLE) is a recipe used for training

model parametersI MLE does not encode our prior knowledge of possible parametersI Maximum a Posteriori (MAP) maximises likelihood along with prior

Rui Ponte Costa & Dima [email protected]

COMS21202: Probabilistic Models

Page 30: COMS21202: Symbols, Patterns and Signals Probabilistic ... · [modified from Dima Damen lecture notes] Rui Ponte Costa & Dima Damen rui.costa@bristol.ac.uk Department of Computer

Further Reading

I Probability and Statistics for Engineers and ScientistsWalpole et al (2007)I Section 3.1I Section 3.2I Section 4.1I Section 4.2

I Statistical Learning MethodsRussell and Norvig (2003)I Chapter 20 (p. 712 - 720)

Rui Ponte Costa & Dima [email protected]

COMS21202: Probabilistic Models