lecture 6:probabilistic modelling (part 2)€¦ · lecture 6:probabilistic modelling (part 2)...

18
Lecture 6:Probabilistic modelling (part 2) Pierre Lison, Language Technology Group (LTG) Department of Informatics Fall 2012, September 17 2012 torsdag 27. september 2012 @ 2012, Pierre Lison - INF5820 course Outline Review of last week’s concepts Special cases of Bayesian networks Exercises 2 torsdag 27. september 2012

Upload: others

Post on 11-Jun-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lecture 6:Probabilistic modelling (part 2)€¦ · Lecture 6:Probabilistic modelling (part 2) Pierre Lison, Language Technology Group (LTG) Department of Informatics Fall 2012, September

Lecture 6:Probabilistic modelling (part 2)

Pierre Lison, Language Technology Group (LTG)

Department of Informatics

Fall 2012, September 17 2012

torsdag 27. september 2012

@ 2012, Pierre Lison - INF5820 course

Outline

• Review of last week’s concepts

• Special cases of Bayesian networks

• Exercises

2torsdag 27. september 2012

Page 2: Lecture 6:Probabilistic modelling (part 2)€¦ · Lecture 6:Probabilistic modelling (part 2) Pierre Lison, Language Technology Group (LTG) Department of Informatics Fall 2012, September

@ 2012, Pierre Lison - INF5820 course

Outline

•Review of last week’s concepts

• Special cases of Bayesian Networks

• Exercises

3torsdag 27. september 2012

@ 2012, Pierre Lison - INF5820 course

Random variables

• A random variable is a variable whose value is subject to variations due to chance

• The domain of possible values for the random variable can be categorical, continuous or discrete

• A probability distribution can be defined over the variable domain:

• Via a probability mass function (PMF) for discrete/categorical variables

• Via a probability density function (PDF) for continuous variables

4torsdag 27. september 2012

Page 3: Lecture 6:Probabilistic modelling (part 2)€¦ · Lecture 6:Probabilistic modelling (part 2) Pierre Lison, Language Technology Group (LTG) Department of Informatics Fall 2012, September

@ 2012, Pierre Lison - INF5820 course

Example of discrete distribution

• Well-known discrete distribution: the binomial B(n,p)

• Describes the total number of successes in a sequence of independent yes/no experiments

• n is the number of experiments

• p is the probability of success («yes») for a single experiment

5

Binomial distribution with n=12 and p=0.5

P (K = k) =

�n

k

�pk(1− p)n−k

Probability mass function (PMF) for a binomial B(n,p):

�n

k

�=

n!

k!(n− k)!with

NB: you don’t have to remember the PMF!torsdag 27. september 2012

@ 2012, Pierre Lison - INF5820 course

Example of continuous distribution

6

• Well-known distribution for continuous variable: the normal (or Gaussian) distribution

• μ and σ2 are the two parameters of this distribution

• μ is the mean and defines the «centre» of the bell curve

• σ2 is the variance and defines the spread of the curve

N (µ,σ2)

f(x; µ,σ2) =1√2πσ2

e−(x−µ)2/(2σ2)

Probability density function (PDF) for a normal :N (µ,σ2)

NB: you don’t have to remember the PDF!torsdag 27. september 2012

Page 4: Lecture 6:Probabilistic modelling (part 2)€¦ · Lecture 6:Probabilistic modelling (part 2) Pierre Lison, Language Technology Group (LTG) Department of Informatics Fall 2012, September

@ 2012, Pierre Lison - INF5820 course

Bayes’ rule

• A central formula in probabilistic inference is Bayes’ rule:

7

P (A|B) =P (B|A)P (A)

P (B)

= α P (B|A) P (A)

Normalisation factor Likelihood

Prior

Posterior

torsdag 27. september 2012

@ 2012, Pierre Lison - INF5820 course

Bayes’ rule in ML

• Bayes’ rule is notably crucial for machine learning

• Let’s say we have a dataset D, and we want to find the hypothesis h* which provides the best fit for the data

• In other words, we search for:

• But we have no idea how to estimate P(h|D)!

8

h∗ = argmaxP (h|D)

torsdag 27. september 2012

Page 5: Lecture 6:Probabilistic modelling (part 2)€¦ · Lecture 6:Probabilistic modelling (part 2) Pierre Lison, Language Technology Group (LTG) Department of Informatics Fall 2012, September

@ 2012, Pierre Lison - INF5820 course

Bayes’ rule in ML

• Fortunately, we can use Bayes’ rule:

• P(h): prior probability of the hypothesis

• P(D|h): likelihood probability of observing data D if h is true

• P(D) is a normalisation factor (can often be ignored)

• Finding the best hypothesis is then:

9

P (h|D) =P (D|h) P (h)

P (D)

h∗ = argmaxP (D|h)P (h)

torsdag 27. september 2012

@ 2012, Pierre Lison - INF5820 course

Expectation and variance

• Important measures associated with a(numeric) random variable X:

• The expectation E(X), which is the mean of the variable:

• The variance Var(X), which is the «spread» of the variable (i.e. how much the values tend to diverge from the mean):

10

E(X) =�

x∈Domain(X)

xP (X = x)

V ar(X) = E((X − E(X))2)

= E(X2)− E2(X)

torsdag 27. september 2012

Page 6: Lecture 6:Probabilistic modelling (part 2)€¦ · Lecture 6:Probabilistic modelling (part 2) Pierre Lison, Language Technology Group (LTG) Department of Informatics Fall 2012, September

@ 2012, Pierre Lison - INF5820 course

Marginalisation

• A very useful operation is marginalisation (also called summing-out):

• The operation simply follows from the fact that probabilities must add up to 1

• Marginalisation is very useful is P(X) is unknown, but P(X|Y) and P(Y) are known

11

P (X) =�

y∈Domain(Y )

P (X,Y =y)

=�

y∈Domain(Y )

P (X|Y =y)P (Y =y)

torsdag 27. september 2012

@ 2012, Pierre Lison - INF5820 course

Bayesian Networks

12

Burglary Earthquake

Alarm

JohnCalls MaryCalls

P(B)

0.001

P(E)

0.002

B E P(A)true true 0.95true false 0.95false true 0.29false false 0.001

A P(JC)

true 0.9

false 0.05

A P(MC)

true 0.70

false 0.01

(we abbreviate the variables in the tables by their first letter)

torsdag 27. september 2012

Page 7: Lecture 6:Probabilistic modelling (part 2)€¦ · Lecture 6:Probabilistic modelling (part 2) Pierre Lison, Language Technology Group (LTG) Department of Informatics Fall 2012, September

@ 2012, Pierre Lison - INF5820 course

Outline

• Review of last week’s concepts

•Special cases of Bayesian Networks

• Exercises

13torsdag 27. september 2012

@ 2012, Pierre Lison - INF5820 course

Types of models

• Bayesian Networks define a general class of probabilistic models

• The probabilistic models used in spoken dialogue systems often rely on special cases of Bayesian Networks

• We’re going to mention 2 special cases:

• Markov chains (used for e.g. language modelling)

• Hidden Markov Models (used for e.g. acoustic modelling)

14torsdag 27. september 2012

Page 8: Lecture 6:Probabilistic modelling (part 2)€¦ · Lecture 6:Probabilistic modelling (part 2) Pierre Lison, Language Technology Group (LTG) Department of Informatics Fall 2012, September

@ 2012, Pierre Lison - INF5820 course

Markov Chains

• A Markov Chain is a probabilistic model for a sequence of variables X1,X2, ... Xn, where for all 1≤ i ≤ n, we have:

15

P(Xi|Xi-1,..., X1) = P(Xi| Xi-1)

• In other words, every variable Xi is only dependent on its predecessor Xi-1

X1 Xi-1 Xi Xn... ...X2

torsdag 27. september 2012

@ 2012, Pierre Lison - INF5820 course

Markov Chains

• A Markov Chain of order k is a probabilistic model for a sequence of variables X1,X2, ... Xn, where for all 1≤ i ≤ n, we have:

16

P(Xi|Xi-1,..., X1) = P(Xi| Xi-1,... Xi-k)

• In other words, every variable Xi is only dependent on its k predecessors Xi-k,... Xi-1

X1 Xi-1 Xi Xn... ...Xi-2

... ...

(here for k=2)

torsdag 27. september 2012

Page 9: Lecture 6:Probabilistic modelling (part 2)€¦ · Lecture 6:Probabilistic modelling (part 2) Pierre Lison, Language Technology Group (LTG) Department of Informatics Fall 2012, September

@ 2012, Pierre Lison - INF5820 course

Markov Chains

• Example of application: Markov Chains can easily encode language models

• A language model defines the probability for a sequence of words w1,...wn

• Often by estimating the probabilities of each word wi given its k predecessors wi-1,...wi-k

• These models are called n-gram models, where n = k+1

17torsdag 27. september 2012

@ 2012, Pierre Lison - INF5820 course

Hidden Markov Models

• Hidden Markov Models (HMMs) are another related probabilistic model to describe a sequence of variables X1, ... Xn

• The particularity of HMMs is that the chain is not directly accessible to observation

• But we can obtain indirect evidence via additional observation nodes

18torsdag 27. september 2012

Page 10: Lecture 6:Probabilistic modelling (part 2)€¦ · Lecture 6:Probabilistic modelling (part 2) Pierre Lison, Language Technology Group (LTG) Department of Informatics Fall 2012, September

@ 2012, Pierre Lison - INF5820 course

Hidden Markov Models

• A Hidden Markov Model extends a Markov chain for X1,...Xn with additional variables O1,...On

• An observation variable Oi only depends on his related variable Xi

19

O1 O2 Oi-1 Oi On

X1 Xi-1 Xi Xn... ...X2

torsdag 27. september 2012

@ 2012, Pierre Lison - INF5820 course

Hidden Markov Models

• Hidden Markov Models are notably used in acoustic modelling for speech recognition

• Here, the objective is to find the most likely sequence of phonemes corresponding to what the user said

• But the only information we have is a set of low-level acoustic features (the observation)

• Given a set of observations and the likelihood of the phoneme sequence, we can then calculate the most likely hypothesis for the utterance

20torsdag 27. september 2012

Page 11: Lecture 6:Probabilistic modelling (part 2)€¦ · Lecture 6:Probabilistic modelling (part 2) Pierre Lison, Language Technology Group (LTG) Department of Informatics Fall 2012, September

@ 2012, Pierre Lison - INF5820 course

Outline

• Review of last week’s concepts

• Special cases of Bayesian Networks

•Exercises

21torsdag 27. september 2012

@ 2012, Pierre Lison - INF5820 course

Exercise 1

• An elementary school is offering 3 language classes: one in Spanish, one in French, and one in German. These classes are open to any of the 100 students in the school. There are 28 students in the Spanish class, 26 in the French class, and 16 in the German class. There are 12 students that are in both Spanish and French, 4 that are in both French and German, and 6 that are in both Spanish and German, In addition, there are 2 students taking all 3 classes.

• Question: If a student is chosen randomly, what is the probability that he or she is not in any of these classes?

22

[From http://www.stat.ucla.edu/~rosario/classes/081/100a-2a/100aHW2Soln.pdf ]

torsdag 27. september 2012

Page 12: Lecture 6:Probabilistic modelling (part 2)€¦ · Lecture 6:Probabilistic modelling (part 2) Pierre Lison, Language Technology Group (LTG) Department of Informatics Fall 2012, September

@ 2012, Pierre Lison - INF5820 course

Exercise 1: answers

• We want P(not in any class)

• We already know some probabilities:

• for instance, P(student in both French and German)= 4/100

• We start by calculating P(student in at least one class)

23

Students

French German

Spanish(28)

(26)

(100)

(16)(12)

(4)

(6)(2)

= P(Fr) + P(Ge) + P(Sp) - P(Fr & Ge) - P(Fr & Sp) - P(Ge & Sp) + P(Fr & Sp & Ge)

= 0.26 + 0.16 + 0.28 - 0.04 - 0.12 - 0.06 + 0.02 = 0.5

• Then, P(not in any class) = 1 - P(at least one class) = 0.5

torsdag 27. september 2012

@ 2012, Pierre Lison - INF5820 course

Exercise 2

• Toss a coin twice and observe whether the head (H) or tail (T) shows up. Assume event A refers to «at least one H shows up», and event B refers to «same side show up in both times».

• Question: What is the probability of event B happening if it is already known that event A has happened?

24

http://www.u.arizona.edu/~sunjing/conditional_prob.pdf

torsdag 27. september 2012

Page 13: Lecture 6:Probabilistic modelling (part 2)€¦ · Lecture 6:Probabilistic modelling (part 2) Pierre Lison, Language Technology Group (LTG) Department of Informatics Fall 2012, September

@ 2012, Pierre Lison - INF5820 course

Exercise 2

• Sample space: {HH, HT, TH, TT}

• Event A is the subset {HH, HT, TH}

• Event B is the subset {HH, T T}

• We are asked to calculate P(B|A)

• We know that

• The intersection of B and A is {HH}, so:

25

P (B|A) =P (B ∩A)

P (A)

P (B|A) =1/43/4

=1

3

torsdag 27. september 2012

@ 2012, Pierre Lison - INF5820 course

Exercise 3

• From previous data analysis, it was found that:

• when the machine operates well, the probability of producing good-quality products is 0.98;

• when the machine is defectuous, the probability of producing good-quality products is 0.55;

• every morning, when the machine is launched, the probability of the machine adjust to normal operation is 0.95.

• Question: If one morning, the first product produced by the machine is of good quality, what is the probability that the machine is adjusted to normal operation in that morning?

26

http://www.u.arizona.edu/~sunjing/conditional_prob.pdf

torsdag 27. september 2012

Page 14: Lecture 6:Probabilistic modelling (part 2)€¦ · Lecture 6:Probabilistic modelling (part 2) Pierre Lison, Language Technology Group (LTG) Department of Informatics Fall 2012, September

@ 2012, Pierre Lison - INF5820 course

Exercise 3

• We know that:

• P(product=good|condition=good) = 0.98

• P(product=good|condition=defect) 0.55

• P(condition=good) = 0.95

• We want to know P(condition|product=good), using Bayes’ rule:

27

P (cond=defect|prod=good)= α P (prod=good|cond=defect) P (cond=defect)

= α× 0.55× 0.05

P (cond=good|prod=good)= α P (prod=good|cond=good) P (cond=good)

= α× 0.98× 0.95If we then renormalise with α≈1.043 we see that P(cond=good|prod=good) = 0.97

torsdag 27. september 2012

@ 2012, Pierre Lison - INF5820 course

Exercise 4

• Consider a lottery with three possible outcomes: $125 will be received with probability 0.2, $100 with probability 0.3, and $50 with probability 0.5

• Question: what is the expectation and variance of the lottery?

28torsdag 27. september 2012

Page 15: Lecture 6:Probabilistic modelling (part 2)€¦ · Lecture 6:Probabilistic modelling (part 2) Pierre Lison, Language Technology Group (LTG) Department of Informatics Fall 2012, September

@ 2012, Pierre Lison - INF5820 course

Exercise 4

29

V ar(Lottery) = E�(Lottery− E(Lottery))2

=�

x∈Dom(Lottery)

(x− E(Lottery))2P (Lottery=x)

= 0.2× (125− 80)2 + 0.3× (100− 80)2 + 0.5× (50− 80)2 = 975

E(Lottery) =�

x∈Dom(Lottery)

xP (Lottery=x)

= 0.2× 125 + 0.3× 100 + 0.5× 50 = 80

V ar(Lottery) = E(Lottery2)− E(Lottery)2

=

x∈Dom(Lottery)

x2P (Lottery=x)

− E(Lottery)2

= 0.2× 1252 + 0.3× 100200.5× 502 − 802 = 975

or alternatively:

torsdag 27. september 2012

@ 2012, Pierre Lison - INF5820 course

Exercise 5 (advanced)

• The definitions of expectation and variance we used until now are only valid for discrete variables

• For continuous variables, they are actually defined as such:

• Now, assume the following PDF:

• Compute its expectation!

30

E(X) =

� +∞

−∞xf(x)dx V ar(X) =

� +∞

−∞(x− E(X))2f(x)dx

f(x) =

�2(1− x) if 0 ≤ x ≤ 10 otherwise

torsdag 27. september 2012

Page 16: Lecture 6:Probabilistic modelling (part 2)€¦ · Lecture 6:Probabilistic modelling (part 2) Pierre Lison, Language Technology Group (LTG) Department of Informatics Fall 2012, September

@ 2012, Pierre Lison - INF5820 course

Exercise 5 (advanced)

• Here’s the development:

31

E(X) =

� +∞

−∞xf(x)dx

=

� 1

0x× 2(1− x)dx

=

� 1

02x− 2x2dx

=

�2x2

2− 2x3

3

�1

0

=(1− 2

3)− 0 =

1

3

since the function is 0 outside [0,1]

since �

xndx =xn+1

n+ 1(basic formula for integration)

torsdag 27. september 2012

@ 2012, Pierre Lison - INF5820 course

Exercise 6.1

32

Burglary Earthquake

Alarm

JohnCalls MaryCalls

P(B)

0.001

P(E)0.002

B E P(A|B,E)true true 0.95true false 0.95false true 0.29false false 0.001

A P(JC|A)true 0.9false 0.05

A P(MC|A)true 0.70false 0.01

(we abbreviate the variables in the tables by their first letter)

Question: calculate P(Alarm)

torsdag 27. september 2012

Page 17: Lecture 6:Probabilistic modelling (part 2)€¦ · Lecture 6:Probabilistic modelling (part 2) Pierre Lison, Language Technology Group (LTG) Department of Informatics Fall 2012, September

@ 2012, Pierre Lison - INF5820 course

Exercise 6.1

• We want to calculate P(Alarm)

• What we have is P(Alarm|Burglary,Earthquake)

• We can do marginalisation:

33

P (A)

=�

b={true,false}

P (A,B=b)

=�

b={true,false}

e={true,false}

P (A,B=b, E=e)

=�

b={true,false}

e={true,false}

P (A|B=b, E=e)P (B=b)P (E=e)

= 0.95×0.001×0.002 + 0.95×0.001×0.998 + 0.29×0.999×0.002 + 0.001×0.999×0.998

= 0.00253

torsdag 27. september 2012

@ 2012, Pierre Lison - INF5820 course

Exercise 6.2

34

Burglary Earthquake

Alarm

JohnCalls MaryCalls

P(B)

0.001

P(E)0.002

B E P(A)true true 0.95true false 0.95false true 0.29false false 0.001

A P(JC)true 0.9false 0.05

A P(MC)true 0.70false 0.01

(we abbreviate the variables in the tables by their first letter)

Question: calculate P(Earthquake|MaryCalls=true)

torsdag 27. september 2012

Page 18: Lecture 6:Probabilistic modelling (part 2)€¦ · Lecture 6:Probabilistic modelling (part 2) Pierre Lison, Language Technology Group (LTG) Department of Informatics Fall 2012, September

@ 2012, Pierre Lison - INF5820 course

Exercise 6.2

35

P (E= true|MC= true)

= αP (MC= true|E)P (E)

= α

a={true,false}

P (MC= true,A=a|E)

P (E)

= α

a={true,false}

P (MC= true|A=a)P (A|E)

P (E)

= α

a={true,false}

P (MC= true|A=a)

b={true,false}

P (A|B = b, E)P (B = b)

P (E)

Marginalisation on A

Marginalisation on B

Bayes’ rule

P(MC,A|E) = P(MC|A,E) P(A|E) = P(MC|A) P(A|E)

torsdag 27. september 2012

@ 2012, Pierre Lison - INF5820 course

Exercise 6.2

36

Renormalising, we get: P(E=true|MC=true) ≈ 0.0359

P (E= true|MC= true)

= α(0.7× 0.95× 0.001 + 0.7× 0.29× 0.999

+ 0.01× 0.05× 0.001 + 0.01× 0.71× 0.999)× 0.002

= α× 4.211× 10−4

P (E=false|MC= true)

= α(0.7× 0.95× 0.001 + 0.7× 0.001× 0.999

+ 0.01× 0.05× 0.001 + 0.01× 0.999× 0.999)× 0.998

= α× 0.01132

torsdag 27. september 2012