lecture 6:probabilistic modelling (part 2)€¦ · lecture 6:probabilistic modelling (part 2)...

Lecture 6:Probabilistic modelling (part 2)

Pierre Lison, Language Technology Group (LTG)

Department of Informatics

Fall 2012, September 17 2012

torsdag 27. september 2012

@ 2012, Pierre Lison - INF5820 course

Outline

• Review of last week’s concepts

• Special cases of Bayesian networks

• Exercises

2torsdag 27. september 2012


Outline

•Review of last week’s concepts

• Special cases of Bayesian Networks

• Exercises



Random variables

• A random variable is a variable whose value is subject to variations due to chance

• The domain of possible values for the random variable can be categorical, continuous or discrete

• A probability distribution can be defined over the variable domain:

• Via a probability mass function (PMF) for discrete/categorical variables

• Via a probability density function (PDF) for continuous variables



Example of discrete distribution

• Well-known discrete distribution: the binomial B(n,p)

• Describes the total number of successes in a sequence of independent yes/no experiments

• n is the number of experiments

• p is the probability of success («yes») for a single experiment

5

Binomial distribution with n=12 and p=0.5

P (K = k) =

�n

k

�pk(1− p)n−k

Probability mass function (PMF) for a binomial B(n,p):

�n

k

�=

n!

k!(n− k)!with

NB: you don’t have to remember the PMF!torsdag 27. september 2012


Example of continuous distribution

6

• Well-known distribution for continuous variable: the normal (or Gaussian) distribution

• μ and σ2 are the two parameters of this distribution

• μ is the mean and defines the «centre» of the bell curve

• σ2 is the variance and defines the spread of the curve

N (µ,σ2)

f(x; µ,σ2) =1√2πσ2

e−(x−µ)2/(2σ2)

Probability density function (PDF) for a normal :N (µ,σ2)

NB: you don’t have to remember the PDF!torsdag 27. september 2012


Bayes’ rule

• A central formula in probabilistic inference is Bayes’ rule:

7

P (A|B) =P (B|A)P (A)

P (B)

= α P (B|A) P (A)

Normalisation factor Likelihood

Prior

Posterior



Bayes’ rule in ML

• Bayes’ rule is notably crucial for machine learning

• Let’s say we have a dataset D, and we want to find the hypothesis h* which provides the best fit for the data

• In other words, we search for:

• But we have no idea how to estimate P(h|D)!

8

h∗ = argmaxP (h|D)



Bayes’ rule in ML

• Fortunately, we can use Bayes’ rule:

• P(h): prior probability of the hypothesis

• P(D|h): likelihood probability of observing data D if h is true

• P(D) is a normalisation factor (can often be ignored)

• Finding the best hypothesis is then:

9

P (h|D) =P (D|h) P (h)

P (D)

h∗ = argmaxP (D|h)P (h)



Expectation and variance

• Important measures associated with a(numeric) random variable X:

• The expectation E(X), which is the mean of the variable:

• The variance Var(X), which is the «spread» of the variable (i.e. how much the values tend to diverge from the mean):

10

E(X) =�

x∈Domain(X)

xP (X = x)

V ar(X) = E((X − E(X))2)

= E(X2)− E2(X)



Marginalisation

• A very useful operation is marginalisation (also called summing-out):

• The operation simply follows from the fact that probabilities must add up to 1

• Marginalisation is very useful is P(X) is unknown, but P(X|Y) and P(Y) are known

11

P (X) =�

y∈Domain(Y )

P (X,Y =y)

=�

y∈Domain(Y )

P (X|Y =y)P (Y =y)



Bayesian Networks

12

Burglary Earthquake

Alarm

JohnCalls MaryCalls

P(B)

0.001

P(E)

0.002

B E P(A)true true 0.95true false 0.95false true 0.29false false 0.001

A P(JC)

true 0.9

false 0.05

A P(MC)

true 0.70

false 0.01

(we abbreviate the variables in the tables by their first letter)



Outline


•Special cases of Bayesian Networks

• Exercises



Types of models

• Bayesian Networks define a general class of probabilistic models

• The probabilistic models used in spoken dialogue systems often rely on special cases of Bayesian Networks

• We’re going to mention 2 special cases:

• Markov chains (used for e.g. language modelling)

• Hidden Markov Models (used for e.g. acoustic modelling)



Markov Chains

• A Markov Chain is a probabilistic model for a sequence of variables X1,X2, ... Xn, where for all 1≤ i ≤ n, we have:

15

P(Xi|Xi-1,..., X1) = P(Xi| Xi-1)

• In other words, every variable Xi is only dependent on its predecessor Xi-1

X1 Xi-1 Xi Xn... ...X2



Markov Chains

• A Markov Chain of order k is a probabilistic model for a sequence of variables X1,X2, ... Xn, where for all 1≤ i ≤ n, we have:

16

P(Xi|Xi-1,..., X1) = P(Xi| Xi-1,... Xi-k)

• In other words, every variable Xi is only dependent on its k predecessors Xi-k,... Xi-1

X1 Xi-1 Xi Xn... ...Xi-2

... ...

(here for k=2)



Markov Chains

• Example of application: Markov Chains can easily encode language models

• A language model defines the probability for a sequence of words w1,...wn

• Often by estimating the probabilities of each word wi given its k predecessors wi-1,...wi-k

• These models are called n-gram models, where n = k+1



Hidden Markov Models

• Hidden Markov Models (HMMs) are another related probabilistic model to describe a sequence of variables X1, ... Xn

• The particularity of HMMs is that the chain is not directly accessible to observation

• But we can obtain indirect evidence via additional observation nodes




• A Hidden Markov Model extends a Markov chain for X1,...Xn with additional variables O1,...On

• An observation variable Oi only depends on his related variable Xi

19

O1 O2 Oi-1 Oi On

X1 Xi-1 Xi Xn... ...X2




• Hidden Markov Models are notably used in acoustic modelling for speech recognition

• Here, the objective is to find the most likely sequence of phonemes corresponding to what the user said

• But the only information we have is a set of low-level acoustic features (the observation)

• Given a set of observations and the likelihood of the phoneme sequence, we can then calculate the most likely hypothesis for the utterance



Outline


• Special cases of Bayesian Networks

•Exercises



Exercise 1

• An elementary school is offering 3 language classes: one in Spanish, one in French, and one in German. These classes are open to any of the 100 students in the school. There are 28 students in the Spanish class, 26 in the French class, and 16 in the German class. There are 12 students that are in both Spanish and French, 4 that are in both French and German, and 6 that are in both Spanish and German, In addition, there are 2 students taking all 3 classes.

• Question: If a student is chosen randomly, what is the probability that he or she is not in any of these classes?

22

[From http://www.stat.ucla.edu/~rosario/classes/081/100a-2a/100aHW2Soln.pdf ]



Exercise 1: answers

• We want P(not in any class)

• We already know some probabilities:

• for instance, P(student in both French and German)= 4/100

• We start by calculating P(student in at least one class)

23

Students

French German

Spanish(28)

(26)

(100)

(16)(12)

(4)

(6)(2)

= P(Fr) + P(Ge) + P(Sp) - P(Fr & Ge) - P(Fr & Sp) - P(Ge & Sp) + P(Fr & Sp & Ge)

= 0.26 + 0.16 + 0.28 - 0.04 - 0.12 - 0.06 + 0.02 = 0.5

• Then, P(not in any class) = 1 - P(at least one class) = 0.5



Exercise 2

• Toss a coin twice and observe whether the head (H) or tail (T) shows up. Assume event A refers to «at least one H shows up», and event B refers to «same side show up in both times».

• Question: What is the probability of event B happening if it is already known that event A has happened?

24

http://www.u.arizona.edu/~sunjing/conditional_prob.pdf



Exercise 2

• Sample space: {HH, HT, TH, TT}

• Event A is the subset {HH, HT, TH}

• Event B is the subset {HH, T T}

• We are asked to calculate P(B|A)

• We know that

• The intersection of B and A is {HH}, so:

25

P (B|A) =P (B ∩A)

P (A)

P (B|A) =1/43/4

=1

3



Exercise 3

• From previous data analysis, it was found that:

• when the machine operates well, the probability of producing good-quality products is 0.98;

• when the machine is defectuous, the probability of producing good-quality products is 0.55;

• every morning, when the machine is launched, the probability of the machine adjust to normal operation is 0.95.

• Question: If one morning, the first product produced by the machine is of good quality, what is the probability that the machine is adjusted to normal operation in that morning?

26

http://www.u.arizona.edu/~sunjing/conditional_prob.pdf



Exercise 3

• We know that:

• P(product=good|condition=good) = 0.98

• P(product=good|condition=defect) 0.55

• P(condition=good) = 0.95

• We want to know P(condition|product=good), using Bayes’ rule:

27

P (cond=defect|prod=good)= α P (prod=good|cond=defect) P (cond=defect)

= α× 0.55× 0.05

P (cond=good|prod=good)= α P (prod=good|cond=good) P (cond=good)

= α× 0.98× 0.95If we then renormalise with α≈1.043 we see that P(cond=good|prod=good) = 0.97



Exercise 4

• Consider a lottery with three possible outcomes: $125 will be received with probability 0.2, $100 with probability 0.3, and $50 with probability 0.5

• Question: what is the expectation and variance of the lottery?



Exercise 4

29

V ar(Lottery) = E�(Lottery− E(Lottery))2

�

=�

x∈Dom(Lottery)

(x− E(Lottery))2P (Lottery=x)

= 0.2× (125− 80)2 + 0.3× (100− 80)2 + 0.5× (50− 80)2 = 975

E(Lottery) =�

x∈Dom(Lottery)

xP (Lottery=x)

= 0.2× 125 + 0.3× 100 + 0.5× 50 = 80

V ar(Lottery) = E(Lottery2)− E(Lottery)2

=

�

x∈Dom(Lottery)

x2P (Lottery=x)

− E(Lottery)2

= 0.2× 1252 + 0.3× 100200.5× 502 − 802 = 975

or alternatively:



Exercise 5 (advanced)

• The definitions of expectation and variance we used until now are only valid for discrete variables

• For continuous variables, they are actually defined as such:

• Now, assume the following PDF:

• Compute its expectation!

30

E(X) =

� +∞

−∞xf(x)dx V ar(X) =

� +∞

−∞(x− E(X))2f(x)dx

f(x) =

�2(1− x) if 0 ≤ x ≤ 10 otherwise



Exercise 5 (advanced)

• Here’s the development:

31

E(X) =

� +∞

−∞xf(x)dx

=

� 1

0x× 2(1− x)dx

=

� 1

02x− 2x2dx

=

�2x2

2− 2x3

3

�1

0

=(1− 2

3)− 0 =

1

3

since the function is 0 outside [0,1]

since �

xndx =xn+1

n+ 1(basic formula for integration)



Exercise 6.1

32

Burglary Earthquake

Alarm

JohnCalls MaryCalls

P(B)

0.001

P(E)0.002

B E P(A|B,E)true true 0.95true false 0.95false true 0.29false false 0.001

A P(JC|A)true 0.9false 0.05

A P(MC|A)true 0.70false 0.01


Question: calculate P(Alarm)



Exercise 6.1

• We want to calculate P(Alarm)

• What we have is P(Alarm|Burglary,Earthquake)

• We can do marginalisation:

33

P (A)

=�

b={true,false}

P (A,B=b)

=�

b={true,false}

�

e={true,false}

P (A,B=b, E=e)

=�

b={true,false}

�

e={true,false}

P (A|B=b, E=e)P (B=b)P (E=e)

= 0.95×0.001×0.002 + 0.95×0.001×0.998 + 0.29×0.999×0.002 + 0.001×0.999×0.998

= 0.00253



Exercise 6.2

34

Burglary Earthquake

Alarm

JohnCalls MaryCalls

P(B)

0.001

P(E)0.002

B E P(A)true true 0.95true false 0.95false true 0.29false false 0.001

A P(JC)true 0.9false 0.05

A P(MC)true 0.70false 0.01


Question: calculate P(Earthquake|MaryCalls=true)



Exercise 6.2

35

P (E= true|MC= true)

= αP (MC= true|E)P (E)

= α

�

a={true,false}

P (MC= true,A=a|E)

P (E)

= α

�

a={true,false}

P (MC= true|A=a)P (A|E)

P (E)

= α

�

a={true,false}

P (MC= true|A=a)

�

b={true,false}

P (A|B = b, E)P (B = b)

P (E)

Marginalisation on A

Marginalisation on B

Bayes’ rule

P(MC,A|E) = P(MC|A,E) P(A|E) = P(MC|A) P(A|E)



Exercise 6.2

36

Renormalising, we get: P(E=true|MC=true) ≈ 0.0359

P (E= true|MC= true)

= α(0.7× 0.95× 0.001 + 0.7× 0.29× 0.999

+ 0.01× 0.05× 0.001 + 0.01× 0.71× 0.999)× 0.002

= α× 4.211× 10−4

P (E=false|MC= true)

= α(0.7× 0.95× 0.001 + 0.7× 0.001× 0.999

+ 0.01× 0.05× 0.001 + 0.01× 0.999× 0.999)× 0.998

= α× 0.01132


lecture 6:probabilistic modelling (part 2)€¦ · lecture 6:probabilistic modelling (part 2)...

Documents