lecture 6:probabilistic modelling (part 2)€¦ · lecture 6:probabilistic modelling (part 2)...
TRANSCRIPT
Lecture 6:Probabilistic modelling (part 2)
Pierre Lison, Language Technology Group (LTG)
Department of Informatics
Fall 2012, September 17 2012
torsdag 27. september 2012
@ 2012, Pierre Lison - INF5820 course
Outline
• Review of last week’s concepts
• Special cases of Bayesian networks
• Exercises
2torsdag 27. september 2012
@ 2012, Pierre Lison - INF5820 course
Outline
•Review of last week’s concepts
• Special cases of Bayesian Networks
• Exercises
3torsdag 27. september 2012
@ 2012, Pierre Lison - INF5820 course
Random variables
• A random variable is a variable whose value is subject to variations due to chance
• The domain of possible values for the random variable can be categorical, continuous or discrete
• A probability distribution can be defined over the variable domain:
• Via a probability mass function (PMF) for discrete/categorical variables
• Via a probability density function (PDF) for continuous variables
4torsdag 27. september 2012
@ 2012, Pierre Lison - INF5820 course
Example of discrete distribution
• Well-known discrete distribution: the binomial B(n,p)
• Describes the total number of successes in a sequence of independent yes/no experiments
• n is the number of experiments
• p is the probability of success («yes») for a single experiment
5
Binomial distribution with n=12 and p=0.5
P (K = k) =
�n
k
�pk(1− p)n−k
Probability mass function (PMF) for a binomial B(n,p):
�n
k
�=
n!
k!(n− k)!with
NB: you don’t have to remember the PMF!torsdag 27. september 2012
@ 2012, Pierre Lison - INF5820 course
Example of continuous distribution
6
• Well-known distribution for continuous variable: the normal (or Gaussian) distribution
• μ and σ2 are the two parameters of this distribution
• μ is the mean and defines the «centre» of the bell curve
• σ2 is the variance and defines the spread of the curve
N (µ,σ2)
f(x; µ,σ2) =1√2πσ2
e−(x−µ)2/(2σ2)
Probability density function (PDF) for a normal :N (µ,σ2)
NB: you don’t have to remember the PDF!torsdag 27. september 2012
@ 2012, Pierre Lison - INF5820 course
Bayes’ rule
• A central formula in probabilistic inference is Bayes’ rule:
7
P (A|B) =P (B|A)P (A)
P (B)
= α P (B|A) P (A)
Normalisation factor Likelihood
Prior
Posterior
torsdag 27. september 2012
@ 2012, Pierre Lison - INF5820 course
Bayes’ rule in ML
• Bayes’ rule is notably crucial for machine learning
• Let’s say we have a dataset D, and we want to find the hypothesis h* which provides the best fit for the data
• In other words, we search for:
• But we have no idea how to estimate P(h|D)!
8
h∗ = argmaxP (h|D)
torsdag 27. september 2012
@ 2012, Pierre Lison - INF5820 course
Bayes’ rule in ML
• Fortunately, we can use Bayes’ rule:
• P(h): prior probability of the hypothesis
• P(D|h): likelihood probability of observing data D if h is true
• P(D) is a normalisation factor (can often be ignored)
• Finding the best hypothesis is then:
9
P (h|D) =P (D|h) P (h)
P (D)
h∗ = argmaxP (D|h)P (h)
torsdag 27. september 2012
@ 2012, Pierre Lison - INF5820 course
Expectation and variance
• Important measures associated with a(numeric) random variable X:
• The expectation E(X), which is the mean of the variable:
• The variance Var(X), which is the «spread» of the variable (i.e. how much the values tend to diverge from the mean):
10
E(X) =�
x∈Domain(X)
xP (X = x)
V ar(X) = E((X − E(X))2)
= E(X2)− E2(X)
torsdag 27. september 2012
@ 2012, Pierre Lison - INF5820 course
Marginalisation
• A very useful operation is marginalisation (also called summing-out):
• The operation simply follows from the fact that probabilities must add up to 1
• Marginalisation is very useful is P(X) is unknown, but P(X|Y) and P(Y) are known
11
P (X) =�
y∈Domain(Y )
P (X,Y =y)
=�
y∈Domain(Y )
P (X|Y =y)P (Y =y)
torsdag 27. september 2012
@ 2012, Pierre Lison - INF5820 course
Bayesian Networks
12
Burglary Earthquake
Alarm
JohnCalls MaryCalls
P(B)
0.001
P(E)
0.002
B E P(A)true true 0.95true false 0.95false true 0.29false false 0.001
A P(JC)
true 0.9
false 0.05
A P(MC)
true 0.70
false 0.01
(we abbreviate the variables in the tables by their first letter)
torsdag 27. september 2012
@ 2012, Pierre Lison - INF5820 course
Outline
• Review of last week’s concepts
•Special cases of Bayesian Networks
• Exercises
13torsdag 27. september 2012
@ 2012, Pierre Lison - INF5820 course
Types of models
• Bayesian Networks define a general class of probabilistic models
• The probabilistic models used in spoken dialogue systems often rely on special cases of Bayesian Networks
• We’re going to mention 2 special cases:
• Markov chains (used for e.g. language modelling)
• Hidden Markov Models (used for e.g. acoustic modelling)
14torsdag 27. september 2012
@ 2012, Pierre Lison - INF5820 course
Markov Chains
• A Markov Chain is a probabilistic model for a sequence of variables X1,X2, ... Xn, where for all 1≤ i ≤ n, we have:
15
P(Xi|Xi-1,..., X1) = P(Xi| Xi-1)
• In other words, every variable Xi is only dependent on its predecessor Xi-1
X1 Xi-1 Xi Xn... ...X2
torsdag 27. september 2012
@ 2012, Pierre Lison - INF5820 course
Markov Chains
• A Markov Chain of order k is a probabilistic model for a sequence of variables X1,X2, ... Xn, where for all 1≤ i ≤ n, we have:
16
P(Xi|Xi-1,..., X1) = P(Xi| Xi-1,... Xi-k)
• In other words, every variable Xi is only dependent on its k predecessors Xi-k,... Xi-1
X1 Xi-1 Xi Xn... ...Xi-2
... ...
(here for k=2)
torsdag 27. september 2012
@ 2012, Pierre Lison - INF5820 course
Markov Chains
• Example of application: Markov Chains can easily encode language models
• A language model defines the probability for a sequence of words w1,...wn
• Often by estimating the probabilities of each word wi given its k predecessors wi-1,...wi-k
• These models are called n-gram models, where n = k+1
17torsdag 27. september 2012
@ 2012, Pierre Lison - INF5820 course
Hidden Markov Models
• Hidden Markov Models (HMMs) are another related probabilistic model to describe a sequence of variables X1, ... Xn
• The particularity of HMMs is that the chain is not directly accessible to observation
• But we can obtain indirect evidence via additional observation nodes
18torsdag 27. september 2012
@ 2012, Pierre Lison - INF5820 course
Hidden Markov Models
• A Hidden Markov Model extends a Markov chain for X1,...Xn with additional variables O1,...On
• An observation variable Oi only depends on his related variable Xi
19
O1 O2 Oi-1 Oi On
X1 Xi-1 Xi Xn... ...X2
torsdag 27. september 2012
@ 2012, Pierre Lison - INF5820 course
Hidden Markov Models
• Hidden Markov Models are notably used in acoustic modelling for speech recognition
• Here, the objective is to find the most likely sequence of phonemes corresponding to what the user said
• But the only information we have is a set of low-level acoustic features (the observation)
• Given a set of observations and the likelihood of the phoneme sequence, we can then calculate the most likely hypothesis for the utterance
20torsdag 27. september 2012
@ 2012, Pierre Lison - INF5820 course
Outline
• Review of last week’s concepts
• Special cases of Bayesian Networks
•Exercises
21torsdag 27. september 2012
@ 2012, Pierre Lison - INF5820 course
Exercise 1
• An elementary school is offering 3 language classes: one in Spanish, one in French, and one in German. These classes are open to any of the 100 students in the school. There are 28 students in the Spanish class, 26 in the French class, and 16 in the German class. There are 12 students that are in both Spanish and French, 4 that are in both French and German, and 6 that are in both Spanish and German, In addition, there are 2 students taking all 3 classes.
• Question: If a student is chosen randomly, what is the probability that he or she is not in any of these classes?
22
[From http://www.stat.ucla.edu/~rosario/classes/081/100a-2a/100aHW2Soln.pdf ]
torsdag 27. september 2012
@ 2012, Pierre Lison - INF5820 course
Exercise 1: answers
• We want P(not in any class)
• We already know some probabilities:
• for instance, P(student in both French and German)= 4/100
• We start by calculating P(student in at least one class)
23
Students
French German
Spanish(28)
(26)
(100)
(16)(12)
(4)
(6)(2)
= P(Fr) + P(Ge) + P(Sp) - P(Fr & Ge) - P(Fr & Sp) - P(Ge & Sp) + P(Fr & Sp & Ge)
= 0.26 + 0.16 + 0.28 - 0.04 - 0.12 - 0.06 + 0.02 = 0.5
• Then, P(not in any class) = 1 - P(at least one class) = 0.5
torsdag 27. september 2012
@ 2012, Pierre Lison - INF5820 course
Exercise 2
• Toss a coin twice and observe whether the head (H) or tail (T) shows up. Assume event A refers to «at least one H shows up», and event B refers to «same side show up in both times».
• Question: What is the probability of event B happening if it is already known that event A has happened?
24
http://www.u.arizona.edu/~sunjing/conditional_prob.pdf
torsdag 27. september 2012
@ 2012, Pierre Lison - INF5820 course
Exercise 2
• Sample space: {HH, HT, TH, TT}
• Event A is the subset {HH, HT, TH}
• Event B is the subset {HH, T T}
• We are asked to calculate P(B|A)
• We know that
• The intersection of B and A is {HH}, so:
25
P (B|A) =P (B ∩A)
P (A)
P (B|A) =1/43/4
=1
3
torsdag 27. september 2012
@ 2012, Pierre Lison - INF5820 course
Exercise 3
• From previous data analysis, it was found that:
• when the machine operates well, the probability of producing good-quality products is 0.98;
• when the machine is defectuous, the probability of producing good-quality products is 0.55;
• every morning, when the machine is launched, the probability of the machine adjust to normal operation is 0.95.
• Question: If one morning, the first product produced by the machine is of good quality, what is the probability that the machine is adjusted to normal operation in that morning?
26
http://www.u.arizona.edu/~sunjing/conditional_prob.pdf
torsdag 27. september 2012
@ 2012, Pierre Lison - INF5820 course
Exercise 3
• We know that:
• P(product=good|condition=good) = 0.98
• P(product=good|condition=defect) 0.55
• P(condition=good) = 0.95
• We want to know P(condition|product=good), using Bayes’ rule:
27
P (cond=defect|prod=good)= α P (prod=good|cond=defect) P (cond=defect)
= α× 0.55× 0.05
P (cond=good|prod=good)= α P (prod=good|cond=good) P (cond=good)
= α× 0.98× 0.95If we then renormalise with α≈1.043 we see that P(cond=good|prod=good) = 0.97
torsdag 27. september 2012
@ 2012, Pierre Lison - INF5820 course
Exercise 4
• Consider a lottery with three possible outcomes: $125 will be received with probability 0.2, $100 with probability 0.3, and $50 with probability 0.5
• Question: what is the expectation and variance of the lottery?
28torsdag 27. september 2012
@ 2012, Pierre Lison - INF5820 course
Exercise 4
29
V ar(Lottery) = E�(Lottery− E(Lottery))2
�
=�
x∈Dom(Lottery)
(x− E(Lottery))2P (Lottery=x)
= 0.2× (125− 80)2 + 0.3× (100− 80)2 + 0.5× (50− 80)2 = 975
E(Lottery) =�
x∈Dom(Lottery)
xP (Lottery=x)
= 0.2× 125 + 0.3× 100 + 0.5× 50 = 80
V ar(Lottery) = E(Lottery2)− E(Lottery)2
=
�
x∈Dom(Lottery)
x2P (Lottery=x)
− E(Lottery)2
= 0.2× 1252 + 0.3× 100200.5× 502 − 802 = 975
or alternatively:
torsdag 27. september 2012
@ 2012, Pierre Lison - INF5820 course
Exercise 5 (advanced)
• The definitions of expectation and variance we used until now are only valid for discrete variables
• For continuous variables, they are actually defined as such:
• Now, assume the following PDF:
• Compute its expectation!
30
E(X) =
� +∞
−∞xf(x)dx V ar(X) =
� +∞
−∞(x− E(X))2f(x)dx
f(x) =
�2(1− x) if 0 ≤ x ≤ 10 otherwise
torsdag 27. september 2012
@ 2012, Pierre Lison - INF5820 course
Exercise 5 (advanced)
• Here’s the development:
31
E(X) =
� +∞
−∞xf(x)dx
=
� 1
0x× 2(1− x)dx
=
� 1
02x− 2x2dx
=
�2x2
2− 2x3
3
�1
0
=(1− 2
3)− 0 =
1
3
since the function is 0 outside [0,1]
since �
xndx =xn+1
n+ 1(basic formula for integration)
torsdag 27. september 2012
@ 2012, Pierre Lison - INF5820 course
Exercise 6.1
32
Burglary Earthquake
Alarm
JohnCalls MaryCalls
P(B)
0.001
P(E)0.002
B E P(A|B,E)true true 0.95true false 0.95false true 0.29false false 0.001
A P(JC|A)true 0.9false 0.05
A P(MC|A)true 0.70false 0.01
(we abbreviate the variables in the tables by their first letter)
Question: calculate P(Alarm)
torsdag 27. september 2012
@ 2012, Pierre Lison - INF5820 course
Exercise 6.1
• We want to calculate P(Alarm)
• What we have is P(Alarm|Burglary,Earthquake)
• We can do marginalisation:
33
P (A)
=�
b={true,false}
P (A,B=b)
=�
b={true,false}
�
e={true,false}
P (A,B=b, E=e)
=�
b={true,false}
�
e={true,false}
P (A|B=b, E=e)P (B=b)P (E=e)
= 0.95×0.001×0.002 + 0.95×0.001×0.998 + 0.29×0.999×0.002 + 0.001×0.999×0.998
= 0.00253
torsdag 27. september 2012
@ 2012, Pierre Lison - INF5820 course
Exercise 6.2
34
Burglary Earthquake
Alarm
JohnCalls MaryCalls
P(B)
0.001
P(E)0.002
B E P(A)true true 0.95true false 0.95false true 0.29false false 0.001
A P(JC)true 0.9false 0.05
A P(MC)true 0.70false 0.01
(we abbreviate the variables in the tables by their first letter)
Question: calculate P(Earthquake|MaryCalls=true)
torsdag 27. september 2012
@ 2012, Pierre Lison - INF5820 course
Exercise 6.2
35
P (E= true|MC= true)
= αP (MC= true|E)P (E)
= α
�
a={true,false}
P (MC= true,A=a|E)
P (E)
= α
�
a={true,false}
P (MC= true|A=a)P (A|E)
P (E)
= α
�
a={true,false}
P (MC= true|A=a)
�
b={true,false}
P (A|B = b, E)P (B = b)
P (E)
Marginalisation on A
Marginalisation on B
Bayes’ rule
P(MC,A|E) = P(MC|A,E) P(A|E) = P(MC|A) P(A|E)
torsdag 27. september 2012
@ 2012, Pierre Lison - INF5820 course
Exercise 6.2
36
Renormalising, we get: P(E=true|MC=true) ≈ 0.0359
P (E= true|MC= true)
= α(0.7× 0.95× 0.001 + 0.7× 0.29× 0.999
+ 0.01× 0.05× 0.001 + 0.01× 0.71× 0.999)× 0.002
= α× 4.211× 10−4
P (E=false|MC= true)
= α(0.7× 0.95× 0.001 + 0.7× 0.001× 0.999
+ 0.01× 0.05× 0.001 + 0.01× 0.999× 0.999)× 0.998
= α× 0.01132
torsdag 27. september 2012