l2 probabilistic reasoning

7/29/2019 L2 Probabilistic Reasoning

1/39


2/39

The process of probabilistic inference

1. define model of problem

2. derive posterior distributions and estimators

3. estimate parameters from data

4. evaluate model accuracy


3/39

Axioms of probability

Axioms (Kolmogorov):0 P(A) 1

P(true) = 1

P(false) = 0

P(A or B) = P(A) + P(B) P(A and B)

Corollaries:- A single random variable must sum to 1:

- The joint probability of a set of variables must also sum to 1.- If A and B are mutually exclusive:

P(A or B) = P(A) + P(B)

n

i=1

P(D = di) = 1


5/39


6/39

Basic concepts

Making rational decisions when faced with uncertainty:

Probabilitythe precise representation of knowledge and uncertainty

Probability theoryhow to optimally update your knowledge based on new information

Decision theory: probability theory + utility theoryhow to use this information to achieve maximum expected utility

Basic probability theory

random variables probability distributions (discrete) and probability densities (continuous) rules of probability expectation and the computation of 1st and 2nd moments joint and multivariate probability distributions and densities covariance and principal components


7/39

The Joint Distribution

Recipe for making a joint distributionof M variables:

1. Make a truth table listing allcombinations of values of your

variables (if there are M Booleanvariables then the table will have2M rows).

2. For each combination of values,say how probable it is.

3. If you subscribe to the axioms ofprobability, those numbers mustsum to 1.

Example: Booleanvariables A, B, C

0.10111

0.25011

0.10101

0.05001

0.05110

0.10010

0.05100

0.30000

ProbCBA

A

B

C0.050.25

0.10 0.05 0.05

0.10

0.100.30

All the nice looking slides like this one from are from Andrew Moore, CMU


8/39








0.10111

0.25011

0.10101

0.05001

0.05110

0.10010

0.05100

0.30000

ProbCBA

A

B

C0.050.25

0.10 0.05 0.05

0.10

0.100.30


9/39








0.10111

0.25011

0.10101

0.05001

0.05110

0.10010

0.05100

0.30000

ProbCBA

A

B

C0.050.25

0.10 0.05 0.05

0.10

0.100.30


10/39

Using theJoint

One you have the JD you canask for the probability of anylogical expression involvingyour attribute

!"E

PEP

matchingrows

)row()(


11/39

Using theJoint


!"E

PEP

matchingrows

)row()(

P(Poor Male) = 0.4654


12/39

Using theJoint


!"E

PEP

matchingrows

)row()(

P(Poor) = 0.7604


13/39

Using theJoint


!"E

PEP

matchingrows

)row()(

Inference

with theJoint

!

!"

#"

2

21

matchingrows

andmatchingrows

2

2121

)row(

)row(

)(

)()|(

E

EE

P

P

EP

EEPEEP

P(Male | Poor) = 0.4654 / 0.7604 = 0.612


14/39

Continuous probability distributions

probability density function (pdf)

joint probability density

marginal probability calculating probabilities using the pdf Bayes rule


15/39

A PDF of American Ages in 2000

more of Andrews nice slides


16/39

A PDF of American Ages in 2000Let X be a continuous randomvariable.

If p(x) is a Probability DensityFunction for X then

! " #$

$%&

b

ax

dxxpbXaP )(

! " #$$%&

50

30ageage)age(50Age30 dpP

= 0.36


17/39

What does p(x) mean?

It does not mean a probability!

First of all, its not a value between 0 and 1.

Its just a value, and an arbitrary one at that. The likelihood of p(a) can only be compared relativelyto other values p(b) It indicates the relative probability of the integrated density over a small delta:

If

then

!

bp

ap"

)(

)(

!

hbXhbP

haXhaP

h

"

#$$%

#$$%

& )(

)(lim

0


18/39

ExpectationsE[X] = the expected value ofrandom variable X

= the average value wed seeif we took a very large number

of random samples of X

!"

#"$

$x

dxxpx )(


19/39

ExpectationsE[X] = the expected value ofrandom variable X

= the average value wed seeif we took a very large number

of random samples of X

!"

#"$

$x

dxxpx )(

= the first moment of theshape formed by the axes and

the blue curve

= the best value to choose ifyou must guess an unknownpersons age and youll befined the square of your error

E[age]=35.897


20/39

Expectation of a function

!=E[f(X)] = the expectedvalue of f(x) where x is drawnfrom Xs distribution.

= the average value wed seeif we took a very large numberof random samples of f(X)

"#

$#%

%x

dxxpxf )()(!

Note that in general:

])[()]([ XEfxfE &

64.1786]age[ 2 %E

62.1288])age[( 2 %E


21/39

Variance

'2

= Var[X] = theexpected squareddifference betweenx and E[X] "

#

$#%

$%x

dxxpx )()( 22 !'

= amount youd expect to lose

if you must guess an unknownpersons age and youll befined the square of your error,and assuming you playoptimally

02.498]age[Var %


22/39

Standard Deviation

!2

= Var[X] = theexpected squareddifference betweenx and E[X] "

#

$#%

$%x

dxxpx )()( 22 &!

= amount youd expect to lose

if you must guess an unknownpersons age and youll befined the square of your error,and assuming you playoptimally

! = Standard Deviation =typical deviation of X fromits mean

02.498]age[Var %

][Var X%!

32.22%!


23/39

Simple example: medical test results

Test report for rare disease is positive, 90% accurate Whats the probability that you have the disease?

What if the test is repeated?

This is the simplest example of reasoning by combining sources of information.


25/39

Evaluating the posterior probability through Bayesian inference

We want P(D|T) = The probability of the having the disease given a positive test Use Bayes rule to relate it to what we know: P(T|D)

Whats the prior P(D)? Disease is rare, so lets assume

What about P(T)?

Whats the interpretation of that?

P(D|T) =P(T|D)P(D)

P(T)posterior

likelihood prior

normalizing

constant

P(D) = 0.001


26/39

P(T) is the marginal probability of P(T,D) = P(T|D) P(D) So, compute with summation

For true or false propositions:

Evaluating the normalizing constant

P(D|T) =P(T|D)P(D)

P(T)posterior

likelihood prior

normalizing

constant

P(T) =

all values of D

P(T|D)P(D)

P(T) = P(T|D)P(D) + P(T|D)P(D)

What are

these?


27/39

Refining our model of the test

We also have to consider the negative case to incorporate all information:

What should it be?

What about ?

P(T|D) = 0.9P(T|D) = ?

P(D)


28/39


29/39

Same problem different situation

Suppose we have a test to determine if you won the lottery. Its 90% accurate.

What is P($ = true | T = true) then?


30/39

Playing around with the numbers

What if the test were 100% reliable?

What if the test was the same, but disease wasnt so rare?

P(D|T) = 1.0

0.001

1.0 0.001 + 0.0 0.999= 1.0

P(D|T) =P(T|D)P(D)

P(T|D)P(D) + P(T|D)P(D)

P(D|T) =. .

0.9 0.1 + 0.1 0.9= 0.5


31/39

Repeating the test

We can relax, P(D|T) = 0.0089, right? Just to be sure the doctor recommends repeating the test.

How do we represent this?

Again, we apply Bayes rule

How do we model P(T1,T2|D)?

P(D|T1, T2)

P(D|T1, T2) =P(T1, T2|D)P(D)

P(T1, T2)


32/39

Modeling repeated tests

Easiest is to assume the tests are independent.

This also implies:

Plugging these in, we have

P(T1, T2|D) = P(T1|D)P(T2|D)

P(D|T1, T2) =P(T1, T2|D)P(D)

P(T1, T2)

P(T1, T2) = P(T1)P(T2)

P(D|T1, T2) =P T1 D P T2 D P D

P(T1)P(T2)


33/39

Evaluating the normalizing constant again

Expanding as before we have

Plugging in the numbers gives us

Another way to think about this:- Whats the chance of 1 false positive from the test?

- Whats the chance of 2 false positives?

The chance of 2 false positives is still 10x more likely than the a prior probability ofhaving the disease.

P(D|T1, T2) = P(T1|D)P(T2|D)P(D)D={t,f} P(T1|D)P(T2|D)P(D)

P(D|T) = 0.9

0.9

0.001

0.9 0.9 0.001 + 0.1 0.1 0.999= 0.075


35/39

The old posterior is the new prior

We can just plugin the value of the old posterior It plays exactly the same role as our old prior

Plugging in the numbers gives the same answer:

P(D|T1, T2) =P T2 D P T1 D P D

P(T2)P(T1)

P(D|T1, T2) =P(T2|D) 0.0089

P(T2)

P(D|T) =P(T|D)P(D)

P(T|D)P(D) + P(T|D)P(D)

P(D|T) = 0.9 0.0089

0.9 0.0089 + 0.1 0.9911= 0.075

This is how Bayesian

reasoning combines old

information with new

information to update

our belief states.


36/39

Example 1.2 (Hamburgers). Consider the following fictitious scientific information: Doctors find thatpeople with Kreuzfeld-Jacob disease (KJ) almost invariably ate hamburgers, thus p(Hamburger Eater|KJ ) =0.9. The probability of an individual having KJ is currently rather low, about one in 100,000.

1. Assuming eating lots of hamburgers is rather widespread, say p(Hamburger Eater) = 0.5, what is theprobability that a hamburger eater will have Kreuzfeld-Jacob disease?

This may be computed as

p(KJ |Hamburger Eater) =p(Hamburger Eater, KJ )

p(Hamburger Eater)=

p(Hamburger Eater|KJ )p(KJ )

p(Hamburger Eater)

(1.2.1)

=9

10

1

100000

1

2

= 1.8 105 (1.2.2)

2. If the fraction of people eating hamburgers was rather small, p(Hamburger Eater) = 0.001, what is theprobability that a regular hamburger eater will have Kreuzfeld-Jacob disease? Repeating the abovecalculation, this is given by

9

10

1

100000

1

1000

1/100 (1.2.3)

This is much higher than in scenario (1) since here we can be more sure that eating hamburgers isrelated to the illness.


37/39

Example 1.3 (Inspector Clouseau). Inspector Clouseau arrives at the scene of a crime. The victim lies deadin the room alongside the possible murder weapon, a knife. The Butler (B) and Maid (M) are the inspectorsmain suspects and the inspector has a prior belief of 0.6 that the Butler is the murderer, and a prior beliefof 0.2 that the Maid is the murderer. These beliefs are independent in the sense that p(B, M) = p(B)p(M).(It is possible that both the Butler and the Maid murdered the victim or neither). The inspectors priorcriminal knowledge can be formulated mathematically as follows:

dom(B) = dom(M) = {murderer, not murderer} , dom(K) = {knife used, knife not used} (1.2.4)

p(B = murderer) = 0.6, p(M = murderer) = 0.2 (1.2.5)

p(knife used|B = not murderer, M = not murderer) = 0.3p(knife used|B = not murderer, M = murderer) = 0.2p(knife used|B = murderer, M = not murderer) = 0.6p(knife used|B = murderer, M = murderer) = 0.1

(1.2.6)

In addition p(K , B , M ) = p(K|B, M)p(B)p(M). Assuming that the knife is the murder weapon, what isthe probability that the Butler is the murderer? (Remember that it might be that neither is the murderer).Using b for the two states of B and m for the two states of M,

p(B|K) =X

m

p(B, m|K) =X

m

p(B,m,K)

p(K)=

Pm p(K|B, m)p(B, m)Pm,b p(K|b, m)p(b, m)

=p(B)P

m p(K|B, m)p(m)Pb p(b)P

m p(K|b, m)p(m)(1.2.7)


38/39

Example 1.5 (Aristotle : Resolution). We can represent the statement All apples are fruit by p(F = tr|A =tr) = 1. Similarly, All fruits grow on trees may be represented by p(T = tr|F = tr) = 1. Additionallywe assume that whether or not something grows on a tree depends only on whether or not it is a fruit,p(T|A,F) = P(T|F). From this we can compute

p(T = tr|A = tr) = XFp(T = tr|F, A = tr)p(F|A = tr) = X

Fp(T = tr|F)p(F|A = tr)

= p(T = tr|F = fa)p(F = fa|A = tr)| {z }=0

+p(T = tr|F = tr)| {z }=1

p(F = tr|A = tr)| {z }=1

= 1 (1.2.16)

In other words we have deduced that All apples grow on trees is a true statement, based on the informationpresented. (This kind of reasoning is called resolution and is a form of transitivity : from the statementsA) F and F) T we can infer A) T).


39/39

Next time

Bayesian belief networks

l2 probabilistic reasoning

Documents