probabilistic calculus to the rescue

30

Upload: cato

Post on 24-Feb-2016

40 views

Category:

Documents


0 download

DESCRIPTION

Suppose we know the likelihood of each of the (propositional) worlds ( aka Joint Probability distribution ) Then we can use standard rules of probability to compute the likelihood of all queries (as I will remind you) So, Joint Probability Distribution is all that you ever need! - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Probabilistic Calculus to the Rescue
Page 2: Probabilistic Calculus to the Rescue

Probabilistic Calculus to the Rescue

Suppose we know the likelihood of each of the (propositional)

worlds (aka Joint Probability distribution )

Then we can use standard rules of probability to compute the likelihood of all queries (as I will remind you)

So, Joint Probability Distribution is all that you ever need!

In the case of Pearl example, we just need the joint probability distribution over B,E,A,J,M (32 numbers)--In general 2n separate numbers

(which should add up to 1)

If Joint Distribution is sufficient for reasoning, what is domain knowledge supposed to help us with?

--Answer: Indirectly by helping us specify the joint probability distribution with fewer than 2n numbers

---The local relations between propositions can be seen as “constraining” the form the joint probability distribution can take!

Burglary => Alarm Earth-Quake => Alarm Alarm => John-callsAlarm => Mary-calls

Only 10 (instead of 32) numbers to specify!

Page 3: Probabilistic Calculus to the Rescue
Page 4: Probabilistic Calculus to the Rescue

How do we learn the bayes nets?

• We assumed that both the topology and CPTs for bayes nets are given by experts

• What if we want to learn them from data?– And use them to predict other data..

Page 5: Probabilistic Calculus to the Rescue
Page 6: Probabilistic Calculus to the Rescue

Statistics Probability

Page 7: Probabilistic Calculus to the Rescue

HP(H)

P(d|H)

i.i.d

D1 D2 DN

Page 8: Probabilistic Calculus to the Rescue

True hypothesis eventually dominates… probability of indefinitely producing uncharacteristic data 0

Page 9: Probabilistic Calculus to the Rescue

Bayesian prediction is optimal (Given the hypothesis prior, all other predictions are less likely)

Page 10: Probabilistic Calculus to the Rescue
Page 11: Probabilistic Calculus to the Rescue
Page 12: Probabilistic Calculus to the Rescue
Page 13: Probabilistic Calculus to the Rescue

So, BN learning is just probability estimation! (as long as data is complete, and topology is given..)

Page 14: Probabilistic Calculus to the Rescue

Works for any topologyB E

A

J M

So, BN learning is just probability estimation?

Data B=T, E=T, A=F, J=T, M=F . . B=F,E=T,A=T,J=F,M=T

P(J|A) = (#data items where J and A are true) (#data items where A is true)

Page 15: Probabilistic Calculus to the Rescue

Steps in ML based learning1. Write down an expression for the likelihood of the data as a function of the

parameter(s)Assume i.i.d. distribution

2. Write down the derivative of the log likelihood with respect to each parameter3. Find the parameter values such that the derivatives are zero

There are two ways this step can become complexIndividual (partial) derivatives lead to non-linear functions (depends on the type of distribution the parameters are controlling; binomial is a very easy case)

Individual (partial) derivatives will involve more than one parameter (thus leading to simultaneous equations)

In general, we will need to use continuous function optimization techniquesOne idea is to use gradient descent to find the point where the derivative goes to

zero. But for gradient descent to find global optimum, we need to know for sure that the function we are optimizing has a single optimum (this is why convex functions are important. If the likelihood is a convex function, then gradient descent will be guaranteed to find the global minimum).

Page 16: Probabilistic Calculus to the Rescue

Continuous Function Optimization

• Function optimization involves finding the zeroes of the gradient

• We can use Newton-Raphson method

• ..but will need the second derivative…

• ..for a function of n variables, the second derivate is an nxn matrix (called Hessian)

Page 17: Probabilistic Calculus to the Rescue

Beyond Known Topology & Complete data!

• So we just noted that if we know the topology of the Bayes net, and we have complete data then the parameters are un-entangled, and can be learned separately from just data counts.

• Questions: How big a deal is this? – Can we have known topology?– Can we have complete data?

• What if there are hidden nodes

Page 18: Probabilistic Calculus to the Rescue

Some times you don’t really know the topologyRussel’s restaurant waiting habbits.

Page 19: Probabilistic Calculus to the Rescue

Classification as a special case of data modeling

• Until now, we were interesting in learning the model of the entire data (i.e., we want to be able to predict each of the attribute values of the data)

• Sometimes, we are most interested in predicting just a subset (or even one) of the attributes of the data– This will be a “classification” task

Page 20: Probabilistic Calculus to the Rescue

Structure (Topology) Learning• Search over different network topologies• Question: How do we decide which

topology is better?– Idea 1: Check if the independence relations

posited by the topology actually hold– Idea 2: Consider which topology agrees

with the data more (i.e., provides higher likelihood)• But need to be careful--increasing edges in a

network cannot reduce likelihood– Idea 3: Need to penalize complexity of the

network (either using prior on network topologies, or using syntactic complexity measures)

1 2

4

8

16

31!

Page 21: Probabilistic Calculus to the Rescue

Naïve Bayes Models: The Jean Harlow of Bayesnet Learners..

WillWait

Alt bar Est…

Page 22: Probabilistic Calculus to the Rescue

P(willwait=yes) = 6/12 = .5P(Patrons=“full”|willwait=yes) = 2/6=0.333P(Patrons=“some”|willwait=yes)= 4/6=0.666

P(willwait=yes|Patrons=full) = P(patrons=full|willwait=yes) * P(willwait=yes) ----------------------------------------------------------- P(Patrons=full) = k* .333*.5P(willwait=no|Patrons=full) = k* 0.666*.5

Similarly we can show that P(Patrons=“full”|willwait=no) =0.6666

Example

Page 23: Probabilistic Calculus to the Rescue

Need for Smoothing.. • Suppose I toss a coin twice, and it

comes up heads both times– What is the empirical probability of

Rao’s coin coming tails?• Suppose I continue to toss the coin

another 3000 times, and it comes heads all these times– What is the empirical probability of

Rao’s coin coming tails?

What is happening? We have a “prior” on the coin tosses

We slowly modify that prior in light of evidence

How do we get NBC to do it?

I beseech you, in the bowels of Christ, think it possible you may be mistaken.  --Cromwell to synod of the Church of Scotland; 1650 (aka Cromwell's Rule)

Page 24: Probabilistic Calculus to the Rescue

Using M-estimates to improve probablity estimates

• The simple frequency based estimation of P(Ai=vj|Ck) can be inaccurate, especially when the true value is close to zero, and the number of training examples is small (so the probability that your examples don’t contain rare cases is quite high)

• Solution: Use M-estimate P(Ai=vj | Ci) = [#(Ci, Ai=vi) + mp ] / [#(Ci) + m]

– m virtual samples, with p being the probability that each of those samples has Ai=vj

• If we believe that our sample set is large enough, we can keep m small. Otherwise, keep it large.

• Essentially we are augmenting the #(Ci) normal samples with m more virtual samples drawn according to the prior probability on how Ai takes values

– p is the prior probability of Ai taking the value vi• If we don’t have any background information, assume uniform

probability (that is 1/d if Ai can take d values)

Also, to avoid overflow errors do addition of logarithms of probabilities (instead of multiplication of probabilities)

Zero is

FOREVER

Page 25: Probabilistic Calculus to the Rescue

Beyond Known Topology & Complete data!

• So we just noted that if we know the topology of the Bayes net, and we have complete data then the parameters are un-entangled, and can be learned separately from just data counts.

• Questions: How big a deal is this? – Can we have known topology?– Can we have complete data?

• What if there are hidden nodes

Page 26: Probabilistic Calculus to the Rescue

Missing Data

What should we do? --Idea: Just consider the complete data as the training data Go ahead and learn the parameters --But wait, now that we have parameters, we can infer the missing value! (suppose we infer B to be 1 with 0.7 and 0 with 0.3 prob)--But wait wait, now that we have inferred the missing value we can re-estimate the parameters..

Infinite Regress? No.. Expectation Maximization

Fractional samples

1 1 0 (0.7)1 0 0 (0.3)

Page 27: Probabilistic Calculus to the Rescue
Page 28: Probabilistic Calculus to the Rescue

Involves Bayes Net inference; can get by with approximate inference

Involves maximization; can get away with just improvement (i.e., a few steps of gradient ascent)

Page 29: Probabilistic Calculus to the Rescue

Candy Example

Start with 1000 samples

Initialize parameters as

Page 30: Probabilistic Calculus to the Rescue

The “size of the step” is determined adaptively by where the max of the lowerbound is..

--In contrast, gradient descent requires a stepsize parameter --Newton Raphson requires second derivative..

Why does EM Work? Log of Sums don’t have easy closed form optima; use Jensen’s inequality and focus on Sum of logs which will be a lower bound

Ft (J) is an arbitrary prob dist over J

By Jensen’s inequality