probabilistic calculus to the rescue

Probabilistic Calculus to the Rescue

Suppose we know the likelihood of each of the (propositional)

worlds (aka Joint Probability distribution )

Then we can use standard rules of probability to compute the likelihood of all queries (as I will remind you)

So, Joint Probability Distribution is all that you ever need!

In the case of Pearl example, we just need the joint probability distribution over B,E,A,J,M (32 numbers)--In general 2n separate numbers

(which should add up to 1)

If Joint Distribution is sufficient for reasoning, what is domain knowledge supposed to help us with?

--Answer: Indirectly by helping us specify the joint probability distribution with fewer than 2n numbers

---The local relations between propositions can be seen as “constraining” the form the joint probability distribution can take!

Burglary => Alarm Earth-Quake => Alarm Alarm => John-callsAlarm => Mary-calls

Only 10 (instead of 32) numbers to specify!

How do we learn the bayes nets?

• We assumed that both the topology and CPTs for bayes nets are given by experts

• What if we want to learn them from data?– And use them to predict other data..

Statistics Probability

HP(H)

P(d|H)

i.i.d

D1 D2 DN

True hypothesis eventually dominates… probability of indefinitely producing uncharacteristic data 0

Bayesian prediction is optimal (Given the hypothesis prior, all other predictions are less likely)

So, BN learning is just probability estimation! (as long as data is complete, and topology is given..)

Works for any topologyB E

A

J M

So, BN learning is just probability estimation?

Data B=T, E=T, A=F, J=T, M=F . . B=F,E=T,A=T,J=F,M=T

P(J|A) = (#data items where J and A are true) (#data items where A is true)

Steps in ML based learning1. Write down an expression for the likelihood of the data as a function of the

parameter(s)Assume i.i.d. distribution

2. Write down the derivative of the log likelihood with respect to each parameter3. Find the parameter values such that the derivatives are zero

There are two ways this step can become complexIndividual (partial) derivatives lead to non-linear functions (depends on the type of distribution the parameters are controlling; binomial is a very easy case)

Individual (partial) derivatives will involve more than one parameter (thus leading to simultaneous equations)

In general, we will need to use continuous function optimization techniquesOne idea is to use gradient descent to find the point where the derivative goes to

zero. But for gradient descent to find global optimum, we need to know for sure that the function we are optimizing has a single optimum (this is why convex functions are important. If the likelihood is a convex function, then gradient descent will be guaranteed to find the global minimum).

Continuous Function Optimization

• Function optimization involves finding the zeroes of the gradient

• We can use Newton-Raphson method

• ..but will need the second derivative…

• ..for a function of n variables, the second derivate is an nxn matrix (called Hessian)

Beyond Known Topology & Complete data!

• So we just noted that if we know the topology of the Bayes net, and we have complete data then the parameters are un-entangled, and can be learned separately from just data counts.

• Questions: How big a deal is this? – Can we have known topology?– Can we have complete data?

• What if there are hidden nodes

Some times you don’t really know the topologyRussel’s restaurant waiting habbits.

Classification as a special case of data modeling

• Until now, we were interesting in learning the model of the entire data (i.e., we want to be able to predict each of the attribute values of the data)

• Sometimes, we are most interested in predicting just a subset (or even one) of the attributes of the data– This will be a “classification” task

Structure (Topology) Learning• Search over different network topologies• Question: How do we decide which

topology is better?– Idea 1: Check if the independence relations

posited by the topology actually hold– Idea 2: Consider which topology agrees

with the data more (i.e., provides higher likelihood)• But need to be careful--increasing edges in a

network cannot reduce likelihood– Idea 3: Need to penalize complexity of the

network (either using prior on network topologies, or using syntactic complexity measures)

1 2

4

8

16

31!

Naïve Bayes Models: The Jean Harlow of Bayesnet Learners..

WillWait

Alt bar Est…

P(willwait=yes) = 6/12 = .5P(Patrons=“full”|willwait=yes) = 2/6=0.333P(Patrons=“some”|willwait=yes)= 4/6=0.666

P(willwait=yes|Patrons=full) = P(patrons=full|willwait=yes) * P(willwait=yes) ----------------------------------------------------------- P(Patrons=full) = k* .333*.5P(willwait=no|Patrons=full) = k* 0.666*.5

Similarly we can show that P(Patrons=“full”|willwait=no) =0.6666

Example

Need for Smoothing.. • Suppose I toss a coin twice, and it

comes up heads both times– What is the empirical probability of

Rao’s coin coming tails?• Suppose I continue to toss the coin

another 3000 times, and it comes heads all these times– What is the empirical probability of

Rao’s coin coming tails?

What is happening? We have a “prior” on the coin tosses

We slowly modify that prior in light of evidence

How do we get NBC to do it?

I beseech you, in the bowels of Christ, think it possible you may be mistaken. --Cromwell to synod of the Church of Scotland; 1650 (aka Cromwell's Rule)

Using M-estimates to improve probablity estimates

• The simple frequency based estimation of P(Ai=vj|Ck) can be inaccurate, especially when the true value is close to zero, and the number of training examples is small (so the probability that your examples don’t contain rare cases is quite high)

• Solution: Use M-estimate P(Ai=vj | Ci) = [#(Ci, Ai=vi) + mp ] / [#(Ci) + m]

– m virtual samples, with p being the probability that each of those samples has Ai=vj

• If we believe that our sample set is large enough, we can keep m small. Otherwise, keep it large.

• Essentially we are augmenting the #(Ci) normal samples with m more virtual samples drawn according to the prior probability on how Ai takes values

– p is the prior probability of Ai taking the value vi• If we don’t have any background information, assume uniform

probability (that is 1/d if Ai can take d values)

Also, to avoid overflow errors do addition of logarithms of probabilities (instead of multiplication of probabilities)

Zero is

FOREVER

Beyond Known Topology & Complete data!

• So we just noted that if we know the topology of the Bayes net, and we have complete data then the parameters are un-entangled, and can be learned separately from just data counts.

• Questions: How big a deal is this? – Can we have known topology?– Can we have complete data?

• What if there are hidden nodes

Missing Data

What should we do? --Idea: Just consider the complete data as the training data Go ahead and learn the parameters --But wait, now that we have parameters, we can infer the missing value! (suppose we infer B to be 1 with 0.7 and 0 with 0.3 prob)--But wait wait, now that we have inferred the missing value we can re-estimate the parameters..

Infinite Regress? No.. Expectation Maximization

Fractional samples

1 1 0 (0.7)1 0 0 (0.3)

Involves Bayes Net inference; can get by with approximate inference

Involves maximization; can get away with just improvement (i.e., a few steps of gradient ascent)

Candy Example

Start with 1000 samples

Initialize parameters as

The “size of the step” is determined adaptively by where the max of the lowerbound is..

--In contrast, gradient descent requires a stepsize parameter --Newton Raphson requires second derivative..

Why does EM Work? Log of Sums don’t have easy closed form optima; use Jensen’s inequality and focus on Sum of logs which will be a lower bound

Ft (J) is an arbitrary prob dist over J

By Jensen’s inequality

probabilistic calculus to the rescue

Documents