cs b 553: a lgorithms for o ptimization and l earning bayesian networks

30
CS B553: ALGORITHMS FOR OPTIMIZATION AND LEARNING Bayesian Networks

Upload: timothy-phillips

Post on 18-Jan-2016

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Bayesian Networks

CS B553: ALGORITHMS FOR OPTIMIZATION AND LEARNINGBayesian Networks

Page 2: CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Bayesian Networks

AGENDA

Bayesian networks Chain rule for Bayes nets Naïve Bayes models

Independence declarations D-separation

Probabilistic inference queries

Page 3: CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Bayesian Networks

PURPOSES OF BAYESIAN NETWORKS

Efficient and intuitive modeling of complex causal interactions

Compact representation of joint distributions O(n) rather than O(2n)

Algorithms for efficient inference with given evidence (more on this next time)

Page 4: CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Bayesian Networks

INDEPENDENCE OF RANDOM VARIABLES

Two random variables a and b are independent if

P(A,B) = P(A) P(B)

hence P(A|B) = P(A) Knowing b doesn’t give you any information

about a

[This equality has to hold for all combinations of values that A and B can take on, i.e., all events A=a and B=b are independent]

Page 5: CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Bayesian Networks

SIGNIFICANCE OF INDEPENDENCE

If A and B are independent, then P(A,B) = P(A) P(B)

=> The joint distribution over A and B can be defined as a product over the distribution of A and the distribution of B

=> Store two much smaller probability tables rather than a large probability table over all combinations of A and B

Page 6: CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Bayesian Networks

CONDITIONAL INDEPENDENCE

Two random variables a and b are conditionally independent given C, if

P(A, B|C) = P(A|C) P(B|C)

hence P(A|B,C) = P(A|C) Once you know C, learning B doesn’t give

you any information about A

[again, this has to hold for all combinations of values that A,B,C can take on]

Page 7: CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Bayesian Networks

SIGNIFICANCE OF CONDITIONAL INDEPENDENCE

Consider Grade(CS101), Intelligence, and SAT Ostensibly, the grade in a course doesn’t

have a direct relationship with SAT scores but good students are more likely to get good

SAT scores, so they are not independent… It is reasonable to believe that Grade(CS101)

and SAT are conditionally independent given Intelligence

Page 8: CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Bayesian Networks

BAYESIAN NETWORK Explicitly represent independence among

propositions Notice that Intelligence is the “cause” of both Grade

and SAT, and the causality is represented explicitly

Intel.

Grade

P(I=x)

high 0.3

low 0.7

SAT

6 probabilities, instead of 11

P(I,G,S) = P(G,S|I) P(I) = P(G|I) P(S|I) P(I)

P(G=x|I) I=low I=high

‘a’ 0.2 0.74

‘b’ 0.34 0.17

‘C’ 0.46 0.09

P(S=x|I) I=low I=high

low 0.95 0.05

high 0.2 0.8

Page 9: CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Bayesian Networks

DEFINITION: BAYESIAN NETWORK

Set of random variables X={X1,…,Xn} with domains Val(X1),…,Val(Xn)

Each node has a set of parents PaX

Graph must be a DAG Each node also maintains a conditional

probability distribution (often, a table) P(X|PaX) 2k-1 entries for binary valued variables

Overall: O(n2k) storage for binary variables

Encodes the joint probability over X1,…,Xn

Page 10: CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Bayesian Networks

CALCULATION OF JOINT PROBABILITY

B E P(a|…)

TTFF

TFTF

0.950.940.290.001

Burglary Earthquake

Alarm

MaryCallsJohnCalls

P(b)

0.001

P(e)

0.002

A P(j|…)

TF

0.900.05

A P(m|…)

TF

0.700.01

P(jmabe) = ??

Page 11: CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Bayesian Networks

P(jmabe)= P(jm|a,b,e) P(abe)= P(j|a,b,e) P(m|a,b,e) P(abe)(J and M are independent given A)

P(j|a,b,e) = P(j|a)(J and B and J and E are independent given A)

P(m|a,b,e) = P(m|a) P(abe) = P(a|b,e) P(b|e) P(e)

= P(a|b,e) P(b) P(e)(B and E are independent)

P(jmabe) = P(j|a)P(m|a)P(a|b,e)P(b)P(e)

Burglary Earthquake

Alarm

MaryCallsJohnCalls

Page 12: CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Bayesian Networks

CALCULATION OF JOINT PROBABILITY

B E P(a|…)

TTFF

TFTF

0.950.940.290.001

Burglary Earthquake

alarm

MaryCallsJohnCalls

P(b)

0.001

P(e)

0.002

A P(j|…)

TF

0.900.05

A P(m|…)

TF

0.700.01

P(jmabe)= P(j|a)P(m|a)P(a|b,e)P(b)P(e)= 0.9 x 0.7 x 0.001 x 0.999 x 0.998= 0.00062

Page 13: CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Bayesian Networks

CALCULATION OF JOINT PROBABILITY

b e P(a|…)

TTFF

TFTF

0.950.940.290.001

Burglary Earthquake

alarm

maryCallsjohnCalls

P(b)

0.001

P(e)

0.002

a P(j|…)

TF

0.900.05

a P(m|…)

TF

0.700.01

P(jmabe)= P(j|a)P(m|a)P(a|b,e)P(b)P(e)= 0.9 x 0.7 x 0.001 x 0.999 x 0.998= 0.00062

P(x1x2…xn) = Pi=1,…,nP(xi|paXi)

full joint distribution

Page 14: CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Bayesian Networks

CHAIN RULE FOR BAYES NETS

Joint distribution is a product of all CPTs

P(X1,X2,…,Xn) = Pi=1,…,nP(Xi|PaXi)

Page 15: CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Bayesian Networks

EXAMPLE: NAÏVE BAYES MODELS

P(Cause,Effect1,…,Effectn)

= P(Cause) Pi P(Effecti | Cause)

Cause

Effect1 Effect2 Effectn

Page 16: CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Bayesian Networks

ADVANTAGES OF BAYES NETS (AND OTHER GRAPHICAL MODELS)

More manageable # of parameters to set and store

Incremental modeling Explicit encoding of independence

assumptions Efficient inference techniques

Page 17: CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Bayesian Networks

ARCS DO NOT NECESSARILY ENCODE CAUSALITY

A

B

C

C

B

A

2 BN’s with the same expressive power, and a 3rd with greater power (exercise)

C

B

A

Page 18: CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Bayesian Networks

READING OFF INDEPENDENCE RELATIONSHIPS

Given B, does the value of A affect the probability of C? P(C|B,A) = P(C|B)?

No! C parent’s (B) are

given, and so it is independent of its non-descendents (A)

Independence is symmetric:C A | B => A C | B

A

B

C

Page 19: CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Bayesian Networks

BASIC RULE

A node is independent of its non-descendants given its parents (and given nothing else)

Page 20: CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Bayesian Networks

WHAT DOES THE BN ENCODE?

Burglary EarthquakeJohnCalls MaryCalls | AlarmJohnCalls Burglary | AlarmJohnCalls Earthquake | AlarmMaryCalls Burglary | AlarmMaryCalls Earthquake | Alarm

Burglary Earthquake

Alarm

MaryCallsJohnCalls

A node is independent of its non-descendents, given its parents

Page 21: CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Bayesian Networks

READING OFF INDEPENDENCE RELATIONSHIPS

How about Burglary Earthquake | Alarm ? No! Why?

Burglary Earthquake

Alarm

MaryCallsJohnCalls

Page 22: CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Bayesian Networks

READING OFF INDEPENDENCE RELATIONSHIPS

How about Burglary Earthquake | Alarm ? No! Why? P(BE|A) = P(A|B,E)P(BE)/P(A) = 0.00075 P(B|A)P(E|A) = 0.086

Burglary Earthquake

Alarm

MaryCallsJohnCalls

Page 23: CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Bayesian Networks

READING OFF INDEPENDENCE RELATIONSHIPS

How about Burglary Earthquake | JohnCalls? No! Why? Knowing JohnCalls affects the probability of

Alarm, which makes Burglary and Earthquake dependent

Burglary Earthquake

Alarm

MaryCallsJohnCalls

Page 24: CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Bayesian Networks

INDEPENDENCE RELATIONSHIPS

For polytrees, there exists a unique undirected path between A and B. For each node on the path: Evidence on the directed road XEY or XEY

makes X and Y independent Evidence on an XEY makes descendants

independent Evidence on a “V” node, or below the V:

XEY, or XWY with W… Emakes the X and Y dependent (otherwise they are independent)

Page 25: CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Bayesian Networks

GENERAL CASE

Formal property in general case: D-separation : the above properties hold for all

(acyclic) paths between A and B D-separation independence

That is, we can’t read off any more independence relationships from the graph than those that are encoded in D-separation The CPTs may indeed encode additional

independences

Page 26: CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Bayesian Networks

PROBABILITY QUERIES

Given: some probabilistic model over variables X

Find: distribution over YX given evidence E=e for some subset E X / Y P(Y|E=e)

Inference problem

Page 27: CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Bayesian Networks

ANSWERING INFERENCE PROBLEMS WITH THE JOINT DISTRIBUTION Easiest case: Y=X/E

P(Y|E=e) = P(Y,e)/P(e) Denominator makes the probabilities sum to 1

Determine P(e) by marginalizing: P(e) = Sy P(Y=y,e)

Otherwise, let Z=X/(EY) P(Y|E=e) = Sz P(Y,Z=z,e) /P(e)

P(e) = Sy Sz P(Y=y,Z=z,e)

Inference with joint distribution: O(2|X/E|) for binary variables

Page 28: CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Bayesian Networks

NAÏVE BAYES CLASSIFIER

P(Class,Feature1,…,Featuren)

= P(Class) Pi P(Featurei | Class)

Class

Feature1 Feature2 Featuren

P(C|F1,….,Fn) = P(C,F1,….,Fn)/P(F1,….,Fn)

= 1/Z P(C) Pi P(Fi|C)

Given features, what class?

Spam / Not Spam

English / French / Latin

Word occurrences

Page 29: CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Bayesian Networks

NAÏVE BAYES CLASSIFIER

P(Class,Feature1,…,Featuren)

= P(Class) Pi P(Featurei | Class)

P(C|F1,….,Fk) = 1/Z P(C,F1,….,Fk)

= 1/Z Sfk+1…fn P(C,F1,….,Fk,fk+1,…fn)

= 1/Z P(C) Sfk+1…fn Pi=1…k P(Fi|C) Pj=k+1…n P(fj|C)

= 1/Z P(C) Pi=1…k P(Fi|C) Pj=k+1…n Sfj P(fj|C)

= 1/Z P(C) Pi=1…k P(Fi|C)

Given some features, what is the distribution over class?

Page 30: CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Bayesian Networks

FOR GENERAL QUERIES

For BNs and queries in general, it’s not that simple… more in later lectures.

Next class: skim 5.1-3, begin reading 9.1-4