exploiting structure in probability distributions irit gat-viks based on presentation and lecture...

41
Exploiting Structure in Probability Distributions Irit Gat-Viks Based on presentation and lecture notes of Nir Friedman, Hebrew University

Upload: albert-freeman

Post on 21-Jan-2016

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Exploiting Structure in Probability Distributions Irit Gat-Viks Based on presentation and lecture notes of Nir Friedman, Hebrew University

Exploiting Structure in Probability Distributions

Irit Gat-Viks

Based on presentation and lecture notes ofNir Friedman, Hebrew University

Page 2: Exploiting Structure in Probability Distributions Irit Gat-Viks Based on presentation and lecture notes of Nir Friedman, Hebrew University

General References:

D. Koller and N. Friedman, probabilistic graphical models Pearl, Probabilistic Reasoning in Intelligent Systems Jensen, An Introduction to Bayesian Networks Heckerman, A tutorial on learning with Bayesian networks

Page 3: Exploiting Structure in Probability Distributions Irit Gat-Viks Based on presentation and lecture notes of Nir Friedman, Hebrew University

Basic Probability Definitions

Product Rule: P(A,B)=P(A | B)*P(B)= P(B | A)*P(A) Independence between A and B: P(A,B)=P(A)*P(B),

or alternatively: P(A|B)=P(A), P(B|A)=P(B). Total probability theorem:

1

,n

ii

B

i ji j B B

1 1

( ) ( , ) ( )* ( | )n n

i i ii i

P A P A B P B P A B

Page 4: Exploiting Structure in Probability Distributions Irit Gat-Viks Based on presentation and lecture notes of Nir Friedman, Hebrew University

Bayes Rule:

Chain Rule:

( | ) ( )( | )

( )

P B A P AP A B

P B

Basic Probability Definitions

( | , ) ( | )( | , )

( | )

P B A C P A CP A B C

P B C

1

1 2 2 3 3 4 1

( ,..., )

( | ,..., ) ( | ,..., ) ( | ,..., ) ... ( | ) ( )n

n n n n n n

P X X

P X X X P X X X P X X X P X X P X

Page 5: Exploiting Structure in Probability Distributions Irit Gat-Viks Based on presentation and lecture notes of Nir Friedman, Hebrew University

Exploiting Independence Property G: whether the woman is pregnant D: whether the doctor’s test is positiveThe joint distribution representation P(g,d):

Factorial representationUsing conditional probability: P(g,d)=P(g)*P(d|g).The distribution of P(g), P(d|g):

Example: P(g0,d1)=0.06 vs. P(g0)*P(d1|g0)=0.6*0.1=0.06

P(d|g)  d0 d1

g0 0.9 0.1

g1 0.05 0.95

G D P(G,D)

0 0 0.54

0 1 0.06

1 0 0.02

1 1 0.38

g0 0.6

g1 0.4G

D

Page 6: Exploiting Structure in Probability Distributions Irit Gat-Viks Based on presentation and lecture notes of Nir Friedman, Hebrew University

Exploiting Independence Property

H: home test Independence assumption: Ind(H;D|G) (i.e., given G, H is independent of D).

  h0 h1

g0 0.9 0.1

g1 0.2 0.8

G

DH

P(d,h,g)=P(d,h|g)*P(g)=P(d|g)* P(h|g)*P(g)

Product rule

Ind(H;D|G)

g0 0.6

g1 0.4

  d0 d1

g0 0.9 0.1

g1 0.05 0.95

Factorial representation

Joint distribution

Page 7: Exploiting Structure in Probability Distributions Irit Gat-Viks Based on presentation and lecture notes of Nir Friedman, Hebrew University

Exploiting Independence Property

  h0 h1

g0 0.9 0.1

g1 0.2 0.8

G

DH

g0 0.6

g1 0.4

  d0 d1

g0 0.9 0.1

g1 0.05 0.95

  representation of P(d,g,h)

  joint distribution factored distribution

No. of parameters 7 5

Adding new variable H

changing the distribution entirely

Modularity: reuse the local probability model. (Only new local probability model for H.)

Local probability

=> Bayesian networks: Exploiting independence properties of the distribution in order to allow a compact and natural representation.

Page 8: Exploiting Structure in Probability Distributions Irit Gat-Viks Based on presentation and lecture notes of Nir Friedman, Hebrew University

Outline

Introduction Bayesian Networks

» Representation & Semantics Inference in Bayesian networks Learning Bayesian networks

Page 9: Exploiting Structure in Probability Distributions Irit Gat-Viks Based on presentation and lecture notes of Nir Friedman, Hebrew University

Representing the Uncertainty

A story with five random variables: Burglary, Earthquake, Alarm, Neighbor Call,

Radio Announcement Specify joint distribution with 25=32 parameters

maybe…

An expert system for monitoring intensive care patients Specify joint distribution over 37 variables with

(at least) 237 parameters

no way!!!

Earthquake

Radio

Burglary

Alarm

Call

Page 10: Exploiting Structure in Probability Distributions Irit Gat-Viks Based on presentation and lecture notes of Nir Friedman, Hebrew University

Probabilistic Independence: a Key for Representation and Reasoning

Recall that if X and Y are independent given Z then

In our story…if burglary and earthquake are independent alarm sound and radio are independent given earthquake

then instead of 15 parameters we need 8)()|(),|(),,|(),,,( BPBEPBERPBERAPBERAP

)()()|(),|(),,,( BPEPERPBEAPBERAP

)|(),|( YXPYZXP

versus

Need a language to represent independence statements

Page 11: Exploiting Structure in Probability Distributions Irit Gat-Viks Based on presentation and lecture notes of Nir Friedman, Hebrew University

Markov Assumption

Generalizing: A child is conditionally

independent from its non-descendents, given the value of its parents.

Ind(Xi ; NonDescendantXi | PaXi)

It is a natural assumption for many causal processes

X

Y1 Y2

Descendent

Ancestor

Parent

Non-descendentNon-descendentNon-descendentNon-descendent

Descendent

Page 12: Exploiting Structure in Probability Distributions Irit Gat-Viks Based on presentation and lecture notes of Nir Friedman, Hebrew University

Markov Assumption (cont.)

In this example: R is independent of A, B, C,

given E A is independent of R,

given B and E C is independent of B, E, R,

given A

Are other independencies implied by these ones?

A graph-theoretical criteria identifies all such independencies

Earthquake

Radio

Burglary

Alarm

Call

Page 13: Exploiting Structure in Probability Distributions Irit Gat-Viks Based on presentation and lecture notes of Nir Friedman, Hebrew University

Bayesian Network Semantics

Compact & efficient representation: nodes have k parents O(2

kn) vs. O(2 n) params

parameters pertain to local interactions

Qualitative partconditionalindependence statementsin BN structure

+

Quantitative partlocal

probabilityModels

(e.g., multinomial, linear Gaussian )

Unique jointdistributionover domain

=

P(C,A,R,E,B) = P(B)*P(E|B)*P(R|E,B)*P(A|R,B,E)*P(C|A,R,B,E)

versus

P(C,A,R,E,B) = P(B)*P(E) * P(R|E) * P(A|B,E) * P(C|A)

In general:

E

R

B

A

C

11,..,

( ,..., ) ( | )n i xii n

P x x P x Pa

Page 14: Exploiting Structure in Probability Distributions Irit Gat-Viks Based on presentation and lecture notes of Nir Friedman, Hebrew University

Bayesian networks

Qualitative part: statistical independence statements

Directed acyclic graph (DAG) Nodes - random variables

of interest (exhaustive and mutually exclusive states)

Edges - direct influence

Efficient representation of probability distributions via conditional independence

Quantitative part: Local probability models. Set of conditional probability distributions.

0.9 0.1

e

b

e

0.2 0.8

0.01 0.99

0.9 0.1

be

b

b

e

BE P(A | E,B)Earthquake

Radio

Burglary

Alarm

Call

Page 15: Exploiting Structure in Probability Distributions Irit Gat-Viks Based on presentation and lecture notes of Nir Friedman, Hebrew University

Outline

Introduction Bayesian Networks

Representation & Semantics» Inference in Bayesian networks Learning Bayesian networks

Page 16: Exploiting Structure in Probability Distributions Irit Gat-Viks Based on presentation and lecture notes of Nir Friedman, Hebrew University

Inference in Bayesian networks A Bayesian network represents a probability distribution. Can we answer queries about this distribution?

Examples: P(Y|Z=z) Most probable estimation Maximum a posteriori

( | ) arg max ( , )wMPE W Z z P w z ( | ) arg max ( | )yMAP Y Z z P y z

Page 17: Exploiting Structure in Probability Distributions Irit Gat-Viks Based on presentation and lecture notes of Nir Friedman, Hebrew University

Inference in Bayesian networks

Goal: compute P(E=e,A=a) in the following Bayesian network:

Using definition of probability, we have

A B C ED

( , ) ( , , , , )

( ) ( | ) ( | ) ( | ) ( | )b c d

b c d

P a e P a b c d e

P a P b a P c b P d c P e d

Page 18: Exploiting Structure in Probability Distributions Irit Gat-Viks Based on presentation and lecture notes of Nir Friedman, Hebrew University

Eliminating d, we get

A B C ED

b c

b c d

b c d

cePbcPabPaP

dePcdPbcPabPaP

dePcdPbcPabPaPeaP

)|()|()|()(

)|()|()|()|()(

)|()|()|()|()(),(

X

Inference in Bayesian networks

Page 19: Exploiting Structure in Probability Distributions Irit Gat-Viks Based on presentation and lecture notes of Nir Friedman, Hebrew University

Eliminating c, we get

A B C ED

b

b c

b c

bepabPaP

cePbcPabPaP

cePbcPabPaPeaP

)|()|()(

)|()|()|()(

)|()|()|()(),(

XX

Inference in Bayesian networks

Page 20: Exploiting Structure in Probability Distributions Irit Gat-Viks Based on presentation and lecture notes of Nir Friedman, Hebrew University

Finally, we eliminate b

A B C ED

)|()(

)|()|()(

)|()|()(),(

aePaP

bepabPaP

bepabPaPeaP

b

b

XXX

Inference in Bayesian networks

Page 21: Exploiting Structure in Probability Distributions Irit Gat-Viks Based on presentation and lecture notes of Nir Friedman, Hebrew University

Variable Elimination Algorithm

General idea: Write query in the form

Iteratively Move all irrelevant terms outside of innermost sum Perform innermost sum, getting a new term Insert the new term into the product

In case of evidence P(x1|evidence xj), use:

3 2

1( ) ( | )k

i ix x x i

P x P x pa

( | ) ( , ) / ( )i j i j jP x x P x x P x

Page 22: Exploiting Structure in Probability Distributions Irit Gat-Viks Based on presentation and lecture notes of Nir Friedman, Hebrew University

Complexity of inference

Lower bound:

General inference in BNs is NP-hard.

Upper bound:

Naïve exact inferenceexponential in the number of variables in the network

Variable elimination complexityexponential in the size of largest factorpolynomial in the number of variables in the networkVariable elimination computation depend on order of elimination (many heuristics, e.g., clique tree algorithm).

Page 23: Exploiting Structure in Probability Distributions Irit Gat-Viks Based on presentation and lecture notes of Nir Friedman, Hebrew University

Outline

Introduction Bayesian Networks

Representation & Semantics Inference in Bayesian networks» Learning Bayesian networks

Parameter LearningStructure Learning

Page 24: Exploiting Structure in Probability Distributions Irit Gat-Viks Based on presentation and lecture notes of Nir Friedman, Hebrew University

Learning

Process Input: dataset and prior informationOutput: Bayesian network

LearningLearningLearningLearningData + Prior information

E

R

B

A

C

Page 25: Exploiting Structure in Probability Distributions Irit Gat-Viks Based on presentation and lecture notes of Nir Friedman, Hebrew University

The Learning Problem

Known Structure Unknown Structure

Complete Data Statisticalparametricestimation

(closed-form eq.)

Discrete optimizationover structures(discrete search)

Incomplete Data Parametricoptimization(EM, gradient

descent...)

Combined(Structural EM, mixture

models…)

We will focus on complete data for the rest of the talk The situation with incomplete data is more involved

Page 26: Exploiting Structure in Probability Distributions Irit Gat-Viks Based on presentation and lecture notes of Nir Friedman, Hebrew University

Outline

Introduction Bayesian Networks

Representation & Semantics Inference in Bayesian networks Learning Bayesian networks

»Parameter LearningStructure Learning

Page 27: Exploiting Structure in Probability Distributions Irit Gat-Viks Based on presentation and lecture notes of Nir Friedman, Hebrew University

Learning Parameters

Key concept: the likelihood function

measures how the probability of the data changes when we change parameters

Estimation: MLE: choose parameters that maximize likelihood Bayesian: treat parameters as an unknown quantity,

and marginalize over it

m

mxPDPDL )|][()|():(

Page 28: Exploiting Structure in Probability Distributions Irit Gat-Viks Based on presentation and lecture notes of Nir Friedman, Hebrew University

28

MLE principle for Binomial Data

Data: H, H, T, H, H. Θ is the unknown probability P(H). Likelihood function:

Estimation task: Given a sequence of samples x[1], x[2]…x[M], we want to estimate the probability P(H)= θ and P(T)=1-θ.

MLE principle: choose parameter that maximize the likelihood function.

Applying the MLE principle we get

MLE for P(X = H ) is 4/5 = 0.8

0,1

( : ) kNk

k

L D

)1():( DL

0 0.2 0.4 0.6 0.8 1

L(

:D)

H

H T

N

N N

Page 29: Exploiting Structure in Probability Distributions Irit Gat-Viks Based on presentation and lecture notes of Nir Friedman, Hebrew University

MLE principle for Multinomial Data

Suppose X can have the values 1,2,…,k. We want to learn the parameters θ1,…, θk.

N1,…,Nk - The number of times each outcome is observed.

Likelihood function:

The MLE is:

K

1k

Nk

kDL ):(Probability of kth outcome

Count of kth outcome in D

1,...,

ii

ll k

N

N

Page 30: Exploiting Structure in Probability Distributions Irit Gat-Viks Based on presentation and lecture notes of Nir Friedman, Hebrew University

MLE principle for Bayesian networks

Training data has the form:

Assume i.i.d. samples

][][][][

]1[]1[]1[]1[

MCMAMBME

CABE

D

E B

A

C

m

mCmAmBmEPDL ):][],[],[],[():(

( [ ] : )

( [ ] : )

( [ ] | [ ], [ ] : )

( [ ] | [ ] : )m

P E m

P B m

P A m B m E m

P C m A m

][][][][

]1[]1[]1[]1[

MCMAMBME

CABE

( [ ] : )

( [ ] : )

( [ ] | [ ], [ ] : )

( [ ] | [ ] : )

m

m

m

m

P E m

P B m

P A m B m E m

P C m A m

][][][][

]1[]1[]1[]1[

MCMAMBME

CABE

By definition

of network

Page 31: Exploiting Structure in Probability Distributions Irit Gat-Viks Based on presentation and lecture notes of Nir Friedman, Hebrew University

Generalizing for any Bayesian network:

, ,( ) ( )

|

( : ) ( [ ] | [ ] : )

( | : ) i pa i pai i

i i

i ii i

i i i i im

N x N x

i i i x papa pax x

L D P x m Pa m

P x pa

( : ) ( [ ] | [ ] : ) ( : )i i i i ii m i

L D P x m Pa m L D

)(),(ˆ

|i

iipax paN

paxNii

• The likelihood decomposes according to the network structure.• Decomposition Independent estimation problems (If the parameters for each family are not related)• For each value pai of the parent of Xi we get independent multinomial problem.

• The MLE is

MLE principle for Bayesian networks

Page 32: Exploiting Structure in Probability Distributions Irit Gat-Viks Based on presentation and lecture notes of Nir Friedman, Hebrew University

Outline

Introduction Bayesian Networks

Representation & Semantics Inference in Bayesian networks Learning Bayesian networks

Parameter Learning»Structure Learning

Page 33: Exploiting Structure in Probability Distributions Irit Gat-Viks Based on presentation and lecture notes of Nir Friedman, Hebrew University

Learning Structure: Motivation

Earthquake Alarm Set

Sound

BurglaryEarthquake Alarm Set

Sound

Burglary

Earthquake Alarm Set

Sound

Burglary

Adding an arc Missing an arc

Page 34: Exploiting Structure in Probability Distributions Irit Gat-Viks Based on presentation and lecture notes of Nir Friedman, Hebrew University

Optimization ProblemInput:

Training data Scoring function (including priors) Set of possible structures

Output: A network (or networks) that maximize the score

Key Property: Decomposability: the score of a network is a sum of terms.

Page 35: Exploiting Structure in Probability Distributions Irit Gat-Viks Based on presentation and lecture notes of Nir Friedman, Hebrew University

Scores

( : ) ( | : )i

Gi

i

Score G D Score X Pa D

( : ) ( | ) ( | ) ( )

( | , ) ( | ) ( )

Score G D P G D P D G P G

P D G P G d P G

When the data is complete, the score is decomposable:

For example. The BDE score:

Page 36: Exploiting Structure in Probability Distributions Irit Gat-Viks Based on presentation and lecture notes of Nir Friedman, Hebrew University

Heuristic Search (cont.)

Typical operations:

S C

E

D

S C

E

D

S C

E

D

S C

E

D

Add C D

Reverse C ERemove C E

Page 37: Exploiting Structure in Probability Distributions Irit Gat-Viks Based on presentation and lecture notes of Nir Friedman, Hebrew University

Heuristic Search We address the problem by using heuristic search Traverse the space of possible networks, looking for high-

scoring structures Search techniques:

Greedy hill-climbing Simulated Annealing ...

Page 38: Exploiting Structure in Probability Distributions Irit Gat-Viks Based on presentation and lecture notes of Nir Friedman, Hebrew University

Outline

Introduction Bayesian Networks

Representation & Semantics Inference in Bayesian networks Learning Bayesian networks Conclusion

Applications

Page 39: Exploiting Structure in Probability Distributions Irit Gat-Viks Based on presentation and lecture notes of Nir Friedman, Hebrew University

Example I

The currently accepted consensus network of human primary CD4 T cells, downstream of CD3,

CD28, and LFA-1 activation

Causal Protein-Signaling Networks Derived from Multiparameter Single-Cell Data. Karen Sachs, et al. 2005.

10 Perturbations

by small molecules

11 Colored nodes:

Measured signaling proteins

Page 40: Exploiting Structure in Probability Distributions Irit Gat-Viks Based on presentation and lecture notes of Nir Friedman, Hebrew University

Bayesian network results

• PKC->PKA was validated experimentally.

• Akt was not affected by Erk in additional experiments

Page 41: Exploiting Structure in Probability Distributions Irit Gat-Viks Based on presentation and lecture notes of Nir Friedman, Hebrew University

Example II: Bayesian network models for a transcription factor-DNA binding motif with 5 positions

PSSM

Tree

Non-tree

Mixture of PSSMs

Mixture of trees