probabilistic graphical models - computer scienceprobabilistic graphical models comp 790comp 790-90...
TRANSCRIPT
Probabilistic Graphical ModelsProbabilistic Graphical Models
COMP 790 90 SeminarCOMP 790-90 Seminar
Spring 2011
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
OutlineOutline
I t d tiIntroductionRepresentationBayesian networkBayesian network
Conditional IndependenceInference: Variable elimination LearningLearning
Markov Random FieldCliqueP i i MRFPair-wise MRF
Inference: Belief Propagation
Conclusion
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications2
IntroductionIntroduction
Graphical Model:
+Probability Theory Graph Theory
Probability theory: ensures consistency, provides interface models to dataprovides interface models to data.
Graph theory: intuitively appealing interface for humans, efficient general purpose algorithms.
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications3
, g p p g
IntroductionIntroduction
Modularity: a complex system is built by combining simpler partscombining simpler parts.
Provides a natural tool for two problems:
Uncertainty and ComplexityPlays an important role in the design and y p ganalysis of machine learning algorithms
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications4
IntroductionIntroduction
Many of the classical multivariate probabilistic systems are special cases of the general graphical model formalism:
Mixture modelsu e ode sFactor analysisHidden Markov ModelsKalman filters
The graphical model framework provides a way to view all of these systems as instances of common underlying formalism.
Techniques that have been developed in one field can be transferred h fi ldto other fields
A framework for the design of new system
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications5
RepresentationRepresentation
A graphical model represent probabilistic relationships between a set of random variables.Variables are represented by nodes:
Binary events, Discrete variables, Continuous variables
Conditional (in)dependency is t d b ( b f) drepresented by (absence of) edges.
Directed Graphical Model: (Bayesian network)Undirected Graphical Model: (Markov Random Field)
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications6
OutlineOutline
IntroductionRepresentationBayesian networkBayesian network
Conditional IndependenceInference: Variable elimination LearningLearning
Markov Random FieldCliquePair-wise MRF
Inference: Belief Propagation
Conclusion
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications7
Bayesian NetworkBayesian Network
Directed acyclic graphs (DAG) ParentsDirected acyclic graphs (DAG).Directed edges give causality relationships between variables
Parents
pFor each variable X and parents pa(X) exists a conditional
b bilit P(X| (X))probability --P(X|pa(X))Discrete Variables: Conditional Probability Table(CPT)Probability Table(CPT)
Description of a noisy “causal” process
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications8
A Example: What Causes Grass Wet?
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications9
More Complex ExampleMore Complex Example
Diagnose the engine start problemDiagnose the engine start problem
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications10
More Complex ExampleMore Complex Example
C b d iComputer-based Patient Case Simulation system (CPCS-PM) developed by Parker and Miller
422 nodes and 867 arcs: 14 nodes describe diseases, 33 ,nodes describe history and risk factors, and the remaining 375 nodes e a g 375 odesdescribe various findings related to the diseases
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications11
Joint DistributionJoint Distribution
P(X X )P(X1,…Xn)
If the variables are binary, we need O(2n)parameters to describe P
For the wet grass example, need 2^4-1=15 parameters
Can we do better?
Key idea: use properties of independence.
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications12
Independent Random VariablesIndependent Random Variables
X is independent of Y ifffor all values x,y)()|( xXPyYxXP y
If X and Y are independent then)()|( y
)()()()|()( YPXPYPYXPYXP
Unfortunately most of random variables of
)()()()|()( ,
)()...()()...( 21,1 nn XPXPXPXXP Unfortunately, most of random variables of interest are not independent of each other
The wet grass example
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications13
g p
Conditional IndependenceConditional IndependenceA more suitable notion is that of conditionalA more suitable notion is that of conditional independence.
X and Y are conditionally independent given ZX and Y are conditionally independent given Z
N i)|()|()|,(
)|(),|(
ZYPZXPZYXP
ZXPYZXP
Notation:
The conditionally independent structure in the )|,( ZYXI
Cgrass exampleI(S,R|C)
C
S R
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications14
I(C,W|S,R)W
Conditional IndependenceConditional Independence
Directed Markov Property:
Each random variable X, is Y1
Parent
conditionally independent of
its non-descendents, X
Y1
Y2D sc nd ntgiven its parents Pa(X)
Formally,
X
Y
Y2Descendent
y,P(X|NonDesc(X), Pa(X))=P(X|Pa(X))
Notation: I (X, NonDesc(X) | Pa(X))
Y3
Y Non descendent
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications15
Y4 Non-descendent
Factorized RepresentationFactorized RepresentationFull Joint distribution is defined in terms of local conditional distributions(obtained via the chain rule)
))(|(),,( 1 iin xpaxpxxP
Graphical Structure encodes conditional independences among random variables
))(|(),,( 1 iin pp
gRepresent the full joint distribution over the variables more compactly
Complexity reductionComplexity reductionJoint probability of n binary variables O(2n)Factorized form O(n*2k)
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications16
( )k: maximal number of parents of a node
Factorized RepresentationFactorized RepresentationThe wetgrass example
P(C,S,R,W)=P(W|S,R)P(R|C)P(S|C)P(C)
Only need 1+2+2+4=9 parameters
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications17
InferenceInference
C i f h di i l b biliComputation of the conditional probability distribution of one set of nodes, given a model and another set of nodes.another set of nodes.Bottom-up
Given Observation (leaves), the probabilities of the reasons can be calculated accordingly.“diagnosis” from effects to reasons
Top downTop-downKnowledge influences the probability of the outcomePredict the effects
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications18
Basic ComputationBasic Computation
The value of x depends on yThe value of x depends on yDependency: conditional probability P(x|y)Knowledge about y: prior probability P(y)
yg y p p y (y)
Product rule)()|(),( yPyxPyxP
xSum rule (Marginalization)
y
yxPxP ),()( x
yxPyP ),()(
x
Bayesian ruley x
)()|()|(
yPyxPxyP prior likelihood lconditiona
poserior
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications19
)()|(
xPxyP
likelihoodposerior
Inference: Bottom UP Inference: Bottom UP Observe: wet grass (denoted by W=T)Observe: wet grass (denoted by W T)
Two possible causes: rain or sprinkle. Which is more likely?
Apply Bayes’ rule
)( TWP
0045.018.00495.00324.0009.00396.0
),,,(,,
rscTWrRsScCP
6471.0
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications20
Inference: Bottom UP Inference: Bottom UP
C S R W P( )
T T T T 0.99*0.8*0.1*0.5=0.0396
T T F T 0 9*0 2*0 1*0 5=0 009T T F T 0.9 0.2 0.1 0.5=0.009
T F T T 0.9*0.8*0.9*0.5=0.324
T F F T 0*0.2*0.9*0.5=0
F T T T 0.99*0.2*0.5*0.5=0.0495
F T F T 0.9*9.8*0.5*0.5=0.18
F F T T 0.9*0.2*0.5*0.5=0.045
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications21
F F F T 0*0.8*0.5*0.5=0
Inference: Bottom UP Inference: Bottom UP Observe: wet grass (denoted by W=T)Observe: wet grass (denoted by W T)
Two possible causes: rain or sprinkle. Which is more likely?
Apply Bayes’ rule
),(
)|(
TWTSP
TWTSP
),,,(
)(
,
TWrRTScCP
TWP
rc
43.064710
2781.0
64710
18.00495.0009.00396.0
)(
TWP
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications22
6471.06471.0
Inference: Bottom UP Inference: Bottom UP Observe: wet grass (denoted by W=T)Observe: wet grass (denoted by W T)
Two possible causes: rain or sprinkle. Which is more likely?
Apply Bayes’ rule
),(
)|(
TWTRP
TWTRP
),,,(
)(
,
TWTRsScCP
TWP
sc
708.064710
4581.0
64710
045.00495.0324.00396.0
)(
TWP
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications23
6471.06471.0
Inference: Top-downInference: Top-down
Th b bilit th t th ill b tThe probability that the grass will be wet given that it is cloudy.
WRS
RS
WRSCP
WRSCP
TCP
TCTWPTCTWP ,
),,,(
),,,(
)(
),()|(
WRS ,,),,,()(
C
S R
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications24
W
Inference AlgorithmsInference Algorithms
Exact inference problem in general graphical model is NP-hardExact Inference
Variable eliminationM i l ithMessage passing algorithmClustering and joint tree approach
Approximate InferenceApproximate InferenceLoopy belief propagationSampling (Monte Carlo) methods
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications25
Variational methods
Variable EliminationVariable Elimination
Computing P(W=T)Computing P(W T)Approach 1. Blind approach
Sum out all un-instantiated variables from the full jointj
Computation Cost O(2n)Th t lThe wetgrass example
Number of additions: 14Number of products:?
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications26
Solution: explore the graph structure
Variable Elimination
Approach 2: Interleave sums and Products
Variable Elimination
Approach 2: Interleave sums and ProductsThe key idea is to push sums in as far as possible
iIn computationFirst compute:Then compute:And so on
Computation Cost O(n*2k)For wetgrass example
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications27
For wetgrass exampleNumber of Additions:?Number of products:?
LearningLearning
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications28
LearningLearning
Learn parameters or structure from dataStructure learning: find correct connectivityStructure learning: find correct connectivity between existing nodesParameter learning: find maximumParameter learning: find maximum likelihood estimates of parameters of each conditional probability distributionp yA lot of knowledge (structures and probabilities) came from domain experts
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications29
p ) p
LearningLearning
Structure Observation Method
Known Full Maximum LikelihoodKnown Full Maximum Likelihood (ML) estimation
K P ti l E t ti M i i tiKnown Partial Expectation Maximization algorithm (EM)
U k F ll M d l l iUnknown Full Model selection
Unknown Partial EM + model selection
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications30
Model Selection MethodModel Selection Method
Select a 'good' model from all possible models and use it as if it were the correct model
Having defined a scoring function, a search algorithm is then used to find a network structure that receives the highest score fitting the prior kno ledge and datahighest score fitting the prior knowledge and data
Unfortunately, the number of DAG's on n variables is super-exponential in n. The usual approach is therefore p p ppto use local search algorithms (e.g., greedy hill climbing) to search through the space of graphs.
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications31
EM AlgorithmEM Algorithm
Expectation (E) stepUse current parameters to estimate the unobservedUse current parameters to estimate the unobserved data
Maximization (M) stepMaximization (M) stepUse estimated data to do ML/MAP estimation of the parameterthe parameter
Repeat EM steps, until convergence
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications32
OutlineOutline
I t d tiIntroductionRepresentationBayesian networkBayesian network
Conditional IndependenceInference LearningLearning
Markov Random FieldCliqueP i i MRFPair-wise MRF
Inference: Belief Propagation
Conclusion
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications33
Markov Random FieldsMarkov Random FieldsUndirected edges simply give correlations between g p y gvariables
The joint distribution is product of local functions over the cliques of the graph
CC xPZ
xP )(1
)(
where are the clique potentials, and Z is a normalization constant
CZ)( CC xP
w
zy
xw
zy
x
),,(),,(1
),,,( zyxPwyxPZ
wzyxP BA
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications34
w
zy
x
Z
The Clique The Clique
A liA cliqueA set of variables which are the arguments of a l l f tilocal function
The order of a cliqueThe number of variables in the clique
Example:p),()(),()()(),...,( 534,332,12151 xxPxxPxxxPxPxPxxP EDCBA
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications35
first order clique third order clique second order clique
Regular and Arbitrary GraphRegular and Arbitrary Graph
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications36
Pair-wise MRFPair-wise MRF
The order of cliques is at most twoThe order of cliques is at most two.
Commonly used in computer vision applications.Infer underline unknown variables through local observation andInfer underline unknown variables through local observation and the smooth prior
o1 o2 o3Observed image
φ2(i2) φ3(i3)φ1(i1)
o4 o5 o6
i1 i2 i3Underlying truth
φ5(i5)(i2,
i 5)
φ6(i6)φ4(i4)
ψ12(i1, i2) ψ23(i2, i3)
(i3,
i 6)
(i1,
i 4)
o7 o8 o9
i4 i5 i6ψ45(i4, i5)
φ5(i5)
ψ56(i5, i6)
ψ25
i 5, i
8)
φ6(i6)φ4(i4)
i 6, i
9)ψ
36
i 4, i
7)ψ
14
compatibility
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications37
o7 o8 o9
i7 i8 i9
ψ58
(i φ8(i8)φ7(i7) φ9(i9)
ψ78(i7, i8) ψ89(i8, i9)
ψ69
(i
ψ47
(i
Pair-wise MRFo1 o2 o3
Observed imageφ (i ) φ (i )φ (i )
Pair-wise MRF1 2 3
o4 o5 o6
i1 i2 i3Underlying truth
2, i 5
)
φ2(i2) φ3(i3)φ1(i1)
ψ12(i1, i2) ψ23(i2, i3)
3, i 6
)
1, i 4
)
o4 o5 o6
i4 i5 i6
ψ45(i4, i5)
φ5(i5)
ψ56(i5, i6)
ψ25
(i2
i 8)
φ6(i6)φ4(i4)
i 9)
ψ36
(i3
i 7)
ψ14
(i1
(i i ) i * t i
o7 o8 o9
i7 i8 i9
ψ58
(i5,
φ8(i8)φ7(i7) φ9(i9)
ψ78(i7, i8) ψ89(i8, i9)
ψ69
(i6,
ψ47
(i4,
ψxy(ix, iy) is an nx * ny matrix.
φx(ix) is a vector of length nx, where nx is the number of states of ix.
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications38
Pair-wise MRFo1 o2 o3
Observed imageφ (i ) φ (i )φ (i )
Pair-wise MRF1 2 3
o4 o5 o6
i1 i2 i3Underlying truth
2, i 5
)
φ2(i2) φ3(i3)φ1(i1)
ψ12(i1, i2) ψ23(i2, i3)
3, i 6
)
1, i 4
)
o4 o5 o6
i4 i5 i6
ψ45(i4, i5)
φ5(i5)
ψ56(i5, i6)
ψ25
(i2
i 8)
φ6(i6)φ4(i4)
i 9)
ψ36
(i3
i 7)
ψ14
(i1
Gi ll th id d t t fi d th t lik l t t f
o7 o8 o9
i7 i8 i9
ψ58
(i5,
φ8(i8)φ7(i7) φ9(i9)
ψ78(i7, i8) ψ89(i8, i9)
ψ69
(i6,
ψ47
(i4,
Given all the evidence nodes yi, we want to find the most likely state for all the hidden nodes xi, which is equivalent to maximizing
iijiij xxxZ
xP )(),(1
})({
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications39
ij i
iijiijZ)()(})({
Belief Propagationo1 o2 o3
Observed imageφ (i ) φ (i )φ (i )
Belief Propagation1 2 3
o4 o5 o6
i1 i2 i3Underlying truth
2, i 5
)
φ2(i2) φ3(i3)φ1(i1)
ψ12(i1, i2) ψ23(i2, i3)
3, i 6
)
1, i 4
)
o4 o5 o6
i4 i5 i6
ψ45(i4, i5)
φ5(i5)
ψ56(i5, i6)
ψ25
(i2
i 8)
φ6(i6)φ4(i4)
i 9)
ψ36
(i3
i 7)
ψ14
(i1
B li f d t i t thi b bilit
o7 o8 o9
i7 i8 i9
ψ58
(i5,
φ8(i8)φ7(i7) φ9(i9)
ψ78(i7, i8) ψ89(i8, i9)
ψ69
(i6,
ψ47
(i4,
Beliefs are used to approximate this probability
z
xzxxxxx imiib )()()(
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications40
xi yz
xzxyxxyxxyxy imiiiim )(),()()(
Belief Propagation
i
Belief Propagation
o5
i2
φ5(i5)m2-
>5(
i 5)
i4 i5 i6
φ5( 5)
m4->5(i5) m6->5(i5)
m
5(x 5
)
5(i 5
)
i8
m8-
>5
m8-
>5
Beliefs are used to approximate this probability)()()()()()( 5855655455255555 imimimimiib
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications41
Belief PropagationBelief Propagation
i2
i 5) (i5)
i1
i 4)
o5
x4 i5 i6
ψ45(i4, i5)φ5(i5)
ψ56(i5, i6)
ψ25
(i2,
8)
m45(i5) m65(i5)
m25
)
o4
i4
φ4(i4)
4)m
14(i
i8
ψ58
(i5,
i 8
65 5
m85
(i5)
i7
m74
(i4
Beliefs are used to approximate this probability)()()()()()( 5855655455255555 imimimimiib
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications42
4
)()(),()()( 474414544544545i
imimiiiim
Belief Propagation
φ(i ) and ψ (i i )
Belief Propagation
F d
φ(ix) and ψxy(ix,iy)
For every node ix
Compute m (i )
N
Compute mzx(ix)for each neighbor iz
Does bx(ix) converge?Y
Compute bx(ix)Output most likely
state for every node i
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications43
state for every node ix
Application: Learning Based Image Super Resolution
Extrapolate higher resolution images from low-p g gresolution inputs.
The basic assumption: there are correlations between low frequency and high frequency information.
A node corresponds to an image patchφx(xp): the probability of high frequency given observed low frequency
ψxy(xp, xq): the smooth prior between neighbor patches
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications44
Image Super ResolutionImage Super Resolution
(a) Images from a "generic" example set.
(b) Input (magnified x4) (c) Cubic spline (d) Super-resolution result (e) Actual full-resolution
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications45
ConclusionConclusion
A graphical representation of the probabilistic structure of a set of random variables, along with f ti th t b d t d i th j i tfunctions that can be used to derive the joint probability distribution.
I t iti i t f f d liIntuitive interface for modeling.
Modular: Useful tool for managing complexity.
C f li f d lCommon formalism for many models.
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications46
ReferencesReferences
K i M h I t d ti t G hi l M d l T h i lKevin Murphy, Introduction to Graphical Models, Technical Report, May 2001.M. I. Jordan, Learning in Graphical Models, MIT Press, 1999.Yijuan Lu, Introduction to Graphical Models, http:// www.cs.utsa.edu/~danlo/teaching/cs7123/Fall2005/Lyijuan.www.cs.utsa.edu/ danlo/teaching/cs7123/Fall2005/Lyijuan.ppt.Milos Hauskrecht, Probabilistic graphical models, http://www cs pitt edu/~milos/courses/cs3710/Lectures/Clashttp://www.cs.pitt.edu/~milos/courses/cs3710/Lectures/Class3.pdf.P. Smyth, Belief networks, hidden Markov models, and M k d fi ld if i i P tt R iti
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications47
Markov random fields: a unifying view, Pattern Recognition Letters, 1998.
ReferencesReferences
F R Kschischang B J Frey and H A Loeliger 2001F. R. Kschischang, B. J. Frey and H. A. Loeliger, 2001. Factor graphs and the sum-product algorithm IEEE Transactions on Information Theory, February, 2001.Y didi J S F W T d W i Y U d t diYedidia J.S., Freeman W.T. and Weiss Y, Understanding Belief Propagation and Its Generalizations, IJCAI 2001 Distinguished Lecture track. William T. Freeman, Thouis R. Jones, and Egon C. Pasztor, Example-based super-resolution, IEEE Computer Graphics and Applications, March/April, 2002.p pp pW. T. Freeman, E. C. Pasztor, O. T. Carmichael Learning Low-Level Vision International Journal of Computer Vision, 40(1), pp. 25-47, 2000.
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications48
Vision, 40(1), pp. 25 47, 2000.