stat 598l probabilistic graphical modelsskirshne/teaching/stat598l_f09/mn.pdf · 2010-01-27 ·...

STAT 598LProbabilistic Graphical Models

Instructor: Sergey Kirshner

Markov Networks

Motivating Example

• Is there a Bayesian Network that is a P-map for {(A ⊥ B │ C, D), (C ⊥ D │ A, B)}?– No other independence except for application of

symmetry, so the rest of the parents are dependent (in a P-map)

– Skeleton

– Adding directions• Without loss of generality, A->C

• Cannot have B->C (A->C<-B)

• Cannot have D->B (C->B<-D)

• Cannot have A->D (A->D<-B)STAT 598L: Probabilistic Graphical Models (Markov Networks)

No BN P-map!

Undirected Model• Is there a different framework that can represent

these dependencies?– What if we had undirected separation instead of d-

separation?

STAT 598L: Probabilistic Graphical Models (Markov Networks)

• Markov networks (Markov random fields, MRFs)– Represent conditional independence

relations with an undirected graph

– Encode functional dependence using potential functions or factors

Factors

{X1,X2,…,Xn} = set of variables{Y1,Y2,…,Yk} ⊆ {X1,X2,…,Xn} -- subset of variables

Val(Y1) x Val(Y2)x … x Val(Yk)0 R+

φscope[φ]

= factor

Joint probability = product of factors

Factor = measure of relationship for a group of variables

Example

normalization constant(partition function)

Gibbs distribution

Example (continued)

How many free parameters?3+3+3+3=12

Factors and Free Parameters

• For this analysis, stick to binary variables

• Each factor of k variables = 2k-1 free parameters

• Assume all factors are of the same size– nCk ways possible factors (O(nk))

– Total of O(nk2k) free parameters

– Compare to O(2n) for a full table

• Conclusion: even using large factors reduces the number of free parameters

BNs: Special Case

Factor Operations: Product

X=x Y=y φ1(x,y)

1 1 0.4

1 0 0.7

0 0 0.8

Y=y Z=z φ2(y,z)

1 1 0.3

1 0 0.9

0 1 0.5

X=x Y=y Z=z φ12(x,y,z)

1 1 1 0.12

1 1 0 0.36

1 0 1 0.35

1 0 0 0.7

0 1 1 0.3

0 1 0 0.9

0 0 1 0.4

0 0 0 0.8

Conditional Independence?

• What about {a,c}, {a,d}, {b,c}, and {b,d}?– They cannot be made independent!

– Edges connect variables in the same scope

– Resulting graph = Markov network

Factorization: Formal Definition• Given: Gibbs distribution P with non-negative factors Φ={φ1,…,φK}, and a Markov network H

• P factorizes over H: scope of every factor corresponds to a complete subgraph of H

Factorization

• Collection of factors is not unique– Are the scopes {{A,B}, {A,C}, and {B,C}}, or is it just

{A,B,C}?

– Networks can obscure scopes (structures) of original factors

Graphical Model

Graphical Model = Graph + Parameters

Bayesian Network =parents in

chain decomposition

+conditional probability

distributions

Markov network =variables in

factors + factors

Undirected vs Directed Model• Bayesian networks:

– DAG => dimensionality reduction with chain rule for probability (simple justification)

– Possible causal dependence (interpretation the edge directions)

– Parameters are interpretable

– Represented independencies depend on the order of variables (drawback)

• Undirected model:– No ordering to consider! (Fewer objects, one less uncertainty to worry

about)

– Intuition using exponential models (later in the course)

– Difficult to interpret (and to illicit) the parameters

Representational Power: BN vs MN

• Can Bayesian Networks represent all independencies from Markov Network?– No: {(A ⊥ B │ C, D), (C ⊥ D │ A, B)}

• Can Markov Networks represent all independencies from Bayesian Networks– No: A -> B <- C

• What is the overlap?– Later

Graph Separation

• Need to establish conditional independence from undirected graph properties

• Active path = none of the intermediate variables are observed

• No active paths = separation

• Monotonic: adding observed variables can only reduce active paths

blocked

Set of global independencies (global Markov property)

Representation Theorem for BNs

P factorizes according to GEach variable is independent of its non-descendants given its parents

Local Markov assumption

independencies graph structure

Representation Theorem for MNs

P factorizes according to Hglobal independencies set by

scopes of factors

Global Markov property

• Proof: Need to show

– Case 1: Assume• Partition Di so that either

Di⊆A∪C or Di⊆B∪C

STAT 598L: Probabilistic Graphical Models (Markov Networks)independencies graph structure

scopes of factors

• Proof: Need to show

– Case 2:

scopes of factors

Converse?

• Think xor

scopes of factors

Hammersley-Clifford Theorem

scopes of factors

If P is positive and

• Interpreting the statement

• Sketch of proof (by construction):– All factors not in the trail are uniform (remove

nodes and edges not in the trail)

– Make the remaining factors almost deterministic

Completeness of separation

STAT 598L: Probabilistic Graphical Models (Bayesian Networks)

Active trail between X and Y given Z X and Y are dependent given Z in some P that factorizes according to H

More General Result

STAT 598L: Probabilistic Graphical Models (Bayesian Networks)

Soundness

Intuition: Two binary variables X and Y;3-d space of possible factors with a 2-d manifold for independence

Completeness (almost)

Representation Theorem for BNs

P factorizes according to GEach variable is independent of its non-descendants given its parents

Local Markov assumption

Other Ways to Encode Independence

• Local Markov independence:

• Pairwise independence:

Markov blanket (local)

Pairwise Markov independencies

Relation Between Independencies

• Two separated nodes will also be separated by the neighbors for either node

• Variables corresponding to non-adjacent are conditionally independent given the variables corresponding to neighbors– Conditionally independent also given the rest of

the variables (monotonic)

global local pariwise

Converse

• For all disjoint A, B, and C,

– Induction on size of C• |C|=n-2:

• |C|=k-1<n-2, case I:

globalpairwise

Converse

• For all disjoint A, B, and C,

– Induction on size of C• |C|=k-1<n-2, case II:

• Assume |A|=|B|=1, otherwise approach as in case I

globalpairwise

Equivalence

• Given P is positive– Global Markov property

– Local Markov property

– Pairwise Markov property

How To Recover MNs from Distribution

• If P is positive– Check whether A ⊥ B | X-A-B or

– Find smallest C such that A ⊥ C | X-A-C• C=MBP(A) (Markov blanket)

– In both cases, the graph is a minimal I-map of P

– Graphs are the same – such I-map is unique!

• If P is not positive– No guarantee that the resulting graph is an I-map

Finding P-maps

• If P-map exists– Find a minimal I-map

– It is also a P-map!

• Does it always exist?– Think v-structure

Alternative Parametrizations

• Structure of the Markov network may hide the scopes of the factors– Think complete graph: is it one factor with all

variables in the scope or a product of factors with pairs of variables in the scope?

• May want to make factorization more explicit in the structure

Factor Graphs

• Bipartite graph: variables vs factors

DA CB D

Log-Linear Model

• Product into a sum

• Convert factors into a finer set of features• Break down factors further (context)

• Different features may share same scope

energy functions

weights features

Ising Model

Binary xis

STAT 598L: Probabilistic Graphical Models (Markov Networks)http://www.cis.upenn.edu/~jshi/GraphTutorial/

• Parameterizations for Markov networks– Features

– Overparameterizations

– How many parameters are free?

– Canonical parameterization

• Proof of Hammersley-Clifford theorem (if there is interest)

• Justification for Markov networks using Maximum Entropy principle (later)

• Relating Bayesian and Markov networks– Proof of soundness theorem for Bayesian

networks

– Determining which Markov networks are P-maps for which Bayesian networks

Information Theory

• P(X) encodes our uncertainty about X– Some variables are more uncertain than others

– How can we quantify this intuition?• Entropy: average number of bits required to encode X

• Entropy is maximized when X is uniform

STAT 598L: Probabilistic Graphical Models (Markov Networks) 40

P(X) P(Y)

( ) ( ) ( ) ( )∑=

EXH 1log1log

From Carlos Guestrin’s 10-708 Probabilistic Graphical Models Fall 2008 at CMU

Maximum Entropy Principle

• Given everything else the same, pick a distribution with the maximum entropy– Closest to uniform

• Example: ¾ kangaroo’s are left-handed and ¾drink Foster’s– Want to reconstruct the full probability table

knowing only p11+p12=0.75 and p11+p21=0.75

– Have 3 free parameters and only 2 constraints leaving 1 free parameter

p pp p

MaxEnt Principle Continued

• Since we are not given that left-handedness is correlated with Foster drunkedness, ideally do not want to introduce the correlation into the model

• Which objective function to maximize?

• Entropy is (the only) such function– Want to maximize HP(X) subject to the constraints

p11+p12=0.75 and p11+p21=0.75

Gull S.F., Skilling J. (1984), “The Maximum Entropy Method,” in Indirect Imaging

Direct Solution

Left-handedness is independent of Foster drunkedness!

Round-about Solution

• Constraints = Lagrange multipliers

Round-about Solution

• How to find the weights?– Plug in the log-linear model for P(x) and maximize F(x)

– Or, satisfy the constraints

Log-linear model!

MaxEnt in a More General Setting• Given a set of constraints

– General solution to the MaxEnt formulation is

• Log-linear model is an approximation to a distribution that preserves some properties (constraints) while making the distribution as close to uniform as possible– Duality between constraints and weights

Soundness of d-separation

For all P that factorizes according to G

G is an I-map for P

G is a BN structure for P

d-separation in G

conditional independencein Plocal graph property

global separation property

Proof Outline

• Given evidence, convert Bayesian network into an equivalent Markov network– Construct such network

– Show that it is an equivalent Markov network

• Use separation property of the Markov network to prove the theorem

Constructing MNs from BNs

moralized graph

I-mapminimal I-map

Constructing MNs from BNs with Evidence

moralized graph

P-map for Moral Graphs

moral graph moralized graph

minimal I-map

Proof: pick an active (minimal) trailin G. Show it is in H.

Two cases:Trail has no v-structures -- no marked nodes-- same trail is in HTrail has v-structures – v-structure is covered-- not minimal -- contradiction

Soundness for d-separation

• What if the graph is not moral?– What if immoralities did not matter?

– They are if effects or their descendant is in evidence

• Only consider the subgraphs for which immoralities have a descendant in the evidence– Upward closure of evidence nodes

Upward Closure and Its MN

Exercise 3.8: BN(G’) agrees with BN(G) over nodes of G’

barren node

Soundness of d-separation

• Consider X and Y d-separated by Z

• Build an upward closure for X∪Y∪Z

• d-separation is equivalent to separation in H

• Separation in H implies conditional independence

For all P that factorizes according to G

G is a BN structure for P

d-separation in G

conditional independencein P

From Markov Networks to Bayesian Networks

• As seen before, Markov networks cannot represent immoralities

• Can show that if a Bayesian network G is a minimal I-map for some Markov network structure H, it contains no immoralities

• No immoralities = every three nodes with v-structure are covered

• Undirected cycle of length >3 = v-structure– Must have a chord

• All BN I-maps of Markov networks are chordal– No BN P-map exists for a non-chordal MN

Markov Networks: Summary• Mass/density = normalized product of factors

• Represent conditional independence with independence graphs– Conditional independence = separation in the graph

– Global separation = local separation (Markov blanket) = pairwiseseparation, all in positive distributions

• Interpretation: closest to uniform under constraints specified by features– Scope of features determines the structure of the graph

(representation theorem)

• Relationship between Markov and Bayesian networks– MNs cannot represent v-structures of BNs

– BNs cannot represent chordless loops of MNs

– Chordal graphs can be represented (as P-maps) by bothSTAT 598L: Probabilistic Graphical Models (Markov Networks)

stat 598l probabilistic graphical modelsskirshne/teaching/stat598l_f09/mn.pdf · 2010-01-27 ·...

Documents

hidden markov models: probabilistic reasoning over time

unsupervised learning of probabilistic grammar-markov

probabilistic methods in markov chains...probabilistic...

montague meets markov: deep semantics with probabilistic...

hidden markov models: probabilistic reasoning over time...

probabilistic models for images markov random fields

hidden markov models, bayesian networks -...

stat 5102 lecture slides: deck 6 gauss-markov theorem,...

chapter 15 section 1 – 2 markov models. outline...

probabilistic sequence modeling ii: markov chains haixu tang...

probabilistic model...

hidden markov models: probabilistic reasoning over time...

probabilistic model checking for continuous-time markov

using ensembles of hidden markov models for grand challenges...

relational probabilistic graphical...

probabilistic programming in python using pymc3...keywords...

principles of markov automata · markov chain labelled...

probabilistic planning with markov decision...

a probabilistic approach of flow-balanced network based on...

inertial hidden markov models: modeling change in...