log-linear mrfs: ising, boltzmann, deep belief, metricsrihari/cse574/chap8/8.6... · log-linear...
TRANSCRIPT
Machine Learning Srihari
1
Log-linear MRFs: Ising, Boltzmann, Deep Belief, Metric
Sargur Srihari [email protected]
Machine Learning Srihari
Topics
• Log-linear MRF Applications – Ising Model – Boltzmann Distribution – Energy Based Model – Boltzmann Machine
• Restricted Boltzmann Machine • Deep Belief Networks
– Metric MRF
2
Machine Learning Srihari
General Log-linear model with features
• A distribution P is a log-linear model over H if
– Can have several functions over same scope – Each term is an energy function
• Equivalent to Gibbs distribution
• Rewrite factor as Where Is the energy function 3
P(X1,..,Xn ) =1Zexp − wi fi (Di )
i=1
k
∑⎡⎣⎢
⎤⎦⎥
Note that k is the no of features Not no of subgraphs
PΦ(X1,..Xn )= 1ZP(X1,..Xn ) where P(X1,..Xn ) = φi
i=1
m
∏ (Di )
is an unnomalized measure and Z = P(X1,..Xn )X1,..Xn
∑
φ(D) φ(D) = exp(−ε(D))ε(D) = − lnφ(D)
Machine Learning Srihari Ising Model
• Earliest Markov Network model • Model of interacting atoms
– Square lattice model
• Energy determined by magnetic spins of atoms
– Atom’s spin is sum of its electron spins
4
One atom going from higher to lower energy releases radio emission and changes electron spin
Machine Learning Srihari Ising Model
• Each atom associated with binary random variable – Xi ∈{+1,-1} whose value is direction of atom’s spin
• Energy function parametric form εi,j(xi,xj)=-wijxixj
• Symmetric in Xi, Xj: note scope is pairwise – Makes contribution wij to energy when Xi=Xj (same
spin) • -wij otherwise
Machine Learning Srihari
Ising Model as a Markov Network • Features are pairwise and single
– Pair Energy function between Xi, Xj • Parametric form εij(xi,xj)=-wijxixj
– Contributes wij when Xi=Xj, same spin, and –wij otherwise
– Node potentials are ui which bias atom spin xi
• Probability distribution over atoms (energy function)
• When wij>0 model prefers aligned spins: ferromagnetism • wij<0 : antiferromagnetic • wij=0: non-interacting 6
P(ξ) = 1Zexp − wij xix j − uixi
i∑
i< j∑
⎛
⎝⎜⎞
⎠⎟
ξ ε Val(Χ) is a full assignment of the variables
Machine Learning Srihari
Ising Model studies • To answer a variety of questions
– Usually as the no. of atoms (variables) goes to infinity
• Inference problems, e.g., – Determine probability of a configuration where
majority of spins are +1 (or -1) versus more mixed ones
• Answer depends on strength of interactions wij
• e.g., Multiply all weights by temperature parameter
– Many other problems investigated extensively • Answers known--some even analytically 7
P(ξ) = 1Zexp − wij xix j − uixi
i∑
i< j∑
⎛
⎝⎜⎞
⎠⎟
Machine Learning Srihari
Boltzmann Distribution • Variant of Ising Model • Variables Xi have value {0,1} instead of {+1,-1}
– Energy function has same parametric form εij(xi,xj)=-wijxixj
– Nonzero contribution –wij from edge Xi-Xj only when Xi=Xj=1
• Ising model has contribution wij when variables are same and –wij when they are different
• Has the same energy function as Ising model
8 P(ξ) = 1
Zexp − wij xix j − uixi
i∑
i< j∑
⎛
⎝⎜⎞
⎠⎟Mapping 0 to -1
Machine Learning Srihari
Boltzmann Distrib. & Statistical Mechanics • Boltzmann Probability distribution
Where E is state energy (varies from state to state) • kT is a constant of the distribution
– k = Boltzmann’s constant, T = absolute temperature
• Ratio over two states depends on energy difference
• Later investigated by Josiah Gibbs • Boltzmann distribution also known as Gibbs measure
• Maxwell-Boltzmann distribution • Is χ2 with 3 degrees of freedom
9
P(state) α exp[−E / kT ]
P(state1) / P(state2) = exp[E2 − E1] / kT
Machine Learning Srihari
Boltzmann Distribution with sigmoid
• Probability distribution of each variable Xi given assignment of neighbors Xj is sigmoid (z) where
• sigmoid (z) is a weighted combination of Xi’s neighbors, weighted by strength and direction of connection
– Where sigmoid(z)=[1/1+exp(-z)] is a value in [0,1]
• Simplest mathematical approximation of the function employed by a neuron in the brain
10
z = − wijx jj∑
⎛
⎝⎜⎞
⎠⎟−wi
Machine Learning Srihari
Boltzmann Distribution & Neuron • Boltzmann distribution resembles a neuron
– Output node of an artificial neural network
• Neuron output is a stochastic function of its connected neighbors
• Boltzmann distribution is a type of energy model
11
�
yk (x,w) = σ wkj(2)
j=1
M
∑ h w ji(1)
i=1
D
∑ xi + w j 0(1)
⎛
⎝ ⎜
⎞
⎠ ⎟ + wk0
(2)⎛
⎝ ⎜ ⎜
⎞
⎠ ⎟ ⎟
Machine Learning Srihari
Energy-based Model (EBM)
• Probability distribution: associates a scalar energy with each configuration of its variables
• Learning corresponds to modifying energy function so its shape has desirable properties – E.g., plausible configurations have low energy
• Energy-based probability distribution
– Where Z is the partition function
12
p(x) = 1Zexp(−E(x))
Z = exp(−E(x))x∑
Machine Learning Srihari
Learning the energy-based model
• To determine parameters of
• Perform stochastic gradient-descent on negative log-likelihood
• Log-likelihood • Loss function
– Gradient is where θ are parameters
13 θ(τ+1) = θ (τ ) −η∇ℓ
p(x) = 1Zexp(−E(x))
Machine Learning Srihari
EBMs with hidden units • Want to include non-observed variables to
increase expressive power of model • Introducing free-energy • Data negative log-likelihood gradient
• Sampling version (with samples from P)
14
First term increases probability of training data. Second term decreases probability of samples generated by model
Machine Learning Srihari
Boltzmann Machine • A form of Energy Based Model • Structure of a recurrent neural network (RNN):
– one where there are directed cycles – Unlike feed-forward neural networks RNN can use
internal memory to process arbitrary sequences • Can process time-varying real-valued inputs
– Have nodes which are inputs, hidden and outputs • Boltzmann machines are a type of RNN
15
Machine Learning Srihari
Restricted Boltzmann Machine
• RBM is a special case of Boltzmann machines and Markov networks
• No visible-visible and hidden-hidden connections– Bipartite graph
• Used to learn features for input to neural networks in Deep Learning
16
Not an RBM
Machine Learning Srihari
Energy function of a RBM • Energy function
• Defining free energy as
• Due to structure of RBM
17
E(v,h) = −b 'v − c 'h − h 'Wvwhere Wis weight matrix connecting hidden and visible unitsv = [v0,v1,..],h = [h0,h1,..],with offset vectors b,c
Machine Learning Srihari
RBM with binary units
• Using vj, hi ∈{0,1} • Free energy simplifies to
• Update equations
18
Machine Learning Srihari
Training RBMs
• Contrastive Divergence • A method to overcome exponential
complexity in dealing with the partition function
19
Machine Learning Srihari
Deep Belief Networks (DBNs)• Consist of several layers of RBMs
– Stacking RBMs• Fine tuning resulting deep network using
gradient descent and back-propagation
• DBNs are Generative Models– Provide estimates of both p(x|Ck) and p(Ck|x)– Conventional neural networks are
discriminative• Directly estimate p(Ck|x)20
Machine Learning Srihari
Deep Belief Network Framework
21
Machine Learning Srihari
Training DBNs • Let X be a matrix of input feature vectors 1. Train an RBM on X to obtain weight matrix W
– Between lower two layers (input and hidden) 2. Transform X by RBM to produce new data X’
– by sampling or by computing mean activation of hidden units
3. Repeat procedure with X ßX’ for next layer pair – Until top two layers of network are reached (output
and hidden)
22
Machine Learning Srihari
Metric MRF for Labeling • Task:
– Graph with nodes X1,..Xn, edges E – Assign to each Xi a label in V={v1,..vk}
• E.g., labeling super-pixels in image – Each node, in isolation, has a preferred label
• E.g., color specifies a label
– However, we want smoothness constraint over neighbors
• Neighboring nodes should have “similar” values
23
Machine Learning Srihari
Importance of Modeling Correlations between superpixels
24
car
road
building
cow
grass
(a) (b) (c) (d)Original image Oversegmented image-superpixels Each superpixel is a random variable
Classification using node potentials alone-each superpixel classified independently
Segmentation using pairwise Markov Network encoding interactions between adjacent superpixels
Machine Learning Srihari
Solution for Labeling • Solution:
– Encode node preferences as edge potentials – Smoothness preferences as edge potentials – Encode model in negative log-space, using energy
functions • Energy function
– For MAP objective, ignore partition function – Goal: Minimize the energy (MAP objective) – How to define smoothness? Next.
25
E(x1,..xn ) = εi (xi ) + εij (xi , x j )i, j∈E∑
i∑
argminx1,..xn
E(x1,..xn )
Machine Learning Srihari
Smoothness for Metric MRF
• Many variants • Simplest one is a variant of Ising model
• for λij ≥ 0
• In this model: – Lowest pairwise energy (0) when neighbors
have same value – Higher energy otherwise λij 26
εi, j (xi , x j ) =0 xi = x j
λi, j xi ≠ x j
⎧⎨⎪
⎩⎪
Machine Learning Srihari
Generalizations of Smoothness for Metric MRF
1. Potts model (when there are more than two labels)
2. Distance Function on labels – Prefer neighboring nodes to have labels smaller
distance apart – Metric MRF
• Need a metric µ(vk,vl) on labels
27
Machine Learning Srihari
Metric Requirement • Function µ: V x Và[0,∞)
– Reflexivity, symmetry and triangle inequality • Semi-metric if triangle inequality is violated
• Metric MRF • Define εi,j(vk,vl)= µ(vk,vl) • Where µ is a metric (or semi-metric)
– Assume same for all variables » Simplifies no. of parameters needed » Usually holds in practice
• Example metric p-norm: ε(xi,xj)= min(c||xi-xj||p,distmax)
• Metric interactions arise frequently • Plays important role in computer vision 28