introduction to machine learning10701/slides/21_undirected_graphical_models.pdf · example: image...

Post on 08-Mar-2020

10 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Introduction to Machine Learning

Undirected Graphical Models

Barnabás Póczos

Credits

Many of these slides are taken from

Ruslan Salakhutdinov, Hugo Larochelle, & Eric Xing

• http://www.dmi.usherb.ca/~larocheh/neural_networks

• http://www.cs.cmu.edu/~rsalakhu/10707/

• http://www.cs.cmu.edu/~epxing/Class/10708/

Reading material:• http://www.cs.cmu.edu/~rsalakhu/papers/Russ_thesis.pdf

• Section 30.1 of Information Theory, Inference, and Learning Algorithms by David

MacKay

• http://www.stat.cmu.edu/~larry/=sml/GraphicalModels.pdf

2

Probabilistic graphical models: a powerful framework for representing dependency

structure between random variables.

Markov network (or undirected graphical model) is a set of random variables having a

dependency structure described by an undirected graph.

Undirected Graphical Models

= Markov Random Fields

Semantic labeling

3

4

Cliques

Clique: a subset of nodes such that there exists a link between all pairs of nodes in a

subset.

Maximal Clique: a clique such that it is not possible to include any other nodes in

the set without it ceasing to be a clique.

This graph has two maximal cliques:

Other cliques:

Undirected Graphical Models

= Markov Random Fields

Directed graphs are useful for expressing causal relationships between random

variables, whereas undirected graphs are useful for expressing dependencies between

random variables.

The joint distribution defined by the graph is given by the

product of non-negative potential functions over the

maximal cliques (connected subset of nodes).

In this example, the joint distribution factorizes as:

5

Markov Random Fields (MRFs)

Each potential function is a mapping from the joint

configurations of random variables in a maximal clique to non-

negative real numbers.

The choice of potential functions is not restricted to having

specific probabilistic interpretations.

6where E(x) is called an energy function.

Conditional Independence

7

Theorem:

It follows that the undirected graphical structure represents conditional independence:

MRFs with Hidden Variables

8

For many interesting problems, we need to introduce hidden or latent variables.

Our random variables will contain both visible and hidden variables x=(v,h)

Computing the Z partition function is intractable

Computing the summation over hidden variables is intractable

Parameter learning is very challenging.

Boltzmann Machines

Definition: [Boltzmann machines] MRFs with maximum click size two [pairwise (edge)

potentials] on binary-valued nodes are called Boltzmann machines

The joint probabilities are given by :

The parameter θij measures the dependence of xi on xj, conditioned on the other nodes. 9

Boltzmann Machines

Theorem: One can prove that the conditional distribution of one node conditioned

on the others is given by the logistic function in Boltzmann Machines:

10

Proof:

Boltzmann Machines

11

Proof [Continued]:

Q.E.D.

Example: Image Denoising

12

Let us look at the example of noise removal from a binary image.

The image is an array of {-1, +1} pixel values.

We take the original noise-free image (x) and randomly flip the sign of pixels with a small

probability. This process creates the noisy image (y)

Our goal is to estimate the original image x from the noisy observations y.

We model the joint distribution with

Inference: Iterated Conditional Models

13

Goal:

Solution: coordinate-wise gradient descent

Gaussian MRFs

• The information matrix is sparse, but the covariance matrix is not.14

Restricted Boltzmann Machines

15

Restricted Boltzmann Machines

Restricted = no connections in the hidden layer + no connection in the visible layer

16

x

Partition function (intractable)

[Quadratic in v linear in h]

17

Gaussian-Bernoulli RBM

Possible Tasks with RBM

Tasks:

Inference:

Evaluate the likelihood function:

Sampling from RBM:

Training RBM:

18

Inference

19

Inference

20

Theorem: Inference in RBM is simple: the conditional

distributions are logistic functions

Similarly,

x

21

Proof:

22

Proof [Continued]:

Q.E.D.

Evaluating the Likelihood

23

Calculating the Likelihood of an RBM

24

Theorem: Calculating the likelihood is simple in RBM

(apart from the partition function)

Free energy

Proof:

25Q.E.D.

Sampling

26

Sampling from p(x,h) in RBM

27

Sampling is tricky… it is easier much in directed graphical models.

Here we will use Gibbs sampling.

Similarly,

x

Goal: Generate samples from

28

Gibbs Sampling: The Problem

Our goal is to generate samples from

Suppose that we can generate samples from

29

Gibbs Sampling: Pseudo Code

30

Gibbs Sampling

31

Training

32

RBM Training

Training is complicated…

To train an RBM, we would like to minimize the negative log-likelihood

function:

To solve this, we use stochastic gradient ascent:

Theorem:

Positive phase Negative phase

(hard to computer)

33

RBM Training

Proof:

34

RBM Training

Proof [Continued]:

First term Second term

First term:

Difficult to

calculate the

expectation

Negative phase

35

RBM Training

Proof [Continued]:

36

RBM Training

Proof [Continued]:

Second term:

The conditionals

are independent

logistic

distributions

First term Second term

Q.E.DPositive phase

37

Since

We need to calculate

where

RBM Training

38

The second term is more tricky.

Approximate the expectations with a single sample:

RBM Training

Since

We need to calculate

where

39

RBM Training

LogisticLogistic

40

CD-k (Contrastive Divergence) Pseudocode

Results

41

http://deeplearning.net/tutorial/rbm.html

RBM Training Results

Samples generated by the RBM after training.

Each row represents a mini-batch of negative

particles (samples from independent Gibbs

chains). 1000 steps of Gibbs sampling were

taken between each of those rows.

Original images Learned filters

42

Summary

Tasks:

Inference:

Evaluate the likelihood function:

Sampling from RBM:

Training RBM:

43

Thanks for your Attention!

45

RBM Training Results

46

Gaussian-Bernoulli RBM Training Results

Each document (story) is represented with a bag of world coming from a multinomial

distribution with parameters (h = topics). After training we can generate words from this topics.

top related