pattern recognition and machine learning : graphical...

38
C. M. Bishop PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS Part I

Upload: dotuyen

Post on 31-Mar-2018

220 views

Category:

Documents


5 download

TRANSCRIPT

Page 1: Pattern Recognition and Machine Learning : Graphical …rita/uml_course/lectures/prml8_update_p1.pdf · c. m. bishop pattern recognition and machine learning chapter 8: graphical

C. M. Bishop

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS

Part I

Page 2: Pattern Recognition and Machine Learning : Graphical …rita/uml_course/lectures/prml8_update_p1.pdf · c. m. bishop pattern recognition and machine learning chapter 8: graphical

BCS Summer School, Exeter, 2003 Christopher M. Bishop

Probabilistic Graphical Models

• Graphical representation of a probabilistic model

• Each variable corresponds to a node in the graph

• Links in the graph denote probabilistic relations between

variables

Page 3: Pattern Recognition and Machine Learning : Graphical …rita/uml_course/lectures/prml8_update_p1.pdf · c. m. bishop pattern recognition and machine learning chapter 8: graphical

Why do we need graphical models?

• Graphs are an intuitive way of representing and visualising the

relationships between many variables. (Examples: family trees,

electric circuit diagrams, neural networks)

• A graph allows us to abstract out the conditional independence

relationships between the variables from the details of their

parametric forms. Thus we can ask questions like: “Is A dependent

on B given that we know the value of C ?” just by looking at the

graph.

• Graphical models allow us to define general message-passing

algorithms that implement Bayesian inference efficiently. Thus we

can answer queries like “What is P(A|C = c)?” without enumerating

all settings of all variables in the model.

Zoubin Ghahramani

Page 4: Pattern Recognition and Machine Learning : Graphical …rita/uml_course/lectures/prml8_update_p1.pdf · c. m. bishop pattern recognition and machine learning chapter 8: graphical

Three kinds of Graphical Models

image by Zoubin Ghahramani

or Bayesian Networks

useful to express causal

relationships between

variables

or Markov random fields

useful to express soft

constraints between

variables

convenient for

solving inference

problems

Page 5: Pattern Recognition and Machine Learning : Graphical …rita/uml_course/lectures/prml8_update_p1.pdf · c. m. bishop pattern recognition and machine learning chapter 8: graphical

Bayesian Networks

Directed Acyclic Graph (DAG)

Note: the left-hand side is symmetrical w/r to the variables whereas the right-

hand side is not.

Page 6: Pattern Recognition and Machine Learning : Graphical …rita/uml_course/lectures/prml8_update_p1.pdf · c. m. bishop pattern recognition and machine learning chapter 8: graphical

Bayesian Networks

Generalization to K variables:

The associated graph is fully connected.

The absence of links conveys important information.

Page 7: Pattern Recognition and Machine Learning : Graphical …rita/uml_course/lectures/prml8_update_p1.pdf · c. m. bishop pattern recognition and machine learning chapter 8: graphical

Bayesian Networks

General Factorization

Page 8: Pattern Recognition and Machine Learning : Graphical …rita/uml_course/lectures/prml8_update_p1.pdf · c. m. bishop pattern recognition and machine learning chapter 8: graphical

Some Notations(1)

Plate

N

n

ntppp1

)w|()w()w,t(

More compact

representation

Page 9: Pattern Recognition and Machine Learning : Graphical …rita/uml_course/lectures/prml8_update_p1.pdf · c. m. bishop pattern recognition and machine learning chapter 8: graphical

Some Notations(2)

Condition on data

Shaded to indicate that the r.v. is set to its observed value

Page 10: Pattern Recognition and Machine Learning : Graphical …rita/uml_course/lectures/prml8_update_p1.pdf · c. m. bishop pattern recognition and machine learning chapter 8: graphical

Generative Models

Causal process for generating images

Page 11: Pattern Recognition and Machine Learning : Graphical …rita/uml_course/lectures/prml8_update_p1.pdf · c. m. bishop pattern recognition and machine learning chapter 8: graphical

PPCA

nx

nz

N

W

2

Page 12: Pattern Recognition and Machine Learning : Graphical …rita/uml_course/lectures/prml8_update_p1.pdf · c. m. bishop pattern recognition and machine learning chapter 8: graphical

Discrete Variables

Denote the probability of observing both and

by

General joint distribution: K 2 { 1 parameters

K states K states

11 kx

12 lx 1 , klk lkl

)x()x|x()x,x( 11221 ppp

K -1 parameters

K -1 parameters

K2-1 parameters

Page 13: Pattern Recognition and Machine Learning : Graphical …rita/uml_course/lectures/prml8_update_p1.pdf · c. m. bishop pattern recognition and machine learning chapter 8: graphical

Discrete Variables

Independent joint distribution: 2(K { 1) parameters

Page 14: Pattern Recognition and Machine Learning : Graphical …rita/uml_course/lectures/prml8_update_p1.pdf · c. m. bishop pattern recognition and machine learning chapter 8: graphical

Discrete Variables

General joint distribution over M variables: KM { 1 parameters

M -node Markov chain: K { 1 + (M { 1) K(K { 1) parameters

Reduce the number of parameters by dropping link in the graph, at the

expense of having a restricted class of distributions.

)x( 1pMipp iii ,...,2),x()x|x( 11

Page 15: Pattern Recognition and Machine Learning : Graphical …rita/uml_course/lectures/prml8_update_p1.pdf · c. m. bishop pattern recognition and machine learning chapter 8: graphical

Conditional Independence

a is independent of b given c

Equivalently

Notation

Page 16: Pattern Recognition and Machine Learning : Graphical …rita/uml_course/lectures/prml8_update_p1.pdf · c. m. bishop pattern recognition and machine learning chapter 8: graphical

Conditional Independence: Example 1

marginalize w.r.t c

as it doesn‟t factorize into p(a)p(b) in

general

tail-to-tail

Page 17: Pattern Recognition and Machine Learning : Graphical …rita/uml_course/lectures/prml8_update_p1.pdf · c. m. bishop pattern recognition and machine learning chapter 8: graphical

Conditional Independence: Example 1

tail-to-tail

Page 18: Pattern Recognition and Machine Learning : Graphical …rita/uml_course/lectures/prml8_update_p1.pdf · c. m. bishop pattern recognition and machine learning chapter 8: graphical

Conditional Independence: Example 2

marginalize w.r.t c

head-to-tail

Page 19: Pattern Recognition and Machine Learning : Graphical …rita/uml_course/lectures/prml8_update_p1.pdf · c. m. bishop pattern recognition and machine learning chapter 8: graphical

Conditional Independence: Example 2

head-to-tail

Page 20: Pattern Recognition and Machine Learning : Graphical …rita/uml_course/lectures/prml8_update_p1.pdf · c. m. bishop pattern recognition and machine learning chapter 8: graphical

Conditional Independence: Example 3

Note: this is the opposite of Example 1 and 2, with c unobserved.

marginalize w.r.t c

head-to-head

Page 21: Pattern Recognition and Machine Learning : Graphical …rita/uml_course/lectures/prml8_update_p1.pdf · c. m. bishop pattern recognition and machine learning chapter 8: graphical

Conditional Independence: Example 3

Note: this is the opposite of Example 1 and 2, with c observed.

head-to-head

Page 22: Pattern Recognition and Machine Learning : Graphical …rita/uml_course/lectures/prml8_update_p1.pdf · c. m. bishop pattern recognition and machine learning chapter 8: graphical

“Am I out of fuel?”

B = Battery (0=flat, 1=fully charged) F = Fuel Tank (0=empty, 1=full) G = Fuel Gauge Reading (0=empty, 1=full)

and hence

Page 23: Pattern Recognition and Machine Learning : Graphical …rita/uml_course/lectures/prml8_update_p1.pdf · c. m. bishop pattern recognition and machine learning chapter 8: graphical

“Am I out of fuel?”

Probability of an empty tank increased by observing G = 0.

Page 24: Pattern Recognition and Machine Learning : Graphical …rita/uml_course/lectures/prml8_update_p1.pdf · c. m. bishop pattern recognition and machine learning chapter 8: graphical

“Am I out of fuel?”

Probability of an empty tank reduced by observing B = 0. This referred to as “explaining away”.

the state of the fuel tank and that of the

battery have become dependent on each

other as a result of observing the reading

on the fuel gauge.

Page 25: Pattern Recognition and Machine Learning : Graphical …rita/uml_course/lectures/prml8_update_p1.pdf · c. m. bishop pattern recognition and machine learning chapter 8: graphical

D-separation • A, B, and C are non-intersecting subsets of nodes in a

directed graph. • A path from A to B is blocked if it contains a node such that

either a) the arrows on the path meet either head-to-tail or tail-

to-tail at the node, and the node is in the set C, or b) the arrows meet head-to-head at the node, and

neither the node, nor any of its descendants, are in the set C.

• If all paths from A to B are blocked, A is said to be d-separated from B by C.

• If A is d-separated from B by C, the joint distribution over all variables in the graph satisfies .

Page 26: Pattern Recognition and Machine Learning : Graphical …rita/uml_course/lectures/prml8_update_p1.pdf · c. m. bishop pattern recognition and machine learning chapter 8: graphical

D-separation: Example

Page 27: Pattern Recognition and Machine Learning : Graphical …rita/uml_course/lectures/prml8_update_p1.pdf · c. m. bishop pattern recognition and machine learning chapter 8: graphical

D-separation: I.I.D. Data

Page 28: Pattern Recognition and Machine Learning : Graphical …rita/uml_course/lectures/prml8_update_p1.pdf · c. m. bishop pattern recognition and machine learning chapter 8: graphical

The Markov Blanket

Markov Blanket (remaining factors)

Parents and children of xi

Co-parents: parents of children of xi

due to „explaining away‟ phenomenon

Page 29: Pattern Recognition and Machine Learning : Graphical …rita/uml_course/lectures/prml8_update_p1.pdf · c. m. bishop pattern recognition and machine learning chapter 8: graphical

Factorization Properties

Page 30: Pattern Recognition and Machine Learning : Graphical …rita/uml_course/lectures/prml8_update_p1.pdf · c. m. bishop pattern recognition and machine learning chapter 8: graphical

Cliques and Maximal Cliques

Clique

Maximal Clique

Thus, if {x1, x2, x3} is a maximal clique and we define an arbitrary function over this clique, then including another factor defined over a subset of these variables would be redundant.

Page 31: Pattern Recognition and Machine Learning : Graphical …rita/uml_course/lectures/prml8_update_p1.pdf · c. m. bishop pattern recognition and machine learning chapter 8: graphical

Notations: C – max. clique, - the set of variables in that clique.

Joint Distribution

where is the potential over clique C , which is a non-negative function

which measures “compatibility” between settings of the variables.

is the partition function (used for normalization).

note: M K-state variables KM terms in Z.

Page 32: Pattern Recognition and Machine Learning : Graphical …rita/uml_course/lectures/prml8_update_p1.pdf · c. m. bishop pattern recognition and machine learning chapter 8: graphical

Hammersley and Clifford Theorem

• UI: the set of distributions that are consistent with the set of conditional independence statements read from the graph using graph separation.

• UF: the set of distributions that can be expressed as a factorization described with respect to the maximal cliques.

• The Hammersley-Clifford theorem states that the sets UI and UF are identical if is strictly positive.

• In such case

where is called an energy function, and the exponential representation is called the Boltzmann distribution.

Page 33: Pattern Recognition and Machine Learning : Graphical …rita/uml_course/lectures/prml8_update_p1.pdf · c. m. bishop pattern recognition and machine learning chapter 8: graphical

Illustration: Image De-Noising using MRF

Original Image Noisy Image

Noisy image is obtained by randomly flipping the sign of pixels in the

Noise free image with some small probability.

The Goal: Given Noisy image recover Noise Free Image

Page 34: Pattern Recognition and Machine Learning : Graphical …rita/uml_course/lectures/prml8_update_p1.pdf · c. m. bishop pattern recognition and machine learning chapter 8: graphical

Markov Model

strong correlation between xi and yi.

neighbouring pixels xi and xj in an image are

strongly correlated.

Page 35: Pattern Recognition and Machine Learning : Graphical …rita/uml_course/lectures/prml8_update_p1.pdf · c. m. bishop pattern recognition and machine learning chapter 8: graphical

Markov Model

biasing the model towards pixel

values that have one particular

sign in preference to the other.

Page 36: Pattern Recognition and Machine Learning : Graphical …rita/uml_course/lectures/prml8_update_p1.pdf · c. m. bishop pattern recognition and machine learning chapter 8: graphical

The joint distribution IC

M -

ite

rate

d c

onditio

nal m

odes

having a high probability

Page 37: Pattern Recognition and Machine Learning : Graphical …rita/uml_course/lectures/prml8_update_p1.pdf · c. m. bishop pattern recognition and machine learning chapter 8: graphical

Illustration: Image De-Noising (3)

Noisy Image Restored Image (ICM)

Page 38: Pattern Recognition and Machine Learning : Graphical …rita/uml_course/lectures/prml8_update_p1.pdf · c. m. bishop pattern recognition and machine learning chapter 8: graphical

Illustration: Image De-Noising (4)

Restored Image (Graph cuts) Restored Image (ICM)