cs b 553: a lgorithms for o ptimization and l earning continuous probability distributions and...

26
CS B553: ALGORITHMS FOR OPTIMIZATION AND LEARNING Continuous Probability Distributions and Bayesian Networks with Continuous Variables

Upload: marvin-norman

Post on 23-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Continuous Probability Distributions and Bayesian Networks with Continuous Variables

CS B553: ALGORITHMS FOR OPTIMIZATION AND LEARNINGContinuous Probability Distributions and Bayesian Networks with Continuous Variables

Page 2: CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Continuous Probability Distributions and Bayesian Networks with Continuous Variables

AGENDA

Continuous probability distributions Common families:

The Gaussian distribution Linear Gaussian Bayesian networks

Page 3: CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Continuous Probability Distributions and Bayesian Networks with Continuous Variables

CONTINUOUS PROBABILITY DISTRIBUTIONS

Let X be a random variable in R, P(X) be a probability distribution over X P(x)0 for all x, “sums to 1”

Challenge: (most of the time) P(X=x) = 0 for any x

Page 4: CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Continuous Probability Distributions and Bayesian Networks with Continuous Variables

CDF AND PDF Probability density function (pdf) f(x)

Nonnegative, f(x) dx = 1 Cumulative distribution function (cdf) g(x)

g(x) = P(Xx)

g(-) = 0, g() = 1, g(x) = (-,x] f(y) dy, monotonic

f(x) = g’(x)

pdf f(x)

cdf g(x)1

Both cdfs and pdfs are complete representations of the probability

space over X, but usually pdfs are more intuitive to work with.

Page 5: CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Continuous Probability Distributions and Bayesian Networks with Continuous Variables

CAVEATS

pdfs may exceed 1 Deterministic values, or ones taking on a few

discrete values, can be represented in terms of the Dirac delta function a(x) pdf (an improper function) a(x) = 0 if x a a(x) = if x = a a(x) dx = 1

Page 6: CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Continuous Probability Distributions and Bayesian Networks with Continuous Variables

COMMON DISTRIBUTIONS

Uniform distribution U(a,b) p(x) = 1/(b-a) if x [a,b], 0 otherwise P(Xx) = 0 if x < a, (x-a)/(b-a) if x [a,b], 1

otherwise Gaussian (normal) distribution N(,)

= mean, = standard deviation

P(Xx) not closed form:(1+erf(x))/2 for N(0,1)

U(a1,b1)

U(a2,b2)

Page 7: CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Continuous Probability Distributions and Bayesian Networks with Continuous Variables

MULTIVARIATE CONTINUOUS DISTRIBUTIONS

Consider c.d.f. g(x,y) = P(Xx,Yy) g(-,y) = 0, g(x,-) = 0 g(,) = 1 g(x,) = P(Xx), g(,x) = P(Yy) g monotonic

Its joint density is given by the p.d.f. f(x,y) iff g(p,q) = (-,p] (-,q] f(x,y) dy dx

i.e. P(axXbx,ayYby) = [ax,bx] [ay,by] f(x,y) dy dx

Page 8: CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Continuous Probability Distributions and Bayesian Networks with Continuous Variables

MARGINALIZATION WORKS OVER PDFS Marginalizing f(x,y) over y:

If h(x) = (-,) f(x,y) dy, then h(x) is a p.d.f. for P(Xx) Proof:

P(Xa) = P(Xa,Y) = g(a,) = (-,a] (-,) f(x,y) dy dx h(a) = d/da P(Xa)

= d/da (-,a] (-,) f(x,y) dy dx (definition)= (-,) f(a,y) dy (fundamental theorem of

calculus)

So, the joint density contains all information needed to reconstruct the density of each individual variable

Page 9: CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Continuous Probability Distributions and Bayesian Networks with Continuous Variables

CONDITIONAL DENSITIES

We might want to represent the density P(Y|X=x)… but how? Naively, P(aYb|X=x) = P(aYb,X=x)/P(X=x), but

denominator is 0! Consider pdf p(x,y), consider taking limit

P(aYb|x+eXx+e) as e0

So p(x,y)/p(x) is the conditional density

Page 10: CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Continuous Probability Distributions and Bayesian Networks with Continuous Variables

TRANSFORMATIONS OF CONTINUOUS RANDOM VARIABLES

Suppose we want to compute the distribution of f(X), where X is a random variable distributed w.r.t. pX(x) Assume f is monotonic and invertible

Consider Y=f(X) a random variable P(Yy) = I[f(x)y]pX(x) dx

= = P(X f-1(y)) pY(y) = d/dy P(Yy)

= d/dy P(X f-1(y)) = p(f-1(y)) d/dy f-1(y) by chain rule

= pX(f-1(y))/f ’(f-1(y)) by inverse function derivative

Page 11: CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Continuous Probability Distributions and Bayesian Networks with Continuous Variables

NOTES:

In general, continuous multivariate distributions are hard to handle exactly

But, there are specific classes that lead to efficient exact inference techniques In particular, Gaussians

Other distributions usually require resorting to Monte Carlo approaches

Page 12: CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Continuous Probability Distributions and Bayesian Networks with Continuous Variables

MULTIVARIATE GAUSSIANS

Multivariate analog in N-D space Mean (vector) m, covariance (matrix) S

With a normalization factor

X ~ N(m,S)

Page 13: CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Continuous Probability Distributions and Bayesian Networks with Continuous Variables

INDEPENDENCE IN GAUSSIANS

If X ~ N(mX,SX) and Y ~ N(mY,SY) are independent, then

Moreover, if X~N( ,m S), then Sij=0 iff Xi and Xj are independent

Page 14: CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Continuous Probability Distributions and Bayesian Networks with Continuous Variables

LINEAR TRANSFORMATIONSLinear transformations of gaussians

If X~ N(m,S), y = A x + bThen Y ~ N(Am+b, ASAT) In fact,

Consequence: If X ~ N(mx,Sx), Y ~ N(my,Sy), Z=X+YThen Z ~ N(mx+my,Sx+Sy)

Page 15: CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Continuous Probability Distributions and Bayesian Networks with Continuous Variables

MARGINALIZATION AND CONDITIONING If (X,Y) ~ N([mX mY],[SXX,SXY;SYX,SYY]), then:

Marginalization Summing out Y gives

X ~ N(mX , SXX)

Conditioning:On observing Y=y, we have

X ~ N(mX-SXYSYY-1(y-mY), SXX-SXYSYY

-1SYX)

Page 16: CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Continuous Probability Distributions and Bayesian Networks with Continuous Variables

LINEAR GAUSSIAN MODELS

A conditional linear Gaussian model has : P(Y|X=x) = N(0+Ax,S0) With parameters 0, A, and S0

Page 17: CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Continuous Probability Distributions and Bayesian Networks with Continuous Variables

LINEAR GAUSSIAN MODELS

A conditional linear Gaussian model has : P(Y|X=x) = N(0+Ax,S0) With parameters 0, A, and S0

If X ~ N(mX,SX), then joint distribution over is given by:

(Recall the linear transformation rule)

If X~ N(m,S) and y=Ax+b, thenY ~ N(Am+b, ASAT)

Page 18: CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Continuous Probability Distributions and Bayesian Networks with Continuous Variables

CLG BAYESIAN NETWORKS

If all variables in a Bayesian network have Gaussian or CLG CPTS, inference can be done efficiently!

X1

Y

X2

P(Y|x1,x2) = N(ax1+bx2,y)

P(X2) = N(2,2)P(X1) = N(1,1)

P(Z|x1,y) = N(c+dx1+ey,z)Z

Page 19: CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Continuous Probability Distributions and Bayesian Networks with Continuous Variables

CANONICAL REPRESENTATION All factors in a CLG Bayes net can be represented

as C(x;K,h,g) withC(x;K,h,g) = exp(-1/2 xT Kx + hTx + g)

Ex: if P(Y|x) = N(0+Ax,S0) thenP(y|x) = 1/Z exp(-1/2 (y-Ax-0)T S0

-1(y-Ax-0))=1/Z exp(-1/2 (y,x)T [I –A]T S0

-1 [I –A](y,x) + 0T

S0-1 [I –A](y,x) – ½ 0

TS0-10)

Is of form C((y,x);K,h,g) with K = [I –A]T S0

-1 [I –A]

h = [I –A]T S0

-1 0

g = log(1/Z) exp (–½ 0TS0

-10)

Page 20: CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Continuous Probability Distributions and Bayesian Networks with Continuous Variables

PRODUCT OPERATIONS

C(x;K1,h1,g1)C(x;K2,h2,g2) = C(x;K,h,g) with K=K1+K2

h=h1+h2

g = g1+g2

If the scopes of the two factors are not equivalent, just extend the K’s with 0 rows and columns, and h’s with 0 rows so that each row/column matches

Page 21: CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Continuous Probability Distributions and Bayesian Networks with Continuous Variables

SUM OPERATION C((x,y);K,h,g)dy = C(x;K’,h’,g’)with

K’=KXX-KXYKYY-1KYX

h’=hX-KXYKYY-1hY

g’ = g+1/2 (log|2pKYY-1|+hY

TKYY-1hY)

Using these two operations we can implement inference algorithms developed for discrete Bayes nets: Top-down inference, variable elimination (exact) Belief propagation (approximate)

Page 22: CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Continuous Probability Distributions and Bayesian Networks with Continuous Variables

MONTE CARLO WITH GAUSSIANS

Assume sample X ~ N(0,1) is given as a primitive RandN() To sample X ~ N(m,s2), simply + m s RandN()

How to generate a random multivariate Gaussian variable N( , )m S ?

Page 23: CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Continuous Probability Distributions and Bayesian Networks with Continuous Variables

MONTE CARLO WITH GAUSSIANS

Assume sample X ~ N(0,1) is given as a primitive RandN() To sample X ~ N(m,s2), simply set x + m s

RandN() How to generate a random multivariate

Gaussian variable N( , )m S ? Take Cholesky decomposition: S-1=LLT, L

invertible if S is positive definite Let y = LT(x-m) P(y) exp(-1/2 (y1

2 + … + yN2)) is isotropic, and

each yi is independent Sample each component of y at random Set x L-Ty+m

Page 24: CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Continuous Probability Distributions and Bayesian Networks with Continuous Variables

MONTE CARLO WITH LIKELIHOOD WEIGHTING

Monte Carlo with rejection has probability 0 of finding a continuous value given as evidence, so likelihood weighting must be used

X

Y=y

P(Y|x)=N(Ax+mY,SY)

P(X)=N(mX,SX)

Step 1: Sample x ~ N(mX,SX)Step 2: weight by P(y|x)

Page 25: CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Continuous Probability Distributions and Bayesian Networks with Continuous Variables

HYBRID NETWORKS

Hybrid networks combine both discrete and continuous variables

Exact inference techniques are hard to apply Result in Gaussian mixtures NP hard even in polytree networks

Monte Carlo techniques apply in straightforward way

Belief approximation can be applied (e.g., collapsing Gaussian mixtures to single Gaussians)

Page 26: CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Continuous Probability Distributions and Bayesian Networks with Continuous Variables

ISSUES

Non-gaussian distributions Nonlinear dependencies More in future lectures on particle filtering