pattern recognition and machine learning-chapter 2: probability distributions (part 2) + graphs

33
Pattern Recognition and Machine Learning- Chapter 2: Probability Distributions (Part 2) + Graphs Affiliation: Kyoto University Name: Kevin Chien, Dr. Oba Shigeyuki, Dr. Ishii Shin Date: Nov 04, 2011 1

Upload: hedia

Post on 19-Jan-2016

45 views

Category:

Documents


0 download

DESCRIPTION

Pattern Recognition and Machine Learning-Chapter 2: Probability Distributions (Part 2) + Graphs. Affiliation: Kyoto University Name: Kevin Chien , Dr. Oba Shigeyuki, Dr. Ishii Shin Date: Nov 04, 2011. For understanding distributions. Terminologies. Terminologies. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Pattern Recognition and Machine Learning-Chapter 2: Probability Distributions (Part 2) + Graphs

Pattern Recognition and Machine Learning-Chapter 2: Probability Distributions (Part 2) + Graphs

Affiliation: Kyoto UniversityName: Kevin Chien, Dr. Oba

Shigeyuki, Dr. Ishii ShinDate: Nov 04, 2011

1

Page 2: Pattern Recognition and Machine Learning-Chapter 2: Probability Distributions (Part 2) + Graphs

Terminologies

For understanding distributions

2

Page 3: Pattern Recognition and Machine Learning-Chapter 2: Probability Distributions (Part 2) + Graphs

Terminologies• Schur complement: relationship between

original matrix and its inverse.

• Completing the square: converting quadratic of form ax2+bx+c to a(…)2+const for equating quadratic components with normal Gaussian to find unknowns, or for solving quadratic.

• Robbins-Monro algorithm: iterative root finding for unobserved regression function M(x) expressed as a mean. Ie. E[N(x)]=M(x) 3

Page 4: Pattern Recognition and Machine Learning-Chapter 2: Probability Distributions (Part 2) + Graphs

Terminologies (cont.)• [Stochastic appro., wiki., 2011]

– Condition on that

• Trace Tr(W) is sum of diagonals

• Degree of freedom: dimension of subspace. Here it refers to a hyperparameter.

4

Page 5: Pattern Recognition and Machine Learning-Chapter 2: Probability Distributions (Part 2) + Graphs

Distributions

Gaussian distributions and motives

5

Page 6: Pattern Recognition and Machine Learning-Chapter 2: Probability Distributions (Part 2) + Graphs

Conditional Gaussian Distribution• Derivation of conditional mean and variance:

– Noting Schur complement

• Linear Gaussian model: observations are weighted sum of underlying latent variables. Mean is linear w.r.t. dependent variable Xb. Variance is independent of Xb. 6

Assume y=Xa, x=Xb

Page 7: Pattern Recognition and Machine Learning-Chapter 2: Probability Distributions (Part 2) + Graphs

Marginal Gaussian Distribution• Goal is also to identify mean and variance by

‘completing the square’.

• Solving above integration while noting Schur complement and compare components

7

Page 8: Pattern Recognition and Machine Learning-Chapter 2: Probability Distributions (Part 2) + Graphs

Bayesian relationship with Gaussian distr. (quick view)

• Consider multivariable Gaussian where– Thus– According to Bayesian equation

• The conditional Gaussian must have form where exponent is difference of p(x,y) and p(x)– Ie. becomes

8

Page 9: Pattern Recognition and Machine Learning-Chapter 2: Probability Distributions (Part 2) + Graphs

Bayesian relationship with Gaussian distr.

• Starting from

• Mean and var. for joint Gaussian distr. P(x,y)

• Mean and variance for P(x|y)

9

Can be seem as prior Can be seem as likelihood

Can be seem as posterior

Page 10: Pattern Recognition and Machine Learning-Chapter 2: Probability Distributions (Part 2) + Graphs

Bayesian relationship with Gaussian distr., sequential est.

• Estimate mean by (N-1)+1 observations

• Robbins-Monro algorithm looks like the above form, and can solve mean from maximum likelihood. – solve for by Robbin-Monro

10

Page 11: Pattern Recognition and Machine Learning-Chapter 2: Probability Distributions (Part 2) + Graphs

Bayesian relationship with Univariate Gaussian distr.

• Conjugate prior for precision (inv. cov.) of univariate Gaussian is gamma function

• Conjugate prior of univariate Gaussian is Gaussian-gamma function

11

Page 12: Pattern Recognition and Machine Learning-Chapter 2: Probability Distributions (Part 2) + Graphs

• Conjugate prior for precision (inv. cov.) mat. of Multivariate Gaussian is Wishart distr.

• Conjugate prior of Multivariate Gaussian is Gaussian-Wishart distr.

12

Bayesian relationship with Multivariate Gaussian distr.

Page 13: Pattern Recognition and Machine Learning-Chapter 2: Probability Distributions (Part 2) + Graphs

Distributions

Gaussian distributions variations

13

Page 14: Pattern Recognition and Machine Learning-Chapter 2: Probability Distributions (Part 2) + Graphs

Student’s t-distr• Use in analysis of variance on whether effect

is real and statistical significant using t-distri. w/ n-1 degree of freedom.

• If Xi are normal random then– T-distr. has lower peak and longer tail (allow more

outliers thus robust) than Gaussian distr.

• Obtain by Sum up infinite number of univariate Gaussian of same mean but different precision

14

n

n

nXtn

)(

)(2

Page 15: Pattern Recognition and Machine Learning-Chapter 2: Probability Distributions (Part 2) + Graphs

Student’s t-distr (cont.)• For multivariate Gaussian ,

corresponding t-distri.

– Mahalanobis dist.

• Mean, variance

15

Page 16: Pattern Recognition and Machine Learning-Chapter 2: Probability Distributions (Part 2) + Graphs

Gaussian with periodic variables• To avoid mean been dependent on choice of

origin use polar coordinate

– Solve for theta

• Von Mises distr. a special case of von Mises-Fiser for N-dimensional sphere: stationary distribution of a drift process on the circle

16

Page 17: Pattern Recognition and Machine Learning-Chapter 2: Probability Distributions (Part 2) + Graphs

Gaussian with periodic variables (cont.)

• From Gaussian of Cartesian coordinate to polar

– Becomes

– Von Mises distr.• Mean• Precision (concentration)

17

Page 18: Pattern Recognition and Machine Learning-Chapter 2: Probability Distributions (Part 2) + Graphs

Gaussian with periodic variables: mean and variance

• Solving log likelihood

– mean– precision ‘m’

• By noting

18

Page 19: Pattern Recognition and Machine Learning-Chapter 2: Probability Distributions (Part 2) + Graphs

Mixture of Gaussians• In part1 we already know one limitation of

Gaussian is unimodal property.– Solution: linear comb. (superposition) of Gaussians

• Mixing coefficients sum to 1

• Posterior here is known as ‘responsibilities’

– Log likelihood:19

Page 20: Pattern Recognition and Machine Learning-Chapter 2: Probability Distributions (Part 2) + Graphs

Exponential family• Natural form

– Normalize by

• 1) Bernoulli– Becomes

• 2) Multinomial– Becomes

20

Page 21: Pattern Recognition and Machine Learning-Chapter 2: Probability Distributions (Part 2) + Graphs

Exponential family (cont.)• 3) Univariate Gaussian

– Becomes

• Solve for natural parameter

– Becomes

– From max. likelihood

21

Page 22: Pattern Recognition and Machine Learning-Chapter 2: Probability Distributions (Part 2) + Graphs

Parameters of Distributions

And interesting methodologies

22

Page 23: Pattern Recognition and Machine Learning-Chapter 2: Probability Distributions (Part 2) + Graphs

Uninformative priors• “Subjective Bayesian”: avoid incorrect

assumption by using uninformative (ex. uniform distr.) prior.– Improper prior: prior need not sum to 1 for

posterior to sum to 1 as per Bayes equation.

• 1) location parameter for translation invariance

• 2) scale parameter for scale invariance in

23

Page 24: Pattern Recognition and Machine Learning-Chapter 2: Probability Distributions (Part 2) + Graphs

Nonparametric methods• Instead of assume form of distribution, use

nonparametric methods.• 1) Histogram of constant bin width

– Good for sequential data– Problem: discontinuity, dimensionality increase exp.

• 2) Kernel estimators: sum of Parzen windows– ‘N’ Observations falling in region R (volume V) is ‘K’

– becomes24

Page 25: Pattern Recognition and Machine Learning-Chapter 2: Probability Distributions (Part 2) + Graphs

Nonparametric method: Kernel estimators

• 2) Kernel estimators: fix V, determine K– Form of kernel function for points falling in R– h>0 is fixed parameter bandwidth for smoothing– Parzen estimator. Can choose k(u) (ex. Gaussian)

25

Page 26: Pattern Recognition and Machine Learning-Chapter 2: Probability Distributions (Part 2) + Graphs

Nonparametric method: Nearest-neighbor

• 3) Nearest neighbor: this time use data to grow V Prior:

– Same as kernel estimator: training set is store as knowledge base.

– ‘k’ is number of neighbors, larger ‘k’ for smoother, and less complex boundary, fewer regions.

– For classifying N points into Nk points

in class Ck from Bayesian maximize

26

Page 27: Pattern Recognition and Machine Learning-Chapter 2: Probability Distributions (Part 2) + Graphs

Nonparametric method: Nearest-neighbor (cont.)

• 3) Nearest neighbor: assign new point to class Ck

by majority vote of its k nearest neighbors……………… - for k=1 and n->∞ , error is bounded by

Bayes error rate

27

[k-nearest neighbor algorithm, wiki., 2011]

Page 28: Pattern Recognition and Machine Learning-Chapter 2: Probability Distributions (Part 2) + Graphs

Ch.2 Basic Graph Concepts

From David Barber’s book

28

Page 29: Pattern Recognition and Machine Learning-Chapter 2: Probability Distributions (Part 2) + Graphs

Directed and undirected graphs

29

Page 30: Pattern Recognition and Machine Learning-Chapter 2: Probability Distributions (Part 2) + Graphs

Representations of Graphs

30

• Singled connected (tree): only one path from A to B

• Spanning tree of undirected graph: singly connected subset covering all vertices

• Graph representation (numerical)• Edge list: ex.

• Adjacency matrix A: N vertex then NxN where Aij=1 if there is an edge from i to j. For undirected graph this will be symmetric.

Page 31: Pattern Recognition and Machine Learning-Chapter 2: Probability Distributions (Part 2) + Graphs

Representations of Graphs (cont.)

31

• Directed graph: If vertices are labeled in ancestral order (parent before children) then we have strictly upper triangular adjacency matrix• Provided there are no edge from a vertex to

itself• K maximum clique undirected graph has a N x K

matrix, where each column Ck express which nodes form a clique.• 2 cliques: vertices {1,2,3}

and {2,3,4}

Page 32: Pattern Recognition and Machine Learning-Chapter 2: Probability Distributions (Part 2) + Graphs

Incidence Matrix

32

• Adjacency matrix A and incidence matrix Zinc

• Maximum clique incidence matrix Z • Property:

• Note: Zinc columns denote edges, and rows denote vertices

Page 33: Pattern Recognition and Machine Learning-Chapter 2: Probability Distributions (Part 2) + Graphs

Additional Information

33

• Excerpt of graph and equations from [Pattern Recognition and Machine Learning, Bishop C.M.] page 84-127.

• Excerpt of graph and equations from [Bayesian Reasoning and Machine Learning, David Barber] page 19-23.

• Slide uploaded to Google group. Use with reference.