pattern recognition and machine learning-chapter 2: probability distributions (part 2) + graphs

Pattern Recognition and Machine Learning-Chapter 2: Probability Distributions (Part 2) + Graphs

Affiliation: Kyoto UniversityName: Kevin Chien, Dr. Oba

Shigeyuki, Dr. Ishii ShinDate: Nov 04, 2011

1

Terminologies

For understanding distributions

2

Terminologies• Schur complement: relationship between

original matrix and its inverse.

• Completing the square: converting quadratic of form ax2+bx+c to a(…)2+const for equating quadratic components with normal Gaussian to find unknowns, or for solving quadratic.

• Robbins-Monro algorithm: iterative root finding for unobserved regression function M(x) expressed as a mean. Ie. E[N(x)]=M(x) 3

Terminologies (cont.)• [Stochastic appro., wiki., 2011]

– Condition on that

• Trace Tr(W) is sum of diagonals

• Degree of freedom: dimension of subspace. Here it refers to a hyperparameter.

4

Distributions

Gaussian distributions and motives

5

Conditional Gaussian Distribution• Derivation of conditional mean and variance:

– Noting Schur complement

• Linear Gaussian model: observations are weighted sum of underlying latent variables. Mean is linear w.r.t. dependent variable Xb. Variance is independent of Xb. 6

Assume y=Xa, x=Xb

Marginal Gaussian Distribution• Goal is also to identify mean and variance by

‘completing the square’.

• Solving above integration while noting Schur complement and compare components

7

Bayesian relationship with Gaussian distr. (quick view)

• Consider multivariable Gaussian where– Thus– According to Bayesian equation

• The conditional Gaussian must have form where exponent is difference of p(x,y) and p(x)– Ie. becomes

8

Bayesian relationship with Gaussian distr.

• Starting from

• Mean and var. for joint Gaussian distr. P(x,y)

• Mean and variance for P(x|y)

9

Can be seem as prior Can be seem as likelihood

Can be seem as posterior

Bayesian relationship with Gaussian distr., sequential est.

• Estimate mean by (N-1)+1 observations

• Robbins-Monro algorithm looks like the above form, and can solve mean from maximum likelihood. – solve for by Robbin-Monro

10

Bayesian relationship with Univariate Gaussian distr.

• Conjugate prior for precision (inv. cov.) of univariate Gaussian is gamma function

• Conjugate prior of univariate Gaussian is Gaussian-gamma function

11

• Conjugate prior for precision (inv. cov.) mat. of Multivariate Gaussian is Wishart distr.

• Conjugate prior of Multivariate Gaussian is Gaussian-Wishart distr.

12

Bayesian relationship with Multivariate Gaussian distr.

Distributions

Gaussian distributions variations

13

Student’s t-distr• Use in analysis of variance on whether effect

is real and statistical significant using t-distri. w/ n-1 degree of freedom.

• If Xi are normal random then– T-distr. has lower peak and longer tail (allow more

outliers thus robust) than Gaussian distr.

• Obtain by Sum up infinite number of univariate Gaussian of same mean but different precision

14

n

n

nXtn

)(

)(2

Student’s t-distr (cont.)• For multivariate Gaussian ,

corresponding t-distri.

– Mahalanobis dist.

• Mean, variance

15

Gaussian with periodic variables• To avoid mean been dependent on choice of

origin use polar coordinate

– Solve for theta

• Von Mises distr. a special case of von Mises-Fiser for N-dimensional sphere: stationary distribution of a drift process on the circle

16

Gaussian with periodic variables (cont.)

• From Gaussian of Cartesian coordinate to polar

– Becomes

– Von Mises distr.• Mean• Precision (concentration)

17

Gaussian with periodic variables: mean and variance

• Solving log likelihood

– mean– precision ‘m’

• By noting

18

Mixture of Gaussians• In part1 we already know one limitation of

Gaussian is unimodal property.– Solution: linear comb. (superposition) of Gaussians

• Mixing coefficients sum to 1

• Posterior here is known as ‘responsibilities’

– Log likelihood:19

Exponential family• Natural form

– Normalize by

• 1) Bernoulli– Becomes

• 2) Multinomial– Becomes

20

Exponential family (cont.)• 3) Univariate Gaussian

– Becomes

• Solve for natural parameter

– Becomes

– From max. likelihood

21

Parameters of Distributions

And interesting methodologies

22

Uninformative priors• “Subjective Bayesian”: avoid incorrect

assumption by using uninformative (ex. uniform distr.) prior.– Improper prior: prior need not sum to 1 for

posterior to sum to 1 as per Bayes equation.

• 1) location parameter for translation invariance

• 2) scale parameter for scale invariance in

23

Nonparametric methods• Instead of assume form of distribution, use

nonparametric methods.• 1) Histogram of constant bin width

– Good for sequential data– Problem: discontinuity, dimensionality increase exp.

• 2) Kernel estimators: sum of Parzen windows– ‘N’ Observations falling in region R (volume V) is ‘K’

– becomes24

Nonparametric method: Kernel estimators

• 2) Kernel estimators: fix V, determine K– Form of kernel function for points falling in R– h>0 is fixed parameter bandwidth for smoothing– Parzen estimator. Can choose k(u) (ex. Gaussian)

25

Nonparametric method: Nearest-neighbor

• 3) Nearest neighbor: this time use data to grow V Prior:

– Same as kernel estimator: training set is store as knowledge base.

– ‘k’ is number of neighbors, larger ‘k’ for smoother, and less complex boundary, fewer regions.

– For classifying N points into Nk points

in class Ck from Bayesian maximize

26

Nonparametric method: Nearest-neighbor (cont.)

• 3) Nearest neighbor: assign new point to class Ck

by majority vote of its k nearest neighbors……………… - for k=1 and n->∞ , error is bounded by

Bayes error rate

27

[k-nearest neighbor algorithm, wiki., 2011]

Ch.2 Basic Graph Concepts

From David Barber’s book

28

Directed and undirected graphs

29

Representations of Graphs

30

• Singled connected (tree): only one path from A to B

• Spanning tree of undirected graph: singly connected subset covering all vertices

• Graph representation (numerical)• Edge list: ex.

• Adjacency matrix A: N vertex then NxN where Aij=1 if there is an edge from i to j. For undirected graph this will be symmetric.

Representations of Graphs (cont.)

31

• Directed graph: If vertices are labeled in ancestral order (parent before children) then we have strictly upper triangular adjacency matrix• Provided there are no edge from a vertex to

itself• K maximum clique undirected graph has a N x K

matrix, where each column Ck express which nodes form a clique.• 2 cliques: vertices {1,2,3}

and {2,3,4}

Incidence Matrix

32

• Adjacency matrix A and incidence matrix Zinc

• Maximum clique incidence matrix Z • Property:

• Note: Zinc columns denote edges, and rows denote vertices

Additional Information

33

• Excerpt of graph and equations from [Pattern Recognition and Machine Learning, Bishop C.M.] page 84-127.

• Excerpt of graph and equations from [Bayesian Reasoning and Machine Learning, David Barber] page 19-23.

• Slide uploaded to Google group. Use with reference.

pattern recognition and machine learning-chapter 2: probability distributions (part 2) + graphs

Documents

normal gaussian

multivariate gaussian

univariate gaussian

gaussiangamma function

wishart distr

students tdistr cont

posteriorbayesian relationship

analysis of variance