cs6720 --- pattern recognition review of prerequisites in math and statistics prepared by li yang...

CS6720 --- Pattern RecognitionCS6720 --- Pattern Recognition

Review of Prerequisites in Review of Prerequisites in Math and StatisticsMath and Statistics

Prepared by Li Yang

Based on Appendix chapters of

Pattern Recognition, 4th Ed.by S. Theodoridis and K. Koutroumbas

and figures from Wikipedia.org

1

Probability and StatisticsProbability and Statistics Probability P(A) of an event A : a real number between 0 to

1. Joint probability P(A ∩ B) : probability that both A and B

occurs in a single experiment. P(A ∩ B) = P(A)P(B) if A and B and independent.

Probability P(A ∪ B) of union of A and B: either A or B occurs in a single experiment.P(A ∪ B) = P(A) + P(B) if A and B are mutually exclusive.

Conditional probability:

Therefore, the Bayes rule:

Total probability: let then

2

m

11 1)(such that ,...,i im APAA

m

i ii APABPBP1

)()|()(

)(

)()|(

BP

BAPBAP

)(

)()|()|( and )()|()()|(

BP

APABPBAPAPABPBPBAP

3

Probability density function (pdf): p(x) for a continuous random variable x

Total and conditional probabilities can also be extended to pdf’s.

Mean and Variance: let p(x) be the pdf of a random variable x

Statistical independence:

Kullback-Leibler divergence (Distance?) of pdf’s

Pay attention that

dxxpxExdxxxpxE )(])[( and ,)(][ 22

)()(),( ypxpyxp yx

xx

xxxx d

p

ppppL

)(

)('ln)())('),((

b

adxxpbxaP )()(

))(),('())('),(( xxxx ppLppL

Characteristic function of a pdf:

2nd Characteristic function:

n-th order moment:

Cumulants:

4

)][exp()exp()()( xΩxxΩxΩ TT jEdjp

][)0( n

n

n

xEds

d

)][exp()exp()()( sxEdxsxxps

)(ln)( ss

n

n

n ds

d )0(

(Kurtosis) 3][

(Skewness) ][ ,][

,0][ ,0

then,0][When

444

33

222

10

xE

xExE

xE

xE

http://upload.wikimedia.org/wikipedia/commons/b/b3/Skewness_Statistics.svg

Discrete DistributionsDiscrete Distributions

Binomial distribution B(n,p): Repeatedly grab n balls, each with a probability p of getting a black ball. The probability of getting k black balls:

Poisson distribution probability of # of events occurring in a fixed period of time if these events occur with a known average.

5

knk ppk

nkP

)1()(

)(),(

constant, remains and When

npPoissonpnB

npn

!);(

k

ekP

k

http://upload.wikimedia.org/wikipedia/commons/c/c1/Poisson_distribution_PMF.png

Normal (Gaussian) DistributionNormal (Gaussian) Distribution Univariate N(μ, σ2):

Multivariate N(μ, Σ):

Central limit theorem:

6

)2

)(exp(

2

1)(

2

2

x

xp

)])([(

and ])[( where

matrix covariance theand mean with the

)()(2

1exp

||2

1)(

22

221

22221

11221

1

jjiijiij

iii

lll

l

l

T

xxE

xE

xxxp

s.' of spdf' theof veirrespecti

when )1,0(~ then ,Let 1

i

n

i i

x

nNz

xz

http://upload.wikimedia.org/wikipedia/commons/7/74/Normal_Distribution_PDF.svg

http://upload.wikimedia.org/wikipedia/commons/c/ca/Normal_Distribution_CDF.svg

http://upload.wikimedia.org/wikipedia/commons/8/8c/Standard_deviation_diagram.svg

Other Continuous DistributionsOther Continuous Distributions

Chi-square (Χ2)distribution of k degrees of freedom:distribution of a sum of squares of k independent standard normal random variables, that is,

Mean: k, Variance: 2k

Assume Then by central limit theorem.

Also is approximately normally distributed with mean and unit variance.

7

)1,0( where222

21

2 Nxxxx ik

0

1

2/12/2/

)( where

),(step)2/(2

1)(

dtetz

yeyk

yp

tz

ykk

)(~ 2 kx kNkkx as (0,1) ~2/)(

x2 12 k

http://upload.wikimedia.org/wikipedia/commons/2/21/Chi-square_distributionPDF.png

Other Continuous DistributionsOther Continuous Distributions t-distribution: estimating mean of a normal distribution

when sample size is small.A t-distributed variable

Mean: 0 for k > 1, variance: k/(k-2) for k > 2

β-distribution: Beta(α,β): the posterior distribution of p of a binomial distribution after α−1 events with p and β − 1 with 1 − p.

8

)( and )1,0( where// 2 kzNxkzxq 2/)1(2

1)2/(

)2/)1(()(

k

k

q

kk

kqp

11

11

)1(),(

1

)1()()(

)()(

xx

xxxp

Linear AlgebraLinear Algebra

Eigenvalues and eigenvectors: there exists λ and v such that Av= λv

Real matrix A is called positive semidefinite if xTAx ≥ 0 for every nonzero vector x; A is called positive definite if xTAx > 0.

Positive definite matrixes act as positive numbers. All positive eigenvalues

If A is symmetric, AT=A, then its eigenvectors are orthogonal, vi

Tvj=0. Therefore, a symmetric A can be diagonalized as

9

),,(diag and ],,[ where

and

2121 ll

TT

vvv

AA

Correlation Matrix and Inner Product MatrixCorrelation Matrix and Inner Product Matrix

Principal component analysis (PCA) Let x be a random variable in Rl, its correlation matrix Σ=E[xxT] is

positive semidefinite and thus can be diagonalized as

Assign , then Further assign , thenClassical multidimensional scaling (classical MDS) Given a distance matrix D={dijij}, the inner product matrix G={xi

Txj} can be computed by a bidirectional centering process

G can be diagnolized as Actually, nΛ and Λ’ share the same set of eigenvalues, and

10

T

xx T' TTxxE )''('xx T 2/1'' IxxE T )''''(''

TTT eeen

IDeen

IG ]1,...,1,1[ where)1

()1

(2

1

TG '

Tn

T xxXX ],...,[ where 12/1' as receovered becan then , Because XXXXG T

Cost Function OptimizationCost Function Optimization

Find θ so that a differentiable function J(θ) is minimized. Gradient descent method

Starts with an initial estimate θ(0) Adjust θ iteratively by

Taylor expansion of J(θ) at a stationary point θ0

Ignore higher order terms within a neighborhood of θ0

11

0 where,|)(

old

Joldnew

00 |)(

),( and |)(

where

))(()()(2

1)()()(

2

300000

ji

TT

Jji

J

OJJ

Hg

Hg

./by decided is speed econvergenc theTherefore,

./2 i.e., ,1|1|every if converge which will

)()()(

get we, then te,semidefini positive is

))((

maxmin

max

00

00

i

oldT

newT

T

oldnew

I

I

HH

H

Newton’s method Adjust θ iteratively by

Converges much faster that gradient descent. In fact, from the Taylor expansion, we have

The minimum is found in one iteration.

Conjugate gradient method

12

old

Jold

|)(1H

001new

0

))((

)()(

oldold

J

HH

H

11

1

11

1

)(or and

|)(

where

tTt

ttTt

tt

Tt

tTt

t

t

tttt

gg

ggg

gg

gg

Jg

g

t

Constrained Optimization with Constrained Optimization with Equality ConstraintsEquality Constraints

Minimize J(θ) subject to fi(θ)=0 for i=1, 2, …, m

Minimization happens at

Lagrange multipliers: construct

13

0),(),(

solve and

)()(),(1

LL

fJLm

iii

)()( ifJ

Constrained Optimization with Constrained Optimization with Inequality ConstraintsInequality Constraints

Minimize J(θ) subject to fi(θ)≥0 for i=1, 2, …, m

fi(θ)≥0 i=1, 2, …, m defines a feasible region in which the answer lies.

Karush–Kuhn–Tucker (KKT) conditions:A set of necessary conditions, which a local optimizer θ* has to satisfy. There exists a vector λ of Lagrange multipliers such that

14

mif

mi

L

ii

i

,...,2,1for 0)()3(

,...,2,1for 0)2(

0),()1(

*

*

θ

λθθ

(1) Most natural condition.(2) fi(θ*) is inactive if λi =0.(3)λi ≥0 if the minimum is on fi(θ*). (4)The (unconstrained) minimum in the interior region if all λi

=0.(5)For convex J(θ) and the region, local minimum is global minimum.(6)Still difficult to compute. Assume some fi(θ*)’s active, check λi

≥0.

Convex function:

Concave function:

Convex set:

Local minimum of a convex function is also global minimum.

15

)'()1()()')1((

]1,0[, ifconvax is :)(

fff

Sθ,θ'Sf l

)'()1()()')1(( fff

S

Sθ,θ'S l

)')1(

]1,0[, ifset convax a is

set.convex a is })(|{ then concave, is )( If bfXf

Min-Max duality

16

),(maxmin),(minmax),(

:lyequivalentor

),(),(),(

such that )( exists thereIf

:

),(maxmin),(minmax Therefore,

),(max),(),(min :generalIn

other.each todual are problems twoThe

),(minmax :goal sB' ),,(maxmin :goal sA'

chooses B and choosesA whileB to$ ),( paysA :Game

**

****

yxFyxFyxF

yxFyxFyxF

,yx

yxFyxF

yxFyxFyxF

yxFyxF

yxyxF

yxxy

**

yxxy

yx

xyyx

conditionpoint Saddle

Lagrange duality Recall the optimization problem:

Convex Programming

17

),(maxmin)(min

have we),(),(max Because

)()(),( :function Lagrange

,...,2,1for 0)( s.t. )( Minimize

0

0

1

LJ

JL

fJL

mifJm

iii

i

SIMPLER! MUCH

0),( subject to ),(max

or ),,(minmax becomes problemon optimizati theTherefore,

),(minmax),(maxmin),(

),(),(),(

),( ofpoint saddle a is ),(solution on minimizati thethen,

concave are s')( convex, is )( ns,applicatio of class large aFor

0

0

00**

****

**

LL

L

LLL

LLL

L

fJ i

Mercer’s Theorem and the Kernel MethodMercer’s Theorem and the Kernel Method

Mercer’s theorem:

The kernel method can transform any algorithm that solely depends on the dot product between two vectors to a kernelized vesion, by replacing dot product with the kernel function. The kernelized version is equivalent to the algorithm operating in the range space of φ. Because kernels are used, however, φ is never explicitly computed.

18

true.also is opposite The

definite.-semi positive and ,continuous symmetric, is ),( where

),()(),(

a as expressed becan )(),(product inner the

space)Euclidean infiniteor finite i.e. space,Hilbert denotes (

, )( mapping agiven and Let

yxK

yxKyx

yx

H

Hxx l

function kernel

cs6720 --- pattern recognition review of prerequisites in math and statistics prepared by li yang...

Documents