cs6720 --- pattern recognition review of prerequisites in math and statistics prepared by li yang...

18
CS6720 --- Pattern Recognition CS6720 --- Pattern Recognition Review of Prerequisites in Review of Prerequisites in Math and Statistics Math and Statistics Prepared by Li Yang Based on Appendix chapters of Pattern Recognition, 4 th Ed. by S. Theodoridis and K. Koutroumbas and figures from Wikipedia.org 1

Post on 21-Dec-2015

217 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: CS6720 --- Pattern Recognition Review of Prerequisites in Math and Statistics Prepared by Li Yang Based on Appendix chapters of Pattern Recognition, 4

CS6720 --- Pattern RecognitionCS6720 --- Pattern Recognition

Review of Prerequisites in Review of Prerequisites in Math and StatisticsMath and Statistics

Prepared by Li Yang

Based on Appendix chapters of

Pattern Recognition, 4th Ed.by S. Theodoridis and K. Koutroumbas

and figures from Wikipedia.org

1

Page 2: CS6720 --- Pattern Recognition Review of Prerequisites in Math and Statistics Prepared by Li Yang Based on Appendix chapters of Pattern Recognition, 4

Probability and StatisticsProbability and Statistics Probability P(A) of an event A : a real number between 0 to

1. Joint probability P(A ∩ B) : probability that both A and B

occurs in a single experiment. P(A ∩ B) = P(A)P(B) if A and B and independent.

Probability P(A ∪ B) of union of A and B: either A or B occurs in a single experiment.P(A ∪ B) = P(A) + P(B) if A and B are mutually exclusive.

Conditional probability:

Therefore, the Bayes rule:

Total probability: let then

2

m

11 1)(such that ,...,i im APAA

m

i ii APABPBP1

)()|()(

)(

)()|(

BP

BAPBAP

)(

)()|()|( and )()|()()|(

BP

APABPBAPAPABPBPBAP

Page 3: CS6720 --- Pattern Recognition Review of Prerequisites in Math and Statistics Prepared by Li Yang Based on Appendix chapters of Pattern Recognition, 4

3

Probability density function (pdf): p(x) for a continuous random variable x

Total and conditional probabilities can also be extended to pdf’s.

Mean and Variance: let p(x) be the pdf of a random variable x

Statistical independence:

Kullback-Leibler divergence (Distance?) of pdf’s

Pay attention that

dxxpxExdxxxpxE )(])[( and ,)(][ 22

)()(),( ypxpyxp yx

xx

xxxx d

p

ppppL

)(

)('ln)())('),((

b

adxxpbxaP )()(

))(),('())('),(( xxxx ppLppL

Page 4: CS6720 --- Pattern Recognition Review of Prerequisites in Math and Statistics Prepared by Li Yang Based on Appendix chapters of Pattern Recognition, 4

Characteristic function of a pdf:

2nd Characteristic function:

n-th order moment:

Cumulants:

4

)][exp()exp()()( xΩxxΩxΩ TT jEdjp

][)0( n

n

n

xEds

d

)][exp()exp()()( sxEdxsxxps

)(ln)( ss

n

n

n ds

d )0(

(Kurtosis) 3][

(Skewness) ][ ,][

,0][ ,0

then,0][When

444

33

222

10

xE

xExE

xE

xE

Page 5: CS6720 --- Pattern Recognition Review of Prerequisites in Math and Statistics Prepared by Li Yang Based on Appendix chapters of Pattern Recognition, 4

Discrete DistributionsDiscrete Distributions

Binomial distribution B(n,p): Repeatedly grab n balls, each with a probability p of getting a black ball. The probability of getting k black balls:

Poisson distribution probability of # of events occurring in a fixed period of time if these events occur with a known average.

5

knk ppk

nkP

)1()(

)(),(

constant, remains and When

npPoissonpnB

npn

!);(

k

ekP

k

Page 6: CS6720 --- Pattern Recognition Review of Prerequisites in Math and Statistics Prepared by Li Yang Based on Appendix chapters of Pattern Recognition, 4

Normal (Gaussian) DistributionNormal (Gaussian) Distribution Univariate N(μ, σ2):

Multivariate N(μ, Σ):

Central limit theorem:

6

)2

)(exp(

2

1)(

2

2

x

xp

)])([(

and ])[( where

matrix covariance theand mean with the

)()(2

1exp

||2

1)(

22

221

22221

11221

1

jjiijiij

iii

lll

l

l

T

xxE

xE

xxxp

s.' of spdf' theof veirrespecti

when )1,0(~ then ,Let 1

i

n

i i

x

nNz

xz

Page 7: CS6720 --- Pattern Recognition Review of Prerequisites in Math and Statistics Prepared by Li Yang Based on Appendix chapters of Pattern Recognition, 4

Other Continuous DistributionsOther Continuous Distributions

Chi-square (Χ2)distribution of k degrees of freedom:distribution of a sum of squares of k independent standard normal random variables, that is,

Mean: k, Variance: 2k

Assume Then by central limit theorem.

Also is approximately normally distributed with mean and unit variance.

7

)1,0( where222

21

2 Nxxxx ik

0

1

2/12/2/

)( where

),(step)2/(2

1)(

dtetz

yeyk

yp

tz

ykk

)(~ 2 kx kNkkx as (0,1) ~2/)(

x2 12 k

Page 8: CS6720 --- Pattern Recognition Review of Prerequisites in Math and Statistics Prepared by Li Yang Based on Appendix chapters of Pattern Recognition, 4

Other Continuous DistributionsOther Continuous Distributions t-distribution: estimating mean of a normal distribution

when sample size is small.A t-distributed variable

Mean: 0 for k > 1, variance: k/(k-2) for k > 2

β-distribution: Beta(α,β): the posterior distribution of p of a binomial distribution after α−1 events with p and β − 1 with 1 − p.

8

)( and )1,0( where// 2 kzNxkzxq 2/)1(2

1)2/(

)2/)1(()(

k

k

q

kk

kqp

11

11

)1(),(

1

)1()()(

)()(

xx

xxxp

Page 9: CS6720 --- Pattern Recognition Review of Prerequisites in Math and Statistics Prepared by Li Yang Based on Appendix chapters of Pattern Recognition, 4

Linear AlgebraLinear Algebra

Eigenvalues and eigenvectors: there exists λ and v such that Av= λv

Real matrix A is called positive semidefinite if xTAx ≥ 0 for every nonzero vector x; A is called positive definite if xTAx > 0.

Positive definite matrixes act as positive numbers. All positive eigenvalues

If A is symmetric, AT=A, then its eigenvectors are orthogonal, vi

Tvj=0. Therefore, a symmetric A can be diagonalized as

9

),,(diag and ],,[ where

and

2121 ll

TT

vvv

AA

Page 10: CS6720 --- Pattern Recognition Review of Prerequisites in Math and Statistics Prepared by Li Yang Based on Appendix chapters of Pattern Recognition, 4

Correlation Matrix and Inner Product MatrixCorrelation Matrix and Inner Product Matrix

Principal component analysis (PCA) Let x be a random variable in Rl, its correlation matrix Σ=E[xxT] is

positive semidefinite and thus can be diagonalized as

Assign , then Further assign , thenClassical multidimensional scaling (classical MDS) Given a distance matrix D={dijij}, the inner product matrix G={xi

Txj} can be computed by a bidirectional centering process

G can be diagnolized as Actually, nΛ and Λ’ share the same set of eigenvalues, and

10

T

xx T' TTxxE )''('xx T 2/1'' IxxE T )''''(''

TTT eeen

IDeen

IG ]1,...,1,1[ where)1

()1

(2

1

TG '

Tn

T xxXX ],...,[ where 12/1' as receovered becan then , Because XXXXG T

Page 11: CS6720 --- Pattern Recognition Review of Prerequisites in Math and Statistics Prepared by Li Yang Based on Appendix chapters of Pattern Recognition, 4

Cost Function OptimizationCost Function Optimization

Find θ so that a differentiable function J(θ) is minimized. Gradient descent method

Starts with an initial estimate θ(0) Adjust θ iteratively by

Taylor expansion of J(θ) at a stationary point θ0

Ignore higher order terms within a neighborhood of θ0

11

0 where,|)(

old

Joldnew

00 |)(

),( and |)(

where

))(()()(2

1)()()(

2

300000

ji

TT

Jji

J

OJJ

Hg

Hg

./by decided is speed econvergenc theTherefore,

./2 i.e., ,1|1|every if converge which will

)()()(

get we, then te,semidefini positive is

))((

maxmin

max

00

00

i

oldT

newT

T

oldnew

I

I

HH

H

Page 12: CS6720 --- Pattern Recognition Review of Prerequisites in Math and Statistics Prepared by Li Yang Based on Appendix chapters of Pattern Recognition, 4

Newton’s method Adjust θ iteratively by

Converges much faster that gradient descent. In fact, from the Taylor expansion, we have

The minimum is found in one iteration.

Conjugate gradient method

12

old

Jold

|)(1H

001new

0

))((

)()(

oldold

J

HH

H

11

1

11

1

)(or and

|)(

where

tTt

ttTt

tt

Tt

tTt

t

t

tttt

gg

ggg

gg

gg

Jg

g

t

Page 13: CS6720 --- Pattern Recognition Review of Prerequisites in Math and Statistics Prepared by Li Yang Based on Appendix chapters of Pattern Recognition, 4

Constrained Optimization with Constrained Optimization with Equality ConstraintsEquality Constraints

Minimize J(θ) subject to fi(θ)=0 for i=1, 2, …, m

Minimization happens at

Lagrange multipliers: construct

13

0),(),(

solve and

)()(),(1

LL

fJLm

iii

)()( ifJ

Page 14: CS6720 --- Pattern Recognition Review of Prerequisites in Math and Statistics Prepared by Li Yang Based on Appendix chapters of Pattern Recognition, 4

Constrained Optimization with Constrained Optimization with Inequality ConstraintsInequality Constraints

Minimize J(θ) subject to fi(θ)≥0 for i=1, 2, …, m

fi(θ)≥0 i=1, 2, …, m defines a feasible region in which the answer lies.

Karush–Kuhn–Tucker (KKT) conditions:A set of necessary conditions, which a local optimizer θ* has to satisfy. There exists a vector λ of Lagrange multipliers such that

14

mif

mi

L

ii

i

,...,2,1for 0)()3(

,...,2,1for 0)2(

0),()1(

*

*

θ

λθθ

(1) Most natural condition.(2) fi(θ*) is inactive if λi =0.(3)λi ≥0 if the minimum is on fi(θ*). (4)The (unconstrained) minimum in the interior region if all λi

=0.(5)For convex J(θ) and the region, local minimum is global minimum.(6)Still difficult to compute. Assume some fi(θ*)’s active, check λi

≥0.

Page 15: CS6720 --- Pattern Recognition Review of Prerequisites in Math and Statistics Prepared by Li Yang Based on Appendix chapters of Pattern Recognition, 4

Convex function:

Concave function:

Convex set:

Local minimum of a convex function is also global minimum.

15

)'()1()()')1((

]1,0[, ifconvax is :)(

fff

Sθ,θ'Sf l

)'()1()()')1(( fff

S

Sθ,θ'S l

)')1(

]1,0[, ifset convax a is

set.convex a is })(|{ then concave, is )( If bfXf

Page 16: CS6720 --- Pattern Recognition Review of Prerequisites in Math and Statistics Prepared by Li Yang Based on Appendix chapters of Pattern Recognition, 4

Min-Max duality

16

),(maxmin),(minmax),(

:lyequivalentor

),(),(),(

such that )( exists thereIf

:

),(maxmin),(minmax Therefore,

),(max),(),(min :generalIn

other.each todual are problems twoThe

),(minmax :goal sB' ),,(maxmin :goal sA'

chooses B and choosesA whileB to$ ),( paysA :Game

**

****

yxFyxFyxF

yxFyxFyxF

,yx

yxFyxF

yxFyxFyxF

yxFyxF

yxyxF

yxxy

**

yxxy

yx

xyyx

conditionpoint Saddle

Page 17: CS6720 --- Pattern Recognition Review of Prerequisites in Math and Statistics Prepared by Li Yang Based on Appendix chapters of Pattern Recognition, 4

Lagrange duality Recall the optimization problem:

Convex Programming

17

),(maxmin)(min

have we),(),(max Because

)()(),( :function Lagrange

,...,2,1for 0)( s.t. )( Minimize

0

0

1

LJ

JL

fJL

mifJm

iii

i

SIMPLER! MUCH

0),( subject to ),(max

or ),,(minmax becomes problemon optimizati theTherefore,

),(minmax),(maxmin),(

),(),(),(

),( ofpoint saddle a is ),(solution on minimizati thethen,

concave are s')( convex, is )( ns,applicatio of class large aFor

0

0

00**

****

**

LL

L

LLL

LLL

L

fJ i

Page 18: CS6720 --- Pattern Recognition Review of Prerequisites in Math and Statistics Prepared by Li Yang Based on Appendix chapters of Pattern Recognition, 4

Mercer’s Theorem and the Kernel MethodMercer’s Theorem and the Kernel Method

Mercer’s theorem:

The kernel method can transform any algorithm that solely depends on the dot product between two vectors to a kernelized vesion, by replacing dot product with the kernel function. The kernelized version is equivalent to the algorithm operating in the range space of φ. Because kernels are used, however, φ is never explicitly computed.

18

true.also is opposite The

definite.-semi positive and ,continuous symmetric, is ),( where

),()(),(

a as expressed becan )(),(product inner the

space)Euclidean infiniteor finite i.e. space,Hilbert denotes (

, )( mapping agiven and Let

yxK

yxKyx

yx

H

Hxx l

function kernel