cs6720 --- pattern recognition review of prerequisites in math and statistics prepared by li yang...
Post on 21-Dec-2015
217 views
TRANSCRIPT
CS6720 --- Pattern RecognitionCS6720 --- Pattern Recognition
Review of Prerequisites in Review of Prerequisites in Math and StatisticsMath and Statistics
Prepared by Li Yang
Based on Appendix chapters of
Pattern Recognition, 4th Ed.by S. Theodoridis and K. Koutroumbas
and figures from Wikipedia.org
1
Probability and StatisticsProbability and Statistics Probability P(A) of an event A : a real number between 0 to
1. Joint probability P(A ∩ B) : probability that both A and B
occurs in a single experiment. P(A ∩ B) = P(A)P(B) if A and B and independent.
Probability P(A ∪ B) of union of A and B: either A or B occurs in a single experiment.P(A ∪ B) = P(A) + P(B) if A and B are mutually exclusive.
Conditional probability:
Therefore, the Bayes rule:
Total probability: let then
2
m
11 1)(such that ,...,i im APAA
m
i ii APABPBP1
)()|()(
)(
)()|(
BP
BAPBAP
)(
)()|()|( and )()|()()|(
BP
APABPBAPAPABPBPBAP
3
Probability density function (pdf): p(x) for a continuous random variable x
Total and conditional probabilities can also be extended to pdf’s.
Mean and Variance: let p(x) be the pdf of a random variable x
Statistical independence:
Kullback-Leibler divergence (Distance?) of pdf’s
Pay attention that
dxxpxExdxxxpxE )(])[( and ,)(][ 22
)()(),( ypxpyxp yx
xx
xxxx d
p
ppppL
)(
)('ln)())('),((
b
adxxpbxaP )()(
))(),('())('),(( xxxx ppLppL
Characteristic function of a pdf:
2nd Characteristic function:
n-th order moment:
Cumulants:
4
)][exp()exp()()( xΩxxΩxΩ TT jEdjp
][)0( n
n
n
xEds
d
)][exp()exp()()( sxEdxsxxps
)(ln)( ss
n
n
n ds
d )0(
(Kurtosis) 3][
(Skewness) ][ ,][
,0][ ,0
then,0][When
444
33
222
10
xE
xExE
xE
xE
Discrete DistributionsDiscrete Distributions
Binomial distribution B(n,p): Repeatedly grab n balls, each with a probability p of getting a black ball. The probability of getting k black balls:
Poisson distribution probability of # of events occurring in a fixed period of time if these events occur with a known average.
5
knk ppk
nkP
)1()(
)(),(
constant, remains and When
npPoissonpnB
npn
!);(
k
ekP
k
Normal (Gaussian) DistributionNormal (Gaussian) Distribution Univariate N(μ, σ2):
Multivariate N(μ, Σ):
Central limit theorem:
6
)2
)(exp(
2
1)(
2
2
x
xp
)])([(
and ])[( where
matrix covariance theand mean with the
)()(2
1exp
||2
1)(
22
221
22221
11221
1
jjiijiij
iii
lll
l
l
T
xxE
xE
xxxp
s.' of spdf' theof veirrespecti
when )1,0(~ then ,Let 1
i
n
i i
x
nNz
xz
Other Continuous DistributionsOther Continuous Distributions
Chi-square (Χ2)distribution of k degrees of freedom:distribution of a sum of squares of k independent standard normal random variables, that is,
Mean: k, Variance: 2k
Assume Then by central limit theorem.
Also is approximately normally distributed with mean and unit variance.
7
)1,0( where222
21
2 Nxxxx ik
0
1
2/12/2/
)( where
),(step)2/(2
1)(
dtetz
yeyk
yp
tz
ykk
)(~ 2 kx kNkkx as (0,1) ~2/)(
x2 12 k
Other Continuous DistributionsOther Continuous Distributions t-distribution: estimating mean of a normal distribution
when sample size is small.A t-distributed variable
Mean: 0 for k > 1, variance: k/(k-2) for k > 2
β-distribution: Beta(α,β): the posterior distribution of p of a binomial distribution after α−1 events with p and β − 1 with 1 − p.
8
)( and )1,0( where// 2 kzNxkzxq 2/)1(2
1)2/(
)2/)1(()(
k
k
q
kk
kqp
11
11
)1(),(
1
)1()()(
)()(
xx
xxxp
Linear AlgebraLinear Algebra
Eigenvalues and eigenvectors: there exists λ and v such that Av= λv
Real matrix A is called positive semidefinite if xTAx ≥ 0 for every nonzero vector x; A is called positive definite if xTAx > 0.
Positive definite matrixes act as positive numbers. All positive eigenvalues
If A is symmetric, AT=A, then its eigenvectors are orthogonal, vi
Tvj=0. Therefore, a symmetric A can be diagonalized as
9
),,(diag and ],,[ where
and
2121 ll
TT
vvv
AA
Correlation Matrix and Inner Product MatrixCorrelation Matrix and Inner Product Matrix
Principal component analysis (PCA) Let x be a random variable in Rl, its correlation matrix Σ=E[xxT] is
positive semidefinite and thus can be diagonalized as
Assign , then Further assign , thenClassical multidimensional scaling (classical MDS) Given a distance matrix D={dijij}, the inner product matrix G={xi
Txj} can be computed by a bidirectional centering process
G can be diagnolized as Actually, nΛ and Λ’ share the same set of eigenvalues, and
10
T
xx T' TTxxE )''('xx T 2/1'' IxxE T )''''(''
TTT eeen
IDeen
IG ]1,...,1,1[ where)1
()1
(2
1
TG '
Tn
T xxXX ],...,[ where 12/1' as receovered becan then , Because XXXXG T
Cost Function OptimizationCost Function Optimization
Find θ so that a differentiable function J(θ) is minimized. Gradient descent method
Starts with an initial estimate θ(0) Adjust θ iteratively by
Taylor expansion of J(θ) at a stationary point θ0
Ignore higher order terms within a neighborhood of θ0
11
0 where,|)(
old
Joldnew
00 |)(
),( and |)(
where
))(()()(2
1)()()(
2
300000
ji
TT
Jji
J
OJJ
Hg
Hg
./by decided is speed econvergenc theTherefore,
./2 i.e., ,1|1|every if converge which will
)()()(
get we, then te,semidefini positive is
))((
maxmin
max
00
00
i
oldT
newT
T
oldnew
I
I
HH
H
Newton’s method Adjust θ iteratively by
Converges much faster that gradient descent. In fact, from the Taylor expansion, we have
The minimum is found in one iteration.
Conjugate gradient method
12
old
Jold
|)(1H
001new
0
))((
)()(
oldold
J
HH
H
11
1
11
1
)(or and
|)(
where
tTt
ttTt
tt
Tt
tTt
t
t
tttt
gg
ggg
gg
gg
Jg
g
t
Constrained Optimization with Constrained Optimization with Equality ConstraintsEquality Constraints
Minimize J(θ) subject to fi(θ)=0 for i=1, 2, …, m
Minimization happens at
Lagrange multipliers: construct
13
0),(),(
solve and
)()(),(1
LL
fJLm
iii
)()( ifJ
Constrained Optimization with Constrained Optimization with Inequality ConstraintsInequality Constraints
Minimize J(θ) subject to fi(θ)≥0 for i=1, 2, …, m
fi(θ)≥0 i=1, 2, …, m defines a feasible region in which the answer lies.
Karush–Kuhn–Tucker (KKT) conditions:A set of necessary conditions, which a local optimizer θ* has to satisfy. There exists a vector λ of Lagrange multipliers such that
14
mif
mi
L
ii
i
,...,2,1for 0)()3(
,...,2,1for 0)2(
0),()1(
*
*
θ
λθθ
(1) Most natural condition.(2) fi(θ*) is inactive if λi =0.(3)λi ≥0 if the minimum is on fi(θ*). (4)The (unconstrained) minimum in the interior region if all λi
=0.(5)For convex J(θ) and the region, local minimum is global minimum.(6)Still difficult to compute. Assume some fi(θ*)’s active, check λi
≥0.
Convex function:
Concave function:
Convex set:
Local minimum of a convex function is also global minimum.
15
)'()1()()')1((
]1,0[, ifconvax is :)(
fff
Sθ,θ'Sf l
)'()1()()')1(( fff
S
Sθ,θ'S l
)')1(
]1,0[, ifset convax a is
set.convex a is })(|{ then concave, is )( If bfXf
Min-Max duality
16
),(maxmin),(minmax),(
:lyequivalentor
),(),(),(
such that )( exists thereIf
:
),(maxmin),(minmax Therefore,
),(max),(),(min :generalIn
other.each todual are problems twoThe
),(minmax :goal sB' ),,(maxmin :goal sA'
chooses B and choosesA whileB to$ ),( paysA :Game
**
****
yxFyxFyxF
yxFyxFyxF
,yx
yxFyxF
yxFyxFyxF
yxFyxF
yxyxF
yxxy
**
yxxy
yx
xyyx
conditionpoint Saddle
Lagrange duality Recall the optimization problem:
Convex Programming
17
),(maxmin)(min
have we),(),(max Because
)()(),( :function Lagrange
,...,2,1for 0)( s.t. )( Minimize
0
0
1
LJ
JL
fJL
mifJm
iii
i
SIMPLER! MUCH
0),( subject to ),(max
or ),,(minmax becomes problemon optimizati theTherefore,
),(minmax),(maxmin),(
),(),(),(
),( ofpoint saddle a is ),(solution on minimizati thethen,
concave are s')( convex, is )( ns,applicatio of class large aFor
0
0
00**
****
**
LL
L
LLL
LLL
L
fJ i
Mercer’s Theorem and the Kernel MethodMercer’s Theorem and the Kernel Method
Mercer’s theorem:
The kernel method can transform any algorithm that solely depends on the dot product between two vectors to a kernelized vesion, by replacing dot product with the kernel function. The kernelized version is equivalent to the algorithm operating in the range space of φ. Because kernels are used, however, φ is never explicitly computed.
18
true.also is opposite The
definite.-semi positive and ,continuous symmetric, is ),( where
),()(),(
a as expressed becan )(),(product inner the
space)Euclidean infiniteor finite i.e. space,Hilbert denotes (
, )( mapping agiven and Let
yxK
yxKyx
yx
H
Hxx l
function kernel