a bahadur type representation of the linear support vector...

A Bahadur Type Representation ofthe Linear Support Vector Machine

and its Relative Efficiency

Yoonkyung LeeDepartment of Statistics

The Ohio State University

June 8, 2009Machine Learning Summer School/Workshop

University of Chicago

Outline

◮ Support Vector Machines◮ Statistical Properties of SVM◮ Main Results

(asymptotic analysis of the linear SVM)◮ An Illustrative Example◮ Asymptotic Relative Efficiency◮ Discussion

Applications

◮ Handwritten digit recognition◮ Cancer diagnosis with microarray data◮ Text categorization

Classification

◮ x = (x1, . . . , xd) ∈ Rd

◮ y ∈ Y = {1, . . . , k}◮ Learn a rule φ : R

d → Y from the training data{(x i , yi), i = 1, . . . ,n}, where (x i , yi) are i.i.d. with P(X ,Y ).

◮ The 0-1 loss function :

L(y , φ(x)) = I(y 6= φ(x))

Methods of Regularization (Penalization)

Find f (x) ∈ F minimizing

1n

n∑

i=1

L(yi , f (x i)) + λJ(f ).

◮ Empirical risk + penalty◮ F : a class of candidate functions◮ J(f ): complexity of the model f◮ λ > 0: a regularization parameter◮ Without the penalty J(f ), ill-posed problem

Maximum Margin Hyperplane

−1 −0.5 0 0.5 1−1

−0.5

0

0.5

1

wx+b=−1

wx+b=0

wx+b=1

Margin=2/||w||

Support Vector Machines

Boser, Guyon, and Vapnik (1992)Vapnik (1995), The Nature of Statistical Learning TheoryScholkopf and Smola (2002), Learning with Kernels

◮ yi ∈ {−1,1}, class labels in the binary case◮ Find f ∈ F = {f (x) = w⊤x + b | w ∈ R

d and b ∈ R}minimizing

1n

n∑

i=1

(1 − yi f (x i))+ + λ‖w‖2,

where λ is a regularization parameter.◮ Classification rule : φ(x) = sign(f (x))

Hinge Loss

−2 −1 0 1 2

0

1

2

t=yf

[−t]*

(1−t)+

(1 − yf (x))+ is an upper bound of the misclassification loss functionI(y 6= φ(x)) = [−yf (x)]∗ ≤ (1 − yf (x))+ where [t ]∗ = I(t ≥ 0) and(t)+ = max{t ,0}.

SVM in General

Find f (x) = b + h(x) with h ∈ HK (RKHS) minimizing

1n

n∑

i=1

(1 − yi f (x i))+ + λ‖h‖2HK.

◮ Linear SVM:HK = {h(x) = w⊤x | w ∈ R

d} withi) K (x ,x ′) = x⊤x ′

ii) ‖h‖2HK

= ‖w⊤x‖2HK

= ‖w‖2

◮ Nonlinear SVM: K (x ,x ′) = (1 + x⊤x ′)d ,exp(−‖x − x ′‖2/2σ2), ...

Statistical Properties of SVM

◮ Fisher consistency (Lin, DM & KD 2002)

argf min E [(1 − Yf (X ))+|X = x ] = sign(p(x) − 1/2)

where p(x) = P(Y = 1|X = x)

◮ SVM approximates the Bayes decision rule

φB(x) = sign(p(x) − 1/2).

◮ Bayes risk consistency (Zhang, AOS 2004, Steinwart,IEEE IT 2005, Bartlett et al., JASA 2006)

R(fSVM) → R(φB) in prob.

under universal approximation condition on HK

◮ Rate of convergence (Steinwart et al., AOS 2007)

Main Questions

◮ Recursive Feature Elimination (Guyon et al., ML 2002):backward elimination of variables based on the fittedcoefficients of the linear SVM

◮ What is the statistical behavior of the coefficients?◮ What determines their variances?◮ Study asymptotic properties of the coefficients of the linear

SVM.◮ What about its relative efficiency?

Something New, Old, and Borrowed

◮ The hinge loss:not everywhere differentiable,no closed form expression for the solution

◮ Useful link:sign(p(x) − 1/2), the population minimizer w.r.t. the hingeloss is the median of Y at x .

◮ Asymptotics for least absolute deviation (LAD) estimators(Pollard, ET 1991)

◮ Convexity of the loss

Preliminaries

◮ (X ,Y ): a pair of random variables with X ∈ X ⊂ Rd and

Y ∈ {1,−1}◮ P(Y = 1) = π+ and P(Y = −1) = π−

◮ Let f and g be the densities of X given Y = 1 and −1.◮ With x = (1, x1, . . . , xd)⊤ and β = (b,w⊤)⊤,

h(x ;β) = w⊤x + b = x⊤β

◮ βλ,n = arg minβ

1n

n∑

i=1

(1 − yih(xi ;β))+ + λ‖w‖2

Population Version

◮ L(β) = E

[1 − Yh(X ;β)

]

+

◮ β∗ = arg minβ L(β)

◮ The gradient of L(β):

S(β) = −E

(ψ(1 − Yh(X ;β))YX

)

where ψ(t) = I(t ≥ 0)

◮ The Hessian matrix of L(β):

H(β) = E

(δ(1 − Yh(X ;β))X X⊤

)

where δ is the Dirac delta function.

More on H(β)

◮ The Hessian matrix of L(β):

H(β) = E

(δ(1 − Yh(X ;β))X X⊤

)

Hj,k (β) = E

(δ(1 − Yh(X ;β))XjXk

)for 0 ≤ j , k ≤ d

= π+

∫

X

δ(1 − b − w⊤x)xjxk f (x)dx

+ π−

∫

X

δ(1 + b + w⊤x)xjxkg(x)dx .

◮ For a function s on X , define the Radon transform Rs of sfor p ∈ R and ξ ∈ R

d as

(Rs)(p, ξ) =

∫

X

δ(p − ξ⊤x)s(x)dx .

(the integral of s over hyperplanes ξ⊤x = p)◮ Hj,k (β) = π+(Rfj,k )(1 − b,w) + π−(Rgj,k )(1 + b,−w),

where fj,k (x) = xjxk f (x) and gj,k (x) = xjxkg(x).

Regularity Conditions

(A1) The densities f and g are continuous and have finitesecond moments.

(A2) There exists B(x0, δ0) such that f (x) > C1 and g(x) > C1

for every x ∈ B(x0, δ0).

(A3) For some 1 ≤ i∗ ≤ d ,∫

X

{xi∗ ≥ G−

i∗}xi∗g(x)dx <∫

X

{xi∗ ≤ F+

i∗ }xi∗ f (x)dx

or∫

X

{xi∗ ≤ G+

i∗}xi∗g(x)dx >∫

X

{xi∗ ≥ F−

i∗ }xi∗ f (x)dx .

(when π+ = π−, it says that the means are different.)

(A4) Let M+ = {x ∈ X | x⊤β∗ = 1} andM− = {x ∈ X | x⊤β∗ = −1}. There exist two subsets ofM+ and M− on which the class densities f and g arebounded away from zero.

Bahadur Representation

◮ Bahadur (1966), A Note on Quantiles in Large Samples◮ A statistical estimator is approximated by a sum of

independent variables with a higher-order remainder.◮ Let ξ = F−1(p) be the pth quantile of distribution F .

For X1, . . . ,Xn ∼ iid F , the sample pth quantile is

ξ +

[ n∑

i=1

I(Xi > ξ) − n(1 − p)

]/nf (ξ) + Rn,

where f (x) = F ′(x).

Bahadur-type Representation of the Linear SVM

TheoremSuppose that (A1)-(A4) are met. For λ = o(n−1/2), we have

√n (βλ,n−β∗) = − 1√

nH(β∗)−1

n∑

i=1

ψ(1−Yih(Xi ;β∗))Yi Xi+oP(1).

Recall that H(β∗) = E

(δ(1 − Yh(X ;β∗))X X⊤

)and

ψ(t) = I(t ≥ 0).

Asymptotic Normality of βλ,n

TheoremSuppose (A1)-(A4) are satisfied. For λ = o(n−1/2),

√n (βλ,n − β∗) → N

(0,H(β∗)−1G(β∗)H(β∗)−1

)

in distribution, where

G(β) = E

(ψ(1 − Yh(X ;β))X X⊤

).

CorollaryUnder the same conditions as in Theorem,

√n(

h(x ; βλ,n) − h(x ;β∗))→ N

(0, x⊤H(β∗)−1G(β∗)H(β∗)−1x

)

in distribution.

An Illustrative Example◮ Two multivariate normal distributions in R

d with meanvectors µf and µg and a common covariance matrix Σ

◮ π+ = π− = 1/2.◮ What is the relation between the Bayes decision boundary

and the optimal hyperplane by the SVM, h (x ;β∗) = 0?

µf

µg

Example

◮ The Bayes decision boundary (Fisher’s LDA):

{Σ−1(µf − µg)

}⊤{x − 1

2(µf + µg)

}= 0.

◮ The hyperplane determined by the SVM :

x⊤β∗ = 0 with S(β∗) = 0.

◮ β∗ balances two classes within the margin

E

(ψ(1 − Yh(X ;β∗))YX

)= 0

P(h(X ;β∗) ≤ 1|Y = 1) = P(h(X ;β∗) ≥ −1|Y = −1)

E

(I{h(X ;β∗) ≤ 1}Xj |Y = 1

)= E

(I{h(X ;β∗) ≥ −1}Xj |Y = −1

)

Example

◮ Direct calculation shows that

β∗ = C(dΣ(µf , µg))[ −1

2(µf + µg)⊤

Id

]Σ−1(µf − µg),

where dΣ(µf , µg) is the Mahalanobis distance between thetwo distributions.

◮ The linear SVM is equivalent to Fisher’s LDAasymptotically in this case.

◮ The assumptions (A1)-(A4) are satisfied. So, the maintheorem applies.

◮ Consider d = 1, µf + µg = 0, σ = 1, anddΣ(µf , µg) = |µf − µg |.

Distance and Margin

−5 0 5

0.0

0.2

0.4

d=0.5

x

−5 0 5

0.0

0.2

0.4

d=1

x

−5 0 5

0.0

0.2

0.4

d=3

x

−5 0 5

0.0

0.2

0.4

d=6

x

Asymptotic Variance

0 1 2 3 4 5 6

010

2030

4050

Mahalanobis distance

asym

ptot

ic v

aria

bilit

y

(a) Intercept

0 1 2 3 4 5 6

010

2030

4050

Mahalanobis distance

asym

ptot

ic v

aria

bilit

y(b) Slope

Figure: The asymptotic variabilities of the intercept and the slope forthe optimal hyperplane as a function of the Mahalanobis distance.

A Bivariate Normal Example

◮ µf = (1,1)⊤, µg = (−1,−1)⊤ and Σ = diag(1,1)

◮ dΣ(µf , µg) = 2√

2 and the Bayes error rate is 0.07865.

◮ Find (β0, β1, β2) = arg minβ

1n

n∑

i=1

(1 − yi xi⊤β)+

Sample size nEstimates 100 200 500 Optimal coefficients

β0 0.0006 -0.0013 0.0022 0β1 0.7709 0.7450 0.7254 0.7169β2 0.7749 0.7459 0.7283 0.7169

Table: Averages of estimated optimal coefficients over 1000replicates.

Sampling Distributions of β0 and β1

−0.4 −0.2 0.0 0.2 0.4

01

23

Den

sity

estimateasymptotic

(a) β0

0.5 0.6 0.7 0.8 0.9 1.0

01

23

45

Den

sity

estimateasymptotic

(b) β1

Figure: Estimated sampling distributions of β0 and β1 (n = 500)

Type I Error Rates

500 1000 1500 2000

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

n

Typ

e I e

rror

rat

e

d=6d=12d=18d=24

Figure: The median values of the type I error rates in variableselection when µf = (1d/2,0d/2)

⊤, µg = 0⊤

d , and Σ = Id

Excess ErrorEfron (1975), The Efficiency of Logistic Regression Comparedto Normal Discriminant Analysis

◮ The canonical LDA setting:X ∼ N((∆/2)e1, I) for Y = 1 with probability π+

X ∼ N(−(∆/2)e1, I) for Y = −1 with probability π−◮ Fisher’s linear discriminant function is

ℓ(x) = β⊤∗ x ,

where β∗0 = log(π+/π−) and (β∗1, . . . , β∗d)⊤ = ∆e1.◮ For a linear discriminant method ℓ with coefficient vector β,

if√

n(β − β∗) → N(0,Σ), E(R(ℓ) − R(φB)) is given by

π+φ(D1)

2∆n

[σ00 −

2β∗0

∆σ01 +

β2∗0

∆2 σ11 + σ22 + · · · + σdd

],

where D1 = ∆/2 + (1/∆) log(π+/π−).

Relative Efficiency

◮ Efron (1975) studied the Asymptotic Relative Efficiency(ARE) of logistic regression (LR) to normal discrimination(LDA) defined as

limn→∞

E(R(ℓLDA) − R(φB))

E(R(ℓLR) − R(φB)).

◮ Logistic regression is shown to be between one half andtwo thirds as effective as normal discrimination typically.

Asymptotic Relative Efficiencies of SVM to LDA

Under the canonical LDA setting with π+ = π− = 0.5,the ARE of the linear SVM to LDA is given by

Eff =2∆

(1 +∆2

4)φ(a∗),

where a∗ is the constant satisfying φ(a∗)/Φ(a∗) = ∆/2.

∆ R(φB) a∗ SVM LR2.0 0.1587 -0.3026 0.7622 0.8992.5 0.1056 -0.6466 0.6636 0.7863.0 0.0668 -0.9685 0.5408 0.6413.5 0.0401 -1.2756 0.4105 0.4864.0 0.0228 -1.5718 0.2899 0.343

A Mixture of Two Gaussian Distributions

µf1

µf2

µg1

µg2

∆B

∆W

Figure: ∆W and ∆B indicate the mean difference between twoGaussian components within each class and the mean differencebetween two classes.

As ∆W Varies

∆w

Err

or R

ate

1 1.3 1.6 1.9 2.2 2.5 2.8 3.1 3.4 3.7 4

0.16

00.

165

0.17

00.

175

0.18

0

Line Types

Bayes ErrorLDA

LRSVM

Figure: ∆B = 2, d = 5, π+ = π−, π1 = π2, and n = 100

As ∆B Varies

∆B

Err

or R

ate

1 1.3 1.6 1.9 2.2 2.5 2.8 3.1 3.4 3.7 4

0.00

0.05

0.10

0.15

0.20

0.25

0.30

Line Types

Bayes ErrorLDA

LRSVM

Figure: ∆W = 1, d = 5, π+ = π−, π1 = π2, and n = 100

As Dimension d Varies

Dimension

Err

or R

ate

5 15 25 35 45 55 65 75 85

0.2

0.3

0.4

0.5

Line Types

Bayes ErrorLDA

LRSVM

Figure: ∆W = 1, ∆B = 2, π+ = π−, π1 = π2, and n = 100

Concluding Remarks

◮ Examine asymptotic properties of the coefficients ofvariables in the linear SVM and study its relative efficiency.

◮ Establish Bahadur type representation of the coefficients.◮ How the margins of the optimal hyperplane and the

underlying probability distribution characterize theirstatistical behavior

◮ Variable selection for the SVM in the framework ofhypothesis testing

◮ For practical applications, need consistent estimators ofG(β∗) and H(β∗).

◮ Explore a different scenario where d also grows with n.◮ Extension of the SVM asymptotics to the nonlinear case

Reference

◮ A Bahadur Representation of the Linear Support VectorMachine, Koo, J.-Y., Lee, Y., Kim, Y., and Park, C.,Journal of Machine Learning Research (2008).

Available at www.stat.osu.edu/∼yklee orhttp://www.jmlr.org/.

◮ Relative efficiency analysis and related results are fromjoint work with Rui Wang (in progress).

a bahadur type representation of the linear support vector...

Documents