computational learning theory and kernel methods

1

Computational Learning Theoryand Kernel Methods

Tianyi JiangMarch 8, 2004

2

General Research Question

“Under what conditions is successful learning possible and impossible?”

“Under what conditions is a particular learning algorithm assured of learning successfully?”

-Mitchell, ‘97

3

Computational Learning Theory

1. Sample Complexity

2. Computational Complexity

3. Mistake Bound

-Mitchell, ‘97

4

Problem Setting

Instance Space: X, with a stable distribution D

Concept Class: C, s.t. c: X {0,1}

Hypothesis Space: H

General Learner: L

5

Error of a Hypothesis

+ +

Where c and h disagree

h -

-

- c

6

PAC Learnability

)()(Pr)( xhxcherrorx

DDTrue Error:

Difficulties in getting 0 error:

1. Multiple hypothesis consistent with training examples2. Training examples can mislead the Learner

7

PAC-Learnable

Learner L will output a hypothesis h with probability(1-) s.t. )(herrorD

in time that is polynomial in )(,,1,1

csizen

where2

10,

2

10

n = size of a training examplesize(c) = encoding length of c in C

8

Consistent Learner & Version Space

Consistent Learner – Outputs hypotheses that perfectly fit the training data whenever possible

Version Space: )()()(,|, xcxhExcxHhVS EH

VSH,E is -exhausted with respect to c and D if: )(, herrorVSh EH D

9

Version Space

Hypothesis space H ( =.21)

.error=.1r=.2

.error=.3r=.1

.error=.3r=.4

.error=.2r=.3

.error=.2r=0

.error=.1r=0

VSH,

E

10

Sample Complexity for Finite Hypothesis Spaces

Theorem - -exhausting the version space:

If H is finite, the probability that VSH,D is NOT -exhausted (with respect to c) is:

|H|e- m

where m1, sequence of i.r.d. examples of some target concept c; 0 1

11

Upper bound on sufficient number of training examples

If we set probability of failure below some level,

meH ||

then…

1

ln||ln1

Hm

… however, too loose of a bound due to |H|

12

Agnostic Learning

What if concept c H?

Agnostic Learner: simply finds the h with min. training error

Find upper bound on m s.t.

bestEbest herrorherror D

Where hbest = h with lowest training error

13

Upper bound on sufficient number of training examples - errorE(hbest) 0

From Chernoff Bounds, we have:

22)()(Pr mE eherrorherror D

then…

22)()(Pr mE eHherrorherrorHh D

1

ln||ln2

12

Hm

thus…

14

Example:

Given a consistent learner and a target concept of conjunctions of up to 10 Boolean literals, how many training examples are needed to learn a hypothesis with error < .1 95% of the time?

|H|=?=?=?

15

Example:

Given a consistent learner and a target concept of conjunctions of up to 10 Boolean literals, how many training examples are needed to learn a hypothesis with error < .1 95% of the time?

|H|=310

=.1=.05

14005.

1ln3ln10

1.

1

1ln3ln

1

m

nm

16

Sample Complexity for Infinite Hypothesis Spaces

Consider subset of instances: S X, and h H s.t. h imposed dichotomy on S: 2 subsets: {x S | h(x)=1 } & {x S | h(x)=0 }

Thus for any instance set S, there are 2|S| possible dichotomies.

Definition: A set of instance S is shattered by hypothesis space H iff for every dichotomy of S there exist some h H consistent with this dichotomy

17

3 Instances Shattered by 8 Hypotheses

Instance Space X

18

Vapnik-Chervonenkis Dimension

Definition: VC(H), is the size of the largest finite subset of X shattered by H.

If arbitrarily large finite sets of X can be shattered by H, then VC(H)=

For any finite H, VC(H) log2|H|

19

Example of VC Dimension

Along a line…

In a plane…

20

VC Dimension Example 2

21

VC dimensions in Rn

Theorem: Consider some set of m points in Rn. Choose any one of the points as origin. Then the m points can be shattered by oriented hyperplanes iff the position vectors of the remaining points are linearly independent.

So VC dimension of the set of oriented hyperplanes in R10 is ?

22

Bounds on m with VC Dimension

13

log)(82

log41

22 HVCm

VC(H) log2|H|

Upper Bound:

Lower Bound:

32

1)(,

1log1

maxCVC

23

Mistake Bound Model of Learning

“How many mistakes will the learner make in its predictions before it learns the target concept?”

The best algorithm in worst case scenario (hardest target concept, hardest training sequence) will makeOpt(C) mistakes, where

CCOptCVC 2log)()(

24

Linear Support Vector Machines

Consider a binary classification problem:

Training data: {xi, yi}, i=1,…,; yi {-1, +1}; xi Rd

Points x lie on the separating hyperplane satisfy:wx+b=0

where w is normal to the hyperplane |b|/||w|| is the perpendicular distance to origin ||w|| is the Euclidean norm of w

25

Linear Support Vector Machine, Definitions

Let d+ (d-) be the shortest distance from the separating hyperplane to the closest positive (negative) example

Margin of a separating hyperplane= d+ + d-

=1/||W||+1/||W||=2/||w||

Constraints:

ibwxy

yforbwx

yforbwx

ii

ii

ii

01)(

11

11

26

Linear Separating Hyperplane for the Separable Case

27

Problem of Maximizing the Margins

H1 and H2 are parallel, & with no training points in between

Thus we reformulate the problem as:

Maximize margin by minimizing ||W||2

s.t. ibwxy ii 01)(

28

Ties to Least Squares

y

x

b

bxwxfy )(

21

),(

l

i ii bxwybwLLoss Function:

29

Lagrangian Formulation1. Transform constraints into Lagrange multipliers2. Training data will only appear in dot products form

Let ,,,1, lii be positive Lagrange multipliers

We have the Lagrangian:

its

bwxywL

i

l

i i

l

i iiiP

0..

)(2

111

2

30

Transform the convex quadratic programming problem

Observations: minimizing LP w.r.t. w, b, and simultaneously require that

0

PL ii 0subject to

is a convex quadratic programming problemthat can be easily solved in its Dual form

31

Transform the convex quadratic programming problem – the Dual

LP’s Dual: Maximize LP, subject to gradients of LP w.r.t. w and b vanish, and i0

i jijijijiiD

iii

iiii

xxyyL

y

xyw

,2

1

0

32

Observations about the Dual

i ji

jijijiiD xxyyL,2

1

• There is a Lagrangian multiplier i for every training point• In the solution, points for which i > 0 are called “support vectors”. They lie on either H1 or H2

• Support vectors are critical elements of the training set, they lie closest to the “boundary”• If all other points are removed or moved around (but not crossing H1 or H2), the same separating hyperplane would be found

33

Prediction

• Solving the SVM problem is equivalent to finding a solution for the Karush-Kuhn-Tucker (KTT) conditions (KTT conditions are satisfied at the solution of any constrained optimization problem)

Once we solved for w, b, we predict x to be sign(wx+b)

34

Linear SVM: The Non-Separable Case

We account for outliers by introducing slack conditions:

i

yforbwx

yforbwx

i

iii

iii

0

11

11

We penalize outliers by changing the cost function to:

iiCw 2

2

1min

35

Example of Linear SVM with slacks

36

Linear SVM Classification Examples

Linearly Separable Linearly Non-Separable

37

Nonlinear SVMObservation: data appear as dot products in the training

problem

So we can use a mapping function , to map data into a high dimensional space where points are linearly separable:

Hd :

To make things easier, we define a kernel function K s.t.

jiji xxxxK ),(

38

Nonlinear SVM (cont.)Kernel functions can compute dot products in the highdimensional space without explicitly work with

Example: 22

2/),(

ji xx

ji exxK

Rather than computing w, we make prediction on x via:

bxsKy

bxsyxf

S

S

N

i iii

N

i iii

1

1

,

)(

39

Example of mapping

Image, in , of the square [-1,1]x[-1,1] R2 under the mapping

40

Example Kernel FunctionsKernel functions must satisfy the Mercer’s condition, orsimple, the Hessian Matrix

jijiij xxyyHH ,

must be positive semidefinite. (non-negative eigenvalues)

Example Kernels:

pji yxxxK )1(),(

)tanh(),( ykxxxK ji

41

Nonlinear SVM Classification Examples (Degree 3 Polynomial Kernel)

Linearly Separable Linearly Non-Separable

42

Multi-Class SVM

1. One-against-all

2. One-against-one (majority vote)

3. One-against-one (DAGSVM)

43

Global Solution and Uniqueness

• Every local solution is also global (property of any convex programming problem)

• Solution is guaranteed unique if the objective function is strictly convex (Hessian matrix is positive definite)

44

Complexity and Scalability

Curse of dimensionality:1. The proliferation of parameters causing intractable complexity

2. The proliferation of parameters causing overfitting

SVM circumvent these via the use of 1. Kernel functions (trick) that computes at O(dL)2. Support vectors that focus on the “boundary”

45

Structural Risk Minimization

Empirical Risk:

l

i iiemp xfyl

R1

,2

1)(

Expected Risk:

l

hlhRR emp

)4/log()1)/2(log()()(

46

Structural Risk Minimization

Nested subsets of functions, ordered by VC dimensions

computational learning theory and kernel methods

Documents