computational learning theory and kernel methods

46
1 Computational Learning Theory and Kernel Methods Tianyi Jiang March 8, 2004

Upload: chandler-torres

Post on 31-Dec-2015

16 views

Category:

Documents


1 download

DESCRIPTION

Computational Learning Theory and Kernel Methods. Tianyi Jiang March 8, 2004. General Research Question. “ Under what conditions is successful learning possible and impossible? ” “ Under what conditions is a particular learning algorithm assured of learning successfully? ” - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Computational Learning Theory and Kernel Methods

1

Computational Learning Theoryand Kernel Methods

Tianyi JiangMarch 8, 2004

Page 2: Computational Learning Theory and Kernel Methods

2

General Research Question

“Under what conditions is successful learning possible and impossible?”

“Under what conditions is a particular learning algorithm assured of learning successfully?”

-Mitchell, ‘97

Page 3: Computational Learning Theory and Kernel Methods

3

Computational Learning Theory

1. Sample Complexity

2. Computational Complexity

3. Mistake Bound

-Mitchell, ‘97

Page 4: Computational Learning Theory and Kernel Methods

4

Problem Setting

Instance Space: X, with a stable distribution D

Concept Class: C, s.t. c: X {0,1}

Hypothesis Space: H

General Learner: L

Page 5: Computational Learning Theory and Kernel Methods

5

Error of a Hypothesis

+ +

Where c and h disagree

h -

-

- c

Page 6: Computational Learning Theory and Kernel Methods

6

PAC Learnability

)()(Pr)( xhxcherrorx

DDTrue Error:

Difficulties in getting 0 error:

1. Multiple hypothesis consistent with training examples2. Training examples can mislead the Learner

Page 7: Computational Learning Theory and Kernel Methods

7

PAC-Learnable

Learner L will output a hypothesis h with probability(1-) s.t. )(herrorD

in time that is polynomial in )(,,1,1

csizen

where2

10,

2

10

n = size of a training examplesize(c) = encoding length of c in C

Page 8: Computational Learning Theory and Kernel Methods

8

Consistent Learner & Version Space

Consistent Learner – Outputs hypotheses that perfectly fit the training data whenever possible

Version Space: )()()(,|, xcxhExcxHhVS EH

VSH,E is -exhausted with respect to c and D if: )(, herrorVSh EH D

Page 9: Computational Learning Theory and Kernel Methods

9

Version Space

Hypothesis space H ( =.21)

.error=.1r=.2

.error=.3r=.1

.error=.3r=.4

.error=.2r=.3

.error=.2r=0

.error=.1r=0

VSH,

E

Page 10: Computational Learning Theory and Kernel Methods

10

Sample Complexity for Finite Hypothesis Spaces

Theorem - -exhausting the version space:

If H is finite, the probability that VSH,D is NOT -exhausted (with respect to c) is:

|H|e- m

where m1, sequence of i.r.d. examples of some target concept c; 0 1

Page 11: Computational Learning Theory and Kernel Methods

11

Upper bound on sufficient number of training examples

If we set probability of failure below some level,

meH ||

then…

1

ln||ln1

Hm

… however, too loose of a bound due to |H|

Page 12: Computational Learning Theory and Kernel Methods

12

Agnostic Learning

What if concept c H?

Agnostic Learner: simply finds the h with min. training error

Find upper bound on m s.t.

bestEbest herrorherror D

Where hbest = h with lowest training error

Page 13: Computational Learning Theory and Kernel Methods

13

Upper bound on sufficient number of training examples - errorE(hbest) 0

From Chernoff Bounds, we have:

22)()(Pr mE eherrorherror D

then…

22)()(Pr mE eHherrorherrorHh D

1

ln||ln2

12

Hm

thus…

Page 14: Computational Learning Theory and Kernel Methods

14

Example:

Given a consistent learner and a target concept of conjunctions of up to 10 Boolean literals, how many training examples are needed to learn a hypothesis with error < .1 95% of the time?

|H|=?=?=?

Page 15: Computational Learning Theory and Kernel Methods

15

Example:

Given a consistent learner and a target concept of conjunctions of up to 10 Boolean literals, how many training examples are needed to learn a hypothesis with error < .1 95% of the time?

|H|=310

=.1=.05

14005.

1ln3ln10

1.

1

1ln3ln

1

m

nm

Page 16: Computational Learning Theory and Kernel Methods

16

Sample Complexity for Infinite Hypothesis Spaces

Consider subset of instances: S X, and h H s.t. h imposed dichotomy on S: 2 subsets: {x S | h(x)=1 } & {x S | h(x)=0 }

Thus for any instance set S, there are 2|S| possible dichotomies.

Definition: A set of instance S is shattered by hypothesis space H iff for every dichotomy of S there exist some h H consistent with this dichotomy

Page 17: Computational Learning Theory and Kernel Methods

17

3 Instances Shattered by 8 Hypotheses

Instance Space X

Page 18: Computational Learning Theory and Kernel Methods

18

Vapnik-Chervonenkis Dimension

Definition: VC(H), is the size of the largest finite subset of X shattered by H.

If arbitrarily large finite sets of X can be shattered by H, then VC(H)=

For any finite H, VC(H) log2|H|

Page 19: Computational Learning Theory and Kernel Methods

19

Example of VC Dimension

Along a line…

In a plane…

Page 20: Computational Learning Theory and Kernel Methods

20

VC Dimension Example 2

Page 21: Computational Learning Theory and Kernel Methods

21

VC dimensions in Rn

Theorem: Consider some set of m points in Rn. Choose any one of the points as origin. Then the m points can be shattered by oriented hyperplanes iff the position vectors of the remaining points are linearly independent.

So VC dimension of the set of oriented hyperplanes in R10 is ?

Page 22: Computational Learning Theory and Kernel Methods

22

Bounds on m with VC Dimension

13

log)(82

log41

22 HVCm

VC(H) log2|H|

Upper Bound:

Lower Bound:

32

1)(,

1log1

maxCVC

Page 23: Computational Learning Theory and Kernel Methods

23

Mistake Bound Model of Learning

“How many mistakes will the learner make in its predictions before it learns the target concept?”

The best algorithm in worst case scenario (hardest target concept, hardest training sequence) will makeOpt(C) mistakes, where

CCOptCVC 2log)()(

Page 24: Computational Learning Theory and Kernel Methods

24

Linear Support Vector Machines

Consider a binary classification problem:

Training data: {xi, yi}, i=1,…,; yi {-1, +1}; xi Rd

Points x lie on the separating hyperplane satisfy:wx+b=0

where w is normal to the hyperplane |b|/||w|| is the perpendicular distance to origin ||w|| is the Euclidean norm of w

Page 25: Computational Learning Theory and Kernel Methods

25

Linear Support Vector Machine, Definitions

Let d+ (d-) be the shortest distance from the separating hyperplane to the closest positive (negative) example

Margin of a separating hyperplane= d+ + d-

=1/||W||+1/||W||=2/||w||

Constraints:

ibwxy

yforbwx

yforbwx

ii

ii

ii

01)(

11

11

Page 26: Computational Learning Theory and Kernel Methods

26

Linear Separating Hyperplane for the Separable Case

Page 27: Computational Learning Theory and Kernel Methods

27

Problem of Maximizing the Margins

H1 and H2 are parallel, & with no training points in between

Thus we reformulate the problem as:

Maximize margin by minimizing ||W||2

s.t. ibwxy ii 01)(

Page 28: Computational Learning Theory and Kernel Methods

28

Ties to Least Squares

y

x

b

bxwxfy )(

21

),(

l

i ii bxwybwLLoss Function:

Page 29: Computational Learning Theory and Kernel Methods

29

Lagrangian Formulation1. Transform constraints into Lagrange multipliers2. Training data will only appear in dot products form

Let ,,,1, lii be positive Lagrange multipliers

We have the Lagrangian:

its

bwxywL

i

l

i i

l

i iiiP

0..

)(2

111

2

Page 30: Computational Learning Theory and Kernel Methods

30

Transform the convex quadratic programming problem

Observations: minimizing LP w.r.t. w, b, and simultaneously require that

0

PL ii 0subject to

is a convex quadratic programming problemthat can be easily solved in its Dual form

Page 31: Computational Learning Theory and Kernel Methods

31

Transform the convex quadratic programming problem – the Dual

LP’s Dual: Maximize LP, subject to gradients of LP w.r.t. w and b vanish, and i0

i jijijijiiD

iii

iiii

xxyyL

y

xyw

,2

1

0

Page 32: Computational Learning Theory and Kernel Methods

32

Observations about the Dual

i ji

jijijiiD xxyyL,2

1

• There is a Lagrangian multiplier i for every training point• In the solution, points for which i > 0 are called “support vectors”. They lie on either H1 or H2

• Support vectors are critical elements of the training set, they lie closest to the “boundary”• If all other points are removed or moved around (but not crossing H1 or H2), the same separating hyperplane would be found

Page 33: Computational Learning Theory and Kernel Methods

33

Prediction

• Solving the SVM problem is equivalent to finding a solution for the Karush-Kuhn-Tucker (KTT) conditions (KTT conditions are satisfied at the solution of any constrained optimization problem)

Once we solved for w, b, we predict x to be sign(wx+b)

Page 34: Computational Learning Theory and Kernel Methods

34

Linear SVM: The Non-Separable Case

We account for outliers by introducing slack conditions:

i

yforbwx

yforbwx

i

iii

iii

0

11

11

We penalize outliers by changing the cost function to:

iiCw 2

2

1min

Page 35: Computational Learning Theory and Kernel Methods

35

Example of Linear SVM with slacks

Page 36: Computational Learning Theory and Kernel Methods

36

Linear SVM Classification Examples

Linearly Separable Linearly Non-Separable

Page 37: Computational Learning Theory and Kernel Methods

37

Nonlinear SVMObservation: data appear as dot products in the training

problem

So we can use a mapping function , to map data into a high dimensional space where points are linearly separable:

Hd :

To make things easier, we define a kernel function K s.t.

jiji xxxxK ),(

Page 38: Computational Learning Theory and Kernel Methods

38

Nonlinear SVM (cont.)Kernel functions can compute dot products in the highdimensional space without explicitly work with

Example: 22

2/),(

ji xx

ji exxK

Rather than computing w, we make prediction on x via:

bxsKy

bxsyxf

S

S

N

i iii

N

i iii

1

1

,

)(

Page 39: Computational Learning Theory and Kernel Methods

39

Example of mapping

Image, in , of the square [-1,1]x[-1,1] R2 under the mapping

Page 40: Computational Learning Theory and Kernel Methods

40

Example Kernel FunctionsKernel functions must satisfy the Mercer’s condition, orsimple, the Hessian Matrix

jijiij xxyyHH ,

must be positive semidefinite. (non-negative eigenvalues)

Example Kernels:

pji yxxxK )1(),(

)tanh(),( ykxxxK ji

Page 41: Computational Learning Theory and Kernel Methods

41

Nonlinear SVM Classification Examples (Degree 3 Polynomial Kernel)

Linearly Separable Linearly Non-Separable

Page 42: Computational Learning Theory and Kernel Methods

42

Multi-Class SVM

1. One-against-all

2. One-against-one (majority vote)

3. One-against-one (DAGSVM)

Page 43: Computational Learning Theory and Kernel Methods

43

Global Solution and Uniqueness

• Every local solution is also global (property of any convex programming problem)

• Solution is guaranteed unique if the objective function is strictly convex (Hessian matrix is positive definite)

Page 44: Computational Learning Theory and Kernel Methods

44

Complexity and Scalability

Curse of dimensionality:1. The proliferation of parameters causing intractable complexity

2. The proliferation of parameters causing overfitting

SVM circumvent these via the use of 1. Kernel functions (trick) that computes at O(dL)2. Support vectors that focus on the “boundary”

Page 45: Computational Learning Theory and Kernel Methods

45

Structural Risk Minimization

Empirical Risk:

l

i iiemp xfyl

R1

,2

1)(

Expected Risk:

l

hlhRR emp

)4/log()1)/2(log()()(

Page 46: Computational Learning Theory and Kernel Methods

46

Structural Risk Minimization

Nested subsets of functions, ordered by VC dimensions