rapid introduction to machine learning/ deep learninghichoi/seminar2015/lecture2b.pdf · rapid...

1/35

1. Objectives of Lecture 2b 2. Statistical learning theory 3. Overfitting 4. Tikhonov regularization

Rapid Introduction to Machine Learning/Deep Learning

Hyeong In Choi

Seoul National University

2/35


Lecture 2bStatistical learning theory and its consequences

October 2, 2015

3/35


Table of contents

1 1. Objectives of Lecture 2b

2 2. Statistical learning theory2.1. Nature of data D = {(x (t), y (t))}ni=1

2.2. Errors2.3. Machine learning in a nutshell2.4. VC dimension2.5. VC theory

3 3. Overfitting3.1. Error vs. model complexity3.2. VC bound of the generalization error

4 4. Tikhonov regularization4.1. Linear regression4.2. Lasso regression4.3. Digression: convex optimization4.4. Mitigating overfitting problem via regularization

4/35


1. Objectives of Lecture 2b

Objective 1

Understand the probabilistic model on the nature of machinelearning data

Objective 2

Understand the errors: empirical vs. general

Objective 3

Learn about the outline of Vapnik-Chevonenkis theory

5/35


Objective 4

Understand the relationship between the model complexity andvarious errors

Objective 5

Understand the nature of overfitting and how to mitigate it viaregularization

Objective 6

As an aside, learn about the sparsity of L1 optimization solutions

6/35


2. Statistical learning theory

Question on generalization

Suppose a classifier is found that works well for a given data. Doesthis mean it will work just as well for similar data in the future?Need to pin down what we mean by “similar”

7/35


2.1. Nature of data D = {(x(t), y (t))}ni=1

2.1. Nature of data D = {(x (t), y (t))}ni=1

Deterministic label

Assume there is a probability distribution PX on Rd such thatx (1), · · · , x (n) are IID samples ∼ PX

Assume there is a deterministic function α : X→ Y, whereX = Rd

Random label

The label y , given x , is gotten as a random (IID) sample∼ P(Y = y |X = x)

Combining PX (X = x) and PY |X (Y = y |X = x), getP(X = x ,Y = y) = PX ,Y (x , y)

So this case amounts to assuming the existence of PX ,Y andregarding (x (1), y (1)), · · · , (x (n), y (n)) as IID samples ∼ PX ,Y

We assume this is the case for the rest of our lectures

8/35


2.2. Errors

2.2. Errors

Generalization error

Let (X ,Y ) be a random variable ∼ PX ,Y . Let f : X→ Y be anyfunction which we regard as a classifier or a decision function.Define its loss by `(f (X ),Y ) = I(f (X ) 6= Y ).

Note: loss is a random variable.

The risk of the classifier f : X→ Y is defined as:R(f ) = E [`(f (X ),Y )].This risk is also called the generalization error or theout-of-sample error.

9/35


2.2. Errors

Empirical error

Given data D = {(xi , yi )}ni=1 and a classifier f : X→ Y define itsempirical error, or empirical risk Remp(f ) by

Remp(f ) =1

n

n∑i=1

I(f (xi ) 6= yi ).

This empirical error is also called the in-sample error.

10/35


2.2. Errors

Observables and unobservables

In general, it is assumed that the nature of PX ,Y is notrevealed to us (but sometimes we may assume more aboutPX ,Y )

Thus PX ,Y is used only as a general theoretical backgroundconstruct that is devoid of much concrete information

Remp(f ) is what we can observe from data

R(f ) is what we are truly interested in knowing

Statistical learning theory deals with such general situation,and it bridges the gap between Remp(f ) and R(f )

11/35


2.3. Machine learning in a nutshell


General framework of machine learning

Postulate a set F of candidate classifiers

Pick the one from F that is “best” according to someminimal error criterion

The error criteria are:

Remp(f )Remp(f ) + λRegularizing Termsome other expressions involving the given data on way oranother

But the ultimate aim is to control (minimize) R(f )

12/35



Further issues of machine learning

how to postulate F? which one holds better hope?

how to get “good” estimate of R(f )?

how to “massage” (transform) the input so that truediscerning characteristics of input stand out in the feature set?

what is the good regularizer?

how to select the good model (hyper parameters)?

13/35


2.4. VC dimension

2.4. VC dimension

Shattering

Example: suppose there are 3 points in R2. There are 8 wayslabeling them with x ’s or o’s

14/35


2.4. VC dimension

In all 8 cases, x ’s and o’s can be separated by linear classifiers

Definition: we say these 3 points are shattered by F , whereF is the set of classifiers gotten by lines in R2

15/35


2.4. VC dimension

Example 2: But no set of 4 points in R2 can be shattered by suchF . For proof, look at this case:

Example 3: Even for the case of 3 points, if they are aligned on aline, they cannot be shattered

16/35


2.4. VC dimension

VC (Vapnik-Chervonenkis) dimension

The VC (Vapnik-Chervonenkis) dimension of F , denoted byVC (F), is the maximal number of points that can be shattered byF

Thus for the set F of the classifiers defined by lines inR2,VC (F) = 3

VC dimension measures how complex the set F is

The bigger VC (F) is the richer (more versatile) F is

Then it is more likely that the empirical error for the givendata gets smaller, i.e. smaller Remp(f )

But it may increase the chance of fitting to the given data atthe risk of increased generalization error, i.e. larger R(f )

17/35


2.4. VC dimension

Example of infinite VC dimension

Let X = circle on R2

Any finite number of points can be shattered by F

Thus VC (F) =∞

18/35


2.5. VC theory

2.5. VC theory

Theorem (Vapnik-Chervonenkis)

For any probability PX ,Y , and data (x (1), y (1)), · · · , (x (n), y (n))gotten as IID samples ∼ PX ,Y , and for any set F of binaryclassifiers, we have

P(supf ∈F|Remp(f )− R(f )| > ε) ≤ 8S(F , n)e−nε

2/32

Remark: The numbers in the above inequality are not sharp,and can be improved somewhat

19/35


2.5. VC theory

Shatter coefficient

Define

F(x1, · · · , xn) = {(f (x1), · · · , f (xn)) : f ∈ F},

S(F , n) = maxx1,··· ,xn∈X

|F(x1, · · · , xn)|.

This S(F , n) is called the growth function or the shattercoefficient

20/35


2.5. VC theory

Theorem of Sauer

Suppose the VC dimension is finite. Then S(F , n) is a polynomialof order O(nD), where D is the VC dimension

Thus we have:

P(supf ∈F|Remp(f )− R(f )| > ε) . 8 exp

{D log n − nε2

32

},

where the symbol . denotes the asymptotic inequality as n→∞.In particular,

P(supf ∈F|Remp(f )− R(f )| > ε)→ 0,

as n→∞

21/35


2.5. VC theory

VC theorem says that with the exception of probabilityδ(F , n, ε) = 8S(F , n)e−nε

2/32, we have

|Remp(f )− R(f )| ≤ ε

for any f ∈ F uniformly; and

δ(F , n, ε)→ 0

as n→∞.Thus

R(f ) ≤ R(f )− Remp(f ) + Remp(f )

≤ ε+ Remp(f )

22/35


2.5. VC theory

In conclusion

This says that R(f ) can be controlled by Remp(f ) as long aswe can take big enough n

But the question is how big n has to be,

The estimate in VC theorem is quite loose as it is validregardless of which F or PX ,Y is take.

It means the number n from VC theorem is too big to bepractical.

VC theory should be taken as a theoretical guideline, not as apractical estimate.

Nonetheless, VC theory is still one of the most fundamentalresults in machine learning and a truly remarkable feat!

23/35


3.1. Error vs. model complexity

3. Overfitting3.1. Error vs. model complexity

24/35



Error vs. model complexity

VC dimension is one way of measuring model complexity

But the model complexity may be understood ratherintuitively

It is a general tendency that as the model complexityincreases, the empirical error decreases

But the generalization error behaves differently: although itdecreases initially as the complexity increases, beyond acertain threshold d∗, the generalization error starts to increasewith the increase of model complexity

25/35



This phenomenon is generally dubbed as “overfitting”

So the “best” choice of model complexity is the thresholdvalue d∗

The trouble is that we have no access to the generalizationerror per se, so it is difficult to determine d∗ (but we candevise proxies for the generalization error)

How to overcome overfitting problem

Regularization

Model selection via validation

26/35


3.2. VC bound of the generalization error

3.2. VC bound of the generalization error

VC theory

R(f ) = Remp(f ) + R(f )− Remp(f )

Let δ = 8S(F , n)e−nε2/32. Then

ε = ε(F , n, δ) =

√32

n{log(8/δ) + log S(F , n)}

Thus for any δ > 0, with probability at least 1− δ

|R(f )− Remp(f )| ≤ ε(F , n, δ)

But this is a loose bound, and even for that, it is difficult toget hold of

27/35


4.1. Linear regression

4. Tikhonov regularization4.1. Linear regression

Example

Data D = {(x (t), y (t))}ni=1

x (t) = (x(t)1 , · · · , x (t)

d )y (t) ∈ R

Linear regression

y =d∑

j=1

ωjxj + b

Empirical error

Eemp =n∑

t=1

d∑j=1

(ωjx

(t)j + b − y (t)

)2

28/35


4.1. Linear regression

Linear regression is to minimize Eemp

Tikhonov regularization

Argument error

Eaug = Eemp + λ{||ω||2 + b2

}Tikhonov regularization minimizes Eaug

Eaug is a surrogate for the generalization error

ill-posed problem becomes well-posed

29/35


4.2. Lasso regression

4.2. Lasso regression

Lasso regression

Lasso regression minimizes

Eaug = Eemp + {||ω||1 + |b|},

where ||ω||1 = |ω1|+ · · ·+ |ωd |Why ||ω||1?

30/35


4.3. Digression: convex optimization


Example

minimize (x − a)2 + (y − b)2

subject to |x |+ |y | = 1

31/35



Normal space

For almost all (a, b), the optimum occurs at the corner point

32/35



Convex optimization with L1 term

For higher dimensional problem,

minimize f (x1, · · · , xd)

subject tod∑

i=1

|xi | = 1

where f (x1, · · · , xd) is a convex function

33/35



Look at the optimization problem:

(P) minimize f (x1, · · · , xd) + λ||x ||1Associated optimization problem

(Pα) minimize f (x1, · · · , xd)

subject to ||x ||1 = α

Let mα be its optimal value of (Pα)

Let α∗ = argminα

mα

34/35



Then (P) is equivalent to

minimize f (x1, · · · , xd) + λα∗

subject to ||x ||1 = α∗

Thus the solution of (P) is also sparse.

Almost all optimal solution occurs at the corner point of∑di=1 |xi | = 1, i.e. the solution is sparse (many co-ordinates

are zero)

35/35


4.4. Mitigating overfitting problem via regularization

4.4. Mitigating overfitting problem via regularization

Mitigating overfitting problem via regularization

The objective function of regularization problem is chosen insuch a way to control the possible range of parameters (sortof like putting on a break)

Regularization idea, even without explicit optimizationformulation, is a way of restricting possible range of theparameters in question

It is typically used in many contexts in machine learning

rapid introduction to machine learning/ deep learninghichoi/seminar2015/lecture2b.pdf · rapid...

Documents