rapid introduction to machine learning/ deep learninghichoi/seminar2015/lecture2b.pdf · rapid...

35
1/35 1. Objectives of Lecture 2b 2. Statistical learning theory 3. Overfitting 4. Tikhonov regularization Rapid Introduction to Machine Learning/ Deep Learning Hyeong In Choi Seoul National University

Upload: others

Post on 26-Jun-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Rapid Introduction to Machine Learning/ Deep Learninghichoi/seminar2015/Lecture2b.pdf · Rapid Introduction to Machine Learning/ Deep Learning Hyeong In Choi Seoul National University

1/35

1. Objectives of Lecture 2b 2. Statistical learning theory 3. Overfitting 4. Tikhonov regularization

Rapid Introduction to Machine Learning/Deep Learning

Hyeong In Choi

Seoul National University

Page 2: Rapid Introduction to Machine Learning/ Deep Learninghichoi/seminar2015/Lecture2b.pdf · Rapid Introduction to Machine Learning/ Deep Learning Hyeong In Choi Seoul National University

2/35

1. Objectives of Lecture 2b 2. Statistical learning theory 3. Overfitting 4. Tikhonov regularization

Lecture 2bStatistical learning theory and its consequences

October 2, 2015

Page 3: Rapid Introduction to Machine Learning/ Deep Learninghichoi/seminar2015/Lecture2b.pdf · Rapid Introduction to Machine Learning/ Deep Learning Hyeong In Choi Seoul National University

3/35

1. Objectives of Lecture 2b 2. Statistical learning theory 3. Overfitting 4. Tikhonov regularization

Table of contents

1 1. Objectives of Lecture 2b

2 2. Statistical learning theory2.1. Nature of data D = {(x (t), y (t))}ni=1

2.2. Errors2.3. Machine learning in a nutshell2.4. VC dimension2.5. VC theory

3 3. Overfitting3.1. Error vs. model complexity3.2. VC bound of the generalization error

4 4. Tikhonov regularization4.1. Linear regression4.2. Lasso regression4.3. Digression: convex optimization4.4. Mitigating overfitting problem via regularization

Page 4: Rapid Introduction to Machine Learning/ Deep Learninghichoi/seminar2015/Lecture2b.pdf · Rapid Introduction to Machine Learning/ Deep Learning Hyeong In Choi Seoul National University

4/35

1. Objectives of Lecture 2b 2. Statistical learning theory 3. Overfitting 4. Tikhonov regularization

1. Objectives of Lecture 2b

Objective 1

Understand the probabilistic model on the nature of machinelearning data

Objective 2

Understand the errors: empirical vs. general

Objective 3

Learn about the outline of Vapnik-Chevonenkis theory

Page 5: Rapid Introduction to Machine Learning/ Deep Learninghichoi/seminar2015/Lecture2b.pdf · Rapid Introduction to Machine Learning/ Deep Learning Hyeong In Choi Seoul National University

5/35

1. Objectives of Lecture 2b 2. Statistical learning theory 3. Overfitting 4. Tikhonov regularization

Objective 4

Understand the relationship between the model complexity andvarious errors

Objective 5

Understand the nature of overfitting and how to mitigate it viaregularization

Objective 6

As an aside, learn about the sparsity of L1 optimization solutions

Page 6: Rapid Introduction to Machine Learning/ Deep Learninghichoi/seminar2015/Lecture2b.pdf · Rapid Introduction to Machine Learning/ Deep Learning Hyeong In Choi Seoul National University

6/35

1. Objectives of Lecture 2b 2. Statistical learning theory 3. Overfitting 4. Tikhonov regularization

2. Statistical learning theory

Question on generalization

Suppose a classifier is found that works well for a given data. Doesthis mean it will work just as well for similar data in the future?Need to pin down what we mean by “similar”

Page 7: Rapid Introduction to Machine Learning/ Deep Learninghichoi/seminar2015/Lecture2b.pdf · Rapid Introduction to Machine Learning/ Deep Learning Hyeong In Choi Seoul National University

7/35

1. Objectives of Lecture 2b 2. Statistical learning theory 3. Overfitting 4. Tikhonov regularization

2.1. Nature of data D = {(x(t), y (t))}ni=1

2.1. Nature of data D = {(x (t), y (t))}ni=1

Deterministic label

Assume there is a probability distribution PX on Rd such thatx (1), · · · , x (n) are IID samples ∼ PX

Assume there is a deterministic function α : X→ Y, whereX = Rd

Random label

The label y , given x , is gotten as a random (IID) sample∼ P(Y = y |X = x)

Combining PX (X = x) and PY |X (Y = y |X = x), getP(X = x ,Y = y) = PX ,Y (x , y)

So this case amounts to assuming the existence of PX ,Y andregarding (x (1), y (1)), · · · , (x (n), y (n)) as IID samples ∼ PX ,Y

We assume this is the case for the rest of our lectures

Page 8: Rapid Introduction to Machine Learning/ Deep Learninghichoi/seminar2015/Lecture2b.pdf · Rapid Introduction to Machine Learning/ Deep Learning Hyeong In Choi Seoul National University

8/35

1. Objectives of Lecture 2b 2. Statistical learning theory 3. Overfitting 4. Tikhonov regularization

2.2. Errors

2.2. Errors

Generalization error

Let (X ,Y ) be a random variable ∼ PX ,Y . Let f : X→ Y be anyfunction which we regard as a classifier or a decision function.Define its loss by `(f (X ),Y ) = I(f (X ) 6= Y ).

Note: loss is a random variable.

The risk of the classifier f : X→ Y is defined as:R(f ) = E [`(f (X ),Y )].This risk is also called the generalization error or theout-of-sample error.

Page 9: Rapid Introduction to Machine Learning/ Deep Learninghichoi/seminar2015/Lecture2b.pdf · Rapid Introduction to Machine Learning/ Deep Learning Hyeong In Choi Seoul National University

9/35

1. Objectives of Lecture 2b 2. Statistical learning theory 3. Overfitting 4. Tikhonov regularization

2.2. Errors

Empirical error

Given data D = {(xi , yi )}ni=1 and a classifier f : X→ Y define itsempirical error, or empirical risk Remp(f ) by

Remp(f ) =1

n

n∑i=1

I(f (xi ) 6= yi ).

This empirical error is also called the in-sample error.

Page 10: Rapid Introduction to Machine Learning/ Deep Learninghichoi/seminar2015/Lecture2b.pdf · Rapid Introduction to Machine Learning/ Deep Learning Hyeong In Choi Seoul National University

10/35

1. Objectives of Lecture 2b 2. Statistical learning theory 3. Overfitting 4. Tikhonov regularization

2.2. Errors

Observables and unobservables

In general, it is assumed that the nature of PX ,Y is notrevealed to us (but sometimes we may assume more aboutPX ,Y )

Thus PX ,Y is used only as a general theoretical backgroundconstruct that is devoid of much concrete information

Remp(f ) is what we can observe from data

R(f ) is what we are truly interested in knowing

Statistical learning theory deals with such general situation,and it bridges the gap between Remp(f ) and R(f )

Page 11: Rapid Introduction to Machine Learning/ Deep Learninghichoi/seminar2015/Lecture2b.pdf · Rapid Introduction to Machine Learning/ Deep Learning Hyeong In Choi Seoul National University

11/35

1. Objectives of Lecture 2b 2. Statistical learning theory 3. Overfitting 4. Tikhonov regularization

2.3. Machine learning in a nutshell

2.3. Machine learning in a nutshell

General framework of machine learning

Postulate a set F of candidate classifiers

Pick the one from F that is “best” according to someminimal error criterion

The error criteria are:

Remp(f )Remp(f ) + λRegularizing Termsome other expressions involving the given data on way oranother

But the ultimate aim is to control (minimize) R(f )

Page 12: Rapid Introduction to Machine Learning/ Deep Learninghichoi/seminar2015/Lecture2b.pdf · Rapid Introduction to Machine Learning/ Deep Learning Hyeong In Choi Seoul National University

12/35

1. Objectives of Lecture 2b 2. Statistical learning theory 3. Overfitting 4. Tikhonov regularization

2.3. Machine learning in a nutshell

Further issues of machine learning

how to postulate F? which one holds better hope?

how to get “good” estimate of R(f )?

how to “massage” (transform) the input so that truediscerning characteristics of input stand out in the feature set?

what is the good regularizer?

how to select the good model (hyper parameters)?

Page 13: Rapid Introduction to Machine Learning/ Deep Learninghichoi/seminar2015/Lecture2b.pdf · Rapid Introduction to Machine Learning/ Deep Learning Hyeong In Choi Seoul National University

13/35

1. Objectives of Lecture 2b 2. Statistical learning theory 3. Overfitting 4. Tikhonov regularization

2.4. VC dimension

2.4. VC dimension

Shattering

Example: suppose there are 3 points in R2. There are 8 wayslabeling them with x ’s or o’s

Page 14: Rapid Introduction to Machine Learning/ Deep Learninghichoi/seminar2015/Lecture2b.pdf · Rapid Introduction to Machine Learning/ Deep Learning Hyeong In Choi Seoul National University

14/35

1. Objectives of Lecture 2b 2. Statistical learning theory 3. Overfitting 4. Tikhonov regularization

2.4. VC dimension

In all 8 cases, x ’s and o’s can be separated by linear classifiers

Definition: we say these 3 points are shattered by F , whereF is the set of classifiers gotten by lines in R2

Page 15: Rapid Introduction to Machine Learning/ Deep Learninghichoi/seminar2015/Lecture2b.pdf · Rapid Introduction to Machine Learning/ Deep Learning Hyeong In Choi Seoul National University

15/35

1. Objectives of Lecture 2b 2. Statistical learning theory 3. Overfitting 4. Tikhonov regularization

2.4. VC dimension

Example 2: But no set of 4 points in R2 can be shattered by suchF . For proof, look at this case:

Example 3: Even for the case of 3 points, if they are aligned on aline, they cannot be shattered

Page 16: Rapid Introduction to Machine Learning/ Deep Learninghichoi/seminar2015/Lecture2b.pdf · Rapid Introduction to Machine Learning/ Deep Learning Hyeong In Choi Seoul National University

16/35

1. Objectives of Lecture 2b 2. Statistical learning theory 3. Overfitting 4. Tikhonov regularization

2.4. VC dimension

VC (Vapnik-Chervonenkis) dimension

The VC (Vapnik-Chervonenkis) dimension of F , denoted byVC (F), is the maximal number of points that can be shattered byF

Thus for the set F of the classifiers defined by lines inR2,VC (F) = 3

VC dimension measures how complex the set F is

The bigger VC (F) is the richer (more versatile) F is

Then it is more likely that the empirical error for the givendata gets smaller, i.e. smaller Remp(f )

But it may increase the chance of fitting to the given data atthe risk of increased generalization error, i.e. larger R(f )

Page 17: Rapid Introduction to Machine Learning/ Deep Learninghichoi/seminar2015/Lecture2b.pdf · Rapid Introduction to Machine Learning/ Deep Learning Hyeong In Choi Seoul National University

17/35

1. Objectives of Lecture 2b 2. Statistical learning theory 3. Overfitting 4. Tikhonov regularization

2.4. VC dimension

Example of infinite VC dimension

Let X = circle on R2

Any finite number of points can be shattered by F

Thus VC (F) =∞

Page 18: Rapid Introduction to Machine Learning/ Deep Learninghichoi/seminar2015/Lecture2b.pdf · Rapid Introduction to Machine Learning/ Deep Learning Hyeong In Choi Seoul National University

18/35

1. Objectives of Lecture 2b 2. Statistical learning theory 3. Overfitting 4. Tikhonov regularization

2.5. VC theory

2.5. VC theory

Theorem (Vapnik-Chervonenkis)

For any probability PX ,Y , and data (x (1), y (1)), · · · , (x (n), y (n))gotten as IID samples ∼ PX ,Y , and for any set F of binaryclassifiers, we have

P(supf ∈F|Remp(f )− R(f )| > ε) ≤ 8S(F , n)e−nε

2/32

Remark: The numbers in the above inequality are not sharp,and can be improved somewhat

Page 19: Rapid Introduction to Machine Learning/ Deep Learninghichoi/seminar2015/Lecture2b.pdf · Rapid Introduction to Machine Learning/ Deep Learning Hyeong In Choi Seoul National University

19/35

1. Objectives of Lecture 2b 2. Statistical learning theory 3. Overfitting 4. Tikhonov regularization

2.5. VC theory

Shatter coefficient

Define

F(x1, · · · , xn) = {(f (x1), · · · , f (xn)) : f ∈ F},

S(F , n) = maxx1,··· ,xn∈X

|F(x1, · · · , xn)|.

This S(F , n) is called the growth function or the shattercoefficient

Page 20: Rapid Introduction to Machine Learning/ Deep Learninghichoi/seminar2015/Lecture2b.pdf · Rapid Introduction to Machine Learning/ Deep Learning Hyeong In Choi Seoul National University

20/35

1. Objectives of Lecture 2b 2. Statistical learning theory 3. Overfitting 4. Tikhonov regularization

2.5. VC theory

Theorem of Sauer

Suppose the VC dimension is finite. Then S(F , n) is a polynomialof order O(nD), where D is the VC dimension

Thus we have:

P(supf ∈F|Remp(f )− R(f )| > ε) . 8 exp

{D log n − nε2

32

},

where the symbol . denotes the asymptotic inequality as n→∞.In particular,

P(supf ∈F|Remp(f )− R(f )| > ε)→ 0,

as n→∞

Page 21: Rapid Introduction to Machine Learning/ Deep Learninghichoi/seminar2015/Lecture2b.pdf · Rapid Introduction to Machine Learning/ Deep Learning Hyeong In Choi Seoul National University

21/35

1. Objectives of Lecture 2b 2. Statistical learning theory 3. Overfitting 4. Tikhonov regularization

2.5. VC theory

VC theorem says that with the exception of probabilityδ(F , n, ε) = 8S(F , n)e−nε

2/32, we have

|Remp(f )− R(f )| ≤ ε

for any f ∈ F uniformly; and

δ(F , n, ε)→ 0

as n→∞.Thus

R(f ) ≤ R(f )− Remp(f ) + Remp(f )

≤ ε+ Remp(f )

Page 22: Rapid Introduction to Machine Learning/ Deep Learninghichoi/seminar2015/Lecture2b.pdf · Rapid Introduction to Machine Learning/ Deep Learning Hyeong In Choi Seoul National University

22/35

1. Objectives of Lecture 2b 2. Statistical learning theory 3. Overfitting 4. Tikhonov regularization

2.5. VC theory

In conclusion

This says that R(f ) can be controlled by Remp(f ) as long aswe can take big enough n

But the question is how big n has to be,

The estimate in VC theorem is quite loose as it is validregardless of which F or PX ,Y is take.

It means the number n from VC theorem is too big to bepractical.

VC theory should be taken as a theoretical guideline, not as apractical estimate.

Nonetheless, VC theory is still one of the most fundamentalresults in machine learning and a truly remarkable feat!

Page 23: Rapid Introduction to Machine Learning/ Deep Learninghichoi/seminar2015/Lecture2b.pdf · Rapid Introduction to Machine Learning/ Deep Learning Hyeong In Choi Seoul National University

23/35

1. Objectives of Lecture 2b 2. Statistical learning theory 3. Overfitting 4. Tikhonov regularization

3.1. Error vs. model complexity

3. Overfitting3.1. Error vs. model complexity

Page 24: Rapid Introduction to Machine Learning/ Deep Learninghichoi/seminar2015/Lecture2b.pdf · Rapid Introduction to Machine Learning/ Deep Learning Hyeong In Choi Seoul National University

24/35

1. Objectives of Lecture 2b 2. Statistical learning theory 3. Overfitting 4. Tikhonov regularization

3.1. Error vs. model complexity

Error vs. model complexity

VC dimension is one way of measuring model complexity

But the model complexity may be understood ratherintuitively

It is a general tendency that as the model complexityincreases, the empirical error decreases

But the generalization error behaves differently: although itdecreases initially as the complexity increases, beyond acertain threshold d∗, the generalization error starts to increasewith the increase of model complexity

Page 25: Rapid Introduction to Machine Learning/ Deep Learninghichoi/seminar2015/Lecture2b.pdf · Rapid Introduction to Machine Learning/ Deep Learning Hyeong In Choi Seoul National University

25/35

1. Objectives of Lecture 2b 2. Statistical learning theory 3. Overfitting 4. Tikhonov regularization

3.1. Error vs. model complexity

This phenomenon is generally dubbed as “overfitting”

So the “best” choice of model complexity is the thresholdvalue d∗

The trouble is that we have no access to the generalizationerror per se, so it is difficult to determine d∗ (but we candevise proxies for the generalization error)

How to overcome overfitting problem

Regularization

Model selection via validation

Page 26: Rapid Introduction to Machine Learning/ Deep Learninghichoi/seminar2015/Lecture2b.pdf · Rapid Introduction to Machine Learning/ Deep Learning Hyeong In Choi Seoul National University

26/35

1. Objectives of Lecture 2b 2. Statistical learning theory 3. Overfitting 4. Tikhonov regularization

3.2. VC bound of the generalization error

3.2. VC bound of the generalization error

VC theory

R(f ) = Remp(f ) + R(f )− Remp(f )

Let δ = 8S(F , n)e−nε2/32. Then

ε = ε(F , n, δ) =

√32

n{log(8/δ) + log S(F , n)}

Thus for any δ > 0, with probability at least 1− δ

|R(f )− Remp(f )| ≤ ε(F , n, δ)

But this is a loose bound, and even for that, it is difficult toget hold of

Page 27: Rapid Introduction to Machine Learning/ Deep Learninghichoi/seminar2015/Lecture2b.pdf · Rapid Introduction to Machine Learning/ Deep Learning Hyeong In Choi Seoul National University

27/35

1. Objectives of Lecture 2b 2. Statistical learning theory 3. Overfitting 4. Tikhonov regularization

4.1. Linear regression

4. Tikhonov regularization4.1. Linear regression

Example

Data D = {(x (t), y (t))}ni=1

x (t) = (x(t)1 , · · · , x (t)

d )y (t) ∈ R

Linear regression

y =d∑

j=1

ωjxj + b

Empirical error

Eemp =n∑

t=1

d∑j=1

(ωjx

(t)j + b − y (t)

)2

Page 28: Rapid Introduction to Machine Learning/ Deep Learninghichoi/seminar2015/Lecture2b.pdf · Rapid Introduction to Machine Learning/ Deep Learning Hyeong In Choi Seoul National University

28/35

1. Objectives of Lecture 2b 2. Statistical learning theory 3. Overfitting 4. Tikhonov regularization

4.1. Linear regression

Linear regression is to minimize Eemp

Tikhonov regularization

Argument error

Eaug = Eemp + λ{||ω||2 + b2

}Tikhonov regularization minimizes Eaug

Eaug is a surrogate for the generalization error

ill-posed problem becomes well-posed

Page 29: Rapid Introduction to Machine Learning/ Deep Learninghichoi/seminar2015/Lecture2b.pdf · Rapid Introduction to Machine Learning/ Deep Learning Hyeong In Choi Seoul National University

29/35

1. Objectives of Lecture 2b 2. Statistical learning theory 3. Overfitting 4. Tikhonov regularization

4.2. Lasso regression

4.2. Lasso regression

Lasso regression

Lasso regression minimizes

Eaug = Eemp + {||ω||1 + |b|},

where ||ω||1 = |ω1|+ · · ·+ |ωd |Why ||ω||1?

Page 30: Rapid Introduction to Machine Learning/ Deep Learninghichoi/seminar2015/Lecture2b.pdf · Rapid Introduction to Machine Learning/ Deep Learning Hyeong In Choi Seoul National University

30/35

1. Objectives of Lecture 2b 2. Statistical learning theory 3. Overfitting 4. Tikhonov regularization

4.3. Digression: convex optimization

4.3. Digression: convex optimization

Example

minimize (x − a)2 + (y − b)2

subject to |x |+ |y | = 1

Page 31: Rapid Introduction to Machine Learning/ Deep Learninghichoi/seminar2015/Lecture2b.pdf · Rapid Introduction to Machine Learning/ Deep Learning Hyeong In Choi Seoul National University

31/35

1. Objectives of Lecture 2b 2. Statistical learning theory 3. Overfitting 4. Tikhonov regularization

4.3. Digression: convex optimization

Normal space

For almost all (a, b), the optimum occurs at the corner point

Page 32: Rapid Introduction to Machine Learning/ Deep Learninghichoi/seminar2015/Lecture2b.pdf · Rapid Introduction to Machine Learning/ Deep Learning Hyeong In Choi Seoul National University

32/35

1. Objectives of Lecture 2b 2. Statistical learning theory 3. Overfitting 4. Tikhonov regularization

4.3. Digression: convex optimization

Convex optimization with L1 term

For higher dimensional problem,

minimize f (x1, · · · , xd)

subject tod∑

i=1

|xi | = 1

where f (x1, · · · , xd) is a convex function

Page 33: Rapid Introduction to Machine Learning/ Deep Learninghichoi/seminar2015/Lecture2b.pdf · Rapid Introduction to Machine Learning/ Deep Learning Hyeong In Choi Seoul National University

33/35

1. Objectives of Lecture 2b 2. Statistical learning theory 3. Overfitting 4. Tikhonov regularization

4.3. Digression: convex optimization

Look at the optimization problem:

(P) minimize f (x1, · · · , xd) + λ||x ||1Associated optimization problem

(Pα) minimize f (x1, · · · , xd)

subject to ||x ||1 = α

Let mα be its optimal value of (Pα)

Let α∗ = argminα

Page 34: Rapid Introduction to Machine Learning/ Deep Learninghichoi/seminar2015/Lecture2b.pdf · Rapid Introduction to Machine Learning/ Deep Learning Hyeong In Choi Seoul National University

34/35

1. Objectives of Lecture 2b 2. Statistical learning theory 3. Overfitting 4. Tikhonov regularization

4.3. Digression: convex optimization

Then (P) is equivalent to

minimize f (x1, · · · , xd) + λα∗

subject to ||x ||1 = α∗

Thus the solution of (P) is also sparse.

Almost all optimal solution occurs at the corner point of∑di=1 |xi | = 1, i.e. the solution is sparse (many co-ordinates

are zero)

Page 35: Rapid Introduction to Machine Learning/ Deep Learninghichoi/seminar2015/Lecture2b.pdf · Rapid Introduction to Machine Learning/ Deep Learning Hyeong In Choi Seoul National University

35/35

1. Objectives of Lecture 2b 2. Statistical learning theory 3. Overfitting 4. Tikhonov regularization

4.4. Mitigating overfitting problem via regularization

4.4. Mitigating overfitting problem via regularization

Mitigating overfitting problem via regularization

The objective function of regularization problem is chosen insuch a way to control the possible range of parameters (sortof like putting on a break)

Regularization idea, even without explicit optimizationformulation, is a way of restricting possible range of theparameters in question

It is typically used in many contexts in machine learning