rapid introduction to machine learning/ deep learninghichoi/seminar2015/lecture2b.pdf · rapid...
TRANSCRIPT
1/35
1. Objectives of Lecture 2b 2. Statistical learning theory 3. Overfitting 4. Tikhonov regularization
Rapid Introduction to Machine Learning/Deep Learning
Hyeong In Choi
Seoul National University
2/35
1. Objectives of Lecture 2b 2. Statistical learning theory 3. Overfitting 4. Tikhonov regularization
Lecture 2bStatistical learning theory and its consequences
October 2, 2015
3/35
1. Objectives of Lecture 2b 2. Statistical learning theory 3. Overfitting 4. Tikhonov regularization
Table of contents
1 1. Objectives of Lecture 2b
2 2. Statistical learning theory2.1. Nature of data D = {(x (t), y (t))}ni=1
2.2. Errors2.3. Machine learning in a nutshell2.4. VC dimension2.5. VC theory
3 3. Overfitting3.1. Error vs. model complexity3.2. VC bound of the generalization error
4 4. Tikhonov regularization4.1. Linear regression4.2. Lasso regression4.3. Digression: convex optimization4.4. Mitigating overfitting problem via regularization
4/35
1. Objectives of Lecture 2b 2. Statistical learning theory 3. Overfitting 4. Tikhonov regularization
1. Objectives of Lecture 2b
Objective 1
Understand the probabilistic model on the nature of machinelearning data
Objective 2
Understand the errors: empirical vs. general
Objective 3
Learn about the outline of Vapnik-Chevonenkis theory
5/35
1. Objectives of Lecture 2b 2. Statistical learning theory 3. Overfitting 4. Tikhonov regularization
Objective 4
Understand the relationship between the model complexity andvarious errors
Objective 5
Understand the nature of overfitting and how to mitigate it viaregularization
Objective 6
As an aside, learn about the sparsity of L1 optimization solutions
6/35
1. Objectives of Lecture 2b 2. Statistical learning theory 3. Overfitting 4. Tikhonov regularization
2. Statistical learning theory
Question on generalization
Suppose a classifier is found that works well for a given data. Doesthis mean it will work just as well for similar data in the future?Need to pin down what we mean by “similar”
7/35
1. Objectives of Lecture 2b 2. Statistical learning theory 3. Overfitting 4. Tikhonov regularization
2.1. Nature of data D = {(x(t), y (t))}ni=1
2.1. Nature of data D = {(x (t), y (t))}ni=1
Deterministic label
Assume there is a probability distribution PX on Rd such thatx (1), · · · , x (n) are IID samples ∼ PX
Assume there is a deterministic function α : X→ Y, whereX = Rd
Random label
The label y , given x , is gotten as a random (IID) sample∼ P(Y = y |X = x)
Combining PX (X = x) and PY |X (Y = y |X = x), getP(X = x ,Y = y) = PX ,Y (x , y)
So this case amounts to assuming the existence of PX ,Y andregarding (x (1), y (1)), · · · , (x (n), y (n)) as IID samples ∼ PX ,Y
We assume this is the case for the rest of our lectures
8/35
1. Objectives of Lecture 2b 2. Statistical learning theory 3. Overfitting 4. Tikhonov regularization
2.2. Errors
2.2. Errors
Generalization error
Let (X ,Y ) be a random variable ∼ PX ,Y . Let f : X→ Y be anyfunction which we regard as a classifier or a decision function.Define its loss by `(f (X ),Y ) = I(f (X ) 6= Y ).
Note: loss is a random variable.
The risk of the classifier f : X→ Y is defined as:R(f ) = E [`(f (X ),Y )].This risk is also called the generalization error or theout-of-sample error.
9/35
1. Objectives of Lecture 2b 2. Statistical learning theory 3. Overfitting 4. Tikhonov regularization
2.2. Errors
Empirical error
Given data D = {(xi , yi )}ni=1 and a classifier f : X→ Y define itsempirical error, or empirical risk Remp(f ) by
Remp(f ) =1
n
n∑i=1
I(f (xi ) 6= yi ).
This empirical error is also called the in-sample error.
10/35
1. Objectives of Lecture 2b 2. Statistical learning theory 3. Overfitting 4. Tikhonov regularization
2.2. Errors
Observables and unobservables
In general, it is assumed that the nature of PX ,Y is notrevealed to us (but sometimes we may assume more aboutPX ,Y )
Thus PX ,Y is used only as a general theoretical backgroundconstruct that is devoid of much concrete information
Remp(f ) is what we can observe from data
R(f ) is what we are truly interested in knowing
Statistical learning theory deals with such general situation,and it bridges the gap between Remp(f ) and R(f )
11/35
1. Objectives of Lecture 2b 2. Statistical learning theory 3. Overfitting 4. Tikhonov regularization
2.3. Machine learning in a nutshell
2.3. Machine learning in a nutshell
General framework of machine learning
Postulate a set F of candidate classifiers
Pick the one from F that is “best” according to someminimal error criterion
The error criteria are:
Remp(f )Remp(f ) + λRegularizing Termsome other expressions involving the given data on way oranother
But the ultimate aim is to control (minimize) R(f )
12/35
1. Objectives of Lecture 2b 2. Statistical learning theory 3. Overfitting 4. Tikhonov regularization
2.3. Machine learning in a nutshell
Further issues of machine learning
how to postulate F? which one holds better hope?
how to get “good” estimate of R(f )?
how to “massage” (transform) the input so that truediscerning characteristics of input stand out in the feature set?
what is the good regularizer?
how to select the good model (hyper parameters)?
13/35
1. Objectives of Lecture 2b 2. Statistical learning theory 3. Overfitting 4. Tikhonov regularization
2.4. VC dimension
2.4. VC dimension
Shattering
Example: suppose there are 3 points in R2. There are 8 wayslabeling them with x ’s or o’s
14/35
1. Objectives of Lecture 2b 2. Statistical learning theory 3. Overfitting 4. Tikhonov regularization
2.4. VC dimension
In all 8 cases, x ’s and o’s can be separated by linear classifiers
Definition: we say these 3 points are shattered by F , whereF is the set of classifiers gotten by lines in R2
15/35
1. Objectives of Lecture 2b 2. Statistical learning theory 3. Overfitting 4. Tikhonov regularization
2.4. VC dimension
Example 2: But no set of 4 points in R2 can be shattered by suchF . For proof, look at this case:
Example 3: Even for the case of 3 points, if they are aligned on aline, they cannot be shattered
16/35
1. Objectives of Lecture 2b 2. Statistical learning theory 3. Overfitting 4. Tikhonov regularization
2.4. VC dimension
VC (Vapnik-Chervonenkis) dimension
The VC (Vapnik-Chervonenkis) dimension of F , denoted byVC (F), is the maximal number of points that can be shattered byF
Thus for the set F of the classifiers defined by lines inR2,VC (F) = 3
VC dimension measures how complex the set F is
The bigger VC (F) is the richer (more versatile) F is
Then it is more likely that the empirical error for the givendata gets smaller, i.e. smaller Remp(f )
But it may increase the chance of fitting to the given data atthe risk of increased generalization error, i.e. larger R(f )
17/35
1. Objectives of Lecture 2b 2. Statistical learning theory 3. Overfitting 4. Tikhonov regularization
2.4. VC dimension
Example of infinite VC dimension
Let X = circle on R2
Any finite number of points can be shattered by F
Thus VC (F) =∞
18/35
1. Objectives of Lecture 2b 2. Statistical learning theory 3. Overfitting 4. Tikhonov regularization
2.5. VC theory
2.5. VC theory
Theorem (Vapnik-Chervonenkis)
For any probability PX ,Y , and data (x (1), y (1)), · · · , (x (n), y (n))gotten as IID samples ∼ PX ,Y , and for any set F of binaryclassifiers, we have
P(supf ∈F|Remp(f )− R(f )| > ε) ≤ 8S(F , n)e−nε
2/32
Remark: The numbers in the above inequality are not sharp,and can be improved somewhat
19/35
1. Objectives of Lecture 2b 2. Statistical learning theory 3. Overfitting 4. Tikhonov regularization
2.5. VC theory
Shatter coefficient
Define
F(x1, · · · , xn) = {(f (x1), · · · , f (xn)) : f ∈ F},
S(F , n) = maxx1,··· ,xn∈X
|F(x1, · · · , xn)|.
This S(F , n) is called the growth function or the shattercoefficient
20/35
1. Objectives of Lecture 2b 2. Statistical learning theory 3. Overfitting 4. Tikhonov regularization
2.5. VC theory
Theorem of Sauer
Suppose the VC dimension is finite. Then S(F , n) is a polynomialof order O(nD), where D is the VC dimension
Thus we have:
P(supf ∈F|Remp(f )− R(f )| > ε) . 8 exp
{D log n − nε2
32
},
where the symbol . denotes the asymptotic inequality as n→∞.In particular,
P(supf ∈F|Remp(f )− R(f )| > ε)→ 0,
as n→∞
21/35
1. Objectives of Lecture 2b 2. Statistical learning theory 3. Overfitting 4. Tikhonov regularization
2.5. VC theory
VC theorem says that with the exception of probabilityδ(F , n, ε) = 8S(F , n)e−nε
2/32, we have
|Remp(f )− R(f )| ≤ ε
for any f ∈ F uniformly; and
δ(F , n, ε)→ 0
as n→∞.Thus
R(f ) ≤ R(f )− Remp(f ) + Remp(f )
≤ ε+ Remp(f )
22/35
1. Objectives of Lecture 2b 2. Statistical learning theory 3. Overfitting 4. Tikhonov regularization
2.5. VC theory
In conclusion
This says that R(f ) can be controlled by Remp(f ) as long aswe can take big enough n
But the question is how big n has to be,
The estimate in VC theorem is quite loose as it is validregardless of which F or PX ,Y is take.
It means the number n from VC theorem is too big to bepractical.
VC theory should be taken as a theoretical guideline, not as apractical estimate.
Nonetheless, VC theory is still one of the most fundamentalresults in machine learning and a truly remarkable feat!
23/35
1. Objectives of Lecture 2b 2. Statistical learning theory 3. Overfitting 4. Tikhonov regularization
3.1. Error vs. model complexity
3. Overfitting3.1. Error vs. model complexity
24/35
1. Objectives of Lecture 2b 2. Statistical learning theory 3. Overfitting 4. Tikhonov regularization
3.1. Error vs. model complexity
Error vs. model complexity
VC dimension is one way of measuring model complexity
But the model complexity may be understood ratherintuitively
It is a general tendency that as the model complexityincreases, the empirical error decreases
But the generalization error behaves differently: although itdecreases initially as the complexity increases, beyond acertain threshold d∗, the generalization error starts to increasewith the increase of model complexity
25/35
1. Objectives of Lecture 2b 2. Statistical learning theory 3. Overfitting 4. Tikhonov regularization
3.1. Error vs. model complexity
This phenomenon is generally dubbed as “overfitting”
So the “best” choice of model complexity is the thresholdvalue d∗
The trouble is that we have no access to the generalizationerror per se, so it is difficult to determine d∗ (but we candevise proxies for the generalization error)
How to overcome overfitting problem
Regularization
Model selection via validation
26/35
1. Objectives of Lecture 2b 2. Statistical learning theory 3. Overfitting 4. Tikhonov regularization
3.2. VC bound of the generalization error
3.2. VC bound of the generalization error
VC theory
R(f ) = Remp(f ) + R(f )− Remp(f )
Let δ = 8S(F , n)e−nε2/32. Then
ε = ε(F , n, δ) =
√32
n{log(8/δ) + log S(F , n)}
Thus for any δ > 0, with probability at least 1− δ
|R(f )− Remp(f )| ≤ ε(F , n, δ)
But this is a loose bound, and even for that, it is difficult toget hold of
27/35
1. Objectives of Lecture 2b 2. Statistical learning theory 3. Overfitting 4. Tikhonov regularization
4.1. Linear regression
4. Tikhonov regularization4.1. Linear regression
Example
Data D = {(x (t), y (t))}ni=1
x (t) = (x(t)1 , · · · , x (t)
d )y (t) ∈ R
Linear regression
y =d∑
j=1
ωjxj + b
Empirical error
Eemp =n∑
t=1
d∑j=1
(ωjx
(t)j + b − y (t)
)2
28/35
1. Objectives of Lecture 2b 2. Statistical learning theory 3. Overfitting 4. Tikhonov regularization
4.1. Linear regression
Linear regression is to minimize Eemp
Tikhonov regularization
Argument error
Eaug = Eemp + λ{||ω||2 + b2
}Tikhonov regularization minimizes Eaug
Eaug is a surrogate for the generalization error
ill-posed problem becomes well-posed
29/35
1. Objectives of Lecture 2b 2. Statistical learning theory 3. Overfitting 4. Tikhonov regularization
4.2. Lasso regression
4.2. Lasso regression
Lasso regression
Lasso regression minimizes
Eaug = Eemp + {||ω||1 + |b|},
where ||ω||1 = |ω1|+ · · ·+ |ωd |Why ||ω||1?
30/35
1. Objectives of Lecture 2b 2. Statistical learning theory 3. Overfitting 4. Tikhonov regularization
4.3. Digression: convex optimization
4.3. Digression: convex optimization
Example
minimize (x − a)2 + (y − b)2
subject to |x |+ |y | = 1
31/35
1. Objectives of Lecture 2b 2. Statistical learning theory 3. Overfitting 4. Tikhonov regularization
4.3. Digression: convex optimization
Normal space
For almost all (a, b), the optimum occurs at the corner point
32/35
1. Objectives of Lecture 2b 2. Statistical learning theory 3. Overfitting 4. Tikhonov regularization
4.3. Digression: convex optimization
Convex optimization with L1 term
For higher dimensional problem,
minimize f (x1, · · · , xd)
subject tod∑
i=1
|xi | = 1
where f (x1, · · · , xd) is a convex function
33/35
1. Objectives of Lecture 2b 2. Statistical learning theory 3. Overfitting 4. Tikhonov regularization
4.3. Digression: convex optimization
Look at the optimization problem:
(P) minimize f (x1, · · · , xd) + λ||x ||1Associated optimization problem
(Pα) minimize f (x1, · · · , xd)
subject to ||x ||1 = α
Let mα be its optimal value of (Pα)
Let α∗ = argminα
mα
34/35
1. Objectives of Lecture 2b 2. Statistical learning theory 3. Overfitting 4. Tikhonov regularization
4.3. Digression: convex optimization
Then (P) is equivalent to
minimize f (x1, · · · , xd) + λα∗
subject to ||x ||1 = α∗
Thus the solution of (P) is also sparse.
Almost all optimal solution occurs at the corner point of∑di=1 |xi | = 1, i.e. the solution is sparse (many co-ordinates
are zero)
35/35
1. Objectives of Lecture 2b 2. Statistical learning theory 3. Overfitting 4. Tikhonov regularization
4.4. Mitigating overfitting problem via regularization
4.4. Mitigating overfitting problem via regularization
Mitigating overfitting problem via regularization
The objective function of regularization problem is chosen insuch a way to control the possible range of parameters (sortof like putting on a break)
Regularization idea, even without explicit optimizationformulation, is a way of restricting possible range of theparameters in question
It is typically used in many contexts in machine learning