ece595 / stat598: machine learning i lecture 28 sample …...e.g., linear in 2d: n = 3 is okay, but...
TRANSCRIPT
![Page 1: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples](https://reader033.vdocuments.us/reader033/viewer/2022050121/5f513a85e5f918157102ab7f/html5/thumbnails/1.jpg)
c©Stanley Chan 2020. All Rights Reserved.
ECE595 / STAT598: Machine Learning ILecture 28 Sample and Model Complexity
Spring 2020
Stanley Chan
School of Electrical and Computer EngineeringPurdue University
1 / 17
![Page 2: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples](https://reader033.vdocuments.us/reader033/viewer/2022050121/5f513a85e5f918157102ab7f/html5/thumbnails/2.jpg)
c©Stanley Chan 2020. All Rights Reserved.
Outline
Lecture 28 Sample and Model Complexity
Lecture 29 Bias and Variance
Lecture 30 Overfit
Today’s Lecture:
Generalization Bound using VC Dimension
Review of growth function and VC dimensionGeneralization bound
Sample and Model Complexity
Sample complexityModel complexityTrade off
2 / 17
![Page 3: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples](https://reader033.vdocuments.us/reader033/viewer/2022050121/5f513a85e5f918157102ab7f/html5/thumbnails/3.jpg)
c©Stanley Chan 2020. All Rights Reserved.
VC Dimension
Definition (VC Dimension)
The Vapnik-Chervonenkis dimension of a hypothesis set H, denoted by dVC, is the largestvalue of N for which H can shatter all N training samples.
You give me a hypothesis set H, e.g., linear model
You tell me the number of training samples N
Start with a small N
I will be able to shatter for a while, until I hit a bump
E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay
So I find the largest N such that H can shatter N training samples
E.g., linear in 2D: dVC = 3
If H is complex, then expect large dVC
Does not depend on p(x), A and f
3 / 17
![Page 4: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples](https://reader033.vdocuments.us/reader033/viewer/2022050121/5f513a85e5f918157102ab7f/html5/thumbnails/4.jpg)
c©Stanley Chan 2020. All Rights Reserved.
VC Dimension
Definition (VC Dimension)
The Vapnik-Chervonenkis dimension of a hypothesis set H, denoted by dVC, is the largestvalue of N for which H can shatter all N training samples.
You give me a hypothesis set H, e.g., linear model
You tell me the number of training samples N
Start with a small N
I will be able to shatter for a while, until I hit a bump
E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay
So I find the largest N such that H can shatter N training samples
E.g., linear in 2D: dVC = 3
If H is complex, then expect large dVC
Does not depend on p(x), A and f
3 / 17
![Page 5: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples](https://reader033.vdocuments.us/reader033/viewer/2022050121/5f513a85e5f918157102ab7f/html5/thumbnails/5.jpg)
c©Stanley Chan 2020. All Rights Reserved.
VC Dimension
Definition (VC Dimension)
The Vapnik-Chervonenkis dimension of a hypothesis set H, denoted by dVC, is the largestvalue of N for which H can shatter all N training samples.
You give me a hypothesis set H, e.g., linear model
You tell me the number of training samples N
Start with a small N
I will be able to shatter for a while, until I hit a bump
E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay
So I find the largest N such that H can shatter N training samples
E.g., linear in 2D: dVC = 3
If H is complex, then expect large dVC
Does not depend on p(x), A and f
3 / 17
![Page 6: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples](https://reader033.vdocuments.us/reader033/viewer/2022050121/5f513a85e5f918157102ab7f/html5/thumbnails/6.jpg)
c©Stanley Chan 2020. All Rights Reserved.
VC Dimension
Definition (VC Dimension)
The Vapnik-Chervonenkis dimension of a hypothesis set H, denoted by dVC, is the largestvalue of N for which H can shatter all N training samples.
You give me a hypothesis set H, e.g., linear model
You tell me the number of training samples N
Start with a small N
I will be able to shatter for a while, until I hit a bump
E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay
So I find the largest N such that H can shatter N training samples
E.g., linear in 2D: dVC = 3
If H is complex, then expect large dVC
Does not depend on p(x), A and f
3 / 17
![Page 7: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples](https://reader033.vdocuments.us/reader033/viewer/2022050121/5f513a85e5f918157102ab7f/html5/thumbnails/7.jpg)
c©Stanley Chan 2020. All Rights Reserved.
VC Dimension
Definition (VC Dimension)
The Vapnik-Chervonenkis dimension of a hypothesis set H, denoted by dVC, is the largestvalue of N for which H can shatter all N training samples.
You give me a hypothesis set H, e.g., linear model
You tell me the number of training samples N
Start with a small N
I will be able to shatter for a while, until I hit a bump
E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay
So I find the largest N such that H can shatter N training samples
E.g., linear in 2D: dVC = 3
If H is complex, then expect large dVC
Does not depend on p(x), A and f
3 / 17
![Page 8: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples](https://reader033.vdocuments.us/reader033/viewer/2022050121/5f513a85e5f918157102ab7f/html5/thumbnails/8.jpg)
c©Stanley Chan 2020. All Rights Reserved.
VC Dimension
Definition (VC Dimension)
The Vapnik-Chervonenkis dimension of a hypothesis set H, denoted by dVC, is the largestvalue of N for which H can shatter all N training samples.
You give me a hypothesis set H, e.g., linear model
You tell me the number of training samples N
Start with a small N
I will be able to shatter for a while, until I hit a bump
E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay
So I find the largest N such that H can shatter N training samples
E.g., linear in 2D: dVC = 3
If H is complex, then expect large dVC
Does not depend on p(x), A and f
3 / 17
![Page 9: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples](https://reader033.vdocuments.us/reader033/viewer/2022050121/5f513a85e5f918157102ab7f/html5/thumbnails/9.jpg)
c©Stanley Chan 2020. All Rights Reserved.
VC Dimension
Definition (VC Dimension)
The Vapnik-Chervonenkis dimension of a hypothesis set H, denoted by dVC, is the largestvalue of N for which H can shatter all N training samples.
You give me a hypothesis set H, e.g., linear model
You tell me the number of training samples N
Start with a small N
I will be able to shatter for a while, until I hit a bump
E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay
So I find the largest N such that H can shatter N training samples
E.g., linear in 2D: dVC = 3
If H is complex, then expect large dVC
Does not depend on p(x), A and f
3 / 17
![Page 10: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples](https://reader033.vdocuments.us/reader033/viewer/2022050121/5f513a85e5f918157102ab7f/html5/thumbnails/10.jpg)
c©Stanley Chan 2020. All Rights Reserved.
VC Dimension
Definition (VC Dimension)
The Vapnik-Chervonenkis dimension of a hypothesis set H, denoted by dVC, is the largestvalue of N for which H can shatter all N training samples.
You give me a hypothesis set H, e.g., linear model
You tell me the number of training samples N
Start with a small N
I will be able to shatter for a while, until I hit a bump
E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay
So I find the largest N such that H can shatter N training samples
E.g., linear in 2D: dVC = 3
If H is complex, then expect large dVC
Does not depend on p(x), A and f
3 / 17
![Page 11: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples](https://reader033.vdocuments.us/reader033/viewer/2022050121/5f513a85e5f918157102ab7f/html5/thumbnails/11.jpg)
c©Stanley Chan 2020. All Rights Reserved.
VC Dimension
Definition (VC Dimension)
The Vapnik-Chervonenkis dimension of a hypothesis set H, denoted by dVC, is the largestvalue of N for which H can shatter all N training samples.
You give me a hypothesis set H, e.g., linear model
You tell me the number of training samples N
Start with a small N
I will be able to shatter for a while, until I hit a bump
E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay
So I find the largest N such that H can shatter N training samples
E.g., linear in 2D: dVC = 3
If H is complex, then expect large dVC
Does not depend on p(x), A and f
3 / 17
![Page 12: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples](https://reader033.vdocuments.us/reader033/viewer/2022050121/5f513a85e5f918157102ab7f/html5/thumbnails/12.jpg)
c©Stanley Chan 2020. All Rights Reserved.
VC Dimension
Definition (VC Dimension)
The Vapnik-Chervonenkis dimension of a hypothesis set H, denoted by dVC, is the largestvalue of N for which H can shatter all N training samples.
You give me a hypothesis set H, e.g., linear model
You tell me the number of training samples N
Start with a small N
I will be able to shatter for a while, until I hit a bump
E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay
So I find the largest N such that H can shatter N training samples
E.g., linear in 2D: dVC = 3
If H is complex, then expect large dVC
Does not depend on p(x), A and f3 / 17
![Page 13: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples](https://reader033.vdocuments.us/reader033/viewer/2022050121/5f513a85e5f918157102ab7f/html5/thumbnails/13.jpg)
c©Stanley Chan 2020. All Rights Reserved.
Linking the Growth Function
Theorem (Sauer’s Lemma)
Let dVC be the VC dimension of a hypothesis set H, then
mH(N) ≤dVC∑i=0
(N
i
). (1)
I skip the proof here. See AML Chapter 2.2 for proof.What is more interesting is this:
dVC∑i=0
(N
i
)≤ NdVC + 1.
This can be proved by simple induction. Exercise.So together we have
mH(N) ≤ NdVC + 1.
4 / 17
![Page 14: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples](https://reader033.vdocuments.us/reader033/viewer/2022050121/5f513a85e5f918157102ab7f/html5/thumbnails/14.jpg)
c©Stanley Chan 2020. All Rights Reserved.
Linking the Growth Function
Theorem (Sauer’s Lemma)
Let dVC be the VC dimension of a hypothesis set H, then
mH(N) ≤dVC∑i=0
(N
i
). (1)
I skip the proof here. See AML Chapter 2.2 for proof.
What is more interesting is this:
dVC∑i=0
(N
i
)≤ NdVC + 1.
This can be proved by simple induction. Exercise.So together we have
mH(N) ≤ NdVC + 1.
4 / 17
![Page 15: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples](https://reader033.vdocuments.us/reader033/viewer/2022050121/5f513a85e5f918157102ab7f/html5/thumbnails/15.jpg)
c©Stanley Chan 2020. All Rights Reserved.
Linking the Growth Function
Theorem (Sauer’s Lemma)
Let dVC be the VC dimension of a hypothesis set H, then
mH(N) ≤dVC∑i=0
(N
i
). (1)
I skip the proof here. See AML Chapter 2.2 for proof.What is more interesting is this:
dVC∑i=0
(N
i
)≤ NdVC + 1.
This can be proved by simple induction. Exercise.
So together we havemH(N) ≤ NdVC + 1.
4 / 17
![Page 16: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples](https://reader033.vdocuments.us/reader033/viewer/2022050121/5f513a85e5f918157102ab7f/html5/thumbnails/16.jpg)
c©Stanley Chan 2020. All Rights Reserved.
Linking the Growth Function
Theorem (Sauer’s Lemma)
Let dVC be the VC dimension of a hypothesis set H, then
mH(N) ≤dVC∑i=0
(N
i
). (1)
I skip the proof here. See AML Chapter 2.2 for proof.What is more interesting is this:
dVC∑i=0
(N
i
)≤ NdVC + 1.
This can be proved by simple induction. Exercise.
So together we havemH(N) ≤ NdVC + 1.
4 / 17
![Page 17: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples](https://reader033.vdocuments.us/reader033/viewer/2022050121/5f513a85e5f918157102ab7f/html5/thumbnails/17.jpg)
c©Stanley Chan 2020. All Rights Reserved.
Linking the Growth Function
Theorem (Sauer’s Lemma)
Let dVC be the VC dimension of a hypothesis set H, then
mH(N) ≤dVC∑i=0
(N
i
). (1)
I skip the proof here. See AML Chapter 2.2 for proof.What is more interesting is this:
dVC∑i=0
(N
i
)≤ NdVC + 1.
This can be proved by simple induction. Exercise.So together we have
mH(N) ≤ NdVC + 1.4 / 17
![Page 18: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples](https://reader033.vdocuments.us/reader033/viewer/2022050121/5f513a85e5f918157102ab7f/html5/thumbnails/18.jpg)
c©Stanley Chan 2020. All Rights Reserved.
Difference between VC and Hoeffding
5 / 17
![Page 19: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples](https://reader033.vdocuments.us/reader033/viewer/2022050121/5f513a85e5f918157102ab7f/html5/thumbnails/19.jpg)
c©Stanley Chan 2020. All Rights Reserved.
Generalization Bound Again
Recall the generalization bound
Ein(g)−√
1
2Nlog
2M
δ≤ Eout(g) ≤ Ein(g) +
√1
2Nlog
2M
δ.
Substitute M by mH(N), and then mH(N) ≤ NdVC + 1:
Eout(g) ≤ Ein(g) +
√1
2Nlog
2(NdVC + 1)
δ.
Wonderful!
Everything is characterized by δ, N and dVC
dVC tells us the expressiveness of the model
You can also think of dVC as the effective number of parameters
6 / 17
![Page 20: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples](https://reader033.vdocuments.us/reader033/viewer/2022050121/5f513a85e5f918157102ab7f/html5/thumbnails/20.jpg)
c©Stanley Chan 2020. All Rights Reserved.
Generalization Bound Again
Recall the generalization bound
Ein(g)−√
1
2Nlog
2M
δ≤ Eout(g) ≤ Ein(g) +
√1
2Nlog
2M
δ.
Substitute M by mH(N), and then mH(N) ≤ NdVC + 1:
Eout(g) ≤ Ein(g) +
√1
2Nlog
2(NdVC + 1)
δ.
Wonderful!
Everything is characterized by δ, N and dVC
dVC tells us the expressiveness of the model
You can also think of dVC as the effective number of parameters
6 / 17
![Page 21: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples](https://reader033.vdocuments.us/reader033/viewer/2022050121/5f513a85e5f918157102ab7f/html5/thumbnails/21.jpg)
c©Stanley Chan 2020. All Rights Reserved.
Generalization Bound Again
Recall the generalization bound
Ein(g)−√
1
2Nlog
2M
δ≤ Eout(g) ≤ Ein(g) +
√1
2Nlog
2M
δ.
Substitute M by mH(N), and then mH(N) ≤ NdVC + 1:
Eout(g) ≤ Ein(g) +
√1
2Nlog
2(NdVC + 1)
δ.
Wonderful!
Everything is characterized by δ, N and dVC
dVC tells us the expressiveness of the model
You can also think of dVC as the effective number of parameters
6 / 17
![Page 22: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples](https://reader033.vdocuments.us/reader033/viewer/2022050121/5f513a85e5f918157102ab7f/html5/thumbnails/22.jpg)
c©Stanley Chan 2020. All Rights Reserved.
Generalization Bound Again
Recall the generalization bound
Ein(g)−√
1
2Nlog
2M
δ≤ Eout(g) ≤ Ein(g) +
√1
2Nlog
2M
δ.
Substitute M by mH(N), and then mH(N) ≤ NdVC + 1:
Eout(g) ≤ Ein(g) +
√1
2Nlog
2(NdVC + 1)
δ.
Wonderful!
Everything is characterized by δ, N and dVC
dVC tells us the expressiveness of the model
You can also think of dVC as the effective number of parameters
6 / 17
![Page 23: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples](https://reader033.vdocuments.us/reader033/viewer/2022050121/5f513a85e5f918157102ab7f/html5/thumbnails/23.jpg)
c©Stanley Chan 2020. All Rights Reserved.
Generalization Bound Again
Recall the generalization bound
Ein(g)−√
1
2Nlog
2M
δ≤ Eout(g) ≤ Ein(g) +
√1
2Nlog
2M
δ.
Substitute M by mH(N), and then mH(N) ≤ NdVC + 1:
Eout(g) ≤ Ein(g) +
√1
2Nlog
2(NdVC + 1)
δ.
Wonderful!
Everything is characterized by δ, N and dVC
dVC tells us the expressiveness of the model
You can also think of dVC as the effective number of parameters
6 / 17
![Page 24: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples](https://reader033.vdocuments.us/reader033/viewer/2022050121/5f513a85e5f918157102ab7f/html5/thumbnails/24.jpg)
c©Stanley Chan 2020. All Rights Reserved.
Generalization Bound Again
Recall the generalization bound
Ein(g)−√
1
2Nlog
2M
δ≤ Eout(g) ≤ Ein(g) +
√1
2Nlog
2M
δ.
Substitute M by mH(N), and then mH(N) ≤ NdVC + 1:
Eout(g) ≤ Ein(g) +
√1
2Nlog
2(NdVC + 1)
δ.
Wonderful!
Everything is characterized by δ, N and dVC
dVC tells us the expressiveness of the model
You can also think of dVC as the effective number of parameters
6 / 17
![Page 25: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples](https://reader033.vdocuments.us/reader033/viewer/2022050121/5f513a85e5f918157102ab7f/html5/thumbnails/25.jpg)
c©Stanley Chan 2020. All Rights Reserved.
Generalization Bound Again
Recall the generalization bound
Ein(g)−√
1
2Nlog
2M
δ≤ Eout(g) ≤ Ein(g) +
√1
2Nlog
2M
δ.
Substitute M by mH(N), and then mH(N) ≤ NdVC + 1:
Eout(g) ≤ Ein(g) +
√1
2Nlog
2(NdVC + 1)
δ.
Wonderful!
Everything is characterized by δ, N and dVC
dVC tells us the expressiveness of the model
You can also think of dVC as the effective number of parameters
6 / 17
![Page 26: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples](https://reader033.vdocuments.us/reader033/viewer/2022050121/5f513a85e5f918157102ab7f/html5/thumbnails/26.jpg)
c©Stanley Chan 2020. All Rights Reserved.
Generalization Bound Again
If dVC <∞,
Then as N →∞,
ε =
√1
2Nlog
2(NdVC + 1)
δ→ 0.
If this is the case, then the final hypothesis g ∈ H will generalize.
dVC =∞,
Then His as diverse as it can be
It is not possible to generalize
Message 1: If you choose a complex model, then you need to pay the price of trainingsample
Message 2: If you choose an extremely complex model, then it may not be able togeneralize regardless the number of samples
7 / 17
![Page 27: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples](https://reader033.vdocuments.us/reader033/viewer/2022050121/5f513a85e5f918157102ab7f/html5/thumbnails/27.jpg)
c©Stanley Chan 2020. All Rights Reserved.
Generalization Bound Again
If dVC <∞,
Then as N →∞,
ε =
√1
2Nlog
2(NdVC + 1)
δ→ 0.
If this is the case, then the final hypothesis g ∈ H will generalize.
dVC =∞,
Then His as diverse as it can be
It is not possible to generalize
Message 1: If you choose a complex model, then you need to pay the price of trainingsample
Message 2: If you choose an extremely complex model, then it may not be able togeneralize regardless the number of samples
7 / 17
![Page 28: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples](https://reader033.vdocuments.us/reader033/viewer/2022050121/5f513a85e5f918157102ab7f/html5/thumbnails/28.jpg)
c©Stanley Chan 2020. All Rights Reserved.
Generalization Bound Again
If dVC <∞,
Then as N →∞,
ε =
√1
2Nlog
2(NdVC + 1)
δ→ 0.
If this is the case, then the final hypothesis g ∈ H will generalize.
dVC =∞,
Then His as diverse as it can be
It is not possible to generalize
Message 1: If you choose a complex model, then you need to pay the price of trainingsample
Message 2: If you choose an extremely complex model, then it may not be able togeneralize regardless the number of samples
7 / 17
![Page 29: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples](https://reader033.vdocuments.us/reader033/viewer/2022050121/5f513a85e5f918157102ab7f/html5/thumbnails/29.jpg)
c©Stanley Chan 2020. All Rights Reserved.
Generalization Bound Again
If dVC <∞,
Then as N →∞,
ε =
√1
2Nlog
2(NdVC + 1)
δ→ 0.
If this is the case, then the final hypothesis g ∈ H will generalize.
dVC =∞,
Then His as diverse as it can be
It is not possible to generalize
Message 1: If you choose a complex model, then you need to pay the price of trainingsample
Message 2: If you choose an extremely complex model, then it may not be able togeneralize regardless the number of samples
7 / 17
![Page 30: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples](https://reader033.vdocuments.us/reader033/viewer/2022050121/5f513a85e5f918157102ab7f/html5/thumbnails/30.jpg)
c©Stanley Chan 2020. All Rights Reserved.
Generalization Bound Again
If dVC <∞,
Then as N →∞,
ε =
√1
2Nlog
2(NdVC + 1)
δ→ 0.
If this is the case, then the final hypothesis g ∈ H will generalize.
dVC =∞,
Then His as diverse as it can be
It is not possible to generalize
Message 1: If you choose a complex model, then you need to pay the price of trainingsample
Message 2: If you choose an extremely complex model, then it may not be able togeneralize regardless the number of samples
7 / 17
![Page 31: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples](https://reader033.vdocuments.us/reader033/viewer/2022050121/5f513a85e5f918157102ab7f/html5/thumbnails/31.jpg)
c©Stanley Chan 2020. All Rights Reserved.
Generalization Bound Again
If dVC <∞,
Then as N →∞,
ε =
√1
2Nlog
2(NdVC + 1)
δ→ 0.
If this is the case, then the final hypothesis g ∈ H will generalize.
dVC =∞,
Then His as diverse as it can be
It is not possible to generalize
Message 1: If you choose a complex model, then you need to pay the price of trainingsample
Message 2: If you choose an extremely complex model, then it may not be able togeneralize regardless the number of samples
7 / 17
![Page 32: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples](https://reader033.vdocuments.us/reader033/viewer/2022050121/5f513a85e5f918157102ab7f/html5/thumbnails/32.jpg)
c©Stanley Chan 2020. All Rights Reserved.
Generalization Bound Again
If dVC <∞,
Then as N →∞,
ε =
√1
2Nlog
2(NdVC + 1)
δ→ 0.
If this is the case, then the final hypothesis g ∈ H will generalize.
dVC =∞,
Then His as diverse as it can be
It is not possible to generalize
Message 1: If you choose a complex model, then you need to pay the price of trainingsample
Message 2: If you choose an extremely complex model, then it may not be able togeneralize regardless the number of samples
7 / 17
![Page 33: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples](https://reader033.vdocuments.us/reader033/viewer/2022050121/5f513a85e5f918157102ab7f/html5/thumbnails/33.jpg)
c©Stanley Chan 2020. All Rights Reserved.
Generalization Bound Again
If dVC <∞,
Then as N →∞,
ε =
√1
2Nlog
2(NdVC + 1)
δ→ 0.
If this is the case, then the final hypothesis g ∈ H will generalize.
dVC =∞,
Then His as diverse as it can be
It is not possible to generalize
Message 1: If you choose a complex model, then you need to pay the price of trainingsample
Message 2: If you choose an extremely complex model, then it may not be able togeneralize regardless the number of samples
7 / 17
![Page 34: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples](https://reader033.vdocuments.us/reader033/viewer/2022050121/5f513a85e5f918157102ab7f/html5/thumbnails/34.jpg)
c©Stanley Chan 2020. All Rights Reserved.
Generalization Bound Again
If dVC <∞,
Then as N →∞,
ε =
√1
2Nlog
2(NdVC + 1)
δ→ 0.
If this is the case, then the final hypothesis g ∈ H will generalize.
dVC =∞,
Then His as diverse as it can be
It is not possible to generalize
Message 1: If you choose a complex model, then you need to pay the price of trainingsample
Message 2: If you choose an extremely complex model, then it may not be able togeneralize regardless the number of samples
7 / 17
![Page 35: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples](https://reader033.vdocuments.us/reader033/viewer/2022050121/5f513a85e5f918157102ab7f/html5/thumbnails/35.jpg)
c©Stanley Chan 2020. All Rights Reserved.
Generalizing the Generalization Bound
Theorem (Generalization Bound)
For any tolerance δ > 0
Eout(g) ≤ Ein(g) +
√8
Nlog
4mH(2N)
δ,
with probability at least 1− δ.
Some small subtle technical requirements. See AML chapter 2.2
How tight is this generalization bound? Not too tight.
The Hoeffding inequality has a slack. The inequality is too general for all values of Eout
The growth function mH(N) gives the worst case scenario
Bounding mH(N) by a polynomial introduces slack
8 / 17
![Page 36: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples](https://reader033.vdocuments.us/reader033/viewer/2022050121/5f513a85e5f918157102ab7f/html5/thumbnails/36.jpg)
c©Stanley Chan 2020. All Rights Reserved.
Generalizing the Generalization Bound
Theorem (Generalization Bound)
For any tolerance δ > 0
Eout(g) ≤ Ein(g) +
√8
Nlog
4mH(2N)
δ,
with probability at least 1− δ.
Some small subtle technical requirements. See AML chapter 2.2
How tight is this generalization bound? Not too tight.
The Hoeffding inequality has a slack. The inequality is too general for all values of Eout
The growth function mH(N) gives the worst case scenario
Bounding mH(N) by a polynomial introduces slack
8 / 17
![Page 37: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples](https://reader033.vdocuments.us/reader033/viewer/2022050121/5f513a85e5f918157102ab7f/html5/thumbnails/37.jpg)
c©Stanley Chan 2020. All Rights Reserved.
Generalizing the Generalization Bound
Theorem (Generalization Bound)
For any tolerance δ > 0
Eout(g) ≤ Ein(g) +
√8
Nlog
4mH(2N)
δ,
with probability at least 1− δ.
Some small subtle technical requirements. See AML chapter 2.2
How tight is this generalization bound? Not too tight.
The Hoeffding inequality has a slack. The inequality is too general for all values of Eout
The growth function mH(N) gives the worst case scenario
Bounding mH(N) by a polynomial introduces slack
8 / 17
![Page 38: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples](https://reader033.vdocuments.us/reader033/viewer/2022050121/5f513a85e5f918157102ab7f/html5/thumbnails/38.jpg)
c©Stanley Chan 2020. All Rights Reserved.
Generalizing the Generalization Bound
Theorem (Generalization Bound)
For any tolerance δ > 0
Eout(g) ≤ Ein(g) +
√8
Nlog
4mH(2N)
δ,
with probability at least 1− δ.
Some small subtle technical requirements. See AML chapter 2.2
How tight is this generalization bound? Not too tight.
The Hoeffding inequality has a slack. The inequality is too general for all values of Eout
The growth function mH(N) gives the worst case scenario
Bounding mH(N) by a polynomial introduces slack
8 / 17
![Page 39: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples](https://reader033.vdocuments.us/reader033/viewer/2022050121/5f513a85e5f918157102ab7f/html5/thumbnails/39.jpg)
c©Stanley Chan 2020. All Rights Reserved.
Generalizing the Generalization Bound
Theorem (Generalization Bound)
For any tolerance δ > 0
Eout(g) ≤ Ein(g) +
√8
Nlog
4mH(2N)
δ,
with probability at least 1− δ.
Some small subtle technical requirements. See AML chapter 2.2
How tight is this generalization bound? Not too tight.
The Hoeffding inequality has a slack. The inequality is too general for all values of Eout
The growth function mH(N) gives the worst case scenario
Bounding mH(N) by a polynomial introduces slack
8 / 17
![Page 40: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples](https://reader033.vdocuments.us/reader033/viewer/2022050121/5f513a85e5f918157102ab7f/html5/thumbnails/40.jpg)
c©Stanley Chan 2020. All Rights Reserved.
Generalizing the Generalization Bound
Theorem (Generalization Bound)
For any tolerance δ > 0
Eout(g) ≤ Ein(g) +
√8
Nlog
4mH(2N)
δ,
with probability at least 1− δ.
Some small subtle technical requirements. See AML chapter 2.2
How tight is this generalization bound? Not too tight.
The Hoeffding inequality has a slack. The inequality is too general for all values of Eout
The growth function mH(N) gives the worst case scenario
Bounding mH(N) by a polynomial introduces slack
8 / 17
![Page 41: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples](https://reader033.vdocuments.us/reader033/viewer/2022050121/5f513a85e5f918157102ab7f/html5/thumbnails/41.jpg)
c©Stanley Chan 2020. All Rights Reserved.
Outline
Lecture 28 Sample and Model Complexity
Lecture 29 Bias and Variance
Lecture 30 Overfit
Today’s Lecture:
Generalization Bound using VC Dimension
Review of growth function and VC dimensionGeneralization bound
Sample and Model Complexity
Sample complexityModel complexityTrade off
9 / 17
![Page 42: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples](https://reader033.vdocuments.us/reader033/viewer/2022050121/5f513a85e5f918157102ab7f/html5/thumbnails/42.jpg)
c©Stanley Chan 2020. All Rights Reserved.
Sample and Model Complexity
Sample Complexity
What is the smallest number of samples required?
Required to ensure training and testing error are close
Close = within certain ε, with confidence 1− δRegardless of what learning algorithm you use
Model Complexity
What is the largest model you can use?
Refers to the hypothesis set
With respect to the number of training samples
Largest = measured in terms of VC dimension
Can use = within certain ε, with confidence 1− δRegardless of what learning algorithm you use
10 / 17
![Page 43: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples](https://reader033.vdocuments.us/reader033/viewer/2022050121/5f513a85e5f918157102ab7f/html5/thumbnails/43.jpg)
c©Stanley Chan 2020. All Rights Reserved.
Sample and Model Complexity
Sample Complexity
What is the smallest number of samples required?
Required to ensure training and testing error are close
Close = within certain ε, with confidence 1− δRegardless of what learning algorithm you use
Model Complexity
What is the largest model you can use?
Refers to the hypothesis set
With respect to the number of training samples
Largest = measured in terms of VC dimension
Can use = within certain ε, with confidence 1− δRegardless of what learning algorithm you use
10 / 17
![Page 44: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples](https://reader033.vdocuments.us/reader033/viewer/2022050121/5f513a85e5f918157102ab7f/html5/thumbnails/44.jpg)
c©Stanley Chan 2020. All Rights Reserved.
Sample and Model Complexity
Sample Complexity
What is the smallest number of samples required?
Required to ensure training and testing error are close
Close = within certain ε, with confidence 1− δRegardless of what learning algorithm you use
Model Complexity
What is the largest model you can use?
Refers to the hypothesis set
With respect to the number of training samples
Largest = measured in terms of VC dimension
Can use = within certain ε, with confidence 1− δRegardless of what learning algorithm you use
10 / 17
![Page 45: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples](https://reader033.vdocuments.us/reader033/viewer/2022050121/5f513a85e5f918157102ab7f/html5/thumbnails/45.jpg)
c©Stanley Chan 2020. All Rights Reserved.
Sample and Model Complexity
Sample Complexity
What is the smallest number of samples required?
Required to ensure training and testing error are close
Close = within certain ε, with confidence 1− δ
Regardless of what learning algorithm you use
Model Complexity
What is the largest model you can use?
Refers to the hypothesis set
With respect to the number of training samples
Largest = measured in terms of VC dimension
Can use = within certain ε, with confidence 1− δRegardless of what learning algorithm you use
10 / 17
![Page 46: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples](https://reader033.vdocuments.us/reader033/viewer/2022050121/5f513a85e5f918157102ab7f/html5/thumbnails/46.jpg)
c©Stanley Chan 2020. All Rights Reserved.
Sample and Model Complexity
Sample Complexity
What is the smallest number of samples required?
Required to ensure training and testing error are close
Close = within certain ε, with confidence 1− δRegardless of what learning algorithm you use
Model Complexity
What is the largest model you can use?
Refers to the hypothesis set
With respect to the number of training samples
Largest = measured in terms of VC dimension
Can use = within certain ε, with confidence 1− δRegardless of what learning algorithm you use
10 / 17
![Page 47: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples](https://reader033.vdocuments.us/reader033/viewer/2022050121/5f513a85e5f918157102ab7f/html5/thumbnails/47.jpg)
c©Stanley Chan 2020. All Rights Reserved.
Sample and Model Complexity
Sample Complexity
What is the smallest number of samples required?
Required to ensure training and testing error are close
Close = within certain ε, with confidence 1− δRegardless of what learning algorithm you use
Model Complexity
What is the largest model you can use?
Refers to the hypothesis set
With respect to the number of training samples
Largest = measured in terms of VC dimension
Can use = within certain ε, with confidence 1− δRegardless of what learning algorithm you use
10 / 17
![Page 48: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples](https://reader033.vdocuments.us/reader033/viewer/2022050121/5f513a85e5f918157102ab7f/html5/thumbnails/48.jpg)
c©Stanley Chan 2020. All Rights Reserved.
Sample and Model Complexity
Sample Complexity
What is the smallest number of samples required?
Required to ensure training and testing error are close
Close = within certain ε, with confidence 1− δRegardless of what learning algorithm you use
Model Complexity
What is the largest model you can use?
Refers to the hypothesis set
With respect to the number of training samples
Largest = measured in terms of VC dimension
Can use = within certain ε, with confidence 1− δRegardless of what learning algorithm you use
10 / 17
![Page 49: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples](https://reader033.vdocuments.us/reader033/viewer/2022050121/5f513a85e5f918157102ab7f/html5/thumbnails/49.jpg)
c©Stanley Chan 2020. All Rights Reserved.
Sample and Model Complexity
Sample Complexity
What is the smallest number of samples required?
Required to ensure training and testing error are close
Close = within certain ε, with confidence 1− δRegardless of what learning algorithm you use
Model Complexity
What is the largest model you can use?
Refers to the hypothesis set
With respect to the number of training samples
Largest = measured in terms of VC dimension
Can use = within certain ε, with confidence 1− δRegardless of what learning algorithm you use
10 / 17
![Page 50: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples](https://reader033.vdocuments.us/reader033/viewer/2022050121/5f513a85e5f918157102ab7f/html5/thumbnails/50.jpg)
c©Stanley Chan 2020. All Rights Reserved.
Sample and Model Complexity
Sample Complexity
What is the smallest number of samples required?
Required to ensure training and testing error are close
Close = within certain ε, with confidence 1− δRegardless of what learning algorithm you use
Model Complexity
What is the largest model you can use?
Refers to the hypothesis set
With respect to the number of training samples
Largest = measured in terms of VC dimension
Can use = within certain ε, with confidence 1− δRegardless of what learning algorithm you use
10 / 17
![Page 51: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples](https://reader033.vdocuments.us/reader033/viewer/2022050121/5f513a85e5f918157102ab7f/html5/thumbnails/51.jpg)
c©Stanley Chan 2020. All Rights Reserved.
Sample and Model Complexity
Sample Complexity
What is the smallest number of samples required?
Required to ensure training and testing error are close
Close = within certain ε, with confidence 1− δRegardless of what learning algorithm you use
Model Complexity
What is the largest model you can use?
Refers to the hypothesis set
With respect to the number of training samples
Largest = measured in terms of VC dimension
Can use = within certain ε, with confidence 1− δ
Regardless of what learning algorithm you use
10 / 17
![Page 52: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples](https://reader033.vdocuments.us/reader033/viewer/2022050121/5f513a85e5f918157102ab7f/html5/thumbnails/52.jpg)
c©Stanley Chan 2020. All Rights Reserved.
Sample and Model Complexity
Sample Complexity
What is the smallest number of samples required?
Required to ensure training and testing error are close
Close = within certain ε, with confidence 1− δRegardless of what learning algorithm you use
Model Complexity
What is the largest model you can use?
Refers to the hypothesis set
With respect to the number of training samples
Largest = measured in terms of VC dimension
Can use = within certain ε, with confidence 1− δRegardless of what learning algorithm you use
10 / 17
![Page 53: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples](https://reader033.vdocuments.us/reader033/viewer/2022050121/5f513a85e5f918157102ab7f/html5/thumbnails/53.jpg)
c©Stanley Chan 2020. All Rights Reserved.
Sample Complexity
The generalization bound is
Eout(g) ≤ Ein(g) +
√8
Nlog
4mH(2N)
δ.
If you want the generalization error to be at most ε, then√8
Nlog
4mH(2N)
δ≤ ε.
Rearrange terms and use VC dimension,
N ≥ 8
ε2log
(4(2N)dVC + 1
δ
).
Example. dVC = 3. ε = 0.1. δ = 0.1 (90% confidence). Then the number of samples weneed is
N ≥ 8
0.12log
(4(2N)3 + 4
0.1
).
11 / 17
![Page 54: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples](https://reader033.vdocuments.us/reader033/viewer/2022050121/5f513a85e5f918157102ab7f/html5/thumbnails/54.jpg)
c©Stanley Chan 2020. All Rights Reserved.
Sample Complexity
The generalization bound is
Eout(g) ≤ Ein(g) +
√8
Nlog
4mH(2N)
δ.
If you want the generalization error to be at most ε, then√8
Nlog
4mH(2N)
δ≤ ε.
Rearrange terms and use VC dimension,
N ≥ 8
ε2log
(4(2N)dVC + 1
δ
).
Example. dVC = 3. ε = 0.1. δ = 0.1 (90% confidence). Then the number of samples weneed is
N ≥ 8
0.12log
(4(2N)3 + 4
0.1
).
11 / 17
![Page 55: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples](https://reader033.vdocuments.us/reader033/viewer/2022050121/5f513a85e5f918157102ab7f/html5/thumbnails/55.jpg)
c©Stanley Chan 2020. All Rights Reserved.
Sample Complexity
The generalization bound is
Eout(g) ≤ Ein(g) +
√8
Nlog
4mH(2N)
δ.
If you want the generalization error to be at most ε, then√8
Nlog
4mH(2N)
δ≤ ε.
Rearrange terms and use VC dimension,
N ≥ 8
ε2log
(4(2N)dVC + 1
δ
).
Example. dVC = 3. ε = 0.1. δ = 0.1 (90% confidence). Then the number of samples weneed is
N ≥ 8
0.12log
(4(2N)3 + 4
0.1
).
11 / 17
![Page 56: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples](https://reader033.vdocuments.us/reader033/viewer/2022050121/5f513a85e5f918157102ab7f/html5/thumbnails/56.jpg)
c©Stanley Chan 2020. All Rights Reserved.
Sample Complexity
The generalization bound is
Eout(g) ≤ Ein(g) +
√8
Nlog
4mH(2N)
δ.
If you want the generalization error to be at most ε, then√8
Nlog
4mH(2N)
δ≤ ε.
Rearrange terms and use VC dimension,
N ≥ 8
ε2log
(4(2N)dVC + 1
δ
).
Example. dVC = 3. ε = 0.1. δ = 0.1 (90% confidence). Then the number of samples weneed is
N ≥ 8
0.12log
(4(2N)3 + 4
0.1
).
11 / 17
![Page 57: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples](https://reader033.vdocuments.us/reader033/viewer/2022050121/5f513a85e5f918157102ab7f/html5/thumbnails/57.jpg)
c©Stanley Chan 2020. All Rights Reserved.
Sample Complexity
The generalization bound is
Eout(g) ≤ Ein(g) +
√8
Nlog
4mH(2N)
δ.
If you want the generalization error to be at most ε, then√8
Nlog
4mH(2N)
δ≤ ε.
Rearrange terms and use VC dimension,
N ≥ 8
ε2log
(4(2N)dVC + 1
δ
).
Example. dVC = 3. ε = 0.1. δ = 0.1 (90% confidence). Then the number of samples weneed is
N ≥ 8
0.12log
(4(2N)3 + 4
0.1
).
11 / 17
![Page 58: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples](https://reader033.vdocuments.us/reader033/viewer/2022050121/5f513a85e5f918157102ab7f/html5/thumbnails/58.jpg)
c©Stanley Chan 2020. All Rights Reserved.
Sample Complexity
How to solve for N in this equation?
N ≥ 8
0.12log
(4(2N)3 + 4
0.1
).
Put N = 1000 to the right hand side
N ≥ 8
0.12log
(4(2× 1000)3 + 4
0.1
)≈ 21, 193.
Not enough. So put N = 21, 193 to the right hand side. Iterate.
Then we get N ≈ 30, 000.
So we need at least 30,000 samples.
However, generalization bound is not tight. So our estimate is over-estimate.
Rule of thumb, 10× dVC.
12 / 17
![Page 59: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples](https://reader033.vdocuments.us/reader033/viewer/2022050121/5f513a85e5f918157102ab7f/html5/thumbnails/59.jpg)
c©Stanley Chan 2020. All Rights Reserved.
Sample Complexity
How to solve for N in this equation?
N ≥ 8
0.12log
(4(2N)3 + 4
0.1
).
Put N = 1000 to the right hand side
N ≥ 8
0.12log
(4(2× 1000)3 + 4
0.1
)≈ 21, 193.
Not enough. So put N = 21, 193 to the right hand side. Iterate.
Then we get N ≈ 30, 000.
So we need at least 30,000 samples.
However, generalization bound is not tight. So our estimate is over-estimate.
Rule of thumb, 10× dVC.
12 / 17
![Page 60: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples](https://reader033.vdocuments.us/reader033/viewer/2022050121/5f513a85e5f918157102ab7f/html5/thumbnails/60.jpg)
c©Stanley Chan 2020. All Rights Reserved.
Sample Complexity
How to solve for N in this equation?
N ≥ 8
0.12log
(4(2N)3 + 4
0.1
).
Put N = 1000 to the right hand side
N ≥ 8
0.12log
(4(2× 1000)3 + 4
0.1
)≈ 21, 193.
Not enough. So put N = 21, 193 to the right hand side. Iterate.
Then we get N ≈ 30, 000.
So we need at least 30,000 samples.
However, generalization bound is not tight. So our estimate is over-estimate.
Rule of thumb, 10× dVC.
12 / 17
![Page 61: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples](https://reader033.vdocuments.us/reader033/viewer/2022050121/5f513a85e5f918157102ab7f/html5/thumbnails/61.jpg)
c©Stanley Chan 2020. All Rights Reserved.
Sample Complexity
How to solve for N in this equation?
N ≥ 8
0.12log
(4(2N)3 + 4
0.1
).
Put N = 1000 to the right hand side
N ≥ 8
0.12log
(4(2× 1000)3 + 4
0.1
)≈ 21, 193.
Not enough. So put N = 21, 193 to the right hand side. Iterate.
Then we get N ≈ 30, 000.
So we need at least 30,000 samples.
However, generalization bound is not tight. So our estimate is over-estimate.
Rule of thumb, 10× dVC.
12 / 17
![Page 62: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples](https://reader033.vdocuments.us/reader033/viewer/2022050121/5f513a85e5f918157102ab7f/html5/thumbnails/62.jpg)
c©Stanley Chan 2020. All Rights Reserved.
Sample Complexity
How to solve for N in this equation?
N ≥ 8
0.12log
(4(2N)3 + 4
0.1
).
Put N = 1000 to the right hand side
N ≥ 8
0.12log
(4(2× 1000)3 + 4
0.1
)≈ 21, 193.
Not enough. So put N = 21, 193 to the right hand side. Iterate.
Then we get N ≈ 30, 000.
So we need at least 30,000 samples.
However, generalization bound is not tight. So our estimate is over-estimate.
Rule of thumb, 10× dVC.
12 / 17
![Page 63: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples](https://reader033.vdocuments.us/reader033/viewer/2022050121/5f513a85e5f918157102ab7f/html5/thumbnails/63.jpg)
c©Stanley Chan 2020. All Rights Reserved.
Sample Complexity
How to solve for N in this equation?
N ≥ 8
0.12log
(4(2N)3 + 4
0.1
).
Put N = 1000 to the right hand side
N ≥ 8
0.12log
(4(2× 1000)3 + 4
0.1
)≈ 21, 193.
Not enough. So put N = 21, 193 to the right hand side. Iterate.
Then we get N ≈ 30, 000.
So we need at least 30,000 samples.
However, generalization bound is not tight. So our estimate is over-estimate.
Rule of thumb, 10× dVC.
12 / 17
![Page 64: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples](https://reader033.vdocuments.us/reader033/viewer/2022050121/5f513a85e5f918157102ab7f/html5/thumbnails/64.jpg)
c©Stanley Chan 2020. All Rights Reserved.
Sample Complexity
How to solve for N in this equation?
N ≥ 8
0.12log
(4(2N)3 + 4
0.1
).
Put N = 1000 to the right hand side
N ≥ 8
0.12log
(4(2× 1000)3 + 4
0.1
)≈ 21, 193.
Not enough. So put N = 21, 193 to the right hand side. Iterate.
Then we get N ≈ 30, 000.
So we need at least 30,000 samples.
However, generalization bound is not tight. So our estimate is over-estimate.
Rule of thumb, 10× dVC.
12 / 17
![Page 65: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples](https://reader033.vdocuments.us/reader033/viewer/2022050121/5f513a85e5f918157102ab7f/html5/thumbnails/65.jpg)
c©Stanley Chan 2020. All Rights Reserved.
Sample Complexity
How to solve for N in this equation?
N ≥ 8
0.12log
(4(2N)3 + 4
0.1
).
Put N = 1000 to the right hand side
N ≥ 8
0.12log
(4(2× 1000)3 + 4
0.1
)≈ 21, 193.
Not enough. So put N = 21, 193 to the right hand side. Iterate.
Then we get N ≈ 30, 000.
So we need at least 30,000 samples.
However, generalization bound is not tight. So our estimate is over-estimate.
Rule of thumb, 10× dVC.12 / 17
![Page 66: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples](https://reader033.vdocuments.us/reader033/viewer/2022050121/5f513a85e5f918157102ab7f/html5/thumbnails/66.jpg)
c©Stanley Chan 2020. All Rights Reserved.
Error Bar
The generalization bound is
Eout(g) ≤ Ein(g) +
√8
Nlog
(4 ((2N)dVC + 1)
δ
).
What error bar can we offer?
Example. N = 100. δ = 0.1 (90% confidence). dVC = 1.
Eout(g) ≤ Ein(g) +
√8
100log
(4 ((2× 100) + 1)
0.1
)≈ Ein(g) + 0.848.
Close to useless.
If we use N = 1000, thenEout(g) ≤ Ein(g) + 0.301.
Somewhat more respectable estimate.
13 / 17
![Page 67: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples](https://reader033.vdocuments.us/reader033/viewer/2022050121/5f513a85e5f918157102ab7f/html5/thumbnails/67.jpg)
c©Stanley Chan 2020. All Rights Reserved.
Error Bar
The generalization bound is
Eout(g) ≤ Ein(g) +
√8
Nlog
(4 ((2N)dVC + 1)
δ
).
What error bar can we offer?
Example. N = 100. δ = 0.1 (90% confidence). dVC = 1.
Eout(g) ≤ Ein(g) +
√8
100log
(4 ((2× 100) + 1)
0.1
)≈ Ein(g) + 0.848.
Close to useless.
If we use N = 1000, thenEout(g) ≤ Ein(g) + 0.301.
Somewhat more respectable estimate.
13 / 17
![Page 68: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples](https://reader033.vdocuments.us/reader033/viewer/2022050121/5f513a85e5f918157102ab7f/html5/thumbnails/68.jpg)
c©Stanley Chan 2020. All Rights Reserved.
Error Bar
The generalization bound is
Eout(g) ≤ Ein(g) +
√8
Nlog
(4 ((2N)dVC + 1)
δ
).
What error bar can we offer?
Example. N = 100. δ = 0.1 (90% confidence). dVC = 1.
Eout(g) ≤ Ein(g) +
√8
100log
(4 ((2× 100) + 1)
0.1
)≈ Ein(g) + 0.848.
Close to useless.
If we use N = 1000, thenEout(g) ≤ Ein(g) + 0.301.
Somewhat more respectable estimate.
13 / 17
![Page 69: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples](https://reader033.vdocuments.us/reader033/viewer/2022050121/5f513a85e5f918157102ab7f/html5/thumbnails/69.jpg)
c©Stanley Chan 2020. All Rights Reserved.
Error Bar
The generalization bound is
Eout(g) ≤ Ein(g) +
√8
Nlog
(4 ((2N)dVC + 1)
δ
).
What error bar can we offer?
Example. N = 100. δ = 0.1 (90% confidence). dVC = 1.
Eout(g) ≤ Ein(g) +
√8
100log
(4 ((2× 100) + 1)
0.1
)≈ Ein(g) + 0.848.
Close to useless.
If we use N = 1000, thenEout(g) ≤ Ein(g) + 0.301.
Somewhat more respectable estimate.
13 / 17
![Page 70: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples](https://reader033.vdocuments.us/reader033/viewer/2022050121/5f513a85e5f918157102ab7f/html5/thumbnails/70.jpg)
c©Stanley Chan 2020. All Rights Reserved.
Error Bar
The generalization bound is
Eout(g) ≤ Ein(g) +
√8
Nlog
(4 ((2N)dVC + 1)
δ
).
What error bar can we offer?
Example. N = 100. δ = 0.1 (90% confidence). dVC = 1.
Eout(g) ≤ Ein(g) +
√8
100log
(4 ((2× 100) + 1)
0.1
)≈ Ein(g) + 0.848.
Close to useless.
If we use N = 1000, thenEout(g) ≤ Ein(g) + 0.301.
Somewhat more respectable estimate.
13 / 17
![Page 71: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples](https://reader033.vdocuments.us/reader033/viewer/2022050121/5f513a85e5f918157102ab7f/html5/thumbnails/71.jpg)
c©Stanley Chan 2020. All Rights Reserved.
Error Bar
The generalization bound is
Eout(g) ≤ Ein(g) +
√8
Nlog
(4 ((2N)dVC + 1)
δ
).
What error bar can we offer?
Example. N = 100. δ = 0.1 (90% confidence). dVC = 1.
Eout(g) ≤ Ein(g) +
√8
100log
(4 ((2× 100) + 1)
0.1
)≈ Ein(g) + 0.848.
Close to useless.
If we use N = 1000, thenEout(g) ≤ Ein(g) + 0.301.
Somewhat more respectable estimate.
13 / 17
![Page 72: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples](https://reader033.vdocuments.us/reader033/viewer/2022050121/5f513a85e5f918157102ab7f/html5/thumbnails/72.jpg)
c©Stanley Chan 2020. All Rights Reserved.
Error Bar
The generalization bound is
Eout(g) ≤ Ein(g) +
√8
Nlog
(4 ((2N)dVC + 1)
δ
).
What error bar can we offer?
Example. N = 100. δ = 0.1 (90% confidence). dVC = 1.
Eout(g) ≤ Ein(g) +
√8
100log
(4 ((2× 100) + 1)
0.1
)≈ Ein(g) + 0.848.
Close to useless.
If we use N = 1000, thenEout(g) ≤ Ein(g) + 0.301.
Somewhat more respectable estimate.
13 / 17
![Page 73: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples](https://reader033.vdocuments.us/reader033/viewer/2022050121/5f513a85e5f918157102ab7f/html5/thumbnails/73.jpg)
c©Stanley Chan 2020. All Rights Reserved.
Error Bar
The generalization bound is
Eout(g) ≤ Ein(g) +
√8
Nlog
(4 ((2N)dVC + 1)
δ
).
What error bar can we offer?
Example. N = 100. δ = 0.1 (90% confidence). dVC = 1.
Eout(g) ≤ Ein(g) +
√8
100log
(4 ((2× 100) + 1)
0.1
)≈ Ein(g) + 0.848.
Close to useless.
If we use N = 1000, thenEout(g) ≤ Ein(g) + 0.301.
Somewhat more respectable estimate.13 / 17
![Page 74: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples](https://reader033.vdocuments.us/reader033/viewer/2022050121/5f513a85e5f918157102ab7f/html5/thumbnails/74.jpg)
c©Stanley Chan 2020. All Rights Reserved.
Model Complexity
The generalization bound is
Eout(g) ≤ Ein(g) +
√8
Nlog
4 ((2N)dVC + 1)
δ︸ ︷︷ ︸=ε(N,H,δ)
ε(N,H, δ) = penalty of the model complexity
If dVC is large, then ε(N,H, δ) is big
So the generalization error is large
There is a trade-off curve
14 / 17
![Page 75: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples](https://reader033.vdocuments.us/reader033/viewer/2022050121/5f513a85e5f918157102ab7f/html5/thumbnails/75.jpg)
c©Stanley Chan 2020. All Rights Reserved.
Model Complexity
The generalization bound is
Eout(g) ≤ Ein(g) +
√8
Nlog
4 ((2N)dVC + 1)
δ︸ ︷︷ ︸=ε(N,H,δ)
ε(N,H, δ) = penalty of the model complexity
If dVC is large, then ε(N,H, δ) is big
So the generalization error is large
There is a trade-off curve
14 / 17
![Page 76: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples](https://reader033.vdocuments.us/reader033/viewer/2022050121/5f513a85e5f918157102ab7f/html5/thumbnails/76.jpg)
c©Stanley Chan 2020. All Rights Reserved.
Model Complexity
The generalization bound is
Eout(g) ≤ Ein(g) +
√8
Nlog
4 ((2N)dVC + 1)
δ︸ ︷︷ ︸=ε(N,H,δ)
ε(N,H, δ) = penalty of the model complexity
If dVC is large, then ε(N,H, δ) is big
So the generalization error is large
There is a trade-off curve
14 / 17
![Page 77: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples](https://reader033.vdocuments.us/reader033/viewer/2022050121/5f513a85e5f918157102ab7f/html5/thumbnails/77.jpg)
c©Stanley Chan 2020. All Rights Reserved.
Model Complexity
The generalization bound is
Eout(g) ≤ Ein(g) +
√8
Nlog
4 ((2N)dVC + 1)
δ︸ ︷︷ ︸=ε(N,H,δ)
ε(N,H, δ) = penalty of the model complexity
If dVC is large, then ε(N,H, δ) is big
So the generalization error is large
There is a trade-off curve
14 / 17
![Page 78: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples](https://reader033.vdocuments.us/reader033/viewer/2022050121/5f513a85e5f918157102ab7f/html5/thumbnails/78.jpg)
c©Stanley Chan 2020. All Rights Reserved.
Model Complexity
The generalization bound is
Eout(g) ≤ Ein(g) +
√8
Nlog
4 ((2N)dVC + 1)
δ︸ ︷︷ ︸=ε(N,H,δ)
ε(N,H, δ) = penalty of the model complexity
If dVC is large, then ε(N,H, δ) is big
So the generalization error is large
There is a trade-off curve
14 / 17
![Page 79: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples](https://reader033.vdocuments.us/reader033/viewer/2022050121/5f513a85e5f918157102ab7f/html5/thumbnails/79.jpg)
c©Stanley Chan 2020. All Rights Reserved.
Model Complexity
The generalization bound is
Eout(g) ≤ Ein(g) +
√8
Nlog
4 ((2N)dVC + 1)
δ︸ ︷︷ ︸=ε(N,H,δ)
ε(N,H, δ) = penalty of the model complexity
If dVC is large, then ε(N,H, δ) is big
So the generalization error is large
There is a trade-off curve
14 / 17
![Page 80: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples](https://reader033.vdocuments.us/reader033/viewer/2022050121/5f513a85e5f918157102ab7f/html5/thumbnails/80.jpg)
c©Stanley Chan 2020. All Rights Reserved.
Trade-off Curve
15 / 17
![Page 81: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples](https://reader033.vdocuments.us/reader033/viewer/2022050121/5f513a85e5f918157102ab7f/html5/thumbnails/81.jpg)
c©Stanley Chan 2020. All Rights Reserved.
Generalization Bound for Testing
Testing Set: Dtest = {x1, . . . , xL}.The final hypothesis g is already determined. So no need to use Union bound.
The Hoeffding is as simple as
P{|Ein(g)− Eout(g)| > ε
}≤ 2e−2ε
2L,
The generalization bound is
Eout(g) ≤ Ein(g) +
√1
2Llog
2
δ.
If you have a lot of testing samples, then Ein(g) will be good estimate of Eout(g)
Independent of model complexity
Only δ and L
16 / 17
![Page 82: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples](https://reader033.vdocuments.us/reader033/viewer/2022050121/5f513a85e5f918157102ab7f/html5/thumbnails/82.jpg)
c©Stanley Chan 2020. All Rights Reserved.
Generalization Bound for Testing
Testing Set: Dtest = {x1, . . . , xL}.
The final hypothesis g is already determined. So no need to use Union bound.
The Hoeffding is as simple as
P{|Ein(g)− Eout(g)| > ε
}≤ 2e−2ε
2L,
The generalization bound is
Eout(g) ≤ Ein(g) +
√1
2Llog
2
δ.
If you have a lot of testing samples, then Ein(g) will be good estimate of Eout(g)
Independent of model complexity
Only δ and L
16 / 17
![Page 83: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples](https://reader033.vdocuments.us/reader033/viewer/2022050121/5f513a85e5f918157102ab7f/html5/thumbnails/83.jpg)
c©Stanley Chan 2020. All Rights Reserved.
Generalization Bound for Testing
Testing Set: Dtest = {x1, . . . , xL}.The final hypothesis g is already determined. So no need to use Union bound.
The Hoeffding is as simple as
P{|Ein(g)− Eout(g)| > ε
}≤ 2e−2ε
2L,
The generalization bound is
Eout(g) ≤ Ein(g) +
√1
2Llog
2
δ.
If you have a lot of testing samples, then Ein(g) will be good estimate of Eout(g)
Independent of model complexity
Only δ and L
16 / 17
![Page 84: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples](https://reader033.vdocuments.us/reader033/viewer/2022050121/5f513a85e5f918157102ab7f/html5/thumbnails/84.jpg)
c©Stanley Chan 2020. All Rights Reserved.
Generalization Bound for Testing
Testing Set: Dtest = {x1, . . . , xL}.The final hypothesis g is already determined. So no need to use Union bound.
The Hoeffding is as simple as
P{|Ein(g)− Eout(g)| > ε
}≤ 2e−2ε
2L,
The generalization bound is
Eout(g) ≤ Ein(g) +
√1
2Llog
2
δ.
If you have a lot of testing samples, then Ein(g) will be good estimate of Eout(g)
Independent of model complexity
Only δ and L
16 / 17
![Page 85: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples](https://reader033.vdocuments.us/reader033/viewer/2022050121/5f513a85e5f918157102ab7f/html5/thumbnails/85.jpg)
c©Stanley Chan 2020. All Rights Reserved.
Generalization Bound for Testing
Testing Set: Dtest = {x1, . . . , xL}.The final hypothesis g is already determined. So no need to use Union bound.
The Hoeffding is as simple as
P{|Ein(g)− Eout(g)| > ε
}≤ 2e−2ε
2L,
The generalization bound is
Eout(g) ≤ Ein(g) +
√1
2Llog
2
δ.
If you have a lot of testing samples, then Ein(g) will be good estimate of Eout(g)
Independent of model complexity
Only δ and L
16 / 17
![Page 86: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples](https://reader033.vdocuments.us/reader033/viewer/2022050121/5f513a85e5f918157102ab7f/html5/thumbnails/86.jpg)
c©Stanley Chan 2020. All Rights Reserved.
Generalization Bound for Testing
Testing Set: Dtest = {x1, . . . , xL}.The final hypothesis g is already determined. So no need to use Union bound.
The Hoeffding is as simple as
P{|Ein(g)− Eout(g)| > ε
}≤ 2e−2ε
2L,
The generalization bound is
Eout(g) ≤ Ein(g) +
√1
2Llog
2
δ.
If you have a lot of testing samples, then Ein(g) will be good estimate of Eout(g)
Independent of model complexity
Only δ and L
16 / 17
![Page 87: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples](https://reader033.vdocuments.us/reader033/viewer/2022050121/5f513a85e5f918157102ab7f/html5/thumbnails/87.jpg)
c©Stanley Chan 2020. All Rights Reserved.
Generalization Bound for Testing
Testing Set: Dtest = {x1, . . . , xL}.The final hypothesis g is already determined. So no need to use Union bound.
The Hoeffding is as simple as
P{|Ein(g)− Eout(g)| > ε
}≤ 2e−2ε
2L,
The generalization bound is
Eout(g) ≤ Ein(g) +
√1
2Llog
2
δ.
If you have a lot of testing samples, then Ein(g) will be good estimate of Eout(g)
Independent of model complexity
Only δ and L
16 / 17
![Page 88: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples](https://reader033.vdocuments.us/reader033/viewer/2022050121/5f513a85e5f918157102ab7f/html5/thumbnails/88.jpg)
c©Stanley Chan 2020. All Rights Reserved.
Generalization Bound for Testing
Testing Set: Dtest = {x1, . . . , xL}.The final hypothesis g is already determined. So no need to use Union bound.
The Hoeffding is as simple as
P{|Ein(g)− Eout(g)| > ε
}≤ 2e−2ε
2L,
The generalization bound is
Eout(g) ≤ Ein(g) +
√1
2Llog
2
δ.
If you have a lot of testing samples, then Ein(g) will be good estimate of Eout(g)
Independent of model complexity
Only δ and L16 / 17
![Page 89: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples](https://reader033.vdocuments.us/reader033/viewer/2022050121/5f513a85e5f918157102ab7f/html5/thumbnails/89.jpg)
c©Stanley Chan 2020. All Rights Reserved.
Reading List
Yasar Abu-Mostafa, Learning from Data, chapter 2.1
Mehrya Mohri, Foundations of Machine Learning, Chapter 3.2
Stanford Note http://cs229.stanford.edu/notes/cs229-notes4.pdf
17 / 17