![Page 1: Introduction to Machine Learning (67577) Lecture 4shais/Lectures2014/lecture4.pdf · Introduction to Machine Learning (67577) Lecture 4 Shai Shalev-Shwartz School of CS and Engineering,](https://reader030.vdocuments.us/reader030/viewer/2022040214/5ec5eefff1e2d22df87563c3/html5/thumbnails/1.jpg)
Introduction to Machine Learning (67577)Lecture 4
Shai Shalev-Shwartz
School of CS and Engineering,The Hebrew University of Jerusalem
Boosting
Shai Shalev-Shwartz (Hebrew U) IML Lecture 4 Boosting 1 / 29
![Page 2: Introduction to Machine Learning (67577) Lecture 4shais/Lectures2014/lecture4.pdf · Introduction to Machine Learning (67577) Lecture 4 Shai Shalev-Shwartz School of CS and Engineering,](https://reader030.vdocuments.us/reader030/viewer/2022040214/5ec5eefff1e2d22df87563c3/html5/thumbnails/2.jpg)
Outline
1 Weak learnability
2 Boosting the confidence
3 Boosting the accuracy using AdaBoost
4 AdaBoost as a learner for Halfspaces++
5 AdaBoost and the Bias-Complexity Tradeoff
6 Weak Learnability and Separability with Margin
7 AdaBoost for Face Detection
Shai Shalev-Shwartz (Hebrew U) IML Lecture 4 Boosting 2 / 29
![Page 3: Introduction to Machine Learning (67577) Lecture 4shais/Lectures2014/lecture4.pdf · Introduction to Machine Learning (67577) Lecture 4 Shai Shalev-Shwartz School of CS and Engineering,](https://reader030.vdocuments.us/reader030/viewer/2022040214/5ec5eefff1e2d22df87563c3/html5/thumbnails/3.jpg)
Weak Learnability
Definition ((ε, δ)-Weak-Learnability)
A class H is (ε, δ)-weak-learnable if there exists a learning algorithm, A,and a training set size, m ∈ N, such that for every distribution D over Xand every f ∈ H,
Dm(S : LD,f (A(S)) ≤ ε) ≥ 1− δ .
Remarks:
Almost identical to (strong) PAC learning, but we only need tosucceed for specific ε, δ
Every class H is (1/2, 0)-weak-learnable
Intuitively, one can think of a weak learner as an algorithm that usesa simple ’rule of thumb’ to output a hypothesis that performs justslightly better than a random guess
Shai Shalev-Shwartz (Hebrew U) IML Lecture 4 Boosting 3 / 29
![Page 4: Introduction to Machine Learning (67577) Lecture 4shais/Lectures2014/lecture4.pdf · Introduction to Machine Learning (67577) Lecture 4 Shai Shalev-Shwartz School of CS and Engineering,](https://reader030.vdocuments.us/reader030/viewer/2022040214/5ec5eefff1e2d22df87563c3/html5/thumbnails/4.jpg)
Weak Learnability
Definition ((ε, δ)-Weak-Learnability)
A class H is (ε, δ)-weak-learnable if there exists a learning algorithm, A,and a training set size, m ∈ N, such that for every distribution D over Xand every f ∈ H,
Dm(S : LD,f (A(S)) ≤ ε) ≥ 1− δ .
Remarks:
Almost identical to (strong) PAC learning, but we only need tosucceed for specific ε, δ
Every class H is (1/2, 0)-weak-learnable
Intuitively, one can think of a weak learner as an algorithm that usesa simple ’rule of thumb’ to output a hypothesis that performs justslightly better than a random guess
Shai Shalev-Shwartz (Hebrew U) IML Lecture 4 Boosting 3 / 29
![Page 5: Introduction to Machine Learning (67577) Lecture 4shais/Lectures2014/lecture4.pdf · Introduction to Machine Learning (67577) Lecture 4 Shai Shalev-Shwartz School of CS and Engineering,](https://reader030.vdocuments.us/reader030/viewer/2022040214/5ec5eefff1e2d22df87563c3/html5/thumbnails/5.jpg)
Weak Learnability
Definition ((ε, δ)-Weak-Learnability)
A class H is (ε, δ)-weak-learnable if there exists a learning algorithm, A,and a training set size, m ∈ N, such that for every distribution D over Xand every f ∈ H,
Dm(S : LD,f (A(S)) ≤ ε) ≥ 1− δ .
Remarks:
Almost identical to (strong) PAC learning, but we only need tosucceed for specific ε, δ
Every class H is (1/2, 0)-weak-learnable
Intuitively, one can think of a weak learner as an algorithm that usesa simple ’rule of thumb’ to output a hypothesis that performs justslightly better than a random guess
Shai Shalev-Shwartz (Hebrew U) IML Lecture 4 Boosting 3 / 29
![Page 6: Introduction to Machine Learning (67577) Lecture 4shais/Lectures2014/lecture4.pdf · Introduction to Machine Learning (67577) Lecture 4 Shai Shalev-Shwartz School of CS and Engineering,](https://reader030.vdocuments.us/reader030/viewer/2022040214/5ec5eefff1e2d22df87563c3/html5/thumbnails/6.jpg)
Weak Learnability
Definition ((ε, δ)-Weak-Learnability)
A class H is (ε, δ)-weak-learnable if there exists a learning algorithm, A,and a training set size, m ∈ N, such that for every distribution D over Xand every f ∈ H,
Dm(S : LD,f (A(S)) ≤ ε) ≥ 1− δ .
Remarks:
Almost identical to (strong) PAC learning, but we only need tosucceed for specific ε, δ
Every class H is (1/2, 0)-weak-learnable
Intuitively, one can think of a weak learner as an algorithm that usesa simple ’rule of thumb’ to output a hypothesis that performs justslightly better than a random guess
Shai Shalev-Shwartz (Hebrew U) IML Lecture 4 Boosting 3 / 29
![Page 7: Introduction to Machine Learning (67577) Lecture 4shais/Lectures2014/lecture4.pdf · Introduction to Machine Learning (67577) Lecture 4 Shai Shalev-Shwartz School of CS and Engineering,](https://reader030.vdocuments.us/reader030/viewer/2022040214/5ec5eefff1e2d22df87563c3/html5/thumbnails/7.jpg)
Example of weak learner
X = R, H is the class of 3-piece classifiers, e.g.
θ1 θ2
+ +−
Let B = x 7→ sign(x− θ) · b : θ ∈ R, b ∈ ±1 be the class ofDecision Stumps
Claim: There is a constant m, such that ERMB over m examples is a(5/12, 1/2)-weak learner for HProof:
Observe that there’s always a decision stump with LD,f (h) ≤ 1/3Apply VC bound for the class of decision stumps
Shai Shalev-Shwartz (Hebrew U) IML Lecture 4 Boosting 4 / 29
![Page 8: Introduction to Machine Learning (67577) Lecture 4shais/Lectures2014/lecture4.pdf · Introduction to Machine Learning (67577) Lecture 4 Shai Shalev-Shwartz School of CS and Engineering,](https://reader030.vdocuments.us/reader030/viewer/2022040214/5ec5eefff1e2d22df87563c3/html5/thumbnails/8.jpg)
Example of weak learner
X = R, H is the class of 3-piece classifiers, e.g.
θ1 θ2
+ +−
Let B = x 7→ sign(x− θ) · b : θ ∈ R, b ∈ ±1 be the class ofDecision Stumps
Claim: There is a constant m, such that ERMB over m examples is a(5/12, 1/2)-weak learner for HProof:
Observe that there’s always a decision stump with LD,f (h) ≤ 1/3Apply VC bound for the class of decision stumps
Shai Shalev-Shwartz (Hebrew U) IML Lecture 4 Boosting 4 / 29
![Page 9: Introduction to Machine Learning (67577) Lecture 4shais/Lectures2014/lecture4.pdf · Introduction to Machine Learning (67577) Lecture 4 Shai Shalev-Shwartz School of CS and Engineering,](https://reader030.vdocuments.us/reader030/viewer/2022040214/5ec5eefff1e2d22df87563c3/html5/thumbnails/9.jpg)
Example of weak learner
X = R, H is the class of 3-piece classifiers, e.g.
θ1 θ2
+ +−
Let B = x 7→ sign(x− θ) · b : θ ∈ R, b ∈ ±1 be the class ofDecision Stumps
Claim: There is a constant m, such that ERMB over m examples is a(5/12, 1/2)-weak learner for H
Proof:
Observe that there’s always a decision stump with LD,f (h) ≤ 1/3Apply VC bound for the class of decision stumps
Shai Shalev-Shwartz (Hebrew U) IML Lecture 4 Boosting 4 / 29
![Page 10: Introduction to Machine Learning (67577) Lecture 4shais/Lectures2014/lecture4.pdf · Introduction to Machine Learning (67577) Lecture 4 Shai Shalev-Shwartz School of CS and Engineering,](https://reader030.vdocuments.us/reader030/viewer/2022040214/5ec5eefff1e2d22df87563c3/html5/thumbnails/10.jpg)
Example of weak learner
X = R, H is the class of 3-piece classifiers, e.g.
θ1 θ2
+ +−
Let B = x 7→ sign(x− θ) · b : θ ∈ R, b ∈ ±1 be the class ofDecision Stumps
Claim: There is a constant m, such that ERMB over m examples is a(5/12, 1/2)-weak learner for HProof:
Observe that there’s always a decision stump with LD,f (h) ≤ 1/3Apply VC bound for the class of decision stumps
Shai Shalev-Shwartz (Hebrew U) IML Lecture 4 Boosting 4 / 29
![Page 11: Introduction to Machine Learning (67577) Lecture 4shais/Lectures2014/lecture4.pdf · Introduction to Machine Learning (67577) Lecture 4 Shai Shalev-Shwartz School of CS and Engineering,](https://reader030.vdocuments.us/reader030/viewer/2022040214/5ec5eefff1e2d22df87563c3/html5/thumbnails/11.jpg)
The problem of boosting
Suppose we have an (ε0, δ0)-weak-learner algorithm, A, for some classH
Can we use A to construct a strong learner ?
If A is computationally efficient, can we boost it efficiently ?
Two questions:
Boosting the confidenceBoosting the accuracy
Shai Shalev-Shwartz (Hebrew U) IML Lecture 4 Boosting 5 / 29
![Page 12: Introduction to Machine Learning (67577) Lecture 4shais/Lectures2014/lecture4.pdf · Introduction to Machine Learning (67577) Lecture 4 Shai Shalev-Shwartz School of CS and Engineering,](https://reader030.vdocuments.us/reader030/viewer/2022040214/5ec5eefff1e2d22df87563c3/html5/thumbnails/12.jpg)
The problem of boosting
Suppose we have an (ε0, δ0)-weak-learner algorithm, A, for some classHCan we use A to construct a strong learner ?
If A is computationally efficient, can we boost it efficiently ?
Two questions:
Boosting the confidenceBoosting the accuracy
Shai Shalev-Shwartz (Hebrew U) IML Lecture 4 Boosting 5 / 29
![Page 13: Introduction to Machine Learning (67577) Lecture 4shais/Lectures2014/lecture4.pdf · Introduction to Machine Learning (67577) Lecture 4 Shai Shalev-Shwartz School of CS and Engineering,](https://reader030.vdocuments.us/reader030/viewer/2022040214/5ec5eefff1e2d22df87563c3/html5/thumbnails/13.jpg)
The problem of boosting
Suppose we have an (ε0, δ0)-weak-learner algorithm, A, for some classHCan we use A to construct a strong learner ?
If A is computationally efficient, can we boost it efficiently ?
Two questions:
Boosting the confidenceBoosting the accuracy
Shai Shalev-Shwartz (Hebrew U) IML Lecture 4 Boosting 5 / 29
![Page 14: Introduction to Machine Learning (67577) Lecture 4shais/Lectures2014/lecture4.pdf · Introduction to Machine Learning (67577) Lecture 4 Shai Shalev-Shwartz School of CS and Engineering,](https://reader030.vdocuments.us/reader030/viewer/2022040214/5ec5eefff1e2d22df87563c3/html5/thumbnails/14.jpg)
The problem of boosting
Suppose we have an (ε0, δ0)-weak-learner algorithm, A, for some classHCan we use A to construct a strong learner ?
If A is computationally efficient, can we boost it efficiently ?
Two questions:
Boosting the confidenceBoosting the accuracy
Shai Shalev-Shwartz (Hebrew U) IML Lecture 4 Boosting 5 / 29
![Page 15: Introduction to Machine Learning (67577) Lecture 4shais/Lectures2014/lecture4.pdf · Introduction to Machine Learning (67577) Lecture 4 Shai Shalev-Shwartz School of CS and Engineering,](https://reader030.vdocuments.us/reader030/viewer/2022040214/5ec5eefff1e2d22df87563c3/html5/thumbnails/15.jpg)
The problem of boosting
Suppose we have an (ε0, δ0)-weak-learner algorithm, A, for some classHCan we use A to construct a strong learner ?
If A is computationally efficient, can we boost it efficiently ?
Two questions:
Boosting the confidence
Boosting the accuracy
Shai Shalev-Shwartz (Hebrew U) IML Lecture 4 Boosting 5 / 29
![Page 16: Introduction to Machine Learning (67577) Lecture 4shais/Lectures2014/lecture4.pdf · Introduction to Machine Learning (67577) Lecture 4 Shai Shalev-Shwartz School of CS and Engineering,](https://reader030.vdocuments.us/reader030/viewer/2022040214/5ec5eefff1e2d22df87563c3/html5/thumbnails/16.jpg)
The problem of boosting
Suppose we have an (ε0, δ0)-weak-learner algorithm, A, for some classHCan we use A to construct a strong learner ?
If A is computationally efficient, can we boost it efficiently ?
Two questions:
Boosting the confidenceBoosting the accuracy
Shai Shalev-Shwartz (Hebrew U) IML Lecture 4 Boosting 5 / 29
![Page 17: Introduction to Machine Learning (67577) Lecture 4shais/Lectures2014/lecture4.pdf · Introduction to Machine Learning (67577) Lecture 4 Shai Shalev-Shwartz School of CS and Engineering,](https://reader030.vdocuments.us/reader030/viewer/2022040214/5ec5eefff1e2d22df87563c3/html5/thumbnails/17.jpg)
Outline
1 Weak learnability
2 Boosting the confidence
3 Boosting the accuracy using AdaBoost
4 AdaBoost as a learner for Halfspaces++
5 AdaBoost and the Bias-Complexity Tradeoff
6 Weak Learnability and Separability with Margin
7 AdaBoost for Face Detection
Shai Shalev-Shwartz (Hebrew U) IML Lecture 4 Boosting 6 / 29
![Page 18: Introduction to Machine Learning (67577) Lecture 4shais/Lectures2014/lecture4.pdf · Introduction to Machine Learning (67577) Lecture 4 Shai Shalev-Shwartz School of CS and Engineering,](https://reader030.vdocuments.us/reader030/viewer/2022040214/5ec5eefff1e2d22df87563c3/html5/thumbnails/18.jpg)
Boosting the confidence
Suppose A is an (ε0, δ0)-weak learner for H that requires m0 examples
For any δ, ε ∈ (0, 1) we show how to learn H to accuracy ε0 + ε withconfidence δ
Step 1: Apply A on k =⌈
log(2/δ)log(1/δ0)
⌉i.i.d. samples, each of which of
m0 examples, to obtain h1, . . . , hk
Step 2: Take additional validation sample of size |V | ≥ 2 log(4k/δ)ε2
and
output h ∈ argminhi LV (hi)
Claim: W.p. at least 1− δ, we have LD(h) ≤ ε0 + ε
Shai Shalev-Shwartz (Hebrew U) IML Lecture 4 Boosting 7 / 29
![Page 19: Introduction to Machine Learning (67577) Lecture 4shais/Lectures2014/lecture4.pdf · Introduction to Machine Learning (67577) Lecture 4 Shai Shalev-Shwartz School of CS and Engineering,](https://reader030.vdocuments.us/reader030/viewer/2022040214/5ec5eefff1e2d22df87563c3/html5/thumbnails/19.jpg)
Boosting the confidence
Suppose A is an (ε0, δ0)-weak learner for H that requires m0 examples
For any δ, ε ∈ (0, 1) we show how to learn H to accuracy ε0 + ε withconfidence δ
Step 1: Apply A on k =⌈
log(2/δ)log(1/δ0)
⌉i.i.d. samples, each of which of
m0 examples, to obtain h1, . . . , hk
Step 2: Take additional validation sample of size |V | ≥ 2 log(4k/δ)ε2
and
output h ∈ argminhi LV (hi)
Claim: W.p. at least 1− δ, we have LD(h) ≤ ε0 + ε
Shai Shalev-Shwartz (Hebrew U) IML Lecture 4 Boosting 7 / 29
![Page 20: Introduction to Machine Learning (67577) Lecture 4shais/Lectures2014/lecture4.pdf · Introduction to Machine Learning (67577) Lecture 4 Shai Shalev-Shwartz School of CS and Engineering,](https://reader030.vdocuments.us/reader030/viewer/2022040214/5ec5eefff1e2d22df87563c3/html5/thumbnails/20.jpg)
Boosting the confidence
Suppose A is an (ε0, δ0)-weak learner for H that requires m0 examples
For any δ, ε ∈ (0, 1) we show how to learn H to accuracy ε0 + ε withconfidence δ
Step 1: Apply A on k =⌈
log(2/δ)log(1/δ0)
⌉i.i.d. samples, each of which of
m0 examples, to obtain h1, . . . , hk
Step 2: Take additional validation sample of size |V | ≥ 2 log(4k/δ)ε2
and
output h ∈ argminhi LV (hi)
Claim: W.p. at least 1− δ, we have LD(h) ≤ ε0 + ε
Shai Shalev-Shwartz (Hebrew U) IML Lecture 4 Boosting 7 / 29
![Page 21: Introduction to Machine Learning (67577) Lecture 4shais/Lectures2014/lecture4.pdf · Introduction to Machine Learning (67577) Lecture 4 Shai Shalev-Shwartz School of CS and Engineering,](https://reader030.vdocuments.us/reader030/viewer/2022040214/5ec5eefff1e2d22df87563c3/html5/thumbnails/21.jpg)
Boosting the confidence
Suppose A is an (ε0, δ0)-weak learner for H that requires m0 examples
For any δ, ε ∈ (0, 1) we show how to learn H to accuracy ε0 + ε withconfidence δ
Step 1: Apply A on k =⌈
log(2/δ)log(1/δ0)
⌉i.i.d. samples, each of which of
m0 examples, to obtain h1, . . . , hk
Step 2: Take additional validation sample of size |V | ≥ 2 log(4k/δ)ε2
and
output h ∈ argminhi LV (hi)
Claim: W.p. at least 1− δ, we have LD(h) ≤ ε0 + ε
Shai Shalev-Shwartz (Hebrew U) IML Lecture 4 Boosting 7 / 29
![Page 22: Introduction to Machine Learning (67577) Lecture 4shais/Lectures2014/lecture4.pdf · Introduction to Machine Learning (67577) Lecture 4 Shai Shalev-Shwartz School of CS and Engineering,](https://reader030.vdocuments.us/reader030/viewer/2022040214/5ec5eefff1e2d22df87563c3/html5/thumbnails/22.jpg)
Boosting the confidence
Suppose A is an (ε0, δ0)-weak learner for H that requires m0 examples
For any δ, ε ∈ (0, 1) we show how to learn H to accuracy ε0 + ε withconfidence δ
Step 1: Apply A on k =⌈
log(2/δ)log(1/δ0)
⌉i.i.d. samples, each of which of
m0 examples, to obtain h1, . . . , hk
Step 2: Take additional validation sample of size |V | ≥ 2 log(4k/δ)ε2
and
output h ∈ argminhi LV (hi)
Claim: W.p. at least 1− δ, we have LD(h) ≤ ε0 + ε
Shai Shalev-Shwartz (Hebrew U) IML Lecture 4 Boosting 7 / 29
![Page 23: Introduction to Machine Learning (67577) Lecture 4shais/Lectures2014/lecture4.pdf · Introduction to Machine Learning (67577) Lecture 4 Shai Shalev-Shwartz School of CS and Engineering,](https://reader030.vdocuments.us/reader030/viewer/2022040214/5ec5eefff1e2d22df87563c3/html5/thumbnails/23.jpg)
Proof
First, by the validation procedure guarantees
P[LD(h) > miniLD(hi) + ε] ≤ δ/2 .
Second,
P[miniLD(hi) > ε0] = P[∀i LD(hi) > ε0]
=k∏i=1
P[LD(hi) > ε0]
≤ δk0 ≤ δ/2 .
Apply the union bound to conclude the proof.
Shai Shalev-Shwartz (Hebrew U) IML Lecture 4 Boosting 8 / 29
![Page 24: Introduction to Machine Learning (67577) Lecture 4shais/Lectures2014/lecture4.pdf · Introduction to Machine Learning (67577) Lecture 4 Shai Shalev-Shwartz School of CS and Engineering,](https://reader030.vdocuments.us/reader030/viewer/2022040214/5ec5eefff1e2d22df87563c3/html5/thumbnails/24.jpg)
Proof
First, by the validation procedure guarantees
P[LD(h) > miniLD(hi) + ε] ≤ δ/2 .
Second,
P[miniLD(hi) > ε0] = P[∀i LD(hi) > ε0]
=
k∏i=1
P[LD(hi) > ε0]
≤ δk0 ≤ δ/2 .
Apply the union bound to conclude the proof.
Shai Shalev-Shwartz (Hebrew U) IML Lecture 4 Boosting 8 / 29
![Page 25: Introduction to Machine Learning (67577) Lecture 4shais/Lectures2014/lecture4.pdf · Introduction to Machine Learning (67577) Lecture 4 Shai Shalev-Shwartz School of CS and Engineering,](https://reader030.vdocuments.us/reader030/viewer/2022040214/5ec5eefff1e2d22df87563c3/html5/thumbnails/25.jpg)
Proof
First, by the validation procedure guarantees
P[LD(h) > miniLD(hi) + ε] ≤ δ/2 .
Second,
P[miniLD(hi) > ε0] = P[∀i LD(hi) > ε0]
=
k∏i=1
P[LD(hi) > ε0]
≤ δk0 ≤ δ/2 .
Apply the union bound to conclude the proof.
Shai Shalev-Shwartz (Hebrew U) IML Lecture 4 Boosting 8 / 29
![Page 26: Introduction to Machine Learning (67577) Lecture 4shais/Lectures2014/lecture4.pdf · Introduction to Machine Learning (67577) Lecture 4 Shai Shalev-Shwartz School of CS and Engineering,](https://reader030.vdocuments.us/reader030/viewer/2022040214/5ec5eefff1e2d22df87563c3/html5/thumbnails/26.jpg)
Boosting a learner that succeeds on expectation
Suppose that A is a learner that guarantees:
ES∼Dm
[LD(A(S))] ≤ minh∈H
LD(h) + ε .
Denote θ = LD(A(S))−minh∈H LD(h), so we obtain
ES∼Dm
[θ] ≤ ε .
Since θ is a non-negative random variable, we can apply Markov’sinequality to obtain
P[θ ≥ 2ε] ≤ E[θ]
2ε≤ 1
2.
Corollary: A is (2ε, 1/2)-weak learner.
Shai Shalev-Shwartz (Hebrew U) IML Lecture 4 Boosting 9 / 29
![Page 27: Introduction to Machine Learning (67577) Lecture 4shais/Lectures2014/lecture4.pdf · Introduction to Machine Learning (67577) Lecture 4 Shai Shalev-Shwartz School of CS and Engineering,](https://reader030.vdocuments.us/reader030/viewer/2022040214/5ec5eefff1e2d22df87563c3/html5/thumbnails/27.jpg)
Boosting a learner that succeeds on expectation
Suppose that A is a learner that guarantees:
ES∼Dm
[LD(A(S))] ≤ minh∈H
LD(h) + ε .
Denote θ = LD(A(S))−minh∈H LD(h), so we obtain
ES∼Dm
[θ] ≤ ε .
Since θ is a non-negative random variable, we can apply Markov’sinequality to obtain
P[θ ≥ 2ε] ≤ E[θ]
2ε≤ 1
2.
Corollary: A is (2ε, 1/2)-weak learner.
Shai Shalev-Shwartz (Hebrew U) IML Lecture 4 Boosting 9 / 29
![Page 28: Introduction to Machine Learning (67577) Lecture 4shais/Lectures2014/lecture4.pdf · Introduction to Machine Learning (67577) Lecture 4 Shai Shalev-Shwartz School of CS and Engineering,](https://reader030.vdocuments.us/reader030/viewer/2022040214/5ec5eefff1e2d22df87563c3/html5/thumbnails/28.jpg)
Boosting a learner that succeeds on expectation
Suppose that A is a learner that guarantees:
ES∼Dm
[LD(A(S))] ≤ minh∈H
LD(h) + ε .
Denote θ = LD(A(S))−minh∈H LD(h), so we obtain
ES∼Dm
[θ] ≤ ε .
Since θ is a non-negative random variable, we can apply Markov’sinequality to obtain
P[θ ≥ 2ε] ≤ E[θ]
2ε≤ 1
2.
Corollary: A is (2ε, 1/2)-weak learner.
Shai Shalev-Shwartz (Hebrew U) IML Lecture 4 Boosting 9 / 29
![Page 29: Introduction to Machine Learning (67577) Lecture 4shais/Lectures2014/lecture4.pdf · Introduction to Machine Learning (67577) Lecture 4 Shai Shalev-Shwartz School of CS and Engineering,](https://reader030.vdocuments.us/reader030/viewer/2022040214/5ec5eefff1e2d22df87563c3/html5/thumbnails/29.jpg)
Boosting a learner that succeeds on expectation
Suppose that A is a learner that guarantees:
ES∼Dm
[LD(A(S))] ≤ minh∈H
LD(h) + ε .
Denote θ = LD(A(S))−minh∈H LD(h), so we obtain
ES∼Dm
[θ] ≤ ε .
Since θ is a non-negative random variable, we can apply Markov’sinequality to obtain
P[θ ≥ 2ε] ≤ E[θ]
2ε≤ 1
2.
Corollary: A is (2ε, 1/2)-weak learner.
Shai Shalev-Shwartz (Hebrew U) IML Lecture 4 Boosting 9 / 29
![Page 30: Introduction to Machine Learning (67577) Lecture 4shais/Lectures2014/lecture4.pdf · Introduction to Machine Learning (67577) Lecture 4 Shai Shalev-Shwartz School of CS and Engineering,](https://reader030.vdocuments.us/reader030/viewer/2022040214/5ec5eefff1e2d22df87563c3/html5/thumbnails/30.jpg)
Outline
1 Weak learnability
2 Boosting the confidence
3 Boosting the accuracy using AdaBoost
4 AdaBoost as a learner for Halfspaces++
5 AdaBoost and the Bias-Complexity Tradeoff
6 Weak Learnability and Separability with Margin
7 AdaBoost for Face Detection
Shai Shalev-Shwartz (Hebrew U) IML Lecture 4 Boosting 10 / 29
![Page 31: Introduction to Machine Learning (67577) Lecture 4shais/Lectures2014/lecture4.pdf · Introduction to Machine Learning (67577) Lecture 4 Shai Shalev-Shwartz School of CS and Engineering,](https://reader030.vdocuments.us/reader030/viewer/2022040214/5ec5eefff1e2d22df87563c3/html5/thumbnails/31.jpg)
Boosting the accuracy
Shai Shalev-Shwartz (Hebrew U) IML Lecture 4 Boosting 11 / 29
![Page 32: Introduction to Machine Learning (67577) Lecture 4shais/Lectures2014/lecture4.pdf · Introduction to Machine Learning (67577) Lecture 4 Shai Shalev-Shwartz School of CS and Engineering,](https://reader030.vdocuments.us/reader030/viewer/2022040214/5ec5eefff1e2d22df87563c3/html5/thumbnails/32.jpg)
AdaBoost (’Adaptive Boosting’)
Input: S = (x1, y1), . . . , (xm, ym), where for each i, yi = f(xi)
Output: hypothesis h with small error on S
We’ll later analyze L(D,f)(h) as well
AdaBoost calls the weak learner on distributions over S
Shai Shalev-Shwartz (Hebrew U) IML Lecture 4 Boosting 12 / 29
![Page 33: Introduction to Machine Learning (67577) Lecture 4shais/Lectures2014/lecture4.pdf · Introduction to Machine Learning (67577) Lecture 4 Shai Shalev-Shwartz School of CS and Engineering,](https://reader030.vdocuments.us/reader030/viewer/2022040214/5ec5eefff1e2d22df87563c3/html5/thumbnails/33.jpg)
AdaBoost (’Adaptive Boosting’)
Input: S = (x1, y1), . . . , (xm, ym), where for each i, yi = f(xi)
Output: hypothesis h with small error on S
We’ll later analyze L(D,f)(h) as well
AdaBoost calls the weak learner on distributions over S
Shai Shalev-Shwartz (Hebrew U) IML Lecture 4 Boosting 12 / 29
![Page 34: Introduction to Machine Learning (67577) Lecture 4shais/Lectures2014/lecture4.pdf · Introduction to Machine Learning (67577) Lecture 4 Shai Shalev-Shwartz School of CS and Engineering,](https://reader030.vdocuments.us/reader030/viewer/2022040214/5ec5eefff1e2d22df87563c3/html5/thumbnails/34.jpg)
AdaBoost (’Adaptive Boosting’)
Input: S = (x1, y1), . . . , (xm, ym), where for each i, yi = f(xi)
Output: hypothesis h with small error on S
We’ll later analyze L(D,f)(h) as well
AdaBoost calls the weak learner on distributions over S
Shai Shalev-Shwartz (Hebrew U) IML Lecture 4 Boosting 12 / 29
![Page 35: Introduction to Machine Learning (67577) Lecture 4shais/Lectures2014/lecture4.pdf · Introduction to Machine Learning (67577) Lecture 4 Shai Shalev-Shwartz School of CS and Engineering,](https://reader030.vdocuments.us/reader030/viewer/2022040214/5ec5eefff1e2d22df87563c3/html5/thumbnails/35.jpg)
AdaBoost (’Adaptive Boosting’)
Input: S = (x1, y1), . . . , (xm, ym), where for each i, yi = f(xi)
Output: hypothesis h with small error on S
We’ll later analyze L(D,f)(h) as well
AdaBoost calls the weak learner on distributions over S
Shai Shalev-Shwartz (Hebrew U) IML Lecture 4 Boosting 12 / 29
![Page 36: Introduction to Machine Learning (67577) Lecture 4shais/Lectures2014/lecture4.pdf · Introduction to Machine Learning (67577) Lecture 4 Shai Shalev-Shwartz School of CS and Engineering,](https://reader030.vdocuments.us/reader030/viewer/2022040214/5ec5eefff1e2d22df87563c3/html5/thumbnails/36.jpg)
The AdaBoost Algorithm
input: training set S = (x1, y1), . . . , (xm, ym), weak learner WL,number of rounds T
initialize D(1) = ( 1m , . . . ,
1m)
for t = 1, . . . , T :
invoke weak learner ht = WL(D(t), S)
compute εt = LD(t)(ht) =∑mi=1D
(t)i 1[yi 6=ht(xi)]
let wt = 12 log
(1εt− 1)
update D(t+1)i =
D(t)i exp(−wtyiht(xi))∑m
j=1D(t)j exp(−wtyjht(xj))
for all i = 1, . . . ,m
output the hypothesis hs(x) = sign(∑T
t=1wtht(x))
.
Shai Shalev-Shwartz (Hebrew U) IML Lecture 4 Boosting 13 / 29
![Page 37: Introduction to Machine Learning (67577) Lecture 4shais/Lectures2014/lecture4.pdf · Introduction to Machine Learning (67577) Lecture 4 Shai Shalev-Shwartz School of CS and Engineering,](https://reader030.vdocuments.us/reader030/viewer/2022040214/5ec5eefff1e2d22df87563c3/html5/thumbnails/37.jpg)
The AdaBoost Algorithm
input: training set S = (x1, y1), . . . , (xm, ym), weak learner WL,number of rounds T
initialize D(1) = ( 1m , . . . ,
1m)
for t = 1, . . . , T :
invoke weak learner ht = WL(D(t), S)
compute εt = LD(t)(ht) =∑mi=1D
(t)i 1[yi 6=ht(xi)]
let wt = 12 log
(1εt− 1)
update D(t+1)i =
D(t)i exp(−wtyiht(xi))∑m
j=1D(t)j exp(−wtyjht(xj))
for all i = 1, . . . ,m
output the hypothesis hs(x) = sign(∑T
t=1wtht(x))
.
Shai Shalev-Shwartz (Hebrew U) IML Lecture 4 Boosting 13 / 29
![Page 38: Introduction to Machine Learning (67577) Lecture 4shais/Lectures2014/lecture4.pdf · Introduction to Machine Learning (67577) Lecture 4 Shai Shalev-Shwartz School of CS and Engineering,](https://reader030.vdocuments.us/reader030/viewer/2022040214/5ec5eefff1e2d22df87563c3/html5/thumbnails/38.jpg)
The AdaBoost Algorithm
input: training set S = (x1, y1), . . . , (xm, ym), weak learner WL,number of rounds T
initialize D(1) = ( 1m , . . . ,
1m)
for t = 1, . . . , T :
invoke weak learner ht = WL(D(t), S)
compute εt = LD(t)(ht) =∑mi=1D
(t)i 1[yi 6=ht(xi)]
let wt = 12 log
(1εt− 1)
update D(t+1)i =
D(t)i exp(−wtyiht(xi))∑m
j=1D(t)j exp(−wtyjht(xj))
for all i = 1, . . . ,m
output the hypothesis hs(x) = sign(∑T
t=1wtht(x))
.
Shai Shalev-Shwartz (Hebrew U) IML Lecture 4 Boosting 13 / 29
![Page 39: Introduction to Machine Learning (67577) Lecture 4shais/Lectures2014/lecture4.pdf · Introduction to Machine Learning (67577) Lecture 4 Shai Shalev-Shwartz School of CS and Engineering,](https://reader030.vdocuments.us/reader030/viewer/2022040214/5ec5eefff1e2d22df87563c3/html5/thumbnails/39.jpg)
The AdaBoost Algorithm
input: training set S = (x1, y1), . . . , (xm, ym), weak learner WL,number of rounds T
initialize D(1) = ( 1m , . . . ,
1m)
for t = 1, . . . , T :
invoke weak learner ht = WL(D(t), S)
compute εt = LD(t)(ht) =∑mi=1D
(t)i 1[yi 6=ht(xi)]
let wt = 12 log
(1εt− 1)
update D(t+1)i =
D(t)i exp(−wtyiht(xi))∑m
j=1D(t)j exp(−wtyjht(xj))
for all i = 1, . . . ,m
output the hypothesis hs(x) = sign(∑T
t=1wtht(x))
.
Shai Shalev-Shwartz (Hebrew U) IML Lecture 4 Boosting 13 / 29
![Page 40: Introduction to Machine Learning (67577) Lecture 4shais/Lectures2014/lecture4.pdf · Introduction to Machine Learning (67577) Lecture 4 Shai Shalev-Shwartz School of CS and Engineering,](https://reader030.vdocuments.us/reader030/viewer/2022040214/5ec5eefff1e2d22df87563c3/html5/thumbnails/40.jpg)
The AdaBoost Algorithm
input: training set S = (x1, y1), . . . , (xm, ym), weak learner WL,number of rounds T
initialize D(1) = ( 1m , . . . ,
1m)
for t = 1, . . . , T :
invoke weak learner ht = WL(D(t), S)
compute εt = LD(t)(ht) =∑mi=1D
(t)i 1[yi 6=ht(xi)]
let wt = 12 log
(1εt− 1)
update D(t+1)i =
D(t)i exp(−wtyiht(xi))∑m
j=1D(t)j exp(−wtyjht(xj))
for all i = 1, . . . ,m
output the hypothesis hs(x) = sign(∑T
t=1wtht(x))
.
Shai Shalev-Shwartz (Hebrew U) IML Lecture 4 Boosting 13 / 29
![Page 41: Introduction to Machine Learning (67577) Lecture 4shais/Lectures2014/lecture4.pdf · Introduction to Machine Learning (67577) Lecture 4 Shai Shalev-Shwartz School of CS and Engineering,](https://reader030.vdocuments.us/reader030/viewer/2022040214/5ec5eefff1e2d22df87563c3/html5/thumbnails/41.jpg)
The AdaBoost Algorithm
input: training set S = (x1, y1), . . . , (xm, ym), weak learner WL,number of rounds T
initialize D(1) = ( 1m , . . . ,
1m)
for t = 1, . . . , T :
invoke weak learner ht = WL(D(t), S)
compute εt = LD(t)(ht) =∑mi=1D
(t)i 1[yi 6=ht(xi)]
let wt = 12 log
(1εt− 1)
update D(t+1)i =
D(t)i exp(−wtyiht(xi))∑m
j=1D(t)j exp(−wtyjht(xj))
for all i = 1, . . . ,m
output the hypothesis hs(x) = sign(∑T
t=1wtht(x))
.
Shai Shalev-Shwartz (Hebrew U) IML Lecture 4 Boosting 13 / 29
![Page 42: Introduction to Machine Learning (67577) Lecture 4shais/Lectures2014/lecture4.pdf · Introduction to Machine Learning (67577) Lecture 4 Shai Shalev-Shwartz School of CS and Engineering,](https://reader030.vdocuments.us/reader030/viewer/2022040214/5ec5eefff1e2d22df87563c3/html5/thumbnails/42.jpg)
The AdaBoost Algorithm
input: training set S = (x1, y1), . . . , (xm, ym), weak learner WL,number of rounds T
initialize D(1) = ( 1m , . . . ,
1m)
for t = 1, . . . , T :
invoke weak learner ht = WL(D(t), S)
compute εt = LD(t)(ht) =∑mi=1D
(t)i 1[yi 6=ht(xi)]
let wt = 12 log
(1εt− 1)
update D(t+1)i =
D(t)i exp(−wtyiht(xi))∑m
j=1D(t)j exp(−wtyjht(xj))
for all i = 1, . . . ,m
output the hypothesis hs(x) = sign(∑T
t=1wtht(x))
.
Shai Shalev-Shwartz (Hebrew U) IML Lecture 4 Boosting 13 / 29
![Page 43: Introduction to Machine Learning (67577) Lecture 4shais/Lectures2014/lecture4.pdf · Introduction to Machine Learning (67577) Lecture 4 Shai Shalev-Shwartz School of CS and Engineering,](https://reader030.vdocuments.us/reader030/viewer/2022040214/5ec5eefff1e2d22df87563c3/html5/thumbnails/43.jpg)
The AdaBoost Algorithm
input: training set S = (x1, y1), . . . , (xm, ym), weak learner WL,number of rounds T
initialize D(1) = ( 1m , . . . ,
1m)
for t = 1, . . . , T :
invoke weak learner ht = WL(D(t), S)
compute εt = LD(t)(ht) =∑mi=1D
(t)i 1[yi 6=ht(xi)]
let wt = 12 log
(1εt− 1)
update D(t+1)i =
D(t)i exp(−wtyiht(xi))∑m
j=1D(t)j exp(−wtyjht(xj))
for all i = 1, . . . ,m
output the hypothesis hs(x) = sign(∑T
t=1wtht(x))
.
Shai Shalev-Shwartz (Hebrew U) IML Lecture 4 Boosting 13 / 29
![Page 44: Introduction to Machine Learning (67577) Lecture 4shais/Lectures2014/lecture4.pdf · Introduction to Machine Learning (67577) Lecture 4 Shai Shalev-Shwartz School of CS and Engineering,](https://reader030.vdocuments.us/reader030/viewer/2022040214/5ec5eefff1e2d22df87563c3/html5/thumbnails/44.jpg)
Intuition: AdaBoost forces WL to focus on problematicexamples
Claim: The error of ht w.r.t. D(t+1) is exactly 1/2
Proof:
m∑i=1
D(t+1)i 1[yi 6=ht(xi)] =
∑mi=1D
(t)i e−wtyiht(xi)1[yi 6=ht(xi)]∑mj=1D
(t)j e−wtyjht(xj)
=ewtεt
ewtεt + e−wt(1− εt)=
εtεt + e−2wt(1− εt)
=εt
εt + εt1−εt (1− εt)
=1
2.
Shai Shalev-Shwartz (Hebrew U) IML Lecture 4 Boosting 14 / 29
![Page 45: Introduction to Machine Learning (67577) Lecture 4shais/Lectures2014/lecture4.pdf · Introduction to Machine Learning (67577) Lecture 4 Shai Shalev-Shwartz School of CS and Engineering,](https://reader030.vdocuments.us/reader030/viewer/2022040214/5ec5eefff1e2d22df87563c3/html5/thumbnails/45.jpg)
Intuition: AdaBoost forces WL to focus on problematicexamples
Claim: The error of ht w.r.t. D(t+1) is exactly 1/2
Proof:
m∑i=1
D(t+1)i 1[yi 6=ht(xi)] =
∑mi=1D
(t)i e−wtyiht(xi)1[yi 6=ht(xi)]∑mj=1D
(t)j e−wtyjht(xj)
=ewtεt
ewtεt + e−wt(1− εt)=
εtεt + e−2wt(1− εt)
=εt
εt + εt1−εt (1− εt)
=1
2.
Shai Shalev-Shwartz (Hebrew U) IML Lecture 4 Boosting 14 / 29
![Page 46: Introduction to Machine Learning (67577) Lecture 4shais/Lectures2014/lecture4.pdf · Introduction to Machine Learning (67577) Lecture 4 Shai Shalev-Shwartz School of CS and Engineering,](https://reader030.vdocuments.us/reader030/viewer/2022040214/5ec5eefff1e2d22df87563c3/html5/thumbnails/46.jpg)
Intuition: AdaBoost forces WL to focus on problematicexamples
Claim: The error of ht w.r.t. D(t+1) is exactly 1/2
Proof:
m∑i=1
D(t+1)i 1[yi 6=ht(xi)] =
∑mi=1D
(t)i e−wtyiht(xi)1[yi 6=ht(xi)]∑mj=1D
(t)j e−wtyjht(xj)
=ewtεt
ewtεt + e−wt(1− εt)=
εtεt + e−2wt(1− εt)
=εt
εt + εt1−εt (1− εt)
=1
2.
Shai Shalev-Shwartz (Hebrew U) IML Lecture 4 Boosting 14 / 29
![Page 47: Introduction to Machine Learning (67577) Lecture 4shais/Lectures2014/lecture4.pdf · Introduction to Machine Learning (67577) Lecture 4 Shai Shalev-Shwartz School of CS and Engineering,](https://reader030.vdocuments.us/reader030/viewer/2022040214/5ec5eefff1e2d22df87563c3/html5/thumbnails/47.jpg)
Theorem
If WL is (1/2− γ, δ) weak learner then, with probability at least 1− δT ,
LS(hs) ≤ exp(−2 γ2 T ) .
Remarks:
For any ε > 0 and γ ∈ (0, 1/2), if T ≥ log(1/ε)2γ2
, then AdaBoost will
output a hypothesis hs with LS(hs) ≤ ε.Setting ε = 1/(2m) the hypothesis hs must have a zero training error
Since the weak learner is invoked on a distribution over S, in manycases δ can be 0. In any case, by “boosting the confidence”, we canassume w.l.o.g. that δ is very small.
Shai Shalev-Shwartz (Hebrew U) IML Lecture 4 Boosting 15 / 29
![Page 48: Introduction to Machine Learning (67577) Lecture 4shais/Lectures2014/lecture4.pdf · Introduction to Machine Learning (67577) Lecture 4 Shai Shalev-Shwartz School of CS and Engineering,](https://reader030.vdocuments.us/reader030/viewer/2022040214/5ec5eefff1e2d22df87563c3/html5/thumbnails/48.jpg)
Theorem
If WL is (1/2− γ, δ) weak learner then, with probability at least 1− δT ,
LS(hs) ≤ exp(−2 γ2 T ) .
Remarks:
For any ε > 0 and γ ∈ (0, 1/2), if T ≥ log(1/ε)2γ2
, then AdaBoost will
output a hypothesis hs with LS(hs) ≤ ε.
Setting ε = 1/(2m) the hypothesis hs must have a zero training error
Since the weak learner is invoked on a distribution over S, in manycases δ can be 0. In any case, by “boosting the confidence”, we canassume w.l.o.g. that δ is very small.
Shai Shalev-Shwartz (Hebrew U) IML Lecture 4 Boosting 15 / 29
![Page 49: Introduction to Machine Learning (67577) Lecture 4shais/Lectures2014/lecture4.pdf · Introduction to Machine Learning (67577) Lecture 4 Shai Shalev-Shwartz School of CS and Engineering,](https://reader030.vdocuments.us/reader030/viewer/2022040214/5ec5eefff1e2d22df87563c3/html5/thumbnails/49.jpg)
Theorem
If WL is (1/2− γ, δ) weak learner then, with probability at least 1− δT ,
LS(hs) ≤ exp(−2 γ2 T ) .
Remarks:
For any ε > 0 and γ ∈ (0, 1/2), if T ≥ log(1/ε)2γ2
, then AdaBoost will
output a hypothesis hs with LS(hs) ≤ ε.Setting ε = 1/(2m) the hypothesis hs must have a zero training error
Since the weak learner is invoked on a distribution over S, in manycases δ can be 0. In any case, by “boosting the confidence”, we canassume w.l.o.g. that δ is very small.
Shai Shalev-Shwartz (Hebrew U) IML Lecture 4 Boosting 15 / 29
![Page 50: Introduction to Machine Learning (67577) Lecture 4shais/Lectures2014/lecture4.pdf · Introduction to Machine Learning (67577) Lecture 4 Shai Shalev-Shwartz School of CS and Engineering,](https://reader030.vdocuments.us/reader030/viewer/2022040214/5ec5eefff1e2d22df87563c3/html5/thumbnails/50.jpg)
Theorem
If WL is (1/2− γ, δ) weak learner then, with probability at least 1− δT ,
LS(hs) ≤ exp(−2 γ2 T ) .
Remarks:
For any ε > 0 and γ ∈ (0, 1/2), if T ≥ log(1/ε)2γ2
, then AdaBoost will
output a hypothesis hs with LS(hs) ≤ ε.Setting ε = 1/(2m) the hypothesis hs must have a zero training error
Since the weak learner is invoked on a distribution over S, in manycases δ can be 0. In any case, by “boosting the confidence”, we canassume w.l.o.g. that δ is very small.
Shai Shalev-Shwartz (Hebrew U) IML Lecture 4 Boosting 15 / 29
![Page 51: Introduction to Machine Learning (67577) Lecture 4shais/Lectures2014/lecture4.pdf · Introduction to Machine Learning (67577) Lecture 4 Shai Shalev-Shwartz School of CS and Engineering,](https://reader030.vdocuments.us/reader030/viewer/2022040214/5ec5eefff1e2d22df87563c3/html5/thumbnails/51.jpg)
Outline
1 Weak learnability
2 Boosting the confidence
3 Boosting the accuracy using AdaBoost
4 AdaBoost as a learner for Halfspaces++
5 AdaBoost and the Bias-Complexity Tradeoff
6 Weak Learnability and Separability with Margin
7 AdaBoost for Face Detection
Shai Shalev-Shwartz (Hebrew U) IML Lecture 4 Boosting 16 / 29
![Page 52: Introduction to Machine Learning (67577) Lecture 4shais/Lectures2014/lecture4.pdf · Introduction to Machine Learning (67577) Lecture 4 Shai Shalev-Shwartz School of CS and Engineering,](https://reader030.vdocuments.us/reader030/viewer/2022040214/5ec5eefff1e2d22df87563c3/html5/thumbnails/52.jpg)
AdaBoost as a Learner for Halfspaces++
Let B be the set of all hypotheses the WL may return
Observe that AdaBoost outputs a hypothesis from the class
L(B, T ) =
x 7→ sign
(T∑t=1
wtht(x)
): w ∈ RT , ∀t, ht ∈ B
.
Since WL is invoked only on distributions over S we can assumew.l.o.g. that B = g1, . . . , gd for some d ≤ 2m.
Denote ψ(x) = (g1(x), . . . , gd(x)). Therefore:
L(B, T ) =x 7→ sign (〈w,ψ(x)〉) : w ∈ Rd, ‖w‖0 ≤ T
,
where ‖w‖0 = |i : wi 6= 0|.That is, AdaBoost learns a composition of the class of halfspaces withsparse coefficients over the mapping x 7→ ψ(x)
Shai Shalev-Shwartz (Hebrew U) IML Lecture 4 Boosting 17 / 29
![Page 53: Introduction to Machine Learning (67577) Lecture 4shais/Lectures2014/lecture4.pdf · Introduction to Machine Learning (67577) Lecture 4 Shai Shalev-Shwartz School of CS and Engineering,](https://reader030.vdocuments.us/reader030/viewer/2022040214/5ec5eefff1e2d22df87563c3/html5/thumbnails/53.jpg)
AdaBoost as a Learner for Halfspaces++
Let B be the set of all hypotheses the WL may return
Observe that AdaBoost outputs a hypothesis from the class
L(B, T ) =
x 7→ sign
(T∑t=1
wtht(x)
): w ∈ RT , ∀t, ht ∈ B
.
Since WL is invoked only on distributions over S we can assumew.l.o.g. that B = g1, . . . , gd for some d ≤ 2m.
Denote ψ(x) = (g1(x), . . . , gd(x)). Therefore:
L(B, T ) =x 7→ sign (〈w,ψ(x)〉) : w ∈ Rd, ‖w‖0 ≤ T
,
where ‖w‖0 = |i : wi 6= 0|.That is, AdaBoost learns a composition of the class of halfspaces withsparse coefficients over the mapping x 7→ ψ(x)
Shai Shalev-Shwartz (Hebrew U) IML Lecture 4 Boosting 17 / 29
![Page 54: Introduction to Machine Learning (67577) Lecture 4shais/Lectures2014/lecture4.pdf · Introduction to Machine Learning (67577) Lecture 4 Shai Shalev-Shwartz School of CS and Engineering,](https://reader030.vdocuments.us/reader030/viewer/2022040214/5ec5eefff1e2d22df87563c3/html5/thumbnails/54.jpg)
AdaBoost as a Learner for Halfspaces++
Let B be the set of all hypotheses the WL may return
Observe that AdaBoost outputs a hypothesis from the class
L(B, T ) =
x 7→ sign
(T∑t=1
wtht(x)
): w ∈ RT , ∀t, ht ∈ B
.
Since WL is invoked only on distributions over S we can assumew.l.o.g. that B = g1, . . . , gd for some d ≤ 2m.
Denote ψ(x) = (g1(x), . . . , gd(x)). Therefore:
L(B, T ) =x 7→ sign (〈w,ψ(x)〉) : w ∈ Rd, ‖w‖0 ≤ T
,
where ‖w‖0 = |i : wi 6= 0|.That is, AdaBoost learns a composition of the class of halfspaces withsparse coefficients over the mapping x 7→ ψ(x)
Shai Shalev-Shwartz (Hebrew U) IML Lecture 4 Boosting 17 / 29
![Page 55: Introduction to Machine Learning (67577) Lecture 4shais/Lectures2014/lecture4.pdf · Introduction to Machine Learning (67577) Lecture 4 Shai Shalev-Shwartz School of CS and Engineering,](https://reader030.vdocuments.us/reader030/viewer/2022040214/5ec5eefff1e2d22df87563c3/html5/thumbnails/55.jpg)
AdaBoost as a Learner for Halfspaces++
Let B be the set of all hypotheses the WL may return
Observe that AdaBoost outputs a hypothesis from the class
L(B, T ) =
x 7→ sign
(T∑t=1
wtht(x)
): w ∈ RT , ∀t, ht ∈ B
.
Since WL is invoked only on distributions over S we can assumew.l.o.g. that B = g1, . . . , gd for some d ≤ 2m.
Denote ψ(x) = (g1(x), . . . , gd(x)). Therefore:
L(B, T ) =x 7→ sign (〈w,ψ(x)〉) : w ∈ Rd, ‖w‖0 ≤ T
,
where ‖w‖0 = |i : wi 6= 0|.
That is, AdaBoost learns a composition of the class of halfspaces withsparse coefficients over the mapping x 7→ ψ(x)
Shai Shalev-Shwartz (Hebrew U) IML Lecture 4 Boosting 17 / 29
![Page 56: Introduction to Machine Learning (67577) Lecture 4shais/Lectures2014/lecture4.pdf · Introduction to Machine Learning (67577) Lecture 4 Shai Shalev-Shwartz School of CS and Engineering,](https://reader030.vdocuments.us/reader030/viewer/2022040214/5ec5eefff1e2d22df87563c3/html5/thumbnails/56.jpg)
AdaBoost as a Learner for Halfspaces++
Let B be the set of all hypotheses the WL may return
Observe that AdaBoost outputs a hypothesis from the class
L(B, T ) =
x 7→ sign
(T∑t=1
wtht(x)
): w ∈ RT , ∀t, ht ∈ B
.
Since WL is invoked only on distributions over S we can assumew.l.o.g. that B = g1, . . . , gd for some d ≤ 2m.
Denote ψ(x) = (g1(x), . . . , gd(x)). Therefore:
L(B, T ) =x 7→ sign (〈w,ψ(x)〉) : w ∈ Rd, ‖w‖0 ≤ T
,
where ‖w‖0 = |i : wi 6= 0|.That is, AdaBoost learns a composition of the class of halfspaces withsparse coefficients over the mapping x 7→ ψ(x)
Shai Shalev-Shwartz (Hebrew U) IML Lecture 4 Boosting 17 / 29
![Page 57: Introduction to Machine Learning (67577) Lecture 4shais/Lectures2014/lecture4.pdf · Introduction to Machine Learning (67577) Lecture 4 Shai Shalev-Shwartz School of CS and Engineering,](https://reader030.vdocuments.us/reader030/viewer/2022040214/5ec5eefff1e2d22df87563c3/html5/thumbnails/57.jpg)
Expressiveness of L(B, T )
Suppose X = R and B is Decision Stumps,
B = x 7→ sign(x− θ) · b : θ ∈ R, b ∈ ±1 .
Let GT be the class of piece-wise constant functions with T pieces
Claim: GT ⊆ L(B, T )
Composing halfspaces on top of simple classes can be very expressive !
Shai Shalev-Shwartz (Hebrew U) IML Lecture 4 Boosting 18 / 29
![Page 58: Introduction to Machine Learning (67577) Lecture 4shais/Lectures2014/lecture4.pdf · Introduction to Machine Learning (67577) Lecture 4 Shai Shalev-Shwartz School of CS and Engineering,](https://reader030.vdocuments.us/reader030/viewer/2022040214/5ec5eefff1e2d22df87563c3/html5/thumbnails/58.jpg)
Expressiveness of L(B, T )
Suppose X = R and B is Decision Stumps,
B = x 7→ sign(x− θ) · b : θ ∈ R, b ∈ ±1 .
Let GT be the class of piece-wise constant functions with T pieces
Claim: GT ⊆ L(B, T )
Composing halfspaces on top of simple classes can be very expressive !
Shai Shalev-Shwartz (Hebrew U) IML Lecture 4 Boosting 18 / 29
![Page 59: Introduction to Machine Learning (67577) Lecture 4shais/Lectures2014/lecture4.pdf · Introduction to Machine Learning (67577) Lecture 4 Shai Shalev-Shwartz School of CS and Engineering,](https://reader030.vdocuments.us/reader030/viewer/2022040214/5ec5eefff1e2d22df87563c3/html5/thumbnails/59.jpg)
Expressiveness of L(B, T )
Suppose X = R and B is Decision Stumps,
B = x 7→ sign(x− θ) · b : θ ∈ R, b ∈ ±1 .
Let GT be the class of piece-wise constant functions with T pieces
Claim: GT ⊆ L(B, T )
Composing halfspaces on top of simple classes can be very expressive !
Shai Shalev-Shwartz (Hebrew U) IML Lecture 4 Boosting 18 / 29
![Page 60: Introduction to Machine Learning (67577) Lecture 4shais/Lectures2014/lecture4.pdf · Introduction to Machine Learning (67577) Lecture 4 Shai Shalev-Shwartz School of CS and Engineering,](https://reader030.vdocuments.us/reader030/viewer/2022040214/5ec5eefff1e2d22df87563c3/html5/thumbnails/60.jpg)
Expressiveness of L(B, T )
Suppose X = R and B is Decision Stumps,
B = x 7→ sign(x− θ) · b : θ ∈ R, b ∈ ±1 .
Let GT be the class of piece-wise constant functions with T pieces
Claim: GT ⊆ L(B, T )
Composing halfspaces on top of simple classes can be very expressive !
Shai Shalev-Shwartz (Hebrew U) IML Lecture 4 Boosting 18 / 29
![Page 61: Introduction to Machine Learning (67577) Lecture 4shais/Lectures2014/lecture4.pdf · Introduction to Machine Learning (67577) Lecture 4 Shai Shalev-Shwartz School of CS and Engineering,](https://reader030.vdocuments.us/reader030/viewer/2022040214/5ec5eefff1e2d22df87563c3/html5/thumbnails/61.jpg)
Expressiveness of L(B, T )
Suppose X = R and B is Decision Stumps,
B = x 7→ sign(x− θ) · b : θ ∈ R, b ∈ ±1 .
Let GT be the class of piece-wise constant functions with T pieces
Claim: GT ⊆ L(B, T )
Composing halfspaces on top of simple classes can be very expressive !
Shai Shalev-Shwartz (Hebrew U) IML Lecture 4 Boosting 18 / 29
![Page 62: Introduction to Machine Learning (67577) Lecture 4shais/Lectures2014/lecture4.pdf · Introduction to Machine Learning (67577) Lecture 4 Shai Shalev-Shwartz School of CS and Engineering,](https://reader030.vdocuments.us/reader030/viewer/2022040214/5ec5eefff1e2d22df87563c3/html5/thumbnails/62.jpg)
Outline
1 Weak learnability
2 Boosting the confidence
3 Boosting the accuracy using AdaBoost
4 AdaBoost as a learner for Halfspaces++
5 AdaBoost and the Bias-Complexity Tradeoff
6 Weak Learnability and Separability with Margin
7 AdaBoost for Face Detection
Shai Shalev-Shwartz (Hebrew U) IML Lecture 4 Boosting 19 / 29
![Page 63: Introduction to Machine Learning (67577) Lecture 4shais/Lectures2014/lecture4.pdf · Introduction to Machine Learning (67577) Lecture 4 Shai Shalev-Shwartz School of CS and Engineering,](https://reader030.vdocuments.us/reader030/viewer/2022040214/5ec5eefff1e2d22df87563c3/html5/thumbnails/63.jpg)
Bias-complexity
Recall:
εapp εest
We have argued that the expressiveness of L(B, T ) grows with T
In other words, the approximation error decreases with T
We’ll show that the estimation error increases with T
Therefore, the parameter T of AdaBoost enables us to control thebias-complexity tradeoff
Shai Shalev-Shwartz (Hebrew U) IML Lecture 4 Boosting 20 / 29
![Page 64: Introduction to Machine Learning (67577) Lecture 4shais/Lectures2014/lecture4.pdf · Introduction to Machine Learning (67577) Lecture 4 Shai Shalev-Shwartz School of CS and Engineering,](https://reader030.vdocuments.us/reader030/viewer/2022040214/5ec5eefff1e2d22df87563c3/html5/thumbnails/64.jpg)
Bias-complexity
Recall:
εapp εest
We have argued that the expressiveness of L(B, T ) grows with T
In other words, the approximation error decreases with T
We’ll show that the estimation error increases with T
Therefore, the parameter T of AdaBoost enables us to control thebias-complexity tradeoff
Shai Shalev-Shwartz (Hebrew U) IML Lecture 4 Boosting 20 / 29
![Page 65: Introduction to Machine Learning (67577) Lecture 4shais/Lectures2014/lecture4.pdf · Introduction to Machine Learning (67577) Lecture 4 Shai Shalev-Shwartz School of CS and Engineering,](https://reader030.vdocuments.us/reader030/viewer/2022040214/5ec5eefff1e2d22df87563c3/html5/thumbnails/65.jpg)
Bias-complexity
Recall:
εapp εest
We have argued that the expressiveness of L(B, T ) grows with T
In other words, the approximation error decreases with T
We’ll show that the estimation error increases with T
Therefore, the parameter T of AdaBoost enables us to control thebias-complexity tradeoff
Shai Shalev-Shwartz (Hebrew U) IML Lecture 4 Boosting 20 / 29
![Page 66: Introduction to Machine Learning (67577) Lecture 4shais/Lectures2014/lecture4.pdf · Introduction to Machine Learning (67577) Lecture 4 Shai Shalev-Shwartz School of CS and Engineering,](https://reader030.vdocuments.us/reader030/viewer/2022040214/5ec5eefff1e2d22df87563c3/html5/thumbnails/66.jpg)
Bias-complexity
Recall:
εapp εest
We have argued that the expressiveness of L(B, T ) grows with T
In other words, the approximation error decreases with T
We’ll show that the estimation error increases with T
Therefore, the parameter T of AdaBoost enables us to control thebias-complexity tradeoff
Shai Shalev-Shwartz (Hebrew U) IML Lecture 4 Boosting 20 / 29
![Page 67: Introduction to Machine Learning (67577) Lecture 4shais/Lectures2014/lecture4.pdf · Introduction to Machine Learning (67577) Lecture 4 Shai Shalev-Shwartz School of CS and Engineering,](https://reader030.vdocuments.us/reader030/viewer/2022040214/5ec5eefff1e2d22df87563c3/html5/thumbnails/67.jpg)
The Estimation Error of L(B, T )
Claim:VCdim(L(B, T )) ≤ O(T · VCdim(B))
Corollary: if m ≥ Ω(log(1/δ)γ2ε
)and T = log(m)/(2γ2), then w.p. of
at least 1− δ,L(D,f)(hs) ≤ ε .
Shai Shalev-Shwartz (Hebrew U) IML Lecture 4 Boosting 21 / 29
![Page 68: Introduction to Machine Learning (67577) Lecture 4shais/Lectures2014/lecture4.pdf · Introduction to Machine Learning (67577) Lecture 4 Shai Shalev-Shwartz School of CS and Engineering,](https://reader030.vdocuments.us/reader030/viewer/2022040214/5ec5eefff1e2d22df87563c3/html5/thumbnails/68.jpg)
The Estimation Error of L(B, T )
Claim:VCdim(L(B, T )) ≤ O(T · VCdim(B))
Corollary: if m ≥ Ω(log(1/δ)γ2ε
)and T = log(m)/(2γ2), then w.p. of
at least 1− δ,L(D,f)(hs) ≤ ε .
Shai Shalev-Shwartz (Hebrew U) IML Lecture 4 Boosting 21 / 29
![Page 69: Introduction to Machine Learning (67577) Lecture 4shais/Lectures2014/lecture4.pdf · Introduction to Machine Learning (67577) Lecture 4 Shai Shalev-Shwartz School of CS and Engineering,](https://reader030.vdocuments.us/reader030/viewer/2022040214/5ec5eefff1e2d22df87563c3/html5/thumbnails/69.jpg)
Outline
1 Weak learnability
2 Boosting the confidence
3 Boosting the accuracy using AdaBoost
4 AdaBoost as a learner for Halfspaces++
5 AdaBoost and the Bias-Complexity Tradeoff
6 Weak Learnability and Separability with Margin
7 AdaBoost for Face Detection
Shai Shalev-Shwartz (Hebrew U) IML Lecture 4 Boosting 22 / 29
![Page 70: Introduction to Machine Learning (67577) Lecture 4shais/Lectures2014/lecture4.pdf · Introduction to Machine Learning (67577) Lecture 4 Shai Shalev-Shwartz School of CS and Engineering,](https://reader030.vdocuments.us/reader030/viewer/2022040214/5ec5eefff1e2d22df87563c3/html5/thumbnails/70.jpg)
Weak Learnability and Separability with Margin
We have essentially shown: if H is weak learnable, then H ⊆ L(B,∞)
What about the other direction ?
Using von Neumanns minimax theorem, it can be shown that ifL(B,∞) separates a training set with `1 margin γ then ERMB is a γweak learner for HThis is beyond the scope of the course
Shai Shalev-Shwartz (Hebrew U) IML Lecture 4 Boosting 23 / 29
![Page 71: Introduction to Machine Learning (67577) Lecture 4shais/Lectures2014/lecture4.pdf · Introduction to Machine Learning (67577) Lecture 4 Shai Shalev-Shwartz School of CS and Engineering,](https://reader030.vdocuments.us/reader030/viewer/2022040214/5ec5eefff1e2d22df87563c3/html5/thumbnails/71.jpg)
Weak Learnability and Separability with Margin
We have essentially shown: if H is weak learnable, then H ⊆ L(B,∞)
What about the other direction ?
Using von Neumanns minimax theorem, it can be shown that ifL(B,∞) separates a training set with `1 margin γ then ERMB is a γweak learner for HThis is beyond the scope of the course
Shai Shalev-Shwartz (Hebrew U) IML Lecture 4 Boosting 23 / 29
![Page 72: Introduction to Machine Learning (67577) Lecture 4shais/Lectures2014/lecture4.pdf · Introduction to Machine Learning (67577) Lecture 4 Shai Shalev-Shwartz School of CS and Engineering,](https://reader030.vdocuments.us/reader030/viewer/2022040214/5ec5eefff1e2d22df87563c3/html5/thumbnails/72.jpg)
Weak Learnability and Separability with Margin
We have essentially shown: if H is weak learnable, then H ⊆ L(B,∞)
What about the other direction ?
Using von Neumanns minimax theorem, it can be shown that ifL(B,∞) separates a training set with `1 margin γ then ERMB is a γweak learner for H
This is beyond the scope of the course
Shai Shalev-Shwartz (Hebrew U) IML Lecture 4 Boosting 23 / 29
![Page 73: Introduction to Machine Learning (67577) Lecture 4shais/Lectures2014/lecture4.pdf · Introduction to Machine Learning (67577) Lecture 4 Shai Shalev-Shwartz School of CS and Engineering,](https://reader030.vdocuments.us/reader030/viewer/2022040214/5ec5eefff1e2d22df87563c3/html5/thumbnails/73.jpg)
Weak Learnability and Separability with Margin
We have essentially shown: if H is weak learnable, then H ⊆ L(B,∞)
What about the other direction ?
Using von Neumanns minimax theorem, it can be shown that ifL(B,∞) separates a training set with `1 margin γ then ERMB is a γweak learner for HThis is beyond the scope of the course
Shai Shalev-Shwartz (Hebrew U) IML Lecture 4 Boosting 23 / 29
![Page 74: Introduction to Machine Learning (67577) Lecture 4shais/Lectures2014/lecture4.pdf · Introduction to Machine Learning (67577) Lecture 4 Shai Shalev-Shwartz School of CS and Engineering,](https://reader030.vdocuments.us/reader030/viewer/2022040214/5ec5eefff1e2d22df87563c3/html5/thumbnails/74.jpg)
Outline
1 Weak learnability
2 Boosting the confidence
3 Boosting the accuracy using AdaBoost
4 AdaBoost as a learner for Halfspaces++
5 AdaBoost and the Bias-Complexity Tradeoff
6 Weak Learnability and Separability with Margin
7 AdaBoost for Face Detection
Shai Shalev-Shwartz (Hebrew U) IML Lecture 4 Boosting 24 / 29
![Page 75: Introduction to Machine Learning (67577) Lecture 4shais/Lectures2014/lecture4.pdf · Introduction to Machine Learning (67577) Lecture 4 Shai Shalev-Shwartz School of CS and Engineering,](https://reader030.vdocuments.us/reader030/viewer/2022040214/5ec5eefff1e2d22df87563c3/html5/thumbnails/75.jpg)
Face Detection
Classify rectangles in an image as face or non-face
Shai Shalev-Shwartz (Hebrew U) IML Lecture 4 Boosting 25 / 29
![Page 76: Introduction to Machine Learning (67577) Lecture 4shais/Lectures2014/lecture4.pdf · Introduction to Machine Learning (67577) Lecture 4 Shai Shalev-Shwartz School of CS and Engineering,](https://reader030.vdocuments.us/reader030/viewer/2022040214/5ec5eefff1e2d22df87563c3/html5/thumbnails/76.jpg)
Weak Learner for Face Detection
Rules of thumb:
“eye region is often darker than the cheeks”
“bridge of the noise is brighter than the eyes”
Goal:
We want to combine few rules of thumb to obtain a face detector
“Sparsity” reflects both small estimation error but also speed !
Shai Shalev-Shwartz (Hebrew U) IML Lecture 4 Boosting 26 / 29
![Page 77: Introduction to Machine Learning (67577) Lecture 4shais/Lectures2014/lecture4.pdf · Introduction to Machine Learning (67577) Lecture 4 Shai Shalev-Shwartz School of CS and Engineering,](https://reader030.vdocuments.us/reader030/viewer/2022040214/5ec5eefff1e2d22df87563c3/html5/thumbnails/77.jpg)
Weak Learner for Face Detection
Rules of thumb:
“eye region is often darker than the cheeks”
“bridge of the noise is brighter than the eyes”
Goal:
We want to combine few rules of thumb to obtain a face detector
“Sparsity” reflects both small estimation error but also speed !
Shai Shalev-Shwartz (Hebrew U) IML Lecture 4 Boosting 26 / 29
![Page 78: Introduction to Machine Learning (67577) Lecture 4shais/Lectures2014/lecture4.pdf · Introduction to Machine Learning (67577) Lecture 4 Shai Shalev-Shwartz School of CS and Engineering,](https://reader030.vdocuments.us/reader030/viewer/2022040214/5ec5eefff1e2d22df87563c3/html5/thumbnails/78.jpg)
Weak Learner for Face Detection
Each hypothesis in the base class is of the form h(x) = f(g(x)), where fis a decision stump and g : R24,24 → R is parameterized by:
An axis-aligned rectangle R. Since each image is of size 24× 24,there are at most 244 axis-aligned rectangles.
A type, t ∈ A,B,C,D. Each type corresponds to a mask:
A B
C D
Shai Shalev-Shwartz (Hebrew U) IML Lecture 4 Boosting 27 / 29
![Page 79: Introduction to Machine Learning (67577) Lecture 4shais/Lectures2014/lecture4.pdf · Introduction to Machine Learning (67577) Lecture 4 Shai Shalev-Shwartz School of CS and Engineering,](https://reader030.vdocuments.us/reader030/viewer/2022040214/5ec5eefff1e2d22df87563c3/html5/thumbnails/79.jpg)
AdaBoost for Face Detection
The first and second features selected by AdaBoost, as implemented byViola and Jones.
Figure 5: The first and second features selected by AdaBoost. The two features are shown in the top rowand then overlayed on a typical training face in the bottom row. The first feature measures the difference inintensity between the region of the eyes and a region across the upper cheeks. The feature capitalizes on theobservation that the eye region is often darker than the cheeks. The second feature compares the intensitiesin the eye regions to the intensity across the bridge of the nose.
directly increases computation time.
4 The Attentional Cascade
This section describes an algorithm for constructing a cascade of classifiers which achieves increased detec-tion performance while radically reducing computation time. The key insight is that smaller, and thereforemore efficient, boosted classifiers can be constructed which reject many of the negative sub-windows whiledetecting almost all positive instances. Simpler classifiers are used to reject the majority of sub-windowsbefore more complex classifiers are called upon to achieve low false positive rates.Stages in the cascade are constructed by training classifiers using AdaBoost. Starting with a two-feature
strong classifier, an effective face filter can be obtained by adjusting the strong classifier threshold to min-imize false negatives. The initial AdaBoost threshold, , is designed to yield a low error rate onthe training data. A lower threshold yields higher detection rates and higher false positive rates. Based onperformance measured using a validation training set, the two-feature classifier can be adjusted to detect100% of the faces with a false positive rate of 40%. See Figure 5 for a description of the two features usedin this classifier.The detection performance of the two-feature classifier is far from acceptable as an object detection
system. Nevertheless the classifier can significantly reduce the number sub-windows that need further pro-cessing with very few operations:
1. Evaluate the rectangle features (requires between 6 and 9 array references per feature).
2. Compute the weak classifier for each feature (requires one threshold operation per feature).
11
Shai Shalev-Shwartz (Hebrew U) IML Lecture 4 Boosting 28 / 29
![Page 80: Introduction to Machine Learning (67577) Lecture 4shais/Lectures2014/lecture4.pdf · Introduction to Machine Learning (67577) Lecture 4 Shai Shalev-Shwartz School of CS and Engineering,](https://reader030.vdocuments.us/reader030/viewer/2022040214/5ec5eefff1e2d22df87563c3/html5/thumbnails/80.jpg)
Summary
Boosting the confidence using validation
Boosting the accuracy using AdaBoost
The power of composing halfspaces over simple classes
The bias-complexity tradeoff
AdaBoost works in many practical problems !
Shai Shalev-Shwartz (Hebrew U) IML Lecture 4 Boosting 29 / 29