online algorithms: perceptron and...
TRANSCRIPT
Outline
• Online Model: Mistake Bound
Simple algorithms
• CON, HAL, ELIM
• Online to PAC
• Linear Separator
Percpetron
• Realizable case
• Unrealizable case – Hinge loss
Winnow
Introduction to Machine Learning 2
Online Model
• Examples arrive sequentially
• Need to make a prediction
Afterwards observe the outcome
• No distributional assumptions
• Goal: Minimize the number of mistakes
Introduction to Machine Learning 3
Online Algorithm
Example xt
Prediction ht(xt)
Label c*(xt)
Time t:
Con Algorithm
• Realizable setting.
• Modest goal:
At most |C|-1 mistakes
• How?
• At time t
Let 𝐶𝑡 the consistent concepts on 𝑥1, 𝑥2. . 𝑥𝑡−1∀𝑐 ∈ 𝐶𝑡 ∀𝑖 ≤ 𝑡 − 1𝑐 𝑥𝑖 = 𝑐∗(𝑥𝑖)
Let ℎ𝑡 ∈ 𝐶𝑡• arbitrary
Predict ℎ𝑡 𝑥𝑡Observe c∗ 𝑥𝑡
Introduction to Machine Learning 4
CON Algorithm
• Theorem: For any concept class C, CON makes the most |𝐶| − 1 mistakes
• Proof: Initially 𝐶1 = 𝐶.
• After each mistake|𝐶𝑡| decreases by at least 1
• |𝐶𝑡| ≥ 1, since c∗ ∈ 𝐶 at any t
• Therefore number of mistakes is bound by |𝐶| − 1
Can we do better ?!Introduction to Machine Learning 5
HAL – halving algorithm
• Change selection of ht
• At time t, given 𝑥𝑡 For every 𝑐 ∈ 𝐶𝑡Compute 𝑐(𝑥𝑡)
Predict the majority• ℎ𝑡 𝑥 = 𝑀{𝑐 𝑥 : 𝑐 ∈ 𝐶𝑡}
– M is the majority
• Might be that ℎ𝑡 ∉ 𝐶𝑡
• HAL algorithm
• At time t
Let 𝐶𝑡 the consistent concepts on 𝑥1, 𝑥2. . 𝑥𝑡−1∀𝑐 ∈ 𝐶𝑡∀𝑖 ≤ 𝑡 − 1𝑐 𝑥𝑖 = 𝑐∗(𝑥𝑖)
Let ℎ𝑡 = 𝑀{𝑐 𝑥 : 𝑐 ∈ 𝐶𝑡}
Predict ℎ𝑡 𝑥𝑡Observe c∗ 𝑥𝑡
Introduction to Machine Learning 6
HAL –halving algorithm
• Theorem: For any concept class C, HAL makes the most log2 |𝐶| mistakes
• Proof: Initially 𝐶1 = 𝐶.
• After each mistake 𝐶𝑡+1 ≤1
2𝐶𝑡
majority of consistent concepts were wrong.
• Therefore number of mistakes is bound by log2 |𝐶|
Introduction to Machine Learning 7
Example: Learning OR of literals
• Inputs: (x1, … , xn)
• Literals : 𝑧𝑖 , ҧ𝑧𝑖• OR functions:
𝑧1 ∨ ҧ𝑧4 ∨ 𝑧7
• Realizable case:
c*(z) is an OR• L* are the literals
• ELIM algorithm:
Initialize:𝐿1 = {𝑧1, ҧ𝑧1, … , 𝑧𝑛, ҧ𝑧𝑛}
Time t, receive 𝑥𝑡 = (𝑥1, … , 𝑥𝑛)
Predict OR(Lt,xt)
Receive c*(xt)• If c*(xt) negative
– delete from Lt any positive literals of z.
Introduction to Machine Learning 8
What is the MAXIMUM number of mistakes?
Analysis of ELIM
• Properties:
𝐿∗ ⊂ 𝐿1𝐿𝑡 ⊆ 𝐿𝑡−1𝐿∗ ⊆ 𝐿𝑡On mistake
• 𝐿𝑡 ≤ 𝐿𝑡−1 − 1
First mistake:• 𝐿𝑡−1 − 𝐿𝑡 = 𝑛
Later mistakes• 𝐿𝑡−1 − 𝐿𝑡 ≥ 1
• No mistake when c*(xt)=1
𝐿∗ ⊆ 𝐿𝑡
• Number of mistakes
Initially |𝐿1| = 2𝑛
First mistake:• Reduce by n
Each additional mistake• Reduce by at least 1
Maximum number of mistakes n+1
Introduction to Machine Learning 9
Regret Minimization
• What about non-realizable setting?
• We can guarantee:
Sequence of length T; average loss:
𝐿𝑜𝑠𝑠 𝑜𝑛𝑙𝑖𝑛𝑒 ≤ min𝑐∈𝐶
𝐿𝑜𝑠𝑠(𝑐) + log |𝐶|
𝑇
Best expert problem.
• 𝑅𝑒𝑔𝑟𝑒𝑡 = 𝐿𝑜𝑠𝑠 𝑜𝑛𝑙𝑖𝑛𝑒 − min𝑐∈𝐶
𝐿𝑜𝑠𝑠(𝑐)
• Not in the course !
Introduction to Machine Learning 10
Mistake Bound and PAC
• Are they equivalent
Class C is PAC learnable iff
Class C has finite Mistake Bound
• NO!
Prefix on of [0,1]
PAC learnable
No finite mistake bound• Why?
• What about the other direction
Algorithm A learns C in mistake bound model
Does A learn C in PAC model?
• We will show APAC that PAC learns C
• APAC uses A
Introduction to Machine Learning 11
Conservative Online Algorithm
• A is conservative if
for every sample xt if 𝑐∗ 𝑥𝑡 = ℎ𝑡(𝑥𝑡) then• ℎ𝑡+1 = ℎ𝑡
• Conservative algorithm
Changes hypothesis only after mistakes
After mistakes always changes hypothesis.
• Goal: show how to convert any algorithm A to a conservative algorithm A’
Same mistake bound
• Basic idea:
Feed A only mistakes
Introduction to Machine Learning 12
Conservative Mistake Bound Algorithm
• Given A Define A’
Initially ℎ0 = 𝐴(∅)
At time t:• Predict ℎ𝑡 𝑥𝑡• IF 𝑐∗ 𝑥𝑡 = ℎ𝑡 𝑥𝑡 ,
– THEN ℎ𝑡+1 = ℎ𝑡– ELSE
» 𝐸𝑅 = 𝐸𝑅 ||𝑥𝑡» ℎ𝑡+1 = 𝐴 𝐸𝑅
• For any input sequence
If A’ makes M mistakes
Then A makes M mistakes on ER
• Claim: Mistake Bound of A and A’ identical
Introduction to Machine Learning 13
Building APAC
• Given algorithm A At most M mistakes A is conservative
• APAC works in stages At most M+1
• Stage i: Run algorithm A If makes error
• Start stage i+1
No error for 𝑚𝑖=1
𝜖log
1
𝛿𝑖
• Output hi
• Termination Each stage A makes an error At most M completed stages
• Performance If hi is 𝜖 − 𝑏𝑎𝑑 prob output:
1 − 𝜖 𝑚𝑖 ≤ 𝑒−𝜖𝑚𝑖 = 𝛿𝑖• Sample size
σ𝑖=1𝑀+1𝑚𝑖 ≤
𝑀
𝜖(𝑀+1
2+ log 1
𝛿)
• Confidence Set: 𝛿𝑖 =
𝛿
2𝑖
• σ𝑖 𝛿𝑖 ≤ 𝛿
Introduction to Machine Learning 14
Learning Linear Separators
• Input {0,1}d or Rd
• Linear Separator
weights w in Rd and threshold θ
hypothesis h(x)=+1 iff
<w,x> = Σ wi xi ≥θ
• Simplifying assumptions:
θ=0 (add coordinate x0 such that x0=1 always)
|| x||=1
Introduction to Machine Learning 15
Perceptron - Algorithm
• Initialize w1=(0,…,0)
• Given example xt,
predict positive iff <wt ,xt> ≥ 0
• On a Mistake t: wt+1 = wt + ct(x) xt,
Mistake on negative (i.e., c*(x)=+1): wt+1 = wt + xt.
Mistake on positive (i.e., c*(x)=-1): wt+1 = wt - xt.
Introduction to Machine Learning 16
Perceptron - motivation
• False Negative
ct(x) = +1
<wt ,xt> negative
after update
<wt+1,xt>
= <wt,xt> + <xt,xt>
= <wt,xt> + 1
• False Positive
ct(x) = -1
<wt ,xt> positive
after update
<wt+1,xt>
= <wt,xt> - <xt,xt>
= <wt,xt> - 1
Introduction to Machine Learning 17
Perceptron Example
Introduction to Machine Learning 18
+1
-1
x1- x2=0
w1 = (0,0) 1
w2 = (0,0)
Perceptron Example
Introduction to Machine Learning 19
+1
-1
1
2
x1-x2=0
w1 = (0,0)w2 = (0,0)
w3= (-1,0)
Perceptron Example
Introduction to Machine Learning 20
+1
-1
1
3
2
x1-x2=0
w1 = (0,0)w2 = (0,0)w3= (-1,0)
Perceptron Example
Introduction to Machine Learning 21
+1
-1
1
3
2
x1-x2=0
w1 = (0,0)w2 = (0,0)w3= (-1,0)
w4=(-0.2,+0.6)
Perceptron - Geometric Interpretation
Introduction to Machine Learning 22
3
w1 = (0,0)w2 = (0,0)w3= (-1,0)
w4=(-0.2,+0.6)
w3
-x3w4 = w3 –x3
Perceptron - Analysis
• target concept c*(x) uses w* and ||w*||=1
• Margin γ:
For any x is S
• Theorem: Number of mistakes ≤ 1/γ2
Introduction to Machine Learning 23
*| , |min x S
x w
x
w*
x
Perceptron - Performance
Claim 1: <wt+1 ,w*> ≥ <wt,w
*> +γ
Assume c*(x)=+1
<wt+1 , w*> =
<(wt +x) , w*> =
<wt, w*> +<x ,w*> ≥
<wt, w*> + γ
Similar for c*(x)=-1
Claim 2: ||wt+1 ||2 ≤ ||wt||2+1
Assume c*(x)=+1
||wt+1 ||2 =
||wt +x||2 =
||wt ||2 + 2<wt,x> + ||x||2 ≤
||wt ||2 +1
Since x is a mistake <wt,x> is negative.
Similar for c*(x)=-1
Introduction to Machine Learning 24
Perceptron - performance
Claim 3: <wt,w*> ≤ ||wt|| Completing the proof
• After M mistakes:
<wM+1 , w*> ≥ γ M (claim1)
||wM+1 ||2 ≤ M (claim 2)
Introduction to Machine Learning 25
t
t
ttt w
w
wwww , , *
MwwwM MM 1
*
1 ,
2
1
M
Perceptron
• Guaranteed convergence
realizable case
• Can be very slow (even for {0,1}d)
• Additive increases:
problematic with large weights
• Still, a simple benchmark
Introduction to Machine Learning 26
Perceptron – Unrealizable case
Motivation
Realizable case Unrelizable case
Introduction to Machine Learning 28
Hinge Loss
Motivation
• “Move” points to be realizable– with margin γ
• correct points– both classification and margin
– zero loss
• mistake points– even just margin
– loss is the distance
Definition
• Assume <x,w> = β
• Hinge Loss with margin γ:
• Compare to Error: c* (x)β < 0
Introduction to Machine Learning 29
})(
1,0max{*
xc
*( )c x
Perceptron - Performance
• Let TDγ=total distance
Σi max{0, γ – c*(x)βi}, where βi=<xi,w*>
• Claim 1’: <wM+1 ,w*> ≥ γM – TDγ
• Claim 2: ||wt+1 ||2 ≤ ||wt||2+1
• Bounding the mistakes:
Introduction to Machine Learning 30
TDMM
TDM
212
Winnow
Winnow –motivation
• Updates
multiplicative vs additive
• Domain
{0,1}d or [0,1]d
• we will use {0,1}d
• Weights
non-negative• monotone function
• Separation
c*(x)=+1: <w*,x> ≥ θ
c*(x)=-1: <w*,x> ≤ θ -γ
θ ≥ 1• part of the input
• Remarks:
normalizing x in L∞ to 1
Introduction to Machine Learning 32
Winnow - Algorithm
• parameter β >1
we will use β=1+γ/2
• Initialize w=(1, … , 1)
• predict h(x)=+1 iff
<w,x> ≥ θ
• For a mistake:
• False Positive (demotion)
c*(x)=-1, h(x)=+1
for every xi=1: wi=wi/β
• False Negative (promotion)
c*(x)=+1, h(x)=-1
for every xi=1: wi=βwi
Introduction to Machine Learning 33
Winnow - intuition
• Demotion step
target negative
hypothesis positive
• Before update
<w,x>=α ≥ θ
• After the update:
<w,x> = α/β < α
• Decrease in ∑ wi
at least (1- β-1)θ
• Promotion step
target positive
hypothesis negative
• Before update
<w,x>=α < θ
• After the update:
<w,x> = αβ > α
• Increase in ∑ wi
at most (β-1)θ
Introduction to Machine Learning 34
Winnow - example
• Target function:
• w*=(2,2,0,0)
• θ=2 , β=2
• What is the target function?
x1 v x2
monotone OR
• w0=(1,1,1,1)
• x1=(0,0,1,1) ct(x1)=-1
w1=(1,1, ½, ½)
• x2=(1,0,1,0) ct(x2)=+1
w2=(2,1, 1, ½)
• x3=(0,1,0,1) ct(x3)=+1
w3=(2,2, 1, 1)
Introduction to Machine Learning 35
Winnow - Theorem
• Theorem (realizable case)
Number of mistakes bounded by
• Corollary: For θ=d we have
Introduction to Machine Learning 36
d
iiw
dO
1
*
22
ln1
d
iiw
dO
1
*
2
ln
Winnow - Analysis
• Mistakesu promotion steps
v demotion steps
mistakes = u+v
• Lemma 1:
• Lemma 2: wi ≤βθ
Introduction to Machine Learning 37
ud
v 1
• Lemma 3:
after u prom.
and v demo.
exists i
• Proof of theorem
log)(
log
1
*
d
i i
i
w
vuw
due to sum equalities before, and since
weights are positive
We won’t promote unless wi by itself is not enough to pass theta
Winnow vs Perceptron
Percetron
• Additive updates– slow for large d
– slow large weights
• Non-monotone– natural
• Simple Algorithm
• Margin scale L2(w*)L2(x)
Winnow
• Multiplicative updates– handles large d nicely
– ok with large weights
• Monotone– need to make monotone
– flip non-monotone attributes
• Simple Algorithm
• Margin scale L1(w*)L∞(x)
• Additional factor log d– for θ=d
Introduction to Machine Learning 38
Summary
Linear Separators
– Today: Perceptron and Winnow
– Next week: SVM
– 2 weeks: Kernels
– later: Adaboost
Brief history:
– Perceptron
• Rosenblatt 1957
– Fell out of favor in 70s
• representation issues
– Reemerged with Neural nets
• late 80s early 90s
– Linear separators:
• Adaboost and SVM
– The immediate future: deep learning
Introduction to Machine Learning 39