online algorithms: perceptron and...

Outline

• Online Model: Mistake Bound

Simple algorithms

• CON, HAL, ELIM

• Online to PAC

• Linear Separator

Percpetron

• Realizable case

• Unrealizable case – Hinge loss

Winnow

Introduction to Machine Learning 2

Online Model

• Examples arrive sequentially

• Need to make a prediction

Afterwards observe the outcome

• No distributional assumptions

• Goal: Minimize the number of mistakes


Online Algorithm

Example xt

Prediction ht(xt)

Label c*(xt)

Time t:

Con Algorithm

• Realizable setting.

• Modest goal:

At most |C|-1 mistakes

• How?

• At time t

Let 𝐶𝑡 the consistent concepts on 𝑥1, 𝑥2. . 𝑥𝑡−1∀𝑐 ∈ 𝐶𝑡 ∀𝑖 ≤ 𝑡 − 1𝑐 𝑥𝑖 = 𝑐∗(𝑥𝑖)

Let ℎ𝑡 ∈ 𝐶𝑡• arbitrary

Predict ℎ𝑡 𝑥𝑡Observe c∗ 𝑥𝑡


CON Algorithm

• Theorem: For any concept class C, CON makes the most |𝐶| − 1 mistakes

• Proof: Initially 𝐶1 = 𝐶.

• After each mistake|𝐶𝑡| decreases by at least 1

• |𝐶𝑡| ≥ 1, since c∗ ∈ 𝐶 at any t

• Therefore number of mistakes is bound by |𝐶| − 1

Can we do better ?!Introduction to Machine Learning 5

HAL – halving algorithm

• Change selection of ht

• At time t, given 𝑥𝑡 For every 𝑐 ∈ 𝐶𝑡Compute 𝑐(𝑥𝑡)

Predict the majority• ℎ𝑡 𝑥 = 𝑀{𝑐 𝑥 : 𝑐 ∈ 𝐶𝑡}

– M is the majority

• Might be that ℎ𝑡 ∉ 𝐶𝑡

• HAL algorithm

• At time t

Let 𝐶𝑡 the consistent concepts on 𝑥1, 𝑥2. . 𝑥𝑡−1∀𝑐 ∈ 𝐶𝑡∀𝑖 ≤ 𝑡 − 1𝑐 𝑥𝑖 = 𝑐∗(𝑥𝑖)

Let ℎ𝑡 = 𝑀{𝑐 𝑥 : 𝑐 ∈ 𝐶𝑡}

Predict ℎ𝑡 𝑥𝑡Observe c∗ 𝑥𝑡


HAL –halving algorithm

• Theorem: For any concept class C, HAL makes the most log2 |𝐶| mistakes

• Proof: Initially 𝐶1 = 𝐶.

• After each mistake 𝐶𝑡+1 ≤1

2𝐶𝑡

majority of consistent concepts were wrong.

• Therefore number of mistakes is bound by log2 |𝐶|


Example: Learning OR of literals

• Inputs: (x1, … , xn)

• Literals : 𝑧𝑖 , ҧ𝑧𝑖• OR functions:

𝑧1 ∨ ҧ𝑧4 ∨ 𝑧7

• Realizable case:

c*(z) is an OR• L* are the literals

• ELIM algorithm:

Initialize:𝐿1 = {𝑧1, ҧ𝑧1, … , 𝑧𝑛, ҧ𝑧𝑛}

Time t, receive 𝑥𝑡 = (𝑥1, … , 𝑥𝑛)

Predict OR(Lt,xt)

Receive c*(xt)• If c*(xt) negative

– delete from Lt any positive literals of z.


What is the MAXIMUM number of mistakes?

Analysis of ELIM

• Properties:

𝐿∗ ⊂ 𝐿1𝐿𝑡 ⊆ 𝐿𝑡−1𝐿∗ ⊆ 𝐿𝑡On mistake

• 𝐿𝑡 ≤ 𝐿𝑡−1 − 1

First mistake:• 𝐿𝑡−1 − 𝐿𝑡 = 𝑛

Later mistakes• 𝐿𝑡−1 − 𝐿𝑡 ≥ 1

• No mistake when c*(xt)=1

𝐿∗ ⊆ 𝐿𝑡

• Number of mistakes

Initially |𝐿1| = 2𝑛

First mistake:• Reduce by n

Each additional mistake• Reduce by at least 1

Maximum number of mistakes n+1


Regret Minimization

• What about non-realizable setting?

• We can guarantee:

Sequence of length T; average loss:

𝐿𝑜𝑠𝑠 𝑜𝑛𝑙𝑖𝑛𝑒 ≤ min𝑐∈𝐶

𝐿𝑜𝑠𝑠(𝑐) + log |𝐶|

𝑇

Best expert problem.

• 𝑅𝑒𝑔𝑟𝑒𝑡 = 𝐿𝑜𝑠𝑠 𝑜𝑛𝑙𝑖𝑛𝑒 − min𝑐∈𝐶

𝐿𝑜𝑠𝑠(𝑐)

• Not in the course !


Mistake Bound and PAC

• Are they equivalent

Class C is PAC learnable iff

Class C has finite Mistake Bound

• NO!

Prefix on of [0,1]

PAC learnable

No finite mistake bound• Why?

• What about the other direction

Algorithm A learns C in mistake bound model

Does A learn C in PAC model?

• We will show APAC that PAC learns C

• APAC uses A


Conservative Online Algorithm

• A is conservative if

for every sample xt if 𝑐∗ 𝑥𝑡 = ℎ𝑡(𝑥𝑡) then• ℎ𝑡+1 = ℎ𝑡

• Conservative algorithm

Changes hypothesis only after mistakes

After mistakes always changes hypothesis.

• Goal: show how to convert any algorithm A to a conservative algorithm A’

Same mistake bound

• Basic idea:

Feed A only mistakes


Conservative Mistake Bound Algorithm

• Given A Define A’

Initially ℎ0 = 𝐴(∅)

At time t:• Predict ℎ𝑡 𝑥𝑡• IF 𝑐∗ 𝑥𝑡 = ℎ𝑡 𝑥𝑡 ,

– THEN ℎ𝑡+1 = ℎ𝑡– ELSE

» 𝐸𝑅 = 𝐸𝑅 ||𝑥𝑡» ℎ𝑡+1 = 𝐴 𝐸𝑅

• For any input sequence

If A’ makes M mistakes

Then A makes M mistakes on ER

• Claim: Mistake Bound of A and A’ identical


Building APAC

• Given algorithm A At most M mistakes A is conservative

• APAC works in stages At most M+1

• Stage i: Run algorithm A If makes error

• Start stage i+1

No error for 𝑚𝑖=1

𝜖log

1

𝛿𝑖

• Output hi

• Termination Each stage A makes an error At most M completed stages

• Performance If hi is 𝜖 − 𝑏𝑎𝑑 prob output:

1 − 𝜖 𝑚𝑖 ≤ 𝑒−𝜖𝑚𝑖 = 𝛿𝑖• Sample size

σ𝑖=1𝑀+1𝑚𝑖 ≤

𝑀

𝜖(𝑀+1

2+ log 1

𝛿)

• Confidence Set: 𝛿𝑖 =

𝛿

2𝑖

• σ𝑖 𝛿𝑖 ≤ 𝛿


Learning Linear Separators

• Input {0,1}d or Rd

• Linear Separator

weights w in Rd and threshold θ

hypothesis h(x)=+1 iff

<w,x> = Σ wi xi ≥θ

• Simplifying assumptions:

θ=0 (add coordinate x0 such that x0=1 always)

|| x||=1


Perceptron - Algorithm

• Initialize w1=(0,…,0)

• Given example xt,

predict positive iff <wt ,xt> ≥ 0

• On a Mistake t: wt+1 = wt + ct(x) xt,

Mistake on negative (i.e., c*(x)=+1): wt+1 = wt + xt.

Mistake on positive (i.e., c*(x)=-1): wt+1 = wt - xt.


Perceptron - motivation

• False Negative

ct(x) = +1

<wt ,xt> negative

after update

<wt+1,xt>

= <wt,xt> + <xt,xt>

= <wt,xt> + 1

• False Positive

ct(x) = -1

<wt ,xt> positive

after update

<wt+1,xt>

= <wt,xt> - <xt,xt>

= <wt,xt> - 1


Perceptron Example


+1

-1

x1- x2=0

w1 = (0,0) 1

w2 = (0,0)

Perceptron Example


+1

-1

1

2

x1-x2=0

w1 = (0,0)w2 = (0,0)

w3= (-1,0)

Perceptron Example


+1

-1

1

3

2

x1-x2=0

w1 = (0,0)w2 = (0,0)w3= (-1,0)

Perceptron Example


+1

-1

1

3

2

x1-x2=0

w1 = (0,0)w2 = (0,0)w3= (-1,0)

w4=(-0.2,+0.6)

Perceptron - Geometric Interpretation


3

w1 = (0,0)w2 = (0,0)w3= (-1,0)

w4=(-0.2,+0.6)

w3

-x3w4 = w3 –x3

Perceptron - Analysis

• target concept c*(x) uses w* and ||w*||=1

• Margin γ:

For any x is S

• Theorem: Number of mistakes ≤ 1/γ2


*| , |min x S

x w

x

w*

x

Perceptron - Performance

Claim 1: <wt+1 ,w*> ≥ <wt,w

*> +γ

Assume c*(x)=+1

<wt+1 , w*> =

<(wt +x) , w*> =

<wt, w*> +<x ,w*> ≥

<wt, w*> + γ

Similar for c*(x)=-1

Claim 2: ||wt+1 ||2 ≤ ||wt||2+1

Assume c*(x)=+1

||wt+1 ||2 =

||wt +x||2 =

||wt ||2 + 2<wt,x> + ||x||2 ≤

||wt ||2 +1

Since x is a mistake <wt,x> is negative.

Similar for c*(x)=-1


Perceptron - performance

Claim 3: <wt,w*> ≤ ||wt|| Completing the proof

• After M mistakes:

<wM+1 , w*> ≥ γ M (claim1)

||wM+1 ||2 ≤ M (claim 2)


t

t

ttt w

w

wwww , , *

MwwwM MM 1

*

1 ,

2

1

M

Perceptron

• Guaranteed convergence

realizable case

• Can be very slow (even for {0,1}d)

• Additive increases:

problematic with large weights

• Still, a simple benchmark


Perceptron – Unrealizable case

Motivation

Realizable case Unrelizable case


Hinge Loss

Motivation

• “Move” points to be realizable– with margin γ

• correct points– both classification and margin

– zero loss

• mistake points– even just margin

– loss is the distance

Definition

• Assume <x,w> = β

• Hinge Loss with margin γ:

• Compare to Error: c* (x)β < 0


})(

1,0max{*

xc

*( )c x

Perceptron - Performance

• Let TDγ=total distance

Σi max{0, γ – c*(x)βi}, where βi=<xi,w*>

• Claim 1’: <wM+1 ,w*> ≥ γM – TDγ

• Claim 2: ||wt+1 ||2 ≤ ||wt||2+1

• Bounding the mistakes:


TDMM

TDM

212

Winnow

Winnow –motivation

• Updates

multiplicative vs additive

• Domain

{0,1}d or [0,1]d

• we will use {0,1}d

• Weights

non-negative• monotone function

• Separation

c*(x)=+1: <w*,x> ≥ θ

c*(x)=-1: <w*,x> ≤ θ -γ

θ ≥ 1• part of the input

• Remarks:

normalizing x in L∞ to 1


Winnow - Algorithm

• parameter β >1

we will use β=1+γ/2

• Initialize w=(1, … , 1)

• predict h(x)=+1 iff

<w,x> ≥ θ

• For a mistake:

• False Positive (demotion)

c*(x)=-1, h(x)=+1

for every xi=1: wi=wi/β

• False Negative (promotion)

c*(x)=+1, h(x)=-1

for every xi=1: wi=βwi


Winnow - intuition

• Demotion step

target negative

hypothesis positive

• Before update

<w,x>=α ≥ θ

• After the update:

<w,x> = α/β < α

• Decrease in ∑ wi

at least (1- β-1)θ

• Promotion step

target positive

hypothesis negative

• Before update

<w,x>=α < θ

• After the update:

<w,x> = αβ > α

• Increase in ∑ wi

at most (β-1)θ


Winnow - example

• Target function:

• w*=(2,2,0,0)

• θ=2 , β=2

• What is the target function?

x1 v x2

monotone OR

• w0=(1,1,1,1)

• x1=(0,0,1,1) ct(x1)=-1

w1=(1,1, ½, ½)

• x2=(1,0,1,0) ct(x2)=+1

w2=(2,1, 1, ½)

• x3=(0,1,0,1) ct(x3)=+1

w3=(2,2, 1, 1)


Winnow - Theorem

• Theorem (realizable case)

Number of mistakes bounded by

• Corollary: For θ=d we have


d

iiw

dO

1

*

22

ln1

d

iiw

dO

1

*

2

ln

Winnow - Analysis

• Mistakesu promotion steps

v demotion steps

mistakes = u+v

• Lemma 1:

• Lemma 2: wi ≤βθ


ud

v 1

• Lemma 3:

after u prom.

and v demo.

exists i

• Proof of theorem

log)(

log

1

*

d

i i

i

w

vuw

due to sum equalities before, and since

weights are positive

We won’t promote unless wi by itself is not enough to pass theta

Winnow vs Perceptron

Percetron

• Additive updates– slow for large d

– slow large weights

• Non-monotone– natural

• Simple Algorithm

• Margin scale L2(w*)L2(x)

Winnow

• Multiplicative updates– handles large d nicely

– ok with large weights

• Monotone– need to make monotone

– flip non-monotone attributes

• Simple Algorithm

• Margin scale L1(w*)L∞(x)

• Additional factor log d– for θ=d


Summary

Linear Separators

– Today: Perceptron and Winnow

– Next week: SVM

– 2 weeks: Kernels

– later: Adaboost

Brief history:

– Perceptron

• Rosenblatt 1957

– Fell out of favor in 70s

• representation issues

– Reemerged with Neural nets

• late 80s early 90s

– Linear separators:

• Adaboost and SVM

– The immediate future: deep learning


online algorithms: perceptron and...

Documents