online learning -...

54
Online Learning Jordan Boyd-Graber University of Colorado Boulder LECTURE 21 Slides adapted from Mohri Jordan Boyd-Graber | Boulder Online Learning | 1 of 31

Upload: others

Post on 01-Sep-2019

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Online Learning - legacydirs.umiacs.umd.edulegacydirs.umiacs.umd.edu/~jbg/teaching/CSCI_5622/21a.pdfMotivation PAC learning: distribution xed over time (training and test), IID assumption

Online Learning

Jordan Boyd-GraberUniversity of Colorado BoulderLECTURE 21

Slides adapted from Mohri

Jordan Boyd-Graber | Boulder Online Learning | 1 of 31

Page 2: Online Learning - legacydirs.umiacs.umd.edulegacydirs.umiacs.umd.edu/~jbg/teaching/CSCI_5622/21a.pdfMotivation PAC learning: distribution xed over time (training and test), IID assumption

Motivation

• PAC learning: distribution fixed over time (training and test), IIDassumption.

• On-line learning:◦ no distributional assumption.◦ worst-case analysis (adversarial).◦ mixed training and test.◦ Performance measure: mistake model, regret.

Jordan Boyd-Graber | Boulder Online Learning | 2 of 31

Page 3: Online Learning - legacydirs.umiacs.umd.edulegacydirs.umiacs.umd.edu/~jbg/teaching/CSCI_5622/21a.pdfMotivation PAC learning: distribution xed over time (training and test), IID assumption

General Online Setting

• For t = 1 to T :◦ Get instance xt ∈ X◦ Predict yt ∈ Y◦ Get true label yt ∈ Y◦ Incur loss L(yt , yt)

• Classification: Y = {0, 1}, L(y , y ′) = |y ′ − y |• Regression: Y ⊂ R, L(y , y ′) = (y ′ − y)2

• Objective: Minimize total loss∑

t L(yt , yt)

Jordan Boyd-Graber | Boulder Online Learning | 3 of 31

Page 4: Online Learning - legacydirs.umiacs.umd.edulegacydirs.umiacs.umd.edu/~jbg/teaching/CSCI_5622/21a.pdfMotivation PAC learning: distribution xed over time (training and test), IID assumption

General Online Setting

• For t = 1 to T :◦ Get instance xt ∈ X◦ Predict yt ∈ Y◦ Get true label yt ∈ Y◦ Incur loss L(yt , yt)

• Classification: Y = {0, 1}, L(y , y ′) = |y ′ − y |• Regression: Y ⊂ R, L(y , y ′) = (y ′ − y)2

• Objective: Minimize total loss∑

t L(yt , yt)

Jordan Boyd-Graber | Boulder Online Learning | 3 of 31

Page 5: Online Learning - legacydirs.umiacs.umd.edulegacydirs.umiacs.umd.edu/~jbg/teaching/CSCI_5622/21a.pdfMotivation PAC learning: distribution xed over time (training and test), IID assumption

Experts

Plan

Experts

Perceptron Algorithm

Online Perceptron for Structure Learning

Jordan Boyd-Graber | Boulder Online Learning | 4 of 31

Page 6: Online Learning - legacydirs.umiacs.umd.edulegacydirs.umiacs.umd.edu/~jbg/teaching/CSCI_5622/21a.pdfMotivation PAC learning: distribution xed over time (training and test), IID assumption

Experts

Prediction with Expert Advice

• For t = 1 to T :◦ Get instance xt ∈ X and advice at , i ∈ Y , i ∈ [1,N]◦ Predict yt ∈ Y◦ Get true label yt ∈ Y◦ Incur loss L(yt , yt)

• Objective: Minimize regret, i.e., difference of total loss vs. bestexpert

Regret(T ) =∑t

L(yt , yt)−mini

∑t

L(at,i , yt) (1)

Jordan Boyd-Graber | Boulder Online Learning | 5 of 31

Page 7: Online Learning - legacydirs.umiacs.umd.edulegacydirs.umiacs.umd.edu/~jbg/teaching/CSCI_5622/21a.pdfMotivation PAC learning: distribution xed over time (training and test), IID assumption

Experts

Prediction with Expert Advice

• For t = 1 to T :◦ Get instance xt ∈ X and advice at , i ∈ Y , i ∈ [1,N]◦ Predict yt ∈ Y◦ Get true label yt ∈ Y◦ Incur loss L(yt , yt)

• Objective: Minimize regret, i.e., difference of total loss vs. bestexpert

Regret(T ) =∑t

L(yt , yt)−mini

∑t

L(at,i , yt) (1)

Jordan Boyd-Graber | Boulder Online Learning | 5 of 31

Page 8: Online Learning - legacydirs.umiacs.umd.edulegacydirs.umiacs.umd.edu/~jbg/teaching/CSCI_5622/21a.pdfMotivation PAC learning: distribution xed over time (training and test), IID assumption

Experts

Mistake Bound Model

• Define the maximum number of mistakes a learning algorithm Lmakes to learn a concept c over any set of examples (until it’sperfect).

ML(c) = maxx1,...,xT

|mistakes(L, c)| (2)

• For any concept class C , this is the max over concepts c.

ML(C ) = maxc∈C

ML(c) (3)

• In the expert advice case, assumes some expert matches theconcept (realizable)

Jordan Boyd-Graber | Boulder Online Learning | 6 of 31

Page 9: Online Learning - legacydirs.umiacs.umd.edulegacydirs.umiacs.umd.edu/~jbg/teaching/CSCI_5622/21a.pdfMotivation PAC learning: distribution xed over time (training and test), IID assumption

Experts

Mistake Bound Model

• Define the maximum number of mistakes a learning algorithm Lmakes to learn a concept c over any set of examples (until it’sperfect).

ML(c) = maxx1,...,xT

|mistakes(L, c)| (2)

• For any concept class C , this is the max over concepts c.

ML(C ) = maxc∈C

ML(c) (3)

• In the expert advice case, assumes some expert matches theconcept (realizable)

Jordan Boyd-Graber | Boulder Online Learning | 6 of 31

Page 10: Online Learning - legacydirs.umiacs.umd.edulegacydirs.umiacs.umd.edu/~jbg/teaching/CSCI_5622/21a.pdfMotivation PAC learning: distribution xed over time (training and test), IID assumption

Experts

Halving Algorithm

H1 ← H;for t ← 1 . . .T do

Receive xt ;yt ← Majority(Ht , ~at , xt);Receive yt ;if yt 6= yt then

Ht+1 ← {a ∈ Ht : a(xt) = yt};return HT+1

Algorithm 1: The Halving Algorithm (Mitchell, 1997)

Jordan Boyd-Graber | Boulder Online Learning | 7 of 31

Page 11: Online Learning - legacydirs.umiacs.umd.edulegacydirs.umiacs.umd.edu/~jbg/teaching/CSCI_5622/21a.pdfMotivation PAC learning: distribution xed over time (training and test), IID assumption

Experts

Halving Algorithm Bound (Littlestone, 1998)

• For a finite hypothesis set

MHalving(H) ≤ lg |H| (4)

• After each mistake, the hypothesis set is reduced by at least by half

• Consider the optimal mistake bound opt(H). Then

VC(H) ≤ opt(H) ≤ MHalving(H) ≤ lg |H| (5)

• For a fully shattered set, form a binary tree of mistakes with heightVC(H)

• What about non-realizable case?

Jordan Boyd-Graber | Boulder Online Learning | 8 of 31

Page 12: Online Learning - legacydirs.umiacs.umd.edulegacydirs.umiacs.umd.edu/~jbg/teaching/CSCI_5622/21a.pdfMotivation PAC learning: distribution xed over time (training and test), IID assumption

Experts

Halving Algorithm Bound (Littlestone, 1998)

• For a finite hypothesis set

MHalving(H) ≤ lg |H| (4)

• After each mistake, the hypothesis set is reduced by at least by half

• Consider the optimal mistake bound opt(H). Then

VC(H) ≤ opt(H) ≤ MHalving(H) ≤ lg |H| (5)

• For a fully shattered set, form a binary tree of mistakes with heightVC(H)

• What about non-realizable case?

Jordan Boyd-Graber | Boulder Online Learning | 8 of 31

Page 13: Online Learning - legacydirs.umiacs.umd.edulegacydirs.umiacs.umd.edu/~jbg/teaching/CSCI_5622/21a.pdfMotivation PAC learning: distribution xed over time (training and test), IID assumption

Experts

Halving Algorithm Bound (Littlestone, 1998)

• For a finite hypothesis set

MHalving(H) ≤ lg |H| (4)

• After each mistake, the hypothesis set is reduced by at least by half

• Consider the optimal mistake bound opt(H). Then

VC(H) ≤ opt(H) ≤ MHalving(H) ≤ lg |H| (5)

• For a fully shattered set, form a binary tree of mistakes with heightVC(H)

• What about non-realizable case?

Jordan Boyd-Graber | Boulder Online Learning | 8 of 31

Page 14: Online Learning - legacydirs.umiacs.umd.edulegacydirs.umiacs.umd.edu/~jbg/teaching/CSCI_5622/21a.pdfMotivation PAC learning: distribution xed over time (training and test), IID assumption

Experts

Weighted Majority (Littlestone and Warmuth, 1998)

for t ← 1 . . .N dow1,i ← 1;

for t ← 1 . . .T doReceive xt ;

yt ← 1[∑

yt,i=1 wt ≥∑

yt,i=0 wt

];

Receive yt ;if yt 6= yt then

for t ← 1 . . .N doif yt 6= yt then

wt+1,i ← βwt,i ;

elsewt+1,i ← wt,i

return wT+1

• Weights for every expert

• Classifications in favor ofside with higher totalweight (y ∈ {0, 1})

• Experts that are wrongget their weightsdecreased (β ∈ [0, 1])

• If you’re right, you stayunchanged

Jordan Boyd-Graber | Boulder Online Learning | 9 of 31

Page 15: Online Learning - legacydirs.umiacs.umd.edulegacydirs.umiacs.umd.edu/~jbg/teaching/CSCI_5622/21a.pdfMotivation PAC learning: distribution xed over time (training and test), IID assumption

Experts

Weighted Majority (Littlestone and Warmuth, 1998)

for t ← 1 . . .N dow1,i ← 1;

for t ← 1 . . .T doReceive xt ;

yt ← 1[∑

yt,i=1 wt ≥∑

yt,i=0 wt

];

Receive yt ;if yt 6= yt then

for t ← 1 . . .N doif yt 6= yt then

wt+1,i ← βwt,i ;

elsewt+1,i ← wt,i

return wT+1

• Weights for every expert

• Classifications in favor ofside with higher totalweight (y ∈ {0, 1})

• Experts that are wrongget their weightsdecreased (β ∈ [0, 1])

• If you’re right, you stayunchanged

Jordan Boyd-Graber | Boulder Online Learning | 9 of 31

Page 16: Online Learning - legacydirs.umiacs.umd.edulegacydirs.umiacs.umd.edu/~jbg/teaching/CSCI_5622/21a.pdfMotivation PAC learning: distribution xed over time (training and test), IID assumption

Experts

Weighted Majority (Littlestone and Warmuth, 1998)

for t ← 1 . . .N dow1,i ← 1;

for t ← 1 . . .T doReceive xt ;

yt ← 1[∑

yt,i=1 wt ≥∑

yt,i=0 wt

];

Receive yt ;if yt 6= yt then

for t ← 1 . . .N doif yt 6= yt then

wt+1,i ← βwt,i ;

elsewt+1,i ← wt,i

return wT+1

• Weights for every expert

• Classifications in favor ofside with higher totalweight (y ∈ {0, 1})

• Experts that are wrongget their weightsdecreased (β ∈ [0, 1])

• If you’re right, you stayunchanged

Jordan Boyd-Graber | Boulder Online Learning | 9 of 31

Page 17: Online Learning - legacydirs.umiacs.umd.edulegacydirs.umiacs.umd.edu/~jbg/teaching/CSCI_5622/21a.pdfMotivation PAC learning: distribution xed over time (training and test), IID assumption

Experts

Weighted Majority (Littlestone and Warmuth, 1998)

for t ← 1 . . .N dow1,i ← 1;

for t ← 1 . . .T doReceive xt ;

yt ← 1[∑

yt,i=1 wt ≥∑

yt,i=0 wt

];

Receive yt ;if yt 6= yt then

for t ← 1 . . .N doif yt 6= yt then

wt+1,i ← βwt,i ;

elsewt+1,i ← wt,i

return wT+1

• Weights for every expert

• Classifications in favor ofside with higher totalweight (y ∈ {0, 1})

• Experts that are wrongget their weightsdecreased (β ∈ [0, 1])

• If you’re right, you stayunchanged

Jordan Boyd-Graber | Boulder Online Learning | 9 of 31

Page 18: Online Learning - legacydirs.umiacs.umd.edulegacydirs.umiacs.umd.edu/~jbg/teaching/CSCI_5622/21a.pdfMotivation PAC learning: distribution xed over time (training and test), IID assumption

Experts

Weighted Majority

• Let mt be the number of mistakes made by WM until time t

• Let m∗t be the best expert’s mistakes until time t

mt ≤logN + m∗t log 1

β

log 21+β

(6)

• Thus, mistake bound is O(logN) plus the best expert

• Halving algorithm β = 0

Jordan Boyd-Graber | Boulder Online Learning | 10 of 31

Page 19: Online Learning - legacydirs.umiacs.umd.edulegacydirs.umiacs.umd.edu/~jbg/teaching/CSCI_5622/21a.pdfMotivation PAC learning: distribution xed over time (training and test), IID assumption

Experts

Proof: Potential Function

• Potential function is the sum of all weights

Φt ≡∑i

wt,i (7)

• We’ll create sandwich of upper and lower bounds

• For any expert i , we have lower bound

Φt ≥ wt,i = βmt ,i (8)

Jordan Boyd-Graber | Boulder Online Learning | 11 of 31

Page 20: Online Learning - legacydirs.umiacs.umd.edulegacydirs.umiacs.umd.edu/~jbg/teaching/CSCI_5622/21a.pdfMotivation PAC learning: distribution xed over time (training and test), IID assumption

Experts

Proof: Potential Function

• Potential function is the sum of all weights

Φt ≡∑i

wt,i (7)

• We’ll create sandwich of upper and lower bounds

• For any expert i , we have lower bound

Φt ≥ wt,i = βmt ,i (8)

Jordan Boyd-Graber | Boulder Online Learning | 11 of 31

Page 21: Online Learning - legacydirs.umiacs.umd.edulegacydirs.umiacs.umd.edu/~jbg/teaching/CSCI_5622/21a.pdfMotivation PAC learning: distribution xed over time (training and test), IID assumption

Experts

Proof: Potential Function

• Potential function is the sum of all weights

Φt ≡∑i

wt,i (7)

• We’ll create sandwich of upper and lower bounds

• For any expert i , we have lower bound

Φt ≥ wt,i = βmt ,i (8)

Weights are nonnegative, so∑

i wt,i ≥ wt,i

Jordan Boyd-Graber | Boulder Online Learning | 11 of 31

Page 22: Online Learning - legacydirs.umiacs.umd.edulegacydirs.umiacs.umd.edu/~jbg/teaching/CSCI_5622/21a.pdfMotivation PAC learning: distribution xed over time (training and test), IID assumption

Experts

Proof: Potential Function

• Potential function is the sum of all weights

Φt ≡∑i

wt,i (7)

• We’ll create sandwich of upper and lower bounds

• For any expert i , we have lower bound

Φt ≥ wt,i = βmt ,i (8)

Each error multiplicatively reduces weight by β

Jordan Boyd-Graber | Boulder Online Learning | 11 of 31

Page 23: Online Learning - legacydirs.umiacs.umd.edulegacydirs.umiacs.umd.edu/~jbg/teaching/CSCI_5622/21a.pdfMotivation PAC learning: distribution xed over time (training and test), IID assumption

Experts

Proof: Potential Function (Upper Bound)

• If an algorithm makes an error at round t

Φt+1 ≤Φt

2+βΦt

2(9)

Jordan Boyd-Graber | Boulder Online Learning | 12 of 31

Page 24: Online Learning - legacydirs.umiacs.umd.edulegacydirs.umiacs.umd.edu/~jbg/teaching/CSCI_5622/21a.pdfMotivation PAC learning: distribution xed over time (training and test), IID assumption

Experts

Proof: Potential Function (Upper Bound)

• If an algorithm makes an error at round t

Φt+1 ≤Φt

2+βΦt

2(9)

Half (at most) of the experts by weight were right

Jordan Boyd-Graber | Boulder Online Learning | 12 of 31

Page 25: Online Learning - legacydirs.umiacs.umd.edulegacydirs.umiacs.umd.edu/~jbg/teaching/CSCI_5622/21a.pdfMotivation PAC learning: distribution xed over time (training and test), IID assumption

Experts

Proof: Potential Function (Upper Bound)

• If an algorithm makes an error at round t

Φt+1 ≤Φt

2+βΦt

2(9)

Half (at least) of the experts by weight were wrong

Jordan Boyd-Graber | Boulder Online Learning | 12 of 31

Page 26: Online Learning - legacydirs.umiacs.umd.edulegacydirs.umiacs.umd.edu/~jbg/teaching/CSCI_5622/21a.pdfMotivation PAC learning: distribution xed over time (training and test), IID assumption

Experts

Proof: Potential Function (Upper Bound)

• If an algorithm makes an error at round t

Φt+1 ≤Φt

2+βΦt

2=

[1 + β

2

]Φt (9)

Jordan Boyd-Graber | Boulder Online Learning | 12 of 31

Page 27: Online Learning - legacydirs.umiacs.umd.edulegacydirs.umiacs.umd.edu/~jbg/teaching/CSCI_5622/21a.pdfMotivation PAC learning: distribution xed over time (training and test), IID assumption

Experts

Proof: Potential Function (Upper Bound)

• If an algorithm makes an error at round t

Φt+1 ≤Φt

2+βΦt

2=

[1 + β

2

]Φt (9)

• Initially potential function sums all weights, which start at 1

Φ1 = N (10)

Jordan Boyd-Graber | Boulder Online Learning | 12 of 31

Page 28: Online Learning - legacydirs.umiacs.umd.edulegacydirs.umiacs.umd.edu/~jbg/teaching/CSCI_5622/21a.pdfMotivation PAC learning: distribution xed over time (training and test), IID assumption

Experts

Proof: Potential Function (Upper Bound)

• If an algorithm makes an error at round t

Φt+1 ≤Φt

2+βΦt

2=

[1 + β

2

]Φt (9)

• Initially potential function sums all weights, which start at 1

Φ1 = N (10)

• After mT mistakes after T rounds

ΦT ≤[

1 + β

2

]mT

N (11)

Jordan Boyd-Graber | Boulder Online Learning | 12 of 31

Page 29: Online Learning - legacydirs.umiacs.umd.edulegacydirs.umiacs.umd.edu/~jbg/teaching/CSCI_5622/21a.pdfMotivation PAC learning: distribution xed over time (training and test), IID assumption

Experts

Weighted Majority Proof

• Put the two inequalities together, using the best expert

βm∗T ≤ ΦT ≤

[1 + β

2

]mT

N (12)

• Take the log of both sides

m∗T log β ≤ logN + mT log

[1 + β

2

](13)

• Solve for mT

mT ≤logN + m∗T log 1

β

log[

21+β

] (14)

Jordan Boyd-Graber | Boulder Online Learning | 13 of 31

Page 30: Online Learning - legacydirs.umiacs.umd.edulegacydirs.umiacs.umd.edu/~jbg/teaching/CSCI_5622/21a.pdfMotivation PAC learning: distribution xed over time (training and test), IID assumption

Experts

Weighted Majority Proof

• Put the two inequalities together, using the best expert

βm∗T ≤ ΦT ≤

[1 + β

2

]mT

N (12)

• Take the log of both sides

m∗T log β ≤ logN + mT log

[1 + β

2

](13)

• Solve for mT

mT ≤logN + m∗T log 1

β

log[

21+β

] (14)

Jordan Boyd-Graber | Boulder Online Learning | 13 of 31

Page 31: Online Learning - legacydirs.umiacs.umd.edulegacydirs.umiacs.umd.edu/~jbg/teaching/CSCI_5622/21a.pdfMotivation PAC learning: distribution xed over time (training and test), IID assumption

Experts

Weighted Majority Proof

• Put the two inequalities together, using the best expert

βm∗T ≤ ΦT ≤

[1 + β

2

]mT

N (12)

• Take the log of both sides

m∗T log β ≤ logN + mT log

[1 + β

2

](13)

• Solve for mT

mT ≤logN + m∗T log 1

β

log[

21+β

] (14)

Jordan Boyd-Graber | Boulder Online Learning | 13 of 31

Page 32: Online Learning - legacydirs.umiacs.umd.edulegacydirs.umiacs.umd.edu/~jbg/teaching/CSCI_5622/21a.pdfMotivation PAC learning: distribution xed over time (training and test), IID assumption

Experts

Weighted Majority Recap

• Simple algorithm

• No harsh assumptions (non-realizable)

• Depends on best learner

• Downside: Takes a long time to do well in worst case (but okay inpractice)

• Solution: Randomization

Jordan Boyd-Graber | Boulder Online Learning | 14 of 31

Page 33: Online Learning - legacydirs.umiacs.umd.edulegacydirs.umiacs.umd.edu/~jbg/teaching/CSCI_5622/21a.pdfMotivation PAC learning: distribution xed over time (training and test), IID assumption

Perceptron Algorithm

Plan

Experts

Perceptron Algorithm

Online Perceptron for Structure Learning

Jordan Boyd-Graber | Boulder Online Learning | 15 of 31

Page 34: Online Learning - legacydirs.umiacs.umd.edulegacydirs.umiacs.umd.edu/~jbg/teaching/CSCI_5622/21a.pdfMotivation PAC learning: distribution xed over time (training and test), IID assumption

Perceptron Algorithm

Perceptron Algorithm

• Online algorithm for classification

• Very similar to logistic regression (but 0/1 loss)

• But what can we prove?

Jordan Boyd-Graber | Boulder Online Learning | 16 of 31

Page 35: Online Learning - legacydirs.umiacs.umd.edulegacydirs.umiacs.umd.edu/~jbg/teaching/CSCI_5622/21a.pdfMotivation PAC learning: distribution xed over time (training and test), IID assumption

Perceptron Algorithm

Perceptron Algorithm

~w1 ← ~0;for t ← 1 . . .T do

Receive xt ;yt ← sgn(~wt · ~xt);Receive yt ;if yt 6= yt then

~wt+1 ← ~wt + yt~xt ;else

~wt+1 ← wt ;return wT+1

Algorithm 2: Perceptron Algorithm (Rosenblatt, 1958)

Jordan Boyd-Graber | Boulder Online Learning | 17 of 31

Page 36: Online Learning - legacydirs.umiacs.umd.edulegacydirs.umiacs.umd.edu/~jbg/teaching/CSCI_5622/21a.pdfMotivation PAC learning: distribution xed over time (training and test), IID assumption

Perceptron Algorithm

Objective Function

• Optimizes1

T

∑t

max (0,−yt(~w · xt)) (15)

• Convex but not differentiable

Jordan Boyd-Graber | Boulder Online Learning | 18 of 31

Page 37: Online Learning - legacydirs.umiacs.umd.edulegacydirs.umiacs.umd.edu/~jbg/teaching/CSCI_5622/21a.pdfMotivation PAC learning: distribution xed over time (training and test), IID assumption

Perceptron Algorithm

Margin and Errors

• If there’s a good margin ρ,you’ll converge quickly

• Whenever you se an error, youmove the classifier to get itright

• Convergence only possible ifdata are separable

Jordan Boyd-Graber | Boulder Online Learning | 19 of 31

Page 38: Online Learning - legacydirs.umiacs.umd.edulegacydirs.umiacs.umd.edu/~jbg/teaching/CSCI_5622/21a.pdfMotivation PAC learning: distribution xed over time (training and test), IID assumption

Perceptron Algorithm

Margin and Errors

• If there’s a good margin ρ,you’ll converge quickly

• Whenever you se an error, youmove the classifier to get itright

• Convergence only possible ifdata are separable

Jordan Boyd-Graber | Boulder Online Learning | 19 of 31

Page 39: Online Learning - legacydirs.umiacs.umd.edulegacydirs.umiacs.umd.edu/~jbg/teaching/CSCI_5622/21a.pdfMotivation PAC learning: distribution xed over time (training and test), IID assumption

Perceptron Algorithm

How many errors does Perceptron make?

• If your data are in a R ball and there is a margin

ρ ≤ yt(~v · ~xt)||v ||

(16)

for some ~v , then the number of mistakes is bounded by R2/ρ2

• The places where you make an error are support vectors

• Convergence can be slow for small margins

Jordan Boyd-Graber | Boulder Online Learning | 20 of 31

Page 40: Online Learning - legacydirs.umiacs.umd.edulegacydirs.umiacs.umd.edu/~jbg/teaching/CSCI_5622/21a.pdfMotivation PAC learning: distribution xed over time (training and test), IID assumption

Online Perceptron for Structure Learning

Plan

Experts

Perceptron Algorithm

Online Perceptron for Structure Learning

Jordan Boyd-Graber | Boulder Online Learning | 21 of 31

Page 41: Online Learning - legacydirs.umiacs.umd.edulegacydirs.umiacs.umd.edu/~jbg/teaching/CSCI_5622/21a.pdfMotivation PAC learning: distribution xed over time (training and test), IID assumption

Online Perceptron for Structure Learning

Binary to Structure

Jordan Boyd-Graber | Boulder Online Learning | 22 of 31

Page 42: Online Learning - legacydirs.umiacs.umd.edulegacydirs.umiacs.umd.edu/~jbg/teaching/CSCI_5622/21a.pdfMotivation PAC learning: distribution xed over time (training and test), IID assumption

Online Perceptron for Structure Learning

Binary to Structure

Jordan Boyd-Graber | Boulder Online Learning | 22 of 31

Page 43: Online Learning - legacydirs.umiacs.umd.edulegacydirs.umiacs.umd.edu/~jbg/teaching/CSCI_5622/21a.pdfMotivation PAC learning: distribution xed over time (training and test), IID assumption

Online Perceptron for Structure Learning

Binary to Structure

Jordan Boyd-Graber | Boulder Online Learning | 22 of 31

Page 44: Online Learning - legacydirs.umiacs.umd.edulegacydirs.umiacs.umd.edu/~jbg/teaching/CSCI_5622/21a.pdfMotivation PAC learning: distribution xed over time (training and test), IID assumption

Online Perceptron for Structure Learning

Generic Perceptron

• perceptron is the simplest machine learning algorithm

• online-learning: one example at a time

• learning by doing◦ find the best output under the current weights◦ update weights at mistakes

Jordan Boyd-Graber | Boulder Online Learning | 23 of 31

Page 45: Online Learning - legacydirs.umiacs.umd.edulegacydirs.umiacs.umd.edu/~jbg/teaching/CSCI_5622/21a.pdfMotivation PAC learning: distribution xed over time (training and test), IID assumption

Online Perceptron for Structure Learning

Structured Perceptron

Jordan Boyd-Graber | Boulder Online Learning | 24 of 31

Page 46: Online Learning - legacydirs.umiacs.umd.edulegacydirs.umiacs.umd.edu/~jbg/teaching/CSCI_5622/21a.pdfMotivation PAC learning: distribution xed over time (training and test), IID assumption

Online Perceptron for Structure Learning

Perceptron Algorithm

Jordan Boyd-Graber | Boulder Online Learning | 25 of 31

Page 47: Online Learning - legacydirs.umiacs.umd.edulegacydirs.umiacs.umd.edu/~jbg/teaching/CSCI_5622/21a.pdfMotivation PAC learning: distribution xed over time (training and test), IID assumption

Online Perceptron for Structure Learning

POS Example

Jordan Boyd-Graber | Boulder Online Learning | 26 of 31

Page 48: Online Learning - legacydirs.umiacs.umd.edulegacydirs.umiacs.umd.edu/~jbg/teaching/CSCI_5622/21a.pdfMotivation PAC learning: distribution xed over time (training and test), IID assumption

Online Perceptron for Structure Learning

What must be true?

• Finding highest scoring structure must be really fast (you’ll do itoften)

• Requires some sort of dynamic programming algorithm

• For tagging: features must be local to y (but can be global to x)

Jordan Boyd-Graber | Boulder Online Learning | 27 of 31

Page 49: Online Learning - legacydirs.umiacs.umd.edulegacydirs.umiacs.umd.edu/~jbg/teaching/CSCI_5622/21a.pdfMotivation PAC learning: distribution xed over time (training and test), IID assumption

Online Perceptron for Structure Learning

Averaging is Good

Jordan Boyd-Graber | Boulder Online Learning | 28 of 31

Page 50: Online Learning - legacydirs.umiacs.umd.edulegacydirs.umiacs.umd.edu/~jbg/teaching/CSCI_5622/21a.pdfMotivation PAC learning: distribution xed over time (training and test), IID assumption

Online Perceptron for Structure Learning

Averaging is Good

Jordan Boyd-Graber | Boulder Online Learning | 28 of 31

Page 51: Online Learning - legacydirs.umiacs.umd.edulegacydirs.umiacs.umd.edu/~jbg/teaching/CSCI_5622/21a.pdfMotivation PAC learning: distribution xed over time (training and test), IID assumption

Online Perceptron for Structure Learning

Smoothing

• Must include subset templates for features

• For example, if you have feature (t0,w0,w−1), you must also have◦ (t0,w0); (t0,w−1); (w0,w−1)

Jordan Boyd-Graber | Boulder Online Learning | 29 of 31

Page 52: Online Learning - legacydirs.umiacs.umd.edulegacydirs.umiacs.umd.edu/~jbg/teaching/CSCI_5622/21a.pdfMotivation PAC learning: distribution xed over time (training and test), IID assumption

Online Perceptron for Structure Learning

Inexact Search?

• Sometimes search is too hard

• So we use beam search instead

• How to create algorithms that respect this relaxation: track whenright answer falls off the beam

Jordan Boyd-Graber | Boulder Online Learning | 30 of 31

Page 53: Online Learning - legacydirs.umiacs.umd.edulegacydirs.umiacs.umd.edu/~jbg/teaching/CSCI_5622/21a.pdfMotivation PAC learning: distribution xed over time (training and test), IID assumption

Online Perceptron for Structure Learning

Wrapup

• Structured prediction: when one label isn’t enough

• Generative models can help with not a lot of data

• Discriminative models are state of the art

• More in Natural Language Processing (at least when I teach it)

Jordan Boyd-Graber | Boulder Online Learning | 31 of 31

Page 54: Online Learning - legacydirs.umiacs.umd.edulegacydirs.umiacs.umd.edu/~jbg/teaching/CSCI_5622/21a.pdfMotivation PAC learning: distribution xed over time (training and test), IID assumption

Online Perceptron for Structure Learning

Wrapup

• Structured prediction: when one label isn’t enough

• Generative models can help with not a lot of data

• Discriminative models are state of the art

• More in Natural Language Processing (at least when I teach it)

Jordan Boyd-Graber | Boulder Online Learning | 31 of 31