predicting sequences: structured perceptron - svivek · notes on structured perceptron • mistake...

41
CS 6355: Structured Prediction Predicting Sequences: Structured Perceptron 1

Upload: vuongthu

Post on 24-Jul-2019

221 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Predicting Sequences: Structured Perceptron - svivek · Notes on structured perceptron • Mistake bound for separable data, just like perceptron • In practice, use averaging for

CS6355:StructuredPrediction

PredictingSequences:StructuredPerceptron

1

Page 2: Predicting Sequences: Structured Perceptron - svivek · Notes on structured perceptron • Mistake bound for separable data, just like perceptron • In practice, use averaging for

ConditionalRandomFieldssummary

• Anundirectedgraphicalmodel– Decomposethescoreoverthestructureintoacollectionoffactors– Eachfactorassignsascoretoassignmentoftherandomvariablesitis

connectedto

• Trainingandprediction– Finalpredictionviaargmax wTÁ(x,y)– Trainbymaximum(regularized)likelihood

• Connectionstoothermodels– Effectivelyalinearclassifier– Ageneralizationoflogisticregressiontostructures– AninstanceofMarkovRandomField,withsomerandomvariables

observed• Wewillseethissoon

2

Page 3: Predicting Sequences: Structured Perceptron - svivek · Notes on structured perceptron • Mistake bound for separable data, just like perceptron • In practice, use averaging for

Global features

Thefeaturefunctiondecomposesoverthesequence

3

y0 y1 y2 y3

x

wTÁ(x,y0,y1) wTÁ(x,y1,y2) wTÁ(x,y2,y3)

Page 4: Predicting Sequences: Structured Perceptron - svivek · Notes on structured perceptron • Mistake bound for separable data, just like perceptron • In practice, use averaging for

Outline

• Sequencemodels

• HiddenMarkovmodels

– InferencewithHMM– Learning

• ConditionalModelsandLocalClassifiers

• Globalmodels– ConditionalRandomFields

– StructuredPerceptronforsequences

4

Page 5: Predicting Sequences: Structured Perceptron - svivek · Notes on structured perceptron • Mistake bound for separable data, just like perceptron • In practice, use averaging for

HMMisalsoalinearclassifierConsidertheHMM

5

Page 6: Predicting Sequences: Structured Perceptron - svivek · Notes on structured perceptron • Mistake bound for separable data, just like perceptron • In practice, use averaging for

HMMisalsoalinearclassifierConsidertheHMM

6

Logprobabilities=transitionscores+emissionscores

Page 7: Predicting Sequences: Structured Perceptron - svivek · Notes on structured perceptron • Mistake bound for separable data, just like perceptron • In practice, use averaging for

HMMisalsoalinearclassifierConsidertheHMM

7

Defineindicators:Iz =1ifz istrue;else0

Page 8: Predicting Sequences: Structured Perceptron - svivek · Notes on structured perceptron • Mistake bound for separable data, just like perceptron • In practice, use averaging for

HMMisalsoalinearclassifierConsidertheHMM

8

Defineindicators:Iz =1ifz istrue;else0

Page 9: Predicting Sequences: Structured Perceptron - svivek · Notes on structured perceptron • Mistake bound for separable data, just like perceptron • In practice, use averaging for

HMMisalsoalinearclassifierConsidertheHMM

9

Defineindicators:Iz =1ifz istrue;else0

Page 10: Predicting Sequences: Structured Perceptron - svivek · Notes on structured perceptron • Mistake bound for separable data, just like perceptron • In practice, use averaging for

HMMisalsoalinearclassifierConsidertheHMM

10

Indicators:Iz =1ifz istrue;else0

Page 11: Predicting Sequences: Structured Perceptron - svivek · Notes on structured perceptron • Mistake bound for separable data, just like perceptron • In practice, use averaging for

HMMisalsoalinearclassifierConsidertheHMM

• Orequivalently

• Thisisalinearfunction– logPtermsaretheweights;countsandindicatorsarefeatures– CanbewrittenaswTÁ(x,y)andaddmorefeatures

11

Indicators:Iz =1ifz istrue;else0

Page 12: Predicting Sequences: Structured Perceptron - svivek · Notes on structured perceptron • Mistake bound for separable data, just like perceptron • In practice, use averaging for

HMMisalsoalinearclassifierConsidertheHMM

• Orequivalently

• Thisisalinearfunction– logPtermsaretheweights;countsandindicatorsarefeatures– CanbewrittenaswTÁ(x,y)andaddmorefeatures

12

Indicators:Iz =1ifz istrue;else0

Page 13: Predicting Sequences: Structured Perceptron - svivek · Notes on structured perceptron • Mistake bound for separable data, just like perceptron • In practice, use averaging for

HMMisalinearclassifier:Anexample

13

The ate thedog homework

Det Verb DetNoun Noun

Page 14: Predicting Sequences: Structured Perceptron - svivek · Notes on structured perceptron • Mistake bound for separable data, just like perceptron • In practice, use averaging for

HMMisalinearclassifier:Anexample

14

The ate thedog homework

Det Verb DetNoun Noun

Consider

Page 15: Predicting Sequences: Structured Perceptron - svivek · Notes on structured perceptron • Mistake bound for separable data, just like perceptron • In practice, use averaging for

HMMisalinearclassifier:Anexample

15

The ate thedog homework

Det Verb DetNoun Noun

+

Consider

Transitionscores Emissionscores

Page 16: Predicting Sequences: Structured Perceptron - svivek · Notes on structured perceptron • Mistake bound for separable data, just like perceptron • In practice, use averaging for

HMMisalinearclassifier:Anexample

16

The ate thedog homework

Det Verb DetNoun Noun

logP(Det! Noun)£ 2

+logP(Noun! Verb) £ 1

+logP(Verb! Det) £ 1

+

Consider

Emissionscores

Page 17: Predicting Sequences: Structured Perceptron - svivek · Notes on structured perceptron • Mistake bound for separable data, just like perceptron • In practice, use averaging for

HMMisalinearclassifier:Anexample

17

The ate thedog homework

Det Verb DetNoun Noun

logP(Det! Noun)£ 2

+logP(Noun! Verb) £ 1

+logP(Verb! Det) £ 1

logP(The|Det) £ 1

+logP(dog|Noun) £ 1

+logP(ate|Verb) £ 1

+logP(the|Det) £ 1

+logP(homework|Noun) £ 1

+

Consider

Page 18: Predicting Sequences: Structured Perceptron - svivek · Notes on structured perceptron • Mistake bound for separable data, just like perceptron • In practice, use averaging for

HMMisalinearclassifier:Anexample

18

The ate thedog homework

Det Verb DetNoun Noun

logP(Det! Noun)£ 2

+logP(Noun! Verb) £ 1

+logP(Verb! Det) £ 1

logP(The|Det) £ 1

+logP(dog|Noun) £ 1

+logP(ate|Verb) £ 1

+logP(the|Det) £ 1

+logP(homework|Noun) £ 1

+

w:Parametersofthemodel

Consider

Page 19: Predicting Sequences: Structured Perceptron - svivek · Notes on structured perceptron • Mistake bound for separable data, just like perceptron • In practice, use averaging for

HMMisalinearclassifier:Anexample

19

The ate thedog homework

Det Verb DetNoun Noun

logP(Det! Noun)£ 2

+logP(Noun! Verb) £ 1

+logP(Verb! Det) £ 1

logP(The|Det) £ 1

+logP(dog|Noun) £ 1

+logP(ate|Verb) £ 1

+logP(the|Det) £ 1

+logP(homework|Noun) £ 1

+

Á(x,y):Propertiesofthisoutputandtheinput

Consider

Page 20: Predicting Sequences: Structured Perceptron - svivek · Notes on structured perceptron • Mistake bound for separable data, just like perceptron • In practice, use averaging for

HMMisalinearclassifier:Anexample

20

The ate thedog homework

Det Verb DetNoun Noun

Á(x,y):Propertiesofthisoutputandtheinputw:Parameters

ofthemodel

Consider

log 𝑃(𝐷𝑒𝑡 → 𝑁𝑜𝑢𝑛)log 𝑃(𝑁𝑜𝑢𝑛 → 𝑉𝑒𝑟𝑏)log 𝑃(𝑉𝑒𝑟𝑏 → 𝐷𝑒𝑡)log 𝑃 𝑇ℎ𝑒 𝐷𝑒𝑡)log 𝑃 𝑑𝑜𝑔 𝑁𝑜𝑢𝑛)log 𝑃 𝑎𝑡𝑒 𝑉𝑒𝑟𝑏)log 𝑃 𝑡ℎ𝑒 𝐷𝑒𝑡

log 𝑃 ℎ𝑜𝑚𝑒𝑤𝑜𝑟𝑘 𝑁𝑜𝑢𝑛)

21111111

Page 21: Predicting Sequences: Structured Perceptron - svivek · Notes on structured perceptron • Mistake bound for separable data, just like perceptron • In practice, use averaging for

HMMisalinearclassifier:Anexample

21

The ate thedog homework

Det Verb DetNoun Noun

Á(x,y):Propertiesofthisoutputandtheinputw:Parameters

ofthemodel

logP(x,y)=Alinearscoringfunction=wTÁ(x,y)

Consider

log 𝑃(𝐷𝑒𝑡 → 𝑁𝑜𝑢𝑛)log 𝑃(𝑁𝑜𝑢𝑛 → 𝑉𝑒𝑟𝑏)log 𝑃(𝑉𝑒𝑟𝑏 → 𝐷𝑒𝑡)log 𝑃 𝑇ℎ𝑒 𝐷𝑒𝑡)log 𝑃 𝑑𝑜𝑔 𝑁𝑜𝑢𝑛)log 𝑃 𝑎𝑡𝑒 𝑉𝑒𝑟𝑏)log 𝑃 𝑡ℎ𝑒 𝐷𝑒𝑡

log 𝑃 ℎ𝑜𝑚𝑒𝑤𝑜𝑟𝑘 𝑁𝑜𝑢𝑛)

21111111

Page 22: Predicting Sequences: Structured Perceptron - svivek · Notes on structured perceptron • Mistake bound for separable data, just like perceptron • In practice, use averaging for

TowardsstructuredPerceptron

1. HMMisalinearclassifier– Canwetreatitasanylinearclassifierfortraining?– Ifso,wecouldaddadditionalfeaturesthatareglobalproperties

• Aslongastheoutputcanbedecomposedforeasyinference

2. TheViterbialgorithmcalculatesmaxwTÁ(x,y)Viterbionlycaresaboutscorestostructures(notnecessarilynormalized)

3. Wecouldpushthelearningalgorithmtotrainforun-normalizedscores– Ifweneednormalization,wecouldalwaysnormalizebycomputing

exponentiating anddividingbyZ– Thatis,thelearningalgorithmcaneffectivelyjustfocusonthescoreofy

foraparticularx– Trainadiscriminativemodel!

22

Page 23: Predicting Sequences: Structured Perceptron - svivek · Notes on structured perceptron • Mistake bound for separable data, just like perceptron • In practice, use averaging for

TowardsstructuredPerceptron

1. HMMisalinearclassifier– Canwetreatitasanylinearclassifierfortraining?– Ifso,wecouldaddadditionalfeaturesthatareglobalproperties

• Aslongastheoutputcanbedecomposedforeasyinference

2. TheViterbialgorithmcalculatesmaxwTÁ(x,y)Viterbionlycaresaboutscorestostructures(notnecessarilynormalized)

3. Wecouldpushthelearningalgorithmtotrainforun-normalizedscores– Ifweneednormalization,wecouldalwaysnormalizebycomputing

exponentiating anddividingbyZ– Thatis,thelearningalgorithmcaneffectivelyjustfocusonthescoreofy

foraparticularx– Trainadiscriminativemodel!

23

Page 24: Predicting Sequences: Structured Perceptron - svivek · Notes on structured perceptron • Mistake bound for separable data, just like perceptron • In practice, use averaging for

TowardsstructuredPerceptron

1. HMMisalinearclassifier– Canwetreatitasanylinearclassifierfortraining?– Ifso,wecouldaddadditionalfeaturesthatareglobalproperties

• Aslongastheoutputcanbedecomposedforeasyinference

2. TheViterbialgorithmcalculatesmaxwTÁ(x,y)Viterbionlycaresaboutscorestostructures(notnecessarilynormalized)

3. Wecouldpushthelearningalgorithmtotrainforun-normalizedscores– Ifweneednormalization,wecouldalwaysnormalizebyexponentiating

anddividingbyZ– Thatis,thelearningalgorithmcaneffectivelyjustfocusonthescoreofy

foraparticularx– Trainadiscriminativemodelinsteadofthegenerativeone!

24

Page 25: Predicting Sequences: Structured Perceptron - svivek · Notes on structured perceptron • Mistake bound for separable data, just like perceptron • In practice, use averaging for

StructuredPerceptronalgorithm

GivenatrainingsetD={(x,y)}1. Initializew =02 <n

2. Forepoch=1…T:1. Foreachtrainingexample(x,y)2 D:

1. Predict y’ =argmaxy’ wTÁ(x,y’)2. Ify ≠ y’,updatewà w +learningRate (Á(x,y)- Á(x,y’))

3. Returnw

Prediction:argmaxywTÁ(x,y)

25

StructuredPerceptronupdate

Page 26: Predicting Sequences: Structured Perceptron - svivek · Notes on structured perceptron • Mistake bound for separable data, just like perceptron • In practice, use averaging for

StructuredPerceptronalgorithm

GivenatrainingsetD={(x,y)}1. Initializew =02 <n

2. Forepoch=1…T:1. Foreachtrainingexample(x,y)2 D:

1. Predict y’ =argmaxy’ wTÁ(x,y’)2. Ify ≠ y’,updatewà w +learningRate (Á(x,y)- Á(x,y’))

3. Returnw

Prediction:argmaxywTÁ(x,y)

26

Page 27: Predicting Sequences: Structured Perceptron - svivek · Notes on structured perceptron • Mistake bound for separable data, just like perceptron • In practice, use averaging for

StructuredPerceptronalgorithm

GivenatrainingsetD={(x,y)}1. Initializew =02 <n

2. Forepoch=1…T:1. Foreachtrainingexample(x,y)2 D:

1. Predict y’ =argmaxy’ wTÁ(x,y’)2. Ify ≠ y’,updatewà w +learningRate (Á(x,y)- Á(x,y’))

3. Returnw

Prediction:argmaxywTÁ(x,y)

27

Updateonlyonanerror.Perceptronisanmistake-drivenalgorithm.Ifthereisamistake,promotey anddemotey’

Page 28: Predicting Sequences: Structured Perceptron - svivek · Notes on structured perceptron • Mistake bound for separable data, just like perceptron • In practice, use averaging for

StructuredPerceptronalgorithm

GivenatrainingsetD={(x,y)}1. Initializew =02 <n

2. Forepoch=1…T:1. Foreachtrainingexample(x,y)2 D:

1. Predict y’ =argmaxy’ wTÁ(x,y’)2. Ify ≠ y’,updatewà w +learningRate (Á(x,y)- Á(x,y’))

3. Returnw

Prediction:argmaxywTÁ(x,y)

28

Tisahyperparameter tothealgorithm

Updateonlyonanerror.Perceptronisanmistake-drivenalgorithm.Ifthereisamistake,promotey anddemotey’

Page 29: Predicting Sequences: Structured Perceptron - svivek · Notes on structured perceptron • Mistake bound for separable data, just like perceptron • In practice, use averaging for

StructuredPerceptronalgorithm

GivenatrainingsetD={(x,y)}1. Initializew =02 <n

2. Forepoch=1…T:1. Foreachtrainingexample(x,y)2 D:

1. Predict y’ =argmaxy’ wTÁ(x,y’)2. Ify ≠ y’,updatewà w +learningRate (Á(x,y)- Á(x,y’))

3. Returnw

Prediction:argmaxywTÁ(x,y)

29

Tisahyperparameter tothealgorithm

Inpractice,goodtoshuffleDbeforetheinnerloop

Updateonlyonanerror.Perceptronisanmistake-drivenalgorithm.Ifthereisamistake,promotey anddemotey’

Page 30: Predicting Sequences: Structured Perceptron - svivek · Notes on structured perceptron • Mistake bound for separable data, just like perceptron • In practice, use averaging for

StructuredPerceptronalgorithm

GivenatrainingsetD={(x,y)}1. Initializew =02 <n

2. Forepoch=1…T:1. Foreachtrainingexample(x,y)2 D:

1. Predict y’ =argmaxy’ wTÁ(x,y’)2. Ify ≠ y’,updatewà w +learningRate (Á(x,y)- Á(x,y’))

3. Returnw

Prediction:argmaxywTÁ(x,y)

30

Tisahyperparameter tothealgorithm

Inpractice,goodtoshuffleDbeforetheinnerloop

Updateonlyonanerror.Perceptronisanmistake-drivenalgorithm.Ifthereisamistake,promotey anddemotey’

Inferenceintrainingloop!

Page 31: Predicting Sequences: Structured Perceptron - svivek · Notes on structured perceptron • Mistake bound for separable data, just like perceptron • In practice, use averaging for

Notesonstructuredperceptron

• Mistakeboundforseparabledata,justlikeperceptron

• Inpractice,useaveragingforbettergeneralization– Initializea =0– Aftereachstep,whetherthereisanupdateornot,aà w +a

• Note,westillcheckformistakeusingw nota– Returnaattheendinsteadof w

Exercise:Optimizethisforperformance– modifya onlyonerrors

• Globalupdate– Oneweightvectorforentiresequence

• Notforeachposition– Samealgorithmcanbederivedviaconstraintclassification

• Createabinaryclassificationdatasetandrunperceptron

31

Page 32: Predicting Sequences: Structured Perceptron - svivek · Notes on structured perceptron • Mistake bound for separable data, just like perceptron • In practice, use averaging for

StructuredPerceptronwithaveraging

GivenatrainingsetD={(x,y)}1. Initializew =02 <n,a =02 <n

2. Forepoch=1…T:1. Foreachtrainingexample(x,y)2 D:

1. Predict y’ =argmaxy’ wTÁ(x,y’)2. Ify ≠ y’,updatewà w +learningRate (Á(x,y)- Á(x,y’))3. Setaà a +w

3. Returna

32

Page 33: Predicting Sequences: Structured Perceptron - svivek · Notes on structured perceptron • Mistake bound for separable data, just like perceptron • In practice, use averaging for

CRFvs.structuredperceptron

StochasticgradientdescentupdateforCRF– Foratrainingexample(xi,yi)

Structuredperceptron– Foratrainingexample(xi,yi)

33

Caveat:AddingregularizationwillchangetheCRFupdate,averagingchangestheperceptronupdate

Expectationvs max

Page 34: Predicting Sequences: Structured Perceptron - svivek · Notes on structured perceptron • Mistake bound for separable data, just like perceptron • In practice, use averaging for

ThelayofthelandHMM:Agenerativemodel,assignsprobabilitiestosequences

34

Page 35: Predicting Sequences: Structured Perceptron - svivek · Notes on structured perceptron • Mistake bound for separable data, just like perceptron • In practice, use averaging for

ThelayofthelandHMM:Agenerativemodel,assignsprobabilitiestosequences

35

Tworoadsdiverge

Page 36: Predicting Sequences: Structured Perceptron - svivek · Notes on structured perceptron • Mistake bound for separable data, just like perceptron • In practice, use averaging for

ThelayofthelandHMM:Agenerativemodel,assignsprobabilitiestosequences

• HiddenMarkovModelsareactuallyjustlinear classifiers

• Don’treallycarewhetherwearepredictingprobabilities.Weareassigningscorestoafulloutputforagiveninput(likemulticlass)

• Generalizealgorithmsforlinearclassifiers.Sophisticatedmodelsthatcanusearbitraryfeatures

• StructuredPerceptronStructuredSVM

36

Tworoadsdiverge

Page 37: Predicting Sequences: Structured Perceptron - svivek · Notes on structured perceptron • Mistake bound for separable data, just like perceptron • In practice, use averaging for

ThelayofthelandHMM:Agenerativemodel,assignsprobabilitiestosequences

• HiddenMarkovModelsareactuallyjustlinear classifiers

• Don’treallycarewhetherwearepredictingprobabilities.Weareassigningscorestoafulloutputforagiveninput(likemulticlass)

• Generalizealgorithmsforlinearclassifiers.Sophisticatedmodelsthatcanusearbitraryfeatures

• StructuredPerceptronStructuredSVM

• Modelprobabilitiesviaexponentialfunctions.Givesusthelog-linearrepresentation

• Log-probabilitiesforsequencesforagiveninput

• Learnbymaximizinglikelihood.Sophisticatedmodelsthatcanusearbitraryfeatures

• ConditionalRandomfield

37

Tworoadsdiverge

Page 38: Predicting Sequences: Structured Perceptron - svivek · Notes on structured perceptron • Mistake bound for separable data, just like perceptron • In practice, use averaging for

ThelayofthelandHMM:Agenerativemodel,assignsprobabilitiestosequences

• HiddenMarkovModelsareactuallyjustlinear classifiers

• Don’treallycarewhetherwearepredictingprobabilities.Weareassigningscorestoafulloutputforagiveninput(likemulticlass)

• Generalizealgorithmsforlinearclassifiers.Sophisticatedmodelsthatcanusearbitraryfeatures

• StructuredPerceptronStructuredSVM

• Modelprobabilitiesviaexponentialfunctions.Givesusthelog-linearrepresentation

• Log-probabilitiesforsequencesforagiveninput

• Learnbymaximizinglikelihood.Sophisticatedmodelsthatcanusearbitraryfeatures

• ConditionalRandomfield

38

Tworoadsdiverge

Discriminative/Conditionalmodels

Page 39: Predicting Sequences: Structured Perceptron - svivek · Notes on structured perceptron • Mistake bound for separable data, just like perceptron • In practice, use averaging for

ThelayofthelandHMM:Agenerativemodel,assignsprobabilitiestosequences

• HiddenMarkovModelsareactuallyjustlinear classifiers

• Don’treallycarewhetherwearepredictingprobabilities.Weareassigningscorestoafulloutputforagiveninput(likemulticlass)

• Generalizealgorithmsforlinearclassifiers.Sophisticatedmodelsthatcanusearbitraryfeatures

• StructuredPerceptronStructuredSVM

• Modelprobabilitiesviaexponentialfunctions.Givesusthelog-linearrepresentation

• Log-probabilitiesforsequencesforagiveninput

• Learnbymaximizinglikelihood.Sophisticatedmodelsthatcanusearbitraryfeatures

• ConditionalRandomfield

ApplicablebeyondsequencesEventually,similarobjectiveminimizedwithdifferentlossfunctions 39

Tworoadsdiverge

Discriminative/Conditionalmodels

Page 40: Predicting Sequences: Structured Perceptron - svivek · Notes on structured perceptron • Mistake bound for separable data, just like perceptron • In practice, use averaging for

ThelayofthelandHMM:Agenerativemodel,assignsprobabilitiestosequences

• HiddenMarkovModelsareactuallyjustlinear classifiers

• Don’treallycarewhetherwearepredictingprobabilities.Weareassigningscorestoafulloutputforagiveninput(likemulticlass)

• Generalizealgorithmsforlinearclassifiers.Sophisticatedmodelsthatcanusearbitraryfeatures

• StructuredPerceptronStructuredSVM

• Modelprobabilitiesviaexponentialfunctions.Givesusthelog-linearrepresentation

• Log-probabilitiesforsequencesforagiveninput

• Learnbymaximizinglikelihood.Sophisticatedmodelsthatcanusearbitraryfeatures

• ConditionalRandomfield

ApplicablebeyondsequencesEventually,similarobjectiveminimizedwithdifferentlossfunctions 40

Tworoadsdiverge

Comingsoon…

Discriminative/Conditionalmodels

Page 41: Predicting Sequences: Structured Perceptron - svivek · Notes on structured perceptron • Mistake bound for separable data, just like perceptron • In practice, use averaging for

Sequencemodels:Summary

• Goal:Predictanoutputsequencegiveninputsequence

• HiddenMarkovModel

• Inference– PredictviaViterbialgorithm

• Conditionalmodels/discriminativemodels– Localapproaches(noinferenceduringtraining)

• MEMM,conditionalMarkovmodel– Globalapproaches (inferenceduringtraining)

• CRF,structuredperceptron

• Tothink– Whatarethepartsinasequencemodel?– Howiseachmodelscoringtheseparts?

41

Samedichotomyformoregeneralstructures

Predictionisnotalwaystractableforgeneralstructures