predicting sequences: structured perceptron - svivek · notes on structured perceptron • mistake...

CS6355:StructuredPrediction

PredictingSequences:StructuredPerceptron

1

ConditionalRandomFieldssummary

• Anundirectedgraphicalmodel– Decomposethescoreoverthestructureintoacollectionoffactors– Eachfactorassignsascoretoassignmentoftherandomvariablesitis

connectedto

• Trainingandprediction– Finalpredictionviaargmax wTÁ(x,y)– Trainbymaximum(regularized)likelihood

• Connectionstoothermodels– Effectivelyalinearclassifier– Ageneralizationoflogisticregressiontostructures– AninstanceofMarkovRandomField,withsomerandomvariables

observed• Wewillseethissoon

2

Global features

Thefeaturefunctiondecomposesoverthesequence

3

y0 y1 y2 y3

x

wTÁ(x,y0,y1) wTÁ(x,y1,y2) wTÁ(x,y2,y3)

Outline

• Sequencemodels

• HiddenMarkovmodels

– InferencewithHMM– Learning

• ConditionalModelsandLocalClassifiers

• Globalmodels– ConditionalRandomFields

– StructuredPerceptronforsequences

4

HMMisalsoalinearclassifierConsidertheHMM

5


6

Logprobabilities=transitionscores+emissionscores


7

Defineindicators:Iz =1ifz istrue;else0


8



9



10

Indicators:Iz =1ifz istrue;else0


• Orequivalently

• Thisisalinearfunction– logPtermsaretheweights;countsandindicatorsarefeatures– CanbewrittenaswTÁ(x,y)andaddmorefeatures

11



• Orequivalently

• Thisisalinearfunction– logPtermsaretheweights;countsandindicatorsarefeatures– CanbewrittenaswTÁ(x,y)andaddmorefeatures

12


HMMisalinearclassifier:Anexample

13

The ate thedog homework

Det Verb DetNoun Noun


14



Consider


15



+

Consider

Transitionscores Emissionscores


16



logP(Det! Noun)£ 2

+logP(Noun! Verb) £ 1

+logP(Verb! Det) £ 1

+

Consider

Emissionscores


18



logP(Det! Noun)£ 2



logP(The|Det) £ 1



+logP(the|Det) £ 1


+

w:Parametersofthemodel

Consider


19



logP(Det! Noun)£ 2



logP(The|Det) £ 1



+logP(the|Det) £ 1


+

Á(x,y):Propertiesofthisoutputandtheinput

Consider


20



Á(x,y):Propertiesofthisoutputandtheinputw:Parameters

ofthemodel

Consider

log 𝑃(𝐷𝑒𝑡 → 𝑁𝑜𝑢𝑛)log 𝑃(𝑁𝑜𝑢𝑛 → 𝑉𝑒𝑟𝑏)log 𝑃(𝑉𝑒𝑟𝑏 → 𝐷𝑒𝑡)log 𝑃 𝑇ℎ𝑒 𝐷𝑒𝑡)log 𝑃 𝑑𝑜𝑔 𝑁𝑜𝑢𝑛)log 𝑃 𝑎𝑡𝑒 𝑉𝑒𝑟𝑏)log 𝑃 𝑡ℎ𝑒 𝐷𝑒𝑡

log 𝑃 ℎ𝑜𝑚𝑒𝑤𝑜𝑟𝑘 𝑁𝑜𝑢𝑛)

⋅

21111111


21



Á(x,y):Propertiesofthisoutputandtheinputw:Parameters

ofthemodel

logP(x,y)=Alinearscoringfunction=wTÁ(x,y)

Consider

log 𝑃(𝐷𝑒𝑡 → 𝑁𝑜𝑢𝑛)log 𝑃(𝑁𝑜𝑢𝑛 → 𝑉𝑒𝑟𝑏)log 𝑃(𝑉𝑒𝑟𝑏 → 𝐷𝑒𝑡)log 𝑃 𝑇ℎ𝑒 𝐷𝑒𝑡)log 𝑃 𝑑𝑜𝑔 𝑁𝑜𝑢𝑛)log 𝑃 𝑎𝑡𝑒 𝑉𝑒𝑟𝑏)log 𝑃 𝑡ℎ𝑒 𝐷𝑒𝑡

log 𝑃 ℎ𝑜𝑚𝑒𝑤𝑜𝑟𝑘 𝑁𝑜𝑢𝑛)

⋅

21111111

TowardsstructuredPerceptron

1. HMMisalinearclassifier– Canwetreatitasanylinearclassifierfortraining?– Ifso,wecouldaddadditionalfeaturesthatareglobalproperties

• Aslongastheoutputcanbedecomposedforeasyinference

2. TheViterbialgorithmcalculatesmaxwTÁ(x,y)Viterbionlycaresaboutscorestostructures(notnecessarilynormalized)

3. Wecouldpushthelearningalgorithmtotrainforun-normalizedscores– Ifweneednormalization,wecouldalwaysnormalizebycomputing

exponentiating anddividingbyZ– Thatis,thelearningalgorithmcaneffectivelyjustfocusonthescoreofy

foraparticularx– Trainadiscriminativemodel!

22





3. Wecouldpushthelearningalgorithmtotrainforun-normalizedscores– Ifweneednormalization,wecouldalwaysnormalizebycomputing

exponentiating anddividingbyZ– Thatis,thelearningalgorithmcaneffectivelyjustfocusonthescoreofy

foraparticularx– Trainadiscriminativemodel!

23





3. Wecouldpushthelearningalgorithmtotrainforun-normalizedscores– Ifweneednormalization,wecouldalwaysnormalizebyexponentiating

anddividingbyZ– Thatis,thelearningalgorithmcaneffectivelyjustfocusonthescoreofy

foraparticularx– Trainadiscriminativemodelinsteadofthegenerativeone!

24

StructuredPerceptronalgorithm

GivenatrainingsetD={(x,y)}1. Initializew =02 <n

2. Forepoch=1…T:1. Foreachtrainingexample(x,y)2 D:

1. Predict y’ =argmaxy’ wTÁ(x,y’)2. Ify ≠ y’,updatewÃ w +learningRate (Á(x,y)- Á(x,y’))

3. Returnw

Prediction:argmaxywTÁ(x,y)

25

StructuredPerceptronupdate





3. Returnw


26





3. Returnw


27

Updateonlyonanerror.Perceptronisanmistake-drivenalgorithm.Ifthereisamistake,promotey anddemotey’





3. Returnw


28

Tisahyperparameter tothealgorithm






3. Returnw


29


Inpractice,goodtoshuffleDbeforetheinnerloop






3. Returnw


30


Inpractice,goodtoshuffleDbeforetheinnerloop


Inferenceintrainingloop!

Notesonstructuredperceptron

• Mistakeboundforseparabledata,justlikeperceptron

• Inpractice,useaveragingforbettergeneralization– Initializea =0– Aftereachstep,whetherthereisanupdateornot,aÃ w +a

• Note,westillcheckformistakeusingw nota– Returnaattheendinsteadof w

Exercise:Optimizethisforperformance– modifya onlyonerrors

• Globalupdate– Oneweightvectorforentiresequence

• Notforeachposition– Samealgorithmcanbederivedviaconstraintclassification

• Createabinaryclassificationdatasetandrunperceptron

31

StructuredPerceptronwithaveraging

GivenatrainingsetD={(x,y)}1. Initializew =02 <n,a =02 <n


1. Predict y’ =argmaxy’ wTÁ(x,y’)2. Ify ≠ y’,updatewÃ w +learningRate (Á(x,y)- Á(x,y’))3. SetaÃ a +w

3. Returna

32

CRFvs.structuredperceptron

StochasticgradientdescentupdateforCRF– Foratrainingexample(xi,yi)

Structuredperceptron– Foratrainingexample(xi,yi)

33

Caveat:AddingregularizationwillchangetheCRFupdate,averagingchangestheperceptronupdate

Expectationvs max

ThelayofthelandHMM:Agenerativemodel,assignsprobabilitiestosequences

34


35

Tworoadsdiverge


• HiddenMarkovModelsareactuallyjustlinear classifiers

• Don’treallycarewhetherwearepredictingprobabilities.Weareassigningscorestoafulloutputforagiveninput(likemulticlass)

• Generalizealgorithmsforlinearclassifiers.Sophisticatedmodelsthatcanusearbitraryfeatures

• StructuredPerceptronStructuredSVM

36

Tworoadsdiverge






• Modelprobabilitiesviaexponentialfunctions.Givesusthelog-linearrepresentation

• Log-probabilitiesforsequencesforagiveninput

• Learnbymaximizinglikelihood.Sophisticatedmodelsthatcanusearbitraryfeatures

• ConditionalRandomfield

37

Tworoadsdiverge










38

Tworoadsdiverge

Discriminative/Conditionalmodels










ApplicablebeyondsequencesEventually,similarobjectiveminimizedwithdifferentlossfunctions 39

Tworoadsdiverge











ApplicablebeyondsequencesEventually,similarobjectiveminimizedwithdifferentlossfunctions 40

Tworoadsdiverge

Comingsoon…


Sequencemodels:Summary

• Goal:Predictanoutputsequencegiveninputsequence

• HiddenMarkovModel

• Inference– PredictviaViterbialgorithm

• Conditionalmodels/discriminativemodels– Localapproaches(noinferenceduringtraining)

• MEMM,conditionalMarkovmodel– Globalapproaches (inferenceduringtraining)

• CRF,structuredperceptron

• Tothink– Whatarethepartsinasequencemodel?– Howiseachmodelscoringtheseparts?

41

Samedichotomyformoregeneralstructures

Predictionisnotalwaystractableforgeneralstructures

predicting sequences: structured perceptron - svivek · notes on structured perceptron • mistake...

Documents