predicting sequences: structured perceptron - svivek · notes on structured perceptron • mistake...
TRANSCRIPT
CS6355:StructuredPrediction
PredictingSequences:StructuredPerceptron
1
ConditionalRandomFieldssummary
• Anundirectedgraphicalmodel– Decomposethescoreoverthestructureintoacollectionoffactors– Eachfactorassignsascoretoassignmentoftherandomvariablesitis
connectedto
• Trainingandprediction– Finalpredictionviaargmax wTÁ(x,y)– Trainbymaximum(regularized)likelihood
• Connectionstoothermodels– Effectivelyalinearclassifier– Ageneralizationoflogisticregressiontostructures– AninstanceofMarkovRandomField,withsomerandomvariables
observed• Wewillseethissoon
2
Global features
Thefeaturefunctiondecomposesoverthesequence
3
y0 y1 y2 y3
x
wTÁ(x,y0,y1) wTÁ(x,y1,y2) wTÁ(x,y2,y3)
Outline
• Sequencemodels
• HiddenMarkovmodels
– InferencewithHMM– Learning
• ConditionalModelsandLocalClassifiers
• Globalmodels– ConditionalRandomFields
– StructuredPerceptronforsequences
4
HMMisalsoalinearclassifierConsidertheHMM
5
HMMisalsoalinearclassifierConsidertheHMM
6
Logprobabilities=transitionscores+emissionscores
HMMisalsoalinearclassifierConsidertheHMM
7
Defineindicators:Iz =1ifz istrue;else0
HMMisalsoalinearclassifierConsidertheHMM
8
Defineindicators:Iz =1ifz istrue;else0
HMMisalsoalinearclassifierConsidertheHMM
9
Defineindicators:Iz =1ifz istrue;else0
HMMisalsoalinearclassifierConsidertheHMM
10
Indicators:Iz =1ifz istrue;else0
HMMisalsoalinearclassifierConsidertheHMM
• Orequivalently
• Thisisalinearfunction– logPtermsaretheweights;countsandindicatorsarefeatures– CanbewrittenaswTÁ(x,y)andaddmorefeatures
11
Indicators:Iz =1ifz istrue;else0
HMMisalsoalinearclassifierConsidertheHMM
• Orequivalently
• Thisisalinearfunction– logPtermsaretheweights;countsandindicatorsarefeatures– CanbewrittenaswTÁ(x,y)andaddmorefeatures
12
Indicators:Iz =1ifz istrue;else0
HMMisalinearclassifier:Anexample
13
The ate thedog homework
Det Verb DetNoun Noun
HMMisalinearclassifier:Anexample
14
The ate thedog homework
Det Verb DetNoun Noun
Consider
HMMisalinearclassifier:Anexample
15
The ate thedog homework
Det Verb DetNoun Noun
+
Consider
Transitionscores Emissionscores
HMMisalinearclassifier:Anexample
16
The ate thedog homework
Det Verb DetNoun Noun
logP(Det! Noun)£ 2
+logP(Noun! Verb) £ 1
+logP(Verb! Det) £ 1
+
Consider
Emissionscores
HMMisalinearclassifier:Anexample
17
The ate thedog homework
Det Verb DetNoun Noun
logP(Det! Noun)£ 2
+logP(Noun! Verb) £ 1
+logP(Verb! Det) £ 1
logP(The|Det) £ 1
+logP(dog|Noun) £ 1
+logP(ate|Verb) £ 1
+logP(the|Det) £ 1
+logP(homework|Noun) £ 1
+
Consider
HMMisalinearclassifier:Anexample
18
The ate thedog homework
Det Verb DetNoun Noun
logP(Det! Noun)£ 2
+logP(Noun! Verb) £ 1
+logP(Verb! Det) £ 1
logP(The|Det) £ 1
+logP(dog|Noun) £ 1
+logP(ate|Verb) £ 1
+logP(the|Det) £ 1
+logP(homework|Noun) £ 1
+
w:Parametersofthemodel
Consider
HMMisalinearclassifier:Anexample
19
The ate thedog homework
Det Verb DetNoun Noun
logP(Det! Noun)£ 2
+logP(Noun! Verb) £ 1
+logP(Verb! Det) £ 1
logP(The|Det) £ 1
+logP(dog|Noun) £ 1
+logP(ate|Verb) £ 1
+logP(the|Det) £ 1
+logP(homework|Noun) £ 1
+
Á(x,y):Propertiesofthisoutputandtheinput
Consider
HMMisalinearclassifier:Anexample
20
The ate thedog homework
Det Verb DetNoun Noun
Á(x,y):Propertiesofthisoutputandtheinputw:Parameters
ofthemodel
Consider
log 𝑃(𝐷𝑒𝑡 → 𝑁𝑜𝑢𝑛)log 𝑃(𝑁𝑜𝑢𝑛 → 𝑉𝑒𝑟𝑏)log 𝑃(𝑉𝑒𝑟𝑏 → 𝐷𝑒𝑡)log 𝑃 𝑇ℎ𝑒 𝐷𝑒𝑡)log 𝑃 𝑑𝑜𝑔 𝑁𝑜𝑢𝑛)log 𝑃 𝑎𝑡𝑒 𝑉𝑒𝑟𝑏)log 𝑃 𝑡ℎ𝑒 𝐷𝑒𝑡
log 𝑃 ℎ𝑜𝑚𝑒𝑤𝑜𝑟𝑘 𝑁𝑜𝑢𝑛)
⋅
21111111
HMMisalinearclassifier:Anexample
21
The ate thedog homework
Det Verb DetNoun Noun
Á(x,y):Propertiesofthisoutputandtheinputw:Parameters
ofthemodel
logP(x,y)=Alinearscoringfunction=wTÁ(x,y)
Consider
log 𝑃(𝐷𝑒𝑡 → 𝑁𝑜𝑢𝑛)log 𝑃(𝑁𝑜𝑢𝑛 → 𝑉𝑒𝑟𝑏)log 𝑃(𝑉𝑒𝑟𝑏 → 𝐷𝑒𝑡)log 𝑃 𝑇ℎ𝑒 𝐷𝑒𝑡)log 𝑃 𝑑𝑜𝑔 𝑁𝑜𝑢𝑛)log 𝑃 𝑎𝑡𝑒 𝑉𝑒𝑟𝑏)log 𝑃 𝑡ℎ𝑒 𝐷𝑒𝑡
log 𝑃 ℎ𝑜𝑚𝑒𝑤𝑜𝑟𝑘 𝑁𝑜𝑢𝑛)
⋅
21111111
TowardsstructuredPerceptron
1. HMMisalinearclassifier– Canwetreatitasanylinearclassifierfortraining?– Ifso,wecouldaddadditionalfeaturesthatareglobalproperties
• Aslongastheoutputcanbedecomposedforeasyinference
2. TheViterbialgorithmcalculatesmaxwTÁ(x,y)Viterbionlycaresaboutscorestostructures(notnecessarilynormalized)
3. Wecouldpushthelearningalgorithmtotrainforun-normalizedscores– Ifweneednormalization,wecouldalwaysnormalizebycomputing
exponentiating anddividingbyZ– Thatis,thelearningalgorithmcaneffectivelyjustfocusonthescoreofy
foraparticularx– Trainadiscriminativemodel!
22
TowardsstructuredPerceptron
1. HMMisalinearclassifier– Canwetreatitasanylinearclassifierfortraining?– Ifso,wecouldaddadditionalfeaturesthatareglobalproperties
• Aslongastheoutputcanbedecomposedforeasyinference
2. TheViterbialgorithmcalculatesmaxwTÁ(x,y)Viterbionlycaresaboutscorestostructures(notnecessarilynormalized)
3. Wecouldpushthelearningalgorithmtotrainforun-normalizedscores– Ifweneednormalization,wecouldalwaysnormalizebycomputing
exponentiating anddividingbyZ– Thatis,thelearningalgorithmcaneffectivelyjustfocusonthescoreofy
foraparticularx– Trainadiscriminativemodel!
23
TowardsstructuredPerceptron
1. HMMisalinearclassifier– Canwetreatitasanylinearclassifierfortraining?– Ifso,wecouldaddadditionalfeaturesthatareglobalproperties
• Aslongastheoutputcanbedecomposedforeasyinference
2. TheViterbialgorithmcalculatesmaxwTÁ(x,y)Viterbionlycaresaboutscorestostructures(notnecessarilynormalized)
3. Wecouldpushthelearningalgorithmtotrainforun-normalizedscores– Ifweneednormalization,wecouldalwaysnormalizebyexponentiating
anddividingbyZ– Thatis,thelearningalgorithmcaneffectivelyjustfocusonthescoreofy
foraparticularx– Trainadiscriminativemodelinsteadofthegenerativeone!
24
StructuredPerceptronalgorithm
GivenatrainingsetD={(x,y)}1. Initializew =02 <n
2. Forepoch=1…T:1. Foreachtrainingexample(x,y)2 D:
1. Predict y’ =argmaxy’ wTÁ(x,y’)2. Ify ≠ y’,updatewà w +learningRate (Á(x,y)- Á(x,y’))
3. Returnw
Prediction:argmaxywTÁ(x,y)
25
StructuredPerceptronupdate
StructuredPerceptronalgorithm
GivenatrainingsetD={(x,y)}1. Initializew =02 <n
2. Forepoch=1…T:1. Foreachtrainingexample(x,y)2 D:
1. Predict y’ =argmaxy’ wTÁ(x,y’)2. Ify ≠ y’,updatewà w +learningRate (Á(x,y)- Á(x,y’))
3. Returnw
Prediction:argmaxywTÁ(x,y)
26
StructuredPerceptronalgorithm
GivenatrainingsetD={(x,y)}1. Initializew =02 <n
2. Forepoch=1…T:1. Foreachtrainingexample(x,y)2 D:
1. Predict y’ =argmaxy’ wTÁ(x,y’)2. Ify ≠ y’,updatewà w +learningRate (Á(x,y)- Á(x,y’))
3. Returnw
Prediction:argmaxywTÁ(x,y)
27
Updateonlyonanerror.Perceptronisanmistake-drivenalgorithm.Ifthereisamistake,promotey anddemotey’
StructuredPerceptronalgorithm
GivenatrainingsetD={(x,y)}1. Initializew =02 <n
2. Forepoch=1…T:1. Foreachtrainingexample(x,y)2 D:
1. Predict y’ =argmaxy’ wTÁ(x,y’)2. Ify ≠ y’,updatewà w +learningRate (Á(x,y)- Á(x,y’))
3. Returnw
Prediction:argmaxywTÁ(x,y)
28
Tisahyperparameter tothealgorithm
Updateonlyonanerror.Perceptronisanmistake-drivenalgorithm.Ifthereisamistake,promotey anddemotey’
StructuredPerceptronalgorithm
GivenatrainingsetD={(x,y)}1. Initializew =02 <n
2. Forepoch=1…T:1. Foreachtrainingexample(x,y)2 D:
1. Predict y’ =argmaxy’ wTÁ(x,y’)2. Ify ≠ y’,updatewà w +learningRate (Á(x,y)- Á(x,y’))
3. Returnw
Prediction:argmaxywTÁ(x,y)
29
Tisahyperparameter tothealgorithm
Inpractice,goodtoshuffleDbeforetheinnerloop
Updateonlyonanerror.Perceptronisanmistake-drivenalgorithm.Ifthereisamistake,promotey anddemotey’
StructuredPerceptronalgorithm
GivenatrainingsetD={(x,y)}1. Initializew =02 <n
2. Forepoch=1…T:1. Foreachtrainingexample(x,y)2 D:
1. Predict y’ =argmaxy’ wTÁ(x,y’)2. Ify ≠ y’,updatewà w +learningRate (Á(x,y)- Á(x,y’))
3. Returnw
Prediction:argmaxywTÁ(x,y)
30
Tisahyperparameter tothealgorithm
Inpractice,goodtoshuffleDbeforetheinnerloop
Updateonlyonanerror.Perceptronisanmistake-drivenalgorithm.Ifthereisamistake,promotey anddemotey’
Inferenceintrainingloop!
Notesonstructuredperceptron
• Mistakeboundforseparabledata,justlikeperceptron
• Inpractice,useaveragingforbettergeneralization– Initializea =0– Aftereachstep,whetherthereisanupdateornot,aà w +a
• Note,westillcheckformistakeusingw nota– Returnaattheendinsteadof w
Exercise:Optimizethisforperformance– modifya onlyonerrors
• Globalupdate– Oneweightvectorforentiresequence
• Notforeachposition– Samealgorithmcanbederivedviaconstraintclassification
• Createabinaryclassificationdatasetandrunperceptron
31
StructuredPerceptronwithaveraging
GivenatrainingsetD={(x,y)}1. Initializew =02 <n,a =02 <n
2. Forepoch=1…T:1. Foreachtrainingexample(x,y)2 D:
1. Predict y’ =argmaxy’ wTÁ(x,y’)2. Ify ≠ y’,updatewà w +learningRate (Á(x,y)- Á(x,y’))3. Setaà a +w
3. Returna
32
CRFvs.structuredperceptron
StochasticgradientdescentupdateforCRF– Foratrainingexample(xi,yi)
Structuredperceptron– Foratrainingexample(xi,yi)
33
Caveat:AddingregularizationwillchangetheCRFupdate,averagingchangestheperceptronupdate
Expectationvs max
ThelayofthelandHMM:Agenerativemodel,assignsprobabilitiestosequences
34
ThelayofthelandHMM:Agenerativemodel,assignsprobabilitiestosequences
35
Tworoadsdiverge
ThelayofthelandHMM:Agenerativemodel,assignsprobabilitiestosequences
• HiddenMarkovModelsareactuallyjustlinear classifiers
• Don’treallycarewhetherwearepredictingprobabilities.Weareassigningscorestoafulloutputforagiveninput(likemulticlass)
• Generalizealgorithmsforlinearclassifiers.Sophisticatedmodelsthatcanusearbitraryfeatures
• StructuredPerceptronStructuredSVM
36
Tworoadsdiverge
ThelayofthelandHMM:Agenerativemodel,assignsprobabilitiestosequences
• HiddenMarkovModelsareactuallyjustlinear classifiers
• Don’treallycarewhetherwearepredictingprobabilities.Weareassigningscorestoafulloutputforagiveninput(likemulticlass)
• Generalizealgorithmsforlinearclassifiers.Sophisticatedmodelsthatcanusearbitraryfeatures
• StructuredPerceptronStructuredSVM
• Modelprobabilitiesviaexponentialfunctions.Givesusthelog-linearrepresentation
• Log-probabilitiesforsequencesforagiveninput
• Learnbymaximizinglikelihood.Sophisticatedmodelsthatcanusearbitraryfeatures
• ConditionalRandomfield
37
Tworoadsdiverge
ThelayofthelandHMM:Agenerativemodel,assignsprobabilitiestosequences
• HiddenMarkovModelsareactuallyjustlinear classifiers
• Don’treallycarewhetherwearepredictingprobabilities.Weareassigningscorestoafulloutputforagiveninput(likemulticlass)
• Generalizealgorithmsforlinearclassifiers.Sophisticatedmodelsthatcanusearbitraryfeatures
• StructuredPerceptronStructuredSVM
• Modelprobabilitiesviaexponentialfunctions.Givesusthelog-linearrepresentation
• Log-probabilitiesforsequencesforagiveninput
• Learnbymaximizinglikelihood.Sophisticatedmodelsthatcanusearbitraryfeatures
• ConditionalRandomfield
38
Tworoadsdiverge
Discriminative/Conditionalmodels
ThelayofthelandHMM:Agenerativemodel,assignsprobabilitiestosequences
• HiddenMarkovModelsareactuallyjustlinear classifiers
• Don’treallycarewhetherwearepredictingprobabilities.Weareassigningscorestoafulloutputforagiveninput(likemulticlass)
• Generalizealgorithmsforlinearclassifiers.Sophisticatedmodelsthatcanusearbitraryfeatures
• StructuredPerceptronStructuredSVM
• Modelprobabilitiesviaexponentialfunctions.Givesusthelog-linearrepresentation
• Log-probabilitiesforsequencesforagiveninput
• Learnbymaximizinglikelihood.Sophisticatedmodelsthatcanusearbitraryfeatures
• ConditionalRandomfield
ApplicablebeyondsequencesEventually,similarobjectiveminimizedwithdifferentlossfunctions 39
Tworoadsdiverge
Discriminative/Conditionalmodels
ThelayofthelandHMM:Agenerativemodel,assignsprobabilitiestosequences
• HiddenMarkovModelsareactuallyjustlinear classifiers
• Don’treallycarewhetherwearepredictingprobabilities.Weareassigningscorestoafulloutputforagiveninput(likemulticlass)
• Generalizealgorithmsforlinearclassifiers.Sophisticatedmodelsthatcanusearbitraryfeatures
• StructuredPerceptronStructuredSVM
• Modelprobabilitiesviaexponentialfunctions.Givesusthelog-linearrepresentation
• Log-probabilitiesforsequencesforagiveninput
• Learnbymaximizinglikelihood.Sophisticatedmodelsthatcanusearbitraryfeatures
• ConditionalRandomfield
ApplicablebeyondsequencesEventually,similarobjectiveminimizedwithdifferentlossfunctions 40
Tworoadsdiverge
Comingsoon…
Discriminative/Conditionalmodels
Sequencemodels:Summary
• Goal:Predictanoutputsequencegiveninputsequence
• HiddenMarkovModel
• Inference– PredictviaViterbialgorithm
• Conditionalmodels/discriminativemodels– Localapproaches(noinferenceduringtraining)
• MEMM,conditionalMarkovmodel– Globalapproaches (inferenceduringtraining)
• CRF,structuredperceptron
• Tothink– Whatarethepartsinasequencemodel?– Howiseachmodelscoringtheseparts?
41
Samedichotomyformoregeneralstructures
Predictionisnotalwaystractableforgeneralstructures