learning weighted finite‐state transducersnasmith/slides/spflodd.10-27-11.pdf · •...

57
Learning Weighted Finite‐State Transducers SPFLODD October 27, 2011

Upload: haquynh

Post on 11-Jul-2018

220 views

Category:

Documents


0 download

TRANSCRIPT

LearningWeightedFinite‐State

Transducers

SPFLODDOctober27,2011

Background

•  ThislectureisbasedonapaperbyJasonEisneratACL2002,“ParameterEsImaIonforProbabilisIcFinite‐StateTransducers.”– Thisisperhapsthemostunder‐appreciatedpaperinthepasttenyearsofNLP.

•  Fulldisclosure:JasonEisnerwasmyPh.D.advisor.– He’soneofthesmartestpeopleIhaveevermet.

Finite‐StateAutomaton

Nota%on Defini%on

Q finitesetofstates

Σ finitevocabulary

q0∈Q startstate

F⊆Q setoffinalstates

δ:Q⨉Σ*→2QtransiIonfuncIon;possiblenextstatesgivencurrentstateandinputsymbol(s)

Finite‐StateAutomaton(MaybeBe`er?)

Nota%on Defini%on

Q finitesetofstates

Σ finitevocabulary

q0∈Q startstate

F⊆Q setoffinalstates

A⊆Q⨉Σ*⨉QsetoftransiIons(sourcestate,symbolsequence,targetstate)

Finite‐StateAutomaton

•  Automatonthatrecognizesaregularlanguage–  KeytransformaIons:removeε‐transiIons,determinize,minimize

•  ImplementaIonofaregularexpression•  RegularlanguagesareclosedundernumerousoperaIons–  ConcatenaIon,union,intersecIon,Kleene*,difference,reverse,complement,…

•  Correspondtoregulargrammars(type3inChomskyhierarchy)

•  Pumpinglemma:necessarycondiIonforalanguagetoberegular

FSAasaDicIonary

•  Example:850wordsin“BasicEnglish”•  EachwordisanFSA

Ten‐WordDicIonary

Removeε‐transiIons

Determinize

Minimize

Full850‐WordDicIonary

states final states 

arcs 

Union 5303 850 5302

Removeε‐transiIons 4454 850 4453

Determinize 2609 848 2608

Minimize 744 42 1535

GeneralizaIons

•  Finite‐staterecognizerisafuncIonfromΣ*→{0,1}– Meaning:fsa(s)=1⇔sisinthelanguage

•  Otherra%onalrela%ons…– Finite‐statetransducer:Σ*→Δ*– WeightedFSA:Σ*→ℝ– WeightedFST:Σ*→Δ*×ℝ

•  WFSAsandWFSTscanbeconsideredprobabilis%c(butdon’thavetobe)

RelaIonsonStrings•  Arela%onisasetof(input,output)pairs.

–  Moregeneralthanfunc-onsbecauseyoucanrepresentambiguityandopIonality!

–  ForstandardFSAs,thinkofinput=output.•  RaIonalrelaIonsareaspecialkindofrelaIonwithawiderangeof

closureproperIes.•  RaIonalrelaIonscanbeunderstoodasadeclaraIveprogramming

paradigm:–  sourcecodeisaregularexpression–  objectcodeisa2‐tapeautomatoncalledafinite‐statetransducer

(FST)–  opImizaIonisaccomplishedbydeterminizaIonandminimizaIon–  supportsnondeterminism,parallelprocessingoverinfinitesetsof

inputstrings,reversecomputaIonfromoutputstoinputs

Finite‐StateAutomataandTransducers

FSA FST Defini%on

Q finitesetofstates

Σ finite(input)vocabulary

Δ finiteoutputvocabulary

q0∈Q startstate

F⊆Q setoffinalstates

δ:Q⨉Σ*→2QtransiIonfuncIon;possiblenextstatesgivencurrentstateandinputsymbol(s)

δ:Q⨉Σ*⨉Δ*→2Q

…andoutputsymbol(s)

Eisner’sRunningExample

WeightedRelaIons

•  Assignscoresto(input,output)pairs.–  SomeImesinterpretedasp(input,output)–  SomeImesinterpretedasp(output|input)–  SomeImesneither

•  ThisideaunifiesmanyNLPapproaches:–  sequencelabeling–  chunking–  normalizaIon–  segmentaIon–  alignment–  speechrecogniIon(PereiraandRiley,1997)–  machinetranslaIon(KnightandAl‐Onaizan,1998)

WeightedFSTs

FST Weights Defini%on

Q finitesetofstates

Σ finiteinputvocabulary

Δ finiteoutputvocabulary

q0∈Q startstate

F⊆Q stopweights setoffinalstates

δ:Q⨉Σ*⨉Δ*→2Q

arcweights

transiIonfuncIon;possiblenextstatesgivencurrentstateandinputsymbol(s)andoutputsymbol(s)

WFSTs

FSTsWFSAs

weightsarein{0,1}

representsets(languages),aspecialkindofrelaIonwhereoutput=input

FSAs

WeightsandScores

•  WeightsareassignedtotransiIonsandtoendingapathineachstate.

•  ScoreofapathistheproductofthetransiIonweightsandthestopweight.

•  “Zero”meansthesamethingas“impossible.”

Eisner’sRunningExample

Twopathsfor(aabb,xz):

pathscore=0.0002646

pathscore=0.0002646

scoreof(aabb,xz)=0.0005296

WeightedRelaIonsandProbabiliIes

•  Letf(x,y)bethefuncIoncorrespondingtoaWFST’sassignmentofascoretothe(input,output)pair(x,y).–  Iffsumstooneoverall(x,y),thenitisajointdistribuIon.

–  Iffsumstooneoverally,foreachx,thenitisacondiIonaldistribuIon.

ParameterizingtheWFST

•  OpIon1:everytransiIonorstopgetsaparameter.– OpIon1A:makesurecompeIngchoices(transiIonsfromqandstoppinginq)sumto1.

13freeparameters

ParameterizingtheWFST

•  SupposeourWFSTwasbuiltbycomposingtwosimplerWFSTs.

WFSTComposiIon

Σ={a,b}Δ={p,q}

Σ={p,q}Δ={x,z}

WFSTComposiIon

inputstringin{a,b}* WFST1

intermediatestringin{p,q}*

WFST2

outputstringin{x,z}*

WFSTComposiIon

inputstringin{a,b}* intermediatestringin{p,q}*

WFST2

outputstringin{x,z}*

WFST1

WFSTComposiIon

inputstringin{a,b}*

WFST

outputstringin{x,z}*

WFSTComposiIon

•  LetfandgdefinetheweightedrelaIonsfortwoWFSTssuchthatf’soutputalphabetandg’sinputalphabetarethesame.

Then:f∘g(x,z)=∑yf(x,y)⨉g(y,z)

–  Eitherforgorbothcanbeaset(insteadofarelaIon).–  Eitherforgcanbeunweighted(scoresare0or1).–  Ifbothareunweightedsets(FSAs)thenthisisintersec%on.

•  IffisajointdistribuIonp(x,y)andgisacondiIonaldistribuIonp(z|y),wenowhaveaprobabilisIcmodeloverthreestringrandomvariables.

WFSTComposiIon

(4,6)self‐loopisreallya→p→x

(5,6)self‐loopisreallyb→{p,q}→x

(4,7)self‐loopisreallya→p→ε

(5,7)self‐loopsarereallyb→p→εandb→q→z

WFSTComposiIon

6freeparameters

1freeparameter

6+1=7freeparameters

Aside1

•  EisnersuggestsanotherwaytowritedownweightedregularrelaIons,asprobabilisIcregularexpressions.– Buildupfromatomicexpressionsa:b,withainΣ*andbinΔ*

– ConcatenaIon,probabilisIcunion,probabilisIcclosure.

•  AlmostnoworkonthisinNLPorML,asfarasI’veseen.

NoisyChannelModel

idealizedoutputofpredictor

ChannelWFST

observableinputtopredictor

SourceWFSA

NoisyChannelModel

idealizedoutputofpredictor

ChannelWFST

observableinputtopredictor

SourceWFSA

HistoricalNote

•  Unweighted FSTsweredevelopedlargelyfordesigningandimplemenIngmodelsofthemorphologyofnaturallanguages.– HugeamountofworkatXerox.– AlsousedininformaIonextracIon.

•  Veryusefulforhand‐construcIonofmorphologicalrulesindividually,thenassemblethembyconcatenaIon,union,composiIon,etc.

ParameterizaIons

1.  Everyarcgetsoneprobability2.  Every“original”arcgetsoneprobability3.  Log‐lineardistribuIonwithsharedfeatures

allovertheWFST–  Thisisreallythemostgeneral,sincefeatures

couldbeidenIIesofarcsorof“original”arcs!

Exercise

•  HowtorepresentanHMMasaWFST?– MEMM?

– Chain‐structuredCRF?•  HowtorepresentstochasIceditdistanceasaWFST?– ElegantwaytodesignawiderangeofeditoperaIons:composiIonofWFSTs

BacktoLearning

•  Wewantageneral methodlearningtheparametersfromdata,evenwhenalllayersarenotknown.–  EMforHMMsiswellknown(Baum,1972)–  EMforstochasIceditdistancesiswellknown(RistadandYianilos,1996)

•  IfourWFSTwasconstructedbycomposingsimplermachines,wemightwanttokeeptheoriginalparameterizaIon.–  I.e.,learnweightsforindividualmachinesjointly.

VeryGeneralFormulaIonofLearning

•  FlexibilityinparameterizaIon,including–  oneprobabilityperarc–  oneprobabilityper“original”arc–  log‐lineardistribuIonoverarcsfromastate,withparametertyingthroughouttheWFST

•  Learnfromsamplesof(input,output)pairswherexi⊆Σ*andyi⊆Δ*(pathsnotobserved).–  supervised:eachxiandeachyiisasinglestring–  unsupervised:yi=Δ*–  lotsofpossibiliIesinbetween

MaximumLikelihoodEsImaIon

•  YoucanviewthisasageneralizaIonofBaum‐WelchtrainingorEMforstochasIceditdistances.

•  Eachexample’stotalscoreisapathsumoverthescoresofpathsthat–  areallowedbytheWFST–  matchtheinputsetxi–  matchtheoutputsetyi

•  RecallthatfmightbeajointorcondiIonalprobabilitydistribuIonandθmightbelog‐linearormulInomialparameters.

max!

!

i

f!(xi, yi) = max!

!

i

"

a!Paths(xi,yi)

|a|!

j=1

weight!(aj)

ExpectaIonMaximizaIon

•  Estep(oneexample):findthedistribuIonoverpathsgivenxiandyi:

•  Mstep:updateθtomakethosepathsmorelikely(exactformdependsonparameterizaIon).

max!

!

a

E!(t!1) [freq(a)] log weight!(a)

!

a!Paths(xi,yi)

|a|"

j=1

weight!(aj)

!

a!Paths

|a|"

j=1

weight!(aj)

MStep

•  ParameterizaIon1(oneprobabilityperarc):

•  ParameterizaIon2(oneprobabilityperoriginalarc):“unwind”p(a)intoproductoforiginalarcprobabiliIes.

max!

!

a

E!(t!1) [freq(a)] log weight!(a)

max!

!

a

E!(t!1) [freq(a)] log !a

“Unwinding”Example

(4,6)self‐loopisreallya→p→x 0’sself‐loopwitha:xisreally

4’sself‐loopwitha:pand6’sself‐loopwithp:x

MStep

•  ParameterizaIon1(oneprobabilityperarc):

•  ParameterizaIon2(oneprobabilityperoriginalarc):“unwind”p(a)intoproductoforiginalarcprobabiliIes.

•  ParameterizaIon3(log‐linearandmostgeneral):

max!

!

a

E!(t!1) [freq(a)] log weight!(a)

max!

!

a

E!(t!1) [freq(a)] log !a

max!

!

a

E!(t!1) [freq(a)]

"

#!!g(a)! log!

a""Competitors(a)

exp!!g(a#)

$

%

ExpectaIonMaximizaIon

•  Estep(oneexample):findthedistribuIonoverpathsgivenxiandyi:

•  Mstep:updateθtomakethosepathsmorelikely(exactformdependsonparameterizaIon).

max!

!

a

E!(t!1) [freq(a)] log weight!(a)

!

a!Paths(xi,yi)

|a|"

j=1

weight!(aj)

!

a!Paths

|a|"

j=1

weight!(aj)

EStep

•  ThelikelihoodvalueforoneexampleistheraIooftwopathsums.–  ThedenominatorpathsumisthesameforallexamplesinthegeneraIvecase.

•  ButtheEstep’srealjobistocalculatesufficientstaIsIcsthattheMstepneeds!

!

a!Paths(xi,yi)

|a|"

j=1

weight!(aj)

!

a!Paths

|a|"

j=1

weight!(aj)forageneraIvemodel,Paths(Σ*, Δ*)foracondiIonalmodel,Paths(xi, Δ*)

BestPath

•  Generalidea:takexandbuildagraph.•  Scoreofapathfactorsintotheedges.

•  Decodingisfindingthebest path.

TheViterbialgorithmisaninstanceoffindingabestpath!

FlashbacktoOctober4,whenItalkedaboutdecoding…

BestPath

•  Generalidea:takexandbuildagraph.•  Scoreofapathfactorsintotheedges.

•  Decodingisfindingthebest path

TheViterbialgorithmisaninstanceoffindingabestpath

Sum

log!

y

exp log!

y

exp

sumscoring

sum!

forward

ExpectedFeatureCounts

•  TheEstep’srealjob,inthemostgeneralcase,istocalculatetheexpectedfeaturecountsintheexamples,underthecurrentmodel.

•  We’veseenthisbefore!– Forward‐backward;youcandoitthatway– Eisnersuggestsadifferentway,wheretheusual“plus‐Imes”semiringisextendedandtheexpectaIonsareobtainedinasinglepass.

Semirings

Thetuple(K,⊕,⊗,0,1)isasemiringif:–  Kisasetofvalues–  ⊕isacommutaIve,binaryoperaIonK×K→Kwith

idenItyelement0–  ⊗isabinaryoperaIonK×K→KwithidenIty

element1•  forcomposiIon,⊗mustbecommutaIve

–  ⊗distributesover⊕:a⊗(b⊕c)=(a⊗b)⊕(a⊗c). –  0annihilatesK:a⊗0=0–  tohandleinfinitesetsofcyclicpaths,weneedaunary

closureoperator*suchthata*=1⊕a⊕(a⊗a)⊕(a⊗a⊗a)⊕…,

SemiringsYouKnow

interpretaIonofweights  wanttocompute  weights  “plus”  “Imes” 

probabiliIes

p(s)

[0,1]

+ ×

bestpathprob. max ×

log‐probabiliIes

logp(s)

(‐∞,0]

log+ +

bestpathlog‐prob.

max +

costs min‐costpath [0,+∞) min +

Boolean sinlanguage? {0,1} ∨ ∧

strings sitself Σ* set‐union concat.

TheExpectaIonSemiring

•  Insteadofaweightstoringjusta“forward”scoreorprobability,alsostoreavalueinV.–  Forus,Visvectorsof(expected)featurecounts.– Assigntoeacharcavaluecorrespondingtoitslocalfeaturevector.

•  DefineoperaIons:

•  Result:finalvaluecontainspathsumandfeatureexpectaIons.

(p,v)! (p!,v!) = (pp!, pv! + p!v)(p,v)" (p!,v!) = (p + p!,v! + v!)

(p,v)" = (p", p"vp")

What’sReallyHappening?

•  WearemanipulaIngweightedrela%ons,notWFSTs.

•  TheexpectaIonsemiring’svaluesarescoresandgradients ofscoreswithrespecttoθ–  Forward‐backward(e.g.,forCRFs)isdoingthesamething,onlyusingthechainruleforderivaIvestodefineasecondpass(thebackwardpass).

–  TheexpectaIonsemiringletsyouavoidthebackwardpassandtheper‐arcproductsofforwardandbackwardprobabiliIes.

–  Butit’sprobablyslowerinpracIce.

OtherGoodiesinthePaperandLaterWork

•  Analysisasthe“algebraicpath”problem,linkstoarangeofspeedups(e.g.,foracyclicgraphs).

•  ViterbivariantoftheexpectaIonsemiring.•  ProbabilisIcregularexpressionsidea.

–  PotenIalforrapidincorporaIonofexpertintuiIonsintodata‐drivensystems?

•  LiandEisner(2009)goestosecond‐orderexpectaIonsemirings!

•  DreyerandEisner(2009)usesWFSTstodefinefactorsingraphicalmodels!

ClosingNotesonLearning

•  Eisner’sapproachisforMLE(andMAP),buthisalgorithmsareactuallyinferencemethodsforWFSTs.–  BychangingtomaximizaIonandincorporaIngcosts,youcandoperceptron,structuredSVM,andothererror‐drivenlearning.

•  Wedidn’ttalkatallaboutlearningthestructureoffinite‐statemodels!–  There’sarichformalliteratureonthis,andnottoomanypapersthata`emptitforrealproblems.

–  IgaveoneclassiccitaIononthewiki(StolckeandOmohundro,1993).

Toolkits

•  FSMlibraries(AT&T)–  Freebinaries–  Implementspre`ymucheverythingyouneedtobuildweightedand

unweightedFSrecognizersandtransducers…excepttraining!

•  XeroxFStoolkit–  Webdemo;so�warecanbepurchased–  Noweights

•  RWTHFSAtoolkit–  Newer,open‐source–  Notsurewhat’simplemented

•  OpenFST(Google)–  NewincarnaIonofFSMlibraries–  Freeandopensource!

NotesonAlgorithms

limitaIons polyIme

ε‐removalMohri2002

determinizeMohri1997

notalltransducers

minimizeEisner2003

notallsemirings ✓

Conclusion

•  WFSTsareextremelygeneralandpowerful– PeopleusethemtoimplementorapproximatealmosteverythinginNLP,IE,andMT

•  YoushouldknowthisabstracIon,evenifyoudon’tuseiteveryday.