ed oct2 2009 - simon fraser universityanoop/papers/pdf/ed_oct2_2009.pdfbrief history of...

Post on 10-Aug-2020

0 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

1

AnoopSarkarOnSabbaticalatU.ofEdinburgh(Informatics4.18b)

SimonFraserUniversityVancouver,Canada

natlang.cs.sfu.ca October2,2009

BootstrappingaClassifier

UsingtheYarowskyAlgorithm

Acknowledgements

• ThisisjointworkwithmystudentsGholamrezaHaffari(Ph.D.)andMaxWhitney(B.Sc.)atSFU.

• ThankstoMichaelCollinsforprovidingthenamed‐entitydatasetandansweringourquestions.

• ThankstoDamianosKarakosandJasonEisnerforprovidingthewordsensedatasetandansweringourquestions.

2

3

Bootstrapping

4

Self‐Training

1.  Abasemodelistrainedwithasmall/largeamountoflabeleddata.

2.  Thebasemodelisthenusedtoclassifytheunlabeleddata.

3.  Onlythemostconfidentunlabeledpoints,alongwiththepredictedlabels,areincorporatedintothelabeledtrainingset(pseudo‐labeleddata).

4.  Thebasemodelisre‐trained,andtheprocessisrepeated.

5

Self‐Training

•  Itcanbeappliedtoanybaselearningalgorithm:onlyneedconfidenceweightsforpredictions.

•  DifferenceswithEM:•  Self‐trainingonlyusesthemodeofpredictiondistribution.•  Unlikehard‐EM,itcanabstain:“Idonotknowthelabel”.

•  DifferenceswithCo‐training:•  Inco‐trainingtherearetwoviews,ineachofwhichamodel

islearned.•  Themodelinoneviewtrainsthemodelinanotherviewby

providingpseudo‐labeledexamples.

6

Bootstrapping

•  Startwithafewseedrules(typicallyhighprecision,lowrecall).Buildinitialclassifier.

•  Useclassifiertolabelunlabeleddata.

•  Extractnewrulesfrompseudo‐labeleddataandbuildclassifierfornextiteration.

•  Exitiflabelsforunlabeleddataareunchanged.Else,applyclassifiertounlabeleddataandcontinue.

77

DecisionList(DL)

•  ADecisionListisanorderedsetofrules.•  Givenaninstancex,thefirstapplicableruledeterminestheclass

label.

•  Insteadoforderingtherules,wecangiveweighttothem.•  Amongallapplicablerulestoaninstancex,applytherulewhich

hasthehighestweight.

•  Theparametersaretheweightswhichspecifytheorderingoftherules.

Rules:Ifxhasfeaturefclassk ,θf,k

parameters

88

DLforWordSenseDisambiguation

Ifcompany+1,confidenceweight.97Iflife−1,confidenceweight.96…

(Yarowsky1995)

•  WSD:Specifythemostappropriatesense(meaning)ofawordinagivensentence.

  Considerthesetwosentences:  …companysaidtheplantisstilloperating.factorysense+  …anddividelifeintoplantandanimalkingdom.livingorganismsense‐

  Considerthesetwosentences:  …companysaidtheplantisstilloperating.sense+  …anddividelifeintoplantandanimalkingdom.sense‐

  Considerthesetwosentences:  …companysaidtheplantisstilloperating.(company,operating)sense+  …anddividelifeintoplantandanimalkingdom.(life,animal)sense‐

Sorted

Example:disambiguate2sensesofsentence

•  Seedrules:Ifcontextcontainsserved,label+1,conf=1.0Ifcontextcontainsreads,label‐1,conf=1.0

•  Seedruleslabel8outof303unlabeledexamples

•  These8pseudo‐labeledexamplesprovide6rulesabove0.95threshold(includingtheoriginalseedrules)e.g.Ifcontextcontainsread,label‐1,conf=0.953

•  These6ruleslabel151outof303unlabeledexamples

Example:disambiguate2sensesofsentence

•  These151pseudo‐labeledexamplesprovide60rulesabovethethreshold,e.g.Ifcontextcontainsprison,label+1,conf=0.989Ifprevwordislife,label+1,conf=0.986Ifprevwordishis,label+1,conf=0.983Ifnextwordisfrom,label‐1,conf=0.982Ifcontextcontainsrelevant,label‐1,conf=0.953Ifcontextcontainspage,label‐1,conf=0.953

•  After5iterations,297/303unlabeledexamplesarepermanentlylabeled(nochangespossible)

•  Buildingfinalclassifiergives67%accuracyontestsetof515sentences.Withsome“tricks”wecanget76%accuracy.

11

BriefHistoryofBootstrapping

•  (Yarowsky1995)useditwithDecisionListbaseclassifierforWordSenseDisambiguation(WSD)task.•  Itachievedthesameperformancelevelasthesupervisedalgorithm,

usingonlyafewseedexamplesaslabeledtrainingdata.

•  (Collins&Singer1999)useditforNamedEntityClassificationtaskwithDecisionListbaseclassifier.•  Usingonly7initialrules,itachieved91%accuracy.•  ItachievedthesameperformancelevelasCo‐training(noneedfor2

views).

•  (AbneyACL2002)inapaperaboutco‐trainingcontrastsitwiththeYarowskyalgorithm.Initialanalysisabandonedlater.

12

BriefHistoryofBootstrapping

•  (AbneyCL2004)providedanewanalysisoftheYarowskyalgorithm.•  ItcouldnotmathematicallyanalyzetheoriginalYarowskyalgorithm,

butintroducednewvariants(wewillseethemlater).

•  (Haffari&SarkarUAI2007)advancedAbney’sanalysisandgaveageneralframeworkthatshowedhowtheYarowskyalgorithmintroducedbyAbneyisrelatedtootherSSLmethods.

•  (EisnerandKarakos2005)examinestheconstructionofseedrulesforbootstrapping.

13

AnalysisoftheYarowskyAlgorithm

14

OriginalYarowskyAlgorithm

•  TheYarowskyalgorithmisabootstrappingalgorithmwithaDecisionListbaseclassifier.

•  Thepredictedlabelisk*iftheconfidenceoftheappliedruleisabovesomethresholdη.

•  Aninstancemaybecomeunlabeledinfutureiterations.

(Yarowsky1995)

15

ModifiedYarowskyAlgorithm

•  Insteadofthefeaturewiththemaxscoreweusethesumofthescoresofallfeaturesactiveforanexampletobelabeled.

•  Thepredictedlabelisk*iftheconfidenceoftheappliedruleisabovethethreshold1/K.•  K:isthenumberoflabels.

•  Aninstancemuststaylabeledonceitbecomeslabeled,butthelabelmaychange.

•  Thesearetheconditionsinallthealgorithmswewillanalyzeintherestofthetalk.•  AnalyzingtheoriginalYarowskyalgorithmisstillanopenquestion.

(Abney2004)

1616

BipartiteGraphRepresentation

+1companysaidtheplantisstilloperating

‐1dividelifeintoplantandanimalkingdom

company

operating

life

animal

(Features)F

X(Instances)

Unlabeled

(Cordunneanu2006,Haffari&Sarkar2007)

Weproposetoviewbootstrappingaspropagatingthelabelsofinitiallylabelednodestotherestofthegraphnodes.

1717

Self‐TrainingontheGraph

f

(Features)F X(Instances)

… …

xπx qxLabelingdistribution

+ ‐

1qx

θfLabelingdistribution

+ ‐

.7.3θf

(Haffari&Sarkar2007)

+ ‐

.6.4

+ ‐

1qx

1818

GoalsoftheAnalysis

•  Tofindreasonableobjectivefunctionsfortheself‐trainingalgorithmsonthebipartitegraph.

•  TheobjectivefunctionsmayshedlighttotheempiricalsuccessofdifferentDL‐basedself‐trainingalgorithms.

•  Itcantellusthekindofpropertiesinthedatawhicharewellexploitedandcapturedbythealgorithms.

•  Itisalsousefulinprovingtheconvergenceofthealgorithms.

•  KL‐divergenceisameasureofdistancebetweentwoprobabilitydistributions:

•  EntropyHisameasureofrandomnessinadistribution:

•  Theobjectivefunction:

19

ObjectiveFunction

F X

20

TheBregmanDistance

•  Examples:–  Ifψ(t)=tlogtThenBψ(α,β)=KL(α,β)–  Ifψ(t)=t2ThenBψ(α,β)=Σi(αi‐βi)2

ψ(αi)

αiβi

ψ(t)

t

•  Givenastrictlyconvexfunctionψ,theBregmandistanceBψbetweentwoprobabilitydistributionsisdefinedas:

•  Theψ‐entropyHψisdefinedas:

•  Thegeneralizedobjectivefunction:

21

GeneralizingtheObjectiveFunction

F X

22

OptimizingtheObjectiveFunctions

•  Inwhatfollows,wementionsomespecificobjectivefunctionstogetherwiththeiroptimizationalgorithms.

•  TheseoptimizationalgorithmscorrespondtosomevariantsofthemodifiedYarowskyalgorithm.

•  Itisnoteasytocomeupwithalgorithmsfordirectlyoptimizingthegeneralizedobjectivefunctions.

2323

UsefulOperations

•  Average:takestheaveragedistributionoftheneighbors

•  Majority:takesthemajoritylabeloftheneighbors

(.2,.8)

(.4,.6)

(.3,.7)

(0,1)

(.2,.8)

(.4,.6)

2424

AnalyzingSelf‐Training

 Theorem.Thefollowingobjectivefunctionsareoptimizedbythecorrespondinglabelpropagationalgorithmsonthebipartitegraph:

F X

where:ConvergesinPolytimeO(|F|2|X|2|)

Relatedtograph‐basedSSlearning(Zhuetal2003)

Abney’svariantofYarowskyalgorithm

2525

WhataboutLog‐Likelihood?

•  Initially,thelabelingdistributionisuniformforunlabeledverticesandaδ‐likedistributionforlabeledvertices.

•  Bylearningtheparameters,wewouldliketoreducetheuncertaintyinthelabelingdistributionwhilerespectingthelabeleddata:

Negativelog‐Likelihoodoftheoldandnewlylabeleddata

 Lemma.Ifmisthenumberoffeaturesconnectedtoaninstance,then:

2626

ConnectionbetweenthetwoAnalyses

ComparewithConditionalEntropyRegularization(GrandvaletandBengio2005)!

x!L

KL(qx||!x) + "!

x!U

H(!x)

27

Experiments

NamedEntityClassification

• 971,476sentencesfromtheNYTwereparsedwiththeCollinsparser

• Thetaskistoidentifythreetypesofnamedentities:1.  Location(LOC)2.  Person(PER)3.  Organization(ORG)−1.notaNEor“don’tknow”

28

(CollinsandSinger,1999)

NamedEntityClassification

• Nounphraseswereextractedthatmetthefollowingconditions1.  TheNPcontainedonlywordstaggedasproper

nouns2.  TheNPappearedinthefollowingtwo

syntacticcontexts:  Modifiedbyanappositivewhoseheadisasingular

noun  InaprepositionalphrasemodifyinganNPwhose

headisasingularnoun29

(CollinsandSinger,1999)

NamedEntityClassification

• Nounphraseswereextractedthatmetthefollowingconditions1.  TheNPcontainedonlywordstaggedasproper

nouns2.  TheNPappearedinthefollowingtwo

syntacticcontexts:  Modifiedbyanappositivewhoseheadisasingular

noun  InaprepositionalphrasemodifyinganNPwhose

headisasingularnoun30

(CollinsandSinger,1999)

NP

NNP NNP NNPS

International Business Machines

…,says[NEMauryCooper],avice[CONTEXTpresident]atS.&P.

…,fraudrelatedtoworkonafederallyfundedsewage[CONTEXTplantin][NEGeorgia]

NamedEntityClassification

• Thetask:classifyNPsintoLOC,PER,ORG• 89,305trainingexampleswith68,475distinctfeaturetypes– 88,962wasusedinCS99experiments

•  1000testdataexamples(includesNPsthatarenotLOC,PERorORG)

•  Monthnamesareeasilyidentifiableasnotnamedentities:leaves962examples

•  Still85NPsthatarenotLOC,PER,ORG.•  Cleanaccuracyover877;Noisyover962 31

(CollinsandSinger,1999)

YarowskyVariants

•  AtrickfromtheCo‐trainingpaper(BlumandMitchell1998)istobecautious.Don’taddallrulesabovethe0.95threshold

•  Addonlynrulesperlabel(say5)andincreasethisamountbynineachiteration

•  Changesthedynamicsoflearninginthealgorithmbutnottheobjectivefn

•  Twovariants:Yarowsky(basic),Yarowsky(cautious)

•  Withoutathreshold:Yarowsky(nothreshold)32

(Abney2004,CollinsandSinger,1999)

ResultsLearning Algorithm Accuracy (Clean) Accuracy (Noisy) Baseline (all organization)

45.8 41.8

EM 83.1 75.8 Yarowsky (basic) 80.7 73.5 Yarowsky (no threshold) 80.3 73.2 Yarowsky (cautious) 91 83 DL-CoTrain 91 83

33

NumberofRules(basic)

34

0

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

1 2 3 4 5 6 7 8 9

Num

. R

ule

s

Iteration

NumberofRules(cautious)

35 0

1000

2000

3000

4000

5000

6000

7000

0 50 100 150 200 250 300 350 400 450

Num

. R

ule

s

Iteration

Coverage(basic)

36 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

1 2 3 4 5 6 7 8 9

Co

ve

rag

e

Iteration

Coverage(cautious)

37 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0 50 100 150 200 250 300 350 400 450

Co

vera

ge

Iteration

Accuracy(basic)

38

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

1 2 3 4 5 6 7 8 9 10

Accura

cy

Iteration

Accuracy(cautious)

39

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0 50 100 150 200 250 300 350 400 450

Accura

cy

Iteration

Precision‐Recall(basic)

40

0.82

0.84

0.86

0.88

0.9

0.92

0.94

0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85

Pre

cis

ion

Recall

Precision‐Recall(cautious)

41

0.91

0.92

0.93

0.94

0.95

0.96

0.97

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Pre

cis

ion

Recall

Seeds

• Selectingseedrules:whatisagoodstrategy?–  Frequency:sortbyfrequencyoffeatureoccurrence

– Contexts:sortbynumberofotherfeaturesafeaturewasobservedwith

– Weighted:sortbyaweightedcountofotherfeaturesobservedwithfeature.Weight(f)=count(f)/Σf’count(f’)

42

(EisnerandKarakos2005,ZagibalovandCarroll2008)

Seeds

• Ineachcasethefrequenciesweretakenfromtheunlabeledtrainingdata

• Seedswereextractedfromthesortedlistoffeaturesbymanualinspectionandassignedalabel(theentireexamplewasused)

• Location(LOC)featuresappearinfrequentlyinallthreeorderings

• ItispossiblethatsomegoodLOCseedsweremissed

43

SeedsNumber of Rules Frequency Contexts Weighted

(n/3) rules/label Clean Noisy Clean Noisy Clean Noisy 3 84 77 84 77 88 80 9 91 83 90 82 82 74 15 91 83 91 83 85 77 7 (CS99) 91 83

44

WordSenseDisambiguation

• Datafrom(EisnerandKarakos2005)• Disambiguatetwosenseseachfordrug,duty,land,language,position,sentence(Galeet.al.1992)

• Sourceofunlabeleddata:14MwordCanadianHansards(Englishonly)

• Twoseedrulesforeachdisambiguationtaskfrom(EisnerandKarakos2005)

45

Results

46

Learning Algorithm drug land sentence Seeds alcohol medical acres courts served reads Train / Test size 134 / 386 1604 / 1488 303 / 515

Yarowsky (basic) 53.3 79.3 67.7

Yarowsky (no threshold) 52 79 64.8

Yarowsky (cautious) 55.9 79 76.1

DL-CoTrain (2 views = long distance v.s. immediate context)

53.1 77.7 75.9

47

Self‐trainingforMachineTranslation

48

Self‐TrainingforSMT

Train

MFE

Bilingualtext

F E

Monolingualtext

DecodeTranslatedtext

F E

F E

SelecthighqualitySent.pairs

Re‐Log‐linearModel

Re‐trainingtheSMTmodel

49

SelectingSentencePairs

•  Firstgivescores:  Usenormalizeddecoder’sscore  Confidenceestimationmethod(Ueffing&Ney2007)

•  Thenselectbasedonthescores:  Importancesampling:  Thosewhosescoreisaboveathreshold  Keepallsentencepairs

50

Re‐trainingtheSMTModel

•  UsenewsentencepairstotrainanadditionalphrasetableanduseitasanewfeaturefunctionintheSMTlog‐linearmodel  Onephrasetabletrainedonsentencesforwhichwe

havethetruetranslations  Onephrasetabletrainedonsentenceswiththeir

generatedtranslations

PhraseTable1 PhraseTable2

51

ChinesetoEnglish(Transductive)

Selection Scoring BLEU%

Baseline 31.8±.7

Keepall 33.1

ImportanceSampling

Norm.score 33.5

Confidence 33.2

Threshold Norm.score 33.5

confidence 33.5

Bold:bestresult,italic:significantlybetter

NISTEval‐2004:train=8.2M,test=1788(4refs)

Train:news,magazines,laws+UN

Test:newswire,editorials,politicalspeeches

We use Portage from NRC as the underlying SMT system (Ueffing et al, 2007)

52

ChinesetoEnglish(Inductive)

system BLEU%

Baseline 31.8±.7

AddChinesedata Iter1 32.8

Iter4 32.6

Iter10 32.5

Bold:bestresult,italic:significantlybetter

Usingimportancesampling

Before After

editorials 30.7 31.3

newswire 30.0 31.1

speeches 36.1 37.3

53

Whydoesitwork?

•  Reinforcespartsofthephrasetranslationmodelwhicharerelevantfortestcorpus

•  Gluephrasesfromtestdatausedtocomposenewphrases(mostphrasesstillfromoriginaldata)

54

Whydoesitwork?

Summary

•  ShouldweeveruseCo‐trainingforBootstrapping?• Per‐labelcautiousnessleadstoeffective

bootstrapping.–  ExploitedinYarowskyalgo.,DL‐CoTrain,Co‐Boosting

•  Thesedynamicscan/shouldbeexaminedmoreclosely.–  Perhapsusingtoolsfromtheanalysisoffeature

induction.

• Bootstrappingandself‐trainingmaybemoreeffectivethanyoumayhavethought.

55

top related