ed oct2 2009 - simon fraser universityanoop/papers/pdf/ed_oct2_2009.pdfbrief history of...

AnoopSarkarOnSabbaticalatU.ofEdinburgh(Informatics4.18b)

SimonFraserUniversityVancouver,Canada

natlang.cs.sfu.ca October2,2009

BootstrappingaClassifier

UsingtheYarowskyAlgorithm

Acknowledgements

• ThisisjointworkwithmystudentsGholamrezaHaffari(Ph.D.)andMaxWhitney(B.Sc.)atSFU.

• ThankstoMichaelCollinsforprovidingthenamed‐entitydatasetandansweringourquestions.

• ThankstoDamianosKarakosandJasonEisnerforprovidingthewordsensedatasetandansweringourquestions.

Bootstrapping

Self‐Training

1.  Abasemodelistrainedwithasmall/largeamountoflabeleddata.

2.  Thebasemodelisthenusedtoclassifytheunlabeleddata.

3.  Onlythemostconfidentunlabeledpoints,alongwiththepredictedlabels,areincorporatedintothelabeledtrainingset(pseudo‐labeleddata).

4.  Thebasemodelisre‐trained,andtheprocessisrepeated.

Self‐Training

•  Itcanbeappliedtoanybaselearningalgorithm:onlyneedconfidenceweightsforpredictions.

•  DifferenceswithEM:•  Self‐trainingonlyusesthemodeofpredictiondistribution.•  Unlikehard‐EM,itcanabstain:“Idonotknowthelabel”.

•  DifferenceswithCo‐training:•  Inco‐trainingtherearetwoviews,ineachofwhichamodel

islearned.•  Themodelinoneviewtrainsthemodelinanotherviewby

providingpseudo‐labeledexamples.

Bootstrapping

•  Startwithafewseedrules(typicallyhighprecision,lowrecall).Buildinitialclassifier.

•  Useclassifiertolabelunlabeleddata.

•  Extractnewrulesfrompseudo‐labeleddataandbuildclassifierfornextiteration.

•  Exitiflabelsforunlabeleddataareunchanged.Else,applyclassifiertounlabeleddataandcontinue.

DecisionList(DL)

•  ADecisionListisanorderedsetofrules.•  Givenaninstancex,thefirstapplicableruledeterminestheclass

label.

•  Insteadoforderingtherules,wecangiveweighttothem.•  Amongallapplicablerulestoaninstancex,applytherulewhich

hasthehighestweight.

•  Theparametersaretheweightswhichspecifytheorderingoftherules.

Rules:Ifxhasfeaturefclassk ,θf,k

parameters

DLforWordSenseDisambiguation

Ifcompany+1,confidenceweight.97Iflife−1,confidenceweight.96…

(Yarowsky1995)

•  WSD:Specifythemostappropriatesense(meaning)ofawordinagivensentence.

  Considerthesetwosentences:  …companysaidtheplantisstilloperating.factorysense+  …anddividelifeintoplantandanimalkingdom.livingorganismsense‐

  Considerthesetwosentences:  …companysaidtheplantisstilloperating.sense+  …anddividelifeintoplantandanimalkingdom.sense‐

  Considerthesetwosentences:  …companysaidtheplantisstilloperating.(company,operating)sense+  …anddividelifeintoplantandanimalkingdom.(life,animal)sense‐

Sorted

Example:disambiguate2sensesofsentence

•  Seedrules:Ifcontextcontainsserved,label+1,conf=1.0Ifcontextcontainsreads,label‐1,conf=1.0

•  Seedruleslabel8outof303unlabeledexamples

•  These8pseudo‐labeledexamplesprovide6rulesabove0.95threshold(includingtheoriginalseedrules)e.g.Ifcontextcontainsread,label‐1,conf=0.953

•  These6ruleslabel151outof303unlabeledexamples

Example:disambiguate2sensesofsentence

•  These151pseudo‐labeledexamplesprovide60rulesabovethethreshold,e.g.Ifcontextcontainsprison,label+1,conf=0.989Ifprevwordislife,label+1,conf=0.986Ifprevwordishis,label+1,conf=0.983Ifnextwordisfrom,label‐1,conf=0.982Ifcontextcontainsrelevant,label‐1,conf=0.953Ifcontextcontainspage,label‐1,conf=0.953

•  After5iterations,297/303unlabeledexamplesarepermanentlylabeled(nochangespossible)

•  Buildingfinalclassifiergives67%accuracyontestsetof515sentences.Withsome“tricks”wecanget76%accuracy.

BriefHistoryofBootstrapping

•  (Yarowsky1995)useditwithDecisionListbaseclassifierforWordSenseDisambiguation(WSD)task.•  Itachievedthesameperformancelevelasthesupervisedalgorithm,

usingonlyafewseedexamplesaslabeledtrainingdata.

•  (Collins&Singer1999)useditforNamedEntityClassificationtaskwithDecisionListbaseclassifier.•  Usingonly7initialrules,itachieved91%accuracy.•  ItachievedthesameperformancelevelasCo‐training(noneedfor2

views).

•  (AbneyACL2002)inapaperaboutco‐trainingcontrastsitwiththeYarowskyalgorithm.Initialanalysisabandonedlater.

BriefHistoryofBootstrapping

•  (AbneyCL2004)providedanewanalysisoftheYarowskyalgorithm.•  ItcouldnotmathematicallyanalyzetheoriginalYarowskyalgorithm,

butintroducednewvariants(wewillseethemlater).

•  (Haffari&SarkarUAI2007)advancedAbney’sanalysisandgaveageneralframeworkthatshowedhowtheYarowskyalgorithmintroducedbyAbneyisrelatedtootherSSLmethods.

•  (EisnerandKarakos2005)examinestheconstructionofseedrulesforbootstrapping.

AnalysisoftheYarowskyAlgorithm

OriginalYarowskyAlgorithm

•  TheYarowskyalgorithmisabootstrappingalgorithmwithaDecisionListbaseclassifier.

•  Thepredictedlabelisk*iftheconfidenceoftheappliedruleisabovesomethresholdη.

•  Aninstancemaybecomeunlabeledinfutureiterations.

(Yarowsky1995)

ModifiedYarowskyAlgorithm

•  Insteadofthefeaturewiththemaxscoreweusethesumofthescoresofallfeaturesactiveforanexampletobelabeled.

•  Thepredictedlabelisk*iftheconfidenceoftheappliedruleisabovethethreshold1/K.•  K:isthenumberoflabels.

•  Aninstancemuststaylabeledonceitbecomeslabeled,butthelabelmaychange.

•  Thesearetheconditionsinallthealgorithmswewillanalyzeintherestofthetalk.•  AnalyzingtheoriginalYarowskyalgorithmisstillanopenquestion.

(Abney2004)

BipartiteGraphRepresentation

+1companysaidtheplantisstilloperating

‐1dividelifeintoplantandanimalkingdom

company

operating

animal

(Features)F

X(Instances)

Unlabeled

(Cordunneanu2006,Haffari&Sarkar2007)

Weproposetoviewbootstrappingaspropagatingthelabelsofinitiallylabelednodestotherestofthegraphnodes.

Self‐TrainingontheGraph

(Features)F X(Instances)

… …

xπx qxLabelingdistribution

θfLabelingdistribution

.7.3θf

(Haffari&Sarkar2007)

GoalsoftheAnalysis

•  Tofindreasonableobjectivefunctionsfortheself‐trainingalgorithmsonthebipartitegraph.

•  TheobjectivefunctionsmayshedlighttotheempiricalsuccessofdifferentDL‐basedself‐trainingalgorithms.

•  Itcantellusthekindofpropertiesinthedatawhicharewellexploitedandcapturedbythealgorithms.

•  Itisalsousefulinprovingtheconvergenceofthealgorithms.

•  KL‐divergenceisameasureofdistancebetweentwoprobabilitydistributions:

•  EntropyHisameasureofrandomnessinadistribution:

•  Theobjectivefunction:

ObjectiveFunction

TheBregmanDistance

•  Examples:–  Ifψ(t)=tlogtThenBψ(α,β)=KL(α,β)–  Ifψ(t)=t2ThenBψ(α,β)=Σi(αi‐βi)2

ψ(αi)

αiβi

•  Givenastrictlyconvexfunctionψ,theBregmandistanceBψbetweentwoprobabilitydistributionsisdefinedas:

•  Theψ‐entropyHψisdefinedas:

•  Thegeneralizedobjectivefunction:

GeneralizingtheObjectiveFunction

OptimizingtheObjectiveFunctions

•  Inwhatfollows,wementionsomespecificobjectivefunctionstogetherwiththeiroptimizationalgorithms.

•  TheseoptimizationalgorithmscorrespondtosomevariantsofthemodifiedYarowskyalgorithm.

•  Itisnoteasytocomeupwithalgorithmsfordirectlyoptimizingthegeneralizedobjectivefunctions.

UsefulOperations

•  Average:takestheaveragedistributionoftheneighbors

•  Majority:takesthemajoritylabeloftheneighbors

(.2,.8)

(.4,.6)

(.3,.7)

(.2,.8)

(.4,.6)

AnalyzingSelf‐Training

 Theorem.Thefollowingobjectivefunctionsareoptimizedbythecorrespondinglabelpropagationalgorithmsonthebipartitegraph:

where:ConvergesinPolytimeO(|F|2|X|2|)

Relatedtograph‐basedSSlearning(Zhuetal2003)

Abney’svariantofYarowskyalgorithm

WhataboutLog‐Likelihood?

•  Initially,thelabelingdistributionisuniformforunlabeledverticesandaδ‐likedistributionforlabeledvertices.

•  Bylearningtheparameters,wewouldliketoreducetheuncertaintyinthelabelingdistributionwhilerespectingthelabeleddata:

Negativelog‐Likelihoodoftheoldandnewlylabeleddata

 Lemma.Ifmisthenumberoffeaturesconnectedtoaninstance,then:

ConnectionbetweenthetwoAnalyses

ComparewithConditionalEntropyRegularization(GrandvaletandBengio2005)!

KL(qx||!x) + "!

Experiments

NamedEntityClassification

• 971,476sentencesfromtheNYTwereparsedwiththeCollinsparser

• Thetaskistoidentifythreetypesofnamedentities:1.  Location(LOC)2.  Person(PER)3.  Organization(ORG)−1.notaNEor“don’tknow”

(CollinsandSinger,1999)

• Nounphraseswereextractedthatmetthefollowingconditions1.  TheNPcontainedonlywordstaggedasproper

nouns2.  TheNPappearedinthefollowingtwo

syntacticcontexts:  Modifiedbyanappositivewhoseheadisasingular

noun  InaprepositionalphrasemodifyinganNPwhose

headisasingularnoun29

• Nounphraseswereextractedthatmetthefollowingconditions1.  TheNPcontainedonlywordstaggedasproper

nouns2.  TheNPappearedinthefollowingtwo

syntacticcontexts:  Modifiedbyanappositivewhoseheadisasingular

noun  InaprepositionalphrasemodifyinganNPwhose

headisasingularnoun30

NNP NNP NNPS

International Business Machines

…,says[NEMauryCooper],avice[CONTEXTpresident]atS.&P.

…,fraudrelatedtoworkonafederallyfundedsewage[CONTEXTplantin][NEGeorgia]

• Thetask:classifyNPsintoLOC,PER,ORG• 89,305trainingexampleswith68,475distinctfeaturetypes– 88,962wasusedinCS99experiments

•  1000testdataexamples(includesNPsthatarenotLOC,PERorORG)

•  Monthnamesareeasilyidentifiableasnotnamedentities:leaves962examples

•  Still85NPsthatarenotLOC,PER,ORG.•  Cleanaccuracyover877;Noisyover962 31

YarowskyVariants

•  AtrickfromtheCo‐trainingpaper(BlumandMitchell1998)istobecautious.Don’taddallrulesabovethe0.95threshold

•  Addonlynrulesperlabel(say5)andincreasethisamountbynineachiteration

•  Changesthedynamicsoflearninginthealgorithmbutnottheobjectivefn

•  Twovariants:Yarowsky(basic),Yarowsky(cautious)

•  Withoutathreshold:Yarowsky(nothreshold)32

(Abney2004,CollinsandSinger,1999)

ResultsLearning Algorithm Accuracy (Clean) Accuracy (Noisy) Baseline (all organization)

45.8 41.8

EM 83.1 75.8 Yarowsky (basic) 80.7 73.5 Yarowsky (no threshold) 80.3 73.2 Yarowsky (cautious) 91 83 DL-CoTrain 91 83

NumberofRules(basic)

1 2 3 4 5 6 7 8 9

Iteration

NumberofRules(cautious)

0 50 100 150 200 250 300 350 400 450

Iteration

Coverage(basic)

1 2 3 4 5 6 7 8 9

Iteration

Coverage(cautious)

0 50 100 150 200 250 300 350 400 450

Iteration

Accuracy(basic)

1 2 3 4 5 6 7 8 9 10

Accura

Iteration

Accuracy(cautious)

0 50 100 150 200 250 300 350 400 450

Accura

Iteration

Precision‐Recall(basic)

0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85

Recall

Precision‐Recall(cautious)

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Recall

• Selectingseedrules:whatisagoodstrategy?–  Frequency:sortbyfrequencyoffeatureoccurrence

– Contexts:sortbynumberofotherfeaturesafeaturewasobservedwith

– Weighted:sortbyaweightedcountofotherfeaturesobservedwithfeature.Weight(f)=count(f)/Σf’count(f’)

(EisnerandKarakos2005,ZagibalovandCarroll2008)

• Ineachcasethefrequenciesweretakenfromtheunlabeledtrainingdata

• Seedswereextractedfromthesortedlistoffeaturesbymanualinspectionandassignedalabel(theentireexamplewasused)

• Location(LOC)featuresappearinfrequentlyinallthreeorderings

• ItispossiblethatsomegoodLOCseedsweremissed

SeedsNumber of Rules Frequency Contexts Weighted

(n/3) rules/label Clean Noisy Clean Noisy Clean Noisy 3 84 77 84 77 88 80 9 91 83 90 82 82 74 15 91 83 91 83 85 77 7 (CS99) 91 83

WordSenseDisambiguation

• Datafrom(EisnerandKarakos2005)• Disambiguatetwosenseseachfordrug,duty,land,language,position,sentence(Galeet.al.1992)

• Sourceofunlabeleddata:14MwordCanadianHansards(Englishonly)

• Twoseedrulesforeachdisambiguationtaskfrom(EisnerandKarakos2005)

Results

Learning Algorithm drug land sentence Seeds alcohol medical acres courts served reads Train / Test size 134 / 386 1604 / 1488 303 / 515

Yarowsky (basic) 53.3 79.3 67.7

Yarowsky (no threshold) 52 79 64.8

Yarowsky (cautious) 55.9 79 76.1

DL-CoTrain (2 views = long distance v.s. immediate context)

53.1 77.7 75.9

Self‐trainingforMachineTranslation

Self‐TrainingforSMT

Bilingualtext

Monolingualtext

DecodeTranslatedtext

SelecthighqualitySent.pairs

Re‐Log‐linearModel

Re‐trainingtheSMTmodel

SelectingSentencePairs

•  Firstgivescores:  Usenormalizeddecoder’sscore  Confidenceestimationmethod(Ueffing&Ney2007)

•  Thenselectbasedonthescores:  Importancesampling:  Thosewhosescoreisaboveathreshold  Keepallsentencepairs

Re‐trainingtheSMTModel

•  UsenewsentencepairstotrainanadditionalphrasetableanduseitasanewfeaturefunctionintheSMTlog‐linearmodel  Onephrasetabletrainedonsentencesforwhichwe

havethetruetranslations  Onephrasetabletrainedonsentenceswiththeir

generatedtranslations

PhraseTable1 PhraseTable2

ChinesetoEnglish(Transductive)

Selection Scoring BLEU%

Baseline 31.8±.7

Keepall 33.1

ImportanceSampling

Norm.score 33.5

Confidence 33.2

Threshold Norm.score 33.5

confidence 33.5

Bold:bestresult,italic:significantlybetter

NISTEval‐2004:train=8.2M,test=1788(4refs)

Train:news,magazines,laws+UN

Test:newswire,editorials,politicalspeeches

We use Portage from NRC as the underlying SMT system (Ueffing et al, 2007)

ChinesetoEnglish(Inductive)

system BLEU%

Baseline 31.8±.7

AddChinesedata Iter1 32.8

Iter4 32.6

Iter10 32.5

Bold:bestresult,italic:significantlybetter

Usingimportancesampling

Before After

editorials 30.7 31.3

newswire 30.0 31.1

speeches 36.1 37.3

Whydoesitwork?

•  Reinforcespartsofthephrasetranslationmodelwhicharerelevantfortestcorpus

•  Gluephrasesfromtestdatausedtocomposenewphrases(mostphrasesstillfromoriginaldata)

Whydoesitwork?

Summary

•  ShouldweeveruseCo‐trainingforBootstrapping?• Per‐labelcautiousnessleadstoeffective

bootstrapping.–  ExploitedinYarowskyalgo.,DL‐CoTrain,Co‐Boosting

•  Thesedynamicscan/shouldbeexaminedmoreclosely.–  Perhapsusingtoolsfromtheanalysisoffeature

induction.

• Bootstrappingandself‐trainingmaybemoreeffectivethanyoumayhavethought.

ed oct2 2009 - simon fraser universityanoop/papers/pdf/ed_oct2_2009.pdfbrief history of...

Documents

068 oct2 bulletin

sample midterm test – b - grant's...

emnlp - association for computational...

brief industrial profile district kanshiram nagar (kasganj)...

analysis of semi-supervised learning with the yarowsky...

va trimit spre consultare textul pe partea de discriminare...

port trunking -...

review possibility of pharmacokinetic drug interaction...

brief operating instructions levelflex...

yarowsky & wicentowski minimally supervised morphological...

rough paths methods 1:...

edmonton oct2 2015 web

voz y testimonio de cadereyta oct2

semantic annotation evaluation and utility bonnie dorr saif...

deep learning for medical image...

progress report no -...

oxaliplatin-induced neurotoxicity is dependent on the...

insurance cases oct2

indian banking sector: brief history and...

brief profile dr. gyanendra nath singh (dr. g ... -...