Transcript
Page 1: Building Resources for Human and Computaonal Language ... · o Projeto PorPopular (Finao et al. 2012) • Tabloids for low literacy readers Simple Corpora • Wikilivros Readability

BuildingResourcesforHumanandComputa6onalLanguageProcessing

ofPortugueseSílvioCordeiro,CarlosRamisch,MarcoIdiart,RodrigoWilkens,

LeonardoZilio,JorgeWagner,AlineVillavicencioFederalUniversityofRioGrandedoSul(Brazil)

AixMarseilleUniversité,CNRS,LIFUMR7279(France)UniversityofEssex(UK)

Page 2: Building Resources for Human and Computaonal Language ... · o Projeto PorPopular (Finao et al. 2012) • Tabloids for low literacy readers Simple Corpora • Wikilivros Readability

MWE-awareprocessingwiththemwetoolkit

Page 3: Building Resources for Human and Computaonal Language ... · o Projeto PorPopular (Finao et al. 2012) • Tabloids for low literacy readers Simple Corpora • Wikilivros Readability

Mul6wordExpressionsinaNutshell•  Acombina6onofwordsthatmustbetreatedasaunitatsomelevel

oflinguis6cprocessing(Calzolarietal.,2002)o  CompoundNounso  Verb-par6cleconstruc6onso  Light-verbconstruc6onso  Idioms

Page 4: Building Resources for Human and Computaonal Language ... · o Projeto PorPopular (Finao et al. 2012) • Tabloids for low literacy readers Simple Corpora • Wikilivros Readability

Mul6wordExpressionsinaNutshell•  Lexical,syntac6c,seman6c,pragma6c,sta6s6cal

idiosyncrasieso  Adhoc,wineanddine(KimandBaldwin2010)

•  ArbitrarinessandIns6tu6onalisa6ono  saltandpepper,?pepperandsalt(Smadja,1993)

•  Frequencyo  Sameorderofmagnitudeaswordsinmentallexicon(Jackendoff,1997)

•  Limitedlexical,syntac6candseman6cvariabilityo  kickthebucket/?pail/?container(Sagetal.,2002)

Page 5: Building Resources for Human and Computaonal Language ... · o Projeto PorPopular (Finao et al. 2012) • Tabloids for low literacy readers Simple Corpora • Wikilivros Readability

MWEsandNLP•  RealunderstandingrequiresMWE-awarecorpusprocessing

1.  Corpusprocessing2.  MWEdiscoveryfromcorpus(tobuildMWElexica)3.  MWErepresenta6on(inlexiconandgrammar)4.  MWEtokeniden6fica6onincorpus(toannotateMWEs)5.  MWEseman6cprocessing6.  MWEintegra6oninapplica6ons

Page 6: Building Resources for Human and Computaonal Language ... · o Projeto PorPopular (Finao et al. 2012) • Tabloids for low literacy readers Simple Corpora • Wikilivros Readability

mwetoolkit•  LanguageindependentframeworkforMWEprocessing•  ExtractsMWEfromcorpora•  AnnotatescorporawithMWEs•  CalculatesAMs•  Pre-processesMWEsincorporaforDSMconstruc6on•  ImportsDSMs(word2vec,glove,PPMI)•  Providesfunc6onsforvectorcombina6ons•  Calculatescomposi6onality•  Evaluatesagainstgoldstandard

Project CAPES-COFECUB (France-Brazil)

Page 7: Building Resources for Human and Computaonal Language ... · o Projeto PorPopular (Finao et al. 2012) • Tabloids for low literacy readers Simple Corpora • Wikilivros Readability

Overviewofthemwetoolkit

Page 8: Building Resources for Human and Computaonal Language ... · o Projeto PorPopular (Finao et al. 2012) • Tabloids for low literacy readers Simple Corpora • Wikilivros Readability

MWETypeDiscovery•  Candidateextrac6on

o  Pafern-basedheuris6cs(e.g.noun-noun,verb-par6cle…)

Page 9: Building Resources for Human and Computaonal Language ... · o Projeto PorPopular (Finao et al. 2012) • Tabloids for low literacy readers Simple Corpora • Wikilivros Readability

MWE-awarecorpusprocessing•  MWEtokeniden6fica6on

o  Pafern-basedheuris6cso  Con6guous/gappyiden6fica6ono  Shortest/longest/allmatchdistanceso  Projec6ngextractedMWEtypesbackinsourcecorpus

Page 10: Building Resources for Human and Computaonal Language ... · o Projeto PorPopular (Finao et al. 2012) • Tabloids for low literacy readers Simple Corpora • Wikilivros Readability

Annota6onOp6ons•  CorpusbutnoMWEList:

o  GenerateMWElistfromcorpusandprojec6ngthemback•  CorpusàMWElistàAnnotatedCorpus

•  CorpusandMWEListo  Annota6onbasedonexternallistsofMWEs

•  Corpus+MWElistàAnnotatedCorpus

Page 11: Building Resources for Human and Computaonal Language ... · o Projeto PorPopular (Finao et al. 2012) • Tabloids for low literacy readers Simple Corpora • Wikilivros Readability

MWEseman6cprocessing•  MeaningofMWEmaynotbeunderstoodfrommeaningof

individualwordso  brickwallisawallmadeofbricks,o  cheeseknifeisnotaknifemadeofcheeseàknifeforcu@ngcheese(Girjuetal.,2005).o  Loansharkisnotasharkforloanbutapersonwhooffersloansatextremelyhighinterestrates

Page 12: Building Resources for Human and Computaonal Language ... · o Projeto PorPopular (Finao et al. 2012) • Tabloids for low literacy readers Simple Corpora • Wikilivros Readability

Howtodetectcomposi6onality?

•  Distribu6onalSeman6cModels(DSMs)o  Posi6onwordsinmul6dimensionalseman6cspace

•  Eachword/MWErepresentedasavectorintheseman6cspaceo  Proximityinspaceindicatesseman6crelatedness

Cloud nine

Access road

Compositionality Idiomaticity

Grandfather clock

Page 13: Building Resources for Human and Computaonal Language ... · o Projeto PorPopular (Finao et al. 2012) • Tabloids for low literacy readers Simple Corpora • Wikilivros Readability

Howtodetectcomposi6onality?•  CosinesimilaritybetweentheMWEvectorandthesumofthe

vectorsofthecomponentwordso  Thecloservectorsarethemorecomposi6onaltheyare(Reddyetal.2011)o  cos(w1w2vector,w1vector+w2vector)

Page 14: Building Resources for Human and Computaonal Language ... · o Projeto PorPopular (Finao et al. 2012) • Tabloids for low literacy readers Simple Corpora • Wikilivros Readability

Distribu6onalSeman6cModels•  Techniquesandtoolsforconstruc6ngDSMs

o  Dissect(Dinuetal.,2013),Miniman6cs(Ramischetal.2013),word2vec(Mikolovetal.,2013)andGlove4(Penningtonetal.,2014).

Minimantics

word2vec

dissect

LexVec (Lexical Vectors)

Page 15: Building Resources for Human and Computaonal Language ... · o Projeto PorPopular (Finao et al. 2012) • Tabloids for low literacy readers Simple Corpora • Wikilivros Readability

GoldStandardsforEvalua6on•  Rolleretal.(2013)244Germancompounds

o  around30judgmentsbycrowdsourcingscalefrom1to7

•  Farahmandetal.(2015)1,042Englishcompoundso  4expertsjudgesbinaryscalefornon-composi6onalityandconven6onality

•  Reddyetal.(2011)90Englishcompoundso  around30judgmentsbycrowdsourcingscalefrom0to5

Page 16: Building Resources for Human and Computaonal Language ... · o Projeto PorPopular (Finao et al. 2012) • Tabloids for low literacy readers Simple Corpora • Wikilivros Readability

DSMsandComposi6onality•  Datasetofnominalcompoundswithhumanjudgmentsabout

literality/composi6onalityo  180compoundsforEnglish,FrenchandPortugueseo  Resourcefreelyavailable

•  hfp://pageperso.lif.univ-mrs.fr/~carlos.ramisch/?page=downloads/compounds&lang=en

Page 17: Building Resources for Human and Computaonal Language ... · o Projeto PorPopular (Finao et al. 2012) • Tabloids for low literacy readers Simple Corpora • Wikilivros Readability

DSMsandComposi6onality•  DatasetofLexicalSubs6tu6onofNominalCompoundsin

Portuguese(LexSubNC)o  180compoundsforPortugueseo  Resourcefreelyavailable

•  hfp://pageperso.lif.univ-mrs.fr/~carlos.ramisch/?page=downloads/compounds&lang=en

Page 18: Building Resources for Human and Computaonal Language ... · o Projeto PorPopular (Finao et al. 2012) • Tabloids for low literacy readers Simple Corpora • Wikilivros Readability

Collec6ngHumanJudgments•  Judgmentswithlikertscale(0to5)

o  Forcompoundo  Forw1andw2separately

•  AgreementforPortugueseo  Forsubsetofannotators

•  α=.52forhead,•  α=.36formodifier•  α=.42forcompound

o  Sameannotatoraqer1month:0.59forcompound

Page 19: Building Resources for Human and Computaonal Language ... · o Projeto PorPopular (Finao et al. 2012) • Tabloids for low literacy readers Simple Corpora • Wikilivros Readability

Collec6ngHumanJudgments-Agreement

•  Greateragreementbetweenscoreforcompoundandhead(ormodifier)forextremeso  totallyidioma6candfullycomposi6onal

•  ForPTandFRcompoundscoredeterminedbyscoreoftheleastliteralword

Page 20: Building Resources for Human and Computaonal Language ... · o Projeto PorPopular (Finao et al. 2012) • Tabloids for low literacy readers Simple Corpora • Wikilivros Readability

Agreement•  Most/leastvaria6oninscores(average±σscore)

Page 21: Building Resources for Human and Computaonal Language ... · o Projeto PorPopular (Finao et al. 2012) • Tabloids for low literacy readers Simple Corpora • Wikilivros Readability

Themodels•  WaCkyCorpora(Baronietal.,2009):

o  ukWaCforEnglish(∼2billiontokens)o  frWaC(∼1.6billiontokens)forFrencho  brWaC(∼2.3billiontokens)forPortuguese(WagnerFilhoetal.2016)o  Pre-processing

•  surface+:theoriginalcorpus•  surface:withstopwordremoval.•  lemma:stopwordremovalandlemma6za6on;•  lemmaPOS:stopwordremoval,lemma6za6onandPOS-tagging

o  ContextWindowsize:1,4and8o  Dimensionsize:250,500,750

•  DSMso  PPMImodels–posi6vePMI(Miniman6cs)o  GloVe(Penningtonetal.2014)o  Word2vec(Mikolovetal2013)Skipgram,CBOWo  LexVec

French

Portuguese

English

Page 22: Building Resources for Human and Computaonal Language ... · o Projeto PorPopular (Finao et al. 2012) • Tabloids for low literacy readers Simple Corpora • Wikilivros Readability

ResourcesforTextSimplifica6onforPortuguese

RodrigoWilkens,LeonardoZilio,MarcoIdiart,JorgeWagnerFilho,EduardoFerreira,LuisMollmann,BiancaPasqualini,AlineVillavicencio

FederalUniversityofRioGrandedoSul(Brazil)

Page 23: Building Resources for Human and Computaonal Language ... · o Projeto PorPopular (Finao et al. 2012) • Tabloids for low literacy readers Simple Corpora • Wikilivros Readability

TextSimplifica6on(TS)•  TSqualitydependentonresources

o  English:•  Corpora:

o  SimpleEnglishWikipediaparallelcorpus,withalignmentsbetweentheSimpleandtheStandardEnglishWikipedia

o  PennTreebank,Bri6shNa6onalCorpus,ukWaC•  ResourcesforLexicalSubs6tu6ons:WordNet,Roget,Moby•  GoldStandardSubs6tu6onLists:SemEvalLexicalSubs6tu6onTask

o  Portuguese:•  Corpora

o  PorSimples(Aluísioetal)•  Thesauri

o  WordNet.Pt,OntoPT,

Page 24: Building Resources for Human and Computaonal Language ... · o Projeto PorPopular (Finao et al. 2012) • Tabloids for low literacy readers Simple Corpora • Wikilivros Readability

TextSimplifica6on•  Twomaintasks(Shardlow,2014):

o  lexicalsimplifica6on(LS),•  replacingcomplexexpressionswithsimplersynonyms,

o  syntac6csimplifica6on(SS)•  changethestructureofasentencebyusingsimplersyntac6cconstruc6ons

(Siddharthan,2002).o  TSwithMTtechniquesformonolingualtransla6on

•  learningalignmentsbetweensimpleandstandardsentences

Page 25: Building Resources for Human and Computaonal Language ... · o Projeto PorPopular (Finao et al. 2012) • Tabloids for low literacy readers Simple Corpora • Wikilivros Readability

GeneralCorpora•  WaCky(Baronietal.2009)

o  ukWaC(Baronietal.2009)o  brWaC(Boosetal.2014,WagnerFilhoetal.2018)

•  Crawling,frommediumfrequencycontentwordsasseedso  LinguatecaCorporaFrequencyList

•  Cleaningo  HTMLandboilerplatestripping,usingdensitymetricsand

shallowtextfeatures•  Near-duplicatedetec6onandremoval

o  pairwisecomparisonofalldocuments

Page 26: Building Resources for Human and Computaonal Language ... · o Projeto PorPopular (Finao et al. 2012) • Tabloids for low literacy readers Simple Corpora • Wikilivros Readability

SimpleCorpora•  ForEnglish,

o  SimpleEnglishWikipediaalignedwithEnglishWikipedia(CosterandKauchak,2011)

•  ForPortugueseo  ColeçãoÉSóoComeço(Wilkensetal.2014)

•  5booksmanuallysimplifiedbylinguists.o  Caselietal.2009

•  manuallyannotatedcorpusofsyntac6candlexicalsimplifica6onso  WikiJunior

•  illustratedbooksforchildrenupto12yearsold.o  ProjetoPorPopular(Finafoetal.2012)

•  Tabloidsforlowliteracyreaders

Page 27: Building Resources for Human and Computaonal Language ... · o Projeto PorPopular (Finao et al. 2012) • Tabloids for low literacy readers Simple Corpora • Wikilivros Readability

SimpleCorpora•  WikilivrosReadabilityCorpus(WRC)

o  BooklibraryfromWikilivros•  L1:33booksfrom1stto9thgrades•  L2:65booksfrom10thto12thgrades•  L3:21booksforcollegeeduca6on

•  ReadabilityAssessedWaC(RAW)o  readabilityassessmentmodule(WagnerFilhoetal.2016)

•  intermediatemoduleofreadabilityassessment•  severalreadabilityfeaturesusedasfeaturesforclassifier

o  129,000sentencesfromL1,•  13.5wordspersentence

o  236,000sentencesfromL2•  15.2wordspersentence

o  96,000sentencesfromL3,•  17.4wordspersentence

Page 28: Building Resources for Human and Computaonal Language ... · o Projeto PorPopular (Finao et al. 2012) • Tabloids for low literacy readers Simple Corpora • Wikilivros Readability

Mul6wordExpressions(MWEs)

•  ForEnglish•  NomLex,WordNet•  Verb-Par6cleConstruc6ons(Baldwin2005)•  CompoundNouns(Nakov2010,Reddyetal2011,Yazdanietal2015,Ramischet

al2016)

•  ForPortuguese•  LightVerbConstruc6ons

o  Duranetal.2011•  CompoundNouns

o  NounPreposi6onNounsfromEuroparl(Zilioetal.2016)•  Parsing-based(FIPS)combinedwithSta6s6cal(PMI)

o  NounAdjec6ve(Cordeiroetal.2016)

Page 29: Building Resources for Human and Computaonal Language ... · o Projeto PorPopular (Finao et al. 2012) • Tabloids for low literacy readers Simple Corpora • Wikilivros Readability

SimpleWordsLists•  Manuallycreatedlists

o  ForEnglish•  Oxford3000

o  ForPortuguese•  3,853wordsfromOxford3000transla6oncomplementedwithmostfrequent

wordsincorpora(Finafoetal.2013)

Page 30: Building Resources for Human and Computaonal Language ... · o Projeto PorPopular (Finao et al. 2012) • Tabloids for low literacy readers Simple Corpora • Wikilivros Readability

LexicalSubs6tu6on•  Manualresources

o  ForEnglish•  WordNet(Fellbaum1998),Roget,Moby

o  ForPortuguese•  Onto.PT(Oliveiraetal.2010),•  OpenWN-PT(Paivaetal.2012),•  Mul6Wordnet(Brancoetal.),•  WordNet.PT(Marrafa,2002),•  WordNet.Br(DiasdaSilvaetal.,2008)

Page 31: Building Resources for Human and Computaonal Language ... · o Projeto PorPopular (Finao et al. 2012) • Tabloids for low literacy readers Simple Corpora • Wikilivros Readability

LexicalSubs6tu6on•  Distribu6onalSeman6cModels

o  GloVe,word2vec,Miniman6cs,Dissect,LexVec

•  QualityEvalua6ono  ForEnglish

•  WordNet-BasedSynonymyTest(WBST)(Freitagetal.2005)•  Wordsimilarityandanalogytasks

o  ForPortuguese•  BabelNet-BasedSeman6cGoldStandard(B2SG)(Wilkensetal.2016)

o  synonymy,antonymyandhypernymyfornounsandverbs

Page 32: Building Resources for Human and Computaonal Language ... · o Projeto PorPopular (Finao et al. 2012) • Tabloids for low literacy readers Simple Corpora • Wikilivros Readability

Seman6cRoleLabeling(SRL)•  Forwordsubs6tu6onincontext

o  ForEnglish•  FrameNet(Bakeretal.1998),PropBank(Kingsburyetal.2002)

o  ForPortuguese•  PropBank.Br(DuranandAluísio2012),VerbNet.Br(Scarton2013),andFrameNet

Brasil(Salomao,2009).•  VerbLexPor(Zilioetal.2016)

o  Cardiologypapersvs.Newspaperar6cles•  15,281annotatedarguments(4,192inCARDand11,089inDG)

Page 33: Building Resources for Human and Computaonal Language ... · o Projeto PorPopular (Finao et al. 2012) • Tabloids for low literacy readers Simple Corpora • Wikilivros Readability

ResourcesinNumbers

Page 34: Building Resources for Human and Computaonal Language ... · o Projeto PorPopular (Finao et al. 2012) • Tabloids for low literacy readers Simple Corpora • Wikilivros Readability

ConclusionandFutureWork•  EnglishandPortuguese

o  Differenceinmagnitudeandavailabilityofmanuallyconstructedresources

•  Alterna6ve:languageindependentmethodso  Extrapolatefrommanuallycreatedresourceso  Corpora

•  brWaC(WagnerFilhoetal.2016)•  ReadabilityAssessedWac(WagnerFilhoetal.2016,WagnerFilhoetal.2018)

o  Distribu6onalSeman6cModels•  LexVec(Salleetal.2016)

o  Goldstandards•  NCComposi6onalityDataset(Cordeiroetal.2016)•  NCLexSub(Cordeiroetal.2017)•  BabelNet-BasedSeman6cGoldStandard(B2SG)(Wilkensetal.2016)

Page 35: Building Resources for Human and Computaonal Language ... · o Projeto PorPopular (Finao et al. 2012) • Tabloids for low literacy readers Simple Corpora • Wikilivros Readability
Page 36: Building Resources for Human and Computaonal Language ... · o Projeto PorPopular (Finao et al. 2012) • Tabloids for low literacy readers Simple Corpora • Wikilivros Readability

Acknowledgments•  Thisworkhasbeenfundedbythe

•  BrazilianAgencyCNPq(482520/2012-4and312114/2015-0)•  Projects“SimplificaçãoTextualdeExpressõesComplexas”,sponsoredbySamsung

EletrônicadaAmazôniaLtda.underthetermsofBrazilianfederallawNo.8.248/91,and•  FrenchAgenceNa6onalepourlaRecherchethroughprojectsPARSEME-FR(ANR-14-

CERA-0001)andORFEO(ANR-12-CORP-0005),andby•  French-Braziliancoopera6onprojectsCAMELEON(CAPES-COFECUB707/11)andAIM-

WEST(FAPERGS-INRIA1706-2551/13-7).

Page 37: Building Resources for Human and Computaonal Language ... · o Projeto PorPopular (Finao et al. 2012) • Tabloids for low literacy readers Simple Corpora • Wikilivros Readability

BuildingResourcesforHumanandComputa6onalLanguageProcessing

ofPortugueseSílvioCordeiro,CarlosRamisch,MarcoIdiart,RodrigoWilkens,

LeonardoZilio,JorgeWagner,AlineVillavicencioFederalUniversityofRioGrandedoSul(Brazil)

AixMarseilleUniversité,CNRS,LIFUMR7279(France)UniversityofEssex(UK)


Top Related