tom finin: “from strings to things: populating knowledge bases from text”

35
From Strings to Things: Popula4ng Knowledge Bases from Text Tim Finin University of Maryland, Bal6more County IBM Cogni6ve Systems Ins6tute Speaker Series July 14, 2016 h"p://ebiq.org/r/369

Upload: diannepatricia

Post on 11-Jan-2017

224 views

Category:

Technology


0 download

TRANSCRIPT

FromStringstoThings:Popula4ngKnowledge

BasesfromText

TimFininUniversityofMaryland,Bal6moreCounty

IBMCogni6veSystemsIns6tuteSpeakerSeriesJuly14,2016

h"p://ebiq.org/r/369

TheWebisourgreatestknowledgesource

Butitwasdesignedforpeople,notmachines

Butitwasdesignedforpeople,notmachines•  Itscontentismostlytext,spokenlanguage,imagesandvideos•  Theseareeasyforpeopletounderstand•  Buthardformachines

Machinesneedaccesstothisknowledgetoo

Weneedtoaddknowledgegraphs

Weneedtoaddknowledgegraphs•  Highqualitysemi-structuredinforma6onabouten66esandrela6ons•  RepresentedandaccessedviaWebstandards•  Easilyintegrated,fusedandreasonedwith

WhowroteTomSawyer?

Almostallcommercialrecipesitesembedseman4cdataabouttherecipesinanRDF-compa6bleformusingtermsfromtheschema.orgontology.

SearchenginesreadandusethisdatatobeRerunder-standtheseman6csofthepagecontent

WheredoesKnowledgecomefrom•  Exis6ngknowledgegraphsquicklydevelopedwithsemi-structureddatafromWikipedia,augmentedwith–  Structureddatabases(Geonames)–  Semi-structureddatacollec6ons(CIAFactbook)

•  Maintainingandextendingthemandcrea6ngfocusedknowledgegraphswillincreasinglyrequireextrac4nginforma4onfromtext

NISTTextAnalysisConferenceYearlyevalua6onworkshopsonnaturallanguageprocessingandrelatedapplica6onswithlargetestcollec6onsandcommonevalua6onprocedures TACTracks 08 09 10 11 12 13 14 15

Ques6onAnswering X

TextualEntailment X X X X

Summariza6on X X X X X

KnowledgeBasePopula6on X X X X X X X

En6tyDiscovery&Linking X X

EventDetec6on X X

Valida6on X

hRp://www.nist.gov/tac

KnowledgeBasePopula4on

Tracksfocusedonbuildingaknowledgebasefromen66esandrela6onsextractedfromtext•  ColdStartKBP:constructaKBfromtext•  En4tydiscovery&linking:clusterandlinken6tymen6ons

•  Slotfilling•  Slotfillervalida6on•  Sen6ment•  Events:discoverandclustereventsintext

KnowledgeBasePopula4on*Buildasystemthatcan•  Read90Kdocuments:newswirear6cles&socialmediapostsinEnglish,ChineseandSpanish

•  Finden6tymen6ons,typesandrela6ons•  Clusteren66eswithinandacrossdocumentsandlinktoareferenceKBwhenappropriate

•  Removeerrors(ObamaborninIllinois),drawsoundinferences(MaliaandSashasisters)

•  Generatetripleswithprovenancedata(Doc+stringoffsets)formen6onsandrela6ons *2016

KnowledgeBasePopula4on*Buildasystemthatcan•  Read90Kdocuments:newswirear6cles&socialmediapostsinEnglish,ChineseandSpanish

•  Finden6tymen6ons,typesandrela6ons•  Clusteren66eswithinandacrossdocumentsandlinktoareferenceKBwhenappropriate

•  Removeerrors(ObamaborninIllinois),drawsoundinferences(MaliaandSashasisters)

•  Generatetripleswithprovenancedata(Doc+stringoffsets)formen6onsandrela6ons *2016

<DOCid="APW_ENG_20100325.0021"type="story"><HEADLINE>DivorceaRorneysaysDennisHopperisdying</HEADLINE><DATELINE>LOSANGELES2010-03-2500:15:51UTC</DATELINE><TEXT<P>DennisHopper'sdivorceaRorneysaysinacourtfilingthattheactorisdyingandcan'tundergochemotherapyashebaRlesprostatecancer.</P><P>ARorneyJosephMannisdescribedthe"EasyRider"star'sgravecondi6oninadeclara6onfiledWednesdayinLosAngelesSuperiorCourt.</P><P>MannisandaRorneysforHopper'swifeVictoriaarefigh6ngoverwhenandwhethertotaketheactor'sdeposi6on.</P>…

KnowledgeBasePopula4on*Buildasystemthatcan•  Read90Kdocuments:newswirear6cles&socialmediapostsinEnglish,ChineseandSpanish

•  Finden6tymen6ons,typesandrela6ons•  Clusteren66eswithinandacrossdocumentsandlinktoareferenceKBwhenappropriate

•  Removeerrors(ObamaborninIllinois),drawsoundinferences(MaliaandSashasisters)

•  Generatetripleswithprovenancedata(Doc+stringoffsets)formen6onsandrela6ons *2016

<DOCid="APW_ENG_20100325.0021"type="story"><HEADLINE>DivorceaRorneysaysDennisHopperisdying</HEADLINE><DATELINE>LOSANGELES2010-03-2500:15:51UTC</DATELINE><TEXT<P>DennisHopper'sdivorceaRorneysaysinacourtfilingthattheactorisdyingandcan'tundergochemotherapyashebaRlesprostatecancer.</P><P>ARorneyJosephMannisdescribedthe"EasyRider"star'sgravecondi6oninadeclara6onfiledWednesdayinLosAngelesSuperiorCourt.</P><P>MannisandaRorneysforHopper'swifeVictoriaarefigh6ngoverwhenandwhethertotaketheactor'sdeposi6on.</P>…

…:e00211 type PER:e00211 link FB:m.02fn5:e00211 link WIKI:Dennis_Hopper:e00211 mention "Dennis Hopper" APW_021:185-197:e00211 mention "Hopper" APW_021:507-512:e00211 mention "Hopper" APW_021:618-623:e00211 mention "丹尼斯·霍珀” CMN_011:930-936:e00211 per:spouse :e00217 APW_021:521-528:e00217 per:spouse :e00211 APW_021:521-528:e00211 per:age   "72" APW_021:521-528…

KnowledgeBasePopula4on*Buildasystemthatcan•  Read90Kdocuments:newswirear6cles&socialmediapostsinEnglish,ChineseandSpanish

•  Finden6tymen6ons,typesandrela6ons•  Clusteren66eswithinandacrossdocumentsandlinktoareferenceKBwhenappropriate

•  Removeerrors(ObamaborninIllinois),drawsoundinferences(MaliaandSashasisters)

•  Generatetripleswithprovenancedata(Doc+stringoffsets)formen6onsandrela6ons *2016

<DOCid="APW_ENG_20100325.0021"type="story"><HEADLINE>DivorceaRorneysaysDennisHopperisdying</HEADLINE><DATELINE>LOSANGELES2010-03-2500:15:51UTC</DATELINE><TEXT<P>DennisHopper'sdivorceaRorneysaysinacourtfilingthattheactorisdyingandcan'tundergochemotherapyashebaRlesprostatecancer.</P><P>ARorneyJosephMannisdescribedthe"EasyRider"star'sgravecondi6oninadeclara6onfiledWednesdayinLosAngelesSuperiorCourt.</P><P>MannisandaRorneysforHopper'swifeVictoriaarefigh6ngoverwhenandwhethertotaketheactor'sdeposi6on.</P>…

…:e00211 type PER:e00211 link FB:m.02fn5:e00211 link WIKI:Dennis_Hopper:e00211 mention "Dennis Hopper" APW_021:185-197:e00211 mention "Hopper" APW_021:507-512:e00211 mention "Hopper" APW_021:618-623:e00211 mention "丹尼斯·霍珀” CMN_011:930-936:e00211 per:spouse :e00217 APW_021:521-528:e00217 per:spouse :e00211 APW_021:521-528:e00211 per:age   "72" APW_021:521-528…

KBEvalua4onMethodology

•  Evalua6ngKBsextractedfrom90Kdocumentsisnon-trivial

•  TAC’sapproachissimplifiedby:– Fixingtheontologyofen6tytypesandrela6ons– Specifyingaserializa6onastriples+provenance– SamplingaKBusingasetofqueriesgroundedinanen#tymen#onfoundinadocument

•  GivenaKB,wecanevaluateitsprecisionandrecallforasetofqueries

KBEvalua4onMethodology•  EX:WhatarethenamesofschoolsaRendedbythechildrenoftheen6tymen6onedindocument#45611atcharacters401-412– Thatmen6onisGeorgeBushandthedocumentcontextsuggestsitreferstothe41stU.S.president– QuerygiveninstructuredformusingTAContology

•  Assessorsdeterminegoodanswersgiventextcorpusandchecksubmissionresultsusingprovenanceasneeded– Yale,Harvard,UTAus6n,Tulane,Univ.ofVirginia,BostonCollege,...

TACOntology•  Fivebasicen6tytypes– PER:people(JohnLennon)orgroups(Americans)– ORG:organiza4onslikeIBM,MITorUSSenate– GPE:geopoli4calen6tylikeBoston,BelgiumorEurope– LOC:loca4onslikeCentralParkortheRockies– FAC:facili4eslikeBWIortheEmpireStateBuilding

•  En6tyMen6ons– Stringsreferencingen66esbyname(BarackObama)ordescrip6on(thePresident)

•  ~65rela6ons– Rela6onsholdbetweentwoen66es:parent_of,spouse,employer,founded_by,city_of_birth,…

– Orbetweenanen6ty&string:age,website,6tle,cause_of_death,...

Kelvin•  KELVIN:KnowledgeExtrac6on,Linking,Valida6onandInference

•  Informa6onextrac6onsystemdevelopedattheHumanLanguageTechnologyCenterofExcellenceatJHU

•  UsedintheTACKBPtracksin2012-2016andvariousprojects

•  Documentsin,triplesout(withprovenance)

h"p://ebiq.org/p/730

Kelvin’sbasicpipeline

1. Documentlevelanalysisforen6ty&rela6onextrac6on

2. Cross-documentco-referencetocreateini6alKB

3. KBlinking,inferenceandadjudica6on

4. MaterializetheKB

Document-levelprocessing• Doneinparallelonagrid-basedsystem• Extractnameden66es,men6ons,events,rela6ons,valuesusingseveraltools(BBNSerif,StanfordCoreNLP,customsystems)• MapresultstoTAContology• Resolveinconsistenciesinen6tymen6onchains• Produces“singledocument”KBswithTACcomplianttriplesandprovenancedata• CS15:50Kdocs,1.9Mmen4ons,800Ken44es,2Mfacts

1

X-docCoref:crea4ngini4alKB• Cross-documentco-referencecreatesanini6alKBfromthecollec6onofsingle-documentKBs- Iden6fythatBarackObamaen6tyinDOC32issameindividualasObamainDOC342,etc.

•  Usesagglomera#veclusteringonsimilarityofen4tymen4ons&co-men4oneden44es- Bothdocsalsomen6onWhiteHouse,Biden,D.C.

- Variousstringsimilaritymeasures,namefrequency•  WorkswellonEnglish,Spanish,Chinese

2

Kripke• Kelvinclustered~1Mdocumenten##esinto~300KKBen##es• Conserva6ve,tendingtounder-ratherthanover-merge• TypicalpaRernisonelargeclusterandalongtailofsmallones• BarackObamahadoneclusterof8Kdoc.en66esand~100smallones

0

2000

4000

6000

8000

10000

72singletonObamaclusters+15withjusttwo

Huge8kcluster

2

Inferenceandadjudica4on

Inferenceandadjudica6onrulesareusedto• Deleterela6onsviola6ngKBconstraints– Apersoncan’tbeborninanorganiza#on• Addrela6onsfromsoundinferencerules– Twopeoplesharingaparentaresiblings– BorninplaceP1,P1part_ofP2=>BorninP2• Mergeen66esusingrules– Mergeci#esinsameGeo.regionwithsamenames

• Addrela6onsbydefaultorheuris6crules– Apersonisprobablyaci#zenoftheircountryofbirth

3

En4tyLinking• Simplestrategytolinktexten66estoreferenceKB• TACuseslastFreebaseKB,oursubsethas- ~4.5Men66esand~150Mtriples- NamesandtextinEnglish,SpanishandChinese

• En6tylinkingcomparesen6tymen#onswithFreebaseen6tynames&aliases,weightedby– Howoyeneachmen6onusedforen6ty– En6tysignificance(logofinlinks(1..20))

• Results:setofcandidatesandascoreforeach– Rejectcandidatesiftoomanyorscoreslow

3

KB-levelmergingrules•  Mergeci6eswithsamenameinsamestate

•  Highlydiscrimina6verela6onsgiveevidenceofsameness– per:spouseisfewtofew– org:top_level_empisfewtofew

•  MergePERswithsimilarnameswhowere– bothCEOsofthesamecompany,or– Bothmarriedtothesameperson

•  Mergeen66eslinkedtosameFreebaseen6ty

3

SlotValueConsolida4on•  Problem:toomanyvaluesforsomeslots,especiallyfor‘popular’en66es,e.g.– Anen6tywithfourdifferentper:agevalues– Obamahas>100per:employee_ofvalues

•  Strategy:rankvaluesandselectbest– RankaRestedvaluesby#ofaRes6ngdocs– RankinferredvaluesbelowaRestedvalues– Selectbestvalueforsingle-valuesslot– SelectbestNvaluesformul6-valuedslots

3

InferenceinColdStart• InCSwecanonlyusesoundinferencerules,e.g.– Peoplearesiblingsiftheyshareaparent– PeoplebornincityXwereborninregionYifXisinY• Defaultandheuris6crulesarenotaRested– Youprobablyresidedinacitywereyouwenttoschool• 2014baserunfound848aRestedsiblingrela6ons,inferenceadded132,16%more• Drawingsoundinferencesfrom{good|bad}factsyields{good|bad}results

3 ⊨

MaterializeKBversionsSomeworkisneededtoencodeournascentKBinanactualKBsystemorserializa6on• TACrequirecustomserializa6onwithspecificrequirements,e.g.,atmostfourprovenancestringsoflength≤150characters– Pickbestforslotswithmanyprovenancestrings

• RDFrequiresanOWLschemaandusingreifica6onformetadata,e.g.,men6onandrela6onprovenance

WeuseRDFforsubsequentuse

4

2015TACResults

•  TACColdStartKnowledgeBasePopula6on– Eightteamspar6cipatedin2015ColdStartKBP– Weplaced2ndand3rddependingonthemetric

•  TACEn6tyDiscoveryandlinking•  SixteamssubmiRedrunsforallthreelanguages,8forENG,7forCMNand7forSPA• Weplacedthirdforalllanguages,secondforChineseandthirdforEnglish

LessonsLearned•  Wealwayshavetomindprecision&recall•  Extrac6nginforma6onfromtextisinherentlynoisy;readingmoretexthelpsboth

•  Usingmachinelearningateverylevelisimportant

•  Makingmoreuseofprobabili6eswillhelp•  Extrac6nginforma6onaboutaeventsishard•  Recognizingthetemporalextentofrela6onsisimportant,buts6llachallenge

Conclusion

•  KBshelpinextrac6nginforma6onfromtext•  Theinforma6onextractedcanupdatetheKBs•  TheKBsprovidesupportfornewtasks,suchasques6onansweringandspeechinterfaces

•  We’llseethisapproachgrowandevolveinthefuture

•  NewmachinelearningframeworkswillresultinbeReraccuracy

Formoreinforma4on,[email protected]