phd day: adaptive entity linking
DESCRIPTION
Presentation at 5th NLP PhD Day at National University or Ireland, Galway (Insight) at 16/10/2013TRANSCRIPT
www.insight-‐centre.org www.insight-‐centre.org
Adap%ve En%ty Linking
PhD Day – October/2013 Bianca Pereira
www.insight-‐centre.org
Agenda
• Mo%va%on • Problem • Proposed Solu%on • Experiments • Next Steps
www.insight-‐centre.org
Mo%va%on
• En%ty Linking creates links from men%ons in text to en%%es from a structured knowledge base. It ..
.. enables reusing knowledge already published on the web. .. can be used as the first step for ontology learning and popula%on algorithms.
www.insight-‐centre.org
Problem
• En%ty Linking has been performed using generic approaches.
• It does not work for all domains and types of text.
• There is no clear defini%on of “en%ty”.
www.insight-‐centre.org
Problem
• Research Ques%on: “How to adapt a general En%ty Linking Approach to a Domain?”
• Philosophical Ques%on: “What is an En%ty?”
www.insight-‐centre.org
Proposed Solu%on
• Usage of Linked Data datasets. • AELA, a Framework for Adap%ve En%ty Linking.
www.insight-‐centre.org
Experiments
• What is an En%ty?
• What have been iden%fied as en%%es? • How to manually detect en%%es from text? • How the defini%on of En%ty change from one domain to another?
www.insight-‐centre.org
Experiments
• What is an En%ty?
• What have been iden8fied as en88es? • How to manually detect en%%es from text? • How the defini%on of En%ty change from one domain to another?
www.insight-‐centre.org
Experiments
• AIDA-‐CoNLL annotated dataset – 1,387 Reuters documents (some of them are tables)
– Annota%on of en%%es with links to Wikipedia.
www.insight-‐centre.org
Experiments
• AIDA-‐CoNLL annotated dataset – 1,387 Reuters documents (some of them are tables)
– Annota%on of en88es with links to Wikipedia.
?
www.insight-‐centre.org
Experiments – AIDA CoNLL
• Proper Nouns: 5576 – Names ini%ated by a capitalized leber
• Acronyms: 712 – Names with all lebers in upper case
• Others: 20
www.insight-‐centre.org
AIDA CoNLL – Proper Nouns
• German • Bri%sh • European Commission • Germany • European Union • Britain • Commission • Franz Fischler • France • Spanish • Loyola de Palacio • Europe • Bonn • Hendrix
• U.S. • Jimi Hendrix • English • Noengham • Australian • China • Taiwan • Taipei • Taiwan Strait • Ukraine • Taiwanese • Lien Chan • Chinese • Foreign Ministry
www.insight-‐centre.org
AIDA CoNLL – Proper Nouns
• German • Bri%sh • European Commission • Germany • European Union • Britain • Commission • Franz Fischler • France • Spanish • Loyola de Palacio • Europe • Bonn • Hendrix
• U.S. • Jimi Hendrix • English • Noengham • Australian • China • Taiwan • Taipei • Taiwan Strait • Ukraine • Taiwanese • Lien Chan • Chinese • Foreign Ministry
www.insight-‐centre.org
AIDA CoNLL – “Acronyms”
• BRUSSELS • BSE • LONDON • BEIJING • FRANKFURT • GREEK • ATHENS • BAYERISCHE VEREINSBANK • SWEDISH • SWEDEN • JERUSALEM • TUNIS • KDPI • PUK
• KDP • MANAMA • UAE • DUBAI • BEIRUT • AN-‐NAHAR • AS-‐SAFIR • AD-‐DIYAR • CME • CHICAGO • MONTGOMERY • SNET • PHOENIX • PARIS
www.insight-‐centre.org
AIDA CoNLL -‐ Others
• interior ministry • neo-‐Nazi • neo-‐Nazism • post-‐Soviet • van der Sar • 1860 Munich • serie A • 1990 World Cup • 1992 European championship • 2,000 Guineas • 2000 Games • pan-‐Turkism • al-‐Akhbar • al-‐Ram
• 1997 FED CUP • 1998 World Cup • 1995 World Cup • 1. FC Cologne • post-‐Communist • cocker spaniels
www.insight-‐centre.org
AIDA CoNLL -‐ Others
• interior ministry • neo-‐Nazi • neo-‐Nazism • post-‐Soviet • van der Sar • 1860 Munich • serie A • 1990 World Cup • 1992 European championship • 2,000 Guineas • 2000 Games • pan-‐Turkism • al-‐Akhbar • al-‐Ram
• 1997 FED CUP • 1998 World Cup • 1995 World Cup • 1. FC Cologne • post-‐Communist • cocker spaniels
www.insight-‐centre.org
AIDA CoNLL -‐ Others
• interior ministry • neo-‐Nazi • neo-‐Nazism • post-‐Soviet • van der Sar • 1860 Munich • serie A • 1990 World Cup • 1992 European championship • 2,000 Guineas • 2000 Games • pan-‐Turkism • al-‐Akhbar • al-‐Ram
• 1997 FED CUP • 1998 World Cup • 1995 World Cup • 1. FC Cologne • post-‐Communist • cocker spaniels
www.insight-‐centre.org
AIDA CoNLL • SOCCER -‐ GERMAN FIRST DIVISION RESULTS / STANDINGS. BONN
1996-‐12-‐06 Results of German first division soccer matches played on Friday : Bochum 2 Bayer Leverkusen 2 Werder Bremen 1 1860 Munich 1 Karlsruhe 3 Freiburg 0 Schalke 2 Hansa Rostock 0 Standings ( tabulated under played, won, drawn, lost, goals for goals against points ) : Bayer Leverkusen 17 10 4 3 38 22 34 Bayern Munich 16 9 6 1 26 14 33 VfB Stubgart 16 9 4 3 39 17 31 Borussia Dortmund 16 9 4 3 33 17 31 Karlsruhe 17 8 4 5 30 20 28 VfL Bochum 16 7 6 3 23 21 27 1. FC Cologne 16 8 2 6 31 27 26 Schalke 04 17 7 4 6 25 26 25 Werder Bremen 17 6 4 7 29 28 22 MSV Duisburg 16 5 4 7 16 22 19 SV 1860 Munich 17 4 6 7 25 31 18 FC St. Pauli 15 5 3 7 21 28 18 Fortuna Dusseldorf 16 5 3 8 13 24 18 Hamburger SV 16 4 5 7 20 25 17 Arminia Bielefeld 16 4 4 8 18 28 16 FC Hansa Rostock 17 4 3 10 19 26 15 Borussia Monchengladbach 16 4 3 9 12 22 15 SC Freiburg 17 4 1 12 20 40 13
www.insight-‐centre.org
AIDA CoNLL – Some findings
• Syntac%c structure does not help in all cases. – Proper Nouns may not be ini%alized by a capitalized leber.
– Not all words with all lebers in upper case are Acronyms.
• There may be some “men%on boundary” problems even on manual annota%on.
www.insight-‐centre.org
AIDA CoNLL
• 5596 en%%es • 6308 different men%on strings
www.insight-‐centre.org
AIDA CoNLL
• 1110 en%%es with name varia%ons. hbp://en.wikipedia.org/wiki/New_York_Jets New York Jets NY JETS
hbp://en.wikipedia.org/wiki/Butch_Harmon Butch Harmon Butch
hbp://en.wikipedia.org/wiki/Norway Norway Norwegian
hbp://en.wikipedia.org/wiki/Cincinna%_Reds Cincinna% Reds CINCINNATI Reds
hbp://en.wikipedia.org/wiki/Republika_Srpska Bosnian Serb Republika Srpska
hbp://en.wikipedia.org/wiki/John_Smoltz John Smoltz Smoltz
hbp://en.wikipedia.org/wiki/Rede_Globo TV Globo Globo
hbp://en.wikipedia.org/wiki/London_Wasps London Wasps
hbp://en.wikipedia.org/wiki/Chicago_Cubs CHICAGO CUBS Chicago Cubs
hbp://en.wikipedia.org/wiki/England_cricket_team ENGLAND Englishmen
hbp://en.wikipedia.org/wiki/Alexander_Downer Alexander Downer Downer
hbp://en.wikipedia.org/wiki/Wales Wales Welsh
www.insight-‐centre.org
AIDA CoNLL
• 1110 en%%es with name varia%ons. hbp://en.wikipedia.org/wiki/New_York_Jets New York Jets NY JETS
hbp://en.wikipedia.org/wiki/Butch_Harmon Butch Harmon Butch
hCp://en.wikipedia.org/wiki/Norway Norway Norwegian
hbp://en.wikipedia.org/wiki/Cincinna%_Reds Cincinna% Reds CINCINNATI Reds
hbp://en.wikipedia.org/wiki/Republika_Srpska Bosnian Serb Republika Srpska
hbp://en.wikipedia.org/wiki/John_Smoltz John Smoltz Smoltz
hbp://en.wikipedia.org/wiki/Rede_Globo TV Globo Globo
hbp://en.wikipedia.org/wiki/London_Wasps London Wasps
hbp://en.wikipedia.org/wiki/Chicago_Cubs CHICAGO CUBS Chicago Cubs
hCp://en.wikipedia.org/wiki/England_cricket_team ENGLAND Englishmen
hbp://en.wikipedia.org/wiki/Alexander_Downer Alexander Downer Downer
hCp://en.wikipedia.org/wiki/Wales Wales Welsh
www.insight-‐centre.org
AIDA CoNLL – Some findings
• Use of metonymy. • Disambigua%on (Norway vs. Norwegians). • Men%on to an en%ty using part of the name.
www.insight-‐centre.org
AIDA CoNLL
• 434 ambiguous men%on strings (corpus level) French hbp://en.wikipedia.org/wiki/France
hbp://en.wikipedia.org/wiki/France_na%onal_football_team
NORTHAMPTON hbp://en.wikipedia.org/wiki/Northampton hbp://en.wikipedia.org/wiki/Northampton_Town_F.C. hbp://en.wikipedia.org/wiki/Northamptonshire_County_Cricket_Club hbp://en.wikipedia.org/wiki/Northampton_Saints
West hbp://en.wikipedia.org/wiki/Western_World hbp://en.wikipedia.org/wiki/American_League_West
Volkswagen AG hbp://en.wikipedia.org/wiki/Volkswagen hbp://en.wikipedia.org/wiki/Volkswagen_Group
EDMONTON hbp://en.wikipedia.org/wiki/Edmonton hbp://en.wikipedia.org/wiki/Edmonton_Oilers
Rangers hbp://en.wikipedia.org/wiki/Texas_Rangers_(baseball) hbp://en.wikipedia.org/wiki/Rangers_F.C.
Va%can hbp://en.wikipedia.org/wiki/Holy_See hbp://en.wikipedia.org/wiki/Va%can_Library hbp://en.wikipedia.org/wiki/Va%can_City
Shell hbp://en.wikipedia.org/wiki/Shell_Turbo_Chargers hbp://en.wikipedia.org/wiki/Shell_Oil_Company
Irish hbp://en.wikipedia.org/wiki/Republic_of_Ireland hbp://en.wikipedia.org/wiki/Republic_of_Ireland_na%onal_football_team hbp://en.wikipedia.org/wiki/Northern_Ireland
www.insight-‐centre.org
AIDA CoNLL
• 190 ambiguous men%on strings (document) 17 Iraq BAGHDAD hbp://en.wikipedia.org/wiki/Baghdad
hbp://en.wikipedia.org/wiki/Iraq
965testa SOCCER SILVA hbp://en.wikipedia.org/wiki/Mario_Silva hbp://en.wikipedia.org/wiki/Mauro_Silva
1102testa SOCCER WORLD CUP hCp://en.wikipedia.org/wiki/1998_FIFA_World_Cup hCp://en.wikipedia.org/wiki/FIFA_World_Cup
791 PRESS Chinese hbp://en.wikipedia.org/wiki/People’s_Republic_of_China hbp://en.wikipedia.org/wiki/Chinese_language
179 Soccer Liechenstein hCp://en.wikipedia.org/wiki/Liechtenstein_na8onal_football_team hCp://en.wikipedia.org/wiki/Liechtenstein
703 Cricket Pakistan hbp://en.wikipedia.org/wiki/Pakistan_na%onal_cricket_team hbp://en.wikipedia.org/wiki/Pakistan
1323testb Frankfurt Frankfurt hbp://en.wikipedia.org/wiki/Frankfurt_Stock_Exchange hbp://en.wikipedia.org/wiki/Frankfurt_am_Main
1054testa CRICKET ENGLAND hbp://en.wikipedia.org/wiki/England_cricket_team hbp://en.wikipedia.org/wiki/England
www.insight-‐centre.org
AIDA CoNLL – Some findings
• Even misspelled text is marked. • “Classes” and “instances” are annotated.
www.insight-‐centre.org
AIDA CoNLL
• 39 Classes hbp://dbpedia.org/ontology/Agent 2579
hbp://xmlns.com/foaf/0.1/Person 426
hbp://dbpedia.org/ontology/Place 333
hbp://dbpedia.org/ontology/City 234
hbp://dbpedia.org/ontology/Country 194
hbp://dbpedia.org/ontology/Administra%veRegion 76
hCp://dbpedia.org/ontology/Newspaper 55
hbp://dbpedia.org/ontology/ArchitecturalStructure 39
hCp://dbpedia.org/ontology/EthnicGroup 30
hbp://dbpedia.org/ontology/Airport 21
hCp://dbpedia.org/ontology/Event 18
hbp://dbpedia.org/ontology/Island 12
hCp://dbpedia.org/ontology/Film 10
hbp://dbpedia.org/ontology/BodyOfWater 10
www.insight-‐centre.org
AIDA CoNLL – Some findings
• Not only Person, Loca%on and Organiza%on.
www.insight-‐centre.org
Experiments
• How were those en%%es annotated? • Which Wikipedia pages were chosen as represen%ng en%%es?
www.insight-‐centre.org
Experiments
• How were those en%%es annotated? • Which Wikipedia pages were chosen as represen%ng en%%es?
• What is the Annota8on Guideline?
www.insight-‐centre.org
Experiments
• What is an En%ty?
• What have been iden%fied as en%%es? • How to manually detect en88es from text? • How the defini%on of En%ty change from one domain to another?
www.insight-‐centre.org
Experiments
• Survey on Annota%on Guidelines – Ques%on: “Is there any guideline for en%ty annota%on?”
– Search Strategy: • Papers from “en%ty annota%on guidelines”. • Guidelines from annotated corpora provided by En%ty Recogni%on, Disambigua%on and Linking challenges.
www.insight-‐centre.org
Experiments
• Survey on Annota%on Guidelines – Common Problems (differ from one domain to another)
• Men%on Boundaries • Name varia%ons • Metonymy
– Annota%on Process – Evalua%on
www.insight-‐centre.org
Next Steps
• Corpus Sampling for Annota%on • Development of Annota%on Guidelines
– Domain/Task dependent – Itera%ve Process
• Domains: – Touris%c Domain (TripAdvisor corpus) – Electronics Domain – Other
www.insight-‐centre.org
Next Steps
• What is an En%ty?
• What have been iden%fied as en%%es? • How to manually detect en%%es from text? • How the defini8on of En8ty change from one domain to another?
www.insight-‐centre.org
Next Steps
• What is an En%ty?
• What have been iden%fied as en%%es? • How to manually detect en%%es from text? • How the defini%on of En%ty change from one domain to another?
• How to iden8fy the most frequent classes in a domain?