phd day: adaptive entity linking

36
www.insightcentre.org www.insightcentre.org Adap%ve En%ty Linking PhD Day – October/2013 Bianca Pereira

Upload: bianca-pereira

Post on 05-Dec-2014

83 views

Category:

Internet


0 download

DESCRIPTION

Presentation at 5th NLP PhD Day at National University or Ireland, Galway (Insight) at 16/10/2013

TRANSCRIPT

Page 1: PhD Day: Adaptive Entity Linking

www.insight-­‐centre.org  www.insight-­‐centre.org  

Adap%ve  En%ty  Linking  

PhD  Day  –  October/2013  Bianca  Pereira  

Page 2: PhD Day: Adaptive Entity Linking

www.insight-­‐centre.org  

Agenda  

•  Mo%va%on  •  Problem  •  Proposed  Solu%on  •  Experiments  •  Next  Steps  

Page 3: PhD Day: Adaptive Entity Linking

www.insight-­‐centre.org  

Mo%va%on  

•  En%ty  Linking  creates  links  from  men%ons  in  text  to  en%%es  from  a  structured  knowledge  base.  It  ..  

..   enables   reusing   knowledge   already   published   on  the  web.  ..  can  be  used  as  the  first  step  for  ontology  learning  and  popula%on  algorithms.  

Page 4: PhD Day: Adaptive Entity Linking

www.insight-­‐centre.org  

Problem  

•  En%ty   Linking   has   been   performed   using  generic  approaches.  

•  It  does  not  work  for  all  domains  and  types  of  text.  

•  There  is  no  clear  defini%on  of  “en%ty”.  

Page 5: PhD Day: Adaptive Entity Linking

www.insight-­‐centre.org  

Problem  

•  Research  Ques%on:  “How  to  adapt  a  general  En%ty  Linking  Approach  to  a  Domain?”  

•  Philosophical  Ques%on:  “What  is  an  En%ty?”  

Page 6: PhD Day: Adaptive Entity Linking

www.insight-­‐centre.org  

Proposed  Solu%on  

•  Usage  of  Linked  Data  datasets.  •  AELA,   a   Framework   for   Adap%ve   En%ty  Linking.  

Page 7: PhD Day: Adaptive Entity Linking

www.insight-­‐centre.org  

Experiments  

•  What  is  an  En%ty?    

•  What  have  been  iden%fied  as  en%%es?  •  How  to  manually  detect  en%%es  from  text?  •  How  the  defini%on  of  En%ty  change  from  one  domain  to  another?  

   

Page 8: PhD Day: Adaptive Entity Linking

www.insight-­‐centre.org  

Experiments  

•  What  is  an  En%ty?    

•  What  have  been  iden8fied  as  en88es?  •  How  to  manually  detect  en%%es  from  text?  •  How  the  defini%on  of  En%ty  change  from  one  domain  to  another?  

   

Page 9: PhD Day: Adaptive Entity Linking

www.insight-­‐centre.org  

Experiments  

•  AIDA-­‐CoNLL  annotated  dataset  – 1,387  Reuters  documents  (some  of  them  are  tables)  

– Annota%on  of  en%%es  with  links  to  Wikipedia.  

Page 10: PhD Day: Adaptive Entity Linking

www.insight-­‐centre.org  

Experiments  

•  AIDA-­‐CoNLL  annotated  dataset  – 1,387  Reuters  documents  (some  of  them  are  tables)  

– Annota%on  of  en88es  with  links  to  Wikipedia.  

             ?

Page 11: PhD Day: Adaptive Entity Linking

www.insight-­‐centre.org  

Experiments  –  AIDA  CoNLL  

•  Proper  Nouns:  5576  – Names  ini%ated  by  a  capitalized  leber  

•  Acronyms:  712  – Names  with  all  lebers  in  upper  case  

•  Others:  20  

Page 12: PhD Day: Adaptive Entity Linking

www.insight-­‐centre.org  

AIDA  CoNLL  –  Proper  Nouns  

•  German  •  Bri%sh  •  European  Commission  •  Germany  •  European  Union  •  Britain  •  Commission  •  Franz  Fischler  •  France  •  Spanish  •  Loyola  de  Palacio  •  Europe  •  Bonn  •  Hendrix    

•  U.S.  •  Jimi  Hendrix  •  English  •  Noengham  •  Australian  •  China  •  Taiwan  •  Taipei  •  Taiwan  Strait  •  Ukraine  •  Taiwanese  •  Lien  Chan  •  Chinese  •  Foreign  Ministry    

Page 13: PhD Day: Adaptive Entity Linking

www.insight-­‐centre.org  

AIDA  CoNLL  –  Proper  Nouns  

•  German  •  Bri%sh  •  European  Commission  •  Germany  •  European  Union  •  Britain  •  Commission  •  Franz  Fischler  •  France  •  Spanish  •  Loyola  de  Palacio  •  Europe  •  Bonn  •  Hendrix    

•  U.S.  •  Jimi  Hendrix  •  English  •  Noengham  •  Australian  •  China  •  Taiwan  •  Taipei  •  Taiwan  Strait  •  Ukraine  •  Taiwanese  •  Lien  Chan  •  Chinese  •  Foreign  Ministry    

Page 14: PhD Day: Adaptive Entity Linking

www.insight-­‐centre.org  

AIDA  CoNLL  –  “Acronyms”  

•  BRUSSELS  •  BSE  •  LONDON  •  BEIJING  •  FRANKFURT  •  GREEK  •  ATHENS  •  BAYERISCHE  VEREINSBANK  •  SWEDISH  •  SWEDEN  •  JERUSALEM  •  TUNIS  •  KDPI  •  PUK  

•  KDP  •  MANAMA  •  UAE  •  DUBAI  •  BEIRUT  •  AN-­‐NAHAR  •  AS-­‐SAFIR  •  AD-­‐DIYAR  •  CME  •  CHICAGO  •  MONTGOMERY  •  SNET  •  PHOENIX  •  PARIS  

Page 15: PhD Day: Adaptive Entity Linking

www.insight-­‐centre.org  

AIDA  CoNLL  -­‐  Others  

•  interior  ministry  •  neo-­‐Nazi  •  neo-­‐Nazism  •  post-­‐Soviet  •  van  der  Sar  •  1860  Munich  •  serie  A  •  1990  World  Cup  •  1992  European  championship  •  2,000  Guineas  •  2000  Games  •  pan-­‐Turkism  •  al-­‐Akhbar  •  al-­‐Ram  

•  1997  FED  CUP  •  1998  World  Cup  •  1995  World  Cup  •  1.  FC  Cologne  •  post-­‐Communist  •  cocker  spaniels  

Page 16: PhD Day: Adaptive Entity Linking

www.insight-­‐centre.org  

AIDA  CoNLL  -­‐  Others  

•  interior  ministry  •  neo-­‐Nazi  •  neo-­‐Nazism  •  post-­‐Soviet  •  van  der  Sar  •  1860  Munich  •  serie  A  •  1990  World  Cup  •  1992  European  championship  •  2,000  Guineas  •  2000  Games  •  pan-­‐Turkism  •  al-­‐Akhbar  •  al-­‐Ram  

•  1997  FED  CUP  •  1998  World  Cup  •  1995  World  Cup  •  1.  FC  Cologne  •  post-­‐Communist  •  cocker  spaniels  

Page 17: PhD Day: Adaptive Entity Linking

www.insight-­‐centre.org  

AIDA  CoNLL  -­‐  Others  

•  interior  ministry  •  neo-­‐Nazi  •  neo-­‐Nazism  •  post-­‐Soviet  •  van  der  Sar  •  1860  Munich  •  serie  A  •  1990  World  Cup  •  1992  European  championship  •  2,000  Guineas  •  2000  Games  •  pan-­‐Turkism  •  al-­‐Akhbar  •  al-­‐Ram  

•  1997  FED  CUP  •  1998  World  Cup  •  1995  World  Cup  •  1.  FC  Cologne  •  post-­‐Communist  •  cocker  spaniels  

Page 18: PhD Day: Adaptive Entity Linking

www.insight-­‐centre.org  

AIDA  CoNLL  •  SOCCER  -­‐  GERMAN  FIRST  DIVISION  RESULTS  /  STANDINGS.  BONN  

1996-­‐12-­‐06  Results  of  German  first  division  soccer  matches  played  on  Friday  :  Bochum  2  Bayer  Leverkusen  2  Werder  Bremen  1  1860  Munich  1  Karlsruhe  3  Freiburg  0  Schalke  2  Hansa  Rostock  0  Standings  (  tabulated  under  played,  won,  drawn,  lost,  goals  for  goals  against  points  )  :  Bayer  Leverkusen  17  10  4  3  38  22  34  Bayern  Munich  16  9  6  1  26  14  33  VfB  Stubgart  16  9  4  3  39  17  31  Borussia  Dortmund  16  9  4  3  33  17  31  Karlsruhe  17  8  4  5  30  20  28  VfL  Bochum  16  7  6  3  23  21  27  1.  FC  Cologne  16  8  2  6  31  27  26  Schalke  04  17  7  4  6  25  26  25  Werder  Bremen  17  6  4  7  29  28  22  MSV  Duisburg  16  5  4  7  16  22  19  SV  1860  Munich  17  4  6  7  25  31  18  FC  St.  Pauli  15  5  3  7  21  28  18  Fortuna  Dusseldorf  16  5  3  8  13  24  18  Hamburger  SV  16  4  5  7  20  25  17  Arminia  Bielefeld  16  4  4  8  18  28  16  FC  Hansa  Rostock  17  4  3  10  19  26  15  Borussia  Monchengladbach  16  4  3  9  12  22  15  SC  Freiburg  17  4  1  12  20  40  13  

Page 19: PhD Day: Adaptive Entity Linking

www.insight-­‐centre.org  

AIDA  CoNLL  –  Some  findings  

•  Syntac%c  structure  does  not  help  in  all  cases.  – Proper   Nouns   may   not   be   ini%alized   by   a  capitalized  leber.  

– Not   all   words   with   all   lebers   in   upper   case   are  Acronyms.  

•  There   may   be   some   “men%on   boundary”  problems  even  on  manual  annota%on.  

Page 20: PhD Day: Adaptive Entity Linking

www.insight-­‐centre.org  

AIDA  CoNLL  

•  5596  en%%es  •  6308  different  men%on  strings  

Page 21: PhD Day: Adaptive Entity Linking

www.insight-­‐centre.org  

AIDA  CoNLL  

•  1110  en%%es  with  name  varia%ons.  hbp://en.wikipedia.org/wiki/New_York_Jets     New  York  Jets   NY  JETS  

hbp://en.wikipedia.org/wiki/Butch_Harmon   Butch  Harmon   Butch  

hbp://en.wikipedia.org/wiki/Norway     Norway   Norwegian  

hbp://en.wikipedia.org/wiki/Cincinna%_Reds   Cincinna%  Reds   CINCINNATI  Reds  

hbp://en.wikipedia.org/wiki/Republika_Srpska   Bosnian  Serb   Republika  Srpska  

hbp://en.wikipedia.org/wiki/John_Smoltz   John  Smoltz   Smoltz  

hbp://en.wikipedia.org/wiki/Rede_Globo   TV  Globo   Globo  

hbp://en.wikipedia.org/wiki/London_Wasps   London   Wasps  

hbp://en.wikipedia.org/wiki/Chicago_Cubs   CHICAGO   CUBS   Chicago  Cubs  

hbp://en.wikipedia.org/wiki/England_cricket_team   ENGLAND   Englishmen  

hbp://en.wikipedia.org/wiki/Alexander_Downer   Alexander  Downer   Downer  

hbp://en.wikipedia.org/wiki/Wales   Wales   Welsh  

Page 22: PhD Day: Adaptive Entity Linking

www.insight-­‐centre.org  

AIDA  CoNLL  

•  1110  en%%es  with  name  varia%ons.  hbp://en.wikipedia.org/wiki/New_York_Jets     New  York  Jets   NY  JETS  

hbp://en.wikipedia.org/wiki/Butch_Harmon   Butch  Harmon   Butch  

hCp://en.wikipedia.org/wiki/Norway     Norway   Norwegian  

hbp://en.wikipedia.org/wiki/Cincinna%_Reds   Cincinna%  Reds   CINCINNATI  Reds  

hbp://en.wikipedia.org/wiki/Republika_Srpska   Bosnian  Serb   Republika  Srpska  

hbp://en.wikipedia.org/wiki/John_Smoltz   John  Smoltz   Smoltz  

hbp://en.wikipedia.org/wiki/Rede_Globo   TV  Globo   Globo  

hbp://en.wikipedia.org/wiki/London_Wasps   London   Wasps  

hbp://en.wikipedia.org/wiki/Chicago_Cubs   CHICAGO   CUBS   Chicago  Cubs  

hCp://en.wikipedia.org/wiki/England_cricket_team   ENGLAND   Englishmen  

hbp://en.wikipedia.org/wiki/Alexander_Downer   Alexander  Downer   Downer  

hCp://en.wikipedia.org/wiki/Wales   Wales   Welsh  

Page 23: PhD Day: Adaptive Entity Linking

www.insight-­‐centre.org  

AIDA  CoNLL  –  Some  findings  

•  Use  of  metonymy.  •  Disambigua%on  (Norway  vs.  Norwegians).  •  Men%on  to  an  en%ty  using  part  of  the  name.  

Page 24: PhD Day: Adaptive Entity Linking

www.insight-­‐centre.org  

AIDA  CoNLL  

•  434  ambiguous  men%on  strings  (corpus  level)  French   hbp://en.wikipedia.org/wiki/France    

hbp://en.wikipedia.org/wiki/France_na%onal_football_team  

NORTHAMPTON   hbp://en.wikipedia.org/wiki/Northampton  hbp://en.wikipedia.org/wiki/Northampton_Town_F.C.  hbp://en.wikipedia.org/wiki/Northamptonshire_County_Cricket_Club  hbp://en.wikipedia.org/wiki/Northampton_Saints  

West   hbp://en.wikipedia.org/wiki/Western_World  hbp://en.wikipedia.org/wiki/American_League_West  

Volkswagen  AG   hbp://en.wikipedia.org/wiki/Volkswagen  hbp://en.wikipedia.org/wiki/Volkswagen_Group  

EDMONTON   hbp://en.wikipedia.org/wiki/Edmonton  hbp://en.wikipedia.org/wiki/Edmonton_Oilers  

Rangers   hbp://en.wikipedia.org/wiki/Texas_Rangers_(baseball)  hbp://en.wikipedia.org/wiki/Rangers_F.C.  

Va%can   hbp://en.wikipedia.org/wiki/Holy_See  hbp://en.wikipedia.org/wiki/Va%can_Library  hbp://en.wikipedia.org/wiki/Va%can_City  

Shell   hbp://en.wikipedia.org/wiki/Shell_Turbo_Chargers  hbp://en.wikipedia.org/wiki/Shell_Oil_Company  

Irish   hbp://en.wikipedia.org/wiki/Republic_of_Ireland  hbp://en.wikipedia.org/wiki/Republic_of_Ireland_na%onal_football_team  hbp://en.wikipedia.org/wiki/Northern_Ireland  

Page 25: PhD Day: Adaptive Entity Linking

www.insight-­‐centre.org  

AIDA  CoNLL  

•  190  ambiguous  men%on  strings  (document)  17  Iraq   BAGHDAD   hbp://en.wikipedia.org/wiki/Baghdad  

hbp://en.wikipedia.org/wiki/Iraq  

965testa  SOCCER   SILVA   hbp://en.wikipedia.org/wiki/Mario_Silva  hbp://en.wikipedia.org/wiki/Mauro_Silva  

1102testa  SOCCER   WORLD  CUP   hCp://en.wikipedia.org/wiki/1998_FIFA_World_Cup  hCp://en.wikipedia.org/wiki/FIFA_World_Cup  

791  PRESS   Chinese   hbp://en.wikipedia.org/wiki/People’s_Republic_of_China  hbp://en.wikipedia.org/wiki/Chinese_language  

179  Soccer   Liechenstein   hCp://en.wikipedia.org/wiki/Liechtenstein_na8onal_football_team  hCp://en.wikipedia.org/wiki/Liechtenstein  

703  Cricket   Pakistan   hbp://en.wikipedia.org/wiki/Pakistan_na%onal_cricket_team  hbp://en.wikipedia.org/wiki/Pakistan  

1323testb  Frankfurt   Frankfurt   hbp://en.wikipedia.org/wiki/Frankfurt_Stock_Exchange  hbp://en.wikipedia.org/wiki/Frankfurt_am_Main  

1054testa  CRICKET   ENGLAND   hbp://en.wikipedia.org/wiki/England_cricket_team  hbp://en.wikipedia.org/wiki/England  

Page 26: PhD Day: Adaptive Entity Linking

www.insight-­‐centre.org  

AIDA  CoNLL  –  Some  findings  

•  Even  misspelled  text  is  marked.  •  “Classes”  and  “instances”  are  annotated.  

Page 27: PhD Day: Adaptive Entity Linking

www.insight-­‐centre.org  

AIDA  CoNLL  

•  39  Classes  hbp://dbpedia.org/ontology/Agent   2579  

hbp://xmlns.com/foaf/0.1/Person   426  

hbp://dbpedia.org/ontology/Place   333  

hbp://dbpedia.org/ontology/City   234  

hbp://dbpedia.org/ontology/Country   194  

hbp://dbpedia.org/ontology/Administra%veRegion   76  

hCp://dbpedia.org/ontology/Newspaper   55  

hbp://dbpedia.org/ontology/ArchitecturalStructure   39  

hCp://dbpedia.org/ontology/EthnicGroup   30  

hbp://dbpedia.org/ontology/Airport   21  

hCp://dbpedia.org/ontology/Event   18  

hbp://dbpedia.org/ontology/Island   12  

hCp://dbpedia.org/ontology/Film   10  

hbp://dbpedia.org/ontology/BodyOfWater   10  

Page 28: PhD Day: Adaptive Entity Linking

www.insight-­‐centre.org  

AIDA  CoNLL  –  Some  findings  

•  Not  only  Person,  Loca%on  and  Organiza%on.  

Page 29: PhD Day: Adaptive Entity Linking

www.insight-­‐centre.org  

Experiments  

•  How  were  those  en%%es  annotated?  •  Which  Wikipedia  pages  were  chosen  as  represen%ng  en%%es?  

Page 30: PhD Day: Adaptive Entity Linking

www.insight-­‐centre.org  

Experiments  

•  How  were  those  en%%es  annotated?  •  Which  Wikipedia  pages  were  chosen  as  represen%ng  en%%es?  

• What  is  the  Annota8on  Guideline?  

Page 31: PhD Day: Adaptive Entity Linking

www.insight-­‐centre.org  

Experiments  

•  What  is  an  En%ty?    

•  What  have  been  iden%fied  as  en%%es?  •  How  to  manually  detect  en88es  from  text?  •  How  the  defini%on  of  En%ty  change  from  one  domain  to  another?  

   

Page 32: PhD Day: Adaptive Entity Linking

www.insight-­‐centre.org  

Experiments  

•  Survey  on  Annota%on  Guidelines  – Ques%on:  “Is  there  any  guideline  for  en%ty  annota%on?”  

– Search  Strategy:  •  Papers  from  “en%ty  annota%on  guidelines”.  •  Guidelines  from  annotated  corpora  provided  by  En%ty  Recogni%on,  Disambigua%on  and  Linking  challenges.  

Page 33: PhD Day: Adaptive Entity Linking

www.insight-­‐centre.org  

Experiments  

•  Survey  on  Annota%on  Guidelines  – Common   Problems   (differ   from   one   domain   to  another)  

•  Men%on  Boundaries  •  Name  varia%ons  •  Metonymy  

– Annota%on  Process  – Evalua%on  

Page 34: PhD Day: Adaptive Entity Linking

www.insight-­‐centre.org  

Next  Steps  

•  Corpus  Sampling  for  Annota%on  •  Development  of  Annota%on  Guidelines  

– Domain/Task  dependent  –  Itera%ve  Process  

•  Domains:  – Touris%c  Domain  (TripAdvisor  corpus)  – Electronics  Domain  – Other  

Page 35: PhD Day: Adaptive Entity Linking

www.insight-­‐centre.org  

Next  Steps  

•  What  is  an  En%ty?    

•  What  have  been  iden%fied  as  en%%es?  •  How  to  manually  detect  en%%es  from  text?  •  How  the  defini8on  of  En8ty  change  from  one  domain  to  another?  

   

Page 36: PhD Day: Adaptive Entity Linking

www.insight-­‐centre.org  

Next  Steps  

•  What  is  an  En%ty?    

•  What  have  been  iden%fied  as  en%%es?  •  How  to  manually  detect  en%%es  from  text?  •  How  the  defini%on  of  En%ty  change  from  one  domain  to  another?  

•  How  to   iden8fy  the  most  frequent  classes   in  a  domain?