semanc (analysisin language(technology(

60
Seman&c Analysis in Language Technology http://stp.lingfil.uu.se/~santinim/sais/2016/sais_2016.htm Relation Extraction Marina San(ni [email protected]fil.uu.se Department of Linguis(cs and Philology Uppsala University, Uppsala, Sweden Spring 2016

Upload: others

Post on 22-Jan-2022

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Semanc (Analysisin Language(Technology(

Seman&c  Analysis  in  Language  Technology  http://stp.lingfil.uu.se/~santinim/sais/2016/sais_2016.htm

Relation Extraction

Marina  San(ni  [email protected]  

 

Department  of  Linguis(cs  and  Philology  

Uppsala  University,  Uppsala,  Sweden  

 

Spring  2016  

 

 

Page 2: Semanc (Analysisin Language(Technology(

Previous  Lecture:  Ques$on  Answering  

2  

Page 3: Semanc (Analysisin Language(Technology(

Ques$on  Answering  systems  

•  Factoid  ques(ons:      •  Google  •  Wolfram  •  Ask  Jeeves  •  Start    •  ….  

3  

•  Approaches:      •  IR-­‐based  •  Knowelege  based  •  Hybrid  

Page 4: Semanc (Analysisin Language(Technology(

Katz  et  al.  (2006)  hFp://start.csail.mit.edu/publica$ons/FLAIRS0601KatzB.pdf    

•  START  answers  natural  language  ques(ons  by  presen(ng  components  of  text  and  mul(-­‐media  informa(on  drawn  from  a  set  of  informa(on  resources  that  are  hosted  locally  or  accessed  remotely  through  the  Internet.    

•  START  targets  high  precision  in  its  ques(on  answering.    

•  The  START  system  analyzes  English  text  and  produces  a  knowledge  base  which  incorporates,  in  the  form  of  nested  ternary  expressions  (=triples),  the  informa(on  found  in  the  text.  

4  

Page 5: Semanc (Analysisin Language(Technology(

Is  it    true?:  hFp://uncyclopedia.wikia.com/wiki/Ask_Jeeves  

•  Ask  Jeeves,  more  correctly  known  as  Ask.com,  is  a  search  engine  founded  in  1996  in  California.    

•  Ini(ally  it  represented  a  stereotypical  English  butler  who  would  "fetch"  the  answer  to  any  ques(on  asked.  

•  Ask.com  is  now  considered  one  of  the  great  failures  of  the  internet.  The  ques(on  and  answer  feature  simply  didn't  work  as  well  as  hoped,  and  a^er  trying  his  hand  at  being  both  a  tradi(onal  search  engine  and  a  terrible  kind  of  "ar(ficial  AI"  with  a  bald  spot,    

•  These  days  Jeeves  is  ranked  as  the  4th  most  successful  search  engine  on  the  web,  and  the  4th  most  successful  overall.  This  seems  impressive  un$l  you  consider  that  Google  holds  the  top  spot  with  95%  of  the  market.  It  has  even  fallen  behind  Bing;  enough  said.  5  

Page 6: Semanc (Analysisin Language(Technology(

Search  engines  that  can  be  used  as  QA  systems  

•  Yahoo  •  Bing  

6  

Page 7: Semanc (Analysisin Language(Technology(

Siri  hFp://en.wikipedia.org/wiki/Siri    

•  Siri  /ˈsɪri/  is  an  intelligent  personal  assistant  and  knowledge  navigator  which  works  as  an  applica(on  for  Apple  Inc.'s  iOS.  

•   The  applica(on  uses  a  natural  language  user  interface  to  answer  ques$ons,  make  recommenda(ons,  and  perform  ac(ons  by  delega$ng  requests  to  a  set  of  Web  services.    

•  The  so^ware,  both  in  its  original  version  and  as  an  iOS  applica(on,  adapts  to  the  user's  individual  language  usage  and  individual  searches  (preferences)  with  con(nuing  use,  and  returns  results  that  are  individualized.    

•  The  name  Siri  is  Scandinavian,  a  short  form  of  the  Norse  name  Sigrid  meaning  "beauty"  and  "victory",  and  comes  from  the  intended  name  for  the  original  developer's  first  child.  

7  

Page 8: Semanc (Analysisin Language(Technology(

ChaFerbots  •  Siri…  conversa(onal  ”safety  net”.  •  Conversa(onal  agents  (chaker  bots,  

and  personal  assistants)    àcustomer  care,  customer  analy(cs  (replacing/integra(ng  FAQs  and  help  desk)  

8  

Avatar: a picture of a person or animal that represents you on a computer screen, for example in some chat rooms or when you are playing games over the Internet

Page 9: Semanc (Analysisin Language(Technology(

Eliza  hFp://en.wikipedia.org/wiki/ELIZA  ELIZA  was  wriFen  at  MIT  by  Joseph  Weizenbaum  between  1964  and  1966    

9  

Page 10: Semanc (Analysisin Language(Technology(

General  IR  architecture  for  factoid  ques$ons  

10  

DocumentDocumentDocument

DocumentDocume

ntDocumentDocume

ntDocument

Question Processing

PassageRetrieval

Query Formulation

Answer Type Detection

Question

Passage Retrieval

Document Retrieval

Answer Processing

Answer

passages

Indexing

RelevantDocs

DocumentDocumentDocument

Page 11: Semanc (Analysisin Language(Technology(

Things  to  extract  from  the  ques$on  •  Answer  Type  Detec(on  

•  Decide  the  named  en$ty  type  (person,  place)  of  the  answer  

•  Query  Formula(on  •  Choose  query  keywords  for  the  IR  system  

•  Ques(on  Type  classifica(on  •  Is  this  a  defini(on  ques(on,  a  math  ques(on,  a  list  ques(on?  

•  Focus  Detec(on  •  Find  the  ques(on  words  that  are  replaced  by  the  answer  

•  Rela(on  Extrac(on  •  Find  rela(ons  between  en((es  in  the  ques(on  11  

Page 12: Semanc (Analysisin Language(Technology(

12  

Common  Evalua$on  Metrics  

1. Accuracy  (does  answer  match  gold-­‐labeled  answer?)  2. Mean  Reciprocal  Rank:    •  The  reciprocal  rank  of  a  query  response  is  the  inverse  of  the  rank  of  the  

first  correct  answer.    •  The  mean  reciprocal  rank  is  the  average  of  the  reciprocal  ranks  of  

results  for  a  sample  of  queries  Q  

MRR =

1rankii=1

N

N=  

Page 13: Semanc (Analysisin Language(Technology(

Common  Evalua$on  Metrics:  MRR  •  The  mean  reciprocal  rank  is  the  average  of  the  reciprocal  ranks  

of  results  for  a  sample  of  queries  Q.  •  (ex  adapted  from  Wikipedia)  

•  3  ranked  answers  for  a  query,  with  the  first  one  being  the  one  it  thinks  is  most  likely  correct    

•  Given  those  3  samples,  we  could  calculate  the  mean  reciprocal  rank  as  (1/3  +  1/2  +  1)/3  =  0.61.  

13  

Page 14: Semanc (Analysisin Language(Technology(

Complex  ques$ons:  “What  is  the  ‘hajii’”?  

•  The  (bokom-­‐up)  snippet  method  •  Find  a  set  of  relevant  documents  •  Extract  informa(ve  sentences  from  the  documents  (using  p-­‐idf,  MMR)  •  Order  and  modify  the  sentences  into  an  answer  

•  The  (top-­‐down)  informa(on  extrac(on  method  •  build  specific  answerers  for  different  ques(on  types:  •  defini(on  ques(ons,  •  biography  ques(ons,    •  certain  medical  ques(ons  

Page 15: Semanc (Analysisin Language(Technology(

Informa$on  that  should  be  in  the  answer  for  3  kinds  of  ques$ons  

Page 16: Semanc (Analysisin Language(Technology(

Document Retrieval

11 Web documents1127 total sentences

Predicate Identification

Data-Driven Analysis

383 Non-Specific Definitional sentences

Sentence clusters, Importance ordering

DefinitionCreation

9 Genus-Species SentencesThe Hajj, or pilgrimage to Makkah (Mecca), is the central duty of Islam.The Hajj is a milestone event in a Muslim's life.The hajj is one of five pillars that make up the foundation of Islam....

The Hajj, or pilgrimage to Makkah [Mecca], is the central duty of Islam. More than two million Muslims are expected to take the Hajj this year. Muslims must perform the hajj at least once in their lifetime if physically and financially able. The Hajj is a milestone event in a Muslim's life. The annual hajj begins in the twelfth month of the Islamic year (which is lunar, not solar, so that hajj and Ramadan fall sometimes in summer, sometimes in winter). The Hajj is a week-long pilgrimage that begins in the 12th month of the Islamic lunar calendar. Another ceremony, which was not connected with the rites of the Ka'ba before the rise of Islam, is the Hajj, the annual pilgrimage to 'Arafat, about two miles east of Mecca, toward Mina…

"What is the Hajj?" (Ndocs=20, Len=8)

Architecture  for  complex  ques$on  answering:    defini$on  ques$ons   S.  Blair-­‐Goldensohn,  K.  McKeown  and  A.  Schlaikjer.  2004.  

Answering  Defini(on  Ques(ons:  A  Hyrbid  Approach.    

Page 17: Semanc (Analysisin Language(Technology(

State-­‐of-­‐the-­‐art:  ex  

•  Top  downMing  Tan,  Cicero  dos  Santos,  Bing  Xiang  &  Bowen  Zhou.  2015.  LSTM-­‐Based  Deep  Learning  Models  for  non  factoid  Answer  Selec(on.  

•  Di  Wang  and  Eric  Nyberg.  2015.  A  Long  Short-­‐Term  Memory  Model  for  Answer  Sentence  Selec(on  in  Ques(on  Answering.  In  ACL  2015.s  

•  Minwei  Feng,  Bing  Xiang,  Michael  R.  Glass,  Lidan  Wang,  Bowen  Zhou.  2015.  Applying  deep  learning  to  answer  selec(on:  A  study  and  an  open  task.    

17  

Deep  Learning  is  a  new  area  of  Machine  Learning  research.  Said  to  be  very  promising.  It  is  about  learning  mul(ple  levels  of  representa(on  and  abstrac(on  that  help  to  make  sense  of  data  such  as  images,  sound,  and  text.  It  is  based  on  neural  networks.    

Page 18: Semanc (Analysisin Language(Technology(

Prac$cal  ac$vity  •  Start  seems  to  be  limited,  but  it  understands  natural  language  •  Google  (presumably  helped  by  Knowledge  Graph)  is  more  

accurate,  but  skips  natural  language  (uses  keywords).    •  Google  is  customized  to  the  users’  preferences  (different  results)  

•  Interes(ng  outcomes  •  Currency  vs.  Coin  •  What’s  love?  •  Lyric/song  vs.  Defini(on  ques(on  

18  

Page 19: Semanc (Analysisin Language(Technology(

What’s  the  meaning  of  life?  

•  Google  

19  

Presumably  from  Knowledge  Graph…  

Page 20: Semanc (Analysisin Language(Technology(

Start  and  the  42  puzzle  

•  gg  

20  

Page 21: Semanc (Analysisin Language(Technology(

End  of  previous  lecture  

21  

Page 22: Semanc (Analysisin Language(Technology(

Acknowledgements Most  slides  borrowed  or  adapted  from:  

Dan  Jurafsky  and  Christopher  Manning,  Coursera  

Dan  Jurafsky  and  James  H.  Mar(n  (2015)  

   

 

J&M(2015,  dra^):  hkps://web.stanford.edu/~jurafsky/slp3/      

 

     

Page 23: Semanc (Analysisin Language(Technology(

Relation Extraction

What  is  rela(on  extrac(on?  

Page 24: Semanc (Analysisin Language(Technology(

Extrac$ng  rela$ons  from  text  

•  Company  report:  “Interna(onal  Business  Machines  Corpora(on  (IBM  or  the  company)  was  incorporated  in  the  State  of  New  York  on  June  16,  1911,  as  the  Compu(ng-­‐Tabula(ng-­‐Recording  Co.  (C-­‐T-­‐R)…”  

•  Extracted  Complex  Rela(on:  Company-­‐Founding  

   Company    IBM      Loca(on      New  York      Date      June  16,  1911      Original-­‐Name      Compu(ng-­‐Tabula(ng-­‐Recording  Co.  

•  But  we  will  focus  on  the  simpler  task  of  extrac(ng  rela(on  triples  Founding-­‐year(IBM,1911)  Founding-­‐loca(on(IBM,New  York)  24  

Page 25: Semanc (Analysisin Language(Technology(

Extrac$ng  Rela$on  Triples  from  Text    The  Leland  Stanford  Junior  University,  commonly  referred  to  as  Stanford  University  or  Stanford,  is  an  American  private  research  university  located  in  Stanford,  California  …  near  Palo  Alto,  California…  Leland  Stanford…founded  the  university  in  1891  

Stanford EQ Leland Stanford Junior University Stanford LOC-IN California Stanford IS-A research university Stanford LOC-NEAR Palo Alto Stanford FOUNDED-IN 1891 Stanford FOUNDER Leland Stanford 25  

Page 26: Semanc (Analysisin Language(Technology(

Why  Rela$on  Extrac$on?  

•  Create  new  structured  knowledge  bases,  useful  for  any  app  •  Augment  current  knowledge  bases  

•  Adding  words  to  WordNet  thesaurus,  facts  to  FreeBase  or  DBPedia  

•  Support  ques(on  answering  •  The  granddaughter  of  which  actor  starred  in  the  movie  “E.T.”?  (acted-in ?x “E.T.”)(is-a ?y actor)(granddaughter-of ?x ?y)!

•  But  which  rela(ons  should  we  extract?  !

26  

Page 27: Semanc (Analysisin Language(Technology(

Automated  Content  Extrac$on  (ACE)  

ARTIFACT

GENERALAFFILIATION

ORGAFFILIATION

PART-WHOLE

PERSON-SOCIAL PHYSICAL

Located

Near

Business

Family Lasting Personal

Citizen-Resident-Ethnicity-Religion

Org-Location-Origin

Founder

EmploymentMembership

OwnershipStudent-Alum

Investor

User-Owner-Inventor-Manufacturer

GeographicalSubsidiary

Sports-Affiliation

“Relation Extraction Task”

27  

Automa(c  Content  Extrac(on  (ACE)  is  a  research  program  for  developing  advanced  Informa(on  extrac(on  technologies.  Given  a  text  in  natural  language,  the  ACE  challenge  is  to  detect:  •  en((es    •  rela(ons  between  en((es  •  events    

Page 28: Semanc (Analysisin Language(Technology(

Automated  Content  Extrac$on  (ACE)  

•  Physical-­‐Located                        PER-­‐GPE   !He was in Tennessee!

•  Part-­‐Whole-­‐Subsidiary    ORG-­‐ORG        XYZ, the parent company of ABC!

•  Person-­‐Social-­‐Family          PER-­‐PER   John’s wife Yoko!

•  Org-­‐AFF-­‐Founder                      PER-­‐ORG  !Steve Jobs, co-founder of Apple…!

•     28  

Page 29: Semanc (Analysisin Language(Technology(

UMLS:  Unified  Medical  Language  System  

•  134  en(ty  types,  54  rela(ons  

Injury                          disrupts    Physiological  Func(on  Bodily  Loca(on                      loca(on-­‐of  Biologic  Func(on  Anatomical  Structure                      part-­‐of    Organism  Pharmacologic  Substance        causes    Pathological  Func(on  Pharmacologic  Substance        treats      Pathologic  Func(on  

29  

Page 30: Semanc (Analysisin Language(Technology(

Extrac$ng  UMLS  rela$ons  from  a  sentence  

 Doppler echocardiography can be used to diagnose left anterior descending artery stenosis in patients with type 2 diabetes! ê  

 Echocardiography,  Doppler  DIAGNOSES  Acquired  stenosis  

30  

Page 31: Semanc (Analysisin Language(Technology(

Databases  of  Wikipedia  Rela$ons  

31  

Rela(ons  extracted  from  Infobox  Stanford  state  California  Stanford  moko  “Die  Lu^  der  Freiheit  weht”  …  

Wikipedia  Infobox  

Page 32: Semanc (Analysisin Language(Technology(

Rela$on  databases    that  draw  from  Wikipedia  

•  Resource  Descrip(on  Framework  (RDF)  triples  subject  predicate  object  Golden Gate Park location San Francisco!dbpedia:Golden_Gate_Park      dbpedia-­‐owl:loca(on      dbpedia:San_Francisco!

•  The  DBpedia  project  uses  the  Resource  Descrip(on  Framework  (RDF)  to  represent  the  extracted  informa(on  and  consists  of  3  billion  RDF  triples,  580  million  extracted  from  the  English  edi(on  of  Wikipedia  and  2.46  billion  from  other  language  edi(ons  (wikipedia,  March  2016).  

•  Frequent  Freebase  rela(ons:  people/person/na(onality,                                                                loca(on/loca(on/contains    people/person/profession,                                                                  people/person/place-­‐of-­‐birth    biology/organism_higher_classifica(on                      film/film/genre  

32  

DBpedia  is  a  project  aiming  to  extract  structured  content  from  the  informa(on  created  as  part  of  the  Wikipedia  project.  

Freebase  was  a  large  collabora(ve  knowledge  base  consis(ng  of  data  composed  mainly  by  its  community  members  (cf  Seman(c  Web).  -­‐-­‐>  Knowledge  Graph:  hkps://en.wikipedia.org/wiki/Freebase    

Page 33: Semanc (Analysisin Language(Technology(

How  to  build  rela$on  extractors  

1.  Hand-­‐wriken  pakerns  2.  Supervised  machine  learning  3.  Semi-­‐supervised  and  unsupervised    •  Bootstrapping  (using  seeds)  •  Distant  supervision  •  Unsupervised  learning  from  the  web  

33  

Page 34: Semanc (Analysisin Language(Technology(

Relation Extraction

Using  pakerns  to  extract  rela(ons  

Page 35: Semanc (Analysisin Language(Technology(

Rules  for  extrac$ng  IS-­‐A  rela$on  

 Early  intui(on  from  Hearst  (1992)    

•  “Agar  is  a  substance  prepared  from  a  mixture  of  red  algae,  such  as  Gelidium,  for  laboratory  or  industrial  use”  

•  What  does  Gelidium  mean?    •  How  do  you  know?`  

35  

Page 36: Semanc (Analysisin Language(Technology(

Rules  for  extrac$ng  IS-­‐A  rela$on  

 Early  intui(on  from  Hearst  (1992)    

•  “Agar  is  a  substance  prepared  from  a  mixture  of  red  algae,  such  as  Gelidium,  for  laboratory  or  industrial  use”  

•  What  does  Gelidium  mean?    •  How  do  you  know?`  

36  

Page 37: Semanc (Analysisin Language(Technology(

Hearst’s  PaFerns  for  extrac$ng  IS-­‐A  rela$ons  (Hearst,  1992):      Automa(c  Acquisi(on  of  Hyponyms  

“Y such as X ((, X)* (, and|or) X)”!“such Y as X”!“X or other Y”!“X and other Y”!“Y including X”!“Y, especially X”!

37  

Page 38: Semanc (Analysisin Language(Technology(

Hearst’s  PaFerns  for  extrac$ng  IS-­‐A  rela$ons  

Hearst  paFern   Example  occurrences  X  and  other    Y   ...temples,  treasuries,  and  other  important  civic  buildings.  

X  or  other    Y   Bruises,  wounds,  broken  bones  or  other  injuries...  

Y  such  as  X   The  bow  lute,  such  as  the  Bambara  ndang...  

Such    Y  as  X   ...such  authors  as  Herrick,  Goldsmith,  and  Shakespeare.  

Y  including  X   ...common-­‐law  countries,  including  Canada  and  England...  

Y  ,  especially  X   European  countries,  especially  France,  England,  and  Spain...  38  

Page 39: Semanc (Analysisin Language(Technology(

Hand-­‐built  paFerns  for  rela$ons  •  Plus: • Human patterns tend to be high-precision • Can be tailored to specific domains

•  Minus • Human patterns are often low-recall •  A lot of work to think of all possible patterns! • Don’t want to have to do this for every relation! • We’d like better accuracy 39  

Page 40: Semanc (Analysisin Language(Technology(

Relation Extraction

Supervised  rela(on  extrac(on  

Page 41: Semanc (Analysisin Language(Technology(

Supervised  machine  learning  for  rela$ons  

•  Choose  a  set  of  rela(ons  we’d  like  to  extract  •  Choose  a  set  of  relevant  named  en((es  •  Find  and  label  data  

•  Choose  a  representa(ve  corpus  •  Label  the  named  en((es  in  the  corpus  •  Hand-­‐label  the  rela(ons  between  these  en((es  •  Break  into  training,  development,  and  test  

•  Train  a  classifier  on  the  training  set  41  

Page 42: Semanc (Analysisin Language(Technology(

How  to  do  classifica$on  in  supervised  rela$on  extrac$on  

1.  Find  all  pairs  of  named  en((es  (usually  in  same  sentence)  2.  Decide  if  2  en((es  are  related  3.  If  yes,  classify  the  rela(on  •  Why  the  extra  step?  

•  Faster  classifica(on  training  by  elimina(ng  most  pairs  •  Can  use  dis(nct  feature-­‐sets  appropriate  for  each  task.  

42  

Page 43: Semanc (Analysisin Language(Technology(

Word  Features  for  Rela$on  Extrac$on  

•  Headwords  of  M1  and  M2,  and  combina(on  Airlines                          Wagner                              Airlines-­‐Wagner  

•  Bag  of  words  and  bigrams  in  M1  and  M2                      {American,  Airlines,  Tim,  Wagner,  American  Airlines,  Tim  Wagner}  

•  Words  or  bigrams  in  par(cular  posi(ons  le^  and  right  of  M1/M2  M2:  -­‐1  spokesman  M2:  +1  said  

•  Bag  of  words  or  bigrams  between  the  two  en((es  {a,  AMR,  of,  immediately,  matched,  move,  spokesman,  the,  unit}  

American  Airlines,  a  unit  of  AMR,  immediately  matched  the  move,  spokesman  Tim  Wagner  said  Men(on  1   Men(on  2  

43  

Page 44: Semanc (Analysisin Language(Technology(

Named  En$ty  Type  and  Men$on  Level  Features  for  Rela$on  Extrac$on  

•  Named-­‐en(ty  types  •  M1:    ORG  •  M2:    PERSON  

•  Concatena(on  of  the  two  named-­‐en(ty  types  •  ORG-­‐PERSON  

•  En(ty  Level  of  M1  and  M2    (NAME,  NOMINAL,  PRONOUN)  •  M1:  NAME    [it    or  he  would  be  PRONOUN]  •  M2:  NAME    [the  company    would  be  NOMINAL]  

American  Airlines,  a  unit  of  AMR,  immediately  matched  the  move,  spokesman  Tim  Wagner  said  Men(on  1   Men(on  2  

44  

Page 45: Semanc (Analysisin Language(Technology(

Parse  Features  for  Rela$on  Extrac$on  

•  Base  syntac(c  chunk  sequence  from  one  to  the  other  NP          NP        PP      VP        NP        NP  

•  Cons(tuent  path  through  the  tree  from  one  to  the  other  NP      é NP      é        S        é S        ê NP  

•  Dependency  path                    Airlines        matched            Wagner      said  

American  Airlines,  a  unit  of  AMR,  immediately  matched  the  move,  spokesman  Tim  Wagner  said  Men(on  1   Men(on  2  

45  

Page 46: Semanc (Analysisin Language(Technology(

American  Airlines,  a  unit  of  AMR,  immediately  matched  the  move,  spokesman  Tim  Wagner  said.

46  

Page 47: Semanc (Analysisin Language(Technology(

Classifiers  for  supervised  methods  

•  Now  you  can  use  any  classifier  you  like  •  MaxEnt  •  Naïve  Bayes  •  SVM  •  ...  

•  Train  it  on  the  training  set,  tune  on  the  dev  set,  test  on  the  test  set  

47  

Page 48: Semanc (Analysisin Language(Technology(

Evalua$on  of  Supervised  Rela$on  Extrac$on  

•  Compute  P/R/F1  for  each  rela(on  

48  

P = # of correctly extracted relationsTotal # of extracted relations

R = # of correctly extracted relationsTotal # of gold relations

F1 =2PRP + R

Page 49: Semanc (Analysisin Language(Technology(

Summary:  Supervised  Rela$on  Extrac$on  

+    Can  get  high  accuracies  with  enough  hand-­‐labeled  training  data,  if  test  similar  enough  to  training  

-­‐      Labeling  a  large  training  set  is  expensive  -­‐      Supervised  models  are  brikle,  don’t  generalize  well  to  different  genres  

 

49  

Page 50: Semanc (Analysisin Language(Technology(

Relation Extraction

Semi-­‐supervised  and  unsupervised  rela(on  extrac(on  

Page 51: Semanc (Analysisin Language(Technology(

Seed-­‐based  or  bootstrapping  approaches  to  rela$on  extrac$on  

•  No  training  set?  Maybe  you  have:  •  A  few  seed  tuples    or  •  A  few  high-­‐precision  pakerns  

•  Can  you  use  those  seeds  to  do  something  useful?  •  Bootstrapping:  use  the  seeds  to  directly  learn  to  populate  a  rela(on  

51  

Roughly  said:  Use  seeds  to  ini(alize  a  process  of  annota(on,  then  refine  through  itera(ons  

Page 52: Semanc (Analysisin Language(Technology(

Rela$on  Bootstrapping  (Hearst  1992)  

•  Gather  a  set  of  seed  pairs  that  have  rela(on  R  •  Iterate:  1.  Find  sentences  with  these  pairs  2.  Look  at  the  context  between  or  around  the  pair  and  

generalize  the  context  to  create  pakerns  3.  Use  the  pakerns  for  grep  for  more  pairs    

52  

Page 53: Semanc (Analysisin Language(Technology(

Bootstrapping    •  <Mark  Twain,  Elmira>    Seed  tuple  

•  Grep  (google)  for  the  environments  of  the  seed  tuple  “Mark  Twain  is  buried  in  Elmira,  NY.”  

X  is  buried  in  Y  “The  grave  of  Mark  Twain  is  in  Elmira”  

The  grave  of  X  is  in  Y  “Elmira  is  Mark  Twain’s  final  res(ng  place”  

Y  is  X’s  final  res(ng  place.  

•  Use  those  pakerns  to  grep  for  new  tuples  •  Iterate  53  

Page 54: Semanc (Analysisin Language(Technology(

Dipre:  Extract  <author,book>  pairs  

•  Start  with  5  seeds:      

•  Find  Instances:  The  Comedy  of  Errors,  by    William  Shakespeare,  was  The  Comedy  of  Errors,  by    William  Shakespeare,  is  The  Comedy  of  Errors,  one  of  William  Shakespeare's  earliest  akempts  The  Comedy  of  Errors,  one  of  William  Shakespeare's  most  

•  Extract  pakerns  (group  by  middle,  take  longest  common  prefix/suffix)  ?x , by ?y , ?x , one of ?y ‘s !

•  Now  iterate,  finding  new  seeds  that  match  the  pakern  !

Brin, Sergei. 1998. Extracting Patterns and Relations from the World Wide Web.

Author   Book  Isaac  Asimov   The  Robots  of  Dawn  David  Brin   Star(de  Rising  James  Gleick   Chaos:  Making  a  New  Science  Charles  Dickens   Great  Expecta(ons  William  Shakespeare   The  Comedy  of  Errors  

54  

Page 55: Semanc (Analysisin Language(Technology(

Distant  Supervision  

•  Combine  bootstrapping  with  supervised  learning  •  Instead  of  5  seeds,  •  Use  a  large  database  to  get  huge  #  of  seed  examples  

• Create  lots  of  features  from  all  these  examples  • Combine  in  a  supervised  classifier  

Snow,  Jurafsky,  Ng.  2005.  Learning  syntac(c  pakerns  for  automa(c  hypernym  discovery.  NIPS  17  Fei  Wu  and  Daniel  S.  Weld.  2007.    Autonomously  Seman(fying  Wikipeida.  CIKM  2007  Mintz,  Bills,  Snow,  Jurafsky.  2009.  Distant  supervision  for  rela(on  extrac(on  without  labeled  data.  ACL09  

55  

Page 56: Semanc (Analysisin Language(Technology(

Distant  supervision  paradigm  

•  Like  supervised  classifica(on:  •  Uses  a  classifier  with  lots  of  features  •  Supervised  by  detailed  hand-­‐created  knowledge  •  Doesn’t  require  itera(vely  expanding  pakerns  

•  Like  unsupervised  classifica(on:  •  Uses  very  large  amounts  of  unlabeled  data  •  Not  sensi(ve  to  genre  issues  in  training  corpus  

56  

Page 57: Semanc (Analysisin Language(Technology(

Distantly  supervised  learning    of  rela$on  extrac$on  paFerns  

 

For  each  rela(on    

For  each  tuple  in  big  database    

Find  sentences  in  large  corpus  with  both  en((es  

 

Extract  frequent  features  (parse,  words,  etc)  

 

Train  supervised  classifier  using  thousands  of  pakerns  

 

4

1

2

3

5

PER  was  born  in  LOC  PER,  born  (XXXX),  LOC  PER’s  birthplace  in  LOC  

       

<Edwin  Hubble,  Marshfield>  <Albert  Einstein,  Ulm>  

Born-­‐In  

Hubble  was  born  in  Marshfield  Einstein,  born  (1879),    Ulm  Hubble’s  birthplace  in  Marshfield  

P(born-in | f1,f2,f3,…,f70000) 57  

Page 58: Semanc (Analysisin Language(Technology(

Unsupervised  rela$on  extrac$on  

•  Open  InformaLon  ExtracLon:    •  extract  rela(ons  from  the  web  with  no  training  data,  no  list  of  rela(ons  

1.  Use  parsed  data  to  train  a  “trustworthy  tuple”  classifier  2.  Single-­‐pass  extract  all  rela(ons  between  NPs,  keep  if  trustworthy  3.  Assessor  ranks  rela(ons  based  on  text  redundancy  

(FCI,  specializes  in,  so^ware  development)    (Tesla,  invented,  coil  transformer)  

  58  

M.  Banko,  M.  Cararella,  S.  Soderland,  M.  Broadhead,  and  O.  Etzioni.  2007.  Open  informa(on  extrac(on  from  the  web.  IJCAI  

Page 59: Semanc (Analysisin Language(Technology(

Evalua$on  of  Semi-­‐supervised  and  Unsupervised  Rela$on  Extrac$on  

•  Since  it  extracts  totally  new  rela(ons  from  the  web    •  There  is  no  gold  set  of  correct  instances  of  rela(ons!  

•  Can’t  compute  precision  (don’t  know  which  ones  are  correct)  •  Can’t  compute  recall  (don’t  know  which  ones  were  missed)  

•  Instead,  we  can  approximate  precision  (only)  •   Draw  a  random  sample  of  rela(ons  from  output,  check  precision  manually  

•  Can  also  compute  precision  at  different  levels  of  recall.  •  Precision  for  top  1000  new  rela(ons,  top  10,000  new  rela(ons,  top  100,000  •  In  each  case  taking  a  random  sample  of  that  set  

•  But  no  way  to  evaluate  recall  59  

P̂ = # of correctly extracted relations in the sampleTotal # of extracted relations in the sample

Page 60: Semanc (Analysisin Language(Technology(

The end