semanc (analysisin language(technology(

Seman&c Analysis in Language Technology http://stp.lingfil.uu.se/~santinim/sais/2016/sais_2016.htm

Relation Extraction

Marina San(ni [email protected]

Department of Linguis(cs and Philology

Uppsala University, Uppsala, Sweden

Spring 2016

Previous Lecture: Ques$on Answering

2

Ques$on Answering systems

•  Factoid ques(ons: •  Google •  Wolfram •  Ask Jeeves •  Start •  ….

3

•  Approaches: •  IR-‐based •  Knowelege based •  Hybrid

Katz et al. (2006) hFp://start.csail.mit.edu/publica$ons/FLAIRS0601KatzB.pdf

•  START answers natural language ques(ons by presen(ng components of text and mul(-‐media informa(on drawn from a set of informa(on resources that are hosted locally or accessed remotely through the Internet.

•  START targets high precision in its ques(on answering.

•  The START system analyzes English text and produces a knowledge base which incorporates, in the form of nested ternary expressions (=triples), the informa(on found in the text.

4

Is it true?: hFp://uncyclopedia.wikia.com/wiki/Ask_Jeeves

•  Ask Jeeves, more correctly known as Ask.com, is a search engine founded in 1996 in California.

•  Ini(ally it represented a stereotypical English butler who would "fetch" the answer to any ques(on asked.

•  Ask.com is now considered one of the great failures of the internet. The ques(on and answer feature simply didn't work as well as hoped, and a^er trying his hand at being both a tradi(onal search engine and a terrible kind of "ar(ficial AI" with a bald spot,

•  These days Jeeves is ranked as the 4th most successful search engine on the web, and the 4th most successful overall. This seems impressive un$l you consider that Google holds the top spot with 95% of the market. It has even fallen behind Bing; enough said. 5

Search engines that can be used as QA systems

•  Yahoo •  Bing

6

Siri hFp://en.wikipedia.org/wiki/Siri

•  Siri /ˈsɪri/ is an intelligent personal assistant and knowledge navigator which works as an applica(on for Apple Inc.'s iOS.

•  The applica(on uses a natural language user interface to answer ques$ons, make recommenda(ons, and perform ac(ons by delega$ng requests to a set of Web services.

•  The so^ware, both in its original version and as an iOS applica(on, adapts to the user's individual language usage and individual searches (preferences) with con(nuing use, and returns results that are individualized.

•  The name Siri is Scandinavian, a short form of the Norse name Sigrid meaning "beauty" and "victory", and comes from the intended name for the original developer's first child.

7

ChaFerbots •  Siri… conversa(onal ”safety net”. •  Conversa(onal agents (chaker bots,

and personal assistants) àcustomer care, customer analy(cs (replacing/integra(ng FAQs and help desk)

8

Avatar: a picture of a person or animal that represents you on a computer screen, for example in some chat rooms or when you are playing games over the Internet

Eliza hFp://en.wikipedia.org/wiki/ELIZA ELIZA was wriFen at MIT by Joseph Weizenbaum between 1964 and 1966

9

General IR architecture for factoid ques$ons

10

DocumentDocumentDocument

DocumentDocume

ntDocumentDocume

ntDocument

Question Processing

PassageRetrieval

Query Formulation

Answer Type Detection

Question

Passage Retrieval

Document Retrieval

Answer Processing

Answer

passages

Indexing

RelevantDocs

DocumentDocumentDocument

Things to extract from the ques$on •  Answer Type Detec(on

•  Decide the named en$ty type (person, place) of the answer

•  Query Formula(on •  Choose query keywords for the IR system

•  Ques(on Type classifica(on •  Is this a defini(on ques(on, a math ques(on, a list ques(on?

•  Focus Detec(on •  Find the ques(on words that are replaced by the answer

•  Rela(on Extrac(on •  Find rela(ons between en((es in the ques(on 11

12

Common Evalua$on Metrics

1. Accuracy (does answer match gold-‐labeled answer?) 2. Mean Reciprocal Rank: •  The reciprocal rank of a query response is the inverse of the rank of the

first correct answer. •  The mean reciprocal rank is the average of the reciprocal ranks of

results for a sample of queries Q

MRR =

1rankii=1

N

∑

N=

Common Evalua$on Metrics: MRR •  The mean reciprocal rank is the average of the reciprocal ranks

of results for a sample of queries Q. •  (ex adapted from Wikipedia)

•  3 ranked answers for a query, with the first one being the one it thinks is most likely correct

•  Given those 3 samples, we could calculate the mean reciprocal rank as (1/3 + 1/2 + 1)/3 = 0.61.

13

Complex ques$ons: “What is the ‘hajii’”?

•  The (bokom-‐up) snippet method •  Find a set of relevant documents •  Extract informa(ve sentences from the documents (using p-‐idf, MMR) •  Order and modify the sentences into an answer

•  The (top-‐down) informa(on extrac(on method •  build specific answerers for different ques(on types: •  defini(on ques(ons, •  biography ques(ons, •  certain medical ques(ons

Informa$on that should be in the answer for 3 kinds of ques$ons

Document Retrieval

11 Web documents1127 total sentences

Predicate Identification

Data-Driven Analysis

383 Non-Specific Definitional sentences

Sentence clusters, Importance ordering

DefinitionCreation

9 Genus-Species SentencesThe Hajj, or pilgrimage to Makkah (Mecca), is the central duty of Islam.The Hajj is a milestone event in a Muslim's life.The hajj is one of five pillars that make up the foundation of Islam....

The Hajj, or pilgrimage to Makkah [Mecca], is the central duty of Islam. More than two million Muslims are expected to take the Hajj this year. Muslims must perform the hajj at least once in their lifetime if physically and financially able. The Hajj is a milestone event in a Muslim's life. The annual hajj begins in the twelfth month of the Islamic year (which is lunar, not solar, so that hajj and Ramadan fall sometimes in summer, sometimes in winter). The Hajj is a week-long pilgrimage that begins in the 12th month of the Islamic lunar calendar. Another ceremony, which was not connected with the rites of the Ka'ba before the rise of Islam, is the Hajj, the annual pilgrimage to 'Arafat, about two miles east of Mecca, toward Mina…

"What is the Hajj?" (Ndocs=20, Len=8)

Architecture for complex ques$on answering: defini$on ques$ons S. Blair-‐Goldensohn, K. McKeown and A. Schlaikjer. 2004.

Answering Defini(on Ques(ons: A Hyrbid Approach.

State-‐of-‐the-‐art: ex

•  Top downMing Tan, Cicero dos Santos, Bing Xiang & Bowen Zhou. 2015. LSTM-‐Based Deep Learning Models for non factoid Answer Selec(on.

•  Di Wang and Eric Nyberg. 2015. A Long Short-‐Term Memory Model for Answer Sentence Selec(on in Ques(on Answering. In ACL 2015.s

•  Minwei Feng, Bing Xiang, Michael R. Glass, Lidan Wang, Bowen Zhou. 2015. Applying deep learning to answer selec(on: A study and an open task.

17

Deep Learning is a new area of Machine Learning research. Said to be very promising. It is about learning mul(ple levels of representa(on and abstrac(on that help to make sense of data such as images, sound, and text. It is based on neural networks.

Prac$cal ac$vity •  Start seems to be limited, but it understands natural language •  Google (presumably helped by Knowledge Graph) is more

accurate, but skips natural language (uses keywords). •  Google is customized to the users’ preferences (different results)

•  Interes(ng outcomes •  Currency vs. Coin •  What’s love? •  Lyric/song vs. Defini(on ques(on

18

What’s the meaning of life?

•  Google

19

Presumably from Knowledge Graph…

Start and the 42 puzzle

•  gg

20

End of previous lecture

21

Acknowledgements Most slides borrowed or adapted from:

Dan Jurafsky and Christopher Manning, Coursera

Dan Jurafsky and James H. Mar(n (2015)

J&M(2015, dra^): hkps://web.stanford.edu/~jurafsky/slp3/

Relation Extraction

What is rela(on extrac(on?

Extrac$ng rela$ons from text

•  Company report: “Interna(onal Business Machines Corpora(on (IBM or the company) was incorporated in the State of New York on June 16, 1911, as the Compu(ng-‐Tabula(ng-‐Recording Co. (C-‐T-‐R)…”

•  Extracted Complex Rela(on: Company-‐Founding

Company IBM Loca(on New York Date June 16, 1911 Original-‐Name Compu(ng-‐Tabula(ng-‐Recording Co.

•  But we will focus on the simpler task of extrac(ng rela(on triples Founding-‐year(IBM,1911) Founding-‐loca(on(IBM,New York) 24

Extrac$ng Rela$on Triples from Text The Leland Stanford Junior University, commonly referred to as Stanford University or Stanford, is an American private research university located in Stanford, California … near Palo Alto, California… Leland Stanford…founded the university in 1891

Stanford EQ Leland Stanford Junior University Stanford LOC-IN California Stanford IS-A research university Stanford LOC-NEAR Palo Alto Stanford FOUNDED-IN 1891 Stanford FOUNDER Leland Stanford 25

Why Rela$on Extrac$on?

•  Create new structured knowledge bases, useful for any app •  Augment current knowledge bases

•  Adding words to WordNet thesaurus, facts to FreeBase or DBPedia

•  Support ques(on answering •  The granddaughter of which actor starred in the movie “E.T.”? (acted-in ?x “E.T.”)(is-a ?y actor)(granddaughter-of ?x ?y)!

•  But which rela(ons should we extract? !

26

Automated Content Extrac$on (ACE)

ARTIFACT

GENERALAFFILIATION

ORGAFFILIATION

PART-WHOLE

PERSON-SOCIAL PHYSICAL

Located

Near

Business

Family Lasting Personal

Citizen-Resident-Ethnicity-Religion

Org-Location-Origin

Founder

EmploymentMembership

OwnershipStudent-Alum

Investor

User-Owner-Inventor-Manufacturer

GeographicalSubsidiary

Sports-Affiliation

“Relation Extraction Task”

27

Automa(c Content Extrac(on (ACE) is a research program for developing advanced Informa(on extrac(on technologies. Given a text in natural language, the ACE challenge is to detect: •  en((es •  rela(ons between en((es •  events

Automated Content Extrac$on (ACE)

•  Physical-‐Located PER-‐GPE !He was in Tennessee!

•  Part-‐Whole-‐Subsidiary ORG-‐ORG XYZ, the parent company of ABC!

•  Person-‐Social-‐Family PER-‐PER John’s wife Yoko!

•  Org-‐AFF-‐Founder PER-‐ORG !Steve Jobs, co-founder of Apple…!

•  28

UMLS: Unified Medical Language System

•  134 en(ty types, 54 rela(ons

Injury disrupts Physiological Func(on Bodily Loca(on loca(on-‐of Biologic Func(on Anatomical Structure part-‐of Organism Pharmacologic Substance causes Pathological Func(on Pharmacologic Substance treats Pathologic Func(on

29

Extrac$ng UMLS rela$ons from a sentence

Doppler echocardiography can be used to diagnose left anterior descending artery stenosis in patients with type 2 diabetes! ê

Echocardiography, Doppler DIAGNOSES Acquired stenosis

30

Databases of Wikipedia Rela$ons

31

Rela(ons extracted from Infobox Stanford state California Stanford moko “Die Lu^ der Freiheit weht” …

Wikipedia Infobox

Rela$on databases that draw from Wikipedia

•  Resource Descrip(on Framework (RDF) triples subject predicate object Golden Gate Park location San Francisco!dbpedia:Golden_Gate_Park dbpedia-‐owl:loca(on dbpedia:San_Francisco!

•  The DBpedia project uses the Resource Descrip(on Framework (RDF) to represent the extracted informa(on and consists of 3 billion RDF triples, 580 million extracted from the English edi(on of Wikipedia and 2.46 billion from other language edi(ons (wikipedia, March 2016).

•  Frequent Freebase rela(ons: people/person/na(onality, loca(on/loca(on/contains people/person/profession, people/person/place-‐of-‐birth biology/organism_higher_classifica(on film/film/genre

32

DBpedia is a project aiming to extract structured content from the informa(on created as part of the Wikipedia project.

Freebase was a large collabora(ve knowledge base consis(ng of data composed mainly by its community members (cf Seman(c Web). -‐-‐> Knowledge Graph: hkps://en.wikipedia.org/wiki/Freebase

How to build rela$on extractors

1.  Hand-‐wriken pakerns 2.  Supervised machine learning 3.  Semi-‐supervised and unsupervised •  Bootstrapping (using seeds) •  Distant supervision •  Unsupervised learning from the web

33

Relation Extraction

Using pakerns to extract rela(ons

Rules for extrac$ng IS-‐A rela$on

Early intui(on from Hearst (1992)

•  “Agar is a substance prepared from a mixture of red algae, such as Gelidium, for laboratory or industrial use”

•  What does Gelidium mean? •  How do you know?`

35

Rules for extrac$ng IS-‐A rela$on

Early intui(on from Hearst (1992)

•  “Agar is a substance prepared from a mixture of red algae, such as Gelidium, for laboratory or industrial use”

•  What does Gelidium mean? •  How do you know?`

36

Hearst’s PaFerns for extrac$ng IS-‐A rela$ons (Hearst, 1992): Automa(c Acquisi(on of Hyponyms

“Y such as X ((, X)* (, and|or) X)”!“such Y as X”!“X or other Y”!“X and other Y”!“Y including X”!“Y, especially X”!

37

Hearst’s PaFerns for extrac$ng IS-‐A rela$ons

Hearst paFern Example occurrences X and other Y ...temples, treasuries, and other important civic buildings.

X or other Y Bruises, wounds, broken bones or other injuries...

Y such as X The bow lute, such as the Bambara ndang...

Such Y as X ...such authors as Herrick, Goldsmith, and Shakespeare.

Y including X ...common-‐law countries, including Canada and England...

Y , especially X European countries, especially France, England, and Spain... 38

Hand-‐built paFerns for rela$ons •  Plus: • Human patterns tend to be high-precision • Can be tailored to specific domains

•  Minus • Human patterns are often low-recall •  A lot of work to think of all possible patterns! • Don’t want to have to do this for every relation! • We’d like better accuracy 39

Relation Extraction

Supervised rela(on extrac(on

Supervised machine learning for rela$ons

•  Choose a set of rela(ons we’d like to extract •  Choose a set of relevant named en((es •  Find and label data

•  Choose a representa(ve corpus •  Label the named en((es in the corpus •  Hand-‐label the rela(ons between these en((es •  Break into training, development, and test

•  Train a classifier on the training set 41

How to do classifica$on in supervised rela$on extrac$on

1.  Find all pairs of named en((es (usually in same sentence) 2.  Decide if 2 en((es are related 3.  If yes, classify the rela(on •  Why the extra step?

•  Faster classifica(on training by elimina(ng most pairs •  Can use dis(nct feature-‐sets appropriate for each task.

42

Word Features for Rela$on Extrac$on

•  Headwords of M1 and M2, and combina(on Airlines Wagner Airlines-‐Wagner

•  Bag of words and bigrams in M1 and M2 {American, Airlines, Tim, Wagner, American Airlines, Tim Wagner}

•  Words or bigrams in par(cular posi(ons le^ and right of M1/M2 M2: -‐1 spokesman M2: +1 said

•  Bag of words or bigrams between the two en((es {a, AMR, of, immediately, matched, move, spokesman, the, unit}

American Airlines, a unit of AMR, immediately matched the move, spokesman Tim Wagner said Men(on 1 Men(on 2

43

Named En$ty Type and Men$on Level Features for Rela$on Extrac$on

•  Named-‐en(ty types •  M1: ORG •  M2: PERSON

•  Concatena(on of the two named-‐en(ty types •  ORG-‐PERSON

•  En(ty Level of M1 and M2 (NAME, NOMINAL, PRONOUN) •  M1: NAME [it or he would be PRONOUN] •  M2: NAME [the company would be NOMINAL]


44

Parse Features for Rela$on Extrac$on

•  Base syntac(c chunk sequence from one to the other NP NP PP VP NP NP

•  Cons(tuent path through the tree from one to the other NP é NP é S é S ê NP

•  Dependency path Airlines matched Wagner said


45

American Airlines, a unit of AMR, immediately matched the move, spokesman Tim Wagner said.

46

Classifiers for supervised methods

•  Now you can use any classifier you like •  MaxEnt •  Naïve Bayes •  SVM •  ...

•  Train it on the training set, tune on the dev set, test on the test set

47

Evalua$on of Supervised Rela$on Extrac$on

•  Compute P/R/F1 for each rela(on

48

P = # of correctly extracted relationsTotal # of extracted relations

R = # of correctly extracted relationsTotal # of gold relations

F1 =2PRP + R

Summary: Supervised Rela$on Extrac$on

+ Can get high accuracies with enough hand-‐labeled training data, if test similar enough to training

-‐ Labeling a large training set is expensive -‐ Supervised models are brikle, don’t generalize well to different genres

49

Relation Extraction

Semi-‐supervised and unsupervised rela(on extrac(on

Seed-‐based or bootstrapping approaches to rela$on extrac$on

•  No training set? Maybe you have: •  A few seed tuples or •  A few high-‐precision pakerns

•  Can you use those seeds to do something useful? •  Bootstrapping: use the seeds to directly learn to populate a rela(on

51

Roughly said: Use seeds to ini(alize a process of annota(on, then refine through itera(ons

Rela$on Bootstrapping (Hearst 1992)

•  Gather a set of seed pairs that have rela(on R •  Iterate: 1.  Find sentences with these pairs 2.  Look at the context between or around the pair and

generalize the context to create pakerns 3.  Use the pakerns for grep for more pairs

52

Bootstrapping •  <Mark Twain, Elmira> Seed tuple

•  Grep (google) for the environments of the seed tuple “Mark Twain is buried in Elmira, NY.”

X is buried in Y “The grave of Mark Twain is in Elmira”

The grave of X is in Y “Elmira is Mark Twain’s final res(ng place”

Y is X’s final res(ng place.

•  Use those pakerns to grep for new tuples •  Iterate 53

Dipre: Extract <author,book> pairs

•  Start with 5 seeds:

•  Find Instances: The Comedy of Errors, by William Shakespeare, was The Comedy of Errors, by William Shakespeare, is The Comedy of Errors, one of William Shakespeare's earliest akempts The Comedy of Errors, one of William Shakespeare's most

•  Extract pakerns (group by middle, take longest common prefix/suffix) ?x , by ?y , ?x , one of ?y ‘s !

•  Now iterate, finding new seeds that match the pakern !

Brin, Sergei. 1998. Extracting Patterns and Relations from the World Wide Web.

Author Book Isaac Asimov The Robots of Dawn David Brin Star(de Rising James Gleick Chaos: Making a New Science Charles Dickens Great Expecta(ons William Shakespeare The Comedy of Errors

54

Distant Supervision

•  Combine bootstrapping with supervised learning •  Instead of 5 seeds, •  Use a large database to get huge # of seed examples

• Create lots of features from all these examples • Combine in a supervised classifier

Snow, Jurafsky, Ng. 2005. Learning syntac(c pakerns for automa(c hypernym discovery. NIPS 17 Fei Wu and Daniel S. Weld. 2007. Autonomously Seman(fying Wikipeida. CIKM 2007 Mintz, Bills, Snow, Jurafsky. 2009. Distant supervision for rela(on extrac(on without labeled data. ACL09

55

Distant supervision paradigm

•  Like supervised classifica(on: •  Uses a classifier with lots of features •  Supervised by detailed hand-‐created knowledge •  Doesn’t require itera(vely expanding pakerns

•  Like unsupervised classifica(on: •  Uses very large amounts of unlabeled data •  Not sensi(ve to genre issues in training corpus

56

Distantly supervised learning of rela$on extrac$on paFerns

For each rela(on

For each tuple in big database

Find sentences in large corpus with both en((es

Extract frequent features (parse, words, etc)

Train supervised classifier using thousands of pakerns

4

1

2

3

5

PER was born in LOC PER, born (XXXX), LOC PER’s birthplace in LOC

<Edwin Hubble, Marshfield> <Albert Einstein, Ulm>

Born-‐In

Hubble was born in Marshfield Einstein, born (1879), Ulm Hubble’s birthplace in Marshfield

P(born-in | f1,f2,f3,…,f70000) 57

Unsupervised rela$on extrac$on

•  Open InformaLon ExtracLon: •  extract rela(ons from the web with no training data, no list of rela(ons

1.  Use parsed data to train a “trustworthy tuple” classifier 2.  Single-‐pass extract all rela(ons between NPs, keep if trustworthy 3.  Assessor ranks rela(ons based on text redundancy

(FCI, specializes in, so^ware development) (Tesla, invented, coil transformer)

58

M. Banko, M. Cararella, S. Soderland, M. Broadhead, and O. Etzioni. 2007. Open informa(on extrac(on from the web. IJCAI

Evalua$on of Semi-‐supervised and Unsupervised Rela$on Extrac$on

•  Since it extracts totally new rela(ons from the web •  There is no gold set of correct instances of rela(ons!

•  Can’t compute precision (don’t know which ones are correct) •  Can’t compute recall (don’t know which ones were missed)

•  Instead, we can approximate precision (only) •  Draw a random sample of rela(ons from output, check precision manually

•  Can also compute precision at different levels of recall. •  Precision for top 1000 new rela(ons, top 10,000 new rela(ons, top 100,000 •  In each case taking a random sample of that set

•  But no way to evaluate recall 59

P̂ = # of correctly extracted relations in the sampleTotal # of extracted relations in the sample

The end

semanc (analysisin language(technology(

Documents