semanc (analysisin language(technology(
TRANSCRIPT
Seman&c Analysis in Language Technology http://stp.lingfil.uu.se/~santinim/sais/2016/sais_2016.htm
Relation Extraction
Marina San(ni [email protected]
Department of Linguis(cs and Philology
Uppsala University, Uppsala, Sweden
Spring 2016
Previous Lecture: Ques$on Answering
2
Ques$on Answering systems
• Factoid ques(ons: • Google • Wolfram • Ask Jeeves • Start • ….
3
• Approaches: • IR-‐based • Knowelege based • Hybrid
Katz et al. (2006) hFp://start.csail.mit.edu/publica$ons/FLAIRS0601KatzB.pdf
• START answers natural language ques(ons by presen(ng components of text and mul(-‐media informa(on drawn from a set of informa(on resources that are hosted locally or accessed remotely through the Internet.
• START targets high precision in its ques(on answering.
• The START system analyzes English text and produces a knowledge base which incorporates, in the form of nested ternary expressions (=triples), the informa(on found in the text.
4
Is it true?: hFp://uncyclopedia.wikia.com/wiki/Ask_Jeeves
• Ask Jeeves, more correctly known as Ask.com, is a search engine founded in 1996 in California.
• Ini(ally it represented a stereotypical English butler who would "fetch" the answer to any ques(on asked.
• Ask.com is now considered one of the great failures of the internet. The ques(on and answer feature simply didn't work as well as hoped, and a^er trying his hand at being both a tradi(onal search engine and a terrible kind of "ar(ficial AI" with a bald spot,
• These days Jeeves is ranked as the 4th most successful search engine on the web, and the 4th most successful overall. This seems impressive un$l you consider that Google holds the top spot with 95% of the market. It has even fallen behind Bing; enough said. 5
Search engines that can be used as QA systems
• Yahoo • Bing
6
Siri hFp://en.wikipedia.org/wiki/Siri
• Siri /ˈsɪri/ is an intelligent personal assistant and knowledge navigator which works as an applica(on for Apple Inc.'s iOS.
• The applica(on uses a natural language user interface to answer ques$ons, make recommenda(ons, and perform ac(ons by delega$ng requests to a set of Web services.
• The so^ware, both in its original version and as an iOS applica(on, adapts to the user's individual language usage and individual searches (preferences) with con(nuing use, and returns results that are individualized.
• The name Siri is Scandinavian, a short form of the Norse name Sigrid meaning "beauty" and "victory", and comes from the intended name for the original developer's first child.
7
ChaFerbots • Siri… conversa(onal ”safety net”. • Conversa(onal agents (chaker bots,
and personal assistants) àcustomer care, customer analy(cs (replacing/integra(ng FAQs and help desk)
8
Avatar: a picture of a person or animal that represents you on a computer screen, for example in some chat rooms or when you are playing games over the Internet
Eliza hFp://en.wikipedia.org/wiki/ELIZA ELIZA was wriFen at MIT by Joseph Weizenbaum between 1964 and 1966
9
General IR architecture for factoid ques$ons
10
DocumentDocumentDocument
DocumentDocume
ntDocumentDocume
ntDocument
Question Processing
PassageRetrieval
Query Formulation
Answer Type Detection
Question
Passage Retrieval
Document Retrieval
Answer Processing
Answer
passages
Indexing
RelevantDocs
DocumentDocumentDocument
Things to extract from the ques$on • Answer Type Detec(on
• Decide the named en$ty type (person, place) of the answer
• Query Formula(on • Choose query keywords for the IR system
• Ques(on Type classifica(on • Is this a defini(on ques(on, a math ques(on, a list ques(on?
• Focus Detec(on • Find the ques(on words that are replaced by the answer
• Rela(on Extrac(on • Find rela(ons between en((es in the ques(on 11
12
Common Evalua$on Metrics
1. Accuracy (does answer match gold-‐labeled answer?) 2. Mean Reciprocal Rank: • The reciprocal rank of a query response is the inverse of the rank of the
first correct answer. • The mean reciprocal rank is the average of the reciprocal ranks of
results for a sample of queries Q
MRR =
1rankii=1
N
∑
N=
Common Evalua$on Metrics: MRR • The mean reciprocal rank is the average of the reciprocal ranks
of results for a sample of queries Q. • (ex adapted from Wikipedia)
• 3 ranked answers for a query, with the first one being the one it thinks is most likely correct
• Given those 3 samples, we could calculate the mean reciprocal rank as (1/3 + 1/2 + 1)/3 = 0.61.
13
Complex ques$ons: “What is the ‘hajii’”?
• The (bokom-‐up) snippet method • Find a set of relevant documents • Extract informa(ve sentences from the documents (using p-‐idf, MMR) • Order and modify the sentences into an answer
• The (top-‐down) informa(on extrac(on method • build specific answerers for different ques(on types: • defini(on ques(ons, • biography ques(ons, • certain medical ques(ons
Informa$on that should be in the answer for 3 kinds of ques$ons
Document Retrieval
11 Web documents1127 total sentences
Predicate Identification
Data-Driven Analysis
383 Non-Specific Definitional sentences
Sentence clusters, Importance ordering
DefinitionCreation
9 Genus-Species SentencesThe Hajj, or pilgrimage to Makkah (Mecca), is the central duty of Islam.The Hajj is a milestone event in a Muslim's life.The hajj is one of five pillars that make up the foundation of Islam....
The Hajj, or pilgrimage to Makkah [Mecca], is the central duty of Islam. More than two million Muslims are expected to take the Hajj this year. Muslims must perform the hajj at least once in their lifetime if physically and financially able. The Hajj is a milestone event in a Muslim's life. The annual hajj begins in the twelfth month of the Islamic year (which is lunar, not solar, so that hajj and Ramadan fall sometimes in summer, sometimes in winter). The Hajj is a week-long pilgrimage that begins in the 12th month of the Islamic lunar calendar. Another ceremony, which was not connected with the rites of the Ka'ba before the rise of Islam, is the Hajj, the annual pilgrimage to 'Arafat, about two miles east of Mecca, toward Mina…
"What is the Hajj?" (Ndocs=20, Len=8)
Architecture for complex ques$on answering: defini$on ques$ons S. Blair-‐Goldensohn, K. McKeown and A. Schlaikjer. 2004.
Answering Defini(on Ques(ons: A Hyrbid Approach.
State-‐of-‐the-‐art: ex
• Top downMing Tan, Cicero dos Santos, Bing Xiang & Bowen Zhou. 2015. LSTM-‐Based Deep Learning Models for non factoid Answer Selec(on.
• Di Wang and Eric Nyberg. 2015. A Long Short-‐Term Memory Model for Answer Sentence Selec(on in Ques(on Answering. In ACL 2015.s
• Minwei Feng, Bing Xiang, Michael R. Glass, Lidan Wang, Bowen Zhou. 2015. Applying deep learning to answer selec(on: A study and an open task.
17
Deep Learning is a new area of Machine Learning research. Said to be very promising. It is about learning mul(ple levels of representa(on and abstrac(on that help to make sense of data such as images, sound, and text. It is based on neural networks.
Prac$cal ac$vity • Start seems to be limited, but it understands natural language • Google (presumably helped by Knowledge Graph) is more
accurate, but skips natural language (uses keywords). • Google is customized to the users’ preferences (different results)
• Interes(ng outcomes • Currency vs. Coin • What’s love? • Lyric/song vs. Defini(on ques(on
18
What’s the meaning of life?
19
Presumably from Knowledge Graph…
Start and the 42 puzzle
• gg
20
End of previous lecture
21
Acknowledgements Most slides borrowed or adapted from:
Dan Jurafsky and Christopher Manning, Coursera
Dan Jurafsky and James H. Mar(n (2015)
J&M(2015, dra^): hkps://web.stanford.edu/~jurafsky/slp3/
Relation Extraction
What is rela(on extrac(on?
Extrac$ng rela$ons from text
• Company report: “Interna(onal Business Machines Corpora(on (IBM or the company) was incorporated in the State of New York on June 16, 1911, as the Compu(ng-‐Tabula(ng-‐Recording Co. (C-‐T-‐R)…”
• Extracted Complex Rela(on: Company-‐Founding
Company IBM Loca(on New York Date June 16, 1911 Original-‐Name Compu(ng-‐Tabula(ng-‐Recording Co.
• But we will focus on the simpler task of extrac(ng rela(on triples Founding-‐year(IBM,1911) Founding-‐loca(on(IBM,New York) 24
Extrac$ng Rela$on Triples from Text The Leland Stanford Junior University, commonly referred to as Stanford University or Stanford, is an American private research university located in Stanford, California … near Palo Alto, California… Leland Stanford…founded the university in 1891
Stanford EQ Leland Stanford Junior University Stanford LOC-IN California Stanford IS-A research university Stanford LOC-NEAR Palo Alto Stanford FOUNDED-IN 1891 Stanford FOUNDER Leland Stanford 25
Why Rela$on Extrac$on?
• Create new structured knowledge bases, useful for any app • Augment current knowledge bases
• Adding words to WordNet thesaurus, facts to FreeBase or DBPedia
• Support ques(on answering • The granddaughter of which actor starred in the movie “E.T.”? (acted-in ?x “E.T.”)(is-a ?y actor)(granddaughter-of ?x ?y)!
• But which rela(ons should we extract? !
26
Automated Content Extrac$on (ACE)
ARTIFACT
GENERALAFFILIATION
ORGAFFILIATION
PART-WHOLE
PERSON-SOCIAL PHYSICAL
Located
Near
Business
Family Lasting Personal
Citizen-Resident-Ethnicity-Religion
Org-Location-Origin
Founder
EmploymentMembership
OwnershipStudent-Alum
Investor
User-Owner-Inventor-Manufacturer
GeographicalSubsidiary
Sports-Affiliation
“Relation Extraction Task”
27
Automa(c Content Extrac(on (ACE) is a research program for developing advanced Informa(on extrac(on technologies. Given a text in natural language, the ACE challenge is to detect: • en((es • rela(ons between en((es • events
Automated Content Extrac$on (ACE)
• Physical-‐Located PER-‐GPE !He was in Tennessee!
• Part-‐Whole-‐Subsidiary ORG-‐ORG XYZ, the parent company of ABC!
• Person-‐Social-‐Family PER-‐PER John’s wife Yoko!
• Org-‐AFF-‐Founder PER-‐ORG !Steve Jobs, co-founder of Apple…!
• 28
UMLS: Unified Medical Language System
• 134 en(ty types, 54 rela(ons
Injury disrupts Physiological Func(on Bodily Loca(on loca(on-‐of Biologic Func(on Anatomical Structure part-‐of Organism Pharmacologic Substance causes Pathological Func(on Pharmacologic Substance treats Pathologic Func(on
29
Extrac$ng UMLS rela$ons from a sentence
Doppler echocardiography can be used to diagnose left anterior descending artery stenosis in patients with type 2 diabetes! ê
Echocardiography, Doppler DIAGNOSES Acquired stenosis
30
Databases of Wikipedia Rela$ons
31
Rela(ons extracted from Infobox Stanford state California Stanford moko “Die Lu^ der Freiheit weht” …
Wikipedia Infobox
Rela$on databases that draw from Wikipedia
• Resource Descrip(on Framework (RDF) triples subject predicate object Golden Gate Park location San Francisco!dbpedia:Golden_Gate_Park dbpedia-‐owl:loca(on dbpedia:San_Francisco!
• The DBpedia project uses the Resource Descrip(on Framework (RDF) to represent the extracted informa(on and consists of 3 billion RDF triples, 580 million extracted from the English edi(on of Wikipedia and 2.46 billion from other language edi(ons (wikipedia, March 2016).
• Frequent Freebase rela(ons: people/person/na(onality, loca(on/loca(on/contains people/person/profession, people/person/place-‐of-‐birth biology/organism_higher_classifica(on film/film/genre
32
DBpedia is a project aiming to extract structured content from the informa(on created as part of the Wikipedia project.
Freebase was a large collabora(ve knowledge base consis(ng of data composed mainly by its community members (cf Seman(c Web). -‐-‐> Knowledge Graph: hkps://en.wikipedia.org/wiki/Freebase
How to build rela$on extractors
1. Hand-‐wriken pakerns 2. Supervised machine learning 3. Semi-‐supervised and unsupervised • Bootstrapping (using seeds) • Distant supervision • Unsupervised learning from the web
33
Relation Extraction
Using pakerns to extract rela(ons
Rules for extrac$ng IS-‐A rela$on
Early intui(on from Hearst (1992)
• “Agar is a substance prepared from a mixture of red algae, such as Gelidium, for laboratory or industrial use”
• What does Gelidium mean? • How do you know?`
35
Rules for extrac$ng IS-‐A rela$on
Early intui(on from Hearst (1992)
• “Agar is a substance prepared from a mixture of red algae, such as Gelidium, for laboratory or industrial use”
• What does Gelidium mean? • How do you know?`
36
Hearst’s PaFerns for extrac$ng IS-‐A rela$ons (Hearst, 1992): Automa(c Acquisi(on of Hyponyms
“Y such as X ((, X)* (, and|or) X)”!“such Y as X”!“X or other Y”!“X and other Y”!“Y including X”!“Y, especially X”!
37
Hearst’s PaFerns for extrac$ng IS-‐A rela$ons
Hearst paFern Example occurrences X and other Y ...temples, treasuries, and other important civic buildings.
X or other Y Bruises, wounds, broken bones or other injuries...
Y such as X The bow lute, such as the Bambara ndang...
Such Y as X ...such authors as Herrick, Goldsmith, and Shakespeare.
Y including X ...common-‐law countries, including Canada and England...
Y , especially X European countries, especially France, England, and Spain... 38
Hand-‐built paFerns for rela$ons • Plus: • Human patterns tend to be high-precision • Can be tailored to specific domains
• Minus • Human patterns are often low-recall • A lot of work to think of all possible patterns! • Don’t want to have to do this for every relation! • We’d like better accuracy 39
Relation Extraction
Supervised rela(on extrac(on
Supervised machine learning for rela$ons
• Choose a set of rela(ons we’d like to extract • Choose a set of relevant named en((es • Find and label data
• Choose a representa(ve corpus • Label the named en((es in the corpus • Hand-‐label the rela(ons between these en((es • Break into training, development, and test
• Train a classifier on the training set 41
How to do classifica$on in supervised rela$on extrac$on
1. Find all pairs of named en((es (usually in same sentence) 2. Decide if 2 en((es are related 3. If yes, classify the rela(on • Why the extra step?
• Faster classifica(on training by elimina(ng most pairs • Can use dis(nct feature-‐sets appropriate for each task.
42
Word Features for Rela$on Extrac$on
• Headwords of M1 and M2, and combina(on Airlines Wagner Airlines-‐Wagner
• Bag of words and bigrams in M1 and M2 {American, Airlines, Tim, Wagner, American Airlines, Tim Wagner}
• Words or bigrams in par(cular posi(ons le^ and right of M1/M2 M2: -‐1 spokesman M2: +1 said
• Bag of words or bigrams between the two en((es {a, AMR, of, immediately, matched, move, spokesman, the, unit}
American Airlines, a unit of AMR, immediately matched the move, spokesman Tim Wagner said Men(on 1 Men(on 2
43
Named En$ty Type and Men$on Level Features for Rela$on Extrac$on
• Named-‐en(ty types • M1: ORG • M2: PERSON
• Concatena(on of the two named-‐en(ty types • ORG-‐PERSON
• En(ty Level of M1 and M2 (NAME, NOMINAL, PRONOUN) • M1: NAME [it or he would be PRONOUN] • M2: NAME [the company would be NOMINAL]
American Airlines, a unit of AMR, immediately matched the move, spokesman Tim Wagner said Men(on 1 Men(on 2
44
Parse Features for Rela$on Extrac$on
• Base syntac(c chunk sequence from one to the other NP NP PP VP NP NP
• Cons(tuent path through the tree from one to the other NP é NP é S é S ê NP
• Dependency path Airlines matched Wagner said
American Airlines, a unit of AMR, immediately matched the move, spokesman Tim Wagner said Men(on 1 Men(on 2
45
American Airlines, a unit of AMR, immediately matched the move, spokesman Tim Wagner said.
46
Classifiers for supervised methods
• Now you can use any classifier you like • MaxEnt • Naïve Bayes • SVM • ...
• Train it on the training set, tune on the dev set, test on the test set
47
Evalua$on of Supervised Rela$on Extrac$on
• Compute P/R/F1 for each rela(on
48
P = # of correctly extracted relationsTotal # of extracted relations
R = # of correctly extracted relationsTotal # of gold relations
F1 =2PRP + R
Summary: Supervised Rela$on Extrac$on
+ Can get high accuracies with enough hand-‐labeled training data, if test similar enough to training
-‐ Labeling a large training set is expensive -‐ Supervised models are brikle, don’t generalize well to different genres
49
Relation Extraction
Semi-‐supervised and unsupervised rela(on extrac(on
Seed-‐based or bootstrapping approaches to rela$on extrac$on
• No training set? Maybe you have: • A few seed tuples or • A few high-‐precision pakerns
• Can you use those seeds to do something useful? • Bootstrapping: use the seeds to directly learn to populate a rela(on
51
Roughly said: Use seeds to ini(alize a process of annota(on, then refine through itera(ons
Rela$on Bootstrapping (Hearst 1992)
• Gather a set of seed pairs that have rela(on R • Iterate: 1. Find sentences with these pairs 2. Look at the context between or around the pair and
generalize the context to create pakerns 3. Use the pakerns for grep for more pairs
52
Bootstrapping • <Mark Twain, Elmira> Seed tuple
• Grep (google) for the environments of the seed tuple “Mark Twain is buried in Elmira, NY.”
X is buried in Y “The grave of Mark Twain is in Elmira”
The grave of X is in Y “Elmira is Mark Twain’s final res(ng place”
Y is X’s final res(ng place.
• Use those pakerns to grep for new tuples • Iterate 53
Dipre: Extract <author,book> pairs
• Start with 5 seeds:
• Find Instances: The Comedy of Errors, by William Shakespeare, was The Comedy of Errors, by William Shakespeare, is The Comedy of Errors, one of William Shakespeare's earliest akempts The Comedy of Errors, one of William Shakespeare's most
• Extract pakerns (group by middle, take longest common prefix/suffix) ?x , by ?y , ?x , one of ?y ‘s !
• Now iterate, finding new seeds that match the pakern !
Brin, Sergei. 1998. Extracting Patterns and Relations from the World Wide Web.
Author Book Isaac Asimov The Robots of Dawn David Brin Star(de Rising James Gleick Chaos: Making a New Science Charles Dickens Great Expecta(ons William Shakespeare The Comedy of Errors
54
Distant Supervision
• Combine bootstrapping with supervised learning • Instead of 5 seeds, • Use a large database to get huge # of seed examples
• Create lots of features from all these examples • Combine in a supervised classifier
Snow, Jurafsky, Ng. 2005. Learning syntac(c pakerns for automa(c hypernym discovery. NIPS 17 Fei Wu and Daniel S. Weld. 2007. Autonomously Seman(fying Wikipeida. CIKM 2007 Mintz, Bills, Snow, Jurafsky. 2009. Distant supervision for rela(on extrac(on without labeled data. ACL09
55
Distant supervision paradigm
• Like supervised classifica(on: • Uses a classifier with lots of features • Supervised by detailed hand-‐created knowledge • Doesn’t require itera(vely expanding pakerns
• Like unsupervised classifica(on: • Uses very large amounts of unlabeled data • Not sensi(ve to genre issues in training corpus
56
Distantly supervised learning of rela$on extrac$on paFerns
For each rela(on
For each tuple in big database
Find sentences in large corpus with both en((es
Extract frequent features (parse, words, etc)
Train supervised classifier using thousands of pakerns
4
1
2
3
5
PER was born in LOC PER, born (XXXX), LOC PER’s birthplace in LOC
<Edwin Hubble, Marshfield> <Albert Einstein, Ulm>
Born-‐In
Hubble was born in Marshfield Einstein, born (1879), Ulm Hubble’s birthplace in Marshfield
P(born-in | f1,f2,f3,…,f70000) 57
Unsupervised rela$on extrac$on
• Open InformaLon ExtracLon: • extract rela(ons from the web with no training data, no list of rela(ons
1. Use parsed data to train a “trustworthy tuple” classifier 2. Single-‐pass extract all rela(ons between NPs, keep if trustworthy 3. Assessor ranks rela(ons based on text redundancy
(FCI, specializes in, so^ware development) (Tesla, invented, coil transformer)
58
M. Banko, M. Cararella, S. Soderland, M. Broadhead, and O. Etzioni. 2007. Open informa(on extrac(on from the web. IJCAI
Evalua$on of Semi-‐supervised and Unsupervised Rela$on Extrac$on
• Since it extracts totally new rela(ons from the web • There is no gold set of correct instances of rela(ons!
• Can’t compute precision (don’t know which ones are correct) • Can’t compute recall (don’t know which ones were missed)
• Instead, we can approximate precision (only) • Draw a random sample of rela(ons from output, check precision manually
• Can also compute precision at different levels of recall. • Precision for top 1000 new rela(ons, top 10,000 new rela(ons, top 100,000 • In each case taking a random sample of that set
• But no way to evaluate recall 59
P̂ = # of correctly extracted relations in the sampleTotal # of extracted relations in the sample
The end