relation extraction

Download Relation Extraction

If you can't read please download the document

Upload: chelsa

Post on 25-Feb-2016

54 views

Category:

Documents


1 download

DESCRIPTION

Relation Extraction. What is relation extraction?. Extracting relations from text. Company report: “ International Business Machines Corporation (IBM or the company) was incorporated in the State of New York on June 16, 1911, as the Computing-Tabulating-Recording Co. (C-T- R)…” - PowerPoint PPT Presentation

TRANSCRIPT

Information Extraction

Relation ExtractionWhat is relation extraction?Dan Jurafsky1Extracting relations from textCompany report: International Business Machines Corporation (IBM or the company) was incorporated in the State of New York on June 16, 1911, as the Computing-Tabulating-Recording Co. (C-T-R)Extracted Complex Relation:Company-Founding Company IBM Location New York Date June 16, 1911 Original-Name Computing-Tabulating-Recording Co.But we will focus on the simpler task of extracting relation triplesFounding-year(IBM,1911)Founding-location(IBM,New York)Dan Jurafsky

Extracting Relation Triples from Text The Leland Stanford Junior University, commonly referred to as Stanford University or Stanford, is an American private research university located in Stanford, California near Palo Alto, California Leland Stanfordfounded the university in 1891

Stanford EQ Leland Stanford Junior UniversityStanford LOC-IN CaliforniaStanford IS-A research universityStanford LOC-NEAR Palo AltoStanford FOUNDED-IN 1891Stanford FOUNDER Leland StanfordDan JurafskyWhy Relation Extraction?Create new structured knowledge bases, useful for any appAugment current knowledge basesAdding words to WordNet thesaurus, facts to FreeBase or DBPediaSupport question answeringThe granddaughter of which actor starred in the movie E.T.?(acted-in ?x E.T.)(is-a ?y actor)(granddaughter-of ?x ?y)But which relations should we extract?

4Dan JurafskyAutomated Content Extraction (ACE)

17 relations from 2008 Relation Extraction TaskDan JurafskyAutomated Content Extraction (ACE)Physical-Located PER-GPE He was in TennesseePart-Whole-Subsidiary ORG-ORG XYZ, the parent company of ABCPerson-Social-Family PER-PER Johns wife YokoOrg-AFF-Founder PER-ORGSteve Jobs, co-founder of Apple 6Dan JurafskyUMLS: Unified Medical Language System134 entity types, 54 relationsInjury disruptsPhysiological FunctionBodily Location location-ofBiologic FunctionAnatomical Structure part-ofOrganismPharmacologic Substance causesPathological FunctionPharmacologic Substance treats Pathologic FunctionDan Jurafsky7Extracting UMLS relations from a sentence Doppler echocardiography can be used to diagnose left anterior descending artery stenosis in patients with type 2 diabetes Echocardiography, Doppler DIAGNOSES Acquired stenosis8Dan JurafskyDatabases of Wikipedia Relations9

Relations extracted from InfoboxStanford state CaliforniaStanford motto Die Luft der Freiheit wehtWikipedia Infobox

Dan JurafskyRelation databases that draw from WikipediaResource Description Framework (RDF) triplessubject predicate objectGolden Gate Park location San Franciscodbpedia:Golden_Gate_Park dbpedia-owl:location dbpedia:San_FranciscoDBPedia: 1 billion RDF triples, 385 from English WikipediaFrequent Freebase relations:people/person/nationality, location/location/containspeople/person/profession, people/person/place-of-birthbiology/organism_higher_classification film/film/genre10Dan JurafskyOntological relationsIS-A (hypernym): subsumption between classesGiraffe IS-A ruminant IS-A ungulate IS-A mammal IS-A vertebrate IS-A animal

Instance-of: relation between individual and classSan Francisco instance-of city

Examples from the WordNet ThesaurusDan JurafskyHow to build relation extractorsHand-written patternsSupervised machine learningSemi-supervised and unsupervised Bootstrapping (using seeds)Distant supervisionUnsupervised learning from the web

Dan JurafskyRelation ExtractionWhat is relation extraction?Dan Jurafsky13Relation ExtractionUsing patterns to extract relationsDan Jurafsky14Rules for extracting IS-A relationEarly intuition from Hearst (1992) Agar is a substance prepared from a mixture of red algae, such as Gelidium, for laboratory or industrial useWhat does Gelidium mean? How do you know?`Dan Jurafsky15Rules for extracting IS-A relationEarly intuition from Hearst (1992) Agar is a substance prepared from a mixture of red algae, such as Gelidium, for laboratory or industrial useWhat does Gelidium mean? How do you know?`Dan Jurafsky16Hearsts Patterns for extracting IS-A relations(Hearst, 1992): Automatic Acquisition of HyponymsY such as X ((, X)* (, and|or) X)such Y as XX or other YX and other YY including XY, especially XDan Jurafsky17Hearsts Patterns for extracting IS-A relationsHearst patternExample occurrencesX and other Y...temples, treasuries, and other important civic buildings.X or other YBruises, wounds, broken bones or other injuries...Y such as XThe bow lute, such as the Bambara ndang...Such Y as X...such authors as Herrick, Goldsmith, and Shakespeare.Y including X...common-law countries, including Canada and England...Y , especially XEuropean countries, especially France, England, and Spain...Dan JurafskyExtracting Richer Relations Using RulesIntuition: relations often hold between specific entitieslocated-in (ORGANIZATION, LOCATION)founded (PERSON, ORGANIZATION)cures (DRUG, DISEASE)Start with Named Entity tags to help extract relation!Dan Jurafsky19Named Entities arent quite enough.Which relations hold between 2 entities?

DrugDiseaseCure?Prevent?Cause?

Dan JurafskyDrug and disease pictures from Wikimedia Commons

http://commons.wikimedia.org/wiki/File:Gnome-face-sick.svg20What relations hold between 2 entities?PERSONORGANIZATIONFounder?Investor?Member?Employee?

President?Dan Jurafskypictures from Wikimedia Commons

http://upload.wikimedia.org/wikipedia/commons/thumb/9/9a/Sanyo_Electric_Corporation.JPG/320px-Sanyo_Electric_Corporation.JPG21Extracting Richer Relations Using Rules andNamed EntitiesWho holds what office in what organization?PERSON, POSITION of ORGGeorge Marshall, Secretary of State of the United StatesPERSON(named|appointed|chose|etc.) PERSON Prep? POSITIONTruman appointed Marshall Secretary of StatePERSON [be]? (named|appointed|etc.) Prep? ORG POSITION George Marshall was named US Secretary of State

Dan Jurafsky22Hand-built patterns for relationsPlus:Human patterns tend to be high-precisionCan be tailored to specific domainsMinusHuman patterns are often low-recallA lot of work to think of all possible patterns!Dont want to have to do this for every relation!Wed like better accuracyDan JurafskyRelation ExtractionUsing patterns to extract relationsDan Jurafsky24Relation ExtractionSupervised relation extractionDan Jurafsky25Supervised machine learning for relationsChoose a set of relations wed like to extractChoose a set of relevant named entitiesFind and label dataChoose a representative corpusLabel the named entities in the corpusHand-label the relations between these entitiesBreak into training, development, and testTrain a classifier on the training set26Dan JurafskyHow to do classification in supervised relation extractionFind all pairs of named entities (usually in same sentence)Decide if 2 entities are relatedIf yes, classify the relationWhy the extra step?Faster classification training by eliminating most pairsCan use distinct feature-sets appropriate for each task.27Dan Jurafsky27Automated Content Extraction (ACE)

17 sub-relations of 6 relations from 2008 Relation Extraction TaskDan JurafskyRelation ExtractionClassify the relation between two entities in a sentenceAmerican Airlines, a unit of AMR, immediately matched the move, spokesman Tim Wagner said.SUBSIDIARYFAMILYEMPLOYMENTNILFOUNDERCITIZENINVENTOR

Dan JurafskyWord Features for Relation ExtractionHeadwords of M1 and M2, and combinationAirlines Wagner Airlines-WagnerBag of words and bigrams in M1 and M2 {American, Airlines, Tim, Wagner, American Airlines, Tim Wagner}Words or bigrams in particular positions left and right of M1/M2M2: -1 spokesmanM2: +1 saidBag of words or bigrams between the two entities{a, AMR, of, immediately, matched, move, spokesman, the, unit}American Airlines, a unit of AMR, immediately matched the move, spokesman Tim Wagner saidMention 1Mention 2Dan Jurafsky30Named Entity Type and Mention LevelFeatures for Relation ExtractionNamed-entity typesM1: ORGM2: PERSONConcatenation of the two named-entity typesORG-PERSONEntity Level of M1 and M2 (NAME, NOMINAL, PRONOUN)M1: NAME[it or he would be PRONOUN]M2: NAME[the company would be NOMINAL]American Airlines, a unit of AMR, immediately matched the move, spokesman Tim Wagner saidMention 1Mention 2Dan Jurafsky31Parse Features for Relation ExtractionBase syntactic chunk sequence from one to the otherNP NP PP VP NP NPConstituent path through the tree from one to the otherNP NP S S NPDependency path Airlines matched Wagner saidAmerican Airlines, a unit of AMR, immediately matched the move, spokesman Tim Wagner saidMention 1Mention 2Dan Jurafsky32Gazetteer and trigger word features for relation extractionTrigger list for family: kinship termsparent, wife, husband, grandparent, etc. [from WordNet]Gazetteer:Lists of useful geo or geopolitical wordsCountry name listOther sub-entitiesDan JurafskyAmerican Airlines, a unit of AMR, immediately matched the move, spokesman Tim Wagner said.

Dan Jurafsky34Classifiers for supervised methodsNow you can use any classifier you likeMaxEntNave BayesSVM...Train it on the training set, tune on the dev set, test on the test setDan JurafskyEvaluation of Supervised Relation ExtractionCompute P/R/F1 for each relation

36

Dan JurafskySummary: Supervised Relation Extraction+ Can get high accuracies with enough hand-labeled training data, if test similar enough to training- Labeling a large training set is expensive- Supervised models are brittle, dont generalize well to different genres

Dan JurafskyRelation ExtractionSupervised relation extractionDan Jurafsky38Relation ExtractionSemi-supervised and unsupervised relation extractionDan Jurafsky39Seed-based or bootstrapping approaches to relation extractionNo training set? Maybe you have:A few seed tuples orA few high-precision patternsCan you use those seeds to do something useful?Bootstrapping: use the seeds to directly learn to populate a relationDan JurafskyRelation Bootstrapping (Hearst 1992)Gather a set of seed pairs that have relation RIterate:Find sentences with these pairsLook at the context between or around the pair and generalize the context to create patternsUse the patterns for grep for more pairs

Dan JurafskyBootstrapping Seed tupleGrep (google) for the environments of the seed tupleMark Twain is buried in Elmira, NY.X is buried in YThe grave of Mark Twain is in ElmiraThe grave of X is in YElmira is Mark Twains final resting placeY is Xs final resting place.Use those patterns to grep for new tuplesIterateDan JurafskyDipre: Extract pairsStart with 5 seeds:

Find Instances:The Comedy of Errors, by William Shakespeare, wasThe Comedy of Errors, by William Shakespeare, isThe Comedy of Errors, one of William Shakespeare's earliest attemptsThe Comedy of Errors, one of William Shakespeare's mostExtract patterns (group by middle, take longest common prefix/suffix)?x , by ?y , ?x , one of ?y s Now iterate, finding new seeds that match the pattern

Brin, Sergei. 1998. Extracting Patterns and Relations from the World Wide Web.AuthorBookIsaac AsimovThe Robots of DawnDavid BrinStartide RisingJames GleickChaos: Making a New ScienceCharles DickensGreat ExpectationsWilliam ShakespeareThe Comedy of ErrorsDan Jurafsky

SnowballSimilar iterative algorithm

Group instances w/similar prefix, middle, suffix, extract patternsBut require that X and Y be named entitiesAnd compute a confidence for each pattern{s, in, headquarters}{in, based}ORGANIZATIONLOCATION OrganizationLocation of HeadquartersMicrosoftRedmondExxonIrvingIBMArmonkE. Agichtein and L. Gravano 2000. Snowball: Extracting Relations from Large Plain-Text Collections. ICDL ORGANIZATIONLOCATION .69.75Dan JurafskyDistant SupervisionCombine bootstrapping with supervised learningInstead of 5 seeds,Use a large database to get huge # of seed examplesCreate lots of features from all these examplesCombine in a supervised classifierSnow, Jurafsky, Ng. 2005. Learning syntactic patterns for automatic hypernym discovery. NIPS 17Fei Wu and Daniel S. Weld. 2007. Autonomously Semantifying Wikipeida. CIKM 2007Mintz, Bills, Snow, Jurafsky. 2009. Distant supervision for relation extraction without labeled data. ACL09Dan JurafskyDistant supervision paradigmLike supervised classification:Uses a classifier with lots of featuresSupervised by detailed hand-created knowledgeDoesnt require iteratively expanding patternsLike unsupervised classification:Uses very large amounts of unlabeled dataNot sensitive to genre issues in training corpus

Dan JurafskyDistantly supervised learning of relation extraction patternsFor each relation

For each tuple in big database

Find sentences in large corpus with both entities

Extract frequent features (parse, words, etc)

Train supervised classifier using thousands of patterns

41235PER was born in LOCPER, born (XXXX), LOCPERs birthplace in LOC

Born-InHubble was born in MarshfieldEinstein, born (1879), UlmHubbles birthplace in MarshfieldP(born-in | f1,f2,f3,,f70000)Dan Jurafsky47Unsupervised relation extractionOpen Information Extraction: extract relations from the web with no training data, no list of relations

Use parsed data to train a trustworthy tuple classifierSingle-pass extract all relations between NPs, keep if trustworthyAssessor ranks relations based on text redundancy(FCI, specializes in, software development) (Tesla, invented, coil transformer)

48M. Banko, M. Cararella, S. Soderland, M. Broadhead, and O. Etzioni. 2007. Open information extraction from the web. IJCAIDan JurafskyEvaluation of Semi-supervised andUnsupervised Relation ExtractionSince it extracts totally new relations from the web There is no gold set of correct instances of relations!Cant compute precision (dont know which ones are correct)Cant compute recall (dont know which ones were missed)Instead, we can approximate precision (only) Draw a random sample of relations from output, check precision manually

Can also compute precision at different levels of recall.Precision for top 1000 new relations, top 10,000 new relations, top 100,000In each case taking a random sample of that setBut no way to evaluate recall49

Dan JurafskyRelation ExtractionSemi-supervised and unsupervised relation extractionDan Jurafsky50