![Page 1: Schema-Driven Relationship Extraction from Unstructured Text](https://reader036.vdocuments.us/reader036/viewer/2022062309/56813d44550346895da70291/html5/thumbnails/1.jpg)
1
Schema-Driven Relationship Extraction from
Unstructured TextCartic Ramakrishnan, Krys Kochut and
Amit Sheth LSDIS Lab, University of Georgia, Athens, GA
November 7th 2006ISWC 2006
![Page 2: Schema-Driven Relationship Extraction from Unstructured Text](https://reader036.vdocuments.us/reader036/viewer/2022062309/56813d44550346895da70291/html5/thumbnails/2.jpg)
2
Outline
• Motivation• Problem Description & Approach• Results• Future Work
![Page 3: Schema-Driven Relationship Extraction from Unstructured Text](https://reader036.vdocuments.us/reader036/viewer/2022062309/56813d44550346895da70291/html5/thumbnails/3.jpg)
3
Anecdotal Example
Leonardo Da Vinci
The Da Vinci code
The Louvre
Victor Hugo
The Vitruvian man
Santa Maria delle Grazie
Et in Arcadia EgoHoly Blood, Holy Grail
Harry Potter
The Last Supper
Nicolas Poussin
Priory of Sion
The Hunchback of Notre Dame
The Mona Lisa
Nicolas Flammel
painted_by
painted_by
painted_by
painted_by
member_of
member_of
member_of
written_by
mentioned_in
mentioned_in
displayed_at
displayed_at
cryptic_motto_of
displayed_at
mentioned_in
mentioned_in
Discovering connections hidden in textUNDISCOVERED PUBLIC KNOWLEDGE
![Page 4: Schema-Driven Relationship Extraction from Unstructured Text](https://reader036.vdocuments.us/reader036/viewer/2022062309/56813d44550346895da70291/html5/thumbnails/4.jpg)
4
Motivation 1 – Undiscovered Public knowledge in biology
MagnesiumMigraine
PubMed
?Stress
Spreading Cortical Depression
Calcium Channel Blockers
Swanson’s Discoveries
Associations Discovered based on keyword searches followed by manually analysis of text to establish possible relevant relationships
11 possible associations found
These associations were discovered in 1986
![Page 5: Schema-Driven Relationship Extraction from Unstructured Text](https://reader036.vdocuments.us/reader036/viewer/2022062309/56813d44550346895da70291/html5/thumbnails/5.jpg)
5
Motivation 2 - Hypothesis Driven retrieval of Scientific Literature
PubMed
Complex Query
SupportingDocument setsretrieved
Migraine
Stress
Patient
affects
isaMagnesium
Calcium Channel Blockers
inhibit
Keyword query: Migraine[MH] + Magnesium[MH]
![Page 6: Schema-Driven Relationship Extraction from Unstructured Text](https://reader036.vdocuments.us/reader036/viewer/2022062309/56813d44550346895da70291/html5/thumbnails/6.jpg)
6
Motivation 3 -- Growth Rate of Public Knowledge• Data captured per year = 1 exabyte (1018)
(Eric Neumann, Science, 2005)
• How much is that?– Compare it to the estimate of the total words
ever spoken by humans = 12 exabyte
• A small but significant portion is text data– PubMed 16 Million abstracts – MedlinePlus – health information– OMIM – catalog of human genes and genetic
disorders
Undiscovered public knowledge may have also increased by a large amount
![Page 7: Schema-Driven Relationship Extraction from Unstructured Text](https://reader036.vdocuments.us/reader036/viewer/2022062309/56813d44550346895da70291/html5/thumbnails/7.jpg)
7
Our past work in Connection Discovery• Semantic Associations over RDF graphs
– Discovery and Ranking
Migraine
Stress
Patient
affects
isaMagnesium
Calcium Channel Blockers
inhibit
Semantically Connected
Assumption: Rich Semantic Metadata containing entities related by a diverse set of relationships
It is therefore critical to bridge the gap between unstructured and structured databy extracting entities and relationships between resulting in semantic
metadata
![Page 8: Schema-Driven Relationship Extraction from Unstructured Text](https://reader036.vdocuments.us/reader036/viewer/2022062309/56813d44550346895da70291/html5/thumbnails/8.jpg)
8
Outline
• Motivation• Problem Description & Approach• Results• Future Work
![Page 9: Schema-Driven Relationship Extraction from Unstructured Text](https://reader036.vdocuments.us/reader036/viewer/2022062309/56813d44550346895da70291/html5/thumbnails/9.jpg)
9
Problem – Extracting relationships between MeSH terms from PubMed
Biologically active substance
LipidDisease or Syndrome
affects
causes
affectscauses
complicates
Fish Oils Raynaud’s Disease???????
instance_of instance_of
UMLS Semantic Network
MeSH
PubMed
9284 documents
4733 documents
5 documents
`
![Page 10: Schema-Driven Relationship Extraction from Unstructured Text](https://reader036.vdocuments.us/reader036/viewer/2022062309/56813d44550346895da70291/html5/thumbnails/10.jpg)
10
Background knowledge used• UMLS – A high level schema of the biomedical
domain– 136 classes and 49 relationships– Synonyms of all relationship – using variant lookup (tools
from NLM)– 49 relationship + their synonyms = ~350 mostly verbs
• MeSH – 22,000+ topics organized as a forest of 16 trees– Used to query PubMed
• PubMed – Over 16 million abstract– Abstracts annotated with one or more MeSH terms
T147—effect T147—induce T147—etiology T147—cause T147—effecting T147—induced
![Page 11: Schema-Driven Relationship Extraction from Unstructured Text](https://reader036.vdocuments.us/reader036/viewer/2022062309/56813d44550346895da70291/html5/thumbnails/11.jpg)
11
Method – Parse Sentences in PubMed
SS-Tagger (University of Tokyo)
SS-Parser (University of Tokyo)
(TOP (S (NP (NP (DT An) (JJ excessive) (ADJP (JJ endogenous) (CC or) (JJ exogenous) ) (NN stimulation) ) (PP (IN by) (NP (NN estrogen) ) ) ) (VP (VBZ induces) (NP (NP (JJ adenomatous) (NN hyperplasia) ) (PP (IN of) (NP (DT the) (NN endometrium) ) ) ) ) ) )
• Entities (MeSH terms) in sentences occur in modified forms• “adenomatous” modifies “hyperplasia”• “An excessive endogenous or exogenous stimulation” modifies “estrogen”
• Entities can also occur as composites of 2 or more other entities• “adenomatous hyperplasia” and “endometrium” occur as “adenomatous hyperplasia of the endometrium”
![Page 12: Schema-Driven Relationship Extraction from Unstructured Text](https://reader036.vdocuments.us/reader036/viewer/2022062309/56813d44550346895da70291/html5/thumbnails/12.jpg)
12
Method – Identify entities and Relationships in Parse Tree
TOP
NP
VP
S
NPVBZ
induces
NPPP
NPINof
DTthe
NNendometrium
JJadenomatous
NNhyperplasia
NP PP
INby
NNestrogenDT
the
JJexcessive ADJP NN
stimulation
JJendogenous
JJexogenous
CCor
MeSHIDD004967MeSHIDD006965 MeSHIDD004717
UMLS ID
T147
ModifiersModified entitiesComposite Entities
![Page 13: Schema-Driven Relationship Extraction from Unstructured Text](https://reader036.vdocuments.us/reader036/viewer/2022062309/56813d44550346895da70291/html5/thumbnails/13.jpg)
13
Entities – The simple, the modified and the composite• To capture the various types of entities we define
– Simple entities as MeSH terms– Modifiers as siblings of entities that are
• Determiners – “Y induces no X” • Noun Phrases – “An excessive endogenous or exogenous
stimulation”• Adjective phrases – “adenomatous”• Prepositional phrases – “M is induced by the X in the Z”
– Modified Entities as any entity that has a sibling which is a modifier
– Composite Entity as any entity that has another entity as a sibling
![Page 14: Schema-Driven Relationship Extraction from Unstructured Text](https://reader036.vdocuments.us/reader036/viewer/2022062309/56813d44550346895da70291/html5/thumbnails/14.jpg)
14
Resulting RDF
ModifiersModified entitiesComposite Entities
estrogen
An excessive endogenous or
exogenous stimulation
modified_entity1composite_entity1
modified_entity2
adenomatous hyperplasia
endometrium
hasModifier
hasPart
induces
hasPart
hasPart
hasModifier
hasPart
![Page 15: Schema-Driven Relationship Extraction from Unstructured Text](https://reader036.vdocuments.us/reader036/viewer/2022062309/56813d44550346895da70291/html5/thumbnails/15.jpg)
15
Outline
• Motivation• Approach• Results• Future Work
![Page 16: Schema-Driven Relationship Extraction from Unstructured Text](https://reader036.vdocuments.us/reader036/viewer/2022062309/56813d44550346895da70291/html5/thumbnails/16.jpg)
16
Results
• Dataset 1– Swanson’s discoveries
• Associations between Migraine and Magnesium [Hearst99]– stress is associated with migraines – stress can lead to loss of magnesium – calcium channel blockers prevent some migraines – magnesium is a natural calcium channel blocker – spreading cortical depression (SCD) is implicated in some
migraines – high levels of magnesium inhibit SCD – migraine patients have high platelet aggregability – magnesium can suppress platelet aggregability
![Page 17: Schema-Driven Relationship Extraction from Unstructured Text](https://reader036.vdocuments.us/reader036/viewer/2022062309/56813d44550346895da70291/html5/thumbnails/17.jpg)
17
Results – Creation of Dataset 1• Keywords pairs e.g. stress + migraine etc. against
PubMed return PubMed abstracts that are annotated (by NLM) with both terms
• 8 pairs of terms in this scenario result in 8 subsets of PubMed
• Semantic Metadata– Represented in RDF– With complex entities and relationships connecting them– Pointers to original document and sentence– Size
• ~2MB RDF for Migraine Magnesium subset of PubMed
![Page 18: Schema-Driven Relationship Extraction from Unstructured Text](https://reader036.vdocuments.us/reader036/viewer/2022062309/56813d44550346895da70291/html5/thumbnails/18.jpg)
18
Evaluating the Result of Extraction• Ideal method to evaluate the Extraction
method– Domain experts read a set of abstract given a
set of relationship names and entities to look for
– In addition to this give them the extracted triples and entities
– For every abstract the expert judges counts the correct, incorrect and missed triples
– Measure precision and recall
![Page 19: Schema-Driven Relationship Extraction from Unstructured Text](https://reader036.vdocuments.us/reader036/viewer/2022062309/56813d44550346895da70291/html5/thumbnails/19.jpg)
19
Evaluating the Result of Extraction• In the absence of a domain expert we
focus of getting a feel for the utility of the extracted data– We know the association manually discovered
between Migraine and Magnesium– We locate paths of various lengths between
them and manually inspect these paths– If the paths are indicative of the manually
discovered associations the extracted data is useful
![Page 20: Schema-Driven Relationship Extraction from Unstructured Text](https://reader036.vdocuments.us/reader036/viewer/2022062309/56813d44550346895da70291/html5/thumbnails/20.jpg)
20
Paths between Migraine and Magnesium
Paths are considered interesting if they have one or more named relationshipOther than hasPart or hasModifiers in them
![Page 21: Schema-Driven Relationship Extraction from Unstructured Text](https://reader036.vdocuments.us/reader036/viewer/2022062309/56813d44550346895da70291/html5/thumbnails/21.jpg)
21
An example of such a path
platelet(D001792)
collagen(D003094)
migraine(D008881)
magnesium(D008274)
me_3142by_a_primary_abnormality_of_platelet_behavior
me_2286_13%_and_17%_adp_and_collagen_induced_platelet_aggregation
caused_by
hasPart
hasPart
stimulated
stimulatedhasPart
![Page 22: Schema-Driven Relationship Extraction from Unstructured Text](https://reader036.vdocuments.us/reader036/viewer/2022062309/56813d44550346895da70291/html5/thumbnails/22.jpg)
22
Results• Dataset 2
– Neoplasm (C04)• For subtree of MeSH rooted at Neoplasms all topics under this
subtree are used as query terms against PubMed• The resulting dataset contains ~500,000 PubMed abstracts• The extraction process run on this data returns ~150MB
• Processing the tagged and parsed sentences for Dataset 2 (Neoplasm) to generate RDF took approx. 5 minutes
• Stats– 211 different named relationships found – 500,000 instance-property-instance statements– 260,000 instance-property-literal statements
• Currently setting up to extract RDF from all of PubMed
![Page 23: Schema-Driven Relationship Extraction from Unstructured Text](https://reader036.vdocuments.us/reader036/viewer/2022062309/56813d44550346895da70291/html5/thumbnails/23.jpg)
23
Outline
• Motivation• Problem Description & Approach• Results• Future Work
![Page 24: Schema-Driven Relationship Extraction from Unstructured Text](https://reader036.vdocuments.us/reader036/viewer/2022062309/56813d44550346895da70291/html5/thumbnails/24.jpg)
24
Future Extensions to the Extraction process• Short-term goals (1 month)
– MeSH qualifiers (blood pressure, contraindications)– Curate and release Migraine-Magnesium RDF
• Long-Term goals – More complex structures
• Conjunctions• X causes Y to inhibit Z
– Rule-action language to test new extraction rules– Finding new terms to enrich existing vocabularies– Perhaps ontology enrichment
![Page 25: Schema-Driven Relationship Extraction from Unstructured Text](https://reader036.vdocuments.us/reader036/viewer/2022062309/56813d44550346895da70291/html5/thumbnails/25.jpg)
25
The projected future of research in Biology
From …
Hypothesis driven “wet lab” experiments
To … Data-driven reduction/pruning of
hypothesis space leading to new insight and possibly discovery
• What challenges does this transition bring?
![Page 26: Schema-Driven Relationship Extraction from Unstructured Text](https://reader036.vdocuments.us/reader036/viewer/2022062309/56813d44550346895da70291/html5/thumbnails/26.jpg)
26
Use of Generated Semantic Metadata• Semantic Browsing of PubMed based on
named relationships between MeSH terms• Path/hypothesis based document retrieval• Knowledge discovery from literature
– Coprus-based complex relationship discovery and ranking
– Corpus-based relevant connection subgraph discovery
![Page 27: Schema-Driven Relationship Extraction from Unstructured Text](https://reader036.vdocuments.us/reader036/viewer/2022062309/56813d44550346895da70291/html5/thumbnails/27.jpg)
27
Support such retrieval and discovery operations across multiple data sources
•Extract Semantic Metadata about entities in all of these databases that might occur in PubMed text
•Resulting metadata will contain relationships between genes (OMIM), diseases (MeSH), nucleotide anomalies (SNP)
•hypothesis validation and knowledge discovery in biology.
![Page 28: Schema-Driven Relationship Extraction from Unstructured Text](https://reader036.vdocuments.us/reader036/viewer/2022062309/56813d44550346895da70291/html5/thumbnails/28.jpg)
28
THANK YOU!