bridget mcinnes ted pedersen ying liu genevieve b. melton serguei pakhomov
DESCRIPTION
Knowledge-based Method for Determining the Meaning of Ambiguous Biomedical Terms Using Information Content Measures of Similarity. Bridget McInnes Ted Pedersen Ying Liu Genevieve B. Melton Serguei Pakhomov. Objective of this work. Develop and evaluate a method than can - PowerPoint PPT PresentationTRANSCRIPT
1
KNOWLEDGE-BASED METHOD FOR DETERMINING THE MEANING OF AMBIGUOUS BIOMEDICAL TERMS USING INFORMATION CONTENT MEASURES OF SIMILARITY
Bridget McInnesTed Pedersen Ying LiuGenevieve B. MeltonSerguei Pakhomov
2
OBJECTIVE OF THIS WORK Develop and evaluate a method than can
disambiguate terms in biomedical text by exploiting similarity information extrapolated from the Unified Medical
Language System
Evaluate the efficacy of Information Content-based similarity measures over path-based similarity measures
for Word Sense Disambiguation, WSD
3
WORD SENSE DISAMBIGUATIONWord sense disambiguation is the task of
determining the appropriate sense of a term given context in which it is used.
TERM: tolerance
DrugTolerance
ImmuneTolerance
4
WORD SENSE DISAMBIGUATIONWord sense disambiguation is the task of
determining the appropriate sense of a term given context in which it is used.
Busprione attenuates tolerance to morphine in mice with skin cancer
DrugTolerance
ImmuneTolerance
5
SENSE INVENTORY: UNIFIED MEDICAL LANGUAGE SYSTEM Unified Medical Language Sources (UMLS)
Semantic Network Metathesaurus
~1.7 million biomedical and clinical concepts; integrated semi-automatically
CUIs (Concept Unique Identifiers), linked: Hierarchical: PAR/CHD and RB/RN Non-hierarchical: SIB, RO
Sources viewed together or independently Medical Subject Heading (MSH)
SPECIALIST Lexicon Biomedical and clinical terms, including variants
6
WORD SENSE DISAMBIGUATION
Busprione attenuates tolerance to morphine in mice with skin cancer
DrugTolerance: C0013220
ImmuneTolerance:C0020963
Concept Unique Identifiers: CUIs
7
SENSERELATE ALGORITHM
Each possible sense of a target word is assigned a score [sum similarity between it and its surrounding terms]
Assign target word the sense with highest score
Proposed by Patwardhan and Pedersen 2003 using WordNet
UMLS::SenseRelate is a modification of this algorithm using information from the UMLS
NEXT UP: an example
8
SENSERELATE EXAMPLEBusprione attenuates tolerance to morphine
in mice with skin cancer
9
SENSERELATE EXAMPLEBusprione attenuates tolerance to morphine
in mice with skin cancer
DrugTolerance: C0013220
ImmuneTolerance:C0020963
10
SENSERELATE EXAMPLEBusprione attenuates tolerance to morphine
in mice with skin cancer
DrugTolerance: C0013220
ImmuneTolerance:C0020963
Busprione:
C0006462Morphine:C0026549
Mice: C0026809
Skin cancer:
C0007114
11
SENSERELATE EXAMPLE
0.090.160.11
Busprione attenuates tolerance to morphine in mice with skin cancer
0.09
DrugTolerance: C0013220
ImmuneTolerance:C0020963
Busprione:
C0006462Morphine:C0026549
Mice: C0026809
Skin cancer:
C0007114
12
SENSERELATE EXAMPLE
0.090.160.11
Busprione attenuates tolerance to morphine in mice with skin cancer
0.09
DrugTolerance: C0013220
ImmuneTolerance:C0020963
Busprione:
C0006462Morphine:C0026549
Mice: C0026809
Skin cancer:
C0007114
Drug ToleranceScore = 0.09 + 0.09 + 0.16 + 0.11 = 0.45
13
SENSERELATE EXAMPLE
0.090.160.11
0.090.050.04
Busprione attenuates tolerance to morphine in mice with skin cancer
0.09 0.09
DrugTolerance: C0013220
ImmuneTolerance:C0020963
Busprione:
C0006462Morphine:C0026549
Mice: C0026809
Skin cancer:
C0007114
Drug ToleranceScore = 0.09 + 0.09 + 0.16 + 0.11 = 0.45
14
SENSERELATE EXAMPLE
0.090.160.11
0.090.050.04
Busprione attenuates tolerance to morphine in mice with skin cancer
0.09 0.09
DrugTolerance: C0013220
ImmuneTolerance:C0020963
Busprione:
C0006462Morphine:C0026549
Mice: C0026809
Skin cancer:
C0007114
Drug ToleranceScore = 0.09 + 0.09 + 0.16 + 0.11 = 0.45
Immune ToleranceScore = 0.09 + 0.09 + 0.05 + 0.05 = 0.27
15
SENSERELATE EXAMPLE
0.090.160.11
0.090.050.04
Busprione attenuates tolerance to morphine in mice with skin cancer
0.09 0.09
DrugTolerance: C0013220
ImmuneTolerance:C0020963
Busprione:
C0006462Morphine:C0026549
Mice: C0026809
Skin cancer:
C0007114
Drug ToleranceScore = 0.09 + 0.09 + 0.16 + 0.11 = 0.45
Immune ToleranceScore = 0.09 + 0.09 + 0.05 + 0.05 = 0.27
16
SENSE RELATE ASSUMPTION
An ambiguous word is often used in the sense
that is most similar to the sense of the terms that surround it
17
SENSERELATE COMPONENTS Identifying the concepts of surrounding terms
Calculating semantic similarity
18
IDENTIFYING THE CONCEPTS OF THE SURROUNDING TERMS
Use the SPECIALIST LEXICON to identify the terms and map the terms doing a string match to the MRCONSO table in
the UMLS
19
IDENTIFYING THE CONCEPTS OF THE SURROUNDING TERMS
Use the SPECIALIST LEXICON to identify the terms and map the terms doing a string match to the MRCONSO table in
the UMLSBusprione attenuates tolerance to morphine in mice with skin cancer
20
IDENTIFYING THE CONCEPTS OF THE SURROUNDING TERMS
Use the SPECIALIST LEXICON to identify the terms and map the terms doing a string match to the MRCONSO table in
the UMLS
...skin cancerskin graftingskin disease
...
SPECIALISTLEXICON
Busprione attenuates tolerance to morphine in mice with skin cancer
21
IDENTIFYING THE CONCEPTS OF THE SURROUNDING TERMS
Use the SPECIALIST LEXICON to identify the terms and map the terms doing a string match to the MRCONSO table in
the UMLS
...skin cancerskin graftingskin disease
...
...skin cancer C0007114skin grafting C0037297skin disease C0037274
...
SPECIALISTLEXICON
MRCONSO
Busprione attenuates tolerance to morphine in mice with skin cancer
22
SEMANTIC SIMILARITY MEASURES Path-based measures
Path Wu and Palmer Leacock and Chodorow Ngyuen and Al-Mubaid
Information content (IC)-based measures Resnik Lin Jiang and Conrath
23
PATH-BASED SIMILARITY MEASURES Use only the path information obtained from a
taxonomy
24
PATH-BASED SIMILARITY MEASURES Use only the path information obtained from a
taxonomy
Path measure sim(c1,c2) = 1 / minpath(c2,c2)
where minpath is the shortest path between the two concepts
25
PATH-BASED SIMILARITY MEASURES Use only the path information obtained from a
taxonomy
Path measure sim(c1,c2) = 1/minpath(c2,c2)
where minpath is the shortest path between the two concepts
Wu and Palmer, 1994 sim(c1,c2) = (2*depth(LCS(c2,c2))) /
(depth(c1)+depth(c2)) where LCS is the least common subsumer of the
two concepts
26
PATH-BASED SIMILARITY MEASURES Use only the path information obtained from a taxonomy
Path measure sim(c1,c2) = 1/ minpath(c2,c2)
where minpath is the shortest path between the two concepts
Wu and Palmer, 1994 sim(c1,c2) = (2*depth(LCS(c2,c2))) / (depth(c1)+depth(c2))
where LCS is the least common subsumer of the two concepts
Leacock and Chodorow, 1998 sim(c1,c2) = -log( minpath(c1,c2) / (2D) )
where D is the total depth of the taxonomy
27
PATH-BASED SIMILARITY MEASURES Use only the path information obtained from a taxonomy
Path measure sim(c1,c2) = 1/ minpath(c2,c2)
where minpath is the shortest path between the two concepts
Leacock and Chodorow, 1998 sim(c1,c2) = -log( minpath(c1,c2) / (2D) )
where D is the total depth of the taxonomy
Wu and Palmer, 1994 sim(c1,c2) = (2*depth(LCS(c2,c2))) / (depth(c1)+depth(c2))
where LCS is the least common subsumer of the two concepts
Nyguen and Al-Mubaid, 2006 sim(c1,c2) = log ( (2 + minpath(c1,c2) - 1) *
(D - depth(LCS(c1,c2))) )
28
PATH-BASED SIMILARITY MEASURES
USE ONLY THE PATH INFORMATION OBTAINED FROM A TAXONOMY
Disease: C0012634
Drug Related
Disorder: C0277579
DrugTolerance: C0013220
Neoplasm: C1302761
Neoplastic Disease: C1882062
Malignant Neoplasm: C0006826
Skin cancer:
C0007114
29
INFORMATION CONTENT-BASED MEASURES Incorporate the probability of the concepts
IC = -log(P(concept))
30
INFORMATION CONTENT-BASED MEASURES Incorporate the probability of the concepts
IC = -log(P(concept))
P(concept)
Calculated by summing the probability of the concept and the probability of its descendants
Probabilities are obtained from an external corpus
31
INFORMATION CONTENT-BASED MEASURES Incorporate the probability of the concepts
IC = -log(P(concept)
Resnik, 1995 sim(c1,c2) = IC(LCS(c1,c2))
32
INFORMATION CONTENT-BASED MEASURES
Incorporate the probability of the concepts
IC = -log(P(concept)
Resnik, 1995 sim(c1,c2) = IC(LCS(c2,c2))
Jiang and Conrath, 1997 sim(c1,c2) = 1 / (IC(c1)+IC(c2) – 2*
IC(LCS(c1,c2))
33
INFORMATION CONTENT-BASED MEASURES
Incorporate the probability of the concepts
IC = -log(P(concept)
Resnik, 1995 sim(c1,c2) = IC(LCS(c2,c2))
Jiang and Conrath, 1997 sim(c1,c2) = 1 ÷ (IC(c1)+IC(c2) – 2* IC(LCS(c1,c2))
Lin, 1998 sim(c1,c2) = (2*IC(LCS(c2,c2))) / (IC(c1)+IC(c2))
34
IC-BASED SIMILARITY MEASURES
Disease: C001263
4Drug
Related Disorder: C0277579Drug
Tolerance:
C0013220
Neoplasm:
C1302761Neoplasti
c Disease: C188206
2Malignant
Neoplasm:
C0006826Skin
cancer: C000711
4
+
PATH INFORMATIONPROBABILITY OF
CONCEPTS
EXTERNAL CORPUS
35
EXPERIMENTAL FRAMEWORK Use open-source UMLS::Similarity package to
obtain the similarity between the terms and possible senses in the SenseRelate algorithm
Path information: parent/child relations in MSH source
Information content: calculated using the UMLSonMedline dataset created by NLM
Consists of concepts from 2009AB UMLS and the frequency they occurred in Medline using the Essie Search Engine (Ide et al 2007)
Medline: database of citations of biomedical/clinical articles
36
EVALUATION DATA: MSH WSD MSH-WSD dataset (Jimeno-Yepes, et al 2011)
203 target words (ambiguous word) from Medline 106 terms e.g. tolerance 88 acronyms e.g. CA (calcium, california) 9 mixtures e.g. bat (brown adipose tissue)
Each target word contains ~187 instances (Medline abstracts) abstract = ~ 500 words
Each target word in the instances assigned a concept from MSH by exploiting the manually assigned MSH concepts assigned to the abstract
Average of 2.08 possible senses per target word Majority sense over all the target words is 54.5%
37
RESULTS
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.55
0.7200000000000010.690000000
0000010.700000000
0000010.720000000
0000010.730000000
0000010.740000000
0000010.740000000
000001
baseline path
lch wup
nam
res jcn
accuracy
Path-based IC-basedlin
38
COMPARISON ACROSS SUBSETS OF MSH-WSD
Terms Acronyms Mixture Overall0
0.10.20.30.40.50.60.70.80.9
1
0.55 0.54 0.53 0.55
0.670000000000002
0.8 0.730000000000001
0.7400000000000020.71000000000
0001
0.870000000000002 0.88
0.80.67000000000
0002
0.850000000000001
0.93
0.78
BaselineSenseRe-lateMRD2-MRD
accuracy
39
COMPARISON ACROSS SUBSETS OF MSH-WSD
Terms Acronyms Mixture Overall0
0.10.20.30.40.50.60.70.80.9
1
0.55 0.54 0.53 0.55
0.670000000000002
0.8 0.730000000000001
0.7400000000000020.71000000000
0001
0.870000000000002 0.88
0.80.67000000000
0002
0.850000000000001
0.93
0.78
BaselineSenseRe-lateMRD2-MRD
accuracy
40
COMPARISON ACROSS SUBSETS OF MSH-WSD
Terms Acronyms Mixture Overall0
0.10.20.30.40.50.60.70.80.9
1
0.55 0.54 0.53 0.55
0.670000000000002
0.8 0.730000000000001
0.7400000000000020.71000000000
0001
0.870000000000002 0.88
0.80.67000000000
0002
0.850000000000001
0.93
0.78
BaselineSenseRe-lateMRD2-MRD
accuracy
41
COMPARISON ACROSS SUBSETS OF MSH-WSD
Terms Acronyms Mixture Overall0
0.10.20.30.40.50.60.70.80.9
1
0.55 0.54 0.53 0.55
0.670000000000002
0.8 0.730000000000001
0.7400000000000020.71000000000
0001
0.870000000000002 0.88
0.80.67000000000
0002
0.850000000000001
0.93
0.78
BaselineSenseRe-lateMRD2-MRD
accuracy
42
COMPARISON ACROSS SUBSETS OF MSH-WSD
Terms Acronyms Mixture Overall0
0.10.20.30.40.50.60.70.80.9
1
0.55 0.54 0.53 0.55
0.670000000000002
0.8 0.730000000000001
0.7400000000000020.71000000000
0001
0.870000000000002 0.88
0.80.67000000000
0002
0.850000000000001
0.93
0.78
BaselineSenseRe-lateMRD2-MRD
accuracy
43
WINDOW SIZES Use the terms surrounding the target word
within a specified window: 1, 2, 5, 10, 25, 50, 60, 70
Busprione attenuates tolerance to morphine in mice with skin_cancer
WINDOW SIZE = 2
44
COMPARISON OF WINDOW SIZES FOR LIN
0 1 2 5 10 25 50 60 700
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.5 0.53
0.650000000000002
0.690000000000001
0.710000000000001
0.740000000000002
0.740000000000002
0.740000000000002
0.740000000000002
lin
accuracy
window size
45
SURROUNDING TERMSNot all terms have a concept in the
UMLS
therefore
Not all surrounding terms in the window mapped to CUIs
46
WINDOW SIZES VERSUS MAPPED TERMS
0 1 2 5 10 25 50 60 7002468
1012141618
0 0.27 0.791.85
3.49
7.6
12.9614.28
15.64
lin
number
of
mappings
window size
47
FUTURE WORK: MAPPING TERMS Currently looking at mapping the terms to
CUIs using information from the concept mapping system MetaMap
Obtain the terms from MetaMap and do a dictionary look up in MRCONSO Hypothesis – the terms obtained by MetaMap are more
accurate than using the SPECIALIST Lexicon
Obtain the CUIs from MetaMap Hypothesis – the CUIs obtained by MetaMap will be
more accurate than the dictionary look-up
48
OBJECTIVE #1Develop and evaluate a method than can disambiguate terms in biomedical text by
exploiting similarity information extrapolated from the UMLS
UMLS::SenseRelate statistically significantly higher disambiguation accuracy than the baseline
On par with previous unsupervised methods for terms
49
OBJECTIVE #2Evaluate the efficacy of IC-based similarity
measures over path-based measures on a secondary task
There is no statistically significant difference between the accuracies obtained by the IC-based measures
There is a statistically significant difference between the IC-based measures and the path-based measures
50
TAKE HOME MESSAGE:
An ambiguous word is often used in the sense
that is most similar to the sense of the concepts
of the terms that surround it
51
RESOURCES Software:
UMLS::SenseRelate http://search.cpan.org/dist/UMLS-SenseRelate/
UMLS::Similarity http://search.cpan.org/dist/UMLS-Similarity/
Data MSH-WSD
http://wsd.nlm.nih.gov/collaboration.shtml
52
RESOURCES Software:
UMLS::SenseRelate http://search.cpan.org/dist/UMLS-SenseRelate/
UMLS::Similarity http://search.cpan.org/dist/UMLS-Similarity/
Data MSH-WSD
http://wsd.nlm.nih.gov/collaboration.shtml
THANK YOU
53
RESOURCES Software:
UMLS::SenseRelate http://search.cpan.org/dist/UMLS-SenseRelate/
UMLS::Similarity http://search.cpan.org/dist/UMLS-Similarity/
Data MSH-WSD
http://wsd.nlm.nih.gov/collaboration.shtml
QUESTIONS?