bridget mcinnes ted pedersen ying liu genevieve b. melton serguei pakhomov

53
KNOWLEDGE-BASED METHOD FOR DETERMINING THE MEANING OF AMBIGUOUS BIOMEDICAL TERMS USING INFORMATION CONTENT MEASURES OF SIMILARITY Bridget McInnes Ted Pedersen Ying Liu Genevieve B. Melton Serguei Pakhomov 1

Upload: coen

Post on 23-Feb-2016

32 views

Category:

Documents


0 download

DESCRIPTION

Knowledge-based Method for Determining the Meaning of Ambiguous Biomedical Terms Using Information Content Measures of Similarity. Bridget McInnes Ted Pedersen Ying Liu Genevieve B. Melton Serguei Pakhomov. Objective of this work. Develop and evaluate a method than can - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Bridget McInnes Ted Pedersen  Ying Liu Genevieve B. Melton Serguei Pakhomov

1

KNOWLEDGE-BASED METHOD FOR DETERMINING THE MEANING OF AMBIGUOUS BIOMEDICAL TERMS USING INFORMATION CONTENT MEASURES OF SIMILARITY

Bridget McInnesTed Pedersen Ying LiuGenevieve B. MeltonSerguei Pakhomov

Page 2: Bridget McInnes Ted Pedersen  Ying Liu Genevieve B. Melton Serguei Pakhomov

2

OBJECTIVE OF THIS WORK Develop and evaluate a method than can

disambiguate terms in biomedical text by exploiting similarity information extrapolated from the Unified Medical

Language System

Evaluate the efficacy of Information Content-based similarity measures over path-based similarity measures

for Word Sense Disambiguation, WSD

Page 3: Bridget McInnes Ted Pedersen  Ying Liu Genevieve B. Melton Serguei Pakhomov

3

WORD SENSE DISAMBIGUATIONWord sense disambiguation is the task of

determining the appropriate sense of a term given context in which it is used.

TERM: tolerance

DrugTolerance

ImmuneTolerance

Page 4: Bridget McInnes Ted Pedersen  Ying Liu Genevieve B. Melton Serguei Pakhomov

4

WORD SENSE DISAMBIGUATIONWord sense disambiguation is the task of

determining the appropriate sense of a term given context in which it is used.

Busprione attenuates tolerance to morphine in mice with skin cancer

DrugTolerance

ImmuneTolerance

Page 5: Bridget McInnes Ted Pedersen  Ying Liu Genevieve B. Melton Serguei Pakhomov

5

SENSE INVENTORY: UNIFIED MEDICAL LANGUAGE SYSTEM Unified Medical Language Sources (UMLS)

Semantic Network Metathesaurus

~1.7 million biomedical and clinical concepts; integrated semi-automatically

CUIs (Concept Unique Identifiers), linked: Hierarchical: PAR/CHD and RB/RN Non-hierarchical: SIB, RO

Sources viewed together or independently Medical Subject Heading (MSH)

SPECIALIST Lexicon Biomedical and clinical terms, including variants

Page 6: Bridget McInnes Ted Pedersen  Ying Liu Genevieve B. Melton Serguei Pakhomov

6

WORD SENSE DISAMBIGUATION

Busprione attenuates tolerance to morphine in mice with skin cancer

DrugTolerance: C0013220

ImmuneTolerance:C0020963

Concept Unique Identifiers: CUIs

Page 7: Bridget McInnes Ted Pedersen  Ying Liu Genevieve B. Melton Serguei Pakhomov

7

SENSERELATE ALGORITHM

Each possible sense of a target word is assigned a score [sum similarity between it and its surrounding terms]

Assign target word the sense with highest score

Proposed by Patwardhan and Pedersen 2003 using WordNet

UMLS::SenseRelate is a modification of this algorithm using information from the UMLS

NEXT UP: an example

Page 8: Bridget McInnes Ted Pedersen  Ying Liu Genevieve B. Melton Serguei Pakhomov

8

SENSERELATE EXAMPLEBusprione attenuates tolerance to morphine

in mice with skin cancer

Page 9: Bridget McInnes Ted Pedersen  Ying Liu Genevieve B. Melton Serguei Pakhomov

9

SENSERELATE EXAMPLEBusprione attenuates tolerance to morphine

in mice with skin cancer

DrugTolerance: C0013220

ImmuneTolerance:C0020963

Page 10: Bridget McInnes Ted Pedersen  Ying Liu Genevieve B. Melton Serguei Pakhomov

10

SENSERELATE EXAMPLEBusprione attenuates tolerance to morphine

in mice with skin cancer

DrugTolerance: C0013220

ImmuneTolerance:C0020963

Busprione:

C0006462Morphine:C0026549

Mice: C0026809

Skin cancer:

C0007114

Page 11: Bridget McInnes Ted Pedersen  Ying Liu Genevieve B. Melton Serguei Pakhomov

11

SENSERELATE EXAMPLE

0.090.160.11

Busprione attenuates tolerance to morphine in mice with skin cancer

0.09

DrugTolerance: C0013220

ImmuneTolerance:C0020963

Busprione:

C0006462Morphine:C0026549

Mice: C0026809

Skin cancer:

C0007114

Page 12: Bridget McInnes Ted Pedersen  Ying Liu Genevieve B. Melton Serguei Pakhomov

12

SENSERELATE EXAMPLE

0.090.160.11

Busprione attenuates tolerance to morphine in mice with skin cancer

0.09

DrugTolerance: C0013220

ImmuneTolerance:C0020963

Busprione:

C0006462Morphine:C0026549

Mice: C0026809

Skin cancer:

C0007114

Drug ToleranceScore = 0.09 + 0.09 + 0.16 + 0.11 = 0.45

Page 13: Bridget McInnes Ted Pedersen  Ying Liu Genevieve B. Melton Serguei Pakhomov

13

SENSERELATE EXAMPLE

0.090.160.11

0.090.050.04

Busprione attenuates tolerance to morphine in mice with skin cancer

0.09 0.09

DrugTolerance: C0013220

ImmuneTolerance:C0020963

Busprione:

C0006462Morphine:C0026549

Mice: C0026809

Skin cancer:

C0007114

Drug ToleranceScore = 0.09 + 0.09 + 0.16 + 0.11 = 0.45

Page 14: Bridget McInnes Ted Pedersen  Ying Liu Genevieve B. Melton Serguei Pakhomov

14

SENSERELATE EXAMPLE

0.090.160.11

0.090.050.04

Busprione attenuates tolerance to morphine in mice with skin cancer

0.09 0.09

DrugTolerance: C0013220

ImmuneTolerance:C0020963

Busprione:

C0006462Morphine:C0026549

Mice: C0026809

Skin cancer:

C0007114

Drug ToleranceScore = 0.09 + 0.09 + 0.16 + 0.11 = 0.45

Immune ToleranceScore = 0.09 + 0.09 + 0.05 + 0.05 = 0.27

Page 15: Bridget McInnes Ted Pedersen  Ying Liu Genevieve B. Melton Serguei Pakhomov

15

SENSERELATE EXAMPLE

0.090.160.11

0.090.050.04

Busprione attenuates tolerance to morphine in mice with skin cancer

0.09 0.09

DrugTolerance: C0013220

ImmuneTolerance:C0020963

Busprione:

C0006462Morphine:C0026549

Mice: C0026809

Skin cancer:

C0007114

Drug ToleranceScore = 0.09 + 0.09 + 0.16 + 0.11 = 0.45

Immune ToleranceScore = 0.09 + 0.09 + 0.05 + 0.05 = 0.27

Page 16: Bridget McInnes Ted Pedersen  Ying Liu Genevieve B. Melton Serguei Pakhomov

16

SENSE RELATE ASSUMPTION

An ambiguous word is often used in the sense

that is most similar to the sense of the terms that surround it

Page 17: Bridget McInnes Ted Pedersen  Ying Liu Genevieve B. Melton Serguei Pakhomov

17

SENSERELATE COMPONENTS Identifying the concepts of surrounding terms

Calculating semantic similarity

Page 18: Bridget McInnes Ted Pedersen  Ying Liu Genevieve B. Melton Serguei Pakhomov

18

IDENTIFYING THE CONCEPTS OF THE SURROUNDING TERMS

Use the SPECIALIST LEXICON to identify the terms and map the terms doing a string match to the MRCONSO table in

the UMLS

Page 19: Bridget McInnes Ted Pedersen  Ying Liu Genevieve B. Melton Serguei Pakhomov

19

IDENTIFYING THE CONCEPTS OF THE SURROUNDING TERMS

Use the SPECIALIST LEXICON to identify the terms and map the terms doing a string match to the MRCONSO table in

the UMLSBusprione attenuates tolerance to morphine in mice with skin cancer

Page 20: Bridget McInnes Ted Pedersen  Ying Liu Genevieve B. Melton Serguei Pakhomov

20

IDENTIFYING THE CONCEPTS OF THE SURROUNDING TERMS

Use the SPECIALIST LEXICON to identify the terms and map the terms doing a string match to the MRCONSO table in

the UMLS

...skin cancerskin graftingskin disease

...

SPECIALISTLEXICON

Busprione attenuates tolerance to morphine in mice with skin cancer

Page 21: Bridget McInnes Ted Pedersen  Ying Liu Genevieve B. Melton Serguei Pakhomov

21

IDENTIFYING THE CONCEPTS OF THE SURROUNDING TERMS

Use the SPECIALIST LEXICON to identify the terms and map the terms doing a string match to the MRCONSO table in

the UMLS

...skin cancerskin graftingskin disease

...

...skin cancer C0007114skin grafting C0037297skin disease C0037274

...

SPECIALISTLEXICON

MRCONSO

Busprione attenuates tolerance to morphine in mice with skin cancer

Page 22: Bridget McInnes Ted Pedersen  Ying Liu Genevieve B. Melton Serguei Pakhomov

22

SEMANTIC SIMILARITY MEASURES Path-based measures

Path Wu and Palmer Leacock and Chodorow Ngyuen and Al-Mubaid

Information content (IC)-based measures Resnik Lin Jiang and Conrath

Page 23: Bridget McInnes Ted Pedersen  Ying Liu Genevieve B. Melton Serguei Pakhomov

23

PATH-BASED SIMILARITY MEASURES Use only the path information obtained from a

taxonomy

Page 24: Bridget McInnes Ted Pedersen  Ying Liu Genevieve B. Melton Serguei Pakhomov

24

PATH-BASED SIMILARITY MEASURES Use only the path information obtained from a

taxonomy

Path measure sim(c1,c2) = 1 / minpath(c2,c2)

where minpath is the shortest path between the two concepts

Page 25: Bridget McInnes Ted Pedersen  Ying Liu Genevieve B. Melton Serguei Pakhomov

25

PATH-BASED SIMILARITY MEASURES Use only the path information obtained from a

taxonomy

Path measure sim(c1,c2) = 1/minpath(c2,c2)

where minpath is the shortest path between the two concepts

Wu and Palmer, 1994 sim(c1,c2) = (2*depth(LCS(c2,c2))) /

(depth(c1)+depth(c2)) where LCS is the least common subsumer of the

two concepts

Page 26: Bridget McInnes Ted Pedersen  Ying Liu Genevieve B. Melton Serguei Pakhomov

26

PATH-BASED SIMILARITY MEASURES Use only the path information obtained from a taxonomy

Path measure sim(c1,c2) = 1/ minpath(c2,c2)

where minpath is the shortest path between the two concepts

Wu and Palmer, 1994 sim(c1,c2) = (2*depth(LCS(c2,c2))) / (depth(c1)+depth(c2))

where LCS is the least common subsumer of the two concepts

Leacock and Chodorow, 1998 sim(c1,c2) = -log( minpath(c1,c2) / (2D) )

where D is the total depth of the taxonomy

Page 27: Bridget McInnes Ted Pedersen  Ying Liu Genevieve B. Melton Serguei Pakhomov

27

PATH-BASED SIMILARITY MEASURES Use only the path information obtained from a taxonomy

Path measure sim(c1,c2) = 1/ minpath(c2,c2)

where minpath is the shortest path between the two concepts

Leacock and Chodorow, 1998 sim(c1,c2) = -log( minpath(c1,c2) / (2D) )

where D is the total depth of the taxonomy

Wu and Palmer, 1994 sim(c1,c2) = (2*depth(LCS(c2,c2))) / (depth(c1)+depth(c2))

where LCS is the least common subsumer of the two concepts

Nyguen and Al-Mubaid, 2006 sim(c1,c2) = log ( (2 + minpath(c1,c2) - 1) *

(D - depth(LCS(c1,c2))) )

Page 28: Bridget McInnes Ted Pedersen  Ying Liu Genevieve B. Melton Serguei Pakhomov

28

PATH-BASED SIMILARITY MEASURES

USE ONLY THE PATH INFORMATION OBTAINED FROM A TAXONOMY

Disease: C0012634

Drug Related

Disorder: C0277579

DrugTolerance: C0013220

Neoplasm: C1302761

Neoplastic Disease: C1882062

Malignant Neoplasm: C0006826

Skin cancer:

C0007114

Page 29: Bridget McInnes Ted Pedersen  Ying Liu Genevieve B. Melton Serguei Pakhomov

29

INFORMATION CONTENT-BASED MEASURES Incorporate the probability of the concepts

IC = -log(P(concept))

Page 30: Bridget McInnes Ted Pedersen  Ying Liu Genevieve B. Melton Serguei Pakhomov

30

INFORMATION CONTENT-BASED MEASURES Incorporate the probability of the concepts

IC = -log(P(concept))

P(concept)

Calculated by summing the probability of the concept and the probability of its descendants

Probabilities are obtained from an external corpus

Page 31: Bridget McInnes Ted Pedersen  Ying Liu Genevieve B. Melton Serguei Pakhomov

31

INFORMATION CONTENT-BASED MEASURES Incorporate the probability of the concepts

IC = -log(P(concept)

Resnik, 1995 sim(c1,c2) = IC(LCS(c1,c2))

Page 32: Bridget McInnes Ted Pedersen  Ying Liu Genevieve B. Melton Serguei Pakhomov

32

INFORMATION CONTENT-BASED MEASURES

Incorporate the probability of the concepts

IC = -log(P(concept)

Resnik, 1995 sim(c1,c2) = IC(LCS(c2,c2))

Jiang and Conrath, 1997 sim(c1,c2) = 1 / (IC(c1)+IC(c2) – 2*

IC(LCS(c1,c2))

Page 33: Bridget McInnes Ted Pedersen  Ying Liu Genevieve B. Melton Serguei Pakhomov

33

INFORMATION CONTENT-BASED MEASURES

Incorporate the probability of the concepts

IC = -log(P(concept)

Resnik, 1995 sim(c1,c2) = IC(LCS(c2,c2))

Jiang and Conrath, 1997 sim(c1,c2) = 1 ÷ (IC(c1)+IC(c2) – 2* IC(LCS(c1,c2))

Lin, 1998 sim(c1,c2) = (2*IC(LCS(c2,c2))) / (IC(c1)+IC(c2))

Page 34: Bridget McInnes Ted Pedersen  Ying Liu Genevieve B. Melton Serguei Pakhomov

34

IC-BASED SIMILARITY MEASURES

Disease: C001263

4Drug

Related Disorder: C0277579Drug

Tolerance:

C0013220

Neoplasm:

C1302761Neoplasti

c Disease: C188206

2Malignant

Neoplasm:

C0006826Skin

cancer: C000711

4

+

PATH INFORMATIONPROBABILITY OF

CONCEPTS

EXTERNAL CORPUS

Page 35: Bridget McInnes Ted Pedersen  Ying Liu Genevieve B. Melton Serguei Pakhomov

35

EXPERIMENTAL FRAMEWORK Use open-source UMLS::Similarity package to

obtain the similarity between the terms and possible senses in the SenseRelate algorithm

Path information: parent/child relations in MSH source

Information content: calculated using the UMLSonMedline dataset created by NLM

Consists of concepts from 2009AB UMLS and the frequency they occurred in Medline using the Essie Search Engine (Ide et al 2007)

Medline: database of citations of biomedical/clinical articles

Page 36: Bridget McInnes Ted Pedersen  Ying Liu Genevieve B. Melton Serguei Pakhomov

36

EVALUATION DATA: MSH WSD MSH-WSD dataset (Jimeno-Yepes, et al 2011)

203 target words (ambiguous word) from Medline 106 terms e.g. tolerance 88 acronyms e.g. CA (calcium, california) 9 mixtures e.g. bat (brown adipose tissue)

Each target word contains ~187 instances (Medline abstracts) abstract = ~ 500 words

Each target word in the instances assigned a concept from MSH by exploiting the manually assigned MSH concepts assigned to the abstract

Average of 2.08 possible senses per target word Majority sense over all the target words is 54.5%

Page 37: Bridget McInnes Ted Pedersen  Ying Liu Genevieve B. Melton Serguei Pakhomov

37

RESULTS

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.55

0.7200000000000010.690000000

0000010.700000000

0000010.720000000

0000010.730000000

0000010.740000000

0000010.740000000

000001

baseline path

lch wup

nam

res jcn

accuracy

Path-based IC-basedlin

Page 38: Bridget McInnes Ted Pedersen  Ying Liu Genevieve B. Melton Serguei Pakhomov

38

COMPARISON ACROSS SUBSETS OF MSH-WSD

Terms Acronyms Mixture Overall0

0.10.20.30.40.50.60.70.80.9

1

0.55 0.54 0.53 0.55

0.670000000000002

0.8 0.730000000000001

0.7400000000000020.71000000000

0001

0.870000000000002 0.88

0.80.67000000000

0002

0.850000000000001

0.93

0.78

BaselineSenseRe-lateMRD2-MRD

accuracy

Page 39: Bridget McInnes Ted Pedersen  Ying Liu Genevieve B. Melton Serguei Pakhomov

39

COMPARISON ACROSS SUBSETS OF MSH-WSD

Terms Acronyms Mixture Overall0

0.10.20.30.40.50.60.70.80.9

1

0.55 0.54 0.53 0.55

0.670000000000002

0.8 0.730000000000001

0.7400000000000020.71000000000

0001

0.870000000000002 0.88

0.80.67000000000

0002

0.850000000000001

0.93

0.78

BaselineSenseRe-lateMRD2-MRD

accuracy

Page 40: Bridget McInnes Ted Pedersen  Ying Liu Genevieve B. Melton Serguei Pakhomov

40

COMPARISON ACROSS SUBSETS OF MSH-WSD

Terms Acronyms Mixture Overall0

0.10.20.30.40.50.60.70.80.9

1

0.55 0.54 0.53 0.55

0.670000000000002

0.8 0.730000000000001

0.7400000000000020.71000000000

0001

0.870000000000002 0.88

0.80.67000000000

0002

0.850000000000001

0.93

0.78

BaselineSenseRe-lateMRD2-MRD

accuracy

Page 41: Bridget McInnes Ted Pedersen  Ying Liu Genevieve B. Melton Serguei Pakhomov

41

COMPARISON ACROSS SUBSETS OF MSH-WSD

Terms Acronyms Mixture Overall0

0.10.20.30.40.50.60.70.80.9

1

0.55 0.54 0.53 0.55

0.670000000000002

0.8 0.730000000000001

0.7400000000000020.71000000000

0001

0.870000000000002 0.88

0.80.67000000000

0002

0.850000000000001

0.93

0.78

BaselineSenseRe-lateMRD2-MRD

accuracy

Page 42: Bridget McInnes Ted Pedersen  Ying Liu Genevieve B. Melton Serguei Pakhomov

42

COMPARISON ACROSS SUBSETS OF MSH-WSD

Terms Acronyms Mixture Overall0

0.10.20.30.40.50.60.70.80.9

1

0.55 0.54 0.53 0.55

0.670000000000002

0.8 0.730000000000001

0.7400000000000020.71000000000

0001

0.870000000000002 0.88

0.80.67000000000

0002

0.850000000000001

0.93

0.78

BaselineSenseRe-lateMRD2-MRD

accuracy

Page 43: Bridget McInnes Ted Pedersen  Ying Liu Genevieve B. Melton Serguei Pakhomov

43

WINDOW SIZES Use the terms surrounding the target word

within a specified window: 1, 2, 5, 10, 25, 50, 60, 70

Busprione attenuates tolerance to morphine in mice with skin_cancer

WINDOW SIZE = 2

Page 44: Bridget McInnes Ted Pedersen  Ying Liu Genevieve B. Melton Serguei Pakhomov

44

COMPARISON OF WINDOW SIZES FOR LIN

0 1 2 5 10 25 50 60 700

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.5 0.53

0.650000000000002

0.690000000000001

0.710000000000001

0.740000000000002

0.740000000000002

0.740000000000002

0.740000000000002

lin

accuracy

window size

Page 45: Bridget McInnes Ted Pedersen  Ying Liu Genevieve B. Melton Serguei Pakhomov

45

SURROUNDING TERMSNot all terms have a concept in the

UMLS

therefore

Not all surrounding terms in the window mapped to CUIs

Page 46: Bridget McInnes Ted Pedersen  Ying Liu Genevieve B. Melton Serguei Pakhomov

46

WINDOW SIZES VERSUS MAPPED TERMS

0 1 2 5 10 25 50 60 7002468

1012141618

0 0.27 0.791.85

3.49

7.6

12.9614.28

15.64

lin

number

of

mappings

window size

Page 47: Bridget McInnes Ted Pedersen  Ying Liu Genevieve B. Melton Serguei Pakhomov

47

FUTURE WORK: MAPPING TERMS Currently looking at mapping the terms to

CUIs using information from the concept mapping system MetaMap

Obtain the terms from MetaMap and do a dictionary look up in MRCONSO Hypothesis – the terms obtained by MetaMap are more

accurate than using the SPECIALIST Lexicon

Obtain the CUIs from MetaMap Hypothesis – the CUIs obtained by MetaMap will be

more accurate than the dictionary look-up

Page 48: Bridget McInnes Ted Pedersen  Ying Liu Genevieve B. Melton Serguei Pakhomov

48

OBJECTIVE #1Develop and evaluate a method than can disambiguate terms in biomedical text by

exploiting similarity information extrapolated from the UMLS

UMLS::SenseRelate statistically significantly higher disambiguation accuracy than the baseline

On par with previous unsupervised methods for terms

Page 49: Bridget McInnes Ted Pedersen  Ying Liu Genevieve B. Melton Serguei Pakhomov

49

OBJECTIVE #2Evaluate the efficacy of IC-based similarity

measures over path-based measures on a secondary task

There is no statistically significant difference between the accuracies obtained by the IC-based measures

There is a statistically significant difference between the IC-based measures and the path-based measures

Page 50: Bridget McInnes Ted Pedersen  Ying Liu Genevieve B. Melton Serguei Pakhomov

50

TAKE HOME MESSAGE:

An ambiguous word is often used in the sense

that is most similar to the sense of the concepts

of the terms that surround it

Page 51: Bridget McInnes Ted Pedersen  Ying Liu Genevieve B. Melton Serguei Pakhomov

51

RESOURCES Software:

UMLS::SenseRelate http://search.cpan.org/dist/UMLS-SenseRelate/

UMLS::Similarity http://search.cpan.org/dist/UMLS-Similarity/

Data MSH-WSD

http://wsd.nlm.nih.gov/collaboration.shtml

Page 52: Bridget McInnes Ted Pedersen  Ying Liu Genevieve B. Melton Serguei Pakhomov

52

RESOURCES Software:

UMLS::SenseRelate http://search.cpan.org/dist/UMLS-SenseRelate/

UMLS::Similarity http://search.cpan.org/dist/UMLS-Similarity/

Data MSH-WSD

http://wsd.nlm.nih.gov/collaboration.shtml

THANK YOU

Page 53: Bridget McInnes Ted Pedersen  Ying Liu Genevieve B. Melton Serguei Pakhomov

53

RESOURCES Software:

UMLS::SenseRelate http://search.cpan.org/dist/UMLS-SenseRelate/

UMLS::Similarity http://search.cpan.org/dist/UMLS-Similarity/

Data MSH-WSD

http://wsd.nlm.nih.gov/collaboration.shtml

QUESTIONS?