literature-based knowledge discovery using natural language processing

Post on 25-Feb-2016

28 Views

Category:

Documents

2 Downloads

Preview:

Click to see full reader

DESCRIPTION

Literature-Based Knowledge Discovery using Natural Language Processing. Dimitar Hristovski, 1 PhD, Carol Friedman, 2 PhD, Thomas C Rindflesch, 3 PhD, B orut Peterlin, 4 MD PhD 1 Institute of Biomedical Informatics, Medical Faculty, University of Ljubljana, Slovenia - PowerPoint PPT Presentation

TRANSCRIPT

1

Literature-Based Knowledge Discovery using

Natural Language ProcessingDimitar Hristovski,1 PhD, Carol Friedman,2 PhD,

Thomas C Rindflesch,3 PhD, Borut Peterlin,4 MD PhD

1Institute of Biomedical Informatics, Medical Faculty, University of Ljubljana, Slovenia

2Department of Biomedical Informatics, Columbia University, New York3National Library of Medicine, Bethesda, Maryland

4Division of medical genetics, UMC, Slajmerjeva 3, Ljubljana, Slovenia

e-mail: dimitar.hristovski@mf.uni-lj.si

2

Part 1: Co-occurrence based LBD

3

Motivation

• Overspecialization• Information overload• Large databases• Need and opportunity for computer

supported knowledge discovery

4

Literature-based Discovery (LBD)

• A method for automatically generating hypotheses (discoveries) from literature

• Hypotheses have form:Concept1 –Relation– Concept2

• Example:Fish oil –Treats– Raynaud’s disease

5

Background • Swanson’s LBD paradigm:

Concept X(Disease)e.g. Raynaud’s

Concepts Y(Pathologycal or Cell Function, …)e.g. Blood viscosity

Concepts Z(Drugs, …)e.g. Fish oil

New Relation?e.g. Treats

6

Biomedical Discovery Support System (BITOLA)

• Goal: – discover potentially new relations (knowledge) between

biomedical concepts – to be used as research idea generator and/or as– an alternative way to search Medline

• System user (researcher or intermediary):– interactively guides the discovery process– evaluates the proposed relations

7

Extending and Enhancing Literature Based Discovery• Goal:

– Make literature based discovery more suitable for disease candidate gene discovery

– Decrease the number of candidate relations

• Method:– Integrate background knowledge:

• Chromosomal location of diseases and genes• Gene expression location• Disease manifestation location

8

System Overview

Knowledge Base

Concepts

Association Rules

Background Knowledge (Chromosomal Locations, …)

Discovery Algorithm

User Interface

Databases (Medline, LocusLink, HUGO, OMIM, …)

Knowledge Extraction

9

Terminology Problems during Knowledge Extraction

• Gene names• Gene symbols• MeSH and genetic diseases

10

Detected Gene Symbols by Frequency

• type|666548• II|552584• III|201776• component|179643• CT|175973• AT|151337• ATP|147357• IV|123429• CD4|99657• p53|89357• MR|88682• SD|85889• GH|84797• LPS|68982• 59|67272• E2|64616

• 82|63521• AMP|61862• TNF|59343• RA|58818• CD8|57324• O2|56847• ACTH|54933• CO2|53171• PKC|51057• EGF|50483• T3|49632• MS|46813• A2|44896• ER|43212• upstream|41820• PRL|41599

11

Gene Symbol Disambiguation

• Find MEDLINE docs in which we can expect to find gene symbols

• Example of false positive:– Ethics in a twist: "Life Support", BBC1. BMJ 1999

Aug 7;319(7206):390– breast basic conserved 1 (BBC1) gene, v.s. BBC1

television station featuring new drama series Life Support

12

Binary Association Rules• XY (confidence, support) • If X Then Y (confidence, support)• Confidence = % of docs containing Y within the X docs• Support = number (or %) of docs containing both X and

Y• The relation between X and Y not known.• Examples:

– Multiple Sclerosis Optic Neuritis (2.02, 117)– Multiple Sclerosis Interferon-beta (5.17, 300)

13

Discovery Algorithm

Concept X(Disease)

Concepts Y(Pathologycal or Cell Function, …)

Concepts Z(Genes)

Chromosomal Region

Chromosomal Location

Candidate Gene?

Match

Manifestation Location

Expression Location

Match

14

Ranking Concepts Z

X

Y1

Y2

Y3

Yi

Yj

Z1

Z2

Z3

Zk

Zn

s1

( ) ( * )i i k

m

k XY Y Zi

Rank Z S S

15

Problem Size• Full Medline analyzed (cca 15,000,000 recs)• 87,000,000 association rules between 180,000

biomedical concepts

16

Bilateral Perisylvian Polymicrogiria - BPP (OMIM:

300388)• Polymicrogyria of the cerebral cortex is

a developmental abnormality characterized by excessive surface convolution

• Clinical characteristics:– Mental retardation– Epilepsy– Pseudobulbar palsy (paralysis of the face,

throat, tongue and the chewing process)

• X linked dominant inheritance

17

18 gene candidates

15 gene candidates

Tissue specific expression

2 gene candidates: L1CAM and FLNA

relation between semantic types Cell Movement and Gene or gene products

Sublocalisation in the Xq28

237 genes in Xq28

18

User Interface “cgi-bin” version

19

Automatically search for supporting Medline Citations

20

Part 1: Summary and Conclusions

• Discovery support system (BITOLA) presented• The system can be used as:

– Research idea generator, or– Alternative method of searching Medline

• Genetic knowledge about the chromosomal locations of diseases and genes included to make BITOLA more suitable for disease candidate gene discovery

21

System Availability

• URL:

www.mf.uni-lj.si/bitola/

22

Part 2: Exploring Semantic Relations for

LBD

23

Current LBD Systems• Co-occurrence based• Concepts

– Title/Abstract Words/Phrases– MeSH– UMLS– Genes ...

• UMLS Semantic types used for filtering• Semantic relations between concepts

NOT used

24

Drawbacks of Current LBD

• Not all co-occurrences represent a relation• Users have to read many Medline citations

when reviewing candidate relations• Many spurious (false-positive) relations and

hypotheses produced• No explanation of proposed hypotheses

25

Enhancing the LBD paradigm

• Use semantic relations obtained from – two NLP systems (BioMedLee and SemRep)

to augment – co-occurrence based LBD system (BITOLA)

26

Methods

27

Discovery Patterns• Discovery pattern:

Set of conditions to be satisfied for the generation of new hypotheses

• Conditions are combinations of semantic relations between concepts

• Maybe_Treats pattern in this research – has two forms:– Maybe_Treats1– Maybe_Treats2

28

Maybe_Treats Discovery Pattern

Disease X

Maybe_Treats2

Change1

Change2

Treats

Substance Y1(or Body meas.,

Body funct.)

Substance Y2(or Body meas.,

Body funct.)

Drug Z1 (or substance)

Disease X2

Drug Z2(or substance)

Opposite_Change1

Same Change2

Maybe_Treats1

29

Maybe_Treats1 and Maybe_Treats2

• Goal:Propose potentially new treatments

• Can work in concert:– Propose different treatments (complementary)– Propose same treatments using different discovery

reasoning (reinforcing)

30

Multiple Usages of Maybe_Treats

• Given Disease X as input: – find new treatments Z

• Given Drug Z as input: – find diseases X that can be treated

• Given Disease X and Drug Z as input: – test whether Z can be used to treat X

31

Semantic Relations Used

• Associated_with_change and Treats used to extract known facts from the literature

• Then Maybe_Treats1 and Maybe_Treats2 predict new treatments based on the known extracted facts

32

Associated_with_change

• One concept associated with a change in another concept, for example:

• Associated_with(Raynaud’s, Blood viscosity, increase):– “Local increase of blood viscosity during cold-induced Raynaud's

phenomenon.”– “Increased viscosity might be a causal factor in secondary forms

of Raynaud's disease, …”

• BioMedLee (Friedman et al) used to extract Associated_with_change

33

Treats

• Used to extract drugs known to treat a disease• Major purpose in our approach:

– Eliminate drugs already known to be used to treat a disease– Find existing treatments for similar diseases

• TREATS(Amantadine,Huntington):– “…treatment of Huntington’s disease with amantadine…”

• Treats extracted by SemRep (Rindflesch et al)

34

Results

35

Huntington Disease

• Inherited neurodegenerative disorder• All 5511 Huntington citations (Jan.2006)

processed with BioMedLee and SemRep• 35 interesting concepts assoc.with change

selected and corresponding citations (250.000) processed

36

Insulin for Huntington Disease

• Assoc_with(Huntington,Insulin,decrease):– “Huntington's disease transgenic mice develop an

age-dependent reduction of insulin mRNA expression and diminished expression of key regulators of insulin gene transcription, …”

• Insulin also decreased in diabetes mellitus• Therapies used to regulate insulin in

diabetes might be used for Huntington

37

Capsaicin for Huntington• Assoc_with(Huntington,Substance P,decrease):

– “In Huntington's disease brains decreased Substance P staining was found in …”

• Assoc_with(Capsaicin,Substance P,increase):– “Capsaicin also attenuated the increase in Substance P

content in sciatic nerve, …”

• Capsaicin maybe treats Huntington because Substance P is decreased in Huntington and Capsaicin increases Substance P.

38

Huntington Results - Summary

Huntington(Disease X)

Maybe_Treats2

Decrease

Decrease

Treats

Substance P(Substance Y1)

Insulin(Substance Y2)

Capsaicin(Drug Z1)

Diabetes M(Disease X2)

Insulin regulation ther.

(Z2)

Increase

Decrease

Maybe_Treats1

39

Example: Parkinson disease as starting concept. Bellow shown some related concepts changed in

association to Parkinson

40

Potential Treatments for Parkinson (e.g. gabapentine)

41

Showing Supporting Sentences

with highlighted concepts and relations

42

Gabapentine for Parkinson

• Assoc_with(Parkinson,gamma-aminobutyric acid(GABA),decrease):– “…studies indicate that patients with Parkinson's disease

have decreased basal ganglia gamma-aminobutyric acid function… ”

• Assoc_with(GABA,Gabapentine,increase):– “Gabapentin, probably through the activation of glutamic acid

decarboxylase, leads to the increase in synaptic GABA. ”• Explanation: Gabapentine maybe treats

Parkinson because GABA is decreased in Parkinson and Gabapentine increases GABA.

43

Part 2: Conclusions• A new method to improve LBD presented• Based on discovery patterns and semantic

relations extracted by BioMedLee and SemRep, coupled with BITOLA LBD

• Easier for the user to evaluate smaller number of hypotheses

• Two potentially new therapeutic approaches for Huntington proposed and one for Parkinson

• Raynaud’s—Fish oil discovery replicated

44

The future of Literature-based Discovery

• Development of specific discovery patterns based on semantic relations and further integrated with co-occurrence-based LBD

45

Link, References and some propaganda

• http://www.mf.uni-lj.si/bitola• Hristovski D, Peterlin B, Mitchell JA and Humphrey SM. Using literature-

based discovery to identify disease candidate genes. Int. J. Med. Inform. 2005. Vol. 74(2–4), pp. 289–298. Selected for Yearbook of Medical Informatics 2006

• Hristovski D, Friedman C, Rindflesch TC, Peterlin B. Exploiting semantic relations for literature-based discovery. In Proc AMIA 2006 Symp; 2006. p. 349-53.

• Ahlers C, Hristovski D, Kilicoglu H, Rindflesch TC. Using the Literature-Based Discovery Paradigm to Investigate Drug Mechanisms. In Proc AMIA 2007 Symp; 2007. p. 6-10. “Distinguished Paper Award AMIA2007”

• Hristovski D, Friedman C, Rindflesch TC, Peterlin B. Literature-Based Knowledge Discovery using Natural Language Processing. To appear as a chapter in the first LBD book in 2008

top related