using domain ontologies to improve information retrieval in scientific publications

23
Using Domain Ontologies to Improve Information Retrieval in Scientific Publications Engineering Informatics Lab at Stanford

Upload: aelan

Post on 24-Feb-2016

42 views

Category:

Documents


0 download

DESCRIPTION

Using Domain Ontologies to Improve Information Retrieval in Scientific Publications. Engineering Informatics Lab at Stanford. Data. TREC Genomics 2007 Data Set. Over 162,000 full-text scientific publications from 49 prominent journals in biomedicine Metadata available through MEDLINE - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Using Domain Ontologies to Improve Information Retrieval in Scientific Publications

Using Domain Ontologies to Improve Information Retrieval in Scientific Publications

Engineering Informatics Lab at Stanford

Page 2: Using Domain Ontologies to Improve Information Retrieval in Scientific Publications

Data

3/29/2012 Engineering Informatics Lab at Stanford University 2

Page 3: Using Domain Ontologies to Improve Information Retrieval in Scientific Publications

TREC Genomics 2007 Data Set

• Over 162,000 full-text scientific publications from 49 prominent journals in biomedicine

• Metadata available through MEDLINE• Tasks involve passage, document, and feature

retrieval• Methodologies are evaluated on their response

to 36 topics (‘queries’)• The topics are categorized based on 13 entity

types (Proteins, Genes, etc.)

3/29/2012 Engineering Informatics Lab at Stanford University 3

Page 4: Using Domain Ontologies to Improve Information Retrieval in Scientific Publications

BioPortal

• BioPortal is an integrated resource for biomedical ontologies

• Currently indexes over 300 ontologies including Medical Subject Headings and Gene Ontology

• Provides a comprehensive web service, abstracting the formats and API’s of all underlying ontologies

3/29/2012 Engineering Informatics Lab at Stanford University 4

Page 5: Using Domain Ontologies to Improve Information Retrieval in Scientific Publications

Methodology

3/29/2012 Engineering Informatics Lab at Stanford University 5

Page 6: Using Domain Ontologies to Improve Information Retrieval in Scientific Publications

How is Domain Knowledge Integrated

(1) Annotating Documents prior to indexing– Response time is fast– Not flexible, the entire index has to be updated if a

new ontology needs to be added– Indexes can grow very large

(2) Query Expansion– Response time is slower– Very flexible, ontologies can be dynamically

chosen

3/29/2012 Engineering Informatics Lab at Stanford University 6

Page 7: Using Domain Ontologies to Improve Information Retrieval in Scientific Publications

Query Expansion

• TREC Queries are first manually pre-processed

“What [TUMOR TYPES] are found in zebrafish?”=>

“[Tumor][MeSH] AND zebrafish”

• [Tumor] indicates term that has to be expanded• [MeSH] indicates ontology that should be used

3/29/2012 Engineering Informatics Lab at Stanford University 7

Page 8: Using Domain Ontologies to Improve Information Retrieval in Scientific Publications

Query Expansion

• The pre-processed query is automatically expanded using BioPortal’s API[Tumor][MeSH] => {Tumor, Neoplasm, Carcinoma,

Leukemia …}

Tumor

Leukemia

Melanoma

Adenocarcinoma

Nerve Sheath Neo

Synonyms Cancer, Neoplasm, …

Synonyms LeucocythaemiasLeucocythemia

MeSH

3/29/2012 Engineering Informatics Lab at Stanford University 8

Page 9: Using Domain Ontologies to Improve Information Retrieval in Scientific Publications

Which Domain Knowledge is Integrated

• The use of synonymy results in inconsistent performance (2007 TREC genomics track)

• Common reasons include:– Relevant terms may not be classified as expected– Some relevant terms may not be classified in a particular

ontology– Incomplete information (such as synonyms)

• Selection of the appropriate domain ontology is important

3/29/2012 Engineering Informatics Lab at Stanford University 9

Page 10: Using Domain Ontologies to Improve Information Retrieval in Scientific Publications

Enriching Existing Ontologies• Existing ontologies must be enriched to complete missing

information

• Multiple ontologies can be used to provide different classifications

3/29/2012 Engineering Informatics Lab at Stanford University 10

MeSH

NCI

Ontology NDF

Concept Pamidronate

Synonyms from NDF APD, Amidronate, ...

Synonyms from MeSH

pamidronate calcium, pamidronate monosodium, aredia

Synonyms from NCI Pamidronic acid, pamidronate disodium, …

Page 11: Using Domain Ontologies to Improve Information Retrieval in Scientific Publications

Evaluations

• Baseline• With Query Expansion (Suggested Sources)• Using Enriched Ontologies• Multiple Query Expansions per query

3/29/2012 Engineering Informatics Lab at Stanford University 11

Summary of 2007 TREC genomics track

Max 0.3286

Min 0.0329

Mean 0.1862

Median 0.1897

Page 12: Using Domain Ontologies to Improve Information Retrieval in Scientific Publications

Queries

Topic Number

Query Domain Knowledge

205 What [SIGNS OR SYMPTOMS] of anxiety disorder are related to coronary artery disease?

Symptom Ontology

206 What [TOXICITIES] are associated with zoledronic acid?

NCI Thesaurus

207 What [TOXICITIES] are associated with etidronate? NCI Thesaurus

211 What [ANTIBODIES] have been used to detect protein PSD-95?

MeSH

229 What [SIGNS OR SYMPTOMS] are caused by human parvovirus infection?

Symptom Ontology

231 What [TUMOR TYPES] are found in zebrafish? MeSH

3/29/2012 Engineering Informatics Lab at Stanford University 12

Page 13: Using Domain Ontologies to Improve Information Retrieval in Scientific Publications

Baseline

• Queries are used without modification, e.g.,– “What [ANTIBODIES] have been used to detect

protein PSD-95?”– “What [SIGNS OR SYMPTOMS] of anxiety disorder

are related to coronary artery disease?”

• Document MAP: 0.277

3/29/2012 Engineering Informatics Lab at Stanford University 13

Page 14: Using Domain Ontologies to Improve Information Retrieval in Scientific Publications

Query Expansion

• Queries are formulated in ‘AND’ clauses:“[Tumor][MeSH] AND zebrafish”

=> (Tumor, Neoplasm, Carcinoma, Leukemia …)

AND zebrafish

• Document MAP: 0.347

3/29/2012 Engineering Informatics Lab at Stanford University 14

Page 15: Using Domain Ontologies to Improve Information Retrieval in Scientific Publications

Multiple Query Expansion Terms

• Expansion can be performed on multiple terms in the query

• Example: Coronary Artery Disease => {Coronary heart disease, coronary disease, CAD, …}

[Tumor][MeSH] AND zebrafish[MeSH} =>

(tumor, neoplasm, …) AND (zebrafish, danio rerio, …)

• Document MAP: 0.352

3/29/2012 Engineering Informatics Lab at Stanford University 15

Page 16: Using Domain Ontologies to Improve Information Retrieval in Scientific Publications

Enriched Ontology

• Marginal improvement over basic enhanced models

• Document MAP: 0.352• Why is the improvement only marginal?– Framework for enrichment based on synonymy is

rigid, i.e., relevant terms that are entirely missing in the ontology are still not included

– Relevant terms that are classified differently are never included in the search

3/29/2012 Engineering Informatics Lab at Stanford University 16

Page 17: Using Domain Ontologies to Improve Information Retrieval in Scientific Publications

Visualization

• Expert knowledge is valuable• We extend MINOE, a co-occurrence based

visualization tool, originally designed for exploring marine ecosystems

• User can browse (or search) documents through ontologies and visualize interactions between concepts

SEE DEMO

3/29/2012 Engineering Informatics Lab at Stanford University 17

Page 18: Using Domain Ontologies to Improve Information Retrieval in Scientific Publications

Summary

• Search methodologies must be based on semantics in order to tackle terminology inconsistency

• Domain ontologies provide these semantics• Domain ontologies need to be modified (or

enriched) in order to fulfill information needs• User interaction is important

3/29/2012 Engineering Informatics Lab at Stanford University 18

Page 19: Using Domain Ontologies to Improve Information Retrieval in Scientific Publications

Future Work

• Using multiple enriched ontologies may provide the necessary terms

• MeSH Descriptors are provided for every publication during indexing and can potentially improve results

• Implement Okapi model for scoring documents

3/29/2012 Engineering Informatics Lab at Stanford University 19

Page 20: Using Domain Ontologies to Improve Information Retrieval in Scientific Publications

Engineering Informatics Lab at Stanford University

20

Backup Slides

3/29/2012

Page 21: Using Domain Ontologies to Improve Information Retrieval in Scientific Publications

Motivation

• Scientific literature is an important source of information

• Retrieving relevant information from scientific publications is challenging

• Domain terminology is used inconsistently in scientific publications

• Increasing amounts of information amplify the problem

• Improved methodologies based on semantics are required

3/29/2012 Engineering Informatics Lab at Stanford University 21

Page 22: Using Domain Ontologies to Improve Information Retrieval in Scientific Publications

Background

• Text REtrieval Conference (TREC) organized by NIST has showcased many successful methods

• The Genomics track focused on full-text scientific publications from 49 prominent journals

• Methodologies involved:– Use of Synonymy from ontologies– Language based models– Query expansion and annotations– Okapi scoring model

3/29/2012 Engineering Informatics Lab at Stanford University 22

Page 23: Using Domain Ontologies to Improve Information Retrieval in Scientific Publications

Goals

• Understand how domain ontologies can be leveraged

• Understand which domain ontologies can be leveraged

• Develop a knowledge-based approach to integrate domain knowledge with search mechanism

3/29/2012 Engineering Informatics Lab at Stanford University 23