6/23/03 indous dl 2003 text metadata mining: exploring its potential* padmini srinivasan school of...

26
IndoUS DL 2003 6/23/03 Text Metadata Mining: Exploring its potential* Padmini Srinivasan School of Library & Information Science The University of Iowa Iowa City, IA [email protected] *Students:Aditya Sehgal, Xin Ying Qiu

Upload: eleanore-walsh

Post on 16-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

IndoUS DL 2003 6/23/03

Text Metadata Mining: Exploring its potential*

Padmini Srinivasan

School of Library & Information Science

The University of Iowa

Iowa City, IA

[email protected]

*Students:Aditya Sehgal, Xin Ying Qiu

IndoUS DL 2003 6/23/03

Outline

1. Text Mining

2. Metadata-based Topic profiles

3. Function: Exploring topic characteristics via profiles

Problem: Study disease research prevalence

4. Conclusions

IndoUS DL 2003 6/23/03

1. Text Mining: Novelty and Usefulness

Assist researchers with hypothesis generation,

exploration, and testing.

Discover knowledge that is ‘novel’ at

least relative to the text collection

Discover knowledge that is potentially

‘useful’

Extract patterns, explore relationships

Propositions/Hypotheses: need follow up

verification

IndoUS DL 2003 6/23/03

Of all 45 studies in Medline on chemical X, 80% have been done in the context of disease L, 10% disease M and the remainder in the context of disease N.

Gene A is known to be associated with disease X. The literature suggests that gene B shows some key ‘similarities’ to A and therefore B may also be associated with X.

Examples

IndoUS DL 2003 6/23/03

Support content organization and managementProvide access to content

Dublin Core Metadata InitiativeRDF: Resource Description Framework

Library of Congress Subject Headings (LCSH)Medical Subject Headings (MeSH)

Question: Can we use metadata for text mining and knowledge discovery?

Given a topic, eg. ‘Toxic waste’ and a collectionof texts such as Medline..

Metadata in Digital Libraries

IndoUS DL 2003 6/23/03

Describe topics: topic profiles built from the textcollection being mined ~ metadata profiles

- Compare topics via their profiles: a. topic similarityb. trends over specific features/characteristics

- Look for indirect links between topics

- Given a topic look for related topics.

Metadata for Text Mining

IndoUS DL 2003 6/23/03

MeSH Phrase

MeSH Qualifier

Example MEDLINE Record

IndoUS DL 2003 6/23/03

MeSH Metadata

Semantic Types

Aldehydes

Organic Chemical

Protein Isoprenylation

Genetic Function(134)

(22,000)

Formaldehyde

Chemical

IndoUS DL 2003 6/23/03

2. Topic Profiles

A set of terms that characterize the topic with weightsassigned to represent their relative importance.

{Medline: A vector of MeSH term vectors - one for each of the134 semantic types.}

IndoUS DL 2003 6/23/03

Topic: “hip fractures in the elderly”

Search against Pubmed: (geriatrics or elderly) AND hip fractures

Extract MeSH metadata terms from retrieved documents

Build weighted profile: vector of vectorscan be limited to MeSH terms of particular semantic types

IndoUS DL 2003 6/23/03

Example Profile: Raynauds disease

IndoUS DL 2003 6/23/03

Comparing topics via their profiles

Topic 1: PubMed search Topic 2: PubMed search

MeSH Profile MeSH Profile

documentsdocuments

(cosine similarity)13,000 genes

IndoUS DL 2003 6/23/03

Comparing topics - studying particular characteristics in their profiles

Problem:To study the prevalence of disease research.

‘geographical context’.

IndoUS DL 2003 6/23/03

Topic: “cholera”

Search against Pubmed:

Extract MeSH metadata terms from retrieved documents

Build weighted profile vectorscan be limited to MeSH terms in ‘Geographical Area’

Cholera: {0.6 Nigeria, 0.1 Malyasia , ……}Breast Cancer: {0.1 Poland, 0.8 Italy, ……}

Rank nations

IndoUS DL 2003 6/23/03

Research Prevalence: Mental Disorders (1961-2000)Ranking nations.

IndoUS DL 2003 6/23/03

Research Prevalence: Cholera (middle & low income;1991 - 2000) Ranking nations

IndoUS DL 2003 6/23/03

Research prevalence versus disease prevalence

For each disease:(a) Rank nations by Disease Prevalence (WHO epid. data)- estimated by # of cases reported or # of deathsStatistical Information Systemweekly epidemiological records

(b) Rank nations by Research Prevalence

Compare rankings using Spearman’s rank coefficient.

Analysis limited to the decade of the 90s.

Question: So how does the prevalence of research compare with the prevalence of the disease?

IndoUS DL 2003 6/23/03

Breast cancer Cholorectal cancerHodgkins disease MeningitisDengue TuberculosisLiver neoplasms Prostate cancerOvarian cancer Esophagus cancerCholera AIDSStomach cancer MelanomaLeprosy MalariaYellow fever TrypanosomiasisDracunculiasis

19 diseases

IndoUS DL 2003 6/23/03

Disease Income N CC

Breast Cancer

All

High

Medium

low

168

35

71

61

0.645*

0.856*

0.709*

0.372*

Hodgkins All

High

Medium

low

165

34

70

61

0.539*

0.71*

0.545*

0.386*

*0.05 sig.level

IndoUS DL 2003 6/23/03

Observations:

Diseases most prevalent in high or middle income group, have significant +ve correlation (9/10 diseases)

Diseases most prevalent in low income groupsignificant +ve correlation less likely (4/9, 44%).

IndoUS DL 2003 6/23/03

Temporal analysis on disease research

Extract the top 3 ranked diseases studied in thecontext of each nation

Pool these together

How often does a disease rank in the top 3 positions?

IndoUS DL 2003 6/23/03

Topic: Each nation

Sweden: {0.6 Breast Cancer, 0.1 Malaria , ……}

Nigeria: {0.1 Breast Cancer, 0.8 Malaria, ……}Rank diseases

IndoUS DL 2003 6/23/03

Pooling: (for each decade & each incomegroup)

IndoUS DL 2003 6/23/03

IndoUS DL 2003 6/23/03

Observations from the study:

Collecting epidemiological data is extremely complicated.

Collect it at a fine grained analysis. Different forms ofLeishmaniasis; Plague

Complement existing efforts at collecting epidemiologicaldata.

Consider more complex phenomena such as the prevalenceof Leishmania and HIV as co-infections.

Research based evidence to explore policy issues.

IndoUS DL 2003 6/23/03

Conclusions:

Metadata can be exploited for text mining

MeSH ~ rich metadata scheme

Importance of metadata for digital libraries

Other text mining applications built on DL?

Domain independent ~ accounting!

Thank you!