6/23/03 indous dl 2003 text metadata mining: exploring its potential* padmini srinivasan school of...
TRANSCRIPT
IndoUS DL 2003 6/23/03
Text Metadata Mining: Exploring its potential*
Padmini Srinivasan
School of Library & Information Science
The University of Iowa
Iowa City, IA
*Students:Aditya Sehgal, Xin Ying Qiu
IndoUS DL 2003 6/23/03
Outline
1. Text Mining
2. Metadata-based Topic profiles
3. Function: Exploring topic characteristics via profiles
Problem: Study disease research prevalence
4. Conclusions
IndoUS DL 2003 6/23/03
1. Text Mining: Novelty and Usefulness
Assist researchers with hypothesis generation,
exploration, and testing.
Discover knowledge that is ‘novel’ at
least relative to the text collection
Discover knowledge that is potentially
‘useful’
Extract patterns, explore relationships
Propositions/Hypotheses: need follow up
verification
IndoUS DL 2003 6/23/03
Of all 45 studies in Medline on chemical X, 80% have been done in the context of disease L, 10% disease M and the remainder in the context of disease N.
Gene A is known to be associated with disease X. The literature suggests that gene B shows some key ‘similarities’ to A and therefore B may also be associated with X.
Examples
IndoUS DL 2003 6/23/03
Support content organization and managementProvide access to content
Dublin Core Metadata InitiativeRDF: Resource Description Framework
Library of Congress Subject Headings (LCSH)Medical Subject Headings (MeSH)
Question: Can we use metadata for text mining and knowledge discovery?
Given a topic, eg. ‘Toxic waste’ and a collectionof texts such as Medline..
Metadata in Digital Libraries
IndoUS DL 2003 6/23/03
Describe topics: topic profiles built from the textcollection being mined ~ metadata profiles
- Compare topics via their profiles: a. topic similarityb. trends over specific features/characteristics
- Look for indirect links between topics
- Given a topic look for related topics.
Metadata for Text Mining
IndoUS DL 2003 6/23/03
MeSH Metadata
Semantic Types
Aldehydes
Organic Chemical
Protein Isoprenylation
Genetic Function(134)
(22,000)
Formaldehyde
Chemical
IndoUS DL 2003 6/23/03
2. Topic Profiles
A set of terms that characterize the topic with weightsassigned to represent their relative importance.
{Medline: A vector of MeSH term vectors - one for each of the134 semantic types.}
IndoUS DL 2003 6/23/03
Topic: “hip fractures in the elderly”
Search against Pubmed: (geriatrics or elderly) AND hip fractures
Extract MeSH metadata terms from retrieved documents
Build weighted profile: vector of vectorscan be limited to MeSH terms of particular semantic types
IndoUS DL 2003 6/23/03
Comparing topics via their profiles
Topic 1: PubMed search Topic 2: PubMed search
MeSH Profile MeSH Profile
documentsdocuments
(cosine similarity)13,000 genes
IndoUS DL 2003 6/23/03
Comparing topics - studying particular characteristics in their profiles
Problem:To study the prevalence of disease research.
‘geographical context’.
IndoUS DL 2003 6/23/03
Topic: “cholera”
Search against Pubmed:
Extract MeSH metadata terms from retrieved documents
Build weighted profile vectorscan be limited to MeSH terms in ‘Geographical Area’
Cholera: {0.6 Nigeria, 0.1 Malyasia , ……}Breast Cancer: {0.1 Poland, 0.8 Italy, ……}
Rank nations
IndoUS DL 2003 6/23/03
Research Prevalence: Cholera (middle & low income;1991 - 2000) Ranking nations
IndoUS DL 2003 6/23/03
Research prevalence versus disease prevalence
For each disease:(a) Rank nations by Disease Prevalence (WHO epid. data)- estimated by # of cases reported or # of deathsStatistical Information Systemweekly epidemiological records
(b) Rank nations by Research Prevalence
Compare rankings using Spearman’s rank coefficient.
Analysis limited to the decade of the 90s.
Question: So how does the prevalence of research compare with the prevalence of the disease?
IndoUS DL 2003 6/23/03
Breast cancer Cholorectal cancerHodgkins disease MeningitisDengue TuberculosisLiver neoplasms Prostate cancerOvarian cancer Esophagus cancerCholera AIDSStomach cancer MelanomaLeprosy MalariaYellow fever TrypanosomiasisDracunculiasis
19 diseases
IndoUS DL 2003 6/23/03
Disease Income N CC
Breast Cancer
All
High
Medium
low
168
35
71
61
0.645*
0.856*
0.709*
0.372*
Hodgkins All
High
Medium
low
165
34
70
61
0.539*
0.71*
0.545*
0.386*
*0.05 sig.level
IndoUS DL 2003 6/23/03
Observations:
Diseases most prevalent in high or middle income group, have significant +ve correlation (9/10 diseases)
Diseases most prevalent in low income groupsignificant +ve correlation less likely (4/9, 44%).
IndoUS DL 2003 6/23/03
Temporal analysis on disease research
Extract the top 3 ranked diseases studied in thecontext of each nation
Pool these together
How often does a disease rank in the top 3 positions?
IndoUS DL 2003 6/23/03
Topic: Each nation
Sweden: {0.6 Breast Cancer, 0.1 Malaria , ……}
Nigeria: {0.1 Breast Cancer, 0.8 Malaria, ……}Rank diseases
IndoUS DL 2003 6/23/03
Observations from the study:
Collecting epidemiological data is extremely complicated.
Collect it at a fine grained analysis. Different forms ofLeishmaniasis; Plague
Complement existing efforts at collecting epidemiologicaldata.
Consider more complex phenomena such as the prevalenceof Leishmania and HIV as co-infections.
Research based evidence to explore policy issues.