dynamic hybrid clustering of bioinformatics by incorporating text mining and citation analysis
DESCRIPTION
Dynamic Hybrid Clustering of Bioinformatics by Incorporating Text Mining and Citation Analysis. Frizo Janssens, Wolfgang Glänzel, and Bart De Moor. Presented by Cindy Burklow. CS 685: Special Topics in Data Mining Professor Dr. Jinze Liu University of Kentucky April 17 th , 2008. Outline. - PowerPoint PPT PresentationTRANSCRIPT
Dynamic Hybrid Clustering of Bioinformatics by Incorporating Text Mining and Citation Analysis
Frizo Janssens, Wolfgang Glänzel, and Bart De Moor
Presented by Cindy Burklow
CS 685: Special Topics in Data MiningProfessor Dr. Jinze LiuUniversity of KentuckyApril 17th, 2008
OutlineIntroductionMotivationRelated WorkProposed ModelsProposed AlgorithmsResults: Hybrid & Dynamic ClusteringDiscussion of Pros and ConsQuestionsReferences
IntroductionBioinformatics …
◦Computer Science◦ Information Technology◦Solves problems in Biomedicine
Goal of Paper: Investigate◦Cognitive structure◦Dynamics of bioinformatics core◦Sub-disciplines◦ ISI Web of Science & MEDLINE◦Retrieval of core literature in
bioinformatics
MeSH = Medical Subject Headings
Image Reference: Wolfgang Glnzel, and Bart De Moor, Dynamic hybrid clustering of bioinformatics by incorporating text mining and citation analysis, pg. 360, 368, KDD '07. ACM, San Jose, CA, August 2007.
MotivationBioinformatics field …
◦Dynamic ◦Evolving discipline ◦Fast growth rate
Monitor current trendsPredict future directionDecision Making
◦Grants◦Business Ventures◦Research Opportunities
Image Reference: Wolfgang Glnzel, and Bart De Moor, Dynamic hybrid clustering of bioinformatics by incorporating text mining and citation analysis, pg. 361, 368, KDD '07. ACM, San Jose, CA, August 2007.
Related WorkWeb miningBibliometricsText mining & citation analysis
◦Mapping of knowledge◦Charting science & technology fields
Textual & graph-based approaches◦Different perceptions of similarity
between documents or groups of documents
Related Work
Establishing the Data SetPatra & Mishra – Bibliometric Study
◦MeSH term based◦Liberal delineation strategy with
maximal recall◦Broader interpretation of
bioinformatics◦Less restricted search strategy◦Broader coverage of underlying
database◦14,563 journal papers
Related WorkHybrid Clustering
◦He – Unsupervised spectral clustering of web pages
◦Wang & Kitsuregawa – Contents-linked coupled clustering algorithm of web pages
Dynamic hybrid clustering◦Mei & Zhai – Temporal Text Mining◦Kullback-Leibler – Divergence for coherent
themes & Hidden Markov Models◦Griffiths & Steyvers – Latent Dirichlet
Allocation with hot topics in PNAS abstracts
Models: Data SetBibliometric Retrieval StrategyNovel subject delineation
strategy◦Retrieve core literature◦Combines textual components &
bibliometrics, citation-based techniques
◦Web of Science Edition of Thomson Scientific 7401 bioinformatics-related papers 1981 to 2004 Titles, abstracts, author keywords, and
MeSH terms
Models – Text Analysis◦All text was indexed with Jakarta Lucene
Platform◦Encoded in Vector Space Model using TF-
IDF weighting scheme◦Text-based similarities
Cosine of angle between the vector representations of two papers
◦No Stop word used during indexing◦Porter Stemmer
All remaining terms from titles and abstracts
◦Bigrams Candidate list of MeSH descriptors, author
keywords, and noun phrases
◦Latent Semantic Indexing (LSI) – 10 terms
Models – Citation Analysis
Citation GraphsLink-based algorithms
◦HITS◦PageRank
Representative Publications
Text-based
Co-citation
Citation-based
Documents
QUANTIFY SIMILARITIES
Boolean Input
Vectors
CosineBibliographic coupling
(BC)
Combine
Image Reference: Google Logo from http://www.google.com
Models – ClusteringAgglomerative Hierarchical
Clustering Algorithm with Ward’s Method
Hard Clustering Algorithm: ◦Every publication is assigned to exactly 1 cluster.
Image Reference: Clustering Analysis - http://en.wikipedia.org/wiki/Data_clustering
Models – ClusteringOptimal number of clustersCombine Distance-based & Stability-based
Methods Strategy
Dendrogram observation
Silhouette Curves: Mean text andCitation-based
Stability DiagramImage Reference: Wolfgang Glnzel, and Bart De Moor, Dynamic hybrid clustering of bioinformatics by incorporating text mining and citation analysis, pg. 364, 365, KDD '07. ACM, San Jose, CA, August 2007.
Proposed Algorithm – Hybrid Clustering
Cluster Input: DistancesCombining text mining and
bibliometrics◦Integrate text & citation info early in
mapping process before applying of clustering algorithm
Weighted linear combination
Fisher’s inverse chi-square methodImage Reference: Wolfgang Glnzel, and Bart De Moor, Dynamic hybrid clustering of bioinformatics by incorporating text mining and citation analysis, pg. 362, 363, KDD '07. ACM, San Jose, CA, August 2007.
Image Reference: Wolfgang Glnzel, and Bart De Moor, Dynamic hybrid clustering of bioinformatics by incorporating text mining and citation analysis, pg. 363 KDD '07. ACM, San Jose, CA, August 2007.
Proposed Algorithm – Dynamic Hybrid ClusteringGoal: Match & track clusters through
timeProcess:
◦Separate hybrid clustering for each period◦Determine optimal number of clusters
Dendrogram Silhouette curve Ben-hur stability plot
◦Construct complete graph All cluster centroids from each period as nodes Edge weights as mutual cosine similarities in LSS
◦Form Cluster Chains Keep edge weights > threshold, T1 Allow qualifying clusters to join > threshold, T2
Image Reference: Wolfgang Glnzel, and Bart De Moor, Dynamic hybrid clustering of bioinformatics by incorporating text mining and citation analysis, pg. 367, KDD '07. ACM, San Jose, CA, August 2007.
Image Reference: Wolfgang Glnzel, and Bart De Moor, Dynamic hybrid clustering of bioinformatics by incorporating text mining and citation analysis, pg. 367, KDD '07. ACM, San Jose, CA, August 2007.
Results – Hybrid ClusteringSilhouette Curve
Image Reference: Wolfgang Glnzel, and Bart De Moor, Dynamic hybrid clustering of bioinformatics by incorporating text mining and citation analysis, pg. 364, KDD '07. ACM, San Jose, CA, August 2007.
Result – Hybrid ClusteringSilhouette Curve
Image Reference: Wolfgang Glnzel, and Bart De Moor, Dynamic hybrid clustering of bioinformatics by incorporating text mining and citation analysis, pg. 364, KDD '07. ACM, San Jose, CA, August 2007.
Result – Hybrid ClusteringStability
Image Reference: Wolfgang Glnzel, and Bart De Moor, Dynamic hybrid clustering of bioinformatics by incorporating text mining and citation analysis, pg. 365, KDD '07. ACM, San Jose, CA, August 2007.
Result – Hybrid ClusteringDendrogram
Image Reference: Wolfgang Glnzel, and Bart De Moor, Dynamic hybrid clustering of bioinformatics by incorporating text mining and citation analysis, pg. 365, KDD '07. ACM, San Jose, CA, August 2007.
Result – Hybrid ClusteringCluster Characterization
RNA structure prediction
205
Protein structure prediction
1167Systems biology & molecular networks
694
Phylogeny &
Evolution
749Genome
sequencing &
assembly
640
Gene / promoter /
motif prediction
995
Molecular
DBs & annotation platforms
1091Multiple
sequence alignment
713
Microarray analysis
1147
Result – Dynamics ClusteringHistogram
Image Reference: Wolfgang Glnzel, and Bart De Moor, Dynamic hybrid clustering of bioinformatics by incorporating text mining and citation analysis, pg. 365, KDD '07. ACM, San Jose, CA, August 2007.
Result – Dynamics ClusteringCluster Chains
Image Reference: Wolfgang Glnzel, and Bart De Moor, Dynamic hybrid clustering of bioinformatics by incorporating text mining and citation analysis, pg. 367, KDD '07. ACM, San Jose, CA, August 2007.
Yearly Publication Outputamong Cluster chains
Image Reference: Wolfgang Glnzel, and Bart De Moor, Dynamic hybrid clustering of bioinformatics by incorporating text mining and citation analysis, pg. 368, KDD '07. ACM, San Jose, CA, August 2007.
Dynamic TermNetwork
Image Reference: Wolfgang Glnzel, and Bart De Moor, Dynamic hybrid clustering of bioinformatics by incorporating text mining and citation analysis, pg. 368, KDD '07. ACM, San Jose, CA, August 2007.
Pros & ConsPros
◦Offers fresh perspective on clustering
◦Integrates various techniques◦Provides insight into bioinformatics
Cons◦Challenge of selecting the optimal
number of clusters still exists◦There are many steps required to
implement their approach
Questions
References Janssens, F., Glänzel, W., and De Moor, B. 2007. Dynamic
hybrid clustering of bioinformatics by incorporating text mining and citation analysis. In Proceedings of the 13th ACM SIGKDD international Conference on Knowledge Discovery and Data Mining (San Jose, California, USA, August 12 - 15, 2007). KDD '07. ACM, New York, NY, 360-369. DOI= http://doi.acm.org/10.1145/1281192.1281233
ISI Web of Science Image: http://apps.isiknowledge.com/WOS_GeneralSearch_input.do?highlighted_tab=WOS&product=WOS&last_prod=WOS&SID=3DamC8GFDKmpBLhFOIM&search_mode=GeneralSearch
PubMed Image: http://www.ncbi.nlm.nih.gov/pubmed/ The Apache Jakarta Project:
http://lucene.apache.org/java/1_4_3/ Fisher’s Method: http://en.wikipedia.org/wiki/Fisher
%27s_method “Data Mining - Concepts and techniques” by Han and Kamber,
Morgan Kaufmann, 2006. (ISBN:1-55860-901-6)