dynamic hybrid clustering of bioinformatics by incorporating text mining and citation analysis

Dynamic Hybrid Clustering of Bioinformatics by Incorporating Text Mining and Citation Analysis

Frizo Janssens, Wolfgang Glänzel, and Bart De Moor

Presented by Cindy Burklow

CS 685: Special Topics in Data MiningProfessor Dr. Jinze LiuUniversity of KentuckyApril 17th, 2008

OutlineIntroductionMotivationRelated WorkProposed ModelsProposed AlgorithmsResults: Hybrid & Dynamic ClusteringDiscussion of Pros and ConsQuestionsReferences

IntroductionBioinformatics …

◦Computer Science◦ Information Technology◦Solves problems in Biomedicine

Goal of Paper: Investigate◦Cognitive structure◦Dynamics of bioinformatics core◦Sub-disciplines◦ ISI Web of Science & MEDLINE◦Retrieval of core literature in

bioinformatics

MeSH = Medical Subject Headings

Image Reference: Wolfgang Glnzel, and Bart De Moor, Dynamic hybrid clustering of bioinformatics by incorporating text mining and citation analysis, pg. 360, 368, KDD '07. ACM, San Jose, CA, August 2007.

MotivationBioinformatics field …

◦Dynamic ◦Evolving discipline ◦Fast growth rate

Monitor current trendsPredict future directionDecision Making

◦Grants◦Business Ventures◦Research Opportunities

Image Reference: Wolfgang Glnzel, and Bart De Moor, Dynamic hybrid clustering of bioinformatics by incorporating text mining and citation analysis, pg. 361, 368, KDD '07. ACM, San Jose, CA, August 2007.

Related WorkWeb miningBibliometricsText mining & citation analysis

◦Mapping of knowledge◦Charting science & technology fields

Textual & graph-based approaches◦Different perceptions of similarity

between documents or groups of documents

Related Work

Establishing the Data SetPatra & Mishra – Bibliometric Study

◦MeSH term based◦Liberal delineation strategy with

maximal recall◦Broader interpretation of

bioinformatics◦Less restricted search strategy◦Broader coverage of underlying

database◦14,563 journal papers

Related WorkHybrid Clustering

◦He – Unsupervised spectral clustering of web pages

◦Wang & Kitsuregawa – Contents-linked coupled clustering algorithm of web pages

Dynamic hybrid clustering◦Mei & Zhai – Temporal Text Mining◦Kullback-Leibler – Divergence for coherent

themes & Hidden Markov Models◦Griffiths & Steyvers – Latent Dirichlet

Allocation with hot topics in PNAS abstracts

Models: Data SetBibliometric Retrieval StrategyNovel subject delineation

strategy◦Retrieve core literature◦Combines textual components &

bibliometrics, citation-based techniques

◦Web of Science Edition of Thomson Scientific 7401 bioinformatics-related papers 1981 to 2004 Titles, abstracts, author keywords, and

MeSH terms

Models – Text Analysis◦All text was indexed with Jakarta Lucene

Platform◦Encoded in Vector Space Model using TF-

IDF weighting scheme◦Text-based similarities

Cosine of angle between the vector representations of two papers

◦No Stop word used during indexing◦Porter Stemmer

All remaining terms from titles and abstracts

◦Bigrams Candidate list of MeSH descriptors, author

keywords, and noun phrases

◦Latent Semantic Indexing (LSI) – 10 terms

Models – Citation Analysis

Citation GraphsLink-based algorithms

◦HITS◦PageRank

Representative Publications

Text-based

Co-citation

Citation-based

Documents

QUANTIFY SIMILARITIES

Boolean Input

Vectors

CosineBibliographic coupling

(BC)

Combine

Image Reference: Google Logo from http://www.google.com

Models – ClusteringAgglomerative Hierarchical

Clustering Algorithm with Ward’s Method

Hard Clustering Algorithm: ◦Every publication is assigned to exactly 1 cluster.

Image Reference: Clustering Analysis - http://en.wikipedia.org/wiki/Data_clustering

http://en.wikipedia.org/wiki/Image:Clusters.PNG

http://en.wikipedia.org/wiki/Image:Hierarchical_clustering_diagram.png

Models – ClusteringOptimal number of clustersCombine Distance-based & Stability-based

Methods Strategy

Dendrogram observation

Silhouette Curves: Mean text andCitation-based

Stability DiagramImage Reference: Wolfgang Glnzel, and Bart De Moor, Dynamic hybrid clustering of bioinformatics by incorporating text mining and citation analysis, pg. 364, 365, KDD '07. ACM, San Jose, CA, August 2007.

Proposed Algorithm – Hybrid Clustering

Cluster Input: DistancesCombining text mining and

bibliometrics◦Integrate text & citation info early in

mapping process before applying of clustering algorithm

Weighted linear combination

Fisher’s inverse chi-square methodImage Reference: Wolfgang Glnzel, and Bart De Moor, Dynamic hybrid clustering of bioinformatics by incorporating text mining and citation analysis, pg. 362, 363, KDD '07. ACM, San Jose, CA, August 2007.

Image Reference: Wolfgang Glnzel, and Bart De Moor, Dynamic hybrid clustering of bioinformatics by incorporating text mining and citation analysis, pg. 363 KDD '07. ACM, San Jose, CA, August 2007.

Proposed Algorithm – Dynamic Hybrid ClusteringGoal: Match & track clusters through

timeProcess:

◦Separate hybrid clustering for each period◦Determine optimal number of clusters

Dendrogram Silhouette curve Ben-hur stability plot

◦Construct complete graph All cluster centroids from each period as nodes Edge weights as mutual cosine similarities in LSS

◦Form Cluster Chains Keep edge weights > threshold, T1 Allow qualifying clusters to join > threshold, T2

Image Reference: Wolfgang Glnzel, and Bart De Moor, Dynamic hybrid clustering of bioinformatics by incorporating text mining and citation analysis, pg. 367, KDD '07. ACM, San Jose, CA, August 2007.

Results – Hybrid ClusteringSilhouette Curve


Result – Hybrid ClusteringSilhouette Curve


Result – Hybrid ClusteringStability


Result – Hybrid ClusteringDendrogram


Result – Hybrid ClusteringCluster Characterization

RNA structure prediction

205

Protein structure prediction

1167Systems biology & molecular networks

694

Phylogeny &

Evolution

749Genome

sequencing &

assembly

640

Gene / promoter /

motif prediction

995

Molecular

DBs & annotation platforms

1091Multiple

sequence alignment

713

Microarray analysis

1147

Result – Dynamics ClusteringHistogram


Result – Dynamics ClusteringCluster Chains


Yearly Publication Outputamong Cluster chains


Dynamic TermNetwork


Pros & ConsPros

◦Offers fresh perspective on clustering

◦Integrates various techniques◦Provides insight into bioinformatics

Cons◦Challenge of selecting the optimal

number of clusters still exists◦There are many steps required to

implement their approach

Questions

References Janssens, F., Glänzel, W., and De Moor, B. 2007. Dynamic

hybrid clustering of bioinformatics by incorporating text mining and citation analysis. In Proceedings of the 13th ACM SIGKDD international Conference on Knowledge Discovery and Data Mining (San Jose, California, USA, August 12 - 15, 2007). KDD '07. ACM, New York, NY, 360-369. DOI= http://doi.acm.org/10.1145/1281192.1281233

ISI Web of Science Image: http://apps.isiknowledge.com/WOS_GeneralSearch_input.do?highlighted_tab=WOS&product=WOS&last_prod=WOS&SID=3DamC8GFDKmpBLhFOIM&search_mode=GeneralSearch

PubMed Image: http://www.ncbi.nlm.nih.gov/pubmed/ The Apache Jakarta Project:

http://lucene.apache.org/java/1_4_3/ Fisher’s Method: http://en.wikipedia.org/wiki/Fisher

%27s_method “Data Mining - Concepts and techniques” by Han and Kamber,

Morgan Kaufmann, 2006. (ISBN:1-55860-901-6)