a clustering method based on nonnegative matrix factorization for text mining farial shahnaz
TRANSCRIPT
![Page 1: A Clustering Method Based on Nonnegative Matrix Factorization for Text Mining Farial Shahnaz](https://reader030.vdocuments.us/reader030/viewer/2022033022/56649f525503460f94c75d35/html5/thumbnails/1.jpg)
A Clustering Method Based on Nonnegative Matrix Factorization
for Text Mining
Farial Shahnaz
![Page 2: A Clustering Method Based on Nonnegative Matrix Factorization for Text Mining Farial Shahnaz](https://reader030.vdocuments.us/reader030/viewer/2022033022/56649f525503460f94c75d35/html5/thumbnails/2.jpg)
Topics
• Introduction
• Algorithm
• Performance
• Observation
• Conclusion and Future Work
![Page 3: A Clustering Method Based on Nonnegative Matrix Factorization for Text Mining Farial Shahnaz](https://reader030.vdocuments.us/reader030/viewer/2022033022/56649f525503460f94c75d35/html5/thumbnails/3.jpg)
Introduction
![Page 4: A Clustering Method Based on Nonnegative Matrix Factorization for Text Mining Farial Shahnaz](https://reader030.vdocuments.us/reader030/viewer/2022033022/56649f525503460f94c75d35/html5/thumbnails/4.jpg)
Basic Concepts
• Text Mining : Detection of trends or patterns in text data
• Clustering : Grouping or classifying documents based on similarity of content
![Page 5: A Clustering Method Based on Nonnegative Matrix Factorization for Text Mining Farial Shahnaz](https://reader030.vdocuments.us/reader030/viewer/2022033022/56649f525503460f94c75d35/html5/thumbnails/5.jpg)
Clustering
• Manual Vs Automated
• Supervised Vs Unsupervised
• Hierarchical Vs Partitional
![Page 6: A Clustering Method Based on Nonnegative Matrix Factorization for Text Mining Farial Shahnaz](https://reader030.vdocuments.us/reader030/viewer/2022033022/56649f525503460f94c75d35/html5/thumbnails/6.jpg)
Clustering
• Objective: Automated Unsupervised Partitional Clustering of Text Data or Documents
• Method : Nonnegative Matrix Factorization or NMF
![Page 7: A Clustering Method Based on Nonnegative Matrix Factorization for Text Mining Farial Shahnaz](https://reader030.vdocuments.us/reader030/viewer/2022033022/56649f525503460f94c75d35/html5/thumbnails/7.jpg)
Vector Space Model of Text Data
• Documents represented as n-dimensional vectors– n : terms in the dictionary– vector component : importance of term
• Document collection represented as term-by-document matrix
![Page 8: A Clustering Method Based on Nonnegative Matrix Factorization for Text Mining Farial Shahnaz](https://reader030.vdocuments.us/reader030/viewer/2022033022/56649f525503460f94c75d35/html5/thumbnails/8.jpg)
Term-by-Document Matrix
• Terms in the dictionary, n : 9 (a, brown, dog, fox, jumped, lazy, over, quick, the)
• Document 1 : a quick brown fox
• Document 2 : jumped over the lazy dog
![Page 9: A Clustering Method Based on Nonnegative Matrix Factorization for Text Mining Farial Shahnaz](https://reader030.vdocuments.us/reader030/viewer/2022033022/56649f525503460f94c75d35/html5/thumbnails/9.jpg)
Term-by-Document Matrix
![Page 10: A Clustering Method Based on Nonnegative Matrix Factorization for Text Mining Farial Shahnaz](https://reader030.vdocuments.us/reader030/viewer/2022033022/56649f525503460f94c75d35/html5/thumbnails/10.jpg)
Clustering Method : NMF
• Low rank approximation of large sparse matrices
• Preserves data nonnegativity
• Introduces the concept of parts-based representation (by Lee and Seung in Nature, 1999)
![Page 11: A Clustering Method Based on Nonnegative Matrix Factorization for Text Mining Farial Shahnaz](https://reader030.vdocuments.us/reader030/viewer/2022033022/56649f525503460f94c75d35/html5/thumbnails/11.jpg)
Other Methods
• Other rank reduction methods : – Principal Component Analysis (PCA)– Vector Quantization (VQ)
• Produce basis vectors with negative entries
• Additive and Subtractive combinations of basis vectors yield original document vectors
![Page 12: A Clustering Method Based on Nonnegative Matrix Factorization for Text Mining Farial Shahnaz](https://reader030.vdocuments.us/reader030/viewer/2022033022/56649f525503460f94c75d35/html5/thumbnails/12.jpg)
NMF
• Produces nonnegative basis vectors
• Additive combination of basis vectors yield original document vector
![Page 13: A Clustering Method Based on Nonnegative Matrix Factorization for Text Mining Farial Shahnaz](https://reader030.vdocuments.us/reader030/viewer/2022033022/56649f525503460f94c75d35/html5/thumbnails/13.jpg)
Term-by-Document Matrix (all entries nonnegative)
![Page 14: A Clustering Method Based on Nonnegative Matrix Factorization for Text Mining Farial Shahnaz](https://reader030.vdocuments.us/reader030/viewer/2022033022/56649f525503460f94c75d35/html5/thumbnails/14.jpg)
NMF
• Basis vectors interpreted as semantic features or topics
• Documents clustered on the basis of shared features
![Page 15: A Clustering Method Based on Nonnegative Matrix Factorization for Text Mining Farial Shahnaz](https://reader030.vdocuments.us/reader030/viewer/2022033022/56649f525503460f94c75d35/html5/thumbnails/15.jpg)
NMF
• Demonstrated by Xu et. Al (2003):– Outperforms Singular Value Decomposition
(SVD)– Comparable to Graph Partitioning methods
![Page 16: A Clustering Method Based on Nonnegative Matrix Factorization for Text Mining Farial Shahnaz](https://reader030.vdocuments.us/reader030/viewer/2022033022/56649f525503460f94c75d35/html5/thumbnails/16.jpg)
Algorithm
![Page 17: A Clustering Method Based on Nonnegative Matrix Factorization for Text Mining Farial Shahnaz](https://reader030.vdocuments.us/reader030/viewer/2022033022/56649f525503460f94c75d35/html5/thumbnails/17.jpg)
NMF : Definition
Given
• S : Document collection
• Vmxn : term-by-document matrix
• m : terms in the dictionary
• n : Number of documents in S
![Page 18: A Clustering Method Based on Nonnegative Matrix Factorization for Text Mining Farial Shahnaz](https://reader030.vdocuments.us/reader030/viewer/2022033022/56649f525503460f94c75d35/html5/thumbnails/18.jpg)
NMF : Definition
NMF is defined as:
• Low rank approximation of Vmxn in terms of some metric
• Factor V into the product WH– Wmxk : Contains basis vectors– Hkxn : Contains linear combinations– k : Selected number of topics or basis
vectors, k << min(m,n)
![Page 19: A Clustering Method Based on Nonnegative Matrix Factorization for Text Mining Farial Shahnaz](https://reader030.vdocuments.us/reader030/viewer/2022033022/56649f525503460f94c75d35/html5/thumbnails/19.jpg)
NMF : Common Approach
• Minimize objective function:
![Page 20: A Clustering Method Based on Nonnegative Matrix Factorization for Text Mining Farial Shahnaz](https://reader030.vdocuments.us/reader030/viewer/2022033022/56649f525503460f94c75d35/html5/thumbnails/20.jpg)
NMF : Existing Methods
Multiplicative Method (MM) [ by Lee and Seung ]
• Based on Multiplicative update rules
• || V - WH || is monotonically non-increasing and constant iff W, H at stationary point
• Version of Gradient Descent (GD) optimization scheme
![Page 21: A Clustering Method Based on Nonnegative Matrix Factorization for Text Mining Farial Shahnaz](https://reader030.vdocuments.us/reader030/viewer/2022033022/56649f525503460f94c75d35/html5/thumbnails/21.jpg)
NMF : Existing Methods
Sparse Encoding [ by Hoyer ]
• Based on study of neural networks
• Enforces statistical sparsity of H– Minimizes sum of non-zeros in H
![Page 22: A Clustering Method Based on Nonnegative Matrix Factorization for Text Mining Farial Shahnaz](https://reader030.vdocuments.us/reader030/viewer/2022033022/56649f525503460f94c75d35/html5/thumbnails/22.jpg)
NMF : Existing Methods
Sparse Encoding [ by Mu, Plemmons and Santago ]
• Similar to Hoyer’s method
• Enforces statistical sparsity of H using a regularization parameter– Minimizes number of non-zeros in H
![Page 23: A Clustering Method Based on Nonnegative Matrix Factorization for Text Mining Farial Shahnaz](https://reader030.vdocuments.us/reader030/viewer/2022033022/56649f525503460f94c75d35/html5/thumbnails/23.jpg)
NMF : Proposed Algorithm
Hybrid Method:• W approximated using Multiplicative
Method• H calculated using a Constrained Least
Square (CLS) model as the metric– Penalizes the number of non-zeros– Similar to the method by Mu, Plemmons and
Santago
• Called GD-CLS
![Page 24: A Clustering Method Based on Nonnegative Matrix Factorization for Text Mining Farial Shahnaz](https://reader030.vdocuments.us/reader030/viewer/2022033022/56649f525503460f94c75d35/html5/thumbnails/24.jpg)
GD-CLS
![Page 25: A Clustering Method Based on Nonnegative Matrix Factorization for Text Mining Farial Shahnaz](https://reader030.vdocuments.us/reader030/viewer/2022033022/56649f525503460f94c75d35/html5/thumbnails/25.jpg)
Performance
![Page 26: A Clustering Method Based on Nonnegative Matrix Factorization for Text Mining Farial Shahnaz](https://reader030.vdocuments.us/reader030/viewer/2022033022/56649f525503460f94c75d35/html5/thumbnails/26.jpg)
Text Collections Used
• Two benchmark topic detection text collections:– Reuters : Collection of documents on assorted
topics– TDT2 : Transcripts from news media
![Page 27: A Clustering Method Based on Nonnegative Matrix Factorization for Text Mining Farial Shahnaz](https://reader030.vdocuments.us/reader030/viewer/2022033022/56649f525503460f94c75d35/html5/thumbnails/27.jpg)
Text Collections Used
![Page 28: A Clustering Method Based on Nonnegative Matrix Factorization for Text Mining Farial Shahnaz](https://reader030.vdocuments.us/reader030/viewer/2022033022/56649f525503460f94c75d35/html5/thumbnails/28.jpg)
Accuracy Metric
• Defined by:
• di : Document number i• = 1 = 1 if the topic labels match• ∂(di) = 0 otherwise
k = 2, 4, 6, 8, 10, 15, 20λ = 0.1, 0.01, 0.001
![Page 29: A Clustering Method Based on Nonnegative Matrix Factorization for Text Mining Farial Shahnaz](https://reader030.vdocuments.us/reader030/viewer/2022033022/56649f525503460f94c75d35/html5/thumbnails/29.jpg)
Results for Reuters Results for TDT2
![Page 30: A Clustering Method Based on Nonnegative Matrix Factorization for Text Mining Farial Shahnaz](https://reader030.vdocuments.us/reader030/viewer/2022033022/56649f525503460f94c75d35/html5/thumbnails/30.jpg)
Observations
![Page 31: A Clustering Method Based on Nonnegative Matrix Factorization for Text Mining Farial Shahnaz](https://reader030.vdocuments.us/reader030/viewer/2022033022/56649f525503460f94c75d35/html5/thumbnails/31.jpg)
Observations : AC
• AC inversely proportional to k
• Nature of the collection affects AC– Reuters : earn, interest, cocoa– TDT2 : Asian economic crisis, Oprah lawsuit
![Page 32: A Clustering Method Based on Nonnegative Matrix Factorization for Text Mining Farial Shahnaz](https://reader030.vdocuments.us/reader030/viewer/2022033022/56649f525503460f94c75d35/html5/thumbnails/32.jpg)
Observations : λ parameter
• AC declines as λ increases ( mostly effective for homogeneous text collections) :
• CPU time declines as λ increases
![Page 33: A Clustering Method Based on Nonnegative Matrix Factorization for Text Mining Farial Shahnaz](https://reader030.vdocuments.us/reader030/viewer/2022033022/56649f525503460f94c75d35/html5/thumbnails/33.jpg)
Observations : Cluster size• Imbalance in cluster sizes has adverse effect :
![Page 34: A Clustering Method Based on Nonnegative Matrix Factorization for Text Mining Farial Shahnaz](https://reader030.vdocuments.us/reader030/viewer/2022033022/56649f525503460f94c75d35/html5/thumbnails/34.jpg)
Conclusion & Future Work
GD-CLS can be used to effectively cluster text data. Further development involves:
• Smart updating
• Use in Bioinformatics
• Develop user-interface
• Convert to C++