information retrieval through various approximate matrix decompositions
DESCRIPTION
Information Retrieval through Various Approximate Matrix Decompositions. Kathryn Linehan Advisor: Dr. Dianne O’Leary. Querying a Document Database. We want to return documents that are relevant to entered search terms Given data: Term-Document Matrix, A - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Information Retrieval through Various Approximate Matrix Decompositions](https://reader035.vdocuments.us/reader035/viewer/2022070418/56815928550346895dc65001/html5/thumbnails/1.jpg)
1
Information Retrieval through Various Approximate Matrix
Decompositions
Kathryn Linehan
Advisor: Dr. Dianne O’Leary
![Page 2: Information Retrieval through Various Approximate Matrix Decompositions](https://reader035.vdocuments.us/reader035/viewer/2022070418/56815928550346895dc65001/html5/thumbnails/2.jpg)
2
Querying a Document Database
We want to return documents that are relevant to entered search terms
Given data: • Term-Document Matrix, A
• Entry ( i , j ): importance of term i in document j
• Query Vector, q• Entry ( i ): importance of term i in the query
![Page 3: Information Retrieval through Various Approximate Matrix Decompositions](https://reader035.vdocuments.us/reader035/viewer/2022070418/56815928550346895dc65001/html5/thumbnails/3.jpg)
3
Solutions
Literal Term Matching• Compute score vector: s = qTA
• Return the highest scoring documents
• May not return relevant documents that do not contain the exact query terms
Latent Semantic Indexing (LSI)• Same process as above, but use an approximation
to A
![Page 4: Information Retrieval through Various Approximate Matrix Decompositions](https://reader035.vdocuments.us/reader035/viewer/2022070418/56815928550346895dc65001/html5/thumbnails/4.jpg)
4
Term-Document Matrix Approximation
Standard approximation used in LSI: rank-k SVD
Project Goal: evaluate use of term-document matrix approximations other than rank-k SVD in LSI• Nonnegative Matrix Factorization (NMF)
• CUR Decomposition
![Page 5: Information Retrieval through Various Approximate Matrix Decompositions](https://reader035.vdocuments.us/reader035/viewer/2022070418/56815928550346895dc65001/html5/thumbnails/5.jpg)
5
Matrix Approximation Validation
Let be an approximation to A As the rank of increases, we expect the
relative error, , to go to zero Matrix approximation can be applied to any
matrix A• Preliminary test matrix A: 50 x 30 random sparse
matrix
• Future test matrices: three large sparse term-document matrices
A~
FFAAA
~
A~
![Page 6: Information Retrieval through Various Approximate Matrix Decompositions](https://reader035.vdocuments.us/reader035/viewer/2022070418/56815928550346895dc65001/html5/thumbnails/6.jpg)
6
Nonnegative Matrix Factorization (NMF)
Term-document matrix is nonnegative
HWA *
m x n m x k
k x n
• W and H are nonnegative
• kWHrank )(
![Page 7: Information Retrieval through Various Approximate Matrix Decompositions](https://reader035.vdocuments.us/reader035/viewer/2022070418/56815928550346895dc65001/html5/thumbnails/7.jpg)
7
NMF
Multiplicative update algorithm of Lee and Seung found in [1]• Find W, H to minimize
• Random initialization for W,H
• Convergence is not guaranteed, but in practice it is very common
• Slow due to matrix multiplications in iteration
2
2
1FWHA
![Page 8: Information Retrieval through Various Approximate Matrix Decompositions](https://reader035.vdocuments.us/reader035/viewer/2022070418/56815928550346895dc65001/html5/thumbnails/8.jpg)
8
NMF Validation
A: 50 x 30 random sparse matrix. Average over 5 runs.
5 10 15 20 25 300
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8NMF Validation: Relative Error
k
rela
tive
erro
r
5 10 15 20 25 300
2
4
6
8
10
12NMF Validation: Run Time
k
run
time
![Page 9: Information Retrieval through Various Approximate Matrix Decompositions](https://reader035.vdocuments.us/reader035/viewer/2022070418/56815928550346895dc65001/html5/thumbnails/9.jpg)
9
CUR Decomposition
Term-document matrix is sparse
**A C U R
m x n m x c
c x r r x n
• C (R) holds c (r) sampled and rescaled columns (rows) of A
• U is computed using C and R
• kCURrank )(where k is a rank parameter
,
![Page 10: Information Retrieval through Various Approximate Matrix Decompositions](https://reader035.vdocuments.us/reader035/viewer/2022070418/56815928550346895dc65001/html5/thumbnails/10.jpg)
10
CUR Implementations
CUR algorithm in [2] by Drineas, Kannan, and Mahoney• Linear time algorithm
• Modification: use ideas in [3] by Drineas, Mahoney, Muthukrishnan (no longer linear time)
• Improvement: Compact Matrix Decomposition (CMD) in [5] by Sun, Xie, Zhang, and Faloutsos
• Other Modifications: our ideas
Deterministic CUR code by G. W. Stewart
![Page 11: Information Retrieval through Various Approximate Matrix Decompositions](https://reader035.vdocuments.us/reader035/viewer/2022070418/56815928550346895dc65001/html5/thumbnails/11.jpg)
11
Sampling Column (Row) norm sampling [2]
• Prob(col j) = (similar for row i)
Subspace Sampling [3]• Uses rank-k SVD of A for column probabilities
• Prob(col j) =
• Uses “economy size” SVD of C for row probabilities
• Prob(row i) =
Sampling without replacement
22)(:, FAjA
kjV kA2
, :),(
ciUC2:),(
![Page 12: Information Retrieval through Various Approximate Matrix Decompositions](https://reader035.vdocuments.us/reader035/viewer/2022070418/56815928550346895dc65001/html5/thumbnails/12.jpg)
12
Sampling Comparison
A: 50 x 30 random sparse matrix. Average over 5 runs.Legend: Sampling,U,Scaling (Scaling only for without replacement sampling)
5 10 15 20 25 300
0.005
0.01
0.015
0.02
0.025
0.03
0.035
0.04CUR Validation: Run Time
k
run
time
CN,L
S,Lw/o R,L,w/o Sc
w/o R,L,Sc
5 10 15 20 25 300
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1CUR Validation: Relative Error
k
rela
tive
erro
r
CN,L
S,Lw/o R,L,w/o Sc
w/o R,L,Sc
![Page 13: Information Retrieval through Various Approximate Matrix Decompositions](https://reader035.vdocuments.us/reader035/viewer/2022070418/56815928550346895dc65001/html5/thumbnails/13.jpg)
13
Computation of U
Linear algorithm U: approximately solves
, where [2]
Optimal U: solves
FU
UCA ˆminˆ
URU ˆ
2min F
U
CURA
![Page 14: Information Retrieval through Various Approximate Matrix Decompositions](https://reader035.vdocuments.us/reader035/viewer/2022070418/56815928550346895dc65001/html5/thumbnails/14.jpg)
14
U Comparison
A: 50 x 30 random sparse matrix. Average over 5 runs. Legend: Sampling,U
5 10 15 20 25 300
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08CUR Validation: Run Time
k
run
time
CN,L
CN,OS,L
S,O
5 10 15 20 25 30
0.4
0.5
0.6
0.7
0.8
0.9
1CUR Validation: Relative Error
k
rela
tive
erro
r
CN,L
CN,OS,L
S,O
![Page 15: Information Retrieval through Various Approximate Matrix Decompositions](https://reader035.vdocuments.us/reader035/viewer/2022070418/56815928550346895dc65001/html5/thumbnails/15.jpg)
15
Compact Matrix Decomposition (CMD) Improvement
Remove repeated columns (rows) in C (R) Decreases storage while still achieving the
same relative error [5]
Algorithm [2] [2] with CMD
Runtime 0.008060 0.007153
Storage 880.5 550.5
Relative Error 0.820035 0.820035
A: 50 x 30 random sparse matrix, k = 15. Average over 10 runs.
![Page 16: Information Retrieval through Various Approximate Matrix Decompositions](https://reader035.vdocuments.us/reader035/viewer/2022070418/56815928550346895dc65001/html5/thumbnails/16.jpg)
16
Deterministic CUR
Code by G. W. Stewart Uses a RRQR algorithm that does not
store Q• We only need the permutation vector
• Gives us the columns (rows) for C (R)
Uses optimal U
![Page 17: Information Retrieval through Various Approximate Matrix Decompositions](https://reader035.vdocuments.us/reader035/viewer/2022070418/56815928550346895dc65001/html5/thumbnails/17.jpg)
17
CUR Comparison
A: 50 x 30 random sparse matrix. Average over 5 runs.Legend: Sampling,U,Scaling (Scaling only for without replacement sampling)
5 10 15 20 25 300
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1CUR Validation: Relative Error
k
rela
tive
erro
r
5 10 15 20 25 300
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08CUR Validation: Run Time
k
run
time
CN,LCN,O
S,L
S,O
w/o R,L,w/o Sc
w/o R,L,ScD
![Page 18: Information Retrieval through Various Approximate Matrix Decompositions](https://reader035.vdocuments.us/reader035/viewer/2022070418/56815928550346895dc65001/html5/thumbnails/18.jpg)
18
Future Project Goals
Finish investigation of CUR improvement Validate NMF and CUR using term-document
matrices Investigate storage, computation time and
relative error of NMF and CUR Test performance of NMF and CUR in LSI
• Use average precision and recall, where the average is taken over all queries in the data set
![Page 19: Information Retrieval through Various Approximate Matrix Decompositions](https://reader035.vdocuments.us/reader035/viewer/2022070418/56815928550346895dc65001/html5/thumbnails/19.jpg)
19
Precision and Recall
Measurements of performance for document retrieval Let Retrieved = number of documents retrieved,
Relevant = total number of relevant documents to the query, RetRel = number of documents retrieved that are relevant.
Precision:
Recall:
Retrieved
RetRel)Retrieved( P
Relevant
RetRel)Retrieved( R
![Page 20: Information Retrieval through Various Approximate Matrix Decompositions](https://reader035.vdocuments.us/reader035/viewer/2022070418/56815928550346895dc65001/html5/thumbnails/20.jpg)
20
Further Topics
Time permitting investigations• Parallel implementations of matrix
approximations
• Testing performance of matrix approximations in forming a multidocument summary
![Page 21: Information Retrieval through Various Approximate Matrix Decompositions](https://reader035.vdocuments.us/reader035/viewer/2022070418/56815928550346895dc65001/html5/thumbnails/21.jpg)
21
ReferencesMichael W. Berry, Murray Browne, Amy N. Langville, V. Paul Pauca, and Robert J. Plemmons. Algorithms and applications for approximate nonnegative matrix factorization. Computational Statistics and Data Analysis, 52(1):155-173, September 2007.
Petros Drineas, Ravi Kannan, and Michael W. Mahoney. Fast Monte Carlo algorithms for matrices iii: Computing a compressed approximate matrix decomposition. SIAM Journal on Computing, 36(1):184-206, 2006.
Petros Drineas, Michael W. Mahoney, and S. Muthukrishnan. Relative-error CUR matrix decompositions. SIAM Journal on Matrix Analysis and Applications, 30(2):844-881, 2008.
Tamara G. Kolda and Dianne P. O'Leary. A semidiscrete matrix decomposition for latent semantic indexing in information retrieval. ACM Transactions on Information Systems, 16(4):322-346, October 1998.
Jimeng Sun, Yinglian Xie, Hui Zhang, and Christos Faloutsos. Less is more: Sparse graph mining with compact matrix decomposition. Statistical Analysis and Data Mining, 1(1):6-22, February 2008.
[3]
[2]
[1]
[4]
[5]