information retrieval in text part iii
DESCRIPTION
Information Retrieval in Text Part III. Reference: Michael W. Berry and Murray Browne. Understanding Search Engines: Mathematical Modeling and Text Retrieval . SIAM 1999. Reading Assignment: Chapter 4. Outline. Matrix Decompositions QR Factorization Singular Value Decomposition - PowerPoint PPT PresentationTRANSCRIPT
Information Retrieval in TextPart III Reference: Michael W. Berry and Murray
Browne. Understanding Search Engines: Mathematical Modeling and Text Retrieval. SIAM 1999.
Reading Assignment: Chapter 4.
Outline
Matrix Decompositions QR Factorization Singular Value Decomposition Updating Techniques
Matrix Decomposition
To produce a reduced-rank approximation of the mn term by document matrix A, one must identify the dependence between columns or rows of the matrix A.
For a rank-k matrix, the k basis vectors of its column space serve in place of its n column vectors to represent its column space.
QR Factorization
The QR factorization of matrix A is defined as
where Q is an m m orthogonal matrix A square matrix is orthogonal if its columns are orthonormal.
i.e., if qj denotes a column of the orthogonal matrix Q, then q j has unit Euclidean norm (|| qj ||2 = 1) for j = 1,2, …, m and it is orthogonal to all other columns of Q ((qj
Tqi)1/2 = 0 for all i ≠ j).
The rows of Q are also orthonormal, i.e. QTQ = QQT = I. Such factorization exists for any matrix A. There are many ways to do the factorization.
QRA
QR Factorization
Given A = QR, the columns of the matrix A are all linear combinations of the columns of Q. Thus, a subset of k of the columns of Q form a
basis for the column space of A, where k = rank(A)
QR Factorization: Example
QR Factorization: Example
QR Factorization: Example
QR Factorization of the previous example can be represented as
Note that the first 7 columns of Q, Q1, are orthonormal And hence constitute a basis for the column space of A. The bottom zero submatrix of R is not always guaranteed to be
generated automatically from the QR factorization, and hence may need to apply column pivoting in order to guarantee the zero submatrix.
Q2 does not contribute to producing any nonzero value in A
QR Factorization
One motivation for using QR factorization is that the basis vectors can be used to describe the semantic content of the corresponding text collection.
The cosines of the anglesj between a query vector q and document vectors aj
Note that for the query “Child Proofing” it gives exactly the same cosines. Why?
Frobenius Matrix Norm
Definition: The Frobenius matrix norm of an mn matrix B = [bij], ||.||F is defined by
m
i
n
jijFbB
1 1
2
Low Rank Approximation for QR Factorization Initially, the rank of A is not known. However,
after performing the QR factorization, its rank is obviously the rank of _______
With column pivoting, we know that there exists a permutation matrix P such that
AP = QRwhere the larger entries of R are moved to the upper left corner. Such arrangement, if possible, partitions R where the smallest entries are isolated in the bottom submatrix.
Low Rank Approximation for QR Factorization
Low Rank Approximation for QR Factorization Computing Redefining R22 to be the 42 zero matrix, the
modified upper triangular matrix R has rank 5 rather than 7. Hence, the matrix has rank ____
Show that ||E||F = ||R22||F.
Show that ||E||F/ ||A||F = || R22 ||F / ||R||F = 0.3237 Therefore, the relative change in R, 32.37%, yields the
same relative change in A. With r=4, the relative change is 76%.
FF
RR22
RQEA~
Low Rank Approximation for QR Factorization: Example
Comparing Cosine Similarities for the Query: “Child Proofing”
Doc A r=5 r=4
2 0.408 0.408 0.408
3 0.408 0.5 0.309
4 0 0 0.184
5 0.5 0.5 0.5
6 0.5 0.5 0.5
Comparing Cosine Similarities for the Query: “Child Home Safety”
Doc A r=5 r=4
2 0.667 0.667 0.667
3 1 0.816 0.756
4 0.258 0 0.1
5 0 0 0
6 0 0 0
7 0 0 0.356
Singular Value Decomposition While QR factorization provides a reduced
rank basis for the column space, no information is provided about the row space of A.
SVD can provide reduced rank approximation for both spaces rank-k approximation to A of minimal change for
any value of k.
Singular Value Decomposition A = UVT where
U: mm orthogonal matrix whose columns define the left singular vectors of A
V: nn orthogonal matrix whose columns define the right singular vectors of A
: mn diagonal matrix containing singular values 1 2 … min{m,n}
Such factorization exists for any matrix A.
Component Matrices of the SVD
SVD vs. QR
What is the relationship between the rank of A and the ranks of the matrices in both factorizations?
In QR, the first rA columns of Q form a basis for the column space, so do the first rA columns of U. The first rA rows of VT form a basis for the row space of A.
The low rank-k approximation in SVD can be done by setting all but the k largest singular values in to zero.
SVD
Theorem: The low rank-k approximation of SVD is the closest rank-k approximation to A Proven by Eckart and Young It showed that the error in approximating A by Ak is
given by
where Ak = UkkVkT
Hence, the error in approximating the original matrix is determined by singular values (k+1,k+2,…,rank(A))
SVD: Example
SVD: Example
SVD: Example ||A – A6||F = …… Hence, the relative change in the matrix A is …
Therefore, rank-5 approximation may be appropriate in our case. Determining the best rank approximation for any database
depends on empirical testing For very large databases, the number could be between 100 and
300. Computational feasibility, rather than accuracy, determines the rank
reduction
k-rank approximation % Change
Rank-6 7.4%
Rank-5 22.67%
Rank-4 32.49%
Rank-3 56.45%
Low Rank Approximations
Visual comparison of rank-reduced approximations to A can be misleading Check rank-4 QR approximation vs. the more
accurate rank-4 SVD approximation. Rank-4 SVD approximation shows
associations made with terms, not originally in the document title e.g. Term 4 (Health) and Term 8 (Safety) in
Document 1 (Infant & Toddler First Aid).
Query Matching
Given the query vector q, to be compared with the columns of the reduced-rank matrix Ak. Let ej denotes the jth canonical vector in In. Then, Akej
represents _______________ It is easy to show that
where
jTkkj eVs
nj
qs
qUs
j
Tk
Tj
j ,...,2,1,cos22
Query Matching
An alternate formula for the cosine computation is
Note that which means that the number of retrieved documents using this query matching technique is larger.
nj
qUs
qUsTkj
Tk
Tj
j ,...,2,1,ˆcos22
njs jjj ,...,2,1cosˆcos