information retrieval in text part iii

Information Retrieval in TextPart III Reference: Michael W. Berry and Murray

Browne. Understanding Search Engines: Mathematical Modeling and Text Retrieval. SIAM 1999.

Reading Assignment: Chapter 4.

Outline

Matrix Decompositions QR Factorization Singular Value Decomposition Updating Techniques

Matrix Decomposition

To produce a reduced-rank approximation of the mn term by document matrix A, one must identify the dependence between columns or rows of the matrix A.

For a rank-k matrix, the k basis vectors of its column space serve in place of its n column vectors to represent its column space.

QR Factorization

The QR factorization of matrix A is defined as

where Q is an m m orthogonal matrix A square matrix is orthogonal if its columns are orthonormal.

i.e., if qj denotes a column of the orthogonal matrix Q, then q j has unit Euclidean norm (|| qj ||2 = 1) for j = 1,2, …, m and it is orthogonal to all other columns of Q ((qj

Tqi)1/2 = 0 for all i ≠ j).

The rows of Q are also orthonormal, i.e. QTQ = QQT = I. Such factorization exists for any matrix A. There are many ways to do the factorization.

QRA

QR Factorization

Given A = QR, the columns of the matrix A are all linear combinations of the columns of Q. Thus, a subset of k of the columns of Q form a

basis for the column space of A, where k = rank(A)

QR Factorization: Example

QR Factorization: Example

QR Factorization of the previous example can be represented as

Note that the first 7 columns of Q, Q1, are orthonormal And hence constitute a basis for the column space of A. The bottom zero submatrix of R is not always guaranteed to be

generated automatically from the QR factorization, and hence may need to apply column pivoting in order to guarantee the zero submatrix.

Q2 does not contribute to producing any nonzero value in A

QR Factorization

One motivation for using QR factorization is that the basis vectors can be used to describe the semantic content of the corresponding text collection.

The cosines of the anglesj between a query vector q and document vectors aj

Note that for the query “Child Proofing” it gives exactly the same cosines. Why?

Frobenius Matrix Norm

Definition: The Frobenius matrix norm of an mn matrix B = [bij], ||.||F is defined by

m

i

n

jijFbB

1 1

2

Low Rank Approximation for QR Factorization Initially, the rank of A is not known. However,

after performing the QR factorization, its rank is obviously the rank of _______

With column pivoting, we know that there exists a permutation matrix P such that

AP = QRwhere the larger entries of R are moved to the upper left corner. Such arrangement, if possible, partitions R where the smallest entries are isolated in the bottom submatrix.

Low Rank Approximation for QR Factorization

Low Rank Approximation for QR Factorization Computing Redefining R22 to be the 42 zero matrix, the

modified upper triangular matrix R has rank 5 rather than 7. Hence, the matrix has rank ____

Show that ||E||F = ||R22||F.

Show that ||E||F/ ||A||F = || R22 ||F / ||R||F = 0.3237 Therefore, the relative change in R, 32.37%, yields the

same relative change in A. With r=4, the relative change is 76%.

FF

RR22

RQEA~

Low Rank Approximation for QR Factorization: Example

Comparing Cosine Similarities for the Query: “Child Proofing”

Doc A r=5 r=4

2 0.408 0.408 0.408

3 0.408 0.5 0.309

4 0 0 0.184

5 0.5 0.5 0.5

6 0.5 0.5 0.5

Comparing Cosine Similarities for the Query: “Child Home Safety”

Doc A r=5 r=4

2 0.667 0.667 0.667

3 1 0.816 0.756

4 0.258 0 0.1

5 0 0 0

6 0 0 0

7 0 0 0.356

Singular Value Decomposition While QR factorization provides a reduced

rank basis for the column space, no information is provided about the row space of A.

SVD can provide reduced rank approximation for both spaces rank-k approximation to A of minimal change for

any value of k.

Singular Value Decomposition A = UVT where

U: mm orthogonal matrix whose columns define the left singular vectors of A

V: nn orthogonal matrix whose columns define the right singular vectors of A

: mn diagonal matrix containing singular values 1 2 … min{m,n}

Such factorization exists for any matrix A.

Component Matrices of the SVD

SVD vs. QR

What is the relationship between the rank of A and the ranks of the matrices in both factorizations?

In QR, the first rA columns of Q form a basis for the column space, so do the first rA columns of U. The first rA rows of VT form a basis for the row space of A.

The low rank-k approximation in SVD can be done by setting all but the k largest singular values in to zero.

SVD

Theorem: The low rank-k approximation of SVD is the closest rank-k approximation to A Proven by Eckart and Young It showed that the error in approximating A by Ak is

given by

where Ak = UkkVkT

Hence, the error in approximating the original matrix is determined by singular values (k+1,k+2,…,rank(A))

SVD: Example

SVD: Example ||A – A6||F = …… Hence, the relative change in the matrix A is …

Therefore, rank-5 approximation may be appropriate in our case. Determining the best rank approximation for any database

depends on empirical testing For very large databases, the number could be between 100 and

300. Computational feasibility, rather than accuracy, determines the rank

reduction

k-rank approximation % Change

Rank-6 7.4%

Rank-5 22.67%

Rank-4 32.49%

Rank-3 56.45%

Low Rank Approximations

Visual comparison of rank-reduced approximations to A can be misleading Check rank-4 QR approximation vs. the more

accurate rank-4 SVD approximation. Rank-4 SVD approximation shows

associations made with terms, not originally in the document title e.g. Term 4 (Health) and Term 8 (Safety) in

Document 1 (Infant & Toddler First Aid).

Query Matching

Given the query vector q, to be compared with the columns of the reduced-rank matrix Ak. Let ej denotes the jth canonical vector in In. Then, Akej

represents _______________ It is easy to show that

where

jTkkj eVs

nj

qs

qUs

j

Tk

Tj

j ,...,2,1,cos22

Query Matching

An alternate formula for the cosine computation is

Note that which means that the number of retrieved documents using this query matching technique is larger.

nj

qUs

qUsTkj

Tk

Tj

j ,...,2,1,ˆcos22

njs jjj ,...,2,1cosˆcos

information retrieval in text part iii

Documents