latent semantic indexing (lsi)ce.sharif.edu/courses/95-96/1/ce324-1/resources/root/... · 2020. 9....
TRANSCRIPT
![Page 1: Latent Semantic Indexing (LSI)ce.sharif.edu/courses/95-96/1/ce324-1/resources/root/... · 2020. 9. 7. · Latent Semantic Indexing (LSI) Perform a low-rank approximation of doc-term](https://reader033.vdocuments.us/reader033/viewer/2022051512/603dc9a8fc2e5d1b555e16eb/html5/thumbnails/1.jpg)
Latent Semantic Indexing (LSI) CE-324: Modern Information Retrieval Sharif University of Technology
M. Soleymani
Fall 2016
Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford)
![Page 2: Latent Semantic Indexing (LSI)ce.sharif.edu/courses/95-96/1/ce324-1/resources/root/... · 2020. 9. 7. · Latent Semantic Indexing (LSI) Perform a low-rank approximation of doc-term](https://reader033.vdocuments.us/reader033/viewer/2022051512/603dc9a8fc2e5d1b555e16eb/html5/thumbnails/2.jpg)
Vector space model: pros
Partial matching of queries and docs
dealing with the case where no doc contains all search terms
Ranking according to similarity score
Term weighting schemes
improves retrieval performance
Various extensions
Relevance feedback (modifying query vector)
Doc clustering and classification
2
![Page 3: Latent Semantic Indexing (LSI)ce.sharif.edu/courses/95-96/1/ce324-1/resources/root/... · 2020. 9. 7. · Latent Semantic Indexing (LSI) Perform a low-rank approximation of doc-term](https://reader033.vdocuments.us/reader033/viewer/2022051512/603dc9a8fc2e5d1b555e16eb/html5/thumbnails/3.jpg)
Problems with lexical semantics
Ambiguity and association in natural language
Polysemy: Words often have a multitude of meanings and
different types of usage
More severe in very heterogeneous collections.
The vector space model is unable to discriminate between
different meanings of the same word.
3
![Page 4: Latent Semantic Indexing (LSI)ce.sharif.edu/courses/95-96/1/ce324-1/resources/root/... · 2020. 9. 7. · Latent Semantic Indexing (LSI) Perform a low-rank approximation of doc-term](https://reader033.vdocuments.us/reader033/viewer/2022051512/603dc9a8fc2e5d1b555e16eb/html5/thumbnails/4.jpg)
Problems with lexical semantics
Synonymy: Different terms may have identical or similar
meanings (weaker: words indicating the same topic).
No associations between words are made in the vector
space representation.
4
![Page 5: Latent Semantic Indexing (LSI)ce.sharif.edu/courses/95-96/1/ce324-1/resources/root/... · 2020. 9. 7. · Latent Semantic Indexing (LSI) Perform a low-rank approximation of doc-term](https://reader033.vdocuments.us/reader033/viewer/2022051512/603dc9a8fc2e5d1b555e16eb/html5/thumbnails/5.jpg)
Polysemy and context
Doc similarity on single word level: polysemy and context
carcompany
•••dodgeford
meaning 2
ringjupiter
•••space
voyagermeaning 1…
saturn...
…
planet...
contribution to similarity, if
used in 1st meaning, but not
if in 2nd
5
![Page 6: Latent Semantic Indexing (LSI)ce.sharif.edu/courses/95-96/1/ce324-1/resources/root/... · 2020. 9. 7. · Latent Semantic Indexing (LSI) Perform a low-rank approximation of doc-term](https://reader033.vdocuments.us/reader033/viewer/2022051512/603dc9a8fc2e5d1b555e16eb/html5/thumbnails/6.jpg)
SVD
6
![Page 7: Latent Semantic Indexing (LSI)ce.sharif.edu/courses/95-96/1/ce324-1/resources/root/... · 2020. 9. 7. · Latent Semantic Indexing (LSI) Perform a low-rank approximation of doc-term](https://reader033.vdocuments.us/reader033/viewer/2022051512/603dc9a8fc2e5d1b555e16eb/html5/thumbnails/7.jpg)
Latent Semantic Indexing (LSI)
Perform a low-rank approximation of doc-term
matrix (typical rank 100-300)
latent semantic space
Term-doc matrices are very large but the number of topicsthat people talk about is small (in some sense)
General idea: Map docs (and terms) to a low-dimensional
space
Design a mapping such that the low-dimensional space reflects
semantic associations
Compute doc similarity based on the inner product in this latent
semantic space
7
![Page 8: Latent Semantic Indexing (LSI)ce.sharif.edu/courses/95-96/1/ce324-1/resources/root/... · 2020. 9. 7. · Latent Semantic Indexing (LSI) Perform a low-rank approximation of doc-term](https://reader033.vdocuments.us/reader033/viewer/2022051512/603dc9a8fc2e5d1b555e16eb/html5/thumbnails/8.jpg)
Goals of LSI
Similar terms map to similar location in low
dimensional space
Noise reduction by dimension reduction
8
![Page 9: Latent Semantic Indexing (LSI)ce.sharif.edu/courses/95-96/1/ce324-1/resources/root/... · 2020. 9. 7. · Latent Semantic Indexing (LSI) Perform a low-rank approximation of doc-term](https://reader033.vdocuments.us/reader033/viewer/2022051512/603dc9a8fc2e5d1b555e16eb/html5/thumbnails/9.jpg)
9
This matrix is the basis for computing similarity between docs and queries.
Can we transform this matrix, so that we get a better measure of similarity
between docs and queries? . . .
Term-document matrix
![Page 10: Latent Semantic Indexing (LSI)ce.sharif.edu/courses/95-96/1/ce324-1/resources/root/... · 2020. 9. 7. · Latent Semantic Indexing (LSI) Perform a low-rank approximation of doc-term](https://reader033.vdocuments.us/reader033/viewer/2022051512/603dc9a8fc2e5d1b555e16eb/html5/thumbnails/10.jpg)
Singular Value Decomposition (SVD)
𝑀𝑀 𝑀𝑁 𝑁𝑁
For an 𝑀𝑁 matrix 𝐴 of rank 𝑟 there exists a factorization:
The columns of 𝑈 are orthogonal eigenvectors of 𝐴𝐴𝑇.
The columns of 𝑉 are orthogonal eigenvectors of 𝐴𝑇𝐴.
Singular values
Eigenvalues 1… 𝑟 of 𝐴𝐴𝑇 are also the eigenvalues of 𝐴𝑇𝐴.
𝐴 = 𝑈Σ𝑉𝑇
Typically, the singular values arranged in decreasing order.
Σ = diag 𝜎1, … , 𝜎𝑟𝜎𝑖 = 𝜆𝑖
![Page 11: Latent Semantic Indexing (LSI)ce.sharif.edu/courses/95-96/1/ce324-1/resources/root/... · 2020. 9. 7. · Latent Semantic Indexing (LSI) Perform a low-rank approximation of doc-term](https://reader033.vdocuments.us/reader033/viewer/2022051512/603dc9a8fc2e5d1b555e16eb/html5/thumbnails/11.jpg)
Singular Value Decomposition (SVD)
Truncated SVD
11
min(𝑀,𝑁)
min(𝑀,𝑁)
Mmin(M,N) Min(M,N)min(M,N) Min(M,N)N
𝐴 = 𝑈Σ𝑉𝑇
![Page 12: Latent Semantic Indexing (LSI)ce.sharif.edu/courses/95-96/1/ce324-1/resources/root/... · 2020. 9. 7. · Latent Semantic Indexing (LSI) Perform a low-rank approximation of doc-term](https://reader033.vdocuments.us/reader033/viewer/2022051512/603dc9a8fc2e5d1b555e16eb/html5/thumbnails/12.jpg)
SVD example
M=3, N=2
Or equivalently:
0 2/ 6
1/ 2 −1/ 6
1/ 2 −1/ 6
1 0
0 3
1/ 2 1/√2
1/ 2 −1/ 2
𝐴 =
0 2/ 6
1/ 2 −1/ 6
1/ 2 −1/ 6
1/ 3
1/ 3
1/ 3
1 0
0 30 0
1/ 2 1/√2
1/ 2 −1/ 2
𝐴 =1 −10 11 0
![Page 13: Latent Semantic Indexing (LSI)ce.sharif.edu/courses/95-96/1/ce324-1/resources/root/... · 2020. 9. 7. · Latent Semantic Indexing (LSI) Perform a low-rank approximation of doc-term](https://reader033.vdocuments.us/reader033/viewer/2022051512/603dc9a8fc2e5d1b555e16eb/html5/thumbnails/13.jpg)
Example
13
We use a non-weighted matrix here to simplify the example.
![Page 14: Latent Semantic Indexing (LSI)ce.sharif.edu/courses/95-96/1/ce324-1/resources/root/... · 2020. 9. 7. · Latent Semantic Indexing (LSI) Perform a low-rank approximation of doc-term](https://reader033.vdocuments.us/reader033/viewer/2022051512/603dc9a8fc2e5d1b555e16eb/html5/thumbnails/14.jpg)
Example of 𝐶 = 𝑈Σ𝑉𝑇: All four matrices
14
𝐶 = 𝑈Σ𝑉𝑇
![Page 15: Latent Semantic Indexing (LSI)ce.sharif.edu/courses/95-96/1/ce324-1/resources/root/... · 2020. 9. 7. · Latent Semantic Indexing (LSI) Perform a low-rank approximation of doc-term](https://reader033.vdocuments.us/reader033/viewer/2022051512/603dc9a8fc2e5d1b555e16eb/html5/thumbnails/15.jpg)
Example of 𝐶 = 𝑈Σ𝑉𝑇: matrix 𝑈
15
Columns: “semantic” dims (distinct topics like politics, sports,...)
𝑢𝑖𝑗: how strongly related term 𝑖 is to the topic in column 𝑗 .
One row per term
One column per min(M,N)
![Page 16: Latent Semantic Indexing (LSI)ce.sharif.edu/courses/95-96/1/ce324-1/resources/root/... · 2020. 9. 7. · Latent Semantic Indexing (LSI) Perform a low-rank approximation of doc-term](https://reader033.vdocuments.us/reader033/viewer/2022051512/603dc9a8fc2e5d1b555e16eb/html5/thumbnails/16.jpg)
Example of 𝐶 = 𝑈Σ𝑉𝑇: The matrix Σ
16
Singular value:
“measures the importance of the corresponding semantic dimension”.
We’ll make use of this by omitting unimportant dimensions.
square, diagonal matrix
min(M,N) × min(M,N).
![Page 17: Latent Semantic Indexing (LSI)ce.sharif.edu/courses/95-96/1/ce324-1/resources/root/... · 2020. 9. 7. · Latent Semantic Indexing (LSI) Perform a low-rank approximation of doc-term](https://reader033.vdocuments.us/reader033/viewer/2022051512/603dc9a8fc2e5d1b555e16eb/html5/thumbnails/17.jpg)
Example of 𝐶 = 𝑈Σ𝑉𝑇: The matrix 𝑉𝑇
17
Columns of 𝑉: “semantic” dims
𝑣𝑖𝑗: how strongly related doc 𝑖 is to the topic in column 𝑗 .
One column per doc
One row per min(M,N)
![Page 18: Latent Semantic Indexing (LSI)ce.sharif.edu/courses/95-96/1/ce324-1/resources/root/... · 2020. 9. 7. · Latent Semantic Indexing (LSI) Perform a low-rank approximation of doc-term](https://reader033.vdocuments.us/reader033/viewer/2022051512/603dc9a8fc2e5d1b555e16eb/html5/thumbnails/18.jpg)
Matrix decomposition: Summary
We’ve decomposed the term-doc matrix 𝐶 into a
product of three matrices.
𝑈: consists of one (row) vector for each term
𝑉𝑇: consists of one (column) vector for each doc
Σ: diagonal matrix with singular values, reflecting importance of
each dimension
Next:Why are we doing this?
18
![Page 19: Latent Semantic Indexing (LSI)ce.sharif.edu/courses/95-96/1/ce324-1/resources/root/... · 2020. 9. 7. · Latent Semantic Indexing (LSI) Perform a low-rank approximation of doc-term](https://reader033.vdocuments.us/reader033/viewer/2022051512/603dc9a8fc2e5d1b555e16eb/html5/thumbnails/19.jpg)
LSI: Overview
19
Decompose term-doc matrix 𝐶 into a product of
matrices using SVD
𝐶 = 𝑈Σ𝑉𝑇
We use columns of matrices 𝑈 and 𝑉 that correspond to the
largest values in the diagonal matrix Σ as term and document
dimensions in the new space
SVD for this purpose is called LSI.
![Page 20: Latent Semantic Indexing (LSI)ce.sharif.edu/courses/95-96/1/ce324-1/resources/root/... · 2020. 9. 7. · Latent Semantic Indexing (LSI) Perform a low-rank approximation of doc-term](https://reader033.vdocuments.us/reader033/viewer/2022051512/603dc9a8fc2e5d1b555e16eb/html5/thumbnails/20.jpg)
Solution via SVD
Low-rank approximation
set smallest r-k
singular values to zero
column notation:
sum of rank 1 matrices
𝑀 ×𝑁 𝑀 × 𝑘
𝑘 × 𝑘 𝑘 × 𝑁
We retain only 𝑘 singular values
𝐴𝑘 = 𝑈 diag 𝜎1, … , 𝜎𝑘 , 0, … 0 𝑉𝑇
𝐴𝑘 =
𝑖=1
𝑘
𝜎𝑘𝑢𝑖𝑣𝑖𝑇
![Page 21: Latent Semantic Indexing (LSI)ce.sharif.edu/courses/95-96/1/ce324-1/resources/root/... · 2020. 9. 7. · Latent Semantic Indexing (LSI) Perform a low-rank approximation of doc-term](https://reader033.vdocuments.us/reader033/viewer/2022051512/603dc9a8fc2e5d1b555e16eb/html5/thumbnails/21.jpg)
SVD can be used to compute optimal low-rank approximations.
Keeping the 𝑘 largest singular values and setting all others to zero results in
the optimal approximation [Eckart-Young].
No matrix of the rank 𝑘 can approximates 𝐴 better than 𝐴𝑘 .
Approximation problem: Given matrix 𝐴, find matrix 𝐴𝑘 of rank 𝑘 (e.g.
a matrix with 𝑘 linearly independent rows or columns) such that
𝐴𝑘 and 𝑋 are both 𝑀 × 𝑁 matrices.
Typically, we want 𝑘 ≪ 𝑟.
Low-rank approximation
Frobenius norm
21
𝐴𝑘 = min𝑋:𝑟𝑎𝑛𝑘 𝑋 =𝑘
𝐴 − 𝑋 𝐹
![Page 22: Latent Semantic Indexing (LSI)ce.sharif.edu/courses/95-96/1/ce324-1/resources/root/... · 2020. 9. 7. · Latent Semantic Indexing (LSI) Perform a low-rank approximation of doc-term](https://reader033.vdocuments.us/reader033/viewer/2022051512/603dc9a8fc2e5d1b555e16eb/html5/thumbnails/22.jpg)
Approximation error
How good (bad) is this approximation?
It’s the best possible, measured by the Frobenius norm of
the error:
where the 𝑖 are ordered such that 𝑖 𝑖+1.
Suggests why Frobenius error drops as 𝑘 increases.
22
min𝑋:𝑟𝑎𝑛𝑘 𝑋 =𝑘
𝐴 − 𝑋 𝐹 = 𝐴 − 𝐴𝑘 𝐹 = 𝜎𝑘+1
𝐴𝑘 = 𝑈 diag 𝜎1, … , 𝜎𝑘 , 0, … 0 𝑉𝑇
![Page 23: Latent Semantic Indexing (LSI)ce.sharif.edu/courses/95-96/1/ce324-1/resources/root/... · 2020. 9. 7. · Latent Semantic Indexing (LSI) Perform a low-rank approximation of doc-term](https://reader033.vdocuments.us/reader033/viewer/2022051512/603dc9a8fc2e5d1b555e16eb/html5/thumbnails/23.jpg)
SVD Low-rank approximation
Term-doc matrix 𝐶 may have 𝑀 = 50000,𝑁 = 106
rank close to 50000
Construct an approximation 𝐶100with rank 100.
Of all rank 100 matrices, it would have the lowest Frobenius
error.
Great … but why would we??
Answer: Latent Semantic Indexing
C. Eckart, G. Young, The approximation of a matrix by another of lower rank.
Psychometrika, 1, 211-218, 1936.
![Page 24: Latent Semantic Indexing (LSI)ce.sharif.edu/courses/95-96/1/ce324-1/resources/root/... · 2020. 9. 7. · Latent Semantic Indexing (LSI) Perform a low-rank approximation of doc-term](https://reader033.vdocuments.us/reader033/viewer/2022051512/603dc9a8fc2e5d1b555e16eb/html5/thumbnails/24.jpg)
Recall unreduced decomposition 𝐶 = 𝑈Σ𝑉𝑇
24
![Page 25: Latent Semantic Indexing (LSI)ce.sharif.edu/courses/95-96/1/ce324-1/resources/root/... · 2020. 9. 7. · Latent Semantic Indexing (LSI) Perform a low-rank approximation of doc-term](https://reader033.vdocuments.us/reader033/viewer/2022051512/603dc9a8fc2e5d1b555e16eb/html5/thumbnails/25.jpg)
Reducing the dimensionality to 2
25
![Page 26: Latent Semantic Indexing (LSI)ce.sharif.edu/courses/95-96/1/ce324-1/resources/root/... · 2020. 9. 7. · Latent Semantic Indexing (LSI) Perform a low-rank approximation of doc-term](https://reader033.vdocuments.us/reader033/viewer/2022051512/603dc9a8fc2e5d1b555e16eb/html5/thumbnails/26.jpg)
Reducing the dimensionality to 2
26
![Page 27: Latent Semantic Indexing (LSI)ce.sharif.edu/courses/95-96/1/ce324-1/resources/root/... · 2020. 9. 7. · Latent Semantic Indexing (LSI) Perform a low-rank approximation of doc-term](https://reader033.vdocuments.us/reader033/viewer/2022051512/603dc9a8fc2e5d1b555e16eb/html5/thumbnails/27.jpg)
Original matrix 𝐶 vs. reduced 𝐶2 = 𝑈Σ2𝑉𝑇
27
𝐶2 as a two-dimensional
representation of 𝐶.
Dimensionality reduction to
two dimensions.
![Page 28: Latent Semantic Indexing (LSI)ce.sharif.edu/courses/95-96/1/ce324-1/resources/root/... · 2020. 9. 7. · Latent Semantic Indexing (LSI) Perform a low-rank approximation of doc-term](https://reader033.vdocuments.us/reader033/viewer/2022051512/603dc9a8fc2e5d1b555e16eb/html5/thumbnails/28.jpg)
Why is the reduced matrix “better”?
28
28
Similarity of d2 and d3 in the original space: 0.
Similarity of d2 und d3 in the reduced space:
0.52 * 0.28 + 0.36 * 0.16 + 0.72 * 0.36 + 0.12 * 0.20 + - 0.39 * - 0.08 ≈ 0.52
![Page 29: Latent Semantic Indexing (LSI)ce.sharif.edu/courses/95-96/1/ce324-1/resources/root/... · 2020. 9. 7. · Latent Semantic Indexing (LSI) Perform a low-rank approximation of doc-term](https://reader033.vdocuments.us/reader033/viewer/2022051512/603dc9a8fc2e5d1b555e16eb/html5/thumbnails/29.jpg)
Why the reduced matrix is “better”?
29
“boat” and “ship” are semantically similar.
The “reduced” similarity measure reflects this.
What property of the SVD reduction is responsible for improved similarity?
![Page 30: Latent Semantic Indexing (LSI)ce.sharif.edu/courses/95-96/1/ce324-1/resources/root/... · 2020. 9. 7. · Latent Semantic Indexing (LSI) Perform a low-rank approximation of doc-term](https://reader033.vdocuments.us/reader033/viewer/2022051512/603dc9a8fc2e5d1b555e16eb/html5/thumbnails/30.jpg)
Example
30 [Example from Dumais et. al]
![Page 31: Latent Semantic Indexing (LSI)ce.sharif.edu/courses/95-96/1/ce324-1/resources/root/... · 2020. 9. 7. · Latent Semantic Indexing (LSI) Perform a low-rank approximation of doc-term](https://reader033.vdocuments.us/reader033/viewer/2022051512/603dc9a8fc2e5d1b555e16eb/html5/thumbnails/31.jpg)
Example
31 [Example from Dumais et. al]
![Page 32: Latent Semantic Indexing (LSI)ce.sharif.edu/courses/95-96/1/ce324-1/resources/root/... · 2020. 9. 7. · Latent Semantic Indexing (LSI) Perform a low-rank approximation of doc-term](https://reader033.vdocuments.us/reader033/viewer/2022051512/603dc9a8fc2e5d1b555e16eb/html5/thumbnails/32.jpg)
Example (k=2)
32 [Example from Dumais et. al]
𝑈𝑘
Σ𝑘 𝑉𝑘𝑇
![Page 33: Latent Semantic Indexing (LSI)ce.sharif.edu/courses/95-96/1/ce324-1/resources/root/... · 2020. 9. 7. · Latent Semantic Indexing (LSI) Perform a low-rank approximation of doc-term](https://reader033.vdocuments.us/reader033/viewer/2022051512/603dc9a8fc2e5d1b555e16eb/html5/thumbnails/33.jpg)
33
graph
tree
minor
survey
time
responseuser
computer
interface
humanEPS
system
Squares: terms
Circles: docs
![Page 34: Latent Semantic Indexing (LSI)ce.sharif.edu/courses/95-96/1/ce324-1/resources/root/... · 2020. 9. 7. · Latent Semantic Indexing (LSI) Perform a low-rank approximation of doc-term](https://reader033.vdocuments.us/reader033/viewer/2022051512/603dc9a8fc2e5d1b555e16eb/html5/thumbnails/34.jpg)
34 [Example from Dumais et. al]
![Page 35: Latent Semantic Indexing (LSI)ce.sharif.edu/courses/95-96/1/ce324-1/resources/root/... · 2020. 9. 7. · Latent Semantic Indexing (LSI) Perform a low-rank approximation of doc-term](https://reader033.vdocuments.us/reader033/viewer/2022051512/603dc9a8fc2e5d1b555e16eb/html5/thumbnails/35.jpg)
How we use the SVD in LSI
Key property of SVD: Each singular value tells us how
important its dimension is.
By setting less important dimensions to zero, we keep the
important information, but get rid of the “details”.
These details may
be noise ⇒ reduced LSI is a better representation
Details make things dissimilar that should be similar ⇒ reduced LSI is a better
representation because it represents similarity better.
35
![Page 36: Latent Semantic Indexing (LSI)ce.sharif.edu/courses/95-96/1/ce324-1/resources/root/... · 2020. 9. 7. · Latent Semantic Indexing (LSI) Perform a low-rank approximation of doc-term](https://reader033.vdocuments.us/reader033/viewer/2022051512/603dc9a8fc2e5d1b555e16eb/html5/thumbnails/36.jpg)
How LSI addresses synonymy and semantic
relatedness?
Docs may be semantically similar but are not similar in the
vector space (when we talk about the same topics but use
different words).
Desired effect of LSI: Synonyms contribute strongly to doc similarity.
Standard vector space: Synonyms contribute nothing to doc similarity.
LSI (via SVD) selects the “least costly” mapping:
different words (= different dimensions of the full space) are
mapped to the same dimension in the reduced space.
Thus, it maps synonyms or semantically related words to the same dimension.
“cost” of mapping synonyms to the same dimension is much less
than cost of collapsing unrelated words.
Thus, LSI will avoid doing that for unrelated words.
36
![Page 37: Latent Semantic Indexing (LSI)ce.sharif.edu/courses/95-96/1/ce324-1/resources/root/... · 2020. 9. 7. · Latent Semantic Indexing (LSI) Perform a low-rank approximation of doc-term](https://reader033.vdocuments.us/reader033/viewer/2022051512/603dc9a8fc2e5d1b555e16eb/html5/thumbnails/37.jpg)
Performing the maps
Each row and column of 𝐶 gets mapped into the 𝑘-dimensional LSI space, by the SVD.
A query 𝑞 is also mapped into this space, by
Query NOT a sparse vector.
Claim: this is not only the mapping with the best
(Frobenius error) approximation to 𝐶, but also improves
retrieval.
37
Since 𝑉𝑘 = 𝐶𝑘𝑇𝑈𝑘Σ𝑘
−1, we
should transform query 𝑞 to 𝑞𝑘
𝑞𝑘 = 𝑞𝑇𝑈𝑘Σ𝑘
−1
![Page 38: Latent Semantic Indexing (LSI)ce.sharif.edu/courses/95-96/1/ce324-1/resources/root/... · 2020. 9. 7. · Latent Semantic Indexing (LSI) Perform a low-rank approximation of doc-term](https://reader033.vdocuments.us/reader033/viewer/2022051512/603dc9a8fc2e5d1b555e16eb/html5/thumbnails/38.jpg)
Implementation
Compute SVD of term-doc matrix
Map docs to the reduced space
Map the query into the reduced space 𝑞𝑘 = 𝑞𝑇𝑈𝑘Σ𝑘
−1
Compute similarity of 𝑞𝑘 with all reduced docs in 𝑉𝑘 .
Output ranked list of docs as usual
What is the fundamental problem with this approach?
38
![Page 39: Latent Semantic Indexing (LSI)ce.sharif.edu/courses/95-96/1/ce324-1/resources/root/... · 2020. 9. 7. · Latent Semantic Indexing (LSI) Perform a low-rank approximation of doc-term](https://reader033.vdocuments.us/reader033/viewer/2022051512/603dc9a8fc2e5d1b555e16eb/html5/thumbnails/39.jpg)
Empirical evidence
Experiments on TREC 1/2/3 – Dumais
Lanczos SVD code (available on netlib) due to Berry used
in these experiments
Running times of ~ one day on tens of thousands of docs [still an
obstacle to use]
Dimensions – various values 250-350 reported.
Reducing k improves recall.
Under 200 reported unsatisfactory
Generally expect recall to improve – what about precision?
39
![Page 40: Latent Semantic Indexing (LSI)ce.sharif.edu/courses/95-96/1/ce324-1/resources/root/... · 2020. 9. 7. · Latent Semantic Indexing (LSI) Perform a low-rank approximation of doc-term](https://reader033.vdocuments.us/reader033/viewer/2022051512/603dc9a8fc2e5d1b555e16eb/html5/thumbnails/40.jpg)
Empirical evidence
Precision at or above median TREC precision
Top scorer on almost 20% of TREC topics
Slightly better on average than straight vector spaces
Effect of dimensionality:
Dimensions Precision
250 0.367
300 0.371
346 0.374
40
![Page 41: Latent Semantic Indexing (LSI)ce.sharif.edu/courses/95-96/1/ce324-1/resources/root/... · 2020. 9. 7. · Latent Semantic Indexing (LSI) Perform a low-rank approximation of doc-term](https://reader033.vdocuments.us/reader033/viewer/2022051512/603dc9a8fc2e5d1b555e16eb/html5/thumbnails/41.jpg)
But why is this clustering?
We’ve talked about docs, queries, retrieval and
precision here.
What does this have to do with clustering?
Intuition: Dimension reduction through LSI brings
together “related” axes in the vector space.
41
![Page 42: Latent Semantic Indexing (LSI)ce.sharif.edu/courses/95-96/1/ce324-1/resources/root/... · 2020. 9. 7. · Latent Semantic Indexing (LSI) Perform a low-rank approximation of doc-term](https://reader033.vdocuments.us/reader033/viewer/2022051512/603dc9a8fc2e5d1b555e16eb/html5/thumbnails/42.jpg)
Simplistic picture
Topic 1
Topic 2
Topic 342
![Page 43: Latent Semantic Indexing (LSI)ce.sharif.edu/courses/95-96/1/ce324-1/resources/root/... · 2020. 9. 7. · Latent Semantic Indexing (LSI) Perform a low-rank approximation of doc-term](https://reader033.vdocuments.us/reader033/viewer/2022051512/603dc9a8fc2e5d1b555e16eb/html5/thumbnails/43.jpg)
Reference
43
Chapter 18 of IIR Book