why spectral retrieval works
DESCRIPTION
Why Spectral Retrieval Works. SIGIR 2005 in Salvador, Brazil, August 15 – 19. Holger Bast Max-Planck-Institut für Informatik (MPII) Saarbrücken, Germany joint work with Debapriyo Majumdar. What we mean by spectral retrieval. Ranked retrieval in the term space. . 1.00. 1.00. 0.00. - PowerPoint PPT PresentationTRANSCRIPT
Why Spectral Retrieval Works
Holger BastMax-Planck-Institut für Informatik (MPII)
Saarbrücken, Germany
joint work with Debapriyo Majumdar
SIGIR 2005 in Salvador, Brazil, August 15 – 19
What we mean by spectral retrieval
d1 d2 d3 d4 d5
qTd1———|q||d1|
cosine similaritiesqTd2———|q||d2|
0.82 0.00 0.00 0.38 0.00
2 0 0 1 01 2 0 1 01 1 0 2 10 0 1 1 2
1000
internetwebsurfingbeach
Ranked retrieval in the term space
1.00 1.00 0.00 0.50 0.00 "true" similarities to query
q
What we mean by spectral retrieval
d1 d2 d3 d4 d5
cosine similarities0.82 0.00 0.00 0.38 0.00
2 0 0 1 01 2 0 1 01 1 0 2 10 0 1 1 2
1000
internetwebsurfingbeach
Ranked retrieval in the term space
1.00 1.00 0.00 0.50 0.00 "true" similarities to query
q
Spectral retrieval = linear projection to an eigensubspace
projection matrix L
0.42 0.51 0.66 0.370.33 0.43 -0.08 -0.84
2.01 1.67 0.37 2.61 1.39
1.01 0.79 -0.84 -0.21 -1.75
0.42
0.33
L qL d1 L d2 L d3 L d4 L d5
(Lq)T(Ld1)——————
|Lq| |Ld1|
…0.98 0.98 -0.25 0.73 0.01 cosine similarities in the subspace
Why and when does this work? Previous work: if the term-document matrix is a slight
perturbation of a rank-k matrix then projection to a k-dimensional subspace works
– Papadimitriou, Tamaki, Raghavan, Vempala PODS'98
– Ding SIGIR'99
– Ando and Lee SIGIR'01
– Azar, Fiat, Karlin, McSherry, Saia STOC'01
Our explanation: spectral retrieval works through its ability to identify pairs of terms with similar co-occurrence patterns
– no single subspace is appropriate for all term pairs
– we fix that problem
Spectral retrieval — alternative view
Spectral retrieval = linear projection to an eigensubspace
projection matrix L
0.42 0.51 0.66 0.370.33 0.43 -0.08 -0.84
2.01 1.67 0.37 2.61 1.39
1.01 0.79 -0.84 -0.21 -1.75
0.42
0.33
L qL d1 L d2 L d3 L d4 L d5
(Lq)T(Ld1)——————
|Lq||Ld1|
… cosine similarities in the subspace
qT(LTLd1)——————|Lq||LTLd1|
=
d1 d2 d3 d4 d5
2 0 0 1 01 2 0 1 01 1 0 2 10 0 1 1 2
1000
internetwebsurfingbeach
Ranked retrieval in the term spaceq
Spectral retrieval — alternative view
Spectral retrieval = linear projection to an eigensubspace
2.01 1.67 0.37 2.61 1.39
1.01 0.79 -0.84 -0.21 -1.75
0.42
0.33
L qL d1 L d2 L d3 L d4 L d5
… cosine similarities in the subspace
qT(LTLd1)——————|Lq||LTLd1|
d1 d2 d3 d4 d5
2 0 0 1 01 2 0 1 01 1 0 2 10 0 1 1 2
1000
internetwebsurfingbeach
Ranked retrieval in the term spaceq
0.29 0.36 0.25 -0.120.36 0.44 0.30 -0.170.25 0.30 0.44 0.30-0.12 -0.17 0.30 0.84
expansion matrix LTL
0.42 0.51 0.66 0.370.33 0.43 -0.08 -0.84
projection matrix L
Spectral retrieval — alternative view
Spectral retrieval = linear projection to an eigensubspace
2.01 1.67 0.37 2.61 1.39
1.01 0.79 -0.84 -0.21 -1.75
0.42
0.33
L qL d1 L d2 L d3 L d4 L d5
… cosine similarities in the subspace
1.18 0.96 -0.12 1.03 0.01
1.45 1.19 -0.17 1.22 -0.05
1.24 1.04 0.30 1.73 1.04
-0.11 -0.04 0.84 1.15 1.98
1000
internetwebsurfingbeach
Ranked retrieval in the term spaceq
0.29 0.36 0.25 -0.120.36 0.44 0.30 -0.170.25 0.30 0.44 0.30-0.12 -0.17 0.30 0.84
expansion matrix LTL
0.42 0.51 0.66 0.370.33 0.43 -0.08 -0.84
qT(LTLd1)——————|Lq||LTLd1|
LTLd1 LTLd2 LTLd3 LTLd4 LTLd5
qT(LTLd1)——————
|q||LTLd1|
… similarities after document expansion
Spectral retrieval = document expansion (not query expansion)
projection matrix L
Why document "expansion"
1 1 0 0
1 1 0 0
0 0 1 0
0 0 0 1
internet
web
surfing
beach
·
0-1 expansion matrix
inte
rnet
web
surfi
ng
beac
h
1
1
1
0
=
0
1
1
0
Why document "expansion"
1 1 0 0
1 1 0 0
0 0 1 0
0 0 0 1
internet
web
surfing
beach
·
add "internet" if "web" is present
0-1 expansion matrix
inte
rnet
web
surfi
ng
beac
h
1
1
1
0
=
0
1
1
0
Why document "expansion"
0.29 0.36 0.25 -0.12
0.36 0.44 0.30 -0.17
0.25 0.30 0.44 0.30
-0.12 -0.17 0.30 0.84
internet
web
surfing
beach
0
1
1
0
Ideal expansion matrix has
– high scores for intuitively related terms
– low scores for intuitively unrelated terms
expansion matrix LTL
inte
rnet
web
surfi
ng
beac
h
0.61
0.74
0.74
0.13
matrix L projectingto 2 dimensions
0.42 0.51 0.66 0.370.33 0.43 -0.08 -0.84
add "internet" if "web" is present
· =
expansion matrix depends heavily on the subspace dimension!
Why document "expansion"
0.93-
0.120.20 -0.11
-0.12 0.80 0.34 -0.18
0.20 0.34 0.44 0.30
-0.11 -0.18 0.30 0.84
internet
web
surfing
beach
0
1
1
0
Ideal expansion matrix has
– high scores for intuitively related terms
– low scores for intuitively unrelated terms
inte
rnet
web
surfi
ng
beac
h
0.08
1.13
0.78
0.12
0.42 0.51 0.66 0.370.33 0.43 -0.08 -0.84-0.80 0.59 0.06 -0.01
add "internet" if "web" is present
· =
matrix L projectingto 3 dimensions
expansion matrix depends heavily on the subspace dimension!
expansion matrix LTL
node / vertex
200 400 6000subspace dimension
logic / logics
200 400 6000subspace dimension
logic / vertex
200 400 6000subspace dimension
Our Key Observation We studied how the entries in the expansion matrix depend on the
dimension of the subspace to which documents are projected
expansi
on
matr
ix e
ntr
y
0
no single dimension is appropriate for all term pairs
node / vertex
200 400 6000subspace dimension
logic / logics
200 400 6000subspace dimension
logic / vertex
200 400 6000subspace dimension
Our Key Observation We studied how the entries in the expansion matrix depend on the
dimension of the subspace to which documents are projected
expansi
on
matr
ix e
ntr
y
0
no single dimension is appropriate for all term pairs
but the shape of the curve is a good indicator for relatedness!
Curves for related terms· · · · · 1 1 0 0
· · · · · 0 0 1 1
· · · · · 1 1 1 1
0 0 1 1 1 1 0 1 0
0 0 1 1 1 0 1 0 1
We call two terms perfectly related if they have an identical co-occurrence pattern
200
400
600
0subspace dimension
200
400
600
0subspace dimension
200
400
600
0subspace dimension
expansi
on
matr
ix e
ntr
y
proven shape for perfectly related
terms
provably small change after slight
perturbation
half way to a real matrix
up-and-then-down shape
remains
point of fall-off is different for every term pair!
term 1
term 2
0
Curves for unrelated terms Co-occurrence graph:
– vertices = terms
– edge = two terms co-occur
We call two terms perfectly unrelated if no path connects them in the graph
curves for unrelated terms are random oscillations around zero
proven shape forperfectly unrelated terms
provably small changeafter slight perturbationhalf way to a real matrix
200
400
600
0subspace dimension
200
400
600
0subspace dimension
200
400
600
0subspace dimension
expansi
on
matr
ix e
ntr
y
0
Telling the shapes apart — TN
1. Normalize term-document matrix so that theoretical point of fall-off is equal for all term pairs
2. For each term pair: if curve is never negative before this point, set entry in expansion matrix to 1, otherwise to 0
200 400 6000subspace dimension
200 400 6000subspace dimension
200 400 6000subspace dimension
a simple 0-1 classification, no fractional entries!
set entry to 1 set entry to 1 set entry to 0
expansi
on
matr
ix e
ntr
y
0
An alternative algorithm — TM1. Again, normalize term-document matrix so that theoretical
point of fall-off is equal for all term pairs
2. For each term pair compute the monotonicity of its initial curve (= 1 if perfectly monotone, 0 as number of turns increase)
3. If monotonicity is above some threshold, set entry in expansion matrix to 1, otherwise to 0
again: a simple 0-1 classification!
200 400 6000subspace dimension
200 400 6000subspace dimension
200 400 6000subspace dimension
0.82 0.69 0.07
expansi
on
matr
ix e
ntr
y
set entry to 1 set entry to 1
0.82 0.69 0.07
set entry to 0
0
Experimental results
TIME
63.2%
62.8%
58.6%
59.1%
62.2%
64.9%
64.1%
COS
LSI*
LSI-RN*
CORR*
IRR*
TN
TM
425 docs3882 terms
* the numbers for LSI, LSI-RN, CORR, IRR are for the best subspace dimension!
Baseline: cosine similarity in term space
Latent Semantic Indexing Dumais et al. 1990
Term-normalized LSI Ding et al. 2001
Correlation-based LSI Dupret et al. 2001
Iterative Residual Rescaling Ando & Lee 2001
our non-negativity test
our monotonicity test
(average precision)
Experimental results
TIME
63.2%
62.8%
58.6%
59.1%
62.2%
64.9%
64.1%
COS
LSI*
LSI-RN*
CORR*
IRR*
TN
TM
425 docs3882 terms
REUTERS
36.2%
32.0%
37.0%
32.3%
——
41.9%
42.9%
21578 docs5701 terms
OHSUMED
13.2%
6.9%
13.0%
10.9%
——
14.4%
15.3%
233445 docs99117 terms
* the numbers for LSI, LSI-RN, CORR, IRR are for the best subspace dimension!
(average precision)
Conclusions
Main message: spectral retrieval works through its ability to identify pairs of terms with similar co-occurrence patterns
– a simple 0-1 classification that considers a sequence of subspaces is at least as good as schemes that commit to a fixed subspace
Some useful corollaries …
– new insights into the effect of term-weighting and other normalizations for spectral retrieval
– straightforward integration of known word relationships
– consequences for spectral link analysis?
Conclusions
Obrigado!
Main message: spectral retrieval works through its ability to identify pairs of terms with similar co-occurrence patterns
– a simple 0-1 classification that considers a sequence of subspaces is at least as good as schemes that commit to a fixed subspace
Some useful corollaries …
– new insights into the effect of term-weighting and other normalizations for spectral retrieval
– straightforward integration of known word relationships
– consequences for spectral link analysis?
Why document "expansion"
1 0 0 0
0 1 0 0
0 0 1 0
0 0 0 1
internet
web
surfing
beach
0
1
1
0
Ideal expansion matrix has
– high scores for related terms
– low scores for unrelated terms
Expansion matrix LTL depends on the subspace dimension
inte
rnet
web
surfi
ng
beac
h
0
1
1
0
0.42 0.51 0.66 0.370.33 0.43 -0.08 -0.84-0.80 0.59 0.06 -0.010.27 0.45 -0.75 0.41
add "internet" if "web" is present
· =
expansion matrix LTLmatrix L projecting
to 4 dimensions