chapter 5 query operations
DESCRIPTION
Chapter 5 Query Operations. Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University. Paraphrase Problem in IR. Users often input queries containing terms that do not match the terms used to index the majority of the relevant documents. - PowerPoint PPT PresentationTRANSCRIPT
Hsin-Hsi Chen 5-1
Chapter 5Query Operations
Hsin-Hsi Chen
Department of Computer Science and Information Engineering
National Taiwan University
Hsin-Hsi Chen 5-2
Paraphrase Problem in IR
• Users often input queries containing terms that do not match the terms used to index the majority of the relevant documents.
• relevance feedback and query modification– reweighting of the query terms based on the distributio
n of these terms in the relevant and nonrelevant documents retrieved in response to those queries
– changing the actual terms in the query
Hsin-Hsi Chen 5-3
Query Reformulation
• basic steps– query expansion: expanding the original query with ne
w terms• feedback information from the user
• information derived from the set of documents initially retrieved (local set of documents)
• global information derived from document collection
– term reweighting• reweighting the terms in the expanded query
Hsin-Hsi Chen 5-4
User Relevance Feedback
• U: Query is submitted• S: A list of the retrieved documents is presented• U: The documents are examined and the relevant
ones are marked• S: The important terms/expressions are selected
from the documents that have been identified as relevant
• The relevance feedback cycle is repeated several times
Hsin-Hsi Chen 5-5
User Relevance Feedback (Continued)
• advantages– Shield the details of the query reformulation– Break down the whole searching task into a
sequence of small steps– Provide a controlled process designed to
emphasize some terms (relevant ones) and de-emphasize others (non-relevant ones)
Hsin-Hsi Chen 5-6
Query Expansion and Term Reweighting for the Vector Model
• basic idea– Relevant documents resemble each other– Non-relevant documents have term-weight
vectors which are dissimilar from the ones for the relevant documents
– The reformulated query is moved to closer to the term-weight vector space of relevant documents
Hsin-Hsi Chen 5-7
Hsin-Hsi Chen 5-8
Query Expansion and Term Reweighting for the Vector Model (Continued)
Cr: set of relevant documents
set of non-relevant documents
Dr: set of relevant documents, as identified by the userDn: set of non-relevant documents
the retrieveddocuments collection
Hsin-Hsi Chen 5-9
Query Expansion and Term Reweighting for the Vector Model (Continued)
• when complete set Cr of relevant documents is known
• when the set Cr are not known a priori – Formulate an initial query
– Incrementally change the initial query vector
rj rjCd Cd
j
r
j
ropt d
CNd
Cq
||
1
||
1
Hsin-Hsi Chen 5-10
• Calculate the modified query– Standard-Rochio
– Ide-Regular
– Ide-Dec-Hi
, , : tuning constants (usually, >) =1 (Rochio, 1971) ===1 (Ide, 1971) =0: positive feedback
Dnd
jnDrd
jr
mjj
dD
dD
qq||||
Dnd
jDrd
jmjj
ddqq
)(max jrelevantnonDrd
jm ddqqj
the highest ranked non-relevant document
queryexpansion
term reweighting
Similar performance
Hsin-Hsi Chen 5-11
positive relevance-feedback ==1 and =0
Hsin-Hsi Chen 5-12
• “dec hi” method: use all relevant information, but subtract only the highest ranked nonrelevant document
• feedback with query splittingsolve problems: (1) the relevant documents identified do not form a tight cluster; (2) nonrelevant documents are scattered among certain relevant ones
homogeneousrelevant items
homogeneousrelevant items
Hsin-Hsi Chen 5-13
Analysis
• advantages– simplicity– good results
• disadvantages– No optimality criterion is adopted
Hsin-Hsi Chen 5-14
Term Weighting for the Probabilistic Model
• The similarity of a document dj to a query q
))|(
))|(1(log
))|(1(
)|((log),(
1,, RkP
RkP
RkP
RkPwwqdsim
i
i
i
it
ijiqij
)|( RkP i : the probability of observing the term ki in the set R of relevant documents
)|( RkP i : the probability of observing the term ki in the set R of non-relevant documents
5.0)|( RkP i
Initial search:
N
nRkP i
i )|(
Hsin-Hsi Chen 5-15
i
it
ijiqi
i
it
ijiqi
i
i
i
it
ijiqij
n
nNww
Nn
Nn
ww
RkP
RkP
RkP
RkPwwqdsim
log
))1(
log)5.01(
5.0(log
))|(
))|(1(log
))|(1(
)|((log),(
1,,
1,,
1,,
Feedback search:
5.0)|( RkP iInitial search:N
nRkP i
i )|(
||
||)|( ,
r
irii D
D
V
VRkP
||
||)|( ,
r
iriiii DN
Dn
VN
VnRkP
Hsin-Hsi Chen 5-16
Feedback search:
||
||)|( ,
r
iri D
DRkP
||
||)|( ,
r
irii DN
DnRkP
||
|)|(||
||||
||log
||
||||
||1
||
||1
||
||
log
))|(
))|(1(log
))|(1(
)|((log),(
,
,
,
,
1,,
,
,
,
,
1,,
1,,
iri
irir
irr
irt
ijiqi
r
iri
r
iri
r
ir
r
irt
ijiqi
i
i
i
it
ijiqij
Dn
DnDN
DD
Dww
DN
DnDN
Dn
D
DD
D
ww
RkP
RkP
RkP
RkPwwqdsim
No query expansion occurs
Hsin-Hsi Chen 5-17
1||
5.0||
1
5.0)|(
1||
5.0||
1
5.0)|(
,
,
r
iriiii
r
irii
DN
Dn
VN
VnRkP
D
D
V
VRkP
For small values of |Dr| and |Dr,i| (i.e., |Dr|=1, |Dr,i|=0)
Alternative 1:
Alternative 2:
1||
||)|(
1||
||)|(
,
,
r
iiri
i
r
iir
i
DNNn
DnRkP
DNn
DRkP
Hsin-Hsi Chen 5-18
Analysis
• advantages– Feedback process is directly related to the derivation of
new weights for query terms
– The term reweighting is optimal
• disadvantages– Document term weights are not considered
– Weights of terms in previous query formulations are disregarded
– No query expansion is used
Hsin-Hsi Chen 5-19
A Variant of Probabilistic Term Reweighting
• variant– distinct initial search method– include within-document frequency weights
• initial search
)max()1(
)(
),(
,
,,
,,,
1,,,,
ji
jiji
jiiqji
t
iqjijiqij
f
fKKf
fidfCF
Fwwqdsim
Similar totf-idf scheme
Hsin-Hsi Chen 5-20
C=0 for automatically indexed collections or for feedback searching (allow IDF or the relevance weighting to be the dominant factor)C>0 for manually indexed collections (allow the mere existence of a term within a document to carry more weight)
K=0.3 for initial search of regular length documents (documents having many multiple occurrences of a term)K=0.5 for feedback searchesK=1 for short documents: the within-document frequency is removed (the within-document frequency plays a minimum role)
Feedback search
jii
i
i
iqji f
RkP
RkP
RkP
RkPCF ,,, )
)|(
)|(1log
)|(1
)|(log(
Hsin-Hsi Chen 5-21
Analysis
• advantages– The within-document frequencies are considered– A normalized version of these frequencies is
adopted– Constants C and K are introduced
• disadvantages– more complex formulation– no query expansion
Hsin-Hsi Chen 5-22
Evaluation of relevance feedback• Standard evaluation (i.e., recall-precision) method
is not suitable, because the relevant documents used to reweight the query terms moving to higher ranks.
• The residual collection method– the evaluation of the results compares only the residual
collections, i.e., the initial run is remade minus the documents previously shown to the user and this is compared with the feedback run minus the same documents
Note that qm tend to be lowerthan the figures for the originalquery vector q in residual collection
Hsin-Hsi Chen 5-23
Residual Collection with Partial Rank Freezing
• The previously retrieved items identified as relevant are kept “frozen”; and the previously retrieved nonrelevant items are simple removed from the collection.
Assume 10 documents are relevant.
Hsin-Hsi Chen 5-24
Residual Collection with Partial Rank Freezing
Hsin-Hsi Chen 5-25
Automatic Local Analysis
• user relevance feedback– Known relevant documents contain terms which can be
used to describe a larger cluster of relevant documents with assistance from the user (clustering)
• automatic analysis– Obtain a description (i.t.o terms) for a larger cluster of
relevant documents automatically– global strategy: global thesaurus-like structure is trained
from all documents before querying – local strategy: terms from the documents retrieved for a
given query are selected at query time
Hsin-Hsi Chen 5-26
Local Feedback Strategy
• Internet– client site
• Retrieving the text of 100 Web documents for local analysis would take too long
– server site• Analyzing the text of 100 Web documents would spend extra
CPU time
• Applications– Intranet– Specialized document collections, e.g., medical
document collections
Hsin-Hsi Chen 5-27
Query Expansion-Local Clustering
• stem– V(s): a non-empty subset of words which are grammatical v
ariants of each othere.g., {polish, polishing, polished}
– A canonical form s of V(s) is called a steme.g., polish
• local document set Dl
– the set of documents retrieved for a given query
• local vocabulary Vl (Sl)– the set of all distinct words (stems) in the local document se
t
Hsin-Hsi Chen 5-28
local cluster
• basic concept– Expanding the query with terms correlated to the query terms
– The correlated terms are presented in the local clusters built from the local document set
• local clusters– association clusters: co-occurrences of pairs of terms in docu
ments
– metric clusters: distance factor between two terms
– scalar clusters: terms with similar neighborhoods have some synonymity relationship
Hsin-Hsi Chen 5-29
Association Clusters
• idea– Based on the co-occurrence of stems (or terms) insi
de documents
• association matrix– fsi,j: the frequency of a stem si in a document dj (Dl)
– m=(fsi,j): an association matrix with |Sl| rows and |Dl| columns
– : a local stem-stem association matrixt
mms
Hsin-Hsi Chen 5-30
jsv
Dldj
jsuvu ffc ,,,
vuvvuu
vuvu ccc
cs
,,,
,,
: a correlation between the stems su and sv
: normalized matrix
su,v=cu,v: unnormalized matrix
:)(nsu local association cluster around the stem su
Take u-th rowReturn the set of n largest values su,v (uv)
an element in t
mm
Hsin-Hsi Chen 5-31
Metric Clusters
• idea– Consider the distance between two terms in th
e computation of their correlation factor
• local stem-stem metric correlation matrix– r(ki,kj): the number of words between keywords
ki and kj in a same document
– cu,v: metric correlation between stems su and sv
)()(
, ),(
1
svVkj jisuVki
vu kkrc
),(1
),(ji
ji kkrkkr
Hsin-Hsi Chen 5-32
|)(||)(|,
,vu
vuvu sVsV
cs
: normalized matrix
su,v=cu,v: unnormalized matrix
:)(nsu local metric cluster around the stem su
Take u-th rowReturn the set of n largest values su,v (uv)
Hsin-Hsi Chen 5-33
Scalar Clusters
• idea– Two stems with similar neighborhoods have s
ynonymity relationship– The relationship is indirect or induced by the n
eighborhood
• scalar association matrix ||||,
vu
vuvu
ss
sss
:)(nsu local scalar cluster around the stem su
Take u-th rowReturn the set of n largest values su,v (uv)
The row corresponding to a specific termin a term co-occurrence matrix forms its neighborhood
The correlation value for su andsv in this matrix may be small
Hsin-Hsi Chen 5-34
Interactive Search Formulation
• neighbors of the query term sv
– Terms su belonging to clusters associated to sv, i.e., suSv(n)
– su is called a searchonym of sv x
Su
Sv
Sv(n)
x
xx
x
x
x
x
x
x
x
x
x
x
xx x
x
x
x
Hsin-Hsi Chen 5-35
Interactive Search Formulation(Continued)
• Algorithm– For each stem svq, select m neighbor stems from t
he cluster Sv(n) and add them to the query– Merge normalized and unnormalized clusters
• Extension– Let su and sv be correlated with a cu,v
– If cu,v is larger than a predefined threshold, then a neighbor stem su’ of su can also be interpreted as a neighbor stem of sv, and vice versa.
more rare large frequencies
Hsin-Hsi Chen 5-36
Query Expansion throughLocal Context Analysis
• local analysis– Based on the set of documents retrieved for the
original query– Based on term co-occurrence inside documents– Terms closest to individual query terms are selected
• global analysis– Based on the whole document collection– Based on term co-occurrence inside small contexts
and phrase structures– Terms closest to the whole query are selected
Hsin-Hsi Chen 5-37
Query Expansion throughLocal Context Analysis (Continued)
• candidates– noun groups instead of simple keywords– single noun, two adjacent nouns, or three
adjacent nouns
• query expansion– Concepts are selected from the top ranked
documents (as in local analysis)– Passages are used for determining co-
occurrence (as in global analysis)
Hsin-Hsi Chen 5-38
Query Expansion throughLocal Context Analysis (Continued)
• algorithm– Retrieve the top n ranked passages using the original q
uery– For each concept in the top ranked passages, the similar
ity sim(q,c) between the whole query q and the concept c is computed using a variant of tf-idf ranking
– The top m ranked concepts are added to the original query q
• Each concept is assigned a weight 1-0.9i/m (i: rank)• Each term in the original query is assigned a weight 2origina
l weight
Hsin-Hsi Chen 5-39
i
i
idf
qk
cin
idfkcfcqsim
log
,log),(
jcn
jjii pfpfkcf ,
1,,
5
log,1max 10 i
inpN
idf
5
log,1max 10 cnpN
idfc
n: # of rankedpassages
correlation between c and ki
pfi,j (pfc,j): freq of ki (c) in j-th passage
N: # of passages in the collectionnpi: # of passages containing term ki
npc: # of passages containing concept c
0.1
idf1,當 np很大 (小 )時,第二項可能小 (大 )於 1
association clusters (passage)
for infrequent query term
Hsin-Hsi Chen 5-40
Automatic Global Analysis
• local analysis– Extract information from the local set of documents
(passages) retrieved
• global analysis– Expand the query using information from the whole set
of documents in the collection– Issues
• How to build the thesaurus• How to select the terms for query expansion
– Query expansion based on similarity thesaurus– Query expansion based on statistical thesaurus
Hsin-Hsi Chen 5-41
Similarity Thesaurus
• How to build the thesaurus– Consider term to term relationship instead of
co-occurrence
• How to select the terms for query expansion– Consider the similarity to the whole query
instead of individual query terms
Hsin-Hsi Chen 5-42
Concept Space
• basic idea– Each term is indexed by the documents in which it appears– The role of terms and documents is interchanged in the co
ncept space
• t: the number of terms in the collection• N: the number of documents in the collection• fi,j: the frequency of term ki in document dj
• tj: the number of distinct index terms in document dj
• itfj: inverse term frequency for document dj
jj t
titf log
(dj 用來區辨 index term的能力,dj含有的 index terms 越多,區辨力越低 )
Hsin-Hsi Chen 5-43
Each term ki is associated with a vector ki
Niiii wwwk ,2,1, ,...,,r
where
Nl jlifl
f ji
jjifj
f ji
itf
itf
jiw
12
2
),(max,
),(max,
5.05.0
5.05.0
,
The relationship between two terms ku and kv is computed as
jd
jvjuvuvu wwkkc ,,,
Hsin-Hsi Chen 5-44
Query Expansion using Global Similarity Thesaurus
• Represent the query in the concept space used for representation of the index terms
• Based on the global similarity thesaurus, compute a similarity sim(q,kv) between each term kv correlated to the query terms and the whole query q
qk
iqii
kwq
,
Qk
vuquvvu
cwkqkqsim ,,,
query term expand term
Hsin-Hsi Chen 5-45
Query Expansion using Global Similarity Thesaurus
• Expand the query with the top r ranked terms according to sim(q,kv)
quk qu
vwkqsim
qvw,
,,
Hsin-Hsi Chen 5-46
Ki
Kv
Kj
Ka Qc Kb
Q={Ka,Kb}
Expand term
Hsin-Hsi Chen 5-47
GVSM vs. Query Expansion
Qk
vuquvvu
cwkqkqsim ,,,
idk jij kwdji
,
vuqudk qk
jvj cwwdqsimjv u
,,,),(
Only the top r ranked terms are used for query expansion.