putting query representation and understanding in context: chengxiang zhai department of computer...
Post on 22-Dec-2015
217 views
TRANSCRIPT
Putting Query Representation and Understanding in Context:
ChengXiang Zhai
Department of Computer Science
University of Illinois at Urbana-Champaign
A Decision-Theoretic Framework for Optimal Interactive Retrieval through Dynamic User Modeling
Including joint work with Xuehua Shen, Bin Tan
SIGIR 2010 Workshop on Query Representation and Understanding, July 23, 2010, Geneva, Switzerland 2
What is a query?
Query = a sequence of keywords that describe the information need of a particular user at a particular time for finishing a particular task
iPhone battery Search
Rich context !
Query = a sequence of keywords?
Query must be put in a context
SIGIR 2010 Workshop on Query Representation and Understanding, July 23, 2010, Geneva, Switzerland
3
Jaguar Search
Mac OS? Car ? Animal ?
What queries did the user type in before this query? What documents were just viewed by this user? What documents were skipped by this user? What other users looked for similar information? ……
Context helps query understanding
SIGIR 2010 Workshop on Query Representation and Understanding, July 23, 2010, Geneva, Switzerland
4
Car
Car
Car
Car
Software
Animal
Suppose we know:
1. Previous query = “racing cars” vs. “Apple OS”
2. “car” occurs far more frequently than “Apple” in pages browsed by the user in the last 20 days
3. User just viewed an “Apple OS” document
Questions• How can we model a query in a context-
sensitive way? Generalize query representation to user model
• How can we model the dynamics of user information needs? Dynamic updating of user models
• How can we put query representation into a retrieval framework to improve search? A framework for optimal interactive retrieval
SIGIR 2010 Workshop on Query Representation and Understanding, July 23, 2010, Geneva, Switzerland
5
Rest of the talk: UCAIR Project
1. A decision-theoretic framework
2. Statistical language models for implicit feedback (personalized search without extra user effort)
3. Open challenges
SIGIR 2010 Workshop on Query Representation and Understanding, July 23, 2010, Geneva, Switzerland
6
SIGIR 2010 Workshop on Query Representation and Understanding, Geneva, Switzerland, July 23, 2010
7
UCAIR Project• UCAIR = User-Centered Adaptive IR
– user modeling (“user-centered”)– search context modeling (“adaptive”)– interactive retrieval
• Implemented as a personalized search agent that– sits on the client-side (owned by the user)– integrates information around a user (1 user vs.
N sources as opposed to 1 source vs. N users)– collaborates with each other– goes beyond search toward task support
8
Main Idea: Putting the User in the Center!
Search Engine
“java”
Personalized search agent
WEB
Search Engine
Search Engine
DesktopFiles
Personalized search agent
“java”
...Viewed Web pages
QueryHistory
A search agent can know abouta particular user very well
SIGIR 2010 Workshop on Query Representation and Understanding, Geneva, Switzerland, July 23, 2010
1. A Decision-Theoretic Framework for Optimal Interactive Retrieval
SIGIR 2010 Workshop on Query Representation and Understanding, Geneva, Switzerland, July 23, 2010
IR as Sequential Decision Making
User System
A1 : Enter a query Which documents to present?How to present them?
Ri: results (i=1, 2, 3, …)Which documents to view?
A2 : View documentWhich part of the document to show? How?
R’: Document contentView more?
A3 : Click on “Back” button
(Information Need) (Model of Information Need)
SIGIR 2010 Workshop on Query Representation and Understanding, Geneva, Switzerland, July 23, 2010
Retrieval Decisions
User U: A1 A2 … … At-1 At
System: R1 R2 … … Rt-1
Given U, C, At , and H, choosethe best Rt from all possibleresponses to At
History H={(Ai,Ri)} i=1, …, t-1
DocumentCollection
C
Query=“Jaguar”
All possible rankings of C
The best ranking for the query
Click on “Next” button
All possible rankings of unseen docs
The best ranking of unseen docs
Rt r(At)
Rt =?
SIGIR 2010 Workshop on Query Representation and Understanding, Geneva, Switzerland, July 23, 2010
12
A Risk Minimization Framework
User: U Interaction history: HCurrent user action: At
Document collection: C
Observed
All possible responses: r(At)={r1, …, rn}
User Model
M=(S, U…) Seen docs
Information need
L(ri,At,M) Loss Function
Optimal response: r* (minimum loss)
( )arg min ( , , ) ( | , , , )tt r r A t tM
R L r A M P M U H A C dM ObservedInferredBayes risk
SIGIR 2010 Workshop on Query Representation and
Understanding, Geneva, Switzerland, July 23, 2010
SIGIR 2010 Workshop on Query Representation and Understanding, July 23, 2010, Geneva, Switzerland
13
• Approximate the Bayes risk by the loss at the mode of the posterior distribution
• Two-step procedure– Step 1: Compute an updated user model M*
based on the currently available information– Step 2: Given M*, choose a response to minimize
the loss function
A Simplified Two-Step Decision-Making Procedure
( )
( )
( )
arg min ( , , ) ( | , , , )
arg min ( , , *) ( * | , , , )
arg min ( , , *)
* arg max ( | , , , )
t
t
t
t r r A t tM
r r A t t
r r A t
M t
R L r A M P M U H A C dM
L r A M P M U H A C
L r A M
where M P M U H A C
14
Optimal Interactive Retrieval
User
A1
U C
M*1 P(M1|U,H,A1,C)
L(r,A1,M*1)
R1A2
L(r,A2,M*2)
R2
M*2 P(M2|U,H,A2,C)
A3 …
Collection
IR system
SIGIR 2010 Workshop on Query Representation and Understanding, Geneva, Switzerland, July 23, 2010
SIGIR 2010 Workshop on Query Representation and Understanding, July 23, 2010, Geneva, Switzerland
15
Refinement of Risk Minimization• r(At): decision space (At dependent)
– r(At) = all possible subsets of C (document selection)– r(At) = all possible rankings of docs in C – r(At) = all possible rankings of unseen docs– r(At) = all possible subsets of C + summarization strategies
• M: user model – Essential component: U = user information need– S = seen documents– n = “Topic is new to the user”
• L(Rt ,At,M): loss function– Generally measures the utility of Rt for a user modeled as M– Often encodes retrieval criteria (e.g., using M to select a ranking of
docs)• P(M|U, H, At, C): user model inference
– Often involves estimating a unigram language model U
SIGIR 2010 Workshop on Query Representation and Understanding, July 23, 2010, Geneva, Switzerland
16
Case 1: Context-Insensitive IR– At=“enter a query Q”
– r(At) = all possible rankings of docs in C
– M= U, unigram language model (word distribution)
– p(M|U,H,At,C)=p(U |Q)
1
1
1 2
( , , ) (( ,..., ), )
( | ) ( || )
( | ) ( | ) ....
( || )
i
i
i t N U
N
i U di
t U d
L r A M L d d
p viewed d D
Since p viewed d p viewed d
the optimal ranking R is given by ranking documents by D
SIGIR 2010 Workshop on Query Representation and Understanding, July 23, 2010, Geneva, Switzerland
17
Case 2: Implicit Feedback – At=“enter a query Q”
– r(At) = all possible rankings of docs in C
– M= U, unigram language model (word distribution)
– H={previous queries} + {viewed snippets}– p(M|U,H,At,C)=p(U |Q,H)
1
1
1 2
( , , ) (( ,..., ), )
( | ) ( || )
( | ) ( | ) ....
( || )
i
i
i t N U
N
i U di
t U d
L r A M L d d
p viewed d D
Since p viewed d p viewed d
the optimal ranking R is given by ranking documents by D
SIGIR 2010 Workshop on Query Representation and Understanding, July 23, 2010, Geneva, Switzerland
18
Case 3: General Implicit Feedback
– At=“enter a query Q” or “Back” button, “Next” button
– r(At) = all possible rankings of unseen docs in C
– M= (U, S), S= seen documents
– H={previous queries} + {viewed snippets}– p(M|U,H,At,C)=p(U |Q,H)
1
1
1 2
( , , ) (( ,..., ), )
( | ) ( || )
( | ) ( | ) ....
( || )
i
i
i t N U
N
i U di
t U d
L r A M L d d
p viewed d D
Since p viewed d p viewed d
the optimal ranking R is given by ranking documents by D
SIGIR 2010 Workshop on Query Representation and Understanding, July 23, 2010, Geneva, Switzerland
19
Case 4: User-Specific Result Summary
– At=“enter a query Q”
– r(At) = {(D,)}, DC, |D|=k, {“snippet”,”overview”}
– M= (U, n), n{0,1} “topic is new to the user”
– p(M|U,H,At,C)=p(U,n|Q,H), M*=(*, n*)
( , , ) ( , , *, *)
( , *) ( , *)
( * || ) ( , *)i
i t i i
i i
d id D
L r A M L D n
L D L n
D L n
n*=1 n*=0
i=snippet 1 0i=overview 0 1
( , *)iL n
Choose k most relevant docs If a new topic (n*=1), give an overview summary;otherwise, a regular snippet summary
2. Statistical Language Models for implicit feedback
(Personalized search without extra user effort)
SIGIR 2010 Workshop on Query Representation and Understanding, Geneva, Switzerland, July 23, 2010
SIGIR 2010 Workshop on Query Representation and Understanding, July 23, 2010, Geneva, Switzerland
22
Risk Minimization for Implicit Feedback – At=“enter a query Q”
– r(At) = all possible rankings of docs in C
– M= U, unigram language model (word distribution)
– H={previous queries} + {viewed snippets}– p(M|U,H,At,C)=p(U |Q,H)
1
1
1 2
( , , ) (( ,..., ), )
( | ) ( || )
( | ) ( | ) ....
( || )
i
i
i t N U
N
i U di
t U d
L r A M L d d
p viewed d D
Since p viewed d p viewed d
the optimal ranking R is given by ranking documents by D
Need to estimate a context-sensitive LM
SIGIR 2010 Workshop on Query Representation and Understanding, July 23, 2010, Geneva, Switzerland
23
Estimate a Context-Sensitive LM
Q2
C2={C2,1 , C2,2 ,C2,3 ,… }
…
C1={C1,1 , C1,2 ,C1,3 ,…} User Clickthrough
Qk
Q1 User Query e.g., Apple software
e.g., Apple - Mac OS X The Apple Mac OS X product page. Describes features in the current version of Mac OS X, …
e.g., Jaguar
1 1 1 1,...,( | ,) ( | ,...,, ) ?k kk kp w p Q CQ Q Cw User Model:
Query History Clickthrough
Short-term vs. long-term implicit feedback• Short term implicit feedback
– context = current retrieval session – past queries in the context are closely related
to the current query– clickthroughs user’s current interests
• Long term implicit feedback– context = all search interaction history – not all past queries/clickthroughs are related to
the current query
SIGIR 2010 Workshop on Query Representation and Understanding, July 23, 2010, Geneva, Switzerland
24
SIGIR 2010 Workshop on Query Representation and Understanding, July 23, 2010, Geneva, Switzerland
25
“Bayesian interpolation” for short-term implicit feedback
Q1
Qk-1
…
C1
Ck-1
…
Average user query andclickthrough history
CH
QH1
11
1
( | ) ( | )i k
Q iki
p w H p w Q
11
11
( | ) ( | )i k
C iki
p w H p w C
Intuition: trust the current query Qk more if it’s longer
Qk
Dirichlet Prior
( , ) ( | ) ( | )
| |( | ) k Q C
k
c w Q p w H p w H
k Qp w
k
26
Overall Effect of Search Context
Query
FixInt
(=0.1,=1.0)
BayesInt
(=0.2,=5.0)
OnlineUp
(=5.0,=15.0)
BatchUp
(=2.0,=15.0)
MAP pr@20 MAP pr@20 MAP pr@20 MAP pr@20
Q3 0.0421 0.1483 0.0421 0.1483 0.0421 0.1483 0.0421 0.1483
Q3+HQ+HC 0.0726 0.1967 0.0816 0.2067 0.0706 0.1783 0.0810 0.2067
Improve 72.4% 32.6% 93.8% 39.4% 67.7% 20.2% 92.4% 39.4%
Q4 0.0536 0.1933 0.0536 0.1933 0.0536 0.1933 0.0536 0.1933
Q4+HQ+HC 0.0891 0.2233 0.0955 0.2317 0.0792 0.2067 0.0950 0.2250
Improve 66.2% 15.5% 78.2% 19.9% 47.8% 6.9% 77.2% 16.4%
• Short-term context helps system improve retrieval accuracy
• BayesInt better than FixInt; BatchUp better than OnlineUp
SIGIR 2010 Workshop on Query Representation and Understanding, July 23, 2010, Geneva, Switzerland
SIGIR 2010 Workshop on Query Representation and Understanding, July 23, 2010, Geneva, Switzerland
27
Using Clickthrough Data Only
Query MAP pr@20
Q3 0.0421 0.1483
Q3+HC 0.0766 0.2033
Improve 81.9% 37.1%
Q4 0.0536 0.1930
Q4+HC 0.0925 0.2283
Improve 72.6% 18.1%
BayesInt (=0.0,=5.0)
Clickthrough is the major contributor
13.9% 67.2%Improve
0.1880.0739Q4+HC
0.1650.0442Q4
42.4%99.7%Improve
0.1780.0661Q3+HC
0.1250.0331Q3
pr@20MAPQuery
Performance on unseen docs
-4.1%15.7%Improve
0.18500.0620Q4+HC
0.19300.0536Q4
23.0%23.8%Improve
0.18200.0521Q3+HC
0.14830.0421Q3
pr@20MAPQuery
Snippets for non-relevant docs are still useful!
28
Mixture model with dynamic weighting for long-term implicit feedback
q1D1C1
S1
θS1
q2D2C2
S2
θS2
... qt-1Dt-1Ct-1
St-1
θSt-1
θH
qtDt
St
θq
θq,H
λ1?λ2?
λq?1-λq
λt-1?
select {λ} to maximize P(Dt | θq, H)EM algorithm
SIGIR 2010 Workshop on Query Representation and Understanding, Geneva, Switzerland, July 23, 2010
29
Results: Different Individual Search Models
recurring fresh≫
combination ≈ clickthrough > docs > query, contextless
SIGIR 2010 Workshop on Query Representation and Understanding, Geneva, Switzerland, July 23, 2010
30
Results: Different Weighting Schemes for Overall History Model
hybrid ≈ EM > cosine > equal > contextless
SIGIR 2010 Workshop on Query Representation and Understanding, Geneva, Switzerland, July 23, 2010
SIGIR 2010 Workshop on Query Representation and Understanding, July 23, 2010, Geneva, Switzerland
31
3. Open Challenges
• What is a query?• How to collect as much context information as possible
without infringing user privacy? • How to store and organize the collected context
information? • How to accurately interpret/exploit context information? • How to formally represent the evolving information need
of a user?• How to optimize search results for an entire session? • What’s the right architecture (client-side, server-side,
and client-server combo)?
References
• Framework– Xuehua Shen, Bin Tan, and ChengXiang Zhai, Implicit User Modeling for
Personalized Search , In Proceedings of CIKM 2005, pp. 824-831.– ChengXiang Zhai and John Lafferty, A risk minimization framework for
information retrieval , Information Processing and Management, 42(1), Jan. 2006. pages 31-55.
• Short-term implicit feedback– Xuehua Shen, Bin Tan, ChengXiang Zhai, Context-Sensitive Information
Retrieval with Implicit Feedback, Proceedings of SIGIR 2005, pp. 43-50.
• Long-term implicit feedback – Bin Tan, Xuehua Shen, ChengXiang Zhai, Mining long-term search history
to improve search accuracy , Proceedings of KDD 2006, pp. 718-723.
SIGIR 2010 Workshop on Query Representation and Understanding, July 23, 2010, Geneva, Switzerland
32
Thank You!
SIGIR 2010 Workshop on Query Representation and Understanding, Geneva, Switzerland, July 23, 2010