session search by direct policy learning

Dynamic IR• Statistical Modeling of Information Seeking

• Aims to connect user’s information seeking behaviors with a set of new retrieval models

• most are generative

• the ‘dynamics’ in the search process are the primary elements to be modeled

1

Research Agenda

Evaluation of DIR CubeTest, CIKM’13

POMDP stochastic game,

SIGIR’14

MDP query change

model,SIGIR’13

Design of POMDPs ECIR’15

Direct Policy Learning reduce complexity,

ICTIR’15

Two-Way Communication,

ICTIR’15

TREC Dynamic Domain Track’15

2

Session Search by Direct Policy Learning

Jiyun Luo, Xuchu Dong, Grace Hui Yang Georgetown University

ICTIR 2015

3

Session Search• The information retrieval task that aims to find

relevant documents for a session of multiple queries.

• We would like to call it ‘dynamic search’, too

• It happens when information needs are complex, vague, evolving, often containing multiple subtopics

• not possible to be resolved by one-shot ad-hoc search

4

E.g. Find what city and state Dulles airport is in, what shuttles ride-sharing vans and taxi cabs connect the airport to other cities, what hotels are close to the airport, what are some cheap off-airport parking, and what are the metro stops close to the Dulles airport.

Information need

User

Search Engine

An Illustration

5

• Search by “test-the-water” • Trial-and-error • Repeatedly trying different search

paths via writing different queries, until succeeding in satisfying the information need by finding relevant documents

• The search engine receives immediate, instant feedback (reward) from the user

• Aim for optimization of long term gain

A good fit for Reinforcement Learning

• Existing work on RL in IR

• Query Change Model (SIGIR’13, Guan et al.)

• POMDP Modeling (SIGIR’14, Luo et al.)

• POMDP in Re-ranking (WWW’13, Jin et al.)

• More Prior work

• UCAIR (CIKM’05, X. Shen, B. Tan, and C. Zhai)

• Online Learning (ECIR’11, K. Hofmann, S. Whiteson, and M. de Rijke)

6

POMDP: Partially Observable Markov Decision Process

……s0 s1

r0

a0

s2

r1

a1

s3

r2

a2

● Hidden states ● Actions ● Rewards

1R. D. Smallwood et. al., ‘73

o1 o2 o3

7

● Markov ● Long Term Optimization ● Observations ● Beliefs

7

Challenges of RL in IR• Formulation of the Problem (ECIR’15, Luo et al. )

• What are the states, actions, rewards, observations, agents?

• Math vs. Physical Meanings

• Efficiency

• RL training is expensive

• Existing Solutions - Reduce the states, actions

• Four Decision-Making states in our POMDP modeling (SIGIR’14)

8

Contribution of this Paper• Addresses high complexity of RL in IR

• directly learns mappings from observations to actions

• skips states, beliefs

• flatten the model structure (a more down-to-the-earth model)

• … but, still complex enough to be interesting

• less model complexity leads to higher efficiency

9

A Direct Policy Learning Framework

• At each search iteration, the search engine maximizes long-term rewards (value function)

• Learns a direct mapping from observations to actions by gradient descent

• Experimented on TREC Session Tracks 2012-2014

V✓(s0) = E� 1X

t=0

�tr(t)|s0�

10

Defining a History• History: the record of a session from the search

iteration 0 to the current iteration t

• A chain of events happening in a session

• the dynamic changes of states, actions, observations, and rewards in a session

ht = [ht�1, Ct, Tt, qt,�qt, Dt]

11

12

quit smoking !q1 !D1 ! Rank 1: Easy Ways to Quit Smoking | Quit Smoking Help …!

… !Rank 3: Quit Smoking Toolbox - Quit Smoking - Nicotine Addiction … !… !Rank 6: Quit Smoking Hypnosis, Stop Smoking Hypnosis CDs… !

Example: TREC 2014 Session 1011 “quit smoking”

13



h1 ! [clicked:none, q1, +∆q:quit smoking, -∆q:none, D1] !


14




C2 ! Rank 1: Easy Ways to Quit Smoking | Quit Smoking Help … !… !Rank 3: Quit Smoking Toolbox - Quit Smoking - Nicotine Addiction … !… !Rank 6: Quit Smoking Hypnosis, Stop Smoking Hypnosis CDs… !

SAT-Clicked. !Dwell time: 40 seconds!

Clicked. !Dwell time:24 seconds!


15







smoking quitting hypnosis!q2 !

D2 ! Rank 1: Quit Smoking Hypnosis | Stop Smoking Hypnosis CDs Quit Smoking Hypnosis Neuro… !… !Rank 4: Quit Smoking with Video Hypnosis Home Shopping Cart… !


16



smoking quitting !q2 ! hypnosis !+∆q "






Exploitation!

Query reformulation using words in previous search results!


17




h2 !


[h1, clicked:[[3,24,SAT-Clicked=F],[6,40,SAT-Clicked=T]],q2,+∆q:hypnosis, -∆q:none, D2 ] !





Exploitation!



18

side effects !









Exploitation!


h2 ! [h1, clicked:[[3,24,SAT-Clicked=F],[6,40,SAT-Clicked=T]],q2,+∆q:hypnosis, -∆q:none, D2 ] !

C3 !!Rank 1: Quit Smoking Hypnosis | Stop Smoking Hypnosis CDs Quit Smoking Hypnosis Neuro… !… !Rank 4: Quit Smoking with Video Hypnosis Home Shopping Cart… !

quit smoking !q3 ! hypnosis !


D3 !!Rank 1: Side Effects Of Quitting Smoking | Self Hypnosis To Quit Smoking … !… !… !


19



smoking quitting !q2 !


hypnosis !+∆q "

quit smoking !q3 ! side effects !+∆q "

hypnosis !-∆q "


h2 !








Exploitation!


Exploration!

Query reformulation excluding words in previous search results!


20



smoking quitting !q2 !


hypnosis !+∆q "

quit smoking !q3 ! side effects !+∆q "

hypnosis !-∆q "


h2 !






h3 ! [h2, clicked:[[4,31,SAT-Clicked=T]],q3,+∆q:side effects,-∆q:hypnosis, D3 ] !



Exploitation!


Exploration!

Query reformulation excluding words in previous search results!


Decompose a history• First level: iteration by iteration

• Second level: break down an iteration into

• browse phase

• query phase

• retrieval phase

21

Browse Phase• Actor: the user

• It happens

• after the search results are shown to the user

• before the user starts to write the next query

• Records how the user perceives and examines the (previously retrieved) search results

22

s(t)�

orank(t)� abrowse(t)�

n1(t)�

…"

Query Phase• Actor: the user

• It happens

• when the user writes a query

• Assuming the query is created based on

• what has been seen in the browse phase

• the information need

23

s(t)�

obrowse(t)� aquery(t)�orank(t)� abrowse(t)�

n2(t)�n1(t)�

…"

Rank Phase• Actor: the search engine

• It happens

• after the query is entered

• before the search results are returned

• It is where the search algorithm takes place

24

s(t)�

obrowse(t)� aquery(t)�

oquery(t)� arank(t)�s(t+1)�

orank(t)� abrowse(t)�

n2(t)�

n3(t)�

n1(t)�

browse' query'rank'…" …"

P (h|✓) =len(h)Y

t=1

P (orank

(t), abrowse

(t),

o

browse

(t), aquery

(t), oquery

(t), arank

(t)|ht�1, ✓)

/len(h)Y

t=1

P (abrowse

(t)|orank

(t), ✓1)

⇥P (aquery

(t)|obrowse

(t), ✓2)

⇥P (arank

(t)|obrowse

(t), oquery

(t), orank

(t), ✓3)

/len(h)Y

t=1

Y

i2{1,2,3}

P (ai(t)|ni(t), ✓i)

25

Our objective function:

where

Ranking Function

• It originally presents the probability of selecting a (ranking) action

• In our context, the probability of selecting d to be put at the top of a ranked list under n3 and θ3 at the tth iteration

• Then we sort the documents by it to generate the document list

P (arank|n3, ✓3) =e✓3·�(arank,n3)

Pa0rank

e✓3·�(a0rank,n3)

27

28

�✓3 =X

h2H

len(h)X

t=1

�tr(t, h)⇥tX

i=1

[�(arank, n3)�

X

a0rank

�(a0rank, n3) P (a0rank|n3, ✓3)]

Updates:

�(arank, n3)Feature function: Query Features • Test if a search term w∈qt and w∈qt−1

• # of times that a term w occurs in q1,q2,…,qt

Query-Document Features • Test if a search term w∈+∆qt and w∈Dt−1

• Test if a document d contains a term w ∈ −∆qt

tf.idf score of a document d to qt

Click Features • Test if there are SAT-Clicks in Dt−1 • # of times a document being clicked in the

current session • # of seconds a document being viewed and

reviewed in the current session Query-Document-Click Features • Test if qi leads to SAT-Clicks in Di, where i =

0...t−1 Session Features • position at the current session

Browse

Query

Rank

Experiments• Data: TREC 2012, 2013, 2014 Session Tracks • Corpus: ClueWeb09, ClueWeb12

29

Baselines• lemur • qcm (the query change model, MDP [Guan et. al

SIGIR’13]) • winwin: a POMDP model [Luo et. al SIGIR’14]

• winwin-short: user clicks are used as reward • winwin-long: nDCG, an ideal reward function, are

used as reward • dpl (proposed in this paper) • dpl+upper bound

• using ground truth as reward function • a upper bound of the proposed approach

30

Whole-Session Search Accuracy

• dpl achieves a significant improvement over the TREC best run • We found similar conclusions on TREC 2013 and 2014 Session Track

Experiments

31

Efficiency

• lemur > dpl > qcm > winwin • dpl achieves a good balance between accuracy and efficiency • the conclusions are also consistent upon experiments on TREC’12

~ 14 Session Tracks

32

Conclusions• A novel document retrieval algorithm by Direct Policy

Learning

• Define the history and three phases in search

• a way to describe the (messy) information seeking process

• The approach achieves a good balance between effectiveness and efficiency

• less complexity

• more flexible to incorporate large set of features

33

session search by direct policy learning

Science