a thesis - georgetown universityinfosense.cs.georgetown.edu/publication/dongyi_guan_thesis.pdf · a...

84

Upload: others

Post on 21-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A Thesis - Georgetown Universityinfosense.cs.georgetown.edu/publication/dongyi_guan_thesis.pdf · A Thesis submitted to the acultF y of the Graduate School of Arts and Sciences of

Structured Query Formulation and Result Organization for Session

Search

A Thesis

submitted to the Faculty of the

Graduate School of Arts and Sciences

of Georgetown University

in partial ful�llment of the requirements for the

degree of

Master of Science

in Computer Science

By

Dongyi Guan

Washington, DC

April 22, 2013

Page 2: A Thesis - Georgetown Universityinfosense.cs.georgetown.edu/publication/dongyi_guan_thesis.pdf · A Thesis submitted to the acultF y of the Graduate School of Arts and Sciences of

Copyright c© 2013 by Dongyi Guan

All Rights Reserved

ii

Page 3: A Thesis - Georgetown Universityinfosense.cs.georgetown.edu/publication/dongyi_guan_thesis.pdf · A Thesis submitted to the acultF y of the Graduate School of Arts and Sciences of

Structured Query Formulation and Result Organization for Session

Search

Dongyi Guan

Thesis Advisor: Dr. Grace Hui Yang

Abstract

Complicated search such as making a travel plan usually requires more than one

search queries. A user interacts with a search engine for multiple iterations, which

we call a session. Session search is the task that deals with document retrieval within

a session. A session often involves a series of interactions between the user and the

search engine. To make use of all the queries and various interactions in a session,

we propose an e�ective structured query formulation method for session search. By

identifying phrase-like textual nuggets, we investigate di�erent degrees of importance

for phrases in queries, aggregate them to create a highly e�ective session-wise query,

and send them to state-of-the-art search engine to retrieval the relevant documents.

Our system participated in the TREC 2012 Session track evaluation and won the

second position in whole session search (RL2-RL4).

A second main contribution of this thesis is to increase stability of result organiza-

tion for session search. Search result clustering (SRC) hierarchies are widely used in

organizing search results. These hierarchies provide users overviews about their search

results. Search result organization is usually sensitive to even slight change in queries.

Within a session, queries are related and hence the search result organization should

be related as well and maintain a more stable representation of the organization. We

propose two monothetic concept hierarchy approaches that exploit external knowl-

edge to build more stable SRC hierarchies for session search. One approach corrects

iii

Page 4: A Thesis - Georgetown Universityinfosense.cs.georgetown.edu/publication/dongyi_guan_thesis.pdf · A Thesis submitted to the acultF y of the Graduate School of Arts and Sciences of

erroneous relations generated by Subsumption, a state-of-the-art concept hierarchy

construction approach. The other employs external knowledge and directly builds

SRC hierarchies. Evaluations show that our approaches generate statistically signi�-

cantly more stable search result organizations while keeping the organization in good

quality.

Index words: Information retrieval, session search, structured query, searchresult organization

iv

Page 5: A Thesis - Georgetown Universityinfosense.cs.georgetown.edu/publication/dongyi_guan_thesis.pdf · A Thesis submitted to the acultF y of the Graduate School of Arts and Sciences of

Acknowledgments

This thesis would not have been completed without the guidance and the help of

the persons who contributed and extended their valuable assistance in the preparation

and completion of this study.

First and foremost, I would like to express my utmost gratitude to my advisor Dr.

Grace H. Yang for her continuous support and inspiring instruction throughout my

study and research. Dr. Grace H. Yang is a great advisor with patience, motivation,

enthusiasm, and immense knowledge. I would also like to thank her for encouraging

and helping me to shape my interest and ideas.

Besides my advisor, I am deeply grateful to the rest of my thesis committee: Dr.

Lisa Singh and Dr. Calvin Newport, for their insightful comments and high quality

questions.

My sincere thanks also go to the professors at Georgetown University for their

great support and kind help: Dr. Ophir Frieder, Dr. Evan Barba, Dr. Eric Burger,

Dr. Der-Chen Chang, Dr. Jeremy Fineman, Dr. Nazli Goharian, Dr. Bala Kalyanasun-

daram, Dr. Mark Maloof, Dr. Jami Montgomery, Dr. Micah Sherr, Dr. Clay Shields,

Dr. Richard Squier, Dr. Mahendran Velauthapillai, Dr. Wenchao Zhou. Also I thank

my friends Yifan Gu, Jiyun Luo, Jon Parker, Henry Tan, Amin Teymorian, Chris

Wacek, Yifang Wei, Andrew Yates, and Sicong Zhang, for the stimulating discussion

and sleepless nights we were working together before deadlines.

I owe my warm thanks to my family for their continuous love and supports in my

decision. My parents always give me advice to help me get through the di�cult times.

v

Page 6: A Thesis - Georgetown Universityinfosense.cs.georgetown.edu/publication/dongyi_guan_thesis.pdf · A Thesis submitted to the acultF y of the Graduate School of Arts and Sciences of

I am so grateful to my Fiancée, whose love and unconditional support allow me to

�nish this journey. Finally, I would like to dedicate this work to my lost Grandma,

who left us too soon. I hope that this work makes her proud.

vi

Page 7: A Thesis - Georgetown Universityinfosense.cs.georgetown.edu/publication/dongyi_guan_thesis.pdf · A Thesis submitted to the acultF y of the Graduate School of Arts and Sciences of

Table of Contents

Chapter

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Session Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . 31.2.2 Query Formulation . . . . . . . . . . . . . . . . . . . . 31.2.3 Search Result Organization . . . . . . . . . . . . . . . 4

1.3 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.3.1 Challenges in Query Formulation for Session Search . . 51.3.2 Challenges in Result Organization for Session Search . 6

1.4 TREC Session Tracks . . . . . . . . . . . . . . . . . . . . . . . . 81.5 Our Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

1.5.1 Structured Query Formulation for Session Search . . . 91.5.2 Stable Search Result Organization by Exploiting External

Knowledge . . . . . . . . . . . . . . . . . . . . . . . . . 101.6 Contributions of this Thesis . . . . . . . . . . . . . . . . . . . . 111.7 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.1 Session Search and TREC Session Tracks . . . . . . . . . . . . . 142.2 Query Formulation . . . . . . . . . . . . . . . . . . . . . . . . . 172.3 Search Result Organization . . . . . . . . . . . . . . . . . . . . . 19

2.3.1 Hierarchical Clustering . . . . . . . . . . . . . . . . . . 192.3.2 Subsumption . . . . . . . . . . . . . . . . . . . . . . . 202.3.3 Exploiting External Knowledge . . . . . . . . . . . . . 20

3 E�ective Structured Query Formulation for Session Search . . . . . . . 233.1 Identifying Nuggets and Formulating Structured Queries . . . . 23

3.1.1 The Strict Method . . . . . . . . . . . . . . . . . . . . 243.1.2 The Relaxed Method . . . . . . . . . . . . . . . . . . . 26

3.2 Query Aggregation within a Session . . . . . . . . . . . . . . . . 293.2.1 Aggregation Schemes . . . . . . . . . . . . . . . . . . . 29

3.3 Query Expansion by Anchor Text . . . . . . . . . . . . . . . . . 303.4 Removing Duplicated Queries . . . . . . . . . . . . . . . . . . . 323.5 Document Re-ranking . . . . . . . . . . . . . . . . . . . . . . . . 33

vii

Page 8: A Thesis - Georgetown Universityinfosense.cs.georgetown.edu/publication/dongyi_guan_thesis.pdf · A Thesis submitted to the acultF y of the Graduate School of Arts and Sciences of

3.6 Evaluation for Session Search . . . . . . . . . . . . . . . . . . . 343.6.1 Datasets, Baseline, and Evaluation Metrics . . . . . . . 343.6.2 Results for TREC 2011 Session Track . . . . . . . . . . 353.6.3 Results for TREC 2012 Session Track . . . . . . . . . . 373.6.4 O�cial Evaluation Results for TREC 2012 Session Track 39

3.7 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4 Increasing Stability of Result Organization for Session Search . . . . . . 424.1 Utilizing External Knowledge to Increase Stability of Search

Result Organization . . . . . . . . . . . . . . . . . . . . . . . . . 434.2 Identifying Reference Wikipedia Entries . . . . . . . . . . . . . . 444.3 Improving Stability of Subsumption . . . . . . . . . . . . . . . . 474.4 Building Concept Hierarchy Purely Based on Wikipedia . . . . . 494.5 Evaluation for Search Result Organization . . . . . . . . . . . . 51

4.5.1 Hierarchy Stability . . . . . . . . . . . . . . . . . . . . 524.5.2 Hierarchy Quality . . . . . . . . . . . . . . . . . . . . . 56

4.6 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . 60

5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 615.1 Research Summary . . . . . . . . . . . . . . . . . . . . . . . . . 615.2 Signi�cance of the Thesis . . . . . . . . . . . . . . . . . . . . . . 625.3 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . 63

Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

viii

Page 9: A Thesis - Georgetown Universityinfosense.cs.georgetown.edu/publication/dongyi_guan_thesis.pdf · A Thesis submitted to the acultF y of the Graduate School of Arts and Sciences of

List of Figures

1.1 Typical procedure of session search. . . . . . . . . . . . . . . . . . . . 21.2 Retrieved documents by Lemur (TREC 2011 Session 25). The top doc-

ument only describes the symptoms and treatments for communicablediseases, which is not relevant to the topic of session �collagen vasculardisease�. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.3 Search result clustering (SRC) hierarchies by Yippy (TREC 2010 Ses-sion 123). SRC hierarchies (a) and (b) are for queries �diet� and �lowcarb diet� respectively. A low carb diet �South Beach Diet� that shouldhave appeared in both (a) and (b), is missing in (b); The cluster of �DietAnd Weight Loss� in (a) are dramatically changed in (b). Screenshotwas snapped at 15:51EST, 6/15/2012 from Yippy. . . . . . . . . . . . 7

3.1 A sample nugget in the TREC 2012 session 53 query �servering spinalcord paralysis�. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.2 Words in a snippet built from TREC 2012 session 53 query �serveringspinal cord consequenses�, where �spinal� is always connected to �cord�. 25

3.3 Words in a snippet built from TREC 2011 session 20 query �dooneybourke purses�, where �dooney and bourke� is a brand name but theuser omits the word �and�. . . . . . . . . . . . . . . . . . . . . . . . . 26

3.4 nDCG@10 values of retrieved documents using TREC 2011 Sessiontrack dataset. two cases, with threshold and without threshold, arecompared. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.5 Anchor text in a web page. . . . . . . . . . . . . . . . . . . . . . . . . 313.6 Changes in nDCG@10 from RL1 to RL2 presented by TREC 2012

Session track. Error bars are 95% con�dence intervals (Figure 1 in [26]) 393.7 All results by nDCG@10 for the current query in the session for each

subtask (Table 2 in [26]). . . . . . . . . . . . . . . . . . . . . . . . . . 404.1 Framework overview of the Wikipedia enhanced concept hierarchy con-

struction system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 434.2 Mapping to relevant Wikipedia entry. Text in circles denotes Wikipedia

entries, while text in rectangle denotes concepts. Based on the contextof current search session, the entry �Gestational diabetes� is selectedas the most relevant Wikipedia entry. Therefore the concept �GDM�is mapped to �Gestational diabetes�, whose supercategories are �Dia-betes� and �Health issues in pregnancy�. . . . . . . . . . . . . . . . . 45

ix

Page 10: A Thesis - Georgetown Universityinfosense.cs.georgetown.edu/publication/dongyi_guan_thesis.pdf · A Thesis submitted to the acultF y of the Graduate School of Arts and Sciences of

4.3 An example of Wikipedia-enhanced Subsumption. The concepts �Dia-betes� and �type 2 diabetes� satisfy Eq.(4.5) and is identi�ed as apotential subsumption pair. The reference Wikipedia entry of �Dia-betes� is a category, and reference Wikipedia entry of �type 2 diabetes�is a Wikipedia entry �Diabetes mellitus type 2�. Therefore we check if�Diabetes� is one of the supercategories of �Diabetes mellitus type 2�and con�rm that �diabetes� subsumes �type 2 diabetes�. . . . . . . . . 49

4.4 An example of Wikipedia-only hierarchy construction. From concept�Diabetes mellitus� we �nd the reference Wikipedia entry �Diabetesmellitus�, then we �nd its start category �Diabetes�. Similarly, foranother concept �joslin�, we �nd its reference Wikipedia entry �JoslinDiabetes Center� and its start category �Diabetes organizations�. Wethen expand from these two start categories. �Diabetes organizations�is one of the subcategories of �Diabetes�, thus we merge them together. 50

4.5 Major clusters in hierarchies built by Clusty for TREC 2010 session 3.(a) is for query �diabetes education� and (b) is for �diabetes educationvideos books�. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

4.6 Major clusters in hierarchies built by Wiki-only for TREC 2010 session3. (a) is for query �diabetes education� and (b) is for �diabetes educationvideos books�. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

4.7 Major clusters in hierarchies built by Subsumption for TREC 2010session 3. (a) is for query �diabetes education� and (b) is for �diabeteseducation videos books�. . . . . . . . . . . . . . . . . . . . . . . . . . 57

4.8 Major clusters in hierarchies built by Subsumption+Wiki for TREC2010 session 3. (a) is for query �diabetes education� and (b) is for�diabetes education videos books�. . . . . . . . . . . . . . . . . . . . . 57

4.9 Search result organization quality improvement vs. stability for Sub-sumption and Subsumption+Wiki. . . . . . . . . . . . . . . . . . . . 58

4.10 Extreme case 1. A totally static hierarchy for two queries in a session(TREC 2010 session 107). . . . . . . . . . . . . . . . . . . . . . . . . 59

4.11 Extreme case 2. A totally di�erent hierarchy for two queries in a session(TREC 2010 session 75). . . . . . . . . . . . . . . . . . . . . . . . . . 60

x

Page 11: A Thesis - Georgetown Universityinfosense.cs.georgetown.edu/publication/dongyi_guan_thesis.pdf · A Thesis submitted to the acultF y of the Graduate School of Arts and Sciences of

List of Tables

3.1 nDCG@10 for TREC 2011 Session track RL1. Dirichlet smoothingmethod is used. µ = 4000, f = 10 for strict method and µ = 4000,f = 20 for relaxed method. Methods are compared to the baseline �original query. A signi�cant improvement over the baseline is indicatedwith a † at p < 0.05 level and a ‡ at p < 0.005 level (t-test, single-tailed). The best run and median run in TREC 2011 are listed forcomparison. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.2 nDCG@10 for TREC 2011 Session track RL2. Dirichlet smoothingmethod and strict method are used. µ = 4000, f = 5 for uniform,µ = 4500, f = 5 for previous vs. current (PvC ) and distance-based.Methods are compared to the baseline � original query. A signi�cantimprovement over the baseline is indicated with a † at p < 0.05 leveland a ‡ at p < 0.005 level (t-test, single-tailed). The best run andmedian run in TREC 2011 are listed for comparison. . . . . . . . . . 37

3.3 nDCG@10 for TREC 2011 Session track RL3 and RL4. All runs usestrict method and the con�guration of µ = 4500, f = 5. Methods arecompared to the baseline � original query. A signi�cant improvementover the baseline is indicated with a † at p < 0.05 level and a ‡ atp < 0.005 level (t-test, single-tailed). The best run and median run inTREC 2011 are listed for comparison. . . . . . . . . . . . . . . . . . . 37

3.4 Methods and parameter settings for TREC 2012 Session track. µ is theDirichlet smoothing parameter, f is the number of pseudo relevancefeedback. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.5 nDCG@10 for TREC 2012 Session track. Mean of the median of theevaluation results in TREC 2012 are listed. . . . . . . . . . . . . . . . 38

3.6 AP for TREC 2012 session track. Mean of the median of the evaluationresults in TREC 2012 are listed. . . . . . . . . . . . . . . . . . . . . . 38

4.1 Statistics of TREC 2010 and TREC 2011 Session track datasets. . . . 524.2 Stability of search result organization for TREC 2010 Session queries.

Approaches are compared to the baseline - Subsumption. A signi�cantimprovement over the baseline is indicated with a † at p < 0.05 and a‡ at p < 0.005 (t-test, single-tailed). . . . . . . . . . . . . . . . . . . . 54

xi

Page 12: A Thesis - Georgetown Universityinfosense.cs.georgetown.edu/publication/dongyi_guan_thesis.pdf · A Thesis submitted to the acultF y of the Graduate School of Arts and Sciences of

4.3 Stability of search result organization for TREC 2011 Session queries.Approaches are compared to the baseline - Subsumption. A signi�cantimprovement over the baseline is indicated with a † at p < 0.05 and a‡ at p < 0.005 (t-test, single-tailed). . . . . . . . . . . . . . . . . . . . 54

xii

Page 13: A Thesis - Georgetown Universityinfosense.cs.georgetown.edu/publication/dongyi_guan_thesis.pdf · A Thesis submitted to the acultF y of the Graduate School of Arts and Sciences of

Chapter 1

Introduction

1.1 Motivation

Complicated search tasks, such as planning a trip, buying a product, or looking for

a good elementary school, are common in our daily life. These tasks often contain

multiple sub-topics so that they require more than one query. A user usually interacts

with a search engine when performing complicated search tasks. These interactions

form a session. Session search is the task that deals with document retrieval within a

session.

Major Web search engines, including Google and Bing, return a list of documents

that are ranked in decreasing relevance order to a single query. However, this repre-

sentation may not fully satisfy the users. Because more complicated information needs

may contain multiple sub-topics, therefore documents relevant to di�erent sub-topics

are perhaps mixed up in the returned document list. If a search engine organizes the

search results into a hierarchical representation that shows the sub-topics emerging

in the documents explicitly, a user will probably be able to locate the information

that he or she needs in the search results more easily and more e�ciently.

For example, if a user is preparing an article about the �Pocono Mountains region�,

he or she would want to search the information about many sub-topics of the �Pocono

Mountain region�, such as national parks, resorts, shopping, etc. It is not easy to

retrieve the information about all these aspects by one query. Consequently, the user

1

Page 14: A Thesis - Georgetown Universityinfosense.cs.georgetown.edu/publication/dongyi_guan_thesis.pdf · A Thesis submitted to the acultF y of the Graduate School of Arts and Sciences of

Figure 1.1: Typical procedure of session search.

may begin with a query �pocono mountain region� and then turn to queries like

�pocono mountains region things to do�, �pocono mountains region activities�, and

�pocono mountains region national park� to search the sub-topics. Because the queries

are about di�erent sub-topics, the user may expect a system that can organize the

relevant documents into a hierarchy structure, which places documents about di�erent

sub-topics in di�erent groups such as �activities� or �national park�.

1.2 Session Search

Session search is a �eld devoted to �nding documents relevant to a session. A system

that support session search accepts an entire session, which includes a series of pre-

vious queries with corresponding previous search results and a current/last query,

and retrieves documents relevant to the topic of the session.

2

Page 15: A Thesis - Georgetown Universityinfosense.cs.georgetown.edu/publication/dongyi_guan_thesis.pdf · A Thesis submitted to the acultF y of the Graduate School of Arts and Sciences of

1.2.1 Overview

Figure 1.2.1 shows a typical procedure for a session search. The interaction between a

user and a search engine can be represented as a session. The session contains a series

of previous queries q1, q2, · · · , qn−1 each associated with a set of relevant documents,

i.e., previous results,D1, D2, · · · , Dn−1, and a current/last query qn. The search engine

usually formulates a query that represents the entire session. The search engine then

applies retrieval models on the formulated query and an indexed corpus to retrieve

relevant documents. After retrieval, the search engine presents them to the user. The

user may be satis�ed with the search results and �nish the procedure, or may be

unsatis�ed and modify the session to re-retrieve the documents.

This thesis studies two crucial components in this procedure: query formulation

and search result organization.

1.2.2 Query Formulation

Query formulation is important in a system because the retrieval model directly

relies on the formulated query. The system can retrieve more relevant documents

if the formulated query represents the topic of session more accurately. Structural

formulation for queries, like combination of terms, assigning weights to terms, or

query expansion, focuses on the underlying meanings in queries. Structured queries

identify the concepts in queries and emphasize the important concepts as individual

atoms. In other words, structured queries express user intentions more precisely, so

as to retrieve relevant documents more e�ectively.

In the example in Section 1.1, there are multiple concepts in this session: �pocono

mountains region�, �things to do�, �activities�, and �national park�. Since �pocono

mountains region� appears in every query, it is probably much more important than

3

Page 16: A Thesis - Georgetown Universityinfosense.cs.georgetown.edu/publication/dongyi_guan_thesis.pdf · A Thesis submitted to the acultF y of the Graduate School of Arts and Sciences of

the others. Therefore, the search engine can build a structured query assigning higher

weight to �pocono mountains region� to express the importance of the concept �pocono

mountains region�.

1.2.3 Search Result Organization

Clear organization for search results gives users an overview on the search results and

may help users discover their further information needs e�ectively. Since the results

of a session search often contain multiple aspects, the search engine is friendly to the

user if it applies search result clustering to organize the search results into hierarchies,

which we call SRC hierarchies. SRC hierarchies support better information access by

improving the display of information. Search results are presented in a "lay of the

land" format, which presents similar results together and reveals important concepts

in lower ranked results.

In the example in Section 1.1, a SRC hierarchy is appropriate for organizing the

search results because the documents relevant to the topic contains multiple sub-

topics and some sub-topics can be further divided. For example, there is a query

about �pocono mountains things to do�, which may include documents that can be

divided to more detailed groups like �hiking� or �camping�.

1.3 Challenges

The complexity of session search lays great challenges in front of the researchers,

especially in the two crucial components in the procedure of session search: query

formulation and search result organization.

4

Page 17: A Thesis - Georgetown Universityinfosense.cs.georgetown.edu/publication/dongyi_guan_thesis.pdf · A Thesis submitted to the acultF y of the Graduate School of Arts and Sciences of

Figure 1.2: Retrieved documents by Lemur (TREC 2011 Session 25). The top docu-ment only describes the symptoms and treatments for communicable diseases, whichis not relevant to the topic of session �collagen vascular disease�.

1.3.1 Challenges in Query Formulation for Session Search

Words within a query may form phrases to express coherent meanings, or concepts.

A word group may describe a topic di�erent from every single word. Words in the

group may be more important than the rest. Furthermore, a session contains multiple

queries, some of which are more important than others for expressing the topic of

session. If a search engine treats all the words in a session individually and identically,

it could give the documents relevant to every single word higher rankings. However,

the documents relevant to single words are not necessarily relevant to the topic of the

query. This may decrease the search accuracy.

5

Page 18: A Thesis - Georgetown Universityinfosense.cs.georgetown.edu/publication/dongyi_guan_thesis.pdf · A Thesis submitted to the acultF y of the Graduate School of Arts and Sciences of

Figure 1.2 shows an example of directly submitting all the words in queries in a

session to Lemur1, a powerful search engine. The session is composed of three queries:

�collagen vascular disease causes symptoms treatments e�ects�, �CVD causes symp-

toms treatments�, and �collagen vascular disease causes symptoms treatments�. As we

can see, the search engine processes �collagen�, �vascular�, and �disease� as separate

words. Moreover, the common words �disease�, �symptoms�, and �treatments�, which

repeatedly occur in all the queries, enormously bias the search results, leading to high

ranks of the documents related to �symptoms� and �treatments� of other �diseases�.

Consequently, the relevant documents about the topic �collagen vascular disease� do

not appear in the top retrieval results.

Session search lays tremendous challenges in front of us: (1) How to identify word

groups expressing unit coherent meanings, that is, concepts, in queries within a ses-

sion; (2) How to formulate these word groups into a structured query according to

their importance.

1.3.2 Challenges in Result Organization for Session Search

SRC hierarchies (see an example in Figure 1.3) are suitable to organize the search

results for a regular search. However, most SRC hierarchies created by the state-of-

the-art algorithms are overly sensitive to minor query changes regardless of whether

queries are similar and belong to the same session. This minor query change often

occurs within a session. For instance, about 38.6% adjacent queries in TREC 2010-

2011 session tracks only show a one-word change and 26.4% show a two-word change.

Figure 1.3 shows hierarchies generated by Yippy2 for adjacent queries �diet� and

�low carb diet�. The second query �low carb diet� is a speci�cation of the �rst. We

1http://www.lemurproject.org/, version 5.02http://www.yippy.com.

6

Page 19: A Thesis - Georgetown Universityinfosense.cs.georgetown.edu/publication/dongyi_guan_thesis.pdf · A Thesis submitted to the acultF y of the Graduate School of Arts and Sciences of

Figure 1.3: Search result clustering (SRC) hierarchies by Yippy (TREC 2010 Session123). SRC hierarchies (a) and (b) are for queries �diet� and �low carb diet� respectively.A low carb diet �South Beach Diet� that should have appeared in both (a) and (b), ismissing in (b); The cluster of �Diet And Weight Loss� in (a) are dramatically changedin (b). Screenshot was snapped at 15:51EST, 6/15/2012 from Yippy.

observe many changes between two SRC hierarchies (a) and (b). Overall, hierarchies

(a) and (b) share only 4 common words, �weight�, �loss�, �review�, and �diet�, and 0

common pair-wise relations. This is a very low overlap given that these two queries

are closely related and within the same session.

The dramatic change, a.k.a., instability, of SRC hierarchies for a session search

weakens their functionality to serve as an information overview. With rapidly changing

7

Page 20: A Thesis - Georgetown Universityinfosense.cs.georgetown.edu/publication/dongyi_guan_thesis.pdf · A Thesis submitted to the acultF y of the Graduate School of Arts and Sciences of

SRC hierarchies, users may perceive them as random search result organizations and

it is di�cult to re-�nd relevant documents identi�ed in the previous queries. We argue

that although SRC hierarchies should not be static, while making changes, they should

maintain the basic topics and structures across the entire session.

Ideally, SRC hierarchies should not only be closely related to the current query

and its search results but also re�ect changes in adjacent queries to the right degree

and at the right places. In this work, we address this new challenge of producing

stable SRC hierarchies for session search.

1.4 TREC Session Tracks

National Institute of Standards and Technology (NIST) had held TREC Session tracks

[24, 25, 26] for three years from 2010 to 2012. TREC Session tracks aim to test whether

IR systems can improve the search accuracy by the assistance of previous queries

and corresponding user interactions. The session data was composed of sequences of

queries q1, q2, · · · , qn−1, and qn, with only the current (last) query qn being the subject

for retrieval; and q1, q2, · · · , qn−1 were named previous queries correspondingly.

NIST invited faculty, sta�, and students at the University of She�eld as users to

generate session queries. In addition to these queries, NIST provided a user interaction

for each previous query in a session. A user interaction contained a ranked document

list that was retrieved for a previous query and user-click information, including the

click order, start time, and end time.

The TREC participants (we are one of them) were requested to submit the

retrieval results as ranked document lists. NIST assessors evaluated the submissions.

TREC released o�cial evaluation results every year.

8

Page 21: A Thesis - Georgetown Universityinfosense.cs.georgetown.edu/publication/dongyi_guan_thesis.pdf · A Thesis submitted to the acultF y of the Graduate School of Arts and Sciences of

1.5 Our Approaches

In this work, we tackle challenges in two crucial components in session search: query

formulation and search result organization. We formulate structured queries for ses-

sions to improve the search accuracy. We also propose to build stable and high quality

SRC hierarchies for a session search.

1.5.1 Structured Query Formulation for Session Search

Observation shows that a query often contains phrases which describe a coherent

meaning as a group. For example, the query �russian politics kursk submarine� (TREC

2012, session 18) contains two phrases �russian politics� and �kursk submarine�, each

of which expresses a concept and cannot be split. Phrases usually are more related

to the topic of a session so as to be more important than single words. A structured

query can represent the phrases in a query. Therefore, we focus on formulating e�ective

structured queries for search tasks within a session.

In order to represent phrases, we introduce Nugget, a substring in a query with the

terms that frequently occur together. We propose to identify nuggets by examining,

in the pseudo-relevance feedback, the distance between terms that are adjacent in a

query. Two rules, named strict and relaxed, are applied when calculating the term

distance.

We can generate a set of terms and nuggets from every query in a session. However,

the importance of these queries are not identical. We combine the terms and nuggets

from every query into one structured query using di�erent aggregation schemes.

We compare three schemes: uniform, previous vs. current, and distance-based. The

schemes are designed based on the order of queries in a session.

9

Page 22: A Thesis - Georgetown Universityinfosense.cs.georgetown.edu/publication/dongyi_guan_thesis.pdf · A Thesis submitted to the acultF y of the Graduate School of Arts and Sciences of

Our approach includes query expansion and document re-ranking as well. The top

k terms in anchor texts in pseudo-relevance feedback are extracted to expand the

structured query. We then re-rank the retrieved documents by comparing them to

the clicked documents in the user interactions, setting the dwell times as the weights.

1.5.2 Stable Search Result Organization by Exploiting External

Knowledge

External knowledge such as Wikipedia and WordNet are compiled manually. There-

fore, they are widely used to enhance automatic information retrieval. Correct rela-

tions between concepts is crucial for generating high quality SRC hierarchies. We

apply external knowledge as a reference to build relations between concepts. We

choose Wikipedia in this work because it contains more extensive de�nition for con-

cepts and relations represented by links and categories. Wikipedia is used in two

ways: (1) �xing the incorrect relations generated by an existing approach, which is

named Subsumption+Wiki ; (2) extracting the category information to build the con-

cept hierarchies directly, which is named Wiki-only.

The issue of unstable SRC hierarchies might occur for various reasons, where the

most signi�cant one is the popular bottom-up strategy. Contrastingly, monothetic

concept hierarchy approaches �rst extract the labels (or concepts) from retrieved doc-

uments and then organize these concepts into hierarchies. Since labels are obtained

before clusters are formed, they are not derived from the clusters. Monothetic con-

cept hierarchy approaches, hence, produce more stable hierarchies than clustering

approaches. Therefore, we build our system based on the monothetic concept hier-

archy approach.

In both methods that we exploit Wikipedia, we extract a set of concepts from the

document set, i.e., the search results. In the �rst one, we apply an existing approach

10

Page 23: A Thesis - Georgetown Universityinfosense.cs.georgetown.edu/publication/dongyi_guan_thesis.pdf · A Thesis submitted to the acultF y of the Graduate School of Arts and Sciences of

to draw the possible parent-child relations between pairs of concepts. Then we iden-

tify the location of this pair of concepts in the category network of Wikipedia and

�lter out the incorrect relations. In the second one, for each concept, we identify the

most relevant page in Wikipedia. Then we extract the category structure from this

Wikipedia page. The category structures for all concepts are merged to build the SRC

hierarchies.

1.6 Contributions of this Thesis

This thesis focuses on improving the search accuracy for session search and building

stable SRC hierarchies for queries in a session. By combining the nugget approach

and aggregation schemes, a structured query represents the topic of a session more

accurate. In addition, our approach integrates external knowledge into the monothetic

concept hierarchy algorithm and signi�cantly increases the stability of SRC hierarchies

without loss of quality. The speci�c contributions are: 1) we propose an approach that

introduces a concept of nugget to formulates a session into a structured query; 2) we

propose an e�cient method to predict a window size for a nugget; 3) we present

two e�ective approaches that organize the search results into SRC hierarchies of high

stability and high quality; 4) we evaluate the stability of concept hierarchies built

by monothetic concept hierarchy approaches and by clustering approaches over the

dataset of TREC Session tracks.

We propose to formulate a structured query to represent the topic of a session

precisely. We try to �nd the phrases in queries, which express atomic meanings.

In particular, we introduce a concept of nugget, a phrase-like substring in queries.

Based on identifying nuggets, an e�ective approach is proposed to generate a struc-

tured query from a session. Evaluation indicates that nuggets increase the accuracy

11

Page 24: A Thesis - Georgetown Universityinfosense.cs.georgetown.edu/publication/dongyi_guan_thesis.pdf · A Thesis submitted to the acultF y of the Graduate School of Arts and Sciences of

for session search. Moreover, We propose an e�cient relaxed method to predict an

appropriate window size for a nugget according to the average distance between two

terms in the pseudo-relevance feedback. Experiments show that the relaxed method

gives advantage to the single-query task over the traditional method examining the

n-grams.

Furthermore, we introduce three aggregation schemes for multiple queries in a

session are studied in this work. A session contains multiple queries, from which we

can obtain a set of nuggets. The queries may be di�erent in importance, hence some

nuggets may play more important roles than others. We �nd out that the last query

is commonly more important than the previous ones.

This thesis further studies the results organization for session search. Search result

organization gives the user an overview about the relevant documents, which helps

the user to locate the needed information rapidly. We present a novel framework based

on the monothetic concept hierarchy approach, which show advantages in terms of

stability over the popular organization approaches mostly based on hierarchical clus-

tering. Our algorithm dynamically maps the concepts to Wikipedia entries and gen-

erates the hierarchical structure, which can extract Wikipedia categories structures

about a speci�c topic e�ciently.

We are the �rst to evaluate the stability of concept hierarchies built by monothetic

concept hierarchy approaches and by clustering approaches. Moreover, we are the �rst

to integrate external knowledge into an monothetic concept hierarchy approach, to

correct the erroneous parent-child relationship between concepts. The results indicate

that our approach improves the quality of the hierarchies.

12

Page 25: A Thesis - Georgetown Universityinfosense.cs.georgetown.edu/publication/dongyi_guan_thesis.pdf · A Thesis submitted to the acultF y of the Graduate School of Arts and Sciences of

1.7 Outline

The rest of this thesis is organized as follows. Chapter 2 discusses the related work.

Chapter 3 presents the methods of generating the e�ective structured query from

the session data. Chapter 4 presents the enhancement of the results organization for

session search by integrating the Wikipedia category structure. Chapter 5 summarizes

the thesis and describes possible directions for future work.

13

Page 26: A Thesis - Georgetown Universityinfosense.cs.georgetown.edu/publication/dongyi_guan_thesis.pdf · A Thesis submitted to the acultF y of the Graduate School of Arts and Sciences of

Chapter 2

Related Work

This chapter reviews the related work to this thesis research. The related work includes

the submissions to TREC Session tracks, query formulation, and search result orga-

nization.

2.1 Session Search and TREC Session Tracks

In TREC 2011 and TREC 2012 Session tracks [25, 26], a session contained multiple

queries q1, q2, · · · , qn−1, qn, and the user interactions such as the previous search results

and click information. Four subtasks were requested:

• RL1. Only using the current query qn.

• RL2. Including the previous queries q1, q2, · · · , qn−1 and the current query qn.

• RL3. Including top retrieved documents for previous queries.

• RL4. Considering additional information about which top results are clicked by

users.

ClueWeb09 collection1 is used as the corpus in TREC Session tracks. However,

participants are allowed to use the �rst 50 million documents of ClueWeb09, named

�Category B� or �CatB�, as the corpus as well. However, they are evaluated as if they

were using the entire collection (named �Category A� or �CatA�).

1http://www.lemurproject.org/clueweb09.php

14

Page 27: A Thesis - Georgetown Universityinfosense.cs.georgetown.edu/publication/dongyi_guan_thesis.pdf · A Thesis submitted to the acultF y of the Graduate School of Arts and Sciences of

Twenty teams have participated in TREC Session tracks in three years [22, 6, 45,

32, 28, 19, 23, 13, 33, 2, 15, 30, 50]. Evaluation results showed signi�cant improvements

from the �rst subtask to the last one in most of submissions. These results indicated

that considering all session information contributed to search accuracy.

Jiang et al. [22, 23] applied the sequential dependence model (SDM) [36] as the

basic retrieval model. SDM features, including all single terms, ordered phrases, and

unordered phrases, were extracted from the query. Then the features were incorpo-

rated into the Lemur system by using the Indri query language. The session historical

query model (SH-QM) was used when including previous queries. For each SDM fea-

ture, a weight was assigned by linearly combining the frequencies of this feature in

current and previous queries. After introducing previous search results in RL3 and

RL4, the author applied the pseudo-relevance feedback query model (PRF-QM) on the

single-term feature. The weight of a single-term feature was adjusted by calculating

the term frequency in pseudo-relevance feedback. In RL3, top 10 ranked Wikipedia

documents served as the pseudo-relevance feedback; while in RL4, clicked documents

associated with their snippets were considered as the pseudo-relevance feedback. Fur-

thermore, Jiang et al. introduced document novelty to adjust document scores in

retrieval. Scores of the documents clicked by the user previously were lowered based

on their ranks in previous search results. Jiang et al. achieved the top rank in TREC

2011 and TREC 2012 Session track.

Albakour et al. from the University of Essex [34] utilized anchor texts to expand

queries. The anchor log �le proposed by the University of Twente2 was used as the ref-

erence to �nd terms with similar topic to the session. First, stop words were removed

from queries in a session. Then they found in the anchor log lines containing any

2http://wwwhome.cs.utwente.nl/hiemstra/2010/anchor-text-for-clueweb09-category-a.html

15

Page 28: A Thesis - Georgetown Universityinfosense.cs.georgetown.edu/publication/dongyi_guan_thesis.pdf · A Thesis submitted to the acultF y of the Graduate School of Arts and Sciences of

of these queries. Terms in these lines were extracted to expand queries. Anchor text

approach was proven e�ective and adopted by other teams, such as the BUPT team

[31].

Nootropia model [38] was applied in another approach proposed by the University

of Essex [34]. The authors built a Nootropia network based on previous search results

and then re-ranked the documents retrieved for the current query. They experimented

applying two opposite strategies. The �positive� one assumed that previous search

results in a session were relevant to the topic of the session so that the documents with

higher Nootropia scores would be ranked higher in the �nal results. The �negative� one

made the opposite assumption that previous search results in a session dissatis�ed the

user who submitted the session, hence the documents with higher Nootropia scores

would be ranked lower in the �nal results. Evaluation indicated that the �positive�

strategy was more valid than the �negative� one.

The CWI team [17] presented a discount rate model for the previous queries.

They assumed two classes of users: �good� users and �bad� users. A �good� user

learned from previous search results in a session to generate a high quality query,

so that the current query in a session was able to express the topic of the session

precisely. On the contrary, a �bad� user failed to adjust queries to �t the topic of a

session. Consequently the previous queries had equal values in representing the topic

of a session. Based on the above assumption, a session submitted by a �better� user

received a more discounted rate for its previous queries. When only considering the

queries in the session, the authors used the average number of interactions for all

sessions as a standard to determine whether a sessions was submitted by a �good�

user or a �bad� user. A session submitted by a �good� user was supposed to be �nished

within the average number of interactions, while a session submitted by a �bad� user

was supposed to contain more interactions than the average number. After adding

16

Page 29: A Thesis - Georgetown Universityinfosense.cs.georgetown.edu/publication/dongyi_guan_thesis.pdf · A Thesis submitted to the acultF y of the Graduate School of Arts and Sciences of

the information about the previous search results, the average adjacent interaction

overlap for all sessions became the standard of di�erentiating sessions submitted by

�good� users or �bad� users. The authors assumed that a session submitted by a

�better� user would have less overlap between the adjacent interactions in it.

The BUPT team [31] exploited dwell time of documents clicked by users. They

built a reference document set which contained all clicked documents in a session.

Then dwell time of every document in the set was transformed into attention time by

an exponential decay function regarding to the rank of the document. Next, they pre-

dicted attention time for every retrieved document based on its cosine similarity to the

reference document set. Finally the author re-ranked retrieved documents according

to their predicted attention time.

Most teams modi�ed the retrieval models and used query expansion [22, 23, 33, 2,

30, 50] to �t session search. However, they did not apply query formulation to generate

structured queries. Structured queries can represent phrases, which can emphasize

the important terms in the query. We propose to build structured queries for session

search in our work.

2.2 Query Formulation

The process of query formulation modi�es the original query submitted by a user [8].

The goal is to understand the user intention underlying the query more accurately.

Query formulation includes spelling correction, term proximity, etc.

As a crucial component of search engines, spelling correction has been studied

thoroughly [29, 20, 10]. For example, Li et al. proposed a generalized hidden Markov

model to correct query spelling errors [29]. They divided the spelling errors into six

types. For every type, the authors designed a rule to �x the spelling error. For each

17

Page 30: A Thesis - Georgetown Universityinfosense.cs.georgetown.edu/publication/dongyi_guan_thesis.pdf · A Thesis submitted to the acultF y of the Graduate School of Arts and Sciences of

word in the query submitted by a user, they classi�ed it into one type based on

the Markov model. The parameters of the Markov model are trained using manually

corrected documents.

Many structural query formulation approaches were based on n-grams, which were

de�ned to be continuous terms with a length of n [3, 37, 42]. For example, Bendersky

et al. focused on optimizing weights of concepts in a query [3]. The authors �rst

extracted bi-grams from a query as concept candidates. Then they referred to multiple

information sources such as ClueWeb09 and Wikipedia concurrently to evaluate the

relatedness between a bi-gram and the query. With the evaluation, they �ltered out

those meaningless bi-grams and assigned a weight to the the remained bi-grams, i.e.,

concepts. Finally a structured query consisted of the concepts associated with weights.

Mishne et al. applied proximity terms on web retrieval. They extracted n-grams

from a query and then experimented multiple ways to de�ne a term frequency (tf)

and a inversed document frequency (idf) for an n-gram. For example, idf could be

de�ned as the minimum or maximum idf of the terms in a group. Finally, they applied

a traditional tf -idf retrieval model with extended tf 's and idf 's for n-grams.

Zhao and Callan tried to identify term mismatches and �x them by expanding

them using boolean conjunctive normal form (CNF) [51]. CNF queries contain oper-

ator �AND� or �OR� to describe relations between terms in a query. The authors

experimented two measurements, highest inverse document frequency or lowest prob-

ability of a term in pseudo-relevance feedback, to diagnose mismatched terms. Evalua-

tion showed that it was more accurate to use probability of a term in pseudo-relevance

feedback. After identifying mismatched terms, they use a set of manually built CNF

queries to expand the original query.

Huston and Croft detected key concepts in queries by a classi�er [18]. The features

used in the classi�er included term frequency, inverse document frequency, residual

18

Page 31: A Thesis - Georgetown Universityinfosense.cs.georgetown.edu/publication/dongyi_guan_thesis.pdf · A Thesis submitted to the acultF y of the Graduate School of Arts and Sciences of

inverse document frequency, weighted information gain, n-grams term frequency, and

query frequency. The classi�er was trained using GOV 2 dataset.

The approaches using n-grams can e�ectively represent phrases in a query. How-

ever, some phrases have multiple forms. For example, both �pull out a book� and �pull

it out� contain the phrase �pull out�. Therefore, n-grams method is sometimes too

strict. If we only identify continuous terms, we may miss some relevant documents.

On the contrary, Boolean conjunctive normal form is hard to represent phrases. On

the other hand, a classi�er can precisely detect key concepts in queries with a good

training dataset. However, is is not easy to �nd a training dataset that can �t all

queries. We propose a relaxed method, which relies on the query itself, to predict

window sizes for nuggets and then build structured queries.

2.3 Search Result Organization

Meta search engines such as Clusty (now Yippy) employ search result clustering (SRC)

[1, 5, 41] to automatically organize search results into hierarchical clusters. There are

two strategies to cluster the search results: hierarchical clustering and monothetic

concept hierarchy construction. Furthermore, external knowledge bases are more and

more exploited to improve the quality of clustering. The remainder of this subsection

discusses the related work on hierarchical clustering, Subsumption, and exploiting

external knowledge.

2.3.1 Hierarchical Clustering

Most search result organization adopted clustering-based approaches, which shared

a common scheme that �rst clustered similar documents and then assigned labels

to the clusters [5, 9]. Clustering-based approaches often produce non-interpretable

19

Page 32: A Thesis - Georgetown Universityinfosense.cs.georgetown.edu/publication/dongyi_guan_thesis.pdf · A Thesis submitted to the acultF y of the Graduate School of Arts and Sciences of

clusters and semantically-ill hierarchies due to their data-driven nature and poor

cluster labeling. Even in the best known commercial clustering-based search engine,

Clusty (now Yippy), which presents search results in hierarchical clusters and labels

the clusters by variable length sentences, cluster labeling remains a challenging issue.

2.3.2 Subsumption

Monothetic concept hierarchy approaches build concept hierarchy di�erently. They

avoid cluster labeling by �rst extracting concepts from documents and then organizing

concepts into a hierarchy where each concept attaches to the subset of documents

containing it. Hence, for document sets about similar topics, monothetic concept

hierarchy approaches usually generate hierarchies with more stable nodes, which are

concepts extracted from the entire document set.

The Subsumption approach [39] is a classic and state-of-the-art monothetic con-

cept hierarchy approach. This approach built browsing hierarchies based on condi-

tional probability. They expanded the query by Local Context Analysis [46]. Then

they used the expanded query to retrieve the documents. Terms with high ratio of

occurrence in the retrieved documents to that in the collection were chosen to add

into the concept set composed of query terms. For the term pairs (x, y) in the concept

set, x was said to subsume y if P (x|y) ≥ 0.8, P (y|x) < 1.

2.3.3 Exploiting External Knowledge

Computer scientists usedWikipedia to improve their research work because Wikipedia

was compiled by thousands of experts from all over the world [16, 44, 4, 40, 27]. Not

only the texts but also the links and the categories in Wikipedia are prevalently

exploited.

20

Page 33: A Thesis - Georgetown Universityinfosense.cs.georgetown.edu/publication/dongyi_guan_thesis.pdf · A Thesis submitted to the acultF y of the Graduate School of Arts and Sciences of

Carmel et al. improved cluster labeling accuracy by using labels from Wikipedia

[4]. They �rst look for a set of terms that maximizes the Jensen-Shannon Divergence

(JSD) distance between the speci�c cluster and the entire corpus. They then search

Wikipedia using these terms for a list of documents. The title and corresponding

categories are picked up as the candidate cluster labels. After that, they rank the

candidates by Mutual Information judgment and Score Propagation judgment. The

results indicate high quality labels.

Han et al. organized the search result based on topics by leveraging the knowledge

that Wikipedia links provided [16]. They chose Wikipedia concepts from the links

words in the retrieved Wikipedia documents. Then a semantic graph was built based

on the semantic relatedness between these Wikipedia concepts. The graph was divided

into communities according to the internal links density. The terms in the community

represent the subtopics of the query. Finally, the search results for the query were

assigned to the communities by comparing the similarity to the communities.

Wang et al. [44] constructed a thesaurus of concepts from Wikipedia and use this

thesaurus to improve the text classi�cation. They used the out-link category-based

measure to assist deciding whether or not two articles were related. The out-link cat-

egories of an article were de�ned as the categories to which out-link articles from the

original one belong. The two articles were more closely related if their out-link cate-

gories overlapped more. They reported signi�cant improvement in text classi�cation

by introducing the out-link category-based measure.

Many SRC hierarchy construction approaches are data driven, such as the widely-

used hierarchical clustering algorithms. These algorithms �rst group similar docu-

ments into clusters and then label the clusters as hierarchy nodes. Multiple aspects in

textual search results often yield mixed-initiative clusters, which reduce the stability

of SRC hierarchies. Moreover, when clustering algorithms build clusters bottom-up,

21

Page 34: A Thesis - Georgetown Universityinfosense.cs.georgetown.edu/publication/dongyi_guan_thesis.pdf · A Thesis submitted to the acultF y of the Graduate School of Arts and Sciences of

little changes in leaf clusters propagate to upper levels and amplify the instability. Fur-

thermore, hierarchy labels are automatically generated from documents in a cluster,

which is often data-sensitive so that SRC hierarchies could look even more unstable.

Monothetic concept hierarchy approaches usually generate hierarchies with stable

nodes. However, monothetic concept hierarchy approaches often produce hierarchies

short of semantic meanings because just term frequencies, not meanings are taken into

account. This work �lls the gap by exploiting external knowledge to correct relations

between concepts.

22

Page 35: A Thesis - Georgetown Universityinfosense.cs.georgetown.edu/publication/dongyi_guan_thesis.pdf · A Thesis submitted to the acultF y of the Graduate School of Arts and Sciences of

Chapter 3

Effective Structured Query Formulation for Session Search

In session search, a user feeds a session into a search engine. The session includes

a series of previous queries q1, q2, · · · , qn−1 with corresponding previous results

D1, D2, · · · , Dn−1, and a current/last query qn. All the queries have a common under-

lying topic, or session topic. The search engine is expected to retrieve documents

relevant to the session topic.

Each query in a session is composed of terms and phrases. In order to represent

the topic of session precisely, we extract phrases from each query and combine words

and phrases in all queries together when performing document retrieval. The Indri

query language1 supports complex queries such as proximity terms and combining

beliefs, so as to bene�t building structured queries from all queries in a session. In

this work, we further expand structured queries with anchor texts. Anchor texts are

texts in a document, each of which is associated with a link to another document.

The research reported in this chapter have been published in Proceedings of the

21st Text REtrieval Conference (TREC 2012) [13].

3.1 Identifying Nuggets and Formulating Structured Queries

In a query, several words sometimes bundle together as a phrase to express a coherent

meaning. We identify phrase-like text nuggets and formulate them into Lemur queries

1http://www.lemurproject.org/, version 5.0

23

Page 36: A Thesis - Georgetown Universityinfosense.cs.georgetown.edu/publication/dongyi_guan_thesis.pdf · A Thesis submitted to the acultF y of the Graduate School of Arts and Sciences of

Figure 3.1: A sample nugget in the TREC 2012 session 53 query �servering spinalcord paralysis�.

for retrieval. Nuggets are substrings in a query, similar to phrases but are not neces-

sarily as semantically coherent as phrases.

Figure 3.1 shows an example of a nugget. Words �spinal� and �cord� often occur

together to represent a speci�c concept. We discover that a valid nugget appears

frequently in the top returned snippets for a query. Hence, we identify nuggets to

formulate new structured queries in Lemur query language. Particularly, we look for

nuggets in the top s snippets returned by Lemur for a query q. Nuggets are identi�ed

by two methods, a strict one and a relaxed one, as described below.

3.1.1 The Strict Method

First, a query is represented as a word list q = w1w2 · · ·wn. We send this word list

into Lemur and retrieve the top s snippets over an inverted index built for ClueWeb

CatB. Then all snippets are concatenated as a reference document R.

For every bi-gram in q, we count its occurrences in R. The occurrence of a bi-gram

is normalized by the smaller occurrence of the words in the bi-gram. A bi-gram is

marked as a nugget candidate if its normalized occurrence exceeds a threshold, as

shown in (3.1):

count(wiwi+1;R)

min(count(wi;R), count(wi+1;R))≥ θ (3.1)

24

Page 37: A Thesis - Georgetown Universityinfosense.cs.georgetown.edu/publication/dongyi_guan_thesis.pdf · A Thesis submitted to the acultF y of the Graduate School of Arts and Sciences of

Figure 3.2: Words in a snippet built from TREC 2012 session 53 query �serveringspinal cord consequenses�, where �spinal� is always connected to �cord�.

where count(x;R) denotes the occurrence of x in the reference document R; wi and

wi+1 are adjacent words in the query; θ is the threshold, which is tuned to be 0.97

over all of the TREC 2011 session data. For example, in TREC 2012 session 53

query �servering spinal cord consequenses�, we identify a bi-gram �spinal cord� as a

candidate.

Bi-grams could connect to form longer n-grams. For instance, there is a query

�hawaii real estate average resale value house or condo news� in TREC 2011 session

11. We discover that �hawaii real� and �real estate� are both marked as nugget can-

didates, so that they can be merged into a longer sequence �hawaii real estate�. On

the contrary, �estate average�, is not a candidate, hence we cannot append it to form

�hawaii real estate average�. Therefore, �hawaii real estate� is the longest sequence

and is recognized as a nugget.

Consequently, the query is broken down into nuggets and single words. All serve

as the elements to build up a structured query using the Lemur query language

#combine(nugget1 nugget2 · · · nuggetm w1 w2 · · · wr) (3.2)

where we suppose there are m nuggets and r single words.

25

Page 38: A Thesis - Georgetown Universityinfosense.cs.georgetown.edu/publication/dongyi_guan_thesis.pdf · A Thesis submitted to the acultF y of the Graduate School of Arts and Sciences of

Figure 3.3: Words in a snippet built from TREC 2011 session 20 query �dooneybourke purses�, where �dooney and bourke� is a brand name but the user omits theword �and�.

The example nugget detection for TREC 2012 session 53 is shown in Figure 3.2.

We obtain the structured query �#1(spinal cord) servering consequenses�.

3.1.2 The Relaxed Method

Operator #1 is a strict structure operator and may miss relevant documents. For

example, the queries in TREC 2011 session 20 all contain �dooney bourke�. However,

�dooney and bourke� is a brand name and is written as �dooney bourke� sometimes.

We would miss relevant documents with the phrase �dooney and bourke� if we for-

mulate the query to be �#1(dooney bourke)�. Hence, we introduce a relaxed method

for query formulation. We relax the constraints based on the intuition that distance

between two words re�ects the associativity of them. Particularly, we �rst retrieve

the reference document R as in Section 3.1.1. Every word's position in the snippet is

26

Page 39: A Thesis - Georgetown Universityinfosense.cs.georgetown.edu/publication/dongyi_guan_thesis.pdf · A Thesis submitted to the acultF y of the Graduate School of Arts and Sciences of

Figure 3.4: nDCG@10 values of retrieved documents using TREC 2011 Session trackdataset. two cases, with threshold and without threshold, are compared.

marked as shown in Figure 3.3. We then estimate the centroid of a word wi by

x̄(wi) =

∑j xj(wi;R)

count(wi;R)(3.3)

where s is the number of snippets, R is the reference document, which is the snippets,

xj(wi;R) is the index of the jth instance of wi in R, count(wi;R) is the occurrence of

wi in R.

For every bi-gram in a query, the distance between their estimated centroids is

calculated. We predict the window size (X in #X) of a nugget based on the distance.

Intuitively, it is reasonable to assume that the window size is proportional to the

27

Page 40: A Thesis - Georgetown Universityinfosense.cs.georgetown.edu/publication/dongyi_guan_thesis.pdf · A Thesis submitted to the acultF y of the Graduate School of Arts and Sciences of

distance between their estimated centroids, which can be written as:

nugget = #

⌈|x̄(wi)− x̄(wi+1)|

ξ

⌉(wi wi+1) (3.4)

where ξ is an empirical factor. However, some terms in a query do not form nuggets.

The distance between the centroids of these terms may be very long so as to generate

a large window size, which could be noise and hurt the search precision. Therefore,

we set a threshold to �lter out those term pairs with too far centroids. Figure 3.4

compares the nDCG@10 values of the retrieved documents over TREC 2011 sessions,

with and without threshold respectively. It shows that the precision greatly increases

with the threshold. A decision tree can be derived from Eq (3.4) with the threshold:

nugget =

#1(wi wi+1) |x̄(wi)− x̄(wi+1)| ≤ ξ

#2(wi wi+1) ξ < |x̄(wi)− x̄(wi+1)| ≤ 2ξ

∅ |x̄(wi)− x̄(wi+1)| > 2ξ

(3.5)

where we set the threshold to be 2ξ by experiments, i.e. we only consider nuggets

with window size no larger than 2. A structured query is then formulated as in eq

(3.2).

We tune ξ from 2 to 8 using TREC 2011 Session track dataset. Figure 3.4 shows

the nDCG@10 value for di�erent ξ. We �nd that the precision of session search is

not sensitive to the value of threshold ξ. Hence, we choose the ξ value with largest

nDCG@10, which is 5.

For the above query �dooney bourke purses�, Figure 3.3 shows the procedure of

generating a structured query �#2(dooney bourke) purses�.

28

Page 41: A Thesis - Georgetown Universityinfosense.cs.georgetown.edu/publication/dongyi_guan_thesis.pdf · A Thesis submitted to the acultF y of the Graduate School of Arts and Sciences of

3.2 Query Aggregation within a Session

A session contains multiple queries, from each of which we can build a structured

query. Therefore, we aggregate over all queries in a session to generate a large struc-

tured query. We �rst obtain a set of nuggets and single words from every query

qk = {nuggetik, wjk} by the approach presented in Section 3.1. Then we merge these

nuggets to form a structured query:

#weight(

λ1 #combine(nugget11 nugget12 · · · nugget1m w11 w12 · · · w1r)

λ2 #combine(nugget21 nugget22 · · · nugget2m w21 w22 · · · w2r)

· · ·

λn #combine(nuggetn1 nuggetn2 · · · nuggetnm wn1 wn2 · · · wnr)

)

(3.6)

where λk denotes the weight of query qk. Note that the last #combine is for the

current query qn.

3.2.1 Aggregation Schemes

Three weighting schemes are designed to determine the weight λk, namely uniform,

previous vs. current, and distance-based.

• uniform. Queries are assigned the same weight, i.e., λk = 1.

• previous vs. current. All previous queries share the same weight while the current

query uses a complementary and higher weight. Particularly, we de�ne:

λk =

λp k = 1, 2, · · · , n− 1

1− λp k = n

(3.7)

29

Page 42: A Thesis - Georgetown Universityinfosense.cs.georgetown.edu/publication/dongyi_guan_thesis.pdf · A Thesis submitted to the acultF y of the Graduate School of Arts and Sciences of

where λp is tuned to be 0.4 on TREC 2011 session track data.

• distance-based. The weights are distributed based on how far a query's position

in the session is from the current query. We use a reciprocal function to model

it:

λk =

λpn−k k = 1, 2, · · · , n− 1

1− λp k = n

(3.8)

where λp is tuned to be 0.4 based on TREC 2011 session track data, k is the

position of a query.

3.3 Query Expansion by Anchor Text

A session also provides previous search results, which are pages relevant to the pre-

vious queries. An anchor text pointing to a page often provide valuable human-created

description to this page [34], as shown in Figure 3.5, which enable us to expand a

query by words in anchor texts. A anchor log is extracted by harvestlinks in the Lemur

toolkit.

We collect anchor texts for all previous search results and sort them by term fre-

quency in decreasing order. The top 5 frequent anchor texts are appended to the

structured query generated in 3.2, each with a weight proportional to its term fre-

30

Page 43: A Thesis - Georgetown Universityinfosense.cs.georgetown.edu/publication/dongyi_guan_thesis.pdf · A Thesis submitted to the acultF y of the Graduate School of Arts and Sciences of

Figure 3.5: Anchor text in a web page.

quency.

#weight(

λ1 #combine(nugget11 nugget12 · · · nugget1m w11 w12 · · · w1r)

λ2 #combine(nugget21 nugget22 · · · nugget2m w21 w22 · · · w2r)

· · ·

λn #combine(nuggetn1 nuggetn2 · · · nuggetnm wn1 wn2 · · · wnr)

βω1 #combine(e1) βω2 #combine(e2) · · · βω5 #combine(e5)

)

(3.9)

where ei(i = 1 · · · 5) is the top 5 anchor texts, ωi(i = 1 · · · 5) denotes the corresponding

frequency of the anchor text, which is normalized by the maximum frequency, β is a

31

Page 44: A Thesis - Georgetown Universityinfosense.cs.georgetown.edu/publication/dongyi_guan_thesis.pdf · A Thesis submitted to the acultF y of the Graduate School of Arts and Sciences of

factor to adjust the intervention of anchor texts, which is tuned to be 0.1 based on

the TREC 2011 session data.

For example, in TREC 2012 session 53, the anchor texts with top frequen-

cies are �type of paralysi�, �quadriplegia paraplegia�, �paraplegia�, �spinal cord

injury�, and �quadriplegic tetraplegic�, hence the �nal structured query becomes

�#weight(1.0 #1(spinal cord) 0.6 consequenses 0.4 paralysis 1.0 servering 0.380723

#combine(type of paralysi) 0.004819 #combine(quadriplegia paraplegia) 0.004819

paraplegia 0.004819 #combine(spinal cord injury) 0.00241 #combine(quadriplegic

tetraplegic) )�, where the underlined part is from anchor texts.

3.4 Removing Duplicated Queries

The trace of how a user modi�es queries in a session may suggest the intention of the

user so that it can be exploited to study the real information need of the user. We

notice that sometimes a user repeats a previous query and makes duplicated queries.

Thus, we make two assumptions to re�ne the �nal structured query as follows.

• If there is a previous query that is the same as the current query qn, we only use

the current query to generate a �nal structured query. For example, in TREC

2011 session 22, the current query �shoulder joint pain� is the same as the �rst

query �shoulder joint pain�. The possible reason is that the search results for

the intermediate queries do not satisfy the user, which results in that the user

returns to one of the previous queries.

• If multiple previous queries are duplicated but they are all di�erent from qn, we

remove these queries when formulating a �nal structured query. For example,

in TREC 2011 session 60, the query �non-extinct marsupials� occurs for three

32

Page 45: A Thesis - Georgetown Universityinfosense.cs.georgetown.edu/publication/dongyi_guan_thesis.pdf · A Thesis submitted to the acultF y of the Graduate School of Arts and Sciences of

times and the query �marsupial manure� occurs twice. It could bias the search

results if we used all of these duplicate queries.

In the duplicate detection, we consider a special situation as follows. If a substring

is the abbreviation of another one, we consider that these two queries are duplicated.

For example, the only di�erence between queries �History of DSEC� and �History of

dupont science essay contest� is �DSEC� and �dupont science essay contest�, in which

the former is the abbreviation of the latter, hence they are considered as duplicates. To

detect abbreviations, we scan a query string and split a word into letters if this word

is entirely uppercase. In the example above, the �rst query is transformed to �History

of D S E C�. When comparing two queries, two words in corresponding positions are

considered the same if one of them contains only one capital letter and they start

with the same letter. In the above example, �dupont� and �D� are considered to be

same.

3.5 Document Re-ranking

A user intends to stay in a page that he or she is interested for longer time [11, 14, 47].

We use dwell time, which is de�ned as the elapsed time that a user stays in a page,

to re-rank the search results from the structured query generate in Section 3.4.

Click information in a session is associated with a start time ts and an end time te.

Therefore, dwell time ∆t can be derived by te− ts. In a session, we retrieve all clicked

pages ci with their dwell time ∆ti. For each returned document dj for a structured

query generated after 3.4, its cosine similarity to ci is computed. We calculate the

score of dj by

s(dj) =∑i

Sim(dj, ci) ·∆ti (3.10)

33

Page 46: A Thesis - Georgetown Universityinfosense.cs.georgetown.edu/publication/dongyi_guan_thesis.pdf · A Thesis submitted to the acultF y of the Graduate School of Arts and Sciences of

where Sim(dj, ci) is the cosine similarity between dj and ci. We rank dj by s(dj) in

decreasing order as the �nal search results.

In our experiments, the raw dwell time used in our method strongly bias the

document weights towards those with long dwell time, which corresponds to that

satisfying visits receive much higher weights. For example, if a document has been

reviewed by a user for more than 30 seconds, we consider that the user is satis�ed

by this document. On the contrary, if the dwell time of a document is only a few

seconds, the user might just have a glimpse on this document and �nd the content

not relevant. Since the dwell time is multiplied to the similarity, the former document

would achieve much higher score than the latter.

3.6 Evaluation for Session Search

We participated in TREC 2012 Session track and submitted three runs using di�erent

approach combinations, which are listed in Table 3.4. Four subtasks named RL1, RL2,

RL3, and RL4 are described in Section 2.1. In the evaluation results o�cially released

by NIST [26], we achieved the highest improvement from RL1 to RL2. Our retrieval

result for RL2�RL4 won the second rank among the participants.

3.6.1 Datasets, Baseline, and Evaluation Metrics

We build an inverted index over ClueWeb09 CatB. An anchor log is acquired by

applying harvestlinks over ClueWeb09 CatA, since the o�cial previous search results

are from CatA. Previous research demonstrates that ClueWeb09 collection involves

many spam documents. We �lter out spam documents based on Waterloo �GroupX�

spam ranking score2 less than 70 [7].

2http://durum0.uwaterloo.ca/clueweb09spam/

34

Page 47: A Thesis - Georgetown Universityinfosense.cs.georgetown.edu/publication/dongyi_guan_thesis.pdf · A Thesis submitted to the acultF y of the Graduate School of Arts and Sciences of

Lemur search engine is employed in our experiments as the baseline. Lemur's lan-

guage model based on the Bayesian belief network is applied [35]. The language model

is a multinomial distribution, for which the conjugate prior for Bayesian analysis is

the Dirichlet distribution [49]:

pµ(w|d) =c(w; d) + µp(w|C)∑

w c(w; d) + µ(3.11)

where c(w; d) denotes the occurrences of term w in document d, p(w|C) is the col-

lection language model, µ is the parameter. The parameter µ is tuned based on the

2011 session data.

The metrics provided by TREC 2012 Session track [26] are used to evaluate the

retrieval performance: Expected Reciprocal Rank (ERR), ERR@10, ERR normalized

by the maximum ERR per query (nERR), nERR@10, normalized discounted cumu-

lative gain (nDCG), nDCG@10, Average Precision (AP), and Precision@10, where

nDCG@10 serves as the primary metric, which is de�ned as [21]:

nDCG@10 =10∑i=1

rel(i)

1 + log2(i)/

10∑i=1

rel∗(i)

1 + log2(i)(3.12)

where rel(i) is the relevance score of the document at rank i in the ranked document

list retrieved for a session, rel∗(i) denotes the relevance score of the document at rank

i in the ideal ranked document list for the session. The top 10 documents are taken

account because search engines usually display top 10 relevant documents in the �rst

page, which are the most attractive to a user.

3.6.2 Results for TREC 2011 Session Track

For RL1, where only the current query qn is available, we generate a structured

query from qn by the approach described in Section 3.1 and send it into Lemur. The

Dirichlet parameter µ and the number of pseudo relevance feedback f are tested on

35

Page 48: A Thesis - Georgetown Universityinfosense.cs.georgetown.edu/publication/dongyi_guan_thesis.pdf · A Thesis submitted to the acultF y of the Graduate School of Arts and Sciences of

Table 3.1: nDCG@10 for TREC 2011 Session track RL1. Dirichlet smoothing methodis used. µ = 4000, f = 10 for strict method and µ = 4000, f = 20 for relaxed method.Methods are compared to the baseline � original query. A signi�cant improvementover the baseline is indicated with a † at p < 0.05 level and a ‡ at p < 0.005 level(t-test, single-tailed). The best run and median run in TREC 2011 are listed forcomparison.

Method original query strict relaxed TREC BestnDCG@10 0.3378 0.3834 0.3979 0.3789

%chg 0.00% 13.50%† 17.79%‡ 12.17%

TREC 2011 session data. The documents retrieved by directly searching qn serve

as the baseline. Table 3.1 shows the nDCG@10 results for RL1 on TREC 2011. By

formulating structured query using nuggets, we greatly boost the search accuracy

than baseline by 13.50%. The relaxed form achieves even better search accuracy of

0.3979 (+17.79%).

For RL2, we apply query expansion with the previous queries explained in Section

3.2. We observe that the strict method performs much better, because the window

size in relaxed method is hard to optimize for multiple queries. Table 3.2 presents the

nDCG@10 for RL2 on TREC 2011 session data. We �nd that �previous vs. current�

scheme gives the best search accuracy. It is worth noting that distance-based scheme

performs even worse than uniform scheme, which implies that the modi�cation of user

intention is complex and we cannot assume that the early query has less importance

in the entire session.

For RL3 and RL4, we combine several methods, including anchor texts, removing

duplicated queries and re-ranking by dwell time. Table 3.3 displays the nDCG@10

for RL3 and RL4 on 2011 session track data. It illustrates that removing duplicated

queries signi�cantly improves the performance. However, neither re-ranking nor only

36

Page 49: A Thesis - Georgetown Universityinfosense.cs.georgetown.edu/publication/dongyi_guan_thesis.pdf · A Thesis submitted to the acultF y of the Graduate School of Arts and Sciences of

Table 3.2: nDCG@10 for TREC 2011 Session track RL2. Dirichlet smoothing methodand strict method are used. µ = 4000, f = 5 for uniform, µ = 4500, f = 5 forprevious vs. current (PvC ) and distance-based. Methods are compared to the baseline� original query. A signi�cant improvement over the baseline is indicated with a †at p < 0.05 level and a ‡ at p < 0.005 level (t-test, single-tailed). The best run andmedian run in TREC 2011 are listed for comparison.

Scheme original query uniform PvC distance-based TREC Best

nDCG@10 0.3378 0.4475 0.4626 0.4431 0.4281

%chg 0.00% 32.47%‡ 36.94%‡ 31.17%‡ 26.73%

Table 3.3: nDCG@10 for TREC 2011 Session track RL3 and RL4. All runs use strictmethod and the con�guration of µ = 4500, f = 5. Methods are compared to thebaseline � original query. A signi�cant improvement over the baseline is indicatedwith a † at p < 0.05 level and a ‡ at p < 0.005 level (t-test, single-tailed). The bestrun and median run in TREC 2011 are listed for comparison.

Method

baseline=0.3378

anchor text TREC Bestall documents clicked documents nDCG@10

nDCG@10 %chg nDCG@10 %chg RL3

all queries 0.4695 38.99%‡ 0.4680 38.54%‡ 0.4307

remove duplicate 0.4836 43.16%‡ 0.4542 34.46%‡ RL4

re-rank by dwell time 0.4435 31.29%‡ � � 0.4540

considering clicked document contributes to the results. The reason may lie in that

we calculate cosine similarity based on the full text of documents, which perhaps

introduce lots of noise.

3.6.3 Results for TREC 2012 Session Track

We submit three runs to TREC 2012 session track. The run names, methods and

parameters are listed in Table 3.4, where µ is the Dirichlet smoothing parameter and

f is the number of pseudo relevance feedback.

37

Page 50: A Thesis - Georgetown Universityinfosense.cs.georgetown.edu/publication/dongyi_guan_thesis.pdf · A Thesis submitted to the acultF y of the Graduate School of Arts and Sciences of

Table 3.4: Methods and parameter settings for TREC 2012 Session track. µ is theDirichlet smoothing parameter, f is the number of pseudo relevance feedback.run RL1 RL2 RL3 RL4

guphrase1 strict methodµ = 4000, f =10

strict methodquery expansionµ = 4500, f = 5

strict methodquery expansionanchor textremove duplicatesµ = 4500, f = 5

strict methodquery expansionanchor textall queriesµ = 4500, f = 5

guphrase2 strict methodµ = 3500, f =10

strict methodquery expansionµ = 5000, f = 5

strict methodquery expansionanchor textremove duplicatesµ = 5000, f = 5

strict methodquery expansionanchor textall queriesµ = 5000, f = 5

gurelaxphr relaxed methodµ = 4000, f =20

relaxed methodquery expansionµ = 4500, f =20

relaxed methodquery expansionanchor textremove duplicatesµ = 4500, f = 20

strict methodquery expansionanchor textre-ranking by timeµ = 4500, f = 5

Table 3.5: nDCG@10 for TREC 2012 Session track. Mean of the median of the eval-uation results in TREC 2012 are listed.

run original query guphrase1 guphrase2 gurelaxphr TREC BestRL1 0.2474 0.2298 0.2265 0.2334 0.2615RL2 0.2474 0.2932 0.2839 0.2832 0.3100RL3 0.2474 0.3021 0.2995 0.3033 0.3221RL4 0.2474 0.3021 0.2995 0.2900 0.3153

Table 3.6: AP for TREC 2012 session track. Mean of the median of the evaluationresults in TREC 2012 are listed.

run original query guphrase1 guphrase2 gurelaxphr TREC BestRL1 0.1274 0.1185 0.1186 0.1223 0.1286RL2 0.1274 0.1466 0.1457 0.1455 0.1496RL3 0.1274 0.1490 0.1483 0.1482 0.1538RL4 0.1274 0.1490 0.1483 0.1467 0.1542

38

Page 51: A Thesis - Georgetown Universityinfosense.cs.georgetown.edu/publication/dongyi_guan_thesis.pdf · A Thesis submitted to the acultF y of the Graduate School of Arts and Sciences of

Figure 3.6: Changes in nDCG@10 from RL1 to RL2 presented by TREC 2012 Sessiontrack. Error bars are 95% con�dence intervals (Figure 1 in [26])

The evaluation results of nDCG@10 and Average Precision (AP) by TREC are

presented in Table 3.5 and Table 3.6. They show similar trends as what we observe

on the TREC 2011 data, but in a much lower range even beneath the results using

the original query. This may imply that our query formulation methods may over�t

on TREC 2011 session data. Nonetheless, using previous queries and eliminating

duplicates continues to demonstrate signi�cant improvement in search accuracy.

3.6.4 Official Evaluation Results for TREC 2012 Session Track

TREC 2012 Session track presented the evaluation for all participants [26]. The runs

were compared on both the individual subtasks and the improvements between the

39

Page 52: A Thesis - Georgetown Universityinfosense.cs.georgetown.edu/publication/dongyi_guan_thesis.pdf · A Thesis submitted to the acultF y of the Graduate School of Arts and Sciences of

Figure 3.7: All results by nDCG@10 for the current query in the session for eachsubtask (Table 2 in [26]).

40

Page 53: A Thesis - Georgetown Universityinfosense.cs.georgetown.edu/publication/dongyi_guan_thesis.pdf · A Thesis submitted to the acultF y of the Graduate School of Arts and Sciences of

pairs of subtasks. Our runs achieved the highest improvement from RL1 to RL2, as

shown in Figure 3.6 (Figure 1 in [26]). This improvement made us ranked second

among all the groups in RL2, RL3, and RL4, as shown in Figure 3.7 (Table 2 in

[26]). The evaluation results demonstrate the e�ectiveness of query formulation by

combining nuggets and user interaction information in session search.

3.7 Chapter Summary

In this chapter, we describe an approach to build e�ective structured query in session

search. A concept of nuggets is introduced to represent the phrase-like semantic units

in the query. A window size can be predicted for a nugget by the relaxed method

in nugget identi�cation. Nuggets from all queries are combined together by three

aggregation schemes. Experiments indicate that injecting nuggets into all queries in

a session increases the search accuracy signi�cantly. Removing duplicated queries in

a session improves the search accuracy further more. In addition, nuggets from the

current query are more important than those from previous queries. However, no

evidence shows di�erence in the importances of previous queries.

Query expansion and document re-ranking are applied to make additional progress

in search accuracy. Moreover, we design two rules to remove duplicate queries within

a session, which improves the search accuracy e�ectively. All these techniques make

our results ranked second among all the participants in the subtasks that involve a

session in TREC 2012 Session track.

41

Page 54: A Thesis - Georgetown Universityinfosense.cs.georgetown.edu/publication/dongyi_guan_thesis.pdf · A Thesis submitted to the acultF y of the Graduate School of Arts and Sciences of

Chapter 4

Increasing Stability of Result Organization for Session Search

The relatedness of the queries in a session requires high stability for the search result

organization. In order to improve the stability of SRC hierarchies, we presented an

original system framework based on the monothetic concept hierarchy approach. Par-

ticularly, we extract concepts from the document set �rst. Then the hierarchies are

built according to the statistics of the concepts such as document frequency in the

document set. Additionally, we applied the category information in Wikipedia to

regulate the parent-child relationship between pairs of concepts.

It is worth mentioning that we investigate how to increase stability of concept

hierarchies by considering only the current query and its search results. One may

argue that the instability issue could be resolved if considering queries in the same

session all together when building SRC hierarchies. However, in Web search, session

membership is not always available. Therefore, our task is more consistent with the

real application. Moreover, our task is to independently generate similar hierarchies

for queries as long as these queries are similar, which place more challenge in front

of us. Furthermore, our algorithms can be extended to include other queries in the

session if session segmentation is known.

The research results reported in this chapter have been published in Proceedings

of the 35th European Conference on Information Retrieval (ECIR 2013) [12].

42

Page 55: A Thesis - Georgetown Universityinfosense.cs.georgetown.edu/publication/dongyi_guan_thesis.pdf · A Thesis submitted to the acultF y of the Graduate School of Arts and Sciences of

Figure 4.1: Framework overview of the Wikipedia enhanced concept hierarchy con-struction system.

4.1 Utilizing External Knowledge to Increase Stability of Search

Result Organization

We propose to exploit external knowledge to increase stability of SRC hierarchies.

Wikipedia, a broadly used knowledge base, is used as the main source of external

knowledge. We refer to each article in Wikipedia as a page, which usually discusses a

single topic. The title of a page is called an entry. Every entry belongs to one or more

categories. The categories in Wikipedia are organized following the subsumption (also

called is-a) relations; together all Wikipedia categories form a network that consists

of many connected hierarchies.

Our framework consists of three components: concept extraction, identifying refer-

ence Wikipedia entries, and relationship construction, as shown in Figure 4.1. Initially,

the framework takes in a single query q and its search results D and extracts con-

43

Page 56: A Thesis - Georgetown Universityinfosense.cs.georgetown.edu/publication/dongyi_guan_thesis.pdf · A Thesis submitted to the acultF y of the Graduate School of Arts and Sciences of

cept set C that best represents D by an e�cient version of [48] Chapter 4. Next,

for each concept c ∈ C, the framework identi�es its most relevant Wikipedia entry e

which is called a reference Wikipedia entry. Finally, relationship construction adopts

two schemes to incorporate Wikipedia category information. One applies Subsump-

tion [39] �rst and then re�nes the relationships according to Wikipedia categories

while another connects the concepts purely based on Wikipedia. We present mapping

to reference Wikipedia entry in Section 4.2, followed by enhancing Subsumption by

Wikipedia in Section 4.3 and constructing hierarchies purely based on Wikipedia in

Section 4.4.

4.2 Identifying Reference Wikipedia Entries

Given a set of concepts C acquired by concept extraction, we identify the reference

Wikipedia entry for each concept. In particular, we �rst obtain potential Wikipedia

entries by retrieval. We employ Lemur toolkit to build an index from the entire

Wikipedia collection in ClueWeb09 CatB dataset. Each concept c ∈ C is sent as

a query to the index and the top 10 returned Wikipedia pages are kept. The titles

of these pages are considered as Wikipedia entry candidates for c. We denote these

entries as {ei}, i = 1 · · · 10.

We then select the most relevant Wikipedia entry as the reference Wikipedia entry.

Although we have obtained a ranked list of Wikipedia pages for c, the top result is not

always the best suited Wikipedia entry for the search session. For instance, TREC

2010 session 3 is about �diabetes education�, the top Lemur returned Wikipedia entry

for concept �GDM� is �GNOME Display Manager�, which is not relevant. Instead, the

second ranked entry �Gestational diabetes� is relevant. We propose to disambiguate

among the top returned Wikipedia entries by the following measures.

44

Page 57: A Thesis - Georgetown Universityinfosense.cs.georgetown.edu/publication/dongyi_guan_thesis.pdf · A Thesis submitted to the acultF y of the Graduate School of Arts and Sciences of

Figure 4.2: Mapping to relevant Wikipedia entry. Text in circles denotes Wikipediaentries, while text in rectangle denotes concepts. Based on the context of currentsearch session, the entry �Gestational diabetes� is selected as the most relevantWikipedia entry. Therefore the concept �GDM� is mapped to �Gestational diabetes�,whose supercategories are �Diabetes� and �Health issues in pregnancy�.

Cosine Similarity. Selected by the concept extraction component, most concepts

in C are meaningful phrases and exactly map to a Wikipedia entry. However, many

mutiple-word concepts and entries only partially match to each other. If they par-

tially match with a good portion, they should still be considered as matched. We

therefore measure the similarity between a concept c and its candidate Wikipedia

entries by cosine similarity. Particularly, we represent the concept and the entry as

term vectors after stemming and stop word removal. If a candidate entry, i.e. the title

of a Wikipedia page, starts with �Category:�, we remove the pre�x �Category�. Cosine

similarity of c and Wikipedia entry candidate ei is:

Sim(c, ei) =~vc · ~vei|~vc||~vei |

(4.1)

45

Page 58: A Thesis - Georgetown Universityinfosense.cs.georgetown.edu/publication/dongyi_guan_thesis.pdf · A Thesis submitted to the acultF y of the Graduate School of Arts and Sciences of

where ~vc and ~vei are term vectors of c and ei respectively.

Mutual Information. To resolve the ambiguity in Wikipedia entry candidates, we

select the entry that best �ts the current search query q and its search results D. For

example, in Figure 4.2, concept �GDM� could mean �GNOME Display Manager� or

�Gestational Diabetes Mellitus�. Given the query �diabetes education�, only the latter

is relevant. We need a measure to indicate similarity between a candidate entry ei and

the search query. Since concept set C can be used to represent the search results D,

we convert this problem into measuring the similarity between ei and C. We calculate

the mutual information MI(ei, C) between an entry candidate ei and the extracted

concept set C as described in [4], but with a modi�ed formula for calculating the

weight of a concept:

w(c) = log(1 + ctf(c)) · idf(c) (4.2)

where ctf(c) is the term frequency of concept c with regard to the entire document

set, and idf(c) is the inverse document frequency of concept c with regard to the entire

document set. It is worth noting that [4] clustered the document set �rst. Therefore,

the weight formula in [4] counted the term frequency of a concept with regard to the

cluster to which the concept belong. In addition, the weight formula in [4] slightly

biased weights of the terms distributed over many cluster documents by multiplying

an extra item

cdf(c, L) = log(N(c, L) + 1) (4.3)

where N(c, L) is the document frequency of the concept c with regard to the cluster

L.

Finally, we aggregate the scores. Each candidate entry is scored by a linear com-

bination of cosine similarity and MI:

score(ei) = αSim(ei, c) + (1− α)MI(ei, c) (4.4)

46

Page 59: A Thesis - Georgetown Universityinfosense.cs.georgetown.edu/publication/dongyi_guan_thesis.pdf · A Thesis submitted to the acultF y of the Graduate School of Arts and Sciences of

where α is set to 0.8 empirically. The aggregated score considers both the word sim-

ilarity and topic relevancy of a candidate entry. The highest scored candidate entry

is selected as the reference Wikipedia entry. Figure 4.2 illustrates the procedure of

�nding the reference Wikipedia entry.

4.3 Improving Stability of Subsumption

Subsumption is a popular approach for building concept hierarchies [39]. It identi�es

the is-a relationship between two concepts based on conditional probabilities: concept

x subsumes concept y if 0.8 < P (x|y) < 1 and P (y|x) < 1. The main weakness of

Subsumption is that minor �uctuation in document frequency may result in oppo-

site conclusion. For example, in search results for the query �diabetes education�, two

concepts �type 1 diabetes� and �type 2 diabetes�, show very similar document fre-

quencies. Small changes in search result documents may completely turn the decision

from �type 1 diabetes� subsuming �type 2 diabetes� into �type 2 diabetes� subsuming

�type 1 diabetes�. Neither of the conclusions is reliable or stable. In this work, we

propose to inject Wikipedia category information to Subsumption for building more

stable hierarchies.

First, we build a concept hierarchy by Subsumption. For the sake of e�ciency,

we sort all concepts in C by their document frequencies in D from high to low.

We compare document frequency of a concept c with every concept that has higher

document frequency than c. Since the concepts are all relevant to the same session,

we slightly relax the decision condition in Subsumption: for concepts x and y with

document frequencies dfx > dfy, we say x potentially subsumes y if

log(1 + dfy)

log(1 + dfx)> 0.6 (4.5)

47

Page 60: A Thesis - Georgetown Universityinfosense.cs.georgetown.edu/publication/dongyi_guan_thesis.pdf · A Thesis submitted to the acultF y of the Graduate School of Arts and Sciences of

where dfx and dfy are document frequencies of concepts x and y respectively and are

evaluated in D.

Second, based on reference Wikipedia entries ex and ey for concepts x and y, we

evaluate all potential subsumption pairs (x, y) in the following cases:

• ex is marked as a Wikipedia category: We extract the Wikipedia categories

that ey belongs to, including the case that ey itself is a Wikipedia category,

from ey's Wikipedia page. Note that ey may have multiple categories. The list

of Wikipedia categories for ey is called super-categories of ey and denoted as Sy.

x subsumes y is con�rmed if ex ∈ Sy.

• Neither ex nor ey is marked as a Wikipedia category: We extract the Wikipedia

categories that contain ey (ex) to form its super-categories set Sy (Sx). For each

syi ∈ Sy, we again extract its super-categories and form the super-supercategory

set SSy for ey. Next we calculate a subsumption score by counting the overlap

between SSy and Sx, normalized by the smaller size of SSy and Sx. The sub-

sumption score for concepts x and y is de�ned as:

Scoresub(x, y) =count(s; s ∈ Sx and s ∈ SSy)

min(|Sx|, |SSy|)(4.6)

where count(s; s ∈ Sx and s ∈ SSy) denotes the number of categories that

appear in both Sx and SSy. If Scoresub(x, y) for a potential subsumption pair

(x, y) passes a threshold (set to 0.6), x subsumes y.

• ey is marked as a Wikipedia category but ex is not: The potential subsumption

relationship between x and y is canceled.

By employing Wikipedia to re�ne and expand the relationships identi�ed by Sub-

sumption, we remove the majority of noise in hierarchies built by Subsumption. Figure

4.3 demonstrates this procedure.

48

Page 61: A Thesis - Georgetown Universityinfosense.cs.georgetown.edu/publication/dongyi_guan_thesis.pdf · A Thesis submitted to the acultF y of the Graduate School of Arts and Sciences of

Figure 4.3: An example of Wikipedia-enhanced Subsumption. The concepts �Dia-betes� and �type 2 diabetes� satisfy Eq.(4.5) and is identi�ed as a potential subsump-tion pair. The reference Wikipedia entry of �Diabetes� is a category, and referenceWikipedia entry of �type 2 diabetes� is a Wikipedia entry �Diabetes mellitus type 2�.Therefore we check if �Diabetes� is one of the supercategories of �Diabetes mellitustype 2� and con�rm that �diabetes� subsumes �type 2 diabetes�.

4.4 Building Concept Hierarchy Purely Based on Wikipedia

This section describes how to build SRC hierarchies purely based on Wikipedia. We

observed that categories on the same topic often share common super-categories or

common subcategories. This inspired us to create hierarchies by joining Wikipedia

subtrees. The algorithm is described as the following:

First, identify the start categories. For each concept c ∈ C, we collect all Wikipedia

categories that c's reference Wikipedia entry belongs to. We call these categories start

categories. If an entry is marked as a category, it is the start category.

Second, expand from the start categories. For each start category, we extract its

sub-categories from its Wikipedia page. Among these subcategories, we choose those

relevant to the current query for further expansion. The relevance for (ei, q) is mea-

sured by the MI measure described in Section 4.2. The subcategories with the MI

49

Page 62: A Thesis - Georgetown Universityinfosense.cs.georgetown.edu/publication/dongyi_guan_thesis.pdf · A Thesis submitted to the acultF y of the Graduate School of Arts and Sciences of

Figure 4.4: An example of Wikipedia-only hierarchy construction. From concept �Dia-betes mellitus� we �nd the reference Wikipedia entry �Diabetes mellitus�, then we �ndits start category �Diabetes�. Similarly, for another concept �joslin�, we �nd its refer-ence Wikipedia entry �Joslin Diabetes Center� and its start category �Diabetes orga-nizations�. We then expand from these two start categories. �Diabetes organizations�is one of the subcategories of �Diabetes�, thus we merge them together.

score higher than a threshold (set to 0.9) are kept. For the sake of e�ciency as well as

hierarchy quality, we expand the subcategories to three levels at most. Since concepts

in the search session share many start categories, expanding to a limited number of

levels hardly misses relevant categories. At the end of this step, we generate a forest

of trees consisting of all concepts in C as well as their related Wikipedia categories.

Third, select the right nodes to merge the trees. We apply the MI score described

in Section 4.2 to determine which super-category �ts into the search session and

assign the common node as its child. For example, start categories �Diabetes� and

�Medical and health organizations by medical condition� share a common child node

�Diabetes organizations�, which is a start category too. �Diabetes� is selected as the

super-category of �Diabetes organizations�. The trees that have common nodes get

connected together and form a larger hierarchy.

50

Page 63: A Thesis - Georgetown Universityinfosense.cs.georgetown.edu/publication/dongyi_guan_thesis.pdf · A Thesis submitted to the acultF y of the Graduate School of Arts and Sciences of

Last, clean up the hierarchy. For every internal node in the joined structure, we

traverse downwards to the leaves. Along the way, we trim the nodes that have no

o�spring in the concept set C to eliminate noise that is irrelevant to the current

query. Figure 4.4 shows the Wikipedia-only algorithm.

4.5 Evaluation for Search Result Organization

We evaluate our approach using the dataset of TREC 2010 and 2011 Session Tracks.

For each q, to obtain its search resultsD, we retrieve the top 1000 documents returned

by Lemur from an index built from the ClueWeb09 CatB collection. All relevant

documents identi�ed by TREC assessors are merged into the results set. Table 4.1

summarizes the data used in this evaluation.

We compare our approaches, Subsumption+Wikipedia (Section 4.3) andWikipedia-

only (Section 4.4), with the following systems:

• Clusty (now Yippy): We could not re-implement Clusty's algorithm. Instead,

we sent queries to yippy.com, saved the hierarchies.

• Hierarchical clustering: We employ WEKA1 to form hierarchical document clus-

ters and then assign labels to the clusters. The labeling is done by a highly

e�ective cluster labeling algorithm [4].

• Subsumption: A popular monothetic concept hierarchy construction algorithm,

used as the baseline. [39]. We modify Subsumption's decision parameters to

suit our dataset. In particular, we consider x subsumes y if P (x|y) ≥ 0.6 and

P (y|x) < 1.

1http://www.cs.waikato.ac.nz/ml/weka/, version 3.6.6, bottom-up hierarchical clus-tering based on cosine similarity.

51

Page 64: A Thesis - Georgetown Universityinfosense.cs.georgetown.edu/publication/dongyi_guan_thesis.pdf · A Thesis submitted to the acultF y of the Graduate School of Arts and Sciences of

Table 4.1: Statistics of TREC 2010 and TREC 2011 Session track datasets.Dataset #sessions #q #q per session #docTREC2010 100 200 2 200,000TREC2011 24 99 4.12 99,000Total 124 299 2.41 299,000

4.5.1 Hierarchy Stability

To quantitatively evaluate the stability of SRC hierarchies, we compare the similarity

between SRC hierarchies created within one search session. Given a query session Q

with queries q1, q2, ... qn, the stability of SRC hierarchies for Q is measured by the

average of pairwise hierarchy similarity between unique query pairs in Q. It is de�ned

as:

Stability(Q) =2

n(n− 1)

n−1∑i=1

n∑j=i+1

Simhie(Hi, Hj) (4.7)

where n is the number of queries in Q, Hi and Hj are SRC hierarchies for query qi

and qj, and Simhie(Hi, Hj) is the hierarchy similarity between Hi and Hj.

We apply three methods to calculate Simhie(Hi, Hj). Suppose there are M nodes

in Hi and N nodes in Hj,

• node overlap: Measures the percentage of identical nodes in Hi and Hj, normal-

ized by min(M,N).

• parent-child precision: Measures the percentage of similar parent-child pairs in

Hi and Hj, normalized by min(M,N).

52

Page 65: A Thesis - Georgetown Universityinfosense.cs.georgetown.edu/publication/dongyi_guan_thesis.pdf · A Thesis submitted to the acultF y of the Graduate School of Arts and Sciences of

• fragment-based similarity (FBS) [48]: Given two hierarchies Hi and Hj, FBS

compares their similarity by calculating

1

max(M,N)

m∑p=1

Simcos(cip, cjp) (4.8)

where cip ⊆ Hi, cjp ⊆ Hj, and they are the pth matched pair among the m

matched fragment pairs.

These metrics measure di�erent aspects of two hierarchies. Node overlap measures

content di�erence between hierarchies and ignores structure di�erences. Parent-child

precision measures local content and structure di�erences and it is a very strict mea-

sure. FBS considers both content and structure di�erences; it measures di�erences at

fragment level and tolerant minor changes in hierarchies.

Table 4.2 and Table 4.3 summarize the stability evaluation over the TREC 2010

and 2011 datasets, respectively. The most stable hierarchies are generated by the pro-

posed approaches. Our approaches statistically signi�cantly outperform Subsumption

in terms of stability in FBS for the evaluation datasets.

Not only our approaches but also Subsumption tremendously improves the sta-

bility of SRC hierarchies as compared to Clusty. Our observation is that monothetic

concept hierarchy approaches acquire concepts directly from the search results, it

probably learns from a more complete dataset rather than a segment of data (one

cluster) and be able to avoid minor changes.

Figure 4.5 and Figure 4.6 exhibit major clusters in SRC hierarchies for TREC

2010 session 3 generated by Clusty and Wiki-only (Section 4.4) respectively. The

queries are �diabetes education� and �diabetes education videos books�. We observe

that the Clusty hierarchies (Figure 4.5(a)(b)) are less stable than that built by Wiki-

only (Figure 4.6(a)(b)). For example, Clusty groups the search results by types of

services (Figure 4.5(a)); however, a test indicator of diabetes �Blood Sugar�, which is

53

Page 66: A Thesis - Georgetown Universityinfosense.cs.georgetown.edu/publication/dongyi_guan_thesis.pdf · A Thesis submitted to the acultF y of the Graduate School of Arts and Sciences of

Table 4.2: Stability of search result organization for TREC 2010 Session queries.Approaches are compared to the baseline - Subsumption. A signi�cant improvementover the baseline is indicated with a † at p < 0.05 and a ‡ at p < 0.005 (t-test,single-tailed).

MethodFBS Node overlap Parent-child precision

Average % chg Average % chg Average % chg

Clusty 0.463 � 0.415 � 0.144 �Hierarchical clustering 0.347 � 0.342 � 0.061 �Subsumption 0.573 0.00% 0.518 0.00% 0.394 0.00%Subsumption + Wikipedia 0.603† 5.24% 0.529† 2.12% 0.450‡ 14.21%Wikipedia only 0.634‡ 10.65% 0.516 −0.39% 0.452‡ 14.72%

Table 4.3: Stability of search result organization for TREC 2011 Session queries.Approaches are compared to the baseline - Subsumption. A signi�cant improvementover the baseline is indicated with a † at p < 0.05 and a ‡ at p < 0.005 (t-test,single-tailed).

MethodFBS Node overlap Parent-child precision

Average % chg Average % chg Average % chg

Clusty 0.440 � 0.327 � 0.115 �Hierarchical clustering 0.350 � 0.129 � 0.043 �Subsumption 0.483 0.00% 0.420 0.00% 0.262 0.00%Subsumption + Wikipedia 0.504† 4.35% 0.420 0.00% 0.247 −5.73%Wikipedia only 0.532‡ 10.14% 0.425† 1.19% 0.255 −2.67%

not any type of services, is added after the query is slightly changed (Figure 4.5(b)).

Moreover, the largest cluster in Figure 4.5(a), �Research�, disappears completely in

Figure 4.5(b). These changes make Clusty hierarchies less stable and less desirable.

The Wiki-only approach (Figure 4.6(a)(b)) that employ external knowledge bases

better maintain a single classi�cation dimension, in this case types of diabetes, and

are easy to follow.

54

Page 67: A Thesis - Georgetown Universityinfosense.cs.georgetown.edu/publication/dongyi_guan_thesis.pdf · A Thesis submitted to the acultF y of the Graduate School of Arts and Sciences of

Figure 4.5: Major clusters in hierarchies built by Clusty for TREC 2010 session 3. (a)is for query �diabetes education� and (b) is for �diabetes education videos books�.

Figure 4.6: Major clusters in hierarchies built by Wiki-only for TREC 2010 session 3.(a) is for query �diabetes education� and (b) is for �diabetes education videos books�.

55

Page 68: A Thesis - Georgetown Universityinfosense.cs.georgetown.edu/publication/dongyi_guan_thesis.pdf · A Thesis submitted to the acultF y of the Graduate School of Arts and Sciences of

4.5.2 Hierarchy Quality

One may question that perfect stability can be achieved by a static SRC hierarchy

regardless of query changes in a session. To avoid evaluating SRC hierarchies only

by stability while sacri�cing other important features, such as hierarchy quality, we

manually evaluate the hierarchies. Particularly, we compare two approaches, Sub-

sumption and Subsumption+Wikipedia, to see how much quality improvement is

done by adding Wikipedia information.

Figure 4.7 and Figure 4.8 illustrates the major clusters in hierarchies built for

TREC 2010 session 3 by Subsumption (Section 2.3.2) and Subsumption+Wiki (Sec-

tion 4.3). We observe errors in Figure 4.7(a): �Type 1 diabetes� is misplaced under

�type 2 diabetes�. While Figure 4.8(a) corrects this relationship and these two con-

cepts are both correctly identi�ed under �diabetes�.

Moreover, we �nd that hierarchies created by Wikipedia (Figure 4.8(a)(b))

exhibits higher stability than that by Subsumption only (4.7(a)(b)). For example, in

Figure 4.7, �type 2 diabetes� becomes the root of hierarchy when the query changes.

While in Figure 4.8, the main structure of hierarchy, �Diabetes� with two children

�type 2 diabetes� and �Type 1 diabetes� are maintained.

We further compare the hierarchies generated by Subsumption+Wiki (Section 4.3)

and Wiki-only (Section 4.4). Wiki-only approach generates more stable hierarchies

because it utilizes Wikipedia entries, which are standardized concepts, to connect

the concepts extracted from search results. This may cause a high overlap between

the hierarchies generated from queries about a similar topic. On the contrary, the

hierarchies generated by Subsumption+Wiki approach are more related to the query

because it primarily builds the relations between the concepts extracted from the

search results and only uses Wikipedia to �lter out the inappropriate relations.

56

Page 69: A Thesis - Georgetown Universityinfosense.cs.georgetown.edu/publication/dongyi_guan_thesis.pdf · A Thesis submitted to the acultF y of the Graduate School of Arts and Sciences of

Figure 4.7: Major clusters in hierarchies built by Subsumption for TREC 2010 session3. (a) is for query �diabetes education� and (b) is for �diabetes education videos books�.

Figure 4.8: Major clusters in hierarchies built by Subsumption+Wiki for TREC 2010session 3. (a) is for query �diabetes education� and (b) is for �diabetes educationvideos books�.

57

Page 70: A Thesis - Georgetown Universityinfosense.cs.georgetown.edu/publication/dongyi_guan_thesis.pdf · A Thesis submitted to the acultF y of the Graduate School of Arts and Sciences of

Figure 4.9: Search result organization quality improvement vs. stability for Subsump-tion and Subsumption+Wiki.

Quantitatively, we measure the quality improvement of Subsumption+Wiki over

Subsumption by checking the correctness of parent-child concept pairs in a hierarchy

H as:

countw,corr − countw,err − (counts,corr − countw,err)

countw + counts(4.9)

where count∗ is the number of concept pairs inH, w denotes Subsumption+Wikipedia,

s denotes Subsumption, corr means the correct pairs, err means the incorrect pairs.

Figure 4.9 plots the quality improvement vs stability for Subsumption and Sub-

sumption+Wiki over all evaluated query sessions. Stability is measured by the

number of di�erent parent-child pairs in corresponding hierarchies generated by these

58

Page 71: A Thesis - Georgetown Universityinfosense.cs.georgetown.edu/publication/dongyi_guan_thesis.pdf · A Thesis submitted to the acultF y of the Graduate School of Arts and Sciences of

Figure 4.10: Extreme case 1. A totally static hierarchy for two queries in a session(TREC 2010 session 107).

two approaches. Figure 4.9 demonstrates that quality and stability could correlate

well. Moreover, we calculate the Spearman's rank correlation coe�cient [43] and the

Pearson's correlation coe�cient [43] between quality improvement and stability and

the values are 0.764 and 0.760 respectively.

Queries change slightly within a session, hence the user may not expect a totally

static hierarchy in session search. Figure 4.10 shows two extreme cases. In the �rst

case, two queries in a session are �elliptical trainer� and �elliptical trainer bene�ts�

(TREC 2010 session 107). The hierarchies are exactly same for these two queries, but

the user may want more detailed hierarchy about �bene�ts� for the query �elliptical

trainer bene�ts�. With hierarchies not stable, the user would not be satis�ed either,

as shown in Figure 4.11 (TREC 2010 session 75). Therefore, in these extreme cases,

the quality of hierarchies are poor for all stability, shown as the red line in Figure

4.9. The comparison indicates that our proposed techniques increase the quality of

hierarchies while improving stability.

59

Page 72: A Thesis - Georgetown Universityinfosense.cs.georgetown.edu/publication/dongyi_guan_thesis.pdf · A Thesis submitted to the acultF y of the Graduate School of Arts and Sciences of

Figure 4.11: Extreme case 2. A totally di�erent hierarchy for two queries in a session(TREC 2010 session 75).

4.6 Chapter Summary

This chapter present a system framework which can generate stable hierarchy for

session search. Because the query usually changes little within a session, the stability

is required for search result organization. Our system �rst extracts the concepts from

the document set, and then use the concepts as the nodes to build the hierarchy. We

present two approaches that exploit Wikipedia category to improve the stability of

hierarchy. The �rst one corrects the mistaken relationship generated by Subsumption,

while the second one builds the hierarchy purely from the Wikipedia categories related

to the concepts.

The monothetic concept hierarchy approaches indicate signi�cant improvement

in stability over the hierarchical clustering approaches. The evaluation further shows

that the Wikipedia category information increase not only the stability but also the

quality of the hierarchy.

60

Page 73: A Thesis - Georgetown Universityinfosense.cs.georgetown.edu/publication/dongyi_guan_thesis.pdf · A Thesis submitted to the acultF y of the Graduate School of Arts and Sciences of

Chapter 5

Conclusion

This chapter concludes the thesis. Section 5.1 summarizes the research work. Section

5.2 highlights the signi�cance of this thesis. Section 5.3 proposes the probable research

directions in the future.

5.1 Research Summary

This thesis contains two components: (1) applying query formulation to improve the

accuracy of session search; (2) presenting a new system based on a monothetic concept

hierarchy approach to stably organize the retrieved documents for session search.

First, to obtain documents relevant to queries in a session, we translate individual

queries into a set of nuggets associated with weights. The nuggets from all queries are

aggregated by three schemes. Then we extract anchor texts from the previous inter-

actions to build an expansion term set. In order to process duplicated queries in a

session, we design two rules for two di�erent cases whether the duplicate involves the

current query or not. Dwell time of clicked documents is applied to re-rank the results

returned by Lemur. Evaluation results on both TREC 2011 and 2012 datasets show

that the search accuracy is signi�cantly improved by introducing previous queries,

expanding queries by anchor texts, and removing duplicates sequentially. Further-

more, the strict method of identifying nuggets performs better in an entire session

while the relaxed method performs better at searching a single query. The o�cial eval-

uation of TREC 2012 Session track showed that our submission achieved the highest

61

Page 74: A Thesis - Georgetown Universityinfosense.cs.georgetown.edu/publication/dongyi_guan_thesis.pdf · A Thesis submitted to the acultF y of the Graduate School of Arts and Sciences of

improvement in RL2. Our results for RL2, RL3, and RL4 were ranked second among

all groups.

Second, we study results organization for session search. We present two algorithms

to generate stable SRC hierarchies. The �rst one inserts a module, which utilizes

Wikipedia, into Subsumption approach to re�ne the parent-child relationships. The

second directly build SRC hierarchies from the category information of Wikipedia.

We draw noun phrases using POS tagger and then �lter them through Google to

generate a concept set. The concepts are mapped to Wikipedia entries by using cosine

similarity and mutual information. Evaluation indicates that the monothetic concept

hierarchy approaches bene�t the stability of SRC hierarchies for the session search.

Furthermore, applying external knowledge like Wikipedia improve the quality of SRC

hierarchies as well.

5.2 Significance of the Thesis

This thesis addresses how to search for a series of queries and associated user inter-

actions e�ectively and e�ciently. In order to retrieve documents relevant to the topic

of a session, search engines are expected to precisely understand queries in a session.

This expectation requires the system to identify important concepts in a session.

Nuggets, which represent the phrases in query, can identify the concepts emphasized

by the user so as to increase the accuracy of session search.

Nuggets can �nd out important concepts in queries, while aggregation schemes and

removal rules for duplicated queries deal with the importance of queries in a session.

Some queries in a session are much more important because they are submitted more

recently, while some cannot satisfy the user who submits them and become noise.

Aggregation schemes help the system to decide the importance of queries with regard

62

Page 75: A Thesis - Georgetown Universityinfosense.cs.georgetown.edu/publication/dongyi_guan_thesis.pdf · A Thesis submitted to the acultF y of the Graduate School of Arts and Sciences of

to when they are submitted. Removal rules for duplicated queries �lter out the noise

queries which could bias the search results.

This thesis further generates stable result organization for session search. We

adopt the idea of monothetic concept hierarchy because its top-down strategy bene�ts

the stability of hierarchies. We implement a �lter module that utilizes Wikipedia to

improve the accuracy of parent-child relationship generated by Subsumption, the most

prevalent monothetic concept hierarchy approach. Evaluation results show signi�cant

improvements in both stability and quality.

Building highly stable SRC hierarchies requires dynamical access to Wikipedia

category information, because the concepts, which become the cluster labels, are

dynamically extracted from the search results. We propose an approach, which com-

bines cosine similarity and mutual information for disambiguation, to identify the

Wikipedia entry relevant to a concept. With this e�ective approach of building a

map between concepts andWikipedia entries, we present an original concept hierarchy

construction algorithm that dynamically generates stable hierarchies from Wikipedia

categories.

5.3 Future Directions

The �eld of session search is attractive. We can continue to make progress in improving

the accuracy of search session based on our present system. We think that the nugget

approach is promising since it greatly boosted the performance of the system that uses

TREC 2011 data. Therefore, it is important to improve the accuracy of identifying

nuggets in a query. Phrases in queries could be identi�ed by their appearances. For

example, �American Girl� with the words uppercased represents a brand name, if in

lowercase, the connection between these two words is much looser. External knowledge

63

Page 76: A Thesis - Georgetown Universityinfosense.cs.georgetown.edu/publication/dongyi_guan_thesis.pdf · A Thesis submitted to the acultF y of the Graduate School of Arts and Sciences of

like dictionaries could be used to improve the accuracy of identifying nuggets. We

could build a component to compare candidate nuggets with an external knowledge

base so as to �lter out the erroneous nuggets.

Some phrases are not suited to be nuggets. For example, verb phrases are more

volatile than noun phrases, so that they could wipe out some relevant documents if we

formulate them into strict nuggets. The better method is perhaps to identify nuggets

by di�erent rules according to the type of phrase. A POS tagger could be applied to

classify a phrase and then select an appropriate rule to identify nuggets.

Our approaches of search result organization can be improved in several aspects.

First, other knowledge bases, such as freebase and WordNet, can be adopted in

the proposed framework. The synonyms and anonyms information in these knowl-

edge bases may give us the advantage to build more stable generated hierarchies.

Second, anchor texts in Wikipedia pages often link related concepts together. There-

fore, besides Wikipedia categories, we consider extracting relations between concepts

by analyzing anchor texts, which might be helpful to build SRC hierarchies with high

quality. In the future, we will additionally consider using lexical-syntactic patterns to

create more stable and better quality SRC hierarchies. Extended user study will also

be conducted to evaluate the quality of SRC hierarchies.

64

Page 77: A Thesis - Georgetown Universityinfosense.cs.georgetown.edu/publication/dongyi_guan_thesis.pdf · A Thesis submitted to the acultF y of the Graduate School of Arts and Sciences of

Bibliography

[1] D. C. Anastasiu, B. J. Gao, and D. Buttler. A framework for personalized and

collaborative clustering of search results. In Proceedings of the 20th international

conference on Information and knowledge management, CIKM '11, pages 573�

582, New York, NY, USA, 2011.

[2] S. Araujo, G. Gebremeskel, J. He, C. Bosscarino, and A. de Vries. Cwi at trec

2012, kba track and session track. In Proceedings of the 21st Text REtrieval

Conference, TREC '12, Gaithersburg, MD, USA, 2012.

[3] M. Bendersky, D. Metzler, and W. B. Croft. E�ective query formulation with

multiple information sources. In Proceedings of the 5th international conference

on Web search and data mining, WSDM '12, pages 443�452, New York, NY,

USA, 2012.

[4] D. Carmel, H. Roitman, and N. Zwerdling. Enhancing cluster labeling using

wikipedia. In Proceedings of the 32nd international SIGIR conference on Research

and development in information retrieval, SIGIR '09, pages 139�146, New York,

NY, USA, 2009.

[5] C. Carpineto and G. Romano. Optimal meta search results clustering. In

Proceedings of the 33rd international SIGIR conference on Research and

development in information retrieval, SIGIR '10, pages 170�177, New York, NY,

USA, 2010.

65

Page 78: A Thesis - Georgetown Universityinfosense.cs.georgetown.edu/publication/dongyi_guan_thesis.pdf · A Thesis submitted to the acultF y of the Graduate School of Arts and Sciences of

[6] B. Carterette and P. Chander. Implicit feedback and document �ltering

for retrieval over query sessions. In Proceedings of the 20th Text REtrieval

Conference, TREC '11, Gaithersburg, MD, USA, 2011.

[7] G. V. Cormack, M. D. Smucker, and C. L. Clarke. E�cient and e�ective spam

�ltering and re-ranking for large web datasets. Information Retrieval, 14(5):441�

465, Oct. 2011.

[8] W. B. Croft, M. Bendersky, H. Li, and G. Xu. Query representation and under-

standing workshop. SIGIR Forum, 44(2):48�53, Jan. 2011.

[9] D. R. Cutting, D. R. Karger, J. O. Pedersen, and J. W. Tukey. Scatter/gather:

a cluster-based approach to browsing large document collections. In Proceedings

of the 15th annual international SIGIR conference on Research and development

in information retrieval, SIGIR '92, pages 318�329, New York, NY, USA, 1992.

[10] H. Duan and B.-J. P. Hsu. Online spelling correction for query completion. In

Proceedings of the 20th international conference on World wide web, WWW '11,

pages 117�126, New York, NY, USA, 2011.

[11] S. Fox, K. Karnawat, M. Mydland, S. Dumais, and T. White. Evaluating implicit

measures to improve web search. Transactions on Information System, 23(2):147�

168, Apr. 2005.

[12] D. Guan and H. Yang. Increasing stability of result organization for ses-

sion search. In Proceedings of the 35th European conference on Advances in

information retrieval, ECIR '13, Berlin, Heidelberg, 2013. Springer-Verlag.

66

Page 79: A Thesis - Georgetown Universityinfosense.cs.georgetown.edu/publication/dongyi_guan_thesis.pdf · A Thesis submitted to the acultF y of the Graduate School of Arts and Sciences of

[13] D. Guan, H. Yang, and N. Goharian. E�ective structured query formulation for

session search. In Proceedings of the 21st Text REtrieval Conference, TREC '12,

Gaithersburg, MD, USA, 2012.

[14] Q. Guo and E. Agichtein. Ready to buy or just browsing?: detecting web searcher

goals from interaction data. In Proceedings of the 33rd international SIGIR

conference on Research and development in information retrieval, SIGIR '10,

pages 130�137, New York, NY, USA, 2010.

[15] M. Hagen, M. Potthast, M. Busse, J. Gomoll, J. Harder, and B. Stein. Webis at

the trec 2012 session track. In Proceedings of the 21th Text REtrieval Conference,

Gaithersburg, MD, USA, 2012.

[16] X. Han and J. Zhao. Topic-driven web search result organization by lever-

aging wikipedia semantic knowledge. In Proceedings of the 19th international

conference on Information and knowledge management, CIKM '10, pages 1749�

1752, New York, NY, USA, 2010.

[17] J. He, V. Hollink, C. Bosscarino, R. Cornacchia, and A. de Vries. Cwi at trec

2011, session, web, and medical. In Proceedings of the 20th Text REtrieval

Conference, TREC '11, Gaithersburg, MD, USA, 2011.

[18] S. Huston and W. B. Croft. Evaluating verbose query processing techniques. In

Proceedings of the 33rd international ACM SIGIR conference on Research and

development in information retrieval, SIGIR '10, pages 291�298, New York, NY,

USA, 2010.

[19] B. Huurnink, R. Berendsen, K. Hofmann, E. Meij, and M. de Rijke. The uni-

versity of amsterdam at the trec 2011 session track. In Proceedings of the 20th

Text REtrieval Conference, TREC '11, Gaithersburg, MD, USA, 2011.

67

Page 80: A Thesis - Georgetown Universityinfosense.cs.georgetown.edu/publication/dongyi_guan_thesis.pdf · A Thesis submitted to the acultF y of the Graduate School of Arts and Sciences of

[20] A. Islam and D. Inkpen. Real-word spelling correction using google web 1tn-gram

data set. In Proceedings of the 18th conference on Information and knowledge

management, CIKM '09, pages 1689�1692, New York, NY, USA, 2009.

[21] K. Järvelin and J. Kekäläinen. Cumulated gain-based evaluation of ir techniques.

Transactions on Information Systems, 20(4):422�446, Oct. 2002.

[22] J. Jiang, S. Han, J. Wu, and D. He. Pitt at trec 2011 session track. In Proceedings

of the 20th Text REtrieval Conference, TREC '11, Gaithersburg, MD, USA, 2011.

[23] J. Jiang, D. He, and S. Han. Pitt at trec 2012 session track. In Proceedings of

the 21st Text REtrieval Conference, TREC '12, Gaithersburg, MD, USA, 2012.

[24] E. Kanoulas, B. Carterette, P. D. Clough, and M. Sanderson. Overview of the

trec 2010 session track. In Proceedings of the 19th Text REtrieval Conference,

TREC '10, Gaithersburg, MD, USA, 2010.

[25] E. Kanoulas, B. Carterette, M. Hall, P. Clough, and M. Sanderson. Overview of

the trec 2011 session track. In Proceedings of the 20th Text REtrieval Conference,

TREC '11, Gaithersburg, MD, USA, 2011.

[26] E. Kanoulas, B. Carterette, M. Hall, P. Clough, and M. Sanderson. Overview of

the trec 2012 session track. In Proceedings of the 21st Text REtrieval Conference,

TREC '12, Gaithersburg, MD, USA, 2012.

[27] R. Kaptein, P. Serdyukov, A. De Vries, and J. Kamps. Entity ranking using

wikipedia as a pivot. In Proceedings of the 19th international conference on

Information and knowledge management, CIKM '10, pages 69�78, New York,

NY, USA, 2010.

68

Page 81: A Thesis - Georgetown Universityinfosense.cs.georgetown.edu/publication/dongyi_guan_thesis.pdf · A Thesis submitted to the acultF y of the Graduate School of Arts and Sciences of

[28] L. Leal, S. Kharazmi, J. Dhaliwal, M. Sanderson, F. Scholer, S. Sadeghi, and

F. Alahmari. Rmit at trec 2011 session track. In Proceedings of the 20th Text

REtrieval Conference, TREC '11, Gaithersburg, MD, USA, 2011.

[29] Y. Li, H. Duan, and C. Zhai. A generalized hidden markov model with discrimina-

tive training for query spelling correction. In Proceedings of the 35th international

SIGIR conference on Research and development in information retrieval, SIGIR

'12, pages 611�620, New York, NY, USA, 2012.

[30] C. Liu, M. Cole, E. Baik, and J. N. Belkin. Rutgers at the trec 2012 session

track. In Proceedings of the 21st Text REtrieval Conference, TREC '12, 2012.

[31] T. Liu, C. Zhang, Y. Gao, W. Xiao, and H. Huang. BUPT_WILDCAT at trec

2011 session track. In Proceedings of the 20th Text REtrieval Conference, TREC

'11, Gaithersburg, MD, USA, 2011.

[32] W. Liu, H. Lin, Y. Ma, and T. Chang. Dutir at the session track in trec 2011.

In Proceedings of the 20th Text REtrieval Conference, TREC '11, Gaithersburg,

MD, USA, 2011.

[33] A. M-Dyaa and K. Udo. University of essex at the trec 2012 session track. In

Proceedings of the 21st Text REtrieval Conference, TREC '12, Gaithersburg,

MD, USA, 2012.

[34] A. M-Dyaa, K. Udo, N. Nikolaos, N. Brendan, L. Deirdre, and F. Maria. Uni-

versity of essex at the trec 2011 session track. In Proceedings of the 20th Text

REtrieval Conference, TREC '11, Gaithersburg, MD, USA, 2011.

69

Page 82: A Thesis - Georgetown Universityinfosense.cs.georgetown.edu/publication/dongyi_guan_thesis.pdf · A Thesis submitted to the acultF y of the Graduate School of Arts and Sciences of

[35] D. Metzler and W. B. Croft. Combining the language model and inference

network approaches to retrieval. Information Processing and Management,

40(5):735�750, Sept. 2004.

[36] D. Metzler andW. B. Croft. A markov random �eld model for term dependencies.

In Proceedings of the 28th annual international SIGIR conference on Research

and development in information retrieval, SIGIR '05, pages 472�479, New York,

NY, USA, 2005.

[37] G. Mishne and M. de Rijke. Boosting web retrieval through query operations.

In Proceedings of the 27th European conference on Advances in information

retrieval, ECIR '05, pages 502�516, Berlin, Heidelberg, 2005. Springer-Verlag.

[38] N. Nanas and A. Roeck. Autopoiesis, the immune system, and adaptive infor-

mation �ltering. Natural Computing, 8(2):387�427, June 2009.

[39] M. Sanderson and B. Croft. Deriving concept hierarchies from text. In

Proceedings of the 22nd annual international SIGIR conference on Research and

development in information retrieval, SIGIR '99, pages 206�213, New York, NY,

USA, 1999.

[40] C. Santamaría, J. Gonzalo, and J. Artiles. Wikipedia as sense inventory to

improve diversity in web search results. In Proceedings of the 48th Annual

Meeting of the Association for Computational Linguistics, ACL '10, pages 1357�

1366, Stroudsburg, PA, USA, 2010. Association for Computational Linguistics.

[41] U. Scaiella, P. Ferragina, A. Marino, and M. Ciaramita. Topical clustering of

search results. In Proceedings of the 5th international conference on Web search

and data mining, WSDM '12, pages 223�232, New York, NY, USA, 2012.

70

Page 83: A Thesis - Georgetown Universityinfosense.cs.georgetown.edu/publication/dongyi_guan_thesis.pdf · A Thesis submitted to the acultF y of the Graduate School of Arts and Sciences of

[42] R. Song, M. J. Taylor, J.-R. Wen, H.-W. Hon, and Y. Yu. Viewing term proximity

from a di�erent perspective. In Proceedings of the 30th European conference on

Advances in information retrieval, ECIR '08, pages 346�357, Berlin, Heidelberg,

2008. Springer-Verlag.

[43] D. Wackerly, W. Mendenhall, and R. L. Schea�er. Mathematical Statistics with

Applications. Duxbury Advanced Series, 2002.

[44] P. Wang, J. Hu, H.-J. Zeng, and Z. Chen. Using wikipedia knowledge to improve

text classi�cation. Knowledge and Information Systems, 19(3):265�281, May

2009.

[45] M. Wei, Y. Xue, C. Xu, X. Yu, Y. Liu, and X. Cheng. Ictnet at session track

trec 2011. In Proceedings of the 20th Text REtrieval Conference, TREC '11,

Gaithersburg, MD, USA, 2011.

[46] J. Xu and W. B. Croft. Query expansion using local and global document

analysis. In Proceedings of the 19th annual international SIGIR conference on

Research and development in information retrieval, SIGIR '96, pages 4�11, New

York, NY, USA, 1996.

[47] S. Xu, H. Jiang, and F. C. Lau. User-oriented document summarization through

vision-based eye-tracking. In Proceedings of the 14th international conference on

Intelligent user interfaces, IUI '09, pages 7�16, New York, NY, USA, 2009.

[48] H. Yang. Personalized Concept Hierarchy Construction. Ph.D. dissertation,

Carnegie Mellon University, 2011.

71

Page 84: A Thesis - Georgetown Universityinfosense.cs.georgetown.edu/publication/dongyi_guan_thesis.pdf · A Thesis submitted to the acultF y of the Graduate School of Arts and Sciences of

[49] C. Zhai and J. La�erty. A study of smoothing methods for language

models applied to information retrieval. Transactions on Information Systems,

22(2):179�214, Apr. 2004.

[50] C. Zhang, X. Wang, S. Wen, and R. Li. BUPT_PRIS at trec 2012 session track.

In Proceedings of the 21st Text REtrieval Conference, TREC '12, Gaithersburg,

MD, USA, 2012.

[51] L. Zhao and J. Callan. Automatic term mismatch diagnosis for selective query

expansion. In Proceedings of the 35th international ACM SIGIR conference on

Research and development in information retrieval, SIGIR '12, pages 515�524,

New York, NY, USA, 2012.

72