click models for web search...goal of this survey is to bring together current efforts in the area,...

i

Click Models for Web Search

Authors’ version*

Aleksandr ChuklinUniversity of Amsterdam and Google Switzerland

Ilya MarkovUniversity of Amsterdam

Maarten de RijkeUniversity of Amsterdam

*The final version will be available at the publisher’s website: http://www.morganclaypool.com/doi/abs/10.2200/S00654ED1V01Y201507ICR043

http://www.morganclaypool.com/doi/abs/10.2200/S00654ED1V01Y201507ICR043

http://www.morganclaypool.com/doi/abs/10.2200/S00654ED1V01Y201507ICR043

ii

Contents

Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Historical Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Aims of this Survey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.4 Additional Material and Website . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.5 Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

3 Basic Click Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3.1 Random Click Model (RCM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93.2 Click-through Rate Models (CTR) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.2.1 Rank-based CTR Model (RCTR) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103.2.2 Document-based CTR Model (DCTR) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.3 Position-based Model (PBM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103.4 Cascade Model (CM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123.5 User Browsing Model (UBM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133.6 Dependent Click Model (DCM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133.7 Click Chain Model (CCM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143.8 Dynamic Bayesian Network Model (DBN) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.8.1 Simplified DBN Model (SDBN) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163.9 Click Probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.10 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4 Parameter Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.1 The MLE Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234.1.1 MLE for the RCM and CTR models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244.1.2 MLE for DCM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244.1.3 MLE for SDBN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

iii

4.2 The EM Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274.2.1 Expectation (E-step) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284.2.2 Maximization (M-step) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294.2.3 EM Estimation for UBM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304.2.4 EM Estimation for DBN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.3 Formulas for Click Model Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384.4 Alternative Estimation Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

5.1 Log-Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 435.2 Perplexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

5.2.1 Conditional Perplexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 445.2.2 Detailed Perplexity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

5.3 Click-through Rate Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455.4 Relevance Prediction Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 475.5 Intuitiveness Evaluation for Aggregated Search Click Models . . . . . . . . . . . . . . . 47

6 Data and Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

6.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 496.2 Software Tools and Open Source Projects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

7 Experimental Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

7.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 537.2 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 547.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

8 Advanced Click Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

8.1 Aggregated Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 618.1.1 Federated Click Model (FCM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 618.1.2 Vertical Click Model (VCM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 628.1.3 Intent-aware Click Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

8.2 Beyond a Single Query Session . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 638.2.1 Pagination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 648.2.2 Task-centric Click Model (TCM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

8.3 User and Query Diversity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 658.3.1 Session Variability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 658.3.2 Query Variability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

iv

8.3.3 User Variability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 668.4 Eye Tracking and Mouse Movements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 688.5 Using Editorial Relevance Judgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 688.6 Non-linear SERP examination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 708.7 Using Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

9 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

9.1 Simulating Users . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 759.2 User Interaction Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 779.3 Inference of Document Relevance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 779.4 Click Model-based Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

10 Discussion and Directions for Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

Authors’ Biographies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

v

Acknowledgments

We benefited from comments and feedback from many colleagues. In particular, we want to thankAlexey Borisov, Artem Grotov, Damien Lefortier, Anne Schuth, and our reviewers Yiqun Liu andJian-Yun Nie for their valuable questions and suggestions. A special thank you to the Morgan &Claypool staff, and especially Diane Cerra, for their help in publishing the book in a very efficientmanner.

The writing of this book was partially supported by grant P2T1P2 152269 of the SwissNational Science Foundation, Amsterdam Data Science, the Dutch national program COMMIT,Elsevier, the European Community’s Seventh Framework Programme (FP7/2007-2013) under grantagreement nr 312827 (VOX-Pol), the ESF Research Network Program ELIAS, the HPC Fund, theRoyal Dutch Academy of Sciences (KNAW) under the Elite Network Shifts project, the MicrosoftResearch Ph.D. program, the Netherlands eScience Center under project number 027.012.105, theNetherlands Institute for Sound and Vision, the Netherlands Organisation for Scientific Research(NWO) under project nrs 727.011.005, 612.001.116, HOR-11-10, 640.006.013, 612.066.930, CI-14-25, SH-322-15, the Yahoo! Faculty Research and Engagement Program, and Yandex.

The discussions in this book as well as the choice of models and results discussed in the bookare entirely based on published research, publicly available datasets and the authors’ own judgementsrather than on the internal practice of any organization. All content represents the opinion of theauthors, which is not necessarily shared or endorsed by their respective employers and/or sponsors.

Aleksandr Chuklin, Ilya Markov, Maarten de RijkeJune 2015

1

C H A P T E R 1

Introduction

All models are wrong but some are useful.

George Edward Pelham Box

1.1 MOTIVATION

Many areas of science have models of events. For example, mechanics deals with solid bodies andforces applied to them. A model abstracts away many things that are not important to be able topredict the outcome of an event sufficiently accurately. When we want to predict the movementsof two billiard balls after collision, we apply the laws of conservation of mechanical energy andmomentum, neglecting the friction between two balls and between balls and the air. The modelallows us to predict the movement without performing the experiment, which then only serves as averification of our model. Without the model we would have to perform an experiment every timewe want to predict future events, which is simply not scalable.

When it comes to research and development in web search, we perform many experiments thatinvolve users. These experiments range from small-scale lab studies to experiments with millions ofusers of real web search systems. It is still common to perform experiments “blindly,” e.g., by tryingdifferent colors of a web search UI and checking which colors yield more clicks and revenue [Kohaviet al., 2014]. However, we would not be able to advance as fast as we do now if we did not havesome understanding of user behavior. For example, Joachims et al. [2005] show that users tend toprefer the first document to the last document on a search engine result page (SERP for short). Thatknowledge helps us to rank documents in a way that is most beneficial for a user. This is a part of ourunderstanding of user behavior, a part of a user model. We first suggest hypotheses, which togetherform a user model, and then validate these hypotheses with real users.

Intuitively, a user model is a set of rules that allow us to simulate user behavior on a SERPin the form of a random process. For example, we know from numerous user studies that thereis a so-called position bias effect: search engine users tend to prefer documents higher in theranking [Guan and Cutrell, 2007, Joachims et al., 2005, Lorigo et al., 2008]. There is also attentionbias to visually salient documents [Chen et al., 2012a, Wang et al., 2013a], novelty bias to previouslyunseen documents [Zhang et al., 2011], and many other types of bias that different models attemptto take into account by introducing random variables and dependencies between them. Since themain observed user interaction with a search system concerns clicks, the models we study are

2 1. INTRODUCTION

usually called click models. There are some recent efforts to predict mouse movements instead ofclicks [Diaz et al., 2013], but our main focus is on click models.

An important motivation for developing click models is that they help in cases when we donot have real users to experiment with, or prefer not to experiment with real users for fear of hurtingthe user experience. Most notably, researchers in academia have limited access to user interactiondata due to privacy and commercial constraints. Even companies that do have access to millions orbillions of users cannot afford to run user experiments all the time and have to use some modelsduring the development phase. This is where simulated users come to the rescue—we simulate userinteractions based on a click model.

Apart from user simulations, click models are also used to improve document ranking (i.e.,infer document relevance from clicks predicted by a click model), to improve evaluation metrics(e.g., model-based metrics) and to better understand users by inspecting the parameters of clickmodels. We detail these applications in Chapter 9.

1.2 HISTORICAL NOTES

Perhaps the first work that uses the term “click model” is the paper by Craswell et al. [2008] inwhich the cascade model is introduced. This work also considers a number of other models that arebased on a position bias hypothesis that had previously been established using eye tracking and clicklog analysis by Joachims et al. [2005]. Prior to the work of Craswell et al. a couple of models thatwe now call click models had been introduced for the sake of ranking evaluation [Dupret et al., 2007,Moffat and Zobel, 2008].

Following the publication of [Craswell et al., 2008], the UBM model has been introducedby Dupret and Piwowarski [2008], the DBN model by Chapelle and Zhang [2009], and the DCMmodel by Guo et al. [2009b]. Most click models that have been introduced since then are based onone of those three models.

In more recent years, quite a few improvements to the basic click models just listed havebeen suggested. For instance, once search engine companies began to present their search results asblocks of verticals aggregated from different sources (such as images or news items), click modelsfor aggregated search started to receive their rightful attention [Chen et al., 2012a, Chuklin et al.,2013b, Wang et al., 2013a].

Overall, the development of click models goes hand in hand with their applications. Forexample, recent research work on analyzing mouse movements has been applied to improve clickmodels in [Huang et al., 2012]. And as web search is rapidly becoming more personalized [Douet al., 2007, Teevan et al., 2005], this is also reflected in click models; see, e.g., [Xing et al., 2013].We believe these directions are going to be developed further, following the evolution of modernsearch engines.

1.3. AIMS OF THIS SURVEY 3

1.3 AIMS OF THIS SURVEY

A large body of research on click models has developed, research that has improved our understandingof user behavior in web search and that has facilitated the usage of these models in various search-related tasks. Current studies use a broad range of notations and terminology, perform experimentsusing different and mostly proprietary datasets, and do not detail the model instantiations andparameter estimation procedures used. A general systematic view on the research area is notavailable. This, in turn, slows down the development and hinders the application of click models. Thegoal of this survey is to bring together current efforts in the area, summarize the research performedso far and give a holistic view on existing click models for web search. More specifically, the aimsof this survey are the following:

1. Describe existing click models in a unified way, i.e., using common notation and terminology,so that different models can easily be related to each other.

2. Compare commonly used click models both theoretically and experimentally, outline theiradvantages and limitations and provide a set of recommendations on their usage.

3. Provide ready-to-use formulas and implementations of existing click models and detail param-eter estimation procedures to facilitate the development of new ones.

4. Summarize current efforts on click model evaluation in terms of evaluation approaches, datasetsand software packages.

5. Provide an overview of click model applications and directions for future development of clickmodels.

Our target audience consists of researchers and developers in information retrieval who are interestedin formally capturing user interactions with search engine result pages, whether for ranking purposes,to simulate user behavior in a lab setting or simply to gain deeper insights in user behavior andinteraction data. The survey will be useful as an overview for anyone who is starting research workin the area as well as for practitioners seeking concrete recipes.

We presuppose some basic familiarity with statistical inference and parameter estimationtechniques, including maximum-likelihood estimation (MLE) and expectation-maximization (EM);see, e.g., [Bishop, 2006, Murphy, 2013]. Familiarity with probabilistic graphical models is helpful;see, e.g., parts I and II of [Koller and Friedman, 2009] or the probabilistic graphical models courseon Coursera,1 but we explain the required concepts where they are being used for the first time.

The survey provides a map of an increasingly rich landscape of click models. By the end ofthis survey, our readers should be familiar with basic definitions and intuitions of what we consider tobe the core models that have been introduced so far, with inference tasks for these models, and withuses of these models. While our presentation is necessarily formal in places, we make a serious effort

1https://www.coursera.org/course/pgm

https://www.coursera.org/course/pgm

4 1. INTRODUCTION

to relate the models, the estimation procedures and the applications back to the core web search taskand to web search data by including a fair number of examples. We hope that this supplies readerswho are new to the area with effective means to begin to carry out research on click models.

1.4 ADDITIONAL MATERIAL AND WEBSITE

At http://clickmodels.weebly.com we make available additional material that may ben-efit readers of the book. This includes slides for tutorials based on the book and a list of resources(datasets, benchmarks, etc.) that may be useful to the reader interested in pursuing research on clickmodels.

1.5 STRUCTURE

Our ideal reader would read the survey from start to finish. But the survey is structured in such away that it should be easy to skip parts that the reader is not interested in. For example, if the planis to use a click model package without modification, the reader may skip details of the parameterestimation process and go directly to the final formulas in Chapter 4.

We start in Chapter 2 by introducing the general terminology and notation that we will beusing throughout the survey. In Chapter 3 we present the basic click models for web search anddiscuss their similarities and differences as well as strengths and weaknesses. Chapter 4 detailsthe parameter estimation process—the process of estimating the values of model parameters usingclick data. Chapter 5 introduces evaluation techniques that are used to assess the quality of clickmodels. Chapter 6 is a practice-oriented chapter that presents the datasets and the software packagesavailable for click model experiments. In Chapter 7 we perform a thorough evaluation of all thepopular models using a common dataset. In Chapter 8 we briefly discuss extensions of the basic clickmodels that are aimed at handling additional aspects of user behavior and web search user interfaces.Chapter 9 is dedicated to applications of click models; in it we reiterate the most common use-casesfor click models and detail a few concrete examples. We conclude with a discussion in Chapter 10.

http://clickmodels.weebly.com

5

C H A P T E R 2

Terminology

In this book we consider click models for web search. Such models describe user behavior on asearch engine result page (SERP). The scenario captured by existing click models is the following. Auser issues a query to a search engine, based on their information need. The search engine presentsthe user with a SERP, which contains a number of objects, either directly related to the issued query(e.g., document snippets, ads, query suggestions, etc.) or not (e.g., pagination buttons, feedbacklinks, etc.). A user examines the presented SERP, possibly clicks on one or more objects and thenabandons the search either by issuing a new query or by stopping to interact with the SERP. The setof events between issuing a query and abandoning the search is called a session.1

Click models treat user search behavior as a sequence of observable and hidden events.2 Eachevent is described with a binary random variable X , where X = 1 corresponds to the occurrence ofthe event and X = 0 means that the event did not happen. The main events that most click modelsconsider are the following:

• E: a user examines an object on a SERP;

• A: a user is attracted by the object’s representation;

• C: an object is clicked; and

• S: a user’s information need is satisfied.

Click models define dependencies between these events and aim at estimating full or conditionalprobabilities of the corresponding random variables, e.g., P (E = 1), P (C = 1 | E = 1), etc. Someprobabilities are treated as parameters of click models. The parameters are usually denoted withGreek letters (e.g., P (A = 1) = α) and depend on features of a SERP and a user’s query. Forexample, the examination probability P (E = 1) may depend on the position of a document, whilethe attractiveness probability P (A = 1) can be dependent on a query and the document’s content.

Overall, a click model is unambiguously defined with the following:

• a set of events/random variables that it considers;

• a directed graph with nodes corresponding to events and edges corresponding to dependenciesbetween events (dependency graph);

1Note that the session here applies only to one query, not a set of queries related to the same information need. While there areclick models that are targeted at modeling multi-query behavior (e.g., [Zhang et al., 2011]), we are not going to study them here,but see Chapter 8 for further details.

2Observed events are sometimes referred to as evidence.

6 2. TERMINOLOGY

Figure 2.1: A small sample of the query log released by AOL in 2006.

• conditional probabilities in the nodes of the dependency graph expressed by parameters; and

• a correspondence between the model’s parameters and features of a SERP and user’s query.

When working with click models, researchers often have a click log available to them. A click logis a log of user interactions with a search system by means of clicks together with the informationabout documents presented on a SERP. Different click log datasets use different formats and havedifferent types of additional information, such as mouse movements or options selected in the searchengines interface. We cover these aspects in more detail in Section 6.1. A sample from a click log,the AOL query log released in 2006, is shown in Figure 2.1; here, every line contains the followinginformation: user id, query, query timestamp and, if there was a click, the rank of the clicked itemand the URL3 of the clicked item. If we have click log data, parameters can be estimated or learned

3Uniform Resource Locator, an address of a resource in the Internet.

7

from clicks.4 Once we have the parameter values, a click model can be used to simulate users’actions on a SERP as a random process and predict click probabilities.

In this book we will often depict a click model as a graphical model. An example of a graphicalmodel is given in Figure 3.1. We show random variables as circles, with gray circles corresponding toobserved variables and white circles corresponding to hidden variables. Solid lines (arrows) betweencircles show the “direction of probabilistic dependence.”

More precisely, we use the following definition by Koller and Friedman [2009]:

Definition 2.1 Bayesian Network. A Bayesian network is a directed acyclic graph (DAG) whosenodes represent random variables X1, . . ., Xn and the joint distribution of these variables can bewritten as

P (X1, . . . , Xn) =n∏i=1

P (Xi | P(Xi)) ,

where P(Xi) is a set of “parents of Xi,” i.e., a set of variables X ′ such that there is a directed edgefrom X ′ to Xi.

Notice that a click model does not have to be a Bayesian network and can be represented asa pure feature-based machine learning problem. One can refer to any machine learning textbook,(e.g., [Witten and Frank, 2005]) for the guidelines on how to do feature engineering and select theright learning method. While some advanced click models5 discussed in Chapter 8 successfully usemachine learning as part of the model, the focus of the book is on probabilistic Bayesian Networkmodels.

The notation used throughout the book is shown in Table 2.1. Here, events and the correspond-ing random variables are denoted with capital letters, while their values are denoted with lower-caseletters. The expressionX = xmeans that the random variableX takes the value x, where x ∈ {0, 1}in case of a binary variable. Xc denotes an event X being applied to a concept c. For example,Er = 1 means that rank r was examined by a user, while Au = 0 means that a user was not attractedby document u.

The indicator function I(·) is defined as follows:

I(expr) =

{1 if expr is true,

0 otherwise.

For example, I(Su = 1) = 1 if, and only if, a user is satisfied by document u, i.e., the value ofthe random variable Su is 1. If a user is not satisfied by u, then I(Su = 1) = 0. In particular,I(X = 1) = x for any binary random variable X .

4Estimation is used more often in statistics, while inference is sometimes used in the models involving Bayesian framework. Thisprocess can also be referred to as training or learning as in machine learning studies. We use these terms interchangeably.

5E.g., the noise-aware models by Chen et al. [2012b] or the GCM model by Zhu et al. [2010].

8 2. TERMINOLOGY

Table 2.1: Notation used in the book.

Expression Meaningu A document (documents are identified by their URLs, hence the notation).q A user’s query.r The rank of a document.c A placeholder for any concept associated with a SERP (e.g., query-document

pair, rank, etc.).s A user search session.S A set of user search sessions.Sc A set of user search sessions containing a concept c.ur A document at rank r.ru The rank of a document u.N Maximum rank (SERP size); usually equals 10.X An event/random variable.x The value of a random variable X .Xc An event X applied to a concept c.xc The value that a random variable X takes,

when applied to a concept c.x

(s)c The value that a random variable X takes, when applied to concept c in a

particular session s.I(·) An indicator function.

9

C H A P T E R 3

Basic Click Models

In this chapter we present basic click models that capture different assumptions about searchers’interaction behavior on a web search engine result page. Later, in Chapter 8 we study more advancedmodels that take into account additional aspects of this behavior, additional signals from a userand/or peculiarities of a SERP.

To define a model we need to describe observed and hidden random variables, relationsbetween the variables and how they depend on the model parameters. This together forms thecontent of the current chapter. Building on the material of this chapter, click models can be used topredict clicks or estimate the data likelihood provided that the parameters’ values are known. For thequestion of parameter estimation we reserve a separate chapter, Chapter 4.

We start with a simple random click model (Section 3.1), two click-through rate models(Section 3.2) and a very basic position-based model (Section 3.3). Then we introduce the early clickmodels, namely the cascade model (Section 3.4) by Craswell et al. [2008] and the user browsingmodel (Section 3.5) by Dupret and Piwowarski [2008], as well as the more recent DCM, CCMand DBN models (Sections 3.6, 3.7, 3.8) that are widely used nowadays. In Section 3.9 we provideformulas for click probabilities that are useful when using models for click prediction or simulation.We conclude the chapter by summarizing the differences and similarities between the models.

3.1 RANDOM CLICK MODEL (RCM)

The simplest click model one can think of has only one parameter and is defined as follows:

P (Cu = 1) = ρ. (3.1)

That means that every document u has the same probability of being clicked and this probability is amodel parameter ρ.

Once we have estimated the value of ρ, we can generate user clicks by flipping a biased coin,i.e., by running a simple Bernoulli process. As we show in Section 4.1.1, the estimation process forρ is also very straightforward and reduces to a simple counting of clicks.

While this model is extremely simplistic, its performance can be used as a baseline whencomparing to other models. In addition, the random click model (RCM) is almost guaranteed to besafe from overfitting since it has only one parameter.

10 3. BASIC CLICK MODELS

3.2 CLICK-THROUGH RATE MODELS (CTR)

The next step beyond the random click model is a model that has more than a single parameter. Aswith many other parameters that we study later, they typically depend on the rank of a document, oron the query-document pair. That gives us two possibilities for click models based on click-throughrate estimation, which we will present next.

3.2.1 RANK-BASED CTR MODEL (RCTR)

One of the quantities typically present in click log studies is click-through rates at different positions.For example, according to Joachims et al. [2005], the click-through rate (CTR) of the first documentis about 0.45 while the CTR of the tenth document is well below 0.05. So the simple model that wecan build from this observation is that the click probability depends on the rank of the document:

P (Cr = 1) = ρr. (3.2)

As we will see in Section 4.1, the maximum likelihood estimation for the model parameters boilsdown to computing the CTR of each position on the training set, which is why this model can becalled a CTR model.

RCTR can serve as a robust baseline that is more complex than RCM, yet has only a handfulof parameters and hence is safe from overfitting. This model was implicitly used by Dupret andPiwowarski [2008] as a baseline for perplexity evaluation.

3.2.2 DOCUMENT-BASED CTR MODEL (DCTR)

Another option would be to estimate the click-through rates for each query-document pair:

P (Cu = 1) = ρuq. (3.3)

This model was formally introduced by Craswell et al. [2008] and used as a baseline. Unlike thesimple RCM or RCTR models, DCTR has one parameter for each query-document pair. As aconsequence, when we fit model parameters using past observations and then apply the result topredict future clicks, DCTR is subject to overfitting more than RCM or RCTR, especially given thefact that some documents and/or queries were not previously encountered in our click log. The sameholds for other models studied below.

In the next section we introduce a position-based model that unites the position (rank) biasand the document bias in a single model.

3.3 POSITION-BASED MODEL (PBM)

Many click models include a so-called examination hypothesis:

Cu = 1⇔ Eu = 1 and Au = 1, (3.4)

3.3. POSITION-BASED MODEL (PBM) 11

document u

Eu

Cu

Au

↵uq�ru

Figure 3.1: Graphical representation of the position-based model.

which means that a user clicks a document if, and only if, they examined the document and wereattracted by the document. The random variables Eu and Au are usually considered independent.The corresponding Bayesian network (Definition 2.1) is shown in Figure 3.1.

The simplest model that uses the examination hypothesis introduces a set of document-dependent parameters α that represent the attractiveness1 of documents on a SERP:

P (Au = 1) = αuq. (3.5)

We should emphasize that attractiveness here is a characteristic of the document’s snippet (sometimesreferred to as document caption), and not the full text of the document. As was shown by Turpin et al.[2009], this parameter is correlated with the full-text relevance obtained from judges in a TREC-styleassessment process, but there is still a substantial discrepancy between them.

Joachims et al. [2005] shows that the probability of a user examining a document dependsheavily on its rank or position on a SERP and typically decreases with rank. To incorporate thisintuition into a model, we introduce a set of examination parameters γr, one for each rank.2 Thisposition-based model (PBM) was formally introduced by Craswell et al. [2008] and can be writtenas follows:

P (Cu = 1) = P (Eu = 1) · P (Au = 1) (3.6)P (Au = 1) = αuq (3.7)P (Eu = 1) = γru . (3.8)

1It is sometimes referred to as perceived relevance or just relevance of a document.2Strictly speaking, it is assumed that the first document is always examined, so γ1 always equals 1 and we have one less parameter.


3.4 CASCADE MODEL (CM)

document urdocument ur�1

Er�1

Cr�1

Ar�1

Er

Cr

Ar

......

↵ur�1q ↵urq

Figure 3.2: Graphical representation of the cascade model (fragment).

The cascade model [Craswell et al., 2008] assumes that a user scans documents on a SERPfrom top to bottom until they find a relevant document. Under this assumption, the top rankeddocument u1 is always examined, while documents ur at ranks r ≥ 2 are examined if, and onlyif, the previous document ur−1 was examined and not clicked. If we combine this idea with theexamination assumptions (3.6) and (3.7), we obtain the cascade model as introduced by Craswellet al. [2008]:

Cr = 1⇔ Er = 1 and Ar = 1 (3.9)P (Ar = 1) = αurq (3.10)P (E1 = 1) = 1 (3.11)

P (Er = 1 | Er−1 = 0) = 0 (3.12)P (Er = 1 | Cr−1 = 1) = 0 (3.13)

P (Er = 1 | Er−1 = 1, Cr−1 = 0) = 1. (3.14)

Note that for the sake of simplicity we use the rank r as a subscript for the random variables A, Eand C and not the document u. We are going to use these notations interchangeably, writing eitherXu or Xr depending on the context. The corresponding Bayesian network is shown in Figure 3.2.

This model allows simple estimation, since the examination events are observed: the modelimplies that all documents up to the first-clicked document were examined by a user. This also meansthat the cascade model can only describe sessions with one click and cannot explain non-linearexamination patterns.

The key difference between the cascade model (CM) and the position-based model (PBM) isthat the probability of clicking on a document ur in PBM does not depend on the events at previousranks r′ < r whereas in CM it does. In particular, CM does not allow sessions with more than oneclick, but such sessions are totally possible for PBM.

3.5. USER BROWSING MODEL (UBM) 13

3.5 USER BROWSING MODEL (UBM)

document ur

Er

Cr

Ar

...

↵urq

�rr0

Figure 3.3: Graphical representation of the user browsing model (fragment).

The user browsing model (UBM) by Dupret and Piwowarski [2008] is an extension of the PBMmodel that has some elements of the cascade model. The idea is that the examination probabilityshould take previous clicks into account, but should mainly be position-based: we assume thatit depends not only on the rank of a document r, but also on the rank of the previously clickeddocument r′:3

P (Er = 1 | C1 = c1, . . . , Cr−1 = cr−1) = γrr′ , (3.15)

where r′ is the rank of the previously clicked document or 0 if none of them was clicked. In otherwords:

r′ = max {k ∈ {0, . . . , r − 1} : ck = 1}, (3.16)

where c0 is set to 1 for convenience.Alternatively, (3.15) can be written as follows:

P (Er = 1 | C<r) = P (Er = 1 | Cr′ = 1, Cr′+1 = 0, . . . , Cr−1 = 0) = γrr′ (3.17)

Figure 3.3 shows a Bayesian network for the UBM model. The set of arrows on the left means thatthe examination probability P (Er = 1) depends on all click events Ck at smaller ranks k < r. Cr,in turn, influences all examination probabilities further down the ranking (hence, the set of arrowson the right-hand side).

3.6 DEPENDENT CLICK MODEL (DCM)

The dependent click model (DCM) by Guo et al. [2009b] is an extension of the cascade model thatis meant to handle sessions with multiple clicks. This model assumes that after a user clicked a3Instead of the rank of the previously clicked document r′, the distance d = r − r′ to this document was used in the originalpaper [Dupret and Piwowarski, 2008], which is equivalent to the form we use.



Er�1

Cr�1

Ar�1

Er

Cr

Ar

......

↵ur�1q ↵urq

Sr�1 Sr

�r�1 �r

Figure 3.4: Graphical representation of the dependent click model (fragment).

document, they may still continue to examine other documents. In other words, (3.13) is replaced by

P (Er = 1 | Cr−1 = 1) = λr, (3.18)

where λr is the continuation parameter, which depends on the rank r of a document.For uniformity with models studied below and to simplify parameter estimation, we introduce

a set of satisfaction random variables Sr that denote the user satisfaction after a click:

P (Sr = 1 | Cr = 0) = 0 (3.19)P (Sr = 1 | Cr = 1) = 1− λr (3.20)

P (Er = 1 | Sr−1 = 1) = 0 (3.21)P (Er = 1 | Er−1 = 1, Sr−1 = 0) = 1. (3.22)

The corresponding Bayesian network is shown in Figure 3.4.

3.7 CLICK CHAIN MODEL (CCM)

The click chain model (CCM) by Guo et al. [2009a] is a further extension of the DCM model. Theauthors introduce a parameter to handle the situation where a user may abandon the search withoutclicking on a result. They also change the continuation parameter to depend not on the rank of adocument, but on its relevance αuq. All in all, the model can be written as follows:


P (Er = 1 | Er−1 = 0) = 0 (3.26)P (Er = 1 | Er−1 = 1, Cr−1 = 0) = τ1 (3.27)

3.8. DYNAMIC BAYESIAN NETWORK MODEL (DBN) 15


Er�1

Cr�1

Ar�1

Er

Cr

Ar

......

↵ur�1q ↵urq

Sr�1 Sr

⌧1, ⌧2, ⌧3

Figure 3.5: Graphical representation of the click chain model (fragment).

P (Er = 1 | Cr−1 = 1) = τ2(1− αur−1q

)+ τ3αur−1q, (3.28)

where τ1, τ2 and τ3 are three constants (continuation parameters).As in the case of DCM, we introduce a satisfaction variable Sr for CCM:

P (Sr = 1 | Cr = 0) = 0 (3.29)P (Sr = 1 | Cr = 1) = αurq (3.30)

P (Er = 1 | Er−1 = 1, Cr−1 = 0) = τ1 (3.31)P (Er = 1 | Cr−1 = 1, Sr−1 = 0) = τ2 (3.32)P (Er = 1 | Cr−1 = 1, Sr−1 = 1) = τ3. (3.33)

When we write it this way, we can say that the model makes an assumption that the probability ofsatisfaction is the same as the probability of attractiveness ((3.24) and (3.30)). This is a rather strongassumption, since satisfaction is determined by the content of a document itself (available only afterclicking), while attractiveness depends only on the document’s snippet presented on a SERP.

Notice that CCM is the only model that we have considered so far that distinguishes betweenthe continuation probability after not clicking the document (τ1) and the continuation probabilityafter being dissatisfied by the clicked document (τ2). The corresponding graphical model is shownin Figure 3.5.

3.8 DYNAMIC BAYESIAN NETWORK MODEL (DBN)

Like CCM, the dynamic Bayesian network model (DBN) by Chapelle and Zhang [2009] extends thecascade model, but in a different way. Unlike CCM, DBN assumes that the user’s perseverance aftera click depends not on the perceived relevance αuq but on the actual relevance4 σuq. In other words,4This parameter is often referred to as the satisfaction probability.



Er�1

Cr�1

Ar�1

Er

Cr

Ar

......

↵ur�1q ↵urq

Sr�1 Sr

�

�ur�1q �urq

Figure 3.6: Graphical representation of the dynamic Bayesian network model (fragment).

DBN introduces another set of document-dependent parameters. The model, then, can be written asfollows:


P (Er = 1 | Er−1 = 0) = 0 (3.37)P (Sr = 1 | Cr = 1) = σurq (3.38)

P (Er = 1 | Sr−1 = 1) = 0 (3.39)P (Er = 1 | Er−1 = 1, Sr−1 = 0) = γ, (3.40)

where γ is the continuation probability for a user that either did not click on a document or clickedbut was not satisfied by it. The graphical model is shown in Figure 3.6.

3.8.1 SIMPLIFIED DBN MODEL (SDBN)

One special case of the DBN model is the so-called simplified DBN model (SDBN) [Chapelleand Zhang, 2009, Section 5] which assumes that γ = 1. This model allows for simple parameterestimation (Section 4.1.3) while maintaining a similar level of predictive power as DBN accordingto Chapelle and Zhang.

It is worth noting that if we further assume that a user is always satisfied by a clicked document(Sr ≡ Cr), in other words, if all σuq are equal to 1, then the SDBN model reduces to the cascademodel (Section 3.4).

3.9. CLICK PROBABILITIES 17

3.9 CLICK PROBABILITIES

Click models aim at modeling user clicks on a SERP. In particular, the basic click models discussedin this chapter are able to calculate the probability of a click on a given document u, i.e., P (Cu = 1),and the probability of a click on that document given previously observed clicks in the same session,i.e., P (Cu = 1 | C<ru).5 The former is used to estimate the click-through rate (CTR) of a document,while the latter is useful when simulating clicks in a session one by one or computing the likelihood.

For simple click models such as RCM, RCTR, DCTR and PBM, the probability of examininga document does not depend on clicks above this document (C<ru). Therefore, for these modelsP (Cu = 1 | C<ru) = P (Cu = 1). For the RCM, RCTR and DCTR models, these probabilitiesare directly equal to the model parameters: ρ, ρru and ρuq respectively. For the PBM model, theprobability of a click is decomposed into the examination and attractiveness probabilities and, thus,

P (Cu = 1) = P (Au = 1) · P (Eru = 1) = αuqγru . (3.41)

For the other basic click models, examination of a document depends on clicks above the document(C<ru) and, thus, the full and conditional probabilities of click are calculated differently. For theCM, DCM, CCM, DBN and SDBN models, the full probability of click, P (Cu = 1), is calculatedas follows:

P (Cu = 1) = P (Cu = 1 | Eru = 1) · P (Eru = 1) = αuqεru . (3.42)

The difference between the models lies in the way they compute the examination probabilityP (Eru = 1) = εru . The general rule for all models mentioned above is the following (here, we user instead of ru for simplicity):

εr+1 = P (Er+1 = 1)

= P (Er = 1) · P (Er+1 = 1 | Er = 1)

= P (Er = 1) ·(P (Er+1 = 1 | Er = 1, Cr = 1) · P (Cr = 1 | Er = 1) +

P (Er+1 = 1 | Er = 1, Cr = 0) · P (Cr = 0 | Er = 1)). (3.43)

The examination probability of each model can be calculated using this equation. For example, forthe DBN model it gives the following formula for the examination probability:

εr+1 = εr ((1− σuq) γαuq + γ (1− αuq)) . (3.44)

The conditional probability of a click for the CM, DCM, CCM, DBN and SDBN models can becalculated in a similar way:

P (Cu = 1 | C<ru) = P (Cu = 1 | Eru = 1,C<ru) · P (Eru = 1 | C<ru) = αuqεru , (3.45)

5Here, C<ru denotes the set of observed clicks above rank ru.


where the examination probability P (Eru = 1 | C<ru) = εru should now be calculated based onclicks observed above the document u (again, we use r instead of ru):

εr+1 = P (Er+1 = 1 | C<r+1)

= P (Er+1 = 1 | Er = 1,C<r+1) · P (Er = 1 | C<r+1)

= P (Er+1 = 1 | Er = 1, Cr = 1) · P (Er = 1 | Cr = 1,C<r) · c(s)r +

P (Er+1 = 1 | Er = 1, Cr = 0) · P (Er = 1 | Cr = 0,C<r) · (1− c(s)r ). (3.46)

Here, we split the conditional examination probability into two parts based on the click c(s)r observedat rank r in a particular query session s. All basic click models assume that a clicked document isexamined, thus, P (Er = 1 | Cr = 1,C<r) = 1. On the other hand, if a document is not clicked,then we cannot directly infer whether it is examined or not. Therefore, P (Er = 1 | Cr = 0,C<r)

should be calculated as follows:

P (Er = 1 | Cr = 0,C<r) =P (Er = 1, Cr = 0 | C<r)

P (Cr = 0 | C<r)

=P (Cr = 0 | Er = 1,C<r) · P (Er = 1|C<r)

P (Cr = 0 | C<r)

=εr(1− αurq)

1− αurqεr. (3.47)

So the final formula for the conditional examination probability is the following:

εr+1 = P (Er+1 = 1 | Er = 1, Cr = 1) · c(s)r +

P (Er+1 = 1 | Er = 1, Cr = 0) · εr(1− αurq)

1− αurqεr(1− c(s)r ). (3.48)

The conditional examination probability for each click model can be calculated using this equation.For example, for the DCM model this probability can be computed as follows:

εr+1 = λrc(s)r +

(1− αurq)εr1− αurqεr

(1− c(s)r ). (3.49)

Finally, the UBM click model has a different procedure of calculating the full and conditional clickprobabilities. The UBM conditional probability of click is computed as follows:

P (Cu = 1 | C<ru) = αuqγrur′u , (3.50)

where r′u is the rank of the last clicked document above u.To calculate the full click probability P (Cu = 1), we should marginalize over all possible

clicks above the document u [Chuklin et al., 2013c]:

P (Cu = 1) =

ru−1∑j=0

P (Cj = 1, Cj+1 = 0, . . . , Cru−1 = 0, Cru = 1). (3.51)

3.9. CLICK PROBABILITIES 19

By applying Bayes’ rule we can rewrite this equation as follows:

P (Cu = 1) =

ru−1∑j=0

P (Cj = 1) ·

ru−1∏k=j+1

P (Ck = 0 | Cj = 1, Cj+1 = 0, . . . , Ck−1 = 0)

·P (Cru = 1 | Cj = 1, Cj+1 = 0, . . . , Cru−1 = 0)

=

ru−1∑j=0

P (Cj = 1) ·

ru−1∏k=j+1

(1− αukqγkj)

· αuqγruj , (3.52)

where P (C0 = 1) = 1.The full and conditional click probabilities for all models presented in this chapter are

summarized in Table 3.1.

203.

BASIC

CL

ICK

MO

DE

LS

Table 3.1: Full and conditional probabilities of click for basic clickmodels.

Click model P (Cu = 1) P (Cu = 1 | C<ru)

RCM ρ ρ

RCTR ρru ρruDCTR ρuq ρuqPBM αuqγru αuqγruCM αuqεru , αuqεru ,

where εr+1 = εr(1− αurq) where εr =

{1 if no clicks before r

0 otherwise

UBM∑ru−1j=0 P (Cj = 1)

(∏ru−1k=j+1(1− αukqγkj)

)αuqγruj , αuqγrr′

where P (C0 = 1) = 1

DCM αuqεru , where αuqεru , where

εr+1 = εr (αurqλru + (1− αurq)) εr+1 = c(s)r λr +

(1− c(s)r

)(1−αurq)εr1−αurqεr

CCM αuqεru , where αuqεru , whereεr+1 = εr (αurq((1− αurq)τ2 + αurqτ3) + (1− αurq)τ1) εr+1 = c

(s)r ((1− αurq)τ2 + αurqτ3)+(1− c(s)r

)(1−αurq)εrτ1

1−αurqεr

DBN αuqεru , where αuqεru , where

εr+1 = εrγ(αurq(1− σurq) + (1− αurq)) εr+1 = c(s)r γ(1− σurq) +

(1− c(s)r

)(1−αurq)εrγ

1−αurqεr

SDBN Same as DBN with γ = 1 Same as DBN with γ = 1

3.10. SUMMARY 21

3.10 SUMMARY

As has become apparent from the model equations and Bayesian network graphical models, thereare many similarities between the models that we have presented. In particular, all models introducean attractiveness variable Au and most of the models define a satisfaction variable Su. The modelsthen define how the examination probability Eu depends on the examination, satisfaction and clickevents for previous documents: E<ru ,S<ru ,C<ru .

Table 3.2: Core assumptions of the basic click models. Here, P (Au = 1) is the probability that docu-ment u is attractive, P (Su = 1 | Cu = 1) is the probability that a user is satisfied with document u afterclicking on it and P (Eu = 1 | E<ru ,S<ru ,C<ru) is the probability of examining document u based onexamination, satisfaction and click events for previous documents.

Click model P (Au = 1) P (Su = 1 | Cu = 1) P (Eu = 1 | E<ru ,S<ru ,C<ru)

RCM constant (ρ) N/A constant (1)RCTR constant (1) N/A rank (ρr)DCTR query-document (ρuq) N/A constant (1)PBM query-document (αuq) N/A rank (γr)CM query-document (αuq) constant (1) constant (1 or 0)UBM query-document (αuq) N/A other (γrr′)DCM query-document (αuq) rank (1− λr) constant (1 or 0)CCM query-document (αuq) query-document (αuq) constant (τ1, τ2, τ3)DBN query-document (αuq) query-document (σuq) constant (γ)SDBN query-document (αuq) query-document (σuq) constant (1 or 0)

We summarize the parameters corresponding to these events in Table 3.2. The table showsthat RCM and RCTR use a constant value for attractiveness, while the other models use a set ofquery-document dependent parameters αuq. Satisfaction is not defined for RCM, RCTR, DCTR,PBM and UBM; it is constant for CM (satisfaction is equivalent to a click for the cascade model); itdepends on the document rank for DCM and on the query-document pair for CCM and DBN. CCMand DBN use different parameters for satisfaction: CCM re-uses the attractiveness parameters, whileDBN introduces a new set of parameters σuq. The examination probability is always 1 for RCM andDCTR, and depends only on the document rank for RCTR and PBM. It has a non-trivial dependencyon previous clicks in UBM, while in CM, DCM, CCM and DBN it depends on the click, satisfactionand examination events at rank (r − 1) and is governed by a set of constants (different for differentmodels).

An interesting observation, which becomes evident by inspecting Table 3.2, concerns thedifference between the SDBN (Section 3.8.1) and DCM (Section 3.6) models. These two modelsappear to be the same apart from the definition of the satisfaction probability P (Su = 1 | Cu = 1):for SDBM it depends on the query-document pair, while for DCM it depends on the document rank.


In the next chapter we are going to discuss the process of estimating click model parameters, whichcompletes the picture we have begun to sketch in this chapter. Chapter 5 describes different ways toevaluate click models and Chapter 6 presents datasets and software packages useful for experimentingwith click models. After reading these chapters the reader should be able to experiment with clickmodels, modify them and evaluate them. We present our own experiments in Chapter 7.

Alternatively, the reader may now proceed directly to Chapter 8 where various advanced clickmodels are described or to Chapter 9 where we discuss applications of click models.

23

C H A P T E R 4

Parameter Estimation

In Chapter 3 we introduced basic click models for web search. To evaluate the quality of thesemodels (in Chapter 5) and to use them in applications (in Chapter 9), click models first need to betrained. In this chapter we describe the process of learning click model parameters from past clickobservations. This process is called parameter estimation or parameter learning. We will reviewtwo main techniques used to estimate the parameters of click models, namely maximum likelihoodestimation (MLE) and the expectation-maximization algorithm (EM). We will then show how eachtechnique can be applied to parameter estimation. Several detailed examples of these techniqueswill be given to support the theory. Additionally, we discuss a few alternative methods of parameterestimation, but our main focus is on the MLE and EM algorithms.

The MLE estimates and EM update rules for the parameters of the click models introducedso far will be presented in Section 4.3. Readers interested in implementing those click modelswithout modifications may skip the calculations of this chapter and directly use the formulas that aresummarized in Table 4.1.

The click models that we have introduced so far can be divided into two classes based on thenature of their random variables. These classes determine the way the model parameters are estimated.Click models of the first class operate with observed random variables. For example, in the simplifiedDBN model [Chapelle and Zhang, 2009] both the attractiveness A and satisfaction S depend on theexamination event E, which is observed. In this case, maximum likelihood estimation can be usedto estimate their probabilities. Click models of the second class have one or more hidden variables,the values of which are not observed. In this case we suggest to use the expectation-maximizationalgorithm. Below we discuss each of the two cases in detail.

4.1 THE MLE ALGORITHM

Click models consider binary events (e.g., a document is either examined by a user or not) andassume that the corresponding random variables are distributed according to a Bernoulli distribution:X ∼ Bernoulli(θ), where θ is a parameter of the distribution. The parameter θ is unknown andneeds to be estimated based on the values of X . In case X is observed, θ can be derived in closedform using maximum likelihood estimation. If X is not observed, the expectation-maximizationalgorithm should be used instead (see Section 4.2).

Given a concept c, assume that we observe a set of user sessions Sc containing this concept atleast once as well as the values of X for all or some occurrences of c. Since Xc is a binary randomvariable, its value xc can be either 0 or 1. The likelihood for the parameter θc can be written as

24 4. PARAMETER ESTIMATION

follows:1

L(θc) =∏s∈Sc

∏ci∈sc

θx(s)cic (1− θc)1−x(s)

ci , (4.1)

where sc is the set of occurrences of the concept c in the session s and x(s)ci is the value of the random

variable X when applied to the i-th occurrence of c.By taking the log of (4.1), calculating its derivative and equating it to zero, we get the MLE

estimation of the parameter θc. In our case, this is simply the sample mean of Xc:

θc =

∑s∈Sc

∑ci∈sc I(X

(s)ci = 1)∑

s∈Sc |sc|=

∑s∈Sc

∑ci∈sc x

(s)ci∑

s∈Sc |sc|, (4.2)

where |sc| is the size of the set sc.Below, we present several examples of parameter estimation using MLE. In particular, we

focus on the RCM, CTR, DCM and SDBN click models.

4.1.1 MLE FOR THE RCM AND CTR MODELS

The simplest click model, RCM, discussed in Section 3.1, has only one parameter ρ, which modelsthe probability of a click. The click event is always observed, so ρ can be estimated using MLE. Inthe case of RCM, the concept c is just “any document,” so in (4.2) we should consider all sessions inS and all documents in each session:

ρ =

∑s∈S

∑u∈s c

(s)u∑

s∈S |s|. (4.3)

Similar to this formula we can obtain the estimation of the RCTR model (Section 3.2.1):

ρr =

∑s∈S c

(s)r

|S|, (4.4)

and the DCTR model (Section 3.2.2):

ρuq =

∑s∈Suq

c(s)u

|Suq|, (4.5)

where Suq is the set of sessions initiated by the query q and containing the document u.

4.1.2 MLE FOR DCM

The DCM click model, introduced in Section 3.6, has two sets of parameters: the perceived relevanceαuq of all query-document pairs and the continuation probabilities λr for all ranks r. The corre-sponding random variables are the attractivenessA and the opposite of satisfaction, i.e., 1− S. In the1Here, we assume the simplest case where the full likelihood function can be represented as a product of likelihood functions foreach model parameter.

4.1. THE MLE ALGORITHM 25

general DCM model, the attractiveness A is not observable starting from the rank of the last-clickeddocument (denoted as l here): we do not know if a user stopped examining documents because theywere satisfied by the last-clicked document, or because they were not attracted by any documentbelow it. However, if a user is assumed to be satisfied by the last-clicked document, then A and S dobecome observable. We will call this modification simplified DCM, or SDCM for short. Accordingto the simplified DCM model, for r ≤ l a document ur is attractive whenever it is clicked:

Aur= 1 ⇐⇒ Cur

= 1. (4.6)

Also, a user is satisfied by a document at rank r whenever ur is clicked and this is the last click in asession:

Sr = 1 ⇐⇒ Cur= 1 & r = l. (4.7)

Since all random variables of SDCM are observed, its parameters can be estimated using MLE.The attractiveness αuq of a given query-document pair is defined for sessions that have been

initiated by the query q and contain the document u above or at the last-clicked rank. More formally,let us consider the query-document pair uq as a concept and define Suq = {sq : u ∈ sq, ru ≤ l},where sq is the session initiated by q. Notice that the concept uq appears at most once in each session.Thus, according to (4.2), the attractiveness parameter αuq of the simplified DCM model is estimatedas follows:

αuq =

∑s∈Suq

I(A(s)u = 1)∑

s∈Suq1

=1

|Suq|∑s∈Suq

I(A(s)u = 1)

=1

|Suq|∑s∈Suq

I(C(s)u = 1) =

1

|Suq|∑s∈Suq

c(s)u . (4.8)

This equation means that the attractiveness of a document is directly proportional to the number oftimes it is clicked and inversely proportional to the number of times it is presented for a given queryabove or at the last-clicked rank.

The continuation probability λr is the opposite of the satisfaction probability: λr = 1− σr.The satisfaction event Sr, in turn, is defined for sessions, where a click at rank r is observed:Sr = {s : c

(s)ur = 1}.2 Combining the above observations with (4.2), we obtain the MLE estimate of

the continuation probability λr:

λr = 1−∑s∈Sr I(S

(s)ur = 1)∑

s∈Sr 1

2Sessions without clicks were excluded in the original paper [Guo et al., 2009b].


= 1− 1

|Sr|∑s∈SrI(S(s)

ur= 1)

= 1− 1

|Sr|∑s∈SrI(r(s)

u = l)

=1

|Sr|∑s∈SrI(r(s)

u 6= l) (4.9)

Again, notice that the rank r that we consider appears only once in each session, so we do not needtwo summations in the numerator.

Equation 4.9 shows that the satisfaction probability at a given rank r is equal to the number oftimes r is the rank of the last-clicked document, divided by the number of times a click at rank r isobserved. The continuation probability, in turn, is one minus this quantity.

Notice that the parameters of the general DCM model should be estimated using the EMalgorithm because it has hidden variables. However, the paper in which DCM was originallyintroduced uses the same equations (4.8) and (4.9), although derived in a different way [Guo et al.,2009b].

4.1.3 MLE FOR SDBN

Similarly to the simplified DCM click model, SDBN assumes that a user examines all documentsuntil the last-clicked one and then abandons the search (Section 3.8.1). In this case, both theattractiveness A and satisfaction S of SDBN are observed. Therefore, MLE can be used to estimatethe model parameters: attractiveness probabilities αuq and satisfaction probabilities σuq.

In the case of SDBN, both attractiveness and satisfaction are defined for a query-documentpair. Therefore, in order to estimate αuq and σuq for a given pair, we should consider sessionsinitiated by the query q and containing the document u above or at the last-clicked rank l: Suq =

{sq : u ∈ sq, ru ≤ l}. Then, according to (4.2), the attractiveness probability αuq can be estimatedusing MLE as follows:

αuq =

∑s∈Suq

I(A(s)u = 1)∑

s∈Suq1

=1

|Suq|∑s∈Suq

I(A(s)u = 1)

=1

|Suq|∑s∈Suq

I(C(s)u = 1) =

1

|Suq|∑s∈Suq

c(s)u (4.10)

This means that the attractiveness of a document, given a query, is directly proportional to the numberof times this document is clicked and inversely proportional to the number of times it is presented inresponse to the given query above or at the last-clicked rank.

4.2. THE EM ALGORITHM 27

The satisfaction event Su is defined only for the part of the session set Suq where a click onu is observed: S ′uq = {s : s ∈ Suq, c(s)u = 1}. Based on this session set, the satisfaction probabilityσuq is estimated as follows:

σuq =

∑s∈S′uq

I(S(s)u = 1)∑

s∈S′uq1

=1

|S ′uq|∑s∈S′uq

I(S(s)u = 1)

=1


I(C(s)u = 1, r(s)

u = l)

=1


I(r(s)u = l). (4.11)

Thus, the satisfaction probability of a document u, given a query q, is equal to the number of times uappears as the last-clicked document, divided by the number of times it is clicked for the query q.

In contrast to SDBN, the full DBN click model, discussed in Section 3.8, has an additionalparameter γ. It determines whether a user continues examining documents after they are not satisfiedwith the current one. In this case, the attractiveness A and satisfaction S are not observable after thelast-clicked document, so the EM algorithm should be used to estimate the parameters of DBN. Wediscuss the EM estimation method in the next section and apply it to DBN in Section 4.2.4.

4.2 THE EM ALGORITHM

Consider a random variable X and a set of its parents P(X) in a Bayesian network.3 Assume furtherthat there is an assignment P(X) = p of the parents, such that the probability P (X | P(X) = p)

is a Bernoulli distribution with a parameter θ. In this case, the expectation-maximization algorithm(EM) by Dempster et al. [1977] can be used to estimate the value of the parameter θ if X or any ofits parents are not observed.4

The EM algorithm operates by iteratively performing expectation (E) and maximization (M)steps. During the E-step, θ is assumed to be known and fixed and the values of so-called expectedsufficient statistics (ESS), namely E

[∑s I(X = x,P(X) = p | C(s)

)], are computed,5 where

C(s) is a vector of observations for a data point s. During the M-step, the new value of θ is derivedusing a formula similar to (4.2) to maximize the empirical likelihood. EM starts by assigning someinitial values to the model parameters and then repeats the E- and M-steps until convergence.

3By parents of a random variable X we mean the set of random variables connected to X by a directed edge in the dependencygraph of a model. This set can be empty if there are no edges coming into X .

4If X and all its parents are observed, the maximum likelihood estimation can be used instead (see Section 4.1).5See, e.g., [Koller and Friedman, 2009, Section 19.2.2] for more details.


Notice that the EM algorithm is not guaranteed to converge quickly. The only theoreticalguarantee is that the likelihood value improves on each iteration of the algorithm and eventuallyconverges to a (generally local) optimum. In practice, however, the convergence rate is fast, especiallyin the beginning. It is possible to adopt other optimization techniques, e.g., gradient ascent methods,but we are not going to cover them here.

Below, we show how the EM algorithm can be applied to estimate the parameters of clickmodels. We discuss the E-step in Section 4.2.1, the M-step in Section 4.2.2 and examples of the EMestimation for the UBM and DBN click models in Sections 4.2.3 and 4.2.4, respectively.

4.2.1 EXPECTATION (E-STEP)

Given a click model and a model’s parameter θc corresponding to a random variable Xc, our goal isto find the value of θc that optimizes the log-likelihood of the model given a set of observed querysessions:

LL =∑s∈S

log

(∑X

P(X,C(s) | Ψ

)), (4.12)

where X is the vector of all hidden variables, C(s) is the vector of clicks observed in a particularsession s and Ψ is the vector of all model parameters.

Generally speaking, the equation above is hard to compute due to the necessity of summingover all hidden variables. As was shown by Dempster et al. [1977], one can replace summation overhidden variables by an expectation. Now instead of the original likelihood function we are going tooptimize a similar but different expression:

Q =∑s∈S

EX|C(s),Ψ

[logP

(X,C(s) | Ψ

)]. (4.13)

It simplifies the formula and allows us to group it around the parameter θc:

Q(θc) =∑s∈S

EX|C(s),Ψ

[logP

(X,C(s) | Ψ

)]=∑s∈S

EX|C(s),Ψ

[∑ci∈s

(I(X(s)ci = 1,P(X(s)

ci ) = p)

log(θc) +

I(X(s)ci = 0,P(X(s)

ci ) = p)

log(1− θc))

+ Z

]=∑s∈S

∑ci∈s

(P(X(s)ci = 1,P(X(s)

ci ) = p | C(s),Ψ)

log(θc) +

P(X(s)ci = 0,P(X(s)

ci ) = p | C(s),Ψ)

log(1− θc))

+ Z, (4.14)

where P(Xc) = p is the assignment of the parents of Xc corresponding to the parameter θc and Zis the part of the expression that does not depend on θc.


Here, we use the fact that the expectation of a sum is a sum of expectations and the fact that theexpected value of an indicator function E[I(expr)] is equal to the probability of the correspondingexpression P (expr).

So, on the E-step we compute the following expected sufficient statistics:

ESS (x) =∑s∈S

∑ci∈s

P(X(s)ci = x,P(X(s)

ci ) = p | C(s),Ψ), (4.15)

where x ∈ {0, 1}. The computation is done using formulas from Table 3.1 for each session s.Notice that at the E-step the value of θc is assumed to be known and fixed. Therefore, in order

to launch the EM algorithm, θc should first be set to some initial value, which will then be updatedon every iteration. The initialization of the algorithm is important; it may affect the convergence rateand even push the algorithm to a different local optimum. For example, if the satisfaction probabilityin DBN (Section 3.8) is initially set to 0, then, as we will see below in Section 4.2.4, Equation 4.41,it will always stay 0.

Existing studies on click models do not provide specific recommendations on choosing initialvalues and use some intuitive constants, like 0.2 or 0.5. When possible, one should either use valuesobtained from a simpler model, e.g., a CTR model studied in Section 3.2, or some intuition or priorknowledge.

4.2.2 MAXIMIZATION (M-STEP)

The expected value of the log-likelihood function, obtained during the E-step (Equation 4.14 inSection 4.2.1), should be maximized by taking its derivative with respect to the parameter θc andequating it to zero:

∂Q(θc)

∂θc

=∑s∈S

∑ci∈s

P(X

(s)ci = 1,P(X

(s)ci ) = p | C(s),Ψ

)θc

−P(X

(s)ci = 0,P(X

(s)ci ) = p | C(s),Ψ

)1− θc

= 0.

From this equation the new optimal value of the parameter θc can be derived, which will then beused in the next iteration of EM:

θ(t+1)c =

∑s∈S

∑ci∈s P

(X

(s)ci = 1,P(X

(s)ci ) = p | C(s),Ψ

)∑s∈S

∑ci∈s

∑x=1x=0 P

(X

(s)ci = x,P(X

(s)ci ) = p | C(s),Ψ

)=

∑s∈S

∑ci∈s P

(X

(s)ci = 1,P(X

(s)ci ) = p | C(s),Ψ

)∑s∈S

∑ci∈s P

(P(X

(s)ci ) = p | C(s),Ψ

)


=ESS (t)(1)

ESS (t)(1) + ESS (t)(0). (4.16)

Here, ESS is composed of probabilities that should be calculated for each session based on observedclicks C(s) and a set of parameters Ψ of a click model. Notice that the values of parameters that arecalculated during iteration t are used to estimate θ(t+1)

c .Notice also that it is sometimes practical to smooth the operands in (4.16) in order to deal

with data sparsity. For example, Dupret and Piwowarski [2008] added two fictitious observations,one click and one skip (i.e., no click), when estimating the attractiveness probability of each query-document pair. This essentially means that in the numerator the sum starts from 1 instead of 0 and inthe denominator it starts from 2.

A slightly more general way to view the smoothing approach is that the parameters themselveshave some prior distribution. For example, Chapelle and Zhang [2009] assume that attractiveness andsatisfaction parameters of the DBN model have Beta(1, 1) priors, which corresponds to the samekind of smoothing discussed above. In general, any prior distributions can be used here. Accordingto Guo et al. [2009a], the right choice of priors and Bayesian approach to parameter learning iscrucial for less frequent queries.

Below, we discuss two examples of EM estimation based on the UBM and DBN click models(Sections 4.2.3 and 4.2.4, respectively). To simplify the discussion, we omit the superscript (s),index i and parameter set Ψ in our equations and use r instead of ru to denote the rank of adocument u.

4.2.3 EM ESTIMATION FOR UBM

The UBM click model, introduced in Section 3.5, has two sets of parameters: the attractivenessprobability αuq for each query-document pair and the examination probability γrr′ , which dependson the document rank r and the rank of the previously clicked document r′. In the UBM model,neither the attractiveness A nor the examination E is observed, so we have to use the EM algorithmto estimate the model parameters. Below, we show how to apply the theoretical results presented inSections 4.2.1 and 4.2.2 to derive the EM update rules for αuq and γrr′ .

Recall that in order to perform the EM estimation, the model parameters first have to be setto some initial values, which are then updated by the EM algorithm. The original study on UBMsuggests to set the attractiveness probabilities to 0.2 and the examination probabilities to 0.5 [Dupretand Piwowarski, 2008].

Attractiveness probability. According to (3.10) in Section 3.5, the attractiveness probability of theUBM model is defined as follows:

αuq = P (Au = 1). (4.17)

Notice that αuq is defined for sessions that are initiated by the query q and that contain thedocument u: Suq = {sq : u ∈ sq}. Notice, also, that the attractiveness random variable A doesnot have parents in a Bayesian network, so P (Au = 1,P(Au) = p | C) = P (Au = 1 | C) and


P (P(Au) = p | C) = 1. Therefore, we can write the update rule for αuq based on (4.16) as fol-lows:

α(t+1)uq =

∑s∈Suq

P (Au = 1 | C)∑s∈Suq

1=

1

|Suq|∑s∈Suq

P (Au = 1 | C). (4.18)

The attractiveness event Au for document u depends on the click event Cu. Therefore, whencalculating the attractiveness probability, we can consider two separate cases: the document u isclicked (Cu = 1) and the document u is not clicked (Cu = 0):

P (Au = 1 | C) = P (Au = 1 | Cu)

= I(Cu = 1)P (Au = 1 | Cu = 1) + I(Cu = 0)P (Au = 1 | Cu = 0)

= cu + (1− cu)P (Cu = 0 | Au = 1) · P (Au = 1)

P (Cu = 0)

= cu + (1− cu)(1− γrr′)αuq1− γrr′αuq

. (4.19)

The first transition holds because Cv and Au are independent for any document v above u (moreformally, for all rv < ru, Cv ⊥⊥ Au) and Cw and Au are conditionally independent given Cu forany document w below u (for all rw > ru, Cw ⊥⊥ Au | Cu). The second transition splits P (Au =

1 | Cu) into two non-overlapping cases: Cu = 1 and Cu = 0. For the third transition we observethat if a document is clicked, i.e., Cu = 1, then it is attractive, i.e., Au = 1. This means thatP (Au = 1 | Cu = 1) = 1. In the third transition, we also replace the indicator function I(Cu) withthe actual value cu, and apply Bayes’ rule to P (Au = 1 | Cu = 0). Finally, in the last transition, wecalculate probabilities according to the definition of the UBM model, given in (3.9)–(3.14).

Equation 4.19 can be interpreted as follows. If a click is observed for a document, then thisdocument is attractive with probability 1 (first summand). If there is no click on a document, thedocument is attractive only if it is not examined; the probability of this event is (1− γrr′)αuq.The overall probability of no click is (1− γrr′αuq). Then, the probability of a document beingattractive, when no click on that document is observed, is the fraction of these two quantities:(1− γrr′)αuq/(1− γrr′αuq) (second summand).

By plugging (4.19) into (4.18) we get the final update rule for the attractiveness parameter αuq:

α(t+1)uq =

1

|Suq|∑s∈Suq

c(s)u +(

1− c(s)u) (1− γ(t)

rr′

)α

(t)uq

1− γ(t)rr′α

(t)uq

. (4.20)

Notice that in order to calculate the value of αuq in iteration t+ 1, the values of parameters calculatedon the previous iteration t are used. Notice, also, that so far we omitted the superscript (s), but herewe show it for completeness.


Examination probability. In the UBM click model the examination probability is defined asfollows (see (3.17) in Section 3.5):

γrr′ = P (Er = 1 | Cr′ = 1, Cr′+1 = 0, . . . , Cr−1 = 0), (4.21)

where r′ is the rank of the previously clicked document. As we can see, the examination variableEr depends on all the clicks above r. Thus, the parents of Er in the UBM Bayesian network areP(Er) = {C1, . . . , Cr−1} and the assignment of parents, where γrr′ is defined, is

p = [0, . . . , 0, 1︸︷︷︸r′

, 0, . . . , 0]. (4.22)

According to (4.16), we need to compute the expected sufficient statistics (ESS) based on P (Er =

x,P(Er) = p | C) where x = {0, 1}. This probability can be decomposed as follows:

P (Er = x,P(Er) = p | C) = P (Er = x | P(Er) = p,C) · P (P(Er) = p | C). (4.23)

For sessions where a click at rank r′ is observed, but there are no clicks between r′ and r, P(Er) = p

is always true. We denote this set of sessions as Srr′ = {s : cr′ = 1, cr′+1 = 0, . . . , cr−1 = 0}.Therefore, for sessions in Srr′ the components of the ESS can be simplified as follows:

P (Er = x | P(Er) = p,C) = P (Er = x | C) (4.24)

P (P(Er) = p | C) = 1. (4.25)

For all other sessions P (P(Er) = p | C) = 0. So, using (4.16) and the above components of theESS we can write the update rule for γrr′ as follows:

γ(t+1)rr′ =

∑s∈Srr′ P (Er = 1 | C)∑

s∈Srr′ 1=

1

|Srr′ |∑s∈Srr′

P (Er = 1 | C). (4.26)

Considering the conditional independence (Ct ⊥⊥ Er | Cr) for t > r, the rest of the estimationfor the examination probability is similar to that of the attractiveness probability. In particular, thefollowing holds:

P (Er = 1 | C) = cu + (1− cu)γrr′ (1− αuq)1− γrr′αuq

. (4.27)

The interpretation of this equation is similar to that of (4.19).The final update rule for γrr′ can be obtained by combining (4.26) and (4.27) as follows:

γ(t+1)rr′ =

1


c(s)u +(

1− c(s)u) γ(t)

rr′

(1− α(t)

uq

)1− γ(t)

rr′α(t)uq

. (4.28)


4.2.4 EM ESTIMATION FOR DBN

The DBN click model (Section 3.8) has two sets of parameters that describe the relevance of a docu-ment u, given a query q: attractiveness αuq, also known as perceived relevance, and satisfaction σuq,also known as actual relevance. In the simplified version of DBN, i.e., SDBN (Section 3.8.1), thecorresponding attractiveness and satisfaction random variables are observed and therefore the valuesof αuq and σuq can be calculated using MLE, as shown in Section 4.1.3. However, the completeDBN model has an additional parameter γ (continuation probability), which determines if a usercontinues examining a SERP after they were not satisfied with the current document. In this case,both attractiveness and satisfaction variables of the DBN model are not observable, and so EMestimation should be used to calculate the values of corresponding parameters.

Notice that the original work on DBN admitted that the value of γ could be estimated usingEM, but treated it as a configurable parameter [Chapelle and Zhang, 2009]. Here, instead, we showthe EM estimation for all parameters of DBN. Notice also that Chapelle and Zhang [2009] derivedthe attractiveness αuq and satisfaction σuq using an alternative parameterization (called α and β inthe paper) and the forward-backward propagation algorithm. Here, we derive a concise closed-formformula for EM updates for αuq and σuq.

Recall that we have to set the parameters of DBN to some initial values in order to launch theEM algorithm. Chapelle and Zhang [2009] used 0.5 as the initial value for both αuq and σuq, whileγ was treated as a configurable parameter. Since we estimate γ directly from the data, we may set itsinitial value to some arbitrary constant, although setting it to some value supposedly close to optimal(0.9 according to Chapelle and Zhang [2009]) may speed up the estimation.

Attractiveness probability. The attractiveness probability of the DBN click model is defined in asimilar way to that of UBM (see (3.35) in Section 3.8):

αuq = P (Au = 1). (4.29)

However, unlike the UBM click model, DBN does not explicitly define the examination parameter.According to (3.36)–(3.40) a user examines a document at rank r + 1 (that is, Er+1 = 1) if, andonly if, they examine, but are not satisfied with, the document at rank r (Er = 1, Sr = 0). Then, theexamination probability of the DBN model can be written recursively as follows:

ε1 = P (E1 = 1) = 1

εr+1 = P (Er+1 = 1)

= P (Er+1 = 1 | Er = 1) · P (Er = 1)

= P (Er+1 = 1 | Er = 1) · εr= εr · P (Er+1 = 1 | Sr = 0, Er = 1) · P (Sr = 0 | Er = 1)

= εrγ ·(P (Sr = 0 | Cur = 1, Er = 1) · P (Cur = 1 | Er = 1) +

P (Sr = 0 | Cur = 0, Er = 1) · P (Cur = 0 | Er = 1))


= εrγ ((1− σurq)αurq + (1− αurq)) . (4.30)

This equation can be interpreted as follows. The examination of a document at rank r + 1 dependson the examination and click at rank r. If there is a click at rank r (with probability αurq), then theuser examines rank r + 1 if they are not satisfied with the document ur (with probability 1− σurq)and decides to continue the search (with probability γ). This explains the first summand. If there isno click at rank r (with probability 1− αurq), then the user continues examining documents withprobability γ (second summand).

Now, similarly to UBM, we can apply (4.16) to derive the update rule for the attractivenessparameter αuq of the DBN model:

α(t+1)uq =

1

|Suq|∑s∈Suq

P (Au = 1 | C), (4.31)

where Suq is the set of sessions initiated by the query q and containing the document u.As in UBM, the attractiveness probability P (Au = 1 | C) of the DBN model depends on the

click eventCu. For example, if a document u is clicked, i.e.,Cu = 1, then it is attractive, i.e.,Au = 1.However, unlike UBM, the DBN model assumes that a user examines documents sequentially and,thus, all documents above the last-clicked one are examined, i.e., for all r ≤ l, Er = 1, where l isthe rank of the last-clicked document. Due to this, the attractiveness probability P (Au = 1 | C)

of the DBN model also depends on clicks after the document u. For example, if there are clicksafter u, then, according to DBN, u is examined, i.e., Eru = 1. Moreover, if there is no click on u,i.e., Cu = 0, then u is certainly not attractive (Au = 0), because it was examined but not clicked.

Let us consider a vector of random variables describing clicks after the document u, anddenote it as C>r, where r = ru for simplicity. The scalar C>r is a binary variable that equals 1 if aclick is observed below u and 0 otherwise. Since Au does not depend on clicks before the documentu, the probability of attractiveness in the DBN model can be written as follows:

P (Au = 1 | C) = P (Au = 1 | Cu,C>r)

= I(Cu = 1) · P (Au = 1 | Cu = 1,C>r) +

I(Cu = 0) · P (Au = 1 | Cu = 0,C>r)

= cu + (1− cu) ·(I(C>r = 1) · P (Au = 1 | Cu = 0, C>r = 1,C>r) +

I(C>r = 0) · P (Au = 1 | Cu = 0, C>r = 0,C>r))

= cu + (1− cu)(1− c>r) ·

P (Cu = 0, C>r = 0 | Au = 1) · P (Au = 1)

P (Cu = 0, C>r = 0). (4.32)

Here, we first distinguish two cases: Cu = 1 and Cu = 0. If a click on u is observed, i.e., Cu = 1,then u is attractive and other clicks below u do not matter, i.e., P (Au = 1 | Cu = 1,C>r) = 1.


In the second transition, we split the equation based on clicks after u: C>r = 1 and C>r = 0.If u is not clicked, i.e., Cu = 0, but there are clicks after u, i.e., C>r = 1, then u is not attractive,because it was examined but not clicked: P (Au = 1 | Cu = 0, C>r = 1,C>r) = 0. Finally, in thelast transition, we use the fact that C>r = 0 is the same as C>r = 0 and apply Bayes’ rule toP (Au = 1 | Cu = 0, C>r = 0).

In (4.32), we need to calculate two probabilities: P (Cu = 0, C>r = 0 | Au = 1) and P (Cu =

0, C>r = 0). First, an attractive document u is not clicked only if it is not examined:

P (Cu = 0, C>r = 0 | Au = 1) = P (Cu = 0, C>r = 0 | Au = 1, Er = 0) · P (Er = 0)

= P (Er = 0) = 1− εr. (4.33)

Second, the probability P (Cu = 0, C>r = 0) = P (C≥r = 0) can be computed as 1− P (C≥r = 1),while P (C≥r = 1) can be calculated based on the definition of the DBN model, given in (3.34)–(3.40):

P (C≥r = 1) = εr ·Xr, (4.34)

where

Xr = P (C≥r = 1 | Er = 1)

= P (Cr = 1 | Er = 1) + P (Cr = 0, C≥r+1 = 1 | Er = 1)

= αurq + P (C≥r+1 = 1 | Cr = 0, Er = 1) · P (Cr = 0 | Er = 1)

= αurq + P (C≥r+1 = 1 | Er+1 = 1) · P (Er+1 = 1 | Cr = 0, Er = 1) · (1− αurq)

= αurq + (1− αurq) γXr+1. (4.35)

We assume that XN+1 = 0, where N is the size of the SERP (usually 10).Combining (4.31)–(4.35), we get the final update rule for the attractiveness parameter αuq:

α(t+1)uq =

1

|Suq|∑s∈Suq

c(s)u +(

1− c(s)u)

(1− c(s)>r)

(1− ε(t)r

)α

(t)uq

1− ε(t)r X(t)r

, (4.36)

where εr is given in (4.30) and Xr is given by the recursive formula (4.35).

Satisfaction probability. According to (3.38) in Section 3.8, the satisfaction probability of theDBN model is defined as follows:

σuq = P (Su = 1 | Cu = 1). (4.37)

That is, σuq is defined only for clicked documents. In other words, the parent of Su is P(Su) = Cuand the assignment of P(Su) for which σuq is defined is p = [1].


According to (4.16), we need to compute the expected sufficient statistics based on P (Su =

x,Cu = 1 | C), where x = {0, 1}. Similarly to the UBM examination probability (Section 4.2.3),this probability simplifies to P (Su = x | C) for sessions, where a click on the document u isobserved. We will denote these sessions as S ′uq = {s : s ∈ Suq, c(s)u = 1}.

Then, according to (4.16), the update rule for σuq can be written as follows:

σ(t+1)uq =

∑s∈S′uq

P (Su = 1 | C)∑s∈S′uq

1

=1


P (Su = 1 | C). (4.38)

Given the fact that cu = 1 for all sessions s ∈ S ′uq, the satisfaction probability σuq does not dependon clicks above the document u. If a click is observed below u, i.e., C>ru = 1, then the user is notsatisfied with u, i.e., Su = 0. Otherwise, there are no clicks after the document u. In other words,the probability of satisfaction P (Su = 1 | C) depends on Cu and on C>ru (the vector of observedclicks after rank ru). So P (Su = 1 | C) can be written as follows:

P (Su = 1 | C) = P (Su = 1 | Cu = 1,C>r)

= I(C>r = 1) · P (Su = 1 | Cu = 1, C>r = 1,C>r) +

I(C>r = 0) · P (Su = 1 | Cu = 1, C>r = 0,C>r)

= (1− c>r) · P (Su = 1 | Cu = 1, C>r = 0)

= (1− c>r) ·P (C>r = 0 | Su = 1, Cu = 1) · P (Su = 1 | Cu = 1)

P (C>r = 0 | Cu = 1)

= (1− c>r) ·σuq

P (C>r = 0 | Cu = 1). (4.39)

In the second transition, we consider two cases: C>r = 1 and C>r = 0. In the third transition,we note that P (Su = 1 | C>r = 1) = 0, because, according to the DBN model, a user cannotbe satisfied with a document u if they click on other documents below u. Then we also ap-ply Bayes’ rule to P (Su = 1 | Cu = 1, C>r = 0). Finally, in the last transition, we note thatP (C>r = 0 | Su = 1, Cu = 1) = 1, because if a user is satisfied with u then they do not exam-ine other documents. We also replace the probability of satisfaction P (Su = 1 | Cu = 1) with thecorresponding parameter σuq.

In (4.39), we still need to calculate the probability of no click after a document u, i.e.,P (C>r = 0 | Cu = 1) = P (C≥r+1 = 0 | Cu = 1). It can be calculated as 1− P (C≥r+1 = 1 |Cu = 1) where the last term is related to X given by (4.35):

P (C≥r+1 = 1 | Cu = 1) = P (C≥r+1 = 1 | Er+1 = 1) · P (Er+1 = 1 | Cu = 1)

= Xr+1 · (1− σurq)γ. (4.40)


Then, the final update rule for σuq can be written as follows:

σ(t+1)uq =

1


(1− c(s)>r

) σ(t)uq

1−(

1− σ(t)urq

)γ(t)X

(t)r+1

, (4.41)

where X is given by (4.35).

Continuation probability. The continuation probability γ in the DBN model is defined as follows(see (3.40) in Section 3.8):

γ = P (Er+1 = 1 | Er = 1, Sur = 0), (4.42)

where neither ofEr+1,Er and Sur is observed. Here, the parents ofEr+1 areP(Er+1) = {Er, Sur}and their assignment for which γ is defined is p = [1, 0].

According to (4.16), we need to compute the expected sufficient statistics ESS (z) =∑s∈S

∑r P (Er+1 = z, Er = 1, Sur = 0 | C), where z = {0, 1}. There is no concise closed-form

formula for this statistic. In order to compute it, we introduce an additional factor:

Φ(x, y, z) =P (Er = x, Sur = y,Er+1 = z,C)

P (C<r). (4.43)

Once we have computed this factor we can compute the ESS as follows:

ESS (z) =∑s∈S

∑r

P (Er+1 = z, Er = 1, Sur = 0 | C)

=∑s∈S

∑r

P (Er+1 = z, Er = 1, Sur = 0,C)

P (C)

=∑s∈S

∑r

P (Er+1 = z, Er = 1, Sur= 0,C)∑

x

∑y

∑z P (Er+1 = z, Er = x, Sur = y,C)

=∑s∈S

∑r

P (Er+1 = z, Er = 1, Sur = 0,C) · 1P (C<r)∑

x

∑y

∑z P (Er+1 = z, Er = x, Sur = y,C) · 1

P (C<r)

=∑s∈S

∑r

Φ(s)(1, 0, z)∑x

∑y

∑z Φ(s)(x, y, z)

(4.44)

We compute the factor Φ(x, y, z) using the chain rule for probabilities and the cascade nature of themodel, which guarantees that events at ranks greater than or equal to r are conditionally independentof the clicks C<r above r given examination Er:

Φ(x, y, z) = P (Er = x,Cr = cr, Sur = y,Er+1 = z,C≥r+1 | C<r)

= P (Er = x | C<r) · P (Cr = cr | Er = x) · P (Sr = y | Cr = cr) ·


P (Er+1 = z | Er = x, Sr = y, Cr = cr) · P (C≥r+1 | Er+1 = z). (4.45)

The first multiplier is computed as follows:

P (Er = x | C<r) =

{P (Er = 1 | C<r) if x = 1

1− P (Er = 1 | C<r) if x = 0,(4.46)

where P (Er = 1 | C<r) is given by the recursive formula (3.48) and is presented in the Table 3.1.The last multiplier is computed as follows:

P (C≥r+1 | Er+1 = z) =

P (C≥r+1 | Er+1 = 1) if z = 1

1 if z = 0 and C≥r+1 = 0

0 otherwise

(4.47)

If z = 0, it either equals 1 if there are no clicks after r (C≥r+1 = 0) or 0 otherwise. If, in turn,z = 1, then P (C≥r+1 | Er+1 = 1) is computed using the chain rule of probabilities and the modelformulas (3.34)–(3.40).

The other multipliers in (4.45), namely P (Cr = cr | Er = x), P (Sr = y | Cr = cr) andP (Er+1 = z | Er = x, Sr = y, Cr = cr), are computed directly from the model formulas (3.34)–(3.40).

4.3 FORMULAS FOR CLICK MODEL PARAMETERS

In this chapter we have presented two main methods used to estimate parameters of click models,namely maximum likelihood estimation and expectation-maximization. We have also shown howthese methods can be used to calculate the parameters of the models that were presented in Chapter 3.Table 4.1 summarizes the MLE and EM formulas derived throughout this chapter. The table showsthe abbreviation of each click model, the estimation algorithm that should be used (MLE or EM),the list of parameters and the corresponding MLE formulas or EM update rules.

Recall that c(s)u denotes an actual click on a document u in a search session s, where c(s)u = 1 ifthere was a click on u and 0 otherwise. Recall also that the EM update rules use values of parametersthat were calculated during iteration t (e.g., γ(t)), to compute new values during iteration t+ 1 (e.g.,γ(t+1)). In Table 4.1 we use r instead of ru to simplify the notation. For other notation used inTable 4.1, please refer to Table 2.1 in Chapter 2.

4.3.FO

RM

UL

AS

FOR

CL

ICK

MO

DE

LPA

RA

ME

TE

RS

39Table 4.1: Formulas for calculating parameters of click models. Foreach model the table shows its abbreviation, parameter estimationalgorithm (MLE/EM), list of parameters and corresponding MLEformulas or EM update rules. (Continues.)

Click Estimationmodel algorithm Parameters FormulaRCM MLE ρ ρ = 1∑

s∈S |s|∑s∈S

∑u∈s c

(s)u

RCTR MLE ρr ρr = 1|S|∑s∈S c

(s)r

DCTR MLE ρuq ρuq = 1|Suq|

∑s∈Suq

c(s)u ,

where Suq = {sq : u ∈ sq}

PBM EM αuq α(t+1)uq = 1

|Suq|∑s∈Suq

(c(s)u +

(1− c(s)u

)(1−γ(t)

r )α(t)uq

1−γ(t)r α

(t)uq

),


γr γ(t+1)r = 1

|S|∑s∈S

(c(s)u +

(1− c(s)u

)(1−α(t)

uq )γ(t)r

1−γ(t)r α

(t)uq

)CM MLE αuq αuq = 1

|Suq|∑s∈Suq

c(s)u ,

where Suq = {sq : u ∈ sq, ru ≤ l}, l is the rank of the first-clicked document

UBM EM αuq α(t+1)uq = 1

|Suq|∑s∈Suq

(c(s)u +

(1− c(s)u

) (1−γ(t)

rr′)α(t)

uq

1−γ(t)

rr′α(t)uq

),


γrr′ γ(t+1)rr′ = 1


(c(s)u +

(1− c(s)u

)(1−α(t)

uq )γ(t)

rr′

1−γ(t)

rr′α(t)uq

),

where Srr′ ={s : c

(s)r′ = 1, c

(s)r′+1 = 0, . . . , c

(s)r−1 = 0

}SDCM MLE αuq αuq = 1

|Suq|∑s∈Suq

c(s)u ,

where Suq = {sq : u ∈ sq, ru ≤ l}, l is the rank of the last-clicked documentλr λr = 1

|Sr|∑s∈Sr I(r 6= l),

404.

PAR

AM

ET

ER

EST

IMAT

ION

where Sr ={s : c

(s)ur = 1

}CCM EM αuq α(t+1)

uq =1

|Suq|+ |S ′uq|∑s∈Suq

c(s)u +(

1− c(s)u)(

1− c(s)>r) (1− ε(t)r

)α

(t)uq

1− ε(t)r Y(t)r

+

1

|Suq|+ |S ′uq|∑s∈S′uq

(1− c(s)>r

) α(t)uq

1−(τ2

(1− α(t)

urq

)+ τ3α

(t)urq

)Y

(t)r+1

,

where Suq = {sq : u ∈ sq}, S ′uq ={s : s ∈ Suq, c(s)u = 1

},

ε(t)r+1 = ε

(t)r

((1− α(t)

urq

)τ1 + α

(t)urq

(τ2

(1− α(t)

urq

)+ τ3α

(t)urq

)),

Y(t)r = P (C≥r = 1 | Er = 1) = α

(t)urq +

(1− α(t)

urq

)τ

(t)1 Y

(t)r+1, Y (t)

N+1 = 0,

c(s)>r = 1 if in the session s a click is observed after rank r, and 0 otherwise

τ1 τ(t+1)1 = ESS(t)(1)

ESS(t)(1)+ESS(t)(0), ESS (z) =

∑s∈S

∑r:c

(s)r =0

∑y Φ(s)(1,y,z)∑

x

∑y

∑z Φ(s)(x,y,z)

,

τ2 τ(t+1)2 = ESS(t)(1)


∑s∈S

∑r:c

(s)r =1

Φ(s)(1,0,z)∑x

∑y

∑z Φ(s)(x,y,z)

,

τ3 τ(t+1)3 = ESS(t)(1)


∑s∈S

∑r:c

(s)r =1

Φ(s)(1,1,z)∑x

∑y

∑z Φ(s)(x,y,z)

,

where Φ is defined by (4.43) and computed using (4.45)

DBN EM αuq α(t+1)uq = 1

|Suq|∑s∈Suq

(c(s)u +

(1− c(s)u

)(1− c(s)>r

)(1−ε(t)r )α(t)

uq

1−ε(t)r X(t)r

),

where Suq = {sq : u ∈ sq}, ε(t)r+1 = ε(t)r γ(t)

((1− σ(t)

urq

)α

(t)urq +

(1− α(t)

urq

)),

X(t)r = P (C≥r = 1 | Er = 1) = α

(t)urq +

(1− α(t)

urq

)γ(t)X

(t)r+1, X(t)

N+1 = 0,

c(s)>r = 1 if in the session s a click is observed after rank r, and 0 otherwise

σuq σ(t+1)uq = 1

|S′uq|∑s∈S′uq

(1− c(s)>r

)σ(t)uq

1−(

1−σ(t)urq

)γ(t)X

(t)r+1

,

where S ′uq ={s : s ∈ Suq, c(s)u = 1

}and X is defined above

γ γ(t+1) = ESS(t)(1)


∑s∈S

∑r

Φ(s)(1,0,z)∑x

∑y

∑z Φ(s)(x,y,z)

,

where Φ is defined by (4.43) and computed using (4.45)SDBN MLE αuq αuq = 1

|Suq|∑s∈Suq

c(s)u ,

4.3.FO

RM

UL

AS

FOR

CL

ICK

MO

DE

LPA

RA

ME

TE

RS

41where Suq = {sq : u ∈ sq, ru ≤ l}, l is the rank of the last-clicked document

σuq σuq = 1|S′uq|

∑s∈S′uq

I(r(s)u = l),

where S ′uq ={sq : u ∈ sq, ru ≤ l, c(s)u = 1

}


4.4 ALTERNATIVE ESTIMATION METHODS

As an alternative to estimation based on MLE or EM, a Bayesian approach is sometimes used. Twonotable examples are the so-called Bayesian browsing model (BBM) [Liu et al., 2009] and thehierarchical models (H-UBM, H-CCM, H-DBN) of Zhang et al. [2010]. Both approaches makeadditional assumptions about prior distributions for model parameters and are reported to run fasterthan the EM estimation and to outperform EM-trained models, especially for queries that rarelyoccur in the training set.

The BBM model by Liu et al. [2009] is essentially a UBM model (Section 3.5), where theattractiveness αuq is assumed to have a uniform prior distribution. The γrr′ parameters can thenbe computed in a closed form by maximizing the likelihood (which in this case does not dependon αuq). In turn, the αuq parameters are not estimated directly, but their posterior distribution isderived instead. This allows for better flexibility, as one may either use this posterior distribution oruse expected values for each αuq.

The hierarchical models by Zhang et al. [2010] assume that all model parameters have anormal prior distribution. With each session added to a training set, the posterior distribution iscomputed and approximated with another normal distribution (a process referred to as Gaussiandensity filtering). This approximated distribution replaces the parameter distribution estimated ata previous step and then the process continues. The authors also suggest a modification to theirestimation process called deep probit Bayesian inference, which adds feature-based smoothing basedon, e.g., BM25 query-document matching scores. This allows for more accurate estimations in caseswhere the data is too sparse.

Shen et al. [2012] suggest to adopt the ideas of collaborative filtering to the click modelingproblem. While the main focus of the authors is on user personalization (see Section 8.3), theirfirst idea of matrix factorization already leads to a substantial improvement of the model qualityas measured by the log-likelihood (Section 5.1). The MFCM model that they propose (matrixfactorization click model) introduces Gaussian priors for the model parameters and adopts a differentestimation process based on matrix factorization. The performance improvement is achieved by thefact that previously unseen documents now get better estimations based on other elements in thedecomposed parameter matrix.

Now that we have introduced the basic click models as well as estimation methods for workingwith them, we are almost ready to evaluate those models. Next, in Chapter 5, we introduce differentevaluation methods for click models, and in Chapter 6 we list available datasets before turning toexperiments in Chapter 7.

43

C H A P T E R 5

Evaluation

Evaluation lies at the core of any scientific method and click models are no exception. We wantto be sure that by introducing a more complex model we do, in fact, obtain better results. In thischapter we discuss different ways to evaluate click models. We start with traditional statisticalevaluation approaches, such as log-likelihood and perplexity, and then discuss more application-oriented evaluation methods like click-through rate prediction, and evaluation based on NDCG andintuitiveness.

5.1 LOG-LIKELIHOOD

Whenever we have a statistical model, we can evaluate its accuracy by looking at the likelihood ofsome held-out test set (see, e.g., [Fisher, 1922]). For each session in the test set S we compute howlikely this session is according to our click model:

L(s) = PM

(C1 = c

(s)1 , . . . , Cn = c(s)n

), (5.1)

where PM is the probability measure induced by the click model M . If we further assume indepen-dence of sessions, we can compute the logarithm of the joint likelihood:

LL(M) =∑s∈S

logPM

(C1 = c

(s)1 , . . . , Cn = c(s)n

). (5.2)

This metric is known as log-likelihood and usually computed using the formula of total probability:

LL(M) =∑s∈S

n∑r=1

logPM

(Cr = c(s)r | C<r = c

(s)<r

). (5.3)

As a logarithm of a probability measure this metric always has non-positive values with highervalues representing better prediction quality. The log-likelihood of a perfect prediction equals 0.

5.2 PERPLEXITY

Craswell et al. [2008] proposed to use the notion of cross-entropy from information theory [Murphy,2013] as a metric for click models. This metric was not easy to interpret and it did not becomewidely used. Instead, Dupret and Piwowarski [2008] proposed to use the conceptually similar notionof perplexity as a metric for assessing click models:

pr(M) = 2−1|S|∑

s∈S(c(s)r log2 q

(s)r +(1−c(s)r ) log2 (1−q(s)r )), (5.4)

44 5. EVALUATION

where q(s)r is the probability of a user clicking the document at rank r in the session s as predicted

by the model M , i.e., q(s)r = PM (Cr = 1 | q,u). Note that while computing this probability only

the query q and the vector of documents u are used and not the clicks in the session s. Usually,perplexity is reported for each rank r independently, but it is also often averaged across ranks tohave an overall measure of model quality:

p(M) =1

n

n∑r=1

pr(M). (5.5)

Notice that perplexity of the ideal prediction is 1. Indeed, an ideal model would predict the clickprobability qr to be always equal to the actual click cr (zero or one). On the other hand, the perplexityof the simple Random Click Model with probability ρ = 0.5 (Section 3.1) is equal to 2. Hence, theperplexity of a realistic model should lie between 1 and 2.1

Notice, also, that better models have lower values of perplexity. When we compare perplexityscores of two models A and B, we follow Guo et al. [2009a] and compute the perplexity gain of Aover B as follows:

gain(A,B) =pB − pA

pB − 1. (5.6)

The perplexity is typically higher for top documents and decreases toward the bottom of a SERP.2

As we will see later, predicting a click is generally much harder than predicting its absence and,since top documents typically get more clicks, they are harder for a click model to get right.

5.2.1 CONDITIONAL PERPLEXITY

One of the building blocks of perplexity is the click probability qr at rank r. In the above definitionwe defined it as the full probability P (Cr = 1). This probability is independent of the clicks inthe current session s and could be tricky to compute as it involves marginalizing over all possibleassignments for previous clicks C<r.

Another possibility for defining perplexity is to use the conditional click probability, given theprevious clicks in the session. We introduce another version of perplexity, which we call conditionalperplexity:

pr(M) = 2−1|S|∑

s∈S(c(s)r log2 q

(s)r +(1−c(s)r ) log2 (1−q(s)r )), (5.7)

where q(s)r is the conditional probability:

q(s)r = PM

(Cr = 1 | C1 = c

(s)1 , . . . Cr−1 = c

(s)r−1

). (5.8)

Unlike the original version of perplexity (5.4), which evaluates a model’s ability to predict clickswithout any input, the conditional perplexity is just a decomposition of the likelihood (Section 5.1)1One can also obtain tighter upper bounds using the RCTR model (Section 3.2.1) [Dupret and Piwowarski, 2008].2As was shown in [Chuklin et al., 2013d], this does not hold beyond the first SERP. In particular, when a search interface providespagination buttons to switch to a second or further SERP (i.e., retrieve the next ranked group of documents) the perplexity nolonger decreases with rank.

5.3. CLICK-THROUGH RATE PREDICTION 45

and also factors in a model’s ability to exploit clicks above the current rank. Since the conditionalperplexity uses the beginning of a session c

(s)<r to compute the probability q(s)

r it typically yieldslower values than the regular perplexity, especially for models like UBM (Section 3.5) that rely onthe accuracy of previous clicks. It is important to keep this in mind when comparing results fromdifferent publications using different versions of perplexity.

5.2.2 DETAILED PERPLEXITY ANALYSIS

Clicks and skips. Dupret and Piwowarski [2008] suggest that one should look separately atperplexity of predicting clicks (cr = 1) and perplexity of predicting “non-clicks” or skips (cr = 0).This perspective reveals some general trends about click models and helps to better compare themto each other. The authors show that predicting skips is usually much easier than predicting clicks.This is due to the fact that the majority of documents are not clicked and even a simple randomclick model with a small value of the parameter ρ can bring perplexity close to 1 for skips. Thisbreakdown has also been used in [Chuklin et al., 2013d] to show how two different variations of amodel outperform one another on either clicks or skips.

Query frequency stratification. Guo et al. [2009a,b] analyze perplexity values for different queryfrequency buckets. Naturally, user behavior can be modeled more accurately for frequent queries,i.e., queries that appear many times in a training set. Following this evaluation approach, one may beinterested to determine whether some frequency buckets are particularly hard to model. This methodalso allows one to compare how different click models handle the sparsity of data.

5.3 CLICK-THROUGH RATE PREDICTION

Click-through rate (or CTR for short) is a ratio of the cases when a particular document was clickedto the cases when it was shown. CTRs are often used to compare ranking methods in A/B-testing(see, e.g., [Kohavi et al., 2008]) and, even more importantly, when evaluating sponsored search.Predicting clicks in sponsored search [Richardson et al., 2007] has an immense influence on revenueand is often done using click models (see, e.g., [Ashkan and Clarke, 2012, Zhu et al., 2010]). Belowwe discuss two ways of evaluating predictive power of a click model in terms of CTR.

Removing position bias. Chapelle and Zhang [2009] evaluate CTR prediction using the assumptionthat there is no position effect on the first position. In other words, if there is a model that predictsthe probability of a click given examination P (Cu = 1 | Eu = 1), that value would be equal to theclick-through rate P (Cu = 1) if u is on the first position (ru = 1). This means that by placing thedocument on top, we remove the position bias. To do that, Chapelle and Zhang suggest the followingprocedure:

1. Consider a document u appearing on both the first position and some other position for thesame query (but in different sessions).

46 5. EVALUATION

2. Use sessions where u appears on the first position as a test set and the rest of the sessions as atraining set.

3. Perform parameter estimation using the training set.

4. Estimate P (Cu = 1 | Eu = 1) from the training set and compare these values to the CTRvalues obtained from the test set.

To compare CTR values two metrics have been suggested: Mean Square Error (MSE) and Kullback-Leibler (KL) divergence [Kullback and Leibler, 1951].

While this method directly evaluates CTR, it uses a very skewed set of queries. First, it onlylooks at previously seen query-document pairs and does not evaluate the generalization ability ofthe model. Second, for each document it requires sessions where the document appears on thefirst position and sessions where the document appears on other positions. While this naturallyhappens in major search engine systems due to user experiments and personalization effects (see,e.g., [Radlinski and Craswell, 2013]), it is not guaranteed to be the case for other search systems.Moreover, these high-variability queries represent a special set of queries that is not representativeof the whole distribution.

Direct CTR estimation. Zhu et al. [2010] suggest an alternative method where CTR is estimatedby simulating user clicks multiple times and comparing simulated CTR values to the real ones.3

CTR is computed for each triple (u, q, s) consisting of a document u, a query q and a session sthat was initiated by the query q and contains the document u. While simulated CTR is a real valuecomputed by repeatedly generating clicks given a query and a SERP, the real CTR is either equal to1 if the document was clicked or equal to 0 otherwise. To avoid comparing real-valued simulatedCTRs to binary-valued actual CTRs, Zhu et al. [2010] suggest the following method. First, they sort(u, q, s) triples by the predicted CTR and group them into buckets of similar predicted CTR values(1,000 triples in each bucket). Then, they compute the average predicted CTR xi for triples in thei-th bucket as well as the average real CTR yi for the same triples. The real CTR in this case equalsthe number of clicked triples divided by 1,000. Finally, they report the coefficient of determinationR2 between the vectors of simulated and real CTRs x and y.

A similar method was used by Xing et al. [2013] where ranks of the first and last clicks wereused instead of CTRs. This method does not require any bucketing as click ranks can be computedfor every session. As a measure, Xing et al. use mean absolute error (MAE), which is the averageabsolute difference between the real and predicted first/last click ranks.

The direct CTR estimation method discussed above removes limitations of the previousmethod by Chapelle and Zhang [2009]. In particular, it takes into account all documents, even theones that never appear on the first position.

3Simulating clicks is also discussed below in Section 9.1.

5.4. RELEVANCE PREDICTION EVALUATION 47

5.4 RELEVANCE PREDICTION EVALUATION

Click models are often used to predict relevance either directly or as a feature in a machine-learnedranking system. If a model operates with the notion of satisfaction (e.g., DCM, CCM and DBN),4

the relevance of a document u with respect to a query q can be defined as follows:

Ruq = P (Su = 1 | Eu = 1) = P (Su = 1 | Cu = 1) · P (Cu = 1 | Eu = 1) (5.9)

In the case of DBN (Section 3.8) this reduces toRuq = σuqαuq. In the original DBN paper [Chapelleand Zhang, 2009] this notion of click model-based relevance has first been used verbatim as a rankingfunction and then tested as a feature in a machine learning-based ranking. The latter has also beenused in [Dupret and Liao, 2010]. While the latter represents the more realistic situation of modernsearch engines (see, e.g., [Liu, 2009]), the former is preferable for reusable evaluation, since itdepends neither on ranking features nor on the learning algorithm used. In both cases, a documentranking is evaluated using standard IR evaluation metrics such as DCG, NDCG [Jarvelin andKekalainen, 2002] or AUC when only binary relevance labels are available.

One issue with this approach is that when underlying assumptions of an evaluation metricreflect those of a model, the model is going to get a high score, while this may not necessarily meanreal quality improvements for the end user. As we will see in Section 9.4, one can build an evaluationmetric from a user model and often the opposite is possible. In the latter case the model will receivehigh scores from the metric, but those might be artificial. To address this issue, Wang et al. [2015]suggest to compare metrics by showing two SERPs side by side to raters, where each SERP is rankedaccording to a particular click model. This evaluation method requires more rater resources, butallows one to circumvent the potential issues of IR evaluation metrics.

5.5 INTUITIVENESS EVALUATION FOR AGGREGATEDSEARCH CLICK MODELS

Chuklin et al. [2014] propose an evaluation method for click models for aggregated search. Aggre-gated search (also referred to as federated search) is a type of search where results are retrieved fromdifferent vertical search engines (such as News, Sports or Video search engines) and then presentedin a special way on a SERP (see, e.g., [Arguello et al., 2009, 2012]). See Section 8.1 for more details.

Recently, a number of click models for aggregated search have been developed [Chen et al.,2012a, Chuklin et al., 2013b, Wang et al., 2013a]. While all the traditional evaluation methods areapplicable to these vertical-aware models, it is sometimes important to look at the model performanceon a more fine-grained level.

Zhou et al. [2013] explicitly name four components of aggregated search:

1. Vertical selection: which vertical search engines are relevant to a given query?

4Even models that do not employ the notion of satisfaction can be useful in machine learning-based settings, where modelparameters are used as ranking features.

48 5. EVALUATION

2. Item selection: which items should be selected from a given vertical?

3. Result presentation: how results from different verticals should be presented on a SERP.

4. Vertical diversity: how diverse the set of verticals should be.

Even if an aggregated system A is better than an aggregated system B overall, it may be importantto know whether A loses to B in one of the above aspects. To evaluate this, a simple base metric wassuggested for each aspect.

Chuklin et al. [2014] adapt this idea to evaluate click models for aggregated search. Similar tothe above, four dimensions are used to show how well a particular click model can capture differentaggregated search components. Multiple pairs of aggregated search systems are considered and thecases of disagreement where a click model assigns more clicks to the “worse” system (according toone of the four dimensions listed above) are counted. This is then repeated for different dimensionsand different click models, and the click models are compared to each other by counting the amountof disagreement.

Now that we have discussed different evaluation methods for click models, we need only one moreingredient before we can evaluate the basic click models introduced in Chapter 3, viz. datasets andlibraries, which are the topic of the next chapter.

49

C H A P T E R 6

Data and Tools

To estimate parameters and evaluate the performance of click models, researchers use click logs, i.e.,logs of user search sessions with click-through information (see page 6). Such logs are produced bylive search systems and contain highly sensitive information in terms of privacy and commercialvalue. For this reason, publicly releasing such data while respecting privacy and the interests of thesearch engine owner is very challenging and requires a substantial amount of work. Still, substantialclick logs have been available to academic researchers. In this chapter we discuss such publiclyavailable click logs. We also describe software packages and libraries that we have found useful forworking with click models.

6.1 DATASETS

One of the first publicly released datasets was the AOL query log released in 2006. It was acomprehensive dataset containing twenty million search sessions for over 650,000 users over athree-month period. The data was not redacted for privacy, which led the company to withdrawthe dataset just a couple of days after its release.1 It is one of the few datasets that contain actualqueries and document URLs, which makes it valuable in spite of the fact that it represents a differentgeneration of web search users interacting with a representative of a different generation of searchengine. One of the drawbacks of this dataset is that it does not contain all documents shown to auser, only those that were clicked.

More recent releases include click logs from Microsoft2 and Yandex,3 published as part of theWeb Search Click Data (WSCD) workshop series in 2009, 2012, 2013, and 2014. The WSCD 2009dataset, also known as MSN 2006,4 was collected from the MSN search engine in 2006 during aperiod of one month. The dataset contains query strings, grouped into sessions,5 number of returnedresults, and clicked URLs. The dataset cannot be downloaded directly, but is provided upon request.

The WSCD 2012 dataset6 consists of user search sessions extracted from Yandex logs around2009. The dataset contains anonymized queries, URL rankings, clicks and relevance judgmentsfor ranked URLs. In addition, queries are grouped into search sessions. We use this dataset for ourexperiments in Chapter 7.

1The dataset is still available for researchers through several mirrors.2http://research.microsoft.com3http://yandex.com, the biggest search engine in Russia with significant presence in some other countries.4http://research.microsoft.com/en-us/um/people/nickcr/wscd09/5Note that the notion of search session used in this dataset differs from our notion of query session (see Chapter 2).6http://imat-relpred.yandex.ru/en/datasets

http://research.microsoft.com

http://yandex.com

http://research.microsoft.com/en-us/um/people/nickcr/wscd09/

http://imat-relpred.yandex.ru/en/datasets

50 6. DATA AND TOOLS

Table 6.1: Summary of publicly available click log datasets.

Dataset Queries URLs Users Sessions Other InformationAOL 2006 10,154,742 1,632,788 657,426 21,011,340 Only clicked documentsMSN 2006 8,831,280 4,975,897 – 7,470,915 Only clicked documentsSogouQ 2012∗ 8,939,569 15,095,269 9,739,704 25,530,711 Only clicked documentsWSCD 2012 30,717,251 117,093,258 – 146,278,823 Judgements (71,930)WSCD 2013 10,139,547 49,029,185 956,536 17,784,583 Search engine switchesWSCD 2014 21,073,569 70,348,426 5,736,333 65,172,853 Query terms, URL domains∗ For SogouQ the number of unique query-user pairs is reported instead of the session count; the actual number of sessions may be

higher.

The WSCD 2013 dataset7 [Serdyukov et al., 2013] was extracted from Yandex logs around2011. In addition to information provided by WSCD 2012, this dataset contains anonymized user idsand search engine switching actions (i.e., indicators that a user switched from Yandex to other searchengines during a search session). The WSCD 2013 dataset does not contain relevance judgements.

The WSCD 2014 dataset8 [Serdyukov et al., 2014], collected around 2012 over a period ofone month, contains personalized rankings for more than five million users. Compared to WSCD2012, this dataset also provides query terms and domains of ranked URLs, but does not providerelevance judgements.

Sogou9 also released two click log datasets, called SogouQ.10 The first dataset was released in2008; in 2012 it was replaced by a bigger dataset. Similarly to the WSCD datasets, SogouQ containsanonymized user ids, queries, URL rankings and clicks. Unlike the Yandex datasets, however, querystrings and document URLs are not obfuscated and provided verbatim in the click log. This allowsresearchers to perform query similarity analysis, document analysis and other applications that arenot possible with numeric ids. The downside of this dataset is that it only provides information aboutclicked documents, so the exact set of documents shown to a user can only be approximated.

The statistics of publicly available click logs is given in Table 6.1. Although these logs largelyfacilitate research on click models for web search, they are limited to the standard results presentationparadigm (known as “ten blue links”) and to searches performed on desktop computers. As we showin Chapter 8, current developments of click models go beyond these limitations and consider usersearch intents, results of vertical search engines (e.g., Image or News search engines) and even othertypes of user interaction, such as eye-gaze and mouse movements. Unfortunately, at the time ofwriting there are still no publicly available datasets suitable for working with these advanced clickmodels.

7http://switchdetect.yandex.ru/en.8https://www.kaggle.com/c/yandex-personalized-web-search-challenge.9http://www.sogou.com, one of the biggest search engines in China.

10http://www.sogou.com/labs/dl/q-e.html; as of 2015 the files are encoded using the GB18030 character set, notUTF-8.

http://switchdetect.yandex.ru/en

https://www.kaggle.com/c/yandex-personalized-web-search-challenge

http://www.sogou.com

http://www.sogou.com/labs/dl/q-e.html

6.2. SOFTWARE TOOLS AND OPEN SOURCE PROJECTS 51

In this context, it is worth mentioning ad datasets published by Criteo.11 The 201412 and201513 releases contain ads served by Criteo (represented using 39 integer features) and click events.The Criteo datasets are not directly suitable for training and testing click models for web search,but could be used to experiment with domain-specific models (e.g., click models for ads [Yin et al.,2014]).

6.2 SOFTWARE TOOLS AND OPEN SOURCE PROJECTS

A number of software tools and libraries are suitable for working with click models: from dedicatedprojects like ClickModels14 and PyClick,15 through more general tools like Infer.NET16 and Lerot,17

to general-purpose languages such as Octave18 or Matlab.19

The ClickModels project, written in Python, provides an open source implementation of state-of-the-art click models, namely the dynamic Bayesian network model (DBN) [Chapelle and Zhang,2009] (simplified and full versions), the dependent click model (DCM) [Guo et al., 2009b] and theuser browsing model (UBM) [Dupret and Piwowarski, 2008]. It also implements the explorationbias extension of UBM according to Chen et al. [2012a]. A special feature of this project is theimplementation of intent-aware extensions of the above models according to Chuklin et al. [2013b].Overall, the ClickModels project is suitable for training and testing standard click models for websearch as well as advanced models that consider user search intents.

The PyClick open source project, also written in Python, aims to provide an extendableimplementation of existing click models and to simplify the development of new models. PyClickimplements all basic click models discussed in Chapter 3. It mainly uses standard parameterestimation techniques described in Chapter 4, i.e., maximum likelihood estimation (MLE) andexpectation-maximization (EM). In addition, it implements two alternative parameter estimationtechniques, namely probit Bayesian inference (PBI) [Zhang et al., 2010] and the Bayesian browsingmodel (BBM) [Liu et al., 2009]. PyClick can also be used as a basis for the rapid development ofnew click models. We use PyClick to perform experiments in Chapter 7.

Infer.NET, created and maintained by Microsoft Research, is a framework for runningBayesian inference in graphical models. It is fast, scalable and well-supported. Infer.NET pro-vides algorithms to solve various machine learning problems and, in particular, can be used to trainclick models. The advantage of this framework is that it can be used for a wide range of tasks beyondclick models. On the other hand, Infer.NET is limited to non-commercial use and does not provideout-of-the-box implementations of existing click models.

11http://www.criteo.com.12https://www.kaggle.com/c/criteo-display-ad-challenge.13http://labs.criteo.com/2015/03/criteo-releases-its-new-dataset.14https://github.com/varepsilon/clickmodels.15https://github.com/markovi/PyClick.16http://research.microsoft.com/en-us/um/cambridge/projects/infernet.17https://bitbucket.org/ilps/lerot.18https://www.gnu.org/software/octave.19http://mathworks.com/products/matlab.

http://www.criteo.com

https://www.kaggle.com/c/criteo-display-ad-challenge

http://labs.criteo.com/2015/03/criteo-releases-its-new-dataset

https://github.com/varepsilon/clickmodels

https://github.com/markovi/PyClick

http://research.microsoft.com/en-us/um/cambridge/projects/infernet

https://bitbucket.org/ilps/lerot

https://www.gnu.org/software/octave

http://mathworks.com/products/matlab

52 6. DATA AND TOOLS

The Lerot project [Schuth et al., 2013], implemented in Python, is designed to run experimentswith online learning to rank methods. Lerot uses click models to simulate user clicks and, for thispurpose, implements some basic models, such as the random click model (RCM), the position-basedmodel (PBM) and the cascade model (CM) (see Chapter 3). It is also integrated with the ClickModelsproject, so that a wider range of click models can be used for simulation within Lerot.

Finally, general-purpose languages, such as Octave and Matlab, provide a number of ready-to-use inference and evaluation methods, which can be used to implement click models. Theselanguages are highly flexible, but of course do not provide any specific support for click models.

In summary, existing software tools and open source projects implement many of the clickmodels proposed in the literature so far and support the development of new models. When workingwith click models, researchers and practitioners have access to a solid set of tools ranging from highlyspecific projects with many implemented models but relatively low flexibility to general-purposelanguages with high flexibility but no out-of-the-box implementations.

53

C H A P T E R 7

Experimental Comparison

Existing studies of click models usually evaluate only a subset of available click models, often useproprietary data, rely on different evaluation metrics and rarely publish the source code used toproduce the results. This makes it hard to compare the reported results between different studies. Inthis chapter, we present a comprehensive evaluation of the click models described in Chapter 3 usinga publicly available dataset (see Chapter 6), an open-source implementation (again, see Chapter 6)and a set of commonly used evaluation metrics (see Chapter 5).

7.1 EXPERIMENTAL SETUP

We use the WSCD 2012 dataset discussed in Section 6.1 (see Table 6.1 for more statistics). Morespecifically, we use the first one million sessions of the dataset to train and test the basic clickmodels.1 We train models on the first 75% of the sessions and test them on the last 25%. This is doneto simulate the real-world application where the model is trained using historical data and appliedto unseen future sessions. Since sessions in the dataset are grouped by user tasks, we are not fullyguaranteed to have a strict chronological ordering between training and test material, but this isthe closest we can get. Since the basic click models introduced in Chapter 3 cannot handle unseenquery-document pairs, we filter the test set to contain only queries that appear in the training set.The statistics of the training and test sets are given in Table 7.1.

In order to verify that the results hold for other subsets of the data and to report confidenceintervals for our results, we repeat our experiment 15 times, each time selecting the next millionsessions from the dataset and creating the same training-test split as described above. We then reportthe 95% confidence intervals using the bootstrap method [Efron and Tibshirani, 1994].

All click models are implemented in Python within the PyClick library described in Section 6.2.The expectation maximization algorithm (EM) uses 50 iterations. The DCM model is implementedin its simplified form (SDCM) according to the original study [Guo et al., 2009b] (see Section 4.1.2).To measure the quality of click models, we use log-likelihood and perplexity, both the standard andconditional versions, which are the most commonly used evaluation metrics for click models. Inaddition, we report the time it took us to train each model using a single CPU core2 and the PyPyinterpreter.3

1We also experimented with ten million sessions and the results were qualitatively similar to what we present here.2Intel® Xeon® CPU E5-2650 0 @ 2.00 GHz, 20 MB L2 cache.3http://pypy.org.

http://pypy.org

54 7. EXPERIMENTAL COMPARISONTable 7.1: Statistics of the training and test sets.

Set # Sessions # Unique QueriesTraining 750,000 322,799

Test 139,009 23,465

Table 7.2: Log-likelihood, perplexity, conditional perplexity and training time of basic click models forweb search as calculated on the first one million sessions of the WSCD 2012 dataset. The best values foreach metric are indicated in boldface.

Model Log-likelihood Perplexity Cond. perplexity Time (sec)RCM −0.3727 1.5325 1.5325 2.37

RCTR −0.3017 1.3730 1.3730 2.45

DCTR −0.3082 1.3713 1.3713 9.39

PBM −0.2757 1.3323 1.3323 77.95

CM −∞ 1.3675 +∞ 12.17

UBM −0.2568 1.3321 1.3093 113.53

SDCM −0.2974 1.3315 1.3588 15.53

CCM −0.2807 1.3406 1.3412 2,993.03

DBN −0.2680 1.3311 1.3217 1,661.16

SDBN −0.2940 1.3270 1.3538 17.42

7.2 RESULTS AND DISCUSSION

The results of our experimental comparison are shown in Table 7.2. The table reports log-likelihood,perplexity, conditional perplexity and training time. The best values for each evaluation metric areindicated in boldface. Below we discuss the results for each metric.

Log-likelihood. The log-likelihood metric shows how well a model approximates the observeddata. In our case, it shows how well a click model approximates observed user clicks in a set ofquery sessions (see Section 5.1). Table 7.2 and Figure 7.1 report the log-likelihood values for allbasic click models. Notice that the cascade model (CM) cannot handle query sessions with morethan one click and gives zero probabilities to all clicks below the first one. For such sessions, thelog-likelihood of CM is log 0 = −∞ and, thus, the total log-likelihood of CM is −∞. Similarly,the value of conditional perplexity for CM is 2− log 0 = +∞. Notice that we could consider onlysessions with one click to evaluate the CM model. However, this would make the CM log-likelihoodvalues incomparable to those of other click models.

Since most click models have the same set of attractiveness parameters that depend on queriesand documents (apart from RCM and DCTR), the main difference between the models is the waythey treat examination parameters (e.g., the number of examination parameters and the estimation

7.2. RESULTS AND DISCUSSION 55

RCM RCTR DCTR PBM CM UBM SDCM CCM DBN SDBN0.40

0.35

0.30

0.25

0.20

0.15

0.10

0.05

0.00Lo

g-l

ikelih

ood

Figure 7.1: Log-likelihood values for different models; higher is better. Error bars correspond to the95% bootstrap confidence intervals. The cascade model (CM) is excluded because it cannot handlemultiple clicks in a session.

technique).4 Thus, it is natural to expect that models with more examination parameters, which areestimated in connection to other parameters (i.e., using the EM algorithm), should better approximateobserved data.

The results in Table 7.2 and Figure 7.1 generally confirm this intuition. UBM, having thelargest number of examination parameters, which are estimated using EM, appears to be the bestmodel in terms of approximating user clicks based on previous clicks and skips. It is followed byDBN (one examination parameter, a set of satisfaction parameters, EM algorithm) and PBM (tenexamination parameters, EM algorithm). Other models have noticeably lower log-likelihood withthe RCM baseline being significantly worse than the others as expected.

Notice that DBN outperforms PBM and SDBN outperforms SDCM (although not signifi-cantly), respectively, having fewer examination parameters and using the same estimation technique.This is due to the fact that DBN and SDBN have a large set of satisfaction parameters that also affect

4Notice that DBN and SDBN differ from other models also by considering satisfaction parameters that depend on queries anddocuments.

56 7. EXPERIMENTAL COMPARISON


1.1

1.2

1.3

1.4

1.5Perp

lexit

y

Figure 7.2: Perplexity values for different models; lower is better. Error bars correspond to the 95%

bootstrap confidence intervals.

examination. Overall, the above results show that complex models are able to describe user clicksbetter than the simple CTR models, which confirms the usefulness of these models.

As expected, simple models, namely RCTR, DCTR and RCM, have the lowest log-likelihood.Notice, however, that RCTR and DCTR do not differ much from SDCM and SDBN, while beingmuch simpler and sometimes much faster (RCTR). Still, the downside of these simple CTR-basedmodels is that they do not explain user behavior on a SERP, as opposed to SDCM and SDBN thatexplicitly define examination and click behavior and that can, therefore, be used in a number ofapplications of click models (see Chapter 9).

Perplexity. Perplexity shows how well a click model can predict user clicks in a query sessionwhen previous clicks and skips in that session are not known (the lower the better). Table 7.2shows that this version of perplexity (defined in Section 5.2) does not directly correlate with log-likelihood. Therefore, we believe that the standard perplexity should be preferred over the conditionalperplexity for the task of click model evaluation because it gives a different perspective compared tolog-likelihood.

7.2. RESULTS AND DISCUSSION 57

1 2 3 4 5 6 7 8 9 10

rank

1.2

1.4

1.6

1.8

2.0Perp

lexit

yRCM

RCTR

DCTR

PBM

CM

UBM

SDCM

CCM

DBN

SDBN

Figure 7.3: Perplexity values for different models and at different ranks; lower is better. Error barscorrespond to the 95% bootstrap confidence intervals.

When ranking click models based on the perplexity values (see Table 7.2 and Figures 7.2and 7.3) the best model is SDBN followed by DBN, SDCM, UBM and PBM, respectively. Theresults show that complex click models perform better than CTR not only in terms of log-likelihoodbut also in terms of perplexity.

By looking at perplexity values for different ranks, as shown in Figure 7.3, one can seethat, apart from the simple CTR-based models and the cascade model, all the models show similarperplexity results with the biggest difference observed for rank 1.

Conditional perplexity. As discussed in Section 5.2.1, conditional perplexity can be seen as a perrank decomposition of the likelihood. Indeed, Table 7.2 and Figure 7.4 show that the conditionalperplexity produces the same ranking of click models as log-likelihood (apart from RCTR and DCTR,whose ranks are swapped). This means that conditional perplexity does not add much additionalinformation when used together with log-likelihood for click model evaluation.

By looking at a per-rank decomposition (Figure 7.5) one can see a more non-trivial pattern.First, UBM is a clear leader starting from rank 4, suggesting that knowing the distance to the last click



1.1

1.2

1.3

1.4

1.5C

ondit

ional perp

lexit

y

Figure 7.4: Conditional perplexity values for different models; lower is better. Error bars correspondto the 95% bootstrap confidence intervals. Cascade Model (CM) is excluded because it cannot handlemultiple clicks in a session.

is quite useful for the examination prediction (cf. Section 3.5). DBN also shows quite good resultsfor higher ranks, much better than SDBN, although not as good as UBM. Surprisingly, predicting aclick on the first position is easier for SDBN than for the more complex DBN model. This suggeststhat the DBN assumptions work either for higher ranks or for lower ranks (with different γ parametervalues), but cannot quite explain user behavior as a whole.

Training time. The amount of time required to train a click model (Figure 7.6) directly relates tothe estimation method used. The maximum-likelihood estimation (MLE) iterates through trainingsessions only once and, thus, it is much faster than EM, which uses a large number of iterations. Clickmodels estimated using MLE require from two to 18 seconds to train on 750,000 query sessions,where the fastest models are RCM and RCTR. The EM-based models, instead, require from one to50 minutes to be trained on the same number of sessions, where the slowest model is CCM. In fact,CCM is over 60% slower than DBN, which, in turn, is over ten times slower than any other modelstudied here.

7.3. SUMMARY 59

1 2 3 4 5 6 7 8 9 10

rank

1.2

1.4

1.6

1.8

2.0C

ondit

ional perp

lexit

yRCM

RCTR

DCTR

PBM

UBM

SDCM

CCM

DBN

SDBN

Figure 7.5: Conditional perplexity values for different models and at different ranks; lower is better.Error bars correspond to the 95% bootstrap confidence intervals. The cascade model (CM) is excludedbecause it cannot handle multiple clicks in a session.

One of the things to mention here is that the choice of PyPy as a Python interpreter was crucial.Just this simple switch reduced the training time of DBN from 39 hours to 28 minutes; similardramatic drops were observed for all other models.

7.3 SUMMARY

In this chapter we have provided a comparison of the basic click models introduced in Chapter 3,using two core evaluation metrics, namely log-likelihood and perplexity. While we include all thebasic models in our analysis, there are many dimensions that we did not look at, such as predictionof clicks vs. prediction of their absence, query frequency and query entropy analysis, to name afew. For those additional dimensions, we refer the reader to a complementary study by Grotov et al.[2015].

Overall, complex click models, namely PBM, UBM, SDCM, CCM, DBN and SDBN, outper-form CTR-based models and the cascade model both in terms of log-likelihood and perplexity. Also,


RCM RCTR DCTR PBM CM UBM SDCM CCM DBN SDBN0

500

1000

1500

2000

2500

3000

3500

4000

Tim

e (

sec)

Figure 7.6: Training time for different click models; lower is better.

the models whose parameters are estimated using the EM algorithm (PBM, UBM, CCM and DBN)tend to outperform those estimated using the MLE technique (SDCM and SDBN), because in theformer case model parameters are estimated in connection to each other. However, such models takemuch longer to train.

When comparing pairs of more complex and simplified models (DBN/SDBN andCCM/SDCM) one can see that the simplified one yields better perplexity values but worse condi-tional perplexity and likelihood scores. We believe that these simplifications together with the MLEalgorithm can find more robust estimations for the model parameters, but cannot fully interpret therelation between past and future behavior in the same session.

There is no clear winner among the best performing models, but UBM tends to have thehighest log-likelihood and almost the lowest perplexity. A potential drawback of UBM, however, isthat the interpretation of its examination parameters is not straightforward (see Section 3.5). UBM isclosely followed by DBN and PBM, which sometimes have slightly worse performance, but theirparameters are easier to interpret.

61

C H A P T E R 8

Advanced Click Models

In this chapter we summarize more recent click models that somehow improve over the basicclick models studied in Chapter 3. Most of these models build on one of the basic click modelsby introducing new parameters, using more data and, more generally, incorporating additionalknowledge about user behavior.

We start by describing click models for aggregated search in Section 8.1. In Section 8.2 wediscuss models that aim to predict user behavior that spans across multiple query sessions. Section 8.3is devoted to models that exploit user diversity at different granularity levels. In Section 8.4 we talkabout models that use additional signals such as mouse or eye movements, while in Section 8.5 wetalk about models that make use of relevance labels. In Section 8.6 we discuss models that capturenon-linearity in user examination behavior. And finally, in Section 8.7, we briefly discuss modelsthat make use of feature-based machine learning in addition to a probabilistic modeling framework.

8.1 AGGREGATED SEARCH

In this section we discuss click models for aggregated search. As explained in Section 5.5, aggregatedsearch (also known as federated search) is a search paradigm where a SERP aggregates results frommultiple sources known as verticals (e.g., News, Image or Video verticals). All major web searchengines have adopted this mode of presentation by adding blocks of vertical results to their SERPs;see Figure 8.1 for an example.

8.1.1 FEDERATED CLICK MODEL (FCM)

It is evident from Figure 8.1 that vertical blocks on an aggregated SERP can be more visually salientand attract more user attention than other results on the SERP. Moreover, Chen et al. [2012a] showthat vertical blocks also affect the amount of attention that nearby non-vertical documents get. Forexample, if there is an image vertical placed at rank 4, then the chance of the document at rank3 being clicked is more than twice as high as normal. To account for this, Chen et al. introduce aparameter called “attention bias” A, which is a binary variable.1 If A equals 1, then documents nearvertical blocks have a higher chance of being examined:

P (A = 1) = χc (8.1)P (Er = 1 | A = 0,E<r,S<r,C<r) = εr (8.2)P (Er = 1 | A = 1,E<r,S<r,C<r) = εr + (1− εr)βdist , (8.3)

1Not to be confused with the attractiveness variable Au which always has a subscript.

62 8. ADVANCED CLICK MODELS

Figure 8.1: An example of a search engine result page featuring organic results as well as results fromImage and Video verticals for the query “Amsterdam.”

where the parameter β is indexed by the distance from the current document u to the nearest verticalblock vert , i.e., dist = ru − rvert (dist can be negative); εr is the examination probability in theabsence of vertical attention bias. The parameter χc can either depend on the position of a verticalblock (i.e., χrvert ) or on the content of the block (i.e., χvert ).

Notice that through the parameter εr different click models can be extended to a vertical-aware model (called FCM for federated click model in [Chen et al., 2012a]). For example, UBM(Section 3.5) can be extended to FCM-UBM as follows:

P (A = 1) = χc (8.4)P (Er = 1 | A = 0,C<r) = γrr′ (8.5)P (Er = 1 | A = 1,C<r) = γrr′ + (1− γrr′)βdist . (8.6)

The FCM model has been extended to take into account the fact that some users may have a strongerbias to a particular vertical and tend to skip organic web results [Chen et al., 2012a]. Later, it hasalso been extended to account for multiple verticals [Chuklin et al., 2014, 2015].

8.1.2 VERTICAL CLICK MODEL (VCM)

After performing a thorough analysis of different peculiarities related to verticals, Wang et al. [2013a]suggest a complex model that is based on UBM (Section 3.5) and incorporates four types of bias:

1. Attraction bias. If there is a vertical on a SERP, there is a certain probability that a user startsexamining the SERP from this vertical.

8.2. BEYOND A SINGLE QUERY SESSION 63

2. Global bias. If the user starts examining the SERP from the vertical result, this affects theirfurther behavior (both the examination and attractiveness probabilities change).

3. First place bias. If there is a vertical on the first position, it affects further behavior of theuser. This is a special case of the global bias.

4. Sequence bias. If the user starts examining the SERP from the vertical block, they maycontinue examining documents from top to bottom, or first examine documents above thevertical in reversed order and only then proceed in the normal top-to-bottom order.

Wang et al. perform a detailed eye-tracking and click log analysis to study these biases and showthat a click model incorporating all of them outperforms UBM in both perplexity and log-likelihood.The detailed description of the model can be found in the original paper [Wang et al., 2013a].

8.1.3 INTENT-AWARE CLICK MODELS

Chuklin et al. [2013b] suggest to look at aggregated search as if there are different users comingwith different intents (needs): organic web, Image, News, Video, etc. If we assume that these intentsare independent, than the click probability decomposes as follows:

P (C1, . . . , Cn) =∑i

P (I = i) · P (C1, . . . , Cn | I = i), (8.7)

where P (I = i) is the probability of the i-th intent. One may then use different examination andclick probabilities for different intents, assuming that the intent distribution P (I = i) is known foreach query.

The authors suggest to go one step further and also take visual aspects into account, hypothe-sizing that the use of special presentation formats, e.g., News results enhanced with a thumbnail,will lead to different examination patterns than if these results are presented as organic web results.

For example, the intent-aware version of UBM (Section 3.5), UBM-IA, can be written asfollows:

P (Er = 1 | Gr = b, I = i,C<r) = γrr′(b, i) (8.8)P (Cr = 1 | I = i, Er = 1) = αiurq, (8.9)

whereGr is the presentation type of a document at rank r (e.g., standard snippet, thumbnail-enhancednews, etc.). In the simplest case both I and Gr random variables take two possible values: organicweb or vertical.

One nice bonus of the intent-aware extension is that it can be applied on top of any model,including the FCM model discussed above [Chen et al., 2012a]. The resulting model outperformsthe original FCM model in terms of perplexity [Chuklin et al., 2013b].

8.2 BEYOND A SINGLE QUERY SESSION


8.2.1 PAGINATION

Modern SERPs usually contain roughly ten results, but a user can request more results by clicking onpagination buttons. As reported in [Chuklin et al., 2013d], each session has a click on a paginationbutton with a probability of 5–10%. The new set of documents available after the pagination actionobviously should not be treated as a new session, but it cannot be viewed as a simple continuation ofthe previous page either. Chuklin et al. [2013d] show that by explicitly adding pagination buttons asa clickable SERP object into a click model one can substantially improve over a baseline model,which, in this case, works as if a user would see an infinite list of documents at once without theneed to click on pagination buttons. Two models based on DBN (Section 3.8) are proposed in whichthe click and examination probabilities of the pagination button may or may not depend on a query.The authors show that both extensions outperform the baseline DBN model after the first result pagewith one or another model giving better perplexity at different rank positions.

8.2.2 TASK-CENTRIC CLICK MODEL (TCM)

Zhang et al. [2011] present a general notion of a search task, where two consecutive query sessionsdo not necessarily share the same query (as opposed to the case of pagination discussed in theprevious section) but rather represent the same task from the user’s perspective, such as, e.g., buyinga concert ticket.2 After performing a qualitative analysis, the authors formulate two assumptionsabout sessions in a task:

1. Query bias. If a query does not match the user’s information need, the results are not clickedand the query is reformulated based on the presented results to better match the informationneed.

2. Duplicate bias. If the same document appears again in the same task, it has a lower probabilityof being clicked.

Zhang et al. formalize this within the task-centric click model (TCM):

P (M = 1) = τ1 (8.10)P (N = 1 |M = 0) = 1 (8.11)P (N = 1 |M = 1) = τ2 (8.12)

P (Hr = 1 | Hs′r′ = 0, Es

′r′ = 0) = 0 (8.13)

P (Fr = 1 | Hr = 0) = 1 (8.14)P (Fr = 1 | Hr = 1) = τ3 (8.15)

P (Er = 1 | E<r,S<r,C<r) = εr (8.16)P (Ar = 1) = αurq (8.17)

Cr = 1⇔M = 1, Fr = 1, Er = 1 and Ar = 1. (8.18)

The notation should be read as follows:2In practice, users may switch between several tasks while browsing (a phenomenon called multitasking). In that case, sessions thatbelong to the same task do not have to be consecutive.

8.3. USER AND QUERY DIVERSITY 65

• M : whether a session matches a user need;

• N : whether a user is going to submit another query in the same task;

• Hr: whether the document ur occurred in some earlier session; if so, its previous occurrencewas in a session s′ at rank r′;

• Fr: whether the document ur is considered “fresh”;

• Er: examination (see Chapter 2);

• Ar: attractiveness (see Chapter 2);

• Cr: click (see Chapter 2);

The TCM model can be built on top of any basic click model (Chapter 3). This basic click modeldetermines the parameters αuq and εr. Zhang et al. show that the TCM models built on top ofDBN and UBM outperform the corresponding basic models in terms of perplexity (Section 5.2) andrelevance prediction (Section 5.4).

8.3 USER AND QUERY DIVERSITY

It is well-known that there is a certain variability between different users and different sessions [Whiteand Drucker, 2007]. One user may be very persistent and click many results while another mayjust abandon a SERP without spending much effort. Over time, different approaches to captureuser diversity have been developed: from considering session variability and query variability todirectly incorporating users into click models. In the next sections we describe the correspondingclick models.

8.3.1 SESSION VARIABILITY

Hu et al. [2011] are the first to formally introduce user diversity in a click model. They start with theidea that different users come to a search box with different notions of relevance and some usersmay decide not to click a document even if it looks attractive overall. For this purpose they introducea session-dependent parameter µs that decreases the attractiveness probability:

P (Au = 1) = P (Cu = 1 | Eu = 1) = µsαuq. (8.19)

The authors build corresponding click models based on UBM and DBN (calling them Unbiased-UBM and Unbiased-DBN, respectively), each one outperforming the corresponding basic model interms of log-likelihood (Section 5.1) and relevance prediction (Section 5.4).


8.3.2 QUERY VARIABILITY

Varying persistence (DBN-VP). Ashkan and Clarke [2012] suggest to make model parametersquery-dependent, which is especially useful in the context of commercial search where differentqueries have different levels of commercial intent and hence different user persistence in clickinglinks. They apply this idea to the DBN model (Section 3.8) by modifying (3.40) and replacing itwith the following:

P (Er = 1 | Er−1 = 1, Sr−1 = 0) = γq, (8.20)

where the user perseverance γq now depends on a query. They call this model DBN-VP (varyingpersistence) and show that it outperforms the baseline DBN for commercial search in terms oflog-likelihood and perplexity.

A similar but simpler idea is used by Guo et al. [2009a]. They suggest to train the CCM model(Section 3.7) for navigational and informational queries separately, which results in substantiallydifferent persistence parameters τi for different query classes.

Intent-aware models. The idea of query-dependent parameters is similar to the intent-awaremodels by Chuklin et al. [2013b] considered above. Instead of having model parameters directlydepend on a query, Chuklin et al. assume a smaller set of user intents. The intents do not have tocorrespond to verticals as discussed above. Instead, they can be used to differentiate users withdifferent goals. For example, we can introduce the following intents: “searching definition,” “factchecking” or “entertainment.” These intents will have different perseverance parameters in the caseof intent-aware DBN (DBN-intents model in [Chuklin et al., 2013b]):

P (Er = 1 | I = i, Er−1 = 1, Sr−1 = 0) = γ(i). (8.21)

Each query will then have an intent distribution P (I = i) and the final click probability can beobtained by marginalizing over this distribution.

8.3.3 USER VARIABILITY

Matrix factorization click model (MFCM). Shen et al. [2012] are the first to formally incorporateusers into click models. In order to do so, they say that the attractiveness parameter αuq = P (Au = 1)

has a Gaussian prior that depends on the document matrix Uu, the query matrix Qq and the usermatrix Vv:

P (Au = 1) = αuqv (8.22)P (αuqv | Uu, Qq, Vv, σ) ∼ N

((Uu ◦Qq ◦ Vv) , σ2

)(8.23)

P (Uu | σu) ∼ N (0, σ2u) (8.24)

P (Qq | σq) ∼ N (0, σ2q) (8.25)

P (Vv | σv) ∼ N (0, σ2v), (8.26)

whereUu ◦Qq ◦ Vv =∑f UfuQfqVfv is a tensor factorization. While this personalized click model

(PCM) directly adds a user component to the model, it may have overfitting issues for cases where

8.3. USER AND QUERY DIVERSITY 67

user differences are not important, for instance, navigational queries. In order to solve this problem,the authors suggest a so-called hybrid personalized click model (HPCM) where a non-personalizedcomponent is added to the personalized one:

P (αuqv | Uu, Qq, Vv, σ) ∼ N((Uu ◦ Qq + Uu ◦Qq ◦ Vv

), σ2). (8.27)

Each matrix above has its own Gaussian prior similar to the earlier case of PCM.Experiments show that the HPCM model (based on UBM) consistently outperforms all other

models in terms of both log-likelihood and perplexity and hence represents a good trade-off betweenthe fully personalized and regular click model that treats users homogeneously.

Incorporating users. Xing et al. [2013] take a more direct approach and simply add a user-basedparameter into a click model:

P (Au = 1) = αuqεv, (8.28)

where v denotes a user.The above model, called UBM-user and based on UBM (Section 3.5), already gives a small

improvement over the baseline model in terms of perplexity and CTR prediction. A few othermodifications are suggested, two of which perform exceptionally well, with over 10% perplexitygain (5.6) on the test set; both modifications are based on UBM.

The first of these two modifications is a logistic model (LOG-rd-user), which modifies theexamination and attractiveness probabilities separately:

P (Er = 1 | C<r) = σ (γrr′ + εv) (8.29)P (Ar = 1) = σ (αurq + εv) , (8.30)

where εv and εv are the two user-dependent parameters and σ(x) = 1/(1 + exp(−x)) is a logisticfunction (sigmoid). Here, the α and γ parameters do not have to lie between 0 and 1 as this isguaranteed by the logistic transformation.

Another modification, called the dilution model (DIL-user), is specifically designed to mitigatethe risk of overfitting. This is done by simplifying the UBM examination probability parameter γrr′and splitting it into the term that depends on the rank r and the term that depends on the distance tothe previous click r − r′:

P (Er = 1 | C<r) = (βv)r−1(λv)

k(µv)r−r′ (8.31)

P (Ar = 1) = αurqεv, (8.32)

where k is the total number of clicks before r and βv, λv and µv are the three user-dependentdumping factors, which control how the examination probability diminishes with increasing rank,increasing number of previous clicks and the distance to the previous click, respectively.


8.4 EYE TRACKING AND MOUSE MOVEMENTS

Since the work of Huang et al. [2011] it has become clear that mouse movement data can be collectedat scale. With almost no additional resources, exact coordinates of a mouse cursor can be stored in asearch log on top of the click events data. As has been shown in multiple experiments [Chen et al.,2001, Rodden et al., 2008], mouse movements can serve as a robust proxy to result examination andcan help to estimate the examination probability.

Mousing and scrolling as hints to examination. Huang et al. [2012] suggest to use mouse move-ments and page scrolls as additional data that helps to detect examination events E that are notobserved directly. The authors suggest a refined DBN model that benefits from this additional sourceof data by adding the following constraints:

P (Er = 1 | ∃h ∈ H : r ≤ rh) = 1 (8.33)P (Er = 1 | r ∈ V ) = 1, (8.34)

where H is the set of hovered documents (mousing) and V is the set of all documents shown when auser scrolled a SERP (scrolling). Results show that augmenting the model with this additional dataleads to improved perplexity.

From skimming to reading. Recent work by Liu et al. [2014] suggests that mouse movementsthemselves or even eye positions obtained with an eye tracker do not necessarily indicate that a useris reading a snippet. Instead, two separate states should be considered: skimming a document andreading a document:

P (Cu = 1) = P (Cu = 1 | Ru = 1) · P (Ru = 1 | Fu = 1) · P (Fu = 1), (8.35)

where Ru is an additional random variable corresponding to reading the document u as opposed toskimming or fixing eyes on u, i.e., Fu. The authors show that training separate models for each ofthe factors above yields better relevance prediction (Section 5.4).

Predicting mousing instead of clicks. Diaz et al. [2013] take the first steps toward modeling mousemovements themselves. They suggest a Markov model that predicts the probability of mousingelement j after element i. While in most of the click models studied above it is assumed thatdocuments are examined from top to bottom, the order may be different when a result page isnon-linear. For instance, the result page of an image search engine typically contains results in theform of a grid and the top-to-bottom examination assumption is simply not applicable in this case.

Diaz et al. demonstrate that complex sequences of mouse movements can be predictedreasonably well and that their model can generalize to unseen SERP layouts.

8.5 USING EDITORIAL RELEVANCE JUDGEMENTS

As we discussed in Section 6.1, there is a very limited amount of publicly available click data andeach of the public datasets has its own drawbacks. On the other hand, there are multiple datasets

8.5. USING EDITORIAL RELEVANCE JUDGEMENTS 69

available, which contain rankings produced by different search systems and relevance labels obtainedfrom judges. This includes, but is not limited to, the TREC datasets [Voorhees and Harman, 2001],the Microsoft LETOR datasets [Liu et al., 2007], the Yahoo! Learning to Rank Challenge3 and theYandex Internet Mathematics Challenge.4 Below we discuss different approaches to incorporatingrelevance judgements into click models, which benefit from datasets that come with relevancejudgements.

Using only editorial relevance. Hofmann et al. [2011] suggest to instantiate click model pa-rameters based on editorial relevance judgements. As has been shown by Turpin et al. [2009],attractiveness parameters αuq appearing in most of the models are reasonably well correlated witheditorial relevance judgements Ruq. The same holds for satisfaction parameters σuq appearing inDBN or in any model with the notion of satisfaction. This allows us to greatly reduce the numberof click model parameters by making the assumption that any parameter θuq is a function of therelevance judgement Ruq. This means that instead of having a separate parameter value for everyquery-document pair, we only need to have one value for each relevance grade. For example, theDBN model can then be specified as follows (cf. (3.34)–(3.40)):

Cr = 1⇔ Er = 1 and Ar = 1 (8.36)P (Ar = 1) = α(Rurq) (8.37)P (E1 = 1) = 1 (8.38)

P (Er = 1 | Er−1 = 0) = 0 (8.39)P (Sr = 1 | Cr = 1) = σ(Rurq) (8.40)

P (Er = 1 | Sr−1 = 1) = 0 (8.41)P (Er = 1 | Er−1 = 1, Sr−1 = 0) = γ. (8.42)

If we assume that there are five possible relevance labels, as is customary in web search set-tings [Sanderson, 2010], then this model has 11 parameters in total: five attractiveness, five satisfac-tion and one continuation parameter.

By training a model only once on a relatively small number of clicks, one can then reuse it withnew datasets with completely different documents, provided that they have relevance judgements onthe same scale.

If there is no click data available to estimate the values of the click model parameters, one canconsider different assignments for these parameters. For instance, Hofmann et al. [2013] considerthree different extreme cases for the attractiveness parameter assignment, which they call perfect,navigational and informational user models. Each of these user models corresponds to a differentmapping from the relevance scale (Hofmann et al. use three relevance grades) to the probability ofattractiveness. The intuition is that under the perfect user model only relevant documents are clicked,under the navigational user model there is some relaxation to this, while the informational usermodel assumes that a user carefully examines a SERP and can click even non-relevant documents3http://webscope.sandbox.yahoo.com/catalog.php?datatype=c.4http://imat2009.yandex.ru/academic/mathematic/2009/en/datasets.

http://webscope.sandbox.yahoo.com/catalog.php?datatype=c

http://imat2009.yandex.ru/academic/mathematic/2009/en/datasets


with reasonable probability. The assignment of the attractiveness parameters according to [Hofmannet al., 2013] is given in the Table 8.1. By considering all three different models in a simulation, onemay hope to cover real-world settings, which supposedly lie somewhere in between.

Table 8.1: Click model attractiveness parameters for the simulated click models from [Hofmann et al.,2013].

relevance grade 0 1 2perfect model 0.00 0.50 1.00navigational model 0.05 0.50 0.95informational model 0.40 0.70 0.90

Using clicks and editorial relevance. Using both click signals and human relevance labels isa promising direction for click models, which can help to get more accurate predictions and toovercome data sparsity issues. The first steps in this direction are presented in the noise-awareframework due to Chen et al. [2012b]. In order to make use of editorial judgements, the notionof relevance R is introduced directly into the model. Then, the user’s click given examination iseither guided by relevance (noise-free environment, Nr = 0) or not (noisy environment, Nr = 1).Formally, this can be written as follows:

P (Cr = 1 | Er = 0) = 0 (8.43)P (Cr = 1 | Er = 1, Rr = 1, Nr = 0) = αurq (8.44)P (Cr = 1 | Er = 1, Rr = 0, Nr = 0) = 0 (8.45)

P (Cr = 1 | Er = 1, Nr = 1) = βq, (8.46)

where βq is a query-specific parameter.The authors introduce two models, N-DBN and N-UBM, and show that they consistently have

better perplexity than the corresponding baseline models. For details on the parameter estimation(that uses editorial relevance data) as well as on the estimation of the noise probability, we refer thereader to the original paper [Chen et al., 2012b].

8.6 NON-LINEAR SERP EXAMINATION

Aggregated search. It is evident that the more heterogeneous a SERP is, the more likely it is that auser is going to examine it in a non-linear way. We have already seen this when we discussed theVCM model by Wang et al. [2013a] in Section 8.1.2. In that model a user is allowed to jump right toa vertical block and then proceed up or down.

Temporal click models. A similar idea is used for a range of temporal click models. Unlike mostclick models that are based on the position of search engine results, temporal click models arebased on the time series of users’ actions, such as clicks, examinations, etc., and therefore give us adifferent perspective on user interaction behavior.

8.6. NON-LINEAR SERP EXAMINATION 71

In the setting of sponsored search, Xu et al. [2010] introduce the temporal click model(sometimes referred to as TCM, which should not be confused with the task-centric model, for whichthe same acronym is being used in the literature, see Section 8.2.2). The temporal click model buildson what the authors call the positional rationality hypothesis, according to which users first examinetwo ads together and assess their quality, and then compare their quality. The temporal click modelis found to have better CTR prediction accuracy (Section 5.3) than BBM5 on less frequent queriesand comparable performance on frequent queries.

Wang et al. [2010] model the sequence of user actions on a SERP as a Markov chain andconsider not only positional but also temporal information in search logs. The authors modelall the user actions in a session, including hovering events, page loading and unloading, queryreformulation, etc., in addition to conventional click events. The resulting model is compared againstthe experimental outcomes of an eye tracking study and high agreement is reported. No evaluationof the type that we focus on in this book (see Chapter 5) is performed. He and Wang [2011] refinethis approach by explicitly modeling the duration of an event (e.g., the time a user spends reading asearch result).

The temporal hidden click model (THCM) by Xu et al. [2012] also incorporates non-linearSERP examination, with a special focus on what the authors call revising patterns, where a userclicks on a higher-ranked document after having clicked on a lower-ranked one. Xu et al. suggest atemporal model in which, after examining a document ur, a user examines the next document ur+1

with probability α, but may also examine the previous document ur−1 with probability γ. Severalevaluations are conducted. In terms of perplexity (Section 5.2), THCM is found to outperform bothDBM and CCM. In terms of relevance prediction (Section 5.4), there is no significant differencebetween THCM and DBM or CCM on frequent queries, but on moderately frequent and rare queriesthe authors do report significant improvements.

Recently, Wang et al. [2015] have introduced the partially sequential click model (PSCM).It builds on two hypotheses. The first, called the locally unidirectional examination assumption,says that between adjacent clicks, users tend to examine search results in a single direction withoutchanges and this direction is usually consistent with that of clicks (whether going up or downthe SERP). The second, called the non first-order examination assumption, says that although theexamination behavior between adjacent clicks can be regarded as locally unidirectional, users mayskip a few results and examine a result at some distance from the current one following a certaindirection. PSCM outperforms existing models in terms of perplexity (Section 5.2) and relevanceprediction (Section 5.4).

Switching between organic web and other SERP blocks. Finally, the whole page click model(WPC) by Chen et al. [2011] is meant to tackle the problem of side elements on a SERP, suchas ads or related searches. The authors suggest to separately model two things: user transitionsbetween blocks (organic results, top ads, side ads, related searches, etc.) and user behavior within ablock. The first model, called macro-model, is set up as a Markov chain, while the second model,5We discuss the BBM model, which is due to Liu et al. [2009], in Section 4.4.


called micro-model, is based on a regular click model (Chen et al. use UBM). Since WPC considersadditional SERP elements, it yields substantial perplexity improvements over the baseline model forthe whole SERP as well as separately within the organic result block and ads block.

8.7 USING FEATURES

Regression-based models. Starting with the work of Richardson et al. [2007], a feature-basedapproach to predicting clicks in the setting of sponsored search has become popular. Richardson et al.suggest to use a feature-based logistic model that they train using conventional machine learningmethods:

P (Cu = 1) =1

1 + e−Z(8.47)

Z =∑

wifi, (8.48)

where fi is a feature extracted from the query, document and rank, and wi is the weight of fi learnedduring training. Richardson et al. use 81 features and show that their model predicts CTR better thanthe simple RCM model (Section 3.1).

Like logistic regression, factorization machines are a general regression model. They captureinteractions between pairs of elements using factors. Factorization machines can efficiently modeluser interactions with various elements of ads. Rendel [2012] uses factorization machines and abroad range of user-, content- and interaction-based features to predict the CTR of an ad given a userand a query. This system produced very competitive results at the 2012 edition of the KDDCup.6

Recurrent neural networks have also been used for click prediction in sponsored search. Zhanget al. [2014] observe that users’ interactions with ads depend on past user behavior in terms of whatqueries they submitted, what ads they clicked or ignored, how much time they spent on the landingpages of clicked ads, etc. Zhang et al. build on these observations and introduce a click predictionframework based on recurrent neural networks. This framework directly models the dependency onusers’ sequential behavior as a part of the click prediction process through the recurrent structure ofa network. The authors compare their approach against logistic regression and non-recurrent neuralnetworks and find that their method wins on the ad click prediction task in terms of the Area underthe ROC Curve and Relative Information Gain scores.

Post-click model (PCC). Zhong et al. [2010] suggest to take into account the after-click behavioron a landing page and beyond. They start with the DBN model (Section 3.8) and add a feature-basedterm to the satisfaction probability (3.38) using the following formula:

P (Sr = 1 | Cr = 1) = P

(σurq +

∑i

yurifi + ε > 0

), (8.49)

6http://www.kddcup2012.org.

http://www.kddcup2012.org

8.7. USING FEATURES 73

where σurq is assumed to have a normal distribution, fi is a post-click feature like “dwell time onthe clicked page,” “time before next query” or the fact of a search engine switch, yuri is a weightparameter and ε is a normally distributed error term.

The authors show that this model allows for efficient training and outperforms DBN and CCMin terms of the relevance prediction evaluation (Section 5.4).

General click model (GCM). Zhu et al. [2010] propose to generalize the attributes on which modelparameters depend. For example, in the DCM model (Section 3.6) the attractiveness parameter αuqdepends on a query-document pair while the continuation parameter λr depends on the currentrank (also see Table 3.2). In the general click model (GCM), suggested by Zhu et al., click modelparameters can depend on other attributes, such as time of the day, user browser, user location orlength of a URL:

P (E1 = 1) = 1 (8.50)P (Er = 1 | Er−1 = 0) = 0 (8.51)

P (Cr = 1 | Er = 1) = P

∑t

θRfusert

+∑j

θRfurli,j

+ ε > 0

(8.52)

P (Er = 1 | Er−1 = 1, Cr−1 = 0) = P

∑t

θAfusert

+∑j

θAfurli,j

+ ε > 0

(8.53)

P (Er = 1 | Cr−1 = 1) = P

∑t

θBfusert

+∑j

θBfurli,j

+ ε > 0

, (8.54)

where fusert and furli,j are user-specific and URL-specific features respectively, θf is some randomdistribution parametrized by the corresponding feature and ε is a normally distributed error term(cf. (8.49)). By using appropriate features f and distributions θf it can be shown, that CM, DCM,CCM and DBN models can be considered as special cases of this model.

The authors show that a model with many attributes (they use between 7 and 21 attributes) themodel performs better than DBN and CCM in terms of both log-likelihood (Section 5.1) and CTRprediction (Section 5.3).

Wang et al. [2013b] propose a similar model called Bayesian Sequential State model (BSS)that extends the feature set by adding features based on the dependencies between sequential clickevents. Their model has poor perplexity score, but outperforms UBM and DBN on the task ofrelevance prediction (Section 5.4). The authors did not, however, compare their model to GCM.

In this chapter we have covered a diverse landscape of recent click model developments. As the readercan see, most advanced models are built on top of the basic click models presented in Chapter 3.Quite often, performance improvements are achieved at the cost of far greater model complexity and,


as usual, compromises have to be made between the improvement a newly proposed model bringsand the amount of time required to implement and test this model and resources required to run it.

75

C H A P T E R 9

Applications

In this chapter we expand on the main click model applications that were briefly mentioned in theintroduction. We start with the most straightforward application, that of simulating users, whichwe discuss in Section 9.1. Then we move on to a more practical area of observing user behaviorand drawing important conclusions about IR system design in Section 9.2. Next, we discuss areally powerful application of click models that allows one to improve search quality by correctlyinterpreting user clicks (Section 9.3). Finally, we show how click models can be applied to offlineevaluation by incorporating user behavior into evaluation metrics (Section 9.4).

9.1 SIMULATING USERS

When we are performing online quality evaluation using methods such as A/B-testing [Kohavi et al.,2008] or interleaving [Chapelle et al., 2012], or assess the response time and resource requirementsof a search system, we either need to have access to real users or somehow simulate them. It is clearthat experimenting with real users is not always feasible or sufficient due to the limited amount ofuser interaction data, the sensitive nature of such data and the risk of noticeably degrading the userexperience. Click models give a perfect solution to this problem: once trained on historical interactiondata, a click model can be applied to simulate an infinite number of user clicks. This can be doneusing a very simple algorithm, as shown in Algorithm 1, where the probability P (Cr = 1 | C<r) ofa click given the clicks above it is taken from Table 3.1 for each particular model.

As we will explain below, depending on the character and availability of training, there aredifferent ways in which one can use click models for user simulations.

Simulating users in the same sessions: data expansion. The first situation we consider is whenwe need to simulate clicks in the same sessions as we have in our training set. Assume, for example,that we are running a ranking experiment where, with a certain probability, a user is presented with are-ordered set of results on a SERP. Due to the risk of degrading the user experience, re-orderings areshown with a very small probability. On the one hand, this prudent approach minimizes the impacton the search quality. On the other hand, it leads to fewer clicks on the experimental rankings, as aresult of which we may have problems drawing conclusions using online click-based metrics. Whatwe can do instead is to train a click model using available clicks in the sessions with re-orderingsand extrapolate these clicks to the same SERPs to get more robust estimations for quality metricssuch as abandonment rate or mean reciprocal rank (MRR) [Radlinski et al., 2008].

76 9. APPLICATIONS

Algorithm 1 Simulating user clicks for a query session.

Input: click model M , query session sOutput: vector of simulated clicks (c1, . . . , cn)

1: for r ← 1 to |s| do2: Compute p← P (Cr = 1 | C1 = c1, . . . , Cr−1 = cr−1) using previous clicks c1, . . . , cr−1

and the parameters of the model M3: Generate random value cr from Bernoulli(p)

4: end for

Simulating users in new sessions. Quite often, one needs to extrapolate user behavior to unseendata. For example, one may want to test a completely new ranking of documents and does not wantto show this ranking to real users. Here, click models can be of great help because they essentiallydecompose the click probability into the properties of a document (such as attractiveness or satisfac-tion parameters) and the effects of the document rank and surrounding documents. Consequently,even if a click model was trained on one ranking of documents, we can simulate user clicks ondifferent rankings of the same documents. We can go even further and apply a trained click model tounseen documents or queries. In this case, models such as the matrix factorization model (MFCM)by Shen et al. [2012] can help to fill the gaps in the sparse matrix of model parameters.

Simulating users on the LETOR datasets. Often, a click model has to be applied to a totally newdataset that does not share common queries or documents with previously used ones or it is hard tomatch those because they are anonymized in different ways. This often happens in the setting ofacademic research labs that do not have access to large-scale search logs, but also with commercialorganizations that want to quickly assess their ideas on new data using click logs. Here, editorialrelevance labels can serve as a link between different datasets. As we discussed in Section 8.5, everyclick model can be modified in such a way that its parameters depend only on the rank and editorialrelevance label of documents. In this manner we not only contract the size of a trained model to theorder of hundreds of parameters, but also allow researchers to exchange these parameters withoutsharing the data the model was trained on.

Once relevance-based click model parameters have been made available, they can then beapplied to other datasets that have relevance labels. Typically, these are datasets used for learningto rank (LETOR)—a task for machine learning to rank documents in a way that optimizes someutility function. A number of such datasets have been published, for example, by Microsoft ResearchAsia1 [Liu et al., 2007, Qin et al., 2010]. The algorithm for simulating clicks in this setting is thefollowing: train a click model on a dataset having both clicks and relevance labels (see Table 6.1)and then apply it to any dataset having only relevance labels.

1http://research.microsoft.com/en-us/um/beijing/projects/letor/ .

http://research.microsoft.com/en-us/um/beijing/projects/letor/

9.2. USER INTERACTION ANALYSIS 77

Such relevance-based click models have been used successfully to simulate users in interleav-ing experiments [Chuklin et al., 2013a, 2015, Hofmann et al., 2011, 2013].

9.2 USER INTERACTION ANALYSIS

One of the less obvious applications of click models is their usage as an observational device in thearea of user experience research (UX). By examining parameters of a click model one can learn moreabout user behavior and make changes to user interfaces based on insights gained in this manner.

Dupret and Piwowarski [2008] are the first to point out this application of click models. Inparticular, they analyze the value of the examination parameter γrr′ of the UBM model (Section 3.5).This parameter defines the probability of examining a document at rank r, given that the previousclick was at rank r′. The authors present a set of results that confirm the fact that the examinationprobability declines with the rank r of the current document, but also show that this probabilityexhibits a non-trivial dependency on the rank of the previous click r′.

Analyzing this kind of information is especially interesting in the case of aggregated search,which we studied in Section 8.1. By looking at the parameters of a click model we can quantifyhow vertical documents affect the way a user consumes a SERP. Following that, one may introducechanges to the way we surface these vertical results. For example, Chen et al. [2012a] show that thepresence of vertical documents affects the examination of the neighboring web documents; the effectis less prominent for textual verticals like News, which only affects its neighbors when it is placed atthe top. But if, for example, an Image vertical is placed at rank 4, then web documents at ranks 5, 6

and 7 will have substantially higher chances of being examined than without the placement of theImage vertical.

9.3 INFERENCE OF DOCUMENT RELEVANCE

One of the ways to define document relevance is through the probability of user satisfaction givenexamination P (Sr = 1 | Er = 1), as was suggested by Chapelle and Zhang [2009]. Having traineda click model which has the notion of satisfaction, one can obtain this probability from the clickmodel parameters:

Rurq = P (Sr = 1 | Er = 1) = P (Sr = 1 | Cr = 1) · P (Cr = 1 | Er = 1). (9.1)

After having obtained this click-based relevance score, we can either use it to evaluate the qualityof the current production system using metrics such as DCG [Jarvelin and Kekalainen, 2002] oruse it to improve the system in the future. Chapelle and Zhang [2009] show that if this click-basedrelevance is used as a learning feature in a machine learning-based ranking system, it significantlyimproves the end result, which means that it is a powerful signal of document quality.

Ozertem et al. [2011] use this notion of click-based relevance together with a smoothingtechnique to complement editorial relevance judgments in the evaluation process. They show thatdespite a fair amount of missing judgements, a significant correlation with the ground-truth DCGcan be achieved.

78 9. APPLICATIONS

9.4 CLICK MODEL-BASED METRICS

Here we discuss the connection between click models and web search quality metrics. In the so-called Cranfield approach described and practiced by Cleverdon et al. [1966], the relevance of eachquery-document pair is obtained from trained raters independently of other documents. This way thesame relevance labels can be used to evaluate different rankings of documents, which makes it aneffective way to evaluate systems without running long and costly experiments or user studies. Onthe other hand, these relevance labels need to be aggregated in a SERP-level quality metric, which isnot a straightforward task.

The first metric to evaluate the quality of a whole SERP was just a simple precision metric:the number of relevant documents found by raters divided by the total number of documents shownto a user. With the emergence of ubiquitous search engines that are being used many times a dayit has become clear that users do not always examine all documents, but rather start from top tobottom. Following this idea, Jarvelin and Kekalainen [2002] suggest a discounted metric, DCG thatassigns higher weights to the documents higher in the ranking. However, this metric does not fullyexplain why this particular rank discount has to be used nor does it address the question of dealingwith graded relevance. For instance, is it better to have two marginally relevant documents at ranks 2and 3 or one very relevant document at rank 3? In order to answer this question, one must firstobserve user behavior and learn regularities from it.

Following early attempts to build user behavior-aware evaluation metrics such as ExpectedReciprocal Rank (ERR) by Chapelle et al. [2009] and Expected Browser Utility (EBU) by Yilmazet al. [2010], Chuklin et al. [2013c] suggest the idea that any click model can be a basis for buildingevaluation metrics. Following Carterette [2011], Chuklin et al. suggest to use two families of clickmodel-based metrics: utility-based and effort-based. Utility-based metrics assume that by examiningrelevant documents a user accumulates utility, so the metric should be defined as an average utilityvalue:

uMetric =n∑r=1

P (Cr = 1) · Ur, (9.2)

where Ur is some utility measure of the r-th document (usually relevance Rr). So if we fix a clickmodel and compute click probabilities P (Cr = 1) for a particular SERP we can plug them into theabove formula and obtain an evaluation metric.

There are two important factors to note here. First, a click model used has to be relevance-based (see Section 8.5) in order for the final metric to depend only on the relevance of documents ona SERP. Second, the click probability has to be the full click probability, i.e., it has to be derivedfrom the model equations such that it is marginalized over all random variables that are not observed(note that clicks are also not observed when we want to apply a metric). For instance, for UBM(Section 3.5) this requires marginalizing over all possible ranks of the previous click. We give moredetails and provide final formulas in Section 3.9.

As to the family of so-called effort-based metrics, they are based on models, such as DBN (Sec-tion 3.8) and DCM (Section 3.6), that have the notion of satisfaction. The metric then computes the

9.4. CLICK MODEL-BASED METRICS 79

average effort a searcher has to spend until they get satisfied:

eMetric =n∑r=1

P (Sr = 1) · Fr, (9.3)

where Fr is some effort measure. Usually, the inverse of an effort, such as, e.g., reciprocal rank 1r is

used, so that the higher metric values correspond to lower effort. For example, the widely used ERRmetric by Chapelle et al. [2009] can be viewed as a metric of this family.

Chuklin et al. [2013c] find that an effort-based metric based on UBM shows the highestcorrelation with online experiments, especially in the case where there is a lot of unlabeled data.However, in some cases other models may be more suitable. For example, when we consideraggregated search, there exist more precise click models that we have discussed in Section 8.1. Ashas been shown by Markov et al. [2014], in the presence of vertical results, metrics based on clickmodels for aggregated search show better alignment with online user preferences.

Similarly, Kharitonov et al. [2013] consider a problem of evaluating query auto-completionwhere a specific user model needs to be used. They also demonstrate that incorporating a user modelinto evaluation helps to achieve higher correlation with online metrics.

80

C H A P T E R 10

Discussion and Directions forFuture Work

In this book we have presented a coherent view on the area of click models for web search. Themain focus of the book is on models represented as probabilistic graphical models (PGMs). Usingthe terminology of Koller and Friedman [2009], most of the models studied in the book are directedBayesian networks (see Definition 2.1). As we have shown, this machinery is very powerful andallows one to express many ideas about user behavior, incorporate additional signals and achievegood performance results. But of course, there is always room for improvement. Specifically, webelieve that the development of user modeling should go in three directions.

Foundations of user modeling. The first direction is to expand the foundations of the user model-ing aspects of click models. As we have seen in Chapter 7, greater complexity does not always comewith better results. Nevertheless, there are many approaches to user modeling that have not beentried yet. First of all, the area of probabilistic graphical models itself suggests additional tools suchas undirected Markov networks or partially directed conditional random fields [Koller and Friedman,2009, Chapter 4]. These alternative structures allow us to specify fewer dependencies between user’sbehavioral states. Also, there are many options to generalize conditional probability distributions.Currently, we set the probability of observing some state to have a simple dependency on the statesconnected to it in a PGM. We may extend this by using so-called structured probability distributionswhere the dependency becomes more complex; see [Koller and Friedman, 2009, Chapter 5]. Thisextension would allow us to naturally introduce feature-based machine learning elements as in someof the models that we discussed in Section 8.7.

Another promising approach is structure learning [Koller and Friedman, 2009, Chapter 18].Up until now we always fixed the structure of a network. However, the very fact that we have so manydifferent models, as demonstrated in this book, suggests that we, humans, cannot easily engineer aperfect dependency graph by hand. Instead, we may want to infer this structure from data, just aswe currently do for model parameters. The recent successes of distributed representations in manyprediction tasks suggest a natural way forward, viz. to represent user behavior as a sequence ofvector states that capture user interactions with a SERP; see, e.g., [Zhang et al., 2014].

Rich interactions and alternative search environments. As we pointed out in Section 8.4, evenin a desktop environment, there are additional signals that one may use to gain a better understandingof user interactions with a SERP. While cognitively rich signals such as eye tracking and mouse

81

movements increasingly receive attention, other, perhaps more mundane, aspects have been littlestudied, as a typical SERP may not be equipped with such aspects. Pagination buttons, as discussedin Section 8.2.1, are but one example. Other examples include interface elements to support resultlist refinement through facets and filters [He et al., 2015], common in professional search envi-ronments [Huurnink et al., 2010], so-called knowledge cards [Blanco et al., 2013] and interactiveelements such as weather or calculator widgets.

Devices beyond desktop computers provide other means of interaction with search results.Mobile devices mainly rely on touch technology, which substitutes mouse movements and pagescrolls and supports a number of gestures. Voice input is also more heavily used for mobile searchinterfaces. Also, it was shown recently that a SERP viewport (the visible portion of a web page) isan important signal of a searcher’s attention on mobile devices [Lagun et al., 2014]. Other devices,such as TV sets, provide limited interaction capabilities, but may have additional search elements,such as preview, which are not available in standard search environments. As Markov et al. [2015]put it, the main challenge here is to capture and reliably interpret user signals other than clicks. Forexample, when capturing mouse movements and touch gestures, one needs to map the position of acursor/finger on a screen to an underlying object on a SERP. Also, while some user signals haverelatively clear interpretations (e.g., clicks, scrolls, zooming), others do not (e.g., mouse movements).

Apart from the fact that different devices provide different means of interaction with searchresults, devices may also differ in their technical characteristics and main function. Desktops andlaptops are general-purpose devices with rich interaction capabilities, and hence they can be usedfor most types of search. In contrast, smartphones and tablet computers are more mobile and haveadditional components, such as a touch screen, microphone and GPS, which make them perfectfor specific types of search, e.g., local search. Search systems on devices such as TV sets and GPSnavigation devices, may use additional input sources, such as voice commands, user location orautomobile sensors. On the other hand, they provide limited search capabilities and are often gearedtoward answering specific types of question. Thus, the device type determines not only the way ofinteraction with search results, but also the user’s search intents. The notion of what constitutes asuccessful search may also be different on different devices. In generic web search, a successfulsession is usually one with many clicks followed by long dwell time. In TV search, however, asuccessful session could be one that ends with a purchase of a TV program. Similarly, a successfulsearch session on a GPS-navigation device should probably be followed by the user navigatingto a point of interest [Norouzzadeh Ravari et al., 2015]. Overall, users’ search behavior is boundto differ considerably between devices. Therefore, models aimed at capturing and predicting userbehavior have to carefully consider special properties of each device and corresponding user intentsand indicators of success.

Convergence. A final direction for future development consists of fusions of ideas developed inadjacent areas. We already discussed some of these ideas in Chapter 8, but there is more to it. First,there is a large body of work on user modeling and user behavior analysis for online advertising andsponsored search; see, e.g., [Alipov et al., 2014, Ashkan and Clarke, 2012, Ghosh and Mahdian,

82 10. DISCUSSION AND DIRECTIONS FOR FUTURE WORK

2008, Graepel et al., 2010, Richardson et al., 2007, Xiong et al., 2012, Zhu et al., 2010]. Sincepredicting clicks is critical for many commercial search and advertisement applications, modelsaimed at predicting clicks appeared relatively early on in those areas [Richardson et al., 2007]. Sincethen, they developed rather independently from click models for web search. Yin et al. [2014] addorganic results to a click model for ads, but only as an external entity. We believe that these twoareas could and should benefit more from each other and that they will eventually converge.

There are also some more distant areas, such as user experience studies [Dumais et al., 2001],network and game theory [Easley and Kleinberg, 2010] and learning to rank [Burges et al., 2005]that contribute ideas to the area of user modeling and will continue to do so in the future.

As we have shown in Section 8.3, there is a fair amount of work that tries to tackle the problemof user diversity. The next steps in the direction of personalization should concern the aggregation ofsimilar users into clusters [Buscher et al., 2012] or cohorts [Hassan and White, 2013] in order toimprove model performance for infrequent users and to minimize overfitting risks.

83

Bibliography

Vyacheslav Alipov, Valery Topinsky, and Ilya Trofimov. On peculiarities of positional effects insponsored search. In SIGIR, 2014. ACM Press. 81

Jaime Arguello, Fernando Diaz, Jamie Callan, and Jean-Francois Crespo. Sources of evidence forvertical selection. In SIGIR. ACM Press, 2009. 47

Jaime Arguello, Fernando Diaz, and Milad Shokouhi. Integrating and ranking aggregated content onthe web. In WWW. ACM Press, 2012. Tutorial at WWW 2012. 47

Azin Ashkan and Charles L.A. Clarke. Modeling browsing behavior for click analysis in sponsoredsearch. In CIKM, 2012. ACM Press. 45, 66, 81

Christopher M. Bishop. Pattern Recognition and Machine Learning. Springer, 2006. 3

Roi Blanco, Berkant Barla Cambazoglu, Peter Mika, and Nicolas Torzec. Entity recommendationsin web search. In The Semantic Web–ISWC 2013. Springer, 2013. 81

Chris Burges, Tal Shaked, Erin Renshaw, Ari Lazier, Matt Deeds, Nicole Hamilton, and GregHullender. Learning to rank using gradient descent. In ICML, 2005. ACM Press. 82

Georg Buscher, Ryen W. White, Susan Dumais, and Jeff Huang. Large-scale analysis of individualand task differences in search result page examination strategies. In WSDM, 2012. ACM Press. 82

Ben Carterette. System effectiveness, user models, and user utility: A conceptual framework forinvestigation. In SIGIR. ACM, 2011. 78

Olivier Chapelle and Ya Zhang. A dynamic bayesian network click model for web search ranking.In WWW, 2009. ACM Press. 2, 15, 16, 23, 30, 33, 45, 46, 47, 51, 77

Olivier Chapelle, Donald Metzler, Ya Zhang, and Pierre Grinspan. Expected reciprocal rank forgraded relevance. In CIKM, 2009. ACM Press. 78, 79

Olivier Chapelle, Thorsten Joachims, Filip Radlinski, and Yisong Yue. Large-scale validation andanalysis of interleaved search evaluation. ACM Transactions on Information Systems, 2012. 75

Danqi Chen, Weizhu Chen, Haixun Wang, Zheng Chen, and Qiang Yang. Beyond ten blue links:enabling user click modeling in federated web search. In WSDM, 2012a. ACM Press. 1, 2, 47, 51,61, 62, 63, 77


Mon Chu Chen, John R. Anderson, and Myeong Ho Sohn. What can a mouse cursor tell us more?:Correlation of eye/mouse movements on web browsing. In CHI’01 extended abstracts on Humanfactors in computing systems, 2001. ACM Press. 68

Weizhu Chen, Zhanglong Ji, Si Shen, and Qiang Yang. A whole page click model to better interpretsearch engine click data. In AAAI. AAAI Press, 2011. 71, 72

Weizhu Chen, Dong Wang, Yuchen Zhang, Zheng Chen, Adish Singla, and Qiang Yang. A noise-aware click model for web search. In WSDM, 2012b. ACM Press. 7, 70

Aleksandr Chuklin, Anne Schuth, Katja Hofmann, Pavel Serdyukov, and Maarten de Rijke. Evaluat-ing aggregated search using interleaving. In CIKM, 2013a. ACM Press. 77

Aleksandr Chuklin, Pavel Serdyukov, and Maarten de Rijke. Using intent information to model userbehavior in diversified search. In ECIR, 2013b. 2, 47, 51, 63, 66

Aleksandr Chuklin, Pavel Serdyukov, and Maarten de Rijke. Click model-based information retrievalmetrics. In SIGIR, 2013c. ACM Press. 18, 78, 79

Aleksandr Chuklin, Pavel Serdyukov, and Maarten de Rijke. Modeling clicks beyond the first resultpage. In CIKM, 2013d. ACM Press. 44, 45, 64

Aleksandr Chuklin, Ke Zhou, Anne Schuth, Floor Sietsma, and Maarten de Rijke. Evaluatingintuitiveness of vertical-aware click models. In SIGIR, 2014. ACM Press. 47, 48, 62

Aleksandr Chuklin, Anne Schuth, Ke Zhou, and Maarten de Rijke. A comparative analysis ofinterleaving methods for aggregated search. ACM Transactions on Information Systems, 33(2),2015. 62, 77

Cyril W. Cleverdon, Jack Mills, and Michael Keen. Aslib Cranfield research project - Factorsdetermining the performance of indexing systems; Volume 1, Design; Part 1, text. Technicalreport, College of Aeronautics, Cranfield, 1966. 78

Nick Craswell, Onno Zoeter, Michael Taylor, and Bill Ramsey. An experimental comparison ofclick position-bias models. In WSDM, 2008. ACM Press. 2, 9, 10, 11, 12, 43

Arthur P. Dempster, Nan M. Laird, and Donald B. Rubin. Maximum likelihood from incompletedata via the EM algorithm. Journal of the royal statistical society. Series B (methodological),1977. 27, 28

Fernando Diaz, Ryen W. White, Georg Buscher, and Dan Liebling. Robust models of mousemovement on dynamic web search results pages. In CIKM, 2013. ACM Press. 2, 68

Zhicheng Dou, Ruihua Song, and Ji-Rong Wen. A large-scale evaluation and analysis of personalizedsearch strategies. In WWW, 2007. ACM Press. 2

85

Susan Dumais, Edward Cutrell, and Hao Chen. Optimizing search by showing results in context. InCHI, 2001. ACM Press. 82

Georges Dupret and Ciya Liao. A model to estimate intrinsic document relevance from the click-through logs of a web search engine. In WSDM, 2010. ACM Press. 47

Georges Dupret, Vanessa Murdock, and Benjamin Piwowarski. Web search engine evaluation usingclickthrough data and a user model. In WWW, 2007. ACM Press. 2

Georges E. Dupret and Benjamin Piwowarski. A user browsing model to predict search engine clickdata from past observations. In SIGIR, 2008. ACM Press. 2, 9, 10, 13, 30, 43, 44, 45, 51, 77

David Easley and Jon Kleinberg. Networks, Crowds, and Markets: Reasoning About a HighlyConnected World. Cambridge University Press, 2010. 82

Bradley Efron and R J Tibshirani. An Introduction to the Bootstrap (Chapman & Hall/CRCMonographs on Statistics & Applied Probability). Chapman and Hall/CRC, 1 edition, May 1994.53

Ronald Aylmer Fisher. On the mathematical foundations of theoretical statistics. PhilosophicalTransactions of the Royal Society of London. Series A, Containing Papers of a Mathematical orPhysical Character, 1922. 43

Arpita Ghosh and Mohammad Mahdian. Externalities in online advertising. In WWW, 2008. ACMPress. 81

Thore Graepel, Joaquin Q. Candela, Thomas Borchert, and Ralf Herbrich. Web-scale bayesianclick-through rate prediction for sponsored search advertising in microsoft’s bing search engine.In ICML. ACM Press, 2010. 82

Artem Grotov, Aleksandr Chuklin, Ilya Markov, Luka Stout, Finde Xumara, and Maarten de Rijke.A Comparative Study of Click Models for Web Search. In CLEF, 2015. 59

Zhiwei Guan and Edward Cutrell. An eye tracking study of the effect of target rank on web search.In CHI, 2007. ACM Press. 1

Fan Guo, Chao Liu, Anitha Kannan, Tom Minka, Michael Taylor, Yi-Min Wang, and ChristosFaloutsos. Click chain model in web search. In WWW, 2009a. ACM Press. 14, 30, 44, 45, 66

Fan Guo, Chao Liu, and Yi Min Wang. Efficient multiple-click models in web search. In WSDM,2009b. ACM Press. 2, 13, 25, 26, 45, 51, 53

Ahmed Hassan and Ryen W. White. Personalized models of search satisfaction. In CIKM, 2013.ACM Press. 82


Jiyin He, Marc Bron, Arjen P. de Vries, Leif Azzopardi, and Maarten de Rijke. Untangling result listrefinement and ranking quality: A model for evaluation and prediction. In SIGIR, August 2015.ACM Press. 81

Yin He and Kuansan Wang. Inferring search behaviors using partially observable markov modelwith duration (pomd). In WSDM, 2011. ACM Press. 71

Katja Hofmann, Shimon Whiteson, and Maarten de Rijke. A probabilistic method for inferringpreferences from clicks. In CIKM, 2011. ACM Press. 69, 77

Katja Hofmann, Anne Schuth, Shimon Whiteson, and Maarten de Rijke. Reusing historical in-teraction data for faster online learning to rank for IR. In WSDM, 2013. ACM Press. 69, 70,77

Botao Hu, Yuchen Zhang, Weizhu Chen, Gang Wang, and Qiang Yang. Characterizing search intentdiversity into click models. In WWW, 2011. ACM Press. 65

Jeff Huang, Ryen W. White, and Susan Dumais. No clicks, no problem: Using cursor movements tounderstand and improve search. In CHI, 2011. ACM Press. 68

Jeff Huang, Ryen W. White, Georg Buscher, and Kuansan Wang. Improving searcher models usingmouse cursor activity. In SIGIR, 2012. ACM Press. 2, 68

Bouke Huurnink, Laura Hollink, Wietske van den Heuvel, and Maarten de Rijke. Search behavior ofmedia professionals at an audiovisual archive: A transaction log analysis. Journal of the AmericanSociety for Information Science and Technology, 61(6):1180–1197, June 2010. 81

Kalervo Jarvelin and Jaana Kekalainen. Cumulated gain-based evaluation of IR techniques. ACMTransactions on Information Systems, 20(4):422–446, 2002. 47, 77, 78

Thorsten Joachims, Laura Granka, Bing Pan, Helene Hembrooke, and Geri Gay. Accuratelyinterpreting clickthrough data as implicit feedback. In SIGIR, 2005. ACM Press. 1, 2, 10, 11

Eugene Kharitonov, Craig Macdonald, Pavel Serdyukov, and Iadh Ounis. User model-based metricsfor offline query suggestion evaluation. In SIGIR, 2013. ACM Press. 79

Ron Kohavi, Roger Longbotham, Dan Sommerfield, and Randal M. Henne. Controlled experimentson the web: survey and practical guide. Data Mining and Knowledge Discovery, 18(1):140–181,July 2008. 45, 75

Ron Kohavi, Alex Deng, and Roger Longbotham. Seven rules of thumb for web site experimenters.In KDD, 2014. ACM Press. 1

Daphne Koller and Nir Friedman. Probabilistic Graphical Models: Principles and Techniques. MITPress, August 2009. 3, 7, 27, 80

87

Solomon Kullback and Richard A. Leibler. On information and sufficiency. The Annals of Mathe-matical Statistics, 22(1):79–86, 03 1951. 46

Dmitry Lagun, Chih-Hung Hsieh, Dale Webster, and Vidhya Navalpakkam. Towards better measure-ment of attention and satisfaction in mobile search. In SIGIR, 2014. ACM Press. 81

Chao Liu, Fan Guo, and Christos Faloutsos. BBM: Bayesian browsing model from petabyte-scaledata. In KDD, 2009. ACM Press. 42, 51, 71

Tie-Yan Liu. Learning to rank for information retrieval. Foundations and Trends in InformationRetrieval, 3(3):225–331, 2009. 47

Tie-Yan Liu, Jun Xu, Tao Qin, Wenying Xiong, and Hang Li. LETOR: Benchmark dataset forresearch on learning to rank for information retrieval. In SIGIR 2007 workshop on learning torank for information retrieval, 2007. ACM Press. 69, 76

Yiqun Liu, Chao Wang, Ke Zhou, Jianyun Nie, Min Zhang, and Shaoping Ma. From skimming toreading: A two-stage examination model for web search. In CIKM, 2014. ACM Press. 68

Lori Lorigo, Maya Haridasan, Hronn Brynjarsdottir, Ling Xia, Thorsten Joachims, Geri Gay, LauraGranka, Fabio Pellacini, and Bing Pan. Eye tracking and online search: Lessons learned andchallenges ahead. Journal of the American Society for Information Science and Technology, 59(7):1041–1052, 2008. 1

Ilya Markov, Eugene Kharitonov, Vadim Nikulin, Pavel Serdyukov, Maarten de Rijke, and FabioCrestani. Vertical-aware click model-based effectiveness metrics. In CIKM, 2014. ACM Press. 79

Ilya Markov, Aleksandr Chuklin, and Yiqun Liu. Modeling search behavior in heterogeneousenvironments. In Proceedings of HIA, February 2015. ACM Press. 81

Alistair Moffat and Justin Zobel. Rank-biased precision for measurement of retrieval effectiveness.ACM Transactions on Information Systems, 27(1):1–27, December 2008. 2

Kevin Patrick Murphy. Machine Learning: A Probabilistic Perspective. MIT Press, 2013. 3, 43

Yaser Norouzzadeh Ravari, Ilya Markov, Artem Grotov, Maarten Clements, and Maarten de Rijke.User behavior in location search on mobile devices. In ECIR. Springer, April 2015. 81

Umut Ozertem, Rosie Jones, and Benoit Dumoulin. Evaluating new search engine configurationswith pre-existing judgments and clicks. In WWW, 2011. ACM. 77

Tao Qin, Tie-Yan Liu, Jun Xu, and Hang Li. Letor: A benchmark collection for research on learningto rank for information retrieval. Information Retrieval, 13(4):346–374, 2010. 76

Filip Radlinski and Nick Craswell. Optimized interleaving for online retrieval evaluation. In WSDM,2013. ACM Press. 46


Filip Radlinski, Madhu Kurup, and Thorsten Joachims. How does clickthrough data reflect retrievalquality? In CIKM, 2008. ACM Press. 75

Steffen Rendel. Social network and click-through prediction with factorization machines. In KDDCup Workshop, 2012. ACM Press. 72

Matthew Richardson, Ewa Dominowska, and Robert Ragno. Predicting clicks: Estimating theclick-through rate for new ads. In WWW, 2007. ACM Press. 45, 72, 82

Kerry Rodden, Xin Fu, Anne Aula, and Ian Spiro. Eye-mouse coordination patterns on web searchresults pages. In CHI, 2008. 68

Mark Sanderson. Test collection based evaluation of information retrieval systems. Foundations andTrends in Information Retrieval, 4(4):247–375, 2010. 69

Anne Schuth, Katja Hofmann, Shimon Whiteson, and Maarten de Rijke. Lerot: An online learningto rank framework. In Living Labs’13: Workshop on Living Labs for Information RetrievalEvaluation, November 2013. ACM Press. 52

Pavel Serdyukov, Georges Dupret, and Nick Craswell. WSCD2013: Workshop on web search clickdata 2013. In WSDM, 2013. ACM Press. 50

Pavel Serdyukov, Georges Dupret, and Nick Craswell. Log-based personalization: The 4th websearch click data (WSCD) workshop. In WSDM, 2014. ACM. 50

Si Shen, Botao Hu, Weizhu Chen, and Qiang Yang. Personalized click model through collaborativefiltering. In WSDM, 2012. ACM Press. 42, 66, 76

Jaime Teevan, Susan T. Dumais, and Eric Horvitz. Personalizing search via automated analysis ofinterests and activities. In SIGIR, 2005. ACM Press. 2

Andrew Turpin, Falk Scholer, Kalervo Jarvelin, Mingfang Wu, and J. Shane Culpepper. Includingsummaries in system evaluation. In SIGIR, 2009. ACM Press. 11, 69

Ellen M. Voorhees and Donna Harman. Overview of TREC 2001. In TREC. NIST, 2001. 69

Chao Wang, Yiqun Liu, Min Zhang, Shaoping Ma, Meihong Zheng, Jing Qian, and Kuo Zhang.Incorporating vertical results into search click models. In SIGIR, 2013a. ACM Press. 1, 2, 47, 62,63, 70

Chao Wang, Yiqun Liu, Meng Wang, Ke Zhou, Jian-Yun Nie, and Shaoping Ma. Incorporatingnon-sequential behavior into click models. In SIGIR, 2015. ACM. 47, 71

Hongning Wang, ChengXiang Zhai, Anlei Dong, and Yi Chang. Content-aware click modeling. InWWW, 2013b. 73

89

Kuansan Wang, Nikolas Gloy, and Xiaolong Li. Inferring search behaviors using partially observablemarkov (pom) model. In WSDM, 2010. ACM Press. 71

Ryen W. White and Steven M. Drucker. Investigating behavioral variability in web search. In WWW,2007. ACM Press. 65

Ian H. Witten and Eibe Frank. Data Mining: Practical machine learning tools and techniques.Morgan Kaufmann, 2005. 7

Qianli Xing, Yiqun Liu, Jian-Yun Nie, Min Zhang, Shaoping Ma, and Kuo Zhang. Incorporatinguser preferences into click models. In CIKM, 2013. ACM Press. 2, 46, 67

Chenyan Xiong, Taifeng Wang, Wenkui Ding, Yidong Shen, and Tie-Yan Liu. Relational clickprediction for sponsored search. In WSDM, 2012. ACM Press. 82

Danqing Xu, Yiqun Liu, Min Zhang, Shaoping Ma, and Liyun Ru. Incorporating revisiting behaviorsinto click models. In WSDM, 2012. ACM Press. 71

Wanhong Xu, Eren Manavoglu, and Erick Cantu-Paz. Temporal click model for sponsored search.In SIGIR, 2010. ACM Press. 71

Emine Yilmaz, Milad Shokouhi, Nick Craswell, and Stephen Robertson. Expected browsing utilityfor web search evaluation. In CIKM, 2010. ACM Press. 78

Dawei Yin, Shike Mei, Bin Cao, Jian-Tao Sun, and Brian D. Davison. Exploiting contextual factorsfor click modeling in sponsored search. In WSDM, 2014. ACM Press. 51, 82

Yuchen Zhang, Dong Wang, Gang Wang, Weizhu Chen, Zhihua Zhang, Botao Hu, and Li Zhang.Learning click models via probit bayesian inference. In CIKM, 2010. ACM Press. 42, 51

Yuchen Zhang, Weizhu Chen, Dong Wang, and Qiang Yang. User-click modeling for understandingand predicting search-behavior. In KDD, 2011. ACM Press. 1, 5, 64, 65

Yuyu Zhang, Hanjun Dai, Chang Xu, Jun Feng, Taifeng Wang, Jiang Bian, Bin Wang, and Tie-YanLiu. Sequential click prediction for sponsored search with recurrent neural networks. In AAAI.AAAI Press, 2014. 72, 80

Feimin Zhong, Dong Wang, Gang Wang, Weizhu Chen, Yuchen Zhang, Zheng Chen, and HaixunWang. Incorporating post-click behaviors into a click model. In SIGIR, 2010. ACM Press. 72

Ke Zhou, Mounia Lalmas, Tetsuya Sakai, Ronan Cummins, and Joemon M. Jose. On the reliabilityand intuitiveness of aggregated search metrics. In CIKM, 2013. ACM Press. 47

Zeyuan Allen Zhu, Weizhu Chen, Tom Minka, Chenguang Zhu, and Zheng Chen. A novel clickmodel and its applications to online advertising. In WSDM, 2010. ACM Press. 7, 45, 46, 73, 82

90

Authors’ Biographies

ALEKSANDR CHUKLIN

Aleksandr Chuklin is a Software Engineer working on search problems atGoogle Switzerland. Apart from his projects at Google he is also working withthe Information and Language Processing Systems group at the University ofAmsterdam on a number of research topics. He received his MSc degree fromthe Moscow Institute of Physics and Technology in 2012. His main researchinterests are modeling and understanding user behavior on a search engineresult page. Aleksandr has a number of publications on click models and theirapplications at SIGIR, CIKM, ECIR. He is also PC member of the CIKM

and WSDM conferences.

ILYA MARKOV

Ilya Markov is a postdoctoral researcher and SNF fellow at the University ofAmsterdam. His research agenda builds around information retrieval methodsfor heterogeneous search environments. Ilya has experience in federatedsearch, user behavior analysis, click models and effectiveness metrics. He isa PC member of leading IR conferences, such as SIGIR, WWW and ECIR,a PC chair of the RuSSIR 2015 summer school and a co-organizer of theIMine-2 task at NTCIR-12. Ilya is currently teaching an MSc course on websearch and has previously taught information retrieval courses at the BSc and

MSc levels and given tutorials at conferences and summer schools in IR (ECIR, RuSSIR).

MAARTEN DE RIJKE

Maarten de Rijke is Professor of Computer Science at the University ofAmsterdam. He leads a large team of researchers in information retrieval. Hisrecent research focus is on (online) ranking and evaluation and on semanticsearch. Maarten has authored over 650 papers, many of which are core to thistutorial, and is Editor-in-Chief of ACM Transactions on Information Systems.He has supervised or is supervising over 40 Ph.D. students. He has taught atthe primary school, high school, BSc, MSc and Ph.D. levels, as well as forgeneral audiences, with recent tutorials at ECIR, ESSIR and SIGIR.

91

Index

A/B testing, 75abandonment rate, 75actual relevance, 15aggregated search, 47, 61, 77AOL query log, 49attention bias, 1attraction bias, 62attractiveness, 11, 33attractiveness probability, 30

Bayesian browsing model, 42Bayesian network, 7Bayesian sequential state model, 73BBM, 42BSS, 73

cascade model, 12CCM, 14click chain model, 14click log, 6click model, 5click-through rate, 45click-through rate model, 10

document-based, 10rank-based, 10

ClickModels, 51CM, 12conditional perplexity, 44Cranfield evaluation methodology, 78cross-entropy, 43CTR model, 10

DBN, 15DBN-IA, 63DBN-intents, 66DBN-VP, 66DCG, 78DCM, 13DCTR, 10deep probit Bayesian inference, 42dependency graph, 5dependent click model, 13DIL-user, 67dilution model, 67dynamic Bayesian network model, 15

EBU, 78EM, 23, 27ERR, 78ESS, 27estimation, 6, 23evidence, 5examination hypothesis, 10expectation maximization, 27expectation-maximization algorithm, 23Expected Browser Utility, 78Expected Reciprocal Rank, 78expected sufficient statistics, 27

FCM, 62FCM-UBM, 62federated click model, 62federated search, 61first place bias, 63

92 10. INDEX

GCM, 73general click model, 73global bias, 63graphical model, 7

probabilistic, 3, 80

H-CCM, 42H-DBN, 42H-UBM, 42hidden events, 5hierarchical model, 42HPCM, 67hybrid personalized click model, 67

indicator function, 7Infer.NET, 51inference, 7interleaving, 75

Lerot, 52LETOR dataset, 76likelihood, 23log-likelihood, 43LOG-rd-user, 67

matrix factorization click model, 42, 66maximum likelihood estimation, 23mean reciprocal rank, 75metric

effort-based, 78utility-based, 78

MFCM, 42, 66MLE, 23MRR, 75MSN 2006 dataset, 49multitasking, 64

N-DBN, 70N-UBM, 70novelty bias, 1

observed events, 5

parameter learning, 6, 23parents, 27partially sequential click model, 71PBM, 11PCC, 72PCM, 66perceived relevance, 11, 33perplexity, 43

conditional, 44gain, 44

PGM, 80position bias, 1position-based model, 11post-click click model, 72probabilistic graphical model, 3, 80PSCM, 71PyClick, 51PyPy, 53

random click model, 9RCM, 9RCTR, 10

satisfaction, 33satisfaction probability, 15SDBN, 16SDCM, 25search task, 64sequence bias, 63SERP, 5session, 5simplified DBN model, 16simplified DCM, 25simulating users, 75simulations, 75SogouQ dataset, 50

TCM, 64, 71temporal click model, 71

93

temporal hidden click model, 71THCM, 71

UBM, 13UBM-IA, 63UBM-user, 67Unbiased-DBN, 65Unbiased-UBM, 65URL, 6user browsing model, 13user experience, 77

UX, 77

varying persistence, 66VCM, 62Vertical Click Model, 62verticals, 61

whole page click model, 71WPC, 71WSCD dataset, 49

Yandex dataset, 49

click models for web search...goal of this survey is to bring together current efforts in the area,...

Documents