optimizing search engines using click-through datapb/cs344-2013/student...• need to verify 1. the...

Optimizing Search Engines using Click-through Data

BySameep - 100050003Rahee - 100050028Anil - 100050082

1Friday, 15 March 13

Overview

• Web Search Engines : Creating a good information retrieval system

• Previous Approaches : TF-IDF , PageRank

• Machine learning model

• User Feedback using Clickthrough Data

• Ranking SVM and Kendall’s τ

• Experimental Results


Introduction to IR

• Information retrieval is the activity of obtaining information resources relevant to an information need from a collection of information resources.

• Searches can be based on metadata or on full-text indexing.

• Web search engines are the most visible IR applications.


Web Search Engine

• Creating a search engine which scales even to today's web presents many challenges.


• Which WWW page(s) does a user actually want to retrieve when he types some keywords into a search engine?

• There are typically thousands of pages that contain these words, but the user is interested in a much smaller subset.

• Intuitively, a good information retrieval system should present relevant documents high in the ranking, with less relevant documents following below.

What is the problem?


Evolution of search algorithms

TF-IDF

PageRank

ML

1994

1999

2002


TF-IDF

• Term Frequency–Inverse Document Frequency, is a numerical statistic which reflects how important a word is to a document in a collection or corpus. It is often used as a weighting factor in information retrieval and text mining.

• The tf-idf value increases proportionally to the number of times a word appears in the document, but is offset by the frequency of the word in the corpus, which helps to control for the fact that some words are generally more common than others.

• One of the simplest ranking functions is computed by summing the tf–idf for each query term.


http://en.wikipedia.org/wiki/Proportionality_(mathematics)

http://en.wikipedia.org/wiki/Proportionality_(mathematics)

http://en.wikipedia.org/wiki/Ranking_function

http://en.wikipedia.org/wiki/Ranking_function

PageRank - Bringing Order to Web

• Plain tf-idf sucks on the web.

• The citation (link) graph of the web is an important resource that was largely going unused in existing web search engines at that time.

• PageRank gives the notion of how well linked the document is on the web. This is good indicator of quality of web page.


PageRank Computation

• We assume page A has pages T1...Tn which point to it (i.e., are citations). The parameter d is a damping factor which can be set between 0 and 1. We usually set d to 0.85. Also C(A) is defined as the number of links going out of page A. The PageRank of a page A is given as follows:

• PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))

• Note that the PageRanks form a probability distribution over web pages, so the sum of all web pages' PageRanks will be one.

• Computation is iterative and converges.


Architecture

• Let us follow how things actually land up on your browser when you type in query and hit enter


Google way back in 2001 - Documents ranked by PageRank


Machine-learned ranking

• Learning to rank or machine-learned ranking (MLR) is a type of supervised or semi-supervised machine learning problem in which the goal is to automatically construct a ranking model from training data.

• Training data consists of lists of items with some partial order specified between items in each list.

• This order is typically induced by giving a numerical or ordinal score or a binary judgment (e.g. "relevant" or "not relevant") for each item.

• Ranking model's purpose is to rank, i.e. produce a permutation of items in new, unseen lists in a way, which is "similar" to rankings in the training data in some sense.


Motivation for better learning model

• While previous approaches to learning retrieval functions from examples exist, they typically require training data generated from relevance judgments by experts.

• This makes them difficult and expensive to apply.


User Feedback Model

• One could simply ask the user for feedback.

• If we knew the set of pages actually relevant to the user’s query, we could use this as training data for optimizing (and even personalizing) the retrieval function.

• Unfortunately, experience shows that users are only rarely willing to give explicit feedback


Click-through Data Method

• Sufficient information is already hidden in the logfiles of WWW search engines.

• Since major search engines receive millions of queries per day, such data is available in abundance.

• Compared to explicit feedback data, which is typically elicited in laborious user studies, any information that can be extracted from logfiles is virtually free and substantially more timely.


Basics of Click-through Data

• What is click-through data?

• How can it be recorded?

• How can it be used to generate training examples in the form of preferences?


What is Click-through Data?

• Click-through data can be thought as triplets (q,r,c) where

• q is the user query,

• r is the ranking presented to the user and

• c is the set of links the user clicked on.


Recording Click-through Data

• For recording the clicks, a simple proxy system can keep a logfile where each query is assigned a unique ID.

• The links on the results-page presented to the user do not lead directly to the suggested document, but point to a proxy server.

• When the user clicks on the link, the proxy-server records the URL and the query- ID in the click-log and then redirects the user to the target URL.

• This process can be made transparent to the user and does not influence system performance.


ClickThrough data logging in Google


What kind of information does Click-through data convey?

• There are strong dependencies between the three parts of (q, r, c).

• The presented ranking r depends on the query q as determined by the retrieval function implemented in the search engine.

• Furthermore, the set c of clicked-on links depends on both the query q and the presented ranking r.

• A user is more likely to click on a link, if it is relevant to q. While this dependency is desirable and interesting for analysis, the dependency of the clicks on the presented ranking r muddies the water.

• Thus we can get only relative relevance judgement rather than absolute relevance.


Example

• Denoting the ranking preferred by the user with r∗, we get partial (and potentially noisy) information of the form

link3 <r∗ link2link7 <r∗ link2link7 <r∗ link4link7 <r∗ link5link7 <r∗ link6


Algorithm: Extracting Preference Feedback from Clickthrough

• For a ranking (link1 , link2 , link3 , ...) and a set C containing the ranks of the clicked-on links, extract a preference example

linki <r∗ linkjfor all pairs 1 ≤ j < i, with i ∈ C and j ∉ C.


Framework for Learning Retrieval Function

• For a query q and a document collection D = {d1 , ..., dm }, the optimal retrieval system should return a ranking r∗ that orders the documents in D according to their relevance to the query.

• Typically, retrieval systems do not achieve an optimal ordering r∗. Instead, an operational retrieval function f is evaluated by how closely its ordering rf(q) approximates the optimum.

• Formally, both r∗ and rf(q) are binary relations over D × D such that if a document di is ranked higher than dj for an ordering r, i.e. di <r dj, then (di,dj) ∈ r, otherwise (di , dj ) ∉ r.


Kendall’s τ as performance measure

• A pair di ≠ dj is concordant, if both ra and rb agree in how they order di and dj . It is discordant if they disagree.

• P : number of concordant pairsQ : number of discordant pairsP+Q = mC2 on a finite domain D of m documents


Problem at Hand

• Given an independently and identically distributed training sample S of size n containing queries q with their target rankings r∗

the learner L will select a ranking function f from a family of ranking functions F that maximizes the empirical τ on the training sample.


Ranking SVM (Support Vector Machine)

• Ranking SVM is used to adaptively sort the web-pages by their relevance to a specific query.

• Generally, Ranking SVM includes three steps in the training period:

1.It maps the similarities between queries and the clicked pages onto certain feature space.

2.It calculates the distances between any two of the vectors obtained in step1

3.It forms optimization problem which is similar to SVM classification and solve such problem with the regular SVM solver.


Mapping function

• A mapping function is required to define relevance of a web-page to the query.

• Φ(q, d) is a mapping onto features that describe the match between query q and document d.

• Such features are, for example, the number of words that query and document share, the number of words they share inside certain HTML tags (e.g. TITLE, H1, H2, ...), or the page-rank of d, etc.

• These features combined with user’s click-through data (which implies page ranks for a specific query) can be considered as the training data for machine learning algorithms.


• We need to find a weight vector so that maximum number of inequalities are fulfilled.

• This is however NP-Hard

Finding the weight vector w

How 4 points are ranked by w1 and w2


Slack

• It is possible to approximate the solution by introducing (non-negative) slack variables ξi,j,k and minimizing the upper bound sum of ξi,j,k .

• C is a parameter that allows trading-off margin size against training error.

• Optimization Problem 1 is convex and has no local optima.


Using Partial Feedback

• If clickthrough logs are the source of training data, the full target ranking r∗ for a query q is not observable.

• It is straightforward to adapt the Ranking SVM to the case of such partial data by replacing r∗ with the observed preferences r′.

• We are given a training set S : (q1, r′1), (q2, r′2), ..., (qn, r′n) containing training data.

• The resulting retrieval function is defined analogously as in previous. Using the algorithm results in finding a ranking function that has a low number of discordant pairs with respect to the observed parts of the target ranking.


Experiments

• Need to verify

1. The Ranking SVM can indeed learn a retrieval function maximizing Kendall’s τ on partial preference feedback.

2. The learned retrieval function does improve retrieval quality as desired.


Experiment Setup: Meta-Search

• To elicit data and provide a framework for testing the algorithm, a WWW meta-search engine called “Striver” was implemented.

• Meta-search engines combine the results of several basic search engines without having a database of their own.

• Striver forwards user query to search engines “Google”, “MSNSearch”, “Excite”, “Altavista”, and “Hotbot” to get set of relevant documents and ranks them based on learned retrieval function before returning to user.


Comparing Different Retrieval Functions

• The key idea is to present two rankings at the same time combined into one.

• The ranking should be such that If the user scans the links of C (combined ranking of A and B) from top to bottom, at any point he has seen almost equally many links from the top of A as from the top of B

• This particular form of presentation leads to a blind statistical test so that the clicks of the user demonstrate unbiased preferences.


Offline Experiment

• This experiment verifies that the Ranking SVM can indeed learn regularities using partial feedback from clickthrough data.

• To generate a first training set, Striver search engine was used. Striver displayed the results of Google and MSNSearch using the combination method from the previous section.

• All clickthrough triplets were recorded. This resulted in 112 queries with a non-empty set of clicks. This data provides the basis for the offline experiment.


Results of Offline Experiment


Interactive Online Experiment

• To show that the learned retrieval function improves retrieval, the Striver search engine was made available to a group of approximately 20 users.

• The system collected 260 training queries (with at least one click). On these queries, the Ranking SVM was trained.

• During evaluation learned retrieval function is compared against Google, MSNSearch and Toprank(meta search engine)


Conclusions

• We presented an approach to mining logfiles of WWW search engines with the goal of improving their retrieval performance automatically.

• The key insight is that clickthrough data can provide training data in the form of relative preferences.

• Taking a Support Vector approach, the resulting training problem is tractable even for large numbers of queries and large numbers of features.

• Experimental results show that the algorithm derived in this paper for learning a ranking function performs well in practice, successfully adapting the retrieval function of a meta-search engine to the preferences of a group of users.


Food For Thought

• There is a trade-off between the amount of training data (i.e. large group) and maximum homogeneity (i.e. single user). What is a good size of a user group and how can such groups be determined?

• Is it possible to use clustering algorithms to find homogenous groups of users?

• Can click-through data also be used to adapt a search engine not to a group of users, but to the properties of a particular document collection?


References

• Brin, Sergey, and Lawrence Page. "The anatomy of a large-scale hypertextual Web search engine." Computer networks and ISDN systems 30.1 (1998): 107-117.

• Joachims, Thorsten. "Optimizing search engines using clickthrough data."Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2002.

• Cortes, Corinna, and Vladimir Vapnik. "Support-vector networks." Machine learning 20.3 (1995): 273-297.


Thank You