optimizing search engines using clickthrough data

Optimizing search engines using clickthrough data

ΕΘΝΙΚΟ ΜΕΤΣΟΒΙΟ ΠΟΛΥΤΕΧΝΕΙΟΣΧΟΛΗ ΗΛΕΚΤΡΟΛΟΓΩΝ ΜΗΧΑΝΙΚΩΝΚΑΙ ΜΗΧΑΝΙΚΩΝ ΥΠΟΛΟΓΙΣΤΩΝΤΟΜΕΑΣ ΤΕΧΝΟΛΟΓΙΑΣ ΠΛΗΡΟΦΟΡΙΚΗΣ ΚΑΙ ΥΠΟΛΟΓΙΣΤΩΝ

2

Problem Optimization of web search results ranking

Ranking algorithms mainly based on similarity Similarity between query keywords and page text keywords Similarity between pages (Page Rank)

No consideration for user personal preferences

Digital librariesHigh rank

VacationLow rank

3

Problem E.g.: query “delos”

Digital libraries Archaeology Vacation

Room for improvement by incorporating user behaviour data: user feedback Use of previous implicit search information to enhance result ranking

VacationLow rank

Digital librariesHigh rank

4

Types of user feedbackExplicit feedback

User explicitly judges relevance of results with the query

Costly in time and resources Costly for user limited effectiveness Direct evaluation of results

Implicit feedback Extracted from log files

Large amount of information Indirect evaluation of results through click behaviour

5

Implicit feedback (Categories) Clicked results

Absolute relevance: Clicked result -> Relevant

Risky: poor user behaviour quality Percentage of result clicks for a query Frequently clicked groups of results Links followed from result page

Relative relevance: Clicked result -> More relevant than non-clicked

More reliable Time

Between clicks E.g., fast clicks maybe bad results

Spent on a result page E.g., a lot of time relevant page

First click, scroll E.g., maybe confusing results

6

Implicit feedback (Categories) Query chains: sequences of reformulated queries to

improve results of initial search Detection:

Query similarity Result sets similarity Time

Connection between results of different queries: Enhancement of a bad query with another from the query chain Comparison of result relevance between different query results

Scrolling Time (quality of results) Scrolling behaviour (quality of results) Percentage of page (results viewed)

Other features Save, print, copy etc of result page maybe relevant Exit type (e.g. close window poor results)

7

Joachims approach Clickthrough data

Relative relevance Indicated by user behaviour study

Method Training of svm functions Training input: inequations of query result rankings Trained function: weight vector for features examined Use of trained vector to give weights to examined

features Experiments

Comparison of method with existing search engines

8

Clickthrough data Form

Triplets (q,r,c) = (query,ranked results,links clicked) …in the log file of a proxy server

Relative relevance (regarding a specific query) dk more relevant than dl if

dk clicked and dl not clicked and dk lower initial ranking than dl

d5 more relevant than d4d5 more relevant than d2

d3 more relevant than d2

9

Clickthrough data (Studies) Why not absolute relevance?

User behaviour influenced by initial ranking

Study: Rank and viewership Percentage of queries where a user

viewed the search result presented at a particular rank

Conclusion: Most times users view only the few first results

Study: Rank and viewership Percentage of queries where a user

clicked the result presented at a particular rank, both in normal and swapped conditions

Conclusion: Users tend to click on higher ranked results irrespective of content

10

Method (System train input) Data extracted from log

Relevance inequalities: dk <rq dl dk more relevant than dl for query q

Construct relevance inequalites for every query q and every result pair (dk , dl) for q

For each link di (result of query q): construct a feature vector Φ(q,di)



11

Method (System train input) Feature vector:

Describes the quality of the match between a document di and a query q

Φ(q,di) = [rank_X, top1_X, top10_X, …..]

12

Method (System train input) Weight vector:

w = [w1, w2,…,wn] Assigns weights to each of the features in Φ(q,di) Sdi = w * Φ(q,di) assigns a score to document di for

query q w: unknown must be trained by solving the system of

relative relevance inequalities derived from clickthrough data

13

Method (System training) Initial input transformation:

dk <rm dl => Sdk >rm Sdl => w * Φ(qm,dk) > w * Φ(qm,dl) => [w1, w2,…,wn] Φ(qm,dk) > [w1, w2,…,wn] Φ(qm,dl) System of relevance inequalites for every query q and

every result pair (dk , dl) for q Object of training:

Knowing, for every clicked link di , its feature vector, Find an optimal weight vector w which satisfies most of

the inequalities above (optimal solution) With vector w known, we can calculate a score for every

link, whether clicked or not, hence a new ranking



14

Method (System training) Intuitively

Example: 2 dimension feature vector X-axis: similarity between query terms and link text Y-axis: time spent on link page

Looking for optimal weight vector w If training input: r1 < r2 < r3 < r4

Optimal solution w1: text similarity more important for ranking If training input: r2 < r3 < r1 < r4

Optimal solution w2: time more important for ranking w optimized by maximizing δ

Link 1

Link 2Link 3

Link 4

15

Experiments Based on a meta-search engine

Combination of results from different search engines

“Fair” presentation of results: For a meta-meta-search engine combining 2 engines: Top z results of meta-search contain x results from engine A y results from engine b x + y = z |x – y| <= 1

16

Experiments Meta-search

engine Assumptions

Users click more often on more relevant links

Preference for one of the search engines not influenced by ranking

17

Experiments (Results) Conditions

20 users (university students) 260 training queries collected 20 days of training 10 days of evaluation

Comparison with Google MSNSearch Toprank (meta-search for Google, MSNSearch, Altavista, Excite,

Hotbot) Results

18

Experiments (Results) Weight vector

19

Open issues Trade-off between amount of training data and

homogeneity Clustering algorithms to find homogenous groups

of users Adaptation to the properties of a particular

document collection Incremental online learning/feedback algorithm Protection from spamming

20

Joachims approach II Clickthrough data

Addition of new relative relevance criterionsAddition of query chains

Method Modified feature vector

Rank features Term/document features

Constraints to the svm optimization problem To avoid trivial/incorrect solutions

21

Query chains Sequences of reformulated queries

Poor results for first query Too many/few terms Not representative terms

q1: “special collections” q2: “rare books” Incorrect spelling

q1: “Lexis Nexus” q2: “Lexis Nexis” Execution of new query

Addition/deletion of query terms to get better results In every query of the sequence, searching for the same thing

Ways of detection Term similarity between queries Percentage of common results Time between queries

22

Query chains detection method 1st strategy

Data recorded from log for 1285 queries: query, date, IP address, results returned, number of

clicks on the results, session id uniquely assigned Manual grouping of queries in query chains Division of data set into training and testing set Training

Feature vector for every pair of query Training of a weight vector indicating which features

are more important for a query pair to be a query chain

Testing Recognition of query chains in testing set Comparison with manual grouping 94.3% accuracy

2nd strategy Assumption: Query chain if:

Queries from the same IP Time between 2 queries < 30 min 91.6% accuracy

Adoption of (simpler) 2nd strategy

23

Clickthrough data (studies) Conclusion: Most

times users view the first 2 results

Conclusion: Most times users view the next result after the last clicked

24

Relevance criteria (query) Click >q Skip Above

dk more relevant than dl if dk clicked and dl not clicked and dk lower initial ranking than dl

Click First >q No-Click Second d1 more relevant than d2 if

d1 clicked d2 not clicked

d5 more relevant than d4d5 more relevant than d2d3 more relevant than d2


25

Relevance criteria (query chain) Click >q1 Skip Above

dk more relevant than dl if dk clicked and dl not clicked and dk lower initial ranking than dl

Click First >q1 No-Click Second d1 more relevant than d2 if

d1 clicked d2 not clicked

FOR QUERY 1 (q’)d8 more relevant than d7d8 more relevant than d6

FOR QUERY 1 (q’)d5 more relevant than d6

QUERY 1 QUERY 2

26

Relevance criteria (query chain II)

Click >q1 Skip Earlier Query

Click First >q1 Top 2 Earlier Query

FOR QUERY 1 (q’)d6 more relevant than d1d6 more relevant than d2d6 more relevant than d4

FOR QUERY 1 (q’)d6 more relevant than d1d6 more relevant than d2

QUERY 1 QUERY 2

27

Relevance criteria (Experiment) Sequences of reformulated queries

Search on 16 subjectsRelevance inequalities produced by previous

criteriaComparison with explicit relevance judgements

on query chains

28

Method (SVM training) Feature vector consists of

Rank features φfirank(d, q)

Term/document features φterms(d, q) Rank features

One φfirank(d, q) for every retrieval

function fi examined Every φfi

rank(d, q) consists of 28 features

For ranks 1,2,3,…10,15,20,…100 in the initial ranking

If rank of d in initial ranking ≤ 1,2,3,…10,15,20,…100 set 1 else set 0

Allow us to learn weights for and combine different retrieval functions

29

Method (SVM training) Term/document features

φterms(d, q) consists of N*M features N number of terms M number of documents

Set 1 if di = d and tj ε q Trains weight vector to learn

assosiations between terms and documents

30

Method (Constraints) Problem

Most relevance criteria suggest that a lower ranked link (in the initial ranking) is better than a higher ranked link

Trivial solution: reverse the initial ranking by assigning negative values to weight vector

Solution Each w*φfi

rank(d, q) ≥ a minimun positive value

31

Experiments Based on a meta-search engine

Initial retrieval function from Nutch Training

9949 queries 7429 clicks Pqc: Total of 120134 relative relevance inequalities Pnc: A set of Pqc generated without use of query chains

Results: (rel0 initial ranking)

32

Experiments Results

Trained weights for term/document features

33

Open issues Tolerance of noise and malicious clicks Different weights in each query of a query chain Personalized ranking functions Improvement of performance

34

Related work (Fox et al. 2005) Evaluation of implicit measures Use of explicit ratings of user satisfaction Method

Bayesian modeling and decision trees Gene analysis

Results: Importance of combination of Clickthrough Time Exit type

Features used:

35

Related work (Agichtein, Brill, Dumais 2006) Incorporation of implicit

feedback features directly into the initial ranking algorithm BM25F RankNet

Features used:

36

Related work Query/links clustering based on features extracted

from implicit feedback Score based on interrelations between queries and

web pages

37

Possible extensions Utilization of time feedback

As relative relevance criterion As a feature

Other types of feedback Scrolling Exit type

Combination with absolute relevance clickthrough feedback Percentage of result clicks for a query Links followed from result page

Query chains Improvement of detection method

Link association rules For frequently clicked groups of results

Query/links clustering Constant training of ranking functions

38

References Search Engines that Learn from Implicit Feedback

Joachims, Radlinski Optimizing Search Engines Using Clickthrough Data

Joachims Query Chains: Learning to Rank from Implicit Feedback

Radlinski, Joachims Evaluating implicit measures to improve web search

Fox et al. Implicit Feedback for Inferring User Preference A Bibliography

Kelly, Teevan

Improving Web Search Ranking by Incorporating User Behavior Agichtein, Brill, Dumais

Optimizing web search using web click-through data Xue, Zeng, Chen, Yu, Ma, Xi, Fan

39

Clickthrough data (Studies) Why not absolute relevance?

Study: Rank and viewership Percentage of queries where a user clicked the result presented

at a particular rank, both in normal and swapped conditions

Conclusion: Users tend to click on higher ranked results irrespective of content

40

Types of user feedbackExplicit feedback

User explicitly judges relevance of results with the query

Costly in time and resources Costly for user limited effectiveness Direct evaluation of results

Implicit feedback Extracted from log files

Large amount of information Real user behaviour (not expert judgement) Indirect evaluation of results through click behaviour

optimizing search engines using clickthrough data

Documents

relevance of results

user behaviour data

result pagerelative

result pagee

results of different

bad query

query similarityresult

query keywords