improving web search ranking by incorporating user behavior information

Improving Web Search Ranking by Incorporating User Behavior Information Eugene AgichteinEric BrillSusan Dumais

Microsoft

Research

22

Web Search RankingWeb Search Ranking Rank pages relevant for a queryRank pages relevant for a query

– Content matchContent match e.g., page terms, anchor text, term e.g., page terms, anchor text, term

weightsweights

– Prior document qualityPrior document quality e.g., web topology, spam featurese.g., web topology, spam features

– Hundreds of parametersHundreds of parameters

Tune ranking functions on explicit Tune ranking functions on explicit document relevance ratings document relevance ratings

33

Query: Query: SIGIR 2006SIGIR 2006

Users can help indicate most relevant Users can help indicate most relevant resultsresults

44

Web Search Ranking: Web Search Ranking: RevisitedRevisited

Incorporate user behavior informationIncorporate user behavior information– Millions of users submit queries dailyMillions of users submit queries daily– Rich user interaction features (earlier talk)Rich user interaction features (earlier talk)– Complementary to content and web topologyComplementary to content and web topology

Some challenges:Some challenges:– User behavior “in the wild” is not reliableUser behavior “in the wild” is not reliable– How to integrate interactions into rankingHow to integrate interactions into ranking– What is the impact over all queriesWhat is the impact over all queries

55

OutlineOutline

Modelling user behavior for rankingModelling user behavior for ranking

Incorporating user behavior into Incorporating user behavior into rankingranking

Empirical evaluationEmpirical evaluation

ConclusionsConclusions

66

Related WorkRelated Work PersonalizationPersonalization

– Rerank results based on user’s Rerank results based on user’s clickthrough and browsing historyclickthrough and browsing history

Collaborative filteringCollaborative filtering– Amazon, DirectHit: rank by clickthroughAmazon, DirectHit: rank by clickthrough

General rankingGeneral ranking– Joachims et al. [KDD 2002], Radlinski et al. Joachims et al. [KDD 2002], Radlinski et al.

[KDD 2005]: tuning ranking functions with [KDD 2005]: tuning ranking functions with clickthroughclickthrough

77

Rich User Behavior Feature SpaceRich User Behavior Feature Space

Observed and distributional featuresObserved and distributional features– Aggregate observed values over all user interactions Aggregate observed values over all user interactions

for each query and result pairfor each query and result pair– Distributional features: deviations from the Distributional features: deviations from the

“expected” behavior for the query“expected” behavior for the query

Represent user interactions as vectors in user Represent user interactions as vectors in user behavior spacebehavior space– PresentationPresentation: what a user sees : what a user sees beforebefore a a

clickclick– ClickthroughClickthrough: frequency and timing of clicks: frequency and timing of clicks– BrowsingBrowsing: what users do : what users do afterafter a click a click

88

Some User Interaction FeaturesSome User Interaction Features

PresentationPresentation

ResultPositionResultPosition Position of the URL in Current rankingPosition of the URL in Current ranking

QueryTitleOverlaQueryTitleOverlapp

Fraction of query terms in result TitleFraction of query terms in result Title

Clickthrough Clickthrough

DeliberationTimeDeliberationTime Seconds between query and first clickSeconds between query and first click

ClickFrequencyClickFrequency Fraction of all clicks landing on pageFraction of all clicks landing on page

ClickDeviationClickDeviation Deviation from expected click Deviation from expected click frequencyfrequency

Browsing Browsing

DwellTimeDwellTime Result page dwell timeResult page dwell time

DwellTimeDeviatiDwellTimeDeviationon

Deviation from expected dwell time for Deviation from expected dwell time for queryquery

99

Training a User Behavior ModelTraining a User Behavior Model

Map user behavior features to Map user behavior features to relevance judgementsrelevance judgements

RankNet: RankNet: Burges et al., [ICML 2005]Burges et al., [ICML 2005]– Scalable Neural Net implementationScalable Neural Net implementation– Input: user behavior + relevance labelsInput: user behavior + relevance labels– Output: weights for behavior feature Output: weights for behavior feature

valuesvalues– Used as testbed for all experimentsUsed as testbed for all experiments

1010

Training RankNetTraining RankNet

For query results 1 and 2, present pair For query results 1 and 2, present pair of vectors and labels, label(1) > label(2)of vectors and labels, label(1) > label(2)

1111

RankNet RankNet [Burges et al. [Burges et al. 2005]2005]

Feature Vector1 Label1

NN output 1

For query results 1 and 2, present pair of For query results 1 and 2, present pair of vectors and labels, label(1) > label(2)vectors and labels, label(1) > label(2)

1212


Feature Vector2 Label2

NN output 1 NN output 2


1313


NN output 1 NN output 2

Error is function of both outputs(Desire output1 > output2)


1414

Predicting with RankNetPredicting with RankNet

Feature Vector1

NN output

Present individual vector and get Present individual vector and get scorescore

1515

OutlineOutline

Modelling user behaviorModelling user behavior




1616

User Behavior Models for User Behavior Models for RankingRanking

Use interactions from Use interactions from previousprevious instances of instances of queryquery– General-purpose (not personalized)General-purpose (not personalized)– Only available for queries with past user interactionsOnly available for queries with past user interactions

Models:Models:– Rerank, clickthrough only: Rerank, clickthrough only:

reorder results by number of clicksreorder results by number of clicks

– Rerank, predicted preferences (all user behavior Rerank, predicted preferences (all user behavior features): reorder results by predicted preferencesfeatures): reorder results by predicted preferences

– Integrate directly into ranker: Integrate directly into ranker: incorporate user interactions as features for the incorporate user interactions as features for the rankerranker

1717

Rerank, Clickthrough Rerank, Clickthrough OnlyOnly

Promote all clicked results to the Promote all clicked results to the top of the result listtop of the result list– Re-order by click frequencyRe-order by click frequency

Retain relative ranking of un-clicked Retain relative ranking of un-clicked resultsresults

1818

Rerank, Preference Rerank, Preference PredictionsPredictions

Re-order results by function of Re-order results by function of preference prediction scorepreference prediction score

Experimented with different variantsExperimented with different variants– Using inverse of ranksUsing inverse of ranks– Intuition: scores not comparable Intuition: scores not comparable merge merge

ranksranks

1

1

1

1),(

ddIdd OIwOIScore

1919

Integrate User Behavior Features Integrate User Behavior Features Directly into RankerDirectly into Ranker

For a given queryFor a given query– Merge original feature set with user Merge original feature set with user

behavior features when availablebehavior features when available

– User behavior features computed from User behavior features computed from previous interactions with same queryprevious interactions with same query

Train RankNet on enhanced feature Train RankNet on enhanced feature setset

2020

OutlineOutline

Modelling user behaviorModelling user behavior




2121

Evaluation MetricsEvaluation Metrics Precision at K: fraction of relevant in top KPrecision at K: fraction of relevant in top K

NDCG at K: norm. discounted cumulative NDCG at K: norm. discounted cumulative gaingain– Top-ranked results most importantTop-ranked results most important

MAP: mean average precisionMAP: mean average precision– Average precision for each query: mean of the

precision at K values computed after each relevant document was retrieved

K

j

jrqq jMN

1

)( )1log(/)12(

2222

DatasetsDatasets 8 weeks of user behavior data from 8 weeks of user behavior data from

anonymized opt-in client instrumentationanonymized opt-in client instrumentation

Millions of unique queries and interaction Millions of unique queries and interaction tracestraces

Random sample of 3,000 queriesRandom sample of 3,000 queries– Gathered independently of user behaviorGathered independently of user behavior– 1,500 train, 500 validation, 1,000 test1,500 train, 500 validation, 1,000 test

Explicit relevance assessments for top Explicit relevance assessments for top 10 results for each query in sample10 results for each query in sample

2323

Methods ComparedMethods Compared Content only: Content only: BM25FBM25F

Full Search Engine: Full Search Engine: RNRN– Hundreds of parameters for content match and Hundreds of parameters for content match and

document qualitydocument quality– Tuned with RankNetTuned with RankNet

Incorporating User BehaviorIncorporating User Behavior– Clickthrough: Clickthrough: Rerank-CTRerank-CT– Full user behavior model predictions: Full user behavior model predictions: Rerank-Rerank-

AllAll – Integrate all user behavior features directly: Integrate all user behavior features directly:

+All+All

2424

Content, User Behavior: Content, User Behavior: Precision at K, queries with Precision at K, queries with interactionsinteractions

BM25 < Rerank-CT < Rerank-All < +All

0.38

0.43

0.48

0.53

0.58

0.63

1 3 5 10K

Precision

BM25

Rerank-CTRerank-All

BM25+All

2525

Content, User Behavior: NDCGContent, User Behavior: NDCG

BM25 < Rerank-CT < Rerank-All < +All

0.5

0.52

0.54

0.56

0.58

0.6

0.62

0.64

0.66

0.68

1 2 3 4 5 6 7 8 9 10K

NDCG

BM25Rerank-CTRerank-AllBM25+All

2626

Full Search Engine, User Full Search Engine, User Behavior: NDCG, MAPBehavior: NDCG, MAP

MAP Gain

RN 0.270

RN+ALL 0.321 0.052 (19.13%)

BM25 0.236

BM25+ALL 0.2920.056 (23.71%)

0.56

0.58

0.6

0.62

0.64

0.66

0.68

0.7

0.72

0.74

1 2 3 4 5 6 7 8 9 10K

NDCG

RNRerank-AllRN+All

2727

Impact: All Queries, Precision Impact: All Queries, Precision at Kat K

< 50% of test queries w/ prior interactions+0.06-0.12 precision over all test queries

0.4

0.45

0.5

0.55

0.6

0.65

0.7

1 3 5 10K

Precision

RNRerank-AllRN+All

2828

Impact: All Queries, NDCGImpact: All Queries, NDCG

+0.03-0.05 NDCG over all test queries

0.56

0.58

0.6

0.62

0.64

0.66

0.68

0.7

1 2 3 4 5 6 7 8 9 10K

NDCG

RNRerank-AllRN+All

2929

Which Queries Benefit MostWhich Queries Benefit Most

0

50

100

150

200

250

300

350

0.1 0.2 0.3 0.4 0.5 0.6

-0.4-0.35-0.3-0.25-0.2-0.15-0.1-0.0500.050.10.150.2

Frequency Average Gain

Most gains are for queries with poor ranking

3030

ConclusionsConclusions Incorporating user behavior into Incorporating user behavior into

web search ranking dramatically web search ranking dramatically improves relevanceimproves relevance

Providing rich user interaction Providing rich user interaction features to ranker is the most features to ranker is the most effective strategyeffective strategy

Large improvement shown for up Large improvement shown for up to 50% of test queriesto 50% of test queries

3131

Thank youThank you

Text Mining, Search, and Navigation group: http://research.microsoft.com/tmsn/

Adaptive Systems and Interaction group:http://research.microsoft.com/adapt/

Microsoft

Research

3232

Content,User Behavior: Content,User Behavior: All Queries, Precision at KAll Queries, Precision at K

BM25 < Rerank-CT < Rerank-All < All

0.35

0.4

0.45

0.5

0.55

0.6

0.65

1 3 5 10K

Precision

BM25

Rerank-CT

Rerank-All

All

3333

Content, User Behavior: Content, User Behavior: All Queries, NDCGAll Queries, NDCG

BM25 << Rerank-CT << Rerank-All < All

0.5

0.52

0.54

0.56

0.58

0.6

0.62

0.64

0.66

0.68

1 2 3 4 5 6 7 8 9 10K

NDCG

BM25Rerank-CTRerank-AllAll

3434

Results SummaryResults Summary Incorporating user behavior into web Incorporating user behavior into web

search ranking dramatically improves search ranking dramatically improves relevancerelevance

Incorporating user behavior features into Incorporating user behavior features into ranking directly most effective strategyranking directly most effective strategy

Impact on relevance substantial Impact on relevance substantial

Poorly performing queries benefit mostPoorly performing queries benefit most

3535

Promising ExtensionsPromising Extensions

Backoff (improve query coverage)Backoff (improve query coverage)

Model user intent/information Model user intent/information needneed

Personalization of various degreesPersonalization of various degrees

Query segmentationQuery segmentation

improving web search ranking by incorporating user behavior information

Documents

expected behavior

past user interactionsmodels

behavior feature valuesused

reorder results

present pair of vectors

users clickthrough

label1 label2nn output

label1 label2ranknet