learning to rank: new techniques and applications martin szummer microsoft research cambridge, uk

Learning to Rank: New Techniques and Applications

Martin SzummerMicrosoft Research

Cambridge, UK

Martin Szummer2 Microsoft Research

Why learning to rank?

• Current rankers use many features, in complex combinations

• Applications– Web search ranking, enterprise search – Image search– Ad selection– Merging multiple results lists

• The good: uses training data to find combinations of features that optimize IR metrics

• The bad: requires judged training data. Expensive, subjective, not provided by end-users, out-of-date


This talk

• Learning to rank with IR metrics A single, simple yet competition-winning, recipe. Works for NDCG, MAP, Precision with linear or non-linear ranking functions (neural nets, boosted trees etc)

• Semi-supervised rankingA new technique. Reduce the amount of judged training data required.

• Learning to mergeApplication: merging results lists from multiple query reformulations

Actually – I apply the same recipe in three

different settings!


Ranking Background

• Classification: determine the class of an item i (operates on individual items)

• Ranking: determine the preference of item i versus j (operates on pairs of items)

• Ranking function:

Example: Linear function Ranking function induces a preference: when

score function query-docfeatures

parameters


From Ranking Function to the Ranking

• Applying the ranking function to define a ranking Sort {}

• Above: had a deterministic model of preference• Henceforth: a probabilistic model

translates score differences into a probability of preference Bradley-Terry/Mallows


Learning to Rank

• Learning to rank Sort {}

• Maximize likelihood of the preference pairs given in training data

indicator when in train

e.g. RankNet model [Burges et al 2005]

given givendetermine w

preferencepairs


Learning to Rank for IR metrics

• IR metrics such as NDCG, MAP or Precision depend on:– sorted order of items– ranks of items: weight the top of the ranking more

Recipe1) Express the metric as a sum of pairwise swap deltas2) Smooth it by multiplying by a Bradley-Terry term3) Optimize parameters by gradient descent over a judged

training set

LambdaRank & LambdaMART [Burges et al] are instances of this recipe. The latter won the Yahoo! Learning to rank challenge (2010).


Example: Apply recipe to NDCG metric

Unpublished material. Email me if interested.


Gradients - intuition

• Gradients act as forces on doc pairs

x Lr12345

𝑑𝐶𝑑𝑠𝑖 𝑗

𝑑𝐶𝑑𝑠𝑖 𝑗


Semi-supervised Ranking

prefer

[with Emine Yilmaz]

Train with judged AND unjudged query-document pairs


Semi-supervised Ranking

• Applications– (Pseudo) Relevance feedback– Reduce the number of (expensive) human judgments– Use when judgments are hard to obtain

• Customers may not want to judge their collections

– adaptation to a specific company in enterprise search– ranking for small markets, special interest domains,

• Approach– preference learning– end-to-end optimization of ranking metrics (NDCG, MAP)– multiple and completely unlabeled rank instances– scalability


How to benefit from unlabeled data?

Unlabeled data gives information about the data distribution P(x). We must make assumptions about what the structure of the unlabeled data tells us about the ranking distribution P(R|x).

A common assumption: the cluster assumption Unlabeled data defines the extent of clusters, Labeled data determines the class/function value of each cluster


Semi-supervised classification: similar documents Þ same class regression: similar documents Þ similar function value ranking: similar documents Þ similar preference i.e. neither is preferred to the other

• Differences from classification & regression:– Preferences provide weaker constraints than function values or classes

is a type of regularizer on the function we are learning.

Similarity can be defined based on content. Does not require judgments.


Quantify Similarity

similar documents Þ similar preference i.e. neither is preferred to the other

Unpublished material. Email me if interested.


Semi-supervised Gradients

x L 𝑑𝐶𝐿

𝑑 𝑠𝑖 𝑗

𝑑𝐶𝑈

𝑑 𝑠𝑖 𝑗

𝑑𝐶𝐿

𝑑 𝑠𝑖 𝑗+𝛽 𝑑𝐶

𝑈

𝑑 𝑠𝑖𝑗


ExperimentsRelevance Feedback task: 1) user issues a query and labels a few of the resulting documents from a traditional ranker (BM25) 2) system trains query-specific ranker, and re-ranks

Data: TREC collection. 528,000 documents, 150 queries1000 total documents per query; 2-15 docs are labeled

Features: ranking features (q, d): 22 features from LETOR content features (d1, d2): TF-IDF dist between top 50 words

Neighbors in input space using either of the above Note: at test time, only ranking features are used; method allows using features of type (d1, d2) and (q, d1, d2) at training that other algos cannot use

Ranking function f(): neural network, 3 hidden unitsK=5 neighbors


Relevance Feedback Task

2 3 5 10 150.1

0.2

0.3

0.4

0.5

0.6

Number of labeled documents

ND

CG

(10)

LambdaRank L&U ContLambdaRank L&ULambdaRank LTSVM L&U

RankBoost L&URankingSVM L

RankBoost L


Novel Queries Task

90,000 training documents3500 preference pairs

40 million unlabeled pairs


Novel Queries Task

102

103

0.1

0.2

0.3

0.4

0.5

Number of labeled preference pairs

ND

CG

(10)

LambdaRank L&U ContLambdaRank L&ULambdaRank L

Upper Bound


Learning to MergeTask: learn a ranker that merges results from other rankers

Example applicationusers do not know the best way to express their web search querya single query may not be enough to reach all relevant documents

merge results

wp7

wp7 phonereformulatein parallel: microsoft wp7

user:Solution


Merging Multiple Queries [with Sheldon, Shokouhi, Craswell]

• Traditional approach: alter before retrieval• Merging: alter after retrieval

– Prospecting: see results first, then decide– Flexibility: any is rewrite allowed, arbitrary

features– Upside potential: better than any individual list– Increased query load on engine: use cache to

mitigate it


LambdaMerge: learn to merge

A weighted mixture of ranking function

Rewrite features: Rewrite-difficulty: ListMean, ListStd, Clarity Rewrite-drift: IsRewrite, RewriteRank, RewriteScore,Overlap@N

Scoring features: Dynamic rank score, BM25, Rank, IsTopN

rewrite feat

score feat score feat

jupiters mass mass of jupiter


Reformulation – Original NDCG

Mer

ged

– O

rigin

al N

DCG

-Merge Results


Summary

• Learning to Rank– An indispensable tool– Requires judgments: but semi-supervised learning can help

crowd-sourcing is also a possibility research frontier: implicit judgments from clicks

– Many applications beyond those shown• Merging: multiple local search engines, multiple language engines• Rank recommendations in collaborative filtering• Many thresholding tasks (filtering) can be posed as ranking• Rank ads for relevance • Elections

– Use it!

learning to rank: new techniques and applications martin szummer microsoft research cambridge, uk

Documents

ranking martin szummer

microsoft research learning

microsoft research example

cluster martin szummer

outofdate martin szummer

enterprise search ranking

judged training data

ranking distribution