learning to ranksifaka.cs.uiuc.edu/~wang296/course/ir_fall/docs/pdfs/learning2ran… · lambdamart:...

80
Learning to Rank from heuristics to theoretic approaches Hongning Wang

Upload: others

Post on 19-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Learning to Ranksifaka.cs.uiuc.edu/~wang296/Course/IR_Fall/docs/PDFs/learning2ran… · LambdaMART: An Overview Christopher J.C. Burges, 2010 •Minimizing mis-ordered pair => maximizing

Learning to Rankfrom heuristics to theoretic approaches

Hongning Wang

Page 2: Learning to Ranksifaka.cs.uiuc.edu/~wang296/Course/IR_Fall/docs/PDFs/learning2ran… · LambdaMART: An Overview Christopher J.C. Burges, 2010 •Minimizing mis-ordered pair => maximizing

Congratulations

• Job Offer

– Design the ranking module for Bing.com

CS 6501: Information Retrieval 2CS@UVa

Page 3: Learning to Ranksifaka.cs.uiuc.edu/~wang296/Course/IR_Fall/docs/PDFs/learning2ran… · LambdaMART: An Overview Christopher J.C. Burges, 2010 •Minimizing mis-ordered pair => maximizing

How should I rank documents?

Answer: Rank by relevance!

CS 6501: Information Retrieval 3CS@UVa

Page 4: Learning to Ranksifaka.cs.uiuc.edu/~wang296/Course/IR_Fall/docs/PDFs/learning2ran… · LambdaMART: An Overview Christopher J.C. Burges, 2010 •Minimizing mis-ordered pair => maximizing

Relevance ?!

CS 6501: Information Retrieval 4CS@UVa

Page 5: Learning to Ranksifaka.cs.uiuc.edu/~wang296/Course/IR_Fall/docs/PDFs/learning2ran… · LambdaMART: An Overview Christopher J.C. Burges, 2010 •Minimizing mis-ordered pair => maximizing

The Notion of Relevance

Relevance

(Rep(q), Rep(d))

Similarity

P(r=1|q,d) r {0,1}Probability of Relevance

P(d q) or P(q d)

Probabilistic inference

Different

rep & similarity

Vector space

model

(Salton et al., 75)

Prob. distr.

model

(Wong & Yao, 89)

Generative

Model

Regression

Model (Fuhr 89)

Classical

prob. Model

(Robertson &

Sparck Jones, 76)

Doc

generation

Query

generation

LM

approach

(Ponte & Croft, 98)

(Lafferty & Zhai, 01a)

Prob. concept

space model

(Wong & Yao, 95)

Different

inference system

Inference

network

model

(Turtle & Croft, 91)

Div. from Randomness

(Amati & Rijsbergen 02)

Learn. To Rank

(Joachims 02,

Berges et al. 05)

Relevance constraints[Fang et al. 04]

Lecture notes from CS598CXZCS 6501: Information Retrieval 5CS@UVa

Page 6: Learning to Ranksifaka.cs.uiuc.edu/~wang296/Course/IR_Fall/docs/PDFs/learning2ran… · LambdaMART: An Overview Christopher J.C. Burges, 2010 •Minimizing mis-ordered pair => maximizing

Relevance Estimation

• Query matching

– Language model

– BM25

– Vector space cosine similarity

• Document importance

– PageRank

– HITS

CS 6501: Information Retrieval 6CS@UVa

Page 7: Learning to Ranksifaka.cs.uiuc.edu/~wang296/Course/IR_Fall/docs/PDFs/learning2ran… · LambdaMART: An Overview Christopher J.C. Burges, 2010 •Minimizing mis-ordered pair => maximizing

Did I do a good job of ranking documents?

• IR evaluations metrics

– Precision

– MAP

– NDCG

PageRankBM25

CS 6501: Information Retrieval 7CS@UVa

Page 8: Learning to Ranksifaka.cs.uiuc.edu/~wang296/Course/IR_Fall/docs/PDFs/learning2ran… · LambdaMART: An Overview Christopher J.C. Burges, 2010 •Minimizing mis-ordered pair => maximizing

Take advantage of different relevance estimator?

• Ensemble the cues

– Linear?

– Non-linear?

• Decision tree-like

𝑎1 × 𝐵𝑀25 + 𝛼2 × 𝐿𝑀 + 𝛼3 × 𝑃𝑎𝑔𝑒𝑅𝑎𝑛𝑘 + 𝛼4 × 𝐻𝐼𝑇𝑆

BM25>0.5

LM > 0.1PageRank

> 0.3

True

FalseTrue

r= 1.0 r= 0.7

FalseTrue

r= 0.4 r= 0.1

False

{𝛼1= 0.2, 𝛼2 = 0.2, 𝛼3 = 0.3, 𝛼4 = 0.1} → {𝑀𝐴𝑃 = 0.12, 𝑁𝐷𝐶𝐺 = 0.5 }

{𝛼1= 0.1, 𝛼2 = 0.1, 𝛼3 = 0.5, 𝛼4 = 0.2} → {𝑀𝐴𝑃 = 0.18, 𝑁𝐷𝐶𝐺 = 0.7}

{𝛼1= 0.4, 𝛼2 = 0.2, 𝛼3 = 0.1, 𝛼4 = 0.1} → {𝑀𝐴𝑃 = 0.20, 𝑁𝐷𝐶𝐺 = 0.6}

CS 6501: Information Retrieval 8CS@UVa

Page 9: Learning to Ranksifaka.cs.uiuc.edu/~wang296/Course/IR_Fall/docs/PDFs/learning2ran… · LambdaMART: An Overview Christopher J.C. Burges, 2010 •Minimizing mis-ordered pair => maximizing

What if we have thousands of features?

• Is there any way I can do better?

– Optimizing the metrics automatically!

How to determine those 𝛼s? Where to find those tree structures?

CS 6501: Information Retrieval 9CS@UVa

Page 10: Learning to Ranksifaka.cs.uiuc.edu/~wang296/Course/IR_Fall/docs/PDFs/learning2ran… · LambdaMART: An Overview Christopher J.C. Burges, 2010 •Minimizing mis-ordered pair => maximizing

Rethink the task

• Given: (query, document) pairs represented by a set of relevance estimators, a.k.a., features

• Needed: a way of combining the estimators

– 𝑓 𝑞, 𝑑 𝑖=1𝐷 → ordered 𝑑 𝑖=1

𝐷

• Criterion: optimize IR metrics

– P@k, MAP, NDCG, etc.

DocID BM25 LM PageRank Label

0001 1.6 1.1 0.9 0

0002 2.7 1.9 0.2 1

Key!

CS 6501: Information Retrieval 10CS@UVa

Page 11: Learning to Ranksifaka.cs.uiuc.edu/~wang296/Course/IR_Fall/docs/PDFs/learning2ran… · LambdaMART: An Overview Christopher J.C. Burges, 2010 •Minimizing mis-ordered pair => maximizing

Machine Learning

• Input: { 𝑋1, 𝑌1 , 𝑋2, 𝑌2 , … , (𝑋𝑛, 𝑌𝑛)}, where 𝑋𝑖 ∈ 𝑅

𝑁, 𝑌𝑖 ∈ 𝑅𝑀

• Object function : 𝑂(𝑌′, 𝑌)

• Output: 𝑓 𝑋 → 𝑌, such that 𝑓 =argmax𝑓′⊂𝐹𝑂(𝑓

′(𝑋), 𝑌)

𝑂 𝑌′, 𝑌 = 𝛿(𝑌′ = 𝑌)

Classification

http://en.wikipedia.org/wiki/Statistical_classification

Regression

𝑂 𝑌′, 𝑌 = −||𝑌′ − 𝑌||

http://en.wikipedia.org/wiki/Regression_analysis

M

NOTE: We will only talk about supervised learning.

CS 6501: Information Retrieval 11CS@UVa

Page 12: Learning to Ranksifaka.cs.uiuc.edu/~wang296/Course/IR_Fall/docs/PDFs/learning2ran… · LambdaMART: An Overview Christopher J.C. Burges, 2010 •Minimizing mis-ordered pair => maximizing

Learning to Rank

• General solution in optimization framework

– Input: (𝑞𝑖 , 𝑑1), 𝑦1 , (𝑞𝑖 , 𝑑2), 𝑦2 , … , (𝑞𝑖 , 𝑑𝑛), 𝑦𝑛 ,

where dn ∈ 𝑅𝑁, 𝑦𝑖 ∈ {0,… , 𝐿}

– Object: O = {P@k, MAP, NDCG}

– Output: 𝑓 𝑞, 𝑑 → 𝑌, s.t., 𝑓 =argmax𝑓′⊂𝐹𝑂(𝑓

′(𝑞, 𝑑), 𝑌)

DocID BM25 LM PageRank Label

0001 1.6 1.1 0.9 0

0002 2.7 1.9 0.2 1

CS 6501: Information Retrieval 12CS@UVa

Page 13: Learning to Ranksifaka.cs.uiuc.edu/~wang296/Course/IR_Fall/docs/PDFs/learning2ran… · LambdaMART: An Overview Christopher J.C. Burges, 2010 •Minimizing mis-ordered pair => maximizing

Challenge: how to optimize?

• Evaluation metric recap

– Average Precision

– DCG

• Order is essential!

– 𝑓 → order→metric

Not continuous with respect to f(X)!

CS 6501: Information Retrieval 13CS@UVa

Page 14: Learning to Ranksifaka.cs.uiuc.edu/~wang296/Course/IR_Fall/docs/PDFs/learning2ran… · LambdaMART: An Overview Christopher J.C. Burges, 2010 •Minimizing mis-ordered pair => maximizing

Approximating the Objects!

• Pointwise

– Fit the relevance labels individually

• Pairwise

– Fit the relative orders

• Listwise

– Fit the whole order

CS 6501: Information Retrieval 14CS@UVa

Page 15: Learning to Ranksifaka.cs.uiuc.edu/~wang296/Course/IR_Fall/docs/PDFs/learning2ran… · LambdaMART: An Overview Christopher J.C. Burges, 2010 •Minimizing mis-ordered pair => maximizing

Pointwise Learning to Rank

• Ideally perfect relevance prediction leads to perfect ranking– 𝑓 → score→ order →metric

• Reducing ranking problem to– Regression

• 𝑂 𝑓 𝑄,𝐷 , 𝑌 = − 𝑖 |𝑓 𝑞𝑖 , 𝑑𝑖 − 𝑦𝑖 |

• Subset Ranking using Regression, D.Cossock and T.Zhang, COLT 2006

– (multi-)Classification

• 𝑂 𝑓 𝑄,𝐷 , 𝑌 = 𝑖 𝛿(𝑓 𝑞𝑖 , 𝑑𝑖 = 𝑦𝑖)

• Ranking with Large Margin Principles, A. Shashua and A. Levin, NIPS 2002

CS 6501: Information Retrieval 15CS@UVa

Page 16: Learning to Ranksifaka.cs.uiuc.edu/~wang296/Course/IR_Fall/docs/PDFs/learning2ran… · LambdaMART: An Overview Christopher J.C. Burges, 2010 •Minimizing mis-ordered pair => maximizing

Subset Ranking using RegressionD.Cossock and T.Zhang, COLT 2006

• Fit relevance labels via regression

– Emphasize more on relevant documents

http://en.wikipedia.org/wiki/Regression_analysis

Weights on each document Most positive document

CS 6501: Information Retrieval 16CS@UVa

Page 17: Learning to Ranksifaka.cs.uiuc.edu/~wang296/Course/IR_Fall/docs/PDFs/learning2ran… · LambdaMART: An Overview Christopher J.C. Burges, 2010 •Minimizing mis-ordered pair => maximizing

Ranking with Large Margin PrinciplesA. Shashua and A. Levin, NIPS 2002

• Goal: correctly placing the documents in the corresponding category and maximize the margin

Y=0 Y=2

Y=1

𝛼𝑠

Reduce the violations

CS 6501: Information Retrieval 17CS@UVa

Page 18: Learning to Ranksifaka.cs.uiuc.edu/~wang296/Course/IR_Fall/docs/PDFs/learning2ran… · LambdaMART: An Overview Christopher J.C. Burges, 2010 •Minimizing mis-ordered pair => maximizing

• Maximizing the sum of margins

Ranking with Large Margin Principles A. Shashua and A. Levin, NIPS 2002

Y=0

Y=2

Y=1

𝛼𝑠 CS 6501: Information Retrieval 18CS@UVa

Page 19: Learning to Ranksifaka.cs.uiuc.edu/~wang296/Course/IR_Fall/docs/PDFs/learning2ran… · LambdaMART: An Overview Christopher J.C. Burges, 2010 •Minimizing mis-ordered pair => maximizing

Ranking with Large Margin Principles A. Shashua and A. Levin, NIPS 2002

• Ranking lost is consistently decreasing with more training data

CS 6501: Information Retrieval 19CS@UVa

Page 20: Learning to Ranksifaka.cs.uiuc.edu/~wang296/Course/IR_Fall/docs/PDFs/learning2ran… · LambdaMART: An Overview Christopher J.C. Burges, 2010 •Minimizing mis-ordered pair => maximizing

What did we learn

• Machine learning helps!

– Derive something optimizable

– More efficient and guided

CS 6501: Information Retrieval 20CS@UVa

Page 21: Learning to Ranksifaka.cs.uiuc.edu/~wang296/Course/IR_Fall/docs/PDFs/learning2ran… · LambdaMART: An Overview Christopher J.C. Burges, 2010 •Minimizing mis-ordered pair => maximizing

There is always a catch

• Cannot directly optimize IR metrics

– (0→ 1, 2→ 0) worse than (0->-2, 2->4)

• Position of documents are ignored

– Penalty on documents at higher positions should be larger

• Favor the queries with more documents

CS 6501: Information Retrieval 21CS@UVa

Page 22: Learning to Ranksifaka.cs.uiuc.edu/~wang296/Course/IR_Fall/docs/PDFs/learning2ran… · LambdaMART: An Overview Christopher J.C. Burges, 2010 •Minimizing mis-ordered pair => maximizing

Pairwise Learning to Rank

• Ideally perfect partial order leads to perfect ranking

– 𝑓 → partial order→ order →metric

• Ordinal regression– 𝑂 𝑓 𝑄,𝐷 , 𝑌 = 𝑖≠𝑗 𝛿(𝑦𝑖 > 𝑦𝑗)𝛿(𝑓 𝑞𝑖 , 𝑑𝑖 > 𝑓(𝑞𝑖 , 𝑑𝑖))

• Relative ordering between different documents is significant

• E.g., (0->-2, 2->4) is better than (0→ 1, 2→ 0)

– Large body of work

CS 6501: Information Retrieval 22CS@UVa

Page 23: Learning to Ranksifaka.cs.uiuc.edu/~wang296/Course/IR_Fall/docs/PDFs/learning2ran… · LambdaMART: An Overview Christopher J.C. Burges, 2010 •Minimizing mis-ordered pair => maximizing

Optimizing Search Engines using Clickthrough DataThorsten Joachims, KDD’02

• Minimizing the number of mis-ordered pairs

𝑦1 > 𝑦2, 𝑦2 > 𝑦3 , 𝑦1 > 𝑦41

0

linear combination of features

𝑓 𝑞, 𝑑 = 𝑤𝑇𝑋𝑞,𝑑

RankingSVM

𝑦1 > 𝑦2

Keep the relative orders

CS 6501: Information Retrieval 23CS@UVa

Page 24: Learning to Ranksifaka.cs.uiuc.edu/~wang296/Course/IR_Fall/docs/PDFs/learning2ran… · LambdaMART: An Overview Christopher J.C. Burges, 2010 •Minimizing mis-ordered pair => maximizing

Optimizing Search Engines using Clickthrough DataThorsten Joachims, KDD’02

• How to use it?

– 𝑓 → s𝐜𝐨𝐫𝐞 → order

𝑦1 > 𝑦2 > 𝑦3 > 𝑦4

CS 6501: Information Retrieval 24CS@UVa

Page 25: Learning to Ranksifaka.cs.uiuc.edu/~wang296/Course/IR_Fall/docs/PDFs/learning2ran… · LambdaMART: An Overview Christopher J.C. Burges, 2010 •Minimizing mis-ordered pair => maximizing

Optimizing Search Engines using Clickthrough DataThorsten Joachims, KDD’02

• What did it learn from the data?

– Linear correlations

Positive correlated features

Negative correlated features CS 6501: Information Retrieval 25CS@UVa

Page 26: Learning to Ranksifaka.cs.uiuc.edu/~wang296/Course/IR_Fall/docs/PDFs/learning2ran… · LambdaMART: An Overview Christopher J.C. Burges, 2010 •Minimizing mis-ordered pair => maximizing

• How good is it?

– Test on real system

Optimizing Search Engines using Clickthrough DataThorsten Joachims, KDD’02

CS 6501: Information Retrieval 26CS@UVa

Page 27: Learning to Ranksifaka.cs.uiuc.edu/~wang296/Course/IR_Fall/docs/PDFs/learning2ran… · LambdaMART: An Overview Christopher J.C. Burges, 2010 •Minimizing mis-ordered pair => maximizing

An Efficient Boosting Algorithm for Combining Preferences

Y. Freund, R. Iyer, et al. JMLR 2003

• Smooth the loss on mis-ordered pairs

– 𝑦𝑖>𝑦𝑗 𝑃𝑟 𝑑𝑖 , 𝑑𝑗 𝑒𝑥𝑝[𝑓 𝑞, 𝑑𝑗 − 𝑓 𝑞, 𝑑𝑖 ]

exponential loss

hinge loss (RankingSVM)

0/1 loss

square loss

from Pattern Recognition and Machine Learning, P337

CS 6501: Information Retrieval 27CS@UVa

Page 28: Learning to Ranksifaka.cs.uiuc.edu/~wang296/Course/IR_Fall/docs/PDFs/learning2ran… · LambdaMART: An Overview Christopher J.C. Burges, 2010 •Minimizing mis-ordered pair => maximizing

• RankBoost: optimize via boosting

– Vote by a committee

An Efficient Boosting Algorithm for Combining Preferences

Y. Freund, R. Iyer, et al. JMLR 2003

from Pattern Recognition and Machine Learning, P658

Updating Pr(𝑑𝑖 , 𝑑𝑗)

Credibility of each committee member (ranking feature)

BM25 PageRank Cosine

CS 6501: Information Retrieval 28CS@UVa

Page 29: Learning to Ranksifaka.cs.uiuc.edu/~wang296/Course/IR_Fall/docs/PDFs/learning2ran… · LambdaMART: An Overview Christopher J.C. Burges, 2010 •Minimizing mis-ordered pair => maximizing

• How good is it?

An Efficient Boosting Algorithm for Combining Preferences

Y. Freund, R. Iyer, et al. JMLR 2003

CS 6501: Information Retrieval 29CS@UVa

Page 30: Learning to Ranksifaka.cs.uiuc.edu/~wang296/Course/IR_Fall/docs/PDFs/learning2ran… · LambdaMART: An Overview Christopher J.C. Burges, 2010 •Minimizing mis-ordered pair => maximizing

A Regression Framework for Learning Ranking Functions Using Relative Relevance Judgments

Zheng et al. SIRIG’07

• Non-linear ensemble of features

– Object: 𝑦𝑖>𝑦𝑗 max 0, 𝑓 𝑞, 𝑑𝑗 − 𝑓 𝑞, 𝑑𝑖2

– Gradient descent boosting tree

• Boosting tree– Using regression tree to minimize the residuals

– 𝑟𝑡(𝑞, 𝑑, 𝑦) = 𝑂𝑡(𝑞, 𝑑, 𝑦) − 𝑓 𝑡−1 (𝑞, 𝑑, 𝑦)

r= 0.4 r= 0.1

BM25>0.5

LM > 0.1PageRank

> 0.3

True

FalseTrue

r= 1.0 r= 0.7

FalseTrue

False

CS 6501: Information Retrieval 30CS@UVa

Page 31: Learning to Ranksifaka.cs.uiuc.edu/~wang296/Course/IR_Fall/docs/PDFs/learning2ran… · LambdaMART: An Overview Christopher J.C. Burges, 2010 •Minimizing mis-ordered pair => maximizing

A Regression Framework for Learning Ranking Functions Using Relative Relevance Judgments

Zheng et al. SIRIG’07

• Non-linear v.s. linear

– Comparing with RankingSVM

CS 6501: Information Retrieval 31CS@UVa

Page 32: Learning to Ranksifaka.cs.uiuc.edu/~wang296/Course/IR_Fall/docs/PDFs/learning2ran… · LambdaMART: An Overview Christopher J.C. Burges, 2010 •Minimizing mis-ordered pair => maximizing

Where do we get the relative orders

• Human annotations

– Small scale, expensive to acquire

• Clickthroughs

– Large amount, easy to acquire

CS 6501: Information Retrieval 32CS@UVa

Page 33: Learning to Ranksifaka.cs.uiuc.edu/~wang296/Course/IR_Fall/docs/PDFs/learning2ran… · LambdaMART: An Overview Christopher J.C. Burges, 2010 •Minimizing mis-ordered pair => maximizing

Accurately Interpreting Clickthrough Data as Implicit Feedback

Thorsten Joachims, et al., SIGIR’05

• Position biasYour click is not because of its relevance, but its position?!

CS 6501: Information Retrieval 33CS@UVa

Page 34: Learning to Ranksifaka.cs.uiuc.edu/~wang296/Course/IR_Fall/docs/PDFs/learning2ran… · LambdaMART: An Overview Christopher J.C. Burges, 2010 •Minimizing mis-ordered pair => maximizing

Accurately Interpreting Clickthrough Data as Implicit Feedback

Thorsten Joachims, et al., SIGIR’05

• Controlled experiment

– Over trust of the top ranked positions

CS 6501: Information Retrieval 34CS@UVa

Page 35: Learning to Ranksifaka.cs.uiuc.edu/~wang296/Course/IR_Fall/docs/PDFs/learning2ran… · LambdaMART: An Overview Christopher J.C. Burges, 2010 •Minimizing mis-ordered pair => maximizing

• Pairwise preference matters

– Click: examined and clicked document

– Skip: examined but non-clicked document

Accurately Interpreting Clickthrough Data as Implicit Feedback

Thorsten Joachims, et al., SIGIR’05

Click > SkipCS 6501: Information Retrieval 35CS@UVa

Page 36: Learning to Ranksifaka.cs.uiuc.edu/~wang296/Course/IR_Fall/docs/PDFs/learning2ran… · LambdaMART: An Overview Christopher J.C. Burges, 2010 •Minimizing mis-ordered pair => maximizing

What did we learn

• Predicting relative order

– Getting closer to the nature of ranking

• Promising performance in practice

– Pairwise preferences from click-throughs

CS 6501: Information Retrieval 36CS@UVa

Page 37: Learning to Ranksifaka.cs.uiuc.edu/~wang296/Course/IR_Fall/docs/PDFs/learning2ran… · LambdaMART: An Overview Christopher J.C. Burges, 2010 •Minimizing mis-ordered pair => maximizing

Listwise Learning to Rank

• Can we directly optimize the ranking?

– 𝑓 → order→metric

• Tackle the challenge

– Optimization without gradient

CS 6501: Information Retrieval 37CS@UVa

Page 38: Learning to Ranksifaka.cs.uiuc.edu/~wang296/Course/IR_Fall/docs/PDFs/learning2ran… · LambdaMART: An Overview Christopher J.C. Burges, 2010 •Minimizing mis-ordered pair => maximizing

From RankNet to LambdaRank to LambdaMART: An Overview

Christopher J.C. Burges, 2010

• Minimizing mis-ordered pair => maximizing IR metrics?

Mis-ordered pairs: 6 Mis-ordered pairs: 4

AP: 5

8AP:

5

12

DCG: 1.333 DCG: 0.931

Position is crucial!

CS 6501: Information Retrieval 38CS@UVa

Page 39: Learning to Ranksifaka.cs.uiuc.edu/~wang296/Course/IR_Fall/docs/PDFs/learning2ran… · LambdaMART: An Overview Christopher J.C. Burges, 2010 •Minimizing mis-ordered pair => maximizing

From RankNet to LambdaRank to LambdaMART: An Overview

Christopher J.C. Burges, 2010

• Weight the mis-ordered pairs?

– Some pairs are more important to be placed in the right order

– Inject into object function

• 𝑦𝑖>𝑦𝑗Ω 𝑑𝑖 , 𝑑𝑗 exp −𝑤ΔΦ 𝑞𝑛, 𝑑𝑖 , 𝑑𝑗

– Inject into gradient

• 𝜆𝑖𝑗 =𝜕𝑂𝑎𝑝𝑝𝑟𝑜

𝜕𝑤Δ𝑂

Gradient with respect to approximated object, i.e., exponential loss on mis-ordered pairs

Change in original object, e.g., NDCG, if we switch the documents i and j, leaving the other documents unchanged

Depend on the ranking of document i, j in the whole list

CS 6501: Information Retrieval 39CS@UVa

Page 40: Learning to Ranksifaka.cs.uiuc.edu/~wang296/Course/IR_Fall/docs/PDFs/learning2ran… · LambdaMART: An Overview Christopher J.C. Burges, 2010 •Minimizing mis-ordered pair => maximizing

• Lambda functions

– Gradient?

• Yes, it meets the sufficient and necessary condition of being partial derivative

– Lead to optimal solution of original problem?

• Empirically

From RankNet to LambdaRank to LambdaMART: An Overview

Christopher J.C. Burges, 2010

CS 6501: Information Retrieval 40CS@UVa

Page 41: Learning to Ranksifaka.cs.uiuc.edu/~wang296/Course/IR_Fall/docs/PDFs/learning2ran… · LambdaMART: An Overview Christopher J.C. Burges, 2010 •Minimizing mis-ordered pair => maximizing

• Evolution

From RankNet to LambdaRank to LambdaMART: An Overview

Christopher J.C. Burges, 2010

RankNet LambdaRank LambdaMART

ObjectCross entropy over

the pairsUnknown Unknown

Gradient (𝜆 function)

Gradient of cross entropy

Gradient of cross entropy times

pairwise change in target metric

Gradient of cross entropy times

pairwise change in target metric

Optimization method

neural networkstochastic gradient

descent

Multiple Additive Regression Trees

(MART)

As we discussed in RankBoost

Optimize solely by gradient

Non-linear combination

CS 6501: Information Retrieval 41CS@UVa

Page 42: Learning to Ranksifaka.cs.uiuc.edu/~wang296/Course/IR_Fall/docs/PDFs/learning2ran… · LambdaMART: An Overview Christopher J.C. Burges, 2010 •Minimizing mis-ordered pair => maximizing

• A Lambda tree

From RankNet to LambdaRank to LambdaMART: An Overview

Christopher J.C. Burges, 2010

splitting

Combination of features

CS 6501: Information Retrieval 42CS@UVa

Page 43: Learning to Ranksifaka.cs.uiuc.edu/~wang296/Course/IR_Fall/docs/PDFs/learning2ran… · LambdaMART: An Overview Christopher J.C. Burges, 2010 •Minimizing mis-ordered pair => maximizing

AdaRank: a boosting algorithm for information retrieval

Jun Xu & Hang Li, SIGIR’07

• Loss defined by IR metrics

– 𝑞∈𝑄𝑃𝑟 𝑞 𝑒𝑥𝑝[−𝑂(𝑞)]

– Optimizing by boostingTarget metrics: MAP, NDCG, MRR

from Pattern Recognition and Machine Learning, P658

Updating Pr(𝑞)

Credibility of each committee member (ranking feature)

BM25 PageRank Cosine

CS 6501: Information Retrieval 43CS@UVa

Page 44: Learning to Ranksifaka.cs.uiuc.edu/~wang296/Course/IR_Fall/docs/PDFs/learning2ran… · LambdaMART: An Overview Christopher J.C. Burges, 2010 •Minimizing mis-ordered pair => maximizing

A Support Vector Machine for Optimizing Average Precision

Yisong Yue, et al., SIGIR’07

RankingSVM

• Minimizing the pairwise loss

SVM-MAP

• Minimizing the structural loss

Loss defined on the number of mis-ordered document pairs

Loss defined on the quality of the whole list of ordered documents

CS 6501: Information Retrieval 44

MAP difference

CS@UVa

Page 45: Learning to Ranksifaka.cs.uiuc.edu/~wang296/Course/IR_Fall/docs/PDFs/learning2ran… · LambdaMART: An Overview Christopher J.C. Burges, 2010 •Minimizing mis-ordered pair => maximizing

• Max margin principle

– Push the ground-truth far away from any mistakes you might make

– Finding the most violated constraints

A Support Vector Machine for Optimizing Average Precision

Yisong Yue, et al., SIGIR’07

CS 6501: Information Retrieval 45CS@UVa

Page 46: Learning to Ranksifaka.cs.uiuc.edu/~wang296/Course/IR_Fall/docs/PDFs/learning2ran… · LambdaMART: An Overview Christopher J.C. Burges, 2010 •Minimizing mis-ordered pair => maximizing

• Finding the most violated constraints

– MAP is invariant to permutation of (ir)relevant documents

– Maximize MAP over a series of swaps between relevant and irrelevant documents

A Support Vector Machine for Optimizing Average Precision

Yisong Yue, et al., SIGIR’07

Right-hand side of constraints

Greedy solutionCS 6501: Information Retrieval 46

Start from the reverse order of ideal ranking

CS@UVa

Page 47: Learning to Ranksifaka.cs.uiuc.edu/~wang296/Course/IR_Fall/docs/PDFs/learning2ran… · LambdaMART: An Overview Christopher J.C. Burges, 2010 •Minimizing mis-ordered pair => maximizing

• Experiment results

A Support Vector Machine for Optimizing Average Precision

Yisong Yue, et al., SIGIR’07

CS 6501: Information Retrieval 47CS@UVa

Page 48: Learning to Ranksifaka.cs.uiuc.edu/~wang296/Course/IR_Fall/docs/PDFs/learning2ran… · LambdaMART: An Overview Christopher J.C. Burges, 2010 •Minimizing mis-ordered pair => maximizing

Other listwise solutions

• Soften the metrics to make them differentiable

– Michael Taylor et al., SoftRank: optimizing non-smooth rank metrics, WSDM'08

• Minimize a loss function defined on permutations

– Zhe Cao et al., Learning to rank: from pairwise approach to listwise approach, ICML'07

CS 6501: Information Retrieval 48CS@UVa

Page 49: Learning to Ranksifaka.cs.uiuc.edu/~wang296/Course/IR_Fall/docs/PDFs/learning2ran… · LambdaMART: An Overview Christopher J.C. Burges, 2010 •Minimizing mis-ordered pair => maximizing

What did we learn

• Taking a list of documents as a whole

– Positions are visible for the learning algorithm

– Directly optimizing the target metric

• Limitation

– The search space is huge!

CS 6501: Information Retrieval 49CS@UVa

Page 50: Learning to Ranksifaka.cs.uiuc.edu/~wang296/Course/IR_Fall/docs/PDFs/learning2ran… · LambdaMART: An Overview Christopher J.C. Burges, 2010 •Minimizing mis-ordered pair => maximizing

Summary

• Learning to rank– Automatic combination of ranking features for

optimizing IR evaluation metrics

• Approaches– Pointwise

• Fit the relevance labels individually

– Pairwise• Fit the relative orders

– Listwise• Fit the whole order

CS 6501: Information Retrieval 50CS@UVa

Page 51: Learning to Ranksifaka.cs.uiuc.edu/~wang296/Course/IR_Fall/docs/PDFs/learning2ran… · LambdaMART: An Overview Christopher J.C. Burges, 2010 •Minimizing mis-ordered pair => maximizing

Experimental Comparisons

• Ranking performance

CS 6501: Information Retrieval 51CS@UVa

Page 52: Learning to Ranksifaka.cs.uiuc.edu/~wang296/Course/IR_Fall/docs/PDFs/learning2ran… · LambdaMART: An Overview Christopher J.C. Burges, 2010 •Minimizing mis-ordered pair => maximizing

Experimental Comparisons

• Winning count

– Over seven different data sets

CS 6501: Information Retrieval 52CS@UVa

Page 53: Learning to Ranksifaka.cs.uiuc.edu/~wang296/Course/IR_Fall/docs/PDFs/learning2ran… · LambdaMART: An Overview Christopher J.C. Burges, 2010 •Minimizing mis-ordered pair => maximizing

Experimental Comparisons

• My experiments

– 1.2k queries, 45.5K documents with 1890 features

– 800 queries for training, 400 queries for testing

MAP P@1 ERR MRR NDCG@5

ListNET 0.2863 0.2074 0.1661 0.3714 0.2949

LambdaMART 0.4644 0.4630 0.2654 0.6105 0.5236

RankNET 0.3005 0.2222 0.1873 0.3816 0.3386

RankBoost 0.4548 0.4370 0.2463 0.5829 0.4866RankingSVM 0.3507 0.2370 0.1895 0.4154 0.3585

AdaRank 0.4321 0.4111 0.2307 0.5482 0.4421pLogistic 0.4519 0.3926 0.2489 0.5535 0.4945

Logistic 0.4348 0.3778 0.2410 0.5526 0.4762

CS 6501: Information Retrieval 53CS@UVa

Page 54: Learning to Ranksifaka.cs.uiuc.edu/~wang296/Course/IR_Fall/docs/PDFs/learning2ran… · LambdaMART: An Overview Christopher J.C. Burges, 2010 •Minimizing mis-ordered pair => maximizing

Analysis of the Approaches

• What are they really optimizing?

– Relation with IR metrics

gap

CS 6501: Information Retrieval 54CS@UVa

Page 55: Learning to Ranksifaka.cs.uiuc.edu/~wang296/Course/IR_Fall/docs/PDFs/learning2ran… · LambdaMART: An Overview Christopher J.C. Burges, 2010 •Minimizing mis-ordered pair => maximizing

Pointwise Approaches

• Regression based

• Classification based

Regression lossDiscount coefficients in DCG

Classification lossDiscount coefficients in DCG

CS 6501: Information Retrieval 55CS@UVa

Page 56: Learning to Ranksifaka.cs.uiuc.edu/~wang296/Course/IR_Fall/docs/PDFs/learning2ran… · LambdaMART: An Overview Christopher J.C. Burges, 2010 •Minimizing mis-ordered pair => maximizing

From

CS 6501: Information Retrieval 56CS@UVa

Page 57: Learning to Ranksifaka.cs.uiuc.edu/~wang296/Course/IR_Fall/docs/PDFs/learning2ran… · LambdaMART: An Overview Christopher J.C. Burges, 2010 •Minimizing mis-ordered pair => maximizing

From

Discount coefficients in DCG

CS 6501: Information Retrieval 57CS@UVa

Page 58: Learning to Ranksifaka.cs.uiuc.edu/~wang296/Course/IR_Fall/docs/PDFs/learning2ran… · LambdaMART: An Overview Christopher J.C. Burges, 2010 •Minimizing mis-ordered pair => maximizing

Listwise Approaches

• No general analysis

– Method dependent

– Directness and consistency

CS 6501: Information Retrieval 58CS@UVa

Page 59: Learning to Ranksifaka.cs.uiuc.edu/~wang296/Course/IR_Fall/docs/PDFs/learning2ran… · LambdaMART: An Overview Christopher J.C. Burges, 2010 •Minimizing mis-ordered pair => maximizing

Connection with Traditional IR

• People have foreseen this topic long time ago

– Nicely fit in the risk minimization framework

CS 6501: Information Retrieval 59CS@UVa

Page 60: Learning to Ranksifaka.cs.uiuc.edu/~wang296/Course/IR_Fall/docs/PDFs/learning2ran… · LambdaMART: An Overview Christopher J.C. Burges, 2010 •Minimizing mis-ordered pair => maximizing

Applying Bayesian Decision Theory

Choice: (D1,1)

Choice: (D2,2)

Choice: (Dn,n)

...

query q

user U

doc set C

source S

q

1

N

dSCUqpDLDD

),,,|(),,(minarg*)*,(,

hidden observedloss

Bayes risk for choice (D, )RISK MINIMIZATION

Loss

L

L

L

Metric to be optimized Available ranking features

CS 6501: Information Retrieval 60CS@UVa

Page 61: Learning to Ranksifaka.cs.uiuc.edu/~wang296/Course/IR_Fall/docs/PDFs/learning2ran… · LambdaMART: An Overview Christopher J.C. Burges, 2010 •Minimizing mis-ordered pair => maximizing

Traditional Solution

• Set-based models (choose D)

• Ranking models (choose )

– Independent loss

• Relevance-based loss

• Distance-based loss

– Dependent loss

• MMR loss

• MDR loss

Boolean model

Probabilistic relevance model

Generative Relevance Theory

Subtopic retrieval model

Vector-space Model

Two-stage LM

KL-divergence model

Pointwise

Pairwise/Listwise

CS 6501: Information Retrieval 61CS@UVa

Page 62: Learning to Ranksifaka.cs.uiuc.edu/~wang296/Course/IR_Fall/docs/PDFs/learning2ran… · LambdaMART: An Overview Christopher J.C. Burges, 2010 •Minimizing mis-ordered pair => maximizing

Traditional Notion of Relevance

Relevance

(Rep(q), Rep(d))

Similarity

P(r=1|q,d) r {0,1}Probability of Relevance

P(d q) or P(q d)

Probabilistic inference

Different

rep & similarity

Vector space

model

(Salton et al., 75)

Prob. distr.

model

(Wong & Yao, 89)

Generative

Model

Regression

Model (Fuhr 89)

Classical

prob. Model

(Robertson &

Sparck Jones, 76)

Doc

generation

Query

generation

LM

approach

(Ponte & Croft, 98)

(Lafferty & Zhai, 01a)

Prob. concept

space model

(Wong & Yao, 95)

Different

inference system

Inference

network

model

(Turtle & Croft, 91)

Div. from Randomness

(Amati & Rijsbergen 02)

Learn. To Rank

(Joachims 02,

Berges et al. 05)

Relevance constraints[Fang et al. 04]

CS 6501: Information Retrieval 62CS@UVa

Page 63: Learning to Ranksifaka.cs.uiuc.edu/~wang296/Course/IR_Fall/docs/PDFs/learning2ran… · LambdaMART: An Overview Christopher J.C. Burges, 2010 •Minimizing mis-ordered pair => maximizing

Broader Notion of Relevance

• Traditional view– Content-driven

• Vector space model

• Probability relevance model

• Language model

• Modern view– Anything related to the quality of the document

• Clicks/views

• Link structure

• Visual structure

• Social network

• ….

Query-Document specificUnsupervised

Query, Document, Query-Document specificSupervised

CS 6501: Information Retrieval 63CS@UVa

Page 64: Learning to Ranksifaka.cs.uiuc.edu/~wang296/Course/IR_Fall/docs/PDFs/learning2ran… · LambdaMART: An Overview Christopher J.C. Burges, 2010 •Minimizing mis-ordered pair => maximizing

Broader Notion of Relevance

DocumentsQuery

BM25

Language Model

Cosine

DocumentsQuery

BM25

Language Model

Cosine

likes

Linkage structure

Query relation

Query relation

Linkage structure

DocumentsQuery

BM25

Language Model

Cosine

Click/View

Visual StructureSocial network

CS 6501: Information Retrieval 64CS@UVa

Page 65: Learning to Ranksifaka.cs.uiuc.edu/~wang296/Course/IR_Fall/docs/PDFs/learning2ran… · LambdaMART: An Overview Christopher J.C. Burges, 2010 •Minimizing mis-ordered pair => maximizing

Future

• Tighter bounds

• Faster solution

• Larger scale

• Wider application scenario

CS 6501: Information Retrieval 65CS@UVa

Page 66: Learning to Ranksifaka.cs.uiuc.edu/~wang296/Course/IR_Fall/docs/PDFs/learning2ran… · LambdaMART: An Overview Christopher J.C. Burges, 2010 •Minimizing mis-ordered pair => maximizing

Resources• Books

– Liu, Tie-Yan. Learning to rank for information retrieval. Vol. 13. Springer, 2011.

– Li, Hang. "Learning to rank for information retrieval and natural language processing." Synthesis Lectures on Human Language Technologies 4.1 (2011): 1-113.

• Helpful pages– http://en.wikipedia.org/wiki/Learning_to_rank

• Packages– RankingSVM: http://svmlight.joachims.org/– RankLib: http://people.cs.umass.edu/~vdang/ranklib.html

• Data sets– LETOR http://research.microsoft.com/en-

us/um/beijing/projects/letor//– Yahoo! Learning to rank challenge

http://learningtorankchallenge.yahoo.com/

CS 6501: Information Retrieval 66CS@UVa

Page 67: Learning to Ranksifaka.cs.uiuc.edu/~wang296/Course/IR_Fall/docs/PDFs/learning2ran… · LambdaMART: An Overview Christopher J.C. Burges, 2010 •Minimizing mis-ordered pair => maximizing

References• Liu, Tie-Yan. "Learning to rank for information retrieval." Foundations and

Trends in Information Retrieval 3.3 (2009): 225-331.

• Cossock, David, and Tong Zhang. "Subset ranking using regression." Learning theory (2006): 605-619.

• Shashua, Amnon, and Anat Levin. "Ranking with large margin principle: Two approaches." Advances in neural information processing systems 15 (2003): 937-944.

• Joachims, Thorsten. "Optimizing search engines using clickthrough data." Proceedings of the eighth ACM SIGKDD. ACM, 2002.

• Freund, Yoav, et al. "An efficient boosting algorithm for combining preferences." The Journal of Machine Learning Research 4 (2003): 933-969.

• Zheng, Zhaohui, et al. "A regression framework for learning ranking functions using relative relevance judgments." Proceedings of the 30th annual international ACM SIGIR. ACM, 2007.

CS 6501: Information Retrieval 67CS@UVa

Page 68: Learning to Ranksifaka.cs.uiuc.edu/~wang296/Course/IR_Fall/docs/PDFs/learning2ran… · LambdaMART: An Overview Christopher J.C. Burges, 2010 •Minimizing mis-ordered pair => maximizing

References

• Joachims, Thorsten, et al. "Accurately interpreting clickthrough data as implicit feedback." Proceedings of the 28th annual international ACM SIGIR. ACM, 2005.

• Burges, C. "From ranknet to lambdarank to lambdamart: An overview." Learning 11 (2010): 23-581.

• Xu, Jun, and Hang Li. "AdaRank: a boosting algorithm for information retrieval." Proceedings of the 30th annual international ACM SIGIR. ACM, 2007.

• Yue, Yisong, et al. "A support vector method for optimizing average precision." Proceedings of the 30th annual international ACM SIGIR. ACM, 2007.

• Taylor, Michael, et al. "Softrank: optimizing non-smooth rank metrics." Proceedings of the international conference WSDM. ACM, 2008.

• Cao, Zhe, et al. "Learning to rank: from pairwise approach to listwiseapproach." Proceedings of the 24th ICML. ACM, 2007.

CS 6501: Information Retrieval 68CS@UVa

Page 69: Learning to Ranksifaka.cs.uiuc.edu/~wang296/Course/IR_Fall/docs/PDFs/learning2ran… · LambdaMART: An Overview Christopher J.C. Burges, 2010 •Minimizing mis-ordered pair => maximizing

Thank you!

• Q&A

CS 6501: Information Retrieval 69CS@UVa

Page 70: Learning to Ranksifaka.cs.uiuc.edu/~wang296/Course/IR_Fall/docs/PDFs/learning2ran… · LambdaMART: An Overview Christopher J.C. Burges, 2010 •Minimizing mis-ordered pair => maximizing

Recap of last lecture

• Goal

– Design the ranking module for Bing.com

CS 6501: Information Retrieval 70CS@UVa

Page 71: Learning to Ranksifaka.cs.uiuc.edu/~wang296/Course/IR_Fall/docs/PDFs/learning2ran… · LambdaMART: An Overview Christopher J.C. Burges, 2010 •Minimizing mis-ordered pair => maximizing

Basic Search Engine Architecture

• The anatomy of a large-scale hypertextualWeb search engine

– Sergey Brin and Lawrence Page. Computer networks and ISDN systems 30.1 (1998): 107-117.

CS 6501: Information Retrieval 71

Your JobCS@UVa

Page 72: Learning to Ranksifaka.cs.uiuc.edu/~wang296/Course/IR_Fall/docs/PDFs/learning2ran… · LambdaMART: An Overview Christopher J.C. Burges, 2010 •Minimizing mis-ordered pair => maximizing

Learning to Rank

• Given: (query, document) pairs represented by a set of relevance estimators, a.k.a., features

• Needed: a way of combining the estimators

– 𝑓 𝑞, 𝑑 𝑖=1𝐷 → ordered 𝑑 𝑖=1

𝐷

• Criterion: optimize IR metrics

– P@k, MAP, NDCG, etc.

QueryID DocID BM25 LM PageRank Label

0001 0001 1.6 1.1 0.9 0

0001 0002 2.7 1.9 0.2 1

Key!

CS 6501: Information Retrieval 72CS@UVa

Page 73: Learning to Ranksifaka.cs.uiuc.edu/~wang296/Course/IR_Fall/docs/PDFs/learning2ran… · LambdaMART: An Overview Christopher J.C. Burges, 2010 •Minimizing mis-ordered pair => maximizing

Challenge: how to optimize?

• Order is essential!

– 𝑓 → order→metric

• Evaluation metrics are not continuous and not differentiable

CS 6501: Information Retrieval 73CS@UVa

Page 74: Learning to Ranksifaka.cs.uiuc.edu/~wang296/Course/IR_Fall/docs/PDFs/learning2ran… · LambdaMART: An Overview Christopher J.C. Burges, 2010 •Minimizing mis-ordered pair => maximizing

Approximating the Objects!

• Pointwise

– Fit the relevance labels individually

• Pairwise

– Fit the relative orders

• Listwise

– Fit the whole order

CS 6501: Information Retrieval 74CS@UVa

Page 75: Learning to Ranksifaka.cs.uiuc.edu/~wang296/Course/IR_Fall/docs/PDFs/learning2ran… · LambdaMART: An Overview Christopher J.C. Burges, 2010 •Minimizing mis-ordered pair => maximizing

Pointwise Learning to Rank

• Ideally perfect relevance prediction leads to perfect ranking

– 𝑓 → score→ order →metric

• Reduce ranking problem to

– Regression

CS 6501: Information Retrieval 75

http://en.wikipedia.org/wiki/Statistical_classification

http://en.wikipedia.org/wiki/Regression_analysis

- Classification

CS@UVa

Page 76: Learning to Ranksifaka.cs.uiuc.edu/~wang296/Course/IR_Fall/docs/PDFs/learning2ran… · LambdaMART: An Overview Christopher J.C. Burges, 2010 •Minimizing mis-ordered pair => maximizing

Deficiency

• Cannot directly optimize IR metrics

– (0→ 1, 2→ 0) worse than (0->-2, 2->4)

• Position of documents are ignored

– Penalty on documents at higher positions should be larger

• Favor the queries with more documents

CS 6501: Information Retrieval 76

𝑑1′

𝑑2𝑑1

1

0

-2

2

-1

𝑑1′′

CS@UVa

Page 77: Learning to Ranksifaka.cs.uiuc.edu/~wang296/Course/IR_Fall/docs/PDFs/learning2ran… · LambdaMART: An Overview Christopher J.C. Burges, 2010 •Minimizing mis-ordered pair => maximizing

Pairwise Learning to Rank

• Ideally perfect partial order leads to perfect ranking

– 𝑓 → partial order→ order →metric

• Ordinal regression– 𝑂 𝑓 𝑄,𝐷 , 𝑌 = 𝑖≠𝑗 𝛿(𝑦𝑖 > 𝑦𝑗)𝛿(𝑓 𝑞𝑖 , 𝑑𝑖 < 𝑓(𝑞𝑖 , 𝑑𝑖))

• Relative ordering between different documents is significant

CS 6501: Information Retrieval 77

𝑑1′

𝑑2𝑑1

1

0

-2

2

-1

𝑑1′′

CS@UVa

Page 78: Learning to Ranksifaka.cs.uiuc.edu/~wang296/Course/IR_Fall/docs/PDFs/learning2ran… · LambdaMART: An Overview Christopher J.C. Burges, 2010 •Minimizing mis-ordered pair => maximizing

RankingSVMThorsten Joachims, KDD’02

• Minimizing the number of mis-ordered pairs

𝑦1 > 𝑦2, 𝑦2 > 𝑦3 , 𝑦1 > 𝑦41

0

linear combination of features

𝑓 𝑞, 𝑑 = 𝑤𝑇𝑋𝑞,𝑑𝑦1 > 𝑦2

Keep the relative orders

CS 6501: Information Retrieval 78

QueryID DocID BM25 PageRank Label

0001 0001 1.6 0.9 4

0001 0002 2.7 0.2 2

0001 0003 1.3 0.2 1

0001 0004 1.2 0.7 0

CS@UVa

Page 79: Learning to Ranksifaka.cs.uiuc.edu/~wang296/Course/IR_Fall/docs/PDFs/learning2ran… · LambdaMART: An Overview Christopher J.C. Burges, 2010 •Minimizing mis-ordered pair => maximizing

RankingSVMThorsten Joachims, KDD’02

• Minimizing the number of mis-ordered pairs

CS 6501: Information Retrieval 79

𝑤ΔΦ 𝑞𝑛, 𝑑𝑖 , 𝑑𝑗 ≥ 1 − 𝜉𝑖,𝑗,𝑛

CS@UVa

Page 80: Learning to Ranksifaka.cs.uiuc.edu/~wang296/Course/IR_Fall/docs/PDFs/learning2ran… · LambdaMART: An Overview Christopher J.C. Burges, 2010 •Minimizing mis-ordered pair => maximizing

General Idea of Pairwise Learning to Rank

• For any pair of 𝑦𝑖 > 𝑦𝑗

CS 6501: Information Retrieval 80

exponential loss

hinge loss (RankingSVM)

0/1 loss

square loss (GBRank)

80Guest Lecture for Learing to Rank

max 0,1 − 𝑤ΔΦ 𝑞𝑛, 𝑑𝑖 , 𝑑𝑗

max 0,1 − 𝑤ΔΦ 𝑞𝑛, 𝑑𝑖 , 𝑑𝑗2

exp −𝑤ΔΦ 𝑞𝑛, 𝑑𝑖 , 𝑑𝑗

𝛿 𝑤ΔΦ 𝑞𝑛, 𝑑𝑖 , 𝑑𝑗 < 0

𝑤ΔΦ 𝑞𝑛, 𝑑𝑖 , 𝑑𝑗CS@UVa