Foundations of Machine Learning Ranking
Mehryar MohriCourant Institute and Google Research
pageMehryar Mohri - Foundations of Machine Learning
Motivation
Very large data sets:
• too large to display or process.
• limited resources, need priorities.
• ranking more desirable than classification.
Applications:
• search engines, information extraction.
• decision making, auctions, fraud detection.
Can we learn to predict ranking accurately?
2
pageMehryar Mohri - Foundations of Machine Learning
Related Problem
Rank aggregation: given candidates and voters each giving a ranking of the candidates, find ordering as close as possible to these.
• closeness measured in number of pairwise misrankings.
• problem NP-hard even for (Dwork et al., 2001).
3
n k
k=4
pageMehryar Mohri - Foundations of Machine Learning
This Talk
Score-based ranking
Preference-based ranking
4
pageMehryar Mohri - Foundations of Machine Learning
Score-Based Setting
Single stage: learning algorithm
• receives labeled sample of pairwise preferences;
• returns scoring function .Drawbacks:
• induces a linear ordering for full set .
• does not match a query-based scenario.
Advantages:
• efficient algorithms.
• good theory: VC bounds, margin bounds, stability bounds (FISS 03, RCMS 05, AN 05, AGHHR 05, CMR 07).
5
Uh
h: U � R
pageMehryar Mohri - Foundations of Machine Learning
Score-Based Ranking
Training data: sample of i.i.d. labeled pairs drawn from according to some distribution ,
Problem: find hypothesis in with small generalization error
6
D
with
Hh :U�R
U�U
yi =
���
��
+1 if x�i >pref xi
0 if xi =pref x�i or no information
�1 if x�i <pref xi.
S =�(x1, x
�1, y1), . . . , (xm, x�
m, ym)��U�U�{�1, 0, +1},
R(h) = Pr(x,x0)⇠D
h(f(x, x0) 6= 0) ^ (f(x, x0)
�h(x0)� h(x)
� 0)
i.
pageMehryar Mohri - Foundations of Machine Learning
Notes
Empirical error:
The relation may be non-transitive (needs not even be anti-symmetric).Problem different from classification.
7
xRx� � f(x, x�)=1
�R(h) =1m
m�
i=1
1(yi �=0)�(yi(h(x�i)�h(xi))�0) .
pageMehryar Mohri - Foundations of Machine Learning
Distributional Assumptions
Distribution over points: points (literature).
• labels for pairs.
• squared number of examples .
• dependency issue.
Distribution over pairs: pairs.
• label for each pair received.
• independence assumption.
• same (linear) number of examples.
8
m
O(m2)
m
pageMehryar Mohri - Foundations of Machine Learning
Confidence Margin in Ranking
Labels assumed to be in .
Empirical margin loss for ranking: for ,
9
{+1,�1}
bR⇢(h) =1
m
mX
i=1
�⇢
⇣yi�h(x0
i)� h(xi)�⌘
.
⇢ > 0
bR⇢(h) 1
m
mX
i=1
1yi[h(x0i)�h(xi)]⇢.
pageMehryar Mohri - Foundations of Machine Learning
Marginal Rademacher Complexities
Distributions:
• marginal distribution with respect to the first element of the pairs;
• marginal distribution with respect to second element of the pairs.
Samples:
Marginal Rademacher complexities:
10
D1
D2
S1 =�(x1, y1), . . . , (xm, ym)
�
S2 =�(x0
1, y1), . . . , (x0m, ym)
�.
RD1m (H) = E[bRS1(H)] RD2
m (H) = E[bRS2(H)].
pageMehryar Mohri - Foundations of Machine Learning
Ranking Margin Bound
Theorem: let be a family of real-valued functions. Fix , then, for any , with probability at least over the choice of a sample of size , the following holds for all :
11
(Boyd, Cortes, MM, and Radovanovich 2012; MM, Rostamizadeh, and Talwalkar, 2012)
H�>0 �>0
1�� mh�H
R(h) � �R�(h) +2�
�RD1
m (H) + RD2m (H)
�+
�log 1
�
2m.
pageMehryar Mohri - Foundations of Machine Learning
Ranking with SVMs
Optimization problem: application of SVMs.
Decision function:
12
h : x ��w · �(x) + b.
see for example (Joachims, 2002)
minw,�
12�w�2 + C
m�
i=1
�i
subject to: yi
�w ·
��(x�
i)��(xi)��� 1� �i
�i � 0, �i � [1, m] .
pageMehryar Mohri - Foundations of Machine Learning
Notes
The algorithm coincides with SVMs using feature mapping
Can be used with kernels:
Algorithm directly based on margin bound.
13
(x, x�) �� �(x, x�) = �(x�)��(x).
K 0((xi, x0i), (xj , x
0j)) = (xi, x
0i) · (xj , x
0j)
= K(xi, xj) +K(x0i, x
0j)�K(x0
i, xj)�K(xi, x0j).
pageMehryar Mohri - Foundations of Machine Learning
Boosting for Ranking
Use weak ranking algorithm and create stronger ranking algorithm.
Ensemble method: combine base rankers returned by weak ranking algorithm.
Finding simple relatively accurate base rankers often not hard.
How should base rankers be combined?
14
pageMehryar Mohri - Foundations of Machine Learning
CD RankBoost
15
H�{0, 1}X.
(Freund et al., 2003; Rudin et al., 2005)
�0t + �+t + ��t = 1, �st (h) = Pr
(x,x�)�Dt
�sgn
�f(x, x�)(h(x�)� h(x))
�= s
�.
RankBoost(S = ((x1, x�1, y1) . . . , (xm, x�m, ym)))1 for i� 1 to m do2 D1(xi, x�i)� 1
m3 for t� 1 to T do4 ht � base ranker in H with smallest ��t � �+t = �Ei�Dt
�yi
�ht(x�i)� ht(xi)
��
5 �t � 12 log �+t
��t
6 Zt � �0t + 2[�+t ��t ] 12 � normalization factor
7 for i� 1 to m do
8 Dt+1(xi, x�i)�Dt(xi,x
�i) exp
���tyi
�ht(x
�i)�ht(xi)
��
Zt
9 �T ��T
t=1 �tht
10 return �T
pageMehryar Mohri - Foundations of Machine Learning
Notes
Distributions over pairs of sample points:
• originally uniform.
• at each round, the weight of a misclassified example is increased.
• observation: , since
weight assigned to base classifier : directly depends on the accuracy of at round .
16
Dt
ht �t
ht t
Dt+1(x, x�)= e�y[�t(x�)��t(x)]
|S|Qt
s=1 Zs
Dt+1(x, x�) =Dt(x, x�)e�y�t[ht(x
�)�ht(x)]
Zt=
1|S|
e�yPt
s=1 �s[hs(x�)�hs(x)]
�ts=1 Zs
.
pageMehryar Mohri - Foundations of Machine Learning 17
Objective Function: convex and differentiable.
Coordinate Descent RankBoost
e�x
0�1pairwise loss
F (�) =�
(x,x�,y)�S
e�y[�T (x�)��T (x)] =�
(x,x�,y)�S
exp�� y
T�
t=1
�t[ht(x�)�ht(x)]�.
pageMehryar Mohri - Foundations of Machine Learning 18
• Direction: unit vector with
• Since ,
et
et = argmint
dF (� + �et)d�
�����=0
.
Thus, direction corresponding to base classifier selected by the algorithm.
F (� + �et)=�
(x,x�,y)�S
e�yPT
s=1 �s[hs(x�)�hs(x)]e�y�[ht(x
�)�ht(x)]
dF (� + �et)d�
�����=0
= ��
(x,x�,y)�S
y[ht(x�)� ht(x)] exp�� y
T�
s=1
�s[hs(x�)� hs(x)]�
= ��
(x,x�,y)�S
y[ht(x�)� ht(x)]DT+1(x, x�)�m
T�
s=1
Zs
�
= �[�+t � ��t ]�m
T�
s=1
Zs
�.
pageMehryar Mohri - Foundations of Machine Learning 19
• Step size: obtained via
Thus, step size matches base classifier weight used in algorithm.
dF (� + �et)d�
= 0
� ��
(x,x�,y)�S
y[ht(x�)� ht(x)] exp�� y
T�
s=1
�s[hs(x�)� hs(x)]�e�y[ht(x
�)�ht(x)]� = 0
� ��
(x,x�,y)�S
y[ht(x�)� ht(x)]DT+1(x, x�)�m
T�
s=1
Zs
�e�y[ht(x
�)�ht(x)]� = 0
� ��
(x,x�,y)�S
y[ht(x�)� ht(x)]DT+1(x, x�)e�y[ht(x�)�ht(x)]� = 0
� �[�+t e�� � ��t e�] = 0
� � =12
log�+t��t
.
pageMehryar Mohri - Foundations of Machine Learning
Bipartite Ranking
Training data:
• sample of negative points drawn according to
• sample of positive points drawn according to
Problem: find hypothesis in with small generalization error
20
H
D+
D�
RD(h) = Prx�D�,x��D+
�h(x�)<h(x)
�.
S�=(x1, . . . , xm)�U.
S+ =(x�1, . . . , x
�m�)�U.
h : U ! R
pageMehryar Mohri - Foundations of Machine Learning
Notes
Connection between AdaBoost and RankBoost (Cortes & MM, 04; Rudin et al., 05).
• if constant base ranker used.
• relationship between objective functions.
More efficient algorithm in this special case (Freund
et al., 2003).
Bipartite ranking results typically reported in terms of AUC.
21
pageMehryar Mohri - Foundations of Machine Learning
AdaBoost and CD RankBoost
Objective functions: comparison.
22
FRank(�) =�
(i,j)�S��S+
exp�� [f(xj)� f(xi)]
�
=�
(i,j)�S��S+
exp (+f(xi)) exp (�f(xi))
= F�(�)F+(�).
FAda(�) =�
xi�S��S+
exp (�yif(xi))
=�
xi�S�
exp (+f(xi)) +�
xi�S+
exp (�f(xi))
= F�(�) + F+(�).
pageMehryar Mohri - Foundations of Machine Learning
AdaBoost and CD RankBoost
Property: AdaBoost (non-separable case).
• constant base learner equal contribution of positive and negative points (in the limit).
• consequence: AdaBoost asymptotically achieves optimum of CD RankBoost objective.
Observations: if ,
23
h=1
F+(�)=F�(�)
d(FRank) = F+d(F�) + F�d(F+)
= F+
�d(F�) + d(F+)
�
= F+d(FAda).
(Rudin et al., 2005)
pageMehryar Mohri - Foundations of Machine Learning
Bipartite RankBoost - Efficiency
Decomposition of distribution: for ,
Thus,
24
D(x, x�) = D�(x)D+(x�).
(x, x�)�(S�, S+)
Dt+1(x, x�) =Dt(x, x�)e��t[ht(x
�)�ht(x)]
Zt
=Dt,�(x)e�tht(x)
Zt,�
Dt,+(x�)e��tht(x�)
Zt,+,
Zt,� =�
x�S�
Dt,�(x)e�tht(x) Zt,+ =�
x��S+
Dt,+(x�)e��tht(x�).with
pageMehryar Mohri - Foundations of Machine Learning
ROC Curve
Definition: the receiver operating characteristic (ROC) curve is a plot of the true positive rate (TP) vs. false positive rate (FP).
• TP: % positive points correctly labeled positive.
• FP: % negative points incorrectly labeled positive.
25
(Egan, 1975)
0 .2 .4 .6 .8 1
.2
.4
.6
.8
1
0
False positive rate
True
pos
itive
rat
e
h(x3) h(x1)h(x14) h(x23)h(x5)
� +-
sorted scores
pageMehryar Mohri - Foundations of Machine Learning
Area under the ROC Curve (AUC)
Definition: the AUC is the area under the ROC curve. Measure of ranking quality.
Equivalently,
26
(Hanley and McNeil, 1982)
0 .2 .4 .6 .8 1
.2
.4
.6
.8
1
0
False positive rate
True
pos
itive
rat
eAUC
AUC(h) =1
mm�
m�
i=1
m��
j=1
1h(x�j)>h(xi) = Prx� bD�x�� bD+
[h(x�) > h(x)]
= 1� �R(h).
pageMehryar Mohri - Foundations of Machine Learning
This Talk
Score-based ranking
Preference-based ranking
27
pageMehryar Mohri - Foundations of Machine Learning
Preference-Based Setting
Definitions:
• : universe, full set of objects.
• : finite query subset to rank, .
• : target ranking for (random variable).
Two stages: can be viewed as a reduction.
• learn preference function .
• given , use to determine ranking of .
Running-time: measured in terms of |calls to |.
28
V � U
U
V
�� V
V h �
h
V
h: U�U� [0, 1]
pageMehryar Mohri - Foundations of Machine Learning
Preference-Based Ranking Problem
Training data: pairs sampled i.i.d. according to :
Problem: for any query set , use to return ranking close to target with small average error
29
(V, ��)
subsets ranked bydifferent labelers.
D(V1, �
�1 ), (V2, �
�2 ), . . . , (Vm, ��m) Vi � U.
preference function .
V � U��
h�h,V
learn classifier
h : U�U� [0, 1]
R(h, �) = E(V,��)�D
[L(�h,V , ��)].
pageMehryar Mohri - Foundations of Machine Learning
Preference Function
close to when preferred to , close to otherwise. For the analysis, .
Assumed pairwise consistent:
May be non-transitive, e.g., we may have
Output of classifier or ‘black-box’.
30
h(u, v) + h(v, u) = 1.
h(u, v) u1 v 0h(u, v)�{0, 1}
h(u, v) = h(v, w) = h(w, u) = 1.
pageMehryar Mohri - Foundations of Machine Learning
Loss Functions
Preference loss:
Ranking loss:
31
L(�, ⇥�) =2
n(n� 1)
�
u ⇥=v
�(u, v)⇥�(v, u).
L(h, ��) =2
n(n� 1)
�
u ⇥=v
h(u, v)��(v, u).
(for fixed )(V, ��)
pageMehryar Mohri - Foundations of Machine Learning
(Weak) Regret
Preference regret:
Ranking regret:
32
R⇥rank(A) = E
V,⇥�,s[L(As(V ), ⇥�)]� E
Vmin
�̃⇤S(V )E
⇥�|V[L(�̃, ⇥�)].
R�class(h) = E
V,��[L(h|V , ��)]� E
Vmin
h̃E
��|V[L(h̃, ��)].
pageMehryar Mohri - Foundations of Machine Learning
Deterministic Algorithm
Stage one: standard classification. Learn preference function .
Stage two: sort-by-degree using comparison function .
• sort by number of points ranked below.
• quadratic time complexity .
33
h : U�U� [0, 1]
h
(Balcan et al., 07)
O(n2)
pageMehryar Mohri - Foundations of Machine Learning
Randomized Algorithm
Stage one: standard classification. Learn preference function .
Stage two: randomized QuickSort (Hoare, 61) using as comparison function.
• comparison function non-transitive unlike textbook description.
• but, time complexity shown to be in general.
34
h : U�U� [0, 1]
h
O(n log n)
(Ailon & MM, 08)
pageMehryar Mohri - Foundations of Machine Learning
Randomized QS
35
left recursion right recursion
random pivot
u
v
h(v, u)=1 h(u, v)=1
pageMehryar Mohri - Foundations of Machine Learning
Bounds: for deterministic sort-by-degree algorithm
• expected loss:
• regret:
Time complexity: .
Deterministic Algo. - Bipartite Case
36
EV,��
[L(A(V ), ��)] � 2 EV,��
[L(h, ��)].
R�rank(A(V )) � 2R�
class(h).
�(|V |2)
(V = V+ � V�) (Balcan et al., 07)
pageMehryar Mohri - Foundations of Machine Learning
Bounds: for randomized QuickSort (Hoare, 61).
• expected loss (equality):
• regret:
Time complexity:
• full set: .
• top k:
Randomized Algo. - Bipartite Case
37
R�rank(Qh
s (·)) � R�class(h) .
EV,��,s
[L(Qhs (V ), ��)] = E
V,��[L(h, ��)].
O(n log n)O(n + k log k).
(V = V+ � V�) (Ailon & MM, 08)
pageMehryar Mohri - Foundations of Machine Learning
Proof Ideas
QuickSort decomposition:
Bipartite property:
38
puv +13
⇤
w ⇥�{u,v}
puvw
�h(u, w)h(w, v) + h(v, w)h(w, u)
⇥= 1.
u
v w
��(u, v) + ��(v, w) + ��(w, u) =
��(v, u) + ��(w, v) + ��(u, w).
pageMehryar Mohri - Foundations of Machine Learning
Lower Bound
Theorem: for any deterministic algorithm , there is a bipartite distribution for which
• thus, factor of 2 best in deterministic case.
• randomization necessary for better bound.Proof: take simple case and assume that induces a cycle.
• up to symmetry, returns
39
Rrank(A) � 2Rclass(h).
U =V ={u, v, w}h
A
A
u, v, w w, v, u.or
u
v wh
=
pageMehryar Mohri - Foundations of Machine Learning
Lower Bound
If returns , then choose as:
If returns , then choose as:
40
u, v, w u
v wh
A��
u, v w
+-
A��
w, v, u
+-
w, v u
L[h, ��] =13;
L[A, ��] =23.
pageMehryar Mohri - Foundations of Machine Learning
Guarantees - General Case
Loss bound for QuickSort:
Comparison with optimal ranking (see (CSS 99)):
41
EV,��,s
[L(Qhs (V ), ��)] � 2 E
V,��[L(h, ��)].
Es[L(Qhs (V ),�optimal)] � 2 L(h, �optimal)
Es[L(h, Qhs (V ))] � 3 L(h, �optimal),
where �optimal = argmin�
L(h, �).
pageMehryar Mohri - Foundations of Machine Learning
Weight Function
Generalization:
Properties: needed for all previous results to hold,
• symmetry: for all .
• monotonicity: for .
• triangle inequality: for all triplets .
42
⇥�(u, v) = ��(u, v) ⇤(��(u),��(v)).
�(i, j) = �(j, i) i, j
�(i, j),�(j, k) � �(i, k) i < j < k
�(i, j) � �(i, k) + �(k, j)i, j, k
pageMehryar Mohri - Foundations of Machine Learning
Weight Function - Examples
Kemeny:
Top-k:
Bipartite:
k-partite: can be defined similarly.
43
w(i, j) = 1, � i, j.
w(i, j) =
�1 if i � k and j > k;0 otherwise.
w(i, j) =
�1 if i � k or j � k;0 otherwise.
pageMehryar Mohri - Foundations of Machine Learning
(Strong) Regret Definitions
Ranking regret:
Preference regret:
All previous regret results hold if for ,
44
Rclass(h) = EV,��
[L(h|V , ��)]�minh̃
EV,��
[L(h̃|V , ��)].
E��|V1
[��(u, v)] = E��|V2
[��(u, v)]
V1, V2 � {u, v}
for all (pairwise independence on irrelevant alternatives).
Rrank(A) = EV,⇥�,s
[L(As(V ), ⇥�)]�min�̃
EV,⇥�
[L(�̃|V , ⇥�)].
u, v
Courant Institute, NYUpageMehryar Mohri - Courant Seminar
References• Agarwal, S., Graepel, T., Herbrich, R., Har-Peled, S., & Roth, D. (2005). Generalization
bounds for the area under the roc curve. JMLR 6, 393–425.
• Agarwal, S., and Niyogi, P. (2005). Stability and generalization of bipartite ranking algorithms. COLT (pp. 32–47).
• Nir Ailon and Mehryar Mohri. An efficient reduction of ranking to classification. In Proceedings of COLT 2008. Helsinki, Finland, July 2008. Omnipress.
• Balcan, M.-F., Bansal, N., Beygelzimer, A., Coppersmith, D., Langford, J., and Sorkin, G. B. (2007). Robust reductions from ranking to classification. In Proceedings of COLT (pp. 604–619). Springer.
• Stephen Boyd, Corinna Cortes, Mehryar Mohri, and Ana Radovanovic (2012). Accuracy at the top. In NIPS 2012.
• Cohen, W. W., Schapire, R. E., and Singer, Y. (1999). Learning to order things. J. Artif. Intell. Res. (JAIR), 10, 243–270.
• Cossock, D., and Zhang, T. (2006). Subset ranking using regression. COLT (pp. 605–619).
45
Courant Institute, NYUpageMehryar Mohri - Courant Seminar
References• Corinna Cortes and Mehryar Mohri. AUC Optimization vs. Error Rate Minimization. In
Advances in Neural Information Processing Systems (NIPS 2003), 2004. MIT Press.
• Cortes, C., Mohri, M., and Rastogi, A. (2007a). An Alternative Ranking Problem for Search Engines. Proceedings of WEA 2007 (pp. 1–21). Rome, Italy: Springer.
• Corinna Cortes and Mehryar Mohri. Confidence Intervals for the Area under the ROC Curve. In Advances in Neural Information Processing Systems (NIPS 2004), 2005. MIT Press.
• Crammer, K., and Singer, Y. (2001). Pranking with ranking. Proceedings of NIPS 2001, December 3-8, 2001, Vancouver, British Columbia, Canada] (pp. 641–647). MIT Press.
• Cynthia Dwork, Ravi Kumar, Moni Naor, and D. Sivakumar. Rank Aggregation Methods for the Web. WWW 10, 2001. ACM Press.
• J. P. Egan. Signal Detection Theory and ROC Analysis. Academic Press, 1975.
• Yoav Freund, Raj Iyer, Robert E. Schapire and Yoram Singer. An efficient boosting algorithm for combining preferences. JMLR 4:933-969, 2003.
46
Courant Institute, NYUpageMehryar Mohri - Courant Seminar
References• J. A. Hanley and B. J. McNeil. The meaning and use of the area under a receiver operating
characteristic (ROC) curve. Radiology, 1982.
• Ralf Herbrich, Thore Graepel, and Klaus Obermayer. Large margin rank boundaries for ordinal regression. Advances in Large Margin Classifiers, pages 115–132, 2000.
• Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of Machine Learning. The MIT Press. 2012.
• Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. The PageRank citation ranking: Bringing order to the Web, Stanford Digital Library Technologies Project, 1998.
• Lehmann, E. L. (1975). Nonparametrics: Statistical methods based on ranks. San Francisco, California: Holden-Day.
• Cynthia Rudin, Corinna Cortes, Mehryar Mohri, and Robert E. Schapire. Margin-Based Ranking Meets Boosting in the Middle. In Proceedings of The 18th Annual Conference on Computational Learning Theory (COLT 2005), pages 63-78, 2005.
• Thorsten Joachims. Optimizing search engines using clickthrough data. Proceedings of the 8th ACM SIGKDD, pages 133-142, 2002.
47