introduction to machine learning lecture 16mohri/mlu/mlu_lecture_16.pdf · 2011-12-04 ·...
TRANSCRIPT
Introduction to Machine LearningLecture 16
Mehryar MohriCourant Institute and Google Research
pageMehryar Mohri - Foundations of Machine Learning
Motivation
Very large data sets:
• too large to display or process.
• limited resources, need priorities.
• ranking more desirable than classification.
Applications:
• search engines, information extraction.
• decision making, auctions, fraud detection.
Can we learn to predict ranking accurately?
3
pageMehryar Mohri - Foundations of Machine Learning
Score-Based Setting
Single stage: learning algorithm
• receives labeled sample of pairwise preferences;
• returns scoring function .Drawbacks:
• induces a linear ordering for full set .
• does not match a query-based scenario.
Advantages:
• efficient algorithms.
• good theory: VC bounds, margin bounds, stability bounds (FISS 03, RCMS 05, AN 05, AGHHR 05, CMR 07).
4
Uh
h: U → R
pageMehryar Mohri - Foundations of Machine Learning
Preference-Based Setting
Definitions:
• : universe, full set of objects.
• : finite query subset to rank, .
• : target ranking for (random variable).
Two stages: can be viewed as reduction.
• learn preference function .
• given , use to determine ranking of .
Running-time: measured in terms of |calls to |.
5
V ⊆ U
U
V
τ∗ V
V h σ
h
V
h: U×U→ [0, 1]
pageMehryar Mohri - Foundations of Machine Learning
Related Problem
Rank aggregation: given candidates and voters each giving a ranking of the candidates, find ordering as close as possible to these.
• closeness measured in number of pairwise misrankings.
• problem NP-hard even for (Dwork et al., 2001).
6
n k
k=4
pageMehryar Mohri - Foundations of Machine Learning
This Talk
Score-based ranking
Preference-based ranking
7
pageMehryar Mohri - Foundations of Machine Learning
Score-Based Ranking
Training data: sample of i.i.d. labeled pairs drawn from according to some distribution ,
Problem: find hypothesis in with small generalization error
8
D
with
Hh :U→R
U×U
yi =
+1 if xi >pref xi
0 if xi =pref xi or no information
−1 if xi <pref xi.
RD(h) = Pr(x,x)∼D
f(x, x)
h(x)− h(x)
< 0
.
S =(x1, x
1, y1), . . . , (xm, x
m, ym)∈U×U×−1, 0, +1,
pageMehryar Mohri - Foundations of Machine Learning
Notes
Empirical error:
The relation may be non-transitive (needs not even be anti-symmetric).Problem different from classification.
9
xRx ⇔ f(x, x)=1
R(h) =1m
m
i=1
1yi(h(xi)−h(xi))<0 .
pageMehryar Mohri - Foundations of Machine Learning
Distributional Assumptions
Distribution over points: points (literature).
• labels for pairs.
• squared number of examples .
Distribution over pairs: pairs.
• label for each pair received.
• independence assumption.
• same (linear) number of examples.
10
m
O(m2)
m
pageMehryar Mohri - Foundations of Machine Learning
Boosting for Ranking
Use weak ranking algorithm and create stronger ranking algorithm.
Ensemble method: combine base rankers returned by weak ranking algorithm.
Finding simple relatively accurate base rankers often not hard.
How should base rankers be combined?
11
pageMehryar Mohri - Foundations of Machine Learning
CD RankBoost
12
H⊆0, 1X.
(Freund et al., 2003; Rudin et al., 2005)
0t + +t + −t = 1, st (h) = Pr
(x,x)∼Dt
sgn
f(x, x)(h(x)− h(x))
= s
.
RankBoost(S = ((x1, x1, y1) . . . , (xm, x
m, ym)))
1 for i← 1 to m do2 D1(xi, x
i)← 1
m3 for t← 1 to T do4 ht ← base ranker in H with smallest
−t −
+t = −Ei∼Dt
yi
ht(xi)− ht(xi)
5 αt ← 12 log +t
−t
6 Zt ← 0t + 2[+t
−t ] 1
2 normalization factor7 for i← 1 to m do
8 Dt+1(xi, xi)←
Dt(xi,xi) exp
−αtyi
ht(x
i)−ht(xi)
Zt
9 ϕT ←T
t=1 αtht
10 return h = sgn(ϕT )
pageMehryar Mohri - Foundations of Machine Learning
Notes
Distributions over pairs of sample points:
• originally uniform.
• at each round, the weight of a misclassified example is increased.
• observation: , since
weight assigned to base classifier : directy depends on the accuracy of at round .
13
Dt
ht αt
ht t
Dt+1(x, x)= e−y[ϕt(x)−ϕt(x)]
|S|Qt
s=1 Zs
Dt+1(x, x) =Dt(x, x)e−yαt[ht(x
)−ht(x)]
Zt=
1|S|
e−yPt
s=1 αs[hs(x)−hs(x)]
ts=1 Zs
.
pageMehryar Mohri - Foundations of Machine Learning 14
Objective Function: convex and differentiable.
Coordinate Descent RankBoost
e−x
0−1pairwise loss
F (α) =
(x,x,y)∈S
e−y[ϕT (x)−ϕT (x)] =
(x,x,y)∈S
exp− y
T
t=1
αt[ht(x)−ht(x)].
pageMehryar Mohri - Foundations of Machine Learning 15
• Direction: unit vector with
• Since ,
et
et = argmint
dF (α + ηet)dη
η=0
.
Thus, direction corresponding to base classifier selected by the algorithm.
F (α + ηet)=
(x,x,y)∈S
e−yPT
s=1 αs[hs(x)−hs(x)]e−yη[ht(x
)−ht(x)]
dF (α + ηet)dη
η=0
= −
(x,x,y)∈S
y[ht(x)− ht(x)] exp− y
T
s=1
αs[hs(x)− hs(x)]
= −
(x,x,y)∈S
y[ht(x)− ht(x)]DT+1(x, x)m
T
s=1
Zs
= −[+t − −t ]m
T
s=1
Zs
.
pageMehryar Mohri - Foundations of Machine Learning 16
• Step size: obtained via
Thus, step size matches base classifier weight used in algorithm.
dF (α + ηet)dη
= 0
⇔ −
(x,x,y)∈S
y[ht(x)− ht(x)] exp− y
T
s=1
αs[hs(x)− hs(x)]e−y[ht(x
)−ht(x)]η = 0
⇔ −
(x,x,y)∈S
y[ht(x)− ht(x)]DT+1(x, x)m
T
s=1
Zs
e−y[ht(x
)−ht(x)]η = 0
⇔ −
(x,x,y)∈S
y[ht(x)− ht(x)]DT+1(x, x)e−y[ht(x)−ht(x)]η = 0
⇔ −[+t e−η − −t eη] = 0
⇔ η =12
log+t−t
.
pageMehryar Mohri - Foundations of Machine Learning
L1 Margin Definitions
Definition: the margin of a pair with label is
Definition: the margin of the hypothesis for a sample is the minimum margin for pairs in with non-zero labels:
17
h
(x, x)
ρ(x, x) =y(ϕ(x)− ϕ(x))m
t=T αt=
yT
t=1 αt[ht(x)− ht(x)]α1
= yα · ∆h(x)α1
.
y =0
S =((x1, x1, y1) . . . , (xm, x
m, ym))S
ρ = min(x,x,y)∈S
y =0
yα · ∆h(x)
α1.
pageMehryar Mohri - Foundations of Machine Learning
Ranking Margin Bound
Theorem: let be a family of real-valued functions. Fix , then, for any , with probability at least over the choice of a sample of size , the following holds for all :
18
(Cortes and MM, 2011)
H
ρ>0 δ>01−δ m
h∈H
R(h) ≤ Rρ(h) +2ρ
RD1
m (H) + RD2m (H)
+
log 1
δ
2m.
pageMehryar Mohri - Foundations of Machine Learning
RankBoost Margin
But, RankBoot does not maximize the margin.
Empirical performance not reported.
19
smooth-margin RankBoost (Rudin et al. ,
2005):G(α) = − log F (α)
α1.
pageMehryar Mohri - Foundations of Machine Learning
Ranking with SVMs
Optimization problem: application of SVMs.
Decision function:
20
h : x →w · Φ(x) + b.
see for example (Joachims, 2002)
minw,ξ
12w2 + C
m
i=1
ξi
subject to: yi
w ·
Φ(x
i)−Φ(xi)≥ 1− ξi
ξi ≥ 0, ∀i ∈ [1, m] .
pageMehryar Mohri - Foundations of Machine Learning
Notes
The algorithm coincides with SVMs using feature mapping
Can be used with kernels.Algorithm directly based on margin bound.
21
(x, x) → Ψ(x, x) = Φ(x)−Φ(x).
pageMehryar Mohri - Foundations of Machine Learning
Bipartite Ranking
Training data:
• sample of negative points drawn according to
• sample of positive points drawn according to
Problem: find hypothesis in with small generalization error
22
Hh :U→R
D+
S+ =(x1, . . . , xm)∈U.
D−
S−=(x1, . . . , xm)∈U.
RD(h) = Prx∼D−,x∼D+
h(x)<h(x)
.
pageMehryar Mohri - Foundations of Machine Learning
Notes
More efficient algorithm in this special case (Freund
et al., 2003).
Connection between AdaBoost and RankBoost (Cortes & MM, 04; Rudin et al., 05).
• if constant base ranker used.
• relationship between objective functions.
Bipartite ranking results typically reported in terms of AUC.
23
pageMehryar Mohri - Foundations of Machine Learning
ROC Curve
Definition: the receiver operating characteristic (ROC) curve is a plot of the true positive rate (TP) vs. false positive rate (FP).
• TP: % positive points correctly labeled positive.
• FP: % negative points incorrectly labeled positive.
24
(Egan, 1975)
0 .2 .4 .6 .8 1
.2
.4
.6
.8
1
0
False positive rate
True
pos
itive
rat
e
h(x3) h(x1)h(x14) h(x23)h(x5)
θ+-
sorted scores
pageMehryar Mohri - Foundations of Machine Learning
Area under the ROC Curve (AUC)
Definition: the AUC is the area under the ROC curve. Measure of ranking quality.
Equivalently,
25
(Hanley and McNeil, 1982)
0 .2 .4 .6 .8 1
.2
.4
.6
.8
1
0
False positive rate
True
pos
itive
rat
eAUC
AUC(h) =1
mm
m
i=1
m
j=1
1h(xi)>h(xj)= Pr
x∼ bD+
x∼ bD−
[h(x) > h(x)]
= 1− R(h).
pageMehryar Mohri - Foundations of Machine Learning
AdaBoost and CD RankBoost
Objective functions: comparison.
26
FRank(α) =
(i,j)∈S−×S+
exp− [f(xj)− f(xi)]
=
(i,j)∈S−×S+
exp (+f(xi)) exp (−f(xi))
= F−(α)F+(α).
FAda(α) =
xi∈S−∪S+
exp (−yif(xi))
=
xi∈S−
exp (+f(xi)) +
xi∈S+
exp (−f(xi))
= F−(α) + F+(α).
pageMehryar Mohri - Foundations of Machine Learning
AdaBoost and CD RankBoost
Property: AdaBoost (non-separable case).
• constant base learner equal contribution of positive and negative points (in the limit).
• consequence: AdaBoost asymptotically achieves optimum of CD RankBoost objective.
Observations: if ,
27
h=1
F+(α)=F−(α)
d(FRank) = F+d(F−) + F−d(F+)= F+
d(F−) + d(F+)
= F+d(FAda).
(Rudin et al., 2005)
pageMehryar Mohri - Foundations of Machine Learning
Bipartite RankBoost - Efficiency
Decomposition of distribution: for ,
Thus,
28
D(x, x) = D−(x)D+(x).
(x, x)∈(S−, S+)
Dt+1(x, x) =Dt(x, x)e−αt[ht(x
)−ht(x)]
Zt
=Dt,−(x)eαtht(x)
Zt,−
Dt,+(x)e−αtht(x)
Zt,+,
Zt,− =
x∈S−
Dt,−(x)eαtht(x) Zt,+ =
x∈S+
Dt,+(x)e−αtht(x).with
pageMehryar Mohri - Foundations of Machine Learning
Ranking ≠ Classification
Bipartite case: can we learn to rank by training a classifier on positive and negative sets?
• different objective functions: AUC vs. 0/1 loss.
• preliminary analysis (Cortes and MM, 2004): different results for imbalanced data sets, on average over all classifications.
• example, stochastic case:
29
A BCB ACC AB
+ -
pageMehryar Mohri - Foundations of Machine Learning
This Talk
Score-based ranking
Preference-based ranking
30
pageMehryar Mohri - Foundations of Machine Learning
Preference-Based Setting
Definitions:
• : universe, full set of objects.
• : finite query subset to rank, .
• : target ranking for (random variable).
Two stages: can be viewed as reduction.
• learn preference function .
• given , use to determine ranking of .
Running-time: measured in terms of |calls to |.
31
V ⊆ U
U
V
τ∗ V
V h σ
h
V
h: U×U→ [0, 1]
pageMehryar Mohri - Foundations of Machine Learning
Preference-Based Ranking Problem
Training data: pairs sampled i.i.d. according to :
Problem: for any query set , use to return ranking close to target with small average error
32
(V, τ∗)
subsets ranked bydifferent labelers.
D
(V1, τ∗1 ), (V2, τ
∗2 ), . . . , (Vm, τ∗m) Vi ⊆ U.
preference function .
V ⊆ Uτ∗
hσh,V
learn classifier
h : U×U→ [0, 1]
R(h, σ) = E(V,τ∗)∼D
[L(σh,V , τ∗)].
pageMehryar Mohri - Foundations of Machine Learning
Preference Function
close to when preferred to , close to otherwise. For the analysis, .
Assumed pairwise consistent:
May be non-transitive, e.g.,
Output of classifier or ‘black-box’.
33
h(u, v) + h(v, u) = 1.
h(u, v) = h(v, w) = h(w, v) = 1.
h(u, v) u1 v 0h(u, v)∈0, 1
pageMehryar Mohri - Foundations of Machine Learning
Loss Functions
Preference loss:
Ranking loss:
34
L(σ, τ∗) =2
n(n− 1)
u =v
σ(u, v)τ∗(v, u).
L(h, τ∗) =2
n(n− 1)
u =v
h(u, v)τ∗(v, u).
(for fixed )(V, τ∗)
pageMehryar Mohri - Foundations of Machine Learning
(Weak) Regret
Preference regret:
Ranking regret:
35
Rrank(A) = E
V,τ∗,s[L(As(V ), τ∗)]− E
Vmin
σ∈S(V )E
τ∗|V[L(σ, τ∗)].
Rclass(h) = E
V,τ∗[L(h|V , τ∗)]− E
Vmin
hE
τ∗|V[L(h, τ∗)].
pageMehryar Mohri - Foundations of Machine Learning
Deterministic Algorithm
Stage one: standard classification. Learn preference function .
Stage two: sort-by-degree using comparison function .
• sort by number of points ranked below.
• quadratic time complexity .
36
h : U×U→ [0, 1]
h
(Balcan et al., 07)
O(n2)
pageMehryar Mohri - Foundations of Machine Learning
Randomized Algorithm
Stage one: standard classification. Learn preference function .
Stage two: randomized QuickSort (Hoare, 61) using as comparison function.
• comparison function non-transitive unlike textbook description.
• but, time complexity shown to be in general.
37
h : U×U→ [0, 1]
h
O(n log n)
(Ailon & MM, 08)
pageMehryar Mohri - Foundations of Machine Learning
Randomized QS
38
left recursion right recursion
random pivot
u
v
h(v, u)=1 h(u, v)=1
pageMehryar Mohri - Foundations of Machine Learning
Bounds: for deterministic sort-by-degree algorithm
• expected loss:
• regret:
Time complexity: .
Deterministic Algo. - Bipartite Case
39
EV,τ∗
[L(A(V ), τ∗)] ≤ 2 EV,τ∗
[L(h, τ∗)].
Rrank(A(V )) ≤ 2R
class(h).
Ω(|V |2)
(V = V+ ∪ V−) (Balcan et al., 07)
pageMehryar Mohri - Foundations of Machine Learning
Bounds: for randomized QuickSort (Hoare, 61).
• expected loss (equality):
• regret:
Time complexity:
• full set: .
• top k:
Randomized Algo. - Bipartite Case
40
Rrank(Qh
s (·)) ≤ Rclass(h) .
EV,τ∗,s
[L(Qhs (V ), τ∗)] = E
V,τ∗[L(h, τ∗)].
O(n log n)O(n + k log k).
(V = V+ ∪ V−) (Ailon & MM, 08)
pageMehryar Mohri - Foundations of Machine Learning
Proof Ideas
QuickSort decomposition:
Bipartite property:
41
puv +13
w ∈u,v
puvw
h(u, w)h(w, v) + h(v, w)h(w, u)
= 1.
u
v w
τ∗(u, v) + τ∗(v, w) + τ∗(w, u) =
τ∗(v, u) + τ∗(w, v) + τ∗(u, w).
pageMehryar Mohri - Foundations of Machine Learning
Lower Bound
Theorem: for any deterministic algorithm , there is a bipartite distribution for which
• thus, factor of 2 best in deterministic case.
• randomization necessary for better bound.Proof: take simple case and assume that induces a cycle.
• up to symmetry, returns
42
Rrank(A) ≥ 2Rclass(h).
U =V =u, v, wh
A
A
u, v, w w, v, u.or
u
v wh
=
pageMehryar Mohri - Foundations of Machine Learning
Lower Bound
If returns , then choose as:
If returns , then choose as:
43
u, v, w u
v wh
Aτ∗
u, v w
+-
Aτ∗
w, v, u
+-
w, v u
L[h, τ∗] =13;
L[A, τ∗] =23.
pageMehryar Mohri - Foundations of Machine Learning
Guarantees - General Case
Loss bound for QuickSort:
Comparison with optimal ranking (see (CSS 99)):
44
EV,τ∗,s
[L(Qhs (V ), τ∗)] ≤ 2 E
V,τ∗[L(h, τ∗)].
Es[L(Qhs (V ),σoptimal)] ≤ 2 L(h, σoptimal)
Es[L(h, Qhs (V ))] ≤ 3 L(h, σoptimal),
where σoptimal = argminσ
L(h, σ).
pageMehryar Mohri - Foundations of Machine Learning
Weight Function
Generalization:
Properties: needed for all previous results to hold,
• symmetry: for all .
• monotonicity: for .
• triangle inequality: for all triplets .
45
τ∗(u, v) = σ∗(u, v) ω(σ∗(u),σ∗(v)).
ω(i, j) = ω(j, i) i, j
ω(i, j),ω(j, k) ≤ ω(i, k) i < j < k
ω(i, j) ≤ ω(i, k) + ω(k, j)i, j, k
pageMehryar Mohri - Foundations of Machine Learning
Weight Function - Examples
Kemeny:
Top-k:
Bipartite:
k-partite: can be defined similarly.
46
w(i, j) = 1, ∀ i, j.
w(i, j) =
1 if i ≤ k and j > k;0 otherwise.
w(i, j) =
1 if i ≤ k or j ≤ k;0 otherwise.
pageMehryar Mohri - Foundations of Machine Learning
(Strong) Regret Definitions
Ranking regret:
Preference regret:
All previous regret results hold if for ,
47
Rclass(h) = EV,τ∗
[L(h|V , τ∗)]−minh
EV,τ∗
[L(h|V , τ∗)].
Eτ∗|V1
[τ∗(u, v)] = Eτ∗|V2
[τ∗(u, v)]
V1, V2 ⊇ u, v
for all (pairwise independence on irrelevant alternatives).
Rrank(A) = EV,τ∗,s
[L(As(V ), τ∗)]−minσ
EV,τ∗
[L(σ|V , τ∗)].
u, v
Courant Institute, NYUpageMehryar Mohri - Courant Seminar
References• Agarwal, S., Graepel, T., Herbrich, R., Har-Peled, S., & Roth, D. (2005). Generalization
bounds for the area under the roc curve. JMLR 6, 393–425.
• Agarwal, S., and Niyogi, P. (2005). Stability and generalization of bipartite ranking algorithms. COLT (pp. 32–47).
• Nir Ailon and Mehryar Mohri. An efficient reduction of ranking to classification. In Proceedings of COLT 2008. Helsinki, Finland, July 2008. Omnipress.
• Balcan, M.-F., Bansal, N., Beygelzimer, A., Coppersmith, D., Langford, J., and Sorkin, G. B. (2007). Robust reductions from ranking to classification. In Proceedings of COLT (pp. 604–619). Springer.
• Cohen, W. W., Schapire, R. E., and Singer, Y. (1999). Learning to order things. J. Artif. Intell. Res. (JAIR), 10, 243–270.
• Cossock, D., and Zhang, T. (2006). Subset ranking using regression. COLT (pp. 605–619).
48
Courant Institute, NYUpageMehryar Mohri - Courant Seminar
References• Corinna Cortes and Mehryar Mohri. AUC Optimization vs. Error Rate Minimization. In
Advances in Neural Information Processing Systems (NIPS 2003), 2004. MIT Press.
• Cortes, C., Mohri, M., and Rastogi, A. (2007a). An Alternative Ranking Problem for Search Engines. Proceedings of WEA 2007 (pp. 1–21). Rome, Italy: Springer.
• Corinna Cortes and Mehryar Mohri. Confidence Intervals for the Area under the ROC Curve. In Advances in Neural Information Processing Systems (NIPS 2004), 2005. MIT Press.
• Crammer, K., and Singer, Y. (2001). Pranking with ranking. Proceedings of NIPS 2001, December 3-8, 2001, Vancouver, British Columbia, Canada] (pp. 641–647). MIT Press.
• Cynthia Dwork, Ravi Kumar, Moni Naor, and D. Sivakumar. Rank Aggregation Methods for the Web. WWW 10, 2001. ACM Press.
• J. P. Egan. Signal Detection Theory and ROC Analysis. Academic Press, 1975.
• Yoav Freund, Raj Iyer, Robert E. Schapire and Yoram Singer. An efficient boosting algorithm for combining preferences. JMLR 4:933-969, 2003.
49
Courant Institute, NYUpageMehryar Mohri - Courant Seminar
References• J. A. Hanley and B. J. McNeil. The meaning and use of the area under a receiver operating
characteristic (ROC) curve. Radiology, 1982.
• Ralf Herbrich, Thore Graepel, and Klaus Obermayer. Large margin rank boundaries for ordinal regression. Advances in Large Margin Classifiers, pages 115–132, 2000.
• Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. The PageRank citation ranking: Bringing order to the Web, Stanford Digital Library Technologies Project, 1998.
• Lehmann, E. L. (1975). Nonparametrics: Statistical methods based on ranks. San Francisco, California: Holden-Day.
• Cynthia Rudin, Corinna Cortes, Mehryar Mohri, and Robert E. Schapire. Margin-Based Ranking Meets Boosting in the Middle. In Proceedings of The 18th Annual Conference on Computational Learning Theory (COLT 2005), pages 63-78, 2005.
• Thorsten Joachims. Optimizing search engines using clickthrough data. Proceedings of the 8th ACM SIGKDD, pages 133-142, 2002.
50