introduction to machine learning lecture 16mohri/mlu/mlu_lecture_16.pdf · 2011-12-04 ·...

50
Introduction to Machine Learning Lecture 16 Mehryar Mohri Courant Institute and Google Research [email protected]

Upload: others

Post on 20-May-2020

12 views

Category:

Documents


0 download

TRANSCRIPT

Introduction to Machine LearningLecture 16

Mehryar MohriCourant Institute and Google Research

[email protected]

pageMehryar Mohri - Introduction to Machine Learning 2

Ranking

pageMehryar Mohri - Foundations of Machine Learning

Motivation

Very large data sets:

• too large to display or process.

• limited resources, need priorities.

• ranking more desirable than classification.

Applications:

• search engines, information extraction.

• decision making, auctions, fraud detection.

Can we learn to predict ranking accurately?

3

pageMehryar Mohri - Foundations of Machine Learning

Score-Based Setting

Single stage: learning algorithm

• receives labeled sample of pairwise preferences;

• returns scoring function .Drawbacks:

• induces a linear ordering for full set .

• does not match a query-based scenario.

Advantages:

• efficient algorithms.

• good theory: VC bounds, margin bounds, stability bounds (FISS 03, RCMS 05, AN 05, AGHHR 05, CMR 07).

4

Uh

h: U → R

pageMehryar Mohri - Foundations of Machine Learning

Preference-Based Setting

Definitions:

• : universe, full set of objects.

• : finite query subset to rank, .

• : target ranking for (random variable).

Two stages: can be viewed as reduction.

• learn preference function .

• given , use to determine ranking of .

Running-time: measured in terms of |calls to |.

5

V ⊆ U

U

V

τ∗ V

V h σ

h

V

h: U×U→ [0, 1]

pageMehryar Mohri - Foundations of Machine Learning

Related Problem

Rank aggregation: given candidates and voters each giving a ranking of the candidates, find ordering as close as possible to these.

• closeness measured in number of pairwise misrankings.

• problem NP-hard even for (Dwork et al., 2001).

6

n k

k=4

pageMehryar Mohri - Foundations of Machine Learning

This Talk

Score-based ranking

Preference-based ranking

7

pageMehryar Mohri - Foundations of Machine Learning

Score-Based Ranking

Training data: sample of i.i.d. labeled pairs drawn from according to some distribution ,

Problem: find hypothesis in with small generalization error

8

D

with

Hh :U→R

U×U

yi =

+1 if xi >pref xi

0 if xi =pref xi or no information

−1 if xi <pref xi.

RD(h) = Pr(x,x)∼D

f(x, x)

h(x)− h(x)

< 0

.

S =(x1, x

1, y1), . . . , (xm, x

m, ym)∈U×U×−1, 0, +1,

pageMehryar Mohri - Foundations of Machine Learning

Notes

Empirical error:

The relation may be non-transitive (needs not even be anti-symmetric).Problem different from classification.

9

xRx ⇔ f(x, x)=1

R(h) =1m

m

i=1

1yi(h(xi)−h(xi))<0 .

pageMehryar Mohri - Foundations of Machine Learning

Distributional Assumptions

Distribution over points: points (literature).

• labels for pairs.

• squared number of examples .

Distribution over pairs: pairs.

• label for each pair received.

• independence assumption.

• same (linear) number of examples.

10

m

O(m2)

m

pageMehryar Mohri - Foundations of Machine Learning

Boosting for Ranking

Use weak ranking algorithm and create stronger ranking algorithm.

Ensemble method: combine base rankers returned by weak ranking algorithm.

Finding simple relatively accurate base rankers often not hard.

How should base rankers be combined?

11

pageMehryar Mohri - Foundations of Machine Learning

CD RankBoost

12

H⊆0, 1X.

(Freund et al., 2003; Rudin et al., 2005)

0t + +t + −t = 1, st (h) = Pr

(x,x)∼Dt

sgn

f(x, x)(h(x)− h(x))

= s

.

RankBoost(S = ((x1, x1, y1) . . . , (xm, x

m, ym)))

1 for i← 1 to m do2 D1(xi, x

i)← 1

m3 for t← 1 to T do4 ht ← base ranker in H with smallest

−t −

+t = −Ei∼Dt

yi

ht(xi)− ht(xi)

5 αt ← 12 log +t

−t

6 Zt ← 0t + 2[+t

−t ] 1

2 normalization factor7 for i← 1 to m do

8 Dt+1(xi, xi)←

Dt(xi,xi) exp

−αtyi

ht(x

i)−ht(xi)

Zt

9 ϕT ←T

t=1 αtht

10 return h = sgn(ϕT )

pageMehryar Mohri - Foundations of Machine Learning

Notes

Distributions over pairs of sample points:

• originally uniform.

• at each round, the weight of a misclassified example is increased.

• observation: , since

weight assigned to base classifier : directy depends on the accuracy of at round .

13

Dt

ht αt

ht t

Dt+1(x, x)= e−y[ϕt(x)−ϕt(x)]

|S|Qt

s=1 Zs

Dt+1(x, x) =Dt(x, x)e−yαt[ht(x

)−ht(x)]

Zt=

1|S|

e−yPt

s=1 αs[hs(x)−hs(x)]

ts=1 Zs

.

pageMehryar Mohri - Foundations of Machine Learning 14

Objective Function: convex and differentiable.

Coordinate Descent RankBoost

e−x

0−1pairwise loss

F (α) =

(x,x,y)∈S

e−y[ϕT (x)−ϕT (x)] =

(x,x,y)∈S

exp− y

T

t=1

αt[ht(x)−ht(x)].

pageMehryar Mohri - Foundations of Machine Learning 15

• Direction: unit vector with

• Since ,

et

et = argmint

dF (α + ηet)dη

η=0

.

Thus, direction corresponding to base classifier selected by the algorithm.

F (α + ηet)=

(x,x,y)∈S

e−yPT

s=1 αs[hs(x)−hs(x)]e−yη[ht(x

)−ht(x)]

dF (α + ηet)dη

η=0

= −

(x,x,y)∈S

y[ht(x)− ht(x)] exp− y

T

s=1

αs[hs(x)− hs(x)]

= −

(x,x,y)∈S

y[ht(x)− ht(x)]DT+1(x, x)m

T

s=1

Zs

= −[+t − −t ]m

T

s=1

Zs

.

pageMehryar Mohri - Foundations of Machine Learning 16

• Step size: obtained via

Thus, step size matches base classifier weight used in algorithm.

dF (α + ηet)dη

= 0

⇔ −

(x,x,y)∈S

y[ht(x)− ht(x)] exp− y

T

s=1

αs[hs(x)− hs(x)]e−y[ht(x

)−ht(x)]η = 0

⇔ −

(x,x,y)∈S

y[ht(x)− ht(x)]DT+1(x, x)m

T

s=1

Zs

e−y[ht(x

)−ht(x)]η = 0

⇔ −

(x,x,y)∈S

y[ht(x)− ht(x)]DT+1(x, x)e−y[ht(x)−ht(x)]η = 0

⇔ −[+t e−η − −t eη] = 0

⇔ η =12

log+t−t

.

pageMehryar Mohri - Foundations of Machine Learning

L1 Margin Definitions

Definition: the margin of a pair with label is

Definition: the margin of the hypothesis for a sample is the minimum margin for pairs in with non-zero labels:

17

h

(x, x)

ρ(x, x) =y(ϕ(x)− ϕ(x))m

t=T αt=

yT

t=1 αt[ht(x)− ht(x)]α1

= yα · ∆h(x)α1

.

y =0

S =((x1, x1, y1) . . . , (xm, x

m, ym))S

ρ = min(x,x,y)∈S

y =0

yα · ∆h(x)

α1.

pageMehryar Mohri - Foundations of Machine Learning

Ranking Margin Bound

Theorem: let be a family of real-valued functions. Fix , then, for any , with probability at least over the choice of a sample of size , the following holds for all :

18

(Cortes and MM, 2011)

H

ρ>0 δ>01−δ m

h∈H

R(h) ≤ Rρ(h) +2ρ

RD1

m (H) + RD2m (H)

+

log 1

δ

2m.

pageMehryar Mohri - Foundations of Machine Learning

RankBoost Margin

But, RankBoot does not maximize the margin.

Empirical performance not reported.

19

smooth-margin RankBoost (Rudin et al. ,

2005):G(α) = − log F (α)

α1.

pageMehryar Mohri - Foundations of Machine Learning

Ranking with SVMs

Optimization problem: application of SVMs.

Decision function:

20

h : x →w · Φ(x) + b.

see for example (Joachims, 2002)

minw,ξ

12w2 + C

m

i=1

ξi

subject to: yi

w ·

Φ(x

i)−Φ(xi)≥ 1− ξi

ξi ≥ 0, ∀i ∈ [1, m] .

pageMehryar Mohri - Foundations of Machine Learning

Notes

The algorithm coincides with SVMs using feature mapping

Can be used with kernels.Algorithm directly based on margin bound.

21

(x, x) → Ψ(x, x) = Φ(x)−Φ(x).

pageMehryar Mohri - Foundations of Machine Learning

Bipartite Ranking

Training data:

• sample of negative points drawn according to

• sample of positive points drawn according to

Problem: find hypothesis in with small generalization error

22

Hh :U→R

D+

S+ =(x1, . . . , xm)∈U.

D−

S−=(x1, . . . , xm)∈U.

RD(h) = Prx∼D−,x∼D+

h(x)<h(x)

.

pageMehryar Mohri - Foundations of Machine Learning

Notes

More efficient algorithm in this special case (Freund

et al., 2003).

Connection between AdaBoost and RankBoost (Cortes & MM, 04; Rudin et al., 05).

• if constant base ranker used.

• relationship between objective functions.

Bipartite ranking results typically reported in terms of AUC.

23

pageMehryar Mohri - Foundations of Machine Learning

ROC Curve

Definition: the receiver operating characteristic (ROC) curve is a plot of the true positive rate (TP) vs. false positive rate (FP).

• TP: % positive points correctly labeled positive.

• FP: % negative points incorrectly labeled positive.

24

(Egan, 1975)

0 .2 .4 .6 .8 1

.2

.4

.6

.8

1

0

False positive rate

True

pos

itive

rat

e

h(x3) h(x1)h(x14) h(x23)h(x5)

θ+-

sorted scores

pageMehryar Mohri - Foundations of Machine Learning

Area under the ROC Curve (AUC)

Definition: the AUC is the area under the ROC curve. Measure of ranking quality.

Equivalently,

25

(Hanley and McNeil, 1982)

0 .2 .4 .6 .8 1

.2

.4

.6

.8

1

0

False positive rate

True

pos

itive

rat

eAUC

AUC(h) =1

mm

m

i=1

m

j=1

1h(xi)>h(xj)= Pr

x∼ bD+

x∼ bD−

[h(x) > h(x)]

= 1− R(h).

pageMehryar Mohri - Foundations of Machine Learning

AdaBoost and CD RankBoost

Objective functions: comparison.

26

FRank(α) =

(i,j)∈S−×S+

exp− [f(xj)− f(xi)]

=

(i,j)∈S−×S+

exp (+f(xi)) exp (−f(xi))

= F−(α)F+(α).

FAda(α) =

xi∈S−∪S+

exp (−yif(xi))

=

xi∈S−

exp (+f(xi)) +

xi∈S+

exp (−f(xi))

= F−(α) + F+(α).

pageMehryar Mohri - Foundations of Machine Learning

AdaBoost and CD RankBoost

Property: AdaBoost (non-separable case).

• constant base learner equal contribution of positive and negative points (in the limit).

• consequence: AdaBoost asymptotically achieves optimum of CD RankBoost objective.

Observations: if ,

27

h=1

F+(α)=F−(α)

d(FRank) = F+d(F−) + F−d(F+)= F+

d(F−) + d(F+)

= F+d(FAda).

(Rudin et al., 2005)

pageMehryar Mohri - Foundations of Machine Learning

Bipartite RankBoost - Efficiency

Decomposition of distribution: for ,

Thus,

28

D(x, x) = D−(x)D+(x).

(x, x)∈(S−, S+)

Dt+1(x, x) =Dt(x, x)e−αt[ht(x

)−ht(x)]

Zt

=Dt,−(x)eαtht(x)

Zt,−

Dt,+(x)e−αtht(x)

Zt,+,

Zt,− =

x∈S−

Dt,−(x)eαtht(x) Zt,+ =

x∈S+

Dt,+(x)e−αtht(x).with

pageMehryar Mohri - Foundations of Machine Learning

Ranking ≠ Classification

Bipartite case: can we learn to rank by training a classifier on positive and negative sets?

• different objective functions: AUC vs. 0/1 loss.

• preliminary analysis (Cortes and MM, 2004): different results for imbalanced data sets, on average over all classifications.

• example, stochastic case:

29

A BCB ACC AB

+ -

pageMehryar Mohri - Foundations of Machine Learning

This Talk

Score-based ranking

Preference-based ranking

30

pageMehryar Mohri - Foundations of Machine Learning

Preference-Based Setting

Definitions:

• : universe, full set of objects.

• : finite query subset to rank, .

• : target ranking for (random variable).

Two stages: can be viewed as reduction.

• learn preference function .

• given , use to determine ranking of .

Running-time: measured in terms of |calls to |.

31

V ⊆ U

U

V

τ∗ V

V h σ

h

V

h: U×U→ [0, 1]

pageMehryar Mohri - Foundations of Machine Learning

Preference-Based Ranking Problem

Training data: pairs sampled i.i.d. according to :

Problem: for any query set , use to return ranking close to target with small average error

32

(V, τ∗)

subsets ranked bydifferent labelers.

D

(V1, τ∗1 ), (V2, τ

∗2 ), . . . , (Vm, τ∗m) Vi ⊆ U.

preference function .

V ⊆ Uτ∗

hσh,V

learn classifier

h : U×U→ [0, 1]

R(h, σ) = E(V,τ∗)∼D

[L(σh,V , τ∗)].

pageMehryar Mohri - Foundations of Machine Learning

Preference Function

close to when preferred to , close to otherwise. For the analysis, .

Assumed pairwise consistent:

May be non-transitive, e.g.,

Output of classifier or ‘black-box’.

33

h(u, v) + h(v, u) = 1.

h(u, v) = h(v, w) = h(w, v) = 1.

h(u, v) u1 v 0h(u, v)∈0, 1

pageMehryar Mohri - Foundations of Machine Learning

Loss Functions

Preference loss:

Ranking loss:

34

L(σ, τ∗) =2

n(n− 1)

u =v

σ(u, v)τ∗(v, u).

L(h, τ∗) =2

n(n− 1)

u =v

h(u, v)τ∗(v, u).

(for fixed )(V, τ∗)

pageMehryar Mohri - Foundations of Machine Learning

(Weak) Regret

Preference regret:

Ranking regret:

35

Rrank(A) = E

V,τ∗,s[L(As(V ), τ∗)]− E

Vmin

σ∈S(V )E

τ∗|V[L(σ, τ∗)].

Rclass(h) = E

V,τ∗[L(h|V , τ∗)]− E

Vmin

hE

τ∗|V[L(h, τ∗)].

pageMehryar Mohri - Foundations of Machine Learning

Deterministic Algorithm

Stage one: standard classification. Learn preference function .

Stage two: sort-by-degree using comparison function .

• sort by number of points ranked below.

• quadratic time complexity .

36

h : U×U→ [0, 1]

h

(Balcan et al., 07)

O(n2)

pageMehryar Mohri - Foundations of Machine Learning

Randomized Algorithm

Stage one: standard classification. Learn preference function .

Stage two: randomized QuickSort (Hoare, 61) using as comparison function.

• comparison function non-transitive unlike textbook description.

• but, time complexity shown to be in general.

37

h : U×U→ [0, 1]

h

O(n log n)

(Ailon & MM, 08)

pageMehryar Mohri - Foundations of Machine Learning

Randomized QS

38

left recursion right recursion

random pivot

u

v

h(v, u)=1 h(u, v)=1

pageMehryar Mohri - Foundations of Machine Learning

Bounds: for deterministic sort-by-degree algorithm

• expected loss:

• regret:

Time complexity: .

Deterministic Algo. - Bipartite Case

39

EV,τ∗

[L(A(V ), τ∗)] ≤ 2 EV,τ∗

[L(h, τ∗)].

Rrank(A(V )) ≤ 2R

class(h).

Ω(|V |2)

(V = V+ ∪ V−) (Balcan et al., 07)

pageMehryar Mohri - Foundations of Machine Learning

Bounds: for randomized QuickSort (Hoare, 61).

• expected loss (equality):

• regret:

Time complexity:

• full set: .

• top k:

Randomized Algo. - Bipartite Case

40

Rrank(Qh

s (·)) ≤ Rclass(h) .

EV,τ∗,s

[L(Qhs (V ), τ∗)] = E

V,τ∗[L(h, τ∗)].

O(n log n)O(n + k log k).

(V = V+ ∪ V−) (Ailon & MM, 08)

pageMehryar Mohri - Foundations of Machine Learning

Proof Ideas

QuickSort decomposition:

Bipartite property:

41

puv +13

w ∈u,v

puvw

h(u, w)h(w, v) + h(v, w)h(w, u)

= 1.

u

v w

τ∗(u, v) + τ∗(v, w) + τ∗(w, u) =

τ∗(v, u) + τ∗(w, v) + τ∗(u, w).

pageMehryar Mohri - Foundations of Machine Learning

Lower Bound

Theorem: for any deterministic algorithm , there is a bipartite distribution for which

• thus, factor of 2 best in deterministic case.

• randomization necessary for better bound.Proof: take simple case and assume that induces a cycle.

• up to symmetry, returns

42

Rrank(A) ≥ 2Rclass(h).

U =V =u, v, wh

A

A

u, v, w w, v, u.or

u

v wh

=

pageMehryar Mohri - Foundations of Machine Learning

Lower Bound

If returns , then choose as:

If returns , then choose as:

43

u, v, w u

v wh

Aτ∗

u, v w

+-

Aτ∗

w, v, u

+-

w, v u

L[h, τ∗] =13;

L[A, τ∗] =23.

pageMehryar Mohri - Foundations of Machine Learning

Guarantees - General Case

Loss bound for QuickSort:

Comparison with optimal ranking (see (CSS 99)):

44

EV,τ∗,s

[L(Qhs (V ), τ∗)] ≤ 2 E

V,τ∗[L(h, τ∗)].

Es[L(Qhs (V ),σoptimal)] ≤ 2 L(h, σoptimal)

Es[L(h, Qhs (V ))] ≤ 3 L(h, σoptimal),

where σoptimal = argminσ

L(h, σ).

pageMehryar Mohri - Foundations of Machine Learning

Weight Function

Generalization:

Properties: needed for all previous results to hold,

• symmetry: for all .

• monotonicity: for .

• triangle inequality: for all triplets .

45

τ∗(u, v) = σ∗(u, v) ω(σ∗(u),σ∗(v)).

ω(i, j) = ω(j, i) i, j

ω(i, j),ω(j, k) ≤ ω(i, k) i < j < k

ω(i, j) ≤ ω(i, k) + ω(k, j)i, j, k

pageMehryar Mohri - Foundations of Machine Learning

Weight Function - Examples

Kemeny:

Top-k:

Bipartite:

k-partite: can be defined similarly.

46

w(i, j) = 1, ∀ i, j.

w(i, j) =

1 if i ≤ k and j > k;0 otherwise.

w(i, j) =

1 if i ≤ k or j ≤ k;0 otherwise.

pageMehryar Mohri - Foundations of Machine Learning

(Strong) Regret Definitions

Ranking regret:

Preference regret:

All previous regret results hold if for ,

47

Rclass(h) = EV,τ∗

[L(h|V , τ∗)]−minh

EV,τ∗

[L(h|V , τ∗)].

Eτ∗|V1

[τ∗(u, v)] = Eτ∗|V2

[τ∗(u, v)]

V1, V2 ⊇ u, v

for all (pairwise independence on irrelevant alternatives).

Rrank(A) = EV,τ∗,s

[L(As(V ), τ∗)]−minσ

EV,τ∗

[L(σ|V , τ∗)].

u, v

Courant Institute, NYUpageMehryar Mohri - Courant Seminar

References• Agarwal, S., Graepel, T., Herbrich, R., Har-Peled, S., & Roth, D. (2005). Generalization

bounds for the area under the roc curve. JMLR 6, 393–425.

• Agarwal, S., and Niyogi, P. (2005). Stability and generalization of bipartite ranking algorithms. COLT (pp. 32–47).

• Nir Ailon and Mehryar Mohri. An efficient reduction of ranking to classification. In Proceedings of COLT 2008. Helsinki, Finland, July 2008. Omnipress.

• Balcan, M.-F., Bansal, N., Beygelzimer, A., Coppersmith, D., Langford, J., and Sorkin, G. B. (2007). Robust reductions from ranking to classification. In Proceedings of COLT (pp. 604–619). Springer.

• Cohen, W. W., Schapire, R. E., and Singer, Y. (1999). Learning to order things. J. Artif. Intell. Res. (JAIR), 10, 243–270.

• Cossock, D., and Zhang, T. (2006). Subset ranking using regression. COLT (pp. 605–619).

48

Courant Institute, NYUpageMehryar Mohri - Courant Seminar

References• Corinna Cortes and Mehryar Mohri. AUC Optimization vs. Error Rate Minimization. In

Advances in Neural Information Processing Systems (NIPS 2003), 2004. MIT Press.

• Cortes, C., Mohri, M., and Rastogi, A. (2007a). An Alternative Ranking Problem for Search Engines. Proceedings of WEA 2007 (pp. 1–21). Rome, Italy: Springer.

• Corinna Cortes and Mehryar Mohri. Confidence Intervals for the Area under the ROC Curve. In Advances in Neural Information Processing Systems (NIPS 2004), 2005. MIT Press.

• Crammer, K., and Singer, Y. (2001). Pranking with ranking. Proceedings of NIPS 2001, December 3-8, 2001, Vancouver, British Columbia, Canada] (pp. 641–647). MIT Press.

• Cynthia Dwork, Ravi Kumar, Moni Naor, and D. Sivakumar. Rank Aggregation Methods for the Web. WWW 10, 2001. ACM Press.

• J. P. Egan. Signal Detection Theory and ROC Analysis. Academic Press, 1975.

• Yoav Freund, Raj Iyer, Robert E. Schapire and Yoram Singer. An efficient boosting algorithm for combining preferences. JMLR 4:933-969, 2003.

49

Courant Institute, NYUpageMehryar Mohri - Courant Seminar

References• J. A. Hanley and B. J. McNeil. The meaning and use of the area under a receiver operating

characteristic (ROC) curve. Radiology, 1982.

• Ralf Herbrich, Thore Graepel, and Klaus Obermayer. Large margin rank boundaries for ordinal regression. Advances in Large Margin Classifiers, pages 115–132, 2000.

• Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. The PageRank citation ranking: Bringing order to the Web, Stanford Digital Library Technologies Project, 1998.

• Lehmann, E. L. (1975). Nonparametrics: Statistical methods based on ranks. San Francisco, California: Holden-Day.

• Cynthia Rudin, Corinna Cortes, Mehryar Mohri, and Robert E. Schapire. Margin-Based Ranking Meets Boosting in the Middle. In Proceedings of The 18th Annual Conference on Computational Learning Theory (COLT 2005), pages 63-78, 2005.

• Thorsten Joachims. Optimizing search engines using clickthrough data. Proceedings of the 8th ACM SIGKDD, pages 133-142, 2002.

50