advanced machine learning - nyu courantmohri/amls/aml_domain_adapt.pdfadvanced machine learning -...

Advanced Machine Learning

MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH..

Domain Adaptation

pageAdvanced Machine Learning - Mohri@

Non-Ideal World

2

sampling

domain

time

ideal

real world


OutlineDomain adaptation.

Multiple-source domain adaptation.

3


Domain AdaptationSentiment analysis.

Language modeling, part-of-speech tagging.

Statistical parsing.

Speech recognition.

Computer vision.

4

Solution critical for applications.


This TalkDomain adaptation

• Discrepancy • Theoretical guarantees • Algorithm • Enhancements

5


Domain Adaptation ProblemDomains: source , target .

Input:

• labeled sample drawn from source. • unlabeled sample drawn from target. Problem: find hypothesis in with small expected loss with respect to target domain, that is

6

(Q, fQ) (P, fP )

S

T

h H

LP (h, fP ) = Ex�P

�L

�h(x), fP (x)

��.


Sample Bias Correction PbProblem: special case of domain adaptation with

• . • .

7

supp(Q) ✓ supp(P )fQ = fP


Related Work in TheorySingle-source adaptation:

• relation between adaptation and the distance (Devroye et al. (1996); Kifer et al. (2004); Ben-David et al. (2007)).

• a few negative examples of adaptation (Ben-David et al. (AISTATS 2010)).

• analysis and learning guarantees for importance weighting (Cortes, Mansour, and MM (NIPS 2010)).

8

dA


Related Work in TheoryMultiple-source:

• same input distribution, but different labels (Crammer et al. , 2005, 2006).

• theoretical analysis and method for multiple-source adaptation (Mansour, MM, Rostamizadeh, 2008).

9

pageAdaptation Theory and Algorithms - Mohri@

Distribution Mismatch

10

PQ

Which distance should we use to compare these distributions?


Simple AnalysisProposition: assume that the loss is bounded by , then

Proof:

11

But, is this bound informative?

L M

|LQ(h, f)� LP (h, f)| � M L1(Q, P ).

|LP (h, f)� LQ(h, f)| =�� E

x�P

�L

�(h(x), f(x)

�� E

x�Q

�L

�(h(x), f(x)

��

=��

x

�P (x) �Q(x)

�L

�(h(x), f(x)

��

�M�

x

��P (x)�Q(x)��.


Example - Zero-One Loss

12

f

ha

|LQ(h, f)� LP (h, f)| = |Q(a)� P (a)|

a

a


DiscrepancyDefinition:

• symmetric, triangle inequality, in general not a distance. • helps compare distributions for arbitrary losses, e.g.

hinge loss, or loss.

• generalization of distance (Devroye et al. (1996); Kifer et al. (2004); Ben-David et al. (2007)).

13

(Mansour, MM, Rostami, COLT 2009)

Lp

dA

disc(P,Q) = maxh,h��H

��LP (h, h�)� LQ(h, h�)��.


Discrepancy - PropertiesTheorem: for loss bounded by , for any , with probability at least ,

14

Lq �>01��

M

disc(P, Q) � disc( �P , �Q) + 4q�

�RS(H) + �RT (H)�

+ 3M

��log 4�2m

+

�log 4�2n

�.


Discrepancy = DistanceTheorem: let be a universal kernel (e.g., Gaussian kernel) and . Then, for the loss, discrepancy is a distance over a compact set .

Proof: is continuous for norm , thus continuous on .

• implies for all . • since is dense in , over . • thus, for all in . • this implies .

15

KH = {h � HK : �h�K��} L2

C(X)� · ��disc(P, Q)=0 �(h)=0 h�H

H C(X) �=0 C(X)

� : h �� Ex�P [h2(x)]�Ex�Q[h2(x)]

EP [f ]�EQ[f ]=0 f�0 C(X)P =Q

(Cortes & MM (TCS 2013))

X


Theoretical GuaranteesTwo types of questions:

• difference between average loss of hypothesis on versus ?

• difference of loss (measured on ) between hypothesis obtained when training on versus hypothesis obtained when training on ?

16

h

h�h

( �Q, fQ)( �P , fP )

P

QP


Generalization BoundNotation:

• . • .

Theorem: assume that obeys the triangle inequality, then the following holds:

17

L

LP (h, fP ) �LQ(h, h�Q) + LP (h�P , fP ) + disc(P, Q)+ LQ(h�Q, h�P ).

(Mansour, MM, Rostami (COLT 2009))

LQ(h⇤Q, fQ) = minh2H

LQ(h, fQ)LP (h⇤P , fP ) = min

h2HLP (h, fP )


Some Natural CasesWhen ,

When (consistent case),

• Bound of (Ben-David et al., NIPS 2006) or (Blitzer et al., NIPS 2007): always worse in these cases.

18

h� = h�Q = h�P

LP (h, fP ) � LQ(h, h�) + LP (h�, fP ) + disc(P, Q).

fP � H

|LP (h, fP )� LQ(h, fP )| � disc(Q, P ).


Regularized ERM AlgorithmsObjective function:

• broad family of algorithms including SVM, SVR, kernel ridge regression, etc.

19

F bQ(h) = � �h�2K + �R bQ(h),

where is a PDS kernel; is a trade-off parameter; and is the empirical error of .

K�>0�R bQ(h) h


Guarantees for Reg. ERMTheorem: let be a PDS kernel with and a loss function such that is -Lipschitz. Let be the minimizer of and that of that , then, for all ,

20

L

where

K(x, x)�R2KL(·, y) µ

(x, y)�X�Y


h0

F bP h F bQ

⌘H(fP , fQ) = infh2H

�max

x2supp( bP )|fP (x)� h(x)|+ max

x2supp( bQ)|fQ(x)� h(x)|

.

��L(h0(x), y)� L(h(x), y)

�� µR

sdisc( bP , bQ) + µ ⌘H(fP , fQ)

�

,

pageFoundations of Machine Learning

ProofBy the property of the minimizers, there exist subgradients such that

Thus,

21

2�h0 = ��R bP (h0)

2�h = ��R bQ(h).

2�kh0 � hk2 = �hh0 � h, �R bP (h0)� �R bQ(h)i

= �hh0 � h, �R bP (h0)i+ hh0 � h, �R bQ(h)i

R bP (h)�R bP (h0) +R bQ(h

0)�R bQ(h)

2disc( bP , bQ) + 2µ ⌘H(fP , fQ).


Guarantees for Reg. ERMTheorem: let be a PDS kernel with and the loss bounded by . Then, for all ,

For ,

• if is ε-close to on samples. • for a Gaussian kernel and continuous.

22

L

where

K(x, x)�R2KL2 M

� = minh�H

�� Ex� bQ

��h(x)� fQ(x)

��K(x)

�� E

x� bP

��h(x)� fP (x)

��K(x)

��K

.

��L(h�(x), y)� L(h(x), y)�� R

�M

�

�� +

��2 + 4�disc( �P , �Q)

�,

� = 0


(x, y)

fP =fQ =f

f H��R�

f


Empirical DiscrepancyDiscrepancy measure critical term in bounds.

Smaller empirical discrepancy guarantees closeness of pointwise losses of h’ and h.

But, can we further reduce the discrepancy?

23

disc( �P , �Q)


Algorithm - IdeaSearch for a new empirical distribution with same support:

Solve modified optimization problem:

24

q�

q� = argminsupp(q)�supp( bQ)

disc( �P , q).

minh

Fq�(h) =m�

i=1

q�(xi)L(h(xi), yi) + ��h�2K .


Case of Halfspaces

25


Min-Max ProblemReformulation:

• game theoretical interpretation. • gives lower bound:

26

�Q� = argminbQ�⇥Q

maxh,h�⇥H

|L bP (h�, h)� L bQ�(h

�, h)|.

maxh,h�⇥H

minbQ�⇥Q


�, h)| ⇥

minbQ�⇥Q

maxh,h�⇥H


�, h)|.


Classification - 0/1 LossProblem:

27

minQ�

maxa⇥H�H

�� ⇤Q�(a)� ⇤P (a)��

subject to ⌅x ⇤ SQ, ⇤Q�(x) ⇥ 0 ⇧⇥

x⇥SQ

⇤Q�(x) = 1.


Classification - 0/1 LossLinear program (LP):

• No. of constraints bounded by shattering coefficient.

28

minQ�

�

subject to ⇧a ⌅ H�H, ⇥Q�(a)� ⇥P (a) ⇥ �

⇧a ⌅ H�H, ⇥P (a)� ⇥Q�(a) ⇥ �

⇧x ⌅ SQ, ⇥Q�(x) ⇤ 0 ⌃�

x⇥SQ

⇥Q�(x) = 1.

�H�H(m0 + n0)


Algorithm - 1D

29


Regression - L2 LossProblem:

30

minbQ�⇥Q

maxh,h�⇥H

�� EbP[(h�(x)� h(x))2]� E

bQ�[(h�(x)� h(x))2]

��.

min

bQ02Qmax

kwk1kw0k1

��EbP[((w

0 �w)>x)2]� EbQ0[((w

0 �w)>x)2]��

= min

bQ02Qmax

kwk1kw0k1

��X

x2S(

bP (x)� bQ0(x))[(w0 �w)>x]2��

= min

bQ02Qmax

kuk2

��X

x2S(

bP (x)� bQ0(x))[u>x]2��

= min

bQ02Qmax

kuk2

��u>�X

x2S(

bP (x)� bQ0(x))xx>�u

��.


Regression - L2 LossProblem equivalent to

31

min

kzk1=1z�0

max

kuk=1|u>M(z)u|,

with: M(z) = M0 �m0X

i=1

ziMi,

M0 =X

x2SP (x)xx>

Mi = sis>i

elements of supp( �Q)


Regression - L2 LossSemi-definite program (SDP): linear hypotheses.

32

minz,�

�

subject to �I �M(z) ⇤ 0�I + M(z) ⇤ 01�z = 1 ⌅ z ⇥ 0,

where the matrix is defined by:M(z)

M(z) =�

x�S

⇥P (x)xx⇥ �m0�

i=1

zisis⇥i .


Regression - L2 LossSDP: generalization to RKHS for some kernel .

33

minz,�

�

subject to �I �M(z) ⇤ 0�I + M(z) ⇤ 01�z = 1 ⌅ z ⇥ 0,

M(z) = M0 �m0�

i=1

ziMi

M0 = K1/2 diag(P (s1), . . . , P (sp0))K1/2

Mi = K1/2IiK1/2.

with:

H K


Discrepancy Min. AlgorithmConvex optimization:

• cast as semi-definite programming (SDP) prob. • efficient solution using smooth optimization. Algorithm and solution for arbitrary kernels.

Outperforms other algorithms in experiments.

34



ExperimentsClassification:

• Q and P Gaussians. • H: halfspaces. • f: interval [-1, +1].

35

0 20 40 60 80 1000.4

0.5

0.6

0.7

0.8

0.9

1

# Training Points

w/ min discw/ orig disc


ExperimentsRegression:

36

1000 1500 2000 25000.085

0.09

0.095

0.1

0.105

# Training Points

�Q

�Q�

P

SDP solved in about 15s using SeDuMi on 3GHz CPU with 2GB memory.


Experiments

37


EnhancementShortcomings:

• discrepancy depends on maximizing pair of hypotheses. • DM algorithm too conservative. Ideas:

• finer quantity: generalized discrepancy, hypothesis-dependent.

• reweighting depending on hypothesis.

38

(Cortes, MM, and Muñoz (2014))


AlgorithmChoose such that objectives are unif. close:

Ideally:

Using convex surrogate :

39

��h�2K + LQh(h, fQ)

��h�2K + L bP (h, fP ).

Qh

Qh = argminq

|Lq(h, fQ)� L bP (h, fP )|.

Qh = argminq

maxh��H��

|Lq(h, fQ)� L(h, h��)|.

H ��



Optimization

40

LQh(h, fQ) = argminl�{Lq(h,fQ):q�F(SX,R)}

maxh��H��

|l � L bP (h, h��)|

= argminl�R

maxh��H��

|l � L bP (h, h��)|

=12

�max

h��H��L bP (h, h

��) + minh��H��

L bP (h, h��)

�.

Convex optimization problem (loss jointly convex):

minh

��h�2K +12

�max

h��H��L bP (h, h

��) + minh��H��

L bP (h, h��)

�.



Convex Surrogate Hyp. SetChoice of among balls

Generalization bound proven to be more favorable than DM for some choices of radius .

Radius chosen via cross-validation using small amount of labeled data from target.

Further improvement of empirical results.

41

H ��

B(r) = {h�� H |Lq(h��, fQ) � rp}.

r

r



ConclusionTheory of adaptation based on discrepancy:

• key term in analysis of adaptation and drifting. • discrepancy minimization algorithm DM. • compares favorably to other adaptation algorithms in

experiments.

Generalized discrepancy:

• extension to hypothesis-dependent reweighting. • convex optimization problem. • further empirical improvements.

42


OutlineDomain adaptation.

Multiple-source domain adaptation.

43


Problem FormulationGiven distributions and corresponding hypotheses:            

⇤ ⇥� ⌅each hypothesis performs well in its domain.

h1

h2

hk

D1

D2

Dk

f...

... ⇤ ⇥� ⌅unknown

target

44

Notation:

Loss L assumed non-negative, bounded, convex and continuous.

L(Di, hi, f)= Ex�Di

[L(hi(x), f(x))].

�i,L(Di, hi, f)��.


Problem FormulationThe unknown target distribution is a mixture of input distributions.

Task: choose a hypothesis mixture that performs well in target distribution.     

D1

Dk

...DT DT (x) =

k�

i=1

�iDi(x)

convex combination rule distribution weighted combination

hz(x) =k⇥

i=1

ziDi(x)�kj=1 zjDj(x)

hi(x)hz(x) =k�

i=1

zihi(x)

45


Known Target DistributionFor some distributions, any convex combination performs poorly.

• base hypotheses have no error within domain. • any convex combination has error of 1/2!

f h0 h1

a 1 1 0

b 0 1 0

DT D0 D1

a 0.5 1 0

b 0.5 0 1

hypothesis outputdistribution weights

46


Main ResultsThus, although convex combinations seem natural, they can perform very poorly.

We will show that distribution weighted combinations seem to define the “right” combination rule.

There exists a single “robust” distribution weighted hypothesis, that does well for any target mixture.      

47

�f, �z, ��, L(D�, hz, f) � �.


Known Target DistributionIf distribution is known, distribution weighted rule will always do well. Choose: .    

Proof:    

z = �

48

L(DT , h�, f) =�

x�XL(h�(x), f(x))DT (x)

��

x�X

k�

i=1

�iDi(x)DT (x)

L(hi(x), f(x))DT (x)

=k�

i=1

�iL(Di, hi(x), f(x)) �k�

i=1

�i� = �.

h�(x) =k�

i=1

�iDi(x)�kj=1 �jDj(x)

hi(x).


Unknown Target MixtureZero-sum game:

• NATURE: select a target distribution . • LEARNER: select a , i.e. a distribution weighted

hypothesis .

• Payoff: . • Already shown: game value is at most . Minimax theorem (modulo discretization of ): there exists a mixture of distribution weighted hypothesis that does well for any distribution mixture.

Di

zhz

�

�j �jhzj

z

49

L(Di, hz, f)


Balancing LossesBrouwer’s Fixed Point theorem: for any compact, convex, non-empty set and any continuous function , there exists such that: .

Define mapping by:

By fixed point theorem (modulo continuity):

xA

�

Notation:

50

f: A� A

Af

Lzi :=L(Di, hz, f).

[�(z)]i =ziLzi�j zjLzj

.

=��z : �i, zi =ziLzi�j zjLzj

�i,Lzi =�

j

zjLzj =: �.

f(x)=x


Bounding LossFor fixed point ,

Also, by convexity,

51

z

L(Dz, hz, f) =�

x�XL(hz(x), f(x))

� k�

i=1

ziDi(x)�

=k�

i=1

zi�

x�XDi(x)L(hz(x), f(x))

=k�

i=1

ziLzi =k�

i=1

zi� = �.

� = L(Dz , hz, f) ��

x�X

k�

i=1

ziDi(x)Dz(x)

L(hi(x), f(x))Dz(x) =k�

i=1

ziL(Di, hi, f) � �.


Bounding LossThus, and for any mixture ,    

52

f...

D�1

D�2

D�i

� � � �

hz

L(D�, hz, f) =k�

i=1

�iL(Di, hz, f) �k�

i=1

�i� = � � �.


DetailsTo deal with non-continuity refine hypotheses:    

Theorem: for any target function and any ,  

If loss obeys triangle inequality:

53

holds for all admissible target functions.

f �>0

�� > 0, z : ��,L(D�, h�z , f) � � + �.

�� > 0, �z, � > 0, ��, f � F , L(D�, h�z , f) � 3� + �.

h�z(x) =k�

i=1

ziDi(x) + �/k�kj=1 zjDj(x) + �

hi(x).


A Simple AlgorithmA simple constructive algorithm, choose z with uniform weights:            

54

L(D�, hu, f) =�

x

D�(x)L

�k�

i=1

Di(x)�kj=1 Dj(x)

hi(x), f(x)

�

=�

x

�k�

m=1

�mDm(x)

�L

�k�

i=1

Di(x)�kj=1 Dj(x)

hi(x), f(x)

�

��

x

�km=1 �mDm(x)�k

j=1 Dj(x)� �� 1

k�

i=1

Di(x)L (hi(x), f(x))

�k�

i=1

�

x

Di(x)L (hi(x), f(x)) =k�

i=1

L(Di, hi, f) =k�

i=1

�i � k�.


Preliminary Empirical ResultsSentiment Analysis - given a product review (text string), predict a rating (between 1.0 and 5.0).

4 Domains: Books, DVDs, Electronics and Kitchen Appliances.

Base hypotheses are trained within each domain (Support Vector Regression).

We are not given the distributions. We model each distribution using a bag of words model.

We then test the distribution combination rule on known target mixture domains.

55


Empirical Results

1 2 3 4 5 61.5

1.6

1.7

1.8

1.9

2

2.1

MSE

Uniform Mixture Over 4 Domains

In−DomainOut−Domain

56


Empirical Results2 class

0 0.2 0.4 0.6 0.8 11.4

1.6

1.8

2

2.2

2.4

Mixture = α book + (1 − α) kitchen

α

MSE

weightedlinearbookkitchen

57


ConclusionFormulation of the multiple source adaptation problem.

Theoretical analysis for mixture distributions.

Efficient algorithm for finding distribution weighted combination hypothesis?

Beyond mixture distributions?  

58


Rényi DivergencesDefinition: for ,

• : coincide with relative entropy. • : logarithm of expected probability ratio;

• : logarithm of maximum probability ratio;

59

↵ � 0

↵ = 1

↵ = 2

D

↵

(PkQ) = 1↵� 1 log

X

x

P (x)

P (x)

Q(x)

�↵�1

.

D

↵

(PkQ) = log Ex⇠P

P (x)

Q(x)

�.

↵ = +1

D

↵

(PkQ) = log supx⇠P

P (x)

Q(x)

�.


Q=� k�

i=1

�iDi : � � �k�

Extensions - Arbitrary TargetTheorem: for any and ,

60

�>0

P

measured in terms of Rényi divergence,

(Mansour, MM, and Rostami, 2009)

9⌘, z : 8P,L(P, h⌘z , f) hd↵(PkQ)(✏+ �)

i↵�1↵M

1↵ .

d

↵

(P,Q)=

X

x

P

↵(x)

Q

↵�1(x)

� 1↵�1

.

↵ > 1


ProofBy Hölder’s inequality, for any hypothesis ,

61

h

L(P, h, f) =X

x

P (x)

Q

↵�1↵ (x)

Q

↵�1↵ (x)L(h(x), f(x))

X

x

P

↵

x)

Q

↵�1(x)

� 1↵ hX

x

Q(x)L↵

↵�1 (h(x), f(x))i↵�1

↵

= (d↵

(PkQ))↵�1↵

hE

x⇠Q[L

↵↵�1 (h(x), f(x))]

i↵�1↵

= (d↵

(PkQ))↵�1↵

hE

x⇠Q[L(h(x), f(x))L

1↵�1 (h(x), f(x))]

i↵�1↵

(d↵

(PkQ))↵�1↵

hL(Q, h, f)M

1↵�1

i↵�1↵

.


Other ExtensionsApproximate distributions (estimated):

• similar results shown depending on divergence between true and estimated distributions.

Different source target functions :

• similar results when target functions close to on target distribution.

62

f

fi

(Mansour, MM, and Rostami, 2009)

References

Advanced Machine Learning - Mohri@ page

S. Ben-David, J. Blitzer, K. Crammer, and F. Pereira. Analysis of representations for domain adaptation. In NIPS. 2007.

S. Ben-David, T. Lu, T. Luu, and D. Pal. Impossibility theorems for domain adaptation. Journal of Machine Learning Research-Proceedings Track, 9:129–136, 2010.

S. Bickel, M. Bruckner, and T. Scheffer. Discriminative learning under covariate shift. Journal of Machine Learning Research, 10:2137–2155, 2009.

J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, and J. Wortman. Learning bounds for domain adaptation. In NIPS. 2008.

Corinna Cortes, Yishay Mansour, and Mehryar Mohri. Learning bounds for importance weighting. In NIPS. 2010.

Corinna Cortes and Mehryar Mohri. Domain adaptation and sample bias correction theory and algorithm for regression. Theoretical Computer Science, 519, 2014.

63

References


Corinna Cortes, Mehryar Mohri, Michael Riley, and Afshin Rostamizadeh. Sample selection bias correction theory. In Proceedings of ALT. 2008.

Koby Crammer, Michael Kearns, and Jennifer Wortman. Learning from Data of Variable Quality. In Proceedings of NIPS, 2006.

Koby Crammer, Michael Kearns, and Jennifer Wortman.Learning from multiple sources.In Proceedings of NIPS, 2007.

Devroye, L., Gyorfi, L., and Lugosi, G. (1996). A probabilistic theory of pattern recognition. Springer.

M. Dredze, J. Blitzer, P. P. Talukdar, K. Ganchev, J. Graca, and F. Pereira. Frustratingly Hard Domain Adaptation for Parsing. In CoNLL, 2007.

J. Huang, A. J. Smola, A. Gretton, K. M. Borgwardt, and B. Scholkopf. Correcting sample selection bias by unlabeled data. In NIPS, volume 19, pages 601–608. 2006.

64

References


Kifer D., Ben-David S., Gehrke J. Detecting change in data streams. In: Proceedings of VLDB, pp 180–191. 2004.

Yishay Mansour, Mehryar Mohri, and Afshin Rostamizadeh. Domain adaptation: Learning bounds and algorithms. In COLT. 2009.

Yishay Mansour, Mehryar Mohri, and Afshin Rostamizadeh. Domain adaptation with multiple sources. In Advances in NIPS, pages 1041-1048. 2009.

Yishay Mansour, Mehryar Mohri, and Afshin Rostamizadeh. Multiple source adaptation and the Rényi divergence. In Proceedings of UAI. 2009.

Mehryar Mohri and Andres Muñoz Medina. New analysis and algorithm for learning with drifting distributions. In Proceedings of ALT, volume 7568, pages 124-138. 2012.

65

References


H. Shimodaira. Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of Statistical Planning and Inference, 90(2):227– 244, 2000.

M. Sugiyama, S. Nakajima, H. Kashima, P. von Bunau, and M. Kawanabe. Direct importance estimation with model selection and its application to covariate shift adaptation. In NIPS. 2008.

M. Sugiyama, T. Suzuki, S. Nakajima, H. Kashima, P. von Bunau, and M.Kawanabe. Direct importance estimation for covariate shift adaptation. Annals of the Institute of Statistical Mathematics, 60:699–746, 2008.

66

advanced machine learning - nyu courantmohri/amls/aml_domain_adapt.pdfadvanced machine learning -...

Documents