advanced machine learning - nyu courantmohri/amls/aml_domain_adapt.pdfadvanced machine learning -...

66
Advanced Machine Learning MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Domain Adaptation

Upload: others

Post on 20-Feb-2021

8 views

Category:

Documents


0 download

TRANSCRIPT

  • Advanced Machine Learning

    MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH..

    Domain Adaptation

  • pageAdvanced Machine Learning - Mohri@

    Non-Ideal World

    2

    sampling

    domain

    time

    ideal

    real world

  • pageAdvanced Machine Learning - Mohri@

    OutlineDomain adaptation.

    Multiple-source domain adaptation.

    3

  • pageAdvanced Machine Learning - Mohri@

    Domain AdaptationSentiment analysis.

    Language modeling, part-of-speech tagging.

    Statistical parsing.

    Speech recognition.

    Computer vision.

    4

    Solution critical for applications.

  • pageAdvanced Machine Learning - Mohri@

    This TalkDomain adaptation

    • Discrepancy • Theoretical guarantees • Algorithm • Enhancements

    5

  • pageAdvanced Machine Learning - Mohri@

    Domain Adaptation ProblemDomains: source , target .

    Input:

    • labeled sample drawn from source. • unlabeled sample drawn from target. Problem: find hypothesis in with small expected loss with respect to target domain, that is

    6

    (Q, fQ) (P, fP )

    S

    T

    h H

    LP (h, fP ) = Ex�P

    �L

    �h(x), fP (x)

    ��.

  • pageAdvanced Machine Learning - Mohri@

    Sample Bias Correction PbProblem: special case of domain adaptation with

    • . • .

    7

    supp(Q) ✓ supp(P )fQ = fP

  • pageAdvanced Machine Learning - Mohri@

    Related Work in TheorySingle-source adaptation:

    • relation between adaptation and the distance (Devroye et al. (1996); Kifer et al. (2004); Ben-David et al. (2007)).

    • a few negative examples of adaptation (Ben-David et al. (AISTATS 2010)).

    • analysis and learning guarantees for importance weighting (Cortes, Mansour, and MM (NIPS 2010)).

    8

    dA

  • pageAdvanced Machine Learning - Mohri@

    Related Work in TheoryMultiple-source:

    • same input distribution, but different labels (Crammer et al. , 2005, 2006).

    • theoretical analysis and method for multiple-source adaptation (Mansour, MM, Rostamizadeh, 2008).

    9

  • pageAdaptation Theory and Algorithms - Mohri@

    Distribution Mismatch

    10

    PQ

    Which distance should we use to compare these distributions?

  • pageAdaptation Theory and Algorithms - Mohri@

    Simple AnalysisProposition: assume that the loss is bounded by , then

    Proof:

    11

    But, is this bound informative?

    L M

    |LQ(h, f)� LP (h, f)| � M L1(Q, P ).

    |LP (h, f)� LQ(h, f)| =��� E

    x�P

    �L

    �(h(x), f(x)

    ��� E

    x�Q

    �L

    �(h(x), f(x)

    �����

    =����

    x

    �P (x) �Q(x)

    �L

    �(h(x), f(x)

    ����

    �M�

    x

    ��P (x)�Q(x)��.

  • pageAdaptation Theory and Algorithms - Mohri@

    Example - Zero-One Loss

    12

    f

    ha

    |LQ(h, f)� LP (h, f)| = |Q(a)� P (a)|

    a

    a

  • pageAdaptation Theory and Algorithms - Mohri@

    DiscrepancyDefinition:

    • symmetric, triangle inequality, in general not a distance. • helps compare distributions for arbitrary losses, e.g.

    hinge loss, or loss.

    • generalization of distance (Devroye et al. (1996); Kifer et al. (2004); Ben-David et al. (2007)).

    13

    (Mansour, MM, Rostami, COLT 2009)

    Lp

    dA

    disc(P,Q) = maxh,h��H

    ���LP (h, h�)� LQ(h, h�)���.

  • pageAdaptation Theory and Algorithms - Mohri@

    Discrepancy - PropertiesTheorem: for loss bounded by , for any , with probability at least ,

    14

    Lq �>01��

    M

    disc(P, Q) � disc( �P , �Q) + 4q�

    �RS(H) + �RT (H)�

    + 3M

    ��log 4�2m

    +

    �log 4�2n

    �.

  • pageAdaptation Theory and Algorithms - Mohri@

    Discrepancy = DistanceTheorem: let be a universal kernel (e.g., Gaussian kernel) and . Then, for the loss, discrepancy is a distance over a compact set .

    Proof: is continuous for norm , thus continuous on .

    • implies for all . • since is dense in , over . • thus, for all in . • this implies .

    15

    KH = {h � HK : �h�K��} L2

    C(X)� · ��disc(P, Q)=0 �(h)=0 h�H

    H C(X) �=0 C(X)

    � : h �� Ex�P [h2(x)]�Ex�Q[h2(x)]

    EP [f ]�EQ[f ]=0 f�0 C(X)P =Q

    (Cortes & MM (TCS 2013))

    X

  • pageAdaptation Theory and Algorithms - Mohri@

    Theoretical GuaranteesTwo types of questions:

    • difference between average loss of hypothesis on versus ?

    • difference of loss (measured on ) between hypothesis obtained when training on versus hypothesis obtained when training on ?

    16

    h

    h�h

    ( �Q, fQ)( �P , fP )

    P

    QP

  • pageAdaptation Theory and Algorithms - Mohri@

    Generalization BoundNotation:

    • . • .

    Theorem: assume that obeys the triangle inequality, then the following holds:

    17

    L

    LP (h, fP ) �LQ(h, h�Q) + LP (h�P , fP ) + disc(P, Q)+ LQ(h�Q, h�P ).

    (Mansour, MM, Rostami (COLT 2009))

    LQ(h⇤Q, fQ) = minh2H

    LQ(h, fQ)LP (h⇤P , fP ) = min

    h2HLP (h, fP )

  • pageAdaptation Theory and Algorithms - Mohri@

    Some Natural CasesWhen ,

    When (consistent case),

    • Bound of (Ben-David et al., NIPS 2006) or (Blitzer et al., NIPS 2007): always worse in these cases.

    18

    h� = h�Q = h�P

    LP (h, fP ) � LQ(h, h�) + LP (h�, fP ) + disc(P, Q).

    fP � H

    |LP (h, fP )� LQ(h, fP )| � disc(Q, P ).

  • pageAdaptation Theory and Algorithms - Mohri@

    Regularized ERM AlgorithmsObjective function:

    • broad family of algorithms including SVM, SVR, kernel ridge regression, etc.

    19

    F bQ(h) = � �h�2K + �R bQ(h),

    where is a PDS kernel; is a trade-off parameter; and is the empirical error of .

    K�>0�R bQ(h) h

  • pageAdaptation Theory and Algorithms - Mohri@

    Guarantees for Reg. ERMTheorem: let be a PDS kernel with and a loss function such that is -Lipschitz. Let be the minimizer of and that of that , then, for all ,

    20

    L

    where

    K(x, x)�R2KL(·, y) µ

    (x, y)�X�Y

    (Cortes & MM (TCS 2013))

    h0

    F bP h F bQ

    ⌘H(fP , fQ) = infh2H

    �max

    x2supp( bP )|fP (x)� h(x)|+ max

    x2supp( bQ)|fQ(x)� h(x)|

    .

    ��L(h0(x), y)� L(h(x), y)

    �� µR

    sdisc( bP , bQ) + µ ⌘H(fP , fQ)

    ,

  • pageFoundations of Machine Learning

    ProofBy the property of the minimizers, there exist subgradients such that

    Thus,

    21

    2�h0 = ��R bP (h0)

    2�h = ��R bQ(h).

    2�kh0 � hk2 = �hh0 � h, �R bP (h0)� �R bQ(h)i

    = �hh0 � h, �R bP (h0)i+ hh0 � h, �R bQ(h)i

    R bP (h)�R bP (h0) +R bQ(h

    0)�R bQ(h)

    2disc( bP , bQ) + 2µ ⌘H(fP , fQ).

  • pageAdaptation Theory and Algorithms - Mohri@

    Guarantees for Reg. ERMTheorem: let be a PDS kernel with and the loss bounded by . Then, for all ,

    For ,

    • if is ε-close to on samples. • for a Gaussian kernel and continuous.

    22

    L

    where

    K(x, x)�R2KL2 M

    � = minh�H

    ��� Ex� bQ

    ��h(x)� fQ(x)

    ��K(x)

    �� E

    x� bP

    ��h(x)� fP (x)

    ��K(x)

    ����K

    .

    ��L(h�(x), y)� L(h(x), y)�� � R

    �M

    �� +

    ��2 + 4�disc( �P , �Q)

    �,

    � = 0

    (Cortes & MM (TCS 2013))

    (x, y)

    fP =fQ =f

    f H��R�

    f

  • pageAdaptation Theory and Algorithms - Mohri@

    Empirical DiscrepancyDiscrepancy measure critical term in bounds.

    Smaller empirical discrepancy guarantees closeness of pointwise losses of h’ and h.

    But, can we further reduce the discrepancy?

    23

    disc( �P , �Q)

  • pageAdaptation Theory and Algorithms - Mohri@

    Algorithm - IdeaSearch for a new empirical distribution with same support:

    Solve modified optimization problem:

    24

    q�

    q� = argminsupp(q)�supp( bQ)

    disc( �P , q).

    minh

    Fq�(h) =m�

    i=1

    q�(xi)L(h(xi), yi) + ��h�2K .

  • pageAdaptation Theory and Algorithms - Mohri@

    Case of Halfspaces

    25

  • pageAdvanced Machine Learning - Mohri@

    Min-Max ProblemReformulation:

    • game theoretical interpretation. • gives lower bound:

    26

    �Q� = argminbQ�⇥Q

    maxh,h�⇥H

    |L bP (h�, h)� L bQ�(h

    �, h)|.

    maxh,h�⇥H

    minbQ�⇥Q

    |L bP (h�, h)� L bQ�(h

    �, h)| ⇥

    minbQ�⇥Q

    maxh,h�⇥H

    |L bP (h�, h)� L bQ�(h

    �, h)|.

  • pageAdvanced Machine Learning - Mohri@

    Classification - 0/1 LossProblem:

    27

    minQ�

    maxa⇥H�H

    �� ⇤Q�(a)� ⇤P (a)��

    subject to ⌅x ⇤ SQ, ⇤Q�(x) ⇥ 0 ⇧⇥

    x⇥SQ

    ⇤Q�(x) = 1.

  • pageAdvanced Machine Learning - Mohri@

    Classification - 0/1 LossLinear program (LP):

    • No. of constraints bounded by shattering coefficient.

    28

    minQ�

    subject to ⇧a ⌅ H�H, ⇥Q�(a)� ⇥P (a) ⇥ �

    ⇧a ⌅ H�H, ⇥P (a)� ⇥Q�(a) ⇥ �

    ⇧x ⌅ SQ, ⇥Q�(x) ⇤ 0 ⌃�

    x⇥SQ

    ⇥Q�(x) = 1.

    �H�H(m0 + n0)

  • pageAdvanced Machine Learning - Mohri@

    Algorithm - 1D

    29

  • pageAdvanced Machine Learning - Mohri@

    Regression - L2 LossProblem:

    30

    minbQ�⇥Q

    maxh,h�⇥H

    ��� EbP[(h�(x)� h(x))2]� E

    bQ�[(h�(x)� h(x))2]

    ���.

    min

    bQ02Qmax

    kwk1kw0k1

    ���EbP[((w

    0 �w)>x)2]� EbQ0[((w

    0 �w)>x)2]���

    = min

    bQ02Qmax

    kwk1kw0k1

    ���X

    x2S(

    bP (x)� bQ0(x))[(w0 �w)>x]2���

    = min

    bQ02Qmax

    kuk2

    ���X

    x2S(

    bP (x)� bQ0(x))[u>x]2���

    = min

    bQ02Qmax

    kuk2

    ���u>�X

    x2S(

    bP (x)� bQ0(x))xx>�u

    ���.

  • pageFoundations of Machine Learning

    Regression - L2 LossProblem equivalent to

    31

    min

    kzk1=1z�0

    max

    kuk=1|u>M(z)u|,

    with: M(z) = M0 �m0X

    i=1

    ziMi,

    M0 =X

    x2SP (x)xx>

    Mi = sis>i

    elements of supp( �Q)

  • pageAdvanced Machine Learning - Mohri@

    Regression - L2 LossSemi-definite program (SDP): linear hypotheses.

    32

    minz,�

    subject to �I �M(z) ⇤ 0�I + M(z) ⇤ 01�z = 1 ⌅ z ⇥ 0,

    where the matrix is defined by:M(z)

    M(z) =�

    x�S

    ⇥P (x)xx⇥ �m0�

    i=1

    zisis⇥i .

  • pageAdvanced Machine Learning - Mohri@

    Regression - L2 LossSDP: generalization to RKHS for some kernel .

    33

    minz,�

    subject to �I �M(z) ⇤ 0�I + M(z) ⇤ 01�z = 1 ⌅ z ⇥ 0,

    M(z) = M0 �m0�

    i=1

    ziMi

    M0 = K1/2 diag(P (s1), . . . , P (sp0))K1/2

    Mi = K1/2IiK1/2.

    with:

    H K

  • pageAdaptation Theory and Algorithms - Mohri@

    Discrepancy Min. AlgorithmConvex optimization:

    • cast as semi-definite programming (SDP) prob. • efficient solution using smooth optimization. Algorithm and solution for arbitrary kernels.

    Outperforms other algorithms in experiments.

    34

    (Cortes & MM (TCS 2013))

  • pageAdvanced Machine Learning - Mohri@

    ExperimentsClassification:

    • Q and P Gaussians. • H: halfspaces. • f: interval [-1, +1].

    35

    0 20 40 60 80 1000.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    # Training Points

    w/ min discw/ orig disc

  • pageAdvanced Machine Learning - Mohri@

    ExperimentsRegression:

    36

    1000 1500 2000 25000.085

    0.09

    0.095

    0.1

    0.105

    # Training Points

    �Q

    �Q�

    P

    SDP solved in about 15s using SeDuMi on 3GHz CPU with 2GB memory.

  • pageAdvanced Machine Learning - Mohri@

    Experiments

    37

  • pageAdvanced Machine Learning - Mohri@

    EnhancementShortcomings:

    • discrepancy depends on maximizing pair of hypotheses. • DM algorithm too conservative. Ideas:

    • finer quantity: generalized discrepancy, hypothesis-dependent.

    • reweighting depending on hypothesis.

    38

    (Cortes, MM, and Muñoz (2014))

  • pageAdvanced Machine Learning - Mohri@

    AlgorithmChoose such that objectives are unif. close:

    Ideally:

    Using convex surrogate :

    39

    ��h�2K + LQh(h, fQ)

    ��h�2K + L bP (h, fP ).

    Qh

    Qh = argminq

    |Lq(h, fQ)� L bP (h, fP )|.

    Qh = argminq

    maxh���H��

    |Lq(h, fQ)� L(h, h��)|.

    H ��

    (Cortes, MM, and Muñoz (2014))

  • pageAdvanced Machine Learning - Mohri@

    Optimization

    40

    LQh(h, fQ) = argminl�{Lq(h,fQ):q�F(SX,R)}

    maxh���H��

    |l � L bP (h, h��)|

    = argminl�R

    maxh���H��

    |l � L bP (h, h��)|

    =12

    �max

    h���H��L bP (h, h

    ��) + minh���H��

    L bP (h, h��)

    �.

    Convex optimization problem (loss jointly convex):

    minh

    ��h�2K +12

    �max

    h���H��L bP (h, h

    ��) + minh���H��

    L bP (h, h��)

    �.

    (Cortes, MM, and Muñoz (2014))

  • pageAdvanced Machine Learning - Mohri@

    Convex Surrogate Hyp. SetChoice of among balls

    Generalization bound proven to be more favorable than DM for some choices of radius .

    Radius chosen via cross-validation using small amount of labeled data from target.

    Further improvement of empirical results.

    41

    H ��

    B(r) = {h�� � H |Lq(h��, fQ) � rp}.

    r

    r

    (Cortes, MM, and Muñoz (2014))

  • pageAdvanced Machine Learning - Mohri@

    ConclusionTheory of adaptation based on discrepancy:

    • key term in analysis of adaptation and drifting. • discrepancy minimization algorithm DM. • compares favorably to other adaptation algorithms in

    experiments.

    Generalized discrepancy:

    • extension to hypothesis-dependent reweighting. • convex optimization problem. • further empirical improvements.

    42

  • pageAdvanced Machine Learning - Mohri@

    OutlineDomain adaptation.

    Multiple-source domain adaptation.

    43

  • pageAdvanced Machine Learning - Mohri@

    Problem FormulationGiven distributions and corresponding hypotheses: 












    ⇤ ⇥� ⌅each hypothesis performs well in its domain.

    h1

    h2

    hk

    D1

    D2

    Dk

    f...

    ... ⇤ ⇥� ⌅unknown

    target

    44

    Notation:

    Loss L assumed non-negative, bounded, convex and continuous.

    L(Di, hi, f)= Ex�Di

    [L(hi(x), f(x))].

    �i,L(Di, hi, f)��.

  • pageAdvanced Machine Learning - Mohri@

    Problem FormulationThe unknown target distribution is a mixture of input distributions.

    Task: choose a hypothesis mixture that performs well in target distribution. 





    D1

    Dk

    ...DT DT (x) =

    k�

    i=1

    �iDi(x)

    convex combination rule distribution weighted combination

    hz(x) =k⇥

    i=1

    ziDi(x)�kj=1 zjDj(x)

    hi(x)hz(x) =k�

    i=1

    zihi(x)

    45

  • pageAdvanced Machine Learning - Mohri@

    Known Target DistributionFor some distributions, any convex combination performs poorly.

    • base hypotheses have no error within domain. • any convex combination has error of 1/2!

    f h0 h1

    a 1 1 0

    b 0 1 0

    DT D0 D1

    a 0.5 1 0

    b 0.5 0 1

    hypothesis outputdistribution weights

    46

  • pageAdvanced Machine Learning - Mohri@

    Main ResultsThus, although convex combinations seem natural, they can perform very poorly.

    We will show that distribution weighted combinations seem to define the “right” combination rule.

    There exists a single “robust” distribution weighted hypothesis, that does well for any target mixture. 






    47

    �f, �z, ��, L(D�, hz, f) � �.

  • pageAdvanced Machine Learning - Mohri@

    Known Target DistributionIf distribution is known, distribution weighted rule will always do well. Choose: . 




    Proof: 




    z = �

    48

    L(DT , h�, f) =�

    x�XL(h�(x), f(x))DT (x)

    ��

    x�X

    k�

    i=1

    �iDi(x)DT (x)

    L(hi(x), f(x))DT (x)

    =k�

    i=1

    �iL(Di, hi(x), f(x)) �k�

    i=1

    �i� = �.

    h�(x) =k�

    i=1

    �iDi(x)�kj=1 �jDj(x)

    hi(x).

  • pageAdvanced Machine Learning - Mohri@

    Unknown Target MixtureZero-sum game:

    • NATURE: select a target distribution . • LEARNER: select a , i.e. a distribution weighted

    hypothesis .

    • Payoff: . • Already shown: game value is at most . Minimax theorem (modulo discretization of ): there exists a mixture of distribution weighted hypothesis that does well for any distribution mixture.

    Di

    zhz

    �j �jhzj

    z

    49

    L(Di, hz, f)

  • pageAdvanced Machine Learning - Mohri@

    Balancing LossesBrouwer’s Fixed Point theorem: for any compact, convex, non-empty set and any continuous function , there exists such that: .

    Define mapping by:

    By fixed point theorem (modulo continuity):

    xA

    Notation:

    50

    f: A� A

    Af

    Lzi :=L(Di, hz, f).

    [�(z)]i =ziLzi�j zjLzj

    .

    =��z : �i, zi =ziLzi�j zjLzj

    �i,Lzi =�

    j

    zjLzj =: �.

    f(x)=x

  • pageAdvanced Machine Learning - Mohri@

    Bounding LossFor fixed point ,

    Also, by convexity,

    51

    z

    L(Dz, hz, f) =�

    x�XL(hz(x), f(x))

    � k�

    i=1

    ziDi(x)�

    =k�

    i=1

    zi�

    x�XDi(x)L(hz(x), f(x))

    =k�

    i=1

    ziLzi =k�

    i=1

    zi� = �.

    � = L(Dz , hz, f) ��

    x�X

    k�

    i=1

    ziDi(x)Dz(x)

    L(hi(x), f(x))Dz(x) =k�

    i=1

    ziL(Di, hi, f) � �.

  • pageAdvanced Machine Learning - Mohri@

    Bounding LossThus, and for any mixture , 




    52

    f...

    D�1

    D�2

    D�i

    � � � �

    hz

    L(D�, hz, f) =k�

    i=1

    �iL(Di, hz, f) �k�

    i=1

    �i� = � � �.

  • pageAdvanced Machine Learning - Mohri@

    DetailsTo deal with non-continuity refine hypotheses: 




    Theorem: for any target function and any , 


    If loss obeys triangle inequality:

    53

    holds for all admissible target functions.

    f �>0

    �� > 0, z : ��,L(D�, h�z , f) � � + �.

    �� > 0, �z, � > 0, ��, f � F , L(D�, h�z , f) � 3� + �.

    h�z(x) =k�

    i=1

    ziDi(x) + �/k�kj=1 zjDj(x) + �

    hi(x).

  • pageAdvanced Machine Learning - Mohri@

    A Simple AlgorithmA simple constructive algorithm, choose z with uniform weights: 












    54

    L(D�, hu, f) =�

    x

    D�(x)L

    �k�

    i=1

    Di(x)�kj=1 Dj(x)

    hi(x), f(x)

    =�

    x

    �k�

    m=1

    �mDm(x)

    �L

    �k�

    i=1

    Di(x)�kj=1 Dj(x)

    hi(x), f(x)

    ��

    x

    �km=1 �mDm(x)�k

    j=1 Dj(x)� �� ��1

    k�

    i=1

    Di(x)L (hi(x), f(x))

    �k�

    i=1

    x

    Di(x)L (hi(x), f(x)) =k�

    i=1

    L(Di, hi, f) =k�

    i=1

    �i � k�.

  • pageAdvanced Machine Learning - Mohri@

    Preliminary Empirical ResultsSentiment Analysis - given a product review (text string), predict a rating (between 1.0 and 5.0).

    4 Domains: Books, DVDs, Electronics and Kitchen Appliances.

    Base hypotheses are trained within each domain (Support Vector Regression).

    We are not given the distributions. We model each distribution using a bag of words model.

    We then test the distribution combination rule on known target mixture domains.

    55

  • pageAdvanced Machine Learning - Mohri@

    Empirical Results

    1 2 3 4 5 61.5

    1.6

    1.7

    1.8

    1.9

    2

    2.1

    MSE

    Uniform Mixture Over 4 Domains

    In−DomainOut−Domain

    56

  • pageAdvanced Machine Learning - Mohri@

    Empirical Results2 class

    0 0.2 0.4 0.6 0.8 11.4

    1.6

    1.8

    2

    2.2

    2.4

    Mixture = α book + (1 − α) kitchen

    α

    MSE

    weightedlinearbookkitchen

    57

  • pageAdvanced Machine Learning - Mohri@

    ConclusionFormulation of the multiple source adaptation problem.

    Theoretical analysis for mixture distributions.

    Efficient algorithm for finding distribution weighted combination hypothesis?

    Beyond mixture distributions? 


    58

  • pageFoundations of Machine Learning

    Rényi DivergencesDefinition: for ,

    • : coincide with relative entropy. • : logarithm of expected probability ratio;

    • : logarithm of maximum probability ratio;

    59

    ↵ � 0

    ↵ = 1

    ↵ = 2

    D

    (PkQ) = 1↵� 1 log

    X

    x

    P (x)

    P (x)

    Q(x)

    �↵�1

    .

    D

    (PkQ) = log Ex⇠P

    P (x)

    Q(x)

    �.

    ↵ = +1

    D

    (PkQ) = log supx⇠P

    P (x)

    Q(x)

    �.

  • pageAdvanced Machine Learning - Mohri@

    Q=� k�

    i=1

    �iDi : � � �k�

    Extensions - Arbitrary TargetTheorem: for any and ,

    60

    �>0

    P

    measured in terms of Rényi divergence,

    (Mansour, MM, and Rostami, 2009)

    9⌘, z : 8P,L(P, h⌘z , f) hd↵(PkQ)(✏+ �)

    i↵�1↵M

    1↵ .

    d

    (P,Q)=

    X

    x

    P

    ↵(x)

    Q

    ↵�1(x)

    � 1↵�1

    .

    ↵ > 1

  • pageFoundations of Machine Learning

    ProofBy Hölder’s inequality, for any hypothesis ,

    61

    h

    L(P, h, f) =X

    x

    P (x)

    Q

    ↵�1↵ (x)

    Q

    ↵�1↵ (x)L(h(x), f(x))

    X

    x

    P

    x)

    Q

    ↵�1(x)

    � 1↵ hX

    x

    Q(x)L↵

    ↵�1 (h(x), f(x))i↵�1

    = (d↵

    (PkQ))↵�1↵

    hE

    x⇠Q[L

    ↵↵�1 (h(x), f(x))]

    i↵�1↵

    = (d↵

    (PkQ))↵�1↵

    hE

    x⇠Q[L(h(x), f(x))L

    1↵�1 (h(x), f(x))]

    i↵�1↵

    (d↵

    (PkQ))↵�1↵

    hL(Q, h, f)M

    1↵�1

    i↵�1↵

    .

  • pageAdvanced Machine Learning - Mohri@

    Other ExtensionsApproximate distributions (estimated):

    • similar results shown depending on divergence between true and estimated distributions.

    Different source target functions :

    • similar results when target functions close to on target distribution.

    62

    f

    fi

    (Mansour, MM, and Rostami, 2009)

  • References

    Advanced Machine Learning - Mohri@ page

    S. Ben-David, J. Blitzer, K. Crammer, and F. Pereira. Analysis of representations for domain adaptation. In NIPS. 2007.

    S. Ben-David, T. Lu, T. Luu, and D. Pal. Impossibility theorems for domain adaptation. Journal of Machine Learning Research-Proceedings Track, 9:129–136, 2010.

    S. Bickel, M. Bruckner, and T. Scheffer. Discriminative learning under covariate shift. Journal of Machine Learning Research, 10:2137–2155, 2009.

    J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, and J. Wortman. Learning bounds for domain adaptation. In NIPS. 2008.

    Corinna Cortes, Yishay Mansour, and Mehryar Mohri. Learning bounds for importance weighting. In NIPS. 2010.

    Corinna Cortes and Mehryar Mohri. Domain adaptation and sample bias correction theory and algorithm for regression. Theoretical Computer Science, 519, 2014.

    63

  • References

    Advanced Machine Learning - Mohri@ page

    Corinna Cortes, Mehryar Mohri, Michael Riley, and Afshin Rostamizadeh. Sample selection bias correction theory. In Proceedings of ALT. 2008.

    Koby Crammer, Michael Kearns, and Jennifer Wortman. Learning from Data of Variable Quality. In Proceedings of NIPS, 2006.

    Koby Crammer, Michael Kearns, and Jennifer Wortman.Learning from multiple sources.In Proceedings of NIPS, 2007.

    Devroye, L., Gyorfi, L., and Lugosi, G. (1996). A probabilistic theory of pattern recognition. Springer.

    M. Dredze, J. Blitzer, P. P. Talukdar, K. Ganchev, J. Graca, and F. Pereira. Frustratingly Hard Domain Adaptation for Parsing. In CoNLL, 2007.

    J. Huang, A. J. Smola, A. Gretton, K. M. Borgwardt, and B. Scholkopf. Correcting sample selection bias by unlabeled data. In NIPS, volume 19, pages 601–608. 2006.

    64

  • References

    Advanced Machine Learning - Mohri@ page

    Kifer D., Ben-David S., Gehrke J. Detecting change in data streams. In: Proceedings of VLDB, pp 180–191. 2004.

    Yishay Mansour, Mehryar Mohri, and Afshin Rostamizadeh. Domain adaptation: Learning bounds and algorithms. In COLT. 2009.

    Yishay Mansour, Mehryar Mohri, and Afshin Rostamizadeh. Domain adaptation with multiple sources. In Advances in NIPS, pages 1041-1048. 2009.

    Yishay Mansour, Mehryar Mohri, and Afshin Rostamizadeh. Multiple source adaptation and the Rényi divergence. In Proceedings of UAI. 2009.

    Mehryar Mohri and Andres Muñoz Medina. New analysis and algorithm for learning with drifting distributions. In Proceedings of ALT, volume 7568, pages 124-138. 2012.

    65

  • References

    Advanced Machine Learning - Mohri@ page

    H. Shimodaira. Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of Statistical Planning and Inference, 90(2):227– 244, 2000.

    M. Sugiyama, S. Nakajima, H. Kashima, P. von Bunau, and M. Kawanabe. Direct importance estimation with model selection and its application to covariate shift adaptation. In NIPS. 2008.

    M. Sugiyama, T. Suzuki, S. Nakajima, H. Kashima, P. von Bunau, and M.Kawanabe. Direct importance estimation for covariate shift adaptation. Annals of the Institute of Statistical Mathematics, 60:699–746, 2008.

    66