domain adaptation with multiple sources yishay mansour, tel aviv univ. & google mehryar mohri,...

Domain Adaptation with Domain Adaptation with Multiple Sources Multiple Sources

Yishay Mansour, Tel Aviv Univ. & Google

Mehryar Mohri, NYU & Google

Afshin Rostami, NYU

3

Adaptation – motivation

• High level: – The ability to generalize from one domain to

another

• Significance:– Basic human property– Essential in most learning environments– Implicit in many applications.

4

Adaptation - examples

• Sentiment analysis:– Users leave reviews

• products, sellers, movies, …

– Goal: score reviews as positive or negative.– Adaptation example:

• Learn for restaurants and airlines

• Generalize to hotels

5

Adaptation - examples

• Speech recognition– Adaptation:

• Learn a few accents

• Generalize to new accents– think “foreign accents”.

6

Adaptation and generalization

• Machine Learning prediction:– Learn from examples drawn from distribution D

– predict the label of unseen examples• drawn from the same distribution D

– generalization within a distribution

• Adaptation:– predict the label of unseen examples

• drawn from a different distribution D’

– Generalization across distributions

7

Adaptation – Related Work

• Learn from D and test on D’– relating the increase in error to dist(D,D’)

• Ben-David et al. (2006), Blitzer et al. (2007),

• Single distribution varying label quality• Cramer et al. (2005, 2006)

9

Our Model - input

f

target function

D1

Dk

distributions

.

.

.

h1

hk

hypotheses

.

.

.

L(D1,h1,f)≤ε

L(Dk,hk,f)≤ε

.

.

.

Expected Loss

Typical loss function: L(a,b)=|a-b| and L(D,h,f)= Ex~D[ |f(x)-h(x)| ]

10

Our Model – target distribution

D1

Dk

basicdistributions

.

.

.

target distribution Dλ

λ1

λk

k

iii xDxD

1

)()(

11

Our model – Combination Rule

• Combine h1, … , hk to a hypothesis h*

– Low expected loss• hopefully at most ε

• combining rules:– let z: Σ zi = 1 and zi≥ 0 – linear: h*(x) = Σ zi hi(x)– distribution weighted:

)()(

)()(

11

xi

hxDz

xDzxh

k

ik

j jj

iiz

h1hk

. . .

combining rule

12

Combining Rules – Pros

• Alternative: Build a dataset for the mixture.– Learning the mixture parameters is non-trivial– Combined data set might be huge size– Domain dependent data unavailable– Combined data might be huge

• Sometimes only classifiers are given/exist– privacy

• MOST IMPORTANT:

FUNDAMENTAL THEORY QUESTION

13

Our Results:

• Linear Combining rule:– Seems like the first thing to try– Can be very bad

• Simple settings where any linear combining rule performs badly.

14

Our Results:

• Distribution weighted combining rules:– Given the mixture parameter λ:

• there is a good distribution weighted combining rule.

• expected loss at most ε

– For any target function f, • there is a good distribution combining rule hz

• expected loss at most ε

– Extension for multiple “consistent” target functions• expected loss at most 3ε

• OUTCOME: This is the “right” hypothesis class

16

Linear combining rules

Xfh1h0

a110

b010

DDaDb

a½10

b½01

Original Loss: ε=0 !!!

Any linear combining rule hhas expected absolute loss ½

17

Distribution weighted combining rule

• Target distribution – a mixture: Dλ(x)=Σ λi Di(x)

• Set z=λ :

• Claim: L(Dλ,hλ,f) ≤ ε

)()(

)()(

)(

)()(

111

xi

hxD

xDx

ih

xD

xDxh

k

i T

iik

ik

j jj

ii

18

Distribution weighted combining rule

),,( fhDL

k

ii

k

i

fhDL

xiii

ix

k

i

ii

x

ii

xfxhLxD

xfxhLxD

xDxD

xfxhLxD

1

1

),,(

1

))(),(()(

))(),(()(

)()(

))(),(()(

PROOF:

19

Back to the bad example

Xfh1h0

a110

b010

DDaDb

a½10

b½01

Original Loss: ε=0 !!!

h+(x):x=a h+(x)=h1(x)=1x=b h+(x)=h0(x)=0

21

Unknown mixture distribution

• Zero-sum game:– NATURE: selects a distribution Di

– LEARNER: selects a z• hypothesis hz

– Payoff: L(Di,hz,f)

• Restating to previous result:– For any mixed action λ of NATURE– LEARNER has a pure action z= λ

• such that the expected loss is at most ε

22

Unknown mixture distribution

• Consequence:– LEARNER has a mixed action (over z’s) – for any mixed action λ of NATURE

• a mixture distribution Dλ

– The loss is at most ε

• Challenge:– show a specific hypothesis hz

• pure, not mixed, action

23

Searching for a good hypothesis

• Uniformly good hypothesis hz:

– for any Di we have L(Di, hz,f) ≤ ε

• Assume all the hi are identical

– Extremely lucky and unlikely case

• If we have a good hypothesis we are done!– L(Dλ,hz,f) = Σ λi L(Di,hz,f) ≤ Σ λi ε = ε

• We need to show in general a good hz !

24

Proof Outline:

• Balancing the losses:– Show that some hz has identical loss on any Di

– uses Brouwer Fixed Point Theorem• holds very generally

• Bounding the losses:– Show this hz has low loss for some mixture

• specifically Dz

25

A: compact and convex set

φ: A→Acontinuous mapping

Brouwer Fixed Point Theorem :For any convex and compact set A and any continuous mapping φ : A→A, there exists a point x in A such that φ(x)=x

26

Balancing Losses

A = {Σi zi = 1 and zi ≥ 0 }

k

j zjj

ziii

fhDLz

fhDLzz

1),,(

),,()]([

Problem 1: Need to get φ continuous

27

Balancing Losses

A = {Σi zi = 1 and zi≥ 0 }

k

j zjj

ziii

fhDLz

fhDLzz

1),,(

),,()]([

Fixed point: z=φ(z)

k

j zjj

ziii

fhDLz

fhDLzz

1),,(

),,(

),,(),,(1

fhDLzfhDLzz zii

k

j zjji

Problem 2:Needs that zi ≠0

28

Bounding the losses

• We can guarantee balanced losses even for linear combining rule !

Xfh1h0

a110

b010

DDaDb

a½10

b½01

For z=(½, ½) we haveL(Da,hz,f)=½L(Db,hz,f)=½

29

Bounding Losses

• Consider the previous z– from Brouwer fixed point theorem

• Consider the mixture Dz

– Expected loss is at most ε

• Also: L(Dz,hz,f)= ΣzjL(Dj,hz,f)=γ

• Conclusion: – For any mixture expected loss at most γ ≤ ε

30

Solving the problems:

• Redefine the distribution weighted rule:

• Claim: For any distribution D,

is continuous in z.

)()(

/)()(

1, xh

xDz

kxDzxh i

k

i jj

iiz

),,( , fhDL z

31

Main Theorem

For any target function f and any δ>0,

there exists η>0 and z such that

for any λ we have

),,( , fhDL z

32

Balancing Losses

• The set A = {Σ zi = 1 and zi≥ 0 }

– The simplex

• The mapping φ with parameters η and η’– [φ(z)]i= (zi Li,z+η’/k)/ (ΣzjLj,z+η’)

• where Li,z=L(Di,hz,η,f)

• For some z in A we have φ(z)=z– zi = (zi Li,z+η’/k)/ (ΣzjLj,z+η’) >0

– Li,z = (ΣzjLj,z)+η’ - η’/(zi k) < (ΣzjLj,z)+ η’

33

Bounding Losses

• Consider the previous z– from Brouwer fixed point theorem

• Consider the mixture Dz

– Expected loss is at most ε+η

• By definition ΣzjLj,z= L(Dz,hz,η,f)

• Conclusion: γ=ΣzjLj,z ≤ ε+η

34

Putting it together

• There exists (z,η) such that:– Expected loss of hz,η approximately balanced

– L(Di,hz,η,f) ≤γ+η’

• Bounding γ using Dz

– γ =L(Dz,hz,η,f) ≤ε+η

• For any mixture Dλ

– L(Dλ,hz,η,f) ≤ε+η+ η’

35

A more general model

• So far: NATURE first fixes target function f• consistent target functions f

– the expected loss w.r.t. Di is at most ε• for any of the k distributions

• Function class F ={f is consistent}• New Model:

– LEARNER picks a hypothesis h– NATURE picks f in F and mixture Dλ

– Loss L(Dλ,h,f)• RESULT: L(Dλ,h,f)≤ 3ε.

37

Uniform Algorithm

• Hypothesis sets z=(1/k , … , 1/k):

• Performance:– For any mixture, expected error ≤ kε– There exists mixture with expected error Ω(kε)– For k=2, there exists a mixture with 2ε-ε2

)()(

)()(

)()1(

)()1()(

11

11

xi

hxD

xDx

ih

xDk

xDkxh

k

ik

j j

ik

ik

j j

iu

38

Open Problem

• Find a uniformly good hypothesis– efficiently !!!

• algorithmic issues:– Search over the z’s– Multiple local minima.

40

Empirical Results

• Data-set of sentiment analysis:– good product takes a little time to start operating very good for the

price a little trouble using it inside ca

– it rocks man this is the rockinest think i've ever seen or buyed dudes check it ou

– does not retract agree with the prior reviewers i can not get it to retract any longer and that was only after 3 uses

– dont buy not worth a cent got it at walmart can't even remove a scuff i give it 100 good thing i could return it

– flash drive excelent hard drive good price and good time for seller thanks

41

Empirical analysis

• Multiple domains:– dvd, books, electronics, kitchen appliance.

• Language model:– build a model for each domain

• unlike the theory, this is an additional error source

• Tested on mixture distribution– known mixture parameters

• Target: score (1-5)– error: Mean Square Error (MSE)

42

linearDistribution weightedbooks

dvdelectronics

kitchen

46

Summary

• Adaptation model– combining rules

• linear

• distribution weighted

• Theoretical analysis– mixture distribution

• Future research– algorithms for combining rules

– beyond mixtures

48

Adaptation – Our Model

• Input:– target function: f

– k distributions D1, …, Dk

– k hypothesis: h1, …, hk

– For every i: L(Di,hi,f) ≤ε• where L(D,h,f) defines the expected loss

– think L(D,h,f)= Ex~D[ |f(x)-h(x)| ]

domain adaptation with multiple sources yishay mansour, tel aviv univ. & google mehryar mohri,...

Documents

rule slide

f slide

distributions slide

b h x

distribution adaptation

nyu slide

kk slide

linear combining rule