domain adaptation with multiple sources yishay mansour, tel aviv univ. & google mehryar mohri,...

Domain Adaptation with Domain Adaptation with Multiple Sources Multiple Sources

Yishay Mansour, Tel Aviv Univ. & Google

Mehryar Mohri, NYU & Google

Afshin Rostami, NYU

Adaptation – motivation

• High level: – The ability to generalize from one domain to

another

• Significance:– Basic human property– Essential in most learning environments– Implicit in many applications.

Adaptation - examples

• Sentiment analysis:– Users leave reviews

• products, sellers, movies, …

– Goal: score reviews as positive or negative.– Adaptation example:

• Learn for restaurants and airlines

• Generalize to hotels

Adaptation - examples

• Speech recognition– Adaptation:

• Learn a few accents

• Generalize to new accents– think “foreign accents”.

Adaptation and generalization

• Machine Learning prediction:– Learn from examples drawn from distribution D

– predict the label of unseen examples• drawn from the same distribution D

– generalization within a distribution

• Adaptation:– predict the label of unseen examples

• drawn from a different distribution D’

– Generalization across distributions

Adaptation – Related Work

• Learn from D and test on D’– relating the increase in error to dist(D,D’)

• Ben-David et al. (2006), Blitzer et al. (2007),

• Single distribution varying label quality• Cramer et al. (2005, 2006)

Our Model - input

target function

distributions

hypotheses

L(D1,h1,f)≤ε

L(Dk,hk,f)≤ε

Expected Loss

Typical loss function: L(a,b)=|a-b| and L(D,h,f)= Ex~D[ |f(x)-h(x)| ]

Our Model – target distribution

basicdistributions

target distribution Dλ

iii xDxD

Our model – Combination Rule

• Combine h1, … , hk to a hypothesis h*

– Low expected loss• hopefully at most ε

• combining rules:– let z: Σ zi = 1 and zi≥ 0 – linear: h*(x) = Σ zi hi(x)– distribution weighted:

combining rule

Combining Rules – Pros

• Alternative: Build a dataset for the mixture.– Learning the mixture parameters is non-trivial– Combined data set might be huge size– Domain dependent data unavailable– Combined data might be huge

• Sometimes only classifiers are given/exist– privacy

• MOST IMPORTANT:

FUNDAMENTAL THEORY QUESTION

Our Results:

• Linear Combining rule:– Seems like the first thing to try– Can be very bad

• Simple settings where any linear combining rule performs badly.

Our Results:

• Distribution weighted combining rules:– Given the mixture parameter λ:

• there is a good distribution weighted combining rule.

• expected loss at most ε

– For any target function f, • there is a good distribution combining rule hz

• expected loss at most ε

– Extension for multiple “consistent” target functions• expected loss at most 3ε

• OUTCOME: This is the “right” hypothesis class

Linear combining rules

Xfh1h0

Original Loss: ε=0 !!!

Any linear combining rule hhas expected absolute loss ½

Distribution weighted combining rule

• Target distribution – a mixture: Dλ(x)=Σ λi Di(x)

• Set z=λ :

• Claim: L(Dλ,hλ,f) ≤ ε

Distribution weighted combining rule

),,( fhDL

xfxhLxD

))(),(()(

PROOF:

Back to the bad example

Xfh1h0

Original Loss: ε=0 !!!

h+(x):x=a h+(x)=h1(x)=1x=b h+(x)=h0(x)=0

Unknown mixture distribution

• Zero-sum game:– NATURE: selects a distribution Di

– LEARNER: selects a z• hypothesis hz

– Payoff: L(Di,hz,f)

• Restating to previous result:– For any mixed action λ of NATURE– LEARNER has a pure action z= λ

• such that the expected loss is at most ε

Unknown mixture distribution

• Consequence:– LEARNER has a mixed action (over z’s) – for any mixed action λ of NATURE

• a mixture distribution Dλ

– The loss is at most ε

• Challenge:– show a specific hypothesis hz

• pure, not mixed, action

Searching for a good hypothesis

• Uniformly good hypothesis hz:

– for any Di we have L(Di, hz,f) ≤ ε

• Assume all the hi are identical

– Extremely lucky and unlikely case

• If we have a good hypothesis we are done!– L(Dλ,hz,f) = Σ λi L(Di,hz,f) ≤ Σ λi ε = ε

• We need to show in general a good hz !

Proof Outline:

• Balancing the losses:– Show that some hz has identical loss on any Di

– uses Brouwer Fixed Point Theorem• holds very generally

• Bounding the losses:– Show this hz has low loss for some mixture

• specifically Dz

A: compact and convex set

φ: A→Acontinuous mapping

Brouwer Fixed Point Theorem :For any convex and compact set A and any continuous mapping φ : A→A, there exists a point x in A such that φ(x)=x

Balancing Losses

A = {Σi zi = 1 and zi ≥ 0 }

fhDLzz

),,()]([

Problem 1: Need to get φ continuous

Balancing Losses

A = {Σi zi = 1 and zi≥ 0 }

fhDLzz

),,()]([

Fixed point: z=φ(z)

fhDLzz

),,(),,(1

fhDLzfhDLzz zii

j zjji

Problem 2:Needs that zi ≠0

Bounding the losses

• We can guarantee balanced losses even for linear combining rule !

Xfh1h0

For z=(½, ½) we haveL(Da,hz,f)=½L(Db,hz,f)=½

Bounding Losses

• Consider the previous z– from Brouwer fixed point theorem

• Consider the mixture Dz

– Expected loss is at most ε

• Also: L(Dz,hz,f)= ΣzjL(Dj,hz,f)=γ

• Conclusion: – For any mixture expected loss at most γ ≤ ε

Solving the problems:

• Redefine the distribution weighted rule:

• Claim: For any distribution D,

is continuous in z.

kxDzxh i

),,( , fhDL z

Main Theorem

For any target function f and any δ>0,

there exists η>0 and z such that

for any λ we have

),,( , fhDL z

Balancing Losses

• The set A = {Σ zi = 1 and zi≥ 0 }

– The simplex

• The mapping φ with parameters η and η’– [φ(z)]i= (zi Li,z+η’/k)/ (ΣzjLj,z+η’)

• where Li,z=L(Di,hz,η,f)

• For some z in A we have φ(z)=z– zi = (zi Li,z+η’/k)/ (ΣzjLj,z+η’) >0

– Li,z = (ΣzjLj,z)+η’ - η’/(zi k) < (ΣzjLj,z)+ η’

Bounding Losses

• Consider the previous z– from Brouwer fixed point theorem

• Consider the mixture Dz

– Expected loss is at most ε+η

• By definition ΣzjLj,z= L(Dz,hz,η,f)

• Conclusion: γ=ΣzjLj,z ≤ ε+η

Putting it together

• There exists (z,η) such that:– Expected loss of hz,η approximately balanced

– L(Di,hz,η,f) ≤γ+η’

• Bounding γ using Dz

– γ =L(Dz,hz,η,f) ≤ε+η

• For any mixture Dλ

– L(Dλ,hz,η,f) ≤ε+η+ η’

A more general model

• So far: NATURE first fixes target function f• consistent target functions f

– the expected loss w.r.t. Di is at most ε• for any of the k distributions

• Function class F ={f is consistent}• New Model:

– LEARNER picks a hypothesis h– NATURE picks f in F and mixture Dλ

– Loss L(Dλ,h,f)• RESULT: L(Dλ,h,f)≤ 3ε.

Uniform Algorithm

• Hypothesis sets z=(1/k , … , 1/k):

• Performance:– For any mixture, expected error ≤ kε– There exists mixture with expected error Ω(kε)– For k=2, there exists a mixture with 2ε-ε2

)()1()(

Open Problem

• Find a uniformly good hypothesis– efficiently !!!

• algorithmic issues:– Search over the z’s– Multiple local minima.

Empirical Results

• Data-set of sentiment analysis:– good product takes a little time to start operating very good for the

price a little trouble using it inside ca

– it rocks man this is the rockinest think i've ever seen or buyed dudes check it ou

– does not retract agree with the prior reviewers i can not get it to retract any longer and that was only after 3 uses

– dont buy not worth a cent got it at walmart can't even remove a scuff i give it 100 good thing i could return it

– flash drive excelent hard drive good price and good time for seller thanks

Empirical analysis

• Multiple domains:– dvd, books, electronics, kitchen appliance.

• Language model:– build a model for each domain

• unlike the theory, this is an additional error source

• Tested on mixture distribution– known mixture parameters

• Target: score (1-5)– error: Mean Square Error (MSE)

linearDistribution weightedbooks

dvdelectronics

kitchen

Summary

• Adaptation model– combining rules

• linear

• distribution weighted

• Theoretical analysis– mixture distribution

• Future research– algorithms for combining rules

– beyond mixtures

Adaptation – Our Model

• Input:– target function: f

– k distributions D1, …, Dk

– k hypothesis: h1, …, hk

– For every i: L(Di,hi,f) ≤ε• where L(D,h,f) defines the expected loss

– think L(D,h,f)= Ex~D[ |f(x)-h(x)| ]

domain adaptation with multiple sources yishay mansour, tel aviv univ. & google mehryar mohri,...

rule slide

f slide

distributions slide

b h x

distribution adaptation

nyu slide

kk slide

linear combining rule

Documents

ml learning with infinite hypothesis sets - new york...

p2 presentation, sara mehryar

journal of economic theory ii-j, - nyu stern | nyu stern

nyu application

innovative and effective methods of learning other languages...

projects/deadlines - nyu journalism€¦ · journalism dept...

speech recognition - nyu computer...

compsci(220umass-cs-220.github.io/weeks/11/19-regex-intro.pdf ·...

speech recognitionmohri/asr12/lecture_8.pdf · mehryar...

nyu sternembaviewbook

nyu lecture1

nyu program

preference-based learning to rankmohri/pub/pref.pdf ·...

the economics of networks - nyu stern | nyu stern school of

valuation - nyu

approach to protenuria mehryar mehrkash pediatric...

speech recognition - nyu computer...

nyu law library golding media center audio - nyu school of...

nyu wireless - spectrum frontiers: the new …...nyu...

nyu benefits