domain adaptation with multiple sources yishay mansour, tel aviv univ. & google mehryar mohri,...
Post on 21-Dec-2015
218 views
TRANSCRIPT
Domain Adaptation with Domain Adaptation with Multiple Sources Multiple Sources
Yishay Mansour, Tel Aviv Univ. & Google
Mehryar Mohri, NYU & Google
Afshin Rostami, NYU
2
3
Adaptation – motivation
• High level: – The ability to generalize from one domain to
another
• Significance:– Basic human property– Essential in most learning environments– Implicit in many applications.
4
Adaptation - examples
• Sentiment analysis:– Users leave reviews
• products, sellers, movies, …
– Goal: score reviews as positive or negative.– Adaptation example:
• Learn for restaurants and airlines
• Generalize to hotels
5
Adaptation - examples
• Speech recognition– Adaptation:
• Learn a few accents
• Generalize to new accents– think “foreign accents”.
6
Adaptation and generalization
• Machine Learning prediction:– Learn from examples drawn from distribution D
– predict the label of unseen examples• drawn from the same distribution D
– generalization within a distribution
• Adaptation:– predict the label of unseen examples
• drawn from a different distribution D’
– Generalization across distributions
7
Adaptation – Related Work
• Learn from D and test on D’– relating the increase in error to dist(D,D’)
• Ben-David et al. (2006), Blitzer et al. (2007),
• Single distribution varying label quality• Cramer et al. (2005, 2006)
8
9
Our Model - input
f
target function
D1
Dk
distributions
.
.
.
h1
hk
hypotheses
.
.
.
L(D1,h1,f)≤ε
L(Dk,hk,f)≤ε
.
.
.
Expected Loss
Typical loss function: L(a,b)=|a-b| and L(D,h,f)= Ex~D[ |f(x)-h(x)| ]
10
Our Model – target distribution
D1
Dk
basicdistributions
.
.
.
target distribution Dλ
λ1
λk
k
iii xDxD
1
)()(
11
Our model – Combination Rule
• Combine h1, … , hk to a hypothesis h*
– Low expected loss• hopefully at most ε
• combining rules:– let z: Σ zi = 1 and zi≥ 0 – linear: h*(x) = Σ zi hi(x)– distribution weighted:
)()(
)()(
11
xi
hxDz
xDzxh
k
ik
j jj
iiz
h1hk
. . .
combining rule
12
Combining Rules – Pros
• Alternative: Build a dataset for the mixture.– Learning the mixture parameters is non-trivial– Combined data set might be huge size– Domain dependent data unavailable– Combined data might be huge
• Sometimes only classifiers are given/exist– privacy
• MOST IMPORTANT:
FUNDAMENTAL THEORY QUESTION
13
Our Results:
• Linear Combining rule:– Seems like the first thing to try– Can be very bad
• Simple settings where any linear combining rule performs badly.
14
Our Results:
• Distribution weighted combining rules:– Given the mixture parameter λ:
• there is a good distribution weighted combining rule.
• expected loss at most ε
– For any target function f, • there is a good distribution combining rule hz
• expected loss at most ε
– Extension for multiple “consistent” target functions• expected loss at most 3ε
• OUTCOME: This is the “right” hypothesis class
15
16
Linear combining rules
Xfh1h0
a110
b010
DDaDb
a½10
b½01
Original Loss: ε=0 !!!
Any linear combining rule hhas expected absolute loss ½
17
Distribution weighted combining rule
• Target distribution – a mixture: Dλ(x)=Σ λi Di(x)
• Set z=λ :
• Claim: L(Dλ,hλ,f) ≤ ε
)()(
)()(
)(
)()(
111
xi
hxD
xDx
ih
xD
xDxh
k
i T
iik
ik
j jj
ii
18
Distribution weighted combining rule
),,( fhDL
k
ii
k
i
fhDL
xiii
ix
k
i
ii
x
ii
xfxhLxD
xfxhLxD
xDxD
xfxhLxD
1
1
),,(
1
))(),(()(
))(),(()(
)()(
))(),(()(
PROOF:
19
Back to the bad example
Xfh1h0
a110
b010
DDaDb
a½10
b½01
Original Loss: ε=0 !!!
h+(x):x=a h+(x)=h1(x)=1x=b h+(x)=h0(x)=0
20
21
Unknown mixture distribution
• Zero-sum game:– NATURE: selects a distribution Di
– LEARNER: selects a z• hypothesis hz
– Payoff: L(Di,hz,f)
• Restating to previous result:– For any mixed action λ of NATURE– LEARNER has a pure action z= λ
• such that the expected loss is at most ε
22
Unknown mixture distribution
• Consequence:– LEARNER has a mixed action (over z’s) – for any mixed action λ of NATURE
• a mixture distribution Dλ
– The loss is at most ε
• Challenge:– show a specific hypothesis hz
• pure, not mixed, action
23
Searching for a good hypothesis
• Uniformly good hypothesis hz:
– for any Di we have L(Di, hz,f) ≤ ε
• Assume all the hi are identical
– Extremely lucky and unlikely case
• If we have a good hypothesis we are done!– L(Dλ,hz,f) = Σ λi L(Di,hz,f) ≤ Σ λi ε = ε
• We need to show in general a good hz !
24
Proof Outline:
• Balancing the losses:– Show that some hz has identical loss on any Di
– uses Brouwer Fixed Point Theorem• holds very generally
• Bounding the losses:– Show this hz has low loss for some mixture
• specifically Dz
25
A: compact and convex set
φ: A→Acontinuous mapping
Brouwer Fixed Point Theorem :For any convex and compact set A and any continuous mapping φ : A→A, there exists a point x in A such that φ(x)=x
26
Balancing Losses
A = {Σi zi = 1 and zi ≥ 0 }
k
j zjj
ziii
fhDLz
fhDLzz
1),,(
),,()]([
Problem 1: Need to get φ continuous
27
Balancing Losses
A = {Σi zi = 1 and zi≥ 0 }
k
j zjj
ziii
fhDLz
fhDLzz
1),,(
),,()]([
Fixed point: z=φ(z)
k
j zjj
ziii
fhDLz
fhDLzz
1),,(
),,(
),,(),,(1
fhDLzfhDLzz zii
k
j zjji
Problem 2:Needs that zi ≠0
28
Bounding the losses
• We can guarantee balanced losses even for linear combining rule !
Xfh1h0
a110
b010
DDaDb
a½10
b½01
For z=(½, ½) we haveL(Da,hz,f)=½L(Db,hz,f)=½
29
Bounding Losses
• Consider the previous z– from Brouwer fixed point theorem
• Consider the mixture Dz
– Expected loss is at most ε
• Also: L(Dz,hz,f)= ΣzjL(Dj,hz,f)=γ
• Conclusion: – For any mixture expected loss at most γ ≤ ε
30
Solving the problems:
• Redefine the distribution weighted rule:
• Claim: For any distribution D,
is continuous in z.
)()(
/)()(
1, xh
xDz
kxDzxh i
k
i jj
iiz
),,( , fhDL z
31
Main Theorem
For any target function f and any δ>0,
there exists η>0 and z such that
for any λ we have
),,( , fhDL z
32
Balancing Losses
• The set A = {Σ zi = 1 and zi≥ 0 }
– The simplex
• The mapping φ with parameters η and η’– [φ(z)]i= (zi Li,z+η’/k)/ (ΣzjLj,z+η’)
• where Li,z=L(Di,hz,η,f)
• For some z in A we have φ(z)=z– zi = (zi Li,z+η’/k)/ (ΣzjLj,z+η’) >0
– Li,z = (ΣzjLj,z)+η’ - η’/(zi k) < (ΣzjLj,z)+ η’
33
Bounding Losses
• Consider the previous z– from Brouwer fixed point theorem
• Consider the mixture Dz
– Expected loss is at most ε+η
• By definition ΣzjLj,z= L(Dz,hz,η,f)
• Conclusion: γ=ΣzjLj,z ≤ ε+η
34
Putting it together
• There exists (z,η) such that:– Expected loss of hz,η approximately balanced
– L(Di,hz,η,f) ≤γ+η’
• Bounding γ using Dz
– γ =L(Dz,hz,η,f) ≤ε+η
• For any mixture Dλ
– L(Dλ,hz,η,f) ≤ε+η+ η’
35
A more general model
• So far: NATURE first fixes target function f• consistent target functions f
– the expected loss w.r.t. Di is at most ε• for any of the k distributions
• Function class F ={f is consistent}• New Model:
– LEARNER picks a hypothesis h– NATURE picks f in F and mixture Dλ
– Loss L(Dλ,h,f)• RESULT: L(Dλ,h,f)≤ 3ε.
36
37
Uniform Algorithm
• Hypothesis sets z=(1/k , … , 1/k):
• Performance:– For any mixture, expected error ≤ kε– There exists mixture with expected error Ω(kε)– For k=2, there exists a mixture with 2ε-ε2
)()(
)()(
)()1(
)()1()(
11
11
xi
hxD
xDx
ih
xDk
xDkxh
k
ik
j j
ik
ik
j j
iu
38
Open Problem
• Find a uniformly good hypothesis– efficiently !!!
• algorithmic issues:– Search over the z’s– Multiple local minima.
39
40
Empirical Results
• Data-set of sentiment analysis:– good product takes a little time to start operating very good for the
price a little trouble using it inside ca
– it rocks man this is the rockinest think i've ever seen or buyed dudes check it ou
– does not retract agree with the prior reviewers i can not get it to retract any longer and that was only after 3 uses
– dont buy not worth a cent got it at walmart can't even remove a scuff i give it 100 good thing i could return it
– flash drive excelent hard drive good price and good time for seller thanks
41
Empirical analysis
• Multiple domains:– dvd, books, electronics, kitchen appliance.
• Language model:– build a model for each domain
• unlike the theory, this is an additional error source
• Tested on mixture distribution– known mixture parameters
• Target: score (1-5)– error: Mean Square Error (MSE)
42
linearDistribution weightedbooks
dvdelectronics
kitchen
43
44
45
46
Summary
• Adaptation model– combining rules
• linear
• distribution weighted
• Theoretical analysis– mixture distribution
• Future research– algorithms for combining rules
– beyond mixtures
47
48
Adaptation – Our Model
• Input:– target function: f
– k distributions D1, …, Dk
– k hypothesis: h1, …, hk
– For every i: L(Di,hi,f) ≤ε• where L(D,h,f) defines the expected loss
– think L(D,h,f)= Ex~D[ |f(x)-h(x)| ]