clarinet: a one-step approach towards budget-friendly … · 2020. 7. 20. · clarinet: a one-step...

7
Clarinet: A One-step Approach Towards Budget-friendly Unsupervised Domain Adaptation Yiyang Zhang 1,2, * , Feng Liu 2,* , Zhen Fang 2 , Bo Yuan 1 , Guangquan Zhang 2 and Jie Lu 2, 1 Shenzhen International Graduate School, Tsinghua University 2 Centre for Artificial Intelligence, University of Technology Sydney [email protected], [email protected], [email protected], [email protected], {guangquan.zhang, jie.lu}@uts.edu.au Abstract In unsupervised domain adaptation (UDA), clas- sifiers for the target domain are trained with mas- sive true-label data from the source domain and unlabeled data from the target domain. How- ever, it may be difficult to collect fully-true-label data in a source domain given limited budget. To mitigate this problem, we consider a novel prob- lem setting where the classifier for the target do- main has to be trained with complementary-label data from the source domain and unlabeled data from the target domain named budget-friendly UDA (BFUDA). The key benefit is that it is much less costly to collect complementary-label source data (required by BFUDA) than collecting the true- label source data (required by ordinary UDA). To this end, c omplementary l abel adversari al net work (CLARINET) is proposed to solve the BFUDA problem. CLARINET maintains two deep net- works simultaneously, where one focuses on clas- sifying complementary-label source data and the other takes care of the source-to-target distribu- tional adaptation. Experiments show that CLAR- INET significantly outperforms a series of compe- tent baselines. 1 Introduction Domain Adaptation (DA) aims to train a target-domain clas- sifier with data in source and target domains [Yan et al., 2017; Zhou et al., 2019b]. Based on the availability of data in the target domain (e.g., fully-labeled, partially-labeled and unla- beled), DA is divided into three categories: supervised DA [Sukhija et al., 2016], semi-supervised DA [Ao et al., 2017; Zhou et al., 2019a] and unsupervised DA (UDA) [Gong et al., 2018; Liu et al., 2020; Saito et al., 2017; Fang et al., 2019]. In practice, UDA methods have been applied to many real-world problems, such as object recognition [Agresti et al., 2019; Deng et al., 2019; Zhao et al., 2019]. UDA methods train a target-domain classifier with mas- sive true-label data from the source domain (true-label source * Equal Contribution Corresponding Author Figure 1: Budget-friendly unsupervised domain adaptation. The red line denotes that UDA methods transfer knowledge from Ds (true- label source data) to Dt (unlabeled target data). However, acquiring fully-true-label source data is costly and unaffordable (black dash line, xs Ds, xs means unlabeled source data). This brings budget-friendly unsupervised domain adaptation (BFUDA), namely transferring knowledge from Ds (complementary-label source data) to Dt . It is much less costly to collect complementary-label source data (black line, required by BFUDA) than collecting the true-label one (black dash line, required by UDA). To handle BFUDA, a weak solution is a two-step approach (green dash line), which sequentially combines complementary-label learning methods ( Ds ˆ Ds, label correction) and existing UDA methods ( ˆ Ds Dt ). This paper pro- poses a one-step approach called c omplementary l abel adversari al net work (CLARINET, green line, Ds Dt directly). data) and unlabeled data from the target domain (unla- beled target data). Existing works in the literature can be roughly categorised into the following three groups: integral-probability-metrics based UDA [Long et al., 2015]; adversarial-training based UDA [Ganin et al., 2016; Long et al., 2018]; and causality-based UDA [Gong et al., 2016; Gong et al., 2018; Zhang et al., 2015]. Since adversarial- training based UDA methods extract better domain-invariant representations via deep networks, they usually have good target-domain accuracy [Sankaranarayanan et al., 2018]. However, the success of UDA still highly relies on the scale of true-label source data. Namely, the target-domain accuracy of a UDA method (e.g., CDAN) decays when the scale of true-label source data decreases. Hence, massive true-label source data are inevitably required by UDA methods, which is very expensive and prohibitive especially when the budget Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20) 2526

Upload: others

Post on 23-Jan-2021

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Clarinet: A One-step Approach Towards Budget-friendly … · 2020. 7. 20. · Clarinet: A One-step Approach Towards Budget-friendly Unsupervised Domain Adaptation Yiyang Zhang1 ;2,

Clarinet: A One-step Approach TowardsBudget-friendly Unsupervised Domain Adaptation

Yiyang Zhang1,2,∗ , Feng Liu2,∗ , Zhen Fang2 , Bo Yuan1 , Guangquan Zhang2 and Jie Lu2,†

1Shenzhen International Graduate School, Tsinghua University2Centre for Artificial Intelligence, University of Technology Sydney

[email protected], [email protected], [email protected],[email protected], {guangquan.zhang, jie.lu}@uts.edu.au

AbstractIn unsupervised domain adaptation (UDA), clas-sifiers for the target domain are trained with mas-sive true-label data from the source domain andunlabeled data from the target domain. How-ever, it may be difficult to collect fully-true-labeldata in a source domain given limited budget. Tomitigate this problem, we consider a novel prob-lem setting where the classifier for the target do-main has to be trained with complementary-labeldata from the source domain and unlabeled datafrom the target domain named budget-friendly UDA(BFUDA). The key benefit is that it is much lesscostly to collect complementary-label source data(required by BFUDA) than collecting the true-label source data (required by ordinary UDA). Tothis end, complementary label adversarial network(CLARINET) is proposed to solve the BFUDAproblem. CLARINET maintains two deep net-works simultaneously, where one focuses on clas-sifying complementary-label source data and theother takes care of the source-to-target distribu-tional adaptation. Experiments show that CLAR-INET significantly outperforms a series of compe-tent baselines.

1 IntroductionDomain Adaptation (DA) aims to train a target-domain clas-sifier with data in source and target domains [Yan et al., 2017;Zhou et al., 2019b]. Based on the availability of data in thetarget domain (e.g., fully-labeled, partially-labeled and unla-beled), DA is divided into three categories: supervised DA[Sukhija et al., 2016], semi-supervised DA [Ao et al., 2017;Zhou et al., 2019a] and unsupervised DA (UDA) [Gong et al.,2018; Liu et al., 2020; Saito et al., 2017; Fang et al., 2019]. Inpractice, UDA methods have been applied to many real-worldproblems, such as object recognition [Agresti et al., 2019;Deng et al., 2019; Zhao et al., 2019].

UDA methods train a target-domain classifier with mas-sive true-label data from the source domain (true-label source∗Equal Contribution†Corresponding Author

Figure 1: Budget-friendly unsupervised domain adaptation. The redline denotes that UDA methods transfer knowledge from Ds (true-label source data) to Dt (unlabeled target data). However, acquiringfully-true-label source data is costly and unaffordable (black dashline, xs → Ds, xs means unlabeled source data). This bringsbudget-friendly unsupervised domain adaptation (BFUDA), namelytransferring knowledge fromDs (complementary-label source data)to Dt. It is much less costly to collect complementary-label sourcedata (black line, required by BFUDA) than collecting the true-labelone (black dash line, required by UDA). To handle BFUDA, a weaksolution is a two-step approach (green dash line), which sequentiallycombines complementary-label learning methods (Ds → Ds, labelcorrection) and existing UDA methods (Ds → Dt). This paper pro-poses a one-step approach called complementary label adversarialnetwork (CLARINET, green line, Ds → Dt directly).

data) and unlabeled data from the target domain (unla-beled target data). Existing works in the literature canbe roughly categorised into the following three groups:integral-probability-metrics based UDA [Long et al., 2015];adversarial-training based UDA [Ganin et al., 2016; Longet al., 2018]; and causality-based UDA [Gong et al., 2016;Gong et al., 2018; Zhang et al., 2015]. Since adversarial-training based UDA methods extract better domain-invariantrepresentations via deep networks, they usually have goodtarget-domain accuracy [Sankaranarayanan et al., 2018].

However, the success of UDA still highly relies on the scaleof true-label source data. Namely, the target-domain accuracyof a UDA method (e.g., CDAN) decays when the scale oftrue-label source data decreases. Hence, massive true-labelsource data are inevitably required by UDA methods, whichis very expensive and prohibitive especially when the budget

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20)

2526

Page 2: Clarinet: A One-step Approach Towards Budget-friendly … · 2020. 7. 20. · Clarinet: A One-step Approach Towards Budget-friendly Unsupervised Domain Adaptation Yiyang Zhang1 ;2,

Figure 2: Overview of the proposed complementary labeladversarial network (CLARINET). It consists of feature extractorG, label predictor F and conditional domain discriminator D. gsand gt are the outputs of G, representing the extracted features ofsource and target data. fs and ft represent classifier predictions. Tis a mapping function which we propose to scatter the classifier pre-dictions. In Algorithm 1, we show how to use two losses mentionedin this figure to train CLARINET.

is limited [Liu et al., 2019]. This circumstance may hinderthe promotion of DA to more areas.

While determining the correct label from many candi-dates is laborious, choosing one of the incorrect labels (i.e.,complementary labels) would be much easier and thus lesscostly, especially when we have many candidates [Ishida etal., 2017]. For example, suppose that we need to annotatelabels of a bunch of animal images from 1, 000 candidates.One strategy is to ask crowd-workers to choose the true la-bels from 1, 000 candidates, while the other is to judge thecorrectness of a label randomly given by the system fromthe candidates. Apparently, the cost of the second strategyis much lower than that of the first one [Ishida et al., 2019].

This brings us a novel problem setting, budget-friendlyUDA (abbreviated as BFUDA), which aims to transfer knowl-edge from complementary-label source data to unlabeled tar-get data (Figure 1). We describe this problem setting usingthe word budget-friendly since, compared to ordinary UDA,we can greatly save the labeling cost by annotating com-plementary labels in the source domain rather than annotat-ing true labels [Ishida et al., 2017; Yu et al., 2018]. Pleasenote that, existing UDA methods cannot handle BFUDA, asthey require fully-true-label source data [Liu et al., 2020;Saito et al., 2017] or at least 20% true-label source data(demonstrated in [Liu et al., 2019; Shu et al., 2019]).

A straightforward but weak solution to BFUDA is a two-step approach, which sequentially combines complementary-label learning (CLL) methods and existing UDA methods(green dash line in Figure 1)1. CLL methods are used toassign pseudo labels for complementary-label source data.Then, we can train a target-domain classifier with pseudo-label source data and unlabeled target data using existing

1We implement this two-step approach and take it as a baseline.

UDA methods. Nevertheless, pseudo-label source data con-tain noise, which may cause poor domain-adaptation perfor-mance of such two-step approach [Liu et al., 2019].

Therefore, we propose a powerful one-step solution toBFUDA, complementary label adversarial network (CLAR-INET). It maintains two deep networks trained by adversar-ial way simultaneously, where one can accurately classifycomplementary-label source data, and the other can discrimi-nate source and target domains. Since Long et. al. [2018] andSong et. al. [2009] have shown that multimodal structures ofdistributions can only be captured sufficiently by the cross-covariance dependency between the features and classes (i.e.,true labels), we set the input of domain discriminatorD as theouter product of feature representation (e.g., gs in Figure 2)and mapped classifier prediction (e.g., T (fs) in Figure 2).

Due to the nature of complementary-label classification,predicted probability of each class (i.e., each element of fs,Figure 2) is relatively close. According to [Song et al., 2009],this kind of predicted probabilities could not provide suffi-cient information to capture the multimodal structure of dis-tributions. To fix it, we propose a mapping function T tomake the predicted probabilities more scattered (i.e., T (fs),Figure 2) than previous ones (i.e., fs, Figure 2). By doingso, the mapped classifier predictions can better indicate theirchoice. In this way, we can take full advantage of classifierpredictions and effectively align distributions of two domains.Our ablation study (see Table 2) verifies that T indeed helpsimprove the target-domain accuracy.

We conduct experiments on 6 BFUDA tasks and compareCLARINET with a series of competent baselines. Empiricalresults demonstrated that CLARINET can effectively trans-fer knowledge from complementary-label source data to un-labeled target data and is superior to all baselines.

2 Budget-friendly Unsupervised DomainAdaptation

In this section, we propose a novel problem setting, calledbudget-friendly unsupervised domain adaptation (BFUDA),and prove a learning bound of this new problem. Then, weshow how BFUDA brings benefits to domain adaptation field.

2.1 Problem SettingLet X ⊂ Rd be a feature (input) space and Y := {1, ...,K}be a label (output) space. A domain is defined as follows.Definition 1 (Domains for BFUDA). Given random vari-ables Xs, Xt ∈ X , Ys, Y s, Yt ∈ Y , the source and targetdomains are joint distributions P (Xs, Y s) and P (Xt, Yt),where the joint distributions P (Xs, Ys) 6= P (Xt, Yt) andP (Y s = c|Ys = c) = 0 for all c ∈ Y .

Then, we propose BFUDA problem as follows.Problem 1 (BFUDA). Given independent and identicallydistributed (i.i.d.) labeled samples Ds = {(xi

s, yis)}

nsi=1

drawn from the source domain P (Xs, Y s) and i.i.d. unla-beled samples Dt = {xi

t}nti=1 drawn from the target marginal

distribution P (Xt), the aim of BFUDA is to train a classifierF : X → Y with Ds and Dt such that F can accuratelyclassify target data drawn from P (Xt).

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20)

2527

Page 3: Clarinet: A One-step Approach Towards Budget-friendly … · 2020. 7. 20. · Clarinet: A One-step Approach Towards Budget-friendly Unsupervised Domain Adaptation Yiyang Zhang1 ;2,

It is clear that it is impossible to design a suitable learningprocedure without any assumptions on P (Xs, Y s). In thispaper, we use the assumption for unbiased complementarylearning proposed by [Ishida et al., 2019; Ishida et al., 2017]:

P (Y s = k|Xs) =1

K − 1

K∑c=1,c6=k

P (Ys = c|Xs), (1)

for all k, c ∈ Y = {1, ...,K} and c 6= k.

2.2 Learning Bound of BFUDAThis subsection presents a learning bound of BFUDA. Prac-titioner may safely skip it. The ` : RK × Y → R+ is a lossfunction. The decision function is a vector-valued functionh : X → RK and hk is the k-th element of h. The comple-mentary risk for h with respect to ` over P (Xs, Y s) is

Ls(h) = E`(h(Xs), Y s).

The risks for the decision function h with respect to loss `over implicit distribution P (Xs, Ys), P (Xt, Yt) are:

Ls(h) = E`(h(Xs), Ys), Lt(h) = E`(h(Xt), Yt).

Then, we introduce our main theorem as follows.Theorem 1. Given a loss function ` and a hypothesis H ⊂{h : X → RK}, then under unbiased assumption, for anyh ∈ H, we have

Lt(h) ≤ Ls(h) +1

2d`H(PXs , PXt) + Λ,

where Ls(h) :=∑K

k=1

∫X `(h(x), k)dPXs

−(K−1)Ls(h),PXs

, PXtare source and target marginal distributions, Λ =

minh∈H Rs(h) + Rt(h), and d`H(PXs, PXt

) is the distribu-tion discrepancy defined in [Ben-David et al., 2010].

Proof. We firstly investigate the connection between Ls(h)and Ls(h) under unbiased assumption in Eq. (1). GivenK × K matrix Q whose diagonal elements are 0 and otherelements are 1/K, we represent the unbiased assumption byη = Qη, where η = [P (Y s = 1|Xs), ..., P (Y s = K|Xs)]

T

and η = [P (Ys = 1|Xs), ..., P (Ys = K|Xs)]T . Note

that Q has inverse matrix Q−1 whose diagonal elements are−(K − 2) and other elements are 1. Thus, we have that

Q−1η = η. (2)According to Eq. (2), we have P (Ys = k|Xs) = 1 − (K −1)P (Y s = k|Xs), which implies that

Ls(h) =K∑

k=1

∫X`(h(x), k)dPXs

− (K − 1)Ls(h). (3)

Hence, Ls(h) = Ls(h). Combining Eq. (3) with the domainadaptation bound presented in [Ben-David et al., 2010]

Lt(h) ≤ Ls(h) +1

2d`H(PXs , PXt) + Λ,

we prove this theorem.

In this bound, the first term, i.e., Eq. (3), is the source clas-sification error based on complementary-label source data.The second term is the distribution discrepancy distance be-tween two domains and Λ is the difference in labeling func-tions across the two domains. The empirical form of Eq. (3)is known as complementary-label loss (see Eq. (5)).

2.3 How does BFUDA Bring Benefits to DA Field?Collecting true-label data is always expensive in the realworld. Thus, learning from less expensive data [Kumar etal., 2017; Sakai et al., 2017] has been extensively studied inmachine learning field, including label-noise learning [Han etal., 2018b; Han et al., 2018a], complementary-label learning[Ishida et al., 2019; Yu et al., 2018; Ishida et al., 2017] and soon. Among all these research directions, obtaining comple-mentary labels is a cost-effective option. As described in theprevious works mentioned above, compared with choosingthe true class out of many candidate classes precisely, collect-ing complementary labels is obviously much easier and lesscostly. In addition, a classifier trained with complementary-label data is equivalent to a classifier trained with true-labeldata as shown in [Ishida et al., 2019].

At present, the success of DA still highly relies on thescale of true-label source data, which is a critical bottleneck.With limited budget, it is unrealistic to obtain enough true-label source data and thus cannot achieve a good distributionadaptation result. For the same budget, we can get multi-ple times more complementary-label data than the true-labeldata. In addition, the adaptation scenario is limited to somecommonly used datasets, as they have sufficient true labelsto support distributional adaptation. Thus if we can reducethe labeling cost in the source domain, for example, by us-ing complementary-label data to replace true-label data (i.e.BFUDA), we can promote DA to more fields. Due to exist-ing UDA methods require at least 20% true-label source data[Shu et al., 2019], they cannot handle BFUDA problem. Toaddress BFUDA problem directly, we propose a powerful so-lution, CLARINET, as follows.

3 CLARINET: One-step BFUDA ApproachThe proposed CLARINET (Figure 2) mainly consists of fea-ture extractor G, label predictor F and domain discriminatorD. Furthermore, we add a mapping function T between thelabel predictor F and domain discriminator D to take fulladvantage of classifier predictions. In this section, we firstintroduce two losses used to train CLARINET, and then thewhole training procedures of CLARINET is presented.

3.1 Loss Function in CLARINETIn this subsection, we show how to compute the two lossesmentioned above in CLARINET after obtaining mini-batchds from Ds and dt from Dt.Complementary-label Loss. We first divided ds into Kdisjoint subsets according to the complementary labels in ds,

ds = ∪Kk=1ds,k, ds,k = {(xik, k)}ns,k

i=1 , (4)

where ds,k ∩ ds,k′ = ∅ if k 6= k′ and ns,k = |ds,k|. Then,following Eq. (3), the complementary-label loss on ds,k is

Ls(G,F, ds,k) = −(K − 1)πk

ns,k

ns,k∑i=1

`(F ◦G(xik), k)

+K∑j=1

πj

ns,j

ns,j∑l=1

`(F ◦G(xlj), k),

(5)

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20)

2528

Page 4: Clarinet: A One-step Approach Towards Budget-friendly … · 2020. 7. 20. · Clarinet: A One-step Approach Towards Budget-friendly Unsupervised Domain Adaptation Yiyang Zhang1 ;2,

where ` can be any loss and we use the cross-entropy loss,πk is the proportion of the samples complementary-labeledk. The total complementary-label loss on ds is as follows.

Ls(G,F, ds) =K∑

k=1

Ls(G,F, ds,k). (6)

As shown in Section 2.2, the complementary-label loss(i.e., Eq. (6)) is an unbiased estimator of the true-label-datarisk. Namely, the minimizer of complementary-label lossagrees with the minimizer of the true-label-data risk with noconstraints on the loss ` and model F ◦G [Ishida et al., 2019].Remark 1. Due to the negative part in Ls(G,F, ds), mini-mizing it will cause over-fitting [Kiryo et al., 2017]. To over-come this problem, we use a correctional way [Ishida et al.,2019] to minimize Ls(G,F, ds) (lines 7-13 in Algorithm 1).Scattered Conditional Adversarial Loss. According to[Song et al., 2009], it is significant to capture multimodalstructures of distributions using cross-covariance dependencybetween the features and classes (i.e., true labels). Since thereare no true-label target data in UDA, CDAN adopts outerproduct of feature representations and classifier predictions(i.e., outputs of the softmax layer) as new features of two do-mains [Long et al., 2018]. The newly constructed featureshave shown great ability to discriminate source and target do-mains, since classifier predictions of true-label source dataare dispersed, expressing the predicted goal clearly.

However, in the complementary-label classification mode,we observe that the predicted probability of each class (i.e.,each element of fs in Figure 2) is relatively close. Namely, itis hard to find significant predictive preference from the clas-sifier predictions. According to [Song et al., 2009], this kindof predictions cannot provide sufficient information to cap-ture the multimodal structure of distributions. To fix it, wepropose a mapping function T to scatter the classifier predic-tions f = [f1, ..., fK ]T (f could be fs or ft in Figure 2),

T (f) =

f1l1∑K

j=1 f1lj

, ..,f

1l

k∑Kj=1 f

1lj

, ...,f

1l

K∑Kj=1 f

1lj

T

. (7)

It is a common approach of adjusting the “temperature” ofcategorical distribution. As l → 0, the output of T (f) willapproach a Dirac (“one-hot”) distribution.

Then to prioritize the discriminator on those easy-to-transfer examples, following [Long et al., 2018], we measurethe uncertainty of the prediction for sample x by

H(G,F,x) = −K∑

k=1

T (fk(x))log T (fk(x)). (8)

Thus the scattered conditional adversarial loss is as follows,

Ladv(G,F,D, ds, dt) =

∑x∈ds[X] ωs(x) log(D(g(x)))∑

x∈ds[X] ωs(x)

+

∑x∈dt

ωt(x)log(1−D(g(x)))∑x∈dt

ωt(x),

(9)

Algorithm 1 CLARINET: One-step BFUDA ApproachInput: Ds = {(xi

s, yis)}

nsi=1, Dt = {xi

t}nti=1.

Parameters: learning rate γ1 and γ2, epoch Tmax, start epoch Ts,iteration Nmax, class number K, tradeoff λ, network parameterθF◦G and θD .Output: the neural network F ◦ G, namely the target domain clas-sifier for Dt.1: Initialize θF◦G and θD;2: for t = 1, 2. . . . . . Tmax do3: Shuffle the training set Ds, Dt;4: for N = 1, 2. . . . . .Nmax do5: Fetch mini-batch ds, dt from Ds, Dt;6: Divide ds into {ds,k}Kk=1;7: Calculate {Ls(G,F, ds,k)}Kk=1 using Eq. (5), and

Ls(G,F, ds) using Eq. (6);8: if mink{Ls(G,F, ds,k)}Kk=1 ≥ 0 then9: Update θF◦G = θF◦G − γ1OLs(G,F, ds);

10: else11: Calculate Lneg =

∑Kk=1 min{0, Ls(G,F, ds,k)};

12: Update θF◦G = θF◦G + γ1OLneg;13: end if14: if t > Ts then15: Calculate Ladv(G,F,D, ds, dt) using Eq. (9);16: Update θD = θD − γ2OLadv(G,F,D, ds, dt);17: Update θF◦G = θF◦G + γ2λOLadv(G,F,D, ds, dt);18: end if19: end for20: end for

where ωs(x) and ωt(x) are 1 + e−H(G,F,x), g(x) is G(x) ⊗T (F ◦G(x)) and ds[X] is the feature part of ds.

3.2 Training Procedure of CLARINETBased on the two losses proposed in Section 3.1, in CLAR-INET, we try to solve the following optimization problem,

minG,F

Ls(G,F,Ds)− λLadv(G,F,D,Ds, Dt),

minD

Ladv(G,F,D,Ds, Dt),(10)

where D tries to distinguish the samples from different do-mains by minimizing Ladv , while F ◦ G wants to maximizethe Ladv to make domains indistinguishable. To solve theminimax optimization problem in Eq. (10), we add a gradi-ent reversal layer [Ganin et al., 2016] between the domaindiscriminator and the classifier, which multiplies the gradientby a negative constant (-λ) during the back-propagation. λ isa hyper-parameter between the two losses to tradeoff sourcerisk and domain discrepancy.

The training procedures of CLARINET are shown in Al-gorithm 1. First, we initialize the whole network (line 1)and shuffle the training set (line 3). During each epoch, af-ter minbatch ds and dt are fetched (line 5), we divide thesource mini-batch ds into {ds,k}Kk=1 using Eq. (4) (line 6).Then, {ds,k}Kk=1 are used to calculate the complementary-label loss for each class (i.e., {Ls(G,F, ds,k)}Kk=1) and thewhole complementary-label loss Ls(G,F, ds) (line 7).

If mink{Ls(G,F, ds,k)}Kk=1 ≥ 0, we calculate the gradi-ent OLs(G,F, ds) and update parameters of G and F using

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20)

2529

Page 5: Clarinet: A One-step Approach Towards Budget-friendly … · 2020. 7. 20. · Clarinet: A One-step Approach Towards Budget-friendly Unsupervised Domain Adaptation Yiyang Zhang1 ;2,

0 100 200 300 400 500Epoch

50

55

60

65

70

75

80

85Accuracy(%)

GAC+DANGAC+DANN

GAC+CDANCLARINET

(a) USPS→MNIST.

0 100 200 300 400 500Epoch

75

80

85

90

95

GAC+DANGAC+DANN

GAC+CDANCLARINET

(b) MNIST→ USPS.

0 100 200 300 400 500Epoch

35

40

45

50

55

60

65

GAC+DANGAC+DANN

GAC+CDANCLARINET

(c) SVHN→MNIST.

0 100 200 300 400 500Epoch

35

40

45

50

55

60

65

70

75

Accuracy(%)

GAC+DANGAC+DANN

GAC+CDANCLARINET

(d) MNIST→MNIST-M.

0 100 200 300 400 500Epoch

80

85

90

95

GAC+DANGAC+DANN

GAC+CDANCLARINET

(e) SYND→MNIST.

0 100 200 300 400 500Epoch

70

75

80

85

GAC+DANGAC+DANN

GAC+CDANCLARINET

(f) SYND→ SVHN.

Figure 3: Test Accuracy vs. Epochs on 6 BFUDA Task. In (a)-(f), we compare target-domain accuracy of one-step approach, i.e., CLARINET(ours), with that of two-step approach (ours). It is clear to see that the accuracy of CLARINET gradually and steadily increases and eventuallyconverges, achieving the best accuracy on each task.

gradient descent (lines 8-9). Otherwise, we sum negative ele-ments in {Ls(G,F, ds,k)}Kk=1 as Lneg (line 11) and calculatethe gradient with OLneg (line 12). Then, we update parame-ters of G and F using gradient ascent (line 12), which is sug-gested by [Ishida et al., 2019]. When the number of epochs(i.e., t) is over Ts, we start to update parameters of D (line14). We calculate the scattered conditional adversarial lossLadv (line 15). Then, Ladv is minimized over D (line 16),but maximized over F ◦G (line 17) for adversarial training.

4 ExperimentsBased on five commonly used datasets: MNIST (M), USPS(U), SVHN (S), MNIST-M (m) and SYN-DIGITS (Y), we ver-ify efficacy of CLARINET on 6 BFUDA tasks: M→ U, U→M, S→M, M→ m, Y→M and Y→ S. Note that, we generatecomplementary-label data according to [Ishida et al., 2019].

4.1 BaselinesWe compare CLARINET with the following baselines: gra-dient ascent complementary label learning (GAC) [Ishida etal., 2019], namely non-transfer method, and several two-stepmethods, which sequentially combine GAC with UDA meth-ods (including DAN [Long et al., 2015], DANN [Ganin etal., 2016] and CDAN [Long et al., 2018]). In two-step ap-

proach, GAC method is first used to assign pseudo labels forcomplementary-label source data. Then, we train the classi-fier with pseudo-label source data and unlabeled target datausing UDA methods. Thus, we have four possible baselines:GAC, GAC+DAN, GAC+DANN and GAC+CDAN. For two-step methods, they share the same pseudo-label source dataon each task. Note that, in this paper, we use the entropyconditioning variant of CDAN.

4.2 Experimental SetupWe design our feature extractor G, label predictor F and do-main discriminator D according to the architecture from pre-vious works. More precisely, we pick the structures of fea-ture extractor G from [Ganin et al., 2016; Long et al., 2018].The label predictor F and domain discriminator D all sharethe same structure in all tasks, following CDAN [Long et al.,2018]. We follow the standard protocols for unsuperviseddomain adaptation and compare the average classification ac-curacy based on 5 random experiments. For each experiment,we take the result of the last epoch. The batch size is set to128 and the number of epochs is set to 500. SGD optimizer(momentum = 0.9, weight decay = 5×10−5) is with an initiallearning rate of 0.005 in adversarial network and 5× 10−5 inclassifier. In mapping function T , l is set to 0.5. We updateλ according to [Long et al., 2018]. For parameters of each

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20)

2530

Page 6: Clarinet: A One-step Approach Towards Budget-friendly … · 2020. 7. 20. · Clarinet: A One-step Approach Towards Budget-friendly Unsupervised Domain Adaptation Yiyang Zhang1 ;2,

Tasks GACTwo-step approaches (ours)

CLARINETGAC+DAN GAC+DANN GAC+CDAN E (ours)

U →M 51.860 60.692±1.300 77.580±0.770 71.498±1.077 83.692±0.928M → U 77.796 87.215±0.603 88.688±1.280 92.366±0.365 94.538±0.292S →M 39.260 45.132±1.363 50.882±2.440 61.922±2.983 63.070±1.990M → m 45.045 43.346±2.224 62.273±2.261 71.379±0.620 71.717±1.262Y →M 77.070 81.150±0.591 92.328±0.138 95.532±0.873 97.040±0.212Y → S 72.480 78.270±0.311 75.147±1.401 82.878±0.278 84.499±0.537

Average 60.585 65.968 74.483 79.263 82.426

Table 1: Results on 6 BFUDA Tasks. Bold value represents the highest accuracy (%) on each row. Please note, the two-step methods andCLARINET are all first proposed in our paper.

Methods U →M M → U S →M M → m Y →M Y → S Average

C w/ LCE 0.445±0.722 0.055±0.129 3.708±0.688 7.088±0.424 1.832±0.102 1.298±0.070 2.404C w/o T 83.192±1.796 93.419±0.588 52.438±1.927 72.128±1.569 95.442±1.004 83.055±0.652 79.946CLARINET 83.692±0.928 94.538±0.292 63.070±1.990 71.717±1.262 97.040±0.212 84.499±0.537 82.426

Table 2: Ablation Study. Bold value represents the highest accuracy (%) on each column. We prove UDA methods cannot handle BFUDAtasks directly and the mapping function T can help improve the adaptation performance under BFUDA.

baseline, we all follow the original settings. We implementall methods with default parameters by PyTorch. The code ofCLARINET is available at github.com/Yiyang98/BFUDA.

4.3 Results on BFUDA TasksTable 1 reports the target-domain accuracy of 5 methods on 6BFUDA tasks. As shown, our CLARINET performs best oneach task and the average accuracy of CLARINET is signif-icantly higher than those of baselines. Compared with GACmethod, CLARINET successfully transfers knowledge fromcomplementary-label source data to unlabeled target data.Since CDAN has shown much better adaptation performancethan DANN and DAN [Long et al., 2018], GAC+CDANshould outperform other two-step methods on each task.However, on the task U→M , the accuracy of GAC+CDANis much lower than that of GAC+DANN. This abnormalphenomenon shows that the noise contained in pseudo-labelsource data significantly reduces transferability of existingUDA methods. Namely, we cannot obtain the reliable adap-tation performance by using two-step BFUDA approach.

Figure 3 shows the target-domain accuracy of two-stepmethods and CLARINET on 6 BFUDA tasks when increas-ing epochs. It is clear to see that the accuracy of CLAR-INET gradually and steadily increases and eventually con-verges, achieving the best accuracy on each task. The ac-curacy of GAC+CDAN always reaches plateau quickly. ForGAC+DANN, its accuracy is unstable and significantly dropsafter certain epochs on M→U , M→m tasks. While the ac-curacy of GAC+DAN is relatively stable but not satisfactory.

4.4 Ablation StudyFinally, we conduct experiments to show the contributions ofcomponents in CLARINET. We consider following baselines:

• C w/ LCE : train CLARINET by Algorithm 1, while re-placing Ls(G,F,Ds) by cross-entropy loss.

• C w/o T : train CLARINET by Algorithm 1, withoutmapping function T .

C w/ LCE uses the cross-entropy loss to replace thecomplementary-label loss. The target-domain accuracy of Cw/ LCE will show if UDA methods can address the BFUDAproblem. Comparing CLARINET with C w/o T revealswhether the mapping function T takes effect. As shown in Ta-ble 2, the target-domain accuracy of C w/ LCE is much lowerthan that of other methods. Namely UDA methods cannothandle BFUDA tasks directly. Although C w/o T achievesbetter accuracy than two-step methods, its accuracy is stillworse than CLARINET’s. Thus, the mapping function T canhelp improve the adaptation performance under BFUDA.

5 ConclusionThis paper presents a new problem setting for the do-main adaptation field, called budget-friendly unsuperviseddomain adaptation (BFUDA), which exploits economicalcomplementary-label source data instead of expensive true-label source data. Since existing UDA methods cannot ad-dress BFUDA problem, we propose a novel one-step BFUDAapproach, called complementary label adversarial network(CLARINET). Experiments conducted on 6 BFUDA tasksconfirm that CLARINET effectively achieves distributionaladaptation from complementary-label source data to unla-beled target data and outperforms competitive baselines.

AcknowledgementsThe work presented in this paper was supported by the Aus-tralian Research Council (ARC) under FL190100149 andDP170101632. We would like to thank the anonymous re-viewers for their thoughtful comments. The first author par-ticularly thanks the support by UTS-CAI during her visit.

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20)

2531

Page 7: Clarinet: A One-step Approach Towards Budget-friendly … · 2020. 7. 20. · Clarinet: A One-step Approach Towards Budget-friendly Unsupervised Domain Adaptation Yiyang Zhang1 ;2,

References[Agresti et al., 2019] G. Agresti, H. Schafer, P. Sartor, and

P. Zanuttigh. Unsupervised domain adaptation for tofdata denoising with adversarial learning. In CVPR, pages5584–5593, 2019.

[Ao et al., 2017] S. Ao, X. Li, and C.X. Ling. Fast general-ized distillation for semi-supervised domain adaptation. InAAAI, pages 1719–1725, 2017.

[Ben-David et al., 2010] S. Ben-David, J. Blitzer, K. Cram-mer, A. Kulesza, F. Pereira, and J.W. Vaughan. A theory oflearning from different domains. MLJ, 79(1-2):151–175,2010.

[Deng et al., 2019] Z. Deng, Y. Luo, and J. Zhu. Clusteralignment with a teacher for unsupervised domain adap-tation. In ICCV, pages 9944–9953, 2019.

[Fang et al., 2019] Zhen Fang, Jie Lu, Feng Liu, andGuangquan Zhang. Unsupervised domain adaptation withsphere retracting transformation. In IJCNN, pages 1–8.IEEE, 2019.

[Ganin et al., 2016] Y. Ganin, E. Ustinova, H. Ajakan,P. Germain, H. Larochelle, F. Laviolette, M. Marchand,and V. Lempitsky. Domain-adversarial training of neuralnetworks. JMLR, 17(59):1–35, 2016.

[Gong et al., 2016] M. Gong, K. Zhang, T. Liu, D. Tao,C. Glymour, and B. Scholkopf. Domain adaptation withconditional transferable components. In ICML, pages2839–2848, 2016.

[Gong et al., 2018] M. Gong, K. Zhang, B. Huang, C. Gly-mour, D. Tao, and K. Batmanghelich. Causal genera-tive domain adaptation networks. CoRR, abs/1804.04333,2018.

[Han et al., 2018a] B. Han, J. Yao, G. Niu, M. Zhou,I. Tsang, Y. Zhang, and M. Sugiyama. Masking: A newperspective of noisy supervision. In NeurIPS, pages 5836–5846, 2018.

[Han et al., 2018b] B. Han, Q. Yao, X. Yu, G. Niu, M. Xu,W. Hu, I. Tsang, and M. Sugiyama. Co-teaching: Robusttraining of deep neural networks with extremely noisy la-bels. In NeurIPS, pages 8527–8537, 2018.

[Ishida et al., 2017] T. Ishida, G. Niu, W. Hu, andM. Sugiyama. Learning from complementary labels. InNeurIPS, pages 5639–5649, 2017.

[Ishida et al., 2019] T. Ishida, G. Niu, A. Menon, andM. Sugiyama. Complementary-label learning for arbitrarylosses and models. In ICML, pages 2971–2980, 2019.

[Kiryo et al., 2017] R. Kiryo, G. Niu, M.C. du Plessis, andM. Sugiyama. Positive-unlabeled learning with non-negative risk estimator. In NeurIPS, pages 1675–1685,2017.

[Kumar et al., 2017] A. Kumar, P. Sattigeri, and T. Fletcher.Semi-supervised learning with gans: Manifold invariancewith improved inference. In NeurIPS, pages 5534–5544,2017.

[Liu et al., 2019] F. Liu, J. Lu, B. Han, G. Niu, G. Zhang,and M. Sugiyama. Butterfly: A panacea for all difficul-ties in wildly unsupervised domain adaptation. In NeurIPSLTS Workshop, 2019.

[Liu et al., 2020] F. Liu, G. Zhang, and J. Lu. Heterogeneousdomain adaptation: An unsupervised approach. TNNLS,Early Access, 2020.

[Long et al., 2015] M. Long, Y. Cao, J. Wang, and M.I. Jor-dan. Learning transferable features with deep adaptationnetworks. In ICML, page 97–105, 2015.

[Long et al., 2018] M. Long, Z. Cao, J. Wang, and M.I Jor-dan. Conditional adversarial domain adaptation. InNeurIPS, pages 1640–1650, 2018.

[Saito et al., 2017] K. Saito, Y. Ushiku, and T. Harada.Asymmetric tri-training for unsupervised domain adapta-tion. In ICML, pages 2988–2997, 2017.

[Sakai et al., 2017] T. Sakai, M.C. du Plessis, G. Niu, andM. Sugiyama. Semi-supervised classification based onclassification from positive and unlabeled data. In ICML,pages 2998–3006, 2017.

[Sankaranarayanan et al., 2018] S. Sankaranarayanan,Y. Balaji, C.D. Castillo, and R. Chellappa. Generate toadapt: Aligning domains using generative adversarialnetworks. In CVPR, pages 8503–8512, 2018.

[Shu et al., 2019] Y. Shu, Z. Cao, M. Long, and J. Wang.Transferable curriculum for weakly-supervised domainadaptation. In AAAI, pages 4951–4958, 2019.

[Song et al., 2009] L. Song, J. Huang, A. Smola, andK. Fukumizu. Hilbert space embeddings of conditionaldistributions with applications to dynamical systems. InICML, pages 961–968, 2009.

[Sukhija et al., 2016] S. Sukhija, N.C. Krishnan, andG. Singh. Supervised heterogeneous domain adaptationvia random forests. In IJCAI, pages 2039–2045, 2016.

[Yan et al., 2017] Y. Yan, W. Li, M. K. P. Ng, M. Tan, H. Wu,H. Min, and Q. Wu. Learning discriminative correlationsubspace for heterogeneous domain adaptation. In IJCAI,pages 3252–3258, 2017.

[Yu et al., 2018] X. Yu, T. Liu, M. Gong, and D. Tao. Learn-ing with biased complementary labels. In ECCV, pages69–85, 2018.

[Zhang et al., 2015] K. Zhang, M. Gong, and B. Scholkopf.Multi-source domain adaptation: A causal view. In AAAI,pages 3150–3157, 2015.

[Zhao et al., 2019] S. Zhao, H. Fu, M. Gong, and D. Tao.Geometry-aware symmetric domain adaptation formonocular depth estimation. In CVPR, pages 9788–9798,2019.

[Zhou et al., 2019a] J. T. Zhou, S. J. Pan, and I. W. Tsang. Adeep learning framework for hybrid heterogeneous trans-fer learning. AIJ, 275:310 – 328, 2019.

[Zhou et al., 2019b] J. T. Zhou, H. Zhang, D. Jin, andX. Peng. Dual adversarial transfer for sequence labeling.TPAMI, Early Access, 2019.

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20)

2532