arxiv:2007.05675v1 [cs.cv] 11 jul 2020coarse-to-fine pseudo-labeling guided meta-learning for...

5
COARSE-TO-FINE PSEUDO-LABELING GUIDED META-LEARNING FOR INEXACTLY-SUPERVISED FEW-SHOT CLASSIFICATION Jinhai Yang, Hua Yang * , Lin Chen Institution of Image Communication and Network Engineering, Shanghai Jiao Tong University, Shanghai, China Shanghai Key Lab of Digital Media Processing and Transmission, Shanghai, China ABSTRACT Meta-learning has recently emerged as a promising technique to ad- dress the challenge of few-shot learning. However, most existing meta-learning algorithms require fine-grained supervision, thereby involving prohibitive annotation cost. In this paper, we present a new problem named inexactly-supervised meta-learning to alleviate such limitation, focusing on tackling few-shot classification tasks with only coarse-grained supervision. Accordingly, we propose a Coarse-to-Fine (C2F) pseudo-labeling process to construct pseudo- tasks from coarsely-labeled data by grouping each coarse-class into pseudo-fine-classes via similarity matching. Moreover, we develop a Bi-level Discriminative Embedding (BDE) to obtain a good image similarity measure in both visual and semantic aspects with inexact supervision. Experiments across representative benchmarks indicate that our approach shows profound advantages over baseline models. Index TermsMeta-learning, few-shot, inexact supervision, coarse-to-fine pseudo-labeling 1. INTRODUCTION As a hallmark of intelligence, humans can easily learn new con- cepts with scarce samples. In stark contrast, recent advances of deep learning models usually demand immense quantities of data with fine-grained annotations to learn robust systems [1, 2]. This requirement severely limits their practicality due to two issues: (1) the difficulty of collecting massive training samples and (2) the high cost of exhaustively fine-grained data-labeling. To relieve the first issue, meta-learning [3, 4, 5, 6] blazes a trail to acquire transferable knowledge from a variety of analogous few-shot tasks. A task (a.k.a. episode) is considered as a single datapoint, which consists of a sup- port set for task-specific learning and a query set for evaluating or updating the acquired model. In general, there are three common types of meta-learning methods: (1) metric-based [6, 7]: learning components of a differentiable weighted k-NN predictor, (2) model- based [8, 9]: building a function to map the query sample and the support set to a probability distribution, (3) optimization-based: de- signing or learning a novel optimization algorithm [10, 11, 12]. Al- though meta-learning systems have shown their capacity for learning in a low-data regime, most of them still rely heavily on fine-grained annotations. The second issue remains unsolved. * Contact email: [email protected]. This work was funded by National Natural Science Foundation of China (NSFC, Grant No. 61771303), Science and Technology Commission of Shanghai Municipality (STCSM, Grant Nos. 20DZ1200203, 19DZ1209303, 18DZ1200102), and SJTUYitu laboratory for visual computing and application. We also acknowledge the computational support from Student Innovation Center of SJTU. Training Base-Learner Test T 1 T 2 T n Meta-Learner Meta-Knowledge : Fine-LabeledData : Unlabeled Data : Coarsely-Labeled Data : Pseudo-Fine-Labeled Data Training T 1 T 2 T n Meta-Learner Meta-Knowledge Base-Learner Test (a) T 1 T 2 T n Meta-Learner Meta-Knowledge Training Base-Learner Test (b) (c) Fig. 1: Illustration of our problem setup in comparison with exist- ing methods. (a) Fully-supervised meta-learning. All samples are associated with fine labels. (b) Semi-supervised meta-learning. A pool of unlabeled samples is introduced as additional training data. (c) Inexactly-supervised meta-learning (ours). Training samples are associated with only coarse labels. Recently various meta-learning methods that exploit data with limited annotations have come up. As shown in Fig. 1 (a, b), in- stead of purely learning with labeled data as the fully-supervised meta-learning techniques, semi-supervised meta-learning [13, 14] allows a large pool of unlabeled data to assist the recognition of labeled support sets. [15] proposed a prototype propagation mech- anism, leveraging the hierarchical labels (both coarse and fine) of ImageNet [16] to acquire a multi-level directed acyclic prototype graph. Nevertheless, the above-mentioned methods all treat data with limited annotations as extra information in addition to the fine- labeled data. In comparison, purely training with weakly-labeled data, which helps relieve the difficulty and expertise required in data- labeling, has rarely been investigated in meta-learning literature. In this paper, we present inexactly-supervised meta-learning to alleviate the second issue (the expensive cost of fine-grained annota- tion), advancing few-shot classification paradigm towards a scenario where we utilize only inexact supervision. Inexact supervision [17] is a typical type of weak supervision, where the training data are only associated with coarse-grained labels. As illustrated in Fig. 1 (c), different from existing meta-learning frameworks, in our framework only coarse-grained labels are available for both the support set and the query set in meta-training tasks. To tackle this new problem, we arXiv:2007.05675v2 [cs.CV] 26 Oct 2020

Upload: others

Post on 12-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: arXiv:2007.05675v1 [cs.CV] 11 Jul 2020Coarse-to-Fine Pseudo-Labeling Guided Meta-Learning for Few-Shot Classification Jinhai Yang 1;2, Hua Yang 2( ), and Lin Chen 1 Institution of

COARSE-TO-FINE PSEUDO-LABELING GUIDED META-LEARNING FORINEXACTLY-SUPERVISED FEW-SHOT CLASSIFICATION

Jinhai Yang, Hua Yang∗, Lin Chen

Institution of Image Communication and Network Engineering,Shanghai Jiao Tong University, Shanghai, China

Shanghai Key Lab of Digital Media Processing and Transmission, Shanghai, China

ABSTRACT

Meta-learning has recently emerged as a promising technique to ad-dress the challenge of few-shot learning. However, most existingmeta-learning algorithms require fine-grained supervision, therebyinvolving prohibitive annotation cost. In this paper, we present anew problem named inexactly-supervised meta-learning to alleviatesuch limitation, focusing on tackling few-shot classification taskswith only coarse-grained supervision. Accordingly, we propose aCoarse-to-Fine (C2F) pseudo-labeling process to construct pseudo-tasks from coarsely-labeled data by grouping each coarse-class intopseudo-fine-classes via similarity matching. Moreover, we developa Bi-level Discriminative Embedding (BDE) to obtain a good imagesimilarity measure in both visual and semantic aspects with inexactsupervision. Experiments across representative benchmarks indicatethat our approach shows profound advantages over baseline models.

Index Terms— Meta-learning, few-shot, inexact supervision,coarse-to-fine pseudo-labeling

1. INTRODUCTION

As a hallmark of intelligence, humans can easily learn new con-cepts with scarce samples. In stark contrast, recent advances ofdeep learning models usually demand immense quantities of datawith fine-grained annotations to learn robust systems [1, 2]. Thisrequirement severely limits their practicality due to two issues: (1)the difficulty of collecting massive training samples and (2) the highcost of exhaustively fine-grained data-labeling. To relieve the firstissue, meta-learning [3, 4, 5, 6] blazes a trail to acquire transferableknowledge from a variety of analogous few-shot tasks. A task (a.k.a.episode) is considered as a single datapoint, which consists of a sup-port set for task-specific learning and a query set for evaluating orupdating the acquired model. In general, there are three commontypes of meta-learning methods: (1) metric-based [6, 7]: learningcomponents of a differentiable weighted k-NN predictor, (2) model-based [8, 9]: building a function to map the query sample and thesupport set to a probability distribution, (3) optimization-based: de-signing or learning a novel optimization algorithm [10, 11, 12]. Al-though meta-learning systems have shown their capacity for learningin a low-data regime, most of them still rely heavily on fine-grainedannotations. The second issue remains unsolved.

∗Contact email: [email protected]. This work was funded by NationalNatural Science Foundation of China (NSFC, Grant No. 61771303), Scienceand Technology Commission of Shanghai Municipality (STCSM, Grant Nos.20DZ1200203, 19DZ1209303, 18DZ1200102), and SJTUYitu laboratory forvisual computing and application. We also acknowledge the computationalsupport from Student Innovation Center of SJTU.

Training

Base-Learner

Test

T1

T2

Tn

Meta-Learner

Meta-Knowledge

: Fine-LabeledData

: Unlabeled Data

: Coarsely-LabeledData

: Pseudo-Fine-LabeledData

Training

T1

T2

Tn

Meta-Learner

Meta-Knowledge

Base-Learner

Test

(a)

T1

T2

Tn

Meta-Learner

Meta-Knowledge

Training

Base-LearnerTest

(b)

(c)

Fig. 1: Illustration of our problem setup in comparison with exist-ing methods. (a) Fully-supervised meta-learning. All samples areassociated with fine labels. (b) Semi-supervised meta-learning. Apool of unlabeled samples is introduced as additional training data.(c) Inexactly-supervised meta-learning (ours). Training samples areassociated with only coarse labels.

Recently various meta-learning methods that exploit data withlimited annotations have come up. As shown in Fig. 1 (a, b), in-stead of purely learning with labeled data as the fully-supervisedmeta-learning techniques, semi-supervised meta-learning [13, 14]allows a large pool of unlabeled data to assist the recognition oflabeled support sets. [15] proposed a prototype propagation mech-anism, leveraging the hierarchical labels (both coarse and fine) ofImageNet [16] to acquire a multi-level directed acyclic prototypegraph. Nevertheless, the above-mentioned methods all treat datawith limited annotations as extra information in addition to the fine-labeled data. In comparison, purely training with weakly-labeleddata, which helps relieve the difficulty and expertise required in data-labeling, has rarely been investigated in meta-learning literature.

In this paper, we present inexactly-supervised meta-learning toalleviate the second issue (the expensive cost of fine-grained annota-tion), advancing few-shot classification paradigm towards a scenariowhere we utilize only inexact supervision. Inexact supervision [17]is a typical type of weak supervision, where the training data are onlyassociated with coarse-grained labels. As illustrated in Fig. 1 (c),different from existing meta-learning frameworks, in our frameworkonly coarse-grained labels are available for both the support set andthe query set in meta-training tasks. To tackle this new problem, we

arX

iv:2

007.

0567

5v2

[cs

.CV

] 2

6 O

ct 2

020

Page 2: arXiv:2007.05675v1 [cs.CV] 11 Jul 2020Coarse-to-Fine Pseudo-Labeling Guided Meta-Learning for Few-Shot Classification Jinhai Yang 1;2, Hua Yang 2( ), and Lin Chen 1 Institution of

1

2

N

……

: seed : boundary

: k-th pseudo sub-classkk: coarse class “feline”

Support Set Query Set One-HotVector

Meta-Training

Meta-Testing

Meta-Knowledge

zebrapanda

1 2

42

Train Task # b

Train Task # b+1

Test Task

……

… …

L2 Norm

C2F Pseudo-Labeling

: other coarse classes

Transfer

CNN

Fig. 2: Overview of the framework of coarse-to-fine pseudo-labelingguided meta-learning. Left: C2F process with BDE features. Right:An example of 2-way 1-shot classification with one query sample.

propose a coarse-to-fine (C2F) pseudo-labeling algorithm to gener-ate pseudo-fine-labels to guide meta-learning models. As illustratedin Fig. 2, we first establish the task distribution for meta-trainingby grouping each coarse-class into pseudo-fine-classes via greedy-based similarity matching, and then conduct meta-learning methodson the generated pseudo-tasks. How to acquire a similarity measureof good quality with coarse-labeled data becomes the crucial chal-lenge of the overall problem.

To address this challenge, we design a Bi-level DiscriminativeEmbedding (BDE) to map images to embedding vectors such thatthe cosine distances can reflect a similarity measure between im-ages. Supervised embedding learning [18, 19, 20] tends to optimizethe embedding space explicitly by maximizing inter-class variationand minimizing intra-class variation. Note that under the inexactly-supervised regime, this kind of method would eliminate the intra-class discrimination within a coarse class, making the contained fine-grained classes inseparable unfavorably. Differently, instance-basedunsupervised embedding methods [21, 22] which take each instanceas an individual class become promising candidates. [23] attemptedto acquire data augmentation invariant and instance spread-out fea-tures by learning to recognize augmented images as the originalones. Instead of utilizing category-wise labels, they take advantageof data augmentation to conduct instance-wise supervision, to ap-proximate the positive-concentrated and negative-separated proper-ties. However, this data augmentation-based method emphasizes vi-sual discrimination but ignores semantic discrimination. To over-come this drawback, we propose BDE which integrates both dis-criminations simultaneously in the presence of inexact supervision.

We evaluate the proposed methods on the popular benchmarks,Omniglot [24] and tieredImageNet [13]. To simulate the situationof inexact supervision, we remove the fine labels and preserve onlycoarse labels during meta-training. Take Omniglot for example, themodel is exposed to the coarse-grained alphabet labels rather thanthe character labels. Experimental results show that our method con-sistently outperforms the baseline models.

To our knowledge, this work is the first attempt to purely use in-exact supervision in few-shot classification. Our contributions aresummarized as follows: First, we present the inexactly-supervisedmeta-learning problem for few-shot classification with inexact la-bels. Second, we devise a coarse-to-fine process that can be inte-grated with existing meta-learning systems to solve this new prob-lem. Third, we propose a Bi-level Discriminative Embedding forsimilarity matching of the pseudo-labeling process.

Algorithm 1 Coarse-to-Fine (C2F) Pseudo-Labeling

Require: Ns: number of samples per pseudo sub-categoryRequire: C: number of coarse classes of DRequire: D = {(xi, yi)}: coarsely-annotated data

1: Initialize pseudo dataset Dp = {}2: for class c from 1 to C do3: Retrieval Dc =

{xi|yi = c, (xi, yi) ∈ D

}, let M = |Dc|

4: Get the embeddings F c ∈ RD×M of the samples in Dc

5: Obtain the similarity matrix S = F Tc F c ∈ RM×M

6: while number of the remaining samples of Dc ≥ Ns do7: Sample xj from Dc as the seed of new category cn8: Retrieval {xk}Ns−1

k=1 with top similarity to xj from Dc

9: Dp ← Dp ∪ {(xj , c, cn)} ∪ {(xk, c, cn)}Ns−1k=1

10: Remove the selected Ns samples from Dc

11: end while12: Drop the remaining samples of Dc

13: end for

2. METHODS

2.1. Problem Definition

In inexactly-supervised few-shot classification, only coarse-grainedsupervision Y is available for all training data X . Each x ∈ X is as-sociated with a coarse-category label y ∈ Y . Taking ImageNet [16]as an example, instead of being labeled exactly, the images of class“lamp” and “bookcase” may share the coarse-category label “fur-nishing”. The training classes and testing classes should be non-overlapping as in the standard setting of few-shot classification.In coarse-to-fine pseudo-labeling guided meta-learning, we followthe episodic paradigm commonly employed in previous few-shotlearning methods [5, 6, 7, 11]. In an episode, an N -way K-shotclassification model is trained on a novel task, which contains Kpseudo-labeled support samples and Q query samples for each ofthe N classes.

2.2. Coarse-to-Fine Pseudo-Labeling

Unlike the pseudo-labeling in the usual sense [25], coarse-to-finepseudo-labeling takes as input a coarsely-annotated image and as-signs a fine-grained pseudo-label to it. As outlined in Algorithm 1,this process continually selects random images as the seed samplesand then makes locally-optimal choices to produce the pseudo sub-categories, until the pseudo-labeled training dataset Dp is derivedfrom the coarsely-annotated training dataset D. We conduct imagesimilarity matching with cosine similarity (inner product) on top ofthe corresponding BDE features.

After pseudo-labeling, pseudo-tasks can be sampled from Dp

for meta-learning. For each episode, a subset of the pseudo-labeledtraining dataset is sampled to construct an N -way K-shot classi-fication task T , which consists of N classes each with K supportsamples and Q query samples. The class labels in each subset aretemporarily assigned to a random permutation of (1, 2, · · · , N).

2.3. Bi-level Discriminative Embedding

Here we present the Bi-level Discriminative Embedding (BDE),which approximates both visual discrimination and semantic dis-crimination. As shown in Fig. 3, the BDE system exploits twotypes of supervision: (1) instance-wise supervision and (2) coarseclass-wise supervision. Although these two types of supervision

Page 3: arXiv:2007.05675v1 [cs.CV] 11 Jul 2020Coarse-to-Fine Pseudo-Labeling Guided Meta-Learning for Few-Shot Classification Jinhai Yang 1;2, Hua Yang 2( ), and Lin Chen 1 Institution of

L2Norm

DataAugmentation

CNN FC FC

L2Norm

CNN FC FC

low-dim

low-dim

Instance-wise Supervision

CoarseClass-w

iseSupervision

Share Weights

Fig. 3: Training scheme of Bi-level Discriminative Embedding.

share the objective of pulling similar images together and pushingdissimilar ones away in a compact space, they encourage the modelto distinguish images from quite different perspectives (the former isin visual-level whereas the latter is in semantic-level). The followingare the details of the proposed approach.

For a coarsely-labeled sample (x, yi), let xi denote the aug-mented sample, and let fi and fi denote their embedding featuresrespectively. As shown in Fig. 3, the embeddings are `2 normal-ized, thus ‖fi‖2 = 1. Therefore, the cosine similarity can be simplycalculated by dot product. Based on softmax matching, in a batchwith m instances, the probability of the augmented image xi beingrecognized as instance i can be written as

P (i|xi) =exp

(fTi fi/τ

)∑m

k=1 exp(fTk fi/τ

) , (1)

where τ is the temperature parameter [26] scaling the entropy of theoutput probability distribution. On the contrary, the probability ofanother instance xj being recognized as instance i can be written as

P (i|xj) =exp

(fTi fj/τ

)∑mk=1 exp (f

Tk fj/τ)

, j 6= i . (2)

The visual discrimination preserving problem is subsequentlysolved by minimizing the empirical risk (maximizing the log-likelihood) over all the instances within a batch:

LD = −∑i

logP (i|xi)−∑i

∑j 6=i

log (1− P (i|xj)) . (3)

To further optimize the semantic discrimination among coarseclasses and leverage the weak annotations y ∈ {1, 2, . . . , C}, theembedding features are fed into an auxiliary linear classifier W =[W T

1 ,WT2 , · · · ,W T

C

]T ∈ RD×C with a softmax layer. Similar tostandard classification, the probability of xi being classified to classc is defined by

P (c|xi) =exp

(WT

c fi)∑C

k=1 exp (WTk fi)

. (4)

On the other hand, the augmented images should be classified as theoriginal class. Thereby, the probability of the augmented image xi

being classified to class c is denoted by

P (c|xi) =exp

(WT

c fi)

∑Ck=1 exp

(WT

k fi) . (5)

Let yi,j denote the j-th class label of instance xi. The negative loglikelihood classification loss over all the instances within a batch is

LC = −∑i

C∑j=1

yi,j log [P (j|xi)P (j|xi)] . (6)

The joint loss is the sum of the instance-wise discriminative loss andthe coarse class-wise classification loss,

L = mLD + nLC , (7)

where m,n is a pair of trade-off parameters controlling the relativecontributions between the two-levels of discrimination.

3. EXPERIMENTS

3.1. Datasets

We conduct experiments on two widely-used few-shot classificationbenchmarks, Omniglot and tieredImageNet. These datasets are both“hierarchical” in the sense that the data are annotated by not only finelabels but also higher-level coarse labels, thus they are well-suitedfor evaluating inexactly-supervised meta-learning methods.

Omniglot [24] consists of 1623 characters from 50 alphabets,each containing 20 grayscale images drawn by different people. Weselect the alphabets in the background split for meta-training, sevenof the alphabets (Manipuri, Atemayar Qelisayer, Sylheti, Keble,Gurmukhi, ULOG, Old Church Slavonic (Cyrillic)) in the evalu-ation split for meta-validation, and the remaining alphabets of theevaluation split for meta-test, ensuring that the alphabets in eachset are non-overlapping. In the meta-training set, the character la-bels (fine) of the images are invisible, whereas the alphabet labels(coarse) are exposed to provide inexact supervision.

tieredImageNet [13] is a subset of ILSVRC-12 [27], contain-ing 608 classes from 34 super-categories in accordance with theImageNet [16] hierarchy, where the training classes are ensuredto be distinct enough from the test classes semantically. Thesesuper-categories are split into 20 meta-training (351 classes), 6meta-validation (97 classes) and 8 meta-test (160 classes) cate-gories. The mean number of samples in each class is 1281. Similarto Omniglot, we discard the exact class labels in the meta-trainingset and use only the coarse super-category label.

3.2. Experiment Settings

Training Procedures. The training process of the proposed frame-work (Fig. 2) consists of three procedures: (1) BDE Learning(Fig. 3): Train a Convolutional Neural Network (CNN) to representimages with instance-wise supervision and coarse class-wise super-vision on the meta-training set. (2) C2F Pseudo-Labeling (Alg. 1):Use the CNN to derive BDE features from training images of eachcoarse class and then conduct similarity matching to pseudo-labelingthem into pseudo-fine-classes according to the similarity of the cor-responding features. (3) Meta-Learning: Run fully-supervisedmeta-learning algorithms (in this paper, we use MetaBL [6]) onthe pseudo-labeled few-shot classification tasks generated from thepseudo-fine-classes to excavate transferable knowledge.Implementation Details. We use ResNet-18 [2] as the backboneof BDE and use a weighted kNN classifier to evaluate the qualityof learned embedding features. The labels of the top-k (we usek = 200) nearest neighbors of a test example in the sense of co-sine similarity are retrieved to make a weighted vote for predict-ing. For evaluation, again we take the coarse labels as the groundtruth, hypothesizing that the model performing well on coarse-leveltasks can also perform well on fine-level tasks. The meta-trainingset is further split into training and validation set for model selec-tion, which contains 80% and 20% images of each coarse-class re-spectively. We set the dimension of the feature embeddings D to

Page 4: arXiv:2007.05675v1 [cs.CV] 11 Jul 2020Coarse-to-Fine Pseudo-Labeling Guided Meta-Learning for Few-Shot Classification Jinhai Yang 1;2, Hua Yang 2( ), and Lin Chen 1 Institution of

Table 1: Results of inexactly-supervised meta-learning on Om-niglot. Average 5-way accuracy (%) with 95% confidence intervals.

Method 1-shot 5-shot

MetaBL (inexact sup.) [6] 80.60 ± 0.26 93.57 ± 0.15C2F w/ Pixels-MetaBL 72.87 ± 0.70 88.53 ± 0.30C2F w/ BDE-MetaBL (Ours) 88.85 ± 0.54 96.56 ± 0.15

Table 2: Results of inexactly-supervised meta-learning on tieredIm-ageNet. Average 5-way accuracy (%) with 95% confidence intervals.

Method 1-shot 5-shot

MetaBL (inexact sup.) [6] 57.84 ± 0.30 72.21 ± 0.45C2F w/ Pixels-MetaBL 46.65 ± 0.73 61.04 ± 0.75C2F w/ BDE-MetaBL (Ours) 60.54 ± 0.79 75.22 ± 0.63

128 and the temperature τ to 0.1. The model is trained for a to-tal of 200 epochs, with the learning rate starting from 0.3, decayingby 0.1 and 0.01 at 120 and 160 epoch. A mini-batch contains 128samples. The optimizer is SGD with momentum 0.9 and weightdecay 5 × 10−4. Four methods (RandomResizedCrop, ColorJitter,RandomGrayscale, RandomHorizontalFlip) with default parametersin PyTorch are chosen as data augmentation to enhance the visual-discriminative property of BDE. The 1-channel grayscale images ofOmniglot are converted to 3-channel to allow for flexible image aug-mentations. We set the trade-off parameters m and n to 1 and 10respectively. To avoid leaking additional priors about the meta-testdata, we use the average number of images per fine-class in meta-validation set as Ns.Evaluation Metric. Following the standard setting of few-shotlearning, our evaluation metric is the accuracy averaged over 1000test episodes with 95% confidence intervals. To provide fair andaccurate comparisons, we employ the fine-labeled meta-validationset and meta-test set to evaluate the models. Note that we utilize thefine-labeled meta-validation set for model selection only, thereby itleaks no additional information to the models. During meta-training,the models take pseudo-labels of query samples as the ground truthto evaluate and update for better across-task generalization.

3.3. Results on Standard Benchmarks

Baseline models. Since there is no prior work focusing on inexactly-supervised few-shot learning, we create two baselines for compari-son. To investigate the effectiveness of C2F and BDE, we consid-ered two alternatives as the baseline models: (1) Train MetaBL [6]with inexact supervision straightforwardly (without C2F pseudo-labeling). (2) Conduct C2F pseudo-labeling for pseudo-task genera-tion with pixel-level features rather than BDE features (referred to as“C2F w/ Pixels-MetaBL”). We refer to our approach that conductsC2F pseudo-labeling with BDE features as “C2F w/ BDE-MetaBL”.

Results. The results on Omniglot and tieredImageNet are shown inTab. 1 and Tab. 2 respectively. On both datasets, we observe that ourapproach consistently outperforms the baselines with a large-margin.Especially in the 1-shot setting, which is more challenging comparedto the 5-shot setting, we significantly improve the accuracy by 3-16%for Omniglot and 3-14% for tieredImageNet. In contrast, we also no-tice that when C2F pseudo-labeling with pixels, the performance iseven worse than directly training MetaBL with coarse labels, whichindicates the vital importance and necessity to devise a good embed-ding carefully for the C2F process.

3.4. Ablation Study

In this section, we show that both of the bi-level discrimination ofBDE facilitate the final performance. We start with the full model(C2F w/ BDE-MetaBL) and then remove the visual discrimina-tion (instance-wise supervision) or semantic discrimination (coarseclass-wise supervision) respectively. The results are shown in Tab. 3with 95% confidence interval omitted due to space limitation. Weobserve that excluding either of the discrimination would decreasethe few-shot classification accuracy on both datasets. In particular,visual discrimination contributes the most. We suspect that it is be-cause the instance-wise supervision applied to visual discriminationis far stricter than the coarse class-wise supervision for semanticdiscrimination. We also note that these weakened versions still out-perform the “C2F w/ Pixels-MetaBL” baselines in Tab. 1 and Tab. 2.

Table 3: Effect of the components of BDE. “Dis.” stands for dis-crimination. Average 5-way accuracy (%) of C2F w/ BDE-MetaBL.

VisualDis.

SemanticDis.

Omniglot

1-shot 5-shot

tieredImageNet

1-shot 5-shot

X 76.49 91.1981.28 94.2688.85 95.56

57.10 71.1258.06 73.3760.54 75.22

XX X

3.5. Comparisons of Meta-Backbones

As described in Sec. 3.2, our framework can be integrated with exist-ing fully-supervised meta-learning algorithms (which we call meta-backbones). Here, we choose MetaOptNet [12], MAML [11], Pro-toNets [7] and MetaBL [6] for comparison. We first conduct coarse-to-fine pseudo-labeling with BDE on tieredImageNet to generatepseudo-datasets, and then apply various meta-backbones to conductfew-shot classification learning on the fixed datasets. As shown inTab. 4, when integrated into our framework, MetaBL achieves thetop performances among the compared models.

Table 4: Effect of various meta-backbones on tieredImageNet. Av-erage 5-way accuracy (%) with 95% confidence intervals.

Method 1-shot 5-shot

C2F w/ BDE-MetaOptNet 38.98 ± 0.53 54.20 ± 0.51C2F w/ BDE-MAML 47.85 ± 0.49 64.61 ± 0.48C2F w/ BDE-ProtoNets 51.83 ± 0.84 68.71 ± 0.79C2F w/ BDE-MetaBL (Ours) 60.54 ± 0.79 75.22 ± 0.63

4. CONCLUSION

In this work, we propose an inexactly-supervised meta-learning sys-tem, motivated to address the few-shot classification problems withonly coarse-grained labels. To achieve this goal, we design a bi-leveldiscriminative embedding which jointly optimizes the visual dis-criminative and semantic discriminative properties of image repre-sentations, and introduce a coarse-to-fine pseudo-labeling algorithmto generate pseudo-tasks for model training via greedy-based sim-ilarity matching on the learned embeddings. Experimental resultsshow that our framework achieves substantial improvements overthe baseline models. Besides, we also investigate and analyze ourmodels in detail via ablation studies.

Page 5: arXiv:2007.05675v1 [cs.CV] 11 Jul 2020Coarse-to-Fine Pseudo-Labeling Guided Meta-Learning for Few-Shot Classification Jinhai Yang 1;2, Hua Yang 2( ), and Lin Chen 1 Institution of

5. REFERENCES

[1] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton, “Ima-genet classification with deep convolutional neural networks,”in Advances in neural information processing systems, 2012,pp. 1097–1105.

[2] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun,“Deep residual learning for image recognition,” in Proceed-ings of the IEEE conference on computer vision and patternrecognition, 2016, pp. 770–778.

[3] Jurgen Schmidhuber, Evolutionary principles in self-referential learning, or on learning how to learn: the meta-meta-... hook, Ph.D. thesis, Technische Universitat Munchen,1987.

[4] Sebastian Thrun and Lorien Pratt, “Learning to learn: In-troduction and overview,” in Learning to learn, pp. 3–17.Springer, 1998.

[5] Oriol Vinyals, Charles Blundell, Timothy Lillicrap, DaanWierstra, et al., “Matching networks for one shot learning,”in Advances in neural information processing systems, 2016,pp. 3630–3638.

[6] Yinbo Chen, Xiaolong Wang, Zhuang Liu, Huijuan Xu, andTrevor Darrell, “A new meta-baseline for few-shot learning,”arXiv preprint arXiv:2003.04390, 2020.

[7] Jake Snell, Kevin Swersky, and Richard Zemel, “Prototypicalnetworks for few-shot learning,” in Advances in neural infor-mation processing systems, 2017, pp. 4077–4087.

[8] Adam Santoro, Sergey Bartunov, Matthew Botvinick, DaanWierstra, and Timothy Lillicrap, “Meta-learning with memory-augmented neural networks,” in International conference onmachine learning, 2016, pp. 1842–1850.

[9] Nikhil Mishra, Mostafa Rohaninejad, Xi Chen, and PieterAbbeel, “A simple neural attentive meta-learner,” in Inter-national Conference on Learning Representations, 2018.

[10] Sachin Ravi and Hugo Larochelle, “Optimization as a modelfor few-shot learning,” in International Conference on Learn-ing Representations, 2017.

[11] Chelsea Finn, Pieter Abbeel, and Sergey Levine, “Model-agnostic meta-learning for fast adaptation of deep networks,”in Proceedings of the 34th International Conference on Ma-chine Learning-Volume 70. JMLR. org, 2017, pp. 1126–1135.

[12] Kwonjoon Lee, Subhransu Maji, Avinash Ravichandran, andStefano Soatto, “Meta-learning with differentiable convex op-timization,” in Proceedings of the IEEE Conference on Com-puter Vision and Pattern Recognition, 2019, pp. 10657–10665.

[13] Mengye Ren, Eleni Triantafillou, Sachin Ravi, Jake Snell,Kevin Swersky, Joshua B. Tenenbaum, Hugo Larochelle, andRichard S. Zemel, “Meta-learning for semi-supervised few-shot classification,” in International Conference on LearningRepresentations, 2018.

[14] Xinzhe Li, Qianru Sun, Yaoyao Liu, Qin Zhou, Shibao Zheng,Tat-Seng Chua, and Bernt Schiele, “Learning to self-train forsemi-supervised few-shot classification,” in Advances in Neu-ral Information Processing Systems, 2019, pp. 10276–10286.

[15] Lu Liu, Tianyi Zhou, Guodong Long, Jing Jiang, Lina Yao,and Chengqi Zhang, “Prototype propagation networks (ppn)for weakly-supervised few-shot learning on category graph,”in International Joint Conference on Artificial Intelligence (IJ-CAI), 2019.

[16] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, andLi Fei-Fei, “Imagenet: A large-scale hierarchical imagedatabase,” in 2009 IEEE conference on computer vision andpattern recognition, 2009, pp. 248–255.

[17] Zhi-Hua Zhou, “A brief introduction to weakly supervisedlearning,” National Science Review, vol. 5, no. 1, pp. 44–53,2018.

[18] Hyun Oh Song, Yu Xiang, Stefanie Jegelka, and SilvioSavarese, “Deep metric learning via lifted structured featureembedding,” in Proceedings of the IEEE conference on com-puter vision and pattern recognition, 2016, pp. 4004–4012.

[19] Hyun Oh Song, Stefanie Jegelka, Vivek Rathod, and KevinMurphy, “Deep metric learning via facility location,” in Pro-ceedings of the IEEE Conference on Computer Vision and Pat-tern Recognition, 2017, pp. 5382–5390.

[20] Ben Harwood, BG Kumar, Gustavo Carneiro, Ian Reid, TomDrummond, et al., “Smart mining for deep metric learning,”in Proceedings of the IEEE International Conference on Com-puter Vision, 2017, pp. 2821–2829.

[21] Zhirong Wu, Yuanjun Xiong, Stella X Yu, and Dahua Lin,“Unsupervised feature learning via non-parametric instancediscrimination,” in Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition, 2018, pp. 3733–3742.

[22] Alexey Dosovitskiy, Philipp Fischer, Jost Tobias Springenberg,Martin Riedmiller, and Thomas Brox, “Discriminative unsu-pervised feature learning with exemplar convolutional neuralnetworks,” IEEE transactions on pattern analysis and machineintelligence, vol. 38, no. 9, pp. 1734–1747, 2015.

[23] Mang Ye, Xu Zhang, Pong C Yuen, and Shih-Fu Chang, “Un-supervised embedding learning via invariant and spreading in-stance feature,” in Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition, 2019, pp. 6210–6219.

[24] Brenden M Lake, Ruslan Salakhutdinov, and Joshua B Tenen-baum, “Human-level concept learning through probabilisticprogram induction,” Science, vol. 350, no. 6266, pp. 1332–1338, 2015.

[25] Dong-Hyun Lee, “Pseudo-label: The simple and efficientsemi-supervised learning method for deep neural networks,”in Workshop on challenges in representation learning, ICML,2013, vol. 3, p. 2.

[26] Ian Goodfellow, Yoshua Bengio, and Aaron Courville, Deeplearning, MIT press, 2016.

[27] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, San-jeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy,Aditya Khosla, Michael Bernstein, et al., “Imagenet large scalevisual recognition challenge,” International journal of com-puter vision, vol. 115, no. 3, pp. 211–252, 2015.