diversity transfer network for few-shot learning · diversity transfer network for few-shot...

Diversity Transfer Network for Few-Shot Learning

Mengting Chen1∗†, Yuxin Fang1∗, Xinggang Wang1‡, Heng Luo2, Yifeng Geng2,Xinyu Zhang, Chang Huang2, Wenyu Liu1‡, Bo Wang3, 4

1 School of EIC, Huazhong University of Science and Technology 2 Horizon Robotics, Inc. 3 Vector Institute 4 PMCC, UHN{mengtingchen, yuxin_fang, xgwang, liuwy}@hust.edu.cn {heng.luo, yifeng.geng, chang.huang}@horizon.ai

[email protected] [email protected]

AbstractFew-shot learning is a challenging task that aims at train-ing a classifier for unseen classes with only a few trainingexamples. The main difficulty of few-shot learning lies inthe lack of intra-class diversity within insufficient trainingsamples. To alleviate this problem, we propose a novel gen-erative framework, Diversity Transfer Network (DTN), thatlearns to transfer latent diversities from known categoriesand composite them with support features to generate diversesamples for novel categories in feature space. The learningproblem of the sample generation (i.e., diversity transfer) issolved via minimizing an effective meta-classification loss in asingle-stage network, instead of the generative loss in previousworks. Besides, an organized auxiliary task co-training overknown categories is proposed to stabilize the meta-trainingprocess of DTN. We perform extensive experiments and abla-tion studies on three datasets, i.e., miniImageNet, CIFAR100and CUB. The results show that DTN, with single-stage train-ing and faster convergence speed, obtains the state-of-the-artresults among the feature generation based few-shot learningmethods. Code and supplementary material are available at:https://github.com/Yuxin-CV/DTN.

IntroductionDeep neural networks (DNNs) have shown tremendous suc-cess in solving many challenging real-world problems whena large amount of training data is available (Krizhevsky,Sutskever, and Hinton 2012; Simonyan and Zisserman 2014;He et al. 2015). Common practice suggests that models withmore parameters have the greater capacity to fit data andmore training data usually provide better generalization abil-ity. However, DNNs struggle to generalize given only a fewtraining data while humans excel at learning new conceptsfrom just a few examples (Bloom 2000). Few-shot learninghas therefore been proposed to close the performance gapbetween machine learner and human learner. In the canonicalsetting of few-shot learning, there are a training set Dtrain

(seen, known) and a testing setDtest (unseen, novel) with dis-joint categories. Models are trained on the training set while∗Equal contribution.†Mengting Chen was an intern of Horizon Robotics when work-

ing on this paper.‡Corresponding authors.

Copyright c© 2020, Association for the Advancement of ArtificialIntelligence (www.aaai.org). All rights reserved.

tested in an N -way K-shot scheme (Vinyals et al. 2016)where the models need to classify the queries into one ofthe N categories correctly when only K samples of eachnovel category are given. This unique setting of few-shotlearning poses an unprecedented challenge in fully utiliz-ing the prior information in the training set Dtrain, whichcorresponds to the known information or historical infor-mation of the human learner. Common approaches to ad-dress this challenge either learn a good metric for noveltasks (Snell, Swersky, and Zemel 2017; Vinyals et al. 2016;Sung et al. 2018) or train a meta-learner for fast adaptation(Finn, Abbeel, and Levine 2017; Munkhdalai and Yu 2017;Ravi and Larochelle 2017).

Recently, the generation based approach is becomingan effective solution for few-shot learning (Hariharan andGirshick 2017; Schwartz et al. 2018; Wang et al. 2018;Zhang et al. 2018), since it directly alleviates the problem oflacking training samples. We propose a Diversity TransferNetwork (DTN) for sample generation. In DTN, the offsetbetween a random sample pair from the known category iscomposited with a support sample in the novel category inthe latent feature space. Then, the generated features, as wellas the support features, are averaged as the proxy of the novelcategory. At last, query samples are evaluated by the proxy.Only if the generated samples follow the distribution of thereal samples to be diverse, can the meta-classifier (i.e., theproxy) be robust enough to classify queries correctly.

In addition to the new sample generation scheme, we uti-lize an effective meta-training curriculum called OAT (Orga-nized Auxiliary task co-Training), inspired by the auxiliarytask co-training in TADAM (Oreshkin, Rodríguez López, andLacoste 2018) and curriculum learning (Bengio et al. 2009).OAT organizes auxiliary tasks and meta-tasks reasonablyand effectively reduces training complexity. Experimentsshow that by applying OAT, our DTN converges much fastercompared with the naïve meta-training strategy (i.e., meta-training from scratch), the multi-stage training strategy usedin ∆-encoder (Schwartz et al. 2018) and the auxiliary taskco-training strategy used in TADAM.

The main components of DTN are integrated into a singlenetwork and can be optimized in an end-to-end fashion. Thus,DTN is very simple to implement and easy to train. Our ex-perimental results show that this simple method outperformsmany previous works on a variety of datasets.

arX

iv:1

912.

1318

2v1

[cs

.CV

] 3

1 D

ec 2

019

Related WorkMetric Learning Based ApproachesMetric learning is the most common and straightforward so-lution for few-shot learning. An embedding function can belearned by a myriad of instances of known categories. Thensome simple metrics, such as Euclidean distance (Snell, Swer-sky, and Zemel 2017) and cosine distance (Vinyals et al. 2016;Qiao et al. 2018; Qi, Brown, and Lowe 2018), are used tobuild nearest neighbor classifiers for instances in unseencategories. Furthermore, to model the contextual informa-tion among support images and query images, bidirectionalLSTM and attention mechanism are adopted in MatchingNetwork (Vinyals et al. 2016). Besides measuring the dis-tances of a query to its support images, there is a new so-lution that compares the query to the center of the supportimages of each class in feature space, such as Snell, Swer-sky, and Zemel; Qiao et al.; Qi, Brown, and Lowe; Gidarisand Komodakis. The center is usually termed as a proxy ofthe class. Specifically, squared Euclidean distance is usedin Prototypical Network (Snell, Swersky, and Zemel 2017),and cosine distance is used in the other works. Snell, Swer-sky, and Zemel; Qi, Brown, and Lowe directly calculateproxies by averaging the embedding features, while Qiaoet al.; Gidaris and Komodakis take a small network to pre-dict proxies. Based on Prototypical Network, TADAM (Ore-shkin, Rodríguez López, and Lacoste 2018) further proposesa dynamic task conditioned feature extractor by predictingthe layer-level element-wise scale and shift vectors for eachconvolutional layer. Different from simple metrics, RelationNetwork (Sung et al. 2018) takes the neural network as a non-linear metric and directly predicts the similarities betweenthe query and support images. TPN (Liu et al. 2019) performstransductive learning on the similarity graph contains bothquery and support images to obtain high-order similarities.

Meta-Learning Based ApproachesMeta-learning approaches have been widely used in few-shotlearning scenarios by optimization learning for fast adapta-tion, aiming to learn a meta-learner that can solve the noveltask quickly. Meta Network (Munkhdalai and Yu 2017) andadaResNet (Munkhdalai et al. 2018) are memory-based meth-ods. Example and task level information in Meta Networkare preserved in fast and slow weights, respectively. AdaRes-Net performs rapid adaptation by mimicking conditionallyshifted neurons which modify activation values with task-specific shifts retrieved from a memory module. An LSTM-based update rule of the parameters of a classifier is proposedin Ravi and Larochelle, where both short-term knowledgewithin a task and long-term knowledge common among allthe tasks are learned. MAML (Finn, Abbeel, and Levine2017), LEO (Rusu et al. 2019) and MT-net (Lee and Choi2018) all differentiate through gradient update steps to opti-mize performance after fine-tuning. While MAML operatesdirectly in high dimensional parameter space, LEO performsmeta-learning within a low-dimensional latent space. Differ-ent from MAML that assumes a fixed model, MT-net choosesa subset of its weights to fine-tune. Franceschi et al. propose a

method based on bi-level programming that unifies gradient-based hyper-parameter optimization and meta-learning.

Generation Based ApproachesSample synthesis using the generative models has recentlyemerged as a popular direction for few-shot learning (Zhuet al. 2017; Goodfellow et al. 2014). How to synthesize newsamples based on a few examples remains an interesting openproblem. AGA (Dixit et al. 2017) and FATTEN (Liu et al.2018) are attribute-guided (w.r.t. pose and depth) augmenta-tion methods in feature space by leveraging a corpus withattribute annotations. Hariharan and Girshick tries to trans-fer transformations from a pair of examples from a knowncategory to a “seed” example of a novel class. Finding spe-cific generation targets requires a carefully designed pipelinewith heuristic steps. ∆-encoder (Schwartz et al. 2018) alsotries to extract intra-class deformations between image pairssampled from the same class. Wang et al. proposes to gen-erate samples by adding random noises to support features.Different from previous methods, MetaGAN (Zhang et al.2018) generates fake samples that need to be discriminatedby the classifier instead of augmentation, which sharpens thedecision boundaries of novel categories.

Our proposed DTN shares a philosophical similarity withimage hallucination (Hariharan and Girshick 2017) and ∆-encoder (Schwartz et al. 2018) with distinct differences in thefollowing aspects. The first difference is that DTN does notrequire to set specific target points for the generator. Morespecifically, ∆-encoder takes a pair of images X1

a and X2a

from the same class and learns to infer the diversity betweenthem by reconstructing X2

a . The image hallucination methodcollects quadruples (X1

a , X2a , X

1b , X

2b ) for training based on

clustering and traversal; each quadruple contains two imagepairs from two classes a and b; a generation network is trainedto predict a sample X2

a from the quadruple when the restthree (X1

a , X1b , X

2b ) are given as input. Then, synthesized

samples are used to train a linear classifier. The input of thegenerator in DTN is also a triplet (X1

a , X1b , X

2b ) as Hariharan

and Girshick, but the generated sample X2a is used directly to

construct the meta-classifier, and the generator is optimizedby minimizing the meta-classification loss instead of settingspecific generation targets. Secondly, DTN integrates featureextraction, feature generation, and meta-learning into a singlenetwork and enjoys the simplicity and effectiveness of end-to-end training, while Hariharan and Girshick; Schwartz etal. are stage-wise methods.

More recent work based on sample generation and dataaugmentation are IDeMe-Net (Zitian Chen 2019) and SalNet(Zhang, Zhang, and Koniusz 2019). The former utilizes anadditional deformation sub-network with a large numberof parameters to synthesize diverse deformed images, thelatter needs to pre-train a saliency network on the MSRA-Bdataset. In contrast to these approaches, our method is basedon a simple diversity transfer generator that learns a betterproxy of each category with fewer parameters and fasterconvergence speed. Besides, our method can be regarded asan instance of compositional learning (Yuille 2011) in thelatent feature space.

Reference Features

Support Feature FeatureGenerator

G

Generated Features

Avg. Pool+

Normalize

...

Mini-batch Features

Meta-Classifier

Query Feature

Auxiliary Classifier

Meta-testingMeta-trainingAuxiliary task

Query Image

Support Image

Reference Images

Mini-batch Images

FeatureExtractor F

+Normalize

Xi<latexit sha1_base64="P96x7SU4IPssP9PkLYedMFESBJw=">AAAC0XicjVHLSsNAFD3GV62vqks3wSK4KqkPdFlw47KifUBbS5JO26F5MZkIJRTErT/gVn9K/AP9C++MKahFdEKSM+fec2buvU7k8Vha1uucMb+wuLScW8mvrq1vbBa2tutxmAiX1dzQC0XTsWPm8YDVJJcea0aC2b7jsYYzOlfxxi0TMQ+DazmOWMe3BwHvc9eWRN20fVsOnX7anHRTPukWilbJ0sucBeUMFJGtalh4QRs9hHCRwAdDAEnYg42YnhbKsBAR10FKnCDEdZxhgjxpE8pilGETO6LvgHatjA1orzxjrXbpFI9eQUoT+6QJKU8QVqeZOp5oZ8X+5p1qT3W3Mf2dzMsnVmJI7F+6aeZ/daoWiT7OdA2caoo0o6pzM5dEd0Xd3PxSlSSHiDiFexQXhF2tnPbZ1JpY1656a+v4m85UrNq7WW6Cd3VLGnD55zhnQf2wVD4qnVweFytWNuocdrGHA5rnKSq4QBU18hZ4xBOejStjbNwZ95+pxlym2cG3ZTx8AEt6lVo=</latexit>

Xns

<latexit sha1_base64="dQqt0cjvKSLwhGgO0b0p4WLOtQI=">AAAC13icjVHLSsNAFD2Nr1pfsS7dBIvgqqQ+0GXBjcsK9iFtLUk6bUPzIpmIJQR34tYfcKt/JP6B/oV3xhTUIjohyZlz7zkz914zcOyI6/prTpmbX1hcyi8XVlbX1jfUzWIj8uPQYnXLd/ywZRoRc2yP1bnNHdYKQma4psOa5vhUxJvXLIxs37vgk4B1XWPo2QPbMjhRPbXYcQ0+MgdJK+0lUXqVeGlPLellXS5tFlQyUEK2ar76gg768GEhhgsGD5ywAwMRPW1UoCMgrouEuJCQLeMMKQqkjSmLUYZB7Ji+Q9q1M9ajvfCMpNqiUxx6Q1Jq2CWNT3khYXGaJuOxdBbsb96J9BR3m9DfzLxcYjlGxP6lm2b+Vydq4RjgRNZgU02BZER1VuYSy66Im2tfquLkEBAncJ/iIWFLKqd91qQmkrWL3hoy/iYzBSv2VpYb413ckgZc+TnOWdDYL1cOykfnh6Wqno06j23sYI/meYwqzlBDnbxv8IgnPCuXyq1yp9x/piq5TLOFb0t5+ADZV5eB</latexit>

Xq<latexit sha1_base64="i6KUCWzEXsrUaVLTXMONw8mnjj0=">AAAC0XicjVHLSsNAFD2Nr1pfVZdugkVwVRIf6LLgxmVF+4C2liSdtqF5OZkIJRTErT/gVn9K/AP9C++MKahFdEKSM+fec2buvXbkubEwjNecNje/sLiUXy6srK6tbxQ3t+pxmHCH1ZzQC3nTtmLmuQGrCVd4rBlxZvm2xxr26EzGG7eMx24YXIlxxDq+NQjcvutYgqjrtm+Jod1Pm5NuejPpFktG2VBLnwVmBkrIVjUsvqCNHkI4SOCDIYAg7MFCTE8LJgxExHWQEscJuSrOMEGBtAllMcqwiB3Rd0C7VsYGtJeesVI7dIpHLyeljj3ShJTHCcvTdBVPlLNkf/NOlae825j+dublEyswJPYv3TTzvzpZi0Afp6oGl2qKFCOrczKXRHVF3lz/UpUgh4g4iXsU54QdpZz2WVeaWNUue2up+JvKlKzcO1lugnd5Sxqw+XOcs6B+UDYPy8cXR6WKkY06jx3sYp/meYIKzlFFjbw5HvGEZ+1SG2t32v1nqpbLNNv4trSHD16ClWI=</latexit>

Xh,jr

<latexit sha1_base64="YnfkIP4JLlU/WXZ2fDIyN/nUP2g=">AAAC2nicjVHLSsNAFD2Nr1pfVXHlJlgEF1JSH+iy4MZlBfuAtpYknbaxeTGZCCV0407c+gNu9YPEP9C/8M6YglpEJyQ5c+49Z+bea4WuEwnDeM1oM7Nz8wvZxdzS8srqWn59oxYFMbdZ1Q7cgDcsM2Ku47OqcITLGiFnpme5rG4Nz2S8fsN45AT+pRiFrO2Zfd/pObYpiOrkt1qeKQZWL2mMOwkfXyWDff163MkXjKKhlj4NSikoIF2VIP+CFroIYCOGBwYfgrALExE9TZRgICSujYQ4TshRcYYxcqSNKYtRhknskL592jVT1qe99IyU2qZTXHo5KXXskiagPE5YnqareKycJfubd6I85d1G9LdSL49YgQGxf+kmmf/VyVoEejhVNThUU6gYWZ2dusSqK/Lm+peqBDmExEncpTgnbCvlpM+60kSqdtlbU8XfVKZk5d5Oc2O8y1vSgEs/xzkNagfF0mHx+OKoUDbSUWexjR3s0TxPUMY5KqiSd4JHPOFZa2m32p12/5mqZVLNJr4t7eEDAciYTg==</latexit>

zns

<latexit sha1_base64="96C2mII1zM5KFWxUqoSRbKtvDYM=">AAAC13icjVHLSsNAFD2Nr1pfsS7dBIvgqqQ+0GXBjcsK9iFtLUk6bUPzIpmINQR34tYfcKt/JP6B/oV3xghqEZ2Q5My595yZe68ZOHbEdf0lp8zMzs0v5BcLS8srq2vqerER+XFosbrlO37YMo2IObbH6tzmDmsFITNc02FNc3ws4s1LFka2753xScC6rjH07IFtGZyonlrsuAYfmYPkOu0lUXqReGlPLellXS5tGlQyUEK2ar76jA768GEhhgsGD5ywAwMRPW1UoCMgrouEuJCQLeMMKQqkjSmLUYZB7Ji+Q9q1M9ajvfCMpNqiUxx6Q1Jq2CaNT3khYXGaJuOxdBbsb96J9BR3m9DfzLxcYjlGxP6l+8z8r07UwjHAkazBppoCyYjqrMwlll0RN9e+VMXJISBO4D7FQ8KWVH72WZOaSNYuemvI+KvMFKzYW1lujDdxSxpw5ec4p0Fjt1zZKx+c7peqejbqPDaxhR2a5yGqOEENdfK+wgMe8aScKzfKrXL3karkMs0Gvi3l/h0rWJej</latexit>

zq<latexit sha1_base64="fapE4RvTk4jISsOe0jjOyFiaGyE=">AAAC0XicjVHLSsNAFD2Nr1pfVZdugkVwVRIf6LLgxmVF+4C2liSdtsG8nEyEGgri1h9wqz8l/oH+hXfGFNQiOiHJmXPvOTP3Xjvy3FgYxmtOm5mdm1/ILxaWlldW14rrG/U4TLjDak7ohbxpWzHz3IDVhCs81ow4s3zbYw376kTGGzeMx24YXIhRxDq+NQjcvutYgqjLtm+Jod1Pb8fd9HrcLZaMsqGWPg3MDJSQrWpYfEEbPYRwkMAHQwBB2IOFmJ4WTBiIiOsgJY4TclWcYYwCaRPKYpRhEXtF3wHtWhkb0F56xkrt0CkevZyUOnZIE1IeJyxP01U8Uc6S/c07VZ7ybiP625mXT6zAkNi/dJPM/+pkLQJ9HKsaXKopUoyszslcEtUVeXP9S1WCHCLiJO5RnBN2lHLSZ11pYlW77K2l4m8qU7Jy72S5Cd7lLWnA5s9xToP6XtncLx+eHZQqRjbqPLawjV2a5xEqOEUVNfLmeMQTnrVzbaTdafefqVou02zi29IePgCv7JWE</latexit>

zh,jr

<latexit sha1_base64="lOaP/uZ/48XVfZtqRa2FUCSVoo4=">AAAC2nicjVHLSsNAFD2Nr1pfVXHlJlgEF1JSH+iy4MZlBfuAVkuSTtvYvJhMhBq6cSdu/QG3+kHiH+hfeGdMQS2iE5KcOfeeM3PvtULXiYRhvGa0qemZ2bnsfG5hcWl5Jb+6VouCmNusagduwBuWGTHX8VlVOMJljZAz07NcVrcGJzJev2Y8cgL/XAxDduGZPd/pOrYpiGrnN1qeKfpWN7kZtRM+ukz6u/rVqJ0vGEVDLX0SlFJQQLoqQf4FLXQQwEYMDww+BGEXJiJ6mijBQEjcBRLiOCFHxRlGyJE2pixGGSaxA/r2aNdMWZ/20jNSaptOcenlpNSxTZqA8jhheZqu4rFyluxv3onylHcb0t9KvTxiBfrE/qUbZ/5XJ2sR6OJY1eBQTaFiZHV26hKrrsib61+qEuQQEidxh+KcsK2U4z7rShOp2mVvTRV/U5mSlXs7zY3xLm9JAy79HOckqO0VS/vFw7ODQtlIR53FJrawQ/M8QhmnqKBK3gke8YRnraXdanfa/Weqlkk16/i2tIcPVCCYcA==</latexit>

zi<latexit sha1_base64="KjtoTNMxAQxVYyvO13WQOHfc9PA=">AAAC0XicjVHLSsNAFD2Nr1pfVZdugkVwVVIf6LLgxmVF+4C2liSdtkPzYjIRaimIW3/Arf6U+Af6F94ZU1CL6IQkZ86958zce53I47G0rNeMMTe/sLiUXc6trK6tb+Q3t2pxmAiXVd3QC0XDsWPm8YBVJZcea0SC2b7jsbozPFPx+g0TMQ+DKzmKWNu3+wHvcdeWRF23fFsOnN74dtIZ80knX7CKll7mLCiloIB0VcL8C1roIoSLBD4YAkjCHmzE9DRRgoWIuDbGxAlCXMcZJsiRNqEsRhk2sUP69mnXTNmA9soz1mqXTvHoFaQ0sUeakPIEYXWaqeOJdlbsb95j7anuNqK/k3r5xEoMiP1LN838r07VItHDqa6BU02RZlR1buqS6K6om5tfqpLkEBGncJfigrCrldM+m1oT69pVb20df9OZilV7N81N8K5uSQMu/RznLKgdFEuHxeOLo0LZSkedxQ52sU/zPEEZ56igSt4Cj3jCs3FpjIw74/4z1cikmm18W8bDB5zklXw=</latexit>

zn,hg

<latexit sha1_base64="zIaiB1HyCqyb4fBZwEcKMvNdyrg=">AAAC2nicjVHLSsNAFD2Nr1pfUXHlJlgEF1JSH+iy4MZlBfuAtpYknbaheZFMhBqycSdu/QG3+kHiH+hfeGdMQS2iE5KcOfeeM3PvNQPHjriuv+aUmdm5+YX8YmFpeWV1TV3fqEd+HFqsZvmOHzZNI2KO7bEat7nDmkHIDNd0WMMcnYl445qFke17l3wcsI5rDDy7b1sGJ6qrbrVdgw/NfnKTdpNBepV4+9ow7apFvaTLpU2DcgaKyFbVV1/QRg8+LMRwweCBE3ZgIKKnhTJ0BMR1kBAXErJlnCFFgbQxZTHKMIgd0XdAu1bGerQXnpFUW3SKQ29ISg27pPEpLyQsTtNkPJbOgv3NO5Ge4m5j+puZl0ssx5DYv3STzP/qRC0cfZzKGmyqKZCMqM7KXGLZFXFz7UtVnBwC4gTuUTwkbEnlpM+a1ESydtFbQ8bfZKZgxd7KcmO8i1vSgMs/xzkN6gel8mHp+OKoWNGzUeexjR3s0TxPUME5qqiRd4JHPOFZaSu3yp1y/5mq5DLNJr4t5eEDQz6YaQ==</latexit>

W,↵<latexit sha1_base64="tgEDFuMh0yW1tSc8Ls0LsDyDQXk=">AAAC0XicjVHLSsNAFD2Nr1pfVZdugkVwISXxgS4LblxWtA9oq0zSaRvMi8lEKKUgbv0Bt/pT4h/oX3hnTEEtohOSnDn3njNz73Vi30ukZb3mjJnZufmF/GJhaXllda24vlFPolS4vOZGfiSaDku474W8Jj3p82YsOAscnzecm1MVb9xykXhReCmHMe8ErB96Pc9lkqirttMbNcZ7Zpv58YBdF0tW2dLLnAZ2BkrIVjUqvqCNLiK4SBGAI4Qk7IMhoacFGxZi4joYEScIeTrOMUaBtCllccpgxN7Qt0+7VsaGtFeeiVa7dIpPryCliR3SRJQnCKvTTB1PtbNif/MeaU91tyH9ncwrIFZiQOxfuknmf3WqFokeTnQNHtUUa0ZV52Yuqe6Kurn5pSpJDjFxCncpLgi7Wjnps6k1ia5d9Zbp+JvOVKzau1luind1Sxqw/XOc06C+X7YPykfnh6WKlY06jy1sY5fmeYwKzlBFjbwFHvGEZ+PCGBp3xv1nqpHLNJv4toyHD4awlJ0=</latexit>

W0,↵0<latexit sha1_base64="nzWfFs6B1EcnnKQKVS4EhjLBLqk=">AAAC5XicjVHLSsNAFD3Gd31FXboJFsGFlNQHuiy4calgW6FVmaTTOjQvJhOhlG7duRO3/oBb/RXxD/QvvDOm4gPRCUnOnHvPmbn3ekkgUuW6zyPW6Nj4xOTUdGFmdm5+wV5cqqVxJn1e9eMgliceS3kgIl5VQgX8JJGchV7A6153X8frl1ymIo6OVS/hpyHrRKItfKaIOredptfu1wdn/WYiRcgHG06TBckF+yDO7aJbcs1yfoJyDorI12FsP6GJFmL4yBCCI4IiHIAhpaeBMlwkxJ2iT5wkJEycY4ACaTPK4pTBiO3St0O7Rs5GtNeeqVH7dEpArySlgzXSxJQnCevTHBPPjLNmf/PuG099tx79vdwrJFbhgti/dMPM/+p0LQpt7JkaBNWUGEZX5+cumemKvrnzqSpFDglxGrcoLgn7Rjnss2M0qald95aZ+IvJ1Kze+3luhld9Sxpw+fs4f4LaZqm8Vdo52i5W3HzUU1jBKtZpnruo4ACHqJL3Fe7xgEerY11bN9bte6o1kmuW8WVZd2+A6Z0g</latexit>

Figure 1: Illustration of the proposed diversity transfer network. The branch indicated by orange arrows is the meta-task,which is trained in a meta-learning way. The orange solid arrows indicate the process of meta-training, while the orange dashedarrows indicate the process of meta-testing. During meta-training, the features of the support image and reference images fromthe feature extractor are fed into the feature generator to generate new features. The parameters of the meta-classifier are formedby the averaged proxies of the support features and generated features. Then the query image is fed to evaluate the performance ofthe meta-classifier during meta-testing. The branch indicated by grey arrows is the auxiliary task aimed to accelerate convergenceand improve the generalization ability.

MethodProblem DefinitionDifferent from the conventional classification task, wherethe training set Dtrain and the testing set Dtest consist ofsamples from the same classes, few-shot learning aims toaddress the problem where the label spaces are disjoint be-tween Dtrain and Dtest. We follow the standard N -wayK-shot classification scenario defined in Vinyals et al. tostudy the few-shot learning problem. An N -way K-shottask is termed as an episode. An episode is formed by Nclasses sampled from the training/testing set firstly. ThenK images sampled from each of the N classes constitutethe support set {(Xn,k

s , Y n,ks )}, where n ∈ [1, 2, ..., N ] and

k ∈ [1, 2, ...,K]. For the sake of simplicity, we take N -way1-shot (i.e., K = 1) classification for example in the fol-lowing sections, and the support set will be simplified to{(Xn

s , Yns )}. The query sample (Xq, Yq) is sampled from

the rest images of the N classes. The goal is to classify thequery into one of the N classes correctly based only on thesupport set and the prior meta-knowledge learned from thetraining set Dtrain.

An Overview of Diversity Transfer NetworkThe overall structure of the Diversity Transfer Network(DTN) is shown in Fig. 1. DTN contains four modules andis organized into two task branches. The task branch indi-cated by orange arrows is the meta-task, which is trained ina meta-learning way. The input for the meta-task consists of

the following three parts: support images Xns , a query im-

age Xq and reference images Xh,jr , where n ∈ [1, 2, ..., N ],

h ∈ [1, 2, ...,H], j ∈ [1, 2]. All images are mapped to L2-normalized feature vectors z = F (X) by a feature extractorF , where z ∈ RC . zh,1r and zh,2r are feature vectors of tworeference images. They come from the same category andmake up a reference pair. The diversity of the pair is trans-ferred to the support feature zns to generated a new featureby the feature generator G. The generated feature zn,hg issupposed to belong to the same category with zns . For eachsupport feature, there are H samples generated based on it.Since a meta-task is an N -way 1-shot image classificationtask, the meta-classifier is an N -way classifier consisting of aweight matrix W and a trainable temperature α. The valuesin the W are determined by the proxies formed by supportfeatures and features generated by them. The meta-classifieris differentiable, so the feature extractor F and feature genera-torG can be updated by standard back-propagation accordingto the loss function defined by the cosine similarity betweenthe query and the proxies. The task branch indicated by greyarrows in Fig. 1 is the auxiliary task, aiming to accelerate andstabilize the training of DTN. It is a conventional classifica-tion task over all categories of the training set Dtrain.

Feature Generation via Diversity TransferEach image X is mapped to a feature vector z = F (X) bythe feature extractor F . zq, zns and {zh,1r , zh,2r } are featurevectors of the query image Xq, the support image Xn

s and

Support Feature

Reference Feature 1

Reference Feature 2

Mapping Mapping GeneratedFeature

Latent Space

+−

zns

<latexit sha1_base64="mdg+dPtghSfuF6lB+lX6uhKGIkU=">AAAC2HicjVHLSsNAFD2Nr1pf0S7dBIvgqqQ+0GXBjcsK9oFtLUk6bUPzIpkINQTciVt/wK1+kfgH+hfeGVNQi+iEJGfOvefM3HvNwLEjruuvOWVufmFxKb9cWFldW99QN7cakR+HFqtbvuOHLdOImGN7rM5t7rBWEDLDNR3WNMenIt68ZmFk+94FnwSs6xpDzx7YlsGJ6qnFjmvwkTlIbtJeEqVXiZcWempJL+tyabOgkoESslXz1Rd00IcPCzFcMHjghB0YiOhpowIdAXFdJMSFhGwZZ0hRIG1MWYwyDGLH9B3Srp2xHu2FZyTVFp3i0BuSUsMuaXzKCwmL0zQZj6WzYH/zTqSnuNuE/mbm5RLLMSL2L9008786UQvHACeyBptqCiQjqrMyl1h2Rdxc+1IVJ4eAOIH7FA8JW1I57bMmNZGsXfTWkPE3mSlYsbey3Bjv4pY04MrPcc6Cxn65clA+Oj8sVfVs1HlsYwd7NM9jVHGGGurkPcEjnvCsXCq3yp1y/5mq5DJNEd+W8vABcOeXtw==</latexit>

zh,1r

<latexit sha1_base64="TXRGCWlSIVr2y0vueN2xwzrXLDk=">AAAC2nicjVHLSsNAFD2Nr1pfVXHlJlgEF1ISH+hScOOygn1AW0uSTtvQvJhMBA3duBO3/oBb/SDxD/QvvDOmoBbRCUnOnHvPmbn32pHnxsIwXnPa1PTM7Fx+vrCwuLS8Ulxdq8Vhwh1WdUIv5A3bipnnBqwqXOGxRsSZ5dseq9vDUxmvXzEeu2FwIa4j1vatfuD2XMcSRHWKGy3fEgO7l96MOikfXaaDXd0cdYolo2yopU8CMwMlZKsSFl/QQhchHCTwwRBAEPZgIaanCRMGIuLaSInjhFwVZxihQNqEshhlWMQO6dunXTNjA9pLz1ipHTrFo5eTUsc2aULK44TlabqKJ8pZsr95p8pT3u2a/nbm5RMrMCD2L9048786WYtAD8eqBpdqihQjq3Myl0R1Rd5c/1KVIIeIOIm7FOeEHaUc91lXmljVLntrqfibypSs3DtZboJ3eUsasPlznJOgtlc298uH5welEyMbdR6b2MIOzfMIJzhDBVXyTvGIJzxrLe1Wu9PuP1O1XKZZx7elPXwAzHiYNw==</latexit>

zh,2r

<latexit sha1_base64="nu9bHQKmu3y/6UgngtqbSHkQ2Nc=">AAAC2nicjVHLSsNAFD2Nr1pfVXHlJlgEF1LSquiy4MZlBfuAtpYknbaheTGZCDV0407c+gNu9YPEP9C/8M4YQS2iE5KcOfeeM3PvtULXiYRhvGS0mdm5+YXsYm5peWV1Lb++UY+CmNusZgduwJuWGTHX8VlNOMJlzZAz07Nc1rBGpzLeuGI8cgL/QoxD1vHMge/0HdsURHXzW23PFEOrn1xPugmfXCbDfb086eYLRtFQS58GpRQUkK5qkH9GGz0EsBHDA4MPQdiFiYieFkowEBLXQUIcJ+SoOMMEOdLGlMUowyR2RN8B7Vop69NeekZKbdMpLr2clDp2SRNQHicsT9NVPFbOkv3NO1Ge8m5j+lupl0eswJDYv3Sfmf/VyVoE+jhRNThUU6gYWZ2dusSqK/Lm+peqBDmExEncozgnbCvlZ591pYlU7bK3poq/qkzJyr2d5sZ4k7ekAZd+jnMa1MvF0kHx6PywUDHSUWexjR3s0TyPUcEZqqiRd4IHPOJJa2s32q1295GqZVLNJr4t7f4dztmYOA==</latexit>

zns

<latexit sha1_base64="FvkoKV+XNu9jsoeCNj6+3DW1UhI=">AAAC3XicjVHLSsNAFD3G9zvqRnATLIKrkvhAlwU3LhVsLVgNk3TahuZFMhFqiDt34tYfcKu/I/6B/oV3xhR8IDohyZlz7zkz914n9r1UmObLiDY6Nj4xOTU9Mzs3v7CoLy030ihLXF53Iz9Kmg5Lue+FvC484fNmnHAWOD4/dfoHMn56yZPUi8ITMYj5ecC6odfxXCaIsvXVVo+JvBUw0XM6+VVR2HlaXORhYesVs2qqZfwEVgkqKNdRpD+jhTYiuMgQgCOEIOyDIaXnDBZMxMSdIycuIeSpOEeBGdJmlMUpgxHbp2+XdmclG9JeeqZK7dIpPr0JKQ1skCaivISwPM1Q8Uw5S/Y371x5yrsN6O+UXgGxAj1i/9INM/+rk7UIdLCvavCoplgxsjq3dMlUV+TNjU9VCXKIiZO4TfGEsKuUwz4bSpOq2mVvmYq/qkzJyr1b5mZ4k7ekAVvfx/kTNLaq1nZ193inUjPLUU9hDevYpHnuoYZDHKFO3td4wCOeNFu70W61u49UbaTUrODL0u7fAVvHmnA=</latexit>

zh,1r

<latexit sha1_base64="O5e54Of3f/2m4zcT/ZBHj8Kg6QY=">AAAC4HicjVHLSsNAFD3GV31XXXYTLIILKYkPdCm4cVnBPqCtZRKnNpgXk4mgIQt37sStP+BWv0b8A/0L74wRfCA6IcmZc+85M/deJ/a9RFrW84gxOjY+MVmamp6ZnZtfKC8uNZMoFS5vuJEfibbDEu57IW9IT/q8HQvOAsfnLedsX8Vb51wkXhQeyYuY9wJ2GnoDz2WSqH650h0ymXUDJofOILvM834m8uNsuG7aeb9ctWqWXuZPYBegimLVo/ITujhBBBcpAnCEkIR9MCT0dGDDQkxcDxlxgpCn4xw5pkmbUhanDEbsGX1Padcp2JD2yjPRapdO8ekVpDSxSpqI8gRhdZqp46l2Vuxv3pn2VHe7oL9TeAXESgyJ/Uv3kflfnapFYoBdXYNHNcWaUdW5hUuqu6Jubn6qSpJDTJzCJxQXhF2t/OizqTWJrl31lun4i85UrNq7RW6KV3VLGrD9fZw/QXOjZm/Wtg+3qntWMeoSKljBGs1zB3s4QB0N8r7CPR7waDjGtXFj3L6nGiOFZhlflnH3BgTNmwQ=</latexit>

zh,2r

<latexit sha1_base64="/bKam30tovw7PTTFnRhM2+soDBQ=">AAAC4HicjVHLSsNAFD2N7/qquuwmWAQXUlIf6LLgxqWC1YKtZZJObTAvJhOhhizcuRO3/oBb/RrxD/QvvDOmoBbRCUnOnHvPmbn32pHnxtKyXgvG2PjE5NT0THF2bn5hsbS0fBKHiXB4wwm9UDRtFnPPDXhDutLjzUhw5tseP7Uv91X89IqL2A2DYzmIeNtnF4Hbcx0mieqUyq0+k2nLZ7Jv99LrLOukIjtP+xvmZtYpVayqpZc5Cmo5qCBfh2HpBS10EcJBAh8cASRhDwwxPWeowUJEXBspcYKQq+McGYqkTSiLUwYj9pK+F7Q7y9mA9soz1mqHTvHoFaQ0sUaakPIEYXWaqeOJdlbsb96p9lR3G9Dfzr18YiX6xP6lG2b+V6dqkehhT9fgUk2RZlR1Tu6S6K6om5tfqpLkEBGncJfigrCjlcM+m1oT69pVb5mOv+lMxaq9k+cmeFe3pAHXfo5zFJxsVmtb1Z2j7Urdykc9jTJWsU7z3EUdBzhEg7xv8IgnPBu2cWvcGfefqUYh16zg2zIePgAHLpsF</latexit>

�1<latexit sha1_base64="UU5M6d3fkaVuUf+cG1lL+V3BQbI=">AAAC2nicjVHLSsNAFD2Nr1pfUXHlJlgEVyXxgS4LblxWsA9oS0nSaTuYF8lEKKEbd+LWH3CrHyT+gf6Fd8YU1CI6ITNnzr3nzNy5TuTxRJjma0Gbm19YXCoul1ZW19Y39M2tRhKmscvqbuiFccuxE+bxgNUFFx5rRTGzfcdjTef6XMabNyxOeBhciXHEur49DPiAu7YgqqfvdJzQ6ydjn5asE434pJdZk55eNiumGsYssHJQRj5qof6CDvoI4SKFD4YAgrAHGwl9bVgwERHXRUZcTIirOMMEJdKmlMUowyb2muYh7do5G9BeeiZK7dIpHv0xKQ3skyakvJiwPM1Q8VQ5S/Y370x5yruNaXVyL59YgRGxf+mmmf/VyVoEBjhTNXCqKVKMrM7NXVL1KvLmxpeqBDlExEncp3hM2FXK6TsbSpOo2uXb2ir+pjIlK/dunpviXd6SGmz9bOcsaBxWrKPKyeVxuWrmrS5iF3s4oH6eoooL1FAn7wyPeMKz1tFutTvt/jNVK+SabXwb2sMHzP+Yow==</latexit>

�2<latexit sha1_base64="wvmr0eSHPuGuyhWzlIEloSt0KAY=">AAAC2nicjVHLSsNAFD2Nr1pfUXHlJlgEVyWtii4LblxWsA9oS0nSaTs0L5KJUEI37sStP+BWP0j8A/0L74wpqEV0QmbOnHvPmblz7dDlsTDN15y2sLi0vJJfLaytb2xu6ds7jThIIofVncANopZtxczlPqsLLlzWCiNmebbLmvb4QsabNyyKeeBfi0nIup419PmAO5YgqqfvdezA7ccTj5a0E474tJdWpj29aJZMNYx5UM5AEdmoBfoLOugjgIMEHhh8CMIuLMT0tVGGiZC4LlLiIkJcxRmmKJA2oSxGGRaxY5qHtGtnrE976RkrtUOnuPRHpDRwSJqA8iLC8jRDxRPlLNnfvFPlKe82odXOvDxiBUbE/qWbZf5XJ2sRGOBc1cCpplAxsjonc0nUq8ibG1+qEuQQEidxn+IRYUcpZ+9sKE2sapdva6n4m8qUrNw7WW6Cd3lLanD5ZzvnQaNSKh+XTq9OilUza3Ue+zjAEfXzDFVcooY6ead4xBOetY52q91p95+pWi7T7OLb0B4+AM9gmKQ=</latexit> zn,h

g<latexit sha1_base64="AGB915cjClJUKhm+WBhzyK8FkCo=">AAAC3nicjVHLSsNAFD2Nr1pfVVfiJlgEF1JSH+iy4MZlBfuAttYknbaheZFMhBqCO3fi1h9wq58j/oH+hXfGFNQiOiHJmXPvOTP3XsO3rZBr2mtGmZqemZ3LzucWFpeWV/Kra7XQiwKTVU3P9oKGoYfMtlxW5Ra3WcMPmO4YNqsbwxMRr1+xILQ895yPfNZ29L5r9SxT50R18hstR+cDoxdfJ524n1zEsZvsqvEgSTr5glbU5FInQSkFBaSr4uVf0EIXHkxEcMDgghO2oSOkp4kSNPjEtRETFxCyZJwhQY60EWUxytCJHdK3T7tmyrq0F56hVJt0ik1vQEoV26TxKC8gLE5TZTySzoL9zTuWnuJuI/obqZdDLMeA2L9048z/6kQtHD0cyxosqsmXjKjOTF0i2RVxc/VLVZwcfOIE7lI8IGxK5bjPqtSEsnbRW13G32SmYMXeTHMjvItb0oBLP8c5CWp7xdJ+8fDsoFDW0lFnsYkt7NA8j1DGKSqokvcNHvGEZ+VSuVXulPvPVCWTatbxbSkPH5wLmoE=</latexit>

Figure 2: Feature generator in DTN. The three input fea-tures are mapped into a latent space by the mapping functionφ1. Then the diversity (i.e., offset) between the reference fea-tures is added with the support feature in this space. Then itis mapped by the φ2 to keep the same size as inputs. The out-put is a generated feature which is supposed to be a samplebelonging to category n.

the reference images pair {Xh,1r ,Xh,2

r } respectively, wheren ∈ [1, 2, ..., N ], h ∈ [1, 2, ...,H]. For a specific supportfeature zns , during both meta-training and meta-testing phase,the reference image pairs {Xh,1

r ,Xh,2r } are always sampled

from the training set Dtrain (seen, known). Specifically, wefirst randomly sample H classes from the whole trainingclasses Dtrain with replacement. For each sampled class, wethen randomly sample two different images Xh,1

r and Xh,2r

to form a reference pair. We do not sample any images fromDtest (unseen, novel) during the whole process. The conven-tional few-shot evaluation setting, termed as N -way K-shotsetting, requires to get a N -way classifier with the supportof only K samples for each novel class and the prior meta-knowledge from the whole training setDtrain. Therefore, oursampling method strictly complies with the few-shot learningprotocol.

As shown in Fig. 2, the feature generator G of DTNconsists of two mapping functions φ1 and φ2. Three inputfeatures are firstly mapped into a latent space z = φ1(z),z ∈ RC′

. The elementwise difference zh,1r − zh,2r measuresthe diversity between the two reference features. It is ap-plied to the support feature by a simple linear combinationzns + (zh,1r − zh,2r ). After mapping it by φ2, we get a featurezn,hg which has the same size of the input zns and should be-long to the same category with the support feature zns . Morespecifically:

zn,hg =G(zns , zh,1r , zh,2r )

=φ2(φ1(zns ) + φ1(zh,1r )− φ1(zh,2r )).(1)

Given H different reference pairs for a single supportfeature zns , there will be H generated features that enrichthe diversity of category n. They are helpful to construct amore robust classifier for unseen categories. When K > 1,each of the K support samples is taken as a “seed” and Hsamples are generated based on it. Therefore, there will beK support samples and K ×H generated samples for eachnovel category.

Meta-Learning Based on Averaged ProxiesThe meta-task branch of DTN is shown in Fig. 1 indicatedby orange arrows. The orange solid arrows and dashed ar-

rows indicate the process of meta-training and meta-testing,respectively. Each image X is mapped to a feature vectorz = F (X). Similar to Qiao et al.; Gidaris and Komodakis; Qi,Brown, and Lowe, all the features here are L2-normalizedvectors (i.e., ‖z‖2 = 1). The support feature zns and all theH reference feature pairs {zh,1r , zh,2r }(h ∈ [1, 2, ...,H]) arefed into the generator G to generate H new features zn,hg (His set to 3 in Fig. 1 for example). So we get H + 1 featuresfor the n-th category. The meta-task is an N -way classifi-cation task, therefore the meta-classifier is represented by amatrix W ∈ RN×C , in which each row wn ∈ RC can beviewed as a proxy (Movshovitz-Attias et al. 2017) of the n-thcategory. After obtaining all the H + 1 features for categoryn, the n-th row wn of W, termed as averaged proxy, is theL2-normalized average of those features:

wn =1

H + 1

(zns +

H∑h=1

zn,hg

), (2)

wn =wn

‖wn‖. (3)

All the averaged proxies are also L2-normalized vectors,so that the meta-classifier essentially becomes a cosine-similarity based classification model. After constructing themeta-classifier, theL2-normalized query feature zq is fed intoit for evaluation. The prediction p = zqW

> is the combina-tion of n classification scores pn = zqw

>n of each category.

To further increase stability and robustness when dealingwith a large number of categories, we adopt a learnable tem-perature α in our meta-task loss as Qi, Brown, and Lowe,where α is updated by back-propagation during training. Themeta-task loss L can be defined as follow:

L = − logexp (αzqwYq

>)∑Nn=1 exp (αzqwn

>). (4)

Organized Auxiliary Task Co-trainingIn order to accelerate the convergence of training and getbetter generalization ability, the meta-learning network inDTN is jointly trained with an auxiliary task. The auxiliarytask is a conventional classification for all N ′ categories inDtrain. It shares the same feature extractor F with the meta-task branch. Different from the meta-classifier W ∈ RN×C ,which consists of the averaged proxies, the auxiliary classifierW′ ∈ RN ′×C after the feature extractor F are randomlyinitialized and updated via back-propagation. The mini-batch{(Xi, Yi)} is randomly sampled from the training setDtrain,where i ∈ [1, 2, ..., B], and B is the batch size. The auxiliarytask loss L′ has the same form as the meta-task loss L:

L′ = − logexp (α′ziw′Yi

>)∑N ′

n=1 exp (α′ziw′n>), (5)

where zi is one of the training features in the mini-batch,w′n ∈ RC is the n-th row of W′, and α′ is learnable.

In TADAM (Oreshkin, Rodríguez López, and Lacoste2018), the auxiliary task is sampled with a probability thatis annealed exponentially. We observe some positive effects

Table 1: Few-shot images classification accuracies on miniImageNet. ‘-’: not reported.Methods Ref. Backbone 5-way 1-shot 5-way 5-shot

Matching Network (Vinyals et al.) NeurIPS’16 64-64-64-64 43.56%± 0.84% 55.31%± 0.73%Meta-Learn LSTM (Ravi and Larochelle) ICLR’17 64-64-64-64 43.44%± 0.77% 60.60%± 0.71%MAML (Finn, Abbeel, and Levine) ICML’17 32-32-32-32 48.70%± 1.84% 63.11%± 0.92%Prototypical Network (Snell, Swersky, and Zemel) NeurIPS’17 64-64-64-64 49.42%± 0.78% 68.20%± 0.66%Relation Network (Sung et al.) CVPR’18 64-96-128-256 50.44%± 0.82% 65.32%± 0.70%MT-net (Lee and Choi) ICML’18 64-64-64-64 51.70%± 1.84% -MetaGAN� (Zhang et al.) NeurIPS’18 64-96-128-256 52.71%± 0.64% 68.63%± 0.67%Qiao et al. CVPR’18 64-64-64-64 54.53%± 0.40% 67.87%± 0.20%Gidaris and Komodakis CVPR’18 64-64-64-64 56.20%± 0.86% 73.00%± 0.64%DTN (Ours)� 64-64-64-64 57.89%± 0.84% 73.28%± 0.65%

Gidaris and Komodakis CVPR’18 ResNet-12 55.45%± 0.89% 70.13%± 0.68%adaResnet (Munkhdalai et al.) ICML’18 ResNet-12 56.88%± 0.62% 71.94%± 0.57%TADAM (Oreshkin, Rodríguez López, and Lacoste) NeurIPS’18 ResNet-12 58.50%± 0.30% 76.70%± 0.30%Qiao et al. CVPR’18 WRN-28-10 59.60%± 0.41% 73.74%± 0.19%STANet (Yan, Zhang, and He) AAAI’19 ResNet-12 58.35%± 0.57% 71.07%± 0.39%TPN (Liu et al.) ICLR’19 ResNet-12 59.46% 75.65%LEO (Rusu et al.) ICLR’19 WRN-28-10 61.76%± 0.08% 77.59%± 0.12%∆-encoder (Schwartz et al.)� NeurIPS’18 VGG-16 59.9% 69.7%IDeMe-Net (Zitian Chen)� CVPR’19 ResNet-18♠ 59.14%± 0.86% 74.63%± 0.74%SalNet Intra-class Hal. (Zhang, Zhang, and Koniusz)� CVPR’19 ResNet-101♣ 62.22%± 0.87% 77.95%± 0.65%Deep DTN (Ours)� ResNet-12 63.45%± 0.86% 77.91%± 0.62%

� Generation based approaches ♠ Using a deformation sub-network ♣ Using a saliency network pre-trained on MSRA-B

from this training strategy compared with naïve meta-trainingand multi-stage training in our DTN.

However, the inadequacy of this approach is: the random-ness in both the frequency and the order of the two tasksaffects the final result to some extent, and the distribution ofauxiliary tasks are unpredictable rather than annealed expo-nentially, especially when the number of training epochs isnot very large. Another problem brought by the randomnessis that it is hard to determine the training schedule, e.g., thelearning rate, the number of training epochs, etc., since thepermutation of auxiliary tasks and meta-tasks varies accord-ing to the random seed. We empirically find that the stochasticauxiliary task co-training strategy used in TADAM results ina large fluctuation in the meta-classification accuracy (over4%, see Table 3 for details) when using different randomseeds. This randomness makes the choice of hyperparametersas well as the training schedule more difficult.

Therefore, we propose the OAT (Organized Auxiliary taskco-Training) strategy, which organizes auxiliary tasks andmeta-tasks in a more orderly and more reasonable manner.More specifically, there are two kinds of training epochs: theauxiliary training epoch A and the meta-training epochM. We select T training epochs to form one training unit U ,the i-th training unit Ui has γi meta-training epochs, and(T − γi) auxiliary training epochs. The array of γi is denotedas γ = {γi}Li=1, where L is the total number of training units.Then the total number of training epochs is T × L, and thewhole training sequence S can be expressed as follow:

S =

L∑i=1

((T − γi)A+ γiM

). (6)

By changing T and γ, we can obtain different trainingsequences arranged in different frequency and order, which isproven to be more manageable and effective compared with

Table 2: The 5-way 1-shot/5-way 5-shot images classifica-tion accuracies on CIFAR100 and CUB. ‘-’: not reported.

Methods CIFAR100 CUB

Nearest neighbor 56.1%/68.3% 52.4%/66.0%Meta-Learn LSTM (Ravi and Larochelle) - 40.4%/49.7%Matching Network (Vinyals et al.) 50.5%/60.3% 49.3%/59.3%MAML (Finn, Abbeel, and Levine) 49.3%/58.3% 38.4%/59.1%∆-encoder (Schwartz et al.) 66.7%/79.8% 69.8%/82.6%Deep DTN (Ours) 71.5%/82.8% 72.0%/85.1%

the training strategy used in TADAM. Intuitively, we wouldlike to gradually add harder few-shot classification tasks intoa series of simpler auxiliary classification tasks. Thereforethe setting of T and γ is quite simple and straightforward.We choose T = 5 and γ = [0, 0, 1, 1, 2, 2] for training DTN,though a more careful scheduling may achieve better perfor-mance. Therefore the whole training sequence S is organizedas follow:

S = 5A× 2 + (4A+ 1M)× 2 + (3A+ 2M)× 2. (7)

Initially, the auxiliary tasks could be considered as a sim-pler curriculum(Bengio et al. 2009), later they bring regu-larization effects to meta-tasks. Ablation studies show thatcompared with the training strategy used in TADAM, DTNtrained by OAT obtains better and more robust results with afaster convergence speed.

ExperimentsImplementation DetailsDataset. The proposed method is evaluated on multi-ple datasets: miniImageNet, CIFAR100 and CUB. TheminiImageNet dataset has been widely used by few-shotlearning since it is firstly proposed by Vinyals et al.. Thereare 64, 16 and 20 classes for training, validation, and testing

Real samples from 3 categories Support samples of each category Generated samples for each category

triceratops ladybug house finch

unicycle hotdog photocopier

cuirass bookshop crate malamute African hunting dog

ant

aircraft carrier earboxer mixing bowl

dishrag komondor Gordon setter black-footed ferret king crab electric guitar trifle school bus

Figure 3: Visualization of generated samples, support samples, and real samples. The light dots indicate real samples, theshapes (circle, square, triangle, pentagon and diamond) with black border indicate support samples which are also real samples,and the shapes without border indicate generated samples. There are 64 generated samples for each support sample. The top rowshows the results of 3-way 1-shot learning and the bottom row shows the results of 3-way 5-shot learning. The data in the lefttwo columns are from the training set and the data in the right two columns are from the testing set.

respectively. The hyper-parameters are optimized on the vali-dation set. After that, it will be merged into the training setfor the final results. The CIFAR100 dataset (Krizhevsky andHinton 2009) contains 6000 images of 100 classes. We use64, 16, and 20 classes for training, validation, and testing,respectively. The CUB dataset (Wah et al. 2011) is a fine-grained dataset from 200 categories of birds. It is dividedinto training, validation, and testing sets with 100, 50, and50 categories respectively. The splits of CIFAR100 and CUBfollow Schwartz et al..

Architectures. The feature extractor for DTN is a CNN with4 convolutional modules. Each module contains a 3 × 3convolutional layer with 64 channels followed by a batchnormalization(BN) layer, a ReLU non-linearity layer, anda 2 × 2 max-pooling layer. The structure of feature extrac-tor is the same as those in former methods, e.g., Vinyals etal.; Snell, Swersky, and Zemel for fair comparisons. Manyother works also use deeper networks for feature extractionto achieve better accuracy, e.g., Munkhdalai et al.; Oreshkin,Rodríguez López, and Lacoste. To make a comparison withthem, we also implement our algorithm with ResNet-12 ar-chitecture(He et al. 2015). The output of the feature extractoris a 1024-dimensional vector. The mapping function φ1 in thefeature generatorG is a fully-connected (FC) layer with 2048units followed by a leaky ReLU activation (max(x, 0.2x))layer, and a dropout layer with 30% dropout rate. The map-ping function φ2 has the same settings with φ1 except thatthe number of units of the FC layer is 1024.

ResultsQuantitative Results. Table 1 provides comparative re-sults on the miniImageNet dataset. All these results are re-ported with 95% confidence intervals following the setting inVinyals et al.. Under the 4-CONV feature extractor setting,our approach significantly outperforms the previous state-of-the-art works, especially in the 5-way 1-shot task. As for thecomparisons with models using deep feature extractor, deepDTN also surpasses other alternatives in the 5-way 1-shot sce-nario and achieves very competitive results under the 5-way5-shot setting. The results confirm that our feature generationmethod is extremely useful to address the problem of learn-ing with scarce data, i.e., the 5-way 1-shot scenario. DTN isalso one of the simplest and lightweight feature generationmethods which learns to enrich intra-class diversity, and doesnot rely on any extra information from other datasets, such asthe salient object information in Zhang, Zhang, and Koniusz.

Table 2 shows that DTN also gets large improvements onthe CIFAR100 and CUB datasets compared with existingstate-of-the-arts in both 5-way 1-shot task and 5-way 5-shottask, which confirms DTN is generally useful for differentfew-shot learning scenarios.Visualization Results. To better understand the results,Fig. 3 shows tSNE (Maaten and Hinton 2008) visualizationsof generated samples, support samples and real samples. Itcan be seen that our method can greatly enrich the diversityof an unseen class with only a single or a few support exam-ples given. Most of the generated samples fit the distributionof real samples, which means that the category information

Table 3: Ablation studies on the fluctuation of the results obtained by AT and OAT on miniImageNet. AT: auxiliary taskco-training strategy used in TADAM. OAT: organized auxiliary task co-training. The representation of the training sequencefollows the notation introduced in the previous section, and the training sequence of AT is completely determined by the randomseed. The results show that compared with the AT strategy(over 4% fluctuation), the model trained by OAT (less than 1.5%fluctuation) obtains better and more robust results.

Random seedTraining sequence of AT

(number of total training epochs = 30)Results of AT Training sequence of OAT

(number of total training epochs = 30)Results of OAT

5-way 1-shot 5-way 5-shot 5-way 1-shot 5-way 5-shot

Seed #1 13A-1M-2A-1M-1A-1M-2A-1M-2A-1M-5A 58.60% 72.78%

10A-4A-1M-4A-1M-3A-2M-3A-2M

62.78% 77.58%Seed #2 11A-1M-7A-1M-7A-3M 60.71% 74.10% 62.19% 76.82%Seed #3 13A-1M-9A-1M-4A-2M 61.61% 75.52% 63.45% 77.91%Seed #4 18A-1M-2A-1M-1A-1M-2A-2M-1A-1M 60.97% 75.77% 62.48% 77.22%Seed #5 14A-1M-4A-1M-3A-2M-3A-2M 62.37% 77.45% 63.17% 77.47%

Table 4: Ablation studies for different feature generatorson miniImageNet.

Methods 5-way 1-shot 5-way 5-shot

1 Gaussian noise generator 60.14%± 1.24% 75.57%± 0.96%2 ∆-encoder† 60.38%± 1.12% 73.44%± 0.92%3 DTN w/ two-stage training 61.95%± 0.85% 76.52%± 0.65%4 DTN w/ OAT (Ours) 63.45%± 0.86% 77.91%± 0.62%

† Our reimplementation, which outperforms the original.

Table 5: Ablation studies for different training strategieson miniImageNet. AT: auxiliary task co-training strategyused in TADAM . OAT: organized auxiliary task co-training.

Methods 5-way 1-shot 5-way 5-shot

1 DTN w/ naïve meta-training 59.81%± 1.36% 74.97%± 1.01%2 DTN w/ two-stage training 61.95%± 0.85% 76.52%± 0.65%3 DTN w/ AT 62.65%± 0.86% 77.12%± 0.65%4 DTN w/ OAT (Ours) 63.45%± 0.86% 77.91%± 0.62%

of each support sample is well preserved by the generatedsample, and they are close to the center of the real distribu-tion even when the support sample lies on the edge. From thediagrams of 3-way 5-shot learning, it can be seen that gen-erated features from 5 support samples can cover the majordistribution of the real samples, which facilitates to build amore robust classifier for unseen classes.

Ablation StudyIn this section, we study the impact of the feature generator,the training strategy and the number of generated features.We conduct the following ablation studies using models withdeep feature extractor on miniImageNet. All the results aresummarized in Table 3, Table 4, Table 5 and Table 6.

First, we make a comparison between different featuregenerators. For the sake of fairness, we use exactly the samemeta-classifier (cosine-similarity based classifiers) and thesame training strategy (two-stage training), only the featuregenerators are different. All models are trained until conver-gence. Experiments show that diversity transfer generatoroutperforms Gaussian noise seeded generator (by 1.81% in5-way 1-shot, 0.95% in 5-way 5-shot. Table 4, Row 1 andRow 3) and ∆-encoder (by 1.57% in 5-way 1-shot, 3.08% in5-way 5-shot. Table 4, Row 2 and Row 3).

Second, we study the effects of different training strategies.Obviously, OAT (Table 5, Row 4) surpasses the naïve meta-

Table 6: Ablation studies for DTN trained with differentnumber of generated features on miniImageNet. Num-bers in the "( )" are difference in meta-classification accura-cies compared with the result with 64 generated features.

Number of 5-way 5-waygenerated features 1-shot 5-shot

2 60.71% (−2.74%) 75.56% (−2.35%)4 61.18% (−2.27%) 76.17% (−1.74%)16 61.87% (−1.58%) 76.76% (−1.15%)32 62.58% (−0.87%) 77.14% (−0.77)%64 63.45% (−0.00%) 77.91% (−0.00%)96 63.09% (−0.36%) 77.45% (−0.46%)128 63.26% (−0.19%) 77.99% (+0.08%)

training (Table 5, Row 1) and two-stage training (Table 5,Row 2). As mentioned before, a large fluctuation was ob-served (e.g., from 58.60% to 62.37% for 5-way 1-shot, from72.78% to 77.45% for 5-way 5-shot, see Table 3 for details)in the meta-classification accuracy if the DTN is trained withauxiliary task sampled via a probability in 30 epochs. Theresult becomes better and more stable if we increase the totalnumber of training epochs to 60 (Table 5, Row 3), but thisis still worse than the result obtained via DTN trained withOAT in only 30 epochs (Table 5, Row 4). A comparison in theresult’s fluctuation between two training strategies is detailedin Table 3.

Finally, in Table 6 we study the impact on the number ofgenerated features. The results gradually become better asthe number of generated features increases. No improvementwas observed when the number of generated features exceeds64. We attribute this to the fact that 64 generated featureshave been well fitted to the real sample distribution.

Conclusion and Future WorkIn this work, we propose a novel generative model, DiversityTransfer Network (DTN), for few-shot image recognition. Itlearns transferable diversity from the known categories andaugments the unseen category with the sample generation.DTN achieves competitive performance on three benchmarks.We believe that the proposed generative method can be uti-lized in various problems challenged by the scarcity of su-pervision information, e.g., semi-supervised learning, activelearning and imitation learning. These interesting researchdirections will be explored in the future.

AcknowledgmentThis work was supported by National Natural Science Foun-dation of China (NSFC) (No. 61733007, No. 61572207 andNo. 61876212), National Key R&D Program of China (No.2018YFB1402600) and HUST-Horizon Computer VisionResearch Center.

References[Bengio et al. 2009] Bengio, Y.; Louradour, J.; Collobert, R.; and

Weston, J. 2009. Curriculum learning. In International Conferenceon Machine Learning (ICML).

[Bloom 2000] Bloom, P. 2000. How children learn the meanings ofwords, volume 377. Citeseer.

[Dixit et al. 2017] Dixit, M.; Kwitt, R.; Niethammer, M.; and Vas-concelos, N. 2017. Aga: Attribute-guided augmentation. In Com-puter Vision and Pattern Recognition (CVPR).

[Finn, Abbeel, and Levine 2017] Finn, C.; Abbeel, P.; and Levine,S. 2017. Model-agnostic meta-learning for fast adaptation of deepnetworks. In International conference on machine learning (ICML).

[Franceschi et al. 2018] Franceschi, L.; Frasconi, P.; Salzo, S.;Grazzi, R.; and Pontil, M. 2018. Bilevel programming for hyperpa-rameter optimization and meta-learning. In International conferenceon machine learning (ICML).

[Gidaris and Komodakis 2018] Gidaris, S., and Komodakis, N.2018. Dynamic few-shot visual learning without forgetting. InComputer Vision and Pattern Recognition (CVPR).

[Goodfellow et al. 2014] Goodfellow, I.; Pouget-Abadie, J.; Mirza,M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; and Bengio, Y.2014. Generative adversarial nets. In Neural Information ProcessingSystems (NIPS).

[Hariharan and Girshick 2017] Hariharan, B., and Girshick, R. 2017.Low-shot visual recognition by shrinking and hallucinating features.In International Conference on Computer Vision (ICCV).

[He et al. 2015] He, K.; Zhang, X.; Ren, S.; and Sun, J. 2015. Deepresidual learning for image recognition. 2016 IEEE Conference onComputer Vision and Pattern Recognition (CVPR) 770–778.

[Krizhevsky and Hinton 2009] Krizhevsky, A., and Hinton, G. 2009.Learning multiple layers of features from tiny images. Technicalreport, Citeseer.

[Krizhevsky, Sutskever, and Hinton 2012] Krizhevsky, A.;Sutskever, I.; and Hinton, G. E. 2012. Imagenet classifi-cation with deep convolutional neural networks. In NeuralInformation Processing Systems (NIPS).

[Lee and Choi 2018] Lee, Y., and Choi, S. 2018. Gradient-basedmeta-learning with learned layerwise metric and subspace. In Inter-national conference on machine learning (ICML).

[Liu et al. 2018] Liu, B.; Wang, X.; Dixit, M.; Kwitt, R.; and Vas-concelos, N. 2018. Feature space transfer for data augmentation. InComputer Vision and Pattern Recognition (CVPR).

[Liu et al. 2019] Liu, Y.; Lee, J.; Park, M.; Kim, S.; Yang, E.; Hwang,S. J.; and Yang, Y. 2019. Learning to propagate labels: Transduc-tive propagation network for few-shot learning. In InternationalConference on Learning Representations (ICLR).

[Maaten and Hinton 2008] Maaten, L. v. d., and Hinton, G. 2008.Visualizing data using t-sne. Journal of machine learning research9(Nov):2579–2605.

[Movshovitz-Attias et al. 2017] Movshovitz-Attias, Y.; Toshev, A.;Leung, T. K.; Ioffe, S.; and Singh, S. 2017. No fuss distance metriclearning using proxies. In International Conference on ComputerVision (ICCV).

[Munkhdalai and Yu 2017] Munkhdalai, T., and Yu, H. 2017. Metanetworks. In International conference on machine learning (ICML).

[Munkhdalai et al. 2018] Munkhdalai, T.; Yuan, X.; Mehri, S.; andTrischler, A. 2018. Rapid adaptation with conditionally shiftedneurons. In International Conference on Machine Learning (ICML).

[Oreshkin, Rodríguez López, and Lacoste 2018] Oreshkin, B.; Ro-dríguez López, P.; and Lacoste, A. 2018. Tadam: Task dependentadaptive metric for improved few-shot learning. In Neural Informa-tion Processing Systems (NIPS).

[Qi, Brown, and Lowe 2018] Qi, H.; Brown, M.; and Lowe, D. G.2018. Low-shot learning with imprinted weights. In ComputerVision and Pattern Recognition (CVPR).

[Qiao et al. 2018] Qiao, S.; Liu, C.; Shen, W.; and Yuille, A. L. 2018.Few-shot image recognition by predicting parameters from activa-tions. In Computer Vision and Pattern Recognition (CVPR).

[Ravi and Larochelle 2017] Ravi, S., and Larochelle, H. 2017. Op-timization as a model for few-shot learning. In International Con-ference on Learning Representations (ICLR).

[Rusu et al. 2019] Rusu, A. A.; Rao, D.; Sygnowski, J.; Vinyals, O.;Pascanu, R.; Osindero, S.; and Hadsell, R. 2019. Meta-learningwith latent embedding optimization. In International Conferenceon Learning Representations (ICLR).

[Schwartz et al. 2018] Schwartz, E.; Karlinsky, L.; Shtok, J.; Harary,S.; Marder, M.; Kumar, A.; Feris, R.; Giryes, R.; and Bronstein,A. 2018. Delta-encoder: an effective sample synthesis methodfor few-shot object recognition. In Neural Information ProcessingSystems (NIPS).

[Simonyan and Zisserman 2014] Simonyan, K., and Zisserman, A.2014. Very deep convolutional networks for large-scale imagerecognition. CoRR abs/1409.1556.

[Snell, Swersky, and Zemel 2017] Snell, J.; Swersky, K.; and Zemel,R. 2017. Prototypical networks for few-shot learning. In NeuralInformation Processing Systems (NIPS).

[Sung et al. 2018] Sung, F.; Yang, Y.; Zhang, L.; Xiang, T.; Torr,P. H.; and Hospedales, T. M. 2018. Learning to compare: Relationnetwork for few-shot learning. In Computer Vision and PatternRecognition (CVPR).

[Vinyals et al. 2016] Vinyals, O.; Blundell, C.; Lillicrap, T.; Wier-stra, D.; et al. 2016. Matching networks for one shot learning. InNeural Information Processing Systems (NIPS).

[Wah et al. 2011] Wah, C.; Branson, S.; Welinder, P.; Perona, P.; andBelongie, S. 2011. The caltech-ucsd birds-200-2011 dataset.

[Wang et al. 2018] Wang, Y.-X.; Girshick, R.; Hebert, M.; and Har-iharan, B. 2018. Low-shot learning from imaginary data. InComputer Vision and Pattern Recognition (CVPR).

[Yan, Zhang, and He 2019] Yan, S.; Zhang, S.; and He, X. 2019.A dual attention network with semantic embedding for few-shotlearning. In Association for the Advance of Artificial Intelligence(AAAI).

[Yuille 2011] Yuille, A. L. 2011. Towards a theory of compositionallearning and encoding of objects. In 2011 IEEE InternationalConference on Computer Vision Workshops (ICCV Workshops),1448–1455. IEEE.

[Zhang et al. 2018] Zhang, R.; Che, T.; Ghahramani, Z.; Bengio, Y.;and Song, Y. 2018. Metagan: An adversarial approach to few-shotlearning. In Neural Information Processing Systems (NIPS).

[Zhang, Zhang, and Koniusz 2019] Zhang, H.; Zhang, J.; and Ko-niusz, P. 2019. Few-shot learning via saliency-guided hallucinationof samples. In The IEEE Conference on Computer Vision andPattern Recognition (CVPR).

[Zhu et al. 2017] Zhu, J.-Y.; Park, T.; Isola, P.; and Efros, A. A. 2017.Unpaired image-to-image translation using cycle-consistent adver-sarial networks. In International Conference on Computer Vision(ICCV).

[Zitian Chen 2019] Zitian Chen, Yanwei Fu, Y.-X. W. L. M. W. L.M. H. 2019. Image deformation meta-networks for one-shot learn-ing. In The IEEE Conference on Computer Vision and PatternRecognition (CVPR).

diversity transfer network for few-shot learning · diversity transfer network for few-shot...

Documents