language to network: conditional parameter adaptation with ... · pre-trained feature extractor for...

14
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 6994–7007 July 5 - 10, 2020. c 2020 Association for Computational Linguistics 6994 Language to Network: Conditional Parameter Adaptation with Natural Language Descriptions Tian Jin * IBM Thomas J. Watson Research Center Yorktown Heights, New York 10598 [email protected] Zhun Liu *♦ Microsoft Bellevue, WA 98004 [email protected] Shengjia Yan Tandon School of Engineering New York University Brooklyn, NY 11201 [email protected] Alexandre Eichenberger IBM Thomas J. Watson Research Center Yorktown Heights, New York 10598 [email protected] Louis-Philippe Morency School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 [email protected] Abstract Transfer learning using ImageNet pre-trained models has been the de facto approach in a wide range of computer vision tasks. However, fine-tuning still requires task-specific training data. In this paper, we propose N 3 (Neural Networks from Natural Language) - a new paradigm of synthesizing task-specific neural networks from language descriptions and a generic pre-trained model. N 3 leverages lan- guage descriptions to generate parameter adap- tations as well as a new task-specific classifica- tion layer for a pre-trained neural network, ef- fectively “fine-tuning” the network for a new task using only language descriptions as in- put. To the best of our knowledge, N 3 is the first method to synthesize entire neural networks from natural language. Experimen- tal results show that N 3 can out-perform pre- vious natural-language based zero-shot learn- ing methods across 4 different zero-shot image classification benchmarks. We also demon- strate a simple method to help identify key- words in language descriptions leveraged by N 3 when synthesizing model parameters. 1 1 Introduction A person with generic world knowledge can learn to perform a new task based on verbal instructions. On the other hand, despite recent successes in deep * First two authors contributed equally. This work was performed when Zhun Liu was affiliated with Carnegie Mellon University. 1 Code is released at https://github.com/tjingrant/n3cr . Husky This breed has a well rounded, apple like shaped head with a muzzle that is tiny in contrast to the head. Chiwawa Husky Golden Retriever N 3 CNN to Classify Dogs Chihuahua Classify Synthesize Run Once Figure 1: An illustration of N 3 . A list of object class descriptions is fed to the N 3 model to produce a CNN classifier that can classify images belonging to the ob- ject classes described. learning, it remains challenging to re-purpose pre- trained visual classification models to recognize a new set of objects without labeling a new training dataset. A natural question emerges from this ob- servation: can a computer also learn to recognize new objects, simply by reading the descriptions of them in natural language? Concretely, can we cre- ate visual classifiers using language descriptions of the objects of interest? In this paper, we introduce a new paradigm for synthesizing task-specific neural networks for im- age classification simply from language descrip- tions of the relevant objects. We propose N 3 - Neural Networks from Natural Language, a meta- model that takes a list of object descriptions as input to produce a classification model for these objects, as illustrated in Figure 1. The capability of producing task-specific neural network from lan-

Upload: others

Post on 25-Jul-2020

10 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Language to Network: Conditional Parameter Adaptation with ... · pre-trained feature extractor for input, which is usu-ally fixed and cannot be further adapted for the unseen classes

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 6994–7007July 5 - 10, 2020. c©2020 Association for Computational Linguistics

6994

Language to Network: Conditional Parameter Adaptation with NaturalLanguage Descriptions

Tian Jin∗IBM Thomas J. Watson Research Center

Yorktown Heights, New York [email protected]

Zhun Liu∗♦Microsoft

Bellevue, WA [email protected]

Shengjia YanTandon School of Engineering

New York UniversityBrooklyn, NY [email protected]

Alexandre EichenbergerIBM Thomas J. Watson Research Center

Yorktown Heights, New York [email protected]

Louis-Philippe MorencySchool of Computer ScienceCarnegie Mellon University

Pittsburgh, PA [email protected]

Abstract

Transfer learning using ImageNet pre-trainedmodels has been the de facto approach in awide range of computer vision tasks. However,fine-tuning still requires task-specific trainingdata. In this paper, we propose N3 (NeuralNetworks from Natural Language) - a newparadigm of synthesizing task-specific neuralnetworks from language descriptions and ageneric pre-trained model. N3 leverages lan-guage descriptions to generate parameter adap-tations as well as a new task-specific classifica-tion layer for a pre-trained neural network, ef-fectively “fine-tuning” the network for a newtask using only language descriptions as in-put. To the best of our knowledge, N3 isthe first method to synthesize entire neuralnetworks from natural language. Experimen-tal results show that N3 can out-perform pre-vious natural-language based zero-shot learn-ing methods across 4 different zero-shot imageclassification benchmarks. We also demon-strate a simple method to help identify key-words in language descriptions leveraged byN3 when synthesizing model parameters.1

1 Introduction

A person with generic world knowledge can learnto perform a new task based on verbal instructions.On the other hand, despite recent successes in deep

∗First two authors contributed equally.♦This work was performed when Zhun Liu was affiliated

with Carnegie Mellon University.1Code is released at https://github.com/tjingrant/n3cr .

Husky

This breed has a well rounded, apple like shaped head with a muzzle that is tiny in contrast to the head.

ChiwawaHusky

Golden Retriever

N3 CNN toClassify

Dogs

Chihuahua

Classify

SynthesizeRun Once

Figure 1: An illustration of N3. A list of object classdescriptions is fed to the N3 model to produce a CNNclassifier that can classify images belonging to the ob-ject classes described.

learning, it remains challenging to re-purpose pre-trained visual classification models to recognize anew set of objects without labeling a new trainingdataset. A natural question emerges from this ob-servation: can a computer also learn to recognizenew objects, simply by reading the descriptions ofthem in natural language? Concretely, can we cre-ate visual classifiers using language descriptions ofthe objects of interest?

In this paper, we introduce a new paradigm forsynthesizing task-specific neural networks for im-age classification simply from language descrip-tions of the relevant objects. We propose N3 -Neural Networks from Natural Language, a meta-model that takes a list of object descriptions asinput to produce a classification model for theseobjects, as illustrated in Figure 1. The capabilityof producing task-specific neural network from lan-

Page 2: Language to Network: Conditional Parameter Adaptation with ... · pre-trained feature extractor for input, which is usu-ally fixed and cannot be further adapted for the unseen classes

6995

guage descriptions makes N3 ideal to a wide rangeof zero-shot tasks (Wah et al., 2011; Zhu et al.,2017; Elhoseiny et al., 2017; Zhu et al., 2017; Nils-back and Zisserman, 2006) where we do not havevisual training data but can still easily obtain lan-guage descriptions of the objects of interest.

Prior zero-shot learning methods aim to achievea similar goal of applying pre-trained networks onunseen classes. In zero-shot image classification, atypical method of generalizing to unseen classes isto construct class embeddings to augment the clas-sification layer of the pre-trained network or takeretrieval-based approaches while utilizing genericvisual features from pre-trained networks (Akataet al., 2015; Kodirov et al., 2017; Elhoseiny et al.,2013; Lei Ba et al., 2015; Reed et al., 2016).

Extending on the idea of generating classifica-tion layers for the pre-trained network, N3 modifiesthe parameters of all layers in the pre-trained net-work. While the underlying pre-trained networkin previous approaches only extracts generic vi-sual features, N3 makes it possible to extract task-specific ones. This approach effectively increasesthe capacity at which semantic information in thedescriptions can affect the pre-trained network.

In our experiments, we evaluated our proposedN3 method with 4 popular zero-shot image classifi-cation benchmark datasets. We performed ablationstudies to understand the importance of synthesiz-ing task-specific feature extractors, the necessityof a pre-trained visual classification model and theeffects of language representation choices on theefficacy of N3. In addition, we provide a simpleapproach to help interpret what aspects in the lan-guage descriptions are N3-generated models exam-ining when making predictions.

To summarize, our contributions are 3-fold:

1. We propose a novel meta-model N3 to syn-thesize task-specific neural network modelsusing natural language descriptions.

2. We demonstrate N3’s superior efficacy in solv-ing natural language guided zero-shot learningproblems. Our analysis shows that N3’s abil-ity to tailor neural network models to extracttask-specific features plays an important rolein achieving such superior accuracy.

3. We show that N3 can aid the interpretation ofpredictions of synthesized models as they canbe traced back to both supportive and refuta-tive evidence within language descriptions.

2 Related work

In this section, we are situating our work in the con-text of zero-shot learning and dynamic parametergeneration for neural networks.

Zero-shot Learning Zero-shot learning studieshow we can generalize our models to perform wellfor tasks without any labeled training data at all.Achieving classification accuracy above chance-level in such scenarios requires modeling the rela-tionships between the seen classes during trainingand the unseen classes during testing. A typical andeffective method is to manually engineer class at-tribute vectors to obtain representations of seen andunseen classes in a shared attribute space (Duanet al., 2012; Kankuekul et al., 2012; Parikh andGrauman, 2011; Zhang and Saligrama, 2016; Akataet al., 2016). Yet such a method requires labori-ous engineering of class attributes, which is notfeasible for large-scale and/or fine-grained classi-fication tasks (Russakovsky et al., 2015; Khoslaet al., 2011; Welinder et al., 2010). Hence, thereis also work in zero-shot learning that attemptsto leverage textual data as object class representa-tions (Lei Ba et al., 2015; Elhoseiny et al., 2013).The majority of these models are committed toembedding-based retrieval approaches, where clas-sification is re-formulated as retrieving the classembedding with maximal similarity (Akata et al.,2015; Kodirov et al., 2017). While they can han-dle well the case where there is an indefinite num-ber of classes during test time, such approachessuffer from extra computation cost at inferencetime since they need to traverse all the seen datapoints. Moreover, these models often rely on apre-trained feature extractor for input, which is usu-ally fixed and cannot be further adapted for theunseen classes (Akata et al., 2015; Kodirov et al.,2017; Elhoseiny et al., 2013). While there is ahandful of work that tries to modify the parame-ters in-place during test time, they either rely onshallow language representation with limited ex-pressiveness (Lei Ba et al., 2015) or require a sig-nificant amount of textual descriptions per class totrain their model (Reed et al., 2016), both of whichare not ideal. In our work, we aim to learn a modelthat can synthesize parameters for entire neuralnetworks to adapt to the new tasks using short de-scriptions of the object classes. This leads to bettermetadata efficiency of our proposed method.

Page 3: Language to Network: Conditional Parameter Adaptation with ... · pre-trained feature extractor for input, which is usu-ally fixed and cannot be further adapted for the unseen classes

6996

Dynamic Parameter Generation As mentionedbefore, N3 dynamically generates classificationmodels for designated classes. Dynamic param-eter generation has been explored in the contextof generating recurrent cells at different time-stepsof RNNs (Ha et al., 2016), constructing intermedi-ate linear models for interpretability inside neuralnetworks (Al-Shedivat et al., 2017), and contex-tual parameter generation for different languagepairs in multilingual machine translation (Platanioset al., 2018). As mentioned in Section 2, some zero-shot learning methods can also be viewed as gen-erating classifier parameters (Lei Ba et al., 2015;Elhoseiny et al., 2013). However, many of theprevious work directly or indirectly mentions thechallenge of memory and computation complexity- after all, the output of the parameter generationmodel are large matrices that are excessively highdimensional. To tackle this issue, previous work ei-ther only generate very simple linear layers (Lei Baet al., 2015; Elhoseiny et al., 2013; Al-Shedivatet al., 2017), or impose low-rank constraints on theweights to mitigate the memory issues (Ha et al.,2016). In our work, we utilize the architecture ofsequence-to-sequence models and treat the weightmatrices to be generated as a sequence of vectors.This allows parameter generation for entire neuralnetworks with little memory bottleneck.

3 N3 Methodology

In this section, we describe our approach for synthe-sizing task-specific neural networks to recognizenew objects with their natural language descrip-tions and a generic pre-trained model. We denote alist of natural language descriptions for K objects,each containing L tokens as D = {dk,l}k=K, l=L

k=1, l=1 .We denote a pre-trained model with parameters Θas F(·; Θ), so that for a set of images X , F(X; Θ)produces the classification prediction Y .

We can now precisely formulate our problem asfollows: given a pre-trained classification model,F(·; Θ), and the natural language description of Kobject classes, D, adapt the original parameters Θto the specialized parameters Θ′ so that the fine-tuned F(·; Θ′) model accurately classifies the Kobjects described in D.

3.1 Synthesizing Task-specific Models viaParameter Adaptation

Our method draws inspiration from transfer learn-ing, which is often employed when the training

dataset is small. Transfer learning entails training aneural network from a generic pre-trained model toone that solves a new, often task-specific problem.Thus, we can similarly formulate N3 to synthe-size task-specific model parameters by adaptingexisting ones in the generic pre-trained model withthe guidance of natural language task descriptions.While transfer learning updates the pre-trained pa-rameters using signals derived from task-specifictraining data, N3 relies only on language descrip-tions to achieve the same objective. Concretely, theadapted parameters Θ′ are computed as follows:

Θ′ = Θ + µ · Φ(D; Γ) (1)

where Φ(·; Γ) is a function with parameter Γ, map-ping natural language descriptions to parameteradaptations for all parameters in the pre-trainedmodel. Since the transfer learning process oftenproceeds with a tiny learning rate to restrict theeffect of fine-tuning, we introduced a trainable scal-ing factor µ to similarly regulate the effect of pa-rameter adaptation. The initial value of µ is a hyper-parameter. In our experiments, we used an initial µvalue of 1e−3 to mimic the effect of using a smalllearning rate for transfer learning. The specificvalue of 1e−3 is derived from the default learningrate used in the PyTorch transfer learning tutorial(Chilamkurthy). In a later section, we evaluate thenecessity of this scaling factor as well as the effectof a range of initial values of µ.

3.2 Making Adaptation ComputationallyFeasible via Hierarchical Attention andLayer Sharing

The mapping Φ from natural language descrip-tions to parameter adaptations is particularly high-dimensional. Constructing Φ with the transformerblock (Vaswani et al., 2017) is not straight-forwardfor our scenario because it requires O(N2) size ofmemory where N is the length of the input/outputsequence and our N = K ×L can be prohibitivelylarge. Thus, to reduce the memory consumptionof Φ, its attention span must be restricted to asmall but semantically relevant subset of all in-put elements. To this end, we designed Φ to bea two-level hierarchy of transformer blocks, asillustrated in 2. The first level of transformers,named the Tokens2Label Encoder, encodes thenatural language descriptions of each object classto a label embedding vector. Intuitively, this levelsummarizes the described visual features of an ob-

Page 4: Language to Network: Conditional Parameter Adaptation with ... · pre-trained feature extractor for input, which is usu-ally fixed and cannot be further adapted for the unseen classes

6997

Tokens2Label Encoder

Labels2AdaptationEncoder{

A.D.1 A.D.2 A.D.3A.D. = Adaptation Decoder

Token 1

Label 3

Label 4

Label 5

Label 6

Label 7

Token 2

Token 3

Token 4

Token 5

Token 6

Token 7

Label 1Language

Description=

LabelEmbedding Vectors =

Layer 1 Layer 2 Layer 3

ParamAdaptations =

Adaptation Adaptation Adaptation

Figure 2: Hierarchical encoder for descriptions and de-coder for parameter adaptations.

ject class to a single, fixed-sized embedding vec-tor. Thus, attention in this level has a maximumspan of L spanning all tokens in each description.The second level, named the Labels2AdaptationEncoder-Decoder, encodes the sequence of labelembedding vectors and decodes them into param-eter adaptations of multiple layers. This level oftransformer blocks examines the characteristics ofall encoded object classes and determines how toadapt the pre-trained model parameters to classifyimages corresponding to these object labels. In thislevel, the model attention has a maximum span ofK spanning all the object classes. Moreover, due tothe sheer number of layers in state-of-the-art CNNmodels, we initialize layer-specific adaptation de-coders for each layer and decode from shared hid-den states encoded by the adaptation encoder. Onlythrough these measures can we materialize the high-dimensional mapping Φ under reasonable memoryconstraints.

Putting everything together, for a pre-trained net-work with parameter Θ, N3 applies the mappingΦ to input descriptions to generate a parameteradaptation ∆Θ with the same shape as Θ. Theadaptation ∆Θ is multiplied with a trainable scal-ing factor µ to control its impact on the pre-trainedmodel. The scaled parameter adaptation µ ·∆Θ isthen combined with its pre-trained counterpart Θvia point-wise addition.

The mapping Φ from object descriptions D to∆Θ contains two parts. The Tokens2Label Encodertransforms the set of tokens contained in K entries

of object class descriptions, each with length upto L, denoted as {dk,l}, to a set of K object labelembedding vectors {ck}Kk=1:

ck = EncoderT2L(dk,1:L), k = 1, ...,K (2)

Subsequently, Labels2Adaptation Encoder-Decoders translate the object label embeddingvectors into a parameter adaptation matrix ∆Θ.Parameters in a typical neural network often havemore than two-dimensions, but for simplicityin our setup, they are always viewed as a two-dimensional matrix consisting of a sequence ofparameter columns. 1 Viewing the object-labelembeddings {ci}ki=1 as an input sequence and theparameter adaptation ∆Θ as an output sequenceof M columns {∆θm}Mm=1, Labels2AdaptationEncoder-Decoders can be expressed as:

h1:K = EncoderL2A(c1:K) (3)

∆θm = DecodermL2A(h1:K),m = 1, ...,M (4)

∆Θ = Concat([∆θ1; ∆θ2; ...; ∆θM ]) (5)

Finally, pre-trained parameter Θ is adapted toΘ′ using the following equation, with µ being atrainable scaling factor:

Θ′ = Θ + µ ·∆Θ (6)

3.3 Training MethodologyWe formulate the training of N3 as the followingoptimization problem. Optimal parameters Γ forN3 model Φ(·; Γ) should map a list of language de-scriptions for K class objects D = {D1, · · · ,DK}to parameter adaptations ∆Θ = Φ(D; Γ), suchthat the cross entropy loss between ground truthlabel Y and model prediction F(X; Θ,Φ(D; Γ))is minimized for all image-label pairs (X,Y ) inthe training set.

Thus, to train N3 on a image dataset I with labelsL, class descriptions D and a pre-trained modelF to produce a K-way classification model, wefirst draw meta-batches of K class labels LK ∈ L.Then, a subset of the image dataset ILK and de-scription dataset DLK corresponding to the drawnlabels LK are constructed. We then draw mini-batches of images B and ground-truth labels Y

1For instance, the parameter of a convolutional layer Wof shape [K,C, kH, kW ] where K is the number of outputchannels, C the number of input channels, kH , kW the kernelheight and width, is viewed as a sequence of K parametercolumns, with a column size of C × kH × kW .

Page 5: Language to Network: Conditional Parameter Adaptation with ... · pre-trained feature extractor for input, which is usu-ally fixed and cannot be further adapted for the unseen classes

6998

from ILK . For each mini-batch B, distinct param-eter adaptations ∆Θ are generated by evaluatingΦ(·; Γ) at DLk . Batch loss is then calculated as1|B|

∑i `(Yi,F(Xi; Θ,∆Θ)) where `(·, ·) refers to

cross-entropy loss. Since the meta-model Φ(·; Γ)and the pre-trained modelF are fully differentiable,gradients can be propagated back to meta-modelsto optimize meta-model parameters Γ.

4 Experimental Setup

We evaluate N3 by comparing its efficacy in solvingnatural language guided zero-shot learning prob-lems with prior state-of-the-art methods.

In this section, we introduce our training method,datasets and evaluation protocols.

4.1 Datasets

To evaluate the N3, we select 4 standard zero-shotimage classification datasets and collected naturallanguage descriptions for their object classes.

Caltech-UCSD-Birds 200-2011 (CUB) (Wahet al., 2011) contains images of 200 species of birds.Each species of bird forms its own class label. Intotal, there are 11,788 images in this dataset.

Animal with Attributes (AWA) (Lampert et al.,2014; Xian et al., 2017) is another dataset to evalu-ate zero-shot classification methods. It consists of50 classes of animals with a total of 37322 images.

North America’s Birds (NAB) is a dataset usedby prior state-of-the-art methods related to ourtask. Following the established practices (Zhu et al.,2017; Elhoseiny et al., 2017), we consolidated theclass labels into 404 distinct bird species. The con-solidation process combines closely related labels(e.g., ‘American Kestrel (Female, immature)’ and‘American Kestrel (Adult male)’) into a single label(e.g., ‘American Kestrel’). We end up with 48,000images of 404 classes of bird-species.

Flowers-Species (FS) is a dataset we built basedon Oxford Flowers (Nilsback and Zisserman,2006), another commonly used zero-shot dataset.The original contains label categories that are a mix-ture of species and genera. Some genus includesthousands of species, yet the dataset examples onlycover a fraction of them. Such mismatch createsbiases in the dataset that fundamentally cannot beaddressed through learning from external descrip-tions. This hence undermines its utility as a testof our proposed method: for instance, when N3 is

asked to generate classifier to decide whether anobject is of label “anthurium”, which is a genus ofaround 1000 species of varying visual appearance,the efficacy of our generated model can only beevaluated based on a representative samples thatcover most of the species within the genus “an-thurium”. However, the dataset only contains atiny number of (i.e., 105) correlated (species-wise)samples, making such evaluation neither compre-hensive nor conclusive in the context of our taskobjective and may introduce unexpected noise inevaluation results. Therefore, we decided to filterout the genera from the original Oxford Flowersdataset, leaving only the species as class labels, asan effort towards homogenizing the sample spacesimplied by the class labels and the image dataset.This leaves us with 55 classes and 3545 images.

For each dataset, we collect language de-scriptions for object classes from websites likeWikipedia. To collect language descriptions fromWikipedia, we use the python package Wikipedia(Goldsmith) to access structured representation ofWikipedia pages, and extract section content under“Description” to be used as object class descriptions.If no Wikipedia entry exists for a specific objectclass, we resort to manually searching for the objectclass description on Google. Furthermore, we trun-cate these textual excerpts to a maximum length of512 tokens to avoid excessively long descriptionsand the accompanying computational issues.

4.2 Evaluation ProtocolRecent study showed that previous zero-shot learn-ing evaluation protocols are inadequate and pro-posed a set of rigorous evaluation protocols forattribute-based zero-shot learning methods (Xianet al., 2017). Although both our tasks and datasetsdiffer, we nevertheless followed Xian et al.’s guid-ing principles of their Rigorous Protocol and devel-oped our evaluation protocol:

• Similar to Rigorous Protocol, we used twometa-splits Standard Splits (SS) and Pro-posed Splits (PS) to evaluate all methods; theStandard Splits are established meta-splitsand Proposed Splits are meta-splits that guar-antees the exclusion of ImageNet-1K classesfrom the test set.

• Due to class imbalance, Rigorous Protocolproposes to use per-class averaged accuracyfor more meaningful evaluation. Thus, to eval-uate meta-model on a meta-split containing

Page 6: Language to Network: Conditional Parameter Adaptation with ... · pre-trained feature extractor for input, which is usu-ally fixed and cannot be further adapted for the unseen classes

6999

Datasets Total Standard Split/Proposed Split

Training Validation Testing

CUB (Wah et al., 2011) 200 100 50 50AWA2 (Zhu et al., 2017; Elhoseiny et al., 2017) 50 30 10 10NAB (Zhu et al., 2017) 404 324 40 40FS (Nilsback and Zisserman, 2006) 55 35 10 10

Table 1: Number of Class Labels in Our Meta-Split (both Standard Split and Proposed Split). Within each meta-split, training, validation and testing class labels are disjoint.

the set of classes C, we calculate the per-classaveraged accuracy as shown in Equation 7.

AccC =1

|C|∑c∈C

#correctly predictedsamples in c

#total samples in c(7)

• Unlike Rigorous Protocol which uses ResNet-101 model, we use ResNet-18 as our pre-trained model; such choice helps reduce theoutput dimensionality of N3 by reducing thenumber of parameters N3 adapts.

• Our method is unique in that permutationsof the classes belonging to the same meta-split count as distinct tasks and therefore, toaccount for variations, we test our models on10 different permutations of test set classesand report the medium value of the relevantevaluation metric.

We have tabulated the number of class labelsused for training, validation and testing in Table. 1.

4.3 Baseline Models

PDCNN PDCNN (Lei Ba et al., 2015) is themost relevant prior method as it use the naturallanguage descriptions of class labels to generateclassification layers capable of distinguishing be-tween objects described. Note that PDCNN is dis-tinct from our work in that it dynamically generatesfully connected layers and/or additional convolu-tional layers to be appended to a pre-trained deepneural network (VGG-19) whilst ours generates pa-rameter adaptations to be combined directly withall existing layers within deep neural network mod-els, effectively “fine-tuning” the pre-trained model.We compare with two variants of PDCNN, withPDCNNFC generating a fully connected layer onlyfor classification and PDCNNFC+Conv producing an

additional convolutional layer to help with clas-sification. To make our works comparable, wereplaced the TF-IDF feature extractor in PDCNNwith a BERT-based document embedding (specifi-cally, a BERT token embedding followed by max-pooling) and changed the pre-trained model fromVGG-19 to ResNet-18.

MEGAZSL Due to the scarcity of prior workleveraging natural language descriptions for zero-shot classifications in a metadata-efficient way,we also adapted less metadata-efficient methodsto function with stricter metadata-efficiency re-quirements. Specifically, ZSLPP (Elhoseiny et al.,2017), GAZSL (Zhu et al., 2017) and Correction-Network (Hu et al., 2019) all utilize natural lan-guage object class descriptions to produce clas-sifiers capable of distinguishing between imagesbelonging to unseen categories during training.However, all of them require significantly moremetadata: specifically, parts annotations of eachsample image are used to provide extra supervi-sion of the training procedure. Among these meth-ods, GAZSL (Zhu et al., 2017) stands out as themost cited work; therefore we adapted the codereleased by its authors to learn from only natu-ral language metadata, without using parts anno-tations; to distinguish our modified version fromthe original, we will refer to our modified versionas MEGAZSL (Metadata-Efficient GAZSL). Tomake our works comparable, we also updated itslanguage representation from TF-IDF to BERT-based ones and used ResNet-18 as the image fea-ture embedding module. It is worth noting theCorrectionNet(Hu et al., 2019) is orthogonal to ourwork as it is designed to improve any existing zero-shot classification task modules, and in its originalsetup, GAZSL (Zhu et al., 2017) was used as themain task module, which we do include in our ex-perimental comparison.

For all experiments, hyper-parameters are tuned

Page 7: Language to Network: Conditional Parameter Adaptation with ... · pre-trained feature extractor for input, which is usu-ally fixed and cannot be further adapted for the unseen classes

7000

Method CUB-50 AWA2-10 NAB-40 FS-10

SS PS SS PS SS PS SS PS

PDCNNFC(Lei Ba et al., 2015) 6.1 6.1 12.8 18.4 6.7 7.4 13.1 11.9PDCNNFC+CONV(Lei Ba et al., 2015) 7.5 6.5 22.6 17.2 8.9 5.6 7.7 13.6MEGAZSL(Zhu et al., 2017) 2.9 1.8 14.0 10.4 2.8 3.7 10.3 12.3N3 (Proposed) 17.6 9.5 34.0 37.5 14.1 20.7 16.4 17.6

Table 2: Zero-Shot Classification Accuracy (Defined by Eq. 7) On Various Datasets/Meta-Splits Combinations;number of classification labels used in the test set are recorded next to the name of each dataset.

with the same exact algorithms (random search)and for the same number of runs (10).

5 Results and Discussion

In this section, we compare N3 method with priorwork on 4 zero-shot image classification bench-mark datasets. Furthermore, we provide a simpleapproach to help interpret what part of the languageinput is being taken as evidence by the synthesizedvisual classifier to make predictions. We then showthrough ablation studies the importance of adapta-tion of all parameters in the pre-trained network,the necessity of pre-trained networks, and the effectof language representation choices on the successof N3.

5.1 Benchmark Evaluations

We report performance in per-class averaged accu-racy, as shown in Table 2. In these experiments, westandardized the language representations, dataset-splits, and other factors orthogonal to our modeldesign to ensure fairness of comparison. We alsoinclude experimental results comparing N3 to mod-els in their respective original settings in the Ap-pendix. From Table 2, we can clearly observe thatN3 outperforms all competing methods by a sig-nificant margin on all 8 dataset/meta-split combi-nations. Noticing the large performance gap be-tween MEGAZSL and the original GAZSL, weperformed additional investigations to pinpoint thecause: we reproduced one of their experiments (onCUB dataset with SCE meta-split), and replacedtheir TF-IDF module with BERT document em-bedding module. The modified module performednoticeably better (from 10.3% to 11.3%); however,when we replaced its image feature embeddingmodule, which is trained with additional parts an-notations of bird images, the performance droppedsignificantly (from 11.3% to 3.3%), confirmingour conjecture that such methods cannot be easily

adapted to work without extra supervision in theform of additional data annotation.

5.2 Interpreting Model Predictions

In this section, we explore how N3 model architec-ture can help interpret the predictions made by theadapted model. Specifically, the design of N3 isunique in that it examines all object class descrip-tions in order to adapt neural network parameters.This means that N3 can adapt neural networks toseek both positive and negative evidence for animage to be classified. Naturally, we want to un-derstand how the model is using these object classdescriptions. In Figure. 3, we present our findingsby visualizing the magnitude of ∂E

∂Dijwhere E is

the loss value computed on a test example (in ourexperiment, this example is correctly predicted tobe an Acadian Flycatcher) and Dij is the BERTrepresentation of the j-th word in the i-th objectclass description. We present two patterns in thedata that are indicative of the model behavior:

Top Positive Evidence: top positive evidence isidentified as tokens in the description of the groundtruth label with large gradients. Intuitively, thesetokens are top supporting evidence that encour-ages the prediction of the correct label. We locatethem by ranking all tokens in the description of theground truth label by their magnitude of gradientsand take the top few. Top Negative Evidence: topnegative evidence is identified as tokens with largegradients descriptions of the negative labels thatalso has a low predicted probability. Intuitively,these tokens are the keywords deemed as importantevidence when rejecting to predict a label. We lo-cate such evidence by first ranking the descriptionsby the ratio between their largest token gradientsand the softmax probability of the correspondinglabel. Then, within the set of class descriptionswith the largest aforementioned ratios, we locatethe tokens with the largest gradients.

Page 8: Language to Network: Conditional Parameter Adaptation with ... · pre-trained feature extractor for input, which is usu-ally fixed and cannot be further adapted for the unseen classes

7001

Top Positive Evidence (The image is a Acadian Flycatcher because …)

the breast is washed with oliveadults have olive upperparts , darker on the wings and tailthey also have a call similar to that of the northern flicker

Top Negative Evidence(The image is not a [tern, oriole, warbler] because …)

it is a small tern , 22 - 24 cm (8.7-9.4 in) long

immature males are yellow - orange on the breast

Baltimore Oriole(Negative Label)

Least Tern(Negative Label)

Acadian Flycatcher (Ground Truth)

females feature a similar coloration pattern , but the black is replaced with light grey 

Northern Flicker(Related Label)

Golden Winged Warbler(Negative Label)

(Least Tern)

(Baltimore Oriole)

(Golden Winged Warbler)

SignificantInsignificant

Figure 3: When classifying an image of Acadian Flycatcher, we visualized the magnitude of gradients of descrip-tion tokens w.r.t the loss. Higher magnitude is colored red and lower magnitude blue. Intuitively, words with highermagnitude of gradient represent textual features that are more important for the classification decision.

Examples for both types of keyword evidenceare presented in 3 with their context. Several ob-servations can be drawn about how N3 uses tex-tual descriptions to make classification decisions.Firstly, we can observe that distinguishing featuresdescribed with keywords such as “olive”, “darker”,“yellow” and “grey” are used to support both posi-tive and negative identifications, which support ourconjecture that N3 examines descriptions from allobject class descriptions to make classification de-cisions, sometimes employing the process of elimi-nation. Secondly, we can observe that some label(or label word piece) like “flicker” and “tern” isused as evidence to support or refute a classifica-tion decision, which seems to suggest that sometask-specific knowledge is learned and employedin the language representations.

5.3 Ablation Study: Task-specificness ofSynthesized Models

To account for the improved accuracy of mod-els synthesized with whole-model parameter adap-tations, we performed additional analysis to un-derstand the features models are utilizing whenmaking predictions. Two variants of N3 meta-models sharing the same set of hyper-parametersare trained on CUB-50 with the standard split. Oneis allowed to adapt parameters of every layer whilstthe other is ablated only to generate the classifi-cation layer. Saliency map (as described in (Si-

Figure 4: Saliency map showing magnitude of gradi-ents of loss w.r.t. every pixel in the input image.Left: original image. Middle: saliency map from amodel with every layer adapted by N3. Right: saliencymap from a model where N3 is restricted to act ononly its fc layer. Purple indicates small values whileblue/green indicates large values.

monyan et al., 2014)) visualizing the importance ofeach pixel w.r.t. predictions is plotted as a heatmapin Figure. 4. Clearly, fully adapted models show agreater concentration on task-specific regions.

5.4 Ablation Study: Importance ofpre-trained Model

To adapt generic pre-trained model parameters totask-specific ones, N3 mixes the pre-trained modelparameters with a set of generated model param-eter adaptations, using a trainable mixing ratio asillustrated in Equation 6. It is natural to questionwhether the pre-trained parameters are necessary.In other words, can N3 generate parameters for atask-specific model from scratch and achieve rea-sonable classification accuracy? To study the im-portance of pre-trained model parameters, we ex-

Page 9: Language to Network: Conditional Parameter Adaptation with ... · pre-trained feature extractor for input, which is usu-ally fixed and cannot be further adapted for the unseen classes

7002

γ initial µ Acc@1

0.999 0.001 16.2 %0.99 0.01 18.5 %0.9 0.1 16.2 %0.5 0.5 6.2 %0.1 0.9 2.1 %0.01 0.99 2.3 %

Table 3: Comparison of different γ and initial µ values.The pre-trained model is ignored when γ = 0.

Embedding Method Acc @ 1

ELMo (paragraph) 4.1 %ELMo (sentence) 5.0 %GloVe 8.9 %BERT 17.6 %

Table 4: Comparison of different embedding methodsfor N3 in zero-shot classification task.

periment with a modified version of Equation 6:

Θ′ = γ ·Θ + µ ·∆Θ

Where γ is a fixed scalar weight given to the pre-trained model, µ is still the trainable scaling factor.The choice of γ and the initial value of µ affects theextent to which the pre-trained model contribute tothe parameter-adapted one.

In our main experiments, we have fixed γ to be1 and initialized µ to be 1−3, such choice of valueinitialization proves to work well, yet it remainsunclear to what extent the superior performanceof N3 depends on these two hyperparameters. Toanswer this question, we decide to vary γ and µ;In order to keep the magnitude of parameters rela-tively stable, we always set the initial value of µ tobe 1− γ across different settings here. We trainedN3 to produce a 50-way classification model withvarying γ, and the resultant zero-shot classificationperformance are shown in Table 3. Such an abla-tion experiment demonstrates that setting a smallinitial µ is crucial for performance, yet it is notthe smaller the better, re-affirming the crucial roleof parameter adaptation in achieving good perfor-mance.

5.5 Ablation Study: Impact of DifferentWord Embeddings

To understand the importance of pre-trained BERTmodule, and whether N3 can be generalized to beused with pre-trained word embedding modulesother than BERT, we compared the classificationaccuracy of models adapted by variants of N3 us-ing different word embeddings. Concretely, weexperimented with BERT (Devlin et al., 2018),ELMo (Peters et al., 2018) and GloVe (Penningtonet al., 2014) embeddings. Results are shown inTable 4. While BERT prevails as expected, ELMoseems to under-perform GloVe. We hypothesize

that ELMo might have been impacted by unexpect-edly long sequence lengths (here each descriptionis a paragraph up to 512 tokens), since LSTM-based models are known to be worse at capturinglong-range dependencies. To further investigate,we trained a variant of N3, where we limit the con-text of ELMo to each sentence instead of the entireparagraph. As expected, the performance increasesnoticeably, but the resulting performance is stillnot on par with other methods. In contrast, GloVeembeddings worked better since such static embed-dings are not affected by the long context windows.

6 Conclusion

In this paper, we have demonstrated that smallamount of unstructured natural language descrip-tions for object classes can provide enough informa-tion to fine-tune an entire pretrained neural networkto perform classification task on class labels unseenduring training. We have achieved state-of-the-artperformance for natural language guided zero-shotimage classification tasks on 4 public datasets withpractical metadata requirements. In addition, wepresented in-depth analysis and extensive ablationstudies on various aspects of the model functioningmechanism and architecture design, showing thenecessity of our design contribution in achievinggood results.

Acknowledgment

This material is based upon work partially sup-ported by the National Science Foundation (Awards#1750439 #1722822) and National Institutes ofHealth. Any opinions, findings, and conclusionsor recommendations expressed in this material arethose of the author(s) and do not necessarily re-flect the views of National Science Foundation orNational Institutes of Health, and no official en-dorsement should be inferred.

Page 10: Language to Network: Conditional Parameter Adaptation with ... · pre-trained feature extractor for input, which is usu-ally fixed and cannot be further adapted for the unseen classes

7003

References2019. albanie/convnet-burden :memory consumption

and flop count estimates for convnets. https://github.com/albanie/convnet-burden.

2019. torchvision.models. https://pytorch.org/docs/stable/torchvision/models.html.

Zeynep Akata, Florent Perronnin, Zaid Harchaoui, andCordelia Schmid. 2016. Label-embedding for imageclassification. IEEE transactions on pattern analy-sis and machine intelligence, 38(7):1425–1438.

Zeynep Akata, Scott Reed, Daniel Walter, HonglakLee, and Bernt Schiele. 2015. Evaluation of outputembeddings for fine-grained image classification. InProceedings of the IEEE Conference on ComputerVision and Pattern Recognition, pages 2927–2936.

Maruan Al-Shedivat, Avinava Dubey, and Eric P. Xing.2017. Contextual explanation networks. CoRR,abs/1705.10301.

Soravit Changpinyo, Wei-Lun Chao, Boqing Gong,and Fei Sha. 2016. Synthesized classifiers for zero-shot learning. CoRR, abs/1603.00550.

Sasank Chilamkurthy. Transfer learning for computervision tutorial¶.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2018. BERT: pre-training ofdeep bidirectional transformers for language under-standing. CoRR, abs/1810.04805.

Kun Duan, Devi Parikh, David Crandall, and KristenGrauman. 2012. Discovering localized attributes forfine-grained recognition. In 2012 IEEE Conferenceon Computer Vision and Pattern Recognition, pages3474–3481. IEEE.

Mohamed Elhoseiny, Ahmed M. Elgammal, and BabakSaleh. 2016. Write a classifier: Predicting vi-sual classifiers from unstructured text descriptions.CoRR, abs/1601.00025.

Mohamed Elhoseiny, Babak Saleh, and Ahmed Elgam-mal. 2013. Write a classifier: Zero-shot learning us-ing purely textual descriptions. In Proceedings ofthe IEEE International Conference on Computer Vi-sion, pages 2584–2591.

Mohamed Elhoseiny, Yizhe Zhu, Han Zhang, andAhmed M. Elgammal. 2017. Link the head to the”beak”: Zero shot learning from noisy text descrip-tion at part precision. In 2017 IEEE Conferenceon Computer Vision and Pattern Recognition, CVPR2017, Honolulu, HI, USA, July 21-26, 2017, pages6288–6297.

Jonathan Goldsmith. Wikipedia¶.

David Ha, Andrew M. Dai, and Quoc V. Le. 2016. Hy-pernetworks. CoRR, abs/1609.09106.

R. Lily Hu, Caiming Xiong, and Richard Socher. 2019.Correction networks: Meta-learning for zero-shotlearning.

P. Kankuekul, A. Kawewong, S. Tangruamsub, andO. Hasegawa. 2012. Online incremental attribute-based zero-shot learning. In 2012 IEEE Conferenceon Computer Vision and Pattern Recognition, pages3657–3664.

Aditya Khosla, Nityananda Jayadevaprakash, Bang-peng Yao, and Li Fei-Fei. 2011. Novel dataset forfine-grained image categorization. In First Work-shop on Fine-Grained Visual Categorization, IEEEConference on Computer Vision and Pattern Recog-nition, Colorado Springs, CO.

Elyor Kodirov, Tao Xiang, and Shaogang Gong. 2017.Semantic autoencoder for zero-shot learning. In TheIEEE Conference on Computer Vision and PatternRecognition (CVPR).

B. Kulis, K. Saenko, and T. Darrell. 2011. What yousaw is not what you get: Domain adaptation us-ing asymmetric kernel transforms. In CVPR 2011,pages 1785–1792.

C. H. Lampert, H. Nickisch, and S. Harmeling. 2014.Attribute-based classification for zero-shot visual ob-ject categorization. IEEE Transactions on PatternAnalysis and Machine Intelligence, 36(3):453–465.

Jimmy Lei Ba, Kevin Swersky, Sanja Fidler, et al. 2015.Predicting deep zero-shot convolutional neural net-works using textual descriptions. In Proceedings ofthe IEEE International Conference on Computer Vi-sion, pages 4247–4255.

M-E. Nilsback and A. Zisserman. 2006. A visual vo-cabulary for flower classification. In Proceedings ofthe IEEE Conference on Computer Vision and Pat-tern Recognition, volume 2, pages 1447–1454.

Devi Parikh and Kristen Grauman. 2011. Relative at-tributes. In Proceedings of the 2011 InternationalConference on Computer Vision, ICCV ’11, pages503–510, Washington, DC, USA. IEEE ComputerSociety.

Jeffrey Pennington, Richard Socher, and ChristopherManning. 2014. Glove: Global vectors for word rep-resentation. In Proceedings of the 2014 conferenceon empirical methods in natural language process-ing (EMNLP), pages 1532–1543.

Matthew E. Peters, Mark Neumann, Mohit Iyyer, MattGardner, Christopher Clark, Kenton Lee, and LukeZettlemoyer. 2018. Deep contextualized word repre-sentations. CoRR, abs/1802.05365.

Emmanouil Antonios Platanios, Mrinmaya Sachan,Graham Neubig, and Tom Mitchell. 2018. Contex-tual parameter generation for universal neural ma-chine translation. In Proceedings of the 2018 Con-ference on Empirical Methods in Natural LanguageProcessing, pages 425–435. Association for Compu-tational Linguistics.

Page 11: Language to Network: Conditional Parameter Adaptation with ... · pre-trained feature extractor for input, which is usu-ally fixed and cannot be further adapted for the unseen classes

7004

Ruizhi Qiao, Lingqiao Liu, Chunhua Shen, and Antonvan den Hengel. 2016. Less is more: zero-shot learn-ing from online textual documents with noise sup-pression. CoRR, abs/1604.01146.

Scott Reed, Zeynep Akata, Honglak Lee, and BerntSchiele. 2016. Learning deep representations offine-grained visual descriptions. In The IEEE Con-ference on Computer Vision and Pattern Recognition(CVPR).

Bernardino Romera-Paredes and Philip H. S. Torr.2015. An embarrassingly simple approach to zero-shot learning. In Proceedings of the 32Nd Interna-tional Conference on International Conference onMachine Learning - Volume 37, ICML’15, pages2152–2161. JMLR.org.

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause,Sanjeev Satheesh, Sean Ma, Zhiheng Huang, An-drej Karpathy, Aditya Khosla, Michael Bernstein,Alexander C. Berg, and Li Fei-Fei. 2015. Ima-geNet Large Scale Visual Recognition Challenge.International Journal of Computer Vision (IJCV),115(3):211–252.

Karen Simonyan, Andrea Vedaldi, and Andrew Zisser-man. 2014. Deep inside convolutional networks: Vi-sualising image classification models and saliencymaps. In 2nd International Conference on Learn-ing Representations, ICLR 2014, Banff, AB, Canada,April 14-16, 2014, Workshop Track Proceedings.

Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N. Gomez, LukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In NIPS, pages 6000–6010.

C. Wah, S. Branson, P. Welinder, P. Perona, and S. Be-longie. 2011. The Caltech-UCSD Birds-200-2011Dataset. Technical Report CNS-TR-2011-001, Cali-fornia Institute of Technology.

P. Welinder, S. Branson, T. Mita, C. Wah, F. Schroff,S. Belongie, and P. Perona. 2010. Caltech-UCSDBirds 200. Technical Report CNS-TR-2010-001,California Institute of Technology.

Yongqin Xian, Bernt Schiele, and Zeynep Akata. 2017.Zero-shot learning - the good, the bad and the ugly.CoRR, abs/1703.04394.

Z. Zhang and V. Saligrama. 2016. Zero-shot learningvia joint latent similarity embedding. In 2016 IEEEConference on Computer Vision and Pattern Recog-nition (CVPR), pages 6034–6042.

Yizhe Zhu, Mohamed Elhoseiny, Bingchen Liu, andAhmed M. Elgammal. 2017. Imagine it for me: Gen-erative adversarial approach for zero-shot learningfrom noisy texts. CoRR, abs/1712.01381.

A Impact of Different Base ModelArchitectures

In this section, we provide an additional ablationexperiment, we will be using the Standard Spliton CUB dataset. With the superior results of N3

obtained on the above datasets using ResNet-18as the base architecture, it is reasonable to won-der whether the power of N3 can extend to otherbase architectures and whether N3 can work effec-tively on other base model architectures. To answerthis question, we trained N3 with 2 additional basemodel architecture: GoogleNet and SqueezeNet.We compared them in terms of zero-shot classifi-cation accuracy. Results are tabulated in Table 5.

Based on the results shown in Table. 5, we weresurprised to find out that SqueezeNet performs aswell as ResNet as N3’s base model and GoogleNetcan even out-perform ResNet-18 as a better basemodel. One explaination for the superior perfor-mance of GoogleNet is that parameters of batchnormalizations are not known to corrolate with tasksemantics and therefore can hardly be fine-tuned.GoogleNet, on the other hand, does not make useof batch normalization, thus avoiding the poten-tial performance degradations caused by lack offine-tuning on Batch Normalization layers.

B Additional Comparison with PriorWork

In our main experimental evaluation section,we performed extensive experiments comparingour methods with prior arts using newly estab-lished rigorous zero-shot learning evaluation guide-lines (Xian et al., 2017). To adhere to the stricterguidelines, and to compare fairly with prior arts,we have to reproduce the results of prior methodsourselves; one may wonder how do we comparedwith each of these prior methods in their own ex-perimental setup, and more importantly, how dowe compare with their originally published perfor-mance numbers? In this section we seek to assurereaders that our method continues to perform no-ticeably better than prior methods in their originalexperimental setup. To begin with, we comparethe zero-shot classification performance of N3 withprior work which also generates neural networkparameters directly (Lei Ba et al., 2015).

To make our results comparable, we adoptedthe meta-training/testing splits from (Lei Ba et al.,2015), where in CUB-2010/CUB-2011, the 200

Page 12: Language to Network: Conditional Parameter Adaptation with ... · pre-trained feature extractor for input, which is usu-ally fixed and cannot be further adapted for the unseen classes

7005

Model Arch Acc @ 1 ImageNet Test Acc. (tor, 2019) Param Size (MB) (con, 2019)

GoogleNet 21.3 % 69.78 % 51ResNet18 17.6 % 69.76 % 45SqueezeNetV1.1 17.7 % 58.19 % 5

Table 5: N3 Zero-Shot Classification Accuracy on CUB2011 with Different Base Model Architectures

classes are randomly split into 160 ”seen” classesfor training and validation (within which 120classes are randomly selected for training, and 40for validation) and 40 ”unseen” classes for testing,and a 40-class classification model is generated byN3. Within the seen classes, 20% of the datasetis are excluded since (Lei Ba et al., 2015) usedthem for testing to report classification results onseen classes, which is irrelevant to our comparisons.Meta models are trained using the 160 seen classesonly. Similarly, in Oxford Flowers, the 102 classesof flowers are split into 82 ”seen” classes and 20”unseen” classes randomly. We used a ResNet-18pre-trained on ImageNet as our base model. Notethat while we didn’t adopt the larger VGG-19 basemodel as previous work, later we will show thatwe are still able to outperform them significantlywith the lesser ResNet-18 architecture. During testtime, N3 generates neural networks based on testset class descriptions and these generated networksare used for zero-shot evaluation.

As shown in Table 6, N3 generated ResNet18out-performs prior arts by a significant margin bothin terms of average precision and classification ac-curacy on both CUB and Oxford Flower dataset.

We then move on to evaluate the performance ofN3 on more rigorous and difficult zero-shot meta-splits on the CUB-2011 and NAB dataset. Thissplit is called SCE (super category exclusive) split,which ensures that the parent categories of unseenclasses are exclusive to those of the seen classes.Such split is used in many prior works (Zhu et al.,2017; Elhoseiny et al., 2017; Changpinyo et al.,2016) to evaluate their proposed method. We evalu-ated N3 using such meta-split and results are shownin Table 5. As explained in the experimental evalu-ation section of our paper, later prior works (Zhuet al., 2017; Elhoseiny et al., 2017) uses signif-icantly more metadata (e.g., per sample partsannotation) than N3 and yet N3 is able to exceedtheir classification accuracy by using a genericResNet-18 base model.

C Magnitude of Parameter Adaptation

In this section, we visualize in Fig. 6, the magni-tude of N3’s parameter adaptation per layer (quan-tified by the L2 norm of each layer’s parameteradaptation tensor generated by N3) when adaptingpre-trained ResNet-18.

D Hyper Parameters

We tabulated the set of hyper parameters used toproduce our experimental results in Table. 7.

E Computational ResourceConsumption

We used two types of machines - 4 x NVidia Pascal-100 and 4 x NVidia Volta-100 - depending on avail-ability. It is worth noting that computation was nota bottleneck for our experiments; and a multi-GPUsetup is primarily used for their larger total mem-ory size available since transformers are known toconsume a lot of memory when processing lengthysequences. We conjecture that recent developmentof more efficient variants of transformer modelswould enable N3 to train with fewer GPUs. Thetraining time ranged from a few hours to a singleday depending on the size of the dataset and thecomputing power of the machine.

Page 13: Language to Network: Conditional Parameter Adaptation with ... · pre-trained feature extractor for input, which is usu-ally fixed and cannot be further adapted for the unseen classes

7006

Dataset Method A.P. Acc.@1 Acc.@5 Base Model

CUB-2010 DA (VGG) (Kulis et al., 2011) 0.037 - - VGG-19PDCNN(fc) (Lei Ba et al., 2015) 0.1 12.0% 42.8% VGG-19PDCNN(conv) (Lei Ba et al., 2015) 0.043 - - VGG-19PDCNN(fc+conv) (Lei Ba et al., 2015) 0.08 - - VGG-19Ours(ResNet18) 0.31 33.9% 77.0% ResNet-18

CUB-2011 PDCNN(fc) (Lei Ba et al., 2015) 0.11 - - VGG-19PDCNN(conv) (Lei Ba et al., 2015) 0.085 - - VGG-19PDCNN(fc+conv) (Lei Ba et al., 2015) 0.13 - - VGG-19Ours(ResNet18) 0.27 32.6% 68.6% ResNet-18

Flowers PDCNN(fc) (Lei Ba et al., 2015) 0.07 - - VGG-19PDCNN(conv) (Lei Ba et al., 2015) 0.054 - - VGG-19PDCNN(fc+conv) (Lei Ba et al., 2015) 0.067 - - VGG-19Ours(ResNet18) 0.16 10.4% 39.7% ResNet-18

Table 6: N3 Zero-Shot Classification Performance on CUB-2010, CUB-2011, Oxford Flower

Method CUB Acc.@1 NAB Acc.@1

WAC-Linear (Elhoseiny et al., 2016) 5 % -WAC-Kernel (Elhoseiny et al., 2016) 7.7 % 6.0 %ESZSL (Romera-Paredes and Torr, 2015) 7.4 % 6.3 %ZSLNS (Qiao et al., 2016) 7.3 % 6.8 %SynCfast (Changpinyo et al., 2016) 8.6 % 3.8SynCOVO (Changpinyo et al., 2016) 5.9 % -ZSLPP (Elhoseiny et al., 2017) 9.7 % 8.1 %GAZSL (Zhu et al., 2017) 10.3 % 8.6 %CorrectionNetwork (Hu et al., 2019) 10.0 % 9.5 %Ours(ResNet18) 11.9 % 11.1 %

Figure 5: Zero-Shot Classification Performance on CUB-2011/NAB with SCE-split.

L2-n

orm

0.0

12.5

25.0

37.5

50.0

conv

1.w

eigh

tbn

1.w

eigh

tbn

1.bi

asla

yer1

.0.c

onv1

.wei

ght

laye

r1.0

.bn1

.wei

ght

laye

r1.0

.bn1

.bia

sla

yer1

.0.c

onv2

.wei

ght

laye

r1.0

.bn2

.wei

ght

laye

r1.0

.bn2

.bia

sla

yer1

.1.c

onv1

.wei

ght

laye

r1.1

.bn1

.wei

ght

laye

r1.1

.bn1

.bia

sla

yer1

.1.c

onv2

.wei

ght

laye

r1.1

.bn2

.wei

ght

laye

r1.1

.bn2

.bia

sla

yer2

.0.c

onv1

.wei

ght

laye

r2.0

.bn1

.wei

ght

laye

r2.0

.bn1

.bia

sla

yer2

.0.c

onv2

.wei

ght

laye

r2.0

.bn2

.wei

ght

laye

r2.0

.bn2

.bia

sla

yer2

.0.d

owns

ampl

e.0.

wei

ght

laye

r2.0

.dow

nsam

ple.

1.w

eigh

tla

yer2

.0.d

owns

ampl

e.1.

bias

laye

r2.1

.con

v1.w

eigh

tla

yer2

.1.b

n1.w

eigh

tla

yer2

.1.b

n1.b

ias

laye

r2.1

.con

v2.w

eigh

tla

yer2

.1.b

n2.w

eigh

tla

yer2

.1.b

n2.b

ias

laye

r3.0

.con

v1.w

eigh

tla

yer3

.0.b

n1.w

eigh

tla

yer3

.0.b

n1.b

ias

laye

r3.0

.con

v2.w

eigh

tla

yer3

.0.b

n2.w

eigh

tla

yer3

.0.b

n2.b

ias

laye

r3.0

.dow

nsam

ple.

0.w

eigh

tla

yer3

.0.d

owns

ampl

e.1.

wei

ght

laye

r3.0

.dow

nsam

ple.

1.bi

asla

yer3

.1.c

onv1

.wei

ght

laye

r3.1

.bn1

.wei

ght

laye

r3.1

.bn1

.bia

sla

yer3

.1.c

onv2

.wei

ght

laye

r3.1

.bn2

.wei

ght

laye

r3.1

.bn2

.bia

sla

yer4

.0.c

onv1

.wei

ght

laye

r4.0

.bn1

.wei

ght

laye

r4.0

.bn1

.bia

sla

yer4

.0.c

onv2

.wei

ght

laye

r4.0

.bn2

.wei

ght

laye

r4.0

.bn2

.bia

sla

yer4

.0.d

owns

ampl

e.0.

wei

ght

laye

r4.0

.dow

nsam

ple.

1.w

eigh

tla

yer4

.0.d

owns

ampl

e.1.

bias

laye

r4.1

.con

v1.w

eigh

tla

yer4

.1.b

n1.w

eigh

tla

yer4

.1.b

n1.b

ias

laye

r4.1

.con

v2.w

eigh

tla

yer4

.1.b

n2.w

eigh

tla

yer4

.1.b

n2.b

ias

fc.w

eigh

t fc

.bia

s

L2-norm of Param Adaptations L2-norm of Pre-trained Params

Figure 6: Magnitude of Parameter Adaptation

Page 14: Language to Network: Conditional Parameter Adaptation with ... · pre-trained feature extractor for input, which is usu-ally fixed and cannot be further adapted for the unseen classes

7007

(Max) Training Epoch T2L Num Layer L2A Num Layer T2L LR L2A LRCUB-SS/PS 40 2/1 1/1 3e-5/1e-5 5e-5/1e-5AWA-SS/PS 20 2/1 1/1 2e-6/9e-6 1e-5/2e-6NAB-SS/PS 25 2/2 1/1 2e-5/1e-5 2e-5/2e-5FS-SS/PS 40 2/2 1/1 2e-5/1e-6 1e-6/1e-5

Figure 7: Hyper-parameters