neural mask generator: learning to generate adaptive word ...6103 nlu task at hands. how can we then...

19
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 6102–6120, November 16–20, 2020. c 2020 Association for Computational Linguistics 6102 Neural Mask Generator: Learning to Generate Adaptive Word Maskings for Language Model Adaptation Minki Kang 1* Moonsu Han 1* Sung Ju Hwang 1, 2 KAIST 1 , Daejeon, South Korea AITRICS 2 , Seoul, South Korea {zzxc1133, mshan92, sjhwang82}@kaist.ac.kr Abstract We propose a method to automatically gener- ate a domain- and task-adaptive maskings of the given text for self-supervised pre-training, such that we can effectively adapt the language model to a particular target task (e.g. question answering). Specifically, we present a novel re- inforcement learning-based framework which learns the masking policy, such that using the generated masks for further pre-training of the target language model helps improve task per- formance on unseen texts. We use off-policy actor-critic with entropy regularization and experience replay for reinforcement learning, and propose a Transformer-based policy net- work that can consider the relative importance of words in a given text. We validate our Neu- ral Mask Generator (NMG) on several ques- tion answering and text classification datasets using BERT and DistilBERT as the language models, on which it outperforms rule-based masking strategies, by automatically learning optimal adaptive maskings. 1 1 Introduction The recent success of the language model pre- training approaches (Devlin et al., 2019; Peters et al., 2018; Radford et al., 2019; Raffel et al., 2019; Yang et al., 2019), which train language models on diverse text corpora with self-supervised or multi- task learning, have brought up huge performance improvements on several natural language under- standing (NLU) tasks (Wang et al., 2019; Rajpurkar et al., 2016). The key to this success is their ability to learn generalizable text embeddings that achieve near optimal performance on diverse tasks with only a few additional steps of fine-tuning on each downstream task. Most of the existing works on language model aim to obtain a universal language model that can * Equal contribution. 1 Code is available at github.com/Nardien/NMG. address nearly the entire set of available natural language tasks on heterogeneous domains. Al- though this train-once and use-anywhere approach has been shown to be helpful for various natu- ral language tasks (Devlin et al., 2019; Radford et al., 2019; Dong et al., 2019; Raffel et al., 2019), there have been considerable needs on adapting the learned language models to domain-specific corpora (e.g. healthcare or legal). Such domains may contain new entities that are not included in the common text corpora, and may contain only a small amount of labeled data as obtaining annotation on them may require expert knowledge. Some recent works (Sun et al., 2019a; Lee et al., 2019; Belt- agy et al., 2019; Gururangan et al., 2020) suggest to further pre-train the language model with self- supervised tasks on the domain-specific text corpus for adaptation, and show that it yields improved performance on tasks from the target domain. Masked Language Models (MLMs) objective in BERT (Devlin et al., 2019) has shown to be effec- tive for the language model to learn the knowledge of the language in a bi-directional manner (Vaswani et al., 2017). In general, masks in MLMs are sam- pled at random (Devlin et al., 2019; Liu et al., 2019c), which seems reasonable for learning a generic language model pre-trained from scratch, since it needs to learn about as many words in the vocabulary as possible in diverse contexts. However, in the case of further pre-training of the already pre-trained language model, such a conventional selection method may lead a domain adaptation in an inefficient way, since not all words will be equally important for the target task. Re- peatedly learning for uninformative instances thus will be wasteful. Instead, as done with instance selection (Ngiam et al., 2018; Jiang et al., 2018; Yoon et al., 2019; Zhu et al., 2019), it will be more effective if the masks focus on the most important words for the target domain, and for the specific

Upload: others

Post on 18-Feb-2021

0 views

Category:

Documents


0 download

TRANSCRIPT

  • Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 6102–6120,November 16–20, 2020. c©2020 Association for Computational Linguistics

    6102

    Neural Mask Generator: Learning to GenerateAdaptive Word Maskings for Language Model Adaptation

    Minki Kang1∗ Moonsu Han1∗ Sung Ju Hwang1,2

    KAIST1, Daejeon, South KoreaAITRICS2, Seoul, South Korea

    {zzxc1133, mshan92, sjhwang82}@kaist.ac.kr

    AbstractWe propose a method to automatically gener-ate a domain- and task-adaptive maskings ofthe given text for self-supervised pre-training,such that we can effectively adapt the languagemodel to a particular target task (e.g. questionanswering). Specifically, we present a novel re-inforcement learning-based framework whichlearns the masking policy, such that using thegenerated masks for further pre-training of thetarget language model helps improve task per-formance on unseen texts. We use off-policyactor-critic with entropy regularization andexperience replay for reinforcement learning,and propose a Transformer-based policy net-work that can consider the relative importanceof words in a given text. We validate our Neu-ral Mask Generator (NMG) on several ques-tion answering and text classification datasetsusing BERT and DistilBERT as the languagemodels, on which it outperforms rule-basedmasking strategies, by automatically learningoptimal adaptive maskings. 1

    1 Introduction

    The recent success of the language model pre-training approaches (Devlin et al., 2019; Peterset al., 2018; Radford et al., 2019; Raffel et al., 2019;Yang et al., 2019), which train language models ondiverse text corpora with self-supervised or multi-task learning, have brought up huge performanceimprovements on several natural language under-standing (NLU) tasks (Wang et al., 2019; Rajpurkaret al., 2016). The key to this success is their abilityto learn generalizable text embeddings that achievenear optimal performance on diverse tasks withonly a few additional steps of fine-tuning on eachdownstream task.

    Most of the existing works on language modelaim to obtain a universal language model that can

    ∗ Equal contribution.1Code is available at github.com/Nardien/NMG.

    address nearly the entire set of available naturallanguage tasks on heterogeneous domains. Al-though this train-once and use-anywhere approachhas been shown to be helpful for various natu-ral language tasks (Devlin et al., 2019; Radfordet al., 2019; Dong et al., 2019; Raffel et al., 2019),there have been considerable needs on adaptingthe learned language models to domain-specificcorpora (e.g. healthcare or legal). Such domainsmay contain new entities that are not included in thecommon text corpora, and may contain only a smallamount of labeled data as obtaining annotation onthem may require expert knowledge. Some recentworks (Sun et al., 2019a; Lee et al., 2019; Belt-agy et al., 2019; Gururangan et al., 2020) suggestto further pre-train the language model with self-supervised tasks on the domain-specific text corpusfor adaptation, and show that it yields improvedperformance on tasks from the target domain.

    Masked Language Models (MLMs) objective inBERT (Devlin et al., 2019) has shown to be effec-tive for the language model to learn the knowledgeof the language in a bi-directional manner (Vaswaniet al., 2017). In general, masks in MLMs are sam-pled at random (Devlin et al., 2019; Liu et al.,2019c), which seems reasonable for learning ageneric language model pre-trained from scratch,since it needs to learn about as many words in thevocabulary as possible in diverse contexts.

    However, in the case of further pre-training ofthe already pre-trained language model, such aconventional selection method may lead a domainadaptation in an inefficient way, since not all wordswill be equally important for the target task. Re-peatedly learning for uninformative instances thuswill be wasteful. Instead, as done with instanceselection (Ngiam et al., 2018; Jiang et al., 2018;Yoon et al., 2019; Zhu et al., 2019), it will be moreeffective if the masks focus on the most importantwords for the target domain, and for the specific

    github.com/Nardien/NMG

  • 6103

    NLU task at hands. How can we then obtain sucha masking strategy to train the MLMs?

    Several works (Joshi et al., 2019; Sun et al.,2019b,c; Glass et al., 2019) propose rule-basedmasking strategies which work better than randommasking (Devlin et al., 2019) when applied tolanguage model pre-training from scratch. Basedon those works, we assume that adaptation of thepre-trained language model can be improved via alearned masking policy which selects the words tomask. Yet, existing models are inevitably subopti-mal since they do not consider the target domainand the task. To overcome this limitation, in thiswork, we propose to adaptively generate mask bylearning the optimal masking policy for the giventask, for the task-adaptive pre-training (Gururanganet al., 2020) of the language model.

    As described in Figure 1, we want to further pre-train the language model on a specific task with atask-dependent masking policy, such that it directsthe solution to the set of parameters that can betteradapt to the target domain, while task-agnostic ran-dom policy leads the model to an arbitrary solution.To tackle this problem, we pose the given learn-ing problem as a meta-learning problem wherewe learn the task-adaptive mask-generating pol-icy, such that the model learned with the maskingstrategy obtains high accuracy on the target task.We refer to this meta-learner as the Neural MaskGenerator (NMG). Specifically, we formulate masklearning as a bi-level problem where we pre-trainand fine-tune a target language model in the innerloop, and learn the NMG at the outer loop, andsolve it using renforcement learning. We validateour method on diverse NLU tasks, including ques-tion answering and text classification. The resultsshow that the models trained using our NMG out-performs the models pre-trained using rule-basedmasking strategies, as well as finds a proper adap-tive masking strategy for each domain and task.

    Our contribution is threefold:

    • We propose to learn the mask generating pol-icy for further pre-training of masked lan-guage models, to obtain optimal maskingsthat focus on the most important words forthe given text domain and the NLU task.

    • We formulate the problem of learning the task-adaptive mask generating policy as a bi-levelmeta-learning framework which learns theLM in the inner loop, and the mask generatorat the outer loop using reinforcement learning.

    Target Domain

    Original LM Domain

    ***

    Universally Pre-trainedLanguage Model

    *

    Context

    A myocardial infraction, also known as a heart attack, occurs when blood flow..

    Masked Context

    Domain-AdaptiveLanguage Model

    Pre-trained Model

    𝑷𝑷𝑷𝑷𝑷𝑷𝑷𝑷𝑷𝑷𝑷𝑷 𝑽𝑽𝑽𝑽𝑷𝑷𝑽𝑽𝑽𝑽

    Neural Mask Generator

    A [MASK] [MASK], also known as a [MASK] attack, occurs when [MASK] flow…

    Learning

    Figure 1: Concept. Pre-training on domain text leads thelanguage model parameters to adapt to the given target domain.We assume that adjusting the masking policy of the MLMobjective affects the training trajectory of the language model,such that it moves towards a better solution space for the targetdomain. This illustration of the solution spaces for the twodomains is motivated by (Gururangan et al., 2020).

    • We validate our mask generator on diversetasks across various domains, and show thatit outperforms heuristic masking strategies bylearning an optimal task-adaptive masking foreach LM and domain. We also perform em-pirical studies on various heuristic maskingstrategies on the language model adaptation.

    2 Related Work

    Language Model Pre-training Ever sinceHoward and Ruder (2018) suggested languagemodel pre-training with multi-task learning,inspired by the success of fine-tuning on ImageNetpre-trained models on computer vision tasks (Liuet al., 2019b), research on the representationlearning for natural language understandingtasks have focused on obtaining a global languagemodel that can generalize to any NLU tasks.A popular approach is to use self-supervisedpre-training tasks for learning the contextualizedembedding from large unannotated text corporausing auto-regressive (Peters et al., 2018; Yanget al., 2019) or auto-encoding (Devlin et al., 2019;Liu et al., 2019c) language modeling. Followingthe success of the Masked Language Model(MLM) from (Devlin et al., 2019; Liu et al.,2019c), several works have proposed differentmodel architecture (Lan et al., 2019; Raffelet al., 2019; Clark et al., 2020) and pre-trainingobjectives (Sun et al., 2019b,c; Liu et al., 2019c;Joshi et al., 2019; Glass et al., 2019; Dong et al.,2019), to improve upon its performance. Someworks have also proposed alternative maskingpolicies for the MLM pre-training over randomsampling, such as SpanBERT (Joshi et al., 2019)

  • 6104

    and ERNIE (Sun et al., 2019b). Yet, none ofthe existing approaches have tried to learn thetask-adaptive mask in a context-dependent mannerwhich is the problem we target in this work.

    Language Model Adaptation Pre-training thelanguage model on the target domain, then fine-tuning on downstream tasks, is the most simpleyet successful approach for adapting the languagemodel to a specific task. Some studies (Lee et al.,2019; Beltagy et al., 2019; Whang et al., 2019)have shown the advantage of further pre-trainingthe language model on a large unlabeled text cor-pus collected from a specific domain. Moreover,Sun et al. (2019a) and Han and Eisenstein (2019)investigate the effectiveness of further pre-trainingof the language model on small domain-relatedtext corpora. Recently, Gururangan et al. (2020)integrates prior works and defines domain-adaptivepre-training and task-adaptive pre-training, show-ing that domain adaptation of the language modelcan be done with additional pre-training with theMLM objective on a domain-related text corpus,as well as a smaller but directly task-relevant textcorpus.

    Meta-Learning Meta-learning (Thrun and Pratt,1998) aims to train the model to generalize overa distribution of tasks, such that it can general-ize to an unseen task. There exist large num-ber of different approaches to train the meta-learner (Santoro et al., 2016; Vinyals et al., 2016;Ravi and Larochelle, 2017; Finn et al., 2017; Liuet al., 2019a). However, existing meta-learning ap-proaches do not scale well to the training of largemodels such as masked language models. Thus,instead of the existing meta-learning method sucha gradient based approach, we formulate the prob-lem as a bi-level problem (Franceschi et al., 2018)of learning the language model in the inner loopand the mask at the outer loop, and solve it usingreinforcement learning. Such optimization of theouter objective using RL is similar to the formula-tion used in previous works on neural architecturesearch (Zoph and Le, 2017; Zoph et al., 2018).

    3 Problem Statement

    We now describe how to formulate the problem oflearning to generate masks for the Masked Lan-guage Model (MLM) as a bi-level optimizationproblem. Then, we describe how we can reformu-late it as a reinforcement learning problem.

    3.1 Masked Language ModelFor pre-training of the language models, we needan unannotated text corpus S = {s(1), · · · , s(T )}.Here the s = [w1, w2, · · · , wN ] is the context,whose element wi ∈ s is a single word or token.To formulate the meta-learning objective, we as-sume each context corpus as the part of the taskT = {S, Dtr, Dte} consisting of the context andits corresponding task dataset. For the MLM, weneed to generate a noisy version of s, which wedenote as ŝ. Let zi be the indicator of masking i-thword of s. If zi = 1, i-th word is replaced with the[MASK] token. The objective of the MLM thenis to predict the original token of each [MASK]token. Therefore, we can formulate this problemas follows:

    minθLMLM (θ;S, Ŝ) =

    ∑s,ŝ∈S,Ŝ

    − log pθ(s̄|ŝ)

    where θ is the parameter of the language modeland s̄ is the original tokens of each [MASK] token,and Ŝ is the set of masked contexts. Followingthe formulation from (Yang et al., 2019), we canapproximate the MLM objective with zi as follows:

    maxθ

    log pθ(s̄|ŝ) ≈N∑i=1

    zi log pθ(wi|ŝ) (1)

    pθ(wi|ŝ) =exp(Hθ(ŝ)

    >i e(wi))∑

    w′ exp(Hθ(ŝ)>i e(w

    ′))

    where Hθ(ŝ) indicates the contextualized represen-tation of ŝ from the language model layer (e.g. Pre-trained Transformer layers in (Devlin et al., 2019)),and e(wi) denotes the embedding of word wi fromthe last prediction layer. Our objective then is tolearn the optimal policy for determining each maskindicator zi, which we will describe in detail in thenext subsection.

    3.2 Bi-level formulationWe now describe how to formulate this learningproblem as a bi-level problem consisting of innerand outer level objectives. Consider that zi can berepresented using an arbitrary function F parame-terized with λ:

    πλ(at = i|s) = Fλ(wi) (2)

    aλ = arg maxa

    ∏t

    πλ(at|s) (3)

    zλ,i =

    {1, if i ∈ aλ0, if i /∈ aλ

    (4)

    where πλ(at = i|s) is the probability of masking i-th word wi from s, and aλ indicates the list of word

  • 6105

    indices to be masked and zλ,i is the mask indicatorparameterized by the parameter λ. The details ofcorresponding equations will be described in sec-tion 3.3. Therefore, the MLM objective has beenslightly changed from its original form, into thefollowing objective:

    θ(λ) = arg minθLMLM (θ, λ;S, Ŝ)

    = arg maxθ

    ∑s∈S

    log pθ(s̄|ŝ)

    ≈ arg maxθ

    ∑s∈S

    N∑i=1

    zλ,i log pθ(wi|ŝ)

    (5)

    where the masked context ŝ is parameterized by theparameter λ. Now assume that we have found theoptimal parameter θ(λ) for language model fromequation 5. Then, we need to fine-tune the languagemodel on the downstream task. Although linearheads used in both pre-training and fine-tuning aredifferent, we describe the parameter of both mod-els as θ(λ) for simplicity. Following the bi-levelframework notation described in (Franceschi et al.,2018), the inner objective function for fine-tuningcan be written as follows:

    minθ(λ)Ltrain(θ(λ), λ) =

    ∑(x,y)∈Dtr

    l(gθ(λ)(x), y)

    (6)whereDtr is a training dataset, l is the loss functionof the supervised learning, and gθ(λ) is the functionrepresentation of downstream task solver model. Incase of question answering task, each x consists ofa context and corresponding question and y is thecorresponding answer spans.

    Assume that we find optimal parameter θ∗(λ)from supervised task fine-tuning. Then, the finalouter-level objective can be described as follows:

    minλLtest(λ, θ∗(λ)) =

    ∑(x,y)∈Dte

    l(gθ∗(λ)(x), y)

    s.t. θ∗(λ) = arg minθ(λ)

    Ltrain(θ(λ), λ)

    and θ(λ) = arg minθLMLM (θ, λ)

    (7)

    This will allow us to obtain the optimal parameterλ∗ which minimizes the task objective function ona test dataset Dte.

    Although the outer objective is differentiable,we formulate the optimization problem of the outerobjective as a reinforcement learning problem toavoid excessive computation cost caused by the

    two constraint terms.

    Justification of Reinforcement Learning Inthis paragraph, we explain why we use the Re-inforcement Learning (RL) instead of the differ-entiable method to train the parameter λ. As in-dicated in the equation 7, our inner loop includesconsecutive two steps of language model training.The NMG model λ is addressed to the pre-trainingstep rather than the task fine-tuning step. There-fore, the direct differentiation of the outer objectivecontains two second-derivative terms for both theMLM loss LMLM and the task train loss Ltrain.With a single-step approximation, the derivative ofthe outer objective is approximated as follows:

    ∇λLtest(λ, θ∗(λ))≈ ∇λLtest(λ, θ̃∗(λ))− α∇2λ,θLMLM (θ, λ)∇θLtest(θ̃∗(λ), λ)− β∇2

    λ,θ̃(λ)Ltrain(θ̃(λ), λ)∇θ̃(λ)Ltest(θ̃(λ), λ)

    where θ̃(λ) = θ − α∇θLMLM (θ, λ), θ̃∗(λ) =θ̃(λ) − β∇θ̃(λ)Ltrain(θ̃(λ), λ) are approximatedparameters, and α, β are learning rates. Suchgradient estimation requires high computationalcosts since it includes the computation of Hessian-product vectors of the massive language model’sparameters approximated as 110 millions (Finnet al., 2017; Devlin et al., 2019).

    Instead, we can address the first-order approx-imation (α = 0, β = 0) to the derivative of theouter objective to avoid second-order derivativecomputation as follows:

    ∇λLtest(λ, θ∗(λ)) ≈ ∇λLtest(θ)where θ∗(λ) is approximated to θ. Such approxima-tion trivially results in a meaningless optimizationsince it ignores the pre-training step induced bythe parameter λ, which decides the masking pol-icy (Liu et al., 2019a). Therefore, we approachsolving this optimization problem with RL insteadof the differential method to avoid such an issue.In the next section, we introduce how we formulatethis problem as the RL with the outer objective asa non-differentiable reward.

    3.3 Reinforcement learning formulation

    We now propose a reinforcement learning (RL)framework, which given the context s as the state,decides on the actions a = {a1, · · · , aT } whereT is the number of masked tokens in the givencontext, each at ∈ [1, N ] is the token index thatindicates a decision on masking the token wat fol-

  • 6106

    lowing equation 4, and a1 6= a2 6= · · · 6= aT . Theobjective of the RL agent then is to find an optimalmasking policy that minimizes E(λ, θ) from thesection 3.2. In addition, minimizing E(λ, θ) onDte can be seen as maximizing its performance.Therefore, the objective is the maximization of theperformance of the model on Dte. We can induceit by setting the reward R as the accuracy improve-ment on Dte. We will describe the detail of thereward design in section 4.

    In general RL formulation (Sutton and Barto,2018) following Markov Decision Process (MDP),state transition probability can be describedas p(st+1|st, at). The probability of mask-ing T tokens is formularized as p(ŝ|s) =∏Tt=1 p(st+1|st, at), where ŝ consists of T number

    of [MASK] tokens. Although the representation ofst and st+1 are slightly different because of the ad-dition of [MASK] token, we can approximate themas st ≈ st+1 following the approximation in equa-tion 1, which inductively approximates representa-tions after each word masking as a representationof original context.

    Therefore, we can approximate the probabilityof masking T tokens from the MDP problem tothe problem without state-transition formulation asfollows:

    p(ŝ|s) =T∏t=1

    p(st+1|st, at) ≈T∏t=1

    π(at|s)

    where at denotes the index of masked word in-cluded in the context s and π(·|s) is the maskingpolicy of the agent. By this approximation, wedo not need to consider the trajectory along thetemporal horizon, which makes the problem mucheasier. Instead, we approximate the problem asthe task of selecting multiple discrete actions si-multaneously for the same state. We approximatethe policy with neural network parameterized byλ as πλ(a|s). As in equations 3 and 4, the maskzλ,i is determined by actions generated from theneural policy πλ(a|s). In the next section, we willdescribe how to train the Neural Mask Generatorto generate the neural policy πλ(a|s) maximizingthe reward R using RL in Section 4.

    4 Neural Mask Generator

    In this section, we describe our model, NeuralMask Generator (NMG), which learns the maskingpolicy to generate an optimal mask on an unseencontext using deep RL with the detailed descrip-

    Figure 2: The overview of the meta-training framework forour Neural Mask Generator (NMG). In each episode, maskedcontexts by the NMG are used for further pre-training. Then,the further pre-trained language model is used in both themask generator and fine-tuning.

    tions of the framework setup. The overview of themeta mask generator framework is shown in Fig-ure 2. For detailed descriptions of the approaches,the procedures of both training and test phase, andalgorithm, please see Appendix A.

    4.1 Reinforcement Learning DetailsModel Architecture The probability of select-ing i-th word wi of s as t-th action at can be de-scribed as πλ(at = i|s), where

    ∑i πλ(at = i|s) =

    1. Instead of Fλ in equation 2, the neural policyfrom the NMG is given as follows:

    πλ(at = i|s) =exp(fλ(Hθ′(s)i))∑t exp(fλ(Hθ′(s)t))

    where fλ is a deep neural network parameterized byλ, which has a self-attention layer (Vaswani et al.,2017) followed by linear layers with gelu activa-tion (Hendrycks and Gimpel, 2016), and Hθ′(s)i isthe contextualized representation of wi of contexts from the frozen language model layer θ′. Notethat θ′ is shared with the target language model θand not trained during the NMG training. Further,fλ(Hθ′(s)i) outputs the scalar logit of the wi, andthe final probabilistic policy is computed by thesoftmax function.

    Training Objective We train the NMG modelusing the Advantage Actor-Critic method (Mnihet al., 2016) with the value estimator. Furthermore,we resort to off-policy learning (Degris et al., 2012)such that the agent can explore the optimal policybased on its past experiences. To this end, weleverage a prioritized experience replay, in whichwe store every state, action, reward, and old-policypairs (Mnih et al., 2013), and sample them based ontheir absolute value of the advantage (Schaul et al.,2016). We use importance sampling to estimate the

  • 6107

    value of the target policy πλ with the samples fromthe behavior policy πλold , which are sampled fromthe replay buffer (Degris et al., 2012; Schulmanet al., 2015, 2017; Wang et al., 2017). To sum up,the objectives for both policy and value networkare as follows:

    Lpolicy =∑

    (s,a,R,πλold )

    − πλ(a|s)πλold(a|s)

    (R− Vλ(s))

    − α H(πλ(a|s))(8)

    Lvalue =∑

    (s,a,R,πλold )

    1

    2(R− Vλ(s))2 (9)

    where the (s, a,R, πλold) is set of sampled replaysfrom the replay buffer, H is an entropy functionH(πλ(a|s)) = −

    ∑t πλ(at|s) log πλ(at|s), and

    α is a hyperparameter for entropy regularization.Vλ(s) is an estimated value of the state s. Thevalue network consists of linear layers with activa-tion function after mean pooling 1N

    ∑tHθ′(s)t.

    To summarize, the outer-level objective for up-dating the NMG parameter λ can be written asfollows:

    minλLpolicy + Lvalue (10)

    At each episode of meta-training, we update theNMG parameter λ by optimizing the above objec-tive function.

    Reward Design and Self-Play As in Section 3.3,the reward function is considered as the accuracyimprovement on the test set Dte. Therefore, thepre-training step (equation 5) and the fine-tuningthen evaluation step (equation 6, 7) should be doneto get the reward in every episodes.

    Since using the full size of dataset in the innerloop is generally not feasible, we randomly samplesmaller sub-task T ′ = {S′, D′tr, Dval} from T atevery episode. For the evaluation, we randomlysplit the training set Dtr to generate a hold-out val-idation set Dval and replace Dte in equation 7 toDval while meta-training where Dte is unobserv-able. We use a sufficiently large hold-out validationset Dval to prevent the masking policy from over-fitting to Dval. We assume that the meta-learnerNMG also performs well on T if it is trained ondiverse sub-tasks T ′ where |T | � |T ′|.

    The problem to be considered for using diversesub-tasks is that the NMG model encounters dif-ferent sub-task T ′ at every new episode. SinceT ′ determines the state distribution S′ and the data

    Dtr to be trained, it results in the reward scale prob-lem that the expectation of the validation accuracyon Dval varies depending on the composition of T ′then makes it harder to evaluate the performanceincrement of the neural policy across episodes.

    To address this problem, we introduce the ran-dom policy as an opponent policy to evaluate theneural policy relative to it. Therefore, the rewardR is defined as sgn(r − b), where sgn is the signfunction, r and b are the accuracy score on Dvalfrom neural and random policy respectively. How-ever, the random policy may be too weak as theopponent. To overcome this limitation, we add an-other neural policy as an additional opponent toinduce the zero-sum game of two learning agent bythe concept of the self-play algorithm (Wei et al.,2017; Grau-Moya et al., 2018).

    Then, three distinct policies are compared witheach other during episodes and two neural poli-cies are individually trained. Furthermore, in eachneural agent training, only actions correspondingto disjoint comparing with others are stored in aglobal replay buffer for a more accurate rewardassignment of each action.

    Continual Adaptive Learning For fair compar-ison at every episode, the same language modelshould be used to evaluate the policy. Initializ-ing the language model before the start of eachinner loop can be the simplest choice to handle this.However, since we pre-train the language modelfor only few steps during each episode of meta-training, the model is always evaluated around theoriginal language model domain (see Figure 1). Toavoid this, at each episode, the language modelwhich is pre-trained by the NMG model of for-mer step is continually loaded instead of the fixedcheckpoint except for the first episode. By this, weintend that our agent learns the optimal policy ofvarious environments. Furthermore, the agent canlearn dynamic policy based on the learned degreeof the target language model.

    5 Experiment

    We now experimentally validate our Neural MaskGenerator (NMG) model on multiple NLU tasks,including question answering and text classificationtasks using two different language models, andanalyze its behaviors. In Section 5.1, we evaluatethe NMG with several baselines. Then, we evaluatethe effect of the specific design choices made forour model through ablation studies in Section 5.2.

  • 6108

    Finally, we analyze how the policy learned by theNMG model works in Section 5.3.

    Tasks and Datasets For question answering,we use three datasets, namely SQuAD v1.1 (Ra-jpurkar et al., 2016), NewsQA (Trischler et al.,2017), and emrQA (Pampari et al., 2018) to val-idate our model. We use the MRQA2 versionfor both SQuAD and NewsQA to sustain a co-herency. We also preprocess emrQA to fit theformat of other datasets. We use a standard eval-uation metric named Exact-Matchs (EM) and F1score for question answering task. For text clas-sification, we use IMDb (Maas et al., 2011) andChemProt (Kringelum et al., 2016), following theexperimental settings of (Gururangan et al., 2020).

    Baselines According to Devlin et al. (2019) andJoshi et al. (2019), training the language modelswith difficult objectives is much more beneficialwhen pre-training from scratch. To test whetherit is also the case for task-adaptive pre-training,we experiment with two heuristic masking strate-gies, which we refer to as whole-random and span-random. In addition, we also tested the named en-tity masking proposed by Sun et al. (2019b) (entity-random). Below is the complete list of the heuristicbaselines we compare against in the experiments.

    1) No-PT A baseline without any further pre-training of the language model.

    2) Random A random masking strategy intro-duced in BERT (Devlin et al., 2019).

    3) Whole-Random A random masking strategywhich masks the entire word instead of the token(sub-word). This method is introduced by the au-thors of BERT (Devlin et al., 2019)3.

    4) Span-Random A random masking strategywhich selects multiple consecutive tokens.

    5) Entity-Random A random masking strategywhich selects named entities with highest priorities,then randomly selects other tokens.

    6) Punctuation-Random A random maskingstrategy which selects punctuation tokens first, thenrandomly selects other tokens.

    Implementation Details For the languagemodel θ, we use the same hyperparameters andarchitecture with DistilBERT (Sanh et al., 2019)model (66M params) and BERTBASE (Devlinet al., 2019) model (110M params). Our imple-mentation is based on the huggingface’s Pytorch

    2https://mrqa.github.io/3https://github.com/google-research/bert

    implementation version (Wolf et al., 2019; Paszkeet al., 2019). We load the pre-trained parametersfrom the checkpoint of each language model inmeta-testing and the first episode of meta-training.As for the text corpus S to pre-train the languagemodel, we use the collection of contexts fromthe given NLU task. We only use the MaskedLanguage Model (MLM) objective for furtherpre-training. In the initial stage of meta-training,the NMG randomly selects actions for exploration.In meta-testing, it takes maximum probabilityindices as actions. We describe the details oflanguage model training in Appendix B since weuse different settings for each task and experiments.As for reinforcement learning (RL), we use theoff-policy actor-critic method described in Section4. For more details of the reinforcement learningframework, please see Appendix B.

    5.1 Results

    First of all, we need to discuss how the MLM workson the language model pre-training. In practice, theCloze task (Taylor, 1953) benefits when words thatneed to learn are masked. However, for the MLM,we observed that the masking prevents learning themasked words in the language model. Rather theMLM learns representations and relations betweennon-masked words by predicting the masked wordsusing them. In the case of adaptive pre-training,learning the domain-specific vocabulary is crucialfor the domain adaptation. Therefore, masking outtrivial words (e.g. Punctuations) may be more ben-eficial than masking out unique words (e.g. NamedEntities) for the language model to learn the knowl-edge of a new domain. To see how it works, weexperiment for punctuation-random, which masksonly the punctuations, which are clearly useless forthe domain adaptation.

    In Table 1 and 2, we report the performance ofbaselines and our model on both question answer-ing and text classification tasks. From the baselineresults of Table 1, we speculate on the importantaspects of a masking strategy for better adaptation.The whole-word, span and entity maskings oftenlead to better results since it makes the MLM ob-jective more difficult and meaningful (Joshi et al.,2019; Sun et al., 2019b). For instance, in emrQA,most words are tokenized to sub-words since con-texts include a lot of unique words such as medicalterminologies. Therefore, whole-word maskingcould be most suitable for domain adaptation on

  • 6109

    SQuAD emrQA NewsQABase LM Model EM F1 EM F1 EM F1

    BERT

    No PT 80.630.32 88.180.25 70.580.09 76.600.29 51.660.31 66.230.16Random 80.640.02 88.140.10 71.401.01 77.310.78 51.460.19 66.210.22Whole 80.730.23 88.200.11 71.650.33 77.840.31 51.120.28 65.940.51Span 80.660.19 88.190.11 70.870.38 76.760.33 51.170.63 65.930.50Entity 80.790.13 88.230.22 71.500.34 77.570.27 51.550.43 65.440.34Punct. 80.600.24 88.070.25 71.170.58 77.080.57 51.470.37 66.180.36

    NMG 80.830.20 88.280.21 71.840.68 77.490.55 51.810.33 66.570.48

    DistilBERT

    No PT 76.750.41 85.130.26 68.520.39 75.000.53 48.610.39 63.450.56Random 76.460.35 84.920.17 69.020.40 75.700.29 48.520.46 63.060.21Whole 76.480.27 84.960.15 69.640.39 76.160.24 48.220.41 62.920.33Span 76.730.13 84.960.13 69.540.21 76.110.33 48.590.09 63.150.39Entity 76.340.36 84.780.21 69.250.43 75.980.39 48.190.20 62.970.16Punct. 76.850.09 85.160.03 69.410.21 76.140.18 48.580.26 63.140.20

    NMG 76.930.32 85.300.23 69.980.37 76.510.45 48.750.08 63.550.14

    Table 1: Performance of various masking strategies on the QA tasks. We run the model three times with different random seedsthen report the average performances, with standard deviations (subscripts). The numbers in bold fonts denote best scores, andthe numbers with underlines denote the second best scores.

    ChemProt IMDbBase LM Model Acc Acc

    BERT

    No PT 80.400.70 92.280.05Random 81.250.72 92.450.21Whole 80.181.20 92.550.04Span 78.061.72 92.400.10Entity 79.681.32 92.380.13Punct. 79.680.30 92.400.18

    NMG 81.660.37 92.530.03

    Table 2: Performance of various masking strategies on thetext classification tasks. The presentation format is the sameas in Table 1.

    Base LM: NewsQADistilBERT EM F1

    NMG 48.750.08 63.550.14NMG w/o Self-Play 48.640.27 63.370.17NMG w/o Continual-Adaptive 48.520.12 63.060.20

    Table 3: Ablation Results on the NewsQA dataset.

    emrQA. In contrast, such maskings sometimes leadto worse results than random masking on somedomains such as in NewsQA. Especially, in thecase of DistilBERT, punctuation masking performsbetter than others. These result suggests that mask-ing complicated words is rather disturbing for theadaptation of small language model.

    On the other hand, our NMG learns the optimalmasking policy for a given task and domain in anadaptive manner on any language model. In Table 1and 2, we can see that this adaptive characteristicof our model makes the neural masking resultsin better or at least comparable performances tothe baselines for all tasks. We further analyze thelearned masking strategy in Section 5.3.

    5.2 Ablation study

    Effectiveness of Self-Play We further investi-gate the effectiveness of self-play by comparingit with the NMG model without self-play, wherethe model only competes with the random agent.We validate this on the NewsQA dataset. The resultin Table 3 shows that the NMG model with self-play obtains better performance than its counterpartwithout self-play. This result verifies that compet-ing with the opponent neural agent while learninghelps the NMG model to learn better policy.

    Continual Adaptation We also perform an abla-tion study of the continual adaptation learning. Theresult in Table 3 shows that the continual-adaptivemasking strategy is significantly effective for thelanguage model adaptation. The result suggeststhat helpful words for the language model to learndepends on the adaptation degree of it.

    5.3 Analysis

    Masked Word Statistics To analyze how ourmodel performs, we measure the difference be-tween which kind of word token is masked by boththe random and neural policy on the pre-trainedcheckpoint. For qualitative analysis, we provideexamples of masked tokens on the context in Fig-ure 3. As shown in Figure 3, NMG tends to maskhighly informative words such as seminary or is-lam, which are parts of the answer spans. Fur-thermore, we analyze the masking behavior of ourNMG by performing Part-of-Speech (POS) taggingon the masked words using spaCy4. Figure 4 shows

    4https://spacy.io

  • 6110

    Figure 3: Examples of masked tokens using our Neural Mask Generator (Neural) model and random sampling (Random). Thetokens colored in red denote masked tokens by the model, and words highlighted with a yellow box is the answer to the givenquestion. For more examples on multiple datasets, please see Appendix C.

    the six most frequent tags for the words masked outby the random and neural policy. Figure 4 showsthat the neural policy masks more words in noun,verb, and proper noun tags than the random policy,suggesting that our NMG model learns that mask-ing such informative words is beneficial to adapton the NewsQA task with the BERTBASE modelas a language model.

    Learning Curves As already known, the RL-based methods often suffer from the instabilityproblem. Therefore, we further analyze the learn-ing curves of the NMG training in this section. InFigure 5, we plot three kinds of learning curvesto show the detailed training process. CumulativeRegret indicates how many times the neural agentis defeated against the random agent until certainepisodes. The grey plot indicates the worst casethat the random agent always defeats against theneural agent. Entropy indicates the average en-tropy of policy for states given in a certain episode.Lower entropy means that the policy has a highprobability of a few significant actions. Loss indi-cates the RL loss described in the equation 10. Weignore outliers in the loss plot for brevity.

    From the entropy and loss plots, we can no-tice that the policy converges as learning proceeds.However, it seems that such convergence is notcontinually sustained. From the cumulative regretplot, we can observe that the neural policy stilloften loses against the random policy, although itis trained for a while. Such instability may comefrom the difficulty of the exact credit assignment oneach action. Otherwise, continuous change of statedistribution from the continual adaptive learningmay hinder the neural policy’s convergence.

    Even if the NMG shows the notable results, there

    Figure 4: Top-6 POS of Masked Words on NewsQA.

    0 50 100 150 200Episode

    0

    50

    100

    150

    200

    250 [Cumulative Regret]NMGWorst

    0 50 100 150 200Episode

    0

    1

    2

    3

    4

    5

    6[Entropy]

    0 50 100 150 200Episode

    4

    3

    2

    1

    0

    1

    [Loss]

    Figure 5: Reinforcement learning curves on NewsQA.

    is room for improvement on RL in terms of effi-ciency and stability. We leave it as the future work.

    6 Conclusion

    We proposed a novel framework which automati-cally generates an adaptive masking for masked lan-guage models based on the given context, for lan-guage model adaptation to low-resource domains.To this end, we proposed the Neural Mask Gener-ator (NMG), which is trained with reinforcementlearning to mask out words that are helpful fordomain adaptation. We performed an empiricalstudy of various rule-based masking strategies onmultiple datasets for question answering and textclassification tasks, which shows that the optimalmasking strategy depends on both the languagemodel and the domain. We then validated NMGagainst rule-based masking strategies, and the re-sults show that it either outperforms, or obtainscomparable performance to the best heuristic. Fur-ther qualitative analysis suggests that such goodperformance comes from its ability to adaptivelymask meaningful words for the given task.

  • 6111

    Acknowledgments

    This work was supported by Samsung Ad-vanced Institute of Technology (SAIT), theEngineering Research Center Program throughthe National Research Foundation of Korea(NRF) funded by the Korea Government MSIT(NRF2018R1A5A1059921), Institute for Infor-mation & communications Technology Promo-tion(IITP) grant funded by the Korea Government(MSIT) (No.2016-0-00563, Research on AdaptiveMachine Learning Technology Development forIntelligent Autonomous Digital Companion, andNo.2019-0-00075, Artificial Intelligence Gradu-ate School Program (KAIST)), and a study on the“HPC Support” Project supported by the ‘Ministryof Science and ICT’ and NIPA.

    ReferencesIz Beltagy, Arman Cohan, and Kyle Lo. 2019. Scibert:

    Pretrained contextualized embeddings for scientifictext. CoRR, abs/1903.10676.

    Kevin Clark, Minh-Thang Luong, Quoc V. Le, andChristopher D. Manning. 2020. ELECTRA: pre-training text encoders as discriminators rather thangenerators. In 8th International Conference onLearning Representations, ICLR 2020, Addis Ababa,Ethiopia, April 26-30, 2020.

    Thomas Degris, Martha White, and Richard S. Sutton.2012. Linear off-policy actor-critic. In Proceed-ings of the 29th International Conference on Ma-chine Learning, ICML 2012, Edinburgh, Scotland,UK, June 26 - July 1, 2012.

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: pre-training ofdeep bidirectional transformers for language under-standing. In Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, NAACL-HLT 2019, Minneapolis, MN,USA, June 2-7, 2019, Volume 1 (Long and Short Pa-pers), pages 4171–4186.

    Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xi-aodong Liu, Yu Wang, Jianfeng Gao, Ming Zhou,and Hsiao-Wuen Hon. 2019. Unified languagemodel pre-training for natural language understand-ing and generation. In Advances in Neural Infor-mation Processing Systems 32: Annual Conferenceon Neural Information Processing Systems 2019,NeurIPS 2019, 8-14 December 2019, Vancouver, BC,Canada, pages 13042–13054.

    Chelsea Finn, Pieter Abbeel, and Sergey Levine. 2017.Model-agnostic meta-learning for fast adaptation ofdeep networks. In Proceedings of the 34th Inter-national Conference on Machine Learning, ICML

    2017, Sydney, NSW, Australia, 6-11 August 2017,pages 1126–1135.

    Luca Franceschi, Paolo Frasconi, Saverio Salzo,Riccardo Grazzi, and Massimiliano Pontil. 2018.Bilevel programming for hyperparameter optimiza-tion and meta-learning. In Proceedings of the35th International Conference on Machine Learning,ICML 2018, Stockholmsmässan, Stockholm, Sweden,July 10-15, 2018, pages 1563–1572.

    Michael R. Glass, Alfio Gliozzo, Rishav Chakravarti,Anthony Ferritto, Lin Pan, G. P. Shrivatsa Bhar-gav, Dinesh Garg, and Avirup Sil. 2019. Span se-lection pre-training for question answering. CoRR,abs/1909.04120.

    Jordi Grau-Moya, Felix Leibfried, and Haitham Bou-Ammar. 2018. Balancing two-player stochasticgames with soft q-learning. In Proceedings of theTwenty-Seventh International Joint Conference onArtificial Intelligence, IJCAI 2018, July 13-19, 2018,Stockholm, Sweden, pages 268–274.

    Suchin Gururangan, Ana Marasovic, SwabhaSwayamdipta, Kyle Lo, Iz Beltagy, Doug Downey,and Noah A. Smith. 2020. Don’t stop pretraining:Adapt language models to domains and tasks.CoRR, abs/2004.10964.

    Xiaochuang Han and Jacob Eisenstein. 2019. Unsu-pervised domain adaptation of contextualized em-beddings for sequence labeling. In Proceedings ofthe 2019 Conference on Empirical Methods in Nat-ural Language Processing and the 9th InternationalJoint Conference on Natural Language Processing,EMNLP-IJCNLP 2019, Hong Kong, China, Novem-ber 3-7, 2019, pages 4237–4247.

    Dan Hendrycks and Kevin Gimpel. 2016. Bridgingnonlinearities and stochastic regularizers with gaus-sian error linear units. CoRR, abs/1606.08415.

    Jeremy Howard and Sebastian Ruder. 2018. Universallanguage model fine-tuning for text classification. InProceedings of the 56th Annual Meeting of the As-sociation for Computational Linguistics, ACL 2018,Melbourne, Australia, July 15-20, 2018, Volume 1:Long Papers, pages 328–339.

    Lu Jiang, Zhengyuan Zhou, Thomas Leung, Li-Jia Li,and Li Fei-Fei. 2018. Mentornet: Learning data-driven curriculum for very deep neural networks oncorrupted labels. In Proceedings of the 35th Inter-national Conference on Machine Learning, ICML2018, Stockholmsmässan, Stockholm, Sweden, July10-15, 2018, pages 2309–2318.

    Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S.Weld, Luke Zettlemoyer, and Omer Levy. 2019.Spanbert: Improving pre-training by representingand predicting spans. CoRR, abs/1907.10529.

    Diederik P. Kingma and Jimmy Ba. 2015. Adam: Amethod for stochastic optimization. In 3rd Inter-national Conference on Learning Representations,

  • 6112

    ICLR 2015, San Diego, CA, USA, May 7-9, 2015,Conference Track Proceedings.

    Jens Kringelum, Sonny Kim Kjærulff, Søren Brunak,Ole Lund, Tudor I. Oprea, and Olivier Taboureau.2016. Chemprot-3.0: a global chemical biology dis-eases mapping. Database, 2016.

    Zhenzhong Lan, Mingda Chen, Sebastian Goodman,Kevin Gimpel, Piyush Sharma, and Radu Sori-cut. 2019. ALBERT: A lite BERT for self-supervised learning of language representations.CoRR, abs/1909.11942.

    Jinhyuk Lee, Wonjin Yoon, Sungdong Kim,Donghyeon Kim, Sunkyu Kim, Chan Ho So,and Jaewoo Kang. 2019. Biobert: a pre-trainedbiomedical language representation model forbiomedical text mining. CoRR, abs/1901.08746.

    Hanxiao Liu, Karen Simonyan, and Yiming Yang.2019a. DARTS: differentiable architecture search.In 7th International Conference on Learning Repre-sentations, ICLR 2019, New Orleans, LA, USA, May6-9, 2019.

    Xiaodong Liu, Pengcheng He, Weizhu Chen, and Jian-feng Gao. 2019b. Multi-task deep neural networksfor natural language understanding. In Proceedingsof the 57th Conference of the Association for Compu-tational Linguistics, ACL 2019, Florence, Italy, July28- August 2, 2019, Volume 1: Long Papers, pages4487–4496.

    Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,Luke Zettlemoyer, and Veselin Stoyanov. 2019c.Roberta: A robustly optimized BERT pretraining ap-proach. CoRR, abs/1907.11692.

    Ilya Loshchilov and Frank Hutter. 2019. Decou-pled weight decay regularization. In 7th Inter-national Conference on Learning Representations,ICLR 2019, New Orleans, LA, USA, May 6-9, 2019.

    Andrew L. Maas, Raymond E. Daly, Peter T. Pham,Dan Huang, Andrew Y. Ng, and Christopher Potts.2011. Learning word vectors for sentiment analysis.In The 49th Annual Meeting of the Association forComputational Linguistics: Human Language Tech-nologies, Proceedings of the Conference, 19-24 June,2011, Portland, Oregon, USA, pages 142–150.

    Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S.Corrado, and Jeffrey Dean. 2013. Distributed rep-resentations of words and phrases and their com-positionality. In Advances in Neural InformationProcessing Systems 26: 27th Annual Conference onNeural Information Processing Systems 2013. Pro-ceedings of a meeting held December 5-8, 2013,Lake Tahoe, Nevada, United States, pages 3111–3119.

    Volodymyr Mnih, Adrià Puigdomènech Badia, MehdiMirza, Alex Graves, Timothy P. Lillicrap, TimHarley, David Silver, and Koray Kavukcuoglu.

    2016. Asynchronous methods for deep reinforce-ment learning. In Proceedings of the 33nd Inter-national Conference on Machine Learning, ICML2016, New York City, NY, USA, June 19-24, 2016,pages 1928–1937.

    Volodymyr Mnih, Koray Kavukcuoglu, David Silver,Alex Graves, Ioannis Antonoglou, Daan Wierstra,and Martin A. Riedmiller. 2013. Playing atari withdeep reinforcement learning. CoRR, abs/1312.5602.

    Jiquan Ngiam, Daiyi Peng, Vijay Vasudevan, SimonKornblith, Quoc V. Le, and Ruoming Pang. 2018.Domain adaptive transfer learning with specialistmodels. CoRR, abs/1811.07056.

    Anusri Pampari, Preethi Raghavan, Jennifer J. Liang,and Jian Peng. 2018. emrqa: A large corpus forquestion answering on electronic medical records.In Proceedings of the 2018 Conference on EmpiricalMethods in Natural Language Processing, Brussels,Belgium, October 31 - November 4, 2018, pages2357–2368.

    Adam Paszke, Sam Gross, Francisco Massa, AdamLerer, James Bradbury, Gregory Chanan, TrevorKilleen, Zeming Lin, Natalia Gimelshein, LucaAntiga, Alban Desmaison, Andreas Köpf, EdwardYang, Zachary DeVito, Martin Raison, Alykhan Te-jani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang,Junjie Bai, and Soumith Chintala. 2019. Pytorch:An imperative style, high-performance deep learn-ing library. In Advances in Neural Information Pro-cessing Systems 32: Annual Conference on Neu-ral Information Processing Systems 2019, NeurIPS2019, 8-14 December 2019, Vancouver, BC, Canada,pages 8024–8035.

    Matthew E. Peters, Mark Neumann, Mohit Iyyer, MattGardner, Christopher Clark, Kenton Lee, and LukeZettlemoyer. 2018. Deep contextualized word rep-resentations. In Proceedings of the 2018 Confer-ence of the North American Chapter of the Associ-ation for Computational Linguistics: Human Lan-guage Technologies, NAACL-HLT 2018, New Or-leans, Louisiana, USA, June 1-6, 2018, Volume 1(Long Papers), pages 2227–2237.

    Alec Radford, Jeff Wu, Rewon Child, David Luan,Dario Amodei, and Ilya Sutskever. 2019. Languagemodels are unsupervised multitask learners.

    Colin Raffel, Noam Shazeer, Adam Roberts, KatherineLee, Sharan Narang, Michael Matena, Yanqi Zhou,Wei Li, and Peter J. Liu. 2019. Exploring the limitsof transfer learning with a unified text-to-text trans-former. CoRR, abs/1910.10683.

    Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, andPercy Liang. 2016. Squad: 100, 000+ questions formachine comprehension of text. In Proceedings ofthe 2016 Conference on Empirical Methods in Nat-ural Language Processing, EMNLP 2016, Austin,Texas, USA, November 1-4, 2016, pages 2383–2392.

  • 6113

    Sachin Ravi and Hugo Larochelle. 2017. Optimiza-tion as a model for few-shot learning. In 5th Inter-national Conference on Learning Representations,ICLR 2017, Toulon, France, April 24-26, 2017, Con-ference Track Proceedings.

    Victor Sanh, Lysandre Debut, Julien Chaumond, andThomas Wolf. 2019. Distilbert, a distilled versionof BERT: smaller, faster, cheaper and lighter. CoRR,abs/1910.01108.

    Adam Santoro, Sergey Bartunov, Matthew Botvinick,Daan Wierstra, and Timothy P. Lillicrap. 2016.Meta-learning with memory-augmented neural net-works. In Proceedings of the 33nd InternationalConference on Machine Learning, ICML 2016, NewYork City, NY, USA, June 19-24, 2016, pages 1842–1850.

    Tom Schaul, John Quan, Ioannis Antonoglou, andDavid Silver. 2016. Prioritized experience replay. In4th International Conference on Learning Represen-tations, ICLR 2016, San Juan, Puerto Rico, May 2-4,2016, Conference Track Proceedings.

    John Schulman, Sergey Levine, Pieter Abbeel,Michael I. Jordan, and Philipp Moritz. 2015. Trustregion policy optimization. In Proceedings of the32nd International Conference on Machine Learn-ing, ICML 2015, Lille, France, 6-11 July 2015,pages 1889–1897.

    John Schulman, Filip Wolski, Prafulla Dhariwal, AlecRadford, and Oleg Klimov. 2017. Proximal policyoptimization algorithms. CoRR, abs/1707.06347.

    Chi Sun, Xipeng Qiu, Yige Xu, and Xuanjing Huang.2019a. How to fine-tune BERT for text classifica-tion? In Chinese Computational Linguistics - 18thChina National Conference, CCL 2019, Kunming,China, October 18-20, 2019, Proceedings, pages194–206.

    Yu Sun, Shuohuan Wang, Yukun Li, Shikun Feng, XuyiChen, Han Zhang, Xin Tian, Danxiang Zhu, HaoTian, and Hua Wu. 2019b. ERNIE: enhanced rep-resentation through knowledge integration. CoRR,abs/1904.09223.

    Yu Sun, Shuohuan Wang, Yukun Li, Shikun Feng, HaoTian, Hua Wu, and Haifeng Wang. 2019c. ERNIE2.0: A continual pre-training framework for lan-guage understanding. CoRR, abs/1907.12412.

    Richard S. Sutton and Andrew G. Barto. 2018. Rein-forcement Learning: An Introduction, second edi-tion. The MIT Press.

    Wilson L. Taylor. 1953. “cloze procedure”: a newtool for measuring readability. Journalism Bulletin,30(4):415–433.

    Sebastian Thrun and Lorien Pratt, editors. 1998. Learn-ing to Learn. Kluwer Academic Publishers, Nor-well, MA, USA.

    Adam Trischler, Tong Wang, Xingdi Yuan, JustinHarris, Alessandro Sordoni, Philip Bachman, andKaheer Suleman. 2017. Newsqa: A machinecomprehension dataset. In Proceedings of the2nd Workshop on Representation Learning for NLP,Rep4NLP@ACL 2017, Vancouver, Canada, August3, 2017, pages 191–200.

    Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N. Gomez, LukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In Advances in Neural Information Pro-cessing Systems 30: Annual Conference on NeuralInformation Processing Systems 2017, 4-9 Decem-ber 2017, Long Beach, CA, USA, pages 5998–6008.

    Oriol Vinyals, Charles Blundell, Tim Lillicrap, KorayKavukcuoglu, and Daan Wierstra. 2016. Matchingnetworks for one shot learning. In Advances inNeural Information Processing Systems 29: AnnualConference on Neural Information Processing Sys-tems 2016, December 5-10, 2016, Barcelona, Spain,pages 3630–3638.

    Alex Wang, Amanpreet Singh, Julian Michael, FelixHill, Omer Levy, and Samuel R. Bowman. 2019.GLUE: A multi-task benchmark and analysis plat-form for natural language understanding. In 7thInternational Conference on Learning Representa-tions, ICLR 2019, New Orleans, LA, USA, May 6-9,2019.

    Ziyu Wang, Victor Bapst, Nicolas Heess, VolodymyrMnih, Rémi Munos, Koray Kavukcuoglu, andNando de Freitas. 2017. Sample efficient actor-criticwith experience replay. In 5th International Con-ference on Learning Representations, ICLR 2017,Toulon, France, April 24-26, 2017, ConferenceTrack Proceedings.

    Chen-Yu Wei, Yi-Te Hong, and Chi-Jen Lu. 2017. On-line reinforcement learning in stochastic games. InAdvances in Neural Information Processing Systems30: Annual Conference on Neural Information Pro-cessing Systems 2017, 4-9 December 2017, LongBeach, CA, USA, pages 4987–4997.

    Taesun Whang, Dongyub Lee, Chanhee Lee, KisuYang, Dongsuk Oh, and Heuiseok Lim. 2019. Do-main adaptive training BERT for response selection.CoRR, abs/1908.04812.

    Thomas Wolf, Lysandre Debut, Victor Sanh, JulienChaumond, Clement Delangue, Anthony Moi, Pier-ric Cistac, Tim Rault, R’emi Louf, Morgan Funtow-icz, and Jamie Brew. 2019. Huggingface’s trans-formers: State-of-the-art natural language process-ing. ArXiv.

    Zhilin Yang, Zihang Dai, Yiming Yang, Jaime G. Car-bonell, Ruslan Salakhutdinov, and Quoc V. Le. 2019.Xlnet: Generalized autoregressive pretraining forlanguage understanding. CoRR, abs/1906.08237.

  • 6114

    Jinsung Yoon, Sercan Ömer Arik, and Tomas Pfister.2019. Data valuation using reinforcement learning.CoRR, abs/1909.11671.

    Linchao Zhu, Sercan Ömer Arik, Yi Yang, and TomasPfister. 2019. Learning to transfer learn. CoRR,abs/1908.11406.

    Barret Zoph and Quoc V. Le. 2017. Neural architec-ture search with reinforcement learning. In 5th Inter-national Conference on Learning Representations,ICLR 2017, Toulon, France, April 24-26, 2017, Con-ference Track Proceedings.

    Barret Zoph, Vijay Vasudevan, Jonathon Shlens, andQuoc V. Le. 2018. Learning transferable architec-tures for scalable image recognition. In 2018 IEEEConference on Computer Vision and Pattern Recog-nition, CVPR 2018, Salt Lake City, UT, USA, June18-22, 2018, pages 8697–8710.

    A Algorithms

    We provide the pseudocode of algorithm for meta-training of the Neural Mask Generator (NMG).

    In the case of meta-testing, InnerLoop with thefull task T , pre-trained language model checkpointθ, and trained policy πλ as inputs can be consideredas the meta-testing algorithm.

    Algorithm 1 NMG Training AlgorithmInitialize random policy ψ ∼ UniformInitialize two neural agents λ, λop for Self-PlayInitialize replay buffer D,DopArbitrarily split Dtr → Dtr, DvalLoad pre-trained language model θwhile not done do

    Randomly Sample T ′ from Trψ, Eψ, θ(ψ) = InnerLoop(T ′, θ, ψ)rop, Eλop , θ(λop) = InnerLoop(T ′, θ, πλop)r, Eλ, θ(λ) = InnerLoop(T ′, θ, πλ)Aop ← Eψ, rψ, E , r, Eop, ropDop, λop ← OuterLoop(Dop, Aop, λop)A ← Eψ, rψ, Eop, rop, E , rD, λ← OuterLoop(D, A, λ)θ ← θ(λ)

    end while

    Algorithm 2 InnerLoopInput:

    Task T , LM θ, Policy πOutput:

    Episode buffer E , Accuracy r, LM θ(λ)Initialize episode buffer EŜ = {}, S, Dtr, Dval ← Tfor s in S do

    Masking s following equation 3, 4Ŝ← Ŝ ∪ ŝ, E ← E ∪ (s, a, π)

    end for# Actually, below two updates are done withmini-batch optimization with multiple stepsθ(λ)← θ −∇θLMLM (θ, λ;S, Ŝ)θ∗(λ)← θ(λ)−∇θ(λ)Ltrain(θ(λ), λ;Dtr)Evaluate θ on Dval and acquire r

    B Hyperparameters

    B.1 Reinforcement Learning (Outer Loop)

    We describe detailed hyperparameters in Table 4for reinforcement learning (RL). We use the priori-tized experience replay (Schaul et al., 2016) with

  • 6115

    Algorithm 3 OuterLoopInput:D, Eψ, rψ, Eop, rop, E , r, Agent λ

    Output:Replay buffer D, Trained Agent λ

    R← sgn(r − rop)for (s, a) in E do

    if (s, a) /∈ E ∩ Eop and (s, a) /∈ E ∩ Eψ thenR← min(sgn(r − rψ), sgn(r − rop))

    else if (s, a) /∈ E ∩ Eop thenR← sgn(r − rop)

    else if (s, a) /∈ E ∩ Eψ thenR← sgn(r − rψ)

    elsecontinue

    end ifD ← D ∪ {(s, a,R, πold)}

    end forSample a mini-batch {(s, a,R, πold)} from Dλ ← λ + ∇λL following equation 8, 9 usingsampled replays

    exponent value as 1. In addition, we address theconcept of susampling (Mikolov et al., 2013) on re-play sampling. Specifically, we divide the priorityof each replay by the square root of word frequencywithin the corresponding context. For optimization,we use Adam (Kingma and Ba, 2015) optimizer totrain the NMG model and its opponent (Self-Play).

    Hyperparameters Value

    Learning Rate 0.0001Number of Epochs 10Minibatch Size 64Replay Buffer Size 50000Entropy Regularization 0.01Maximum Episodes 200

    Table 4: Hyperparameters for Reinforcement LearningTraining (Outer Loop). QA is short for Question An-swering and TC is short for Text Classification.

    B.2 LM Training in Meta-Train (Inner Loop)

    We describe detailed hyperparameters in Table 5 forlanguage model (LM) training in the meta-training.In the case of pre-processing of pre-training dataset,we use the context of triplets in Question Answer-ing and sentences in Text Classification. Especiallyfor emrQA (Pampari et al., 2018), we preprocessit to be same as other QA’s formats by removingyes-no and multiple answer type questions. Fur-

    thermore, we arbitrarily split a train dataset to atrain and a validation set by 8 to 2 since Pam-pari et al. (2018) do not provide a separate vali-dation set. For optimization, we used AdamW opti-mizer (Loshchilov and Hutter, 2019), with a linearlearning rate scheduler. Each meta-training is doneon the two Titan XP or RTX 2080 Ti GPUs and itcosts maximum of 2 days in the case of BERT. Thebatch size is adequately selected according to thesize of the data and the model.

    B.3 LM Training in Meta-TestWe describe detailed hyperparameters in Table 6for LM training meta-testing. In the meta-testing,we use same setting described in Section B.2 forpre-processing and optimization. The batch size isalso adequately selected according to the size ofthe data and the model.

    Regarding the pre-training epoch and the mask-ing probability p in meta-testing, we use two dis-tinct settings for the baselines and NMG model.For the baselines, we train the LM for 1 epoch withp = 0.15 following a conventional setting. How-ever, we observed that the fewer masking with amore pre-training epoch is much more beneficialin the meta-training on the task with long contextssince it makes the NMG evaluate actions more pre-cisely. Therefore, for the NMG model, we trainthe LM for 3 epochs with p = 0.05, following thesetting of the meta-training.

    B.4 Neural Mask Generator ArchitectureThe policy network of the NMG model consistsof the single self-attention layer and two linearlayers. The self-attention layer follows the config-uration of the transformer layer of BERTBASE .We omit the linear layers of the original trans-former (Vaswani et al., 2017) implementation inour architecture. The hidden size of linear layersis 128 and gelu (Hendrycks and Gimpel, 2016) isused as an activation function. The value networkalso consists of the same linear layers after mean-pooling of word representations. The total numberof parameters of the NMG model is approximately2.5M, which is far smaller than the conventionallanguage model.

    B.5 Hyperparameter SearchingFor searching proper hyperparameter, we use amanual tuning which tries conventional hyperpa-rameters for the reinforcement learning (RL) andthe language model (LM) training. A selection

  • 6116

    Table 5: Hyperparameters for LM Training of Meta-Train (Inner Loop).

    Hyperparameters Value

    Pre-Training Masking Probability 0.05Pre-Training Learning Rate 0.00002Pre-Training Epoch 3Sampled Pre-training Dataset Size 200Pre-Training Batch Size Chosen from {8,16,32}Maximum sequence length in Pre-Training 512 (QA) or 256 (TC)

    Fine-Tuning Learning Rate 0.00003 (QA) or 0.00002 (TC)Fine-Tuning Epoch 1 (QA) or 5 (TC)Maximum Training Set Size 1000Validation set Size 10000Fine-Tuning Batch Size Chosen from {8,16,32}

    Table 6: Hyperparameters for LM Training of Meta-Test.

    Hyperparameters Value

    Pre-Training Masking Probability Chosen from {0.05, 0.15}Pre-Training Learning Rate 0.00002Pre-Training Epoch Chosen from {1, 3}Pre-Training Batch Size Chosen from {12,16,24}Maximum sequence length in Pre-Training 512 (QA) or 256 (TC)

    Fine-Tuning Learning Rate 0.00003 (QA) or 0.00002 (TC)Fine-Tuning Epoch 2 (QA) or 3 (TC)Fine-Tuning Batch Size Chosen from {12,16,32}

    criterion depends on the result of the meta-testing.Especially, we set the criterion to F1 score and ac-curacy for question answering and classificationtask respectively.

    C More Examples

    To show the masking strategy from our NMGmodel, we additionally append additional exam-ples from various datasets used in our experiments.

  • 6117

    Figure 6: SQuAD Examples of masked tokens using the Neural Mask Generator (NMG). The red mark and yellowbox indicates masked tokens by the model and the answer given a question, respectively.

  • 6118

    Figure 7: emrQA Examples of masked tokens using the Neural Mask Generator (NMG). The red mark and yellowbox indicates masked tokens by the model and the answer given a question, respectively.

  • 6119

    Figure 8: NewsQA Examples of masked tokens using the Neural Mask Generator (NMG). The red mark andyellow box indicates masked tokens by the model and the answer given a question, respectively.

  • 6120

    Figure 9: ChemProt and IMDb Examples of masked tokens using the Neural Mask Generator (NMG). The redmark indicates masked tokens by the model.