transfer fine-tuning: a bert case study · crucial research directions that will make the ap-proach...

Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processingand the 9th International Joint Conference on Natural Language Processing, pages 5393–5404,Hong Kong, China, November 3–7, 2019. c©2019 Association for Computational Linguistics

5393

Transfer Fine-Tuning: A BERT Case Study

Yuki Arase1? and Junichi Tsujii?21Osaka University, Japan

?Artificial Intelligence Research Center (AIRC), AIST, Japan2NaCTeM, School of Computer Science, University of Manchester, UK

[email protected], [email protected]

AbstractA semantic equivalence assessment is definedas a task that assesses semantic equivalence ina sentence pair by binary judgment (i.e., para-phrase identification) or grading (i.e., seman-tic textual similarity measurement). It consti-tutes a set of tasks crucial for research on nat-ural language understanding. Recently, BERTrealized a breakthrough in sentence represen-tation learning (Devlin et al., 2019), whichis broadly transferable to various NLP tasks.While BERT’s performance improves by in-creasing its model size, the required compu-tational power is an obstacle preventing prac-tical applications from adopting the technol-ogy. Herein, we propose to inject phrasalparaphrase relations into BERT in order togenerate suitable representations for seman-tic equivalence assessment instead of increas-ing the model size. Experiments on standardnatural language understanding tasks confirmthat our method effectively improves a smallerBERT model while maintaining the modelsize. The generated model exhibits supe-rior performance compared to a larger BERTmodel on semantic equivalence assessmenttasks. Furthermore, it achieves larger per-formance gains on tasks with limited trainingdatasets for fine-tuning, which is a propertydesirable for transfer learning.

1 Introduction

Paraphrase identification and semantic textualsimilarity (STS) measurements aim to assess se-mantic equivalence in sentence pairs. These tasksare central problems in natural language under-standing research and its applications. In this pa-per, these tasks are defined as semantic equiva-lence assessments.

Sentence representation learning is the basisof assessing semantic equivalence. Unsupervisedlearning is becoming the preferred approach be-cause it only requires plain corpora, which are now

abundantly available. In this approach, a modelis pre-trained to generate generic sentence repre-sentations that are broadly transferable to variousnatural language processing (NLP) tasks. Subse-quently, it is fine-tuned to generate specific repre-sentations for solving a target task using an anno-tated corpus. Considering the high costs of anno-tation, a pre-trained model that efficiently fits thetarget task with a smaller amount of annotated cor-pus is desired.

Recently, Bidirectional Encoder Representa-tions from Transformers (BERT) realized a break-through, which dramatically improved sentencerepresentation learning (Devlin et al., 2019).BERT pre-trains its encoder using language mod-eling and by discriminating surrounding sentencesin a document from random ones. Pre-trainingin this manner allows distributional relations be-tween sentences to be learned. Intensive effortsare currently being made to pre-train larger modelsby feeding them enormous corpora for improve-ment (Radford et al., 2019; Yang et al., 2019).For example, a large model of BERT has 340Mparameters, which is 3.1 times larger than itssmaller alternative. Although such a large modelachieves performance gains, the required compu-tational power hinders its application to down-stream tasks.

Given the importance of natural language un-derstanding research, we focus on sentence rep-resentation learning for semantic equivalence as-sessment. Instead of increasing the model size,we propose the injection of semantic relationsinto a pre-trained model, namely BERT, to im-prove performance. Phang et al. (2019) showedthat BERT’s performance on downstream tasksimproves by simply inserting extra training ondata-rich supervised tasks. Unlike them, we in-ject semantic relations of finer granularity usingphrasal paraphrase alignments automatically iden-

5394

tified by Arase and Tsujii (2017) to improve se-mantic equivalent assessment tasks. Specifically,our method learns to discriminate phrasal and sen-tential paraphrases on top of the representationsgenerated by BERT. This approach explicitly in-troduces the concept of the phrase to BERT andsupervises semantic relations between phrases.Due to studies on sentential paraphrase collec-tion (Lan et al., 2017) and generation (Wieting andGimpel, 2018), a million-scale paraphrase corpusis ready for use. We empirically show that furthertraining of a pre-trained model on relevant taskstransfers well to downstream tasks of the samekind, which we name as transfer fine-tuning.

The contributions of our paper are:

• We empirically demonstrate that transfer fine-tuning using paraphrasal relations allows asmaller BERT to generate representations suit-able for semantic equivalence assessment. Thegenerated model exhibits superior performanceto the larger BERT while maintaining the smallmodel size.

• Our experiments indicate that phrasal para-phrase discrimination contributes to represen-tation learning, which complements simplersentence-level paraphrase discrimination.

• Our model exhibits a larger performance gainover the BERT model for a limited amount offine-tuning data, which is an important prop-erty of transfer learning.

We hope that this study will open up one of thecrucial research directions that will make the ap-proach of pre-trained models more practically use-ful. Our codes, datasets, and the trained modelswill be made publicly available at our web site.

2 Related Work

Sentence representation learning is an active re-search area due to its importance in various down-stream tasks. Early studies employed super-vised learning where a sentence representation islearned in an end-to-end manner using an anno-tated corpus. Among these, the importance ofphrase structures in representation learning hasbeen discussed (Tai et al., 2015; Wu et al., 2018).In this paper, we use structural relations in sen-tence pairs for sentence representations. Specifi-cally, we employ phrasal paraphrase relations thatintroduce the notion of a phrase to the model.

The research focus of sentence representationlearning has moved toward unsupervised learn-ing in order to exploit the gigantic corpus. Skip-Thought, which was an early learning attempt,learns to generate surrounding sentences given asentence in a document (Kiros et al., 2015). Thiscan be interpreted as an extension of the distribu-tional hypothesis on sentences. Quick-Thoughts,a successor of Skip-Thought, conducts classifi-cation to discriminate surrounding sentences in-stead of generation (Logeswaran and Lee, 2018).GenSen combines these approaches in massivemulti-task learning (Subramanian et al., 2018)based on the premise that learning dependent tasksenriches sentence representations.

Embeddings from Language Models (ELMo)made a significant step forward (Peters et al.,2018). ELMo uses language modeling with bidi-rectional recurrent neural networks (RNN) to im-prove word embeddings. ELMo’s embedding con-tributes to the performance of various downstreamtasks. OpenAI GPT (Radford et al., 2018) re-placed ELMo’s bidirectional RNN for languagemodeling with the Transformer (Vaswani et al.,2017) decoder. More recently, BERT combinedthe approaches of Quick-Thoughts (i.e., a next-sentence prediction approach) and language mod-eling on top of the deep bidirectional Transformer.BERT broke the records of the previous state-of-the-art methods in eleven different NLP tasks.While BERT’s pre-training generates generic rep-resentations that are broadly transferable to vari-ous NLP tasks, we aim to fit them for semanticequivalence assessment by injecting paraphrasalrelations. Liu et al. (2019) showed that BERT’sperformance improves when fine-tuning with amulti-task learning setting, which is applicable toour trained model for further improvement.

3 Background

3.1 Phrase Alignment for Paraphrases

In order to obtain phrasal paraphrases, we used thephrase alignment method proposed in (Arase andTsujii, 2017) and apply it to our paraphrase cor-pora. The alignment method aligns phrasal para-phrases on the parse forests of a sentential para-phrase pair as illustrated in Fig. 1.

According to the evaluation results reportedin (Arase and Tsujii, 2017), the precision and re-call of alignments are 83.6% and 78.9%, whichare 89% and 92% of those of humans, respec-

5395

Dreams come true very soon

Wishes will be fulfilled in nearfuture

VP ADVP

VP

S

NP

PP

VP

VP

VP

S

Phrase alignment

the

NP

will

VP

Figure 1: Phrasal paraphrases are obtained from (Araseand Tsujii, 2017); arrows indicate phrase alignments.

tively. Although alignment errors occur, previousstudies show that neural networks are relativelyrobust against noise in a training corpus and stillbenefit from extra supervisions as demonstratedin (Edunov et al., 2018; Prabhumoye et al., 2018).

We collect all the spans of phrases in a senten-tial paraphrase pair and their alignments as pairsof phrase spans. Because the phrase alignmentmethod allows unaligned phrases, not all of thephrases have aligned counterparts.

3.2 Pre-Training on BERTBERT is a bidirectional Transformer that gen-erates a sentence representation by conditioningboth the left and right contexts of a sentence.A pre-trained BERT model can be easily fine-tuned for a wide range of tasks by just addinga fully-connected layer, without any task-specificarchitectural modifications. BERT achieved state-of-the-art performances for eleven NLP tasks,thereby outperforming the previous state-of-the-art methods by a large margin.

Pre-training in BERT accomplishes two tasks.The first task is masked language modeling, wheresome words in a sentence are randomly maskedand the model then predicts them from the con-text. This task design allows the representation tofuse both the left and the right context. The sec-ond task predicts whether a pair of sentences areconsecutive in a document to learn the relation be-tween the sentences. Specifically, as illustrated inFig. 2, BERT takes two sentences as input that areconcatenated by a special token [SEP].1 The first

1Throughout the paper, typewriter font represents

Algorithm 4.1 Paraphrasal Relation Injection

Input: Paraphrase sentence pairs P = {〈s, t〉} , apre-trained BERT model

1: Obtain a set of phrase alignments A as pairsof spans for each 〈s, t〉 ∈ P

2: WordPiece tokenization of P3: Accommodate phrase spans in A to BERT’s

token indexing: A = {〈(j, k), (m,n)〉}4: repeat5: for all mini-batch bt ∈ {〈Pi, Ai〉} do6: Encode bt by the BERT model7: Compute loss: L(Θ)8: For phrasal paraphrase task: Lp(Θ)9: For sentential paraphrase task: Ls(Θ)

10: L(Θ) = Lp(Θ) + Ls(Θ)11: Compute gradient: ∇(Θ)12: Update the model parameters13: until convergence

token of every input is always the special tokenof [CLS]. The final hidden state correspondingto this [CLS] token is regarded as an aggregatedrepresentation of the input sentence pair. This isused to predict whether the sentence pair is com-posed of consecutive sentences in a document ornot during pre-training.

BERT has a deep architecture. The BERT-base model has 12 layers of 768 hidden size and12 self-attention heads. The BERT-large modelhas 24 layers of 1024 hidden size and 16 self-attention heads. Both BERT-bese and BERT-largemodels were pre-trained using BookCorpus (Zhuet al., 2015) and English Wikipedia (in total 3.3Bwords).

4 Transfer Fine-Tuning withParaphrasal Relation Injection

We inject semantic relations between a sentencepair into a pre-trained BERT model through classi-fication of phrasal and sentential paraphrases. Af-ter the training, the model can be fine-tuned in ex-actly the same manner as with BERT models.

4.1 OverviewAlgorithm 4.1 provides an overview of ourmethod. It takes a sentential paraphrase pair 〈s, t〉as an input, which are referred to as the source andtarget, respectively, for the sake of clarity. First,a set of phrase alignments A is obtained for 〈s, t〉tokens and labels.

5396

fully-connected layer

“paraphrase”

max-pooling

Phrase alignment

𝒉𝒔, 𝒉𝒕, |𝒉𝒔 − 𝒉𝒕, |, 𝒉𝒔 ∗ 𝒉𝒕

BERT

𝑥𝑥

𝑥

𝑥

𝑥

fully-connected layer

“paraphrase”

max-pooling

Phrasal paraphrase classification

Sentential paraphrase classification

𝒉𝒔 𝒉𝒕

𝑥

Dreams come true very soonwill Wishes will be fulfilled in near futurethe[CLS] [SEP] [SEP]

Figure 2: Our method injects semantic relations to sentence representations through paraphrase discrimination.

(line 1) as described in Sec. 3.1. Because BERTuses sub-words as a unit instead of words, all theinput sentences are tokenized (line 2) by Word-Piece (Wu et al., 2016). In addition, t is concate-nated to s when being input to the BERT model,where the first token should always be [CLS] andthe sentence pair is separated by [SEP] as de-scribed in Sec. 3.2. In order to accommodate tothese factors, phrase spans in alignmentsA are ad-justed accordingly (line 3).

Our method learns to discriminate phrasal andsentential paraphrases simultaneously as illus-trated in Fig. 2. Cross-entropy is used as the lossfunctions for both tasks (line 8, 9).

Phrasal Paraphrase Classification The middlepart of Fig. 2 illustrates phrasal paraphrase clas-sification. We first generate phrase embedding foreach aligned phrase as follows. The tokenized sen-tence pair is encoded by the BERT model. For theinput sequence of N tokens {wi}i=1,...,N , we ob-tain the final hidden states {hi}i=1,...,N (i.e., out-put of the bidirectional Transformer):

hi = Transformer(w1, . . . , wN ),

where hi ∈ Rλ and λ is the hidden size. We thencombine {hi}i for a phrase pair with an align-ment 〈(j, k), (m,n)〉 where 2 ≤ j < k < m <n ≤ N − 1 represent indexes of the beginningand ending of phrases (recall that the first and lasttokens are always special tokens in BERT). As acombination function, we apply max-pooling thatshowed strong performance in (Conneau et al.,

2017) to generate a representation of source andtarget phrases:

hs = max-pooling(hj , . . . ,hk), (1)

ht = max-pooling(hm, . . . ,hn). (2)

The max-pooling(·) function selects the maximumvalue over each dimension of the hidden units.

Then hs and ht are converted to a single vec-tor. To extract relations between hs and ht, threematching methods are used (Conneau et al., 2017):(a) concatenating the representations (hs,ht), (b)taking the element-wise product hs ∗ ht, and (c)finding the absolute element-wise difference |hs−ht|. The final vector of R4λ is fed into a classifier.2

Because our method aims to generate represen-tations for semantic equivalence assessment, theclassifier should be simple (Logeswaran and Lee,2018). Otherwise, a sophisticated classifier wouldfit itself with the task instead of the representa-tions. We use a single fully-connected layer cul-minating in a softmax layer as our classifier.

Previous studies have calculated interac-tions between words (He and Lin, 2016) andphrases (Chen et al., 2017) using the final hiddenstates of bidirectional RNN or recursive neuralnetworks when composing a sentence representa-tion. Our approach differs from these by givingexplicit supervision of which phrase pairs havesemantic interactions (i.e., paraphrases).

2Our follow-up study confirms that a simpler feature gen-eration improves the generality of our model to contribute notonly to semantic equivalent assessment but also natural lan-guage inference. For details, please refer to the Appendix.

5397

Negative Example Selection In paraphraseidentification, non-paraphrases with large lexicaldifferences are easy to discriminate. Discrimina-tion becomes far more difficult when they con-tain a number of identical or related words. Toeffectively supervise the model by solving dif-ficult discrimination problems, we designed athree-way classification task: discrimination ofparaphrase, random, and in-paraphrasepairs.

The random examples are generated by pairings to a random sentence t′ from the training cor-pus, and then pairing all phrases in s to randomlychosen phrases in t′. The in-paraphrase ex-amples aim to make the discrimination problemdifficult, which requires distinguishing true para-phrases and phrases in the paraphrasal sentencepair t. These may provide sub-phrases or ances-tor phrases of true paraphrases as difficult nega-tive examples, which tend to retain the same topicand similar wordings. To prepare such examples,for each phrase pair 〈(j, k), (m,n)〉 ∈ A, the tar-get span (m,n) is replaced by a randomly chosenphrase span in t.

Phrasal paraphrase classification aims to giveexplicit supervision of semantic relations amongphrases in representation learning. It also intro-duces structures in sentences, which is completelymissed in BERT’s pre-training. Swayamdiptaet al. (2018) showed that supervision of phrase-based syntax improves the performance of a taskrelevant to semantics, e.g., semantic role labeling.

Sentential Paraphrase Classification The leftside of Fig. 2 illustrates the sentential paraphraseclassification. The process is simple; the final hid-den state of the [CLS] token, i.e., h1, is fed into aclassifier to discriminate whether a sentence pairis a paraphrase or a random sentence combina-tion. Note that these random sentence pairs pro-vide random phrases for the phrasal paraphraseclassification described above.

4.2 Training Setting

We collected paraphrases from various sources assummarized in Table 1, which shows the num-bers of sentential and phrasal paraphrase pairsafter phrase alignment.3 All the datasets weredownloaded from the Linguistic Data Consortium

3The numbers of sentential paraphrase pairs were reduceddue to parsing and alignment failures.

Source Sentence PhraseNIST OpenMT 47k 711kSimple Wikipedia 97k 1.4MTwitter URL corpus 50k 396kPara-NMT 3.9M 26.7MTotal 4.1M 29.2M

Table 1: Numbers of sentential and phrasal paraphrasesafter the phrase alignment process.

(LDC) or authors’ websites. The following bulletsdescribe the sources.

• NIST OpenMT4: We randomly paired refer-ence translations of the same source sentenceas was done in (Arase and Tsujii, 2017).

• Twitter URL corpus (Lan et al., 2017): Thiscorpus was collected from Twitter by linkingtweets through shared URLs. We used a three-month collection of paraphrases.5

• Simple Wikipedia (Kauchak, 2013): This cor-pus aligned English Wikipedia and Simple En-glish Wikipedia for text simplification. Weused “sentence-aligned, version 2.0.”6

• Para-NMT (Wieting and Gimpel, 2018): Thiscorpus was created by translating the Czechside of a large Czech-English parallel corpusand pairing the translated English and origi-nally target-side English as paraphrases. Weused “Para-nmt-5m-processed.”7

Note that these sentential and phrasal paraphrasesare obtained by automatic methods. On the con-trary, dataset creation for downstream tasks gener-ally requires expensive human annotation.

We employed the pre-trained BERT-basemodel8 and conducted paraphrase classifica-tion using the collected paraphrase corpora.Adam (Kingma and Ba, 2015) was applied as anoptimizer with a learning rate of 5e−5. A dropoutprobability was 0.2 for the fully-connected layersin the classifiers. A development set and a test

4LDC catalogue number: LDC2010T14, LDC2010T17,LDC2010T21, LDC2010T23, LDC2013T03

5https://github.com/lanwuwei/Twitter-URL-Corpus

6http://www.cs.pomona.edu/˜dkauchak/simplification/data.v2/sentence-aligned.v2.tar.gz

7https://drive.google.com/file/d/19NQ87gEFYu3zOIp_VNYQZgmnwRuSIyJd/view?usp=sharing

8https://github.com/google-research/bert

https://github.com/lanwuwei/Twitter-URL-Corpus

https://github.com/lanwuwei/Twitter-URL-Corpus

http://www.cs.pomona.edu/~dkauchak/simplification/data.v2/sentence-aligned.v2.tar.gz



https://drive.google.com/file/d/19NQ87gEFYu3zOIp_VNYQZgmnwRuSIyJd/view?usp=sharing



https://github.com/google-research/bert

https://github.com/google-research/bert

5398

set, each with 50k sentence pairs, were subtractedfrom the paraphrase corpus. The rest of the corpuswas used for training. The training was conductedon four NVIDIA Tesla V100 GPUs with abatch-size of 100. Early stopping was applied tostop training at the second time decrease in theaccuracy of the phrasal paraphrase classification,which was measured on the development set. Thefinal test-set accuracies were 98.1% and 99.9% forphrasal and sentential paraphrase classification,respectively.

5 Evaluation Setting

5.1 Hypotheses to Verify

BERT’s pre-training learns to generate sentencerepresentations broadly transferable to differentNLP tasks. In contrast, our method gives moredirect supervision to generate representations suit-able for semantic equivalence assessment tasks.We set up the following hypotheses on featuresof our method, which will be empirically verifiedthrough evaluation:

H1 Our method contributes to semantic equiva-lence assessment tasks.

H2 Our method achieves improvement on down-stream tasks that only have small amounts oftraining datasets for fine-tuning.

H3 Our method moderately improves tasks ifthey are relevant to semantic equivalence as-sessment.

H4 Our training does not transfer to distant down-stream tasks that are independent to semanticequivalence assessment.

H5 Phrasal and sentential paraphrase classifica-tion complementarily benefits sentence repre-sentation learning.

5.2 GLUE Datasets

We empirically verified the hypotheses H1 to H5using the General Language Understanding Eval-uation (GLUE) benchmark (Wang et al., 2019)9,which is the standard benchmark and provides col-lections of datasets for natural language under-standing tasks. Table 2 summarizes the tasks andevaluation metrics at GLUE. All the scores re-ported in this paper are computed at the GLUE

9https://gluebenchmark.com/

Corpus Task MetricsMRPC paraphrase F1STS-B STS Pearson corr.QQP paraphrase F1

MNLI-m in-domain NLI accuracyMNLI-mm cross-domain NLI accuracyRTE NLI accuracyQNLI QA/NLI accuracySST sentiment accuracyCoLA acceptability Matthews corr.

Table 2: GLUE tasks and evaluation metrics.

evaluation server unless stated otherwise. Accu-racies on MRPC and QQP and Spearman corre-lation on STS-B are omitted due to space limita-tions. Note that they showed the same trends asF1 and Pearson correlation, respectively, in our ex-periment. WNLI was excluded because the GLUEweb site reports its issues.10

GLUE tasks can be categorized according totheir aims as follows.

Semantic Equivalence Assessment Tasks(MRPC, STS-B, QQP) These are the primarytargets of our method, which are used to verifyhypothesis H1. Paraphrase identification assessessemantic equivalence in a sentence pair by bi-nary judgments. Microsoft Paraphrase Corpus(MRPC) (Dolan et al., 2004) consists of sentencepairs drawn from news articles, while QuoraQuestion Pairs (QQP)11 consists of question pairsfrom the community QA website.

STS assesses semantic equivalence by grading.STS benchmark (STS-B) (Cer et al., 2017) pro-vides sentence pairs drawn from heterogeneoussources, which are human-annotated with a levelof equivalence from 1 to 5.

NLI Tasks (MNLI-m/mm, RTE, QNLI) Weuse natural language inference (NLI) tasks to ver-ify hypothesis H3 because they constitute a classof problems relevant to semantic equivalence as-sessment. NLI tasks are different from semanticequivalence assessment in that they often requirelogical inference and understanding of common-sense knowledge. The Multi-Genre Natural Lan-guage Inference Corpus (MNLI) (Williams et al.,2018) is a crowd-sourced corpus and covers het-erogeneous domains. MNLI-m is an in-domain

10https://gluebenchmark.com/faq11https://data.quora.com/

First-Quora-Dataset-Release-Question-Pairs

https://gluebenchmark.com/

https://gluebenchmark.com/faq

https://data.quora.com/First-Quora-Dataset-Release-Question-Pairs

https://data.quora.com/First-Quora-Dataset-Release-Question-Pairs

5399

ModelTask Semantic Equivalence NLI Single-Sent.

MRPC STS-B QQP MNLI (m/mm) RTE QNLI SST CoLABERT-base 88.3 84.7 71.2 84.3/83.0 59.8 89.1 93.3 52.7BERT-large 88.6 86.0 72.1 86.2/85.5 65.5 92.7 94.1 55.7

Transfer Fine-Tuning 89.2 87.4 71.2 83.9/83.1 64.8 89.3 93.1 47.2

Table 3: GLUE test results scored by the GLUE evaluation server. The best scores are represented in bold andscores higher than those of BERT-base are underlined.

NLI task while MNLI-mm is a cross-domainNLI task. The Recognizing Textual Entail-ment (RTE) corpus12 was created from news andWikipedia. Question-answering NLI (QNLI) wascreated from The Stanford Question AnsweringDataset (Rajpurkar et al., 2016) on which all thesentences were drawn from Wikipedia.

Single-Sentence Tasks (SST, CoLA) We usethese tasks to verify hypothesis H4. They aim toestimate features in a single sentence, which haslittle interaction with semantic equivalence assess-ment in a sentence pair. The Stanford SentimentTreebank (SST) (Socher et al., 2013) task is a bi-nary sentiment classification, while The Corpus ofLinguistic Acceptability (CoLA) (Warstadt et al.,2018) task is a binary classification of grammati-cal acceptability.

5.3 Fine-Tuning on Downstream Tasks

Once trained, our model can be used in exactly thesame manner as the pre-trained BERT models. Forfine-tuning our models and replicating BERT’s re-sults under the same setting, we set the hyper-parameter values to those recommended in (De-vlin et al., 2019): a batch size of 32, a learning rateof 3e− 5, the number of training epochs to 4, anda dropout probability of 0.1. We fine-tuned all themodels on downstream tasks using the script pro-vided in the Pytorch version of BERT.13 For STS-B, we modified the script slightly to conduct re-gression instead of classification. All other hyper-parameters were set to the default values definedin the BERT’s fine-tuning script.

For fair comparison, we kept the same hyper-parameter settings described above across all tasksand models. Phang et al. (2019) discussed thatBERT performances become unstable when atraining dataset with fine-tuning is small. In our

12https://aclweb.org/aclwiki/Recognizing_Textual_Entailment

13run classifier.py in https://github.com/huggingface/pytorch-pretrained-BERT

evaluation, performances were stable when settingthe same hyper-parameters, but further investiga-tion is our future work.

6 Results and Discussion

6.1 Effect on Semantic EquivalenceAssessment Tasks

Table 3 shows fine-tuning results on GLUE; ourmodel, denoted as Transfer Fine-Tuning, is com-pared against BERT-base and BERT-large. Thefirst set of columns shows the results of seman-tic equivalence assessment tasks. Our model out-performed BERT-base on MRPC (+0.9 points)and STS-B (+2.7 points). Furthermore, it outper-formed even BERT-large by 0.6 points on MRPCand by 1.4 points on STS-B, despite BERT-largehaving 3.1 times more parameters than our model.Devlin et al. (2019) described that the next-sentence prediction task in BERT’s pre-trainingaims to train a model that understands sentence re-lations. Herein, we argue that such relations are ef-fective at generating representations broadly trans-ferable to various NLP tasks, but are too generic togenerate representations for semantic equivalenceassessment tasks. Our method allows semantic re-lations between sentences and phrases that are di-rectly useful for this class of tasks to be learned.

These results support hypothesis H1, indicatingthat our approach is more effective than blindlyenlarging the model size. A smaller model size isdesirable for practical applications. We have alsoapplied our method on the BERT-large model, butits performance was not much improved to warrantthe larger model size. Further investigation regard-ing pre-trained model sizes is our future work.

6.2 Effect of the Amount of Fine-TuningDatasets

Our method did not improve upon BERT-base forQQP. We consider this is because a large QQPtraining set (364k sentence pairs) allows the BERTmodel to converge to a certain optimum. This also

https://aclweb.org/aclwiki/Recognizing_Textual_Entailment

https://aclweb.org/aclwiki/Recognizing_Textual_Entailment

https://github.com/huggingface/pytorch-pretrained-BERT

https://github.com/huggingface/pytorch-pretrained-BERT

5400

Task Train. size BERT-base TransferFine-Tuning

MRPC1k 81.6 88.1 (+6.5)

all (3.7k) 89.4 90.2 (+0.8)

STS-B1k 83.4 86.2 (+2.8)

all (5.7k) 88.1 90.1 (+2.0)

QQP1k 69.9 71.4 (+1.5)5k 75.5 76.3 (+0.8)10k 77.0 77.6 (+0.6)20k 79.6 79.5 (−0.1)

all (364k) 87.7 87.7 (±0.0)

Table 4: Development set scores of the BERT-basemodel and our model (and their differences) that werefine-tuned using subsamples and full-size training sets.

relates to hypothesis H2.To investigate the effect of the sizes of train-

ing sets, we fine-tuned our model and BERT-base for semantic equivalence assessment tasksusing randomly subsampled training sets. Table 4shows scores on the development sets.14 Theresult clearly indicates that our method is morebeneficial when a training dataset is limited on adownstream task, which supports hypothesis H2.This property is preferable for a transfer learningscenario that unsupervised sentence representationlearning assumes.

Another factor that may affect the performanceis domain mismatch between our paraphrase cor-pora and QQP corpus. The former was mostlycollected from news while the latter was extractedfrom a social QA forum. In the future, we willinvestigate the effects of domains by generatingmulti-domain paraphrase corpora using a methodproposed by Wieting and Gimpel (2018).

6.3 Effect on NLI TasksThe second set of columns in Table 3 shows theresults on NLI tasks. Our model presents moder-ate improvements on most NLI tasks, which sup-ports hypothesis H3. We consider this is becausethe majority of NLI tasks that require inferencesin one-direction, contrary to bi-directional entail-ment relations of paraphrases, are uni-directional.

Another reason is that our elaborate feature gen-eration for the phrasal paraphrase classifier tightlyfits the model for paraphrase identification. Thiscontributes to performance improvements on this

14We used the development set because the GLUE serverallows only two submissions per day. Note that the numberof training epochs for fine-tuning is fixed in our experiments,hence, the development set was not used for other purposes.

task, but sacrifices the model’s generality on rele-vant tasks. We tackle this issue in our follow-upstudy reported in the Appendix.

Among NLI tasks, our model largely outper-formed BERT-base by 5.0 point on RTE. This maybe again due to the property of our method thatbrings improvement on tasks with a limited train-ing set as RTE has only 2.5k training sentencepairs.

6.4 Effect on Single-Sentence TasksThe last two columns of Table 3 show results onsingle-sentence tasks; SST and CoLA, which arethe most distant tasks from paraphrase classifica-tion. Our model presents a slightly lower scoreon SST compared to BERT-base and performedpoorly on CoLA.

One potential reason for this degradation is thatour training takes a sentence pair as input, whichmay weaken the ability to model a single sen-tence. Another cause is attributable to similaritiesbetween our training and fine-tuning tasks. ForSST, sentiment analysis could be adversarial to-ward paraphrase discrimination tasks. Althoughparaphrasal sentences tend to have the same senti-ments, sentences with the same sentiments do notgenerally hold paraphrastic relations. For CoLA,semantic relations unlikely contribute to deter-mining grammatical acceptability, as required byCoLA task.

Together with the results in Sec. 6.3, hypothesisH4 is supported; the effectiveness of our methoddepends on relevance between paraphrase discrim-ination and downstream tasks. Our future workwill be to examine what characteristics of NLPtasks make our method less effective.

6.5 Ablation StudyTo verify hypothesis H5, we conducted an ab-lation study that investigates independent effectsof sentential and phrasal paraphrase classifica-tion. Table 5 shows the results; the last threerows show performances when conducting onlysentential paraphrase classification, phrasal para-phrase classification, and binary classification ofparaphrase and in-paraphrase pairs, re-spectively. All the models were fine-tuned in thesame manner as described in Sec. 5.3.

First, the results support the hypothesis; sen-tential and phrasal paraphrase classification com-plements each other on sentence representationlearning. Our model achieved its best scores

5401

ModelTask Semantic Equivalence NLI Single-Sent.

MRPC STS-B QQP MNLI (m/mm) RTE QNLI SST CoLATransfer Fine-Tuning 89.2 87.4 71.2 83.9/83.1 64.8 89.3 93.1 47.2

BERT-base 88.3 84.7 71.2 84.3/83.0 59.8 89.1 93.3 52.7+sentence 88.2 87.6 71.1 83.2/82.8 66.2 90.2 92.4 39.8+3way-PP 88.2 85.8 70.9 82.9/81.9 65.8 88.0 91.3 32.6+binary-PP 87.7 82.8 70.7 83.7/82.2 61.2 87.6 92.5 42.1

Table 5: Results of the ablation study where the best scores are represented in bold and scores higher than thoseof BERT-base are underlined. The last three rows show performances when conducting only sentential paraphraseclassification (+sentence), phrasal paraphrase classification (+3way-PP), and binary classification of phrasal para-phrase (+binary-PP), respectively.

on MRPC, MNLI-m/mm, SST, and CoLA tasksby conducting both sentential and phrasal para-phrase classification simultaneously. Interestingly,these scores are higher than those when senten-tial and phrasal paraphrase classification are con-ducted independently. This is reasonable consid-ering the process of fine-tuning. Sentential para-phrase classification directly affects the represen-tation of [CLS], which is the primary tuning fac-tor in fine-tuning for downstream tasks. Alter-natively, phrasal paraphrase classification affectsrepresentations of phrases, which are the basis forgenerating the [CLS] representation. Simultane-ously conducting both sentential and phrasal para-phrase classification thus creates synergy.

It is also obvious that the three-way classifica-tion of phrasal paraphrases, on which the modeldiscriminates paraphrases, random combinationsof phrases from a random pair of sentences, andrandom combinations of phrases in a paraphrasalsentence pair, is superior to binary classification.This shows that discriminating random combina-tions of phrases, which is a simpler and easier task,also contributes to representation learning.

7 Conclusion

We empirically demonstrate that sentential andphrasal paraphrase relations help sentence repre-sentation learning. While BERT’s pre-trainingaims to generate generic representations transfer-able to a broad range of NLP tasks, our methodgenerates representations suitable for the classof semantic equivalence assessment tasks. Ourmethod achieves performance gains while main-taining the model size. Furthermore, it exhibitsimprovement on downstream tasks with limitedamounts of training datasets for fine-tuning, whichis a property crucial for transfer learning.

In the future, we plan to investigate the effectsof our method on different sizes of BERT mod-els. Additionally, we will apply our model to im-prove the alignment quality of the phrase align-ment model.

Acknowledgments

We appreciate the anonymous reviewers for theirinsightful comments and suggestions to improvethe paper. This work was supported by JST, ACT-I, Grant Number JPMJPR16U2, Japan.

ReferencesYuki Arase and Jun’ichi Tsujii. 2017. Monolingual

phrase alignment on parse forests. In Proceedingsof the Conference on Empirical Methods in Nat-ural Language Processing (EMNLP), pages 1–11,Copenhagen, Denmark.

Daniel Cer, Mona Diab, Eneko Agirre, Inigo Lopez-Gazpio, and Lucia Specia. 2017. SemEval-2017task 1: Semantic textual similarity multilingual andcrosslingual focused evaluation. In Proceedings ofthe International Workshop on Semantic Evaluation(SemEval), pages 1–14, Vancouver, Canada.

Qian Chen, Xiaodan Zhu, Zhen-Hua Ling, Si Wei, HuiJiang, and Diana Inkpen. 2017. Enhanced lstm fornatural language inference. In Proceedings of theAnnual Meeting of the Association for Computa-tional Linguistics (ACL), pages 1657–1668, Vancou-ver, Canada.

Alexis Conneau, Douwe Kiela, Holger Schwenk, LoıcBarrault, and Antoine Bordes. 2017. Supervisedlearning of universal sentence representations fromnatural language inference data. In Proceedingsof the Conference on Empirical Methods in Natu-ral Language Processing (EMNLP), pages 670–680,Copenhagen, Denmark.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: Pre-training of

http://aclweb.org/anthology/D17-1001


https://doi.org/10.18653/v1/S17-2001

https://doi.org/10.18653/v1/S17-2001

https://doi.org/10.18653/v1/S17-2001

https://doi.org/10.18653/v1/P17-1152

https://doi.org/10.18653/v1/P17-1152

https://doi.org/10.18653/v1/D17-1070

https://doi.org/10.18653/v1/D17-1070

https://doi.org/10.18653/v1/D17-1070

https://www.aclweb.org/anthology/N19-1423

5402

deep bidirectional transformers for language under-standing. In Proceedings of the Annual Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies (NAACL-HLT), pages 4171–4186.

Bill Dolan, Chris Quirk, and Chris Brockett. 2004. Un-supervised construction of large paraphrase corpora:Exploiting massively parallel news sources. In Pro-ceedings of the International Conference on Com-putational Linguistics (COLING), pages 350–356,Geneva, Switzerland.

Sergey Edunov, Myle Ott, Michael Auli, and DavidGrangier. 2018. Understanding back-translation atscale. In Proceedings of the Conference on Em-pirical Methods in Natural Language Processing(EMNLP), pages 489–500, Brussels, Belgium.

Hua He and Jimmy Lin. 2016. Pairwise word inter-action modeling with deep neural networks for se-mantic similarity measurement. pages 937–948, SanDiego, California.

David Kauchak. 2013. Improving text simplificationlanguage modeling using unsimplified text data. InProceedings of the Annual Meeting of the Associ-ation for Computational Linguistics (ACL), pages1537–1546, Sofia, Bulgaria.

Diederik P. Kingma and Jimmy Ba. 2015. Adam: Amethod for stochastic optimization. In Proceedingsof the International Conference on Learning Repre-sentations (ICLR).

Ryan Kiros, Yukun Zhu, Ruslan R Salakhutdinov,Richard Zemel, Raquel Urtasun, Antonio Torralba,and Sanja Fidler. 2015. Skip-thought vectors. InProceedings of Conference on Neural InformationProcessing Systems (NeurIPS), pages 3294–3302.

Wuwei Lan, Siyu Qiu, Hua He, and Wei Xu. 2017.A continuously growing dataset of sentential para-phrases. In Proceedings of the Conference on Em-pirical Methods in Natural Language Processing(EMNLP), pages 1224–1234, Copenhagen, Den-mark.

Xiaodong Liu, Pengcheng He, Weizhu Chen, andJianfeng Gao. 2019. Multi-task deep neural net-works for natural language understanding. arXiv,1901.11504.

Lajanugen Logeswaran and Honglak Lee. 2018. Anefficient framework for learning sentence represen-tations. In Proceedings of the International Confer-ence on Learning Representations (ICLR).

Matthew Peters, Mark Neumann, Mohit Iyyer, MattGardner, Christopher Clark, Kenton Lee, and LukeZettlemoyer. 2018. Deep contextualized word rep-resentations. In Proceedings of the Annual Confer-ence of the North American Chapter of the Associ-ation for Computational Linguistics: Human Lan-guage Technologies (NAACL-HLT), pages 2227–2237, New Orleans, Louisiana.

Jason Phang, Thibault Fevry, and Samuel R. Bow-man. 2019. Sentence encoders on STILTs: Supple-mentary training on intermediate labeled-data tasks.arXiv, 1811.01088.

Shrimai Prabhumoye, Yulia Tsvetkov, RuslanSalakhutdinov, and Alan W Black. 2018. Styletransfer through back-translation. In Proceedingsof the Annual Meeting of the Association forComputational Linguistics (ACL), pages 866–876,Melbourne, Australia.

Alec Radford, Karthik Narasimhan, Tim Salimans, andIlya Sutskever. 2018. Improving language under-standing by generative pre-training. Technical re-port, OpenAI.

Alec Radford, Jeffrey Wu, Rewon Child, David Luan,Dario Amodei, and Ilya Sutskever. 2019. Languagemodels are unsupervised multitask learners. Techni-cal report, OpenAI.

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, andPercy Liang. 2016. SQuAD: 100,000+ questionsfor machine comprehension of text. In Proceedingsof the Conference on Empirical Methods in NaturalLanguage Processing (EMNLP), pages 2383–2392,Austin, Texas.

Richard Socher, Alex Perelygin, Jean Wu, JasonChuang, Christopher D. Manning, Andrew Ng, andChristopher Potts. 2013. Recursive deep modelsfor semantic compositionality over a sentiment tree-bank. In Proceedings of the Conference on Em-pirical Methods in Natural Language Processing(EMNLP), pages 1631–1642, Seattle, Washington,USA.

Sandeep Subramanian, Adam Trischler, Yoshua Ben-gio, and Christopher J Pal. 2018. Learning gen-eral purpose distributed sentence representations vialarge scale multi-task learning. In Proceedings ofthe International Conference on Learning Represen-tations (ICLR).

Swabha Swayamdipta, Sam Thomson, Kenton Lee,Luke Zettlemoyer, Chris Dyer, and Noah A. Smith.2018. Syntactic scaffolds for semantic structures.In Proceedings of the 2018 Conference on Empiri-cal Methods in Natural Language Processing, pages3772–3782.

Kai Sheng Tai, Richard Socher, and Christopher D.Manning. 2015. Improved semantic representationsfrom tree-structured long short-term memory net-works. In ACL-IJCLNLP, pages 1556–1566, Bei-jing, China.

Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, Ł ukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In Proceedings of Conference on NeuralInformation Processing Systems (NeurIPS), pages5998–6008.



http://aclweb.org/anthology/C/C04/C04-1051.pdf





https://doi.org/10.18653/v1/N16-1108

https://doi.org/10.18653/v1/N16-1108

https://doi.org/10.18653/v1/N16-1108

http://aclweb.org/anthology/P13-1151


http://arxiv.org/abs/1412.6980


http://papers.nips.cc/paper/5950-skip-thought-vectors.pdf

https://doi.org/10.18653/v1/D17-1126

https://doi.org/10.18653/v1/D17-1126

https://arxiv.org/abs/1901.11504


https://openreview.net/forum?id=rJvJXZb0W



https://doi.org/10.18653/v1/N18-1202

https://doi.org/10.18653/v1/N18-1202





https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf

https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf

https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf

https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf

https://doi.org/10.18653/v1/D16-1264

https://doi.org/10.18653/v1/D16-1264

http://www.aclweb.org/anthology/D13-1170



https://openreview.net/forum?id=B18WgG-CZ




http://www.aclweb.org/anthology/P15-1150



http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf

http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf

5403

Alex Wang, Amanpreet Singh, Julian Michael, FelixHill, Omer Levy, and Samuel R. Bowman. 2019.GLUE: A multi-task benchmark and analysis plat-form for natural language understanding. In Pro-ceedings of the International Conference on Learn-ing Representations (ICLR).

Alex Warstadt, Amanpreet Singh, and Samuel R Bow-man. 2018. Neural network acceptability judg-ments. arXiv.

John Wieting and Kevin Gimpel. 2018. ParaNMT-50M: Pushing the limits of paraphrastic sentenceembeddings with millions of machine translations.In Proceedings of the Annual Meeting of the Asso-ciation for Computational Linguistics (ACL), pages451–462, Melbourne, Australia.

Adina Williams, Nikita Nangia, and Samuel Bowman.2018. A broad-coverage challenge corpus for sen-tence understanding through inference. pages 1112–1122, New Orleans, Louisiana.

Wei Wu, Houfeng Wang, Tianyu Liu, and ShumingMa. 2018. Phrase-level self-attention networks foruniversal sentence encoding. In Proceedings ofthe Conference on Empirical Methods in NaturalLanguage Processing (EMNLP), pages 3729–3738,Brussels, Belgium.

Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V.Le, Mohammad Norouzi, Wolfgang Macherey,Maxim Krikun, Yuan Cao, Qin Gao, KlausMacherey, Jeff Klingner, Apurva Shah, MelvinJohnson, Xiaobing Liu, Lukasz Kaiser, StephanGouws, Yoshikiyo Kato, Taku Kudo, HidetoKazawa, Keith Stevens, George Kurian, NishantPatil, Wei Wang, Cliff Young, Jason Smith, JasonRiesa, Alex Rudnick, Oriol Vinyals, Greg Corrado,Macduff Hughes, and Jeffrey Dean. 2016. Google’sneural machine translation system: Bridging the gapbetween human and machine translation. arXiv,1810.04805.

Zhilin Yang, Zihang Dai, Yiming Yang, Jaime G.Carbonell, Ruslan Salakhutdinov, and Quoc V. Le.2019. XLNet: Generalized autoregressive pretrain-ing for language understanding. arXiv, 1906.08237.

Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhut-dinov, Raquel Urtasun, Antonio Torralba, and SanjaFidler. 2015. Aligning books and movies: Towardsstory-like visual explanations by watching moviesand reading books. In Proceedings of InternationalConference on Computer Vision (ICCV), pages 19–27.

Appendix: Transfer Fine-Tuning withSimple Features

To further investigate effects of transfer fine-tuning using paraphrase relations on BERT, wedesigned a model that generates a simplest fea-ture to input into the classifier in Fig. 2. We as-sume that this method transmits learning signals

to the underlying BERT in a more effective man-ner. Specifically, we use mean-pooling to gener-ate representations of source and target phrases inEq. (1) and Eq. (2), respectively. These represen-tations are simply concatenated as a feature repre-sentation and then fed into the classifier.

Table 6 compares this new model (denoted asSimple Transfer Fine-Tuning) to BERT models aswell as our model with the elaborate feature gen-eration described in Sec. 4 (denoted as TransferFine-Tuning) on semantic equivalent assessmentand NLI tasks of GLUE benchmark. Table 7 re-ports an ablation study. The results and findingsare summarized as follows.

• Our model with simple feature generation(Simple Transfer Fine-Tuning) on BERT-base outperformed BERT on both semanticequivalent assessment and NLI tasks. Fur-thermore, it performed on-par against BERT-large on MRPC and outperformed it on STS-B and RTE, despite BERT-large having 3.1times more parameters than our model.

• The same trend was confirmed on the modeltrained on BERT-large, where our model out-performed BERT-large on all the tasks exceptQNLI.

• Simple Transfer Fine-Tuning also outper-formed our model with elaborate feature gen-eration (Transfer Fine-Tuning) on all seman-tic equivalent assessment and NLI tasks ex-cept MRPC. This result implies that elabo-rate feature generation tightly fits the modelto paraphrase identification while sacrificesits generality to relevant tasks. Further inves-tigation will be our future work.

• Sentential and phrasal paraphrase classifica-tion complements each other on sentence rep-resentation learning when using simple fea-ture generation, as also confirmed when us-ing the elaborate feature generation in Ta-ble 5. Simple Transfer Fine-Tuning achievedhigher scores on STS-B, RTE, and QNLItasks than models trained either with onlysentential (+sentence) or phrasal paraphrase(+3way-PP [Simple Feature]) classification.

• Simple feature generation improves the per-formance of the model trained with onlyphrasal paraphrase classification; +3way-PP

https://openreview.net/forum?id=rJ4km2R5t7

https://openreview.net/forum?id=rJ4km2R5t7




https://doi.org/10.18653/v1/N18-1101

https://doi.org/10.18653/v1/N18-1101








https://www.cv-foundation.org/openaccess/content_iccv_2015/papers/Zhu_Aligning_Books_and_ICCV_2015_paper.pdf



5404

ModelTask Semantic Equivalence NLI

MRPC STS-B QQP MNLI (m/mm) RTE QNLIBERT-base 88.3 84.7 71.2 84.3/83.0 59.8 89.1Transfer Fine-tuning 89.2 87.4 71.2 83.9/83.1 64.8 89.3Simple Transfer Fine-Tuning 88.6 87.7 71.5 84.7/83.6 67.0 91.1

BERT-large 88.6 86.0 72.1 86.2/85.5 65.5 92.7Simple Transfer Fine-Tuning 89.9 87.1 72.5 86.5/85.6 68.2 92.2

Table 6: Test results on semantic equivalence assessment and NLI tasks scored by the GLUE evaluation server.The best scores for each task are represented in bold. The scores higher than those of BERT counterparts (againstBERT-base and BERT-large, respectively) are underlined. Our models with simple feature generation (SimpleTransfer Fine-Tuning) consistently outperformed the BERT models and achieved the best scores for six out ofseven tasks.

ModelTask Semantic Equivalence NLI

MRPC STS-B QQP MNLI (m/mm) RTE QNLISimple Transfer Fine-Tuning 88.6 87.7 71.5 84.7/83.6 67.0 91.1

BERT-base 88.3 84.7 71.2 84.3/83.0 59.8 89.1+sentence 88.2 87.6 71.1 83.2/82.8 66.2 90.2+3way-PP [Elaborate Feature] 88.2 85.8 70.9 82.9/81.9 65.8 88.0+3way-PP [Simple Feature] 89.0 86.6 71.5 84.7/83.6 65.6 90.6

Table 7: Results of the ablation study where the best scores are represented in bold and scores higher than thoseof BERT-base are underlined. The third row shows performances when conducting only sentential paraphraseclassification (+sentence) and the fourth row shows those when conducting only phrasal paraphrase classificationwith elaborate feature generation (+3way-PP [Elaborate Feature]), as reported in Table 5. The last row showsperformances when conducing phrasal paraphrase classification with simple feature generation (+3way-PP [Sim-ple Feature]). Results indicate that sentential and phrasal paraphrase classification complementarily contributes toSimple Transfer Fine-Tuning modeling.

[Simple Feature] outperformed +3way-PP[Elaborate Feature] on all tasks except RTE.

transfer fine-tuning: a bert case study · crucial research directions that will make the ap-proach...

Documents