do syntax trees help pre-trained transformers extract ...syntax-gnn. next, we will describe our...

12
Do Syntax Trees Help Pre-trained Transformers Extract Information? Devendra Singh Sachan 1,2 , Yuhao Zhang 3 , Peng Qi 3 , William Hamilton 1,2 1 Mila - Quebec AI Institute 2 School of Computer Science, McGill University 3 Stanford University [email protected], [email protected] {yuhaozhang, pengqi}@stanford.edu Abstract Much recent work suggests that incorpo- rating syntax information from dependency trees can improve task-specific transformer models. However, the effect of incorporat- ing dependency tree information into pre- trained transformer models (e.g., BERT) re- mains unclear, especially given recent stud- ies highlighting how these models implicitly encode syntax. In this work, we systemati- cally study the utility of incorporating depen- dency trees into pre-trained transformers on three representative information extraction tasks: semantic role labeling (SRL), named entity recognition, and relation extraction. We propose and investigate two distinct strategies for incorporating dependency structure: a late fusion approach, which ap- plies a graph neural network on the output of a transformer, and a joint fusion approach, which infuses syntax structure into the trans- former attention layers. These strategies are representative of prior work, but we intro- duce essential design decisions that are nec- essary for strong performance. Our empiri- cal analysis demonstrates that these syntax- infused transformers obtain state-of-the-art results on SRL and relation extraction tasks. However, our analysis also reveals a critical shortcoming of these models: we find that their performance gains are highly contin- gent on the availability of human-annotated dependency parses, which raises important questions regarding the viability of syntax- augmented transformers in real-world appli- cations. 1 Introduction Dependency trees—a form of syntactic represen- tation that encodes an asymmetric syntactic rela- tion between words in a sentence, such as sub- ject or adverbial modifier—have proven very use- ful in various NLP tasks. For instance, features defined in terms of the shortest path between entities in a dependency tree were used in rela- tion extraction (RE) (Fundel et al., 2006; Björne et al., 2009), parse structure has improved named entity recognition (NER) (Jie et al., 2017), and joint parsing was shown to benefit semantic role labeling (SRL) (Pradhan et al., 2005) systems. More recently, dependency trees have also led to meaningful performance improvements when in- corporated into neural network models for these tasks. Popular encoders to include dependency tree into neural models include graph neural networks (GNNs) for SRL (Marcheggiani and Titov, 2017) and RE (Zhang et al., 2018), and biaffine attention in transformers for SRL (Strubell et al., 2018). In parallel, there has been a renewed interest in investigating self-supervised learning approaches to pre-training neural models for NLP, with recent successes including ELMo (Peters et al., 2018), GPT (Radford et al., 2018), and BERT (Devlin et al., 2019). Of late, the BERT model based on pre-training of a large transformer model (Vaswani et al., 2017) to encode bidirectional context has emerged as a dominant paradigm, thanks to its improved modeling capacity which has led to state- of-the-art results in many NLP tasks. BERT’s success has also attracted attention to what linguistic information its internal representa- tions capture. For example, Tenney et al. (2019) attribute different linguistic information to different BERT layers; Clark et al. (2019) analyze BERT’s attention heads to find syntactic dependencies; Hewitt and Manning (2019) show evidence that BERT’s hidden representation embeds syntactic trees. However, it remains unclear if this linguistic information helps BERT in downstream tasks after finetuning. Further, it is not evident if (external) syntactic information can further improve BERT’s performance on downstream tasks. In this paper, we investigate the extent to which pre-trained transformers can benefit from integrat- arXiv:2008.09084v1 [cs.CL] 20 Aug 2020

Upload: others

Post on 28-Jan-2021

0 views

Category:

Documents


0 download

TRANSCRIPT

  • Do Syntax Trees Help Pre-trained Transformers Extract Information?

    Devendra Singh Sachan1,2, Yuhao Zhang3, Peng Qi3, William Hamilton1,21Mila - Quebec AI Institute

    2School of Computer Science, McGill University3Stanford University

    [email protected], [email protected]{yuhaozhang, pengqi}@stanford.edu

    Abstract

    Much recent work suggests that incorpo-rating syntax information from dependencytrees can improve task-specific transformermodels. However, the effect of incorporat-ing dependency tree information into pre-trained transformer models (e.g., BERT) re-mains unclear, especially given recent stud-ies highlighting how these models implicitlyencode syntax. In this work, we systemati-cally study the utility of incorporating depen-dency trees into pre-trained transformers onthree representative information extractiontasks: semantic role labeling (SRL), namedentity recognition, and relation extraction.

    We propose and investigate two distinctstrategies for incorporating dependencystructure: a late fusion approach, which ap-plies a graph neural network on the outputof a transformer, and a joint fusion approach,which infuses syntax structure into the trans-former attention layers. These strategies arerepresentative of prior work, but we intro-duce essential design decisions that are nec-essary for strong performance. Our empiri-cal analysis demonstrates that these syntax-infused transformers obtain state-of-the-artresults on SRL and relation extraction tasks.However, our analysis also reveals a criticalshortcoming of these models: we find thattheir performance gains are highly contin-gent on the availability of human-annotateddependency parses, which raises importantquestions regarding the viability of syntax-augmented transformers in real-world appli-cations.

    1 Introduction

    Dependency trees—a form of syntactic represen-tation that encodes an asymmetric syntactic rela-tion between words in a sentence, such as sub-ject or adverbial modifier—have proven very use-ful in various NLP tasks. For instance, features

    defined in terms of the shortest path betweenentities in a dependency tree were used in rela-tion extraction (RE) (Fundel et al., 2006; Björneet al., 2009), parse structure has improved namedentity recognition (NER) (Jie et al., 2017), andjoint parsing was shown to benefit semantic rolelabeling (SRL) (Pradhan et al., 2005) systems.More recently, dependency trees have also led tomeaningful performance improvements when in-corporated into neural network models for thesetasks. Popular encoders to include dependency treeinto neural models include graph neural networks(GNNs) for SRL (Marcheggiani and Titov, 2017)and RE (Zhang et al., 2018), and biaffine attentionin transformers for SRL (Strubell et al., 2018).

    In parallel, there has been a renewed interest ininvestigating self-supervised learning approachesto pre-training neural models for NLP, with recentsuccesses including ELMo (Peters et al., 2018),GPT (Radford et al., 2018), and BERT (Devlinet al., 2019). Of late, the BERT model based onpre-training of a large transformer model (Vaswaniet al., 2017) to encode bidirectional context hasemerged as a dominant paradigm, thanks to itsimproved modeling capacity which has led to state-of-the-art results in many NLP tasks.

    BERT’s success has also attracted attention towhat linguistic information its internal representa-tions capture. For example, Tenney et al. (2019)attribute different linguistic information to differentBERT layers; Clark et al. (2019) analyze BERT’sattention heads to find syntactic dependencies;Hewitt and Manning (2019) show evidence thatBERT’s hidden representation embeds syntactictrees. However, it remains unclear if this linguisticinformation helps BERT in downstream tasks afterfinetuning. Further, it is not evident if (external)syntactic information can further improve BERT’sperformance on downstream tasks.

    In this paper, we investigate the extent to whichpre-trained transformers can benefit from integrat-

    arX

    iv:2

    008.

    0908

    4v1

    [cs

    .CL

    ] 2

    0 A

    ug 2

    020

  • ing external dependency tree information. We pro-vide the first systematic investigation of how de-pendency trees can be incorporated into pre-trainedtransformer models, focusing on three representa-tive information extraction tasks where dependencytrees have been shown to be particularly usefulfor neural models: semantic role labeling (SRL),named entitiy recognition (NER), and relation ex-traction (RE).

    We propose two representative approaches to in-tegrate dependency trees into pre-trained transform-ers (i.e., BERT) using syntax-based graph neuralnetworks (syntax-GNNs). The first approach in-volves a sequential assembly of a transformer anda syntax-GNN, which we call Late Fusion, whilethe second approach interleaves syntax-GNN oper-ations within transformer layers, termed Joint Fu-sion. These approaches are representative of recentwork that combines transformers and syntax input,but we introduce important design decisions—e.g.,the use of gating and the alignment between depen-dency tree structures and BERT wordpieces—thatare necessary to achieve strong performance.

    Comprehensive experiments using these archi-tectures reveal several important insights:

    • Both our syntax-augmented BERT modelsachieve a new state-of-the-art on the CoNLL-2005 and CoNLL-2012 SRL benchmarks whenthe gold trees are used, with the best variant out-performing a fine-tuned BERT model by over 3F1 points on both datasets. The Late Fusion ap-proach also provides performance improvementson the TACRED relation extraction dataset.

    • These performance gains are consistent acrossdifferent pre-trained transformer approachesof different sizes (i.e., BERTBASE/LARGE andRoBERTaBASE/LARGE).

    • The Joint Fusion approach that interleaves GNNswith BERT achieves higher performance im-provements on SRL, but it is also less stable andmore prone to errors when using noisy depen-dency tree inputs such as for the RE task, whereLate Fusion performs much better, suggestingcomplementary strengths from both approaches.

    • In the SRL task, the performance gains of bothapproaches are highly contingent on the availabil-ity of human-annotated parses for both trainingand inference, without which the performancegains are either marginal or non-existent. In theNER task, even the gold trees don’t show perfor-

    mance improvements.

    Our results highlight key design choices that arenecessary to achieve gains from incorporating de-pendency trees into pre-trained transformers. How-ever, our most important result is somewhat neg-ative and cautionary: the performance gains areonly substantial when human-annotated parses areavailable. Indeed, we find that even high-qualityautomated parses generated by domain-specificparsers do not suffice, and we are only able toachieve meaningful gains with human-annotatedparses. This is a critical finding for future work—especially for SRL—as researchers routinely de-velop models with human-annotated parses, withthe implicit expectation that models will generalizeto high-quality automated parses.

    Finally, our analysis provides indirect evidencethat pre-trained transformers do incorporate suffi-cient syntactic information to achieve strong per-formance on downstream tasks. While human-annotated parses can still help greatly, with ourproposed models it appears that the knowledgein automatically-extracted syntax trees is largelyredundant with the implicit syntactic knowledgelearned by pre-trained models such as BERT.

    2 Models

    In this section, we will first briefly review the trans-former encoder, then describe the graph neuralnetwork (GNN) that learns syntax representationsusing dependency tree input, which we term thesyntax-GNN. Next, we will describe our syntax-augmented BERT models that incorporate suchrepresentations learned from the GNN.

    2.1 Transformer Encoder

    The transformer encoder (Vaswani et al., 2017) con-sists of three core modules in sequence: embeddinglayer, multiple encoder layers, and a task-specificoutput layer. The core elements in these mod-ules are different sets of learnable weight matricesthat perform linear transformations. The embed-ding layer consists of wordpiece embeddings, posi-tional embeddings, and segment embeddings (De-vlin et al., 2019). After embedding lookup, thesethree embeddings are added to obtain token embed-dings for an input sentence. The encoder layersthen transform the input token embeddings to hid-den state representations. The encoder layer con-sists of two sublayers: multi-head dot-product self-attention and feed-forward network, which will be

  • 𝑊𝐾 𝑊𝑉 𝑊𝑄

    𝑊𝐹

    Layer Norm

    𝑊1

    𝑊2

    GELU

    Layer Norm

    𝛼31

    𝛼32𝛼33

    𝛼35

    Graph-Attention Sublayer

    Feed-Forward Network Sublayer

    𝑁×

    𝑧3

    𝑣3

    Figure 1: Block diagram illustrating syntax-GNNapplied over a sentence’s dependency tree. In theexample shown, for the word “have”, the graph-attention sublayer aggregates representations fromits three adjacent nodes in the dependency graph.

    covered in the following section. Finally, the out-put layer is task-specific and consists of one layerfeed-forward network.

    2.2 Syntax-GNN: Graph Neural Networkover a Dependency Tree

    A dependency tree can be considered as a multi-attribute directed graph where the nodes representwords and the edges represent the dependency re-lation between the head and dependent words. Tolearn useful syntax representations from the depen-dency tree structure, we apply graph neural net-works (GNNs) (Hamilton et al., 2017; Battagliaet al., 2018) and henceforth call our model syntax-GNN. Our syntax-GNN encoder as shown in Fig-ure 1 is a variation of the transformer encoder

    where the self-attention sublayer is replaced bygraph attention (Veličković et al., 2018). Self-attention can also be considered as a special caseof graph-attention where each word is connectedto all the other words in the sentence.

    Let V = {vi ∈ Rd}i=1:N v denote the input nodeembeddings andE = {(ek, i, j)k=1:N e} denote theedges in the dependency tree, where the edge ek isincident on nodes i and j. Each layer in our syntax-GNN encoder consists of two sublayers: graphattention and feed-forward network.

    First, interaction scores (sij) are computed forall the edges by performing dot-product on theadjacent linearly transformed nodes embeddings

    sij = (viWQ)(vjWK)T. (1)

    The terms viWQ and viWK are also known asquery and key vectors respectively. Next, an at-tention score (αij) is computed for each node byapplying softmax over the interaction scores fromall its connecting edges:

    αij =exp(sij)∑

    k∈Ni exp (sik), (2)

    where Ni refers to the set of nodes connected to ithnode. The graph attention output (zi) is computedby the aggregation of attention scores followed bya linear transformation:

    zi = (∑j∈Ni

    αij(vjWV))WF. (3)

    The term vjWV is also referred to as value vec-tor. Subsequently, the message (zi) is passed tothe second sublayer that consists of two layer fullyconnected feed-forward network with GELU acti-vation (Hendrycks and Gimpel, 2016).

    FFN(zi) = GELU(ziW1 + b1)W2 + b2 . (4)

    The FFN sublayer outputs are given as input tothe next layer. In the above equations WK, WV,WQ, WF, W1, W2 are trainable weight matricesand b1, b2 are bias parameters. Additionally, layernormalization (Ba et al., 2016) is applied to theinput and residual connections (He et al., 2016) areadded to the output of each sublayer.

    2.2.1 Dependency Tree over WordpiecesAs BERT models take as input subword units (alsoknown as wordpieces) instead of linguistic tokens,this also necessitates to extend the definition of a

  • dependency tree to include wordpieces. For this,we introduce additional edges in the original de-pendency tree by defining new edges from the firstsubword (head word) of a token to the remainingsubwords (tail words) of the same token.

    2.3 Syntax-Augmented BERTIn this section, we propose parameter augmen-tations over the BERT model to best incorpo-rate syntax information from a syntax-GNN. Tothis end, we introduce two models— Late Fusionand Joint Fusion. These models represent novelmechanisms—inspired by previous work—throughwhich syntax-GNN features are incorporated at dif-ferent sublayers of BERT (Figure 2). We referto these models as Syntax-Augmented BERT (SA-BERT) models. During the finetuning step, the newparameters in each model are randomly initializedwhile the existing parameters are initialized frompre-trained BERT.

    Late Fusion: In this model, we feed the BERTcontextual representations to the syntax-GNN en-coder i.e. syntax-GNN is stacked over BERT (Fig-ure 2a). We also use a Highway Gate (Srivastavaet al., 2015) at the output of the syntax-GNN en-coder to adaptively select useful representationsfor the training task. Concretely, if vi and zi arethe representations from BERT and syntax-GNNrespectively, then the output (hi) after the gatinglayer is computed as,

    gi = σ(Wgvi + bg) (5)

    hi = gi � vi + (1− gi)� zi, (6)

    where σ is the sigmoid function 1/ (1 + e−x) andWg is a learnable parameter. Finally, we map theoutput representations to linguistic space by addingthe hidden states of all wordpieces that map tolinguistic tokens respectively.

    Joint Fusion: In this model, syntax represen-tations are incorporated within the self-attentionsublayer of BERT. The motivation is to jointly at-tend over both syntax- and BERT representations.First, the syntax-GNN representations are com-puted from the input word embeddings and its finallayer hidden states are passed to BERT. Second,as is shown in Figure 2b, the syntax-GNN hid-den states are linearly transformed using weightsPK, PV to obtain additional syntax-based key andvalue vectors. Third, syntax-based key and valuevectors are added with the BERT’s self-attention

    Syntax-GNN

    BERT Model

    Highway Gate

    Wordpiece Embeddings

    (a) Late Fusion𝑊𝐾 𝑊𝑉 𝑊𝑄

    𝑊𝐹

    Joint Attention

    Layer Norm

    𝑊1

    𝑊2

    GELU

    Layer Norm

    Syntax-GNNHidden States

    𝑁×

    𝑃𝑉𝑃𝐾

    Add Add

    Previous Layer Hidden States

    (b) Joint Fusion

    Figure 2: Block diagrams illustrating our proposedsyntax-augmented BERT models. Weights shownin color are pre-trained while those not coloredare either non-parameterized operations or haverandomly initialized weights. The inputs to eachof these models are wordpiece embeddings whiletheir output goes to task-specific output layers. Insubfigure 2b, N× indicates that there are N layers,with each of them being passed the same set ofsyntax-GNN hidden states.

    sublayer key and value vectors respectively. Fourth,the query vector in self-attention layer now attendsover this set of keys and values, thereby augment-ing the model’s ability to fuse syntax information.Overall, in this model, we introduce two new set ofweights per layer {PK, PV}, which are randomlyinitialized.

    3 Experimental Setup

    3.1 Tasks and Datasets

    For our experiments, we consider information ex-traction tasks for which dependency trees have beenextensively used in the past to improve model per-formance. Below, we provide a brief description ofthese tasks, datasets used, and task-specific model-ing details.

  • Semantic Role Labeling (SRL): In this task,the objective is to assign semantic role labels to textspans in a sentence such that they answer the query:Who did what to whom and when? Specifically, forevery target predicate (verb) of a sentence, we de-tect syntactic constituents (arguments) and classifythem into predefined semantic roles. In our exper-iments, we study the setting where the predicatesare given and the task is to predict the arguments.We use the CoNLL-2005 SRL corpus (Carrerasand Màrquez, 2005) and CoNLL-2012 OntoNotes1

    dataset, which contains PropBank-style annota-tions for predicates and their arguments, and alsoincludes POS tags and constituency parses. Wemodel SRL as a sequence tagging task using alinear-chain CRF (Lafferty et al., 2001) as the lastlayer. During inference, we perform decoding us-ing the Viterbi algorithm (Forney, 1973). To high-light predicate position in the sentence, we useindicator embeddings as input to the model.

    Named Entity Recognition (NER): NER is thetask of recognizing entity mentions in text andtagging them to entity categories. We use theOntoNotes 5.0 dataset (Pradhan et al., 2012), whichcontains 18 named entity types. Similar to SRL,we model NER as a sequence tagging task, and usea linear-chain CRF layer over the model’s hiddenstates. Sequence decoding is performed using theViterbi algorithm.

    Relation Extraction (RE): RE is the task of pre-dicting the relation between the two entity mentionsin a sentence. For experiments, we use the label cor-rected version of the TACRED dataset (Zhang et al.,2017; Alt et al., 2020), which contains 41 relationtypes as well as a special no_relation class indicat-ing that no relation exists between the two entities.As is common in prior work (Zhang et al., 2018;Miwa and Bansal, 2016), the dependency tree ispruned such that the subtree rooted at the lowestcommon ancestor of entity mentions is given as in-put to syntax-GNN. Following Zhang et al. (2018),we extract sentence representations by applying amax-pooling operation over the hidden states. Wealso concatenate the entity representations with sen-tence representation before the final classificationlayer.

    1http://conll.cemantix.org/2012/data.html

    3.2 Training Details

    We select bert-base-cased to be our reference pre-trained baseline model.2 It consists of 12 layers,12 attention heads, and 768 model dimensions. Inboth the variants, the syntax-GNN component con-sists of 4 layers, while other configurations arekept the same as bert-base. Also, for the JointFusion method, syntax-GNN hidden states wereshared across different layers. It is worth notingthat as our objective is to assess if the use of de-pendency trees can provide performance gains overpre-trained Transformer models, it is important totune the hyperparameters of these baseline modelsto obtain strong reference scores. Therefore, foreach task, during the finetuning step, we tune thehyperparameters of the default bert-base model,and use the same hyperparameters to train the SA-BERT models. We refer the reader to Appendix Afor additional training details.

    3.3 Additional Baselines

    Besides BERT models, we also compare our re-sults to the following previous work, which hadobtained good performance gains on incorporatingdependency trees with neural models:

    • For SRL, we include results from the SA (self-attention) and LISA (linguistically-informed self-attention) model by Strubell et al. (2018). InLISA, the attention computation in one attention-head of the Transformer is biased to enforce de-pendent words only attend to their head words.The models were trained using both GloVe (Pen-nington et al., 2014) and ELMo embeddings.

    • For NER, we report the results from Jie and Lu(2019), where they concatenate the child token,head token, and relation embeddings as input toan LSTM and then fuse child and head hiddenstates.

    • For RE, we report the results of the GCN modelfrom Zhang et al. (2018) where they apply graphconvolutional networks on pruned dependencytrees over LSTM states.

    4 Results and Analysis

    In this section, we present our main empirical anal-yses and key findings.

    2bert-base configuration was preferred due to computa-tional reasons and we found that bert-cased provided substan-tial gains over bert-uncased in the NER task.

    http://conll.cemantix.org/2012/data.htmlhttp://conll.cemantix.org/2012/data.html

  • Test Set P R F1Baseline Models (without dependency parses)

    SA+GloVe† 84.17 83.28 83.72SA+ELMo† 86.21 85.98 86.09BERTBASE 86.97 88.01 87.48

    Gold Dependency Parses

    Late Fusion 89.17 91.09 90.12Joint Fusion 90.59 91.35 90.97

    Table 1: SRL results on the CoNLL-2005 WSJtest set averaged over 5 independent runs. † marksresults from Strubell et al. (2018).

    Test Set P R F1Baseline Models (without dependency parses)

    SA+GloVe† 82.55 80.02 81.26SA+ELMo† 84.39 82.21 83.28Deep-LSTM+ELMo‡ - - 84.60BERTBASE 85.91 87.07 86.49

    Gold Dependency Parses

    Late Fusion 88.06 90.32 89.18Joint Fusion 89.34 90.44 89.89

    Table 2: SRL results on the CoNLL-2012 test setaveraged over 5 independent runs. † marks resultsfrom Strubell et al. (2018); ‡marks results from Pe-ters et al. (2018).

    4.1 Benchmark Performance

    To recap, our two proposed variants of the Syntax-Augmented BERT models in Section 2.3 mainlydiffer at the position where syntax-GNN outputsare fused with the BERT hidden states. Followingthis, we first compare the effectiveness of thesevariants on all three tasks, comparing against thebaselines outlined in the previous section. For thispart we use gold dependency parses to train themodels for SRL and NER, and predicted parses forRE, since gold dependency parses are not availablefor TACRED.

    We present our main results for SRL in Table 1and Table 2, NER in Table 3, and RE in Table 4.All these results report average performance overfive runs with different random seeds. First, wenote that for all the tasks, our bert-base baseline isquite strong and is directly competitive with otherstate-of-the-art models.

    Test Set P R F1Baseline Models (without dependency parses)

    BiLSTM-CRF+ELMo† 88.25 89.71 88.98BERTBASE 88.75 89.61 89.18

    Gold Dependency Parses

    DGLSTM-CRF+ELMo† 89.59 90.17 89.88Late Fusion 88.75 89.19 88.97Joint Fusion 88.58 89.31 88.94

    Table 3: NER results on the OntoNotes-5.0 test setaveraged over 5 independent runs. † marks resultsfrom Jie and Lu (2019).

    Test Set P R F1Baseline Models (without dependency parses)

    BERTBASE 78.04 76.36 77.09

    Stanford CoreNLP Dependency Parses

    GCN† 74.2 69.3 71.7GCN+BERTBASE† 74.8 74.1 74.5Late Fusion 78.55 76.29 77.38Joint Fusion 70.22 75.12 72.52

    Table 4: Relation extraction results on the revisedTACRED test set (Alt et al., 2020), averaged over 5independent runs. † marks results reported by Altet al. (2020).

    We observe that both the Late Fusion and JointFusion variants of our approach yielded the bestresults in the SRL tasks. Specifically, on CoNLL-2005 and CoNLL-2012 SRL, Joint Fusion im-proves over bert-base by an absolute 3.5 F1 points,while Late Fusion improves over bert-base by 2.65F1 points. On the RE task, the Late Fusion modelimproves over bert-base by approximately 0.3 F1points while the Joint Fusion model leads to a dropof 4.5 F1 points in performance (which we suspectis driven by the longer sentence lengths observed inTACRED). On NER, the SA-BERT models lead tono performance improvements as their scores lieswithin one standard deviation to that of bert-base.

    Overall, we find that syntax information is mostuseful to the pretrained transformer models in theSRL task, especially when intermixing the interme-diate representations of BERT with representationsfrom the syntax-GNN. Moreover, when the fusionis done after the final hidden layer of the pretrained

  • Test Set P R F1Stanza Dependency Parses (UAS: 84.2)

    Late Fusion 86.85 88.06 87.45Joint Fusion 86.87 87.85 87.36

    In-domain Dependency Parses (UAS: 92.66)

    LISA+GloVe† 85.53 84.45 84.99LISA+ELMo† 87.13 86.67 86.90Late Fusion 86.80 87.98 87.39Joint Fusion 87.09 87.95 87.52

    Gold Dependency Parses

    Late Fusion 89.17 91.09 90.12Joint Fusion 90.59 91.35 90.97

    Table 5: SRL results with different parses on theCoNLL-2005 WSJ test set averaged over 5 inde-pendent runs. † marks results from Strubell et al.(2018).

    models, apart from providing good gains on SRL,it also provides small gains on RE task. We furthernote that, as we trained all our Syntax-AugmentedBERT models using the same hyperparameters asthat of bert-base, it is possible that separate hy-perparameter tuning would further improve theirperformance.

    4.2 Impact of Parse Quality

    In this part, we study to what extent parsing qualitycan affect the performance of the resulting Syntax-Augmented BERT models. Specifically, followingexisting work, we compare the effect of using parsetrees from three different sources: (a) gold syntac-tic annotations3; (b) a dependency parser trainedusing gold, in-domain parses4; and (c) availableoff-the-shelf NLP toolkits.5 In previous work, itwas shown that using in-domain parsers can pro-vide good improvements on SRL (Strubell et al.,2018) and NER tasks (Jie and Lu, 2019), and theperformance can be further improved when gold

    3We use Stanford head rules (de Marneffe and Manning,2008) implemented in Stanford CoreNLP v4.0.0 (Manninget al., 2014) to convert constituency trees to dependency treesin UDv2 format (Nivre et al., 2020).

    4The difference between settings (a) and (b) is during testtime. In (a) gold parses are used for both training and testinstances while in (b) gold parses are used for training, whileduring test time, parses are extracted from a dependency parserwhich was trained using gold parses.

    5In this setting, the parsers are trained on general datasetssuch as the Penn Treebank or the English Web Treebank.

    Test Set P R F1Stanza Dependency Parses (UAS: 83.91)

    Late Fusion 88.83 89.42 89.12Joint Fusion 88.56 89.38 88.97

    In-domain Dependency Parses (UAS: 96.10)

    DGLSTM-CRF+ELMo† – – 89.64

    Gold Dependency Parses

    DGLSTM-CRF+ELMo† 89.59 90.17 89.88Late Fusion 88.75 89.19 88.97Joint Fusion 88.58 89.31 88.94

    Table 6: NER results with different parses on theOntoNotes-5.0 test set averaged over 5 independentruns. † marks results from Jie and Lu (2019).

    parses were used at test time. Meanwhile, in manypractical settings where gold parses are not read-ily available, the only option is to use parse treesproduced by existing NLP toolkits, as was doneby Zhang et al. (2018) for RE. In these cases, sincethe parsers are trained on a different domain oftext, it is unclear if the produced trees, when usedwith the SA-BERT models, can still lead to per-formance gains. Motivated by these observations,we investigate to what extent gold, in-domain, andoff-the-shelf parses can improve performance overstrong BERT baselines.Comparing off-the-shelf and gold parses. Wereport our findings on the CoNLL-2005 SRL (Ta-ble 5) and OntoNotes-5.0 NER (Table 6) tasks. Us-ing gold parses, both the Late Fusion and JointFusion models obtain greater than 2.5 F1 improve-ment on SRL compared with bert-base while wedon’t observe significant improvements on NER.We further note that as the gold parses are producedby expert human annotators, these results can beconsidered as the attainable performance ceilingfrom using parse trees in these models.

    We also observe that using off-the-shelf parsesfrom the Stanza toolkit (Qi et al., 2020) provideslittle to no gains in F1 scores (see Tables 5 and 6).This is mainly due to the low in-domain accuracy ofthe predicted parses. For example, on the SRL testset the UAS is 84.2% for the Stanza parser, whichis understandable as the parser was trained on theEWT corpus which covers a different domain.

    In a more fine-grained error analysis, we also ex-amined the correlation between parse quality and

  • performance on individual examples on CoNLL-2005 (Figures 3 and 4), finding a mild but signifi-cant positive correlation between parse quality andrelative model performance when training and test-ing with Stanza parses (Figure 3). Interestingly, wefound that this correlation between parse qualityand validation performance is much stronger whenwe train a model on gold parses but then evaluatewith noisy Stanza parses (Figure 4). This suggeststhat the model trained on noisy parses tends to sim-ply ignore the noisy dependency tree inputs, whilethe model trained on gold parses is more sensitiveto the external syntactic input.Do in-domain parses help? Lastly, for the settingof using in-domain parses, we only evaluate onthe SRL task, since on the NER task even usinggold parses does not yield substantial gain. Wetrain a biaffine parser (Dozat and Manning, 2017)on the gold parses from the CoNLL-2005 train-ing set and obtain parse trees from it at test time.We observe that while the obtained parse trees arefairly accurate (with a UAS of 92.6% on the testset), it leads to marginal or no improvements overbert-base. This finding is also similar to the re-sults obtained by Strubell et al. (2018), where theirLISA+ELMo model only obtains a relatively smallimprovement over SA+ELMo. We hypothesizethat as the accuracy of the predicted parses furtherincreases, the F1 scores would be closer to thatfrom using the gold parses. One possible reasonfor these marginal gains from using the in-domainparses is that as they are still imperfect, the errorsin the parse edges is forcing the model to ignorethe syntax information.

    Overall, we conclude that parsing quality has adrastic impact on the performance of the Syntax-Augmented BERT models, with substantial gainsonly observed in settings where gold parses areused.

    4.3 Generalizing to BERT VariantsOur previous results used the bert-base setting,which is a relatively small configuration amongpre-trained models. Devlin et al. (2019) also pro-posed a large model setting (bert-large)6 that out-performed bert-base in all benchmark tasks. Theselarge BERT models were further improved by mod-ifying the training objective to predict all the word’swordpieces (whole-word-masking)7 whereas the de-

    624 layers, 16 attention heads, 1024 model dimensions7https://storage.googleapis.com/bert_

    models/2019_05_30/wwm_cased_L-24_H-1024_

    20 40 60 80 100UAS

    −100

    −50

    0

    50

    Sta

    nza

    Par

    seF1

    -Gol

    dP

    arse

    F1

    Figure 3: Correlation between parse quality andthe difference in F1 scores between models trainedusing Stanza and gold parses on CoNLL-2005 SRLWSJ dataset. We observe a small positive corre-lation between performance difference and UAS,suggesting that as UAS of Stanza parse increases,the model makes less errors. The slope of the fittedlinear regression model is 0.075 and the interceptis -9.27.

    fault approach was to predict one wordpiece. Morerecently, Liu et al. (2019) proposed RoBERTa,a better-optimized variant of BERT that demon-strated improved results. A research question thatnaturally arises is: Is syntactic information equallyuseful for these more powerful pre-trained trans-formers, which were pre-trained in a different waythan bert-base? To answer this, we finetune thesemodels—with and without Late Fusion—on theCoNLL-2005 SRL task using gold parses and re-port their performance in Table 7.8

    As expected, we observe that both bert-largeand bert-wwm models outperform bert-base, likelydue to the larger model capacity from increasedwidth and more layers. Our Late Fusion modelconsistently improves the results over the underly-ing BERT models by about 2.2 F1. The RoBERTamodels achieve improved results compared withthe BERT models. And again, our Late Fusionmodel further improves the RoBERTa results byabout 2 F1. Thus, it is evident that the gains fromthe Late Fusion model generalize to other widelyused pre-trained transformer models.

    A-16.zip8We use the Late Fusion model with gold parses in this

    section, as it is computationally more efficient to train thanJoint Fusion model.

    https://storage.googleapis.com/bert_models/2019_05_30/wwm_cased_L-24_H-1024_A-16.ziphttps://storage.googleapis.com/bert_models/2019_05_30/wwm_cased_L-24_H-1024_A-16.ziphttps://storage.googleapis.com/bert_models/2019_05_30/wwm_cased_L-24_H-1024_A-16.ziphttps://storage.googleapis.com/bert_models/2019_05_30/wwm_cased_L-24_H-1024_A-16.ziphttps://storage.googleapis.com/bert_models/2019_05_30/wwm_cased_L-24_H-1024_A-16.ziphttps://storage.googleapis.com/bert_models/2019_05_30/wwm_cased_L-24_H-1024_A-16.ziphttps://storage.googleapis.com/bert_models/2019_05_30/wwm_cased_L-24_H-1024_A-16.ziphttps://storage.googleapis.com/bert_models/2019_05_30/wwm_cased_L-24_H-1024_A-16.ziphttps://storage.googleapis.com/bert_models/2019_05_30/wwm_cased_L-24_H-1024_A-16.ziphttps://storage.googleapis.com/bert_models/2019_05_30/wwm_cased_L-24_H-1024_A-16.ziphttps://storage.googleapis.com/bert_models/2019_05_30/wwm_cased_L-24_H-1024_A-16.ziphttps://storage.googleapis.com/bert_models/2019_05_30/wwm_cased_L-24_H-1024_A-16.ziphttps://storage.googleapis.com/bert_models/2019_05_30/wwm_cased_L-24_H-1024_A-16.ziphttps://storage.googleapis.com/bert_models/2019_05_30/wwm_cased_L-24_H-1024_A-16.ziphttps://storage.googleapis.com/bert_models/2019_05_30/wwm_cased_L-24_H-1024_A-16.ziphttps://storage.googleapis.com/bert_models/2019_05_30/wwm_cased_L-24_H-1024_A-16.ziphttps://storage.googleapis.com/bert_models/2019_05_30/wwm_cased_L-24_H-1024_A-16.ziphttps://storage.googleapis.com/bert_models/2019_05_30/wwm_cased_L-24_H-1024_A-16.ziphttps://storage.googleapis.com/bert_models/2019_05_30/wwm_cased_L-24_H-1024_A-16.ziphttps://storage.googleapis.com/bert_models/2019_05_30/wwm_cased_L-24_H-1024_A-16.ziphttps://storage.googleapis.com/bert_models/2019_05_30/wwm_cased_L-24_H-1024_A-16.ziphttps://storage.googleapis.com/bert_models/2019_05_30/wwm_cased_L-24_H-1024_A-16.ziphttps://storage.googleapis.com/bert_models/2019_05_30/wwm_cased_L-24_H-1024_A-16.ziphttps://storage.googleapis.com/bert_models/2019_05_30/wwm_cased_L-24_H-1024_A-16.ziphttps://storage.googleapis.com/bert_models/2019_05_30/wwm_cased_L-24_H-1024_A-16.ziphttps://storage.googleapis.com/bert_models/2019_05_30/wwm_cased_L-24_H-1024_A-16.zip

  • 20 40 60 80 100UAS

    −100

    −50

    0

    50S

    tanz

    aP

    arse

    F1-G

    old

    Par

    seF1

    Figure 4: Correlation between parse quality andthe difference in F1 scores when inference is doneusing Stanza parses on a model trained with goldparses on the CoNLL-2005 SRL WSJ split. We seethat as the accuracy of the Stanza parses increase,the difference in the F1 scores between the Stanzaand gold parses drop. The slope of the fitted linearregression model is 0.345 and the intercept is -38.9.

    Gold Parses P R F1BERT

    BERTLARGE 88.14 88.84 88.49Late Fusion 89.86 91.57 90.70

    BERTWWM 88.04 88.87 88.45Late Fusion 89.88 91.63 90.75

    RoBERTa

    RoBERTaLARGE 89.14 89.90 89.47Late Fusion 90.89 92.08 91.48

    Table 7: SRL results from using different pre-trained models on the CoNLL-2005 WSJ test setaveraged over 5 independent runs. WWM indicatesthe whole-wordpiece-masking.

    5 Related Work

    5.1 Incorporation of Syntax into NLP Models

    Relation Extraction Neural network modelshave shown performance improvements whenshortest dependency path between entities was in-corporated in sentence encoders: Liu et al. (2015)apply a combination of recursive neural networksand CNNs; Miwa and Bansal (2016) apply tree-LSTMs for joint entity and relation extraction;and Zhang et al. (2018) apply graph convolutionalnetworks (GCN) over LSTM features.

    Semantic Role Labeling Recently, several ap-proaches have been proposed to incorporate de-

    pendency trees within neural SRL models such aslearning the embeddings of dependency path be-tween predicate and argument words (Roth andLapata, 2016); combining GCN-based dependencytree representations with LSTM-based word rep-resentations (Marcheggiani and Titov, 2017); andlinguistically-informed self-attention in one Trans-former attention head (Strubell et al., 2018).

    Named Entity Recognition Moreover, syntaxhas also been found to be useful for NER as itsimplifies modeling interactions between multipleentity mentions in a sentence (Finkel and Man-ning, 2009). To model syntax on OntoNotes-5.0NER task, Jie and Lu (2019) feed the concatenatedchild token, head token, and relation embeddings toLSTM and then fuse child and head hidden states.

    5.2 Pre-trained Text Encoders and SyntaxMore recently, pre-training Transformers usingmasked language modeling objective as was doneby BERT (Devlin et al., 2019), and XLNet (Yanget al., 2019) has shown much improved results inmany NLP tasks. The successes of pre-trainedTransformers—especially that of BERT—have mo-tivated researchers to investigate what kind of inter-nal representations these models capture. For exam-ple, recent research (Tenney et al., 2019; Vig andBelinkov, 2019) provides evidence that differentlayers of BERT specialize in encoding representa-tions for tasks such as POS tagging, parsing, coref-erence, etc. It was shown by Clark et al. (2019) thatsome attention heads of BERT can learn to iden-tify syntactic dependencies. Further, Hewitt andManning (2019) also demonstrate that to a varyingdegree dependency trees are embedded within thegeometry of BERT’s hidden states.

    6 Conclusion

    In this work, we explore the utility of incorporat-ing syntax information from dependency trees intopre-trained transformers when applied to informa-tion extraction tasks of SRL, NER, and RE. Todo so, we compute dependency tree embeddingsusing a syntax-GNN and propose two models tofuse these dependency tree embeddings into trans-formers. Our experiments reveal several importantfindings: syntax representations are most helpfulfor SRL task when fused within the pre-trainedrepresentations, these performance gains on SRLtask are contingent on the quality of the depen-dency parses. We also notice that these models

  • don’t provide any performance improvements onNER. Lastly, for the RE task, syntax representa-tions are most helpful when incorporated on top ofpre-trained representations.

    For future work, we plan to jointly train theSyntax-Augmented BERT models in a multi-tasklearning setup on SRL, NER, and coreference taskswith the goal to further improve performance. Wealso plan to do an in-depth analysis to study thefactors that contribute to the performance gains ofgold parses over extracted parses in the SRL task.

    References

    Christoph Alt, Aleksandra Gabryszak, and Leon-hard Hennig. 2020. TACRED revisited: A thor-ough evaluation of the TACRED relation extrac-tion task. In Proceedings of the 58th AnnualMeeting of the Association for ComputationalLinguistics.

    Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey EHinton. 2016. Layer normalization. arXivpreprint arXiv:1607.06450.

    Peter Battaglia, Jessica Blake Chandler Ham-rick, Victor Bapst, Alvaro Sanchez, ViniciusZambaldi, Mateusz Malinowski, Andrea Tac-chetti, David Raposo, Adam Santoro, RyanFaulkner, Caglar Gulcehre, Francis Song, AndyBallard, Justin Gilmer, George E. Dahl, AshishVaswani, Kelsey Allen, Charles Nash, Victo-ria Jayne Langston, Chris Dyer, Nicolas Heess,Daan Wierstra, Pushmeet Kohli, Matt Botvinick,Oriol Vinyals, Yujia Li, and Razvan Pascanu.2018. Relational inductive biases, deep learn-ing, and graph networks. arXiv preprintarXiv:1806.01261.

    Jari Björne, Juho Heimonen, Filip Ginter, AnttiAirola, Tapio Pahikkala, and Tapio Salakoski.2009. Extracting complex biological events withrich graph-based feature sets. In Proceedings ofthe BioNLP 2009 Workshop Companion Volumefor Shared Task.

    Xavier Carreras and Lluís Màrquez. 2005. Intro-duction to the CoNLL-2005 shared task: Seman-tic role labeling. In Proceedings of the NinthConference on Computational Natural LanguageLearning (CoNLL-2005).

    Kevin Clark, Urvashi Khandelwal, Omer Levy,and Christopher D. Manning. 2019. What does

    BERT look at? An analysis of BERT’s atten-tion. In Proceedings of the 2019 ACL WorkshopBlackboxNLP: Analyzing and Interpreting Neu-ral Networks for NLP.

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: Pre-trainingof deep bidirectional transformers for languageunderstanding. In Proceedings of the 2019 Con-ference of the North American Chapter of theAssociation for Computational Linguistics: Hu-man Language Technologies, Volume 1 (Longand Short Papers.

    Timothy Dozat and Christopher D Manning. 2017.Deep biaffine attention for neural dependencyparsing. In International Conference on Learn-ing Representations (ICLR).

    Jenny Rose Finkel and Christopher D. Manning.2009. Joint parsing and named entity recogni-tion. In Proceedings of Human Language Tech-nologies: The 2009 Annual Conference of theNorth American Chapter of the Association forComputational Linguistics.

    G. D. Forney. 1973. The viterbi algorithm. Pro-ceedings of the IEEE, 61(3):268–278.

    Katrin Fundel, Robert Küffner, and Ralf Zimmer.2006. RelEx – Relation extraction using depen-dency parse trees. Bioinformatics.

    William L. Hamilton, Rex Ying, and Jure Leskovec.2017. Representation learning on graphs: Meth-ods and applications. IEEE Data EngineeringBulletin, 40:52–74.

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, andJian Sun. 2016. Deep residual learning for imagerecognition. In Proceedings of the IEEE confer-ence on computer vision and pattern recognition.

    Dan Hendrycks and Kevin Gimpel. 2016. Bridg-ing nonlinearities and stochastic regularizerswith gaussian error linear units. arXiv preprintarXiv:1606.08415.

    John Hewitt and Christopher D. Manning. 2019.A structural probe for finding syntax in wordrepresentations. In North American Chapter ofthe Association for Computational Linguistics:Human Language Technologies (NAACL).

  • Zhanming Jie and Wei Lu. 2019. Dependency-guided LSTM-CRF for named entity recogni-tion. In Proceedings of the 2019 Conference onEmpirical Methods in Natural Language Pro-cessing and the 9th International Joint Confer-ence on Natural Language Processing (EMNLP-IJCNLP).

    Zhanming Jie, Aldrian Obaja Muis, and Wei Lu.2017. Efficient dependency-guided named entityrecognition. In Thirty-First AAAI Conference onArtificial Intelligence.

    Diederik P Kingma and Jimmy Ba. 2015. Adam: Amethod for stochastic optimization. In The 2015International Conference for Learning Represen-tations.

    John D. Lafferty, Andrew McCallum, and Fer-nando C. N. Pereira. 2001. Conditional ran-dom fields: Probabilistic models for segment-ing and labeling sequence data. In Proceedingsof the Eighteenth International Conference onMachine Learning.

    Yang Liu, Furu Wei, Sujian Li, Heng Ji, MingZhou, and Houfeng Wang. 2015. A dependency-based neural network for relation classification.In Proceedings of the 53rd Annual Meeting ofthe Association for Computational Linguisticsand the 7th International Joint Conference onNatural Language Processing (Volume 2: ShortPapers).

    Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du,Mandar Joshi, Danqi Chen, Omer Levy, MikeLewis, Luke Zettlemoyer, and Veselin Stoy-anov. 2019. RoBERTa: A robustly optimizedBERT pretraining approach. arXiv preprintarXiv:1907.11692.

    Christopher Manning, Mihai Surdeanu, John Bauer,Jenny Finkel, Steven Bethard, and David Mc-Closky. 2014. The Stanford CoreNLP naturallanguage processing toolkit. In Proceedings of52nd Annual Meeting of the Association for Com-putational Linguistics: System Demonstrations.

    Diego Marcheggiani and Ivan Titov. 2017. Encod-ing sentences with graph convolutional networksfor semantic role labeling. In Proceedings ofthe 2017 Conference on Empirical Methods inNatural Language Processing.

    Marie-Catherine de Marneffe and Christopher D.Manning. 2008. The Stanford typed dependen-cies representation. In COLING 2008: Proceed-ings of the workshop on Cross-Framework andCross-Domain Parser Evaluation.

    Makoto Miwa and Mohit Bansal. 2016. End-to-endrelation extraction using LSTMs on sequencesand tree structures. In Proceedings of the 54thAnnual Meeting of the Association for Computa-tional Linguistics.

    Joakim Nivre, Marie-Catherine de Marneffe, FilipGinter, Jan Hajič, Christopher D. Manning,Sampo Pyysalo, Sebastian Schuster, Francis Ty-ers, and Daniel Zeman. 2020. Universal Depen-dencies v2: An evergrowing multilingual tree-bank collection. In Proceedings of The 12th Lan-guage Resources and Evaluation Conference.

    Jeffrey Pennington, Richard Socher, and Christo-pher Manning. 2014. GloVe: Global vectors forword representation. In Proceedings of the 2014Conference on Empirical Methods in NaturalLanguage Processing (EMNLP).

    Matthew Peters, Mark Neumann, Mohit Iyyer, MattGardner, Christopher Clark, Kenton Lee, andLuke Zettlemoyer. 2018. Deep contextualizedword representations. In Proceedings of the2018 Conference of the North American Chapterof the Association for Computational Linguis-tics: Human Language Technologies, Volume 1(Long Papers).

    Sameer Pradhan, Alessandro Moschitti, NianwenXue, Olga Uryupina, and Yuchen Zhang. 2012.CoNLL-2012 shared task: Modeling multilin-gual unrestricted coreference in OntoNotes. InJoint Conference on EMNLP and CoNLL -Shared Task.

    Sameer Pradhan, Wayne Ward, Kadri Hacioglu,James Martin, and Daniel Jurafsky. 2005. Se-mantic role labeling using different syntacticviews. In Proceedings of the 43rd Annual Meet-ing of the Association for Computational Lin-guistics (ACL’05).

    Peng Qi, Yuhao Zhang, Yuhui Zhang, Jason Bolton,and Christopher D. Manning. 2020. Stanza: APython natural language processing toolkit formany human languages. In Proceedings of the58th Annual Meeting of the Association for Com-putational Linguistics: System Demonstrations.

  • Alec Radford, Karthik Narasimhan, Tim Salimans,and Ilya Sutskever. 2018. Improving languageunderstanding by generative pre-training. Tech-nical Report.

    Michael Roth and Mirella Lapata. 2016. Neuralsemantic role labeling with dependency path em-beddings. In Proceedings of the 54th AnnualMeeting of the Association for ComputationalLinguistics (Volume 1: Long Papers).

    Nitish Srivastava, Geoffrey Hinton, AlexKrizhevsky, Ilya Sutskever, and RuslanSalakhutdinov. 2014. Dropout: A simpleway to prevent neural networks from overfit-ting. Journal of Machine Learning Research,15(1):1929–1958.

    Rupesh K Srivastava, Klaus Greff, and JürgenSchmidhuber. 2015. Training very deep net-works. In Advances in neural information pro-cessing systems.

    Emma Strubell, Patrick Verga, Daniel Andor,David Weiss, and Andrew McCallum. 2018.Linguistically-informed self-attention for seman-tic role labeling. In Proceedings of the 2018Conference on Empirical Methods in NaturalLanguage Processing.

    Ian Tenney, Dipanjan Das, and Ellie Pavlick. 2019.BERT rediscovers the classical NLP pipeline. InProceedings of the 57th Annual Meeting of theAssociation for Computational Linguistics.

    Ashish Vaswani, Noam Shazeer, Niki Parmar,Jakob Uszkoreit, Llion Jones, Aidan N Gomez,Ł ukasz Kaiser, and Illia Polosukhin. 2017. At-tention is all you need. In Advances in NeuralInformation Processing Systems.

    Petar Veličković, Guillem Cucurull, ArantxaCasanova, Adriana Romero, Pietro Lio, andYoshua Bengio. 2018. Graph attention networks.In International Conference on Learning Repre-sentations (ICLR).

    Jesse Vig and Yonatan Belinkov. 2019. Analyzingthe structure of attention in a transformer lan-guage model. In Proceedings of the 2019 ACLWorkshop BlackboxNLP: Analyzing and Inter-preting Neural Networks for NLP.

    Zhilin Yang, Zihang Dai, Yiming Yang, JaimeCarbonell, Russ R Salakhutdinov, and Quoc V

    Le. 2019. XLNet: Generalized autoregressivepretraining for language understanding. In Ad-vances in Neural Information Processing Sys-tems.

    Yuhao Zhang, Peng Qi, and Christopher D. Man-ning. 2018. Graph convolution over pruned de-pendency trees improves relation extraction. InProceedings of the 2018 Conference on Empiri-cal Methods in Natural Language Processing.

    Yuhao Zhang, Victor Zhong, Danqi Chen, Ga-bor Angeli, and Christopher D. Manning. 2017.Position-aware attention and supervised data im-prove slot filling. In Proceedings of the 2017Conference on Empirical Methods in NaturalLanguage Processing.

    A Additional Training Details

    During the finetuning step, the new parameters ineach model are randomly initialized while the ex-isting parameters are initialized from pre-trainedBERT. For regularisation, we apply dropout (Sri-vastava et al., 2014) with p = 0.1 to attention co-efficients and hidden states. For all datasets weuse the canonical training, development and testsplits. We use the Adam optimizer (Kingma andBa, 2015) for finetuning.

    We observed that the initial learning rate of 2e-5with a linear decay worked well for all tasks. Formodel training to converge, we found 10 epochswere sufficient for CoNLL-2012 SRL and RE and20 epochs for CoNLL-2005 SRL and NER. Weevaluate the test set performance using the best-performing checkpoint on the development set.

    For evaluation, following convention we reportthe micro-averaged precision, recall, and F1 scoresin every task. For variance control in all the exper-iments we report the mean of the results obtainedfrom five independent runs with different seeds.