arxiv:2010.03009v2 [cs.cl] 17 feb 2021

15
GATE: Graph Attention Transformer Encoder for Cross-lingual Relation and Event Extraction Wasi Uddin Ahmad, Nanyun Peng, Kai-Wei Chang University of California, Los Angeles {wasiahmad, violetpeng, kwchang}@cs.ucla.edu Abstract Recent progress in cross-lingual relation and event extraction use graph convolutional networks (GCNs) with universal de- pendency parses to learn language-agnostic sentence repre- sentations such that models trained on one language can be applied to other languages. However, GCNs struggle to model words with long-range dependencies or are not directly con- nected in the dependency tree. To address these challenges, we propose to utilize the self-attention mechanism where we explicitly fuse structural information to learn the dependen- cies between words with different syntactic distances. We in- troduce GATE, a Graph Attention Transformer Encoder, and test its cross-lingual transferability on relation and event ex- traction tasks. We perform experiments on the ACE05 dataset that includes three typologically different languages: English, Chinese, and Arabic. The evaluation results show that GATE outperforms three recently proposed methods by a large mar- gin. Our detailed analysis reveals that due to the reliance on syntactic dependencies, GATE produces robust representa- tions that facilitate transfer across languages. 1 Introduction Relation and event extraction are two challenging informa- tion extraction (IE) tasks; wherein a model learns to identify semantic relationships between entities and events in nar- ratives. They provide useful information for many natural language processing (NLP) applications such as knowledge graph completion (Lin et al. 2015) and question answering (Chen et al. 2019). Figure 1 gives an example of relation and event extraction tasks. Recent advances in cross-lingual transfer learning approaches for relation and event extraction learns a universal encoder that produces language-agnostic contextualized representations so the model learned on one language can easily transfer to others. Recent works (Huang et al. 2018; Subburathinam et al. 2019) suggested embed- ding universal dependency structure into contextual repre- sentations improves cross-lingual transfer for IE. There are a couple of advantages of leveraging depen- dency structures. First, the syntactic distance between two words 1 in a sentence is typically smaller than the sequen- Copyright © 2021, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. 1 The shortest path in the dependency graph structure. Figure 1: A relation (red dashed) between two entities and an event of type Attack (triggered by “firing”) including two arguments and their role labels (blue) are highlighted. tial distance. For example, in the sentence A fire in a Bangladeshi garment factory has left at least 37 people dead and 100 hospitalized, the sequential and syntactic distance between “fire” and “hospitalized” is 15 and 4, respectively. Therefore, encoding syntax structure helps capture long- range dependencies (Liu, Luo, and Huang 2018). Second, languages have different word order, e.g., adjectives precede or follow nouns as (“red apple”) in English or (“pomme rouge”) in French. Thus, processing sentences sequentially suffers from the word order difference issue (Ahmad et al. 2019a), while modeling dependency structures can mitigate the problem in cross-lingual transfer (Liu et al. 2019). A common way to leverage dependency structures for cross-lingual NLP tasks is using universal dependency parses. 2 A large pool of recent works in IE (Liu, Luo, and Huang 2018; Zhang, Qi, and Manning 2018; Subburathi- nam et al. 2019; Fu, Li, and Ma 2019; Sun et al. 2019; Liu et al. 2019) employed Graph Convolutional Networks (GCNs) (Kipf and Welling 2017) to learn sentence represen- tations based on their universal dependency parses, where a k-layers GCN aggregates information of words that are k hop away. Such a way of embedding structure may hinder cross-lingual transfer when the source and target languages have different path length distributions among words (see Table 9). Presumably, a two-layer GCN would work well on English but may not transfer well to Arabic. Moreover, GCNs have shown to perform poorly in mod- eling long-distance dependencies or disconnected words in the dependency tree (Zhang, Li, and Song 2019; Tang et al. 2020). In contrast, the self-attention mechanism (Vaswani 2 https://universaldependencies.org/ arXiv:2010.03009v2 [cs.CL] 17 Feb 2021

Upload: others

Post on 31-Jan-2022

4 views

Category:

Documents


0 download

TRANSCRIPT

GATE: Graph Attention Transformer Encoder for Cross-lingualRelation and Event Extraction

Wasi Uddin Ahmad, Nanyun Peng, Kai-Wei ChangUniversity of California, Los Angeles

{wasiahmad, violetpeng, kwchang}@cs.ucla.edu

AbstractRecent progress in cross-lingual relation and event extractionuse graph convolutional networks (GCNs) with universal de-pendency parses to learn language-agnostic sentence repre-sentations such that models trained on one language can beapplied to other languages. However, GCNs struggle to modelwords with long-range dependencies or are not directly con-nected in the dependency tree. To address these challenges,we propose to utilize the self-attention mechanism where weexplicitly fuse structural information to learn the dependen-cies between words with different syntactic distances. We in-troduce GATE, a Graph Attention Transformer Encoder, andtest its cross-lingual transferability on relation and event ex-traction tasks. We perform experiments on the ACE05 datasetthat includes three typologically different languages: English,Chinese, and Arabic. The evaluation results show that GATEoutperforms three recently proposed methods by a large mar-gin. Our detailed analysis reveals that due to the reliance onsyntactic dependencies, GATE produces robust representa-tions that facilitate transfer across languages.

1 IntroductionRelation and event extraction are two challenging informa-tion extraction (IE) tasks; wherein a model learns to identifysemantic relationships between entities and events in nar-ratives. They provide useful information for many naturallanguage processing (NLP) applications such as knowledgegraph completion (Lin et al. 2015) and question answering(Chen et al. 2019). Figure 1 gives an example of relationand event extraction tasks. Recent advances in cross-lingualtransfer learning approaches for relation and event extractionlearns a universal encoder that produces language-agnosticcontextualized representations so the model learned on onelanguage can easily transfer to others. Recent works (Huanget al. 2018; Subburathinam et al. 2019) suggested embed-ding universal dependency structure into contextual repre-sentations improves cross-lingual transfer for IE.

There are a couple of advantages of leveraging depen-dency structures. First, the syntactic distance between twowords1 in a sentence is typically smaller than the sequen-

Copyright © 2021, Association for the Advancement of ArtificialIntelligence (www.aaai.org). All rights reserved.

1The shortest path in the dependency graph structure.

Figure 1: A relation (red dashed) between two entities andan event of type Attack (triggered by “firing”) including twoarguments and their role labels (blue) are highlighted.

tial distance. For example, in the sentence A fire in aBangladeshi garment factory has left at least 37 people deadand 100 hospitalized, the sequential and syntactic distancebetween “fire” and “hospitalized” is 15 and 4, respectively.Therefore, encoding syntax structure helps capture long-range dependencies (Liu, Luo, and Huang 2018). Second,languages have different word order, e.g., adjectives precedeor follow nouns as (“red apple”) in English or (“pommerouge”) in French. Thus, processing sentences sequentiallysuffers from the word order difference issue (Ahmad et al.2019a), while modeling dependency structures can mitigatethe problem in cross-lingual transfer (Liu et al. 2019).

A common way to leverage dependency structures forcross-lingual NLP tasks is using universal dependencyparses.2 A large pool of recent works in IE (Liu, Luo, andHuang 2018; Zhang, Qi, and Manning 2018; Subburathi-nam et al. 2019; Fu, Li, and Ma 2019; Sun et al. 2019;Liu et al. 2019) employed Graph Convolutional Networks(GCNs) (Kipf and Welling 2017) to learn sentence represen-tations based on their universal dependency parses, where ak-layers GCN aggregates information of words that are khop away. Such a way of embedding structure may hindercross-lingual transfer when the source and target languageshave different path length distributions among words (seeTable 9). Presumably, a two-layer GCN would work well onEnglish but may not transfer well to Arabic.

Moreover, GCNs have shown to perform poorly in mod-eling long-distance dependencies or disconnected words inthe dependency tree (Zhang, Li, and Song 2019; Tang et al.2020). In contrast, the self-attention mechanism (Vaswani

2https://universaldependencies.org/

arX

iv:2

010.

0300

9v2

[cs

.CL

] 1

7 Fe

b 20

21

et al. 2017) is capable of capturing long-range dependencies.Consequently, a few recent studies proposed dependency-aware self-attention and found effective for machine trans-lation (Deguchi, Tamura, and Ninomiya 2019; Bugliarelloand Okazaki 2020). The key idea is to allow attention be-tween connected words in the dependency tree and gradu-ally aggregate information across layers. However, IE tasksare relatively low-resource (the number of annotated docu-ments available for training is small), and thus stacking morelayers is not feasible. Besides, our preliminary analysis indi-cates that syntactic distance between entities could charac-terize certain relation and event types.3 Hence, we proposeto allow attention between all words but use the pairwisesyntactic distances to weigh the attention.

We introduce a Graph Attention Transformer Encoder(GATE) that utilizes self-attention (Vaswani et al. 2017) tolearn structured contextual representations. On one hand,GATE enjoys the capability of capturing long-range depen-dencies, which is crucial for languages with longer sen-tences, e.g., Arabic.4 On the other hand, GATE is agnos-tic to language word order as it uses syntactic distance tomodel pairwise relationship between words. This charac-teristic makes GATE suitable to transfer across typologi-cally diverse languages, e.g., English to Arabic. One crucialproperty of GATE is that it allows information propagationamong different heads in the multi-head attention structurebased on syntactic distances, which allows to learn the cor-relation between different mention types and target labels.

We conduct experiments on cross-lingual transfer amongEnglish, Chinese, and Arabic languages using the ACE 2005benchmark (Walker et al. 2006). The experimental resultsdemonstrate that GATE outperforms three recently proposedrelation and event extraction methods by a significant mar-gin.5 We perform a thorough ablation study and analysis,which shows that GATE is less sensitive to source lan-guage’s characteristics (e.g., word order, sentence structure)and thus excels in the cross-lingual transfer.

2 Task DescriptionIn this paper, we focus on sentence-level relation extraction(Subburathinam et al. 2019; Ni and Florian 2019) and eventextraction (Subburathinam et al. 2019; Liu et al. 2019) tasks.Below, we first introduce the basic concepts, the notations,as well as define the problem and the scope of the work.Relation Extraction is the task of identifying the relationtype of an ordered pair of entity mentions. Formally, given apair of entity mentions from a sentence s - (es, eo; s) wherees and eo denoted as the subject and object entities respec-tively, the relation extraction (RE) task is defined as predict-ing the relation r ∈ R∪{None} between the entity mentions,

3In ACE 2005 dataset, the relation type PHYS:Located ex-ists among {PER, ORG, LOC, FAC, GPE} entities. The aver-age syntactic distance in English and Arabic sentences among PERand any of the {LOC, FAC, GPE} entities are approx. 2.8 and4.2, while the distance between PER and ORG is 3.3 and 1.5.

4After tokenization, on average, ACE 2005 English and Arabicsentences have approximately 30 and 210 words, respectively.

5Code available at https://github.com/wasiahmad/GATE

whereR is a pre-defined set of relation types. In the exampleprovided in Figure 1, there is a PHYS:Located relationbetween the entity mentions “Terrorists” and “hotel”.Event Extraction can be decomposed into two sub-tasks,Event Detection and Event Argument Role Labeling. Eventdetection refers to the task of identifying event triggers (thewords or phrases that express event occurrences) and theirtypes. In the example shown in Figure 1, the word “firing”triggers the Attack event.

Event argument role labeling (EARL) is defined as pre-dicting whether words or phrases (arguments) participate inevents and their roles. Formally, given an event trigger etand a mention ea (an entity, time expression, or value) froma sentence s, the argument role labeling refers to predictingthe mention’s role r ∈ R∪{None}, whereR is a pre-definedset of role labels. In Figure 1, the “Terrorists” and “hotel” en-tities are the arguments of the Attack event and they havethe Attacker and Place role labels, respectively.

In this work, we focus on the EARL task; we assumeevent mentions (triggers) of the input sentence are provided.Zero-Short Cross-Lingual Transfer refers to the setting,where there is no labeled examples available for the targetlanguage. We train neural relation extraction and event ar-gument role labeling models on one (single-source) or mul-tiple (multi-source) source languages and then deploy themodels in target languages. The overall cross-lingual trans-fer approach consists of four steps:

1. Convert the input sentence into a language-universaltree structure using an off-the-shelf universal dependencyparser, e.g., UDPipe6 (Straka and Strakova 2017).

2. Embed the words in the sentence into a shared seman-tic space across languages. We use off-the-shelf multi-lingual contextual encoders (Devlin et al. 2019; Conneauet al. 2020) to form the word representations. To enrichthe word representations, we concatenate them with uni-versal part-of-speech (POS) tag, dependency relation, andentity type embeddings (Subburathinam et al. 2019). Wecollectively refer them as language-universal features.

3. Based on the word representations, we encode the inputsentence using the proposed GATE architecture that lever-ages the syntactic depth and distance information. Notethat this step is the main focus of this work.

4. A pair of classifier predicts the target relation and argu-ment role labels based on the encoded representations.

3 ApproachOur proposed approach GATE revises the multi-head at-tention architecture in Transformer Encoder (Vaswani et al.2017) to model syntactic information while encoding a se-quence of input vectors (represent the words in a sentence)into contextualized representations. We first review the stan-dard multi-head attention mechanism (§3.1). Then, we in-troduce our proposed method GATE (§3.2). Finally, we de-scribe how we perform relation extraction (§3.3) and eventargument role labeling (§3.4) tasks.

6http://ufal.mff.cuni.cz/udpipe

3.1 Transformer EncoderUnlike recent works (Zhang, Qi, and Manning 2018; Sub-burathinam et al. 2019) that use GCNs (Kipf and Welling2017) to encode the input sequences into contextualized rep-resentations, we propose to employ Transformer encoder asit excels in capturing long-range dependencies. First, the se-quence of input word vectors, x = [x1, . . . , x|x|] where xi ∈Rd are packed into a matrixH0 = [x1, . . . , x|x|]. Then anL-layer Transformer Encoder H l = Transformerl(H

l−1),l ∈ [1, L] takes H0 as input and generates different levels oflatent representations H l = [hl1, . . . , h

l|x|], recursively. Typ-

ically the latent representations generated by the last layer(L-th layer) are used as the contextual representations of theinput words. To aggregate the output vectors of the previ-ous layer, multiple (nh) self-attention heads are employedin each Transformer layer. For the l-th Transformer layer,the output of the previous layer H l−1 ∈ R|x|×dmodel isfirst linearly projected to queries Q, keys K, and valuesV using parameter matrices WQ

l ,WKl ∈ Rdmodel×dk and

WVl ∈ Rdmodel×dv , respectively.

Ql = H l−1WQl ,Kl = H l−1WK

l , Vl = H l−1WVl .

The output of a self-attention head Al is computed as:

Al = softmax

(QKT

√dk

+M

)Vl, (1)

where the matrix M ∈ R|x|×|x| determines whether a pairof tokens can attend each other.

Mij =

{0, allow to attend−∞, prevent from attending

(2)

The matrix M is deduced as a mask. By default, the ma-trix M is a zero-matrix. In the next section, we discuss howwe manipulate the mask matrix M to incorporate syntacticdepth and distance information in sentence representations.

3.2 Graph Attention Transformer EncoderThe self-attention as described in §3.1 learns how much at-tention to put on words in a text sequence when encodinga word at a given position. In this work, we revise the self-attention mechanism such that it takes into account the syn-tactic structure and distances when a token attends to allthe other tokens. The key idea is to manipulate the maskmatrix to impose the graph structure and retrofit the atten-tion weights based on pairwise syntactic distances. We usethe universal dependency parse of a sentence and computethe syntactic (shortest path) distances between every pair ofwords. We illustrate an example in Figure 2.

We denote distance matrixD ∈ R|x|×|x| whereDij repre-sents the syntactic distance between words at position i andj in the input sequence. If we want to allow tokens to attendtheir adjacent tokens (that are 1 hop away) at each layer, thenwe can set the mask matrix as follows.

Mij =

{0, Dij = 1

−∞, otherwise

We generalize this notion to model a distance based atten-tion; allowing tokens to attend tokens that are within dis-

Figure 2: Distance matrix showing the shortest path dis-tances between all pairs of words. The dependency arc di-rection is ignored while computing pairwise distances. Thediagonal value is set to 1, indicating a self-loop. If we setthe values in white cells (with value > 1) to 0, the distancematrix becomes an adjacency matrix.

tance δ (hyper-parameter).

Mij =

{0, Dij ≤ δ−∞, otherwise

(3)

During our preliminary analysis, we observed that syn-tactic distances between entity mentions or event mentionsoften correlate with the target label. For example, if an ORGentity mention appears closer to a PER entity than a LOC en-tity, then the {PER, ORG} entity pair is more likely to havethe PHYS:Located relation. We hypothesize that model-ing syntactic distance between words can help to identifycomplex semantic structure such as events and entity rela-tions. Hence we revise the attention head Al (defined in Eq.(1)) computation as follows.

Al = F

(softmax

(QKT

√dk

+M

))Vl. (4)

Here, softmax produces an attention matrix P ∈ R|x|×|x|

where Pi denotes the attentions that i-th token pays to the allthe tokens in the sentence, and F is a function that modifiesthose attention weights. We can treat F as a parameterizedfunction that can be learned based on distances. However,we adopt a simple formulation of F such that GATE paysmore attention to tokens that are closer and less attentionto tokens that are faraway in the parse tree. We define the(i, j)-th element of the attention matrix produced by F as:

F (P )ij =Pij

ZiDij, (5)

where Zi =∑

jPij

Dijis the normalization factor and Dij

is the distance between i-th and j-th token. We found thisformulation of F effective for the IE tasks.

3.3 Relation ExtractorRelation Extractor predicts the relationship label (or None)for each mention pair in a sentence. For an input sen-tence s, GATE produces contextualized word representa-tions hl1, . . . , h

l|x| where hli ∈ Rdmodel . As different sen-

tences and entity mentions may have different lengths, weperform max-pooling over their contextual representationsto obtain fixed-length vectors. Suppose for a pair of entitymentions es = [hlbs, . . . , h

les] and eo = [hlbo, . . . , h

leo], we

obtain single vector representations es and eo by performingmax-pooling. Following Zhang, Qi, and Manning (2018);Subburathinam et al. (2019), we also obtain a vector repre-sentation for the sentence, s by applying max-pooling over[hl1, . . . , h

l|x|] and concatenate the three vectors. Then the

concatenation of the three vectors [es; eo; s] are fed to a lin-ear classifier followed by a Softmax layer to predict the re-lation type between entity mentions es and eo as follows.

Or = softmax(W Tr [es; eo; s] + br),

where Wr ∈ R3dmodel×r and br ∈ Rr are parameters, andr is the total number of relation types. The probability oft-th relation type is denoted as P (rt|s, es, eo), which corre-sponds to the t-th element of Or. To train the relation ex-tractor, we adopt the cross-entropy loss.

Lr = −N∑s=1

N∑o=1

log(P (yrso|s, es, eo)),

where N is the number of entity mentions in the input sen-tence s and yrso denotes the ground truth relation type be-tween entity mentions es and eo.

3.4 Event Argument Role LabelerEvent argument role labeler predicts the argument men-tions (or None for non-argument mentions) of an eventmention and assigns a role label to each argument from apre-defined set of labels. To label an argument candidateea = [hlba, . . . , h

lea] for an event trigger et = [hlbt, . . . , h

let]

in sentence s = [hl1, . . . , hl|x|], we apply max-pooling to

form vectors ea, et, and s respectively, which is same asthat for relation extraction. Then we concatenate the vectors([et; ea; s]) and pass it through a linear classifier and Soft-max layer to predict the role label as follows.

Oa = softmax(W Ta [et; ea; s] + ba),

where Wa ∈ R3dmodel×r and ba ∈ Rr are parameters, and ris the total number of argument role label types. We optimizethe role labeler by minimizing the cross-entropy loss.

4 Experiment SetupDataset We conduct experiments based on the AutomaticContent Extraction (ACE) 2005 corpus (Walker et al. 2006)that includes manual annotation of relation and event men-tions (with their arguments) in three languages: English(En), Chinese (Zh), and Arabic (Ar). We present the datastatistics in Appendix. ACE defines an ontology that in-cludes 7 entity types, 18 relation subtypes, and 33 eventsubtypes. We add a class label None to denote that two en-

Sequential SyntacticEn Zh Ar En Zh Ar

Relation Mention 4.8 3.9 25.8 2.2 2.6 5.1Event Mention & Arg. 9.8 21.7 58.1 3.1 4.6 12.3

Table 1: Average sequential and syntactic (shortest path)distance between relation mentions and event mentions andtheir candidate arguments in ACE05 dataset. Distances arecomputed by ignoring the order of mentions.

tity mentions or a pair of an event mention and an argumentcandidate under consideration do not have a relationship be-long to the target ontology. We use the same dataset split asSubburathinam et al. (2019) and follow their preprocessingsteps. We refer the readers to Subburathinam et al. (2019)for the dataset preprocessing details.

Evaluation Criteria Following the previous works (Ji andGrishman 2008; Li, Ji, and Huang 2013; Li and Ji 2014;Subburathinam et al. 2019), we set the evaluation criteriaas, (1) a relation mention is correct if its predicted type andthe head offsets of the two associated entity mentions arecorrect, and (2) an event argument role label is correct if theevent type, offsets, and argument role label match any of thereference argument mentions.

Baseline Models To compare GATE on relation and eventargument role labeling tasks, we chose three recently pro-posed approaches as baselines. The source code of the base-lines are not publicly available at the time this research isconducted. Therefore, we reimplemented them.• CL Trans GCN (Liu et al. 2019) is a context-dependentlexical mapping approach where each word in a source lan-guage sentence is mapped to its best-suited translation inthe target language. We use multilingual word embeddings(Joulin et al. 2018) as the continuous representations of to-kens along with the language-universal features embeddingsincluding part-of-speech (POS) tag embedding, dependencyrelation label embedding, and entity type embedding.7 Sincethis model focuses on the target language, we train this base-line for each combination of source and target languages.• CL GCN (Subburathinam et al. 2019) uses GCN (Kipfand Welling 2017) to learn structured common space repre-sentation. To embed the tokens in an input sentence, we usemultilingual contextual representations (Devlin et al. 2019;Conneau et al. 2020) and the language-universal feature em-beddings. We train this baseline on the source languages anddirectly evaluate on the target languages.• CL RNN (Ni and Florian 2019) uses a bidirectional LongShort-Term Memory (LSTM) type recurrnet neural net-works (Hochreiter and Schmidhuber 1997) to learn contex-tual representation. We feed language-universal features forwords in a sentence, constructed in the same way as Subbu-rathinam et al. (2019). We train and evaluate this baseline inthe same way as CL GCN.

In addition to the above three baseline methods, we com-pare GATE with the following two encoding methods.

7Due to the design principle of Liu et al. (2019), we cannot usemultilingual contextual encoders in CL Trans GCN.

Model

Event Argument Role Labeling Relation ExtractionEn En Zh Zh Ar Ar En En Zh Zh Ar Ar⇓ ⇓ ⇓ ⇓ ⇓ ⇓ ⇓ ⇓ ⇓ ⇓ ⇓ ⇓

Zh Ar En Ar En Zh Zh Ar En Ar En ZhCL Trans GCN 41.8 55.6 41.2 52.9 39.6 40.8 56.7 65.3 65.9 59.7 59.6 46.3CL GCN 51.9 50.4 53.7 51.5 50.3 51.9 49.4 58.3 65.0 55.0 56.7 42.4CL RNN 60.4 53.9 55.7 52.5 50.7 50.9 53.7 63.9 70.9 57.6 67.1 55.7Transformer 61.5 55.0 58.0 57.7 54.3 57.0 57.1 63.4 69.6 60.6 67.0 52.6Transformer RPR 62.3 60.8 57.3 66.3 57.5 59.8 58.0 59.9 70.0 55.6 66.5 56.5GATE (this work) 63.2 68.5 59.3 69.2 53.9 57.8 55.1 66.8 71.5 61.2 69.0 54.3

Table 2: Single-source transfer results (F-score % on the test set) using perfect event triggers and entity mentions. The languageon top and bottom of ⇓ denotes the source and target languages, respectively.

Model{En, Zh} {En, Ar} {Zh, Ar}⇓ ⇓ ⇓Ar Zh En

Event Argument Role LabelingCL Trans GCN 57.0 44.5 44.8CL GCN 58.9 56.2 57.9CL RNN 53.5 62.5 60.8Transformer 59.5 62.0 60.7Transformer RPR 71.1 68.4 62.2GATE (this work) 73.9 65.3 61.3Relation ExtractionCL Trans GCN 66.8 54.4 69.5CL GCN 64.0 46.6 65.8CL RNN 66.5 60.5 73.0Transformer 68.3 59.3 73.7Transformer RPR 65.0 62.3 73.8GATE (this work) 67.0 57.9 74.1

Table 3: Multi-source transfer results (F-score % on the testset) using perfect event triggers and entity mentions. Thelanguage on top and bottom of ⇓ denotes the source andtarget languages, respectively.

• Transformer (Vaswani et al. 2017) uses multi-head self-attention mechanism and is the base structure of our pro-posed model, GATE. Note that GATE has the same numberof parameters as Transformer since GATE does not intro-duce any new parameter while modeling the pairwise syn-tactic distance into the self-attention mechanism. Therefore,we credit the GATE’s improvements over the Transformerto its distance-based attention modeling strategy.• Transformer RPR (Shaw, Uszkoreit, and Vaswani 2018)uses relative position representations to encode the structureof the input sequences. This method uses the pairwise se-quential distances while GATE uses pairwise syntactic dis-tances to model attentions between tokens.

Implementation Details To embed words into vector rep-resentations, we use multilingual BERT (M-BERT) (De-vlin et al. 2019). Note that we do not fine-tune M-BERT,but only use it as a feature extractor. We use the univer-sal part-of-speech (POS) tags, dependency relation labels,and seven entity types defined by ACE: person, organi-zation, geo-Political entity, location, facility, weapon, andvehicle. We embed these language-universal features into

fixed-length vectors and concatenate them with M-BERTvectors to form the input word representations. We set themodel size (dmodel), number of encoder layers (L), and at-tention heads (nh) in multi-head to 512, 1, and 8 respec-tively. We tune the distance threshold δ (as shown in Eq.(3)) in [1, 2, 4, 8,∞] for each attention head on each sourcelanguage (more details are provided in the supplementary).

We implement all the baselines and our approach based onthe implementation of Zhang, Qi, and Manning (2018) andOpenNMT (Klein et al. 2017). We used transformers8

to extract M-BERT and XLM-R features. We provide a de-tailed description of the dataset, hyper-parameters, and train-ing of the baselines and our approach in the supplementary.

5 Results and AnalysisWe compare GATE with five baseline approaches on eventargument role labeling (EARL) and relation extraction (RE)tasks, and the results are presented in Table 2 and 3.

Single-source transfer In the single-source transfer set-ting, all the models are individually trained on one sourcelanguage, e.g., English and directly evaluated on the othertwo languages (target), e.g., Chinese and Arabic. Table 2shows that GATE outperforms all the baselines in four out ofsix transfer directions on both tasks. CL RNN surprisinglyoutperforms CL GCN in most settings, although CL RNNuses a BiLSTM that is not suitable to transfer across syn-tactically different languages (Ahmad et al. 2019a). We hy-pothesize the reason being GCNs cannot capture long-rangedependencies, which is crucial for the two tasks. In com-parison, by modeling distance-based pairwise relationshipsamong words, GATE excels in cross-lingual transfer.

A comparison between Transformer and GATE demon-strates the effectiveness of syntactic distance-based self-attention over the standard mechanism. From Table 2, wesee GATE outperforms Transformer with an average im-provement of 4.7% and 1.3% in EARL and RE tasks,respectively. Due to implicitly modeling graph structure,Transformer RPR performs effectively. However, GATEachieves an average improvement of 1.3% and 1.9% inEARL and RE tasks over Transformer RPR. Overall, thesignificant performance improvements achieved by GATEcorroborate our hypothesis that syntactic distance-based at-tention helps in the cross-lingual transfer.

8https://github.com/huggingface/transformers

Model EARL REChinese Arabic Chinese Arabic

Wang et al. (2019)Absolute 61.2 53.5 57.8 65.2Relative 55.3 47.1 58.1 66.4GATE 63.2 68.5 55.1 66.8

Table 4: GATE vs. Wang et al. (2019) results (F-score %) onevent argument role labeling (EARL) and relation extraction(RE); using English as source and Chinese, Arabic as the tar-get languages, respectively. To limit the maximum relativeposition, the clipping distance is set to 10 and 5 for EARLand RE tasks, respectively.

Multi-source transfer In the multi-source cross-lingualtransfer, the models are trained on a pair of languages:{English, Chinese}, {English, Arabic}, and {Chinese, Ara-bic}. Hence, the models observe more examples duringtraining, and as a result, the cross-lingual transfer perfor-mance improves compared to the single-source transfer set-ting. In Table 3, we see GATE outperforms the previousthree IE approaches in multi-source transfer settings, excepton RE for the source:{English, Arabic} and target: Chineselanguage setting. On the other hand, GATE performs com-petitively to Transformer and Transformer RPR baselines.Due to observing more training examples, Transformer andTransformer RPR perform more effectively in this set-ting. The overall result indicates that GATE more efficientlylearns transferable representations for the IE tasks.

Encoding dependency structure GATE encodes the de-pendency structure of sentences by guiding the attentionmechanism in self-attention networks (SANs). However, analternative way to encode the sentence structure is throughpositional encoding for SANs. Conceptually, the key differ-ence is the modeling of syntactic distances to capture fine-grained relations among tokens. Hence, we compare thesetwo notions of encoding the dependency structure to empha-size the promise of modeling syntactic distances.

To this end, we compare the GATE with Wang et al.(2019) that proposed structural position encoding using thedependency structure of sentences. Results are presented inTable 4. We see that Wang et al. (2019) performs well onRE but poorly on EARL, especially on the Arabic language.While GATE directly uses syntactic distances between to-kens to guide the self-attention mechanism, Wang et al.(2019) learns parameters to encode structural positions thatcan become sensitive to the source language. For example,the average shortest path distance between event mentionsand their candidate arguments in English and Arabic is 3.1and 12.3, respectively (see Table 9). As a result, a modeltrained in English may learn only to attend closer tokens,thus fails to generalize on Arabic.

Moreover, we anticipate that different order of subjectand verb in English and Arabic9 causes Wang et al. (2019)to transfer poorly on the EARL task (as event triggers are

9According to WALS (Dryer and Haspelmath 2013), the orderof subject (S), object (O), and verb (V) for English, Chinese andArabic is SVO, SVO, and VSO.

Model EARL REEnglish Chinese∗ English Chinese∗

CL GCN 51.5 56.3 46.9 50.7CL RNN 55.6 59.3 56.8 62.0GATE 63.8 64.2 58.8 57.0

Table 5: Event argument role labeling (EARL) and relationextraction (RE) results (F-score %); using Chinese as thesource and English as the target language. ∗ indicates theEnglish examples are translated into Chinese using GoogleCloud Translate.

Figure 3: Models trained on the Chinese language performon event argument role labeling in English and their paral-lel Chinese sentences. The parallel sentences have the samemeaning but a different structure. To quantify the structuraldifference between two parallel sentences, we compute thetree edit distances.

mostly verbs). To verify our anticipation, we modify therelative structural position encoding (Wang et al. 2019) bydropping the directional information (Ahmad et al. 2019a),and observed a performance increase from 47.1 to 52.2 forEnglish to Arabic language transfer. In comparison, GATE isorder-agnostic as it models syntactic distance; hence, it has abetter transferability across typologically diverse languages.

Sensitivity towards source language Intuitively, an REor EARL model would transfer well on target languages ifthe model is less sensitive towards the source language char-acteristics (e.g., word order, grammar structure). To mea-sure sensitivity towards the source language, we evaluate themodel performance on the target language and their paral-lel (translated) source language sentences. We hypothesizethat if a model performs significantly well on the translatedsource language sentences, then the model is more sensi-tive towards the source language and may not be ideal forcross-lingual transfer. To test the models on this hypothesis,we translate all the ACE05 English test set examples intoChinese using Google Cloud Translate.10 We trainGATE and two baselines on the Chinese and evaluate themon both English (test set) examples and their Chinese trans-lations. To quantify the difference between the dependency

10Details are provided in the supplementary.

Word features EARL REChinese Arabic Chinese Arabic

Multi-WE 35.9 43.7 41.0 54.9M-BERT 57.1 54.8 55.1 66.8XLM-R 51.8 61.7 51.4 68.1

Table 6: Contribution of multilingual word embeddings(Multi-WE) (Joulin et al. 2018), M-BERT (Devlin et al.2019), and XLM-R (Conneau et al. 2020) as a source ofword features; using English as source and Chinese, Arabicas the target languages, respectively.

Input features EARL REChinese Arabic Chinese Arabic

M-BERT 52.5 47.4 44.0 49.7+ POS tag 49.3 47.5 44.1 47.0+ Dep. label 49.7 51.0 48.6 47.0+ Entity type 57.8 60.2 56.3 63.0

Table 7: Ablation on the use of language-universal features(part-of-speech (POS) tag, dependency relation label, andentity type) in GATE (F-score (%); using English as sourceand Chinese, Arabic as the target languages, respectively.

structure of an English and its Chinese translation sentences,we compute edit distance between two tree structures usingthe APTED11 algorithm (Pawlik and Augsten 2015, 2016).

The results are presented in Table 5. We see that CL GCNand CL RNN have much higher accuracy on the translated(Chinese) sentences than the target language (English) sen-tences. On the other hand, GATE makes a roughly similarnumber of correct predictions when the target and translatedsentences are given as input. Figure 3 illustrates how themodels perform when the structural distance between targetsentences and their translation increases. The results suggestthat GATE performs substantially better than the baselineswhen the target language sentences are structurally differentfrom the source language. The overall findings signal thatGATE is less sensitive to source language features, and wecredit this to the modeling of distance-based syntactic rela-tionships between words. We acknowledge that there mightbe other factors associated with a model’s language sensitiv-ity. However, we leave the detailed analysis for measuring amodel’s sensitivity towards languages as future work.

Ablation study We perform a detailed ablation study onlanguage-universal features and sources of word features toexamine their individual impact on cross-lingual transfer.The results are presented in Table 6 and 7. We observed thatM-BERT and XLM-R produced word features performedbetter in Chinese and Arabic, respectively, while they arecomparable in English. On average M-BERT performs bet-ter, and thus we chose it as the word feature extractor inall our experiments. Table 7 shows that part-of-speech anddependency relation embedding has a limited contribution.This is perhaps due to the tokenization errors, as pointedout by Subburathinam et al. (2019). However, the use oflanguage-universal features is useful, particularly when we

11https://pypi.org/project/apted/

have minimal training data. We provide more analysis andresults in the supplementary.

6 Related WorkRelation and event extraction has drawn significant atten-tion from the natural language processing (NLP) commu-nity. Most of the approaches developed in past several yearsare based on supervised machine learning, using either sym-bolic features (Ahn 2006; Ji and Grishman 2008; Liao andGrishman 2010; Hong et al. 2011; Li, Ji, and Huang 2013;Li and Ji 2014) or distributional features (Liao and Grishman2011; Nguyen, Cho, and Grishman 2016; Miwa and Bansal2016; Liu et al. 2018; Zhang et al. 2018; Lu and Nguyen2018; Chen et al. 2015; Nguyen and Grishman 2015; Zenget al. 2014; Peng et al. 2017; Nguyen and Grishman 2018;Zhang, Qi, and Manning 2018; Subburathinam et al. 2019;Liu et al. 2019; Huang, Yang, and Peng 2020) from a largenumber of annotations. Joint learning or inference (Bekouliset al. 2018; Li et al. 2014; Zhang, Ji, and Sil 2019; Liu, Luo,and Huang 2018; Nguyen, Cho, and Grishman 2016; Yangand Mitchell 2016; Han, Ning, and Peng 2019; Han, Zhou,and Peng 2020) are also among the noteworthy techniques.

Most previous works on cross-lingual transfer for rela-tion and event extraction are based on annotation projection(Kim et al. 2010a; Kim and Lee 2012), bilingual dictionaries(Hsi et al. 2016; Ni and Florian 2019), parallel data (Chenand Ji 2009; Kim et al. 2010b; Qian et al. 2014) or machinetranslation (Zhu et al. 2014; Faruqui and Kumar 2015; Zouet al. 2018). Learning common patterns across languages isalso explored (Lin, Liu, and Sun 2017; Wang et al. 2018;Liu et al. 2018). In contrast to these approaches, Subburathi-nam et al. (2019); Liu et al. (2019) proposed to use graphconvolutional networks (GCNs) (Kipf and Welling 2017)to learn multi-lingual structured representations. However,GCNs struggle to model long-range dependencies or dis-connected words in the dependency tree. To overcome thelimitation, we use the syntactic distances to weigh the atten-tions while learning contextualized representations via themulti-head attention mechanism (Vaswani et al. 2017).

Moreover, our proposed syntax driven distance-based at-tention modeling helps to mitigate the word order differenceissue (Ahmad et al. 2019a) that hinders cross-lingual trans-fer. Prior works studied dependency structure modeling (Liuet al. 2019), source reordering (Rasooli and Collins 2019),adversarial training (Ahmad et al. 2019b), constrained infer-ence (Meng, Peng, and Chang 2019) to tackle word orderdifferences across typologically different languages.

7 ConclusionIn this paper, we proposed to model fine-grained syntac-tic structural information based on the dependency parse ofa sentence. We developed a Graph Attention TransformerEncoder (GATE) to generate structured contextual represen-tations. Extensive experiments on three languages demon-strates the effectiveness of GATE in cross-lingual relationand event extraction. In the future, we want to explore othersources of language-universal information to improve struc-tured representation learning.

AcknowledgmentsThis work was supported in part by National Science Foun-dation (NSF) Grant OAC 1920462 and the Intelligence Ad-vanced Research Projects Activity (IARPA) via ContractNo. 2019-19051600007.

ReferencesAhmad, W.; Zhang, Z.; Ma, X.; Hovy, E.; Chang, K.-W.; andPeng, N. 2019a. On Difficulties of Cross-Lingual Transferwith Order Differences: A Case Study on Dependency Pars-ing. In Proceedings of NAACL, 2440–2452.

Ahmad, W. U.; Zhang, Z.; Ma, X.; Chang, K.-W.; and Peng,N. 2019b. Cross-Lingual Dependency Parsing with Unla-beled Auxiliary Languages. In Proceedings of CoNLL, 372–382.

Ahn, D. 2006. The stages of event extraction. In Proceed-ings of the Workshop on Annotating and Reasoning aboutTime and Events, 1–8.

Bekoulis, G.; Deleu, J.; Demeester, T.; and Develder, C.2018. Adversarial training for multi-context joint entity andrelation extraction. In Proceedings of EMNLP, 2830–2836.

Bugliarello, E.; and Okazaki, N. 2020. Enhancing MachineTranslation with Dependency-Aware Self-Attention. In Pro-ceedings of ACL, 1618–1627.

Chen, Y.; Xu, L.; Liu, K.; Zeng, D.; and Zhao, J. 2015. EventExtraction via Dynamic Multi-Pooling Convolutional Neu-ral Networks. In Proceedings of ACL-IJCNLP, 167–176.

Chen, Z.; and Ji, H. 2009. Can One Language Bootstrapthe Other: A Case Study on Event Extraction. In Proceed-ings of the NAACL HLT 2009 Workshop on Semi-supervisedLearning for Natural Language Processing, 66–74.

Chen, Z.-Y.; Chang, C.-H.; Chen, Y.-P.; Nayak, J.; and Ku,L.-W. 2019. UHop: An Unrestricted-Hop Relation Extrac-tion Framework for Knowledge-Based Question Answering.In Proceedings of NAACL, 345–356.

Conneau, A.; Khandelwal, K.; Goyal, N.; Chaudhary, V.;Wenzek, G.; Guzman, F.; Grave, E.; Ott, M.; Zettlemoyer,L.; and Stoyanov, V. 2020. Unsupervised Cross-lingual Rep-resentation Learning at Scale. In Proceedings of ACL.

Deguchi, H.; Tamura, A.; and Ninomiya, T. 2019.Dependency-Based Self-Attention for Transformer NMT. InProceedings of RANLP, 239–246.

Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2019.BERT: Pre-training of Deep Bidirectional Transformers forLanguage Understanding. In Proceedings of NAACL.

Dryer, M. S.; and Haspelmath, M. 2013. The world atlas oflanguage structures online .

Faruqui, M.; and Kumar, S. 2015. Multilingual Open Rela-tion Extraction Using Cross-lingual Projection. In Proceed-ings of NAACL, 1351–1356.

Fu, T.-J.; Li, P.-H.; and Ma, W.-Y. 2019. GraphRel: Model-ing Text as Relational Graphs for Joint Entity and RelationExtraction. In Proceedings of ACL, 1409–1418.

Han, R.; Ning, Q.; and Peng, N. 2019. Joint Event and Tem-poral Relation Extraction with Shared Representations andStructured Prediction. In Proceedings of EMNLP, 434–444.Han, R.; Zhou, Y.; and Peng, N. 2020. Domain KnowledgeEmpowered Structured Neural Net for End-to-End EventTemporal Relation Extraction. In Proceedings of EMNLP,5717–5729.Hochreiter, S.; and Schmidhuber, J. 1997. Long short-termmemory. Neural computation 9(8): 1735–1780.Hong, Y.; Zhang, J.; Ma, B.; Yao, J.; Zhou, G.; and Zhu,Q. 2011. Using Cross-Entity Inference to Improve EventExtraction. In Proceedings of NAACL, 1127–1136.Hsi, A.; Yang, Y.; Carbonell, J.; and Xu, R. 2016. Leverag-ing Multilingual Training for Limited Resource Event Ex-traction. In Proceedings of COLING, 1201–1210.Huang, K.-H.; Yang, M.; and Peng, N. 2020. BiomedicalEvent Extraction with Hierarchical Knowledge Graphs. InFindings of ACL: EMNLP 2020.Huang, L.; Ji, H.; Cho, K.; Dagan, I.; Riedel, S.; and Voss,C. 2018. Zero-Shot Transfer Learning for Event Extraction.In Proceedings of ACL, 2160–2170.Ji, H.; and Grishman, R. 2008. Refining Event Extractionthrough Cross-Document Inference. In Proceedings of ACL,254–262.Joulin, A.; Bojanowski, P.; Mikolov, T.; Jegou, H.; andGrave, E. 2018. Loss in Translation: Learning BilingualWord Mapping with a Retrieval Criterion. In Proceedingsof EMNLP, 2979–2984.Kim, S.; Jeong, M.; Lee, J.; and Lee, G. G. 2010a. A cross-lingual annotation projection approach for relation detec-tion. In Proceedings of COLING, 564–571.Kim, S.; Jeong, M.; Lee, J.; and Lee, G. G. 2010b. A Cross-lingual Annotation Projection Approach for Relation Detec-tion. In Proceedings of COLING, 564–571.Kim, S.; and Lee, G. G. 2012. A Graph-based Cross-lingualProjection Approach for Weakly Supervised Relation Ex-traction. In Proceedings of ACL, 48–53.Kipf, T. N.; and Welling, M. 2017. Semi-supervised classi-fication with graph convolutional networks. In ICLR.Klein, G.; Kim, Y.; Deng, Y.; Senellart, J.; and Rush, A.2017. OpenNMT: Open-Source Toolkit for Neural MachineTranslation. In Proceedings of ACL 2017, System Demo.Li, Q.; and Ji, H. 2014. Incremental Joint Extraction of En-tity Mentions and Relations. In Proceedings of ACL, 402–412.Li, Q.; Ji, H.; Hong, Y.; and Li, S. 2014. Constructing Infor-mation Networks Using One Single Model. In Proceedingsof EMNLP, 1846–1851.Li, Q.; Ji, H.; and Huang, L. 2013. Joint Event Extraction viaStructured Prediction with Global Features. In Proceedingsof ACL, 73–82.Liao, S.; and Grishman, R. 2010. Using Document LevelCross-Event Inference to Improve Event Extraction. In Pro-ceedings of ACL, 789–797.

Liao, S.; and Grishman, R. 2011. Acquiring Topic Featuresto improve Event Extraction: in Pre-selected and BalancedCollections. In Proceedings of RANLP, 9–16.Lin, Y.; Liu, Z.; and Sun, M. 2017. Neural Relation Extrac-tion with Multi-lingual Attention. In Proceedings of ACL.Lin, Y.; Liu, Z.; Sun, M.; Liu, Y.; and Zhu, X. 2015. Learn-ing Entity and Relation Embeddings for Knowledge GraphCompletion. In Proceedings of AAAI, 2181–2187.Liu, J.; Chen, Y.; Liu, K.; and Zhao, J. 2018. Event detectionvia gated multilingual attention mechanism. In Proceedingsof AAAI, 4865–4872.Liu, J.; Chen, Y.; Liu, K.; and Zhao, J. 2019. Neural Cross-Lingual Event Detection with Minimal Parallel Resources.In Proceedings of EMNLP-IJCNLP, 738–748.Liu, X.; Luo, Z.; and Huang, H. 2018. Jointly MultipleEvents Extraction via Attention-based Graph InformationAggregation. In Proceedings of EMNLP, 1247–1256.Lu, W.; and Nguyen, T. H. 2018. Similar but not theSame: Word Sense Disambiguation Improves Event Detec-tion via Neural Representation Matching. In Proceedings ofEMNLP, 4822–4828.Meng, T.; Peng, N.; and Chang, K.-W. 2019. TargetLanguage-Aware Constrained Inference for Cross-lingualDependency Parsing. In Proceedings of EMNLP-IJCNLP,1117–1128.Miwa, M.; and Bansal, M. 2016. End-to-End Relation Ex-traction using LSTMs on Sequences and Tree Structures. InProceedings of ACL, 1105–1116.Nguyen, T. H.; Cho, K.; and Grishman, R. 2016. Joint EventExtraction via Recurrent Neural Networks. In Proceedingsof NAACL, 300–309.Nguyen, T. H.; and Grishman, R. 2015. Event Detection andDomain Adaptation with Convolutional Neural Networks.In Proceedings of EMNLP-IJCNLP, 365–371.Nguyen, T. H.; and Grishman, R. 2018. Graph convolutionalnetworks with argument-aware pooling for event detection.In Proceedings of AAAI.Ni, J.; and Florian, R. 2019. Neural Cross-Lingual RelationExtraction Based on Bilingual Word Embedding Mapping.In Proceedings of EMNLP-IJCNLP, 399–409.Pawlik, M.; and Augsten, N. 2015. Efficient computationof the tree edit distance. ACM Transactions on DatabaseSystems (TODS) 40(1): 1–40.Pawlik, M.; and Augsten, N. 2016. Tree edit distance: Ro-bust and memory-efficient. Information Systems 157–173.Peng, N.; Poon, H.; Quirk, C.; Toutanova, K.; and Yih, W.-t.2017. Cross-sentence N-ary Relation Extraction with GraphLSTMs. Transactions of ACL .Qian, L.; Hui, H.; Hu, Y.; Zhou, G.; and Zhu, Q. 2014. Bilin-gual Active Learning for Relation Classification via PseudoParallel Corpora. In Proceedings of ACL, 582–592.Rasooli, M. S.; and Collins, M. 2019. Low-Resource Syn-tactic Transfer with Unsupervised Source Reordering. InProceedings of NAACL, 3845–3856.

Shaw, P.; Uszkoreit, J.; and Vaswani, A. 2018. Self-Attention with Relative Position Representations. In Pro-ceedings of NAACL, 464–468.Straka, M.; and Strakova, J. 2017. Tokenizing, POS Tag-ging, Lemmatizing and Parsing UD 2.0 with UDPipe. InProceedings of the CoNLL 2017 Shared Task: MultilingualParsing from Raw Text to Universal Dependencies, 88–99.Subburathinam, A.; Lu, D.; Ji, H.; May, J.; Chang, S.-F.; Sil,A.; and Voss, C. 2019. Cross-lingual Structure Transfer forRelation and Event Extraction. In Proceedings of EMNLP-IJCNLP, 313–325.Sun, C.; Gong, Y.; Wu, Y.; Gong, M.; Jiang, D.; Lan, M.;Sun, S.; and Duan, N. 2019. Joint Type Inference on En-tities and Relations via Graph Convolutional Networks. InProceedings of ACL, 1361–1370.Tang, H.; Ji, D.; Li, C.; and Zhou, Q. 2020. DependencyGraph Enhanced Dual-transformer Structure for Aspect-based Sentiment Classification. In Proceedings of ACL.Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones,L.; Gomez, A. N.; Kaiser, Ł.; and Polosukhin, I. 2017. At-tention is all you need. In NeurIPS, 5998–6008.Walker, C.; Strassel, S.; Medero, J.; and Maeda, K. 2006.ACE 2005 multilingual training corpus. Linguistic DataConsortium, Philadelphia 57.Wang, X.; Han, X.; Lin, Y.; Liu, Z.; and Sun, M. 2018. Ad-versarial Multi-lingual Neural Relation Extraction. In Pro-ceedings of COLING, 1156–1166.Wang, X.; Tu, Z.; Wang, L.; and Shi, S. 2019. Self-Attentionwith Structural Position Representations. In Proceedings ofEMNLP-IJCNLP, 1403–1409.Yang, B.; and Mitchell, T. M. 2016. Joint Extraction ofEvents and Entities within a Document Context. In Pro-ceedings of NAACL, 289–299.Zeng, D.; Liu, K.; Lai, S.; Zhou, G.; and Zhao, J. 2014. Rela-tion Classification via Convolutional Deep Neural Network.In Proceedings of COLING, 2335–2344.Zhang, C.; Li, Q.; and Song, D. 2019. Aspect-based Sen-timent Classification with Aspect-specific Graph Convolu-tional Networks. In Proceedings of EMNLP-IJCNLP.Zhang, J.; Zhou, W.; Hong, Y.; Yao, J.; and Zhang, M. 2018.Using entity relation to improve event detection via attentionmechanism. In CCF International Conference on NLPCC.Zhang, T.; Ji, H.; and Sil, A. 2019. Joint entity and event ex-traction with generative adversarial imitation learning. DataIntelligence 99–120.Zhang, Y.; Qi, P.; and Manning, C. D. 2018. Graph Con-volution over Pruned Dependency Trees Improves RelationExtraction. In Proceedings of EMNLP, 2205–2215.Zhu, Z.; Li, S.; Zhou, G.; and Xia, R. 2014. Bilingual EventExtraction: a Case Study on Trigger Type Determination. InProceedings of ACL, 842–847.Zou, B.; Xu, Z.; Hong, Y.; and Zhou, G. 2018. Adversar-ial Feature Adaptation for Cross-lingual Relation Classifica-tion. In Proceedings of COLING, 437–448.

English Chinese ArabicRelations Mentions 8,738 9,317 4,731Event Mentions 5,349 3,333 2,270Event Arguments 9,793 8,032 4,975

Table 8: Statistics of the ACE 2005 dataset.

Language Sequential Distance Structural DistanceEnglish Chinese Arabic English Chinese Arabic

Relation mentions 4.8 3.9 25.8 2.2 2.6 5.1Event mentions and arguments 9.8 21.7 58.1 3.1 4.6 12.3

Table 9: Average sequential and structural (shortest path) distance between relation mentions and event mentions and theircandidate arguments in ACE05 dataset. Distances are computed by ignoring the order of mentions.

Hyper-parameter CL Trans GCN CL GCN CL RNN GATEword embedding size 300 768 768 768part-of-speech embedding size 30 30 30 30entity type embedding size 30 30 30 30dependency relation embedding size 30 30 30 30encoder type GCN GCN BiLSTM Self-Attentionencoder layers 2 2 1 1encoder hidden size 200 200 300 512pooling function max-pool max-pool max-pool max-poolmlp layers 2 2 2 2dropout 0.5 0.5 0.5 0.5optimizer Adam SGD Adam SGDlearning rate 0.001 0.1 0.001 0.1learning rate decay 0.9 0.9 0.9 0.9decay start epoch 5 5 5 5batch size 50 50 50 50maximum gradient norm 5.0 5.0 5.0 5.0

Table 10: Hyper-parameters of CL Trans GCN (Liu et al. 2019), CL GCN (Subburathinam et al. 2019), CL RNN (Ni andFlorian 2019), and our approach, GATE.

A Dataset DetailsWe conduct experiments on the ACE 2005 dataset whichis publicly available from download.12 We list the datasetstatistics in Table 8. In Table 9, we present the statisticsof sequential and shortest path distances between relationsmentions and event mentions and their arguments in ACE05.

B Hyper-parameter DetailsWe detail the hyper-parameters for all the baselines and ourapproach in Table 10.

C Tuning δ (shown in Eq. (3))During our initial experiments, we observed that settingδ = ∞ in four attention heads provide consistently betterperformances. We tune δ in the range [1, 2, 4, 8] on the val-idation set based on the statistics of the shortest path dis-tances between relations mentions and event mentions andtheir arguments in ACE05 (shown in Table 9). we set δ =[2, 2, 4, 4,∞,∞,∞,∞] and δ = [1, 1, 2, 2,∞,∞,∞,∞]

12https://catalog.ldc.upenn.edu/LDC2006T06

for the event argument role labeling and relation extrac-tion tasks, respectively, in all our experiments. This hyper-parameter choice provides us comparably better perfor-mances (on test sets), as shown in Table 11.

D Translation ExperimentWe perform English to Arabic and Chinese translations us-ing Google Cloud Translate.13 During translation,we use special symbols to identify relation mentions andevent mentions and their argument candidates in the sen-tences, as shown in Figure 4. We drop the examples (≈ 10%)in which we cannot identify the mentions after translation.

E Error AnalysisWe compare our proposed approach GATE and the self-attention mechanism (Vaswani et al. 2017) on the event ar-gument role labeling (EARL) and relation extraction (RE)tasks. We consider the models trained on English languageand evaluate them on Chinese language. We do not use the

13https://cloud.google.com/translate/docs/basic/setup-basic

δ for Attention Heads En⇒ Zh En⇒ Ar Zh⇒ En Zh⇒ Ar Ar⇒ En Ar⇒ Zh Avg.Event Argument Role Labeling[1, 1, 1, 1,∞,∞,∞,∞] 63.1 65.9 57.3 67.1 53.5 57.2 60.7[2, 2, 2, 2,∞,∞,∞,∞] 64.3 69.6 58.9 69.4 52.7 56.2 61.9[4, 4, 4, 4,∞,∞,∞,∞] 62.1 69.8 58.9 70.5 53.0 56.1 61.7[8, 8, 8, 8,∞,∞,∞,∞] 63.6 69.4 57.9 71.4 54.0 54.9 61.9[1, 1, 2, 2,∞,∞,∞,∞] 63.2 68.5 58.7 69.5 52.7 53.7 61.1[2, 2, 4, 4,∞,∞,∞,∞] 65.0 69.6 60.2 69.2 53.9 57.8 62.6[4, 4, 8, 8,∞,∞,∞,∞] 63.6 70.5 58.3 70.8 53.4 57.6 62.4[1, 2, 4, 8,∞,∞,∞,∞] 64.3 69.6 57.8 69.7 52.5 55.5 61.6Relation Extraction[1, 1, 1, 1,∞,∞,∞,∞] 54.8 63.7 70.7 62.3 69.8 50.6 62.0[2, 2, 2, 2,∞,∞,∞,∞] 55.1 64.1 70.4 59.4 68.7 50.2 61.3[4, 4, 4, 4,∞,∞,∞,∞] 55.5 64.5 71.6 61.2 68.7 51.5 62.2[8, 8, 8, 8,∞,∞,∞,∞] 55.5 65.5 71.1 61.7 67.5 53.4 62.5[1, 1, 2, 2,∞,∞,∞,∞] 56.4 63.5 70.4 63.1 69.4 51.9 62.5[2, 2, 4, 4,∞,∞,∞,∞] 55.6 62.0 70.6 61.6 67.2 51.2 61.4[4, 4, 8, 8,∞,∞,∞,∞] 55.8 63.9 71.5 63.0 68.5 50.6 62.2[1, 2, 4, 8,∞,∞,∞,∞] 55.4 65.0 70.3 61.1 69.6 50.7 62.0

Table 11: Event Argument Role Labeling (EARL) and Relation Extraction (RE) single-source transfer results (F-score %) ofour proposed approach GATE with different distance threshold δ using perfect event triggers and entity mentions. En, Zh, andAr denotes English, Chinese, and Arabic languages, respectively. In “X⇒ Y”, X and Y denotes the source and target language,respectively.

Figure 4: Translation of an English sentence in Chinese and Arabic with an event trigger (surrounded by <b></b>) and acandidate argument (surrounded by <i></i>).

Model True Positive True Negative False Positive False NegativeSelf-Attention 386 563 179 300GATE 585 493 249 157

Table 12: Comparing GATE and Self-Attention on the EARL task using English and Chinese as the source and target languages,respectively. The rates are aggregated from confusion matrices shown in Figure 5 and 6.

event trigger type as features while training models for theEARL task. We present the confusion matrices of these twomodels in Figure 5, 6, 7, and 8. In general, GATE makesmore correct predictions. We noticed that in transferringfrom English to Chinese on the EARL task, GATE im-proves notably on Destination, Entity, Person, Place re-lation types. The syntactic distance between event triggersand their argument mentions that share those types corrob-orates with our hypothesis that distance-based dependencyrelations help in cross-lingual transfer.

However, we observed that GATE makes more false posi-tive and less false negative predictions than the self-attentionmechanism. We summarize the prediction rates on EARL inTable 12. There are several factors that may be associatedwith these wrong predictions. To shed light on those fac-tors, we manually inspect 50 examples and our findings sug-gests that wrong predictions are due to three primary rea-

sons. First, there are errors in the ground truth annotationsin the ACE dataset. Second, the knowledge required for pre-diction is not available in the input sentence. Third, there areentity mentions, event triggers, and contextual phrases in thetest data that rarely appear in the training data.

Figure 5: Event argument role labeling confusion matrix (on test set) based on our proposed approach GATE using English andChinese as the source and target languages, respectively. The diagonal values indicate the number of correct predictions, whilethe other values denote the incorrect prediction counts.

Figure 6: Event argument role labeling confusion matrix (on test set) based on the Self-Attention (Transformer Encoder)using English and Chinese as the source and target languages, respectively. The diagonal values indicate the number of correctpredictions, while the other values denote the incorrect prediction counts.

Figure 7: Relation extraction labeling confusion matrix (on test set) based on our proposed approach GATE using English andChinese as the source and target languages, respectively. The diagonal values indicate the number of correct predictions, whilethe other values denote the incorrect prediction counts.

Figure 8: Relation extraction confusion matrix (on test set) based on the Self-Attention (Transformer Encoder) using Englishand Chinese as the source and target languages, respectively. The diagonal values indicate the number of correct predictions,while the other values denote the incorrect prediction counts.