graphie: a graph-based framework for information extraction

Proceedings of NAACL-HLT 2019, pages 751–761Minneapolis, Minnesota, June 2 - June 7, 2019. c©2019 Association for Computational Linguistics

751

GraphIE: A Graph-Based Framework for Information Extraction

Yujie Qian1, Enrico Santus1, Zhijing Jin2, Jiang Guo1, and Regina Barzilay1

1Computer Science and Artificial Intelligence Laboratory, MIT2Department of Computer Science, The University of Hong Kong

{yujieq, jiang guo, regina}@csail.mit.edu, {esantus, zhijing}@mit.edu

Abstract

Most modern Information Extraction (IE) sys-tems are implemented as sequential taggersand only model local dependencies. Non-localand non-sequential context is, however, a valu-able source of information to improve predic-tions. In this paper, we introduce GraphIE, aframework that operates over a graph repre-senting a broad set of dependencies betweentextual units (i.e. words or sentences). The al-gorithm propagates information between con-nected nodes through graph convolutions, gen-erating a richer representation that can beexploited to improve word-level predictions.Evaluation on three different tasks — namelytextual, social media and visual informationextraction — shows that GraphIE consistentlyoutperforms the state-of-the-art sequence tag-ging model by a significant margin.1

1 Introduction

Most modern Information Extraction (IE) systemsare implemented as sequential taggers. While suchmodels effectively capture relations in the localcontext, they have limited capability of exploit-ing non-local and non-sequential dependencies. Inmany applications, however, such dependenciescan greatly reduce tagging ambiguity, thereby im-proving overall extraction performance. For in-stance, when extracting entities from a document,various types of non-local contextual informationsuch as co-references and identical mentions mayprovide valuable cues. See for example Figure 1,in which the non-local relations are crucial to dis-criminate the entity type of the second mention ofWashington (i.e. PERSON or LOCATION).

Most of the prior work looking at the non-localdependencies incorporates them by constraining

1Our code and data are available at https://github.com/thomas0809/GraphIE.

… his father . Washington came from a prosperous family of planters …

Jefferson was invited by the president to serve as Secretary of State …

Text

Graph

He was involved in a wide range of duties for Washington ’s request …

… his father. Washington came from a prosperous family of planters …Jefferson was invited by Washington to serve as Secretary of State …He was involved in a wide range of duties for Washington’s request …

local

Figure 1: Example of the entity extraction task withan ambiguous entity mention (i.e. “...for Washing-ton’s request...”). Aside from the sentential forwardand backward edges (green, solid) which aggregate lo-cal contextual information, non-local relations — suchas the co-referent edges (red, dashed) and the identical-mention edges (blue, dotted) — provide additionalvaluable information to reduce tagging ambiguity (i.e.PERSON or LOCATION). Best viewed in color.

the output space in a structured prediction frame-work (Finkel et al., 2005; Reichart and Barzilay,2012; Hu et al., 2016). Such approaches, how-ever, mostly overlook the richer set of structuralrelations in the input space. With reference tothe example in Figure 1, the co-referent depen-dencies would not be readily exploited by sim-ply constraining the output space, as they wouldnot necessarily be labeled as entities (e.g. pro-nouns). In the attempt to capture non-local depen-dencies in the input space, alternative approachesdefine a graph that outlines the input structure andengineer features to describe it (Quirk and Poon,2017). Designing effective features is howeverchallenging, arbitrary and time consuming, espe-cially when the underlying structure is complex.Moreover, these approaches have limited capac-ity of capturing node interactions informed by thegraph structure.

In this paper, we propose GraphIE, a framework

https://github.com/thomas0809/GraphIE

https://github.com/thomas0809/GraphIE

752

that improves predictions by automatically learn-ing the interactions between local and non-localdependencies in the input space. Our approach in-tegrates a graph module with the encoder-decoderarchitecture for sequence tagging. The algorithmoperates over a graph, where nodes correspondto textual units (i.e. words or sentences) andedges describe their relations. At the core of ourmodel, a recurrent neural network sequentially en-codes local contextual representations and then thegraph module iteratively propagates informationbetween neighboring nodes using graph convolu-tions (Kipf and Welling, 2016). The learned repre-sentations are finally projected back to a recurrentdecoder to support tagging at the word level.

We evaluate GraphIE on three IE tasks, namelytextual, social media, and visual (Aumann et al.,2006) information extraction. For each task, weprovide in input a simple task-specific graph,which defines the data structure without accessto any major processing or external resources.Our model is expected to learn from the rele-vant dependencies to identify and extract the ap-propriate information. Experimental results onmultiple benchmark datasets show that GraphIEconsistently outperforms a strong and commonlyadopted sequential model (SeqIE, i.e. a bi-directional long-short term memory (BiLSTM)followed by a conditional random fields (CRF)module). Specifically, in the textual IE task, weobtain an improvement of 0.5% over SeqIE on theCONLL03 dataset, and an improvement of 1.4%on the chemical entity extraction (Krallinger et al.,2015). In the social media IE task, GraphIE im-proves over SeqIE by 3.7% in extracting the EDU-CATION attribute from twitter users. In visual IE,finally, we outperform the baseline by 1.2%.

2 Related Work

The problem of incorporating non-local and non-sequential context to improve information extrac-tion has been extensively studied in the literature.The majority of methods have focused on enforc-ing constraints in the output space during infer-ence, through various mechanisms such as pos-terior regularization or generalized expectations(Finkel et al., 2005; Mann and McCallum, 2010;Reichart and Barzilay, 2012; Li et al., 2013; Huet al., 2016).

Research capturing non-local dependencies inthe input space have mostly relied on feature-

based approaches. Roberts et al. (2008) andSwampillai and Stevenson (2011) have designedintra- and inter-sentential features based on dis-course and syntactic dependencies (e.g., short-est paths) to improve relation extraction. Quirkand Poon (2017) used document graphs to flexi-bly represent multiple types of relations betweenwords (e.g., syntactic, adjacency and discourse re-lations).

Graph-based representations can be also learnedwith neural networks. The most related work toours is the graph convolutional network by Kipfand Welling (2016), which was developed to en-code graph structures and perform node classifi-cation. In our framework, we adapt GCN as anintermediate module that learns non-local context,which — instead of being used directly for clas-sification — is projected to the decoder to enrichlocal information and perform sequence tagging.

A handful of other information extraction ap-proaches have used graph-based neural networks.Miwa and Bansal (2016) applied Tree LSTM (Taiet al., 2015) to jointly represent sequences and de-pendency trees for entity and relation extraction.On the same line of work, Peng et al. (2017) andSong et al. (2018) introduced Graph LSTM, whichextended the traditional LSTM to graphs by en-abling a varied number of incoming edges at eachmemory cell. Zhang et al. (2018) exploited graphconvolutions to pool information over pruned de-pendency trees, outperforming existing sequenceand dependency-based neural models in a relationextraction task. These studies differ from ours inseveral respects. First, they can only model word-level graphs, whereas our framework can learnnon-local context either from word- or sentence-level graphs, using it to reduce ambiguity duringtagging at the word level. Second, all these stud-ies achieved improvements only when using de-pendency trees. We extend the graph-based ap-proach to validate the benefits of using other typesof relations in a broader range of tasks, such as co-reference in named entity recognition, followed-bylink in social media, and layout structure in visualinformation extraction.

3 Problem Definition

We formalize information extraction as a sequencetagging problem. Rather than simply modeling in-puts as sequences, we assume there exists a graphstructure in the data that can be exploited to cap-

753

Encoder

Graph Module

Decoder

Input: Text

Output: Tags

…Encoder(BiLSTM)

Decoder(BiLSTM + CRF)

Sentence 𝑖𝐱1(𝑖)

Sentence 𝑗

𝑣𝑖 𝑣𝑗

⋯⋯

…

𝑦1(𝑖) …

𝐱2(𝑖) 𝐱𝑘

(𝑖)

⋯Graph Module

Encoder(BiLSTM)

Decoder(BiLSTM + CRF)

⋯⋯

Graph Module

(a) Overview (b) Sentence-level graph

(c) Word-level graph

…

𝑦2(𝑖) 𝑦𝑘

(𝑖)

…

𝐱1(𝑗)

…

𝑦1(𝑗) …

𝐱2(𝑗) 𝐱𝑘

(𝑗)

𝑦2(𝑗) 𝑦𝑘

(𝑗)

…

𝐱1(𝑖)

…

𝑦1(𝑖) …

𝐱2(𝑖) 𝐱𝑘

(𝑖)

𝑦2(𝑖) 𝑦𝑘

(𝑖)

…

…

𝐱1(𝑗)

…

𝑦1(𝑗) …

𝐱2(𝑗) 𝐱𝑘

(𝑗)

𝑦2(𝑗) 𝑦𝑘

(𝑗)

Sentence 𝑖 Sentence 𝑗

⋯⋯

⋯⋯ ⋯⋯

Figure 2: GraphIE framework: (a) an overview of the framework; (b) architecture for sentence-level graph, whereeach sentence is encoded to a node vector and fed into the graph module, and the output of the graph module isused as the initial state of the decoder; (c) architecture for word-level graph, where the hidden state for each wordof the encoder is taken as the input node vector of the graph module, and then the output is fed into the decoder.

ture non-local and non-sequential dependenciesbetween textual units, namely words or sentences.

We consider the input to be a set of sentencesS = {s1, . . . , sN} and an auxiliary graph G =(V,E), where V = {v1, . . . , vM} is the node setand E ⊂ V × V is the edge set. Each sentenceis a sequence of words. We consider two differentdesigns of the graph:

(1) sentence-level graph, where each node is asentence (i.e. M = N ), and the edges encodesentence dependencies;

(2) word-level graph, where each node is a word(i.e. M is the number of words in the input),and the edges connect pairs of words, such asco-referent tokens.

The edges ei,j = (vi, vj) in the graph can be ei-ther directed or undirected. Multiple edge typescan also be defined to capture different structuralfactors underlying the task-specific input data.

We use the BIO (Begin, Inside, Outside) tag-ging scheme in this paper. For each sentence

si = (w(i)1 , w

(i)2 , . . . , w

(i)k ),2 we sequentially tag

each word as yi = (y(i)1 , y

(i)2 , . . . , y

(i)k ).

4 Method

GraphIE jointly learns local and non-local depen-dencies by iteratively propagating information be-tween node representations. Our model has threecomponents:

• an encoder, which generates local context-aware hidden representations for the textualunit (i.e. word or sentence, depending on thetask) with a recurrent neural network;

• a graph module, which captures the graphstructure, learning non-local and non-sequential dependencies between textualunits;

• a decoder, which exploits the contextual in-formation generated by the graph module toperform labelling at the word level.

2While sentences may have different lengths, for notationsimplicity we use a single variable k.

754

Figure 2 illustrates the overview of GraphIEand the model architectures for both sentence- andword-level graphs. In the following sections, wefirst introduce the case of the sentence-level graph,and then we explain how to adapt the model for theword-level graph.

4.1 Encoder

In GraphIE, we first use an encoder to gener-ate text representations. Given a sentence si =

(w(i)1 , w

(i)2 , . . . , w

(i)k ) of length k, each word w(i)

t

is represented by a vector x(i)t , which is the con-

catenation of its word embedding and a featurevector learned with a character-level convolutionalneural network (CharCNN; Kim et al. (2016)). Weencode the sentence with a recurrent neural net-work (RNN), defining it as

h(i)1:k = RNN

(x(i)1:k ;0,Θenc

), (1)

where x(i)1:k denotes the input sequence

[x(i)1 , · · · ,x(i)

k ], h(i)1:k denotes the hidden states

[h(i)1 , · · · ,h(i)

k ], 0 indicates the initial hidden stateis zero, and Θenc represents the encoder parame-ters. We implement the RNN as a bi-directionalLSTM (Hochreiter and Schmidhuber, 1997), andencode each sentence independently.

We obtain the sentence representation for siby averaging the hidden states of its words, i.e.Enc(si) = 1

k

(∑kt=1 h

(i)t

). The sentence repre-

sentations are then fed into the graph module.

4.2 Graph Module

The graph module is designed to learn the non-local and non-sequential information from thegraph. We adapt the graph convolutional network(GCN) to model the graph context for informationextraction.

Given the sentence-level graph G = (V,E),where each node vi (i.e. sentence si) has theencoding Enc(si) capturing its local information,the graph module enriches such representationwith neighbor information derived from the graphstructure.

Our graph module is a GCN which takes asinput the sentence representation, i.e. g

(0)i =

Enc(si), and conducts graph convolution on everynode, propagating information between its neigh-bors, and integrating such information into a newhidden representation. Specifically, each layer of

GCN has two parts. The first gets the informationof each node from the previous layer, i.e.

α(l)i = W(l)

v g(l−1)i , (2)

where W(l)v is the weight to be learned. The sec-

ond aggregates information from the neighbors ofeach node, i.e. for node vi, we have

β(l)i =

1

d(vi)·W(l)

e

( ∑ei,j∈E

g(l−1)j

), (3)

where d(vi) is the degree of node vi (i.e. thenumber of edges connected to vi) and is usedto normalize β

(l)i , ensuring that nodes with dif-

ferent degrees have representations of the samescale.3 In the simplest case, where the edges inthe graph are undirected and have the same type,we use the same weight W(l)

e for all of them. Ina more general case, where multiple edge typesexist, we expect them to have different impactson the aggregation. Thus, we model these edgetypes with different weights in Eq. 3, similar tothe relational GCN proposed by Schlichtkrull et al.(2018). When edges are directed, i.e. edge ei,jis different from ej,i, the propagation mechanismshould mirror such difference. In this case, weconsider directed edges as two types of edges (for-ward and backward), and use different weights forthem.

Finally, α(l)i and β

(l)i are combined to obtain the

representation at the l-th layer,

g(l)i = σ

(α

(l)i + β

(l)i + b(l)

), (4)

where σ(·) is the non-linear activation function,and b(l) is a bias parameter.

Because each layer only propagates informa-tion between directly connected nodes, we canstack multiple graph convolutional layers to get alarger receptive field, i.e. each node can be awareof more distant neighbors. After L layers, foreach node vi we obtain a contextual representa-tion, GCN(si) = g

(L)i , that captures both local and

non-local information.

4.3 DecoderTo support tagging, the learned representation ispropagated to the decoder.

3We choose this simple normalization strategy instead ofthe two-sided normalization in Kipf and Welling (2016), as itperforms better in the experiments. The same strategy is alsoadopted by Zhang et al. (2018).

755

In our work, the decoder is instantiated asa BiLSTM+CRF tagger (Lample et al., 2016).The output representation of the graph module,GCN(si), is split into two vectors of the samelength, which are used as the initial hidden statesfor the forward and backward LSTMs, respec-tively. In this way, the graph contextual infor-mation is propagated to each word through theLSTM. Specifically, we have

z(i)1:k = RNN

(h(i)1:k ; GCN(si),Θdec

), (5)

where h(i)1:k are the output hidden states of the en-

coder, GCN(si) represents the initial state, and Θdecis the decoder parameters. A simpler way to incor-porate the graph representation into the decoder isconcatenating with its input, but the empirical per-formance is worse than using as the initial state.

Finally, we use a CRF layer (Lafferty et al.,2001) on top of the BiLSTM to perform tagging,

y∗i = arg maxy∈Yk

p(y | z(i)1:k ; Θcrf

), (6)

where Yk is the set of all possible tag sequencesof length k, and Θcrf represents the CRF parame-ters, i.e. transition scores of tags. CRF combinesthe local predictions of BiLSTM and the transitionscores to model the joint probability of the tag se-quence.4

4.4 Adaptation to Word-level Graphs

GraphIE can be easily adapted to model word-level graphs. In such case, the nodes representwords in the input, i.e. the number of nodes Mequals the total number of words in the N sen-tences. At this point, each word’s hidden state inthe encoder can be used as the input node vectorg(0)i of the graph module. GCN can then con-

duct graph convolution on the word-level graphand generate graph-contextualized representationsfor the words. Finally, the decoder directly op-erates on the GCN’s outputs, i.e. we change theBiLSTM decoder to

z(i)1:k = RNN

([GCN(w

(i)1 ), · · · , GCN(w

(i)k )]

;0,Θdec

),

4In GraphIE, the graph module models the input spacestructure, i.e. the dependencies between textual units (i.e.sentences or words), and the final CRF layer models the se-quential connections of the output tags. Even though loopsmay exist in the input graph, CRF operates sequentially, thusthe inference is tractable.

where GCN(w(i)t ) is the GCN output for word w(i)

t .In this case, the BiLSTM initial states are set tothe default zero vectors. The CRF layer remainsunchanged.

As it can be seen in Figure 2(c), the word-levelgraph module differs from the sentence-level onebecause it directly takes the word representationsfrom the encoder and feeds its output to the de-coder. In sentence-level graph, the GCN operateson sentence representations, which are then usedas the initial states of the decoder BiLSTM.

5 Experimental Setup

We evaluate the model on three tasks, includingtwo traditional IE tasks, namely textual informa-tion extraction and social media information ex-traction, and an under-explored task — visual in-formation extraction. For each of these tasks, wecreated a simple task-specific graph topology, de-signed to easily capture the underlying structure ofthe input data without any major processing. Ta-ble 1 summarizes the three tasks.

5.1 Task 1: Textual Information ExtractionIn this task, we focus on named entity recognitionat discourse level (DiscNER). In contrast to tradi-tional sentence-level NER (SentNER), where sen-tences are processed independently, in DiscNER,long-range dependencies and constraints acrosssentences have a crucial role in the tagging pro-cess. For instance, multiple mentions of the sameentity are expected to be tagged consistently in thesame discourse. Here we propose to use this (soft)constraint to improve entity extraction.

Dataset We conduct experiments on two NERdatasets: the CoNLL-2003 dataset (CONLL03)(Tjong et al., 2003), and the CHEMDNER datasetfor chemical entity extraction (Krallinger et al.,2015). We follow the standard split of each cor-pora. Statistics are shown in Table 2.

Graph Construction In this task, we use aword-level graph where nodes represent words.We create two types of edges for each document:

• Local edges: forward and backward edgesare created between neighboring words ineach sentence, allowing local contextual in-formation to be utilized.

• Non-local edges: re-occurrences of the sametoken other than stop words are connected, so

756

Evaluation Task Graph Type Node Edge

Textual IE word-level word 1. non-local consistency (identical mentions)2. local sentential forward and backward

Social Media IE sentence-level user’s tweets followed-by

Visual IE sentence-level text box spatial layout (horizontal and vertical)

Table 1: Comparisons of graph structure in the three IE tasks used for evaluation.

DATASET Train Dev Test

CONLL03 #doc 946 216 231#sent 14,987 3,466 3,684

CHEMDNER#doc 3,500 3,500 3,000#sent 30,739 30,796 26,399

Table 2: Statistics of the CONLL03 and the CHEMD-NER datasets (Task 1).

that information can be propagated through,encouraging global consistency of tagging.5

5.2 Task 2: Social Media InformationExtraction

Social media information extraction refers to thetask of extracting information from users’ postsin online social networks (Benson et al., 2011; Liet al., 2014). In this paper, we aim at extractingeducation and job information from users’ tweets.Given a set of tweets posted by a user, the goal isto extract mentions of the organizations to whichthey belong. The fact that the tweets are short,highly contextualized and show special linguisticfeatures makes this task particularly challenging.

Dataset We construct two datasets, EDUCA-TION and JOB, from the Twitter corpus released byLi et al. (2014). The original corpus contains mil-lions of tweets generated by ≈ 10 thousand users,where the education and job mentions are anno-tated using distant supervision (Mintz et al., 2009).We sample the tweets from each user, main-taining the ratio between positive and negativeposts.6 The obtained EDUCATION dataset consistsof 443, 476 tweets generated by 7, 208 users, andthe JOB dataset contains 176, 043 tweets generatedby 1, 772 users. Dataset statistics are reported inTable 3.

5Note that other non-local relations such as co-references(cf. the example in Figure 1) may be used for further im-provement. However, these relations require additional re-sources to obtain, and we leave them to future work.

6Positive and negative refer here to whether or not the ed-ucation or job mention is present in the tweet.

EDUCATION JOB

Users 7,208 1,772Edges 11,167 3,498

Positive Tweets 49,793 3,694Negative Tweets 393,683 172,349

Table 3: Statistics of the EDUCATION and JOB datasets(Task 2).

The datasets are both split in 60% for training,20% for development, and 20% for testing. Weperform 5 different random splits and report theaverage results.

Graph Construction We construct the graph asego-networks (Leskovec and Mcauley, 2012), i.e.when we extract information about one user, weconsider the subgraph formed by the user andhis/her direct neighbors. Each node correspondsto a Twitter user, who is represented by the set ofposted tweets.7 Edges are defined by the followed-by link, under the assumption that connected usersare more likely to come from the same universityor company. An example of the social media graphis reported in the appendices.

5.3 Task 3: Visual Information Extraction

Visual information extraction refers to the extrac-tion of attribute values from documents format-ted in various layouts. Examples include invoicesand forms, whose format can be exploited to infervaluable information to support extraction.

Dataset The corpus consists of 25,200 Ad-verse Event Case Reports (AECR) recording drug-related side effects. Each case contains an averageof 9 pages. Since these documents are producedby multiple organizations, they exhibit large vari-ability in the layout and presentation styles (e.g.

7As each node is a set of tweets posted by the user, weencode every tweet with the encoder, and then average themto obtain the node representation. In the decoding phase, thegraph module’s output is fed to the decoder for each tweet.

757

text, table, etc.).8 The collection is provided witha separate human-extracted ground truth databasethat is used as a source of distant supervision.

Our goal is to extract eight attributes related tothe patient, the event, the drug and the reporter (cf.Table 6 for the full list). Attribute types includedates, words and phrases — which can be directlyextracted from the document.

The dataset is split in 50% cases for training,10% for development, and 40% for testing.

Graph Construction We first turn the PDFsto text using PDFMiner,9 which provides wordsalong with their positions in the page (i.e.bounding-box coordinates). Consecutive wordsare then geometrically joined into text boxes. Eachtext box is considered as a “sentence” in this task,and corresponds to a node in the graph.

Since the page layout is the major structural fac-tor in these documents, we work on page-by-pagebasis, i.e. each page corresponds to a graph. Theedges are defined to horizontally or vertically con-nect nodes (text boxes) that are close to each other(i.e. when the overlap of their bounding boxes, ineither the vertical or horizontal direction, is over50%). Four types of edge are considered: left-to-right, right-to-left, up-to-down, and down-to-up.When multiple nodes are aligned, only the closestones are connected. An example of visual docu-ment graph is reported in the appendices.

5.4 Baseline and Our Method

We implement a two-layer BiLSTM with a condi-tional random fields (CRF) tagger as the sequentialbaseline (SeqIE). This architecture and its variantshave been extensively studied and demonstratedto be successful in previous work on informationextraction (Lample et al., 2016; Ma and Hovy,2016). In the textual IE task (Task 1), our base-line is shown to obtain competitive results with thestate-of-the-art method in the CONLL03 dataset.In the visual IE task (Task 3), in order to furtherincrease the competitiveness of the baseline, wesequentially concatenate the horizontally alignedtext boxes, therefore fully modeling the horizontaledges of the graph.

Our baseline shares the same encoder and de-coder architecture with GraphIE, but without thegraph module. Both architectures have similar

8This dataset cannot be shared for patient privacy and pro-prietary issues.

9https://euske.github.io/pdfminer/

DATASET Model F1

CONLL03

Lample et al. (2016) 90.94Ma and Hovy (2016) 91.21Ye and Ling (2018) 91.38SeqIE 91.16GraphIE 91.74∗

CHEMDNERKrallinger et al. (2015) 87.39SeqIE 88.28GraphIE 89.71∗

Table 4: NER accuracy on the CONLL03 and theCHEMDNER datasets (Task 1). Scores for our methodsare the average of 5 runs. * indicates statistical signifi-cance of the improvement over SeqIE (p < 0.01).

computational cost. In Task 1, we apply GraphIEwith word-level graph module (cf. Figure 2(c)),and in Task 2 and Task 3, we apply GraphIE withsentence-level graph module (cf. Figure 2(b)).

5.5 Implementation Details

The models are trained with Adam (Kingma andBa, 2014) to minimize the CRF objective. Forregularization, we choose dropout with a ratio of0.1 on both the input word representation and thehidden layer of the decoder. The learning rateis set to 0.001. We use the development set forearly-stopping and the selection of the best per-forming hyperparameters. For CharCNN, we use64-dimensional character embeddings and 64 fil-ters of width 2 to 4 (Kim et al., 2016). The 100-dimensional pretrained GloVe word embeddings(Pennington et al., 2014) are used in Task 1 and2, and 64-dimensional randomly initialized wordembeddings are used in Task 3. We use a two-layer GCN in Task 1, and a one-layer GCN in Task2 and Task 3. The encoder and decoder BiLSTMshave the same dimension as the graph convolutionlayer. In Task 3, we concatenate a positional en-coding to each text box’s representation by trans-forming its bounding box coordinates to a vectorof length 32, and then applying a tanh activation.

6 Results

6.1 Task 1: Textual Information Extraction

Table 4 describes the NER accuracy on theCONLL03 (Tjong et al., 2003) and the CHEMD-NER (Krallinger et al., 2015) datasets.

For CONLL03, we list the performance of ex-isting approaches. Our baseline SeqIE obtainscompetitive scores compared to the best methods.The fact that GraphIE significantly outperforms

https://euske.github.io/pdfminer/

758

DATASETDictionary SeqIE GraphIE

P R F1 P R F1 P R F1

EDUCATION 78.7 93.5 85.4 85.2 93.6 89.2 92.9 92.8 92.9∗JOB 55.7 70.2 62.1 66.2 66.7 66.2 67.1 66.1 66.5

Table 5: Extraction accuracy on the EDUCATION and JOB datasets (Task 2). Dictionary is a naive method whichcreates a dictionary of entities from the training set and extracts their mentions during testing time. Scores are theaverage of 5 runs. * indicates the improvement over SeqIE is statistically significant (Welch’s t-test, p < 0.01).

Table 1

SeqIE 94.73random connection 94.29

feature augmentation 94.48

GraphIE 95.12

Dev

F1

92.0

93.0

94.0

95.0

96.095.12

94.4894.2994.73

SeqIE randomconnection

featureaugmentation

GraphIE

�1

Figure 3: Analysis on the CONLL03 dataset. We com-pare with two alternative designs: (1) random connec-tion, where we replace the constructed graph by a ran-dom graph with the same number of edges; (2) featureaugmentation, where we use the average embedding ofeach node and its neighbors as the input to the decoder,instead of the GCN which has additional parameters.We report F1 scores on the development set.

it, highlights once more the importance of mod-eling non-local and non-sequential dependenciesand confirms that our approach is an appropriatemethod to achieve this goal.10

For CHEMDNER, we show the best performancereported in Krallinger et al. (2015), obtained witha feature-based method. Our baseline outperformsthe feature-based method, and GraphIE further im-proves the performance by 1.4%.

Analysis To understand the advantage ofGraphIE, we first investigate the importanceof graph structure to the model. As shown inFigure 3, using random connections clearly hurtsthe performance, bringing down the F1 score ofGraphIE from 95.12% to 94.29%. It indicates thatthe task-specific graph structures introduce bene-ficial inductive bias. Trivial feature augmentationalso does not work well, confirming the necessityof learning the graph embedding with GCN.

We further conduct error analysis on the testset to validate our motivation that GraphIE re-solves tagging ambiguity by encouraging consis-tency among identical entity mentions (cf. Figure

10We achieve the best reported performance among meth-ods not using the recently introduced ELMo (Peters et al.,2018) and BERT (Devlin et al., 2018), which are pretrainedon extra-large corpora and computationally demanding.

1). Here we examine the word-level tagging ac-curacy. We define the words that have more thanone possible tags in the dataset as ambiguous. Wefind that among the 1.78% tagging errors of SeqIE,1.16% are ambiguous and 0.62% are unambigu-ous. GraphIE reduces the error rate to 1.67%, with1.06% to be ambiguous and 0.61% unambiguous.We can see that most of the error reduction indeedattributes to the ambiguous words.

6.2 Task 2: Social Media InformationExtraction

Table 5 shows the results for the social media in-formation extraction task. We first report a sim-ple dictionary-based method as a baseline. Neu-ral IE models achieve much better performance,showing that meaningful patterns are learned bythe models rather than simply remembering theentities in the training set. The proposed GraphIEoutperforms SeqIE in both the EDUCATION andJOB datasets, and the improvements are more sig-nificant for the EDUCATION dataset (3.7% versus0.3%). The reason for such difference is the vari-ance in the affinity scores (Mislove et al., 2010)between the two datasets. Li et al. (2014) un-derline that affinity value for EDUCATION is 74.3while for JOB it is only 14.5, which means that inthe datasets neighbors are 5 times more likely tohave studied in the same university than worked inthe same company. We can therefore expect thata model like GraphIE, which exploits neighbors’information, obtains larger advantages in a datasetcharacterized by higher affinity.

6.3 Task 3: Visual Information ExtractionTable 6 shows the results in the visual informationextraction task. GraphIE outperforms the SeqIEbaseline in most attributes, and achieves 1.2% im-provement in the mirco average F1 score. It con-firms that the benefits of using layout graph struc-ture in visual information extraction.

The extraction performance varies across the at-tributes, ranging from 61.4% for Drug Name to

759

ATTRIBUTESeqIE GraphIE

P R F1 P R F1

P. Initials 93.5 92.4 92.9 93.6 91.9 92.8P. Age 94.0 91.6 92.8 94.8 91.1 92.9P. Birthday 96.6 96.0 96.3 96.9 94.7 95.8Drug Name 71.2 51.2 59.4 78.5 50.4 61.4Event 62.6 65.2 63.9 64.1 68.7 66.3R. First Name 78.3 95.7 86.1 79.5 95.9 86.9R. Last Name 84.5 68.4 75.6 85.6 68.2 75.9R. City 88.9 65.4 75.4 92.1 66.3 77.1

Avg. (macro) 83.7 78.2 80.3 85.7 78.4 81.1†

Avg. (micro) 78.5 73.8 76.1 80.3 74.6 77.3†

Table 6: Extraction accuracy on the AECR dataset(Task 3). Scores are the average of 5 runs. P. is theabbreviation for Patient, and R. for Reporter. † indi-cates statistical significance of the improvement overSeqIE (p < 0.05).

Model Dev F1

GraphIE 77.8– Edge types 77.0 (↓ 0.8)– Horizontal edges 74.7 (↓ 3.1)– Vertical edges 72.4 (↓ 5.4)– CRF 72.1 (↓ 5.7)

Table 7: Ablation study (Task 3). Scores are micro av-erage F1 on the development set. “–” means removingthe element from GraphIE.

95.8% for Patient Birthday (similar variations arevisible in the baseline). Similarly, the gap betweenGraphIE and SeqIE varies in relation to the at-tributes, ranging between −0.5% in Patient Birth-day and 2.4% in Event.

In the ablation test described in Table 7,we can see the contribution of: using separateweights for different edge types (+0.8%), hor-izontal edges (+3.1%), vertical edges (+5.4%),and CRF (+5.7%).

Generalization We also assess GraphIE’s ca-pacity of dealing with unseen layouts through anextra analysis. From our dataset, we sample 2, 000reports containing the three most frequent tem-plates, and train the models on this subset. Thenwe test all models in two settings: 1) seen tem-plates, consisting of 1, 000 additional reports inthe same templates used for training; and 2) un-seen templates, consisting of 1, 000 reports in twonew template types.

The performance of GraphIE and SeqIE is re-ported in Figure 4. Both models achieve good re-sults on seen templates, with GraphIE still scoring2.8% higher than SeqIE. The gap becomes even

Table 1

SeqIE GraphIE

Seen Templates 80.3 83.1

Unseen Templates 13.4 33.7

91.66 91.87 91.77 BiLSTM-CRF:

91.83 92.10 91.96

91.12 91.55 91.34 8%, precision: 91.78 recall: 89.39 F1: 90.57%




5%, precision: 90.53 recall: 90.25 F1: 90.39%

91.54 91.36 91.45 91.116 89.728 90.42%

91.83 91.31 91.57 90.41

90.69 90.74 90.72

91.07 90.81 90.94

91.23 91.04 91.1491.272 91.052 91.164 GCN:

9%, precision: 91.37%, recall: 90.30%, F1: 90.83%





90.99%

90.99

F1

0

25

50

75

100

Seen Templates Unseen Templates

33.7

83.1

13.4

80.3 SeqIEGraphIE

�1

Figure 4: Micro average F1 scores tested on seen andunseen templates (Task 3).

larger when our model and the sequential one aretested on unseen templates (i.e. 20.3%), demon-strating that by explicitly modeling the richerstructural relations, GraphIE achieves better gen-eralizability.

7 Conclusions

We introduced GraphIE, an information extractionframework that learns local and non-local con-textual representations from graph structures toimprove predictions. The system operates overa task-specific graph topology describing the un-derlying structure of the input data. GraphIEjointly models the node (i.e. textual units, namelywords or sentences) representations and their de-pendencies. Graph convolutions project informa-tion through neighboring nodes to finally supportthe decoder during tagging at the word level.

We evaluated our framework on three IE tasks,namely textual, social media and visual infor-mation extraction. Results show that it effi-ciently models non-local and non-sequential con-text, consistently enhancing accuracy and out-performing the competitive SeqIE baseline (i.e.BiLSTM+CRF).

Future work includes the exploration of auto-matically learning the underlying graphical struc-ture of the input data.

Acknowledgments

We thank the MIT NLP group and the reviewersfor their helpful comments. This work is sup-ported by MIT-IBM Watson AI Lab. Any opin-ions, findings, conclusions, or recommendationsexpressed in this paper are those of the authors,and do not necessarily reflect the views of thefunding organizations.

760

ReferencesYonatan Aumann, Ronen Feldman, Yair Liberzon,

Benjamin Rosenfeld, and Jonathan Schler. 2006.Visual information extraction. Knowl. Inf. Syst.,10(1):1–15.

Edward Benson, Aria Haghighi, and Regina Barzilay.2011. Event discovery in social media feeds. InProceedings of ACL, pages 389–398. ACL.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2018. Bert: Pre-training of deepbidirectional transformers for language understand-ing. arXiv preprint arXiv:1810.04805.

Jenny Rose Finkel, Trond Grenager, and ChristopherManning. 2005. Incorporating non-local informa-tion into information extraction systems by gibbssampling. In Proceedings of ACL, pages 363–370.ACL.

Sepp Hochreiter and Jurgen Schmidhuber. 1997.Long short-term memory. Neural computation,9(8):1735–1780.

Zhiting Hu, Xuezhe Ma, Zhengzhong Liu, EduardHovy, and Eric Xing. 2016. Harnessing deep neuralnetworks with logic rules. In Proceedings of ACL,pages 2410–2420.

Yoon Kim, Yacine Jernite, David Sontag, and Alexan-der M Rush. 2016. Character-aware neural languagemodels. In Proceedings of AAAI, pages 2741–2749.AAAI Press.

Diederik P Kingma and Jimmy Ba. 2014. Adam: Amethod for stochastic optimization. arXiv preprintarXiv:1412.6980.

Thomas N Kipf and Max Welling. 2016. Semi-supervised classification with graph convolutionalnetworks. arXiv preprint arXiv:1609.02907.

Martin Krallinger, Florian Leitner, Obdulia Rabal,Miguel Vazquez, Julen Oyarzabal, and Alfonso Va-lencia. 2015. Chemdner: The drugs and chemicalnames extraction challenge. Journal of cheminfor-matics, 7(1):S1.

John D Lafferty, Andrew McCallum, and Fernando CNPereira. 2001. Conditional random fields: Prob-abilistic models for segmenting and labeling se-quence data. In Proceedings of ICML, pages 282–289.

Guillaume Lample, Miguel Ballesteros, Sandeep Sub-ramanian, Kazuya Kawakami, and Chris Dyer. 2016.Neural architectures for named entity recognition.In Proceedings of NAACL-HLT, pages 260–270, SanDiego, California. ACL.

Jure Leskovec and Julian J Mcauley. 2012. Learningto discover social circles in ego networks. In NIPS,pages 539–547.

Jiwei Li, Alan Ritter, and Eduard Hovy. 2014. Weaklysupervised user profile extraction from twitter. InProceedings of ACL, volume 1, pages 165–174.

Qi Li, Heng Ji, and Liang Huang. 2013. Joint eventextraction via structured prediction with global fea-tures. In Proceedings of ACL, volume 1, pages 73–82.

Xuezhe Ma and Eduard Hovy. 2016. End-to-end se-quence labeling via bi-directional lstm-cnns-crf. InProceedings of ACL, pages 1064–1074, Berlin, Ger-many. ACL.

Gideon S Mann and Andrew McCallum. 2010. Gener-alized expectation criteria for semi-supervised learn-ing with weakly labeled data. Journal of MachineLearning Research, 11(Feb):955–984.

Mike Mintz, Steven Bills, Rion Snow, and Dan Juraf-sky. 2009. Distant supervision for relation extrac-tion without labeled data. In Proceedings of ACL,pages 1003–1011. ACL.

Alan Mislove, Bimal Viswanath, Krishna P Gummadi,and Peter Druschel. 2010. You are who you know:inferring user profiles in online social networks. InProceedings of the 3rd ACM International Confer-ence on Web Search and Data Mining, pages 251–260. ACM.

Makoto Miwa and Mohit Bansal. 2016. End-to-end re-lation extraction using lstms on sequences and treestructures. arXiv preprint arXiv:1601.00770.

Nanyun Peng, Hoifung Poon, Chris Quirk, KristinaToutanova, and Wen-tau Yih. 2017. Cross-sentencen-ary relation extraction with graph lstms. TACL,5:101–115.

Jeffrey Pennington, Richard Socher, and ChristopherManning. 2014. Glove: Global vectors for wordrepresentation. In Proceedings of EMNLP, pages1532–1543.

Matthew Peters, Mark Neumann, Mohit Iyyer, MattGardner, Christopher Clark, Kenton Lee, and LukeZettlemoyer. 2018. Deep contextualized word rep-resentations. In Proceedings of NAACL-HLT, vol-ume 1, pages 2227–2237.

Chris Quirk and Hoifung Poon. 2017. Distant super-vision for relation extraction beyond the sentenceboundary. In Proceedings of ACL, volume 1, pages1171–1182.

Roi Reichart and Regina Barzilay. 2012. Multi eventextraction guided by global constraints. In Proceed-ings of NAACL-HLT, pages 70–79. ACL.

Angus Roberts, Robert Gaizauskas, and Mark Hepple.2008. Extracting clinical relationships from patientnarratives. In Proceedings of the Workshop on Cur-rent Trends in Biomedical Natural Language Pro-cessing, pages 10–18. ACL.

761

Michael Schlichtkrull, Thomas N Kipf, Peter Bloem,Rianne van den Berg, Ivan Titov, and Max Welling.2018. Modeling relational data with graph convolu-tional networks. In European Semantic Web Confer-ence, pages 593–607. Springer.

Linfeng Song, Yue Zhang, Zhiguo Wang, and DanielGildea. 2018. N-ary relation extraction using graph-state lstm. In Proceedings of EMNLP, pages 2226–2235.

Kumutha Swampillai and Mark Stevenson. 2011. Ex-tracting relations within and across sentences. InProceedings of the International Conference RecentAdvances in Natural Language Processing, pages25–32.

Kai Sheng Tai, Richard Socher, and Christopher DManning. 2015. Improved semantic representationsfrom tree-structured long short-term memory net-works. arXiv preprint arXiv:1503.00075.

Kim Sang Tjong, F Erik, and Fien De Meulder.2003. Introduction to the conll-2003 shared task:Language-independent named entity recognition. InProceedings of NAACL-HLT, pages 142–147. ACL.

Zhixiu Ye and Zhen-Hua Ling. 2018. Hybrid semi-markov crf for neural sequence labeling. In Pro-ceedings of ACL, pages 235–240.

Yuhao Zhang, Peng Qi, and Christopher D Manning.2018. Graph convolution over pruned dependencytrees improves relation extraction. In Proceedingsof EMNLP.

A Appendices

We show some examples of the constructed graphsfor different information extraction tasks.

A.1 Social Media Information Extraction

Figure 5: Mock-up example of Social Media Informa-tion Extraction (Task 2). Nodes are represented as usersand edges are follow-by relations.

A.2 Visual Information Extraction

Figure 6: Mock-up example of Visual Information Ex-traction (Task 3). The two forms have different layouts.Graphical dependencies are shown as green lines con-necting text in blue bounding-boxes.

graphie: a graph-based framework for information extraction

Documents