in layman’s terms: semi-open relation extraction from ...meval 2018 task 7 dataset focuses on 6...

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics pages 1489ndash1500July 5 - 10 2020 ccopy2020 Association for Computational Linguistics

1489

In Laymanrsquos Terms Semi-Open Relation Extraction from Scientific Texts

Ruben Kruiperlowast Julian FV Vincent Jessica Chen-BurgerMarc PY Desmulliez Ioannis Konstas

Heriot-Watt UniversityRiccarton Campus EH14 4ASEdinburgh United Kingdom

lowastCorresponding author rk22hwacuk

Abstract

Information Extraction (IE) from scientifictexts can be used to guide readers to the centralinformation in scientific documents But nar-row IE systems extract only a fraction of theinformation captured and Open IE systems donot perform well on the long and complex sen-tences encountered in scientific texts In thiswork we combine the output of both types ofsystems to achieve Semi-Open Relation Ex-traction a new task that we explore in the Bi-ology domain First we present the FocusedOpen Biological Information Extraction (FO-BIE) dataset and use FOBIE to train a state-of-the-art narrow scientific IE system to extracttrade-off relations and arguments that are cen-tral to biology texts We then run both the nar-row IE system and a state-of-the-art Open IEsystem on a corpus of 10k open-access scien-tific biological texts We show that a signifi-cant amount (65) of erroneous and uninfor-mative Open IE extractions can be filtered us-ing narrow IE extractions Furthermore weshow that the retained extractions are signifi-cantly more often informative to a reader1

1 Introduction

Identifying the central theme and concepts in sci-entific texts is a time-consuming task for expertsand a hard task for laymen (Alper et al 2004 El-Arini and Guestrin 2011 Pain 2016) This prob-lem is even more pronounced in inter-disciplinaryfields of study where experts in a target domainoften lack the deeper knowledge of a source do-main (Carr et al 2018) A specific example isbiomimetics an engineering problem-solving pro-cess in which one draws on analogous biologicalsolutions (Kruiper et al 2016) A major issueis that engineers (target domain) know little biol-ogy (source domain) or characteristics of plants or

1We release FOBIE and code at httpsgithubcomrubenkruiperFOBIE

Figure 1 Example of Semi-Open Relation Extractionin the Biology domain (Burgess et al 2006) We firstextract the TRADE-OFF expressed between the centralconcepts lsquosafetyrsquo and lsquoefficiencyrsquo (in blue) that takesplace specifically in lsquoconifer speciesrsquo (in green) Wefurther explore the content of a paper by investigatingthe results of an Open Information Extraction (OIE)system (in red) The central concepts captured by aTRADE-OFF mechanism enable the filtering of manyirrelevant OIE extractions which are found to be error-prone in scientific texts Further OIE extractions foundin the document can shed light on the semantic mean-ing of relevant concepts eg lsquoxylemrsquo as depicted atthe top of the Figure (in the red box)

animals (Vattam and Goel 2013) This domain-mismatch complicates searching for and reason-ing over relevant scientific information renderingbiomimetics adventitious and solutions serendipi-tous (Kruiper et al 2018)

Recently TRADE-OFF relations have become ofinterest to biomimetics (Adriaens 2019) becausea trade-off defined in technology can be directlyused to search for relevant texts in biology (Vincent2016) TRADE-OFF relations express a problemspace in terms of mutual exclusivity constraintsbetween competing demands Therefore trade-offs play a prominent role in evolutionary think-ing (Agrawal et al 2010) and are the principalrelation under investigation in a significant portionof biology research papers (Garland 2014) The

1490

functional demands that are traded off are usuallyabstract and domain-independent terms such aslsquosafetyrsquo and lsquoefficiencyrsquo in Figure 1 A gap remainsin quickly comprehending the central informationin a text eg the biological mechanisms that areused to manipulate a trade-off

Information Extraction (IE) and specifically Re-lation Extraction (RE) can improve the access tocentral information for downstream tasks (Santoset al 2015 Zeng et al 2014 Jiang et al 2016Miwa and Bansal 2016 Luan et al 2018a) How-ever the focus of current RE systems and datasetsis either too narrow ie a handful of semantic rela-tions such as lsquoUSED-FORrsquo and lsquoSYNONYMYrsquo ortoo broad ie an unbounded number of generic re-lations extracted from large heterogeneous corpora(Niklaus et al 2018) referred to as Open IE (OIE)(Etzioni et al 2005 Banko et al 2007) Narrowapproaches to IE from scientific text (Augensteinet al 2017 Gabor et al 2018 Luan et al 2018a)cover only a fraction of the information capturedin a paper ndash usually what is within an abstract Ithas been shown that scientific texts contain manyunique relation types and therefore it is not fea-sible to create separate narrow IE classifiers forthese (Groth et al 2018) On the other hand OIEsystems are primarily developed for the Web andnews-wire domain and have been shown to performpoorly on scientific texts What laymen really needis a bit of both the accuracy of narrow RE systemsto extract central relations from scientific texts andthe flexibility of an OIE system to capture a muchlarger fraction of the possible relations expressedin scientific texts

This work aims to enable rapid comprehensionof a large scientific document by identifying a) thecentral concepts in a text and b) the most signif-icant relations that govern these central conceptsTo this end we introduce the task of Semi-Open Re-lation Extraction (SORE) Figure 1 illustrates theSORE process First we find the central conceptslsquosafetyrsquo and lsquoefficiencyrsquo involved in a TRADE-OFF

relation Then by using the argument conceptsof the relation as anchor points we can explorefurther concepts and relations eg lsquoxylemrsquo in Fig-ure 1 Uncovering these relations can elucidatethe meaning of unfamiliar concepts to a layper-son (Mausam 2016) The SORE approach is hy-pothesized to reduce the number of uninformativeextractions without limiting RE to a finite set ofrelations which could generally benefit IE from sci-

entific articles eg materials discovery (Kononovaet al 2019) and drug-gene-mutation interactions(Jia et al 2019)

To address SORE we create the FocusedOpen Biological Information Extraction (FOBIE)dataset FOBIE includes manually-annotated sen-tences that express explicit trade-offs or syntac-tically similar relations that capture the centralconcepts in full-text biology papers We train aspan-based RE model used in a strong scientific IEsystem (Luan et al 2018a) to jointly extract theserelation structures We explore SORE and use theoutput of our model to filter the output of an OIEsystem (Saha and Mausam Saha et al 2017 Paland Mausam 2016 Christensen et al 2011) ona corpus of biology papers Qualitative analysesshow that the output of a narrow RE model canspeed up expert analysis of trade-offs in biologicaltexts and be used to filter out both erroneous anduninformative OIE extractions

2 Related work

21 Open Information Extraction

OIE systems use a set of handcrafted or learnedextraction rules and rely on dependency featuresto extract open-domain relational tuples from text(Yu et al 2017 Niklaus et al 2018) As OIE sys-tems rely on syntactic features they require littlefine-tuning when applied to different domains andthe extraction rules work for a variety of relationtypes (Mausam 2016) These properties can be es-pecially useful on scientific texts where additionalknowledge on unknown concepts can ease the tex-tual comprehension for non-experts Consider theexample OIE extractions for lsquoxylemrsquo in the top partof Figure 1

Existing OIE systems have been shown to per-form significantly worse on the longer and morecomplex sentences found in scientific texts than onWikipedia texts (Groth et al 2018) Common is-sues of OIE systems on Web News and Wikipediatexts include the correct identification of the bound-aries of an argument handling latent n-ary rela-tions difficulty handling negations and generatinguninformative extractions (Schneider et al 2017)Groth et al (2018) evaluate the output of two state-of-the-art OIE systems based on correctness ratherthan eg the number of missed extractions Theynote that the crux of the IE challenge is that extrac-tions reflect the consequence of the sentence As anexample of an uninformative extraction Fader et al

1491

(2011) note how lsquo(Faust made a deal)rsquo capturesthe consequence but not the critical informationof whom Faust made a deal with in the sentenceldquoFaust made a deal with the devilrdquo In this workwe explore filtering both incorrect and uninforma-tive OIE extractions from scientific texts using thecentral concepts that we extract through narrow IE(cf Section 53)

22 Narrow Relation Extraction fromscientific text

Narrow RE entails identifying two or more relatedentities in a text and classifying the relation thatholds between them Early works on the combinedtask of Named Entity Recognition and labelingof relations between extracted entities used pre-computed dependency features (Liu et al 2013Chen et al 2015 Lin et al 2016) word positionembeddings (Zeng et al 2014) or considered onlythe Shortest Dependency Path between two enti-ties as input (Bunescu and Mooney 2005 Santoset al 2015 Zeng et al 2015) Later work aimed toreduce errors propagated by pre-computed depen-dency features (Nguyen and Grishman 2015) orby joint modeling of entities and relations (Miwaand Bansal 2016) Poor performance of these REsystems on scientific texts has led to the develop-ment of domain-specific datasets2

The SCIENCEIE dataset focuses on the extrac-tion of 3 types of key-phrases rather than NamedEntities and hyponymy and synonymy relationsbetween these (Augenstein et al 2017) The Se-mEval 2018 task 7 dataset focuses on 6 narrow re-lations between 7 entity types (Gabor et al 2018)And the SCIERC dataset focuses on 7 relationtypes including co-reference between 6 types ofentities (Luan et al 2018a) Top systems devel-oped for both SemEval tasks adapt the LSTM-based approach of Miwa amp Bansal (2016) com-bined with semi-supervised learning and ensem-bling (Ammar et al 2018) as well as pre-trainedconcept embeddings (Luan et al 2018b)

23 BioNLP and BioCreAtIvE

In the past several BioNLP and BioCreAtIvEshared tasks were organized that aimed at iden-tifying relations in the biology domain (Hirschman

2SCIENCEIE SemEval 2017 500 paragraphs from full-text Computer Science Material Science and Physics journalarticles SemEval 2018 500 abstracts within the domain ofComputational Linguistics SCIERC 500 abstracts from Arti-ficial Intelligence conference and workshop proceedings

FOBIE SCIENCEIE SE rsquo18 SCIERCArgrsquos 5834 9946 7483 8089Relrsquos 4788 672 1595 4716Rdoc 309 13 32 94

Table 1 Number of arguments relations and relationsper instance for FOBIE SCIENCEIE the SemEval2018 task 7 dataset and SCIERC Rdoc stands for re-lations per sentence for FOBIE (and per abstract orparagraph for the other datasets)

et al 2005 Kim et al 2009 Nedellec et al 2013Zhou et al 2014) Many datasets focus primarilyon a predefined set of biomedical relations suchas interactions between known proteins genes dis-eases drugs and chemicals (Kim et al 2003Krallinger et al 2017 Cohen et al 2017 Isla-maj Dogan et al 2019) Examples of more biology-oriented corpora include the BB corpus (Delegeret al 2016) and the SEEDEV corpus (Chaix et al2016) The BB corpus includes 4 entity types and 2relation types that revolve around microorganismsof food interest Besides abstracts and titles it con-tains paragraphs and sentences from 20 full-textdocuments (Bossy et al 2019) Similarly SEEDEV

consists of 86 paragraphs from 20 full-text articlesabout seed development in a specific plant the Ara-bidopsis thaliana Considering the small size ofthe dataset a relatively large number of many en-tity and relation types are used 16 types of NamedEntities and 21 types of relations This results in animbalanced dataset with 7 relations making up lessthan 1 of all relations Furthermore there is someoverlap in source documents for the traindevtestsplit (Chaix et al 2016)

In contrast to the previously described datasets FO-BIE does not classify arguments of relations intospecific entity-types FOBIE contains annotationsof key-phrases found in full-text scientific paperssimilar to SCIENCEIE The key-phrases and re-lations are annotated in 1548 relatively long andcomplex sentences which were sourced from 1215full-text scientific biological texts using a Rule-Base System Table 1 provides an overview of thesize of FOBIE in comparison to SCIENCEIE theSemEval 2018 task 7 dataset and SCIERC Boththe BB and SEEDEV corpus contain approximately3500 relations within a small sub-domain of biol-ogy while FOBIE focuses more generally on thedomain of biology Section 3 describes the collec-tion of FOBIE and dataset statistics in detail

1492

3 Dataset description

31 Dataset collection

A variety of words are able to indicate a trade-offeg compromise optimization balance interplayand conflict (Kruiper et al 2018) We adapt theseterms as trigger words in a Rule-Based System(RBS) and run it on 10k open-access papers thatwere collected from the Journal of Experimental Bi-ology (JEB) and BioMed Central (BMC) journalson lsquoBiologyrsquo lsquoEvolutionary Biology and lsquoSystemsBiologyrsquo The selection of journals was made onlyto the extent that the articles focus on the biologicaldomain We retained the abstract introduction re-sults discussion and conclusion sections We usedspaCy3 to split the texts into sentences and identifyPOS tags and dependency structure The FOBIEdataset contains only sentences that the RBS iden-tified as expressing a TRADE-OFF relation

32 Annotation

The initial annotations extracted by the RBS weremanually corrected and extended by a biology ex-pert using the BRAT interface (Stenetorp et al2012) We define three relation types TRADE-OFFARGUMENT-MODIFIER and NOT-A-TRADE-OFFThe latter denotes phrases that are related to a trig-ger word but not by a TRADE-OFF relation Thesesyntactically similar relations provide useful train-ing signal as negative samples Negative samplesare important because possible trigger words can becontiguous eg the phrase lsquonegative correlationrsquodenotes a TRADE-OFF relation whereas lsquocorrela-tionrsquo by itself does not As a result the annotationof training examples is harder and lexical and syn-tactic patterns that correctly signify the relation aresparse (Peng et al 2017) For simplicityrsquos sakewith some abuse of terminology we refer to allsuch relations collectively as trade-offs

We found a substantial amount of arguments tobe nested or in a non-projective relationship InFigure 2 the prepositional phrase lsquoin jumpingrsquo con-ceptually refers to both central concept argumentsof the relation ie lsquothe need for energy storagersquoand lsquothe presence of resilinrsquo We adopt the follow-ing annotation heuristic prepositional phrases aretreated as modifying phrases when they apply tomultiple arguments (as is the case in Figure 2) orcan be distinctly separated from the argument egby punctuation

3httpsspacyio

Sentences 1548Avg sent length 3777 of sents ge 25 tokens 7926Relations- TRADE-OFF 765- NOT-A-TRADE-OFF 2502- ARG-MODIFIER 1521Triggers 1600Arguments 4234Spans 6309Unique spans 2701Unique triggers 41Max triggerssent 2Max spanssent 7Spans w multiple relations 2075 single-word arguments 498Avg tokens per argument 344

Table 2 Aggregated statistics for FOBIE

We randomly selected 250 sentences (161)for re-annotation and quality control by a seconddomain expert The inter-annotator agreement Co-hen k is found to be 093 Table 2 summarizesstatistics on FOBIE The final dataset consists of1548 single sentences from 1292 unique docu-ments split into 1248150150 traindevtest Thesplit is controlled for source document overlap toavoid having identical arguments of relations ap-pearing both during training and testing FOBIEcontains relatively long key-phrases with an aver-age of 344 tokens and only 12 of them consistof a single token In comparison SCIENCEIE andSCIERC both contain 31 singleton key-phrasesand the average entity length in SCIERC is 236Furthermore sentences taken from full-text doc-uments are longer than those found in abstractsThe average sentence length in SCIERC is 2431tokens while 7926 of the sentences in FOBIEare longer than 25 tokens

4 Narrow IE baseline

41 Task definition

Following Peng et al (2017) we extract n-ary rela-tions by (1) identifying the trigger and (2) extract-ing the binary relations between this trigger andthe arguments ndash inspired by Davidsonian seman-tics We define key-phrases as spans of consecutivewords s isin S with S all possible spans in a sen-tence and relation-types as r isin Rd with d the totalnumber of unique relations Then a binary rela-tion is a tripleltgovernor relation dependentgtwith governor and dependent elements of S Theunion of the following binary relations found in asentence may constitute a non-projective graph

1493

Figure 2 Example of an annotation in BRAT showing a trigger word lsquocorrelationrsquo that is related to two argumentswhich in turn are related to a single modifier The trigger word does not indicate a TRADE-OFF relation but apositive correlation

Cleartrade-offs

[]energystorage

Figure 3 We provide the SCIIE system with single sentences D as input For all possible spans up to width Wa span label isin LE is computed and a mention score φmr Spans with the lowest mention scores are pruned withvariable beam size λn For combinations of remaining spans a relation label isin LR is predicted The set of spanlabels LE and the set of relation labels LR both contain a dummy class ε

Def 1 An explicit trade-off is an instance of adirected relation t isin T o indicated by trigger wordp isin P u with u the set of unique trigger wordsand P sub S A trade-off is a binary relation t |= owith governor isin P and dependent isin S A singletrigger word p can be in n multiple relations

Def 2 An argument-modifier is a directed bi-nary relation a isin Am where we omit the classi-fication of a into a set of possible modificationtypes isin m An instance of a is then a tupleltgovernor relation dependentgt where one ofthe arguments is related to a trigger word p andboth arguments isin S

42 Baseline system

We adapt a span-based approach that has been usedpreviously for the tasks of co-reference resolution(Lee et al 2017) Semantic Role Labeling (Heet al 2018) and scientific IE (Luan et al 2018a)The use of span representations as classifier fea-tures enables end-to-end learning by propagatinginformation between multiple tasks without increas-ing the complexity of inference We train the SCIIEsystem (Luan et al 2018a) on FOBIE to extractspans that constitute trigger words and key-phrasesas well as the binary relations between these spans

Figure 3 illustrates the input that we provideto SCIIE All tokens are embedded using GloVe(Pennington et al 2014) and ELMo embed-

dings (original) (Peters et al 2018) For a sin-gle sentence D = w1 wn all possible spansS = s1 sN are computed which are within-sentence word sequences

The model deals with O(n4) possible combina-tions of spans where n is the number of wordsin a sentence Therefore pruning is required tomake the classification of span-pairs into relationlabels tractable at both training and test time (Leeet al 2017 He et al 2018) First a score φmr ofhow likely a span is mentioned in a relation is com-puted These mention scores enable beam pruningthe number of spans considered for relation classi-fication with a variable beam of size λn where nis the number of tokens in the input sentence (Luanet al 2018a) Second the maximum width W ofspans is limited to reduce the total number of spansWe set λ to 8 and W to 14 tokens the maximumspan length in FOBIE

After pruning a label ei isin LE is predicted forthe remaining spans si Here LE is the set ofpossible span labels including a non-span class εFor pairs of spans (si sj) the model predicts whichrelation rij isin LR holds between them The set ofpossible relation types is LR which includes a non-relation class ε The output consists of labeledspans and relation labels for pairs of spans For adetailed description of the SCIIE system we referto Luan et al (2018a)

1494

Pred amp

Gold

Gold

Pred

Gold

Pred

Figure 4 Examples of output from the SCIIE system trained on FOBIE in comparison to gold annotation

43 Narrow IE results

We evaluate SCIIE on two sub-tasks (1) ArgumentRecognition and (2) Relation Extraction Table 3summarizes the results on the sub-tasks of Argu-ment Recognition and RE With regards to the firstsub-task we train two SCIIE models One modelonly predicts whether a span is a valid span or notwhile a second model predicts whether the span is atrigger word or a key-phrase For the first sub-taskwe also report the results of the RBS describedin Section 31 The RBS performs significantlyworse it identifies trigger words exceptionally well(F1=9589 on test set) but does not correctly recog-nize many of the remaining key-phrases (F1=2236on test set) resulting in a low overall performance

Figure 4 shows example outputs of the narrowRE model The predicted relation (NOT-A-TRADE-OFF) and its accompanying structure for the firstexample are completely correct Note how theargument modifiers result in a non-projective struc-ture The second example is more challengingwith a longer range dependency between the trade-off span and the second dependent argument Ourmodel predicts the correct relation TRADE-OFFbut only extracts partial argument spans and es-sentially fragments them into several modifying

argument relations The third example exhibits arelatively long argument ndash which is common inscientific literature ndash where only a small part of thespan is predicted

44 Supporting trade-off annotation

A qualitative analysis confirms the ability of thetrained narrow IE system to support a domain ex-pert during trade-off annotation We predict trade-offs for 523 unlabeled scientific papers that havebeen annotated with a trade-off in an ontology ofbiomimetics (Vincent 2014 2016) A domain ex-pert compares the trade-offs found in the ontologyof biomimetics against the output of the SCIIE sys-tem see Table 4 Narrow IE is found to locatethe central TRADE-OFF relations and argumentsfor 4168 of the total 523 papers Explicit trade-offs were found in 243 documents At least one ofthe extracted TRADE-OFF relations for each docu-ment is identical to the expert annotation in 7737of these documents For 8971 of the 243 doc-uments a trade-off was found to be correct aftersome interpretation by the expert Two main typesof uninformative trade-offs were found trade-offsfrom a cited source and trade-offs between genericterms eg a trade-off between cost and benefitwithout defining what the cost and benefit are

1495

Argument Recognition P R F1RBS 4431 3532 3931SCIIE unlabeled spans 8676 7939 8291SCIIE labeled spans 8386 8083 8232Relation Extraction P R F1RBS mdash mdash mdashSCIIE unlabeled spans 6853 6548 6697SCIIE labeled spans 6771 6771 6771

Table 3 Results on test set for our Rule-Based System(RBS) and SCIIE (Luan et al 2018a) wrt argumentrecognition and the combined task of extracting andclassifying relations (RE) Providing the model withlabels to distinguish trigger words from key-phrasesslightly improves performance

Documents with identified trade-offs 243Exact match 7737Match after interpretation 8971

Sentences with identified trade-off 998Exact match 6804Match after interpretation 8447

Table 4 Manual analysis of extractions from 523 sci-entific documents that were used in the creation of anontology of biomimetics (Vincent 2014 2016)

5 Semi-Open Relation Extraction

51 Task description

We define the aim of SORE as extracting the rela-tions and concepts in a text that capture the mostcentral information The application of SORE isespecially of interest to scientific IE where OIEsystems perform poorly and narrow IE systems areunable to cover the wealth of different relationstypes One possible approach is to automaticallyfilter out uninformative and incorrect extractionsgenerated by OIE systems In this approach SORErelies on the output of both types of systems pro-viding a middle ground between precise narrow IEand unbounded but unreliable OIE The resultingextractions are expected to be useful for humanreaders but can also be used to collect data forannotation and training of scientific IE systems

52 Experimental setup

We explore SORE on scientific biology texts usingthe output of the SCIIE system trained on FOBIEpredicting trade-offs for the unlabeled 10k openaccess biology papers (see section 31) The nar-row IE output consists of 2216 trade-offs found in1279 documents We pre-process arguments by ap-pending their modifier removing stop words andembedding the remaining sequences using ELMo

(PubMed)4 We use the K-means algorithm to com-pute clusters on the IDF-weighted average of theresulting argument representations A domain ex-pert inspected the centroids qualitatively Table 5provides insight into some of the resulting argu-ment clusters and their interrelations The exactnumber of clusters does not seem to greatly affectSORE For the given narrow IE outputplusmn50 clustersseems to provide a good balance between genericand more fine-grained topics The IDF weightsare computed over the subword units found in thedataset we use SentencePiece5 with a vocabularyof 16K We then run OpenIE 5 a state-of-the-artOIE system (Saha and Mausam Saha et al 2017Pal and Mausam 2016 Christensen et al 2011)on the same 1279 documents that were found tocontain one or more TRADE-OFF relations

We retain only OIE extractions that contain oneor more arguments that are classified into the samecluster as the TRADE-OFF arguments found in thattext Furthermore we omit OIE arguments thatbelong to noisy clusters containing mostly mathsymbols or long nested phrases We compute asimple IDF-weighted cosine similarity (Galarragaet al 2014) between the vector representations ofthe remaining OIE and trade-off arguments

53 Qualitative analysis of SORE output

We notice a striking drop in the number of irrel-evant and noisy OIE arguments that remain afterapplying SORE The total amount of OIE extrac-tions reduces from 401k before filtering to 140k(3495) after filtering As a result the number ofOIE extractions per document reduces from 314 to110 The unfiltered OIE extractions are found in170k sentences of which 67k (3955) are retainedafter applying SORE

To test our hypothesis that SORE can reducethe number of uninformative extractions withoutlimiting RE to a narrow set of relations we ran-domly select representative samples of unfilteredand filtered OIE extractions (400 each) A domainexpert manually annotated whether each extractionor sentence was thought to be informative egprovides relevant information to understanding abiological text As an example consider the sen-tence ldquoWe have used this approach in a previousstudy to investigate the molecular factors govern-ing the altered liver regeneration dynamics caused

4httpsallennlporgelmo5httpsgithubcomgooglesentencepiece

1496

Cluster name Immunity Size LocomotionTop-5 arguments immunity size swimming

immune function number sprintingthe immune system volume runningincompetence age locomotionimmune response time diving

Top-3 related clusters Mating Temperature Attribute of AnimalReproduction Sperm Length VerbsLife History Traits Offspring Number CapacityEndurance

Table 5 Examples of clusters found using the K-means algorithm on trade-off arguments from 1279 documentsFor the related clusters only TRADE-OFF relations are taken into account

by ablation of the gene adiponectin (Adn)rdquo (Cooket al 2015) OIE extractions such as lsquo(We haveused this [] study)rsquo are considered uninforma-tive in contrast to lsquo(the molecular [] dynamicscaused by ablation [] adiponectin)rsquo

Many OIE extractions are found to be poorlystructured Like Groth et al (2018) we relaxthe requirement of extractions being well-formedeg we consider extractions that incorrectly iden-tify the boundaries of one or more arguments aspossibly capturing relevant information Differentfrom their evaluation on correctness we evaluatewhether an extraction captures information that isrelevant to understanding a text As a result weconsider poorly structured OIE extractions that con-tain relevant information to be informative eg

bull (rsquothe resumption of respirationrsquo rsquo can lead toan increase of superoxide anions in the cytosolperhaps drivingrsquo rsquo increased elevation of Cu-ZnSODrsquo)

bull (rsquotranscriptional coregulation amongst manygenesrsquo rsquo will giversquo rsquo rise to indirect interac-tion effects in mRNA expression datarsquo)

The annotation relies on the correctness of the in-formation captured by OIE extractions and whetherthis information is useful to a reader However thisdoes not imply informative extractions are relevantto the central theme of the text captured in a trade-off We consider OIE extractions uninformative ifthe extraction

bull contains an uninformative argument classeg (rsquoMiller et al 2012rsquo rsquo to minimizersquo rsquotheir swimming effortrsquo)

bull contains incomplete arguments eg (rsquotheRDME requirementrsquo rsquo reactionsrsquo rsquo only firersquo)

bull is non-sensible eg (rsquoP magellanicusrsquo rsquowould have resultedrsquo rsquo in a 16-fold higherVmax for the scallop musclersquo)

bull is unlikely to help understand a text eg(rsquoDeepBindrsquo rsquo was trainedrsquo rsquo on data fromRNAcompete CLIP - RIP - seq [ 10rsquo)and (rsquomicrolepidopteran superfamiliesrsquo rsquo areheavily entombedrsquo rsquo Lin amberrsquo)

We also randomly select representative samplesfrom the 170k unfiltered and 67k filtered sentencesfrom which the OIE extractions are sourced Thereason is that erroneous OIE extractions eg notwell-formed tuples can guide a reader to informa-tive passages in a text We see similar errors asdescribed by Schneider et al (2017) and Groth etal (2018) eg long sentences lead to incorrectextractions and errors in argument boundaries Toillustrate the complexity of sentences that an OIEsystem encounters in scientific texts consider thefollowing examples

bull the arity of relations can be high eg (49tokens) ldquoA large genome size tends to cor-relate with delayed mitotic and meiotic divi-sion [6ndash8] decreased plant invasiveness ofdisturbed sites [9] lower maximum photosyn-thetic rates in plants [2] and lower metabolicrates in mammals [10] and birds [11 12]rdquo(Warringer and Blomberg 2006)

bull many phrases are nested and express non-verbal relations eg (45 tokens) ldquoHoweverfor arboreal animals that regularly jump be-tween branches (often when elevated quitehigh above the ground) jumping accurately(which we define as the ability to land closeto the intended target) may also be importantto fitnessrdquo (Kuo et al 2011)

Table 6 provides an overview of the annotationresults Filtering is found to increase the informa-tiveness of both OIE extractions (χ2=639 plt025)and sentences (χ2=1175 plt01) The percentageof informative OIE extractions increases by 575and of the percentage of informative sentences by

1497

Filtering Nr Informative

OIE extractions before 401588 2925after 140357 3500

Sentences before 170551 3650after 67460 4475

Table 6 Total number of OIE extractions before andafter filtering as well as the sentences that these ex-tractions were found in The informative denotes thepercentage of extractions and sentences annotated as in-formative by a domain expert based on 400 randomlysampled instances from each group (95 confidenceinterval margin of error 5)

825 A second domain expert annotated 25 ofeach set (400 total) the inter-annotator agreementCohen k was found to be 084

54 Results

Manual inspection of the retained OIE extractionsshows that many relevant extractions are retainedeg see Table 7 These extractions are useful to areader in determining whether a document is worthreading in full and can be used to identify informa-tive sections in a text The presented approach toSORE shows promising results wrt automaticallyfiltering out a large proportion of irrelevant incor-rect or uninformative OIE extractions Consider-ing the poor quality of OIE extractions howeverwe propose presenting a reader with the sentencesthat entail the filtered OIE extractions Further-more SORE provides a method to collect data forannotation and training of scientific OIE systems

6 Conclusions

We introduce the task of Semi-Open Relation Ex-traction (SORE) on scientific texts and the FocusedOpen Biological Information Extraction (FOBIE)dataset We adapt off-the-shelf IE systems to showthat SORE is feasible and that our approach isworth improving upon ndash both in terms of perfor-mance as well as reducing the systemrsquos complexityA strong scientific IE system is used as a baselineand its output is used to filter the relations foundby a state-of-the-art OIE system

OIE from scientific text is a hard task The largenumber of errors that we find in OIE extractionsfrom scientific texts render them near-useless todownstream computing tasks A human reader maynevertheless find many incorrect extractions infor-mative An issue for humans is the sheer amountof OIE extractions and the high proportion of unin-formative extractions We show that our approach

TRADE-OFF relationsTrade-off arguments Argument modifierssleepcognitive abilitiesenergy conservationmemory retention (the keeping of memory

over prolonged periods oftime)

memory consolidation (in bats)(without a food reward)(shift from short- to long-term memory)(using torpor)

Examples filtered OIE extractions(A memory is normally formed after repeated learn-ing events sleep enhances this process)(learning is associated with a food reward)(Sleep deprivation has negative effects on both mem-ory consolidation)(torpor has a negative influence on memory consoli-dation)(digestion prevents the bats from falling into torporquickly)(torpor indeed affects learning abilities)

Table 7 SORE extractions from a scientific biologytext (Ruczy ski et al 2014) The TRADE-OFF relationsare extracted by a narrow IE system trained on FOBIEThese relations capture the central theme and conceptsof the text and are used to filter the extractions that anOIE system outputs for the same document The result-ing extractions can support discerning the relevance ofscientific documents

to SORE reduces the number of OIE extractions by65 while increasing the relative amount of infor-mative extractions by 575 As a result SOREimproves the ability for a reader to quickly skimthrough the remaining extractions or sentences thatthey are sourced from and analyze how central con-cepts are related in a scientific text

The presented approach is currently limited tothe domain of biology and the use of trade-off re-lations but we expect that central relations can beidentified for other scientific domains that enableSORE We show that creating a dataset for narrowRE can be done relatively cheaply by re-annotatingthe output of a simple RBS Similarly SORE mayaid the collection of a dataset for scientific OIE

Acknowledgments

The authors would like to gratefully acknowledgethe financial support of the Engineering and Phys-ical Sciences Research Council (EPSRC) Centrefor Doctoral Training in Embedded Intelligence un-der grant reference EPL0149981 and the EPSRCInnovation Placement fund We also thank BenTrevett for proof-reading the document

1498

ReferencesDominique Adriaens 2019 Evomimetics the

biomimetic design thinking 20 In Bioinspira-tion Biomimetics and Bioreplication IX volume1096509 page 8 SPIE

Anurag A Agrawal Jeffrey K Conner and Sergio Ras-mann 2010 Tradeoff s and Negative Correlationsin Evolutionary Ecology Why Are We Interested inTradeoffs In Evolution since Darwin the first 150years chapter 10 pages 243ndash268

Brian S Alper Jason A Hand Susan G Elliott ScottKinkade Michael J Hauan Daniel K Onion andBernard M Sklar 2004 How much effort is neededto keep up with the literature relevant for primarycare Journal of the Medical Library Association JMLA 92(4)429ndash37

Waleed Ammar Matthew Peters Chandra Bhagavat-ula and Russell Power 2018 The AI2 sys-tem at SemEval-2017 Task 10 (ScienceIE) semi-supervised end-to-end entity and relation extractionpages 592ndash596

Isabelle Augenstein Mrinal Das Sebastian RiedelLakshmi Vikraman and Andrew McCallum 2017SemEval 2017 Task 10 ScienceIE - ExtractingKeyphrases and Relations from Scientific Publica-tions In Proceedings of the 11th InternationalWorkshop on Semantic Evaluation (SemEval-2017)pages 546ndash555 Stroudsburg PA USA Associationfor Computational Linguistics

Michele Banko M Cafarella Stephen Soderland M JBroadhead and Oren Etzioni 2007 Open informa-tion extraction from the web Proceedings of the20th International Joint Conference on Artificial In-telligence pages 2670ndash2676

Robert Bossy Louise Deleger Estelle ChaixMouhamadou Ba and Claire Nedellec 2019 Bacte-ria Biotope at BioNLP Open Shared Tasks 2019 InProceedings of The 5th Workshop on BioNLP OpenShared Tasks pages 121ndash131 Stroudsburg PAUSA Association for Computational Linguistics

Razvan C Bunescu and Raymond J Mooney 2005A shortest path dependency kernel for relation ex-traction Proceedings of the Human LanguageTechnology Conference and Conference on Em-pirical Methods in Natural Language Processing(October)724ndash731

Stephen S O Burgess Jarmila Pittermann and Todd EDawson 2006 Hydraulic efficiency and safety ofbranch xylem increases with height in Sequoia sem-pervirens (D Don) crowns Plant Cell and Environ-ment 29(2)229ndash239

Gemma Carr Daniel P Loucks and Gunter Bloschl2018 Gaining insight into interdisciplinary researchand education programmes A framework for evalu-ation Research Policy 47(1)35ndash48

Estelle Chaix Bertrand Dubreucq Abdelhak FatihiDialekti Valsamou Robert Bossy MouhamadouBa Louise Deleger Pierre Zweigenbaum PhilippeBessieres Loıc Lepiniec and Claire Nedellec 2016Overview of the Regulatory Network of Plant SeedDevelopment (SeeDev) Task at the BioNLP SharedTask 2016 pages 1ndash11

Yubo Chen Liheng Xu Kang Liu Daojian Zeng andJun Zhao 2015 Event Extraction via DynamicMulti-Pooling Convolutional Neural Networks InProceedings of the 53rd Annual Meeting of the Asso-ciation for Computational Linguistics and the 7th In-ternational Joint Conference on Natural LanguageProcessing pages 167ndash176

Janara Christensen Mausam Stephen Soderland andOren Etzioni 2011 An analysis of open informa-tion extraction based on semantic role labeling InProceedings of the sixth international conference onKnowledge capture - K-CAP rsquo11 page 113

K Bretonnel Cohen Karin Verspoor Karen FortChristopher Funk Michael Bada Martha Palmerand Lawrence E Hunter 2017 The ColoradoRichly Annotated Full Text (CRAFT) CorpusMulti-Model Annotation in the Biomedical DomainIn Handbook of Linguistic Annotation pages 1379ndash1394 Springer Netherlands

Daniel Cook Babatunde A Ogunnaike and Ra-janikanth Vadigepalli 2015 Systems analysis ofnon-parenchymal cell modulation of liver repairacross multiple regeneration modes BMC SystemsBiology 9(1)71

Louise Deleger Robert Bossy Estelle ChaixMouhamadou Ba Arnaud Ferre Philippe Bessieresand Claire Nedellec 2016 Overview of the Bac-teria Biotope Task at BioNLP Shared Task 2016pages 12ndash22

Khalid El-Arini and Carlos Guestrin 2011 Beyondkeyword search In Proceedings of the 17th ACMSIGKDD international conference on Knowledgediscovery and data mining - KDD rsquo11 page 439New York New York USA ACM Press

Oren Etzioni Michael Cafarella Doug DowneyAna Maria Popescu Tal Shaked Stephen SoderlandDaniel S Weld and Alexander Yates 2005 Un-supervised named-entity extraction from the WebAn experimental study Artificial Intelligence165(1)91ndash134

Anthony Fader Stephen Soderland and Oren Etzioni2011 Identifying relations for Open InformationExtraction In EMNLP 2011 - Conference on Empir-ical Methods in Natural Language Processing Pro-ceedings of the Conference pages 1535ndash1545

Kata Gabor Davide Buscaldi Anne-Kathrin Schu-mann Behrang Qasemizadeh Hafa Zargayounaand Thierry Charnois 2018 SemEval-2018 Task 7Semantic Relation Extraction and Classification inScientific Papers In SemEval pages 679ndash688

1499

Luis Galarraga Geremy Heitz Kevin Murphy andFabian M Suchanek 2014 Canonicalizing OpenKnowledge Bases In Proceedings of the 23rd ACMInternational Conference on Conference on Informa-tion and Knowledge Management - CIKM rsquo14 pages1679ndash1688

Theodore Garland 2014 Trade-offs Current Biology24(2)R60ndashR61

Paul Groth Michael Lauruhn Antony Scerri and RonDaniel 2018 Open Information Extraction on Sci-entific Text An Evaluation In Proceedings ofthe 27th International Conference on ComputationalLinguistics page 3414ndash3423 Association for Com-putational Linguistics

Luheng He Kenton Lee Omer Levy and Luke Zettle-moyer 2018 Jointly Predicting Predicates and Ar-guments in Neural Semantic Role Labeling pages364ndash369

Lynette Hirschman Alexander Yeh ChristianBlaschke and Alfonso Valencia 2005 Overview ofBioCreAtIvE critical assessment of information ex-traction for biology BMC Bioinformatics 6(Suppl1)S1

Rezarta Islamaj Dogan Sun Kim Andrew Chatr-aryamontri Chih-Hsuan Wei Donald C ComeauRui Antunes Sergio Matos Qingyu Chen AparnaElangovan Nagesh C Panyam Karin VerspoorHongfang Liu Yanshan Wang Zhuang Liu BernaAltınel Zehra Melce Husunbeyi Arzucan OzgurAris Fergadis Chen-Kai Wang Hong-Jie Dai TungTran Ramakanth Kavuluru Ling Luo Albert SteppiJinfeng Zhang Jinchan Qu and Zhiyong Lu 2019Overview of the BioCreative VI Precision MedicineTrack mining protein interactions and mutations forprecision medicine Database 2019

Robin Jia Cliff Wong and Hoifung Poon 2019Document-Level N-ary Relation Extraction withMultiscale Representation Learning In NAACLHLT pages 3693ndash3704

Xiaotian Jiang Quan Wang Peng Li and Bin Wang2016 Relation Extraction with Multi-instanceMulti-label Convolutional Neural Networks Coling(89)1471ndash1480

J D Kim T Ohta Y Tateisi and J Tsujii 2003GENIA corpus - A semantically annotated corpusfor bio-textmining In Bioinformatics volume 19pages 180ndash182

Jin-Dong Kim Tomoko Ohta Sampo Pyysalo Yoshi-nobu Kano and Junrsquoichi Tsujii 2009 Overview ofBioNLPrsquo09 shared task on event extraction In Pro-ceedings of the Workshop on BioNLP Shared Task -BioNLP rsquo09 page 1 Morristown NJ USA Associ-ation for Computational Linguistics

Olga Kononova Haoyan Huo Tanjin He Wenhao SunZiqin Rong Tiago Botari Vahe Tshitoyan and Ger-brand Ceder 2019 Text-mined dataset of inorganicmaterials synthesis recipes Sci Data 6(203)

Martin Krallinger Obdulia Rabal Saber A AkhondiMartın Perez Perez Jesus Santamarıa GaelPerez Rodrıguez Georgios Tsatsaronis Ander Intx-aurrondo Jose Antonio Lopez Umesh Nandal ErinVan Buel Akileshwari Chandrasekhar Marleen Ro-denburg Astrid Laegreid Marius Doornenbal JulenOyarzabal Analia Lourenco and Alfonso Valencia2017 Overview of the BioCreative VI chemical-protein interaction Track In Proceedings of BioCre-ative VI workshop pages 141ndash146

Ruben Kruiper Jessica Chen-Burger and Marc P YDesmulliez 2016 Computer-aided biomimeticsIn Lecture Notes in Computer Science (includingsubseries Lecture Notes in Artificial Intelligenceand Lecture Notes in Bioinformatics) volume 9793pages 131ndash143

Ruben Kruiper Julian Vincent Eitan Abraham Ru-pert Soar Ioannis Konstas Jessica Chen-Burger andMarc Desmulliez 2018 Towards a Design Pro-cess for Computer-Aided Biomimetics Biomimet-ics 3(3)14

Chi Yun Kuo Gary B Gillis and Duncan J Irschick2011 Loading effects on jump performance ingreen anole lizards Anolis carolinensis Journal ofExperimental Biology 214(12)2073ndash2079

Kenton Lee Luheng He Mike Lewis and Luke Zettle-moyer 2017 End-to-end Neural Coreference Res-olution In Proceedings of the 2017 Conference onEmpirical Methods in Natural Language Processingpages 188ndash197 Stroudsburg PA USA Associationfor Computational Linguistics

Yankai Lin Shiqi Shen Zhiyuan Liu Huanbo Luanand Maosong Sun 2016 Neural Relation Extrac-tion with Selective Attention over Instances In Pro-ceedings of the 54th Annual Meeting of the Associa-tion for Computational Linguistics (Volume 1 LongPapers) pages 2124ndash2133

Chunyang Liu Wenbo Sun Wenhan Chao and Wanxi-ang Che 2013 Convolution neural network for rela-tion extraction Lecture Notes in Computer Science(including subseries Lecture Notes in Artificial Intel-ligence and Lecture Notes in Bioinformatics) 8347LNAI(PART 2)231ndash242

Yi Luan Luheng He Mari Ostendorf and HannanehHajishirzi 2018a Multi-Task Identification ofEntities Relations and Coreference for ScientificKnowledge Graph Construction In Proceedingsof the 2018 Conference on Empirical Methods inNatural Language Processing pages 3219ndash3232Stroudsburg PA USA Association for Computa-tional Linguistics

Yi Luan Mari Ostendorf and Hannaneh Hajishirzi2018b The UWNLP system at SemEval-2018 Task7 Neural Relation Extraction Model with Selec-tively Incorporated Concept Embeddings In Pro-ceedings of The 12th International Workshop on Se-mantic Evaluation pages 788ndash792 Stroudsburg PAUSA Association for Computational Linguistics

1500

Mausam 2016 Open Information Extraction Systemsand Downstream Applications In Proceedings ofthe 25th International Joint Conference on ArtificialIntelligence pages 4074ndash4077

Makoto Miwa and Mohit Bansal 2016 End-to-EndRelation Extraction using LSTMs on Sequences andTree Structures In Proceedings of the 54th AnnualMeeting of the Association for Computational Lin-guistics (Volume 1 Long Papers) pages 1105ndash1116Stroudsburg PA USA Association for Computa-tional Linguistics

Claire Nedellec Robert Bossy J-D Kim J J Kim andTomoko Ohta 2013 Overview of BioNLP sharedtask 2013 In BioNLP Shared Task 2013 Workshoppages 1ndash7

Thien Huu Nguyen and Ralph Grishman 2015 Re-lation Extraction Perspective from ConvolutionalNeural Networks In NAACL-HLT pages 39ndash48

Christina Niklaus Matthias Cetto Andre Freitas andSiegfried Handschuh 2018 A Survey on Open In-formation Extraction In Proceedings of the 27th In-ternational Conference on Computational Linguis-tics pages 3866ndash3878 Association for Computa-tional Linguistics

Elisabeth Pain 2016 How to keep up with the scien-tific literature Science

Harinder Pal and Mausam 2016 Demonyms andCompound Relational Nouns in Nominal Open IEIn Proceedings of the 5th Workshop on AutomatedKnowledge Base Construction page 35ndash39

Nanyun Peng Hoifung Poon Chris Quirk KristinaToutanova and Wen-tau Yih 2017 Cross-SentenceN -ary Relation Extraction with Graph LSTMsTransactions of the Association for ComputationalLinguistics 5101ndash115

Jeffrey Pennington Richard Socher and ChristopherManning 2014 Glove Global Vectors for WordRepresentation In Proceedings of the 2014 Con-ference on Empirical Methods in Natural LanguageProcessing (EMNLP) pages 1532ndash1543 Strouds-burg PA USA Association for Computational Lin-guistics

Matthew Peters Mark Neumann Mohit Iyyer MattGardner Christopher Clark Kenton Lee and LukeZettlemoyer 2018 Deep Contextualized Word Rep-resentations In Proceedings of the 2018 Conferenceof the North American Chapter of the Associationfor Computational Linguistics Human LanguageTechnologies Volume 1 (Long Papers) volume 19pages 2227ndash2237 Stroudsburg PA USA Associa-tion for Computational Linguistics

I Ruczy ski T M A Clarin and B M Siemers 2014Do greater mouse-eared bats experience a trade-offbetween energy conservation and learning Journalof Experimental Biology 217(22)4043ndash4048

Swarnadeep Saha and Mausam Open Information Ex-traction from Conjunctive Sentences In To be pub-lished in COLING 2018

Swarnadeep Saha Harinder Pal and Mausam 2017Bootstrapping for Numerical Open IE In Proceed-ings of the 55th Annual Meeting of the Associationfor Computational Linguistics (Volume 2 Short Pa-pers) pages 317ndash323

Cicero Nogueira dos Santos Bing Xiang and BowenZhou 2015 Classifying Relations by Ranking withConvolutional Neural Networks pages 626ndash634

Rudolf Schneider Tom Oberhauser Tobias Klatt Fe-lix A Gers and Alexander Loser 2017 AnalysingErrors of Open Information Extraction Systems InProceedings of the First Workshop on Building Lin-guistically Generalizable NLP Systems pages 11ndash18

Pontus Stenetorp Sampo Pyysalo Goran TopicTomoko Ohta Sophia Ananiadou and Junrsquoichi Tsu-jii 2012 BRAT a Web-based Tool for NLP-Assisted Text Annotation In 13th Conference of theEuropean Chapter of the ACL Figure 1 pages 102ndash107

SS Vattam and AK Goel 2013 An Information For-aging Model of Interactive Analogical Retrieval InProceedings of the Annual Meeting of the CognitiveScience Society pages 3651ndash3656

Julian F V Vincent 2014 An Ontology of Biomimet-ics In Ashok K Goel Daniel A McAdams andRobert B Stone editors Biologically Inspired De-sign pages 269ndash285 Springer London London

Julian F V Vincent 2016 The trade-off a central con-cept for biomimetics Bioinspired Biomimetic andNanobiomaterials 6(2)67ndash76

Jonas Warringer and Anders Blomberg 2006 Evolu-tionary constraints on yeast protein size BMC Evo-lutionary Biology 6(1)61

Dian Yu Lifu Huang and Heng Ji 2017 Open Re-lation Extraction and Grounding Proceedings ofthe Eighth International Joint Conference on Natu-ral Language Processing (Volume 1 Long Papers)1854ndash864

Daojian Zeng Kang Liu Yubo Chen and Jun Zhao2015 Distant Supervision for Relation Extractionvia Piecewise Convolutional Neural Networks InEMNLP pages 1753ndash1762

Daojian Zeng Kang Liu Siwei Lai Guangyou Zhouand Jun Zhao 2014 Relation Classification viaConvolutional Deep Neural Network In COLINGpages 2335ndash2344

Deyu Zhou Dayou Zhong and Yulan He 2014Biomedical Relation Extraction From Binary toComplex Computational and Mathematical Meth-ods in Medicine pages 1ndash18

1490

functional demands that are traded off are usuallyabstract and domain-independent terms such aslsquosafetyrsquo and lsquoefficiencyrsquo in Figure 1 A gap remainsin quickly comprehending the central informationin a text eg the biological mechanisms that areused to manipulate a trade-off

Information Extraction (IE) and specifically Re-lation Extraction (RE) can improve the access tocentral information for downstream tasks (Santoset al 2015 Zeng et al 2014 Jiang et al 2016Miwa and Bansal 2016 Luan et al 2018a) How-ever the focus of current RE systems and datasetsis either too narrow ie a handful of semantic rela-tions such as lsquoUSED-FORrsquo and lsquoSYNONYMYrsquo ortoo broad ie an unbounded number of generic re-lations extracted from large heterogeneous corpora(Niklaus et al 2018) referred to as Open IE (OIE)(Etzioni et al 2005 Banko et al 2007) Narrowapproaches to IE from scientific text (Augensteinet al 2017 Gabor et al 2018 Luan et al 2018a)cover only a fraction of the information capturedin a paper ndash usually what is within an abstract Ithas been shown that scientific texts contain manyunique relation types and therefore it is not fea-sible to create separate narrow IE classifiers forthese (Groth et al 2018) On the other hand OIEsystems are primarily developed for the Web andnews-wire domain and have been shown to performpoorly on scientific texts What laymen really needis a bit of both the accuracy of narrow RE systemsto extract central relations from scientific texts andthe flexibility of an OIE system to capture a muchlarger fraction of the possible relations expressedin scientific texts

This work aims to enable rapid comprehensionof a large scientific document by identifying a) thecentral concepts in a text and b) the most signif-icant relations that govern these central conceptsTo this end we introduce the task of Semi-Open Re-lation Extraction (SORE) Figure 1 illustrates theSORE process First we find the central conceptslsquosafetyrsquo and lsquoefficiencyrsquo involved in a TRADE-OFF

relation Then by using the argument conceptsof the relation as anchor points we can explorefurther concepts and relations eg lsquoxylemrsquo in Fig-ure 1 Uncovering these relations can elucidatethe meaning of unfamiliar concepts to a layper-son (Mausam 2016) The SORE approach is hy-pothesized to reduce the number of uninformativeextractions without limiting RE to a finite set ofrelations which could generally benefit IE from sci-

entific articles eg materials discovery (Kononovaet al 2019) and drug-gene-mutation interactions(Jia et al 2019)

To address SORE we create the FocusedOpen Biological Information Extraction (FOBIE)dataset FOBIE includes manually-annotated sen-tences that express explicit trade-offs or syntac-tically similar relations that capture the centralconcepts in full-text biology papers We train aspan-based RE model used in a strong scientific IEsystem (Luan et al 2018a) to jointly extract theserelation structures We explore SORE and use theoutput of our model to filter the output of an OIEsystem (Saha and Mausam Saha et al 2017 Paland Mausam 2016 Christensen et al 2011) ona corpus of biology papers Qualitative analysesshow that the output of a narrow RE model canspeed up expert analysis of trade-offs in biologicaltexts and be used to filter out both erroneous anduninformative OIE extractions

2 Related work

21 Open Information Extraction

OIE systems use a set of handcrafted or learnedextraction rules and rely on dependency featuresto extract open-domain relational tuples from text(Yu et al 2017 Niklaus et al 2018) As OIE sys-tems rely on syntactic features they require littlefine-tuning when applied to different domains andthe extraction rules work for a variety of relationtypes (Mausam 2016) These properties can be es-pecially useful on scientific texts where additionalknowledge on unknown concepts can ease the tex-tual comprehension for non-experts Consider theexample OIE extractions for lsquoxylemrsquo in the top partof Figure 1

Existing OIE systems have been shown to per-form significantly worse on the longer and morecomplex sentences found in scientific texts than onWikipedia texts (Groth et al 2018) Common is-sues of OIE systems on Web News and Wikipediatexts include the correct identification of the bound-aries of an argument handling latent n-ary rela-tions difficulty handling negations and generatinguninformative extractions (Schneider et al 2017)Groth et al (2018) evaluate the output of two state-of-the-art OIE systems based on correctness ratherthan eg the number of missed extractions Theynote that the crux of the IE challenge is that extrac-tions reflect the consequence of the sentence As anexample of an uninformative extraction Fader et al

1491













1492




32 Annotation



3httpsspacyio





41 Task definition


1493


Cleartrade-offs

[]energystorage




42 Baseline system






1494

Pred amp

Gold

Gold

Pred

Gold

Pred








1495







51 Task description










1496


















1497






54 Results


6 Conclusions










Acknowledgments


1498






















1499





















1500

























1491













1492




32 Annotation



3httpsspacyio





41 Task definition


1493


Cleartrade-offs

[]energystorage




42 Baseline system






1494

Pred amp

Gold

Gold

Pred

Gold

Pred








1495







51 Task description










1496


















1497






54 Results


6 Conclusions










Acknowledgments


1498






















1499





















1500

























1492




32 Annotation



3httpsspacyio





41 Task definition


1493


Cleartrade-offs

[]energystorage




42 Baseline system






1494

Pred amp

Gold

Gold

Pred

Gold

Pred








1495







51 Task description










1496


















1497






54 Results


6 Conclusions










Acknowledgments


1498






















1499





















1500

























1493


Cleartrade-offs

[]energystorage




42 Baseline system






1494

Pred amp

Gold

Gold

Pred

Gold

Pred








1495







51 Task description










1496


















1497






54 Results


6 Conclusions










Acknowledgments


1498






















1499





















1500

























1494

Pred amp

Gold

Gold

Pred

Gold

Pred








1495







51 Task description










1496


















1497






54 Results


6 Conclusions










Acknowledgments


1498






















1499





















1500

























1495







51 Task description










1496


















1497






54 Results


6 Conclusions










Acknowledgments


1498






















1499





















1500

























1496


















1497






54 Results


6 Conclusions










Acknowledgments


1498






















1499





















1500

























1497






54 Results


6 Conclusions










Acknowledgments


1498






















1499





















1500

























1498






















1499





















1500

























1499





















1500

























1500