biocreative vi track 3 bel track · 2018-11-26 · terms training test 2015 test 2017 p() 19.918...
TRANSCRIPT
© Fraunhofer SCAI
BioCreative VI Track 3 – BEL TrackExtraction of Causal Network Information in Biological Expression Language (BEL)
Overview & Results
by Sumit Madan
Sumit Madan,
Jens Dörpinghaus,
Juliane Fluck
Bioinformatics Department
Fraunhofer SCAI
Justyna Szostak
Julia Hoeng
Philip Morris International R&D
© Fraunhofer SCAI
The Research Cycle to Create BEL Networks
Knowledge
Search
Query Manual
Curation
Literature
Results
Curated Data
Ontologies /
Databases /
Models
Public / Private
Further
Enrichment
Other
Sources Specific
GraphExtract
Analysis
Service
Semantic
Search Engine
© Fraunhofer SCAI
Research Cycle: Usage of Internally and Externally Developed Tools
Knowledge
Search
Query Manual
Curation
Corpus
Curated Data
Ontologies /
Databases /
Models
Public / Private
Further
Enrichment
Other
SourcesSpecific
NetworkExtract
Semantic
Search Engine
SCAIViewBELIEF
OLS & OMS
PyBEL
NeuroMMSIG
Aetionomy KB
with Graph Algorithms,
Experimental Data etc.
Analysis
Service
From
EBI
© Fraunhofer SCAI BEL Track
Text to BEL Statements to Networks
Textual Knowledge
Extension of BEL
Networks through
Text Mining extracted
Information
© Fraunhofer SCAI BEL Track
Adding Complexity to Text Mining
Bringing real world settings to text mining
Encoding of BEL using biological background
knowledge (interpretation added by the curators)
Multiple sentences / Tables as supporting text
evidence for relationships
Only semantic information available
No explicit annotation of entities and relationships
Use the data available in form of BEL networks
Composed of BEL Nanopubs with Relationships
(Statements), context information, supported text
evidence, citation reference, and meta data
© Fraunhofer SCAI BEL Track
INTRODUCTION TO BEL AND THE TASKSBioCreative VI Track 3 – BEL Track
© Fraunhofer SCAI BEL Track
BEL Statement can also have a Context described through Annotations
BEL Statementp(HGNC:F2R) -> r(HGNC:IL6)
Supporting text (evidence)
Thrombin receptor mediated signals induce expressions of interleukin 6 and granulocyte colony
stimulating factor via NF-kappa B activation in synovial fibroblasts.
AnatomyDisease
Cell line
Cell typefibroblast
CitationPMID:10343541
OthersOthersOthersOthersOthers
Mandatory annotations
Optional annotations
© Fraunhofer SCAI BEL Track
Task 1: Given textual evidence for a BEL statement, generate the
corresponding BEL statements.
BEL Statementp(HGNC:F2R) -> r(HGNC:IL6)
Supporting text (evidence)
Thrombin receptor mediated signals induce expressions of interleukin 6 and granulocyte colony
stimulating factor via NF-kappa B activation in synovial fibroblasts.
CitationPMID:10343541
Given
Generate
© Fraunhofer SCAI BEL Track
Task 2: Given a BEL statement, provide at most 10 additional evidence
sentences.
BEL Statementp(HGNC:F2R) -> r(HGNC:IL6)
Supporting text (evidence)
Thrombin receptor mediated signals induce expressions of interleukin 6 and granulocyte colony
stimulating factor via NF-kappa B activation in synovial fibroblasts.
CitationPMID:10343541
Given
Find
© Fraunhofer SCAI BEL Track
High Complexity of the Track
Recognition tasks on several levels:
Named entity recognition (NER): 4 classes (Disease, Gene/Protein, Biological Process, Chemical)
Entity normalization: to defined namespaces (MeSH, HGNC/MGI/EntrezGene, GO, ChEBI)
Function extraction: Extraction of biological functions (in combination with the named entities)
Relationship extraction: relate entities and functions
Connect information
Convert to BEL
cat( )p( HGNC:FAS ) increases p( HGNC:RB1, )pmod(P)
Protein ModificationBEL Statement
Relation
Protein Abundance
Function Namespace
Image from: Rinaldi, F., Ellendorff, T. R., Madan, S., et al. (2016)
BioCreative V track 4: a shared task for the extraction of causal network
information using the Biological Expression Language, Database, 2016.
© Fraunhofer SCAI BEL Track
Term Level: Describes Name Spaces and Biological Entities
Namespace Function Long form Function
Short form
BEL Term Example
HGNC geneAbundance(),
rnaAbundance(),
microRNAAbundance(),
proteinAbundance()
g(),
r(),
m(),
p()
p(HGNC:MAPK14)
MGI Similar to HGNC Similar to HGNC p(MGI:Mapk14)
EGID Similar to HGNC Similar to HGNC p(EGID:1432)
GOBP biologicalProcess() bp() bp(GOBP:”cell proliferation”)
MESHD pathology() path() path(MESHD:Hyperoxia)
CHEBI abundance() a() a(CHEBI: lipopolysaccharide)
© Fraunhofer SCAI BEL Track
Function Level: Describes the Biological Function on Term
Function Function Type Example
complex()
complexAbundance()Abundances complex(p(MGI:Itga8),p(MGI:Itgb1))
pmod(P)
proteinModification()Modifications p(MGI:Cav1,pmod(P))
deg()
degradation()Transformations deg(a(CHEBI:”hyaluronic acid”))
tloc()
translocation()Transformations a(CHEBI:”brefeldin A”)
act()
molecularActivity()Activities act(p(MGI:Prkd1))
© Fraunhofer SCAI BEL Track
Relationship level: Relates two Biological Entitites or Functions
Relationship Example
long form short form
decreases -| a(CHEBI:sorbinil) -| bp(GOBP:"DNA replication")
directlyDecreases =| p(HGNC:TIMP1) =| act(p(HGNC:MMP9))
increases -> p(MGI:Hsd11b1) -> path(MESHD:Obesity)
directlyIncreases => p(HGNC:VEGFA) => act(p(HGNC:KDR))
© Fraunhofer SCAI BEL Track
OVERVIEW OF DATA SETS/RESOURCESBioCreative VI Track 3 – BEL Track
© Fraunhofer SCAI BEL Track
Preparation of Training, Sample, Test 2015 and Test 2017 data sets
Training set data from SVB Improver
Network Challenge
Several automatic removal steps such as:
All sentences with statements from additional name
spaces/functions and relationship types
Statement not associated with a PubMed Citation
Test 2015 and Test 2017 data through
manual curation
All sentences and BEL statements are
associated with PMID
Fluck, J., Madan, S., Ansari, S., et al. (2016) Training and evaluation corpora for the
extraction of causal relationships encoded in biological expression language (BEL).,
Database (Oxford)., 2016.
Curation through BELIEF platform
BEL statement
Context annotation
Add/edit/delete/export
Document PubMed
metadata
BEL namespace search
Detected concepts:- Mouse over highlights text
- Provides a fast overview of all
entities in the evidence
Document
Curation View(visualizes text mining
results and also
integrates several
semantic search tools)
© Fraunhofer SCAI BEL Track
Distribution of Evidences and Statements in data sets
Training data set: 11.066 BEL statements with 6353 unique evidences
(Unfiltered set of publicly available data: 29.484 BEL statements with 18.224 unique text excerpts)
Test set 2015 - Task 1: 105 sentence and 205 statements
Test set 2017 – Task 1: 116 sentence and 203 statements (disease context of Ulcerative Colitis)
Test set 2015 – Task 2: 96 statements with at least one single PubMed/PMC evidence sentence
Test set 2017 – Task 2: 98 statements with at least one single PubMed/PMC evidence sentence
© Fraunhofer SCAI BEL Track
Distribution of Instances in Several Data Sets
Terms Training Test 2015 Test 2017
p() 19.918 346 328
a() 1.927 37 52
bp() 877 31 23
path() 244 15 2
Functions Training Test 2015 Test 2017
act() 6.332 36 79
pmod() 1.411 9 36
complex() 750 15 5
tloc() 406 13 10
deg() 205 6 4
Relations Training Test 2015 Test 2017
increases 8.112 155 130
decreases 2.956 53 68
© Fraunhofer SCAI BEL Track
Example of Training Instance provided to the Participants
Training.sentence entry:
SEN:10000032 PMID:10075927 Fas stimulation of Jurkat cells is known to induce p38 kinase
and we find a pronounced increase in Rb phosphorylation
within 30 min of Fas stimulation
Training.BEL entry:
SEN:10000032 cat(p(HGNC:FAS)) increases kin(p(HGNC:MAPK14)) BEL:20029764
SEN:10000032 cat(p(HGNC:FAS)) increases p(HGNC:RB1,pmod(P)) BEL:20006082
SEN:10000032 cat(p(HGNC:FAS)) decreases tscript(p(HGNC:RB1)) BEL:20029794*
SEN:10000032 p(HGNC:RB1,pmod(P)) directlyDecreases tscript(p(HGNC:RB1)) BEL:20011414*
* Statements BEL:20029794 and BEL:20011414 can only be inferred from background knowledge or from other
sentences of the same publication.
© Fraunhofer SCAI BEL Track
ORGANIZATION OF BEL TRACKBioCreative VI Track 3 – BEL Track
© Fraunhofer SCAI BEL Track
Timeline of the BEL Track
Dec - Feb 2017: preparation and approval of proposal for BEL track
Feb 2017: official start of the track
Feb 2017: evaluation platform and supporting data announcement
Feb 2017: restructuring of information pages
Mar 2017 – Jul 2017: annotation of new test set
July 24-27 2017: release of test set and official evaluation
Aug 2017: evaluation of submissions, notification of task 1 results
Sep 2017: curation of task 2 submissions, notification of task 2 results
Sep 2017: submission of participant papers, reviews, analysis of results
Oct 2017: submission of overview paper, preparation of proceeding and
workshop
Track Approval Stage
Announcement & Track
Preparation Stage
System Evaluation
Stage
Proceeding & Workshop
Preparation Stage
© Fraunhofer SCAI BEL Track
Documentation
Track announcement and introduction
http://biocreative.org/
Extended description
Setup of the task
Training, Test 2015 and Test 2017 data
Evaluation details
https://wiki.openbel.org/display/BIOC/BioCreative+BEL+Task+Challenges
© Fraunhofer SCAI BEL Track
Communication Channels
Track announcement on several mailinglists
Track events and information communicated to
BioCreative-Participants mailinglist: [email protected]
Google Group „BELTASK“: https://groups.google.com/forum/#!forum/beltask
Wiki of OpenBEL
Further channels
Organisers
© Fraunhofer SCAI BEL Track
EVALUATION & RESULTSBioCreative VI Track 3 – BEL Track
© Fraunhofer SCAI BEL Track
Evaluation Platform for Task 1 (With the Confusion Matrix)
http://bio-eval.scai.fraunhofer.de/cgi-bin/General_server.rc
© Fraunhofer SCAI BEL Track
Example Evaluation of Task1
Sent.-Id:10004582;PMID:15909112
Sentence: In the present study, we found that transgenic mice overexpressing
wild-type human APP gene (hAPP/+) displayed a much higher expression of FAS,
one of the death receptor subfamily.
Gold standard BEL statements
p(HGNC:APP) -> p(HGNC:FAS)
Predicted BEL statements
act(p(HGNC:APP)) -> bp(GOBP:"gene expression")
act(p(HGNC:APP)) -> act(p(HGNC:FAS))
Sentence based evaluation
Class | TP | FP | FN | Recall | Prec. | F-score
--------------| -- | -- | -- | ------ | -------| -------
Term (T) | 2 | 1 | 0 | 100.00 | 66.67 | 80.00
Func-Sec (FS) | 0 | 1 | 0 | 0 | 0 | 0
Function (F) | 0 | 2 | 0 | 0 | 0 | 0
Rela-Sec (RS) | 1 | 0 | 0 | 100.00 | 100.00 | 100.00
Relation (R) | 1 | 1 | 0 | 100.00 | 50.00 | 66.67
Statement (S) | 0 | 2 | 1 | 0 | 0 | 0
© Fraunhofer SCAI BEL Track
Participants of BioCreative VI BEL Track 2017
12 registered teams
Five participant teams from: Taiwan, China, Germany, USA
Task 1 (4 teams participated):
1. Po-Ting Lai et al., NCU-IISR, Taiwan
2. Ravikumar Komandur Elayavilli et al., Mayo clinic, USA
3. Mehdi Ali et al., University of Bonn, Germany
4. Jiaxin Liu et al., Soochow University, China
Task 2 (1 team participated):
1. Majid Rastegar-Mojarad et al., Mayo clinic, USA
© Fraunhofer SCAI BEL Track
Task 1: Results for Stage 1
Terms Function Function Second. Relation Relation Second. Statement
Team
IdRun F P R F P R F P R F P R F P R F P R
411 r1 63.24 84.62 50.49 33.99 44.83 27.37 51.24 67.39 41.33 40.22 55.38 31.58 62.92 88.19 48.91 22.99 33.33 17.54
r2 57.75 81.93 44.59 31.08 43.4 24.21 38.67 69.05 38.67 36.78 53.33 28.07 57.73 86.84 43.23 20.71 31.82 15.35
r3 61.24 88.27 46.89 32.88 47.06 25.26 46.15 64.29 36 37.43 56.14 28.07 62.03 92.24 46.72 21.15 33.98 15.35
380 r1 50.88 76.82 38.03 6 60 3.16 7.5 60 4 16.77 31.71 11.4 45.14 80 31.44 7.38 15.71 4.82
r2 55.29 81.01 41.97 6.06 75 3.16 7.59 75 4 21.52 38.64 14.91 51.06 84 36.68 10.67 22.22 7.02
r3 67.83 72.22 63.93 20.17 50 12.63 31.25 71.43 20 24.69 28.25 21.93 62.25 70.95 55.46 10.44 12.9 8.77
379 r1 74.14 78.18 70.49 40.54 56.6 31.58 55.28 70.83 45.33 43.65 51.81 37.72 86.17 89.62 82.97 32.28 40.67 26.75
r2 72.89 78.71 67.87 40.29 63.64 29.47 54.39 79.49 41.33 43.77 52.12 37.72 86.71 93 81.22 32.45 41.22 26.75
390 r1 76.39 81.18 72.13 0 0 0 0 0 0 29.87 25.55 35.96 65.19 60.45 70.74 18.08 16.1 20.61
r2 76.39 81.18 72.13 0 0 0 0 0 0 28.92 24.19 35.96 65.23 59.29 72.49 17.88 15.53 21.05
Task 1, Stage 1: Prediction of BEL statements without entity information
411: Rule-based
380: Machine Translation
379: Rule-based
390: Deep Learning
F: Micro-averaged F-Score; P:Precision: R: Recall
© Fraunhofer SCAI BEL Track
Task 1: Results for Stage 2
Terms Function Function Second. Relation Relation Second. Statement
Team
IdRun F P R F P R F P R F P R F P R F P R
411 r1 83.93 99.11 72.79 36.36 47.46 29.47 46.77 59.18 38.67 57.22 73.29 46.93 83.33 98.8 72.05 31.30 46.15 23.68
r2 86.09 99.15 76.07 40.51 50.79 33.68 51.16 61.11 44 56.08 70.67 46.49 83.92 98.82 72.93 30.95 44.63 23.68
r3 85.45 99.13 75.08 39.24 49.21 32.63 50 60.38 42.67 57.6 73.47 47.37 83.63 98.81 72.49 31.79 46.61 24.12
380 r1 90.20 98.83 82.95 12.39 38.89 7.37 21.74 58.82 13.33 42.52 52.94 35.53 84.24 96.61 74.67 22.66 32 17.54
379 r1 87.65 90.56 84.92 51.75 77.08 38.95 57.63 79.07 45.33 66.83 76.7 59.21 92.06 95.75 88.65 49.2 63.01 40.35
r2 86.4 90.94 82.3 52.55 85.71 37.89 58.93 89.19 44 66.83 76.7 59.21 91.92 97.55 86.9 49.6 64.34 40.35
r3 87.41 90.81 84.26 51.75 77.08 38.95 57.63 79.07 45.33 66.83 76.7 59.21 91.99 96.63 87.77 49.2 63.01 40.35
390 r1 91.33 99.23 84.59 0 0 0 0 0 0 43.51 41.6 45.61 86.36 90.05 82.97 23.61 25 22.37
r2 88.36 99.18 79.67 0 0 0 0 0 0 42.47 44.29 40.79 83.41 91.19 76.86 24.06 28.07 21.05
r3 76.71 98.96 62.62 0 0 0 0 0 0 25.68 29.38 22.81 71.76 85.98 61.57 15.06 18.47 12.72
Task 1, Stage 2: Prediction of BEL statements with gold standard entities
411: Rule-based
380: Machine Translation
379: Rule-based
390: Deep Learning
F: Micro-averaged F-Score; P:Precision: R: Recall
© Fraunhofer SCAI BEL Track
Task 1: Comparison of Results (F-Scores) of 2015 and 2017
Levels Stage 1 Stage 2
2015 2017 2015 2017
Term 68.9% 76.4% 96.0% 91.3%
Function-
Secondary
54.6% 55.3% 56.5% 57.6%
Function 32.6% 40.5% 30.5% 52.5%
Relation-Secondary 72.7% 86.7% 86.4% 92.1%
Relation 49.2% 43.8% 65.1% 66.8%
BEL Statement 20.2% 32.4% 35,2% 49.6%
Stage 1: Prediciton
without any entitiy
information
Stage 2: Prediction with
gold standard entities
© Fraunhofer SCAI BEL Track
Task 2: Expert Evaluation & Results
Maximum 5 sentences in each run were evaluated (after agreement with the participant)
Inter-annotator agreement on 150 entries with two following evaluation criteria:
Fully supportive: Agreement of 93% (kappa statistic: 0.75)
Partially supportive: Agreement of 91% (kappa statistic: 0.79)
Results of the single participant:
Runs Criterion TP FP Precision MAP
Run 1Full 117 265 30.6% 59.6%
Partial 175 207 45.8% 77.5%
Run 2Full 121 261 31.7% 50.2%
Partial 192 190 50.3% 76.7%
TP: True Positive
FP: False Positive
MAP: Mean Average Precision
System: Rule- and traditional ML-based
© Fraunhofer SCAI
BioCreative VI Track 3 – BEL TrackExtraction of Causal Network Information in Biological Expression Language (BEL)
Overview & Results
by Sumit Madan
Sumit Madan,
Jens Dörpinghaus,
Juliane Fluck
Bioinformatics Department
Fraunhofer SCAI
Justyna Szostak
Julia Hoeng
Philip Morris International R&D