2017-07-19 john ismb2017 poster · zeb1 saa egf vps37a pi3k hgf cdh1 mmp9 met egfr invasion trim11...
TRANSCRIPT
The capacity of modern experimental methods to generate data about biological processes has surpassed the ability of existing informatics approaches to generate meaningful mechanistic explanations. Mechanistic systems biology models could potentially address this gap, but model construction remains a labor-‐intensive process requiring both biological knowledge and modeling expertise. As a result, modeling studies remain fairly small in scope and are disconnected from genome-‐scale research. For mechanistic models to attain the necessary scope, methods for the automated assembly and analysis of large models from available knowledge sources will be required. Here we describe the use of the Integrated Network and Dynamical Reasoning Assembler (INDRA)1 to assemble mechanistic facts from databases and literature into a rule-‐based Kappa2 model in order to explain observations in a previously published phosphoproteomic dataset.3 Explanations were generated by identifying paths through the rule influence map between drug targets and measured protein nodes. The model yielded detailed, biochemically plausible explanations for 20 of 22 of the largest effects (91%), and 95/135 (70%) of smaller effects. Additional improvements in performance could also be made by supplying manually curated mechanistic information in the form of natural language.
Explanation of drug effects using a mechanistic model automatically assembled from natural language, databases, and literatureJohn A. Bachman1*, Benjamin M. Gyori1*, and Peter K. Sorger1*These authors contributed equally to this work 1Laboratory of Systems Pharmacology, Harvard Medical School, Boston, MA, USA
INTRODUCTION
RESULTS
Availability: https://github.com/sorgerlab/indra Funding: DARPA Big Mechanism program, ARO contract W911NF-‐14-‐1-‐0397
Phosphorylation(RAF, MEK)
Phosphorylation(BRAF, MAP2K1, 218)
Phosphorylation(BRAF, MAP2K1)
Phosphorylation(BRAF, MAP2K1, S, 218) Phosphorylation(BRAF, MAP2K1, S, 222)
Phosphorylation(BRAF, MAP2K1, S)
Phosphorylation(BRAF, MAP2K1, 222)
Mechanisms are normalized into Statements
Correcting systematic errors in entity groundingApplying mechanistic models to interpretation of data requires that the named entities extracted from text (genes, proteins, small molecules, biological processes, etc.) be appropriately “grounded” to identifiers in the relevant databases. This is often challenging due to overlapping synonyms among gene names and ambiguous acronyms. A key problem is that mechanisms described in literature often refer to protein families and named complexes which cannot be directly related to genes and proteins measured in experimental data. To solve this problem, we created Bioentities(http://github.com/sorgerlab/bioentities), a resource accounting for the hierarchical relationships between genes/proteins, protein families, and named complexes.
In addition, INDRA includes a curated “grounding map” that maps commonly-‐encountered entities in text to the relevant database identifiers.
Representation of “AMPK”in the Bioentities hierarchy
Synonyms for JNK protein familyin the INDRA grounding map
Identifying relationships between mechanisms
Relations can be organized into a hierarchybased on their specificity
A key challenge in assembling detailed mechanistic networks is that a single mechanism may be described at different levels of specificity among the literature and various databases. Reconciling these overlapping mechanisms is essential to eliminate spuriously distinct edges in the assembled model. Using hierarchical ontologies of protein modification types, activity types, and the protein family information provided in Bioentities, INDRA implements duplicate removal, hierarchy-‐based redundancy resolution, and other forms of error correction and mechanism linking.
The Integrated Network and Dynamical Reasoning Assembler (INDRA)1 automatically assembles mechanistic models from pathway databases, literature, and expert knowledge expressed in natural language. INDRA draws on three existing natural language processing systems4,5,6 and uses a modular architecture to build different types of models from a variety of sources.
Mechanisms extracted from each source format are normalized into Statements, an SBO-‐compatible internal representation, where they are processed to remove errors, identify overlaps, and estimate reliability. Statements are designed to correspond in both specificity and ambiguity to descriptions of biochemistry as found in text (e.g., “MEK1 phosphorylates ERK2”, rather than a detailed reaction mechanism). The representation currently encompasses post-‐translational modifications, chemical conversions, protein expression and degradation, and generic activation/inhibition relationships.
Statement
evidence : Evidence
Phosphorylation
Modification
enzyme : Agentsubstrate : Agentresidue : stringposition : string
"is a" (inheritance)composition (has one or more, life-cycle dependence)
StatementsAgent and components
Agent
name : stringmods : list[ModCondition]mutations : list[MutCondition]bound_conditions : list [BoundCondition]location : stringactivity : ActivityConditiondb_refs : dict
Hydroxylation Dehydroxylation
Ubiquitination Deubiquitination
Dephosphorylation
Acetylation Deacetylation
Glycosylation Deglycosylation
Sumoylation Desumoylation
SelfModification
enzyme : Agentresidue : stringposition : string Autophosphorylation
ActiveForm
agent : Agentactivity_type : stringis_active : boolean
Conversionsubj : Agentobj_from : list[Agent]obj_to : list[Agent]
Activation
Transphosphorylation
Gef
gef : Agentgtpase : Agentgef_activity : string
Gap
gap : Agentgtpase : Agentgap_activity : string
ModCondition
mod_type : stringresidue : stringposition : stringis_modified : boolean
MutCondition
from_residue : stringto_residue : stringposition : string
BoundCondition
agent : Agentis_bound : string
Farnesylation
ActivityCondition
activity_type : stringis_active : boolean
Inhibition
RegulateActivity
subject : Agentobject : Agentobj_activity : string
RegulateAmount
subject : Agentobject : Agent
Evidence
text : stringsource_api : stringsource_id : stringpmid : stringannotations : dictepistemics : dict
IncreaseAmount
DecreaseAmount
Ribosylation Deribosylation
Defarnesylation
Geranylgeranylation Degeranylgeranylation
Palmitoylation Depalmitoylation
Myristoylation Demyristoylation
Other
AddModification
RemoveModification
Methylation Demethylation
Complex
members : list[Agent]
Conceptual overview of automated assembly
System architecture and approach
INDRA software architecture
Estimating the reliability of extracted mechanismsEven state-‐of-‐the-‐art NLP and text mining algorithms have limited accuracy, with roughly 20-‐30% of extracted relations representing a misinterpretation of the corresponding sentence (“reader error”). Given empirical estimates of the per-‐sentence error rate for different readers, INDRA’s BeliefEngine component aggregates results to estimate the overall probability that a relation is the result of reader error. It accomplishes this by:
1) aggregating evidence from multiple sentences read by the same reader
2) aggregating results from different reading algorithms on the same sentence
3) propagating error estimates through the network of related statements
Mechanisms can then be filtered with a precision threshold (e.g., 95% confidence).
Reading systems produce partiallyoverlapping extractions
Reliability estimates are propagated through the specificity hierarchy
Use case for explanation: interpreting phosphoproteomic data
REFERENCES1. B. M. Gyori*, J. A. Bachman*, K. Subramanian, J. L. Muhlich, L. Galescu, and P. K. Sorger. “From word models to executable models of signaling networks using automated assembly.” bioRxiv, 2017.2. V. Danos, J. Feret, W. Fontana, R. Harmer, and J. Krivine. “Rule-‐Based Modeling of Cellular Signaling.” Concurrency Theory (CONCUR) 2007, Lecture Notes in Computer Science, 4703:17–41, 2007.3. E. J. Molinelli, A. Korkut, et al., “Perturbation biology: Inferring signaling networks in cellular systems.” PLoS Computational Biology, 9(12):e1003290, Dec 2013. 4. J. Allen, W. de Beaumont, L. Galescu, and C. M. Teng. “Complex event extraction using DRUM.” 2015. 5. M. A. Valenzuela-‐Escarcega, G. Hahn-‐Powell, T. Hicks, and M. Surdeanu. “A domain-‐independent rule-‐based framework for event extraction.” In Proc. 53rd Annual Meeting of the ACL-‐IJCNLP, 2015.6. D. McDonald et al., “Extending Biology Models with Deep NLP over Scientific Articles.” Workshops at the 30th AAAI Conference on Artificial Intelligence, 2016.7. C. F. Lopez*, J. L. Muhlich*, J. A. Bachman*, and P. K. Sorger. “Programming biological models in python using PySB.” Molecular Systems Biology, 9(1):646–646, Apr 2014.
Curated mechanisms for MTOR feedback inhibition on AKT
TIAM1SOS1 ICAM1
PIK3CA
FGFR3
Ca PRKAA1PAK1
NRAS
NANOGP8
INSRNOX1
sorafenib
autophagy
PAK2RASGRF1
RAC1senescenceproliferationcell_proliferation
SIVA1 IRS1DUSP1
UTS2caffeine
ERBB2AGT
cell_survival
BRAFNF1
rapamycin
MTOR GRN
TP53
RPTOR
GRB10EIF4EBP1
ARAF
ZEB1SAA
EGF VPS37API3K
HGF
CDH1MMP9
METEGFR
TRIM11invasionERRFI1
aldosterone
IL6ST
KLF5DPP4
CYP11B2
artemisinin
Integrins
IFNL1CXCL8
HDAC6
apocynin
melatoninNADPH
HTN3cell_migration
SMC2
diosmetin
STAT3
CS
HMGB1
PLXNB1
IGF1R
NRG1
AR
CD274
PTX3dapagliflozin
SREBF1
GH1
UL138
RHBDD1
PLD1
SOX10
AKTIP
ANXA2
VLDL
SETD2
AMIGO2
CBL
PRH2
afatinib
cell_growth
SHC3
metastasis
RASA3
ELAVL1
SFTPC
SNAI2
cell_viability
GTP
angiogenesis
VAV1
THBS1TCN1
CXCL16
ALK
PIK3R3
CXCL12
PTENPIK3R1
SHC1AKT
RASA1
MAZSTK11
HRAS KRASRHOA
PTPN9
cetuximab
NEU1PDCD6IP
PTPN11metabolism
TP53BP2
GRB2
RPL17
RET
TNFRSF12A
tst
ABCB1
erlotinib
PGRMC1CXCL2
RASAL1 BEZ235ROCK1
RASA2
localizationLPA
CXCR4
UL135
PDCD6alcama
IGF2BP3
ADAM17
EPS8 WntCDC25A
CRP
ARHGAP35arsenite
CAV1
STUB1 FGF2
TNFSF12
GSK3B
PDGFRB
MB21D1
IRS2
PDGFD
endocytosis
NotchGPRC5ASOD2PHB
VEGFA DCA DA
CTSS
ABI1
DAB2IPRASSF2
CTNNB1
SNAI1GDPAGO2
SRC
KDR
SPRR2A
FOXP1
DLG1
RAF1
MAP2K2 MAP2K1
RPS6KB1
ADM
CDKN1B
PDPK1
RPS6KB2
DUSP3
CTGF
JUN
RPS6KA3
CDKN1A
Rapamycin
FOS
RPS6KA5
transcription
IRF1
MEK
DIRAS3
glucosePEBP1
AMPK
DYRK1B
TAS_116
MYC
VRK3
RAS
dabrafenib
PKA
KSR1
cell_cycle
CASP8
adhesionTRPM2CCL2
KIAA0101
S100A9
VCAM1
WISP1
HSMCR30
CXCL10
TNF
OLR1MKNK1
TLR4
NFKBIA
MITF
LPS
BCL2
NLRP1
MEK_inhibitorsp38
PLX4032
VEGFB
PKC
STAB2
apoptosis
cisplatinTGFB1
cypermethrin
CAMP
TNFRSF10BSLC22A3
differentiation
ERCC8
IFNG
BAX
HMOX1
IL6
ERK
CD36PTGS2
AREG
NFkappaB gefitinib
JNK
collagen
INS
ROSCP
GHRLEPHB2
PLAT
GCG
signal_transductioncell_death
RNF26SQSTM1NTRK1TubulinRUSC2PROCR APC NR4A2
GSK3dhaA
SIRT1
XBP1
SLC12A3CLEC4DMAP3K7 APEX1 KITLG
SLC6A2
FCGR3B
MAP3K3 CCL20
CXCR3
GRM2GLP1RERMAP
KIT
LRP1
APP REG1A POMC
CSF1R GIT2
CSF1CCR4
SCRIB
PKMHIF1A
Actin
PDGFRAFLT3
SMAD3FGF23 MMP2
FAT4
TEC
LPXN
ACOD1
RALAFABP4
TAZ
ARF6PPARG RBBP5 TLR5
MUC16
WNK3
MSLN
SMARCE1
dmpBMME
VIP
KIF13B
RA
melanin
CBLB
PTPRJ
SMPD1
translation
PTK2
IKBMMP13ERN1MM
rutin
CCR3
SMAD4
CCL28
GLUL
CCND2CCND1
CDK4
TFDP1
E2F1E2F2
TFDP2
E2F3
RASSF1
CDK6
IL12
DTLHOTAIR
RGS19
MARK2
RB1
cocaine
TET1
PJA2
MARK3
TNFSF11
AICARCAMKK2
PGD
NFATC1TNFRSF11A
SNCG
BDNF
FASLG
GLI1
IPO7
EGCG
progesterone
RASD1
MAS1paeoniflorin
UGCGDSPP
TNFRSF11B
NTRK2
NOS2
nitric_oxide
IL10
cytokine_production
NORELA
inflammatory_responseOXT
IGF1
SP600125
AQP7LRIG1
curcumin
TNFAIP8L2FAS
MAGEE1
Sorafenib
CASP3
aspirin
Cdetoposide
IL1B
CX3CL1
S1PR2
SMCP
WDR20ATP
HSPD1vorinostat
quercetinWFDC2
inflammation
PDK3PDK1 SYK
OSCAR
NLRP3
RNF126
cholesterolCTSK
BMP2 TAK165ATRA
hCGIGFBP7CHSY1 XYLT1CHST11
metformin
SPP1PPP1R3Acellular_senescenceBMP7
TAX1BP1
ACE2
PSMD4
SMAD
MTDH
SMAD7
FNDC5
fs_1_h
SPRY2
oxygen
NGFIQGAP1
SB203580
ABA
PITRM1MEK_inhibitorCXCL13
TNFSF10
STAR
CIRBP
TBCATLRNFE2L2WNT3A
KLF4
SAV1
SDC2
AMP
CASR
GDF15
PAK4
MST1
CDKN2ACASP7
VemurafenibvemurafenibRAF_inhibitors
AHR
PREX2CDC42
CYTH2PEA15
PRKAB2PRKAG1 PRKAA2
STK4
ARHGEF2
MDM2
FANCA
PRKAB1
ROCK2
FGFR2
LGALS1
JQ1
TERT
RAC2
GRM5
MAPKAPK2
MMP3
STK3YAP1
ICAM2
TP63
APAF1mTORC1 TSC2
MDM4 AKT1
RHEB
PAK3
TP73
RPS6KA1AKT3
MAP2K3
AKT2
DUSP10
DUSP8DUSP4
ETS1
DUSP7
ETS2DUSP16
DUSP2
DUSP6
FLI1ELF1FEVMYCBPSPDEF
ELK3
EHFELK4
CDK2
COPS5
TBK1
BRCA1
cyclin_E
GABPA
SKI
RPS6KC1RPS6KA4
RPS6KA6RPS6KA2
MAPK15MAPK6
ERGELF3ERF
ETV3MMP1
ELF4DDC
UNGCCNA2
MCM7TK1
CDC6
BARD1
CCNA1
MCM4
MCM3
FGFR1
ELK1ELF2ELF5
MAPK3DUSP9EXOC7DUSP5
MYB PPP1CA MAPK1
HSP90
Cyclin
MAPK8MAPK7
B
BV
BM
BGR
BGRV
BMV
BGMR
BR
BBGGRR
BBGGRRV
BBGGMRR
BBGGMRRV
BBGRR
BBGRRV
BBGMRR
BBGMRRV
BRV
BBGGRRVVBBGGMRRVV
BBGRRVV BBGMRRVV
BGMRV
BMR
BBGGMMRR
BBGGMMRRV
BBGMMRR
BBGMMRRV
BMRV
BBGGMMRRVV BBGMMRRVV
BBRR
BBRRV
BBMRR
BBMRRV
BBGR
BBGRV
BBGMR
BBGMRV
BBR
BBRV
BBMR
BBMRV
BBRRVV
BBMRRVV
BBGRVV
BBGMRVV
BBRVV
BBMRVV
BBMMRR
BBMMRRV
BBGMMR
BBGMMRV
BBMMR
BBMMRV
BBMMRRVVBBGMMRVV
BBMMRVV
BB
BBV
BBM
BBMV
BBVV
BBMVV
BBMM
BBMMV
BBMMVV
Model representations for statically identifying causal paths
Drug com
binatio
ns
RPPA measurements
How did this happen?
http://www.sanderlab.org/pertbio/
Directed proteininteraction graph
Kappa ruleinfluence map2
Chemical reactionnetwork
Mechanistic detail/causal contextMore false positive paths(less stringent context)
More false negative paths(more stringent context)
Boolean network
The assembly challenge
MEK phosphorylates ERK
ERK phosphorylates MEK
MEK1 phosphorylates ERK2 at T185
MEK1p218p222 phosphorylates ERK2 at T184
MEK1p218p222 phosphorylates ERK2 at T185.
Methyl Ethyl Ketone phosphorylates ERK
“Raw” mechanismsMEK phosphorylates ERK
MEK phosphorylates ERK
Assembled mechanisms
Generating mechanistic models from assembled Statements
In directed interaction graphs, the relatively limited causal context leads to an explosion of paths between any two proteins. This leads to many false positive paths and makes identification of long causal chains difficult (or even intractable) in large networks.
Generating explanations from the Kappa2 rule influence map
-‐ identifying rules whose activity is increased by the abundance of the subject (e.g., drug)
-‐ searching for a path to an observable representing the object (e.g., a measured protein) with the appropriate overall polarity
-‐ scoring paths by whether the signs of measured intermediate nodes are correctly predicted
Causal path for “Pervanadateincreases MAPK1 phosphorylation”
Pvd_binds_DUSP
Pvd_binds_DUSP_rev
[0->0];[1->1]
DUSP_binds_MAPK1_phosT185
[1->0]
[0->0];[1->1]
[1->0]
[0->1]
DUSP_binds_MAPK1_phosT185_rev
[0->0];[1->1]DUSP_dephos_MAPK1_at_T185
[0->0];[1->1]
[0->1]
[0->0];[1->1]
[0->0];[1->1]
[0->1]
[0->0]
[0->0];[1->1]
MAPK1_pT185
[1->0]
Extending the model by describing mechanisms in English
“IGF1R phosphorylates IRS1 at tyrosine.Tyrosine-‐phosphorylated IRS1 binds PI3K.Serine phosphorylated IRS1 is degraded.Active PPP2CA dephosphorylates IRS1 at serine.Active MTOR inhibits PPP2CA.
To build a mechanistic model, high-‐level assertions such as “MEK1 phosphorylates ERK1” must be converted into specific reaction mechanisms. INDRA uses user-‐specified policies that determine how the different Statement types are implemented, as PySB7 rules and corresponding reactions.
Phosphorylation(MEK1, ERK1)
one-‐step (pseudo-‐first-‐order)one-‐step (Michaelis-‐Menten)two-‐step (enzyme-‐substrate complex formation)ATP-‐dependent (unordered bi-‐bi reaction)
Genome assemblySequence reads
Assembled sequence
Knowledge assembly
Assembly of a large number of mechanistic facts is analogous to genome assembly: databases and literature yield a large number of redundant, partially overlapping facts that may contain errors. Mechanisms must be corrected and “aligned” in order to produce a set of facts suitable for generating a non-‐redundant, non-‐degenerate model.
To evaluate the ability of INDRA to systematically generate explanations of high-‐throughput data, we assembled a rule-‐based executable model to explain a previously published dataset of the phospho-‐proteomic response of a melanoma cell line to 12 different drugs.3 A rule-‐based model containing 221 proteins and 1451 rules was assembled from mechanisms extracted from databases and ~95,000 publications (abstracts and full texts). Static analysis of the rule influence map provided by Kappa identified possible mechanistic paths linking drug targets to experimentally observed effects on phosphoprotein abundances.
Drug Target
AntibodyFold-
changePath
?
MEK MAPK pT202 0.47
SRC CHK2 pT68 1.75
SRC 4EBP1 pT37 0.44
AKT AKT pT308 0.25
AKT GSK3A/B pS21 0.44
AKT AKT pS473 0.17
AKT S6 pS235 0.36
CDK4 4EBP1 pS65 0.44
CDK4 YBI pS102 2.13
MTOR AKT pT308 2.19
MTOR S6 pS240 0.05
MTOR AKT pS473 3.19
MTOR p70S6K pT389 0.33
MTOR S6 pS235 0.06
PKC GSK3A/B pS21 1.59
PKC S6 pS240 0.47
PKC S6 pS235 0.3
PI3K p70S6K pT389 0.5
PI3K S6 pS240 0.44
PI3K AKT pS473 0.2
PI3K S6 pS235 0.27
SRC phosphorylated on Y418 phosphorylates PAK2 on S20. PAK2 phosphorylated on S20 phosphorylates RAF1 on S338. RAF1 phosphorylated on S338, T269 and S471 phosphorylates MAPK1 on T185. MAPK1 phosphorylated on T185 and Y187 phosphorylates TP53 on S15. TP53 phosphorylated on S20 and S15 decreases the amount of PLK1. PLK1 phosphorylates CHEK2 on T68, which is measured by CHK2_pT68.
Example explanation: How does Src inhibition increase CHK2 pT68?
Performance: For the largest effects in the data (>50% fold-‐change) the model generated biochemically plausible explanations for 20 of the 22 effects (91%). For effects at the 20% fold-‐change level, the model
Where the model was unable to identify a causal path between a drug perturbation and an observed effect, we were able to use NLP to manually curate a causal path in simplified English and co-‐assemble it with the automated model.
Overall, this study shows the potential of automatically assembled models to systematically explain high-‐throughput data, generating mechanistic hypotheses and identifying genuinely novel phenomena.
explained 95/135 (70%) of effects. Notably, performance was biased toward drug targets well-‐represented in the literature corpus: the model explained 94/106 (89%) of effects due to PI3K, PKC, SRC, MTOR, MEK, AKT, RAF, and JAK inhibition, but only 1/29 (3%) of effects due to CDK, STAT or MDM2 inhibition.
The Kappa influence map captures detailed context while avoiding the combinatorial explosion of chemical species. Paths are obtained by: