Integrative Informatics
Life Sciences Conference + ExpoApril 3rd, 2006 – Boston, MA
John ReyndersInformation Officer - LRL Discovery and Development Informatics
Life Sciences C+E 2006, Boston MA Copyright © 2006 Eli Lilly and Company
Outline
Can’t we all just get along?
Navigating silos of silos
Integrative Informatics
Life Sciences C+E 2006, Boston MA Copyright © 2006 Eli Lilly and Company
Rapid Application Development with the Parallel Object-Oriented
Methods and Applications (POOMA) Framework
Post-Doc Challenge:
Write a 3D Pseudo-Spectral code to simulate two colliding vortices using the Navier-Stokes Equations
Advanced Computing LabLos Alamos National Lab
Life Sciences C+E 2006, Boston MA Copyright © 2006 Eli Lilly and Company
Rapid Application Development with the Parallel Object-Oriented
Methods and Applications (POOMA) Framework
Post-Doc Challenge:
Write a 3D Pseudo-Spectral code to simulate two colliding vortices using the Navier-Stokes Equations
Advanced Computing LabLos Alamos National Lab
On This:
Life Sciences C+E 2006, Boston MA Copyright © 2006 Eli Lilly and Company
Rapid Application Development with the Parallel Object-Oriented
Methods and Applications (POOMA) Framework
Post-Doc Challenge:
Write a 3D Pseudo-Spectral code to simulate two colliding vortices using the Navier-Stokes Equations
Result:
One Post-Doc with no parallel experience wrote this application in 5 weeks with POOMA
Navier-Stokes simulation iso-surface of vorticity
Advanced Computing LabLos Alamos National Lab
Life Sciences C+E 2006, Boston MA Copyright © 2006 Eli Lilly and Company
Encapsulation in the POOMA FrameWorkEncapsulation in the POOMA FrameWork
STL ExpressionTemplates
Userthreads
RefCount &Data Pooling
MPI/PVMDomainDecomp
RTSSheduling
LoadBalancing
Fields Matrices
ParticlesMeshes
FFTEllipticSolvers
StencilOperations
DPMonteCarlo
ERPlasmas
DPHydro
EROcean
GlobalGlobal
AlgorithmAlgorithm
ComputerScience
StencilOperators Interpolators
Physics
ApplicationApplication
LocalLocal
ParallelParallel
Advanced Computing LabLos Alamos National Lab
Life Sciences C+E 2006, Boston MA Copyright © 2006 Eli Lilly and Company
Outline
Can’t we all just get along?
Navigating silos of silos
Integrative Informatics
Life Sciences C+E 2006, Boston MA Copyright © 2006 Eli Lilly and Company
The Problem: Silos of Silos
•Tools, application, and data are standalone with limited interaction•Scientists have great difficulty finding their data and associated tools•Asking cross-domain questions ( e.g. bio+chem ) very difficult•Support becoming very impractical – estimated 400+ individual tools across silos
LLYDB
BioSel
Jockyss
ELIAS
Beacon
ICARIS
ResultsStar
Jubilant
BioGeMs
Sig3
PathArt
TV-GAME
PubDBs
ProteomeXrep
Nautilus
Conformia
Intellichem
MCPACT
Watson
PRDB
LIMSIDW
Chem Bio PR&D/ADMET
Life Sciences C+E 2006, Boston MA Copyright © 2006 Eli Lilly and Company
PRESENTATION LAYER (.NET)
WORK FLOW / BUSINESS LAYER (Some . Net)
DATA LAYER
BioGems
Going from the vertical to the horizontal
DATA LAYER
Biosel/TINS
DATA LAYER
Process Tracking
DATA LAYER
Data Warehouse
DATA LAYER.Net?
...
Life Sciences C+E 2006, Boston MA Copyright © 2006 Eli Lilly and Company
Lilly Science Grid (LSG) Architecture - Systems
TDC-TAT
Plug-In Manager
Biology Chemistry Toxicology
BioGEMS TV-GAME BioSel System X
WS Provider
WS Consumer
WS Provider WS Provider
WS Cons
WS Provider
SAP Portfolio
Portfolio Sys
WS Provider
Plug-In A Plug-In B Plug-In N…
WS Cons WS Cons
Event Communication
TAO
Life Sciences C+E 2006, Boston MA Copyright © 2006 Eli Lilly and Company
LSG Architecture - Tools View
TDC-TAT
.NET API
SRS/Oracle Oracle Oracle
Perl CGI Perl CGI Perl CGI Java
SOAP::Lite
Perl
SOAP::Lite Axis
.NET Proxy
SOAP::Lite
Flat Files/Oracle
Java
Axis
.NETUser Ctrl
.NETUser Ctrl
.NET User Ctrl…
.NET Proxy .NET Proxy
.NET API
Oracle
.NET C#
IIS Web Server
Visual Studio
Apache Web Server
Linux
Tomcat Web Server
Linux
WSDL WSDL WSDL WSDL WSDL
Common XML Schema’s (XSD)
Life Sciences C+E 2006, Boston MA Copyright © 2006 Eli Lilly and Company
-Por
tfolio
Vie
w-
Pha
se+
Targ
et o
f In
tere
st-
Targ
et to
Hit
. Tar
get 1
. Tar
get 2
. …+
Hit
to L
ead
+ Le
ad V
alid
atio
n+
Lead
Opt
imiz
atio
n+
Pla
tform
Vie
w+
DH
T V
iew
Fin
d Ta
rge
tsTa
rget
Info
rmat
ion:
Tar
get 2
Portf
olio
Info
rmat
ion
Targ
et V
alid
atio
nTa
rget
Dru
gabi
ltiyTa
rget
Tox
icity
Proj
ect D
ocs
Pro
jec
t:P
roje
ctX
Ph
ase:
Tar
get t
o H
itIn
dic
atio
n:
Dep
ress
ion
DH
T:
Psy
chos
isP
roje
ct
Lea
ds:
Bio
: Jo
hn D
oe,
Che
m: J
ohn
Doe
2T
ox:
John
Do
e 3
, etc
. N
ex
t Mile
sto
ne:
Hit
to L
ead
/ Q
4 2
005
Sta
rTea
m:
Bio
ge
nic
Am
ine
Thi
ck C
lient
PO
C I
nter
face
Webparts Thin Client POC Interface
Portfolio WS
Spreadsheet of Portfolio
Active Cpds WS
BIOSEL/TINS
Prot. Express. WS
GeneAltas/BioGems
Prot. Express. WS
Proteome/BioGems
Prot. Express. WS
Proteome/BioGems
Prot. Express. WS
Proteome/BioGems
Data WebLinks. WS
BioGems/TV-GAMES …
Gene ID Mapping WS
Custom Spreadsheet
Data Integration/Mapping
Architecture enables encapsulation and division of labor
Life Sciences C+E 2006, Boston MA Copyright © 2006 Eli Lilly and Company
Grid Architecture Points: The User
MyScience: Enable Scientist to dynamically compose their environment from a set of components
Orchestration: Components communicate to enable an action/question in one component to yield results/answers from multiple components
Organic: New capabilities can be added by simply adding a new component
Scalable: The combinatorics of using 4 out of 12 components yields 495 configurations
• It is much easier to maintain a framework and 12 associated components than 495 separate tools!
Life Sciences C+E 2006, Boston MA Copyright © 2006 Eli Lilly and Company
Grid Architecture Points: The Developer
Get SEs and scientists out of silos and into layers so they may do what they do best
• Data, applications, algorithms, presentation
Plug-in architecture to factor business/science and framework development
Crisp abstraction barriers between and within layers to enable modular development
Rationalize tool set within layers to improve developer productivity
Life Sciences C+E 2006, Boston MA Copyright © 2006 Eli Lilly and Company
Some Observations/Thoughts
Where is the Value:• Data fusion in the military – information supremacy
– F15 heads-up display
– Aegis cruiser
– “Bob” and “Tom” from the NSA
• What makes the drug hunter effective in an “Informatics cockpit”– It’s partly the quality, speed, accuracy of any given tool ( e.g. the altimeter )
– It’s mostly how the instruments work as an integrated whole
• I can ask questions I could not ask before!– Integrate to this point of innovation – before spending significant time on optimizations
Some Lessons from Los Alamos:• How can one go wrong having an application framework built by a team of A+ students?
– By building a framework that can only be used by A+ students
• Surely everyone knowing as much as they can about all aspects of the framework will produce the best framework!
– Nope. By knowing the implementation behind the abstraction, a team fails to program through interface contracts
– Also, the team has challenges scaling in development efforts – because it is not functioning as a team ( everyone run to the soccer ball? )
Life Sciences C+E 2006, Boston MA Copyright © 2006 Eli Lilly and Company
The old vs. the new architecture
Benefits: rapid development, customizable environment, integration of tools for cross-domain inquiry, reduced support load…
plugin
DATA LAYER
Discovery Informatics Integration Kernel
plugin
DATA LAYER
plugin
DATA LAYER
plugin
DATA LAYER
Jubilant
BioGeMs
Sig3
PathArt
TV-GAME
PubDBs
ProteomeXrep
Life Sciences C+E 2006, Boston MA Copyright © 2006 Eli Lilly and Company
Web-Service Layer - LSG
Can accelerate staging the old into the Lilly Science Grid
Discovery Integration Kernel
plugin plugin pluginpluginplugin
Integrated Data Layer - LSG
BioGeMs PathArt
Life Sciences C+E 2006, Boston MA Copyright © 2006 Eli Lilly and Company
Web-Service Layer
Scaling efforts: Divide & ConquerThe Kernel
• Composability, Integration, Interaction, Scalability
• Clear contract with plug-ins
The Plug-inClear contract with kernel and web-service layer
• Domain-specific tool• Limited knowledge of Kernel required to build
plug-in
Web-Services• Clear contract with plug-ins and data-layer• Insert web-service layers into tools – preserving
legacy interface and creating service to build a plug-in
Integrated Data Layer• Clear contract with Web-Services• Design for integration first, optimization next• Automate ETL
Discovery Integration Kernel
plugin
Integrated Data Layer
Life Sciences C+E 2006, Boston MA Copyright © 2006 Eli Lilly and Company
Life Sciences C+E 2006, Boston MA Copyright © 2006 Eli Lilly and Company
Indications view
Life Sciences C+E 2006, Boston MA Copyright © 2006 Eli Lilly and Company
Drug Hunting Team view
Life Sciences C+E 2006, Boston MA Copyright © 2006 Eli Lilly and Company
Outline
Can’t we all just get along?
Navigating silos of silos
Integrative Informatics
Life Sciences C+E 2006, Boston MA Copyright © 2006 Eli Lilly and Company
Similarity?
~
Graph - yes.Text - yesAssay - yes
Life Sciences C+E 2006, Boston MA Copyright © 2006 Eli Lilly and Company
Similarity – and adding a magical Methyl
~
Graph - yes.Text - yesAssay - yes
~
Graph - yesText - maybeAssay - no
Life Sciences C+E 2006, Boston MA Copyright © 2006 Eli Lilly and Company
Gene Objects
•Representation• Name• String
•Filters• PathwaySet• GeneFamily• GO
•Measures• Alignment (Algorithm)• Text (DocumentSet)• GeneExpression (SampleSet, MoleculeSet )
ATGAGCCTCCCCAATTCCTCCTGCCTCTTAGAAGACAAGATGTGTGAGGGATGCCA
Life Sciences C+E 2006, Boston MA Copyright © 2006 Eli Lilly and Company
Protein Objects•Representation
• Gene• String• State ( e.g., Phosphorelated )
•Filters• GeneFamily• GO
•Measures• Alignment (Algorithm)• Pathway (PathwaySet)• Text (DocumentSet)• ProteinExpression (SampleSet, MoleculeSet )• Assay ( ExperimentSet )• 3D Structure (Algorithm)
Life Sciences C+E 2006, Boston MA Copyright © 2006 Eli Lilly and Company
SNP Objects
•Representation• Gene Locus
•Filters• GeneSet• SNP characterization
– Coding/Non-Coding– Blossum Score– Exon/Intron– Transcriptional
•Measures• Linkage disequilibrium
– D-Prime– R-Squared
• Haplotype Block Association• Text (DocumentSet)
ATGAGCCTCCCCAATTCCTCCTACCTCTTCGGAGACAAGATGTGTCAGGGATGCCA
ATGAGCCTCCCCAATTCCTCCTGCCGCTTCGAAGACAAGATGTGTCAGGGATGCCA
ATGAGCCTCCCCAATTCCTCCTACCTCTTAGGAGACAAGATGTGTCAGGGATGCCA
Life Sciences C+E 2006, Boston MA Copyright © 2006 Eli Lilly and Company
Image Objects
Law’s TextureConvolution
L5 = [ 1 4 6 4 1 ] E5 = [ -1 -2 0 2 1 ] S5 = [ -1 0 2 0 -1 ] W5 = [ -1 2 0 -2 1 ] R5 = [ 1 -4 6 -4 1 ]
Density FunctionalSignature ( DFS )
Target DFS + L2 Measure
Measure
Filter
Representation - 2D Matrix
Life Sciences C+E 2006, Boston MA Copyright © 2006 Eli Lilly and Company
Molecule Objects
•Representation• Name• Graph• 3D Structure
•Filters• Library Compounds• Similarity Search
•Measures• Text (DocumentSet)• Fingerprints (Algorithm)• 2D/3D Similarity (Algorithm)• HTS (GeneSet)
Life Sciences C+E 2006, Boston MA Copyright © 2006 Eli Lilly and Company
Targets/
Compoun
ds T1 T2 T3 T4 T5 T6 T7 logP Vdist
C1 8 4 >20 2
C2 5 7 >20 >20 60% 2340
C3 1.1 3 >20 >20 20% 100
C4 >20 >20 >20
C5 2 0.9
C6 >20 >20 0.1 0.09 90% 2500
C7 0.1 0.2 0.1 0.1 0.3 10% 400
C8 0.77 0.2 >20 >20 0
C9 0.57 0.27 >20 >20
… 0.2 >20 >20 >20 >20 >20
CompoundProfilingBioprint
Target Chemoprint
Chemogenomic Selectivity profilesMethod = Ward
GSK3B
PRKG1/PKG1
SRC
CDK2CDK4
PRKCA/PKCa
LYN
CDK5
CSF1R/FMS
CSNK1A1/CK1a
PRKCD/PKCd
KIT
PRKCE/PKCe
CHEK1/CHK1
PRKCG/PKCg
PRKCB1/PKCbPRKCH/PKCh
PDGFRAPDGFRB
LCK
ABL1FLT1FLT4
KDRMAPK1/Erk2
ERBB2/HER2/ErbB2
MAPK3/Erk1
FGFR1
WEE1
INSR
MAPK9/JNK2RAF1
EGFR
MAPK11/p38bMAPK14/p38a
CDC2
TEK/TIE2
IGF1R
FYN
MAP2K1
MYLK2/skMLCK
PRKACA/PKACaZAP70
Dendrogram
Hierarchical Clustering
Life Sciences C+E 2006, Boston MA Copyright © 2006 Eli Lilly and Company
Comparison of kinase dendograms Method = Ward
GSK3B
PRKG1/PKG1
SRC
CDK2
CDK4
PRKCA/PKCa
LYN
CDK5
CSF1R/FMS
CSNK1A1/CK1a
PRKCD/PKCd
KIT
PRKCE/PKCe
CHEK1/CHK1
PRKCG/PKCgPRKCB1/PKCb
PRKCH/PKCh
PDGFRAPDGFRB
LCKABL1
FLT1FLT4KDR
MAPK1/Erk2
ERBB2/HER2/ErbB2
MAPK3/Erk1
FGFR1
WEE1
INSR
MAPK9/JNK2
RAF1
EGFR
MAPK11/p38bMAPK14/p38a
CDC2
TEK/TIE2
IGF1R
FYN
MAP2K1MYLK2/skMLCK
PRKACA/PKACa
ZAP70
Dendrogram
Hierarchical Clustering
Method = Ward
GSK3B
PRKG1/PKG1
SRC
CDK2CDK4
PRKCA/PKCa
LYN
CDK5
CSF1R/FMS
CSNK1A1/CK1a
PRKCD/PKCd
KIT
PRKCE/PKCe
CHEK1/CHK1
PRKCG/PKCg
PRKCB1/PKCbPRKCH/PKCh
PDGFRAPDGFRB
LCK
ABL1FLT1FLT4
KDRMAPK1/Erk2
ERBB2/HER2/ErbB2
MAPK3/Erk1
FGFR1
WEE1
INSR
MAPK9/JNK2RAF1
EGFR
MAPK11/p38bMAPK14/p38a
CDC2
TEK/TIE2
IGF1R
FYN
MAP2K1
MYLK2/skMLCK
PRKACA/PKACaZAP70
Dendrogram
Hierarchical Clustering
AssaySim.
SequenceSim.
Life Sciences C+E 2006, Boston MA Copyright © 2006 Eli Lilly and Company
HTS as a Mapping Object
•Representation• ProteinSet• MoleculeSet• HTS Array
•Filters• Protein Filters• Molecule Filters
•Measures• Cluster Analysis• Self-Organizing Maps• Support Vector Machines• Neural Networks
Life Sciences C+E 2006, Boston MA Copyright © 2006 Eli Lilly and Company
Gene Expression as a Mapping Object
•Representation• GeneSet• SampleSet• 2D Expression Matrix
•Filters• Gene Filters• Sample Filters
•Measures• Cluster Analysis• Self-Organizing Maps• Support Vector Machines• Neural Networks
Life Sciences C+E 2006, Boston MA Copyright © 2006 Eli Lilly and Company
Text as a Mapping Object
•Representation• ObjectSet A• ObjectSet B• RDF Triplets
•Filters• DocumentSet• A Filters• B Filters
•Measures• QR Factorization• Text-based Classifiers
Life Sciences C+E 2006, Boston MA Copyright © 2006 Eli Lilly and Company
Pathway as a Mapping Object
•Representation• ProteinSet• MoleculeSet• VertexSet
•Filters• Protein Filters• Molecule Filters
•Measures• Graph Algorithms
Life Sciences C+E 2006, Boston MA Copyright © 2006 Eli Lilly and Company
Putting it all together…
Objects Measure
MTS Literature
Binding Coding
Clinical DB
Compounds
Images
Genes
SNPs
Expression
Linkage D
Signature
Fingerprint
Map 1 Map 2
Life Sciences C+E 2006, Boston MA Copyright © 2006 Eli Lilly and Company
The goal… find “wormholes”
9 10 11 12
5 6 7 8
1 2 3 4
13 14 15 16
16 Objects20 Text-based relations:
TextPathwayHTSExpressionImage
16 Objects120 heterogeneous relations: