scaling the walls of discovery: using semantic metadata for integrative problem solving
DESCRIPTION
Scaling the walls of discovery: using semantic metadata for integrative problem solving. Greg Tucker-Kellogg, Ph.D. Chief Technology Officer Senior Director, Systems Biology Lilly Singapore Centre for Drug Discovery. Outline. - PowerPoint PPT PresentationTRANSCRIPT
Lilly Singapore Centre for Drug Discovery
LSCDD
Scaling the walls of discovery: using semantic metadata for integrative problem solving
Greg Tucker-Kellogg, Ph.D.Chief Technology OfficerSenior Director, Systems BiologyLilly Singapore Centre for Drug Discovery
2LSCDD
Outline
The Challenge of Translational Discovery in Pharmaceutical Research
Integration of Metadata using Semantic Web Technologies•Why focus on metadata?•How it helps
Examples
3LSCDD
Lilly Singapore Centre for Drug Discovery
Integrative Computational Sciences (tools)
Wet lab biology
Drug Discovery(drug candidates)
Oncology and diabetes research towards tailored therapy to improve patient outcome
Systems Biology(biomarkers)
Experimental Computational
4LSCDD
Pharmaceutical R&D spends more to get less
5LSCDD
Lost in translation
The limits of my language mean the limits of my
world (Ludwig
Wittgenstein)
我的语言限制的范围是我的
Translate
I limit the scope of the language I (Ludwig Wittgenstein)
Translate
6LSCDD
Translational research in cancer: Connecting the dots of genetic aberrations
Targets Disease Patients
Pathways
Tailored TherapeuticsImprove individual patient outcomes and health outcome
predictability through tailoring drug, dose, timing of treatment, and relevant information
Tailored TherapeuticsImprove individual patient outcomes and health outcome
predictability through tailoring drug, dose, timing of treatment, and relevant information
7LSCDD
The “Web” of heterogeneous data
Cell/AssayTechnologies
8LSCDD
Integrating Scientific Data Sets
Uncontrollable diversity
Most of the valuable data is from outside our walls
Much of it is poorly structured
Ranging from large (1TB/day) to boutique
9LSCDD
Scientist’s View of Integrated Information
Protein-IHC,-Luminex
Omics
RNAi reagents-Qiagen siRNA-BROAD shRNA-cDNA
Acumenassays
Cellomicsassays
High-contentbioassays
Biochemicaldata
Chemical biology
Functionalchemogenetics
Target basedchemotype profiling
Mapping and annotation backbone
Interrogators Reporters
Pathway-basedchemotype profiling
Strategic
Cross-domain integration
Domain-level integration
Foundational
Color code
Plate Reader
DNA-CGH-SNP,Mutation
RNA-miRNA-mRNA
Epigenetics-Methylation-ChIP-Chip
Platforms
10LSCDD
Manual Data Integration
A repeated, tedious process:• Pull data from internal and public data sets• Normalize terms and values• Write and run analysis scripts• Compile into a single Excel file, detached from the data
source (no drill-down)
Often this process can consume days with no guaranteed resolution
11LSCDD
Integration Approaches Considered
•Data Warehouse• Difficult to maintain and integrate new data sets• Difficult to evolve as data changes• Schemas tightly coupled to applications
•Federated queries• Query performance issues• Where to place the index?• Problematic to maintain• Translating user search syntax to all sources requires deep knowledge of data layer
•Semantic Integration• Relatively unproven in enterprise systems but adaptive to change• Relationships between data can be more fully characterized
12LSCDD
Standard Semantic Integration Model
QueryGenerator
ResultsPresentation
SemanticNormalization
Source
Source
Source
Source
DomainOntology &Mappings
Data SetIntegration
QueryPlanning
QuerySubmission
•All data is mapped to domain ontology in both directions
•If single system is down, incomplete results.
•Performance is limited to slowest system in network
•Massive mapping effort
•Multiple implementations of this approach, including:• Biological and Chemical Integrated Information System (BACIIS)
• Boeing
13LSCDD
Can we do better for our purposes?
•Avoid a complex architecture and extended development effort
•Realize benefits in the near-term
•Preprocess metadata to improve efficiency.
•Characterize the type of questions that ontology should answer
•Identify stable semantic technologies, do not employ parsers.
•Allow semantic and relational databases to work together
14LSCDD
What we need
Data Management and Availability•Capturing and filtering the global and growing avalanche of internal and external scientific data
Data Fusion•Systems to link, combine and navigate massive and heterogeneous data sets
Information Analysis and Mining•Algorithms and tools to help scientists seek correlations and find connections between pre-clinical and clinical knowledge to generate and test translational hypotheses.
15LSCDD
Data Architecture
Expression(Affy,Agilent,
Illumina)aCGH Screening Methylation
SNPMutation
TissueMicroarray
ChIP-Chip,miRNA
AnalysisResults
Domain/Platform Specific Data
Integration Layer
Experimental Matedata Repository
Annotation Services (Genomics mapping
+ Gene functional info)
QueryVisualization
Analysis and Mining
Algorithms Workflow
Experiment Context
Experiment Context
ReadoutReadout
Mapping & Annotation
Mapping & Annotation
Derived Results
Derived Results
Genomics mapping
Proteome/GO
FunctionalInformation
34 platforms
Ontology
Common Vocabulary
Centralized Experiment
Context
30 million triples
16LSCDD
LSCDD Data integration process in use
Query Visualization
ExperimentalMetadata Repository
Annotation Services(Genomics mapping + Gene Function)
AffyExpression
AgilentExpression
IlluminaExpression aCGH Screening RNAi
DatabaseMutation
SNP TMA AnalysisResults
17LSCDD
LSCDD Semantic Integration Approach
• Use semantic technology on an appropriate problem • Create Ontology focused on solving LSCDD integration needs
•Scientists and IT Analysts work together to iteratively create tailored vocabulary
•Define competency questions to validate the ontology•Encourage ontology to evolve, a different animal than RDBMS schemas
•Create bridges to public and internal ontologies to realize the full capabilities of the vocabulary
• Involve users to verify RDBMS-to-ontology mapping to increase confidence in the solution.
• Sparql is hard. Design an intuitive query model or question templates for users to navigate the repository.
18LSCDD
LSCDD Semantic Integration Approach (Cont)
• Used Agile philosophy throughout: application development, ontology development and mapping effort
• Drive adoption by engaging users to understand their challenges and refine the solution.
• Technologies •Protégé Ontology Editor•Oracle Semantic Technologies 11g•D2R Map (Database to RDF Mapping Language)•C# development in Visual Studio 2205
19LSCDD
Metadata RDF Repository
• Aggregates experiment metadata from a diverse set of LSCDD relational databases into an Oracle Semantic Technologies repository for LSCDD scientific investigation.
• Scientists at LSCDD now have a single source of experiment information described with a common vocabulary.
• Current data sources include:•Expression Data : Affymetrix, Illumina, Agilent•aCGH Data•RNAi Screening Data•Reagent Data•Gene Ontology (GO)•Medical Subject Headings (MeSH)•Many others
Currently ~30 million triples
Currently ~30 million triples
20LSCDD
LSCDD Metadata Ontology
Experiment
Protocol
CellLine
Chip
Tissue
Plate Well
DNA Reagent
Sample
Probe
hasPlateCompound
Gene
ReagentHardware
Assay
hasPlate
Protein Reagent
ClinicalData
Project StudyhasProject hasStudy
Software
subclass
Plate
TreatmentRNA Reagent
hasGene
Model
Chip Type
DiseaseState
hasDiseaseState
GeneList
hasSourceTissuehasSource
subclasshasSample
subclass
subclass
subclass
hasGOId
ViralBatch
hasModel
hasCelllinehasTissue
hasMESHId
hasChiphasAssay
hasChipType
hasChipType
hasGene
IsPartOf
hasReagent
hasReagent
hasProtocol
MESHGO
hasTreatment
hasCompound
21LSCDD
Metadata Repository Application
• Both browse and query views are provided for repository access.• The Query View allows the user to search the repository by setting constraints on attributes of the entities in the ontology.
• Links to external data sets such as Gene Ontology and MeSH have been defined, queries may span multiple ontologies.
• Results View displays details about each of the matches found and allows user to navigate across entities.
• The application is created as a plugin to the Lilly Science Grid and can leverage Integrated Genomics Portal for Cancer Research (IGPCR) plugins to provide details about Genes in hit lists.
22LSCDD
Metadata Repository Application
Find all deacetylases involved in Colorectal Neoplasms- Add filter to Gene Ontology Label attribute
- Add filter to MeSH Description Name attribute
- Run Query…Results View shows list of GenesNavigate across data links
23LSCDD
Experiment Data Annotation
While raw experiment results are not suitable for editing, metadata such as experiment descriptions and relations becomes more valuable when users augment and refine. Experiment
hasId: abc123hasContact: Bill SmithhasType: SiRNA ScreenhasDescription: ____
…Experiment
hasId: def456hasContact: Jane SmithhasType: SiRNA ScreenhasDescription: H460 screen
…
H460 screen: run 789
hasConflictingResults
24LSCDD
IGPCR: Integrated Genomics Portal for Cancer Research
An Integrated view for analysis
results
Helps oncology researchers with:•Drug target identification and prioritization
•Biomarker discovery
•Combination therapy
25LSCDD
Backup
26LSCDD
27LSCDD
28LSCDD
29LSCDD
30LSCDD
Are there any reagents available to conduct functional validation?
Get me all the interactions for methylases that are involved in colorectal cancer. And for all these genes, get the expression and aCGH values for all colon cancer samples.
Answering scientific questions
What is the status of the target of my interest across multiple tumor types? What are the right model systems to study the perturbation of my gene of interest?
31LSCDD
Cancer drug discovery
32LSCDD
Integration of high throughput datasets
Chemosensitivity
Tumor Samples
Patient Survival
Cell lines
RNAi
Tissue Microarrays
Expression
CGH / SKY
Public / P
rivate
Mutations
Chemosensitivity
Tumor Samples
Patient Survival
Cell lines
RNAi
Tissue Microarrays
Expression
CGH / SKY
Public / P
rivate
Mutations
33LSCDD
Going Forward
• Integration with additional external sources: NCBI, KEGG, Proteome, PubMED• Integration with National Cancer Institute Metathesaurus• Continued integration with new data types generated internally or from collaborators• Definition and support of additional ontologies
IntegratedAugmented
QueryResults
SnoMed
PubMed
NCI Metathesaurus
Stanford TissueMicroarray
Web ResourcesLilly Data
Labs
Internal Data
Public Data
Collaborators
Analysis Pipelines Visualizers
34LSCDD
Acknowledgements
LSCDD, SingaporeIT
•Kevin Gao, Rakhi Bhat, Srinivasulu Kota and Maurice Manning
Systems Biology•Amit Aggarwal and Mahesh Kumar Guzuva Desikan
ICS•Pat Hartman
HiSoft Technology – Dalian, China•Bill Yan, Young Gong, Harold Yin, Steven Cao and Jason Wang
Lilly, Indianapolis USA•Susie Stephens, Jacob Koehler
35LSCDD
Backup Slides
36LSCDD
Putting it all together…
Objects Measure
MTS Literature
Binding Coding
Clinical DB
Compounds
Images
Genes
SNPs
Expression
Linkage D
Signature
Fingerprint
Map 1 Map 2
37LSCDD
Silos Need to Broken Down
Data
Transform
Model &Understand
Generate/TestHypothesis
Analyze& Mine
Target Hit Lead PgS CS FHD FED PD/RD FS FA FL GL
TargetToHit
HitTo
Lead
LeadTo
PgS
LeadOptimization
Pre-ClinicalDevelopment
Phase I Phase 2 Phase 3Registration
LaunchGlobalLaunch
Project Program Product
Exploratory
Data
Transform
Model &Understand
Generate/TestHypothesis
Analyze& Mine
Data
Transform
Model &Understand
Generate/TestHypothesis
Analyze& Mine
Data
Transform
Model &Understand
Generate/TestHypothesis
Analyze& Mine
Data
Transform
Model &Understand
Generate/TestHypothesis
Analyze& Mine
Data
Transform
Model &Understand
Generate/TestHypothesis
Analyze& Mine
Data
Transform
Model &Understand
Generate/TestHypothesis
Analyze& Mine
Data
Transform
Model &Understand
Generate/TestHypothesis
Analyze& Mine
Data
Transform
Model &Understand
Generate/TestHypothesis
Analyze& Mine
Data
Transform
Model &Understand
Generate/TestHypothesis
Analyze& Mine
38LSCDD
Web Interface
Input user queries andpresent the query results
Data SourceSchema
Bio-ChemicalOntology
BACIISKnowledge
Base
Query Generator Module
Generate semanticbased user queries into
domain recoganizedterms through Ontology
Query Planning and Execution Module
Query Planner
Decompose the userquery into subqueries,define the subqueriesdependancy, and find
the query paths
Mapping Engine
Map each subquery intospecific data source(s)
Execution Engine
Receive data sourcespecific subqueries
and envokecorresponding
wrappers to fetchthe data from
remote data source
Result Presentation Module
Receive and integratethe individual result
set from wrappers intoHTML format andsend result pages to
web interface
Mediator
Wrapper
Fetch HTML/XMLpages from remotedata source, extract
result data
WebDatabase
WebDatabase
WebDatabase
Wrapper
Fetch HTML/XMLpages from remotedata source, extract
result data
Wrapper
Fetch HTML/XMLpages from remotedata source, extract
result data
BACIIS System Architecture
39LSCDD
Hybrid Architecture
Knowledge-SpaceNavigation
PresentationServices
AnalyticServices
User Interface
Federation Entities
Navigational Entities
Presentation Entities
Personalization Entities
Persistence Entities
Analysis Entities
MetadataRepositories
Source Source Source Source Source
Data Access Service Layer
Navigation Service Layer Data Set Integration Services
Me
tad
ata
Se
rvic
es
La
yer
Query Preparation Service Semantic Normalization Service
Query Submission Service Streams Management Service
Request Brokers
Semantic Layer
Adaptive Layer
Physical Access Layer
ListManagement
40LSCDD
Goals
•Make knowledge emerge from repositories•Make data more valuable by adding context•Leverage intellectual assets•Decision support•Enhance productivity•Reduce IT integration efforts