scaling the walls of discovery: using semantic metadata for integrative problem solving

40
Lilly Singapore Centre for Drug Discovery LSCDD Scaling the walls of discovery: using semantic metadata for integrative problem solving Greg Tucker-Kellogg, Ph.D. Chief Technology Officer Senior Director, Systems Biology Lilly Singapore Centre for Drug Discovery

Upload: elsie

Post on 11-Jan-2016

29 views

Category:

Documents


0 download

DESCRIPTION

Scaling the walls of discovery: using semantic metadata for integrative problem solving. Greg Tucker-Kellogg, Ph.D. Chief Technology Officer Senior Director, Systems Biology Lilly Singapore Centre for Drug Discovery. Outline. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Scaling the walls of discovery: using semantic metadata for integrative problem solving

Lilly Singapore Centre for Drug Discovery

LSCDD

Scaling the walls of discovery: using semantic metadata for integrative problem solving

Greg Tucker-Kellogg, Ph.D.Chief Technology OfficerSenior Director, Systems BiologyLilly Singapore Centre for Drug Discovery

Page 2: Scaling the walls of discovery: using semantic metadata for integrative problem solving

2LSCDD

Outline

The Challenge of Translational Discovery in Pharmaceutical Research

Integration of Metadata using Semantic Web Technologies•Why focus on metadata?•How it helps

Examples

Page 3: Scaling the walls of discovery: using semantic metadata for integrative problem solving

3LSCDD

Lilly Singapore Centre for Drug Discovery

Integrative Computational Sciences (tools)

Wet lab biology

Drug Discovery(drug candidates)

Oncology and diabetes research towards tailored therapy to improve patient outcome

Systems Biology(biomarkers)

Experimental Computational

Page 4: Scaling the walls of discovery: using semantic metadata for integrative problem solving

4LSCDD

Pharmaceutical R&D spends more to get less

Page 5: Scaling the walls of discovery: using semantic metadata for integrative problem solving

5LSCDD

Lost in translation

The limits of my language mean the limits of my

world (Ludwig

Wittgenstein)

我的语言限制的范围是我的

Translate

I limit the scope of the language I (Ludwig Wittgenstein)

Translate

Page 6: Scaling the walls of discovery: using semantic metadata for integrative problem solving

6LSCDD

Translational research in cancer: Connecting the dots of genetic aberrations

Targets Disease Patients

Pathways

Tailored TherapeuticsImprove individual patient outcomes and health outcome

predictability through tailoring drug, dose, timing of treatment, and relevant information

Tailored TherapeuticsImprove individual patient outcomes and health outcome

predictability through tailoring drug, dose, timing of treatment, and relevant information

Page 7: Scaling the walls of discovery: using semantic metadata for integrative problem solving

7LSCDD

The “Web” of heterogeneous data

Cell/AssayTechnologies

Page 8: Scaling the walls of discovery: using semantic metadata for integrative problem solving

8LSCDD

Integrating Scientific Data Sets

Uncontrollable diversity

Most of the valuable data is from outside our walls

Much of it is poorly structured

Ranging from large (1TB/day) to boutique

Page 9: Scaling the walls of discovery: using semantic metadata for integrative problem solving

9LSCDD

Scientist’s View of Integrated Information

Protein-IHC,-Luminex

Omics

RNAi reagents-Qiagen siRNA-BROAD shRNA-cDNA

Acumenassays

Cellomicsassays

High-contentbioassays

Biochemicaldata

Chemical biology

Functionalchemogenetics

Target basedchemotype profiling

Mapping and annotation backbone

Interrogators Reporters

Pathway-basedchemotype profiling

Strategic

Cross-domain integration

Domain-level integration

Foundational

Color code

Plate Reader

DNA-CGH-SNP,Mutation

RNA-miRNA-mRNA

Epigenetics-Methylation-ChIP-Chip

Platforms

Page 10: Scaling the walls of discovery: using semantic metadata for integrative problem solving

10LSCDD

Manual Data Integration

A repeated, tedious process:• Pull data from internal and public data sets• Normalize terms and values• Write and run analysis scripts• Compile into a single Excel file, detached from the data

source (no drill-down)

Often this process can consume days with no guaranteed resolution

Page 11: Scaling the walls of discovery: using semantic metadata for integrative problem solving

11LSCDD

Integration Approaches Considered

•Data Warehouse• Difficult to maintain and integrate new data sets• Difficult to evolve as data changes• Schemas tightly coupled to applications

•Federated queries• Query performance issues• Where to place the index?• Problematic to maintain• Translating user search syntax to all sources requires deep knowledge of data layer

•Semantic Integration• Relatively unproven in enterprise systems but adaptive to change• Relationships between data can be more fully characterized

Page 12: Scaling the walls of discovery: using semantic metadata for integrative problem solving

12LSCDD

Standard Semantic Integration Model

QueryGenerator

ResultsPresentation

SemanticNormalization

Source

Source

Source

Source

DomainOntology &Mappings

Data SetIntegration

QueryPlanning

QuerySubmission

•All data is mapped to domain ontology in both directions

•If single system is down, incomplete results.

•Performance is limited to slowest system in network

•Massive mapping effort

•Multiple implementations of this approach, including:• Biological and Chemical Integrated Information System (BACIIS)

• Boeing

Page 13: Scaling the walls of discovery: using semantic metadata for integrative problem solving

13LSCDD

Can we do better for our purposes?

•Avoid a complex architecture and extended development effort

•Realize benefits in the near-term

•Preprocess metadata to improve efficiency.

•Characterize the type of questions that ontology should answer

•Identify stable semantic technologies, do not employ parsers.

•Allow semantic and relational databases to work together

Page 14: Scaling the walls of discovery: using semantic metadata for integrative problem solving

14LSCDD

What we need

Data Management and Availability•Capturing and filtering the global and growing avalanche of internal and external scientific data

Data Fusion•Systems to link, combine and navigate massive and heterogeneous data sets

Information Analysis and Mining•Algorithms and tools to help scientists seek correlations and find connections between pre-clinical and clinical knowledge to generate and test translational hypotheses.

Page 15: Scaling the walls of discovery: using semantic metadata for integrative problem solving

15LSCDD

Data Architecture

Expression(Affy,Agilent,

Illumina)aCGH Screening Methylation

SNPMutation

TissueMicroarray

ChIP-Chip,miRNA

AnalysisResults

Domain/Platform Specific Data

Integration Layer

Experimental Matedata Repository

Annotation Services (Genomics mapping

+ Gene functional info)

QueryVisualization

Analysis and Mining

Algorithms Workflow

Experiment Context

Experiment Context

ReadoutReadout

Mapping & Annotation

Mapping & Annotation

Derived Results

Derived Results

Genomics mapping

Proteome/GO

FunctionalInformation

34 platforms

Ontology

Common Vocabulary

Centralized Experiment

Context

30 million triples

Page 16: Scaling the walls of discovery: using semantic metadata for integrative problem solving

16LSCDD

LSCDD Data integration process in use

Query Visualization

ExperimentalMetadata Repository

Annotation Services(Genomics mapping + Gene Function)

AffyExpression

AgilentExpression

IlluminaExpression aCGH Screening RNAi

DatabaseMutation

SNP TMA AnalysisResults

Page 17: Scaling the walls of discovery: using semantic metadata for integrative problem solving

17LSCDD

LSCDD Semantic Integration Approach

• Use semantic technology on an appropriate problem • Create Ontology focused on solving LSCDD integration needs

•Scientists and IT Analysts work together to iteratively create tailored vocabulary

•Define competency questions to validate the ontology•Encourage ontology to evolve, a different animal than RDBMS schemas

•Create bridges to public and internal ontologies to realize the full capabilities of the vocabulary

• Involve users to verify RDBMS-to-ontology mapping to increase confidence in the solution.

• Sparql is hard. Design an intuitive query model or question templates for users to navigate the repository.

Page 18: Scaling the walls of discovery: using semantic metadata for integrative problem solving

18LSCDD

LSCDD Semantic Integration Approach (Cont)

• Used Agile philosophy throughout: application development, ontology development and mapping effort

• Drive adoption by engaging users to understand their challenges and refine the solution.

• Technologies •Protégé Ontology Editor•Oracle Semantic Technologies 11g•D2R Map (Database to RDF Mapping Language)•C# development in Visual Studio 2205

Page 19: Scaling the walls of discovery: using semantic metadata for integrative problem solving

19LSCDD

Metadata RDF Repository

• Aggregates experiment metadata from a diverse set of LSCDD relational databases into an Oracle Semantic Technologies repository for LSCDD scientific investigation.

• Scientists at LSCDD now have a single source of experiment information described with a common vocabulary.

• Current data sources include:•Expression Data : Affymetrix, Illumina, Agilent•aCGH Data•RNAi Screening Data•Reagent Data•Gene Ontology (GO)•Medical Subject Headings (MeSH)•Many others

Currently ~30 million triples

Currently ~30 million triples

Page 20: Scaling the walls of discovery: using semantic metadata for integrative problem solving

20LSCDD

LSCDD Metadata Ontology

Experiment

Protocol

CellLine

Chip

Tissue

Plate Well

DNA Reagent

Sample

Probe

hasPlateCompound

Gene

ReagentHardware

Assay

hasPlate

Protein Reagent

ClinicalData

Project StudyhasProject hasStudy

Software

subclass

Plate

TreatmentRNA Reagent

hasGene

Model

Chip Type

DiseaseState

hasDiseaseState

GeneList

hasSourceTissuehasSource

subclasshasSample

subclass

subclass

subclass

hasGOId

ViralBatch

hasModel

hasCelllinehasTissue

hasMESHId

hasChiphasAssay

hasChipType

hasChipType

hasGene

IsPartOf

hasReagent

hasReagent

hasProtocol

MESHGO

hasTreatment

hasCompound

Page 21: Scaling the walls of discovery: using semantic metadata for integrative problem solving

21LSCDD

Metadata Repository Application

• Both browse and query views are provided for repository access.• The Query View allows the user to search the repository by setting constraints on attributes of the entities in the ontology.

• Links to external data sets such as Gene Ontology and MeSH have been defined, queries may span multiple ontologies.

• Results View displays details about each of the matches found and allows user to navigate across entities.

• The application is created as a plugin to the Lilly Science Grid and can leverage Integrated Genomics Portal for Cancer Research (IGPCR) plugins to provide details about Genes in hit lists.

Page 22: Scaling the walls of discovery: using semantic metadata for integrative problem solving

22LSCDD

Metadata Repository Application

Find all deacetylases involved in Colorectal Neoplasms- Add filter to Gene Ontology Label attribute

- Add filter to MeSH Description Name attribute

- Run Query…Results View shows list of GenesNavigate across data links

Page 23: Scaling the walls of discovery: using semantic metadata for integrative problem solving

23LSCDD

Experiment Data Annotation

While raw experiment results are not suitable for editing, metadata such as experiment descriptions and relations becomes more valuable when users augment and refine. Experiment

hasId: abc123hasContact: Bill SmithhasType: SiRNA ScreenhasDescription: ____

…Experiment

hasId: def456hasContact: Jane SmithhasType: SiRNA ScreenhasDescription: H460 screen

H460 screen: run 789

hasConflictingResults

Page 24: Scaling the walls of discovery: using semantic metadata for integrative problem solving

24LSCDD

IGPCR: Integrated Genomics Portal for Cancer Research

An Integrated view for analysis

results

Helps oncology researchers with:•Drug target identification and prioritization

•Biomarker discovery

•Combination therapy

Page 25: Scaling the walls of discovery: using semantic metadata for integrative problem solving

25LSCDD

Backup

Page 26: Scaling the walls of discovery: using semantic metadata for integrative problem solving

26LSCDD

Page 27: Scaling the walls of discovery: using semantic metadata for integrative problem solving

27LSCDD

Page 28: Scaling the walls of discovery: using semantic metadata for integrative problem solving

28LSCDD

Page 29: Scaling the walls of discovery: using semantic metadata for integrative problem solving

29LSCDD

Page 30: Scaling the walls of discovery: using semantic metadata for integrative problem solving

30LSCDD

Are there any reagents available to conduct functional validation?

Get me all the interactions for methylases that are involved in colorectal cancer. And for all these genes, get the expression and aCGH values for all colon cancer samples.

Answering scientific questions

What is the status of the target of my interest across multiple tumor types? What are the right model systems to study the perturbation of my gene of interest?

Page 31: Scaling the walls of discovery: using semantic metadata for integrative problem solving

31LSCDD

Cancer drug discovery

Page 32: Scaling the walls of discovery: using semantic metadata for integrative problem solving

32LSCDD

Integration of high throughput datasets

Chemosensitivity

Tumor Samples

Patient Survival

Cell lines

RNAi

Tissue Microarrays

Expression

CGH / SKY

Public / P

rivate

Mutations

Chemosensitivity

Tumor Samples

Patient Survival

Cell lines

RNAi

Tissue Microarrays

Expression

CGH / SKY

Public / P

rivate

Mutations

Page 33: Scaling the walls of discovery: using semantic metadata for integrative problem solving

33LSCDD

Going Forward

• Integration with additional external sources: NCBI, KEGG, Proteome, PubMED• Integration with National Cancer Institute Metathesaurus• Continued integration with new data types generated internally or from collaborators• Definition and support of additional ontologies

IntegratedAugmented

QueryResults

SnoMed

PubMed

NCI Metathesaurus

Stanford TissueMicroarray

Web ResourcesLilly Data

Labs

Internal Data

Public Data

Collaborators

Analysis Pipelines Visualizers

Page 34: Scaling the walls of discovery: using semantic metadata for integrative problem solving

34LSCDD

Acknowledgements

LSCDD, SingaporeIT

•Kevin Gao, Rakhi Bhat, Srinivasulu Kota and Maurice Manning

Systems Biology•Amit Aggarwal and Mahesh Kumar Guzuva Desikan

ICS•Pat Hartman

HiSoft Technology – Dalian, China•Bill Yan, Young Gong, Harold Yin, Steven Cao and Jason Wang

Lilly, Indianapolis USA•Susie Stephens, Jacob Koehler

Page 35: Scaling the walls of discovery: using semantic metadata for integrative problem solving

35LSCDD

Backup Slides

Page 36: Scaling the walls of discovery: using semantic metadata for integrative problem solving

36LSCDD

Putting it all together…

Objects Measure

MTS Literature

Binding Coding

Clinical DB

Compounds

Images

Genes

SNPs

Expression

Linkage D

Signature

Fingerprint

Map 1 Map 2

Page 37: Scaling the walls of discovery: using semantic metadata for integrative problem solving

37LSCDD

Silos Need to Broken Down

Data

Transform

Model &Understand

Generate/TestHypothesis

Analyze& Mine

Target Hit Lead PgS CS FHD FED PD/RD FS FA FL GL

TargetToHit

HitTo

Lead

LeadTo

PgS

LeadOptimization

Pre-ClinicalDevelopment

Phase I Phase 2 Phase 3Registration

LaunchGlobalLaunch

Project Program Product

Exploratory

Data

Transform

Model &Understand

Generate/TestHypothesis

Analyze& Mine

Data

Transform

Model &Understand

Generate/TestHypothesis

Analyze& Mine

Data

Transform

Model &Understand

Generate/TestHypothesis

Analyze& Mine

Data

Transform

Model &Understand

Generate/TestHypothesis

Analyze& Mine

Data

Transform

Model &Understand

Generate/TestHypothesis

Analyze& Mine

Data

Transform

Model &Understand

Generate/TestHypothesis

Analyze& Mine

Data

Transform

Model &Understand

Generate/TestHypothesis

Analyze& Mine

Data

Transform

Model &Understand

Generate/TestHypothesis

Analyze& Mine

Data

Transform

Model &Understand

Generate/TestHypothesis

Analyze& Mine

Page 38: Scaling the walls of discovery: using semantic metadata for integrative problem solving

38LSCDD

Web Interface

Input user queries andpresent the query results

Data SourceSchema

Bio-ChemicalOntology

BACIISKnowledge

Base

Query Generator Module

Generate semanticbased user queries into

domain recoganizedterms through Ontology

Query Planning and Execution Module

Query Planner

Decompose the userquery into subqueries,define the subqueriesdependancy, and find

the query paths

Mapping Engine

Map each subquery intospecific data source(s)

Execution Engine

Receive data sourcespecific subqueries

and envokecorresponding

wrappers to fetchthe data from

remote data source

Result Presentation Module

Receive and integratethe individual result

set from wrappers intoHTML format andsend result pages to

web interface

Mediator

Wrapper

Fetch HTML/XMLpages from remotedata source, extract

result data

WebDatabase

WebDatabase

WebDatabase

Wrapper

Fetch HTML/XMLpages from remotedata source, extract

result data

Wrapper

Fetch HTML/XMLpages from remotedata source, extract

result data

BACIIS System Architecture

Page 39: Scaling the walls of discovery: using semantic metadata for integrative problem solving

39LSCDD

Hybrid Architecture

Knowledge-SpaceNavigation

PresentationServices

AnalyticServices

User Interface

Federation Entities

Navigational Entities

Presentation Entities

Personalization Entities

Persistence Entities

Analysis Entities

MetadataRepositories

Source Source Source Source Source

Data Access Service Layer

Navigation Service Layer Data Set Integration Services

Me

tad

ata

Se

rvic

es

La

yer

Query Preparation Service Semantic Normalization Service

Query Submission Service Streams Management Service

Request Brokers

Semantic Layer

Adaptive Layer

Physical Access Layer

ListManagement

Page 40: Scaling the walls of discovery: using semantic metadata for integrative problem solving

40LSCDD

Goals

•Make knowledge emerge from repositories•Make data more valuable by adding context•Leverage intellectual assets•Decision support•Enhance productivity•Reduce IT integration efforts