a data warehouse platform for the analysis of molecular...

14
A Data Warehouse Platform for the Analysis of Molecular- biological and Clinical Data Erhard Rahm H.-H. Do, M. Hartung, T. Kirsten, J. Lange http://dbs.uni-leipzig.de www.izbi.de Data Warehouse Technologies in Bioinformatics (DWTB06) December 05, 2006 Interdisciplinary Center for Bioinformatics IZBI: Bioinformatics Center of the Univ. Leipzig Grant of the DFG-Initiative Bioinformatics founded 2001 several working groups, including Databases / Data Integration Initiation of the international workshop series Data Integration in the Life Sciences (DILS) DILS2004: Leipzig (IZBI) DILS2005: San Diego (UCSD Supercomputing Center) DILS2006: Cambridge, UK (EBI) DILS2007: Philadelphia (UPenn) LNBI 2994

Upload: others

Post on 15-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A Data Warehouse Platform for the Analysis of Molecular ...dbs.uni-leipzig.de/file/dwtb06-rahm.pdf · Central data management and analysis platform Data of chip-based experiments

A Data Warehouse Platformfor the Analysis of Molecular-biological and Clinical Data

Erhard RahmH.-H. Do, M. Hartung, T. Kirsten, J. Lange

http://dbs.uni-leipzig.dewww.izbi.de

Data Warehouse Technologies in Bioinformatics (DWTB06)December 05, 2006

Interdisciplinary Center for Bioinformatics

IZBI: Bioinformatics Center of the Univ. Leipzig

Grant of the DFG-Initiative Bioinformatics

founded 2001

several working groups, including Databases / Data Integration

Initiation of the international workshop series Data Integration in the Life Sciences (DILS)

DILS2004: Leipzig (IZBI)

DILS2005: San Diego (UCSD Supercomputing Center)

DILS2006: Cambridge, UK (EBI)

DILS2007: Philadelphia (UPenn)

LNBI 2994

Page 2: A Data Warehouse Platform for the Analysis of Molecular ...dbs.uni-leipzig.de/file/dwtb06-rahm.pdf · Central data management and analysis platform Data of chip-based experiments

Agenda

Data Integration in Bioinformatics Data characteristics

Data integration alternatives: Warehousing / Mediators / P2P

The GeWare data integration and analysis platformSystem architecture

Integration of clinical data

Multidimensional data organization

Annotation management

BioFuice: Mapping-based P2P-like data integration

Summary

Data Integration in Bioinformatics

Many heterogeneous data sources Experimental dataExperimental annotationsClinical dataLots of inter-connected web data sources and ontologies

Sequence dataAnnotation data

Private vs. public data

Different kinds of analysis needs Analysis of sequence data (e.g. multiple alignments)Gene expression analysisPathway analysis and reconstructionFunctional profilingTranscription analysisIdentification of transcription factor binding sites, …

Page 3: A Data Warehouse Platform for the Analysis of Molecular ...dbs.uni-leipzig.de/file/dwtb06-rahm.pdf · Central data management and analysis platform Data of chip-based experiments

High-Volume Experimental Data

High-throughput, chip-based measurement techniques Genome-wide measurements

Gene expression data, e.g. by expression microarrays

Mutation data, e.g. Matrix-CGH arrays, SNP arrays, …

Different chip types, continuous improvement

Very voluminous raw data

Several pre-processing routines (no standard)Different data aggregation levels (e.g. Affy probe vs. probeset expression values)

Wide spectrum of analysis methodsStatistical approaches, e.g. tests and resampling procedures, …

Data mining techniques, e.g. clustering, …

Visualizations, e.g. Heatmap, M/A plot, …

Affymetrix microarray

Clinical Data

Patient-related data and findingsData about patients and their clinical and pathological state

Typically manually captured in hospitals

Mostly of textual nature

Relatively small volume (compared to chip-based data)

RequirementsUniform data specification (metadata, values)

Autonomous data input (online data input)

Data integration: Patient-related findings + chip-based genetic data

Protection of patients' privacy

Utilization of existing software, e.g. for study management

Page 4: A Data Warehouse Platform for the Analysis of Molecular ...dbs.uni-leipzig.de/file/dwtb06-rahm.pdf · Central data management and analysis platform Data of chip-based experiments

Molecular-biological annotations

Annotation data vs. mapping data (cross-references)

Enzyme

GeneOntology

OMIMUniGeneKEGG

} References to other data sources

source-specific ID (accession)

annotations: names, symbols, synonyms, etc.

}

Interconnected data sources

heterogeneous schemas,formats, semantics

many, highly connected data sources and ontologies

frequent changes

incomplete data sources

common (global) database schema ???

Page 5: A Data Warehouse Platform for the Analysis of Molecular ...dbs.uni-leipzig.de/file/dwtb06-rahm.pdf · Central data management and analysis platform Data of chip-based experiments

Data integration: physical vs. virtual

Source 1 Source m Source n

Wrapper 1 Wrapper m Wrapper n

Mediator

Client 1 Client k

Meta data

Virtual Integration(query mediators)

Operational Systems

Import (ETL)

Data Warehouse

Data Marts

Analysis Tools

Meta data

Physical Integration(Data Warehousing)

P2P Integration: Typical Scenario

Gene Ontology

Protein annotations for gene X?

Local dataCheck GO annotation for

genes of interest?

SwissProt Ensembl

NetAffx

Bidirectional mappings between data sources instead of global schema

Queries refer to single source and are propagated to relevant peersAdding new sources becomes simpler

Support for local data sources (e.g. private gene list)

Page 6: A Data Warehouse Platform for the Analysis of Molecular ...dbs.uni-leipzig.de/file/dwtb06-rahm.pdf · Central data management and analysis platform Data of chip-based experiments

Data integration: physical vs. virtual

Virtual

-

+

+

o

-

o

At query runtime

A priori

Query mediators

o-(HW) ressourcerequirements

+oSource autonomy

+oData freshness

o+Achievable data quality

-+Analysis of large datavolumes

o-Scalability to many sources

At query runtimeA prioriInstance data integration

No schemaintegration

A prioriSchema integration

Peer-to-Peer

Physical(Warehouse)

The GeWare System

GeWare – Genetic Data Warehouse

Central data management and analysis platform

Data of chip-based experiments (i.e. expression microarrays & Matrix-CGH arrays)

Uniform and autonomous specification of experiment annotations

Import of clinical data

Integration of gene annotations from public sources

Various methods for pre-processing, analysis and visualization

Coupling with existing tools for powerful and flexible analysis

Page 7: A Data Warehouse Platform for the Analysis of Molecular ...dbs.uni-leipzig.de/file/dwtb06-rahm.pdf · Central data management and analysis platform Data of chip-based experiments

Applications

Two collaborative cancer research studiesMolecular Mechanism in Malignant Lymphoma (MMML)http://www.lymphome.de/Projekte/MMML

German Glioma Network: http://www.gliomnetzwerk.de/

Data from several national clinical, pathological and molecular-genetics centers

Experimental and clinical data for hundreds of patients

Local research groups at the Univ. Leipzig, e.g.Expression analysis of different types of human thyroid nodules

Expression analysis of physiological properties of mice

Analysis of factors influencing the specific binding of sequences on microarrays

System Architecture

Data Sources Data Warehouse Web Interface

Staging Area

Data Im-/ExportDatabase APIStored Procedure

Pre-pro-cessingResults

Gene Annotations

Experimental & ClinicalAnnotation Data

Expression/Mutation Data

CEL Files & Expression/CGH Matrices (CSV)

Manual User Input

Public Data SourcesLocalCopies

SRS

MappingDB

Daily Import from Study Management System

• Data pre-processing• Data analysis (canned

queries, statistics, visuali-zation)

• Administration

Data Mart

Expression /CGH Matrix

Core Data Warehouse

Multidimensional Data Model including• Gene Expression Data• Clone Copy Numbers• Experimental & clinical

Annotations• Public Data

• GO• Ensembl• NetAffx

Page 8: A Data Warehouse Platform for the Analysis of Molecular ...dbs.uni-leipzig.de/file/dwtb06-rahm.pdf · Central data management and analysis platform Data of chip-based experiments

GeWare – System Workflows

Analysis

Import of raw data

Preprocessing(Normalization /

aggregation

Experiment creation / selection

Manualexperiment annotation

Import of pre-processed data

Import Workflow

Statistics Visualization

Browse / search in annotations

Gene/Clonegroups

Treatment groups

External analysis (Functional profiling, clustering)

Expression /CGH matrices

Internal / integrated analysis

Management of analysis objects

Export

Reporting

Analysis Workflow(Closed Loop)

Integrated Analysis

Different types of pre-processing methodsMAS5, RMA, Li/Wong, …

Statistics and reportingDifferent kinds of statistical analysis, e.g. Multivariate oligo-based t-Test, …

Various canned queries, e.g. for outlier detection

Visualization

M/A plot to visualize differentially expressed genes

possible differentially expressed genes

Page 9: A Data Warehouse Platform for the Analysis of Molecular ...dbs.uni-leipzig.de/file/dwtb06-rahm.pdf · Central data management and analysis platform Data of chip-based experiments

Integrated Analysis cont.

Visualizations of expression values using clinical data

Heatmap of a selected gene expression matrix

Ch

ip 1

Ch

ip 2

Ch

ip 3

Ch

ip 4

Ch

ip 5

Ch

ip 6

Ch

ip 7

Ch

ip 8

Ch

ip 9

Ch

ip 1

0C

hip

11

Ch

ip 1

2C

hip

13

Ch

ip 1

4C

hip

15

Ch

ip 1

6C

hip

17

Ch

ip 1

8C

hip

19

Ch

ip 2

0C

hip

21

Ch

ip 2

2C

hip

23

Ch

ip 2

4C

hip

25

Chip/Patient dendrogram

Gen

e de

ndro

gram

Chips/Patients

Genes

Clinical Data: Integration Architecture*

Chip-based genetic Data

Gene expression data

Matrix-CGH data

Lab annotation data

Chip Id

ClinicalCenters

PathologicalCenters

Clinical findings

Location specific genetic findings

Pathologicalfindings

GeneticsCenters

Patient-related Findings

Public Gene/Clone Annotations

GO Ensembl NetAffx…

Management of Chip-related Data(GeWare)

•Data analysis & reports •Data export

Data Warehouse

Management of Clinical Studies(eResearch Network)

StudyRepository

•Administration•Simple reports•Data export

Validationby data checks

commonPatient ID

Mapping tablePatient IDs Chip IDs

periodictransfer

*Kirsten, T; Lange, J; Rahm, E : An integrated platform for analyzing molecular-biological data within clinical studies.Information Integration in Healthcare Application, LNCS 4254, 2006

Page 10: A Data Warehouse Platform for the Analysis of Molecular ...dbs.uni-leipzig.de/file/dwtb06-rahm.pdf · Central data management and analysis platform Data of chip-based experiments

Annotation management

Generic approach to specify structure and vocabulary for experimental, clinical and genetic annotations

Consistent metadata instead of freetext or undocumented abbreviations and naming

Manual specification of experimental annotationsdescribing the experimental set-up and procedure: sample modifications, hybridization process, utilized devices, …

Automatic import of clinical annotations and genetic annotations

Annotation templates: collections of hierarchically structured annotation categories

permissible annotation values can be restricted to controlled vocabularies

MIAME compliant templates

Controlled vocabularies: locally developed or external (e.g. NCBI Taxonomy)

MAGE-ML export (data exchange)

Experiment Annotation: Implementation (1)

Template exampleEasy specification and adaptation

Association of available vocabularies

Description

Page 11: A Data Warehouse Platform for the Analysis of Molecular ...dbs.uni-leipzig.de/file/dwtb06-rahm.pdf · Central data management and analysis platform Data of chip-based experiments

Experiment Annotation: Implementation (2)

Template exampleAutomatically generated web GUI

Hierarchically ordered categories

Index page

Generated page to captureannotation values

Utilization of terms of associated vocabularies

Experiment Annotation: Application

Search in experiment annotation: Create treatment groups (later reuse in analysis)

Search for relevant chipsby specifying queries

Save result as group

Page 12: A Data Warehouse Platform for the Analysis of Molecular ...dbs.uni-leipzig.de/file/dwtb06-rahm.pdf · Central data management and analysis platform Data of chip-based experiments

Multidimensional Data Management

Fact tables: expression values for different chip types and many chipsScalability and extensibility

Dimensions (chips/patients, genes, analysis methods)

Multidimensional analysisEasy selection, aggregation and comparison of values

Basis to support more advanced analysis methodsFocused selection and creation of matrices

Analysis methods

Experiments (chips)

Genes

GeWare – Data Warehouse Model

Annotation-related Dimensions

Facts: Expression Data, Analysis Results

Processing-related Dimensions

Chip

Treatment Group

*1

Experiment

*1

Gene**

Gene Group

Gene Intensity

Expression Matrix

Analysis Method

Transformation Method

Sample, Array, Treatment, …

GO function,Location, Pathway, ...

MAS5, RMA,Li-Wong, …

Data Warehouse

Data Mart

Clustering, Classification, Westfall/Young, ...

*

11

*

*

*

1

Clone**

Clone Group

Clone Intensity

CGH Matrix

Chromosomal Location, …

*

*

11

*

*

1

11

Page 13: A Data Warehouse Platform for the Analysis of Molecular ...dbs.uni-leipzig.de/file/dwtb06-rahm.pdf · Central data management and analysis platform Data of chip-based experiments

Integration of Public Sources*

Annotation AnalysisExpression AnalysisIdentification of relevant genes using annotation data

Identification of relevant genesusing experimental data

Expression (signal) valueP-Value…

Molecular functionGene locationProtein (product)Disease…

DWH+

Analysis Tools

gene /clone

groupsSRS

Gene annotation

Mapping-DB

Query-Mediator

*Kirsten, T; Rahm, E: Hybrid integration of molecular-biological annotation data. Proc. 2nd Intl. Workshop DILS, July 2005

BioFuice*

BioFuice: Bioinformatics information fusion utilizing instance correspondences and peer mappings

Based on iFuice approach for P2P data integration

P2P-like infrastructureMappings between autonomous data sources (peers)

Mapping: Set of instance correspondences

Simple integration of new sources

High-level operators to process mappings and objectsMapping Mediator

Controlling of mapping- and operator execution

Utilization of application specific semantic domain model

*Kirsten, T; Rahm, E: BioFuice: Mapping-based data integration in bioinformatics. Proc. 3rd Intl. Workshop DILS, July 2006

Page 14: A Data Warehouse Platform for the Analysis of Molecular ...dbs.uni-leipzig.de/file/dwtb06-rahm.pdf · Central data management and analysis platform Data of chip-based experiments

Script Example

ScenarioGiven: Set of sequences in local source MySequences

Wanted: Three classes: unaligned s., non-coding s., protein coding sequences

$alignedSeqMR := map( MySequences, { SeqDnaBlast } );$codingSeqMR := compose( $alignedSeqMR, { Ensembl.SRegionExons } );

$unalignedSeqOI := diff ( MySequences, domain ( $alignedSeqMR ));$protCodingSeqOI := domain ( $codingSeqMR );$nonCodingSeqOI := diff ( domain ( $alignedSeqMR ) , $protCodingSeqOI );

Ensembl

MySequences

Ensembl.SRegionExons

SeqDnaBlast

Sequence Region

SequenceExon

LDS PDS

mapping(same: )

Legend

Conclusions

Different data integration architectures for bioinformatics neededData Warehousing

Virtual integration approaches (Mediators, P2P)

Combinations

GeWareManagement of a high volume of expression data and Matrix-CGH mutation data

Comprehensive support for consistent experimental annotations

Import of clinical data from study management system

Access to gene annotations from web sources

Different kinds of pre-processing methods and analysis

BioFuiceP2P-like data integration

Domain model using semantic object and mapping types

Set of high-level operators for query and mapping execution