building a data warehouse in biologics research · building a data warehouse in biologics research...

16
Building a Data Warehouse in Biologics Research Dr. Alex Kohn (Roche Diagnostics GmbH) Dr. Bernhard Schirm (Quattro Research GmbH) The Roche Group Key Facts at a Glance Founded 1896 in Basel, Switzerland Founding families still hold majority stake Employing 80,000 people Leadership in pharmaceuticals Leading supplier of medicines for cancer and a market leader in virology Leadership in in vitro diagnostics Focus on Personalised Healthcare

Upload: hoangtram

Post on 23-Apr-2018

218 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Building a Data Warehouse in Biologics Research · Building a Data Warehouse in Biologics Research Project Objectives Establish a Data Warehouse as a ... Biologics Research stored

Building a Data Warehouse in Biologics Research

Dr. Alex Kohn (Roche Diagnostics GmbH)

Dr. Bernhard Schirm (Quattro Research GmbH)

The Roche Group

Key Facts at a Glance

• Founded 1896 in Basel, Switzerland

• Founding families still hold majority stake

• Employing 80,000 people

• Leadership in pharmaceuticalsLeading supplier of medicines for cancer and a market leader in virology

• Leadership in in vitro diagnostics

• Focus on Personalised Healthcare

Page 2: Building a Data Warehouse in Biologics Research · Building a Data Warehouse in Biologics Research Project Objectives Establish a Data Warehouse as a ... Biologics Research stored

Current Treatment is the Same for Most Patients

Outcomes can vary widely

= Different

outcomes

3

=

Patients Treatment

+Disease

+

Different treatment outcomes affect patients’ safety, survival and quality of life

Personalized Healthcare (PHC)

Tailors treatment to the patient

4

• Molecular diagnostic testing can stratify patients according to their specific genetic makeup and/or the nature of their disease or condition

• This approach improves drug safety, may increase patient survival, and may improve quality of life

Page 3: Building a Data Warehouse in Biologics Research · Building a Data Warehouse in Biologics Research Project Objectives Establish a Data Warehouse as a ... Biologics Research stored

The Roche Penzberg Site

One of the largest biotech centers in Europe

Divisions: Pharma & Diagnostics

Established: 1972 (Boehringer Mannheim GmbH)

1998 acquired by Roche

Employees: 4825* / FTE (Diagnostics 62 %; Pharma 38%)

Area: ~350,000 m2 = ~86 acres

Investments 2003 to 2011: ~€ 1.84 bn * Headcount of December 2011

The Roche Penzberg Site

One of the largest biotech centers in Europe

Divisions: Pharma & Diagnostics

Pharma Research & Early Development (pRED)

Centre of excellence for therapeutic proteins

Oncology, Inflammatory & autoimmune diseases,Metabolic diseases, Central nervous system diseasesVirology

* Headcount of December 2011

Page 4: Building a Data Warehouse in Biologics Research · Building a Data Warehouse in Biologics Research Project Objectives Establish a Data Warehouse as a ... Biologics Research stored

Biologics

Size and Complexity Makes Them Different

• Biopharmaceuticals are at least x100 fold larger than traditional chemical products.

• Produced by living cells

• Modified during expression, incubation in bioreactor, purification and storage

• Presence of impurities (host cell proteins, DNS, endotoxins, degradation products and aggregates)

• The process is the product

Aspirin

(< 200 daltons)

Chemical

pharmaceutical

Erythropoietin (EPO)

(~30 000 daltons)

Biopharmaceutical

Building a Data Warehouse in Biologics Research

Project Objectives

Establish a Data Warehouse as a Data Consolidation and Integration platform for all data within Biologics Research stored in relational repositories

Provide an ad-hoc query, reporting, and analysis toolset that gives users immediate access to information in the Data Warehouse.

Page 5: Building a Data Warehouse in Biologics Research · Building a Data Warehouse in Biologics Research Project Objectives Establish a Data Warehouse as a ... Biologics Research stored

IT-Systems in the Biologics Research Process Chain

9

Lab 1

Lab 2

Lab n

Lab Data Acquisition Registration & Tracking Data Analysis

0

200

400

600

800

1000

1200

DN

A c

lear

ance

Act

ual

12

3

4

56

78

9 10

111213

14

15

16

17

181920

0 200 400 600 800 1000 1200

DNA clearance Predicted

P=0,2616 RSq=0,84 RMSE=241,81

Actual by Predicted Plot

Continuous factors centered by mean, scaled by range/2

InterceptLoad pHLoad MassW1 pH

W1 CondW2 pHEL Cond

Flow rate(Load pH-7)*(Load Mass-15,02)(Load pH-7)*(W1 pH-7,5)(Load pH-7)*(W1 Cond-8,5)

(Load pH-7)*(W2 pH-7,72)(Load pH-7)*(EL Cond-104)(Load pH-7)*(Flow rate-140)

(Load Mass-15,02)*(W1 pH-7,5)

Term

98,965-0,5121071,3143033-0,334162

0,21875-0,076016

0,10625

0,106250,143750,343750,10625

0,331250,193750,31875

-0,46875

Scaled Estimate

0,2607790,2922680,2912620,291621

0,291560,288380,29156

0,291560,291560,291560,29156

0,291560,291560,29156

0,29156

Std Error

379,50-1,754,51

-1,15

0,75-0,260,36

0,360,491,180,36

1,140,661,09

-1,61

t Ratio

<,0001 *0,14010,0063 *0,3037

0,48690,80260,7304

0,73040,64290,29140,7304

0,30740,53580,3241

0,1688

Prob>|t|

Scaled Estimates

Horiz Vert

Load pH

Load Mass

W1 pH

W1 CondW2 pH

EL Cond

Flow rate

Factor

7

15,02

7,5

8,57,72

104

140

Current X

Yield

Purity

ProteinA

DNA clearance

HCP

Response

97,650943

98,809434

18,231132

758,49057

2441,0377

Contour

98,965

98,91

15,57

527

2350

Current Y

97,650943

98,816038

.

.

.

Lo Limit

.

.

18,231132

758,49057

2441,0377

Hi Limit

10

15

20

Load

Mas

s

Yield

ProteinA

DNA clearance

HCP

6,5 6,6 6,7 6,8 6,9 7 7,1 7,2 7,3 7,4 7,5

Load pH

Contour Profiler

Data Warehouse Components

ELN

LIMS

Projects

ScreeningScreening

Inventory

SourceSystems

Page 6: Building a Data Warehouse in Biologics Research · Building a Data Warehouse in Biologics Research Project Objectives Establish a Data Warehouse as a ... Biologics Research stored

Data Capturing Source Systems

Fermentation

Cell Line

Vector Variant

Protein Variant

B-Cell (Hybridoma)

Specimen

Animal

Fusion Sort

Purified & Characterized

Protein

System Description

Key Manager Object and relationship management(proprietary)

Labware Immunization management (Animals, Specimen)

TheraPS Workflow and request tracking (proprietary)

E-Workbook / BioBook

IDBS Electronic Lab Notebook

Sample Management

Sample management including analyticaldata (proprietary)

MaterialManagement

Material management (proprietary)

PI Osisoft: Online monitoring of fermentationprocesses

Data Warehouse Components

ELN

LIMS

Projects

ScreeningScreening

Inventory

SourceSystems

Data Warehouse

Data Marts

ETL

Page 7: Building a Data Warehouse in Biologics Research · Building a Data Warehouse in Biologics Research Project Objectives Establish a Data Warehouse as a ... Biologics Research stored

Extract, Transform, Load

ETL

• Environment: Mostly Oracle Systems

• Used tools

o Oracle Warehouse Builder

o Oracle Workflow

o SQL, PL/SQL

o For PI integration

− PI SDK, C++, Java

− Integrated into Oracle Warehouse Builder

o APEX

− Master Data Management

− Monitoring

Oracle Warehouse Builder

OWB

• Graphical design of ETL routines

• Modeling of process flows

• Provides documentation of ETL

• Used Versiono OWB 11gR2

o Oracle Workflow 2.6.4 to automate process flows

• Part of every 11gR2 database installation

• Only Standard Edition features are used

• OMBPLUSo Deployments of mappings and process flows

Page 8: Building a Data Warehouse in Biologics Research · Building a Data Warehouse in Biologics Research Project Objectives Establish a Data Warehouse as a ... Biologics Research stored

Change Data Capturing

CDC

• Load only changed data

o No updates, only inserts

• Non-invasive

o E.g. materialized views, views with timestamps

o May need to load too much data

• Invasive

o Triggers, redo log transport

o Smaller delta, faster

o Problem: could not change some systems due to license and support issues

�We use only the non-invasive approach

Data Warehouse Components

ELN

LIMS

Projects

ScreeningScreening

Inventory

SourceSystems

Reporting

Ad-hoc Queries

Data Mining, OLAP

Data Warehouse

Data Marts

Master Data Management

ETL

Page 9: Building a Data Warehouse in Biologics Research · Building a Data Warehouse in Biologics Research Project Objectives Establish a Data Warehouse as a ... Biologics Research stored

Master Data Management

MDM

• Definitiono Reference terms consolidated throughout an organization

o Non-transactional data

o E.g.: Physical units, projects, parameters

• In realityo Use of controlled vocabulary not enforced everywhere

• Role in data warehouseo Essential for data linking and comparison

• How to solve?o Get rid of Excel MDM

o Build MDM curation tools

o User buy-in

Challenges

1. Dimensionality

2. Genealogy

3. Heterogeneous user requirements

Page 10: Building a Data Warehouse in Biologics Research · Building a Data Warehouse in Biologics Research Project Objectives Establish a Data Warehouse as a ... Biologics Research stored

Dimensional Modeling

• Star model for business warehouses

• Simple fact tables

• Large amount of data

o E.g. Walmart ~1000TB

Transaction

Product

Region

TimeCustomer

ProductGroup

OLAP Cubes

Aggregate facts along dimensions

Page 11: Building a Data Warehouse in Biologics Research · Building a Data Warehouse in Biologics Research Project Objectives Establish a Data Warehouse as a ... Biologics Research stored

Transaction

Dimensions in Biologics Research?

• High dimensionality

• Multiple facts having hierarchies

o E.g. IC50 has Hill-Slope & R²

• Warehouse size relatively small (~1TB) compared to finance warehouse

Study

B-CellBinding

Device

ProcessTime

Fermentation Phase

Experiment

Absorbance

Technique

Target

Species

AssayActivity

IC50

Batch

Project

Aggregation of Data

• Summary data is a key part of data warehouses

o Pre-aggregation to optimize queries

• Straightforward for business data

o Aggregate sales number (mean, sum)

• Difficult for Biologics Research data

o Aggregation is only valid under a specific set of dimensions

− E.g.: Average IC50 only for the dimensions protocol, target, concentration

o No aggregation possible / desired

− Time-dependent data like cell growth

− Qualifying data like sequences

� Aggregation of Biologics Research data is the exception

Page 12: Building a Data Warehouse in Biologics Research · Building a Data Warehouse in Biologics Research Project Objectives Establish a Data Warehouse as a ... Biologics Research stored

The BI Approach

• OBIEE

o OLAP Cubes

o Admin Tool: Physical / Business / Presentation Layer

o Answers: Ad-hoc query and analysis

• OBIEE in the domain of Biologics Research

o Many dimensions / conditions � ok

o Majority of the data can’t be aggregated because it has no scientific meaning

�OLAP approach not applicable

Challenges

1. Dimensionality

2. Genealogy

3. Heterogeneous user requirements

Page 13: Building a Data Warehouse in Biologics Research · Building a Data Warehouse in Biologics Research Project Objectives Establish a Data Warehouse as a ... Biologics Research stored

Genealogy

• In Finance we usually have a clear reference point

o Product

o Customer

• In Biologics Research reference points are contained in a complex genealogy

o Data consolidation depends on scientific need

o Different levels on which data should be viewed

− Antibody

− Hybridoma

� OLAP / Hierarchy approach not applicable due to curse of aggregation

Challenges

1. Dimensionality

2. Genealogy

3. Heterogeneous user requirements

Page 14: Building a Data Warehouse in Biologics Research · Building a Data Warehouse in Biologics Research Project Objectives Establish a Data Warehouse as a ... Biologics Research stored

Heterogeneous User Requirements

• Initial approach: Use 1 system for all use cases

o e.g. OBIEE

• Not applicable due to heterogeneous user requirements

o Consolidation needs vary depending on the level on which the data should be viewed

• We ended up using multiple tools for accessing the data marts

o Tibco Spotfire

o Oracle APEX

o MS Excel

o InSilico

Online & Offline Fermentation Analysis in Spotfire

Page 15: Building a Data Warehouse in Biologics Research · Building a Data Warehouse in Biologics Research Project Objectives Establish a Data Warehouse as a ... Biologics Research stored

Genealogy Browser in APEX

Conclusion

• Master data management is a key success factor

• Biologics Research is different from business warehouses

o Dimensionality, aggregation and genealogy

• Most tools on the market are for business warehouses

• Query & Reporting

o No one-size fits all solution

o Tailored solutions for each user domain

o Agile development approach � Enabled quick user buy-in

Page 16: Building a Data Warehouse in Biologics Research · Building a Data Warehouse in Biologics Research Project Objectives Establish a Data Warehouse as a ... Biologics Research stored

We Innovate Healthcare