accelerating life science discovery using a high ... oslo univ hospital oct 7 2015.pdf · ibm’s...

Accelerating Life Science Discovery using a High-Performance

Analytics Platform in a Collaborative Environment

October 7, 2015

Kathy Tzeng, PhD

Worldwide Technical Lead

Healthcare & Life Sciences

IBM Systems Group

Overview

© 2015 IBM Corporation2

Genomic Solution Enablement Team

Mission:

• Porting and Optimization of Genomics/Translational applications on IBM solution

• Developing Solutions with Partners

• Making IBM SW/HW available to Software developers

Members:

• Independent Software Vendor (ISV) team

• Toronto Compiler Lab

• Boeblingen Development Lab

• Tokyo Research Lab

• Austin Research Lab


GENOMIC MEDICINE– from Sequencing to Personalized Healthcare

NHGRI, a branch of NIH, has defined 5 steps for genomic medicine. (source: E. Green

et al., Nature 470, 204–213)

Next Generation Sequencing

(or other ingestion)the focus is on very large data generation, mainly from $1000 whole genome sequencing, and the data processing and reductionincludes human, plant, animal, and microbiome genomics

Translational Research/Early

Discovery

the focus is on data integration including genomic data, and the analytics required to identify biomarkers, understand disease mechanisms, and to identify new medical treatments

Personalized

Healthcare/Clinical

Genomics

the focus is on delivering genomic medicine to patients to improve outcomes by associating patients with known genomic specific treatments


Predictive

Response Function

Known Traits or

Environmental Features

Measured

Biological Response

W(t)

Model of associations between features and responses as a function of time t

Computational Challenges

� Feature combinatorics

� Large file sizes

� Large population sizes

� Unstructured data types

F(t) R(t)

Quantities describing population traits or environmental factors at time t

Quantities describing response events for an organism at time t

A Computationally Challenging Problem

Breakthroughs in Genomic Medicine require quantifying associations between known population traits, environmental factors, and biological responses


Variant information requires a computationally intensive analysis of raw sequence data across thousands of genomic samples

Workload Challenge #1: ‘‘‘‘Big Data’’’’ Analytics

� ANNOVAR

� Gene Ontology � …

~ 150 GB (compressed)

Each human genome can have a few million variants

High-Throughput

Sequencing

File Format

Assembly & AlignmentBAM

Raw Reads

De Novo Assembly

~ 150 GB

Whole Human Genome

SOAPdenovo � Velvet � …

Reference-Based MappingBWA � Bowtie � SOAP � …

Reference Genomes

TGCA� GEO� dbSNP � …

Variant CallingVariant Calling

VCF 100 to 200 MBPicard � GATK � SAMtools � SOAPsnp � …

Variant Annotations

Annotation Tools

intergenic … SNP in

IL23R associated with

Crohn's disease …

Sample:

Processing time per genome

1 to 100 hours*

on 1 compute node

* Duration depends on selection of analytical tools and hardware

FastQ

500 MB

3 billion DNA base pairs

@ 30 x coverage


Phenotypic DataEx. Clinical Histories,

Medical Images

…was in good health until

2-3 months ago when she

gradually developed

fatigue and intermittent

epigastric pain, …

exonic NOD2 16 … a

frameshift … SNP… exonic

GJB2 13 … associated

with hearing loss …

exonic CRYL1,GJB6 13 … a

342kb deletion

Omics DataVariant Databases

Scientific data must be extracted from very large volumes of natural language content, biomedical images, and other unstructured data, and transformed into a structured format for analysis

Workload Challenge #2: Unstructured Information

Scientific Literature

Peer-Reviewed Articles, Clinical

Guidelines, Textbooks, Patents

… for statistical analysis and

relationship visualization

Information must be transformed

into normalized structured data …


+

1 Omics Data1 Omics Data

Workload Challenge #3: ‘‘‘‘Big Data’’’’ Integration

2 Phenotypic Data2 Phenotypic Data 3 Knowledge Base3 Knowledge Base

Discovery of genotype-phenotype associations requires an analysis of complex data types that must be integrated within a common analytical environment

Variant Calls & Annotations

Electronic Text & Web Sites

##FORMAT=<ID=DP, …

##FORMAT=<ID=HQ, …

#CHROM POS ID REF ALT …

20 14370 rs6054257 G A …

Clinical Features,Environmental Factors, Biological Responses

Phenotypic DataPhenotypic Data

Knowledge BaseKnowledge Base

Variant ID

Patient-Centric Logical Data Model

Patient IDGenotypic DataGenotypic Data

Patient Population

‘‘‘‘Big’’’’ Data Warehouse Environment

RDBMS and/or NoSQL

Variant List

Detail on a Single Variant

VCF

11

33

22

Phenotype ID

Patient ID

Observation Detail

Observed Traits

& Responses


Key Capabilities

Leading biomedical research organizations are asking for technology capabilities that will give them a low-cost solution to accelerate scientific discovery in Genomic Medicine

� Flexible, scalable, and low-cost high-performance compute and storage solutions capable of

efficiently processing rapidly growing quantities of genomic and other types of complex life

science data

� Seamless integration of complex life science data types on a common analytical platform

�Rapid extraction and analysis of unstructured language content from very large volumes of

clinical and scientific documents

�Metadata collection capabilities providing detailed audit trails as source data are transformed

into analytical results

� Tools for scientific collaboration that enable data and workload sharing to cross organizations

and geographic boundaries in a secure environment that ensures data privacy


A Foundation for Computational Science

IBM’s Reference Architecture for Genomic Medicine supports ‘big data’ computational research on a

foundation of HPC compute, storage, and workload management capabilities

Research Research Research Research

ApplicationsApplicationsApplicationsApplications

‘Big Data’ ‘Big Data’ ‘Big Data’ ‘Big Data’

FoundationFoundationFoundationFoundation

Intelligent resource allocation, sharing, and monitoring

across parallel HPC workloads

RDBMS or NoSQL database environments enabling

rapid processing of large volumes of complex high-

dimensional data structures in a data warehouse

Performance optimization for open source and

commercial analytics applications

Tex

t A

na

lyti

cs /

NLP

Data Management: File System & Storage / ILMData Management: File System & Storage / ILM

LAN

Workload Orchestration with Metadata CaptureWorkload Orchestration with Metadata Capture

‘Big’ Data Warehouse‘Big’ Data Warehouse

Ima

ge

An

aly

sis

- Apache

UIMA

- IBM

System T

+

Low-cost, low-latency, easy-access storage & archiving

of data and metadata across heterogeneous

environments

IBM Research, IBM Watson, IBM Business Partners

IBM BigInsights, IBM Business Partners

IBM Spectrum Scale / Elastic Storage Server

IBM Platform Computing, IBM Business Partners

Text Analytics for the conversion of natural

language concepts into structured data entities

Ge

no

mic

An

aly

sis

Pip

eli

ne

s

Co

mp

uta

tio

na

l

Mo

de

lin

g


Data management and analytics tools can be accessed and shared across heterogeneous systems in

on-premise and cloud environments

IBM Systems Facilitate Scientific Collaboration

External Collaborators (Heterogeneous Environments)Local Data Center

Virtual

Private Clouds

Public Cloud UsersPrivate Cloud UsersOn-Premise Users

On-Premise

Cluster

Encrypted VPN

‘Big Data’ foundation

enables data access, data

management, and HPC

workload orchestration

across heterogeneous on-

premise, private cloud,

public cloud, and hybrid

cloud environments

HPC Network

Data Management: File System / Storage ILMData Management: File System / Storage ILM

WAN

Workload

Burst

Applications

10GbE or InfiniBand

1/10 GbE

Workload Orchestration with Metadata CaptureWorkload Orchestration with Metadata Capture

‘Big’ Data Warehouse


AppCenter(PAC, Galaxy, DataBiology, Lab7)

Orchestrator(ASC/EGO, LSF, Symphony, PPM)

Translational

SSD/Flash FC/IB Attached Low-cost Storage HA/DR Storage Cloud Storage

Pla

tform

sC

om

pute

Sto

rage

Personalized Healthcare

Genomics

Datahub(Spectrum Scale, Zato, Nirvana)

…

HPC Cluster Big Data Spark Cluster Openstack Docker

Application & Workflow File & Database Visualization System & LogAccess

Workload Orchestration


Scale-out cluster

UsersUsers

DevicesDevices

Active ArchiveTSM/LTFS/HPSS

Scale-up SMP

HP

C M

anagem

ent

Suite

Pla

tform

Soft

ware

Sta

ck

A framework for NGS and HPC Systems Architecture

Spectrum Scale

ESS


IBM Genomics Reference Architecture

The IBM Reference Architecture is an ecosystem of data management and analytics tools developed by IBM and industry-leading commercial and open source software providers

Edico Genome


BioBuilds – Open Source Bioinformatics

• Turn-key: Pre-built binaries and complete build scripts enable easy deployment

• Optimized: POWER8 binaries provide the best performance for your hardware

• Ready for the Clinic: A single source for tools streamlining verification and audit

• Long Term Support: Community sponsorship and support contracts ensure ongoing support for tools

http://biobuilds.org/

Open Source bioinformatics tools for research, commercial, and regulated environments.

© 2015 IBM Corporation15 15

2014.11

• ALLPATHS-LG

• Bedtools

• Bfast

• BLAST (NCBI)

• Bowite

• Bowtie2

• BWA

• Cufflinks

• FastQC

• HMMER

• HTSeq

• Mothur

• Numpy

• PICARD

• PLINK

• Python

• SAMTools

• SOAP3-DP

• SOAPDenovo

• SQLite

• Tabix

• TopHat

• Velvet/Oases

2015.02

• R

• Bioconductor

• FASTA

• Trinity

• SHRiMP

Updated tools

• HMMER (LE)

• OpenSSL

• IGV

• iRODS

• RNAStar

• ISAAC

• TMAP

• SOAPaligner/soap2

Updated tools

• Bowtie2

• BWA

• OpenSSL

2015.04

Open Source Application Portfolio in BioBuilds


https://www.broadinstitute.org/gatk/blog?id=4833

Optimization of GATK from Broad Institute

IBM works with genomics leaders to improve performance of analytical

workflows like GATK on IBM Power 8 Systems


Steps Intel Runtime* IBM Runtime

BWA 7 3.88

Samtools 5 3.18

MarkDuplicates 11 7.46

RealignTargets 1 0.23

IndelRealigner 6.5 0.75

BaseRecalibrator 1.3 1.13

PrintReads+Index 12.3 2.48

PreProcessiong Total 44 19.09

HaplotypeCaller 2.03

Total 21.12

Note*: http://library.wolfram.com/infocenter/Conferences/9045/Intel_LifeSciences_Personalized_Medicine_Wolfram%202014_Paolo%20Narvaez.pdf

Input Dataset:

G15512.HCC1954.1,

coverage: 65x

Both IBM and Intel

solution:

# of Machines = 1

# of cores/Machine = 24

IBM Solution:

3.325 GHz Power8 with

GPFS

Optimization of Broad’s Best Practice Pipeline

~ 65X Whole Human Genome analysis done within a day

~ 150X Whole Exome analysis done in 3.45 hours


Performance of L3 Bioinformatics BALSA on Power 8 with GPU

Power8 3.32 GHz, 2x k40 GPU and GPFS


Application: Illumina’s Casava V. 1.8 (BCL to FASTQ)Data Set: 8 lanes of HiSeq data

Elapsed Time = 1730 min Elapsed Time = 107 min

Without cache library With cache library

IO Cache Library to Optimize Performance of Genomics Application

IBM uses a File Cache Library to improve I/O Performance and reduce

workflow runtimes


GPFS NFS

119 437

Bowtie2: NGS Benchmarks on

2.6 GHz iDataPlex with GPFS

and NFS

Elapsed Time in Minutes,

lower is better

Speed of the

matters

Speed of the file system matters

Accelerating Genomics Applications using GPFS

IBM and BIOVIA’s Pipeline Pilot scale genomic analysis from the

desktop to the enterprise using IBM GPFS


Genomic Workflow Optimization

Typical Genomic Sequencing Workflow – Command Line

• bwa aln -t 12 -l 40 -n 3 -k 2

• bwa sampe -a 700 -P -o 1000

• samtools view –bt

• samtools sort

• Picard: java –Xmx8g -Djava.io.tmpdir MarkDuplicates.jar METRICS_FILE=metrics CREATE_INDEX=true

VALIDATION_STRINGENCY=LENIENT REMOVE_DUPLICATES=true ASSUME_SORTED=true TMP_DIR

• Picard: java -Xmx8g -Djava.io.tmpdir AddOrReplaceReadGroups.jar SORT_ORDER=coordinate

RGID=sample_lane RGLB=sample RGPL=illumina RGPU=lane RGSM=sample RGCN=center_name CREATE_INDEX=True

VALIDATION_STRINGENCY=LENIENT TMP_DIR

• Gatk lite: java -Xmx8g -Djava.io.tmpdir -T RealignerTargetCreator -nt 1

• Gatk lite: java -Xmx8g -Djava.io.tmpdir -T IndelRealigner -targetIntervals -known

1000G_biallelic.indels.hg19.vcf

• Picard: java -Xmx8g -Djava.io.tmpdir FixMateInformation.jar SO=coordinate

VALIDATION_STRINGENCY=LENIENT CREATE_INDEX=true TMP_DIR

• Gatk lite: java -Xmx#{JAVA_REQMEM}g -Djava.io.tmpdir -T CountCovariates –recalFile -

knownSites:dbsnp,VCF /gpfs/gpfs1/GENOME/SNP_INDEL_VCF/dbsnp_137.hg19.vcf -cov ReadGroupCovariate -cov

QualityScoreCovariate -cov CycleCovariate -cov DinucCovariate

• Gatk lit: java -Xmx8g -Djava.io.tmpdir -T TableRecalibration -recalFile -sMode SET_Q_ZERO -

solid_nocall_strategy THROW_EXCEPTION -nback 7 --baq RECALCULATE

•Gatk lite:java -Xmx4g -jar $GATK_BIN/GenomeAnalysisTK.jar -glm BOTH -R $REFERENCE -T UnifiedGenotyper –I recalibrated.bam



IBM Platform Process Manager facilitates genomic workflow execution


Runs 1st Set 2nd Set 3rd Set 4th Set Total Sets

1 set on 8 nodes 10.06 hrs ------ ------ ------ 10.06 hrs

4 sets on 8 nodes 19.02 hrs 20.9 hrs 21.26 hrs 25.07 hrs 25.10 hrs

Data Set: 37x coverage of whole human genomes

Workflow Input: 74 fastq.gz files, Workflow Output: Recalibrated Bam file

Dependency steps = Using LSF bsub–w option


IBM Platform LSF workload scheduler is linked to the Process Manager and maximizes the utilization of HPC resources to improve workflow runtimes


Data Compression Appliance

Compression Algorithms

Compression ratio (lossless)

Speed/throughput

gzip on Power 8 with FPGA board–available now

On average 1:3 for fastq files 2.5GB/s on average (200 GB fastq can be compressed in 80 second)

CRAM 1:2 to 1:4 with respect to BAM files depending on the sequencing depth and other factors. (from FASTQ to compressed BAM ratio is 16X)

Achieved beyond 10 times speed up using 12 cores (approximately 0.5GB/min) FPGA acceleration is ongoing.

• Pistoia compression contest was

held in 2012. James Bonfield of

Sanger Institute won with 1:9

compression ratio and

0.1GB/min

• CRAM is released late 2012 to

compress BAM file by EBI and

accepted by Global Alliance of

Genomics and Health.

• IBM is collaborating with Sanger

Institute and EBI on improving

compression for genomics data

– Samtools, Picard, CRAMSource: Baker M., Nature Methods 7, 495 - 499 (2010)


IBM works with Lab7 to deliver data provenance with performance, reliability and security

.

.

>187_29_706_F3

T23302010303131123123022203111123200210100122001

102

T22211130023020133231323302310303131123123022201

211

>187_29_829_F3

T23302010003130123123022203111120122123202132301

212

>187_29_858_F3

T23302010303131123123022203111123222123122122321

212

>

Experimental Design Sample Prep Sequencing Mapping Analysis Meta AnalysisReporting

Workflow Engine

Federated Data Engine

Pipeline Engine

Visualization/EDA

Sample LIMS

User Experience

Sample Data Reference PipelineAttribute Sheet

��

��

IBM Power System Solution with GPFS and Platform LSF delivers:

Superior compute infrastructure --- Superior performance, scalability & maximum throughput

8

Outstanding enterprise-grade reliability and security:

• Reliability, Availability and Serviceability (RAS) features help avoid unplanned downtime

• IBM Power Security and Compliance (PowerSC™) enables security compliance automation and includes

reporting for compliance measurement and audit (HIPAA)

8

Total cost of ownership --- Very affordable compared to like-sized x86 systems

Lab7 ESPComprehensive software platform ---

combines LIMS and informatics functionalities

h

Data provenance --- maintains continuous

data provenance by:

• Tracking the history of samples, analyses,

and results

• Providing detailed audit trails

9

Sequencing platform flexibility --- manages

data generated from any sequencing platform

Enterprise Data Management


IBM Power System Solution with GPFS and Platform LSF delivers:Superior compute infrastructure --- Superior performance, scalability & maximum throughput

8

Outstanding enterprise-grade reliability and security:

• Reliability, Availability and Serviceability (RAS) features help avoid unplanned downtime

• IBM Power Security and Compliance (PowerSC™) enables security compliance automation and

includes reporting for compliance measurement and audit (HIPAA)

8Total cost of ownership --- Very affordable compared to like-sized x86 systems

3 C’s (Configure, Command, Collaborate)

Ontologies

Annotation

Samples

Comments + Attachments

Roles + Access

Shopping Basket

Social

Scientific

Lifecycle Management

Meta Information

Financial + Resource Mgmt

Task Management

Project Management Applications

Import

Analysis

Visualization

Infrastructure

Network

Storage

Compute

Configuration

Instruments

Compute and Storage

Softlayer – LSF – GPFS

Transport

DBE Download Manager

S3, SCP, RSync, SFTP, FTP HTTP

Logic

Version Control + Reproducible

Data Provenance

Everything as an app:Scripts, Binaries,

Pipelines, Workflow Management, Virtual

Machines

Portal API Custom Web Apps via API

DBE Multiprot

Email + WF Integration

Identity Management

Info

rmati

on

M

an

ag

em

en

tIn

terf

ace

Orc

hestr

ati

on

Databiology for Enterprise Functional Architecture Databiology for Enterprise

� SaaS + customer specific instances

� Central hub to manage all ‘omics

data and to orchestrate all activities

� Functionally rich and orientated on key steps in R&D life cycle

� Insight to Instrument with best in class applications

� Easy integration with existing environments

� Automatic data provenance and reporting

� Cost neutral deployment

� Gradual roll-out / Low risk

Data Provenance with Performance, Reliability and Security


tranSMART - Optimized on Power8 and Spectrum Scale

• tranSMART associates genotypic & phenotypic data for complex analytics

• Watson Explorer extracts insight from scientific literature and data record and provides

enrichment to tranSMART’s analysis

https://www.dropbox.com/s/9qw2kr339cl0mie/wats_tran.mp4?dl=0


R Analytics

Tools

Solr Full Text index

Gene Patterns

PLINK

Watson Analytics

ApplicationBrowser

PostgreSQL

tranSMART

DB

GPFS

JDBC

I2b2 Application

Server

Application Server

(Tomcat 7)

tranSMART

JDBCQuartz Job Call

Web Server(Apache2)

HTTP

HTTP

Users

Power8

Watson Analytics Server

tranSMART Power8 Deployment Architecture


Dataset TCGA_OV Simulation GSE32583 GSE13168 GSE1456 GSE15258

No. Records 5,789,632 40,774,968 942,724 1,203,282 3,600,555 4,702,050

Accelerate tranSMART ETL by Power8/Spectrum Scale


NIH DataCDC Data NLM Data

Internet

Lab

Results

Imaging

Data

Radiology

Reports

Microbiology

Reports

Nursing Home

Records

Claims

Data

VPN

VPN

VPN

LAN

LAN

LAN

LAN

LAN

Electronic

Health

Record Data

Genomic

Data

Accepted

Medical

Knowledge

Spanning Data Centers in Parallel with a Single Pane of Glass for Clinical and Research Applications on Power 8 and GPFS

Zato’s Scalable Data Federation Solution for Healthcare and Genomics Data


Thank You

22


Noblis BioVelocity is Developed and Optimized on IBM’s Power 8

accelerating life science discovery using a high ... oslo univ hospital oct 7 2015.pdf · ibm’s...

Documents