accelerating life science discovery using a high ... oslo univ hospital oct 7 2015.pdf · ibm’s...
TRANSCRIPT
Accelerating Life Science Discovery using a High-Performance
Analytics Platform in a Collaborative Environment
October 7, 2015
Kathy Tzeng, PhD
Worldwide Technical Lead
Healthcare & Life Sciences
IBM Systems Group
Overview
© 2015 IBM Corporation2
Genomic Solution Enablement Team
Mission:
• Porting and Optimization of Genomics/Translational applications on IBM solution
• Developing Solutions with Partners
• Making IBM SW/HW available to Software developers
Members:
• Independent Software Vendor (ISV) team
• Toronto Compiler Lab
• Boeblingen Development Lab
• Tokyo Research Lab
• Austin Research Lab
© 2015 IBM Corporation3
GENOMIC MEDICINE– from Sequencing to Personalized Healthcare
NHGRI, a branch of NIH, has defined 5 steps for genomic medicine. (source: E. Green
et al., Nature 470, 204–213)
Next Generation Sequencing
(or other ingestion)the focus is on very large data generation, mainly from $1000 whole genome sequencing, and the data processing and reductionincludes human, plant, animal, and microbiome genomics
Translational Research/Early
Discovery
the focus is on data integration including genomic data, and the analytics required to identify biomarkers, understand disease mechanisms, and to identify new medical treatments
Personalized
Healthcare/Clinical
Genomics
the focus is on delivering genomic medicine to patients to improve outcomes by associating patients with known genomic specific treatments
© 2015 IBM Corporation4
Predictive
Response Function
Known Traits or
Environmental Features
Measured
Biological Response
W(t)
Model of associations between features and responses as a function of time t
Computational Challenges
� Feature combinatorics
� Large file sizes
� Large population sizes
� Unstructured data types
F(t) R(t)
Quantities describing population traits or environmental factors at time t
Quantities describing response events for an organism at time t
A Computationally Challenging Problem
Breakthroughs in Genomic Medicine require quantifying associations between known population traits, environmental factors, and biological responses
© 2015 IBM Corporation5
Variant information requires a computationally intensive analysis of raw sequence data across thousands of genomic samples
Workload Challenge #1: ‘‘‘‘Big Data’’’’ Analytics
� ANNOVAR
� Gene Ontology � …
~ 150 GB (compressed)
Each human genome can have a few million variants
High-Throughput
Sequencing
File Format
Assembly & AlignmentBAM
Raw Reads
De Novo Assembly
~ 150 GB
Whole Human Genome
SOAPdenovo � Velvet � …
Reference-Based MappingBWA � Bowtie � SOAP � …
Reference Genomes
TGCA� GEO� dbSNP � …
Variant CallingVariant Calling
VCF 100 to 200 MBPicard � GATK � SAMtools � SOAPsnp � …
Variant Annotations
Annotation Tools
intergenic … SNP in
IL23R associated with
Crohn's disease …
Sample:
Processing time per genome
1 to 100 hours*
on 1 compute node
* Duration depends on selection of analytical tools and hardware
FastQ
500 MB
3 billion DNA base pairs
@ 30 x coverage
© 2015 IBM Corporation6
Phenotypic DataEx. Clinical Histories,
Medical Images
…was in good health until
2-3 months ago when she
gradually developed
fatigue and intermittent
epigastric pain, …
exonic NOD2 16 … a
frameshift … SNP… exonic
GJB2 13 … associated
with hearing loss …
exonic CRYL1,GJB6 13 … a
342kb deletion
Omics DataVariant Databases
Scientific data must be extracted from very large volumes of natural language content, biomedical images, and other unstructured data, and transformed into a structured format for analysis
Workload Challenge #2: Unstructured Information
Scientific Literature
Peer-Reviewed Articles, Clinical
Guidelines, Textbooks, Patents
… for statistical analysis and
relationship visualization
Information must be transformed
into normalized structured data …
© 2015 IBM Corporation7
+
1 Omics Data1 Omics Data
Workload Challenge #3: ‘‘‘‘Big Data’’’’ Integration
2 Phenotypic Data2 Phenotypic Data 3 Knowledge Base3 Knowledge Base
Discovery of genotype-phenotype associations requires an analysis of complex data types that must be integrated within a common analytical environment
Variant Calls & Annotations
Electronic Text & Web Sites
##FORMAT=<ID=DP, …
##FORMAT=<ID=HQ, …
#CHROM POS ID REF ALT …
20 14370 rs6054257 G A …
Clinical Features,Environmental Factors, Biological Responses
Phenotypic DataPhenotypic Data
Knowledge BaseKnowledge Base
Variant ID
Patient-Centric Logical Data Model
Patient IDGenotypic DataGenotypic Data
Patient Population
‘‘‘‘Big’’’’ Data Warehouse Environment
RDBMS and/or NoSQL
Variant List
Detail on a Single Variant
VCF
11
33
22
Phenotype ID
Patient ID
Observation Detail
Observed Traits
& Responses
© 2015 IBM Corporation8
Key Capabilities
Leading biomedical research organizations are asking for technology capabilities that will give them a low-cost solution to accelerate scientific discovery in Genomic Medicine
� Flexible, scalable, and low-cost high-performance compute and storage solutions capable of
efficiently processing rapidly growing quantities of genomic and other types of complex life
science data
� Seamless integration of complex life science data types on a common analytical platform
�Rapid extraction and analysis of unstructured language content from very large volumes of
clinical and scientific documents
�Metadata collection capabilities providing detailed audit trails as source data are transformed
into analytical results
� Tools for scientific collaboration that enable data and workload sharing to cross organizations
and geographic boundaries in a secure environment that ensures data privacy
© 2015 IBM Corporation9
A Foundation for Computational Science
IBM’s Reference Architecture for Genomic Medicine supports ‘big data’ computational research on a
foundation of HPC compute, storage, and workload management capabilities
Research Research Research Research
ApplicationsApplicationsApplicationsApplications
‘Big Data’ ‘Big Data’ ‘Big Data’ ‘Big Data’
FoundationFoundationFoundationFoundation
Intelligent resource allocation, sharing, and monitoring
across parallel HPC workloads
RDBMS or NoSQL database environments enabling
rapid processing of large volumes of complex high-
dimensional data structures in a data warehouse
Performance optimization for open source and
commercial analytics applications
Tex
t A
na
lyti
cs /
NLP
Data Management: File System & Storage / ILMData Management: File System & Storage / ILM
LAN
Workload Orchestration with Metadata CaptureWorkload Orchestration with Metadata Capture
‘Big’ Data Warehouse‘Big’ Data Warehouse
Ima
ge
An
aly
sis
- Apache
UIMA
- IBM
System T
+
Low-cost, low-latency, easy-access storage & archiving
of data and metadata across heterogeneous
environments
IBM Research, IBM Watson, IBM Business Partners
IBM BigInsights, IBM Business Partners
IBM Spectrum Scale / Elastic Storage Server
IBM Platform Computing, IBM Business Partners
Text Analytics for the conversion of natural
language concepts into structured data entities
Ge
no
mic
An
aly
sis
Pip
eli
ne
s
Co
mp
uta
tio
na
l
Mo
de
lin
g
© 2015 IBM Corporation10
Data management and analytics tools can be accessed and shared across heterogeneous systems in
on-premise and cloud environments
IBM Systems Facilitate Scientific Collaboration
External Collaborators (Heterogeneous Environments)Local Data Center
Virtual
Private Clouds
Public Cloud UsersPrivate Cloud UsersOn-Premise Users
On-Premise
Cluster
Encrypted VPN
‘Big Data’ foundation
enables data access, data
management, and HPC
workload orchestration
across heterogeneous on-
premise, private cloud,
public cloud, and hybrid
cloud environments
HPC Network
Data Management: File System / Storage ILMData Management: File System / Storage ILM
WAN
Workload
Burst
Applications
10GbE or InfiniBand
1/10 GbE
Workload Orchestration with Metadata CaptureWorkload Orchestration with Metadata Capture
‘Big’ Data Warehouse
© 2015 IBM Corporation11
AppCenter(PAC, Galaxy, DataBiology, Lab7)
Orchestrator(ASC/EGO, LSF, Symphony, PPM)
Translational
SSD/Flash FC/IB Attached Low-cost Storage HA/DR Storage Cloud Storage
Pla
tform
sC
om
pute
Sto
rage
Personalized Healthcare
Genomics
Datahub(Spectrum Scale, Zato, Nirvana)
…
HPC Cluster Big Data Spark Cluster Openstack Docker
Application & Workflow File & Database Visualization System & LogAccess
Workload Orchestration
© 2015 IBM Corporation12
Scale-out cluster
UsersUsers
DevicesDevices
Active ArchiveTSM/LTFS/HPSS
Scale-up SMP
HP
C M
anagem
ent
Suite
Pla
tform
Soft
ware
Sta
ck
A framework for NGS and HPC Systems Architecture
Spectrum Scale
ESS
© 2015 IBM Corporation13
IBM Genomics Reference Architecture
The IBM Reference Architecture is an ecosystem of data management and analytics tools developed by IBM and industry-leading commercial and open source software providers
Edico Genome
© 2015 IBM Corporation14
BioBuilds – Open Source Bioinformatics
• Turn-key: Pre-built binaries and complete build scripts enable easy deployment
• Optimized: POWER8 binaries provide the best performance for your hardware
• Ready for the Clinic: A single source for tools streamlining verification and audit
• Long Term Support: Community sponsorship and support contracts ensure ongoing support for tools
http://biobuilds.org/
Open Source bioinformatics tools for research, commercial, and regulated environments.
© 2015 IBM Corporation15 15
2014.11
• ALLPATHS-LG
• Bedtools
• Bfast
• BLAST (NCBI)
• Bowite
• Bowtie2
• BWA
• Cufflinks
• FastQC
• HMMER
• HTSeq
• Mothur
• Numpy
• PICARD
• PLINK
• Python
• SAMTools
• SOAP3-DP
• SOAPDenovo
• SQLite
• Tabix
• TopHat
• Velvet/Oases
2015.02
• R
• Bioconductor
• FASTA
• Trinity
• SHRiMP
Updated tools
• HMMER (LE)
• OpenSSL
• IGV
• iRODS
• RNAStar
• ISAAC
• TMAP
• SOAPaligner/soap2
Updated tools
• Bowtie2
• BWA
• OpenSSL
2015.04
Open Source Application Portfolio in BioBuilds
© 2015 IBM Corporation16
https://www.broadinstitute.org/gatk/blog?id=4833
Optimization of GATK from Broad Institute
IBM works with genomics leaders to improve performance of analytical
workflows like GATK on IBM Power 8 Systems
© 2015 IBM Corporation17
Steps Intel Runtime* IBM Runtime
BWA 7 3.88
Samtools 5 3.18
MarkDuplicates 11 7.46
RealignTargets 1 0.23
IndelRealigner 6.5 0.75
BaseRecalibrator 1.3 1.13
PrintReads+Index 12.3 2.48
PreProcessiong Total 44 19.09
HaplotypeCaller 2.03
Total 21.12
Note*: http://library.wolfram.com/infocenter/Conferences/9045/Intel_LifeSciences_Personalized_Medicine_Wolfram%202014_Paolo%20Narvaez.pdf
Input Dataset:
G15512.HCC1954.1,
coverage: 65x
Both IBM and Intel
solution:
# of Machines = 1
# of cores/Machine = 24
IBM Solution:
3.325 GHz Power8 with
GPFS
Optimization of Broad’s Best Practice Pipeline
~ 65X Whole Human Genome analysis done within a day
~ 150X Whole Exome analysis done in 3.45 hours
© 2015 IBM Corporation18
Performance of L3 Bioinformatics BALSA on Power 8 with GPU
Power8 3.32 GHz, 2x k40 GPU and GPFS
© 2015 IBM Corporation19
Application: Illumina’s Casava V. 1.8 (BCL to FASTQ)Data Set: 8 lanes of HiSeq data
Elapsed Time = 1730 min Elapsed Time = 107 min
Without cache library With cache library
IO Cache Library to Optimize Performance of Genomics Application
IBM uses a File Cache Library to improve I/O Performance and reduce
workflow runtimes
© 2015 IBM Corporation20
GPFS NFS
119 437
Bowtie2: NGS Benchmarks on
2.6 GHz iDataPlex with GPFS
and NFS
Elapsed Time in Minutes,
lower is better
Speed of the
matters
Speed of the file system matters
Accelerating Genomics Applications using GPFS
IBM and BIOVIA’s Pipeline Pilot scale genomic analysis from the
desktop to the enterprise using IBM GPFS
© 2015 IBM Corporation21
Genomic Workflow Optimization
Typical Genomic Sequencing Workflow – Command Line
• bwa aln -t 12 -l 40 -n 3 -k 2
• bwa sampe -a 700 -P -o 1000
• samtools view –bt
• samtools sort
• Picard: java –Xmx8g -Djava.io.tmpdir MarkDuplicates.jar METRICS_FILE=metrics CREATE_INDEX=true
VALIDATION_STRINGENCY=LENIENT REMOVE_DUPLICATES=true ASSUME_SORTED=true TMP_DIR
• Picard: java -Xmx8g -Djava.io.tmpdir AddOrReplaceReadGroups.jar SORT_ORDER=coordinate
RGID=sample_lane RGLB=sample RGPL=illumina RGPU=lane RGSM=sample RGCN=center_name CREATE_INDEX=True
VALIDATION_STRINGENCY=LENIENT TMP_DIR
• Gatk lite: java -Xmx8g -Djava.io.tmpdir -T RealignerTargetCreator -nt 1
• Gatk lite: java -Xmx8g -Djava.io.tmpdir -T IndelRealigner -targetIntervals -known
1000G_biallelic.indels.hg19.vcf
• Picard: java -Xmx8g -Djava.io.tmpdir FixMateInformation.jar SO=coordinate
VALIDATION_STRINGENCY=LENIENT CREATE_INDEX=true TMP_DIR
• Gatk lite: java -Xmx#{JAVA_REQMEM}g -Djava.io.tmpdir -T CountCovariates –recalFile -
knownSites:dbsnp,VCF /gpfs/gpfs1/GENOME/SNP_INDEL_VCF/dbsnp_137.hg19.vcf -cov ReadGroupCovariate -cov
QualityScoreCovariate -cov CycleCovariate -cov DinucCovariate
• Gatk lit: java -Xmx8g -Djava.io.tmpdir -T TableRecalibration -recalFile -sMode SET_Q_ZERO -
solid_nocall_strategy THROW_EXCEPTION -nback 7 --baq RECALCULATE
•Gatk lite:java -Xmx4g -jar $GATK_BIN/GenomeAnalysisTK.jar -glm BOTH -R $REFERENCE -T UnifiedGenotyper –I recalibrated.bam
© 2015 IBM Corporation22
Genomic Workflow Optimization
IBM Platform Process Manager facilitates genomic workflow execution
© 2015 IBM Corporation23
Runs 1st Set 2nd Set 3rd Set 4th Set Total Sets
1 set on 8 nodes 10.06 hrs ------ ------ ------ 10.06 hrs
4 sets on 8 nodes 19.02 hrs 20.9 hrs 21.26 hrs 25.07 hrs 25.10 hrs
Data Set: 37x coverage of whole human genomes
Workflow Input: 74 fastq.gz files, Workflow Output: Recalibrated Bam file
Dependency steps = Using LSF bsub–w option
Genomic Workflow Optimization
IBM Platform LSF workload scheduler is linked to the Process Manager and maximizes the utilization of HPC resources to improve workflow runtimes
© 2015 IBM Corporation24
Data Compression Appliance
Compression Algorithms
Compression ratio (lossless)
Speed/throughput
gzip on Power 8 with FPGA board–available now
On average 1:3 for fastq files 2.5GB/s on average (200 GB fastq can be compressed in 80 second)
CRAM 1:2 to 1:4 with respect to BAM files depending on the sequencing depth and other factors. (from FASTQ to compressed BAM ratio is 16X)
Achieved beyond 10 times speed up using 12 cores (approximately 0.5GB/min) FPGA acceleration is ongoing.
• Pistoia compression contest was
held in 2012. James Bonfield of
Sanger Institute won with 1:9
compression ratio and
0.1GB/min
• CRAM is released late 2012 to
compress BAM file by EBI and
accepted by Global Alliance of
Genomics and Health.
• IBM is collaborating with Sanger
Institute and EBI on improving
compression for genomics data
– Samtools, Picard, CRAMSource: Baker M., Nature Methods 7, 495 - 499 (2010)
© 2015 IBM Corporation25
IBM works with Lab7 to deliver data provenance with performance, reliability and security
.
.
>187_29_706_F3
T23302010303131123123022203111123200210100122001
102
T22211130023020133231323302310303131123123022201
211
>187_29_829_F3
T23302010003130123123022203111120122123202132301
212
>187_29_858_F3
T23302010303131123123022203111123222123122122321
212
>
Experimental Design Sample Prep Sequencing Mapping Analysis Meta AnalysisReporting
Workflow Engine
Federated Data Engine
Pipeline Engine
Visualization/EDA
Sample LIMS
User Experience
Sample Data Reference PipelineAttribute Sheet
��
��
IBM Power System Solution with GPFS and Platform LSF delivers:
Superior compute infrastructure --- Superior performance, scalability & maximum throughput
8
Outstanding enterprise-grade reliability and security:
• Reliability, Availability and Serviceability (RAS) features help avoid unplanned downtime
• IBM Power Security and Compliance (PowerSC™) enables security compliance automation and includes
reporting for compliance measurement and audit (HIPAA)
8
Total cost of ownership --- Very affordable compared to like-sized x86 systems
Lab7 ESPComprehensive software platform ---
combines LIMS and informatics functionalities
h
Data provenance --- maintains continuous
data provenance by:
• Tracking the history of samples, analyses,
and results
• Providing detailed audit trails
9
Sequencing platform flexibility --- manages
data generated from any sequencing platform
Enterprise Data Management
© 2015 IBM Corporation26
IBM Power System Solution with GPFS and Platform LSF delivers:Superior compute infrastructure --- Superior performance, scalability & maximum throughput
8
Outstanding enterprise-grade reliability and security:
• Reliability, Availability and Serviceability (RAS) features help avoid unplanned downtime
• IBM Power Security and Compliance (PowerSC™) enables security compliance automation and
includes reporting for compliance measurement and audit (HIPAA)
8Total cost of ownership --- Very affordable compared to like-sized x86 systems
3 C’s (Configure, Command, Collaborate)
Ontologies
Annotation
Samples
Comments + Attachments
Roles + Access
Shopping Basket
Social
Scientific
Lifecycle Management
Meta Information
Financial + Resource Mgmt
Task Management
Project Management Applications
Import
Analysis
Visualization
Infrastructure
Network
Storage
Compute
Configuration
Instruments
Compute and Storage
Softlayer – LSF – GPFS
Transport
DBE Download Manager
S3, SCP, RSync, SFTP, FTP HTTP
Logic
Version Control + Reproducible
Data Provenance
Everything as an app:Scripts, Binaries,
Pipelines, Workflow Management, Virtual
Machines
Portal API Custom Web Apps via API
DBE Multiprot
Email + WF Integration
Identity Management
Info
rmati
on
M
an
ag
em
en
tIn
terf
ace
Orc
hestr
ati
on
Databiology for Enterprise Functional Architecture Databiology for Enterprise
� SaaS + customer specific instances
� Central hub to manage all ‘omics
data and to orchestrate all activities
� Functionally rich and orientated on key steps in R&D life cycle
� Insight to Instrument with best in class applications
� Easy integration with existing environments
� Automatic data provenance and reporting
� Cost neutral deployment
� Gradual roll-out / Low risk
Data Provenance with Performance, Reliability and Security
© 2015 IBM Corporation27
tranSMART - Optimized on Power8 and Spectrum Scale
• tranSMART associates genotypic & phenotypic data for complex analytics
• Watson Explorer extracts insight from scientific literature and data record and provides
enrichment to tranSMART’s analysis
https://www.dropbox.com/s/9qw2kr339cl0mie/wats_tran.mp4?dl=0
© 2015 IBM Corporation28
R Analytics
Tools
Solr Full Text index
Gene Patterns
PLINK
Watson Analytics
ApplicationBrowser
PostgreSQL
tranSMART
DB
GPFS
JDBC
I2b2 Application
Server
Application Server
(Tomcat 7)
tranSMART
JDBCQuartz Job Call
Web Server(Apache2)
HTTP
HTTP
Users
Power8
Watson Analytics Server
tranSMART Power8 Deployment Architecture
© 2015 IBM Corporation29
Dataset TCGA_OV Simulation GSE32583 GSE13168 GSE1456 GSE15258
No. Records 5,789,632 40,774,968 942,724 1,203,282 3,600,555 4,702,050
Accelerate tranSMART ETL by Power8/Spectrum Scale
© 2015 IBM Corporation30
NIH DataCDC Data NLM Data
Internet
Lab
Results
Imaging
Data
Radiology
Reports
Microbiology
Reports
Nursing Home
Records
Claims
Data
VPN
VPN
VPN
LAN
LAN
LAN
LAN
LAN
Electronic
Health
Record Data
Genomic
Data
Accepted
Medical
Knowledge
Spanning Data Centers in Parallel with a Single Pane of Glass for Clinical and Research Applications on Power 8 and GPFS
Zato’s Scalable Data Federation Solution for Healthcare and Genomics Data
© 2015 IBM Corporation31
Thank You
22
© 2015 IBM Corporation32 32
© 2015 IBM Corporation33
Noblis BioVelocity is Developed and Optimized on IBM’s Power 8