cs-bwamem: a fast and scalable read aligner july 10 11...

1
HiTSeq’15 July 10 11, 2015 CS-BWAMEM: A Fast and Scalable Read Aligner at the Cloud Scale for Whole Genome Sequencing Available at: https://github.com/ytchen0323/cloud-scale-bwamem 1 Computer Science Department, UCLA, USA 2 Dept. of Molecular and Medical Genetics, Oregon Health State University, USA Yu-Ting Chen 1 , Jason Cong 1 , Jie Lei 1 , Sen Li 1 , Myron Peto 2 , Paul Spellman 2 , Peng Wei 1 , and Peipei Zhou 1 Contact: [email protected] 0 2 4 6 8 10 12 14 16 18 20 BWA-MEM (1) CS-BWAMEM (1) CS-BWAMEM (3) CS-BWAMEM (6) CS-BWAMEM (12) CS-BWAMEM (25) Runtime (hours) # of nodes BWA-MEM vs. CS-BWAMEM Motivation: Build a Fast and Scalable Read Aligner for WGS Data Tool Highlight Methods Cloud - Scale BWAMEM ( CS - BWAMEM) Performance Alignment Quality Target: whole - genome sequencing A huge amount of data (500M ~ 1B reads for 30x coverage) Pair - end sequencing Goal: improve the speed of whole - genome sequencing Limitation of the state - of - the - art aligners BWA - MEM takes about 10 hours to align WGS data M ulti - threaded parallelization in a single server T he speed is limited by the single - node computation power and I/O bandwidth when data size is huge Proposed tool: Cloud - Scale BWAMEM Leverage the BWA - MEM algorithm Exploit the enormous parallelism by using cloud infrastructures Use MapReduce programming model in a cluster Most commonly used for big data analytics Good for large - scale deployment handle enormous parallelism of input reads Computation infrastructure: Spark In - memory MapReduce system Cache intermediate data in memory for future steps Avoid unnecessary slow disk I/O acesses Storage infrastructure: Hadoop distributed file system (HDFS) HDFS brings scalable I/O bandwidth since the disk I/O are linearly proportional to the size of a cluster Store the FASTQ input in a distributed fashion Spark can get data from HDFS before processing Provide scalable speedup for read alignment Users can choose an adequate number of nodes in their cluster based on their alignment performance target Speed of aligning a whole - genome sample < 80 minutes: whole - genome data (30x coverage; ~300GB) 25 - node cluster with 300 cores BWA - MEM: 9.5 hours Features Support both pair - end and single - end alignment Achieve similar quality to BWA - MEM Input: FASTQ files Output: SAM (single - node) or ADAM (cluster) format References (through broadcast) Pair-end Short Reads (in FASTQ ~300GB) Reference genome (in FASTA ~ 6GB) Driver Node Local File System (Linux) 1 2 3 n HDFS on Node 3 HDFS on Node n HDFS on Node 1 HDFS on Node 2 Reads 1 2 3 n Spark Worker 1 BWA-MEM (computation) Node 1 * Reference genome: broadcast from the driver node * Short reads in HDFS: upload from the driver node to HDFS in advance Master Node Node 2 Node 3 Node n Spark Worker 2 BWA-MEM (computation) Spark Worker 3 BWA-MEM (computation) Spark Worker n BWA-MEM (computation) 10GB Ethernet Node n Node 1 BWA BWA BWA BWA SW SW SW SW BWA BWA BWA BWA SW SW SW SW Map Calculate Pair-End Statistics Calculate Pair-End Statistics Reduce SW’ SW’ SW’ SW’ Pre-compute reference segments JNI + P-SW implemented in C (use vector engines, Intel SSE) SW’ SW’ SW’ SW’ Pre-compute reference segments JNI + P-SW implemented in C (use vector engines, Intel SSE) MapPartitions: Batched processing Reduce: collect at driver / or wrtie to HDFS Input: Raw reads (FASTQ) Output: Aligned reads (SAM/ADAM) CS-BWAMEM Design: two MapReduce stages Collaborations Test our aligner on the data from our collaborators Oregon Health State University (OHSU) Spellman Lab Application: cancer genomics and precision medicine UCLA Coppola Lab Application: neuroscience N eurodegenerative conditions, including Alzheimer’s Disease (AD), Frontotemporal Dementia (FTD), and Progressive Supranuclear Palsy Fan Lab University of Michigan , Ann Arbor University of Michigan comprehensive cancer center (UMCCC) Application: Motility based cell selection for understanding cancer metastasis. A large genome collection of healthy population and cancer patients Big data, compute-intensive Supercomputer in a rack Gene mutations discovered in codon reading frames Customized accelerators Genomic analysis pipeline Our Project & Cancer Genome Applications Cluster Deployment Comparison with BWA - MEM Flow: CS - BWAMEM / BWA - MEM - > sort - > markduplicate - > indelrealignment - > b aserecalibration Detailed analysis on two cancer exome samples Buffy sample and primary tumor sample Almost same mapping quality to BWA - MEM The numbers of mapped reads: CS - BWAMEM vs. BWA - MEM Buffy sample: 99.59% vs. 99.59% Primary sample: 99.28% vs. 99.28% Difference on total reads CS - BWAMEM vs. BWA - MEM Buffy sample: 0.006% Primary sample: 0.04% 25 worker nodes One master / driver node 10GbE switch Server node setting Intel Xeon server Two E5 - 2620 v3 CPUs 64GB DDR3/4 RAM 10GBE NIC Software infrastructure Spark 1.3.1 (v0.9 v1.3.0 tested) Hadoop 2.5.2 (v2.4.1 tested) Hardware acceleration On - going project PCIe - based FPGA board Ex: Smith - Waterman / Burrows - Wheeler transform kernels Input: whole - genome data sample (30x coverage; about 300GB FASTA files) 7x speedup over BWA - MEM Can align whole - genome data within 80 minutes User can adjust the cluster size based on their performance demand HDFS Master Spark Master

Upload: others

Post on 27-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CS-BWAMEM: A Fast and Scalable Read Aligner July 10 11 ...vast.cs.ucla.edu/sites/.../CS-BWAMEM_poster_0710_0.pdf · Achieve similar quality to BWA-MEM Input: FASTQ files Output: SAM

HiTSeq’15July 10 – 11, 2015

CS-BWAMEM: A Fast and Scalable Read Aligner at the Cloud Scale for Whole Genome Sequencing

Available at: https://github.com/ytchen0323/cloud-scale-bwamem

1 Computer Science Department, UCLA, USA2 Dept. of Molecular and Medical Genetics,

Oregon Health State University, USA

Yu-Ting Chen1, Jason Cong1, Jie Lei1, Sen Li1, Myron Peto2, Paul Spellman2, Peng Wei1, and Peipei Zhou1

Contact: [email protected]

02468

101214161820

BWA-MEM(1)

CS-BWAMEM(1)

CS-BWAMEM(3)

CS-BWAMEM(6)

CS-BWAMEM(12)

CS-BWAMEM(25)

Ru

nti

me

(h

ou

rs)

# of nodes

BWA-MEM vs. CS-BWAMEM

Motivation: Build a Fast and ScalableRead Aligner for WGS Data

Tool Highlight Methods

Cloud-Scale BWAMEM (CS-BWAMEM)

Performance Alignment Quality

Target: whole-genome sequencing

A huge amount of data (500M ~ 1B reads for 30x coverage)

• Pair-end sequencing

Goal: improve the speed of whole-genome sequencing

Limitation of the state-of-the-art aligners

BWA-MEM takes about 10 hours to align WGS data

Multi-threaded parallelization in a single server

The speed is limited by the single-node computation power and I/O

bandwidth when data size is huge

Proposed tool: Cloud-Scale BWAMEM

Leverage the BWA-MEM algorithm

Exploit the enormous parallelism by using cloud infrastructures

Use MapReduce programming model in a cluster

Most commonly used for big data analytics

Good for large-scale deployment – handle enormous parallelism of input reads

Computation infrastructure: Spark

In-memory MapReduce system

Cache intermediate data in memory for future steps

Avoid unnecessary slow disk I/O acesses

Storage infrastructure: Hadoop distributed file system (HDFS)

HDFS brings scalable I/O bandwidth since the disk I/O are linearly proportional

to the size of a cluster

Store the FASTQ input in a distributed fashion

Spark can get data from HDFS before processing

Provide scalable speedup for read alignment

Users can choose an adequate number of nodes in their

cluster based on their alignment performance target

Speed of aligning a whole-genome sample

< 80 minutes: whole-genome data (30x coverage; ~300GB)

25-node cluster with 300 cores

BWA-MEM: 9.5 hours

Features

Support both pair-end and single-end alignment

Achieve similar quality to BWA-MEM

Input: FASTQ files

Output: SAM (single-node) or ADAM (cluster) format

References (through broadcast)

Pair-end Short Reads (in FASTQ ~300GB)

Reference genome (in FASTA ~ 6GB)

Driver Node Local File System (Linux)

1 2 3 n

HDFS on Node3 HDFS on NodenHDFS on Node1HDFS on Node2

Reads 1 2 3 n

Spark Worker1BWA-MEM

(computation)

Node1

* Reference genome: broadcast from the driver node * Short reads in HDFS: upload from the driver node to HDFS in advance

Master Node

Node2 Node3 Noden

Spark Worker2BWA-MEM

(computation)

Spark Worker3BWA-MEM

(computation)

Spark WorkernBWA-MEM

(computation)

10GB Ethernet

NodenNode1

BWA BWA BWA BWA

SW SW SW SW

BWA BWA BWA BWA

SW SW SW SW

Map

Calculate Pair-End StatisticsCalculate Pair-End Statistics Reduce

SW’ SW’ SW’ SW’

Pre-compute reference segments

JNI + P-SW implemented in C (use vector engines, Intel SSE)

SW’ SW’ SW’ SW’

Pre-compute reference segments

JNI + P-SW implemented in C (use vector engines, Intel SSE)

MapPartitions:Batchedprocessing

Reduce: collect at driver / or wrtie to HDFS

Input:Raw reads (FASTQ)

Output:Aligned reads (SAM/ADAM)

CS-BWAMEM Design: two MapReduce stages

Collaborations Test our aligner on the data from our collaborators

Oregon Health State University (OHSU)

Spellman Lab

• Application: cancer genomics and precision medicine

UCLA

Coppola Lab

• Application: neuroscience

Neurodegenerative conditions, including Alzheimer’s Disease (AD),

Frontotemporal Dementia (FTD), and Progressive Supranuclear Palsy

Fan Lab

University of Michigan, Ann Arbor

University of Michigan comprehensive cancer center (UMCCC)

• Application:

Motility based cell selection for understanding cancer metastasis.

A large genome collection of healthy population and cancer patients

Big data, compute-intensive

Supercomputer in a rack

Gene mutations discovered in codon reading frames

Customized accelerators Genomic analysis pipeline

Our Project & Cancer Genome Applications Cluster Deployment

Comparison with BWA-MEM

Flow: CS-BWAMEM / BWA-MEM -> sort ->

markduplicate -> indelrealignment ->

baserecalibration

Detailed analysis on two cancer

exome samples

Buffy sample and primary tumor sample

Almost same mapping quality to BWA-

MEM

The numbers of mapped reads:

CS-BWAMEM vs. BWA-MEM

Buffy sample: 99.59% vs. 99.59%

Primary sample: 99.28% vs. 99.28%

Difference on total reads

CS-BWAMEM vs. BWA-MEM

Buffy sample: 0.006%

Primary sample: 0.04%

25 worker nodes

One master / driver node

10GbE switch

Server node setting

Intel Xeon server

Two E5-2620 v3 CPUs

64GB DDR3/4 RAM

10GBE NIC

Software infrastructure

Spark 1.3.1 (v0.9 – v1.3.0 tested)

Hadoop 2.5.2 (v2.4.1 tested)

Hardware acceleration

On-going project

PCIe-based FPGA board

Ex: Smith-Waterman / Burrows-

Wheeler transform kernels

Input: whole-genome data sample (30x coverage; about 300GB FASTA files)

7x speedup over BWA-MEM

Can align whole-genome data within 80 minutes

User can adjust the cluster size based on their performance demand

HDFS Master Spark Master