page: a framework for easy parallelization of genomic applications
DESCRIPTION
PAGE: A Framework for Easy Parallelization of Genomic Applications. Mucahid Kutlu Gagan Agrawal Department of Computer Science and Engineering The Ohio State University. IPDPS 2014, Phoenix, Arizona. Motivation. The sequencing costs are decreasing. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: PAGE: A Framework for Easy Parallelization of Genomic Applications](https://reader035.vdocuments.us/reader035/viewer/2022062811/56815fce550346895dcecd5a/html5/thumbnails/1.jpg)
PAGE: A Framework for Easy Parallelization of Genomic
Applications
1
Mucahid Kutlu Gagan AgrawalDepartment of Computer Science and Engineering
The Ohio State University
IPDPS 2014, Phoenix, Arizona
![Page 2: PAGE: A Framework for Easy Parallelization of Genomic Applications](https://reader035.vdocuments.us/reader035/viewer/2022062811/56815fce550346895dcecd5a/html5/thumbnails/2.jpg)
IPDPS'14 2
Motivation
• The sequencing costs are decreasing
*Adapted from genome.gov/sequencingcosts
![Page 3: PAGE: A Framework for Easy Parallelization of Genomic Applications](https://reader035.vdocuments.us/reader035/viewer/2022062811/56815fce550346895dcecd5a/html5/thumbnails/3.jpg)
IPDPS'14 3
• Big data problem– 1000 Human Genome Project already produced 200 TB data
– Parallel processing is inevitable!*Adapted from https://www.nlm.nih.gov/about/2015CJ.html
Motivation
![Page 4: PAGE: A Framework for Easy Parallelization of Genomic Applications](https://reader035.vdocuments.us/reader035/viewer/2022062811/56815fce550346895dcecd5a/html5/thumbnails/4.jpg)
IPDPS'14 4
Typical Analysis on Genomic Data
• Single Nucleotide Polymorphism (SNP) calling
Sequences 1 2 3 4 5 6 7 8Read-1 A G C GRead-2 G C G GRead-3 G C G T ARead-4 C G T T C C
Alig
nmen
t File
-1
Reference A G C G T A C C
Sequences 1 2 3 4 5 6 7 8Read-1 A G A G
Read-2 A G A G T
Read-3 G A G T
Read-4 G T T C CAlig
nmen
t File
-2
*Adapted from Wikipedia
A single SNP may cause Mendelian disease!
✖ ✓✖
![Page 5: PAGE: A Framework for Easy Parallelization of Genomic Applications](https://reader035.vdocuments.us/reader035/viewer/2022062811/56815fce550346895dcecd5a/html5/thumbnails/5.jpg)
IPDPS'14 5
Outline• Motivation• Existing Solutions for Implementation• Our Work• Experimental Evaluation• Conclusion
![Page 6: PAGE: A Framework for Easy Parallelization of Genomic Applications](https://reader035.vdocuments.us/reader035/viewer/2022062811/56815fce550346895dcecd5a/html5/thumbnails/6.jpg)
IPDPS'14 6
Existing Solutions for Implementation
• Serial tools– SamTools, VCFTools, BedTools – File merging, sorting etc.– VarScan – SNP calling
• Parallel implementations– Turboblast, searching local alignments, – SEAL, read mapping and duplicate removal– Biodoop, statistical analysis
• Middleware Systems– Hadoop
• Not designed for specific needs of genetic data• Limited programmability
– Genome Analysis Tool Kit (GATK)• Designed for genetic data processing• Provides special data traversal patterns• Limited parallelization for some of its tools
![Page 7: PAGE: A Framework for Easy Parallelization of Genomic Applications](https://reader035.vdocuments.us/reader035/viewer/2022062811/56815fce550346895dcecd5a/html5/thumbnails/7.jpg)
IPDPS'14 7
Outline• Motivation• Existing Solutions for Implementation• Our Work• Experimental Evaluation• Conclusion
![Page 8: PAGE: A Framework for Easy Parallelization of Genomic Applications](https://reader035.vdocuments.us/reader035/viewer/2022062811/56815fce550346895dcecd5a/html5/thumbnails/8.jpg)
IPDPS'14 8
Our Goal• We want to develop a middleware system
– Specific for parallel genetic data processing– Allow parallelization of a variety of genetic algorithms– Be able to work with different popular genetic data
formats – Allows use of existing programs
![Page 9: PAGE: A Framework for Easy Parallelization of Genomic Applications](https://reader035.vdocuments.us/reader035/viewer/2022062811/56815fce550346895dcecd5a/html5/thumbnails/9.jpg)
IPDPS'14
Challenges• Load Imbalance due to
nature of genomic data– It is not just an array of
A, G, C and T characters
• High overhead of tasks
• I/O contention
9
1 3 4
Coverage Variance
![Page 10: PAGE: A Framework for Easy Parallelization of Genomic Applications](https://reader035.vdocuments.us/reader035/viewer/2022062811/56815fce550346895dcecd5a/html5/thumbnails/10.jpg)
IPDPS'14 10
Our Work• PAGE: A Map-Reduce-like middleware for easy
parallelization of genomic applications
• Mappers and reducers are executable programs– Allows us to exploit existing applications– No restriction on programming language
![Page 11: PAGE: A Framework for Easy Parallelization of Genomic Applications](https://reader035.vdocuments.us/reader035/viewer/2022062811/56815fce550346895dcecd5a/html5/thumbnails/11.jpg)
IPDPS'14 11
File-mFile-2File-1
Map
Reduce
Region-1
MapRegion-n
Intra-dependent Processing
O-11
O-1n
Output-1
Map
Reduce
Region-1
MapRegion-n
O-m1
O-mn
Output-m
• Each file is processed independently
![Page 12: PAGE: A Framework for Easy Parallelization of Genomic Applications](https://reader035.vdocuments.us/reader035/viewer/2022062811/56815fce550346895dcecd5a/html5/thumbnails/12.jpg)
IPDPS'14 12
Map O1
Ok
On
Reduce Output
Region-1
Input Files
MapRegion-k
Map
Region-n
Inter-dependent Processing• Each map task processes a particular region of ALL files
![Page 13: PAGE: A Framework for Easy Parallelization of Genomic Applications](https://reader035.vdocuments.us/reader035/viewer/2022062811/56815fce550346895dcecd5a/html5/thumbnails/13.jpg)
IPDPS'14 13
What Can PAGE Parallelize?• PAGE can parallelize all applications that have the
following property• M - Map task• R, R1 and R2 are three regions such that
R = concatenation of R1 and R2
• M (R) = M(R1) M(R⊕ 2) where is the reduction ⊕function
R1 R2
R
![Page 14: PAGE: A Framework for Easy Parallelization of Genomic Applications](https://reader035.vdocuments.us/reader035/viewer/2022062811/56815fce550346895dcecd5a/html5/thumbnails/14.jpg)
IPDPS'14 14
Data Partitioning• Data is NOT packaged into equal-size data blocks as in
Hadoop– Each application has a different way of reading the data– Equal-size data block packaging ignores nucleotide base
location information• Genome structure is divided into regions and each map
task is assigned for a region.– Takes account location information– The map task is responsible of accessing particular region of
the input files• It is a common feature for many genomic tools (GATK, SamTools)
![Page 15: PAGE: A Framework for Easy Parallelization of Genomic Applications](https://reader035.vdocuments.us/reader035/viewer/2022062811/56815fce550346895dcecd5a/html5/thumbnails/15.jpg)
IPDPS'14 15
Genome Partition
• PAGE provides two data partitioning methods– By-locus partitioning: Chromosomes are divided into
regions
– By-chromosome partitioning: Chromosomes preserve their unity
Chr-1 Chr-2 Chr-3 Chr-4 Chr-5 Chr-6
Chr-1 Chr-2 Chr-3 Chr-4 Chr-5 Chr-6
![Page 16: PAGE: A Framework for Easy Parallelization of Genomic Applications](https://reader035.vdocuments.us/reader035/viewer/2022062811/56815fce550346895dcecd5a/html5/thumbnails/16.jpg)
IPDPS'14 16
Task Scheduling
Static • Each processor is responsible of regions with equal length.• All map tasks should finish before the execution of reduce
tasks.
Dynamic• Map & reduce tasks are assigned by a master process• Reduce tasks can start if there are enough available
intermediate results.
PAGE provides two types of scheduling schemes.
![Page 17: PAGE: A Framework for Easy Parallelization of Genomic Applications](https://reader035.vdocuments.us/reader035/viewer/2022062811/56815fce550346895dcecd5a/html5/thumbnails/17.jpg)
IPDPS'14 17
Applications Developed Using PAGE
• We parallelized 4 applications– VarScan: SNP detection– Realigner Target Creator: Detects insertion/deletions in
alignment files– Indel Realigner: Applies local realignment to improve
quality of alignment files– Unified Genotyper: SNP detection
![Page 18: PAGE: A Framework for Easy Parallelization of Genomic Applications](https://reader035.vdocuments.us/reader035/viewer/2022062811/56815fce550346895dcecd5a/html5/thumbnails/18.jpg)
IPDPS'14 18
Sample Application Development with PAGE
• Serial execution command of VarScan Software– samtools mpileup –b file_list -f reference | java -jar VarScan.jar mpileup2snp
• To parallelize VarScan with PAGE, user needs to define:– Genome Partition: By-Locus– Scheduling Scheme: Dynamic (or Static)– Execution Model: Inter-dependent– Map command: samtools mpileup –b file_list -r regionloc -f
reference | java -jar VarScan.jar mpileup2snp >outputloc– Reduction : cat bash shell command
![Page 19: PAGE: A Framework for Easy Parallelization of Genomic Applications](https://reader035.vdocuments.us/reader035/viewer/2022062811/56815fce550346895dcecd5a/html5/thumbnails/19.jpg)
IPDPS'14 19
Outline• Motivation• Existing Solutions for Implementation• Our Work• Experimental Evaluation• Conclusion
![Page 20: PAGE: A Framework for Easy Parallelization of Genomic Applications](https://reader035.vdocuments.us/reader035/viewer/2022062811/56815fce550346895dcecd5a/html5/thumbnails/20.jpg)
IPDPS'14 20
Experiments• Experimental Setup
– In our cluster • Each node has 12 GB memory• 8 cores (2.53 GHz)
– We obtained the data from 1000 Human Genome Project– We evaluated PAGE with 4 applications– We compared PAGE with Hadoop Streaming and GATK
![Page 21: PAGE: A Framework for Easy Parallelization of Genomic Applications](https://reader035.vdocuments.us/reader035/viewer/2022062811/56815fce550346895dcecd5a/html5/thumbnails/21.jpg)
IPDPS'14 21
Comparison with GATK
Scalability Data Size Impact
- Indel Realigner tool of GATK
Data Size: 11 GB # of cores: 128
3.3x
9x
![Page 22: PAGE: A Framework for Easy Parallelization of Genomic Applications](https://reader035.vdocuments.us/reader035/viewer/2022062811/56815fce550346895dcecd5a/html5/thumbnails/22.jpg)
IPDPS'14 22
Comparison with GATK
Scalability Data Size Impact
- Unified Genotyper tool of GATK
10.9x 12.8x
Data Size: 34 GB # of cores: 128
![Page 23: PAGE: A Framework for Easy Parallelization of Genomic Applications](https://reader035.vdocuments.us/reader035/viewer/2022062811/56815fce550346895dcecd5a/html5/thumbnails/23.jpg)
IPDPS'14 23
Scalability Data Size Impact
- VarScan Application
6.9x 12.7x
Comparison with Hadoop Streaming
Data Size: 52 GB # of cores: 128
![Page 24: PAGE: A Framework for Easy Parallelization of Genomic Applications](https://reader035.vdocuments.us/reader035/viewer/2022062811/56815fce550346895dcecd5a/html5/thumbnails/24.jpg)
IPDPS'14 24
Summary of Experimental Results
When the computing power increased by 16 times
Indel Realigner
Unified Genotyper
VarScan Realigner Target Creator
PAGE 9x 12.8x 12.7x 14.1x
GATK 3.3x 10.9x - -
Hadoop Streaming
- - 6.9x -
![Page 25: PAGE: A Framework for Easy Parallelization of Genomic Applications](https://reader035.vdocuments.us/reader035/viewer/2022062811/56815fce550346895dcecd5a/html5/thumbnails/25.jpg)
IPDPS'14 25
Conclusion• We developed a middleware
– Easily parallelizes genomic applications– High applicability
• No restriction on programming language or data format• Allows to use existing applications
– Provides user to control the parallel execution while hiding the details
• Alternative scheduling schemes, execution models and data partitioning types
– Good Scalability
![Page 26: PAGE: A Framework for Easy Parallelization of Genomic Applications](https://reader035.vdocuments.us/reader035/viewer/2022062811/56815fce550346895dcecd5a/html5/thumbnails/26.jpg)
IPDPS'14 26
Thank you for listening …
Questions