![Page 1: Cluster-based SNP Calling on Large Scale Genome Sequencing Data](https://reader036.vdocuments.us/reader036/viewer/2022062301/56815f68550346895dce6b02/html5/thumbnails/1.jpg)
Cluster-based SNP Calling on Large Scale Genome Sequencing Data
Mucahid Kutlu Gagan AgrawalDepartment of Computer Science and
Engineering The Ohio State University
CCGrid 2014, Chicago, IL
![Page 2: Cluster-based SNP Calling on Large Scale Genome Sequencing Data](https://reader036.vdocuments.us/reader036/viewer/2022062301/56815f68550346895dce6b02/html5/thumbnails/2.jpg)
CCGrid 2014 2
What is SNP?
• Stands for Single-Nucleotide Polymorphism
• DNA sequence variation that occurs when a single nucleotide differs between members of biological species.
• Essential for medical researches and developing personalized-medicine.
• A single SNP may cause a Mendelian disease.
*Adapted from Wikipedia
![Page 3: Cluster-based SNP Calling on Large Scale Genome Sequencing Data](https://reader036.vdocuments.us/reader036/viewer/2022062301/56815f68550346895dce6b02/html5/thumbnails/3.jpg)
3
Motivation
• The sequencing costs are decreasing
CCGrid 2014
*Adapted from genome.gov/sequencingcosts
![Page 4: Cluster-based SNP Calling on Large Scale Genome Sequencing Data](https://reader036.vdocuments.us/reader036/viewer/2022062301/56815f68550346895dce6b02/html5/thumbnails/4.jpg)
4
• Big data problem– 1000 Human Genome Project already produced 200 TB data
– Parallel processing is inevitable!*Adapted from https://www.nlm.nih.gov/about/2015CJ.html
Motivation
CCGrid 2014
![Page 5: Cluster-based SNP Calling on Large Scale Genome Sequencing Data](https://reader036.vdocuments.us/reader036/viewer/2022062301/56815f68550346895dce6b02/html5/thumbnails/5.jpg)
CCGrid 2014 5
Outline• Motivation• Parallel SNP Calling• Proposed Scheduling Schemes• Experiments• Conclusion
![Page 6: Cluster-based SNP Calling on Large Scale Genome Sequencing Data](https://reader036.vdocuments.us/reader036/viewer/2022062301/56815f68550346895dce6b02/html5/thumbnails/6.jpg)
CCGrid 2014 6
General Idea of SNP Calling Algorithms
Sequences 1 2 3 4 5 6 7 8Read-1 A G C GRead-2 G C G GRead-3 G C G T ARead-4 C G T T C CAl
ignm
ent F
ile-1
Reference A G C G T A C C
Sequences 1 2 3 4 5 6 7 8Read-1 A G A G
Read-2 A G A G T
Read-3 G A G T
Read-4 G T T C CAlig
nmen
t File
-2
✖ ✓✖ Two main observations:• In order to detect an SNP
at a certain location, we have to check the alignments in ALL genomes at that location.
• The existence of an SNP is independent than others
![Page 7: Cluster-based SNP Calling on Large Scale Genome Sequencing Data](https://reader036.vdocuments.us/reader036/viewer/2022062301/56815f68550346895dce6b02/html5/thumbnails/7.jpg)
CCGrid 2014 7
Parallel SNP CallingHow to distribute data among nodes?
Processor 1
Location-based Sample-based
CCGrid 2014
Proc 2
Proc1
Processor 2
Processor 3
Processor 4
Proc 3
Proc 4
Proc 1
Checkerboard
Proc2
Proc3
Proc4
Genome files Requires communication among processes
![Page 8: Cluster-based SNP Calling on Large Scale Genome Sequencing Data](https://reader036.vdocuments.us/reader036/viewer/2022062301/56815f68550346895dce6b02/html5/thumbnails/8.jpg)
CCGrid 2014 8
Challenges• Load Imbalance due to
nature of genomic data– It is not just an array of
A, G, C and T characters
• I/O contention• High overhead of
random access to a particular region
8
1 3 4
Coverage Variance
![Page 9: Cluster-based SNP Calling on Large Scale Genome Sequencing Data](https://reader036.vdocuments.us/reader036/viewer/2022062301/56815f68550346895dce6b02/html5/thumbnails/9.jpg)
CCGrid 2014 9
Histogram Showing Coverage Variance
• Chromosome: 1• Locations: 1-200M• Number of
samples: 256• Interval size: 1M
![Page 10: Cluster-based SNP Calling on Large Scale Genome Sequencing Data](https://reader036.vdocuments.us/reader036/viewer/2022062301/56815f68550346895dce6b02/html5/thumbnails/10.jpg)
CCGrid 2014 10
Outline• Motivation• Parallel SNP Calling• Proposed Scheduling Schemes• Experiments• Conclusion
![Page 11: Cluster-based SNP Calling on Large Scale Genome Sequencing Data](https://reader036.vdocuments.us/reader036/viewer/2022062301/56815f68550346895dce6b02/html5/thumbnails/11.jpg)
CCGrid 2014 11
Proposed Scheduling Schemes• Dynamic Scheduling• Static Scheduling• Combined Scheduling
…Each scheduling scheme uses location-based data division. That is, the genome is divided into regions and each task is responsible for a region.
![Page 12: Cluster-based SNP Calling on Large Scale Genome Sequencing Data](https://reader036.vdocuments.us/reader036/viewer/2022062301/56815f68550346895dce6b02/html5/thumbnails/12.jpg)
CCGrid 2014 12
Dynamic Scheduling• Master & Worker Approach• Tasks are assigned dynamically• Two types of data-chunks are
used– Big chunk: covers B locations– Small chunk: cover S locations– B > S
B• Big chunks are assigned first,
then small chunks are assignedB
Alig
nmen
t File
-1Al
ignm
ent F
ile -2
![Page 13: Cluster-based SNP Calling on Large Scale Genome Sequencing Data](https://reader036.vdocuments.us/reader036/viewer/2022062301/56815f68550346895dce6b02/html5/thumbnails/13.jpg)
13
Static Scheduling• Pre-processing step
– We count the number of alignments for each region and generate a histogram
• Estimated Cost– We use an estimation function and our histogram
for data partitioning.
– k : histogram interval k– TR : cost of accessing/reading the region– TP: processing an alignment– N(l): Number of alignments in location l
– Each task is responsible for regions having same estimated cost.
CCGrid 2014Al
ignm
ent F
ile -1
Alig
nmen
t File
-2
• Tasks are scheduled statically. No master & Slave approach
![Page 14: Cluster-based SNP Calling on Large Scale Genome Sequencing Data](https://reader036.vdocuments.us/reader036/viewer/2022062301/56815f68550346895dce6b02/html5/thumbnails/14.jpg)
CCGrid 2014 14
Combined Scheduling• Combination of Static and
Dynamic Scheduling• We use small and big chunks as in
dynamic scheduling• The size of the chunks are
determined according to histogram
• Master-Worker approach
Alig
nmen
t File
-1Al
ignm
ent F
ile -2
Big chunks Small chunks
![Page 15: Cluster-based SNP Calling on Large Scale Genome Sequencing Data](https://reader036.vdocuments.us/reader036/viewer/2022062301/56815f68550346895dce6b02/html5/thumbnails/15.jpg)
CCGrid 2014 15
Parameters of Scheduling Schemes
• Our proposed scheduling schemes have user-defined parameters– Dynamic Scheduling
• Length of big and small chunks– Static Scheduling
• Histogram interval size• Estimation function parameters
– Combined Scheduling• All parameters for dynamic and static scheduling
• All parameters can be determined with a offline training phase
![Page 16: Cluster-based SNP Calling on Large Scale Genome Sequencing Data](https://reader036.vdocuments.us/reader036/viewer/2022062301/56815f68550346895dce6b02/html5/thumbnails/16.jpg)
CCGrid 2014 16
Outline• Motivation• Parallel SNP Calling• Proposed Scheduling Schemes• Experiments• Conclusion
![Page 17: Cluster-based SNP Calling on Large Scale Genome Sequencing Data](https://reader036.vdocuments.us/reader036/viewer/2022062301/56815f68550346895dce6b02/html5/thumbnails/17.jpg)
CCGrid 2014 17
Experiments• Local cluster with nodes• 2 quad-core 2.53 GHz Xeon(R) processors with 12 GB RAM
• We obtained genomes of 256 samples from 1000 Human Genome Project
• The data is replicated to all local disks unless noted otherwise
• Parallel implementation:– We implemented VarScan in C programming language
• We also modified VarScan such that BAM files can be read directly.– Used MPI library for parallelization
![Page 18: Cluster-based SNP Calling on Large Scale Genome Sequencing Data](https://reader036.vdocuments.us/reader036/viewer/2022062301/56815f68550346895dce6b02/html5/thumbnails/18.jpg)
CCGrid 2014 18
Experiments: Scalability
Scheduling Scheme
Scalability
Basic 8.4x
Dynamic 10.9x
Static 19.7x
Combined 23.5x
First 192M location of Chr.1
![Page 19: Cluster-based SNP Calling on Large Scale Genome Sequencing Data](https://reader036.vdocuments.us/reader036/viewer/2022062301/56815f68550346895dce6b02/html5/thumbnails/19.jpg)
CCGrid 2014 19
Experiments: Data Size Impact
128 cores are allocated
![Page 20: Cluster-based SNP Calling on Large Scale Genome Sequencing Data](https://reader036.vdocuments.us/reader036/viewer/2022062301/56815f68550346895dce6b02/html5/thumbnails/20.jpg)
CCGrid 2014 20
Experiments: I/O Contention Impact
128 cores are allocated
Scheduling Scheme
IO Contention Impact (Sec)
Basic 174
Dynamic 229
Static 251
Combined 220
I/O C
onte
ntion
Impa
ct
![Page 21: Cluster-based SNP Calling on Large Scale Genome Sequencing Data](https://reader036.vdocuments.us/reader036/viewer/2022062301/56815f68550346895dce6b02/html5/thumbnails/21.jpg)
CCGrid 2014 21
Comparison with Hadoop
- First 192M location of Chr.2 in 512 samples are analyzed
- Lower (dark) portions of the bars show pre-processing time.
![Page 22: Cluster-based SNP Calling on Large Scale Genome Sequencing Data](https://reader036.vdocuments.us/reader036/viewer/2022062301/56815f68550346895dce6b02/html5/thumbnails/22.jpg)
IPDPS'14 22
Scheduling With Replication• Data-Intensive Processing Motivates New Schemes• Replicate each chunk fixed/variable number of times• Dynamic scheduling while processing only local
chunks • Interesting new tradeoffs • Under submission
![Page 23: Cluster-based SNP Calling on Large Scale Genome Sequencing Data](https://reader036.vdocuments.us/reader036/viewer/2022062301/56815f68550346895dce6b02/html5/thumbnails/23.jpg)
IPDPS'14 23
Other Work• PAGE: A Map-Reduce-like middleware for easy
parallelization of genomic applications (IPDPS 2014)
• Mappers and reducers are executable programs– Allows us to exploit existing applications– No restriction on programming language
![Page 24: Cluster-based SNP Calling on Large Scale Genome Sequencing Data](https://reader036.vdocuments.us/reader036/viewer/2022062301/56815f68550346895dce6b02/html5/thumbnails/24.jpg)
IPDPS'14 24
PAGE vs. State-of-the-Art• A middleware system– Specific for parallel genetic data processing– Allow parallelization of a variety of genetic algorithms– Be able to work with different popular genetic data
formats – Allows use of existing programs
![Page 25: Cluster-based SNP Calling on Large Scale Genome Sequencing Data](https://reader036.vdocuments.us/reader036/viewer/2022062301/56815f68550346895dce6b02/html5/thumbnails/25.jpg)
CCGrid 2014 25
Conclusion• We have developed a methodology for parallel
identification of variants in large-scale genome sequencing data.
• Coverage variance and I/O contetion are two main problems
• We proposed 3 scheduling schemes• Combined scheduling gives best results.• Our approach has good speedup and outperforms
Hadoop