masher: mapping long(er) reads with hash-based …esaule/public-website/slides/acmbcb13...masher:...
Post on 17-Apr-2018
220 Views
Preview:
TRANSCRIPT
Masher: Mapping Long(er) Reads with Hash-based Genome Indexing on GPUs
Anas Abu-Doleh1,2, Erik Saule1, Kamer Kaya1 and Ümit V. Çatalyürek1,2
1 Department of Biomedical Informatics2 Department of Electrical and Computer Engineering
The Ohio State University
I. Introduction
• Motivation
• Contribution
• Related Work
II. Masher Workflow
• Index Construction
• Mapping
III. Experiments and Results
IV. Conclusion and Future Work
A Abu-Doleh “Masher: Mapping Long(er) Reads with Hash-based Genome Indexing on GPUs" 2ACM-BCB13
23 Sep 2013
Outline
The read length of next generation sequencing (NGS) devices is continuouslyincreasing so there is a wide interest in efficient and accurate mapping oflong(er) reads.
Utilizing the powerful capabilities of GPUs to improve the mapping of NGSreads.
A Abu-Doleh “Masher: Mapping Long(er) Reads with Hash-based Genome Indexing on GPUs" 3ACM-BCB13
23 Sep 2013
Motivation
A Abu-Doleh “Masher: Mapping Long(er) Reads with Hash-based Genome Indexing on GPUs" 4ACM-BCB13
23 Sep 2013
Related Work and Contributions
Burrows-Wheeler Transform (BWT)o Bowtie2o CUSHAW2o Soap3-dp
Hash Indexingo SeqAltoo BFAST
)
Related Work
A novel hash-based indexing technique by which: For large genomes, the memory footprint small enough to be stored in a
restricted-memory device such as a GPU. The index data structure is more suitable for GPU parallelization
Contribution
A Abu-Doleh “Masher: Mapping Long(er) Reads with Hash-based Genome Indexing on GPUs" 5ACM-BCB13
23 Sep 2013
Masher workflow
A Abu-Doleh “Masher: Mapping Long(er) Reads with Hash-based Genome Indexing on GPUs" 6ACM-BCB13
23 Sep 2013
Index Construction
Base pairs to 2 bit format.
Replacing each N with A.
Processing genome file
A Abu-Doleh “Masher: Mapping Long(er) Reads with Hash-based Genome Indexing on GPUs" 7ACM-BCB13
23 Sep 2013
Index Construction
Base pairs to 2 bit format.
Replacing each N with A.
Processing genome file
Seed length LS
Indexing step size ∆G
Indexing
A Abu-Doleh “Masher: Mapping Long(er) Reads with Hash-based Genome Indexing on GPUs" 8ACM-BCB13
23 Sep 2013
Index Construction
Genome length, N Stores the indexed locations in
order for each seed Location array size = log2(N) x
𝑁/∆G
Size ≈ 2.9 GB , hg19, ∆G = 4
Index arrays - Locations array
A Abu-Doleh “Masher: Mapping Long(er) Reads with Hash-based Genome Indexing on GPUs" 9ACM-BCB13
23 Sep 2013
Index Construction
Stores the number of occurrences for each seed
Size = 4Ls x log2 𝑁/∆G
Store at most 255 locations. Appear more than 255, do
uniform selection. Size = 1 GB , LS = 15.
Index arrays - Count array
A Abu-Doleh “Masher: Mapping Long(er) Reads with Hash-based Genome Indexing on GPUs" 10ACM-BCB13
23 Sep 2013
Index Construction
Stores the starting index at locs array for a group of seeds
Seed group size, δ. Group id = seed/δ Size = 4L/ δ x log2 ( 𝑁/∆G
Size = 0.5 GB , δ = 8, ∆G = 4.
Index arrays - Ptrs array
A Abu-Doleh “Masher: Mapping Long(er) Reads with Hash-based Genome Indexing on GPUs" 11ACM-BCB13
23 Sep 2013
Index Construction
LS = 15, ∆G = 4, δ = 8 , hg19 Total indexing arrays size =
2.9 + 1 + 0.5 = 4.4 GB. Space–time tradeoff
Index arrays
A Abu-Doleh “Masher: Mapping Long(er) Reads with Hash-based Genome Indexing on GPUs" 12ACM-BCB13
23 Sep 2013
Index Construction
Count array• Assume seed = i + 4• Belongs to seed group (i , i + δ −1 ) • , δ = 8 , i mod δ = 0.• Seed index in group, k = (i +4) mod δ• Ck=4 = count[i + 4 ]
Ptrs array• j = seed /δ , • Locs group index (Lgi) = ptrs[ j ]
• Locs seed index (Lsi) = Lgi + 𝑛=0𝑘−1𝐶𝑛
Locs array• Extract locations from (Lsi , Lsi + Ck - 1 )
Accessing the Index
A Abu-Doleh “Masher: Mapping Long(er) Reads with Hash-based Genome Indexing on GPUs" 13ACM-BCB13
23 Sep 2013
Index Construction
0.5
0.6
0.7
0.8
0.9
1
1 6 11 16 21 26 31 36 41 46 51 56
Pr(
cou
nt
<= x
)
Seeds count
A Abu-Doleh “Masher: Mapping Long(er) Reads with Hash-based Genome Indexing on GPUs" 14ACM-BCB13
23 Sep 2013
Mapping
Read step size, ∆R
Read length, LR
Nseeds = ∆G x (LR − LS)/∆R
Seed & hash
Each thread is assigned to a specific seed.
Locate candidate alignment locations (CALs)
A Abu-Doleh “Masher: Mapping Long(er) Reads with Hash-based Genome Indexing on GPUs" 15ACM-BCB13
23 Sep 2013
Mapping
In merging CALs, if two CALs are within a threshold distance, the second weight will be added to the first weight.
For efficiency purpose, Masher consists of two main loops.
Merge CALs and weights
A Abu-Doleh “Masher: Mapping Long(er) Reads with Hash-based Genome Indexing on GPUs" 16ACM-BCB13
23 Sep 2013
Mapping
Sorting and setting the CALs in batches with respect to their weights. At this stage, a filter operation for CALs with low weight could be applied.
Sorting and Batching CALs
A Abu-Doleh “Masher: Mapping Long(er) Reads with Hash-based Genome Indexing on GPUs" 17ACM-BCB13
23 Sep 2013
Mapping
A parameterized variant of Smith-Waterman (SW) algorithm supporting affinity gap scoring.
Bounded alignment, only the matrix cells (i, j) where |i - j| <= w are visited and scored.
Masher does two passes and sets w to 4 and 16 respectively
GPU block performs multiple SWs in parallel.
Bounded local Alignment
Sorting and setting the CALs in batches with respect to their weights. At this stage, a filter operation for CALs with low weight could be applied.
Sorting and Batching CALs
A Abu-Doleh “Masher: Mapping Long(er) Reads with Hash-based Genome Indexing on GPUs" 18ACM-BCB13
23 Sep 2013
Experiments and Results
Intel core i7-960 CPU clocked at 3.2 Ghz. 4 Hyper-Threading cores, 24GB of DDR3 memory.
Tesla K20c GPU, 4.8GB of global memory. CUDA 5.0 and GCC 4.2.4.
Platform
Human genome hg19 Wgsim simulator, 100K reads of length 100, 300, 500, and 1000 with error rates 2%, 4%,
6%, and 8%.
Human genome and Simulated Reads
A Abu-Doleh “Masher: Mapping Long(er) Reads with Hash-based Genome Indexing on GPUs" 19ACM-BCB13
23 Sep 2013
Experiments and Results
Sensitivity, is the percentage of the aligned reads. Accuracy, is the percentage of the reads correctly aligned to simulator locations among
all aligned reads. Execution time: Only alignment time was measured. The lower bound for a valid alignment score is set to
scoreLB = LR x (1.9 - 0.5 x Error Rate)
Metrics for comparison
Normal mode, ∆R = 0.7 LR Fast mode, ∆R = LR
Two modes of Masher
Bowtie2 (sensitive and fast) , 8 threads SOAP3-dp CUSHAW2-GPU.
Comparison with
A Abu-Doleh “Masher: Mapping Long(er) Reads with Hash-based Genome Indexing on GPUs" 20ACM-BCB13
23 Sep 2013
Experiments and Results
99
.23
99
.44
99
.36
98
.87
97
.55
96
.81
94
.5
89
.8398
.8
98
96
93
.15
98
94
.63
88
.8
80
.6
98
.5
92
.5
81
.7
67
.7
99
.9
99
.9
98
.8
96
.2
40
50
60
70
80
90
100
1 2 3 4
Sen
siti
vity
%
Masher Masher-fast Bowtie2 Bowtie2-fast SOAP3-dp CUSHAW2-GPU
95
.01
93
.82
92
.42
90
.84
95
.49
94
.44
93
.07
91
.499
5.2
94
92
.6
91
.1
95
93
.78
91
.7
89
.47
96
.2
95
.5
94
.5
93
95
.2
94
.3
93
.2
91
.9
80
85
90
95
100
2% 4% 6% 8%
Acc
ura
cy %
Error rate
LR = 100 bps.
A Abu-Doleh “Masher: Mapping Long(er) Reads with Hash-based Genome Indexing on GPUs" 21ACM-BCB13
23 Sep 2013
Experiments and Results
99
.89
99
.84
99
.74
99
.62
99
.89
99
.78
99
.51
98
.93
99
.9
99
.9
99
.9
99
.9
99
.9
99
.8
99
.34
97
.7
99
.2
94
.3
75
.3
48
.6
40
50
60
70
80
90
100
1 2 3 4
Sen
siti
vity
%
Masher Masher-fast Bowtie2 Bowtie2-fast SOAP3-dp
97
.69
97
.2
96
.83
96
.25
97
.78
97
.19
96
.83
96
.15
98
.2
98
97
.8
97
.4
98
.1
97
.8
97
.6
979
8.8
98
.5
98
.3
98
80
85
90
95
100
2% 4% 6% 8%
Acc
ura
cy %
Error rate
LR = 500 bps.
A Abu-Doleh “Masher: Mapping Long(er) Reads with Hash-based Genome Indexing on GPUs" 22ACM-BCB13
23 Sep 2013
Experiments and Results
10
0
10
0
10
0
10
0
99
.8
99
.73
99
.53
98
.93
99
.99
99
.9
99
.9
99
.8
99
.9
99
.9
99
.8
99
.5
99
.3
98
.7
91
.4
68
.9
40
50
60
70
80
90
100
1 2 3 4
Sen
siti
vity
%
Masher Masher-fast Bowtie2 Bowtie2-fast SOAP3-dp
98
.5
98
.28
97
.86
97
.41
98
.25
97
.78
97
.24
96
.66
98
.5
98
.3
97
.5
96
.43
98
.5
98
.1
97
.3
96
.198
.9
98
.5
97
.8
96
80
85
90
95
100
2% 4% 6% 8%
Acc
ura
cy %
Error rate
LR = 1000 bps.
A Abu-Doleh “Masher: Mapping Long(er) Reads with Hash-based Genome Indexing on GPUs" 23ACM-BCB13
23 Sep 2013
Experiments and Results
9.1
8.6 9.3 9.4
5 4.9 5.3 5.5
11
10
9 9
5 5 5 5
7.3 8
.3
6.6
6.6
9.3 1
0.5
14
.9
11
.8
1
5
25
2% 4% 6% 8%
Exe
cuti
on
tim
e (
sec.
) in
log
scal
e
Error rate
Masher Masher-fast Bowtie2 Bowtie2-fast SOAP3-dp CUSHAW2-GPU
LR = 100 bps.
A Abu-Doleh “Masher: Mapping Long(er) Reads with Hash-based Genome Indexing on GPUs" 24ACM-BCB13
23 Sep 2013
Experiments and Results
15
.1 19
.1 24 3
1.7
9.9
8.2 1
1 11
.7
13
4
16
0
16
5
18
0
10
0
11
1
11
7
12
3
10
10
73
4
52
2
33
3
1
5
25
125
625
3125
2% 4% 6% 8%
Exe
cuti
on
tim
e (
sec.
) in
log
scal
e
Error rate
Masher Masher-fast Bowtie2 Bowtie2-fast SOAP3-dp
LR = 500 bps.
A Abu-Doleh “Masher: Mapping Long(er) Reads with Hash-based Genome Indexing on GPUs" 25ACM-BCB13
23 Sep 2013
Experiments and Results
17
.6
18
.5
20
.4
21
.8
15
.5
17
.5
20
.2
22
45
6 56
7
66
2
75
2
34
5
40
3
45
2
49
1
56
07
46
00
32
06
20
27
1
5
25
125
625
3125
2% 4% 6% 8%
Exe
cuti
on
tim
e (
sec.
) in
log
scal
e
Error rate
Masher Masher-fast Bowtie2 Bowtie2-fast SOAP3-dp
LR = 1000 bps.
A Abu-Doleh “Masher: Mapping Long(er) Reads with Hash-based Genome Indexing on GPUs" 26ACM-BCB13
23 Sep 2013
Experiments and Results
LR = 1000 bps, Error rate 2%
1
10
100
1000
10000
90 92 94 96 98 100
Accuracy %
Masher
Masher-fast
Bowtie2
Bowtie2-fast
SOAP3-dp1
10
100
1000
10000
90 92 94 96 98 100Exe
cuti
on
tim
e (
sec.
) in
log
scal
e
Sensitivity %
A Abu-Doleh “Masher: Mapping Long(er) Reads with Hash-based Genome Indexing on GPUs" 27ACM-BCB13
23 Sep 2013
Conclusion and future work
Masher, a fast and accurate short/long read mapper, which uses memory efficient indexing scheme to reduce the size of a human genome index and to make it fit to the memory of a GPU.
The results show that Masher produces accurate alignments. Its speed is competitive with the tested state-of-the-art tools for reads of length less than
500 and an order of magnitude faster when the reads are longer than 500.
Conclusion
Making the software publicly available. Improving Masher’s performance further by using GPU-specific optimizations and with a
better CPU/GPU pipelining. Adding new features such as a support for paired-end sequences or fastq format.
Future work
• For more information• Visit http://bmi.osu.edu/hpc
• Acknowledgement of Support
A Abu-Doleh “Masher: Mapping Long(er) Reads with Hash-based Genome Indexing on GPUs" 28ACM-BCB13
23 Sep 2013
Thanks
top related