![Page 1: Using Cascading Bloom filters to Improve the memory usage ...igm.univ-mlv.fr/AlgoB/slides/Kucherov_WABI_2013.pdf · Bloom filter B1 Key observation: in assembly algorithms, not all](https://reader035.vdocuments.us/reader035/viewer/2022070922/5fba5b5b7576ec63c73b2e3d/html5/thumbnails/1.jpg)
Using Cascading Bloom filters to Improve the memory usage of de Bruijn graphs
Kamil Salikhov (Moscow State U)
Gustavo Sacomoto (INRIA/LBBE Lyon)
Gregory Kucherov (CNRS/LIGM Marne-la-Vallée)
![Page 2: Using Cascading Bloom filters to Improve the memory usage ...igm.univ-mlv.fr/AlgoB/slides/Kucherov_WABI_2013.pdf · Bloom filter B1 Key observation: in assembly algorithms, not all](https://reader035.vdocuments.us/reader035/viewer/2022070922/5fba5b5b7576ec63c73b2e3d/html5/thumbnails/2.jpg)
Processing Gigabytes of NGS data NGS technologies produce of order 1010 B (=10GB) per
run Ex: In silico assembly of Panda genome (2010) was done
with 56×2.4GB≈135GB of sequence
☞ Efficient data structures for representing NGS data
![Page 3: Using Cascading Bloom filters to Improve the memory usage ...igm.univ-mlv.fr/AlgoB/slides/Kucherov_WABI_2013.pdf · Bloom filter B1 Key observation: in assembly algorithms, not all](https://reader035.vdocuments.us/reader035/viewer/2022070922/5fba5b5b7576ec63c73b2e3d/html5/thumbnails/3.jpg)
Two main approaches to storing NGS data
de Bruijn graphs (refers back to EULER) nodes = k-mers, edge = (k-1)-overlap k typically 20 to 50, sometimes multiplicity of edges are stored de Bruijn graphs used in many popular assemblers: Velvet,
ABySS, SOAPdenovo, …
![Page 4: Using Cascading Bloom filters to Improve the memory usage ...igm.univ-mlv.fr/AlgoB/slides/Kucherov_WABI_2013.pdf · Bloom filter B1 Key observation: in assembly algorithms, not all](https://reader035.vdocuments.us/reader035/viewer/2022070922/5fba5b5b7576ec63c73b2e3d/html5/thumbnails/4.jpg)
Two main approaches to storing NGS data
de Bruijn graphs (refers back to EULER) nodes = k-mers, edge = (k-1)-overlap k typically 20 to 50, sometimes multiplicity of edges are stored de Bruijn graphs used in many popular assemblers: Velvet,
ABySS, SOAPdenovo, …
overlap graphs (refers back to Celera assembler) nodes = reads, edge = significant overlap (with errors) recent CGA assembler [Simpson,Durbin 2012] is based on
overlap graphs key ingredient is FM-index for storing reads FM-index (based on BWT) used successfully in many mapping
programs such as Bowtie, BWA, SOAP2
![Page 5: Using Cascading Bloom filters to Improve the memory usage ...igm.univ-mlv.fr/AlgoB/slides/Kucherov_WABI_2013.pdf · Bloom filter B1 Key observation: in assembly algorithms, not all](https://reader035.vdocuments.us/reader035/viewer/2022070922/5fba5b5b7576ec63c73b2e3d/html5/thumbnails/5.jpg)
de Bruijn graph: toy example
![Page 6: Using Cascading Bloom filters to Improve the memory usage ...igm.univ-mlv.fr/AlgoB/slides/Kucherov_WABI_2013.pdf · Bloom filter B1 Key observation: in assembly algorithms, not all](https://reader035.vdocuments.us/reader035/viewer/2022070922/5fba5b5b7576ec63c73b2e3d/html5/thumbnails/6.jpg)
Storing de Bruijn graph explicit storage of nodes takes a lot of memory
2k bits per node (in practice, 64 or 128 bits) 15GB for the human genome lower bound (information-theoretic) ≈24 bits/node for k=27
several works appeared recently on compact storage of dB graphs: Conway&Bromage, Bioinformatics 2011 Ye, Ma, Cannon, Pop, Yu, BMC Bioinfo 2012 [storing a fraction
of all k-mers] SparseAssembler Bowe, Onodera, Sadakane, Shibuya, WABI 2012 [ad hoc
sophisticated compact data structures] two others based on Bloom filters
![Page 7: Using Cascading Bloom filters to Improve the memory usage ...igm.univ-mlv.fr/AlgoB/slides/Kucherov_WABI_2013.pdf · Bloom filter B1 Key observation: in assembly algorithms, not all](https://reader035.vdocuments.us/reader035/viewer/2022070922/5fba5b5b7576ec63c73b2e3d/html5/thumbnails/7.jpg)
Bloom filters
en.wikipedia.org
classical method of storing a subset in order to support membership operation has a one-sided error (false positives) n inserted elements, m size of the bitmap if r=m/n, then the probability of a false positive is cr, where
c≈0.6185
![Page 8: Using Cascading Bloom filters to Improve the memory usage ...igm.univ-mlv.fr/AlgoB/slides/Kucherov_WABI_2013.pdf · Bloom filter B1 Key observation: in assembly algorithms, not all](https://reader035.vdocuments.us/reader035/viewer/2022070922/5fba5b5b7576ec63c73b2e3d/html5/thumbnails/8.jpg)
Bloom filters
en.wikipedia.org
classical method of storing a subset in order to support membership operation has a one-sided error (false positives) n inserted elements, m size of the bitmap if r=m/n, then the probability of a false positive is cr, where
c=2-ln(2)≈0.6185
Pell, Hintze, Canino-Koning, Howe, Brown, PNAS 2012: applying (lossy) Bloom filter to store dB graphs
Chikhi, Rizk, WABI 2012: compact ‘lossless’ representation of dB graphs using Bloom filters (Minia)
![Page 9: Using Cascading Bloom filters to Improve the memory usage ...igm.univ-mlv.fr/AlgoB/slides/Kucherov_WABI_2013.pdf · Bloom filter B1 Key observation: in assembly algorithms, not all](https://reader035.vdocuments.us/reader035/viewer/2022070922/5fba5b5b7576ec63c73b2e3d/html5/thumbnails/9.jpg)
Idea of Chkhi and Rizk
Assume we have to represent a set T0 of k-mers by a Bloom filter B1
Key observation: in assembly algorithms, not all k-mers can be queried, but only those which have (k-1)-overlap with k-mers known to be in the graph
This set T1 is much smaller than the set of all false positives, it can be stored explicitely (e.g. via a hash table)
Storing B1 and T1 is much more space efficient than storing T0. Membership of w is tested by first querying B1 and, if the answer is positive, checking if w is not in T1
![Page 10: Using Cascading Bloom filters to Improve the memory usage ...igm.univ-mlv.fr/AlgoB/slides/Kucherov_WABI_2013.pdf · Bloom filter B1 Key observation: in assembly algorithms, not all](https://reader035.vdocuments.us/reader035/viewer/2022070922/5fba5b5b7576ec63c73b2e3d/html5/thumbnails/10.jpg)
false positives of B1 T0
Represent T0 by Bloom filter B1
![Page 11: Using Cascading Bloom filters to Improve the memory usage ...igm.univ-mlv.fr/AlgoB/slides/Kucherov_WABI_2013.pdf · Bloom filter B1 Key observation: in assembly algorithms, not all](https://reader035.vdocuments.us/reader035/viewer/2022070922/5fba5b5b7576ec63c73b2e3d/html5/thumbnails/11.jpg)
false positives of B1 T0
T1
Represent T0 by Bloom filter B1
Compute T1 (‘critical false positives’) and represent it e.g. by a hash table
![Page 12: Using Cascading Bloom filters to Improve the memory usage ...igm.univ-mlv.fr/AlgoB/slides/Kucherov_WABI_2013.pdf · Bloom filter B1 Key observation: in assembly algorithms, not all](https://reader035.vdocuments.us/reader035/viewer/2022070922/5fba5b5b7576ec63c73b2e3d/html5/thumbnails/12.jpg)
false positives of B1 T0
T1
Represent T0 by Bloom filter B1
Compute T1 (‘critical false positives’) and represent it e.g. by a hash table
Result (example): 13.2 bits/node for k=27 (of which 11.1 bits for B1 and 2.1 bits for T1)
![Page 13: Using Cascading Bloom filters to Improve the memory usage ...igm.univ-mlv.fr/AlgoB/slides/Kucherov_WABI_2013.pdf · Bloom filter B1 Key observation: in assembly algorithms, not all](https://reader035.vdocuments.us/reader035/viewer/2022070922/5fba5b5b7576ec63c73b2e3d/html5/thumbnails/13.jpg)
Improving on Chikhi and Rizk’s method
Main idea: iteratively apply the same construction to T1 i.e. encode T1 by a Bloom filter B2 and set of ‘false-false positives’ T2, then apply this to T2 etc.
☞ cascading Bloom filters
![Page 14: Using Cascading Bloom filters to Improve the memory usage ...igm.univ-mlv.fr/AlgoB/slides/Kucherov_WABI_2013.pdf · Bloom filter B1 Key observation: in assembly algorithms, not all](https://reader035.vdocuments.us/reader035/viewer/2022070922/5fba5b5b7576ec63c73b2e3d/html5/thumbnails/14.jpg)
false positives of B1 T0
T1
![Page 15: Using Cascading Bloom filters to Improve the memory usage ...igm.univ-mlv.fr/AlgoB/slides/Kucherov_WABI_2013.pdf · Bloom filter B1 Key observation: in assembly algorithms, not all](https://reader035.vdocuments.us/reader035/viewer/2022070922/5fba5b5b7576ec63c73b2e3d/html5/thumbnails/15.jpg)
false positives of B1 T0
T1
further encode T1 via a Bloom filter B2 and set T2, where T2⊆T0 is the set of k-mers stored in B2 by mistake (‘false2 positives’)
T2
![Page 16: Using Cascading Bloom filters to Improve the memory usage ...igm.univ-mlv.fr/AlgoB/slides/Kucherov_WABI_2013.pdf · Bloom filter B1 Key observation: in assembly algorithms, not all](https://reader035.vdocuments.us/reader035/viewer/2022070922/5fba5b5b7576ec63c73b2e3d/html5/thumbnails/16.jpg)
false positives of B1 T0
T1
further encode T1 via a Bloom filter B2 and set T2, where T2⊆T0 is the set of k-mers stored in B2 by mistake (‘false2 positives’)
iterate the construction on T2 we obtain a sequence of sets T0, T1, T2, T3, … encode by
Bloom filters B1, B2, B3, B4, … respectively T0⊇T2⊇T4⊇… , T1⊇T3⊇T5⊇
T2 T3 T4 T5
![Page 17: Using Cascading Bloom filters to Improve the memory usage ...igm.univ-mlv.fr/AlgoB/slides/Kucherov_WABI_2013.pdf · Bloom filter B1 Key observation: in assembly algorithms, not all](https://reader035.vdocuments.us/reader035/viewer/2022070922/5fba5b5b7576ec63c73b2e3d/html5/thumbnails/17.jpg)
Correctness
Lemma [correctness]: For a k-mer w, consider the smallest i such that w∉Bi+1. Then w∈T0 if i is odd and w∉T0 if i is even.
if w∉B1 then w∉T0 if w∈B1, but w∉B2 then w∈T0 if w∈B1, w∈B2, but w∉B3 then w∉T0 etc.
false positives of B1 T0
T1 T2 T3 T4 T5
![Page 18: Using Cascading Bloom filters to Improve the memory usage ...igm.univ-mlv.fr/AlgoB/slides/Kucherov_WABI_2013.pdf · Bloom filter B1 Key observation: in assembly algorithms, not all](https://reader035.vdocuments.us/reader035/viewer/2022070922/5fba5b5b7576ec63c73b2e3d/html5/thumbnails/18.jpg)
Assuming infinite number of filters
Let N=|T0| and r=mi/ni is the same for every Bi. Then the total size is
rN + 6rNcr + rNcr + 6rNc2r + rNc2r +... =N(1+6cr)
r1− cr
|B1| |B2| |B3| |B4| |B5|
The minimum is achieved for r=5.464, which yields the memory consumption of 8.45 bits/node
![Page 19: Using Cascading Bloom filters to Improve the memory usage ...igm.univ-mlv.fr/AlgoB/slides/Kucherov_WABI_2013.pdf · Bloom filter B1 Key observation: in assembly algorithms, not all](https://reader035.vdocuments.us/reader035/viewer/2022070922/5fba5b5b7576ec63c73b2e3d/html5/thumbnails/19.jpg)
Infinity difficult to deal with ;)
In practice we will store only a small finite number of filters B1, B2,…, Bt together with the set Tt stored explicitely
t=1 ➟ Chkhi&Rizk’s method The estimation should be adjusted, optimal value of r has to be
updated, example for t=4
Table: Estimations for t=4. Optimal r and corresponding memory consumption
![Page 20: Using Cascading Bloom filters to Improve the memory usage ...igm.univ-mlv.fr/AlgoB/slides/Kucherov_WABI_2013.pdf · Bloom filter B1 Key observation: in assembly algorithms, not all](https://reader035.vdocuments.us/reader035/viewer/2022070922/5fba5b5b7576ec63c73b2e3d/html5/thumbnails/20.jpg)
Compared to Chikhi&Rizk’s method
Table: Space (bits/node) compared to Chikhi&Rizk for t=4 and different values of k.
![Page 21: Using Cascading Bloom filters to Improve the memory usage ...igm.univ-mlv.fr/AlgoB/slides/Kucherov_WABI_2013.pdf · Bloom filter B1 Key observation: in assembly algorithms, not all](https://reader035.vdocuments.us/reader035/viewer/2022070922/5fba5b5b7576ec63c73b2e3d/html5/thumbnails/21.jpg)
We can cut down a bit more …
Rather than using the same r for all filters B1, B2,…, we can use different properly chosen coefficients r1,r2, …
This allows saving another 0.2 – 0.4 bits/k-mer
![Page 22: Using Cascading Bloom filters to Improve the memory usage ...igm.univ-mlv.fr/AlgoB/slides/Kucherov_WABI_2013.pdf · Bloom filter B1 Key observation: in assembly algorithms, not all](https://reader035.vdocuments.us/reader035/viewer/2022070922/5fba5b5b7576ec63c73b2e3d/html5/thumbnails/22.jpg)
Experiments I: E.Coli, varying k
10M E.Coli reads of 100bp 3 versions compared: 1 Bloom (=Chikhi&Rizk), 2
Bloom (t=2) and 4 Bloom (t=4)
![Page 23: Using Cascading Bloom filters to Improve the memory usage ...igm.univ-mlv.fr/AlgoB/slides/Kucherov_WABI_2013.pdf · Bloom filter B1 Key observation: in assembly algorithms, not all](https://reader035.vdocuments.us/reader035/viewer/2022070922/5fba5b5b7576ec63c73b2e3d/html5/thumbnails/23.jpg)
Experiments I (cont)
![Page 24: Using Cascading Bloom filters to Improve the memory usage ...igm.univ-mlv.fr/AlgoB/slides/Kucherov_WABI_2013.pdf · Bloom filter B1 Key observation: in assembly algorithms, not all](https://reader035.vdocuments.us/reader035/viewer/2022070922/5fba5b5b7576ec63c73b2e3d/html5/thumbnails/24.jpg)
Experiments II: Human dataset
564M Human reads of 100bp (~17X coverage)
![Page 25: Using Cascading Bloom filters to Improve the memory usage ...igm.univ-mlv.fr/AlgoB/slides/Kucherov_WABI_2013.pdf · Bloom filter B1 Key observation: in assembly algorithms, not all](https://reader035.vdocuments.us/reader035/viewer/2022070922/5fba5b5b7576ec63c73b2e3d/html5/thumbnails/25.jpg)
method implemented in Minia
http://minia.genouest.org/