memory-aware bwt by segmenting sequences
DESCRIPTION
Memory-aware BWT by Segmenting Sequences. presented by Jiaying Wang April 12 , 2012. The 14th Asia-Pacific Web Conference (APWeb). Northeastern University, China. Motivation. Most interesting massive data sets contain string data (web data, record data, genome data, etc.) - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Memory-aware BWT by Segmenting Sequences](https://reader035.vdocuments.us/reader035/viewer/2022062517/56813e73550346895da8884f/html5/thumbnails/1.jpg)
Memory-aware BWT by Segmenting Sequences
presented by Jiaying Wang
April 12, 2012
Northeastern University, China
The 14th Asia-Pacific Web Conference (APWeb)
![Page 2: Memory-aware BWT by Segmenting Sequences](https://reader035.vdocuments.us/reader035/viewer/2022062517/56813e73550346895da8884f/html5/thumbnails/2.jpg)
Motivation
• Most interesting massive data sets contain string data (web data, record data, genome data, etc.)
• BWT as a full text index provides fast substring search over large text collections
• Enormous memory cost while building BWT(n log n + n logσ)
![Page 3: Memory-aware BWT by Segmenting Sequences](https://reader035.vdocuments.us/reader035/viewer/2022062517/56813e73550346895da8884f/html5/thumbnails/3.jpg)
Preliminaries
• text: T[0..n − 1], T[i]∈Σ, |Σ| = σ• We add a $ to the end of the text. $ do no
t belong to Σ• T[i...j] is a sequence starting at i position a
nd ending at j position– empty string iff i>j– prefix iff i = 0– suffix iff j = 0
![Page 4: Memory-aware BWT by Segmenting Sequences](https://reader035.vdocuments.us/reader035/viewer/2022062517/56813e73550346895da8884f/html5/thumbnails/4.jpg)
Problem definition
• Let T[0..n−1] be a text, and P[0..m-1] be a query. Subsequence matching problem is to find all the start positions of occurrences of P in T, i.e. {i | 0 ≤ i ≤ n; T[i..i+m-1] = P[0..m-1]}.
• We take the memory cost into account.• The process should guarantee the efficien
cy of query and memory cost at the same time.
![Page 5: Memory-aware BWT by Segmenting Sequences](https://reader035.vdocuments.us/reader035/viewer/2022062517/56813e73550346895da8884f/html5/thumbnails/5.jpg)
Bwt transformation
p i$mississi pp pi$mississ is ippi$missi ss issippi$mi ss sippi$miss is sissippi$m i
i ssippi$mis s
m ississippi $i ssissippi$ m
i ppi$missis s i $mississip p$ mississipp i
LF
11107410986352
SA
mississippi$ississippi$mssissippi$misissippi$misissippi$missssippi$missisippi$missisippi$mississppi$mississipi$mississipi$mississipp$mississippi
bwt: ipssm$pissiimississippi$text:
![Page 6: Memory-aware BWT by Segmenting Sequences](https://reader035.vdocuments.us/reader035/viewer/2022062517/56813e73550346895da8884f/html5/thumbnails/6.jpg)
Backward search on BWT
L 0, hbwt.length
For i from pat.length-1 to 0
k = pat[i]
l = C[k] + occ(k,l)
h = C[k] + occ(k,h)
Return h - l
searching "ssi"
p i$mississi pp pi$mississ is ippi$missi ss issippi$mi ss sippi$miss is sissippi$m i
i ssippi$mis s
m ississippi $i ssissippi$ m
i ppi$missis s i $mississip p$ mississipp i
LF
![Page 7: Memory-aware BWT by Segmenting Sequences](https://reader035.vdocuments.us/reader035/viewer/2022062517/56813e73550346895da8884f/html5/thumbnails/7.jpg)
Memory cost analysis
• Enormous memory cost for building BWT.• n log n + n logσ. About 5*n Bytes. (1G 5G)• For example: mississippi
mississippi mississippi$
SA:11 10 7 4 1 0 9 8 6 3 5 2ipssm$pissii
12 12×4+ = 12×5
![Page 8: Memory-aware BWT by Segmenting Sequences](https://reader035.vdocuments.us/reader035/viewer/2022062517/56813e73550346895da8884f/html5/thumbnails/8.jpg)
Our idea(1/2)
mississippi
missis sippi
search ssi Load one segment each time will help us save the memory
How to find the segmented sequence?
![Page 9: Memory-aware BWT by Segmenting Sequences](https://reader035.vdocuments.us/reader035/viewer/2022062517/56813e73550346895da8884f/html5/thumbnails/9.jpg)
Our idea(2/2)
mississippi
mississi issippi
search ssi
Oops, we find another one
![Page 10: Memory-aware BWT by Segmenting Sequences](https://reader035.vdocuments.us/reader035/viewer/2022062517/56813e73550346895da8884f/html5/thumbnails/10.jpg)
BWT on Overlapped Segments
…
L
l
T
T1
T2
Tk
bwt…
BWT1
BWT2
BWTk
bwt
bwt
![Page 11: Memory-aware BWT by Segmenting Sequences](https://reader035.vdocuments.us/reader035/viewer/2022062517/56813e73550346895da8884f/html5/thumbnails/11.jpg)
Searching cases
• prerequisite : query length ≤ l
• For the second case, we have to remove duplicates of the results
![Page 12: Memory-aware BWT by Segmenting Sequences](https://reader035.vdocuments.us/reader035/viewer/2022062517/56813e73550346895da8884f/html5/thumbnails/12.jpg)
Filtering method
Filter interval f = l - m
All the occurrences starting at positions in a filter interval should be filtered.
f
![Page 13: Memory-aware BWT by Segmenting Sequences](https://reader035.vdocuments.us/reader035/viewer/2022062517/56813e73550346895da8884f/html5/thumbnails/13.jpg)
Searching algorithm
![Page 14: Memory-aware BWT by Segmenting Sequences](https://reader035.vdocuments.us/reader035/viewer/2022062517/56813e73550346895da8884f/html5/thumbnails/14.jpg)
BWT on Disjoint Segments
…
T
T1
T2
Tk
bwt…
BWT1
BWT2
BWTk
bwt
bwt
![Page 15: Memory-aware BWT by Segmenting Sequences](https://reader035.vdocuments.us/reader035/viewer/2022062517/56813e73550346895da8884f/html5/thumbnails/15.jpg)
Searching cases
• For the second case, we need to– 1 Find the suffix of the query as the prefix of a
segment.– 2 Verify rest prefix of the query needs on the l
eft segment.
![Page 16: Memory-aware BWT by Segmenting Sequences](https://reader035.vdocuments.us/reader035/viewer/2022062517/56813e73550346895da8884f/html5/thumbnails/16.jpg)
Suffix checking
Time complexity: Θ (m)
![Page 17: Memory-aware BWT by Segmenting Sequences](https://reader035.vdocuments.us/reader035/viewer/2022062517/56813e73550346895da8884f/html5/thumbnails/17.jpg)
Prefix verification
• To verify the prefix, we can– 1 keep text. (waste
space) – 2 revert text on the
fly.(waste a little time)
![Page 18: Memory-aware BWT by Segmenting Sequences](https://reader035.vdocuments.us/reader035/viewer/2022062517/56813e73550346895da8884f/html5/thumbnails/18.jpg)
Searching algorithm
![Page 19: Memory-aware BWT by Segmenting Sequences](https://reader035.vdocuments.us/reader035/viewer/2022062517/56813e73550346895da8884f/html5/thumbnails/19.jpg)
Analysis
• Overlap method – Memory cost (n + l + k) × (log σ + log(n + l +
k) − log(k))/k– Time complexity Θ(occ+δ+mk)
• Backwalk method– Memory cost n(log σ+log n−log k)/k bits.– Time complexity Θ(occ + (η + k)m)
![Page 20: Memory-aware BWT by Segmenting Sequences](https://reader035.vdocuments.us/reader035/viewer/2022062517/56813e73550346895da8884f/html5/thumbnails/20.jpg)
Experiment
• Environment – C++ language – PC with 2.93 GHz Intel Core CPU– 4 GB main memory– Ubuntu operating system (Linux distribution).
• data sets– English text at Pizza&Chili Corpus– Genome sequence at UCSC goldenPath
![Page 21: Memory-aware BWT by Segmenting Sequences](https://reader035.vdocuments.us/reader035/viewer/2022062517/56813e73550346895da8884f/html5/thumbnails/21.jpg)
Performance on EnglishMemory cost Build time
Query time Query time
![Page 22: Memory-aware BWT by Segmenting Sequences](https://reader035.vdocuments.us/reader035/viewer/2022062517/56813e73550346895da8884f/html5/thumbnails/22.jpg)
Performance on genomeMemory cost Build time
Query time Query time
![Page 23: Memory-aware BWT by Segmenting Sequences](https://reader035.vdocuments.us/reader035/viewer/2022062517/56813e73550346895da8884f/html5/thumbnails/23.jpg)
More performance
![Page 24: Memory-aware BWT by Segmenting Sequences](https://reader035.vdocuments.us/reader035/viewer/2022062517/56813e73550346895da8884f/html5/thumbnails/24.jpg)
Conclusion
• We propose a novel variation of BWT called S-BWT
• Our index save more memory than BWT
• Two query method based on S-BWT
• Our method is faster than BWT method on large text.
![Page 25: Memory-aware BWT by Segmenting Sequences](https://reader035.vdocuments.us/reader035/viewer/2022062517/56813e73550346895da8884f/html5/thumbnails/25.jpg)
Thank you!
Q&A