james a. edwards, uzi vishkin university of maryland
TRANSCRIPT
![Page 1: James A. Edwards, Uzi Vishkin University of Maryland](https://reader036.vdocuments.us/reader036/viewer/2022082712/56649e2a5503460f94b17ecc/html5/thumbnails/1.jpg)
Truly ParallelBurrows-WheelerCompression and DecompressionJames A. Edwards, Uzi VishkinUniversity of Maryland
![Page 2: James A. Edwards, Uzi Vishkin University of Maryland](https://reader036.vdocuments.us/reader036/viewer/2022082712/56649e2a5503460f94b17ecc/html5/thumbnails/2.jpg)
IntroductionLossless data compression Common tool better use
of memory (e.g., disk space) and network bandwidth.Burrows-Wheeler (BW) compression e.g., bzip2
Relatively higher compression ratio (pro) but slower (con)
Snappy (Google) lower compression ratios but fast. Example For MPI on large machines speed is critical.
Our motivation fast and high compression ratioUnexpected Prior work unknown to us made
empirical follow-up … strongerAssumption throughout: fixed constant-size alphabet
![Page 3: James A. Edwards, Uzi Vishkin University of Maryland](https://reader036.vdocuments.us/reader036/viewer/2022082712/56649e2a5503460f94b17ecc/html5/thumbnails/3.jpg)
State of the fieldIrregular algorithms: prevalent in CS curriculum
and daily work (open-ended problems/programs).Yet, very limited support on today’s parallel
hardware. Even more limited with strong scaling Low support for irregular parallel code in HW
SW developers limit themselves to regular algorithms HW benchmarks optimize HW for regular code …
Namely, parallel data compression is of general interest as an undeniable application representing a big underrepresented “application crowd”
![Page 4: James A. Edwards, Uzi Vishkin University of Maryland](https://reader036.vdocuments.us/reader036/viewer/2022082712/56649e2a5503460f94b17ecc/html5/thumbnails/4.jpg)
“Truly Parallel” BW compressionExisting parallel approach: break input into blocks, compress
blocks in parallelPractical drawback: good compression & speed only with large
inputTheory drawback: not really parallel
Truly parallel: compress entire input using a parallel algorithmWorks for both large and small inputsCan be combined with block-based approach
Applications of small inputs:Faster (decompression) & greater compression better use of main
memory [ISCA05] & cache [ISCA12]Warehouse-scale computers. Bandwidth between various pairs of
nodes can be extremely different; for MPI, MapReduce low bandwidth between pairs debilitating [HP 5th ed.] (i.e., Snappy was a solution)
![Page 5: James A. Edwards, Uzi Vishkin University of Maryland](https://reader036.vdocuments.us/reader036/viewer/2022082712/56649e2a5503460f94b17ecc/html5/thumbnails/5.jpg)
Attempts at truly parallel BW compressionA 2011 survey paper [Eirola] stipulates that
parallelizing BW could hardly work on GPGPU, and decompression would fall behind further.Portions require “very random memory accessing”“…it seems unlikely that efficient Huffman-tree GPGPU
algorithms will be possible.”The best GPGPU result: even more painful
In 2012, Patel et al. concurrently attempted to develop parallel code for BW compression on GPUs but their best result was 2.8X slowdown.
Patel reported separately 1.2X speedup for decompression (hence, not referenced in SPAA13 version.)
![Page 6: James A. Edwards, Uzi Vishkin University of Maryland](https://reader036.vdocuments.us/reader036/viewer/2022082712/56649e2a5503460f94b17ecc/html5/thumbnails/6.jpg)
Stages of BW compression & decompression
Block-Sorting Transform
(BST)
Move-to-Front (MTF)
encoding
Huffman encoding
InverseBlock-Sorting
Transform (IBST)
Move-to-Front (MTF)
decoding
Huffman decoding
Compression
Decompression
S
S
SBST SMTFSBW
SBWSBST SMTF
![Page 7: James A. Edwards, Uzi Vishkin University of Maryland](https://reader036.vdocuments.us/reader036/viewer/2022082712/56649e2a5503460f94b17ecc/html5/thumbnails/7.jpg)
Inverse Block-Sorting TransformSerial algorithm:
1. Sort characters of SBST; the sorted order T[i] forms a ring i → T[i]
2. Starting with $, traverse the ring to recover S
Parallel algorithm:1. Use parallel integer sorting
to find T[i]2. Use parallel list ranking to
traverse the ringBoth steps require O(log n)
time and O(n) workOn current parallel HW list
ranking gets you – why we chose this step
3625104 (END)
0123456banana$
S (read right to left)
i 0 1 2 3 4 5 6
SBST[i] a n n b $ a a
T[i] 1 5 6 4 0 2 3
014
26
3 5
i
rank[i]SBST[i]
Linked ringi → T[i]
![Page 8: James A. Edwards, Uzi Vishkin University of Maryland](https://reader036.vdocuments.us/reader036/viewer/2022082712/56649e2a5503460f94b17ecc/html5/thumbnails/8.jpg)
Conclusion and where to go from here?Despite being originally described as a serial algorithm, BW
compression can be accomplished by a parallel algorithm.Material for a few good exercises on prefix sum & list ranking?For a more detailed description of our algorithm, see reference [4]
in our brief announcement.This algorithm demonstrates the importance of parallel primitives
such as prefix sums and list ranking. Requires support of fine-grained, irregular parallelism and sometimes also strong scaling Issues on all current parallel hardware. Indeed: While recent work from UC Davis (2012) on parallel BW compression
on GPUs that we missed taxed ~20% of our originality (same Step 2), It failed to achieve any speedup on compression. Instead a slowdown
of 2.8x. For decompression: 1.2X speedup. On the UMD experimental Explicit Multi-Threading (XMT)
architecture, we achieved speedups of 25x for compression and 13x for decompression [5]. On balance UC Davis paper huge gift: 70x vs. GPU for compression and 11X for decompression.
![Page 9: James A. Edwards, Uzi Vishkin University of Maryland](https://reader036.vdocuments.us/reader036/viewer/2022082712/56649e2a5503460f94b17ecc/html5/thumbnails/9.jpg)
Where to go from here?Remaining options for the community
Figure out how to do it on current HWOr, bash PRAM
Or, the alternative we pursued Develop a parallel algorithm that will work well on buildable HW designed to support the best-established parallel algorithmic theory
Final thought connecting to several other SPAA presentations This is an example where MPI on large systems works in
tandem with PRAM-like support on small systems.Intra-node (of a large system) use PRAM compression &
decompression algorithms for inter-node MPI messagesCounter-argument to an often unstated position. That we need the
same parallel programming model at very large and small scales
![Page 10: James A. Edwards, Uzi Vishkin University of Maryland](https://reader036.vdocuments.us/reader036/viewer/2022082712/56649e2a5503460f94b17ecc/html5/thumbnails/10.jpg)
References[4] J. A. Edwards and U. Vishkin. Parallel
algorithms for Burrows-Wheeler compression and decompression. TR, UMD, 2012. http://hdl.handle.net/1903/13299.
[5] J. A. Edwards and U. Vishkin. Empirical speedup study of truly parallel data compression. TR, UMD, 2013. http://hdl.handle.net/1903/13890.
![Page 11: James A. Edwards, Uzi Vishkin University of Maryland](https://reader036.vdocuments.us/reader036/viewer/2022082712/56649e2a5503460f94b17ecc/html5/thumbnails/11.jpg)
Backup slides
![Page 12: James A. Edwards, Uzi Vishkin University of Maryland](https://reader036.vdocuments.us/reader036/viewer/2022082712/56649e2a5503460f94b17ecc/html5/thumbnails/12.jpg)
Block-Sorting Transform (BST)Goal: bring occurrences
of characters togetherSerial algorithm:
1. Form a list of all rotations of the input string
2. Sort the list lexicographically
3. Take the last column of the list as output
Equivalent to sorting the suffixes of the input string
banana$anana$bnana$baana$banna$banaa$banan$banana
$bananaa$bananana$bananana$bbanana$na$bananana$ba
banana$Input to BST
List of rotations
annb$aaOutput of BST
Sort
![Page 13: James A. Edwards, Uzi Vishkin University of Maryland](https://reader036.vdocuments.us/reader036/viewer/2022082712/56649e2a5503460f94b17ecc/html5/thumbnails/13.jpg)
Block-Sorting Transform (BST)Parallel algorithm:
1. Find the suffix tree of S (O(log2 n) time, O(n) work))
2. Find the suffix array SA of S by traversing the suffix tree (Euler tour technique: O(log n) time, O(n) work)
3. Permute characters according to SA (O(1) time, O(n) work)
6
5
1
0
4 2
3
$ a
$ na
banana$
$
na$
na
na$$
i 0 1 2 3 4 5 6
S[i] b a n a n a $
SA[i] 6 5 3 1 0 4 2
S[SA[i]-1] a n n b $ a a
![Page 14: James A. Edwards, Uzi Vishkin University of Maryland](https://reader036.vdocuments.us/reader036/viewer/2022082712/56649e2a5503460f94b17ecc/html5/thumbnails/14.jpg)
Move-to-Front (MTF) encodingGoal: Assign low codes
to repeated charactersSerial algorithm:
Maintain list of characters in order last seen
Parallel algorithm: use prefix sums to compute the MTF list for each character (O(log n) time, O(n) work)Associative binary
operator: X + Y = Y concat (X – Y)
Li
1 3 0
j L0[j]0 $1 a2 b3 n
i 210 3
SBST[i] a n n b
3SMTF[i]
j0 a1 $2 b3 n
L1[j] j0 n1 a2 $3 b
L2[j] j0 n1 a2 $3 b
L3[j]
a,$,b,n
n b a $
b,n $,a n,a b,n a,$ a
$,a,b,n b,n,a a,$
b,n,a,$ a,$
assumed prefix SBST
a n n b $ a a
![Page 15: James A. Edwards, Uzi Vishkin University of Maryland](https://reader036.vdocuments.us/reader036/viewer/2022082712/56649e2a5503460f94b17ecc/html5/thumbnails/15.jpg)
Move-to-Front (MTF) decodingSame algorithm as
encoding, with the following changes
Serial: The MTF lists are used in reverse
Parallel: Instead of combining MTF lists, combine permutation functions
0123
1023
00123
10
2
3 0123
23
10123
10
2
31 3 0 3SMTF
Perm
utati
onfu
nctio
n0123
3102
0123
10
2
30123
1023
0123
01
2
3
+
=
![Page 16: James A. Edwards, Uzi Vishkin University of Maryland](https://reader036.vdocuments.us/reader036/viewer/2022082712/56649e2a5503460f94b17ecc/html5/thumbnails/16.jpg)
Huffman EncodingGoal: Assign shorter bit strings to more-frequent MTF codesThe parallelization of this step is already well knownSerial algorithm:
1. Count frequencies of characters2. Build Huffman table based on frequencies 3. Encode characters using the table
Parallel algorithm:1. Use integer sorting to count frequencies (O(log n) time, O(n)
work)2. Build Huffman table using the (standard, heap-based) serial
algorithm (O(1) time and work)3. (a) Compute the prefix sums of the code lengths to determine
where in the output to write the code for each character (O(log n) time, O(n) work)(b) Actually write the output (O(1) time, O(n) work)
![Page 17: James A. Edwards, Uzi Vishkin University of Maryland](https://reader036.vdocuments.us/reader036/viewer/2022082712/56649e2a5503460f94b17ecc/html5/thumbnails/17.jpg)
Huffman DecodingSerial algorithm: Read
through compressed data, decoding one character at a time
Parallel algorithm: partition input and apply serial algorithm to each partitionProblem: Decoding cannot
start in the middle of the codeword for a character
Solution: Identify a set of valid starting bits using prefix sums (O(log n) time, O(n) work)
01 1 1 0 0 0 010
01 1 1 0 0 0 010
01 1 1 0 0 0 010
01 1 1 0 0 0 010
![Page 18: James A. Edwards, Uzi Vishkin University of Maryland](https://reader036.vdocuments.us/reader036/viewer/2022082712/56649e2a5503460f94b17ecc/html5/thumbnails/18.jpg)
Huffman DecodingHow to identify valid starting positions:
Divide the input string into partitions of length l (the length of the longest Huffman codeword)
1. Assign a processor to each bit in the input. Processor i decodes the compressed input starting at index i and stops when it crosses a partition boundary, recording the index where it stopped. (O(1) time, O(n) work)
Now each partition has l pointers entering it, all of which originate from the immediately preceding partition.
2. Use prefix sums to merge consecutive pointers. (O(log n) time, O(n) work)
Now each partition still has l pointers entering it, but they all originate from the first partition.
3. For each bit in the input, mark it as a valid starting position if and only if the pointer that points to that bit originates from the first bit (index 0) of the first partition (O(1) time, O(n) work)
![Page 19: James A. Edwards, Uzi Vishkin University of Maryland](https://reader036.vdocuments.us/reader036/viewer/2022082712/56649e2a5503460f94b17ecc/html5/thumbnails/19.jpg)
Lossless data compression on GPGPU architectures (2011)Inverse BST: “Problems would possibly arise from poor GPU
performance of the very random memory accessing caused by the scattering of characters throughout the string.”
MTF decoding: “Speeding up decoding on GPGPU platforms might be more challenging since the character lookup is already constant time on serial implementations, and starting decoding from multiple places is difficult since the state of the stack is not known at the other places.”
Huffman decoding: “Here again, decompression is harder. This is due to the fact that the decoder doesn’t know where one codeword ends and another begins before it has decoded the whole prior input.”
“As for the codeword tables for the VLE, it seems unlikely that efficient Huffman-tree GPGPU algorithms will be possible.”