james a. edwards, uzi vishkin university of maryland

Truly ParallelBurrows-WheelerCompression and DecompressionJames A. Edwards, Uzi VishkinUniversity of Maryland

IntroductionLossless data compression Common tool better use

of memory (e.g., disk space) and network bandwidth.Burrows-Wheeler (BW) compression e.g., bzip2

Relatively higher compression ratio (pro) but slower (con)

Snappy (Google) lower compression ratios but fast. Example For MPI on large machines speed is critical.

Our motivation fast and high compression ratioUnexpected Prior work unknown to us made

empirical follow-up … strongerAssumption throughout: fixed constant-size alphabet

State of the fieldIrregular algorithms: prevalent in CS curriculum

and daily work (open-ended problems/programs).Yet, very limited support on today’s parallel

hardware. Even more limited with strong scaling Low support for irregular parallel code in HW

SW developers limit themselves to regular algorithms HW benchmarks optimize HW for regular code …

Namely, parallel data compression is of general interest as an undeniable application representing a big underrepresented “application crowd”

“Truly Parallel” BW compressionExisting parallel approach: break input into blocks, compress

blocks in parallelPractical drawback: good compression & speed only with large

inputTheory drawback: not really parallel

Truly parallel: compress entire input using a parallel algorithmWorks for both large and small inputsCan be combined with block-based approach

Applications of small inputs:Faster (decompression) & greater compression better use of main

memory [ISCA05] & cache [ISCA12]Warehouse-scale computers. Bandwidth between various pairs of

nodes can be extremely different; for MPI, MapReduce low bandwidth between pairs debilitating [HP 5th ed.] (i.e., Snappy was a solution)

Attempts at truly parallel BW compressionA 2011 survey paper [Eirola] stipulates that

parallelizing BW could hardly work on GPGPU, and decompression would fall behind further.Portions require “very random memory accessing”“…it seems unlikely that efficient Huffman-tree GPGPU

algorithms will be possible.”The best GPGPU result: even more painful

In 2012, Patel et al. concurrently attempted to develop parallel code for BW compression on GPUs but their best result was 2.8X slowdown.

Patel reported separately 1.2X speedup for decompression (hence, not referenced in SPAA13 version.)

James Edwards

The full author list is: R. Patel, Y. Zhang, J. Mak, A. Davidson, J. Owens

Stages of BW compression & decompression

Block-Sorting Transform

(BST)

Move-to-Front (MTF)

encoding

Huffman encoding

InverseBlock-Sorting

Transform (IBST)

Move-to-Front (MTF)

decoding

Huffman decoding

Compression

Decompression

S

S

SBST SMTFSBW

SBWSBST SMTF

Inverse Block-Sorting TransformSerial algorithm:

1. Sort characters of SBST; the sorted order T[i] forms a ring i → T[i]

2. Starting with $, traverse the ring to recover S

Parallel algorithm:1. Use parallel integer sorting

to find T[i]2. Use parallel list ranking to

traverse the ringBoth steps require O(log n)

time and O(n) workOn current parallel HW list

ranking gets you – why we chose this step

3625104 (END)

0123456banana$

S (read right to left)

i 0 1 2 3 4 5 6

SBST[i] a n n b $ a a

T[i] 1 5 6 4 0 2 3

014

26

3 5

i

rank[i]SBST[i]

Linked ringi → T[i]

Conclusion and where to go from here?Despite being originally described as a serial algorithm, BW

compression can be accomplished by a parallel algorithm.Material for a few good exercises on prefix sum & list ranking?For a more detailed description of our algorithm, see reference [4]

in our brief announcement.This algorithm demonstrates the importance of parallel primitives

such as prefix sums and list ranking. Requires support of fine-grained, irregular parallelism and sometimes also strong scaling Issues on all current parallel hardware. Indeed: While recent work from UC Davis (2012) on parallel BW compression

on GPUs that we missed taxed ~20% of our originality (same Step 2), It failed to achieve any speedup on compression. Instead a slowdown

of 2.8x. For decompression: 1.2X speedup. On the UMD experimental Explicit Multi-Threading (XMT)

architecture, we achieved speedups of 25x for compression and 13x for decompression [5]. On balance UC Davis paper huge gift: 70x vs. GPU for compression and 11X for decompression.

Uzi Vishkin

how about representing it in a figure as a structure, similar to my 1991 paper (&classnotes)

Where to go from here?Remaining options for the community

Figure out how to do it on current HWOr, bash PRAM

Or, the alternative we pursued Develop a parallel algorithm that will work well on buildable HW designed to support the best-established parallel algorithmic theory

Final thought connecting to several other SPAA presentations This is an example where MPI on large systems works in

tandem with PRAM-like support on small systems.Intra-node (of a large system) use PRAM compression &

decompression algorithms for inter-node MPI messagesCounter-argument to an often unstated position. That we need the

same parallel programming model at very large and small scales

References[4] J. A. Edwards and U. Vishkin. Parallel

algorithms for Burrows-Wheeler compression and decompression. TR, UMD, 2012. http://hdl.handle.net/1903/13299.

[5] J. A. Edwards and U. Vishkin. Empirical speedup study of truly parallel data compression. TR, UMD, 2013. http://hdl.handle.net/1903/13890.

Backup slides

Block-Sorting Transform (BST)Goal: bring occurrences

of characters togetherSerial algorithm:

1. Form a list of all rotations of the input string

2. Sort the list lexicographically

3. Take the last column of the list as output

Equivalent to sorting the suffixes of the input string

banana$anana$bnana$baana$banna$banaa$banan$banana

$bananaa$bananana$bananana$bbanana$na$bananana$ba

banana$Input to BST

List of rotations

annb$aaOutput of BST

Sort

Block-Sorting Transform (BST)Parallel algorithm:

1. Find the suffix tree of S (O(log2 n) time, O(n) work))

2. Find the suffix array SA of S by traversing the suffix tree (Euler tour technique: O(log n) time, O(n) work)

3. Permute characters according to SA (O(1) time, O(n) work)

6

5

1

0

4 2

3

$ a

$ na

banana$

$

na$

na

na$$

i 0 1 2 3 4 5 6

S[i] b a n a n a $

SA[i] 6 5 3 1 0 4 2

S[SA[i]-1] a n n b $ a a

Move-to-Front (MTF) encodingGoal: Assign low codes

to repeated charactersSerial algorithm:

Maintain list of characters in order last seen

Parallel algorithm: use prefix sums to compute the MTF list for each character (O(log n) time, O(n) work)Associative binary

operator: X + Y = Y concat (X – Y)

Li

1 3 0

j L0[j]0 $1 a2 b3 n

i 210 3

SBST[i] a n n b

3SMTF[i]

j0 a1 $2 b3 n

L1[j] j0 n1 a2 $3 b

L2[j] j0 n1 a2 $3 b

L3[j]

a,$,b,n

n b a $

b,n $,a n,a b,n a,$ a

$,a,b,n b,n,a a,$

b,n,a,$ a,$

assumed prefix SBST

a n n b $ a a

Move-to-Front (MTF) decodingSame algorithm as

encoding, with the following changes

Serial: The MTF lists are used in reverse

Parallel: Instead of combining MTF lists, combine permutation functions

0123

1023

00123

10

2

3 0123

23

10123

10

2

31 3 0 3SMTF

Perm

utati

onfu

nctio

n0123

3102

0123

10

2

30123

1023

0123

01

2

3

+

=

Huffman EncodingGoal: Assign shorter bit strings to more-frequent MTF codesThe parallelization of this step is already well knownSerial algorithm:

1. Count frequencies of characters2. Build Huffman table based on frequencies 3. Encode characters using the table

Parallel algorithm:1. Use integer sorting to count frequencies (O(log n) time, O(n)

work)2. Build Huffman table using the (standard, heap-based) serial

algorithm (O(1) time and work)3. (a) Compute the prefix sums of the code lengths to determine

where in the output to write the code for each character (O(log n) time, O(n) work)(b) Actually write the output (O(1) time, O(n) work)

Huffman DecodingSerial algorithm: Read

through compressed data, decoding one character at a time

Parallel algorithm: partition input and apply serial algorithm to each partitionProblem: Decoding cannot

start in the middle of the codeword for a character

Solution: Identify a set of valid starting bits using prefix sums (O(log n) time, O(n) work)

01 1 1 0 0 0 010

01 1 1 0 0 0 010

01 1 1 0 0 0 010

01 1 1 0 0 0 010

Huffman DecodingHow to identify valid starting positions:

Divide the input string into partitions of length l (the length of the longest Huffman codeword)

1. Assign a processor to each bit in the input. Processor i decodes the compressed input starting at index i and stops when it crosses a partition boundary, recording the index where it stopped. (O(1) time, O(n) work)

Now each partition has l pointers entering it, all of which originate from the immediately preceding partition.

2. Use prefix sums to merge consecutive pointers. (O(log n) time, O(n) work)

Now each partition still has l pointers entering it, but they all originate from the first partition.

3. For each bit in the input, mark it as a valid starting position if and only if the pointer that points to that bit originates from the first bit (index 0) of the first partition (O(1) time, O(n) work)

Lossless data compression on GPGPU architectures (2011)Inverse BST: “Problems would possibly arise from poor GPU

performance of the very random memory accessing caused by the scattering of characters throughout the string.”

MTF decoding: “Speeding up decoding on GPGPU platforms might be more challenging since the character lookup is already constant time on serial implementations, and starting decoding from multiple places is difficult since the state of the stack is not known at the other places.”

Huffman decoding: “Here again, decompression is harder. This is due to the fact that the decoder doesn’t know where one codeword ends and another begins before it has decoded the whole prior input.”

“As for the codeword tables for the VLE, it seems unlikely that efficient Huffman-tree GPGPU algorithms will be possible.”

james a. edwards, uzi vishkin university of maryland

Documents