[ieee comput. soc 12th international workshop on rapid system protyping. rsp 2001 - monterey, ca,...

Prototyping of Efficient Hardware Algorithms for Data Compression in Future Communication Systems

A. Mukherjee, N. Motgi J. Becker, A. Friebe, C. Habermann, M. Glesner

University of Central Florida Department of Computer Science

Orlando, Florida 328 16-2362, USA amar @cs.ucf.edu

Abstract

Due to high bandwidth requirements up to 2 Mbitshec in third generation mobile communication systems, efJicient data compression approaches are necessary to reduce communication and storage costs. Recent VLSI technologies status promises complete System-on-Chip (SoC) solutions for both mobile and network based communication systems, including new compression algorithms based on Burrows- Wheeler transform (BWT). The most complex task of the BWT algorithm is its lexicographic sorting of n cyclic rotations of a given string of n characters. The paper discusses the feasibility and VLSI implementation of this scalable B WT architecture in simulating and prototyping its systolic highly utilized hardware structure with Virtex FPGAs.

1 Introduction In the last decade of the 20th century we have experi-

enced an unprecedented explosion in the amount of digital data transmitted via the Internet, representing text, images, video, sound, computer programs, etc [23]. This has been coupled with a rapid development of high-bandwidth mobile communication systems which will play a critical stra- tegic and economic role for human civilization. These systems will demand services that require rapid transmis- sion of data with efficient utilization of the available bandwidth. The second generation digital communication systems such as GSM and 1-95 have bandwidth capability in the range of 9.6 to 14.4 Kbitskec and it may go up to 2 Mbitdsec with the third generation UTMS system [ 111. Data compression offers an attractive approach to reducing the redundancy in data representation to decrease the storage required for that data. In parallel with these develop- ments, very large scale integration (VLSI) technology has also made tremendous strides and opened up the possibili- ties of implementing system-on-chip (SoC) by integrating

Institute of Microelectronic Systems Darmstadt University of Technology

D-64283 Darmstadt, Germany becker @mes.tu-darmstadt.de

various functions of mobile communication systems [2], [6]. With these trends expected to continue, it makes sense to pursue research in developing hardware algorithms for functions of high-speed communication systems, both mobile and network based, using the emerging FPGA (field programmable gate array) architectures [5]. FPGA technology is highly suitable for reconfigurable hardware struc- tures for the kind of application algorithms that can be decomposed into smaller parallel tasks and utilize a pipe- lined data flow typical of systolic algorithms. FPGAs could be used as a first step for prototyping and synthesizing the algorithms [ 181, but more significantly, the challenge is to find new innovative SoCs that could incorporate reconfigurable architectures for these applications. FPGA architectures provide high speed, high density, low cost and low power alternative system solutions to general purpose DSP and ASIC solutions by off-loading computation intensive functions or by operating concurrently with them [ I ] . It also provides flexibility, adaptability and risk minimization (time-to-market) for changing functional requirements. This proposal addresses the development of hardware algorithms, application-specific architectures and implementation for data compression in future communication systems.

In this paper, we will present an efficient scalable FPGA architecture for the suffix sorting problem. The suffix sorting problem occurs in various pattern matching and text search operations [7], [ 151, [ 161 and text compression applications [3], [lo], [20], [21], [14], [23]. Suffix sorting finds an array of pointers of suffixes of a string of characters from a finite alphabet, which have been sorted lexicographically. The suffix array can also be used to speed up and improve compression ratio of dictionary based algorithms such as the LZ77 family [24], statistical methods such as the PPM family [9] and block sorting compression algorithm based on the Burrows Wheeler transformation (BWT) [4] [81 which will be the main concern of this paper. We propose a novel systolic hardware algorithm to do the suffix sorting

1074-6005/01 $10.00 0 2001 IEEE 58

mailto:cs.ucf.edu

mailto:mes.tu-darmstadt.de

which is an important and time consuming computation step in the BWT based compression algorithm. We imple- ment a prototype FPGA of this architecture on a Xilinx Vir- tex board with a complexity of 100 cells and speed of 45 MHz. The architecture is easily scalable and could be mapped into an ASIC chip. In view of its wide application in multimedia, this can be used as a component of a system for both text search and data compression applications and also in multimedia and mobile computation.

2 . Problem Definition

A number of sophisticated algorithms have been proposed recently for loss less data compression. The BWT based compression [8] and the PPM family of algorithms [9]] outperform the classical compression algorithms such as Huffman [13], arithmetic [19], Gzip and unix compress and several others [23]. BWT has comparable compression performance as that of PPM but is much faster than PPM. Several improvements of the BWT algorithm has been proposed recently [ 121 [22]. The basic idea of BWT transform is to divide the input text in fixed size blocks. For each block, all possible cyclic permutations of the block are then sorted lexicographically. If the resulting sorted strings are arranged in the form of a matrix, then the last column of this matrix and the index of the row in the matrix where the original text block is located, becomes the output of this prepro- cessing stage. This output shows clustering of characters having similar forward context which can then be exploited by an entropy coder like Huffman or arithmetic coder to compress the file. The lexicographic sorting is also referred to in the literature as sufJix sorr or block sorr. As an example, consider the character string abaabc with an alphabet (T

= (a,b,c) and stored in consecutive memory addresses (0,1,2,3,4,5,). Assuming a < b < c, the suffix array for this string is (2,0,3,1,4,5) corresponding to the sorted cyclic rotations [aabcab, abaabc, abcaba, baabca, bcabaa, cabaab] of the original string. The first index in the array 2’ means that the cyclic string beginning in position 2 of the original string viz. [aabcab] will be the first row in the sorted matrix and so on for the other indices. Given an arbitrary length character string, the problem is to determine the suffix array. In the next section, we propose a systolic hardware algorithm and describe an FPGA implementation of this algorithm.

3 Algorithm and Platform Description

3.1 A High Level Algorithm Description

If we sort the characters of the original string, we can determine the groups of suffix addresses that must follow each other in the final suffix array. For our running example, the sorted string is aaabbc, giving the first group of addresses

(0,2,3) corresponding to three a’s, the second group (1,4) corresponding to two b’s and the third group (5) corresponding to a single character c. If a group has only one character, we call such a group to be resolved. Thus the third group (5) is resolved. If a group has more than one addresses, we need to examine the characters at the next memory addresses for this group in the original string and sort these characters again. Thus, for the first group, the next addresses are (1,3,4) giving the characters (b,a,b). If we sort this string, we get (abb) giving rise to two subgroups of addresses (3) corresponding to (a) which is resolved and (1,4) corresponding to (bb) to be resolved further. For the resolved address (3), the original address is 2 which must be the first address in the suffix array. For the second subgroup, the next addresses will yield two resolved group of addresses (2) and ( 5 ) , corresponding to the original addresses (0) and (3), respectively. Proceeding in a similar fashion, we can re- solve the second group of addresses (1,4) as (1) and (4) since the characters in address locations 2 and 5 are b and c, respectively. Thus , the final suffix array is (2,0,3,1,4,5). A technical problem will arise if the next address specified one more than the maximum address of the original array. We will take care of this problem in the hardware solution by introducing a dummy smallest character “*” as explained in the next subsection. Summarizing, a typical step of the algorithm consists of applications of basic sorting algorithm on groups of characters from the original string followed by the computation of the “next” characters for the subgroups. Initially, the group consists of only one string which is the given string and the computation is carried on until all subgroups consist of a single character. Rather than giving a more formal description of the above process in software, we will now discuss a novel implementation of the algorithm on a systolic array, called a Weavesorter Ma- chine as discussed in the next subsection.

3.2 Weavesorter Machine Structure

The weavesorter algorithm was developed by one of the authors of this paper (Mukherjee) in 1981 and is described in [ 171. The Weavesorter consists of a slightly modified bi- directional shift register. The Weavesorter is capable to do the following operations: a shift-right, shqt lefr and a comparehwap operation. For the comparehwap operation the elements of the Weavesorter are grouped in pairs of two. In this operation the characters of each pair are compared and swapped, if the left character is larger than the right one. The idea of the Weavesorting algorithm is to shift the input string character by character into the Weavesorter starting from the left edge and do a comparehwap operation after each shift step. After the string is completely inserted, the smallest character of the whole string will be in the leftmost cell of the Weavesorter. The rest of the string is not yet sorted but pre-sorted. Now, the shift direction is changed to

59

shift left and the string is shifted out of the Weavesorter. While shifting out each shift operation is followed by a comparehwap operation. This guarantees that always the smallest character of the substring which is still in the Weavesorter is read out of the Weavesorter. So, the output of the Weavesorter is a sorted version of the original input string.

While outputting the first string another string can be in- putted from the other side. The largest character of this

cells : for i in length*2+1 downto 0 generate -- first cell is connected with the Input and the Output --

first : if i = 0 generate cell-first : d e l l

generic map (adr-width, char-width) port map ( Clk =>Clk, ShiftR-In-Reg =>

ShiftL-In-Reg => Out-Reg(i+ I ) , Config => config-cell(i), Out-Reg => Out-Reg(i));

Input(adr-width+char-width- I downto 0).

end generate first;

middle -- middle cells are connected to their neighbours --

: if i > 0 and i < length*2+1 generate cell-middle : eCell

generic map (adr-width, char-width)

Clk => Clk, ShiftR-In-Reg => Out-Reg(i- I) , ShiftL-In-Reg => Out-Reg(i+l), Config => config-cell(i), Out-Reg => Out-Reg(i));

port map (

end generate middle; -- last cell is connected to the Input and the Output --

t : if i = length*2+1 generate cell-last : eCell generic map (adr-width, char-width) pon map ( Clk =>Clk, ShiftR-In-Reg => Out-Reg(i-I), ShiftL-In-Reg =>

Config => config-cell(i), Out-Reg => Out-Reg(i));

Input(adr-width+char-width- I downto O ) ,

end generate last; end generate cells:

CO"

\ - c a b \.h>n-,m*,d.

Figure 1: : Weavesorter with 8 Cells and VHDL Codeframe

string will always be in the rightmost cell of the Weavesort- er. After changing the shift direction this string will be out- putted on the right side of the Weavesorter beginning with the largest character of this second string. Thereby, a hardware utilization of almost 100 percent can be achieved. In figure 1 a Weavesorter with eight basic cells and the corresponding VHDL codeframe is shown, realizing a scalable approach of the architecture and number of cells.

3.3 Suffix Sorting with the Weavesorter

This section will explain how the Weavesorter can be used to perform the BWT which has been explained earlier. As explained the BWT generates all cyclic rotations of a given string and sorts them lexicographically. A problem with sorting these cyclic rotations might occur whenever the original string is periodic. Then cyclic rotations exist that are the same and therefore can not be sorted. The sorting algorithm will never terminate. To solve this problem an artificial smallest character is added at the end of the input string. This character can never occur in the input string and therefore guarantees an unperiodic string.

A straightforward implementation of this algorithm would generate the matrix of the shifted versions of the input string and sort the rows of this matrix. The hardware architecture described in this paper uses only one weavesorter and a memory to store the input string. This leads to a small and highly utilized hardware.

As we can see in the example in section 3.1 it is not suf- ficient to sort according to the first character of a cyclic rotation. This is only the first step. After sorting according to the first character there might be groups of the same character. These groups have to be sorted according to the second character of the cyclic rotation. This has to be done until there are no groups of the same character. To use the Weavesorter for BWT certain requirements have to be cre- ated. The input string must be stored in a memory. By this each character gets its own address. With this address the successor of a character in the original string can be easily accessed by incrementing the address. The character and his address are stored in the Weavesorter. The address is only carried along with the character but does not influence the sorting. After the first iteration of the Weavesorter the output string might contain groups of the same characters which have to be sorted again according to their successor. Therfore the successors of the characters in the output string are inserted into the Weavesorter. This can be done from the other side of the Weavesorter. To prevent mixing of characters from different groups these groups have to be separated. This is done by inserting control bits which block the compurehwup operation if two characters in adjacent cells belong to different groups. Note that as soon as the first character drops out of the Weavesorter its successor in the

60

original string is read out of the memory and inserted from the other side. After this iteration is finished, the output is inserted again from the left side. Like before the output string might contain groups of equal characters which again have to be grouped and sorted according to their following character in the original string. This is done until all characters are separated by a control bit. Then no further sorting is necessary because each group contains only one character. The next step is to generate the last column of the sorted matrix of the cyclic rotations of the original string. Before we explain how this is done we will say a few words about the artificial smallest character and about how the address problem is handled.

The number of bits used for the address within the Weavesorter is one more than the number of bits needed to address the input string. This is necessary to prevent an overflow when incrementing the addresses while executing the algorithm, This additional address space is called virtual address space. Each input string is terminated with an artificial smallest character. This terminating character has the lowest address in the virtual address space. As the artificial smallest character is not allowed to be part of the alphabet of the input string an additional bit would be needed to cre- ate this character. But with the assumption that this artificial smallest character is stored in the virtual address space the MSB of the address can be used to identify the artificial smallest character. As it is unique there will never be a group of artificial smallest characters that have to be sorted. Therefore if the successor of a character is that artificial smallest character i t will be isolated as a resolved group after the next sorting step. As this group is already resolved and will remain isolated the value of the character is not important anymore. This makes i t unnecessary to evaluate the cyclic successor of this character in the string and simplifies the address handling. Now we can explain how the last column is generated.

Actually what is stored in the Weavesorter at the end of the sorting algorithm is the first column of the sorted matrix. Only that as in each iteration the successor of each character was processed the first column is represented by the Xth successor of each character. Were X represents the number of iterations. To get the first column the number of iterations must be subtracted from the addresses which are stored within the Weavesorter. The last column which is the result of the BWT can be generated by decrementing the addresses of the first column by one.

Resource Allocation Rapid Prototyping

4 Hardware Algorithm

Simulation and Prototyping of BWT

For simulation and prototyping of our BWT architecture we have developed a testbench, which provides the necessary input data. We have used the Virtual Workbench VW-

Performance

300 from VCC. The core of this Virtual Workbench is a FPGA from Xilinx. The Virtual Workbench VW-300 also provides several other elements for prototyping. We used the four pushbuttons, the eight DIP switches and the display with eight characters. The FPGA used is the Xilinx Virtex XCV300-4 BG352. This FPGA consists of a 32x48 CLB array and has 3072 slices. The largest BWT architecture we have implemented was based on a Weavesorter with 100 cells. The corresponding scalable hardware architecture was specified in VHDL and simulated with Synopsys frnot end tools. In figure 2 the waveform results of a correct functional logic simulation are provided. The first two signals are the states of the FSM that control the BWT algorithm. The signals Char-Out(9) - Char-Out(0) are the Weavesort- er elements. The Memory signal display the memory used to provide the input string. The first 8 memory cells contain the original input string. The last 8 memory cell contain the result. They are written during the OUTPUT stage of the major FSM. Note, that the artificial smallest character only exists within our architecture. So, when the output string is written into the memory there is one character which must be cut from of the output stream. This can be seen at memory cell 10, which is written twice. First, an irregular character is stored, which represents the virtual smallest character in the BWT architecture. This character is over- written with the next character of the output stream. T

Percentage of used

Slices

Post Place & Route

Results

(Virtex XCV300-4) Simulation

Weavesorter Architecture

(1 00 Cells)

~~

88 %

( 1 Virtex CLB = 2 Slices)

Min. Critical Path Delay =

22ns

Max. Clock Freqenc y = 45 Mhz

->

Table 1: CLB Utilization and Critical Path Delay of Weavesorter Prototype

The rapid prototyping implementation of the hardware architecture was performed with Xilinx Foundation tool suite, whereas a Weavesorter architecture with 100 cells al- located 88% of the slices of the Virtex XCV300. According to post place and route simulation with the used synthesis software the critical path delay was 22ns (see table 1). This corresponds to a maximum clock frequency of 45 Mhz. For

61

Eile Edit Marker Golo Mew Options Window Help

MINOR-STATE

MAJOR-STATE

P C M - O U ( Q X 7 0 )

P CHAR~OU(BX70)

P c M ~ o u T ( 7 x 7 o )

b CM-OUT(6X7 0)

b CM-OUT(5X70)

P CHAf_OUT(4X70)

P CHAR-OUT(3X70)

P CM-OUT(2X7 0)

P CH.AJ_OUT(I x7 0)

b c M ~ o u T ( o x 7 o )

P MEMORW-OUT-S(7 0)

P MEMORYI-OUT-S(7 0)

P MEMORW_OUT_S(70)

P MEMORn_OUT-S(7 0)

P MEMORV4-OUT-S(7 0)

P MEMORVS_OUT_S(7 0)

P MEMORYG_OUT-S(7 0)

0 MEMORW-OUT-S(70)

b MEMORW_OUT-S(7 0)

P MEMORW-OUT-S(7 0)

P MEMORVIO_OUT_S(7U)

6- MEMORVII-OUT_S(70)

P MEMORVl2 OUT 3 7 0 )

P MEMORY13-OUT-S(7U)

I t . . l lme- 2000 WV-65 W k - Z B Set-1

Figure 2: Logic Simulation Results of BWT-based Suffix Sorting Hardware Algorithm

an input string with 99 characters a whole iteration would take 100*2*2 clock cycles. This is the number of clock cycles needed for completely shifting in and out the string. But the second iteration is started while shifting out the result of the first iteration. If we assume that a normal text string is normally sorted within three iterations this would take 100*2*2*2*2 clock cycles. 200 clock cycles for shifting in within the first iteration. Another 200 clock cycles for shifting out the result of the first iteration meanwhile starting the second iteration. Another 200 clock cycles for the third iteration at which end the string is completely sorted. The last 200 clock cycles to shift out the string and generate the last column of the sorted matrix. All in all there are 800 clock cycles. With a clock frequency of 45 MHz this would take around 18 ms. For controlling the BWT architecture we used the 4 pushbuttons provided on the Virtual Workbench.

First, an input string has to be provided, which is done with the 8 DIP switches. A string is written character by character into a memory implemented on the FPGA. The value of the character can be chosen with the DIP switches. The con- tent of the memory is displayed on the LED display. If the whole input string is inserted the architecture can be started with one of the pushbuttons. The result is again written onto the display and can be checked. In further work the BWT algorithm can be inserted into the Bzip2 program.

5 Conclusions The paper introduced new data compression algorithms

based on Burrows-Wheeler transform (BWT), which are necessary in future high-speed communication systems, es- pecially in next generation mobile devices. Lexicographic sorting as the most complex task of the BWT has been de-

62

scribed in detail. Based thereupon an efficient hardware structure for suffix sorting with Weavesorter machines is proposed, which is the most time critical part of Bzip2 algorithms. The focus of the paper demonstrates the feasibility and scalability of its VLSI implementation, because within the used hardware algorihm not all cyclic rotations of the original string must be stored. The proposed systolic hardware structure was simulated and finally prototyped with fine-grained reconfigurable hardware devices (Virtex FPGAs), which operated with a clock frequency of 45 Mhz. The illustrated hardware algorithm for efficient data compression and its final VLSI implementation within Systems- on-Chip (SoC) can reduce significantly corresponding communication and storage costs. Future work of this project will address optimized hardware/software realizations of complete Bzip2 algorithms and hardware implementation of complex tasks in mobile communication applications.

6 Acknowledgement

The work at UCF was partially supported by a grant from National Science Foundation, grant No. 11s-9977336.

References P. Athanas, A. Abbot: Real-Time Image Processing on a Cus- tomcomputing Platform, IEEE Computer, Vo1.28, no.2,Feb. 1995. Peter Jung, Joerg Plechinger: M-GOLD: a multimode bas- band platform for future mobile terminal",CTMC99, IEEE International Conference on Communications, Vancouver, June 1999. Z. Arnavut: Move-To-Front Inverse Coding, Proc. Data Compression Conference, Snowbird, Utah,pp. 172- 182, 2000. B. Balkenho1,S. Kurtz and Y. M. Shtarkov: Modifications of the BurrowsWheeler Data Compression Algorithm, Proc. Data Compression Conference, Snowbird, Utah, pp. 188- 198, 1999. J. Becker, A. Kirschbaum, F. M. Renner and M. Glesner,: Perspectiveof Reconfigurable Computing in Research, In- dustry and Education: 8th International Conference Work shop on Fieldprogrammable Logic and Applications, FPL '98. Tallin, Estonia, Aug.31-Sept. 1998. Lecture Notes in Computer Science, Springer Press, 1998. J. Becker, M. Glesner: A Parallel Dynamically Reconfigura- ble Architecture Designed for Application-specific Hard- ware/Software Systems in Future Mobile Communication; to be published in: The Journal of Supercomputing, Kluwer Academic Publishers, January 2001.

J.L. Bentley and R. Sedgewick: Fast Algorithms for Sorting and Searching Strings", Proc. 8th Annual ACM-SIAM Sym- posium on Discrete Algorithms, pp. 360-369, 1997. M. Burrows, D.J. Wheeler: A Block-sorting Lossless Data Compression Algorithm, SRC Research Report 124, Digital

Systems Research Center, Palo Alto, CA.94301, May 10,1994.

J. G. Cleary and 1. H. Witten: Data Compression using Adap- tivecoding and Partial String Matching, IEEE Transactions on Communication,, Vol.Com-32, No.4, pp. 396-402, April, 1984.

G. V. Cormack and R. N. Horspool: Data Compression Using Dynamic Markov Modeling, Computer Joumal,Vol.3O,No.6,

H. Erban, K. Sabatakakis: Advanced Software Radio Archi- tecture for 3rd Generation Mobile Systems, Vehicular Tech- nology Conference, 1998. IEEE VTC 98. V01.2, pp.825- 829,1998.

P. Fenwick: Block Sorting Text Compression, Proc. 19th Australian Computer Science Conference, Melbourne, Aus- tralia, Jan.3 1 -Feb.3,1996

D.A.Huffman: A Method for the Construction of Minimum Redundancy Codes',Proc.IRE, 40(9),pp. 1098-1 101,1952. T. Kasai, H. Arimura and S. Arikawa: Virtual Suffix Tree: Fast Computation of Subword Frequencies using Suffix Ar- rays, RIMS Kokyuroku 1093, Kyoto University, April, I999 ( in Japanese).

R. M. Karp, R. E. Miller and A. L. Rosenberg, "Rapid Identi- fication of Repeated Patterns in Strings, Arrays and Trees", 4th ACM Symposium on Theory of Computing, pp.125- 136,1972. U. Manber and G. Myers: Suffix Arrays: A New Method for On-line String Searches, SIAM Journal of Computing, V01.22, No.5, pp.935-948, October, 1993. Amar Mukherjee: Introduction to nMOS and CMOS VLSI System Design", Prentice Hall, 1986. F. M. Renner, J. Becker and M. Glesner: Communication Performance Models for Architecture-Precise Prototyping of Real-Time Embedded System, Proc. 10th International IEEE Workshop on Rapid System Prototyping (RSP 99), Clearwa- ter, USA, June16-18, 1999. J. Rissanen and G.G. Langdon: Arithmetic Coding, IBM Journal of Research and Development, Vo1.23, pp. 149- 162, 1979. K. Sadakane: Unifying Text Search and Compression - Suf- fix Sorting, Block Sorting and Suffix arrays, Ph. D. Disserta- tion, University of Tokyo, Tokyo, Japan, 1999. K. Sadakane, T. Okazaki and H. Imai: Implementing The Context Tree Weighting Method for Text Compression, Proc. Data Compression Conference, Snowbird, Utah, pp. 123- 132,2000.

J. Seward: On the Performance of BWT Sorting Algorithms, Proc. Data Compression Conference, Snowbird, Utah,

I.H. Witten, A. Moffat, and T.C. Bell: Managing Gigabytes, Van Nostrand Reinhold, New York, 1994.

J. Ziv and A.Lempe1: A Universal Algorithm for Sequential Data Compression; IEEE Trans on InformationTheory, IT- 23,pp.337-243,1977. Also, by the same authors "Compres- sion of Individual Sequences via Variable Rate Coding", IT-

1987, pp.541-550.

pp. 173- 182,2000.

24, pp.530-536,I 978.

63

[ieee comput. soc 12th international workshop on rapid system protyping. rsp 2001 - monterey, ca,...

Documents