[ieee 2009 international conference on reconfigurable computing and fpgas (reconfig) - cancun,...

Acceleration of Fractal Image Compression Using the Hardware-SoftwareCo-Design Methodology

Oscar Alvarado NavaDepartamento de Electronica

Universidad Autonoma Metropolitana, AzcapotzalcoMexico D.F., Mexico

[email protected]

Arturo Dıaz PerezLaboratorio de Tecnologıas de Informacion

Centro de Investigacion y de Estudios Avanzados, TamaulipasCd. Victoria Tamaulipas, Mexico

[email protected]

Abstract—Fractal Image Compression (FIC) is a lossy tech-nique whose features are promising for computer systems withfew resources, however, it has been ignored due to the largeamount of operations needed to complete the codification. Onthe other hand, the development of VLSI technology allows forthe creation of programmable devices with greater facilities,which not only offer a large gate density to program hardwaremodules, but also contain one or more embedded processors,allowing the creation of complete systems inside a single chip(SoC). The use of hardware and software components in asingle electronic system allows to combine the flexibility offeredby software and the high computing power and parallelismof hardware. This paper describes a Hardware-Software Co-Design (HSC) of FIC which improves the compression time,obtaining an acceleration factor between 6.6 and 8.5. Thesystem was built on a SoC based on an FPGA.

Keywords-Hardware-Software Co-Design; FPGA; FractalCompression;

I. INTRODUCTION

Digital images require an array of pixels to store infor-mation, which requires a large amount of storage resources.Due to this, data coding or compression, and specially imagecompression, has been given a lot of attention and novelmethods to represent information in a more efficient mannerkeep emerging.

Fractal Image Compression (FIC) is an image compres-sion technique whose high compression rate, fast decoding,resolution independence, and progressive coding-decodingmake it competitive to improve the representation and trans-mission of digital images [1], [2], [3], [4]. The price to payfor these features is a considerable amount of time for codingwhen it is implemented on a conventional computer system.Despite of this, one could take advantage of two featuresof FIC: on the one hand it is an asymmetric algorithm (thecomputational cost of decompression is much lower thanthat of compression) and on the other hand its operationsare highly parallelizable [5], [6], [7].

The objective of the Hardware-Software Co-Design(HSC) technique is to accelerate an application, but onlyimplementing in hardware whatever is necessary to ac-celerate [8], [9]. Therefore, an application is distributed

in two or more hardware and software partitions. Thoseparts of the application whose speed is not critical for theprocess are kept in software, while the parallelizable or highcomputational cost parts are candidates to be implementedin hardware.

HSC could be a solution for accelerating applications onlimited computing systems, as mobile devices, which can beclassified as Systems on Chip. The desing and developmentof an application in hardware and software is more flexiblethan its implementation in application-specific integratedcircuit (ASIC).

The present document is organized in the following way:In Section 2 we present a brief description of a FICusing a Locally Iterated Function System (LIFS). In Section3 we describe the performance profile of a program forFIC, in which we obtain the quantitative information ofwhich parts of the program are more time consuming and,therefore, candidates to be implemented in hardware. Basedon the results obtained with the profiling, in Section 4we implement a processing unit which performs the mostcostly computational part of FIC. In Section 5 we explainthe communication interface between the program and thedesigned hardware unit. Finally, we present our results andconclusion on Sections 6 and 7, respectively.

II. FRACTAL IMAGE COMPRESSION

The basic concept behind FIC is to take an image W andto express it as an LIFS [10], [11], [1], [12]. An LIFS isa set of functions wi, each describing a part of a fractal,which when taken together recreate the whole fractal.

Said functions have the form:

W = w1 ∪ w2 ∪ w3 . . . ∪ wu . . . ∪ wNR×NR, (1)

where each is of the form:[x′

y′

z′

]= wu

[ 12

12 0

12

12 0

0 0 1

][au bu 0cu du 00 0 su

][xyz

]+

[eu

fu

ou

](2)

If an image can be described by a small number of thesefunctions, the LIFS is a compact description of the image.

2009 International Conference on Reconfigurable Computing and FPGAs

978-0-7695-3917-1/09 $26.00 © 2009 IEEE

DOI 10.1109/ReConFig.2009.76

167

Iterating values on each of the functions on the system,they converge towards an image called the atractor of thesystem. The atractor can be quickly displayed at any degreeof magnification with unlimited resolution.

The algorithm proposed in [1] takes an image and dividesit into NR × NR range blocks, each of n × n pixels, andND − 1 × ND − 1 domain blocks with 2n × 2n pixels,overlapped by half.

The eight possible transformations are applied to eachrange block, and each of the results is compared against eachof the domain blocks. The comparison that results in the leastdeformation determines the parameters of transformation(au, bu, cu, du), translation (eu, fu), and luminance andcontrast (su, ou). The previous data will conform the fractalcode.

The search of fractal codes that approximate an imagewith a desired accuracy is a costly computational task, andthe cost grows when the number of range blocks increases,either due to the resolution of the image or the partitionmade to code it.

The results presented in Tables II and III were obtained bycoding the image Lena with the Fractal Transform Algorithmproposed in [1], implemented in a conventional computersystem described in Table I. The reconstructed images areshown on Figure 1.

Table ICONVENTIONAL COMPUTER SYSTEM

CPU Core2Duo @ 2.0GHzCache 2 MB L2RAM 1 GB @ 800MhzFSB 800 MHz

Table IICODING TIME OF THE IMAGE LENA AT DIFFERENT RESOLUTIONS, WITH

RANGE BLOCKS OF 8× 8 PIXELS IN A CONVENTIONAL COMPUTERSYSTEM

Resolution [pixels] Time of Coding [seconds]64× 64 1

128× 128 25256× 256 482512× 512 7200

(a) (b) (c)

Figure 1. Fractal approximation of the image Lena at 512 × 512 pixelsusing range blocks of different sizes: (a) 4× 4 pixels, (b) 8× 8 pixels, (c)16× 16 pixels.

Table IIICODING AND DECODING TIME OF THE IMAGE LENA AT A RESOLUTION

OF 512× 512 PIXELS AS A FUNCTION OF THE SIZE OF THE RANGEBLOCK

Partition [pixels] Coding time [seconds] Decoding time [seconds]4× 4 7740 0.398× 8 7200 0.36

16× 16 6480 0.34

Table IVCODED IMAGE QUALITY AND COMPRESSION RATE AS A FUNCTION OF

THE PARTITION SIZE.

Partition [pixeles] PSNR [dB] RMSE Compression Rate4× 4 81.4820 0.0215 5.128× 8 77.9736 0.0322 22.26

16× 16 75.6451 0.0421 97.52

As we can observe on Table III, coding and decodingtimes are asymmetric, due to the fact that, in order to codean image, one must do a large number of operations betweenblocks and pixels, while decoding is a fast iterative process.In order to evaluate how similar are two images W and W

′,

we compute a distortion δ. The most common distortioncriteria are the Root Mean Squared Error δRMSE and thePeak Signal-to-Noise Ration δPSNR defined by equations(3) and (4), respectively.

Table IV shows the parameters δRMSE , δPSNR, and thecompression rate obtained by comparing the decoded imagesshown in Figure 1 with the original image.

δRMSE(W,W′) =

1n2

√√√√ n∑k=1

n∑l=1

(wk,l − w′k,l)2 (3)

δPSNR = 20 log10

(255

RMSE

)(4)

III. HARDWARE-SOFTWARE PARTITION

The partitioning of the application can be done oncewe obtain its performance profile, which could be obtainedwith software tools, like gprof. This involves analyzing saidprogram to detect how its execution time is distributedbetween its functions or procedures. With this we canvisualize which parts of the application are used more oftenor are the most time costly and therefore we can determinethe parts that could stay in software and the parts that couldbe implemented in hardware.

The algorithm proposed in [1] is composed of variousprocedures functionally divided in three groups: file readand write, control blocks, and pixel calculations. The per-formance profile of the program while coding the imageLena at 512 × 512 pixels, gray scale, and block range sizeof 8 × 8 pixels gives as a result that two functions use up85% of the execution time. The first is called reflect, it uses65% of the execution time and in each of its calls performs

168

one of the eight possible transformations or reflections on arange block. The second is called distance and it computesthe distortion between a domain block and a transformedrange block.

IV. TRANSFORMATION AND COMPARISON UNIT

With the information obtained in the profile analysisof the program it was possible to describe in VHDL aTransformation and Comparison Unit (TCU), at the levelof register transfer, which performs the functions reflect anddistance. The TCU circuit is shown on Figure 2.

The TCU receives the range and domain blocks to beprocessed trough a 32 bit input port called in data, eachof the blocks divided on 16 frames of 32 bits. The 8bit input port in commd receives a control word from theprocessor. In the 4 bit output port out sim we obtain theidentification of the transformation of the range block withthe least distortion with respect to the domain block. Atthe same time, the output port out data shows the value ofthis distortion. The signal done, coming from the control,indicates the end of processing and can be used to send aninterrupt to the processor.

The organization of the TCU includes a data path anda control unit. The data path is formed by three 512 bitregisters, an arithmetic unit, and three routing units. TheRangeBlock and DomainBlock registers contain the rangeblock and the domain block, respectively, while the Reflect-edRangeBlock register contains the transformations appliedto the range block. The arithmetic unit AU computes the ab-solute value of the pixel difference, accumulating the resultsin Acc. The routing units 001, 010, and 100 take the pixelsfrom the RangeBlock register to the ReflectedRangeBlockregister, effecting a spatial relocation.

The control unit of the TCU receives control wordsfrom the processor. These control words enable or disablethe units for each operation: register load, transformation,comparing, and restart. In Figure 3 we show the statediagram followed by the control unit in order to transformand compare blocks.

The TCU stays on a waiting state Hd until it receivesthe activation signal E. The state Hd is used to assign initialvalues for block processing. In state Rin the necessary valuesto process each transformation are initialized. The state BRdetermines the transformation to be applied and dependingon each transformation it will move to one of the states001, 010, or 100. For example, to complete transformation5 (101) it will be necessary to pass first to state 001 and thento state 100. States SmP and ShR take care of the calculationof the absolute value of the difference of each one of the 64pairs of pixels that form the transformed range block and thedomain block. This calculation is performed by shifting theregisters RangeBlock and DomainBlock and using the UAunit. The calculated values are accumulated in the registerAcc. In state Cmp, the comparation of the values in the

001 010 100

RangeBlock

ReflectedRangeBlock

DomainBlock

AU

Acc

MinDist

Symmetry

FinalSymmetry

Demux

Control

32

32 4

8

8 8

512

32

Comp

clk

in_commdin_data

out_simout_dat

done

Figure 2. Organization of the TCU.

E=0

Hd

Rin

BR

001 010 100

SmP ShR Cmp

ActE=1

Nsum<64

0 1,3,5,7 2,3,6,7 4,5,6,7

1 2,3 4,5,6,7

sim<8

Figure 3. State diagram of transformation process.

registers Acc and MinDist is performed. In the case thatthe value of Acc is less than the value of MinDist on stateAct, the values of registers MinDist and FinalSymmetry areupdated, storing the identification of the transformation withleast distortion. The 4 bit counter Symmetry identifies thetransformation performed. If its value is less than 8 then itmoves to state Rin, otherwise processing is done.

169

V. HARDWARE-SOFTWARE INTERFACE

In order to communicate and syncronize the TCU with theprogram executed by the processor it was necessary to mapthe TCU in the address space of the processor. That is, apart of the CPU address space is interpreted not as an accessto main memory, but as an access to the input or outputregisters of the TCU. Therefore, the interface needs to add aset of registers for both control and the input and output dataof the TCU. In Figure 4 we show the registers and signalsthat form the interface. For example, the 32 bit registerBus2IP Data was connected to the in data port of the TCUin order to receive data, while the register IP2Bus Data isused for data output from the TCU. At the same time, it wasnecessary to add control signals to integrate the TCU into thecommunication protocol of the peripheral bus, for example,the bus clock signal, the read or write request signals, theread or write acknowledge signal, the retransmission requestsignal, etc.

001 010 100

RangeBlock

ReflectedRangeBlock

DomainBlock

AU

Acc

MinDist

Symmetry

FinalSymmetry

Demux

Control

32

32 4

8

8 8

512

32

Comp

Bus2IP_Data

32

IP2Bus_Data

32

Bus2IP_Clk

Bus2IP_Rst

IP2Bus_Ack

IP2Bus_Error

Deco

Deco

Bus2IP_WrCE

2

IP2Bus_RdCE

2

slv_reg0slv_reg1

slv_reg0slv_reg1

Figure 4. Hardware interface

In order to facilitate the reading and writing of data toand from the TCU we developed functions in C in whichwe specify the device base address, the offset, and the datato be written or read. The following is an example of sucha function’s prototype:

void UTC mWriteReg(Xuint32 BaseAddress, unsignedRegOffset, Xuint32 Data);

VI. FRACTAL IMAGE COMPRESSION ON A SOC

We created a computational system on the FPGAXC2VP30 including an embedded PowerPC 405 processorat 400MHz (ppc405), a block of on-chip RAM memory with128 KB (bram block) with its corresponding bus accessmodule (plb ram), a hierarchical bus system including aprocessor local bus (plb), an on-chip peripheral bus (opb) to-gether with their corresponding arbiter, and a bridge betweenbuses (plb2opb). To be able to read the image from a fileand to write the fractal codes to another file it was necessaryto add the access module to a FAT12 file system on an8 MB Compact Flash Memory (opb sysace). Furthermore,in order to be able to code images at resolutions higherthan 64 × 64 pixels we added to the address space of theprocessor an external memory module, namely, a 256 MBDIMM DDR SDRAM (plb ddr). We also included a serialtransmitter and receiver as the standard input and output ofthe system (opb uart).

A. System without TCU

The system previously described (processor, memory, andfile system) is able to execute programs as any conventionalcomputer system. We modified slightly the code proposedin [1] in order to implement it on this system. Just as witha conventional computer system, we compressed the Lenaimage at different resolutions. The coding times are shownon Table V in the column Without TCU.

B. System with TCU

In Figure 5 we show the schematics of the SoC with TCU.On Table V we show the coding times for this system in thecolumn With TCU.

plb_v34 opb_v20plb2opbbrige

Arbiter

Arbiter

Off-Chip

bramblock

jtagppc

plbram

ppc405

plbddr

dcmmodule

dcmmodule

opbsysace

opbuart

compactflash

RS232

DIMMDDR

XUP2VP Board

TCU

On-Chip

Figure 5. SoC with TCU

170

Table VCODING TIMES DONE IN AN SOC SYSTEM WITHOUT AND WITH TCU.

Resolution [pixels] Without [seconds] With [seconds] Speed64× 64 52 7 7.42

128× 128 1056 160 6.6256× 256 17408 2048 8.5512× 512 303104 45056 6.7

Table VIDEVICE UTILIZATION SUMMARY.

Number of External IOBs 25%Number of RAMB16s 47%

Number of SLICEs 53%Maximum frequency 101.327MHz

The compressions obtained with the FPGA system, bothwith and without the TCU, obtained the same fractal codesas the conventional system described on Section II. There-fore, the image (b) on Figure 1 is the image that would havebeen obtained by reconstructing the image coded by the SoCat a 512× 512 pixel resolution. Also, the results shown onTable IV are valid for both the system with and without theTCU.

On Table VI we show a summary of the FPGA resourceutilization when we synthesized the complete system shownon Figure 5.

VII. CONCLUSIONS

As we can see on the results on Table V, the performanceof coding an image using the fractal image compressionalgorithm improved substantially by integrating a hardwarecomponent on the application. The added component com-putes the most time-costly part of the application and thiswas identified trough the analysis of the application’s profile.Therefore this we have shown that it is possible to accelerateapplications that require intensive computations trough theuse of hardware-software co-design. Moreover, this type ofdevelopment offers great flexibility and speed during thedesign, implementation and testing stages of the applica-tion, and therefore reducing the gap between the designof hardware and the design of software. Furthermore, it isnow possible to explore various strategies to obtain greateraccelerations, for example, by increasing the percentage ofthe application that should be on hardware, increasing thenumber of processing elements on hardware, changing thedegree of parallelism in the processing unit on hardware.

Addition to the above, is possible to find the rightcombination of hardware and software to reduced powerconsumption.

REFERENCES

[1] Michael F. Barnsley and Lyman P. Hurd. Fractal ImageCompression. A. K. Peters, Ltd., Natick, MA, USA, 1993.

[2] R. Heart D. Saupe. Fractal image compression: an introduc-tory overview, chapter 2. New Orleans, USA, 1996.

[3] Erjun Zhao and Dan Liu. Fractal image compression methods:A review. In ICITA ’05: Proceedings of the Third Interna-tional Conference on Information Technology and Applica-tions (ICITA’05) Volume 2, pages 756–759, Washington, DC,USA, 2005. IEEE Computer Society.

[4] I. Kopilovic, D. Saupe, and R. Hamzaoui. Progressive fractalcoding. In Proc. IEEE ICIP-01, pages 86–89, Thessaloniki,October 2001.

[5] Kevin P. Acken, Mary Jane Irwin, and Robert M. Owens. Aparallel ASIC architecture for efficient fractal image coding.J. VLSI Signal Process. Syst., 19(2):97–113, 1998.

[6] A. Martınes-Ramırez A. Dıaz-Sanchez M. Linares ArandaJ. Vega Pineda. An architecture for fractal image compressionusing quad-tree multiresolution. ISCAS, IEEE, 2, 2004.

[7] Alejandro Martınez Ramırez. Diseno de un ProcesadorDigital de Senales para Aplicaciones Especıficas en Comu-nicaciones. Tesis doctoral, Instituto Nacional de AstrofısicaOptica y Electronica, 2005.

[8] Tai-Chi Lee, Patrick Robinson, Michael Gubody, and ErikHenne. Software/hardware co-design implementation forfractal image compression. In ACM-SE 37: Proceedings ofthe 37th annual Southeast regional conference (CD-ROM),page 4, New York, NY, USA, 1999. ACM Press.

[9] M. Abid, T. Ben Ismail, A. Changuel, C. A. Valderrama,M. Romdhani, G. F. Marchioro, J. M. Daveau, and A. A. Jer-raya. Hardware/software co-design methodology for design ofembedded systems. Integr. Comput.-Aided Eng., 5(1):69–84,1998.

[10] A.E.Jacquin. Image coding based on a fractal theory of iter-ated contractive image transformations. IEEE Transactions,1(1):18–30, 1992.

[11] A. E. Jacquin. Fractal image coding: a review. Proceedingsof the IEEE, 81:1451–1465, 1993.

[12] Ning Lu. Fractal Imaging. Academic Press, 525 B Street,Suite 1900, Sn Diego, CA, USA, 1997.

171

[ieee 2009 international conference on reconfigurable computing and fpgas (reconfig) - cancun,...

Documents