dddd

SPECIAL ISSUE

Automation techniques for implementation of hybridwave-pipelined 2D DWT

G. Seetharaman Æ B. Venkataramani ÆG. Lakshminarayanan

Received: 12 July 2007 / Accepted: 19 May 2008 / Published online: 10 June 2008

� Springer-Verlag 2008

Abstract In the literature, techniques such as pipelining

and wave-pipelining (WP) are proposed for increasing the

operating frequency of a digital circuit. In general, use of

pipelining results in higher speed at the cost of increase in

the area and clock routing complexity. On the other hand,

use of WP results in less clock routing complexity and less

area but enables the digital circuit to be operated only at

moderate speeds. In this paper, a hybrid wave-pipelining

scheme is proposed to get the benefits of both pipelining

and WP techniques. Major contributions of this paper are:

proposal for the implementation of 2D DWT using lifting

scheme by adopting the hybrid wave-pipelining and pro-

posal for the automation of the choice of clock frequency

and clock skew between the input and output registers of

wave-pipelined circuit using built in self test (BIST) and

system-on-chip (SOC) approaches. In the hybrid scheme,

different lifting blocks are interconnected using pipelining

registers and the individual blocks are implemented using

WP. For the purpose of evaluating the superiority of the

schemes proposed in this paper, the system for the com-

putation of one level 2D DWT is implemented using the

following techniques: pipelining, non-pipelining and

hybrid wave-pipelining. The BIST approach is used for the

implementation on Xilinx Spartan-II device. The SOC

approach is adopted for implementation on Altera and

Xilinx field programmable gate arrays (FPGAs) based SOC

kits with Nios II or Micro blaze soft-core processors. From

the implementation results, it is verified that the hybrid WP

circuit is faster than non-pipelined circuit by a factor of

1.25–1.39. The pipelined circuit is in turn faster than the

hybrid wave-pipelined circuit by a factor of 1.15–1.38 and

this is achieved with the increase in the number of registers

by a factor of 1.79–3.15 and increase in the number of LEs

by a factor of 1.11–1.65. The soft-core processor based

automation scheme has considerably reduced the effort

required for the design and testing of the hybrid wave-

pipelined circuit. The techniques proposed in this paper,

are also applicable for ASICs. The optimization schemes

proposed in this paper are also applicable for the compu-

tation of other image transforms such as DCT, DHT.

Keywords DWT � Lifting � SOC � Wave-pipelining �Pipelining � Self test

1 Introduction

Programmable logic devices such as FPGAs offer an

alternative solution for the computationally intensive

functions performed traditionally by digital signal proces-

sors with Harvard architecture. The ability to design,

fabricate and test application specific integrated circuits

(ASICs) as well as FPGAs with gate count of the order of a

few tens of millions, has led to the development of com-

plex embedded system-on-chip. The development of

intellectual property (IP) cores for the FPGAs for a variety

of standard functions including processors enables a mul-

timillion gate FPGA to be configured to contain all the

components of a complete system. Development tools from

G. Seetharaman (&) � B. Venkataramani �G. Lakshminarayanan

Department of ECE, National Institute of Technology,

Tiruchirappalli, India

e-mail: [email protected]

B. Venkataramani


G. Lakshminarayanan


123

J Real-Time Image Proc (2008) 3:217–229

DOI 10.1007/s11554-008-0087-8

FPGA vendors such as the Altera or Xilinx enable the

integration of IP cores and the user designed custom blocks

with the soft-core processors such as the Micro blaze or

Nios II processors [1, 2]. The system designed by inte-

gration of IP cores and the user designed custom blocks

with the soft-core processors are far more flexible than the

hard-core processors and they can be enhanced with cus-

tom hardware to optimize them for specific application [3].

The increased performance available with SOC based

FPGAs makes them quite suited for implementation of area

as well as speed intensive image processing applications

such as discrete cosine transform (DCT) and discrete

wavelet transform (DWT). For example, the study in [4]

shows that FPGA based image processing system is faster

by 8–800 times compared to that using Pentium III

processor.

For image processing applications, in addition to DCT,

wavelet transform is increasingly used. It is a part of the

joint photographic experts group (JPEG) 2000 standard for

still image compression. The VLSI implementation of

image encoders with DWT has been addressed in number

of previous works. The implementation of 2D DWT using

lifting scheme and compression using EZT algorithm is

reported in [5] taking the advantage of flexible memory

configuration available in FPGAs. The image is partioned

into sub-images of size 32 9 32 and external memory is

used for storing the sub-images and the transform coeffi-

cients in [5].

Block RAMs in FPGAs are proposed for storing the sub-

images and 2D DWT coefficients in [6]. A new multiplier

algorithm denoted as Baugh–Wooley pipelined constant

coefficient multiplier (BW-PKCM) which combines the

KCM with Baugh–Wooley multiplication algorithm is

proposed and used for the study and comparison of dis-

tributed arithmetic algorithm and lifting scheme [7] for 2D

DWT on FPGAs in [6].

Even though pipelining is adopted for high speed

applications such as that in [6], pipelined systems have a

number of disadvantages such as increase of power dissi-

pation, clock routing complexity and clock skews between

different parts of the system. The circuit design technique

such as wave-pipelining is one of the techniques proposed

for achieving high speed without the above limitations.

Wave-pipelined circuit dispenses with the need for regis-

ters for storing the intermediate results and instead uses the

inherent capacitance at the input to the various combina-

torial blocks. A number of systems have been implemented

using wave-pipelining on ASICs and FPGAs [8, 9]. The

concept of wave-pipelining has been described in a number

of previous works [10–14]. One of the limitations of the

wave-pipelined circuits is that their highest operating fre-

quency reduces with the complexity of the circuit or

equivalently the logic depth [14]. In order to combine the

advantages of both pipelining and wave-pipelining a hybrid

scheme is proposed in this paper. A complex circuit is split

into a number of smaller circuits and is pipelined. Each of

the smaller circuits is realized using wave-pipelining.

The organization of the rest of the paper is as follows: in

Sect. 2, the review of previous work on lifting based 2D

DWT with BW multiplier is described. In Sect. 3, the

previous work related to wave-pipelining and the chal-

lenges involved in the design of wave-pipelined circuits are

described. In Sect. 4, automation schemes for wave-pipe-

lined circuits are presented. In Sect. 5, the architecture

used and assumptions made for the implementation of the

2D DWT are presented. In Sect. 6, the implementation

results of the pipelined and hybrid wave-pipelined 2D

DWT are presented. Sect. 7, summarizes the conclusions.

2 Review of previous work on lifting based 2D DWT

with BW multiplier

The hybrid scheme is proposed to be used for the com-

putation of 2D DWT. The DWT decomposes a signal into

different sub-bands so that the lower frequency sub-bands

have finer frequency resolution and coarser time resolution

compared to the higher frequency sub-bands. A survey of

VLSI architectures for the computation of 2D DWT is

given in [15]. The 2D DWT may be computed using filter

banks. Figure 1 shows how an N 9 M image can be

decomposed using sub-band decomposition for one level

2D DWT. The samples corresponding to the image pixels

are passed through two stages of analysis filters. The ele-

ments of the pixel matrix are read row wise and are first

processed by the low pass h[n] and high pass g[n] hori-

zontal filters. The transform coefficients matrices are then

sub-sampled by two along the rows to obtain two N 9 M/2

matrices L1 and H1. Subsequently, the outputs (L1, H1) are

processed by low pass and high pass vertical filters to

obtain four N/2 9 M/2 transform coefficient matrices. Out

of these four matrices denoted as LL1, LH1, HH1 and HL1,

respectively, LL1 represents a coarse approximation of the

original image [15, 16].

For the two level 2D DWT, LL1 component is pro-

cessed by both horizontal and vertical filters and sub-

sampled to obtain four more matrices LL2, LH2, HH2 and

HL2. This process is continued until the desired level of

sub-band structure is obtained. The horizontal and vertical

filters shown in Fig. 1 may be implemented by adopting the

lifting scheme [7] which uses a factorization scheme for

the poly-phase matrix corresponding to the analysis filter.

The main feature of lifting based DWT scheme is to break

up the high pass and low pass wavelet filters into a

sequence of smaller filters. This scheme requires about

50% less computational complexity compared to that using

218 J Real-Time Image Proc (2008) 3:217–229

123

the convolution-based approach [7]. It has other advanta-

ges, including ‘‘in-place’’ computation of the DWT, integer

to integer wavelet transform and symmetric hardware

architecture for the computation of both forward and

inverse transform [15].

In the lifting scheme for a filter bank with the low pass

and high pass filters of nine and seven taps, respectively,

the odd and even input samples are processed by five lifting

blocks [a, b, c, d, n (n1, n2)] in cascade as shown in Fig. 2.

n1, n2 are scaling blocks.

The internal diagram of a and b blocks are shown in

Figs. 3 and 4. The c and d blocks are obtained by replacing

the constants a, b with c, d. In Figs. 3 and 4, since the

output from one block is fed as the input to the next block,

the maximum rate at which the input can be fed to the

system depends on the sum of the delays in all the four

stages. The speed may be increased by introducing pipe-

lining at the points indicated by dotted lines in Figs. 3 and

4. In this case, the input rate is determined by the largest

delay among all the four blocks.

The delay in the individual stages may be reduced fur-

ther by using constant coefficient multiplier (KCM) which

uses a look up table (LUT) for finding the product of a

constant and a variable. The variable is fed as address to

the LUT, which contains the products corresponding to all

possible combinations of the operands. FPGAs normally

contain four input LUTs. When an LUT with more number

of inputs are required, it has to be implemented using a

number of stages of four input LUTs and adders. For

example, a 12 9 12 bit KCM is implemented using three

4 9 12 bit KCM and two stages of 16 bit adders. The speed

of the KCM can be increased by introducing the pipelining

registers at the outputs of LUTs and adders.

The content of the LUT corresponding to multiplication

of signed numbers can be computed using three approa-

ches: (a) Assuming unsigned multiplication and 2’s

complement blocks (resulting multiplier is referred to as

conventional 2’s complement multiplier (C2CM)) (b)

Using sign extension (c) Baugh Wooley (BW) multiplier.

The pipelined constant coefficient multiplier (PKCM)

using the BW content is referred to as BW-PKCM and it is

Fig. 1 Sub-band decomposition of an N 9 M image

Fig. 2 Simplified block diagram of lifting scheme for 9/7 filter

Fig. 3 a Block

Fig. 4 b Block

J Real-Time Image Proc (2008) 3:217–229 219

123

shown to be superior compared to the other two approaches

[6]. Hence, only this multiplier is considered for wave-

pipelining in this paper. The detailed diagram of the ablock implemented using BW-PKCM is shown in Fig. 5.

The same scheme can be adopted for the b, c, d, n1, n2

blocks. The dotted line indicates points where registers

may be inserted for pipelining. For wave-pipelining all the

stages are directly connected without registers. The regis-

ters are used only at the inputs and outputs. In hybrid wave-

pipelining, registers are used between adjacent lifting

blocks and the individual lifting blocks are connected

without registers.

3 Review of previous work on wave-pipelining

In this section, the technique used for wave-pipelining the

a block in Fig. 5 is considered. An RTL model of a circuit

consists of a combinational logic circuit separated by the

input and output registers. The combinational logic circuit

may be considered to be a wave-pipelined circuit if a

number of waves are made to simultaneously propagate

through it as shown in Fig. 6, [10]. In other words, at any

point of time, a sequence of data is processed in the

combinational logic block. In the case of pipelining, only

one data is processed in the combinational logic block at a

time. Further, the maximum data rate in the pipelined

circuit depends only on Dmax, the maximum propagation

delay in the combinational logic block. Figure 7 shows

temporal/spatial diagram of combinational logic circuits

[11]. If Dmin denotes the minimum propagation delay of the

signal through the combinational logic block, the maxi-

mum data rate of the wave-pipelined circuit depends on

(Dmax–Dmin).

In the case of a block, Dmax corresponds to the pro-

cessing and propagation delay between the even samples

and a0 output (this involves three adder delays, one LUT

delay and four interconnect delays); Dmin corresponds to

the processing and propagation delay between the odd

samples and a0 output (this involves two adder delays and

two interconnect delays).

Traditionally, in a wave-pipelined circuit, higher speeds

are achieved by equalizing the Dmax and Dmin [10]. The

output of the wave-pipelined circuit alternates between

unstable and stable states. The stable period decreases with

the increase in the logic depth. By adjusting the latching

instant at the output register to lie in the stable period, the

wave-pipelined circuit can be made to work properly. But,

for large logic depths, there may not be any stable period.

Hence, adjusting the latching instant by itself may not be

adequate for storing the correct result at the output register.

For such cases, the clock period has to be increased to

increase the stable period. Equalization of path delays,

adjustment of the clock period and clock skew are the three

tasks carried out for maximizing the operating speed of the

wave-pipelined circuit. All the three tasks require the

delays to be measured and altered if required. Layout

editors, such as FPGA editor from Xilinx, or Floor planner

from Altera may be used for this purpose.

These tasks are carried out manually in [13, 14]. The

wave-pipelined circuit designed using the layout editor

may be tested using simulation. However, the simulation is

inadequate for testing due to the difference between the

actual delays and the delays calculated by the layout editor.

This is because, the layout editor considers only the worst

case delays and the actual delays may be significantly

different due to fabrication variations. This differenceFig. 5 a Block using BW-PKCM

Fig. 6 Multiple coherent waves of data sent through combinational

logic acting as pipeline in WP

Fig. 7 Temporal/spatial diagram of combinational logic circuits


123

becomes important as the logic depth of the circuit

increases. Hence, the design is downloaded to the actual

FPGA and its operation is checked using a personal com-

puter (PC) based test system in [14]. If correct results are

not obtained, delays are altered and the design is down-

loaded for testing again. A number of iterations of place

and route, simulation, downloading and testing in the

actual device may be required till the correct results are

obtained. The design of wave-pipelined circuit in this

fashion requires human intervention and is time consum-

ing. Automation of the above three tasks are considered in

the next section.

4 Automation schemes for wave-pipelined circuits

Equalization of the path delays of the combinational logic

blocks such as the a block is considered first. This cannot

be completely automated as the commercially available

synthesis tools do not support the specification of inter-

connect delays. However, the difference in path delays can

be minimized by specifying the physical location of logic

cells (referred to as slices in Xilinx FPGAs) or logic ele-

ments used for the implementation, through either the user

constraints file (UCF) or the Logic lock feature supported

by the FPGA CAD tools [2, 3]. UCF approach is proposed

for Xilinx FPGAs in [14]. The logic lock feature is adopted

for the Altera FPGAs in this paper.

The adjustment of the clock skew and clock period can

be automated by using programmable clock, skew gener-

ator and a processor. Clock generator using LUTs and

interconnects (nets) is proposed for the first time in [14].

(The LUTs are programmed as non-inverting buffers). The

interconnects are manually chosen using the FPGA layout

editor in [14]. The programmable clock is proposed in this

paper using multiplexers in addition to the LUTs and nets

as shown in Fig. 8. The interconnect delays are selected

using the multiplexer. The number of possible interconnect

delays (Di) is restricted to minimize the overheads due to

the additional LUTs required for the introduction of the

delay and the multiplexers. Hence, only a coarse variation

in the delay values can be achieved.

Using the manual routing, much smaller variations in

the delay may be achieved. In Fig. 8, inputs C0–C3 are the

programmable select inputs, which determine the actual

clock frequency. The diagram of programmable clock skew

circuit is given in Fig. 9. In Fig. 9, Di denotes the ith

interconnect (net) specifically introduced to vary the delay.

In addition to this, there are interconnects between the

output of one multiplexer and the input of another multi-

plexer and also between the LUTs. But their delay values

are not controlled by the program. The select inputs S0–S3

are the programmable delay inputs.

The clocks required for the wave-pipelined circuit may

also be derived using the internal system clock generator of

Altera and Xilinx system-on-programmable chip (SOPC)

devices. The maximum operating frequency in this case is

limited by the system bus. Alternately, an external clock

may be multiplied by an arbitrary number using the Altera

mega core function altclkclock or pllclock. Similarly, in the

case of Xilinx Spartan-III family FPGAs, the delay-locked

loop (DLL)/digital clock manager (DCM) module may be

used for clock multiplication. However, the multiplication

factor has to be specified at the synthesis time and hence

the clock frequency cannot be dynamically altered as in the

scheme given in Fig. 8.

The circuit using the programmable clock and skew

generator is a suboptimal wave-pipelined circuit but can

operate at a higher frequency than that reported by the

commercially available synthesis tools which use Dmax for

fixing the operating frequency. The clock and skew gen-

erator may be programmed using either off-chip processor

Fig. 8 Programmable clock generator

Fig. 9 Programmable clock skew circuit


123

or on-chip processor. In order to minimize the time

required for adjustment of the parameters of the wave-

pipelined circuit (clock frequency and skew), the BIST

approach for design for testability [17, 18] may be used. In

the BIST approach, a finite state machine (FSM) is

assumed to be available off-chip and is used for adjustment

of the parameters of the wave-pipelined circuit [19, 20]. In

the SOC approach, a processor is assumed to be available

on-chip and it is used for adjustment of the parameters of

the wave-pipelined circuit.

4.1 BIST approach for wave-pipelined circuit

Testing a large chip requires a large test sequence and

application of these test sequences to the circuit under test

(CUT) using external testers is time consuming. Built in

self test scheme is an alternative for minimizing the testing

time. In the BIST scheme, the test sequences are internally

generated, applied to the CUT at full speed and a signature

is generated for finding whether it is good or bad. The

block diagram of a wave-pipelined circuit with BIST is

given in Fig. 10. This is obtained by including the FSM

block and self-test circuit. The self-test circuit contains

programmable clock, clock skew generator, signature

analyzer and test vector RAM to the circuit given in

Fig. 10.

4.1.1 FSM block

The flow chart given in Fig. 11 describes the function

performed by the FSM. In Fig. 11 {Ti: i = 0, 1, 2… N -

1}, {dj: j = 0, 1, 2… M - 1} denotes the set of clock

periods and clock skews, respectively. The FSM block

generates the control signal to choose between the normal

mode and the self test mode and this is applied to the select

input of multiplexer. In the self test mode, the FSM sys-

tematically varies the clock skews and clock periods. For

each clock frequency and skew, the self test circuit gen-

erates the test inputs, applies them, generates the signature,

compares it with the expected result and finally generates a

flag indicating the match. The FSM progresses with the

testing till the frequency at which the circuit under test

works for at least three or more skew values is found. The

operating skew value is chosen to be the middle value so

that the CUT would reliably work even if the delays

change due to environmental conditions. For example, in

Fig. 7, when the skew is chosen so that it corresponds to

either t2 or t20, the circuit would reliably work during its

entire life time. In order to minimize the time required to

determine the correct value of clock skew and clock per-

iod, a two step procedure is adopted. The clock frequencies

are varied by large steps to determine the range of fre-

quency in which the circuit works. This is achieved by

varying only the higher order two bits of the select inputs

of the programmable clock. After the range is determined,

fine tuning is achieved by varying the lower order bits. For

every frequency at which the circuit is tested, the clock

skews are varied gradually and the results are tested for its

correctness and the clock skews for which the circuit

works satisfactorily is noted. The testing time can be

minimized by using the optimal test vector set and a sig-

nature analyzer [17].

4.1.2 Signature generator

For testing the correctness of the circuit, N test vectors may

be fed one after another and the N outputs obtained should

be compared with the expected outputs. In order to mini-

mize the number of comparisons, a unique signature is

generated out of the N outputs and it is compared with the

Fig. 10 Selftuned wave-

pipelined circuit


123

signature corresponding to the expected outputs. The sig-

nature generator consists of a pseudo random binary

sequence (PRBS) generator with multiple data input [17] as

shown in Fig. 12. The successive output of the output

register is XOR’ed with the state of the PRBS to generate

the next state.

If the test vector set consists of N vectors, the PRBS

generator output contains the signature after application of

N clock pulses. However, due to the propagation delay in

the random access memory (RAM), I/O registers and the

combinational logic block, the time at which signature

generation begins should be delayed with respect to the time

at which the application of test vectors begins. The delay

depends on the depth of the combinational logic blocks.

4.1.3 Test vector generation

In principle, the number of test vectors required for an M

input combinational logic circuit is 2 M. If the value of M is

small, exhaustive testing of the circuit may be carried out

by generating the test inputs through an M bit counter and

checking the signature after the counter completes one full

cycle. However, some of the inputs may contribute more to

Dmax than the others. For example, in the case of the

multipliers, the maximum propagation delay occurs only

when MSBs of the operands are 1. If the multiplier works

for this case, it will work for the other cases where at least

one of the MSBs is zero. Hence, a (M - 2) bit counter is

adequate for testing. For circuits with large number of

inputs, exhaustive testing would require very large testing

time. Minimal test vector set, which reduces the testing

time without compromising the quality of detection of

faults, may be obtained using the automatic test pattern

generator (ATPG) algorithms [17]. Computed aided design

tools may also be used for generating the minimal test

vectors using ATPG algorithm and assessing their fault

coverage ratio. However, the generation of test patterns for

wave-pipelined circuit is non-trivial because we have to

account for data dependent delays (delay for 001 is dif-

ferent from that for 101) [11] and this is compounded by

the absence of accurate models for interconnects in FPGAs.

Since the conventional ATPG techniques are not applicable

for wave-pipelined circuits, we have to content with only

random test vectors. By choosing different test vector sets

consisting of different combinations and different ordering

of test vectors, we can improve the confidence level.

4.2 SOC approach for wave-pipelined circuits

As mentioned in Sect. 4.1, the BIST approach requires a

number of overheads such as FSM, signature generator and

test vector RAM. These blocks are useful only when the

clock frequency and skew are to be varied. If the operating

Fig. 11 Flowchart of FSM operation

Fig. 12 Signature generator


123

frequency is chosen so that the stable period in Fig. 7 is

greater by at least twice the worst case variation in the

delay due to temperature, neither the clock frequency nor

the skew need to be adjusted again. After these initial

selection, the 2D DWT blocks require no further tuning and

work satisfactorily without any external intervention.

Instead of using a dedicated circuit such as BIST, a pro-

cessor may be used to carry out the above tuning task. For

example, an FPGA based speech recognition with SOC

may perform the various tasks required by optimally par-

titioning between hardware and software [21]. The tasks

performed in software uses the on-chip processor. The

hardware block may use wave-pipelining and it may be

tuned by the on-chip processor at the beginning.

For the SOC approach, PRBS generator, signature

comparator blocks in Fig. 8 may be replaced by a block

RAM which is used to store the outputs of the CUT cor-

responding to the test inputs. Since the communication

interface between the on chip processor and the circuit

under test is faster, the outputs can be directly read and

compared with the expected output for every combination

of skew and clock frequency. The flow chart in Fig. 11 can

be modified accordingly. The select inputs for the clock as

well as skew blocks and the data inputs to the wave-

pipelined circuit may be applied and varied through the on-

chip processor.

A variety of choices exist for the implementation of

SOC. The SOC may consist of a hard core processor such

as power PC or ARM processor and an FPGA coprocessor

or DSP block. Alternatively, it may consist of a soft-core

processor such as Nios II or Micro blaze and a custom DSP

block implemented in FPGA. In this paper, FPGA based

SOCs consisting of either Nios II or Micro blaze soft-core

processor is used for the implementation. Figure 13 shows

the interface diagram of a Nios II processor along with the

custom block (hybrid wave-pipelined circuit).

5 Architecture for the computation of 2D DWT using

lifting scheme

The automation schemes proposed in the previous section

is used for tuning the hybrid scheme for 2D DWT. The

details of architecture used and the assumptions made

about the individual blocks of 2D DWT are presented in

this section. Sub-images of size 32 9 32 with 8 bits per

pixel are used for the computation. The DWT coefficients

are assumed to be represented using 11 bits. Number of bits

per pixel is converted to be 11 bits by appending three

zeros to the most significant position. This is done in order

to make the word size of the inputs to the horizontal filter

and vertical filters to be the same. This enables the same

hardware or program to be reused for the computation of

the outputs of both horizontal and vertical filters. The

inputs to the horizontal filter are the pixel intensity values

whereas the inputs to the vertical filters are DWT coeffi-

cients. The lifting multiplier constants (a, b, c, d, n) are

assumed to be of 11 bits each. The block diagram of one

level 2D DWT is shown in Fig. 14. For the horizontal fil-

ters, the even and odd inputs are applied from two block

RAMs of size 512 9 11. The result is written into four

block RAMs of size 256 9 11. For the vertical filters, the

inputs are applied from these four block RAM blocks and

the outputs are written into another four block RAMs. For

testing, the image is assumed to be loaded into the block

RAMs using memory initialization file (MIF).

5.1 Block diagram of two level 2D DWT

The block diagram of two level 2D DWT is shown in

Fig. 15. In order to minimize the area required for imple-

mentation, the horizontal filter and the vertical filters are

reused to compute the multilevel 2D DWT. Block RAMs

E1, O1 contain the even and odd streams of the initial data

to be transformed. Block RAMs E2E/E2O, E3, E4, E5

denote the output of one level 2D DWT. The even and odd

numbered coefficients of LL1 component are stored in two

block RAMs E2E and E2O and are used as inputs for the

2nd level DWT. The outputs of the 2nd level DWT are

stored in block RAMs E6, E7, E8 and E9. The output of the

horizontal filter is stored in four blocks RAMs E10, O10,

Fig. 13 Adding custom logic to the Nios II ALU

Fig. 14 Overall block diagram of one level 2D DWT


123

E11, O11. If LL2, the low pass band corresponding to two

level DWT alone is required, only one demultiplexer and

seven blocks RAMs (E1, E2E, O1, E2O, E10, O10, E3) are

required. For the purpose of verification, only LL2 is

computed and compared for the different schemes of

computation of two level 2D DWT. For the computation of

LL1 component of one level 2D DWT, only block RAMs

E1, O1, E2E, E2O, E10, O10 are used.

5.2 Overlapping scheme for the computation

of 2D DWT of complete image

The architecture proposed in Sect. 5.1 for sub-images of

size 32 9 32 may be used for the computation of 2D DWT

of a larger image by splitting it into a number of overlap-

ping sub images of size 32 9 32. The advantage of

splitting the image into a number of sub-images is to per-

form the computation of 2D DWT in parallel in a number

of computational engines. Further, it also reduces the

memory required for storing the image and its transform. In

the overlapping scheme, the image block is formed such

that a number of pixels overlapped between adjacent

blocks along the vertical and horizontal direction are equal

to the order of the filter. For example, for the 9/7 bi-

orthogonal filter used for the 2D DWT, the number of

overlap pixels should be equal to four on the left and four

on the right between horizontal blocks. Similarly, the

number of pixel overlap between vertical blocks should be

equal to four on the top and four on the bottom. For the

blocks on the boundary, overlapping needs to be done only

on the non-boundary edge.

6 Implementation results

In order to demonstrate the applicability of the automation

approaches for both Xilinx and Altera FPGAs, the 2D

DWT is implemented using both Xilinx Spartan and Altera

Cyclone FPGAs and the results are presented in this sec-

tion. In each of the FPGAs, the 2D DWT is computed using

three multiplication schemes: hybrid WP-P BW-KCM,

non-pipelined BW-KCM and BW-PKCM.

6.1 Programmable clock and skew generators

The operating frequency of the wave-pipelined circuit is

expected to lie between that of non-pipelined circuit and

pipelined circuits. Hence, the minimum and maximum

frequency of the clock generator should correspond to the

maximum operating frequencies of the non-pipelined cir-

cuit and pipelined circuits, respectively. The approximate

values of the clock periods of these circuits for the

implementation of the b block on Cyclone FPGA are 5.6

and 7.4 ns, respectively. The values of Dmax, Dmin for the ablock are 15.302 and 7.34 ns, respectively. The program-

mable clock and skew generator are designed such that the

clock period can be varied from 8.4 to 20.6 ns in steps of

0.8 ns and skew can be varied from 12.3 to 26.2 ns in steps

of 0.9 ns approximately. The same exercise is carried out

for b, c and d blocks using the synthesis report. A single

clock generator is used for all the four blocks. Separate

skew generators are used for each of the four blocks. In

order to remove the glitches in the clock signal ‘‘Majority

Logic Gate’’ is suggested in [23]. The operating frequency

and skews are chosen using FSM such that all the blocks

work satisfactorily. Similar procedure is adopted for the

implementation on the other two FPGAs.

The location of the logic elements and the interconnects

used for the implementation of clock and skew blocks

should be fixed so that when these blocks are integrated

with the 2D DWT or the soft-core processor, the inter-

connect delays are not altered. This is achieved by using

the Logic lock feature in Altera. In the case of Xilinx

FPGAs, this is achieved by using the Macros.

6.2 Implementation of 2D DWT using BIST approach

The one level 2D DWT is implemented on Xilinx Spartan-

II XC2S100 FPGA using BIST approach. It may be noted

that the BIST approach is also applicable for Altera

Fig. 15 Block diagram of two

level 2D DWT


123

FPGAs. A personal computer (PC) is used for the reali-

zation of the FSM. The interface used between PC and

FPGA is the same as that described in [14]. The output of

the hybrid circuit (11 bits) is EXOR’ed with the 11 bit

PRBS generator and the signature is obtained.

The implementation results of the 9/7 horizontal filters

for one level 2D DWT on Xilinx Spartan-II XC2S100

FPGA are given in Table 1. Multipliers of size 11 9 8 are

implemented. From Table 1, it may be concluded that for

the filter, the method using hybrid WP-P BW-KCM is

faster than non-pipelined BW-KCM by a factor of 1.32 and

requires the same area. The pipelined BW-PKCM is in turn

faster than the hybrid WP-P BW-KCM by a factor of 1.97

and this is achieved with the increase in the number of

registers by a factor of 4.6 and increase in the number of

slices by a factor of 1.79.

The implementation results of one level 2D DWT for a

sub-image of size 32 9 32 using BIST approach are

shown in Table 2. In order to make the horizontal and

vertical filters to be identical, multipliers of size 11 9 11

are used for both of them. Three zeros are appended

before the input samples as discussed in Sect. 5. The

overheads required for the wave-pipelined circuits are

also shown in Table 2. It may be noted that overhead

required is about 22.5%. From Table 2, it may be con-

cluded that for the lifting scheme, the method using

hybrid WP-P BW-KCM is faster than non-pipelined BW-

KCM by a factor of 1.4 and requires the same area. The

pipelined BW-(P) KCM is in turn faster than the hybrid

WP-P BW-KCM by a factor of 1.2 and this is achieved

with the increase in the number of registers by a factor of

2.73 and increase in the number of slices by a factor of

1.32.

Table 3, shows the implementation results for two level

2D DWT for the pipelined scheme. The implementation of

hybrid WP two level 2D DWT using BIST approach is

under progress.

6.3 Implementation of 2D DWT using SOC approach

For the hybrid WP block, the optimal clock period and

clock skews are determined using the procedure described

in Sect. 6.1. The hybrid wave-pipelined 2D DWT unit

(obtained by adding the input and output block RAMs to

the non-pipelined circuit along with the programmable

clock and clock skew blocks) is tested first using simu-

lation. As mentioned in Sect. 3, simulation is inadequate

to test the hybrid wave-pipelined circuit. Hence, this

circuit is implemented along with the Nios II or Micro

blaze soft-core processor and the former is added as the

custom block to the Nios II or Micro blaze using SOPC

builder or embedded design kit (EDK) builder. The pro-

gram to be executed by the Nios II or Micro blaze is

written in C/C++ and the custom block is invoked as a

function in the C/C++ program. A C++ program is

written to read and write from the block RAM in the

custom block. The C++ program is compiled and the

executable code along with the configuration bits corre-

sponding to Nios II or Micro blaze integrated with the

custom block is down loaded to the FPGA. When the C

program is run, it systematically varies the select inputs

for the clock and clock skew blocks, and uploads the

content of the output block RAM.

The clock and skew are adjusted till the match occurs

for at least three consecutive clock skews. The operating

Table 1 Implementation of 9/7 bi-orthogonal filters with 11 9 8

multipliers using the various schemes

Multiplier

(BW-KCM)

Slices Number

of registers

Speed

(MHz)

Non-pipelined 253 176 57.3

Pipelined 453 803 149.18

Hybrid WP-P 253 176 75.75

Table 2 Implementation results on one level 2D DWT

BIST approach with Spartan-II

XC2S100PQ208-5

SOC approach with Cyclone-II

EP2C35F672C6

SOC approach with Spartan-III

XC3S200FT256-4

Lifting scheme Non-

pipelining

Pipelining Hybrid

WP-P

Non-

pipelining

Pipelining Hybrid

WP-P

Non-

pipelining

Pipelining Hybrid

WP-P

Number of slices or LEs

1 slice = 2 LUTs

836 1,110 836a(188)

703 782 703 a(30) 897 1,381 897a(32)

Number of registers 611 1,670 611a(85)

375 671 375a(8)

730 2,305 730a(8)

Speed

(MHz)

54.45 87.54 75.75 117.83 203.92 147.5 67.9 114.19 82.6

a Denotes additional overhead for testing WP circuits


123

clock and clock skew of the wave-pipelined circuit is fixed

at the middle value and from now on, the custom block

works without any intervention from the Nios II or Micro

blaze processor.

6.3.1 Implementation results on one level 2D DWT using

Cyclone-II EP2C35F672C6

The one level 2D DWT is implemented on Cyclone-II

EP2C35F672C6 with and without pipelining. A single filter

is implemented and time shared for the computation of the

outputs of both horizontal and vertical filters. The 2D DWT

block added as a custom block to Nios II CPU and

downloaded to the Cyclone-II. 2D DWT is also computed

using the in-built instruction set of Nios II [22]. The

number of CPU clocks for both the cases are tabulated in

Table 4. (Clock frequency obtained using the above device

is 40 MHz.)

For the hybrid wave-pipelined circuit, the number of

logic elements, number of registers, maximum operating

frequency and power dissipated are computed using

Cyclone-II FPGA and the results are given in Table 2. It

may be noted that the overhead required for the wave-

pipelined circuit is about 4%. From this Table 2, it may be

concluded that for the lifting scheme, the method using the

hybrid WP-P BW-KCM is faster than non-pipelined BW-

KCM by a factor of 1.25. The scheme with Baugh–Wooley

Pipelined Constant Coefficient Multiplier is in turn faster

than the hybrid WP-P BW-KCM by a factor of 1.38 and

this is achieved with the increase in the number of registers

by a factor of 1.78 and increase in number of LEs by a

factor of 1.11.

Pipelining may be used either for increasing the operat-

ing frequency of a circuit or for reducing the power

dissipation [12]. Pipelining requires more registers and area.

It automatically may not lead to more power dissipation. In

order to assess whether the hybrid wave-pipelining is

superior or not with regard to power dissipation, both hybrid

wave-pipelined and pipelined circuits are operated at the

same frequency (corresponding to the maximum operating

frequency of the hybrid wave-pipelined circuit) and the

power dissipated for the two approaches are also given in

Table 5. From this Table 5, it may be noted that the pipe-

lined circuit dissipates 11% less power than hybrid wave-

pipelined 2D DWT.

6.3.2 Implementation results on one level 2D DWT using

Spartan-III XC3S200

Implementation results for one level 2D DWT on Xilinx

Spartan-III XC3S200 using all the three approaches are

given in Table 2. The programmable clock and clock skew

blocks are implemented as Macro blocks using Xilinx ISE

8.1i project navigator. For tuning the hybrid wave-pipe-

lined circuit, the Micro blaze soft-core processor is used.

Xilinx EDK software is used to integrate the custom block

to the Micro blaze processor. The rest of the steps are

similar to what is used for the Altera SOC kit. For all the

three schemes, the number of logic elements, number of

registers and maximum operating frequency are computed

and the results are given in Table 2. It may be noted that

the overheads required for the wave-pipelined circuits is

about 3.5%. It may be noted that dedicated filters are used

for the computation of the outputs of both horizontal and

vertical filter. Hence, the area required for this scheme is

higher than that that using cyclone II devices.

From this Table 2, it may be concluded that for the

lifting scheme, the method using hybrid WP-P BW-KCM

is faster than non-pipelined BW-KCM by a factor of 1.21.

The scheme with Baugh–Wooley Pipelined Constant

Coefficient Multiplier is in turn faster than the hybrid WP-

P BW-KCM by a factor of 1.38 and this is achieved with

the increase in the number of registers by a factor of 3.15

and increase in the number of LEs by a factor of 1.53.

6.4 Validation of the scheme for 2D DWT

To verify the correctness of the schemes proposed for the

computation of 2D DWT, Lena image of size 128 9 128

with blocks (sub-images) of size 32 9 32 pixels is used.

The 128 9 128 image is shown in Fig. 16 and is obtained

by subsampling the standard image of size 512 9 512 by a

factor of four along both dimensions. As mentioned in

Sect. 5.2, overlap of four pixels is used between the adja-

cent blocks. Totally 36 image blocks are used for the

Table 3 Area and speed performance of two level forward 2D DWT

on Xilinx Spartan-II XC2S-200PQ208-5

Lifting scheme Slices used Speed

(MHz)

Number of

registers

Pipelined 1,511 61.42 2,506

Table 4 Computation time for 2D DWT

Function Number of CPU clock

cycles for software approach

Equivalent CPU clock

cycles for custom block

2D DWT 73,280 814

Table 5 Power dissipated by pipelined and hybrid wavepipelined

one level 2D DWT at normalized frequency

Description of the circuit Pipelined circuit Hybrid circuit

Power at normalized frequency

including additional overhead

158.97 179.58


123

128 9 128 image. The 2D DWT for the image is also

computed using a C program. This is carried out using both

high-level language C and hardware approach using FPGA.

For implementation in C language, the lifting multiplier

constants (a, b, c, d, n1, n2) and the filter coefficients for the

distributed arithmetic algorithm are declared as ‘‘double’’

type (64 bits) variables. The pixel intensities are declared

as ‘‘short’’ type (16 bits) variables. The analysis filter

output obtained corresponding to 36 image blocks are

merged suitably and LL1 component of the image is shown

in Fig. 16.

The implementation of the forward 2D DWT for image

block of size 32 9 32 is carried out for lifting scheme with

BW-hybrid WP-PKCM. For the implementation, Xilinx

Spartan XC2S100PQ208-5 device is used. For storing the

image input, outputs of the horizontal filter and the outputs

of the vertical filters, the block RAMs are configured

suitably. The image is loaded into the block RAMs through

the UCF of the implementation tool. The one level 2D

DWT is computed using the above scheme for all the 36

image blocks and merged suitably. The LL1 component of

the image is shown in Fig. 16. From these figures, it may

be concluded that the LL1 components obtained through

the FPGA implementation match well with that obtained

using C. The LL1 components also match well with the

original image.

In order to make a quantitative comparison of the LL1

component with the original image, the original LENA

image is subsampled to be of size 64 9 64. Treating the

LL1 component itself as the compressed image, the PSNR

of the compressed images using BW-hybrid WPKCM and

C are computed and are found to be 28.22 and 33.33,

respectively.

7 Conclusion

Two automation schemes are proposed in this paper for

the implementation of the 9/7 bi-orthogonal filters using

hybrid WP-P constant coefficient multiplier with Baugh–

Wooley multiplication algorithm. Nios II and Micro blaze

soft-core processors are integrated with 2D DWT blocks

successfully and the optimum clock period and clock

skews for the 2D DWT blocks are selected using them.

After these initial selection, the 2D DWT blocks work

satisfactorily without any external intervention and the

processors are free to do other tasks. The 9/7 bi-orthog-

onal filters are implemented on both Xilinx and Altera

devices using the lifting scheme with the following three

multipliers: BW-PKCM, BW-KCM and hybrid WP-P

BW-KCM. From the implementation results, it is verified

that hybrid WP-P BW-KCM is faster than non-pipelined

BW-KCM. The scheme with BW-PKCM is in turn faster

than the hybrid WP-P BW-KCM and this is achieved with

the increase in the number of registers and increase in the

number of LEs. The custom instruction for 2D DWT is

found to be faster compared to the implementation using

C. The correctness of the procedure for the computation

of 2D DWT of an image, using the 2D DWT of sub

images, is verified by computing the 2D DWT using both

hardware and software approaches (using C) and dis-

playing the LL1 components for an image of size

128 9 128.transform. The automation schemes proposed

in this paper has also been successfully employed in [23]

for the implementation of wave-pipelined filters using

distributed arithmetic algorithm and sine wave generator

using CORDIC. The work on the computation of multi

level 2D DWT and real time computation of 2D DWT

using the hybrid scheme are under progress.

One of the challenges in the design of FPGA based

wave-pipelined circuits is the accurate modeling of the

interconnects as well as device delays and their tempera-

ture dependence. In the absence of these models, the wave-

pipelined circuits can only be operated at moderate speeds.

References

1. Xilinx documentation library, Xilinx Corporation, USA

2. Altera documentation library-2003 Altera Corporation, USA

3. Sheldon, D., Kumar, R., Vahid, F., Tullsen, D., Lysecky, R.:

Conjoining soft-core FPGA processors. In: IEEE/ACM Interna-

tional Conference on Computer Aided Design, 2006, ICCAD, pp.

694–701 (2006)

4. Draper, B.A., Beveridge, J.R., Willem Bohm, A.P., Ross, C.,

Chawathe, M.: Accelerated image processing on FPGAs. IEEE

Trans. Image. Process. 12(12), 1543–1551 (2003)

5. Ritter, J., Molitor, P.: A pipelined architecture for partitioned

DWT based lossy image compression using FPGA’s. In: Pro-

ceedings ACM Conference FPGA 2001, pp. 201–206 (2001)

6. Lakshminarayanan, G., Venkataramani, B., Senthil Kumar, J.,

Yousuf, A.K., Sriram, G.: Design and FPGA implementation of

image block encoders with 2D-DWT. Proceedings TENCON

2003. 3, 1015–1019 (2003)

7. Daubechies, I., Sweldens, W.: Factoring wavelet transforms into

lifting steps. J. Fourier Anal. Appl. 4, 247–269 (1998)

8. Nyathi, J., Delgado-Frias, J.G.: A hybrid wave-pipelined network

router. IEEE Trans. Circuits. Syst-I, Fundam. Theory. Appl.

49(12), 1764–1772 (2002)

9. Hauck, O., Katoch, A., Huss, S.A.: VLSI system design using

asynchronous wave pipelines: a 0.35 lm CMOS 1.5 GHz elliptic

curve public key cryptosystem chip. In: Proceeding of sixth

Fig. 16 LL1 component compared with input image


123

international symposium on advanced research in asynchronous

circuits and systems 2000 (ASYNC 2000), pp. 188–197 (2000)

10. Burleson, W.P., Ciesielski, M., Klass, F., Liu, W.: Wave-pipe-

lining: a tutorial and research survey. IEEE Trans. Very Large

Scale Integration (VLSI) Syst. 6(3), 464–474 (1998)

11. Gray, T.C., Liu, W., Cavin, R.K., III.: Wave-Pipelining: Theory

and CMOS Implementation. Kluwer, Boston (1994)

12. Parhi, K.K.: VLSI Signal Processing Systems. Wiley, New York

(1999)

13. Boemo, E.I., Lopez-Buedo, S., Meneses, J.M.: Wave-pipelines

via look-up tables. IEEE Int. Symp. Circuits Syst. (ISCAS ’1996).

4, 85–88 (1996)

14. Lakshminarayanan, G., Venkataramani, B.: Optimization tech-

niques for FPGA based wave-pipelined DSP blocks. IEEE Trans.

Very Large Scale Integration (VLSI) Syst. 13(7), 783–793 (2005)

15. Aharya, T., Tsai, P.-S.: JPEG2000 Standard for Image Com-

pression Concepts Algorithms and VLSI Architectures. Wiley,

New York (2005)

16. Sayood, K.: Introduction to Data Compression. Morgan Kauf-

mann, Menlo Park (2000). An Imprint of Elsevier

17. Smith, M.J.S.: Application Specific Integrated Circuits. Pearson

Education Asia Pvt. Ltd, Singapore (2003)

18. Seetharaman, G., Venkataramani, B., Lakshminarayanan, G.:

Design and FPGA implementation of self tuned wave-pipelined

filters. IETE J. Res. 524, 281–286 (2006)


Design and FPGA implementation of wave-pipelined image

block encoders using 2D-DWT. In: Proceedings of VLSI design

and test symposium VDAT 2005, pp. 12–20 (2005)


Design and FPGA implementation of wave-pipelined distributed

arithmetic based filters. In: Proceedings of VLSI Design and Test

workshop VDAT 2004, pp. 216–220 (2004)

21. Amudha, V., Venkataramani, B., Vinoth Kumar, R., Ravishankar,

S.: SOC Implementation of HMM Based Speaker Independent

Isolated Digit Recognition System. In: 20th International IEEE

Conference on VLSI Design (VLSID’07), pp. 1–6 (2007)

22. Seetharaman, G., Venkataramani, B., Amudha, V., Saundattikar,

A.: System on chip implementation of 2D DWT using lifting

scheme. In: Proceedings of the International Asia and South

Pacific Conference on Embedded SOCs (ASPICES 2005), (2005)

23. Seetharaman, G., Venkataramani, B.: SOC implementation of

wave-pipelined circuits. Proceedings of IEEE International con-

ference on Field Programmable Technology 2007 (ICFPT 2007),

pp. 9–16 (2007)

Author Biographies

G. Seetharaman received his B.E. and M.E. degree in Electronics

and Communication Engineering from Regional Engineering Collage,

Tiruchirappalli in 1997 and 2002, respectively. Presently, he is car-

rying out his doctoral thesis work in the National Institute of

Technology, Tiruchirappalli. Previously, he worked as faculty in the

Jayaram College of Engineering and Technology, Tiruchirappalli, for

6 years and as Research Associate for three semesters in the National

Institute of Technology, Tiruchirappalli. Presently, he is working as

Laboratory Engineer in National Institute of Technology, Tiruchi-

rappalli. His current research interests include embedded system

design using field-programmable gate arrays (FPGAs) and system-on-

chip (SOC).

B. Venkataramani received his B.E. degree in Electronics and

Communication Engineering from Regional Engineering College,

Tiruchirappalli in 1979 and M.Tech. and Ph.D. degrees in Elec-

trical Engineering from Indian Institute of Technology, Kanpur in

1984 and 1996, respectively. He worked as Deputy Engineer in

Bharat Electronics Limited, Bangalore, India, and as a research

Engineer in Indian Institute of Technology, Kanpur, each for

approximately 3 years. Since 1987 he has been faculty member of

National Institute of Technology, (formerly Regional Engineering

College) Tiruchirappalli. Presently, he is working as Professor and

Head of the Department of Electronics and Communication in

National Institute of Technology. He has published two books and

numerous papers in journals and international conferences. His

current research interests include FPGA applications and SOC

based system design and performance analysis of high speed

computer networks.

G. Lakshminarayanan received his M.E. and Ph.D. degrees in

Electronics and Communication Engineering from Bharathidasan

University, Tiruchirappalli in 1995 and 2005, respectively. He pre-

viously worked as a Service Engineer for 5 years and as a scientist

and Research Associate for 4 years in Regional Engineering College,

Tiruchirappalli. He was a faculty member in SASTRA, Tanjore, for

two semesters and as an Assistant Professor in Saranathan College of

Engineering, Tiruchirappalli for 1 year. Presently he is working as

Assistant Professor in National Institute of Technology, Tiruchirap-

palli. His current research interests include FPGA based system

design and VLSI front end design.


123

dddd

Documents