dddd
DESCRIPTION
nnnTRANSCRIPT
SPECIAL ISSUE
Automation techniques for implementation of hybridwave-pipelined 2D DWT
G. Seetharaman Æ B. Venkataramani ÆG. Lakshminarayanan
Received: 12 July 2007 / Accepted: 19 May 2008 / Published online: 10 June 2008
� Springer-Verlag 2008
Abstract In the literature, techniques such as pipelining
and wave-pipelining (WP) are proposed for increasing the
operating frequency of a digital circuit. In general, use of
pipelining results in higher speed at the cost of increase in
the area and clock routing complexity. On the other hand,
use of WP results in less clock routing complexity and less
area but enables the digital circuit to be operated only at
moderate speeds. In this paper, a hybrid wave-pipelining
scheme is proposed to get the benefits of both pipelining
and WP techniques. Major contributions of this paper are:
proposal for the implementation of 2D DWT using lifting
scheme by adopting the hybrid wave-pipelining and pro-
posal for the automation of the choice of clock frequency
and clock skew between the input and output registers of
wave-pipelined circuit using built in self test (BIST) and
system-on-chip (SOC) approaches. In the hybrid scheme,
different lifting blocks are interconnected using pipelining
registers and the individual blocks are implemented using
WP. For the purpose of evaluating the superiority of the
schemes proposed in this paper, the system for the com-
putation of one level 2D DWT is implemented using the
following techniques: pipelining, non-pipelining and
hybrid wave-pipelining. The BIST approach is used for the
implementation on Xilinx Spartan-II device. The SOC
approach is adopted for implementation on Altera and
Xilinx field programmable gate arrays (FPGAs) based SOC
kits with Nios II or Micro blaze soft-core processors. From
the implementation results, it is verified that the hybrid WP
circuit is faster than non-pipelined circuit by a factor of
1.25–1.39. The pipelined circuit is in turn faster than the
hybrid wave-pipelined circuit by a factor of 1.15–1.38 and
this is achieved with the increase in the number of registers
by a factor of 1.79–3.15 and increase in the number of LEs
by a factor of 1.11–1.65. The soft-core processor based
automation scheme has considerably reduced the effort
required for the design and testing of the hybrid wave-
pipelined circuit. The techniques proposed in this paper,
are also applicable for ASICs. The optimization schemes
proposed in this paper are also applicable for the compu-
tation of other image transforms such as DCT, DHT.
Keywords DWT � Lifting � SOC � Wave-pipelining �Pipelining � Self test
1 Introduction
Programmable logic devices such as FPGAs offer an
alternative solution for the computationally intensive
functions performed traditionally by digital signal proces-
sors with Harvard architecture. The ability to design,
fabricate and test application specific integrated circuits
(ASICs) as well as FPGAs with gate count of the order of a
few tens of millions, has led to the development of com-
plex embedded system-on-chip. The development of
intellectual property (IP) cores for the FPGAs for a variety
of standard functions including processors enables a mul-
timillion gate FPGA to be configured to contain all the
components of a complete system. Development tools from
G. Seetharaman (&) � B. Venkataramani �G. Lakshminarayanan
Department of ECE, National Institute of Technology,
Tiruchirappalli, India
e-mail: [email protected]
B. Venkataramani
e-mail: [email protected]
G. Lakshminarayanan
e-mail: [email protected]
123
J Real-Time Image Proc (2008) 3:217–229
DOI 10.1007/s11554-008-0087-8
FPGA vendors such as the Altera or Xilinx enable the
integration of IP cores and the user designed custom blocks
with the soft-core processors such as the Micro blaze or
Nios II processors [1, 2]. The system designed by inte-
gration of IP cores and the user designed custom blocks
with the soft-core processors are far more flexible than the
hard-core processors and they can be enhanced with cus-
tom hardware to optimize them for specific application [3].
The increased performance available with SOC based
FPGAs makes them quite suited for implementation of area
as well as speed intensive image processing applications
such as discrete cosine transform (DCT) and discrete
wavelet transform (DWT). For example, the study in [4]
shows that FPGA based image processing system is faster
by 8–800 times compared to that using Pentium III
processor.
For image processing applications, in addition to DCT,
wavelet transform is increasingly used. It is a part of the
joint photographic experts group (JPEG) 2000 standard for
still image compression. The VLSI implementation of
image encoders with DWT has been addressed in number
of previous works. The implementation of 2D DWT using
lifting scheme and compression using EZT algorithm is
reported in [5] taking the advantage of flexible memory
configuration available in FPGAs. The image is partioned
into sub-images of size 32 9 32 and external memory is
used for storing the sub-images and the transform coeffi-
cients in [5].
Block RAMs in FPGAs are proposed for storing the sub-
images and 2D DWT coefficients in [6]. A new multiplier
algorithm denoted as Baugh–Wooley pipelined constant
coefficient multiplier (BW-PKCM) which combines the
KCM with Baugh–Wooley multiplication algorithm is
proposed and used for the study and comparison of dis-
tributed arithmetic algorithm and lifting scheme [7] for 2D
DWT on FPGAs in [6].
Even though pipelining is adopted for high speed
applications such as that in [6], pipelined systems have a
number of disadvantages such as increase of power dissi-
pation, clock routing complexity and clock skews between
different parts of the system. The circuit design technique
such as wave-pipelining is one of the techniques proposed
for achieving high speed without the above limitations.
Wave-pipelined circuit dispenses with the need for regis-
ters for storing the intermediate results and instead uses the
inherent capacitance at the input to the various combina-
torial blocks. A number of systems have been implemented
using wave-pipelining on ASICs and FPGAs [8, 9]. The
concept of wave-pipelining has been described in a number
of previous works [10–14]. One of the limitations of the
wave-pipelined circuits is that their highest operating fre-
quency reduces with the complexity of the circuit or
equivalently the logic depth [14]. In order to combine the
advantages of both pipelining and wave-pipelining a hybrid
scheme is proposed in this paper. A complex circuit is split
into a number of smaller circuits and is pipelined. Each of
the smaller circuits is realized using wave-pipelining.
The organization of the rest of the paper is as follows: in
Sect. 2, the review of previous work on lifting based 2D
DWT with BW multiplier is described. In Sect. 3, the
previous work related to wave-pipelining and the chal-
lenges involved in the design of wave-pipelined circuits are
described. In Sect. 4, automation schemes for wave-pipe-
lined circuits are presented. In Sect. 5, the architecture
used and assumptions made for the implementation of the
2D DWT are presented. In Sect. 6, the implementation
results of the pipelined and hybrid wave-pipelined 2D
DWT are presented. Sect. 7, summarizes the conclusions.
2 Review of previous work on lifting based 2D DWT
with BW multiplier
The hybrid scheme is proposed to be used for the com-
putation of 2D DWT. The DWT decomposes a signal into
different sub-bands so that the lower frequency sub-bands
have finer frequency resolution and coarser time resolution
compared to the higher frequency sub-bands. A survey of
VLSI architectures for the computation of 2D DWT is
given in [15]. The 2D DWT may be computed using filter
banks. Figure 1 shows how an N 9 M image can be
decomposed using sub-band decomposition for one level
2D DWT. The samples corresponding to the image pixels
are passed through two stages of analysis filters. The ele-
ments of the pixel matrix are read row wise and are first
processed by the low pass h[n] and high pass g[n] hori-
zontal filters. The transform coefficients matrices are then
sub-sampled by two along the rows to obtain two N 9 M/2
matrices L1 and H1. Subsequently, the outputs (L1, H1) are
processed by low pass and high pass vertical filters to
obtain four N/2 9 M/2 transform coefficient matrices. Out
of these four matrices denoted as LL1, LH1, HH1 and HL1,
respectively, LL1 represents a coarse approximation of the
original image [15, 16].
For the two level 2D DWT, LL1 component is pro-
cessed by both horizontal and vertical filters and sub-
sampled to obtain four more matrices LL2, LH2, HH2 and
HL2. This process is continued until the desired level of
sub-band structure is obtained. The horizontal and vertical
filters shown in Fig. 1 may be implemented by adopting the
lifting scheme [7] which uses a factorization scheme for
the poly-phase matrix corresponding to the analysis filter.
The main feature of lifting based DWT scheme is to break
up the high pass and low pass wavelet filters into a
sequence of smaller filters. This scheme requires about
50% less computational complexity compared to that using
218 J Real-Time Image Proc (2008) 3:217–229
123
the convolution-based approach [7]. It has other advanta-
ges, including ‘‘in-place’’ computation of the DWT, integer
to integer wavelet transform and symmetric hardware
architecture for the computation of both forward and
inverse transform [15].
In the lifting scheme for a filter bank with the low pass
and high pass filters of nine and seven taps, respectively,
the odd and even input samples are processed by five lifting
blocks [a, b, c, d, n (n1, n2)] in cascade as shown in Fig. 2.
n1, n2 are scaling blocks.
The internal diagram of a and b blocks are shown in
Figs. 3 and 4. The c and d blocks are obtained by replacing
the constants a, b with c, d. In Figs. 3 and 4, since the
output from one block is fed as the input to the next block,
the maximum rate at which the input can be fed to the
system depends on the sum of the delays in all the four
stages. The speed may be increased by introducing pipe-
lining at the points indicated by dotted lines in Figs. 3 and
4. In this case, the input rate is determined by the largest
delay among all the four blocks.
The delay in the individual stages may be reduced fur-
ther by using constant coefficient multiplier (KCM) which
uses a look up table (LUT) for finding the product of a
constant and a variable. The variable is fed as address to
the LUT, which contains the products corresponding to all
possible combinations of the operands. FPGAs normally
contain four input LUTs. When an LUT with more number
of inputs are required, it has to be implemented using a
number of stages of four input LUTs and adders. For
example, a 12 9 12 bit KCM is implemented using three
4 9 12 bit KCM and two stages of 16 bit adders. The speed
of the KCM can be increased by introducing the pipelining
registers at the outputs of LUTs and adders.
The content of the LUT corresponding to multiplication
of signed numbers can be computed using three approa-
ches: (a) Assuming unsigned multiplication and 2’s
complement blocks (resulting multiplier is referred to as
conventional 2’s complement multiplier (C2CM)) (b)
Using sign extension (c) Baugh Wooley (BW) multiplier.
The pipelined constant coefficient multiplier (PKCM)
using the BW content is referred to as BW-PKCM and it is
Fig. 1 Sub-band decomposition of an N 9 M image
Fig. 2 Simplified block diagram of lifting scheme for 9/7 filter
Fig. 3 a Block
Fig. 4 b Block
J Real-Time Image Proc (2008) 3:217–229 219
123
shown to be superior compared to the other two approaches
[6]. Hence, only this multiplier is considered for wave-
pipelining in this paper. The detailed diagram of the ablock implemented using BW-PKCM is shown in Fig. 5.
The same scheme can be adopted for the b, c, d, n1, n2
blocks. The dotted line indicates points where registers
may be inserted for pipelining. For wave-pipelining all the
stages are directly connected without registers. The regis-
ters are used only at the inputs and outputs. In hybrid wave-
pipelining, registers are used between adjacent lifting
blocks and the individual lifting blocks are connected
without registers.
3 Review of previous work on wave-pipelining
In this section, the technique used for wave-pipelining the
a block in Fig. 5 is considered. An RTL model of a circuit
consists of a combinational logic circuit separated by the
input and output registers. The combinational logic circuit
may be considered to be a wave-pipelined circuit if a
number of waves are made to simultaneously propagate
through it as shown in Fig. 6, [10]. In other words, at any
point of time, a sequence of data is processed in the
combinational logic block. In the case of pipelining, only
one data is processed in the combinational logic block at a
time. Further, the maximum data rate in the pipelined
circuit depends only on Dmax, the maximum propagation
delay in the combinational logic block. Figure 7 shows
temporal/spatial diagram of combinational logic circuits
[11]. If Dmin denotes the minimum propagation delay of the
signal through the combinational logic block, the maxi-
mum data rate of the wave-pipelined circuit depends on
(Dmax–Dmin).
In the case of a block, Dmax corresponds to the pro-
cessing and propagation delay between the even samples
and a0 output (this involves three adder delays, one LUT
delay and four interconnect delays); Dmin corresponds to
the processing and propagation delay between the odd
samples and a0 output (this involves two adder delays and
two interconnect delays).
Traditionally, in a wave-pipelined circuit, higher speeds
are achieved by equalizing the Dmax and Dmin [10]. The
output of the wave-pipelined circuit alternates between
unstable and stable states. The stable period decreases with
the increase in the logic depth. By adjusting the latching
instant at the output register to lie in the stable period, the
wave-pipelined circuit can be made to work properly. But,
for large logic depths, there may not be any stable period.
Hence, adjusting the latching instant by itself may not be
adequate for storing the correct result at the output register.
For such cases, the clock period has to be increased to
increase the stable period. Equalization of path delays,
adjustment of the clock period and clock skew are the three
tasks carried out for maximizing the operating speed of the
wave-pipelined circuit. All the three tasks require the
delays to be measured and altered if required. Layout
editors, such as FPGA editor from Xilinx, or Floor planner
from Altera may be used for this purpose.
These tasks are carried out manually in [13, 14]. The
wave-pipelined circuit designed using the layout editor
may be tested using simulation. However, the simulation is
inadequate for testing due to the difference between the
actual delays and the delays calculated by the layout editor.
This is because, the layout editor considers only the worst
case delays and the actual delays may be significantly
different due to fabrication variations. This differenceFig. 5 a Block using BW-PKCM
Fig. 6 Multiple coherent waves of data sent through combinational
logic acting as pipeline in WP
Fig. 7 Temporal/spatial diagram of combinational logic circuits
220 J Real-Time Image Proc (2008) 3:217–229
123
becomes important as the logic depth of the circuit
increases. Hence, the design is downloaded to the actual
FPGA and its operation is checked using a personal com-
puter (PC) based test system in [14]. If correct results are
not obtained, delays are altered and the design is down-
loaded for testing again. A number of iterations of place
and route, simulation, downloading and testing in the
actual device may be required till the correct results are
obtained. The design of wave-pipelined circuit in this
fashion requires human intervention and is time consum-
ing. Automation of the above three tasks are considered in
the next section.
4 Automation schemes for wave-pipelined circuits
Equalization of the path delays of the combinational logic
blocks such as the a block is considered first. This cannot
be completely automated as the commercially available
synthesis tools do not support the specification of inter-
connect delays. However, the difference in path delays can
be minimized by specifying the physical location of logic
cells (referred to as slices in Xilinx FPGAs) or logic ele-
ments used for the implementation, through either the user
constraints file (UCF) or the Logic lock feature supported
by the FPGA CAD tools [2, 3]. UCF approach is proposed
for Xilinx FPGAs in [14]. The logic lock feature is adopted
for the Altera FPGAs in this paper.
The adjustment of the clock skew and clock period can
be automated by using programmable clock, skew gener-
ator and a processor. Clock generator using LUTs and
interconnects (nets) is proposed for the first time in [14].
(The LUTs are programmed as non-inverting buffers). The
interconnects are manually chosen using the FPGA layout
editor in [14]. The programmable clock is proposed in this
paper using multiplexers in addition to the LUTs and nets
as shown in Fig. 8. The interconnect delays are selected
using the multiplexer. The number of possible interconnect
delays (Di) is restricted to minimize the overheads due to
the additional LUTs required for the introduction of the
delay and the multiplexers. Hence, only a coarse variation
in the delay values can be achieved.
Using the manual routing, much smaller variations in
the delay may be achieved. In Fig. 8, inputs C0–C3 are the
programmable select inputs, which determine the actual
clock frequency. The diagram of programmable clock skew
circuit is given in Fig. 9. In Fig. 9, Di denotes the ith
interconnect (net) specifically introduced to vary the delay.
In addition to this, there are interconnects between the
output of one multiplexer and the input of another multi-
plexer and also between the LUTs. But their delay values
are not controlled by the program. The select inputs S0–S3
are the programmable delay inputs.
The clocks required for the wave-pipelined circuit may
also be derived using the internal system clock generator of
Altera and Xilinx system-on-programmable chip (SOPC)
devices. The maximum operating frequency in this case is
limited by the system bus. Alternately, an external clock
may be multiplied by an arbitrary number using the Altera
mega core function altclkclock or pllclock. Similarly, in the
case of Xilinx Spartan-III family FPGAs, the delay-locked
loop (DLL)/digital clock manager (DCM) module may be
used for clock multiplication. However, the multiplication
factor has to be specified at the synthesis time and hence
the clock frequency cannot be dynamically altered as in the
scheme given in Fig. 8.
The circuit using the programmable clock and skew
generator is a suboptimal wave-pipelined circuit but can
operate at a higher frequency than that reported by the
commercially available synthesis tools which use Dmax for
fixing the operating frequency. The clock and skew gen-
erator may be programmed using either off-chip processor
Fig. 8 Programmable clock generator
Fig. 9 Programmable clock skew circuit
J Real-Time Image Proc (2008) 3:217–229 221
123
or on-chip processor. In order to minimize the time
required for adjustment of the parameters of the wave-
pipelined circuit (clock frequency and skew), the BIST
approach for design for testability [17, 18] may be used. In
the BIST approach, a finite state machine (FSM) is
assumed to be available off-chip and is used for adjustment
of the parameters of the wave-pipelined circuit [19, 20]. In
the SOC approach, a processor is assumed to be available
on-chip and it is used for adjustment of the parameters of
the wave-pipelined circuit.
4.1 BIST approach for wave-pipelined circuit
Testing a large chip requires a large test sequence and
application of these test sequences to the circuit under test
(CUT) using external testers is time consuming. Built in
self test scheme is an alternative for minimizing the testing
time. In the BIST scheme, the test sequences are internally
generated, applied to the CUT at full speed and a signature
is generated for finding whether it is good or bad. The
block diagram of a wave-pipelined circuit with BIST is
given in Fig. 10. This is obtained by including the FSM
block and self-test circuit. The self-test circuit contains
programmable clock, clock skew generator, signature
analyzer and test vector RAM to the circuit given in
Fig. 10.
4.1.1 FSM block
The flow chart given in Fig. 11 describes the function
performed by the FSM. In Fig. 11 {Ti: i = 0, 1, 2… N -
1}, {dj: j = 0, 1, 2… M - 1} denotes the set of clock
periods and clock skews, respectively. The FSM block
generates the control signal to choose between the normal
mode and the self test mode and this is applied to the select
input of multiplexer. In the self test mode, the FSM sys-
tematically varies the clock skews and clock periods. For
each clock frequency and skew, the self test circuit gen-
erates the test inputs, applies them, generates the signature,
compares it with the expected result and finally generates a
flag indicating the match. The FSM progresses with the
testing till the frequency at which the circuit under test
works for at least three or more skew values is found. The
operating skew value is chosen to be the middle value so
that the CUT would reliably work even if the delays
change due to environmental conditions. For example, in
Fig. 7, when the skew is chosen so that it corresponds to
either t2 or t20, the circuit would reliably work during its
entire life time. In order to minimize the time required to
determine the correct value of clock skew and clock per-
iod, a two step procedure is adopted. The clock frequencies
are varied by large steps to determine the range of fre-
quency in which the circuit works. This is achieved by
varying only the higher order two bits of the select inputs
of the programmable clock. After the range is determined,
fine tuning is achieved by varying the lower order bits. For
every frequency at which the circuit is tested, the clock
skews are varied gradually and the results are tested for its
correctness and the clock skews for which the circuit
works satisfactorily is noted. The testing time can be
minimized by using the optimal test vector set and a sig-
nature analyzer [17].
4.1.2 Signature generator
For testing the correctness of the circuit, N test vectors may
be fed one after another and the N outputs obtained should
be compared with the expected outputs. In order to mini-
mize the number of comparisons, a unique signature is
generated out of the N outputs and it is compared with the
Fig. 10 Selftuned wave-
pipelined circuit
222 J Real-Time Image Proc (2008) 3:217–229
123
signature corresponding to the expected outputs. The sig-
nature generator consists of a pseudo random binary
sequence (PRBS) generator with multiple data input [17] as
shown in Fig. 12. The successive output of the output
register is XOR’ed with the state of the PRBS to generate
the next state.
If the test vector set consists of N vectors, the PRBS
generator output contains the signature after application of
N clock pulses. However, due to the propagation delay in
the random access memory (RAM), I/O registers and the
combinational logic block, the time at which signature
generation begins should be delayed with respect to the time
at which the application of test vectors begins. The delay
depends on the depth of the combinational logic blocks.
4.1.3 Test vector generation
In principle, the number of test vectors required for an M
input combinational logic circuit is 2 M. If the value of M is
small, exhaustive testing of the circuit may be carried out
by generating the test inputs through an M bit counter and
checking the signature after the counter completes one full
cycle. However, some of the inputs may contribute more to
Dmax than the others. For example, in the case of the
multipliers, the maximum propagation delay occurs only
when MSBs of the operands are 1. If the multiplier works
for this case, it will work for the other cases where at least
one of the MSBs is zero. Hence, a (M - 2) bit counter is
adequate for testing. For circuits with large number of
inputs, exhaustive testing would require very large testing
time. Minimal test vector set, which reduces the testing
time without compromising the quality of detection of
faults, may be obtained using the automatic test pattern
generator (ATPG) algorithms [17]. Computed aided design
tools may also be used for generating the minimal test
vectors using ATPG algorithm and assessing their fault
coverage ratio. However, the generation of test patterns for
wave-pipelined circuit is non-trivial because we have to
account for data dependent delays (delay for 001 is dif-
ferent from that for 101) [11] and this is compounded by
the absence of accurate models for interconnects in FPGAs.
Since the conventional ATPG techniques are not applicable
for wave-pipelined circuits, we have to content with only
random test vectors. By choosing different test vector sets
consisting of different combinations and different ordering
of test vectors, we can improve the confidence level.
4.2 SOC approach for wave-pipelined circuits
As mentioned in Sect. 4.1, the BIST approach requires a
number of overheads such as FSM, signature generator and
test vector RAM. These blocks are useful only when the
clock frequency and skew are to be varied. If the operating
Fig. 11 Flowchart of FSM operation
Fig. 12 Signature generator
J Real-Time Image Proc (2008) 3:217–229 223
123
frequency is chosen so that the stable period in Fig. 7 is
greater by at least twice the worst case variation in the
delay due to temperature, neither the clock frequency nor
the skew need to be adjusted again. After these initial
selection, the 2D DWT blocks require no further tuning and
work satisfactorily without any external intervention.
Instead of using a dedicated circuit such as BIST, a pro-
cessor may be used to carry out the above tuning task. For
example, an FPGA based speech recognition with SOC
may perform the various tasks required by optimally par-
titioning between hardware and software [21]. The tasks
performed in software uses the on-chip processor. The
hardware block may use wave-pipelining and it may be
tuned by the on-chip processor at the beginning.
For the SOC approach, PRBS generator, signature
comparator blocks in Fig. 8 may be replaced by a block
RAM which is used to store the outputs of the CUT cor-
responding to the test inputs. Since the communication
interface between the on chip processor and the circuit
under test is faster, the outputs can be directly read and
compared with the expected output for every combination
of skew and clock frequency. The flow chart in Fig. 11 can
be modified accordingly. The select inputs for the clock as
well as skew blocks and the data inputs to the wave-
pipelined circuit may be applied and varied through the on-
chip processor.
A variety of choices exist for the implementation of
SOC. The SOC may consist of a hard core processor such
as power PC or ARM processor and an FPGA coprocessor
or DSP block. Alternatively, it may consist of a soft-core
processor such as Nios II or Micro blaze and a custom DSP
block implemented in FPGA. In this paper, FPGA based
SOCs consisting of either Nios II or Micro blaze soft-core
processor is used for the implementation. Figure 13 shows
the interface diagram of a Nios II processor along with the
custom block (hybrid wave-pipelined circuit).
5 Architecture for the computation of 2D DWT using
lifting scheme
The automation schemes proposed in the previous section
is used for tuning the hybrid scheme for 2D DWT. The
details of architecture used and the assumptions made
about the individual blocks of 2D DWT are presented in
this section. Sub-images of size 32 9 32 with 8 bits per
pixel are used for the computation. The DWT coefficients
are assumed to be represented using 11 bits. Number of bits
per pixel is converted to be 11 bits by appending three
zeros to the most significant position. This is done in order
to make the word size of the inputs to the horizontal filter
and vertical filters to be the same. This enables the same
hardware or program to be reused for the computation of
the outputs of both horizontal and vertical filters. The
inputs to the horizontal filter are the pixel intensity values
whereas the inputs to the vertical filters are DWT coeffi-
cients. The lifting multiplier constants (a, b, c, d, n) are
assumed to be of 11 bits each. The block diagram of one
level 2D DWT is shown in Fig. 14. For the horizontal fil-
ters, the even and odd inputs are applied from two block
RAMs of size 512 9 11. The result is written into four
block RAMs of size 256 9 11. For the vertical filters, the
inputs are applied from these four block RAM blocks and
the outputs are written into another four block RAMs. For
testing, the image is assumed to be loaded into the block
RAMs using memory initialization file (MIF).
5.1 Block diagram of two level 2D DWT
The block diagram of two level 2D DWT is shown in
Fig. 15. In order to minimize the area required for imple-
mentation, the horizontal filter and the vertical filters are
reused to compute the multilevel 2D DWT. Block RAMs
E1, O1 contain the even and odd streams of the initial data
to be transformed. Block RAMs E2E/E2O, E3, E4, E5
denote the output of one level 2D DWT. The even and odd
numbered coefficients of LL1 component are stored in two
block RAMs E2E and E2O and are used as inputs for the
2nd level DWT. The outputs of the 2nd level DWT are
stored in block RAMs E6, E7, E8 and E9. The output of the
horizontal filter is stored in four blocks RAMs E10, O10,
Fig. 13 Adding custom logic to the Nios II ALU
Fig. 14 Overall block diagram of one level 2D DWT
224 J Real-Time Image Proc (2008) 3:217–229
123
E11, O11. If LL2, the low pass band corresponding to two
level DWT alone is required, only one demultiplexer and
seven blocks RAMs (E1, E2E, O1, E2O, E10, O10, E3) are
required. For the purpose of verification, only LL2 is
computed and compared for the different schemes of
computation of two level 2D DWT. For the computation of
LL1 component of one level 2D DWT, only block RAMs
E1, O1, E2E, E2O, E10, O10 are used.
5.2 Overlapping scheme for the computation
of 2D DWT of complete image
The architecture proposed in Sect. 5.1 for sub-images of
size 32 9 32 may be used for the computation of 2D DWT
of a larger image by splitting it into a number of overlap-
ping sub images of size 32 9 32. The advantage of
splitting the image into a number of sub-images is to per-
form the computation of 2D DWT in parallel in a number
of computational engines. Further, it also reduces the
memory required for storing the image and its transform. In
the overlapping scheme, the image block is formed such
that a number of pixels overlapped between adjacent
blocks along the vertical and horizontal direction are equal
to the order of the filter. For example, for the 9/7 bi-
orthogonal filter used for the 2D DWT, the number of
overlap pixels should be equal to four on the left and four
on the right between horizontal blocks. Similarly, the
number of pixel overlap between vertical blocks should be
equal to four on the top and four on the bottom. For the
blocks on the boundary, overlapping needs to be done only
on the non-boundary edge.
6 Implementation results
In order to demonstrate the applicability of the automation
approaches for both Xilinx and Altera FPGAs, the 2D
DWT is implemented using both Xilinx Spartan and Altera
Cyclone FPGAs and the results are presented in this sec-
tion. In each of the FPGAs, the 2D DWT is computed using
three multiplication schemes: hybrid WP-P BW-KCM,
non-pipelined BW-KCM and BW-PKCM.
6.1 Programmable clock and skew generators
The operating frequency of the wave-pipelined circuit is
expected to lie between that of non-pipelined circuit and
pipelined circuits. Hence, the minimum and maximum
frequency of the clock generator should correspond to the
maximum operating frequencies of the non-pipelined cir-
cuit and pipelined circuits, respectively. The approximate
values of the clock periods of these circuits for the
implementation of the b block on Cyclone FPGA are 5.6
and 7.4 ns, respectively. The values of Dmax, Dmin for the ablock are 15.302 and 7.34 ns, respectively. The program-
mable clock and skew generator are designed such that the
clock period can be varied from 8.4 to 20.6 ns in steps of
0.8 ns and skew can be varied from 12.3 to 26.2 ns in steps
of 0.9 ns approximately. The same exercise is carried out
for b, c and d blocks using the synthesis report. A single
clock generator is used for all the four blocks. Separate
skew generators are used for each of the four blocks. In
order to remove the glitches in the clock signal ‘‘Majority
Logic Gate’’ is suggested in [23]. The operating frequency
and skews are chosen using FSM such that all the blocks
work satisfactorily. Similar procedure is adopted for the
implementation on the other two FPGAs.
The location of the logic elements and the interconnects
used for the implementation of clock and skew blocks
should be fixed so that when these blocks are integrated
with the 2D DWT or the soft-core processor, the inter-
connect delays are not altered. This is achieved by using
the Logic lock feature in Altera. In the case of Xilinx
FPGAs, this is achieved by using the Macros.
6.2 Implementation of 2D DWT using BIST approach
The one level 2D DWT is implemented on Xilinx Spartan-
II XC2S100 FPGA using BIST approach. It may be noted
that the BIST approach is also applicable for Altera
Fig. 15 Block diagram of two
level 2D DWT
J Real-Time Image Proc (2008) 3:217–229 225
123
FPGAs. A personal computer (PC) is used for the reali-
zation of the FSM. The interface used between PC and
FPGA is the same as that described in [14]. The output of
the hybrid circuit (11 bits) is EXOR’ed with the 11 bit
PRBS generator and the signature is obtained.
The implementation results of the 9/7 horizontal filters
for one level 2D DWT on Xilinx Spartan-II XC2S100
FPGA are given in Table 1. Multipliers of size 11 9 8 are
implemented. From Table 1, it may be concluded that for
the filter, the method using hybrid WP-P BW-KCM is
faster than non-pipelined BW-KCM by a factor of 1.32 and
requires the same area. The pipelined BW-PKCM is in turn
faster than the hybrid WP-P BW-KCM by a factor of 1.97
and this is achieved with the increase in the number of
registers by a factor of 4.6 and increase in the number of
slices by a factor of 1.79.
The implementation results of one level 2D DWT for a
sub-image of size 32 9 32 using BIST approach are
shown in Table 2. In order to make the horizontal and
vertical filters to be identical, multipliers of size 11 9 11
are used for both of them. Three zeros are appended
before the input samples as discussed in Sect. 5. The
overheads required for the wave-pipelined circuits are
also shown in Table 2. It may be noted that overhead
required is about 22.5%. From Table 2, it may be con-
cluded that for the lifting scheme, the method using
hybrid WP-P BW-KCM is faster than non-pipelined BW-
KCM by a factor of 1.4 and requires the same area. The
pipelined BW-(P) KCM is in turn faster than the hybrid
WP-P BW-KCM by a factor of 1.2 and this is achieved
with the increase in the number of registers by a factor of
2.73 and increase in the number of slices by a factor of
1.32.
Table 3, shows the implementation results for two level
2D DWT for the pipelined scheme. The implementation of
hybrid WP two level 2D DWT using BIST approach is
under progress.
6.3 Implementation of 2D DWT using SOC approach
For the hybrid WP block, the optimal clock period and
clock skews are determined using the procedure described
in Sect. 6.1. The hybrid wave-pipelined 2D DWT unit
(obtained by adding the input and output block RAMs to
the non-pipelined circuit along with the programmable
clock and clock skew blocks) is tested first using simu-
lation. As mentioned in Sect. 3, simulation is inadequate
to test the hybrid wave-pipelined circuit. Hence, this
circuit is implemented along with the Nios II or Micro
blaze soft-core processor and the former is added as the
custom block to the Nios II or Micro blaze using SOPC
builder or embedded design kit (EDK) builder. The pro-
gram to be executed by the Nios II or Micro blaze is
written in C/C++ and the custom block is invoked as a
function in the C/C++ program. A C++ program is
written to read and write from the block RAM in the
custom block. The C++ program is compiled and the
executable code along with the configuration bits corre-
sponding to Nios II or Micro blaze integrated with the
custom block is down loaded to the FPGA. When the C
program is run, it systematically varies the select inputs
for the clock and clock skew blocks, and uploads the
content of the output block RAM.
The clock and skew are adjusted till the match occurs
for at least three consecutive clock skews. The operating
Table 1 Implementation of 9/7 bi-orthogonal filters with 11 9 8
multipliers using the various schemes
Multiplier
(BW-KCM)
Slices Number
of registers
Speed
(MHz)
Non-pipelined 253 176 57.3
Pipelined 453 803 149.18
Hybrid WP-P 253 176 75.75
Table 2 Implementation results on one level 2D DWT
BIST approach with Spartan-II
XC2S100PQ208-5
SOC approach with Cyclone-II
EP2C35F672C6
SOC approach with Spartan-III
XC3S200FT256-4
Lifting scheme Non-
pipelining
Pipelining Hybrid
WP-P
Non-
pipelining
Pipelining Hybrid
WP-P
Non-
pipelining
Pipelining Hybrid
WP-P
Number of slices or LEs
1 slice = 2 LUTs
836 1,110 836a(188)
703 782 703 a(30) 897 1,381 897a(32)
Number of registers 611 1,670 611a(85)
375 671 375a(8)
730 2,305 730a(8)
Speed
(MHz)
54.45 87.54 75.75 117.83 203.92 147.5 67.9 114.19 82.6
a Denotes additional overhead for testing WP circuits
226 J Real-Time Image Proc (2008) 3:217–229
123
clock and clock skew of the wave-pipelined circuit is fixed
at the middle value and from now on, the custom block
works without any intervention from the Nios II or Micro
blaze processor.
6.3.1 Implementation results on one level 2D DWT using
Cyclone-II EP2C35F672C6
The one level 2D DWT is implemented on Cyclone-II
EP2C35F672C6 with and without pipelining. A single filter
is implemented and time shared for the computation of the
outputs of both horizontal and vertical filters. The 2D DWT
block added as a custom block to Nios II CPU and
downloaded to the Cyclone-II. 2D DWT is also computed
using the in-built instruction set of Nios II [22]. The
number of CPU clocks for both the cases are tabulated in
Table 4. (Clock frequency obtained using the above device
is 40 MHz.)
For the hybrid wave-pipelined circuit, the number of
logic elements, number of registers, maximum operating
frequency and power dissipated are computed using
Cyclone-II FPGA and the results are given in Table 2. It
may be noted that the overhead required for the wave-
pipelined circuit is about 4%. From this Table 2, it may be
concluded that for the lifting scheme, the method using the
hybrid WP-P BW-KCM is faster than non-pipelined BW-
KCM by a factor of 1.25. The scheme with Baugh–Wooley
Pipelined Constant Coefficient Multiplier is in turn faster
than the hybrid WP-P BW-KCM by a factor of 1.38 and
this is achieved with the increase in the number of registers
by a factor of 1.78 and increase in number of LEs by a
factor of 1.11.
Pipelining may be used either for increasing the operat-
ing frequency of a circuit or for reducing the power
dissipation [12]. Pipelining requires more registers and area.
It automatically may not lead to more power dissipation. In
order to assess whether the hybrid wave-pipelining is
superior or not with regard to power dissipation, both hybrid
wave-pipelined and pipelined circuits are operated at the
same frequency (corresponding to the maximum operating
frequency of the hybrid wave-pipelined circuit) and the
power dissipated for the two approaches are also given in
Table 5. From this Table 5, it may be noted that the pipe-
lined circuit dissipates 11% less power than hybrid wave-
pipelined 2D DWT.
6.3.2 Implementation results on one level 2D DWT using
Spartan-III XC3S200
Implementation results for one level 2D DWT on Xilinx
Spartan-III XC3S200 using all the three approaches are
given in Table 2. The programmable clock and clock skew
blocks are implemented as Macro blocks using Xilinx ISE
8.1i project navigator. For tuning the hybrid wave-pipe-
lined circuit, the Micro blaze soft-core processor is used.
Xilinx EDK software is used to integrate the custom block
to the Micro blaze processor. The rest of the steps are
similar to what is used for the Altera SOC kit. For all the
three schemes, the number of logic elements, number of
registers and maximum operating frequency are computed
and the results are given in Table 2. It may be noted that
the overheads required for the wave-pipelined circuits is
about 3.5%. It may be noted that dedicated filters are used
for the computation of the outputs of both horizontal and
vertical filter. Hence, the area required for this scheme is
higher than that that using cyclone II devices.
From this Table 2, it may be concluded that for the
lifting scheme, the method using hybrid WP-P BW-KCM
is faster than non-pipelined BW-KCM by a factor of 1.21.
The scheme with Baugh–Wooley Pipelined Constant
Coefficient Multiplier is in turn faster than the hybrid WP-
P BW-KCM by a factor of 1.38 and this is achieved with
the increase in the number of registers by a factor of 3.15
and increase in the number of LEs by a factor of 1.53.
6.4 Validation of the scheme for 2D DWT
To verify the correctness of the schemes proposed for the
computation of 2D DWT, Lena image of size 128 9 128
with blocks (sub-images) of size 32 9 32 pixels is used.
The 128 9 128 image is shown in Fig. 16 and is obtained
by subsampling the standard image of size 512 9 512 by a
factor of four along both dimensions. As mentioned in
Sect. 5.2, overlap of four pixels is used between the adja-
cent blocks. Totally 36 image blocks are used for the
Table 3 Area and speed performance of two level forward 2D DWT
on Xilinx Spartan-II XC2S-200PQ208-5
Lifting scheme Slices used Speed
(MHz)
Number of
registers
Pipelined 1,511 61.42 2,506
Table 4 Computation time for 2D DWT
Function Number of CPU clock
cycles for software approach
Equivalent CPU clock
cycles for custom block
2D DWT 73,280 814
Table 5 Power dissipated by pipelined and hybrid wavepipelined
one level 2D DWT at normalized frequency
Description of the circuit Pipelined circuit Hybrid circuit
Power at normalized frequency
including additional overhead
158.97 179.58
J Real-Time Image Proc (2008) 3:217–229 227
123
128 9 128 image. The 2D DWT for the image is also
computed using a C program. This is carried out using both
high-level language C and hardware approach using FPGA.
For implementation in C language, the lifting multiplier
constants (a, b, c, d, n1, n2) and the filter coefficients for the
distributed arithmetic algorithm are declared as ‘‘double’’
type (64 bits) variables. The pixel intensities are declared
as ‘‘short’’ type (16 bits) variables. The analysis filter
output obtained corresponding to 36 image blocks are
merged suitably and LL1 component of the image is shown
in Fig. 16.
The implementation of the forward 2D DWT for image
block of size 32 9 32 is carried out for lifting scheme with
BW-hybrid WP-PKCM. For the implementation, Xilinx
Spartan XC2S100PQ208-5 device is used. For storing the
image input, outputs of the horizontal filter and the outputs
of the vertical filters, the block RAMs are configured
suitably. The image is loaded into the block RAMs through
the UCF of the implementation tool. The one level 2D
DWT is computed using the above scheme for all the 36
image blocks and merged suitably. The LL1 component of
the image is shown in Fig. 16. From these figures, it may
be concluded that the LL1 components obtained through
the FPGA implementation match well with that obtained
using C. The LL1 components also match well with the
original image.
In order to make a quantitative comparison of the LL1
component with the original image, the original LENA
image is subsampled to be of size 64 9 64. Treating the
LL1 component itself as the compressed image, the PSNR
of the compressed images using BW-hybrid WPKCM and
C are computed and are found to be 28.22 and 33.33,
respectively.
7 Conclusion
Two automation schemes are proposed in this paper for
the implementation of the 9/7 bi-orthogonal filters using
hybrid WP-P constant coefficient multiplier with Baugh–
Wooley multiplication algorithm. Nios II and Micro blaze
soft-core processors are integrated with 2D DWT blocks
successfully and the optimum clock period and clock
skews for the 2D DWT blocks are selected using them.
After these initial selection, the 2D DWT blocks work
satisfactorily without any external intervention and the
processors are free to do other tasks. The 9/7 bi-orthog-
onal filters are implemented on both Xilinx and Altera
devices using the lifting scheme with the following three
multipliers: BW-PKCM, BW-KCM and hybrid WP-P
BW-KCM. From the implementation results, it is verified
that hybrid WP-P BW-KCM is faster than non-pipelined
BW-KCM. The scheme with BW-PKCM is in turn faster
than the hybrid WP-P BW-KCM and this is achieved with
the increase in the number of registers and increase in the
number of LEs. The custom instruction for 2D DWT is
found to be faster compared to the implementation using
C. The correctness of the procedure for the computation
of 2D DWT of an image, using the 2D DWT of sub
images, is verified by computing the 2D DWT using both
hardware and software approaches (using C) and dis-
playing the LL1 components for an image of size
128 9 128.transform. The automation schemes proposed
in this paper has also been successfully employed in [23]
for the implementation of wave-pipelined filters using
distributed arithmetic algorithm and sine wave generator
using CORDIC. The work on the computation of multi
level 2D DWT and real time computation of 2D DWT
using the hybrid scheme are under progress.
One of the challenges in the design of FPGA based
wave-pipelined circuits is the accurate modeling of the
interconnects as well as device delays and their tempera-
ture dependence. In the absence of these models, the wave-
pipelined circuits can only be operated at moderate speeds.
References
1. Xilinx documentation library, Xilinx Corporation, USA
2. Altera documentation library-2003 Altera Corporation, USA
3. Sheldon, D., Kumar, R., Vahid, F., Tullsen, D., Lysecky, R.:
Conjoining soft-core FPGA processors. In: IEEE/ACM Interna-
tional Conference on Computer Aided Design, 2006, ICCAD, pp.
694–701 (2006)
4. Draper, B.A., Beveridge, J.R., Willem Bohm, A.P., Ross, C.,
Chawathe, M.: Accelerated image processing on FPGAs. IEEE
Trans. Image. Process. 12(12), 1543–1551 (2003)
5. Ritter, J., Molitor, P.: A pipelined architecture for partitioned
DWT based lossy image compression using FPGA’s. In: Pro-
ceedings ACM Conference FPGA 2001, pp. 201–206 (2001)
6. Lakshminarayanan, G., Venkataramani, B., Senthil Kumar, J.,
Yousuf, A.K., Sriram, G.: Design and FPGA implementation of
image block encoders with 2D-DWT. Proceedings TENCON
2003. 3, 1015–1019 (2003)
7. Daubechies, I., Sweldens, W.: Factoring wavelet transforms into
lifting steps. J. Fourier Anal. Appl. 4, 247–269 (1998)
8. Nyathi, J., Delgado-Frias, J.G.: A hybrid wave-pipelined network
router. IEEE Trans. Circuits. Syst-I, Fundam. Theory. Appl.
49(12), 1764–1772 (2002)
9. Hauck, O., Katoch, A., Huss, S.A.: VLSI system design using
asynchronous wave pipelines: a 0.35 lm CMOS 1.5 GHz elliptic
curve public key cryptosystem chip. In: Proceeding of sixth
Fig. 16 LL1 component compared with input image
228 J Real-Time Image Proc (2008) 3:217–229
123
international symposium on advanced research in asynchronous
circuits and systems 2000 (ASYNC 2000), pp. 188–197 (2000)
10. Burleson, W.P., Ciesielski, M., Klass, F., Liu, W.: Wave-pipe-
lining: a tutorial and research survey. IEEE Trans. Very Large
Scale Integration (VLSI) Syst. 6(3), 464–474 (1998)
11. Gray, T.C., Liu, W., Cavin, R.K., III.: Wave-Pipelining: Theory
and CMOS Implementation. Kluwer, Boston (1994)
12. Parhi, K.K.: VLSI Signal Processing Systems. Wiley, New York
(1999)
13. Boemo, E.I., Lopez-Buedo, S., Meneses, J.M.: Wave-pipelines
via look-up tables. IEEE Int. Symp. Circuits Syst. (ISCAS ’1996).
4, 85–88 (1996)
14. Lakshminarayanan, G., Venkataramani, B.: Optimization tech-
niques for FPGA based wave-pipelined DSP blocks. IEEE Trans.
Very Large Scale Integration (VLSI) Syst. 13(7), 783–793 (2005)
15. Aharya, T., Tsai, P.-S.: JPEG2000 Standard for Image Com-
pression Concepts Algorithms and VLSI Architectures. Wiley,
New York (2005)
16. Sayood, K.: Introduction to Data Compression. Morgan Kauf-
mann, Menlo Park (2000). An Imprint of Elsevier
17. Smith, M.J.S.: Application Specific Integrated Circuits. Pearson
Education Asia Pvt. Ltd, Singapore (2003)
18. Seetharaman, G., Venkataramani, B., Lakshminarayanan, G.:
Design and FPGA implementation of self tuned wave-pipelined
filters. IETE J. Res. 524, 281–286 (2006)
19. Seetharaman, G., Venkataramani, B., Lakshminarayanan, G.:
Design and FPGA implementation of wave-pipelined image
block encoders using 2D-DWT. In: Proceedings of VLSI design
and test symposium VDAT 2005, pp. 12–20 (2005)
20. Seetharaman, G., Venkataramani, B., Lakshminarayanan, G.:
Design and FPGA implementation of wave-pipelined distributed
arithmetic based filters. In: Proceedings of VLSI Design and Test
workshop VDAT 2004, pp. 216–220 (2004)
21. Amudha, V., Venkataramani, B., Vinoth Kumar, R., Ravishankar,
S.: SOC Implementation of HMM Based Speaker Independent
Isolated Digit Recognition System. In: 20th International IEEE
Conference on VLSI Design (VLSID’07), pp. 1–6 (2007)
22. Seetharaman, G., Venkataramani, B., Amudha, V., Saundattikar,
A.: System on chip implementation of 2D DWT using lifting
scheme. In: Proceedings of the International Asia and South
Pacific Conference on Embedded SOCs (ASPICES 2005), (2005)
23. Seetharaman, G., Venkataramani, B.: SOC implementation of
wave-pipelined circuits. Proceedings of IEEE International con-
ference on Field Programmable Technology 2007 (ICFPT 2007),
pp. 9–16 (2007)
Author Biographies
G. Seetharaman received his B.E. and M.E. degree in Electronics
and Communication Engineering from Regional Engineering Collage,
Tiruchirappalli in 1997 and 2002, respectively. Presently, he is car-
rying out his doctoral thesis work in the National Institute of
Technology, Tiruchirappalli. Previously, he worked as faculty in the
Jayaram College of Engineering and Technology, Tiruchirappalli, for
6 years and as Research Associate for three semesters in the National
Institute of Technology, Tiruchirappalli. Presently, he is working as
Laboratory Engineer in National Institute of Technology, Tiruchi-
rappalli. His current research interests include embedded system
design using field-programmable gate arrays (FPGAs) and system-on-
chip (SOC).
B. Venkataramani received his B.E. degree in Electronics and
Communication Engineering from Regional Engineering College,
Tiruchirappalli in 1979 and M.Tech. and Ph.D. degrees in Elec-
trical Engineering from Indian Institute of Technology, Kanpur in
1984 and 1996, respectively. He worked as Deputy Engineer in
Bharat Electronics Limited, Bangalore, India, and as a research
Engineer in Indian Institute of Technology, Kanpur, each for
approximately 3 years. Since 1987 he has been faculty member of
National Institute of Technology, (formerly Regional Engineering
College) Tiruchirappalli. Presently, he is working as Professor and
Head of the Department of Electronics and Communication in
National Institute of Technology. He has published two books and
numerous papers in journals and international conferences. His
current research interests include FPGA applications and SOC
based system design and performance analysis of high speed
computer networks.
G. Lakshminarayanan received his M.E. and Ph.D. degrees in
Electronics and Communication Engineering from Bharathidasan
University, Tiruchirappalli in 1995 and 2005, respectively. He pre-
viously worked as a Service Engineer for 5 years and as a scientist
and Research Associate for 4 years in Regional Engineering College,
Tiruchirappalli. He was a faculty member in SASTRA, Tanjore, for
two semesters and as an Assistant Professor in Saranathan College of
Engineering, Tiruchirappalli for 1 year. Presently he is working as
Assistant Professor in National Institute of Technology, Tiruchirap-
palli. His current research interests include FPGA based system
design and VLSI front end design.
J Real-Time Image Proc (2008) 3:217–229 229
123