ORIGINAL RESEARCH PAPER
An efficient hardware implementation of parallel EBCOTalgorithm for JPEG 2000
Taoufik Saidani • Mohamed Atri • Lazhar Khriji •
Rached Tourki
Received: 10 July 2012 / Accepted: 4 January 2013
� Springer-Verlag Berlin Heidelberg 2013
Abstract With the augmentation in multimedia technol-
ogy, demand for high-speed real-time image compression
systems has also increased. JPEG 2000 still image com-
pression standard is developed to accommodate such
application requirements. Embedded block coding with
optimal truncation (EBCOT) is an essential and computa-
tionally very demanding part of the compression process of
JPEG 2000 image compression standard. Various applica-
tions, such as satellite imagery, medical imaging, digital
cinema, and others, require high speed and performance
EBCOT architecture. In JPEG 2000 standard, the context
formation block of EBCOT tier-1 contains high complexity
computation and also becomes the bottleneck in this sys-
tem. In this paper, we propose a fast and efficient VLSI
hardware architecture design of context formation for
EBCOT tier-1. A high-speed parallel bit-plane coding
(BPC) hardware architecture for the EBCOT module in
JPEG 2000 is proposed and implemented. Experimental
results show that our design outperforms well-known
techniques with respect to the processing time. It can reach
70 % reduction when compared to bit plane sequential
processing.
Keywords JPEG 2000 � EBCOT algorithm �Bit-plane coding � VHDL � FPGA implementation
1 Introduction
Nowadays, undoubtedly the demand for good compression
techniques for multimedia applications keeps increasing in
order to provide excellent visual quality as well as efficient
solutions to the end user. The algorithms supporting these
features are very costly in terms of computational time and
complexity. For instance, JPEG 2000 [1] is the latest
international standard for still image compression sup-
porting a rich set of features. Compared to existing JPEG
image compression techniques, this standard not only has
better compression ratios but also offers some exciting
features. Among the features supported by JPEG 2000
include, lossy and lossless compression, continuous tone
and bi-level compression, progressive transmission by
pixel accuracy and resolution, region of interest coding,
compressed domain processing, and error resilience [2, 3].
The JPEG 2000 encoder architecture includes the
component transform, discrete wavelet transform, quanti-
zation and embedded block coding with optimized trun-
cation (EBCOT) [1, 2]. It is basically different from the
original JPEG standard. First, JPEG 2000 replaces the
JPEG’s discrete cosine transform (DCT) frequency
decomposition with a discrete wavelet transformation
(DWT). The JPEG 2000 standard specifies two kinds of
wavelet transformation: (1) integer transform 5/3 for loss-
less image compression and (2) 9/7 transform intended for
lossy compression mode [1, 2].
The DWT is a multi-resolution decomposition, which
allows for resolution scalability within the embedded bit
stream. In addition, the DWT exhibits better energy
T. Saidani (&) � M. Atri � R. Tourki
Electronics and Micro-Electronics Laboratory,
Faculty of Sciences, Monastir University, Monastir, Tunisia
e-mail: [email protected]
M. Atri
e-mail: [email protected]
R. Tourki
e-mail: [email protected]
L. Khriji
Electrical and Computer Engineering Department,
Sultan Qaboos University, Muscat, Oman
e-mail: [email protected]
123
J Real-Time Image Proc
DOI 10.1007/s11554-013-0322-9
compaction than the DCT, allowing for superior com-
pression efficiency. The DWT typically is applied on an
image tile or on the entire image as a whole. This large-
scale application allows the DWT to minimize the blocking
artifacts that plagued the 8 9 8 DCT in the original JPEG
[3, 4]. The second major deviation of JPEG 2000 from its
predecessor is the abandonment of the Huffman entropy
encoding scheme for an adaptive binary arithmetic coder.
JPEG 2000 uses the embedded block coding and optimal
truncation (EBCOT) algorithm to arithmetically encode the
DWT coefficients [1, 2]. EBCOT works on a bit-plane
level, generating a neighborhood context for each coded
bit. For the reason that it must touch every bit of the bit-
plane in the image, the EBCOT dominates the processing
time (50–75 %) depending on the source image [3, 4]. In
addition, the EBCOT works solely at the bit-plane level,
and is therefore fairly inefficient to implement in software.
For these two reasons, the EBCOT algorithm warrants a
custom FPGA implementation. It plays the role of system
bottleneck which imposes a fast implementation to speedup
the overall encoding process.
This paper is organized as follows: Sect. 2 provides an
overview of EBCOT and the related work for existing
architecture for embedded block coding. Section 3
describes the software analysis of EBCOT algorithm and
some discussions. The proposed architecture of BPC is
presented in Sect. 4. In Sect. 5 experimental results are
furnished and are compared with other architectures.
Finally, conclusions are drawn in Sect. 6.
2 Overview of JPEG 2000 encoding system
The block diagram of the JPEG 2000 coder is shown in
Fig. 1. The input image frame is partitioned into rectan-
gular, non-overlapping tiles. All unsigned image samples
are shifted DC levels and centered on zero [1, 3]. Next
color space conversion, from RGB to YCbCr, is performed
and wavelet transform is applied independently to each
channel of the image. The DWT coefficients are supplied
to EBCOT coder which comprises two processing stages:
bit plane coder (BPC) and MQ coder [2, 4, 5].
The BPC generates context and decision (CX,D)s pairs
which are supplied to MQ coder [4]. This latter performs
entropy coding and produces embedded bit stream.
2.1 Bit plane coding
The digital cinema initiative (DCI) [1, 3, 4] standard is
confined to use 5/3 DWT filter [4–6]. The coefficients
generated after wavelet transform are in 2’s complement
format. The bit plane coding is shown in Fig. 2.
These coefficients are converted to sign magnitude
format and stored in code block (CB) memory. In DCI
applications, CB size is restricted to 32 9 32 [3, 4]. The
CB memory comprises a sign plane and several magnitude
planes, as shown in Fig. 3a. During encoding, the magni-
tude bit planes are scanned starting from the most signifi-
cant bit (MSB) plane to the least significant bit (LSB) plane
[7, 8]. Each bit plane is further divided into stripes of four
rows (Fig. 3b). Samples are scanned column-wise starting
from left to right within a stripe and from top to bottom
within a column as shown in Fig. 3c. Three coding passes
are used to examine each sample in a CB, namely, clean-up
pass (CUP), significant propagation pass (SPP) and mag-
nitude refinement pass (MRP) [1, 3, 4]. The type of coding
pass to be applied on a sample is determined by forming a
context window surrounding it as shown in Fig. 3d, along
with the three state variables r, r0 and g, respectively, the
significant state variable, the magnitude requirement state
variable and the pass membership state variable. While
scanning wavelets coefficients, whenever first non-zero
magnitude bit occurs, it is considered as a significant
coefficient and the state of r is updated to one.
When MRP is run for the first time on a coefficient, r0 is
set to one and if zero coding (ZC) is applied on a coefficient
Fig. 2 Functional block diagram of EBCOT Tier-1 algorithmFig. 1 Overview of JPEG 2000 coding process
J Real-Time Image Proc
123
in SPP, g is set to one. Initially, all state variables are
assumed to be zero. The encoding starts from the first non-
zero magnitude bit plane assuming that all state variables
are zeros. Only CUP is run on this bit plane, whereas the rest
of bit planes are coded sequentially using SPP, MRP and
CUP [3, 4]. Four coding primitives, zero coding (ZC), sign
coding (SC), run length coding (RLC), and magnitude
refinement coding (MRC), are used to encode samples and
then (CX,D)s pairs are generated [10, 11].
2.2 MQ coder
The compression technique adopted in JPEG 2000 standard
is a statistical binary arithmetic coding, which is also called
MQ coder [1, 3, 18]. The MQ coder utilizes the probability
(CX) to compress the decision (D). In the MQ coder,
symbol in a code stream is classified as either most-prob-
able symbol (MPS) or least-probable symbol (LPS) [18].
The basic operation of the MQ coder is to divide the
interval recursively according to the probability of the
input symbols.
Figure 4 shows the interval calculation of MPS and LPS
for JPEG 2000 [4, 18]. We can find out whether MPS or
LPS is coded, and the new interval will be shorter than the
original one. In order to solve the finite-precision problems
when the length of the probability interval falls below a
certain minimum size, the interval must be renormalized to
become greater than the minimum bound.
2.3 Related works
In [9] a serial architecture for EBCOT tier-1 is presented. A
code block of size N 9 N (8 bit planes) needs
3 9 N 9 N 9 bp clock cycles. The embedded coding
block operates at 18.5 MHz on APEX20KE family from
Altera. The proposed architecture in Ref. [10] requires very
large number of clock cycles to process an image. Its
drawback is memory requirement which is very high.
Parallel coding of two stripe columns is proposed in [11].
Passes coding is executed concurrently. To investigate this
fact, two processing elements are used. A sequential pro-
cessing is needed for the four-bit column stripe. This
architecture works at 50 MHz on Virtex II for Xilinx. In
[12] all bit planes are coded in parallel processing:
(0.35–0.46) 9 N 9 N clock cycles are required to code an
N 9 N block. It works at 150 MHZ.
To code all samples of a column stripe in one cycle,
pass-parallel architecture is proposed in [13]. This design
works at 100 MHz to code 50 Ms/s (Mega samples per
Fig. 4 JPEG 2000 arithmetic encoding procedure
Fig. 3 a Bit-plane representation of a code block consisting of 8 magnitude bit-planes of dimension 32 9 32. b Each bit-plane consists of stripes
made up of four rows. c Stripe-based scanning order for every pass. d Context window for a sample location
J Real-Time Image Proc
123
second). In [14] a co-design architecture is proposed. The
intensive part (DWT, BPC and MQ coder) of the JPEG
2000 coder is implemented on FPGA, while the other
blocks of the encoder are implemented by a Powerful PC.
This proposed architecture is capable of coding one stripe
column in 2–5 clock cycles. Architecture for concurrent
coding of all bit planes is presented in [15]. In this
approach, the need for state variables memory is com-
pletely eliminated. Further, one arithmetic coder is shared
between two bit planes. However, knowledge of leading
zero bit planes is not embedded in this design. Conse-
quently, the pairs (CX,D) are generated by these planes too,
which are extra burdens on the arithmetic coder. An
architecture for JPEG 2000 encoder is presented in [16]. As
a result of applied hardware optimizations, the maximum
throughput of 180 Ms/s has been achieved for 100 MHz
clock and 0.333 bpp as compression rate. The proposed
encoder core, at competitive area resources, provides
superior frame rate and excellent compression quality for
high-definition (HD) and full HD video material. In [17]
the proposed architecture for BPC processes bits in serial
manner and speeds up the operating frequency due to an
optimized data path design and appropriate CB data han-
dling technique. The estimated working frequency is
67 MHz with Xilinx XC2V1000 target device. All these
architectures require either large encoding time or large
space on silicon. Both area and execution time are critical
parameters while developing an intellectual property core
for real-time applications. The performance of the pro-
posed EBCOT encoder is improved by a compromise
between both parameters.
3 Analysis of bit plane coding algorithm
Before starting the hardware architecture’s design, the
detailed run time analysis of JPEG 2000 codec and the
analysis of bit plane coding will be presented in this sec-
tion. This profile is made using the jasper implementation
verification model’s software [19], which is written by
official JPEG 2000 development organization (ISO SC29)
for development and verification of algorithms used by
JPEG 2000. The size of the Lena image is 512 9 512
pixels (8 bpp), and the compression is set to baseline mode
(5/3 filter, 3-level wavelet decomposition, 32 9 32 size for
code block).
Table 1 summarizes profiling results for lossy and
lossless compression of the 512 9 512 test Lena image. It
shows that the bit plane coding (BPC) is the most intensive
part of the JPEG 2000 coder; it consumes 51.8 and 55 % of
the total execution time for lossy and lossless compression,
respectively.
To perform the software implementation of BPC, a
sequential manner is adopted to encode all passes. For
coding a code block of size N 9 N with bp bit planes,
3 9 N 9 N 9 bp clock cycles are required. Analysis of
the intense block for JPEG 2000 coder is performed to
develop a new technique minimizing the required large
number of clock cycles in processing.
For the BPC analysis, four gray-scale ISO images (Lena,
Barbara, Peppers and baboon) are used. All test images are
of size 512 9 512. Under MATLAB environment a
wavelet Le Gall 5/3 transform [5, 6] is developed. Every
test image is decomposed into three levels. Figure 5 shows
a 3-level, 2-D DWT decomposition of the Lena image
using the (5, 3) filter-bank.
Sub-band of wavelets coefficient is decomposed in CBs
with size of 32 9 32 and stored in eight bits. Finally, using
MATLAB simulation, all CBs in an image are encoded and
(CX,D)s pairs are generated. Table 2 presents the number
of (CX,D)s generated by BPC for all test images.
For this analysis, three types of CB with size 32 9 32
are selected:
1. CB with all coefficients has positive magnitudes.
2. CB with all coefficients has negative magnitudes.
3. CB where the maximum number of coefficients are
zeros.
Table 1 Runtime of JPEG 2000 encoder
JPEG2000 modules Lossy compression Lossless compression
DWT 20.1 % 12.2 %
Quantization 5.5 % 5.8 %
EBCOT
Bit plane coding 51.8 %
70.4 %
55.0 %
71.8%
MQ coder 6.9 % 8.2 %
Rate distortion control 11.7 % 13.9 %
Others 4.0% 4.9%
Total 100%
J Real-Time Image Proc
123
The relationship between CB and the magnitude of
wavelet coefficients is shown in Fig. 6. For code block with
all sample are positive (Fig. 6a), the maximum and the
minimum values are 251 and 12, respectively. This CB is
derived from low resolution (LL3 sub-band) from Lena
image. During CB encoding, MSB plane (bp1) is skipped
because all coefficients are insignificant. Here, the MRP
passes will not generate any (CX,D) pairs while encoding
bit plane 2. This fact is demonstrated in Fig. 7a. In this bit
plane, large number of magnitude samples is significant. In
fact, a huge number of (CX,D) pairs is produced in SPP
passes and only few pairs in CUP passes. The last coeffi-
cients become significant in bit plane 6, no (CX,D) pairs
are generated by SPP and CUP.
In case of a CB with the maximum number of zero
wavelet coefficients (Fig. 6b), there is a need of one bit for
magnitude representation. The contributions of CUP
(CX,D) are much higher in all bit planes except in the LSB,
as shown in Fig. 7b.
In high resolution (HH1), a CB where all samples are
negatives the values range from -45 to -5 (see Fig. 6c). Much
higher number of (CX,D) pairs are generated in SPP passes
while bp 6 encoding. This fact is demonstrated in Fig. 7c.
For the test Lena image of size 512 9 512, there are 256
CBs where CBs number 18, 256 and 56 are chosen for this
analysis. For the CB number 18, all samples are with
positive sign whereas for CB number 256, all samples are
with negative sign. CB number 56 has a large number of
zero coefficients (363). After analysis, the similar rela-
tionships between the contents of CBs and the number of
(CX,D) are generated.
Figure 8 presents the relationship between CBs and the
different sub-bands. It is observed that the number of bits
needed to present the magnitude coefficients increases as
the level of decomposition increases.
Each CB has 1,024 samples. The number of bit planes to
be processed by EBC and the number of code blocks are
shown in Fig. 9, where number 8 represents the MSB
plane. It is observed that a small number of CBs have zero
bit planes (ZBP) (7 blocks of Lena image). For Barbara
image, all CBs need 6 bits (81 blocks) or 5 bits to present
the magnitude coefficients. Null numbers of CB have eight
or seven zero bit planes for all test images.
This analysis is summarized as follows:
• Bit plane coding (BPC) is the most demanding block of the
JPEG 2000 coder; it consumes 51 and 55 % of the total
execution time for lossless and lossy compression,
respectively.
• The number of (CX,D) pairs generated in MRP increases
gradually; whereas, it deceases for the SPP and CUP.
• In the first non-zero bit plane, the generated (CX,D)
pairs depend on the content of CB, while MRP does not
generate any pair of (CX,D).
• Very large number of (CX,D)s are generated in MRP
when the maximum number of coefficient becomes
significant.
• The number of zero bit planes depends on the contents of
CBs and the type of sub-bands. In average the number of
zero bit planes to be processed can be 30 % less to the
total number of bit plane for the used test images.
It is easy to conclude that the number of contexts gen-
erated in a bit plane depends on the contents of the CB. To
Fig. 5 2-D, 3-level wavelet decomposition of Lena using the (5,3) filter-bank
Table 2 (CX,D) number generated for test images
Image Number of (CX,D) pairs
Lena 1,282,176
Barbara 1,490,628
Peppers 1,397,547
Baboon 1,708,678
Average 1,463,765
J Real-Time Image Proc
123
achieve the specifications imposed by DCI, both parallel
pass and concurrent sample coding architecture for EB-
COT must be adopted. Such architecture is presented next
in this paper.
4 Proposed architecture
The proposed embedded block coding is shown in Fig. 10.
After initialization of state variables (r, r0, and g), the
Fig. 6 Contents of a CB in different type of blocks
Fig. 7 (CX,D) statistics generated from the analyzed CBs
J Real-Time Image Proc
123
context modeling controller reads a stripe, sign and state
variables from six separate memories. This architecture uses
single port RAM for the sample’s magnitudes and sign, and
dual port RAM for state variables. The 32 column stripes are
coded with coding operation blocks. To investigate this fact,
four ZC, four SC, four MRC, and one RLC blocks are used in
this architecture. In order to process all magnitude bits
simultaneously, an entire magnitude column and a corre-
sponding sign bits column are generated in one clock cycle
by column information generator. For each magnitude bit in
a column stripe, four sign neighbors, eight r, four r0 and four
g neighbors are required. In one clock cycle, 4–10 (CX,D)
pairs are generated, simultaneously. These pairs are sent to
MQ coder sequentially. This scheduling is done via CX/D
sequencer. When the last column stripe is coded, state vari-
ables are updates for next stripe.
The key building blocks of this architecture are as
follows.
• Data organization and memory arrangement.
• Column information generator.
• Coding operation.
• CX/D sequencer.
• Context modeling controller.
4.1 Definition of terms
In order to make this proposed architecture less complex and
to be understood by wider range of readers, we first provide the
definition of used terms to describe the architecture, followed
Fig. 8 Contents of a CB for the different sub-bands
Fig. 9 Relationship between contents block and zero bit planes from
test images
J Real-Time Image Proc
123
by the explanation of four basic coding operations and three
coding passes. The definition of terms is shown in Table 3.
4.2 Data organization and memory arrangement
In order to achieve efficient data and state variables, magnitude
and sign memories access and to reduce the required memory
access clock cycle, we propose a new data arrangement and
memory organization to implement the bit-plane coding. First
a bit plane is mapped with zeros (Fig. 11). As shown in Fig. 11,
memory blocks are organized in six partitions (i.e. MEM0–
MEM5) containing the stripe state variables data. The same
structure is adopted for magnitude bit plane and sign coeffi-
cients. In a single clock cycle they supply the 32 columns to be
processed and their vertical neighbors at the same time
(Fig. 12). After coding a stripe (32 columns) an updated state
variable is reading from a state variable memory in one cycle.
Using this technique of memory arrangement, one can
reduce the complexity of addressing and able to perform
read and write operations at the same clock cycle.
4.3 Coding operation
There are four different types of operations used in the bit plane
coding process (three passes), namely, zero coding, sign cod-
ing, magnitude refinement coding, and run length coding [4].
4.3.1 Zero coding (ZC) context block
The ZC uses 9 contexts (i.e. from 0 to 8) among the pos-
sible 19. Context for the data in bit position X is formed
from the 8 neighboring values (D0–D3, H0, H1, V0, V1) in
the r matrix as shown in Fig. 13a. The data under con-
sideration is the magnitude of the bit position X.
4.3.2 Sign coding (SC) context block
The SC is a two-step process and uses 5 contexts (contexts
9–13). In the first step, the r and v of the horizontal and
vertical neighbors (Fig. 13b) are used to form the hori-
zontal and vertical ‘contributions’ and an ‘xor’ bit [4]. In
Fig. 10 Top module of the proposed column-parallel context modeling
Table 3 Terms used in a code block
Category Name Description
Bit plane data Vp[n] The pth magnitude bit plane
Xp[n] The sign bit-plane
Coding state
variable
rp[n] The new significance state of the bit-plane
p
r0p[n] The new magnitude refinement (MR) state
of the bit-plane p
gp[n] The visited state of the bit-plane p
J Real-Time Image Proc
123
the second step, context is formed from the two contribu-
tions and data are formed by exclusive OR operation of the
sign bit and the xor bit.
4.3.3 Magnitude refinement coding (MRC) context block
The MRC uses three contexts (contexts 14–16). The con-
texts are formed based on whether it is the first time the
magnitude refinement is being used on a certain position
and its 8 immediate neighbors (Fig. 13a) or not. The data is
the bit itself.
4.3.4 Run length coding (RLC) contexts
This is used only when an immediately previously insig-
nificant sample is found to be significant during ZC, SC or
RLC operation. The sign information is encoded using one
of the five different context states that depend on the sign
and the significance of the immediate vertical and hori-
zontal neighbors. The architecture for RLC coding is
shown in Fig. 14.
RLC condition is checked and if it is satisfied then the
zero detector attached to the output of the bit register
determines whether all bits of the column are zeros or not.
Next, an encoder generates the bit position of the first
1-value bit in the column, which is needed during RL-
coding after coding the first 1-bit.
4.3.5 Context modeling controller block
This controller generates all the necessary control signals
to enable all the modules in the architecture including
loading the stripe from block Rom, reading and writing the
stripe memory, controlling the pointers to access the CX/D
sequencer, generating the control signals for the BPC
encoder.
Fig. 11 Code block lines and code block memories association
Fig. 12 Memory organization for bit plane
Fig. 13 Context windows for coding operations
Fig. 14 Architecture for RLC module
J Real-Time Image Proc
123
4.3.6 Column information generator
The data movement from state variable memory to state
variable register is done column by column as shown in
Fig. 15. During zero coding and magnitude refinement
coding the significant register corresponding to coded
sample in the current column stripe may update the infor-
mation when the stripe is coded.
In order to reduce the clock cycle time spent on the
memory access, the method which is adopted in a con-
ventional context modeling is suitable for vertical causal
context modeling.
4.3.7 CX/D sequencer
The stripe column is coded in one clock cycle. The CX/
D sequencer sends (CX,D) pair to MQ coder in the
specific order. The multiplexer chooses the context from
the outputs of ZC context block, SC context block, MR
context block or the hard coded RLC contexts (17–18).
The data bit is chosen from the t, sign data, v, hard
coded RLC data bits (0, 1) or the ZI (MSB, LSB) bits.
Context sequencer circuit is responsible for buffering
these (CX,D) pairs in proper order. The MUX is con-
trolled with a 3-bit word (cntrl_cx). Based on the pass
being performed, the controller generates the control
word. The contexts and data for different values of
cntrl_cx are given in Table 4.
5 Experimental results and discussion
The proposed architecture is described using VHDL lan-
guage, synthesized with Xilinx ISE 10.1 and implemented
on Virtex5 (XC5LX30T), Virtex4 (XC4VLX80), FPGA
and Spartan 3A DSP 3400 (XC3SD3400a). Table 5 pre-
sents the design implementation’s summary. The speed of
BPC is restricted to 372, 328 and 189 MHz, respectively.
The total power consumption of the proposed design based
on FPGA Virtex5 Family (XC5VSX50T) has been calcu-
lated using XPower utility. It can be observed that the
proposed architecture has consumed 50 mW at 30.2 �C.
All samples in a stripe column are processed in a single
clock cycle and 272 clock cycles are required to encode a
bit plane.
Table 6 summarizes the runtime statistics for CBs with
size 32 9 32. Each image comprises more than 587 zero
bit planes. To encode an image an average of 396,924
clock cycles are required (6,291,456 clock cycles for
sequential architecture). The processing rate is obtained by
computing the ratio between the total number of clock
cycles required per image and the total coefficients in an
image. Table 6 shows that the average processing rate of
the proposed BPC is 2.0513.
Table 7 presents the encoding and cycle statistics and
processing rate for CBs with size 64 9 64. To encode a
512 9 512 image, 219,120 clock cycles are required. The
average processing rate of the proposed BPC is 2.66.
Fig. 15 The relationship of the state variable register and state
variable memory
Table 4 CX/D sequencer for BPC encoder
Cntr_cx CX data
000 – –
001 ZC cx t
010 SC cx Sign data bit
011 MC cx 0
100 17 0
101 17 1
110 18 ZI[MSB]
111 18 ZI[LSB]
Table 5 Proposed embedded bit plane coding architectures
Used FPGA XC5VLX50T XC4VLX80 XC3SD3400a
Max. frequency
(MHz)
372 328 189
No. of 4 input LUTs 228 337 340
Total used slices 201 193 188
Total FF slices 26 206 207
Power consumption
(mw)
50 (30.2 �C) 79
(27.3 �C)
12 (30.2 �C)
J Real-Time Image Proc
123
This architecture has been implemented on Virtex
FPGA for comparison purpose. Table 8 presents a com-
parative result between our proposed BPC architecture and
some existing architectures. The maximum operating speed
of this design is 186 MHz which is 3.72 times faster than
the BPC architecture proposed in [11]. The requirement of
LEs has reduced significantly due to the reduced data width
of the (CX,D) pairs. The size of the code block is 32 9 32
and the maximum coefficient bit width is 8 bits. However,
the memory requirement will be increased by 8 Kb. Very
less memory is required in this architecture because smaller
CB size is used compared to other architectures.
We compare the processing time of two test images using
the proposed bit plane-parallel scheme with sequential
methods, pass-parallel context modeling (PPCM), group-
of-column skipping (GOCS) and bit plane context modeling
(BPCM) [20]. Table 9 gives clock cycles used in test
images.
From the test results, it is clear that our method has
greatly reduced clock cycles used by encoding compared to
other widely known methods.
6 Conclusion
In this paper, we have designed and implemented an effi-
cient parallel VLSI architecture of the most demanding
block of the JPEG 2000 coder. This new architecture is
based on parallel access to memory and parallel coding for
one column stripe. When implemented on FPGA Virtex5
Family (XC5VSX50T), the performance of BPC was
improved and was able to perform at 372 MHz. Thus, the
designed EBCOT architecture is capable of encoding 44
video frames/s of high-definition TV of 1,920 pixels. This
design architecture is proficient in encoding 40 frames/s of
2,048 9 1,080. Requirement of high-speed real-time image
compression systems like satellite imagery, medical imag-
ing, cartography and others is satisfied by our architecture.
Moreover, it improves the processing time by about 30 %
compared to well-known techniques from literature.
References
1. JPEG 2000 image coding system, ISO/IEC International Standard
15444-1. ITU Recommendation T.800, (2000)
2. ISO/IEC JTC1/SC29/WG1 N2678, document JPEG 2000 Part 1
020719 (final publication draft) (2002)
3. Taubman, D.S., Marcellin, M.W.: JPEG2000 image compression
fundamentals, standards, and practice (2002)
4. Acharya, T., Tsai, P.: JPEG2000 Standard for Image Compres-
sion: Concepts Algorithms and VLSI Architectures. Wiley (2005)
5. Rabbani, M., Joshi, R.: An overview of the JPEG 2000 still image
compression standard. Signal Process Image Commun 17(1),
3–48 (2002)
6. Lee, D.: JPEG 2000: Retrospective and new developments. Proc.
IEEE 93(1), 32–41 (2005)
7. Das, A., Hazra, A., Banerjee, S.: An efficient architecture for 3-D
discrete wavelet transform. IEEE Trans. Circuits Syst. Video
Technol. 20(2), 286–296 (2010)
8. Delaunay, X., Chabert, M., Charvillat, V., Morin, G.: Satellite
image compression by post-transforms in the wavelet domain.
Signal Process. 90(2), 599–610 (2010)
9. Varma, K., Damecharla, H., Bell, A., Carletta, J., Back, G.: A fast
JPEG 2000 encoder that preserves coding efficiency: The split
arithmetic encoder. IEEE Trans. Circuits Syst. 55(11),
3711–3722 (2008)
10. Huang, Q., Zhou, R., Hong, Z.: Low memory and low complexity
VLSI implementation of JPEG 2000 codec. IEEE Trans. Consum.
Electron. 50(2), 638–646 (2004)
Table 6 Encoding cycles statistics and processing rate for CB with
size 32 9 32
Image LZBP BP processed Cycles/image Processing rate
Lena 918 1,390 366,960 1.514
Baboon 587 1,717 453,288 2.925
peppers 758 1,546 408,144 2.039
Barbara 788 1,361 359,304 1.727
Table 7 Encoding cycles statistics and processing rate for CB with
size 64 9 64
Image LZBP BP processed Cycles/image Processing rate
Lena 189 387 204,336 2.047
Baboon 121 455 240,240 3.765
Boat 171 405 213,840 2.368
Peppers 156 420 221,760 2.692
Barbara 168 408 215,424 2.4286
Table 8 Comparison with existing EBCOT architectures
Architecture [11] [17] Proposed
Frequency (MHz) 50 67 186
No. of 4 input LUTs 7,071 2,488 340
Total used slices 4,420 2,149 186
Total FF slices 1,560 105 27
FPGA used XC2V1000-6
Table 9 Processing time of the proposed scheme and other methods
Clock cycles
Sequential PPCM GOCS BPCM
[20]
Proposed
Lena 4,164,352 1,312,688 1,743,283 1,116,638 366,960
Baboon 4,947,712 1,748,956 2,106,820 1,113,828 453,288
Pepper 4,550,656 – 1,880,388 1,119,712 408,144
J Real-Time Image Proc
123
11. Gangadhar, M., Bhatia, D.: FPGA based EBCOT architecture for
JPEG 2000. Microprocess. Microsyst. 29(8–9), 363–373 (2005)
12. Li, Y., Bayoumi, M.: A three-level parallel high-speed low-power
architecture for EBCOT of JPEG 2000. IEEE Trans. Circuits
Syst. Video Technol. 16(9), 1153–1163 (2006)
13. Lian, C., Chen, K., Chen, H., Chen, L.: Analysis and architecture
design of block-coding engine for EBCOT in JPEG 2000. IEEE
Trans. Circuits Syst. Video Technol. 13(3), 219–230 (2003)
14. Zhang, C., Long, Y., Kurdahi, F.: A scalable embedded JPEG
2000 architecture. J. Syst. Archit. 53(8), 524–538 (2007)
15. Fang, H.-C., Chang, Y.-W., Wang, T.-C., Lian, C.-J., Chen, L.-
G.: Parallel embedded block coding architecture for JPEG 2000.
IEEE Trans. Circuits Syst. Video Technol. 15(9), 1086–1097
(2005)
16. Modrzyk, D., Staworko, M.: A high-performance architecture of
JPEG2000 encoder. In: 19th European Signal Processing con-
ference (EUSIPCO 2011), September 2011
17. Sarawadekar, K., Banerjee, S.: A High Speed Bit Plane Coder for
JPEG2000 and its FPGA Implementation. In: 17th European
Signal Processing conference (EUSIPCO 2009), September 2009
18. Liu, K., Zhou, Y., Song Li, Y., Ma, J.F.: A high performance MQ
encoder architecture in JPEG2000. Integr. VLSI J. 43(3),
305–317 (2010)
19. JASPER software reference manual. ISO/IEC/JTC1/SC29/
WG1N2415
20. Liu, K., Wu, C., Li, Y.: A high-performance VLSI architecture of
EBCOT block coding in JPEG2000. J Electron (China) 23(1)
(2006)
Author Biographies
Taoufik Saidani received his M.S. degree in Micro-electronics from
Faculty of Science of Monastir, Tunisia in 2007. His major research
interests include VLSI and embedded system in video compression.
Atri Mohamed born in 1971, received his Ph.D. degree in Micro-
electronics from the Science Faculty of Monastir in 2001. He is
currently a member of the Laboratory of Electronics and Micro-
electronics. His research includes Circuit and System Design, Image
processing, Network Communication, IPs and SoCs.
Lazhar Khriji received his BS degree in electronics, and his MS and
PhD degrees in electrical engineering from University of Tunis II,
Tunisia, in 1990, 1992 and 1999, respectively. In 2002, he received
the Doctor of Technology degree in Information Technology from
Signal Processing Institute, Tampere University of Technology,
Finland. Dr. Khriji is currently Associate Professor at University of
Sousse, Tunisia. From 2002, he is in sabbatical leave with Sultan
Qaboos University, Oman. From 1997 to 1999, he was a research
scientist in the Research Institute for Information Technology,
Tampere, Finland. His research interests include signal and image
processing and analysis, nonlinear filtering, adaptive filtering, image
coding, image encryption, genetic algorithms, fuzzy logic, and
hardware implementation of DSP algorithms.
Rached Tourki was born in Tunisia, on May 13, 1948. He received
the B.S. degree in Physics (Electronics option) from Tunis University,
in 1970; the M.S. and the Doctorat de 3eme cycle in Electronics from
Institut d’Electronique d’Orsay, Paris-south University in 1971 and
1973, respectively. From 1973 to 1974 he served as Micro-electronics
engineer in Thomson-CSF. He received the Doctorat d’etat in Physics
from Nice University in 1979. Since this date he has been professor in
Micro-electronics and Microprocessors with the physics department,
Faculte des Sciences de Monastir. His current research interests
include digital signal processing and hardware–software co-design for
rapid prototyping in telecommunications.
J Real-Time Image Proc
123