an efficient hardware implementation of parallel ebcot algorithm for jpeg 2000

12
ORIGINAL RESEARCH PAPER An efficient hardware implementation of parallel EBCOT algorithm for JPEG 2000 Taoufik Saidani Mohamed Atri Lazhar Khriji Rached Tourki Received: 10 July 2012 / Accepted: 4 January 2013 Ó Springer-Verlag Berlin Heidelberg 2013 Abstract With the augmentation in multimedia technol- ogy, demand for high-speed real-time image compression systems has also increased. JPEG 2000 still image com- pression standard is developed to accommodate such application requirements. Embedded block coding with optimal truncation (EBCOT) is an essential and computa- tionally very demanding part of the compression process of JPEG 2000 image compression standard. Various applica- tions, such as satellite imagery, medical imaging, digital cinema, and others, require high speed and performance EBCOT architecture. In JPEG 2000 standard, the context formation block of EBCOT tier-1 contains high complexity computation and also becomes the bottleneck in this sys- tem. In this paper, we propose a fast and efficient VLSI hardware architecture design of context formation for EBCOT tier-1. A high-speed parallel bit-plane coding (BPC) hardware architecture for the EBCOT module in JPEG 2000 is proposed and implemented. Experimental results show that our design outperforms well-known techniques with respect to the processing time. It can reach 70 % reduction when compared to bit plane sequential processing. Keywords JPEG 2000 EBCOT algorithm Bit-plane coding VHDL FPGA implementation 1 Introduction Nowadays, undoubtedly the demand for good compression techniques for multimedia applications keeps increasing in order to provide excellent visual quality as well as efficient solutions to the end user. The algorithms supporting these features are very costly in terms of computational time and complexity. For instance, JPEG 2000 [1] is the latest international standard for still image compression sup- porting a rich set of features. Compared to existing JPEG image compression techniques, this standard not only has better compression ratios but also offers some exciting features. Among the features supported by JPEG 2000 include, lossy and lossless compression, continuous tone and bi-level compression, progressive transmission by pixel accuracy and resolution, region of interest coding, compressed domain processing, and error resilience [2, 3]. The JPEG 2000 encoder architecture includes the component transform, discrete wavelet transform, quanti- zation and embedded block coding with optimized trun- cation (EBCOT) [1, 2]. It is basically different from the original JPEG standard. First, JPEG 2000 replaces the JPEG’s discrete cosine transform (DCT) frequency decomposition with a discrete wavelet transformation (DWT). The JPEG 2000 standard specifies two kinds of wavelet transformation: (1) integer transform 5/3 for loss- less image compression and (2) 9/7 transform intended for lossy compression mode [1, 2]. The DWT is a multi-resolution decomposition, which allows for resolution scalability within the embedded bit stream. In addition, the DWT exhibits better energy T. Saidani (&) M. Atri R. Tourki Electronics and Micro-Electronics Laboratory, Faculty of Sciences, Monastir University, Monastir, Tunisia e-mail: saidani_taoufi[email protected] M. Atri e-mail: [email protected] R. Tourki e-mail: [email protected] L. Khriji Electrical and Computer Engineering Department, Sultan Qaboos University, Muscat, Oman e-mail: [email protected] 123 J Real-Time Image Proc DOI 10.1007/s11554-013-0322-9

Upload: rached

Post on 11-Dec-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: An efficient hardware implementation of parallel EBCOT algorithm for JPEG 2000

ORIGINAL RESEARCH PAPER

An efficient hardware implementation of parallel EBCOTalgorithm for JPEG 2000

Taoufik Saidani • Mohamed Atri • Lazhar Khriji •

Rached Tourki

Received: 10 July 2012 / Accepted: 4 January 2013

� Springer-Verlag Berlin Heidelberg 2013

Abstract With the augmentation in multimedia technol-

ogy, demand for high-speed real-time image compression

systems has also increased. JPEG 2000 still image com-

pression standard is developed to accommodate such

application requirements. Embedded block coding with

optimal truncation (EBCOT) is an essential and computa-

tionally very demanding part of the compression process of

JPEG 2000 image compression standard. Various applica-

tions, such as satellite imagery, medical imaging, digital

cinema, and others, require high speed and performance

EBCOT architecture. In JPEG 2000 standard, the context

formation block of EBCOT tier-1 contains high complexity

computation and also becomes the bottleneck in this sys-

tem. In this paper, we propose a fast and efficient VLSI

hardware architecture design of context formation for

EBCOT tier-1. A high-speed parallel bit-plane coding

(BPC) hardware architecture for the EBCOT module in

JPEG 2000 is proposed and implemented. Experimental

results show that our design outperforms well-known

techniques with respect to the processing time. It can reach

70 % reduction when compared to bit plane sequential

processing.

Keywords JPEG 2000 � EBCOT algorithm �Bit-plane coding � VHDL � FPGA implementation

1 Introduction

Nowadays, undoubtedly the demand for good compression

techniques for multimedia applications keeps increasing in

order to provide excellent visual quality as well as efficient

solutions to the end user. The algorithms supporting these

features are very costly in terms of computational time and

complexity. For instance, JPEG 2000 [1] is the latest

international standard for still image compression sup-

porting a rich set of features. Compared to existing JPEG

image compression techniques, this standard not only has

better compression ratios but also offers some exciting

features. Among the features supported by JPEG 2000

include, lossy and lossless compression, continuous tone

and bi-level compression, progressive transmission by

pixel accuracy and resolution, region of interest coding,

compressed domain processing, and error resilience [2, 3].

The JPEG 2000 encoder architecture includes the

component transform, discrete wavelet transform, quanti-

zation and embedded block coding with optimized trun-

cation (EBCOT) [1, 2]. It is basically different from the

original JPEG standard. First, JPEG 2000 replaces the

JPEG’s discrete cosine transform (DCT) frequency

decomposition with a discrete wavelet transformation

(DWT). The JPEG 2000 standard specifies two kinds of

wavelet transformation: (1) integer transform 5/3 for loss-

less image compression and (2) 9/7 transform intended for

lossy compression mode [1, 2].

The DWT is a multi-resolution decomposition, which

allows for resolution scalability within the embedded bit

stream. In addition, the DWT exhibits better energy

T. Saidani (&) � M. Atri � R. Tourki

Electronics and Micro-Electronics Laboratory,

Faculty of Sciences, Monastir University, Monastir, Tunisia

e-mail: [email protected]

M. Atri

e-mail: [email protected]

R. Tourki

e-mail: [email protected]

L. Khriji

Electrical and Computer Engineering Department,

Sultan Qaboos University, Muscat, Oman

e-mail: [email protected]

123

J Real-Time Image Proc

DOI 10.1007/s11554-013-0322-9

Page 2: An efficient hardware implementation of parallel EBCOT algorithm for JPEG 2000

compaction than the DCT, allowing for superior com-

pression efficiency. The DWT typically is applied on an

image tile or on the entire image as a whole. This large-

scale application allows the DWT to minimize the blocking

artifacts that plagued the 8 9 8 DCT in the original JPEG

[3, 4]. The second major deviation of JPEG 2000 from its

predecessor is the abandonment of the Huffman entropy

encoding scheme for an adaptive binary arithmetic coder.

JPEG 2000 uses the embedded block coding and optimal

truncation (EBCOT) algorithm to arithmetically encode the

DWT coefficients [1, 2]. EBCOT works on a bit-plane

level, generating a neighborhood context for each coded

bit. For the reason that it must touch every bit of the bit-

plane in the image, the EBCOT dominates the processing

time (50–75 %) depending on the source image [3, 4]. In

addition, the EBCOT works solely at the bit-plane level,

and is therefore fairly inefficient to implement in software.

For these two reasons, the EBCOT algorithm warrants a

custom FPGA implementation. It plays the role of system

bottleneck which imposes a fast implementation to speedup

the overall encoding process.

This paper is organized as follows: Sect. 2 provides an

overview of EBCOT and the related work for existing

architecture for embedded block coding. Section 3

describes the software analysis of EBCOT algorithm and

some discussions. The proposed architecture of BPC is

presented in Sect. 4. In Sect. 5 experimental results are

furnished and are compared with other architectures.

Finally, conclusions are drawn in Sect. 6.

2 Overview of JPEG 2000 encoding system

The block diagram of the JPEG 2000 coder is shown in

Fig. 1. The input image frame is partitioned into rectan-

gular, non-overlapping tiles. All unsigned image samples

are shifted DC levels and centered on zero [1, 3]. Next

color space conversion, from RGB to YCbCr, is performed

and wavelet transform is applied independently to each

channel of the image. The DWT coefficients are supplied

to EBCOT coder which comprises two processing stages:

bit plane coder (BPC) and MQ coder [2, 4, 5].

The BPC generates context and decision (CX,D)s pairs

which are supplied to MQ coder [4]. This latter performs

entropy coding and produces embedded bit stream.

2.1 Bit plane coding

The digital cinema initiative (DCI) [1, 3, 4] standard is

confined to use 5/3 DWT filter [4–6]. The coefficients

generated after wavelet transform are in 2’s complement

format. The bit plane coding is shown in Fig. 2.

These coefficients are converted to sign magnitude

format and stored in code block (CB) memory. In DCI

applications, CB size is restricted to 32 9 32 [3, 4]. The

CB memory comprises a sign plane and several magnitude

planes, as shown in Fig. 3a. During encoding, the magni-

tude bit planes are scanned starting from the most signifi-

cant bit (MSB) plane to the least significant bit (LSB) plane

[7, 8]. Each bit plane is further divided into stripes of four

rows (Fig. 3b). Samples are scanned column-wise starting

from left to right within a stripe and from top to bottom

within a column as shown in Fig. 3c. Three coding passes

are used to examine each sample in a CB, namely, clean-up

pass (CUP), significant propagation pass (SPP) and mag-

nitude refinement pass (MRP) [1, 3, 4]. The type of coding

pass to be applied on a sample is determined by forming a

context window surrounding it as shown in Fig. 3d, along

with the three state variables r, r0 and g, respectively, the

significant state variable, the magnitude requirement state

variable and the pass membership state variable. While

scanning wavelets coefficients, whenever first non-zero

magnitude bit occurs, it is considered as a significant

coefficient and the state of r is updated to one.

When MRP is run for the first time on a coefficient, r0 is

set to one and if zero coding (ZC) is applied on a coefficient

Fig. 2 Functional block diagram of EBCOT Tier-1 algorithmFig. 1 Overview of JPEG 2000 coding process

J Real-Time Image Proc

123

Page 3: An efficient hardware implementation of parallel EBCOT algorithm for JPEG 2000

in SPP, g is set to one. Initially, all state variables are

assumed to be zero. The encoding starts from the first non-

zero magnitude bit plane assuming that all state variables

are zeros. Only CUP is run on this bit plane, whereas the rest

of bit planes are coded sequentially using SPP, MRP and

CUP [3, 4]. Four coding primitives, zero coding (ZC), sign

coding (SC), run length coding (RLC), and magnitude

refinement coding (MRC), are used to encode samples and

then (CX,D)s pairs are generated [10, 11].

2.2 MQ coder

The compression technique adopted in JPEG 2000 standard

is a statistical binary arithmetic coding, which is also called

MQ coder [1, 3, 18]. The MQ coder utilizes the probability

(CX) to compress the decision (D). In the MQ coder,

symbol in a code stream is classified as either most-prob-

able symbol (MPS) or least-probable symbol (LPS) [18].

The basic operation of the MQ coder is to divide the

interval recursively according to the probability of the

input symbols.

Figure 4 shows the interval calculation of MPS and LPS

for JPEG 2000 [4, 18]. We can find out whether MPS or

LPS is coded, and the new interval will be shorter than the

original one. In order to solve the finite-precision problems

when the length of the probability interval falls below a

certain minimum size, the interval must be renormalized to

become greater than the minimum bound.

2.3 Related works

In [9] a serial architecture for EBCOT tier-1 is presented. A

code block of size N 9 N (8 bit planes) needs

3 9 N 9 N 9 bp clock cycles. The embedded coding

block operates at 18.5 MHz on APEX20KE family from

Altera. The proposed architecture in Ref. [10] requires very

large number of clock cycles to process an image. Its

drawback is memory requirement which is very high.

Parallel coding of two stripe columns is proposed in [11].

Passes coding is executed concurrently. To investigate this

fact, two processing elements are used. A sequential pro-

cessing is needed for the four-bit column stripe. This

architecture works at 50 MHz on Virtex II for Xilinx. In

[12] all bit planes are coded in parallel processing:

(0.35–0.46) 9 N 9 N clock cycles are required to code an

N 9 N block. It works at 150 MHZ.

To code all samples of a column stripe in one cycle,

pass-parallel architecture is proposed in [13]. This design

works at 100 MHz to code 50 Ms/s (Mega samples per

Fig. 4 JPEG 2000 arithmetic encoding procedure

Fig. 3 a Bit-plane representation of a code block consisting of 8 magnitude bit-planes of dimension 32 9 32. b Each bit-plane consists of stripes

made up of four rows. c Stripe-based scanning order for every pass. d Context window for a sample location

J Real-Time Image Proc

123

Page 4: An efficient hardware implementation of parallel EBCOT algorithm for JPEG 2000

second). In [14] a co-design architecture is proposed. The

intensive part (DWT, BPC and MQ coder) of the JPEG

2000 coder is implemented on FPGA, while the other

blocks of the encoder are implemented by a Powerful PC.

This proposed architecture is capable of coding one stripe

column in 2–5 clock cycles. Architecture for concurrent

coding of all bit planes is presented in [15]. In this

approach, the need for state variables memory is com-

pletely eliminated. Further, one arithmetic coder is shared

between two bit planes. However, knowledge of leading

zero bit planes is not embedded in this design. Conse-

quently, the pairs (CX,D) are generated by these planes too,

which are extra burdens on the arithmetic coder. An

architecture for JPEG 2000 encoder is presented in [16]. As

a result of applied hardware optimizations, the maximum

throughput of 180 Ms/s has been achieved for 100 MHz

clock and 0.333 bpp as compression rate. The proposed

encoder core, at competitive area resources, provides

superior frame rate and excellent compression quality for

high-definition (HD) and full HD video material. In [17]

the proposed architecture for BPC processes bits in serial

manner and speeds up the operating frequency due to an

optimized data path design and appropriate CB data han-

dling technique. The estimated working frequency is

67 MHz with Xilinx XC2V1000 target device. All these

architectures require either large encoding time or large

space on silicon. Both area and execution time are critical

parameters while developing an intellectual property core

for real-time applications. The performance of the pro-

posed EBCOT encoder is improved by a compromise

between both parameters.

3 Analysis of bit plane coding algorithm

Before starting the hardware architecture’s design, the

detailed run time analysis of JPEG 2000 codec and the

analysis of bit plane coding will be presented in this sec-

tion. This profile is made using the jasper implementation

verification model’s software [19], which is written by

official JPEG 2000 development organization (ISO SC29)

for development and verification of algorithms used by

JPEG 2000. The size of the Lena image is 512 9 512

pixels (8 bpp), and the compression is set to baseline mode

(5/3 filter, 3-level wavelet decomposition, 32 9 32 size for

code block).

Table 1 summarizes profiling results for lossy and

lossless compression of the 512 9 512 test Lena image. It

shows that the bit plane coding (BPC) is the most intensive

part of the JPEG 2000 coder; it consumes 51.8 and 55 % of

the total execution time for lossy and lossless compression,

respectively.

To perform the software implementation of BPC, a

sequential manner is adopted to encode all passes. For

coding a code block of size N 9 N with bp bit planes,

3 9 N 9 N 9 bp clock cycles are required. Analysis of

the intense block for JPEG 2000 coder is performed to

develop a new technique minimizing the required large

number of clock cycles in processing.

For the BPC analysis, four gray-scale ISO images (Lena,

Barbara, Peppers and baboon) are used. All test images are

of size 512 9 512. Under MATLAB environment a

wavelet Le Gall 5/3 transform [5, 6] is developed. Every

test image is decomposed into three levels. Figure 5 shows

a 3-level, 2-D DWT decomposition of the Lena image

using the (5, 3) filter-bank.

Sub-band of wavelets coefficient is decomposed in CBs

with size of 32 9 32 and stored in eight bits. Finally, using

MATLAB simulation, all CBs in an image are encoded and

(CX,D)s pairs are generated. Table 2 presents the number

of (CX,D)s generated by BPC for all test images.

For this analysis, three types of CB with size 32 9 32

are selected:

1. CB with all coefficients has positive magnitudes.

2. CB with all coefficients has negative magnitudes.

3. CB where the maximum number of coefficients are

zeros.

Table 1 Runtime of JPEG 2000 encoder

JPEG2000 modules Lossy compression Lossless compression

DWT 20.1 % 12.2 %

Quantization 5.5 % 5.8 %

EBCOT

Bit plane coding 51.8 %

70.4 %

55.0 %

71.8%

MQ coder 6.9 % 8.2 %

Rate distortion control 11.7 % 13.9 %

Others 4.0% 4.9%

Total 100%

J Real-Time Image Proc

123

Page 5: An efficient hardware implementation of parallel EBCOT algorithm for JPEG 2000

The relationship between CB and the magnitude of

wavelet coefficients is shown in Fig. 6. For code block with

all sample are positive (Fig. 6a), the maximum and the

minimum values are 251 and 12, respectively. This CB is

derived from low resolution (LL3 sub-band) from Lena

image. During CB encoding, MSB plane (bp1) is skipped

because all coefficients are insignificant. Here, the MRP

passes will not generate any (CX,D) pairs while encoding

bit plane 2. This fact is demonstrated in Fig. 7a. In this bit

plane, large number of magnitude samples is significant. In

fact, a huge number of (CX,D) pairs is produced in SPP

passes and only few pairs in CUP passes. The last coeffi-

cients become significant in bit plane 6, no (CX,D) pairs

are generated by SPP and CUP.

In case of a CB with the maximum number of zero

wavelet coefficients (Fig. 6b), there is a need of one bit for

magnitude representation. The contributions of CUP

(CX,D) are much higher in all bit planes except in the LSB,

as shown in Fig. 7b.

In high resolution (HH1), a CB where all samples are

negatives the values range from -45 to -5 (see Fig. 6c). Much

higher number of (CX,D) pairs are generated in SPP passes

while bp 6 encoding. This fact is demonstrated in Fig. 7c.

For the test Lena image of size 512 9 512, there are 256

CBs where CBs number 18, 256 and 56 are chosen for this

analysis. For the CB number 18, all samples are with

positive sign whereas for CB number 256, all samples are

with negative sign. CB number 56 has a large number of

zero coefficients (363). After analysis, the similar rela-

tionships between the contents of CBs and the number of

(CX,D) are generated.

Figure 8 presents the relationship between CBs and the

different sub-bands. It is observed that the number of bits

needed to present the magnitude coefficients increases as

the level of decomposition increases.

Each CB has 1,024 samples. The number of bit planes to

be processed by EBC and the number of code blocks are

shown in Fig. 9, where number 8 represents the MSB

plane. It is observed that a small number of CBs have zero

bit planes (ZBP) (7 blocks of Lena image). For Barbara

image, all CBs need 6 bits (81 blocks) or 5 bits to present

the magnitude coefficients. Null numbers of CB have eight

or seven zero bit planes for all test images.

This analysis is summarized as follows:

• Bit plane coding (BPC) is the most demanding block of the

JPEG 2000 coder; it consumes 51 and 55 % of the total

execution time for lossless and lossy compression,

respectively.

• The number of (CX,D) pairs generated in MRP increases

gradually; whereas, it deceases for the SPP and CUP.

• In the first non-zero bit plane, the generated (CX,D)

pairs depend on the content of CB, while MRP does not

generate any pair of (CX,D).

• Very large number of (CX,D)s are generated in MRP

when the maximum number of coefficient becomes

significant.

• The number of zero bit planes depends on the contents of

CBs and the type of sub-bands. In average the number of

zero bit planes to be processed can be 30 % less to the

total number of bit plane for the used test images.

It is easy to conclude that the number of contexts gen-

erated in a bit plane depends on the contents of the CB. To

Fig. 5 2-D, 3-level wavelet decomposition of Lena using the (5,3) filter-bank

Table 2 (CX,D) number generated for test images

Image Number of (CX,D) pairs

Lena 1,282,176

Barbara 1,490,628

Peppers 1,397,547

Baboon 1,708,678

Average 1,463,765

J Real-Time Image Proc

123

Page 6: An efficient hardware implementation of parallel EBCOT algorithm for JPEG 2000

achieve the specifications imposed by DCI, both parallel

pass and concurrent sample coding architecture for EB-

COT must be adopted. Such architecture is presented next

in this paper.

4 Proposed architecture

The proposed embedded block coding is shown in Fig. 10.

After initialization of state variables (r, r0, and g), the

Fig. 6 Contents of a CB in different type of blocks

Fig. 7 (CX,D) statistics generated from the analyzed CBs

J Real-Time Image Proc

123

Page 7: An efficient hardware implementation of parallel EBCOT algorithm for JPEG 2000

context modeling controller reads a stripe, sign and state

variables from six separate memories. This architecture uses

single port RAM for the sample’s magnitudes and sign, and

dual port RAM for state variables. The 32 column stripes are

coded with coding operation blocks. To investigate this fact,

four ZC, four SC, four MRC, and one RLC blocks are used in

this architecture. In order to process all magnitude bits

simultaneously, an entire magnitude column and a corre-

sponding sign bits column are generated in one clock cycle

by column information generator. For each magnitude bit in

a column stripe, four sign neighbors, eight r, four r0 and four

g neighbors are required. In one clock cycle, 4–10 (CX,D)

pairs are generated, simultaneously. These pairs are sent to

MQ coder sequentially. This scheduling is done via CX/D

sequencer. When the last column stripe is coded, state vari-

ables are updates for next stripe.

The key building blocks of this architecture are as

follows.

• Data organization and memory arrangement.

• Column information generator.

• Coding operation.

• CX/D sequencer.

• Context modeling controller.

4.1 Definition of terms

In order to make this proposed architecture less complex and

to be understood by wider range of readers, we first provide the

definition of used terms to describe the architecture, followed

Fig. 8 Contents of a CB for the different sub-bands

Fig. 9 Relationship between contents block and zero bit planes from

test images

J Real-Time Image Proc

123

Page 8: An efficient hardware implementation of parallel EBCOT algorithm for JPEG 2000

by the explanation of four basic coding operations and three

coding passes. The definition of terms is shown in Table 3.

4.2 Data organization and memory arrangement

In order to achieve efficient data and state variables, magnitude

and sign memories access and to reduce the required memory

access clock cycle, we propose a new data arrangement and

memory organization to implement the bit-plane coding. First

a bit plane is mapped with zeros (Fig. 11). As shown in Fig. 11,

memory blocks are organized in six partitions (i.e. MEM0–

MEM5) containing the stripe state variables data. The same

structure is adopted for magnitude bit plane and sign coeffi-

cients. In a single clock cycle they supply the 32 columns to be

processed and their vertical neighbors at the same time

(Fig. 12). After coding a stripe (32 columns) an updated state

variable is reading from a state variable memory in one cycle.

Using this technique of memory arrangement, one can

reduce the complexity of addressing and able to perform

read and write operations at the same clock cycle.

4.3 Coding operation

There are four different types of operations used in the bit plane

coding process (three passes), namely, zero coding, sign cod-

ing, magnitude refinement coding, and run length coding [4].

4.3.1 Zero coding (ZC) context block

The ZC uses 9 contexts (i.e. from 0 to 8) among the pos-

sible 19. Context for the data in bit position X is formed

from the 8 neighboring values (D0–D3, H0, H1, V0, V1) in

the r matrix as shown in Fig. 13a. The data under con-

sideration is the magnitude of the bit position X.

4.3.2 Sign coding (SC) context block

The SC is a two-step process and uses 5 contexts (contexts

9–13). In the first step, the r and v of the horizontal and

vertical neighbors (Fig. 13b) are used to form the hori-

zontal and vertical ‘contributions’ and an ‘xor’ bit [4]. In

Fig. 10 Top module of the proposed column-parallel context modeling

Table 3 Terms used in a code block

Category Name Description

Bit plane data Vp[n] The pth magnitude bit plane

Xp[n] The sign bit-plane

Coding state

variable

rp[n] The new significance state of the bit-plane

p

r0p[n] The new magnitude refinement (MR) state

of the bit-plane p

gp[n] The visited state of the bit-plane p

J Real-Time Image Proc

123

Page 9: An efficient hardware implementation of parallel EBCOT algorithm for JPEG 2000

the second step, context is formed from the two contribu-

tions and data are formed by exclusive OR operation of the

sign bit and the xor bit.

4.3.3 Magnitude refinement coding (MRC) context block

The MRC uses three contexts (contexts 14–16). The con-

texts are formed based on whether it is the first time the

magnitude refinement is being used on a certain position

and its 8 immediate neighbors (Fig. 13a) or not. The data is

the bit itself.

4.3.4 Run length coding (RLC) contexts

This is used only when an immediately previously insig-

nificant sample is found to be significant during ZC, SC or

RLC operation. The sign information is encoded using one

of the five different context states that depend on the sign

and the significance of the immediate vertical and hori-

zontal neighbors. The architecture for RLC coding is

shown in Fig. 14.

RLC condition is checked and if it is satisfied then the

zero detector attached to the output of the bit register

determines whether all bits of the column are zeros or not.

Next, an encoder generates the bit position of the first

1-value bit in the column, which is needed during RL-

coding after coding the first 1-bit.

4.3.5 Context modeling controller block

This controller generates all the necessary control signals

to enable all the modules in the architecture including

loading the stripe from block Rom, reading and writing the

stripe memory, controlling the pointers to access the CX/D

sequencer, generating the control signals for the BPC

encoder.

Fig. 11 Code block lines and code block memories association

Fig. 12 Memory organization for bit plane

Fig. 13 Context windows for coding operations

Fig. 14 Architecture for RLC module

J Real-Time Image Proc

123

Page 10: An efficient hardware implementation of parallel EBCOT algorithm for JPEG 2000

4.3.6 Column information generator

The data movement from state variable memory to state

variable register is done column by column as shown in

Fig. 15. During zero coding and magnitude refinement

coding the significant register corresponding to coded

sample in the current column stripe may update the infor-

mation when the stripe is coded.

In order to reduce the clock cycle time spent on the

memory access, the method which is adopted in a con-

ventional context modeling is suitable for vertical causal

context modeling.

4.3.7 CX/D sequencer

The stripe column is coded in one clock cycle. The CX/

D sequencer sends (CX,D) pair to MQ coder in the

specific order. The multiplexer chooses the context from

the outputs of ZC context block, SC context block, MR

context block or the hard coded RLC contexts (17–18).

The data bit is chosen from the t, sign data, v, hard

coded RLC data bits (0, 1) or the ZI (MSB, LSB) bits.

Context sequencer circuit is responsible for buffering

these (CX,D) pairs in proper order. The MUX is con-

trolled with a 3-bit word (cntrl_cx). Based on the pass

being performed, the controller generates the control

word. The contexts and data for different values of

cntrl_cx are given in Table 4.

5 Experimental results and discussion

The proposed architecture is described using VHDL lan-

guage, synthesized with Xilinx ISE 10.1 and implemented

on Virtex5 (XC5LX30T), Virtex4 (XC4VLX80), FPGA

and Spartan 3A DSP 3400 (XC3SD3400a). Table 5 pre-

sents the design implementation’s summary. The speed of

BPC is restricted to 372, 328 and 189 MHz, respectively.

The total power consumption of the proposed design based

on FPGA Virtex5 Family (XC5VSX50T) has been calcu-

lated using XPower utility. It can be observed that the

proposed architecture has consumed 50 mW at 30.2 �C.

All samples in a stripe column are processed in a single

clock cycle and 272 clock cycles are required to encode a

bit plane.

Table 6 summarizes the runtime statistics for CBs with

size 32 9 32. Each image comprises more than 587 zero

bit planes. To encode an image an average of 396,924

clock cycles are required (6,291,456 clock cycles for

sequential architecture). The processing rate is obtained by

computing the ratio between the total number of clock

cycles required per image and the total coefficients in an

image. Table 6 shows that the average processing rate of

the proposed BPC is 2.0513.

Table 7 presents the encoding and cycle statistics and

processing rate for CBs with size 64 9 64. To encode a

512 9 512 image, 219,120 clock cycles are required. The

average processing rate of the proposed BPC is 2.66.

Fig. 15 The relationship of the state variable register and state

variable memory

Table 4 CX/D sequencer for BPC encoder

Cntr_cx CX data

000 – –

001 ZC cx t

010 SC cx Sign data bit

011 MC cx 0

100 17 0

101 17 1

110 18 ZI[MSB]

111 18 ZI[LSB]

Table 5 Proposed embedded bit plane coding architectures

Used FPGA XC5VLX50T XC4VLX80 XC3SD3400a

Max. frequency

(MHz)

372 328 189

No. of 4 input LUTs 228 337 340

Total used slices 201 193 188

Total FF slices 26 206 207

Power consumption

(mw)

50 (30.2 �C) 79

(27.3 �C)

12 (30.2 �C)

J Real-Time Image Proc

123

Page 11: An efficient hardware implementation of parallel EBCOT algorithm for JPEG 2000

This architecture has been implemented on Virtex

FPGA for comparison purpose. Table 8 presents a com-

parative result between our proposed BPC architecture and

some existing architectures. The maximum operating speed

of this design is 186 MHz which is 3.72 times faster than

the BPC architecture proposed in [11]. The requirement of

LEs has reduced significantly due to the reduced data width

of the (CX,D) pairs. The size of the code block is 32 9 32

and the maximum coefficient bit width is 8 bits. However,

the memory requirement will be increased by 8 Kb. Very

less memory is required in this architecture because smaller

CB size is used compared to other architectures.

We compare the processing time of two test images using

the proposed bit plane-parallel scheme with sequential

methods, pass-parallel context modeling (PPCM), group-

of-column skipping (GOCS) and bit plane context modeling

(BPCM) [20]. Table 9 gives clock cycles used in test

images.

From the test results, it is clear that our method has

greatly reduced clock cycles used by encoding compared to

other widely known methods.

6 Conclusion

In this paper, we have designed and implemented an effi-

cient parallel VLSI architecture of the most demanding

block of the JPEG 2000 coder. This new architecture is

based on parallel access to memory and parallel coding for

one column stripe. When implemented on FPGA Virtex5

Family (XC5VSX50T), the performance of BPC was

improved and was able to perform at 372 MHz. Thus, the

designed EBCOT architecture is capable of encoding 44

video frames/s of high-definition TV of 1,920 pixels. This

design architecture is proficient in encoding 40 frames/s of

2,048 9 1,080. Requirement of high-speed real-time image

compression systems like satellite imagery, medical imag-

ing, cartography and others is satisfied by our architecture.

Moreover, it improves the processing time by about 30 %

compared to well-known techniques from literature.

References

1. JPEG 2000 image coding system, ISO/IEC International Standard

15444-1. ITU Recommendation T.800, (2000)

2. ISO/IEC JTC1/SC29/WG1 N2678, document JPEG 2000 Part 1

020719 (final publication draft) (2002)

3. Taubman, D.S., Marcellin, M.W.: JPEG2000 image compression

fundamentals, standards, and practice (2002)

4. Acharya, T., Tsai, P.: JPEG2000 Standard for Image Compres-

sion: Concepts Algorithms and VLSI Architectures. Wiley (2005)

5. Rabbani, M., Joshi, R.: An overview of the JPEG 2000 still image

compression standard. Signal Process Image Commun 17(1),

3–48 (2002)

6. Lee, D.: JPEG 2000: Retrospective and new developments. Proc.

IEEE 93(1), 32–41 (2005)

7. Das, A., Hazra, A., Banerjee, S.: An efficient architecture for 3-D

discrete wavelet transform. IEEE Trans. Circuits Syst. Video

Technol. 20(2), 286–296 (2010)

8. Delaunay, X., Chabert, M., Charvillat, V., Morin, G.: Satellite

image compression by post-transforms in the wavelet domain.

Signal Process. 90(2), 599–610 (2010)

9. Varma, K., Damecharla, H., Bell, A., Carletta, J., Back, G.: A fast

JPEG 2000 encoder that preserves coding efficiency: The split

arithmetic encoder. IEEE Trans. Circuits Syst. 55(11),

3711–3722 (2008)

10. Huang, Q., Zhou, R., Hong, Z.: Low memory and low complexity

VLSI implementation of JPEG 2000 codec. IEEE Trans. Consum.

Electron. 50(2), 638–646 (2004)

Table 6 Encoding cycles statistics and processing rate for CB with

size 32 9 32

Image LZBP BP processed Cycles/image Processing rate

Lena 918 1,390 366,960 1.514

Baboon 587 1,717 453,288 2.925

peppers 758 1,546 408,144 2.039

Barbara 788 1,361 359,304 1.727

Table 7 Encoding cycles statistics and processing rate for CB with

size 64 9 64

Image LZBP BP processed Cycles/image Processing rate

Lena 189 387 204,336 2.047

Baboon 121 455 240,240 3.765

Boat 171 405 213,840 2.368

Peppers 156 420 221,760 2.692

Barbara 168 408 215,424 2.4286

Table 8 Comparison with existing EBCOT architectures

Architecture [11] [17] Proposed

Frequency (MHz) 50 67 186

No. of 4 input LUTs 7,071 2,488 340

Total used slices 4,420 2,149 186

Total FF slices 1,560 105 27

FPGA used XC2V1000-6

Table 9 Processing time of the proposed scheme and other methods

Clock cycles

Sequential PPCM GOCS BPCM

[20]

Proposed

Lena 4,164,352 1,312,688 1,743,283 1,116,638 366,960

Baboon 4,947,712 1,748,956 2,106,820 1,113,828 453,288

Pepper 4,550,656 – 1,880,388 1,119,712 408,144

J Real-Time Image Proc

123

Page 12: An efficient hardware implementation of parallel EBCOT algorithm for JPEG 2000

11. Gangadhar, M., Bhatia, D.: FPGA based EBCOT architecture for

JPEG 2000. Microprocess. Microsyst. 29(8–9), 363–373 (2005)

12. Li, Y., Bayoumi, M.: A three-level parallel high-speed low-power

architecture for EBCOT of JPEG 2000. IEEE Trans. Circuits

Syst. Video Technol. 16(9), 1153–1163 (2006)

13. Lian, C., Chen, K., Chen, H., Chen, L.: Analysis and architecture

design of block-coding engine for EBCOT in JPEG 2000. IEEE

Trans. Circuits Syst. Video Technol. 13(3), 219–230 (2003)

14. Zhang, C., Long, Y., Kurdahi, F.: A scalable embedded JPEG

2000 architecture. J. Syst. Archit. 53(8), 524–538 (2007)

15. Fang, H.-C., Chang, Y.-W., Wang, T.-C., Lian, C.-J., Chen, L.-

G.: Parallel embedded block coding architecture for JPEG 2000.

IEEE Trans. Circuits Syst. Video Technol. 15(9), 1086–1097

(2005)

16. Modrzyk, D., Staworko, M.: A high-performance architecture of

JPEG2000 encoder. In: 19th European Signal Processing con-

ference (EUSIPCO 2011), September 2011

17. Sarawadekar, K., Banerjee, S.: A High Speed Bit Plane Coder for

JPEG2000 and its FPGA Implementation. In: 17th European

Signal Processing conference (EUSIPCO 2009), September 2009

18. Liu, K., Zhou, Y., Song Li, Y., Ma, J.F.: A high performance MQ

encoder architecture in JPEG2000. Integr. VLSI J. 43(3),

305–317 (2010)

19. JASPER software reference manual. ISO/IEC/JTC1/SC29/

WG1N2415

20. Liu, K., Wu, C., Li, Y.: A high-performance VLSI architecture of

EBCOT block coding in JPEG2000. J Electron (China) 23(1)

(2006)

Author Biographies

Taoufik Saidani received his M.S. degree in Micro-electronics from

Faculty of Science of Monastir, Tunisia in 2007. His major research

interests include VLSI and embedded system in video compression.

Atri Mohamed born in 1971, received his Ph.D. degree in Micro-

electronics from the Science Faculty of Monastir in 2001. He is

currently a member of the Laboratory of Electronics and Micro-

electronics. His research includes Circuit and System Design, Image

processing, Network Communication, IPs and SoCs.

Lazhar Khriji received his BS degree in electronics, and his MS and

PhD degrees in electrical engineering from University of Tunis II,

Tunisia, in 1990, 1992 and 1999, respectively. In 2002, he received

the Doctor of Technology degree in Information Technology from

Signal Processing Institute, Tampere University of Technology,

Finland. Dr. Khriji is currently Associate Professor at University of

Sousse, Tunisia. From 2002, he is in sabbatical leave with Sultan

Qaboos University, Oman. From 1997 to 1999, he was a research

scientist in the Research Institute for Information Technology,

Tampere, Finland. His research interests include signal and image

processing and analysis, nonlinear filtering, adaptive filtering, image

coding, image encryption, genetic algorithms, fuzzy logic, and

hardware implementation of DSP algorithms.

Rached Tourki was born in Tunisia, on May 13, 1948. He received

the B.S. degree in Physics (Electronics option) from Tunis University,

in 1970; the M.S. and the Doctorat de 3eme cycle in Electronics from

Institut d’Electronique d’Orsay, Paris-south University in 1971 and

1973, respectively. From 1973 to 1974 he served as Micro-electronics

engineer in Thomson-CSF. He received the Doctorat d’etat in Physics

from Nice University in 1979. Since this date he has been professor in

Micro-electronics and Microprocessors with the physics department,

Faculte des Sciences de Monastir. His current research interests

include digital signal processing and hardware–software co-design for

rapid prototyping in telecommunications.

J Real-Time Image Proc

123