708825/fulltext01.pdf · presentation date 2008-12-16 publishing date (electronic version)...
TRANSCRIPT
Evaluation and Hardware Implementation of
Real-Time Color Compression Algorithms
Master’s Thesis
Division of Electronics Systems
Department of Electrical Engineering
Linköping University
By
Ahmet Caglar
Amin Ojani
Report number: LiTH-ISY-EX--08/4265--SE
Linköping, December 2008
Evaluation and Hardware Implementation of
Real-Time Color Compression Algorithms
Master’s Thesis
Division of Electronics Systems
Department of Electrical Engineering
at Linköping Institute of Technology
By
Ahmet Caglar
Amin Ojani
LiTH-ISY-EX--08/4265--SE
Supervisor: Henrik Ohlsson,
Ericsson Mobile Platforms (EMP)
Examiner: Oscar Gustafsson,
Electronics Systems, Linköping University
Linköping, December 2008
Presentation Date 2008-12-16
Publishing Date (Electronic version)
Department and Division
Department of Electrical Engineering Division of Electronic Systems
URL, Electronic Version http://www.ep.liu.se
Publication Title Evaluation and Hardware Implementation of Real-Time Color Compression Algorithms
Author(s) Amin Ojani, Ahmet Caglar
Abstract A major bottleneck, for performance as well as power consumption, for graphics hardware in mobile devices is the amount of data that needs to be transferred to and from memory. In, for example, hardware accelerated 3D graphics, a large part of the memory accesses are due to large and frequent color buffer data transfers. In a graphics hardware block color data is typically processed using RGB color format. For both 3D graphic rasterization and image composition several pixels needs to be read from and written to memory to generate a pixel in the frame buffer. This generates a lot of data traffic on the memory interfaces which impacts both performance and power consumption. Therefore it is important to minimize the amount of color buffer data. One way of reducing the memory bandwidth required is to compress the color data before writing it to memory and decompress it before using it in the graphics hardware block. This compression/decompression must be done “on-the-fly”, i.e. it has to be very fast so that the hardware accelerator does not have to wait for data. In this thesis, we investigated several exact (lossless) color compression algorithms from hardware implementation point of view to be used in high throughput hardware. Our study shows that compression/decompression datapath is well implementable even with stringent area and throughput constraints. However memory interfacing of these blocks is more critical and could be dominating.
Keywords Graphics Hardware, Color Compression, Image Compression, Mobile Graphics, Compression Ratio, Frame Buffer Compression, Lossless Compression, Golomb-Rice coding.
Language
X English Other (specify below)
Number of Pages 88
Type of Publication
Licentiate thesis X Degree thesis Thesis C-level Thesis D-level Report Other (specify below)
ISBN (Licentiate thesis)
ISRN: LiTH-ISY-EX--08/4265--SE
Title of series (Licentiate thesis)
Series number/ISSN (Licentiate thesis)
i
Abstract
A major bottleneck, for performance as well as power consumption, for graphics hardware in
mobile devices is the amount of data that needs to be transferred to and from memory. In, for
example, hardware accelerated 3D graphics, a large part of the memory accesses are due to large
and frequent color buffer data transfers. In a graphic hardware block color data is typically
processed using RGB color format. For both 3D graphic rasterization and image composition
several pixels needs to be read from and written to memory to generate a pixel in the frame buffer.
This generates a lot of data traffic on the memory interfaces which impacts both performance and
power consumption. Therefore it is important to minimize the amount of color buffer data. One
way of reducing the memory bandwidth required is to compress the color data before writing it to
memory and decompress it before using it in the graphics hardware block. This
compression/decompression must be done “on-the-fly”, i.e. it has to be very fast so that the
hardware accelerator does not have to wait for data. In this thesis, we investigated several exact
(lossless) color compression algorithms from hardware implementation point of view to be used
in high throughput hardware. Our study shows that compression/decompression datapath is well
implementable even with stringent area and throughput constraints. However memory interfacing
of these blocks is more critical and could be dominating.
Keywords: Graphics Hardware, Color Compression, Image Compression, Mobile Graphics,
Compression Ratio, Frame Buffer Compression, Lossless Compression, Golomb-Rice coding.
ii
iii
Acknowledgements
First, we would like to express our gratitude and appreciation to our supervisor Dr. Henrik
Ohlsson from Ericsson Mobile Platform (EMP) for his valuable guidance and discussions.
We would also like to thank our supervisor from Electronics Systems at Linköping University,
Dr. Oscar Gustafsson for his great supports and recommendations.
Finally, deepest thanks go to our beloved parents for their everlasting supports and
encouragements throughout our educational years. This thesis is dedicated to them.
iv
v
Table of Contents
CHAPTER 1 ................................................................................................................................................................. 1
1 INTRODUCTION .............................................................................................................................................. 1
1.1 COLOR BUFFER AND GRAPHICS HARDWARE ...................................................................................................... 2 1.2 COLOR BUFFER COMPRESSION VS. IMAGE COMPRESSION .................................................................................. 3 1.3 STRUCTURE OF THE REPORT ............................................................................................................................. 3
CHAPTER 2 ................................................................................................................................................................. 5
2 LOSSLESS COMPRESSION ALGORITHMS ............................................................................................... 5
2.1 INTRODUCTION ................................................................................................................................................. 5 2.2 THEORETICAL BACKGROUND OF LOSSLESS IMAGE COMPRESSION ................................................................... 6
2.2.1 JPEG-LS Algorithm .............................................................................................................................. 6 2.3 REFERENCE LOSSLESS COMPRESSION ALGORITHM .......................................................................................... 7
2.3.1 Color Transform and Reverse Color Transform ................................................................................... 8 2.3.2 Predictor and Constructor .................................................................................................................. 10 2.3.3 Golomb-Rice Encoder ......................................................................................................................... 12 2.3.4 Golomb-Rice Decoder ......................................................................................................................... 16
2.4 GOLOMB-RICE ENCODING OPTIMIZATION ...................................................................................................... 17 2.4.1 Proposed method for exhaustive search solution ................................................................................ 17 2.4.2 Estimation method............................................................................................................................... 22
2.5 IMPROVED LOSSLESS COLOR BUFFER COMPRESSION ALGORITHM ................................................................ 24 2.5.1 Modular Reduction ............................................................................................................................. 24 2.5.2 Embedded Alphabet Extension (Run-length Mode) ............................................................................ 25 2.5.3 Previous Header Flag ......................................................................................................................... 26
2.6 COMPRESSION PERFORMANCES OF ALGORITHMS ........................................................................................... 26 2.7 POSSIBLE FUTURE ALGORITHMIC IMPROVEMENTS ......................................................................................... 28
2.7.1 Pixel Reordering ................................................................................................................................. 28 2.7.2 Spectral Predictor ............................................................................................................................... 28 2.7.3 CALIC Predictor ................................................................................................................................. 28 2.7.4 Context Information ............................................................................................................................ 29
CHAPTER 3 ............................................................................................................................................................... 30
3 COLOR BUFFER COMPRESSION/DECOMPRESSION HARDWARE ................................................. 30
3.1 DESIGN CONSTRAINTS .................................................................................................................................... 30 3.2 COMPRESSOR BLOCK ..................................................................................................................................... 31
3.2.1 Addr_Gen1 (Source memory address generator) ............................................................................... 32 3.2.2 Color_T (Color Transformer) ............................................................................................................. 36 3.2.3 Pred_RegFile_Ctrl (Prediction Register File Controller) .................................................................. 37 3.2.4 Predictor ............................................................................................................................................. 39 3.2.5 Enc_RegFile_Ctrl (Golomb-Rice Encoder Register File Controller) ................................................. 40 3.2.6 GR_Encoder (Golomb-Rice Encoder) ................................................................................................ 42 3.2.6.1 GR_k Block (Golomb-Rice Parameter Estimation) ............................................................................ 43 3.2.6.2 Enc Block (Encoding Block) ............................................................................................................... 45 3.2.6.3 GR_ctrl (Golomb-Rice Control Block) ............................................................................................... 47 3.2.7 Data_Packer (Variable Bit Length Packer to Memory Word) ............................................................ 47 3.2.8 Addr_Gen2 (Destination memory address generator) ........................................................................ 49 3.2.9 Compressor_Ctrl (Control Path) ........................................................................................................ 50 3.2.10 Overall Compressor Datapath and Address Generation .................................................................... 51
3.3 DECOMPRESSOR BLOCK ................................................................................................................................. 52 3.3.1 Addr_Gen2 (Source memory address generator) ............................................................................... 53
vi
3.3.2 Rev_Color_T (Reverse Color Transformer) ....................................................................................... 54 3.3.3 Const_RegFile_Ctrl (Construction Register File Controller) ............................................................ 55 3.3.4 Constructor ......................................................................................................................................... 56 3.3.5 Dec_RegFile_Ctrl (Golomb-Rice Decoder Register File Controller) ................................................ 57 3.3.6 GR_Decoder (Golomb-Rice Decoder) ................................................................................................ 58 3.3.7 Data_Unpacker (Variable Bit Length Unpacker from Memory Word) .............................................. 59 3.3.8 Addr_Gen1 (Destination memory address generator) ........................................................................ 60 3.3.9 Decompressor_Ctrl (Control Path) .................................................................................................... 61 3.3.10 Overall Decompressor Datapath and Address Generation ................................................................ 62
3.4 FUNCTIONAL VERIFICATION FRAMEWORK ..................................................................................................... 63 3.5 SYNTHESIS RESULTS ...................................................................................................................................... 64 3.6 EVALUATION OF OTHER HARDWARE IMPLEMENTATIONS .............................................................................. 66
3.6.1 Parallel pipeline Implementation of LOCO-I for JPEG-LS [17] ........................................................ 66 3.6.2 Benchmarking and Hardware Implementation of JPEG-LS [18] ....................................................... 67 3.6.3 A Lossless Image Compression Technique Using Simple Arithmetic Operations [19] ...................... 67 3.6.4 A Low power, Fully Pipelined JPEG-LS Encoder for Lossless Image Compression [11].................. 67 3.6.5 Hardware Implementation of a Lossless Image Compression Algorithm Using a FPGA [20] .......... 68 3.6.6 Comparison ......................................................................................................................................... 68
CHAPTER 4 ............................................................................................................................................................... 69
4 CONCLUSION ................................................................................................................................................. 69
4.1 WORKFLOW .................................................................................................................................................... 69 4.2 RESULTS AND OUTCOMES .............................................................................................................................. 69 4.3 FUTURE WORK ............................................................................................................................................... 71
REFERENCES ........................................................................................................................................................... 73
APPENDIX A ............................................................................................................................................................. 75
PROPOSED COST REDUCTION METHOD ANALYSIS ................................................................................................... 75 A.1 Overlap-limited Search ................................................................................................................................ 75 A.2 Remainder-Based Correction ....................................................................................................................... 83
APPENDIX B ............................................................................................................................................................. 85
TEST IMAGE SETS .................................................................................................................................................... 85 B.1 Standard Photographic Test Images ............................................................................................................ 85 B.2 Computer Generated Test Scenes ................................................................................................................ 86 B.3 Computer Generated User Menu Scenes ..................................................................................................... 87
vii
Table of Figures
FIGURE 1: COMPRESSOR/DECOMPRESSOR HARDWARE ON MEMORY INTERFACE............................................................. 2 FIGURE 2: ERROR ACCUMULATION DUE TO TANDEM COMPRESSION .............................................................................. 6 FIGURE 3: COMPRESSION / DECOMPRESSION FUNCTIONAL BLOCKS ............................................................................... 8 FIGURE 4: COLOR TRANSFORM / REVERSE COLOR TRANSFORM BLOCK INTERFACE ...................................................... 9 FIGURE 5: COLOR TRANSFORM / REVERSE COLOR TRANSFORM OPERATION FLOW GRAPH ........................................... 9 FIGURE 6: MEDIAN EDGE DETECTOR (MED) PREDICTOR PREDICTION WINDOW ........................................................... 10 FIGURE 7: PREDICTOR / CONSTRUCTOR BLOCK INTERFACE .......................................................................................... 11 FIGURE 8: PREDICTOR / CONSTRUCTOR OPERATION FLOW GRAPH .............................................................................. 11 FIGURE 9: ENCODED DATA IN THE STREAM .................................................................................................................. 12 FIGURE 10: ENCODED DATA FOR (2, 0, 13, 3) AND K = 2 ............................................................................................... 12 FIGURE 11: GOLOMB-RICE ENCODER FUNCTIONAL BLOCKS ......................................................................................... 13 FIGURE 12: GOLOMB-RICE PARAMETER EXHAUSTIVE SEARCH HARDWARE .................................................................. 14 FIGURE 13: A POSSIBLE GOLOMB-RICE ENCODER HARDWARE ..................................................................................... 15 FIGURE 14: A POSSIBLE GOLOMB-RICE DECODER HARDWARE ..................................................................................... 16 FIGURE 15: HW-COST VS. NUMBER OF INPUT SAMPLES (N) ........................................................................................... 19 FIGURE 16: HW-COST VS. NUMBER OF PARAMETERS (K) .............................................................................................. 20 FIGURE 17: HW IMPLEMENTATION OF THE NEW COMBINED METHOD ........................................................................... 21 FIGURE 18: ILLUSTRATION OF MODULAR REDUCTION .................................................................................................. 24 FIGURE 19: CALIC GAP PREDICTION WINDOW ............................................................................................................ 29 FIGURE 20: COMPRESSOR BLOCK ................................................................................................................................. 31 FIGURE 21: MEMORY MAPPING AND CORRESPONDING PIXELS OF THE IMAGE .............................................................. 33 FIGURE 22: TRAVERSAL IN PREDICTION WINDOW ......................................................................................................... 34 FIGURE 23: ADDRESS GENERATOR I INTERFACE ........................................................................................................... 35 FIGURE 24: ADDRESS GENERATOR I HARDWARE DIAGRAM ......................................................................................... 35 FIGURE 25: COLOR TRANSFORM HARDWARE DIAGRAM ............................................................................................... 36 FIGURE 26: PREDICTION REGISTER FILE CONTROLLER INTERFACE............................................................................... 37 FIGURE 27: CHANGE OF PREDICTION WINDOW FOR PIXELS OF ONE SUBTILE ................................................................. 37 FIGURE 28: STATES AND REGISTER INPUT CONNECTIVITY IN PREDICTION REGISTER FILE CONTROLLER...................... 38 FIGURE 29: MED PREDICTION HARDWARE FOR BOTH PREDICTOR AND CONSTRUCTOR ................................................ 39 FIGURE 30: PREDICTOR BLOCK HARDWARE DIAGRAM.................................................................................................. 40 FIGURE 31: ENCODER REGISTER FILE CONTROLLER BLOCK INTERFACE ....................................................................... 41 FIGURE 32: GOLOMB-RICE ENCODER BLOCK DIAGRAM ................................................................................................ 42 FIGURE 33: K- PARAMETER ESTIMATION HARDWARE .................................................................................................. 44 FIGURE 34: GOLOMB-RICE ENCODER REALIZATION ..................................................................................................... 46 FIGURE 35: P3 BLOCK, BASIC HARDWARE REALIZATION ............................................................................................... 48 FIGURE 36: PACKED DATA ORDER FORMAT IN THE MEMORY ........................................................................................ 48 FIGURE 37: DATA PACKER ............................................................................................................................................ 49 FIGURE 38: DESTINATION MEMORY ADDRESS GENERATOR BLOCK INTERFACE ............................................................. 50 FIGURE 39: CONTROL PATH BLOCK INTERFACE ............................................................................................................ 50 FIGURE 40: OVERALL COMPRESSOR ............................................................................................................................. 51 FIGURE 41: DECOMPRESSOR BLOCK ............................................................................................................................. 52 FIGURE 42: SOURCE MEMORY ADDRESS GENERATOR BLOCK INTERFACE ...................................................................... 53 FIGURE 43: REVERSE COLOR TRANSFORM HARDWARE DIAGRAM ................................................................................ 54 FIGURE 44: CONSTRUCTION REGISTER FILE CONTROLLER INTERFACE ......................................................................... 55 FIGURE 45: STATES AND REGISTER INPUT CONNECTIVITY IN CONSTRUCTION REGISTER FILE CONTROLLER ................ 56 FIGURE 46: CONSTRUCTOR BLOCK HARDWARE DIAGRAM ............................................................................................ 57 FIGURE 47: DECODER REGISTER FILE CONTROLLER BLOCK INTERFACE ....................................................................... 58 FIGURE 48: GOLOMB-RICE DECODER HARDWARE ........................................................................................................ 58 FIGURE 49: DATA UNPACKER INTERFACE AND BLOCK DIAGRAM ................................................................................. 59 FIGURE 50: READ / WRITE ADRESSES FROM/TO DESTINATION MEMORY TO CONSTRUCT ONE SUBTILE ......................... 60 FIGURE 51: ACTUAL ADDRESSING SCHEME FOR DESTINATION MEMORY ADDRESSES .................................................... 60 FIGURE 52: DESTINATION MEMORY ADDRESS GENERATOR BLOCK INTERFACE ............................................................. 61
viii
FIGURE 53: OVERALL DECOMPRESSOR ......................................................................................................................... 62 FIGURE 54: VERIFICATION FRAMEWORK FSM ............................................................................................................. 63 FIGURE 55: FUNCTIONAL VERIFICATION FRAMEWORK ................................................................................................. 64 FIGURE 56: ONE BLOCK OF N VALUES ........................................................................................................................... 75 FIGURE 57: OVERLAP REGIONS OF CONSECUTIVE LENGTH FUNCTIONS WITH RESPECT TO ET ........................................ 77 FIGURE 58: OVERLAP REGIONS BETWEEN LENGTH FUNCTIONS L1, L2, L3, L4 ............................................................. 78 FIGURE 59: OVERLAP REGIONS FOR N=4 AND K= {0, 1, 2, 3, 4, 5, 6} WITH RESPECT TO ET ........................................... 79 FIGURE 60: REQUIRED COMPARISONS OF OVERLAP REGIONS FOR N=4, K= {0, 1, 2, 3, 4, 5, 6} BASED ON ET ................ 80 FIGURE 61: OVERLAP REGIONS OF NON-CONSECUTIVE LENGTH FUNCTIONS WITH RESPECT TO ET ............................... 81 FIGURE 62: MOTIVATION BEHIND REMAINDER-BASED CORRECTION............................................................................. 83
ix
List of Tables
TABLE 1: ENCODED OUTPUT LENGTHS FOR EACH K-PARAMETER .................................................................................. 14 TABLE 2: LOGIC COST OF FUNCTIONAL BLOCKS .......................................................................................................... 17 TABLE 3: HW COST COMPARISON OF EXHAUSTIVE SEARCH AND NEW COMBINED METHOD .......................................... 22 TABLE 4: ESTIMATION INTERVALS ACCORDING TO SUM OF INPUTS............................................................................... 23 TABLE 5: HW COST AND COMPRESSION RATIO OF ESTIMATION METHOD ...................................................................... 23 TABLE 6: COMPARISON OF COMPRESSION PERFORMANCES ........................................................................................... 27 TABLE 7: COMPRESSOR BLOCK INTERFACE PORT DESCRIPTION................................................................................... 32 TABLE 8: SOURCE MEMORY ADDRESS GENERATOR ADDRESSING SCHEME .................................................................... 34 TABLE 9: ESTIMATION FUNCTION ................................................................................................................................. 45 TABLE 10: HEADER FORMAT GENERATED BY GR_CTRL BLOCK.................................................................................... 47 TABLE 11: DECOMPRESSOR BLOCK INTERFACE PORT DESCRIPTION ............................................................................ 53 TABLE 12: DESTINATION MEMORY ADDRESS GENERATOR ADDRESSING SCHEME ......................................................... 61 TABLE 13: COMPRESSOR SYNTHESIS RESULT ............................................................................................................... 65 TABLE 14: DECOMPRESSOR SYNTHESIS RESULT........................................................................................................... 66 TABLE 15: CHARACTERISTICS OF DIFFERENT HARDWARE IMPLEMENTATIONS .............................................................. 68 TABLE 16: COMPARISON OF COST ESTIMATIONS AND ACTUAL SIZES FOR BLOCKS ........................................................ 70
x
1
Chapter 1
1 Introduction
A major bottleneck, for performance as well as power consumption, for graphics hardware in
mobile devices is the amount of data that needs to be transferred to and from memory. In, for
example, hardware accelerated 3D graphics, a large part of the memory accesses are due to large
and frequent color buffer data transfers. Therefore it is important to minimize the amount of color
buffer data.
In a graphics hardware block (for example image composition, 3D graphics rasterization), color
data is typically processed using RGB color format. Depending on the color resolution of the
image 8, 12, 16, or 32 bits could be used to represent one pixel. For both 3D graphic rasterization
and image composition several pixels needs to be read from and written to memory to generate a
pixel in the frame buffer. This generates a lot of data traffic on the memory interfaces which
impacts both performance and power consumption.
One way of reducing the memory bandwidth required is to compress the color data before writing
it to memory and decompress it before using it in the graphics hardware block. Figure 1 shows
the location of compressor/decompressor hardware with respect to graphics hardware block and
memory. The compressor/decompressor hardware will help reduce the data traffic on memory
interface shown with arrows in the figure. The reduction on the memory bandwidth can be used
to minimize power consumption (reduced access to memory bus), to increase performance (more
data traffic with the same memory bandwidth) or a combination of them. Hence, a better trade-off
between power and performance can be found depending on the design constraints.
2
Graphics Hardware Block RAM
Compress
Data
Decompress
Data
Figure 1: Compressor/Decompressor hardware on memory interface
Hardware implementation of such a compressor/decompressor is the subject of this work. Our
thesis - based on a reference color buffer compression algorithm [1] – aims at:
− Evaluation of color buffer compression algorithms with respect to hardware
implementation properties,
− VHDL implementation of a selected algorithm in order to validate the hardware cost
estimations.
Accordingly, the thesis has been carried out in two phases. In the first phase, the following tasks
have been carried out:
− Analysis of the problem and modeling of the reference algorithm,
− Evaluation of the proposed solution with respect to both compression performance and
implementation properties,
− Exploration of algorithmic and hardware optimizations to improve both compression
performance and implementation cost,
− Decision of the final algorithm to be implemented.
The second phase of the thesis work is dedicated to hardware implementation in VHDL and
verification of the algorithm which is decided in the first phase, and completion of the thesis
report.
1.1 Color buffer and graphics hardware
Color buffer refers to a portion of memory where the actual pixel data to be sent to display is
stored. Graphics hardware uses this buffer during rasterization. Depending on the rasterizer
architecture, the access to this buffer can be in different ways. In traditional immediate mode
rendering, each triangle is rendered as soon as they come in. Hence, for every triangle that is
drawn, the related pixel data are written to the buffer unless the triangle is completely hidden. On
the other hand for tiled, deferred rendering architectures, the color buffer is written when a
complete tile (a unit of w h pixels) is finished. Hence only visible color write is performed
which reduces the overall color buffer bandwidth. A more detailed explanation on the topic can
be found at [2].
3
1.2 Color buffer compression vs. image compression
Color buffer data compression, as a specific application of general data compression, shares lots
of similarities with image compression. Consequently, the theory developed for image
compression is well-suited to be used for compressing color buffer data in 3D graphics hardware.
Specifically, correlation between neighboring pixel values is also valid for color buffer data and
can be used as a basis for compression.
On the other hand, there are important differences between color buffer data compression and
image compression. First of all, most of the image compression algorithms in literature have been
developed for continuous-tone still images. Their compression results have been customarily
based on some set of well-known test images. Those images are real (photographic) images and it
is harder to get information about the performance of image compression algorithms on computer
generated images. Secondly and more importantly, most image compression algorithms assume
the availability of a whole and completed image. For example most (if not all) of the state-of-the
art image compression algorithms are adaptive, which can be briefly explained as learning from
the image itself while traversing it in some order. Rasterization in graphics hardware, on the other
hand, is an incremental process. Depending on the rasterizer architecture, the data to be
compressed could be an unfinished scene and it could also be only a part of the whole scene. In a
tiled architecture for example, a tile is the data to be compressed, and the tile size could be too
small to learn from. Hence the success of adaptive image compression algorithms on color buffer
data is not obvious and dependant on the specific rasterizer architecture.
Another difference between our framework and image compression algorithms is the
requirements on the complexity and implementation cost. As mentioned in [1], most of the image
compression algorithms are not symmetric, i.e., compression and decompression take different
times. Moreover, for most of the compression algorithms, the complexity of the forward path
(compression) is discussed, since they aim at the applications where only compression and
storage of the image data is important. The backward path (decompression) is not considered as
critical. However in our case, the compression/decompression must be done “on-the-fly”, i.e. it
has to be very fast so that the hardware accelerator does not have to wait for data. Finally, a
compressor/decompressor for mobile devices has extra requirements on the implementation cost.
Specifically, the size of the hardware block is of prime concern. This prohibits using
sophisticated algorithms that require logic cost and storage (buffering) cost more than what is
affordable in our case.
1.3 Structure of the report
Chapter 1 of the report has given a description of the aim of this thesis work and some
background information about the application area. Chapter 2, starting with an explanation of the
need for lossless compression in our case, gives a thorough analysis of the lossless compression
algorithms considered for this thesis and evaluation of their implementation properties. This
chapter corresponds to first phase of our thesis work. Chapter 3 describes the implementation and
hardware of the compressor/decompressor and presents synthesis results. Chapter 4 includes
concluding remarks and discussion of some possible future work.
4
5
Chapter 2
2 Lossless Compression Algorithms
In this chapter we discuss several lossless color data compression algorithms, their
performances with their hardware implementation properties. Later, we propose a modified
algorithm which is especially effective for compressible images. The chapter ends with a
comparison of compression ratio and cost of those algorithms and some remarks about possible
future improvements.
2.1 Introduction
Lossless image compression is customarily used in specific application areas like medical and
astronomical imaging, preservation of art work and professional photography. It is not surprising
that lossless compression is not used for multimedia in general when one considers its limited
compression performance. The achievable compression ratio varies between 2:1 and 3:1 in
general, which is significantly lower than what lossy compression can offer. Furthermore, in
lossy compression the resulting image quality and desired compression performance can always
be traded-off depending on the requirements.
Considering the disadvantages just mentioned, the usage of lossless compression in 3D graphics
hardware for color buffer data may be objected. However, [1] explains and illustrates the
possibility of getting unbounded errors due to so called tandem compression when a lossy
algorithm is used. Tandem compression artifacts arise when lossy compression is performed for
every triangle written to a tile during rasterization, resulting in accumulation of error. This is a
direct consequence of rasterization being an incremental process. Figure 2 from [1] illustrates the
accumulation of error.
6
Figure 2: Error accumulation due to Tandem Compression
Although, it is possible to control the accumulated error in those cases as suggested in [1], the
resulting image quality may not be acceptable. In our work we employ a conservative approach
(lossless compression) instead, since the resulting compression ratio is sufficient for our
application.
2.2 Theoretical Background of Lossless Image Compression
In image compression applications, there are several algorithms which offer different approaches
for compression of still images. The most famous algorithms are FELICS [3], LOCO-I [4] and
CALIC [5]. According to the better tradeoff between complexity and compression ratio, LOCO-I
was standardized into JPEG-LS [6].
2.2.1 JPEG-LS Algorithm
The idea behind JPEG-LS is to take the advantage of both simplicity and the compression
potential of context models. The error residuals are computed using an adaptive predictor and
Golomb-Rice technique is used for encoding the data. The purpose of having an adaptive
predictor instead of a fixed predictor is that it proposes minor variations of prediction residuals
which lead to a higher compression ratio. It should be noticed that having better prediction result
help efficiently only when the header information is extracted from the compressed stream which
is the case in JPEG-LS. Otherwise the major overhead which degrades the compression ratio is
sending the header information and in that case, improving the predictor cannot help much in
getting higher performance. The reason why non-adaptive algorithms give lower compression
ratio is that they are limited in their compression performance by first order entropy of the
prediction residuals, which in general cannot achieve total decorrelation of the data [6]. As a
consequence, the compression gap between these simple schemes and more complex algorithms
is significant.
LOCO-I algorithm is constructed by three main components. The first component is predictor
and consists of two components of adaptive and fixed. The fixed component does the task of
horizontal and vertical edge detection where dependence on the surrounding samples is through
fixed coefficients. The fixed predictor used in LOCO-I, is a simple median edge detector (MED)
7
predictor and will be explained in subsection 2.3.2. Adaptive component, on the other hand, is
context dependant and does the bias cancellation task because the DC offset is typically present
in context-based prediction [6].
The second component is context model. A more complex context modeling technique results in
higher achievable dependency order. For LOCO-I, the context model is to compute the gradient
of neighboring pixels and then quantize gradients into a small number of equally probable
connected regions. Although in principle, the number of those regions should be adaptively
optimized, the low complexity requirement dictates a fixed number of equally probable regions.
The gradients represent information about the part of the image surrounding a sampling pixel. By
knowing the gradients we can learn the level of activity such as smoothness or edginess around
the sampling pixel. This information governs the statistical behavior of prediction error [6].
For JPEG-LS, the number of contexts is 365. This number represents a suitable trade-off between
storage requirements which is proportional to the number of contexts.
The last component coder is used to encode the corrected prediction residuals. LOCO-I uses
Golomb-Rice coding technique [6, 7] in two different modes as regular mode and run-length
mode. This coding technique is discussed in details in subsection 2.3.3.
There are several different implementation approaches for JPEG-LS algorithm, each of which
uses specific hardware architectures such as parallel, pipeline or a combination of both.
Implementation options include dedicated DSP, FPGA boards, and ASIC. Factors that affect the
choice of platform selection involve cost, speed, memory, size, and power consumption. One of
the very important characteristics of JPEG-LS algorithm is its sequential execution nature due to
the use of context statistics in coding the error residuals in the prediction phase. This
characteristic makes this possible to design parallel pipeline encoder architecture in order to
speed-up the compression. In section 3.6, different hardware architectures and their
implementation result have been discussed.
Compression in a mobile application is limited by the available storage and memory bandwidth.
Therefore, context-based algorithms such as JPEG-LS may not be applicable and their storage
requirement for the context information could be quite high for this application.
2.3 Reference Lossless Compression Algorithm
Our thesis work is based on [1], which gives a survey of color buffer data compression
algorithms and propose a new exact (lossless) algorithm. In this section, we describe a thorough
analysis of this algorithm, the role and hardware implementation cost of its functional blocks.
The result of this analysis serves as the basis for our later work both on algorithmic and hardware
optimizations.
This algorithm, as opposed to more complex adaptive context-modeling schemes like LOCO-I
[4], can be classified as a variant of simplicity-driven DPCM technique by employing a variable
8
bit length coding of prediction residuals obtained from a fixed predictor [6]. To get a better
decorrelation of pixel data, a lossless (exactly reversible) color transform precedes those blocks.
The block diagram of the compressor and decompressor is given in figure 3.
Figure 3: Compression / Decompression Functional Blocks
In context-based algorithms, the encoding parameter for each pixel is estimated from previously
traversed data (context). Since the decoder traverses the data in the same order, it will give the
same decision as the encoder for the parameter of the current pixel. This eliminates the overhead
of sending the encoding parameter in the stream. However since no context information is stored
in our reference algorithm, the overhead of sending the encoding parameter of each pixel is
significant. An important feature of the algorithm is thus encoding a number of pixels (2 2
subtile) with the same parameter in the encoder stage. This allows a trade-off between the
overhead and using non-optimal encoding parameter for pixels.
Another feature of the reference algorithm is that it operates on tiles (88 blocks of pixels) to
make it compliant with a tiled architecture. However, the functional blocks of the algorithm itself
do not use any tile specific information.
In the following subsections blocks of the algorithm are discussed.
2.3.1 Color Transform and Reverse Color Transform
The color transform block converts RGB triplet to YCoCg triplet in order to decorrelate the
channels. Y channel is the luminance channel; Co and Cg are chrominance channels. It is stated in
[1] that decorrelation of channels improves the compression ratio by about 10%. This
transformation and its important features have been introduced in [9]. Exact reversibility is an
essential feature of this transformation since the overall algorithm is lossless. The forward and
backward transformation equations are:
(1)
Reverse Color
Transform Constructor Decoder +
9
From implementation point of view, this transformation has a dynamic range expansion of 2 bits,
i.e., if input RGB channels are n bits each, the output Y channel will require n bits, and
chrominance channels will require n+1 bits each. The block interfaces of the forward and reverse
transforms with 8-bit RGB channels are given in figure 4.
Figure 4: Color Transform / Reverse Color Transform Block Interface
As the equations suggest, both the color transform and reverse color transform has 2 shift and 4
add/subtract operations per pixel which can be expressed as follows:
[2(>>) , 4(+)] per pixel.
The flow-graph of the operations are given in figure 5.
Figure 5: Color Transform / Reverse Color Transform Operation Flow Graph
The operation cost and data lengths indicate that both blocks can be realized by:
B
G
R 8
8
9
9 8
8
Y
Co
Cg
Reverse
Color
Transform
R
G
B
Y
Cg
Co
8 8
8
8 9
9
Color
Transform
-
+
-
+
>>
>>
<<
+
-
<<
+
-
10
- Two 9-bit adders/subtractors
- Two 8-bit adders/subtractors
This cost is per pixel cost and the overall cost depends on the throughput requirement. It should
also be noted that color transform has a maximum logic depth of two 9-bit adders and two 8-bit
adders, whereas the reverse color transform has a maximum logic depth of one 9-bit adder and
two 8-bit adders.
2.3.2 Predictor and Constructor
The predictor used in our reference algorithm is named as MED predictor in [6] and originally
introduced by Martucci [10]. This predictor uses three surrounding pixels to predict the value of
the current pixel as shown in figure 6.
Figure 6: Median Edge Detector (MED) predictor prediction window
The prediction is performed with the following formula:
(2)
The first two cases correspond to a primitive test for horizontal and vertical edge detection. If no
edge is detected, the third case predicts the value of the current pixel by considering it on a plane
formed by the three neighboring pixels. Despite its simplicity, MED predictor is mentioned to be
a very effective fixed predictor.
After the prediction, the predicted value ( x̂ ) is subtracted from the actual pixel value (x) and the
resulting error residual ( ) is sent out to be encoded in the encoder block. Conversely, in the
decompression path the same prediction is performed from the previously constructed pixels and
the resulting prediction ( x̂ ) is added to the decoded error residual ( ) from the stream to
construct the actual pixel value (x) back. The block interface of the predictor and constructor are
given in figure 7. In this figure, the input pixel values are 9-bit signed chrominance components
(Co and Cg), and the error residual is 10-bit signed value. For Y and α predictors/constructors, the
input size is 8-bits.
11
Figure 7: Predictor / Constructor Block Interface
The operations extracted from (2) can be expressed as follows:
[3 comp.(< ) , 3(+)] per pixelcomponent.
The flow-graph of the predictor and constructor operations are identical and given in figure 8.
Figure 8: Predictor / Constructor Operation Flow Graph
The flow graph and data wordlengths indicate that both the predictor and constructor blocks can
be realized by:
- Three 10-bit comparators
- Two 9-bit adders/subtractors
- One 10-bit adder/subtractor
- One 9-bit 4x1 MUX (with some additional logic at select inputs)
This cost is per pixel-component cost and the overall cost depends on the throughput requirement.
Both the predictor and constructor have a maximum logic depth of two 9-bit and one 10-bit
adders.
Since the next stage i.e. Golomb-Rice encoding requires unsigned (one-sided) error residuals, the
following signed-to-unsigned conversion, as suggested in [4], needs to be performed after the
prediction:
10
9
9 9
9
Constructor
9
9
9 10
9
Predictor
< <
+
-
-
< < <
+
-
+
<
12
(3)
Conversely, after Golomb-Rice decoding in decompression, the corresponding unsigned-to-
signed conversion is needed.
2.3.3 Golomb-Rice Encoder
Golomb codes are variable bit rate codes optimal for one-sided geometric distribution (OSGD) of
non-negative integers. Since the statistics of prediction error residuals from a fixed predictor in
continuous-tone images are well-modeled by a two-sided geometric distribution (TSGD) centered
at zero [6], Golomb coding is widely used in lossless image coding algorithms with a
mathematical absolute operation at the beginning to obtain OSGD.
Since Golomb coding requires an integer division and modulo operation with Golomb parameter
m, in implementations Rice codes [8] are generally used. Rice coding is a special case of Golomb
coding which reduces division and modulo operations to simple shift and mask operations.
In Golomb-Rice encoding, we encode an input value, e, by dividing it with a constant 2k. The
results are a quotient q and a remainder r. The quotient q is stored using unary coding, and the
remainder r is stored using normal, binary coding using k bits. To illustrate with an example
(figure 10), let us assume that we want to encode the values 2, 0, 13, 3 and assume we have
selected the constant k = 2. After the division we get the following (q, r)-pairs: (0, 2), (0, 0), (3,
1), (0, 3). Unary coding represents a value by as many zeros as the magnitude of the value
followed by a terminating one. The encoded values therefore becomes (1b, 10b), (1b, 00b),
(0001b, 01b), (1b, 11b) which is 15 bits in total.
Figure 9: Encoded Data in the Stream
Figure 10: Encoded Data for (2, 0, 13, 3) and k = 2
010 1 10 00 1 1 1000 01 11 k 1r ...
k 1q 1r 2r 2q 4q 3q 3r 4r k
1r ...
first component of
first sub-tile
... second component
of first sub-tile
Direction of
storing/reading data
13
In our reference algorithm, the optimal Golomb-Rice parameter k for a 22 pixel subtile of error
residuals is computed with an exhaustive search, and the Golomb-Rice coded residuals are sent
out to the stream preceded by k-parameter as the header. During decompression, the decoder
decodes the data from the stream with k-parameter received as the header.
Encoding requires three functional blocks as given in figure 11:
Figure 11: Golomb-Rice Encoder functional blocks
The reference algorithm uses 3-bit header (k= 0, 1, ... , 7) to encode a subtile. Among those
headers, k = 7 is reserved for the special case when all error residuals in a subtile are zero. In this
case only header is stored; otherwise the header is followed by coded component-wise residuals.
The exhaustive search of the best k-parameter requires comparison of the lengths of output code
created by each possible k value (0, 1, … , 6) excluding the special case. The length of an output
code corresponding to a k-parameter can be expressed with the following formula:
442222
4321
k
eeeeL
kkkkk (4)
The lengths of each output code from this formula are given in table 1.
k-parameter Length of output code (Lk)
0 1e + 2e + 3e + 4e + 0 + 4
1 21e + 22e + 23e + 24e + 4 + 4
2 41e + 42e + 43e + 44e + 8 + 4
3 81e + 82e + 83e + 84e + 12 + 4
4 161e + 162e + 163e + 164e + 16 + 4
e1
e2
e3
e4
k (3-bit)
e1
e2
e3
e4
k-parameter
determination
Golomb-Rice
encoding with k
Data packer to
external memory
compressed
stream
10-bit error residuals of
subtile
encoded codewords of
each pixel
14
5 321e + 322e + 323e + 324e + 20 + 4
6 641e + 642e + 643e + 644e + 24 + 4
Table 1: Encoded output lengths for each k-parameter
In order to find the best k-parameter, four additions should be performed for each k to calculate
the length of its corresponding output length (three additions are needed for k=0). The fixed term
“4” is common to all the choices; therefore its addition is not needed for comparison. This
corresponds to 64 + 3 = 27 additions. To compare the lengths of seven values, six comparison
operations are needed. To summarize, the operations to find the best k-parameter with exhaustive
search can be expressed as follows:
[6 comp.(< ) , 27(+)] per subtilecomponent = [6 comp.(< ) , 27(+)] per pixel
The hardware diagram is given in figure 12.
1e 2e 3e 4e 21e 2
2e 23e 2
4e 4
1e 42e
43e 4
4e 8
1e 82e 8
3e 84e
161e 16
2e 163e 16
4e 32
1e 322e 32
3e 324e
641e 64
2e 64
3e 64
4e
Figure 12: Golomb-Rice parameter exhaustive search hardware
More specifically, the hardware cost is:
- Six 13-bit comparators
- Two 12-bit adders
+ +
+
+ +
+
+ +
+
+ + + + + + + +
+ + + +
+ + + + + +
L0 L1 L2 L3 L4 L5 L6
4 8 12 16 20 24
15
- Four 11-bit adders
- Four 10-bit adders
- Four 9-bit adders
- Four 8-bit adders
- Four 7-bit adders
- Three 6-bit adders
- Two 5-bit adders
This cost is per subtile-component which can be equivalently thought as per pixel cost. The
overall cost depends on the throughput requirement. This block has a logic depth of three 13-bit,
one 12-bit, one 11-bit and one 10-bit adder.
The second encoder block encodes the input residuals of a subtile with the calculated k-parameter.
The output of this block is four encoded words corresponding to each pixel of a subtile and their
corresponding lengths.
A very simple possible architecture for this block is given in [11]. Adjusting this architecture to
our case, the hardware for each pixel of the second block is given in figure 13.
Figure 13: A possible Golomb-Rice encoder hardware
The hardware cost per pixel-component of this block is:
- One 5-bit adder
- Two 10-bit shifters
- One 22-bit shifter
- 10 XOR gates
- 22 OR gates
>>
<< <<
XOR OR
code
residual
10
22’h0001
k
+
length
q
”1”
k
3
residual
16
The final block of the encoder is the data packer. This block receives the 3-bit header (k-
parameter) and code – length pairs of each pixel in a subtile. It combines the code words into a
fixed memory word size and sends as an output to external memory.
2.3.4 Golomb-Rice Decoder
The role of the decoder is to extract error residuals of a subtile by decoding the compressed data
using the header according to figure 9. Its functional blocks are similar to the encoder but since
header is provided by the incoming stream, k-parameter determination block is not needed. The
data un-packer provides the header and (q, r) pairs of each pixel of a subtile. The q data is
obtained with unary-to-binary conversion.
The next block combines binary (q, r) pairs with the header and reproduces error residual as the
output according to:
rqe k 2 (5)
A simple possible decoder hardware for each pixel-component of a subtile is given in figure 14.
Figure 14: A possible Golomb-Rice decoder hardware
The hardware cost per pixel-component of this block is:
- One 22-bit shifter
- 10 OR gates
<<
OR
residual
q
k
r
10
17
To summarize, table 2 gives the logic cost of functional blocks in both compressor and
decompressor (only adder cost is considered). Note that this calculation only includes the
datapath functional blocks shown in figure 3. This means the actual hardware is expected to
include other blocks for implementation of memory interfacing, memory addressing, pipelining,
control path etc. It is also important to note that the actual hardware size to a great extend
depends on design requirements, while table 2 shows generic per pixel cost of the algorithm.
Functional Blocks Compressor logic cost (adder cell)
per pixel
Decompressor logic cost (adder
cell) per pixel
Color transform 34 - Reverse color transform - 34 Prediction 232 - Construction - 232 GR Encoder – k determination 310 - GR Encoder – residual encoding 20 - GR Decoder – residual decoding - - Total 596 266
Table 2: Logic Cost of Functional Blocks
2.4 Golomb-Rice Encoding Optimization
Considering the result given in table 2 it is obvious that the most costly part of the design is the
hardware necessary to find the best k parameter for Golomb-Rice coding. Therefore, in order to
reduce the hardware cost, it is convenient to try reducing the cost of this circuitry.
Two approaches have been considered to reduce the complexity. First one is to use an improved
exhaustive search method which is presented in subsection 2.4.1. The second one is to use an
estimation formula given in [8] and is presented in subsection 2.4.2.
2.4.1 Proposed method for exhaustive search solution
Exhaustive search method to find k-parameter is straightforward to implement, but the
computational cost of this method is large and it increases linearly with the number of k values.
For all k values, the length of the encoded data should be calculated and the k, corresponding to
minimum length is chosen among them by comparison. For example, consider that we have a
block size of n, which indicates the number of inputs to be encoded together and the set k = {0, 1,
2,…, m-1}, where m is variable and depends on the application requirements. The best member of
the set should be selected as the Golomb-Rice parameter
The computational requirements for exhaustive search method can be significantly reduced with
our new solution, while still finding the Golomb-Rice (best k) parameter for a group of input
data. The approach proposed uses a combination of two different ideas.
18
The first idea, which will be referred as “overlap-limited search“, removes the need for
computation and comparison of all the length values for each possible k. It is mathematically
proven that for any given set of input samples {e1, e2, e3,…,en}, depending on their sum, there are
overlap regions only between a fixed limited number of length functions and that only those
length functions need to be computed and compared to get the best k. In other words, not all
possible k values but only a fixed, limited and consecutive subset of them can be candidates of
being the Golomb-Rice parameter of each block. This idea is not limited to hardware
implementations but reduces time-complexity of comparison in software implementations as
well.
The second idea, which will be referred as “remainder-based correction“, eliminates
computational redundancy of performing identical bit additions in calculation of code lengths (Lk)
corresponding to each k. We identify bit additions common to all Lk and save hardware by
performing those additions only once. With another point of view, instead of adding shifted
versions of input data (the quotients) for each k, we first add the inputs only once and then shift
the same sum for each k. This way of calculation however, ignores the effect of remainders on the
sum. To obtain the exact same result, after the addition, a correction is performed for each k by
using remainders of division. Since the correction hardware is much smaller than the adders used
for each k, a significant hardware saving is possible. This idea is only applicable to hardware
implementations of finding the Golomb-Rice parameter (best k-parameter).
To put the solution into perspective, plots in figures 15 and 16 show cost function of three
different implementation which are exhaustive search, the overlap-limited search method, and the
combined method (overlap-limited search and remainder-based correction) with respect to n
(number of input samples) and k (number of candidates for Golomb-Rice parameter) respectively.
In figure 15, the cost function is represented with respect to n (the number of input samples to be
encoded together). It is assumed that the set k = {0, 1, 2, 3, 4, 5, 6, 7} is fixed and the input data
word-length is equal to 8 bits. It can be observed from the plot that the slope of the cost function
of the combined method is ⅓ of the exhaustive search method.
19
Figure 15: HW-cost vs. number of input samples (n)
In figure 16, the cost is shown as a function of the number of members in set k. This plot shows a
very important feature of “overlap-limited search”. The number of comparisons to find the
Golomb-Rice parameter (best k) is fixed and independent of the number of k values to be
compared. Hence, for applications where dynamic range of input data is larger, a larger set of k
values should be used and “overlap-limited search” leads to even more significant reductions in
the complexity of number of comparisons. Audio applications using 16-bit input data is an
example of this case [12].
The result of both figures 15 and 16 shows that the combined solution is cheaper solution.
20
Figure 16: HW-cost vs. number of parameters (k)
Mathematical derivation and data analysis of this proposed method is given in Appendix A.
Our implementation combining both methods and the circuit diagram in figure 17, takes input
bits (A5-A0, B5-B0, C5-C0, D5-D0), eT, k, k+1, k+2 as inputs. eT is obtained by adding input values.
Then the region corresponding to eT is located to find the three k (k, k+1, k+2) values to compare.
The output of the circuit diagram is Lk, Lk+1, Lk+2. These three values are compared by using two
comparators as a final stage to find the best Golomb-Rice parameter.
21
Figure 17: HW implementation of the new combined method
6 5 4 3 2 1 0
2
C0
B0 A0
1 1
1 +
+
2
”00”
D0 1
2
C1
B1 A1
1 1
1 +
+
2 MSB
D1 1
2
C5
B5 A5
1 1
1 +
+
2
D5 1
3 3 3
24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 x
0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3
2 M
SB
2 M
SB
2 M
SB
+
>>
eT
k
Lk
6 5 4 3 2 1 0 6 5 4 3 2 1 0
+
>> k+1
Lk+1
+
>> k+2
Lk+2
eT eT
k+2 k+1 k
’0
’
22
This method is a general solution for implementations of Golomb-Rice encoders in all
applications with any set of Golomb-Rice parameters k and different block sizes n (subtile size
for our case). This is an exact method which replaces the exhaustive search method to find the
best k-parameter and leads to much lower computational requirements. The improvement in
hardware cost with our implementation explained is given in table 3.
Method Cost (full adders) Compression
Ratio (norm.)
Exhaustive search
(exact) 310
1
New combined
method (exact) 111
1
Table 3: HW cost comparison of exhaustive search and new combined method
The table shows that the new implementation method leads to a reduction of 65% hardware cost
over exhaustive search while still finding the best k-parameter for a block.
The comparison of the result is presented in figures 15 and 16, which shows the advantage of the
new method in reducing the hardware cost. For example in figure 17, considering the word-
length which is 32-bit, in order to achieve the minimum code-length k = {0, 1, 2,…, 31} should
be used. The hardware cost in this case reduces 83% by overlap-limited search and 89% by
combined method with exactly the same result.
2.4.2 Estimation method
In [8], an estimation formula based on the sum of all inputs is given where the k-parameter is
determined according to the range of the sum of input values. The estimation works based on
table 4, where sum is the summation of inputs to be encoded (in our case, four pixels in a subtile):
sum = e1 + e2 + e3 + e4 (6)
23
sum K
sum = 0 7
0 < sum <8 0
8 ≤ sum <16 1
16 ≤ sum <32 2
32 ≤ sum < 64 3
64 ≤ sum < 128 4
128 ≤ sum < 256 5
sum ≥ 256 6
Table 4: Estimation intervals according to sum of inputs
The advantage of estimation method over the exhaustive method in reducing the hardware cost is
given in table 5. The cost of estimation method is the cost of the hardware to calculate the sum in
(6). Therefore, two 10-bit and one 11-bit adder is required. Estimation method may rarely find
non-optimal k-parameter. However, empirical results with wide range of test images show that
the reduction in compression performance is insignificant as shown in table 5. In [8] it is also
mathematically proven that the effect of estimation on the compression performance is bounded.
Method Cost (full adders) Compression
Ratio (norm.)
Exhaustive search
(exact) 310
1
Estimation 31 0.998
Table 5: HW cost and compression ratio of estimation method
For applications where the exact exhaustive search is preferred, the method proposed in
subsection 2.4.1 can reduce the hardware cost significantly. However, in this thesis work the
estimation method has been chosen since it is cheaper and the resulting compression ratio is good
enough.
24
2.5 Improved Lossless Color Buffer Compression Algorithm
As it is mentioned in [1], our reference algorithm is influenced by LOCO-I algorithm. It can be
thought as a low-cost non-adaptive projection of LOCO-I. This has lead us to a deeper analysis of
the ideas behind LOCO-I and hence enabled us to improve the algorithm to get better
compression ratio especially for highly compressible images with negligible extra hardware cost.
The modifications on the reference algorithm are using estimation method (explained in
subsection 2.4.2), modular reduction, run-length mode and previous header flag which will be
explained in the following subsections.
2.5.1 Modular Reduction
The error residual at the output of predictor is one bit more than the data at predictor inputs. For
example in our case, the inputs x, x1, x2, x3 are all 9-bit data and the error residual is 10-bit. The
reason for this expansion is xx ˆ subtraction operation. However, actually since the
predicted value ( x̂ ) is known to both decoder and the encoder, the error residual ( ) can take on
values that can be represented by the number of bits same as input data size. However, since this
data in not centered around zero, a remapping of large prediction residuals is needed. This is
named as modular reduction [4]. Figure 18 illustrates the technique.
Figure 18: Illustration of Modular Reduction
Positive prediction Negative prediction
-256 0 255
-256 0 x^ 255
-256-x^ -256 0 255-x^ 255
x
e
-256 0 255
-256 x^ 0 255
-256 -256-x^ 0 255 255-x^
x
e
25
The effect of modular reduction is two-fold. Firstly, it leads to slightly more compression during
encoding stage, since the absolute value of error residual is smaller. Secondly, the compression
and decompression hardware blocks have smaller area due to smaller data size in the datapath.
2.5.2 Embedded Alphabet Extension (Run-length Mode)
In section 2.3.3, it is mentioned that header k = 7 is used for the cases where all four error
residuals of a subtile are zero during GR-Encoding process. In this case the whole subtile is
encoded with 3 bits only. This addresses the redundancy of sending extra terminating bits for
each error residual in a subtile. Although the redundancy is removed within a single subtile
boundary, a significant redundancy may still exist among adjacent subtiles. In a graphics
application this corresponds to the cases where a whole tile (88 blocks of pixels) is covered
with one/two triangles during rasterization. A typical example is the user menus in mobile
devices. A menu typically consists of large icons and several flat regions at the background.
In image compression applications, a quite similar problem exists for large smooth regions of a
still image. In [4] it is stated that in general, symbol-by-symbol (in our case Golomb-Rice)
encoding of error residuals in low entropy distributions (large flat regions) results in significant
redundancy. They address this problem through introducing “alphabet extension”. Specifically,
LOCO-I /JPEG-LS algorithm enters “run-length mode” when a flat region is encountered.
We used the same idea for more efficient encoding of low entropy regions. In order to do this, we
keep track of the headers used for each component of the previous subtile. Whenever all four
headers are 7 (kα = 7, kY = 7, kCg = 7, kCo = 7) the algorithm enters run-length mode. In this mode
we no longer put any bits into the output stream as long as the incoming error residuals to
Golomb-Rice encoder are zero. Instead we increase a 4-bit run-length counter by one for each
component. The run-length counter indicates the total number of zero error residuals so far.
Whenever a non-zero error residual is encountered, the run-length mode is broken. In this case
current value of the run-length counter is put into the output stream and the normal mode of
operation continues again.
During decoding, the decoder also keeps track of headers for previously decoded subtile. Hence it
also enters run-length mode at the same position during traversal. As soon as it enters the run-
length mode, it first reads 4-bit run-length counter value from the stream. Then, it gives output
error residual as zero for that many cycles and continues normal mode of operation.
The 4-bit run-length counter is fixed in range (0-15). This causes the problem of representing run
lengths longer than four subtiles (16 components). This problem is solved by introducing a run-
length flag. During encoding, when run-length counter becomes 15, a “1” bit is put into the
stream representing completion of one 16-component block. Correspondingly, when run-length is
broken a “0” bit is put into the stream just before the run-length counter value. For the decoder,
each “1” read from the stream means one 16-component block in run-length mode. Similarly a
following “0” bit designates that run length is broken.
26
The hardware cost of the run-length mode implementation is four 3-bit registers to store
component headers and a 4-bit run length counter. Its size relative to other functional blocks will
be given in section 3.5
2.5.3 Previous Header Flag
Once headers of previous subtile are stored in the encoder for run-length mode, a better
compression can be achieved by comparing the current header with the previous header. Due to
the existence of spatial correlation among adjacent subtiles, it is likely that these two headers
have the same value. Hence, instead of putting 3-bit header into the output stream for each subtile,
a “0” flag bit is put which means the current header is same as the previous header. Conversely,
when headers are different, a “1” bit is put before the actual header.
Now that all the modifications on the reference algorithm are introduced, the final algorithm to be
implemented is decided. The algorithm includes all the modifications explained in this section.
Moreover, the algorithm will be implemented not for tiled-traversal but scan-line traversal of the
input data. Therefore, both the reference algorithm and the modified algorithm are modeled for a
left-to-right scan-line data traversal. The results in table 6 are obtained from scan-line traversal of
images as well.
It is important to note that the maximum output size for a 32-bit input pixel is 64 bits. Therefore
theoretically it is possible to have a compressed size twice the original input size. However,
unless the input data is a completely noisy meaningless data, the output size always smaller than
input size. This is same for most other compression algorithms as well.
2.6 Compression Performances of Algorithms
In order to evaluate the compression performance gained, software models of both algorithms
have been prepared in MATLABTM environment. Three different groups of test data have been
used. The first group includes well-known standard photographic test images used for
benchmarking image compression algorithms and taken from [15]. The second group includes
several computer generated scenes. The first four of them in table 6 are used in [1] as well to
benchmark the reference algorithm. The third group includes several menu screen snapshots
typical to mobile devices. Finally, the compression of a completely black image is also evaluated
to observe the performance of algorithms on the extreme case. All test data used are 24-bit color
images in .PNG or .BMP format. The images evaluated are given in Appendix B.
It is important to note that the data used for evaluation are compressed screenshots. This means
that the result does not include the full, incremental rasterization process. An evaluation of the
improvement gained within a real or software-simulated rasterizer framework is definitely of
interest. Nevertheless, we anticipate that the results would be similar or even better during a
rasterization process since an unfinished scene is generally simpler and contains fewer details
27
than a complete scene. It is already mentioned that the improved algorithm works better on
simpler, compressible scenes. This is also verified in table 6 for group 3 data.
Another important point to mention is that the all input data are 24-bit RGB images. The
algorithms are modeled for 32-bit RGBA data format. For evaluation, the alpha channel of all the
image data was padded with eight “0” bits hence the evaluation is performed with 32-bit RGB0
data for all the input images. This is the reason of getting higher compression ratios than
expected for both algorithms. For example, the compression ratio for well-known Lena image is
found as 1.945 / 2.021 in both algorithms respectively. On the other hand, the JPEG-LS
compression ratio is reported as 1.773 [16]. Definitely, JPEG-LS is expected to compress better
than both algorithms within the same framework.
IMAGE REFERENCE
ALGORITHM
IMPROVED
ALGORITHM
Group1
(standard
photographic
test images)
(24-bit color)
Peppers (512 512) 2.812 3.016
Peppers2 (512 512) 1.769 1.828
Mandrill (512 512) 1.542 1.591
Lena (512 512) 1.945 2.021
House (256 256) 2.131 2.226
Sailboat (512 512) 1.690 1.744
Airplane (512 512) 2.289 2.404
Average 2.025 2.118
Group2
(computer
generated
test scenes)
(24-bit color)
Ducks (640 480) 2.785 2.991
Square (640 480) 2.937 3.155
Car (640 480) 3.609 4.059
Quake4 (640 480) 3.173 3.469
Bench_scr1 (640 360) 2.992 3.253
Bench_scr2 (640 360) 2.976 3.249
Bench_scr4 (640 360) 3.168 3.567
Average 3.091 3.392
Group3
(computer
generated
user menu
scenes)
(24-bit color)
Menu1 (240 320) 4.684 6.377
Menu2 (240 320) 2.776 3.056
Menu3 (240 320) 1.992 2.068
Menu4 (240 320) 2.700 2.941
Menu5 (240 320) 4.166 5.734
Menu6 (320 480) 3.416 3.803
Menu7 (320 480) 4.606 6.395
Average 3.477 4.340
Group 4 Black (1280 1024) 10.667 511.926
Table 6: Comparison of compression performances
28
2.7 Possible Future Algorithmic Improvements
In this thesis work several solutions have been examined to improve the compression
performance while still keeping the complexity and hardware cost reasonably low. However,
there are still several possibilities for algorithmic and architectural improvements. This chapter
describes some of those techniques proposed by several scientific papers which might be
applicable to image compression for mobile 3D graphics and to be considered as future works in
the area.
2.7.1 Pixel Reordering
This is one of the solutions that have been examined within our work. The objective of this
algorithm is to minimize the header overhead in the Golomb-Rice encoder. The idea is to group
the pixels/subtiles, inside a tile, based on their Golomb-Rice parameter (k value). This increases
compression ratio significantly since it helps to reduce the header overhead in the stream. As a
future work, it is interesting to do investigation on storage requirements necessary to keep track
of the original place of the pixels in order to reconstruct pixels in their original orders [13].
2.7.2 Spectral Predictor
As it is mentioned before, the main overhead which degrades the compression performance is in
storing the header in the encoded stream. Improving the predictor might not have a large
contribution into compression performance and this small improvement might not justify having
a more complex and costly predictor. However there is an opportunity to get rid of color
transform block, if we could efficiently take advantage of spectral correlation between the color
components R, G, and B. In order to do so, a spectral predictor is needed which can predict pixel
values of one color component, based on the predicted value of another component for the same
pixel. This method is described in [14] in detail. What is interesting for the area of mobile image
compression is to investigate the cost and complexity of this method, compare it with the total
cost of both color transform block and fixed MED predictor, and measure the compression
performance improvement that could be achieved by using spectral predictor.
2.7.3 CALIC Predictor
Context-Based Adaptive Lossless Image Compression (CALIC) was proposed by Menon and Wu
[5]. This algorithm is developed based on an adaptive predictor followed by a context-based
arithmetic coder. CALIC uses a gradient adjusted predictor (GAP), which is able to adapt itself
with respect to the intensity gradients of the surrounding and neighboring pixels near the pixel
under prediction. [14]
29
Figure 19: CALIC GAP prediction window
GAP calculates two Intensity gradient variables as follows:
dh = |Iw − Iww| + |In − Inw| + |In − Ine|,
dv = |Iw − Inw| + |In − Inn| + |Ine − Inne| (7)
It detects and classifies three different kinds of edges as “sharp”, “normal”, or “weak” and the
prediction value is corrected using this evaluation. The basic idea is if one gradient is small and
the other is large, the predictor estimates the current pixel value along the direction of the low
value gradient. Otherwise, if one gradient is larger than the other, but their difference is not small,
the prediction is corrected taking into account this difference. In [5] the pseudo-code of the
prediction algorithm explains how GAP works.
The appealing part of CALIC algorithm in this context is only the predictor. By replacing the
MED predictor with GAP, one can observe that how the compression performance improves and
whether the performance improvement justifies an extra cost introduced by this more complex
predictor.
2.7.4 Context Information
Context-based algorithms are so powerful in terms of giving high compression performance and
used by most of today’s famous lossless image compression algorithms such as LOCO-I / JPEG-
LS and CALIC. Storage elements are needed in order to store the context information of the
image under compression. In image compression algorithms where throughput or storage area is
not a bottleneck, it is advantageous to use context-based approaches. However, this may not be
the case in mobile applications and storing large amount of context information may not be
affordable. But, as a future work, it would be interesting to investigate how to modify context-
based algorithms and make them suitable for mobile image compression, for instance by using
limited number of contexts. Our modifications on the original algorithm (especially previous
header flag) is a first attempt to use information from previously traversed pixels and can be
considered as using a very simple form of context – the previous pixel.
30
Chapter 3
3 Color Buffer Compression/Decompression Hardware
In this chapter we explain the hardware of color buffer compression algorithm presented in
Chapter 2. The chapter starts with the description of design constraints and design environment.
Later, in a hierarchical way we describe each hardware block in the design. The description is
followed by the description of functional verification framework. At the end of the chapter we
present and discuss the synthesis results and give a survey of several lossless compression
hardware implementations.
3.1 Design Constraints
The main goal of the thesis was to investigate the hardware implementation properties of a
selected color buffer compression algorithm and to design corresponding synthesizable RTL level
compressor and decompressor block descriptions in VHDL. The hardware has been designed
considering the following constraints:
Pixels are represented with 32 bits (8-bit integer R, G, B and α channels) before compression.
RGB 8880 representation is assumed.
The blocks are interfaced with single port memories of 64-bit wordlength.
These two properties translate into the constraint of maximum 2 pixel read per clock cycle.
The target clock frequency of 208 MHz in 65nm technology. This constraint defines the
longest combinational path allowed in the design.
Target throughput of one uncompressed pixel per clock cycle i.e. the compressor should be
able to process one uncompressed pixel/clock and conversely the decompressor should
produce one uncompressed pixel/clock.
31
The compressor and decompressor blocks have been designed separately and both have two sets
of memory interfaces i.e. one set to source memory and one set to destination memory. If single
memory block is to be used as both the source and the destination, a separate memory interface
block is needed to coordinate the accesses to the memory unit. The same requirement is also valid
if both the decompression and compression blocks operate concurrently on the same memory
blocks.
3.2 Compressor Block
The hierarchical block diagram of the compressor is given in figure 20.
Figure 20: Compressor Block
The top level of the compression block has interfaces to the source and destination memories as
well as the block controlling the compressor. The compression starts with issuing the “start”
signal together with the “start_addr1” and “start_addr2” signals to point the source and
CT1
CT2
Py
Pα
PCo
PCg
Addr
Gen 1
Addr
Gen 2
Pred.
Reg.
File
Ctrl
Enc.
Reg.
File
Ctrl
Golomb
Rice
Enc.
Data
Packer
Compressor Ctrl
output
addr2
rd_req2
wr_req2
’0’
rdy2
rd_req1
wr_req1
rdy1
’0’
addr1
input
start_addr1 start_addr2
valid valid
enb
enb
enb enb enb
enb
load_enb data_rdy
start exec_finish
32
destination addresses. The completion of the operation is communicated through the
“exec_finish” signal from the compressor block.
The interface port description of the compressor is given in table 7.
Port name Width Direction Source / Dest Description
Clk 1 I Controller 208 MHz clock signal
Rst 1 I Controller Block reset signal
start 1 I Controller Compression start signal
exec_finish 1 O Controller Compression complete signal
start_addr1 24 I Controller Source memory start address
start_addr2 24 I Controller Destination memory start address
rd_req1 1 O Source mem. controller Read request from source memory
Wr_req1 1 O Source mem. controller Write request to source memory
Rdy1 1 I Source mem. controller Source mem. data available
addr1 24 O Source memory Source mem. address bus
input 64 I Source memory Source mem. data bus
rd_req2 1 O Dest. mem. controller Read request from dest. memory
Wr_req2 1 O Dest. mem. controller Write request to dest. memory
Rdy2 1 I Dest. mem. controller Destination mem. data available
addr2 24 O Destination memory Destination mem. address bus
output 64 O Destination memory Destination mem. data bus
Table 7: Compressor Block Interface Port Description
It should be noted that the compressor block reads data only from source memory and writes data
only into destination memory. Therefore, “wr_req1” and “rd_req2” are connected to logic ‘0’.
The sub-blocks inside the compressor block are discussed in the following subsections:
3.2.1 Addr_Gen1 (Source memory address generator)
This sub-block generates the read addresses for the source memory. Since the algorithm operates
on 22 pixel subtiles, the corresponding read addresses need to be generated. The data in the
source memory are assumed to be in left-to-right scan-line order. Considering a resolution of w l
pixels and two pixels per memory word assumption, the memory map and the corresponding
pixels of the image are shown in figure 21:
33
Figure 21: Memory mapping and corresponding pixels of the image
Since subtile is the unit of processing, the data should be read in subtile order from the source
memory. For example, in figure 21 the first subtile (top left corner) consists of pixels p(0), p(1),
p(w) and p(w+1). The memory map shows that these pixels are at addresses a and a + (w/2) in
the source memory. Hence, the corresponding addressing scheme should be as follows:
[a] → [a+(w/2)] → [a+1] → [a+(w/2 + 1)] → [a+2] → … → [a+(w/2 – 1)] → [a+(w – 1)]
[a+w] → [a+(3w/2)] → [a+(w + 1)] → [a+(3w/2 + 1)] → …
However this is not the whole story. As this will be explained in subsection 3.2.3., the prediction
window of a subtile is five neighboring pixels towards up and left. As an example, in order to
start_addr a p(0) p(1)
p(2) p(3)
p(4) p(5)
a+1
a+2
a+3
p(0) p(1)
p(w)
p(2) p(3)
p(w+1)
p(w-1)
p(2w)
p(w-2) p(w-1)
p(w-2)
w
l p(w+1) p(w)
a + (w/2 - 1)
a + (w/2)
p(3w)
p(2w+1)
p(w+2) p(w+3)
p(3w+1)
p(w+3) p(w+2)
p(2w+1) p(2w)
p(2w+3) p(2w+2)
p(3w+1) p(3w)
p(3w+3) p(3w+2)
a + (w)
a + (w+1)
a + (w/2 + 1)
a + (3w/2 + 1)
a + (3w/2)
p(2w+2)
p(3w+2)
p(2w+3)
p(3w+3)
34
encode the grey subtile in figure 21, p(w+1), p(2w+1), p(3w+1), p(w+2) and p(w+3) pixel
values should be available. Figure 22 further illustrates the change of prediction window from
one subtile to the next subtile.
Figure 22: Traversal in prediction window
In order to encode second subtile shown in the right side of the figure 22, the pixel data in all the
six addresses shown in the figure are needed. However, data in addresses “A”, “A + w/2” and “A
+ W” should have already been read in order to encode the first subtile shown in the left side of
the figure. The conclusion is that in order to encode the second subtile three read operations
should be performed to addresses “A+1”, “A + w/2 + 1” and “A + w + 1”. The arrows in the
figure indicate the basic addressing scheme followed by the address generator.
Encoding of one subtile takes four cycles, while three cycles are enough to read the required data
from memory. In the spare cycle the memory bus is released to allow better usage of memory
bandwidth. In this spare cycle “address_valid” signal is ‘0’.
All these considerations lead to the basic cyclic addressing scheme of:
cycle1 cycle2 cycle3 cycle4
Operation read - read read
Address +(w/2) keep +(w/2) -(w - 1)
Table 8: Source memory address generator addressing scheme
End of lines and the first line of the image need special treatment. First line of the image does not
have a pixel to the up hence a read is not needed. The prediction window change at the end of
line causes addressing scheme change at the end of lines and the scheme for the last subtile of a
line is [“+(w/2)”, keep, “+(w/2)”, “-(w/2-1)”].
The block interface and the hardware diagram of the source memory address generator are shown
in figures 23 and 24 respectively. The address register is a 24-bit register which is the assumed
address bus width corresponding to 128 MB RAM. The multiplexer select inputs are determined
with a simple state machine using the block inputs “enable”, “end_of_line”, “end_of_image”,
“first_line”. The synthesis results of the sub-block is given in section 3.5
A
A + W/2
A + W
A + 1
A + W/2 + 1
A + W + 1 A + W
A + W/2 + 1
A + W + 1
A + 1 A
A + W/2
35
Figure 23: Address Generator I interface
Figure 24: Address Generator I Hardware Diagram
+
Cin=”1”
Cin=”0”
w/2
w/2-1
w-1
start_addr
addr_out
address register
start_addr
addr_out enable
end_of_line
end_of_image
first_line
address_valid
Address
Generator
I
36
3.2.2 Color_T (Color Transformer)
Color transform block performs RGB → YCoCg conversion as explained in subsection 2.3.1. The
block interface is shown in figure 4 in subsection 2.3.1. The hardware diagram of the sub-block
is given in figure 25.
Figure 25: Color Transform Hardware Diagram
Co
9
8
8
8
[1,”t”]
9
Cin=”1”
MSB
[1,”B”] [0,”R”]
[0,”G”]
9
9 8
9
9
+
+
R B
+
+
G
>>1 1
>>1 1
MSB
Cin=”1”
Cin=”0”
Cin=”0”
Cg
8
Y
37
The synthesis result of the sub-block is given in section 3.5
3.2.3 Pred_RegFile_Ctrl (Prediction Register File Controller)
This sub-block is responsible for providing the predictor block with current pixel (x) as well as
three neighboring pixels in its prediction window (x1, x2, x3) as shown in figure 6. The block
interface is shown in figure 26.
Figure 26: Prediction Register File Controller interface
The operation is controlled through signals “enable”, “end_of_line” and “end_of_image” by
compression control block. This sub-block in each cycle receives two pixels (p1, p2)
corresponding to one memory word that are read from memory and outputs (x, x1, x2, x3) pixels to
the combinational predictor block. This is performed for each pixel of one 22 subtile before
passing to the next subtile. Figure 27 shows the pixels for the prediction operation of one subtile.
Figure 27: Change of prediction window for pixels of one subtile
x x3
x1 x2
x x3
x1 x2
x x3
x1 x2
x x3
x1 x2
p1
x
enable
end_of_line
end_of_image
Pred_RegFile
Ctrl
(one component)
p2
x1
x2
x3
9
9
9
9
9
9
38
The figure clearly shows that 9 pixels are involved in the prediction of one subtile. However at
any time instant at most 7 pixel values need to be stored. (In figure 27, before step 2, the lower
right two pixels of the subtile are not read in yet. After step 2, the upper left two pixels are not
used anymore so can be overwritten by the incoming pixels.) Therefore, this block includes
seven 9-bit registers and a state machine controlling data transfer among them as well as input
and output.
The basic data transfer scheme is shown in figure 28. The block outputs are directly from
registers X, X1, X2, X3 while registers A, B, C are used for temporary storage of data. As an
example the figure shows input connectivity of register X3 i.e. register X3 receives data only from
register X1 and register A in different states. Different data transfer schemes resulting in the same
functionality are possible, however connectivity affects MUX sizes at the inputs of registers and
hence the hardware cost. In this scheme, it is given importance to use 41 MUXes or smaller.
Figure 28: States and register input connectivity in Prediction Register File Controller
It is also important to note that state S4 of this scheme changes at the end of lines due to the
change in the prediction window.
This sub-block is instantiated 4 times in the design corresponding to four components namely Y,
Cg, Co and α. The synthesis results are given in section 3.5
Registers States
S1
S2
S3
S4
X X1 X2 X3 A B C
P2 P2 C X X1 X2 B
P1 P2 X2 B A X1 X
C P2 B X X1 A P1
P1 C B X1 A X3 X
39
3.2.4 Predictor
The MED predictor explained in subsection 2.3.2 has been realized with the hardware shown in
figures 29 and 30.
Figure 29 shows the prediction hardware common for both the predictor and constructor. This
hardware block generates the predicted value ( x̂ ) from the neighboring pixels x1, x2, x3.
Figure 29: MED Prediction Hardware for both predictor and constructor
The predictor block is given in figure 30. The block performs xx ˆ subtraction, modular
reduction and signed to unsigned conversion as explained in subsection 2.5.1.
9
9
x̂
MSB
10
9
000 011 else
111 100
9
x2
9
x1
MSB
sign
extend
Cin=”1”
+
x3 x1
9 9
sign
extend
Cin=”1”
+
x3 x2
9 9
sign
extend
+
x2 x1
9 9
Cin=”1”
x1
+ Cin=”0”
9
10
MSB
s2
s1 s0
40
Figure 30: Predictor block Hardware diagram
This sub-block is instantiated 4 times in the design corresponding to four components namely Y,
Cg, Co and α. The synthesis result of the sub-block is given in section 3.5
3.2.5 Enc_RegFile_Ctrl (Golomb-Rice Encoder Register File Controller)
This sub-block is responsible for preparing the data for the GR Encoder. More precisely, it
performs pixel to subtile conversion i.e. at each clock it receives 4 components of error residuals
+ Cin=”1”
x2 x3 x1 x
9 9 9 9
9
Prediction
Hardware
9 x̂
<<1 ”0”
9
MSB
0 1
41
corresponding to one pixel from predictors and it outputs one component of 4 pixels (subtile) to
be encoded together in GR Encoder.
The block interface is shown in Figure 31.
Figure 31: Encoder Register File Controller block interface
As the figure may suggest, this sub-block consists of 16 9-bit registers organized as a small 44
transpose memory. This means, in an alternating fashion, the registers are filled (and read out at
the same time) column-wise first and then read out (and also filled at the same time) row-wise in
a FIFO manner. The conversion from 4 components of a pixel to one component of four pixels is
performed this way. The alternation is realized with a simple 2-state state machine. The state
machine is started and stopped by signals “enable and “end_of_image” coming from compression
control block.
Since all registers can be loaded both column-wise and row-wise, there are 21 MUXes at their
inputs (except topmost left, which has only one input). Also since the block outputs can be given
out from two registers (except lowermost right), there are three 21 MUXes at outputs.
The synthesis result of this sub-block is given in section 3.5
ey_in
e4_out
enable
end_of_image
Enc_RegFile
Ctrl
eCo_in
e3_out
e2_out
e1_out
9
9
9
eCg_in
eα_in
9
9
9 9 9
y4 y3 y2 y1
α4 α 3 α 2 α 1
Cg4 Cg3 Cg2 Cg1
Co4 Co3 Co2 Co1
42
3.2.6 GR_Encoder (Golomb-Rice Encoder)
Golomb-Rice Encoder main task – as described before – is to generate code and length values. At
each clock cycle, it receives the residual values of four pixels of one specific subtile (e1, e2, e3,
and e4) of one component, and generates corresponding “code”, “length” and “header” values
for each pixel.
The GR_Encoder block consists of three sub-blocks which are GR_k , Enc, and GR_ctrl as
shown in figure 32..
Figure 32: Golomb-Rice Encoder block diagram
The synthesis result of GR_Encoder is given in section 3.5.
GR_k Enc1
Enc2
Enc3
Enc4 Code4
Leng4
Code3
Leng3
Code2
Leng2
Code1
Leng1
GR_Encoder
e1
e4
e3
e2
rst
enb
e_o_img
GR ctrl
header
length k
43
3.2.6.1 GR_k Block (Golomb-Rice Parameter Estimation)
This block is responsible for Golomb-Rice parameter – best k - determination. It uses an
estimation formula described in section 2.4.2. The hardware to determine the k parameter is given
in figure 33.
x7 = s[10] OR s[9] OR s[8] OR s[7] OR s[6] OR s[5] OR s[4] OR s[3] OR s[2] OR s[1] OR s[0]
x6 = s[10] OR s[9] OR s[8] OR s[7] OR s[6] OR s[5] OR s[4] OR s[3]
x5 = s[10] OR s[9] OR s[8] OR s[7] OR s[6] OR s[5] OR s[4]
x4 = s[10] OR s[9] OR s[8] OR s[7] OR s[6] OR s[5]
x3 = s[10] OR s[9] OR s[8] OR s[7] OR s[6] (8)
x2 = s[10] OR s[9] OR s[8] OR s[7]
x1 = s[10] OR s[9] OR s[8]
Equation (8) shows the cases corresponding to table 9.
44
Figure 33: K- Parameter Estimation Hardware
e1 e2 e3 e4
S[10:0]
9 9 9 9
10 10
11
...
9 bits 9 bits
10 bits
s[10] s[8]
x1
... s[10] s[7]
x2
... s[10] s[6]
x3
... s[10] s[5]
x4
... s[10] s[4]
x5
... s[10] s[3]
x6
... s [10] s[0]
x7
0
0
0
0
0
0
0
1
1
1
1
1
1
1
k
111
110
101
100
011
010
001
000
3
45
The estimation works based on table 4 using equation (6).
sum sum[10 : 0] K
sum = 0 00000000000 7
0 < sum <8 00000000XXX 0
8 ≤ sum <16 00000001XXX 1
16 ≤ sum <32 0000001XXXX 2
32 ≤ sum < 64 000001XXXXX 3
64 ≤ sum < 128 00001XXXXXX 4
128 ≤ sum < 256 0001XXXXXXX 5
sum ≥ 256 XX1XXXXXXXX 6
Table 9: Estimation Function
The synthesis result of GR_k is given in section 3.5.
3.2.6.2 Enc Block (Encoding Block)
The other sub-block, Enc, is responsible to perform encoding according to its inputs, k and e.
Outputs are length and code. Considering the throughput constraint – one pixel / cycle – four
instances of encoder are necessary. The Encoder hardware is represented in figure 34. The first
two multiplexers on the left determine the quotient and the remainder of the divisionk
e
2,
respectively. One addition is performed in order to calculate the length value, k+q+1 and the last
multiplexer and the OR function, generate the output code by enclosing the unary q to the binary
r.
The synthesis result of Enc is given in section 3.5.
46
Figure 34: Golomb-Rice Encoder Realization
0
1
2
3
4
5
6
7
k
q
e[2:0]
e[3:1]
e[4:2]
e[5:3]
e[6:4]
e[7:5]
e[8:6]
”000” 3 bits
3
3
0
1
2
3
4
5
6
7
”000000”
”00000” & e[0]
”0000” & e[1:0]
”000” & e[2:0]
”00” & e[3:0]
”0” & e[4:0]
e[5:0]
”000000”
’1’
”0000” 7
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
”00000000000000”
”00000000000001”
”00000000000010”
”00000000000100”
”00000000001000”
”00000000010000”
”00000000100000”
”00000001000000”
”00000010000000”
”00000100000000”
”00001000000000”
”00010000000000”
”00100000000000”
”01000000000000”
”10000000000000”
length
code
r
”00
00
00
00
0” &
r
4
6
4
14 14
47
3.2.6.3 GR_ctrl (Golomb-Rice Control Block)
This sub-block is a FSM is added to the design in order to handle the new compression algorithm
which improves the compression performance. By just adding this block, it is possible to make
use of the new algorithm without changing two previous blocks. The encoded header format is
not always the same and it changes according to the state of operation. The state encoding is
defined by Golomb-Rice control, based on the previous k-parameters. The possible encoded
formats are given in table 10.
Mode Header Format Header
Length
Condition
start {header} 3 Only first subtile of image
normal {flag = ‘0’} 1 Current_k = Prev_k
{flag = ‘1’, header} 4 Current_k ≠ Prev_k
run-
length
{-} 0 run_length_counter < 15
{run_length_flag = ‘1’} 1 End of image
{run_length_flag = ‘1’} 1 run_length_counter = 15
{run_length_flag = ‘0’, run_length_counter, header} 8 Run-length mode broken
{run_length_flag = ‘1’,run_length_flag = ‘1’} 2 run_length_counter = 15
and end of image
Table 10: Header format generated by GR_ctrl block
The synthesis result of GR_ctrl is given in section 3.5.
3.2.7 Data_Packer (Variable Bit Length Packer to Memory Word)
The output of the GR_Encoder block is five pairs of code and length registers. In each pair, the
length register determines the number of valid data -in bits- in the code register. According to the
throughput requirements, at each clock cycle, the values of these four registers must be stored in
the memory of 64-bit word-length. We need some piece of hardware called data packer, in order
to combine these variable length codes and store them in the memory at each cycle, while
keeping track of the next empty buffer cell, for the next cycle. The data packer hardware consists
of four different stages as shown in figure 37. At each stage a certain block of hardware is used.
In the first stage, there are two instantiation of a block called P1. This block takes two variable-
length codes as well as information about their lengths, combines them by doing the
concatenation operation (Shift & OR), and gives out this result as well as a new length value
which is the summation of original ones. The outputs of two P1 blocks are the input to the P2
block. This block and also the next stage block, P4, exactly does what P1 does but with different
word-length of the registers. In the last stage there is one P3 block which is the only sequential
block in the data packer, and does the final combination as well as keeping track of the next
empty buffer cell. Whenever one 64-bit of packed data is ready, P3 gives it out and issues a ready
48
<<
OR
code length
64 7
signal which is used to generate a write request to the memory in the control block. The hardware
realization of P3 block of the data packer is given is figure 35.
Figure 35: P3 block, basic hardware realization
Output of the data packer would be stored in the memory and it looks like as shown in figure 36.
Figure 36: Packed data order format in the memory
y component
of first subtile
Co component
of first subtile
Cg component
of first subtile component
of first subtile
0 63
....
. . . . . . . . .
component
of second subtile
prev_length
64-bit buffer
49
Figure 37: Data Packer
As it is shown in figure 40, data packer is not considered as a part of datapath because the
hardware overhead of this block is independent of how the datapath is designed. It is not the
optimized design of data packer since it was not the focus of this work. According to the design
constraints, data packer must be capable of packing four codes per clock cycle and give the result
out whenever 64-bit packed data is ready.
3.2.8 Addr_Gen2 (Destination memory address generator)
This sub-block is responsible for generating destination memory addresses for the compressed
data. The compressed data –when packed into a 64-bit memory word- are written in consecutive
locations in the memory. Therefore this sub-block is simply a 24-bit counter with parallel-load in
order to load the destination memory start address at the beginning of the compression operation.
The block interface is shown in figure 38.
Packed
64
Packed
Ready
P1
_
A code2 14
length1 4
length2 4
length12 5
4
code1 14
code12 28
4
P1
code4 14
length3 4
length4 4
length34 5
4
code3 14
code34 28
P2
P4
length1234 6
length12345
4
code1234 56
code12345 length5 4
code5 8
P3
enb
rst
7
64
50
Figure 38: Destination memory address generator block interface
The synthesis results of this sub-block is given in section 3.5
3.2.9 Compressor_Ctrl (Control Path)
This sub-block is responsible for controlling all other blocks in the compression hardware. More
specifically, it is responsible for starting the operation, stalling the datapath when memory is not
available and provide other sub-blocks with image traversal information such as end of lines, first
line etc. The block interface is shown in figure 39
Figure 39: Control path block interface
count_en
load_en
start_addr
addr_out
24 24
Addr_Gen2
rd_req1
rd_req2
wr_req1
wr_req2
rdy2
rdy1
start
finished
addr1_enb
first_line_addr1
end_of_line_addr1
end_of_img_addr1
addr1_valid
pred_enb
end_of_line_pred
end_of_img_pred
enc_enb
end_of_img_enc
GR_enb
packer_enb
end_of_img_packer
data_packed
addr2_enb
controller
Dest.
Memory
Source
Memory
Addr
Gen1
Addr
Gen2
Pred.
RegFile
Ctrl
Enc.
RegFile
Ctrl
Data
Packer
GR
Decoder
CONTROL
PATH
51
The control path basically includes a 25-bit pixel counter to keep track of where in the image we
are. Mainly comparators are used comparing pixel counter to some specific values in order to
compute the location in the image such as first line, end of line, end of image. The enable signals
of datapath and address generator sub-blocks are used to stall the pipeline whenever memories
are not available (Pipeline is stalled according to “rdy” input signals coming from memories or
memory controllers).
3.2.10 Overall Compressor Datapath and Address Generation
Figure 40: Overall Compressor
CT1
P_reg Y
P_reg Cg
P_reg a
P_reg Co
Pred 1
Pred 2
Pred 3
Pred 4
GR_k
GR Enc1
GR Enc2
GR Enc3
GR Enc4
header
Code4
Leng4
Code3
Leng3
Code2
Leng2
Code1
Leng1
GR_Encoder
Enc. Reg_ File
Control Path
Addr_Gen2 Addr_Gen1
Datapath
Data
Pack
er
Gnd
Gnd
RGB1
CT2 RGB2
Y1
Y2
Cg1
Cg2
Co2
Co1
α 2
α1 e α
eY
eCg
eCo
e1
e4
e3
e2
Pixel 1
Pixel 2
52
Figure 40 shows the complete hardware for the compressor datapath and address generator blocks.
The design is divided into three main blocks. Our effort has been mainly on the datapath and
address generator blocks. Synthesizable control path and data packer blocks are designed so as to
ensure correct overall functionality based on constraints. However these blocks are not optimized
and need more detailed design considerations. The synthesis result of the datapath is given in
section 3.5.
3.3 Decompressor Block
The block diagram of the decompressor is given in figure 41.
Figure 41: Decompressor Block
The interface port description of the decompressor is given in table 11.
Port name Width Direction Source / Dest Description
clk 1 I Controller 208 MHz clock signal
rst 1 I Controller Block reset signal
start 1 I Controller Compression start signal
finish 1 O Controller Compression complete signal
Decompressor Ctrl rd_req1
wr_req1 wr_req2
rdy2
start_addr2 start_addr1
start finish
Addr
Gen 2
Addr
Gen 1
addr1
rdy1
addr2
rd_req2
Data
Unpacker
Golomb
Rice
Dec.
Dec.
Reg.
File
Ctrl
input2 Cy
Cα
CCo
CCg
RCT
CT
output1
input1
Const.
Reg.
File
Ctrl
’0’
valid
enb enb
enb enb enb data_req
53
start_addr1 24 I Controller Destination memory start address
start_addr2 24 I Controller Source memory start address
rd_req1 1 O Dest. mem. controller Read request from dest. memory
wr_req1 1 O Dest. mem. controller Write request to dest memory
rdy1 1 I Dest. mem. controller Destination mem. data available
addr1 24 O Dest. memory Destination mem. address bus
Input1 64 I Dest. memory Destination mem. data bus
Output1 64 O Dest. memory Destination mem. data bus
rd_req2 1 O Source mem. controller Read request from source memory
wr_req2 1 O Source mem. controller Write request to source memory
rdy2 1 I Source mem. controller Source mem. data available
addr2 24 O Source memory Source mem. address bus
Input2 64 I Source memory Source mem. data bus
Output2 64 O Source memory Source mem. data bus
Table 11: Decompressor Block Interface Port Description
It should be noted that similar to compressor, the decompressor block writes data only into
destination memory . However, - different from compressor - it reads data both from source
memory and destination memory. Therefore, only “wr_req2” is connected to logic ‘0’.
The sub-blocks inside the decompressor block are discussed in the following subsections:
3.3.1 Addr_Gen2 (Source memory address generator)
This sub-block is responsible for generating source memory addresses for reading in compressed
data. Since compressed data – which is packed into a 64-bit memory word- are located in
consecutive locations in the memory, this sub-block is simply a 24-bit counter with parallel-load
in order to load the source memory start address at the beginning of the decompression operation.
The block interface is shown in figure 42.
Figure 42: Source memory address generator block interface
The synthesis results of this sub-block is given in section 3.5
count_en
load_en
start_addr
addr_out
24 24
Addr_Gen2
54
3.3.2 Rev_Color_T (Reverse Color Transformer)
Reverse color transform sub-block performs YCoCg → RGB conversion as explained in
subsection 2.3.1. The block interface is shown in figure 4 in subsection 2.3.1. The hardware
diagram of the sub-block is given in figure 43.
Figure 43: Reverse Color Transform hardware diagram
G
8
8
8 9
[0,”t”]
Cin=”1”
MSB
[0,”B”]
9
9
+
+
Y Co
+
+
Cg
>>1 1
MSB
Cin=”1”
Cin=”0”
B
8
R
MSB
8
Cin=”0”
t
8
9
>>1 1
MSB
8
55
The synthesis result of this sub-block is given in section 3.5
3.3.3 Const_RegFile_Ctrl (Construction Register File Controller)
The functionality of this sub-block is similar to Pred_RegFile_Ctrl block in compressor datapath.
It is responsible for providing the constructor block with three neighboring pixels in its prediction
window (x1, x2, x3) of the current pixel to be constructed as shown in figure 6. The block interface
is shown in figure 44.
Figure 44: Construction Register File Controller interface
The operation is controlled by decompression control block through signals “enable”,
“end_of_line” and “end_of_image”. This sub-block in each cycle receives one pixel (p) that is
read from memory as well as currently constructed pixel (x) to be used for subsequent predictions.
The sub-block outputs (x1, x2, x3) pixels to the combinational constructor block. This is performed
for each pixel of one 22 subtile before passing to the next subtile.
The functionality is slightly different from Pred_RegFile_Ctrl sub-block in compressor datapath
in the sense that current pixel (x) is not input to the constructor, but it is output. Hence it is not
provided by Const_RegFile_Ctrl sub-block. This leads to different storage requirements in the
block. At any time instant at most 5 pixel values need to be stored. This block includes five 9-bit
registers and a state machine controlling data transfer among them as well as input and output.
The basic data transfer scheme is shown in figure 45. The block outputs are directly from
registers X1, X2, X3 while registers A, B are used for temporary storage of data. As an example the
figure shows input connectivity of register X3 i.e. register X3 receives data only from register X1
and register A in different states. Different data transfer schemes resulting in the same
x
enable
end_of_line
end_of_image
Const_RegFile
Ctrl
(one component)
p x1
x2
x3
9
9
9
9
9
56
functionality are possible, however connectivity affects MUX sizes at the inputs of registers and
hence the hardware cost. Again, in the scheme, it is given importance to use 41 MUXes or
smaller and state S4 of this scheme changes at the end of lines due to the change in the prediction
window.
Figure 45: States and register input connectivity in Construction Register File Controller
This sub-block is instantiated 4 times in the design corresponding to four components namely Y,
Cg, Co and α. The synthesis results are given in section 3.5
3.3.4 Constructor
The constructor block uses the same prediction hardware (figure 29) as the predictor. The sub-
block performs unsigned to signed conversion, modular correction and xx ˆ addition as
explained in subsection 2.3.2. The constructor hardware is given in figure 46.
This sub-block is instantiated 4 times in the design corresponding to four components namely Y,
Cg, Co and α. The synthesis results are given in section 3.5
Registers States
S1
S2
S3
S4
X1 X2 X3 A B
P X A X2 B
X2 B A X B
A X X1 P B
P X1 A P X
57
Figure 46: Constructor block Hardware diagram
3.3.5 Dec_RegFile_Ctrl (Golomb-Rice Decoder Register File Controller)
This sub-block is performs the inverse operation of what Enc_RegFile_Ctrl sub-block does in
compression datapath. More precisely, it performs subtile to pixel conversion this time. At each
clock, it receives error residuals of one component of 4 pixels of a subtile that have been decoded
by GR_Decoder and outputs 4 components of error residuals corresponding to one pixel into
corresponding predictors.
The block interface is shown in Figure 47.
+ Cin=”0”
x2 x3 x1
9 9 9
9
Prediction
Hardware
9
x̂
x
9
LSB
>>1 ”0”
9
0 1
58
Figure 47: Decoder Register File Controller block interface
The design of this sub-block is identical to Enc_RegFile_Ctrl sub-block of compression datapath.
Refer to sub-section 3.2.5 for details. The synthesis result of this sub-block is given in section 3.5
3.3.6 GR_Decoder (Golomb-Rice Decoder)
A very simple circuit is used as GR_Decoder. In order to fulfill the throughput requirements, four
instances of this block are needed in the design, since we need all four pixel errors to be ready at
the same time. The inputs to this block are quotient, q, residual, r, and the Golomb-Rice
parameter, k, and the output is the pixel error, e, using (5) which can be realized as figure 48.
Figure 48: Golomb-Rice Decoder hardware
0
1
2
3
4
5
6
7
”000000” & q
”00000” & q & r[0]
”0000” & q &r[1:0]
”000” & q &r[2:0]
”00” & q & r[3:0]
”0” & q & r[4:0]
q & r[5:0]
”000000”
k
3
e
9
e2_in
eCo _out
enable
end_of_image
Dec_RegFile
Ctrl
e4_in
eCg _out
ey _out
eα _out
9
9
9
e3_in
e1 _in
9
9
9 9 9
Co2
Cg2
y2 α 2
Co1
Cg1 y1 α 1
Co3 Cg3 y3 α 3
Co4 Cg4 y4 α4
59
The synthesis result of GR_Decoder is given in section 3.5.
3.3.7 Data_Unpacker (Variable Bit Length Unpacker from Memory Word)
This sub-block does the reverse task of data packer block in the compression path. It consists of
two sub-blocks, unpacker and GR_ctrl. Unpacker block receives data stream of 64-bits from the
memory and extracts four codes as well as Golomb-Rice parameter which has been used for data
encoding. Codes are given out in terms of quotients, q, and residuals, r. GR_ctrl block is
designed based on the GR_ctrl in the compressor block. As a control block, it provides
information about the state and k parameter in the previous cycle. GR_ctrl is necessary in order to
be able to use our new algorithm. It supplies unpack with information about the previous k-
parameter as well as the current state of operation. This information is necessary for decoding
procedure since as it is mentioned in subsection , there are several data format in the stream. In
order to fulfill throughput requirements, data unpacker must be capable of producing four output
codes per cycle.
Figure 49: Data Unpacker Interface and block diagram
q4
r4
q2
r2
q3
r3
q1
r1
state_in
prev_k
k
Unpack
GR_ctrl
RL-counter
data_stream
60
As it is shown in figure 53, data unpacker is not considered as a part of datapath because the
hardware overhead of this block is independent of how the datapath is designed. It is not the
optimized design of data unpacker since it was not the focus of this thesis work.
3.3.8 Addr_Gen1 (Destination memory address generator)
This sub-block is responsible for generating addresses for destination memory. The generated
addresses can be both read addresses and write addresses. The write operation writes
uncompressed output data to the destination memory. The read operation is required for
prediction operation inside the constructor block.
Figure 50: Read / Write Adresses from/to destination memory to construct one subtile
Figure 50 shows that, blue subtile needs two write operations into addresses “A” and “A + w/2”.
However, a read into address “A - w/2” (shown with a circle) should be performed beforehand in
order to construct this subtile. The conclusion is that, for each subtile one read and two write
operations should be performed and corresponding addresses need to be generated.
During implementation, due to the pipeline latency from destination memory through color
transform to the constructor, the read data should be requested two cycles before it will be used.
Hence, the actual addressing scheme is shown in figure 51 and given in table 12
Figure 51: Actual addressing scheme for destination memory addresses
A + W/2 + 1
A – W/2 A – W/2 - 1 A – W/2 - 2
A - 2
A – W/2 + 1
A + 1
A – W/2 + 2
A + 2
A + W/2 + 2
A - 1
A + W/2 - 2
A
A + W/2 - 1 A + W/2
A + W/2 + 1
A – W/2 A – W/2 - 1
A - 1
A – W/2 + 2
A + 2
A + W/2 + 2
A
A + W/2 - 1
A – W/2 + 1
A + 1
A + W/2
61
Decoding of one subtile takes four cycles, while three cycles are enough to write/read required
data to/from memory. In the spare cycle the memory bus is released to allow better usage of
memory bandwidth. In this spare cycle, both “rd_valid” and “wr_valid” signals are ‘0’.
All these considerations lead to the basic cyclic addressing scheme of:
cycle1 cycle2 cycle3 cycle4
Operation write read write -
Address -(w/2 - 1) -(w/2 - 2) +(w – 2) keep
Table 12: Destination memory address generator addressing scheme
End of lines and the first line of the image need special treatment again. First line of the image
does not have a pixel to the up hence a read is not needed. The prediction window change at the
end of line causes addressing scheme change for the last two subtiles of lines.
The block diagram is given in figure 52 and the synthesis results are in section 3.5.
Figure 52: Destination memory address generator block interface
3.3.9 Decompressor_Ctrl (Control Path)
This sub-block is responsible for controlling all other sub-blocks in the decompression hardware.
More specifically, it is responsible for starting the operation, stalling the datapath when memory
is not available and provide other sub-blocks with image traversal information such as end of
lines, first line etc.
The control path basically includes a 25-bit pixel counter to keep track of where in the image we
are. Mainly comparators are used comparing pixel counter to some specific values in order to
compute the location in the image such as first line, end of line, end of image. The enable signals
of datapath and address generator sub-blocks are used to stall the pipeline whenever memories
are not available (Pipeline is stalled according to “rdy” input signals coming from memories or
memory controllers).
start_addr
addr_out enable
last_two_of_line
end_of_image
first_line
rd_valid Address
Generator
I wr_valid
62
3.3.10 Overall Decompressor Datapath and Address Generation
Figure 53 shows the complete hardware for the decompressor datapath and address generator
blocks. The design is divided into three main blocks. Our effort has been mainly on the datapath
and address generator blocks. Synthesizable control path and data unpacker blocks are designed
so as to ensure correct overall functionality based on constraints. However these blocks are not
optimized and need more detailed design considerations. The synthesis result of the datapath is
given in section 3.5.
Figure 53: Overall Decompressor
Golomb-Rice Decoder
Co- constructor
Reverse Color
transform
α- constructor
y- constructor
Cg- constructor
Constructor register file
control (20 registers)
DECOMPRESSOR CONTROL PATH
DA
TA
UN
-PA
CK
ER
pixel
pixel
pixel
pixel
q1
Decoder register file
control (16 registers)
r1
q2
r2
q3
r3
q4
r4
header
e1
e2
e3
e4
eCo
eCg
ey
eα
α- reg. file
y- reg. file
Cg- reg. file
Co- reg. file
Color transform
x1
x2
x3
x
Address Generator 2
Address Generator 1
Encoded s
tream
R,G,B,α
pCo
DECOMPRESSOR DATAPATH
AND
ADDRESS GENERATION
63
3.4 Functional Verification Framework
In order to verify the functionality of the design, a verification framework has been designed.
This framework consists of two RAM blocks, a counter to generate the address for the memory,
and a control block which is a finite state machine. There are three modes of operation during the
functional verification. The input image to be compressed has already been converted to binary
representation in MATLABTM and stored in a text file as input file to the framework. This file has
64 columns which is equal to the 64-bit memory word-length. The number of lines in the file
depends on the image size and therefore the memory size has to be adjusted accordingly.
In the first mode, content of the input file is stored in the source memory, RAM_1. It takes x
clock cycle to perform this data transfer where x is the number of lines in the input file. That is
when the FSM issues “exec_start” signal and the operation enters the next mode where the
compressor/decompressor starts performing its task and sending the result to the destination
memory, RAM_2. When it issues “exec_finish” signal, FSM changes to the final mode where the
content of RAM_2 is written to the output text file. The verification FSM is shown in figure 54.
The functional verification blocks are shown in figure 55.
For each test image, two output text files are generated. One file in generated by this RTL
verification framework of the VHDL code and the other one is generated by the equivalent
algorithm in MATLABTM
. The correct functionality of the design is verified for several test
images through the comparison of these two text file results to see that they are identical.
Figure 54: Verification Framework FSM
Read from
Input file
Compression /
Decompression
Execution
Write to
Output file
64
Figure 55: Functional Verification Framework
3.5 Synthesis Results
The compressor and decompressor datapath blocks are synthesized with the target clock
frequency of 208 MHz.
The overall block size of the compressor is 10.55 kgates. This size includes datapath blocks,
input-output data registers and address generators. 61.3 % of this block is combinational, and the
rest is sequential logic size. The overall block includes 618 registers.
The sub-block sizes inside the datapath are also of interest. Table 13 shows the hierarchical area
distribution of the whole block.
COMP/
DECOMP
FSM
Address
Generator
RAM_1 RAM_2 wr_req
1
wr_req
2
addr2 addr1
wr_data
1
rd_data2
rd_data1
wr_data
2
rd_req
2
Output
_File
rd_req
1
Input_
File
exec_finish
exec_start
65
Block Name Area
(Kgates)
Compressor 10.55
- Color Transform 1 0.23
- Color Transform 2 0.21
- Golomb-Rice Encoder 2.34
Encoder 1 0.37
Encoder 2 0.36
Encoder 3 0.37
Encoder 4 0.37
GR control 0.48
GR parameter estimation 0.36
- Input Preparation 0.67
- Address generator 1 0.59
- Address generator 2 0.45
- Encoder register file control 1.87
- Prediction register file control 1 0.65
- Prediction register file control 2 0.66
- Prediction register file control 3 0.72
- Prediction register file control 4 0.73
- Predictor 1 0.33
- Predictor 2 0.38
- Predictor 3 0.39
- Predictor 4 0.34
Table 13: Compressor Synthesis Result
The overall block size of the decompressor is 9.23 kgates. This size includes datapath blocks,
input-output data registers and address generators. 58.8 % of this block is combinational, and the
rest is sequential logic size. The overall block includes 584 registers.
The sub-block sizes inside the datapath are also of interest. Table 14 shows the hierarchical area
distribution of the whole block.
Block Name Area
(Kgates)
Decompressor 9.23
- Color Transform 0.29
- Golomb-Rice Decoder 0.27
- Output Preparation 1.46
- Reverse Color Transform 0.18
- Address generator 1 0.68
- Address generator 2 0.46
- Construction register file control 1 0.51
66
- Construction register file control 2 0.51
- Construction register file control 3 0.58
- Construction register file control 4 0.58
- Constructor 1 0.31
- Constructor 2 0.38
- Constructor 3 0.38
- Constructor 4 0.31
- Decoder register file control 1.58
Table 14: Decompressor Synthesis Result
An important result that can be extracted from synthesis is that functional blocks constitute
relatively small portion of the overall cost. The more costly operations are the ones that are
related with the traversal of the image. Also due to high throughput requirement (one pixel/clock
in our case) the pipeline registers and temporary storage registers constitute a significant portion
of overall size. In that sense, it would not be wrong to claim that fast implementations of such
simpler algorithms are control dominated in terms of hardware cost.
Another result is that the GR_ctrl block, which is added to improve compression ratio, only takes
480 gates to implement which corresponds to 4.5% of the overall compressor datapath.
3.6 Evaluation of Other Hardware Implementations
During this thesis work, several hardware implementations of lossless compression algorithms
are investigated. Most implementations are targeting either medical applications such as wireless
endoscopy system or space applications. Majority of the implementations are based on LOCO-I /
JPEG-LS with minor modifications to adapt it better to hardware constraints such as speed and
area. Another feature of implementations is that generally only the compression is implemented.
In this section a survey of investigated implementations is given with their basic features. Finally,
most algorithms are implemented for compressing 8-bit pixels.
3.6.1 Parallel pipeline Implementation of LOCO-I for JPEG-LS [17]
This is parallel pipelined version of a modified LOCO_I lossless compression algorithm used
within the JPEG-LS coding scheme. It doubles the memory required for context statistics, but
achieves a speed-up of almost 2. The latency is 8 clock cycles. It yields a pixel/clock encoding
speed in the range 1.1-1.7
The context memories are dual port devices, each consisting of 368 38-bit words. The synthesis
libraries are the ST HCMOS9 0.13 um process libraries. The synthesized encoder uses a total of
539521 μm2 or 76660 equivalent gates.
67
3.6.2 Benchmarking and Hardware Implementation of JPEG-LS [18]
This is a low complexity version of JPEG-LS algorithm. There is shared memory architecture
between encoder and decoder assuming that they are not processing at the same time. Target is
high speed and real-time compression. On-chip required memory is 4 KB. VHDL code is
synthesized with Synopsys and the overall chip area, without wire interconnections, is 373,862
gates, of which 324,405 gates belong to the on-chip memory, and the other 49,457 gates belong
to the functional units. The overall power consumption is 59.07 mW. The emphasis has been on
operation time not the hardware area. No information is given about throughput and maximum
frequency. The result of the processing speed is measured using a 15ns clock cycle.
3.6.3 A Lossless Image Compression Technique Using Simple Arithmetic Operations [19]
The algorithm implemented is based on the logarithmic number system (LNS) properties. It is
suitable for high quality still image compression where information content is very large
(redundant data is very less). The aim is to speed-up the encoding and decoding by only few
addition/ subtraction and shift operations. It has simple architecture with fast encoding and
decoding. Each pixel is represented by 8 bits. The algorithm is implemented and synthesized
using Xilinx Integrated Software Environment (ISE) with the following results:
For forward arithmetic compression algorithm (FAC), number of slices is 1897, number of flip-
flops is 236, and the number of 4-input LUTs is 2766. (Compression path)
For Inverse arithmetic compression algorithm (IAC), number of slices is 52, number of flip-flops
is 64, and the number of 4-input LUTs is 94. (Decompression path)
3.6.4 A Low power, Fully Pipelined JPEG-LS Encoder for Lossless Image Compression
[11]
A fully pipelined VLSI architecture with a clock management scheme has been proposed for real-
time data processing and low power application. The input image has a maximum resolution of
6404808 bits. The system clock frequency is 40MHz. The frequency of the sensor’s output
pixel is 10MHz. The design has been implemented in UMC 0.18 um technology. The total scale
of the JPEG-LS encoder is 17.6kgates, plus 18k bits on-chip SRAM. The overall power
consumption is reduced by 15.7% by clock management scheme.
68
3.6.5 Hardware Implementation of a Lossless Image Compression Algorithm Using a
FPGA [20]
The algorithm is based on the LOCO-I with some modification in order to reduce the complexity.
8-bit pixel values are used. Total amount of SRAM memory needed by algorithm is 1K 8 for
the pixel memory (maximum image width is 1024 pixels) and 1K 32 for the context memory
(total memory is about 5 KB). The clock frequency is 12 MHz and the latency is nine clock
cycles. The throughput is 1.33 Mpixels/second.
3.6.6 Comparison
In this subsection a comparison of several lossless compression hardware implementations is
given in table 15. The data is mainly taken from [11] and the rest is extracted from the
investigation of respective scientific papers.
Implementation Technology Area (gates) Memory
Usage (bits)
Operating
Frequency
(MHz)
Throughput
[17] STM 0.13 m 53096 236838 - 1.33
[18] - 49457 2k 66 0.0364
[19] Xilinx (ISE) 1897 slice
236 flip-flop
2766 4-input
LUT
No context - -
[11] UMC 0.18 m 17.6k 36534 10(Main
clk)/
40(High
clk)
1
[20] Xilinx XCV50 - - 12 0.1108
Proposed 65 nm 10.55 k No context 208 1
Table 15: Characteristics of different hardware implementations
Note that the proposed implementation does not include data packer and control path blocks.
Also only compressor size is given to make it comparable to the other implementations.
69
Chapter 4
4 Conclusion
The work carried out in this thesis investigated several color compression algorithms from
hardware implementation point of view to be used in high throughput hardware. One such
algorithm is aimed to be implemented in hardware in order to validate early cost estimations and
gain more insight into problematic parts of the algorithms in terms of hardware implementation.
4.1 Workflow
An investigation is done on several scientific papers for both available algorithms and their
hardware implementations. The reference algorithm [1] is simulated in MATLABTM environment.
A possible hardware realization for functional blocks of compression and decompression is
proposed and their cost has been estimated. Then, the work continued with introducing a
compression algorithm based on modification of algorithm used in [1] in order to get a better
compression performance, while the hardware cost is still kept reasonably low. The MATLABTM
simulation of this modified algorithm has verified the significant improvement in the
compression ratio. This algorithm is chosen for hardware implementation in VHDL. The
hardware is designed according to requirements and constraints and simulated with ModelSim in
order to verify functionality as well as throughput requirements. In order to extract the
information about the timing and area, the design has been synthesized.
4.2 Results and Outcomes
Synthesis results given in section 3.5. show that area estimations for functional blocks of the
datapath, which are given in tables 2 and 5, are reasonably close to actual sizes. Table 16
combines our estimations and actual sizes.
70
estimated size
(in NAND2 gates)
actual size
(in NAND2 gates)
Color/Reverse Color Transform 306 290
Predictor/Constructor 522 390
GR k-determination 279 360
GR encoding 180 360
Table 16: Comparison of cost estimations and actual sizes for blocks
Our size estimations given in table 2 and table 5 are given in terms of number of full adders. To
be able to compare them with actual block sizes we assumed that each full adder is equivalent to
nine NAND2 gates. Color Transform and GR k-determination block estimations are quite close
to the real sizes. The predictor / constructor block actual size is smaller since the estimation is
based on 6 adders, whereas the design is made with 5 adders (figures 29 and 30). GR encoder
block estimation is smaller due to the fact that only adders are taken into account in estimations
and for this block, other components such as OR gates and MUXes constitute significant portion
of the block size.
According to us the most important outcome of our hardware implementation is about generic
variable length data packing and unpacking task. The implementation has revealed that high
throughput requirement complicates the design of data packing/unpacking significantly. To fulfill
our throughput requirement, both blocks should pack/unpack four variable length codewords
each clock cycle. It may be possible to parallelize this operation with several units however the
size would probably be too big to afford. For data unpacking, the design is even more difficult
since bit-by-bit read of packed data is required which is inherently a serial operation. Hence, our
implementation shows that packer/unpacker is the bottleneck of the overall design both in terms
of size and speed.
We can summarize the outcomes of this thesis work with the following four main points:
- Give an average size of complete datapath and address generation blocks. (~10 kgates)
- Locate data packing / unpacking as the most critical task for high throughput hardware
implementation.
- Improve the compression ratio (+15%), especially for compressible scenes such as user
menus (+25%), with little extra hardware cost (+4.5%).
- Replace exhaustive search method with estimation, which significantly reduces (-58%)
the hardware size of overall Golomb-Rice encoding with almost same compression
capability (-0.2%).
71
4.3 Future Work
An immediate future work is to investigate fast and efficient implementation of variable length
data packing/unpacking and to integrate it with existing datapath. When this is done, it will be
possible to see the overall hardware size.
Other possible future work about algorithmic improvements has been discussed in section 2.7.
72
73
References
[1] J. Rasmusson, J. Hasselgren, T. A. Möller, ”Exact and error-bounded approximate color
buffer compression and decompression” in Graphics Hardware 2007, San Diego, California,
Aug. 2007.
[2] Course homepage “Mobile computer graphics”, faculty of computer science, Lund University,
http://www.cs.lth.se/EDA075/
[3] P. G. Howard and J. S. Vitter. Fast and efficient lossless image compression. Inc Proc. IEEE
Data Compression Conference (DCC 1993), pages 351-360, Snowbird, Utah, USA, March 1993,
[4] M. J. Weinberger, G. Seroussi, and G. Sapiro. LOCO-I: A low complexity, context-based
lossless image compression algorithm. Inc Proc. IEEE Data Compression Conference (DCC
1996), pages 140-149, Snowbird, Utah, USA, March 1996.
[5] X. Wu and N. D. Memon, “Context-based, adaptive, lossless image coding,” IEEE Trans.
Commun., vol. 45 (4), pp. 437-444, Apr. 1997.
[6] M. J. Weinberger, G. Seroussi, and G. Sapiro. LOCO-I: A low complexity, context-based
lossless image compression algorithm: principles and standardization into JPEG-LS. IEEE Trans.
Image Processing, 9(8):1309-1324, August 2000.
[7] S. W. Golomb, “Run-length encodings,” IEEE Trans. Inform. Theory, vol. IT-12, pp. 399-401,
1966.
[8] R. F. Rice, ”Some practical universal noiseless coding techniques,” Tech. Rep. JPL-91-3, Jet
Propulsion Laboratory, Pasadena, CA.
[9] H. Malvar, G. Sullivan: YCoCg-R: A Color Space with RGB Reversibility and Low Dynamic
Range. In JVT-I014r3 (2003)
[10] S. A. Martucci, “Reversible compression of HDTV images using median adaptive prediction
and arithmetic coding,” in Proc. IEEE Intern’l Symp. On Circuits and Syst., pp. 1310-1313, IEEE
Press, 1990
[11] X. Li, X. Chen, “A low power, fully pipelined JPEG-LS encoder for lossless image
compression. IEEE, Beijing, China (2007)
[12] J. Coalson. FLAC – Free Lossless Audio Codec (2005), http://flac.sourceforge.net/
[13] K. Veeraswamy, S. Srinivaskumar, “ Lossless image compression using topological pixel re-
ordering,” JNTU, College of Engineering, Kakinada, India.
74
[14] S. Andriani, “Lossless compression and interpolation for high quality still images and video
sequences”, Ph.D. thesis, University of Padova, Faculty of engineering, 2006.
[15] The USC-SIPI Image Database, University of Southern California, Electrical engineering
department, signal and image processing institute, http://sipi.usc.edu/database/index.html.
[16] I. Matsuda, T. Kaneko, A. Minezawa, S. Itoh, ”Lossless coding of color images using block-
adaptive inter-color prediction, IEEE ICIP, 2007.
[17] M. Ferretti, M. Boffadossi, “A Parallel Pipeline Implementation of LOCO-I for JPEG-LS,”
17th
International Conference on Pattern Recognition (ICPR’04), vol. 1,pp. 769-772. 2004.
[18] A. Savakis and M. Piorium, “Benchmarking and Hardware Implementation of JPEG-LS,”
ICIP’02, Rochester, NY, Sep. 2002.
[19] S. Kummar Pattaniak, K. K. Mahapatra, ”A Lossless Image Compression Technique Using
Simple Arithmetic Operations and Its FPGA Implementation,” IEEE, 2006.
[20] M. Klimesh, V. Stanton, and D. watola, “Hardware Implementation of a Lossless Image
Compression Algorithm Using a Field Programmable Gate Array,” NASA JPL TMO Progress
Report 42-144, 2001.
75
APPENDIX A
Proposed Cost Reduction Method Analysis
A.1 Overlap-limited Search
Consider block B of size n as B = {e1, e2…, en}, where n is the number of input values which
should be encoded together, n is an integer and n > 0.
e1 e2 …
e3 e4 …
… … en
Figure 56: One block of n values
The Golomb-Rice code length of each input (when encoded with k) in the block is computed as:
where q is the quotient of the integer division of each input value by 2k.
For the general case where n and k are variables, we find the total length of the encoded block as
a function of n and k as (in calculations, Lk = Ltotal-k – n will be used, which does not affect the
comparison result since n is a common term for all k):
(1)
, where
So,
We define eT and rT as:
Then,
76
We also have for the remainders.
So, and
(I)
(II)
Combining (I) & (II)
The above inequality shows bounds of Lk (the length of output code when coded with parameter
k) as a function of n, k and eT
Now, we want to find the overlap regions between two adjacent Lk and Lk+1:
a ≤ Lk ≤ b
c ≤ Lk+1 ≤ d
Three regions are of interest:
A. Lk > Lk+1 a > d: In this case, k+1 is always better than k, no need to compare.
B. Lk < Lk+1 c > b: In this case, k is always better than k+1, no need to compare.
C. If neither (A.) nor (B.) are satisfied, then Lk and Lk+1 need to be compared to find
whether k or k+1 give shorter code length. Hence, the inequalities in (A.) and (B.) give the
bounds of overlap region with respect to eT.
a > d
77
Upper bound of overlap region with respect to eT: a > d eT > 2n (2k+1
-1)
c > b
Lower bound of overlap region with respect to eT: c > b eT < n (2k)
So, the overlap region of two adjacent length functions, Lk and Lk+1 is:
Equation (2) is a general formulation to find overlap region for two adjacent code lengths.
Figure 57: Overlap regions of consecutive length functions with respect to eT
Considering figure 57, it is mathematically provable that for any given block of input data,
there is overlap region only for three consecutive length functions Lk, Lk+1, Lk+2:
Proof: Assume that there is an overlap region between A and C for at least one given eT,
By solving this inequality, we get:
n ≤ 0, which is impossible, since n represents the number of input values in a block. So, there is
never such an overlap region for any given set of k and any block size of n.
(2)
Lk
Lk+1
eT
Lk, Lk+1 Lk+2, Lk+3
eT A ? C
Lk+1, Lk+2
B
78
Result: For any block size n, only three consecutive k values (k, k+1, k+2) among the
consecutive set of all possible k values can give the minimum encoded length, i.e., can be the best
k-parameter for that block. These three k values depend on the sum of inputs values (eT) in the
block. Therefore, once they are located it is sufficient to compare only three length functions (two
comparisons) corresponding to k, k+1 and k+2.
Figure 58, illustrates the case for n=4. Point a (eT = 24) on the plot is the boundary between L1
and L2 such that after point a, L1 is always greater than L2. So L1 need not be considered any
more after point a. Other point of interest is point b (eT= 32) which is the boundary for the start of
L4 such that before point b, L3 is always smaller than L4. So L4 need not be considered before
point b. Since point a (eT = 24) is before point b (eT = 32), L1 and L4 never need to be compared.
Figure 58: Overlap regions between length functions L1, L2, L3, L4
Now, we apply this method to our application. Our application, given in [4], operates on subtiles
of 2x2 pixels. Hence n = 4 in our application. We want to find the Golomb-Rice parameter for a
subtile in the set of k = {0, 1, 2, 3, 4, 5, 6}.
The overlap regions for our case are shown in figure 59.
79
Figure 59: Overlap regions for n=4 and k= {0, 1, 2, 3, 4, 5, 6} with respect to eT
Figure 60 shows the overlap regions with respect to the sum of input numbers, eT, which can be
in the range 0 to 2044 (9-bit unsigned inputs). We only have overlap regions in the interval [4,
504]. The conclusion which can be drawn from this figure is that we never need to do an
exhaustive search among all seven Golomb-Rice parameters in order to find the best one. The
alternative introduced here, is to compute the sum of the inputs and find its corresponding
overlap region.
L0
L1
L2
L3
L4
L5
L6
eT [4, 8]
eT [8, 24]
eT [16, 56]
eT [32, 120]
eT [64, 248]
eT [128, 504]
0 4 8 16 24 32 56 64 120 128 248 256 504 512
L0, L1 L2, L3 L4, L5 L6
L0 L1, L2 L3, L4 L5, L6
eT
80
By knowing the overlap region the search is limited to at most three cases which happens in
intervals of [16, 24], [32, 56], [64, 120], and [128, 248] and only two cases in other intervals.
Also notice that for eT > 504 and eT < 4, no comparison is needed. Now the only remaining issue
is how to find the regions. In order to make hardware simpler, we limit the boundaries between
the regions with powers of two. As the result, we always need three numbers to be compared
according to the regions below:
For eT < 16, compare L0, L1, L2
For 16 ≤ eT <32, compare L1, L2, L3
For 32 ≤ eT < 64, compare L2, L3, L4
For 64 ≤ eT < 128, compare L3, L4, L5
For eT ≥ 128, compare L4, L5, L6
Figure 60: Required comparisons of overlap regions for n=4, k= {0, 1, 2, 3, 4, 5, 6} based on eT
It is noticeable that this solution is an exact method which finds the best k-parameter among all
seven possible values, but only with two comparator hardware units.
This result is obtained when the set of all possible k values is a consecutive set. In specific
implementations it could be the case that a smaller set is used. Now, we extend our method and
show that only one comparison is sufficient to find the best k-parameter if the overall set does
not include any three consecutive k values., i.e., there exists no {k, k+1, k+2} subset in the set of
k.
0 4 8 16 24 32 56 64 120 128 248 256 504 512
L0, L1 L2, L3 L4, L5 L6
L0 L1, L2 L3, L4 L5, L6
eT
L0,L1,L2
L1,L2,L3
L2,L3,L4
L3,L4,L5 L4,L5,L6
81
To show this more general case, we need to derive a formula for the overlap region between Lk
and Lk+2 as well.
Following the same steps for Lk and Lk+2:
m ≤ Lk ≤ n
o ≤ Lk+2 ≤ p
m > p
Upper bound of overlap region with respect to eT:
m > p eT > (4n/3) (3x2k-1)
o > n
Lower bound of overlap region with respect to eT:
o > n eT < 2n (2k)
So, the overlap region of Lk and Lk+2 is:
Figure 61: Overlap regions of non-consecutive length functions with respect to eT
(2)
Lk
Lk+
2
Lk, Lk+2 Lk+2, Lk+3
eT A ? B
82
Figure 61 shows that, if the overall set does not include any three consecutive k (k+1 is not in the
set in the above figure), then there is an overlap between at most two length functions (Lk,Lk+2 or
Lk+2,Lk+3 in the figure)
Proof: Assume that there is an overlap region between A and B for at least one given eT,
Solving this inequality, we get:
n ≤ 0, which is impossible, since n represents the number of input values in a block. So, there is
never such an overlap region for any given set of k and any block size of n.
It is important to note that the total hardware cost of “overlap-limited search” is independent of
the number of k values as showed in figure 3.
Note: The derivations on pg. 7 – 11 are general in the sense that they do not limit Lk to be an
integer. In almost all, if not all, applications Lk is integer and its effect needs to be examined.
This effect can be considered as a quantization of Lk to integer values. This, in effect, narrows the
overlap regions. However, the requirement of two comparisons still exists. Here, without any
formal verification, we give the resulting overlap regions for our example:
L0
L1
L2
L3
L4
L5
L6
eT [6]
eT [10, 20]
eT [20, 48]
eT [40, 100]
eT [80, 204]
eT [160, 412]
83
A.2 Remainder-Based Correction
Golomb-Rice coding separates the data into two parts (q: quotient and r: remainder), by making a
division with a divisor being a power of two. The added values in the first stage additions shown
in figure 12, are actually q of division with 2k.
For block size of n, the output length corresponding to k is repeated here as:
knqknqqqL Tnk 21
Figure 62, shows the quotients that are added together for the case of k = 0 and k = 1 respectively.
Figure 62: Motivation behind remainder-based correction
It is clear from figure 62 that in both additions added bits m to 1 are identical. This, in a hardware
implementation, means same data bits are connected to inputs of two separate adders. The point
is that, the sum in the second addition could be obtained by using the sum in the first addition,
specifically by right shifting the sum by 1-bit. In order to get exactly the same result however, the
effect of LSB bits (remainders for k = 1) on the sum in first addition should be corrected. This
means that by first subtracting carry-out of sum of LSB bits from the first sum (corresponding to
k=0) and then right shifting it by 1-bit, the second sum (corresponding to k=1) is obtained exactly
without addition.
The idea can be generalized for all k values such that once the sum of input values is obtained
(which is nothing but sum of quotients for k = 0), qT for all other k can be found by a common
correction circuit using the remainder bits of each stage.
Mathematically the equivalence can be shown as follows:
am am-1 ... a2 a1 am am-1 ... a2 a1
bm bm-1 ... b2 b1 bm bm-1 ... b2 b1
nm nm-1 ... n2 n1
sm sm-1 ... s2 s1
nm nm-1 ... n2 n1
sm sm-1 ... s2 s1
a0
b0
n0
s0
+ +
84
, where
So,
We define eT and rT as:
Then,
(4)
Equation (4) shows that Lk can be obtained by adding a remainder-conditioned operand (the
second term) to the shifted sum of inputs (first term).
85
Appendix B
Test Image Sets
B.1 Standard Photographic Test Images
Peppers (512 x 512) Peppers2 (512 x 512) Mandrill (512 x 512)
Lenna (512 x 512) House (256 x 256) Sailboat (512 x 512)
Airplane (512 x 512)
86
B.2 Computer Generated Test Scenes
Ducks (640 x 480) Square (640 x 480)
Car (640 x 480) Quake4 (640 x 480)
Bench_scr1(640 x 360) Bench_scr1 (640 x 360)
87
Bench_scr4(640 x 360)
B.3 Computer Generated User Menu Scenes
Menu1 (240 x 320) Menu2 (240 x 320) Menu3 (240 x 320)
88
Menu4 (240 x 320) Menu5 (240 x 320)
Menu6 (320 x 480) Menu7 (320 x 480)