improved transform methods for low complexity, high quality video coding

Tampereen teknillinen yliopisto. Julkaisu 1033 Tampere University of Technology. Publication 1033

Cixun Zhang

Improved Transform Methods for Low Complexity, High Quality Video Coding Thesis for the degree of Doctor of Science in Technology to be presented with due permission for public examination and criticism in Tietotalo Building, Auditorium TB222, at Tampere University of Technology, on the 26th of March 2012, at 12 noon.

Tampereen teknillinen yliopisto - Tampere University of Technology Tampere 2012

ISBN 978-952-15-2791-3 ISSN 1459-2045

i

ABSTRACT

A video signal requires a very large number of bits if it is represented in an uncom-pressed form. This makes many video applications not possible, due to the limited bandwidth of communication channels and the limit of the storage medium. Therefore, video coders that compress the video signal so that it could be represented with a small-er number of bits are used in all digital video applications. The goal of all video coders is to maximize the video quality while minimizing the bitrate. This is achieved by exploit-ing different redundancies present in the video signal. One type of redundancy is present between neighboring samples in the residual frame (spatial redundancy). The focus of this thesis is to develop algorithms to exploit the spatial redundancy with the goal of im-proving coding efficiency of video coders while keeping their implementation complexi-ty as low as possible.

The first part of this thesis introduces fixed-point design methodologies and several resulting implementations of the inverse discrete cosine transform (IDCT). The overall architecture used in the design of the ISO/IEC 23002-2 IDCT algorithm, which can be characterized by its separable and scaled features, is also described. The drift character-istic for the integer IDCT approximations is also analyzed.

The second part of the thesis discusses pre-scaled integer transform (PIT) which re-duces the implementation complexity of the conventional integer cosine transform (ICT) and maintains all the merits such as bit-exact implementation and good coding efficien-cy. Design rules that lead to good PIT kernels are developed and different types of PIT and their target applications are examined. The PIT kernels used in Audio Video coding Standard (AVS), the Chinese National Coding standard, are also introduced.

In the third part of the thesis we discussed a novel algorithm called spatially varying transform (SVT). Unlike state-of-art video codecs where the position of the transform block is fixed, SVT enables video coders to vary the position of the transform block. In addition to changing the position of the transform block, the size of the transform can also be varied within the SVT framework, in order to better localize the prediction error so that the underlying correlations are better exploited. It is shown that by varying the position of the transform block and its size, characteristics of prediction error are better localized, and the coding efficiency is thus improved. A novel low complexity algorithm, operating on a macroblock and a block level, is also proposed to reduce the encoding complexity of SVT.

An extension of SVT, called Prediction Signal Aided Spatially Varying Transform (PSASVT) is also discussed, which utilizes the gradient of prediction signal to eliminate the unlikely location parameters (LPs). As the number of candidate LPs is reduced, a

ii

smaller number of LPs are searched by encoder, which reduces the encoding complexity. In addition, less overhead bits are needed to code the selected LP and thus the coding efficiency can be improved. This reduction in encoding complexity is achieved with a slight increase in coding efficiency, as the number of candidate LPs is reduced. The de-coding complexity increase is only a little.

iii

ACKNOWLEDGMENT

The work presented in this thesis has been carried out in parts at the Department of Sig-nal Processing in Tampere University of Technology (TUT), and at Nokia Research Cen-ter (NRC), both located in Tampere, Finland. During the preparation of this thesis, I have been working in co-ordination with two research groups. In TUT, the group was the “Multimedia Research Group” lead by Prof. Moncef Gabbouj. At NRC, it was the “Audio-Visual Content Representation Team”.

This thesis would remain incomplete without acknowledging some people. Firstly, I wish to express my deepest gratitude to my supervisor Prof. Moncef Gabbouj who has provided me with their support, encouragement, and scientific guidance throughout the years of my work and study.

Further, I would also like to thank a number of people who have directly or indirectly contributed to this thesis or have been working with me on related topics. I would like to thank Kemal Ugur, Jani Lainema, Antti Hallapuro for providing opportunities, technical guidance, valuable comments, and fruitful discussions.

I would like to thank Ms. Virve Larmila, Ms. Ulla Siltaloppi and Ms. Elina Orava for their great help for some routine but important administration work.

I convey special acknowledgement to Zhejiang University, from which I got solid ac-ademic training, during my undergraduate and master studies, which has benefited me for life long. My special thanks go to Prof. Lu Yu, my master supervisor, who guided me patiently on my early research activities and publications.

Financial support of Nokia Foundation is gratefully acknowledged.

Finally, I would wish to thank everyone who contributed to the successful comple-tion of the thesis.

v

CONTENTS

Abstract ................................................................................................................................................. i

Acknowledgment ............................................................................................................................ iii

Contents ............................................................................................................................................... v

List of Publications ........................................................................................................................ vii

List of Abbreviations ...................................................................................................................... ix

List of Tables ..................................................................................................................................... xi

List of Figures ................................................................................................................................. xiii

1. Introduction ................................................................................................................................ 1

1.1 Fundamentals of Video Coding .................................................................................................. 2

1.2 Transforms Used in Video Coding ............................................................................................ 6

1.3 Outline and Objectives of the Thesis ........................................................................................ 8

1.4 Author’s Contributions to the Publications ........................................................................... 9

2. Fixed-Point Approximations of the 8x8 Inverse Discrete Cosine Transform... 11

2.1 Precision Requirements for IDCT Implementations in MPEG and ITU-T Standards ......................................................................................................................................... 11

2.2 Fixed-Point Approximations of DCT/IDCT ......................................................................... 12

2.3 The ISO/IEC 23002-2 IDCT ...................................................................................................... 13

3. Directional Transforms ....................................................................................................... 15

3.1 Reorganization-Based Directional Transform .................................................................. 16

3.2 Lifting-Based Directional Transforms .................................................................................. 17

3.3 Data-Dependent Directional Transforms ........................................................................... 19

4. Pre-Scaled Integer Transform ........................................................................................... 21

4.1 Design Rules of PIT Kernels ..................................................................................................... 24

4.2 Types of Pre-Scaled Integer Transform ............................................................................... 25

5. Spatially Varying Transform .............................................................................................. 27

vi

5.1 Design of SVT .................................................................................................................................. 28

5.1.1 Selection of SVT Block-size ...................................................................................................... 28

5.1.2 Selection and Coding of Candidate LPs .............................................................................. 30

5.1.3 Filtering of SVT Block Boundaries ........................................................................................ 31

5.2 Implementing Spatially Varying Transform in the H.264/AVC Framework ......... 31

5.3 Fast Algorithms for Spatially Varying Transform ............................................................ 32

5.3.1 Macroblock Level Fast Algorithm ......................................................................................... 33

5.3.2 Block Level Fast Algorithm ...................................................................................................... 33

5.4 Experimental Results .................................................................................................................. 35

6. Prediction Signal Aided Spatially Varying Transform .............................................. 43

6.1 Implementation of Prediction Signal Aided Spatially Varying Transform ............. 44

6.2 Experimental Results .................................................................................................................. 46

7. Conclusion ................................................................................................................................ 49

References ........................................................................................................................................ 51

vii

LIST OF PUBLICATIONS

This thesis is written in a summary style followed by the following publications listed below as appendices. The publications are referenced in the thesis as [P1], [P2], etc.

[P1] Yuriy A. Reznik, Arianne T. Hinds, Cixun Zhang, Lu Yu, Zhibo Ni, “Efficient Fixed-Point Approximations of 8x8 Inverse Discrete Cosine Transform,” in Proceedings SPIE Applications of Digital Image Processing XXX, Vol. 6696, pp. 669617-1-17, San Diego, 28 Aug. 2007.

[P2] Arianne T. Hinds, Yuriy A. Reznik, Lu Yu, Zhibo Ni, Cixun Zhang, “Drift Analysis for integer IDCT,” in Proceedings SPIE Applications of Digital Image Processing XXX, Vol. 6696, pp. 669614-1-16, San Diego, 28 Aug. 2007.

[P3] Cixun Zhang, Lu Yu, “Multiplier-less Approximation of the DCT/IDCT with Low Complexity and High Accuracy,” in Proceedings SPIE Applications of Digital Image Processing XXX, Vol. 6696, pp. 669615-1-12, San Diego, 28 Aug. 2007.

[P4] Cixun Zhang, Lu Yu, Jian Lou, Wai-Kuen Cham, Jie Dong, “The Technique of Pre-scaled Integer Transform: Concept, Design and Applications,” IEEE Transactions on Circuits and Systems for Video Technology (T-CSVT), Vol. 18, Issue. 1, Jan 2008, pp. 84-97.

[P5] Cixun Zhang, Kemal Ugur, Jani Lainema, Antti Hallapuro, Moncef Gabbouj, "Video Coding Using Spatially Varying Transform," IEEE Transactions on Circuits and Sys-tems for Video Technology (T-CSVT), Vol. 21, Issue. 2, Feb. 2011, pp. 127-140.

[P6] Cixun Zhang, Kemal Ugur, Jani Lainema, Moncef Gabbouj, “Video Coding Using Spatially Varying Transform,” in Proceedings Pacific-Rim Symposium on Image and Video Technology (PSIVT), Tokyo, Japan, 13th-16th, Jan 2009, pp. 796-806.

[P7] Cixun Zhang, Kemal Ugur, Jani Lainema, Antti Hallapuro, Moncef Gabbouj, “Predic-tion Signal Aided Spatially Varying Transform,” in IEEE International Conference on Multimedia and Expo (ICME), Barcelona, Spain, 11-15 July, 2011.

ix

LIST OF ABBREVIATIONS

720p 1280 pixels wide by 720 pixels height in progressive scan

1080p 1920 pixels wide by 1080 pixels height in progressive scan

1D, 2D, 3D One-, Two-, Three-Dimensional

CIF Common intermediate format

DVD Digital Versatile Disc

Blu-ray Optical disc storage medium with capacity higher than DVD

H.264/AVC Video compression standard jointly developed by MPEG and VCEG

HEVC High efficiency video coding, name for the next generation

standard

KTA Key technological area, experimental software maintained by VCEG

Macroblock A block containing one or more transform blocks

Mbits/s Megabits per second

MPEG-4 Family of video compression standards developed by MPEG

NAL Network access layer

JVT Joint video team of MPEG and VCEG

MPEG Moving picture experts group, standard developing organization

PSNR Peak signal to noise ratio

QCIF Quarter common intermediate format

RD Rate distortion

RDO Rate distortion optimization

SSD Sum of squared differences

TV Television

VCEG Video coding experts group, standard development organization

VCL Video coding layer

x

YUV A color space

xi

LIST OF TABLES

Table 1 The IEEE 1180 | ISO/IEC 23002-1 Pseudo-Random IDCT Test Metrics and Conditions [P1] 12

Table 2 Average Results of ΔBD-Rate Compared to H.264/AVC 38

Table 3 Average Results of Percentage of 8×8 SVT and 16×4+4×16 SVT Selected in VBSVT for Different Sequences 38

Table 4 Encoding/Decoding Time Increase of FVBSVT Compared to H.264/AVC (High Complexity Configuration) (CABAC, 720p) 39

Table 5 Experimental Results of PSASVT 46

Table 6 Percentage of Macroblocks Coded in SVT for Different Sequences 47

xiii

LIST OF FIGURES

Figure 1 Illustration of temporal and spatial redundancy in a video frame 4

Figure 2 Block diagram of the latest video coding standard H.264/AVC 4

Figure 3 Fixed-point 8x8 IDCT architecture. [P1] 14

Figure 4 Loeffler-Ligtenberg-moschytz IDCT factorization (left), and its generalized scaled form (right). [P1] 14

Figure 5 Six directional modes defined in a similar way as was used in H.264/AVC for the block size 8×8. The vertical and horizontal modes are not included here. [9] 16

Figure 6 N×N image block in which the first 1-D DCT will be performed along the diagonal down-left direction. [9] 17

Figure 7 Example of N=8: arrangement of coefficients after the first DCT (left) and arrangement of coefficients after the second DCT as well as the modified zigzag scanning (right). [9] 17

Figure 8 Factorizing 8-point DCT into primary operations. x[n] and y[n] (n=0,1,…,7) are input signal and output DCT coefficient, respectively. Oi (i=1,2,…,35) is primary operation. [28] 18

Figure 9 Exemplified primary operations (a) nondirectional and (b) directional, where white circles denote integer pixels and gray circles half pixels. [28] 19

Figure 10 Block diagram of conventional ICT scheme in H.264/AVC 21

Figure 11 Block diagram of proposed PIT scheme 21

Figure 12 Illustration of 8×8 spatially varying transform. 29

Figure 13 Illustration of 16×4 and 4×16 spatially varying transform. 29

Figure 14 Block diagram of the extended H.264/AVC encoder with spatially varying transform. 32

Figure 15 Illustration of hierarchical search algorithm. 35

Figure 16 Rate-distortion curves for Night sequence (CABAC) and Panslow sequence (CAVLC). 37

Figure 17 (A) Distribution of the macroblocks coded with SVT, (B) corresponding SVT blocks for 32nd prediction error frame. (Night sequence) 40

Figure 18 (A) Distribution of the macroblocks coded with SVT, (B) corresponding SVT blocks for 65th prediction error frame. (Sheriff sequence) 41

Figure 19 Distribution of (Δx, Δy) of 8×8 SVT for BigShips sequence [P5]. 43

1

CHAPTER 1 INTRODUCTION

Video coding deals with representation of video data, which can be used for transmis-sion and/or storage applications. The goals of video coding are to accurately represent the video data compactly, provide means to navigate the video (i.e. search forwards and backwards, random access, etc) and other additional author and content benefits such as text (subtitles), meta information for searching/browsing and digital rights manage-ment. The biggest challenge of video coding is to reduce the size of video data using compression, which is a key technology that enables a number of video services in many industries ranging from entertainment to communications. Anytime a user enjoys a vid-eo clip on a popular video sharing site, records a video on a mobile phone or on a digital camera, plays a movie from a DVD or a Blu-ray disc, attends a video-conferencing meet-ing, uses video surveillance for security, etc. video compression technologies are seam-lessly being used. The need for video compression in these services arises from the fact that a significant amount of space or bandwidth is required to store and transmit an un-compressed video. Consider a video sequence with a resolution of 720×480 pixels (a commonly used resolution in DVD discs), played at 25 frames-per-second. The bitrate of this video with three color components at 8 bits per pixel is over 200 Mbits/s. To store this video in current DVD discs, a compression by a factor of at least 200 is required.

Nowadays, the resolution used in current video services is increasing rapidly. High Definition (HD) resolutions of 720p (1280×720 pixels) and 1080p (1920×1080 pixels) are becoming increasingly common for digital TV broadcast, mobile video services (vid-eo telephony, mobile TV etc.) and handheld consumer electronics (digital still cameras, camcorders etc). With the advances in display and capture technologies, it is foreseen that even higher resolutions such as 4096×2160 pixels will be common in consumer en-tertainment applications. In addition to increased spatial resolution, the temporal frame rates of the video sequences are expected to increase to around 60 frames-per-second and higher. To enable digital video services with these increased resolutions, more effi-cient compression technologies need to be designed.

In addition to compression capabilities of video coders, the implementation com-plexity is a major concern. Consider the case of watching a video sampled at 1080p reso-lution at 60 frames-per-second. This means that a video decoder needs to process more than 6 million pixels in less than 20 milliseconds. Therefore, a practical video coder needs to achieve significant compression rates and at the same time operate at reasona-ble complexity levels. The implementation complexity is especially important in realiz-ing video services on mobile and other handheld devices with limited computational and storage resources.

2

At the time of writing the thesis, increased compression efficiency at high resolutions and improved computational efficiency are key requirements identified in the video standardization community for the next generation video standardization project [1].

1.1 FUNDAMENTALS OF VIDEO CODING

The basic concepts of digital video including capture and representation of video in digital form, color spaces, formats and quality measurement are given below:

Capture: A natural visual scene is spatially and temporally continuous. Representing a visual scene in digital form involves sampling the real scene spatially and temporally. In spatial sampling, sampling occurs at each of the intersection points on the grid and the sampled image may be reconstructed by representing each sample as a square pic-ture element (pixel). The visual quality of the image is thus influenced by the number of sampling points. Choosing a ‘coarse’ sampling grid produces a low-resolution sampled image whilst increasing the number of sampling points slightly increases the resolution of the sampled image. In temporal sampling, a moving video image is captured by taking a rectangular ‘snapshot’ of the signal at periodic time intervals. A higher temporal sam-pling rate (frame rate) gives apparently smoother motion in the video scene but re-quires more samples to be captured and stored. A video signal may be sampled as a se-ries of complete frames (progressive sampling) or as a sequence of interlaced fields (in-terlaced sampling).

Color spaces: The method chosen to represent brightness and color is described as a color space. There are 2 commonly used color spaces: RGB and YCbCr. In the RGB color space, a color image sample is represented with three numbers that indicate the relative proportions of Red, Green and Blue (the three additive primary colors of light). Any col-or can be created by combining red, green and blue in varying proportions. The YCbCr color space and its variations (sometimes referred to as YUV) is a popular way of effi-ciently representing color images. Y is the luminance component and U and V are the chrominance components. The complete description of a color image is given by the lu-minance component Y and the color differences U and V that represent the difference between the color intensity and the mean luminance of each image sample.

Format: CIF (Common Intermediate Format, 352×288), QCIF (Quarter CIF, 176×144), etc are the most commonly used video formats.

Quality: The quality of video is measured subjectively and objectively. In subjective quality measurement, the quality of video is evaluated by the viewers. The most com-monly used objective quality measurement is Peak Signal to Noise Ratio (PSNR).

3

A hybrid video encoder consists of three main functional units: a temporal model, a spatial model and an entropy coder.

1. The input to the temporal model is an uncompressed video sequence. The goal of the temporal model is to reduce temporal redundancy (shown in Figure 1) between transmitted frames by forming a predicted frame and subtracting this from the current frame. The key issue of a temporal model is motion estimation and compensation, which can be conducted for different block sizes, accuracy. The output of the temporal model is a residual frame and a set of model parameters, typically a set of motion vectors describ-ing how the motion was compensated.

2. The residual frame forms the input to the spatial model which makes use of simi-larities between neighboring samples in the residual frame to reduce spatial redundancy (shown in Figure 1). Typically this is achieved by applying a transform to the residual samples and quantizing the results. The transform converts the samples into another domain in which they are represented by transform coefficients. The coefficients are quantized to remove insignificant values, leaving a small number of significant coeffi-cients that provide a more compact representation of the residual frame. The output of the spatial model is a set of quantized transform coefficients.

3. The entropy coder compresses the parameters of the temporal model (typically motion vectors) and the spatial model (coefficients). This removes statistical redundan-cy in the data and produces a compressed bit stream or file that may be transmitted and/or stored. A compressed sequence consists of coded motion vector parameters, coded residual coefficients and header information.

Correspondingly, the video decoder reconstructs a video frame from the compressed bit stream. The coefficients and the motion vectors are decoded by an entropy decoder after which the spatial model is decoded to reconstruct a version of the residual frame. The decoder uses the motion vector parameters, together with one or more previously decoded frames, to create a prediction of the current frame and the frame itself is recon-structed by adding the residual frame to this prediction.

4

FIGURE 1 ILLUSTRATION OF TEMPORAL AND SPATIAL REDUNDANCY IN A VIDEO FRAME

The latest video coding standard, H.264/AVC [2], is a hybrid video coder, which is developed by the Joint Video Team (JVT) of the Moving Picture Experts Group (MPEG) of the International Standardization for Organization (ISO)/International Electrotechnical Commission (IEC) and Video Coding Experts Group (VCEG) of the International Tele-communication Union (ITU) Telecommunication Standardization Sector (ITU-T). H.264/AVC is published by MPEG as MPEG-4 part 10 Advanced Video Coding (AVC) and by ITU-T as ITU-T Recommendation H.264.

Coder Control

Transform/Quant.

Inv. Quant./Inv. Transform

Intra-frame Prediction

Motion Compensation

Entropy Coding

Motion Estimation

Deblocking Filter

+

Frame Buffer

+

Control Data

Output Video

Motion Data

-

Input Video

Output Bitstream

Intra / Inter

Decoder

Quant. Trans. Coeffs.

FIGURE 2 BLOCK DIAGRAM OF THE LATEST VIDEO CODING STANDARD H.264/AVC

5

H.264/AVC provides mechanisms for coding video that are optimized for compres-sion efficiency and aim to meet the needs of practical multimedia communication appli-cations. Figure 2 below shows the diagram of an H.264/AVC codec (encoder and decod-er). The key modules are transform/quantization, intra-frame prediction, motion esti-mation/compensation, deblocking filter, entropy coding etc.

The first version of H.264/AVC defines a set of three Profiles: Baseline, Main, Extend-ed, each supporting a particular set of coding functions and each specifying what is re-quired of an encoder or decoder that complies with the Profile. Performance limits for codecs are defined by a set of Levels, each placing limits on parameters such as sample processing rate, picture size, coded bitrate and memory requirements.

1. The Baseline Profile supports coded sequences containing I and P slices. I slices contain intra-coded macroblocks in which each 16×16 or 4×4 luma region and each 8×8 chroma region is predicted from previously coded samples in the same slice. P slices may contain intra coded, inter coded or skipped MBs. Inter coded MBs in a P slice are predicted from a number of previously coded pictures, using motion compensation with quarter-sample (luma) motion vector accuracy. After prediction, the residual data for each MB is transformed using a 4×4 integer transform and quantized. Quantized trans-form coefficients are reordered and the syntax elements are entropy coded. In the Base-line Profile, transform coefficients are entropy coded using a context-adaptive variable length coding scheme (CAVLC) and all other syntax elements are coded using fixed-length or Exponential-Golomb Variable Length Codes. Quantized coefficients are scaled, inverse transformed, reconstructed (added to the prediction formed during encoding) and (optionally) filtered with a de-blocking filter before being stored for possible use in reference pictures for further intra- and inter-coded macroblocks.

2. The Main Profile is almost a superset of the Baseline Profile, except that multiple slice groups, ASO (arbitrary slice ordering) and redundant slices (all included in the Baseline Profile) are not supported. The additional tools provided by Main Profile are B slices (bi-predicted slices for greater coding efficiency), weighted prediction (providing increased flexibility in creating a motion-compensated prediction block), support for in-terlaced video (coding of fields as well as frames) and CABAC (an alternative entropy coding method based on context adaptive arithmetic coding).

3. The Extended Profile may be particularly useful for applications such as video streaming. It includes all of the features of the Baseline Profile (i.e. it is a superset of the Baseline Profile, unlike the Main Profile), together with B slices, weighted prediction and additional features to support efficient streaming over networks such as the Internet. SP and SI slices facilitate switching between different coded streams and “VCR-like” func-tionality and Data Partitioned slices can provide improved performance in error-prone transmission environments.

6

A coded H.264/AVC video sequence consists of a series of NAL units, each containing a raw byte sequence payloads (RBSP). Coded slices (including Data Partitioned slices and IDR slices) and the End of Sequence RBSP are defined as VCL NAL units whilst all other elements are just NAL units.

1.2 TRANSFORMS USED IN VIDEO CODING

Transforms are widely used in video coding to remove spatial redundancy. The Karhunen-Loeve transform (KLT) [3] is an optimal transform in terms of decorrelation properties and energy compaction representation of the transform data. However, the kernel of KLT is data dependent and should be derived for each image sub-block. More-over, in general the KLT kernel is non-separable and this would require full matrix mul-tiplication which demands unreasonable computational resources. However, it has been shown that for the majority of signal processing applications the type II Discrete Cosine Transform (DCT) is a close approximation of KLT [3]. Basis functions of DCT length N are defined as:

(2 1)cos , 0,..., 12

1 , 0

2 , 0

i i

i

j iA C for i j NN

iNwhere C

iN

π+= ⋅ = −

==

>

(1.1)

where i and j are row and column indexes of the transform matrix, and Ci is a scale factor which is normalizing the transform matrix. The DCT serves as a close approximation of eigenvectors of KLT for signals that can be modeled as first-order Markov processes with high correlation. The output of a 2D DCT is a set of N×N coefficients representing the image block data in the transform domain and these coefficients can be considered as “weights” of a set of standard basis patterns. Typically, the large portion of the signal energy is to be represented by a relatively few number of the transform coefficients, thus energy compaction is achieved.

Another advantage of DCT is that its multidimensional versions can be computed with separable product of DCT along each of the directions. 2D DCT coefficients may be computed by first applying the one dimensional DCT to each image row and then using the corresponding coefficients from each row as input to a one dimensional DCT applied along the columns. There are a number of fast computation algorithms for DCT, when N is not prime and, in particular, when N is a power of 2. In the latter case, these algo-rithms require O(Nlog2N) arithmetic operations to transform Np image or video pixels, with a block size of N.

7

The (near) optimality of the DCT, the separable nature of the 2D DCT as well as the large amount of fast computational algorithms available in the literature [3], [4] and [5], have allowed the wide applicability of this transform. For example, DCT or DCT-like transforms are utilized in all transform-based coding schemes as JPEG, H.26x, and MPEG1/2/4.

In earlier work, residual decoding contains the possibility of drift (mismatch be-tween the decoded data in the encoder and decoder). The drift arises from the fact that the inverse discrete cosine transform (IDCT) is not fully specified in integer arithmetic; rather it must satisfy statistical tests of accuracy compared with a floating point imple-mentation of the inverse transform (e.g., the IDCT accuracy specification in H.261 Annex A). For more information on the analysis of the drift and the IDCT design, refer to [P1], [P2] and [P3]. H.264/AVC makes extensive use of prediction, since even the intra coding modes rely upon spatial prediction. As a result, H.264/AVC is very sensitive to predic-tion drift. To achieve drift-free residual coding, in H.264/AVC, an integer orthogonal ap-proximation to the DCT, namely Integer Cosine Transform (ICT) is used [17], allowing for bit-exact implementation for all encoders and decoders, thus avoiding inverse trans-form mismatch problem and solving the drift problem caused by DCT. The 4×4 and 8×8 ICT matrices, adopted in H.264/AVC, are given below in (1.2), (1.3), (1.4) and (1.5) re-spectively. They are proved to provide similar compression efficiency as DCT.

4f

1 1 1 12 1 1 21 1 1 11 2 2 1

T

− − = − − − −

(1.2)

4i

11 1 12

11 1 1211 1 12

11 1 12

T

− −

= − −

− −

(1.3)

8

8f

8 8 8 8 8 8 8 812 10 6 3 3 6 10 12

8 4 4 8 8 4 4 810 3 12 6 6 12 3 10 1

88 8 8 8 8 8 8 86 12 3 10 10 3 12 64 8 8 4 4 8 8 43 6 10 12 12 10 6 3

T

− − − − − − − − − − − − = ⋅ − − − −

− − − − − − − −

− − − −

(1.4)

8i 8fTT T= (1.5)

Recently, a variation of ICT, called Prescaled Integer Transform (PIT) [P4], was pro-posed to further reduce the implementation complexity of ICT while all the merits of ICT are kept and no penalty in performance is observed. PIT has been adopted in audio vid-eo coding standard (AVS), Chinese National Coding standard [6].

1.3 OUTLINE AND OBJECTIVES OF THE THESIS

This thesis presents novel algorithms and methods to improve the coding efficiency and reduce the complexity of video coders. The developed algorithms have a particular focus on high-resolution, low complexity use-cases to enable the emerging high quality high resolution video services on mobile terminals.

The first main contribution of this thesis is to introduce fixed-point design method-ologies and several resulting implementations of the inverse discrete cosine transform [P1][P2][P3]. The drift analysis for integer IDCT is also given.

The second main contribution of this thesis is to introduce pre-scaled integer trans-form (PIT) [P4] which reduces the implementation complexity of the conventional inte-ger cosine transform (ICT) and maintains all the merits such as bit-exact implementa-tion and good coding efficiency. Design rules that lead to good PIT kernels are developed and different types of PIT and their target applications are examined.

The third main contribution of this thesis is to introduce a new transform technique called spatially varying transform, which achieves up to 13.50% bitrate reduction com-pared to the latest video coding standard H.264/AVC, and at the same time with low im-plementation complexity [P5][P6][P7].

The main goal of the research work presented in this thesis is to develop novel algo-rithms to be used in future video coding standards. It should be noted that, at the time of writing the thesis, VCEG and MPEG, the two international standardization organizations have joined and formed the Joint Collaborative Team on Video Coding (JCT-VC). JCT-VC

9

embarked on a new project to develop the next generation video coding standard called High Efficiency Video Coding (HEVC). The preliminary requirements for HEVC were bit rate reduction of 50% at the same subjective image quality comparing to H.264/AVC high profile, with computational complexity ranging from ½ to 3 times that of the high profile. It was also stated that HEVC would be able to provide 25% bit rate reduction along with 50% reduction in complexity at the same perceived video quality as the high profile, or to provide greater bit rate reduction with somewhat higher complexity. Many new techniques, e.g., extended macroblock sizes, improved interpolation, and flexible motion representation etc, are adopted in HEVC. The proposed novel algorithms and methods in this thesis are developed to improve the coding efficiency and reduce the complexity of video coders, thus could be potentially used in the HEVC framework.

1.4 AUTHOR’S CONTRIBUTIONS TO THE PUBLICATIONS

Publication [P1] proposes efficient fixed-point approximations of 8x8 inverse dis-crete cosine transform. The idea is proposed by Dr. Reznik and Dr. Hinds. The author proposed the extended “linearity test” and performed some of the simulations.

Publication [P2] presents a drift analysis for integer IDCT. The idea is proposed by Dr. Hinds and Dr. Reznik. The author helped perform some of the simulations.

Publication [P3] presents a multiplier-less approximation of the DCT/IDCT with low complexity and high accuracy. The idea is proposed by the author. The author imple-mented the idea, performed all the simulations and wrote most of the publication, with assistance of the co-authors.

Publication [P4] proposes the technique of pre-scaled integer transform. The idea is proposed by Prof. Cham and Dr. Lou. The author implemented the ideas in the H.264/AVC reference software, performed the theoretical analysis and all the simula-tions and most of the writing of the publication, with assistance of the co-authors.

Publication [P5] presents the concept, design and implementation of spatially vary-ing transform. In [P5], the concept, design and implementation of spatially varying transform is first introduced. The idea presented in [P5] is co-invented by the author and Jani Lainema. The author implemented the ideas in the H.264/AVC reference soft-ware, performed all the simulations and most of the writing of the publication, with as-sistance of the co-authors.

Publication [P6] extends the analysis and work done in [P5]. The main ideas are pro-posed by the author. The author also implemented the algorithms, performed all the simulations and wrote most of the publication, with assistance of the co-authors.

10

Publication [P7] proposes the idea of prediction signal aided spatially varying trans-form, which utilizes the gradient of prediction signal to eliminate the unlikely spatially varying transform positions. The original idea was conceived by the author. The author implemented the idea, performed all the simulations and wrote most of the publication, with assistance of the co-authors, especially Dr. Kemal Ugur.

11

CHAPTER 2 FIXED-POINT APPROXIMATIONS OF THE 8X8 INVERSE DISCRETE COSINE TRANSFORM

The Discrete Cosine Transform (DCT) is widely used in video and image coding applica-tions as it is the transform used in both MPEG [44] and JPEG [45] standards. The Inverse Discrete Cosine Transform (IDCT) is the inverse process of the DCT. Theoretically, the DCT/IDCT is defined as real number operations and the implementation complexity is high, thus reducing the complexity is one of the most important issues when designing DCT/IDCT implementation. Another important issue specifically for IDCT implementa-tion is to reduce the drift problem (i.e., mismatch accumulation leading to degradations in quality of reconstructed video), which is caused by different IDCT implementations in encoder and decoder.

In early 2005, because of the expiration and withdrawal of a related IDCT precision specification (IEEE Standard 1180-1990 [46]), MPEG has decided to improve its han-dling of this matter, and produce

• A new precision specification ISO/IEC 23002-1, replacing the IEEE 1180 standard, and harmonizing the treatment of precision requirements across different MPEG standards [47], and

• A new voluntary standard, ISO/IEC 23002-2, providing specific (determinis-tically defined) examples of fixed-point 8x8 IDCT and DCT implementations.

2.1 PRECISION REQUIREMENTS FOR IDCT IMPLEMENTATIONS IN MPEG AND ITU-T STANDARDS

Specifications of MPEG, JPEG, and several ITU-T video coding standards, do not re-quire IDCTs to be implemented exactly to generate the same output as the theoretical output. The MPEG IDCT precision standard: ISO/IEC 23002-1 [48], defines the precise specification of the error tolerances, and how they are to be measured for a given IDCT implementation. This standard includes several tests using a pseudo-random input gen-erator originating from the former IEEE Standard 1180-1990 [46], as well as additional tests required by MPEG standards.

Table 1 below summarizes the error metrics defined by the IEEE 1180 | ISO/IEC 23002-1 specification. The variable i=1,…,Q indicates the index of a pseudo random in-put (an 8x8 matrix) used in a test, and Q=10000 (or in some tests, Q=100000) denotes the total number of sample matrices. The tolerance for each metric is provided in the last column of the paper.

12

TABLE 1 THE IEEE 1180 | ISO/IEC 23002-1 PSEUDO-RANDOM IDCT TEST METRICS AND CONDITIONS [P1]

where ˆ iyxf and i

yxf are theoretical and practical IDCT outputs, respectively; dyx= ˆ i iyx yxf f− ;

eyx= 2ˆ( )i iyx yxf f− . In addition, a so-called “linearity test” [49][50] is used as an additional

(informative) test provided in the ISO/IEC 23002-1/FPDAM1 specification [51]. This test requires the reconstructed pixel values produced by an IDCT implementation under test fyx to be symmetric with respect to the sign reversal of its input coefficients where Fvu is the IDCT input and z, v, t, u, s are variables:

, , , 1,3,...,527,

( ) ( ),0, , , , , 0,...,7.yx vu yx vu vu

z v t u s zf F f F F

otherwise v u t s= = =

− = − = = (2.1)

2.2 FIXED-POINT APPROXIMATIONS OF DCT/IDCT

There are many efficient and well-known factorizations for 1D DCT, including those proposed by Arai et. al. (AAN) [52], (only 5 multiplication and 29 addition operations), and the non-scaled factorizations of Chen, et. al. [53], Lee [54], Vetterli and Ligtenberg (VL) [55], and Loeffler, et. al. (LLM) [56]. The VL and LLM algorithms are the least com-plex among known non-scaled designs, and require only 11 multiplication and 26 addi-tion operations. The suitability of each of these factorizations to the design of fixed point IDCT algorithms and their drift characteristic have been extensively analyzed in the course of work for the ISO/IEC 23002-2 standard. Publications [P1][P2][P3] provide more details.

In order to implement DCT/IDCT in fixed-point arithmetic, one of the most common and practical technique is based on the approximations of irrational factors αi by dyadic fractions:

0

/ 2ki k

kaα

=

≈∑ (2.2)

where both ak and k are integers. In this way, multiplication of x by a factor αi permits the implementation of a very simple approximation in integer arithmetic as follows:

13

0(( * ) )i k

kx x a kα

=

≈ >>∑ (2.3)

where >> denotes the bit-wise right shift operation.

The exact multiplier-less implementation of the multipliers in the factorization is crucial to the accuracy of the IDCT implementation. If we use F(i) and I(i) to represent the output of a theoretical multiplier and a practical multiplier when the input is i, then one straightforward approach is to find a multiplier-less implementation that minimizes the approximation error, e.g., sum of squared approximation errors (SSAE) and sum of absolute errors (SAE), which is calculated as (2.4) and (2.5) below, respectively:

2

2

0( ( ) ( ))

SCALE

iSSAE I i F i

=

= −∑ (2.4)

2

0( ) ( )

SCALE

iSAE I i F i

=

= −∑ (2.5)

However, there may be some disadvantages of this approach:

1. The statistics of the input data of different multipliers is different from each oth-er. Moreover, it is difficult to estimate the statistics since it may vary greatly un-der different situations: different sequence and coding methods, etc.

2. The optimization of implementation of each multiplier is independent from each other. However, in fact, the truncating error among different multipliers will in-teract with one another through the additions/subtractions in the factorization.

Experiments also show that independent optimization of each multiplier using crite-ria such as (2.4) and (2.5) does not yield the best implementation. Normally brute force search is conducted to search for the best implementation. Typically only the multiplier-less implementation with minimum number of additions/subtractions is used for each multiplier and Overall Mean Square Error (OMSE) is selected as the criterion for the op-timization problem, see [P3]. Several fixed-point approximations of DCT/IDCT are de-rived and the drift performance is also analyzed [P3].

2.3 THE ISO/IEC 23002-2 IDCT

The overall architecture used in the design of the ISO/IEC 23002-2 IDCT algorithm is shown in Figure 3, which can be characterized by its separable and scaled features. The scaling stage is performed with a single 8x8 matrix that is precomputed by factoring the 1D scale factors for the row transform with the 1D scale factors for the column trans-form. The scaling stage is also used to pre-allocate P bits of precision to each of the input

14

DCT coefficients. The underlying basis for scaled 1D transform design is a variant of the well-known factorization of Loeffler et. al. [56] with 3 planar rotations and 2 independ-ent factors 2γ = , as shown in Figure 4. This choice has been made empirically based on an extensive analysis of fixed-point designs derived from other known algorithms, in-cluding variants of AAN, VL, and LLM factorizations. For more details of the ISO/IEC 23002-2 IDCT, please see [P1].

FIGURE 3 FIXED-POINT 8X8 IDCT ARCHITECTURE. [P1]

FIGURE 4 LOEFFLER-LIGTENBERG-MOSCHYTZ IDCT FACTORIZATION (LEFT), AND ITS GENERALIZED SCALED FORM (RIGHT). [P1]

15

CHAPTER 3 DIRECTIONAL TRANSFORMS

Traditional 2-D DCT/ICT/PIT [P4] are typically implemented by separable 1-D trans-form in the horizontal and vertical directions. However, this approach does not take im-age orientation features into account, and thus may not be the most appropriate one. This well known fact has recently triggered several attempts towards the development of directional transforms so as to better preserve the directional information. Some of these directional transforms have been applied in video coding, demonstrating a signifi-cant coding gain. Basically, the following properties are desired for a good directional transform in image coding [27]:

• High efficiency of representing various directional signals. To handle different contents in an image, the transform should be able to exploit the correlations along dif-ferent directions, and thus the basis vectors should be directional and anisotropic.

• Low implementation complexity. An ordinary video frame may contain tens of millions of pixels. The transform should be easy to implement to handle such a huge number of pixels.

• Non-redundant. The transform coefficients should not be redundant. Although the transform itself is not necessary to be a critically-sampled one, it needs a very careful judgment to decide whether or not it should be over-complete.

• Able to take full advantage of existing coding tools. Image/video coding has evolved for several decades. Many coding tools are heavily tuned and already very ma-ture. It is very difficult to start from the beginning to improve the current state-of-the-art. For example, any new transform that can generate coefficients in a similar structure as those transforms used in the current coding schemes would be preferred.

Recently, several directional transforms have been proposed and applied in image coding. They can basically be divided into three categories [27]:

• Reorganization-based directional transforms: pixels in an image block are re-organized according to a selected direction, and a conventional transform is then ap-plied.

• Lifting-based directional transforms: the directional transforms are obtained by factorizing a conventional transform into lifting operators and performing each lifting operator directionally.

• Data-dependent directional transforms: the directional transform is constructed by a direction prediction and a corresponding data-dependent transform thereafter.

16

More details are provided in the next subsections about each category.

A new transform technique called spatially varying transform, which tries to tackle the same problem of traditional 2-D DCT, is also proposed [P5][P6][P7].

3.1 REORGANIZATION-BASED DIRECTIONAL TRANSFORM

The basic idea behind the first category of directional transforms is to reorganize pixels along a certain direction for each 1D transform [8]. The idea is very suitable to block transforms such as DCT. Similar to the directional intra-prediction modes defined in H.264/AVC, eight directional modes are usually used in directional DCT. Two modes are the same as the conventional 2D DCT. Other modes cover diagonal down-left, diago-nal down-right, vertical-right, horizontal-down, vertical-left, and horizontal-up; whereas the DC mode in H.264 is not used here. These modes are depicted in Figure 5 below to explain how the directional DCT works. An example of directional DCT is also shown in Figure 6 and Figure 7.

FIGURE 5 SIX DIRECTIONAL MODES DEFINED IN A SIMILAR WAY AS WAS USED IN H.264/AVC FOR THE BLOCK SIZE 8×8. THE VERTICAL AND HORIZONTAL MODES ARE NOT INCLUDED HERE. [9]

17

FIGURE 6 N×N IMAGE BLOCK IN WHICH THE FIRST 1-D DCT WILL BE PERFORMED ALONG THE DIAGONAL DOWN-LEFT DIRECTION. [9]

FIGURE 7 EXAMPLE OF N=8: ARRANGEMENT OF COEFFICIENTS AFTER THE FIRST DCT (LEFT) AND ARRANGEMENT OF COEFFICIENTS AFTER THE SECOND DCT AS WELL AS THE MODIFIED ZIGZAG SCANNING (RIGHT). [9]

3.2 LIFTING-BASED DIRECTIONAL TRANSFORMS

There are various transforms that can be factorized into lifting operators. The lifting-based implementation of the 8-point DCT is illustrated in Figure 8. The inverse trans-form can be implemented by reversing the order of lifting operators and the sign of each operator.

18

FIGURE 8 FACTORIZING 8-POINT DCT INTO PRIMARY OPERATIONS. X[N] AND Y[N] (N=0,1,…,7) ARE INPUT SIGNAL AND OUTPUT DCT COEFFICIENT, RESPECTIVELY. OI (I=1,2,…,35) IS PRIMARY OPERATION. [28]

The transform can be turned into corresponding directional transforms by perform-ing each lifting operator directionally, as shown in Figure 9. The inverse of a directional lifting operator is also a directional operator, which is along the same direction but with the reverse lifting weight. Concatenating all the inverse lifting operators together, the inverse directional transform can be exactly generated.

Based on the lifting-based approach, directional DCT [28] is also proposed to bring directional basis functions into those transforms. These directional functions are gener-ated along the given direction that each lifting operator follows. On the other hand, these directional basis functions still preserve the frequency properties of the DCT. Thus, di-rectional DCT is a direction-frequency analysis of the video signals.

19

FIGURE 9 EXEMPLIFIED PRIMARY OPERATIONS (A) NONDIRECTIONAL AND (B) DIRECTIONAL, WHERE WHITE CIRCLES DENOTE INTEGER PIXELS AND GRAY CIRCLES HALF PIXELS. [28]

In practical coding using lifting-based directional transforms, the image to be coded is divided into blocks. For each block, one direction from a set of predefined directions is assigned and the transform will be along that direction within the block.

3.3 DATA-DEPENDENT DIRECTIONAL TRANSFORMS

In the intra frame coding of H.264/AVC, the intra prediction only removes the direc-tional redundancy among neighboring blocks. It does not exploit the directional correla-tion within the current block. Thus, after the intra prediction, there are still directional correlations left in the prediction residues. To fully remove the directional redundancy within the image, mode-dependent directional redundancy within the image, mode-dependent directional transform (MDDT) [29] is proposed in the framework of H.264/AVC. Instead of using just one transform in H.264/AVC, MDDT uses different transforms according to the prediction direction of the image blocks. The transform for each direction, which can be designed to be separable or non-separable, is obtained by training the KLT based on the signals of the same mode. Thus, the transform is data de-pendent. By applying intra prediction to remove inter-block directional redundancy and MDDT to remove that redundancy within the current block, the whole prediction and transform provide an efficient solution to exploit the directional correlation within H.264/AVC framework.

21

CHAPTER 4 PRE-SCALED INTEGER TRANSFORM

Integer Cosine Transform (ICT) was introduced by Cham in 1989 [39] and is further de-veloped in recent years. It has been proved that some ICTs have almost the same com-pression efficiency as the Discrete Cosine Transform (DCT) but much simpler implemen-tation because only additions and shifts operations are needed [17]. Moreover, ICT can avoid inverse transform mismatch problems of the DCT. Due to these advantages, the latest international video coding standard H.264/AVC adopted order-4 and order-8 ICT transforms [40][41].

In order to reduce the computational complexity, H.264/AVC merges the for-ward/inverse scaling and the quantization/dequantization operations into one step. Figure 10 is the block diagram for the conventional ICT scheme.

FIGURE 10 BLOCK DIAGRAM OF CONVENTIONAL ICT SCHEME IN H.264/AVC

The main idea of pre-scaled integer transform (PIT) is that inverse scaling is moved to the encoder side and combined with forward scaling and quantization as a single pro-cess. The fact that no scaling is needed at the decoder side distinguishes PIT from con-ventional ICT, and this is exactly the reason why PIT can reduce the required memory size and computational complexity on decoder side at the same time, as will be ex-plained below. The block diagram for the PIT scheme is shown in Figure 11.

FIGURE 11 BLOCK DIAGRAM OF PROPOSED PIT SCHEME

To show the advantages of PIT over ICT, let us take order-4 transform as an example first. In H.264/AVC, the QP period is 6 and in decoder a dequantization-scaling matrix QSict is used for 4×4 ICT [10][17][40].

22

10 16 1311 18 1413 20 1614 23 1816 25 2018 29 23

=

ictQS

(4.1)

For the (i, j)th transform coefficient in a block, the search rule for corresponding dequantization-scaling element in QSict is defined by:

( %6,0) ( %2, %2) (0,0)( %6,1) ( %2, %2) (1,1)( %6,2)

ict

ict

ict

QS QP for i j equal toQS QP for i j equal toQS QP else

(4.2)

where “%” is modulus operator and “x%y” means the remainder of x divided by y, de-fined only for integers x and y with x≥0 and y>0.

The main problem here is that for every transform coefficient, an operation needs to be conducted, which uses QP and the coordinates of the coefficients in the block, to search for the corresponding elements in QSict. Some computations as in (4.2) are also needed. In order to reduce the computational complexity, and at the same time facilitate parallel processing and take advantage of the efficient multiply/accumulate architecture of many processors, QSict can be fully expanded to QS’ict, as in (4.3) and correspondingly the search rule is simplified to (4.4). However, the storage will increase from 6×3=18 bytes (see (4.2)) to 6×4×4=96 bytes (see (4.3)). This memory size will be much larger when 8×8 ICT is also used. In that case, a total memory of 6×4×4+6×8×8=480 bytes should be allocated.

23

10 13 10 1313 16 13 1610 13 10 1313 16 13 16

11 14 11 1414 18 14 1811 14 11 1414 18 14 18

13 16 13 1616 20 16 2013 16 13 1616 20 16 20

14 18 14 1818 23 18 2314 18 14 1818 23 18 23

16 20 16 2020 25 20 251

=

'ictQS

6 20 16 2020 25 20 25

18 23 18 2323 29 23 2918 23 18 2323 29 23 29

(4.3)

' ( %6, , )ictQS QP i j (4.4)

When PIT is used, the dequantization-scaling matrix (or more accurately, the dequantization matrix, since no scaling is included any more) QSpit is as below.

T[10,11,13,14,16,18]=pitQS (4.5)

And for the (i, j)th transform coefficient in a block, the search rule for the corresponding element in QSpit is further simplified to:

( %6)pitQS QP . (4.6)

Comparing (4.5), (4.6) to (4.1), (4.2) and (4.3), (4.4) respectively, it is clear that the de-coding complexity is reduced with PIT. First, if the dequantization-scaling matrix is fully

24

expanded, when order-4 ICT is used, a total memory of 6×4×4=96 bytes can be replaced by a memory of only 6 bytes. The memory required when using PIT is much lower than that required when using conventional ICT. Otherwise, if the dequantization-scaling ma-trix is not expanded, a total memory of 6×3=18 bytes can be reduced to 6 bytes. Though this saving is trivial considering only order-4 ICT, keeping in mind that in this case every non-zero coefficient needs a lookup operation and some extra operations when using ICT, the computational complexity of PIT will be lower because 3-D lookup operation is replaced by 1-D lookup operation which only uses QP, and at the same time no other ex-tra operations as in (4.2) is needed. In either case, at the decoder side, the required memory size can be reduced and the 3-D lookup operation can always be replaced by a 1-D lookup operation thus pipeline and parallel processing is facilitated and extra com-putation or storage memory can be saved. While at encoder side, since forward scaling, inverse scaling and quantization are combined as a single process and the original quan-tization-scaling matrix is replaced by a combined-scaling-quantization matrix with the same size, both the computational and storage complexity remain unchanged.

Only order-4 transform is considered in the discussion above. When both order-4 and order-8 transforms are used, a total memory of 6×4×4+6×8×8=480 bytes can be saved and a memory of only 6 bytes is needed instead, assuming that the order-4 and order-8 PITs employed are compatible.

4.1 DESIGN RULES OF PIT KERNELS

Just like ICT, not every PIT kernel performs well. It is practically important to find design rules that lead to good PIT kernels. The following two factors should be consid-ered when choosing a PIT kernel.

Compression Ability: It is proved in [P4] that PIT kernels are the same as the corre-sponding ICT kernels in terms of transform coding gain [42] and DCT frequency distor-tion [43]. So basically, good ICT kernels can and should be used to obtain good PIT ker-nels.

Weighting Factors of Transform Coefficients: It is shown in [P4] that the PIT coeffi-cient matrix can be regarded as a weighted ICT coefficient matrix where Si is the weighting matrix. While all coefficients of an orthogonal ICT have the same maximum and minimum values, those of PIT do not because the weighting matrix Si changes the relative magnitudes of the transform coefficients. The Weighting Factor Difference (WFD) of a PIT represents the difference of weighting factors applied to different trans-form coefficients:

25

,

,

max ( , )

min ( , )ik l

ik l

S k lWFD

S k l= . (4.7)

The deviation of the WFD of a PIT away from unity may cause a problem. It may result in truncation of some transform coefficients that could be retained if Si=αIf where α is a scalar and If is a matrix with all its elements equal to 1. However, unfortunately the con-dition Si=αIf cannot always be true when PIT is used. In order not to change the trans-form coefficients too much so as to retain the compression ability of the corresponding ICT kernels, elements of the weighting matrix Si should have values close to one another. Another advantage of this consideration is that the change from ICT to the correspond-ing PIT will be small so that other parts of a codec need not be re-designed: the scan or-der of the transform coefficients need not be changed and the entropy coding table de-signed for the ICT can also be re-used for the PIT without major changes.

Based on the discussion above, it can be concluded that a good PIT kernel can be ob-tained following the steps below:

1) Obtain a good ICT kernel. To derive good ICT kernels, computer search can be sys-tematically carried out based on transform coding gain and DCT frequency distor-tions.

2) Choose a good ICT kernel whose WFD is not larger than 2 , which is empirically de-rived [P4].

4.2 TYPES OF PRE-SCALED INTEGER TRANSFORM

Besides the design rules derived above, different types of PITs have different charac-teristics and are suitable for different applications, which is also an important issue in practice.

There are generally two types of PITs based on Frequency Scale Factor (FSF). For or-der-N PIT, FSF measures the scaling effect of Si on the transformed coefficients and is defined as:

2

,

( , )1(0,0)

i

i j i

S i jFSFN S

= ∑. (4.8)

1) Type-I PITs, whose FSF is less than 1, have a characteristic that more high-frequency components may be quantized out compared to the corresponding ICT. Type-I PITs may be more suitable for video streaming and conferencing applications where low resolution (such as QCIF and CIF) video sequences are usually used and coded at relatively low bit-rates because Type-I PITs lead to bit savings without degrading

26

the subjective quality significantly in this situation.

2) Type-II PITs, whose FSF is larger than or equal to 1, have a characteristic that more high-frequency components may be retained after quantization compared to the corresponding ICT. Type-II PITs may be more suitable for entertainment quality and other professional applications where higher resolution (such as HD) video se-quences are often used and coded at relatively high bit-rates.

Experiments in [P4] show that no coding efficiency penalty is observed but even some improvement can be obtained compared to H.264/AVC when PIT scheme is care-fully designed. Because of this, PIT has been adopted in Audio Video coding Standard (AVS), the Chinese National Coding standard with PIT8[10,9,6,2;10,4;8] in AVS part 2 and PIT4[3,1;2] in AVS part 7, respectively, due to its implementation complexity reduc-tion and good performance [P4] [6].

27

CHAPTER 5 SPATIALLY VARYING TRANSFORM

Typical block based transform design in most existing video coding standards does not align the underlying transform with the possible edge location. In this case, the coding efficiency decreases. As introduced in chapter 2, directional transforms are proposed to improve the efficiency of transform coding for directional edges. However, efficient cod-ing of horizontal, vertical and non-directional edges was not tackled; whereas typically, in natural video sequences, such edges are more common than directional edges. This is the main reason why only very marginal overall gain has been achieved in some previ-ous work by other researchers [9]. Simple examples are given in [P6] which show that by aligning the transform to the edge location, more energy in the transform domain concentrates in the low frequency part and this facilitates the entropy coding and im-proves the coding efficiency.

Furthermore, coding the entire prediction error signal may not be the best in terms of rate distortion tradeoff. An example is the SKIP mode in H.264/AVC [10], which does not code the prediction error at all. This fact is useful in improving the coding efficiency especially for HD video coding since generally a certain image area in HD video sequenc-es contains less detail than in lower resolution video sequences, and the prediction error tends to have lower energy and to be more noise-like after better prediction.

A new Spatially Varying Transform (SVT) was developed in [P5] to take advantage of the facts stated above in order to improve coding efficiency. The basic idea of SVT is that the transform coding is not restricted inside regular block boundaries but is adjusted to the characteristics of the prediction error. With this flexibility, coding efficiency im-provement can be achieved by selecting and coding the best portion of the prediction error in terms of rate distortion tradeoff. Generally, this can be done by searching inside a certain residual region after intra prediction or motion compensation, for a sub-region and only coding this sub-region. Finally, the location parameter (LP) indicating the posi-tion of the sub-region inside the region is coded into the bitstream if there are non-zero transform coefficients. There are two ways to interpret SVT:

1. SVT can be considered as a region-of-interest (ROI) coding technique [37] where the region of interest is selected in the prediction error domain and only the re-gion of interest is coded.

2. When the region refers to a macroblock, the proposed algorithm can be consid-ered as a special SKIP mode, where part of the macroblock is “skipped”.

In our case, a single block is selected and coded inside a macroblock when applying SVT. Extension to multiple blocks is straightforward. The selection is due to the fact that

28

coding on a macroblock basis reduces complexity, delay and throughput while facilitat-ing the implementation especially for hardware. However, it should be noted that there is no restriction on the size and shape of a “sub-region” and a “region” and thus the pro-posed idea could easily be extended to cover arbitrary sizes and shapes. For instance, SVT can be applied to an extended macroblock [15] of size larger than 16×16. Direction-al spatially varying transforms with a directional oriented block selected and coded with a corresponding directional transform (e.g., [9]) can also be used.

5.1 DESIGN OF SVT

In the following sub-sections, three key issues of SVT are discussed in more detail: selection of SVT block-size, selection and coding of candidate LP and filtering of SVT block boundaries.

5.1.1 SELECTION OF SVT BLOCK-SIZE

In our implementation, an M×N SVT is applied on a selected M×N block inside a mac-roblock of size 16×16, and only this M×N block is coded. It is easy to see that there are in total (17-M)×(17-N) possible LPs if the M×N block is coded by an M×N transform. The following factors are taken into account when choosing suitable M and N:

1) Generally, larger M and N will result in fewer possible LPs and less overhead; and vice versa.

2) Also, larger M and N will result in lower distortion of the reconstructed mac-roblock but need more bits in coding the transform coefficients; and vice versa.

3) Finally, a larger block-size transform is more suitable in coding relatively flat are-as or non-sharp edges, without introducing coding artifacts such as ringing, while a smaller block-size transform is more suitable in coding areas with details or sharp edg-es.

To facilitate the transform design, it can be further assumed that M=2m, N=2n where m and n are integers ranging from 0 to 4 inclusive in our context1. In our case, 8×8 SVT, 16×4 SVT and 4×16 SVT are used and are illustrated in Figure 12 and Figure 13 respec-tively, to show the effectiveness of the proposed algorithm. Nevertheless, it should be noted that, for different sequences with different characteristics, other block-size SVT

1 Note that SKIP mode is a special case when m and n are both equal to 0 and the macroblock partition is 16×16 (and the motion vector equals the predicted value in the context of H.264/AVC), and 16×16 transform is another special case when m and n are both equal to 4.

29

can be more suitable in terms of coding efficiency, complexity, etc. Using VBT [16] in the framework of SVT, namely variable block-size SVT (VBSVT), the prediction error is bet-ter localized and the underlying correlations are better exploited, thus the coding effi-ciency is improved compared to a fixed block-size SVT.

∆x

∆y

16 pixels

16 pixels

8

8

8x8 SVT Transform Block

Location: (∆x, ∆y)

FIGURE 12 ILLUSTRATION OF 8×8 SPATIALLY VARYING TRANSFORM.

Δx

Δy

16pixels

16 pixels

4

4

16x4 SVT Transform BlockLocation: ∆y

16 pixels

16pixels

4x16

SVT

Tra

nsfo

rm B

lock

Loca

tion:

∆x

FIGURE 13 ILLUSTRATION OF 16×4 AND 4×16 SPATIALLY VARYING TRANSFORM.

The transforms used for 8×8, 16×4 and 4×16 SVTs are described below. In general, a separable forward and inverse orthonormal 2-D transform of a 2-D signal can be written as

1v hC T X T −= ⋅ ⋅ , (5.1)

1r v hX T C T−= ⋅ ⋅ (5.2)

respectively, where X denotes a matrix representing M×N pixel block, C is the transform coefficient matrix, and Xr denotes a matrix representing the reconstructed signal block. Tv and Th are the M×M and N×N transform kernels in the vertical and the horizontal di-rection, respectively. The superscript “t” denotes matrix transposition. For 8×8 SVT, we

30

simply reuse the 8×8 transform kernel in H.264/AVC [10]. For 16×4 SVT and 4×16 SVT, we use the 4×4 transform kernel in H.264/AVC [10][17] and the 16×16 transform kernel in [18] because it is simple and can reuse the butterfly structure of the existing 8×8 transform in H.264/AVC. In all cases, normal zig-zag scan (for frame based coding which is used in our experiments) is used to represent the transform coefficients as input sym-bols to the entropy coding.

5.1.2 SELECTION AND CODING OF CANDIDATE LPS

When there are non-zero transform coefficients of the selected SVT block, its location inside the macroblock needs to be coded and transmitted to the decoder. For 8×8 SVT, as shown in Figure 12, the location of the selected 8×8 block inside the current mac-roblock can be denoted by (Δx, Δy) which can be selected from the set Φ8×8={(Δx, Δy), Δx, Δy=0,…,8}, There are in total 81 candidates for 8×8 SVT. For 16×4 SVT and 4×16 SVT, as shown in Figure 13, the locations of the selected 16×4 and 4×16 block inside the cur-rent macroblock can be denoted as Δy and Δx, respectively, which can be selected from the set Φ16×4={Δy, Δy=0,…,12} and Φ4×16={Δx, Δx=0,…,12} respectively. There are in total 26 candidates for 16×4 SVT and 4×16 SVT.

The “best” LP is then selected according to a given criterion. In our case, rate distor-tion optimization (RDO) [19] is used to select the best LP in terms of RD tradeoff by min-imizing the following:

J = D + λ · R (5.3)

where J is the RD cost of the selected configuration, D is the distortion, R is the bit rate and λ is the Lagrangian multiplier. The reconstruction residue for the remaining part of the macroblock is simply set to be 0 in our implementation, but different values can be used and might be beneficial in certain cases (luminance change, etc).

The set of candidate LPs is also important since it directly affects the encoding com-plexity and the performance of SVT. Larger number of candidate LPs provides more room and is more robust for coding efficiency improvement for difference sequences with different characteristics, but adds more encoding complexity and also more over-head. Experimentally, according to criterion (5.3), we choose to use Φ8×8’={(Δx, Δy), Δx=0,…,8, Δy=0; Δx=0,…,8, Δy=8; Δx=0, Δy=1,…,7; Δx=8, Δy=1,…,7} for 8×8 SVT and Φ16×4={Δy, Δy=0,…,12} and Φ4×16={Δx, Δx=0,…,12} for 16×4 SVT and 4×16 SVT, which shows good RD performance over a large test set [P5]. There are in total 58 candidate LPs, and statistics show that the measured entropy of the LP index is 5.73 bits, for all test sequences used in the experiments. Accordingly, the LP index is represented by a 6-bit fixed length code in our implementation. As the overhead bits to code the LPs be-come significant at low bitrates, it would be useful to choose different LPs at different

31

QPs and different bitrates. Similarly, an indication of available LP can also be coded and transmitted in the slice header in H.264/AVC framework.

The LP for the luma component can be used to derive the corresponding LPs for chroma components. Consider the 4:2:0 chroma format as an example, LPs for chroma components can be derived from the LP for the luma component as follows:

1)1(,1)1( >>+∆=∆>>+∆=∆ LCLC yyxx . (5.4)

where ΔxL, ΔyL are the LPs for the luma component, while ΔxC, ΔyC are the LPs for the chroma components.

5.1.3 FILTERING OF SVT BLOCK BOUNDARIES

Due to the coding (transform and quantization) of the selected SVT block, coding ar-tifacts may appear around its boundary with the remaining non-coded part of the mac-roblock. A deblocking filter can be designed and applied to improve both the subjective and objective quality. The basic idea is to adjust the deblocking filter to different possi-ble locations of the SVT block. An example in the framework of H.264/AVC is described in detail in [P6].

5.2 IMPLEMENTING SPATIALLY VARYING TRANSFORM IN THE H.264/AVC FRAMEWORK

The proposed technique is implemented and tested in the H.264/AVC framework. Figure 14 shows the block diagram of the extended H.264/AVC coder with SVT, com-pared to the original H.264/AVC coder shown in Figure 2. The encoder first performs motion estimation and decides the optimal mode to be used, and then searches for the best LPs (illustrated as “SVT Search” in the diagram) if SVT is used for this macroblock. The encoder then calculates the RD cost for using SVT using (5.3), and if lower RD cost can be achieved by using SVT then that macroblock is selected to be coded with SVT. For macroblocks that use SVT, the LPs are coded and transmitted in the bitstream when there are non-zero transform coefficients of the selected SVT block. The LPs for the mac-roblocks that use SVT are decoded in the box marked as “SVT L.P. Decoding” in the dia-gram. However, the normal criteria, e.g., sum of absolute differences (SAD) or sum of squared differences (SSD), which is used in these encoding processes for macroblocks that do not use SVT, may not be optimal for macroblocks that use SVT. Better encoding algorithms are under study.

32

Coder Control

Transform/Quant.



Motion Compensation

Entropy Coding

Motion Estimation

Deblocking Filter

+

Frame Buffer

+

Control Data

SVT L.P.

Output Video

Motion Data

-

Input Video

Output Bitstream

Intra / Inter

SVT Search

SVT Selection

SVT L.P. Decoding

SVT Selection

Decoder


FIGURE 14 BLOCK DIAGRAM OF THE EXTENDED H.264/AVC ENCODER WITH SPATIALLY VARYING TRANSFORM.

Several key parts of the H.264/AVC standard [10], for instance, macroblock types, coded block pattern (CBP), entropy coding, and deblocking, need to be adjusted. The proposed modifications aiming at good compatibility with H.264/AVC are described in [P6].

5.3 FAST ALGORITHMS FOR SPATIALLY VARYING TRANSFORM

As described above, it should be noted that even though the motion estimation, sub-macroblock partition decision process are not changed for the macroblocks that use SVT, the encoding complexity of SVT is higher due to the brute force search process in RDO. Thus, for application cases that have strict requirement of encoding complexity, fast algorithm of SVT should be developed and used.

In this section, a simple yet efficient fast algorithm operating on macroblock and block level is proposed to reduce the encoding complexity of SVT. The basic idea to re-duce the encoding complexity of SVT is to reduce the number of candidate LPs tested in RDO. The proposed fast algorithm first skips testing SVT for macroblocks for which SVT is unlikely to be useful by examining the RD cost of macroblock modes without SVT on a macroblock level. For the remaining macroblocks that SVT may be useful, the proposed fast algorithm selects available candidate LPs based on the motion difference and utiliz-

33

es a hierarchical search algorithm to select the best available candidate LP on a block level. The macroblock level and block level algorithms are described in the following sections.

5.3.1 MACROBLOCK LEVEL FAST ALGORITHM

The basic idea of macroblock level fast algorithm is to skip testing SVT for mac-roblock modes for which SVT is unlikely to be useful. This is done by examining the RD cost of macroblock modes that do not use SVT and are already available prior to SVT coding. In the proposed implementation, SVT is only applied for the macroblock modes in RDO process, when the following two criteria are met:

intraskipinter ),min( JJJ ≤ , (5.5)

thJJJ +≤ ),min( skipintermode (5.6)

where Jinter and Jintra are the minimum RD cost of available inter and intra macroblock modes in regular (without SVT) coding, respectively, Jskip is the RD cost of the SKIP cod-ing mode and Jmode is the RD cost of the current macroblock mode to be tested with SVT (e.g If SVT is being tested for INTER_16×8 mode, then Jmode refers to RD cost of the regu-lar INTER_16×8 without SVT). The threshold th in (5.6) represents the empirical upper limit of bit-rate reduction when SVT is used assuming the reconstruction of macroblock remains the same when SVT is not used. It is calculated using (5.7),

)0,2/23max( QPth −⋅= λ , (5.7)

where λ is the Lagrangian multiplier used in (5.3) and the max function returns the max-imum value of its arguments. More details can be found in [P6].

5.3.2 BLOCK LEVEL FAST ALGORITHM

The proposed block level fast algorithm includes two steps: the selection of available candidate LPs based on the motion difference in the first step and a hierarchical search algorithm in the second step. These two steps are described in detail next.

1) Selection of available candidate LPs based on motion difference: As described in sec-tion 5.1.1, SVT is used for 16×16, 16×8, 8×16 and 8×8 macroblock partitions. One straightforward approach to reduce the encoding complexity is to restrict the trans-form block in a candidate SVT block to be inside the same motion compensation block boundary, since applying the transform across motion compensation block boundaries is usually considered not efficient. However, some penalty was observed in coding efficiency for sequences with slow motion and rich detail, for instance,

34

Preakness and Night. Taking this into account, a more general approach is used in this work, by skipping the testing of candidate LPs when and only when the trans-form block(s) in the SVT block at that position overlaps with different motion com-pensation blocks in which the motion difference is significant according to a certain criterion. In order to measure the motion difference we use a similar method to the one used in deriving the boundary strength parameter in deblocking filter of H.264/AVC [10][20]. Specifically, in our implementation, we skip testing a candidate LP and mark it unavailable if at least one of the following conditions is true:

• If the transform applied to the SVT block at that position overlaps with at least two neighboring motion compensation blocks and the motion vectors of these blocks are larger than or equal to a pre-defined threshold which is set to be one in-teger pixel in this work.

• If the transform applied to the SVT block at that position overlaps with at least two neighboring motion compensation blocks and the reference frames of these two neighboring blocks are different.

Since the number of available candidate LPs varies from one macroblock to an-other and the information to derive the available candidate LPs can be obtained both by the encoder and the decoder, the index of the selected LP is coded as fol-lows. Assume there are N (N>0) available candidate LPs and the index of the select-ed candidate is n (0≤n<N), then it is coded as

=−−=

−<+==

otherwiseNLNnV

NnifNLnVN

N

,log),2(

)2(2,1log,

2log

log2

2

2

(5.8)

where V represents the binary value of the coded bit string and L represents the number of bits coded. This is a near fixed-length code which is chosen based on the observation that all available candidate LPs are equally likely to be used.

This approach shows a stable coding efficiency for sequences with different char-acteristics and achieves similar gain over a wide range of test sets compared to the original algorithm without using the information of motion difference. The addition-al complexity introduced by this approach is marginal when it is carefully imple-mented because: a) the decision is only conducted when necessary and can be skipped for macroblock modes with 16×16 partition and some LPs, e.g. the LP (0,0) for 8×8 SVT; b) the decision is simple and only uses the motion vector and the refer-ence frame information; c) generally several candidate LPs representing spatially consecutive blocks can be marked available or unavailable at the same time in one decision.

2) Hierarchical search algorithm: The basic idea of the hierarchical search algorithm is

35

similar to that of the motion estimation algorithm, i.e., to first find the best LP in a relatively coarse resolution and then refine the results in a finer resolution. In our implementation, we define the candidate LP in coarser resolution as key candidate LP, which are marked as squares in Figure 15. Other candidate LP is also called as non-key candidate LP, which are marked as circles in Figure 15. The hierarchical search algorithm can be summarized as follows.

1. Let Φ1 denote the set of all available key candidate LPs. Select the best one in Φ1 with the lowest RD cost and let Φ2 denote its available neighboring candidate LPs which are marked as triangles in Figure 15.

2. The key LPs are divided into 14 LP zones as shown in Figure 15. Select the best LP zone which is available and has the lowest RD cost. A LP zone is available if and only if all three key candidate LPs in that zone are available and the RD cost of an LP zone is defined to be the sum of the RD costs of the three key candidate LPs in that zone. Let Φ3 denote the additional available non-key candidate LPs which are inside the best LP zone (marked as stars in Figure 15).

3. Select the best LP which has the lowest RD cost among all the candidates in Φ1, Φ2 and Φ3.

Zone 0 Zone 1

Zone 2Zone 3

Zone

7

Zone 5 Zone 4

Zone

6

Zone 8Zone 9

Zone 10

Zone 11 Zone 12 Zone 13

8x8 SVT 16x4 SVT 4x16 SVT

Best key LP

Best LP zone

FIGURE 15 ILLUSTRATION OF HIERARCHICAL SEARCH ALGORITHM.

5.4 EXPERIMENTAL RESULTS

We implemented the proposed algorithm VBSVT and its fast version with the pro-posed fast algorithm, which we denote as FVBSVT, in KTA1.8 reference software [22] in order to evaluate their effectiveness. The most important coding parameters used in the experiments are listed below:

• High Profile

36

• QPI=22, 27, 32, 37, QPP=QPI+1, fixed QP is used and rate control is disabled accord-ing to the common conditions that are set up by ITU-T VCEG and ISO-IEC/MPEG com-munity for coding efficiency experiments [23].

• CAVLC/CABAC is used as the entropy coding

• Frame structure is IPPP, 4 reference frames

• Motion vector search range: ±64/±32 pels for 720p/CIF sequences, resolution ¼-pel

• RDO in the “High Complexity Mode”

Two configurations are tested. 1) Low complexity configuration: motion estimation block sizes are 16×16, 16×8, 8×16, 8×8, and 4×4 transform is not used for macroblocks coded with regular transform or SVT. This represents a low complexity codec with most effective tools for HD video coding; 2) High complexity configuration: motion estimation block sizes are 16×16, 16×8, 8×16, 8×8, 8×4, 4×8, 4×4, and 4×4 transform is also used as an optional transform for both macroblocks coded with regular transform or SVT. This represents a high complexity codec with full usage of the tools provided in H.264/AVC.

We measure the average bitrate reduction (ΔBD-RATE) compared to H.264/AVC us-ing Bjontegaard tool [24] for both low complexity and high complexity configurations. The results are shown in Table 2. The results of 8×8 SVT and 16×4+4×16 SVT are also shown for comparison.

Figure 16 shows the R-D curves for sequences Night and Panslow. We note that the gain of VBSVT comes at medium to high bitrates for many tested sequences and can be much more significant than the average gain over all bitrates reported above. This is true for most test sequences used in our experiment.

37

Night

30

31

32

33

34

35

36

37

38

39

40

41

0 5000 10000 15000 20000 25000

Bitrate (kbit/s)

PSN

R (d

B)

H.264 (low complexity)H.264 (low complexity)+ VBSVTH.264 (high complexity)H.264 (high complexity)+ VBSVT

Panslow

31

32

33

34

35

36

37

38

39

0 2000 4000 6000 8000 10000 12000 14000 16000 18000 20000

Bitrate (kbit/s)

PSN

R (d

B)

H.264 (low complexity)

H.264 (low complexity) + VBSVT

H.264 (high complexity)

H.264 (high complexity) + VBSVT

FIGURE 16 RATE-DISTORTION CURVES FOR NIGHT SEQUENCE (CABAC) AND PANSLOW SEQUENCE (CAVLC).

We also measure the percentage of 8×8 SVT and 16×4+4×16 SVT selected in VBSVT. The results are shown in Table 3. We can see that in VBSVT scheme, 16×4+4×16 SVT are used generally more often than 8×8 SVT. However, 8×8 SVT is also used in a significant portion of the cases.

In Figure 17 and Figure 18, we show the distribution of the macroblocks coded with SVT and corresponding SVT blocks in a prediction error frame for Night and Sheriff se-quences, respectively. It can be observed that, as expected, SVT is mostly useful for mac-roblocks where residual signals of larger magnitude are likely to gather and locate at certain regions, which frequently happens, as also observed by other researchers [26].

We also view the reconstructed video sequences in order to analyze the impact of SVT on subjective quality. We conclude that using SVT does not introduce any special

38

visual artifacts and the overall subjective quality remains similar, while the coded bits are reduced.

Similarly, we measure the average bitrate reduction (ΔBD-RATE) of FVBSVT com-pared to H.264/AVC, We also measure the average reduction of candidate LPs tested in RDO over all the tested QPs using FVBSVT. The results are also shown in Table 3.

We can see that with the proposed fast algorithm we can reduce by a little more than 80% the number of candidate LPs tested in RDO while retaining most of the coding effi-ciency, for both low and high complexity configurations. This means that on average about 10-11 candidate LPs are tested in RDO (each RD cost calculation includes trans-form, quantization, entropy coding, inverse transform, inverse quantization etc) when FVBSVT is used. We also measured the execution time of FVBSVT to estimate the compu-tational complexity. Compared to H.264/AVC, the encoding and decoding time of FVBSVT increases on average 2.49% and 2.05%, respectively, over 720p test set and all the tested QPs in high complexity configuration. The encoding and decoding time in-crease vary for different test sequences and detailed results are shown in Table 4.

TABLE 2 AVERAGE RESULTS OF ΔBD-RATE COMPARED TO H.264/AVC

Average Results 8×8 SVT 16×4+4×16 SVT

VBSVT FVBSVT % Reduction of candidate LPs

Average (CAVLC, 720p, low complexity configuration) -2.64% -3.65% -4.11% -3.69% 83.4

Average (CAVLC, CIF, low complexity configuration) -2.44% -3.68% -4.14% -3.55% 83.6

Average (CAVLC, 720p, high complexity configuration) -1.42% -2.33% -2.43% -2.34% 81.1

Average (CAVLC, CIF, high complexity configuration) -1.28% -1.88% -2.08% -1.72% 81.4

Average (CABAC, 720p, low complexity configuration) -4.29% -4.93% -5.85% -5.31% 84.3

Average (CABAC, CIF, low complexity configuration) -3.04% -3.87% -4.66% -3.94% 83.6

Average (CABAC, 720p, high complexity configuration) -3.26% -3.59% -4.40% -4.03% 82.8

Average (CABAC, CIF, high complexity configuration) -1.52% -2.03% -2.45% -2.01% 81.9

TABLE 3 AVERAGE RESULTS OF PERCENTAGE OF 8×8 SVT AND 16×4+4×16 SVT SELECTED IN VBSVT FOR DIFFERENT SEQUENCES

Average Results Low complexity High complexity

39

configuration configuration

8×8 SVT 16×4+4×16 SVT 8×8 SVT 16×4+4×16 SVT

Average (CAVLC, 720p) 45.8% 54.2% 43.9% 56.1%

Average (CAVLC, CIF) 44.4% 55.6% 43.1% 56.9%

Average (CABAC, 720p) 50.2% 49.8% 52.3% 47.7%

Average (CABAC, CIF) 47.5% 52.5% 47.2% 52.8%

TABLE 4 ENCODING/DECODING TIME INCREASE OF FVBSVT COMPARED TO H.264/AVC (HIGH COMPLEXITY CONFIGURATION) (CABAC, 720P)

Sequence % Encoding time increase of FVBSVT % Decoding time increase of FVBSVT

BigShips 2.88 1.79

ShuttleStart 1.16 0.63

City 2.94 3.85

Night 2.62 2.08

Optis 2.18 2.29

Spincalendar 2.74 2.07

Cyclists 1.90 4.40

Preakness 3.71 1.89

Panslow 2.17 0.49

Sheriff 2.41 0.99

Sailormen 2.64 2.07

Average 2.49 2.05

40

(A)

(B)

FIGURE 17 (A) DISTRIBUTION OF THE MACROBLOCKS CODED WITH SVT, (B) CORRESPONDING SVT BLOCKS FOR 32ND PREDICTION ERROR FRAME. (NIGHT SEQUENCE)

41

(A)

(B)

FIGURE 18 (A) DISTRIBUTION OF THE MACROBLOCKS CODED WITH SVT, (B) CORRESPONDING SVT BLOCKS FOR 65TH PREDICTION ERROR FRAME. (SHERIFF SEQUENCE)

43

CHAPTER 6 PREDICTION SIGNAL AIDED SPATIALLY VARYING TRANSFORM

This chapter is based on Publications [P7] and dedicated to the design and implementa-tion of prediction signal aided spatially varying transform (PSASVT), which reduces the encoding complexity and improves the coding efficiency of the original SVT algorithm with only a little increase in the decoding complexity. AS mentioned in chapter 4, the en-coding complexity for SVT is increased because the encoder needs to search for the best LP among a number of candidate LPs. The basic idea to reduce the encoding complexity of SVT is to reduce the number of candidate LPs tested in RDO. In this chapter, we pro-pose to select candidate LPs based on the prediction signal. The motivations leading to this design are two-fold:

1. Statistics show that normally the selected SVT block locates at positions of the macroblock where the prediction error has higher magnitude [P5][P6]. These positions are probably boundary positions [30][38]. One example is illustrat-ed in Figure 19 below [P5].

2. The gradient of the prediction signal is assumed to be positively correlated with the magnitude of the prediction error, i.e., a larger gradient indicates a higher prediction error magnitude. In other words, it is probable that when the gradient of the prediction signal is high, the texture of the original signal is more complex and hard to predict well thus the prediction error has a rela-tively high magnitude. A similar assumption is also used in [31].

FIGURE 19 DISTRIBUTION OF (ΔX, ΔY) OF 8×8 SVT FOR BIGSHIPS SEQUENCE [P5].

44

6.1 IMPLEMENTATION OF PREDICTION SIGNAL AIDED SPATIALLY VARYING TRANSFORM

From the above motivations, we may utilize the gradient information of the predic-tion signal to select the candidate LPs. Corresponding to (5.3), a general way would be to use an estimated RD cost of the whole macroblock for each candidate LP from the pre-diction signal, which can be done by minimizing the following:

e e e eJ D Rλ= + (6.1)

where Je is the estimated RD cost of the selected configuration, De is the estimated distor-tion which can be estimated from the gradient outside the SVT block (i.e., assume that there is no or relatively negligible quantization effect etc inside the SVT block and thus no or relatively negligible distortion2 introduced therein), Re is the estimated rate which can be obtained from the gradient inside the SVT block and λe is the Lagrangian multipli-er used in this case which can be experimentally derived using the method in [32][33]. It turns out that for each candidate LP, the RD cost Je depends only on Gsvt which is the gra-dient of the prediction signal inside the SVT block [P7]. In our implementation, we use the sum of the gradient amplitude of the prediction signal inside the SVT block as an es-timated RD cost. This is found experimentally efficient in terms of tradeoff of complexity and performance.

Based on the above analysis, the proposed algorithm PSASVT is summarized below:

Step 1: Calculate the gradient map of the prediction signal of the current macroblock. In our implementation, we use Sobel operator, which is considered widely to be a good edge detector, to calculate the gradient as follows.

( , ) ( 1, 1) 2 ( , 1) ( 1, 1) ( 1, 1) 2 ( , 1) ( 1, 1)

( 1, 1) 2 ( 1, ) ( 1, 1) ( 1, 1) 2 ( 1, ) ( 1, 1)

G x y S x y S x y S x y S x y S x y S x y

S x y S x y S x y S x y S x y S x y

= − − + − + + − − − + − + − + +

+ − − + − + − + − + − − + − + + (6.2)

where G(x,y) is the gradient amplitude and S(x,y) is the prediction signal value at posi-tion (x,y).

Step 2: If the gradient map is all-zero map, then in this case we assume the prediction quality is very good and thus we only try 8 corner positions at which the prediction er-ror is assumed to be larger than other positions, i.e., the set Φc= Φ8×8’’+Φ16×4’+Φ4×16’, where Φ8×8’’={(Δx, Δy), Δx=0, Δy=0; Δx=8, Δy=0; Δx=0, Δy=8; Δx=8, Δy=8;}; Φ16×4’={Δy, Δy=0, 12}; Φ4×16’={Δx, Δx=0, 12}; then goto step 5.

2 This rough assumption is based on the following: 1. the SVT block is coded and the remaining part of the macroblock is not coded; 2. the size of the SVT block is only 1/3 of the size of the remaining non-coded part of the macroblock.

45

Step 3: Otherwise, for each candidate LP, calculate the sum of gradient (denoted by a variable SoG) of the prediction signal inside the SVT block at that position.

Step 4: Select some candidate LPs to test in RDO following these two sub-steps:

(a): Calculate a threshold SoGt as follows:

SoGt = (SoGmax+SoGmin+1)>>1; (6.3)

where SoGmax and SoGmin are the maximum and minimum value of SoG among all the candidate LPs, respectively. Generally, the calculation of the threshold SoGt can be de-pendent on the statistics of SoG of all the candidate LPs, and/or other characteristics of the current (and neighboring) macroblock. A larger threshold SoGt can reduce more candidate LPs tested in RDO, but on the other hand may degrade the coding efficiency because the selected candidate LPs may not be among the best ones in terms of rate dis-tortion tradeoff, and vice versa. Eq. (6.3) for calculating the threshold SoGt is derived ac-cording to our experience which shows a good tradeoff between performance and com-plexity.

(b): A candidate LP is selected to be tested if SoG for this candidate LP is larger than or equal to SoGt.

Besides the above method we also tried the following variations to select the candi-date LPs to test and the results are also given for comparison. The basic idea here is to utilize different statistics of SoG of all the candidate LPs to derive the threshold SoGt.

(i) We calculate the average value of SoG of all the candidate LPs as the threshold:

1/

N

t ii

SoG SoG N=

=∑ (6.4)

where N is the number of all the candidate LPs. We then test the candidate LPs whose SoG is larger than or equal to SoGt.

(ii) We use the median value of SoG of all the candidate LPs as the threshold SoGt and test the candidate LPs whose SoG is larger than or equal to SoGt.

(iii) We first sort the SoG of all the candidate LPs into non-increasing order and calculate the difference of the neighboring SoG in the ordered list. Then search the largest difference from the beginning of the ordered list which for example is in the i-th (1≤i≤N, where N is the number of all the candidate LPs) position of the list. Set the threshold SoGt to be SoG in the i-th position of the ordered list. In other words, we use the largest difference of ordered SoG as a criterion to derive the threshold SoGt.

Step 5: End.

46

6.2 EXPERIMENTAL RESULTS

We implemented the proposed idea PSASVT on Tandberg, Ericsson and Nokia test model (TENTM) [34][35] in order to evaluate its effectiveness. TENTM was a joint pro-posal to the High-Efficiency Video Coding (HEVC) standardization effort, which achieves similar visual quality measured using Mean Opinion Score (MOS) to H.264/AVC High Profile anchors, and has a better tradeoff between complexity and coding efficiency than H.264/AVC Baseline Profile [34][35]. Because of this, it is selected as the platform in our experiments. The important coding parameters are as follows:

• Quad-tree based coding structure with a support for macroblocks of size 64×64, 32×32 and 16×16 pixels;

• Frame structure is IPPP;

• QPI=11, 16, 21, 26, QPP=QPI+2;

• Low complexity variable length coding scheme with improved context adapta-tion [34][35];

• The proposed algorithm is only used for 16×16 motion compensation partition. For other motion compensation partitions, a previously developed algorithm of se-lecting available candidate LP based on motion difference [P5][P6] is used.

The anchor is the TENTM when SVT is turned off. We measure the average bitrate reduction (ΔBD-RATE) compared to the anchor according to Bjontegaard metric [24], recommended by the VCEG, which is an average difference between two RD curves. The set of sequences tested in this chapter corresponds to the Joint Collaborative Team on Video Coding (JCT-VC) test set [36] extended by 6 additional sequences. The experi-mental results are summarized in Table 5 and Table 6. The detailed results are reported in [P7].

TABLE 5 EXPERIMENTAL RESULTS OF PSASVT

Sequences SVT PSASVT_MID PSASVT_AVG PSASVT_MED PSASVT_DIFF

BD-rate

(%)

BD-rate (%)

Reduction of LPs

tested in RDO (%)

BD-rate (%)

Reduction of LPs

tested in RDO (%)

BD-rate (%)

Reduction of LPs

tested in RDO (%)

BD-rate (%)

Reduction of LPs

tested in RDO (%)

Avg 2560x1600 -2.28 -2.40 -22.50 -2.41 -19.90 -2.39 -18.67 -2.31 -25.39

Avg 1080p -2.18 -2.31 -21.93 -2.31 -19.67 -2.27 -18.64 -2.30 -24.54

Avg 720p -3.47 -3.66 -20.95 -3.61 -18.58 -3.57 -18.21 -3.66 -23.74

Avg 832x480 -2.58 -2.72 -22.09 -2.69 -19.72 -2.67 -18.65 -2.66 -24.84

47

Avg 416x240 -2.20 -2.28 -21.90 -2.27 -19.47 -2.26 -18.40 -2.26 -24.66

Avg 720p extra -2.58 -2.93 -21.20 -2.84 -19.07 -2.88 -16.13 -2.81 -24.14

Avg all -2.52 -2.70 -21.70 -2.67 -19.38 -2.66 -17.92 -2.65 -24.46

TABLE 6 PERCENTAGE OF MACROBLOCKS CODED IN SVT FOR DIFFERENT SEQUENCES

Sequences SVT PSASVT_MID PSASVT_AVG PSASVT_MED PSASVT_DIFF

All parti-tions (%)

16×16 parti-tion (%)

All parti-tions (%)


All parti-tions (%)


All parti-tions (%)


All parti-tions (%)


Avg 2560x1600 9.58 4.73 9.97 4.81 9.96 4.84 9.97 4.86 9.62 4.16

Avg 1080p 7.25 4.02 7.44 3.81 7.40 3.92 7.41 3.94 7.18 3.40

Avg 720p 3.65 2.08 3.87 2.29 3.83 2.25 3.81 2.22 3.85 2.16

Avg 832x480 11.20 6.06 11.58 6.24 11.54 6.19 11.51 6.16 11.37 5.50

Avg 416x240 12.87 6.53 13.30 6.59 13.23 6.57 13.28 6.59 13.11 5.90

Avg 720p extra 8.87 4.97 9.28 5.14 9.22 5.11 9.21 5.10 9.02 4.51

Avg all 8.99 4.83 9.33 4.93 9.28 4.91 9.28 4.90 9.11 4.35

We have tested four variations of the proposed algorithm, namely: (i) PSASVT_MID: use (6.3) to calculate the threshold SoGt; (ii) PSASVT_AVG: use (6.4) to calculate the threshold SoGt; (iii) PSASVT_MED: use the method as explained in step 3 (ii) of the algo-rithm to calculate the threshold SoGt; (iv) PSASVT_DIFF: use the method as explained in step 3 (iii) of the algorithm to calculate the threshold SoGt. From the experimental re-sults, we can see that encoding complexity reduction and slight coding efficiency im-provement are achieved for all four cases. In a general sense, PSASVT_MID performs bet-ter than PSASVT_AVG and PSASVT_MED, and can reduce on average 21.70% of the can-didate LPs tested in RDO, while also achieves on average 0.18% bitrate reduction. PSASVT_DIFF can reduce most of the candidate LPs tested in RDO (on average 24.46%) among all the cases, while achieves slightly less bitrate reduction (on average 0.13%). The complexity of PSASVT_DIFF is higher than others mainly because it needs to sort the SoG of all the candidate LPs into non-increasing order and calculate the difference of the neighboring SoG in the ordered list. On the contrary, the complexity of PSASVT_MID could be (one of) the lowest among all four cases.

For all the cases, the reduction in encoding complexity and slight coding efficiency improvement are achieved with only a little complexity increase in decoder. The decod-ing complexity increase is mainly due to the calculation of the gradient map of the mac-roblock and the sum of gradient of the prediction signal inside the SVT block for each candidate LP. This needs only to be conducted when 16×16 motion compensation parti-tion and SVT are used, which is typically only a small percent (less than 5% on average)

48

of all the macroblocks in a sequence. The detailed results of the percentages of mac-roblocks coded in SVT for different sequences in each case for all partitions and for only 16×16 partition are given in [P7].

49

CHAPTER 7 CONCLUSION

This thesis presented novel methods for improving the compression capability of video coders and reducing their implementation complexities. The goal of this thesis is to de-velop novel algorithms, which could be useful in video coding standards to improve the coding efficiency of the video codecs used in the future mobile video applications.

In the first part of the thesis we discussed efficient fixed-point approximations of the 8x8 inverse discrete cosine transform [P1][P2][P3]. Fixed-point design methodologies and several resulting implementations of the inverse discrete cosine transform are pre-sented. The overall architecture used in the design of the ISO/IEC 23002-2 IDCT algo-rithm, which can be characterized by its separable and scaled features, is also described. The drift analysis for integer IDCT is also given.

The second part of the thesis discusses pre-scaled integer transform (PIT) [P4] which reduces the implementation complexity of the conventional integer cosine trans-form (ICT) and maintains all the merits such as bit-exact implementation and good cod-ing efficiency. Design rules that lead to good PIT kernels are developed and different types of PIT and their target applications are examined. The PIT kernels used in Audio Video coding Standard (AVS), the Chinese National Coding standard, are also introduced.

In the third part of the thesis we discussed a novel algorithm called spatially varying transform (SVT) [P5][P6] which improves the coding efficiency of video coders. SVT en-ables video coders to vary the position of the transform block, unlike state-of-art video codecs where the position of the transform block is fixed. In addition to changing the po-sition of the transform block, the size of the transform can also be varied within the SVT framework, to better localize the prediction error so that the underlying correlations are better exploited. It is shown that by varying the position of the transform block and its size, characteristics of prediction error are better localized, and the coding efficiency is thus improved. We show that the proposed algorithm achieves 5.85% bitrate reduction compared to H.264/AVC on average over a wide range of test set. The gain in coding effi-ciency is achieved with a similar decoding complexity which makes the proposed algo-rithm easy to be incorporated in video codecs. However, the encoding complexity of SVT can be relatively high because of the need to perform a number of rate distortion opti-mization (RDO) steps to select the best location parameter (LP), which indicates the po-sition of the transform. A novel low complexity algorithm operating on a macroblock and a block level is also proposed to reduce the encoding complexity of SVT [P6]. Exper-imental results show that the proposed low complexity algorithm can reduce the num-ber of LPs to be tested in RDO by about 80% with only a marginal penalty in the coding efficiency.

50

An extension of SVT, called Prediction Signal Aided Spatially Varying Transform (PSASVT) [P7] is also presented in the thesis, which utilizes the gradient of prediction signal to eliminate the unlikely LPs. As the number of candidate LPs is reduced, a smaller number of LPs are searched by encoder, which reduces the encoding complexity. In ad-dition, less overhead bits are needed to code the selected LP and thus the coding effi-ciency can be improved. Experimental results show that the number of LPs to be tested in RDO can be reduced on average by up to nearly 30%. This reduction in encoding complexity is achieved with a slight increase in coding efficiency, as the number of can-didate LPs is reduced. The decoding complexity increase is only small.

At the time of writing the thesis, VCEG and MPEG, the two international standardiza-tion organizations have joined and formed the Joint Collaborative Team on Video Coding (JCT-VC). JCT-VC embarked on a new project to develop the next generation video cod-ing standard called High Efficiency Video Coding (HEVC). Many of the techniques pre-sented in the thesis can be promising and worth further study in the HEVC framework. Especially, the proposed SVT algorithm can potentially further improve the coding effi-ciency of the video codecs with acceptable implementation complexity increase. It should be noted that SVT can also be potentially used in many other situations, for in-stance, in inter-layer prediction in scalable video coding (SVC) and inter-view prediction in multiview video coding (MVC). In addition to traditional 2-D video coding, 3D video coding can also be a new area for studying the proposed algorithm.

51

REFERENCES

[1] ISO/IEC JTC1/SC29/WG11, Vision, Applications and Requirements for High-Performance Video Coding, MPEG Document N11096, January 2010.

[2] ISO/IEC 14496–10:2003, “Coding of Audiovisual Objects—Part 10: Advanced Vid-eo Coding,” 2003, also ITU-T Recommendation H.264 “Advanced video coding for generic audiovisual services.”

[3] I. Richardson, H.264 and MPEG4 Video Compression, Chichester, England: John Wiley & Sons, 2003.

[4] N. Cho and S. Lee, “Fast Algorithm and Implementation of 2-D Discrete Cosine Transform,” IEEE Transactions on Circuits and Systems, vol. 38, no 3, pp. 297-305, Mar. 1991.

[5] A. Hung and T.-Y. Meng, “A Comparison of Fast DCT Algorithms,” Multimedia Sys-tems, No. 5 Vol. 2, Dec 1994.

[6] L. Yu, F. Yi, J. Dong, and C. Zhang, “Overview of AVS-Video: Tools, performance and complexity,” in Proc. VCIP, Beijing, China, Jul. 2005, pp. 679–690.

[7] J. Ostermann, J. Bormans, P. List, D. Marpe, M. Narroschke, F. Pereira, T. Stockham-mer, T. Wedi, “Video Coding with H.264 / AVC: Tools, Performance, and Complexi-ty,” IEEE Circuits and Systems Magazine, vol. 4, no. 1, pp. 7-28, April, 2004.

[8] A. Robert, I. Amonou, B. Pesquet-Popescu, “Improving Intra Mode Coding in H.264/AVC through Block Oriented Transforms,”, in IEEE 8th Workshop on Multi-media Signal Processing, Oct. 2006, pp. 382-386.

[9] B. Zeng and J. Fu, “Directional discrete cosine transforms – A new framework for image coding,” IEEE Transactions on Circuits, Systems and Video Technology, vol. 18, no. 3, pp. 305-313, March. 2008.

[10] ITU-T Recommendation H.264 “Advanced video coding for generic audiovisual services,” Mar. 2005.

52

[11] A. Nosratinia, “Denoising JPEG images by re-application of JPEG,” in Proceedings of the 1998 IEEE Workshop on Multimedia Signal Processing (MMSP), Dec. 1998, pp. 611-615.

[12] R. Samadani, A. Sundararajan and A. Said, “Deringing and deblocking DCT com-pression artifacts with efficient shifted transforms,” in Proceedings of the 2004 IEEE International Conference on Image Processing (ICIP), Oct. 2004, pp. 1799-1802.

[13] J. Katto, J. Suzuki, S. Itagaki, S. Sakaida and K. Iguchi, ”Denoising intra-coded mov-ing pictures using motion estimation and pixel shift,” in Proceedings of the 2008 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Mar. 2008, pp. 1393-1396.

[14] O. G. Guleryuz, “Weighted averaging for denoising with overcomplete dictionaries,” IEEE Transactions on Image Processing, vol. 16, no. 12, pp. 3020-3034, Dec. 2007.

[15] S. Naito and A. Koike, “Efficient coding scheme for super high definition video based on extending H.264 high profile,” SPIE Visual Communications and Image Processing, Vol. 6077, pp. 607727-1-607727-8, Jan. 2006.

[16] M. Wien, “Variable block-size transforms for H.264/AVC,” IEEE Transactions on Circuits, Systems and Video Technology, vol. 13, no. 7, pp. 604-613, Jul. 2003.

[17] H. S. Malvar, A. Hallapuro, M. Karczewicz, and L. Kerofsky, “Low-complexity trans-form and quantization in H.264/AVC,” IEEE Transactions on Circuits, Systems and Video Technology, vol. 13, no. 7, pp. 598-603, Jul. 2003.

[18] S. Ma and C.-C. Kuo, “High-definition video coding with super-macroblocks,” SPIE Visual Communications and Image Processing, Vol. 6508, pp. 650816-1-650816-12, Jan. 2007.

[19] T. Wiegand, H. Schwarz, A. Joch, F. Kossentini and G. J. Sullivan, “Rate-constrained coder control and comparison of video coding standards,” IEEE Transactions on Circuits, Systems and Video Technology, vol. 13, no. 7, pp. 688-703, Jul. 2003.

[20] P. List, A. Joch, J. Lainema, G. Bjontegaard and M. Karczewicz, “Adaptive Deblocking filter,” IEEE Transactions on Circuits, Systems and Video Technology, vol. 13, no. 7, pp. 614-619, Jul. 2003.

53

[21] K. Vermeirsch, J. De Cock, S. Notebaert, P. Lambert, and R. Van de Walle, “Evalua-tion of transform performance when using shape-adaptive partitioning in video coding,” Picture Coding Symposium (PCS), May 2009.

[22] KTA reference model 1.8 http://iphome.hhi.de/suehring/tml/download/KTA/jm11.0kta1.8.zip [online]

[23] T.K. Tan, G. Sullivan and T. Wedi, “Recommended simulation common conditions for coding efficiency experiments,” VCEG Doc. VCEG-AE10, Jan. 2007.

[24] G. Bjontegaard, “Calculation of average PSNR differences between RD-curves,” VCEG Doc. VCEG-M33, Mar. 2001.

[25] S. Pateux and J. Jung, “An excel add-in for computing Bjontegaard metric and its evolution,” VCEG Doc. VCEG-AE07, Jan. 2007.

[26] F. Kamisli and J. S. Lim, “Transforms for the motion compensation residual,” in Pro-ceedings of the 2009 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Apr. 2009, pp. 789-792.

[27] J. Xu, B. Zeng and F. Wu, “An Overview of Directional Transforms in Image Coding,” in IEEE International Symposium on Circuits and Systems (ISCAS), pp 3036-3039, 2010.

[28] H. Xu, J. Xu and F. Wu, “Lifting-based directional DCT-like transform for image cod-ing,” IEEE Transactions on Circuits, Systems and Video Technology, pp. 1325-1335, Oct. 2007.

[29] Y. Ye and M. Karczewicz, “Improved H.264 intra coding based on bi-directional in-tra prediction, directional transform, and adaptive coefficient scanning,” in Pro-ceedings IEEE Int. Conf. Image Processing, San Diego, USA, pp. 2116-2119, 12-15 Oct. 2008.

[30] M. T. Orchard and G. J. Sullivan, “Overlapped block motion compensation: An esti-mation-theoretic approach,” IEEE Trans. Image Processing, vol. 3, pp. 693-699, Sept. 1994.

[31] H. Schiller, “Prediction signal controlled scans for improved motion compensated video coding,” Electronics Letters, Vol. 29, Issue. 5, pp. 433-435, 1993.

http://iphome.hhi.de/suehring/tml/download/KTA/jm11.0kta1.8.zip

54

[32] G. J. Sullivan and T. Wiegand, “Rate-distortion optimization for video compression,” IEEE Signal Processing Mag., vol. 15, pp. 74–90, Nov. 1998.

[33] T. Wiegand and B. Girod, “Lagrangian multiplier selection in hybrid video coder control,” in Proc. ICIP, Thessaloniki, Greece, Oct. 2001.

[34] K. Ugur, K.R. Andersson, A. Fuldseth, et al, “High Performance, Low Complexity Video Coding and the Emerging HEVC Coding Standard,” IEEE Transactions on Cir-cuits and Systems for Video Technology, Vol. 20, Issue. 12, pp. 1688-1697, Dec. 2010.

[35] K. Ugur, K.R. Andersson, A. Fuldseth, “Video coding technology proposal by Tandberg, Nokia, and Ericsson,” JCTVC-A119, Dresden, Germany, 15-23 April, 2010.

[36] ISO/IEC JTC1/SC29/WG11, “Joint Call for Proposals on Video Compression Tech-nology,” MPEG Doc. N11113, Jan. 2010.

[37] Y. Liu, Z. G. Li, and Y. C. Soh, “Region-of-Interest Based Re-source Allocation for Conversational Video Communication of H.264/AVC,” IEEE Transactions on Cir-cuits, Systems and Video Technology, vol. 18, Issue. 1, pp. 134-139, Jan. 2008.

[38] W. Zheng, Y. Shishikui, M. Naemura, Y. Kanatsugu and S. Itoh, “Analysis of space-dependent characteristics of motion-compensated frame differences based on a statistical motion distribution model,” IEEE Trans. Image Processing, vol. 11, pp. 377-386, April. 2002.

[39] W. K. Cham, “Development of integer cosine transforms by the principle of dyadic symmetry,” Proc. IEE, I, vol. 136, Issue. 4, pp. 276-282, Aug, 1989.

[40] T. Wiegand, G. J. Sullivan, G. Bjontegaard, and A. Luthra, “Overview of the H.264/AVC video coding standard,” IEEE Trans. Circuits Syst. Video Technol, vol. 13, pp. 560-576, July 2003.

[41] G. J. Sullivan, P. N. Topiwala, and A. Luthra, “The H.264/AVC advanced video coding standard: overview and introduction to the fidelity range extensions,” in Proc. SPIE Int. Soc. Opt. Eng. 5558, Aug, 2004.

[42] J. Liang, T. D. Tran, “Fast multiplierless approximations of the DCT with the lifting scheme,” IEEE Trans Signal Processing, vol. 49, pp. 3032-3044, Dec 2001.

55

[43] M. Wien, S. Sun, “ICT comparison for adaptive block transform,” Doc. VCEG-L12, Eibsee, Germany, Jan, 2001.

[44] J.L. Mitchell, W.B. Pennebaker, D. LeGall, and C. Fogg, MPEG Video Compression Standard, Chapman & Hall: New York © 1997.

[45] W.B. Pennebaker and J.L. Mitchell, JPEG Still Image Data Compression Standard, Van Nostrand Reinhold: New York © 1993.

[46] ANSI/IEEE 1180-1990, Standard Specifications for the Implementations of 8x8 In-verse Discrete Cosine Transform, December 1990 (withdrawn by ANSI 9 Septem-ber, 2001; withdrawn by IEEE 7 February, 2003).

[47] G.J. Sullivan, “Standardization of IDCT approximation behavior for video compres-sion: the history and the new MPEG-C parts 1 and 2 standards,” in SPIE Applica-tions of Digital Image Processing XXX, Vol. 6696, August 28-31, 2007.

[48] ISO/IEC 23002-1: Information technology – MPEG video technologies – Part 1: Ac-curacy requirements for implementation of integer-output 8x8 inverse discrete co-sine transform, December 2006.

[49] M.A. Isnardi, “Description of Sample Bitstream for Testing IDCT Linearity,” Moving Picture Experts Group (MPEG) document M13375, Montreux, Switzerland, April 2006.

[50] C. Zhang and L. Yu, “On IDCT Linearity Test,” Moving Picture Experts Group (MPEG) document M13528, Klagenfurt, Austria, July 2006.

[51] ISO/IEC JTC1/SC29/WG11 (MPEG), ISO/IEC 23002-1/FPDAM1: Information tech-nology – MPEG video technologies – Part 1: Accuracy requirements for implemen-tation of integer-output 8x8 inverse discrete cosine transform. Amendment 1: Software for Integer IDCT Accuracy Testing, Moving Picture Experts Group (MPEG) output document N8981, San Jose, CA, April 2007.

[52] Y. Arai, T. Agui and M. Nakajima, “A Fast DCT-SQ Scheme for Images,” Transactions of the IEICE E71(11): 1095, November 1988. (in Japanese)

56

[53] W. Chen, C.H. Smith and S.C. Fralick, “A Fast Computational Algorithm for the Dis-crete Cosine Transform,” IEEE Trans. Comm., vol. com-25, No. 9, pp.1 1004-1009, September 1977.

[54] B.G. Lee, “A new algorithm to compute the discrete cosine transform,” IEEE Trans. Acoust., Speech, Signal Processing, vol. ASSP-32, pp.1243-1245, May 1984.

[55] M. Vetterli and A. Ligtenberg, “A Discrete Fourier-Cosine Transform Chip,” IEEE Journal on Selected Areas in Communications, Vol. 4, No. 1, pp. 49-61, January 1986.

[56] C. Loeffler, A. Ligtenberg, and G.S. Moschytz, “Practical fast 1-D DCT algorithms with 11 multiplications,” in Proc. IEEE Int. Conf. Acoust., Speech, and Sig. Proc. (ICASSP’89), vol. 2, pp. 988-991, February 1989.

Publications

Publication 1

Yuriy A. Reznik, Arianne T. Hinds, Cixun Zhang, Lu Yu, Zhibo Ni,

Efficient Fixed-Point Approximations of 8x8 Inverse Discrete Cosine Transform

Proceedings SPIE Applications of Digital Image Processing XXX, Vol. 6696, pp. 669617-1-17, San Diego, 28

Aug. 2007.

Efficient Fixed-Point Approximations of the 8x8 InverseDiscrete Cosine Transform

Yuriy A. Reznik∗, Arianne T. Hinds†, Cixun Zhang‡, Lu Yu‡, and Zhibo Ni‡

∗QUALCOMM Incorporated, 5775 Morehouse Dr., San Diego, CA 92121; USA†IBM, Incorporated, 6300 Diagonal Hwy, Boulder, CO 80301; USA

‡Zhejiang University, Hangzhou, Zhejiang, 310027; China

ABSTRACT

This paper describes fixed-point design methodologies and several resulting implementations of the InverseDiscrete Cosine Transform (IDCT) contributed by the authors to MPEG’s work on defining the new 8x8 fixedpoint IDCT standard – ISO/IEC 23002-2. The algorithm currently specified in the Final Committee Draft (FCD)of this standard is also described herein.

Keywords: DCT, IDCT, factorizations, diophantine approximations, multiplierless algorithms, MPEG, JPEG,ITU-T, H.261, H.263, MPEG-1, MPEG-2, MPEG-4

1. INTRODUCTION

The Discrete Cosine Transform (DCT)1 is a fundamental operation used by the vast majority of today’s image andvideo compression standards, such as JPEG, MPEG-1, MPEG-2, MPEG-4 (P.2), H.261, H.263, and others 2−.8

Encoders in these standards apply such transforms to each 8x8 block of pixels to produce DCT coefficients,which are then subject to quantization and encoding. The Inverse Discrete Cosine Transform (IDCT) is used inboth the encoder and decoder to convert DCT coefficients back to the spatial domain.

At the time when the first image and video compression standards were defined, the implementation of DCTand IDCT algorithms was considered a major technical challenge, and therefore, instead of defining a specificalgorithm for computing it, ITU-T H.261, JPEG, and MPEG standards have included precision specifications thatmust be met by IDCT implementations conforming to these standards.9 This decision has allowed manufacturersto use the best optimized designs for their respective platforms. However, the drawback of this approach is theimpossibility to guarantee exact decoding of MPEG-encoded videos on across different decoder implementations.

In early 2005, prompted by the expiration and withdrawal of a related IDCT precision specification (IEEEStandard 1180-19909), MPEG has decided to improve its handling of this matter, and produce

• a new precision specification ISO/IEC 23002-1, replacing the IEEE 1180 standard, and harmonizing thetreatment of precision requirements across different MPEG standards,10 and

• a new voluntary standard, ISO/IEC 23002-2, providing specific (deterministically defined) examples offixed-point 8x8 IDCT and DCT implementations.

The Call for Proposals11 for the ISO/IEC 23002-2 standard was issued at the August 2005 meeting inPoznan, Poland, and during several subsequent meetings MPEG has received a number of contributions withspecific algorithm proposals, as well as general optimization techniques, analyses of different factorizations, driftproblem, implementation studies, and other informative documents. Many of the successive submissions havebenefited from earlier ideas contributed to MPEG by other proponents, converging on key design aspects.10 The

Further author information: (please send correspondence to Yuriy A. Reznik)Yuriy A. Reznik: [email protected], phone: +1 (858) 658-1866, Arianne T. Hinds: [email protected], Cixun Zhang: [email protected], Lu Yu: [email protected], Zhibo Ni: [email protected] work of Lu Yu, Cixun Zhang, and Zhibo Ni was supported by NSFC under contract No. 60333020 and Project ofScience and Technology Plan of Zhejiang Province under contract No. 2005C1101-03.

Invited Paper

Applications of Digital Image Processing XXX, edited by Andrew G. TescherProc. of SPIE Vol. 6696, 669617, (2007) · 0277-786X/07/$18 · doi: 10.1117/12.740228

Proc. of SPIE Vol. 6696 669617-1

Committee Draft (CD) of this standard, containing a single algorithm was issued at the October 2006 meetingin Hangzhou, China. The Final Committee Draft (FCD) of this standard was reached at the April 2007 meetingin San Jose, CA,13 and the Final Draft International Standard (FDIS) is now expected to be issued in Octoberof 2007.

This paper describes fixed-point design methodologies and several resulting IDCT implementations con-tributed by the authors to this MPEG project. The algorithm currently specified in the Final CommitteeDraft (FCD) of this standard is also described.

Our paper is organized as follows. In Section 2, we provide background information, including definitionsof the DCT and IDCT, examples of their factorizations, and review of some basic techniques used for theirfixed-point implementations. In Section 2, we also explain several ideas that we have proposed for improvingperformance of fixed-point designs: introduction of floating factors between sub-transforms, the use of fastalgorithms for computation of products by groups of factors, and techniques for minimizing rounding errors inalgorithms using right shift operations. In Section 3, we show how these techniques were applied to design ourproposed IDCT approximations. Finally, in Section 4, we provide a detailed description of the algorithm in theFCD of the ISO/IEC 23002-2 standard. Appendices A and B contain supplemental information and proofs ofour claims.

2. BACKGROUND INFORMATION & MAIN IDEAS USED IN THIS WORK

2.1 Definitions

The order-8, one-dimensional (1D) type II1 Discrete Cosine Transform (DCT), and its corresponding InverseDCT (IDCT) are defined as follows:

Fu =cu

2

7∑

x=0

fx cos(2x + 1)uπ

16, u = 0, . . . , 7 , (1)

fx =7∑

u=0

cu

2Fu cos

(2x + 1)uπ

16, x = 0, . . . , 7 , (2)

where cu = 1/√

2, when u = 0, and cu = 1 otherwise.

The definitions for the two-dimensional (2D) versions of these transforms are:

Fvu =cucv

4

7∑

x=0

7∑

y=0

fyx cos(2x + 1)uπ

16cos

(2y + 1)vπ

16, v, u = 0, . . . , 7 , (3)

fyx =7∑

u=0

7∑

v=0

cucv

4Fvu cos

(2x + 1)uπ

16cos

(2y + 1)vπ

16, y, x = 0, . . . , 7 , (4)

where fy x (y, x = 0...7) denote input spatial domain values (for image and video coding – values of pixels orprediction residuals), and Fv u (u, v = 0...7) denote transform domain values, or transform coefficients. Whenused in image or video coding applications, the pixel values are normally assumed to be in the range of [−256, 255],while the transform coefficients are in the range of [−2048, 2047].

Mathematically, these are linear, orthogonal, and separable transforms. That is, a 2D transform can bedecomposed into a cascade of 1D transforms applied successively to all rows and then to all columns in thematrix. This property of separability is often exploited by implementors to mitigate the complexity of an entire2D transform into a much simpler set of 1D operations.


2.2 Precision Requirements for IDCT Implementations in MPEG and ITU-T Standards

As described previously, specifications of MPEG, JPEG, and several ITU-T video coding standards, do not requireIDCTs to be implemented exactly as specified in (4). Rather, they require practical IDCT implementations toproduce integer output values fyx that fall within certain specified tolerances from outputs of an ideal IDCTrounded to the nearest integer:

fyx = �fyx + 1/2� . (5)

The precise specification of these error tolerances, and how they are to be measured for a given IDCT im-plementation under test, is defined by the MPEG IDCT precision standard: ISO/IEC 23002-1.14 This standardincludes several tests using a pseudo-random input generator originating from the former IEEE Standard 1180-1990,9 as well as additional tests required by MPEG standards.

A summary of the error metrics defined by the IEEE 1180 | ISO/IEC 23002-1 specification is provided inTable 1. Here, the variable i = 1, . . . , Q indicates the index of a pseudo random input 8x8 matrix used in a test,and Q = 10000 (or in some tests, Q = 100000) denotes the total number of sample matrices. The tolerance foreach metric is provided in the last column of this table.

Table 1. The IEEE 1180 | ISO/IEC 23002−1 pseudo-random IDCT test metrics and conditions

Error metric Description Test condition

p = maxy,x,i |f iyx − f i

yx| Peak pixel error p � 1

dyx = 1Q

∑i f i

yx − f iyx Pixel mean error maxy,x |dyx| � 0.015

m = 164

∑y,x dy x Overall mean error |m| � 0.0015

eyx = 1Q

∑i(f

iyx − f i

yx)2 Mean square error maxy,x eyx � 0.06

n = 164

∑y,x ey x Overall mean square error n � 0.02

It should be noted that IEEE 1180 | ISO/IEC 23002 error metrics satisfy the following chain of inequalities

max eyx � max |dyx| � |m| ,max eyx � n (6)

which implies that max eyx or peak mean square error (pmse) metric is the strongest one in this set.

Among the additional (informative) tests provided in the ISO/IEC 23002-1/FPDAM1 specification,15 theso-called “linearity test”∗ is of notable interest. This test requires the reconstructed pixel values produced by anIDCT implementation under test fyx to be symmetric with respect to the sign reversal of its input coefficients:

fyx (−Fvu) = −fyx (Fvu) , Fvu =[

z, v = t , u = s0, otherwise ,

z = 1, 3, . . . , 527 ,v, u, t, s = 0, . . . , 7 .

(7)

This test was motivated by the observation that in the decoding of static regions within consecutive videoframes, the decoder will reconstruct small (zero-mean, symmetrically distributed) differences, that will normallynegate each other over time (across the sequence of frames). If the IDCT implementation does not satisfythis property (7), then the mismatch between IDCT outputs may instead accumulate, eventually producing aremarkable visible degradation in the quality of the reconstructed video.16, 17

2.3 DCT/IDCT Factorizations

Much of the original research for designing fast implementations of DCT transforms was focused on finding DCTfactorizations, resulting in the minimum number of multiplications by irrational factors. Many factorizationshave been derived by utilizing other known fast algorithms, such as the classic Cooley-Tukey FFT algorithm,or by applying systematic approaches, such as a decimation in time, or a decimation in frequency.1 The formal

∗Considering general definition of linearity f(αx+βy) = αf(x)+βf(y), for some operator f(.), this test only considersa case when α = 0, and β = −1. Therefore, this is rather a “sign-symmetry test”.


setting of this problem and an upper bound for the multiplicative complexity of transforms of orders 2n can befound in E. Feig and S. Winograd.20

In one special case of the order-8 two-dimensional DCT/IDCT, the least complex direct 2D factorizationis described by E. Feig and S. Winograd.21 Their implementation requires 96 multiplication and 454 additionoperations for the computation of the complete set of 2D outputs. The same paper further describes an efficientscaled 8x8 DCT implementation, that requires only 54 multiplication, 462 addition, and 6 shift operations.21

The latter of these transforms is refered to as a scaled transform because all of its outputs must be scaled(i.e. multiplied by fixed, possibly irrational, constants) so that each output will equate to the relative outputof a nonscaled DCT. In some applications, such as JPEG, and several video coding algorithms, this process ofscaling can be implemented jointly with the process of quantization (by factoring together the scale constantswith the corresponding quantization values), thereby resulting in significant computational savings.

Some of the most efficient and well-known 1D DCT factorizations include the scaled factorization of Y. Arai,T. Agui and M. Nakajima (AAN),25 (only 5 multiplication and 29 addition operations), and the non-scaledfactorizations of W. Chen, C.H. Smith and S.C. Fralick,23 B.G. Lee,24 M. Vetterli and A. Ligtenberg (VL),26

and C. Loeffler, A. Ligtenberg and G. Moschytz (LLM).27 The VL and LLM algorithms are the least complexamong known non-scaled designs, and require only 11 multiplication and 26 addition operations.

We note that the suitability of each of these factorizations to the design of fixed point IDCT algorithms hasbeen extensively analyzed in the course of work for the ISO/IEC 23002-2 standard.28–31

2.4 Fixed-Point Approximations

As described previously, implementations of the DCT/IDCT require multiplication operations with irrationalconstants (i.e. the cosines). Clever factorizations can only reduce the number of such “essential” multiplications,but not eliminate them altogether. Hence, in the design of implementations of the DCT/IDCT, one is usuallytasked with finding ways of approximately computing products of these irrational factors by using fixed-pointarithmetic.

One of the most common and practical techniques for converting floating-point to fixed-point values is basedon the approximations of irrational factors αi by dyadic fractions:

αi ≈ ai/2k , (8)

where both ai and k are integers. In this way, multiplication of x by factor αi permits the implementation of avery simple approximation in integer arithmetic as follows:

xαi ≈ (x ∗ ai) >> k ; (9)

where >> denotes the bit-wise right shift operation.

In some transform designs, right shift operations in approximations (9) can be delayed to later stages of theimplementation, or done at the very end of the transform, but the more complex operations, such as multiplica-tions for each non-trivial constant αi still need to be performed in the algorithm.

The key variable that affects the precision and complexity of these dyadic rational approximations (8) is thenumber of precision bits k. In software designs, this parameter is often constrained by the width of registers(e.g. 16 or 32) and the consequence of not satisfying such a design constraint can easily result in the doublingof execution time for the transform. In hardware designs, the parameter k affects the number of gates needed toimplement adders and multipliers. Hence, one of the basic goals in fixed point designs is to minimize the totalnumber of bits k, while maintaining sufficient accuracy of approximations.

2.5 Improving Precision of Dyadic Rational Approximations

Without placing any specific constraints on values for αi, and assuming that for any given k, the correspondingvalues of nominators ai are chosen such that:

∣∣αi − ai/2k∣∣ = 2−k

∣∣2kαi − ai

∣∣ = 2−k minz∈Z

∣∣2kαi − z∣∣ ,


we can conclude that the absolute error of approximations in (8) should be inversely proportional to 2k:∣∣αi − ai/2k

∣∣ � 2−k−1 .

That is, each extra bit of precision (i.e. incrementing k), should reduce the error by half.

Nevertheless, it turns out that this rate can be significantly improved if the values α1, . . . , αn that we aretrying to approximate can be simultaneously scaled by some additional parameter ξ.

We claim the following (the proof for which is provided in Appendix A):

Lemma 2.1. Let α1, . . . , αn be a set of n irrational numbers (n � 2). Then, there exist infinitely many n + 2-tuples a1, . . . , an, k, ξ, with a1, . . . , an ∈ Z, k ∈ N, and ξ ∈ Q, such that

max{∣∣ξ α1 − a1/2k

∣∣ , . . . ,∣∣ξαn − an/2k

∣∣} <n

n + 1ξ−1/n 2−k(1+1/n) . (10)

In other words, if the algorithm can be altered such that all of its irrational factors α1, . . . , αn can be pre-scaled by some parameter ξ, then we might be able to find approximations whose absolute error decreases asfast as 2−k(1+1/n). For example, when n=2, this means 50% higher effectiveness in the usage of bits. For largesets of factors α1, . . . , αn, however, this gain will be smaller.

These observations suggest that we can significantly improve the precision of a fixed-point IDCT design bysplitting it into a set of smaller blocks (or sub-transforms) with alterable common factors, and then adjust thesefactors such that they yield high-accuracy solutions predicted by Lemma 2.1.

2.6 Reducing Complexity of Multiplications

The dyadic approximations shown in (8, 9) already reduce the problem of computing products by irrational con-stants to multiplications by integers. However, integer multiplications can still be computationally “expensive”to use on many existing platforms, and in such cases it becomes desirable to find ways to compute these productswithout using general purpose multipliers.

To illustrate this idea, consider a multiplication by an irrational factor 1/√

2, using its 5-bit dyadic approxi-mation: 23/32. By looking at the binary bit pattern of 23 = 10111 and substituting each “1” with an additionoperation, we can compute a product by 23 as follows:

x ∗ 23 = (x << 4) + (x << 2) + (x << 1) + x .

This approximation requires 3 addition and 3 shift operations. By further noting that the last 3 digits form aseries of “1”s, we can instead use:

x ∗ 23 = (x << 4) + (x << 3) − x . (11)

which reduces the complexity to just 2 shift and 2 addition operations.

In engineering literature, the sequences of operations “+” associated with isolated digits “1”, or “+” and“-” associated with beginnings and ends of runs “1 . . . 1” are commonly referred to as a “Canonical SignedDigit” (CSD) decomposition.34 This is a well known and frequently used tool in the design of multiplierlesscircuits.38 However, CSD decompositions do not always produce results with the lowest numbers of operations.For example, considering an 8-bit approximation of the same factor 1/

√2 ≈ 181/256 = 10110101, we find that

its CSD decomposition:

x ∗ 181 = (x << 7) + (x << 5) + (x << 4) + (x << 2) + x ,

needs 4 addition and 4 shift operations. But, by rearranging the computations and reusing intermediate results,a more efficient algorithm can be constructed:

x2 = x + (x << 2); // 101x3 = x2 + (x << 4); // 10100x4 = x3 + (x2 << 5); // 10110101 = x ∗ 181


x*23/32(x>>1) + (x>>2) - (x>>5)x - (x>>2) - (x>>5)x + ((-x) >>6) - ((x + (x>>4)) >> 2)

–6

–4

–2

0

2

4

6

8

–10 –8 –6 –4 –2 2 4 6 8 10x

Figure 1. Errors in multiplierless implementations of a product by 23/32.

This approximation requires only 3 addition and 3 shift operations.

An even more dramatic reduction in complexity can be achieved by performing joint factorization of simul-taneous products by multiple integer constants. For example, consider the task of computing products by twoconstants: 99 = 1100011 and 239 = 11101111. The use of CSD decompositions

x ∗ 99 = (x << 6) + (x << 5) + (x << 1) + x;x ∗ 239 = (x << 8) − (x << 4) − x;

results in a total complexity of 5 addition and 5 shift operations. At the same time, by using a joint factorizationof these two products, the same task can be simplified by the following implementation:

x2 = x + (x << 5); // 100001x3 = x2 << 2; // 10000100x4 = x3 − x2; // 1100011 = x ∗ 99x5 = x3 + x4 + (x << 3); // 11101111 = x ∗ 239

which needs only 4 addition and 3 shift operations.

In the context of the IDCT design, such algorithms can be used for simultaneous computation of productsby pairs of factors in transform butterflies. Moreover, since in each butterfly there are typically two variablesthat need to be multiplied by the same factors, such computations can easily be done in parallel.

In passing, we should note that finding optimal (i.e. with fewest numbers of additions and/or shifts) al-gorithms for computing multiplications by integer constants has been an area of active and fruitful researchduring the last few decades33−40 . It has been established that this problem is NP-complete,36 and numerousfast heuristic algorithms have been proposed for solving it approximately.34, 37, 39, 40

2.7 Minimizing Errors in Multiplierless Algorithms using Right ShiftsAnother family of techniques for computation of products by dyadic fractions (8) can be derived by allowing theuse of right shifts as elementary operations.

For example, considering a factor 1/√

2 ≈ 23/32 = 0.10111, and using right shift and addition operationsaccording to its CSD decomposition, we obtain†:

x ∗ 23/32 ∼ (x >> 1) + (x >> 2) − (x >> 5). (12)

†Hereafter, by writing a(x) ∼ b(x) for some functions a(.) and b(.), we imply that there exists a constant δ � 0, suchthat for all x: |a(x) − b(x)| � δ.


or (by further noting that 1/2 + 1/4 = 1 − 1/4):

x ∗ 23/32 ∼ x − (x >> 2) − (x >> 5). (13)

Yet another (although, somewhat less obvious) way of computing product by the same factor is:

x ∗ 23/32 ∼ x − ((x + (x >> 4)) >> 2) + ((−x) >> 6) . (14)

We present plots of values produced by these algorithms in Figure 1. It can be noted that they all computevalues that approximate products by fraction 23/32, however, the errors in each of these approximations aredifferent. For example, the algorithm (13) produces all positive errors, with a maximum magnitude of 55/32.The algorithm (12) has more balanced errors, with the magnitude of oscillations within ±65/64. Finally, thealgorithm (14) produces perfectly sign-symmetric errors with oscillations in ±7/8.

The sign-symmetry property of an algorithm Aai,b(x) � xai/2b means that for any (x ∈ Z):

Aai,b(−x) = −Aai,b(x) , (15)

and it also implies that for any NN∑

x=−N

[Aai,b(x) − x

ai

b

]= 0 , (16)

that is, a zero-mean error on any symmetric interval.

This property is very important in the design of signal processing algorithms, as it minimizes the probabilitythat rounding errors introduced by fixed-point approximations will accumulate. Below we will establish theexistence of right-shift-based sign-symmetric algorithms for computing products by dyadic fractions and provideupper bounds for their complexity.

Given a set of dyadic fractions a1/2b, . . . , am/2b, we define an algorithm

Aai,...,am,b(x) �(xa1/2b, . . . , xam/2b

)(17)

as the following sequence of steps:x1, x2, . . . , xt , (18)

where x1 := x, and where subsequent values xk (k = 2, . . . , t) are produced by using one of the followingelementary operations:

xk :=

⎡

⎢⎢⎣

xi >> sk; 1 � i < k, sk � 1; or−xi; 1 � i < k; orxi + xj ; 1 � i, j < k; orxi − xj ; 1 � i, j < k, i �= j.

(19)

The algorithm terminates when there exists indices j1, . . . , jm � t, such that:

xj1 ∼ x ∗ a1/2b, . . . , xjm ∼ x ∗ am/2b . (20)

We state the following (the proofs for which are provided in Appendix B):

Theorem 2.2. For any m, b ∈ N and ai, . . . , am ∈ Z, there exist algorithms Aai,...,am,b (17-20), which aresign-symmetric. That is, for any x ∈ Z: Aai,...,am,b(−x) = −Aai,...,am,b(x).

Theorem 2.3. The lowest possible number of shifts in algorithms (17-20) satisfying the sign-symmetry propertyis at most twice the lowest possible number of shifts in algorithms without this property.

Theorem 2.4. The lowest possible total number of instructions in algorithms (17-20) satisfying the sign-symmetry property is at most four times the lowest possible total number of instructions in algorithms withoutthis property.


Pre-scaling of transformcoefficients:

F’vu = (Fvu* Svu) >> (S-P);

12 12+P Add DC bias:F’00 += 1<<(P+2);

12+P 1D rowtransforms

14+P 1D columntransforms

Right shiftsfyx >>= P+3;

1114+P

Figure 2. Fixed-point 8x8 IDCT architecture.

2 sin 3 /8cos 3 /16sin 3 /16sin /16

cos /16

0x

1x

2x

3x

4x

5x

6x

7x

0X

4X

2X

6X

7X

3X

5X

1X2 cos 3 /82

0x

1x

2x

3x

4x

5x

6x

7x

0X

4X

12X

16X

17X

3X

5X

11X

Figure 3. Loeffler-Ligtenberg-Moschytz IDCT factorization (left), and its generalized scaled form (right). Parameters (ξ, ζ)represent additional “floating factors” that we use for finding efficient fixed-point approximations.

We should point out that these are very simple and rather coarse complexity bounds, and that in many cases,the complexity overhead for achieving sign-symmetry is not that high. Moreover, when complexity considerationsare paramount, one can pick algorithms that are sign-symmetric for most, but not all values of x in the expectedrange of this variable. In many cases, such “almost symmetric” algorithms can also be the least complex for agiven set of factors.

In the design of our IDCTs we have used an exhaustive enumeration process for searching for the bestalgorithms (17-20) with symmetric (or at least well-balanced) rounding errors. As additional criteria for selectionof such algorithms, we have used estimates of mean, variance, and magnitude (maximum values) of errors thatthey produce. In assessing their complexity, we have counted the numbers of operations, as well as the longestexecution path, and maximum number of intermediate registers needed for computations.

3. DESIGN OF FIXED-POINT APPROXIMATIONS OF THE 8X8 IDCT

The overall architecture used in the design of the proposed fixed-point IDCT algorithms is shown in Figure 2,which can be characterized by its separable and scaled features. The scaling stage is performed with a single8x8 matrix that is precomputed by factoring the 1D scale factors for the row transform with the 1D scale factorsfor the column transform. The scaling stage is also used to pre-allocate P bits of precision to each of the inputDCT coefficients thereby providing a fixed-point ”mantissa” for use throughout the rest of the transform. Otherkey features of this architecture include simplicity, compactness / cache-efficiency, and flexibility of its interface,by allowing the potential for merging of scaling and quantization logic in video and image codec implementations.

As the underlying basis for scaled 1D transform design, we use a variant of the well-known factorization ofC. Loeffler, A. Ligtenberg, and G.S. Moschytz27 with 3 planar rotations and 2 independent factors γ =

√2 (see

Figure 3). This choice has been made empirically based on an extensive analysis of fixed-point designs derivedfrom other known algorithms, including variants of AAN,25 VL,26 and LLM27 factorizations.31

In order to allow efficient rational approximations of constants α, β, δ, ε, η, and θ within the LLM factoriza-tion, we introduce two floating factors ξ and ζ, and apply them to two sub-groups of these constants as follows(see also Figure 3, right flowgraph):

ξ : α′ = ξα, β′ = ξβ;ζ : δ′ = ζδ, ε′ = ζε, η′ = ζη, θ′ = ζθ; (21)


We invert these multiplications by ξ and ζ in the scaling stage by multiplying each input DCT coefficient withthe respective reciprocal of ξ and ζ. That is, we pre-compute a vector of scale factors for use in the scaling stageprior to the first in the cascade of 1D transforms.

σ = (1, 1/ζ, 1/ξ, γ/ζ, 1, γ/ζ, 1/ξ, 1/ζ)T . (22)

These factors are subsequently merged into a scaling matrix which is precomputed as follows:

Σ = σ σT 2S =

⎛

⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝

A B C D A D C BB E F G B G F EC F H I C I H FD G I J D J I GA B C D A D C BD G I J D J I GC F H I C I H FB E F G B G F E

⎞

⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠

(23)

where A − J denote unique values in this product:

A = 2S , B =2S

ζ, C =

2S

ξ, D =

γ2S

ζ, E =

2S

ζ2, F =

2S

ξζ, G =

γ2S

ζ2, H =

2S

ξ2, I =

γ2S

ξζ, J =

γ22S

ζ2,

and S denotes the number of fixed-point precision bits allocated for scaling.

This parameter S is chosen such that it is greater than or equal to the number of bits P for the mantissa ofeach input coefficient. This allows scaling of the coefficients Fvu, to be implemented as follows:

F ′vu = (Fvu ∗ Svu) >> (S − P ) , (24)

where Svu ≈ Σvu denote integer approximations of values in matrix of scalefactors (23).

At the end of the last transform in the series of 1D transforms, the P fixed-point mantissa bits (plus 3 extrabits accumulated during executions of each of the 1D stages‡) are simply shifted out of the transform outputsby right shift operations:

fyx = f ′yx >> (P + 3) . (25)

To ensure a proper rounding of the computed value in (25), we add a bias of 2P+2 to the values f ′yx prior to

the shifts. This rounding bias is implemented by perturbing the DC coefficient prior to executing the first 1Dtransform:

F”00 = F ′00 + 2P+2 .

Using this architecture, the task of finding fixed point IDCT implementations is now reduced to finding setsof integer approximations of factors

• A, B, C, D, E, F, G, I, J – the coefficients in the matrix of scale factors (23), and

• α′, β′, δ′, ε′, η′, θ′ – the factors inside the 1D transforms

and algorithms for computing products by them. Global parameters that can also be adjusted are:

• P – the number of fixed-point mantissa bits;

• S – the number of bits used to implement the scaling stage such that S � P ;

• k – the number of bits used for the approximations of factors within 1D transforms.‡The LLM factorization naturally causes all quantities on the output to be multiplied by a factor of 2

√2.27 This results

in 1.5 bits mantissa expansion during row- and column- passes, and 3 bits accumulated at the end of the 2D transform.


Table 2. Fixed-Point IDCT Approximations Derived by Using our Design Framework.��Details

AlgorithmL16 L1 L0 L2

Z0a23002-2

Z1 Z4

A 16384 2048 2048 2048 1024 1024 8192B 15852 1703 2275 2275 1138 867 8037C 22930 2729 2556 2446 1730 1278 11051D 22418 2408 3218 3218 1609 1226 11366

2D scale E 15337 1416 2528 2528 1264 734 7885factors F 22185 2269 2840 2718 1922 1082 10842

G 21690 2002 3575 3574 1788 1038 11151H 32090 3637 3190 2923 2923 1595 14908I 31374 3209 4016 3844 2718 1530 15333J 30675 2832 5056 5055 2528 1468 15770

α′ 519/512 151/128 113/128 113/128 41/128 111/256 6573/16384β′ 413/2048 15/32 45/256 45/256 99/128 67/64 31737/32768

Transform δ′ 55/64 1 1533/2048 1533/2048 113/128 18981/16384 16379/16384factors ε′ 147/256 171/256 1/2 1/2 719/4096 59/256 1629/8192

η′ 99/256 13/32 111/256 29/64 1533/2048 16091/16384 27771/32768θ′ 239/256 251/256 67/64 35/32 1/2 21/32 4639/8192

S 14 11 11 11 10 10 13Bit- P 3 10 10 10 10 10 13

usage k 11 8 11 11 12 14 15

1D 50a, 20s 44a, 18s 42a, 20s 48a, 12s 44a, 20s 48a, 26s 56a, 32sComplexity 2D 801a, 384s 705a, 352s 673a, 384s 769a, 256s 705a, 384s 769a, 480s 897a, 576s

2D+S 1017a, 624s 901a, 552s 829a, 576s 925a, 448s 865a, 576s 925a, 680s 1125a, 792s

p 1 1 1 1 1 1 1max eyx 0.022800 0.022600 0.019400 0.019800 0.024800 0.007500 0.001300

Precision n 0.018577 0.017923 0.015397 0.016094 0.017866 0.005384 0.000425max |dyx| 0.003800 0.006000 0.004000 0.007500 0.004300 0.001800 0.000700

|m| 0.000573 0.000245 0.000425 0.000342 0.000166 0.000153 0.000053

Linearity test fail fail fail fail pass pass pass

Ext. dynamic range test fail fail fail fail pass pass pass

Notably, this list of parameters does not include the values for our “floating factors” – ξ and ζ. The reason fortheir exclusion is that these factors are needed only for establishing the relationship between the values of thefactors inside the transform (21) and the values for the scale factors (22). The actual values for ξ and ζ areabsorbed by the rational fractions assigned to each factor.

This design framework has been used for the design of several IDCT approximations submitted to MPEG.30, 31

The search for the above parameters and algorithms has been organized such that for each candidate transformapproximation we were able to measure: (a) the IDCT precision in terms of accuracy metrics, and (b) thenumber of operations needed for its implementation. This approach allowed us to identify transforms with thebest achievable complexity and precision tradeoffs.

3.1 Examples of IDCT Designs

We summarize the values of parameters and performance characteristics of several algorithms designed usingthis framework in Table 2. These algorithms have the following particular features:

L16 – an algorithm passing all normative ISO/IEC 23002-1 precision tests using the lowest achievable numberof mantissa bits: P = 3. This implies that this algorithm is implementable on 16-bit platforms.

L1 – an ISO/IEC 23002-1 compliant IDCT approximation with the lowest achievable number of bits in approx-imations of transform factors: k = 8.

Proc. of SPIE Vol. 6696 669617-10

Codec: MPEG-2, quant_scale=1, W[I,j]=16, no I-MBs

38

40

42

44

46

48

50

52

54

56

58

0 50 100 150 200 250 300

Frame #

PSN

R(Y

)

Ref. IDCTL16L1L0L2Z0aZ1Z4XVIDMPEG2 TM5H263 Ann.W

Codec: H.263, QP=1, Annex T, no I-MBs

45

45.5

46

46.5

47

47.5

48

48.5

49

49.5

0 50 100 150 200 250 300

Frame #

PSN

R(Y

)

Ref. IDCTL16L1L0L2Z0aZ1Z4XVIDMPEG2 TM5H263 Ann.W

Figure 4. Drift performance of our IDCT approximations using sequence “News” (CIF, 300 frames).

L0 – an ISO/IEC 23002-1 compliant IDCT approximation with the lowest achievable number of additions: 42additions per 1D transform. Since the underlying factorization contains 26 additions + 12 multiplications,this means that each multiplication in algorithm L0 is implemented by using only 1 + 1

3 additions.

L2 – an ISO/IEC 23002-1 compliant IDCT approximation with the lowest achievable number of shifts (12 shiftsper 1D transform). Since the underlying factorization contains 12 multiplications, this means that eachmultiplication in algorithm L2 is implemented by using only 1 shift operation.

Z0a – a higher-accuracy (linearity-test compliant) algorithm, selected for the Final Committee Draft (FCD) ofthe ISO/IEC 23002-2 standard.13

Z1 – an algorithm that was originally selected for the Committee Draft (CD) of ISO/IEC 23002-2 standard.12

This algorithm is considerably more complex than the FCD design (Z0a).

Z4 – an ultra-high precision IDCT approximation.

In characterizing IDCT precision, Table 2 lists worst-case values of ISO/IEC 23002-1 metrics, collected overall normative pseudo-random tests.14 In describing complexity, letters ”a” are used to denote the numbers ofadditions and letters ”s” – to denote the numbers of shifts necessary to implement these algorithms. The ”1D”complexity section provides the numbers of operations necessary to implement each scaled one-dimensionaltransform. The ”2D” complexity section shows the total numbers of operations necessary to implement thescaled 2D transform. Finally, the ”2D+S” complexity section shows the total numbers of operations necessary toimplement the complete 2D IDCT transform, including scaling (assuming that all input coefficients are non-zero).

The collection of algorithms L16, L0, L1, and L2 illustrates extremes that can be reached if the goal is tosimply pass the basic set of precision requirements for IDCT implementations in MPEG standards. AlgorithmsZ0a, Z1, and Z4 strive to go beyond this basic goal and have some nice additional properties. For example, theyall pass the linearity test,16, 17 pass extended dynamic range tests,15 and perform better in so-called IDCT-drifttests described in the next section.

3.2 Drift Performance Analysis

The IEEE 1180 | ISO/IEC 23002-1 tests define mandatory requirements for IDCT implementations in MPEGand ITU-T video coding standards. However, passing them does not always guarantee high quality of thedecoded video, particularly in situations with low quantization noise and long runs of predicted (P-type) framesor macroblocks.42 This is why, in evaluating an IDCT design, it is important to use additional tests, such asthose measuring drift (difference between reconstructed video frames in encoder and decoder) caused by the useof this approximate IDCT design in the decoder.

Proc. of SPIE Vol. 6696 669617-11

In order to measure the drift performance of our IDCTs we have used reference software encoders (employingfloating-point DCTs and IDCTs) of H.263, MPEG-2, and MPEG-4 P2 standards. In order to emphasize IDCTdrift effects, we have also:

• forced all frames after the first one to be P-frames;

• disabled Intra-macroblock refreshes;

• forced QP = 1 (quant scale = 1, and w[i, j] = 16 in MPEG 2,4) for all frames;

In the decoder we have used our IDCT approximations, and for comparison, we have also run tests for thefollowing existing IDCT implementations:

• MPEG-2 TM5 IDCT - fixed-point implementation included in MPEG-2 reference software,43

• XVID IDCT - a high-accuracy fixed-point implementation of IDCT in XVID (MPEG-4 P2) codec,44 and

• H.263 Annex W IDCT - a 16-bit IDCT algorithm specified in Annex W of ITU-T Recommendation H.263.7

The results of our tests for sequence “News”, using H.263 and MPEG-2 codecs, are shown in Figure 4.It can be observed, that the high-precision algorithm Z4 has virtually no drift. Then algorithms Z1 and Z0a fol-low with their worst case accumulated drift contained approximately within 0.5dB in H.263 tests, and within 2dBin MPEG-2 tests. Algorithms L0, L2, L1, then follow with their worst case drift being slightly worse (approxi-mately 0.625dB in H.263 and 2.25dB in MPEG-2 tests). The rest of the algorithms, however, perform much worse.

The MPEG-2 TM5 and XVID implementations show approximately 3dB drift in H.263 test, and almost 12dBdrift in the MPEG-2 environment. Even worse is the drift behavior of the 16-bit algorithms in our tests L16 andH.263 Annex W: they both show approximately 4dB drift in H.263 test, and 18-20dB drift in the MPEG-2 test.

These results illustrate that IDCT drift performance can be significantly affected by the choice of the fixed-point architecture, and its parameters. In particular, in testing numerous implementations produced using ourscaled, LLM-based framework, we have observed that drift performance is most significantly affected by our“mantissa” parameter P . For the majority of algorithms: L0, L1, L2, Z0a, and Z1, reducing the mantissa by 2, 3,sometimes even by 4 bits had almost no effect on most of the IEEE 1180 | ISO/IEC 23002-1 precision metrics,and yet, each such bit had a major effect (about 1-2dB per bit difference) in drift tests. The algorithm L16 isan extreme example of such a mantissa reduction process (leaving only P = 3), and it is obviously unacceptablein terms of drift performance. For this reason, we have retained at least P = 10 bits of mantissa in the designof most of our algorithms proposed to MPEG.

4. THE ISO/IEC 23002-2 FCD FIXED POINT IDCT ALGORITHM

The overall architecture and 1D factorization flowgraph used by ISO/IEC 23002-2 FCD algorithm are depictedin Figure 2 and Figure 3 correspondingly. All integer factors and parameters used in this algorithm are listed inTable 2 under the column “23002-2”.

This transform allocates P = 10 bits for the fixed-point mantissa, and uses the same number of bits forspecifying the scale factors S = 10. This cancels out right shifts in the processing of input coefficients (24), andmakes the scaling stage of this transform particularly simple:

F ′vu = Fvu ∗ Svu , v, u = 0, . . . , 7 , (26)

F ′′00 = F ′

00 + 212 , (27)

where Fvu are input coefficients, and the DC-term adjustment (27) is done to ensure proper rounding at the endof the transform:

fyx = f ′yx >> 13 , y, x = 0, . . . , 7 .

Proc. of SPIE Vol. 6696 669617-12

Scaling, rounding and righ-shifting of transformcoefficients:

Fvu = (F’vu* Svu + 2P+S-1 -1 + sgn(F’vu)) >> (P+S);

9 121D row transforms

12+P1D columntransforms

Left shiftsfyx << P-3;

6+P 9+P

Figure 5. ISO/IEC 23002-2 FDCT architecture.

0x

1x

2x

3x

4x

5x

6x

7x

0X

4X

12X

16X

17X

3X

5X

11X

Figure 6. Factorization employed in ISO/IEC 23002-2 FDCT design.

The maximum total number of bits needed by all variables in this transform is 26 bits, which assumes full12-bit dynamic range of reconstructed pixel values, which is sufficient to cover even extreme cases of quantizationnoise expansion, as described in 41 .

There are three groups of rational dyadic factors processed by this algorithm (see Figure 3, and Table. 2):

• α′ = 41/128 and β′ = 99/128 – in the butterfly with coefficients X2 and X6,

• δ′ = 113/128 and ε′ = 719/4096 – in the butterfly with coefficients X3 and X5, and

• η′ = 1533/2048 and θ′ = 1/2 – in the butterfly with coefficients X1 and X7.

The computation of products by these factors is performed as follows:

x2 = x + (x >> 5); // 1.00001x3 = x2 >> 2; // 0.0100001x4 = x3 + (x >> 4); // 0.0101001 ∼ x ∗ 41/128 = x ∗ α′

x5 = x2 − x3; // 0.1100011 ∼ x ∗ 99/128 = x ∗ β′

x2 = (x >> 3) − (x >> 7); // 0.0001111x3 = x2 − (x >> 11); // 0.00011101111x4 = x2 + (x3 >> 1); // 0.001011001111 ∼ x ∗ 719/4096 = x ∗ ε′

x5 = x − x2; // 0.1110001 ∼ x ∗ 113/256 = x ∗ δ′

x2 = (x >> 9) − x; // −0.111111111x3 = x >> 1; // 0.1 ∼ x/2 = x ∗ θ′

x4 = (x2 >> 2) − x2; // 0.10111111101 ∼ x ∗ 1533/2046 = x ∗ η′

The combined complexity of all these operations is only 9 addition and 10 shift operations. Therefore, the averagecomplexity for computing a single multiplication in this algorithm is only 9/6 = 1.5 addition and 10/6 ≈ 1.66shift operations. In comparing this with traditional fixed point-point implementation of products by factors:

x ∗ η′ ∼ (x ∗ 1533 + 1024) >> 11

which includes an addition (for proper rounding) and a shift, we conclude that the effective cost of each integermultiplication in ISO/IEC 23002-2 FCD algorithm is only 0.5 addition + 0.66 shift operations.

Proc. of SPIE Vol. 6696 669617-13

The total complexity of computing each scaled 1D transform in this algorithm is 44 addition and 20 shiftoperations. The description of a complete 1D transform in C programming language requires only 50 lines.13

Extra C-code needed to describe the full 2D version takes only 20 lines.

The scaling of transform coefficients can be done either outside of the transform, e.g. in the quantizationstage, thereby taking advantage of the sparseness of the input matrix of coefficients, or inside the transform, byexecuting multiplications (26).

This algorithm passes all normative ISO/IEC 23002-1 precision tests,14 as well as many additional tests thathave been created in the process of evaluating fixed point designs in MPEG. These additional tests includeMPEG-2 and MPEG-4, and T.83 (JPEG) conformance tests, drift tests with H.263, MPEG-2, and MPEG-4encoders and decoders, as well as linearity test, and extended dynamic range tests.15

4.1 ISO/IEC 23002-2 FCD FDCT Design

The design of the corresponding fixed-point forward ISO/IEC 23002-2 DCT is fully symmetric relative to theIDCT design. Its overall architecture and 1D factorization are presented in Figure 5 and Figure 6 correspondingly.All integer factors and algorithms for computing products in this FDCT design are exactly the same as in theIDCT, with the only difference being simply the order in which they are executed.

The two elements in the FDCT design that are implemented differently when compared to the IDCT designare: the reservation of mantissa bits and scaling. The allocation of mantissa bits is done at the very beginningof the FDCT transform as follows:

f ′yx = fyx << 7 , y, x = 0, . . . , 7 , (28)

and the scaling is done at the very end, by using

Fvu =(F ′

vu ∗ Svu + 219 − 1 + sgn(F ′vu)

)>> 20 , v, u = 0, . . . , 7 , (29)

where

sgn(x) =[

0, if x � 01, if x < 0 . (30)

The use of the term (30) in rounding (29) assures that FDCT scaling is done in a sign-symmetric fashion, witha slightly wider deadzone around 0.

We note that the scaled architecture of ISO/IEC FDCT design makes it also possible to combine the fi-nal scaling stage (29-30) with the quantization process in video or image encoders, thereby enabling furthercomplexity reductions.

5. CONCLUSIONS

In this paper we have described our proposed fixed-point IDCT design methodologies and several resultingalgorithms achieving different precision/complexity characteristics. We have explained choices of the parametersin such designs, and their connection to IDCT precision and drift performance.

The fixed-point 8x8 IDCT and DCT algorithms adopted in ISO/IEC 23002-2 FCD standard are also described.Their architecture has benefited from the ideas contributed to the MPEG standardization process by multipleproponents and yielded a remarkably efficient implementation, surpassing all IEEE 1180 | ISO/IEC 23002-2precision requirements, with low implementation complexity (requiring only 44 addition and 20 shift operationsper scaled 1D transform), and performing very well in linearity, extended dynamic range, and IDCT drift tests.

6. ACKNOWLEDGEMENTS

The authors wish to thank Honggang Qi, Wen Gao, Debin Zhao, Siwei Ma, and other participants in the MPEGIDCT project for their contributions influencing the design of the ISO/IEC 23002-2 FCD algorithm. The authorsalso wish to thank Joan Mitchell and Gary Sullivan for their helpful comments on the manuscript of this paper.

Proc. of SPIE Vol. 6696 669617-14

REFERENCES1. K. R. Rao, and P.Yip, Discrete Cosine Transform: Algorithms, Advantages, Applications, Academic Press,

San Diego, 1990.2. W. B. Pennebaker and J. L. Mitchell, JPEG Still Image Compression Standard, Van Nostrand Reinhold,

New York, 1993.3. J. L. Mitchell, W. B. Pennebaker, D. Le Gall, and C. Fogg, MPEG Video Compression Standard, Chapman

& Hall, New York, 1997.4. ITU-T Recommendation T.81 | ISO/IEC 10918-1: Information Technology – Digital Compression and Coding

of Continuous-Tone Still Images – Requirements and Guidelines , 1992.5. ISO/IEC 11172: Coding of moving pictures and associated audio for digital storage media at up to about 1,5

Mbit/s – Part 2 - Video, August, 1993.6. ITU-T Recommendation H.262 | ISO/IEC 13818-2: Information technology – Generic coding of moving

pictures and associated audio information: Video, 1994.7. ITU-T Recommendation H.263: Video Coding for Low Bit Rate Communication, 1995.8. ISO/IEC 14496-2: Information technology – Coding of audio-visual objects – Part 2: Visual , July, 2001.9. ANSI/IEEE 1180-1990, Standard Specifications for the Implementations of 8 8 Inverse Discrete Cosine Trans-

form, December 1990 (withdrawn by ANSI 9 September, 2001; withdrawn by IEEE 7 February, 2003).10. G. J. Sullivan, Standardization of IDCT approximation behavior for video compression: the history and the

new MPEG-C parts 1 and 2 standards, SPIE Applications of Digital Image Processing XXX, Proc. SPIE,Vol. 6696 , August 28–31, 2007 (this conference).

11. ISO/IEC JTC1/SC29/WG11 (MPEG), Call for Proposals on Fixed-Point IDCT and DCT Standard , MovingPicture Experts Group (MPEG) output document N7335, Poznan, Poland, July 2005.

12. ISO/IEC JTC1/SC29/WG11 (MPEG), ISO/IEC CD 23002-2: Fixed-point IDCT and DCT , Moving PictureExperts Group (MPEG) output document N8479, Hangzhou, China, October 2006,

13. ISO/IEC JTC1/SC29/WG11 (MPEG), ISO/IEC FCD 23002-2: Information technology MPEG video tech-nologies, Part 2: Fixed-point 8x8 IDCT and DCT , Moving Picture Experts Group (MPEG) output documentN8983, San Jose, CA, April 2007.

14. ISO/IEC 23002-1: Information technology – MPEG video technologies – Part 1: Accuracy requirements forimplementation of integer-output 8x8 inverse discrete cosine transform, December 2006.

15. ISO/IEC JTC1/SC29/WG11 (MPEG), ISO/IEC 23002-1/FPDAM1: Information technology – MPEG videotechnologies – Part 1: Accuracy requirements for implementation of integer-output 8x8 inverse discrete cosinetransform. Amendment 1: Software for Integer IDCT Accuracy Testing, Moving Picture Experts Group(MPEG) output document N8981, San Jose, CA, April 2007.

16. M. A. Isnardi, Description of Sample Bitstream for Testing IDCT Linearity, Moving Picture Experts Group(MPEG) document M13375, Montreux, Switzerland, April 2006.

17. C. Zhang and L. Yu, On IDCT Linearity Test, Moving Picture Experts Group (MPEG) document M13528,Klagenfurt, Austria, July 2006.

18. Z. Ni and L. Yu, Drift Problem of Fixed-Point IDCT on News Sequence, Moving Picture Experts Group(MPEG) document M13912, Hangzhou, China, October 2006.

19. G. Sullivan, Video IDCT static-content pathological drift: Analysis and encoder techniques, Moving PictureExperts Group (MPEG) document M14077, Marrakech, Morocco, January 2007.

20. E. Feig and S. Winograd, On the multiplicative complexity of discrete cosine transforms (Corresp.), IEEETrans. Info. Theory, vol. IT-38, pp. 1387–1391, July 1992.

21. E. Feig, and S. Winograd, ”Fast algorithms for the discrete cosine transform”, IEEE Trans. Signal Processing ,vol. 40, pp. 2174–2193, September 1992.

22. N. Ahmed, T. Natarajan and K. R. Rao, Discrete cosine transform, IEEE Trans. Comput , vol.C-23, pp.90–93, January 1974.

23. W. Chen, C. H. Smith and S. C. Fralick, A Fast Computational Algorithm for the Discrete Cosine Transform,IEEE Trans. Comm., vol. com-25, No. 9, pp. 1004–1009, September 1977.

24. B. G. Lee, A new algorithm to compute the discrete cosine transform, IEEE Trans. Acoust., Speech, SignalProcessing , vol. ASSP-32, pp.1243–1245, May 1984.

Proc. of SPIE Vol. 6696 669617-15

25. Y. Arai, T.Agui and M. Nakajima, A Fast DCT-SQ Scheme for Images, Transactions of the IEICE E71(11):1095, November 1988. (in Japaneze)

26. M. Vetterli and A. Ligtenberg, A Discrete Fourier-Cosine Transform Chip, IEEE Journal on Selected Areasin Communications, Vol. 4, No. 1, pp. 49–61, January 1986.

27. C. Loeffler, A. Ligtenberg, and G. S. Moschytz, Practical fast 1-D DCT algorithms with 11 multiplications,in Proc. IEEE Int. Conf. Acoust., Speech, and Sig. Proc. (ICASSP’89), vol. 2, pp. 988-991, February 1989.

28. Y.A. Reznik, A.T. Hinds, N. Rijavec, Low-Complexity Fixed-Point Approximation of Inverse Discrete CosineTransform, In Proc. IEEE Int. Conf. Acoust., Speech, and Sig. Proc. (ICASSP’07), Honolulu, HI, April 15-20,2007.

29. A. Navarro, A. Silva, Y. Resnik, A Full 2D IDCT with Extremely Low Complexity, SPIE Applications ofDigital Image Processing XXX, Proc. SPIE, Vol. 6696 , August 28–31, 2007 (this conference).

30. Y. Reznik, A. Hinds, C. Zhang, L. Yu, and Z. Ni, Response to CE on Convergence of Scaled and Non-ScaledIDCT Architectures, Moving Picture Experts Group (MPEG) document M13650, Klagenfurt, Austria, July2006.

31. Y. Reznik, Analysis of hybrid scaled/non-scaled IDCT architectures, Moving Picture Experts Group (MPEG)document M13705, Klagenfurt, Austria, July 2006.

32. J. W. S. Cassels, An Introduction to Diophantine Approximations , Cambridge University Press, 1957.33. D. Knuth, The Art of Computer Programming: Seminumerical Algorithms, vol. 2. Addison-Wesley, 1969.34. A. Avizienis, Signed-digit number representations for fast parallel arithmetic, IRE Transactions on Electronic

Computers , Vol. EC- 10, pp. 389-400. 1961.35. A. Karatsuba and Y. Ofman, Multiplication of multidigit numbers on automata, Soviet Phys. Doklady,

Vol. 7, No. 7, pp. 595–596, January 1963.36. P. R. Cappello and K. Steiglitz, Some complexity issues in digital signal processing, IEEE Trans. Acoustics,

Speech Signal Proc., ASSP-32, 5, pp. 1037–1041, 1984.37. R. Bernstein, Multiplication by Integer Constants, Software - Practice and Experience, Vol 16, No. 7, pp

641-652, 1986.38. R.M.Hewlitt and E.S. Swartzlantler, Canonical signed digit representation for FIR digital filters, in Proc.

IEEE Workshop on Signal Processing Systems (SiPS 2000), pp. 416-426, 2000.39. V. Lefevre, Moyens arithmetiques pour un calcul fiable, PhD thesis, Ecole Normale Superieure de Lyon,

Lyon, France, January 2000.40. Y. Voronenko and M. Puschel, Multiplierless multiple constant multiplication, ACM Trans. Algorithms , 3,

2, pp. 11-, May, 2007.41. M. Zhou, and J. De Lameillieure, IDCT output range in MPEG video coding, Signal Processing: Image

Communication, Vol. 11, No. 2, pp. 137–145, Dec. 1997.42. A.T. Hinds, Z.Ni, C.Zhang, L.Yu, and Y.A. Reznik, On IDCT Drift Problem, SPIE Applications of Digital

Image Processing XXX, Proc. SPIE, Vol. 6696 , August 28–31, 2007 (this conference).43. MPEG-2 TM5 source code and documentation:

http://www.mpeg.org/MPEG/MSSG/tm5/44. XVID open source implementation of MPEG-4 ASP:

http://downloads.xvid.org/

APPENDIX A. SOME FACTS FROM DIOPHANTINE APPROXIMATION THEORYAND PROOF OF LEMMA 2.1

Let θ be a real number, and let us try to approximate it by a rational fraction

θ ≈ p/q.

Here, p and q are integers and, without loss of generality, it can be assumed that q > 0.

Given a fixed q, it can be noted that the precision of best approximation p/q satisfies:

|θ − p/q| = q−1 |qθ − p| = q−1 minz∈Z

|qθ − z| = q−1 ‖qθ‖ ,

Proc. of SPIE Vol. 6696 669617-16

where ‖θ‖ denotes the distance of θ to the nearest integer. Based on the above, it appears that the magnitudeof error should decrease inverse proportional to q.

Nevertheless such approximations can be much more precise. We quote the following result from [32, p. 11,Theorem V] (in this context the term ”equivalent” means multiple by any integer factor).

Theorem A.1. Let θ be irrational. Then there are infinitely many q such that

q ‖qθ‖ < 5−1/2.

If θ is equivalent to 12

(5−1/2 − 1

)then the constant 5−1/2 cannot be replaced by any smaller constant. If θ is not

equivalent to 12

(5−1/2 − 1

), then there are infinitely many q such that:

q ‖qθ‖ < 2−3/2.

Even more notable is the result concerning the achievable precision of simultaneous approximations of irra-tional numbers θ1, . . . , θn by fractions p1/q, . . . , pn/q, with common denominator q (see [32, p. 14, Theorem III]):

Theorem A.2. There are infinitely many integers q such that

q1/n max {‖q θ1‖ , . . . , ‖q θn‖} <n

n + 1.

This means that there exist sets of integers (p1, . . . , pn, q) such that:

max {|θ1 − p1/q| , . . . , |θn − pn/q|} <n

n + 1q−1−1/n .

It can be seen that our Lemma 2.1 is a simple consequence of the above fact, where we additionally introduceparameter k ∈ N, and set ξ := q 2−k. That is, by multiplying both sides of the last formula by ξ we obtain:

max{∣∣ξ θ1 − p1/2k

∣∣ , . . . ,∣∣ξθn − pn/2k

∣∣} <n

n + 1ξ−1/n 2−k(1+1/n) .

APPENDIX B. SIGN-SYMMETRIC RIGHT SHIFT OPERATOR AND PROOFS OFTHEOREMS 2.2-2.4

For the purpose of our analysis we will need to introduce the following operation.

Definition B.1. The Sign-Symmetric Right Shift (SSRS) xsym>> b of an integer variable x by b � 1 bits is

computed as follows:x

sym>> b := (x >> (b + 1)) − ((−x) >> (b + 1)) , (31)

where >> denotes the ordinary (arithmetic) right shift operation.

Based on its definition, it is easy to see that

(−x)sym>> b = −

(x

sym>> b

), (32)

which implies that it satisfies the sign-symmetry property.

The proof of Theorem 2.2 follows by construction: we take any existing non-sign symmetric algorithm, andreplace all its right shifts with SSRS operators. The complexity of the SSRS operator is 2 shifts, 1 addition, and1 negation. The total complexity is 4 operations. Theorems 2.3 and 2.4 follow.

Proc. of SPIE Vol. 6696 669617-17

Publication 2 Arianne T. Hinds, Yuriy A. Reznik, Lu Yu, Zhibo Ni, Cixun Zhang,

Drift Analysis for integer IDCT


Aug. 2007.

Drift Analysis for Integer IDCT

Arianne T. Hindsa, Yuriy A. Reznikb, Lu Yuc, Zhibo Nic, Cixun Zhangc aInfoPrint Solutions Company, 6300 Diagonal Highway, Boulder, CO, USA 80301; bQUALCOMM Incorporated, 5775 Morehouse Drive, San Diego, CA, USA 92121;

cZhejiang University, Hangzhou, China 310027;

ABSTRACT

This paper analyzes the drift phenomenon that occurs between video encoders and decoders that employ different implementations of the Inverse Discrete Cosine Transform (IDCT). Our methodology utilizes MPEG-2, MPEG-4 Part 2, and H.263 encoders and decoders to measure drift occurring at low QP values for CIF resolution video sequences. Our analysis is conducted as part of the effort to define specific implementations for the emerging ISO/IEC 23002-2 Fixed-Point 8x8 IDCT and DCT standard. Various IDCT implementations submitted as proposals for the new standard are used to analyze drift. Each of these implementations complies with both the IEEE Standard 1180 and the new MPEG IDCT precision specification ISO/IEC 23002-1. Reference implementations of the IDCT/DCT, and implementations from well-known video encoders/decoders are also employed. Our results indicate that drift is eliminated entirely only when the implementations of the IDCT in both the encoder and decoder match exactly. In this case, the precision of the IDCT has no influence on drift. In cases where the implementations are not identical, then the use of a highly precise IDCT in the decoder will reduce drift in the reconstructed video sequence only to the extent that the IDCT used in the encoder is also precise.

Keywords: IDCT, DCT, drift, artifacts, video standardization, MPEG, MPEG-2, MPEG-4, H.263, MPEG-C

1. INTRODUCTION The Discrete Cosine Transform (DCT)1 is a fundamental operation used by the vast majority of today’s image and video compression standards, such as JPEG, MPEG-1, MPEG-2, MPEG-4 (P.2), H.261, H.263, and others1-5. Encoders in these standards apply such transforms to each 8x8 block of pixels to produce DCT coefficients, which are then subject to quantization and encoding. The Inverse Discrete Cosine Transform (IDCT) is used in both the encoder and decoder to convert DCT coefficients back to the spatial domain.

At the time when the first image and video compression standards were defined, the implementation of DCT and IDCT algorithms was considered a major technical challenge, and therefore, instead of defining a specific algorithm for computing it, ITU-T H.261, JPEG, and MPEG standards have included precision specifications that must be met by IDCT implementations conforming to these standards8. This decision has allowed manufacturers to use the best optimized designs for their respective platforms. However, the drawback of this approach is the impossibility to guarantee exact decoding of MPEG-encoded videos across different decoder implementations. It was further observed that in some cases the mismatch caused by different IDCT implementations in the encoder and decoder can accumulate, leading to degradations in quality of reconstructed video. This effect has become well known in video compression/distribution industry as “IDCT drift”, and it has catalyzed the development of various modifications and work-arounds in MPEG encoder implementations (such as periodic Intra-macroblock refreshes, “John Morris” test8, and others) aiming at reducing it. Still, in some modes/profiles of today’s video coding standards (such as when using the deblocking filter in H.263 standard, or the quarter-sample motion supported in MPEG-4 part 2 standard) the IDCT drift

Further author information: (please send correspondence to Yuriy A. Reznik); Yuriy A. Reznik: [email protected], phone: +1 (858) 658-1866, Arianne T. Hinds: [email protected], Cixun Zhang: [email protected], Lu Yu: [email protected], Zhibo Ni: [email protected]. The work of Lu Yu, Cixun Zhang, and Zhibo Ni was supported by NSFC under contract No. 60333020 and Project of Science and Technology Plan of Zhejiang Province under contract No. 2005C1101-03.

Invited Paper



remains to be of significant influence on video quality9-11, and its further study is likely to be of interest for the community of engineers implementing such video coding standards.

This paper presents results of an experimental study to analyze drift that occurs between video encoders and decoders that employ different implementations of the IDCT. Our experiments and associated results are collected using MPEG-2, H.263, and MPEG-4 Part 2 testbeds as described below, in the course of the efforts to design a fixed-point IDCT implementation for ISO/IEC 23002-2.

Our paper is organized as follows. Section 2 analyzes the IEEE 1180 and ISO/IEC 23002-1 metrics used to measure the precision of IDCT approximations (with respect to the ideal integer-valued IDCT), and the relationship of these metrics to the drift phenomenon. Section 3 will review the results of our experimental analysis to measure drift with multiple fixed-point IDCT implementations with respect to the above precision metrics. Section 4 presents our conclusions

2. IDCT PRECISION SPECIFICATIONS AND DRIFT PROBLEM The mathematical definition of the Inverse Discrete Cosine Transform (IDCT) of size 8x8 is provided by the following equation:

7 7

0 0

( ) ( ) (2 1) (2 1)( , ) ( , ) cos cos4 16 16u v

c u c v x u y vf x y F u v π π= =

+ += ∑∑ , (x,y=0,…,7) (1)

where ( ) 1/ 2 for 0, otherwise 1,c u u= = and where F(u,v) is the matrix containing forward DCT coefficients computed for a block of sample pixel values, and f(x,y) is the corresponding matrix of reconstructed sample pixel values.

As evident from (1), the direct computation of IDCT involves products by cosine values of various angles, most of which are irrational values, and which can’t be computed exactly using today’s conventional (finite-precision) computers. Hence in order to realize efficient and practical hardware and software implementations of the IDCT, fixed-point approximations of the irrational values are often identified so that the required bit-depth for the final implementation can be bounded in exchange for some amount of tolerable error. The amount of acceptable error for fixed point IDCT implementations for use in MPEG-1, MPEG-2, and MPEG-4 Part 2 standards was originally defined by the IEEE Standard 11806. The current version of this specification is provided by the ISO/IEC 23002-17 (also known as MPEG-C Part 1) standard.

2.1 IEEE 1180 and ISO/IEC 23002-1 Basics

The IEEE 1180 standard and its replacement ISO/IEC 23002-1 specify a procedure for producing a test sequence of “Q” sample blocks of DCT coefficients, which are subsequently converted into the spatial domain using the reference (double-precision floating point) IDCT with its outputs rounded to integer values. In these tests, Q is defined to be either 10000 or 1000000, and the DCT coefficients are generated assuming that pixel values are in the ranges of [-5,5], [-256, 255], [-300, 300], or [-384,383]. The outputs of the IDCT under test are then compared to the corresponding integer outputs of the reference IDCT produced using the same set of inputs. The output matrices from these transforms are denoted by hz”[][] and gz”[][] for the reference and approximate IDCTs respectively.

The IEEE 1180 and ISO/IEC 23002-1 precision specifications define the following set of accuracy criteria that are to be collected across the series of tests:

Peak pixel error (ppe), p: ]][[]][[max]][[ xygxyhxyp zzz

′′−′′= ;

Pixel mean error (pme), d[y][x]: ( )∑−

=

−=1

0]][['']][[''1]][[

Q

zzz xygxyh

Qxyd ;

Pixel mean square error (pmse), e[y][x]: ( )∑−

=

−=1

0

2]][['']][[''1]][[Q

zzz xygxyh

Qxye

Overall mean error (ome): ∑∑= =

=7

0

7

0]][[

641

x yxydm


Overall mean square error (omse): ∑∑= =

=7

0

7

0]][[

641

x yxyen

We note that only three from the above set of criteria satisfy metric properties (i.e. they are non-negative, commutative, and satisfy the triangle inequality). These are the ppe, pmse, and omse metrics which can also be considered as variants of the well known L2 and L ∞ measures. In contrast, the other two criteria: pme and ome do not satisfy metric properties, and hence, do not offer an indication of how precise the approximate IDCT is with respect to the reference algorithm. Instead, they measure how well the errors produced by an approximate transform are balanced around zero. Hence, there are complimentary, and in some cases, redundant test criteria. For instance, due to:

d[y][x] ≤ e[y][x] m ≤ n

it becomes obvious, that designs minimizing pmse and omse should simultaneously minimize pme and ome, and that passing pme and ome thresholds in pmse and omse dimensions correspondingly would simply render pme and ome tests unnecessary.

The IEEE 1180 and ISO/IEC 23002-1 specifications define the following thresholds for these criteria:

p – maximum absolute difference between reconstructed pixels (required: 1≤p ); d[x,y] – average differences between pixels (for all [x,y]: 015.0|],[| ≤yxd );

m – average of all pixel-wise differences (required: 0015.0|| ≤m ); e[x,y] – average square difference between pixels (for all [x,y]: [ , ] 0.06e x y ≤ ); n – average of all pixel-wise square differences (required: 02.0|| ≤n ).

It can be noted that the thresholds set for pme and ome are much smaller than for square-error type metrics: pmse, and omse.

2.2 Errors between ISO/IEC 23002-1-compliant IDCT implementations

Given a sample sequence x1… xQ of test matrices (assumingly generated by some stationary stochastic process) and two IDCT implementations F1 and F2 correspondingly, we are interested in studying the distance between outputs produced by these algorithms:

∆(x) =М(F1(x),F2(x)) or

∆ =М( F1,F2), assuming that the sequence of ∆ (x1) … ∆ (xQ) converges to some quantity ∆. Operator M in both expressions is a given IEEE 1180 | ISO/IEC 23002-1 criteria with metric properties (such as ppe, pmse, or omse). Then, by the triangle inequality:

∆ =М( F1,F2)≤ М( F1,Fref)+ М( F2,Fref) ≤2T

where T is a threshold defined for the distance in terms of metric M to the reference transform Fref. Hence, by upper-bounding ppe, pmse, and omse, the IEEE 1180 | ISO/IEC 23002-1 precision specifications provide a means for controlling the maximum pair-wise error caused by any two (precision compliant) implementations of the IDCT transform. 2.3 Relation of IDCT Precision Metrics to Drift

An illustration explaining the processing of data in video encoder and decoder is provided in Fig. 1.


Fig. 1. A simplified view of the processes in video encoder and decoder modules for inter-coded blocks. In this

Figure, the implementation of IDCT F1 in the encoder’s reconstruction loop does not exactly match the implementation of IDCT F2 in the decoder.

For the inter-coding of blocks, the residual block ( )ex n is first transformed by the forward DCT and then quantized to produce coefficient block ( )ey n in the encoder. ( )ey n is then provided as input to both the decoder and the encoder’s reconstruction loop where it undergoes the reverse process of inverse quantization followed by inverse transformation. In the encoder, IDCT F1 computes the reconstructed residual block ' ( )ex n while in the decoder, IDCT F2 computes the reconstructed residual block ' ( )ex n + ∆ . These two blocks differ because the implementations of F1 and F2 used to compute them are not exactly the same; both are compliant with the IEEE 1180 and/or ISO/IEC 23002-1 specifications, but each computes its corresponding reconstructed residual block with some amount of error. The difference between these two blocks, denoted by ∆, is the error introduced into the reconstructed frame at the decoder.

In turn, the current reconstructed block ( ) ( )( )r p ex n x n x n= + is obtained as a sum of the current predicted block ( )px n and the reconstructed residual. The same processing is done in the decoder, but the result is different by the IDCT-introduced error ∆: ( ) ( )' ( ) ' 'r p ex n x n x n= + + ∆ . In processing of subsequent frames, the reconstructed blocks are used

to derive new predicted blocks ( )1px n + and ( )' 1px n + in the encoder and decoder correspondingly. Hence, the IDCT error introduced in the previous frame will now reappear in the predicted block in the decoder:

( ) ( )( ) ( )( ) ( )' 1 ' ' ,p r rx n f x n f x n f n+ = = + ∆ ,

where ( )' ,f n∆ is a function describing the accumulation of IDCT drift in the predictor after processing of n frames.

Moreover, if the predictor function ( ).f is linear, then it can be shown that ( ) ( )' ,f n n f∆ = ∆ .

In general, we can assume that drift function ( )' ,f n∆ is monotonic and increasing with respect to point-wise IDCT error ∆, and hence if ∆ is bounded:

∆ =М( F1,F2)≤ М( F1,Fref)+ М( F2,Fref) ≤2T then, correspondingly:

( ) ( )( ) ( ) ( )( ) ( )1 2 1 2' , ' , , ' , , , ' 2 ,ref reff n f M F F n f M F F M F F n f T n∆ = ≤ + ≤ ,

where T is the IEEE 1180 | ISO/IEC 23002-1 threshold set for metric M.

Reconstruction Loop

Encoder

Decoder

Forward DCT

( )ey n Quantization

' ( )ex n + ∆IDCT F1 (fixed-point)

' ( )ex n

( )rx n

( )px n Inverse Quantization

Inverse Quantization

IDCT F2 (fixed-point)

' ( )px n

' ( )rx n

( )ey n

( )ex n


Based on the above discussion, it can be noted that the only way to ensure that drift is eliminated entirely is to make IDCT implementations F1 and F2 identical. Making only one of these IDCTs very precise with respect to reference IDCT will reduce drift only to

( ) ( ) ( )( )1 2, 0 ' , ' , ,ref refM F F f n f M F F n= ⇒ ∆ ≤

or

( ) ( ) ( )( )2 1, 0 ' , ' , ,ref refM F F f n f M F F n= ⇒ ∆ ≤

which we will call a self-drift of the IDCT on the other end of the system. Hence, the IDCT problem is generally not solvable by changing the design of only one IDCT (either in the encoder or decoder).

3. EXPERIMENTAL STUDY OF IDCT DRIFT PROBLEM This section summarizes the results of our experimental study10 on the precision performance of IDCTs submitted in response to the MPEG call for proposals11. In this study, we explore the relative drift observed when multiple IDCT implementations are used in both the encoder and decoder, including cases when some existing well-known IDCT implementations are used in the encoder. We also compare the resulting measured drift with the IEEE 1180 metrics to further analyze the relation of these metrics to the reduction of drift.

3.1 Methodology

The following describes our testing methodology, and provides the rationale for selecting particular operating modes for the encoder, quantization parameter (QP) values, and test sequences. 3.1.1 Testbed Encoders/Decoders and Precision Testbed

For our experiments, we have used the reference software of the MPEG-212, MPEG-413, and H.26314 video coding standards to serve as the basis for encoder and decoder testbeds. These testbeds disable rate control, as well as some other standard features, such as the forced insertion of intra macroblocks in order to amplify IDCT drift effects. Also, for the same purpose, the H.263 testsbed employs Annex T of this ITU-T standard15, which allows transmission of transform coefficients outside the normal clipping range of [-2048,2047].

We also employ the use of a precision testbed to measure the conformance of IDCT implementations to the precision metrics of IEEE 1180. This particular testbed is also developed jointly by several members of the MPEG ad hoc group responsible for the development of parts one and two of ISO/IEC 2300216. Its primary purpose is to perform the requisite series of tests defined in IEEE 1180 for each IDCT under test, and to report any failures of the corresponding IDCT with respect to the thresholds defined for each precision metric.

Notably, each of these software testbeds is provided in an amendment to the original ISO/IEC 23002-117. The aim of this amendment is to provide the same software tools used by MPEG to aid developers of fixed-point IDCT implementations with their analysis of IDCT drift and precision performance.

3.1.2 IDCT Algorithms

We substitute the IDCT implementations in the corresponding reference software for each testbed, with several IDCT implementations submitted for consideration for the new standard. Each of these proposed algorithms has been verified to conform to the metrics specified in IEEE 1180 using the above precision testbed. In addition to the proposed algorithms we’ve also included several examples of well-known existing IDCT implementations:

• MPEG-2 TM5 IDCT – fixed-point implementation included in MPEG-2 reference software18,

• XVID IDCT – a high-accuracy fixed-point implementation of IDCT in XVID opens source implementation of MPEG-4 P2 codec19,


• H.263 Annex W IDCT – idct algorithm specified in Annex W of ITU-T Recommendation H.263, and

• a 64-bit floating-point IDCT implementation.

The full list of all algorithms used in this study is presented in Table 1 with their names denoting the following characteristics identified in the list below. In these descriptions, LLM11 denotes the Loeffler, Ligtenberg, and Moschytz algorithm with 11 multiplication operations required, while LLM12 denotes the Loeffler, Ligtenberg, and Moschytz algorithm with 12 multiplication operations20.

• Annn: An algorithm based on the Arai, Agui, and Nakajima factorization21

• Bnnn: An algorithm based on the factorization by B.G. Lee22, implemented with right shift and addition operations

• Cnnn: An algorithm based on the Chen factorization23 implemented with right shift and addition operations

• CWnnn: An algorithm based on the Chen-Wang factorization24 implemented with fixed-point multipliers or left shift and addition operations

• FWnnn: A non-separable two-dimensional algorithm based on the Feig-Winograd25 factorization implemented with right shift and addition operations

• Mnnn: an algorithm based both on LLM12 and fixed-point multiplication or left shift and addition operations

• Nnnn: an algorithm based both on LLM12 and right shift and addition operations

• Tnnn: an algorithm based both on LLM11 and lifting steps implemented with shift and addition opertions

• Rnnn: an algorithm based on LLM11 with a scaling stage implemented in each one-dimensional transform.

• Snnn: an algorithm based on LLM11, lifting steps, and designed specifically to achieve high precision

• Innn: an algorithm that approximates the cosines in the IDCT matrix by scaling them by powers of two, and then performs matrix-vector multiplication operations to compute the outputs.

3.1.3 Test Sequences

In order to conduct our tests, we have used several standard CIF-resolution sequences consisting of the Foreman, Mobile, News, and Football test sequences. The Foreman sequence is 300 frames in length and characterized by its low motion with little static region content. The Mobile sequence is 300 frames in length, also with low motion, but slightly more static region content than Foreman. News, is also 300 frames, but can be characterized by its balanced content of high motion (the dancers), low motion (the news anchors), and static content (the news room background). Football consists of only 150 frames with content that can be characterized as high motion and little static content.

3.1.4 QP Values In order to understand at which QP level the difference between various IDCT implementations will become noticeable, we have conducted a series of “pretests” with the H.263+ testbed using the floating-point IDCT in the encoder and a subset of IDCT proposals operating at various degrees of precision in the decoder. The CIF sequence “Foreman” was used for this purpose. The results of our pretests for QP=1, 2, and 3 are presented in Figures 2-4.

Based on these results it can be seen that IDCT drift with respect to the floating point reference IDCT can be seen most clearly at QP=1 (with Annex T enabled), but the differences between various IDCTs is sharply declining with the use of higher QP values. In fact, even at QP=3, these differences are within 0.2 dB. Hence, in our subsequent experiments we choose to consider only the case of QP=1 as this is the mode in which drift effects are most amplified*. * Nevertheless, all subsequent results should be understood as an extreme illustrations of the drift phenomena, which in practical video applications (ones operating in higher QP ranges and using normal Intra-refreshes) may not be present or have much small influence on quality of decoded video.


Table 1. IDCT/DCT algorithms used in H.263/MPEG-4/MPEG-2 testbeds

Worst-case IEEE-1180 precision metrics Index Algorithm Ppe pmse omse pme ome

0 T001 1 0.023800 0.016960 0.004100 0.000004 1 T002 1 0.005100 0.002575 0.001600 0.000003 2 R001 1 0.023500 0.016820 0.005600 0.000038 3 R002 1 0.005600 0.002895 0.005200 0.000013 4 R003 1 0.023200 0.016392 0.005400 0.000023 5 R004 1 0.004200 0.002633 0.003600 0.000013 6 S001 1 0.001100 0.000316 0.000600 0.000003 7 CW001 1 0.004900 0.003753 0.001900 0.000006 8 C001 1 0.002600 0.001330 0.001000 0.000002 9 A001 1 0.020800 0.010388 0.002900 0.000016

10 A002 1 0.001200 0.000434 0.000600 0.000003 11 I001 1 0.005700 0.004345 0.002000 0.000001 12 A003 1 0.016000 0.010263 0.009500 0.000156 13 A004 1 0.008253 0.004758 0.007963 0.000153 14 A005 1 0.004900 0.002956 0.001800 0.000022 15 A006 1 0.006900 0.003639 0.005500 0.000034 16 A007 1 0.007200 0.003531 0.002700 0.000016 17 A008 1 0.024800 0.017208 0.005800 0.000185 18 A009 1 0.022800 0.017242 0.008700 0.000087 19 M001 1 0.009200 0.006613 0.002800 0.000012 20 M002 1 0.016800 0.013555 0.003400 0.000023 21 M003 1 0.007400 0.005059 0.002200 0.000014 22 N001 1 0.001500 0.000433 0.001500 0.000018 23 N002 1 0.005700 0.003986 0.001900 0.000010 24 FW001 1 0.022900 0.017296 0.003900 0.000005 25 A010 1 0.004500 0.002136 0.001900 0.000046 26 A011 1 0.000800 0.000259 0.000500 0.000003 27 B001 1 0.009500 0.005069 0.002400 0.000015 28 MPEG2_TM5 1 0.017800 0.013300 0.003900 0.000035 29 XVID 1 0.009800 0.007452 0.002500 0.000010 30 H263_Annex_W* 1 0.011400 0.009530 0.003010 0.000014 31 64 bit floating point IDCT ~0 ~0 ~0 ~0 ~0

(*)Excluding test ranges +-384 and +-512. The above table also lists the worst-case IEEE 1180 metrics obtained by the precision testbed for each of the algorithms. 3.1.5 IDCT Algorithms used in the Encoder

In addition to an idealized case where the floating point reference IDCT is employed by the encoder, we also utilize the following fixed point algorithms in our tests:

H.263 Annex W IDCT – An IDCT algorithm specified in Annex W of ITU-T Recommendation H.263,

MPEG-2 TM5 IDCT – A fixed-point implementation included in MPEG-2 reference software,

XVID IDCT – A high-accuracy fixed-point implementation of the IDCT in the XVID open source implementation of the MPEG-4 Part 2 encoder/decoder,

These algorithms cover a wide range of possible precision characteristics, and the first three of them are well known and used in various implementations of MPEG and ITU-T encoders/decoders.


Fig. 2. Drift under reference IDCT in the encoder at QP=1.


Foreman CIF 300 frames, QP=1, Annex T

47.0000

47.5000

48.0000

48.5000

49.0000

49.5000

0 50 100 150 200 250 300Frame number

PSN

R (Y

)

S001 N002 N001 A003 A004 A006 A007


41.5000

42.0000

42.5000

43.0000

43.5000

44.0000

44.5000

0 50 100 150 200 250 300Frame number

PSN

R (Y

)

S001 N002 N001 A003 A004 A006 A007



3.2 Results

In this section we provide a subset of the exhaustive set of results that we have produced using all variations of testbeds, encoder IDCT implementations, and test sequences.

3.2.1 Results produced using the H.263 Encoder/Decoder

Figures 5-8 provide results that are produced using the H.263 encoder/decoder and with the Foreman video sequence.

Fig. 5. Drift with respect to the H.263 Annex W IDCT implemented in the encoder of the H.263 testbed.


39.0000

39.5000

40.0000

40.5000

41.0000

41.5000

42.0000

42.5000

43.0000

0 50 100 150 200 250 300Frame number

PSN

R (Y

)

S001 N002 N001 A003 A004 A006 A007

Foreman CIF 300 frames, QP=1, Annex T Encoder's IDCT = H.263 Annex W IDCT

45.0000 45.5000 46.0000 46.5000 47.0000 47.5000 48.0000 48.5000 49.0000 49.5000 50.0000

0 50 100 150 200 250 300 Frame number

PSN

R (Y

)

H263 Annex W T001 T002 R001 R002 R003 R004 S001 CW01 C001 A001 A002 I001 A003 A004 A005 A006 A007 A008 A009 M001 M002 M003 N001 N002 FW001 A010 A011 B001


Fig. 6. Drift with respect to the MPEG-2 TM5 IDCT implemented in the encoder of the H.263 testbed.

Fig. 7. Drift with respect to the XVID high accuracy IDCT implemented in the encoder of the H.263 testbed.

Foreman CIF 300 frames, QP=1, Annex T Encoder's IDCT = MPEG-2 TM5 IDCT

45.0000 45.5000 46.0000 46.5000 47.0000 47.5000 48.0000 48.5000 49.0000 49.5000 50.0000

0 50 100 150 200 250 300

Frame number

PSN

R (Y

) MPEG-2 TM5 T001 T002 R001 R002 R003 R004 S001 CW01 C001 A001 A002 I001 A003 A004 A005 A006 A007 A008 A009 M001 M002 M003 N001 N002 FW001 A010 A011 B001

Foreman CIF 300 frames, QP=1, Annex T Encoder's IDCT = XVID high accuracy IDCT

45.0000 45.5000 46.0000 46.5000 47.0000 47.5000 48.0000 48.5000 49.0000 49.5000 50.0000

0 50 100 150 200 250 300

Frame number

PSN

R (Y

)

XVID IDCT T001 T002 R001 R002 R003 R004 S001 CW01 C001 A001 A002 I001 A003 A004 A005 A006 A007 A008 A009 M001 M002 M003 N001 N002 FW001 A010 A011 B001

Proc. of SPIE Vol. 6696 669614-10

Fig. 8. Drift with respect to the 64-bit floating point IDCT implemented in the encoder of the H.263 testbed.

It can be seen in Figure 5, that the highest video fidelity is obtained when the H.263 IDCT is used in both the encoder and the decoder, and that the remaining IDCT implementations yield results considerably lower in PSNR regardless of their precision performance with respect to the metrics specified in IEEE 1180. It can also be noted that the curves corresponding to IDCTs with higher-precision performance tend to be clustered together, ultimately centered around a curve, corresponding to the use of the ideal IDCT in the decoder. We refer to this curve as the self-induced drift of the fixed-point IDCT in the encoder. Notably, we observe the same behavior when other (more precise) IDCTs are employed in the encoder.

Referring to Figures 6 and 7, we also note that with an IDCT operating at higher precision in the encoder, the differences between the curves produced by the various IDCTs used in the decoder are getting larger. But still, there is a clear concentration around a self-induced drift curve.

However, in Figure 8, when a 64-bit floating-point IDCT is used in the encoder, the results are dramatically different. Here, all curves corresponding to the use of a fixed-point IDCT in the decoder simply fall below the top curve (for the case where a floating-point IDCT is used in both the encoder and decoder). There is also a significant variation between performances of different fixed-point IDCT implementations.

3.2.2 Results produced using MPEG-4 Encoder Figures 9 through 12 provide a subset of the complete results that are produced using the MPEG-4 Part 2 Advanced Simple Profile encoder/decoder and with the Foreman video sequence. In these results, we also illustrate the effects of enabling and disabling of Quarter Pixel Motion Compensation.

Foreman CIF 300 frames, QP=1, Annex T Encoder's IDCT = 64-bit floating-point IDCT

45.0000

45.5000

46.0000

46.5000

47.0000

47.5000

48.0000

48.5000

49.0000

49.5000

50.0000

0 50 100 150 200 250 300 Frame number

PSN

R (Y

)

Float.point IDCT T001 T002 R001 R002 R003 R004 S001 CW01 C001 A001 A002 I001 A003 A004 A005 A006 A007 A008 A009 M001 M002 M003 N001 N002 FW001 A010 A011 B001

Proc. of SPIE Vol. 6696 669614-11

Fig. 9. Drift with respect to the MPEG-2 TM5 IDCT employed in the encoder of MPEG-4, QMC enabled

As in the case of the H.263 drift results, here we see a concentration of drift curves around the self-induced drift curve of an IDCT in the encoder. We also note that the overall magnitude of drift when quarter-pixel interpolation is utilized is significantly larger than the corresponding drift produced using half-pixel interpolation and prediction.

Fig. 10. Drift with respect to the MPEG-2 TM5 IDCT employed in the encoder of MPEG-4,QMC disabled

Foreman CIF 300 frames, QP=1, Progressive Encoder's IDCT = MPEG-2 TM5 IDCT

Quarter Pel Disabled, MPEG-2 Quantization

41.0000

42.0000

43.0000

44.0000

45.0000

46.0000

47.0000

48.0000

49.0000

50.0000

0 50 100 150 200 250 300 Frame number

PSN

R (Y

)

MPEG-2 TM5 T001 T002 R001 R002 R003 R004 S001 CW01 C001 A001 A002 I001 A003 A004 A005 A006 A007 A008 A009 M001 M002 M003 N001 N002 FW001 A010 A011 B001 floating-point IDCT

Foreman CIF 300 frames, QP=1, Progressive Encoder's IDCT = MPEG-2 TM5 IDCT

QuarterPel Enabled, MPEG2 Quantization

36.0000

38.0000

40.0000

42.0000

44.0000

46.0000

48.0000

50.0000

0 50 100 150 200 250 300 Frame number

PSN

R (Y

)

MPEG-2 TM5 T001 T002 R001 R002 R003 R004 S001 CW01 C001 A001 A002 I001 A003 A004 A005 A006 A007 A008 A009 M001 M002 M003 N001 N002 FW001 A010 A011 B001 float-point IDCT

Proc. of SPIE Vol. 6696 669614-12

Fig. 11. Drift with respect to the 64-bit floating point IDCT implemented in the encoder of the MPEG-4 testbed,

QMC enabled

Fig. 12. Drift with respect to the 64-bit floating point IDCT implemented in the encoder of the MPEG-4 testbed, QMC disabled

Foreman CIF 300 frames, QP=1, Progressive Encoder's IDCT = 64-bit floating-point IDCTQuarterPel Enabled, MPEG-2 Quantization

36.0000

38.0000

40.0000

42.0000

44.0000

46.0000

48.0000

50.0000

0 50 100 150 200 250 300 Frame number

PSN

R (Y

) Float.point IDCT T001 T002 R001 R002 R003 R004 S001 CW01 C001 A001 A002 I001 A003 A004 A005 A006 A007 A008 A009 M001 M002 M003 N001 N002 FW001 A010 A011 B001

Foreman CIF 300 frames, QP=1, Progressive Encoder's IDCT = 64-bit floating-point IDCTQuarter Pel Disabled, MPEG-2 Quantization

45.0000

45.5000

46.0000

46.5000

47.0000

47.5000

48.0000

48.5000

49.0000

49.5000

50.0000

0 50 100 150 200 250 300 Frame number

PSN

R (Y

)

Float.point IDCT T001 T002 R001 R002 R003 R004 S001 CW01 C001 A001 A002 I001 A003 A004 A005 A006 A007 A008 A009 M001 M02 M003 N001 N002 FW001 A010 A011 B001

Proc. of SPIE Vol. 6696 669614-13

3.3 Connection between the Drift and IEEE1180 Accuracy metrics

This section attempts to connect the observed drift performance for the various existing IDCT implementations with their IEEE 1180 precision characteristics. The horizontal lines in figures 13 through 16 below mark the average self-induced drift of an IDCT used in the MPEG-4 encoder. i.e. the average drift values measured by using a particular fixed-point IDCT in the encoder and a 64-bit floating-point IDCT in the decoder.

Fig. 13. Average self drift in terms of OMSE with respect to the 64-bit floating point IDCT implemented in the

decoder of the MPEG-4 testbed

Fig. 14. Average self drift in terms of PMSE with respect to the 64-bit floating point IDCT implemented in the


Average drift in first 131 frames vs OMSE

0

0.5

1

1.5

2

2.5

0 0.005 0.01 0.015 0.02

omse

Drif

t [dB

]

Drift wrt H.263 Ann W IDCTDrift wrt MPEG-2 TM5 IDCTDrift wrt XVID IDCT Drift wrt CW001 IDCT Drift wrt Floating-Point IDCT

Average drift in first 131 frames vs PMSE

0

0.5

1

1.5

2

2.5

0 0.005 0.01 0.015 0.02 0.025 0.03

pmse

Drif

t [dB

]

Drift wrt H.263 Ann W IDCT Drift wrt MPEG-2 TM5 IDCT Drift wrt XVID IDCT Drift wrt CW001 IDCT Drift wrt Floating-Point IDCT

Proc. of SPIE Vol. 6696 669614-14

Fig. 15. Average self drift in terms of OME with respect to the 64-bit floating point IDCT implemented in the


Fig. 16. Average self drift in terms of PME with respect to the 64-bit floating point IDCT implemented in the


It can be seen that in all cases when an integer IDCT is used in the encoder, the drift results tend to concentrate around self-induced drift of the IDCT employed in the encoder. This effect is observed for all metrics.

4. CONCLUSION This paper presents the results of our experimental drift analysis conducted in conjunction with the development of (MPEG-C Part 2) ISO/IEC 23002-2 Fixed-Point 8x8 IDCT and DCT. Our main conclusion from this study is that drift is caused by mismatches between IDCT implementations in the encoder and decoder, and that such drift is eliminated entirely only when these IDCT implementations are identical. Moreover, we present results that suggest that the

Average drift in first 131 frames vs OME

0

0.5

1

1.5

2

2.5

0 0.0001 0.0002 0.0003 0.0004

ome

Drif

t [dB

]

Drift wrt H.263 Ann W IDCTDrift wrt MPEG-2 TM5 IDCTDrift wrt XVID IDCT Drift wrt CW001 IDCT Drift wrt Floating-Point IDCT

Average drift in first 131 frames vs PME

0

0.5

1

1.5

2

2.5

0 0.002 0.004 0.006 0.008 0.01

pme

Drif

t [dB

]

Drift wrt H.263 Ann W IDCT Drift wrt MPEG-2 TM5 IDCT Drift wrt XVID IDCT Drift wrt CW001 IDCT Drift wrt Floating-Point IDCT

Proc. of SPIE Vol. 6696 669614-15

relationship between the precision of an IDCT approximation (in terms of IEEE 1180 metrics with respect to the ideal integerized IDCT) and the prevention of drift, is not easily identified. That is, our evidence suggests that the use of a highly precise IDCT in the decoder can eliminate drift only to the extent that the corresponding IDCT implementation used in the encoder is also precise. Hence, the goal of a completely drift-free operation is not attainable by employing a highly precise IDCT in only the decoder (or only the encoder). Our study also suggests that the drift performance of a particular IDCT approximation should be analyzed by empirically measuring drift resulting between the IDCT approximation implemented in the decoder (or encoder) against a typical or common set of IDCT implementations in the corresponding encoder (or decoder) system. This analysis is especially useful to assess the IDCT performance for a variety of existing bitstreams, encoders, and decoders. To aid in this analysis, ISO/IEC 23002-1 provides a set of drift testbeds with the goal of measuring drift in MPEG-2, H.263, and MPEG-4 (Pt.2), under common test conditions.

REFERENCES

1. ITU-T Recommendation T.81 | ISO/IEC 10918-1: Information Technology -- Digital Compression and Coding of Continuous-Tone Still Images -- Requirements and Guidelines, 1992.

2. ISO/IEC 11172: Coding of moving pictures and associated audio for digital storage media at up to about 1,5 Mbit/s -- Part 2 -Video, August, 1993.

3. ITU-T Recommendation H.262 | ISO/IEC 13818-2: Information technology -- Generic coding of moving pictures and associated audio information: Video, 1994.

4. ISO/IEC 14496-2: Information technology -- Coding of audio-visual objects -- Part 2: Visual, July, 2001. 5. ITU-T Recommendation H.263: Video Coding for Low Bit Rate Communication, 1995. 6. ANSI/IEEE 1180-1990, Standard Specifications for the Implementations of 8 8 Inverse Discrete Cosine Transform,

December 1990 (withdrawn by ANSI 9~September, 2001; withdrawn by IEEE 7~February, 2003). 7. ISO/IEC 23002-1: Information technology -- MPEG video technologies -- Part 1: Accuracy requirements for

implementation of integer-output 8x8 inverse discrete cosine transform, December 2006. 8. G. J. Sullivan, Standardization of IDCT approximation behavior for video compression: the history and the new

MPEG-C parts 1 and 2 standards, SPIE Applications of Digital Image Processing XXX, Proc. SPIE, Vol. 6696, August 28--31, 2007 (this conference).

9. Yi-Shin Tung, Ja-Ling Wu, Jim Shiu, Zhixiong Wu, Quality Degradation Problem: IDCT Mismatch Propagated and Scaled by MPEG-4 Quarter-pel Filtering, Moving Picture Experts Group (MPEG) document M9822, July 2003.

10. Z. Ni and L. Yu, On the Problem of Quater Pixel Motion Compensation, Moving Picture Experts Group (MPEG) document M14544, April. 2007.

11. A. T. Hinds, Y. A. Reznik, P. Sagetong, On IDCT Exactness, Precision, and Drift Problem, Moving Picture Experts Group (MPEG) document M13657, July 2006.

12. Z. Ni and L. Yu, The Drift Problem of Fixed-Point IDCT on News Sequence, Moving Picture Experts Group (MPEG) document M13912, Oct. 2006.

13. ISO/IEC JTC1/SC29/WG11 (MPEG), Call for Proposals on Fixed-Point IDCT and DCT Standard, Moving Picture Experts Group (MPEG) output document N7335, Poznan, Poland, July 2005.

14. MPEG-2 TM5 source code and documentation: http://www.mpeg.org/MPEG/MSSG/tm5/ 15. XVID open source implementation of MPEG-4 ASP: http://downloads.xvid.org/downloads/xvidcore-1.1.0.tar.gz

Proc. of SPIE Vol. 6696 669614-16

Publication 3 Cixun Zhang, Lu Yu,

Multiplier-less Approximation of the DCT/IDCT with Low Complexity and High Accuracy


Aug. 2007.

Multiplier-less Approximation of the DCT/IDCT with Low Complexity and High Accuracy

Cixun Zhang∗ and Lu Yu§

∗Institute of Signal Processing, Tampere University of Technology, Tampere, Finland §Institute of Information and Communication Engineering, Zhejiang University, Hangzhou, China

ABSTRACT This paper presents a straightforward multiplier-less approximation of the forward and inverse Discrete Cosine Trans-form (DCT) with low complexity and high accuracy. The implementation, design methodology, complexity and per-formance tradeoffs are discussed. Particular, the proposed IDCT implementations, in spite of simplicity, comply with and can reach far beyond the MPEG IDCT accuracy specification ISO/IEC 23002-1, and also reduce drift favorably com-pared to other existing IDCT implementations.

Keywords: DCT, IDCT, transform, drift, fixed point, multiplier-less

1. INTRODUCTION The Discrete Cosine Transform (DCT) is widely used in video and image coding applications as it is the transform used in both MPEG [1] and JPEG [2] standards. The Inverse Discrete Cosine Transform (IDCT) is the inverse process of the DCT. Theoretically, the DCT/IDCT is defined as real number operations and the implementation complexity is high, thus reducing the complexity is one of the most important issues when designing DCT/IDCT implementation. Another im-portant issue specifically for IDCT implementation is to reduce the drift problem, which is caused by different IDCT implementations in encoder and decoder.

In this paper, a straightforward multiplier-less approximation of the DCT/IDCT is proposed with low complexity and high accuracy. The proposed IDCT implementations comply with and can reach far beyond the MPEG IDCT accu-racy specification ISO/IEC 23002-1 [3], and also reduce drift favorably compared to other existing IDCT implementa-tions.

The rest of the paper is organized as follows. In Section 2, a detailed description about the proposed implementation and design methodology is presented. Section 3 gives the experimental results and Section 4 gives the complexity and accuracy analysis of the proposed implementations. Section 5 concludes the paper.

2. IMPLEMENTATION AND DESIGN 2.1. Implementation For simplicity, we put most of our discussion on IDCT in the following of this paper and similar design approach can be applied for DCT. Similar to our previous work [6] [7], the fixed point IDCT matrix is derived by (1). The fixed point

This research is supported by NSFC under contract No. 60333020 and Project of Science and Technology Plan of Zhejiang Province

under contract No. 2005C1101-03.

This work was done when the first author was with Institute of Information and Communication Engineering, Zhejiang University.

Please send correspondence to Lu Yu (E-mail: [email protected])

Invited Paper



xO

xl

x2

x3

x4

x5

x6

DCT matrix is its transpose.

( 8 2 ) / 2 / 2SCALE SCALE SCALE

G A E B G C F DG B F D G A E CG C F A G D E BG D E C G B F A

roundG D E C G B F AG C F A G D E BG B F D G A E CG A E B G C F D

⎡ ⎤⎢ ⎥− − − − −⎢ ⎥⎢ ⎥− − −⎢ ⎥− − − −⎢ ⎥= ⋅ ⋅ = ⎢ ⎥− − − −⎢ ⎥

− − − − −⎢ ⎥⎢ ⎥− − −⎢ ⎥

− − − −⎢ ⎥⎣ ⎦

fpIDCT IDCT (1)

Such a fixed point IDCT can be easily implemented based on an efficient 12-multiply-32-add butterfly structure

first proposed by Loeffler et al [8], as shown in Figure 1 (IDCT Butterfly Structure, not including rounding) and Figure 2 (DCT Butterfly Structure, including rounding) below. The apparent rounding offset used before the right shift in final 1-D IDCT can be implemented by simply adding a constant to the DC term near the beginning of the IDCT process. The DCT butterfly structure is obtained by simply reversing the signal flow in the IDCT butterfly structure. Similarly, the rounding offset used before the right shift in final 1-D DCT can be implemented by 3 additions in the middle of the proc-ess as shown in Figure 2.

Figure 1. IDCT Butterfly Structure (not including rounding)


k k

k k

k k

k 0'

-

) L-

—

—-I

4—

0

Figure 2. DCT Butterfly Structure (including rounding)

For a fixed point IDCT matrix derived by (1), the multipliers in the butterfly structure in Figure 1 and Figure 2 can be obtained by (2) and assured to be solvable and unique. This is mainly due to the fact that there is at most one multi-plier in every path inside the butterfly structure. Actually, this characteristic also helps to achieve high accuracy.

0

1

2

0

1

2

3

4

5

6

7

8

( ) / 2

/ 2

( ) / 2

( ) / 2

( ) / 2

( ) / 2

( ) / 2

( ) / 2

( ) / 2

( ) / 2

( ) / 2

/ 2

SCALE

SCALE

SCALE

SCALE

SCALE

SCALE

SCALE

SCALE

SCALE

SCALE

SCALE

SCALE

e E Fe Fe E Fd A B C Dd A B C Dd A B C Dd A B C Dd B Dd A Bd B Cd B Cd B

= − −

=

= −

= − + + −

= + − +

= + + −

= + − −

= − +

= − −

= − −

= − +

=

(2)

By using right shifts rather than left shifts, the proposed methods can effectively reduce the bit widths needed in the implementation of the fixed point IDCT compared to other IDCT implementations [6] [7] [13] [14]. However, in order to achieve high accuracy, it is necessary to pre-multiply all coefficients by a certain constant C which is often chosen to be implemented by simple left shift before the transform process. That is, we have C=2UP_SCALE, where UP_SCALE denotes


IXPU-' OUTPUT

the number of reserved bits. The 1-D IDCT can then be operated directly without spending any cycles on internal pre- or post-scaling, and finally when both rows and columns are processed, we simply need to shift all resulting quantities by UP_SCALE+3 bits to the right. As mentioned above, the apparent rounding operation of IDCT process can be achieved by simply adding the constant (1<<(UP_SCALE+2)) to the DC coefficient right after scaling since there is no right shift after first 1-D IDCT in proposed methods. Such a whole process is denoted by (SCALE, UP_SCALE) in this paper. The block diagram of 2-D IDCT is shown in Figure 3.

Figure 3. 2-D IDCT

Note that the butterfly structures in Figure 1 and Figure 2 can possibly be adjusted to different (mathematical equivalent or not) structures according to different needs. For one example, there are four 3-multiplication structures in the butterfly structures: 1) e0, e1, e2; 2) d0, d3, d4; 3) d1, d2, d5; 4) d6, d7, d8 and indeed if we replace one or all of them with a simple plane rotation, further reduction of the operation count can be possibly achieved with almost the same accuracy using the technique of [12]. At the same time, this will facilitate other implementations and optimizations, e.g., to change non-scaled implementations into scaled implementations [11] [17]. 2.2. Design Methodology The exact multiplier-less implementation (using binary additions and shifts) of the multipliers in the butterfly structure is crucial to the accuracy of the IDCT implementation. If we use F(i) and I(i) to represent the output of a theoretical multi-plier and a practical multiplier when the input is i, then from (2), one straightforward approach is to find a multiplier-less implementation that minimizes the approximation error which, for example, is calculated as (3) or (4) below.

∑=

−=SCALE

iiFiIErrorionApproximat

2

0

2))()(( (3)

∑=

−=SCALE

iiFiIErrorionApproximat

2

0)()( (4)

However, there may be some disadvantages of this approach: 1. The statistics of input data of different multipliers is different from each other. Moreover, it is difficult to estimate

the statistics since it may vary greatly under different situations: different sequence and coding methods, etc. 2. The optimization of implementation of each multiplier is independent from each other. However, in fact, the trun-

cating error among different multipliers will interact with one another through the additions/subtracts in the butterfly structure. Experiments also show that independent optimization of each multiplier using criteria like (3) and (4) does not yield

best implementation. However, on the other hand, exhaustive search is simply impossible because of the huge number of possible implementations. Therefore, the following methods are introduced to reduce the optimization complexity.


1. Only the multiplier-less implementation with minimum number of adders is used [5] for each multiplier. Clearly, this also reduces the implementation complexity.

2. The first step of multiplier-less implementation is restricted to be in the form 1±(1>>x). That is, the shift operations are delayed in the multiplier-less implementations of the multipliers. This not only reduces the number of multi-plier-less implementations being considered, but also reduces the truncation error caused by the shifts, especially when the range of input data is small, which is critical in reducing drift. At the same time, this restriction does not affect the worst case results (regardless of input range of random test or number of samples in random test) for the metrics measured in the MPEG IDCT accuracy specification ISO/IEC 23002-1 [3] obviously, but improves the re-sults when input range is small.

3. Early termination is necessary in spite of method 1 and 2. In practice, we can stop search when Overall Mean Square Error (OMSE) becomes steady. Different start points are used to avoid resulting in bad local minimum as best as possible. The reason why OMSE is selected as criterion is given in sub-section 2.3 below. Note it is also possible that only multiplier-less implementations whose approximation error are less than a specified threshold are consid-ered. However, this is not as necessary as the above methods and was not used in our optimization. We used above methods to optimize the original (13,10) proposed in [4]. Our optimized multiplier-less implementa-

tion of the multipliers in improved (13,10) is given in Table 1. The reason why (13,10) is selected is because it is one of the most efficient methods among others and this will be detailed in section 4. Note this optimization method can also be used for other IDCT approximations with different butterfly structure.

2.3. Optimization Criterion Selection There are basically six metrics defined in MPEG IDCT accuracy specification ISO/IEC 23002-1 [3]. They are: near-DC inversion, Peak Pixel Error (PPE), Pixel Mean Error (PME), Pixel Mean Square Error (PMSE), Overall Mean Error (OME), Overall Mean Square Error (OMSE). Generally speaking, near-DC inversion test result will be 0 when suitable combination of SCALE and UP_SCALE is selected [4], while PPE will be 1 for fixed point IDCTs. So they are not suit-able as a criterion for optimization. Among other four, OMSE turns out to be the most important because PME and OME do not really say how far away the approximate IDCT is from a reference one but instead how well the errors produced by an approximate transform are balanced around 0 [9]. In fact, PME and OME can be effectively reduced by simply using different rounding offset instead of 1/2 for the whole transform. At the same time, OMSE appears more important than PMSE because it takes errors of all the coefficients into account. Therefore, we use OMSE as a criterion in our op-timization process. In practice, we can use the OMSE in range [-256, 255] instead of the maximum value of all ranges for accuracy consideration and OMSE in range [-5, 5] (or even smaller range like [-1, 1]) for drift reduction considera-tion because generally drift is most significant when small QP is used and therefore the input into the transform is small.

Table 1. Multiplier-less Implementation of the Multipliers in improved (13,10)

0e x0=1+(1>>2), 0e =(1<<1)-(x0>>3)+(1>>8) 4shifts+3additions

1e x0=1+(1>>4), x1=1+(x0>>2), x2=x0+(x1>>6), 1e =(x2>>1) 4shifts+3additions

2e x0=1-(1>>4), 2e =1-(x0>>2)-(1>>12) 3shifts+3additions


0d x0=1+(1>>3), x1=x0+(x0>>4)-(1>>10), 0d =(x1>>2) 4shifts+3additions

1d x0=1+(1>>3), 1d =(1<<1)+(x>>5)+(x>>6)+(1>>11) 5shifts+4additions

2d x0=1+(1>>4), x1=x0+(1>>2), 2d =(1<<1)+x0+(x1>>7) 4shifts+4additions

3d x0=1+(1>>10), 3d =x0+(x0>>1) 2shifts+2additions

4d x0=1+(1>>4), x1=1+(x0>>4), 4d =1-(x1>>3)+(x1>>5) 4shifts+4additions

5d x0=1+(1>>7), x1=1+(x0>>3), 5d =(1<<1)+(x1>>1) 4shifts+3additions

6d x0=1+(1>>2), 6d =(1<<1)-(x0>>5)+(x0>>11) 4shifts+3additions

7d x0=1+(1>>1), x1=1-(x0>>6), 7d =(x1>>6)+(x0>>2) 4shifts+3additions

8d x0=1-(1>>5), x1=(x0>>6)+(1>>2), x2=x0-x1, 8d =1+(x2>>2) 4shifts+4additions

Total 46shifts+39additions

3. EXPERIMENTAL RESULTS Proposed improved (13,10) turn out to be very efficient. In Table 2 we list MPEG IDCT accuracy specification ISO/IEC 23002-1 compliance results based on corresponding testbed in [10]. To show the effectiveness of our optimization method, we also compared the drift of improved (13,10) with original implementation of (13,10) in [4] and the results are shown in Figure 4 and Figure 5, where the suffixes “ed”, “e”, “d” mean that the proposed IDCT was used in both en-coder and decoder, used only in encoder, and used only in decoder respectively while for the latter three cases, 64-bit floating point IDCT was used on the other end. In Figure 6 and Figure 7, we also compared improved (13,10) to other existing IDCT implementations like the high accuracy IDCT (13,9,20) proposed in [6] [7], MPEG-2 TM5 IDCT [13] and XVID high accuracy IDCT [14]. It is quite clearly that improved (13,10) performs much better than all these IDCTs and up to several dB improvement in PSNR can be achieved. Actually, it also performs quite well compared to other many IDCTs in spite of its simplicity and more results can be found in [15] [16]. Note that the drift tests are carried out under extreme worst-case: only first frame was coded as intra frame, no intra macroblocks in P frame and QP was set to 1.

Table 2. MPEG IDCT Accuracy Specification ISO/IEC 23002-1 Compliance Results of Improved (13,10)

Q L, H S PPE (<=1)

PMSE (<=0.06)

OMSE (<=0.02)

PME (<=0.015)

OME (-0.0015<=OME<=0.0015)

10000 5, 5 + 1 5.00e-004 1.77e-004 5.00e-004 4.22e-005 10000 5, 5 - 1 5.00e-004 1.62e-004 3.00e-004 3.13e-005 10000 256, 255 + 1 5.80e-003 3.97e-003 1.70e-003 -7.81e-006


10000 256, 255 - 1 5.90e-003 4.00e-003 2.00e-003 9.06e-005 10000 300, 300 + 1 5.20e-003 3.88e-003 1.50e-003 -7.50e-005 10000 300, 300 - 1 5.10e-003 3.88e-003 1.50e-003 1.27e-004 10000 384, 383 + 1 5.50e-003 3.95e-003 2.10e-003 9.06e-005 10000 384, 383 - 1 5.60e-003 3.95e-003 2.10e-003 -5.47e-005 10000 512, 511 + 1 5.30e-003 3.75e-003 1.60e-003 -1.72e-005 10000 512, 511 - 1 5.40e-003 3.75e-003 1.50e-003 4.53e-005

1000000 5, 5 + 1 3.50e-004 1.69e-004 3.48e-004 3.69e-005 1000000 5, 5 - 1 3.55e-004 1.67e-004 3.49e-004 3.78e-005 1000000 256, 255 + 1 4.11e-003 3.94e-003 3.27e-004 3.53e-005 1000000 256, 255 - 1 4.07e-003 3.94e-003 3.59e-004 4.41e-005 1000000 300, 300 + 1 4.05e-003 3.92e-003 2.98e-004 4.05e-005 1000000 300, 300 - 1 4.05e-003 3.92e-003 3.63e-004 2.67e-005 1000000 384, 383 + 1 4.04e-003 3.90e-003 2.46e-004 1.89e-005 1000000 384, 383 - 1 4.04e-003 3.90e-003 2.26e-004 3.27e-005 1000000 512, 511 + 1 4.04e-003 3.91e-003 2.07e-004 2.56e-005 1000000 512, 511 - 1 4.05e-003 3.91e-003 2.01e-004 1.57e-005

Note: Q-the number of input blocks; L, H-input data range [-L, H]; S-the sign of input data. The near-dc inversion test result is 0.

Mobile CIF 300 frames, QP=1Encoder's IDCT = 64-bit floating-point IDCT

54

54.5

55

55.5

56

56.5

57

57.5

58

0 50 100 150 200 250 300

Frame number

PSN

R (Y

)

floating_ed

(13,10)e

(13,10)d

(13,10)ed

Improved(13,10)e

Improved(13,10)d

Improved(13,10)ed

Figure 4. MPEG-2 test results, (13,10) vs Improved (13,10)


Bus CIF 150 frames, QP=1, MPEG2 Quantization1/4 pel enabled

Encoder's IDCT = 64-bit floating-point IDCT

44

44.5

45

45.5

46

46.5

47

47.5

0 20 40 60 80 100 120 140

Frame number

PSN

R (Y

)

floating_ed

(13,10)e

(13,10)d

(13,10)ed

Improved(13,10)e

Improved(13,10)d

Improved(13,10)ed

Figure 5. MPEG-4 test results, (13,10) vs Improved (13,10)

50

51

52

53

54

55

56

57

58

0 50 100 150 200 250 300Frame number

PSN

R (Y

)

Float.Point IDCT

(13,9,20)

Improved (13,10)

MPEG-2 TM5

Foreman CIF 300 frames, QP=1 Encoder's IDCT = 64-bit floating-point IDCT

Figure 6. MPEG-2 test results


Foreman CIF 300 frames, QP=1, MPEG-2 Quantization1/4 pel disabled

Encoder's IDCT = 64-bit floating-point IDCT

47.0

47.5

48.0

48.5

49.0

49.5

0 50 100 150 200 250 300Frame number

PSN

R (Y

)

Float.point IDCT

(13,9,20)

Improved (13,10)

XVID IDCT

Figure 7. MPEG-4 test results

4. COMPLEXITY AND ACCURACY ANALYSIS

Different methods with different accuracy and complexity tradeoffs can be easily obtained by simply adjusting UP_SCALE with all multiplier-less implementations of the multipliers in the butterfly structures unchanged. This allows designers to choose one implementation that best meets the requirements of applications. For example, high accuracy transforms (with more operations) can be used for medical image compression and low accuracy transforms (with fewer operations) can be used for printed images.

In Table 3 we summarize the results of operation counts with SCALE equal to 11, 13, 14 and 16. Note that SCALE=11 is the minimum integer value to pass all the accuracy tests, and SCALE=12 and SCALE=15 are not as effec-tive as others in terms of operation count [7]. In Figure 8-11 we plot PMSE, OMSE, PME, OME (the worst case absolute error regardless of input range of random test or number of samples in random test in ISO/IEC 23002-1) versus UP_SCALE for different SCALEs respectively (the implementations in [4] were used). From these results we can see that, generally speaking, both parameters, SCALE and UP_SCALE affect the accuracy and higher accuracy needs larger, or at least not smaller SCALE or UP_SCALE. Also, the parameter SCALE determines the best level of accuracy that can be achieved (while large UP_SCALE can improve the accuracy when range of input data is small). Moreover, for one certain SCALE, there is a best corresponding UP_SCALE in terms of accuracy and complexity tradeoff, and from Figure 8-11, we can see that (13,10) is one of the most efficient methods among others. An interesting observation is that in-creasing SCALE can significantly reduce OMSE, and indeed very high accuracy can be achieved with large SCALE as shown in [4].


Table 3. Operation Counts of Different SCALEs

SCALE 1-D (not including rounding) 2-D (including rounding)

11 69 additions + 44 shifts 1105 additions + 832 shifts




0.00E+00

1.00E-02

2.00E-02

3.00E-02

4.00E-02

5.00E-02

6.00E-02

7.00E-02

8.00E-02

9.00E-02

0 2 4 6 8 10 12 14 16 18

UP_SCALE

PMSE

(≤0.

06)

SCALE=11

SCALE=13

SCALE=14

SCALE=16

Figure 8. PMSE vs UP_SCALE

0.00E+00

2.00E-03

4.00E-03

6.00E-03

8.00E-03

1.00E-02

1.20E-02

1.40E-02

1.60E-02

0 2 4 6 8 10 12 14 16 18

UP_SCALE

OM

SE (≤

0.02

)

SCALE=11

SCALE=13

SCALE=14

SCALE=16

Figure 9. OMSE vs UP_SCALE

Proc. of SPIE Vol. 6696 669615-10

0.00E+00

1.00E-02

2.00E-02

3.00E-02

4.00E-02

5.00E-02

6.00E-02

7.00E-02

8.00E-02

9.00E-02

0 2 4 6 8 10 12 14 16 18

UP_SCALE

PME

(≤0.

015)

SCALE=11

SCALE=13

SCALE=14

SCALE=16

Figure 10. PME vs UP_SCALE

0.00E+00

2.00E-04

4.00E-04

6.00E-04

8.00E-04

1.00E-03

1.20E-03

1.40E-03

0 2 4 6 8 10 12 14 16 18

UP_SCALE

OM

E (≤

0.00

15)

SCALE=11

SCALE=13

SCALE=14

SCALE=16

Figure 11. OME vs UP_SCALE

5. CONCLUSION

In this paper, a straightforward multiplier-less approximation of the forward and inverse DCT with low complexity and high accuracy is proposed. The implementation, design methodology, complexity and performance tradeoffs are dis-cussed. Particular, the proposed IDCT implementations, in spite of simplicity, comply with and can reach far beyond the MPEG IDCT accuracy specification ISO/IEC 23002-1 when suitable parameters are selected. Among others, (13,10) was identified as one of the most efficient methods. It also reduces drift favorably compared to other existing IDCT imple-mentations like the ones used in MPEG-2 TM5, XVID etc with improvement of up to several dB in PSNR.

Proc. of SPIE Vol. 6696 669615-11

REFERENCES

[1] J. L. Mitchell, W. B. Pennebaker, D. LeGall, and C. Fogg, MPEG Video Compression Standard, Chapman & Hall: New York © 1997.

[2] W. B. Pennebaker and J. L. Mitchell, JPEG Still Image Data Compression Standard, Van Nostrand Reinhold: New York © 1993.

[3] ISO/IEC JTC1/SC29/WG11 N7815 [23002-1 FDIS] Information technology – MPEG video technologies – Part 1: Accuracy requirements for implementation of integer-output 8×8 inverse discrete cosine transform.

[4] C. -X. Zhang, L. Yu, “Low Complexity and High Fidelity Fixed-Point Multiplier-less DCT/IDCT Implementation Scheme”, MPEG Doc. M12936, Bangkok, Thailand, Jan 2006.

[5] O. Gustafsson, A. G. Dempster and L. Wanhammer, “Extended results for minimum-adder constant integer multi-pliers,” Proc. IEEE International Symposium on Circuits and Systems. (ISCAS), vol. 1, pp. 26-29, May. 2002.

[6] C. -X. Zhang, J. Wang, L. Yu, “Extended Results for Fixed-Point 8×8 DCT/IDCT Design and Implementation”, MPEG Doc. M12935, Bangkok, Thailand, Jan 2006.

[7] C. –X. Zhang, J. Wang, L. Yu, “Systematic Approach of Fixed Point 8×8 IDCT and DCT Design and Implementa-tion”, Picture Coding Symposium (PCS), April 2006.

[8] C. Loeffler, A. Ligtenberg, C.S. Moschytz, “Practical Fast 1D DCT Algorithm with Eleven Multiplications”, Proc. ICASSP pp. 988-991, 1989.

[9] Y. A. Reznik, “Considerations for Choosing Precision of MPEG Fixed Point 8x8 IDCT Standard”, MPEG Doc. M13005, Bangkok, Thailand, Jan 2006.

[10] ISO/IEC JTC1/SC29/WG11 N8981 [23002-1 FPDAM1] Information technology – MPEG video technologies – Part 1: Accuracy requirements for implementation of integer-output 8×8 inverse discrete cosine transform, AMENDMENT 1: Software for integer IDCT accuracy testing.

[11] Y. A. Reznik, A. T. Hinds, C. Zhang, L. Yu, Z. Ni, “Response to CE on Convergence of Scaled and Non-Scaled IDCT Architectures”, MPEG Doc. M13650, Klagenfurt, Austria, July 2006.

[12] Y. Voronenko and M. Puschel, “Multiplierless Multiple Constant Multiplication”, ACM Transactions on Algorithms vol 3, issue 2, May 2007.

[13] MPEG-2 TM5 source code: ftp://ftp.mpegtv.com/pub/mpeg/mssg/mpeg2vidcodec_v12.tar.gz [14] XVID open source implementation of MPEG-4 ASP: http://downloads.xvid.org/downloads/xvidcore-1.1.0.tar.gz [15] A. T. Hinds, Y. A. Reznik, P. Sagetong, “On IDCT Exactness, Precision, and Drift Problem”, MPEG Doc. M13657,

Klagenfurt, Austria, July 2006. [16] Z. Ni, C. Zhang, L. Yu, “Updated MPEG-2 Testbed and Drift Test Results”, MPEG Doc. M13599, Klagenfurt, Aus-

tria, July 2006. [17] A. T. Hinds and J. L. Mitchell, “A Fast and Accurate Inverse Discrete Cosine Transform”, Proceedings of the IEEE

Workshop on Signal Processing Systems Design and Implementation, pp. 87-92, November 2005.

Proc. of SPIE Vol. 6696 669615-12

Publication 4 Cixun Zhang, Lu Yu, Jian Lou, Wai-Kuen Cham, Jie Dong,

The Technique of Pre-scaled Integer Transform: Concept, Design and Applications

IEEE Transactions on Circuits and Systems for Video Technology (T-CSVT), Vol. 18, Issue. 1, Jan 2008, pp.

84-97.

84 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 18, NO. 1, JANUARY 2008

The Technique of Prescaled Integer Transform:Concept, Design and Applications

Cixun Zhang, Lu Yu, Member, IEEE, Jian Lou, Student Member, IEEE, Wai-Kuen Cham, Senior Member, IEEE,and Jie Dong

Abstract—Integer cosine transform (ICT) is adopted byH.264/AVC for its bit-exact implementation and significantcomplexity reduction compared to the discrete cosine transform(DCT) with an impact in peak sigal-to-noise ratio (PSNR) of lessthan 0.02 dB. In this paper, a new technique, named prescaled in-teger transform (PIT), is proposed. With PIT, while all the merits ofICT are kept, the implementation complexity of decoder is furtherreduced compared to corresponding conventional ICT, which isespecially important and beneficial for implementation on low-endprocessors. Since not all PIT kernels are good in respect of coding ef-ficiency, design rules that lead to good PIT kernels are considered inthis paper. Different types of PIT and their target applications areexamined. Both fixed block-size transform and adaptive block-sizetransform (ABT) schemes of PIT are also studied. Experimentalresults show that no penalty in performance is observed with PITwhen the PIT kernels employed are derived from the design rules.Up to 0.2 dB of improvement in PSNR for all intra frame codingcompared to H.264/AVC can be achieved and the subjective qualityis also slightly improved when PIT scheme is carefully designed.Using the same concept, a variation of PIT, Post-scaled IntegerTransform, can also be potentially designed to simplify the encoderin some special applications. PIT has been adopted in audio videocoding standard (AVS), Chinese National Coding standard.

Index Terms—Adaptive block-size transform (ABT), audio videocoding standard (AVS), complexity reduction, discrete cosine trans-form (DCT), H.264/AVC, integer cosine transform (ICT), prescaledinteger transform (PIT), standard, transform, video coding.

I. INTRODUCTION

I NTEGER cosine transform (ICT) was first introduced byW. K. Cham in 1989 [1] and is further developed in recent

years. It has been proved that some ICTs have almost the samecompression efficiency as the Discrete Cosine Transform (DCT)but much simpler implementation because only additions andshifts operations are needed [2]. Moreover, ICT can avoid inverse

Manuscript received September 6, 2005; revised February 14, 2007. Thiswork was supported by Natural Science Foundation of China under Grant60333020 and Grant 90207005. This paper was recommended by AssociateEditor I. Ahmad.

C. Zhang was with Institute of Information and Communication Engineering,Zhejiang University, Hangzhou 310027, China. He is now with the Instituteof Signal Processing, Tampere University of Technology, Tampere FIN-33101,Finland.

L. Yu is with the Institute of Information and Communication Engineering,Zhejiang University, Hangzhou 310027, China (e-mail: [email protected]).

J. Lou was with the Institute of Information and Communication Engineering,Zhejiang University, Hangzhou 310027, China. He is now with the Departmentof Electrical Engineering, University of Washington, Seattle, WA 98195 USA.

W.-K. Cham is with the Department of Electronic Engineering, The ChineseUniversity of Hong Kong, Shatin, Hong Kong.

J. Dong was with the Institute of Information and Communication Engi-neering, Zhejiang University, Hangzhou 310027, China. She is now with theDepartment of Electronic Engineering, The Chinese University of Hong Kong,Shatin, Hong Kong.

Color versions of one or more of the figures in this paper are available onlineat http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TCSVT.2007.913749

transform mismatch problems of the DCT. Due to these advan-tages, the latest international video coding standard H.264/AVCadopted order-4 and order-8 ICT transforms [3], [4].

In this paper, a technique called prescaled integer transform(PIT) is proposed. PIT can further reduce the implementationcomplexity of corresponding ICT while no penalty in perfor-mance is observed. The paper is organized as follows. Funda-mentals of the DCT and ICT are first reviewed in Section II.Then the concept of PIT is introduced and examined in detail inSection III. The proposed PIT scheme and its benefits comparedto conventional ICT scheme such as that in H.264/AVC are elab-orated. In Section IV, the design rules lead to good PIT kernelsin respect of performance are considered. We also find that dif-ferent types of PITs have different characteristics and are suit-able for different applications. Experimental results and analysisof fixed block-size transform (FBT) scheme of PIT are given inSection V. Besides, in Section VI, adaptive block-size transform(ABT) schemes of PIT are also studied and corresponding ex-perimental results and analysis are presented. Section VII con-cludes the paper.

II. FUNDAMENTALS OF THE DISCRETE COSINE TRANSFORM

AND INTEGER COSINE TRANSFORM

The forward and inverse DCT [5] are defined as

(1)

(2)

where and stand for the original input matrix and theDCT coefficient matrix while serves as both forward andinverse DCT kernels.

ICT originates from the DCT and was derived using the prin-ciple of dyadic symmetry [1]. The forward and inverse ICT aredefined as

(3)

(4)

where is the ICT coefficient matrix and is the ICTkernel. The ICT kernel includes two parts, and [1]. Forthe same ICT, the choice of and is not unique and can berepresented as

(5)

where the , used in forward transform are denoted as ,and the , used in inverse transform are denoted as ,

, respectively. The properties of , , , and their re-lationship with each other are described in the following para-graphs and will be frequently used in this paper.

In this paper, we will concentrate on the order-4 and order-8transforms since they are the most useful ones in practical appli-cations and the discussions can be easily extended to transforms

1051-8215/$25.00 © 2008 IEEE

ZHANG et al.: TECHNIQUE OF PRESCALED INTEGER TRANSFORM 85

other than order-4 and order-8. Traditionally, order-8 transformshave been used for image and video coding. The order-8 size hasthe advantages of being large enough to capture the redundancywhile being small enough to provide good adaptation to the in-homogeneity inside an image. However, small size transformsuch as order-4 transform has the advantages of reducing ringingartifacts at edges and discontinuities [2]–[4]. Because of thesereasons, H.264/AVC uses both order-4 and order-8 ICTs [4].

For an order-8 ICT kernel specified as [a,b,c,d;e,f;g] inthis paper, and in (5) are defined asfollows where subscript 8 is used to indicate the transform sizeis 8 8:

(6)

and is an 8 8 diagonal matrix with its th diagonal ele-ments such that where is theth row vector of . The values of , , , , , , and in an

ICT should be integers. However, they are sometimes expressedas rational numbers by suitably adjusting . For example, inH.264/AVC, the choice of , , , , , and is 12/8, 10/8, 6/8,3/8, 1, 4/8, and 1, which can be regarded the same as 12, 10, 6,3, 8, 4, and 8, respectively.

Similarly, for an order-4 ICT kernel specified as [a,b;c]in this paper, and in (5) are defined asfollows where subscript 4 is used to indicate the transform sizeis 4 4:

(7)

where , and are integers and is a 4 4 diagonalmatrix with its th diagonal elements such that

where is the th row vector of . InH.264/AVC, the values of , , and in the inverse transformare expressed as 1, 1/2, 1, which can be regarded the same aschoosing , and as 2, 1, 1, respectively.

In the rest of this paper, for convenience, we use the notation, which is a column vector, to represent the

main diagonal of an matrix , i.e.,

(8)Two operators and used in this paper are defined as follows.When , , are all matrices, means

(9)

and similarly, means

(10)

III. CONCEPT OF PRESCALED INTEGER TRANSFORM

A. Conventional Integer Cosine Transform Scheme

Unlike the popular order-8 DCT used in previous standards,such as MPEG1/2/4, H.261 and H.263, H.264/AVC employs

Fig. 1. Block diagram of conventional ICT scheme in H.264.

order-4 and order-8 ICTs, thus avoiding inverse transform mis-match problems. In H.264/AVC, on encoder side, using (5),the forward transform and quantization process can be repre-sented as (11). For convenience, we do not take the practicalrounding operation in the quantization process into account andhere is a real number matrix rather than an integer matrix

(11)

where is the input matrix; is the quantized transformedcoefficient matrix; is the quantization matrix on encoderside and the dequantization matrix on decoder side, which is de-pendent on quantization parameter (QP), when uniform quan-tization is used, can be simply replaced by QP;

is the forward scaling matrix; andis the quantization-scaling matrix

On decoder side, the corresponding inverse transform and de-quantization process is

(12)where is the reconstructed matrix—theoretically is equalto ; is the inverse scalingmatrix; and is the dequantization-scalingmatrix.

Equations (11) and (12) above represent the merging of theforward/inverse scaling and the quantization/dequantization op-erations into one step so as to reduce the computational com-plexity as in H.264/AVC. Fig. 1 is the block diagram for theconventional ICT scheme.

Note that theoretically, in (11) and in (12)normally contain non-integer elements. However, in actual videocoding standards, these are usually implemented using mul-tiplications and shifts to reduce the complexity. For example,in H.264/AVC, in (11) and in (12) contain onlyinteger elements and right-shift is applied to the results of (11)and (12).

B. Proposed Prescaled Integer Transform Scheme

Let us take order-4 transform as an example first. In H.264/AVC, the QP period is 6 and a dequantization-scaling matrix

is used for 4 4 ICT [2], [3], [32].

(13)


For the th transformed coefficient in a block, the searchrule for corresponding dequantization-scaling element inis defined by

for equal tofor equal toelse

(14)

where “%” is modulus operator and “x%y” means remainder ofx divided by y, defined only for integers x and y with and

.The main problem here is that for every transformed coeffi-

cient, we need to conduct a 3-D operation, which uses QP andthe coordinates of the coefficients in the block, to search forthe corresponding elements in . Some computations as in(14) are also needed. In order to reduce the computational com-plexity, and at the same time facilitate parallel processing andtake advantage of the efficient multiply/accumulate architectureof many processors, we can fully expand the to ' , asin (15) and correspondingly the search rule is simplified to (16).However, the storage will increase from bytes to

bytes. This memory size will be much largerwhen we also take 8 8 ICT into account. In that case, a totalmemory of bytes should be allo-cated. Fortunately, in fact, the required memory size and com-putational complexity can be reduced at the same time if thetechnique of PIT is used. The concept of PIT is first proposedby the authors in [6] and will be elaborated below

(15)

(16)

Substituting (11) into (12), we get

(17)

Let

(18)

which will be called combined-scaling-quantization matrix inthe following. Then the forward transform and quantizationprocess can be represented as

(19)

In order to derive the PIT kernel, we have

(20)

and, therefore, the forward PIT kernel is

(21)

Correspondingly, the inverse PIT kernel is

(22)

From (17)–(19) above, the inverse transform and dequantizationprocess can be represented as

(23)

From (23), we can also derive the theoretical inverse PIT kernel,which is the same as in (22).

Based on the derivation above, theoretically we can define theforward and inverse PIT as

(24)

(25)

However, it should be noted that similar to the case of ICTshown in (11) and (12), we always implement PIT using (19)and (23) instead of using (24) and (25) directly.

In the rest of this paper, for order-4 and order-8 ICT ker-nels [a,b;c] and [a,b,c,d;e,f;g], the correspondingorder-4 and order-8 PIT kernels are denoted as [a,b;c] and

[a,b,c,d;e,f;g], respectively.The main idea of PIT is that inverse scaling is moved to en-

coder side and combined with forward scaling and quantizationas one single process. The fact that no scaling is needed on de-coder side anymore distinguishes PIT from conventional ICT,and this is exactly the reason why PIT can reduce the requiredmemory size and computational complexity on decoder side at


Fig. 2. Block diagram of proposed PIT scheme.

the same time. The block diagram for the PIT scheme is shownin Fig. 2.

When PIT is used, the dequantization-scaling matrix (or moreaccurately, the dequantization matrix, since no scaling is in-cluded any more) is as follows:

(26)

And for the th transformed coefficient in a block, thesearch rule for corresponding element in is further sim-plified to

(27)

Comparing (26) and (27) to (13) and (14), and (15) and (16),respectively, we can clearly see that the decoding complexity isreduced with PIT. First, if the dequantization-scaling matrix isfully expanded, when order-4 ICT is used, a total memory of

bytes can be replaced by a memory of only6 bytes. The memory required when using PIT is much lowerthan that required using conventional ICT. Otherwise, if the de-quantization-scaling matrix is not expanded, a total memory of

bytes can be reduced to 6 bytes. Though this savingis trivial considering only order-4 ICT, keeping in mind that inthis case every non-zero coefficient needs a lookup operationand some extra operations when using ICT, the computationalcomplexity of PIT will be lower because 3-D lookup operation isreplaced by 1-D lookup operation which only uses QP, and at thesame time no other extra operations as in (14) is needed. In ei-ther case, on decoder side, required memory size can be reducedand 3-D lookup operation can always be replaced by 1-D lookupoperation thus pipeline and parallel processing is facilitated andextra computation or storage memory can be saved. While onencoder side, comparing (11) to (19), we can find that sinceforward scaling, inverse scaling and quantization are combinedas one single process and the original quantization-scaling ma-trix is replaced by combined-scaling-quantization ma-trix with the same size, both the computational andstorage complexity remain unchanged.

We only consider order-4 transform in the discussion above.When both order-4 and order-8 transforms are used, a totalmemory of bytes can be savedand a memory of only 6 bytes is needed instead, assuming thatthe order-4 and order-8 PITs employed are compatible, whichwill be discussed in more detail in Section VI.

Further, besides H.264/AVC, in some other video codingstandards like AVS part 2 [12], which is Chinese NationalCoding Standard for digital TV Broadcasting and HD-DVD,the scheme of periodic QP is not used in order to reducecomputational complexity. In this case, PIT can provide evenlarger saving of memory. Only a memory of 64 bytes is neededinstead of a memory of bytes because QPrange 0 63 and order-8 PIT is employed in AVS part 2. Inorder to save memory, the scaling and quantization/dequanti-zation are separated in AVS part 2. In this case, a memory of

bytes for inverse scaling matrix can be saved and atthe same time the computational complexity is reduced whenPIT is used because no scaling is needed on decoder side anymore.

C. Post-Scaled Integer Transform

A variation of PIT might be post-scaled integer transform,in which the scaling of the forward transform can be movedto the decoder side. This can be potentially used to simplifythe encoder in some kinds of applications. The issues of post-scaled integer transform are similar to those of PIT. In this paper,we consider only PIT and similar analysis approaches can becarried out for post-scaled integer transform.

D. Fixed-Point Error Analysis of Prescaled Integer Transform

In this part, we first summarize our previous work on fixed-point error analysis of ICT and PIT, and then give some com-parison between them. In [7], an analysis of the statisticalerror behavior due to rounding of the transformed coefficientsfor the DCT and ICT is presented. The analysis considers asystem which forwardly transforms pixels into coefficientsand then inversely transforms back into reconstructed pixels.The transform kernel is assumed to be implemented preciselybut the transformed coefficients are represented using finitenumber of bits and so errors are generated. Both theoretical andexperimental results show that [10,9,6,2;9,3;8] generatesa smaller mean square error between the original and recon-structed pixels than the DCT when the same number of bitsis used for the representation of the transformed coefficients.When the number of bits increases, such mean square errordecreases. It approaches zero faster in the case of ICT than theDCT.

In [8], similar theoretical analysis and experiments are re-ported for [10,9,6,2;10,4;8] which is also found to gen-erate a smaller mean square error (MSE) between the originaland reconstructed pixels than the DCT when the same numberof bits is used for the representation of the coefficients. Fig. 3shows the experiment results for 1-D and 2-D transform sys-tems. The image data are integers generated randomly with uni-form distribution between [ 128,127]. Also, it is assumed thatthe 2-D transform introduces rounding error due to the use offinite number of bits for the 2-D transform coefficients at theoutput only and no rounding error is introduced inside the 2-Dtransform.

The results in [7] and [8] were obtained under the assump-tion that and are implemented precisely without error.Under this condition, we would expect that PIT and the corre-sponding ICT have the same performance. However, it is inter-esting to note that the implementation of cannot be exactbecause it contains irrational numbers but when PIT is usedthe implementation of can be exact especially when

which is usually true because it contains rational num-bers. Furthermore, if the PIT kernel is selected carefully, e.g.,

[2,1;1], [3,1;2], when , can berepresented using few bits. In this case, lossless coding of PITcoefficients can be achieved easily. For the case of ICT, how-ever, lossless coding of transformed coefficients is difficult. Forexample, in H.264/AVC, only lossless coding of residue is sup-ported [4].


Fig. 3. MSE of reconstructed pixels versus coefficient bit length in 1-D image vector and 2-D image data block [8]. Left: MSE of reconstructed pixels versuscoefficient bit length in 1-D image vector. Right: MSE of reconstructed pixels versus coefficient bit length in 2-D image data block.

IV. DESIGN OF PRESCALED INTEGER TRANSFORM

Just like ICT, not every PIT kernels performs well. It is practi-cally important to find design rules that lead to good PIT kernels.

For convenience of discussion, (21) and (22) are rewrittenhere as (28) and (29)

(28)

(29)

We can see that is constructed by and whileis constructed by and . To choose

good PIT kernels, we should take these two factors into ac-count. It turns out that considering compression ability, weshould choose a good and correspondingly whileconsidering bit representation of transformed coefficients weshould choose a good and correspondingly . In thefollowing subsections, these two factors are examined in detailand based on the analysis, design rules are obtained.

A. Consideration of Compression Ability

In (28) and (29), and can be regarded as scaling fac-tors applied to basis vectors of corresponding forward and in-verse ICT kernels to construct corresponding PIT kernels. Since

is diagonal, we can see that the normalized basis vectorsof forward and inverse PIT kernels are the same as the corre-sponding normalized basis vectors of forward and inverse ICTkernels, respectively. This observation suggests that PIT ker-nels and corresponding ICT kernels have similar compressionability. In the following, transform coding gain [9] and DCTfrequency distortion [10] are studied.

1) Transform Coding Gain: Strictly speaking, PIT is not anorthogonal transform, thus we should use biorthogonal trans-form coding gain [9] here. Biorthogonal transform coding gainmeasures the energy compacting ability of the transform in thetransform coding system. Assume that the input vector istransformed into coefficient vector by an order-N forwardtransform , i.e.,

(30)

then the covariance matrix of the vector in the transform do-main is given as

(31)

where is the expectation operator.If the input signal is modeled by a zero mean, first-order au-

toregressive (AR(1)) process which is characterized by the cor-relation coefficient , then the covariance matrix of the inputsampled vector x has a form of Toeplitz matrix, i.e., isequal to

(32)In this situation, the biorthogonal transform coding gain of a

given transform is a function of the correlation coefficient andcan be calculated analytically as follows [9]:

(33)

where is the norm of the th inverse transform basis vector.Note that (33) can also be used for orthogonal transform suchas conventional ICT. Although and areoften used as a measure of the compression ability of the giventransform kernel, in this paper, we define

(34)

as the measure of the compression ability of a given transform,in order to take into account different correlation coefficients ofthe input data. This is because in advanced video coding stan-dards such as H.264/AVC and AVS, the input data fed into thetransform is the residue data after intra or inter prediction, andthe correlation will be less than 0.9 in most cases [2]. Gener-ally, larger indicates better compression ability of thegiven transform kernel.


2) DCT Frequency Distortion: DCT frequency distortion[10] measures the frequency distortion of a given transformwith respect to the DCT. For order- transform with trans-form kernel , the first-order frequency distortion and thesecond-order frequency distortion are defined as

(35)

and

(36)

respectively, where is calculated as

(37)

3) Comparison of Prescaled Integer Transform and IntegerCosine Transform: In this subsection, we will discuss aboutthe relationship between PIT kernel and its corresponding ICTkernel in terms of transform coding gain and DCT frequencydistortion.

For transformcodinggain, from(33),wecanfind that the trans-form coding gain of one certain ICT kernel and its correspondingPIT kernel are given by (38) and (39) below, respectively

(38)

(39)

And we have

(40)

so

(41)

Equation (41) shows that PIT kernels are the same as corre-sponding ICT kernels in terms of transform coding gain.

For DCT frequency distortion, from (28) and (37), we have

(42)

Since is a diagonal matrix, it is easy to see

(43)

(44)

Equations (43) and (44) show that PIT kernels have the sameDCT frequency distortion with corresponding ICT kernels. Sobasically, we can and we should use good ICT kernels to obtaingood PIT kernels.

B. Consideration of Weighting Factors of TransformedCoefficients

Besides compression ability, because the forward PIT kernelis not normalized, it is important to analyze the effect ofwhich is regarded as scaling factors applied to basis vectors ofcorresponding forward ICT kernels.

From (24) and (21), we can get

(45)

It shows that the PIT coefficient matrix can be regarded as aweighted ICT coefficient matrix where is the weighting ma-trix. While all coefficients of an orthogonal ICT have the samemaximum and minimum values, those of PIT do not becausethe weighting matrix changes the relative magnitudes of thetransformed coefficients. We define the weighting factor differ-ence (WFD) of a PIT to represent the difference of weightingfactors applied to different transformed coefficients

(46)

The deviation of the WFD of a PIT away from unity maycause a problem. It may result in truncation of some transformedcoefficients that could be retained if where is scalarand is a matrix with all its elements equal to 1. However,unfortunately the condition can not always be truewhen PIT is used. In order not to change the transformed coef-ficients too much so as to retain the compression ability of cor-responding ICT kernels, elements of the weighting matrix


should have values close to one another. Another advantage ofthis consideration is that the change from ICT to correspondingPIT will be small so that other parts of a codec need not be re-designed: the scan order of the transformed coefficients need notbe changed and the entropy coding table designed for the ICTcan also be reused for the PIT without much change. Besides, itis also worth noting here that compensation for the scaling effectof can also be potentially achieved by using customized quan-tization matrix, but additional complexity would be introduced.

Generally, according to our experience, the following condi-tion should be met

(47)

That is, the bit-width difference between largest element andsmallest elements in should be no larger than 1/2. One couldexpect that more sophisticated measures using all the elementsin rather than only the maximum and minimum in (46) willgive a more precise evaluation, however, (47) already worksrather well, which is justified by the experiment results inSection V. Also, since many PITs with WFD less than arealready obtained and demonstrated to show good performancetherein, there may be no need to search for PITs with WFDlarger than because the smaller the WFD is, the closer thePIT is to the corresponding ICT.

C. Design Rules

Based on the discussion above, we can conclude that we canobtain a good PIT kernel following the steps below.

1) Obtain a good ICT kernel. To derive good ICT kernels,computer search can be systematically carried out basedon transform coding gain and DCT frequency distortions.

2) Choose a good ICT kernel whose WFD is not larger than.

D. Types of Prescaled Integer Transform

Besides the design rules derived above, we have found thatdifferent types of PITs have different characteristics and are suit-able for different applications, which is also an important issuein practical.

There are generally two types of PITs, based on frequencyscale factor (FSF). For order- PIT, FSF measures the scalingeffect of on the transformed coefficients and is defined as

(48)

1) Type-I PITs, whose FSF is less than 1, have a characteristicthat more high-frequency components may be quantizedout compared to corresponding ICT. Type-I PITs may bemore suitable for video streaming and conferencing appli-cations where low resolution (such as QCIF and CIF) videosequences are usually used and coded at relatively low bitrates because Type-I PITs lead to bit savings without de-grading the subjective quality significantly in this situation.

2) Type-II PITs, whose FSF is larger than or equal to 1, havea characteristic that more high-frequency components maybe retained after quantization compared to correspondingICT. Type-II PITs may be more suitable for entertainment

TABLE ITEST CONDITIONS AND ENCODING PARAMETERS FOR ORDER-4 ICT AND PIT

TABLE IITEST CONDITIONS AND ENCODING PARAMETERS FOR ORDER-8 ICT AND PIT

quality and other professional applications where higherresolution (such as HD) video sequences are often used andcoded at relatively high bit rates because chance is higherthat more detailed texture will be preserved.

V. EXPERIMENTAL RESULTS OF FIXED BLOCK-SIZE

TRANSFORM SCHEME WITH PRESCALED

INTEGER TRANSFORM

In order to test performance of different PIT kernels, ex-tensive experiments have been done based on JM 7.6 and JM9.3 [28].

The test conditions are listed in Tables I and II. The experi-ments on order-4 ICTs and PITs target video streaming and con-ferencing applications, so relatively large QPs and QCIF andCIF sequence are used. The experiments on order-8 ICTs andPITs target entertainment-quality and other professional appli-cations so relatively small QPs and HD sequences are used. Inall the experiments, context-based adaptive binary arithmeticcoding (CABAC) [3], [4], [29] is used and number of referenceframes are set to two.

Figs. 4 and 5 give the plot of transform coding gain and trans-form coding gain difference compared with the DCT of differentorder-4 and order-8 ICTs and PITs where correlation coefficientis in the range [0,1], while Tables III and IV give the trans-form coding gain , , and DCTfrequency distortion of different order-4 and order-8 ICTs andPITs. All the order-4 and order-8 ICTs and PITs chosen in ourexperiments have comparable transform coding gains with thatof the DCT and little DCT frequency distortion, so they are ex-pected to have good compression performance which has beenproved by the experimental results given in Tables V–VIII. Theexperimental results are presented in the form of average peaksignal-to-noise ratio (PSNR) gains using the method proposedin [11]. Some rate-distortion curves are given in Figs. 6 and 7,respectively.

Among all the ICTs and PITs in our experiments,[1,1/2;1] ( [2,1;1]) and

are used in H.264/AVC, [10,9,6,2;9,3;8] was proposedin [1] because of its high decorrelation ability and relativelylow complexity. [10,9,6,2;10,4;8] is adopted in AVSpart 2 [12] because its implementation complexity is similar to


Fig. 4. Left: Transform coding gain of order-4 DCT and different order-4 ICTs Right: Transform coding gain difference of different order-4 ICTs compared withorder-4 DCT (Traform coding gain difference larger than 0 means that the ICT has better energy compacting ability than the DCT, and vice versa.) Note that thetransform coding gain of these ICTs and the DCT are nearly the same and the curves in the left figure overlap.

Fig. 5. Left: Transform coding gain of order-8 DCT and different order-8 ICTs Right: Transform coding gain difference of different order-8 ICTs compared withorder-8 DCT (Transform coding gain difference larger than 0 means that the ICT has better energy compacting ability than the DCT, and vice versa.) Note that thetransform coding gain of these ICTs and the DCT are nearly the same and the curves in the left figure overlap.

TABLE IIITRANSFORM CODING GAIN AND DCT FREQUENCY DISTORTION OF DIFFERENT ORDER-4 ICTS AND PITS

TABLE IVTRANSFORM CODING GAIN AND DCT FREQUENCY DISTORTION OF DIFFERENT ORDER-8 ICTS AND PITS

[10,9,6,2;9,3;8] but has a slightly higher transform codinggain, which is shown in Fig. 5 and Table IV. [3,1;2],

[5,2;4], [9,4;7] were proposed in AVS part 7 (also

known as AVS-M, where M represents mobility.), which targetsapplications of video communications on mobile devices, with

[3,1;2] finally adopted for its simplicity and satisfactory


TABLE VCOMPRESSION PERFORMANCE OF DIFFERENT ORDER-4 TRANSFORM MATRIXCES USING ICT COMPARED TO H.264/AVC

TABLE VICOMPRESSION PERFORMANCE OF DIFFERENT ORDER-4 TRANSFORM MATRIXCES USING PIT COMPARED TO H.264/AVC

TABLE VIICOMPRESSION PERFORMANCE OF DIFFERENT ORDER-8 TRANSFORM MATRIXCES USING ICT COMPARED TO H.264/AVC

TABLE VIIICOMPRESSION PERFORMANCE OF DIFFERENT ORDER-8 TRANSFORM MATRIXCES USING PIT COMPARED TO H.264/AVC

performance [12]. [17,7;13] is used in original H.264design (also known as H.26L) [2] and has an advantage ofhaving same basis vector norms, In this case, [17,7;13]is in fact the same as [17,7;13]. Similar technique as PIT

with [22,10;17] and [16,15,9,4;16,6;12] are usedin WMV9/VC-1 [30], [31]. However, the elements in compo-nent in these two transform kernels are relatively larger inorder to meet the conditions mentioned in [30] which are much


Fig. 6. Rate-distortion curves of different order-4 PITs for news sequence (CIF, 30 fps) and foreman sequence (QCIF,15 fps).

Fig. 7. Rate-distortion curves of different order-8 PITs for harbour sequence (HD, IBBP, 60 fps) and night sequence (HD, IPP, 30 fps).

stricter than (47) and thus complexity is increased and accu-racy is somewhat lost when implemented in 16 bit arithmetic.Moreover, [22,10;17] and [16,15,9,4;16,6;12]are not completely compatible. Here compatible means thatdifferent transforms can be implemented in only one transformunit rather than multiple: they can share the same forward/in-verse transform butterfly structure and quantization-scaling/dequantization-scaling matrix [21].

The following conclusions can be drawn based on the exper-imental results.

1) From Tables III and IV, we can see that [0,1] givesa very good estimation of the energy compacting abilityof a given ICT or PIT, which is justified by the experi-mental results in Tables V–VIII. Generally, and

may be also good measures. But from Figs. 4and 5 we can see that using a fixed correlation coefficientpoint may not be appropriate because transform codinggain difference varies for different correlation coefficients.DCT frequency distortion may also serve as an estimationof the compression performance of a given ICT or PIT, butit treats the distortion of each basis vectors equally, wherea weighting should be considered intuitively. Further, dis-tortion from the DCT might not necessarily be bad. Be-sides these two theoretical criterions, we should note thatlarger elements in component of the ICT or PIT kernelwill suffer additional loss in PSNR when implemented in16-bit arithmetic.

2) From Tables V–VIII, we can see that PIT performs gen-erally as well as the corresponding ICT if (47) is met.Generally, the smaller the WFD of the given PIT is, thesmaller the difference with the corresponding ICT in PSNRis, which is expected.

3) According to the experimental results in Tables VI andVIII, Type-I PITs lead to lower bit rate and lower PSNR,while Type-II PITs result in higher bit rate and higherPSNR at the same QP, compared to corresponding ICTs.This is due to the weighting of the frequency components.The larger the FSF of the given PIT is, the higher the bitrate and PSNR are compared to corresponding ICT.

4) From Table VI, we can find that [3,1;2] performs aswell as or even better than some other PITs though its trans-form coding gain is a little lower and its DCT frequencydistortion is a little larger. This is mainly because no rightshift is needed when it is implemented in 16 bit arithmetic,thus precision is retained. [3,1;2] belongs to Type-IPITs and is suitable for low fidelity video coding. It isalso the simplest PIT kernel we can find that meets (47).Due to its low complexity and satisfactory performance,

[3,1;2] is adopted by AVS part 7.5) [10,9,6,2;10,4;8] has high transform coding gain

and approximates the DCT very well. It outperforms otherPITs as shown in Table VIII. [10,9,6,2;10,4;8] is aType-II PIT and is fit for high fidelity video coding. Due toits good performance and favorable complexity reduction,

[10,9,6,2;10,4;8] is adopted by AVS part 2.

VI. ADAPTIVE BLOCK-SIZE TRANSFORM SCHEME WITH

PRESCALED INTEGER TRANSFORM

The fixed block-size transform scheme of PIT has been ex-amined in detail in previous sections. In this section we studythe ABT scheme of PIT. ABT was once introduced in H.26Lbut was removed later because of its high implementation com-plexity. However, ABT can not only improve the coding effi-ciency significantly [13], but can also provide subjective bene-


TABLE IXCOMPRESSION PERFORMANCE OF DIFFERENT ORDER-4 AND ORDER-8 TRANSFORM MATRIXCES USING PIT COMPARED TO H.264/AVC

fits, especially for HD movie sequences from the point of subtletexture preservation [14], such as keeping film details and grainnoise which are very crucial to subjective quality especially forfilm contents [15]. Due to this, ABT has been considered againand adopted in fidelity range extensions (FRExt) of H.264/AVCwith the major concern of complexity reduction [16]–[20].

Compatible ABT is one major consideration during theH.264/AVC FRExt standardization work and the order-8 ICTadopted in H.264/AVC is especially chosen to be compatiblewith the existing order-4 ICT [17]. While it is well knownthat [a,b,c,d;e,f;g] is compatible with [e,f;g], itis also easy to understand that the corresponding PIT kernel

[a,b,c,d;e,f;g] is also compatible with [e,f;g]because there is no need to change the butterfly structureswhen PIT is used and it is shown in [21] that [e,f;g]can also reuse the combined-scaling-quantization matrix of

[a,b,c,d;e,f;g].It has been mentioned that keeping film grain in film con-

tents is crucial to subjective quality especially for HD sequence.This has also been paid much attention to during the H.264/AVCFRExt standardization work [22]–[25]. As analyzed in [22], filmgrain preservation is always a challenge for encoders becausethe film grain is in nature temporally uncorrelated and thus largecompression gains brought by temporal prediction can not beexploited efficiently. As a result, most of the film grain remainsin the prediction error and appears as small transformed coeffi-cients at high frequencies in the DCT domain and thus is typ-ically quantized out with other noise for a wide range of QPvalues. Even at high bit rates, film grain can only be encodedand preserved at a high compression cost. In order to allow en-coding film grain at lower bit rates more efficiently, H.264/AVCuses a standardized film grain characteristics supplemental en-hanced information (SEI), which allows encoder to generate aparameterized model of film grain statistics instead of encodingthe exact film grain, and send it along with the video data to de-coder, to provide the information for film grain synthesis as apost process when decoding [4], [22], [23], [25], [32]. At lowerbit rates, the film grain characteristics SEI message has been

proved to be a good tool to improve picture quality for sequenceswith film grain. However, it was reported that SEI message is notaccepted by the movie industry where higher target bit rates areused mainly because it is very difficult and even impossible toreach transparent picture quality for any kind of input sequenceand have full control on the decoded sequence [26].

From the viewpoint of keeping details, Type-II PITs aremore suitable because film grain and other subtle texturethat get quantized out have a higher chance of survival.From the experimental results in Section V, it can be in-ferred that a compatible ABT scheme using [5,2;4] and

is promising because both of themhave high compression efficiency and are Type-II PITs. Exten-sive experiments have been done to evaluate the performanceof [5,2;4] and . The objectiveresults compared with H.264/AVC high profile and its ICTcounterpart ( [5,2;4] and .)are given in Table IX. The operational rate-distortion curvesare shown in Fig. 8 and the subjective quality comparison forstockholm sequence1 is presented in Fig. 9. The test conditionsare the same as those listed in Table II except that both order-4and order-8 transforms are used. For subjective quality com-parison, rate control is used and the sequence is encoded at6Mbit/s and IBBP structure is used.

Based on the experimental results in Table IX, the followingconclusions can be drawn.

1) Both ABT schemes, [5,2;4] and, [5,2;4] and, outperform the one in

H.264/AVC. Three GOP structures have been tested, i.e.,IPPP, IBBP, Intra frame only. We can see that the gain inPSNR from IPPP and IBBP is only less than 0.05 dB onaverage, which is mainly because of high efficiency of themotion compensation and it can be expected the gain willbe larger if the number of reference frames is reduced toone instead of two and smaller search range for motion

1[Online]. Available: ftp://ftp.ldv.e-technik.tu-muenchen.de/pub/test_se-quences


Fig. 8. Rate-distortion curves of H.264/AVC, ICT [5; 2; 4] + ICT [10; 9; 6; 2; 10;4; 8]=2, PIT [5; 2; 4]+PIT [10;9; 6; 2; 10;4; 8]=2 for harbour sequence(1) HD, IBBP, 60 fps; (2) HD, IPP, 30 fps; (3) HD, intra-frame only, 30 fps.

Fig. 9. Subjective quality for stockholm sequence (720p, IBBP, 30 fps, 6 Mbps, 26th frame, local) top-left: original, top-right: PIT [5; 2; 4] +PIT [10;9; 6; 2; 10;4; 8]=2, 34.976 dB, bottom-left: ICT [5; 2; 4] + ICT [10;9; 6; 2; 10;4; 8]=2, 35.108 dB, bottom-right: H.264/AVC, 35.093 dB

estimation is used in our experiments. However, the gainin PSNR from Intra frame is not trival which can bealmost 0.2dB on average keeping in mind that H.264/AVCdoes not rely much on transform for decorrelation [27].Noting that in real entertainment-quality and otherprofessional applications such as DVD-Video systems andHDTV where HD sequences are generally used, frequentperiodic intra coded pictures are typical in order to enablefast random access, the ABT schemes, [5,2;4]and , [5,2;4] and

are good in terms of codingefficiency.

2) The subjective results in Fig. 9 show that the ABT schemeof [5,2;4] and is slightly

better than both its ICT counterpart and that of H.264/AVC.More details are preserved.

It should be noted that in our implementation of the ABTscheme of [5,2;4] and , on thedecoder side, a memory of only 6 bytes is allocated to store thematrix ' which is used for dequantization

(49)

and, therefore, only 1-D lookup operation is needed. Thus,good performance and complexity reduction are achieved atthe same time when the ABT scheme of [5,2;4] and

is used.


VII. CONCLUSION

In this paper, a new technique, named PIT, is proposed to fur-ther reduce the implementation complexity of conventional ICTscheme such as that adopted in H.264/AVC without any sacrificein performance. Since not all PIT kernels perform well, designrules that lead to good PIT kernels are considered in this paper.It is also found that different types of PIT have different char-acteristics and are suitable for different applications. PIT hasbeen adopted in AVS, Chinese National Coding standard with

[10,9,6,2;10,4;8] in AVS part 2 and [3,1;2] in AVSpart 7, respectively, due to its implementation complexity re-duction and good performance. Besides fixed block-size trans-form scheme, compatible ABT scheme of PIT are also studiedin this paper because of its higher coding efficiency and subjec-tive benefits. A compatible ABT scheme using [5,2;4] and

is proposed and it is shown that upto 0.2 dB in PSNR for all intra frame coding can be achievedcompared to the counterpart in H.264/AVC. Besides, subjectivequality is also slightly improved because more subtle texturecan be preserved. Using the same concept, a variation of PIT,post-scaled integer transform, can also be potentially used tosimplify the encoder in some kinds of applications.

ACKNOWLEDGMENT

The authors would like to thank the anonymous reviewers fortheir useful comments to improve this paper.

REFERENCES

[1] W. K. Cham, “Development of integer cosine transforms by the prin-ciple of dyadic symmetry,” Proc. IEE, I, vol. 136, no. 4, pp. 276–282,Aug. 1989.

[2] H. Malvar, A. Hallapuro, M. Karczwicz, and L. Kerofsky, “Low-complexity transform and quantization in H.264/AVC,” IEEE Trans.Circuits Syst. Video Technol., vol. 13, no. 7, pp. 598–603, Jul. 2003.

[3] T. Wiegand, G. J. Sullivan, G. Bjontegaard, and A. Luthra, “Overviewof the H.264/AVC video coding standard,” IEEE Trans. Circuits Syst.Video Technol, vol. 13, no. 7, pp. 560–576, Jul. 2003.

[4] G. J. Sullivan, P. N. Topiwala, and A. Luthra, “The H.264/AVC ad-vanced video coding standard: Overview and introduction to the fidelityrange extensions,” in Proc. SPIE, Appl. Dig. Image Process. XXVII,Aug. 2004, vol. 5558, pp. 454–474.

[5] N. Ahmed, T. Natarajan, and K. R. Rao, “Discrete cosine transform,”IEEE Trans. Comput., vol. C-23, no. 1, pp. 90–93, Jan. 1974.

[6] C.-X. Zhang, J. Lou, L. Yu, J. Dong, and W. K. Cham, “The techniqueof pre-scaled integer transform,” in Proc. IEEE ISCAS, May 2005, pp.316–319.

[7] W. K. Cham, “Integer sinusoidal transform,” Adv. Electron. ElectronPhys., vol. 88, pp. 1–61, 1994.

[8] W. K. Cham and C. K. Fong, “ICT fixed point error performanceanalysis,” in Proc. 2004 Int. Symp. Intell. Multimedia, Video SpeechProcess., Oct. 2004, pp. 294–297.

[9] J. Liang and T. D. Tran, “Fast multiplierless approximations of theDCT with the lifting scheme,” IEEE Trans. Signal Process., vol. 49,pp. 3032–3044, Dec. 2001.

[10] M. Wien and S. Sun, “ICT comparison for adaptive block transform,”Doc. VCEG-L12. Eibsee, Germany, Jan. 2001.

[11] G. Bjontegaard, “Calculation of average PSNR differences betweenRD-curves,” Doc. VCEG-M33. Austin, TX, Apr. 2001.

[12] L. Yu, F. Yi, J. Dong, and C. Zhang, “Overview of AVS-Video: Tools,performance and complexity,” in Proc. VCIP, Beijing, China, Jul. 2005,pp. 679–690.

[13] M. Wien, “Variable block size transforms for H.264/AVC,” IEEETrans. Circuits Syst. Video Technol, vol. 13, no. 7, pp. 604–613, Jul.2003.

[14] S. Gordon, “Adaptive block transform for film grain reproduction inhigh definition sequences,” Doc. JVT-H029. Geneva, Switzerland,May 2003.

[15] T. Wedi, Y. Kashiwagi, and T. Takahashi, “H.264/AVC for next gener-ation optical disc: A proposal on FRExt profiles,” Doc. JVT-K025.Munich, Germany, Mar. 2004.

[16] M. Wien, “Clean-up and improved design consistency for ABT,” Doc.JVT-E025. Geneva, Switzerland, Oct. 2002.

[17] F. Bossen, “ABT cleanup and complexity reduction,” Doc.JVT-E087. Geneva, Switzerland, Oct. 2002.

[18] S. Gordon, “Simplified use of 8� 8 transform,” Doc. JVT-I022. SanDiego, CA, Sep. 2003.

[19] S. Gordon, D. Marpe, and T. Wiegand, “Simplified use of 8� 8 trans-form—Proposal,” Doc. JVT-J029. Waikoloa, HI, Dec. 2003.

[20] S. Gordon, D. Marpe, and T. Wiegand, “Simplified use of 8� 8 trans-form—Results,” Doc. JVT-J030. Waikoloa, HI, Dec. 2003.

[21] J. Dong, J. Lou, C.-X. Zhang, and L. Yu, “A new approach to com-patible adaptive block-size transforms,” in Proc. VCIP, Beijing, China,Jul. 2005, pp. 38–47.

[22] C. Gomila and A. Kobilansky, “SEI message for film grain encoding,”Doc. JVT-H022. Geneva, Switzerland, May 2003.

[23] C. Gomila, “SEi message for film grain encoding: Syntax and results,”Doc. JVT-I013. San Diego, CA, Sep. 2003.

[24] M. Schlockermann, S. Wittmann, and T. Wedi, “Film grain coding inH.264/AVC,” Doc. JVT-I034. San Diego, CA, Sep. 2003.

[25] C. Gomila and J. Llach, “Film grain modeling versus encoding,” Doc.JVT-K036. Munich, Germany, Mar. 2004.

[26] T. Wedi and S. Wittmann, “Quantization with an adaptive dead zonesize for H.264/AVC FRExt,” Doc. JVT-K026. Munich, Germany,Mar. 2004.

[27] A. Puri, X. Chen, and A. Luthra, “Video coding using the H.264/MPEG-4 AVC compression standard,” Signal Process.: ImageCommun., vol. 19, no. 9, pp. 793–849, Oct. 2004.

[28] JM7.6 and JM9.3 H.264 reference software [Online]. Available: http://iphome.hhi.de/suehring/tml

[29] D. Marpe, H. Schwarz, and T. Wiegand, “Context-based adaptive bi-nary arithmetic coding in the H.264/AVC video compression standard,”IEEE Trans. Circuits Syst. Video Technol, vol. 13, no. 7, pp. 620–636,Jul. 2003.

[30] S. Srinivasan, P. J. Hsu, T. Holcomb, K. Mukerjee, S. L. Regu-nathan, B. Lin, J. Liang, M. C. Lee, and J. Ribas-Corbera, “Windowsmedia video 9: Overview and Applications,” Signal Process.: ImageCommun., vol. 19, no. 9, pp. 851–875, Oct. 2004.

[31] S. Srinivasan and S. L. Regunathan, “An overview of VC-1,” in Proc.VCIP, Beijing, China, Jul. 2005, pp. 720–728.

[32] Joint Video Team of ITU-T VCEG and ISO/IEC MPEG, AdvancedVideo Coding (AVC) ITU-T Rec. T.81 and ISO/IEC 14496-10(MPEG-4 Part 10), Mar. 2005.

Cixun Zhang received the B.Eng. and M.Eng.degrees in information engineering from ZhejiangUniversity, Hangzhou, China, in 2004 and 2006, re-spectively. He is currently working toward the Ph.D.degree in Department of Information Technologyat Tampere University of Technology, Tampere,Finland.

Since August 2006, he has been a Researcher in theInstitute of Signal Processing, Tampere University ofTechnology, Tampere, Finland. His research interestsinclude video compression and communication.

Mr. Zhang was the recipient of the AVS special award from the AVS workinggroup of China in 2005.


Lu Yu (M’00) received the B.Eng. degree (hons.)in radio engineering and the Ph.D. degree in com-munication and electronic systems from ZhejiangUniversity, Hangzhou, China, in 1991 and 1996,respectively.

Since 1996, she has been with the faculty of Zhe-jiang University, Hangzhou, China, and is presentlyProfessor of information and communication engi-neering. She was a Senior Visiting Scholar in Univer-sity Hannover and the Chinese University of HongKong in 2002 and 2004, respectively. Her research

areas include video coding, multimedia communication, and relative ASIC de-sign, in which she is principal investigator of a number of national research anddevelopment projects and inventor or co-inventor of 14 granted and 13 pendingpatents. She published more than 80 academic papers and contributed 119 pro-posals to international and national standards in the recent years.

Dr. Yu acts as the chair of the video subgroup of Audio Video coding Standard(AVS) of China and she was also the co-chair of implementation subgroup ofAVS from 2003 to 2005. She organized the 15th International Workshop onPacket Video as a General Chair in 2006. She organized two special sessionsand gave five invited talks and tutorials in international conferences. Now sheserves as a member of Technical Committee of Visual Signal Processing andCommunication of IEEE Circuits and Systems Society.

Jian Lou (S’07) was born in Hangzhou, Zhejiang,China. He received the B.E. and M.E. degrees ininformation science and electronic engineering fromZhejiang University, Hangzhou, China. He is cur-rently working toward the Ph.D. degree in electricalengineering at the University of Washington, Seattle.

His research interests include video processing andvideo compression. He has interned with several re-search laboratories including Microsoft Corporation,IBM T.J. Watson Research Center, Thomson R&DLaboratory, and Mitsubishi Electric Research Labo-

ratories.

Wai-Kuen Cham (S’77–M’79–SM’91) graduatedfrom the Chinese University of Hong Kong in 1979in electronics. He received the M.Sc. and Ph.D.degrees from Loughborough University of Tech-nology, Loughborough, U.K., in 1980 and 1983,respectively.

From 1984 to 1985, he was a Senior Engineer withDatacraft Hong Kong Limited and a Lecturer in theDepartment of Electronic Engineering, Hong KongPolytechnic. Since May 1985, he has been with theDepartment of Electronic Engineering, the Chinese

University of Hong Kong. His research interests include image coding, imageprocessing and video coding.

Jie Dong received the B.Eng. and M.Eng. degrees inInformation Engineering from Zhejiang University,Hangzhou, China, in 2002 and 2005, respectively.She is currently working toward the Ph.D. degree inelectronic engineering at the Chinese University ofHong Kong.

Her research interests include HD video compres-sion and processing.

Publication 5 Cixun Zhang, Kemal Ugur, Jani Lainema, Antti Hallapuro, Moncef Gabbouj,

Video Coding Using Spatially Varying Transform

IEEE Transactions on Circuits and Systems for Video Technology (T-CSVT), Vol. 21, Issue. 2, Feb. 2011, pp.

127-140.

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 21, NO. 2, FEBRUARY 2011 127

Video Coding Using Spatially Varying TransformCixun Zhang, Student Member, IEEE, Kemal Ugur, Jani Lainema, Antti Hallapuro, and

Moncef Gabbouj, Fellow, IEEE

Abstract—In this paper, a novel algorithm called spatiallyvarying transform (SVT) is proposed to improve the codingefficiency of video coders. SVT enables video coders to varythe position of the transform block, unlike state-of-art videocodecs where the position of the transform block is fixed. Inaddition to changing the position of the transform block, the sizeof the transform can also be varied within the SVT framework,to better localize the prediction error so that the underlyingcorrelations are better exploited. It is shown in this paper thatby varying the position of the transform block and its size,characteristics of prediction error are better localized, and thecoding efficiency is thus improved. The proposed algorithm isimplemented and studied in the H.264/AVC framework. We showthat the proposed algorithm achieves 5.85% bitrate reductioncompared to H.264/AVC on average over a wide range oftest set. Gains become more significant at medium to highbitrates for most tested sequences and the bitrate reductionmay reach 13.50%, which makes the proposed algorithm verysuitable for future video coding solutions focusing on high fidelityvideo applications. The gain in coding efficiency is achievedwith a similar decoding complexity which makes the proposedalgorithm easy to be incorporated in video codecs. However, theencoding complexity of SVT can be relatively high because ofthe need to perform a number of rate distortion optimization(RDO) steps to select the best location parameter (LP), whichindicates the position of the transform. In this paper, a novel lowcomplexity algorithm is also proposed, operating on a macroblockand a block level, to reduce the encoding complexity of SVT.Experimental results show that the proposed low complexityalgorithm can reduce the number of LPs to be tested in RDO byabout 80% with only a marginal penalty in the coding efficiency.

Index Terms—H.264/AVC, spatially varying transform (SVT),transform, variable block-size transforms (VBT), video coding.

I. Introduction

H .264/AVC is the latest international video coding stan-dard which provides up to 50% gain in coding efficiency

compared to previous standards. However, this is achieved atthe cost of both increased encoding and decoding complexity.It is estimated in [1] that the encoder complexity increases

Manuscript received April 16, 2009; revised September 29, 2009, Febru-ary 1, 2010 and May 10, 2010; accepted September 11, 2010. Date of currentversion March 2, 2011. This work was supported by the Academy of Finland,Application 129657, Finnish Program for Centers of Excellence in Research2006–2011. This paper was recommended by Associate Editor C.-W. Lin.

C. Zhang and M. Gabbouj are with the Tampere University of Tech-nology, FI-33720 Tampere, Finland (e-mail: [email protected]; [email protected]).

K. Ugur, J. Lainema, and A. Hallapuro are with Nokia Re-search Center, FI-33720 Tampere, Finland (e-mail: [email protected];[email protected]; [email protected]).

Color versions of one or more of the figures in this paper are availableonline at http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TCSVT.2011.2105595

by more than one order of magnitude between Moving Pic-ture Experts Group (MPEG)-4 Part 2 (Simple Profile) andH.264/AVC (Main Profile) and by a factor of 2 for thedecoder. For mobile video services (video telephony, mobileTV, and others) and handheld consumer electronics (digitalstill cameras, camcorders, and others), additional complexityof H.264/AVC becomes an issue due to the limited resourcesof these devices. On the other hand, as display resolutionsand available bandwidth/storage increase rapidly, high defi-nition (HD) video is becoming more popular and commonlyused, making the implementation of video codecs even morechallenging.

To better satisfy the requirements of increased usage ofHD video in resource-constrained applications, two key issuesshould be addressed: coding efficiency and implementationcomplexity. In this paper, we propose a novel algorithm,namely, spatially varying transform (SVT), which providescoding efficiency gains over H.264/AVC. The concept of SVTwas first introduced by the authors in [2] and later extendedin [3]. The technique is developed and studied mainly forcoding HD resolution video, but also proved to be useful forcoding lower resolution video. The motivations leading to thedevelopment of SVT are two fold:

1) The typical block-based transform design in most exist-ing video coding standards does not align the underlyingtransform with the possible edge location. In this case,the coding efficiency decreases. In [4], directional dis-crete cosine transforms (DCTs) is proposed to improvethe efficiency of transform coding for directional edges.However, efficient coding of horizontal, vertical, andnondirectional edges was not addressed, whereas typ-ically, in natural video sequences, such edges are morecommon than directional edges. This is the main reasonwhy only very marginal overall gain has been achieved,as stated in [4]. Some toy examples are given below in(1)–(3), which represent a vertical edge, a nondirectionaledge, and a directional edge, respectively, where E1, E3,and E5 are the original prediction error matrices, E2, E4,and E6 are the corresponding prediction error matricesafter we align the transform (by shifting the transformhorizontally and vertically) to the edge location, andCi (i = 1, 2, 3, 4, 5, 6) is the corresponding transformcoefficient matrix. Fig. 1 shows the energy distributionof the transform coefficients, where the x-axis denotesthe sum of x and y components of the coefficients in thetransform coefficient matrix, and the y-axis denotes thepercentage of the energy of the transform coefficients.

1051-8215/$26.00 c© 2011 IEEE

128 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 21, NO. 2, FEBRUARY 2011

We can see that by aligning the transform to the edgelocation, more energy in the transform domain concen-trates in the low frequency part and this facilitates theentropy coding and improves the coding efficiency. Itis interesting that this is true not only for horizontal,vertical, and nondirectional edges but also for directionaledges as follows:

E1 =

⎡

⎢⎢⎣

0 2 1 00 2 1 00 2 1 00 2 1 0

⎤

⎥⎥⎦after 2−D DCT−→

C1 =

⎡

⎢⎢⎣

3.0000 0.5412 −3.0000 −1.30660 0 0 00 0 0 00 0 0 0

⎤

⎥⎥⎦

E2 =

⎡

⎢⎢⎣

2 1 0 02 1 0 02 1 0 02 1 0 0

⎤


C2 =

⎡

⎢⎢⎣

3.0000 3.1543 1.0000 −0.22420 0 0 00 0 0 00 0 0 0

⎤

⎥⎥⎦ (1)

E3 =

⎡

⎢⎢⎣

0 1 0 00 0 1 10 2 1 00 0 0 0

⎤


C3 =

⎡

⎢⎢⎣

1.5000 −0.1913 −1.0000 −0.46190.1913 −0.1464 0.0793 −0.1464−1.0000 0.4619 0.5000 −0.19130.4619 0.8536 −1.1152 −0.8536

⎤

⎥⎥⎦

E4 =

⎡

⎢⎢⎣

0 0 0 01 0 0 00 1 1 02 1 0 0

⎤


C4 =

⎡

⎢⎢⎣

1.5000 1.1152 0 0.0793−1.1152 −0.8536 0.0793 0.14640 0.4619 0.5000 −0.1913−0.0793 −0.8536 −1.1152 −0.1464

⎤

⎥⎥⎦ (2)

E5 =

⎡

⎢⎢⎣

0 1 0 00 2 1 00 3 2 10 0 0 0

⎤


C5 =

⎡

⎢⎢⎣

2.5000 0.0793 −2.0000 −1.1152−0.0793 0.3536 −0.1913 −0.3536−2.0000 0.1913 1.5000 0.46191.1152 −0.3536 −0.4619 −0.3536

⎤

⎥⎥⎦

E6 =

⎡

⎢⎢⎣

1 0 0 02 1 0 03 2 1 00 0 0 0

⎤


C6 =

⎡

⎢⎢⎣

2.5000 2.2304 0.5000 0.1585−0.0793 0.2500 0.4619 0.1036−2.0000 −1.5772 0 0.11211.1152 0.6036 −0.1913 0.2500

⎤

⎥⎥⎦ .(3)

Fig. 1. Energy distribution of the transform coefficients before and aftershifting the transform.

2) Coding the entire prediction error signal may not bethe best in terms of rate distortion (RD) tradeoff. Anexample is the SKIP mode in H.264/AVC [5], whichdoes not code the prediction error at all. We note thatthis fact is useful in improving the coding efficiencyespecially for HD video coding since, generally, a cer-tain image area in HD video sequences contains lessdetail than in lower resolution video sequences, and theprediction error tends to have lower energy and morenoise like after better prediction. More importantly, aswe will see, utilizing this fact, our proposed algorithmcan be implemented elegantly, e.g., on a macroblockbasis, without boundary problem.

Both of these two factors contribute to the coding efficiencyimprovement achieved by the SVT [2]. The basic idea of SVTis that we do not restrict the transform coding inside regularblock boundaries but adjust it to the characteristics of theprediction error. With this flexibility, we are able to achievecoding efficiency improvement by selecting and coding thebest portion of the prediction error in terms of RD tradeoff.Generally, this can be done by searching inside a certain resid-ual region after intra prediction or motion compensation, for asubregion and only coding this subregion. Note that, whenthe region refers to a macroblock, the proposed algorithmcan be considered as a special SKIP mode, where part ofthe macroblock is “skipped.” Finally, the location parameter(LP) indicating the position of the subregion inside the regionis coded into the bitstream if there are nonzero transformcoefficients.

It is worth noting that the idea of shifting the transform, asused in the SVT, has been used in video and image denoising,where typically several denoised estimates produced for eachshift of the transform are combined in a certain manner toproduce the final denoised output [6]–[9]. It is often used inpost-processing of a decoded image or a video frame sinceits complexity is generally too high to be incorporated in animage or video decoder, for instance, as an in-loop filter. Also,the boundary problem of this algorithm has not received muchattention, and the performance tends to become worse at the

ZHANG et al.: VIDEO CODING USING SPATIALLY VARYING TRANSFORM 129

boundary and when it is applied to a smaller area such as amacroblock rather than a whole image or a video frame. Bycontrast, our proposed algorithm has no boundary problem andcan be applied elegantly, e.g., on a macroblock basis. Also,using the proposed algorithm, there is no complexity penaltyto the decoder since the LP of the subregion inside the regionis coded in the bitstream and the decoder can simply use thisinformation to reconstruct the region. For these reasons, theproposed algorithm can be easily incorporated in an image or avideo decoder. In this case, various denoising algorithms [6]–[9] may still be efficiently used as post-processing algorithms.

In this paper, the proposed algorithm is studied and imple-mented in H.264/AVC framework. We show that 5.85% bitratereduction is achieved compared to H.264/AVC on average overa wide range of test sets. Gains become more significant atmedium to high bitrates for most tested sequences and thebitrate reduction can reach 13.50%, which makes the proposedalgorithm very suitable for future video coding solutionsfocusing on high fidelity video applications.

The main drawback of the proposed algorithm is thatthe encoding complexity is higher mainly due to the bruteforce search process to select the best LP for the codedsubregion. Nevertheless, the additional encoding complexityof the proposed algorithm is acceptable, for offline encodingapplication cases. For other cases with strict requirement ofencoding complexity, a fast algorithm should be used. Forsuch applications, we propose a fast algorithm operating onmacroblock and block level to reduce the encoding complexityof SVT [10]. The proposed fast algorithm first skips testingSVT for macroblock modes for which SVT is unlikely to beuseful by examining the RD cost of macroblock modes withoutSVT on a macroblock level. For the remaining macroblocksthat SVT may be useful, the proposed fast algorithm selectsavailable candidate LP based on motion difference and utilizesa hierarchical search algorithm to select the best availablecandidate LP at the block level. Experimental results showthat the proposed fast algorithm can reduce the number ofcandidate LPs that need to be tested in search process by about80% with only a marginal penalty in the coding efficiency.

The remainder of this paper is organized as follows. Theproposed algorithm is introduced in Section II and its integra-tion into H.264/AVC framework is described in Section III.The fast algorithm is introduced in Section IV. Experimentalresults are given in Section V. Section VI concludes this paper.

II. SVT

Transform coding is widely used in video coding stan-dards to decorrelate the prediction error and achieve highcompression rates. Typically, in video coding, a transform isapplied to the prediction error at fixed locations. However,this has several drawbacks that may hurt the coding efficiencyand decrease the visual quality. First of all, if the localizedprediction error at a fixed location has a structure that is notsuitable for the underlying transform, many high frequencytransform coefficients will be generated thus requiring a largenumber of bits to code. Consequently, the coding efficiencydecreases. Moreover, notorious visual artifacts such as ring-

ing may appear when these high frequency coefficients getquantized. In this paper, SVT is proposed to reduce thesedrawbacks of transform coding. The basic idea of SVT isthat the transform coding is not restricted inside regular blockboundary but can be applied to a portion of the predictionerror according to the characteristics of the prediction error.

In this paper, we select and code a single block insidea macroblock when applying SVT. Extension to multipleblocks is straightforward. The selection is due to the factthat coding on a macroblock basis reduces complexity, delay,and throughput, which facilitate the implementation especiallyfor hardware. This means that the position and shape ofthe transform block within a macroblock is variable, andinformation about its shape and location is signaled to thedecoder, when there are nonzero transform coefficients. How-ever, it should be noted that there is no restriction on the sizeand shape of a “subregion” and a “region” and thus theproposed idea could easily be extended to cover arbitrary sizesand shapes. For instance, SVT can be applied to an extendedmacroblock [11] of size larger than 16×16. Directional SVTswith a directional oriented block selected and coded with acorresponding directional transform [4] can also be used.

In the following sub-sections, we further discuss three keyissues of SVT in more detail: selection of SVT block-size,selection and coding of candidate LP, and filtering of SVTblock boundaries.

A. Selection of SVT Block-Size

In this paper, an M × N SVT is applied on a selected M ×N block inside a macroblock of size 16 × 16, and only thisM × N block is coded. It is easy to see that there are in total(17−M)× (17−N) possible LPs if the M ×N block is codedby an M × N transform. The following factors are taken intoaccount when choosing suitable M and N.

1) Generally, larger M and N will result in fewer possibleLPs and less overhead, and vice versa.

2) Also, larger M and N will result in lower distortionof the reconstructed macroblock but need more bits incoding the transform coefficients, and vice versa.

3) Finally, a larger block-size transform is more suitable incoding relatively flat areas or non-sharp edges, withoutintroducing coding artifacts such as ringing, while asmaller block-size transform is more suitable in codingareas with details or sharp edges.

To facilitate the transform design, we can further assumeM = 2m, N = 2n, where m and n are integers ranging from0 to 4 inclusive in our context.1 In this paper, we use 8 × 8SVT, 16 × 4 SVT, and 4 × 16 SVT, which are illustrated inFigs. 2 and 3, respectively, to show the effectiveness of theproposed algorithm. Nevertheless, it should be noted that, fordifferent sequences with different characteristics, other block-size SVT can be more suitable in terms of coding efficiency,complexity, and others. For instance, larger block-size SVT,

1Note that SKIP mode is a special case when m and n are both equalto 0 and the macroblock partition is 16 × 16 (and the motion vector equalsthe predicted value in the context of H.264/AVC), and 16 × 16 transform isanother special case when m and n are both equal to 4.


Fig. 2. Illustration of 8 × 8 SVT.

Fig. 3. Illustration of 16 × 4 and 4 × 16 SVT.

such as 16 × 8 SVT or 8 × 16 SVT, may be more preferablein coding flat areas and when using a higher quantization pa-rameter (QP). It is well established in video coding communitythat variable block-size transforms (VBT) can improve thecoding efficiency [12]. Using VBT in the framework of SVT,namely, variable block-size SVT (VBSVT), we will show thatthe prediction error is better localized and the underlyingcorrelations are better exploited; thus, the coding efficiencyis improved compared to a fixed block-size SVT. However,when the number of different block-size SVT used becomeslarge, the coding gain of VBSVT tends to saturate since moreoverhead bits are needed to code the LPs. In this paper,we adaptively choose among 8 × 8 SVT, 16 × 4 SVT, and4×16 SVT in VBSVT. It may also prove beneficial to choosedifferent block-size SVT at different QPs and different bitrates.It is also possible and beneficial for the encoder to code SVTparameters, such as M, N (or m, n), in the bitstream, andtransmit them in the slice header in H.264/AVC framework.

The transforms used for 8×8, 16×4, and 4×16 SVTs aredescribed in the following. In general, the separable forwardand inverse 2-D transform of a 2-D signal can be written,respectively, as follows:

C = Tv · X · T th (4)

Xr = T tv · C · Th (5)

where X denotes a matrix representing M × N pixel block, Cis the transform coefficient matrix, and Xr denotes a matrixrepresenting the reconstructed signal block. Tv and Th are theM × M and N × N transform kernels in the vertical and thehorizontal direction, respectively. The superscript “t” denotesmatrix transposition. For 8 × 8 SVT, we simply reuse the 8 ×8 transform kernel in H.264/AVC [5]. For 16 × 4 SVT and4 × 16 SVT, a 4 × 4 and a 16 × 16 transform kernels need

to be specified. Here, we use the 4 × 4 transform kernel inH.264/AVC [5], [13] and the 16 × 16 transform kernel in [14]because it is simple and can reuse the butterfly structure of theexisting 8×8 transform in H.264/AVC. It is noted that different16 × 16 transform kernels could be used, which, however, isnot the main focus of our proposed algorithm. In all cases,normal zig-zag scan (for frame-based coding which is used inour experiments) is used to represent the transform coefficientsas input symbols to the entropy coding. Different scan patternswere also tested but not much gain in coding efficiency wasachieved.

The block-size of SVT for luma component can be usedto derive the corresponding block-size of SVT for the corre-sponding chroma components. Take 4:2:0 chroma format asan example, if M × N SVT is used for the luma component,then M/2 × N/2 SVT can be used for chroma components.

B. Selection and Coding of Candidate LPs

When there are nonzero transform coefficients of the se-lected SVT block, its location inside the macroblock needsto be coded and transmitted to the decoder. For 8 × 8 SVT,as shown in Fig. 2, the location of the selected 8 × 8 blockinside the current macroblock can be denoted by (�x, �y)which can be selected form the set �8×8 = {(�x, �y), �x,�y = 0, . . . , 8}. There are in total 81 candidates for 8×8 SVT.For 16×4 SVT and 4×16 SVT, as shown in Fig. 3, the loca-tions of the selected 16×4 and 4×16 block inside the currentmacroblock can be denoted as �y and �x, respectively, whichcan be selected from the set �16×4 = {�y, �y = 0, . . . , 12}and �4×16 = {�x, �x = 0, . . . , 12}, respectively. There are intotal 26 candidates for 16 × 4 SVT and 4 × 16 SVT.

The “best” LP is then selected according to a given criterion.In this paper, rate distortion optimization (RDO) [15] is usedto select the best LP in terms of RD tradeoff by minimizingthe following:

J = D + λ·R (6)

where J is the RD cost of the selected configuration, Dis the distortion, R is the bitrate, and λ is the Lagrangianmultiplier. In our implementation, the reconstruction residualfor the remaining part of the macroblock is simply set to be0, but different values can be used and might be beneficial incertain cases (luminance change, and others). It may be usefulto also utilize the reconstructed SVT block and neighboringmacroblock to derive the reconstructed value for the remainingpart of the macroblock.

The set of candidate LPs is also important since it directlyaffects the encoding complexity and the performance of SVT.Larger number of candidate LPs provides more room and ismore robust for coding efficiency improvement for differentsequences with different characteristics, but adds more encod-ing complexity and also more overhead. Experimentally, ac-cording to criterion (6), we choose to use �8×8

′ = {(�x, �y),�x = 0, . . . , 8, �y = 0; �x = 0, . . . , 8, �y = 8; �x =0, �y = 1, . . . , 7; �x = 8, �y = 1, . . . , 7} for 8 × 8 SVTand �16×4 = {�y, �y = 0, . . . , 12} and �4×16 = {�x, �x =0, . . . , 12} for 16 × 4 SVT and 4 × 16 SVT, which shows


good RD performance over a large test set [2]. There are intotal 58 candidate LPs, and statistics show that the measuredentropy of the LP index is 5.73 bits, for all test sequences usedin the experiments. Accordingly, the LP index is representedby a 6-bit fixed length code in our implementation. Moreadvanced representation methods can be designed to achievea better compression performance. These LPs are shown tohave a good tradeoff in coding efficiency and complexity fordifferent test sequences. As the overhead bits to code the LPsbecome significant at low bitrates, it would be useful to choosedifferent LPs at different QPs and different bitrates. Similarly,an indication of available LP can also be coded and transmittedin the slice header in H.264/AVC framework.

The LP for the luma component can be used to derive thecorresponding LPs for chroma components. Consider the 4:2:0chroma format as an example, LPs for chroma components canbe derived from the LP for the luma component as follows:

�xC = (�xL + 1) >> 1, �yC = (�yL + 1) >> 1 (7)

where �xL, �yL are the LPs for the luma component, while�xC, �yC are the LPs for the chroma components.

C. Filtering of SVT Block Boundaries

Due to the coding (transform and quantization) of theselected SVT block, coding artifacts may appear around itsboundary with the remaining non-coded part of the mac-roblock. A deblocking filter can be designed and applied toimprove both the subjective and objective quality. An examplein the framework of H.264/AVC will be described in detail inSection III-D.

III. Implementing SVT in the H.264/AVC

Framework

The proposed technique is implemented and tested in theH.264/AVC framework. Fig. 4 shows the block diagram ofthe extended H.264/AVC encoder with SVT. The encoder firstperforms motion estimation and decides the optimal mode tobe used, and then searches for the best LPs (illustrated as “SVTSearch” in the diagram) if SVT is used for this macroblock.The encoder then calculates the RD cost for using SVT using(6), and if lower RD cost can be achieved by using SVTthen that macroblock is selected to be coded with SVT. Formacroblocks that use SVT, the LPs are coded and transmittedin the bitstream when there are nonzero transform coefficientsof the selected SVT block. The LPs for the macroblocksthat use SVT are decoded in the box marked as “SVT L.P.Decoding” in the diagram. However, we note that the normalcriteria, e.g., sum of absolute differences or sum of squareddifferences, which is used in these encoding processes formacroblocks that do not use SVT, may not be optimal formacroblocks that use SVT. Better encoding algorithms areunder study.

Several key parts of the H.264/AVC standard [5], for in-stance, macroblock types, coded block pattern (CBP), entropycoding, and deblocking, need to be adjusted. The proposedmodifications aiming at good compatibility with H.264/AVCare described in the following sub-sections.

Fig. 4. Block diagram of the extended H.264 encoder with SVT.

TABLE I

Extended Macroblock Types for P Slices in H.264/AVC with SVT

mb−type Name of mb−type0 P−16 × 161 P−16 × 16−SVT2 P−16 × 83 P−16 × 8−SVT4 P−8 × 165 P−8 × 16−SVT6 P−8 × 87 P−8 × 8−SVT8 P−8 × 8ref0a

9 P−8 × 8ref0−SVT

Inferredb P−Skip

aP−8 × 8ref0 is a macroblock mode defined in H.264/AVC with 8 × 8macroblock partition and the reference index of this mode shall be in-ferred to be equal to 0 for all sub-macroblocks of the macroblock [5].bNo further data is present for P−Skip macroblock in the bitstream and itcan be inferred [5].

A. Macroblock Type

We focus our study of SVT on coding the inter predictionerror in P slices although the idea can be easily extendedfor I and B slices. Table I shows the extended macroblocktypes for P slices in H.264/AVC [5] (original intra macroblocktypes in H.264/AVC are not included). We use SVT for16 × 16, 16 × 8, 8 × 16, and 8 × 8 macroblock partitions,based on the observation that SVT is used in a considerableportion when the same macroblock partition is concerned.This is shown in Fig. 5, where the statistics is based onall the test sequences we use and the overall test QP whencontext adaptive variable length coding (CAVLC) is used asthe entropy coding. The macroblock type index is coded ina similar way as H.264/AVC. The sub-macroblock types arekept unchanged and therefore not shown in the table.

B. CBP

In this implementation, SVT is used for coding only theluma component. As shown in Figs. 2 and 3, since only oneblock is selected and coded in macroblocks that use SVT,we can either use 1 bit for luma CBP or jointly code it withchroma CBP as is done in H.264/AVC [5] when CAVLC isused as the entropy coding. However, in our experiments with


many test sequences, we found that luma CBP is often equalto 1 in high fidelity video coding, which is our main focusin this paper. Based on this observation, we restrict the newmacroblock modes to have luma CBP equal to 1 and thusthere is no need to code it. Chroma CBP is not changed andit is represented in the same way as in H.264/AVC [5]. Analternative way would be to infer the luma CBP according toQP or bitrate.

C. Entropy Coding

In H.264/AVC [5], when CAVLC is used as the entropycoding, a different coding table for the total number ofnonzero transform coefficients and trailing ones of currentblock is selected depending on the characteristics (the numberof nonzero transform coefficients) of the neighboring blocks.For macroblocks that use SVT, a fixed coding table is usedfor simplicity without loss of coding efficiency. Besides, wemay also need to derive the information about the number ofnonzero transform coefficients contained in each 4 × 4 lumablock. When the selected SVT block aligns with the regularblock boundaries, no special scheme is needed. Otherwise,the following two-step scheme with negligible complexityincrease is used in our implementation as follows.

Step 1: A 4 × 4 luma block is marked to have nonzerotransform coefficients if it overlaps with a codedblock that has nonzero transform coefficients in theselected SVT block, and marked not to have nonzerotransform coefficients otherwise. This informationmay also be used in other process, e.g., deblocking.

Step 2: The number of nonzero transform coefficients, ncB,for each 4 × 4 block that is marked to have nonzerotransform coefficients, is empirically set to

ncB = �(ncMB + �nB/2�)/nB� (8)

where ncMB is the total number of nonzero transformcoefficients in the current macroblock and nB is thenumber of blocks marked to have nonzero transformcoefficients, and �.� denotes the floor operator. In(8), we (approximately) distribute the total numberof nonzero transform coefficients evenly to all theblocks that are marked as having nonzero transformcoefficients. This is easy to calculate and it is shownexperimentally to have similar coding efficiency com-pared to a more sophisticated scheme, for instance,distributing the total number of nonzero transformcoefficients to all the blocks that are marked ashaving nonzero transform coefficients according tothe overlapping size of the SVT block and eachunderlying 4 × 4 block.

Note that due to the floor operation in (8), a 4 × 4 block maybe marked to have nonzero transform coefficients according toStep 1, but has zero nonzero transform coefficients accordingto (8) in Step 2.2 In our implementation, when only theinformation about whether a block has nonzero transform

2Alternatively, one can set the number of nonzero transform coefficients tobe 1 in this case. However, experimentally this does not show benefit over(8) in terms of coding efficiency.

TABLE II

Increase of Number of Edges to Be Checked in Deblocking for

Luma Component When SVT is Used (CABAC, 720p)

Sequence Increase of Number of Edges to Be Checked inDeblocking (%)

BigShips 32.8ShuttleStart 31.3City 39.5Night 15.9Optis 26.6Spincalendar 27.6Cyclists 20.2Preakness 21.7Panslow 33.6Sheriff 23.6Sailormen 27.2Average 27.3

coefficients or not is needed (e.g., in deblocking), we use theresult of Step 1; while when the information about how manynonzero transform coefficients a block has is needed, we usethe result of Step 2 (e.g., in CAVLC).

The coding of the transform coefficients in CAVLC andcontext-based adaptive binary arithmetic coding (CABAC) isnot changed, and thus no additional complexity is introduced.

D. Deblocking

For macroblocks that use SVT, the deblocking process inH.264/AVC [5] needs to be adjusted because the selected SVTblock may not align with the regular block boundaries. Twocases are shown in Fig. 6: one is when 4 × 4 transform isnot used as an optional transform and the other is when 4 × 4transform is also used as an optional transform. The edges ofthe selected SVT block and the macroblock shown in Fig. 6may be filtered. The filtering criteria and process of theseedges are similar to those used in H.264/AVC [5], [16] by firstderiving the necessary information of the blocks on both sideof the boundary such as having nonzero transform coefficientsor not (using Step 1 in Section III-C), motion difference, andothers, with minor modifications.

The existing regular structure of deblocking in H.264/AVCis made less regular by SVT because the location of theedges of selected SVT block are not fixed. This could be animportant issue especially in hardware implementation becauseit affects the data flow. Meanwhile, when SVT is used, thenumber of edges to be checked in deblocking would increasedue to the fact that some of the macroblocks originally codedin SKIP mode will now be coded with SVT. According toour preliminary results, for instance, when 4 × 4 transform isalso used as an optional transform, the number of edges to bechecked in deblocking increases on average 27.3% for lumacomponent over the 720p test set and all the tested QPs usedin our experiment. Detailed results are shown in Table II.

IV. Fast Algorithms for SVT

As described in Section III, we note that even thoughwe do not change the motion estimation, sub-macroblock


Fig. 5. Percentage of the use of SVT for different macroblock partitions when CAVLC is used. (a) Without using 4×4 transform. (b) Using 4×4 transform.

Fig. 6. Illustration of edges that may be filtered when SVT is used. Boldlines represent the edges of the SVT block. (a), (c), (e) When 4×4 transformis not used as an optional transform. (b), (d), (f) When 4 × 4 transform isalso used as an optional transform.

partition decision process for the macroblocks that use SVT,the encoding complexity of SVT is higher due to the bruteforce search process in RDO. For example, in case of VBSVTdescribed in Section II, there are a total of 58 candidate LPs forone macroblock mode and we need to conduct RDO for eachcandidate LP to select the best one. Note that typically we needto conduct transform, quantization, entropy coding, inversetransform, inverse quantization, and others to calculate the RDcost in (6) and this complexity is high.3 Thus, for application

3Of course, many fast algorithms can be used to reduce the complexity oftransform, quantization, entropy coding, and in estimating the RD cost.

cases that have strict requirement of encoding complexity, fastalgorithm of SVT should be developed and used.

In this section, we propose a simple yet efficient fastalgorithm operating on macroblock and block level to reducethe encoding complexity of SVT. The basic idea to reducethe encoding complexity of SVT is to reduce the number ofcandidate LPs tested in RDO. The proposed fast algorithmfirst skips testing SVT for macroblocks for which SVT isunlikely to be useful by examining the RD cost of mac-roblock modes without SVT on a macroblock level. For theremaining macroblocks that SVT may be useful, the proposedfast algorithm selects available candidate LPs based on themotion difference and utilizes a hierarchical search algorithmto select the best available candidate LP on a block level. Themacroblock level and block-level algorithms are described inSections IV-A and IV-B, respectively.

A. Macroblock Level Fast Algorithm

The basic idea of macroblock level fast algorithm is to skiptesting SVT for macroblock modes for which SVT is unlikelyto be useful. This is done by examining the RD cost of mac-roblock modes that do not use SVT and are already availableprior to SVT coding. In the proposed implementation, SVT isonly applied for the macroblock modes in RDO process, whenthe two criteria are met as follows:

min(Jinter, Jskip) ≤ Jintra (9)

Jmode ≤ min(Jinter, Jskip) + th (10)

where Jinter and Jintra are the minimum RD cost of availableinter and intra macroblock modes in regular (without SVT)coding, respectively, Jskip is the RD cost of the SKIP codingmode, and Jmode is the RD cost of the current macroblockmode to be tested with SVT (e.g., if SVT is being tested forINTER−16 × 8 mode, then Jmode refers to RD cost of theregular INTER−16 × 8 without SVT). In (9), by comparinginter, intra, and SKIP modes, we assume that if the RD cost forintra modes is the lowest, then the probability for inter modeswith SVT to have a lower RD cost will be very small, so thereis no need to check SVT for this macroblock. The basis ofthis assumption is that intra modes are often used in codingdetailed areas for which SVT is unlikely to be useful sincepart of the macroblock is not coded when it is used. In (10),we assume that if the RD cost for the current mode is much


Fig. 7. Derivation of (11).

larger than the RD cost for the best mode, then the probabilityfor this mode with SVT to have a lower RD cost is alsoexpected to be very small. The threshold th in (10) representsthe empirical upper limit of bitrate reduction when SVT isused assuming the reconstruction of macroblock remains thesame when SVT is not used. It is calculated using (11), whichis derived empirically using the following statistical analysis.We used four HD sequences, Crew, Harbour, Raven, and Jetsas train set. The sequences are first encoded and the valuesJmode, Jinter, Jintra, and Jskip are recorded for each modein each macroblock that is coded with SVT. According toresulting costs, a threshold value th in (10) is calculated foreach QP so that a certain number of the macroblocks satisfythe corresponding inequality in (9) and (10). This results for adifferent threshold value for each QP and using these thresholdvalues (11) is derived by fitting a first order polynomial to thedata. This simple derivation is also illustrated in Fig. 7 asfollows:

th = λ · max(23 − QP/2, 0) (11)

where λ is the Lagrangian multiplier used in (6) and themax function returns the maximum value of its arguments.As reflected in (11), the reason that the threshold valueth, which represents the upper limit of bitrate reduction,tends to get lower when QP is larger, is mainly becausedetails/edges are likely to be quantized and thus the benefit ofSVT to align the transform to the underlying details/edges iscompromised.

B. Block-Level Fast Algorithm

The proposed block-level fast algorithm includes two steps:the selection of available candidate LPs based on the motiondifference in the first step and a hierarchical search algorithmin the second step. These two steps are described in detailbelow.

1) Selection of Available Candidate LPs Based on MotionDifference: SVT is used here for 16 × 16, 16 × 8, 8 × 16, and8×8 macroblock partitions, as described in Section III-A. Onestraightforward approach to reduce the encoding complexity isto restrict the transform block in a candidate SVT block to be

inside the same motion compensation block boundary, sinceapplying the transform across motion compensation blockboundaries is usually considered not efficient. In other words,it is typically assumed that edges exist at different motioncompensation block boundaries that the transform should alignto. This approach is simple because only macroblock partitioninformation is used. However, some penalty was observed incoding efficiency for sequences with slow motion and richdetail, for instance, Preakness and Night. This is becauseprediction from different motion compensation blocks is notthe major reason for causing blocking effect [16] whichis likely to degrade the efficiency of the transform coding.Actually, it may not even cause blocking effect if the motiondifference of different motion compensation blocks is smallenough and/or the prediction is good.4 In this case, edgesdo not necessarily appear at different motion compensationblock boundaries where the transform should align to andSVT would be still useful to improve coding efficiency. Takingthis into account, we use a more general approach in thispaper, by skipping the testing of candidate LPs when andonly when the transform block(s) in the SVT block at thatposition overlaps with different motion compensation blocksin which the motion difference is significant according to acertain criterion. In order to measure the motion difference, weuse a similar method to the one used in deriving the boundarystrength parameter in deblocking filter of H.264/AVC [5], [16].Specifically, in our implementation, we skip testing a candidateLP and mark it unavailable if at least one of the followingconditions is true.

a) If the transform applied to the SVT block at thatposition overlaps with at least two neighboring motioncompensation blocks and the motion vectors of theseblocks are larger than or equal to a predefined thresholdwhich is set to be one integer pixel in this paper.

b) If the transform applied to the SVT block at thatposition overlaps with at least two neighboring motioncompensation blocks and the reference frames of thesetwo neighboring blocks are different.

Since the number of available candidate LPs varies fromone macroblock to another and the information to derive theavailable candidate LPs can be obtained both by the encoderand the decoder, the index of the selected LP is coded asfollows. Assume there are N (N > 0) available candidate LPsand the index of the selected candidate is n(0 ≤ n < N), thenit is coded as follows:

{V = n, L =

⌊log2 N

⌋+ 1, if n < 2(N − 2�log2 N�)

V = n − (N − 2�log2 N�), L =⌊log2 N

⌋, otherwise

(12)

where V represents the binary value of the coded bit stringand L represents the number of bits coded. This is a nearfixed-length code which is chosen based on the observationthat all available candidate LPs are similarly likely to beused.

4We would also like to point out that it is shown in a recent paper that“despite the transform residual block being composed of two parts that werepredicted from different areas of the reference picture, correlation withinmixed blocks is very similar to that of normal blocks. The DCT is onlymarginally suboptimal with respect to Karhunen-Loeve transform” [17].


This approach shows a stable coding efficiency for se-quences with different characteristics and achieves similargain over a wide range of test sets compared to the originalalgorithm without using the information of motion difference.Finally, we note that the additional complexity introduced bythis approach is marginal when it is carefully implementedbecause: 1) the decision is only conducted when necessary andcan be skipped for macroblock modes with 16 × 16 partitionand some LPs, e.g., the LP (0, 0) for 8×8 SVT; 2) the decisionis simple and only uses the motion vector and the referenceframe information; and 3) generally several candidate LPs rep-resenting spatially consecutive blocks can be marked availableor unavailable at the same time in one decision.

2) Hierarchical Search Algorithm: The basic idea of thehierarchical search algorithm is similar to that of the motionestimation algorithm, i.e., to first find the best LP in a relativelycoarse resolution and then refine the results in a finer resolu-tion. This assumes that the prediction error structure remainssimilar in the coarse resolution as in the original resolution,which is often the case for a macroblock of size 16 × 16especially in HD video material. In our implementation, wedefine the candidate LP in coarser resolution as key candidateLP, which are marked as squares in Fig. 8. Other candidateLPs are also called as non-key candidate LP, which are markedas circles in Fig. 8. The hierarchical search algorithm can besummarized as follows.

a) Let �1 denote the set of all available key candidate LPs.Select the best one in �1 with the lowest RD cost andlet �2 denote its available neighboring candidate LPswhich are marked as triangles in Fig. 8.

b) The key LPs are divided into 14 LP zones as shownin Fig. 8. Select the best LP zone which is availableand has the lowest RD cost. A LP zone is available ifand only if all three key candidate LPs in that zone areavailable and the RD cost of an LP zone is defined tobe the sum of the RD costs of the three key candidateLPs in that zone. Let �3 denote the additional availablenon-key candidate LPs which are inside the best LP zone(marked as stars in Fig. 8).

c) Select the best LP which has the lowest RD cost amongall the candidates in �1, �2, and �3.

More sophisticated algorithms using the same basic idea canbe designed, for instance, by using different definitions of keyposition and zone, and/or examining different number of goodkey candidate LPs and zones.

V. Experimental Results

We implemented the proposed algorithm VBSVT and itsfast version with the proposed fast algorithm, which we denoteas FVBSVT, in KTA1.8 reference software [18] in order toevaluate their effectiveness. Although our primary interest isin HD video coding, we also test the proposed algorithm inlower resolution [common intermediate format (CIF)] videocoding. The most important coding parameters used in theexperiments are listed as follows.

1) High Profile.

Fig. 8. Illustration of hierarchical search algorithm.

2) QPI = 22, 27, 32, 37, QPP = QPI + 1, fixed QPis used and rate control is disabled according to thecommon conditions that are set up by ITU-T VCEGand ISO-IEC/MPEG community for coding efficiencyexperiments [19].

3) CAVLC/CABAC is used as the entropy coding.4) Frame structure is IPPP, four reference frames.5) Motion vector search range: ±64/±32 pels for 720p/CIF

sequences, resolution 14 -pel.

6) RDO in the “high complexity mode.”

Two configurations are tested as follows.

1) Low complexity configuration: Motion estimation blocksizes are 16×16, 16×8, 8×16, 8×8, and 4×4 transformis not used for macroblocks coded with regular transformor SVT. This represents a low complexity codec withmost effective tools for HD video coding.

2) High complexity configuration: Motion estimation blocksizes are 16×16, 16×8, 8×16, 8×8, 8×4, 4×8, 4×4,and 4×4 transform is also used as an optional transformfor both macroblocks coded with regular transform orSVT. This represents a high complexity codec with fullusage of the tools provided in H.264/AVC.

We measure the average bitrate reduction (�BD-RATE)compared to H.264/AVC using Bjontegaard tool [20] forboth low complexity and high complexity configurations. TheBjontegaard tool measures the numerical average in data rateor quality differences between two rate–distortion curves. Thisis a more compact way to present the performance comparisonand required by MPEG and VCEG community. The results areshown in Tables III–VII. We also show the results of 8 × 8SVT and 16 × 4 + 4 × 16 SVT for comparison. As we cansee, 16 × 4 + 4 × 16 SVT performs better than 8 × 8 SVTwhile VBSVT performs best, both in low and high complexityconfigurations. VBSVT achieves on average 4.11% (up to6.19%) bitrate reduction in low complexity configuration andon average 2.43% (up to 3.48%) bitrate reduction in highcomplexity configuration, respectively, when CAVLC is used,and achieves on average 5.85% (up to 7.92%) bitrate reductionin low complexity configuration and on average 4.40% (upto 6.25%) bitrate reduction in high complexity configuration,respectively, when CABAC is used, compared to H.264/AVC.The gain of VBSVT is lower in high complexity configuration,because the overhead to code the LP becomes more significant


TABLE III

�BD-Rate Compared to H.264/AVC (Low Complexity Configuration) (CABAC, 720p)

Sequence 8 × 8 SVT (%) 16 × 4 + 4× 16 SVT (%) VBSVT (%) FVBSVT (%) Reduction of Candidate LPs (%)BigShips −4.27 −5.31 −6.22 −5.96 82.5ShuttleStart −4.09 −4.70 −5.76 −5.22 91.4City −6.98 −6.76 −7.92 −7.61 81.6Night −3.53 −4.86 −5.41 −4.87 84.7Optis −5.28 −5.17 −6.70 −5.95 83.3Spincalendar −3.24 −4.02 −4.87 −4.01 84.4Cyclists −4.56 −4.28 −5.68 −5.32 85.5Preakness −2.78 −3.23 −4.04 −3.34 78.7Panslow −4.15 −4.91 −5.62 −5.09 86.1Sheriff −4.46 −5.15 −5.93 −5.35 84.5Sailormen −3.88 −5.86 −6.22 −5.69 84.8Average −4.29 −4.93 −5.85 −5.31 84.3

TABLE IV

�BD-Rate Compared to H.264/AVC (Low Complexity Configuration) (CABAC, CIF)

Sequence 8 × 8 SVT (%) 16 × 4 + 4× 16 SVT (%) VBSVT (%) FVBSVT (%) Reduction of Candidate LPs (%)Bus −2.27 −3.34 −3.87 −3.13 81.0Container −3.87 −4.27 −5.31 −4.79 84.4Foreman −2.61 −2.86 −3.52 −3.07 82.7Hall −4.23 −4.99 −5.98 −5.49 84.9Mobile −2.30 −2.73 −3.81 −2.86 81.3News −2.84 −3.98 −4.52 −3.81 86.7Paris −3.28 −4.81 −5.69 −4.32 85.8Tempete −2.91 −3.94 −4.60 −4.03 81.8Average −3.04 −3.87 −4.66 −3.94 83.6

TABLE V

�BD-Rate Compared to H.264/AVC (High Complexity Configuration) (CABAC, 720p)

Sequence 8 × 8 SVT (%) 16 × 4 + 4× 16 SVT (%) VBSVT (%) FVBSVT (%) Reduction of Candidate LPs (%)BigShips −3.05 −3.62 −4.46 −4.34 80.9ShuttleStart −3.56 −4.02 −4.69 −4.33 90.9City −5.41 −4.92 −6.25 −5.65 80.0Night −2.46 −3.65 −4.00 −3.60 82.8Optis −4.20 −3.93 −5.36 −4.71 82.1Spincalendar −2.44 −2.76 −3.28 −2.96 82.2Cyclists −3.87 −3.24 −4.30 −4.24 83.8Preakness −1.22 −1.45 −2.00 −1.60 77.9Panslow −2.26 −2.60 −3.50 −3.02 84.0Sheriff −3.98 −4.25 −5.14 −4.63 82.8Sailormen −3.40 −5.07 −5.44 −5.19 83.2Average −3.26 −3.59 −4.40 −4.03 82.8

TABLE VI

�BD-Rate Compared to H.264/AVC (High Complexity Configuration) (CABAC, CIF)

Sequence 8 × 8 SVT (%) 16 × 4 + 4× 16 SVT (%) VBSVT (%) FVBSVT (%) Reduction of Candidate LPs (%)Bus −1.33 −1.99 −2.32 −1.83 78.8Container −1.35 −1.42 −2.12 −1.80 83.0Foreman −1.58 −2.02 −2.51 −1.83 80.5Hall −2.79 −3.42 −4.01 −3.63 83.3Mobile −0.99 −1.23 −1.58 −1.22 79.3News −0.90 −1.99 −2.31 −1.81 85.8Paris −1.45 −1.79 −2.15 −1.69 84.5Tempete −1.79 −2.39 −2.62 −2.24 79.7Average −1.52 −2.03 −2.45 −2.01 81.9


TABLE VII

Average Results of �BD-Rate Compared to H.264/AVC

Average Results 8 × 8 SVT (%) 16× 4 + 4 × 16 SVT (%) VBSVT (%) FVBSVT (%) Reduction of Candidate LPs (%)Average (CAVLC, 720p, low complexity configuration) −2.64 −3.65 −4.11 −3.69 83.4Average (CAVLC, CIF, low complexity configuration) −2.44 −3.68 −4.14 −3.55 83.6Average (CAVLC, 720p, high complexity configuration) −1.42 −2.33 −2.43 −2.34 81.1Average (CAVLC, CIF, high complexity configuration) −1.28 −1.88 −2.08 −1.72 81.4Average (CABAC, 720p, low complexity configuration) −4.29 −4.93 −5.85 −5.31 84.3Average (CABAC, CIF, low complexity configuration) −3.04 −3.87 −4.66 −3.94 83.6Average (CABAC, 720p, high complexity configuration) −3.26 −3.59 −4.40 −4.03 82.8Average (CABAC, CIF, high complexity configuration) −1.52 −2.03 −2.45 −2.01 81.9

TABLE VIII

Percentage of 8 × 8SVT and 16 × 4 + 4 × 16 SVT Selected in VBSVT for Different Sequences (CABAC, 720p)

Sequence Low Complexity Configuration High Complexity Configuration8 × 8 SVT (%) 16 × 4 + 4 × 16 SVT (%) 8 × 8 SVT (%) 16 × 4 + 4 × 16 SVT (%)

BigShips 51.4 48.6 54.4 45.6ShuttleStart 48.9 51.1 52.8 47.2City 50.1 49.9 53.0 47.0Night 49.7 50.3 49.8 50.2Optis 57.0 43.0 59.1 40.9Spincalendar 44.8 55.2 53.7 46.3Cyclists 52.3 47.7 50.0 50.0Preakness 50.2 49.8 49.9 50.1Panslow 49.4 50.6 55.2 44.8Sheriff 54.3 45.7 51.2 48.8Sailormen 48.3 51.7 46.4 53.6Average 50.2 49.8 52.3 47.7

TABLE IX

Percentage of 8 × 8 SVT and 16 × 4 + 4 × 16SVT Selected in VBSVT for Different Sequences (CABAC, CIF)

Sequence Low Complexity Configuration High Complexity Configuration8 × 8 SVT (%) 16 × 4 + 4 × 16 SVT (%) 8× SVT (%) 16 × 4 + 4 × 16 SVT (%)

Bus 38.6 61.4 39.9 60.1Container 54.1 45.9 56.0 44.0Foreman 53.1 46.9 53.1 46.9Hall 48.9 51.1 48.8 51.2Mobile 49.2 50.8 47.2 52.8News 46.4 53.6 44.6 55.4Paris 45.2 54.8 44.6 55.4Tempete 44.3 55.7 43.6 56.4Average 47.5 52.5 47.2 52.8

and the 4 × 4 transform in H.264/AVC (and smaller motionestimation block sizes) is more efficient in certain cases. Thegain of VBSVT is higher in HD video coding because typicallyless detail is contained in a certain image area in HD video se-quences than in lower resolution video sequences and thus thenon-coded part of the macroblock can be more safely skippedwithout introducing much degradation in reconstructedquality.

Fig. 9 shows the R–D curves for sequences Night andPanslow. We note that the gain of VBSVT comes at mediumto high bitrates for many tested sequences and can be muchmore significant than the average gain over all bitrates reportedabove. This is true for most sequences we tested. TakePanslow sequence as an example, by using the performanceevaluation tool provided in [21], we are able to show thatVBSVT achieves 13.50% and 6.90% bitrate reduction at37 dB compared to H.264/AVC, in low and high complexityconfiguration, respectively, when CAVLC is used.

We also measure the percentage of 8 × 8 SVT and 16 ×4 + 4 × 16 SVT selected in VBSVT. The results are shown inTables VIII–X. We can see that in VBSVT scheme, 16 × 4 +4 × 16 SVT are used generally more often than 8 × 8 SVT.However, 8 × 8 SVT is also used in a significant portion ofthe cases. This shows that different block-size SVT can besuitable for different situations and VBSVT would be a morepreferable algorithm for coding prediction error with differentcharacteristics than fixed block-size SVT.

In Figs. 10 and 11, we show the distribution of the mac-roblocks coded with SVT and corresponding SVT blocks ina prediction error frame for Night and Sheriff sequences,respectively. It can be observed that, as expected, SVT ismostly useful for macroblocks where residual signals of largermagnitude are likely to gather and locate at certain regions,which frequently happens, as also observed by other re-searchers [22]. However, when using RDO, the selection of theSVT depends on the distortion of the whole macroblock and


Fig. 9. R–D curves for (a) Night sequence (CABAC) and (b) Panslow

sequence (CAVLC).

the coded bits of the selected SVT block, so it is also possiblethat SVT can also be chosen to be used for macroblockswith relatively evenly distributed residual signal of mediummagnitude. Macroblocks with residual signal of low magnitudeare often coded in SKIP mode.

We also view the reconstructed video sequences in order toanalyze the impact of SVT on subjective quality. We concludethat using SVT does not introduce any special visual artifactsand the overall subjective quality remains similar, while thecoded bits are reduced.

Similarly, we measure the average bitrate reduction (�BD-RATE) of FVBSVT compared to H.264/AVC. We also mea-sure the average reduction of candidate LPs tested in RDOover all the tested QPs using FVBSVT. The results are alsoshown in Tables III–VII. We can see that with the proposedfast algorithm we can reduce by a little more than 80% thenumber of candidate LPs tested in RDO while retaining mostof the coding efficiency, for both low and high complexity con-figurations. The reduction in high complexity configuration isslightly less than that in low complexity configuration becausewhen the 4×4 transform is used, the number of candidate LPsis increased according to the selection method based on motiondifference described in Section IV-B. This means that on av-erage about 10–11 candidate LPs are tested in RDO (each RDcost calculation includes transform, quantization, entropy cod-ing, inverse transform, inverse quantization, and others) when

Fig. 10. Distribution of the macroblocks coded with SVT and correspondingSVT blocks for 32nd prediction error frame (Night sequence).

Fig. 11. Distribution of the macroblocks coded with SVT and correspondingSVT blocks for 65th prediction error frame (Sheriff sequence).

FVBSVT is used. We also measured the execution time ofFVBSVT to estimate the computational complexity. Comparedto H.264/AVC, the encoding and decoding time of FVBSVTincreases on average 2.49% and 2.05%, respectively, over 720ptest set and all the tested QPs in high complexity configuration.The encoding and decoding time increase vary for different testsequences and detailed results are shown in Table XI. The run-ning time increase is marginal because only a relatively smallpart of the whole coding process is affected. The memory us-age remains similar when SVT is used because it needs to storethe number of nonzero transform coefficients, access the pixels


TABLE X

Average Results of Percentage of 8 × 8 SVT and 16 × 4 + 4 × 16 SVT Selected in VBSVT for Different Sequences

Average Results Low Complexity Configuration High Complexity Configuration8 × 8 SVT (%) 16 × 4 + 4 × 16 SVT (%) 8 × 8 SVT (%) 16 × 4 + 4 × 16 SVT (%)

Average (CAVLC, 720p) 45.8 54.2 43.9 56.1Average (CAVLC, CIF) 44.4 55.6 43.1 56.9Average (CABAC, 720p) 50.2 49.8 52.3 47.7Average (CABAC, CIF) 47.5 52.5 47.2 52.8

TABLE XI

Encoding/Decoding Time Increase of FVBSVT Compared to

H.264/AVC (High Complexity Configuration) (CABAC, 720p)

Sequence Encoding Time Decoding TimeIncrease of FVBSVT (%) Increase of FVBSVT (%)

BigShips 2.88 1.79ShuttleStart 1.16 0.63City 2.94 3.85Night 2.62 2.08Optis 2.18 2.29Spincalendar 2.74 2.07Cyclists 1.90 4.40Preakness 3.71 1.89Panslow 2.17 0.49Sheriff 2.41 0.99Sailormen 2.64 2.07Average 2.49 2.05

in neighboring blocks in deblocking and store the motionvector and reference frame information etc, which is similar tothe case when SVT is not used. The simulation PC we use isIntel(R) Core(TM)2 Quad CPU Q6600 @2.40 GHz 2.39 GHz,2.00 GB of RAM, and the operating system is MicrosoftWindows XP Professional Version 2002 Service Pack 2.

VI. Conclusion

In this paper, we proposed a novel algorithm, called SVTto improve the coding efficiency of video coders. The mainidea of SVT is to vary the position of the transform block.In addition to changing the position of the transform block,the size of the transform can also be varied within the SVTframework. We showed in this paper that by varying theposition of the transform block and its size, the characteristicsof the prediction error is better localized, and the codingefficiency is improved. The proposed algorithm is studied andimplemented in the H.264/AVC framework. We showed thatthe proposed algorithm achieves on average 5.85% bitratereduction in a low complexity configuration which representsa low complexity codec with most effective tools for HDvideo coding which we focus on and on average 4.40% bitratereduction in high complexity configuration, which representsa high complexity codec with full usage of the tools providedin the standard, respectively. Gains become more significantat medium to high bitrates for most tested sequences andthe bitrate reduction may reach up to 13.50%, which makesthe proposed algorithm very suitable for future video codingsolutions focusing on high fidelity video applications. Thesubjective quality remains similar when SVT is used whilethe coded bits are reduced.

The decoding complexity remains similar when incorporat-ing SVT in video codecs with the proposed implementation.However, the encoding complexity of SVT can be relativelyhigh because of the need to perform a number of RDO stepsto select the best position and size for the transform. Toaddress this issue, a novel fast algorithm was also proposedoperating on macroblock and block levels to reduce theencoding complexity of SVT. The proposed fast algorithmfirst skips testing SVT for macroblocks for which SVT isunlikely to be useful by examining the RD cost of macroblockmodes without SVT at a macroblock level. For the remainingmacroblocks for which SVT may be useful, the proposed fastalgorithm selects available candidate LPs based on the motiondifference and utilizes a hierarchical search algorithm to selectthe best available candidate LP at a block level. A reductionof about 80% in the number of candidate LPs that need to betested in RDO has been achieved, with only a marginal penaltyin coding efficiency. Compared to H.264/AVC, the fast SVTachieves on average 5.31% and 4.03% bitrate reduction for lowcomplexity configuration and high complexity configuration,respectively. Nevertheless, we note that still SVT may influ-ence the data flow to make hardware implementation moredifficultly than regular block boundary transform. We believeefficient hardware implementation could be studied further andis an interesting topic.

The proposed method can also be potentially used inmany other situations, for instance, in inter-layer predictionin scalable video coding (SVC) and inter-view prediction inmultiview video coding (MVC).

Acknowledgment

The authors would like to thank the reviewers and theAssociate Editor for their valuable comments which helpedimprove the manuscript.

References

[1] J. Ostermann, J. Bormans, P. List, D. Marpe, M. Narroschke, F. Pereira,T. Stockhammer, and T. Wedi, “Video coding with H.264/AVC: Tools,performance, and complexity,” IEEE Circuits Syst. Mag., vol. 4, no. 1,pp. 7–28, Apr. 2004.

[2] C. Zhang, K. Ugur, J. Lainema, and M. Gabbouj, “Video coding usingspatially varying transform,” in Proc. 3rd PSIVT, Jan. 2009, pp. 796–806.

[3] C. Zhang, K. Ugur, J. Lainema, and M. Gabbouj, “Video coding usingvariable block-size spatially varying transforms,” in Proc. IEEE ICASSP,Apr. 2009, pp. 905–908.

[4] B. Zeng and J. Fu, “Directional discrete cosine transforms: A new frame-work for image coding,” IEEE Trans. Circuits, Syst. Video Technol., vol.18, no. 3, pp. 305–313, Mar. 2008.

[5] Advanced Video Coding for Generic Audiovisual Services, ITU-T Rec.H.264|ISO/IEC IS 14496-10 v3, 2005.


[6] A. Nosratinia, “Denoising JPEG images by re-application of JPEG,” inProc. IEEE Workshop MMSP, Dec. 1998, pp. 611–615.

[7] R. Samadani, A. Sundararajan, and A. Said, “Deringing and deblockingDCT compression artifacts with efficient shifted transforms,” in Proc.IEEE ICIP, Oct. 2004, pp. 1799–1802.

[8] J. Katto, J. Suzuki, S. Itagaki, S. Sakaida, and K. Iguchi, “Denoisingintra-coded moving pictures using motion estimation and pixel shift,” inProc. IEEE ICASSP, Mar. 2008, pp. 1393–1396.

[9] O. G. Guleryuz, “Weighted averaging for denoising with overcompletedictionaries,” IEEE Trans. Image Process., vol. 16, no. 12, pp. 3020–3034, Dec. 2007.

[10] C. Zhang, K. Ugur, J. Lainema, A. Hallapuro, and M. Gabbouj, “Lowcomplexity algorithms for spatially varying transform,” in Proc. PCS,May 2009.

[11] S. Naito and A. Koike, “Efficient coding scheme for super high definitionvideo based on extending H.264 high profile,” in Proc. SPIE Vis.Commun. Image Process., vol. 6077, 607727. Jan. 2006, pp. 1–8.

[12] M. Wien, “Variable block-size transforms for H.264/AVC,” IEEE Trans.Circuits, Syst. Video Technol., vol. 13, no. 7, pp. 604–613, Jul. 2003.

[13] H. S. Malvar, A. Hallapuro, M. Karczewicz, and L. Kerofsky, “Low-complexity transform and quantization in H.264/AVC,” IEEE Trans.Circuits, Syst. Video Technol., vol. 13, no. 7, pp. 598–603, Jul. 2003.

[14] S. Ma and C.-C. Kuo, “High-definition video coding with super-macroblocks,” in Proc. SPIE Vis. Commun. Image Process., vol. 6508,650816. Jan. 2007, pp. 1–12.

[15] T. Wiegand, H. Schwarz, A. Joch, F. Kossentini, and G. J. Sullivan,“Rate-constrained coder control and comparison of video coding stan-dards,” IEEE Trans. Circuits, Syst. Video Technol., vol. 13, no. 7, pp.688–703, Jul. 2003.

[16] P. List, A. Joch, J. Lainema, G. Bjontegaard, and M. Karczewicz,“Adaptive deblocking filter,” IEEE Trans. Circuits, Syst. Video Technol.,vol. 13, no. 7, pp. 614–619, Jul. 2003.

[17] K. Vermeirsch, J. D. Cock, S. Notebaert, P. Lambert, and R. V. deWalle, “Evaluation of transform performance when using shape-adaptivepartitioning in video coding,” in Proc. PCS, May 2009.

[18] KTA Reference Model 1.8 [Online]. Available: http://iphome.hhi.de/suehring/tml/download/KTA/jm11.0kta1.8.zip

[19] T. K. Tan, G. Sullivan, and T. Wedi, “Recommended simulation commonconditions for coding efficiency experiments,” document VCEG-AE10,ITU-T Q.6/SG16 VCEG, Jan. 2007.

[20] G. Bjontegaard, “Calculation of average PSNR differences between RD-curves,” document VCEG-M33, ITU-T SG16/Q6 VCEG, Mar. 2001.

[21] S. Pateux and J. Jung, “An excel add-in for computing Bjontegaardmetric and its evolution,” document VCEG-AE07, ITU-T Q.6/SG16VCEG, Jan. 2007.

[22] F. Kamisli and J. S. Lim, “Transforms for the motion compensationresidual,” in Proc. IEEE ICASSP, Apr. 2009, pp. 789–792.

Cixun Zhang (S’09) received the B.E. and M.E.degrees in information engineering from ZhejiangUniversity, Hangzhou, China, in 2004 and 2006, re-spectively. He is currently pursuing the Ph.D. degreefrom the Department of Signal Processing, TampereUniversity of Technology, Tampere, Finland.

Since August 2006, he has been a Researcher withthe Department of Signal Processing, Tampere Uni-versity of Technology, Tampere. His current researchinterests include video, image coding, and commu-nication. He is currently working on standardization

of high efficiency video coding standard.Mr. Zhang was the recipient of the AVS Special Award from the AVS

Working Group of China in 2005, the Nokia Foundation Scholarship in 2008,and the Chinese Government Award for Outstanding Self-Financed StudentsAbroad in 2009.

Kemal Ugur received the M.S. degree in electricaland computer engineering from the University ofBritish Columbia, Vancouver, BC, Canada, in 2003,and the Ph.D. degree from the Tampere Universityof Technology, Tampere, Finland, in 2010.

He was with Nokia Research Center, Tampere,in 2004, where, currently, he is a Principal Re-searcher and leading a project on next generationvideo coding technologies. Since he joined Nokia,he has been actively participating in several stan-dardization forums, such as Joint Video Team for

the standardization of multiview video coding extension of H.264/AVC,Video Coding Experts Group for exploration toward next generation videocoding standard, 3GPP for mobile broadcast and multicast standard, and,recently, Joint Collaborative Team on Video Coding for standardization ofhigh efficiency video coding standard. He has more than 25 publications inacademic conferences and journals and around 30 patent applications.

Mr. Ugur is a member of the research team that won the highly prestigiousNokia Quality Award in 2006.

Jani Lainema received the M.S. degree in computerscience from the Tampere University of Technology,Tampere, Finland, in 1996.

He was with the Visual Communications Labo-ratory, Nokia Research Center, Tampere, in 1996.Since then he has contributed to the designs ofITU-T and MPEG video coding standards as wellas to the evolution of different multimedia servicestandards in 3GPP, digital video broadcasting, anddigital living network alliance. Recently, he has beenworking as a Principal Scientist with Nokia Research

Center. His current research interests include video, image and graphicscoding, and communications, as well as practical applications of game theory.

Antti Hallapuro received the M.S. degree in com-puter science from Tampere University of Technol-ogy, Tampere, Finland, in 2010.

He was with Nokia Research Center, Tampere,in 1998, where he has been working on videocoding-related topics. He participated in H.264/AVCvideo codec standardization and is the author andco-author of several input documents and relatedacademic papers. He has contributed significantlyto productization of high performance H.264/AVCcodecs for various computing platforms. His current

research interests include practical video coding and processing algorithmsand their high performance implementations. He is currently working on topicsof next generation video-coded standardization.

Moncef Gabbouj (S’86–M’91–SM’95–F’11) re-ceived the B.S. degree in electrical engineering fromOklahoma State University, Stillwater, in 1985, andthe M.S. and Ph.D. degrees in electrical engineeringfrom Purdue University, West Lafayette, IN, in 1986and 1989, respectively.

He has been an Academy Professor with theAcademy of Finland, Helsinki, Finland, since Jan-uary 2011. He was a Professor with the Depart-ment of Signal Processing, Tampere University ofTechnology, Tampere, Finland. He was the Head of

the department from 2002 to 2007. He was a Visiting Professor with theAmerican University of Sharjah, Sharjah, UAE, from 2007 to 2008, and aSenior Research Fellow with the Academy of Finland from 1997 to 1998 andfrom 2007 to 2008. His current research interests include multimedia content-based analysis, indexing and retrieval, nonlinear signal and image processingand analysis, voice conversion, and video processing and coding.

Dr. Gabbouj served as the Distinguished Lecturer for the IEEE Circuits andSystems Society from 2004 to 2005. He served as an Associate Editor ofthe IEEE Transactions on Image Processing, and was a Guest Editorof Multimedia Tools and Applications, European Journal Applied SignalProcessing. He was the Past Chairman of the IEEE Finland Section, the IEEECAS Society, the Technical Committee on DSP, and the IEEE SP/CAS FinlandChapter. He was the recipient of the 2005 Nokia Foundation RecognitionAward. He was the co-recipient of the Myril B. Reed Best Paper Award fromthe 32nd Midwest Symposium on Circuits and Systems and the co-recipientof the NORSIG’94 Best Paper Award from the 1994 Nordic Signal ProcessingSymposium. He is a member of the IEEE Signal Processing Society and theIEEE Circuits and Systems Society.

Publication 6 Cixun Zhang, Kemal Ugur, Jani Lainema, Moncef Gabbouj,


Proceedings Pacific-Rim Symposium on Image and Video Technology (PSIVT), Tokyo, Japan, 13th-16th, Jan

2009, pp. 796-806.

T. Wada, F. Huang, and S. Lin (Eds.): PSIVT 2009, LNCS 5414, pp. 796–806, 2009. © Springer-Verlag Berlin Heidelberg 2009


Cixun Zhang1, Kemal Ugur2, Jani Lainema2, and Moncef Gabbouj1

1 Tampere University of Technology, Tampere, Finland {cixun.zhang,moncef.gabbouj}@tut.fi

2 Nokia Research Center, Tampere, Finland {kemal.ugur,jani.lainema}@nokia.com

Abstract. In this paper, we propose a novel algorithm, named as Spatially Varying Transform (SVT). The basic idea of SVT is that we do not restrict the transform coding inside normal block boundary but adjust it to the characteristics of the prediction error. With this flexibility, we are able to achieve coding efficiency improvement by selecting and coding the best portion of the prediction error in terms of rate distortion tradeoff. The proposed algorithm is implemented and studied in the H.264/AVC framework. We show that the proposed algorithm achieves 2.64% bit-rate reduction compared to H.264/AVC on average over a wide range of test set. Gains become more significant at high bit-rates and the bit-rate reduction can be up to 10.22%, which makes the proposed algorithm very suitable for future video coding solutions focusing on high fidelity applications. The decoding complexity is expected to be decreased because only a portion of the prediction error needs to be decoded.

Keywords: H.264/AVC, video coding, transform, spatially varying transform (SVT).

1 Introduction

H.264/AVC (H.264 for short hereafter) is the latest international video coding standard and it provides up to 50% gain in coding efficiency compared to previous standards. However, this is achieved at the cost of both increased encoding and decoding complexity. It is estimated in [1] that the encoder complexity increases with more than one order of magnitude between MPEG-4 Part 2 (Simple Profile) and H.264 (Main Profile) and with a factor of 2 for the decoder. For mobile video services (video telephony, mobile TV etc.) and handheld consumer electronics (digital still cameras, camcorders etc), additional complexity of H.264 becomes an issue due to the limited resources of these devices. On the other hand, as display resolutions and available bandwidth/storage increases rapidly, High-Definition (HD) video is becoming more popular and commonly used, making the implementation of video codecs even more challenging.

To better satisfy the requirements of increased usage of HD video in resource constrained applications, two key issues should be addressed: coding efficiency and implementation complexity. In this paper, we propose a novel algorithm, named as Spatially Varying Transform (SVT), which provides coding efficiency gains over

Video Coding Using Spatially Varying Transform 797

H.264 and is expected to lower the decoding complexity. The technique is developed and studied mainly for coding HD resolution video, but it could be extended also for other resolutions. The motivations leading to design of SVT are two-fold:

1. The block based transform design in most existing video coding standards does not align the underlying transform with the possible edge location. In this case, the coding efficiency decreases. In [2], directional discrete cosine transforms is proposed to improve the efficiency of transform coding for directional edges. However, efficient coding of horizontal/vertical edges inside the blocks and non-directional edges was not addressed.

2. Coding the entire prediction error signal may not be the best in terms of rate distortion tradeoff. An example is the SKIP mode in H.264 [3], which does not code the prediction error at all.

The basic idea of SVT is that we do not restrict the transform coding inside normal block boundary but adjust it to the characteristics of the prediction error. With this flexibility, we are able to achieve coding efficiency improvement by selecting and coding the best portion of the prediction error in terms of rate distortion tradeoff. This is done by searching inside a certain residual region after intra prediction or motion compensation, for a sub-region and only coding this sub-region. The location parameter of the sub-region inside the region is coded into the bitstream if there are non-zero coefficients.

The proposed algorithm is implemented and studied in H.264 framework. Extensive experimental results show that it can improve the coding efficiency of H.264. In addition decoding complexity is expected to be lowered a little mainly because only a portion of the prediction error needs to be decoded. Encoding complexity of the proposed technique is higher mainly due to the brute force search process. Fast encoding algorithms are being studied to alleviate this aspect of the proposed technique.

The paper is organized as follows: The proposed algorithm is introduced in section 2 and its integration into H.264 framework is described in section 3. Experimental results are given in section 4. Section 5 concludes the paper and also presents future research directions.

2 Spatially Varying Transform

The basic idea of SVT is that the transform coding is not restricted inside normal block boundary but applied to a portion of the prediction error according to the characteristics of the prediction error. We only code a sub-region in a certain residual region after intra prediction or motion compensation. The sub-region is found by searching inside the region according to a certain criterion. Information of the location of the selected sub-region inside the region is coded into the bitstream, if there are non-zero coefficients. Fig. 1 shows an illustrative example of the idea: one 8x8 block inside a 16x16 macroblock is selected and only this 8x8 block is coded. In this paper, we focus our discussion on this particular configuration, which turns out to be promising, as we will see later. However, we note that there is no restriction of the “sub-region” and “region”, for example, on their size, shape, etc when using the idea

798 C. Zhang et al.

in a general sense. Other possible configurations of the idea to achieve further gain in coding efficiency are under study.

In the following, we further discuss two key issues of SVT in more detail: selection of location parameter candidates and filtering of block boundaries.

Fig. 1. Illustration of spatially varying transform

2.1 Selection of Location Parameter Candidates

When there are non-zero coefficients of the selected 8x8 block, its location inside the macroblock needs to be coded and transmitted to the decoder. As shown in Fig. 1, the location of the selected 8x8 block inside the current macroblock is denoted by (Δx, Δy) where Δx and Δy each can take integer value from 0 to 8, if the selected block is restricted to have the same size (which facilitates the transform design) for all locations. There are in total 81 possible combinations and we need to select the best one according to a certain criterion. In this paper, Rate-Distortion Optimization (RDO) is used to select the best (Δx, Δy) in terms of RD tradeoff by minimizing the following:

J = D + λ · R . (1)

where J is the RD cost of the selected combination, D is the distortion, R is the bit rate and λ is the Lagrangian multiplier. The reconstruction residue for the remaining part of the 16x16 residual macroblock is simply set to be 0 in our implementation, but different values can be used and might be beneficial in certain cases (luminance change, etc). Similarly, RDO can also be used to decide if SVT should be used for a macroblock.

Selection of location parameter candidates is important since it directly affects the encoding complexity and the performance of SVT. We study the frequency distribution of (Δx, Δy) and it is observed that the most frequently selected (Δx, Δy), are (0..8,0), (0..8,8), (0,1..7), (8,1..7)1, which takes up a percentage around 60% of all 81 combinations. According to extensive experiments, this is generally true for

1 In this paper, notation x..y is used to specify a range of integer values starting from x to y

inclusive, with x, y being integer numbers.


different sequences, macroblock partitions and Quantization Parameters (QP). Fig. 2 below shows the distributions of (Δx, Δy) for different macroblock partitions of BigShips sequence when QP equal to 23. As we will see in section 4, using this subset of location parameters turns out to be an efficient configuration of the proposed algorithm.

Fig. 2. Frequency distribution of (Δx, Δy) for different macroblock partitions of BigShips sequence, at QP=23 (Z axis denotes the frequency)

2.2 Filtering of Block Boundaries

Due to the coding (transform and quantization) of the selected 8x8 block, blocking artifacts may appear around its boundary with the remaining non-coded part of the macroblock. A deblocking filter can be applied to improve the subjective quality and possibly also the objective quality. An example in the framework of H.264 will be described in detail later in section 3.4.

3 Integration of Spatially Varying Transform into H.264 Framework

In this paper, we study the proposed technique in H.264 framework. Fig. 3 below is the block diagram of extended H.264 encoder with SVT. As shown in Fig. 3, encoder needs to search the best 8x8 block inside macroblocks that use SVT, which is marked as “SVT Search” in the diagram. Then encoder decides whether to use SVT for the current macroblock, using RDO in our implementation. The location parameter is

800 C. Zhang et al.

coded and transmitted in the bitstream. A corresponding decoder needs to decode the location parameter for macroblocks that use SVT, which is marked as “SVT L.P. Decoding” in the diagram. One thing that is worth mentioning here is, in this paper and also the experimental results in section 4, we do not change the motion estimation, sub-macroblock partition decision process, even for the macroblocks that use SVT. After the residual macroblock is generated as normal, RDO is used to decide whether SVT should be used. The reason is to keep the encoding complexity low. However, we note that the normal criteria used in these encoding processes for normal macroblocks may not be optimal for macroblocks that use SVT. Better encoding algorithms are under study.

Several key parts of the H.264 standard [3], for example, macroblock types, Coded Block Pattern (CBP), entropy coding, deblocking, also need to be adjusted. Proposed modifications aiming at good compatibility with H.264 are described in the following sub-sections.

Coder Control

Transform/Quant.



Motion Compensation

Entropy Coding

Motion Estimation

Deblocking Filter

+

Frame Buffer

+

Control Data

SVT L.P.

Output Video

Motion Data

-

Input Video

Output Bitstream

Intra / Inter

SVT Search

SVT Selection

SVT L.P. Decoding

SVT Selection

Decoder


Fig. 3. Block diagram of extended H.264 encoder with spatially varying transform

3.1 Macroblock Types

In this work, we focus our study of SVT on coding inter prediction error in P slices although the idea can be easily extended to be also used in I and B slices. Table 1 below shows the extended macroblock types for P slices in H.264 [3] (original intra macroblock types in H.264 are not included), with the name of new macroblock types that use SVT in italics. The macroblock type index is coded using Exp-Golomb codes in the same way as H.264. The sub-macroblock types are kept unchanged and therefore not shown in the table.


Table 1. Extended macroblock types for P slices in H.264 with spatially varying transform

mb_type Name of mb_type 0 P_16x16 1 P_16x16_SVT 2 P_16x8 3 P_16x8_SVT 4 P_8x16 5 P_8x16_SVT 6 P_8x8 7 P_8x8_SVT 8 P_8x8ref0 9 P_8x8ref0_SVT Inferred P_Skip

3.2 Coded Block Pattern

In this work, we only use SVT for luma component coding. As shown in Fig. 1, since only one 8x8 block is selected and coded in macroblocks that use SVT, we can use 1 bit for luma CBP or jointly code it with chroma CBP as H.264 does [3]. However, in our experiments with many test sequences, we found that luma CBP is probably 1 when QP is low (and probably 0 when the QP is high) where most gain of SVT comes from, so we restrict the new macroblock modes to have luma CBP equal to 1 and there is no need to code this information. Chroma CBP is represented in the same way as H.264 [3]. An alternative way would be to infer the luma CBP according to QP.

3.3 Entropy Coding

In H.264 [3], when Context Adaptive Variable Length Coding (CAVLC) is used as the entropy coding, different coding table for total number of non-zero transform coefficients and trailing ones of current block is selected depending on the characteristics (the number of non-zero transform coefficients) of the neighboring blocks. For macroblocks that use SVT, for simplicity, a fixed coding table is used. Besides, we may also need to derive the information about the number of non-zero transform coefficients every luma 4x4 block has. When the selected 8x8 block aligns with the normal block boundaries, no special scheme is needed. Otherwise, the following scheme is used in our implementation:

1. A luma 4x4 block is marked to have non-zero coefficients if it overlaps with a coded block that has non-zero coefficients in the selected 8x8 block, and marked not to have non-zero coefficients otherwise. This information may also be used in other processes, e.g., deblocking.

2. The number of non-zero transform coefficients for each 4x4 block that is marked to have non-zero coefficients, is empirically set to the same to

(nC+nB/2)/nB . (2)

where nC is the total number of non-zero transform coefficients in the current macroblock and nB is the number of blocks marked to have non-zero

802 C. Zhang et al.

coefficients. Operator “/” is integer division with truncation of the result toward zero.

We note that due to the truncation in (2), a 4x4 block may be marked to have non-zero coefficients according to step 1 but has no non-zero coefficients according to (2) in step 2 at the same time. In our implementation, when only the information about whether a block has non-zero transform coefficients or not is, we use the result of step 1; while when the information about how many non-zero transform coefficients a block has is needed, we use the result of step 2.

3.4 Deblocking

As shown in Fig. 4 below, for macroblocks that use SVT, the deblocking process in H.264 [3] needs to be adjusted because the selected 8x8 block may not align with the normal block boundaries. The following scheme is used in our implementation:

1. First, the boundary edges of the selected 8x8 block and the remaining part of the macroblock are filtered. The filtering criteria and process of these edges are similar to those used in H.264, with minor modifications based on empirical tests to produce visually pleasing results for a variety of content.

2. Second, the normal internal edges and macroblock boundary edges are filtered except those which are inside the selected 8x8 block or overlap with the boundary edges which have already been filtered in the first step. The filtering criteria and process of these edges are kept unchanged as in H.264.

Fig. 4. Illustration of different edges of macroblocks that use spatially varying transform

4 Experimental Results

We implemented SVT on KTA1.8 reference software [4] in order to evaluate its effectiveness. Important coding parameters used in our experiments are listed as follows:


• High Profile • QPI=22, 27, 32, 37, QPP=QPI+1 • CAVLC is used as the entropy coding • Frame structure is IPPP, 4 reference frames • Motion vector search range ±64 pels, resolution ¼-pel • RDO in the “High Complexity Mode” • Two configurations are tested. 1) Low complexity configuration: motion

compensation block size are 16x16, 16x8, 8x16, 8x8, only 8x8 transform is used. In this case, also only 8x8 transform is used for macroblocks that use SVT. This represents a low complexity codec with most effective tools for HD video coding; 2) High complexity configuration: motion compensation block size are 16x16, 16x8, 8x16, 8x8, 8x4, 4x8, 4x4, both 4x4 and 8x8 transform are used. In this case, either 4x4 or 8x8 transform is selected for macroblocks that use SVT. This represents a high complexity codec with full usage of the tools provided in the standard.

We test three configurations of the proposed algorithm in our experiments:

1. SVT32: The location parameter (Δx, Δy) is selected in the set: Φ32={(0..8,0), (0..8,8), (0,1..7), (8,1..7)} which has 32 candidates. The index is coded using 5-bit fixed length code. As we will see, this turns out to be an efficient configuration of SVT.

2. SVT4: The location parameter (Δx, Δy) is selected in the set: Φ4={(0,0), (0,8), (8,0), (8,8)} which has 4 candidates. The index is coded using 2-bit fixed length code. This serves as a comparison to show the effectiveness and necessity of the searching process of SVT.

3. SVT81: The location parameter (Δx, Δy) is selected in the set: Φ81={(0..8,0..8)} which has 81 candidates. The index is coded using 7-bit fixed length code. Although this overhead may be reduced a little by using variable length code, SVT81 serves as a meaningful comparison to show the performance of a configuration of the proposed algorithm with a selected subset of location parameters, which is SVT32 in our case.

Table 2. Experimental results (Low complexity configuration)

Sequence ΔBD-RATE (1280x720/60p) SVT32 SVT4 SVT81 BigShips -2.87% -0.92% -2.41% ShuttleStart -2.51% -1.76% -2.18% City -3.30% -1.77% -2.97% Night -2.33% -0.55% -1.95% Optis -2.65% -0.12% -2.04% Spincalendar -2.07% -0.84% -1.84% Cyclists -2.14% -1.41% -1.48% Preakness -2.34% -0.17% -2.26% Panslow -4.59% -2.17% -4.14% Sheriff -2.18% -0.68% -1.77% Sailormen -2.11% -0.60% -1.81% Average -2.64% -1.00% -2.26%

804 C. Zhang et al.

Table 3. Experimental results (High complexity configuration)

Sequence ΔBD-RATE (1280x720/60p) SVT32 SVT4 SVT81 BigShips -1.68% -0.14% -0.96% ShuttleStart -0.91% -0.45% -0.62% City -1.46% -0.31% -1.19% Night -1.35% -0.21% -0.99% Optis -1.48% +0.06% -1.25% Spincalendar -1.46% -0.59% -1.23% Cyclists -1.07% -0.32% -0.60% Preakness -1.18% +0.12% -0.93% Panslow -2.46% -0.96% -2.18% Sheriff -1.20% -0.13% -0.76% Sailormen -1.34% +0.07% -1.14% Average -1.42% -0.26% -1.08%

Table 4. Percentage of Φ32 and Φ4 selected in Φ81 for different sequences

Low complexity configuration High complexity configuration Sequence (1280x720/60p) Φ32/Φ81 Φ4/Φ81 Φ32/Φ81 Φ4/Φ81 BigShips 59.8% 6.6% 56.7% 4.7% ShuttleStart 63.1% 6.7% 58.4% 4.3% City 58.7% 6.8% 55.8% 4.7% Night 67.2% 12.0% 65.1% 10.2% Optis 57.7% 5.4% 55.3% 3.3% Spincalendar 57.9% 8.5% 54.6% 5.9% Cyclists 63.2% 7.4% 61.0% 5.4% Preakness 68.1% 10.3% 64.9% 6.1% Panslow 59.0% 8.5% 56.3% 6.0% Sheriff 58.9% 5.9% 56.4% 4.1% Sailormen 57.5% 7.1% 55.0% 5.3% Average 61.0% 7.7% 58.1% 5.5%

We calculate the average bit-rate reduction (ΔBD-RATE) according to [5] for both low complexity and high complexity configurations. The results are shown in Table 2 and Table 3 below. We also measure the percentage of Φ32 and Φ4 selected in SVT81. The results are shown in Table 4.

As we can see from Table 2 and Table 3, SVT32 performs better than both SVT4 and SVT81. It achieves on average 2.64% (up to 4.59%) bit-rate reduction in low complexity configuration and on average 1.42% (up to 2.46%) bit-rate reduction in high complexity configuration, respectively. The percentage of Φ32 selected in Φ81 is around 60%. Fig. 5 and Fig. 6 show the R-D curves for City and Panslow sequences. We can see that the gain of SVT comes mainly at high bit-rates and can be quite significant. This is true for most sequences we tested. Take Panslow sequence as an example, by using the performance evaluation tool provided in [6], we are able to show that VBSVT achieves 10.22% and 5.59% bit-rate reduction at 37 dB and 38 dB compared to H.264/AVC, in low and high complexity configuration, respectively.


City

2929.5

3030.5

3131.5

3232.5

3333.5

3434.5

3535.5

3636.5

3737.5

3838.5

3939.5

40

0 5000 10000 15000 20000 25000 30000

Bitrate (kbit/s)

PS

NR

(d

B) H.264 (low

complexity)

H.264 (lowcomplexity)+SVT32

H.264 (highcomplexity)

H.264 (highcomplexity)+SVT32

Fig. 5. R-D curve for City sequence

Panslow

31.5

32

32.5

33

33.5

34

34.5

35

35.5

36

36.5

37

37.5

38

38.5

0 2000 4000 6000 8000 10000 12000 14000 16000 18000 20000

Bitrate (kbit/s)

PS

NR

(d

B)

H.264 (lowcomplexity)

H.264 (lowcomplexity)+SVT32

H.264 (highcomplexity)

H.264 (highcomplexity)+SVT32

Fig. 6. R-D curve for Panslow sequence

5 Conclusions

In this paper, we propose a novel algorithm, named as Spatially Varying Transform (SVT). The basic idea of SVT is that we do not restrict the transform coding inside normal block boundary but adjust it to the characteristics of the prediction error. With

806 C. Zhang et al.

this flexibility, we are able to achieve coding efficiency improvement by selecting and coding the best portion of the prediction error in terms of rate distortion tradeoff. The proposed algorithm is implemented and studied in H.264/AVC framework. Two key issues of SVT: selection of location parameter candidates and filtering of block boundaries, are addressed in detail. It is shown that a configuration of the proposed algorithm achieves on average 2.64% bit-rate reduction in low complexity configuration which represents a low complexity codec with most effective tools for HD video coding and on average 1.42% bit-rate reduction in high complexity configuration which represents a high complexity codec with full usage of the tools provided in the standard, respectively, compared to H.264/AVC. Gains become more significant at high bit-rates and the bit-rate reduction can be up to 10.22%, which makes the proposed algorithm very suitable for future video coding solutions focusing on high fidelity applications. The decoding complexity is expected to be decreased because only a portion of the prediction error needs to be decoded.

Future studies include: 1) Better configuration of the proposed algorithm to achieve better performance, e.g., by selecting different portions inside a macroblock. 2) Better encoding algorithms especially in motion estimation and macroblock partition decision to generate residuals more suitable to code for the proposed algorithm. 3) Encoding algorithms to reduce the encoding complexity.

References

1. Ostermann, J., Bormans, J., List, P., Marpe, D., et al.: Video Coding with H.264 / AVC: Tools, Performance, and Complexity. IEEE Circuits and Systems Magazine 4(1), 7–28 (2004)

2. Zeng, B., Fu, J.: Directional discrete cosine transforms – A new framework for image coding. IEEE Trans. Circuits Syst. Video Technol. 18(3), 305–313 (2008)

3. Advanced video coding for generic audiovisual services, ITU-T Recommendation H.264 (March 2005)

4. KTA reference model 1.8m, http://iphome.hhi.de/suehring/tml/ download/KTA/jm11.0kta1.8.zip

5. Bjontegaard, G.: Calculation of average PSNR differences between RD-curves, VCEG Doc. VCEG-M33 (March 2001)

6. Pateux, S., Jung, J.: An excel add-in for computing Bjontegaard metric and its evolution, VCEG Doc. VCEG-AE 2007 (January 2007)

Publication 7 Cixun Zhang, Kemal Ugur, Jani Lainema, Antti Hallapuro, Moncef Gabbouj,

Prediction Signal Aided Spatially Varying Transform

IEEE International Conference on Multimedia and Expo (ICME), Barcelona, Spain, 11-15 July, 2011.

PREDICTION SIGNAL AIDED SPATIALLY VARYING TRANSFORM

Cixun Zhang*, Kemal Ugur§, Jani Lainema§, Antti Hallapuro§, Moncef Gabbouj*

*Tampere University of Technology, Tampere, Finland

§Nokia Research Center, Tampere, Finland

ABSTRACT Spatially Varying Transform (SVT) is a technique intro-duced earlier to improve the coding efficiency of video cod-ers [1][2]. SVT allows the position of the transform block within the macroblock to vary in order to better localize the underlying residual signal. The coding gains of SVT come with increased encoding complexity due to the additional need in the encoder to search for the best Location Parame-ter (LP) which indicates the position of the transform. In this paper, a new technique called Prediction Signal Aided Spatially Varying Transform (PSASVT) is proposed that utilizes the gradient of prediction signal to eliminate the unlikely LPs. As the number of candidate LPs is reduced, a smaller number of LPs are searched by encoder, which re-duces the encoding complexity. In addition, less overhead bits are needed to code the selected LP and thus the coding efficiency can be improved. Experimental results show that the number of LPs to be tested in RDO is reduced on aver-age by more than 20%. This reduction in encoding com-plexity is achieved with a slight increase in coding efficien-cy, as the number of candidate LPs is reduced. The decod-ing complexity increase is only a little.

Index Terms— Video coding, Transform, Spatially Varying Transform (SVT)

1. INTRODUCTION Nowadays, as display resolutions and available band-width/storage increase rapidly, High-Definition (HD) video is becoming more popular and commonly used. This makes the implementation of video codecs more challenging, es-pecially in resource constrained applications, such as mo-bile video services (mobile TV, video telephony etc) and handheld consumer electronics (camcorders, digital still cameras etc). To better satisfy the requirements of increased usage of HD video in resource constrained applications, two key issues should be addressed: coding efficiency and im-plementation complexity. In [1][2], the concept of Spatially Varying Transform is introduced to improve the coding efficiency of video coders. The motivations leading to de-sign of SVT are two-fold:

1. The block based transform design in most existing video coding standards does not adapt the underly-ing transform to the structure of the prediction error to be coded. This leads to inefficiency in the coding efficiency.

2. Coding the entire prediction error signal may not be the best in terms of rate distortion tradeoff because the prediction error signal may contain noise which contributes little to quality but requires additional bits to code.

Both of these two factors contribute to the coding effi-ciency improvement achieved by SVT [2].

The basic idea of SVT is that the transform coding is not restricted inside regular block boundaries but is ad-justed to the characteristics of the prediction error. With this flexibility, coding efficiency improvement can be achieved by selecting and coding the best portion of the prediction error in terms of rate distortion tradeoff. Gener-ally, this can be done by searching inside a certain residual region after intra prediction or motion compensation, for a sub-region and only coding this sub-region. Finally, the Location Parameter (LP) indicating the position of the sub-region inside the region is coded into the bitstream if neces-sary. This approach leads however to increased encoding complexity because the encoder needs to search for the best LP among a number of candidate LPs. This added encoding complexity is problematic for real-time encoding use-cases on resource constrained devices.

In this paper, a new technique called Prediction Signal Aided Spatially Varying Transform (PSASVT) is proposed, which enables reducing the number of candidate LPs based on the prediction signal and thus also reducing the encod-ing complexity. In addition, the overhead bits for transmit-ting the index of the selected LP are reduced, as there are a smaller number of candidate LPs. Experimental results show that the number of LPs to be tested in RDO is reduced on average by more than 20%. Furthermore, this reduction in encoding complexity is achieved with a slight increase in coding efficiency and only a little decoding complexity in-crease.

The paper is organized as follows. A brief review of SVT is presented in section 2. Section 3 introduces PSASVT and presents a detailed explanation of its design.

Experimental results are given in section 4 and section 5 concludes the paper.

2. SPATIALLY VARYING TRANSFORMS

Transform coding is widely used in video coding standards to decorrelate the prediction error and achieve increased compression rates. Typically, transform coding is applied to prediction error at fixed locations. However, this has sever-al drawbacks that may hurt the coding efficiency and de-crease visual quality. First of all, if the localized prediction error inside the fixed block boundaries has a structure that is not suitable for the underlying transform, many high fre-quency coefficients will be generated in the transform do-main and they need many bits to code. This leads to ineffi-ciency in the coding efficiency. Moreover, notorious visual artifacts such as ringing may appear when these high fre-quency coefficients get quantized.

In [1][2], SVT is proposed to reduce these drawbacks of transform coding. The main idea of SVT is that the trans-form coding is not restricted to be aligned with the tradi-tional grid of block boundaries, but instead can be applied at any location according to the characteristics of the pre-diction error. With this flexibility, coding efficiency im-provement can be achieved by selecting and coding the best portion of the prediction error in terms of rate distortion tradeoff. Generally this can be done as follows. Only a sub-region is coded in a certain residual region after intra/inter prediction. The sub-region is found by searching inside the region according to a certain criterion. Finally, the location parameter (LP) indicating the position of the selected sub-region inside the region is coded into the bitstream if neces-sary.

For 8×8 SVT, as shown in Fig. 1, the location of the se-lected 8×8 block inside the current macroblock can be de-noted by (Δx, Δy) which can be selected from the set Φ8×8={(Δx, Δy), Δx, Δy=0,…,8}. There are in total 81 can-didates for 8×8 SVT. To reduce complexity, it is suggested to use a reduced set Φ8×8’={(Δx, Δy), Δx=0,…,8, Δy=0; Δx=0,…,8, Δy=8; Δx=0, Δy=1,…,7; Δx=8, Δy=1,…,7} instead. For 16×4 SVT and 4×16 SVT, as shown in Fig. 2, the locations of the selected 16×4 and 4×16 block inside the current macroblock can be denoted as Δy and Δx, respec-tively, which can be selected from the set Φ16×4={Δy, Δy=0,…,12} and Φ4×16={Δx, Δx=0,…,12} respectively. In total there are 26 candidates for 16×4 SVT and 4×16 SVT. Rate Distortion Optimization (RDO) is used to select the best LP for SVT and determine whether SVT is used for coding each macroblock.

For the selected SVT block, corresponding block-size transform is applied accordingly. Generally, the separable forward and inverse 2-D transform of a 2-D signal can be written as

Thv TXTC ××= , (1)

hT

vr TCTX ××= (2) respectively, where X denotes a matrix representing M×N pixel block of N pixels horizontally and M pixels vertically, C is the transform coefficient matrix, and Xr denotes a ma-trix representing reconstructed signal block. Tv and Th are the M×M and N×N transform kernels in vertical and hori-zontal direction, respectively. The superscript T denotes matrix transposition. Traditional zig-zag scan is used to represent the transform coefficients as input symbols to the entropy coding.

Fig.1 Illustration of 8×8 spatially varying transform

Fig.2 Illustration of 16×4 and 4×16 spatially varying transform

It is noted that even though the motion estimation, sub-

macroblock partition decision process for the macroblocks that use SVT are not changed [1][2], the encoding com-plexity of SVT is higher due to the brute force search process in RDO. For example, there are a total of 58 candi-date LPs for one macroblock mode and RDO needs to be conducted for each candidate LP to select the best one. Note that typically one needs to conduct transform, quantization, entropy coding, inverse transform and inverse quantization etc to calculate the RD cost accurately and this complexity is high. This is problematic especially for resource con-strained applications.

3. PREDICTION SIGNAL AIDED SPATIALLY VARYING TRANSFORMS

As mentioned above, the encoding complexity for SVT is increased because the encoder needs to search for the best LP among a number of candidate LPs. The basic idea to reduce the encoding complexity of SVT is to reduce the

number of candidate LPs tested in RDO. In this paper, we propose to select candidate LPs based on the prediction signal. The motivations leading to this design are two-fold:

1. Statistics show that normally the selected SVT block locates at positions of the macroblock where the prediction error has higher magni-tude [1][2]. These positions are probably boun-dary positions [3][10]. One example is illu-strated in Fig. 3 below [2].

Fig.3 Distribution of (Dx, Dy) of 8×8 SVT for BigShips sequence.

2. The gradient of the prediction signal is assumed to be positively correlated with the magnitude of the prediction error, i.e., a larger gradient indi-cates a higher prediction error magnitude. In other words, it is probable that when the gra-dient of the prediction signal is high, the texture of the original signal is more complex and hard to predict well thus the prediction error has a relatively high magnitude. A similar assump-tion is also used in [4].

The proposed algorithm PSASVT is summarized below: Step 1: Calculate the gradient map of the prediction sig-

nal of the current macroblock using Sobel operator, as fol-lows.

)1,1(),1(2)1,1()1,1(),1(2)1,1(

)1,1()1,(2)1,1()1,1()1,(2)1,1(

),(

++-+--+-+-+-+--

+

++-+-+---++-+--

=

yxSyxSyxSyxSyxSyxS

yxSyxSyxSyxSyxSyxS

yxG

(3) where G(x,y) is the gradient value and S(x,y) is the predic-tion signal value at position (x,y).

Step 2: If the gradient map is all-zero map, then in this case we assume the prediction quality is very good and thus we only try 8 corner positions at which the prediction error is assumed to be larger than other positions, i.e., the set Φc= Φ8×8’’+Φ16×4’+Φ4×16’, where Φ8×8’’={(Δx, Δy), Δx=0, Δy=0; Δx=8, Δy=0; Δx=0, Δy=8; Δx=8, Δy=8;}; Φ16×4’={Δy, Δy=0, 12}; Φ4×16’={Δx, Δx=0, 12}; then goto step 5.

Step 3: Otherwise, for each candidate LP, calculate the sum of gradient (denoted by a variable SoG) of the predic-tion signal inside the SVT block at that position.

Step 4: Select some candidate LPs to test in RDO fol-lowing these two sub-steps:

(a): Calculate a threshold SoGt as follows: SoGt = (SoGmax+SoGmin+1)>>1; (4)

where SoGmax and SoGmin are the maximum and minimum value of SoG among all the candidate LPs, respectively. Generally, the calculation of the threshold SoGt can be de-pendent on the statistics of SoG of all the candidate LPs, and/or other characteristics of the current (and neighbor-ing) macroblock. A larger threshold SoGt can reduce more candidate LPs tested in RDO, but on the other hand may degrade the coding efficiency because the selected candi-date LPs may not be among the best ones in terms of rate distortion tradeoff, and vice versa. Eq. (4) for calculating the threshold SoGt is derived according to our experience which shows a good tradeoff between performance and complexity.

(b): A candidate LP is selected to be tested if SoG for this candidate LP is larger than or equal to SoGt.

Besides the above method we also tried the following variations to select the candidate LPs to test and the results are given in section 4 for comparison.

(i) We calculate the average value of SoG of all the candidate LPs as the threshold:

1/

N

t ii

SoG SoG N=

=å (5)

where N is the number of all the candidate LPs. We then test the candidate LPs whose SoG is larger than or equal to SoGt.

(ii) We use the median value of SoG of all the candi-date LPs as the threshold SoGt and test the candidate LPs whose SoG is larger than or equal to SoGt.

(iii) We first sort the SoG of all the candidate LPs into non-increasing order and calculate the differ-ence of the neighboring SoG in the ordered list. Then search the largest difference from the be-ginning of the ordered list which for example is in the i-th (1≤i≤N, where N is the number of all the candidate LPs) position of the list. Set the threshold SoGt to be SoG in the i-th position of the ordered list. In other words, we use the largest difference of ordered SoG as a criterion to derive the threshold SoGt.

Step 5. End An example of the PSASVT algorithm is illustrated in

Fig. 4 below:

(a) (b) (c) Fig.4 (a) the prediction signal; (b) the gradient map; (c) the abso-lute prediction error signal; as an example, the SVT block marked in blue has SoGmin and will not be checked in RDO and the SVT block marked in green has SoGmax and will be checked in RDO.

We note that a more general way to decide on the SVT

positions would be to use an estimated RD cost of the whole macroblock for each candidate LP from the prediction sig-nal. Generally this can be done by minimizing the follow-ing:

J = D + λ · R (6) where D is the estimated distortion which can be estimated from the gradient outside the SVT block and R is the esti-mated rate which can be obtained from the gradient inside the SVT block. This generalization method is under further study, however, according to our experiments, simply cal-culating the sum of gradient of the prediction signal, as in our current implementation, a good tradeoff between per-formance and complexity could be achieved.

4. EXPERIMENTAL RESULTS We implemented the proposed idea PSASVT on Tandberg, Ericsson and Nokia test model (TENTM) [5][6] in order to evaluate its effectiveness. TENTM was a joint proposal to the High-Efficiency Video Coding (HEVC) standardization effort, which achieves similar visual quality measured using Mean Opinion Score (MOS) to H.264/AVC High Profile anchors, and has a better tradeoff between complexity and coding efficiency than H.264/AVC Baseline Profile [5][6]. Because of this, it is selected as the platform in our experi-ments. The important coding parameters are as follows:

• Quad-tree based coding structure with a support for macroblocks of size 64×64, 32×32 and 16×16 pixels;

• Frame structure is IPPP; • QPI=11, 16, 21, 26, QPP=QPI+2; • Low complexity variable length coding scheme with

improved context adaptation [5][6]; • The proposed algorithm is only used for 16×16 motion

compensation partition. For other motion compensation partitions, a previously developed algorithm of selecting available candidate LP based on motion difference [1][7] is used.

The anchor is the TENTM when SVT is turned off. We measure the average bitrate reduction (ΔBD-RATE) com-pared to the anchor according to Bjontegaard metric [8], recommended by the VCEG, which is an average difference between two RD curves. The set of sequences tested in this paper corresponds to the Joint Collaborative Team on

Video Coding (JCT-VC) test set [9] extended by 6 addi-tional sequences. The results are shown in Table 1 below.

We have tested four variations of the proposed algo-rithm, namely: (i) PSASVT_MID: use (4) to calculate the threshold SoGt; (ii) PSASVT_AVG: use (5) to calculate the threshold SoGt; (iii) PSASVT_MED: use the method as explained in step 3 (ii) of the algorithm in section 3 to cal-culate the threshold SoGt; (iv) PSASVT_DIFF: use the me-thod as explained in step 3 (iii) of the algorithm in section 3 to calculate the threshold SoGt. From the experimental results, we can see that encoding complexity reduction and slight coding efficiency improvement are achieved for all cases. In a general sense, PSASVT_MID performs better than PSASVT_AVG and PSASVT_MED, and can reduce on average 21.70% of the candidate LPs tested in RDO, while also achieves on average 0.18% bitrate reduction. PSASVT_DIFF can reduce most of the candidate LPs tested in RDO (on average 24.46%) among all the cases, while achieves slightly less bitrate reduction (on average 0.13%). The complexity of PSASVT_DIFF is higher than others mainly because it needs to sort the SoG of all the candidate LPs into non-increasing order and calculate the difference of the neighboring SoG in the ordered list. On the contrary, the complexity of PSASVT_MID could be (one of) the low-est among all cases.

For all the cases, the reduction in encoding complexity and slight coding efficiency improvement are achieved with only a little complexity increase in decoder. The decoding complexity increase is mainly due to the calculation of the gradient map of the macroblock and the sum of gradient of the prediction signal inside the SVT block for each candi-date LP. This needs only to be conducted when 16×16 mo-tion compensation partition and SVT are used, which is typically only a small percent (less than 5% on average) of all the macroblocks in a sequence. The percentages of ma-croblocks coded in SVT for different sequences in each case for all partitions and for only 16×16 partition are given in Table 2.

5. CONCLUSIONS

Spatially Varying Transform (SVT) is a technique intro-duced earlier to improve the coding efficiency of video cod-ers [1][2]. In SVT, the position of the transform block with-in the macroblock, which is indicated by Location Parame-ter (LP), can be varied in order to better localize the under-lying prediction error signal. The coding efficiency of SVT comes with an increase in encoding complexity due to the need to search for the best candidate LP among a number of candidate LPs. In this paper, a new technique called Predic-tion Signal Aided Spatially Varying Transform (PSASVT) is proposed, which enables reducing the number of candi-date LPs based on the prediction signal. Experimental re-sults show that PSASVT can reduce on average more than

20% of the candidate LPs tested in RDO, while also achiev-ing slight increase in coding efficiency, because less over-head bits are needed to code the selected LP. Moreover, the bitrate reduction and the encoder complexity reduction are achieved with only a little complexity increase in decoder.

6. REFERENCES

[1] C. Zhang, K. Ugur, J. Lainema A. Hallapuro and M. Gabbouj, “Video Coding Using Spatially Varying Transform”, IEEE Trans-actions on Circuits and Systems for Video Technology, vol. 21, Issue. 2, pp. 127-140, Feb. 2011. [2] C. Zhang, K. Ugur, J. Lainema and M. Gabbouj, “Video Cod-ing Using Spatially Varying Transform,” in Advances in Image and Video Technology, Lecture Notes in Computer Science, vol. 5414, Springer 2009. [3] M. T. Orchard and G. J. Sullivan, “Overlapped block motion compensation: An estimation-theoretic approach,” IEEE Trans. Image Processing, vol. 3, pp. 693-699, Sept. 1994. [4] H. Schiller, “Prediction signal controlled scans for improved motion compensated video coding”, Electronics Letters, Vol. 29, Issue. 5, pp. 433-435, 1993.

[5] K. Ugur, K.R. Andersson, A. Fuldseth, et al, “High Perfor-mance, Low Complexity Video Coding and the Emerging HEVC Coding Standard”, IEEE Transactions on Circuits and Systems for Video Technology, Vol. 20, Issue. 12, pp. 1688-1697, Dec. 2010. [6] K. Ugur, K.R. Andersson, A. Fuldseth, “Video coding tech-nology proposal by Tandberg, Nokia, and Ericsson”, JCTVC-A119, Dresden, Germany, 15-23 April, 2010. [7] C. Zhang, K. Ugur, J. Lainema, A. Hallapuro, and M. Gab-bouj, "Low Complexity Algorithm for Spatially Varying Trans-forms," PCS 2009, May 6-8, 2009, Chicago, Illinois, USA. [8] G. Bjontegaard, “Calculation of average PSNR differences between RD-curves”, VCEG Doc. VCEG-M33, March 2001. [9] ISO/IEC JTC1/SC29/WG11, “Joint Call for Proposals on Vid-eo Compression Technology”, MPEG Doc. N11113, Jan. 2010. [10] W. Zheng, Y. Shishikui, M. Naemura, Y. Kanatsugu and S. Itoh, “Analysis of space-dependent characteristics of motion-compensated frame differences based on a statistical motion dis-tribution model”, IEEE Trans. Image Processing, vol. 11, pp. 377-386, April. 2002.

Table 1 Experimental Results

Sequences SVT PSASVT_MID PSASVT_AVG PSASVT_MED PSASVT_DIFF BD-rate

(%) BD-rate

(%)

Reduction of LPs

tested in RDO (%)

BD-rate (%)

Reduction of LPs

tested in RDO (%)

BD-rate (%)

Reduction of LPs

tested in RDO (%)

BD-rate (%)

Reduction of LPs

tested in RDO (%)

Traffic -2.88 -3.06 -20.64 -3.10 -18.49 -3.04 -17.52 -2.90 -23.01 PeopleOnStreet -1.67 -1.74 -24.35 -1.73 -21.31 -1.75 -19.82 -1.72 -27.76 Avg 2560x1600 -2.28 -2.40 -22.50 -2.41 -19.90 -2.39 -18.67 -2.31 -25.39 Kimono -0.65 -0.66 -22.38 -0.70 -20.56 -0.64 -19.57 -0.67 -24.60 ParkScene -2.22 -2.32 -20.70 -2.34 -18.78 -2.31 -17.83 -2.30 -22.99 Cactus -2.02 -2.11 -21.58 -2.07 -19.22 -2.06 -18.19 -2.13 -24.07 BasketballDrive -3.32 -3.59 -23.26 -3.55 -20.50 -3.51 -19.49 -3.64 -26.50 BQTerrace -2.69 -2.86 -21.72 -2.86 -19.31 -2.81 -18.13 -2.78 -24.14 Avg 1080p -2.18 -2.31 -21.93 -2.31 -19.67 -2.27 -18.64 -2.30 -24.54 Vidyo1 -2.21 -2.56 -20.57 -2.35 -18.42 -2.37 -17.49 -2.41 -23.58 Vidyo3 -5.04 -5.17 -20.90 -5.15 -18.55 -5.05 -18.33 -5.16 -23.22 Vidyo4 -3.15 -3.24 -21.39 -3.34 -18.76 -3.28 -18.81 -3.40 -24.43 Avg 720p -3.47 -3.66 -20.95 -3.61 -18.58 -3.57 -18.21 -3.66 -23.74 BQMall -3.84 -4.10 -22.10 -4.02 -19.34 -4.01 -18.24 -4.03 -25.02 PartyScene -2.82 -2.96 -21.40 -2.94 -18.61 -2.93 -17.65 -2.84 -24.81 RaceHorses -1.03 -1.01 -23.82 -1.01 -21.57 -1.02 -20.42 -0.96 -26.66 BasketballDrill -2.62 -2.80 -21.05 -2.79 -19.36 -2.72 -18.28 -2.82 -22.88 Avg 832x480 -2.58 -2.72 -22.09 -2.69 -19.72 -2.67 -18.65 -2.66 -24.84 BlowingBubbles -2.94 -2.97 -20.26 -2.98 -18.15 -2.98 -17.29 -2.98 -22.82 BQSquare -2.19 -2.37 -20.52 -2.31 -18.27 -2.30 -17.20 -2.32 -22.62 RaceHorses -1.13 -1.14 -23.64 -1.20 -21.11 -1.13 -19.90 -1.12 -26.65 BasketballPass -2.54 -2.64 -23.19 -2.60 -20.34 -2.65 -19.20 -2.60 -26.53 Avg 416x240 -2.20 -2.28 -21.90 -2.27 -19.47 -2.26 -18.40 -2.26 -24.66 BigShips -3.31 -3.72 -21.20 -3.68 -18.68 -3.59 -17.69 -3.64 -24.34 ShuttleStart -1.71 -2.15 -20.54 -1.95 -19.70 -2.09 -6.78 -1.97 -24.08 Cyclists -0.99 -1.24 -21.67 -1.14 -19.28 -1.28 -18.31 -1.05 -24.81 Panslow -3.98 -4.22 -20.45 -4.18 -18.37 -4.20 -17.62 -4.08 -22.59 Sheriff -2.46 -2.83 -21.17 -2.72 -18.99 -2.74 -18.03 -2.67 -23.78

Sailormen -3.04 -3.43 -22.19 -3.38 -19.37 -3.38 -18.36 -3.46 -25.25 Avg 720p extra -2.58 -2.93 -21.20 -2.84 -19.07 -2.88 -16.13 -2.81 -24.14 Avg all -2.52 -2.70 -21.70 -2.67 -19.38 -2.66 -17.92 -2.65 -24.46

Table 2 Percentage of macroblocks coded in SVT for different sequences Sequences SVT PSASVT_MID PSASVT_AVG PSASVT_MED PSASVT_DIFF

All par-titions (%)

16×16 partition

(%)

All par-titions (%)

16×16 partition

(%)

All par-titions (%)

16×16 partition

(%)

All par-titions (%)

16×16 partition

(%)

All par-titions (%)

16×16 parti-

tion (%) Traffic 7.80 4.10 8.15 4.25 8.13 4.26 8.12 4.26 7.90 3.69 PeopleOnStreet 11.36 5.35 11.78 5.36 11.79 5.41 11.81 5.46 11.34 4.63 Avg 2560x1600 9.58 4.73 9.97 4.81 9.96 4.84 9.97 4.86 9.62 4.16 Kimono 4.43 2.30 4.40 2.01 4.42 2.04 4.45 2.09 4.21 1.73 ParkScene 10.21 5.36 10.47 5.13 10.43 5.16 10.45 5.21 10.11 4.40 Cactus 6.78 3.61 6.92 3.37 6.91 3.41 6.90 3.44 6.61 2.78 BasketballDrive 7.80 4.60 8.04 4.74 7.95 4.67 7.97 4.68 7.85 4.31 BQTerrace 7.05 4.21 7.36 4.36 7.31 4.32 7.28 4.30 7.10 3.78 Avg 1080p 7.25 4.02 7.44 3.81 7.40 3.92 7.41 3.94 7.18 3.40 Vidyo1 3.33 1.76 3.52 1.94 3.50 1.91 3.48 1.88 3.46 1.77 Vidyo3 4.08 2.40 4.36 2.71 4.31 2.66 4.28 2.61 4.43 2.66 Vidyo4 3.54 2.09 3.72 2.22 3.68 2.19 3.66 2.17 3.66 2.05 Avg 720p 3.65 2.08 3.87 2.29 3.83 2.25 3.81 2.22 3.85 2.16 BQMall 11.69 6.41 12.10 6.56 12.05 6.49 12.04 6.48 11.93 5.94 PartyScene 18.12 11.02 19.02 11.90 18.85 11.76 18.77 11.64 18.76 10.30 RaceHorses 9.48 4.32 9.53 4.10 9.60 4.13 9.57 4.13 9.26 3.60 BasketballDrill 5.51 2.48 5.68 2.38 5.66 2.38 5.66 2.39 5.53 2.14 Avg 832x480 11.20 6.06 11.58 6.24 11.54 6.19 11.51 6.16 11.37 5.50 BlowingBubbles 16.06 8.42 16.50 8.29 16.39 8.35 16.45 8.43 16.29 7.33 BQSquare 12.64 7.70 13.55 8.39 13.36 8.23 13.35 8.19 13.33 7.67 RaceHorses 12.45 5.06 12.60 4.80 12.63 4.81 12.67 4.80 12.39 4.22 BasketballPass 10.32 4.93 10.55 4.88 10.52 4.88 10.63 4.95 10.42 4.36 Avg 416x240 12.87 6.53 13.30 6.59 13.23 6.57 13.28 6.59 13.11 5.90 BigShips 11.36 6.33 11.88 6.40 11.84 6.43 11.86 6.44 11.57 5.50 ShuttleStart 3.18 1.81 3.42 2.03 3.38 2.00 3.34 1.96 3.29 1.75 Cyclists 6.57 3.66 7.21 4.29 7.08 4.17 7.03 4.10 7.00 3.94 Panslow 10.48 6.29 10.54 5.78 10.51 5.80 10.55 5.83 10.11 4.84 Sheriff 10.09 5.13 10.64 5.58 10.58 5.53 10.54 5.50 10.41 4.98 Sailormen 11.53 6.60 12.00 6.75 11.95 6.74 11.95 6.76 11.72 6.05 Avg 720p extra 8.87 4.97 9.28 5.14 9.22 5.11 9.21 5.10 9.02 4.51 Avg all 8.99 4.83 9.33 4.93 9.28 4.91 9.28 4.90 9.11 4.35

improved transform methods for low complexity, high quality video coding

Documents