data-dependent low-power 8x8 dct/idct · 2005-02-09 · design and evaluation of a data-dependent...
TRANSCRIPT
Design and Evaluation of a
Data-Dependent Low-Power 8x8 DCT/IDCT
Cheng-Yu Pai
A Thesis
in
The Department
of
Eiectrical and Computer Engineering
Presented in Partial Fulfillrnents of the Requirement
for the Degree of Master of Applied Science (Electrical) at
Concordia University
Montreal, Quebec, Canada
December 2000
O Cheng-Yu Pai, 2000
National Library I*l ,,,a Bibliothèque nationale du Canada
Acquisitions and Acquisitions et Bibliographie Services services bibliographiques 395 Wellington Street 395. rue Wellington Ottawa ON KlA ON4 Ottawa ON K I A O N 4 Canada Canada
The author has granted a non- exclusive licence allowing the National Librq of Canada to reproduce, loaq distribute or sell copies of this thesis in microfom, paper or electronic formats.
The author retains ownership of the copyright in this thesis. Neither the thesis nor substantial extracts fÎom it may be printed or otherwise reproduced without the author's pemiission.
Yowfile Votre réfd~(yso~
Our fi& Notre dtdr~nte
L'auteur a accordé une licence non exclusive permettant à la Bibliotheque nationale du Canada de reproduire, prêter, distribuer ou vendre des copies de cette thèse sous la forme de microfiche/fïJm, de reproduction sur papier ou sur format électronique.
L'auteur conserve la propriété du droit d'auteur qui protège cette thése. Ni la thèse ni des extraits substantiels de celle-ci ne doivent être h p h e s ou autrement reproduits sans son autorisation.
Design and Evaluation of a
Data-Dependent Low-Power 8 x 8 DCT/IDCT
Cheng-Yu ai'
Traditional fast Discrete Cosine Transforrn @CT)/hverse DCT (DCT)
algorithms have focused on reducing arithmetic complexity and have fixed m - t h e
cornplexities regardless of the input. Recently, data-dependent signal processing has been
applied to the DCT/IDCT. These algorithms have variable nui - the complexities.
A new two-dimensional 8x8 low-power DCTIIDCT design is implemented using
VHDL by applying the data-dependent signal-processing concept ont0 the traditional
fixed-complexity fast DCTADCT algorithm. To reduce power, the design is based on
Loeffler's fast a lgori th, which uses a low number of multipIications. On top of bat,
zero bypassing, data segmentation, input tnuication, and hardwired canonical sign-digit
(CSD) multipliers are used to reduce the run-time computation, hence reduce the
switching activities and ths power.
When synthesized using Canadian MicroeIectronic Corporation 3-V 0.35 pn
CMOSP technology, this FDCTlIDCT design consumes 122.7i124.9 mW with dock
fiequency of 40MHz and processing rate of 32OM sample/sec. With technology scaling
to 0.35 pm technology, the proposed design features lobver switching capacitance per
' This work is supported by National Sciences and Engineering Research Council of Canada (È4iSERC) post-graduate
scholarship. and NSERC rescarch grants
sample. i.e. more power-e fficient, than other previously reported hi&-performance
FDCTDCT designs.
Keywords: Data-dependent computation, discrete cosine transfom (DCT), inverse
discrete cosine transfom (IDCT)? low power, canonical sign-digit multiplier.
Acknowledgements
I would l k e to express rny deepest and most sincere gratitude toward my
supenisors - Dr. Asim 5. Al-Khalili and Dr. William E. Lynch. They have given me
clear and helpful guidelines throughout my years as a master student. Above dl, I wish to
rhank them for the geat amount of time devoted to me and my work.
1 wish to th& die scholarship offered by the National Sciences and EngÏneering
Research Council of Canada (NSERC) Post-Graduate Scholarship (PGS-A), and NSERC
research gants. Their financial support allows me concentrating my tirne and effort on
my research.
1 would also like to thank my fellow fnends Wassim Tout, Wei Wang, and VLSI
lab specialist Ted Obuchowicz for helping me throughout the technical problems with the
simulation environnlenrs. and givïng me their valuable opinions about the cornparison
strate,^.
Finally, 1 would like to dedicate this work to my family for their love and support.
I rhank you all for your patience and your sacrifices. This work is as much yows as it is
mine.
Table of Contents
List of Figures ................ ................. ................................................... ix List of TabIes ................................................................................................. x
List of Acronyrns .......................................................................................... xi
1 . Introduction ............................................. .... 1 ........................................................................................... 1.1. Research Motivation 1
1.2. Contribution of this Thesis ................................................................................. 3
1.3. Power Measurement Criteria .............................................................................. 4
............................................................................................ 1 .1 . Thesis Organization 6
2 . Background of FDCTIIDCT ................................................................... 7
2.1. Definition of DCT aiid its Inverse ........................... .... ....................................... 7
2 2 Choices of Algorithms ..................................................................................... 9
......................................................................... 2.2.1. Chen's Algorithm Farnily 9
........................................................... 2.2.2 . Loeffler's FDCTADCT Algorithm 11
......................................................................... 2.2.3. Jeong's FDCT Algorithm 13
........................... 2.2.4. Surnmaq- and Cornparison of Algorithm Cornplexities 14
...................................................................... 2.3 . Precision Requirements of IDCT 15
........................................................................................... 4 . Chapter Sumrnary 16
3 . Design C hoices for the FDCTmICT .................................................... 17 .................................................. 3 Data-Dependent Loeffler's FDCT Algorithm 17
............................................................ 3.1.1. Data-Dependent B ypassing Logic 17
.................................... 3 1 Truncate Some Least-Significant Bits fiom Input 20
................................................... 3.2. Data-Dependent Loeffler's IDCT Algorithm 24 3 ') ...................................................................... J . Transpose Memory Architecture 25
3.4. Chapter Summary .......-........... .. ...................................................................... 28
4 . Multiplier Architectures ......................88..8......8..................................... 29
4.1. S w e y of Constant Multipiication Schemes ..................................................... 29
........................................................................ 4.1.1. Modified Booth Multiplier 30
..................................................................... 4.1.2. Distributed Arithmetic (DA) 30
........... 4.1 .3 . Hardwired Canonicd-S ign-Digit (CSD) Wallace-Tree Multiplier 31
4.1.4. Pattern-Based CSD Multiplier .................................................................. 34
...................................................... 4.2. CSD blultipIier Implernenration Procedure 35
............................................................................... 4.3. Multiplier Synthesis Result 40
4.4. Chapter S u m a r y ............................................................................................. 42
5 . Implementation ...................................................................................... 43
.................. ........................................ 5.1. Hardwired CSD Multiplier Generator .. 43
.................................................... 9.2. IEEE Standard L 180-1990 IDCT Cornpliant 45
5 . 3 Pipelining Desion .............................................................................................. 46 s
............................................................................................. 5.4. Chapter Summary 47
6 . Synthesis Results .......... .. ......................................................................... 49
........................................................ 6.1. Synthesis Results of the Proposed Design 49
.......................... 6.2. Cornparison with past FDCTKDCT VLSI implementations .. 50
6 ChapterSummary ............................................................................................ . 53
7 . Conclusion .............................................................................................. 54
....................................................................................... 7.1 . S ummary of Research 54
7.2. Conclusion ........................................................................................................ 55
.................................................... 7.3. Possible Improvements for Future Research 56
Bibliography ................... .... .......................8..........8...............8....8.88.......8...... 59
vii
Appendi~ A Trunca tion Test Result ........................................................ 65
................... AppendLx B Sample Output of CSD Multiplier Generator 69
................ Appendix C Source Code of Constant Multiplier Generator 74
....... Appendix D IEEE Standard 1180-1990 Cornpliant Test Program 95
viii
List of Fi. aures
....................................... Figure 1 : Generd block diagram of video compression encoder 2
............................... Figure 2: 2-D FDCTADCT using row-column (separable) method .... 9
Figure 3: Loeffler's FDCT algorithm ............................................................................... 11
Fi-me 4: Loeffler's IDCT algorithm ................................................................................ 12
.......................................................... ................ Figure 5: Jeong's fast FDCT algorithm ,. 13
......................... Fi-ure 6: Setup for measuring the accuracy of a proposed 8x8 IDCT .... 15
. . ................................................................................. Fi-me 7: Zero Bypassing Multrpl~er 18
............................................................................. Figure 8: Multiplication Se-gmentation 19
............................................................ Fig~ire 9: 2-D row-col~unn FDCT wivirh tnincation 21
............................................... Figure 10: Test mode1 to measure the effect of truncation 22
.......................................................................... Fi-gre I I : Ping-pong transpose memory 25
................................................................. Figure 12: On-the-fly 8 x 8 Transpose Memory 26
................................. Figure 13: States of the transpose matrix for different dock cycles 27
.................. Figure 14: Converting binary number 0 1 100 10 1 1 1 into CSD representation 33
Figue 15: Hardwired CSD multiplier for multiplying cos(3xA6) with 8-bit unsigned
integer ............................................................................................................. 38
Figure 16: Hardwired CSD multiplier for muItiplying cos(37dI6) with 8-bit signed
integer ............................................................................................................. 40
......................................................................................... Figure 17: Pipelined kcn bIock 46
List of Tables
Table I : Tramfer function of Loeffler's FDCT building blocks ................................ ,.., 11
Table 2: Transfer function of Loeffler's lDCT building blocks ....................................... 12
Table 3: Complexities of different FDCT algorithms ........................... ... ......................... 14
Table 4: lEEE Standard 1 180-1 900 IDCT Precision Requirement .................................. 16
......... Table 5: Truncation errors against the number of truncated bits .. ........................... 23
Table 6: Cornparison of general-purpose multiplication against ROM based
..................................................................................................... multiplication 31
C I ? Table 7: Canonical si@-digit representation of cos(nrJ16) .. ........................................... 33
Table S: Trutb-table of b-1 ............................................................................................... 36
Table 9: Truth table to simpliS sign-extension ................................................................ 39
Table 10: Cornparison of 32-bit CSD Wallace-tree multiplier with 4 different general-
........................ purpose multipliers using Xilinx 4052XL-1 FPGA technology 41
Table 1 1 : IEEE Standard 1 180-1 990 Cornpliance for Proposed DCT ............................ 46
..................................................... Table 12: Latencies for 1 -D FDCT and 1 -D IDCT 4 7
Table 13: Latencies for 2-D FDCT and 3-D IDCT .......................................................... 47
................... Table 14: Process and Specifications of the proposed FDCT/IDCT designs 50
Table 15: Summary of specifications of several FDCTlIDCT chips ................................ 51
Table 16: Energy Efficiency (Switchinp Capacitances/Sample in O . 3 5 p technology) .. 53
Table 17: Truncation errors of test sequences: coke, salesman, and tennis ...................... 68
List of Acronyms
CCITT
CMC
CLMG
CP A
CSA
CSD
DA
dB
DCT
DFT
FDCT
FPGA
HDW
DCT
E E E
JPEG
MC
ME
MHz
bros
International Telegraph and Telephone Consultative Cornmittee
Canadian Microelectronic Corporation
Constant Ivlultiplier Generator
Carry Propagate Adder
Carry Save Adder
Canonical Sip-Digit
Distributed Arithrnetic
Decibel
Discrete Cosine Transform
Discrete Fourier Transfomi
Forward Discrete Cosine Transforrn
Field-Programmable Gate-Array
Hi&-definition TV
Inverse Discrete Cosine Transform
Institute of Electncal and Electronic Engineers
Joint P hotographic Experts Group
Motion Compensation
Motion Estimation
Mega-Hertz
Metal-Oxide Semiconductor
bPEG
MCrX
NMOS
PMOS
PSNR
ROM
SD
SFG
VLC
VLSI
Moving Picture Experts Group
Multiplexer
N-type LMOS
P-type MOS
Peak Signal-to-Noise Ratio
Read-Oniy Memory
S @-Digit
S ignal F!o w Grap h
Variable-Length Coding
Very Large-Scale Integration
xii
Chapter 1
Introduction
1.1. Research Motivation
Waveform compression has k e n an important research topic, and it has wide
industry applications. The term waveform is a generic rem that c m be applied to speech
signal. still image, or video signals. Generally speaking, these wavefoms require large
storage in physicd devices, and require large communication bandwidth to transmit. For
example, one-hou colored 704x450 fiame-size video requires 704x480 (bytedframe)
x 1.5 (for color fiames) x 30 (fiame/sec) x 60 (sec./min.) x 60 (midhour) 54.7 GB to
store/transmit. That is an enorrnous amount of data. Due to the nature of these signais,
redundancies c m be removed by means of waveform compression. In practice, for the
vidso signals, one can achieve from 40:l (for hi& quality) up to 80:1 (for low quality)
compression ratio. In other words, one-hour of digital video requires only about 1.37 GB
to store or transmit.
The discrete cosine transform (DCT) has been widely used in waveform
compression because it features good energy compaction and low computational
cornplexity. It has become an integral part of many waveform compression standards,
such as JPEG, MPEG-2, MPEG-4, CCITiT Recomrnendation H. 361 and H. 263, and
HDTV. [36]
The DCT, like the Discrete Fourier Transforrn (DFT), is used to transfonn the
signai to the fiequency domain. UnIike DFT that uses complex exponentials as basis
functions, DCT uses cosines (real nurnbers) as ba is functions- Since the human audio-
visual system is less sensitive to hi& frequency harmonies, waveform compression
standards use DCT to uansform signal to fiequency domain and perforxn compression on
the DCT coefficients.
As an euample, for video compression, both temporal and spatial redundancies
are eliminated as sl-iown in Fi,owe 1- The motion estirnatiodrnotion compensation
('VIE/MC) block is used to reduce temporal redundancies due to high correlation among
adjacent frmes. The forward DCT (FDCT) together with the guantizer is used to reduce
spatial redundancies. Finally, the variable-length coder (VLC) is used to reduce coding
redundancies.
Uncornpressed Cornpressed Sequence Sequence
Fi,gre 1 : General block d i a m of video compression encoder
'CVith the advances in communication and VLSI technologies, it is expected that
video telephony/conferencing on mobile devices will be more and more cornmon in the
future. Because mobile devices operate with battery power, in order to increase the
batte. life and recharging time, mobile devices always have seingent power
specifications. Also, to Save valuable communication bandwidth, video compression is
always performed on these applications. As a result, the DCT chip is an integral part of
video communication mobile devices, and the design of a low-power DCT chip is an
important problem. In this thesis, a low-power data-dependent DCT/IDCT design is
presented to meet this need-
1.2. Contribution of this Thesis
Many earlier fast DCT algorithms are aimed at reducing the number of
multiplications because general-purpose multipIiers are assumed to be the basic hardware
elements for computing the DCT. Later on, other design techniques, such as digital
filtering and distributed arithmetic (DA), are also used to compute DCT [9]. In more
recent works. data-dependent DCT algorithms ' have been introduced in [19]-[2 11 [ B I .
unIike traditional algorithms, which have fixed-cornputation complexity, data-dependent
algorithms have variable run-time complexities that depend on the statistical properties of
the input data. They may yield fewer or more computations in the nin-time than the fixed
complexity dgorithms.
To reduce the power consurnption, optimizations are performed at both the
a l g o r i h i c level and the architectural level. The lotv-complexity Loeffler's [IO] fast
FDCTADCT algorithm is chosen to reduce the hardware requirement, which in turn
reduces power.
The concept of data-dependent signal processing has also been appiied to the
fixed-complesity Loeffler's algorithm to reduce the switching activities. For both the
FDCT and DCT, zero-bypassing logic is inserted into the circuit to bypass redundant
computations. The zero-bypassing logic takes advantages of high conelation among input
data for the FDCT. and high proportion of zero inputs for the IDCT. Furthemore, the
FDCT design also tnrncates bits fkom its input to reduce the amount of data to be
3
processed, consequently reducing power consurnption. The error introduced by the
truncation is also analyzed in the thesis.
Further architectural optimization is performed on multipliers. Since
multiplication is a high cornplexïty operation compared to addition, the FDCT/IDCT
designs use hard-wired canonical sign-digit (CSD) Wdlace-tree multipliers since it
utilizes minimum arnount of power over the multipliers surveyed.
To sumrnarize, the main contributions made in this thesis are listed as following:
Introduce new data-dependent FDCT/IDCT algorithm by merD$ng the data-
dependent processing concept with fast FDCT/IDCT algorithm.
Empiricdly study the eEect of truncating some least significant bits of the FDCT
input to Save computation.
Derive detailed design procedure for implernenting low-power constant-
coefficient mrd tip liers.
Devdop a code generator written in Ci-+ that generate VHDL code of constant
multipliers for different specifications.
1.3. Power Measurement Criteria
In VLSl design. it is always difficult to compare one design with another due to
different process techology (feature size), supply voltages, operating fiequency,
implementation approach (hll-eustom? semi-custom, etc.), optirnization parameters, and
design algorithm/architectures. Depending on the design goal, several cornparison
rnethods have been suggested and used, such as A, P, T, PT, AT, AP, etc., where A stands
for area, T stands for t h e (delay), and P stands for power. UnfortunateIy, these
measurement criteria give rou@ measures, which do not take al1 process technology into
account.
In this thesis, the proposed design is compared wiîh other reported designs by
comparing the swicching capncitavrce per snmple, which has been used in [19-2 13 [28]. In
VLSl design. power can be estimated Gom the well-known formula:
where P is the pow-er, pi is the switching probability, CL is the Ioad capacitance of the
DCT/TDCT in this case, fcrk i s the clock frequency, and V'D is the suppIy voltage. From
1 equation (1). the switching capacitance is defined as --pl CL , and the switching
capacitance per sample can be obtained by dividing the switching capacitance by the
num ber of inpiit/output sarnples per clock cycle. S ince switching capacitance is directly
proportionai to power. this measwement method leads to comparing relative energy
effrciency rather than absolute values such as in ,4P, PT, etc. It indicates how much
power (switching capacitance) is required to obtain one output.
The main advantage o f this method is that it takes out the effect of different
process technology by performing rechnology scaling. Thus to compare one design of
one technolog uiith amther design of different technology, tecbaology scdùig is first
perforrned on the measured power, then the effects of dock frequency and voltage supply
are factored out to obtain the switching capacitance per sample.
1.4. Thesis Organization
The organization of this thesis is as follows: in Chapter 2, the defuiition of
discrete cosine tsansform and its inverse and the algorithm used in the proposed design
are described. Chapter 3 describes the data-dependent signal processing concept and how
it is incorporated into the design. Chapter 4 summarizes the pros and cons of several
multiplier architectures. and provides a detailed design procedure for the selected
multiplier - hardwired canonical sis-digit (CSD) Wallace-tree multiplier. Chapter 5
describes the design automation effort made to facilitate the implementation of hardwired
multiptiers. The IDCT accuracy test result and pipelining design are dso described. In
Chapter 6, synthesis results of the new FDCT/IDCT designs are reported and compared
against previously reported implementations.
Cnapter 2
Background of FDCT/IDCT
Since there exist many DCT definitions [38], the forward DCT (FDCT) and its
inverse (IDCT) are defined in Section 2.1 for clarification.
Numerous fast algorithms for both FDCT and IDCT have been reparted in the
literature. Most of them atternpted to minimize the number of additions and
multipiications ([l], [8]-[lj], [17-181, [19], etc.). These algonthms usually take
advanrage of the symmetry in the cosine bais functions, and the computation complexity
is fised for al1 input data (data independent algorithm). Since multiplication requires
more hardware and computation t ime than adders, fewer multiplications imply low
power.
In Section 2.2, several existing fast FDCTKDCT algorithms are studied and
compared. The Loeffler's [IO] alsondm is chosen to be the fundamental FDCTlIDCT
algorithm of the proposed design.
Since the FDCT is dways foIlowed by a quantizer, its precision requirement is
not high. On the contrary, the IDCT is used to perform inverse transformation at both the
encoder and the decoder, which requires high precision. It needs to conform to IEEE
Standard 1 180-1990, which is described in 3.3.
2.1. Definition of DCT and its Inverse
The :V-point 1-D fonvard DCT (FDCT) is defined in equation 2:
X ( n ) = l$&n)&k) cos (2k i- 1)nz k=o 2N
The N-point 1-D inverse DCT (IDCT) is define in equarion 3:
where C(n) = n =l,3. ..., N-1
Similady' the NxiV 2-D FDCT is defined as follows: [4]
and the NxrV 2-D IDCT is defined as:
Notice that 2-D Nx:V FDCT/LDCT is a separable transformation, which means
that it can be obtained by first perfonning 1-D N-point DCTlIDCT on the rows, then
performing 1-D IV-point DCTADCT on the columns, or the other way around. This
method of computing 2-D DCTlIDCT is generally referred to as row-column method or
indirect method. The general block diagram of this method is shown in Figure 2.
- - --
Figure 2: 2-D FDCT/ID CT using rotv-colurnn (separable) method
The row-column method is the most popular method in VLSI implementations
([2]-[7], [93, [14]-[16], etc.). Also, since the 8x8 block size is used by MPEG and other
standards, in this thesis. the FDCTKDCT design presented uses 8x8 block size.
2.2. Choices of Algorithrns
Many fast DCTADCT algonthms have been reported in the literature. In this
section. severd fixed-compIexity aIgorithms are reviewed and compared based on their
arithrnetic complexities. The cornparison suggests that Loeffler's FDCTLDCT algorithm
is the most efficient and is used as the basis of the proposed design.
2.2.1. Chen's Algorithm Family
Chen's fast al_oorithm [ I l reported in 1977 is by far the most widely used
DCTADCT algorithm. it has been used in [2]- [7] and many other papers. It is a fixed-
complexity algorithm. The idea of Chen's algorithm is to exploit the symmetry in the
DCTADCT transformation rriatrirr. The 8x8 DCT c m be written in rnatrix form:
where
7r 3z 57r 37r COS- COS- cos- COS -
16 16 8 16 8
Since the even rows of the transformation rnatnx are even symmeb5c and odd
rows are odd-symrneûic, by exploiting the symmetry and separating even and odd rows,
equation (6) can be rewritten as folIows:
Sirnilarly, the 1-D IDCT c m be rewitten as folIows:
b d e g
d -g -6 - e
e - b g d g - e d -b
2.2.2. Loeffler's FDCT/IDCT Algorithm
Loeffler's 1-D %point FDCT dgonthm uses 11 multiplications and 29 additions
only. The signal flow graph (SFG) of an 8-point 1-D DCT is shown in Figure 3, and the
transfer functions of the building blocks are given in Table 1.
Stage 1 Stage 2 Stage 3 Stage 4
Fi-me 3: Loeffler's FDCT algorithm [10]
Symbol 1 Equation 1 Effort
2 add
Table I : Transfer firnction of LoefYler's FDCT building blocks [l O]
I U O
Notice that the second building bIock (km) requires o d y 3 muItip1ications and 3
additions instead of 4 multiplications and 2 additions when equation 9 is used.
0, = 1, ( k cos- n ~ ) + I l ( k s i n ~ ) ( 2N
O=&I
0, = alo + bl , = (b - a ) l , + a ( l o + 1,) nrr nz ,wherea = kcos-,b =sin--
O , = 4 1 , t a l , = -(a + b)I , + a(1, + 1,) 21V 2N (9)
3 mult. + 3 add
1 mult.
By reversing the transfer function of each building block shown in Table 1, and
reversing the signai-flow direction, it is easy to show that the IDCT has SFG shown in
Fi,oure 4 tvith building block transfer function shown in Table 2. Notice that the Loeffler
LDCT algorithm has the same arithmetic complexîty as in the FDCT case ( I l
multiplications and 29 additions). Notice also that division by 2 is considered using no
operation since it c m be realized by i-monng the least-significant bit of the vdue to be
divided.
Stage 1 Stage 2 Stage 3 Stage 4
Figure 4: Loeffler's IDCT algoritlm
~ a u i t i o n t ~ f f o r t
2 add
3 add O, =i,('sin~)+I,('cos~) 1 0=1/42 1 1 mult.
Table 2: Transfer function of Loef2ler's IDCT building blocks
2.2.3. Jeong's FDCT Algo rithm
Jeong's El31 8-point FDCT algorithm reported in 1998 uses 28 additions, 12
multiplications. This algorithm is special because it performs rnost multiplications at the
final stage and requires fewer multiplication stages than other aigorithms, so propagation
errors occurring in the fixed-point computation cm be reduced.
By separating even and odd points in the DCT, this algorithm uses trïgonomemc
identities to reduce the number of multiplication needed to calculate DCT.
* Even points:
N i ? - l
~ ( 2 1 ) = [ - ~ ( k ) +- X ( N - 1 -k)]cos (2k + 1)2zx, where I E [0,3] N t =O ZN
Odd points:
X(21 t 1) = 2 cos + 1 ) ~ ) - ' ru(fl) - [ 2, i'i S '4-1 (21 + 1)Srnx (21 + 1)(2rn + I ) z
i [y(2m) t y(2m + l)] cos rn=O N N
where y(k) = ~ ( k ) - x ( N - 1 - k) and y(-1) = O
The signal flow g a p h is shown in Figure 5-
4-Point DCT I
Figure 5: Jeong's fast FDCT algorithm El31
2.2.4. Summary and Corn parison of Alpo rithm Complexities
Since in VLSI implementation, each computation, Le. addition and multiplication,
requires hardware and consumes power, algonthms with fewer additiodmultiplication
lead to lower power. Aiso, since multiplication requires more power than addition, one
algorithm is better than anoùier if it requires fewer multiplications (for integer
operations).
Table 3 summarizes the complexity of several fixed-compIexity FDCT
algorithms. In [34], Duhamel demonstrated that the theoretical lower bound of an %point
DCT is 1 1 multiplications. Since the number of multiplication in Loeffler's [IO]
algorithm reaches the theoretical Iower bound and the number of addition is not worse
than other algorithms (except Jeong's), the Loeffler's algorithm is chosen.
in [l O])
Lee [ I l ]
Wang 1311
Algorithm Chen
[ l ]
Vetterli [32]
Multiplication Add
Table 3: Complesities of different FDCT a1gonth.m~ (adapted column 2-7 fiom Table 1
12 29
16 1 13 1 12 26 1 29 1 29
Suehiro [33] 12 29
Jeong [13]
Hu '--Loenler
12 28
[12] 12 29
-[IO] .-.
11 : - -29 --
2.3. Precision Requirements of IDCT
In video compression. the precision requirement of FDCT is not high because it is
always followed by heaw quantization. On the contrary, since the IDCT is used for
sequence reconstruction, it is important for IDCT to be computed with hi& precision.
The IEEE Standard 1180-1990 [27] defmes the specification for the
implernentations of IDCT. The step for measuring the accuracy of an 8x8 DCT block is
shown in Figure 6.
Reference i Refernece 6x8 lOCT ' i IDCT output
1 Refernece 8x8 FDCT ; 9 ! 1
Sepefabie. Oriagonal. i- Multiply with at least 64-
, bit floating-point acairaCy :
- -
q '"ws~" ""D" i ? i ROUM [ J' a Ctip
L 1 %ml
Figure 6: Setup for measuring the accuracy of a proposed 8x8 IDCT (figure 2 in 1271)
The standard defines a random nurnber generator that c m generate numbers
within lower and upper bounds (-L and H) inclusive. Based on these random numbers,
10000 8x8 blocks for (L=256, H=255), (L=H=5) and (L=H=300) are used as input for
reference FDCT (see Figure 6): and passed through the diagrarn shown in Figure 6. The
error. ek(i,j): is defined to be the difference between the '?esty' IDCT output and the
"reference" D C T output, Le.:
ek (i, j ) = ..tk (i, j) - xk (i, j )
The standard defmes the following terms to measure the error (see Table 4).
2.4. Chapter Surnmary
In ttiis chapter, the FDCT and IDCT are defined. Several fast fixed-complexity
FDCTKDCT algorithms are reviewed and their computational complexities are
surnrnarized in Table 3 . Since low arithmetic complexity usually implies low power, the
Loeffler's algorithm is used as the basis of the proposed design.
The E E E 1180-1990 standard is also described in this chapter. The standard
defines the precision requirements of DCT, which the new IDCT design will conform to.
In the next chapter, detailed discussion/description is presented to show how the
data-dependent concept is integrated into Loeffler's FDCT/IDCT dgonùun to make it a
data-dependent algorithm.
Maximum Magnitude
I
0.06
0.03
0.0 15
0.00 15
Tenn
For ail-zero input, the proposed IDCT shall generate all-zero output.
Table 4: IEEE Standard 1 1 80- 1 900 IDCT Precision Requirement
De finition
Peak error @pe) 1 Max( kk(i.]>l )
Mean square error for any pixel @mse)
Overall mean square error (omse)
Mean error for any pixel @me)
Overall mean error (orne)
prnse(i, j) = ~ ~ = ~ " ~ i , j ) r 0000 10000 2
omse(i, j) = C:=o C:=o Ci=i ek ( i y i) 64x 10000
prne(i, j ) = e, ( i? j )
10000
ome(i, j ) = c:=, c:, ek ('9 j)
64 x 10000
Chapter 3
Design Choices for the FDCTIIDCT
In this chapter, the data-dependent processing concept is applied to Loeffler's
FDCTADCT algorithm. In Section 3.1 and 3.2, data-dependent bypassing logic is
inserted into Loeffler's FDCTADCT aigorithms to achieve more power reduction. To
M e r reduce the computation complexity, the least significant bits of the FDCT inputs
are tmncated. The effect of mmcation is studied in detail.
Since the row-column method is used to compute the 2-D FDCT/IDCT by using
tcvo 1-0 FDCT/IDCT with a transpose memory in between (see Figure 2), Section 3.3
studies two transpose memory architectures. The on-the-fly transpose memory
architecture is used in this work.
3.1. Data-Dependent Loeffler's FDCT Algorithm
To have a power-efficient design, data-dependent algorithm and truncation
techniques are adopted into Loeffler's FDCT algorithm.
3.1.1. Data-Dependent Bypassing Logic
Loeffler' s FDCT algorithm performs several butterfly operations on the inputs
(set: Figure 3). In general, the inputs are well correlated for the FDCT. Thus, the
subtractions used in the butterfly are very Iikely to produce zeros or small numbers. Since
most multiplications are performed in the kcn blocks, and the inputs of the kcn blocks are
the results of subtractions, adding zero bypassing logic in fiont of each multiplication in
the kcn blocks will reduce the number of multiplications. As shown in Fi=gure 7, the zero
bypassing logic only adds the non-zero-detection logic ((AND gate), a register, and a
multiplexer (MUX) to the circuit. The overhead, both the area and speed, introduced is
small comparing to the miiltiplier itself
i a, Non-Zero Register 1 1 (Load *en Non-Zero)
Figure 7: Zero Bypassing Multiplier
By segmenting the inputs of rnultipliers into several smaller chunks (data
~e~grnentation), further computational reduction can be achieved by taking advantage of
the fact that the inputs of the ken block are vely likely to be small numbers because the
inputs are obtained from butterflying highly correlated data. Thus, instead of rnultiplying
x by c directly, the multiplication is done by breaking x into rn segments, performing
multiplication on each segment, and then adding the products together with proper offset
if necessary (see Figure 8). The sum of the products is still xxc. By inserting bypassing
logic in fiont of each smaller multiplier, part of the small number inputs can be bypassed,
consequently reducing the switching activities and the power. For example, if
x=OO0001 Z l b (7& with two segments, .WC is performed as (0000xc)<<4 + 01 f Oxc. ~ i t h
zero-bypassing logic inserted, OOOOxc is bqpassed and uses no operation.
Product Producc
Figure 8: Multiplication Segmentation
The choice of segment size affects the probability of zero bypassing. One extreme
is that there is onIy one segment, which is direct rnultiplication of xxc. The other extreme
is that each segment is one bit ody, which is essentially perfonning shift-and-add
operation. Theoretically, if we use segment of one bit only, one cm achieve highest
bypassing probability and uses lowest amount of multiplication. However, it requires the
lasest number of addition to add partial products to produce the final product- For n-
segment, one would require to add n partial products together. Having more segments
irnplies more complicated conrrol logic and delay to produce the final result. Thus,
having the trade-off between the probability of bypassing and the segmentation overhead
in mind, cve decide to use IWO segments for FDCT multipkations. It alIows bypassing of
small numbers while keeping the segmentation overhead small since there are only two
partial prodticts to be added.
3.1.2. Truncate Some Least-Significant Bits frorn Input
Since the IEEE standard [27] defines only the precision requirements for the
IDCT, and since the FDCT is usually followed by quantization, in this thesis, some least-
significant bits (LSBs) of the FDCT are truncated. Truncating input bits results in less
compirtation, consequently, reduces power consurnption and increases the speed. On the
other hand, truncation introduces error at the output. Although some error introduced by
the truncation mil1 be compensated by the heaw quantization that follows the FDCT
module. the error still exist. Thus? tnincation allows trade-off between power and error.
The goal is to find the best straregy to truncate input bits so that the error is in acceptable
range dependhg on the application.
In 2-D 8x8 FDCT, there are eight 8-point 1-D FDCT in the first dimension
(rows)? and eight 8-point 1-D FDCT Ui the second dimension (columns). Let Trunc(d,n)
denote the number of bits to be tnincated fiom the n-th 1-D FDCT of dimension cf, where
d = 1 (row), 2 (colurnn) and n = 0...7. The tnincation for al1 eight inputs of any 1-D 8-
point FDCT is the sarne. Figure 9 illustrates the detailed view of 2-D row-column FDCT
with truncation.
DCT Dimension t (d=O) - A Dimension 2 (d=2) ? -
610ck 1-D FDCT Rowç 1-0 FDCT on ~olumns-
i & Tnitx(2.5) g F b ' ! 1 1
!
1 R m 7 ' FOCT '3
I Iii Figure 9: 2-D row-column FDCT with truncation
If we allow truncating at most nz bits fiom each 1 -D FDCT, since there are 16 1-D
FDCT blocks, there are a total of possible combinations (including no muication
for rn=O). Even when rn is small. Say m=l, there are still 65536 possibilities to be
esamined. Fortunately, not al1 combinations are valid fiom the distortion point of view.
In practice- since human eyes/ears are less sensitive to high frequency signal
components higher frequency FDCT
than the low-er frequency coeffîcients.
higher fiequency FDCT coefficients i
coefficients (larger n) are quantized more heavily
This fact suggests that the effect of truncation in
s less than the lower fiequency coefficients. This
ar-ment leads to the following equation.
Trunc(d, n, ) 5 Trunc(d ,n2) if n, c n, (1 0)
Further test cases reduction can be achieved due to the fact that the transpose
matrix distributes al1 coefficients cornputed in each of the first-dimension FDCT modules
to al1 second-dimension FDCT modules. Thus? al1 first-dimension (d=l) FDCT modules
are equally important, Le.:
t ln : Tmnc(1,n) = k, where k is a constant (1 1)
Since the truncation error introduced in the first stage affects entire second stage,
to have a more accurate result, k O (no truncation at the frrst dimension FDCT blocks) is
used in the design of FDCT.
Figure 10: Test model to measure the effect of truncation
To have a quantitative measure of the tnincation effect, standard MPEG-2
encoder is rnodified as the test model (see Figure 10). By changing the Trunc(c?,n),
different PSNR values are measured. The P S h R values are then compared against the
reference: PSNR of no truncation (Trunc(.?,n)=O for al1 n). Smaller PSNR dif5erence
indicates smaller distortion introduced due to tmcation. The tnincation error is defined
as:
Tuca t ion Error = Average PSNR(reference) - Average PSNR(truncation) (13)
Since the goal is to Save power, one combination is better than another if it
tnincates more bits, but has higher PSNR (smaller truncation error), i-e.
7 7 T ~ ~ ~ c ~ ~ ~ ~ (2, I I ) > xnd TmncCst (Zn), and P S N . a e , > PS1v&se2 (7)
Three test video sequences (coke, saiesman, and tennis) we used to measure the
tnincation errors. Each sequence has 180 kames and is encoded using pure 1 - h e s at 8
kIb/s. The FDCT is computed with fixed-point calcuiation with 11-bit precision afier
binary points.
To show the effect of nuncation. al1 165 possible combinations are using m=3
(tnincate at most 3 bits) and Tmnc(l,n)=O (no tnuication for first-dimension FDCT). The
testing results ( tucat ion errors) are s h o w in Appendk A.
Table 5 illustrates the best truncation patterns and its average truncation error
compared to a11 other truncation patterns wïth the same total truncated bit. In this thesis,
truncation pattern Tmnc(1 ,n)=O and Trunc(2,n)=( 1 , 1 ,l , 1 , 1,1,1,1) Ls used in the
implementation of the FDCT because its truncation error is moderate (around 0.5 dB).
Total 'runcated Bit'
O 1 2 3 4 5 6 7 8 9
I O 11 12
Truncation Error rrunc(2,nll ( d ~ )
Total Trunc(2,nlTr~n cation Erro rruncated Bits (dB)
Table 5: Truncation errors against the nurnber of tnincated bits
3.2. Data-Dependent Loeffler's IDCT Algorithm
Like the FDCT, row-colurnn rnethod is used to compute the 2-D TDCT. Due to the
heavy quantization of the encoder (for high compression), a high proportion of the
coefficients are expected to be zero at the input of the first-dimension IDCT.
One problem with the Xanthopoulos's data-dependent DCT designs in [19]-[2 11
is that they may result in more computation than the fi'ced-complexity fast algorithms. In
the worst case, such as the input does not salis@ the assumed statistical propem, the
data-dependent desi= in [19]-[21] may yield as hi& as 1024 multiplications for 3D
IDCT, i.e. degenerates to its base algorithm (direct D C T computation).
In this work: like the FDCT, zero-bypassing logics are inserted into the IDCT
circuit to reduce the nurnber of cornputation. Since zero-bypassing logic does not
increase the number of computation, even at the worst situation, the data-dependent
design yields the sarne complexity as the fundamental Loeffler's algorithm. In other
words. in the worst scenario (none of the bypassins logic active), data-dependent
Loef2fler.s 2D IDCT algorithm uses 176 multiplications (2 dimensions x 8 rows
(colurnns)/dimension s 1 1 multiplication/row (column)).
In real life, some zero-bypassing logics will be active, and the number of
multiplications starts to depend on the distribution of input data. For instance, if there is
one non-zero coefficients in the input of the 1-D IDCT, data-dependent Loeffler's IDCT
algorithm requires 0, 2, 5 or 6 multiplications depending on the position of non-zero
input. If the probability of the non-zero input position is the sarne for all 8 inputs, the
algorithm requires only 3.25 multiplications in average. Thus, by applying zero-
bypassing logic ont0 Loeffler's IDCT algori th, the fiued-compiexity algorithm is
transformed into a dara-dependent aigorithm. The new 2-D IDCT multiplication lower
bound is the same as Xanthopoulos' (O), while the upper bound is significantly reduced
fkom 1024 down to 176.
3.3. Transpose Mernory Architecture
There are various ways to transpose 8x 8 matris in hardware. The trivial way is to
have two matrices (as shown in Figure 1 1). Th ey are used for read and write alternatively
(ping-pong bufTtering). Two matrices are required since the data arrives row-by-row.
Figure 1 1: Ping-pong transpose memory
Another way to transpose a matrix is reported in [28]. As shown in Figure 12,
only one matrix is required. Data is transposed on the fly by changing the shifüng
direction (top-to-bottom or left-to-nght).
Figure 12: On-the-fly 8 x 8 Transpose Memory [28]
The state of the transposition rnatrix for clock cycles is illustrated in Figure 13. To
fil1 up the matrix, eorn clock cycle 1 to 8, shifting direction is top-to-bottom. From dock
cycle 9 to 16, the shifting direction is left-to-right. From dock cycle 17 to 24, the sming
direction is top-to-bottom. Clock cycle 25 is identicai to clock cycle 9, and so on.
1 9-th cycle
b
After 8 cycles \
1 O-th cycle
I I i
16-th cycle 17-th cycle 15-th cycle
t
18-th cycle 23-th cycle - - . - - - - ---
24-th cycle
Figure 13: States of the transpose matrix for different clock cycles
Since n-bit element 8x8 matrix is built with 64n flip-flops, if n is large, the area
consumption will also be large. In the proposed FDCTDDCT design, the on-the-fly
transposition architecture is used since it requires only 64n flip-flops instead of 178n flip-
a case. flops in the ping-pon,
3.4. Chapter Summary
In this chapter. data-dependent Loeffler FDCT/IDCT algorithms are described.
The zero-bypassing logic i s inserted into fixed-complexity Loeffler's algorithm to
conven it into a data-dependent algorithm. rvhich the new design is based on. For FDCT,
input truncation technique \vas also analyzed and applied to fiirther reduce the amount of
data to be pmcessed, hence reduce the power consumption. Based on the simulation
result. we decided to truncate one bit from the input of üle second dimension FDCT.
The transpose memory architecture has also been studied. The on-the-fly
transpose memory reported in [28] is chosen because it requires only half the amount of
area cornparin; to the ping-pong architecture.
Since multiplier is the fundamental building block of FDCTADCT, in the next
chapter, different multiplier architectures are analyzed based on Iow-power criteria.
Chapter 4
Multiplier Architectures
Ln VLSI irnplernenration, floating-point multipIiers are rnuch larger, slocver, and
consume more power zhan fi'ted-point rnultipliers due to normalization of mantissa. For
this reason, al1 FDCT/IDCT designs reviewed in this thesis used fixed-point
multiplication instead of floating-point multiplication.
Since fixed-point or integer multipliers are larger, slowery and consume more
power than adders, ttie choice of multiplier greatly affects the overall FDCTlIDCT
performance and power consumption.
One special note about the multiplications performed in FDCTIIDCT is that they
are al1 constant muItiplications, Le. one of the multiplicand is a constant. In Section 4.1,
several constant multiplication schemes are studied, and the hardwired CSD multiplier is
chosen for low-power design. Section 4.2 describes the design procedure of the
hardmired CSD muItipliers. In Section 3.3, synthesis is performed, and the result
indicates that the CSD multipliers indeed consume less power than general-purpose
rnultipliers.
3.1. Survey of Constant Multiplication Schemes
Following is a bnef description of the characteristics of different constant
multipliers. More detailed description can be found in the references.
4.1.1. Modified Booth Multiplier
Modified Booth multiplier [35] is a popular general-purpose multiplier. Both of
its multiplicands are variables that can be changed at run-time. However, in DCT/IDCT
multiplications, only one of the multiplicand is variable, the other one is a constant
(cos(ndl6)). Having both operands of multiplier variable implies more hardware,
consequently more potver. Thus, general-purpose modified Booth multiplier is not a good
choice for low-power DCTADCT design.
4.1.2. Distributed Arithmetic @A)
Distributed arithrnetic (DA) is a bit-serial operation that performs shift-and-add
operation to multiply nvo nurnbers (one of which is a constant). It replaces the
multiplication tvith additions and a look up ROM table [14]. The input is used as index in
the ROM, and the ROiM contains the partial product of multiplying the address with the
constant rnultiplicand. and the partial products are then added by using shifi-and-add
operations.
The main disadvantage of DA is that it is slow due to its bit-serial nature and
parallel-sendsenal-paraIIel conversion. This implies that it needs higher intemal dock
fiequency than parallel processing to do the same work. Moreover, shifiing consumes
much power because of the high switching activities. In [14] and [lj], the authors
evaluated the trade-off betcveen the performance and the power for three multiplication
schemes: general-purpose multiplier. pure ROM based, and mixed ROM based (DA).
r
Voltage Multiplier Pure ROM Mixed ROM Delay 1 Power Delay 1 Power Delay 1 Power
Table 6: Cornparison of general-purpose multiplication against ROM based
multiplication [ 141
As s h o w in Table 6, the multiplier-based irnplementation is slower than the DA-
based implementations. However, the power is about 30-50% less than the DA-
implementations because about 85% of the entire DA chip runs at higher fiequency due
to its bit-serial nature. As the result. DA is not a good choice for low-power design.
4.1.3. Hardwired Canonical-Sign-Digit (CSD) Wallace-Tree Multiplier
Hardwired multipliers hard code the constant multiplicand by using o d y shift-
and-add operarions. Unlike DA. which performs shifi-and-add operation at run-time,
these shifis are hard-wired at design time and consume no power. In other words,
hardwired multipliers are simply Wallace-tree cany-save adders. This results in a smaller
and more power-efficient multiplier than general-purpose multiplier.
Further power reduction can be achieved on the fixed multiplicand by not using
2's complement representation, but using radix-2 canonical sis-digit (CSD)
representation. By definition. the cunonicnl sign-digit representation is a redundant
number system that represents number with no adjacent non-zero digits. Every nurnber
11s a unique CSD representation [30]. It represents numbers with fewer or equal non-zero
digits [II as the algebraic surn/subtraction of several power-of-two, i.e. :
c = Cs, 2-', where .Y, E {- 1,0,1)
A procedure to transfomi a conventional binary number to CSD representation is
described in [30]. We have also derived a more intuitive transformation algorith:
Given a (nt1)-digit b i n q number B = BnB ,.I...BfBo with B,=O and Bie{O,l J for
ie [ O p il. The following procedure converts B into the (n+l)-digit radix-2
canonical SD vector D = D,D n-,...DrDa with D,G (0,l) and Di€ (0,1,-1) for
is [O,n- 1 ] such that both vector D and B represent the same value:
1. If there are consecutive 1's in B, continue to step S. OtheMrise, the resulting
niunber B is in CSD representation (D). End the process.
2. Replace the nghtmost (starting fiom the lowesr order 2' end) occurrence of bit
pattern O 1 ... 1 1 with 1 - 1. This replacement is possible because +-,- (m-1) l 5 (nt-L) 0'5
3. Go back to step 1.
Figure 14 shows a step-by-step example that converts a binary number
O1 lOOlOll lb (107,-~ in decimal) into CSD representation. The consecutive 1's to be
replaced are shaded in the figure. The resulting CSD representation of 407d is
9 7 10 1 0 1 O T00 T, where T denotes -1. As expected, 407 = 2 -2 +2'-z3-20. In this example,
the CSD representation reduces the number of 1 's fiom 6 down to 5.
-
CSD Representation
Figure 14: Converting binary number 0 1 100 10 1 1 1 into CSD representation
As another esample, Table 7 shows the CSD representations of the constant operands
(cos(nrdl6)) used in FDCT/IDCT with 15-bit precision d e r binary point (total of 16
bits).
cos(ndl6) Traditional Binary Represcntation 1 Canonical Sign-digit Representation
'* i Bit Pattern 1 $Non- 1 Bit Pattern 1 #Non- 1 % Bit
Table 7: Canonical sign-digit representation of cos(ndl6)
1 2
As shown in Table 7, the CSD representation can reduce the nurnber of non-zero
bits up to 50% over traditional representation. In hardwired-multiplier. each non-zero
digit (except the first 3 non-zero digits) in the constant multiplicand requires one extra
caq-save adder stage.
Because canonical means no adjacent non-zero digits, any n-bit number c m be
represented with at most hl21 nurnber of non-zero digits, which in turn reduces at least
half of the cary-save adder stages comparing to generai purpose array multiplier. It can
also be shown that CSD generates an average of nl3 additions [40]. Since fewer non-zero
33
70 3-1 7-Z ,-15) ( - , - , - , . . . , -
O l I l l I o l l o o O l o l o OIlIOIlOOIOOOOlO
3 ~ 0 l I O I O 1 0 0 1 1 0 I 110
Saving 44% 29%
Zero Bits / ( 2 O , ~ ' . 2 - ~ , . . .,P) 1 Zero Bits
4 5
21% 0% 25%
O I O I IOlOlOOOOOlO 0100011100011101
. 50% - ., 3 8%
5 5 7 6 6
9 / 1 0 0 0 0 0 - 1 0 - 1 0 0 0 1 0 1 0 ,
9 6 S
4 5
7 10-101010100-100-LO IO-10-10 I010000010 0100100-100100-IO1
LOOO-IO-1OOiOOOO 10
O 10-1000 100000-100 O0 10-100 10000-1001
16 OOllOOOOlll1llOO 1 8 ~ 7 ~ 0 0 0 1 1 0 0 0 ~ 1 1 1 1 0 0 1 ~ 8
bits imply less computation, less switching activity, and less potver consumption, the
hardwired CSD multiplier is a good choice for low-power design.
4.1.4. Pattern-Based CSD Multiplier
The CSD representation uses minimum sliift-and-add (S&A) operations when
multiplying constant k with variable x directly. However, direct multiplication of x x k
does not necessarily use minimum S&A operations to perform x x k. In some situations,
it is possible to find patterns inside the CSD representation, which can be reused to avoid
repeated cornputarion. Thus, instead multiplying x with k directly, x is mdtiplied with
sub-expressions of k, then partial products are used to construct the final product. As an
example, let k = i I i 00 1 i f = i 00 TO I 001 (23 1 d). Using CSD representation without pattern
searching, 23 lx requires 4 additions. However, ~ 5 t h pattern-based algorithm, 23 lx can be
represented by (7~«5)+7?c, which requires 3 additions only. The Bernstein's algorithm
[4 11, Lefevre's algorithms [3 9-40], and Potkonjack algorithm [42] are pattern-based
algorithrns.
The pattern-based algorithms are very useful for multiplication with very large
constants where the patterns c m be reused fiequently. For example, in
encryptionidecryption, the constant rnay have several hundreds or thousands of bits. In
such situation. pattern-based algorithm can reduces the computation significantly.
However? for the purpose of FDCTmCT and most DSP applications, the constants word
len=ds are usually small, and patterns (if any) are reused less fiequently.
For pattern reuse, one rnust obtain the entire partial product, which requires using
cany-save-adder (CSA) followed by cany-propagate adder (CPA). In general, in VLSI
implementation, CPA is slower, and consumes more power than CSA due to carry
propagation. The slower pattern-based algorithm speed can be compensated by adding
pipeline registers after each CSA used for partial product (pattern) computation. The
edxtra power consumption due to the cany propagation in CPA can be reduced by using
other types of adders such as carry-by-pass adders or carry-select adders. However, given
the patterns are not reused frequently, the overall power consurnption of pattern-based
multiplier is still larger than the one cvithout using pattern- Since the design criterion of
this thesis is power, oniy the CSD multiplication without using pattern is considered, and
ail multipliers used in FDCADCT are hardwired CSD multipliers.
Notice that the application of hardwired CSD Wallace-tree multiplier is not
restricted to FDCT/IDCT only. It can be used in many other digital signal processing
(DSP) applications, such as digital filters, tvhere fixed-coeffrcient multiplication is
required.
4.2. CSD Multiplier Implementation Procedure
To design a hardwired CSD multiplier for multiplying unsigned variable integer
operand (v) with a constant operand, we derived the following steps:
2 . Obtain the CSD representation of the constant operand by using the aigorithm
described in Section 4.1.3.
2. For each non-zero bit position p in constant operand:
8 For each 1 in the constant operand, place the unsicgned variable operand, i.e.
performing v x 2" .
For each -1 in the constant operand, negate the unsigned variable operand
with a 1 placed at the least-significant bit (2's complement), and extend 1's to
the lefi of the most-sipificant bit (sign extension) of the variable operand, Le.
performing (- v) x 2'
3. Simplify the diagram by adding the constant 1's together to avoid redundant
computation at run time. By studying the truth-table of addition, we found that
fùrther optimization can be achieved by using identity 1.1 :
Identity 1.1: Variable bit b plus constant 1 results in sum -b and carry b, where -6
denotes NO T operation
Sum=b, Cary==-b 1 Table 8: Truth-table of b+l
b
O 1
This identity allows reduction of one operand to be added for pos iition p by
increasing the number of operands to be added for position p+l by 1. Intelligent
use of this identity can reduce the nurnber of cary-save adder (CSA) stages
(critical path delay) witiiout introduchg any extra hardware.
4. Combine the operands placed in step 2 and the sirnplified constant 1's (in step 3)
with carry-save adders in Wallace-tree form. The result of the carry-propagate
adder is the result of multiplying variable input operand with the constant
operand.
bil
To illustrate the above algorithm, Figure 15 shows the procedure of constnrcting a
CSD hardtvired-multiplier of constant cos(3d 1 6) multiplying with an 8-bit unsigned
Sum 1 O
Carry O L
integer. Constant cos(3d16)
Table 7. As shown in Fi-aure
of CSA tree fiom 7 down to
is chosen because it contains the most non-zero bits in the
1 5, in step 3, the application of identity 1.1 reduces the depth
4. As the result, the multiplication of cos(3dl6) with an 8-
bit unsigned number has critical path of only 2 full-adder stages with a 19-bit CPA adder.
Notice that despite the fact that the multiplier uses CSD representation for the constant
operand, both the variable operand and the product are still in 2's complement
representation.
Step 1:
Step 2: 7 6 5 4 3 2 1 0 - 1 - 2 3 4 5 6
0 Y
i Notation: I I , I I :
- - Simplify the 1's and Rearranging op%%VYS
1 L . - - Apply Identity: b+l => Sum -b. Carry b
on bit position -34. 4. and 6
1 - - Canstruct Multiplier with Wallace-tm C a n y S a v e Adders
step 4: -- - - - -- -- -----.----A
7 -6 8 - 7 6 7 6 7 6 5 -7 - 6 - 5 - 7 : - 6 ; - 5 ' '- -3 1 -1 O
--- - > . . . . T V T V T T . & T ; T . t . . t . . .
Figure 15: Harduired CSD multiplier for multiplying cos(3dl6) with &bit unsigned
integer
Similarly-, to multiply a signed 2's complernent vkab l e operand (v) having a
sis-bi t (s) with a constant operand, the following procedure is derived:
1. Obtain the CSD representation of the constant operand.
2. For each non-zero bit positionp in constant operand:
a For each 1 in the constant operand, place the signed variable operand, Le.
performing v x zP . Sign-extend towards left.
* For each -1 in the constant operand, negate the s i p e d variable operand v with
a 1 placed at the Ieast-significant bit (2's complement), and extend -s
(negated sign-bit s) to the left of the most-significant bit (sign extension) of
the variable operand, Le. performing (- V) x 2'
3. SirnpliS the s i s extension bits and constant 1's in the diagram:
* Let s O , replace al1 s with 0, and -s with 1, add al1 constant 1's together, and
obtain a constant value SEO.
0 Le s=l, replace dl s with 1, and -s with 0, add al1 constant 1's together, and
obtain a constant value SEI.
For each bit at positionp, merge and SEr together to obtain another value
SE using the following truth table:
Table 9: Truth table to simpli@ si@-extension
Remove dl sign extension bit (s or -s), insert SE into
Like the unsigned case, apply identity 1.1 where suitable.
4. Combine the operands placed in step 2 and the simpiified sip-extension bits and
constant 1's (in step 3) with carry-save adders in Wallace tree form. The result of
the carry-propagate adder is the result of multiplyinp variable signed 2's
complement input operand with the constant operand.
Like the unsigned case, step-by-step illustration of construction of a CSD
hardwired-multiplier of constant operand COS(;^ 1 6) rnultiplying with an 8-bit signed 2's
compIement inreger is shown in Figure 16.
Step 1: 1 - 2 - 2 + 2-& + z-& + 2- . - 2-l1 - 2-1.
Step 2: 7 6 5 4 3 2 1 0 -1 -2 -3 4 -5 5 -7 -8 -9 -10 -11 -12 -11 -14
-7- *-? ?'a; ?-.-- ,------,- -sT' ----.--.-'-- -- . . 1-
! -S . s i s S i S I ! s i 6 5 : 4 3 2 o f + ; - --- -- -.- - -- ..-------
1
; Natation- - 7
- Sirnplify the Sign-Extension bits and COnSlant 1's
'if i'rl SE,, @=O)
a SE, (SI)
P
- !- Remove si-s in step 2. and Insert SE
- ! ' Apply ldentity: b+l = Sum -b. Carry b . . -
on bit position -14. 5. 6
Fi-me 1 6: Hardwired CSD multiplier for multiplying cos(Wl6) with 8-bit signed
integer
4.3. Multiplier Synthesis Result
To demonstrate that the hardwired CSD Wallace-tree constant multiplier
consumes less power and area while offenng comparable speed performance, its delay,
area and power consumption figures are compared wirh the with other 32-bit popular
general-purpose multipliers.
Since the hardwired CSD multiplier has one operand constant, severai CSD
multipliers are implemented wïth different constant operand used in FDCT/IDCT
(cos(nrr/l6), and 2'" ). Al1 constants have l -bit integer part and 3 1-bit fkaction part to
forrn a 32-bit fixed-point number. The constants are then rnultiplied with a 32-bit signed
inteser (variable operand). Al1 multipliers are synthesized using Xilinx 4052XL-1 FPGA
technology.
Modified \Vallace Modifiec; 32-bit Array Booth- Booth Tree Multiplier Multiplier Multiplier Wallace TI
Multi~fie
Proposed ree Scheme
Table 10: Cornparison of 32-bit CSD Wallace-tree multiplier with 4 different general-
purpose multipliers using Xilinx 4052XL-1 FPGA t e c h n o l o ~ (Columns 1-5 adopted
Tom Table 1 in [36])
As shown in Table 10, the CSD multiplier uses least arnount of area and power
(less than half of the power than the array multiplier) while offering comparable speed
performance with the other multipliers (around 100 ns). This result agrees with the
analysis - hardwired CSD is more power efficient then other general-purpose multipliers.
Therefore, hardwired CSD Wallace-tree multipliers are used in the FDCTllDCT designs
presented in this thesis.
4.4. Chapter Summary
In this chapter, by analyzing diîferent constant multiplication schemes, a new
constant-coefficient multiplier design is presented. The multiplier is based on canonical
sign-digit representation with Wallace-tree formation. As shown in the analysis and
simulation. the CSD multiplier is both more power md area efficient than general-
purpose multiplier while offering similar speed performance. Consequently, it is used in
the FDCTRDCT design presented in this work. Detailed design procedures for both
unsigned and signed integer are dso described.
In the next chapter, more implementation details, such as design automation and
pipeline design, are presented.
Chapter 5
Implementation
Since the main efforts are concentrated on the arithrnetic level (data-dependent
algorithrn) and implementation level (hardwired CSD multipliers), we decide to use
VHDL to implement the FDCTLDCT designs. No optirnization on the circuit level or
technology level is made.
To ensure error-fice coding, some design automation effort is made. In Section
5.1. a C+ prograrn that generates VHDL code of hardwired CSD Wallace-tree mulùplier
is developed. Similarly, to make the LDCT design cornpliant to IEEE Std. 1180-1990, in
Section 5.2. a Java prograrn is developed that calculates the error figures defmed in IEEE
standard [27] for different intemal bandwidths. The pipeline designs for both the FDCT
and D C T are also described in this Chapter (Section 5.3).
9.1. Hardwired CSD Multiplier Generator
Since the FDCTADCT design uses hardwired CSD multiplier, for e x h constant
operand and bandwidth of variable operand. different multipliers are required. To Save
the design tirne and avoid bugs in the coding, it is ideal to generate constant multipliers
through a code generator.
S everal constant multipliers generators [40] [43 -441 have been reported in the
literahre. Al1 of them are optimized for Xilinx FPGA 4000 and Virtex technologies. To
a
a
a
a
code.
VHDL entity name.
Integer value of the constant operand: For real number constant operand, use the
integer value of the corresponding fixed-point representation. For Intel ~entium@
processors running Microsoft ~ i n d o w s @ 32, the limitation of the constant
operand is fiom O to 2 147483647.
Variable operand: Nurnber of bit of the signed/unsigned variable opermd.
Product Least-Significant-Bit Truncation: This feature is useful for real number
(fixed-point) multiplications. In many situations, not al1 bits in the real part are
required. Truncating some least-si,&ficant bits frorn the product results in a
smailer, faster. and more potver-efficient rnriltiplier. The truncation error has been
analyzed in [ i 5 ] .
The generator uses the algorithm described in Section 4.2 to generate VHDL
At the end of the code generation, it also reports critical statistical information:
have a technoIogy-independent constant multiplier generator, a C++ program that
generates \FIDL code for hardwired CSD multiplier is developed. The program is called
constant multiplier generator (CMG). The C++ source code of the generator is listed in
Appendix C and in attached CD.
The CMG is capable of generating VHDL code that multiplies signedfunsigned
variable opermd with any positive integer constant multiplicand. The constant operand
can have the size of Lon9 type in Ci-+ langua~e- The CMG takes the following
information fiom the user:
number of cany-save adder stages, number of inverters, half adders, and full adders. This
information is useful for power, area, speed, and pipelining analysis.
44
As an example, for constant operand cos(3d16) with 15-bit precision multipiied
with 12-bit variable operand and no truncation, the CMG generates the VHDL code
shotvn in Appendix B.
5.2. IEEE Standard 1180-1990 IDCT Cornpliant
To ensure the proposed IDCT chip conforms to IEEE Standard 1180-1990, a Java
progam is deveIoped. The program reads in data path bandwidths, multiplier precisions,
and tmcation patterns used in Loeffler's IDCT in each pipeline stage from a file, and
calculates the error figures @pe. pmse, ornse, pme and orne) defined in the standard (see
Section 2.3. ). Again, the source code is listed in Appendix D and in the attached CD.
Notice that Java is chosen as the programming language because the long t g e in Java is
a 64-bit integer, which is more suitable for simulating fixed-point arithmetic. In Ci+, the
size of data type is machine dependent; while in Java, the size of data type is machine
independent and fixed.
After testing different combinations of interna1 bandwidths, the first dimension
IDCT produces l ;-bit integedj-bit precision fixed-point output. The second dimension
IDCT produces 14-bit signed integer output (afier roundinp of 10-bit precision result).
The 2 -0 IDCT presented in this thesis conforms to the IEEE 1180-1990 Standard. The
cornpliance test results are shown in Table 11.
1 Random Data 1 Pme 1 pmse 1 orne 1 omse 1
5.3. Pipelining Design
Range [-300,300] [-256,3551
[-5,5] -[-300,300] -[-256,3551
-[-5,q
Since the hardwired CSD mdtiplier is essentially a carry-save adder, and the
speed of the carry-save adder is mostly limited to the carry-propagate adder, the speed of
Zero-in zero-out -- test passed. IppelSl Table 1 1 : EEE Standard 1 1 30- 1 990 Cornpliance for Proposed IDCT
FDCT/IDCT is directly related to the carry-propagate adders. Thus, it is logical to insert
<O.OIS 1 c0.06
pipelining registrrs afier each adder (including the adders in the multipliers). As shown in
<O.OO 15 0.00082 0.00073
O 0.0008 1 O -0008 5
O
0.0 121 0.0 129
Figrire 17, for kcn blocks in IDCT, there are 3 pipeline stages (add inputs, multiply, and
<0.02 0.0108 0.0103
O 0.0109 0.0 1 04
O
0.0 16 1 0.0 144
add product). For kcn blocks in FDCT? there are 4 stages. The extra stage is required to
add the partiai products of the segmented multiplications. Therefore, there are 10 pipeline
delays (latency) for 1-D FDCT? and 8 pipeline delays for 1-D IDCT (see Table 12).
-Stage 1 -Stage 2 ; = Stage 3 - 1 I
t I
4, c --(a+b)+ $ &
V) - - V1 .-
, +O/ m m 2 a , B 1 1
P) c a C 1
.- A .- - a> CZ al a
4 S E . b - é i ~ a + 0"
j 1
I
Figure 17: Pipelined ken block
O . 0.0 12 1
0.0 143
O 0.0 155 0.0 163
O I O
For the transpose memory. the on-the-fly transpose memory architecture is used.
From Figure 12. ir is clear t h ~ t the latency is 8 clock cycles because the transposed output
can be obtained starting from the 9" clock cycle.
To summarize. the proposed 2-D FDCT has latency of 28 dock cycles, and the 2-
D IDCT has latency of 24 clock cycles (see Table 13).
Total IO 8
Table 13: Latencies for 2-D FDCT and 2-D IDCT
Table 12: Latencies for 1-D FDCT and 1 -D IDCT
Stage 4 1 1
5.4. Chapter Summary
Latency FDCT DCT
Total 28 24
In this chapter, a new constant CSD multiplier generator is introduced. Written in
Ctç, the program generates VHDL code that multiplies constant integer operand wiîh
signed/unsigncd variable operand. Truncation can also be made on the product to reduce
hardware, power, and delay.
A Java program is developed to select the intemal bandwidth such that the 2-D
8x8 IDCT conforms to the IEEE Standard 1 180-1 990.
Both FDCT and IDCT designs have also been pipelined to achieve throughput of
1 output/clock cycle. The latency is 28 clock cycles for FDCT, and 24 clock cycles for
Stage 1 1 1
Stage 2 1 Stage 3 4 i 4
Second Dimension 10 8
Latency FDCT DCT
IDCT.
47
3 3
First Dimension 1 Transpose Memory 10 8
8 8
Ln the next chapter, the W L code of the proposed FDCTmCT chip is
synthesized using Synopsis with Canadian Microelecnonic Corporation (CMC) 3-volt
0.35-,yn technology. Synthesis results (powedareddelay) are compared with previous
works.
Chapter 6
Synthesis Results
In this chapter, synthesis results of proposed FDCTRDCT are presented in Section
6.1. The proposed design is compared with previous reported designs in Section 6.2.
using the stvitching-capacitance per sample criteria described in 1-3.
6.1. Synthesis Results of the Proposed Design
The VHDL code of the proposed FDCTRDCT core is synthesized using Synopsis
with Canadian Microelectronic Corporation (CMC) 3-volt 0.35-pn technology. Since the
design goal is low power, the compiler constraint is set to minimize the dynamic power
consumption (ideally zero). The synthesis result indicates that the proposed FDCT core
consumes 222.7mW at 40MHz' and IDCT core consumes 124.9mW at 40MHz. The
detailed specifications of the new FDCT/IDCT design are s h o w in Table 14.
Only the dynamic power reported by the Synopsis is compared with other designs
in the next section. Ln real life, there may be other power consumptions, such as leakage
power and short-circuit power. Since the leakage power is related to the fabrication,
which is not the concern of this paper, it is ignored in the cornparison. As for the short-
circuit power, it is assurned to be small and negligibIe, which is usually the case in
practice. Its effect c m be minimized with proper timing design.
The power measurements are performed under the worst-case condition where the
assumed statistical properties do not hold, Le. under white noise input. In this situation,
most of the bypassing Iogics are not active, and the power consumption is higher. This
simulation condition is chosen because in real life, for MPEG-2 video compression, the
assumed statistical properties apply only for 1-Çarnes, but less so for B-fkames and P-
fiames. For those fkames, the redundancies at the input are already been reduced, and the
input behaves like white noise.
FDCT -
Process Technolog- / CLIC 0.35pm CMOSP technology Supply Voltage (V) 3 Volts
1 hout Bandwidth (for eacli input) 9-bit I 12-bit I
D C T
Operating Frequency (MHz) Processing Rate (samples/sec) Dpamic Power (mW) Leakaoe Power (nW) u
Table 14: Process and Specifications of the proposed FDCT/IDCT designs
40 M 320 M
3.2969125 24.43
24
Area (reported by S-popsis) Maximum Pipeline Stage Delay (ns)
Throughput Output Bandwidth (for each output)
6.2. Comparison with past FDCTADCT VLSI irnplernentations
1226666 :- ,, 16.8610
3.2548425 24.5 1
8 output/clock cycle 17-bit 1 14-bit
Many FDCTiIDCT VLSI imptementahons have been reported in the literature.
The specifications of several recent hi&-performance FDCT/IDCT chips are summarized
in Table 15. Due to different process technologies (supply voltage, operating fiequency,
etc.), implementation approach (Ml-custom, semi-custom, etc.), optirnization parameters
. : - - .124.858.7---:~<.-; 18.6860
Input/Output Numeric System hriut S~ecification
2's comptement signed integer 8 in~ut/ciock cycle
Latency (clock cycles) 28
(RTL, transistor level, Iayout level, etc.), and design algonthdarchitectures, cornparhg
different implantations is alrvays a tough job in VLSI design. N s o , in some situations,
not ail measurement figures are reported. As the result, it is very difficult to compare one
design with another accurately.
--
I Area (mm') l Supply VoItage (V) Power CIock Implementa tion Process Transistors (mW) Ra te
Toshiba 1994 FDCTIIDCT 123 ](M 1
Toshiba 1996 FDCTfIDCT j221
ATScT lDCT 1371
Xanthopoulos's CDCT [251
Xanthopoulos's FDCT 1281
Table 15: Summary of specificaûons of several FDCT/IDCT chips
Sarmiento's FDCT 161
In order to compare the proposed design with other works fairly, like [19]-[21],
the switching capacitance per sample (hence power per sample) is calculated and
compared. It c m be used as an indication of e n e r g efficiency since it is directly
proportional to power consumption required to process each input sample.
As described in Section 1.3, the switching capacitance o f each design is obtained
by dividing the power with the fiequency and squared voltage. Notice that the switch
capacitance per sarnple is obtained by dividing the switching capacitance by the number
of sarnples per clock cycle.
0.6prn12ML
0.3 prn 2ML, Triple well
0.5 prn
3ML
0.6 Pm, 3ML
0.6 Fm, E/D-MESFET
GaAs
13.33mm7/ 120K
$mm2/ 120K
7rnm2/ 69K
32.2mm2l 5 l K
0.3 5 W at 3.3V, 20OMHz 0.15W ar 2V, lOOMHz
3V
150 MHz
58 MHz
5-43 MHz
2-43 MHz
0-9Vl VT=0,15 10.1V
3V
: -4i6-2 '- at-l'.32V;::
I~MHZ .'
20,7mm2/ l6OK
7W
250mW
1.1-1.9V VTNNTP=0.66/-0.92V
TOX=9.6nm
600 MHz
20.7mm2/ l6OK
1.1-3V VMNTP=0.75/-0.82 V
TOX=14.8nm
.$.38myi -333@?g: - . .14-& .+'
Technology scaling is also perforrned for all designs to norrnabe all designs to
0.35pm technology. The scalins factor fkom 0.35pm (CMC 0.35pm CMOSP) technology
îo 0 . 5 ~ ~ (0.6pm drawn) (CMC CMOSISS) technology is obtained by performing
HSPICE simulations on two inverters, one as the load of another. For both technologies,
the power supply is 3 volts with 40-MHz 3-volt square pulse input. The PMOSs have size
L= W,,,,, with W=4Wm,,, and the NMOSS have size L= W,ni, wîth W=2 Wmi,, where W,,,, is
the minimum feature size of the correspondin,o technology. The simulation result
indicates that the power consumption is 0.634 mW for 0.35pm technology, and 1.19 mW
for 0.5pm technology. Since both circuits are operating on the same voltage and
Bequency, the ratio between the powers is the ratio between the switching capacitances.
For simplicity, 0 . 5 ~ ~ m and 0.6pm technologies are treated equally, similarly for 0 . 3 p
and 0.35pm technologies. Thus, the switching capacitance in 0.5pm (and 0 . 6 ~ )
technology will be multiplied with 0.532 to scale to 0.3 5pm technology. The effect of
circuit level optimization, such as variable threshold voltage used in [22], is ignored since
it cannot be quantified correctly.
The switching capacitance pet sample is shown in Table 16 &er technology
norrnalization. As an example, the switching capacitance per sample of the proposed
112.6666.1 O-' FDCT design is calculated as = 42.6 pF . For the Xanthopoulos's FDCT,
320- 106 -3'
which is a O.5um design, technology scding is performed, and the switching capacitance
4.65- IO-^ -0.532 per sample is calculated as = 101.6 pF, where 0.532 is the technology
1 4 . 1 0 ~ -1.32'
scaling factor to scale the power of a O . 5 p technology down to 0.35pm technology.
As shown in Table 16, the proposed data-dependent FDCT/IDCT designs have
the least swirching capacitance per sample. Le. consume least amount of power to process
each input data sample. Thus, the proposed FDCTlDCT design is the most power
effkient one among the designs reviewed in this thesis.
Switching Capacitance / 1 Implernentation
Toshiba 1994 FDCTLDCT 85.6 (3.3V design) [23] [24] / 199.8(ZVdesign) Toshiba 1996 FDCT/IDCT f331 1 82.3
AT&T IDCT [37] ! 478.9 Xanthopoulos's FDCT [28] 1 68.5
thop op ou los's IDCT [28] 1 101.6 1
Table 16: Energy Efficiency (Switching Capacitance/Sarnple in O.35pm technology)
Sarmiento's FDCT [6] Proposed FDCT Design.
6.3. Chapter Summary
1553 -9 _ - - 42.6. : -:- :
In this chapter, the proposed FDCTADCT design is synthesized using Synopsis
with CMC 3-volt 0.35-ym technology. To compare the proposed design with previous
works, the switching capacitance per sample is used. This cornparison method pennits
technolog-independent cornparison of different DCTLDCT architectures. From Table
16. it has been show that the new FDCTmCT designs have the smallest switching
capacitance per sarnple, and are the most power-eEcient designs.
Propo sed IDCT Design . -
43 -4-
Chapter 7
Conclusion
7.1. Summary of Research
In thiç w-ork, a data-dependent low-power FDCTmCT design is presented. Low
power is achieved by performing opùmizations on both algorithm and architectural
levels.
Both the FDCT and IDCT designs are buik based on low-complexity Loeffler's
fast algorithm cornbined with data-dependent zero-bypassing logic. In FDCT, to have
high zero-bypassing probability, sepented multiplication is used. Also, to reduce the
interna1 bandwidth, hence the arnount of data to be processed, least-significant-bits
mincation technique has also been employed. The error introduced by tuca t ion is
empincally snidied.
The multiplier architecture is optimized by developing low-power CSD
multipliers. To reduce the possibility of bugs in coding, a C* program that generates the
technology-independent VHDL code for the multiplier is developed. This generator can
be used in many other DSP applications where constant multiplication is required.
The FDCTlIDCT designs are coded using VHDL, and synthesized using Synopsis
1998 with CMC O.35pm CMOSP technology. No transistor-level circuit optimization is
done. Operating at 3V and 40MHz, the FDCT design consumes 122.7mW, while the
IDCT design consumes 124.9mW. By comparing with other recent works, the proposed
FDCTRDCT designs are the most power-efficient ones since they have the least
switching capacitance per sample. Low-power operation is achieved through the selection
of low-complexity Loeffler's algorirhm~ data-dependent zero-bypassing logics, and least-
si-gificant-bits truncation.
7.2. Conclusion
Frorn the analysis and simulation results, the foilowing conclusion can be made
about this thesis:
r Data-dependent algorithm c m reduce the number of operation w-hen bypassing
logics are properly inserted. Improper use of the data-dependent algorithm may
lead to increasing the computation rather than decreasing the computation.
* Hardwired CSD Wallace-tree multiplier is a good choice for low-power design
where constant multiplication is required. Its application is not only limited to
DCTRDCT. In many non-adaptive signal processingKilter applications, constant
multiplications are required. The use of hardwired CSD multiplier c m lead to a
more power-efficient design.
9 Low-power design can be achieved by having optimization at both design tirne
and run time. The design tirne optimization is done by carefully choosing a good
algorithm that reduces the number of operations. The m-time opthkat ion is
achieved by using data-dependent bypassing logics to reduce the switching
activity, which is directly proportional to the power consumption.
The data-dependent low-power design approach is not only limited to DCTLDCT.
It can be used in other applications as weU where the statistical property of the
input is wel! undentood.
7.3. Possible Improvements for Future Research
Follow-ing are some recommendations and possible irnprovements for future
research endeavors.
Stzidy the effect of Nztegrating data-dependenr algorithm wirh other fast
dgorirhms: This thesis is based on Loeffler's fast algorithm. It is chosen because
it has the least amount of multiplication over the surveyed papers. It is interesting
to know the effect of applying bypassing logic onto other fast algorithms to
determine the potential of data-dependent algorithm.
Sni& the effect of segmented rnziltip[ication: As discussed in 3.1 -1, smaller
segmentation size leads to higher bypassing probability with the expense of more
complicated control iogic and more delay. In this work, the multiplications in
FDCT are spilt into nvo segments. This choice may not be optimum. Having
different segmentation straregy may lead to a more power-efficient design.
Srtdy the trwncation effect for P-frnmes and B--urnes: The truncation simulation
is performed for 1-fiarns only. It is a good idea to measure the truncation effect on
P- and B-fiames as weil.
Explore the possibilify of wing truncation as a mean of quantization: Tnincation
behaves like quantization since both operations reduce numerical precision. Thus,
instead of havulg 2-D FDCT and quantizer as two separate blocks, it could be
possibIe to rnerge them together. In such a situation, a sophisticated control
algorithm is necessary for adapting the FDCT for different quantization levels (Q-
factors).
More power simzrlntions trnder dzxerent conditions: The po wer rneasurements
presented in this work are performed under the worst-case condition where the
assumed statisticd properties do not hoid, Le. under white noise input. In order to
=et more accurate power estimation, it is recommended to pass many different
real sequences (with 1-, B-, and P-fiames) as the input of the system, and measure
the power consumptions.
Improve rhe Conrimi bfiil~iplier Genernror (CMG): Several possible
irnprovements can be made on the CMG:
1. Negcttive conston& support Currently, the C M G supports only multiplication
Miith non-negative integers. In this work, the negative constant coefficients of
DCT/IDCT are taken care by using subtractions instead of additions when the
products are used. However, for other applications, if negative constant
multiplication is required, the CMG can easily be modified to support
multiplying negative integer constants.
2. Cnrry-sme-adder optimization: In some situations, there are comrnon
operands to be added in the cany-save adder array for different bit positions.
It is possible to share the partial sum of the full/half adders. Unlike the
pattern-based algorithm that requires full summation, sharing cany-save-adder
reduces the hardware and power without increasing the delay. The only
drawback of doing so is that the overall design becomes highly imegular due
to cornplex routing cause by sharing wires.
3. berter fi&-addition szipport: Currently, at the end of the CSA, C P A is used. It
is possible to reduce the power consumption even m e r by rasing cany-
bypass adder or carry-select adder.
4. Strpporr for pattern-based CSD algorithms: As rnentioned before, &e CMG is
designed for DSP applications where the constants are assumed izo be srnall.
However, if the constants are large, pattern-based algorithms should reduce
the computation si~glificantly, thus reducing the power.
Bibliograp hy
W. H. Chen, C. H. Smith, and S. C. Fralick, "A Fast Computational Algorithm for
the Discrete Cosine Transform", IEEE Tram. on Communications, vol. Com-25,
no. 9, pp. 1003-1 009, Septernber 1977
S. 1. C'rarnotoT Y. houe. A. Takabatake, J. Takeda, Y. Yamasiiita, H. Terme, and
M. Y oshimoto. "A 1 00-MHz 2-D discrete cosine transform core processor", IEEE
J. of solid-state circuits, vol. 27, no. 4, pp. 492-499, April 1992.
Y. F. Jang, J. N. Kao, J. S. Yang, and P. C. Huang, "A 0 . 8 ~ 100-MHz 2-D DCT
core processor", IEEE tram on consumer electronics, vol. 40, no. 3, pp. 703-709,
A u p s t 1994.
A. Madisetti and A. hT. Willson, "A 100 MHz 3-D 8x8 DCTiIDCT Processor for
HDTV Applications", IEEEE. Tran. on Circuits and Sysrems for Video Tech., vol. 5,
NO. 2. pp. 158-1 61, April 1995.
T. Masaki- Y. Morimoto, T. Onoye, and 1. Shirakawa, "VLSI Implementation of
Inverse Discrete Cosine Transform and Motion Compensator for MPEG2 HDTV
Video Decoding". lEEE Tran. on Circuits and Systems for Video Tech., vol. 5, No.
5, pp. 387-395, October 1995.
R. Sarmiento, C. Pulido, F. Tobajas, V. Armas, R. E. Chain, J. Lapez, J. M. Nelson,
and A. Niifiez. "A 600 MHz 2-D DCT processor for MPEG application",
Conference Record of the 31'' Asilornar Conference on Signals, Systems &
Computers 1997, vol. 2: pp. 1527 -1 53 1, 1998
[7] M. T. Sun, T. C. Chen, and A. M. Gottlieb, T L S I Implementation of a 16x16
discrete cosine transform7?, IEEE transaction on circuits and systems, vol. 36, no. 4,
pp. 610-617, April 1989
[8] W. Li. "A new algorithm to compte the DCT and its inverse", IEEE trans. On
signal processing, vol. 39, no. 6. pp. 1305-13 13, June 1991
[9] D. Slawecki and W. Lee, "DCTADCT Processor Design for High Data Rate Image
Coding", IEEE Tmn. on Circziits and S ~ e r n s for Video Tech., vol. 2, No. 2, pp.
135-146. Jme 1993.
[IO] C. Loeffler, A. Lightenberg, and G. S. Moschytz, "Practical fast 1-D DCT
algorithms with 1 1-multiplications", ICASSP-89, vol. 2, pp. 988 -99 1, 1989
[ I l ] B. G. Lee, "A new algorithm to cornpute the discrete cosine transfonn", IEEE
trans. on acoustics, speech, and signal processing, vol. ASSP-32, no. 6, pp. 1243-
1345' December 1954
, - [ E l H. S. Hou, --A fast recursive algorithm for computing the discrete cosine
tra,nsform", IEEE trans. on acoustics, speech. and signal processing, vol. ASSP-35,
no. 10. pp. l65-146l , October 1957
[13] Y. Jeon%, 1. Lee, H. S. Kim, and K. T. Park? "Fast DCT algonthm uith fewer
multiplication stages", EZectronic Letters, vol. 34, No. 8, pp. 723-724, April 1998.
[l4] E. N. Farag- and ha. 1. Elmasry, "Low-power Mplementation of discrete cosine
transform", Sixth Great Lakes Symposium on Proceedings VLSI, pp. 174 -177,
1996
[lj] M. Kuhimann and K. Parhi, "Power cornparison of flow-graph and distributed
arithmetic based DCT architectures", Conference Record of the 3znd Asilomar
Conference on Sipals, Systems & Computers, 1998, vo1.2. pp. 1214 -1219, 1998
[16] C. V. Schimpfle, P. Reider. and J. A. Nossek: "A power efficient implementation of
the discrete cosine transform", Conference Record of the 3 1'' PLsilomar Conference
on Signals. Systems & Computers, 1997, vol. 1, pp. 729 -733, 1998
[17] S. ~Masupe and T. Arslan, "Low power DCT implernentation approach for VtSI
DSP processors", ISCAS '99, vol. 1, pp. 149 -152, 1999
[18] S. Masupr and T. Arsian, "Low power DCT implementation approach for CMOS-
based DSP processors", Electronics Letters, vol. 34 25, pp. 2392 -2394, Dec. 1998
1191 T. Xanthopoulos, and A. Chandrakasan, "A low-power DCT core usiog adaptive
bittvidth and arithrnetic activity exploiting signal correlations and quantkation",
Digest of Technical Papers. 1999 Symposium on VLSI Circuits, pp. 1 1 -1 2, 1999
[20] T. Xanthopoulos. and A. Cliandrakasan. "A low-power IDCT macrocell for
MPEGZ bIP@ML exploiting data distribution properties for minimal activity",
Digest ofTechnical Papers. 1998 Symposium on VLSI Circuits, pp. 38 -39, 1998
[21] T. Xanthopoulos. and A. Chandrakasan, "A low-power iDCT macrocell for
MPEG2 MP@ML exploiting data distribution properties for minimal activity",
IEEE J. of solid-state circuits, vol. 34, no. 5, pp. 693-703, May 1999
[22] T. Kuroda, T. Fujita, S. Mita, T. Nagarnatsu. S. Yoshioka, K. Suzuki, F. Sano, M.
Norishima, M. Murota, M. Kako, M. Kinugawa, M. Kakumu, and T. Sakurai, "A
0.9V 1 jOlMHz, 1 OmW 4 m 2 , 2-D discrete cosine transform core processor with
variable threshold-voltage (VT) scherne", IEEE J. of solid-state circuits, vol. 3 1,
no. I l , pp. 1770-1779, November 1996
[23] M. Matsui, H. Hara, Y. U e t a ~ , L. S. Kirn, T. Nagamatsu, Y. Watanabe, A. Chiba,
K. Matsuda, and T. Sakurai, "A 200 MHz 13 mm2 2-D DCT macrocell using sense-
ampliSing pipeline flip-flop scheme", IEEE J. of solid-state circuits, vol. 29, no.
12, pp. 1452-1490, Decernber 1994
[24] M. blatsui, H. Hara. K. Seta. Y. Uetani, L. S. Kirn, T- Nagamatsu, T. Shimazawa,
S. Mita. G. Otomo, T. Oto, Y. Watanabe, F. Sano, A. Chiba, K. Matsuda, T.
Sakurai, "200MHz video compression macrocells using low-swing differential
logic", ISSC'94, pp. 76-77, 1993
[25] M. Hamada, T. Terazawa, T. Higashi. S. Kitabayashi, S. Mita, Y. Watanabe, M.
Ashino? H. Hara, and T. Kuroda, "Flip-flop selection technique for power-delay
trade-off7, ISSC799, pp. 270-271, 1999
[26] T. H. Chen, "A cost-effective 8x8 2-D IDCT core processor with folded
architecture". ïEEE trans. on consumer dectronics, vol. 45, no. 2, pp.333-339, May
1999
[27] YEEE Standard Specifications for the Implementation of 8x8 Inverse Discrete
Cosine Transform", IEEE Std. 1180-1 990, March, 199 1.
[28] Xanthopoulos, "Low pou7er data-dependent transform video and still image
coding", Ph. D. Thesis, M. 1. T., February 1999.
[29] E. Feing and S. Winograd? "Fast algorithms for the discrete cosine transform7',
IEEE trans. on signal processing, 40(9), pp. 2 174-2 193, September 1992.
K. Hwang? Cornpziter Arirhmetic - Principles. Architecfzrre, and Design, John
Wiley Br Songs, 1979, pp. 149-151.
Z. Wang, "Fast Algorithrns for Discrete W-Transfomi and for the Discrete Fourier
Transform", E E E trans. on acoustics, speech and signal processing, vol. ASSP-32,
no. 4, pp. 803-8 16, Aupst 1984.
M. Vetterli, W. Nussbaumer, "Simple FFT and DCT Algorithms with Reduced
Number of Operations", Sipal Processing (North Holland), vol. 6. no. 4, pp. 264-
275. August 1954
N. Suehiro, M. Hatori. "Fast algorithms for the DFT and other Sinusoida1
Transforms", IEEE Trans. on acoustics, speech, and signal processing, vol. ASSP-
34. no. 3, pp. 642-664, June 1986
P. Duhamel and H. H'Mida, "New 2" DCT algorithms suitable for VLSI
implemcntation", Proceedings IEEE international conference on acoustics, speech
and sienai C processing, ICASSP-85, Dallas, pp. 1805-1 808, April 1987
K. Swang, pp. 152-1 55
S. Shah. A. J. Al-Khalili, and D. Al-Khalili, Tomparison of 32-bit multipliers of
various performance rncasures", Proceedings of the 12" international Conference
on Microelectronics, ICb1'2000, pp. 75-80, October 3 1- November 2,2000
A. Bhattacharya and S. Haider, "A VLSI implementation of the inverse cosine
transforrn, International J. of Pattern R e c ~ ~ p i t i o n and AI, 9(2), pp. 303-3 14, 1995
K. R. Rao and P. Yip, Discrete Cosine Tt-ansform - Algorithrns, Advantages,
Applications, Academic Press, 1990, pp. 10- 1 5
[39] V. Lefèvre? "Multiplication by an integer constanty7, LIP research report RR1999-
06, Laboratoire d'Informatique du Parallélisme, Lyon, France, 1999
[LCO] F. de Dinechine and V. Lefevre, "Constant MultipKers for FPGAs", LIP research
report W 0 0 0 - 1 8, Laboratoire d'Informatique du Parallélisme, Lyon, France, 2000
[4 11 R. Bernstein, Multiplication by integer constants, Software - Practice and
Expenence, 16(7), Juiy 1956, pp. 641-652
[42] M. Potkonjak, M- Snvastava, and A. Chandrakasan, "Multiple Constant
Multiplications: Efficient and Versatile Frameworks for Exploring Common
Subexpression Elimination": IEEE Trans. on CAD of IC and Systems, vol. 15, no.
2, pp. 151-165, February 1996
[433 Xilinx Cooperation, "Constant (k) Coefficient Multiplier Generator for Virtex",
Application Note, Version 1.1, Mach 12, 1999
1441 Xilim Cooperation, "Constant Coefficient Multipliers for XC3000E", Application
Note XAPP 054, Version 1-1, December 1 1, 1996
[4>3 R Hartley. "Optimization of Canonical Sign Digit Multipliers for Filter Design",
IEEE International Sympoisum on Circuits and Systems, 1991, vol. 4, 1992-1995,
1991
Appendix A
Truncation Test Result
Table 17 shows the truncation erïor of 3 test video sequences: coke, salesman,
and tennis. The truncation error is defined as:
Tuca t ion Error = Average PSNR(reference) - Average PSNR(tnuication)
Each sequence is encoded with pure 1-hunes, 8 Mb/s and 180 frames. The FDCT
is computed with fixed-point calculation with 1 1-bit precision after binary points.
Truncation Error = Average PSNR(reference1- Average PSNR(truncation1
Average of 3 Sequences (dB;
-- -
TNnc(2tn) Numker of
Truncated Bit Tennis Coke Saiesmar
(dB) (dB) (dB)
Table 17: Truncation errors of test sequences: coke, salesman, and tennis
a r ch i t e c tu r e S t ruczu rz l o f COS-3-16 Fs component HalfAdder
port (A, 9: i2 Std-Loqic; S m , Cout: out Std-Logic) ; end componenz ;
cornuone3r FullAdcer O , , c i : i 3cd-Logit; Sum, Cocr: ou t Szd-Logic);
ena component;
s i g n o l S û r CO, SI, C i , s 2 , C 2 , 5 3 , C 1 : Srci-Logic-Vector (25 downto O ) ;
s igna l n-m : Std - LoqFc-Vector (25 downto O ) ; s i g n a l ZEXO: Std-Logic; -- cons tan^ s i g n a l ' O ' s i gnc i ONE : Stc-Logic; -- Cozstan t s i g n a l '1'
ZE8O <= ' 0 ' ; ONE <= '1';
-- I n ~ e r t e à i n p u t siqnals: N <= nor P;
-- a i r O Srage 9: -- B i t O Stage 1:
-- B i t O Stage 2: -- B i = O Çcoge 3:
-- B i t 1 Stage O:
ïiA - 0-1: XalfAdder 3 0 x rnac (N ( I l ,N ( -- B i t L Scaqe 1: -- B i t I Sïaçe 2: -- Biz I Srage 3 :
-- Sic 2 Stzge O: -- E i t 2 Stage 1:
HA-1-2: HalfAdder porc map (N ( 2 , CO -- Bi= 2 Scage 2: -- a i t 2 Stage 3 :
-- E i t 3 Stage O: 34-0-3: FuilAdder p o r t map(N( 3),N(
-- B i t 3 Stage 1: -- a i z 3 Stage 2:
Ka--2-3: Holr'P-dder porz rnaplSO( 3 ) , C I ( 23,S2( 3),C2( 3 ) ) ; -- B i t 3 Stage 3:
-- B i r 4 Stzge O : -- Bit 4 Stage 1:
FP--I-?: FullAdder p o r t map(N( 4),A;I( l),CO( 3),Si( 4),CI( 4 ) ) ; -- B i t 4 S t a g e 2 : -- B i t 4 stage 3:
m-3-4: H a l f A d d e r porc ~ p ( S l ( 4) ,C2( 3),S3( 4),C3( 4));
-- Sic 6 S c a q e O : E3--0-6: FullS.dder port rnap(bJ( 61, NI 3 ) , 1 ( C),SO( CO( 6));
-- B i c 6 Stage 1: -- Biz 6 S t a g e 2 : -- B i t 6 S t e g e 3:
-- Sic 7 Sz+çe O : FA-C-7: Fü1IAdcier p o r t - p ( N ( 7 ) , Y f 4)r?1 I),SO( ?),CO( 7 ) ) ;
-- S i r 7 S c a g e 1: -- Bit 7 S t a g e 2: -- Bit 7 S c a g e 3:
-- B i t I I S t a q e O : Fr? - 0-11: F t r l l A d c i e r p o r c .nap(N(LI),N( a ) , P ( 5),S0(11),C0(11)1;
-- E i c II Scage 1: FA - 1-11: F u l l A d d e r porz mâp(P( ~ ) , S O ( I ~ ) , C O ( I O ) , S I ( ~ ~ ) , C ~ ( ~ ~ ) ) ;
-- Bit 11 Sracre 2:
-- Bic 1 3 Stage O: FA O 13: FullAdder p 0 r . L r n a p ( N ( l O ) , P ( 7 ) , P ( 5),~0(13),~0(13));
-- Bit 20 Stage 1:
FA-1-20: FullAdder porc nzp( ONE ,SO(20) ,CO(19) ,S1{20) ,CI (20) ) ; -- B i t 20 Stâge 2:
-- 9it 21 Stage O: fK-0-21: FullP-dder porc nzp(P(1I) ,N( 9 ) , P ( 7) ,S0 (211 ,CO i21) ;
-- Bit 21 Stage 1: FA 1 21: FxlLidder sort mzp( ONE ,S0(21) ,CO(20) ,S1(21) ,C1(21) 1 ;
-- ~it-21 Stage 2: KA-2-21 : 5clfAdder port map CS1 (21) ,CL (2C)) , S2 ( 2 1 ) ) ;
-- BFt 21 Stage 1:
-- 3ic 22 Scage O: FA-0-22: FullAdder p o r t map(N(IO),P( 'ô), ONE ,S0(22),C0(22));
-- B i r 22 Stôoe 1: -- Bit 22 S t q e 2:
-F-L - 2-22: F~llOdder POEL map(~0(22i,~0(21),~1(21),~2(22),~2(22)~; -- i3ic 22 Scage 3:
-- Bit 23 Scage O : FA - 0-23: FullAaaer porc nap(N(ll), P( 91 , ONE ,SO (23) ,CO (23) ) ;
-- Bir 23 Szage 1: m-1-23: E~lZAdder ?art map(SO(23),CO(22},S1(23),C1(23));
-- B i t 23 Stage 2: -- Six 23 Sxage 3:
-- B i = 21 Scaqe O: -- 3ic 24 S t q e 1:
Fi-1-24: EalfAdcier port nap[P(IO) ,C0(23; ,SI ( 2 4 ; , CIi241 ! ; -- 3it 24 Szage 2: -- a i t 2 4 Stage 3 :
end;
-- Statistical Info-macion: A d Stage : 4 -- K Inverter : 12 -- # E a l f adaerr 13 -- t Fu11 adcer: 4 8
Appendix C
Source Code of Constant Multiplier Generator
The following is the C++ source code listing for constant multiplier generator.
The codes are listed in arphabetic order of the source file name. The header file (.h) is
aiways in fiont of the implementation file (.cpp). The main program is located inside n l e
IntMrkcpp. Notice tbat al1 codes are also included in the attached CD.
using naneSpace s t d ;
ucs igned nXezcyAïStzge(SignalVector& imï, int curSïage);
void gecAdderOperana (znsigced n O p , msigned r n a x C o n s t I n p u t , S i g n a l V e c t o r & i m t ,
void createE? {vec=cr<Sig?.aItJector> & k t , unsigned xnsignea m a x C o n s t O p , oscreemb 0 ) ; voia c r e ~ t e ? ~ (vêctcr<SiqnaItiector> S i m r , umigned xnsianed nzxCanscC~, oscrsam& G ) ;
curBiz , unsigned curstaqe,
c u r B i t , u n s i g n e d c u r s t a g e ,
voici genprate-VHDI, - CSA - Eody(vec~or<ÇignalVec~sr> &imt, unsignea CSA-Scage, u r r s i g c e c & n H â l f A d d e r , unsigned a c F u l l A c d e r , ostrem& csa) ;
using nomespace std; I sca t i c S i g n a l
SIGNAL_SU?.I (VARIABLE,"S",SUM ,false,NONE,-1,-11, S I G N ~ C P ~ Y (V.UI.ABLE, "Cm , CPRRY, fafse, NONE, -1, -1 1 ,
SIGNAL-SIGX (SIGX, "Sign", O, faIse,NONE, -l) - ;
//-------------------------------'------------------------------------------- void gerACderOpera?c(unsigned =Op, unsignea maxConsrInput,
Sig~alVector& SV, SignalVeccor& opToAda)
L=@ ; wnile :nOpO & & i<sv. size ( . - Ir (SV[%] .ID==CF-=Y\ 1 s v [ F ] . I!l==SUM)
I 20p--; cpToAad. push-k~cck (SV [Fi) ; sv.erase(çv-begin0 ii) ;
I eise i++;
if (nOp==O j return;
//------------------------------'--------------------------------------------
void creazsSJ. (v2c~or<SlgnaIVec=or> &SV, unsigned curBit, unsignec curstage, unsigned maxConstOp, osrrecrn& O )
I SignalVeczor opToAdd; gecAdderOoerand ( 2 , maxConstOp, SV [curB ic j , ocToAàd) ;
SIGNAL-SUM.bitPos=SIGNALLCARRY.bitPos=curBit; SIGNAL-ÇUF. stage =S IGNPL-CFRRY . stage =curStage;
void create?A(veccor<SFgnalVector> &SV,
unsigned curai t , uns igned curçtage, unçlgned rnüxCorrst~p, cstreun& O)
SigcalVecrcr opToAdd; get-WcerOperand ( 3 , maxCons~Op, SV [curBit 1 , opToAdd) ;
SIGNAL SUM.SitPos=SIGNAL-CF-SRY .bItPos=curEii t ; S I G N A L ~ S U P I . ~ ~ ~ ~ ~ =SIGNAL-CPmY.scage =curStage;
SV [curBizI .push-bsck (SIGNAL-SüMl ; if (curBi~==sv.slte ( j -Il return; sv[cur3Ft+t] .push-bcck(SIGNPL-C32.RY) ; return;
i
/ / Generating carry-save adder VHDL code nEclE9.dder=nFullAdder=O ; / / Complexity Stat
boof -HA - for-2op = c rue ; 5001 i sF i r s tAdder ; Fnt ~Xeady; for (i=O; i<sv.size ( ) ; i + - 1 I csa << "\n"; FsFirstAdder = Erne;
f o r ( j = O ; j<CÇk-Sccge; j +-)
I cça << "-- Bit "<<i<<" Stage "<<j<<" : \n"; if (,U,-for_20pI { switch (nSeadyAtStage (SV [ il , j 1 1 I czse 0: nreak; case 1: break; case 2: (
if (sv[i],sizeO==2)
creace-FIA (SV, i, j, (isFirstAeder?2: 1) , csa) ; if (j==CSA-Stage-l) KA-for-2op=false; isFirstAdder=false; nHalfAdder++;
1 break;
1 default : ! creaceFL(sv, i, j , f isFirstAddez?3 : 1) , csz) ; if ( j ==CS-S tage-1) F-a- - ffor_2op=false; isFirstAdaer=false; nFuL1Accierit;
i t
I else / / L-A-for-20p = false I nReaay = nRe~cy~tStaçe (svii], j 1 ; . - r r (sv[il -size O ==3) i L
If (nReâdy==2 1 1 riReady==3 1 I createw-(SV, i, j , (isPirsSidder?2: I), csa) ; isFirstAdder = false; nXalZAader+i;
I i else if (nRêady>=3 1 I creaïeFA (SV, F , j , (isFirsiAader?3 : l) , csa) ; isFFrstAdder = folse; nFullAddec+t;
i 1
1 i
1
void s i ~ ~ l i f ~ ~ o n s ~ ~ n r s ( v e c t o r ~ S i g ~ a f V e c ~ o ~ ~ &c) I SipaiVecrcr: : i ï e rzcor result; int ?One, carry=O; fo r (unrigzed 1=0; i<c-sire ( 1 ; i++l { nOne=O ; //counc ( c [ i ] .beginO ,c[il .end0 ,SfGXIL1,ONEInOne) ; nOne = count (c[i] .begFn0,c[i~.en~O,S1GNPJ;~0NE);
/ / Removing constant zeros: No operation result = rernoveic[ij .begin{},c[i] .er?d(),SIGNPJ,-ZERCI; c[i] .erase(result,cCil .end0 1 ;
/ / S i m p l i f y constant o n e s : adding rhem together result = reF.ove(c[ii .beginO , c [ i I .ena(),SIGNNiILLONFl; c i F 1 .ercse(resulc,c[i] . e d O ) ;
nO?e-=carry; carry=nOr?e/2; nOne%=2; if (nOne!=O; c [ i l .push_back (SIGNAL-ONE) ;
1 1
void create CSA Vector(
77
vec=or<SignalVector>& op, vector<SFqrrziVector>& c sa, bool issigned) I int i, j; unsigned max9it=O; for (i=O; i<cp.sizeO ; i++l
if fop[ii .size ( 1 >max9it) maxBit=oc. size ( 1 ;
for (i=op.çize ( 1 >>Ir j=O; i!=O; i > > = l , j -t) ; / / Get tne MSB posizion of i: Log2 (op. s i z e ( 1 l
naxBiz+= j ; / / n m-Dit operand will have outpur of n+m bic
csâ. reçize ( x ï a x E i ~ ) ;
Signal signal (VARIABLE, l I O ~ l r r O, crueI NONE, -1, -1) ; for (signol. ID=O; signal. ID<op. size ( 1 ; signal. ID*+)
O << " "c<signalc<" : in Std Logic Vector ("<< (op [signal. ID] . size O -1) <<"
d o w n t o O 1 ; \n" ;
s i g n a l . n a m e = "Sum"; f o r (sicnal.ID=l; sFgnal. ID<=2; sional. ID++)
O << Ii " < < s i g n z l < C " : out S t d - L o g i c - V e c t o r ( "cc (nBitOut-1) cc" d o w n t o O) ;\nW;
o c < " ) ;\a" c< "end; \n\c" ;
/ / Generâcf VEDL Architecture H e a d e r o c < "zrcnitecture Structural of " << en~ityNzme <C" Fs\nW
<< " componenz EalEAdderb" << " po r t (A, E: i n S t c i - L o g i c ; S m , C o c t : out S t d - L o g i c ) ; \n" c< " enci cc-onent; \n\nn cc " conFccent FullAcder\n" <c II pc rz (A, B, CFrr: in S r d - L o g i c ; Sun,, tout: o u t Std-Loqic) ; \n" << " end c c m p c n e , r ; t ; \ c \ n W ;
C S . - S t a c e = ( o p . s i z e O > = 3 ? o p . s i z o 0 - 2 : 1) ; SIGNAL-SUT. showïD=SIGNALLC1.RRY. s h o w I D = ~ r u e ; O << " slqn&lw << encil; f o r (F=O- , i < = C S A - S t a g e ; i+-1 {
SIGNAL-SVM.ID = SIGNAL_CF,9RY.ID = i; O << " w<cSFGNPL-SU-IC<", " < < S I G N P L - C m Y ; Lf ( i !=CSI;=Stage)
O << ",\nW; else O << ": Std-Logic - Vec~or("c<(nSirOut-L)<<" domto O1 ; \n\nl ' ;
, r
i E ( FsSigned)
O << " signal "; f o r (SIGLJAL-SIC-N. I D = G ; S I G N U - S I G N . I D < o p . s i z e O ; S I G N A L - S I G N . I D t i ) c << SIGXPL - SIGX C c (SIGNPL-SIGN.LD<opPsizei}-1 ? ", " : " " 1 ;
o << ": Scd-Lcgic-Vector ( " < < ( o p . s i t e ( 1 -1) <<" downto O ! ; \ r iv ;
1
i f ( i s s i g n e d ) i
s i g n a i - n a m e = "Op"; f o r ( i = O ; i<op. s i z e O ; i++) C
SIGNAI-STGN-ID = s i g n a l . 1 D = i;
void g e n e r a t e _ V H D 4 C S - A - - T a F I (
v e c t o r < S i a n à l V e c ~ o r > &imt,
SLgnalVeccor &occl, SiçnaLVeccor Gout2, ostream& O)
{ //--------------------------------------------------------------------------
/ / Map iaternal signal~ to output ostrstrezm numl, ZIW~;
Signal signal (VG.X2BLCr "Sum", O, true,NONE, - I r -1) ;
cnor -sr = n-xml. s i r 0 ; si [nilml.Fcount ( ) !='\O' ; char ' r2 = nu-.n2. str ( 1 ; s 2 [num2.pcount ( ) ] = ' \ O 1 ; O << "\fiv << si << s2 << "\nW; O << "end;\~\n"; o.flush() ;
//------------------------------------------------------------------------ void generate-VBDL-CSA (
char- en t i cyNme, vector<SignalVector>& op, 5001 issigned, SignalVeccor &oucl, SignalVector &outSr oscr2arn& 01
ist CSA-Stage; unsigced nEalfAader, riFullAdder; vector<SignolVecccr> csa;
creaïe-CSA-Vector (op , csar lsSigneci} ; gzcerate-VECL-CSA-Sezaer (ent-i~:~Nms, op, issigned, csa. s i z e (1, C S C S t a g e , O ) ; generzte-VEDL-CSA-3ody (csa, CSA-Stage, r-AalEAddez, nFullAdder, O ) ; generate-VEDL-CSD,TaFI (csa, oucl , out2, O) ;
HWMult. h - - 1 . . - . - .. . -
Sifncef -HWL'ULT-H #def ine - H'rNULT-H
voia HWMult ( unsignec nVarBit, bool sFgneaVar, unsigned lcng constop, vector<SiqnalVeccor> &out, ostreEm O, ostreamç ccmpone?t, charf entityNme-0, unsigned cruncLSE=O, no01 goceraïeProduct=~rue, bcol byPass=fa l se ) ;
- - --
kincluae "HWMult.h" #include "CSFi.h" %include "NurrberSyscern. h" Binclude "Nonzero . k"
using nowspace std;
scotic Signal SIGNAL-SIGN-P (SIGN, ' lSignlf , O, falser POSITIVE, -1 -1) , S I G N A L - SIGN-N (SIGN, "Slçn", O, faIse,NEGATIVE, -1, -1) ;
uirsigned nOutEit1, unsigned nOutaFt2, unsigneci out3it2of fset, w-signed CSA-Scage, unslgced nCSlsBFt, bool F~vertedInpuc, ostream& O, ostream& c, bool generatêJroduct, bool bypass)
t o << " 3 e s u l t : ouc Scd-LcgFc-Veczor ("<< (nOuc9itl-1) <<" downto O) \n"; c << " R e s u L c : oct Std-Logic~Vec-,or("<<(nOut3Lt1-1)~~~f downto O)\n";
1 else f O << " Xesrrltl: oilc Std-Loqic-Vector ("<< (nOutSit1-It cc'' downto O) ;\n"
<< " Result2: cur S t c - Loçic~Veccor("<<(nOutSF~2-l)<<" downto O)\nn; c << " R e s ~ i t l : e u t Szci-Logic-Vector ( "<< (nOut8iil-1) <Cs' downto O) ; \nu
<< " Result2: out Scd-LogFcJector ("<< (nOtitBit2-1) <Cg' downto O) \n"; 1
O << " ) ;\n" << "ena;\n\nW;
c << ) ; \n" << " end componenz;\n\n";
/ / Generace VHDL Archiïecrure Header O CC "architecture Structural of " << entityName CC" i s \ n m
<< " port (A, B: in StkLogic; Sum, COUC: out Scd-Logic) ;\nV cc ecd component;\n\nW
<< II component FulXader\n" cc Il porr iA, 9, Cin: in StdLogic; S m , Cour: out Std-Logic) ;\n" < enc cornpone~t; \n\n" ;
L 0 << " Sm << i <<", C" << i; iE (:!=CS-A-Scaqe-l)
O cc ",\nu; else O cc ": Std-Logic-Vecror ("Cc (nCSFBit-l) <<" downto O) ; \n\nn;
1
if ( invertedIn-uz} I O c< " signal N : Std - Logic-Vector ("cc (riVar9i1~-1) cc'* downto O) ; \n\nn; if (signeci'Jzr)
o << " siqnal "<<CIGM-AL-SIGN-3<<", "<<SIGN3L-SIGN-N<<": Std-Logic;\nn; 1
O c< " sicnzl P : S t d - Logic-Vector !"<< (nVarBFt -1) <<" downto O) ; \nW << " signal numl : Std-Logic-Vector ("<< (;?OuiSitI-1) <<Ir domto O) ;\nu << '* sicnal num2 : Scd-Logic-Vector("<c(nOucBit2-l)<<" aownto O);\n\nW;
o << " signal ZERO: Std-Logic; -- Constant signal 'O'\nt' <c " signal ONE : Std-Loçic; -- Zonscanc signal 'll\n";
if (bypzss) O << " signal NonZeraIn: Std-LoqFc;\nW
cc " s i g n a l ZERO-Out : S c d - Logic-Vector ("<< (nOutBit1-1) <cl1 downto O} ;\n\nW;
if (byPass) (
O << " SP: MZ"cCnVzrSFt<<" port ma? (VarI2, NonZeroIn) ; \n" <c " P <= VarIn when (NonZeroIn=' 1' ) else P; \n\nM;
e L s e O € c " P <= Var1n;\n\nu;
if (inverreciInpur) f 0 c< II-- Inverced input siqnals:\nW; //for :i=O; i<nVarBit; itt)
if (signedvar) / / Signeci variable oceranc! O << " " << SImTkL-sIG??-P << " <= P("<<(nVarBir-1) <<") ; \nvf
CC " " << SIGNAL-SIGN-Pl << " <= N ("<< (nVarBit-1) <<") ; \n\nn'; k
void generace-VSCL-Hm-TâiI ( vector<SignzlVector> &imt, vector<SiqnclVectcr> &cct, oscrêarri& O, Uool qenerzreproduct, b o o l byFass)
i ,./-------------------------------------------------------------------------- / / Mep inccxnzl signal~ to ouepur 3 s trçirfzq nu?nl, nm2 ; numl << " n u m l <= " ; nu12 << " m . <= ";
j +t ; if ( (jâ8)==O) ( numl c< "\ri "; nwn2 cc "\n
1 1 nrrrnl<<"; \n"; nu1;i2<<"; ?nW; numl . flush { 1 ; n-~m2. f l u s h ( 1 ;
cher -SI = rruml.szr(); s1[nunl.pcount()]='\O1; char * s 2 = num2.str O ; s2[num2.pcoun~(} ]='\O1; O << "\n" c< sl << ç2 <c "\fi";
O << " ZERO-Ou= <= \"" ; for (i=O; i<ou=.size ( ) ; i--) o<<"O";
if (generatePrcduct i O cc " nun <= Unsigsec(ncml) + Unsigned(nun2) ;\II"; if (byPass)
O c< " R e s c l t <= num whec (NonZeroLn='lr ) else ZSBO0ut;\n\n1'; eLse o c c " R e s u l t <= \n\n" ;
k else I
iE (5yPass) O << " R~sulclC= n u l when (NonZeroIn=' l ' 1 eise Z E R O O u t ; \nt'
<< " Result2C= n m 2 when (NonZerofn=' l' 1 else ZEE?OOuc; \n\nW; else o << " ?.esulzl<= narnl;\nw
<< " ?.esült2<= n 1 a 2 ; \n\n8';
vec~or<SiçnolV2ctor> signZero (sign) , signOne (sign) ; SIgnalVeccor: : i t e r a r o r resulr;
/ / For Sign=l => 3ernove al1 "Sign-N (=O)" & Replace Sign-P with "ln resclt = remove :signOne fi] . bêg in ( ) , signone [il -end ( 1 SIGNPLJ;SIGNNN1 ; siqnOne[i] .erase (result, signone [il .ericO ) ;
reclace (s içnOne fi 1 . b q i n ( ) sLgn0ne [il .*-ONE) ;
1
unsigneà n O u t S i ~ = nVarsic + const3it.sizeO; / / N u m b e r of output bit imt . r l s i z e (nCu=Bir) ; / / Inte-mediate signals sign. res i ze (nOut9it) ; / / Sign and constanc 1's
/ / I n s e r t a l1 Lncermediate signols S i q n z l siçnal;
invercedIzpuc = false; fo r ( I = G ; icconsrBit .size ( 1 ; ii+) t
if (canszBit[ij==l) t
fcz [ j=O; j< (signecVar?~Vor3it-I:nVarBit) ; j++) i signal .bitPos=j ; çiqnal,inv~rted=2OSITIVS; kr[i+j j .push-back(signa1) ;
1
s i g n [il , p u s h b a c k (SIGNPL-ONE) ;
/ / Kerge s i g n / c o n s ~ a n t s t o g e t e r and perform opcimization for constant 1's unsignec mxEepth=O ; f o r ( i = G ; i < i m t . s i z e O ; i++)
if (siçnii] .sFze() !=O) t
if ( s i g r i [ i ] [O]==SIGMAL-CNS &t i C i m c . s i z e 0 - l & & imï[i].sizeO==l & &
irr , i[i-l] . s F z e ( ) c=2)
/ / bit T I => sm-=(nec o i t ) , c a r q r = S i t iinï [ i + l ] .push-bock (imt [ i l [O] ) ; imt[ij [ 0 ] . i n v e r t e d = i m t [ i ] [O] .lnverted==POSITIVE ? NEGATIVE: POSITIVE; i nve r t ed rnpuc = t r u e ;
1 e l s e i n t [ i ] .push-bock(sigr:[FI [ W ;
i
. - I r ( F r n t [ i f .size ( ) > m a x C e g r n f inaxDepth=imt [il .size ( 1 ;
!
CSA - Stage=(rnâxDepcB>3) ? ~axDepth -2 : (maxDepzh>O ? 1 : 0) ; 1
voit D i i u l t (
unsigned nVzrBit , ho01 signedTJar, unsigneci long constop, vector<SignzlVector> & o u t , astream& O , ostream& component, charr entityName, ansigcec cruncLSB, b o o l genera teProduc t , boo l SyPassl
( unsigzeb L , j;
/ / Cons t ruc t M u c i p l i c a t i c n v e c t o r to be usea i n CSA creace-EiW-Vector ( ~ V a r B i ' c , s igneavar , consrOp, i m t , i n v e r t e d I n p u t ,
CS&--Stage) ;
i m t . erase (imc. begin ( ) , imt . begin ( 1 +rruncLSB 1 ; COU^ << " \ n A f t e r t r u n c a c i n g "<<truncLSa<<" b i t s : \n";
1
/ / -------------------------------------------------------------------- / / Generat inç VHDL Code generate-VEDL-EFR44i'!~aaez (?,Vzr9Ft sicneciVar, canstCp, entityName,
i m t . size ( ) , Lxtt-size ( ) , O , CSA-Stage, ict .size ( 1 , i n v e r t e d I n p u t , O, componenr, generacê?roducc, byPass 1 ;
O . m ~ ~ n ( ; ;
//--------------------------------------------------------------------------
/ / Generaring ca r ry - save aaaer VH3L code u s i g n e d nEalfAdcer, nFulLAdder; generate-JH3L-CSA-3cdy ( i m t , CSA-Stage, nffaifAdder, n F ~ l f A d d e r , 01 ; a.flusn() ;
/ / Generezing VEDL tail ( ~ n c a r c h i t e c t u r e & statistical i n f o r n a c i o n generace - VHDL-I_Tail ( i m t , ou t , O, gene taceProauc t , byPass) ;
l u s i n ç na-tespace scd;
c o n s t double p i = 3.1415526535897932384o;
i n t nain i i nc zrgc, char ' arcv [ 1 )
unsiqnea long va l ; ccu t << "Conscanc Operand c i n >> val;
inc nVarBit, trcncLSB, bypsss; inc s ignedvâr , generzceProduct ; couz << "f b i t of V a r i a b l e Operând . 11 .
r I
c i n >> nVar3Fc; coüt << "Çigred v a r i a b l e operond ( O / L ) : "; c i n >> s ignedvar ; cour << " # bic crurrcoted at LSB . 11 .
. ?
c i n >> EruncLSS; couc << "Generate p r o d u c i (O/l) : '*; c i n >> genera teProàuc t ; cout << "Bypass Zero ( O / L ) : Il;
c i n >> bypass;
tout CC "Entity ( f i l e ) nane cin >> en t i t yNme; if (entFtÿNarne [ O ] = = 1 \ O 1 1
c o u r <c ll\n\n"; cour << "Cons tan t Value : "C<valCC"'~n\n";
f i n c l u d e <fstre&.ru
using namespace std;
voici Nonzero (char- z n t i t y N m î , unsigned n B i t , ostrea-m6, cl f char f iloxaxte :2561 ; s p r i n t f (EileNcrne, "ES .vhdl', encityNzne1 ; L- ~ a t r e a m f (f i l e N m ~ e , F o s : : ourj ;
<< " porc\nl' << (\n" << Ir D : i n Std-Logic-Vector ("<< (nBit-1) <<" downto O) ; \n" << " NZ: ouz S t < L o g i c \ ~ " << " ) ;\n" << "end; \n\n" ;
/ / C o n v e r t unsiqnea l cng tc a sequezce of bic. / / Trie MC6 of t k e r e t u r n i z g b l t is a l w a y s O
void ulonqToBFc(unsFgned long 1, vector<char>& b i c ) ;
vo i d 3izoryToSFgnDigFE (veceor<char>& b i t ) ; void op r in ï zeSD (vec tor<char> &bic) ; / / Seduce -1's void uLongToSignDigit (unsigceà long I, vector<char>& sd) ;
vold uiongToBoo~8 (unsigned long 1, vector<cnar>& booCh1; vo i d E o o c k T o S i c p D i q i ~ (v~c to r<cha r>&boo th , vecror<char>& sd);
1 osc r smh p r i n t a i c ( c s c r e m i O , veccor<chsr>& b i t ) ; vo i d s k o w B i t ( v e c t o r ~ c h ~ r > t bir;;
#endif
/ / Convert wsiqned lonç CO a sequence o f S i c . / / T h e ES3 of the returning bit is always O void u l o n g T o S F r (unsiqxed l o n g 1, veccor<thar>& Sic) t
b i t . c l ea r ( ; for (; l ! = 0 ; 1>>=I)
bii.push-'Dack(l&l ? 1: O}; 1
void show Bi^ (veczor<char>& b i t ) t
p r i ? c 5 i c ( c o u t , nit ; 1
f o r ( i n t F = D ~ E . size ( ) -l; i>=O; i--1 f o << s e t w ( 3 ) <c ( i n t ) b l r [ l l ; if ( b i ~ [ i l ! = O ) weight+t;
1 O << " Weighc=" cc welght; retur? O ;
1
int stczc, enc; / / Sïârt azd end p o s i c i o n of consecotive ones
i stârt=i; f o r (end=i+l; end<rBit; end++) if ( ~ i t [ena] ==O) breok;
i f (ena-s târt>l) / / More cheni one l y s i
b i ~ [starcl=-1; b i t [end] =l; for ( s t a r t t t ; startcend; start-f 1 bit [scart]=O;
! F = end-1;
if (bit [bit-size ( ) -l]==O} b i t - c r a s e (8ir.end:; -1) ;
1
void ulongToSignDigit (üns igned l o n q 1, vec ro r<char>& b i t ) i
ulongToBit (1, b i r ) ; b ina ryToSignù ig i r (bit 1 ;
/ / cp tL .?zeSD ( b i t ) ; i
L
s t o c i c ccnçc c h a r ~ o B o o c h [ ] = ( O , 1, 1, 2 , -2, -1, -1, O );
v o i d BoothT~SignDFgiC ( ~ e c t o r < c h a r > & C o o t h , vec to r<cner>& sd) f
f o r (int i=O; i < b o o c n . s F z e ( ) ; i++) s w i ~ c h ('coccir [ i 1 1 I
case -2: sd-push-back( 0); s d - p u s h b a c k ( - 1 ) ; Dreak; case -1: sd, push-back (-1) ; sd.-ush-back ( 0) ; break ; case 0: sd . pusn-bock ( 0) ; sa. pnsn-back ( 0 ) ; break ; c a s e I: sd.pnsh-bccic( 1) ; çd. p u s h b a c k ( 0) ; break ; c a s e 2: sa. p u s n j a c k ( O } ; sd. pnsn-back ( 1) ; break ;
I
. - . _ . . . .. VHDL Signa1.h - . . - - . . - , -
. . ) . . - - . - . .;, :z D i f nde f -VHDL-SIGN-AL-B + d e f i n e VZDL-SIGNAL-ii
ginciucie < v e c t o r > Winciude < ç ~ r i n g > SincLuae < ios r re s ,w Oinclude <iomanip> k i n c l u d e <s t r s t r ea rn>
u s i n g nanespace std;
cypeaef enum { CONSTANT, VXQIPBLE, SIGN 1 s F q n a l T p e ; c ~ e d e f enum ( NONE, POSITI'VE, NEGATIVE 1 inverrrype; typede f enum { ZERO , ONE, OPEN 1 consrIDType; cypedef enum ( INPUT, SUM, CARRY 1 varIDType;
Signa l ( 1 : tlQe (V.=IP3LE) , name l " " ) , I D ( O ) , i nver ted(~ONE) , bFtPos (-1) , s t a g e (-1) , show13 ( f â l s e ! 1 ;
i q n a l ( s i g n a l T l n e t , cha r "2, ur-çignêd id, bool showid, inverCType inv, int pos, i n t s t g )
signalType type; char* ncn~e ; unsigneci I D ; / / ID f o r che s i ç n a l Do01 showID; inverzType Invexed; i n t --- b i t - 3 0 ~ ; / / Bit p o s i t i o n (Non-negative i n t e g e r . -1 w i l l not show
che b i t position) irrc stage ; / / stage where t h i s s i g n a l is gensra ted (-1: input or
corstent s igna l & w i l l not show the s t a g e )
s t a c i c Fnt cre~ceNewSigria l0 { ur,sigzed save=idCount++; return s c a c i c int idCounr;
I ;
oscreznh 03erator << ( o s t r e ~ = & s t r e m , const Signal& s i g n a l ) ; bool operator== !cocst SLgnal&sl , cons t Signal &s2 ) ; bool osera tor ! = (consr S i g n z l & s l , consc Signal &s2) ; bool operâcor < (consc S i g n a l & s l , c m s r Signal &s2) ;
consr Signal S LGN-%-ZERO (CONST,n-NT, " ZERO " , ZERO, Eâlçe, NONE, -1 , - 1 1 , SIGXALONE (CONSTANT, " GNE " ,ONE , fâLse, NONE, -1, -1 1 , ÇIG>IAL OFEN (CGNSTGXT," \'X\' " , O P ~ N , ~ ~ ~ S ~ , N O N ~ , - 1 , - I ) ; -
uçing nomespace scd;
i n t SicnaL::idCcunc=O;
//------------------------------------------------------------------------ o s t r e m & operator CC ( ~ ~ ~ r e m & st rem. , const Signal& s i g n a l ) f
scream << signal.norne; i f (signal. cype==CONST-XYT ) re turr - scream;
if (signal.showID) strearn<<sLgnal. I D ;
Ff (signal.inverred!=NONEI stream << ( s i g n a l . i n v e r r e a = = P O S I T I F Y ? " P" : "N" ) ;
/ / Variable signal if (signal.stage>=O:
stream C < signal. sraqe;
//------------------------------------------------------------------------ bool operatcr==(consC Signal&sl, const Signal & s 2 ) ( . - rr (s i . t - p e ! = s L . type) retrrrn false;
i f (si. ttype==CONST-W! recurn (sl . ID==s2. ID) ; rêcnzn (SI. ID==s2. ID & & sl. inverted=s2 . inverteci & & sI.SitP~s==s2 .bitPos
s l . s t a g e = = s 2 . s r a g e ) ; 1
//------------------------------------------------------------------------ Dco: operâccr < ( c o n s t Signal &sL, const Sigcal &s2) i
if (sl. s ï age<s2 . scoge) return true; i f (SI. ~~Q~==CONSTPNT) rerurn true; i f (sl. stage==s2,stage)
- - - ,- ,, :SI. ID==SUM & o s2 - ID==C..V.?.Y) r e t c r n zrne; roccrn f alse;
t
//------------------------------------------------------------------------ void printVSV (veczorCSignaIVector> &SV, osrreâm& O)
Appendix D
IEEE Standard 1180-1990 Cornpliant Test
Program
The following is the Java source code listing for IEEE Standard 1 180- 1990
compliance test program for IDCT. It is used to determine the interna1 bandwidth of the
IDCT for both the first dimension IDCT and second dimension IDCT.
The codes are listed in alphabetic order based on the souice file name. The main
program is located inside file IEEE-118O-l99Ojava. Notice that al1 codes are also
included in the attached CD.
To esecute the program, use the following command: jm IEEE - 1180 - 1990. The
program reads the intemal banduidth confipuration from file Setzptxt, and perform test
to check if the bandwidth yields IEEE 1 180-1 990 compliance.
. - - - - CSDij ava . - . . . I V . - - -A,
/+ Convert conven t iona l b i n a r y nunber t o c a n o n i c a l sign-digit represen ta t ion Algorir-hm: H. Zwarig, Conputer Aritiimetic, Wiiey, 1979 , pp. 150 Coding : Pai, Cheng-Yu Nore
To conpi le , execute " j avac SignDiqit . java" To rur. , e x e c u i e "java SignDigit xxxx",
where xxxx is the number wish t o c o n v e r t . * /
p u b l i c class CSD {
p u b l i c s c a t i c byte [ 1 coCSD ( long 1) I
/ / System.out.princln("Tnteger value = "+L); / / Systen.out.prinrsln("In~eger bits = "tLong.toBinaryString(1));
/ / Construct bit a r r a y represencacion of t h e input byte [ ] b = ( "O"+Long. toBinaryStr ing (1) ) . ge tBytes ( 1 ; f o r (inc i=0, j=b. length-1; i<=j ; i++, j - - 1
byce [ l d = new b y t e [b. l eng th] ; byte ci=O, ci-1; f o r (Fric :=O; i <b , lengtn; i + t , ci=ci-l) f
i f (i==b . lenqcn-1 j
ci-l = ( b y t e ) f (bEll+ci>l)?l:O); e l s e ci-l = ( b y t e ) ((Dlil+b[i+l]tci>l)?L:O); d[d. lenoch-i-l] = (byte) (b [il +ci -2fc i - 1) ;
t
. .. .. . -. FDCEjava - _ -
._. . . _ _ - - _ . - - -- 1 .-.. -, . . . . - ,
public c l a s s FDCT {
s c a t i c double s [ l [ = new double i51 [8 1 ; s r z c i c double tmp[] [! = new couble[8] [ a ] ; s t a t i c final ict m a p [ ] = { 0 1 4 , 2 1 6 , 7 , 3 , 5 1 1 ~ ;
s r a t i c void f
s [ s cage t l s [scage+1
f
static ~ i o i d xO, inc ? c l )
L o e f f l e r (acuble AI double BmixusA, docble AplusE, int ç ïsge , i n t
s t ac ic vo id Lrl (inï srage, i n t xO, i n t xl) I ?.
r r n a l Fnc n=l; f i n a l double k=Mach . sqrt (2 1 ; f i n a l double a=k*Math.cos(n*Math.P1/16),
b=K*Kath.sin (niMoth. 01 /16) , BminusA=b-a, Aplusa=a+S;
Lce f f l e r (a, BminusA, ApLusB, s tage, xO, xl) ; 1
stztiî vcid Lclfint sc2qer inc xO, i n t :cl) t
f i n a l i n t n=L; f i n o l double k=l; f i n a l double a=keMaxh. cos (n*Math. PI/l6) ,
8=ktP4ath. sin (nt3Iath. -6) , aminusA=b-s, AplusB=a+b;
L o e f f l e r (a, BminusA, AplcsB, stzge, xO, xl) ; t
stacic void Lc3 tint stzga, F n t xO, int XII I
f i z d i n t n=3; final double k = l ; f i n a l aomle a=kfk!ath. cos (n*Xath- I?I/I6l ,
b=k*Math. sin(n'L4ath. PI/I6), BninusF-=bal p.~lusY=a-n ;
Loef f l e z (a, Brn inusA, A p l u s B , stage, xO, xl) ; I
f o r ( i = O ; i<9; i ~ i ) I
/ / Input m a ~ p i n g f o r ( j = O ; j<9; j++) s [O] [j]=blockCil [ j l ;
/ / Stage 1: B u t t e r f l y for ( j = O ; j<l; j++l
B t i r t e r f l y (a, j,7-j) ;
/ / Scage 2 for ( j = O ; j<2; j t - 1
Eucterfly(l,j,3-j); Lc3 ( L , 4 , 7 ) ; L c L (1,5,6) ;
/ / Stage 3 But=erfly(2,0,1) ; Lrl (2,2,3) ; S x c t e r f l y (2,4,61 ; auczerfly(2,7,51 ;
/ / Stage 4 for ( j = O ; j<4; j + + ) s[41 [jj=s[3] [ j ] ; 3u~tsrfly (3,7,4) ; s[4! [5] = root2 ' s [ 3 ] [ S I ; s [ C l [ O ] = roo t2 * s [ 3 ] [61 ;
/ / Ourput mapping f o r ( j = O ; j<3; jt+)
unp[map[jI j [il = s[41 [ j l ; / *
Systern. out .prinïln i"1D S: " 1 ; f o r ( j = O ; j<5; j++) I
f o r (k=O; k<8; kc+) Syst~m.ouz.print(s[j] [klt", "1 ;
l System.our .pxintln ( ;
I - /
1 ~ / -
Sys=em.ouz .prlntln ( "FDCT ID: " 1 ; for (:=O; i<8; it+) 1
fer ( j = O ; j <8 ; j++: Sysrem.out.princ(tmptil [ j I + " , "1 ;
Syste,.n.out - p r i n t l n ( 1 ; 1 Systern-out -println f) ;
+/ f c r (i=O; i < B ; i++l f
/ / Izpuc rnapping for (j=O; f c 9 ; j + t l s[0: [ j ] = t x p [ i ] [ j ] ;
/ / Stage 4 for ( j = O ; j < l ; j++, s [41 [ j ] = s [ 3 ] [ j ] ; Bucterfly!3,7,41 ; sC41 [51 = r oo t2 * s r31 [51; s [ 4 ] [ 6 ] = roo t2 " s [ 3 ] i61;
/ / O ~ r p u t rnapping for (j=G; j<8 ; Siil ~lockli] [mapl j ] 1 = (shorr) Math. rounci (s 1 4 1 [ j ] ) ;
/ = Syste_m.oi?c . p r i n t I n ("23 S: " 1 ; f o r (j=O; j<5; j-i) (
f o r (k=.;li; k<3; k t t j System.out.prir,rs(s[j! [k]+", " ) ;
Syscem.out . pzintln! ; 1
= /
I s t a t i c double s [ I [ ] = n e w doul ; le[5][a] ; s t c t i c couble unp[] C I = new doub l e [8 ] [8 ] ; s c a t i c f i n a l Fnt mop[I=f0,4,2,6,7,3,S,11;
scatic void Ektterfly ( i n t stage, i n c xO, inc xl)
1 s [sczge-il [xOj = !s istaçei [xO 1 + s [stage] [xl] /2; s [scâgeil] [XI] = ( s [stoçel [xO 1 - s [stage] [xll ) /2;
I
s z o c i c voici Iloeffler(ao&le C, double DminusC, d o b l e DpLusC, int stage, i n t xO, inc xlj
( doubie Cmp = C* ( s [staqel [xO] +s [stage] ExII 1 ; s[scaqe+l][:&] = DplusC - ~[stagel [xOI - tmp; s isisce+l] [xli = 9rninusC ' s istago] [XI] +- tmp;
I
srzïic void I c l (int stage, int xO, int xl) ( - . r~ncl in= n = L;
final double k = 1; final dounle c = Math.sln(n'Math.P1/16)/kI
d = P!ach.cos (n'Math. 21/1o)/k, DainusC = d-c, DplusC = d+c;
ILoeffler ;c, DrninusC, i@IxsC , stage, x0, XI) ;
1
s rac l c -raid Ic3 ( i ~ t stage, FZC %O, int xl) I firial h t ri = 3; finzl aouble k = i; fF2z. l CouDle c = Math.sin(nTMach.PI/16)/k,
d = Hath.c~s(ntMâth.PI/14)/k, D m i n u s C = a-c, DplusC = d t c ;
~~ûefflêr (c, DndnusC, EpIt?sC, stage, x0, XI) ; 1
scûcic voici Irl(inr stage, int xO, Fnc xl) ( - - rmal int n = 1; final double k = Math. sqrt ( 2 ; final double c = Xach.sin(n*Ma~h.PI/l6)/k,
d = Mâth.cos(n*Math.3I/l61/k8 miriusC = d-c, DplusC = a+c;
ILoeffLer (c, DmincsC, DplcsC, stage, xo, xl) ;
pub l i c s ~ a t i c voia idci(sncrt block[j ( 1 ) { int i, j , k; - - r ~ n a l double invRoot2 = l.O/Math. sqrt (2) ;
/ - Syscem. ou~. p r i ~ ï l n ( 1 ; Syszem.out .princln(llDime-?~L~n O: " ) ;
* / for ( i = O ; i < 8 ; i++) f
/ / Input mappinq for (j=O; j<8 ; jt+)
s [O] [j] = blcck[i] [rnap[j J I ;
-+ S.
+.- .O,- .
v 7 . c T t I l V ) . u u ..p.-.- -3' .O0 v O -- .n-V) ~1
4 h 4 I l Il
I) O 'CJ 3 Il LI -. - u ~ n a l m u , J - U -u Q C , + -
Lc Y 1-1 r l ~ 0 F O U U \ IU 1-1 0 V)
$1 O v, 'Cl u
/ / Stage 3 f o r (j=O; jC2; ji-)
IEuccerfly (2 , j , 3- j i ; Ic3 (2,4,71; Icl(2,5,6) ;
/ / Stage 4 fer ( j = O ; j < 4 ; j--)
IBu1zterfly(3,~,7-j) ;
public clâss IDCT-Trznc {
s t a t i c long s [ j [I = new 1o~qi51 [ 8 ] ; sratic l o n q t?~ [ 1 [ ] = n e w long [ a j [ 8 1 ; scatic f i z a l int mep[]={O, 4,2,6,7,3,5,1);
p rec i {
Frit i; Long factor = ( (long} 1) <Cprec; Lonq CL, subL, surrL; CL = (long1 Math. r o u ~ d ( c " f a c t o r ) ;
~aCtOr) ; s u j L = ( long) Mach. round ( s-;bw =- s-mL= (Lcr-.g) Mazh. ronnd ( smtfzctor) ;
coplidx] [O] = CSD.toCSD!cL) ; COD [ idx] [l] = CSD . coCSO ( s u b L ) ;
p u b l i c staric void iniï-IDCT-Trunc ( i n r p r e c [ 1 1 t
f i n a i d o b l e k [ ] = { L , 1,Math.çqr t (21 1 ; final int 1 ~ [ ] = { 1 , 3 r l l ; double c, d, sub, sun;
I Syscern, o u t , p r i n t l n ("LnFtialize IDCT c o e f f i c i e n t s : " 1 ;
f i n a l cou8 ie F r = 1/Math.sqr= ( 2 ) ; long factor=! (long) 1) <cprec [ 3 I ; long r2L = (long) Nach. round ( i r C f a c t o r 1 ; Fnü-Zoor2 = CSD. toCSD (r25) ;
Sysren. o u t . p r i n t l n ("l/Sqrt (2) =11+r2L1 ; 1
/ * scatic long mult ( n y ~ e sdf 1, long v a l , int t r u n c l f
l o n g resclr=O;
s t a ~ i c l o n g nult (by te sa [ ] , long val, int trünc) {
long result=O; long pp; i f (vel==O) rerurn 0; n.bIul++;
f o r ( i n ï i = O ; i<sd . l eng th ; iit) I
i f ( s a [ i ] = = O ) c o n t i n u e ; if (sQiij==lj t if (i<trunc) pp=val>> (trunc-F) ; eiss pp=val<< (i-crunc) ;
1 else / / s a [ i j = = - 1
return result;
I l 1 stacFc n i d IButterfljr(Fnt stage, int xC, ict xl) 1
( s[stage+ll [x0] = is Cstagei Lx01 - s[stage] [xl] /2; s[stzge+i] [xll = (~[sragel ixO1 - s[stage] [xl]) /2; n4dd+=2 ;
1 scatic void Xutterfly2(irii szaqe, inc xO, int xl)
srcric vcid I L o e f E i e r (byte C [ 1, byre DminusC[ j , byte DplusC Il, int stage, int 1 x g i Lnc x l , in^ ï r u n c ;
long m p = mult ( C , s [s~agej lx0 1 +s [scaçel [XII , trunc) ; s [stage+l] [xO j = r n u l ~ (3plusC , s [sragel Lx01 , trunc) - t m ~ ; s [stogeii] [xl] = mult (Dr r i nüsC , s [stage] [xl] , trunc) + tmp;
static void Icl(inc srâge, int xO, i n t XI, int trunc) f ILoeffler(cûp[O] [O] ,cOp[OI [Il ,cOp[O] [23 ,stage,xO,xl,trunc) ;
k
static voici Ic3 iint stage, int x O , int xl, i n t trunc) i ILoeff' -,~er(cO?[il [Gl ,c3pCII [Ii,cOpiL] 121 ,stàqe,xO,xI,trunc);
1
scar ic void Irl (int scage, int :<O, i n t xl, inr trunc) ( ILaeffler(cûp[2! [ Q I , cOpE21 [ I l ,cUp[2] [ 2 ] ,s~~ge,xO,xl, t r unc ) ;
1
szacic voia adj u s ~ G f f s e c ( l ong stage i l , inc offset [l ) 1 for (inr i=O; i<8; F-+) ( if (offset [il CO) stage[i] >>=(-offset f i l ) ;
else if (offsec [il >O1 stâçeri] <<=offsec ii! ;
1 1
static long cclcR (int r) ( if (r<2) r e t u r n O; / / Do nothing int i. i;
lorig offsec; for (i-l, of fset=l; i<r- l ; i++) offset=(offset<<l) i 1;
/ / o f f s e r = ((lcnq]l)<<(r-2); r e t u r n offse~;
j
public static void F d c t T r u n c (short block[] [ J , int crunc [ l [ i , F n t - - orrsez[J [ I [ j ,inr r o u d t ?
f I n t i, j , k; Lcng rO, rl;
r O = calcR (round [O 1 ) ; r l = calcR (round [l! ) ;
/ * Systen.out .priztln ( 1 ; Sysrs-rn-out . p r i n c h tl'Diner?sicn O: " 1 ;
* / for (i=Q; i < a ; ii-) f
/ / I n p u t m.pging for [ j = O ; j<8; j ~ - )
s [O] [ 5 ] = ~lock[i] [map[j]];
/ / Stage I LSu tce rE l y2 (û, 0,l) ; I rl ( 0 , 2 , 3 , crunc[U] [21) ; i B u c c e r f l y 2 ( 0 , 7 , 4 1 ; s [Il E S ] = rnult (invRoot2, s [O] [ 5 j , crux 101 [3 ! ) ; s[l] [SI = mclt(invRoot2,s [O] (61, trunc[Ol [ 3 l ;
/ / Stage 3 for ( j = O ; j < 4 ; ïi+i / / Rounding
s [ 3 1 [ j l = s12J Ljl i (r0 << (-offset[O] [ 4 ] [jl-round[O]) ) ; .*ddi=4 ;
Ic3 ( 2 , 4 , 7 , c r u n c [ O j [tl) ; T c 1 (2,5,5, ~runc[O] [On ;
a d j u s t O f f s e t ( s [ 3 ] , o f f s e t [O] [ 3 ] ) ;
/ / S t a g e 4 f o r ( j = O ; j c 4 ; j++)
IButterfly2{3, j,7-j};
/ / Stage I ISc tcer f ly2 (O, O, lj ; Tr: - - - (0,2,3, trunc111 12i ; ISuizerfly2(0,7,4) ; s [ l ; [SI = mult(invRocC2,s [ O ] [ S I , tzunc[L! [311; s i l j [ 6 j = mult(inv2cot2,s [O] [ o ' ] , trunc[l] 131) ;
a c j u s t O f f s e t ( s [LI, of f sez [ I l [II 1 ; I
/ / Stage 3 for ( j = O ; j < 4 ; j - i )
s131 [ j ] = sC21 [j] i (rl <C t-offsec[ll t 4 l [jl-round[l]) ) ; -cd+=4 ;
Ic3 (2,4,7,~runc[l] [Il ) ; Ic1(2,5,6, trunc[I] [O] ) ;
/ / Scage 4 f o r ( j = O ; j < 4 ; j t c )
I B u t t o r f ly2 (3, j ,7- j ) ;
srcric Lcnq e [ ] [ 1 = n e w Lang [8 [E ] pmsêi] [ j = r.ew I_onç[8I [B], pme [ I [ I = new Long[Bj [a! , orne, 0.9s e ;
static b o o l e a n checkError (short %Cal [ l [ l short xRef [ I [ ] ) ( int i, j, e r r , e 2 ; long e P h s r pmeAbs, orneas;
i e[i] [ j ] = e r r = xCzl[i] [ j l - x X e f [il t j l ; e 2 = e r r * err; p ç e [ L ] [j] += e2; pme [il [ j l += err; CF-e -= err; omse += e 2 ;
SURS += err; S y s t e r n - o n t -print (erri" " 1 ;
eAbs = (err<O ? -err : err); prneAbs = ( p m e [ i l Tjl<O ? -prne[il [ j ] : p m e [ i ] [ j ] ) ;
scacic void cransforrnBlock ( s h o r z bl [ J [ ] , shori b2 [ ] [ ] , o f f s e t [ ] [ ] [ ] , i nc round[] )
( FDCT. facc (bl) ; clFpFDCT ! bl) ;
IDCT . i d c ~ (SI 1 ; clipIDCT (bl) ;
pme ="+prne[il [j J
orne ="+orne
omse="iomse
re turn false; 1
trunc [ 1 1, int
IDCT-Trunc. idcLTrunc (b2# t runc , o f f se t , round) ; clFpIDCT (b2) ;
szatic noolean checlcli.: (lonq L, long H, bcoleari negatePixe1, i n t trunc C l f 1, Fnt cffse t [ ] E l [ J , int romid[])
i short 5 11 [ 1 = new s n o r t [8] [8 1,
bi[] [] = n e w snort[8] 181 , b2:j [ 1 = z e w shor~~[8] 181 ;
inrs Fr j, k;
/ / TnFïizlize s z o t variables for (i=O; i<8; i++) for ij=O; jC8; j++)
{ e [ i : [jI=pmse[FI [j:=pme[il [jI=O; 1 ome=o=e=O ;
IDCT-Trunc. idctTrunc ('01, tz~n-c, offset, round) ; if ( ! cHeckZero (bl) ) r e t s r n false;
IEEZ - 4andom. init (L, H) ;
fo r ( j = O ; jc8; j-+) / / Generacs random pixel cata fcr (k=O; kC8; kt+)
if ( i chêckError (Dl, b2) ) i
Çyçc-rn. out-println ("B~cck="tit", L="iLtW, x="iH+",
~açate="-negatePixeL) ; r e t u r E f alse;
L
1 lonc; max; i n r percect; System. out .pr inc ("PASSZD: pme (max) =") ; for (F=O,max=O; i<8; ii+) for ( j = O ; jc8; jt-1 if (prne[i][j]>.max) max=pmelil[jl;
percent = (Fnt) ( (max'100.0) /pmeMPX) ; Systern.ou=.print ( (max/lOOOO .O) t" ("+percent+"%), psme (max) =") ;
for (i=Or~x=O; Fc8; i++) for ( j = O ; j c8 ; j++) . I r (prnse[i] [j!>max) max=pmse[il [jl;
percent = ( i n t ) ( (mox*100 .O) /pmseMFX) ; Systern-out . p r i n t ( (ni.â.u/10000. O) +" ("+?ercent+"%), " 1 ; perîeEt = (inrs) ( (cme-IOO. O) /omeMAX) ; syçtem. ouc.prht (orne/ ( 6 4 - IOOOO. O) ) + p e r c e n t + % , " 1 ;
f u r (in: token=s .nextToken ( ) ; token!=s . T T N b . E R ; token=s . nexzToken ( ) ) ; r e r u r n (Fr î t ) S. nvol;
1
static void FnirSerup ( i n t trunc [ ] Clr int offset [ l C I [ I ir?t round[] ) throws Lxcept ion
t i n t d l L, j;
IDCT-Trunc. init - ICCT-Trunc (prec) ;
t for (i=O; i<5; F-i)
for ( j = O ; 5 4 ; j++) offse~:d] [i: [j ] = qetInr ( s e tup ) ;
round[Q] = qeeInt (setup) ;
i
public s r a t i c void main (String orgs [ 1 ) throws Excepzion
static i n t qetïnt (Streâ-~ToKenizer s) t n r ~ w s Exception
initçetup (trunc, o f f s e t r zound) ;
for (i=C; Fc2; i++, negote=!negare) f o r ( j = O ; j<3; j++)
if (!checkLE(L[j] ,H[j 1 , negate,tzunc, offset, round) 1 re turr?;
Systom.out . p r i n c l n ("A11 t e s t passed! " ; 1
p u b l i c scaï ic void Fnitclong 1, long h)
randx = 1; z = Docbie. long~itsToDouble (Ox7fff f f ff) ; ; s=n;
?ublic sca t i c long rand( 1 !
Icrig i, j; CouSLe x;
pirSlic s t a t i c void m i n (Srrizg orgs : ! ) i
l o n g 1, ?, n; n = Long .parseLong (args l0l) ; I = Long. parseLong (args l1l: ; h = Long. pa r seLo~g (args :2j) ;