The Usage of Dual Edge Triggered Flip-flops
in Low Power, Low Voltage Applications
by
Wai Man Chung
A thesis
presented to the University of Waterloo
in fulfillment of the
thesis requirement for the degree of
Master of Applied Science
in
Electrical and Computer Engineering
Waterloo, Ontario, Canada, January 2003
c©Wai Man Chung, 2003
I hereby declare that I am the sole author of this thesis.
I authorize the University of Waterloo to lend this thesis to other institutions or individuals
for the purpose of scholarly research.
Wai Man Chung
I authorize the University of Waterloo to reproduce this thesis by photocopying or other
means, in total or in part, at the request of other institutions or individuals for the purpose
of scholarly research.
Wai Man Chung
ii
The University of Waterloo requires the signatures of all persons using or photocopying
this thesis. Please sign below, and give address and date.
iii
Acknowledgements
I would like to thank...
Professor Manoj Sachdev
for the opportunity of working on this project, his guidance and especially his patience;
Dr. A. Opal and Dr. M.A. Hasan, my thesis readers,
for their invaluable time in reviewing my thesis;
Tim Lo and Sarmad Musa,
who have shown me what it is like to work in a real team;
Bhaskar Chatterjee, Oleg Semonov, Mohamed Elgebaly
for their insightful discussions;
Phil Regier,
whom I have given tremendous trouble to;
Wendy Boles and Wendy Gauthier, our helpful and friendly secretaries,
for their help and great smiles;
Gennum Corporation’s HIP group, Mr. Rob Cram in particular,
for their interest and support;
Canadian Microelectronics Corporation (CMC)
for providing resources and for their support;
My friends, Ying, Jen, Chris, Hadi, Ed, Ka Lok, Zhinian, Dorothy, etc.,
for bringing me so much laugher!!
My best friend Shannen for her sense of humor and encouragement;
My best friend Nora, who has listened and cheered me all along;
My family for their love, care and faith in me;
God, our heavenly Father, for His blessings.
iv
Abstract
In the research of low power and low voltage VLSI circuits, the use and implementation
of dual edge triggered flip-flop (DETFF) has gained more attention at the gate level de-
sign. The main advantage of using DETFF is that it allows one to maintain a constant
throughput while operating at only half the clock frequency. This thesis compares four pre-
viously published static dual edge triggered flip-flops (DETFFs) together with our design
for their performance, power dissipation, and low voltage, low power applications. For each
DETFF, the optimal delay, power consumption, and energy are determined as the primary
figures of merit. The proposed design demonstrates the least energy at low voltages. In
order to illustrate the advantages in using DETFFs over conventional single-edge triggered
flip-flops (SETFFs), a digital half-band FIR filter is designed, implemented and used as
a benchmark circuit for further investigation. The implementation of the FIR filter with
DETFFs exhibits power saving of 38% over the implementation with SETFFs.
v
Contents
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Low Power and Low Voltage CMOS Design 4
2.1 Low Voltage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.4 Switching Activity Reduction . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.5 Switching Capacitance Reduction . . . . . . . . . . . . . . . . . . . . . . . 12
3 Dual Edge Triggered Flip-Flop 14
3.1 Types of Flip-flops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2 Dual Edge-Triggered Flip-flops . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2.1 DETFF implementations . . . . . . . . . . . . . . . . . . . . . . . . 18
3.3 Analysis of Dual-Edge Triggered Flip-flops . . . . . . . . . . . . . . . . . . 23
3.3.1 Power Consumption of a DETFF . . . . . . . . . . . . . . . . . . . 23
3.3.2 Timing Characterization of a DETFF . . . . . . . . . . . . . . . . . 24
vi
3.3.3 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.4.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.5 Other Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.5.1 Parallel Interconnects and Clock Requirements . . . . . . . . . . . . 40
3.5.2 Design for testability . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4 Hearing Aids and Digital Filters 45
4.1 Hearing Aids . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.2 Digital Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.2.1 Half-Band FIR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.2.2 Implementation of FIR - Direct Form Structure . . . . . . . . . . . 49
4.3 Design and Implementation of a Chebychev Half-Band FIR Filter . . . . . 50
4.3.1 Number representation . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.3.2 Processing Blocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5 Results and Discussions 59
5.1 Measurements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.2 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
6 Conclusions 68
6.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
A Glossary of Terms 70
vii
List of Tables
3.1 Simulation parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.2 Optimal parameters for DETFFs studied [1] . . . . . . . . . . . . . . . . . 32
3.3 Performance Characteristics for DETFFs studied [1] . . . . . . . . . . . . . 33
3.4 Summary of DETFF performance as Vdd reduces [1] . . . . . . . . . . . . . 35
4.1 Specifications for Chebychev half-band FIR filter with N=39 . . . . . . . . 51
5.1 Back annotation parameters for DETFFs and standard SETFF . . . . . . 61
5.2 Simulation Results for Chebyshev half-band FIR filter operating at 0.9 V . 63
viii
List of Figures
2.1 A generic data path of a synchronous system . . . . . . . . . . . . . . . . . 8
2.2 Parallel architecture operating at lower clock rate and with reduced Vdd . . 9
2.3 Pipeline architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.4 Latch with clock gating [2] . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.5 Interconnect Capacitance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.1 Classic implementation of DETFF . . . . . . . . . . . . . . . . . . . . . . . 18
3.2 DETFF proposed in [3], DETgago . . . . . . . . . . . . . . . . . . . . . . . 19
3.3 DETFF proposed in [4], DETllopis . . . . . . . . . . . . . . . . . . . . . . . 20
3.4 DETFF proposed in [5], DETpedram . . . . . . . . . . . . . . . . . . . . . . 21
3.5 DETFF proposed in [6], DETstrollo . . . . . . . . . . . . . . . . . . . . . . 22
3.6 proposed DETFF, DETproposed . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.7 The simulation testbench for flip-flops . . . . . . . . . . . . . . . . . . . . . 27
3.8 DETFF internal node transitions given input sequence of 1010101000 . . . 30
3.9 PDPCQ vs tCQ, used to determine the initial optimization point [1] . . . . 33
3.10 Power consumption dependence on data transition activity α [1] . . . . . . 35
3.11 Power consumption dependence on supply voltage [1] . . . . . . . . . . . . 37
3.12 tCQ as a function of supply voltage [1] . . . . . . . . . . . . . . . . . . . . . 38
3.13 PDP dependency as a function of supply voltage [1] . . . . . . . . . . . . . 39
ix
3.14 DETFF proposed in [4] with unidirectional characteristic . . . . . . . . . . 43
3.15 DETFF using XOR operation to generate delayed clock pulses [7] . . . . . 43
3.16 DETFF proposed in [8] with DET pulse generator . . . . . . . . . . . . . . 44
4.1 FIR filter implementing with direct form structure . . . . . . . . . . . . . . 49
4.2 Block Diagram of the Chebychev half-band FIR filter design . . . . . . . . 54
4.3 (a) Bit-serial adder(b) Bit-serial subtractor [9] . . . . . . . . . . . . . . . . 55
4.4 Serial/Parallel multiplier . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.5 Structures for shift registers . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.1 DETgago with reset active low control . . . . . . . . . . . . . . . . . . . . . 60
5.2 DETproposed with reset active low control . . . . . . . . . . . . . . . . . . . 61
5.3 Frequency response of the Chebyshev half-band FIR filter . . . . . . . . . . 62
5.4 Improved implementation of DETgago with reset active low control . . . . . 64
5.5 Layout of the half-band FIR filter implemented with standard SETFF . . . 66
5.6 Layout of the half-band FIR filter implemented with standard DETFF . . 67
x
Chapter 1
Introduction
Today’s technologies make possible powerful computing devices with multi-media capabil-
ities. Consumer’s attitudes are gearing towards better accessibility and mobility. Their
desire has caused a demand for an ever-increasing number of portable applications requir-
ing low-power and high throughput. For example, notebook and handheld computers are
now made with competitive computational capabilities as those found in desktop machines.
Equally demanding are personal communication applications in a pocket-sized device. In
these applications, not only voice, but data as well as video are transmitted via wireless
links. It is important that these high computational capabilities are placed in a low-power,
portable environment. The weight and size of these portable devices is determined by the
amount of power required. The battery lifetime for such products is crucial, hence, a well
planned low-energy design strategy must be in place [10, 11].
As the density of the integrated circuits and size of the chips and systems continue to
grow, it becomes more and more difficult to provide adequate cooling for the systems [10].
In addition to heat removal, there are also economic and environmental issues for low-
power development. In the United States, computer equipment accounts for about 2-3%
1
Introduction 2
of national electricity consumption. This figure is expected to increase as there is tremen-
dous increase in household computer applications, Web phones, handheld computers, and
Internal terminals [11, 12, 13]. These economic and environmental reasons have compelled
the requirement for energy efficient computers.
In order to meet the demand in high computational applications, the clock rate is
steadily increasing, with clock jitter and clock skew being an increasingly significant part
of the clock cycle. The energy consumed by low-skew clock distribution networks is perpet-
ually growing. Clock-related power consumption can reach more than 30-40% of the total
power of microprocessor and is becoming a larger fraction of the chip power. In addition,
the number of logic gate delays in a clock period is reduced by 25% per generation, and is
approaching a value of 10 or smaller beyond 0.13 µm technology generation. As a result,
latency of flip-flops or latches is becoming a larger portion of the cycle time. In order to
achieve a design that is both high-performance and power-efficient, careful attention must
be paid to the design of the flip-flops and latches [7, 14].
1.1 Motivation
Energy consumption is a product of average power and delay. And power is linearly
proportional to the square of supply voltage. Voltage reduction is an efficient way to
reduce power consumption; yet, it also leads to logic speed reduction. However, for signal
processing applications, such as digital filter, it is important to maintain a given level of
computation or throughput. Hence, parallel architectures should be used to maintain the
throughput at a reduced supply voltage [10]. Dual-edge triggered flip-flops is a device level
realization of this concept. It can obtain the same data throughput with one half of the
clock frequency, thus relaxing the power and clock uncertainty requirements [14].
Introduction 3
1.2 Thesis Outline
This thesis is focused on the applicability of DETFFs in low power and low voltage appli-
cations. Chapter 2 provides background on low power and low voltages applications and
considerations. Techniques for low power and low voltage, such as parallelism, are also
described in this chapter. Chapter 3 first describes all the DETFFs investigated in this
study, including a newly proposed DETFF. It states the analysis methodology used and
outlines the simulation testbench and parameters. In addition, the DETFF optimization
procedure is also explained, followed by simulation results. Chapter 4 provides a brief intro-
duction on a stringent low power, low voltage application - hearing aids. It then described
a Chebyshev half-band finite impulse response (FIR) digital filter and its implementation.
This FIR is used as a benchmark circuit to further investigates the usage of DETFFs in
practical low power and low voltage applications. Chapter 5 summarizes the simulation
and layout results of the FIR filter. Finally, the discussion and conclusions are drawn in
Chapter 6.
Chapter 2
Low Power and Low Voltage CMOS
Design
The design of portable devices requires consideration for peak power consumption to ensure
reliability and proper operation. However, the time averaged power is often more critical
as it is linearly related to the battery life. There are four sources of power dissipation
in digital CMOS circuits: switching power, short-circuit power, leakage power and static
power. The following equation describes these four components of power:
Pavg = Pswitching + Pshort−circuit + Pleakage + Pstatic (2.1)
= αCLVddVsfck + IscVdd + IleakageVdd + IstaticVdd (2.2)
Pswitching is the switching power. For a properly designed CMOS circuit, this power
component usually dominates, and may account for more than 90% of the total power [10,
15]. α denotes the transition activity factor, which is defined as the average number of
power consuming transitions that is made at a node in one clock period. Vs is the voltage
swing, where in most cases it is the same as the supply voltage, Vdd. CL is the node
4
Low Power and Low Voltage CMOS Design 5
capacitance. It can be broken into three components, the gate capacitance, the diffusion
capacitance, and the interconnect capacitance. The interconnect capacitance is in general
a function of the placement and routing. fck is the frequency of clock. The switching power
for static CMOS is derived as follows [11].
During the low to high output transition, the path from Vdd to the output node is con-
ducting to charge CL. Hence, the energy provided by the supply source is
E =
∫ ∞
0
VddI(t)dt (2.3)
where I(t) = Vs
Re−t/RCL is the current drawn from the supply. Here, R is the resistance of
the path between the Vdd and the output node. Therefore, the energy can be rewritten as
E = CLVddVs (2.4)
During the high to low transition, no energy is supplied by the source. Hence, the average
power consumed during one clock cycle is
P =Epercycle
T= CLVddVsfck (2.5)
Eq. (2.4) and Eq. (2.5) estimate the energy and the power of a single gate only. From a
system point of view, α is used to account for the actual number of gates switching at a
point in time.
Pshort−circuit is the short-circuit power. It is a type of dynamic power and is typically
much smaller than Pswitching [11]. Isc is known as the direct-path short circuit current.
It refers to the conducting current from power supply directly to ground when both the
NMOS and PMOS transistors are simultaneously active during switching [10].
Pleakage is the leakage power. Ileakage refers to the leakage current. It is primarily deter-
mined by fabrication technology considerations and originates from two sources. The first
is the reverse leakage current of the parasitic drain-/source-substrate diodes. This current
Low Power and Low Voltage CMOS Design 6
is in the order of a few femtoamperes per diode, which translates into a few microwatts
of power for a million transistors. The second source is the subthreshold current of MOS-
FETs, which is in the order of a few nanoamperes. For a million transistors, the total
subthreshold leakage current results in a few milliwatts of power [10, 11].
Pstatic is the static power and Istatic is static current. This current arises from circuits
that have a constant source of current between the power supplies such as bias circuitries,
pseudo-NMOS logic families [10]. For CMOS logic family, power is dissipated only when
the circuits switch, with no static power consumption [15].
Energy is independent of the clock frequency. Reducing the frequency will lower the
power consumption but will not change the energy required to perform a given operation,
as depicted by Eq. (2.4) and Eq. (2.5). It is important to note that the battery life is
determined by energy consumption, whereas the heat dissipation considerations is related
to the power consumption [11].
There are four factors that influence the power dissipation of CMOS circuits. They are
technology, circuit design style, architecture, and algorithm. The challenge of meeting the
contradicting goals of high performance and low power system operation has motivated the
development of low power process technologies and the scaling of device feature sizes. De-
sign considerations for low power should be carried out in all steps in the design hierarchy,
namely 1)fundamental, 2)material, 3)device, 4)circuit, and 5)system [10, 15].
2.1 Low Voltage
Power consumption is linearly proportional to voltage swing (Vs) and supply voltage (Vdd)
as indicated in Eq. (2.5). For most CMOS logic families, the swing is typically rail-to-
rail. Hence, power consumption is also said to be proportional to the square of the supply
Low Power and Low Voltage CMOS Design 7
voltage, Vdd. Therefore, lowering the Vdd is an efficient approach to reduce both energy
and power, presuming that the signal voltage swing can be freely chosen. This is, however,
at the expense of the delay of circuits. The delay, td, can be shown to be proportional to
Vdd/(Vdd − VT )γ. The exponent γ is between 1 and 2. It tends to be closer to 1 for MOS
transistors that are in deep sub-micrometer region, where carrier velocity saturation may
occur. γ increases toward 2 for longer channel transistors [11].
The current technology trends are to reduce feature size and lower supply voltage.
Lowering Vdd leads to increased circuit delays and therefore lower functional throughput.
Smaller feature size, however, reduces gate delay, as it is inversely proportional to the square
of the effective channel length of the devices. In addition, thinner gate oxides impose volt-
age limitation for reliability reasons. Hence, the supply voltage must be lowered for smaller
geometries. The net effect is that circuit performance improves as CMOS technologies scale
down, despite of the Vdd reduction. Therefore, the new technology has made it possible to
fulfill the contradicting requirements of low-power and high throughput [10, 11, 15].
The various techniques that are currently used to scale the supply voltage include opti-
mizing the technology and device reliability, trading off area for low power in architecture
driven approach, and exploiting the concurrency possibility in algorithmic transformations
[10]. Reference [11] has shown that the optimum supply voltage for CMOS technology is
equal to 3Vth
3−γ. Hence, the voltage scaling is limited by the threshold voltage Vth.
In applications such as digital processing, where the throughput is of more concern
than the speed, architecture can be designed to reduce the supply voltage at the expense
of speed without throughput degradation. Hence, the performance of the system can be
maintained. This can be achieved by using parallelism and/or pipelining. Both techniques
will be discussed next [11].
Low Power and Low Voltage CMOS Design 8
2.2 Parallelism
Power dissipation increases linearly with frequency, as shown in Eq. (2.1). System clock
typically switches at the highest frequency across the chip. Hence, clock switching has a
considerable impact on power. Clock power can be as much as twice the logic power for
static logic and three times the logic power for dynamic logic. To minimize clock power,
a given system should be operated at the minimum frequency necessary to attain the
required level of performance. Logic parallelism allows operation at lower frequency while
maintaining the desired throughput. It also allows for reduced supply voltage [15].
Consider a generic data path, which consists of two latches which synchronize the data
flow, as depicted in Figure 2.1. The energy consumed by this implementation is E = CLV2dd.
Din Logic block D
CK(f)
D
CK(f)
Dout
Figure 2.1: A generic data path of a synchronous system
Now, consider the parallel structure illustrated in Figure 2.2. If the logic is duplicated
n times, then the input can be fed to each logic block at a lower frequency. In fact, the
latches to each block are clocked at one n-th the frequency of the latch in Figure 2.1. The
output of the parallel blocks is sent to the output latch through a multiplexer.
In this implementation, the clock rate of each logic block is reduced to f/n. Hence, the
delay can be relaxed by a factor of n, while the throughput is remained the same as in the
non-parallel case. Furthermore, by allowing for increased block delay, the supply voltage
can be scaled down by n times. The optimum supply voltage for the parallel architecture,
Low Power and Low Voltage CMOS Design 9
Din
Logic sub-block (1)
D
Logic sub-block (2)
D
Logic sub-block (n)
D
CKn(f/n)
D
CK(f)
CK2(f/n)
CK1(f/n)
Select
Dout M
U
X
Figure 2.2: Parallel architecture operating at lower clock rate and with reduced Vdd
Vdd(//), neglecting Vth, is expressed as
Vdd(//) =Vddn
(2.6)
And if the power overhead due to the multiplexer is neglected, the energy for the parallel
architecture, E(//), can be expressed as
E(//) = CLV2dd(//) = E
V 2dd(//)
V 2dd
=E
n2(2.7)
where E is the energy consumed by the generic, non-parallel structure.
This method works well for computationally intensive functions, however, it costs a
large area as it requires duplicating hardware. Using this architecture driven voltage scal-
ing strategy, a parallel logic can provide the same throughput as the original logic while
Low Power and Low Voltage CMOS Design 10
operating at greatly reduced frequency and voltage [15]. Eq. (2.7) indicates that the en-
ergy saving in a parallel architecture is proportional to the square of the voltage scaling
factor [11]. Parallelism can be applied at different levels of a design: system, architecture,
circuit/logic, device, etc. DET flip-flop is considered as a device level implementation of
parallelism.
2.3 Pipeline
Pipelining is another approach to relax the speed requirement without degrading the
throughput. In this approach, a logic block is broken down into i sub-logic blocks and
latches are inserted between them as shown in Figure 2.3. The delay of each sub-logic
block is therefore reduced. Hence, it allows for reduction of the supply voltage, Vdd. It
should be note that the pipeline approach may offer comparable power savings to the paral-
lel architecture but with less area overhead [11]. In some applications, the two approaches,
pipeline and parallelism, can be applied together to achieve higher power/energy savings.
Logic sub-block (1)
D
CK(f)
D
CK(f)
Logic sub-block (n)
D
CK(f)
D
CK(f)
Din Dout
Figure 2.3: Pipeline architecture
2.4 Switching Activity Reduction
CMOS circuits dissipate power only when switching, therefore it is important to minimize
the switching activity for low power applications. Switching is decreased when the data
Low Power and Low Voltage CMOS Design 11
rate is low. Hence, switching activity can be reduced by circuit and architectural optimiza-
tion exploring data correlation. For instance, human speech exhibits a higher correlation
compared to random data [10]. Switching activity can be reduced by algorithmic opti-
mization, architecture optimization, logic topology, and circuit optimization, which are
discussed as follows.
Algorithmic optimization depends heavily on the application and on the characteristics
of the data. Furthermore, the data representation may have a significant impact on the
switching activity. Recent researches show that the use of a gray code in address bits, where
data changes sequentially, results in less transition than the use of binary code. Moreover,
the sign-magnitude notation is more efficient for data that changes sign frequently, when
compared to the two’s complement notation. A change in sign causes transitions of all
the most significant bits in the two’s complement representation, whereas only the sign bit
changes in the case of sign-magnitude notation.
Architecture optimization can be achieved through delay balancing, precomputation
logic, and power management scheme. Balanced tree topologies are often used to balance
path delay, hence reduce glitching. Precomputation logic predicts the output signal one
clock cycle ahead while using minimum circuit overhead. It generally limits a small subset
of inputs to pass over to the combinational blocks, and hence minimizes the switching
activity of the system as a whole. As shown in Figure 2.4 is a latch with clock gating. The
XOR gate compares the values of D and Q. If D and Q are the same, the output of the
XOR gate is 0. The AND gate then prevents the clock from triggering the latch. On the
other hand, if D and Q are different, then the XOR-AND logic allows for the passing of
the clock signal. This scheme eliminates any unnecessary clock switching internal to the
latch.
Power management technique is one of the most effective approaches in switching ac-
Low Power and Low Voltage CMOS Design 12
D Q D Q
CK CK CKG
Figure 2.4: Latch with clock gating [2]
tivity reduction. This power-down method puts the circuits in a sleep mode when they
are idle. It can be applied at different levels of hierarchy, from module to chip level, even
at the printed circuit board. Circuit optimization may come down to the choice of logic
families as well as gate topologies. The selection is also application oriented [11].
2.5 Switching Capacitance Reduction
Energy consumption is proportional to switching capacitance as shown in Eq. (2.4). Switch-
ing capacitance consists of transistor parasitic as well as wire capacitance from metal in-
terconnects. In general, fewer the transistor counts, lesser the parasitic capacitances of the
gate oxide and the source/drain diffusion capacitances. Complementary Passgate Logic
(CPL) family demonstrates the least transistor count, compared to dynamic and static
logic family. The interconnect capacitance can be further divided into three main compo-
nents, as shown in Figure 2.5: parallel plate capacitance, fringing field effect, and wire-wire
capacitance. These three are inter-related by the width (W) and the height (H) of the wire,
as well as the thickness of dielectric (tox).
As tox increases, parallel plate capacitance reduces. But when tox becomes comparable
to W and H, the fringing field effect dominates. When W is much larger than tox, parallel
plate capacitance dominates. But when W is smaller than H, the wire-wire capacitance
Low Power and Low Voltage CMOS Design 13
t ox Dielectric
W
H
Parallel plate capacitance
Wire - wire capacitance
Metal Interconnect
Fringing capacitance
Figure 2.5: Interconnect Capacitance
would dominate. The optimum ratio for minimum capacitance is obtained when W/H is
1.75.
Moreover, for low power design, the rule is to size up only the transistors that are on
critical paths to meet the speed requirement and keep the rest of transistors minimum-size
as much as possible. Layout optimization is also crucial. The appropriate layout styles
not only minimize the diffusion capacitances, but also the interconnect length, and hence
leads to significant power saving [10, 11].
Chapter 3
Dual Edge Triggered Flip-Flop
In a synchronous system, operations and data sequences take place with a fixed and pre-
determined time relationship. The timing of computations are controlled by flip-flops and
latches together with a global clock, as shown in Figure 2.1. Flip-flops and latches are
clocked storage elements, which store values applied to their inputs. They are classed
according to their behaviour during the clock phases. A latch is level sensitive. It is
transparent and propagates its input to the output during one clock phase (clock low or
high), while holding its value during the other clock phase. A flip-flop is edge triggered.
It captures its input and propagates it to the output at a clock edge (rising or falling),
while keeps the output constant at any other time. The design of these clocked storage
elements is highly depended on the clocking strategy and circuit topology [9, 15, 16]. This
research focuses on synchronous system with edge-triggered clocking strategy, henceforth,
only flip-flop is discussed. In particular, dual edge-triggered flip-flops are introduced and
explored.
14
Dual Edge Triggered Flip-Flop 15
3.1 Types of Flip-flops
Storage element generally stores its value as charges on a capacitor. CMOS flip-flop can
be static or dynamic, depending on how it retains its values against charge leakage. A
static flip-flop retains its value using positive feedback, while a dynamic flip-flop requires
periodic refreshment of charges.
Besides the method of retaining storage value, flip-flops are also classed by their topolo-
gies. Three types will be briefly discussed in the following: master-slave flip-flops, pulsed-
based flip-flops, and amplifier-based flip-flops.
Master-slave flip-flop is the most commonly used flip-flop topology in low power appli-
cations. It is composed of a master latch cascaded with a slave latch. These two latches
are active during opposite clock phases.
Pulsed-based flip-flop is popular for its soft-clock edge property, which allows time
borrowing and alleviates clock skew penalty just like level-sensitive latch. It also provides
superior latency and is capable of incorporating complex logic. Hybrid latch flip-flop
(HLFF) and semidynamic flip-flop (SDFF) are two practical examples of pulsed-based
flip-flops. HLFF is a latch with a brief transparent pulse derived from the global clock
edge [17]. SDFF is composed of a dynamic stage coupled to a static stage [18, 19].
Amplifier-based flip-flop is mainly designed as a de-skewing element [16]. Sense amplifier-
based flip-flop (SAFF) is an example of amplifier-based flip-flop. It incorporates a precharged
sense amplifier in the first stage to generate a negative pulse, and a set-reset (SR) latch in
the second stage to capture and hold the results [20].
For the critical paths of a design, a small flip-flop delay is crucial while power con-
sumption is a secondary concern. Therefore, pulse-based flip-flops, which have very short
latency, are appropriate for these types of applications. For paths that are not critical in
the design, lower power consumption can be achieved by employing static flip-flops [7].
Dual Edge Triggered Flip-Flop 16
With low power and low voltage applications in mind, static flip-flops are the focus of
analysis. Static flip-flops are of two main types, made from gates or transmission gates.
Strictly speaking, transmission gate inputs and outputs have a slightly smaller capacitance
than inverter outputs, since transmission gates do not have the Miller effect. However, the
gate-based static flip-flop is found to have lower power dissipation than transmission gate
flip-flop, in spite of the fact that it uses considerably more transistors. The reason is that
it has fewer clocked transistors [15].
In addition, many other features can be incorporated into the flip-flop for enhance-
ment. For instance, conditional shutoff capability can provide a reduced sensitivity to the
variations of the sampling window of a pulsed-based flip-flop. Conditional capture fea-
ture can improve statistical power reduction [20]. Pulsed-based and master-slave flip-flops
are integrated as a new topology to further improve the latency and power efficiency in
Reference [7].
3.2 Dual Edge-Triggered Flip-flops
As discussed in chapter 2, clock related power is one of the most significant components of
the dynamic power consumption. The total clock related power dissipation in synchronous
VLSI circuits is further divided into three major components [4, 21]: (i) power dissipation
in the clock network, (ii) power dissipation in the clock buffers, and (iii) power dissipation
in the flip-flops. The total power dissipation of the clock network depends on both the
clock frequency and the data rate, and can be computed based on Eq. (2.5):
PCK = V 2dd[fCK(CCK + Cff,CK) + fDCff,D] (3.1)
Dual Edge Triggered Flip-Flop 17
where
fCK is the clock frequency;
fD is the average data rate;
CCK is the total capacitance seen by the clock network;
Cff,CK is the capacitance of the clock path seen by the flip-flop;
Cff,D is the capacitance of the data path seen by the flip-flop.
From Eq. (3.1), it is obvious that the clock power can be reduced if any of the parameters
on the right hand side of the equation is reduced. The reduction of Vdd is already the
trend of contemporary design, and it has the strongest impact on the PCK expression. By
reducing the overall capacitance of the clock network, CCK , the power dissipation may also
be reduced. For instance, the capacitance can be reduced by proper design of clock drivers
and buffers. Similarly, by reducing the capacitance inside a flip-flop, Cff,CK and Cff,D,
power may also be reduced.
Furthermore, the clock power dissipation is linearly proportional to the clock frequency.
Although the clock frequency is determined by the system specifications, it can be reduced
with the use of dual edge triggered flip-flops (DETFFs). As its name implied, DETFF
responds to both rising and falling clock edges. Hence, it can reduce the clock frequency by
half while keeping the same data throughput. As a result, power consumption of the clock
distribution network is reduced, making DETFFs desirable for low power applications.
Even for high performance applications, the usage of DETFFs offers certain benefits. Since
the clock speed is reduced by a factor of two, one does not need to propagate a relatively
high speed clock signal.
A classic double-edge triggered flip-flop can be implemented as in Figure 3.1. In this
classic configuration, two opposite polarity level-sensitive latches are connected in parallel,
the output is then multiplexed at the output stage [8].
Dual Edge Triggered Flip-Flop 18
D Q
Q
1
0
Z
Y
CK
D D Q
CK
CK
Figure 3.1: Classic implementation of DETFF
If the clock load of the DETFF is not significantly larger than the traditional single-
edge triggered flip-flop (SETFF), the power in the clock distribution network is reduced by
as much as a factor of two. Because the clock distribution power is a large fraction of the
total power of a synchronous VLSI system, significant overall power savings is possible [7].
3.2.1 DETFF implementations
A few previously reported DETFFs along with a newly proposed DETFF are analyzed in
this study for their performance and applicability in low power and low voltage applications.
DETgago
The flip-flop, DETgago, proposed in [3] is illustrated in Figure 3.2. Nodes N2, N3, N4,
and N5 represent parallel connections between input buffers and latches. The appropriate
phase of clock and its complement connects and disconnects the input buffers and storage
elements from the power supply and ground. When CK is high, the top input buffer and
the bottom latch are active while the bottom input buffer and the top latch are inactive,
Dual Edge Triggered Flip-Flop 19
and vice versa. As a result, it has potential for low power applications. Although the
complete isolation of the active and inactive parts of the circuit helps in power savings, it
leads to a larger delay.
N2
N3
Q
N4
N5
N5
N4 N2
N3
D
CK CK
CK
CK
CK
CK
Q
Figure 3.2: DETFF proposed in [3], DETgago
DETllopis
Figure 3.3 shows the circuit implementation of DETllopis proposed in [4] which is a
modified version of the DETFF proposed earlier in [22]. Complementary transmission and
logic gates are employed here to balance the output rise and fall times of the original
DETFF. With this modification, it improves the power and the latency at the expense of
Dual Edge Triggered Flip-Flop 20
increased total transistor count.
D Q
CK
Q
CK
CK CK CK
CK
CK
CK
Figure 3.3: DETFF proposed in [4], DETllopis
DETpedram
Pedram et al. proposed a DETFF, DETFFpedram, as shown in Figure 3.4 [5]. In this
DETFF, the role of the clock enable signal and the input data signal is reversed in the
feedback transmission gate loops of the storage latches. This implementation can reduce
the number of transistor count at the expense of increased latency. Consider an operation
at a falling clock edge. CK is high initially, and the upper latch is active. Now, if D is
1, the input transmission gates pass a 1 to N1 and N2 becomes a 0. N2 then switches
on P-passgate M6 and passes a 0 (CK) onto N1. This creates contention at node N1 and
hence increases the delay. However, when CK switches to low, input transmission gates
are closed, M6 now passes a 1 (CK) onto N1, further enhances the value stored. A similar
approach can be used to study the case for D equal 0.
DETstrollo
A DETFF proposed by Strollo et al. in [6] is illustrated in Figure 3.5, DETstrollo. This
DETFF is a pulse-based single latch DETFF. Its operation is based on pulse triggering
Dual Edge Triggered Flip-Flop 21
D Q
CK
CK
CK
CK
CK CK
CK
CK
CK
CK
N1
N2
M6
M5
Figure 3.4: DETFF proposed in [5], DETpedram
that is created by its internal clock buffers. The input passgates in series serves as an AND
operation to provide a short transparent pulse. Since input passgates are of N-type, the
PMOS are used to restore value at N1 to full swing. The weakly on PMOS (referred to the
PMOS whose gate is tied to ground) can help minimizing the parasitic capacitance. The
size of the transparent pulse width is crucial in this design. Hence, the proper operation
of this DETFF is highly dependent on the internal clock buffer sizing and the propagation
delay of the internal clock buffers.
Proposed DETFF
The proposed DETFF, DETproposed, is illustrated in Figure 3.6. It consists of two sets of
back-to-back inverters as storage elements. A true and complement combination of input
data and clock signals controls the latching of the data value in these storage elements.
When CK is high, node N7 is pre-discharged to 0. If D is 1, then N1 is pulled down to 0.
Dual Edge Triggered Flip-Flop 22
D Q
CK
CK CK
CK
CK1 CK1
CK1
CK2
CK2
Q N1
Figure 3.5: DETFF proposed in [6], DETstrollo
Else if D is 0, then N2 is 0 and N1 becomes a 1. The main advantage of this configuration
is that it avoids stacking PMOS transistors. As a consequence, low voltage and low power
operation becomes feasible.
Q
D D
D
CK CK
D
CK
CK
CK CK
CK
CK
D
N1 N2 N3 N4
N7 N8
Figure 3.6: proposed DETFF, DETproposed
Dual Edge Triggered Flip-Flop 23
3.3 Analysis of Dual-Edge Triggered Flip-flops
Several metrics are available for comparative analysis of digital circuits. For example,
power consumption, delay and latency, energy or power-delay product (PDP), energy-
delay product (EDP), and energy-delay-squared product (ED2P ) have been reported by
several researchers [23, 24]. In general, a PDP based metric is appropriate for low power
portable systems in which the battery life is the primary index of energy efficiency. This is
in contrast with EDP or ED2P , where delay is weighted more heavily for high performance
systems [23].
In this study, we are primarily interested in DETFF usage for low power low voltage
applications. Therefore, PDP is selected as the primary figure of merit. However, since
the scaling of Vdd directly affect both energy consumption positively and delay negatively,
it implies that using the energy as the metric is not sufficient for low voltage applications.
The energy-delay product, on the other hand, accounts for both the energy and the delay,
thence will be used as well.
In particular, this analysis is similar to the comparative technique described by Sto-
janovic et al. [25]. Their study establishes a set of guidelines for objective comparisons of
single edge triggered (SET) latches and flip-flops. The details of power and delay param-
eters employed in this study are defined in the following two subsections:
3.3.1 Power Consumption of a DETFF
There are three main components of power dissipation of a flip-flop:
(a) Internal power dissipation of the flip-flop represents the power consumed by the internal
and input nodes during latching operations, including the power dissipated driving the
output load.
Dual Edge Triggered Flip-Flop 24
(b) Local clock power dissipation represents the portion of the power dissipated in the
clock buffer that is driving the clock input of the flip-flop.
(c) Local data power dissipation represents the portion of the power dissipated in the logic
gate that is driving the data input of the flip-flop.
The clock power dissipation is determined solely by the clock load of the flip-flop,
whereas the distribution of the internal and data power dissipation is affected by the
structure and operation of the latching element itself as well as the input switching ac-
tivity [20]. The sum of these three components is referred to as the total power (PTOT ).
All three components of power require independent estimation in any comparative analysis
because, inherently, a tradeoff exists between the three. If a comparison is made without
taking all three components into account, it may indicate misleading results.
3.3.2 Timing Characterization of a DETFF
There are two delay parameters of interest in this study. The first delay is the time
measured between the clock edge and the output edge, or tCQ. The second delay is the
time measured between the input data edge and the output edge, or tDQ. The latter
parameter is often referred to as the latency of a flip-flop. It is composed of tCQ and
tDC (the data setup time). Since there are two parallel paths in a classic DETFF, two
characteristics are obtained. One corresponds to the rising edge of the clock and the other
one corresponds to the falling edge of the clock. These two characteristics are independent
of each other and generally are not the same. Hence, the latency of a DETFF is defined
Dual Edge Triggered Flip-Flop 25
as:
td1 = tCQ,LH + tDC,HL
td2 = tCQ,HL + tDC,LH
tDQ = max(td1, td2)
where tCQ,LH and tCQ,HL are the clock-to-output time at rising and falling clock edge
respectively. tDC,HL and tDC,LH are the setup time required at the falling and rising clock
edge respectively [14]. Thus, the latency for a DETFF is computed indirectly as the
maximum tDQ of a rising and a falling data transitions for both rising and falling clock
edges.
Latency is significant in synchronous system because the system’s cycle time depends on
the longest delay of the network [16]. However, tCQ is equally important for this comparison
since the setup time is often also a function of the independent variable of the simulations.
This is true in the optimization process where changes in the transistor widths affect the
setup time and in the supply voltage analysis where the voltage is independent variable.
For completeness, the set-up and hold times, the maximum data rate and total transis-
tor width are included as additional flip-flops performance metrics. Total transistor width
is used as a measure of the flip-flop area, since the physical layout is not available at this
point.
3.3.3 Simulation
A tradeoff between speed and power consumption is often possible, and it is normally
determined by the application. Hence, a given flip-flop can either be optimized for high
performance or low power. However, when both power dissipation and performance are
critical, one needs to determine a design that operates at the optimum. At this optimum
Dual Edge Triggered Flip-Flop 26
operating point, the power-delay product is minimum, i.e. optimal energy utilization for a
given clock frequency. However, since the optimal delay and power parameters cannot be
obtained in a single step, the energy optimization procedure is often iterative [25].
Testbench
For this study, 0.18 µm CMOS technology is used. Apart from the supply voltage analysis,
all simulations are carried out at nominal conditions: Vdd=1.8 V and at room temperature
(25◦C). The clock frequency is kept at 500 MHz. This clock frequency for DETFFs is
equivalent to 1 GHz for SETFFs. Details of the simulation parameters are summarized in
Table 3.1.
Table 3.1: Simulation parameters
0.18 µm CMOS technology
MOSFET Model: BSIM3 Level 49
Nominal Conditions: Vdd=1.8V T=25◦C
Frequency Rise time Fall time Duty Cycle Sequence Length
Clock 500 MHz 100 ps 100 ps 50% n/a
Data n/a 100 ps 100 ps n/a 16 clock cycles
The testbench for this study is illustrated in Figure 3.7. Additionally, input buffers are
used to provide realistic clock and data signals. A fanout of five inverters (approximately
32 fF in 0.18 µm technology) is used as the nominal load for each DETFF. These inverters,
in turn, drive a capacitive load CL of 25 fF each, to simulate the loading from the previous
Dual Edge Triggered Flip-Flop 27
logic stages, as well as the following stages. All the measurements are taken over a 16-cycle
data sequence of alternating 1’s and 0’s. As mentioned before, the total power dissipation
is composed of three components. They are represented and calculated in the testbench
as follows:
(a) Local data power represents the portion of power dissipated in the grey inverter driving
the data input of the flip-flop.
(b) Local clock power represents the portion of power dissipated in the black inverter,
which drives the clock input of the flip-flop.
(c) Internal power consumption is the intrinsic power dissipated on switching the internal
nodes of the flip-flop.
Clock
Data C
L C L
C L
C L
C L
C L
CK
D Q
Figure 3.7: The simulation testbench for flip-flops
In order to compute the local data power and the local clock power, the flip-flop under
test is initially disconnected, and the power dissipated by the grey inverter and the black
inverter are recorded respectively. The flip-flop is then connected to the testbench for
performance analysis. The power consumed by the grey and black inverters are recorded
again for this time. Hence, the local data power can be calculated as the difference of the
Dual Edge Triggered Flip-Flop 28
two power dissipations of the grey inverter. Likewise, the local clock power is computed as
the difference of the two power consumption values of the black inverter.
Size Optimization of DETFF
Due to the inter-relationship between transistors’ sizes, the sizing of flip-flops is opti-
mized using a line optimization algorithm. Starting with an initial guess in which all the
devices are minimum sized, the dimension of the inverter driving the output Q is first
optimized. Then working backward from the Q output to the D input. This sequence
of one-dimensional optimizations is iterated until the power-delay product stops decreas-
ing [26].
During the size optimization, a data transition probability, α, of 0.5 is assumed. The
critical path is first identified. The width of the NMOS transistor, wn, is then selected
as the parameter of interest. The sizing of the PMOS transistors that are located on the
critical path is kept at a certain ratio with respect to wn. This ratio is determined by
balancing the rising and falling edges of the output waveform of a test inverter. Note that
this ratio changes with NMOS sizing. Moreover, transmission gates and transistors that
are not located on the critical path are implemented with relatively smaller sizes.
Delay and power are measured as functions of wn. The measured power is the sum of
all three components discussed earlier, whereas the delay is expressed by tCQ. Once the
power and delay measurements are obtained, the PDPCQ is calculated as the product of the
power and delay. Subsequently, PDPCQ is plotted as a function of tCQ. The initial PDPCQ
point is taken as a minimum point of the PDPCQ versus tCQ curve. If the minimum point
does not exist, the operating point with the minimum tCQ for a given energy is selected as
the initial PDPCQ point to begin the optimization process. Once the initial PDPCQ point
is determined for each flip-flop, these flip-flops are further optimized using the iterative
Dual Edge Triggered Flip-Flop 29
line optimization method, until the best PDPCQ and PDPDQ are found.
Dependency of Data Transition Probability/Activity α
Generally in a VLSI circuit, each flip-flop could have input data with different transition
probability, α. As a consequence, it is interesting to observe the behaviour of flip-flop
power-delay product as a function of α [26].
Furthermore, the power saving of using DETFF is strongly dependent on α. Recall
that power dissipation of a CMOS circuit is
PD =1
2fDV 2
dd
∑
j
αjCj (3.2)
And the power dissipation due to clock nodes’ switching is
PCK = fCKV 2ddCCK (3.3)
In a SET-based system, fD,SET = fCK,SET , as there is at most one signal change in one clock
period, in the absence of glitching. Whereas in a DET-based system, fD,DET = 2×fCK,DET
as there are at most two signal changes in one clock period. For a fixed data throughput,
fD,SET = fD,DET = f . Hence,
PSET = fV 2ddCCK,SET +
1
2fV 2
dd
∑
j
αj,SETCj,SET (3.4)
PDET =1
2fV 2
ddCCK,DET +1
2fV 2
dd
∑
j
αj,DETCj,DET (3.5)
In addition, it is interesting to note that the internal nodes transition probability of a
classic DETFF is the same as that of a SETFF. Consider the classic DETFF illustrated in
Figure 3.1 and a master-slave SETFF. As shown in Figure 3.8, the transition probability
Dual Edge Triggered Flip-Flop 30
D
1 1 1 1 0 0 0 0 0 0
CK
Z
Y
Q
Figure 3.8: DETFF internal node transitions given input sequence of 1010101000
of DETFF’s internal nodes is equal to that of the D input. Hence, αY = αZ = αQ = αD.
Similarly for master-slave SETFF, αX = αQ = αD.
Therefore, (3.4) and (3.5) can be rewritten as:
PSET = fV 2ddCCK,SET +
1
2fV 2
ddαDCα,SET (3.6)
PDET =1
2fV 2
ddCCK,DET +1
2fV 2
ddαDCα,DET (3.7)
where Cα,SET = CD,SET + CX,SET + CQ,SET and Cα,DET = CD,DET + CY,DET + CZ,DET +
CQ,DET . And the power saving is defined by the ratio, η, between the DETFF and SETFF
power dissipation:
η =PDET
PSET
=CCK,DET + αDCα,DET
2CCK,SET + αDCα,SET
(3.8)
As demonstrated, if α is low, then the reduced clock frequency of DETFF may result
in significant power savings. For a larger number of low power applications, the transition
activity of input data is indeed approximately one-tenth of the clock signal activity [7]. For
Dual Edge Triggered Flip-Flop 31
high input activity, the αDCα parts dominate. In fact, the total switched capacitance on
the clock line is actually larger in DETFF with respect to SETFF structures. This requires
larger buffers in the clock tree, and hence, increases the overall power dissipated by the
clock. Therefore, energy saving from using DETFFs is due to the halved clock frequency
and not to the value of the clock capacitance [26].
Once the DETFFs are optimized, they are simulated at different data activity rates: 0
(all zero’s and all one’s), 0.5 and 1. This is to determine the efficiency and performance
of each DETFF for a wide range of data activities. As discussed above, the total power
consumption of a DETFF consists of three separate components. Owing to the diverse
design styles, these components can vary from flip-flop to flip-flop. As a result, the total
power consumption of a flip-flop may change depending on the data activity. Therefore,
it is desirable to simulate various DETFFs with different data activities. It is, however,
worth noting that this behaviour is independent on the α value assumed in the optimization
procedure.
Influence of Supply Voltage
The nominal power supply voltage for 0.18 µm technology is 1.8 V. However, for bat-
tery operated systems, the power supply voltage is reduced drastically to lower the power
consumption. Also, an efficient low voltage flip-flop should demonstrate a lower rate of
incremental delay as the power supply voltage is reduced. Therefore delay, power, and
energy of all the DETFFs are computed as a function of supply voltage. Again, since the
setup time increases with reduced supply voltage, the simulations require relaxed setup
time conditions to provide results over a wide range. Hence in this analysis, tCQ and
PDPCQ are determined for precise results.
Dual Edge Triggered Flip-Flop 32
3.4 Results
All five DETFFs under study have been optimized as described in Section 3.3.3. It is
found that the delay decreases as the width increases until the minimum point is reached,
if such a point exists. At this point, any further increase in the width does not result in
any further appreciable decrease in the delay. On the contrary, owing to the increased
parasitics associated with the increased width, the delay may increase. On the other hand,
for all the DETFFs, PTOT increases monotonically as the width increases. PDPCQ is
then determined by multiplying PTOT by tCQ for the corresponding width. Furthermore,
by combining the tCQ and the PDPCQ curves, we can plot PDPCQ versus tCQ, which is
illustrated in Figure 3.9. These curves represent the first step of the optimization process.
The slopes of the PDPCQ curves in Figure 3.9 indicate sensitivity of the flip-flops to
delay as the width varies. When tCQ is small, PDPCQ is large since the total power
dominates the product at larger widths. As the width decreases, the power consumption
decreases, however the delay is inversely related to the width. This remains true until the
local minimum is reached. At this point, both the power and delay increase because of
the weakened driver strength. Figure 3.9 also depicts the spread of DETFF performance
in terms of PDPCQ and delay. As shown, the performance of the DETFFs studied is
comparable. PDPCQ ranges from 30 fJ to 75 fJ and delay ranges from 200 ps to 300 ps.
Table 3.2: Optimal parameters for DETFFs studied [1]
ClockPower DataPower InternalPower TotalPower tCQ PDPCQ tDQ PDPDQ
Cell (µW ) (µW ) (µW ) (µW ) (ps) (fJ) (ps) (fJ)
DETpedram 17.6 65.6 241.7 324.9 233.1 75.7 245.3 79.7
DETllopis 17.0 4.6 153.4 175.0 237.5 41.6 312.3 54.7
DETgago 23.2 11.6 131.4 166.2 202.2 33.6 262.2 43.6
DETstrollo 30.0 13.4 194.5 237.8 214.4 51.0 235.3 56.0
DETproposed 18.1 10.9 189.4 218.4 161.3 35.2 230.5 50.3
Dual Edge Triggered Flip-Flop 33
PDPCQ vs tCQ under relaxed setup time condition
000E+0
50E-15
100E-15
150E-15
200E-15
250E-15
300E-15
100E-12 150E-12 200E-12 250E-12 300E-12tCQ (s)
PD
PC
Q (
J)DETpedramDETllopisDETgagoDETstrolloDETproposed
Figure 3.9: PDPCQ vs tCQ, used to determine the initial optimization point [1]
Table 3.3: Performance Characteristics for DETFFs studied [1]
Cell Setup (ps) Hold (ps) Max. Data Rate (GHz) Total Width (µm)
DETpedram 17.9 34.0 1.75 23.0
DETllopis 80.3 -15.7 2.22 37.7
DETgago 49.5 -5.7 2.63 44.6
DETstrollo -41.4 85.9 2.22 40.5
DETproposed 76.9 -5.1 1.56 56.1
The initial optimization points are then extracted from Figure 3.9 and an iterative
process is used to complete the optimization process. The goal of the optimization is to
Dual Edge Triggered Flip-Flop 34
minimize the energy consumption, PDPDQ. The different DETFFs are compared in terms
of power, delay and energy. The final optimal parameters are summarized in Table 3.2.
The first column of Table 3.2 lists the DETFFs and the second column displays the three
components of power dissipation and the total power consumption. The third and fourth
columns report the delay and energy consumption, CQ and DQ respectively. Table 3.3
lists the other performance characteristics, such as setup and hold times, maximum data
rate and total transistor width. As shown in the tables, DETpedram consumes the most
power, due to an extensively large internal and data power dissipation. This also leads
to the highest energy consumption. However, it has the smallest total transistor width.
DETllopis has the largest delay, yet the smallest consumption of clock and data power.
DETgago consumes the least internal and total power, thence the least energy. DETstrollo
consumes the most clock power, yet this does not affect its overall performance compared
to the other DETFFs studied. DETproposed has the smallest delay, but it requires the
largest total width.
After the DETFFs are optimized, they are simulated at different data activity rates.
The results are shown in Figure 3.10. In general, applications with α = 1, exhibit the
largest total power consumption. Clock power dissipation is rather constant over all data
activity rates. Data and internal power consumption increase as the data activity increases.
One exception is DETpedram. Where the data sequence consists of all zeros, the internal
power is remarkably large. For the case of all ones, the internal power, on the other hand,
is especially small, whereas the data power is notably larger. However, the data power at
α = 0.5 and α = 1 are almost the same. Furthermore, DETpedram demonstrates the worst
power consumption at all data rates, except when α = 1. DETgago is the best in terms of
power dissipation, at all different data rates. The total power consumption of DETllopis is
very close to DETgago in all data activity. DETproposed has similar power consumption as
Dual Edge Triggered Flip-Flop 35
DETgago, except in the case of α = 1, in which it exhibits a substantially large internal
power dissipation.
Power Distr ibution as a function of Data Activity
0
50
100
150
200
250
300
350
400
450
500
Pow
er C
onsu
mpt
ion
(uW
)
Clock Power Data Power Internal Power
αααα = 0 (all 0's) αααα = 0 (all 1's) αααα = 0.5 αααα = 1
sDE
Tpe
dram
sDE
Tpr
opos
ed
sDE
Tst
rollo
sDE
Tga
go
sDE
Tllo
pis
Figure 3.10: Power consumption dependence on data transition activity α [1]
Table 3.4: Summary of DETFF performance as Vdd reduces [1]
Vdd = 0.9V Vdd = 1.3V Vdd = 1.6V
tCQ(ps) PTOT (µW ) PDPCQ(fJ) tCQ PTOT PDPCQ tCQ PTOT PDPCQ
DETpedram 734.1 77.3 56.7 329.5 172.2 56.7 244.2 257.9 63.0
DETllopis 762.8 75.4 57.5 350.7 117.2 41.1 264.8 152.4 40.4
DETgago 721.2 37.0 26.7 335.3 89.1 29.9 253.3 143.1 36.2
DETstrollo failed failed failed 932.4 118.4 110.4 262.2 183.2 48.0
DETproposed 445.6 51.2 22.8 233.7 111.7 26.1 180.0 174.8 31.5
†CQ-delay and PDP as a function of supply voltage with relaxed setup time
The performance of DETFFs under reduced voltage conditions is depicted in Fig-
Dual Edge Triggered Flip-Flop 36
ures. 3.11, 3.12, and 3.13. Figure 3.11 plots total power consumption of DETFFs as a
function of supply voltage. DETgago exhibits the lowest power consumption. DETproposed
shows the second lowest power consumption at low supply voltage. DETllopis has the second
best power dissipation near nominal supply voltage, however by the time supply voltage
drops to 1.4 V, it starts to exceed that of DETproposed. The worst power consumption is
exhibited by DETpedram. The power consumption curve of DETstrollo is somewhat mis-
leading, since it fails to function below 1.3 V. Figure 3.12 depicts the tCQ of DETFFs as a
function of supply voltage. The DETproposed exhibits the lowest delay. On the other hand,
DETstrollo demonstrates the worst delay and quickly fails to latch below 1.3 V. All the other
DETFFs have similar delay at all supply voltages tested. Figure 3.13 plots the PDPCQ as
a function of supply voltage. The best energy consumption versus supply voltage is seen
from the proposed DETFF, but DETgago is comparable. DETpedram and DETstrollo, have
similar energy dissipation at half of the nominal supply voltage. The results are further
summarized in Table 3.4.
3.4.1 Discussion
DETpedram consumes the most data power in this study. It is found that the high data
and internal power dissipation is a result of the positive feedback of the transmission gate
loop at the input end of the flip-flop. In the feedback path of the latches, the input data
controls the passing of the clock signals. For instance from Fig. 3.4, when D=0 and CK=1,
M1 turns on. Hence, Node A discharges to 0 and Node B switches to 1. Node B then
switches M2 on. As a result, M1 and M2 attempt to write 0 and (VDD - Vtn) voltages
simultaneously onto Node A. This voltage conflict is present until the clock changes state.
Such a conflict results in a degraded noise margin. This has two implications. First, this
structure allows large current to flow through the transmission gates at the input. Second,
Dual Edge Triggered Flip-Flop 37
Total Power vs Supply Voltage under relaxed setup time condition
000e+0
50e-6
100e-6
150e-6
200e-6
250e-6
300e-6
0.9 1 1.1 1.2 1.3 1.4 1.5 1.6VDD (V)
PT
OT (
W)
DETpedramDETllopisDETgagoDETstrolloDETproposed
Figure 3.11: Power consumption dependence on supply voltage [1]
the degraded voltage level at Node A also causes a direct path current in the subsequent
inverters. Hence, large data and internal power dissipation results. In addition, both data
power and internal power depend on the data level rather than the data activity. When
D=0, NMOS pass gates are active through the input loop, while the PMOS is active in
the inverter that follows the loop. The opposite is true for D=1. In either case, PMOS
transistors draw more current. The all 0’s and all 1’s cases are extreme examples of this
effect. Despite the large data power consumption, its clock power dissipation is small
because of the local clock buffers. The absence of local data buffers brings into question
the robustness of the flip-flop. The transparent nature of the pass gates fails to secure
unidirectional data flow. Furthermore, its energy consumption at low supply voltage is
approximately twice as high as the proposed DETFF. Hence, the usage of DETpedram in
Dual Edge Triggered Flip-Flop 38
CQ delay vs Supply Voltage under relaxed setup time condition
100e-12
200e-12
300e-12
400e-12
500e-12
600e-12
700e-12
800e-12
900e-12
1e-9
0.9 1 1.1 1.2 1.3 1.4 1.5 1.6VDD (V)
t CQ
(s)
DETpedram
DETllopis
DETgago
DETstrollo
DETproposed
Figure 3.12: tCQ as a function of supply voltage [1]
low voltage and low power applications is not recommended.
DETllopis has the best clock and data power dissipation. Its clock power consumption
is low because of the small clock capacitance, whereas its data power dissipation is low
due to the use of an inverting input buffer. Despite the fact that it has one of the smallest
power consumptions at all data activity, it has the longest delay at nominal voltage since
the data must propagate through the most logic stages compared to the other DETFF
configurations. This leads to comparatively large energy consumption at nominal condition.
As a function of supply voltage, its total power consumption drops at a much lower rate
and its delay rises at a slightly higher rate, compared to other DETFFs studied. Hence,
it results in higher energy consumption at low voltage. Therefore, its application for low
voltage conditions is limited and its best energy consumption is seen around 1.5 V.
Dual Edge Triggered Flip-Flop 39
PDPCQ vs Supply Voltage under relaxed setup time condition
20e-15
30e-15
40e-15
50e-15
60e-15
70e-15
80e-15
90e-15
100e-15
110e-15
120e-15
0.9 1 1.1 1.2 1.3 1.4 1.5 1.6VDD (V)
PD
PC
Q (
J)DETpedram
DETllopis
DETgago
DETstrollo
DETproposed
Figure 3.13: PDP dependency as a function of supply voltage [1]
DETgago is found to be the most energy efficient DETFFs in all circumstances under
nominal conditions in this study. Its superior low power performance is mainly due to the
complete isolation of the elements when they are not in use. Its low power application
is demonstrated. Under low supply voltage condition, although it has the lowest power
consumption, its delay is relatively higher than that of the proposed DETFF. It results in
a slightly higher energy consumption than DETproposed at low supply voltage.
DETstrollo consumes the largest clock power because of the chain of internal clock
buffers. The delay through these clock buffers defines the activation pulse for the flip-flop.
The definition of the activation pulse width is crucial to its operation. As the supply
voltage reduces, the activation pulse width varies that causes the delay to increase at a
much higher rate. The delay rapidly approaches the clock pulse width, hence it fails to latch
Dual Edge Triggered Flip-Flop 40
the input data anymore. Therefore, it is not suitable to use in low voltage environment.
DETproposed has superior delay because the use of NMOS transistors and the avoid-
ance of PMOS transistor stacking in its design. However, its inferior slew rate leads to
an especially prominent power consumption at high data rates. As a result, its overall
energy consumption at nominal condition is close to DETgago which has the lowest energy
dissipation. In reduced supply voltage condition, DETproposed has the second best power
consumption and the best delay. Therefore, the best energy consumption at low supply
voltage results. Hence, it has promising usage in low energy, and low voltage applications.
The proposed design is an attempt to design a low voltage DETFF. AlthoughDETproposed
can achieve good performance, it is found that the complete isolation of the deacti-
vated elements, as in the case of DETgago, is a key to low power dissipation. However,
DETproposed has been shown to operate the most efficiently at low supply voltage. Hence,
the DETproposed and DETgago are recommended for further research in low power low volt-
age subsystems. In the next chapter, these two DETFFs will be implemented in a digital
filter, along with a standard SETFF.
3.5 Other Considerations
3.5.1 Parallel Interconnects and Clock Requirements
Although many DETFFs have been proposed, their use is still uncommon. There are
several reasons why DETFFs are not yet popular in VLSI circuits. DETFF requires a more
complex implementation and a more intense interconnects with respect to SET structures.
In DETFFs, latches are connected in parallel. This results in a higher number of internal
nodes, as well as higher node capacitances including those on data and clock inputs. In
fact, this byproducts of DETFF offsets some of the benefits of a reduced clock rate [7,
Dual Edge Triggered Flip-Flop 41
14, 26]. For instance, the setup and hold times of DETFFs are typically larger compared
to that of conventional flip-flops [27]. Thus, DETFFs become less attractive for high
performance applications. Furthermore, DETFFs pay a penalty in the design area [22, 27].
The larger number of transistors and the increased number of interconnects make the
footprint of a DETFF much larger than that of a conventional SETFF. Nevertheless, a
careful floorplanning and layout can be used to minimize the length of wires, which in
turns optimize interconnect capacitance. Hence, keeping the offsets to a minimum.
In addition, a DETFF captures data on both clock edges, therefore, a duty cycle of
50% is required. As such, the specification on jitter tolerance is more stringent. With the
trend of increasing clock frequency, it is more difficult to control the clock duty cycle and
both edges of the clock in the clock distribution system [26] Hence, local clock regeneration
or a more complex system phase lock loop is necessary.
3.5.2 Design for testability
In a typical VLSI system, millions of transistors are involved. In such an overwhelming
situation, testing and debugging would seem to be impossible to accomplish in a timely
manner without the aids of automated tools. Controllability and observability are two
important attributes in the testing and debugging. Controllability is the ability to establish
a specific signal value at each node in a circuit by setting values on the circuit’s inputs.
Observability is the ability to determine the signal value at any node in a circuit by
retrieving values on the circuit’s outputs. The ability to observe a snapshot of the operation
at a particular cycle or cycles and/or to control the state of the system at a desired cycle,
are invaluable tools for debugging.
Testability is a design characteristic that allows the status of a device to be determined,
the isolation of faults to be performed quickly and effectively, and the tests to be devel-
Dual Edge Triggered Flip-Flop 42
oped in a cost effective manner. It influences strongly the testing time and cost. Design
for testability (DFT) techniques are design efforts employed to ensure that a device is
testable [28, 29].
Most DFT techniques require circuit modifications and affect factors such as logic
complexity, die area, I/O pins, and circuit delay. Increasing logic complexity results in
increased power consumption and decreased yield. Hence, a critical balance between the
extent of DFT used and the gain achieved should be sought with care. In brief, DFT is
used to reduce test generation costs, enhance the fault coverage of tests, and hence reduce
defect levels. When a good DFT is applied, it can also reduce test length, tester memory
and test application time.
One of the most popular DFT techniques is scanning, with the use of scan registers.
A scan register is a flip-flop or a latch with both shift and parallel load capability. The
storage cells in the register are used as observation points and/or input controls. The
implementation of scan can be achieved at a relatively small cost to latency and die area,
and its benefits far outweigh the added hardware complexity.
The parallel structure of the DETFF also presents new challenges in forming scan chain.
Llopis and Sachdev are one of the first who have addressed this challenge. They proposed
a DETFF, as illustrated in Figure 3.14, that is made unidirectional on the data paths [4].
Good as it seems, the area and delay penalties are rather large.
Recent researches have explored alternatives other than the classic configurations of
DETFF, where only one latch is used. Since there only exists one data path, the imple-
mentation of scan is much easier than in the case of classic DETFF [7]. But how can one
single latch capture and store data on both rising and falling edges of the clock signal? The
focus has been shifted to the multiplexing mechanism at the input. A new clock signal is
generated from the original system-wide clock signal. And the input multiplexer thereby
Dual Edge Triggered Flip-Flop 43
D
Q
N1
CK
CK
Q
CK
CK
CK
CK
N2
Figure 3.14: DETFF proposed in [4] with unidirectional characteristic
ensures the data capture at each clock edge of the system clock.
Figures 3.15 and 3.16 are two examples of this new direction of DETFF implementation.
In Figure 3.15, an XOR operation is performed on delayed clock pulses that are generated
by an inverter chain. This results in a new clock signal of twice the frequency of the system
clock. It is then feed into a standard SETFF. In essence, it is a SETFF that operates twice
as fast locally as the system clock in order to maintain the data throughput. The area
overhead is much less than the one of a classic implementation. However, the XOR gate
delay manifests in the overall flip-flop delay and degrades its maximum clock frequency.
D
X
Y
Q D Q
CK
Figure 3.15: DETFF using XOR operation to generate delayed clock pulses [7]
Dual Edge Triggered Flip-Flop 44
Figure 3.16 presents an efficient realization of dual edge-triggered pulse generator. This
generator is composed of an inverter chain followed by a pair of parallel transmission gates.
It exhibits minimum area overhead and no delay penalty. It also has good clock skew
absorption.
D Q
CK N2
N1
Figure 3.16: DETFF proposed in [8] with DET pulse generator
From the above discussion, it is important to note that in the implementations of
DETFF, it is crucial to achieve DET features without increasing the loads of either the
clock or the input data. The DETFF structures should also facilitate the implementation
of DFT for the ease of debugging and robust designs.
Chapter 4
Hearing Aids and Digital Filters
In the objective of further analyzing the DETFF performance and possible power savings in
low power low voltage VLSI circuits, digital filters of hearing aid applications are explored.
4.1 Hearing Aids
The strictest set of requirements in portable audio systems is found in hearing aids. There
are four common types of hearing aids, which differ in size and location on the users. They
can be worn behind-the-ear, in-the-ear, in-the-canal, and completely-in-the-canal (CIC).
CIC is the smallest discreet hearing aids, which can be fit entirely within the bony canal.
Not only does it have superiour sound solution, but also offers better cosmetic appeal
since it is completely invisible to the outsider. In general, smaller hearing aids have fewer
controls over acoustical adjustments and smaller batteries so they do not last as long as
larger hearing aids. However, smaller hearing aid offers a more natural sound because of
the characteristics of ear.
Until very recently all contemporary hearing aids were designed utilizing analog cir-
45
Hearing Aids and Digital Filters 46
cuitry. The shortcoming of these analog aids is that their electroacoustic characteristics
could not be modified as easily to suit individual patient requirements. As the technology
advances, digital programmable hearing devices are becoming readily available at more
affordable prices. In these devices, the input signal is digitized, then processed with dig-
ital signal processing circuitry. With the algorithmic fitting methods, the electroacoustic
performance and sound output can be adjusted to meet individual hearing loss require-
ments [30, 31].
Hearing aid devices have challenging requirements in terms of power and area due
to the small battery capacitance and limited dimension. Since hearing aids are battery
powered, the maximum supply is 1.5 mA, with an operating voltage range of 1 to 1.5 V.
The core of a hearing aid consists of a Digital Signal Processor (DSP), which executes
basic filter algorithms such as finite impulse response as well as special adaptive algorithms
used for noise reduction. The basic building blocks of these filters are adders, multipliers,
multiply-accumulators and memories in addition to counters, registers, multiplexers and
demultiplexers. In this study, only the digital filters are covered and used as a benchmark
circuit for the comparison. Its implementation with DETFF is discussed next.
4.2 Digital Filters
Digital filters can be categorized into two classes known as FIR (finite-length impulse
response) and IIR (infinite-length impulse response filters). Advantages of FIR filters
over IIR filters are that they are guaranteed to be stable and to have a linear-phase re-
sponse. Linear-phase FIR filters are widely used in digital communication systems, speech
and image processing systems, spectral analysis, and particularly in applications where
nonlinear-phase response distortion cannot be tolerated. Digital FIR filters are limited to
Hearing Aids and Digital Filters 47
designs that have transfer functions with effective poles at the origin of the z-plane, while
IIR filters can have poles anywhere within the unit circle. Hence, in IIR filters, the poles
can be used to improve the frequency selectivity. As a consequence, the required filter
order is much lower for IIR as compared to FIR filters. However, it is still not possible to
have exact linear-phase IIR filters. Furthermore, FIR filters are straightforward to design
by using CAD tools. One of the major drawbacks of FIR filters is that large amounts of
memory and arithmetic processing are needed. These make them unattractive in many
applications. IIR filters, on the other hand, require much less memory and fewer arith-
metic operations, but they are difficult to design and they suffer from stability problems.
Although the design is much more demanding, the use of an IIR filter may result in a lower
system cost and higher performance [9, 32].
The primary goal of this study is to determine the usage of DETFF in a low power and
low voltage VLSI application, such as hearing aids, and the possible power saving. Hence,
FIR filter, which is simpler, is selected to minimize the design efforts.
4.2.1 Half-Band FIR
An FIR filter is a type of nonrecursive filters, which are always stable. They cannot
sustain any type of parasitic oscillation, except when they are a part of a recursive loop.
They generate little round-off noise. However, they require a large number of arithmetic
operations and large memories.
Consider the transfer function of an FIR filter of order M is
H(z) =M
∑
n=0
h(n)z−n
=h(0)zM + h(1)zM−1 + ...+ h(M − 1)z + h(M)
zM
Hearing Aids and Digital Filters 48
where h(n) is the impulse response. Instead of using the order of the filter to describe an
FIR filter, it is customary to use the length (N) of the impulse response. N is equal to
M + 1 in this case.
The FIR filters in interest are filters with linear phase. Linear-phase response has
constant group delay, which implies a pure delay of the signal. The group delay is defined
as the rate of change of the total phase shift with respect to angular frequency. For linear-
phase FIR, it is equal to
τg(ωT ) = −∂Φ(ωT )
∂ω= −
N − 1
2T
Linear-phase filters are useful in applications where frequency dispersion effects must
be minimized, such as in speech processing systems of hearing aids. The impulse response
of linear-phase filters exhibits symmetry or antisymmetry. Therefore, the transfer function
can be rewritten, using a real function HR
H(ejωT ) = ejΦ(ωT )HR(ejωT ) (4.1)
where Φ(ωT ) = c − τg(ωT ) with c = 0 and c = π/2 for symmetric and antisymmetric
impulse responses, respectively. HR(ejωT ) is referred to as the zero-phase response. For
FIR with antisymmetric impulse response and N is odd, the zero-phase response is
HR(ejωT ) =
(N−1)/2∑
n=1
2h(N − 1
2− n)sin(ωTn) (4.2)
Many DSP schemes exploit the fact that a large number of zeros in the impulse response
of certain types of FIR filters, such as half-band FIR filter. The required number of arith-
metic operations can therefore be reduced since it is unnecessary to perform multiplications
by coefficients that are zero.
An even-order (N = odd) half-band FIR filter has zero-phase function that is anti-
symmetric about π/2. Hence, for lowpass half-band FIR, HR(ejωT ) = 1 − HR(e
j(π−ωT )).
Hearing Aids and Digital Filters 49
Then every other coefficient in the impulse response is zero, except for the one in the cen-
ter, which is 0.5. It is called half-band because the bandwidth is about half of the whole
frequency band. The symmetry implies that the relation between the cutoff angle and
stopband angle is ωcT + ωsT = π and the passband and stopband deviations are equal.
The reduction in the number of arithmetic operations is significant although in practice
the required filter order is slightly higher than that for a corresponding linear-phase filter.
The normalized zero-phase function is HR(ejπ/2) = 0.5, implying an attenuation of 6 dB.
It should be noted that the coefficients are nonzero for odd-order half-band [9, 33].
4.2.2 Implementation of FIR - Direct Form Structure
Only a few structures are of interest for the realization of FIR filters. One of the best and
yet simplest structures is the direct form, as depicted in Figure 4.1.
X(n) X(n-2) X(n-1) X(n-37) D
X(n-38)
h(1)
D D D
h(0) h(2) h(36) h(37) h(38)
…..
y(n)
Figure 4.1: FIR filter implementing with direct form structure
The direct form FIR filter of order M (length N =M +1) can be described by a single
Hearing Aids and Digital Filters 50
difference equation:
y(n) =M
∑
k=0
h(k)x(n− k) (4.3)
The required numbers of multiplications and additions are N and N − 1 respectively.
This structure is suitable for implementation on processors that are efficient in computing
sum-of-products. Most standard signal processors provide special features to support sum-
of-product computations, i.e., a multiplier-accumulator and hardware implementation of
loops and circular memory addressing. The signal levels in this structure are inherently
scaled except for the output [9, 32].
As shown in Eq. (4.3), digital FIR filters consists of a series of multiplications of samples
of input by some constant coefficients and of additions of these products. Building blocks
include memory cells, multipliers, adders, and programmer to control the sequence of
operations [34].
4.3 Design and Implementation of a Chebychev Half-
Band FIR Filter
Human audible frequency range is 20-22 kHz and human speech bandwidth is 100-8 kHz.
By Nyquist Sampling Theorem, data is only valid up to Fs/2 where Fs is the sampling
frequency of the input data. Hence, maximum input data frequency should be less Fs/2.
Thence, the bandwidth of a half-band is 12(Fs/2) = Fs/4. In order to cover the human
speech spectrum, a sampling frequency of 32 kHz is required.
Chebyshev polynomial has been chosen as the filter approximation method to determine
the filter coefficients. It is because linear-phase, equal-ripple FIR filters are naturally
describable in terms of Chebyshev polynomials. The design method for filter coefficients
Hearing Aids and Digital Filters 51
described in [35] is adopted here. A filter length of 39 is first selected. The min stopband
loss and the passband ripple are found to be 36.3 dB and 0.26 dB respectively. The details
of the filter specification is listed in Table 4.1.
Table 4.1: Specifications for Chebychev half-band FIR filter with N=39
length: 39
cutoff freq.: 1.507
min. stopband loss: 36.3 dB
passband ripple: 0.26 dB
4.3.1 Number representation
Before describing the implementation of the design, its number system must be defined
first. Performance of the processing elements with respect to speed, chip area, and power
dissipation depends on the number representation used. In this research, two’s-complement
representation and fractional fixed-point arithmetic are employed.
Two’s complement representation is the most common type of arithmetic used in dig-
ital signal processing. It is a subset of binary representation of numbers. One of the
main advantages of a complement representation is that addition and subtraction can be
performed without regard of the sign of the operands. The value of a normalized Wd-bit
binary word (x) in two’s complement representation is
x = −x0 +
Wd−1∑
i=1
xi2−i (4.4)
Hearing Aids and Digital Filters 52
For x > 0 two’s complement has the same binary word as signed-magnitude representa-
tion. The negative value of a number in two’s complement representation can be obtained
from the corresponding positive number by adding 2−Wd−1 to the bit-complement.
A useful property of two’s complement representation is that, if the sum lies in the
proper range, several two’s complement numbers can be added even though the partial
sums may temporarily overflow the available number range. Thus, the numbers can be
added in arbitrary order without considering possible overflow as long as the final sum lies
within the proper range. Hence, arithmetic operation such as addition, subtraction, and
multiplication, are simple to implement in two’s complement representation, since they are
independent of the signs of the numbers involved.
In addition, in the fractional fixed-point arithmetic, k leftmost digits represent the
integer part and the Wd − k remaining digits represent the fractional part. For example,
the binary representation of the number x with k = 2 is x0x1. x2 · · · xWd−1. This radix
point is not stored in the fixed-point representation, instead its position is understood.
The advantage of fractional fixed-point arithmetic is that parasitic oscillations are more
easily suppressed. It also requires less chip area and is much faster than floating-point
arithmetic. Hence, in most VLSI circuits for dedicated DSP applications calculations are
done using fractional fixed-point arithmetic [9].
In the Chebychev half-band FIR filter design, 16-bit word length is selected for both
data inputs and filter coefficients.
4.3.2 Processing Blocks
Bit-serial arithmetic is a viable alternative in digital signal processing applications to tra-
ditional bit-parallel arithmetic. Bit-serial arithmetic significantly reduces chip area by
eliminating wide buses and simplifies wire routing. The speed penalty is not as large as
Hearing Aids and Digital Filters 53
it seems. In fact, the ratio is in speed is much smaller due to the long carry propagation
paths in parallel arithmetic [9, 36].
The comparison of power consumption is, however, more complicated. Bit-parallel
arithmetic suffers from energy losses in glitches that occur when the carry propagates.
Yet, the glitches will be fewer if successive data are strongly correlated. Driving long and
wide buses consumes large amounts of power. Bit-serial arithmetic, on the other hand, will
only perform useful computations without any glitches, but require more clocked elements
that will consume significant amounts of power. Power-efficient realization of the clocked
elements is then more important [9]. Since the focus of this research is on the efficiency
and performance of DETFFs, bit-serial arithmetic is selected for the design of Chebyshev
half-band digital filter core.
The architecture of the FIR filter depicted in Figure 4.1 has been modified to reduce the
number of multipliers used. The block diagram of the design is illustrated in Figure 4.2.
Input data of 16-bit word length is coming in at a rate of Fs. The data is then stored
in a queue through a mux stage. The mux stage selects whether to push new data into the
queue or to loop the output of queue back to the beginning of queue. New data is pushed
into the queue 139of the time. The queue needs to operate at 39 times faster than the
input data rate, since only one set of multiplier and accumulator while 39 coefficients are
required to be processed. Because the filter core uses bit-serial arithmetic, it is required
to operate at (Wc +Wd) × 39 × Fs, where Wc is the word length of filter coefficients. In
bit-serial arithmetic, the numbers are normally processed with the least-significant bit first.
The SRin is a parallel-to-serial shift register followed by a serial-parallel multiplier. The
accumulator adds up all the products. The SRout stored the accumulated sum, and feed
the sum back to the accumulator. It also drives the output buffers when all multiplications
with the 39 coefficients and additions are completed. These processing blocks are described
Hearing Aids and Digital Filters 54
MUX Queue
Push and Pop @ 39Fs
Fs (1/39Fs duty cycle)
16 / 16
/
Fs IN 16
/
SRin
X
ROM 39
coefficients
16 bits
16 /
SRout
+
Buffer
OUT Fs
32 ~
32 ~
Figure 4.2: Block Diagram of the Chebychev half-band FIR filter design
Hearing Aids and Digital Filters 55
in more details in the following.
Adder
Figure 4.3(a) shows a bit-serial adder, which is composed of a full adder (FA) and a D
flip-flop. This adder is also called carry-save adder, since the carries are saved from one
bit position to the next. The D flip-flop is reset at the start of a computation to clear the
memory of the adder.
Sum
C
Y
X
FA
D
Reset
C
Y
X
FA
D
Diff
Set
(a) (b)
Figure 4.3: (a) Bit-serial adder(b) Bit-serial subtractor [9]
A subtractor can be obtained from the adder implementation by simply inverting one
of the addends and setting the D flip-flop at the beginning. The implementation of a
subtractor is illustrated in 4.3(b).
Serial/Parallel Multiplier
Most bit-serial multipliers are based on the shift-and-add algorithm, where several bit-
products are added in each time slot. In a serial/parallel multiplier, the Wd-bit multi-
Hearing Aids and Digital Filters 56
plicand, x, arrives bit-serially while the Wc-bit multiplier, a, is applied in a bit-parallel
format. Many different schemes for bit-serial multipliers are available. They differ mainly
in the order of bit-products generation and addition and in the handling of subtraction.
A common approach is to generate a row of bit-products in each time slot and then add
them concurrently.
Figure 4.4 illustrates an example of serial/parallel multiplier based on carry-save adders.
It is composed of an AND stage and an array of one subtractor for the sign bit and Wc− 1
adders cascaded together. At the beginning of a computation, all the D flip-flops are
clocked to set the subtractor and reset the adders. The input data bits are broadcast
LSB-first bit-serially to the array, while the Wc-bit multiplier is applied parallelly. The
x bit is first and-ed with a word, then the outputs are added/subtracted in the array.
As the D flip-flops are clocked, the sum-bits from the FAs are shifted one bit to the
right while each carry-bit is saved and will be added to the FA in the same stage at the
next clock. At each cycle, one partial product is formed and shift-accumulated. These
operations correspond to multiplying the accumulator contents by 2−1. After Wd clock
cycles, only the least-significant Wd product bits have been shifted out from the end of
the array. The most-significant Wc − 1 product bits are remained in the accumulator as
a carry-save residue. These remained product bits must be combined to form the whole
product by clocking through the accumulator. Hence, a bit-serial multiplication takes at
least Wd +Wc − 1 clock cycles. Two successive multiplications are therefore separated by
Wd +Wc clock cycles since one clock cycle is required to clear the accumulator [9, 37].
In most serial-parallel multipliers, speed limitation is the propagation delay of the un-
pipelined data path. In this case it is the propagation time through one AND gate and one
full-adder [36]. In addition, the use of carry save adders yields a regular hardware struc-
ture. A weakness of serial/parallel multiplier architectures is that data and control signals
Hearing Aids and Digital Filters 57
y
0 FA
D
Set
D
&
FA
D
D
&
FA
D
D
&
FA
D
D
&
FA
D
&
a 4 a 3 a 2 a 1 a 0
x 0 x
1 x
2 ...x
Wd -1
Figure 4.4: Serial/Parallel multiplier
must be broadcast to the array. The loading on these wires, together with their physically
inherent RC constant, may impair the potential performance. This may combated in two
ways: (i) inclusion of buffer/drivers before broadcasting; (ii) inclusion of pipelining latches
in all direct paths, every few stages in the accumulator [37].
Shift Registers
Shift registers are composed of multiplexers and flip-flops, except for the serial-in-serial-out
(SISO) structure. The function of multiplexers is to select the inputs and load the regis-
ters. In this filter design, the shift registers at the input is of parallel-in-serial-out (PISO)
structure, whereas the one at the output is a simplification of SISO and serial-in-parallel-
out (SIPO) structures. Basically the SRout is a SISO shift register with parallel access
points between flip-flops that drives the output buffers. These structures are illustrated in
Figure 4.5.
A queue is essentially an array of SISO shift register. It is also referred to as FIFO
(first-in-first-out).
Hearing Aids and Digital Filters 58
D D D D
load
PISO
SISO D D D D
SIPO D D D
load
D D
Figure 4.5: Structures for shift registers
Chapter 5
Results and Discussions
5.1 Measurements
As noted in chapter 3, DETgago and DETproposed are selected to be implemented in a
Chebyshev digital filter for further analysis. These two DETFFs are modified with the
inclusion of reset and set input controls. Figures 5.1 and 5.2 illustrate the implementations
with reset control input. Similar realizations are carried out for the set control input. These
DETFFs are optimized, laid-out according to the 0.18 µm standard cell rules and extracted.
All the DETFF implementations have width of 21.12 µm, except for the DETgago with set
control which is 20.46 µm wide. Compared to the 0.18 µm standard SETFF with reset
and set inputs, the DETFFs are about 77% larger in area. This is expected because of the
more complex parallel interconnects of the DETFF structures.
The DETFFs together with the standard SETFF are simulated at a range of Vdd from
a nominal 1.8 V to 0.9 V. The same testbench and simulation parameters introduced in
section 3.3.3 are used here as well. Results are summarized in Table 5.1. As shown in
the table, the two DETFFs demonstrate superior power, delay and energy as compared
59
Results and Discussions 60
N2
N3
Q
N4
N5
N5
N4 N2
N3
D
CK CK CK
CK
CK
CK
Q R
Figure 5.1: DETgago with reset active low control
to the SETFF for the same throughput. DETproposed exhibits the best energy efficiency
among the three flip-flops while DETgago consumes the least power. DETFFs demonstrate
a power saving of 10% and an energy saving of as much as 23%.
Furthermore, the standard SETFF fails to deliver the data throughput described in
the testbench at high clock frequency (1 GHz) for Vdd of 1 V or less. It can however work
at 1.05 V delivering the same throughput. It has power, delay and energy of 68.19 µW,
846.4 ps and 57.72 fJ respectively. The standard SETFF can still operate properly at
lower clock frequency as in the case of FIR filter used in this study. It is worth noting
that although the DETFFs dissipate the least power at Vdd = 0.9 V , the least energy
Results and Discussions 61
Q
D D
D
CK CK
D
CK
CK
CK CK
CK
CK
D
N1 N2 N3 N4
N7 N8
R R
Figure 5.2: DETproposed with reset active low control
consumption is occurred at around 1 V Vdd. The delay penalty well exceeds the power
savings at 0.9 V Vdd.
Table 5.1: Back annotation parameters for DETFFs and standard SETFF
Vdd = 0.9V Vdd = 1V Vdd = 1.8V
Power PTOT Delay td Energy PTOT td E PCK PD PFF td E
Cell (µW) (ps) (fJ) (µW) (ps) (fJ) (µW) (µW) (µW) (ps) (fJ)
SETFF - - - - - - 39.66 26 150.2 340.1 73.41
DETgago 43.37 1212 52.56 54.45 877.5 47.78 25.04 33.12 134.2 299 57.52
DETproposed 44.89 951.6 42.71 55.91 710.1 39.7 25.42 30.06 147.5 277.9 56.41
The Chebyshev half-band FIR filter is verified functionally using Matlab. Figure 5.3
shows the frequency response of the half-band FIR filter.
The Chebyshev half-band FIR filter design is coded using RTL/VHDL, compiled, sim-
Results and Discussions 62
0 0.5 1 1.5 2 2.5 3 3.5−250
−200
−150
−100
−50
0
50
w (rad/s)
Mag
nitu
de (
dB)
Frequency Response of Chebyshev Polynominal Half−Band FIR Filter
Figure 5.3: Frequency response of the Chebyshev half-band FIR filter
Results and Discussions 63
Table 5.2: Simulation Results for Chebyshev half-band FIR filter operating at 0.9 V
FIR with Total power (µW) Core area ((µm)2)
SETFF 135.7 280 x 280
DETgago - 280 x 475
DETproposed 83.61 280 x 475
ulated, and then synthesized using Synopsys. It is then laid-out using automated place-
and-route tools, Design Planar and Silicon Ensemble. The design is ported into Cadence
for final post-layout simulation. Table 5.2 summarized the results for power consumption
and filter core dimension. The FIR filter with DETproposed is found to consume 38% less
power than that with standard SETFF cell. However, the area overhead is approximately
70%.
The simulation data for DETgago is not available at this point in time because of the
improper implementation of the reset control signal. As shown in Figure 5.1, as reset is
asserted, the output of the flip-flop appears to be reset. However, since the latches are not
disconnected during reset, the value stored in the flip-flop remains as the reset is released.
This causes problem in resetting the filter at the start of simulation, but not so much
during the filter operation since the clock is always toggling. Hence, clocking with input
data equal to 0 is necessary initially. An improved implementation of DETgago with reset
is illustrated in Figure 5.4 where latches are disconnected during reset.
Results and Discussions 64
N2
N3
Q
N4
N5
N5
N4 N2
N3
D
CK CK CK
CK
CK
CK
Q
R
R
R
R
Figure 5.4: Improved implementation of DETgago with reset active low control
5.2 Discussions
It should be noted that the implementation of the FIR with DETFF is not as optimized
as the one with SETFF due to the incompatibility of the design tools. The layouts of the
two filter implementation with SETFF and DETFF are illustrated in Figures 5.5 and 5.6
respectively. As of today, automated tools, such as Synopsys, Design Planner and Silicon
Ensemble, do not support devices/gates with multiple clock signal controls. For instance,
rising and falling clock edges are consider two clock signal controls. Simulation is feasible,
but not for synthesis. Hence, in this study, many manual workarounds are put in for the
implementation of FIR with DETFF. This implementation of DETFF is based on the SET
Results and Discussions 65
implementation, hence, is far less optimized. If the automated tools is made to support
multiple clock signal controls, the FIR area overhead of DETFF implementation could have
been much smaller and the interconnects routing could have been more optimized. The
offset could have been smaller than what is presented here and hence more power savings
can be achieved. Therefore, advancement in automated tools is essential, in order to take
the full advantage of DETFFs in VLSI designs.
Results and Discussions 66
Figure 5.5: Layout of the half-band FIR filter implemented with standard SETFF
Results and Discussions 67
Figure 5.6: Layout of the half-band FIR filter implemented with standard DETFF
Chapter 6
Conclusions
The proposed design is an attempt to design a low power, low voltage DETFF. DETproposed
exhibits the best energy efficiency while DETgago consumes the least power among the
various flip-flops studied. DETFFs demonstrate a power saving of 10% and an energy
saving of as much as 23% comparing to the standard SETFF. The standard SETFF, on
the other hand, fails to deliver the data throughput described in the testbench at high
clock frequency (1 GHz), at low voltage (1 V).
In a digital filter setting, the benefits of DETFF are even more prominent. Global
clock net capacitance can be very large in a VLSI circuit. The usage of DETFF allows
one to maintain a constant throughput while operating at only half the clock frequency.
The FIR filter with DETproposed is found to consume 38% less power than the one with
standard SETFF cell. However, this comes with a price of larger area overhead, which
is approximately 70%. Advancement in automated tools to include multiple clock control
signals is essential, in order to minimize the area overhead and to take full advantage of
the DETFFs in VLSI designs.
It has been shown that the usage of DETFFs in VLSI systems is beneficial in low
68
Conclusions 69
power, low voltage applications and high speed applications. As illustrated in Table 5.1 of
section 5.1, the implementation of DETFFs in high speed applications can relax the clock
rate and hence make low voltage possible to carry out. As demonstrated in the half-band
FIR filter design, DETFFs offer significant power savings compared to the implementation
of standard SETFFs in applications where constant data throughput is important and
where data transition probability, α, is low.
6.1 Future Work
The design flow and the automated place-and-route tools should be enhanced to include
and support the design and layout of DETFF implementations in VLSI circuits. This will
reduce the area overhead of the DETFF designs and improve their performance. It will
also encourage further research on DETFF circuits and topologies.
Appendix A
Glossary of Terms
BIST built-in self test
CMOS complementary metal oxide semiconductor
C2MOS clocked complementary metal oxide semiconductor
CPL complementary passgate logic
DET dual-edge triggered
DETFF dual-edge triggered flip-flop
DSP digital signal processing
FA full adder
FIFO first-in-first-out
FIR finite impulse response
IIR infinite impulse response
70
Glossary of Terms 71
IC integrated circuit
I/O input and output
LSB least-significant bit
MSB most-significant bit
MOS metal oxide semiconductor
MOSFET metal oxide semiconductor field-effect transistor
NMOS n-channel metal oxide semiconductor
PDP power-delay product, which is also referred to as energy
PDPCQ energy calculated with clock-to-output delay
PDPDQ energy calculated with data-to-output delay
PIPO parallel-in-parallel-out
PISO parallel-in-serial-out
PMOS p-channel metal oxide semiconductor
RTL register transfer language
SAFF sense amplifier flip-flop
SDFF semi-dynamic flip-flop
SET single-edge triggered
SETFF single-edge triggered flip-flop
Glossary of Terms 72
SIPO serial-in-parallel-out
SISO serial-in-serial-out
SOC system-on-a-chip
TSPC true single phase clock
VHDL VHSIC hardware design language
VHSIC very high speed integrated circuit
VLSI very large scaled integrated circuit
ULSI ultra large scaled integrated circuit
Bibliography
[1] W.M. Chung, , T. Lo, and M. Sachdev, “A comparative analysis of low-power low-
voltage dual-edge-triggered flip-flops,” IEEE Transactions on Very Large Scale Inte-
gration (VLSI) Systems, 2001 (accepted).
[2] A.G.M. Strollo and D. De Caro, “Low Power Flip-flop with Clock Gating on Master
and Slave Latches,” Electronics Letters, vol. 31, no. 4, pp. 294–295, February 2000.
[3] A. Gago, R. Escano, and J.A. Hidalgo, “Reduced Implementation of D-type DET
Flip-Flops,” IEEE J. of Solid-State Circuits, pp. 400–442, March 1993.
[4] R.P. Llopis and M. Sachdev, “Low Power, Testable Dual Edge Triggered Flip-Flops,”
1996 International Symposium on Low Power Electronics and Design, pp. 341–345,
1996.
[5] M. Pedram, Q. Wu, and X. Wu, “A New Design of Double Edge Triggered Flip-Flops,”
1998 Proceedings of the Asian and South Pacific Design Automation Conference (ASP-
DAC ’98), pp. 417–421, 1998.
[6] A.G.M. Strollo, E. Napoli, and C. Cimino, “Low Power Double Edge-Triggered Flip-
Flop Using One Latch,” Electronics Letters, vol. 35, no. 3, pp. 187–188, 1999.
73
Bibliography 74
[7] J. Tschanz, S. Narendra, Z. Chen, S. Borkar, and M. Sachdev, “Comparative Delay
and Energy of Single Edge-Triggered & Dual Edge-Triggered Pulsed Flip-Flops for
High-Performance Microprocessors,” 2001 Symposium on VLSI Circuits, Digest of
Technical Papers., pp. 217–218, 2001.
[8] T.A. Johnson and I.S. Kourtev, “A Single Latch, High Speed Double-Edge Triggered
Flip-Flop (DETFF),” The 8th IEEE International Conference on Electronics, Circuits
and Systems, 2001, vol. 1, pp. 189–192, 2001.
[9] L. Wanhammar, DSP Integrated Circuits, Academic Press, San Diego, CA, USA,
1999.
[10] A.P. Chandrakasan and R.W. Brodersen, Low Power Digital CMOS Design, Kluwer
Academic Publishers, Norwell, Massachusetts, USA, 1995.
[11] E. Sanchez-Sinencio and A.G. Andreou, Eds., Low-Voltage/Low-Power Integraged
Circuits and Systems: low-voltage mixed-signal circuits, IEEE Press, New York, New
York, USA, 1999.
[12] J. Koomey, K. Kawamoto, J. Koomey, B. Nordman, R. Brown, M.A. Piette, and
A. Meier, “Electricity Used by Office Equipment and Network Equipment in the U.S.,”
2000 Proceedings of the ACEEE Summer Study on Energy Efficiency in Buildings.
Asilomar, CA., p. Panel 7, August 2000.
[13] Energy Information Administration Office of Energy Markets and End Use U.S. De-
partment of Energy Washington, “A Look at Residential Energy Consumption in
1997,” November 1999.
Bibliography 75
[14] N. Nedovic, M. Aleksic, and V.G. Oklobdzija, “Timing Characterization of Dual-Edge
Triggered Flip-flops,” 2001 Proceedings of International Conference on Computer
Design, pp. 538–541, 2001.
[15] J.M. Rabaey and M. Pedram, Eds., Low Power Design Methodologies, Kluwer Aca-
demic, Norwell, Massachusetts, USA, 2000.
[16] A. Chandrakasan, W.J. Bowhill, and F. Fox, Eds., Design of high-performance mi-
croprocessor circuits, IEEE Press, Piscataway, NJ, USA, 2001.
[17] H. Partovi, R. Burd, U. Salim, F. Weber, L. DiGregorio, and D. Draper, “Flow-
through latch and edge-triggered flip-flop hybrid elements,” 1996 IEEE International
Solid-State Circuits Conference, Digest of Technical Papers, pp. 138–139, 1996.
[18] F. Klass, “Semi-Dynamic and Dynamic Flip-flops with Embedded Logic,” 1998
Symposium on VLSI Circuits, Digest of Technical Papers, pp. 108–109, 1998.
[19] F. Klass, C. Amir, A. Das, K. Aingaran, C. Truong, R. Wang, A. Mehta, R. Heald,
and G. Yee, “A New Family of Semidynamic and Dynamic Flip-flops with Embedded
Logic for High-Performance Processors,” IEEE Journal of Solid-State Circuits, vol.
34, no. 5, pp. 712–716, May 1999.
[20] B. Kong, S. Kim, and Y. Jun, “Conditional-Capture Flip-Flop for Statistical Power
Reduction,” IEEE Journal of Solid-State Circuits, vol. 36, no. 8, pp. 1263–1271, 2001.
[21] W.M. Chung and M. Sachdev, “A comparative analysis of dual edge triggered flip-
flops,” 2000 Canadian Conference on Electrical and Computer Engineering, vol. 1,
pp. 564–568, 2000.
Bibliography 76
[22] R. Hossain, L.D. Wronski, and A. Albicki, “Low Power Design Using Double Edge
Triggered Flip-Flops,” IEEE Trans. on VLSI Systems, pp. 261–265, June 1994.
[23] D.M. Brooks, P. Bose, S.E. Schuster, H. Jacobson, P.N. Kudva, A. Buyuktosunoglu,
J. Wellman, V. Zyuban, M. Gupta, and P.W. Cook, “Power-aware microarchitecture:
design and modelling challeges for next generation microprocessors,” IEEE Micro,
vol. 20, no. 6, pp. 26–44, November–December 2000.
[24] S.J. Abou-Samra and A. Guyot, “Performance/Complexity Space Exploration: Bulk
vs. SOI,” PATMOS ’98, International workshop – Power And Timing Modeling,
Optimization and Simulation, October 7-9, 1998.
[25] V. Stojanovic and V.G. Oklobdzija, “Comparative Analysis of Master-Slave Latches
and Flip-Flops for High-Performance and Low-Power Systems,” IEEE J. of Solid-State
Circuits, vol. 34, no. 4, pp. 536–548, April 1999.
[26] A.G.M. Strollo, E. Napoli, and C. Cimino, “Analysis of Power Dissipation in Dou-
ble Edge-Triggered Flip-flops,” IEEE Transactions on Very Large Scale Integration
(VLSI) Systems, vol. 8, no. 5, pp. 624–629, October 2000.
[27] S.L. Lu and M. Ercegovac, “A Novel CMOS Implementation of Double-Edge-Triggered
Flip-Flops,” IEEE J. of Solid-State Circuits, pp. 1008–1010, August 1990.
[28] N. Shastry, “Tutorial on design for testability,” 1992 Proceedings of Fifth Annual
IEEE International ASIC Conference and Exhibit, pp. 139–142, 1992.
[29] R.D. Hess, “Considerations in selecting a design-for-testability technique,” 1988 IEEE
Region 5 Conference: Spanning the Peaks of Electrotechnology, pp. 157–160, 1988.
Bibliography 77
[30] Audiotech Healthcare Corporation, “Hearing aid types for all levels of hearing im-
pairment,” http : //www.hearingcenteronline.com/hearaid.shtml, 2000.
[31] Dr G K Hebbar’s Micro Ear Surgery & E.N.T Endoscopy Centre, “Types of hearing
aids,” http : //entcentre.faithweb.com/faqs/HearingAids BroadPerspectives/
type of hearing aids.htm, 2000.
[32] A.V. Oppenheim, R.W. Schafer, and J.R. Buck, Discrete-time Signal Processing,
Prentice Hall, Upper Saddle River, NJ, USA, second edition, 1999.
[33] P.P. Vaidyanathan, “Multirate digital filters, filter banks, polyphase networks, and
application: A tutorial,” Proceedings of the IEEE, vol. 78, no. 1, pp. 56–92, January
1990.
[34] V. Cappellini, Digital Filters and Their Applications, Prentice Hall, New York, USA,
1978.
[35] Jr. A.N. Willson and H.J. Orchard, “A design method for half-band fir filters,” IEEE
Transactions on Circuits and Systems I: Fundamental Theory and Applicaions, vol.
46, no. 1, pp. 95–101, January 1999.
[36] G. Bi and E.V. Jones, “High-performance bit-serial adders and multipliers,” IEEE
Proceedings-G, vol. 139, no. 1, pp. 109–113, February 1992.
[37] S.G. Smith, “Serial/Parallel Automultiplier,” Electronics Letters, vol. 23, no. 8, pp.
413–414, April 1987.