[ieee 2013 23rd international workshop on power and timing modeling, optimization and simulation...

Design of Variable Latency Adder Based On Present and Transitional States Prediction

Xinghua Yang*, Fei Qiao§, Chang Liu†, Huazhong Yang‡

Institute of Circuits and Systems, Dept. of Electronic, Engineering, Tsinghua National Laboratory for Information Science and Technology,

Tsinghua University, Beijing, P.R. China {*yang-xh11, †changliu11}@mails.tsinghua.edu.cn

{§qiaofei, ‡yanghz}@tsinghua.edu.cn

Abstract—A novel circuit architecture for variable latency adder based on present and transitional states prediction (PTSP) method is presented in this paper, for taking the low power benefits of voltage-over-scaling. With the scaling down of CMOS technology, failure from process variation and high power consumption has become severe problem in VLSI design and the traditional conservative methodology is about to reach its limit. The technique of adaptive clocking has been proved promising to jointly address the mentioned two issues above. Previous works have focused on two or multi-stage predictions of present input data with error recovery but ignored the data correlation, which could result plenty of redundant cycles. In this work, along with the present data, sequence dependence between successive data is also introduced into function speculation and realized by a simple feedback strategy. Analytical energy saving and performance models have been deduced and validated by simulation using Hspice with 65nm CMOS technology, where the redundant cycles are eliminated up to 16% and the maximum energy saving is 15% with 3% area overhead, being compared with conventional adaptive clocking adder. Furthermore, the new adder with PTSP is applied to the domain of approximate computation and gets a decrement in error deviation of up to 50% in an accumulator.

Keywords—Voltage-over-Scaling; Adaptive Clocking; Sequence Dependence; Energy Efficient

I. INTRODUCTION Process variation will cause more failure in digital circuit as

the CMOS technology scales into nanometer. Conservative design methodology, such as scaling up operating voltage to avoid errors, is getting in dilemma since it leads to higher power consumption, which is another severe problem in VLSI design at present. Hence, it is becoming challenging and indispensable to address the issues of variation aware and low energy jointly [1].

Many researches have been widely investigated considering both failure from process variation and energy consumption. The basic idea of traditional approaches is to optimize power/energy while ensuring no computation error incurred. Following this principle, several methodologies have been proposed, such as dual/multiple Vth, gate sizing and adaptive body biasing [2, 3]. Among all of these, Razor [4] seems to reach the peak of perfection in theory since it is implemented with dynamic detection and correction, which makes it optimal

for DVS under process variation. But as pointed in [5], Razor will have to be confronted with the problem of great complexity in design and difficult portability in practical application, which will get worse as the feature size of CMOS technology continues to decrease.

As an alternative, variable latency adder design based on speculation is causing interest. Elastic Clocking based on Input Prediction (ECIP) with Voltage-over-Scaling (VoS) from [1, 6] has been proved to be a formidable approach for low power and variation aware design. Liu [7] improved this work by error detection and recovery mechanisms and proposed a multi-stage function speculation structure for the adder. In [8], Barrio applied multispeculative methodology to high-level datapath synthesis and eliminated certain penalty cycles from wrong prediction.

However, the previous works applying extra cycles to ensure always-correct computation have two drawbacks: (a) the prediction method or the speculative function is only based on current input data, ignoring the previous carry output bit, or we call it transitional state. As data correlation plays an important role in practical computation, the indiscriminate extra cycles for the computation with predicted critical path in previous strategy are obviously wasted. This defect will get worse especially in the domain of sensor network where the data to be processed may be changing slowly. (b) the number of bits for prediction block in ECIP has to be more than five in order to keep the value of activation probability small enough, which is not appropriate in designing arithmetic unit in which the bit length is below eight.

The novelty of this work is to propose a modified circuit architecture based on Present and Transitional States Prediction method (PTSP). The effect of sequence dependence between successive data is introduced into our prediction method and realized by simple logic in circuit level, resulting in the following advantages: (1) elimination of mass redundant cycles due to data correlation and lower energy consumption with 3% area overhead. (2) high throughput even only two bits in the arithmetic unit are used for prediction, which successfully resolved the contradiction of low activation probability and long latency path in previous design. (3) offering better circuit architecture to reduce the error deviation in approximate computation.

978-1-4799-1170-7/13/$31.00 c©2013 IEEE 120

Fig. 1. 32-Bits Ripple Carry Adder with Adaptive Clocking in ECIP [6].

Fig. 2. Sequence Dependence In Calculation [9].

The remainder of this paper is organized as follows. Previous related works and detailed analysis of their circuit technique are described in Section II. The proposed PTSP circuit architecture is shown in Section III, as well as the analysis of its energy and performance. Section IV presents the simulation results using Hspice. Finally, conclusions are drawn in Section V.

II. ANALYSIS OF RELATED WORK In this section, we will first point out the limitation of

previous circuit architecture of variable latency adder. The architecture of ECIP [6] is selected as the typical analysis object since it has the most representative structure, then we will illustrate the effect of sequence dependence among input data and demonstrate that the adoption of transitional state for predicting the activation of critical path will greatly make up for the deficiency in ECIP.

A. Preliminary Analysis of ECIP The basic assumption of the circuit technique in ECIP is

just similar to that of Razor's, as both of them realize that (a) the critical path of a logic block will be activated based on some unique patterns of input data; (b) the probability of this kind of activation should be quite small in order to avoid large performance penalty. Different from Razor, ECIP adopted adaptive clocking with VoS by prediction of the current input data, making it far simpler than Razor as there is no feedback control logic and the design can be easily extended to multi-stage prediction. To satisfy the assumption of small activation probability, more than five bits in the middle position of the input vector are used to make the prediction, as shown in Figure 1. Considering the two vectors A31…0 and B31…0 and let us define C to be the set of computation with two cycles, the content of C can be derived as (1)

C= {2-cycle computation | 13 13 17 17( ) ( 1)A B A B…⊕ ∩ ∩ ⊕ = } (1)

It is obvious that if the two bits at the same position of the input vectors are not equal and every other four successive position has the same case, the system should offer two cycles

for this kind of computation under VoS. The method for the stretched cycle could be realized by clock gating on circuit level. The advantage of ECIP is that the operating frequency of the design can maintain unchanged under VoS, but the voltage cannot decrease below the point where the delay from FA0 to FA17 or from FA13 to FA31, named as short latency path (SLP), exceeds one operating cycle. However, there are two limitations in ECIP:

1) Contradiction between Short Latency Path and Low Activation Probability of Critical Path. From Figure 1, it can be seen that the extent of bringing down the supply voltage only depends on length of SLP, which means that the margin for lowing the voltage will be shrinking as more bits in the middle position are getting predicted. It seems that one bit prediction will be the best as more operating voltage can be scaled down, but this is certainly unacceptable since it will violate the assumption of low activation probability. In ECIP, to avoid performance penalty, at least five bits have been used as input for the prediction block, getting nearly 3% activation probability in result. The contradiction here, between short latency path and low activation probability of critical path will deteriorates when the length of the vectors to be processed is reduced, for example, in 8-bit Ripple Carry Adder (RCA), the energy saving from VoS is weak if five or six bits are used for prediction.

2) Redundant Cycles for Slow Changed Input Data. In some practical applications, such as sensor network for processing temperature or humidity data, the input vectors are changing slowly. If this kind of correlation in data is ignored, two cycles will be given without any discrimination, resulting large amount of redundant cycles, which will reduce throughput of the system or increase the energy consumption for a certain load of computation. Another kind of correlation in sequential calculation will be illustrated in next sub-section, which could also be exploited to eliminate some redundant cycles from the original design.

B. Sequence Dependence In Calculation Even though we do not consider the effect of slow changed data in practical applications, there are still parts of computation that do not need two cycles even if the critical path is activated. This is attributed to the sequence dependence in calculation, which is clarified in [9]. As shown in Figure 2, in second addition, the pattern of the input data will absolutely activate the critical path in ECIP and one extra cycle will be applied to this calculation based on the original strategy. But in actual processing, the delay for second addition is less than one cycle since the previous carry output bits are just the same to the steady value of second addition by coincidence, which means that the extra cycle is wasted. This kind of redundancy is enormous when we take the mentioned effect of slow changed data into account, as the relevance in slow changed data is more significant. Thus, the previous carry out bit or transitional state in other words, should be utilized together with the current input data for prediction as we proposed in next section. We will prove that with the improved prediction strategy and corresponding modified circuit architecture, massive redundant cycles will be eliminated, which naturally leads to better energy and performance efficiency. Furthermore,

121

(a) Logic circuit for improved prediction method

(b) Sequence circuit for clock-gating.

Fig. 3. Proposed Circuit Architecture based on PTSP method.

the contradiction in ECIP between short latency path and low activation probability of critical path will also be resolved.

III. NEW ARCHITECTURE WITH PTSP METHOD In this section, we will first describe the proposed modified

circuit architecture based on present and transitional states prediction method (PTSP) and its corresponding timing analysis, and then the theoretical model of energy consumption and performance of our design will be presented. At last, we apply our adder to accumulator for approximate computation and get a large decrement in error deviation than previous design.

A. PTSP Circuit Architecture We propose a new prediction method considering both

current input data and transitional state of carry output bit, which is realized by a little extra logic as show in Figure 3(a). The carry out bit from FA12 to FA13 is preserved and fed back to the input register by D-Flip-Flop, which is compared with the current value. The compare signal (CompSignal) will be logic “1” when they are not equal and logic “0” otherwise. The bits in middle position of the input vectors along with CompSignal are put into the Prediction Block (PB) to produce the Enable signal for clock-gating and Carry_sel signal to select the Carry_bit from FA12 to FA13 as shown in Figure 3(b). The functional description and timing analysis are as follows:

1): when (A13 ⨁ B13) ⋂…⋂ (A17 ⨁ B17) =0 is obtained for current input vectors, meaning that the critical path is not activated, the signal of Carry_bit will select the Cout_Present, and the Enable signal from PB will always be logic “1”, indicating that one cycle is assigned to current computation, as show in Figure 4(a).

2): when (A13 ⨁ B13) ⋂…⋂ (A17 ⨁ B17) =1 is got for current input, Carry_bit will select the Cout_Preserved and the

(a) (A13 ⨁ B13) ⋂…⋂ (A17 ⨁ B17) = 0 for every computation.

(b) (A13 ⨁ B13) ⋂…⋂ (A17 ⨁ B17) = 1 with sequence dependence.

(c) (A13 ⨁ B13) ⋂…⋂ (A17 ⨁ B17) = 1 with slow changed input data.

Fig. 4. Timing Analysis Diagram of PTSP circuit.

whole computation can be executed in parallel in first cycle. Before next cycle arriving, the correct value of Cout_Present and CompSignal will be steady, then the PB will make the Enable signal to be logic “1” if CompSignal is low, which means that the prediction of carry out bit in first cycle is right. Otherwise, the Enable signal will be put down to logic “0” so that the register will be delayed for one cycle, during which Carry_bit will select the Cout_Present and the whole computation will be done within these two cycles, as shown in Figure 4(b) (c).

In Figure 4(b), every computation has made (A13 ⨁ B13) ⋂…⋂ (A17 ⨁ B17) =1, but it can be seen that there is no need to distribute two cycles for critical path after the third addition because the previous carry out bit and the current one happen to be equal, as we have described in Section II and achieved here by simple feedback logic in practical circuit design. In Figure 4(c), it is implied that a large amount of redundant cycles for critical path will be abandoned when the input data changes slowly, a problem staying unresolved in ECIP and other previous works. It is more impressive that with the technique of PTSP, two bits rather than five or more in the middle position are just enough to make the prediction with little performance penalty, which is quite useful when small vectors are processed. We will prove this advantage in the following part. The production of the signal Carry_sel is extremely important for the correct function of our design; detailed implementation is shown in Figure 5.

122

Fig. 5. Circuit implementation for generating Carry_sel signal

B. Energy and Performance Model 1) Preliminary Definition and Assumption: before conduct-

ing the comparison of energy efficiency between our work and ECIP, some basic parameters and assumptions should be defined for modeling the energy consumption and performance of the design:

• Ω: the number of data to be processed following the uniform distribution between [0, 2n-1], where “n” is the length of RCA.

• β, σ: the parameters to simulate the degree of variability in input vector, every data in Ω will be duplicated β times continuously following Gaussian distribution with σ as variance and the data itself as mean value. Here, σ is fixed and assumed to be 3 in following model deduction.

• PE-α: the activation probability of critical path in ECIP with α bits for prediction block.

• PY-γ: the activation probability of critical path in our design with γ bits for prediction block, not including the feedback bit.

• E1-cycle, E2-cycle: energy consumption for 1-cycle and 2-cylce calculation.

Two kinds of assumptions are made in our energy saving model. First, E2-cycle is twice as much as E1-cycle since one more cycle is applied. Second, as we have put a little extra logic in our design, we assume that the energy consumption of 1-cycle computation is approximately equal between our design and ECIP, so as 2-cycle computation. The assumptions here are rational and could be proved in simulation section later.

2) Analysis of Energy and Performance Efficiency: with all the parameters defined before, it can be seen that the whole amount of processed data is Ω⋅β, where β can be taken from {1, 2, 3....}. In ECIP, the amount of the computation that needs two cycles is Ω⋅β⋅PE-α (PE-α=2-α). Thus, the energy in ECIP (EE) is：

1 2) (E E cycle E cycleE P E P Eα αβ β β− − − −= Ω⋅ −Ω⋅ ⋅ ⋅ +Ω⋅ ⋅ ⋅ (2)

In our work, with the feedback strategy shown in Figure 3(a), the wasted cycles could be eliminated from the original design by taking advantage of the correlation in data. We simulated our circuit architecture in Matlab with 10000 random input vectors following the uniform distribution. The result shows

0 0.1 0.2 0.3 0.4 0.50

0.05

0.1

0.15

0.2

0.25

0.3

0.35

Activation Probability of Critical Path

Ener

gy S

avin

g

data1

data2

data3data4

data5

Fig. 6. Energy saving (△E) with different value of β and probability

that half amount of the computation with prediction of critical path could be satisfied in only one cycle due to the sequence dependence as we have explained in Figure 4(b). With this illustration, we assume that the activation probability of critical path in our design, which defined as PY-γ, will be half of PE-α in ECIP when α=γ, i.e. PY-γ=2-(γ+1). Moreover, the parameter of σ is supposed to be very small for simulating practical input data. Thus, the amount of computation with two cycles in our design will keep approximately constant as Ω⋅ PY-γ because the feedback strategy will eliminate the redundant cycles among the slow changed input data. With all these analysis, the energy of our design (EY) is shown in (3) and the percentage of energy saving (△E) we get can be expressed by (4).

1 2) (Y Y cycle Y cycleE P E P Eγ γβ − − − −= Ω⋅ −Ω⋅ ⋅ +Ω⋅ ⋅ (3)

E YE

E

E EE−=Δ (4)

Based on the above assumptions, we calculate the Equation (4) with different value of β and probability as shown in Figure 6. It can be seen △E is increasing with the number of predicted bits getting smaller. It will be about 17% energy savings if two bits in each vector are used for prediction when β=4. At the same time, △E will increase with bigger β, which accords with our previous analysis as the redundant cycles among slow changed data are abandoned in our design.

As we point out in Section II, in order to get low probability for activation of critical path to avoid performance penalty, more than five bits have to be used for prediction in ECIP, which is absolutely inappropriate for designing arithmetic unit with small bit length. This problem can be successfully resolved in our design. Considering both the Equation 2 and 3, which could also be used to model the performance of the adder if the parameter of energy consumption (EE, EY, E1-cycle, E2-cycle) is changed to the parameter of cycle consumption (CE, CY, C1-cycle, C2-cycle). When β=4, α=5 (PE-α=1/32) and γ=2 (PY-γ= 1/8), CE and CY will be approximatey equal. This result means that two bits from each of the input vectors and one bit feedback of the carry out signal in our design has the same

123

performance with five bits for prediction in ECIP. The performance penalty is greatly alleviated by our methodology.

TABLE I. ENVIRONMENT FOR HSPICE SIMULATION

Nominal Operating Voltage 1.2V @65nm

Temperature 25℃

Capacity Load 0.004 PF

Clock Frequency 250MHz

Technology Corner Typical

C. Decrement in error deviation for approximate computation

Another advantage of PTSP circuit architecture is to offer a more efficient controlling mechanism in error deviation for approximate computation. In [10], the whole computation of Motion Estimation in MPEG-2 is divided into the significant and insignificant parts. The technique of ECIP is applied to the significant computation ensuring that there are no errors when the system is under VoS. For the insignificant computation, there will always be only one cycle applied to the arithmetic unit no matter whether the critical path is activated or not. However, we find that this indiscriminate one-cycle strategy for insignificant computation may lead to large errors as the higher bits of the result will be random when the computation in critical path is executed in only one cycle. In our design, errors will be caused as we also apply one cycle to all the insignificant computation under VoS, but the different point is that the carry out bit from FA12 to FA13 will be substituted by the feedback logic bit when the critical path is activated. Through this approach, the error deviation will be controlled within an acceptable low level, which means that more computation can be allocated to the insignificant part and more energy saving will be obtained for the system. The result from the simulation based on accumulator for approximate computation will be shown in Section IV.

IV. SIMULATION RESULT In this section, the environment for our experiment is

explained and then the simulation results will be given to verify our analytical energy and performance model by comparing our design with ECIP. Finally, the decrement in error deviation when one-cycle is applied to insignificant computation in accumulator will be shown.

A. Experimental Environment and Setup All the adders in our experiment are implemented with

65nm technology. We used Verilog to design all the adders and the pre-simulation was conducted with ModelSim to guarantee the correctness of our design. Then the Verilog code was synthesized by DesignCompiler, after which the post-gate netlist was transformed by Hsimplus from Synopsys so that the energy consumption for different design can be obtained through Hspice simulation with 5000 random input vectors. We compiled all the design at 250MHz, the other parameters for the simulation environment is shown in Table I.

Fig. 7. Energy saving (△E) when β=4

Fig. 8. Energy saving (△E) with α=γ=2

B. Validation of energy and performance model Based on Equation (4), it can be seen that the parameters of β and activation probability for critical path will determine the

energy and performance model. Figure 7 shows the percentage of energy savings between our design and ECIP when β=4 and the number of prediction bits changes from 1 to 5. This result is based on simulating 16-bit RCA with 5000 random vectors as input data in Hspice. The percentage of energy savings increases as fewer bits are utilized for prediction. This trend is consistent with our previous analysis but the absolute value is slightly smaller than the analytical result from Equation (4). This is just because E1-cycle and E2-cycle in our design are a little more than those of ECIP as extra logic blocks for feedback strategy are added in our circuit architecture. On the other hand, the percentage of the energy savings also increases when the input data changes slowly as shown in Figure 8. In this simulation, two bits are taken from the 16-bit RCA for prediction and β changes from 1 to 5 when α=γ=2. The trend of the result verifies our analytical model and the reason for the small decline in absolute value is just like what we have explained before. The comparison of performance penalty is shown in Figure 9 after conducting the simulation of 5000 random vectors in ModelSim. The result shows that the performance of our design with 2-bit prediction is approximately same with that of 5-bit prediction in ECIP. This advantage is quite useful when designing small bit adder, by which we could scale down more operating voltage in practical implementation as only two bits are predicted with little performance penalty.

124

Fig. 9. Cycle consumption for different prediction bits

Fig. 10. Energy per computation for different bit-length

For completeness, energy per computation of our design with different bit length is show in Figure 10. Two bits in the adder are taken for prediction in each simulation and the operating voltage is scaled down from 1.2V to 0.7V with no error incurred due to the stratergy of variable latency.

C. Error Deviation in Approximate Computation

In the domain of image and audio processing, the significance of computation in the whole system is different. Certain amount of errors for insignificant computation will not decrease the output quality dramatically. Thus, more computation can be allocated to the insignificant processing if the errors caused by arithmetic units can be controlled within a lower level under VoS, resulting more energy savings as purpose. In [10], for the insignificant computation, one cycle will be applied to the RCA under VoS no matter whether the critical path is activated or not. Large amount of errors may be drawn into the final result due to this strategy as the higher bits of the addition will be randomly erroneous. However in our design, the error deviation will be controlled within a small level as the feedback logic bit will substitute the carry signal from FA12 to FA13 in middle position when the prediction of critical path is hit. We adopted 8-bit RCA accumulator with ECIP methology and our proposed method to verify the above analysis. The decrement in error deviation of our design is shown in Figure 11. When the operating voltage is scaled down to 5/8* Vdd, one-cycle time is only enough to compute correctly half of the lower bits in RCA, where we got about 50% decrement in error deviation.

Fig. 11. Decrement in error deviation compared with ECIP

V. CONCLUSION

In this paper, we have proposed a novel modified circuit architecture for variable latency adder based on PTSP. Sequence dependence is considered in our implementation and realized by simple feedback strategy. We have made rational energy performance model and the simulation results from Hspice prove that our design has a better energy efficiency and error control with negligible area overhead.

REFERENCES [1] Ghosh S, Bhunia S, Roy K. A new paradigm for low-power, variation-

tolerant circuit synthesis using critical path isolation. Proceedings of the 2006 IEEE/ACM international conference on Computer-aided design. ACM, 2006: 619-624.

[2] Srivastava A, Sylvester D, Blaauw D. Statistical optimization of leakage power considering process variations using dual-Vth and sizing. Proceedings of the 41st annual Design Automation Conference. ACM, 2004: 773-778.

[3] Borkar S, Karnik T, De V. Design and reliability challenges in nanometer technologies. Proceedings of the 41st annual Design Automation Conference. ACM, 2004: 75-75.

[4] Fojtik M, Fick D, Kim Y, et al. Bubble razor: An architecture-independent approach to timing-error detection and correction. Solid-State Circuits Conference Digest of Technical Papers (ISSCC), 2012 IEEE International. IEEE, 2012: 488-490.

[5] Dreslinski R G, Wieckowski M, Blaauw D, et al. Near-threshold computing: Reclaiming moore's law through energy efficient integrated circuits. Proceedings of the IEEE, 2010, 98(2): 253-266.

[6] Mohapatra D, Karakonstantis G, Roy K. Low-power process-variation tolerant arithmetic units using input-based elastic clocking. Proceedings of the 2007 international symposium on Low power electronics and design. ACM, 2007: 74-79.

[7] Liu Y, Sun Y, Zhu Y, et al. Design methodology of variable latency adders with multistage function speculation. Quality Electronic Design (ISQED), 2010 11th International Symposium on. IEEE, 2010: 824-830.

[8] Del Barrio A A, Hermida R, Memik S O, et al. Multispeculative Addition Applied to Datapath Synthesis. Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on, 2012, 31(12): 1817-1830.

[9] Kim S H, Mukohopadhyay S, Wolf W. Experimental analysis of sequence dependence on energy saving for error tolerant image processing. Proceedings of the 14th ACM/IEEE international symposium on Low power electronics and design. ACM, 2009: 347-350.

[10] Mohapatra D, Karakonstantis G, Roy K. Significance driven computation: a voltage-scalable, variation-aware, quality-tuning motion estimator. Proceedings of the 14th ACM/IEEE international symposium on Low power electronics and design. ACM, 2009: 195-200.

125

[ieee 2013 23rd international workshop on power and timing modeling, optimization and simulation...

Documents