28.3 conditional push-pull pulsed latches with 726fjps energy...

2
28.3 Conditional Push-Pull Pulsed Latches With 726fJ·ps Energy-Delay Product in 65nm CMOS Elio Consoli 1 , Massimo Alioto 2,3 , Gaetano Palumbo 1 , Jan Rabaey 4 1 University of Catania, Catania, Italy, 2 University of Siena, Siena, Italy, 3 University of Michigan, Ann Arbor, MI, 4 University of California at Berkeley, Berkeley, CA Flip-flops (FFs) are key building blocks in the design of high-speed energy- efficient microprocessors, as their data-to-output delay (D-Q) and power dissipation strongly affect the processor’s clock period and overall power [1]. From previous analyses [2], the Transmission-Gate Pulsed Latch (TGPL) [3] proved to be the most energy-efficient FF in a large portion of the design space, ranging from high speed (minimizing ED j products with j>1) to minimum ED product designs [2], while simple Master-Slave FFs (TGFF [1] and ACFF [4]) are the most energy efficient. TGPL also has the lowest D-Q delay along with STFF [5]. However, the latter has considerably worse energy efficiency [2], hence, the TGPL is the best reference for a comparison. In this work, two new FFs are introduced, the Conditional Push-Pull Pulsed Latch (CP 3 L), and a version with a Shareable (CSP 3 L) Pulse Generator (PG). The adoption of a fast push-pull second stage, which requires a conditional PG, enables 50-to-100% delay improvements compared to TGPL, and absolute D-Q up to 0.7FO4. CP 3 L and CSP 3 L also exhibit superior energy efficiency to TGPL in terms of minimum ED 3 and ED products. A test chip is fabricated in 65nm CMOS technology (VDD=1V) to measure delay and energy consumption of CP 3 L, CSP 3 L and TGPL in minimum ED and ED 3 sizings. Different loadings are used in the minimum ED (16×) and the minimum ED 3 (64×) cases. Figure 28.3.1 shows schematics of the proposed FFs and TGPL [3]. In CP 3 L, the data-to-output path is broken in two parallel paths, capturing data rising and falling transitions. The output inverter is replaced by a push-pull stage, the first stage is split into two half-latches and the gated keeper is moved to the output. Since the push-pull stage is prone to current contention, only one of the signals Sneg and R can be asserted at a time due to the conditional PG, depending on the previous Q value. The pulsed signals CPr (rising) and CPf (falling) are alternately enabled by employing pseudo-NOR/NAND that are gated by a delayed version of Q, QD. If QD=1 (0), CP 3 L can change its state if D=0 (D=1), and does not change its state if D=1 (D=0). The pseudo-NOR (pseudo-NAND) does not change CPr (CPf) and also Sneg (R) does not make any transition. Conversely, the pseudo-NAND (pseudo-NOR) is enabled and generates a pulse on CPf (CPr). Furthermore, if D remains equal to 1 (0), R (Sneg) does not change. QD has to be sufficiently delayed in order to avoid the pseudo-NOR (pseudo-NAND) being enabled when the previous output was equal to 1 (0). Otherwise, an undesired transition on CPr (CPf) would occur, dissipating power. This would not affect correctness, as the input signal on D must remain stable until the end of the transparent window to avoid hold time violations. In CP 3 L, the PG cannot be shared among different latches, since pseudo-NOR/NAND are driven by QD, which is unique for each latch. CSP 3 L solves this by fully integrating the conditional logic in the latch so that the PG can be shared among several FFs. Two transmission gates and a few small keepers have to be added at the two pulsed nodes to achieve the same operation as before. Figure 28.3.2 shows a block diagram of the delay and energy test circuits. The delay measurement setup is essentially the same as in [6]. Clock-to-data (CK- D), clock-to-output (CK-Q), and D-Q delays are measured as time differences by using a capturing FF, clocked by a signal (CKMS) obtained via the propagation of a pulse through a programmable Delay Generator (DG), as is the case for the D and CK inputs provided to the FF under test (DUT) [6]. The average delay step of the DG is equal to 1.8ps (FO4/12). Since the delay measurement is carried out locally by way of the capturing FF coupled to each DUT, we employed a single DG to test an array of 384 FFs, i.e. 64 DUTs for each topology-sizing combination. The energy measurement approach permits extraction of transient energy (dynamic and short circuit) under different data activities [6]. Figure 28.3.3 shows measured setup and hold curves of the proposed FFs and TGPL. CP 3 L and CSP 3 L achieve a minimum D-Q close to 15ps (0.7FO4) and 17.5ps (0.8FO4) in their minimum ED 3 (64× load) and ED (16× load) sizings, respectively, whereas TGPL has a D-Q 50% (100%) larger in the minimum ED 3 (ED) case. TGPL is known to exhibit better performance than other existing FFs thanks to its small logical effort, branching effort and low parasitic delay in its data-to-output path [2]. Observe that the D-Q delay of the proposed FFs is further reduced, since the Sneg and R nodes have approximately half the load vs. TGPL. Hence, the half-latches are fast and the push-pull stage size can be increased without degrading energy efficiency. Also the branching due to fixed size gates is reduced and the load due to interconnect parasitics is smaller. Although the PGs of all FFs were designed to achieve a 4FO4 pulse width, the flat D-Q region is more pronounced in TGPL than in CP 3 L and CSP 3 L. Note that though it may seem that the proposed FFs do not suffer from the normal setup/hold time tradeoff (when the previous Q value is equal to 1 (0), only a data falling (rising) transition can be captured in the upcoming clock cycle), the hold time requirement has to be honored even if the Q value is unchanged. That is, the usual setup/hold time tradeoff is present for CP 3 L and CSP 3 L (the critical hold time is the last CK-D leading to finite D-Q). Figure 28.3.4 shows the measured energy dissipated per clock cycle vs. data activity and the energy/leakage-delay tradeoffs. CP 3 L and CSP 3 L show an energy 40-to-60% higher than TGPL, due to their higher complexity. Nevertheless, the adoption of the conditional technique reduces energy by avoiding unnecessary internal transitions when the data input does not change. The energy-efficiency of CP 3 L and CSP 3 L is indeed higher than TGPL. In particular, CP 3 L has 1.3× (2.3×) better ED (ED 3 ) product than TGPL at 25% data activity, and 2.7× (5.4×) better ILEAKD (ILEAKD 3 ) product, with ILEAK being the average leakage current of the FF (estimated through simulations). Figure 28.3.5 shows the variability of timing parameters from measurements on 256 FFs taken from four different dies, and leakage variation results from Monte Carlo simulations. Inspection of Fig. 28.3.5 reveals that CP 3 L and CSP 3 L bear similar variability to TGPL, except for the slightly worse (<1.4×) values in D-Q and CK-Q cases. Nevertheless, even including the impact of process variations, CP 3 L and CSP 3 L are largely faster than TGPL. Figure 28.3.6 summarizes the figures of merit and also the simulation results for TGFF [1] and STFF [5], and measurements for ACFF [4]. Although they target different applications (TGFF/ACFF), or have much worse energy efficiency (STFF), these FFs represent valuable references for comparison because of their small ED product (TGFF/ACFF) and D-Q (STFF). Hence, data for these FFs is included for completeness. The improvements achieved with the proposed FFs are highlighted in the table, where it is shown that TGPL and STFF have >50% higher D-Q than CP 3 L and CSP 3 L. Regarding energy efficiency, CP 3 L and CSP 3 L have the best ED 3 product (>120% reduction) and outperform TGPL, TGFF and ACFF, also in terms of ED product (>20% reduction). Acknowledgements: The authors thank STMicroelectronics for chip fabrication. References: [1] C. Giacomotto, et al., “The Effect of the System Specification on the Optimal Selection of Clocked Storage Elements,” IEEE J. Solid-State Circuits, vol. 42, no. 6, pp. 1392-1404, 2007. [2] M. Alioto, et a.l, "Analysis and Comparison in the Energy-Delay-Area Domain of Nanometer CMOS Flip-Flops – Part I-II," IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 19, no.5, pp. 737-750, 2011. [3] S. Naffziger, et al., "The Implementation of the Next-Generation 64b ltaniumTM Microprocessor," ISSCC Dig. Tech. Papers, pp. 276-504, 2002. [4] C. Teh, et al., "A 77% Energy-Saving 22-Transistor Single-Phase-Clocking D-Flip-Flop with Adaptive-Coupling Configuration in 40nm CMOS," ISSCC Dig. Tech. Papers, pp. 338-340, 2011. [5] N. Nedovic, et al., “A Clock Skew Absorbing Flip-Flop,” ISSCC Dig. Tech. Papers, pp. 342-497, 2003. [6] N. Nedovic, et al., "A Test Circuit for Measurement of Clocked Storage Element Characteristics," IEEE J. Solid-State Circuits, vol. 39, no. 8, pp. 1294- 1304, 2004. Jason Anderson 11-11-9 19:01 Formatted ... [1] Jason Anderson 11-11-27 17:39 Formatted ... [2] Jason Anderson 11-11-27 17:39 Formatted ... [3] Jason Anderson 11-11-9 19:01 Deleted: -…F…ops (FFs) are among the most ... [4] Jason Anderson 11-11-9 19:09 Formatted ... [5] Jason Anderson 11-11-27 18:02 Deleted: their …inimum ED and ED 3 sizings. Different ... [6] Jason Anderson 11-11-9 19:10 Deleted: .…28.3.1 shows the …chematics of the ... [7] Jason Anderson 11-11-27 17:49 Deleted: .…28.3.2 shows a block diagram of the delay ... [8] Jason Anderson 11-11-27 17:52 ... [9] Deleted: are …s well …nown to exhibit better ... [10] Jason Anderson 11-11-27 17:55 Deleted: .…28.3.4 shows the measured energy ... [11] Jason Anderson 11-11-9 19:32 Deleted: .…28.3.5 shows the variability of timing ... [12] Jason Anderson 11-11-9 19:32 Deleted: .…28.3.6 summarizes the figures of merit and ... [13] Jason Anderson 11-11-9 19:34 Formatted ... [14] Jason Anderson 11-11-9 19:34 Formatted ... [15] Jason Anderson 11-11-27 17:59 Deleted: N. Nedovic, V. Oklobdzija Jason Anderson 11-11-9 19:02 Formatted ... [16] Jason Anderson 11-11-27 17:59 Deleted: Jun. Jason Anderson 11-11-27 18:00 Deleted: E. Consoli, G. Palumbo Jason Anderson 11-11-9 19:03 Formatted ... [17] Jason Anderson 11-11-27 17:59 Deleted: May Jason Anderson 11-11-27 18:00 Deleted: G. Hammond…t al., "The Implementation of ... [18] Jason Anderson 11-11-9 19:03 Formatted ... [19] Jason Anderson 11-11-27 17:59 Deleted: Feb. Jason Anderson 11-11-27 18:00 Deleted: T. Fujita, H. Hara, M. Hamada…t al., "A 77% ... [20] Jason Anderson 11-11-9 19:03 Formatted ... [21] Jason Anderson 11-11-27 17:59 Deleted: Feb. Jason Anderson 11-11-27 17:59 Deleted: V. Oklobdzija, W. Walker…t al., “A Clock ... [22] Jason Anderson 11-11-9 19:02 Formatted ... [23] Jason Anderson 11-11-27 17:59 Deleted: Feb. Jason Anderson 11-11-27 17:59 Deleted: W. Walker, V. Oklobdzija Jason Anderson 11-11-9 19:02 Formatted ... [24] Jason Anderson 11-11-27 17:59

Upload: lamcong

Post on 06-May-2018

219 views

Category:

Documents


3 download

TRANSCRIPT

28.3 Conditional Push-Pull Pulsed Latches With 726fJ·ps Energy-Delay Product in 65nm CMOS

Elio Consoli1, Massimo Alioto2,3, Gaetano Palumbo1, Jan Rabaey4 1University of Catania, Catania, Italy, 2University of Siena, Siena, Italy, 3University of Michigan, Ann Arbor, MI, 4University of California at Berkeley, Berkeley, CA Flip-flops (FFs) are key building blocks in the design of high-speed energy-efficient microprocessors, as their data-to-output delay (D-Q) and power dissipation strongly affect the processor’s clock period and overall power [1]. From previous analyses [2], the Transmission-Gate Pulsed Latch (TGPL) [3] proved to be the most energy-efficient FF in a large portion of the design space, ranging from high speed (minimizing EDj products with j>1) to minimum ED product designs [2], while simple Master-Slave FFs (TGFF [1] and ACFF [4]) are the most energy efficient. TGPL also has the lowest D-Q delay along with STFF [5]. However, the latter has considerably worse energy efficiency [2], hence, the TGPL is the best reference for a comparison. In this work, two new FFs are introduced, the Conditional Push-Pull Pulsed Latch (CP3L), and a version with a Shareable (CSP3L) Pulse Generator (PG). The adoption of a fast push-pull second stage, which requires a conditional PG, enables 50-to-100% delay improvements compared to TGPL, and absolute D-Q up to 0.7FO4. CP3L and CSP3L also exhibit superior energy efficiency to TGPL in terms of minimum ED3 and ED products. A test chip is fabricated in 65nm CMOS technology (VDD=1V) to measure delay and energy consumption of CP3L, CSP3L and TGPL in minimum ED and ED3 sizings. Different loadings are used in the minimum ED (16×) and the minimum ED3 (64×) cases. Figure 28.3.1 shows schematics of the proposed FFs and TGPL [3]. In CP3L, the data-to-output path is broken in two parallel paths, capturing data rising and falling transitions. The output inverter is replaced by a push-pull stage, the first stage is split into two half-latches and the gated keeper is moved to the output. Since the push-pull stage is prone to current contention, only one of the signals Sneg and R can be asserted at a time due to the conditional PG, depending on the previous Q value. The pulsed signals CPr (rising) and CPf (falling) are alternately enabled by employing pseudo-NOR/NAND that are gated by a delayed version of Q, QD. If QD=1 (0), CP3L can change its state if D=0 (D=1), and does not change its state if D=1 (D=0). The pseudo-NOR (pseudo-NAND) does not change CPr (CPf) and also Sneg (R) does not make any transition. Conversely, the pseudo-NAND (pseudo-NOR) is enabled and generates a pulse on CPf (CPr). Furthermore, if D remains equal to 1 (0), R (Sneg) does not change. QD has to be sufficiently delayed in order to avoid the pseudo-NOR (pseudo-NAND) being enabled when the previous output was equal to 1 (0). Otherwise, an undesired transition on CPr (CPf) would occur, dissipating power. This would not affect correctness, as the input signal on D must remain stable until the end of the transparent window to avoid hold time violations. In CP3L, the PG cannot be shared among different latches, since pseudo-NOR/NAND are driven by QD, which is unique for each latch. CSP3L solves this by fully integrating the conditional logic in the latch so that the PG can be shared among several FFs. Two transmission gates and a few small keepers have to be added at the two pulsed nodes to achieve the same operation as before. Figure 28.3.2 shows a block diagram of the delay and energy test circuits. The delay measurement setup is essentially the same as in [6]. Clock-to-data (CK-D), clock-to-output (CK-Q), and D-Q delays are measured as time differences by using a capturing FF, clocked by a signal (CKMS) obtained via the propagation of a pulse through a programmable Delay Generator (DG), as is the case for the D and CK inputs provided to the FF under test (DUT) [6]. The average delay step of the DG is equal to 1.8ps (≈FO4/12). Since the delay measurement is carried out locally by way of the capturing FF coupled to each DUT, we employed a single DG to test an array of 384 FFs, i.e. 64 DUTs for each topology-sizing combination. The energy measurement approach permits extraction of transient energy (dynamic and short circuit) under different data activities [6]. Figure 28.3.3 shows measured setup and hold curves of the proposed FFs and TGPL. CP3L and CSP3L achieve a minimum D-Q close to 15ps (0.7FO4) and 17.5ps (0.8FO4) in their minimum ED3 (64× load) and ED (16× load) sizings, respectively, whereas TGPL has a D-Q 50% (100%) larger in the minimum ED3

(ED) case. TGPL is known to exhibit better performance than other existing FFs thanks to its small logical effort, branching effort and low parasitic delay in its data-to-output path [2]. Observe that the D-Q delay of the proposed FFs is further reduced, since the Sneg and R nodes have approximately half the load vs. TGPL. Hence, the half-latches are fast and the push-pull stage size can be increased without degrading energy efficiency. Also the branching due to fixed size gates is reduced and the load due to interconnect parasitics is smaller. Although the PGs of all FFs were designed to achieve a 4FO4 pulse width, the flat D-Q region is more pronounced in TGPL than in CP3L and CSP3L. Note that though it may seem that the proposed FFs do not suffer from the normal setup/hold time tradeoff (when the previous Q value is equal to 1 (0), only a data falling (rising) transition can be captured in the upcoming clock cycle), the hold time requirement has to be honored even if the Q value is unchanged. That is, the usual setup/hold time tradeoff is present for CP3L and CSP3L (the critical hold time is the last CK-D leading to finite D-Q). Figure 28.3.4 shows the measured energy dissipated per clock cycle vs. data activity and the energy/leakage-delay tradeoffs. CP3L and CSP3L show an energy 40-to-60% higher than TGPL, due to their higher complexity. Nevertheless, the adoption of the conditional technique reduces energy by avoiding unnecessary internal transitions when the data input does not change. The energy-efficiency of CP3L and CSP3L is indeed higher than TGPL. In particular, CP3L has 1.3× (2.3×) better ED (ED3) product than TGPL at 25% data activity, and 2.7× (5.4×) better ILEAKD (ILEAKD3) product, with ILEAK being the average leakage current of the FF (estimated through simulations). Figure 28.3.5 shows the variability of timing parameters from measurements on 256 FFs taken from four different dies, and leakage variation results from Monte Carlo simulations. Inspection of Fig. 28.3.5 reveals that CP3L and CSP3L bear similar variability to TGPL, except for the slightly worse (<1.4×) values in D-Q and CK-Q cases. Nevertheless, even including the impact of process variations, CP3L and CSP3L are largely faster than TGPL. Figure 28.3.6 summarizes the figures of merit and also the simulation results for TGFF [1] and STFF [5], and measurements for ACFF [4]. Although they target different applications (TGFF/ACFF), or have much worse energy efficiency (STFF), these FFs represent valuable references for comparison because of their small ED product (TGFF/ACFF) and D-Q (STFF). Hence, data for these FFs is included for completeness. The improvements achieved with the proposed FFs are highlighted in the table, where it is shown that TGPL and STFF have >50% higher D-Q than CP3L and CSP3L. Regarding energy efficiency, CP3L and CSP3L have the best ED3 product (>120% reduction) and outperform TGPL, TGFF and ACFF, also in terms of ED product (>20% reduction). Acknowledgements: The authors thank STMicroelectronics for chip fabrication. References: [1] C. Giacomotto, et al., “The Effect of the System Specification on the Optimal Selection of Clocked Storage Elements,” IEEE J. Solid-State Circuits, vol. 42, no. 6, pp. 1392-1404, 2007. [2] M. Alioto, et a.l, "Analysis and Comparison in the Energy-Delay-Area Domain of Nanometer CMOS Flip-Flops – Part I-II," IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 19, no.5, pp. 737-750, 2011. [3] S. Naffziger, et al., "The Implementation of the Next-Generation 64b ltaniumTM Microprocessor," ISSCC Dig. Tech. Papers, pp. 276-504, 2002. [4] C. Teh, et al., "A 77% Energy-Saving 22-Transistor Single-Phase-Clocking D-Flip-Flop with Adaptive-Coupling Configuration in 40nm CMOS," ISSCC Dig. Tech. Papers, pp. 338-340, 2011. [5] N. Nedovic, et al., “A Clock Skew Absorbing Flip-Flop,” ISSCC Dig. Tech. Papers, pp. 342-497, 2003. [6] N. Nedovic, et al., "A Test Circuit for Measurement of Clocked Storage Element Characteristics," IEEE J. Solid-State Circuits, vol. 39, no. 8, pp. 1294-1304, 2004.

Jason Anderson� 11-11-9 19:01Formatted ... [1]

Jason Anderson� 11-11-27 17:39Formatted ... [2]

Jason Anderson� 11-11-27 17:39Formatted ... [3]

Jason Anderson� 11-11-9 19:01Deleted: -…F…ops (FFs) are among the most ... [4]

Jason Anderson� 11-11-9 19:09Formatted ... [5]

Jason Anderson� 11-11-27 18:02Deleted: their …inimum ED and ED3 sizings. Different ... [6]

Jason Anderson� 11-11-9 19:10Deleted: .…28.3.1 shows the …chematics of the ... [7]

Jason Anderson� 11-11-27 17:49Deleted: .…28.3.2 shows a block diagram of the delay ... [8]

Jason Anderson� 11-11-27 17:52... [9]

Jason Anderson� 11-11-27 17:53Deleted: are …s well …nown to exhibit better ... [10]

Jason Anderson� 11-11-27 17:55Deleted: .…28.3.4 shows the measured energy ... [11]

Jason Anderson� 11-11-9 19:32Deleted: .…28.3.5 shows the variability of timing ... [12]

Jason Anderson� 11-11-9 19:32Deleted: .…28.3.6 summarizes the figures of merit and ... [13]

Jason Anderson� 11-11-9 19:34Formatted ... [14]

Jason Anderson� 11-11-9 19:34Formatted ... [15]

Jason Anderson� 11-11-27 17:59Deleted: N. Nedovic, V. Oklobdzija

Jason Anderson� 11-11-9 19:02Formatted ... [16]

Jason Anderson� 11-11-27 17:59Deleted: Jun.

Jason Anderson� 11-11-27 18:00Deleted: E. Consoli, G. Palumbo

Jason Anderson� 11-11-9 19:03Formatted ... [17]

Jason Anderson� 11-11-27 17:59Deleted: May

Jason Anderson� 11-11-27 18:00Deleted: G. Hammond…t al., "The Implementation of ... [18]

Jason Anderson� 11-11-9 19:03Formatted ... [19]

Jason Anderson� 11-11-27 17:59Deleted: Feb.

Jason Anderson� 11-11-27 18:00Deleted: T. Fujita, H. Hara, M. Hamada…t al., "A 77% ... [20]

Jason Anderson� 11-11-9 19:03Formatted ... [21]

Jason Anderson� 11-11-27 17:59Deleted: Feb.

Jason Anderson� 11-11-27 17:59Deleted: V. Oklobdzija, W. Walker…t al., “A Clock ... [22]

Jason Anderson� 11-11-9 19:02Formatted ... [23]

Jason Anderson� 11-11-27 17:59Deleted: Feb.

Jason Anderson� 11-11-27 17:59Deleted: W. Walker, V. Oklobdzija

Jason Anderson� 11-11-9 19:02Formatted ... [24]

Jason Anderson� 11-11-27 17:59

Figure 28.3.1: Schematic of CP3L, CSP3L, their operation and TGPL [3].

Figure 28.3.2: Block diagram of delay and energy test circuits

Figure 28.3.3: Setup (D-Q/CK-Q vs. CK-D) and Hold (CK-Q vs. CK-D) time characteristics.

Figure 28.3.4: Transient energy vs. data switching activity and energy/leakage-delay tradeoff.

Figure 28.3.5: Variability of timing parameters and leakage.

Figure 28.3.6: Performance summary and comparison with other FFs.