razor and recycle - computer science and...

32
AMEEN AKEL Razor and ReCycle

Upload: doandat

Post on 30-Apr-2018

222 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Razor and ReCycle - Computer Science and Engineeringcseweb.ucsd.edu/classes/wi09/cse240c/Slides/17_Razor_and_ReCycle.pdfRazor and ReCycle . ... Counterflow pipeline in the synchronous

A M E E N A K E L

Razor and ReCycle

Page 2: Razor and ReCycle - Computer Science and Engineeringcseweb.ucsd.edu/classes/wi09/cse240c/Slides/17_Razor_and_ReCycle.pdfRazor and ReCycle . ... Counterflow pipeline in the synchronous

Razor

Page 3: Razor and ReCycle - Computer Science and Engineeringcseweb.ucsd.edu/classes/wi09/cse240c/Slides/17_Razor_and_ReCycle.pdfRazor and ReCycle . ... Counterflow pipeline in the synchronous

Razor – Motivation

  Power   Today’s designs are extremely power hungry

 Power now a limiting factor

  Performance cannot be sacrificed (overall) to save on power  Both situations must continue improving

  Static Voltage Scaling   Not adaptive enough   Must be conservative estimates

 Wastes power savings for little to no performance benefit

  Make average silicon matter!

Page 4: Razor and ReCycle - Computer Science and Engineeringcseweb.ucsd.edu/classes/wi09/cse240c/Slides/17_Razor_and_ReCycle.pdfRazor and ReCycle . ... Counterflow pipeline in the synchronous

Razor – Approach

  The Main Idea   Circuit delay is data dependent, so why should designers care

about the conservative case?  Shooting for the “average” case – just like with ReCycle

  Lower the supply voltage (to sub-critical voltages) to reduce power throughout the chip

  What happens if execution encounters a “worst-case” path through a pipeline stage?  Wrong data can be latched and moved to the next stage

  Razor hybrids a few previous designs to solve this.

Page 5: Razor and ReCycle - Computer Science and Engineeringcseweb.ucsd.edu/classes/wi09/cse240c/Slides/17_Razor_and_ReCycle.pdfRazor and ReCycle . ... Counterflow pipeline in the synchronous

Razor Design Goals

  Razor hardware must not interfere with error-free operation of a pipeline   Nearly invisible to the common case

  Razor hardware cannot fail; it must always be correct   Razor hardware must be minimal in both hardware

size and power footprint

Page 6: Razor and ReCycle - Computer Science and Engineeringcseweb.ucsd.edu/classes/wi09/cse240c/Slides/17_Razor_and_ReCycle.pdfRazor and ReCycle . ... Counterflow pipeline in the synchronous

Razor – Approach

  What’s unique to Razor?   Counterflow pipeline in the synchronous world   Handling metastability   Inducing error-prone state

  What did it inherit?   Delayed latch idea (Triple Latch), but not implementation

(Shadow Latch)   Error correction method (from DIVA)

Page 7: Razor and ReCycle - Computer Science and Engineeringcseweb.ucsd.edu/classes/wi09/cse240c/Slides/17_Razor_and_ReCycle.pdfRazor and ReCycle . ... Counterflow pipeline in the synchronous

Razor – Approach

  Exploring the Shadow Latch   Method to detect and recover from errors in a minimal number of cycles   Shadow latch is delayed about 50% behind the main flip-flop’s clock in

order to catch any timing errors   A comparator will quickly decide how accurate the data in the flip-flop is

via an XOR gate.   Pipeline stages are designed so that in the absolute worst case, the

shadow latch’s setup time is met.   An encountered error will invalidate any data coming out of the flip-flop

for that cycle

60 Computer

a timing-error detection and recovery mechanismto handle the rare cases that require a higher volt-age. In addition, the system must include a voltagecontrol system capable of responding to the oper-ating condition changes, such as temperature, thatmight require higher or lower voltages.

Detecting circuit timing errors with Razor flip-flops

Figure 4a illustrates a Razor flip-flop for apipeline stage. At the circuit level, a shadow latchaugments each delay-critical flip-flop. A delayedclock controls the shadow latch.

Figure 4b illustrates a Razor flip-flop operation.In clock cycle 1, the combinational logic L1 meetsthe setup time by the clock’s rising edge, and boththe main flip-flop and the shadow latch will latchthe same data. In this case, the error signal at theXOR gate’s output remains low and the pipeline’soperation is unaltered. In cycle 2, the combina-tional logic exceeds the intended delay due to sub-critical voltage operation. In this case, the mainflip-flop does not latch the data; but since theshadow latch operates using a delayed clock, it suc-cessfully latches the data in cycle 3.

To guarantee that the shadow latch will alwayslatch the input data correctly, the allowable oper-ating voltage is constrained at design time such thatunder worst-case conditions, the logic delay doesnot exceed the shadow latch’s setup time. In cycle3, a comparison of the valid shadow latch data withthe main flip-flop data generates an error signal. Incycle 4, the shadow latch’s data moves into the

main flip-flop and becomes available to the nextpipeline stage L2.

If a timing error occurs in a particular clock cycleof pipeline stage L1, the data in L2 in the followingclock cycle is incorrect and must be flushed fromthe pipeline. However, since the shadow latch con-tains stage L1’s correct output data, the pipelinedoes not need to reexecute the instruction throughL1. This is a key feature of Razor: It reexecutes aninstruction failure in one pipeline stage through thefollowing stage, while incurring a one-cycle penalty.The proposed approach therefore guarantees aninstruction’s forward progress and avoids the per-petual reexecution of an instruction at a particularpipeline stage because of timing failure.

Razor flip-flop construction must minimize thepower and delay overhead. The power overhead isinherently low because in most cycles a flip-flop’sinput will not transition; thus, the only power over-head incurred comes from switching the delayedclock. To minimize even this power requirement,Razor inverts the main clock to generate the delayedclock locally, thus reducing its routing capacitance.

Many noncritical flip-flops in a design will notneed Razor technology. For example, if the maxi-mum delay at a flip-flop input is guaranteed to meetthe required cycle time under the worst-case sub-critical voltage setting, it isn’t necessary to replaceit with a Razor flip-flop because it will never needto initiate timing recovery. In the prototype Razorpipeline designed to study this problem, for exam-ple, we found that only 192 of a total of 2,408 flip-flops required Razor.

Razor FF

01

Logic stageL2Main

flip-flop

Shadowlatch

Error_L

ErrorComparator

clk

clk_delayed

Q1D1Logic stageL1

Figure 4. Razor flip-flop for apipeline stage. (a) A shadow latchcontrolled by adelayed clock augments each flip-flop. (b) Razor flip-flop operation with atiming error in cycle2 and recovery incycle 4.

clkclk_delayed

D

Error

Q

Cycle 1 Cycle 2 Cycle 3 Cycle 4

Instr 1 Instr 2

Instr 1 Instr 2

(a)

(b)

Authorized licensed use limited to: U niv of C alif S an D iego. D ownloaded on February 5 , 2009 at 05:14 from IE E E X plore . R estrictions apply.

60 Computer

a timing-error detection and recovery mechanismto handle the rare cases that require a higher volt-age. In addition, the system must include a voltagecontrol system capable of responding to the oper-ating condition changes, such as temperature, thatmight require higher or lower voltages.

Detecting circuit timing errors with Razor flip-flops

Figure 4a illustrates a Razor flip-flop for apipeline stage. At the circuit level, a shadow latchaugments each delay-critical flip-flop. A delayedclock controls the shadow latch.

Figure 4b illustrates a Razor flip-flop operation.In clock cycle 1, the combinational logic L1 meetsthe setup time by the clock’s rising edge, and boththe main flip-flop and the shadow latch will latchthe same data. In this case, the error signal at theXOR gate’s output remains low and the pipeline’soperation is unaltered. In cycle 2, the combina-tional logic exceeds the intended delay due to sub-critical voltage operation. In this case, the mainflip-flop does not latch the data; but since theshadow latch operates using a delayed clock, it suc-cessfully latches the data in cycle 3.

To guarantee that the shadow latch will alwayslatch the input data correctly, the allowable oper-ating voltage is constrained at design time such thatunder worst-case conditions, the logic delay doesnot exceed the shadow latch’s setup time. In cycle3, a comparison of the valid shadow latch data withthe main flip-flop data generates an error signal. Incycle 4, the shadow latch’s data moves into the

main flip-flop and becomes available to the nextpipeline stage L2.

If a timing error occurs in a particular clock cycleof pipeline stage L1, the data in L2 in the followingclock cycle is incorrect and must be flushed fromthe pipeline. However, since the shadow latch con-tains stage L1’s correct output data, the pipelinedoes not need to reexecute the instruction throughL1. This is a key feature of Razor: It reexecutes aninstruction failure in one pipeline stage through thefollowing stage, while incurring a one-cycle penalty.The proposed approach therefore guarantees aninstruction’s forward progress and avoids the per-petual reexecution of an instruction at a particularpipeline stage because of timing failure.

Razor flip-flop construction must minimize thepower and delay overhead. The power overhead isinherently low because in most cycles a flip-flop’sinput will not transition; thus, the only power over-head incurred comes from switching the delayedclock. To minimize even this power requirement,Razor inverts the main clock to generate the delayedclock locally, thus reducing its routing capacitance.

Many noncritical flip-flops in a design will notneed Razor technology. For example, if the maxi-mum delay at a flip-flop input is guaranteed to meetthe required cycle time under the worst-case sub-critical voltage setting, it isn’t necessary to replaceit with a Razor flip-flop because it will never needto initiate timing recovery. In the prototype Razorpipeline designed to study this problem, for exam-ple, we found that only 192 of a total of 2,408 flip-flops required Razor.

Razor FF

01

Logic stageL2Main

flip-flop

Shadowlatch

Error_L

ErrorComparator

clk

clk_delayed

Q1D1Logic stageL1

Figure 4. Razor flip-flop for apipeline stage. (a) A shadow latchcontrolled by adelayed clock augments each flip-flop. (b) Razor flip-flop operation with atiming error in cycle2 and recovery incycle 4.

clkclk_delayed

D

Error

Q

Cycle 1 Cycle 2 Cycle 3 Cycle 4

Instr 1 Instr 2

Instr 1 Instr 2

(a)

(b)

Authorized licensed use limited to: U niv of C alif S an D iego. D ownloaded on February 5 , 2009 at 05:14 from IE E E X plore . R estrictions apply.

Page 8: Razor and ReCycle - Computer Science and Engineeringcseweb.ucsd.edu/classes/wi09/cse240c/Slides/17_Razor_and_ReCycle.pdfRazor and ReCycle . ... Counterflow pipeline in the synchronous

Razor – Approach

  Metastability – The state in which a signal is neither 0 nor 1. The state usually settles around Vdd/2.   Shadow latch can never be metastable, based upon its timing

constraints.   If flip-flop becomes metastable, the metastability detector can report on

that fact (most of the time).   Small chance that Error can become metastable, which is claimed as

inevitable. In this case, a panic signal is raised and the pipeline is flushed.

flip-flops is restored even when only one of the Razor flip-flops gen-erates an error.

If an error occurs in pipeline stage L1 in a particular clockcycle, the data in L2 in the following clock cycle is incorrect andmust be flushed from the pipeline using one of the pipeline controlmethods described in Section 2.2. However, since the shadow latchcontains the correct output data of pipeline stage L1, the instructiondoes not need to be re-executed through this failing stage. Thus, akey feature of Razor is that if an instruction fails in a particular pipe-line stage it is re-executed through the following pipeline stage,while incurring a one cycle penalty. The proposed approach there-fore guarantees forward progress of a failing instruction, which isessential to avoid the perpetual failure of an instruction at a particu-lar stage in the pipeline.

In addition to invalidating the data in the following pipelinestage, an error must also stall the preceding pipeline stages while theshadow latch data is restored into the main flip-flops. A number ofdifferent methods, such as clock gating or flushing the instruction inthe preceding stages, were examined to accomplish this and are dis-cussed in Section 2.2. The proposed approach also raises a numberof circuit related issues. The Razor flip-flop must be constructedsuch that the power and delay overhead is minimized. Also, the pres-ence of the delayed clock introduces a new short-path constraint inthe design. And finally, allowing the setup time of the main flip-flopto be exceeded raises the possibility of meta-stability. These issuesare discussed in more detail in Section 2.1. In the proposed Razorbased DVS approach, the error signal is used to tune the supply volt-age to its optimal value. In Section 2.3, we therefore discuss differ-ent algorithms to control the supply voltage based on the observederror rate.

In general, maximum power savings is obtained from Razortechnology when it is applied to all parts of a microprocessor design.To accomplish this, we identify three distinct design challenges. Thefirst design challenge, and the focus of this paper, is the detectionand recovery of timing errors in combinational logic containedwithin pipeline datapaths, e.g., adders, shifters, and decode logic.The second design challenge is the application of Razor to on-chipSRAM structures. In SRAM structures, such as register files andcaches, it is necessary to introduce Razor-compatible sense amplifi-ers and support for fast non-speculative stores. The third challenge isthe use of Razor on pipeline control logic to restore correct programexecution in the presence of incorrect control decisions.

For the sake of brevity and clarity, the focus of this paper islimited to the first design challenge, which is the use of Razor oncombinational logic blocks contained within the pipeline datapaths.We therefore apply Razor to a simple embedded processor whichutilizes an in-order pipeline with simple control and small caches. Insuch a processor, control logic and SRAM structures remain error-free, even at the worst-case frequency and voltage and do not requireRazor technology. However, to effectively apply Razor in largemicroprocessor designs with large caches and complex control logic,it will be necessary to apply Razor technology to all parts of thedesign. Therefore, in concert with the effort presented in this paper,we are developing Razor-compatible memory structures based onbit-line sampling and architectural modifications for reduced typi-cal-case latency. For control logic, we are developing techniques tocheckpoint control state to enable control logic recovery. These addi-tional developments will be presented in future reports.

2.1 Circuit-level implementation issuesA key requirement for Razor based DVS is that during error-

free operation, the delay and power overhead due to the error detec-tion and correction circuitry is minimal. Otherwise, the power gainfrom more aggressive voltage scaling is overcome by the poweroverhead due to the presence of the error detection and correctioncircuitry. In addition, the overhead of performing an error correctionmust also be minimized to enable efficient operation at moderate

error rates. A number of methods were applied to reduce the powerand delay overhead of the Razor flip-flop, shown in Figure 1. Themultiplexer at the input the razor flip-flop results in a significantdelay and power overhead, and was therefore moved to the feedbackpath of the master latch of the main flip-flop, as shown in Figure 2.Hence, it introduces only a slight increase in the capacitive loadingof the critical path and has minimal impact on the performance andpower of the design.

The power overhead of Razor is also reduced by the fact that inmost cycles, the input of a flip-flop will not transition and only thepower overhead from switching the delayed clock is incurred. Tofurther minimize this additional clock power, the delayed clock islocally generated, reducing its routing capacitance. If the delayedclock is delayed by half the clock cycle, it can be derived by simplyinverting the main clock. Also, many non-critical flip-flops in thedesign do not need Razor. If the maximum delay at the input of aflip-flop is guaranteed to meet the required cycle time under theworst-case sub-critical voltage, the flip-flop cannot fail and does notneed to be replaced with a Razor flip-flop. It was found that in theprototype Alpha processor only 192 flip-flops out of a total of 2408required Razor, thereby significantly reducing the power overheadof the Razor approach. For this prototype processor, the total poweroverhead in error free operation (due to Razor flip-flops) was foundto be less than 1%, while the delay overhead was negligible.

The use of a delayed clock at the shadow latch raises the possi-bility that a short path in the combinational logic will corrupt thedata in the shadow latch. Figure 3 shows how a short-path allowsdata launched at the start of a cycle to be latched into the shadowlatch, instead of the data launched from the previous cycle. To pre-vent this corruption of the shadow latch data, a minimum-pathlength constraint is added at the input of each Razor flip-flop in thedesign. These minimum-path constraints result in the addition ofbuffers during logic synthesis to slow down fast paths and thereforeintroduce a certain power overhead. Figure 3 shows that the mini-mum-path constraint is equal to the clock delay tdelay plus the holdtime thold of the shadow latch (which is typically a small negativevalue). A large clock delay increases the severity of the short pathconstraint and therefore increases the power overhead due to theneed for additional buffers. On the other hand, a small clock delayreduces the margin between the main flip-flop and the shadow latch,and hence reduces the amount by which the supply voltage can bedropped below the critical supply voltage. The clock delay thereforepresents a trade-off between the power overhead incurred fromshort-path correction and the degree of possible power saving fromsub-critical voltage operation. In the prototype 64-bit Alpha design,the clock delay was set at 1/2 the clock period. This simplified thegeneration of the delayed clock while the short-path constraintscould still be easily met and resulted in a power overhead (due tobuffers) of less than 3%.

In subcritical voltage operation, it is possible that the data at theinput of the main latch transitions at the same time as the clock. Thiscan give rise to meta-stability of the main flip-flop, where the output

Figure 2. Reduced overhead Razor flip-flop and meta-stability detection circuits.

!"#$%

!"#

!"#

!"#$%

& '

())*)$+

!"#$"

!"#$,-"

!"#$,-"$%

!"#$%

.-/012/0%3"3/45,-/-!/*)

())*)$+

670,*85+0/!7

!"#$%

!"#

!"#

!"#$%

& '

())*)$+

!"#$"

!"#$,-"

!"#$,-"$%

!"#$%

.-/012/0%3"3/45,-/-!/*)

())*)$+

670,*85+0/!7

Page 9: Razor and ReCycle - Computer Science and Engineeringcseweb.ucsd.edu/classes/wi09/cse240c/Slides/17_Razor_and_ReCycle.pdfRazor and ReCycle . ... Counterflow pipeline in the synchronous

Razor Approach – Recovery

  Clock Gating

Recovering pipeline state after timing-error detection

A pipeline recovery mechanism guarantees thatany timing failures that do occur will not corrupt theregister and memory state with an incorrect value.We have developed two approaches to recoveringpipeline state.1 The first is a simple method based onclock gating, while the second is a more scalabletechnique based on counterflow pipelining.3

Figure 5 illustrates pipeline recovery using aglobal clock-gating approach. In the event that anystage detects a timing error, pipeline control logicstalls the entire pipeline for one cycle by gating thenext global clock edge. The additional clock periodallows every stage to recompute its result using theRazor shadow latch as input. Consequently, recov-ery logic replaces any previously forwarded errantvalues with the correct value from the shadow latch.

Because all stages reevaluate their result with theRazor shadow latch input, a Razor flip-flop can tol-erate any number of errant values in a single cycleand still guarantee forward progress. If all stagesfail each cycle, the pipeline will continue to run butat half the normal speed.

In aggressively clocked designs, implementingglobal clock gating can significantly impact proces-sor cycle time. Consequently, we have designed andimplemented a fully pipelined recovery mechanismbased on counterflow pipelining techniques. Figure6 illustrates this approach, which places negligibletiming constraints on the baseline pipeline design atthe expense of extending pipeline recovery over afew cycles.

When a Razor flip-flop generates an error signal,pipeline recovery logic must take two specificactions. First, it generates a bubble signal to nullifythe computation in the following stage. This signalindicates to the next and subsequent stages that the

pipeline slot is empty. Second, recovery logic trig-gers the flush train by asserting the ID of the stagegenerating the error signal. In the following cycle,the Razor flip-flop injects the correct value fromthe shadow latch data back into the pipeline, allow-ing the errant instruction to continue with its cor-rect inputs.

Additionally, the flush train begins propagatingthe failing stage’s ID in the opposite direction ofinstructions. At each stage that the active flush trainvisits, a bubble replaces the pipeline stage. Whenthe flush ID reaches the start of the pipeline, theflush control logic restarts the pipeline at theinstruction following the failing instruction.

In the event that multiple stages generate errorsignals in the same cycle, all the stages will initiaterecovery, but only the failing instruction closest tothe end of the pipeline will complete. Later recov-ery sequences will flush earlier ones.

RAZOR PIPELINE PROTOTYPETo obtain a realistic prediction of the power

overhead for detecting and correcting circuit tim-ing errors, we implemented Razor in a simplified64-bit Alpha pipeline design, using TaiwanSemiconductor Manufacturing Co. 0.18-microm-eter technology to produce the layout.1 In additionto gate- and circuit-level power analysis on theerror-detection-and-recovery design, we performedarchitectural simulations to analyze the overallthroughput and power characteristics of Razor-based voltage reduction for different benchmarktest programs. The benchmark studies demon-strated that, on average, Razor reduced simulatedpower consumption by nearly a factor of two—agreater than 40 percent reduction—compared totraditional design-time dynamic voltage scaling anddelay chain-based approaches.

March 2004 61

Time (in cycles)

IF ID EX* MEM* MEMMEM

WB

(b)

IF ID EX MEM WB

IF ID EX WBIF ID EX MEM

STST

ST

STStall

Stall

Razor latch getscorrect EX value

Correct valueprovided to MEM

Inst

ruct

ions

IDIF

EX

PC

Recover Recover Recover Recover

Razo

r FF

Stab

ilize

r FF

Razo

r FF

Razo

r FF

Razo

r FF

(a)

clk

Error Error Error Error

MEM WB(reg/mem)

ST

Figure 5. Pipelinerecovery usingglobal clock gating. (a) Pipelineorganization and (b) pipeline timingfor an erroroccurring in the execute (EX) stage.Asterisks denote a failing stage computation. IF =instruction fetch; ID = instructiondecode; MEM =memory; WB = writeback.

Authorized licensed use limited to: U niv of C alif S an D iego. D ownloaded on February 5 , 2009 at 05:14 from IE E E X plore . R estrictions apply.

Page 10: Razor and ReCycle - Computer Science and Engineeringcseweb.ucsd.edu/classes/wi09/cse240c/Slides/17_Razor_and_ReCycle.pdfRazor and ReCycle . ... Counterflow pipeline in the synchronous

Razor Approach – Recovery

  Clock Gating   Pipeline stalls on any Razor error   Forward progress is guaranteed, as the problematic input is

always available at the previous stage’s Shadow Latch  Only a single cycle stall is required to recompute the next stage’s

value, and the pipeline can continue.   Possible long cycle time

 Cycle time must be long enough so that any stage in the pipeline can deliver a clock gating signal to the rest of the Flip Flops.

Page 11: Razor and ReCycle - Computer Science and Engineeringcseweb.ucsd.edu/classes/wi09/cse240c/Slides/17_Razor_and_ReCycle.pdfRazor and ReCycle . ... Counterflow pipeline in the synchronous

Razor Approach – Recovery

  Counterflow

62 Computer

Power analysisFigure 7 shows the design layout; Table 1 lists

the specifications and test results for error-freeoperation and for error-correction-and-recoveryoverhead. The pipeline consists of instruction fetch,instruction decode, execute, and memory/write-back with 8 Kbytes of both I-cache and D-cache.Performance analysis revealed that only the instruc-tion decode and execute stages were critical at theworst-case voltage and frequency settings, thusrequiring Razor flip-flops for their critical paths.While the overall design included a total of 2,408flip-flops, only 192 of them implemented Razortechnology. The clock for the Razor flip-flops wasdelayed by a half cycle from the system clock.

We performed both gate-level power simulationsand SPICE (simulation program with integratedcircuit emphasis) to evaluate the power overheadof the timing-failure detection and recovery circuit.The total power consumption during error-freeoperation at 200 MHz is 425 mW at 1.8 V.

Table 1 lists two energy consumption values,switching and static, for standard and Razor flip-

flops over one clock cycle in error-free operation.These values reflect whether the latched data ischanging or not changing, respectively. We expecta power overhead of 12.2 mW for inserting delaybuffers to meet short-path constraints, bringing thetotal overhead for the detection and recovery cir-cuitry in error-free operation to 3.1 percent of totalpower consumption.

The energy required to detect a setup violation,generate an error signal, and restore the correctshadow latch data into the main flip-flop was 210femtojoules (10–15) per such event for each Razorflip-flop. The total energy required to perform a sin-gle timing-error detection and recovery event in thepipeline was 189 picojoules (10–12), resulting in anadditional overhead of approximately 1 percentmore total power when operating at a pessimistic10 percent error rate.

Architectural benchmark testsTo further explore the design’s efficiency, we

developed an advanced simulation technique basedon the SimpleScalar architectural tool set. This tech-nique combines function-level architectural simu-lation with detailed SPICE-level circuit simulation,enabling the study of how voltage influences thetiming of the pipeline stage computation.

Table 2 lists simulation results for the SPEC2000benchmarks running on the simulated Razor pro-totype pipeline. Through extensive simulation, weidentified the fixed energy-optimal supply voltagefor each benchmark. This is the single voltage thatresults in the lowest overall energy requirement foreach program. Table 2 also shows the averagepipeline error rate, energy reduction, and pipelinethroughput reduction (instructions per clock) at thefixed energy-optimal voltage. The total energy com-putation includes computation, Razor latch andcheck circuitry, and the total pipeline recoveryenergy incurred when an error is detected.

The Razor latches and error-detection circuitry

Time (in cycles)

(b)

IF ID EX* Bubble MEMFlushEX FlushID FlushIF

WBID EX MEM WB

IF ID EX IDIF ID

STST

IFIF

Razor detects fault,forwards bubble toward WB,initiates flush toward IF

Pipeline flushcompletes

Inst

ruct

ions

IDIF EX

PC

Recover Recover Recover Recover

Razo

r FF

Stab

ilize

r FF

Razo

r FF

Razo

r FF

Razo

r FF

(a)

Error

MEM(read only)

WB(reg/mem)

ST

IF

FlushIDFlush

control

Bubble Error

FlushID

Bubble Error

FlushID

Bubble Error

FlushID

Bubble

Figure 6. Pipelinerecovery using counterflow pipelining. (a)Pipelineorganization and (b) pipeline timingfor an erroroccurring in the execute (EX) stage.Asterisks denote a failing stage computation. IF =instruction fetch; ID = instructiondecode; MEM =memory; WB = writeback.

3.0 mm

3.3 mm

I-cache Register file

WB

D-cache

IF ID EX

MEM

Figure 7. Razor prototype layout. A simplified 64-bitAlpha pipeline consists of instruction fetch,instruction decode,execute, and memory/writebackwith both I- and D-cache.

Authorized licensed use limited to: U niv of C alif S an D iego. D ownloaded on February 5 , 2009 at 05:14 from IE E E X plore . R estrictions apply.

Page 12: Razor and ReCycle - Computer Science and Engineeringcseweb.ucsd.edu/classes/wi09/cse240c/Slides/17_Razor_and_ReCycle.pdfRazor and ReCycle . ... Counterflow pipeline in the synchronous

Razor Approach – Recovery

  Counterflow Pipelining   Uses an asynchronous-like design to propagate errors

backwards   Now the error propagation is also pipelined, which translates

to a minimal effect on the cycle time of each stage.  This translates into a tradeoff between resuming within one cycle

versus a faster cycle time   Error signal travels through each pipelined register until

reaching the PC, which then restarts execution.

Page 13: Razor and ReCycle - Computer Science and Engineeringcseweb.ucsd.edu/classes/wi09/cse240c/Slides/17_Razor_and_ReCycle.pdfRazor and ReCycle . ... Counterflow pipeline in the synchronous

Razor Approach – Dynamic Adjustments

  Focus on a constant error rate (Eref)   Change voltages based upon this measurement

  Pros   Real dynamic changes based on the runtime conditions

  Cons   Voltage regulators are slow   Slow reaction causes overcompensation

flip-flops for their critical paths. Out of a total of 2408 flip-flops inthe design, 192 Razor flip-flops were used. The clock for the Razorflip-flops was delayed by 1/2 the clock cycle from the system clock.

Power analysis was performed on the processor design, usingboth gate level power simulations and SPICE to evaluate the over-head of the error correction and detection circuits. The total powerconsumption during error free operation is expected to be 425 mW at1.8 V at a clock frequency of 200 MHz. The energy consumption ofthe standard and Razor flip-flops over one clock cycle in error freeoperation is listed in Figure 7(a). Two values are shown for each flip-flop, reflecting the cases when the latched data is changing (switch-ing) and is not changing (static). The total power overhead due to theinsertion of delay buffers to meet short-path constraints in the designwas simulated and is expect to be 12.2 mW. The total power over-head due to the presence of the Razor error detection and correctioncircuitry in error-free operation is expected to be 3.1% of the totalpower. The final three rows of the table show the power overheaddue to error detection and recovery. The energy required to detect anerror and restore the correct shadow latch data into the main flip-flopwas 210 fJ per error event for each Razor flip-flop. The total energyto perform a single error detection and correction event in the Alphapipeline was 189 pJ, resulting in a power overhead of approximately1% of total power when operating at a 10% error rate. Note that this

error detection and correction power overhead does not include theoverhead due to re-execution of instructions that were flushed fromthe pipeline. This additional power overhead is accounted for in thearchitectural simulations discussed in Sections 3.4 and 3.5.

3.2 Error rate analysisRazor permits a microprocessor to tolerate circuit timing errors,

thereby permitting operation at a lower voltage at the expense ofdecreased instruction throughput. As an initial step in gauging thebenefits of Razor technology, we empirically examined the error rateof an 18x18-bit multiplier block contained within a high-densityFPGA. In addition, we used SPICE-level models to measure theerror rates of an adder over a range of voltages and workloads.

FPGA-based analysis. The multiplier experiments were per-formed using a Xilinx XC2V250-F456-5 FPGA [25]. This part wasselected because it contains full-custom 18x18-bit multiplier blocks,which permit the measurement of error rates for a multiplier withminimal impact due to the overhead of the FPGA routing fabric. Fig-ure 8 illustrates the multiplier circuit under test (shaded in the sche-matic) and accompanying test harness. The multiplier circuitimplements an 18-bit by 18-bit multiplier, producing a 36-bit resulteach clock cycle. During placement, synthesis was directed to fore-most optimize the performance of the fast multiplier pipeline. Theresulting placement is fairly efficient with the Xilinx static timing

Figure 6. Supply Voltage Control System

E ref

V oltage

C ontrolF unction

!...

P ipe line

reset

V dd

E diff = E ref - E sample

-

E sample

panic

V oltageR egulator

E diff erro

rs

ign

alsE ref

V oltage

C ontrolF unction

!!...

P ipe line

reset

V dd

E diff = E ref - E sample

-

E sample

panic

V oltageR egulator

E diff erro

rs

ign

als

Figure 7. Razor prototype implementation details and die photo.

(a) (b)

Technology node 0.18 µm

Voltage range 1.8 V to 1.2 V

Total number of logic gates 45,661

D-cache size 8 KBytes

I-cache size 8 KBytes

Die size 3 x 3.3 mm

Clock frequency 200 MHz

Clock delay 2.5 nS

Total number of flip-flops 2408

Number of Razor flip-flops 192

Total number of delay buffers 2498

Error free operation

Total power 425 mW

Standard FF energy (switching/static) 49 fJ / 95 fJ

Razor FF energy (switching/static) 60 fJ / 160 fJ

Total delay buffer power overhead 12.2 mW

% total power overhead 3.1%

Error correction and recovery overhead

Energy per Razor FF per error event 210 fJ

Total energy per error event 189 pJ

Razor FF recovery overhead at 10% error rate 1%

D -C a che

IF ID E X

ME

M

W B

R e giste r F ileI-C a che

3 .3 m m

3 m m

D -C a che

IF ID E X

ME

M

W B

R e giste r F ileI-C a che

3 .3 m m

3 m m

Page 14: Razor and ReCycle - Computer Science and Engineeringcseweb.ucsd.edu/classes/wi09/cse240c/Slides/17_Razor_and_ReCycle.pdfRazor and ReCycle . ... Counterflow pipeline in the synchronous

Razor – Simulations/Data

  Alpha-64 Simulation   Parameters:

  In-order pipeline  8 KB I/D Caches  192/2408 flip-flops were augmented with a shadow latch.

  Important results:  3.1% total power overhead for Razor parts  1% of total power for recovery overhead

Page 15: Razor and ReCycle - Computer Science and Engineeringcseweb.ucsd.edu/classes/wi09/cse240c/Slides/17_Razor_and_ReCycle.pdfRazor and ReCycle . ... Counterflow pipeline in the synchronous

Razor – Simulations/Data

  FPGA Multiplier Simulation

instructions using the adder (e.g., adds, subtracts, loads, stores)recorded their inputs. The benchmark samples are taken in programexecution order starting at the SimPoint point of the execution, asspecified by Sherwood et al. [18].

As shown in Figure 10, the random input, like for the multi-plier, demonstrates a gradual rise in the error rate with decreasingvoltage. We see a similar trend for the benchmark samples analyzed.The error rates for the real program samples increase even moreslowly at first than the random sample sequence. For instance, theammp benchmark experiences very few errors until 1.05 V, and bzipdoes not generate any substantial error rates until 1.2 V. With realprogram samples, the error rate tends to rise faster once errors dotake hold, even performing slightly worse than the random sequenceat lower voltages. However, at error rates that we would expect to beeasily tolerated (e.g., below 5%), the real program samples demon-strate substantially lower operating voltages than the random samplesequence.

3.3 Simulator Framework and BenchmarksThe architectural simulators used in this paper are derived from

the SimpleScalar/Alpha version 3.0 tool set [2], a suite of functionaland timing simulation tools for the Alpha AXP ISA. Simulation isexecution-driven, including execution down any speculative pathuntil the detection of a fault, TLB miss, or branch misprediction. Thebaseline processor modeled was a single-issue, in-order pipelinewith the pipeline stages that are described in Section 3.1. The base-line model was modified to simulate Razor error recovery with itsproper penalties. Furthermore, the detailed C-level Kogge-Stone

adder model was integrated into the execute stage, where it was usedto determine when voltage scaling introduced adder timing errors.

To perform our evaluation, we collected results from 11 of theSPEC2000 benchmarks. All SPEC programs were compiled for aCompaq Alpha AXP-21264 processor using the Compaq C and For-tran compilers under the OSF/1 V4.0 operating system using fullcompiler optimization (-O4). The simulations were run for 100 mil-lion instructions using the SPEC reference inputs. We used the Sim-Point toolset’s Early SimPoints to pinpoint program locations thatwere highly representative of the entire program execution [18].

3.4 Energy Analysis for Fixed VoltageFigure 11 illustrates qualitatively the relationship between sup-

ply voltage, adder energy and pipeline throughput. The total energyconsumed by the adder (Eadder) is the sum of the energy required toperform add operations (Eadditions) plus the energy required torecover the pipeline in the event of an adder timing error (Erecovery).Moreover, there is a fixed amount of energy overhead incurred toimplement Razor checking for the adder. This energy is consumedby the shadow latches and comparison logic. A trade-off existsbetween the adder and recovery energy components. When supplyvoltage is decreased, the energy required to perform addition opera-tions is decreased, but fewer of these operations are able to completewithin the clock period. As a result, pipeline recovery is invokedmore frequently with additional energy expense. Energy for theadder (Eadder) is optimized when any additional decrease in voltage

Figure 9. Measured Error Rates for an 18x18-bit FPGA Multiplier Block at 90 MHz and 27 C.

0.0000000%

0.0000001%

0.0000010%

0.0000100%

0.0001000%

0.0010000%

0.0100000%

0.1000000%

1.0000000%

10.0000000%

100.0000000%

1.141.181.221.261.301.341.381.421.461.501.541.581.621.661.701.741.78

Supply Voltage (V)

Erro

r ra

te (

log

scale

)

random

Zero-margin

@ 1.54 V

Safety-margin

@ 1.63 V

Environmental-margin

@ 1.69 V

35% energy savings with 1.3% error

30% energy saving

22% saving

One error every ~20 seconds

0.0000000%

0.0000001%

0.0000010%

0.0000100%

0.0001000%

0.0010000%

0.0100000%

0.1000000%

1.0000000%

10.0000000%

100.0000000%

1.141.181.221.261.301.341.381.421.461.501.541.581.621.661.701.741.78

Supply Voltage (V)

Erro

r ra

te (

log

scale

)

random

Zero-margin

@ 1.54 V

Safety-margin

@ 1.63 V

Environmental-margin

@ 1.69 V

35% energy savings with 1.3% error35% energy savings with 1.3% error

30% energy saving30% energy saving

22% saving22% saving

One error every ~20 seconds

Figure 10. Simulated Error Rates for a Kogge-Stone Adder at 870 MHz and 27 C.

0.00%

0.01%

0.10%

1.00%

10.00%

100.00%

0.60.811.21.41.61.82

Supply Voltage

Erro

r ra

te

random

bzip

ammp

Figure 11. The Qualitative Relationship Between Supply Voltage, Energy and Pipeline Throughput (for

a fixed frequency).

D e cre a sing S upply V olta ge

E ne rgy

E ne rgy o f A dde rO pe ra tions, E a dditions

E ne rgy o f

P ipe lineR e cove ry,

E re cove ry

T ota l A dde r E ne rgy,E a dde r = E a ddition s + E re cove ry

Optimal Eadder

P ipe lineT hroughput

IP C

E ne rgy o f A dde rw /o R a zor S upport

D e cre a sing S upply V olta ge

E ne rgy

E ne rgy o f A dde rO pe ra tions, E a dditions

E ne rgy o f

P ipe lineR e cove ry,

E re cove ry

T ota l A dde r E ne rgy,E a dde r = E a ddition s + E re cove ry

Optimal Eadder

P ipe lineT hroughput

IP C

E ne rgy o f A dde rw /o R a zor S upport

Page 16: Razor and ReCycle - Computer Science and Engineeringcseweb.ucsd.edu/classes/wi09/cse240c/Slides/17_Razor_and_ReCycle.pdfRazor and ReCycle . ... Counterflow pipeline in the synchronous

Razor – Simulations/Data

  Adder Simulation   Fixed voltage sweep   Goal:

 Reduce energy without sacrificing IPC

results in an energy savings that is smaller than the extra energy costincurred by more pipeline recoveries. The energy-optimal voltagevaries from program to program (and even within the phases of aprogram) because pipeline error rate is heavily dependent on the datavalues sent to the adder. These trade-offs are further complicatedunder a pipeline performance constraint. Decreasing voltage willincur additional pipeline errors, which in turn decreases pipelinethroughput (i.e., instructions per cycle). Consequently, the programwill take longer to execute. Under a performance constraint, the opti-mal voltage is limited to the minimal energy that meets the perfor-mance constraint.

Table 1 lists for each benchmark the energy-optimal supplyvoltage, average adder error rate, energy reduction, and IPC reduc-tion at the fixed energy-optimal voltage. The simulations are per-formed by sweeping the voltage in 25 mV steps from 1.8 V down to0.6 V. The voltage remains fixed for the entire simulation (i.e., eachpoint on the graph is a different simulation). All experiments are per-formed at 27 C and 870 MHz, the maximum speed at which theadder runs error-free at room temperature (i.e., the zero-marginpoint). All Razor energy estimates were made using RTL-levelpower analysis of the Razor prototype physical design described inSection 3.1. The total energy of the Razor adder includes the energyof the adder, Razor latch and check circuitry, and the total pipelinerecovery energy incurred when a Razor adder error is detected. TheRazor latches and error detection circuitry increase adder energy byabout 4.3%. Error recovery energy is conservatively estimated at 18times the cost of a single add (at 1.8 V), based on a 6-cycle recoverysequence at typical activity rates. It should be noted that the energysavings reflect only that due to eliminating data-dependent delaymargins. If comparisons were made to existing DVS techniques thatrequire safety margins (e.g., delay line speed detector) or tempera-ture margins (e.g., design-time DVS), the resulting energy savingwould be substantially higher. Table 1 also shows the relative perfor-mance of the benchmark, given as the IPC of the program withRazor timing speculation divided by the IPC of a non-speculativepipeline. Since all the experiments are run at the same frequency, thechange in IPC due to pipeline recovery reflects true performanceimpacts. Figure 12 illustrates the relative energy and performanceacross the entire supply voltage operating range, for the benchmarksbzip and gcc.

It is important to note that the energy analysis presented in thissection only reflects the energy savings in the Razor pipeline adder.

We only consider the energy of the entire Razor pipeline when anadder timing error occurs. In this event, all activities (and pipelineenergy) are directly attributable to the Razor timing error, and thusmust be counted against Razor adder energy savings. In essence, thisis the adder energy savings one could expect if the adder were givenits own independently tunable voltage source. Total energy reductionfor the entire pipeline would only be the same if the remaining com-ponents could scale their voltage to the same degree without increas-ing the overall error rate.

Clearly, there is significant energy to be reclaimed by runningthe adder at a low error rate. All of the benchmarks experienced sig-nificant energy savings, ranging from 23.7% to 64.2%. One particu-larly encouraging result is that error rates and performance impactsare muted up to and slightly past the energy-optimal voltage, afterwhich error rates rise very quickly. At the energy-optimal voltagepoint, the benchmarks suffered at most a 2.49% reduction in pipelineperformance (due to recovery flushes). There appears to be littletrade-off in performance when fully exploiting adder energy savingsat subcritical voltages. While we have simulated voltages down to0.6V, our Razor prototype design is only capable of validating circuittiming down to 1.2 V. This constraint will limit the energy savings offour of the benchmarks. Since additional voltage scaling headroomexists, we are examining techniques to further reduce voltage onfuture prototype designs.

3.5 Energy Analysis for Dynamic Voltage ScalingReducing voltage to the energy-optimal fixed voltage point will

certainly improve the energy characteristics of a system thatemploys Razor. In this section, we consider the potential value ofdynamically adjusting supply voltage to workload characteristics.We perform these experiments by engaging the proportional controlsystem described in Section 2.3. For the simulated experiments, weassume a voltage regulator response time of 20 cycles per 1 mV. Thecontrol system samples (and then resets) an error counter every 5000cycles, and adjusts the voltage regular plus or minus 25 mV, depend-

Table 1. Energy-Optimal Characteristics

Program

Optimal

Vdd

Error

Rate

% Energy

Reduced

% IPC

Reduced

bzip 1.1 0.31% 57.6% 0.70%

crafty 1.175 0.41% 50.5% 0.60%

eon 1.3 1.21% 34.4% 1.24%

gap 1.275 1.15% 30.1% 2.49%

gcc 1.375 1.62% 23.7% 1.47%

gzip 1.3 1.03% 35.6% 0.41%

mcf 1.175 0.67% 48.7% 0.00%

parser 1.2 0.61% 47.9% 0.29%

twolf 1.275 2.67% 30.7% 0.31%

vortex 1.3 0.53% 42.8% 0.14%

vpr 1.075 0.01% 64.2% 0.00%

Average 42.4%

Figure 12. Relative Adder Energy and Pipeline Throughput for Simulated Benchmarks.

BZIP

0 .3 1 % E rro r R a te0 .3

0 .5

0 .7

0 .9

1 .1

1 .3

1 .5

0 .6

0 .67 5

0 .75

0 .82 5

0 .9

0 .97 5

1 .05

1 .12 5

1 .2

1 .27 5

1 .35

1 .42 5

1 .5

1 .57 5

1 .65

1 .72 5

1 .8

Voltage

Re

lati

ve

IP

C a

nd

En

erg

y

R e l E ne rg y

R e l P e rfo rm a nc e

GCC

1 .6 2 % E rro r R a te

0 .3

0 .5

0 .7

0 .9

1 .1

1 .3

1 .5

0 .6

0 .67 5

0 .75

0 .82 5

0 .9

0 .97 5

1 .05

1 .12 5

1 .2

1 .27 5

1 .35

1 .42 5

1 .5

1 .57 5

1 .65

1 .72 5

1 .8

Voltage

Re

lati

ve

IP

C a

nd

En

erg

y

R e l E ne rg y

R e l P e rfo rm a nc e

instructions using the adder (e.g., adds, subtracts, loads, stores)recorded their inputs. The benchmark samples are taken in programexecution order starting at the SimPoint point of the execution, asspecified by Sherwood et al. [18].

As shown in Figure 10, the random input, like for the multi-plier, demonstrates a gradual rise in the error rate with decreasingvoltage. We see a similar trend for the benchmark samples analyzed.The error rates for the real program samples increase even moreslowly at first than the random sample sequence. For instance, theammp benchmark experiences very few errors until 1.05 V, and bzipdoes not generate any substantial error rates until 1.2 V. With realprogram samples, the error rate tends to rise faster once errors dotake hold, even performing slightly worse than the random sequenceat lower voltages. However, at error rates that we would expect to beeasily tolerated (e.g., below 5%), the real program samples demon-strate substantially lower operating voltages than the random samplesequence.

3.3 Simulator Framework and BenchmarksThe architectural simulators used in this paper are derived from

the SimpleScalar/Alpha version 3.0 tool set [2], a suite of functionaland timing simulation tools for the Alpha AXP ISA. Simulation isexecution-driven, including execution down any speculative pathuntil the detection of a fault, TLB miss, or branch misprediction. Thebaseline processor modeled was a single-issue, in-order pipelinewith the pipeline stages that are described in Section 3.1. The base-line model was modified to simulate Razor error recovery with itsproper penalties. Furthermore, the detailed C-level Kogge-Stone

adder model was integrated into the execute stage, where it was usedto determine when voltage scaling introduced adder timing errors.

To perform our evaluation, we collected results from 11 of theSPEC2000 benchmarks. All SPEC programs were compiled for aCompaq Alpha AXP-21264 processor using the Compaq C and For-tran compilers under the OSF/1 V4.0 operating system using fullcompiler optimization (-O4). The simulations were run for 100 mil-lion instructions using the SPEC reference inputs. We used the Sim-Point toolset’s Early SimPoints to pinpoint program locations thatwere highly representative of the entire program execution [18].

3.4 Energy Analysis for Fixed VoltageFigure 11 illustrates qualitatively the relationship between sup-

ply voltage, adder energy and pipeline throughput. The total energyconsumed by the adder (Eadder) is the sum of the energy required toperform add operations (Eadditions) plus the energy required torecover the pipeline in the event of an adder timing error (Erecovery).Moreover, there is a fixed amount of energy overhead incurred toimplement Razor checking for the adder. This energy is consumedby the shadow latches and comparison logic. A trade-off existsbetween the adder and recovery energy components. When supplyvoltage is decreased, the energy required to perform addition opera-tions is decreased, but fewer of these operations are able to completewithin the clock period. As a result, pipeline recovery is invokedmore frequently with additional energy expense. Energy for theadder (Eadder) is optimized when any additional decrease in voltage

Figure 9. Measured Error Rates for an 18x18-bit FPGA Multiplier Block at 90 MHz and 27 C.

0.0000000%

0.0000001%

0.0000010%

0.0000100%

0.0001000%

0.0010000%

0.0100000%

0.1000000%

1.0000000%

10.0000000%

100.0000000%

1.141.181.221.261.301.341.381.421.461.501.541.581.621.661.701.741.78

Supply Voltage (V)

Erro

r ra

te (

log

scale

)

random

Zero-margin

@ 1.54 V

Safety-margin

@ 1.63 V

Environmental-margin

@ 1.69 V

35% energy savings with 1.3% error

30% energy saving

22% saving

One error every ~20 seconds

0.0000000%

0.0000001%

0.0000010%

0.0000100%

0.0001000%

0.0010000%

0.0100000%

0.1000000%

1.0000000%

10.0000000%

100.0000000%

1.141.181.221.261.301.341.381.421.461.501.541.581.621.661.701.741.78

Supply Voltage (V)

Erro

r ra

te (

log

scale

)

random

Zero-margin

@ 1.54 V

Safety-margin

@ 1.63 V

Environmental-margin

@ 1.69 V

35% energy savings with 1.3% error35% energy savings with 1.3% error

30% energy saving30% energy saving

22% saving22% saving

One error every ~20 seconds

Figure 10. Simulated Error Rates for a Kogge-Stone Adder at 870 MHz and 27 C.

0.00%

0.01%

0.10%

1.00%

10.00%

100.00%

0.60.811.21.41.61.82

Supply Voltage

Erro

r ra

te

random

bzip

ammp

Figure 11. The Qualitative Relationship Between Supply Voltage, Energy and Pipeline Throughput (for

a fixed frequency).

D e cre a sing S upply V olta ge

E ne rgy

E ne rgy o f A dde rO pe ra tions, E a dditions

E ne rgy o f

P ipe lineR e cove ry,

E re cove ry

T ota l A dde r E ne rgy,E a dde r = E a ddition s + E re cove ry

Optimal Eadder

P ipe lineT hroughput

IP C

E ne rgy o f A dde rw /o R a zor S upport

D e cre a sing S upply V olta ge

E ne rgy

E ne rgy o f A dde rO pe ra tions, E a dditions

E ne rgy o f

P ipe lineR e cove ry,

E re cove ry

T ota l A dde r E ne rgy,E a dde r = E a ddition s + E re cove ry

Optimal Eadder

P ipe lineT hroughput

IP C

E ne rgy o f A dde rw /o R a zor S upport

Page 17: Razor and ReCycle - Computer Science and Engineeringcseweb.ucsd.edu/classes/wi09/cse240c/Slides/17_Razor_and_ReCycle.pdfRazor and ReCycle . ... Counterflow pipeline in the synchronous

Razor – Simulations/Data

  Dynamic Scaling   Target error rate was 1.5%   Takes 5000 cycle chunk

samples   Uses those chunks to

dynamically scale voltage

  Slow reaction times

ing on the error rate differential. All simulations use a target errorrate of 1.5%, which was set based on the energy-optimal error ratesanalyzed in the previous section.

Table 2 lists the adder energy reduction compared to a non-Razor adder at the zero-margin voltage (1.8 V). Dynamically adjust-ing voltage again results in substantial energy savings. Compared thefixed voltage experiments of the previous section, about one half ofthe benchmarks see better energy savings, and the other half hasslightly worse energy savings. With dynamic voltage scaling, mostof the benchmarks ran slower, although overall performance impactswere still small, with the largest slowdown limited to just under 6%.Figure 13 illustrates the change in error rate of the adder over timeand the voltage control systems response to the error rate for the gccand gap benchmarks. Overall, the results for the proportional controlsystem are mixed. Given that it represents a fairly unsophisticatedclass of control functions; further investigations into supply voltagecontrol will likely yield additional energy savings.

4 Previous WorkTable 3 lists a number of prior proposals supporting adaptive

voltage and frequency scaling. With Design-Time DVS, conservativedesign techniques are used to specify “legal” voltage and frequencypairs that allow reliable operation of the processor under worst-casevoltage, temperature, and process conditions. Examples of systemsthat utilize this approach are Intel’s x86 SpeedStep technology [10]and Transmeta’s Longrun technology [20].

A Correlating VCO allows ambient margins to be eliminated;examples of this design have been proposed by Burd [3] and Gutnik[7]. The approach implements a voltage controlled oscillator using atiming loop constructed to slightly exceed the latency of the worst-case critical path in the machine, plus process and safety margins.When supply voltage changes, the oscillator speed will automati-cally adjust to match the fastest safe clock speed. It is important tonote that this approach cannot compensate for intra-die process andtemperature variations, IR drop, or noise. As a result, additional volt-

age margins (implemented with additional timing loop delay) arerequired for safe operation.

A Delay Line Speed Detector is a device that models the worst-case critical path of the system, plus a safety margin. Examples ofthese devices have been proposed by Dhar [5] and Uht [21]. Periodi-cally, a signal transition is propagated down a delay chain and sam-pled at the end of the current clock cycle. If the signal transition doesnot propagate to the end of the delay chain within the clock period,the system is running too close to failure and frequency and/or volt-age must be adjusted. Since the delay chain fails prior to the core cir-cuitry, any failure detected in the delay chain will proceed a corecircuitry failing, assuming that the delay line is frequently monitoredand the system is adjusted promptly upon detection of a delay linefailure. To ensure that the delay line fails first, it is necessary to addlatency margins to accommodate intra-die process and temperaturevariations, IR drop and noise. Unlike the Correlating VCO, it may bepossible to put multiple delay line speed detectors across the die andcombine their timing signals in an effort to mitigate intra-die processand temperature variation. However, some variation is inherentlylocal (e.g., cross-coupling noise), thus some delay margin willalways be required. We have not seen the use of multiple delay linedetectors explored to date.

Kehl’s Triple-Latch Monitor is similar to the delay line speeddetector, but like Razor, utilizes in-situ circuit monitoring [11].Using this approach, all monitored system state is captured usingthree latches, clocked in succession with a small delay between each.The staggered latches provide three closely spaced samples of alogic block’s value each cycle. The value in the latest-clocked latchis assumed correct and always forwarded to later logic. The systemis considered “tuned” when the first latch does not match the secondand third latch values, meaning that the logic transition was verynear the critical speed, but not dangerously close. If all latches seethe same value, the system is running too slowly and should be spedup. If the first two latches see different values than the last, then thesystem is running dangerously fast and should be slowed down.Because of the in-situ nature of this approach, it could conceivablyadjust to intra-die process and temperature variations. However,data-dependent delay variations complicate Kehl’s approach. Toavoid too aggressively clocking the system, speedup evaluationsmust be limited to tests on worst-cast latency vectors. Kehl suggeststhat the system should periodically stop and test worst-case vectorsto determine if the system should be sped up.

Figure 13. Adder Error Rate and Voltage Controller Response.

GCC

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

Time

Su

pp

ly V

olt

ag

e

0.00%

5.00%

10.00%

15.00%

20.00%

25.00%

30.00%

35.00%

40.00%

Err

or

Ra

te

V oltage

E rror Rate

G a p

0 . 6

0 . 8

1

1 . 2

1 . 4

1 . 6

1 . 8

2

T im e

Su

pp

ly V

olt

ag

e

0 . 0 0 %

3 . 0 0 %

6 . 0 0 %

9 . 0 0 %

1 2 . 0 0 %

1 5 . 0 0 %

1 8 . 0 0 %

2 1 . 0 0 %

2 4 . 0 0 %

2 7 . 0 0 %

3 0 . 0 0 %

Err

or

Ra

te

V o lta g e

E rro r R a te

Table 2. Simulated DVS Energy Savings

Program

% Energy

Reduced

% IPC

Reduced

bzip 54.5% 4.13%

crafty 54.8% 1.78%

eon 30.4% 0.78%

gap 12.9% 2.14%

gcc 31.3% 5.88%

gzip 44.6% 1.27%

mcf 36.9% 0.47%

parser 53.0% 1.94%

twolf 20.4% 0.06%

vortex 49.1% 1.07%

vpr 63.6% 1.66%

Average 41.0%

Page 18: Razor and ReCycle - Computer Science and Engineeringcseweb.ucsd.edu/classes/wi09/cse240c/Slides/17_Razor_and_ReCycle.pdfRazor and ReCycle . ... Counterflow pipeline in the synchronous

Razor – Simulations/Data

  Mixed results   Half see increases, half see

decreases in energy relative to static scaling

ing on the error rate differential. All simulations use a target errorrate of 1.5%, which was set based on the energy-optimal error ratesanalyzed in the previous section.

Table 2 lists the adder energy reduction compared to a non-Razor adder at the zero-margin voltage (1.8 V). Dynamically adjust-ing voltage again results in substantial energy savings. Compared thefixed voltage experiments of the previous section, about one half ofthe benchmarks see better energy savings, and the other half hasslightly worse energy savings. With dynamic voltage scaling, mostof the benchmarks ran slower, although overall performance impactswere still small, with the largest slowdown limited to just under 6%.Figure 13 illustrates the change in error rate of the adder over timeand the voltage control systems response to the error rate for the gccand gap benchmarks. Overall, the results for the proportional controlsystem are mixed. Given that it represents a fairly unsophisticatedclass of control functions; further investigations into supply voltagecontrol will likely yield additional energy savings.

4 Previous WorkTable 3 lists a number of prior proposals supporting adaptive

voltage and frequency scaling. With Design-Time DVS, conservativedesign techniques are used to specify “legal” voltage and frequencypairs that allow reliable operation of the processor under worst-casevoltage, temperature, and process conditions. Examples of systemsthat utilize this approach are Intel’s x86 SpeedStep technology [10]and Transmeta’s Longrun technology [20].

A Correlating VCO allows ambient margins to be eliminated;examples of this design have been proposed by Burd [3] and Gutnik[7]. The approach implements a voltage controlled oscillator using atiming loop constructed to slightly exceed the latency of the worst-case critical path in the machine, plus process and safety margins.When supply voltage changes, the oscillator speed will automati-cally adjust to match the fastest safe clock speed. It is important tonote that this approach cannot compensate for intra-die process andtemperature variations, IR drop, or noise. As a result, additional volt-

age margins (implemented with additional timing loop delay) arerequired for safe operation.

A Delay Line Speed Detector is a device that models the worst-case critical path of the system, plus a safety margin. Examples ofthese devices have been proposed by Dhar [5] and Uht [21]. Periodi-cally, a signal transition is propagated down a delay chain and sam-pled at the end of the current clock cycle. If the signal transition doesnot propagate to the end of the delay chain within the clock period,the system is running too close to failure and frequency and/or volt-age must be adjusted. Since the delay chain fails prior to the core cir-cuitry, any failure detected in the delay chain will proceed a corecircuitry failing, assuming that the delay line is frequently monitoredand the system is adjusted promptly upon detection of a delay linefailure. To ensure that the delay line fails first, it is necessary to addlatency margins to accommodate intra-die process and temperaturevariations, IR drop and noise. Unlike the Correlating VCO, it may bepossible to put multiple delay line speed detectors across the die andcombine their timing signals in an effort to mitigate intra-die processand temperature variation. However, some variation is inherentlylocal (e.g., cross-coupling noise), thus some delay margin willalways be required. We have not seen the use of multiple delay linedetectors explored to date.

Kehl’s Triple-Latch Monitor is similar to the delay line speeddetector, but like Razor, utilizes in-situ circuit monitoring [11].Using this approach, all monitored system state is captured usingthree latches, clocked in succession with a small delay between each.The staggered latches provide three closely spaced samples of alogic block’s value each cycle. The value in the latest-clocked latchis assumed correct and always forwarded to later logic. The systemis considered “tuned” when the first latch does not match the secondand third latch values, meaning that the logic transition was verynear the critical speed, but not dangerously close. If all latches seethe same value, the system is running too slowly and should be spedup. If the first two latches see different values than the last, then thesystem is running dangerously fast and should be slowed down.Because of the in-situ nature of this approach, it could conceivablyadjust to intra-die process and temperature variations. However,data-dependent delay variations complicate Kehl’s approach. Toavoid too aggressively clocking the system, speedup evaluationsmust be limited to tests on worst-cast latency vectors. Kehl suggeststhat the system should periodically stop and test worst-case vectorsto determine if the system should be sped up.

Figure 13. Adder Error Rate and Voltage Controller Response.

GCC

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

Time

Su

pp

ly V

olt

ag

e

0.00%

5.00%

10.00%

15.00%

20.00%

25.00%

30.00%

35.00%

40.00%

Err

or

Ra

te

V oltage

E rror Rate

G a p

0 . 6

0 . 8

1

1 . 2

1 . 4

1 . 6

1 . 8

2

T im e

Su

pp

ly V

olt

ag

e

0 . 0 0 %

3 . 0 0 %

6 . 0 0 %

9 . 0 0 %

1 2 . 0 0 %

1 5 . 0 0 %

1 8 . 0 0 %

2 1 . 0 0 %

2 4 . 0 0 %

2 7 . 0 0 %

3 0 . 0 0 %

Err

or

Ra

te

V o lta g e

E rro r R a te

Table 2. Simulated DVS Energy Savings

Program

% Energy

Reduced

% IPC

Reduced

bzip 54.5% 4.13%

crafty 54.8% 1.78%

eon 30.4% 0.78%

gap 12.9% 2.14%

gcc 31.3% 5.88%

gzip 44.6% 1.27%

mcf 36.9% 0.47%

parser 53.0% 1.94%

twolf 20.4% 0.06%

vortex 49.1% 1.07%

vpr 63.6% 1.66%

Average 41.0%

Page 19: Razor and ReCycle - Computer Science and Engineeringcseweb.ucsd.edu/classes/wi09/cse240c/Slides/17_Razor_and_ReCycle.pdfRazor and ReCycle . ... Counterflow pipeline in the synchronous

Razor – Conclusions

  Similarities to DIVA   Error checking component becomes an oracle   Added error checking does not interfere with original pipeline

other than the case of an error

  Differences from DIVA   Does not handle transient errors; only handles timing errors

Page 20: Razor and ReCycle - Computer Science and Engineeringcseweb.ucsd.edu/classes/wi09/cse240c/Slides/17_Razor_and_ReCycle.pdfRazor and ReCycle . ... Counterflow pipeline in the synchronous

ReCycle

Page 21: Razor and ReCycle - Computer Science and Engineeringcseweb.ucsd.edu/classes/wi09/cse240c/Slides/17_Razor_and_ReCycle.pdfRazor and ReCycle . ... Counterflow pipeline in the synchronous

ReCycle – Motivation

  Process Variation   As transistor sizes continue to shrink, our their error margins

continue to become more significant  The same stage on two different chips may not be equal with regard to

timing  This requires designers to set more conservative cycle times, which

will in turn affect the cycle time of all stages   Guard Banding – Bad for performance!

  Also, like Razor, Power   Same arguments as Razor…

  Finally, make average silicon matter!   Salvage chips that vary beyond the set threshold (and therefore fail

hold-time tests)

Page 22: Razor and ReCycle - Computer Science and Engineeringcseweb.ucsd.edu/classes/wi09/cse240c/Slides/17_Razor_and_ReCycle.pdfRazor and ReCycle . ... Counterflow pipeline in the synchronous

ReCycle – Approach

  What’s unique to ReCycle?   Analyzing feedback paths in cycle time analysis   Using cycle time stealing to tackle process variation   The implementation of Donor stages, but not the idea in its

entirety

  What did it inherit?   The notion of skewing the clock to alter cycle times   Mapping the clock skew optimization problem as a graph

Page 23: Razor and ReCycle - Computer Science and Engineeringcseweb.ucsd.edu/classes/wi09/cse240c/Slides/17_Razor_and_ReCycle.pdfRazor and ReCycle . ... Counterflow pipeline in the synchronous

ReCycle – Approach

  The Main Idea   Why should we increase the cycle time of all of a pipeline’s stages if

only one or two stages are the culprits of long cycle times (due to variation)?  This greater level of unbalance is actually more optimal for ReCycle

  Why doesn’t a designer concentrate on the average cycle time instead of the longest?  Let one slower stage “borrow” or “steal” time from a faster stage

!"" #$% &'(()*+&!#+'* "+*,-. /* 0!1#+&)"!12 3% '*"4 -$'3 #$% 5%%67

8!&, 0!#$- #$!# 3% &'*-+6%1 ('-# +(0'1#!*#. 9'1 %:!(0"%2 3% 6' *'#

-$'3 !"" #$% 5'13!16+*; 0!#$-. <$+"% 3% 3+"" 8!-% ')1 !*!"4-+- '*

#$+- -+(0"+!%6 0+0%"+*%2 =%>4&"% +- ;%*%1!" %*');$ #' 8% !00"+&!8"%

#' ('1% &'(0"+&!#%6 0+0%"+*%-.

!"#" $%&'()) *+%,+-,&. +./ 0-) 012+'-

<$+"% 01'&%-- ?!1+!#+'* %:+-#- !# -%?%1!" "%?%"-2 3% 5'&)- '*

<+#$+*7@+% A</@B ?!1+!#+'*2 3$+&$ +- &!)-%6 84 8'#$ !"!#$%&#'( %57

5%&#- 6)% #' "+#$';1!0$+& +11%;)"!1+#+%- !*6 )&*+,% %55%&#- 01+(!1+"4

6)% #' ?!14+*; 6'0!*# &'*&%*#1!#+'*- CDEF. G4-#%(!#+& ?!1+!#+'* %:7

$+8+#- -#1'*; -0!#+!" &'11%"!#+'* H -#1)&#)1%- #$!# !1% &"'-% #';%#$%1

!1% "+,%"4 #' $!?% -+(+"!1 ?!")%-H3$+"% 1!*6'( ?!1+!#+'* 6'%- *'#.

I3' +(0'1#!*# 01'&%-- 0!1!(%#%1- !55%&#%6 84 ?!1+!#+'* !1% #$%

#$1%-$'"6 ?'"#!;% AVtB !*6 #$% %55%&#+?% &$!**%" "%*;#$ ALeff B. J!1+7

!#+'* '5 #$%-% 0!1!(%#%1- 6+1%&#"4 !55%&#- ! ;!#%K- 6%"!4 ATgB2 !-

;+?%* 84 #$% !"0$!70'3%1 ('6%" CELFM

Tg ! LeffVdd

µ(Vdd " Vt)!AEB

3$%1% µ +- #$% &!11+%1 ('8+"+#42 Vdd +- #$% -)00"4 ?'"#!;%2 !*6 ! +-)-)!""4 1.3. N'#$ µ !*6 Vt !1% ! 5)*&#+'* '5 #$% #%(0%1!#)1% -.

<% #1%!# 1!*6'( !*6 -4-#%(!#+& ?!1+!#+'* -%0!1!#%"4. <% ('6%"

#$% -4-#%(!#+& ?!1+!#+'* '5 Vt 3+#$ ! ()"#+?!1+!#% *'1(!" 6+-#1+8)7

#+'* 3+#$ ! -0%&+!& &'11%"!#+'* -#1)&#)1% CDEF. /# +- &$!1!&#%1+O%6 84

#$1%% 0!1!(%#%1-M µ2 "sys2 !*6 #. G0%&+!&!""42 3% 6+?+6% #$% &$+0+*#' ! ;1+6 3+#$ PQ &%""-. R!&$ &%"" #!,%- '* ! -+*;"% ?!")% '5 Vt !-

;+?%* 84 ! ()"#+?!1+!#% *'1(!" 6+-#1+8)#+'* 3+#$ 0!1!(%#%1- µ !*6"sys. S"'*; 3+#$ #$+-2 Vt +- -0!#+!""4 &'11%"!#%6.

<% !--)(% #$!# #$% &'11%"!#+'* +- +-'#1'0+& !*6 +*6%0%*6%*# '5

0'-+#+'* CDTF. I$+- (%!*- #$!# #$% &'11%"!#+'* 8%#3%%* #3' 0'+*#-

$x !*6 $y +* #$% ;1+6 6%0%*6- '* #$% 6+-#!*&% 8%#3%%* #$%( !*6 *'#'* #$% 6+1%&#+'* '1 0'-+#+'*. >'*-%U)%*#"42 3% %:01%-- #$% &'11%7

"!#+'* 5)*&#+'* '5 Vt($x) !*6 Vt($y) !- %(r)2 3$%1% r = |$x " $y|.N4 6%!*+#+'*2 %AVBWP A+.%.2 #'#!""4 &'11%"!#%6B. <% !"-' -%# %(#)WVA+.%.2 #'#!""4 )*&'11%"!#%6B. <% #$%* !--)(% #$!# %(r) &$!*;%- 3+#$ )!- 0%1 #$% G0$%1+&!" 6+-#1+8)#+'* CPXF. /* #$% G0$%1+&!" 6+-#1+8)#+'*2

%(r) 6%&1%!-%- 51'( P #' V -(''#$"4 !*6 1%!&$%- V !# ! 6+-#!*&% #&!""%6 )&*.$. /*#)+#+?%"42 #$+- (%!*- #$!# !# 6+-#!*&% #2 #$%1% +- *'-+;*+!&!*# &'11%"!#+'* 8%#3%%* #$% Vt '5 #3' #1!*-+-#'1-. I$+- !07

01'!&$ (!#&$%- %(0+1+&!" 6!#! '8#!+*%6 84 91+%68%1; $# &/0 CPYF. #+- ;+?%* !- ! 51!&#+'* '5 #$% &$+0K- 3+6#$.

=!*6'( ?!1+!#+'* '5 Vt '&&)1- !# ! ()&$ !*%1 ;1!*)"!1+#4 #$!*

-4-#%(!#+& ?!1+!#+'*M +# '&&)1- !# #$% "%?%" '5 +*6+?+6)!" #1!*-+-#'1-.

<% ('6%" +# !- !* )*&'11%"!#%6 *'1(!" 6+-#1+8)#+'* 3+#$ "rand !*6

! O%1' (%!*.

Leff +- ('6%"%6 "+,% Vt 3+#$ ! 6+55%1%*# µ2 "sys2 !*6 "rand 8)#

#$% -!(% #.91'( #$% Vt !*6 Leff ?!1+!#+'*2 3% &'(0)#% #$% Tg ?!1+!#+'*

)-+*; RU)!#+'* E. <% #$%* )-% #$% &1+#+&!" 0!#$ ('6%" '5 N'37

(!* $# &/0 CZF #' %-#+(!#% #$% 51%U)%*&4 -)00'1#%6 84 %!&$ 0+0%"+*%

-#!;%. I$+- ('6%" #!,%- #$% *)(8%1 '5 ;!#%- An&0B +* ! &1+#+&!"0!#$ !*6 #$% *)(8%1 '5 &1+#+&!" 0!#$- AN&0B +* ! -#1)&#)1%2 !*6&'(0)#%- #$% 01'8!8+"+#4 6+-#1+8)#+'* '5 #$% "'*;%-# &1+#+&!" 0!#$

6%"!4 Amax{T&0}B +* #$% -#1)&#)1%. I$+- +- #$% 0!#$ #$!# 6%#%17(+*%- #$% (!:+()( 51%U)%*&4 '5 #$% -#1)&#)1%2 3$+&$ 3% -%# #'

8% 1/ max{T&0}.9'1 -+(0"+&+#42 3% ('6%" ! &1+#+&!" 0!#$ !- n&0 9[D ;!#%- &'*7

*%&#%6 84 ?%14 -$'1# 3+1%- H 3$%1% n&0 +- #$% )-%5)" "';+& 6%0#$'5 ! 0+0%"+*% -#!;%. \*5'1#)*!#%"42 !&&)1!#%"4 %-#+(!#+*; N&0 5'1

%!&$ 0+0%"+*% -#!;% +- 6+5!&)"# 8%&!)-% N&0 +- 6%-+;*7-0%&+!& !*6*'# 0)8"+&"4 !?!+"!8"% 5'1 ! 6%-+;*. >'*-%U)%*#"42 3% !--)(% #$!#

&1+#+&!" 0!#$- !1% 6+-#1+8)#%6 +* ! -0!#+!""47)*+5'1( (!**%1 '* #$%

01'&%--'1 "!4')# H %:&%0# +* #$% ]X2 3$'-% 0!#$- 3% !--)(% *%?%1

!55%&# #$% &4&"% #+(%. 91'( #$% "!4')# !1%! '5 %!&$ 0+0%"+*% -#!;%

!*6 N'3(!* $# &/0K- %-#+(!#% #$!# ! $+;$70%15'1(!*&% 01'&%--'1

&$+0 !# ')1 #%&$*'"';4 *'6% $!- !8')# PV2VVV &1+#+&!" 0!#$- CZF2 3%

6%#%1(+*% #$% &1+#+&!" 0!#$- +* %!&$ -#!;%. I$% -"'3%-# &1+#+&!" 0!#$

+* ! -#!;% 6%#%1(+*%- #$% 51%U)%*&4 '5 #$% -#!;%^ #$% -"'3%-# -#!;%

6%#%1(+*%- #$% 0+0%"+*% 51%U)%*&4.

#" $,2(3,.( 4/+2-+-,&. 5,-6 7(89'3(

#":" ;+,. 0/(+

I' )*6%1-#!*6 #$% +6%! 8%$+*6 =%>4&"%2 &'*-+6%1 #$% 0+0%"+*% '5

9+;)1% XA!B !*6 &!"" Ti #$% #+(% #!,%* #' 0%15'1( #$% 3'1, +* -#!;%

'. 9'1 -+(0"+&+#42 +* #$% !8-%*&% '5 01'&%-- ?!1+!#+'*2 3% !--)(%

#$!# Ti +- #$% -!(% 5'1 !"" ' !*62 #$%1%5'1%2 #$% 0+0%"+*%K- 0%1+'6 +-

TCP = Ti, $i. <$%* ?!1+!#+'* -%#- +*2 +# -"'3- 6'3* -'(% -#!;%-3$+"% +# -0%%6- )0 '#$%1-. S- -$'3* +* 9+;)1% XA!B2 #$% 1%-)"#+*;

)*8!"!*&%6 0+0%"+*% $!- #' 8% &"'&,%6 3+#$ ! "'*;%1 0%1+'6 TCP =Max(Ti), $i.

!"#$%&#

'()"(#"%*

+",-."*-

/00-1#

%0

'()"(#"%*

+",-."*-

20#-)

'()"(#"%*

34%56-781.-9

6/:/;6/4<= <>

+-)"%?

7.%1@5#%

<*"#"(.

6-A"B#-)

7.%1@5#%

="*(.

6-A"B#-)

34%56-781.-9

:"0%)5(..5"

"2CA3:559

<= 6/4 /; 6/:

<=

<>

6/4

<>

/;

6/:

6/:/;<>6/4<=

3(96-781.-9

3!"#$

'()"(#"%*

20#-)

+",-."*-

7.%1@5#%

="*(.

6-A"B#-)

3!"#$

6-781.-93D9

E

:

:

F"*

B@-G

B-#&,: $%.?F(H

E

"I(H3:559

<,=>%( !? R55%&# '5 01'&%-- ?!1+!#+'* '* 0+0%"+*%- A!B !*6 -,%3+*;

! &"'&, -+;*!" A8B.

<+#$ =%>4&"%2 3% &'(01%$%*-+?%"4 !00"4 &4&"% #+(% -#%!"+*;

#' &'11%&# #$+- ?!1+!#+'*7+*6)&%6 0+0%"+*% )*8!"!*&%. I$% 1%-)"#+*;

&"'&, 0%1+'6 '5 #$% 0+0%"+*% +- TCP = Average(Ti), $i. I$+- 0%71+'6 &!* 0'#%*#+!""4 8% -+(+"!1 #' #$!# '5 #$% *'7?!1+!#+'* 0+0%"+*%.

S- -$'3* +* 9+;)1% XA!B2 #$% -"'3 -#!;%- ;%# ('1% #$!* ! &"'&, 0%7

1+'6 #' 01'0!;!#% #$%+1 -+;*!"2 !# #$% %:0%*-% '5 5!-#%1 -#!;%- #$!#

#1!*-5%1 #$%+1 -"!&,.

<+#$ #$+- !001'!&$2 3% 6' *'# *%%6 #' &$!*;% #$% 0+0%"+*% -#1)&7

#)1%2 0+0%"+*% 6%0#$2 '1 #$% +*$%1%*# -3+#&$+*; -0%%6 '5 #1!*-+-#'1-.

325

Page 24: Razor and ReCycle - Computer Science and Engineeringcseweb.ucsd.edu/classes/wi09/cse240c/Slides/17_Razor_and_ReCycle.pdfRazor and ReCycle . ... Counterflow pipeline in the synchronous

ReCycle – Approach

  Illustration of how ReCycle skews the clock to account for process variation:

!"" #$% &'(()*+&!#+'* "+*,-. /* 0!1#+&)"!12 3% '*"4 -$'3 #$% 5%%67

8!&, 0!#$- #$!# 3% &'*-+6%1 ('-# +(0'1#!*#. 9'1 %:!(0"%2 3% 6' *'#

-$'3 !"" #$% 5'13!16+*; 0!#$-. <$+"% 3% 3+"" 8!-% ')1 !*!"4-+- '*

#$+- -+(0"+!%6 0+0%"+*%2 =%>4&"% +- ;%*%1!" %*');$ #' 8% !00"+&!8"%

#' ('1% &'(0"+&!#%6 0+0%"+*%-.

!"#" $%&'()) *+%,+-,&. +./ 0-) 012+'-

<$+"% 01'&%-- ?!1+!#+'* %:+-#- !# -%?%1!" "%?%"-2 3% 5'&)- '*

<+#$+*7@+% A</@B ?!1+!#+'*2 3$+&$ +- &!)-%6 84 8'#$ !"!#$%&#'( %57

5%&#- 6)% #' "+#$';1!0$+& +11%;)"!1+#+%- !*6 )&*+,% %55%&#- 01+(!1+"4

6)% #' ?!14+*; 6'0!*# &'*&%*#1!#+'*- CDEF. G4-#%(!#+& ?!1+!#+'* %:7

$+8+#- -#1'*; -0!#+!" &'11%"!#+'* H -#1)&#)1%- #$!# !1% &"'-% #';%#$%1

!1% "+,%"4 #' $!?% -+(+"!1 ?!")%-H3$+"% 1!*6'( ?!1+!#+'* 6'%- *'#.

I3' +(0'1#!*# 01'&%-- 0!1!(%#%1- !55%&#%6 84 ?!1+!#+'* !1% #$%

#$1%-$'"6 ?'"#!;% AVtB !*6 #$% %55%&#+?% &$!**%" "%*;#$ ALeff B. J!1+7

!#+'* '5 #$%-% 0!1!(%#%1- 6+1%&#"4 !55%&#- ! ;!#%K- 6%"!4 ATgB2 !-

;+?%* 84 #$% !"0$!70'3%1 ('6%" CELFM

Tg ! LeffVdd

µ(Vdd " Vt)!AEB

3$%1% µ +- #$% &!11+%1 ('8+"+#42 Vdd +- #$% -)00"4 ?'"#!;%2 !*6 ! +-)-)!""4 1.3. N'#$ µ !*6 Vt !1% ! 5)*&#+'* '5 #$% #%(0%1!#)1% -.

<% #1%!# 1!*6'( !*6 -4-#%(!#+& ?!1+!#+'* -%0!1!#%"4. <% ('6%"

#$% -4-#%(!#+& ?!1+!#+'* '5 Vt 3+#$ ! ()"#+?!1+!#% *'1(!" 6+-#1+8)7

#+'* 3+#$ ! -0%&+!& &'11%"!#+'* -#1)&#)1% CDEF. /# +- &$!1!&#%1+O%6 84

#$1%% 0!1!(%#%1-M µ2 "sys2 !*6 #. G0%&+!&!""42 3% 6+?+6% #$% &$+0+*#' ! ;1+6 3+#$ PQ &%""-. R!&$ &%"" #!,%- '* ! -+*;"% ?!")% '5 Vt !-

;+?%* 84 ! ()"#+?!1+!#% *'1(!" 6+-#1+8)#+'* 3+#$ 0!1!(%#%1- µ !*6"sys. S"'*; 3+#$ #$+-2 Vt +- -0!#+!""4 &'11%"!#%6.

<% !--)(% #$!# #$% &'11%"!#+'* +- +-'#1'0+& !*6 +*6%0%*6%*# '5

0'-+#+'* CDTF. I$+- (%!*- #$!# #$% &'11%"!#+'* 8%#3%%* #3' 0'+*#-

$x !*6 $y +* #$% ;1+6 6%0%*6- '* #$% 6+-#!*&% 8%#3%%* #$%( !*6 *'#'* #$% 6+1%&#+'* '1 0'-+#+'*. >'*-%U)%*#"42 3% %:01%-- #$% &'11%7

"!#+'* 5)*&#+'* '5 Vt($x) !*6 Vt($y) !- %(r)2 3$%1% r = |$x " $y|.N4 6%!*+#+'*2 %AVBWP A+.%.2 #'#!""4 &'11%"!#%6B. <% !"-' -%# %(#)WVA+.%.2 #'#!""4 )*&'11%"!#%6B. <% #$%* !--)(% #$!# %(r) &$!*;%- 3+#$ )!- 0%1 #$% G0$%1+&!" 6+-#1+8)#+'* CPXF. /* #$% G0$%1+&!" 6+-#1+8)#+'*2

%(r) 6%&1%!-%- 51'( P #' V -(''#$"4 !*6 1%!&$%- V !# ! 6+-#!*&% #&!""%6 )&*.$. /*#)+#+?%"42 #$+- (%!*- #$!# !# 6+-#!*&% #2 #$%1% +- *'-+;*+!&!*# &'11%"!#+'* 8%#3%%* #$% Vt '5 #3' #1!*-+-#'1-. I$+- !07

01'!&$ (!#&$%- %(0+1+&!" 6!#! '8#!+*%6 84 91+%68%1; $# &/0 CPYF. #+- ;+?%* !- ! 51!&#+'* '5 #$% &$+0K- 3+6#$.

=!*6'( ?!1+!#+'* '5 Vt '&&)1- !# ! ()&$ !*%1 ;1!*)"!1+#4 #$!*

-4-#%(!#+& ?!1+!#+'*M +# '&&)1- !# #$% "%?%" '5 +*6+?+6)!" #1!*-+-#'1-.

<% ('6%" +# !- !* )*&'11%"!#%6 *'1(!" 6+-#1+8)#+'* 3+#$ "rand !*6

! O%1' (%!*.

Leff +- ('6%"%6 "+,% Vt 3+#$ ! 6+55%1%*# µ2 "sys2 !*6 "rand 8)#

#$% -!(% #.91'( #$% Vt !*6 Leff ?!1+!#+'*2 3% &'(0)#% #$% Tg ?!1+!#+'*

)-+*; RU)!#+'* E. <% #$%* )-% #$% &1+#+&!" 0!#$ ('6%" '5 N'37

(!* $# &/0 CZF #' %-#+(!#% #$% 51%U)%*&4 -)00'1#%6 84 %!&$ 0+0%"+*%

-#!;%. I$+- ('6%" #!,%- #$% *)(8%1 '5 ;!#%- An&0B +* ! &1+#+&!"0!#$ !*6 #$% *)(8%1 '5 &1+#+&!" 0!#$- AN&0B +* ! -#1)&#)1%2 !*6&'(0)#%- #$% 01'8!8+"+#4 6+-#1+8)#+'* '5 #$% "'*;%-# &1+#+&!" 0!#$

6%"!4 Amax{T&0}B +* #$% -#1)&#)1%. I$+- +- #$% 0!#$ #$!# 6%#%17(+*%- #$% (!:+()( 51%U)%*&4 '5 #$% -#1)&#)1%2 3$+&$ 3% -%# #'

8% 1/ max{T&0}.9'1 -+(0"+&+#42 3% ('6%" ! &1+#+&!" 0!#$ !- n&0 9[D ;!#%- &'*7

*%&#%6 84 ?%14 -$'1# 3+1%- H 3$%1% n&0 +- #$% )-%5)" "';+& 6%0#$'5 ! 0+0%"+*% -#!;%. \*5'1#)*!#%"42 !&&)1!#%"4 %-#+(!#+*; N&0 5'1

%!&$ 0+0%"+*% -#!;% +- 6+5!&)"# 8%&!)-% N&0 +- 6%-+;*7-0%&+!& !*6*'# 0)8"+&"4 !?!+"!8"% 5'1 ! 6%-+;*. >'*-%U)%*#"42 3% !--)(% #$!#

&1+#+&!" 0!#$- !1% 6+-#1+8)#%6 +* ! -0!#+!""47)*+5'1( (!**%1 '* #$%

01'&%--'1 "!4')# H %:&%0# +* #$% ]X2 3$'-% 0!#$- 3% !--)(% *%?%1

!55%&# #$% &4&"% #+(%. 91'( #$% "!4')# !1%! '5 %!&$ 0+0%"+*% -#!;%

!*6 N'3(!* $# &/0K- %-#+(!#% #$!# ! $+;$70%15'1(!*&% 01'&%--'1

&$+0 !# ')1 #%&$*'"';4 *'6% $!- !8')# PV2VVV &1+#+&!" 0!#$- CZF2 3%

6%#%1(+*% #$% &1+#+&!" 0!#$- +* %!&$ -#!;%. I$% -"'3%-# &1+#+&!" 0!#$

+* ! -#!;% 6%#%1(+*%- #$% 51%U)%*&4 '5 #$% -#!;%^ #$% -"'3%-# -#!;%

6%#%1(+*%- #$% 0+0%"+*% 51%U)%*&4.

#" $,2(3,.( 4/+2-+-,&. 5,-6 7(89'3(

#":" ;+,. 0/(+

I' )*6%1-#!*6 #$% +6%! 8%$+*6 =%>4&"%2 &'*-+6%1 #$% 0+0%"+*% '5

9+;)1% XA!B !*6 &!"" Ti #$% #+(% #!,%* #' 0%15'1( #$% 3'1, +* -#!;%

'. 9'1 -+(0"+&+#42 +* #$% !8-%*&% '5 01'&%-- ?!1+!#+'*2 3% !--)(%

#$!# Ti +- #$% -!(% 5'1 !"" ' !*62 #$%1%5'1%2 #$% 0+0%"+*%K- 0%1+'6 +-

TCP = Ti, $i. <$%* ?!1+!#+'* -%#- +*2 +# -"'3- 6'3* -'(% -#!;%-3$+"% +# -0%%6- )0 '#$%1-. S- -$'3* +* 9+;)1% XA!B2 #$% 1%-)"#+*;

)*8!"!*&%6 0+0%"+*% $!- #' 8% &"'&,%6 3+#$ ! "'*;%1 0%1+'6 TCP =Max(Ti), $i.

!"#$%&#

'()"(#"%*

+",-."*-

/00-1#

%0

'()"(#"%*

+",-."*-

20#-)

'()"(#"%*

34%56-781.-9

6/:/;6/4<= <>

+-)"%?

7.%1@5#%

<*"#"(.

6-A"B#-)

7.%1@5#%

="*(.

6-A"B#-)

34%56-781.-9

:"0%)5(..5"

"2CA3:559

<= 6/4 /; 6/:

<=

<>

6/4

<>

/;

6/:

6/:/;<>6/4<=

3(96-781.-9

3!"#$

'()"(#"%*

20#-)

+",-."*-

7.%1@5#%

="*(.

6-A"B#-)

3!"#$

6-781.-93D9

E

:

:

F"*

B@-G

B-#&,: $%.?F(H

E

"I(H3:559

<,=>%( !? R55%&# '5 01'&%-- ?!1+!#+'* '* 0+0%"+*%- A!B !*6 -,%3+*;

! &"'&, -+;*!" A8B.

<+#$ =%>4&"%2 3% &'(01%$%*-+?%"4 !00"4 &4&"% #+(% -#%!"+*;

#' &'11%&# #$+- ?!1+!#+'*7+*6)&%6 0+0%"+*% )*8!"!*&%. I$% 1%-)"#+*;

&"'&, 0%1+'6 '5 #$% 0+0%"+*% +- TCP = Average(Ti), $i. I$+- 0%71+'6 &!* 0'#%*#+!""4 8% -+(+"!1 #' #$!# '5 #$% *'7?!1+!#+'* 0+0%"+*%.

S- -$'3* +* 9+;)1% XA!B2 #$% -"'3 -#!;%- ;%# ('1% #$!* ! &"'&, 0%7

1+'6 #' 01'0!;!#% #$%+1 -+;*!"2 !# #$% %:0%*-% '5 5!-#%1 -#!;%- #$!#

#1!*-5%1 #$%+1 -"!&,.

<+#$ #$+- !001'!&$2 3% 6' *'# *%%6 #' &$!*;% #$% 0+0%"+*% -#1)&7

#)1%2 0+0%"+*% 6%0#$2 '1 #$% +*$%1%*# -3+#&$+*; -0%%6 '5 #1!*-+-#'1-.

325

Page 25: Razor and ReCycle - Computer Science and Engineeringcseweb.ucsd.edu/classes/wi09/cse240c/Slides/17_Razor_and_ReCycle.pdfRazor and ReCycle . ... Counterflow pipeline in the synchronous

ReCycle – Uses

  Long Pipelines   ReCycle continues to outperform a non-ReCycle scheme at an

exponential rate as the number of stages are added to a given pipeline

  Donor Stages   Can increase the frequency of a pipeline by adding empty “donor”

stages.   With ReCycle, this essentially behaves in a similar way to adding a

pipeline stage and rebalancing the stages to fit equally in the new pipeline slots.

  This can be done either statically or dynamically   Statically-Donor Algorithm is run a single time on each new chip  Dynamically-Donor Algorithm is run on a “new phase” (as seen by a

phase detector)

Page 26: Razor and ReCycle - Computer Science and Engineeringcseweb.ucsd.edu/classes/wi09/cse240c/Slides/17_Razor_and_ReCycle.pdfRazor and ReCycle . ... Counterflow pipeline in the synchronous

ReCycle – Uses

  ReCycling to Feedback Paths   Remembering back to the Future of Wires paper, longer wires—in

this case, feedback paths—make use of repeaters to break the quadratic relation between wire delay and wire length:  Delay = constant*wire_length2

  Non-critical pipeline loops will redirect excess cycle time (slack) to reduce overall repeater usage:   In this case, the slack is used to reduce a feedback path’s requirement

for speed, thereby reducing its dependence on frequent repeaters.

!"# $%&%' ()*%'+,"- .(& /# '0& 1,(,+.())2 (, ,"# -(&03(.,0'#'41

1+,# %&.# %' 52&(-+.())2 (, '0&,+-# -(&2 ,+-#16 7& ,"# 3%'-#' .(1#8

,"# -(&03(.,0'#' "(1 ( '#9'#1#&,(,+:# ;%'<)%(58 (&5 -(<#1 #(."

5#.+1+%& +& ,"# ()*%'+,"- /(1#5 %& ,"# +-9(., %& ,"# 9#'3%'-(&.#

%3 ,"# ;%'<)%(56

73 ,"# $%&%' ()*%'+,"- +1 '0& 52&(-+.())28 ;# '#)2 %& ( 9"(1#

5#,#.,%' (&5 9'#5+.,%' =#6*68 >?@AB ,% 5#,#., 9"(1#1 +& ,"# '0&&+&*

(99)+.(,+%&6 C, ,"# /#*+&&+&* %3 #(." &#; 9"(1#8 ,"# 121,#- '0&1

,"# ;"%)# ()*%'+,"- ,% 5#.+5# ;"(, 5%&%' 1,(*#1 ,% (556 !"# ()*%D

'+,"- %:#'"#(5 +1 ,%)#'(/)# /#.(01# (99)+.(,+%& 9"(1#1 ('# )%&* E

,"# (:#'(*# 9"(1# +1 ,29+.())2 %:#' F@@-16 G%'#%:#'8 50'+&* ,"#

9#'+%5 &##5#5 ,% 9'%!)# ,"# 7HI %3 ( *+:#& 9+9#)+&# .%&!*0'(,+%&

=#6*68 !F@8@@@ .2.)#1B8 ,"# (99)+.(,+%& +1 1,+)) '0&&+&*6J%,#8 "%;#:#'8 ,"(, (, #:#'2 1,#9 +& ,"# $%&%' ()*%'+,"- ,"(, ;#

;(&, ,% ."(&*# ,"# .)%.< 1<#;1 +& ,"# 9+9#)+&#8 ;# &##5 ,% ;(+, 0&,+)

,"# 9+9#)+&# 5'(+&16 I%&1#K0#&,)28 10." %9#'(,+%& +1 (1 #L9#&1+:# (1

( .%1,)2 /'(&." -+19'#5+.,+%&6 !% '#50.# %:#'"#(518 1+&.# 9'%*'(-

9"(1#1 ,#&5 ,% '#9#(,8 ;# #&:+1+%& ,"(,8 (3,#' ,"# 121,#- 1#)#.,1 ,"#

1<#;18 9#'+%58 (&5 5%&%'=1B 3%' ( &#; 9"(1#8 +, 1(:#1 ,"#- +& ( ,(/)#6

!"# 5(,( +& ,"# ,(/)# ;+)) /# '#01#5 +3 ,"# 9"(1# +1 1##& (*(+& +& ,"#

30,0'#6 G%'#%:#'8 (1 (& %9,+%&8 ;# -(2 5#.+5# ,% 1,%9 (3,#' ;# "(:#

(55#5 %&# %' ,;% 5%&%' 1,(*#16

M099%',+&* ,"# (/+)+,2 ,% (55 5%&%' 1,(*#1 &#.#11('+)2 .%-9)+D

.(,#1 ,"# 9+9#)+&# +-9)#-#&,(,+%&6 N%' #L(-9)#8 #L,#&5+&* ,"#

&0-/#' %3 .2.)#1 ,(<#& /2 ( 30&.,+%&() 0&+, +&,'%50.#1 .%-9)#LD

+,2 +& ,"# +&1,'0.,+%& 1."#50)#'6 O# ('# &%, (;('# %3 (&2 ;%'<

%& 121,#-(,+.())2 -(&(*+&* :('+(/)# &0-/#'1 %3 .2.)#1 3%' )%*+.()

9+9#)+&# 1,(*#1 E (),"%0*" 1%-# '#1,'+.,#5 1."#-#1 "(:# /##& '#D

.#&,)2 9'%9%1#5 >P?A =M#.,+%& QB6 O# 9)(& ,% ,('*#, ,"+1 9'%/)#- +&

%0' 30,0'# ;%'<6

!"#" $%&'()* +,-./ 01 23345-./ $-0'&

C ,"+'5 01# %3 R#I2.)# +1 ,% 901" ,"# 1)(.< %3 &%&D.'+,+.() )%%91

,% ,"# )%%914 3##5/(.< 9(,"16 M0." 1)(.< .(& ,"#& /# 01#5 ,% '#D

50.# 9%;#' %' ,% +-9'%:# ;+'# '%0,(/+)+,26 !% 1## ;"28 '#.()) ,"(,

,"# 9+9#)+&# -%5#) ,"(, ;# ('# 01+&* =M#.,+%& S6SB -%5#)1 )%%91 (1

1#,1 %3 1,(*#1 ;+," 3##5/(.< 9(,"16 !"# )(,,#' ('# (/1,'(.,#5 (;(2

(1 %&#D.2.)# 1,(*#1 %3 1+-9)2 ;+'#1 ;+," '#9#(,#'16 R#9#(,#'1 ('#

,29+.())2 +&:#',#'1 ,"(,8 /2 +&,#''09,+&* )%&* ;+'#18 '#50.# ,"# ,%,()

;+'# 5#)(2 >SSA6

!;% )%%91 +& ( 9+9#)+&# .(& /# 5+1T%+&, %' %:#')(99+&*6 N%' #LD

(-9)#8 N+*0'# ? 1"%;1 ,;% %:#')(99+&* )%%91 3'%- N+*0'# F=(BU ,"#

/'(&." -+19'#5+.,+%& %&# (&5 ,"# +&,#*#' )%(5 -+119#.0)(,+%& %&#6

1

IF BPred IntQ IntReg IntExec

DcacheLdStU2

R R

RRR

Load misspeculation loop

Branch misprediction loop

: RepeaterR

1

2

IntMap

2(*%63 !7 VL(-9)# %3 %:#')(99+&* )%%916

7& ()) ,"# 9+9#)+&# )%%91 /0, ,"# .'+,+.() %&#8 ;# 01# R#I2.)# ,%

901" ()) ,"# 1)(.< +& ,"# )%%9 ,% +,1 3##5/(.< 9(,"6 !"+1 5%#1 &%,

(33#., ,"# .2.)# ,+-#6 J%,# ,"(, ( 1,(*# ,"(, /#)%&*1 ,% -0),+9)#

)%%91 "(1 ( 19#.+() 9'%9#',2U +,1 1)(.< +1 ,'(&13#''#5 ,% ,"# 3##5/(.<

9(,"1 %3 !"" ,"# )%%91 +, /#)%&*1 ,%6 N%' #L(-9)#8 +& N+*0'# ?8 ,"#

1)(.< +& ,"# #$%& 1,(*# +1 9(11#5 1+-0),(&#%01)2 ,% /%," 3##5/(.<

9(,"16

W2 (..0-0)(,+&* ,"# 1)(.<1 +& ,"# 3##5/(.< 9(,"18 ;# .(& 9#'D

3%'- ,"# 3%))%;+&* ,;% %9,+-+X(,+%&16

!"#"8" $1936 :34%.0(1)

O+," %9,+-() '#9#(,#' 5#1+*&8 (/%0, Y@Z %3 ,"# 9%;#' +& ( 3##5D

/(.< 9(," +1 5+11+9(,#5 +& ,"# '#9#(,#'1 >S[A6 V)+-+&(,+&* '#9#(,#'1

;%0)5 1(:# 9%;#'8 /0, +, ;%0)5 ()1% +&.'#(1# ,"# 5#)(2 %3 ,"# 3##5D

/(.< 9(,"8 1+&.# ( ;+'# 5#)(2 +1 ' ( )l28 ;"#'# " +1 ,"# )#&*," %3 ,"#;+'# ;+,"%0, '#9#(,#'16 I%&1#K0#&,)28 ;# 9'%9%1# ,% 1(:# 9%;#' /2

#)+-+&(,+&* (1 -(&2 '#9#(,#'1 (1 +, ,(<#1 ,% .%&10-# ()) ,"# 1)(.< +&

,"# 3##5/(.< 9(,"6

O# #&:+1+%& (& #&:+'%&-#&, ;"#'# ,"# -(&03(.,0'#'8 (3,#' -#(D

10'+&* ,"# #33#., %3 9'%.#11 :('+(,+%& %& ( 9(',+.0)(' 9+9#)+&#8 .%0)5

#)+-+&(,# +&5+:+50() '#9#(,#'1 3'%- 3##5/(.< 9(,"1 ,% 1(:# 9%;#'6 7&

,"+1 .(1#8 ;# ;%0)5 9'%.##5 /2 '#-%:+&* %&# '#9#(,#' (, ( ,+-#8 1#D

)#.,+&* !'1, '#9#(,#'1 /#,;##& (5T(.#&, 1"%',#1, ;+'# 1#*-#&,1 =lsB673 ;# (110-# ( ;+'# ;+," '#9#(,#'1 5#1+*&#5 3%' %9,+-() ,%,() 5#)(28

,"# 5#)(2 ,"'%0*" ( '#9#(,#' +1 #K0() ,% ,"# 5#)(2 ,"'%0*" ( ;+'#

1#*-#&, >SSA6 I%&1#K0#&,)28 #)+-+&(,+&* %&# '#9#(,#' +&.'#(1#1 ,"#

5#)(2 3'%- P)l2s ,% )=SlsB28 ;"+." +1 )l2s 6

!"#";" <=>61?34 :1%0-5(,(0@

!"# 1)(.< %3 ,"# 3##5/(.< 9(,"1 .(& +&1,#(5 /# 01#5 ,% #(1# ;+'#

'%0,+&* 50'+&* ,"# )(2%0, 1,(*# %3 9+9#)+&# 5#1+*&6 M9#.+!.())28 ;#

.(& *+:# ,"# '%0,+&* ,%%) -%'# "#L+/+)+,2 ,% #+,"#' )#&*,"#& ,"# ;+'#1

%' 90, ,"#- +& 1)%;#' -#,() )(2#'16 \&3%',0&(,#)28 ,"# '%0,+&* 1,(*#

+1 9'#D3(/'+.(,+%& (&58 ,"#'#3%'#8 ;# 5% &%, <&%; ,"# #L(., 1)(.<

,"(, ;+)) /# (:(+)(/)# 3%' #(." 3##5/(.< 9(," (3,#' 3(/'+.(,+%&6 I%&D

1#K0#&,)28 ,"# (-%0&, %3 )##;(2 *+:#& ,% ,"# '%0,+&* ,%%) "(1 ,% /#

/(1#5 %& 1,(,+1,+.() #1,+-(,#16 O# .(& 01# "#0'+1,+.1 10." (1 *+:+&*

)##;(2 %&)2 ,% ,"# 3##5/(.< 9(,"1 %3 )%%91 ,"(, ('# )%&* E 1+&.#

,"#2 ('# 0&)+<#)2 ,% /# .'+,+.() /#.(01# ,"#2 .(& .%))#., 1)(.< 3'%-

-(&2 1,(*#1 E (&5 *+:+&* &% )##;(2 ,% ,"# 3##5/(.< 9(,"1 %3 ,"#

)%%91 ,"(, ('# :#'2 1"%', E 1+&.# %&# %3 ,"#- +1 )+<#)2 ,% /# ,"# .'+,D

+.() )%%96 7& (&2 .(1#8 #:#& +3 3%' ( 9(',+.0)(' 9+9#)+&#8 ( )%%9 ;"%1#

3##5/(.< 9(," ;(1 '%0,#5 10/%9,+-())2 #&51 09 /#+&* ,"# .'+,+.()

)%%98 ;# "(:# &%, "0', .%''#.,&#11U R#I2.)# ;+)) 1+-9)2 ."%%1# (

1)+*")2 )%&*#' .)%.< 9#'+%5 ,"(& +, ;%0)5 "(:# ."%1#& %,"#';+1#6

O# )(.< ,"# +&3'(1,'0.,0'# ,% 9'%9#')2 #:()0(,# ,"+1 %9,+-+X(,+%&6

]%;#:#'8 5+1.011+%&1 ;+," M2&%9121 5#1+*&#'1 10**#1, ,"(, ,"# )##D

;(2 ,"(, R#I2.)# 9'%:+5#1 ;%0)5 #(1# ,"# T%/ %3 '%0,+&* ,"# 3##5D

/(.< 9(,"1 %3 ,"# 9+9#)+&#6

!"!" +-,?-*()* A'(>& :3B3.034 C%3 01 D1,4 E(1,-0(1)&

C !&() 01# %3 R#I2.)# +1 ,% 1():(*# ."+91 ,"(, ;%0)5 %,"#';+1#

/# '#T#.,#5 50# ,% :('+(,+%&D+&50.#5 "%)5D,+-# 3(+)0'#16 !"+1 +1 (

19#.+() .(1# %3 R#I2.)#41 01# %3 .2.)# ,+-# 1,#()+&* ,% +-9'%:#

9+9#)+&#1 (3,#' 3(/'+.(,+%&6 ]%;#:#'8 ;"+)# .%''#.,+&* 1#,09 :+D

%)(,+%&1 =:+%)(,+%&1 %3 VK0(,+%& ?B .(& /# (..%-9)+1"#5 ,"'%0*"

%,"#'8 &%&DR#I2.)# ,#."&+K0#18 .%''#.,+&* "%)5 :+%)(,+%&1 =:+%)(D

,+%&1 %3 VK0(,+%& YB (3,#' 3(/'+.(,+%& ;+," %,"#' ,#."+K0#1 +1 "('5#'6

M9#.+!.())28 ( 1#,09D,+-# 9'%/)#- .(& /# .%''#.,#5 /2 +&.'#(1+&*

,"# 9+9#)+&#41 .)%.< 9#'+%56 ]%;#:#'8 .%''#.,+&* ( "%)5D,+-# 9'%/D

)#- (3,#' 3(/'+.(,+%& .(& /# 5%&# %&)2 ;+," ,'+.<+#' ,#."&+K0#1 10."

(1 1)%;+&* 5%;& .'+,+.() 9(,"1 /2 5#.'#(1+&* ,"# :%),(*# E ;+," (&

(5:#'1# #33#., %& &%+1# -('*+&16 C1 ( '#10),8 ."+91 ;+," "%)5D,+-#

9'%/)#-1 ,29+.())2 #&5 09 /#+&* 5+1.('5#56

!"# R#I2.)# 3'(-#;%'< 1#(-)#11)2 !L#1 9+9#)+&#1 ;+," "%)5

3(+)0'#16 R#3#''+&* ,% VK0(,+%& Y8 ( "%)5 3(+)0'# -(<#1 ,"# '+*", 1+5#

&#*(,+:# 3%' 1%-# 9+9#)+&# 1,(*#8 /0, R#I2.)# .(& -(<# ,"# )#3, 1+5#

&#*(,+:# (1 ;#))6 R0&&+&* ,"# R#I2.)# ()*%'+,"- %3 M#.,+%& P6S

328

Page 27: Razor and ReCycle - Computer Science and Engineeringcseweb.ucsd.edu/classes/wi09/cse240c/Slides/17_Razor_and_ReCycle.pdfRazor and ReCycle . ... Counterflow pipeline in the synchronous

ReCycle - Uses

  More on Feedback Path ReCycling   By reducing the number of repeaters, we are also saving a great

deal of power.   Research also shows that a reduction in repeaters could be

very helpful for future power reductions Technology Node (Nm)

0.18 0.15 0.12 0.1 0.07 0.05 2 0 0 3

1898 2000 2002 2004 2006 2008 2010 2012 Year

Figure 5: Repeater power dissipation as function of tech

node. ITRS dictated total chip power budget also shown.

optimally on a single wire. This capacitance can be obtained by multiplying a single repeater capacitance (6SoptCmos) by the

number of repeaters on a wire (L/lopt), where, lo, and Sqpt can be obtained using (1) and (4), respectively. The expression, thus obtained, is independent of the wire resistance, and is given by:

Where, Crepperline is the total capacitance of all delay optimized repeaters on a single wire and Cline is the capacitance of a single wire. Since we established before that all global wires use repeaters, the total power dissipated due to global wires is approximately same as that due to all global repeaters. Hence, the total power dissipation approximately doubles, yielding about 120 Watts of power at 50 nm node with a Rent's exponent of 0.55. This can be a substantial fkaction of the total chip power.

3. REPEATER POWER MINIMIZATION METHODOLOGY

The exorbitant power consumption due to delay-optimized repeaters at future technology nodes can be of serious concem. A simple method to reduce repeater power is to decrease the repeater size and/or space them hrther apart. Both these solutions lead to a delay penalty. In this section, we develop a novel formulation which optimizes the separation and sizing of the repeaters such that the power savings is maximized for a given delay penalty.

The expression for delay due to repeaters which are spaced distance 1 apart and whose NMOS transistor is sized, S, (channel width to length ratio) can be simply obtained by applying Elmore delay model to a simplified RC network for a stage (one repeater to the next) and is given by

= L[

b ( l + e ) ( 1 + f)roCnmos + aR C I + !@& f b(l + e)RwC,,,,,S

S w w

I

Here, L is the length of the wire, a and b are switching model dependent parameters. If we assume that the output of the repeaters

switches when the input reaches half of the voltage swing, a and b are found to be about 0.4 and 0.7, respectively [12]. Parameter e is the ratio of the PMOS to the NMOS size and f is the ratio of the diffusion capacitance to the gate capacitance of the transistors. Equation (8) can be optimized independently with respect to S and 1 to give minimum delay. This yields

zrpOpt =ZL(Jab(l+e)(l+f ) + b & b w (9)

For the typical value of e=2 (PMOS sized hvice of NMOS), f=l

(diffusion capacitance is same as gate capacitance), and above

stated a and b values, (10) and (11) reduce to (1) and (4), respectively. Now, in an attempt to reduce power, we decrease S and increase 1, such that S = xsSopt and 1 = lopt/xl. Here, x, and XI are

less than one and denote the fractional change in sizing and spacing from delay optimal values. The total wire dela), can be written. as

rv=L(,/-

For x, and xI equal to 1, (12) reduces to (9). 'The delay penalty, p, expressed as a ratio of delay with sub-optimal (xs and xI not equal

to 1) repeaters to that with delay optimized repeaters (x, and XI

equal to 1) can be written as

where, A = ~ (14) Ja(Y Next, we examine the power consumption of a single repeated

wire due to its capacitance and the capacitance of repeaters on it. This power for the delay sub-optimal case (general form) is:

(15)

l 2 L cw + ( I -k f )( [+ e)Cnmos -- x s fclock

L p t

= sw ' fclock x s nl A )

The first and the second terms in the parenthesis correspond to the wire and the repeater contributions, respectively. For delay optimal case, where xs=xl=l, the ratio of the Capacitance of all the repeaters on a single wire to the wire capacitance becomes equal to A. For a reasonable value of 6 1 , A is 1.07 from (14), agreeing with (7).

The amount of power saving obtained per wire can be expressed as the ratio of the total power per wire in the power saving

repeaters to that in the delay optimized repeaters (6). This is easily obtained using (1 5 ) and is given by

We propose that using the expressions for delay penalty, (13) and power savings, (1 6), one can find x, and xI , such that, for a required power saving, minimum delay penalty is incurred, or vice versa.

This condition can be achieved by substituting xj expressed in

terms of 6 and x, from (16), into the expression for p, and

minimizing p with respect to x,. The minimum p and the

corresponding xSopt and xlopt are obtained to ba the following

464

Authorized licensed use limited to: U niv of C alif S an D iego. D ownloaded on February 5 , 2009 at 17:36 from IE E E X plore . R estrictions apply.

to accommodate the wires at far future nodes. For the present and near future technology nodes, the allocated metal layers appear to

be in excess of the number required. However, owing to a large number of wires on the chip, slight increase in the pitch will lead to a rapid increase in metal levels. Thus, we don’t expect a significant

deviation in the average wire pitch from the ITRS dictated pitch even for near term technology nodes.

Technology Node (km)

0.18 0.15 0.12 0.1 0.07 0.05

1 Allocated for all I 14---)--7-----

1 2 -

YI

E 1 0 - m

m

- -

E 0

b 6 -

z 4 - E,

-A- p = 0 6 t y p e 2 f w i r e s 1 * ITRS projections -9- p = 0 5 5

2 - Required for only

signal wires

(1)

Here, C,,,, and r, are the capacitance and resistance of the minimum sized NMOS transistor, respectively. R, and C, are the resistance and capacitance per unit length of wires respectively. We

find that I,,, for global wires is always less than the minimum

global wire length. Hence, all global wires will have repeaters on them. We call the length, beyond which repeaters are inserted, as

the crossover length. In our case, this length is the same as the minimum global wire length. Thus, for a wire of length I, the number of repeaters on that wire is:

lop, = 3 . 2 ‘ f d ~ yo Cnmos

0, if 1 1 crossover

nrepeo,er(4 = (2)

(round - ) - I , otherwise Lt 1 Using the statistical wire length distribution, the minimum global

wire length, and the number of repeaters at a given length from (2), we compute the total number of repeaters, Nrepeater The resulting

number of repeaters, for two Rent’s exponents of 0.55 and 0.6, are

shown in Fig. 4, for realistic as well as ideal copper resistivity. The global signal wire repeaters are found to be as high as 5.5 million at

the 50 nm technology node with reasonable copper resistivity and a Rent’s exponent of 0.55. We compare our repeater number estimates with those obtained by other authors [4], [15] at the 70

nm technology node (Table 2). Our prediction of about 0.85 million

repeaters, for a Rent’s exponent of 0.55, lies between the two numbers predicted by references [15] and [4], where as, a Rent’s exponent of 0.6 yields results which match well with [4]. The repeater estimate obtained in [ 151 is quite less because in this work the global wires are kept at a constant pitch at future nodes.

Technology Node ( wm)

0.18 0.15 0.12 0.1 0.07 0.05

2000 2002 2004 2006 2008 2010 2012 Year

Figure 4: Total no. of repeaters on global wires as a

function of tech. node for different p (Rent’s exponent)

Table 2: Comparison of no. of repeaters of our approach with

previous work. The numbers shown are for 70 nm technology

Number of Number of

repeaters repeaters approach, approach, estimated by Estimated by p=0.55 p=0.6 1 [I51 1 Our 1 Our 1 [4]

1.6 million 0.2 million 0.85 million 1.61 million I I I I

2.2.3 Power Due to Delay optimized Repeaters The short circuit power of repeaters is neglected in our analysis.

For estimating dynamic power, the capacitance due to all the

repeaters on global wires, Crepeater, is given by

Sop, = 0.58 ___ (4)

( 5 )

R w Cnmos d Where,

and Cnmos = C g (2n)

Here, So, is the optimal sizing of the NMOS in the repeater [3],

[I I]. C, is the NMOS gate capacitance per micron, and is expected to stay constant at about 1.75 fF/lm for future technology nodes [9]. For a repeater, PMOS is assumed to be twice as large as

NMOS. Also the di&sion capacitance is assumed to be the same as the gate capacitance. This leads to 6 times the NMOS gate capacitance in (3). The total dynamic power dissipation due to

repeaters is

( 6 ) -

Prepeater - Sw Crepeater V 'frock

Where, s, is the switching activity factor, and V and fclock are supply voltage and clock frequency, respectively. For a reasonable

switching activity of 0.15 [16], the power dissipation due to global

wire repeaters for future technology nodes is shown in Fig. 5. It is evident that the added power dissipation due to repeaters is a

serious problem in the future. At 50 nm technology node, with a reasonable Rent’s exponent of 0.55 [13] and using ideal copper resistivity, the repeater power dissipation is about 50 Watts, and

with realistic copper resistivity it is about 60 Watts. The resistance plays a role in repeater power as it dictates the crossover length

beyond which repeaters are inserted. The power numbers are much

worse for a Rent’s exponent of 0.6.

2.2.4 Power Due to Global Wires The power dissipation due to global wires themselves can be

simply obtained from the repeater power by realizing an interesting fact regarding the total capacitance of all the repeaters placed

463

Authorized licensed use limited to: U niv of C alif S an D iego. D ownloaded on February 5 , 2009 at 17:36 from IE E E X plore . R estrictions apply.

Page 28: Razor and ReCycle - Computer Science and Engineeringcseweb.ucsd.edu/classes/wi09/cse240c/Slides/17_Razor_and_ReCycle.pdfRazor and ReCycle . ... Counterflow pipeline in the synchronous

ReCycle - Uses

  More on Feedback Path ReCycling   Also, with more cycle time for wires versus repeaters, a

designer is allowed more freedom with routing.

  Catering to the Average Case   Designs equipped with ReCycle will also have the ability to

correct hold violations post-fabrication.  Greater yield -> Lower prices

Page 29: Razor and ReCycle - Computer Science and Engineeringcseweb.ucsd.edu/classes/wi09/cse240c/Slides/17_Razor_and_ReCycle.pdfRazor and ReCycle . ... Counterflow pipeline in the synchronous

ReCycle – Implementation

!"#$%"&'()*$#

'()*$#

'+",-#./+!0(1"02

'()*$#3455"02

!"#

6./$#7"8"$9"02:0(&

-#./+

!$#

:"*;

<*9=

!"#$%& '( %&'()*+ ,-' ./0.& 1)+*$/2 ./0.& 3)1,4)"5,)0* *',(04& !$# $*3 .)4.5),46 ,0 .-$*+' ,-' 3'/$6 07 ,-' 1)+*$/ !"#8

()// .09:5,' ,-' 0:,)9$/ 4'+)1,'4 1&'(1 704 $// 1,$+'1 ,0 9$&' 15.-

$ :):'/)*' 4'51$"/'8

') *+,-&+&./0/"1. *22$&2

;'<6./' -$1 ,-4'' .09:0*'*,12 ,5*$"/' 3'/$6 "577'41= ,-' 107,>

($4' 161,'9 9$*$+'4= $*3 35:/).$,' 4'+)1,'418 ?* $33),)0*= ), .$*

0:,)0*$//6 -$@' $ :-$1' 3','.,04 $*3 :4'3).,04= $*3 ,'9:'4$,54' 1'*>

10418 ?* ,-)1 1'.,)0*= (' 0@'4@)'( ,-')4 )9:/'9'*,$,)0* $*3 ,-'*

1-0( ,-' 0@'4$// ;'<6./' 161,'98

')3) 4$.05-& 6&-07 8$99&%2

;'<6./' 51'1 A5*$"/' B'/$6 C577'41 !ABC# )* ,-' ./0.& *',>

(04& ,0 )*,'*,)0*$//6 1&'( ,-' 1)+*$/ ,-$, 4'$.-'1 )*3)@)35$/

:):'/)*' 4'+)1,'418 A-)1 .$* "' '$1)/6 30*'8 D)+54' E!$# 1-0(1 $

.0*@'*,)0*$/ ./0.& *',(04&= (-'4' ,-' ./0.& 1)+*$/ )1 3)1,4)"5,'3

,-405+- $ 95/,)>/'@'/ *',(04& F 515$//6 $ "$/$*.'3 G ,4''8 A-'

1)+*$/ 34)@'* "6 ,-' ./0.& +'*'4$,04 )1 "001,'3 "6 4':'$,'41 $*3

"577'4'3 )* 1)+*$/ "577'41 F $, /'$1, 0*.'= "5, 07,'* $, $ 7'( /'@'/1

F "'704' 34)@)*+ $ /0.$/ ./0.& +4)38 H /0.$/ ./0.& +4)3 ./0.&1 $

:):'/)*' 1,$+'8 A-)1 95/,)>/'@'/ ,4'' )1 1,4$,'+).$//6 :$4,),)0*'3 )*,0

I0*'1 ,-$, 70//0( :):'/)*'>1,$+' "05*3$4)'18

J' 4':/$.' ,-' /$1, 1)+*$/ "577'4 $, '$.- I0*$/ /'@'/ )* D)+54' E!$#

(),- $ ABC= .$:$"/' 07 )*K'.,)*+ $* )*,'*,)0*$/ 1&'( )*,0 ),1 ./0.&>

)*+ 15",4''8 A-)1 .$* "' 30*' "6 1)9:/6 $33)*+ $ .)4.5), ,0 3'/$6

,-' ./0.& 1)+*$/= 704 'L$9:/' $1 1-0(* )* D)+54' E!"#8 H 1,4)*+ 07

)*@'4,'41 )1 ,$::'3 )*,0 $, 3)77'4'*, :0)*,1 ,0 1$9:/' ,-' 1)+*$/ $,

3)77'4'*, )*,'4@$/1= $*3 ,-'* $ 95/,):/'L'4 )1 51'3 ,0 1'/'., ,-' 1)+>

*$/ (),- ,-' 3'1)4'3 3'/$68 H 1)9)/$4 3'1)+* )1 51'3 )* ,-' ?,$*)59

./0.& *',(04& MNOP F )* ,-')4 .$1' ,0 '*154' ,-$, $// 1)+*$/1 4'$.-

,-' 1,$+'1 (),- ,-' 1$9' 1&'(8

A-' ABC ),1'/7 .05/3 "' 15"K'., ,0 @$4)$,)0*8 A-)1 .$* "'

$@0)3'3 "6 1)I)*+ ),1 ,4$*1)1,041 /$4+'48

'):) ;72/&+ <0.0#&%

J' :40:01' ,0 )9:/'9'*, ,-' ;'<6./' $/+04),-9 )* $ :4)@)/'+'3

107,($4' -$*3/'4 ,-$, 'L'.5,'1 "'/0( ,-' 0:'4$,)*+ 161,'9 /)&' ,-'

%61,'9 Q$*$+'4 !%Q# )* R'*,)59 O MSTP8 A-' ;'<6./' $/+04),-9

.03' $*3 ),1 3$,$ 1,45.,54'1 $4' 1,04'3 )* ,-' %Q ;HQ8 J-'* $

%61,'9 Q$*$+'9'*, ?*,'445:, !%Q?# )1 +'*'4$,'3= ,-' ;'<6./' %Q

-$*3/'4 )1 )*@0&'38 A-' -$*3/'4 3','49)*'1 ,-' *'( :):'/)*' 4'+)1>

,'4 1&'(1 $*3 :40+4$91 ,-'9 )*,0 ,-' ABC18 ?, $/10 3','49)*'1 ,-'

*'( .6./' ,)9'8 H1 )*3).$,'3 )* %'.,)0* S8U= ,-' ;'<6./' $/+04),-9

:'470491 $"05, O=OVV "$1). 0:'4$,)0*1 704 054 :):'/)*'= (-).- ,$&'

$405*3 WEV*1 0* $ XYGI :40.'11048

H* %Q? .$* "' +'*'4$,'3 )* ,(0 .$1'12 (-'* .-): .0*3),)0*1

15.- $1 ,'9:'4$,54' .-$*+' !%'.,)0* S8S# 04= )7 30*04 1,$+'1 $4' 15:>

:04,'3= (-'* ,-' .544'*,/6 45**)*+ $::/).$,)0* '*,'41 $ *'( :-$1'

!%'.,)0* O8U#8 ?* ,-' 7049'4 .$1'= ,-' %Q? )1 +'*'4$,'3 "6 1'*1041

,-$, 3','., (-'* :$,- 3'/$61 .-$*+'= 15.- $1 $ ,'9:'4$,54' 1'*1048

?* ,-' /$,,'4 .$1'= ,-' %Q? )1 +'*'4$,'3 "6 $ :-$1' 3','.,04 $*3 :4'>

3).,04= 15.- $1 ,-' -$43($4' 5*), :40:01'3 "6 %-'4(003 !" #$% MOVP8

')=) 6$,-">0/& ?&#"2/&%2

A0 $::/6 ,-' B0*04 %,$+' 0:,)9)I$,)0* 07 %'.,)0* O8U= (' )*>

./53' 0*' 35:/).$,' 4'+)1,'4 )* '$.- $&'()#$ :):'/)*' 1,$+' F 704

'L$9:/'= )99'3)$,'/6 $7,'4 ,-' 05,:5, :):'/)*' 4'+)1,'4 07 ),1 /$1,

:-61).$/ 1,$+'8 C6 3'7$5/,= ,-'1' 35:/).$,' 4'+)1,'41 $4' 3)1$"/'3Z

(-'* 0*' )1 '*$"/'3= ), .4'$,'1 $ 30*04 1,$+'8

R4'@)051 (04& 0* @$4)$"/' :):'/)*'>3':,- )9:/'9'*,$,)0*1

1-0(1 -0( :):'/)*' 4'+)1,'41 .$* "' 9$3' ,4$*1:$4'*, 51)*+ :$11>

,4$*1)1,04 95/,):/'L)*+ 1,45.,54'1 MSSP8 ?* 054 3'1)+*= ,-' 1)*+/'>"),

'*$"/'[3)1$"/' 1)+*$/1 07 $// 35:/).$,' 4'+)1,'41 $4' .0//'.,'3 )* $

1:'.)$/ -$43($4' 4'+)1,'4 .$//'3 B0*04 <4'$,)0* 4'+)1,'48 %5.- 4'+>

)1,'4 )1 1', )* :4)@)/'+'3 903' "6 ,-' ;'<6./' %Q -$*3/'48

')@) AB&%0-- ?&C7>-& ;72/&+

A-' 0@'4$// ;'<6./' 161,'9 )1 1-0(* )* D)+54' X8 A-' !+54'

1-0(1 0*' /0+).$/ 1,$+' .09:4)1'3 07 0*' :-61).$/ 1,$+'8 A-' 35>

:/).$,' 4'+)1,'4 07 ,-' :4'@)051 1,$+' !1-0(* )* 3$1-'3 /)*'1# )1 *0,

'*$"/'3= "5, ,-' 0*' 07 ,-)1 1,$+' )18

$*&>?0"&(/9.0?@$2">!"9"/9.0

7")(29"0!.*.0>-0"$9(.*

A*$B#"A*$B#"

C!3 C!3 C!3 C!3

D'.59,$0"E

'%29"F>G$*$)"0

C"F8"0$940"'"*2.0

'G< 'G<

!"#$%& D( \@'4$// ;'<6./' 161,'98

A-' -$43($4' 0@'4-'$3 07 ;'<6./' )1 $1 70//0(18 D04 '$.- /0+>

).$/ 1,$+'= ;'<6./' $331 $ 35:/).$,' :):'/)*' 4'+)1,'4 $*3 ),1 ABC8

H ABC )1 $ 1)+*$/ "577'4 /)&' ,-01' )* .0*@'*,)0*$/ :):'/)*'1 $5+>

9'*,'3 (),- $ .-$)* 07 )*@'4,'41 $*3 $ 95/,):/'L'48 Q04'0@'4= 704

'$.- :-61).$/ :):'/)*' 1,$+'= ;'<6./' $5+9'*,1 ,-' 'L)1,)*+ ./0.&

1)+*$/ "577'4 (),- $ .-$)* 07 )*@'4,'41 $*3 $ 95/,):/'L'48 D)*$//6= ;'>

<6./' $331 ,-' B0*04 <4'$,)0* 4'+)1,'48 \:,)0*$//6= ;'<6./' $/10

51'1 $ :-$1' 3','.,04 $*3 :4'3).,04= $*3 ,'9:'4$,54' 1'*10418 %'.>

,)0* X8U ]5$*,)!'1 ,-'1' 4'1054.'1 704 ,-' $.,5$/ :):'/)*' 903'/'38

329

Page 30: Razor and ReCycle - Computer Science and Engineeringcseweb.ucsd.edu/classes/wi09/cse240c/Slides/17_Razor_and_ReCycle.pdfRazor and ReCycle . ... Counterflow pipeline in the synchronous

ReCycle – Simulation/Results

  Simulation Model   Alpha 21264-based   64 KB L1 I/D Caches   2MB L2 Cache

  Balanced Pipeline Stages   45 nm Feedback wire proces

Page 31: Razor and ReCycle - Computer Science and Engineeringcseweb.ucsd.edu/classes/wi09/cse240c/Slides/17_Razor_and_ReCycle.pdfRazor and ReCycle . ... Counterflow pipeline in the synchronous

ReCycle – Results

  ReCycle is able to reclaim almost 60% of the frequency lost to process variation.   The simulation was fixed at a useful logic depth per stage of

17FO4 (measure of delay)

Page 32: Razor and ReCycle - Computer Science and Engineeringcseweb.ucsd.edu/classes/wi09/cse240c/Slides/17_Razor_and_ReCycle.pdfRazor and ReCycle . ... Counterflow pipeline in the synchronous

ReCycle – Results !" #$ %&'( !"#) *+,- ,- .%/ '.%01+ /% 2%34'.-5/' /+' '66'2/ %6

&5(,5/,%.7 -,.2' $%!"# ,- 89$ 65-/'( /+5. !"#) :%;'&'(7 &'()*

+,'-./0%1%# 5.< &'()+,'-0)10%1%#3%(' /+5. ('2%&'( /+' =%--'-

<0' /% &5(,5/,%.7 -,.2' /+'" 5(' 8>$ 5.< 8?$ 65-/'( /+5. !"#7 ('@

-4'2/,&'=")

!"#$%& '() A'(6%(35.2' %6 <,66'('./ '.&,(%.3'./-)

!"#$%& '*) B".53,2 4%;'( 6%( 2%.-/5./ 4'(6%(35.2')

! " #$ #% #& #! #"

$#$

%$

'$

&$

()*+,-.-/012.3*4.)560*789&:

;*3*65*4).4*</=*>.7?:

!

!!

!! !

! ! ! ! !!

! !@$A#!@$A'!@$AB!@$AC

!"#$%& '+) C(52/,%. %6 ('4'5/'(- '=,3,.5/'< !" D'E"2=')

D'E"2=' 1'.'(5/'- /,3,.1 -=52F /+5/ ;' 0-' /% ,.2('5-' /+' 6('@

G0'.2") H./'('-/,.1="7 ;' 2%0=< 0-' /+' -=52F 1'.'(5/'< !" D'E"2='

/% -5&' 4%;'( ,.-/'5<) I4'2,!25=="7 ;' 2%0=< 0-' <".53,2 &%=/51'

-25=,.1 /% !(,.1 /+' 6('G0'.2" !52F /% /+' !"# ='&'= ;+,=' ('<02,.1

/+' %4'(5/,.1 &%=/51'J -5&,.1 <".53,2 4%;'( ,. /+' 4(%2'--) I,3@

,=5(="7 ;' 25. <% /+' -53' /+,.1 ;,/+ /+' -=52F 2('5/'< !" /+' B%.%(

%4/,3,K5/,%.7 'L2'4/ /+5/ .%; ;' ;5./ /% (%== !52F /+' 6('G0'.2"

0./,= ;' 1'/ /+' -53' 4'(6%(35.2' 5- !"#)

*+' ('-0=/- %6 /+'-' 'L4'(,3'./- 544'5( ,. C,10(' 8>) *+' !10('

-+%;- /+' <".53,2 4%;'( 2%.-03'< !" /+' '.&,(%.3'./- %6 C,1@

0(' 89 5/ +%12/"1/ 3'#4%#5"1+') M' -'' /+5/ D'E"2=' -5&'- %. 5&@

'(51' N?$ %6 /+' !"# <".53,2 4%;'() O%('%&'(7 &'()+,'-./0%1%#

5.< &'()+,'-0)10%1%# -5&' >P$ 5.< >Q$7 ('-4'2/,&'=" %6 /+'!"#

<".53,2 4%;'() *+'-' 5(' -,K5!=' 4%;'( ('<02/,%.-)

,-*- ./"0"123"1# 4&5&23&%6

I'2/,%. 9)Q)8 4(%4%-'< 0-,.1 D'E"2=' /% 40-+ /+' -=52F %6 .%.@

2(,/,25= =%%4- /% /+',( 6''<!52F 45/+- 5.< /+'. 2%.-03,.1 ,/ !" '=,3@

,.5/,.1 ('4'5/'(- /+'(' J 5.< -5&,.1 4%;'( ,. /+' 4(%2'--) C,1@

0(' 8? -+%;- /+' 4'(2'./51' %6 ('4'5/'(- ,. 6''<!52F 45/+- /+5/ D'@

E"2=' '=,3,.5/'- 6%( <,66'('./ &5=0'- %6 /+' 0-'60= =%1,2 <'4/+ 4'(

4,4'=,.' -/51' 5.< !)

*+' !10(' -+%;- /+5/ D'E"2=' ('3%&'- 8>@QP$ %6 /+' ('4'5/'(-7

<'4'.<,.1 %. /+' =%1,2 <'4/+ 4'( -/51') *+' +,1+'( '66'2/,&'.'--

2%(('-4%.<- /% 4,4'=,.'- ;,/+ ='-- =%1,2 4'( -/51') *+,- ,- !'250-'7

5- ;' 35F' /+' 4,4'=,.' =%.1'( 5.< -/51'- -+%(/'(7 ;' +5&' 3%('

0.!5=5.2' 52(%-- -/51'-) *+,- ('-0=/- ,. .%.@2(,/,25= =%%4- +5&,.1

3%(' -=52F) *+' !,11'( /+' -=52F ,-7 /+' 3%(' ('4'5/'(- ;' 25.

('3%&') R. /+' %/+'( +5.<7 /+' &5=0' %6 ! +5- =,//=' '66'2/) R&'(5==7-,.2' !>P$ %6 /+' 4%;'( ,. 6''<!52F 45/+- ,- ,. ('4'5/'(- SN?T7

D'E"2=' 25. -5&' !U)>@8>$ %6 /+' 4%;'( ,. 6''<!52F 45/+-)

7- 4&/23&8 9:%;

D'E"2=' ,- 5. 5(2+,/'2/0(5= 6(53';%(F 6%( 4,4'=,.'- /+5/ 2%3@

4('+'.-,&'=" 4'(6%(3- 2"2=' /,3' -/'5=,.1 56/'( 65!(,25/,%. V',/+'(

-/5/,25==" 5/ /+' 35.0652/0('( -,/' %( <".53,25==" !5-'< %. %4'(5/@

,.1 2%.<,/,%.-W /% /%='(5/' 4(%2'-- &5(,5/,%.) *+' 3%-/ ('=5/'< 5(@

'5- %6 ('-'5(2+ 5(' /+%-' %6 2=%2F -F'; %4/,3,K5/,%. 5.< 5<54/,&'

4,4'=,.,.1)

</:=; 6;&> :53"0"?23":1 +5- !''. ;'== -/0<,'< ,. /+' 2,(20,/- 2%3@

30.,/" S9T) H/ +5- !''. 544=,'< !%/+ 5/ <'-,1. /,3' 5.< 56/'( 65!(,25@

/,%. /% ,34(%&' 2,(20,/ /,3,.1 35(1,.-) C,-+!0(. S8XT ;5- /+' !(-/ /%

4(%4%-' 5 =,.'5( 4(%1(533,.1 6%(30=5/,%. /% !.< /+' %4/,35= 2=%2F

-F';- ,. 5 2,(20,/)

I'&'(5= ;%(F- 0-' 2=%2F -F';,.1 /% 5<<('-- /+' 4(%!='3 %6 4(%@

2'-- &5(,5/,%.) O%-/ %6 /+'3 544=" -F';,.1 /% =5/2+ '='3'./- ,. /+'

2=%2F <,-/(,!0/,%. .'/;%(F /% 2%34'.-5/' 6%( /+' '66'2/- %6 4(%2'--

&5(,5/,%. %1 /6' +,%+7 3"/6 8',")2 /+'3-'=&'- V')1)7 S98TW) C%( 'L@

534='7 H/5.,03 +5- !066'(- ,. /+' 2=%2F .'/;%(F /+5/ <".53,25=="

8'27'9 /+' -,1.5= J ,)')7 '.-0(' /+5/ /+' 2=%2F -,1.5= ('52+'- 5== /+'

45(/- %6 /+' 4(%2'--%( ;,/+ /+' -53' -F'; S89T) R. /+' %/+'( +5.<7

Y,5.1 5.< Z(%%F- SQ8T 0-' 2=%2F -F';,.1 /% !5=5.2' /;% 4,4'=,.'

-/51'-) I4'2,!25=="7 /+'" 0-' 2"2=' /,3' -/'5=,.1 !'/;''. /+' ('1,-@

/'( !=' 5.< 'L'20/' -/51'- 5.<7 ;,/+ ='&'=@-'.-,/,&' =5/2+'-7 !'/;''.

-/51'- ,. /+' "%5/,.1@4%,./ 0.,/)

*+' %.=" ;%(F /+5/ 544=,'- 2"2=' /,3' -/'5=,.1 ,. 5 -"-/'35/,2

35..'( ,. 5 4,4'=,.' ,- /+5/ %6 Y'' '/ ",: SQPT) *+'" 0-' ,/ ,. /+'

2%./'L/ %6 D5K%( /% !5=5.2' 4,4'=,.' '((%( (5/'-) *+'" .',/+'( 544="

,/ /% 4(%2'-- &5(,5/,%. .%(7 3%(' ,34%(/5./="7 -/0<" /+' ,3452/ %6

4,4'=,.' -/(02/0(' -02+ 5- 4,4'=,.' <'4/+ %( =%%4 %(15.,K5/,%. %. /+'

4'(6%(35.2' %6 2"2=' /,3' -/'5=,.1)

@8253"A& 5"5&/"1"1# 3&=B1"C$&6 5(' ('=5/'< /% %0( <%.%( -/51' %4@

/,3,K5/,%.) [%445.5=,= '/ ",: SN#T -/0<" /+' '66'2/ %6 <".53,25=="

3'(1,.1 4,4'=,.' -/51'- /% 'L/'.< /+' 6('G0'.2" (5.1' %6 <".53,2

&%=/51' -25=,.1\ /+'" <% .%/ 'L4=,2,/=" <'-2(,!' /+' ,34='3'./5/,%.

<'/5,=- %6 /+',( -2+'3') ]6/+"3,%0 '/ ",: S8?T 0-' 5-".2+(%.%0-

<'-,1. /'2+.,G0'- /% 5<54/,&'=" 3'(1' 5<^52'./ -/51'- ,. 5. '3@

!'<<'<7 -,.1='@,--0' 4(%2'--%( 4,4'=,.') _=!%.'-, SNT 4(%4%-'-

<".53,25=="@&5(",.1 60.2/,%.5= 0.,/ =5/'.2,'- 5- 5. 5<54/,&' 4(%@

2'--,.1 -2+'3'7 !0/ +' <%'- .%/ <,-20-- /+' ('-0=/,.1 -2+'<0=,.1

2%34='L,/,'-) D'2'./="7 RK<'3,( '/ ",: SQ9T 5<<('-- /+' ,--0' %6

-2+'<0=,.1 2%34='L,/" 6%( &5(,5!='@522'-- Y8 252+' !" 0-,.1 5<@

<,/,%.5= =%5<@!"45-- !066'(-) C,.5=="7 2%.20(('./=" ;,/+ %0( ;%(F7

Y,5.1 5.< Z(%%F- SQ8T 4(%4%-' ,.-'(/,.1 ='&'=@-'.-,/,&' =5/2+'- ,.@

-,<' /+' CA 0.,/ /+5/ 25. !' '.5!='< 56/'( 65!(,25/,%.) H6 4(%2'--

&5(,5/,%. ,- -02+ /+5/ /+' 0.,/ <%'- .%/ 3''/ /,3,.17 /+' =5/2+'- 5('

'.5!='<7 5<<,.1 %.' 'L/(5 2"2=' /% /+' CA 0.,/)

D- <:1=/$6":16

A(%2'-- &5(,5/,%. 566'2/- 4(%2'--%( 4,4'=,.'- !" 'L52'(!5/,.1

4,4'=,.' 0.!5=5.2' 5.< ('<02,.1 /+' 5//5,.5!=' 6('G0'.2") *% /%='(@

333

!" #$ %&'( !"#) *+,- ,- .%/ '.%01+ /% 2%34'.-5/' /+' '66'2/ %6

&5(,5/,%.7 -,.2' $%!"# ,- 89$ 65-/'( /+5. !"#) :%;'&'(7 &'()*

+,'-./0%1%# 5.< &'()+,'-0)10%1%#3%(' /+5. ('2%&'( /+' =%--'-

<0' /% &5(,5/,%.7 -,.2' /+'" 5(' 8>$ 5.< 8?$ 65-/'( /+5. !"#7 ('@

-4'2/,&'=")

!"#$%& '() A'(6%(35.2' %6 <,66'('./ '.&,(%.3'./-)

!"#$%& '*) B".53,2 4%;'( 6%( 2%.-/5./ 4'(6%(35.2')

! " #$ #% #& #! #"

$#$

%$

'$

&$

()*+,-.-/012.3*4.)560*789&:

;*3*65*4).4*</=*>.7?:

!

!!

!! !

! ! ! ! !!

! !@$A#!@$A'!@$AB!@$AC

!"#$%& '+) C(52/,%. %6 ('4'5/'(- '=,3,.5/'< !" D'E"2=')

D'E"2=' 1'.'(5/'- /,3,.1 -=52F /+5/ ;' 0-' /% ,.2('5-' /+' 6('@

G0'.2") H./'('-/,.1="7 ;' 2%0=< 0-' /+' -=52F 1'.'(5/'< !" D'E"2='

/% -5&' 4%;'( ,.-/'5<) I4'2,!25=="7 ;' 2%0=< 0-' <".53,2 &%=/51'

-25=,.1 /% !(,.1 /+' 6('G0'.2" !52F /% /+' !"# ='&'= ;+,=' ('<02,.1

/+' %4'(5/,.1 &%=/51'J -5&,.1 <".53,2 4%;'( ,. /+' 4(%2'--) I,3@

,=5(="7 ;' 25. <% /+' -53' /+,.1 ;,/+ /+' -=52F 2('5/'< !" /+' B%.%(

%4/,3,K5/,%.7 'L2'4/ /+5/ .%; ;' ;5./ /% (%== !52F /+' 6('G0'.2"

0./,= ;' 1'/ /+' -53' 4'(6%(35.2' 5- !"#)

*+' ('-0=/- %6 /+'-' 'L4'(,3'./- 544'5( ,. C,10(' 8>) *+' !10('

-+%;- /+' <".53,2 4%;'( 2%.-03'< !" /+' '.&,(%.3'./- %6 C,1@

0(' 89 5/ +%12/"1/ 3'#4%#5"1+') M' -'' /+5/ D'E"2=' -5&'- %. 5&@

'(51' N?$ %6 /+' !"# <".53,2 4%;'() O%('%&'(7 &'()+,'-./0%1%#

5.< &'()+,'-0)10%1%# -5&' >P$ 5.< >Q$7 ('-4'2/,&'=" %6 /+'!"#

<".53,2 4%;'() *+'-' 5(' -,K5!=' 4%;'( ('<02/,%.-)

,-*- ./"0"123"1# 4&5&23&%6

I'2/,%. 9)Q)8 4(%4%-'< 0-,.1 D'E"2=' /% 40-+ /+' -=52F %6 .%.@

2(,/,25= =%%4- /% /+',( 6''<!52F 45/+- 5.< /+'. 2%.-03,.1 ,/ !" '=,3@

,.5/,.1 ('4'5/'(- /+'(' J 5.< -5&,.1 4%;'( ,. /+' 4(%2'--) C,1@

0(' 8? -+%;- /+' 4'(2'./51' %6 ('4'5/'(- ,. 6''<!52F 45/+- /+5/ D'@

E"2=' '=,3,.5/'- 6%( <,66'('./ &5=0'- %6 /+' 0-'60= =%1,2 <'4/+ 4'(

4,4'=,.' -/51' 5.< !)

*+' !10(' -+%;- /+5/ D'E"2=' ('3%&'- 8>@QP$ %6 /+' ('4'5/'(-7

<'4'.<,.1 %. /+' =%1,2 <'4/+ 4'( -/51') *+' +,1+'( '66'2/,&'.'--

2%(('-4%.<- /% 4,4'=,.'- ;,/+ ='-- =%1,2 4'( -/51') *+,- ,- !'250-'7

5- ;' 35F' /+' 4,4'=,.' =%.1'( 5.< -/51'- -+%(/'(7 ;' +5&' 3%('

0.!5=5.2' 52(%-- -/51'-) *+,- ('-0=/- ,. .%.@2(,/,25= =%%4- +5&,.1

3%(' -=52F) *+' !,11'( /+' -=52F ,-7 /+' 3%(' ('4'5/'(- ;' 25.

('3%&') R. /+' %/+'( +5.<7 /+' &5=0' %6 ! +5- =,//=' '66'2/) R&'(5==7-,.2' !>P$ %6 /+' 4%;'( ,. 6''<!52F 45/+- ,- ,. ('4'5/'(- SN?T7

D'E"2=' 25. -5&' !U)>@8>$ %6 /+' 4%;'( ,. 6''<!52F 45/+-)

7- 4&/23&8 9:%;

D'E"2=' ,- 5. 5(2+,/'2/0(5= 6(53';%(F 6%( 4,4'=,.'- /+5/ 2%3@

4('+'.-,&'=" 4'(6%(3- 2"2=' /,3' -/'5=,.1 56/'( 65!(,25/,%. V',/+'(

-/5/,25==" 5/ /+' 35.0652/0('( -,/' %( <".53,25==" !5-'< %. %4'(5/@

,.1 2%.<,/,%.-W /% /%='(5/' 4(%2'-- &5(,5/,%.) *+' 3%-/ ('=5/'< 5(@

'5- %6 ('-'5(2+ 5(' /+%-' %6 2=%2F -F'; %4/,3,K5/,%. 5.< 5<54/,&'

4,4'=,.,.1)

</:=; 6;&> :53"0"?23":1 +5- !''. ;'== -/0<,'< ,. /+' 2,(20,/- 2%3@

30.,/" S9T) H/ +5- !''. 544=,'< !%/+ 5/ <'-,1. /,3' 5.< 56/'( 65!(,25@

/,%. /% ,34(%&' 2,(20,/ /,3,.1 35(1,.-) C,-+!0(. S8XT ;5- /+' !(-/ /%

4(%4%-' 5 =,.'5( 4(%1(533,.1 6%(30=5/,%. /% !.< /+' %4/,35= 2=%2F

-F';- ,. 5 2,(20,/)

I'&'(5= ;%(F- 0-' 2=%2F -F';,.1 /% 5<<('-- /+' 4(%!='3 %6 4(%@

2'-- &5(,5/,%.) O%-/ %6 /+'3 544=" -F';,.1 /% =5/2+ '='3'./- ,. /+'

2=%2F <,-/(,!0/,%. .'/;%(F /% 2%34'.-5/' 6%( /+' '66'2/- %6 4(%2'--

&5(,5/,%. %1 /6' +,%+7 3"/6 8',")2 /+'3-'=&'- V')1)7 S98TW) C%( 'L@

534='7 H/5.,03 +5- !066'(- ,. /+' 2=%2F .'/;%(F /+5/ <".53,25=="

8'27'9 /+' -,1.5= J ,)')7 '.-0(' /+5/ /+' 2=%2F -,1.5= ('52+'- 5== /+'

45(/- %6 /+' 4(%2'--%( ;,/+ /+' -53' -F'; S89T) R. /+' %/+'( +5.<7

Y,5.1 5.< Z(%%F- SQ8T 0-' 2=%2F -F';,.1 /% !5=5.2' /;% 4,4'=,.'

-/51'-) I4'2,!25=="7 /+'" 0-' 2"2=' /,3' -/'5=,.1 !'/;''. /+' ('1,-@

/'( !=' 5.< 'L'20/' -/51'- 5.<7 ;,/+ ='&'=@-'.-,/,&' =5/2+'-7 !'/;''.

-/51'- ,. /+' "%5/,.1@4%,./ 0.,/)

*+' %.=" ;%(F /+5/ 544=,'- 2"2=' /,3' -/'5=,.1 ,. 5 -"-/'35/,2

35..'( ,. 5 4,4'=,.' ,- /+5/ %6 Y'' '/ ",: SQPT) *+'" 0-' ,/ ,. /+'

2%./'L/ %6 D5K%( /% !5=5.2' 4,4'=,.' '((%( (5/'-) *+'" .',/+'( 544="

,/ /% 4(%2'-- &5(,5/,%. .%(7 3%(' ,34%(/5./="7 -/0<" /+' ,3452/ %6

4,4'=,.' -/(02/0(' -02+ 5- 4,4'=,.' <'4/+ %( =%%4 %(15.,K5/,%. %. /+'

4'(6%(35.2' %6 2"2=' /,3' -/'5=,.1)

@8253"A& 5"5&/"1"1# 3&=B1"C$&6 5(' ('=5/'< /% %0( <%.%( -/51' %4@

/,3,K5/,%.) [%445.5=,= '/ ",: SN#T -/0<" /+' '66'2/ %6 <".53,25=="

3'(1,.1 4,4'=,.' -/51'- /% 'L/'.< /+' 6('G0'.2" (5.1' %6 <".53,2

&%=/51' -25=,.1\ /+'" <% .%/ 'L4=,2,/=" <'-2(,!' /+' ,34='3'./5/,%.

<'/5,=- %6 /+',( -2+'3') ]6/+"3,%0 '/ ",: S8?T 0-' 5-".2+(%.%0-

<'-,1. /'2+.,G0'- /% 5<54/,&'=" 3'(1' 5<^52'./ -/51'- ,. 5. '3@

!'<<'<7 -,.1='@,--0' 4(%2'--%( 4,4'=,.') _=!%.'-, SNT 4(%4%-'-

<".53,25=="@&5(",.1 60.2/,%.5= 0.,/ =5/'.2,'- 5- 5. 5<54/,&' 4(%@

2'--,.1 -2+'3'7 !0/ +' <%'- .%/ <,-20-- /+' ('-0=/,.1 -2+'<0=,.1

2%34='L,/,'-) D'2'./="7 RK<'3,( '/ ",: SQ9T 5<<('-- /+' ,--0' %6

-2+'<0=,.1 2%34='L,/" 6%( &5(,5!='@522'-- Y8 252+' !" 0-,.1 5<@

<,/,%.5= =%5<@!"45-- !066'(-) C,.5=="7 2%.20(('./=" ;,/+ %0( ;%(F7

Y,5.1 5.< Z(%%F- SQ8T 4(%4%-' ,.-'(/,.1 ='&'=@-'.-,/,&' =5/2+'- ,.@

-,<' /+' CA 0.,/ /+5/ 25. !' '.5!='< 56/'( 65!(,25/,%.) H6 4(%2'--

&5(,5/,%. ,- -02+ /+5/ /+' 0.,/ <%'- .%/ 3''/ /,3,.17 /+' =5/2+'- 5('

'.5!='<7 5<<,.1 %.' 'L/(5 2"2=' /% /+' CA 0.,/)

D- <:1=/$6":16

A(%2'-- &5(,5/,%. 566'2/- 4(%2'--%( 4,4'=,.'- !" 'L52'(!5/,.1

4,4'=,.' 0.!5=5.2' 5.< ('<02,.1 /+' 5//5,.5!=' 6('G0'.2") *% /%='(@

333