ees reach logic bottom-up - eecs instructional support ...cs152/fa04/lecnotes/lec2-2.6page.pdf ·...

4
CS 152 L03 Testing Processors () UC Regents Fall 2004 © UCB 2004-09-07 Dave Patterson (www.cs.berkeley.edu/~patterson) John Lazzaro (www.cs.berkeley.edu/~lazzaro) www-inst.eecs.berkeley.edu/~cs152/ CS152 – Computer Architecture and Engineering Lecture 4 – Timing 1 CS 152 L03 Testing Processors () UC Regents Fall 2004 © UCB Last Time: Test plan for your project complete processor testing Top-down testing Bottom-up testing unit testing processor testing with self-checks Which testing types are good for each epoch? processor assembly complete correctly executes single instructions correctly executes short programs Time Epoch 1 Epoch 2 Epoch 3 Epoch 4 unit testing early multi unit testing later multi-unit testing processor testing with self-checks multi-unit testing unit testing diagnostics complete processor testing verification processor testing with self-checks diagnostics processor testing with self-checks multi-unit testing unit testing diagnostics 2 CS 152 L03 Testing Processors () UC Regents Fall 2004 © UCB Outline - Timing Team networking break! A clocked logic circuit primer More clocked logic circuits 3 CS 152 L03 Testing Processors () UC Regents Fall 2004 © UCB Architects draw blocks ... Circuit designers draw ??? 5678.’(9 :8#+-%-&.8 ;.9($< !"#$ %"#$ Logic is where they meet. 4 CS 152 L02 Design as a Team Sport () UC Regents Fall 2004 © UCB Architects reach logic top-down ... Next State Combinational Logic next_G next_R next_Y R G Y Change Rst wire next_R, next_Y, next_G; assign next_R = rst ? 1’b1 : (change ? Y : R); assign next_Y = rst ? 1’b0 : (change ? G : Y); assign next_G = rst ? 1’b0 : (change ? R : G); Is this structural Verilog? 5 CS 152 L02 Design as a Team Sport () UC Regents Fall 2004 © UCB 1/28/04 ©UCB NAND Gate A B Out 0 0 1 0 1 1 1 0 1 1 1 0 Out A B Out = A • B Vdd A B Out EEs reach logic bottom-up ... Can you build a processor entirely out of NAND gates? Small number of high-performance logic circuits. For some definition of performance. 6

Upload: buinhan

Post on 06-Feb-2018

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: EEs reach logic bottom-up - EECS Instructional Support ...cs152/fa04/lecnotes/lec2-2.6page.pdf · buffers (TLBs) that support multiple page sizes are provided ... Basic Components:

CS 152 L03 Testing Processors () UC Regents Fall 2004 © UCB

2004-09-07

Dave Patterson

(www.cs.berkeley.edu/~patterson)

John Lazzaro

(www.cs.berkeley.edu/~lazzaro)

www-inst.eecs.berkeley.edu/~cs152/

CS152 – Computer Architecture andEngineering

Lecture 4 – Timing

1

CS 152 L03 Testing Processors () UC Regents Fall 2004 © UCB

Last Time: Test plan for your project

complete processor

testing

Top-downtesting

Bottom-uptesting

unit testing

processortesting

withself-checks

Which testing types are good for each epoch?

processorassemblycomplete

correctlyexecutes

singleinstructions

correctlyexecutes

shortprograms

Time

Epoch 1 Epoch 2 Epoch 3 Epoch 4

unit testing

early

multiunit

testing

latermulti-unit testing

processortesting

withself-checks

multi-unit testing

unit testing

diagnostics

complete processor

testing

verification

processortesting

withself-checks

diagnostics

processortesting

withself-checks

multi-unit testing

unit testing

diagnostics

2

CS 152 L03 Testing Processors () UC Regents Fall 2004 © UCB

Outline - Timing

Team networking break!

A clocked logic circuit primer

More clocked logic circuits

3

CS 152 L03 Testing Processors () UC Regents Fall 2004 © UCB

1600 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 36, NO. 11, NOVEMBER 2001

Fig. 1. Process SEM cross section.

The process was raised from [1] to limit standby power.

Circuit design and architectural pipelining ensure low voltage

performance and functionality. To further limit standby current

in handheld ASSPs, a longer poly target takes advantage of the

versus dependence and source-to-body bias is used

to electrically limit transistor in standby mode. All core

nMOS and pMOS transistors utilize separate source and bulk

connections to support this. The process includes cobalt disili-

cide gates and diffusions. Low source and drain capacitance, as

well as 3-nm gate-oxide thickness, allow high performance and

low-voltage operation.

III. ARCHITECTURE

The microprocessor contains 32-kB instruction and data

caches as well as an eight-entry coalescing writeback buffer.

The instruction and data cache fill buffers have two and four

entries, respectively. The data cache supports hit-under-miss

operation and lines may be locked to allow SRAM-like oper-

ation. Thirty-two-entry fully associative translation lookaside

buffers (TLBs) that support multiple page sizes are provided

for both caches. TLB entries may also be locked. A 128-entry

branch target buffer improves branch performance a pipeline

deeper than earlier high-performance ARM designs [2], [3].

A. Pipeline Organization

To obtain high performance, the microprocessor core utilizes

a simple scalar pipeline and a high-frequency clock. In addition

to avoiding the potential power waste of a superscalar approach,

functional design and validation complexity is decreased at the

expense of circuit design effort. To avoid circuit design issues,

the pipeline partitioning balances the workload and ensures that

no one pipeline stage is tight. The main integer pipeline is seven

stages, memory operations follow an eight-stage pipeline, and

when operating in thumb mode an extra pipe stage is inserted

after the last fetch stage to convert thumb instructions into ARM

instructions. Since thumb mode instructions [11] are 16 b, two

instructions are fetched in parallel while executing thumb in-

structions. A simplified diagram of the processor pipeline is

Fig. 2. Microprocessor pipeline organization.

shown in Fig. 2, where the state boundaries are indicated by

gray. Features that allow the microarchitecture to achieve high

speed are as follows.

The shifter and ALU reside in separate stages. The ARM in-

struction set allows a shift followed by an ALU operation in a

single instruction. Previous implementations limited frequency

by having the shift and ALU in a single stage. Splitting this op-

eration reduces the critical ALU bypass path by approximately

1/3. The extra pipeline hazard introduced when an instruction is

immediately followed by one requiring that the result be shifted

is infrequent.

Decoupled Instruction Fetch.A two-instruction deep queue is

implemented between the second fetch and instruction decode

pipe stages. This allows stalls generated later in the pipe to be

deferred by one or more cycles in the earlier pipe stages, thereby

allowing instruction fetches to proceed when the pipe is stalled,

and also relieves stall speed paths in the instruction fetch and

branch prediction units.

Deferred register dependency stalls. While register depen-

dencies are checked in the RF stage, stalls due to these hazards

are deferred until the X1 stage. All the necessary operands are

then captured from result-forwarding busses as the results are

returned to the register file.

One of the major goals of the design was to minimize the en-

ergy consumed to complete a given task. Conventional wisdom

has been that shorter pipelines are more efficient due to re-

Architects draw blocks ...

Circuit designers draw ???

!"#$%&'())* ++,!-.)'/ 012-)34$5$%& 67&1'8

!"#$%&#&%'()*+#$,-%-)./)0.1%2)3($#,

4 5678.'(9):8#+-%-&.8);.9($<))

!"#$4 =()8(/(8)&.)&8#+-%-&.8)>-&8(+1&?>)#-)

&?()#6."+&)./)2"88(+&)&?#&)/$.@-)/.8)

#)1%'(+ A9- #+9 A1-B)

4 :?()-&8(+1&?)%-)$%+(#8$,)78.7.8&%.+#$)

&.)&?()8#&%.)./)=C0B)

%"#$

!"#$%&'())* ++,!-.)'/ 012-)34$5$%& 67&1'8

!"#$%&#&%'()*+#$,-%-)./)0.1%2)3($#,

4 5678.'(9):8#+-%-&.8);.9($<))

!"#$4 =()8(/(8)&.)&8#+-%-&.8)>-&8(+1&?>)#-)

&?()#6."+&)./)2"88(+&)&?#&)/$.@-)/.8)

#)1%'(+ A9- #+9 A1-B)

4 :?()-&8(+1&?)%-)$%+(#8$,)78.7.8&%.+#$)

&.)&?()8#&%.)./)=C0B)

%"#$

Logic is where they meet.

4

CS 152 L02 Design as a Team Sport () UC Regents Fall 2004 © UCB

Architects reach logic top-down ...

Next State Combinational Logic

next_Gnext_R next_YR G Y

ChangeRst

wire next_R, next_Y, next_G;

assign next_R = rst ? 1’b1 : (change ? Y : R); assign next_Y = rst ? 1’b0 : (change ? G : Y);assign next_G = rst ? 1’b0 : (change ? R : G);

Is this structural Verilog?

5

CS 152 L02 Design as a Team Sport () UC Regents Fall 2004 © UCB

1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz

Lec3.33

Basic Components: CMOS Logic Gates

NOR GateNAND Gate

A B Out

0 0 10 1 11 0 11 1 0

A B Out

0 0 10 1 01 0 01 1 0

OutA

B

A

B

Out

Out = A + BOut = A • B

Vdd

A

B

Out

Vdd

A

B

Out

1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz

Lec3.34

Basic Components: CMOS Logic Gates

Out

A

B

C

D

More Inputs More asymmetric Edges Times!

Vdd

Out

B

C

D

A

4-input NAND Gate

1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz

Lec3.35

Ideal versus Reality

° When input 0 -> 1, output 1 -> 0 but NOT instantly• Output goes 1 -> 0: output voltage goes from Vdd (5v) to 0v

° When input 1 -> 0, output 0 -> 1 but NOT instantly• Output goes 0 -> 1: output voltage goes from 0v to Vdd (5v)

° Voltage does not like to change instantaneously

Vin

Vout

1 => Vdd

VoltageOutIn

0 => GND

Time

1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz

Lec3.36

Fluid Timing Model

Level (V) = Vdd

Vdd

° Water ! Electrical Charge Tank Capacity ! Capacitance (C)

° Water Level ! Voltage Water Flow ! Charge Flowing (Current)

° Size of Pipes ! Strength of Transistors (G)

° Time to fill up the tank proportional to C / G

Reservoir Tank

(Cout)Bottomless Sea

Sea Level

(GND)

SW2SW1SW1

Tank Level (Vout)

Cout

Vout

SW2

EEs reach logic bottom-up ...

Can you build a processorentirely out of NAND gates?

Small number of high-performance

logic circuits.

For some definition of performance.

6

Page 2: EEs reach logic bottom-up - EECS Instructional Support ...cs152/fa04/lecnotes/lec2-2.6page.pdf · buffers (TLBs) that support multiple page sizes are provided ... Basic Components:

CS 152 L02 Design as a Team Sport () UC Regents Fall 2004 © UCB

Logic Synthesis bridges the gap

1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz

Lec3.5

Design Refinement

Informal System Requirement

Initial Specification

Intermediate Specification

Final Architectural Description

Intermediate Specification of Implementation

Final Internal Specification

Physical Implementation

refinementincreasing level of detail

1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz

Lec3.6

Logic Components

1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz

Lec3.7

° Wires: Carry signals from one point to another• Single bit (no size label) or multi-bit bus (size label)

° Combinational Logic: Like function evaluation• Data goes in, Results come out after some propagation delay

° Flip-Flops: Storage Elements• After a clock edge, input copied to output

• Otherwise, the flip-flop holds its value

• Also: a “Latch” is a storage element that is level triggered

D Q D[8] Q[8]

8

Combinational

Logic

11

8

Elements of the design zoo

1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz

Lec3.8

Basic Combinational Elements+DeMorgan Equivalence

Wire Inverter

In Out

01

01

In Out

10

01

OutIn

Out = InOut = In

NAND Gate NOR GateA B Out

111

0 00 11 01 1 0

A B Out

0 0 10 1 01 0 01 1 0

OutA

BA

B

Out

DeMorgan’s

TheoremOut = A + B = A • BOut = A • B = A + B

A

B

Out

A B Out

1 1 11 0 10 1 10 0 0

0 00 11 01 1

A B

OutA

B

A B Out

1 1 11 0 00 1 00 0 0

0 00 11 01 1

A B

1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz

Lec3.5

Design Refinement

Informal System Requirement

Initial Specification

Intermediate Specification

Final Architectural Description

Intermediate Specification of Implementation

Final Internal Specification

Physical Implementation

refinementincreasing level of detail

1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz

Lec3.6

Logic Components

1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz

Lec3.7

° Wires: Carry signals from one point to another• Single bit (no size label) or multi-bit bus (size label)

° Combinational Logic: Like function evaluation• Data goes in, Results come out after some propagation delay

° Flip-Flops: Storage Elements• After a clock edge, input copied to output

• Otherwise, the flip-flop holds its value

• Also: a “Latch” is a storage element that is level triggered

D Q D[8] Q[8]

8

Combinational

Logic

11

8

Elements of the design zoo

1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz

Lec3.8

Basic Combinational Elements+DeMorgan Equivalence

Wire Inverter

In Out

01

01

In Out

10

01

OutIn

Out = InOut = In

NAND Gate NOR GateA B Out

111

0 00 11 01 1 0

A B Out

0 0 10 1 01 0 01 1 0

OutA

BA

B

Out

DeMorgan’s

TheoremOut = A + B = A • BOut = A • B = A + B

A

B

Out

A B Out

1 1 11 0 10 1 10 0 0

0 00 11 01 1

A B

OutA

B

A B Out

1 1 11 0 00 1 00 0 0

0 00 11 01 1

A B

1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz

Lec3.5

Design Refinement

Informal System Requirement

Initial Specification

Intermediate Specification

Final Architectural Description

Intermediate Specification of Implementation

Final Internal Specification

Physical Implementation

refinementincreasing level of detail

1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz

Lec3.6

Logic Components

1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz

Lec3.7

° Wires: Carry signals from one point to another• Single bit (no size label) or multi-bit bus (size label)

° Combinational Logic: Like function evaluation• Data goes in, Results come out after some propagation delay

° Flip-Flops: Storage Elements• After a clock edge, input copied to output

• Otherwise, the flip-flop holds its value

• Also: a “Latch” is a storage element that is level triggered

D Q D[8] Q[8]

8

Combinational

Logic

11

8

Elements of the design zoo

1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz

Lec3.8

Basic Combinational Elements+DeMorgan Equivalence

Wire Inverter

In Out

01

01

In Out

10

01

OutIn

Out = InOut = In

NAND Gate NOR GateA B Out

111

0 00 11 01 1 0

A B Out

0 0 10 1 01 0 01 1 0

OutA

BA

B

Out

DeMorgan’s

TheoremOut = A + B = A • BOut = A • B = A + B

A

B

Out

A B Out

1 1 11 0 10 1 10 0 0

0 00 11 01 1

A B

OutA

B

A B Out

1 1 11 0 00 1 00 0 0

0 00 11 01 1

A B

1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz

Lec3.5

Design Refinement

Informal System Requirement

Initial Specification

Intermediate Specification

Final Architectural Description

Intermediate Specification of Implementation

Final Internal Specification

Physical Implementation

refinementincreasing level of detail

1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz

Lec3.6

Logic Components

1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz

Lec3.7

° Wires: Carry signals from one point to another• Single bit (no size label) or multi-bit bus (size label)

° Combinational Logic: Like function evaluation• Data goes in, Results come out after some propagation delay

° Flip-Flops: Storage Elements• After a clock edge, input copied to output

• Otherwise, the flip-flop holds its value

• Also: a “Latch” is a storage element that is level triggered

D Q D[8] Q[8]

8

Combinational

Logic

11

8

Elements of the design zoo

1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz

Lec3.8

Basic Combinational Elements+DeMorgan Equivalence

Wire Inverter

In Out

01

01

In Out

10

01

OutIn

Out = InOut = In

NAND Gate NOR GateA B Out

111

0 00 11 01 1 0

A B Out

0 0 10 1 01 0 01 1 0

OutA

BA

B

Out

DeMorgan’s

TheoremOut = A + B = A • BOut = A • B = A + B

A

B

Out

A B Out

1 1 11 0 10 1 10 0 0

0 00 11 01 1

A B

OutA

B

A B Out

1 1 11 0 00 1 00 0 0

0 00 11 01 1

A B

assign next_R = rst ? 1’b1 : (change ? Y : R); assign next_Y = rst ? 1’b0 : (change ? G : Y);assign next_G = rst ? 1’b0 : (change ? R : G);

It’s easier to work at one level of

abstraction if you have a basic

understanding of the level below.

7

CS 152 L03 Testing Processors () UC Regents Fall 2004 © UCB

Administrivia - Team Networking Break!

First homework due 9/15.

Lab 1 due Monday 9/13.

Mini-Lab 2 this Friday (9/10).Remember to do the pre-lab!

Lab 2 goes out on 9/14. The first team lab (next break)

8

UC Regents Fall 2004 © UCBCS 152 L03 Testing Processors ()

A Logic Circuit Primer

9

CS 152 L03 Testing Processors () UC Regents Fall 2004 © UCB

Inverters: A simple transistor model

1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz

Lec3.5

Design Refinement

Informal System Requirement

Initial Specification

Intermediate Specification

Final Architectural Description

Intermediate Specification of Implementation

Final Internal Specification

Physical Implementation

refinementincreasing level of detail

1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz

Lec3.6

Logic Components

1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz

Lec3.7

° Wires: Carry signals from one point to another• Single bit (no size label) or multi-bit bus (size label)

° Combinational Logic: Like function evaluation• Data goes in, Results come out after some propagation delay

° Flip-Flops: Storage Elements• After a clock edge, input copied to output

• Otherwise, the flip-flop holds its value

• Also: a “Latch” is a storage element that is level triggered

D Q D[8] Q[8]

8

Combinational

Logic

11

8

Elements of the design zoo

1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz

Lec3.8

Basic Combinational Elements+DeMorgan Equivalence

Wire Inverter

In Out

01

01

In Out

10

01

OutIn

Out = InOut = In

NAND Gate NOR GateA B Out

111

0 00 11 01 1 0

A B Out

0 0 10 1 01 0 01 1 0

OutA

BA

B

Out

DeMorgan’s

TheoremOut = A + B = A • BOut = A • B = A + B

A

B

Out

A B Out

1 1 11 0 10 1 10 0 0

0 00 11 01 1

A B

OutA

B

A B Out

1 1 11 0 00 1 00 0 0

0 00 11 01 1

A B

1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz

Lec3.29

Delay Model:

CMOS

1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz

Lec3.30

Review: General C/L Cell Delay Model

° Combinational Cell (symbol) is fully specified by:• functional (input -> output) behavior

- truth-table, logic equation, VHDL

• load factor of each input

• critical propagation delay from each input to each output for each transition

- THL(A, o) = Fixed Internal Delay + Load-dependent-delay x load

° Linear model composes

Cout

Vout

Cout

Delay

Va -> Vout

XX

X

X

X

X

Ccritical

delay per unit load

A

B

X

.

.

.

Combinational

Logic Cell

Internal Delay

1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz

Lec3.31

Basic Technology: CMOS

° CMOS: Complementary Metal Oxide Semiconductor• NMOS (N-Type Metal Oxide Semiconductor) transistors

• PMOS (P-Type Metal Oxide Semiconductor) transistors

° NMOS Transistor• Apply a HIGH (Vdd) to its gate

turns the transistor into a “conductor”

• Apply a LOW (GND) to its gateshuts off the conduction path

° PMOS Transistor• Apply a HIGH (Vdd) to its gate

shuts off the conduction path

• Apply a LOW (GND) to its gateturns the transistor into a “conductor”

Vdd = 5V

GND = 0v

Vdd = 5V

GND = 0v

1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz

Lec3.32

Basic Components: CMOS Inverter

Vdd

Circuit

° Inverter Operation

OutIn

SymbolPMOS

NMOS

In Out

Vdd

Open

Charge

VoutVdd

Vdd

Out

Open

Discharge

Vin

Vdd

Vdd

“1”

“0”

pFET.A switch. “On” if gate is grounded.

nFET.A switch. “On” if gate is at Vdd.

“1”“0”

“1” “0”

10

CS 152 L03 Testing Processors () UC Regents Fall 2004 © UCB

Transistors as water valves

If electrons are water molecules,and a capacitor a bucket ...

A “on” p-FET fillsup the capacitor

with charge.

1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz

Lec3.29

Delay Model:

CMOS

1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz

Lec3.30

Review: General C/L Cell Delay Model

° Combinational Cell (symbol) is fully specified by:• functional (input -> output) behavior

- truth-table, logic equation, VHDL

• load factor of each input

• critical propagation delay from each input to each output for each transition

- THL(A, o) = Fixed Internal Delay + Load-dependent-delay x load

° Linear model composes

Cout

Vout

Cout

Delay

Va -> Vout

XX

X

X

X

X

Ccritical

delay per unit load

A

B

X

.

.

.

Combinational

Logic Cell

Internal Delay

1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz

Lec3.31

Basic Technology: CMOS

° CMOS: Complementary Metal Oxide Semiconductor• NMOS (N-Type Metal Oxide Semiconductor) transistors

• PMOS (P-Type Metal Oxide Semiconductor) transistors

° NMOS Transistor• Apply a HIGH (Vdd) to its gate

turns the transistor into a “conductor”

• Apply a LOW (GND) to its gateshuts off the conduction path

° PMOS Transistor• Apply a HIGH (Vdd) to its gate

shuts off the conduction path

• Apply a LOW (GND) to its gateturns the transistor into a “conductor”

Vdd = 5V

GND = 0v

Vdd = 5V

GND = 0v

1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz

Lec3.32

Basic Components: CMOS Inverter

Vdd

Circuit

° Inverter Operation

OutIn

SymbolPMOS

NMOS

In Out

Vdd

Open

Charge

VoutVdd

Vdd

Out

Open

Discharge

Vin

Vdd

Vdd

A “on” n-FET empties the

bucket.

1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz

Lec3.29

Delay Model:

CMOS

1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz

Lec3.30

Review: General C/L Cell Delay Model

° Combinational Cell (symbol) is fully specified by:• functional (input -> output) behavior

- truth-table, logic equation, VHDL

• load factor of each input

• critical propagation delay from each input to each output for each transition

- THL(A, o) = Fixed Internal Delay + Load-dependent-delay x load

° Linear model composes

Cout

Vout

Cout

Delay

Va -> Vout

XX

X

X

X

X

Ccritical

delay per unit load

A

B

X

.

.

.

Combinational

Logic Cell

Internal Delay

1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz

Lec3.31

Basic Technology: CMOS

° CMOS: Complementary Metal Oxide Semiconductor• NMOS (N-Type Metal Oxide Semiconductor) transistors

• PMOS (P-Type Metal Oxide Semiconductor) transistors

° NMOS Transistor• Apply a HIGH (Vdd) to its gate

turns the transistor into a “conductor”

• Apply a LOW (GND) to its gateshuts off the conduction path

° PMOS Transistor• Apply a HIGH (Vdd) to its gate

shuts off the conduction path

• Apply a LOW (GND) to its gateturns the transistor into a “conductor”

Vdd = 5V

GND = 0v

Vdd = 5V

GND = 0v

1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz

Lec3.32

Basic Components: CMOS Inverter

Vdd

Circuit

° Inverter Operation

OutIn

SymbolPMOS

NMOS

In Out

Vdd

Open

Charge

VoutVdd

Vdd

Out

Open

Discharge

Vin

Vdd

Vdd

!"#$%&'())* ++,!-.)'/ 012-)34$5$%& 67&1'-)

!"#$%&'(#)*(+,%-$*".(/0

1 2+.$0#$03

1 4546%,"#$3

“1”

“0”Time

Water level

!"#$%&'())* ++,!-.)'/ 012-)34$5$%& 67&1'-)

!"#$%&'(#)*(+,%-$*".(/0

1 2+.$0#$03

1 4546%,"#$3

“0”

“1”

TimeWater level

11

CS 152 L03 Testing Processors () UC Regents Fall 2004 © UCB

What is the bucket? A gate’s “fan-out”.

Driving other gates slows a gate down.

!"#$%&'())* ++,!-.)'/ 012-)34$5$%& 67&1'-)

!"#$%&'(#)*(+,%-$*".(/0

1 2+.$0#$03

1 4546%,"#$3

Driving wires slows a gate down.

“Fan-out”: The number of gate inputs driven by a gate’s output.

12

Page 3: EEs reach logic bottom-up - EECS Instructional Support ...cs152/fa04/lecnotes/lec2-2.6page.pdf · buffers (TLBs) that support multiple page sizes are provided ... Basic Components:

CS 152 L03 Testing Processors () UC Regents Fall 2004 © UCB

A closer look at fan-out ...

!"#$%&'())* ++,!-.)'/ 012-)34$5$%& 67&1'-(

!"#$%&$'"(

) *"+,-.#/

) 01$%2$'"(%-3%"%4"#$%56%78-7-8#5-+"'%#-%5#6%-.#7.#%9"7"95#"+9$:%%;$9".6$<%4"#$6%=>%"+2%?%#.8+%-+@-33%"#%"%'"#$8%#5A$:%%BC#%#"D$6%'-+4$8%3-8%#1$%-.#7.#%-3%4"#$%=E%#-%8$"91%#1$%6F5#915+4%#18$61-'2%-3%4"#$6%=>%"+2%?%"6 F$%"22%A-8$%-.#7.#%9"7"95#"+9$:G

E

?

> Driving more gates adds delay.

1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz

Lec3.37

Series Connection

Vdd

Cout

Vout

C1

V1G2

Vdd

Voltage

Vdd

Vin

GND

V1 Vout

Vdd/2

d1 d2

G1

V1Vin Vout

VinG1 G2

Time

° Total Propagation Delay = Sum of individual delays = d1 + d2

° Capacitance C1 has two components:

• Capacitance of the wire connecting the two gates

• Input capacitance of the second inverter

1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz

Lec3.38

Calculating Aggregate Delays

Vdd

G2

Vdd

° Sum delays along serial paths

° Delay (Vin -> V2) ! = Delay (Vin -> V3)• Delay (Vin -> V2) = Delay (Vin -> V1) + Delay (V1 -> V2)

• Delay (Vin -> V3) = Delay (Vin -> V1) + Delay (V1 -> V3)

° Critical Path = The longest among the N parallel paths

° C1 = Wire C + Cin of Gate 2 + Cin of Gate 3

V2

V1Vin V2

G1V1

C1

Vin

Vdd

G3V3

V3

1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz

Lec3.39

Characterize a Gate

° Input capacitance for each input

° For each input-to-output path:• For each output transition type (H->L, L->H, H->Z, L->Z ... etc.)

- Internal delay (ns)

- Load dependent delay (ns / fF)

° Example: 2-input NAND Gate

OutA

B

Delay A -> Out

Out: Low -> High

0.5ns

Slope =

0.0021ns / fF

For A and B: Input Load (I.L.) = 61 fF

For either A -> Out or B -> Out:

Tlh = 0.5ns Tlhf = 0.0021ns / fF

Thl = 0.1ns Thlf = 0.0020ns / fF

Cout

1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz

Lec3.40

A Specific Example: 2 to 1 MUX

Y = (A and !S)

or (B and S)

A

B

S

Gate 3

Gate 2

Gate 1Wire 1

Wire 2

Wire 0

A

B

Y

S

2 x

1M

ux

° Input Load (I.L.)• A, B: I.L. (NAND) = 61 fF

• S: I.L. (INV) + I.L. (NAND) = 50 fF + 61 fF = 111 fF

° Load Dependent Delay (L.D.D.): Same as Gate 3• TAYlhf = 0.0021 ns / fF TAYhlf = 0.0020 ns / fF

• TBYlhf = 0.0021 ns / fF TBYhlf = 0.0020 ns / fF

• TSYlhf = 0.0021 ns / fF TSYlhf = 0.0020 ns / fF

Linear model

works for reasonable

fan-out

13

CS 152 L03 Testing Processors () UC Regents Fall 2004 © UCB

Propagation delay graphs ...

!"#$%&'())* ++,!-.)'/ 012-)34$5$%& 67&1'--

!"#$%&$'"(

) *"+,"-$-%."#$+/

012#

034

!"#$%&'())* ++,!-.)'/ 012-)34$5$%& 67&1'--

!"#$%&$'"(

) *"+,"-$-%."#$+/

012#

034

!"#$%&'())* ++,!-.)'/ 012-)34$5$%& 67&1'--

!"#$%&$'"(

) *"+,"-$-%."#$+/

012#

034

!"#$%&'())* ++,!-.)'/ 012-)34$5$%& 67&1'--

!"#$%&$'"(

) *"+,"-$-%."#$+/

012#

034

1->0

14

CS 152 L03 Testing Processors () UC Regents Fall 2004 © UCB

Intuition: Critical paths ...

!"#$%&'())* ++,!-.)'/ 012-)34$5$%& 67&1'-*

!"#$%&$'"(

) *+",-.,/

) 01"#%.2%#1$%3$'"(%.,%#1.2%4.546.#7

) !"#$#%&'()&$*+(#1$%8"#1%9.#1%#1$%:";.:6:%3$'"(<%=5>:%",(%

.,86#%#>%",(%>6#86#?

@ A,%B$,$5"'<%9$%.,4'63$%5$B.2#$5%2$#-68%",3%4'C-#>-D%#.:$2%.,%

45.#.4"'%8"#1%4"'46'"#.>,?

) 01(%3>%9$%4"5$%"E>6#%#1$ %"#$#%&'(,&$*-

x = g(a, b, c, d, e, f)

If d going 0-to-1 switches x 0-to-1, delay is T1.

If a going 0-to-1 switches x 0-to-1, delay is T2.

Would you be surprised if T1 > T2? Why?

T1

T2

T2 might

be the

critical

(worst-

case

delay)

path.

15

CS 152 L03 Testing Processors () UC Regents Fall 2004 © UCB

Why “might”? Wires have delay too ...

!"#$%&'())* ++,!-.)'/ 012-)34$5$%& 67&1'-8

!"#$%&$'()

* +,$-%"-%./01$%2(1$1%3/$#$%./$%

.#(-14"11"0-%'"-$%$55$2.%"1%

-$6'"6"7'$8

9 !"#$1%:011$1%;"1.#"7<.$;%

#$1"1.(-2$%(-;%2(:(2".(-2$

9 ="4$%20-1.(-.%(1102"(.$;%3"./%

;"1.#"7<.$;%>?%"1%:#0:0#."0-('%.0%

./$%!"#$%& 05%./$%'$-6./

* @0#%!"#$%&'($)! 0-%A?1B%#$1"1.(-2$%"1%"-1"6-"5"2(-.%C#$'(.",$%.0%$55$2.",$%>%05%.#(-1"1.0#1DB%7<.%?%"1%"4:0#.(-.E

9 =):"2('')%(#0<-;%/('5%05%?%05%6(.$%'0(;%"1%"-%./$%3"#$1E

* @0#%*#+,&'($)! 0-%A?18

9 7<11$1B%2'02F%'"-$1B%6'07('%20-.#0'%1"6-('B%$.2E

9 >$1"1.(-2$%"1%1"6-"5"2(-.B%./$#$50#$%;"1.#"7<.$;%>?%$55$2.%;04"-(.$1E

9 1"6-('1%(#$%.):"2('')%G#$7<55$#$;H%.0%#$;<2$%;$'()8

,I

,J,K

,L

."4$

,I ,L ,K ,J

!"#$%&'())* ++,!-.)'/ 012-)34$5$%& 67&1'-8

!"#$%&$'()

* +,$-%"-%./01$%2(1$1%3/$#$%./$%

.#(-14"11"0-%'"-$%$55$2.%"1%

-$6'"6"7'$8

9 !"#$1%:011$1%;"1.#"7<.$;%

#$1"1.(-2$%(-;%2(:(2".(-2$

9 ="4$%20-1.(-.%(1102"(.$;%3"./%

;"1.#"7<.$;%>?%"1%:#0:0#."0-('%.0%

./$%!"#$%& 05%./$%'$-6./

* @0#%!"#$%&'($)! 0-%A?1B%#$1"1.(-2$%"1%"-1"6-"5"2(-.%C#$'(.",$%.0%$55$2.",$%>%05%.#(-1"1.0#1DB%7<.%?%"1%"4:0#.(-.E

9 =):"2('')%(#0<-;%/('5%05%?%05%6(.$%'0(;%"1%"-%./$%3"#$1E

* @0#%*#+,&'($)! 0-%A?18

9 7<11$1B%2'02F%'"-$1B%6'07('%20-.#0'%1"6-('B%$.2E

9 >$1"1.(-2$%"1%1"6-"5"2(-.B%./$#$50#$%;"1.#"7<.$;%>?%$55$2.%;04"-(.$1E

9 1"6-('1%(#$%.):"2('')%G#$7<55$#$;H%.0%#$;<2$%;$'()8

,I

,J,K

,L

."4$

,I ,L ,K ,J

!"#$%&'())* ++,!-.)'/ 012-)34$5$%& 67&1'-8

!"#$%&$'()

* +,$-%"-%./01$%2(1$1%3/$#$%./$%

.#(-14"11"0-%'"-$%$55$2.%"1%

-$6'"6"7'$8

9 !"#$1%:011$1%;"1.#"7<.$;%

#$1"1.(-2$%(-;%2(:(2".(-2$

9 ="4$%20-1.(-.%(1102"(.$;%3"./%

;"1.#"7<.$;%>?%"1%:#0:0#."0-('%.0%

./$%!"#$%& 05%./$%'$-6./

* @0#%!"#$%&'($)! 0-%A?1B%#$1"1.(-2$%"1%"-1"6-"5"2(-.%C#$'(.",$%.0%$55$2.",$%>%05%.#(-1"1.0#1DB%7<.%?%"1%"4:0#.(-.E

9 =):"2('')%(#0<-;%/('5%05%?%05%6(.$%'0(;%"1%"-%./$%3"#$1E

* @0#%*#+,&'($)! 0-%A?18

9 7<11$1B%2'02F%'"-$1B%6'07('%20-.#0'%1"6-('B%$.2E

9 >$1"1.(-2$%"1%1"6-"5"2(-.B%./$#$50#$%;"1.#"7<.$;%>?%$55$2.%;04"-(.$1E

9 1"6-('1%(#$%.):"2('')%G#$7<55$#$;H%.0%#$;<2$%;$'()8

,I

,J,K

,L

."4$

,I ,L ,K ,J

!"#$%&'())* ++,!-.)'/ 012-)34$5$%& 67&1'-8

!"#$%&$'()

* +,$-%"-%./01$%2(1$1%3/$#$%./$%

.#(-14"11"0-%'"-$%$55$2.%"1%

-$6'"6"7'$8

9 !"#$1%:011$1%;"1.#"7<.$;%

#$1"1.(-2$%(-;%2(:(2".(-2$

9 ="4$%20-1.(-.%(1102"(.$;%3"./%

;"1.#"7<.$;%>?%"1%:#0:0#."0-('%.0%

./$%!"#$%& 05%./$%'$-6./

* @0#%!"#$%&'($)! 0-%A?1B%#$1"1.(-2$%"1%"-1"6-"5"2(-.%C#$'(.",$%.0%$55$2.",$%>%05%.#(-1"1.0#1DB%7<.%?%"1%"4:0#.(-.E

9 =):"2('')%(#0<-;%/('5%05%?%05%6(.$%'0(;%"1%"-%./$%3"#$1E

* @0#%*#+,&'($)! 0-%A?18

9 7<11$1B%2'02F%'"-$1B%6'07('%20-.#0'%1"6-('B%$.2E

9 >$1"1.(-2$%"1%1"6-"5"2(-.B%./$#$50#$%;"1.#"7<.$;%>?%$55$2.%;04"-(.$1E

9 1"6-('1%(#$%.):"2('')%G#$7<55$#$;H%.0%#$;<2$%;$'()8

,I

,J,K

,L

."4$

,I ,L ,K ,J

Looksbenign,but ...

16

CS 152 L03 Testing Processors () UC Regents Fall 2004 © UCB

Team Networking Break!

Optimal team size? Min/max?

Disagreements are inevitable.Build a bridge and get over it.

Lab 2 goes out on 9/14. Time to form your team.

Talk out little problems before they get big. Communicate.

17

UC Regents Fall 2004 © UCBCS 152 L03 Testing Processors ()

Clocked Logic Circuits

18

Page 4: EEs reach logic bottom-up - EECS Instructional Support ...cs152/fa04/lecnotes/lec2-2.6page.pdf · buffers (TLBs) that support multiple page sizes are provided ... Basic Components:

CS 152 L03 Testing Processors () UC Regents Fall 2004 © UCB

From Delay Models to Timing Analysis1600 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 36, NO. 11, NOVEMBER 2001

Fig. 1. Process SEM cross section.

The process was raised from [1] to limit standby power.

Circuit design and architectural pipelining ensure low voltage

performance and functionality. To further limit standby current

in handheld ASSPs, a longer poly target takes advantage of the

versus dependence and source-to-body bias is used

to electrically limit transistor in standby mode. All core

nMOS and pMOS transistors utilize separate source and bulk

connections to support this. The process includes cobalt disili-

cide gates and diffusions. Low source and drain capacitance, as

well as 3-nm gate-oxide thickness, allow high performance and

low-voltage operation.

III. ARCHITECTURE

The microprocessor contains 32-kB instruction and data

caches as well as an eight-entry coalescing writeback buffer.

The instruction and data cache fill buffers have two and four

entries, respectively. The data cache supports hit-under-miss

operation and lines may be locked to allow SRAM-like oper-

ation. Thirty-two-entry fully associative translation lookaside

buffers (TLBs) that support multiple page sizes are provided

for both caches. TLB entries may also be locked. A 128-entry

branch target buffer improves branch performance a pipeline

deeper than earlier high-performance ARM designs [2], [3].

A. Pipeline Organization

To obtain high performance, the microprocessor core utilizes

a simple scalar pipeline and a high-frequency clock. In addition

to avoiding the potential power waste of a superscalar approach,

functional design and validation complexity is decreased at the

expense of circuit design effort. To avoid circuit design issues,

the pipeline partitioning balances the workload and ensures that

no one pipeline stage is tight. The main integer pipeline is seven

stages, memory operations follow an eight-stage pipeline, and

when operating in thumb mode an extra pipe stage is inserted

after the last fetch stage to convert thumb instructions into ARM

instructions. Since thumb mode instructions [11] are 16 b, two

instructions are fetched in parallel while executing thumb in-

structions. A simplified diagram of the processor pipeline is

Fig. 2. Microprocessor pipeline organization.

shown in Fig. 2, where the state boundaries are indicated by

gray. Features that allow the microarchitecture to achieve high

speed are as follows.

The shifter and ALU reside in separate stages. The ARM in-

struction set allows a shift followed by an ALU operation in a

single instruction. Previous implementations limited frequency

by having the shift and ALU in a single stage. Splitting this op-

eration reduces the critical ALU bypass path by approximately

1/3. The extra pipeline hazard introduced when an instruction is

immediately followed by one requiring that the result be shifted

is infrequent.

Decoupled Instruction Fetch.A two-instruction deep queue is

implemented between the second fetch and instruction decode

pipe stages. This allows stalls generated later in the pipe to be

deferred by one or more cycles in the earlier pipe stages, thereby

allowing instruction fetches to proceed when the pipe is stalled,

and also relieves stall speed paths in the instruction fetch and

branch prediction units.

Deferred register dependency stalls. While register depen-

dencies are checked in the RF stage, stalls due to these hazards

are deferred until the X1 stage. All the necessary operands are

then captured from result-forwarding busses as the results are

returned to the register file.

One of the major goals of the design was to minimize the en-

ergy consumed to complete a given task. Conventional wisdom

has been that shorter pipelines are more efficient due to re-

!"#$%&'())* ++,!-.)'/ 012-)34$5$%& 67&1'8

!"#$%&'

( )#*#&&'&+,-+.'*/#&+0-12'*,'*3+

#

4 5+! ,/$'60&7"89+:+,/$'6$;"9+:+,/$'6.',;%9

5+! #0&7"8 :+#$;" :+#.',;%

0&7

f T

1 MHz 1 !s

10 MHz 100 ns

100 MHz 10 ns

1 GHz 1 ns

Timing Analysis

What is the smallest T that

produces correct operation?

19

CS 152 L03 Testing Processors () UC Regents Fall 2004 © UCB

Timing Analysis and Logic Delay

Can T be smaller than worst-casedelay through CL?

1600 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 36, NO. 11, NOVEMBER 2001

Fig. 1. Process SEM cross section.

The process was raised from [1] to limit standby power.

Circuit design and architectural pipelining ensure low voltage

performance and functionality. To further limit standby current

in handheld ASSPs, a longer poly target takes advantage of the

versus dependence and source-to-body bias is used

to electrically limit transistor in standby mode. All core

nMOS and pMOS transistors utilize separate source and bulk

connections to support this. The process includes cobalt disili-

cide gates and diffusions. Low source and drain capacitance, as

well as 3-nm gate-oxide thickness, allow high performance and

low-voltage operation.

III. ARCHITECTURE

The microprocessor contains 32-kB instruction and data

caches as well as an eight-entry coalescing writeback buffer.

The instruction and data cache fill buffers have two and four

entries, respectively. The data cache supports hit-under-miss

operation and lines may be locked to allow SRAM-like oper-

ation. Thirty-two-entry fully associative translation lookaside

buffers (TLBs) that support multiple page sizes are provided

for both caches. TLB entries may also be locked. A 128-entry

branch target buffer improves branch performance a pipeline

deeper than earlier high-performance ARM designs [2], [3].

A. Pipeline Organization

To obtain high performance, the microprocessor core utilizes

a simple scalar pipeline and a high-frequency clock. In addition

to avoiding the potential power waste of a superscalar approach,

functional design and validation complexity is decreased at the

expense of circuit design effort. To avoid circuit design issues,

the pipeline partitioning balances the workload and ensures that

no one pipeline stage is tight. The main integer pipeline is seven

stages, memory operations follow an eight-stage pipeline, and

when operating in thumb mode an extra pipe stage is inserted

after the last fetch stage to convert thumb instructions into ARM

instructions. Since thumb mode instructions [11] are 16 b, two

instructions are fetched in parallel while executing thumb in-

structions. A simplified diagram of the processor pipeline is

Fig. 2. Microprocessor pipeline organization.

shown in Fig. 2, where the state boundaries are indicated by

gray. Features that allow the microarchitecture to achieve high

speed are as follows.

The shifter and ALU reside in separate stages. The ARM in-

struction set allows a shift followed by an ALU operation in a

single instruction. Previous implementations limited frequency

by having the shift and ALU in a single stage. Splitting this op-

eration reduces the critical ALU bypass path by approximately

1/3. The extra pipeline hazard introduced when an instruction is

immediately followed by one requiring that the result be shifted

is infrequent.

Decoupled Instruction Fetch.A two-instruction deep queue is

implemented between the second fetch and instruction decode

pipe stages. This allows stalls generated later in the pipe to be

deferred by one or more cycles in the earlier pipe stages, thereby

allowing instruction fetches to proceed when the pipe is stalled,

and also relieves stall speed paths in the instruction fetch and

branch prediction units.

Deferred register dependency stalls. While register depen-

dencies are checked in the RF stage, stalls due to these hazards

are deferred until the X1 stage. All the necessary operands are

then captured from result-forwarding busses as the results are

returned to the register file.

One of the major goals of the design was to minimize the en-

ergy consumed to complete a given task. Conventional wisdom

has been that shorter pipelines are more efficient due to re-

1600

IEEEJOURNALOFSOLID-STATECIRCUITS,VOL.36,NO.11,NOVEMBER2001

Fig.1.ProcessSEMcrosssection.

Theprocess

wasraisedfrom[1]tolimitstandbypower.

Circuitdesignandarchitecturalpipeliningensurelowvoltage

performanceandfunctionality.Tofurtherlimitstandbycurrent

inhandheldASSPs,alongerpolytargettakesadvantageofthe

versus

dependenceandsource-to-bodybiasisused

toelectricallylimittransistor

instandbymode.Allcore

nMOSandpMOStransistorsutilizeseparatesourceandbulk

connectionstosupportthis.Theprocessincludescobaltdisili-

cidegatesanddiffusions.Lowsourceanddraincapacitance,as

wellas3-nmgate-oxidethickness,allowhighperformanceand

low-voltageoperation. III.ARCHITECTURE

Themicroprocessorcontains32-kBinstructionanddata

cachesaswellasaneight-entrycoalescingwritebackbuffer.

Theinstructionanddatacachefillbuffershavetwoandfour

entries,respectively.Thedatacachesupportshit-under-miss

operationandlinesmaybelockedtoallowSRAM-likeoper-

ation.Thirty-two-entryfullyassociativetranslationlookaside

buffers(TLBs)thatsupportmultiplepagesizesareprovided

forbothcaches.TLBentriesmayalsobelocked.A128-entry

branchtargetbufferimprovesbranchperformanceapipeline

deeperthanearlierhigh-performanceARMdesigns[2],[3].

A.PipelineOrganization

Toobtainhighperformance,themicroprocessorcoreutilizes

asimplescalarpipelineandahigh-frequencyclock.Inaddition

toavoidingthepotentialpowerwasteofasuperscalarapproach,

functionaldesignandvalidationcomplexityisdecreasedatthe

expenseofcircuitdesigneffort.Toavoidcircuitdesignissues,

thepipelinepartitioningbalancestheworkloadandensuresthat

noonepipelinestageistight.Themainintegerpipelineisseven

stages,memoryoperationsfollowaneight-stagepipeline,and

whenoperatinginthumbmodeanextrapipestageisinserted

afterthelastfetchstagetoconvertthumbinstructionsintoARM

instructions.Sincethumbmodeinstructions[11]are16b,two

instructionsarefetchedinparallelwhileexecutingthumbin-

structions.Asimplifieddiagramoftheprocessorpipelineis

Fig.2.Microprocessorpipelineorganization.

showninFig.2,wherethestateboundariesareindicatedby

gray.Featuresthatallowthemicroarchitecturetoachievehigh

speedareasfollows.

TheshifterandALUresideinseparatestages.TheARMin-

structionsetallowsashiftfollowedbyanALUoperationina

singleinstruction.Previousimplementationslimitedfrequency

byhavingtheshiftandALUinasinglestage.Splittingthisop-

erationreducesthecriticalALUbypasspathbyapproximately

1/3.Theextrapipelinehazardintroducedwhenaninstructionis

immediatelyfollowedbyonerequiringthattheresultbeshifted

isinfrequent.

DecoupledInstructionFetch.Atwo-instructiondeepqueueis

implementedbetweenthesecondfetchandinstructiondecode

pipestages.Thisallowsstallsgeneratedlaterinthepipetobe

deferredbyoneormorecyclesintheearlierpipestages,thereby

allowinginstructionfetchestoproceedwhenthepipeisstalled,

andalsorelievesstallspeedpathsintheinstructionfetchand

branchpredictionunits.

Deferredregisterdependency

stalls.Whileregisterdepen-

denciesarecheckedintheRFstage,stallsduetothesehazards

aredeferreduntiltheX1stage.Allthenecessaryoperandsare

thencapturedfromresult-forwardingbussesastheresultsare

returnedtotheregisterfile.

Oneofthemajorgoalsofthedesignwastominimizetheen-

ergyconsumedtocompleteagiventask.Conventionalwisdom

hasbeenthatshorterpipelinesaremoreefficientduetore-

1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz

Lec3.9

General C/L Cell Delay Model

° Combinational Cell (symbol) is fully specified by:• functional (input -> output) behavior

- truth-table, logic equation, VHDL

• Input load factor of each input

• Propagation delay from each input to each output for each transition

- THL(A, o) = Fixed Internal Delay + Load-dependent-delay x load

° Linear model composes

Cout

Vout

Cout

Delay

Va -> Vout

XX

X

X

X

X

Ccritical

delay per unit load

A

B

X

.

.

.

Combinational

Logic Cell

Internal Delay

1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz

Lec3.10

Storage Element’s Timing Model

Clk

D Q

° Setup Time: Input must be stable BEFORE trigger clock edge

° Hold Time: Input must REMAIN stable after trigger clock edge

° Clock-to-Q time:

• Output cannot change instantaneously at the trigger clock edge

• Similar to delay in logic gates, two components:

- Internal Clock-to-Q

- Load dependent Clock-to-Q

Don’t Care Don’t Care

HoldSetup

D

Unknown

Clock-to-Q

Q

1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz

Lec3.11

Clocking Methodology

Clk

Combination Logic

.

.

.

.

.

.

.

.

.

.

.

.

° All storage elements are clocked by the same clock edge

° The combination logic blocks:• Inputs are updated at each clock tick

• All outputs MUST be stable before the next clock tick

1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz

Lec3.12

Critical Path & Cycle Time

Clk

.

.

.

.

.

.

.

.

.

.

.

.

° Critical path: the slowest path between any two storage devices

° Cycle time is a function of the critical path

° must be greater than:

Clock-to-Q + Longest Path through Combination Logic + Setup

Register:

An Array of Flip-Flops

1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz

Lec3.9

General C/L Cell Delay Model

° Combinational Cell (symbol) is fully specified by:• functional (input -> output) behavior

- truth-table, logic equation, VHDL

• Input load factor of each input

• Propagation delay from each input to each output for each transition

- THL(A, o) = Fixed Internal Delay + Load-dependent-delay x load

° Linear model composes

Cout

Vout

Cout

Delay

Va -> Vout

XX

X

X

X

X

Ccritical

delay per unit load

A

B

X

.

.

.

Combinational

Logic Cell

Internal Delay

1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz

Lec3.10

Storage Element’s Timing Model

Clk

D Q

° Setup Time: Input must be stable BEFORE trigger clock edge

° Hold Time: Input must REMAIN stable after trigger clock edge

° Clock-to-Q time:

• Output cannot change instantaneously at the trigger clock edge

• Similar to delay in logic gates, two components:

- Internal Clock-to-Q

- Load dependent Clock-to-Q

Don’t Care Don’t Care

HoldSetup

D

Unknown

Clock-to-Q

Q

1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz

Lec3.11

Clocking Methodology

Clk

Combination Logic.

.

.

.

.

.

.

.

.

.

.

.

° All storage elements are clocked by the same clock edge

° The combination logic blocks:• Inputs are updated at each clock tick

• All outputs MUST be stable before the next clock tick

1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz

Lec3.12

Critical Path & Cycle Time

Clk

.

.

.

.

.

.

.

.

.

.

.

.

° Critical path: the slowest path between any two storage devices

° Cycle time is a function of the critical path

° must be greater than:

Clock-to-Q + Longest Path through Combination Logic + Setup

Combinational Logic

20

CS 152 L02 Design as a Team Sport () UC Regents Fall 2004 © UCB

Flip Flops have internal delays ...

D Q

CLK

Value of D is sampled on positive clock edge.

Q outputs sampled value for rest of cycle.

D

Q

t_setup

t_clk-to-Q

21

CS 152 L03 Testing Processors () UC Regents Fall 2004 © UCB

Conclusion -- Timing

Flip-flops: setup and clk-to-Q

Critical path limits clock speed

Logic delay: fan-out and wires

22