cs 152 lec 3.1 ©ucb fall 2002 cs 152: computer architecture and engineering lecture 3 performance,...

©UCB Fall 2002 CS 152Lec 3.1

CS 152: Computer Architecture and Engineering

Lecture 3

Performance, Technology & Delay Modeling

Modified From the Lectures of Randy H. KatzUC Berkeley

Lec 3.2

Review: Salient features of MIPS I

• 32-bit fixed format inst (3 formats)• 32 32-bit GPR (R0 contains zero) and 32 FP

registers (+ HI LO)– Partitioned by software convention

• 3-address, reg-reg arithmetic instr.• Single address mode for load/store:

base+displacement– No indirection, scaled

• 16-bit immediate plus LUI• Simple branch conditions

– Compare against zero or two registers for =,– No integer condition codes

• Support for 8 bit, 16 bit, and 32 bit integers• Support for 32 bit and 64 bit floating point

Lec 3.3

Review: MIPS Addr Modes/Instruction Formats

op rs rt rd

immed

register

Register (direct)

op rs rt

register

Base+index

+

Memory

immedop rs rtImmediate

immedop rs rt

PC

PC-relative

+

Memory

• All instructions 32 bits wide

Lec 3.4

Review: When Does MIPS Sign Extend?•When value is sign extended, copy upper bit to full

value:

Examples of sign extending 8 bits to 16 bits:

00001010 00000000 0000101010001100 11111111 10001100

•When is an immediate value sign extended?– Arithmetic instructions (add, sub, etc.) sign extend

immediates even for the unsigned versions of the instructions!

– Logical instructions do not sign extend

addi $r2, $r3, -1 has 0xFFFF in immediate fieldand will extend to 0xFFFFFFFF before

adding

andi $r2, $r3, -1 has 0xFFFF in immediate field and will extend to 0x0000FFFF before

anding

– Kinda weird to put negative numbers in logical instructions

Lec 3.5

Review: Details of the MIPS Instruction Set• Register zero always has the value zero (even if you try to write

it)• Branch/jump and link put the return addr.

PC+4 into the link register (R31), also called “ra”• All instructions change all 32 bits of the destination register

(including lui, lb, lh) and all read all 32 bits of sources (add, and, …)

• Difference between signed and unsigned versions:– For add and subtract: signed causes exception on overflow

» No difference in sign-extension behavior!– For multiply and divide, distinguishes type of operation

• Thus, overflow can occur in these arithmetic and logical instructions:

– add, sub, addi– it cannot occur in addu, subu, addiu, and, or, xor, nor, shifts, mult, multu,

div, divu

• Immediate arithmetic & logical instructions are extended as follows:

– logical immediates ops are zero extended to 32 bits– arithmetic immediates ops are sign extended to 32 bits (including addu)

• Data loaded by the instructions lb and lh are extended as follows:

– lbu, lhu are zero extended– lb, lh are sign extended

Lec 3.6

Performance

•Purchasing perspective – Given a collection of machines, which has the

» Best performance ?» Least cost ?» Best performance / cost ?

•Design perspective– Faced with design options, which has the

» Best performance improvement ?» Least cost ?» Best performance / cost ?

•Both require– basis for comparison– metric for evaluation

•Our goal: understand cost & performance implications of architectural choices

Lec 3.7

Two Notions of “Performance”

Which has higher performance?• Time to do the task (Execution Time)

– execution time, response time, latency• Tasks per day, hour, week, sec, ns. .. (Performance)

– throughput, bandwidth

Response time and throughput often are in opposition

Plane

Boeing 747

BAD/Sud Concorde

Speed

610 mph

1350 mph

DC to Paris

6.5 hours

3 hours

Passengers

470

132

Throughput (pmph)

286,700

178,200

Lec 3.8

Definitions•Performance is in units of things-per-second

– bigger is better

•If we are primarily concerned with response time– performance(x) = 1

execution_time(x)

" X is n times faster than Y" means

Performance(X)

n = ----------------------

Performance(Y)

Lec 3.9

Example• Time of Concorde vs. Boeing 747?

• Concord is 1350 mph / 610 mph = 2.2 times faster

= 6.5 hours / 3 hours

• Throughput of Concorde vs. Boeing 747 ?• Concord is 178,200 pmph / 286,700 pmph = 0.62 “times

faster”• Boeing is 286,700 pmph / 178,200 pmph = 1.60 “times

faster”

• Boeing is 1.6 times (“60%”) faster in terms of throughput• Concord is 2.2 times (“120%”) faster in terms of flying

time

We will focus primarily on execution time for a single job

Lots of instructions in a program => Instruction thruput important!

Lec 3.10

Basis of Evaluation

Actual Target Workload

Full Application Benchmarks

Small “Kernel” Benchmarks

Microbenchmarks

Pros Cons

• representative• very specific• non-portable• difficult to run, or measure• hard to identify cause

• portable• widely used• improvements useful in reality

• easy to run, early in design cycle

• identify peak capability and potential bottlenecks

• less representative

• easy to “fool”

• “peak” may be a long way from application performance

Lec 3.11

SPEC95•Eighteen application benchmarks (with inputs) reflecting a technical computing workload

•Eight integer–go, m88ksim, gcc, compress, li, ijpeg, perl, vortex

•Ten floating-point intensive–tomcatv, swim, su2cor, hydro2d, mgrid, applu, turb3d, apsi, fppp, wave5

•Must run with standard compiler flags–eliminate special undocumented incantations that may not even generate working code for real programs

Lec 3.12

Metrics of Performance

Compiler

Programming Language

Application

DatapathControl

Transistors Wires Pins

ISA

Function Units

(millions) of Instructions per second – MIPS(millions) of (F.P.) operations per second – MFLOP/s

Cycles per second (clock rate)

Megabytes per second

Seconds per program

Useful Operations per second

Each metric has a place and a purpose, and each can be misused

Lec 3.13

CPI

CPU time = ClockCycleTime * CPI * Ii = 1

n

i i

CPI = CPI * F where F = I i = 1

n

i i i iInstruction Count

"instruction frequency"

Invest Resources where time is Spent!

CPIave = (CPU Time * Clock Rate) / Instruction Count = Clock Cycles / Instruction Count

“ Average cycles per instruction”

Lec 3.14

Aspects of CPU Performance

CPU time = Seconds = Instructions x Cycles x Seconds

Program Program Instruction Cycle

CPU time = Seconds = Instructions x Cycles x Seconds

Program Program Instruction Cycle instr count CPI clock rate

Program

Compiler

Instr. Set

Organization

Technology

Lec 3.15

Speedup due to enhancement E: ExTime w/o E Performance w/

ESpeedup(E) = -------------------- = --------------------- ExTime w/ E Performance w/o

E

Suppose that enhancement E accelerates a fraction F of the task by a factor S and the remainder of the task is unaffected then,

ExTime(with E) = ((1-F) + F/S) x ExTime(without E)

Speedup(with E) = 1 (1-F) + F/S

Amdahl's Law

Lec 3.16

Example (RISC Processor)

Typical Mix

Base Machine (Reg / Reg)Op Freq Cycles CPI(i) % Time

ALU 50% 1 .5 23%

Load 20% 5 1.0 45%

Store 10% 3 .3 14%

Branch 20% 2 .4 18%

2.2

How much faster would the machine be if a better data cachereduced the average load time to 2 cycles?

How does this compare with using branch prediction to shave a cycle off the branch time?

What if two ALU instructions could be executed at once?

Lec 3.17

Summary: Evaluating Instruction Sets and Implementation

•Design-time metrics:– Can it be implemented, in how long, at what cost?– Can it be programmed? Ease of compilation?

•Static Metrics:– How many bytes does the program occupy in memory?

•Dynamic Metrics:– How many instructions are executed?– How many bytes does the processor fetch to execute the

program?– How many clocks are required per instruction?– How "lean" a clock is practical?

•Best Metric: Time to execute the program!

NOTE: Depends on instructions set, processor organization,and compilation techniques.

CPI

Inst. Count Cycle Time

Lec 3.18

Administrative Matters

•Course accounts available from the TAs

•Lab #2 posted, due in 1.5 weeks—individual work on this one; submission by midnight

Lec 3.19

Finite State Machines:• System state is explicit in representation

• Transitions between states represented as arrows with inputs on arcs

• Output may be either part of state or on arcs

Alpha/

0

Delta/

2

Beta/

10

1

1

0

0

1

“ Mod 3 Machine”

Input (MSB first)

0 1 0 1 00 1 2 2

1

106

Mod 3

1

1

1 1

0

Lec 3.20

“M

eale

y M

ach

ine”

“M

oore

Mach

ine”

Implementation as Combinational Logic + Latch

Alpha/

0

Delta/

2

Beta/

1

0/0

1/0

1/1

0/10/0

1/1

Latc

h

Com

bin

ati

onal

Logic

I nput Stateold Statenew Div

000

000110

001001

001

111

000110

010010

011

Lec 3.21

Year

Perf

orm

an

ce

0.1

1

10

100

1000

1965 1970 1975 1980 1985 1990 1995 2000

Microprocessors

Minicomputers

Mainframes

Supercomputers

Performance and Technology Trends

•Technology Power: 1.2 x 1.2 x 1.2 = 1.7 x / year– Feature Size: shrinks 10% / yr. => Switching speed improves 1.2

/ yr.

– Density: improves 1.2x / yr.

– Die Area: 1.2x / yr.

•Lesson of RISC is to keep the ISA as simple as possible:

– Shorter design cycle => fully exploit the advancing technology (~3yr)

– Advanced branch prediction and pipeline techniques

– Bigger and more sophisticated on-chip caches

Lec 3.22

Range of Design Styles

Gates

Routing Channel

Gates

Routing Channel

Gates

StandardALU

Standard Registers

Gates

Cust

om

Contr

ol Lo

gic

CustomRegister File

Custom Design Standard Cell Gate Array/FPGA/CPLD

CustomALU

Performance

Design Complexity (Design Time)

Longer wiresCompact

Lec 3.23

•CMOS: Complementary Metal Oxide Semiconductor

– NMOS (N-Type Metal Oxide Semiconductor) transistors

– PMOS (P-Type Metal Oxide Semiconductor) transistors

•NMOS Transistor– Apply a HIGH (Vdd) to its gate

turns the transistor into a “conductor”

– Apply a LOW (GND) to its gateshuts off the conduction path

•PMOS Transistor– Apply a HIGH (Vdd) to its gate

shuts off the conduction path

– Apply a LOW (GND) to its gate turns the transistor into a “conductor”

Basic Technology: CMOS

Vdd = 5V

GND = 0v

GND = 0v

Vdd = 5V

Lec 3.24

•Inverter Operation

Vdd

OutIn

Symbol Circuit

Basic Components: CMOS Inverter

OutIn

Vdd VddVdd

Out

Open

Discharge

Open

Charge

Vin

Vout

Vdd

Vdd

PMOS

NMOS

Lec 3.25

Basic Components: CMOS Logic Gates

NAND Gate NOR Gate

Vdd

A

B

Out

Vdd

A

B

Out

OutA

B

A

B

Out

A B Out

0 0 10 1 11 0 11 1 0

A B Out

0 0 10 1 01 0 01 1 0

Lec 3.26

Voltage Waveforms versus Time

Time

Voltage

1 => Vdd

VinVout

0 => GND

Vin Vout

Lec 3.27

Series Connection

• Total Propagation Delay = Sum of individual delays = d1 + d2

• Capacitance C1 has two components:

– Capacitance of the wire connecting the two gates

– Input capacitance of the second inverter

Vdd

Cout

Vout

Vdd

C1

V1Vin

V1Vin Vout

Time

G1 G2 G1 G2

VoltageVdd

Vin

GND

V1 Vout

Vdd/2d1 d2

Lec 3.28

Gate Comparison

•PMOS are 3 times slower than NMOS (3 times higher resistance) so if all devices are the same size then a NAND Low to High will be

•Better to put NMOS transistors in series

Vdd

A

B

Out

Vdd

A

B

Out

NAND Gate NOR Gate

Lec 3.29

Calculating Delays

•Sum delays along serial paths

•Delay (Vin -> V2) ! = Delay (Vin -> V3)– Delay (Vin -> V2) = Delay (Vin -> V1) + Delay (V1 -> V2)

– Delay (Vin -> V3) = Delay (Vin -> V1) + Delay (V1 -> V2) + Delay (V1 -> V3)

•Critical Path = The longest delay path (Vin->V3)

•C1 = Wire Capacitance + Cin of Gate 2 + Cin of Gate 3

Vdd

V2

VddV1Vin V2

C1

V1VinG1 G2

Vdd

V3G3

V3

Lec 3.30

General C/L Cell Delay Model

•Combinational Cell (symbol) is fully specified by:– Functional (input -> output) behavior

» Truth-table, logic equation, VHDL

– Load at each input

– Critical prop delay from each input to each output for each transition

» THL(A, o) = Fixed Internal Delay + Load-dependent-delay x load capacitance

– Linear model is good enough up to Ccritical

Cout

VoutA

B

X

.

.

.

CombinationalLogic Cell

Cout

DelayVa -> Vout

XX

X

X

Ccritical

Internal Delay

delay per load (Cload)Nanoseconds/femtoFarad= ns/fF

Lec 3.31

Characterize a Gate•Input capacitance for each input

•For each input-to-output path:– For each output transition (H->L, L->H)

» Internal delay (ns) - e.g. for low to high from A to O: TPAOlh

» Load dependent delay (ns / fF) - e.g. TPAOlhf

•Example: 2-input NAND GateOA

B

For A and B: Input Load = 61 fF

For A -> O : TPAOlh = 0.5ns TPAOlhf = 0.0021ns / fF TPAOhl = 0.1ns TPAOhlf = 0.0020ns / fF

Delay A -> OO: Low -> High

CO

0.5ns

Slope =0.0021ns / fF

Lec 3.32

A Specific Example: 2 to 1 MUX

•Input Load (I.L.)– A, B: I.L. (NAND) = 61 fF

– S: I.L. (INV) + I.L. (NAND) = 50 fF + 61 fF = 111 fF

•Load Dependent Delay (L.D.D.): Set by Gate 3 – TAYlhf = 0.021 ns / fF TAYhlf = 0.020 ns / fF

– TBYlhf = 0.021 ns / fF TBYhlf = 0.020 ns / fF

– TSYlhf = 0.021 ns / fF TSYlhf = 0.020 ns / fF

Y = (A and !S) or (A and S)

A

B

S

Gate 3

Gate 2

Gate 1Wire 1

Wire 2

Wire 0

A

B

Y

S

2 x

1 M

ux

Inv: I.L. = 50 fF; T I.D.= .01 ns; T L.D.D. = .01 ns/fF

Lec 3.33

2 to 1 MUX: Internal Delay Calculation

•Internal Delay (I.D.):– A to Y: I.D. G1 + (Wire_1_C + G3_Input_C) * L.D.D G1 + I.D. G3

– B to Y: I.D. G2 + (Wire_2_C + G3_Input_C) * L.D.D. G2 + I.D. G3

– S to Y (Worst Case) : I.D. Inv + (Wire_0_C + G1_Input_C) * L.D.D. Inv + Internal Delay A to Y

•We can approximate the size of “Wire_1_C” by:– Assume Wire 1 has the same C as the input capacitance off all the

gates attached to it (G3 in this case).

Therefore the total C that Gate1 needs to drive = 2.0 x G3_Input_C


A

B

S

Gate 3

Gate 2

Gate 1Wire 1

Wire 2

Wire 0

L.D.D. = Load Dependent DelayI.D. = Internal Delay

Lec 3.34

2 to 1 MUX: Internal Delay Calculation (continue)

•Internal Delay (I.D.):– A to Y: I.D. G1 + (Wire 1 C + G3 Input C) * L.D.D G1 + I.D. G3

– B to Y: I.D. G2 + (Wire 2 C + G3 Input C) * L.D.D. G2 + I.D. G3

– S to Y (Worst Case): I.D. Inv + (Wire 0 C + G1 Input C) * L.D.D. Inv + Internal Delay A to Y

•Specific Example:– TAYlh = TPhl G1 + (2.0 * 61 fF) * TPhlf G1 + TPlh G3

= 0.1ns + 122 fF * 0.0020 ns/fF + 0.5ns = 0.844 ns


A

B

S

Gate 3

Gate 2

Gate 1Wire 1

Wire 2

Wire 0

Lec 3.35

Abstraction: 2 to 1 MUX

•Input Load: A = 61 fF, B = 61 fF, S = 111 fF

•Load Dependent Delay:– TAYlhf = 0.021 ns / fF TAYhlf = 0.020 ns / fF

– TBYlhf = 0.021 ns / fF TBYhlf = 0.020 ns / fF

– TSYlhf = 0.021 ns / fF TSYlhf = 0.020 ns / f F

•Internal Delay:– TAYlh = TPhl G1 + (2.0 * 61 fF) * TPhlf G1 + TPlh G3

= 0.1ns + 122 fF * 0.0020ns/fF + 0.5ns = 0.844ns

– Fun Exercises!: TAYhl, TBYlh, TSYlh, TSYlh

A

B

Y

S

2 x

1 M

ux

A

B

S

Gate 3

Gate 2

Gate 1

Y

Lec 3.36

Storage Element’s Timing Model

•Setup Time: Input must be stable BEFORE the trigger clock edge

•Hold Time: Input must REMAIN stable after the trigger clock edge

•Clock-to-Q time:– Output cannot change instantaneously at the trigger

clock edge

– Similar to delay in logic gates, two components:» Internal Clock-to-Q

» Load dependent Clock-to-Q

D QD Don’t Care Don’t Care

Clk

UnknownQ

Setup Hold

Clock-to-Q

Lec 3.37

CS152 Building Blocks (maybe more….)

•Logic elements– NAND2, NAND3, NAND 4

– NOR2, NOR3, NOR4

– INV1x (normal inverter)

– INV4x (inverter with large output drive)

– XOR2

– XNOR2

– PWR: Source of 1’s

– GND: Source of 0’s

– fast MUXes

•Storage Element– D flip flop - negative edge triggered

Lec 3.38

Clocking Methodology

•All storage elements are clocked by the same clock edge (but there may be clock skews)

•The combination logic block’s:– Inputs are updated at each clock tick

– All outputs MUST be stable before the next clock tick

Clk

.

.

.

.

.

.

.

.

.

.

.

.Combination Logic

Lec 3.39

Hold Time Violations

• The worst case scenario for hold time consideration:

– The input register sees CLK1

– The output register sees CLK2

– fast FF1 output must not change input to FF2 for same clock edge

• (CLK-to-Q + Shortest Delay Path - Clock Skew) > Hold Time

Clk1

Clk2

Clk1 Clk2

.

.

.

.

.

.

.

.

.

.

.

.Combination Logic

Hold Time

Clk-to-Q+Delay

Lec 3.40

Clock Skew’s Effect on Hold Time Violation

• The worst case scenario for hold time consideration:

– The input register sees CLK1



• (CLK-to-Q + Shortest Delay Path) > Hold Time + Clock Skew or (CLK-to-Q + Shortest Delay Path - Clock Skew) > Hold Time

Clk1

Clk2

Clk1 Clk2

.

.

.

.

.

.

.

.

.

.

.

.Combination Logic

Clock Skew

Hold Time

Clk-to-Q+Delay

Lec 3.41

Critical Path & Cycle Time

•Critical path: the slowest path between any two storage devices

•Cycle time is a function of the critical path

•must be greater than:– Clock-to-Q + Longest Path through the Combination

Logic + Setup

Clk

.

.

.

.

.

.

.

.

.

.

.

.

Lec 3.42

Clock Skew’s Effect on Cycle Time

•Worst case scenario for cycle time consideration:– The input register sees CLK1


•Cycle Time = CLK-to-Q(FF1) + Longest Delay(C/L) + Setup(FF2) + Clock Skew

Clk1

Clk2 Clock Skew

.

.

.

.

.

.

.

.

.

.

.

.

Clock Skw

Clk1 Clk2

FF1 FF2Setup

Lec 3.43

Tricks to Reduce Cycle Time

•Reduce the number of gate levels

Pay attention to loading

• One gate driving many gates is a bad idea

• Avoid using a small gate to drive a long wire

Use multiple stages to drive large load

A

B

CD

A

B

C

D

INV4x

INV4x

Clarge

Lec 3.44

How to Avoid Hold Time Violations?

•Hold time requirement:– Input to register must NOT change immediately after the

clock tick

•This is usually easy to meet in the “edge trigger” clocking scheme

•Hold time of most FFs is <= 0 ns

•CLK-to-Q + Shortest Delay Path must be greater than Hold Time

Clk

.

.

.

.

.

.

.

.

.

.

.

.Combination Logic

Lec 3.45

Hold Time Violation

• The worst case scenario for hold time consideration:– The input register sees CLK1



• For no violation (CLK-to-Q + Shortest Delay Path) > Hold Time

• A violation is shown above

Clk1

Clk2

Clk1 Clk2

.

.

.

.

.

.

.

.

.

.

.

.Combination Logic

Hold Time

Clk-to-Q+Delay

Lec 3.46

A Hold Time Violation Because of Clock Skew

•For no violation(CLK-to-Q + Shortest Delay Path) > Hold Time +

Clock Skew

or (CLK-to-Q + Shortest Delay Path - Clock Skew) > Hold Time

Clk1

Clk2

Clk1 Clk2

.

.

.

.

.

.

.

.

.

.

.

.Combination Logic

Clock Skew

Hold Time

Clk-to-Q+Delay

Lec 3.47

Summary•Total execution time is the most reliable measure

of performance

•Amdahl’s law: Law of Diminishing Returns

•Performance and Technology Trends– Keep the design simple (KISS rule) to take advantage of the

latest technology

– CMOS inverter and CMOS logic gates

•Delay Modeling and Gate Characterization– Delay = Internal Delay + (Load Dependent Delay x Output

Load)

•Clocking Methodology and Timing Considerations– Simplest clocking methodology

» All storage elements use the SAME clock edge

– Cycle Time CLK-to-Q + Longest Delay Path + Setup + Clock Skew

– (CLK-to-Q + Shortest Delay Path - Clock Skew) > Hold Time

Lec 3.48

To Get More Information

•EECS 141 - Digital Integrated Circuit Design - 105 no longer a prerequisite - only EECS 40 required!

•Book: Digital Integrated Circuits - A design perspective - by Jan Rabaey

•Web page (slides from book) – http://bwrc.eecs.berkeley.edu/icdesign/instructors.html

cs 152 lec 3.1 ©ucb fall 2002 cs 152: computer architecture and engineering lecture 3 performance,...

Documents