cs 152 lec 3.1 ©ucb fall 2002 cs 152: computer architecture and engineering lecture 3 performance,...
TRANSCRIPT
©UCB Fall 2002 CS 152Lec 3.1
CS 152: Computer Architecture and Engineering
Lecture 3
Performance, Technology & Delay Modeling
Modified From the Lectures of Randy H. KatzUC Berkeley
Lec 3.2
Review: Salient features of MIPS I
• 32-bit fixed format inst (3 formats)• 32 32-bit GPR (R0 contains zero) and 32 FP
registers (+ HI LO)– Partitioned by software convention
• 3-address, reg-reg arithmetic instr.• Single address mode for load/store:
base+displacement– No indirection, scaled
• 16-bit immediate plus LUI• Simple branch conditions
– Compare against zero or two registers for =,– No integer condition codes
• Support for 8 bit, 16 bit, and 32 bit integers• Support for 32 bit and 64 bit floating point
Lec 3.3
Review: MIPS Addr Modes/Instruction Formats
op rs rt rd
immed
register
Register (direct)
op rs rt
register
Base+index
+
Memory
immedop rs rtImmediate
immedop rs rt
PC
PC-relative
+
Memory
• All instructions 32 bits wide
Lec 3.4
Review: When Does MIPS Sign Extend?•When value is sign extended, copy upper bit to full
value:
Examples of sign extending 8 bits to 16 bits:
00001010 00000000 0000101010001100 11111111 10001100
•When is an immediate value sign extended?– Arithmetic instructions (add, sub, etc.) sign extend
immediates even for the unsigned versions of the instructions!
– Logical instructions do not sign extend
addi $r2, $r3, -1 has 0xFFFF in immediate fieldand will extend to 0xFFFFFFFF before
adding
andi $r2, $r3, -1 has 0xFFFF in immediate field and will extend to 0x0000FFFF before
anding
– Kinda weird to put negative numbers in logical instructions
Lec 3.5
Review: Details of the MIPS Instruction Set• Register zero always has the value zero (even if you try to write
it)• Branch/jump and link put the return addr.
PC+4 into the link register (R31), also called “ra”• All instructions change all 32 bits of the destination register
(including lui, lb, lh) and all read all 32 bits of sources (add, and, …)
• Difference between signed and unsigned versions:– For add and subtract: signed causes exception on overflow
» No difference in sign-extension behavior!– For multiply and divide, distinguishes type of operation
• Thus, overflow can occur in these arithmetic and logical instructions:
– add, sub, addi– it cannot occur in addu, subu, addiu, and, or, xor, nor, shifts, mult, multu,
div, divu
• Immediate arithmetic & logical instructions are extended as follows:
– logical immediates ops are zero extended to 32 bits– arithmetic immediates ops are sign extended to 32 bits (including addu)
• Data loaded by the instructions lb and lh are extended as follows:
– lbu, lhu are zero extended– lb, lh are sign extended
Lec 3.6
Performance
•Purchasing perspective – Given a collection of machines, which has the
» Best performance ?» Least cost ?» Best performance / cost ?
•Design perspective– Faced with design options, which has the
» Best performance improvement ?» Least cost ?» Best performance / cost ?
•Both require– basis for comparison– metric for evaluation
•Our goal: understand cost & performance implications of architectural choices
Lec 3.7
Two Notions of “Performance”
Which has higher performance?• Time to do the task (Execution Time)
– execution time, response time, latency• Tasks per day, hour, week, sec, ns. .. (Performance)
– throughput, bandwidth
Response time and throughput often are in opposition
Plane
Boeing 747
BAD/Sud Concorde
Speed
610 mph
1350 mph
DC to Paris
6.5 hours
3 hours
Passengers
470
132
Throughput (pmph)
286,700
178,200
Lec 3.8
Definitions•Performance is in units of things-per-second
– bigger is better
•If we are primarily concerned with response time– performance(x) = 1
execution_time(x)
" X is n times faster than Y" means
Performance(X)
n = ----------------------
Performance(Y)
Lec 3.9
Example• Time of Concorde vs. Boeing 747?
• Concord is 1350 mph / 610 mph = 2.2 times faster
= 6.5 hours / 3 hours
• Throughput of Concorde vs. Boeing 747 ?• Concord is 178,200 pmph / 286,700 pmph = 0.62 “times
faster”• Boeing is 286,700 pmph / 178,200 pmph = 1.60 “times
faster”
• Boeing is 1.6 times (“60%”) faster in terms of throughput• Concord is 2.2 times (“120%”) faster in terms of flying
time
We will focus primarily on execution time for a single job
Lots of instructions in a program => Instruction thruput important!
Lec 3.10
Basis of Evaluation
Actual Target Workload
Full Application Benchmarks
Small “Kernel” Benchmarks
Microbenchmarks
Pros Cons
• representative• very specific• non-portable• difficult to run, or measure• hard to identify cause
• portable• widely used• improvements useful in reality
• easy to run, early in design cycle
• identify peak capability and potential bottlenecks
• less representative
• easy to “fool”
• “peak” may be a long way from application performance
Lec 3.11
SPEC95•Eighteen application benchmarks (with inputs) reflecting a technical computing workload
•Eight integer–go, m88ksim, gcc, compress, li, ijpeg, perl, vortex
•Ten floating-point intensive–tomcatv, swim, su2cor, hydro2d, mgrid, applu, turb3d, apsi, fppp, wave5
•Must run with standard compiler flags–eliminate special undocumented incantations that may not even generate working code for real programs
Lec 3.12
Metrics of Performance
Compiler
Programming Language
Application
DatapathControl
Transistors Wires Pins
ISA
Function Units
(millions) of Instructions per second – MIPS(millions) of (F.P.) operations per second – MFLOP/s
Cycles per second (clock rate)
Megabytes per second
Seconds per program
Useful Operations per second
Each metric has a place and a purpose, and each can be misused
Lec 3.13
CPI
CPU time = ClockCycleTime * CPI * Ii = 1
n
i i
CPI = CPI * F where F = I i = 1
n
i i i iInstruction Count
"instruction frequency"
Invest Resources where time is Spent!
CPIave = (CPU Time * Clock Rate) / Instruction Count = Clock Cycles / Instruction Count
“ Average cycles per instruction”
Lec 3.14
Aspects of CPU Performance
CPU time = Seconds = Instructions x Cycles x Seconds
Program Program Instruction Cycle
CPU time = Seconds = Instructions x Cycles x Seconds
Program Program Instruction Cycle instr count CPI clock rate
Program
Compiler
Instr. Set
Organization
Technology
Lec 3.15
Speedup due to enhancement E: ExTime w/o E Performance w/
ESpeedup(E) = -------------------- = --------------------- ExTime w/ E Performance w/o
E
Suppose that enhancement E accelerates a fraction F of the task by a factor S and the remainder of the task is unaffected then,
ExTime(with E) = ((1-F) + F/S) x ExTime(without E)
Speedup(with E) = 1 (1-F) + F/S
Amdahl's Law
Lec 3.16
Example (RISC Processor)
Typical Mix
Base Machine (Reg / Reg)Op Freq Cycles CPI(i) % Time
ALU 50% 1 .5 23%
Load 20% 5 1.0 45%
Store 10% 3 .3 14%
Branch 20% 2 .4 18%
2.2
How much faster would the machine be if a better data cachereduced the average load time to 2 cycles?
How does this compare with using branch prediction to shave a cycle off the branch time?
What if two ALU instructions could be executed at once?
Lec 3.17
Summary: Evaluating Instruction Sets and Implementation
•Design-time metrics:– Can it be implemented, in how long, at what cost?– Can it be programmed? Ease of compilation?
•Static Metrics:– How many bytes does the program occupy in memory?
•Dynamic Metrics:– How many instructions are executed?– How many bytes does the processor fetch to execute the
program?– How many clocks are required per instruction?– How "lean" a clock is practical?
•Best Metric: Time to execute the program!
NOTE: Depends on instructions set, processor organization,and compilation techniques.
CPI
Inst. Count Cycle Time
Lec 3.18
Administrative Matters
•Course accounts available from the TAs
•Lab #2 posted, due in 1.5 weeks—individual work on this one; submission by midnight
Lec 3.19
Finite State Machines:• System state is explicit in representation
• Transitions between states represented as arrows with inputs on arcs
• Output may be either part of state or on arcs
Alpha/
0
Delta/
2
Beta/
10
1
1
0
0
1
“ Mod 3 Machine”
Input (MSB first)
0 1 0 1 00 1 2 2
1
106
Mod 3
1
1
1 1
0
Lec 3.20
“M
eale
y M
ach
ine”
“M
oore
Mach
ine”
Implementation as Combinational Logic + Latch
Alpha/
0
Delta/
2
Beta/
1
0/0
1/0
1/1
0/10/0
1/1
Latc
h
Com
bin
ati
onal
Logic
I nput Stateold Statenew Div
000
000110
001001
001
111
000110
010010
011
Lec 3.21
Year
Perf
orm
an
ce
0.1
1
10
100
1000
1965 1970 1975 1980 1985 1990 1995 2000
Microprocessors
Minicomputers
Mainframes
Supercomputers
Performance and Technology Trends
•Technology Power: 1.2 x 1.2 x 1.2 = 1.7 x / year– Feature Size: shrinks 10% / yr. => Switching speed improves 1.2
/ yr.
– Density: improves 1.2x / yr.
– Die Area: 1.2x / yr.
•Lesson of RISC is to keep the ISA as simple as possible:
– Shorter design cycle => fully exploit the advancing technology (~3yr)
– Advanced branch prediction and pipeline techniques
– Bigger and more sophisticated on-chip caches
Lec 3.22
Range of Design Styles
Gates
Routing Channel
Gates
Routing Channel
Gates
StandardALU
Standard Registers
Gates
Cust
om
Contr
ol Lo
gic
CustomRegister File
Custom Design Standard Cell Gate Array/FPGA/CPLD
CustomALU
Performance
Design Complexity (Design Time)
Longer wiresCompact
Lec 3.23
•CMOS: Complementary Metal Oxide Semiconductor
– NMOS (N-Type Metal Oxide Semiconductor) transistors
– PMOS (P-Type Metal Oxide Semiconductor) transistors
•NMOS Transistor– Apply a HIGH (Vdd) to its gate
turns the transistor into a “conductor”
– Apply a LOW (GND) to its gateshuts off the conduction path
•PMOS Transistor– Apply a HIGH (Vdd) to its gate
shuts off the conduction path
– Apply a LOW (GND) to its gate turns the transistor into a “conductor”
Basic Technology: CMOS
Vdd = 5V
GND = 0v
GND = 0v
Vdd = 5V
Lec 3.24
•Inverter Operation
Vdd
OutIn
Symbol Circuit
Basic Components: CMOS Inverter
OutIn
Vdd VddVdd
Out
Open
Discharge
Open
Charge
Vin
Vout
Vdd
Vdd
PMOS
NMOS
Lec 3.25
Basic Components: CMOS Logic Gates
NAND Gate NOR Gate
Vdd
A
B
Out
Vdd
A
B
Out
OutA
B
A
B
Out
A B Out
0 0 10 1 11 0 11 1 0
A B Out
0 0 10 1 01 0 01 1 0
Lec 3.26
Voltage Waveforms versus Time
Time
Voltage
1 => Vdd
VinVout
0 => GND
Vin Vout
Lec 3.27
Series Connection
• Total Propagation Delay = Sum of individual delays = d1 + d2
• Capacitance C1 has two components:
– Capacitance of the wire connecting the two gates
– Input capacitance of the second inverter
Vdd
Cout
Vout
Vdd
C1
V1Vin
V1Vin Vout
Time
G1 G2 G1 G2
VoltageVdd
Vin
GND
V1 Vout
Vdd/2d1 d2
Lec 3.28
Gate Comparison
•PMOS are 3 times slower than NMOS (3 times higher resistance) so if all devices are the same size then a NAND Low to High will be
•Better to put NMOS transistors in series
Vdd
A
B
Out
Vdd
A
B
Out
NAND Gate NOR Gate
Lec 3.29
Calculating Delays
•Sum delays along serial paths
•Delay (Vin -> V2) ! = Delay (Vin -> V3)– Delay (Vin -> V2) = Delay (Vin -> V1) + Delay (V1 -> V2)
– Delay (Vin -> V3) = Delay (Vin -> V1) + Delay (V1 -> V2) + Delay (V1 -> V3)
•Critical Path = The longest delay path (Vin->V3)
•C1 = Wire Capacitance + Cin of Gate 2 + Cin of Gate 3
Vdd
V2
VddV1Vin V2
C1
V1VinG1 G2
Vdd
V3G3
V3
Lec 3.30
General C/L Cell Delay Model
•Combinational Cell (symbol) is fully specified by:– Functional (input -> output) behavior
» Truth-table, logic equation, VHDL
– Load at each input
– Critical prop delay from each input to each output for each transition
» THL(A, o) = Fixed Internal Delay + Load-dependent-delay x load capacitance
– Linear model is good enough up to Ccritical
Cout
VoutA
B
X
.
.
.
CombinationalLogic Cell
Cout
DelayVa -> Vout
XX
X
X
Ccritical
Internal Delay
delay per load (Cload)Nanoseconds/femtoFarad= ns/fF
Lec 3.31
Characterize a Gate•Input capacitance for each input
•For each input-to-output path:– For each output transition (H->L, L->H)
» Internal delay (ns) - e.g. for low to high from A to O: TPAOlh
» Load dependent delay (ns / fF) - e.g. TPAOlhf
•Example: 2-input NAND GateOA
B
For A and B: Input Load = 61 fF
For A -> O : TPAOlh = 0.5ns TPAOlhf = 0.0021ns / fF TPAOhl = 0.1ns TPAOhlf = 0.0020ns / fF
Delay A -> OO: Low -> High
CO
0.5ns
Slope =0.0021ns / fF
Lec 3.32
A Specific Example: 2 to 1 MUX
•Input Load (I.L.)– A, B: I.L. (NAND) = 61 fF
– S: I.L. (INV) + I.L. (NAND) = 50 fF + 61 fF = 111 fF
•Load Dependent Delay (L.D.D.): Set by Gate 3 – TAYlhf = 0.021 ns / fF TAYhlf = 0.020 ns / fF
– TBYlhf = 0.021 ns / fF TBYhlf = 0.020 ns / fF
– TSYlhf = 0.021 ns / fF TSYlhf = 0.020 ns / fF
Y = (A and !S) or (A and S)
A
B
S
Gate 3
Gate 2
Gate 1Wire 1
Wire 2
Wire 0
A
B
Y
S
2 x
1 M
ux
Inv: I.L. = 50 fF; T I.D.= .01 ns; T L.D.D. = .01 ns/fF
Lec 3.33
2 to 1 MUX: Internal Delay Calculation
•Internal Delay (I.D.):– A to Y: I.D. G1 + (Wire_1_C + G3_Input_C) * L.D.D G1 + I.D. G3
– B to Y: I.D. G2 + (Wire_2_C + G3_Input_C) * L.D.D. G2 + I.D. G3
– S to Y (Worst Case) : I.D. Inv + (Wire_0_C + G1_Input_C) * L.D.D. Inv + Internal Delay A to Y
•We can approximate the size of “Wire_1_C” by:– Assume Wire 1 has the same C as the input capacitance off all the
gates attached to it (G3 in this case).
Therefore the total C that Gate1 needs to drive = 2.0 x G3_Input_C
Y = (A and !S) or (A and S)
A
B
S
Gate 3
Gate 2
Gate 1Wire 1
Wire 2
Wire 0
L.D.D. = Load Dependent DelayI.D. = Internal Delay
Lec 3.34
2 to 1 MUX: Internal Delay Calculation (continue)
•Internal Delay (I.D.):– A to Y: I.D. G1 + (Wire 1 C + G3 Input C) * L.D.D G1 + I.D. G3
– B to Y: I.D. G2 + (Wire 2 C + G3 Input C) * L.D.D. G2 + I.D. G3
– S to Y (Worst Case): I.D. Inv + (Wire 0 C + G1 Input C) * L.D.D. Inv + Internal Delay A to Y
•Specific Example:– TAYlh = TPhl G1 + (2.0 * 61 fF) * TPhlf G1 + TPlh G3
= 0.1ns + 122 fF * 0.0020 ns/fF + 0.5ns = 0.844 ns
Y = (A and !S) or (A and S)
A
B
S
Gate 3
Gate 2
Gate 1Wire 1
Wire 2
Wire 0
Lec 3.35
Abstraction: 2 to 1 MUX
•Input Load: A = 61 fF, B = 61 fF, S = 111 fF
•Load Dependent Delay:– TAYlhf = 0.021 ns / fF TAYhlf = 0.020 ns / fF
– TBYlhf = 0.021 ns / fF TBYhlf = 0.020 ns / fF
– TSYlhf = 0.021 ns / fF TSYlhf = 0.020 ns / f F
•Internal Delay:– TAYlh = TPhl G1 + (2.0 * 61 fF) * TPhlf G1 + TPlh G3
= 0.1ns + 122 fF * 0.0020ns/fF + 0.5ns = 0.844ns
– Fun Exercises!: TAYhl, TBYlh, TSYlh, TSYlh
A
B
Y
S
2 x
1 M
ux
A
B
S
Gate 3
Gate 2
Gate 1
Y
Lec 3.36
Storage Element’s Timing Model
•Setup Time: Input must be stable BEFORE the trigger clock edge
•Hold Time: Input must REMAIN stable after the trigger clock edge
•Clock-to-Q time:– Output cannot change instantaneously at the trigger
clock edge
– Similar to delay in logic gates, two components:» Internal Clock-to-Q
» Load dependent Clock-to-Q
D QD Don’t Care Don’t Care
Clk
UnknownQ
Setup Hold
Clock-to-Q
Lec 3.37
CS152 Building Blocks (maybe more….)
•Logic elements– NAND2, NAND3, NAND 4
– NOR2, NOR3, NOR4
– INV1x (normal inverter)
– INV4x (inverter with large output drive)
– XOR2
– XNOR2
– PWR: Source of 1’s
– GND: Source of 0’s
– fast MUXes
•Storage Element– D flip flop - negative edge triggered
Lec 3.38
Clocking Methodology
•All storage elements are clocked by the same clock edge (but there may be clock skews)
•The combination logic block’s:– Inputs are updated at each clock tick
– All outputs MUST be stable before the next clock tick
Clk
.
.
.
.
.
.
.
.
.
.
.
.Combination Logic
Lec 3.39
Hold Time Violations
• The worst case scenario for hold time consideration:
– The input register sees CLK1
– The output register sees CLK2
– fast FF1 output must not change input to FF2 for same clock edge
• (CLK-to-Q + Shortest Delay Path - Clock Skew) > Hold Time
Clk1
Clk2
Clk1 Clk2
.
.
.
.
.
.
.
.
.
.
.
.Combination Logic
Hold Time
Clk-to-Q+Delay
Lec 3.40
Clock Skew’s Effect on Hold Time Violation
• The worst case scenario for hold time consideration:
– The input register sees CLK1
– The output register sees CLK2
– fast FF1 output must not change input to FF2 for same clock edge
• (CLK-to-Q + Shortest Delay Path) > Hold Time + Clock Skew or (CLK-to-Q + Shortest Delay Path - Clock Skew) > Hold Time
Clk1
Clk2
Clk1 Clk2
.
.
.
.
.
.
.
.
.
.
.
.Combination Logic
Clock Skew
Hold Time
Clk-to-Q+Delay
Lec 3.41
Critical Path & Cycle Time
•Critical path: the slowest path between any two storage devices
•Cycle time is a function of the critical path
•must be greater than:– Clock-to-Q + Longest Path through the Combination
Logic + Setup
Clk
.
.
.
.
.
.
.
.
.
.
.
.
Lec 3.42
Clock Skew’s Effect on Cycle Time
•Worst case scenario for cycle time consideration:– The input register sees CLK1
– The output register sees CLK2
•Cycle Time = CLK-to-Q(FF1) + Longest Delay(C/L) + Setup(FF2) + Clock Skew
Clk1
Clk2 Clock Skew
.
.
.
.
.
.
.
.
.
.
.
.
Clock Skw
Clk1 Clk2
FF1 FF2Setup
Lec 3.43
Tricks to Reduce Cycle Time
•Reduce the number of gate levels
Pay attention to loading
• One gate driving many gates is a bad idea
• Avoid using a small gate to drive a long wire
Use multiple stages to drive large load
A
B
CD
A
B
C
D
INV4x
INV4x
Clarge
Lec 3.44
How to Avoid Hold Time Violations?
•Hold time requirement:– Input to register must NOT change immediately after the
clock tick
•This is usually easy to meet in the “edge trigger” clocking scheme
•Hold time of most FFs is <= 0 ns
•CLK-to-Q + Shortest Delay Path must be greater than Hold Time
Clk
.
.
.
.
.
.
.
.
.
.
.
.Combination Logic
Lec 3.45
Hold Time Violation
• The worst case scenario for hold time consideration:– The input register sees CLK1
– The output register sees CLK2
– fast FF1 output must not change input to FF2 for same clock edge
• For no violation (CLK-to-Q + Shortest Delay Path) > Hold Time
• A violation is shown above
Clk1
Clk2
Clk1 Clk2
.
.
.
.
.
.
.
.
.
.
.
.Combination Logic
Hold Time
Clk-to-Q+Delay
Lec 3.46
A Hold Time Violation Because of Clock Skew
•For no violation(CLK-to-Q + Shortest Delay Path) > Hold Time +
Clock Skew
or (CLK-to-Q + Shortest Delay Path - Clock Skew) > Hold Time
Clk1
Clk2
Clk1 Clk2
.
.
.
.
.
.
.
.
.
.
.
.Combination Logic
Clock Skew
Hold Time
Clk-to-Q+Delay
Lec 3.47
Summary•Total execution time is the most reliable measure
of performance
•Amdahl’s law: Law of Diminishing Returns
•Performance and Technology Trends– Keep the design simple (KISS rule) to take advantage of the
latest technology
– CMOS inverter and CMOS logic gates
•Delay Modeling and Gate Characterization– Delay = Internal Delay + (Load Dependent Delay x Output
Load)
•Clocking Methodology and Timing Considerations– Simplest clocking methodology
» All storage elements use the SAME clock edge
– Cycle Time CLK-to-Q + Longest Delay Path + Setup + Clock Skew
– (CLK-to-Q + Shortest Delay Path - Clock Skew) > Hold Time
Lec 3.48
To Get More Information
•EECS 141 - Digital Integrated Circuit Design - 105 no longer a prerequisite - only EECS 40 required!
•Book: Digital Integrated Circuits - A design perspective - by Jan Rabaey
•Web page (slides from book) – http://bwrc.eecs.berkeley.edu/icdesign/instructors.html