ece 4100/6100 advanced computer architecture lecture 3 performance
Post on 11-Feb-2016
78 Views
Preview:
DESCRIPTION
TRANSCRIPT
ECE 4100/6100Advanced Computer Architecture
Lecture 3 Performance
Prof. Hsien-Hsin Sean LeeSchool of Electrical and Computer EngineeringGeorgia Institute of Technology
2
Performance• Execution/Response time (Latency)
– Elapsed time between start and completion of an event
– How long my job takes?
• Throughput (Bandwidth)– Total amount of work done within a given
period of time– How many jobs done per unit time on a
system?
3
CPU Performance• Execution Time = Seconds / Program
cyclenInstructiocyclesnsInstructio seconds
program
• Programmer• Algorithms• ISA• Compilers
• Microarchitecture
• System architecture
• Microarchitecture, pipeline depth
• Circuit design• Technology
4
Pipeline Stage
Combinational
LogicF/F
F/F
• Optimal FO4 per pipe– 6 to 8 [UT/Compaq, ISCA-29]– 18 (15+3 latch) [IBM, MICRO-
35]
P4 pipe stage~ 16 FO4
1 FO4
Slide from Lecture 1 Pipelining
5
Architecture Comparison• Many architecture research just make the
following assumptions• Instructions / program is fixed
– Same binary ()– Same compiler ()– Same benchmark
• Seconds per cycle is constant () – Same frequency– Same pipeline depth– Typically a bad assumption today
• Focus on IPC or CPI• It is more complicated for today’s architects !
6
Example: Calculating CPI
Typical Mix of instruction typesin program
Base Machine (Reg / Reg)Op Freq Cycles CPI(i) (% Time)ALU 50% 1 .5 (33%)Load 20% 2 .4 (27%)Store 10% 2 .2 (13%)Branch 20% 2 .4 (27%) 1.5
Design guideline: Make the common case fast
MIPS 1% rule: only consider adding an instruction of it is shown to add 1% performance improvement on reasonable benchmarks.
Run benchmark and collect workload characterization (simulate, machine counters, or sampling)
7
Performance Comparison• For some program running on machine X,
PerformanceX = 1 / Execution timeX
• "X is nn times faster than Y"PerformanceX / PerformanceY = n n = speedup of X over Y
• Problem:– machine A runs a program in 20 seconds– machine B runs the same program in 25 seconds
8
Performance Evaluation: Benchmark• (Real) Programs
– In the form of collection of programs– E.g., SPEC, Winstone, SYSMARK, 3D Winbench, EEMBC
• Kernels: – Small key pieces of real programs – E.g., Livermore Fortran Loops Kernels (LFK), Linpack
• Modified (or scripted)– To focus on some particular aspects (e.g. remove I/O, focus on
CPU)• (Toy) Benchmarks
– Produce expected results• Synthetic Benchmarks:
– Representative instruction mix– E.g., Dhrystone, Whetstone
• Important for – Architectural and microarchitectural design trade-off– Competitive analysis of real products
9
Performance Summary Measurement• Average of total execution time
• This is Arithmetic Mean (Weighted Arithmetic Mean (Weighted Arithmetic Mean)Arithmetic Mean)
n
iii
n
ii TimeWeight
nTime
n 11
1or 1
10
Performance Summary Measurement
• Ratei is a function of 1/Timei
• Used to represent the average “rate” such as instruction per cycle (IPC)
n
i i
in
i i RateWeightn
Rate
n
11
or 1
11
Why Harmonic Mean?• 30 mph for the first 10 miles• 90 mph for the next 10 miles• Average speed? (30+90)/2 = 60 mph?? • Wrong!
• Average speed = total distance / total time
• (10+10)/(10/30 + 10/90) = 45 mph
12
New Breed of Metrics • Performance / Watt
– Performance achievable at the same cooling capacity
• Performance / Joule (Energy)– Achievable performance at the lifetime of
the same energy source (i.e., battery = energy)
– Equivalent to reciprocal of energy-delay product (ED product)
13
Amdahl’s Law (Law of Diminishing Returns)• Make the common case faster• Speedup
= Perfnew / Perfold = Told / Tnew = • Performance improvement from using faster mode is limited by the fraction the faster mode can be applied.
f(1 - f)Told
(1 - f)
Tnew
f / P
Pff )1(
1
14
Amdahl’s Law Analogy
• Driving from Orlando to Atlanta– 60 miles/hr from Orlando to Macon– 120 miles/hr from Macon to Atlanta– How much time you can save
compared against driving all the way at 60 miles/hr from Orlando to Atlanta?
• 6hr 45min vs. 7hr 30min = ~11% speedup
• Key is to speed up the biggie portion, i.e. speed up frequently executed blocks
15
Parallelism vs. Speedup
1.11x
1.97x
1.33x1
10
100
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Spee
d-up
Code portion in Faster mode (f)
Amdahl's Law speed-up as a function of parallelism
P=1P=2P=4P=8
P=16P=32P=64
16
Gustafson’s Law• Amdahl’s Law killed massive parallel processing
(MPP)• Gustafson came to rescue
Seq
Tnew
ParallelTold
Seq P * Parallel Time
Assume: Seq + Parallel = 1 (Tnew)
Speedup = Seq + p * (1 – Seq) where p=parallel factorIf Seq diminishes with increased problem size,
Speedup p
17
Amdahl versus Gustafson
Who is right?
18
The Principle of Locality• Knuth made the original observation about
program locality in 1971.– … less than 4 percent of a program generally
accounts for more than half of its running time.• 90/10 rule: a program spends 90% of its execution
time in only 10% of the code• Two types of locality
– Temporal locality (locality in time)– Spatial locality (locality in space)
• Memory subsystem design heavily leverages the locality concept for better performance
19
Example of Performance Evaluation (I)Operation Frequency Clock cycle
countALU Ops (reg-reg)
43% 1
Loads 21% 2Stores 12% 2Branches 24% 2Assume 25% of the ALU ops directly use a loaded operand that is not used again.
We propose adding ALU instructions that have one src operand in memory. These new reg-mem instructions spend 2 clock cycles. Also assume that the extended instruction set increase branch’s clock by 1, but no impact to cycle time.Would this change improve performance ?
20
Example of Performance Evaluation (I)Operation Frequency Clock cycle
countALU Ops (reg-reg)
43% 1
Loads 21% 2Stores 12% 2Branches 24% 2Assume 25% of the ALU ops directly use a loaded operand that is not used again.
We propose adding ALU instructions that have one src operand in memory. These new reg-mem instructions spend 2 clock cycles. Also assume that the extended instruction set increase branch’s clock by 1, but no impact to cycle time.Would this change improve performance ?
703.13*24.02*12.02*)43.0*25.021.0(1)43.025.043.0(243.025.0 newCycles
57.12*24.0212.0221.0143.0 oldCycles
21
Example of Performance Evaluation (II)FP instructions = 25%
Average CPI of FP instructions = 4.0Average CPI of other instructions = 1.33FPSQRT = 2% of all instructions, CPI of
FPSQRT = 20• Design Option 1: decrease the CPI of FQSQRT to 2• Design Option 2: decease the average CPI of all FP
instructions to 2.5
22
Example of Performance Evaluation (II)FP instructions = 25%
Average CPI of FP instructions = 4.0Average CPI of other instructions = 1.33FPSQRT = 2% of all instructions, CPI of
FPSQRT = 20• Design Option 1: decrease the CPI of FQSQRT to 2• Design Option 2: decease the average CPI of all FP
instructions to 2.5Original CPI = 0.25*4 + 1.33*(1-0.25) = 2.0
Option 1 CPI = 2.0 – 2%*(20-2) = 1.64
Option 2 CPI = 0.25*2.5 + 1.33*(1-0.25) = 1.625
Speedup of Option 1 = 2/1.64 = 1.2195Speedup of Option 2 = 2/1.625 = 1.2308
23
Example of Performance Evaluation (III)Clock freq = 1.4 GHz
FP insturctionss = 25%Average CPI of FP instructions = 4.0Average CPI of other instructions = 1.33FPSQRT = 2%, CPI of FPSQRT = 20
• Design Option 1: decrease the CPI of FQSQRT to 2, clock freq = 1.2GHz
• Design Option 2: decease the average CPI of all FP instructions to 2.5, clock freq = 1.1 GHz
24
Example of Performance Evaluation (III)Clock freq = 1.4 GHz
FP insturctionss = 25%Average CPI of FP instructions = 4.0Average CPI of other instructions = 1.33FPSQRT = 2%, CPI of FPSQRT = 20
• Design Option 1: decrease the CPI of FQSQRT to 2, clock freq = 1.2GHz
• Design Option 2: decease the average CPI of all FP instructions to 2.5, clock freq = 1.1 GHz
Original CPI = 2.0, IPC = 1/2, Inst/Sec = ½*1.4G = 0.7G inst/s
Option 1 CPI = 1.64, IPC = 1/1.64, Inst/Sec = 1/1.64*1.2G = 0.73G inst/s
Option 2 CPI = 1.625, IPC = 1/1.625, Inst/Sec = 1/1.625*1.1G = 0.68G inst/s
top related