inroduction and performance analysis
DESCRIPTION
good for performanceTRANSCRIPT
1
Computer Architecture
2
Performance
What do you mean by performance of computer?Two important metrics• Response Time or Latency – Time taken for
completion of a single job. Smaller is better.• Throughput – Number of jobs done per unit of
time. Larger is better.
Does one imply the other?• Yes. Eg. If latency decreases, throughput will increase.• No. Eg. In pipelining, latency may have be increased to
increase throughput!
3
CPU Performance Equation
RateClock
ninstructioPerClocksnsInstructioNoTIMECPU
TimeCycleClockNeededCyclesClockTIMECPU
_
__*_._
__*___
What is this Response Time or Throughput??
4
How can we Improve Performance ?
• No. Instructions can be reduced by:– Better ISA– Better Compiler– Better Algorithm
• Clocks Per Instruction can be reduced by:– Better Hardware Design– Make the common case faster
• Clock Rate can be increased by:– Hardware Design
5
Numerical AssignmentA computer (3.06 GHz) has the following CPI
Instruction Type A B CCPI 1 2 3
An algorithm may be implemented in 2 ways I1 and I2, for each implementation the number of instructions used (in million) are as follows
Instruction Type A B CI1 0 2 2I2 2 2 1
1. Which implementation has lesser number of instructions?2. What is average CPI for both implementations? Which implementation is
faster?3. What is the total time taken for executing I1 and I2?4. What can you say about the MIPS rating?
6
1. No. of Instructions I1 = 4 M
No. of Instructions I2 = 5 M
Hence I1 has lesser number of instructions
2. Clocks req. by I1 = 2*2 + 2*3 = 10 M.
by I2 = 2*1 + 2*2 + 1*3 = 9 M.
Average CPI for I1 = 10/4 = 2.5
I2 = 9/5 = 1.8
I2 is faster as it requires lesser number of clock cycles. Notice that number of instructions required by I1 is lesser.
3. Total Time for I1 = 10 M / 3.06 GHz = 3.27 mS
I2 = 9 M / 3.06 GHz = 2.94 mS
7
4. MIPS rating = Million Instructions per second. This can be calculated from • CPI and clock rate of machine
MIPS = clock rate / CPI * 10-6
• Total Execution Time and Instruction Count
MIPS = Instruction Count / Total Execution Time * 10-6
MIPS rating for I1 = 1224 MIPS
for I2 = 1700 MIPS
MIPS rating for I2 machine > MIPS rating for I1 machine. This is as expected, since I2 has lesser execution time.
8
Probable Conclusions
1. Total Number of instructions is definitely not a good metric.
2. MIPS is a good metric.
9
Numerical AssignmentA computer (3.06 GHz) has the following CPI
Instruction Type A B CCPI 5 2 3
An algorithm may be implemented in 2 ways I1 and I2, for each implementation the number of instructions used (in million) are as follows
Instruction Type A B CI1 0 2 2I2 1 2 0
1. Which implementation has lesser number of instructions?2. What is average CPI for both implementations? Which implementation is
faster?3. What is the total time taken for executing I1 and I2?4. What can you say about the MIPS rating?
10
1. No. of Instructions I1 = 4 M
No. of Instructions I2 = 3 M
Hence number of instructions for I1 is greater than number of instructions for I2.
2. Clocks req. by I1 = 2*2 + 2*3 = 10 M.
by I2 = 1*5 + 2*2 = 9 M.
Average CPI for I1 = 10/4 = 2.5
I2 = 9/3 = 3
I2 is faster as it requires lesser number of clock cycles.
3. Total Time for I1 = 10 M / 3.06 GHz = 3.27 mS
I2 = 9 M / 3.06 GHz = 2.94 mS
11
4. MIPS rating for I1 = 1224 MIPS
for I2 = 1020 MIPS
MIPS rating for I1 machine > MIPS rating for I2 machine. This is unexpected, since I2 has lesser execution time.
Conclusion
MIPS is also not a good metric for overall system performance.
12
Conclusion
Total time of execution is always a better metric as it sums up all factors and can not be replaced by considering
1. MIPS
2. Total number of instructions
3. Clock Rate
alone.
13
Measuring Performance
Now that we know that performance is dependent upon program, which program(s) should be used to measure performance?
Benchmarks.
14
Benchmarks
• Are a set of programs that are specifically chosen for measuring performance.
• Types of Benchmarks– Real Programs– Kernel
• Extract the key feature from a program – Component– Synthetic
• Dhrystone – floating Point• Whetstone – Integer and String Arithemetic
– I/O – Parallel
15
Challenges
1. Vendors may tinker with benchmark to make them run better on their platform. At-times this is permitted.
2. Give data set rather than a single performance number.
3. Concentrate only on computational power.
16
Popular Benchmarks• SPEC - Standard Performance Evaluation Corporation
– Floating point– Integer– Web– Graphics
• TPC – Transaction Processing Performance Council– Web Server– Transaction Processing– Decision Support Systems
• BAPCo – Business Applications Performance Corporation– Popular business applications
• EEMBC – Embedded Microprocessor Benchmark Consortium– Embedded Applications
17
Statistical Summarization of Data
For Response time metric
Arithmetic Mean
For Throughput metric
Harmonic Mean or Geometric Mean.
SPEC uses Geometric Mean
18
Are Benchmarks enough?
Benchmarks give the overall performance, if one wants to optimize performance, it may be necessary to know about the instruction or section of program where maximum time is being spent.
Profilers do this job.
19
Profiling or Dynamic Program Analysis
• Program behavior is analysed as it is being run.
• Techniques used– Instruction Set Simulation– Hardware Interrupts– OS Hooks– Code instrumentation
• Example, Intel Vtune, Gprof
20
Simulation
• Difficult to build the system. Simulation is cost effective.
• Beneficial for learning/improving some aspect of architecture.
• Simulators available are :– Kiel – Instruction Simulator– Little Mans Simulator – Simulator of a
machine– Cacheprof – Cache Simulator
21
Moore’s Law (1965)
Moore's Law states that the number of transistors on a single chip at the same price will double every 18 to 24 months.
22
Implication?
As more transistors are added to the chip of the same area, their speed increases, hence circuits become faster. Or clock rate increases.
Moore’s Law in combination with various other factors like ILP (Instruction Level Parallelism) were responsible for major improvements till a long time.
23
Trends in Computing (Intel Processors)
Fastest Processor reported in Text, 2003
Current fastest processor, 2008
Intel® Processor name
Pentium 4Intel® Core™ i7-965 Processor
Extreme Edition
Processor speed 3.20 GHz 3.20GHz
Processor Primary Level Cache
12KB + 8KB 4x32KB
Processor secondary cache
512 KB 4x256KB Level 2 cache
Processor third level cache
2 MB Unified inclusive 8MB L3
24
Observations
Fastest Processor reported in Text 2003
Current fastest processor, 2008
Intel® Processor name
Pentium 4Intel® Core™ i7-965 Processor
Extreme Edition
Processor speed 3.20 GHz 3.20GHz
Processor Primary Level Cache
12KB + 8KB 4x32KB
Processor secondary cache
512 KB 4x256KB Level 2 cache
Processor third level cache
2 MB Unified inclusive 8MB L3
Processor Speed or Clock Rate has not changed!!!
25
Observations
Fastest Processor reported in Text 2003
Current fastest processor, 2008
Intel® Processor name
Pentium 4Intel® Core™ i7-965 Processor
Extreme Edition
Processor speed 3.20 GHz 3.20GHz
Processor Primary Level Cache
12KB + 8KB 4x32KB
Processor secondary cache
512 KB 4x256KB Level 2 cache
Processor third level cache
2 MB Unified inclusive 8MB L3
What is 4?
26
The Answer
Multi Core Approach - Actually more transistors are being used to pack more cores into a chip, rather than increasing clock speed.
Why?
1. Power Wall
2. Memory Wall
3. No more ILP.
27
Topics for further Study
• Papers– Performance papers– Memory Wall.
• Software– Intel Vtune or any other profiling tool– Little Mans Computer Simulator or any other
simulator apart from keil.
28
Amdahl’s Law
Execution time after improvement
= Execution time affected by improvement
Amount of improvement
+ Execution time unaffected by improvement
29
What this means?
Even if we substantially increase performance any one component, it may not result in overall substantial performance improvement.
A new architecture increases the speed of memory instructions by 50%. If memory instructions account for 50% of total time taken. What is the overall increase in performance?
Told = 100, Tnew = 25 + 50 = 75. Imp = 25%
30
What is better?
a. 20% increase in perf. of instructions executing 90% of time.
b. 90% increase in perf of instructions executing 20% of time.