computer architecture “the architecture of a computer is the interface between the machine and the...
Post on 20-Jan-2016
213 views
TRANSCRIPT
Computer Architecture
“The architecture of a computer is the interface between the machine and the software”
- Andris Padges IBM 360/370
Architect
Course Outline Computer Architecture
QuarterAutumn 2006-7 Instructor Muhammad Jahangir
Ikram Office: Room 424 e-mail: [email protected] Office Hours: Monday and Wednesday, 3:00 –
4:30pm
Course Outline (Contd..)
DescriptionThis course focuses on the principles, practices and issues
in Computer Architecture, while examining computer design tradeoffs both qualitatively and quantitatively.
The course starts with a quick overview of computer design fundamentals and instruction set principles, the materials which the student has already covered in the pre-requisite of this course.
The following topics are covered in greater detail: Advanced Pipelining Instruction-level parallelism and Compiler Support Memory - hierarchy design SIMD, VLIW, Superscalar Architectures Code Optimization and Compiler Issues
Course Outline (Contd..)
Text BookHennessy, J. L, and Patterson, D. A.,
Computer Architecture: A Quantitative Approach, 2nd Edition. Morgan Kaufmann, 1996.
Course Outline (Contd..)
Lectures There will be two 75 minutes
lecturers per week and 50 minutes Lecture/ 100 minutes lab.
TOTAL SESSIONS = 29There will be four Labs during
weeks 2, 3, 4, 5.
Course Outline (Contd..)
Grading Quizzes & assignments 17+3% Laboratory 10% (Atten 3 + Lab Task 3 + HW 4)
Midterm exam 30% Final exam 40%
Schedule
Fundamentals of Computer Design 1,2 1.1 – 1.10
Measuring and Reporting Performance Quantitative Principles of Computer DesignInstruction Set Principles and Examples 3-5 2.1 – 2.8 Classifying Instruction Set Architectures Memory Addressing Operations in the Instruction Set Encoding an Instruction Set
LAB 1: MIPS Instruction Format and Instruction Study 6
Pipelining Overview 7-14 A.1 to A10
What Is Pipelining? Single Cycle Computer Study 9
The Major Hurdle of Pipelining – Pipeline Hazards Data Hazards
LAB 2: Study of Pipelining 12
Schedule
Control Hazards and Static Branch Prediction LAB 3: Pipeline Studies and Control Hazards 15
ScoreboardingMIDTERM
ILP and Dynamic Exploitation 17-19 3.1 – 3.5 Static Branch Prediction Tomasulo’s Dynamic Scheduling Dynamic Branch Prediction Superscalar and VLIW architecturesAdvanced Pipelining And ILP (Cont’d.) 20-22 3.6 – 3.10 Taking Advantage of More ILP with Multiple Issue P6 ArchitectureAdvanced Pipelining And ILP (Cont’d.) 23-25 4.1, 4.7 Compiler Support for Exploiting ILP Hardware Support for Extracting More Parallelism Putting It All Together: The PowerPC 620, and Itanium
Schedule
Memory-Hierarchy Design 26-295.1 – 5.7
The ABCs of Caches Reducing Cache Misses Reducing Cache Miss Penalty Virtual Memory SystemComputer I/O 30 6.1 - ?
Background
Emergence of the first microprocessor in late 1970’s
Roughly 35% growth per year Important changes in the marketplace:
Virtual elimination of assembly language programming reduced the need for object code compatibility
Creation of standardized, vendor-independent operating systems, such as UINX, LINX lowered the risk of bringing out a new architecture
Development of RISC
These changes lead to the development of a new set of architectures, called the RISC (Reduced Instruction Set Computer) architecture
RISC uses two performance techniques: Instruction level parallelism (pipelining) Use of Cache
Growth in microprocessor performance
Moore’s Law
Technology Scaling
Scaling of Transistors Feature Size has reduced to 3 micron
in 1985 to 0.09 micron. Reducing Feature-size means
quadratic increase in Transistor Count and better Performance.
But higher routing Delays and poor performance of Long Wires
Also means More Power Consumption (Less load Capacitance)
The Itanium Processor
Intel microprocessor die
IC Cost Trends (Source: IC Knowledge)
Measuring performance
Definition of time: Response time, elapse time: The latency to
complete the task, including disk access, input/output, operating system overhead etc.
CPU time: User CPU Time
Time spent in the program System CPU Time:
Time Spent by operating system.
Unix Time Command: 90.7s 12.9s 2:39 (159s) 65% (90.7+12.9)/159
(User, System, Elapsed Time)
What is a Benchmark?
A benchmark is "a standard of measurement or evaluation" (Webster’s II Dictionary).
A computer benchmark is typically a computer program that performs a strictly defined set of operations - a workload - and returns some form of result - a metric - describing how the tested computer performed.
Computer benchmark metrics usually measure speed: how fast was the workload completed; or throughput: how many workload units per unit time were completed.
Running the same computer benchmark on multiple computers allows a comparison to be made.
Source: Standards Performance Evaluation Corporation
Programs to Evaluate Performance
Real Applications
Modified (or scripted) applications
Kernels
Toy benchmarks
Synthetic benchmarks
Programs to evaluate performance
Real Applications Example: Compliers for C, text-
processing software etc.
Modified (or scripted) applications CPU oriented bench mark, I/O may be
removed to minimize its impact on execution
Programs to evaluate performance
Kernels To isolate performance of individual features of a
machine.
Toy benchmarks Produces a result that the user already knows
Synthetic benchmarks Try to match the average frequency of operations
and operands of a large set of programs
Benchmark Suites
SPEC95, SPEC2000 (11 Integer, 14 FP), SPEC2006 (12 Integer, 17 FP) C Compiler, Router, FEM Desktop (CPU and Graphics Intensive)
Server (File Servers, Web Servers, Transaction Processing)
Embedded (EEMBC) 34 Kernels
What is SPEC
SPEC is the Standard Performance Evaluation Corporation. SPEC is a non-profit organization whose members include computer hardware vendors, software companies, universities, research organizations, systems integrators, publishers and consultants. SPEC's goal is to establish, maintain and endorse a standardized set of relevant benchmarks for computer systems. Although no one set of tests can fully characterize overall system performance, SPEC believes that the user community benefits from objective tests which can serve as a common reference point.
What does a benchmark measure?
the computer processor (CPU), the memory architecture, and the compilers.
SPEC CPU2006 contains two components that focus on two different types of compute intensive performance:
The CINT2006 suite measures compute-intensive integer performance, and
The CFP2006 suite measures compute-intensive floating point performance
Source: Standards Performance Evaluation Corporation
Reference Machine Source: Standards Performance Evaluation Corporation
SPEC uses a historical Sun system, the "Ultra Enterprise 2" which was introduced in 1997, as the reference machine. The reference machine uses a 296 MHz UltraSPARC II processor, as did the reference machine for CPU2000. But the reference machines for the two suites are not identical: the CPU2006 reference machine has substantially better caches, and the CPU2000 reference machine could not have held enough memory to run CPU2006.
It takes about 12 days to do a rule-conforming run of the base metrics for CINT2006 and CFP2006 on the CPU2006 reference machine. SPEC2000 now takes less a minute on latest High Performance M/Cs
Example Result for SPEC 2000 Source: Standards Performance Evaluation Corporation
SYSTEMIntel SE440BX-2 (800 MHz Pentium III)
1 core, 1 chip, 1 core/chipBase340
Peak344
Intel D850GB motherboard(1.4 GHz, Pentium 4 processor)
1 core, 1 chip, 1 core/chip 502 512
Sun Blade 2500 (1.28GHz) 1 core, 1 chip, 1 core/chip 604 696
Intel D850EMV2 motherboard (2.0A GHz, Pentium 4 processor)
1 core, 1 chip, 1 core/chip 756 759
PowerEdge 2650 (3.06 GHz Xeon) DELL
1 core, 1 chip, 1 core/chip (Hyper-Threading
Technology disabled)
1014 1056
Precision WorkStation 350 (2.8 GHz P4) DELL
1 core, 1 chip, 1 core/chip 1017 1061
SGI Altix 3000 (1300MHz, Itanium 2)
1 core, 1 chip, 1 core/chip 1019 --
Example Result for SPEC 2000Source: Standards Performance Evaluation Corporation
SYSTEMPrecision Workstation 690 (Intel® Xeon® processor 5160, 3.0
#CPU4 cores, 2 chips,
2 cores/chip
BASE3057
PEAK3063
PowerEdge 1950 (Intel Xeon processor 5160, 3.00GHz)
4 cores, 2 chips, 2 cores/chip
3061 3065
Intel(R) DG965WH motherboard( 2.93 GHz, Intel(R) Core(TM) 2
2 cores, 1 chip, 2 cores/chip
3099 3109
Intel(R) DG965WH motherboard( 2.93 GHz, Intel(R) Core(TM) 2
2 cores, 1 chip, 2 cores/chip
3106 3111
Precision Workstation 390 (Intel Core 2 Extreme processor X6
2 cores, 1 chip, 2 cores/chip
3108 3119
Summarizing Performance
Amdahl’s Law
The performance improvement to be gained from using faster mode of execution is limited by the fraction of the time the faster mode can be used
Amdahl’s Law: Law of Diminishing Returns
possiblewhentenhancementhewithouttaskentitreforePerformanc
possiblewhentenhancementhewithtaskentitreforePerformancSpeedup
EnhancedSpeedup
EnhancedFractionEnhancedFractionoldimeExecutionTnewtimeExecution 1
EnhancedSpeedupEnhancedFraction
EnhancedFractionnewTimeExecution
OldTimeExecutionSpeedUp
1
1
CPU performance Equations
CPU Time = Instructions
Program
Clock Cycle
Instruction
Seconds
Clock Cycle
Example:
Frequency of FP operations = 25% Average CPI of FP operations = 4.0 Average CPI of other instructions = 1.33 Frequency of FPSQR = 2% CPI of FPSQR = 20
Assume CPI of FPSQR decreased to 2 OR the CPI of all FP operations to 2.5
Compare these two designs using the CPU performance equations
Example: Solution
0.2%7533.1%254
i
n
i
iorignal CPI
CountnInstructio
ICCPI
1
CPI for enhanced FPSQR
onlyFPSQRnewFPSQRoldorignalFPSQR CPICPICPICPI %2
64.1220%20.2 CPI for enhanced FP operation
625.15.2%2533.1%75 newFPCPI
Example: Solution
newFP
orignal
newFP
orignalnewFP
CPIClockcycleIC
CPIClockcycleIC
CPUtime
CPUtimeSpeedup
23.1625.1
0.2
newFP
orignal
CPI
CPI
Another Measure -- MIPS
MIPS =
Instruction Count
Execution Time 106
Example:An Embedded Processor
120 MIPS for single processor. 80 MIPS for Processor –Co-Processor
Combination (That is how they are measured for combined)
I= Number of Integer Instructions F = Number of Floating Point
Instructions (8M) Y = No. of Integer Instructions to
Emulate one FP Instruction (50) W = Time for choice 1 (4 seconds) B = Time for Choice 2
End of Lecture 1
CINT 2006400.perlbench C PERL Programming Language
401.bzip2 C Compression
403.gcc C C Compiler
429.mcf C Combinatorial Optimization
445.gobmk C Artificial Intelligence: go
456.hmmer C Search Gene Sequence
458.sjeng C Artificial Intelligence: chess
462.libquantum C Physics: Quantum Computing
464.h264ref C Video Compression
471.omnetpp C++ Discrete Event Simulation
473.astar C++ Path-finding Algorithms
483.xalancbmk C++ XML Processing
CFP 2006410.bwaves Fortran Fluid Dynamics
416.gamess Fortran Quantum Chemistry
433.milc C Physics: Quantum Chromodynamics
434.zeusmp Fortran Physics/CFD
435.gromacs C/Fortran Biochemistry/Molecular Dynamics
436.cactusADM C/Fortran Physics/General Relativity
437.leslie3d Fortran Fluid Dynamics
444.namd C++ Biology/Molecular Dynamics
447.dealII C++ Finite Element Analysis
450.soplex C++ Linear Programming, Optimization
453.povray C++ Image Ray-tracing
454.calculix C/Fortran Structural Mechanics
459.GemsFDTD Fortran Computational Electromagnetics
465.tonto Fortran Quantum Chemistry
470.lbm C Fluid Dynamics
481.wrf C/Fortran Weather Prediction
482.sphinx3 C Speech recognition